0% found this document useful (0 votes)
131 views278 pages

Adaptive Dynamic Programming: Single and Multiple Controllers

Uploaded by

Abhishek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
131 views278 pages

Adaptive Dynamic Programming: Single and Multiple Controllers

Uploaded by

Abhishek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 278

Studies in Systems, Decision and Control 166

Ruizhuo Song
Qinglai Wei
Qing Li

Adaptive Dynamic
Programming:
Single and Multiple
Controllers
Studies in Systems, Decision and Control

Volume 166

Series editor
Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences,
Warsaw, Poland
e-mail: [email protected]
The series “Studies in Systems, Decision and Control” (SSDC) covers both new
developments and advances, as well as the state of the art, in the various areas of
broadly perceived systems, decision making and control–quickly, up to date and
with a high quality. The intent is to cover the theory, applications, and perspectives
on the state of the art and future developments relevant to systems, decision
making, control, complex processes and related areas, as embedded in the fields of
engineering, computer science, physics, economics, social and life sciences, as well
as the paradigms and methodologies behind them. The series contains monographs,
textbooks, lecture notes and edited volumes in systems, decision making and
control spanning the areas of Cyber-Physical Systems, Autonomous Systems,
Sensor Networks, Control Systems, Energy Systems, Automotive Systems,
Biological Systems, Vehicular Networking and Connected Vehicles, Aerospace
Systems, Automation, Manufacturing, Smart Grids, Nonlinear Systems, Power
Systems, Robotics, Social Systems, Economic Systems and other. Of particular
value to both the contributors and the readership are the short publication timeframe
and the world-wide distribution and exposure which enable both a wide and rapid
dissemination of research output.

More information about this series at http://www.springer.com/series/13304


Ruizhuo Song Qinglai Wei

Qing Li

Adaptive Dynamic
Programming: Single
and Multiple Controllers

123
Ruizhuo Song Qinglai Wei
University of Science and Technology Institute of Automation
Beijing Chinese Academy of Sciences
Beijing, China Beijing, China

Qing Li
University of Science and Technology
Beijing
Beijing, China

ISSN 2198-4182 ISSN 2198-4190 (electronic)


Studies in Systems, Decision and Control
ISBN 978-981-13-1711-8 ISBN 978-981-13-1712-5 (eBook)
https://doi.org/10.1007/978-981-13-1712-5

Jointly published with Science Press, Beijing, China

The print edition is not for sale in China Mainland. Customers from China Mainland please order the
print book from: Science Press, Beijing.

Library of Congress Control Number: 2018949333

© Science Press, Beijing and Springer Nature Singapore Pte Ltd. 2019
This work is subject to copyright. All rights are reserved by the Publishers, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publishers, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publishers nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made. The publishers remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
Preface

In the past few decades, neurobiological discoveries towards understanding learn-


ing mechanisms of the human brain have provided indications of how to design
more effective and responsive decision and control systems for complex intercon-
nected human-engineered systems. Advances in computational intelligence rein-
forcement learning (RL) methods, particularly those with an actor-critic structure in
the family of approximate/adaptive dynamic programming (ADP) techniques, have
made inroads in comprehending and mimicking brain functionality at the level
of the brainstem, cerebellum, basal ganglia, and cerebral cortex. The works of Doya
and others have shown that the basal ganglia acts as a critic in selecting action
commands sent to the muscle motor control systems. The concentration of dopa-
mine neurotransmitters in the basal ganglia is modified based on rewards received
from previous actions, so that successful actions are more likely to be repeated in
the future. The cerebellum learns a dynamics model of environments for control
purposes. This motivates a three-level actor-critic structure with learning networks
for critic, actor, and model dynamics. Based on this actor-critic structure, the ADP
algorithms which are state feedback control methods have been developed as
powerful tools for solving optimal control problems online.
ADP techniques are used to design new structures of adaptive feedback control
systems that learn the solutions to optimal control problems by measuring data
along the system trajectories. This book studies the optimal control based on ADP
for two categories of dynamic feedback control systems: systems with single
control input and systems with multiple control inputs.
This book is organized into 13 chapters. First, Chap. 1 gives a preparation of the
book. After that, the book is divided into two parts. In Part I, we develop novel
ADP methods for optimal control of the systems with one control input. Chapter 2
introduces a finite-time optimal control method for a class of unknown nonlinear
systems. Chapter 3 proposes finite-horizon optimal control for a class of nonaffine
time-delay nonlinear systems. Chapter 4 presents multi-objective optimal control
for a class of nonlinear time-delay systems. Chapter 5 develops multiple actor-critic

v
vi Preface

structures for continuous-time optimal control using input–output data. In Chap. 6,


complex-valued nonlinear system optimal control method is studied based on
ADP. In Chap. 7, off-policy neuro-optimal control is developed for unknown
complex-valued nonlinear systems based on policy iteration. In Chap. 8, optimal
tracking control is proposed with the convergence proof. Part II concerns
multi-player systems from Chaps. 9 to 13. In Chap. 9, off-policy actor-critic
structure for optimal control of unknown systems with disturbances is presented. In
Chap. 10, iterative ADP method for solving a class of nonlinear zero-sum differ-
ential games is introduced. In Chap. 11, neural network-based synchronous itera-
tion learning method for multi-player zero-sum games is developed. In Chap. 12,
off-policy integral reinforcement learning (IRL) method is also employed to solve
nonlinear continuous-time multi-player non-zero-sum games. In Chap. 13, optimal
distributed synchronization control for continuous-time heterogeneous multi-agent
differential graphical games is established.
Dr. Ruizhuo Song completed Chaps. 1–12 and wrote 250,000 words. Dr. Qinglai
Wei completed Chap. 13 and wrote 50,000 words. The authors would like to
acknowledge the help and encouragement they have received from colleagues in the
University of Science and Technology Beijing, and Institute of Automation,
Chinese Academy of Sciences during the course of writing this book.
The authors are very grateful to the National Natural Science Foundation of
China (NSFC) for providing necessary financial support to our research. The pre-
sent book is the result of NSFC Grants 61374105, 61673054, 61722312, and in part
by the Fundamental Research Funds for the Central Universities under Grant
FRF-GF-17-B45.
This book is dedicated to Tiantian, who makes every day exciting.

Beijing, China Ruizhuo Song


October 2017 Qinglai Wei
Qing Li
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Optimal Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Continuous-Time LQR . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Discrete-Time LQR . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Adaptive Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Review of Matrix Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Neural-Network-Based Approach for Finite-Time Optimal
Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Problem Formulation and Motivation . . . . . . . . . . . . . . . . . . . . 9
2.3 The Data-Based Identifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Derivation of the Iterative ADP Algorithm
with Convergence Analysis . . . . . . . . . . . . . . . . . . . . . ...... 11
2.5 Neural Network Implementation of the Iterative Control
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3 Nearly Finite-Horizon Optimal Control for Nonaffine Time-Delay
Nonlinear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 The Iteration ADP Algorithm and Its Convergence . . . . . . . . . . 30
3.3.1 The Novel ADP Iteration Algorithm . . . . . . . . . . . . . . 30
3.3.2 Convergence Analysis of the Improved Iteration
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 33
3.3.3 Neural Network Implementation of the Iteration
ADP Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 38

vii
viii Contents

3.4 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40


3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4 Multi-objective Optimal Control for Time-Delay Systems . . . . . . . . 49
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3 Derivation of the ADP Algorithm for Time-Delay Systems . . . . 51
4.4 Neural Network Implementation for the Multi-objective
Optimal Control Problem of Time-Delay Systems . . . . . . . . . . . 54
4.5 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5 Multiple Actor-Critic Optimal Control via ADP . . . . . . . . . . . . . . . 63
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.3 SIANN Architecture-Based Classification . . . . . . . . . . . . . . . . . 66
5.4 Optimal Control Based on ADP . . . . . . . . . . . . . . . . . . . . . . . . 69
5.4.1 Model Neural Network . . . . . . . . . . . . . . . . . . . . . . . . 70
5.4.2 Critic Network and Action Network . . . . . . . . . . . . . . 74
5.5 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6 Optimal Control for a Class of Complex-Valued Nonlinear
Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.2 Motivations and Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.3 ADP-Based Optimal Control Design . . . . . . . . . . . . . . . . . . . . 99
6.3.1 Critic Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.3.2 Action Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.3.3 Design of the Compensation Controller . . . . . . . . . . . . 102
6.3.4 Stability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.4 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7 Off-Policy Neuro-Optimal Control for Unknown
Complex-Valued Nonlinear Systems . . . . . . . . . . . . . . . . . . . . . . . . 113
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.3 Off-Policy Optimal Control Method . . . . . . . . . . . . . . . . . . . . . 115
7.3.1 Convergence Analysis of Off-Policy PI Algorithm . . . . 117
7.3.2 Implementation Method of Off-Policy Iteration
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Contents ix

7.3.3 Implementation Process . . . . . . . . . . . . . . . . . . . . . . . 122


7.4 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
8 Approximation-Error-ADP-Based Optimal Tracking Control
for Chaotic Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
8.2 Problem Formulation and Preliminaries . . . . . . . . . . . . . . . . . . 128
8.3 Optimal Tracking Control Scheme Based on
Approximation-Error ADP Algorithm . . . . . . . . . . . . . . . . . . . 130
8.3.1 Description of Approximation-Error ADP
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
8.3.2 Convergence Analysis of The Iterative ADP
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
8.4 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
8.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
9 Off-Policy Actor-Critic Structure for Optimal Control
of Unknown Systems with Disturbances . . . . . . . . . . . . . . . . . . . . . 147
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
9.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
9.3 Off-Policy Actor-Critic Integral Reinforcement Learning . . . . . . 151
9.3.1 On-Policy IRL for Nonzero Disturbance . . . . . . . . . . . 151
9.3.2 Off-Policy IRL for Nonzero Disturbance . . . . . . . . . . . 152
9.3.3 NN Approximation for Actor-Critic Structure . . . . . . . . 154
9.4 Disturbance Compensation Redesign and Stability Analysis . . . 157
9.4.1 Disturbance Compensation Off-Policy Controller
Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
9.4.2 Stability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
9.5 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
9.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
10 An Iterative ADP Method to Solve for a Class of Nonlinear
Zero-Sum Differential Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
10.2 Preliminaries and Assumptions . . . . . . . . . . . . . . . . . . . . . . . . 166
10.3 Iterative Approximate Dynamic Programming Method
for ZS Differential Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
10.3.1 Derivation of The Iterative ADP Method . . . . . . . . . . . 169
10.3.2 The Procedure of the Method . . . . . . . . . . . . . . . . . . . 174
10.3.3 The Properties of the Iterative ADP Method . . . . . . . . 176
x Contents

10.4 Neural Network Implementation . . . . . . . . . . . . . . . . . . . . . . . 190


10.4.1 The Model Network . . . . . . . . . . . . . . . . . . . . . . . . . . 191
10.4.2 The Critic Network . . . . . . . . . . . . . . . . . . . . . . . . . . 192
10.4.3 The Action Network . . . . . . . . . . . . . . . . . . . . . . . . . . 193
10.5 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
10.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
11 Neural-Network-Based Synchronous Iteration Learning
Method for Multi-player Zero-Sum Games . . . . . . . . . . . . . . . . . . . 207
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
11.2 Motivations and Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . 208
11.3 Synchronous Solution of Multi-player ZS Games . . . . . . . . . . . 213
11.3.1 Derivation of Off-Policy Algorithm . . . . . . . . . . . . . . . 213
11.3.2 Implementation Method for Off-Policy Algorithm . . . . 214
11.3.3 Stability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
11.4 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
11.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
12 Off-Policy Integral Reinforcement Learning Method
for Multi-player Non-zero-Sum Games . . . . . . . . . . . . . . . . . . . . . . 227
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
12.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
12.3 Multi-player Learning PI Solution for NZS Games . . . . . . . . . . 229
12.4 Off-Policy Integral Reinforcement Learning Method . . . . . . . . . 234
12.4.1 Derivation of Off-Policy Algorithm . . . . . . . . . . . . . . . 234
12.4.2 Implementation Method for Off-Policy Algorithm . . . . 236
12.4.3 Stability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
12.5 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
12.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
13 Optimal Distributed Synchronization Control for Heterogeneous
Multi-agent Graphical Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
13.2 Graphs and Synchronization of Multi-agent Systems . . . . . . . . . 252
13.2.1 Graph Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
13.2.2 Synchronization and Tracking Error Dynamic
Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
13.3 Optimal Distributed Cooperative Control for Multi-agent
Differential Graphical Games . . . . . . . . . . . . . . . . . . . . . . . . . . 255
13.3.1 Cooperative Performance Index Function . . . . . . . . . . . 256
13.3.2 Nash Equilibrium . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
Contents xi

13.4 Heterogeneous Multi-agent Differential Graphical Games


by Iterative ADP Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 259
13.4.1 Derivation of the Heterogeneous Multi-agent
Differential Graphical Games . . . . . . . . . . . . . . . . . . . 259
13.4.2 Properties of the Developed Policy Iteration
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
13.4.3 Heterogeneous Multi-agent Policy Iteration
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
13.5 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
13.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
About the Authors

Ruizhuo Song Ph.D., Associate Professor, School of Automation and Electrical


Engineering, University of Science and Technology Beijing.
She received her Ph.D. in control theory and control engineering from
Northeastern University, Shenyang, China, in 2012. She was a Postdoctoral Fellow
at the University of Science and Technology Beijing, Beijing, China. She is cur-
rently an Associate Professor at the School of Automation and Electrical
Engineering, University of Science and Technology Beijing. She was a Visiting
Scholar in the Department of Electrical Engineering at the University of Texas at
Arlington, Arlington, TX, USA, from 2013 to 2014. She was a Visiting Scholar in
the Department of Electrical, Computer, and Biomedical Engineering at the
University of Rhode Island, Kingston, RI, USA, from January 2018 to February
2018. Her current research interests include optimal control, multi-player games,
neural network-based control, nonlinear control, wireless sensor networks, and
adaptive dynamic programming and their industrial application. She has published
over 40 journal and conference papers and co-authored two monographs.

Qinglai Wei Ph.D., Professor, The State Key Laboratory of Management and
Control for Complex Systems, Institute of Automation, Chinese Academy of
Sciences.
He received the B.S. degree in Automation, and the Ph.D. degree in control
theory and control engineering, from the Northeastern University, Shenyang,
China, in 2002 and 2009, respectively. From 2009–2011, he was a postdoctoral
fellow with The State Key Laboratory of Management and Control for Complex
Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China.
He is currently a professor of the institute. He has authored three books, and
published over 80 international journal papers. His research interests include
adaptive dynamic programming, neural network-based control, computational
intelligence, optimal control, nonlinear systems and their industrial applications.
Dr. Wei is an Associate Editor of IEEE Transaction on Automation Science and
Engineering since 2017, IEEE Transaction on Consumer Electronics since 2017,
Control Engineering (in Chinese) since 2017, IEEE Transactions on Cognitive and

xiii
xiv About the Authors

Developmental Systems since 2017, IEEE Transaction on Systems Man, and


Cybernetics: Systems since 2016, Information Sciences since 2016,
Neurocomputing since 2016, Optimal Control Applications and Methods since
2016, Acta Automatica Sinica since 2015, and has been holding the same position
for IEEE Transactions on Neural Networks and Learning Systems during 2014–
2015. He is the Secretary of IEEE Computational Intelligence Society (CIS) Beijing
Chapter since 2015. He was the Program Chair of The 14th International
Symposium on Neural Networks (ISNN 2017), Program Co-Chair of The 24th
International Conference on Neural Information Processing (ICONIP 2017),
Registration Chair of the 12th World Congress on Intelligent Control and
Automation (WCICA 2016), 2014 IEEE World Congress on Computational
Intelligence (WCCI 2014), the 2013 International Conference on Brain Inspired
Cognitive Systems (BICS 2013), and the Eighth International Symposium on
Neural Networks (ISNN 2011). He was the Publication Chair of 5th International
Conference on Information Science and Technology (ICIST 2015) and the Ninth
International Symposium on Neural Networks (ISNN 2012). He was the Finance
Chair of the 4th International Conference on Intelligent Control and Information
Processing (ICICIP 2013) and the Publicity Chair of the 2012 International
Conference on Brain Inspired Cognitive Systems (BICS 2012). He was guest
editors for Neual Computing and Applications and Neurocomuting in 2013 and
2014, respectively. He was a recipient of Shuang-Chuang Talents in Jiangsu
Province, China, in 2014. He was a recipient of the Outstanding Paper Awards of
IEEE Transactions on Neural Network and Learning Systems in 2018, Acta
Automatica Sinica in 2011, Zhang Siying Outstanding Paper Award of Chinese
Control and Decision Conference (CCDC) in 2015 and Best Paper Award of IEEE
6th Data Driven Control and Learning Systems Conference (DDCLS) in 2017. He
was a recipient of Young Researcher Award of Asia Pacific Neural Network
Society (APNNS) in 2016. He was recipient of Young Scientist Award, Yang Jiachi
Science and Technology Awards (Second Class Prize), and Natural Science Award
(First Prize) in Chinese Association of Automation (CAA) in 2017. He was the PI
for 13 national and local government projects.

Qing Li Ph.D., received his B.E. from North China University of Science and
Technology, Tangshan, China, in 1993, and the Ph.D. in control theory and its
applications from University of Science and Technology Beijing, Beijing, China, in
2000. He is currently a Professor at the School of Automation and Electrical
Engineering, University of Science and Technology Beijing, Beijing, China. He has
been a Visiting Scholar at Ryerson University, Toronto, Canada, from February
2006 to February 2007. His research interests include intelligent control and
intelligent optimization.
Symbols

x State vector
u Control vector
F System function
i Index
ℝn State space
X State set
J, V Performance index functions
U Utility function
J* Optimal performance index function
u* Law of optimal control
N Terminal time
W Weight matrix between the hidden layer and output layer
Q, R Positive definite matrices
a, b Learning rate
H Hamiltonian function
ec Estimation error
Ec Squared residual error

xv
Chapter 1
Introduction

1.1 Optimal Control

Optimal control is one particular branch of modern control. It deals with the problem
of finding a control law for a given system such that a certain optimality criterion is
achieved. A control problem includes a cost functional that is a function of state and
control variables. An optimal control is a set of differential equations describing the
paths of the control variables that minimize the cost function. The optimal control
can be derived using Pontryagin’s maximum principle (a necessary condition also
known as Pontryagin’s minimum principle or simply Pontryagin’s Principle), or by
solving the Hamilton–Jacobi–Bellman (HJB) equation (a sufficient condition). For
linear systems with quadratic performance function, the HJB equation reduces to the
algebraic Riccati equation (ARE) [1].
Dynamic programming is based on Bellman’s principle of optimality: an optimal
(control) policy has the property that no matter what previous decisions have been,
the remaining decisions must constitute an optimal policy with regard to the state
resulting from those previous decisions. Dynamic programming is a very useful tool
in solving optimization and optimal control problems.
In the next, we will introduce optimal control problems for continuous-time and
discrete-time linear systems.

1.1.1 Continuous-Time LQR

A special case of the general optimal control problem given is the linear quadratic
regulator (LQR) optimal control problem [2]. The LQR considers the linear time
invariant dynamical system described by

ẋ(t) = Ax(t) + Bu(t) (1.1)

© Science Press, Beijing and Springer Nature Singapore Pte Ltd. 2019 1
R. Song et al., Adaptive Dynamic Programming: Single and Multiple
Controllers, Studies in Systems, Decision and Control 166,
https://doi.org/10.1007/978-981-13-1712-5_1
2 1 Introduction

with state x(t) ∈ Rn and control input u(t) ∈ Rm . To this system is associated the
infinite-horizon quadratic cost function or performance index
 ∞
1
V (x(t0 ), t0 ) = {xT (t)Qx(t) + uT (t)Ru(t)}dτ (1.2)
2 t0

with√weighting matrices Q > 0 and R > 0. It is assumed that (A, B) is stabilizable,


(A, Q) is detectable.
The LQR optimal control problem requires finding the control policy that mini-
mizes the cost (1.2):

u∗ (t) = arg min V (t0 , x(t0 ), u(t)). (1.3)


u(t)

The solution of this optimal control problem is given by the state-feedback

u(t) = −Kx(t) (1.4)

where K = R−1 BT P, and matrix P is a positive definite solution of the algebraic


Riccati equation (ARE)

AT P + PA + Q − PBR−1 BT P = 0. (1.5)

Under the stabilizability and detectability conditions, there is an unique positive


semidefinite solution of the ARE that yields a stabilizing closed-loop controller given
by (1.4). That is, the closed-loop system is asymptotically stable.

1.1.2 Discrete-Time LQR

Consider the discrete-time LQR problem where the dynamical system described by

x(k + 1) = Ax(k) + Bu(k) (1.6)

with k the discrete time index. The associated infinite horizon performance index
has deterministic stage costs and is

1  T 
V (k) = x (i)Qx(i) + uT (i)Ru(i) . (1.7)
2
i=k

The value function for a fixed policy depends only on the initial state x(k). A differ-
ence equation equivalent to this infinite sum is given by
1.1 Optimal Control 3

1 T
V (x(k)) = (x (k)Qx(k) + uT (k)Ru(k)) + V (x(k + 1)). (1.8)
2
Assuming the value is quadratic in the state so that

1 T
V (x(k)) = x (k)Px(k) (1.9)
2
for some kernel matrix P yields the Bellman equation form

xT (k)Px(k) = xT (k)Qx(k) + uT (k)Ru(k)


+ (Ax(k) + Bu(k))T P(Ax(k) + Bu(k)). (1.10)

Assuming a constant state feedback policy u(k) = −Kx(k) for some stabilizing
gain K, we write

(A − BK)T P(A − BK) − P + Q + K T RK = 0. (1.11)

This is a Lyapunov equation.


In the above, the solution methods for linear system optimal control problems are
given. However, it is often computationally untenable to run true dynamic program-
ming due to the backward numerical process required for its solutions, i.e., as a result
of the well-known “curse of dimensionality”.

1.2 Adaptive Dynamic Programming

Reinforcement learning (RL) is a type of machine learning developed in the com-


putational intelligence community in computer science and engineering. It has been
extensively used to solve optimal control problems and it is a computational approach
to learning from interactions with the surrounding environment and concerned with
how an agent or actor ought to take actions so as to optimize a cost of its long-term
interactions with the environment. In the context of control, the environment is the
dynamic system, the agent corresponds to the controller, and actions correspond to
control signals. The RL objective is to find a strategy that minimizes an expected
long-term cost.
One type of reinforcement learning algorithms employs the actor-critic structure
shown in Fig. 1.1, which is also used in [3]. This structure produces forward-in-time
algorithms that are implemented in real time wherein an actor component applies
an action, or control policy, to the environment, and a critic component assesses the
value of that action. The learning mechanism supported by the actor-critic structure
has two steps: policy evaluation and executing by the critic. The policy evaluation
step is performed by observing from the environment the results of applying current
actions. Performance or value can be defined in terms of optimality objectives such as
4 1 Introduction

Fig. 1.1 Reinforcement learning with an actor/critic structure

minimum fuel, minimum energy, minimum risk, or maximum reward. Based on the
assessment of the performance, one of several schemes can then be used to modify
or improve the control policy in the sense that the new policy yields a value that is
improved relative to the previous value. In this scheme, reinforcement learning is a
means of learning optimal behaviors by observing the real-time responses from the
environment to nonoptimal control policies.
Werbos developed actor-critic techniques for feedback control of discrete-time
dynamical systems that learn optimal policies online in real time using data measured
along the system trajectories [4–7]. These methods, known as approximate dynamic
programming or adaptive dynamic programming (ADP), comprise a family of the
basic learning methods: heuristic dynamic programming (HDP), action-dependent
HDP (ADHDP), dual heuristic dynamic programming (DHP), ADDHP, globalized
DHP (GDHP), and ADGDHP. It builds critic to approximate the cost function and
actor to approximate the optimal control in dynamic programming using a function
approximation structure such as neural networks (NNs) [8].
Based on the Bellman equation for solving optimal decision problems in real
time forward in time according to data measured along the system trajectories, ADP
algorithm as an effective intelligent control method has played an important role
in seeking solutions for the optimal control problem [9]. Now, ADP has two main
iteration forms, namely policy iteration (PI) and value iteration (VI), respectively
[10]. PI algorithms contain policy evaluation and policy improvement. An initial
stabilizing control law is required, which is often difficult to obtain. Comparing to
VI algorithms, in most applications, PI would require fewer iterations as a Newtons
method, but every iteration is more computationally demanding. VI algorithms solve
the optimal control problem without requirement of an initial stabilizing control law,
which is easy to implement. For system (1.6) and the cost function (1.7), the detailed
procedures of PI and VI are given as follows.
1.2 Adaptive Dynamic Programming 5

1. PI Algorithm
(1) Initialize. Select any admissible (i.e. stabilizing) control policy h[0] (x(k))
(2) Policy Evaluation Step. Determine the value of the current policy using the
Bellman Equation

V [i+1] (x(k)) = r(x(k), h[i] (x(k))) + V [i+1] (x(k + 1)). (1.12)

(3) Policy Improvement Step. Determine an improved policy using


 
h[i+1] (x(k)) = arg min r(x(k), h(x(k))) + V [i+1] (x(k + 1)) . (1.13)
h(·)

2. VI Algorithm
(1) Initialize. Select any control policy h[0] (x(k)), not necessarily admissible or
stabilizing.
(2) Value Update Step. Update the value using

V [i+1] (x(k)) = r(x(k), h[i] (x(k))) + V [i] (x(k + 1)). (1.14)

(3) Policy Improvement Step. Determine an improved policy using


 
h[i+1] (x(k)) = arg min r(x(k), h(x(k))) + V [i+1] (x(k + 1)) . (1.15)
h(·)

1.3 Review of Matrix Algebra

In this book, some matrix manipulations are the basic mathematical vehicle and, for
those whose memory needs refreshing, we provides a short review.
1. For any n × n matrices A and B, (AB)T = BT AT .
2. For any n × n matrices A and B, if A and B are nonsingular, then (AB)−1 =
−1 −1
B A .
3. The Kronecker product of two matrices A = [aij ] ∈ Cm×n and B = [bij ] ∈ Cp×q
is A ⊗ B = [aij B] ∈ Cmp×nq .
4. If A = [a1 , a2 , . . . , an ] ∈
⎡ C ⎤ , where ai are the columns of A, the stacking
m×n

a1
⎢ a2 ⎥
⎢ ⎥
operator is defined by s(A) = ⎢ . ⎥. It converts A ∈ Cm×n into a vector s(A) ∈ Cmn .
⎣ .. ⎦
an
Then for matrices A, B and D we have

s(ABD) = (DT ⊗ A)s(B). (1.16)

5. If x ∈ Rn is a vector, then the square of the Euclidean norm is ||x||2 = xT x.



6. If Q is symmetric, then (xT Qx) = 2Qx.
∂x
6 1 Introduction

References

1. Zhang, H., Liu, D., Luo, Y., Wang, D.: Adaptive Dynamic Programming for Control-Algorithms
and Stability. Springer, London (2013)
2. Vrabie, D., Vamvoudakis, K., Lewis, F.: Optimal Adaptive Control and Differential Games by
Reinforcement Learning Principles. The Institution of Engineering and Technology, London
(2013)
3. Barto, A., Sutton, R., Anderson, C.: Neuron-like adaptive elements that can solve difficult
learning control problems. IEEE Trans. Syst. Man Cybern. Part B, Cybern. SMC–13(5), 834–
846 (1983)
4. Werbos, P.: A menu of designs for reinforcement learning over time. In: Miller, W.T., Sutton,
R.S., Werbos, P.J. (eds.) Neural Networks for Control, pp. 67–95. MIT Press, Cambridge (1991)
5. Werbos, P.: Approximate dynamic programming for real-time control and neural modeling.
In: White, D.A., Sofge, D.A. (eds.) Handbook of Intelligent Control. Van Nostrand Reinhold,
New York (1992)
6. Werbos, P.: Neural networks for control and system identification. In: proceedings of IEEE
Conference Decision Control, Tampa, FL, pp. 260–265 (1989)
7. Werbos, P.: Advanced forecasting methods for global crisis warning and models of intelligence.
General Syst. Yearbook 22, 25–38 (1977)
8. Liu, D., Wei, Q., Yang, X., Li, H., Wang, D.: Adaptive Dynamic Programming with Applications
in Optimal Control. Springer International Publishing, Berlin (2017)
9. Werbos, P.: ADP: the key direction for future research in intelligent control and understanding
brain intelligence. IEEE Trans. Syst. Man Cybern. Part B: Cybern. 38(4), 898–900 (2008)
10. Lewis, F., Vrabie, D.: Reinforcement learning and adaptive dynamic programming for feedback
control. IEEE Circuits Syst. Mag. 9(3), 32–50 (2009)
Chapter 2
Neural-Network-Based Approach for
Finite-Time Optimal Control

This chapter proposes a novel finite-time optimal control method based on input-
output data for unknown nonlinear systems using ADP algorithm. In this method,
the single-hidden layer feed-forward network (SLFN) with extreme learning machine
(ELM) is used to construct the data-based identifier of the unknown system dynamics.
Based on the data-based identifier, the finite-time optimal control method is estab-
lished by ADP algorithm. Two other SLFNs with ELM are used in ADP method
to facilitate the implementation of the iterative algorithm, which aim to approxi-
mate the performance index function and the optimal control law at each iteration,
respectively. A simulation example is provided to demonstrate the effectiveness of
the proposed control scheme.

2.1 Introduction

The linear optimal control problem with a quadratic cost function is probably the most
well-known control problem [1, 2], and it can be translated into Riccati equation.
While the optimal control of nonlinear systems is usually a challenging and difficult
problem [3, 4]. Furthermore, comparing with the known system dynamics case, it
is more intractable to solve the optimal control problem of the unknown system
dynamics. Generally speaking, most actual systems are nearly far too complex to
present the perfect mathematical models. Whenever no model is available to design
the system controller nor is easy to produce, a standard way is resorting to data-
based techniques [5]: (1) on the basis of input-output data, the model of the unknown
system dynamics is identified; (2) on the basis of the estimated model of the system
dynamics, the controller is designed by model-based design techniques.
It is well known that neural network is an effective tool to implement intelligent
identification based on input-output data, due to the properties of nonlinearity, adap-
tivity, self-learning and fault tolerance [6–10]. In which, SLFN is one of the most

© Science Press, Beijing and Springer Nature Singapore Pte Ltd. 2019 7
R. Song et al., Adaptive Dynamic Programming: Single and Multiple
Controllers, Studies in Systems, Decision and Control 166,
https://doi.org/10.1007/978-981-13-1712-5_2
8 2 Neural-Network-Based Approach for Finite-Time Optimal Control

useful types [11]. In [12], Hornik proved that if the activation function is continu-
ous, bounded, and non-constant, then continuous mappings can be approximated by
SLFN with additive hidden nodes over compact input sets. In [13], Leshno improved
the results of [12] and proved that SLFN with additive hidden nodes and with a non-
polynomial activation function can approximate any continuous target functions. In
[11], it is proven in theory that SLFN with randomly generated additive and a broad
type of activation functions can universally approximate any continuous target func-
tions in any compact subset of the Euclidean space. For SLFN training, there are
three main approaches: (1) gradient-descent based, for example back-propagation
(BP) method [14]; (2) least square based, for example ELM method in this chapter;
(3) standard optimization method based, for example support vector machine (SVM).
While, the learning speed of feed-forward neural networks is in general far slower
than required and it has been a major bottleneck in their applications for past decades
[15]. Two key reasons are: (1) the slow gradient-based learning algorithms are exten-
sively used to train neural networks, (2) all the parameters of the networks are tuned
iteratively by using such learning algorithms. Unlike conventional neural network
theories, in this chapter, the ELM method is used to train SLFN. Such SLFN can
be as the universal approximator, one may simply randomly choose hidden nodes
and then only need to adjust the output weights linking the hidden layer and the out-
put layer. For given network architectures, ELM does not require human-intervened
parameters, so ELM has fast convergence and can be easily used.
Based on the SLFN identifier, the finite-time optimal control method is presented
in this chapter. For finite-time control problems, the system must be stabilized to
zero within finite time. The controller design of finite-time problems still presents
a challenge to control engineers as the lack of methodology and the control step
is difficult to determine. Few results relate to the finite-time optimal control based
on ADP algorithm. As we know that [16] solved the finite-horizon optimal control
problem for a class of discrete-time nonlinear systems using ADP algorithm. But the
method in [16] adopts the BP networks to obtain the optimal control, which has slow
convergence speed.
In this chapter, we will design the finite-time optimal controller based on SLFN
with ELM for unknown nonlinear systems. First, the identifier is established by the
input-output data. It is proven that the identification error converges to zero. Upon
the data-based identifier, the optimal control method is proposed. We prove that the
iterative performance index function converges to optimum, and the optimal control
is also obtained. Compared to other popular implementation methods such as BP,
the SLFN with ELM has the fast response speed and is fully automatic. It means
that except for target errors and the allowed maximum number of hidden nodes, no
control parameters need to be manually tuned by users.
The rest of this chapter is organized as follows. In Sect. 2.2, the problem formu-
lation is presented. In Sect. 2.3, the identifier is developed based on the input-output
data. In Sect. 2.4, the iterative ADP algorithm and the convergence proof are given.
In Sect. 2.6, an example is given to demonstrate the effectiveness of the proposed
control scheme. In Sect. 2.7, the conclusion is drawn.
2.2 Problem Formulation and Motivation 9

2.2 Problem Formulation and Motivation

Consider the following unknown discrete-time nonlinear systems

x(k + 1) = F(x(k), u(k)), (2.1)

where the state x(k) ∈ Rn and the control u(k) ∈ Rm . F(x(k), u(k)) is unknown
continuous function. Assume that the state is completely controllable and bounded
on Ω, and F(0, 0) = 0. The finite-time performance index function is defined as
follows:


K
J (x(k), U (k, K )) = {x T (i)Qx(i) + u T (i)Ru(i)} (2.2)
i=k

where Q and R are positive definite matrices, K is the finite positive integer, the
control sequence U (k, K ) = (u(k), u(k + 1), . . . , u(K )) is finite-time admissible
[16]. The length of U (k, K ) is defined as (K − k + 1).
This chapter is desired to find the optimal control for system (2.1) based on
performance index function (2.2). Since the system dynamics is completely unknown,
the optimal problem cannot be solved directly. Therefore, it is desirable to propose
a novel method that does not need the exact system dynamics but only the input-
output data, which can be obtained during the operation of the system. In this chapter,
we propose a data-based optimal control scheme using SLFN with ELM and ADP
method for general unknown nonlinear systems. The design of proposed controller
is divided into two steps:
(1) The unknown nonlinear system dynamics is identified by SLFN identification
scheme with convergence proof.
(2) The optimal controller is designed based on the data-based identifier.
In the following sections, we will discuss the establishment of the data-based
identifier and the controller design in details.

2.3 The Data-Based Identifier

In this section, the ELM method is introduced and the data-based identifier is estab-
lished with convergence proof. The structure of SLFN is in Fig. 2.1.
For N1 arbitrary distinct samples (x̄(i), ȳ(i)), where x̄(i) ∈ Rn 1 , ȳ(i) ∈ Rm 1 ,
i = 1, 2, . . . , N1 . The weight vectors between the input neurons and the jth hid-
den neuron are w j ∈ Rn 2 . The weight vectors between the output neurons and
the jth hidden neuron are β̄ j ∈ Rn 3 , which will be designed by ELM method [17].
The number of hidden neurons is L. The threshold of the jth hidden neuron is
b j . The hidden layer activation function g L (x̄) is infinitely differentiable, then the
10 2 Neural-Network-Based Approach for Finite-Time Optimal Control

fL (x )

1 m
L

1 L
w
1 n

Fig. 2.1 The basic SLFN architecture

mathematically model of SLFN is [15]


L
f L (x̄(i)) = β̄ j g L (wTj x̄(i) + b j ), i = 1, 2, . . . , N1 . (2.3)
j=1

Unlike the traditional popular implementations of SLFN, in this chapter, ELM is


used to adjust the output weights. In theory, Refs. [18, 19] show that the input weights
and hidden neurons biases of SLFN do not need be adjusted during training and one
may simply randomly assign values to them. To be convenient for explanation, let
β L = [β̄1 , β̄2 , . . . , β̄ L ]TL×m 1 , Ȳ = [ ȳ(1), ȳ(2), . . . , ȳ(N1 )]TN1 ×m 1 , and

H = [h(x̄(1)), h(x̄(2)), . . . , h(x̄(N1 ))]T


⎡ ⎤
G(w1 , b1 , x̄(1)) . . . G(w L , b L , x̄(1))
⎢ .. .. .. ⎥
= ⎣. . . ⎦ ,
G(w1 , b1 , x̄(N1 )) . . . G(w L , b L , x̄(N1 )) N1 ×L

where G(w j , b j , x̄(i)) = g L (wTj x̄(i) + b j ). So we have

Hβ L = Ȳ . (2.4)

Based on least-square method, it can be obtained that

β L = H + Ȳ , (2.5)

where H + = (H T H )−1 H T .
2.3 The Data-Based Identifier 11

For SLEN in (2.3), the output weight β L is the only value we want to obtain. In
the following, a theorem is given to prove that β L exists, which means that H is
invertible.

Theorem 2.1 If SLFN is defined as in (2.3), let the hidden neurons number is L.
For N1 arbitrary distinct input samples x̄(i) and any given w j and b j , we have H in
(2.4) is invertible.

Proof As x̄(i) are distinct, for any vector w j according to any continuous prob-
ability distribution, then with probability one, wTj x(1), wTj x(2), . . ., wTj x(N1 ) are
different from each other. Define the jth column of H is c( j) = [g L (wTj x̄(1) +
b j ), g L (wTj x̄(2) + b j ), . . . , g L (wTj x̄(N1 ) + b j )]T , we can have c( j) does not belong
to any subspace whose dimension is less than N1 [19]. It means that for any given
w j and b j , according to any continuous probability distribution, H in (2.4) can be
made full-rank, i.e., H is invertible.

Therefore, the SLFN with ELM method is summarized as follows [20]:


Step 1. Given a training set (x̄(i), ȳ(i)), i = 1, 2, . . . , N1 , hidden node output
function G(w j , b j , x̄(i)) and hidden node number L.
Step 2. Given arbitrary hidden node parameters (w j , b j ), j = 1, 2, . . . , L.
Step 3. Calculate the hidden layer output matrix H .
Step 4. According to (2.5) to calculate β L .

Remark 2.1 ELM algorithm can work with wide type of activation functions, such
as sigmoidal functions, radial basis, sine, cosine and exponential functions et.al..
The feed-forward networks with arbitrarily assigned input weights and hidden layer
biases can universally approximate any continuous functions on any compact input
sets [21].

Remark 2.2 It is important to point out that β L in (2.5) has the smallest norm among
all the least-squares solutions of Hβ L = Ȳ . As the input weights and hidden neurons
biases of SLFN are simply randomly assigned values. So training an SLFN is simply
equivalent to finding a least-squares solution of the linear system Hβ L = Ȳ . Although
almost all learning algorithms wish to reach the minimum training error, however,
most of them cannot reach it because of local minimum or infinite training iteration
is usually not allowed in applications [21]. Fortunately, the special unique solution
β L in (2.5) has the smallest norm among all the least squares solutions.

2.4 Derivation of the Iterative ADP Algorithm


with Convergence Analysis

For the unknown nonlinear system (2.1), the data-based identifier is established.
Then we can design the iterative ADP algorithm to get the solution of the finite-time
optimal control problem.
12 2 Neural-Network-Based Approach for Finite-Time Optimal Control

First, the derivations of the optimal control u ∗ (k) and J ∗ (x(k)) are given in details.
It is known that, for the case of finite horizon optimization, the optimal performance
index function J ∗ (x(k)) satisfies [16]

J ∗ (x(k)) = inf {J (x(k), U (k, K ))}, (2.6)


U (k)

where U (k, K ) stands for a finite-time control sequence. The length of the control
sequence is not assigned.
According to Bellman’s optimality principle, the following Hamilton–Jocabi–
Bellman (HJB) equation

J ∗ (x(k)) = inf {x T (k)Qx(k) + u T (k)Ru(k) + J ∗ (x(k + 1))} (2.7)


u(k)

holds.
Define the law of optimal control sequence starting at k by

U ∗ (k, K ) = arg inf {J (x(k), U (k, K ))}, (2.8)


U (k,K )

and the law of optimal control vector by

u ∗ (k) = arg inf {x T (k)Qx(k) + u T (k)Ru(k) + J ∗ (x(k + 1))}. (2.9)


u(k)

Therefore, we can have

J ∗ (x(k)) =x T (k)Qx(k) + u ∗T (k)Ru ∗ (k) + J ∗ (x(k + 1)). (2.10)

Based on the above preparation, the finite-time ADP method for unknown system
is proposed. The iterative procedure is as follows.
For the iterative step i = 1, the performance index function is computed as

V [1] (x(k)) = inf {x T (k)Qx(k) + u T (k)Ru(k) + V [0] (x(k + 1))}


u(k)

= x T (k)Qx(k) + u [1]T (k)Ru [1] (k) + V [0] (x(k + 1)) (2.11)

where

u [1] (x(k)) = arg inf {x T (k)Qx(k) + u T (k)Ru(k) + V [0] (x(k + 1))}, (2.12)
u(k)

and V [0] (x(k + 1)) has two expression forms according to two different cases.
If for x(k), there exists U (k, K ) = (u(k)), s.t. F(x(k), u(k)) = 0, then V [0] (x(k +
1)) is

V [0] (x(k + 1)) = J (x(k + 1), U ∗ (k + 1, k + 1)) = 0, ∀x(k + 1) (2.13)


2.4 Derivation of the Iterative ADP Algorithm with Convergence Analysis 13

where U ∗ (k + 1, k + 1) = (0). In this situation, the restrict term F(x(k), u [1] (k)) =
0 for (2.11) is necessary.
If for x(k), there exists U (k, K̄ ) = (u(k), u(k + 1), . . . , u( K̄ )), s.t. F(x(k),
U (k, K̄ )) = 0, then V [0] (x(k + 1)) is

V [0] (x(k + 1)) = J (x(k + 1), U ∗ (k + 1, K )), (2.14)

where U ∗ (k + 1, K ) = (u ∗ (k + 1), u ∗ (k + 2), . . . , u ∗ (K )).


For the iterative step i > 1, the performance index function is updated as

V [i+1] (x(k)) = inf {x T (k)Qx(k) + u T (k)Ru(k) + V [i] (x(k + 1))}


u(k)

= x T (k)Qx(k) + u [i+1]T (k)Ru [i+1] (k) + V [i] (x(k + 1)), (2.15)

where

u [i+1] (x(k)) = arg inf {x T (k)Qx(k) + u T (k)Ru(k) + V [i] (x(k + 1))}. (2.16)
u(k)

In the above recurrent iterative procedure, the index i is the iterative step and k is
the time step. The optimal control and optimal performance index function can be
obtained by the iterative ADP algorithm (2.11)–(2.16).
In the following part, we will present the convergence analysis of the iterative
ADP algorithm (2.11)–(2.16).

Theorem 2.2 For an arbitrary state vector x(k), the performance index func-
tion V [i+1] (x(k)) is obtained by the iterative ADP algorithm (2.11)–(2.16), then
{V [i+1] (x(k))} is a monotonically nonincreasing sequence for i ≥ 1, i.e., V [i+1] (x(k))
≤ V [i] (x(k)), ∀i ≥ 1.

Proof The mathematical induction is used to prove the theorem.


First, for i = 1, we can have V [1] (x(k)) in (2.11), V [0] (x(k + 1)) in (2.13), and
the finite-time admissible control sequence U ∗ (k, k + 1) = (u [1] (k), U ∗ (k + 1, k +
1)) = (u [1] (k), 0). For i = 2, we have

V [2] (x(k)) = x T (k)Qx(k) + u [2]T (k)Ru [2] (k) + V [1] (x(k + 1)). (2.17)

From (2.11), we have

V [1] (x(k + 1)) = inf {x T (k + 1)Qx(k + 1) + u T (k + 1)Ru(k + 1)


u(k+1)

+ V [0] (x(k + 2))}. (2.18)

So (2.17) can be expressed as


14 2 Neural-Network-Based Approach for Finite-Time Optimal Control

V [2] (x(k)) = inf {x T (k)Qx(k) + u T (k)Ru(k) + inf {x T (k + 1)Qx(k + 1)


u(k) u(k+1)
[0]
+ u (k + 1)Ru(k + 1)} + V
T
(x(k + 2))}. (2.19)

So (2.19) can be written as


k+1
V [2] (x(k)) = inf {x T (l)Qx(l) + u T (l)Ru(l)}. (2.20)
U (k,k+1)
l=k

If U (k, k + 1) in (2.20) is defined as U (k, k + 1) = (u(k), u(k + 1)) = (u [1] (k), 0),
then we have


k+1
{x T (l)Qx(l) + u T (l)Ru(l)} = x T (k)Qx(k) + u [1]T (k)Ru [1] (k)
l=k

= V [1] (x(k)). (2.21)

So according to (2.20) and (2.21), we have V [2] (x(k)) ≤ V [1] (x(k)), for i = 1.
Second, we assume that for i = j − 1, the following expression

V [ j] (x(k)) ≤ V [ j−1] (x(k)) (2.22)

holds.
Then according to (2.15), for i = j, we have

V [ j+1] (x(k)) = inf {x T (k)Qx(k) + u T (k)Ru(k) + inf {x T (k + 1)Qx(k + 1)


u(k) u(k+1)

+ u (k + 1)Ru(k + 1) + · · ·
T

+ inf {x T (k + j)Qx(k + j) + u T (k + j)Ru(k + j)} . . .}}.


u(k+ j)
(2.23)

So we can obtain


k+ j
V [ j+1] (x(k)) = inf {x T (l)Qx(l) + u T (l)Ru(l)}. (2.24)
U (k)
l=k

If we let U (k, k + j) = (u [ j] (k), u [ j] (k + 1), . . . , u [1] (k + j − 1), 0) in (2.24), then


we can get


k+ j
{x T (l)Qx(l) + u T (l)Ru(l)}
l=k

= x T (k)Qx(k) + u [ j]T (k)Ru [ j] (k) + x T (k + 1)Qx(k + 1)


2.4 Derivation of the Iterative ADP Algorithm with Convergence Analysis 15

+ u [ j−1]T (k + 1)Ru [ j−1] (k + 1)


+ · · · + x T (k + j − 1)Qx(k + j − 1) + u [1]T (k + j − 1)Ru [1] (k + j − 1)
+ x T (k + j)Qx(k + j). (2.25)

As mentioned in the iterative algorithm, the restrict term F(x(k), u [1] (k)) = 0, ∀x(k)
for (2.11) is necessary. So we can get

x(k + j) = F(x(k + j), u [1] (k + j)) = 0. (2.26)

Thus, we have


k+ j
{x T (l)Qx(l) + u T (l)Ru(l)} = V [ j] (x(k)). (2.27)
l=k

Therefore, we obtain

V [ j+1] (x(k)) ≤ V [ j] (x(k)). (2.28)

For the situation (2.14), it can easily be proven according to the above method.
Therefore, we can conclude that V [i+1] (x(k)) ≤ V [i] (x(k)), ∀i.

From Theorem 2.2, it is clear that the iterative performance index function is con-
vergent. So we can define the limitation of the sequence {V [i+1] (x(k))} is V o (x(k)).
In the next theorem, we will prove that V o (x(k)) satisfies the HJB equation.

Theorem 2.3 Let V o (x(k)) = lim V [i+1] (x(k)), then V o (x(k)) satisfies
i→∞

V o (x(k)) = inf {x T (k)Qx(k) + u T (k)Ru(k) + V o (x(k + 1))}. (2.29)


u(k)

Proof According to (2.15) and (2.16), for any admissible control vector η(k), we
have

V [i+1] (x(k)) ≤x T (k)Qx(k) + ηT (k)Rη(k) + V [i] (x(k + 1)). (2.30)

From Theorem 2.2, we can obtain

V o (x(k)) ≤ V [i+1] (x(k)). (2.31)

So it can be obtained that

V o (x(k)) ≤ x T (k)Qx(k) + ηT (k)Rη(k) + V [i] (x(k + 1)). (2.32)

Let i → ∞, (2.32) can be written as


16 2 Neural-Network-Based Approach for Finite-Time Optimal Control

V o (x(k)) ≤ x T (k)Qx(k) + ηT (k)Rη(k) + V o (x(k + 1)). (2.33)

As η(k) is any admissible control, so we can obtain

V o (x(k)) ≤ inf {x T (k)Qx(k) + u T (k)Ru (k) + V o (x(k + 1))}. (2.34)


u(k)

On the other side, according to the definition V o (x(k)) = lim V [i+1] (x(k)), there
i→∞
exists a positive integer p and an arbitrary positive number ε, such that

V [ p] (x(k)) ≥ V o (x(k)) ≥ V [ p] (x(k)) − ε. (2.35)

From (2.15), we have

V [ p] (x(k)) = x T (k)Qx(k) + u [ p]T (k)Ru [ p] (k) + V [ p−1] (x(k + 1)). (2.36)

So according to (2.35) and (2.36), we have

V o (x(k)) ≥ x T (k)Qx(k) + u [ p]T (k)Ru [ p] (k) + V [ p−1] (x(k + 1)) − ε. (2.37)

As V [ p−1] (x(k + 1)) ≥ V o (x(k + 1)), ∀ p, then (2.37) is written as follows:

V o (x(k)) ≥ x T (k)Qx(k) + u [ p]T (k)Ru [ p] (k) + V o (x(k + 1)) − ε. (2.38)

Hence we have

V o (x(k)) ≥ inf {x T (k)Qx(k) + u T (k)Ru (k) + V o (x(k + 1)) − ε}. (2.39)


u(k)

Since ε is arbitrary, we can have

V o (x(k)) ≥ inf {x T (k)Qx(k) + u T (k)Ru (k) + V o (x(k + 1))}. (2.40)


u(k)

Thus from (2.34) and (2.40), we have

V o (x(k)) = inf {x T (k)Qx(k) + u T (k)Ru (k) + V o (x(k + 1))}. (2.41)


u(k)

From Theorems 2.2 and 2.3, it can be concluded that V o (x(k)) is the optimal
performance index function and V o (x(k)) = J ∗ (x(k)). So we can have the following
corollary.

Corollary 2.1 Let the iterative algorithm be expressed as (2.11)–(2.16). Then we


have the iterative control u [i] (k) converge to the optimal control u ∗ (k), as i → ∞,
i.e.,
2.4 Derivation of the Iterative ADP Algorithm with Convergence Analysis 17

u ∗ (k) = arg inf {x T (k)Qx(k) + u T (k)Ru(k) + J ∗ (x(k + 1))}. (2.42)


u(k)

In this section, the iterative control algorithm is proposed for data-based unknown
systems with convergence analysis. In next section, the neural network implementa-
tion of the iterative control algorithm will be presented.

2.5 Neural Network Implementation of the Iterative


Control Algorithm

The input-output data are used to identify the unknown nonlinear system, until the
identification error is in the satisfied precision range. Then the data-based identifier
is used for the controller design. The diagram of the whole structure is shown in
Fig. 2.2.
In Fig. 2.2, the SLENs module is the identifier, the action network module is
used to approximate the iterative control u [i] (k), and the critic network module is
used to approximate the iterative performance index function. The SlFNs with ELM
are used in the ADP algorithm, i.e. action network and critic network. The detailed
implementation steps are as follows.
Step 1. Train the identifier by input-output data.
Step 2. Choose an error bound , and choose randomly an initial state x(0).
Step 3. Calculate the initial finite-time admissible control sequence for x(0),
which is U (0, K ) = (u(0), u(1), . . . , u(K )). The corresponding state sequence is
(x(0), x(1), . . ., x(K + 1)), where x(K + 1) = 0.

x(k )
u (k )
SLENs x(k 1)

x(k ) Critic Critic V [ i ] ( x(k 1)) x ( k )Qx ( k ) u (k ) Ru (k )


u[i ] (k )
Network Network

Critic V [i 1] ( x(k 1))


Network

Fig. 2.2 The basic structure of the proposed control method


18 2 Neural-Network-Based Approach for Finite-Time Optimal Control

Step 4. For the state x(K ), run the iterative ADP algorithm (2.11)–(2.13) for
i = 1, and (2.15)–(2.16) for i > 1. If |V [i+1] (x(K )) − V i (x(K ))| < , then Stop.
Step 5. For the state x(k), k = K − 1, K − 2, . . . , 0, run the iterative ADP algo-
rithm (2.11)–(2.12) and (2.14) as i = 1, (2.15)–(2.16) as i > 1. Until |V [i+1] (x(k)) −
V i (x(k))| < .
Step 6. Stop.

2.6 Simulation Study

To evaluate the performance of our iterative ADP algorithm for the data-based iden-
tifier, an example is provided in this section.
Consider the following nonlinear system [16]

x(k + 1) = x(k) + sin(0.1x 2 (k) + u(k)), (2.43)

where the state x(k) ∈ R and the control u(k) ∈ R.


In this chapter, 5000 sampling points are used to train the SLFN identifier. The
number of hidden neurons is 20. The weight vectors between the input neurons and
the jth hidden neuron are selected in (0, 1). The threshold of the hidden neuron is
selected in (0, 1). For 50 test points, we get the identification results in Fig. 2.3. The
red dashed line is the test points, the blue solid line is the output of the data-based
identifier, and the purple line with sign “×” is the identification error. From Fig. 2.3,
we can see that the identifier reconstructs the unknown nonlinear system accurately.
Based on the identification results, the optimal ADP controller is designed. The
initial state for system (2.43) is x(0) = 1.5. For implementation the proposed itera-
tion algorithm in this chapter, two neural networks which are SLFNs with ELM are
used to approximate the action network and critic network, respectively. To demon-
strate the effectiveness of the proposed scheme, we implement the iterative algorithm
by neural networks with ELM and BP methods, respectively. The maximal iteration
step is 50 for two kinds of neural networks. The convergence precision of ELM is
10−6 , and the convergence precision of BP is 10−4 . We get the simulation results in
Figs. 2.4, 2.5, 2.6, 2.7, 2.8 and 2.9. Figures 2.4 and 2.5 are trajectories of the itera-
tive performance index function obtained by ELM and BP, respectively. In Fig. 2.4,
after 5 iterative steps, the iterative performance index function is convergent. While
in Fig. 2.5, it costs 15 iterative steps. Figures 2.6 and 2.7 are the state trajectories
obtained by ELM method and BP method, respectively. Figures 2.8 and 2.9 are the
control trajectories obtained by ELM method and BP method, respectively. By ELM
method, the trajectories of state and control are convergent after 4 time steps. While
by BP method, it costs 15 time steps. From the figures we can see that the results
2.6 Simulation Study 19

20
test state
train state
accuracy
state, accuracy 15

10

-5
0 10 20 30 40 50 60 70 80 90 100
time steps

Fig. 2.3 The train results of the data-based identifier

8.5
performance index function

7.5

6.5

6
0 5 10 15 20 25 30 35 40 45 50
iteration steps

Fig. 2.4 The performance index function obtained by ELM method

of ELM method are faster and smoother than the results of BP method. It can be
concluded that the learning speed of ELM method is faster than BP method while
obtaining better generalization performance.
20 2 Neural-Network-Based Approach for Finite-Time Optimal Control

5.8

5.6
performance index function

5.4

5.2

4.8

4.6

4.4
0 10 20 30 40 50 60 70 80 90 100
iteration steps

Fig. 2.5 The performance index function obtained by BP method

1.6

1.4

1.2

0.8
state

0.6

0.4

0.2

-0.2
0 2 4 6 8 10 12 14 16 18 20
time steps

Fig. 2.6 The state obtained by ELM method

2.7 Conclusions

This chapter studied the ELM method for optimal control of unknown nonlinear
systems. Using the input-output data, a data-based identifier was established. The
2.7 Conclusions 21

1.6

1.4

1.2

0.8
state

0.6

0.4

0.2

-0.2
0 5 10 15 20 25 30
time steps

Fig. 2.7 The state obtained by BP method

-0.2

-0.4

-0.6
control

-0.8

-1

-1.2

-1.4
0 5 10 15 20 25 30
time steps

Fig. 2.8 The control obtained by ELM method

finite-time optimal control scheme was proposed based on iterative ADP algorithm.
The results of theorems showed that the proposed iterative algorithm was convergent.
The simulation study have demonstrated the effectiveness of the proposed control
algorithm.
22 2 Neural-Network-Based Approach for Finite-Time Optimal Control

0.2

-0.2

-0.4
control

-0.6

-0.8

-1

-1.2

-1.4
0 5 10 15 20 25 30
time steps

Fig. 2.9 The control obtained by BP method

References

1. Duncan, T., Guo, L., Pasik-Duncan, B.: Adaptive continuous-time linear quadratic gaussian
control. IEEE Trans. Autom. Control 44(9), 1653–1662 (1999)
2. Gabasov, R., Kirillova, F., Balashevich, N.: Open-loop and closed-loop optimization of linear
control systems. Asian J. Control 2(3), 155–168 (2000)
3. Jin, X., Yang, G., Peng, L.: Robust adaptive tracking control of distributed delay systems with
actuator and communication failures. Asian J. Control 14(5), 1282–1298 (2012)
4. Zhang, H., Song, R., Wei, Q., Zhang, T.: Optimal tracking control for a class of nonlinear
discrete-time systems with time delays based on heuristic dynamic programming. IEEE Trans.
Neural Netw. 22(12), 1851–1862 (2011)
5. Guardabassi, G., Savaresi, S.: Virtual reference direct design method: an off-line approach to
data-based control system design. IEEE Trans. Autom. Control 45(5), 954–959 (2000)
6. Jagannathan, S.: Neural Network Control of Nonlinear Discrete-Time Systems. CRC Press,
Boca Raton (2006)
7. Yu, W.: Recent Advances in Intelligent Control Systems. Springer, London (2009)
8. Fernández-Navarro, F., Hervás-Martínez, C., Gutierrez, P.: Generalised Gaussian radial basis
function neural networks. Soft Comput. 17, 519–533 (2013)
9. Richert, D., Masaud, K., Macnab, C.: Discrete-time weight updates in neural-adaptive control.
Soft Comput. 17, 431–444 (2013)
10. Kuntal, M., Pratihar, D., Nath, A.: Analysis and synthesis of laser forming process using neural
networks and neuro-fuzzy inference system. Soft Comput. 17, 849–865 (2013)
11. Huang, G., Chen, L., Siew, C.: Universal approximation using incremental constructive feed-
forward networks with random hidden nodes. IEEE Trans. Neural Netw. 17(4), 879–892 (2006)
12. Hornik, K.: Approximation capabilities of multilayer feedforward networks. Neural Netw. 4,
251–257 (1991)
13. Leshno, M., Lin, V., Pinkus, A., Schocken, S.: Multilayer feedforward networks with a non-
polynomial activation function can approximate any function. Neural Netw. 6, 861–867 (1993)
References 23

14. Zhang, H., Wei, Q., Luo, Y.: A novel infinite-time optimal tracking control scheme for a class
of discrete-time nonlinear systems via the greedy HDP iteration algorithm. IEEE Trans. Syst.
Man Cybern. Part B: Cybern. 38(4), 937–942 (2008)
15. Huang, G., Siew, C.: Extreme learning machine: RBF network case. In: Proceedings of the
Eighth International Conference on Control, Automation, Robotics and Vision (ICARCV
2004), Dec 6–9, Kunming, China, vol. 2, pp. 1029–1036. (2004)
16. Wang, F., Jin, N., Liu, D., Wei, Q.: Adaptive dynamic programming for finite-horizon optimal
control of discrete-time nonlinear systems with -error bound. IEEE Transactions on Neural
Networks 22, 24–36 (2011)
17. Zhang, R., Huang, G., Sundararajan, N., Saratchandran, P.: Multi-category classification using
extreme learning machine for microarray gene expression cancer diagnosis. IEEE/ACM Trans.
Comput. Biol. Bioinform. 4(3), 485–495 (2007)
18. Tamura, S., Tateishi, M.: Capabilities of a fourlayered feedforward neural network: four layers
versus three. IEEE Trans. Neural Netw. 8(2), 251–255 (1997)
19. Huang, G.: Learning capability and storage capacity of two-hidden-layer feedforward networks.
IEEE Trans. Neural Netw. 14(2), 274–281 (2003)
20. Huang, G., Wang, D., Lan, Y.: Extreme learning machines: a survey. Int. J. Mach. Learn.
Cybern. 2(2), 107–122 (2011)
21. Huang, G., Zhu, Q., Siew, C.: Extreme learning machine: theory and applications. Neurocom-
puting 70, 489–501 (2006)
Chapter 3
Nearly Finite-Horizon Optimal Control
for Nonaffine Time-Delay Nonlinear
Systems

In this chapter, a novel ADP algorithm is developed to solve the nearly optimal finite-
horizon control problem for a class of deterministic nonaffine nonlinear time-delay
systems. The idea is to use ADP technique to obtain the nearly optimal control which
makes the optimal performance index function close to the greatest lower bound of
all performance index functions within finite time. The proposed algorithm contains
two cases with respective different initial iterations. In the first case, there exists con-
trol policy which makes arbitrary state of the system reach to zero in one time step.
In the second case, there exists a control sequence which makes the system reach to
zero in multiple time steps. The state updating is used to determine the optimal state.
Convergence analysis of the performance index function is given. Furthermore, the
relationship between the iteration steps and the length of the control sequence is pre-
sented. Two neural networks are used to approximate the performance index function
and compute the optimal control policy for facilitating the implementation of ADP
iteration algorithm. At last, two examples are used to demonstrate the effectiveness
of the proposed ADP iteration algorithm.

3.1 Introduction

Time-delay phenomenons are often encountered in physical and biological systems,


and requires special attention in engineering applications [1]. Transportation systems,
communication systems, chemical processing systems, metallurgical processing sys-
tems and power systems are examples of time-delay systems. Delays may result in
degradation in the control efficiency even instability of the control systems [2]. So
there have been many works about systems with time delays in various research areas
such as electrical, chemical engineering and networked control [3]. In the past few
decades, the stabilization and control of time-delay systems have always been the
key focus in the control field [3, 4]. Furthermore, there are many researchers studied

© Science Press, Beijing and Springer Nature Singapore Pte Ltd. 2019 25
R. Song et al., Adaptive Dynamic Programming: Single and Multiple
Controllers, Studies in Systems, Decision and Control 166,
https://doi.org/10.1007/978-981-13-1712-5_3
26 3 Nearly Finite-Horizon Optimal Control for Nonaffine Time-Delay …

the controllability of linear time-delay systems [5, 6]. They proposed some related
theorems to judge the controllability of the linear time-delay systems. In addition,
the optimal control problem is often encountered in industrial production. So the
investigation of the optimal control for time-delay systems is significant. In [7] D. H.
Chyung has pointed out the disadvantages of discrete time-delay system written as
an extended system by increasing dimension method to deal with the optimal control
problem. So some direct methods for linear time-delay systems were presented in
[7, 8]. While for nonlinear time-delay system, due to the complexity of systems,
the optimal control problem is rarely researched. As we know that [9] solved the
finite-horizon optimal control problem for a class of discrete-time nonlinear systems
using ADP algorithm. But the method in [9] can not be used in nonlinear time-delay
systems. As the delay states in time-delay systems are coupling with each other.
The state of current time k is decides by the states before k and the control law,
while the control law is not known before it is obtained. So based on the research
results in [9], we proposed a new ADP algorithm to solve the nearly finite-horizon
optimal control problem for discrete time-delay systems through the framework of
Hamilton–Jacobi–Bellman (HJB) equation.
In this chapter the optimal controller is designed based on the original time-delay
systems, directly. The state updating method is proposed to determine the optimal
state of the time-delay system. For finite-horizon optimal control, the system can
reach to zero when the final running step N is finite. But it is impossible in practice.
So the results in this chapter is in the sense of an error bound. The main contributions
of this chapter can be summarized as follows.
(1) The finite-horizon optimal control for deterministic discrete time-delay systems
is first studied based on ADP algorithm.
(2) The state updating is used to determine the optimal states of HJB equation.
(3) The relationship between the iteration steps and the length of the control
sequence is given.
This chapter is organized as follows. In Sect. 3.2, the problem formulation is pre-
sented. In Sect. 3.3, the nearly finite-horizon optimal control scheme is developed
based on iteration ADP algorithm and the convergence proof is given. In Sect. 3.1,
two examples are given to demonstrate the effectiveness of the proposed control
scheme. In Sect. 3.5, the conclusion is drawn.

3.2 Problem Statement

Consider a class of deterministic time-delay nonaffine nonlinear systems



x(t + 1) = F(x(t), x(t − h 1 ), x(t − h 2 ), . . . , x(t − h l ), u(t)),
(3.1)
x(t) = χ (t), −h l ≤ t ≤ 0,

where x(t) ∈ Rn is state and x(t − h 1 ), x(t − h 2 ), . . . , x(t − h l ) ∈ Rn are time-delay


3.2 Problem Statement 27

states. u(t) ∈ Rm is the system input. χ (t) is the initial state, h i , i = 1, 2, . . . , l,


is the time delay, set 0 < h 1 < h 2 < · · · < h l , and they are nonnegative inte-
ger numbers. F(x(t), x(t − h 1 ), x(t − h 2 ), . . . , x(t − h l ), u(t)) is known function.
F(0, 0, . . . , 0) = 0.
For any time step k, the performance index function for state x(k) under the control
sequence U (k, N + k − 1) = (u(k), u(k + 1), . . . , u(N + k − 1)) is defined as

+k−1
N
J (x(k), U (k, N + k − 1)) = {x T ( j)Qx( j) + u T ( j)Ru( j)}, (3.2)
j=k

where Q and R are positive definite constant matrixes.


In this chapter, we focus on solving the nearly finite-horizon optimal control prob-
lem for system (3.1). The feedback control u(k) must not only stabilize the system
within finite time step but also guarantee the performance index function (3.2) to be
finite. So the control sequence U (k, N + k − 1) = (u(k), u(k + 1), . . . , u(N + k −
1)) must be admissible.

Definition 3.1 N time steps control sequence: For any time step k, we define the
N time steps control sequence U (k, N + k − 1) = (u(k), u(k + 1), . . . , u(N + k −
1)). The length of U (k, N + k − 1) is N .

Definition 3.2 Final state: we define final state x f = x f (x(k), U (k, N + k − 1)),
i.e., x f = x(N + k).

Definition 3.3 Admissible control sequence: An N time steps control sequence is


said to be admissible for x(k), if the final state x f (x(k), U (k, N + k − 1)) = 0 and
J (x(k), U (k, N + k − 1)) is finite.

Remark 3.1 Definitions 3.1 and 3.2 are used to state conveniently the admissible
control sequence, i.e. Definition 3.3, which is necessary for the theorems of this
chapter.

Remark 3.2 It is important to point out that the length of control sequence N can not
be designated in advance. It is calculated by the proposed algorithm. If we calculate
that the length of optimal control sequence is L at time step k, then we consider that
the optimal control sequence length at time step k is N = L.

According to the theory of dynamics programming [10], the optimal performance


index function is defined as

J ∗ (x(k)) = inf J (x(k), U (k, N + k − 1)) (3.3)


U (k,N +k−1)
 
= inf x T (k)Qx(k) + u T (k)Ru(k) + J ∗ (x(k + 1)) , (3.4)
u(k)

and the optimal control policy is


28 3 Nearly Finite-Horizon Optimal Control for Nonaffine Time-Delay …
 
u ∗ (k) = arg inf x T (k)Qx(k) + u T (k)Ru(k) + J ∗ (x(k + 1)) , (3.5)
u(k)

so the state under the optimal control policy is

x ∗ (t + 1) = F(x ∗ (t), x ∗ (t − h 1 ), . . . , x ∗ (t − h l ), u ∗ (t)), t = 0, 1, . . . , k, . . . ,


(3.6)

and then, the HJB equation is written as

J ∗ (x ∗ (k)) = J (x ∗ (k), U ∗ (k, N + k − 1))


= x ∗T (k)Qx ∗ (k) + u ∗T (k)Ru ∗ (k) + J ∗ (x ∗ (k + 1)). (3.7)

Remark 3.3 From Remark 3.2, we can see that the length N of the optimal control
sequence is unknown finite number and can not be designated in advance. So we can
say that if at time step k, the length of the optimal control sequence is N , then at time
step k + 1, the length of the optimal control sequence is N − 1. Therefore, the HJB
equation (3.7) is established.

In the following, we will give an explanation about the validity of Eq. (3.4). First,
we define U ∗ (k, N + k − 1) = (u ∗ (k), u ∗ (k + 1), . . . , u ∗ (N + k − 1)), i.e.,

U ∗ (k, N + k − 1) = arg inf J (x(k), U (k, N + k − 1)). (3.8)


U (k,N +k−1)

Then we have

J ∗ (x(k)) = inf J (x(k), U (k, N + k − 1))


U (k,N +k−1)

= J (x(k), U ∗ (k, N + k − 1)). (3.9)

Then according to (3.2), we can get

+k−1
N
J ∗ (x(k)) = {x T ( j)Qx( j) + u ∗T ( j)Ru ∗ ( j)}
j=k

= x (k)Qx(k) + u ∗T (k)Ru ∗ (k)


T

+ ···
+ x T (N + k − 1)Qx(N + k − 1)
+ u ∗T (N + k − 1)Ru ∗ (N + k − 1). (3.10)

Equation (3.10) can be written as

J ∗ (x(k)) = x T (k)Qx(k) + u ∗T (k)Ru ∗ (k)


+ ···
+ x T (N + k − 2)Qx(N + k − 2)
3.2 Problem Statement 29

+ u ∗T (N + k − 2)Ru ∗ (N + k − 2)
+ inf {x T (N + k − 1)Qx(N + k − 1)
u(N +k−1)

+ u T (N + k − 1)Ru(N + k − 1)}. (3.11)

We also obtain

J ∗ (x(k)) = x T (k)Qx(k) + u ∗T (k)Ru ∗ (k)


+ ···
+ inf {x T (N + k − 2)Qx(N + k − 2)
u(N +k−2)

+ u T (N + k − 2)Ru(N + k − 2)
+ inf {x T (N + k − 1)Qx(N + k − 1)
u(N +k−1)

+ u T (N + k − 1)Ru(N + k − 1)}}. (3.12)

So we have

J ∗ (x(k)) = inf {x T (k)Qx(k) + u T (k)Ru(k)


u(k)

+ ···
+ inf {x T (N + k − 2)Qx(N + k − 2)
u(N +k−2)

+ u T (N + k − 2)Ru(N + k − 2)
+ inf {x T (N + k − 1)Qx(N + k − 1)
u(N +k−1)

+ u T (N + k − 1)Ru(N + k − 1)}} . . .}. (3.13)

Thus according to (3.9), Eq. (3.10) is expressed as



J ∗ (x(k)) = inf x T (k)Qx(k) + u T (k)Ru(k)
u(k)

+ inf J (x(k + 1), U (k + 1, N + k − 1))}


U (k+1,N +k−1)
 T 
= inf x (k)Qx(k) + u T (k)Ru(k) + J ∗ (x(k + 1)) . (3.14)
u(k)

Therefore, the Eqs. (3.3) and (3.4) are established.


In the following part we will give a novel iteration ADP algorithm to get the nearly
optimal solution.
30 3 Nearly Finite-Horizon Optimal Control for Nonaffine Time-Delay …

3.3 The Iteration ADP Algorithm and Its Convergence

3.3.1 The Novel ADP Iteration Algorithm

In this subsection we will give the novel iteration ADP algorithm in details. For the
state x(k) of system (3.1), there exists two cases. Case 1: ∃U (k, k) which makes
x(k + 1) = 0. Case 2: ∃U (k, k + m), m > 0, which makes x(k + m + 1) = 0. In
the following part, we will discuss the two cases, respectively.
Case 1: There exists U (k, k) = (β(k)) which makes x(k + 1) = 0 for system
(3.1). We set the optimal control sequence U ∗ (k + 1, k + 1) = (0). The states of the
system are driven by a given initial state χ (t), −h l ≤ t ≤ 0 and the initial control
policy β(t). We set V [0] (x(k + 1)) = J (x(k + 1), U ∗ (k + 1, k + 1)) = 0, ∀x(k +
1), then for time step k, we have the following iterations

u [1] (k) = arg inf {x T (k)Qx(k) + u T (k)Ru(k) + V [0] (x(k + 1))},


u(k)

s.t. F(x(k), x(k − h 1 ), x(k − h 2 ), . . . , x(k − h l ), u(k)) = 0, (3.15)

and

V [1] (x [1] (k)) = x [1]T (k)Qx [1] (k) + u [1]T (k)Ru [1] (k) + V [0] (x [0] (k + 1)), (3.16)

where the states in (3.16) are obtained as

x [1] (t + 1) = F(x [1] (t), x [1] (t − h 1 ), x [1] (t − h 2 ), . . . , x [1] (t − h l ), u [1] (t)),


t = 0, 1, . . . , k − 1, (3.17)

and

x [0] (t + 1) = F(x [1] (t), x [1] (t − h 1 ), x [1] (t − h 2 ), . . . , x [1] (t − h l ), u [1] (t)),


t = k, k + 1, . . . , (3.18)

For the iteration step i = 1, 2, . . . , we have the iterations as

u [i+1] (k) = arg inf {x T (k)Qx(k) + u T (k)Ru(k) + V [i] (x(k + 1))} (3.19)
u(k)

and

V [i+1] (x [i+1] (k)) = x [i+1]T (k)Qx [i+1] (k) + u [i+1]T (k)Ru [i+1] (k)
+ V [i] (x [i] (k + 1)), (3.20)

where V [i] (x(k + 1)) in (3.19) is obtained as


3.3 The Iteration ADP Algorithm and Its Convergence 31

V [i] (x(k + 1)) = arg inf {x T (k + 1)Qx(k + 1) + u T (k + 1)Ru(k + 1)


u(k+1)
[i−1]
+V (x(k + 2))}, (3.21)

and the states in (3.20) are updated as

x [i+1] (t + 1) = F(x [i+1] (t), x [i+1] (t − h 1 ), x [i+1] (t − h 2 ), . . . ,


x [i+1] (t − h l ), u [i+1] (t)), t = 0, 1, . . . , k − 1, (3.22)

and

x [i] (t + 1) = F(x [i+1] (t), x [i+1] (t − h 1 ), x [i+1] (t − h 2 ), . . . ,


x [i+1] (t − h l ), u [i+1] (t)), t = k, k + 1, . . . . (3.23)

Case 2: There exists finite-horizon admissible control sequence U (k, k + m) =


(β(k), β(k + 1), . . . , β(k + m)) which makes x f (x(k), U (k, k + m)) = 0. We sup-
pose that for x(k + 1), there exists optimal control sequence U ∗ (k + 1, k + j − 1) =
(u ∗ (k + 1), u ∗ (k + 2), . . . , u ∗ (k + j − 1)). For time step k, the iteration ADP algo-
rithm between

u [1] (k) = arg inf {x T (k)Qx(k) + u T (k)Ru(k) + V [0] (x(k + 1))} (3.24)
u(k)

and

V [1] (x [1] (k)) = x [1]T (k)Qx [1] (k) + u [1]T (k)Ru [1] (k) + V [0] (x [0] (k + 1)), (3.25)

where ∀x(k + 1), V [0] (x(k + 1)) in (3.24) is obtained as

V [0] (x(k + 1)) = J (x(k + 1), U ∗ (k + 1, k + j − 1)


= J ∗ (x(k + 1)). (3.26)

In (3.25), V [0] (x [0] (k + 1)) is obtained by the similar Eq. (3.26). The states in (3.25)
are obtained as

x [1] (t + 1) = F(x [1] (t), x [1] (t − h 1 ), x [1] (t − h 2 ), . . . ,


x [1] (t − h l ), u [1] (t)), t = 0, 1, . . . , k − 1, (3.27)

and

x [0] (t + 1) = F(x [1] (t), x [1] (t − h 1 ), x [1] (t − h 2 ), . . . ,


x [1] (t − h l ), u [1] (t)), t = k, k + 1, . . . (3.28)
32 3 Nearly Finite-Horizon Optimal Control for Nonaffine Time-Delay …

For the iteration step i = 1, 2, . . ., the iteration algorithm will be implemented as


follows:

u [i+1] (k) = arg inf {x T (k)Qx(k) + u T (k)Ru(k) + V [i] (x(k + 1))} (3.29)
u(k)

and

V [i+1] (x [i+1] (k)) = (x [i+1] (k))T Qx [i+1] (k) + u [i+1]T (k)Ru [i+1] (k)
+ V [i] (x [i] (k + 1)), (3.30)

where V [i] (x(k + 1)) in (3.29) is updated as

V [i] (x(k + 1)) = inf {x T (k + 1)Qx(k + 1) + u T (k + 1)Ru(k + 1)


u(k+1)

+ V [i−1] (x(k + 2))}, (3.31)

and the states in (3.30) are obtained as

x [i+1] (t + 1) = F(x [i+1] (t), x [i+1] (t − h 1 ), x [i+1] (t − h 2 ), . . . ,


x [i+1] (t − h l ), u [i+1] (t)), t = 0, 1, . . . , k − 1, (3.32)

and

x [i] (t + 1) = F(x [i+1] (t), x [i+1] (t − h 1 ), x [i+1] (t − h 2 ), . . . ,


x [i+1] (t − h l ), u [i+1] (t)), t = k, k + 1, . . . . (3.33)

This completes the iteration algorithm. From the two cases we can see that, if
V [0] = 0 in (3.25), then Case 1 is a special one of Case 2. In the following, the
algorithms are summarized as follows.

Algorithm 1 ADP Algorithm


Initialization:
Compute u [1] (k) and V [1] (x [1] (k)) by (3.15) and (3.16) in Case 1, or by (3.24) and (3.25) in Case
2;
Update:
Update u [i+1] (k) and V [i+1] (x [i+1] (k)) by (3.19) and (3.20) in Case 1, or by (3.29) and (3.30) in
Case 2.

Remark 3.4 For the state x(k) of system (3.1), which is driven by the fixed initial
states χ (t), −h l ≤ t ≤ 0. If there exists a control sequence U (k, k) = (β(k)), which
makes x(k + 1) = 0 hold, then we will use Case 1 of the algorithm to obtain the opti-
mal control. Otherwise, i.e., there does not exist U (k, k), which makes x(k + 1) = 0
hold. But there is a control sequence U (k, k + m) = (β(k), β(k + 1), . . . , β(k +
m)) which makes x f (x(k), U (k, k + m)) = 0, then we will adopt Case 2 of the
3.3 The Iteration ADP Algorithm and Its Convergence 33

algorithm. The detailed implementation process of the second algorithm is as fol-


lows.
For system (3.1), there exists arbitrary finite-horizon admissible control sequence
U (k, k + m) = (β(k), β(k + 1), . . . , β(k + m)) and the corresponding state
sequence (x(k + 1), x(k + 2), . . . , x(k + m + 1)) in which x(k + m + 1) = 0. It
is clearly that U (k, k + m) may not be optimal one. Which means two points: (1)
The length m + 1 of control sequence U (k, k + m) may not be optimal. (2) The law
of control sequence U (k, k + m) may not be optimal. So it is necessary to use the
proposed algorithm to obtain the optimal one.
We start to discuss the proposed algorithm from the state x(k + m) now. Obvi-
ously, the situation of x(k + m) is belongs to Case 1, so the optimal control for
x(k + m) can be obtained by Case 1 of the proposed algorithm. Although the state
x(k + m) can reach to zero in one step, the optimal control step number may be more
than one, this property can be seen in Corollary 3.1. Next, we can obtain the optimal
control for x(k + m − 1) according to Case 2 of the proposed algorithm. Continue
this process, until the optimal control of state x(k) is obtained. From [9] we known
that if the optimal control length of state x(k + m 1 + 1) is the same as the one of
x(k + m 1 ), then we say that the two states x(k + m 1 ) and x(k + m 1 + 1) are in the
same circular region. The finite-horizon optimal control for the two states are same.
The detailed analysis can be seen in [9].

3.3.2 Convergence Analysis of the Improved Iteration


Algorithm

In the above subsection, the novel algorithm for finite-horizon time-delay nonlinear
systems has been proposed in detail. In the following part, we will prove that the
algorithm is convergent and the limitation of the sequence of performance index
functions V [i+1] (x [i+1] (k)) satisfies the HJB equation (3.7).

Theorem 3.1 For system (3.1), the states of the system are driven by a given initial
state χ (t), −h l ≤ t ≤ 0, and the initial finite-horizon admissible control policy β(t).
The iteration algorithm is as in (3.15)–(3.33). For time step k, ∀ x(k) and U (k, k + i),
we define

Λ[i+1] (x(k), U (k, k + i)) = x T (k)Qx(k) + u T (k)Ru(k)


+ x T (k + 1)Qx(k + 1) + u T (k + 1)Ru(k + 1)
+ ···
+ x T (k + i)Qx(k + i) + u T (k + i)Ru(k + i)
+ V [0] (x(k + i + 1)), (3.34)

where V [0] (x(k + i + 1)) as in (3.26) and V [i+1] (x(k)) is updated as (3.31). Then
we have
34 3 Nearly Finite-Horizon Optimal Control for Nonaffine Time-Delay …

V [i+1] (x(k)) = inf Λ[i+1] (x(k), U (k, k + i)). (3.35)


U (k,k+i)

Proof From (3.31) we have

V [i+1] (x(k)) = inf {x T (k)Qx(k) + u T (k)Ru(k)


u(k)
+ inf {x T (k + 1)Qx(k + 1) + u T (k + 1)Ru(k + 1)
u(k+1)

+ ···
+ inf {x T (k + i)Qx(k + i) + u T (k + i)Ru(k + i)}
u(k+i)
+ V [0] (x(k + i + 1))} . . .}}. (3.36)

So we can further obtain

V [i+1] (x(k)) = inf {x T (k)Qx(k) + u T (k)Ru(k)


U (k,k+i)

+ x T (k + 1)Qx(k + 1) + u T (k + 1)Ru(k + 1)
+ ...
+ x T (k + i)Qx(k + i) + u T (k + i)Ru(k + i)
+ V [0] (x(k + i + 1))}, (3.37)

Thus we can have

V [i+1] (x(k)) = inf Λ[i+1] (x(k), U (k, k + i)). (3.38)


U (k,k+i)

Based on Theorem 3.1, we give the monotonicity theorem about the sequence of
performance index functions V [i+1] (x [i+1] (k)), ∀x [i+1] (k).
Theorem 3.2 For system (3.1), let the iteration algorithm be as in (3.15)–(3.33).
Then we have V [i+1] (x [i] (k)) ≤ V [i] (x [i] (k)), ∀i > 0, for Case 1; V [i+1] (x [i] (k)) ≤
V [i] (x [i] (k)), ∀i ≥ 0, for Case 2.
Proof We first give the proof for Case 2. Define Û (k, k + i) = (u [i] (k), u [i] (k +
2), . . . , u [1] (k + i − 1), u ∗ (k + i)), then according to the definition of Λ[i+1] (x(k),
Û (k, k + i)) in (3.34), we have

Λ[i+1] (x(k), Û (k, k + i)) = x T (k)Qx(k) + u [i]T (k)Ru [i] (k)


+ ···
+ x T (k + i − 1)Qx(k + i − 1)
+ u [1]T (k + i − 1)Ru [1] (k + i − 1)
+ x T (k + i)Qx(k + i) + u ∗T (k + i)Ru ∗ (k + i)
+ V [0] (x(k + i + 1)). (3.39)
3.3 The Iteration ADP Algorithm and Its Convergence 35

From (3.26) and (3.4), we get

V [0] (x(k + i)) = J ∗ (x(k + i))


= x T (k + i)Qx(k + i) + u ∗T (k + i)Ru ∗ (k + i)
+ J ∗ (x(k + i + 1))
= x T (k + i)Qx(k + i) + u ∗T (k + i)Ru ∗ (k + i)
+ V [0] (x(k + i + 1)). (3.40)

On the other side, from (3.31), we have

V [i] (x(k)) = x T (k)Qx(k) + u [i]T (k)Ru [i] (k)


+ ···
+ x T (k + i − 1)Qx(k + i − 1) + u [1]T (k + i − 1)Ru [1] (k + i − 1)
+ V [0] (x(k + i)). (3.41)

So according to (3.40), we obtain

Λ[i+1] (x(k), Û (k, k + i)) = V [i] (x(k)). (3.42)

From Theorem 3.1, we can get

V [i+1] (x(k)) ≤ Λ[i+1] (x(k), Û (k, k + i)). (3.43)

So we have ∀x(k),

V [i+1] (x(k)) ≤ V [i] (x(k)), (3.44)

i.e., for x [i] (k),

V [i+1] (x [i] (k)) ≤ V [i] (x [i] (k)). (3.45)

For Case 1, we set V [0] = 0, the proof is similar with Case 2.

From Theorem 3.2, we can conclude that the performance index function { V [i+1]
(x(k)) } is a monotonically nonincreasing sequence. As the performance index func-
tion is positive definite, so we can say that the performance index function is con-
vergent. Thus we define V ∞ (x(k)) = lim V [i+1] (x(k)), u ∞ (k) = lim u [i+1] (k) and
i→∞ i→∞
x ∞ (k) is the state under u ∞ (k). In the following, we give a theorem to indicate that
V ∞ (x ∞ (k)) satisfies HJB equation.

Theorem 3.3 For system (3.1), the iteration algorithm is as in (3.15)–(3.33). Then
we have
36 3 Nearly Finite-Horizon Optimal Control for Nonaffine Time-Delay …

V ∞ (x ∞ (k)) = x ∞T (k)Qx ∞ (k) + u ∞T (k)Ru ∞ (k) + V ∞ (x ∞ (k + 1)). (3.46)

Proof Let  be an arbitrary positive number. Since V [i+1] (x(k)) is nonincreasing and
V ∞ (x(k)) = lim V [i+1] (x(k)), there exists a positive integer p such that
i→∞

V [ p] (x(k)) −  ≤ V ∞ (x(k)) ≤ V [ p] (x(k)). (3.47)

So we have

V ∞ (x(k)) ≥ inf {x T (k)Qx(k) + u T (k)Ru(k) + V [ p−1] (x(k + 1))} − . (3.48)


u(k)

According to Theorem 3.2, we have

V ∞ (x(k)) ≥ inf {x T (k)Qx(k) + u T (k)Ru(k) + V ∞ (x(k + 1))} −  (3.49)


u(k)

hold. Since  is arbitrary, we have

V ∞ (x(k)) ≥ inf {x T (k)Qx(k) + u T (k)Ru(k) + V ∞ (x(k + 1))}. (3.50)


u(k)

On the other side, according to Theorem 3.2, we have

V ∞ (x(k)) ≤ V [i+1] (x(k))


= inf {x T (k)Qx(k) + u T (k)Ru(k) + V [i] (x(k + 1))}. (3.51)
u(k)

Let i → ∞, we have

V ∞ (x(k)) ≤ inf {x T (k)Qx(k) + u T (k)Ru(k) + V ∞ (x(k + 1))}. (3.52)


u(k)

So from (3.50) and (3.52), we can get

V ∞ (x(k)) = inf {x T (k)Qx(k) + u T (k)Ru(k) + V ∞ (x(k + 1))}, ∀x(k). (3.53)


u(k)

According to (3.29), we obtain u ∞ (k). From (3.32) and (3.33), we have the cor-
responding state x ∞ (k), thus the following expression

V ∞ (x ∞ (k)) = x ∞T (k)Qx ∞ (k) + u ∞T (k)Ru ∞ (k) + V ∞ (x ∞ (k + 1)) (3.54)

holds, which completes the proof.

So we can say that V ∞ (x ∞ (k)) = J ∗ (x ∗ (k)). Until now, we have proven that
for ∀k, the iteration algorithm converges to the optimal performance index function
when the iteration index i → ∞. For finite-horizon optimal control problem of time-
3.3 The Iteration ADP Algorithm and Its Convergence 37

delay systems, another aspect is the length N of the optimal control sequence. In this
chapter, the specific value of N is not known, but we can analyse the relationship
between the the iteration index i and the terminal time N .

Theorem 3.4 Let the iteration algorithm be in (3.24)–(3.33). If V [0] (x(k + i +


1)) = J (x(k + i + 1), U ∗ (k + i + 1, k + i + j − 1)), ∀x(k + i + 1), then the state
at time step k of system (3.1) can reach to zero in N = i + j steps for Case 2.

Proof For Case 2 of the iteration algorithm, we have

V [i+1] (x [i+1] (k)) = x [i+1]T (k)Qx [i+1] (k) + u [i+1]T (k)Ru [i+1] (k)
+ x [i]T (k + 1)Qx [i] (k + 1) + u [i]T (k + 1)Ru [i] (k + 1)
+ ···
+ x [1]T (k + i)Qx [1] (k + i) + u [1]T (k + i)Ru [1] (k + i)
+ V [0] (x [0] (k + i + 1)). (3.55)

According to [9], we can see that the optimal control sequence for x [i+1] (k) is
U (k, k + i) = (u [i+1] (k), u [i] (k + 1), . . . , u [1] (k + i)). As we have V [0] (x [0] (k +

i + 1)) = J (x [0] (k + i + 1), U ∗ (k + i + 1, k + i + j − 1)), so we can obtain N =


i + j.

For Case 1 of the proposed iteration algorithm, we have the following corollary.
Corollary 3.1 Let the iteration algorithm be in (3.15)–(3.23). Then for system (3.1),
the state at time step k can reach to zero in N = i + 1 steps for Case 1.

Proof For Case 1, we have

V [i+1] (x [i+1] (k)) = x [i+1]T (k)Qx [i+1] (k) + u [i+1]T (k)Ru [i+1] (k)
+ x [i]T (k + 1)Qx [i] (k + 1) + u [i]T (k + 1)Ru [i] (k + 1)
+ ···
+ x [1]T (k + i)Qx [1] (k + i) + u [1]T (k + i)Ru [1] (k + i)
+ V [0] (x [0] (k + i + 1))
= J (x [i+1] (k), U (k, k + i)), (3.56)

where U ∗ (k, k + i) = (u [i+1] (k), u [i] (k + 1), . . . , u [1] (k + i)), and each element of
U ∗ (k, k + i) is obtained from (3.29). According to Case 1, x [0] (k + i + 1) = 0. So
the state at time step k can reach to zero in N = i + 1 steps.

We can see that for time step k the optimal controller is obtained when i → ∞,
which induces the time steps N → ∞ according to Theorem 3.4 and Corollary 3.1.
In this chapter, we want to get the nearly optimal performance index function within
finite N time steps. The following corollary is used to prove that the existences of
nearly optimal performance index function and nearly optimal control.
38 3 Nearly Finite-Horizon Optimal Control for Nonaffine Time-Delay …

Corollary 3.2 For system (3.1), the iteration algorithm is as in (3.15)–(3.33), then
∀ε > 0, ∃I ∈ N, ∀i > I , we have
 [i+1] [i+1] 
V (x (k)) − J ∗ (x ∗ (k)) ≤ ε. (3.57)

Proof From Theorems 3.2 and 3.3, we can see that lim V [i] (x [i] (k)) = J ∗ (x ∗ (k)),
i→∞
then from the limitation definition, the conclusion is obtained easily.
So we can say that V [i] (x [i] (k)) is the nearly optimal performance index function
in the sense of ε, the corresponding nearly optimal control is defined as follows:

u ε (k) = arg inf {x T (k)Qx(k) + u T (k)Ru(k) + V [i] (x(k + 1))}. (3.58)


u(k)

Remark 3.5 From Theorem 3.4 and Corollary 3.1, we can see that the length of the
control sequence N is dependent on the iteration step. In addition, from Corollary 3.2,
we know that the iteration step is dependent on ε. So it is concluded that the length
of the control sequence N is dependent on ε.
From (3.57), we can see that the inequality is hard to satisfy. So in practice, we
adopt the following standard to substitute (3.57):
 [i+1] [i+1] 
V (x (k)) − V [i] (x [i] (k)) ≤ ε. (3.59)

3.3.3 Neural Network Implementation of the Iteration ADP


Algorithm

The nonlinear optimal control solution relies on solving the HJB equation, and the
exact solution of which is generally impossible to be obtained for nonlinear time-
delay system. So we employ neural networks for approximations of u [i] (k) and
J [i+1] (x(k)) in this section.
Assume the number of hidden layer neurons is denoted by l, the weight matrix
between the input layer and hidden layer is denoted by V , the weight matrix between
the hidden layer and output layer is denoted by W , then the output of three-layer
neural network is represented by

F̂(X, W, Ŵ ) = W T σ (Ŵ T X ), (3.60)

ezi − e−zi
where σ (Ŵ T X ) ∈ Rl , [σ (z)]i = , i = 1, 2, . . . , l, are the activation func-
ezi + e−zi
tion. The gradient descent rule is adopted for the weight update rules of each neural
network.
Here, there are two networks, which are critic network and action network, respec-
tively. Both neural networks are chosen as three-layer Back-propagation (BP) neural
networks. The whole structure diagram is shown in Fig. 3.1.
3.3 The Iteration ADP Algorithm and Its Convergence 39

Critic Network
x(k )
Plant
Action Network Plant
Action Network v* ( k )
x(k ), , x(k hl )
x* ( k )

J * ( x* (k ))
Critic Network

Fig. 3.1 The structure diagram of the algorithm

3.3.3.1 The Critic Network

The critic network is used to approximate the performance index function V [i+1]
(x(k)). The output of the critic network is denoted as

V̂ [i+1] (x(k)) = wc[i+1]T σ (vc[i+1]T x(k)). (3.61)

The target function can be written as

V [i+1] (x(k)) = x T (k)Qx(k) + û [i+1]T (k)R û [i+1] (k) + V̂ [i] (x(k + 1)). (3.62)

Then we define the error function for the critic network as follows:

ec[i+1] (k) = V̂ [i+1] (x(k)) − V [i+1] (x(k)). (3.63)

The objective function to be minimized in the critic network is

1 [i+1]
E c[i+1] (k) = (e (k))2 . (3.64)
2 c
So the gradient-based weights update rule for the critic network is given by

wc[i+2] (k) = wc[i+1] (k) + wc[i+1] (k), (3.65)


vc[i+2] (k) = vc[i+1] (k) + vc[i+1] (k), (3.66)

where

∂ E c[i+1] (k)
wc[i+1] (k) = −αc , (3.67)
∂wc[i+1] (k)
∂ E c[i+1] (k)
vc[i+1] (k) = −αc , (3.68)
∂vc[i+1] (k)

and the learning rate αc of critic network is positive number.


40 3 Nearly Finite-Horizon Optimal Control for Nonaffine Time-Delay …

3.3.3.2 The Action Network

In the action network the states x(k), x(k − h 1 ), . . . , x(k − h l ) are used as inputs to
create the optimal control, û [i] (k) as the output of the network. The output can be
formulated as

û [i] (k) = wa[i]T σ (va[i]T Y (k)), (3.69)

where Y (k) = [x T (k), x T (k − h 1 ), . . . , x T (k − h l )]T .


We define the output error of the action network as follows:

ea[i] (k) = û [i] (k) − u [i] (k). (3.70)

The weights in the action network are updated to minimize the following perfor-
mance error measure
1 [i]T
E a[i] (k) = e (k)ea[i] (k). (3.71)
2 a
The weights updating algorithm is similar to the one for the critic network. By
the gradient descent rule, we can obtain

wa[i+1] (k) = wa[i] (k) + wa[i] (k), (3.72)


va[i+1] (k) = va[i] (k) + va[i] (k), (3.73)

where

∂ E a[i] (k)
wa[i] (k) = −αa , (3.74)
∂wa[i] (k)
∂ E a[i] (k)
va[i] (k) = −αa , (3.75)
∂va[i] (k)

and the learning rate αa of action network is positive number.


In the next section, we will give the simulation study to explain the proposed
iteration algorithm in details.

3.4 Simulation Study

Example 3.1 We take the example in [9] with modification:

x(t + 1) = x(t − 2) + sin(0.1x 2 (t) + u(t)). (3.76)


3.4 Simulation Study 41

1.15

1.14

iteration performance index 1.13

1.12

1.11

1.1

1.09

1.08

1.07

1.06

1.05
0 5 10 15 20 25 30 35 40 45 50
iteration steps

Fig. 3.2 The performance trajectory for x(3) = 0.5

We give the initial states as χ1 (−2) = χ1 (−1) = χ1 (0) = 1.5, and the initial con-
trol policy as β(t) = sin−1 (x(t + 1) − x(t − 2)) − 0.1x 2 (t). We implement the pro-
posed algorithm at the time instant k = 3.
First, according to the initial control policy β(t) = sin−1 (x(t + 1) − x(t − 2)) −
0.1x 2 (t) of system (3.76), we give fist group of state data: x(1) = 0.8, x(2) = 0.7,
x(3) = 0.5, x(4) = 0. We also can get the second group of state data: x(1) = 1.4,
x(2) = 1.2, x(3) = 1.1, x(4) = 0.8, x(5) = 0.7, x(6) = 0.5, x(7) = 0. Obviously,
for the first sequences of states we can get the optimal controller by Case 1 of the
proposed algorithm. For the second one, the optimal controller can be obtained by
Case 2 of the proposed algorithm, and the optimal control sequence U o (k + 1, k +
j + 1) can be obtained in the first group of state data. We select Q = R = 1.
The three-layer BP neural networks are used to approach the critic network and the
action network with the structure 2 − 8 − 1 and 6 − 8 − 1, respectively. The iteration
times of the weights updating for two neural networks are 200. The initial weights
are chosen randomly from (−0.1, 0.1), and the learning rates are αa = αc = 0.05.
The performance index trajectories for the first and second state data group show in
Figs. 3.2 and 3.3, respectively. According to Theorem 3.2, for the first state group, the
performance index is decreasing as i > 0. For the second state group, the performance
index is decreasing as i ≥ 0. The state trajectory and the control trajectory of the
second state data are shown in Figs. 3.4 and 3.5. From the figures, we can see that the
system is asymptotically stable. The simulation study shows the new iteration ADP
algorithm is very feasible.
42 3 Nearly Finite-Horizon Optimal Control for Nonaffine Time-Delay …

8.6

8.4
performance index function

8.2

7.8

7.6

7.4
0 5 10 15 20 25 30 35 40 45 50
iteration steps

Fig. 3.3 The performance trajectory for x(3) = 1.1

1.6

1.4

1.2

0.8
state

0.6

0.4

0.2

-0.2
0 5 10 15 20 25 30
time steps

Fig. 3.4 The state trajectory using the second state data group

Example 3.2 For demonstrating the effectiveness of the proposed iteration algorithm
in this chapter, we give a more substantial application. Consider the ball and beam
experiment. A ball is placed on a beam as shown in Fig. 3.6.
3.4 Simulation Study 43

0.2

-0.2

-0.4
control

-0.6

-0.8

-1

-1.2

-1.4
0 5 10 15 20 25 30
time steps

Fig. 3.5 The control trajectory using the second state data group

Fig. 3.6 Ball and beam experiment

2d
The beam angle α can be expressed in terms of the servo gear angle θ as α ≈ θ.
L
The equation of motion for the ball is given as
 
M
+ m r̈ + mg sin α − mr (α̇)2 = 0, (3.77)
R2

where r is the ball position coordinate. The mass of the ball m = 0.1 kg, the radius
of the ball R = 0.015 m, the radius of the lever gear d = 0.03 m, the length of the
44 3 Nearly Finite-Horizon Optimal Control for Nonaffine Time-Delay …

beam L = 1.0 m and the ball’s moment of inertia M = 10−5 kg.m2 . Given the time
step h, let r (t) = r (t h), α(t) = α(t h) and θ (t) = θ (t h), then the Eq. (3.77)
is discretized as
⎧  

⎪ 2d

⎪ x(t + 1) = x(t) + y(t) − A sin θ (t) + Bx(t)(θ (t) − z(t))2 ,
⎨  L
2d (3.78)

⎪ y(t + 1) = y(t) − A sin θ (t) + Bx(t)(θ (t) − z(t))2 ,

⎪ L

z(t + 1) = θ (t),

mgh 2 R 2 4d 2 m R 2
where A = and B = 2 . The state X (t) = (x(t), y(t),
M + mR 2 L (M + m R 2 )
z(t))T , in which x(t) = r (t), y(t) = r (t) − r (t − 1) and z(t) = θ (t − 1). The con-
trol input is u(t) = θ (t). For the convenience of analysis, system (3.78) is rewritten
as follows:
⎧  

⎪ 2d

⎪ x(t + 1) = x(t − 2) + y(t) − A sin θ (t) + Bx(t)(θ (t) − z(t))2 ,
⎨   L
2d (3.79)

⎪ y(t + 1) = y(t) − A sin θ (t) + Bx(t)(θ (t) − z(t − 2))2 ,

⎪ L

z(t + 1) = θ (t).

In this chapter, h is selected as 0.1, the states of time-delay system (3.79) are
X (1) = [1.0027, 0.0098, 1]T , X (2) = [0.0000, 0.0057, 1.0012]T , X (3) = [1.0000,
0.0016, 1.0000]T , X (4) = [1.0002, −0.0025, 0.9994]T and X (5) = [0, 0, 0]T . The
initial states are χ (−2) = [0.9929, 0.0221, 1.0000]T , χ (−1) = [−0.0057, 0.0180,
1.0000]T and χ (0) = [0.9984, 0.0139, 1.0000]T . The initial control sequence is
(1.0000, 1.0012, 1.0000, 0.9994, 0.0000). Obviously, the initial control sequence
and states are not the optimal ones, so the proposed algorithm in this chapter is
adopt to obtain the optimal solution. We select Q = R = 1. The iteration times of
the weights updating for two neural networks are 200. The initial weights of critic
network are chosen randomly from (−0.1, 0.1), the initial weights of action network
are chosen randomly from [−2, 2], and the learning rates are αa = αc = 0.001. For
the state X (4) = [1.0002, −0.0025, 0.9994]T . For the state X (1) = [1.0027, 0.0098,
1]T . Obviously, for the state X (4) we can get the optimal controller by Case 1 of the
proposed algorithm. For the state X (1), the optimal controller can be obtained by
Case 2 of the proposed algorithm. Then we obtain the performance index function
trajectories of the two states as shown in Figs. 3.7 and 3.8, which satisfy Theorem 3.2,
i.e., for the state X (4), the performance index is decreasing as i > 0, for the state
X (1), the performance index is decreasing as i ≥ 0. The state trajectories and the
control trajectory of state X (1) show in Figs. 3.9, 3.10, 3.11 and 3.12. From the
figures, we can see that the states of the system are asymptotically stable. Based on
the above analysis, we can conclude that the proposed iteration ADP algorithm is
satisfactory.
3.4 Simulation Study 45

1.3

1.2

1.1
performance index function

0.9

0.8

0.7

0.6

0.5
0 5 10 15 20 25 30 35 40 45 50
iteration steps

Fig. 3.7 The performance trajectory for X (4)

5.8

5.6
performance index function

5.4

5.2

4.8

4.6

4.4
0 10 20 30 40 50 60 70 80 90 100
iteration steps

Fig. 3.8 The performance trajectory for X (1)


46 3 Nearly Finite-Horizon Optimal Control for Nonaffine Time-Delay …

0.9

0.8

0.7

0.6
state

0.5

0.4

0.3

0.2

0.1

0
0 5 10 15 20 25 30 35 40 45 50
time steps

Fig. 3.9 The state trajectory of x(t)

0.01

0.009

0.008

0.007

0.006

0.005
y

0.004

0.003

0.002

0.001

0
0 5 10 15 20 25 30 35 40 45 50
time steps

Fig. 3.10 The state trajectory of y(t)


3.4 Simulation Study 47

0.8

0.6

0.4
z

0.2

-0.2
0 5 10 15 20 25 30 35 40 45 50
time steps

Fig. 3.11 The state trajectory of z(t)

-0.05
control

-0.1

-0.15
0 5 10 15 20 25 30 35 40 45 50
time steps

Fig. 3.12 The control trajectory


48 3 Nearly Finite-Horizon Optimal Control for Nonaffine Time-Delay …

3.5 Conclusion

This chapter proposed a novel ADP algorithm to deal with the nearly finite-horizon
optimal control for a class of deterministic nonaffine time-delay nonlinear systems.
For determining the optimal state, the state updating was contained in the novel
ADP algorithm. The results of theorems showed the proposed iteration algorithm
was convergent. Moreover, the relationship between the the iteration steps and time
steps was given. The simulation study have demonstrated the effectiveness of the
proposed control algorithm.

References

1. Niculescu, S.: Delay Effects on Stability: A Robust Control Approach. Springer, Berlin (2001)
2. Gu, K., Kharitonov, V., Chen, J.: Stability of Time-Delay Systems. Birkhäuser, Boston (2003)
3. Song, R., Zhang, H., Luo, Y., Wei, Q.: Optimal control laws for time-delay systems with
saturating actuators based on heuristic dynamic programming. Neurocomputing 73(16–18),
3020–3027 (2010)
4. Huang, J., Lewis, F.: Neural-network predictive control for nonlinear dynamic systems with
time-delay. IEEE Trans. Neural Netw. 14(2), 377–389 (2003)
5. Chyung, D.: On the controllability of linear systems with delay in control. IEEE Trans. Autom.
Control 15(2), 255–257 (1970)
6. Phat, V.: Controllability of discrete-time systems with multiple delays on controls and states.
Int. J. Control 49(5), 1645–1654 (1989)
7. Chyung, D.: Discrete optimal systems with time delay. IEEE Trans. Autom. Control 13(1), 117
(1968)
8. Chyung, D., Lee, E.: Linear optimal systems with time delays. SIAM J. Control 4(3), 548–575
(1966)
9. Wang, F., Jin, N., Liu, D., Wei, Q.: Adaptive dynamic programming for finite-horizon optimal
control of discrete-time nonlinear systems with -error bound. IEEE Trans. Neural Netw. 22,
24–36 (2011)
10. Manu, M., Mohammad, J.: Time-Delay Systems Analysis, Optimization and Applications.
North-Holland, New York (1987)
Chapter 4
Multi-objective Optimal Control
for Time-Delay Systems

A novel multi-objective ADP method is constructed to obtain the optimal controller


of a class of nonlinear time-delay systems in this chapter. Using the weighted sum
technology, the original multi-objective optimal control problem is transformed to the
single one. An ADP method is established for nonlinear time-delay systems to solve
the optimal control problem. To demonstrate the presented iterative performance
index function sequence is convergent and the closed-loop system is asymptotically
stable, the convergence analysis is also given. The neural networks are used to get
the approximative control policy and the approximative performance index function,
respectively. Two simulation examples are presented to illustrate the performance of
the presented optimal control method.

4.1 Introduction

For a class of unknown discrete time nonlinear systems the multi-objective opti-
mal control problem was discussed in [1]. In [2], for nonaffine nonlinear unknown
discrete-time systems an optimal control scheme with discount factor was developed.
However, as far as we know, how to obtain the multi-objective optimal control solu-
tion of nonlinear time-delay systems based on ADP algorithm is still an intractable
problem. This chapter will discuss this difficult problem explicitly. First, the simple
objective optimal control problem is obtained by the weighted sum technology. Then,
the iterative ADP optimal control method of time-delay systems is established, and
the convergence analysis is presented. The neural network implementation program
is also given. At last, for illustrating the control effect of the proposed multi-objective
optimal control method, two simulation examples are introduced.
The rest chapter is organized as follows. In Sect. 4.2, it will present the problem
formulation. In Sect. 4.3, it will develop the multi-objective optimal control scheme

© Science Press, Beijing and Springer Nature Singapore Pte Ltd. 2019 49
R. Song et al., Adaptive Dynamic Programming: Single and Multiple
Controllers, Studies in Systems, Decision and Control 166,
https://doi.org/10.1007/978-981-13-1712-5_4
50 4 Multi-objective Optimal Control for Time-Delay Systems

and give the corresponding convergence proof. In Sect. 4.4, it will present implemen-
tation process by neural networks. In Sect. 4.5, it will give examples to demonstrate
the validity of the proposed control scheme. In Sect. 4.6, it will draw the conclusion.

4.2 Problem Formulation

We consider the unknown nonlinear time-delay systems as follows:

x(t + 1) = f (x(t), x(t − h)) + g(x(t), x(t − h))u(t), (4.1)

where u(t) ∈ Rm and the state x(t), x(t − h) ∈ Rn . x(t) = 0, k ≤ 0. F(x(t), x(t −
h), u(t)) = f (x(t), x(t − h)) + g(x(t), x(t − h))u(t) is the unknown continuous
function.
For the system (4.1), we will consider the following multi-objective optimal
control problem

inf Pi (x(t), u(t)), i = 1, 2, . . . , h, (4.2)


u(t)

where Pi (x(t), u(t)) is the performance index function, and it is defined as the
following expression

  
Pi (x(t), u(t)) = L i (x( j)) + u T ( j)Ri u( j) , (4.3)
j=k

in which the utility function Ii ( j) = L i (x( j)) + u T ( j)Ri u( j) is the positive-definite,


Ri is the positive matrix.
In this chapter, the weighted sum technology is used to combine the different
objectives, so it is obtained that the following single objective performance index
function


h
P(x(t)) = wi Pi (x(t)), (4.4)
i=1


h
where W = [w1 , w2 , . . . , wh ]T , wi ≥ 0 and wi = 1.
i=1
So we have the following expression


h 
h
P(x(t)) = wi Ii (t) + wi Pi (x(t + 1))
i=1 i=1
= W I (t) + P(x(t + 1)),
T
(4.5)
4.2 Problem Formulation 51

where I (t) = [I1 (t), I2 (t), . . . , Ih (t)]T .


Therefore, we can define the optimal performance index function as follows:

P ∗ (x(t)) = inf {P(x(t), u(t))}, (4.6)


u(t)

and the optimal control policy is defined as

u ∗ (t) = arg inf {P(x(t), u(t))}. (4.7)


u(t)

In fact, the first-order necessary condition of the optimal control policy u ∗ (t)
should be satisfied according to optimality principle of Bellman, i.e.,
 h −1   h
∗ 1  ∂ F T  ∂ Pi
u (t) = − wi Ri wi . (4.8)
2 i=1 ∂u i=1
∂F

Thus we obtain the optimal performance index function as follows:


h 
h
P ∗ (x(t)) = wi L i (x(t)) + wi Pi∗ (x(t + 1))
i=1 i=1
⎛ −1  ⎞T
 h
1⎝  ∂ F T  ∂ Pi ⎠
h
+ wi Ri wi
4 i=1
∂u i=1
∂F
⎛ −1  ⎞
h  h T  h
∂ F ∂ P
wi Ri ⎝ ⎠.
i
× wi Ri wi (4.9)
i=1 i=1
∂u i=1
∂ F

4.3 Derivation of the ADP Algorithm for Time-Delay


Systems

The aim of this chapter is to obtain the solution of the optimal control problem (4.14),
so we propose the following ADP algorithm.
Let start from i = 0, and define the initial iterative performance index function
P [0] (·) = 0. So we can calculate the initial iterative control vector u [0] (t):
 
u [0] (t) = arg inf W T I (x(t), u(t)) . (4.10)
u(t)

So we can get the next step performance index function as


 
P [1] (x(t)) = inf W T I (x(t), u(t)) , (4.11)
u(t)
52 4 Multi-objective Optimal Control for Time-Delay Systems

accordingly, we have the following state

x(t + 1) = F(x(t), x(t − h), u [0] (t)). (4.12)

Immediately following, the iterative control is updated by


 
u [i] (t) = arg inf W T I (x(t), u(t)) +P [i] (x(t + 1)) , (4.13)
u(t)

and the iterative performance index function is updated as


 
P [i+1] (x(t)) = inf W T I (x(t), u(t)) +P [i] (x(t + 1)) . (4.14)
u(t)

So we can have the updated state

x(t + 1) = F(x(t), x(t − h), u [i] (t)). (4.15)

In the next part, we will prove that the proposed algorithm is convergent, and the
asymptotic stability of the closed-loop systems.
   
Lemma 4.1 Give a control sequence μ[i] (t) , and u [i] (t) is obtained from (4.13).
The iterative performance index function P [i] is obtained from (4.14) and Υ [i] is
defined as
Υ [i+1] (x(t)) = W T I (x(t), μ[i] (t)) + Υ [i] (x(t + 1)), (4.16)

where x(t + 1) is obtained from μ[i] (t). If the initial values of P [i] and Υ [i] are same
for i = 0, then P [i+1] (x(t)) ≤ Υ [i+1] (x(t)), ∀i.
Lemma 4.2 If the iterative performance index function P [i] (x(t)) is obtained from
(4.14). Then P [i] (x(t)) is bounded, i.e., 0 ≤ P [i+1] (x(t)) ≤ B, ∀i, where B is posi-
tive.
Theorem 4.1 If the iterative performance index function P [i] (x(t)) is obtained from
(4.14). Then we have P [i+1] (x(t)) ≥ P [i] (x(t)), ∀i.
 
Proof Let Θ [i] (x(t)) define as follows:

Θ [i] (x(t)) = W T I (x(t), u [i] (t)) + Θ i−1 (x(t + 1)), (4.17)

with Θ [0] (.) = P [0] (.) = 0.


Here, the mathematical induction is used to get the conclusion: Θ [i] (x(t)) ≤
[i+1]
P (x(t)).
It is quite clear that for i = 0, we have

P [1] (x(t)) − Θ [0] (x(t)) = W T I (x(t), u [0] (t)) ≥ 0. (4.18)

So it is summed up for i = 0,
4.4 Neural Network Implementation … 53

P [1] (x(t)) ≥ Θ [0] (x(t)). (4.19)

Suppose that for i − 1, P [i] (x(t)) ≥ Θ [i−1] (x(t)), ∀x(t). Then we have

P [i+1] (x(t)) = W T I (x(t), u [i] (t)) + P [i] (x(t + 1)), (4.20)

and
Θ [i] (x(t)) = W T I (x(t), u [i] (t)) + Θ i−1 (x(t + 1)). (4.21)

Thus we can conclude that

P [i+1] (x(t)) − Θ [i] (x(t)) = P [i] (x(t + 1)) − Θ i−1 (x(t + 1))
≥ 0. (4.22)

So, we have
P [i+1] (x(t)) ≥ Θ [i] (x(t)), ∀i. (4.23)

Furthermore, it is known that P [i] (x(t)) ≤ Θ [i] (x(t)) from Lemma 4.1. So it can
be drawn a conclusion

P [i] (x(t)) ≤ Θ [i] (x(t)) ≤ P [i+1] (x(t)), (4.24)


 
which means that as i → ∞, P [i] (x(t)) is convergent.

Hence, we have that as i → ∞, P [i] (x(t)) → P ∗ (x(t)), and u [i] (t) → u ∗ (t)
accordingly.

Theorem 4.2 If the iterative performance index function P [i] (x(t)) is obtained from
(4.14).
If u ∗ (t) is expressed as (4.13) and P ∗ (x(t)) is expressed as (4.14). Then we can
say that u ∗ (t) stabilizes the system (4.1) asymptotically.

Proof It is proven that P [i+1] (x(t)) ≥ 0. So P ∗ (x(t)) ≥ 0.


From (4.5), we have
 
P ∗ (x(t + 1)) − P ∗ (x(t)) = − W T I (x(t), u ∗ (t)) ≤ 0, (4.25)

So, the closed-loop system is asymptotically stabilized according to Lyapunov


Theorem.

Remark 4.1 In fact, the proposed method is the further study of [3]. In [3], although
the multi-objective control problem is presented, the time-delay state for nonlinear
system is not considered. In [1] the multi-objective optimal control problem is not
discussed. In this chapter, the multi-objective optimal control problem of time-delay
system is solved successfully. So this chapter is the evolvement of traditional ADP
literature.
54 4 Multi-objective Optimal Control for Time-Delay Systems

4.4 Neural Network Implementation for the Multi-objective


Optimal Control Problem of Time-Delay Systems

There are two neural networks in the ADP algorithm, i.e., the critic network and
the action network. The approximate performance index function P [i+1] (x(t)) is
obtained from the critic network, which is denoted as follows:

P̂ [i+1] (x(t)) = vc[i+1]T σ (wc[i+1]T x(t)). (4.26)

The approximation error is defined as

ec[i+1] (t) = P̂ [i+1] (x(t)) − P [i+1] (x(t)), (4.27)

where we define P [i+1] (x(t)) as

P [i+1] (x(t)) = W T I (x(t), û [i] (t)) + P̂ [i] (e[i−1] (t + 1)). (4.28)

For obtaining the update rule, we define

1 [i+1]T
Yc[i+1] (t) = e (t)ec[i+1] (t). (4.29)
2 c
Thus the critic network weights is updated by

vc[i+2] (t) = vc[i+1] (t) + vc[i+1] (t), (4.30)

where

∂Yc[i+1] (t)
vc[i+1] (t) = −βc . (4.31)
∂vc[i+1] (t)

And

wc[i+2] (t) = wc[i+1] (t) + vc[i+1] (t), (4.32)

where

∂Yc[i+1] (t)
wc[i+1] (t) = −βc . (4.33)
∂wc[i+1] (t)

The regulation parameter βc > 0.


The approximation iterative control is obtained by the action network. x(t −
1), x(t − 2), . . . , x(t − h) are used as inputs, the critic network output is û [i] (t),
which can be formulated as
4.4 Neural Network Implementation … 55

û [i] (t) = va[i]T σ (wa[i]T X (t)), (4.34)

where X (t) = [x T (t − 1), x T (t − 2), . . . , x T (t − h)]T .


The approximation error of the network is defined as follows:

ea[i] (t) = û [i] (t) − u [i] (t). (4.35)

For obtaining the update rule, we define the following performance error measure

1 [i]T
Ya[i] (t) = e (t)ea[i] (t). (4.36)
2 a
The critic weights is updated by

va[i+1] (t) = va[i] (t) + va[i] (t), (4.37)

where

∂Ya[i] (t)
va[i] (t) = −βa . (4.38)
∂va[i] (t)

And

wa[i+1] (t) = wa[i] (t) + wa[i] (t), (4.39)

where

∂Ya[i] (t)
wa[i] (t) = −βa . (4.40)
∂wa[i] (t)

The regulation parameter βa > 0.

4.5 Simulation Results

Example 4.1 For illustrating the detailed implementation procedure of the presented
method, we will discuss the nonlinear time-delay system as follows [4]:

x(t + 1) = f (x(t), x(t − 2)) + g(x(t), x(t − 2))u(t)


x(t) =x0 , −2 ≤ s ≤ 0 (4.41)

where
56 4 Multi-objective Optimal Control for Time-Delay Systems

0.2x1 (t) exp (x2 (t))2 x2 (t − 2)
f (x(t), x(t − 2)) = ,
0.3 (x2 (t))2 x1 (t − 2)

and

x2 (t − 2) 0.2
g(x(t), x(t − 1), x(t − 2)) = .
0.1 1

The performance index functions Pi (x(t)) are similar as in [1], which are defined
as follows:

  
P1 (x(t)) = ln(x12 (t) + 1) + u 2 (t) , (4.42)
k=0
∞
 
P2 (x(t)) = ln(x22 (t) + 1) + u 2 (t) , (4.43)
k=0

and

  
P3 (x(t)) = ln(x12 (t) + x22 (t) + 1) + u 2 (t) , (4.44)
k=0

According to the request of the system property, the weight vector W is selected as
W = [0.1, 0.3, 0.6]T . For implementing the established algorithm, we let the initial
state of time-delay system (4.41) x0 = [0.5, −0.5]T . We also use neural networks to
implement the iterative ADP algorithm. The critic network is a three-layer BP neural
network, which is used to approximate the iterative performance index function. The
other three-layer BP neural network is also used as the action network, which is used
to approximate the iterative control. For increasing the neural network accuracy, the
two networks are trained 1000 steps, and the regulation parameters are both 0.001.
Then we get the iterative performance index functions as shown in Fig. 4.1, which
converges to a constant. The system runs 200 steps, then the simulation results are
obtained. We can see that the state trajectories are shown as in Fig. 4.2, and the control
input trajectories are shown as in Fig. 4.3. It is quite clear that the time-delay system
is convergent after 60 time steps. So the constructed iterative multi-objective optimal
control method in this chapter has good performance.

Example 4.2 This example is derived from [3, 5] with modifications


⎧ −3
⎪ x1 (t + 1) = x1 (t) + 10 x2 (t − 1),


x2 (t + 1) = 0.03151 sin(x1 (t − 2)) + 0.9991x2 (t) + 0.0892x3 (t),
(4.45)

⎪ x3 (t + 1) = −0.01x2 (t) + 0.99x3 (t − 1) + u(t),

x(t) = x0 , −2 ≤ s ≤ 0.

For system (4.45), define the performance index functions


4.5 Simulation Results 57

1.6

performance index function 1.4

1.2

0.8

0.6

0.4
0 10 20 30 40 50
iteration steps

Fig. 4.1 The iterative performance index functions

x1
0.8 x2

0.6
state trajectories

0.4

0.2

-0.2
0 50 100 150 200
time steps

Fig. 4.2 The state trajectories x1 and x2


58 4 Multi-objective Optimal Control for Time-Delay Systems

0.4
u
1

0.3 u
2

0.2

0.1
control

-0.1

-0.2

-0.3
0 50 100 150 200
time steps

Fig. 4.3 The control trajectories u 1 and u 2


  
P1 (x(t)) = ln(x12 (t) + x22 (t)) + u 2 (t) , (4.46)
k=0
∞
 
P2 (x(t)) = ln(x22 (t) + x32 (t)) + u 2 (t) , (4.47)
k=0
∞
 
P3 (x(t)) = ln(x12 (t) + x32 (t)) + u 2 (t) , (4.48)
k=0

and

  
P4 (x(t)) = ln(x12 (t) + 1) + u 2 (t) , (4.49)
k=0

The weight vector is W = [0.1, 0.3, 0.1, 0.5]T . Then the performance index func-
tion P(x(t)) can be obtained based on the weighted sum method. The initial state is
selected as x0 = [0.5, −0.5, 1]T . The structures, the learning rate and initial weights
of critic network and action network are same as in Example 4.1. After 10 iterative
steps, the performance index function trajectory is obtained as in Fig. 4.4. The state
trajectories are shown in Fig. 4.5, and the control trajectory is given in Fig. 4.6.
4.5 Simulation Results 59

0.6

0.5
performance index function

0.4

0.3

0.2

0.1

0
0 1 2 3 4 5 6 7 8 9 10
iteration steps

Fig. 4.4 The iterative performance index functions

1
x1
x2
x3

0.5
state trajectories

-0.5
0 10 20 30 40 50 60 70 80
time steps

Fig. 4.5 The state trajectories x1 , x2 and x3


60 4 Multi-objective Optimal Control for Time-Delay Systems

-0.1

-0.2

-0.3
control

-0.4

-0.5

-0.6

-0.7

-0.8
0 10 20 30 40 50 60 70 80
time steps

Fig. 4.6 The control trajectories u

x1
0.5 x2
x3
0
state trajectories

-0.5

-1

-1.5

-2

-2.5

-3
0 500 1000 1500
time steps

Fig. 4.7 The state trajectories x1 , x2 and x3 obtained by [1]


4.5 Simulation Results 61

0.2

0.1

-0.1
control

-0.2

-0.3

-0.4

-0.5
0 150 300 450 600 750 900 1050 1200 1350 1500
time steps

Fig. 4.8 The control trajectories u obtained by [1]

For illustrating the good performance, the method in [1] has been used to obtain
the optimal control of system (4.45). After 1500 time steps, the system states and
control converge to zero, which are shown in Figs. 4.7 and 4.8. From the comparison,
we can see that the convergence speed of the method presented in this chapter is faster
and the performance is better than the method in [1].

4.6 Conclusions

This chapter aimed nonlinear time-delay systems and solved the multi-objective
optimal control problem using the presented ADP method. By the weighted sum
technology, we obtained the single optimal control problem from the original multi-
objective one. So we established an ADP method for the considered nonlinear time-
delay systems, and the convergence analysis proved that the iterative performance
index functions converge to the optimal one. The critic and action networks were
used to get the iterative performance index function and the iterative control policy,
respectively. In the simulation, the multiple performance index functions were given,
and using the proposed method, the results are achieved, which illustrates the validity
and effectiveness of the proposed multi-objective optimal control method.
62 4 Multi-objective Optimal Control for Time-Delay Systems

References

1. Wei, Q., Zhang, H., Dai, J.: Model-free multiobjective approximate dynamic programming for
discrete-time nonlinear systems with general performance index functions. Neurocomputing
72(7–9), 1839–1848 (2009)
2. Wang, D., Liu, D., Wei, Q., Zhao, D., Jin, N.: Optimal control of unknown nonaffine nonlinear
discrete-time systems based on adaptive dynamic programming. Automatica 48(8), 1825–1832
(2012)
3. Song, R., Xiao, W., Zhang, H.: Multi-objective optimal control for a class of unknown nonlinear
systems based on finite-approximation-error ADP algorithm. Neurocomputing 119(7), 212–221
(2013)
4. Zhang, H., Song, R., wei, Q., Zhang, T.: Optimal tracking control for a class of nonlinear discrete-
time systems with time delays based on heuristic dynamic programming. IEEE Trans. Neural
Netw. 22(12), 1851–1862 (2011)
5. Gyurkovics, É., Takács, T.: Quadratic stabilisation with H∞ -norm bound of non-linear discrete-
time uncertain systems with bounded control. Syst. Control Lett. 50, 277–289 (2003)
Chapter 5
Multiple Actor-Critic Optimal
Control via ADP

In industrial process control, there may be multiple performance objectives, depend-


ing on salient features of the input-output data. Aiming at this situation, this chapter
proposes multiple actor-critic structures to obtain the optimal control via input-output
data for unknown nonlinear systems. The shunting inhibitory artificial neural net-
work (SIANN) is used to classify the input-output data into one of several categories.
Different performance measure functions may be defined for disparate categories.
The ADP algorithm, which contains model module, critic network and action net-
work, is used to establish the optimal control in each category. A recurrent neural
network (RNN) model is used to reconstruct the unknown system dynamics using
input-output data. Neural networks are used to approximate the critic and action
networks, respectively. It is proven that the model error and the closed unknown
system are uniformly ultimately bounded (UUB). Simulation results demonstrate
the performance of the proposed optimal control scheme for the unknown nonlinear
system.

5.1 Introduction

Recent work in experimental neurocognitive psychology has revealed the existence


of new relations between structures in the human brain and their functions in deci-
sion and control [1, 2]. Mechanisms in the brain involving the emotional gates in the
amygdala and deliberative decisions between risky choices in the anterior cingulate
cortex (ACC) indicate the presence of multiple interacting actor-critic structures.
That research reveals the role of the amygdala in fast intuitive response to the envi-
ronment based on stored patterns, and the role of the ACC in deliberative response
when risk or mismatch with stored patterns is detected. It is shown that the amygdala
uses features and cues received from interactions with the environment to classify
© Science Press, Beijing and Springer Nature Singapore Pte Ltd. 2019 63
R. Song et al., Adaptive Dynamic Programming: Single and Multiple
Controllers, Studies in Systems, Decision and Control 166,
https://doi.org/10.1007/978-981-13-1712-5_5
64 5 Multiple Actor-Critic Optimal Control via ADP

possible responses into different categories based on context and match with previ-
ously encountered situations. These categories can be viewed as representing stored
behavior responses or patterns. Werbos [3, 4] discusses novel ADP structures with
multiple critic levels based on functional regions in the human cerebral cortex. The
interplay between using previously stored experiences to quickly decide possible
actions, and real-time exploration and learning for precise control is emphasized.
This agrees with the work in [1, 2] which details how stored behavior patterns can be
used to enact fast decisions by drawing on previous experiences when there is match
between observed attributes and stored patterns. In the event of risk, mismatch, or
anxiety, higher-level control mechanisms are recruited by the ACC that involve more
focus on real-time observations and exploration.
There are existing approaches to adaptive control and ADP that take into account
some features of these new studies. Included is the multiple-model adaptive control
method of [5]. Multiple actor-critic structures were developed by [5–8]. These works
used multiple modules, each of which contains a stored dynamics model of the
environment and a controller. Incoming data were classified into modules based on
prediction errors between observed data and the prediction of the dynamics model
in each module.
In industrial process control, the system dynamics are difficult and expensive to
estimate and cannot be accurately obtained [9]. Therefore, it is difficult to design
optimal controllers for these unknown systems. On the other hand, new imperatives
in minimizing resource use and pollution, while maximizing throughput and yield,
make it important to control industrial processes in an optimal fashion. Using proper
control techniques, the input-output data generated in the process of system operation
can be accessed and used to design optimal controllers for unknown systems [10].
Recently, many studies have been done on data-based control schemes for unknown
systems [11–17]. For large-scale industrial processes, there are usually different
production performance measures according to desired process properties. Therefore,
more extensive architectures are needed to design optimal controllers for large-scale
industrial processes, which respond quickly based on previously learned behavior
responses, allow adaptive online learning to guarantee real-time performance, and
accommodate different performance measures for different situations.
Such industrial imperatives show the need for more comprehensive approaches to
data-driven process control. Multiple performance objectives may be needed depend-
ing on different important features hidden in observed data. This requires different
control categories that may not only depend on closeness of match with predictions
based on stored dynamics models as used in [2, 18]. The work of Levine [19] and
Werbos [20] indicates that more complex structures are responsible for learning in
the human brain than the standard three-level actor-critic based on networks for actor,
model, and critic.
It has been shown [1, 2] that shunting inhibition is a powerful computational
mechanism and plays an important role in sensory neural information processing
systems [21]. Bouzerdoum [22] proposed the shunting inhibitory artificial neural
network (SIANN) which can be used for highly effective classification and func-
tion approximation. In SIANN the synaptic interactions among neurons are medi-
ated via shunting inhibition. Shunting inhibition is more powerful than multi-layer
5.1 Introduction 65

perceptions in that each shunting neuron has a higher discrimination capacity than
a perceptron neuron and allows the network to construct complex decision surfaces
much more readily. Therefore, SIANN is a powerful and effective classifier. In [23],
efficient training algorithms for a class of shunting inhibitory convolution neural
networks are presented.
In this chapter we bring together the recent studies in neurocognitive psychology
[19, 20] and recent mathematical machinery for implementing shunting inhibition
[21, 22] to develop a new class of process controllers that have a novel multiple actor
critic structure. This multiple network structure has learning on several timescales.
Stored experiences are first used to train a SIANN to classify environmental cues into
different behavior performance categories according to different technical require-
ments. This classification is general, and does not only depend on match of observe
data with stored dynamics models. Then, data observed in real-time are used to
learn deliberative responses for more precise online optimal control. This results in
faster immediate control using stored experiences, along with real-time deliberative
learning.
The contributions of this chapter are as follows.
(1) Based on neurocognitive psychology, a novel controller based on multiple actor-
critic structures is developed for unknown systems. This controller trades off fast
actions based on stored behavior patterns with real-time exploration using current
input-output data.
(2) A SIANN is used to classify input-output data into categories based on salient
features of previously recorded data. Shunting inhibition allows for higher dis-
criminatory capabilities and has been shown important in neural information
processing.
(3) In each category, an RNN is used to identify the system dynamics, novel parameter
update algorithms are given, and it is proven that the parameter errors are UUB.
(4) Action-critic networks are developed in each category to obtain the optimal per-
formance measure function and the optimal controller based on current observed
input-output data. It is proven that the closed-loop systems and the weight errors
are UUB based on Lyapunov techniques.
The rest of the chapter is organized as follows. In Sect. 5.2, the problem moti-
vations and preliminaries are presented. In Sect. 5.3, the SIANN architecture-based
category technique is developed. In Sect. 5.4, the model network, critic network and
action network are introduced. In Sect. 5.5, two examples are given to demonstrate
the effectiveness of the proposed optimal control scheme. In Sect. 5.6, the conclusion
is drawn.

5.2 Problem Statement

For most complex industrial systems, it is difficult, time consuming, and expensive to
identify an accurate mathematics model. Therefore, optimal controllers are difficult
to design. Motivated by this problem, a multiple actor-critic structure is proposed
66 5 Multiple Actor-Critic Optimal Control via ADP

to obtain optimal controllers based on measured input-output data. This allows the
construction of adaptive optimal control systems for industrial processes that are
responsive to changing plant conditions and performance requirements.
This structure uses fast classification based on stored memories, such as that
occurs in the amygdala and orbitofrontal cortex (OFC) in [1, 2]. It also allows for
real-time exploration and learning, such as occurs in the ACC in [1, 2]. This structure
conforms to the ideas of [3], which stress the importance of an integrated utilization
of stored memories and real-time exploration.
The overall multiple actor-critic structure is shown in Fig. 5.1. A SIANN is trained
to classify previously recorded data into different memory categories [24]. In Fig. 5.1,
xi stands for the measured system data, which are also the input of the SIANN. The
trained SIANN is used to classify data recorded in real time into categories, and its
output y = 1, 2, . . . , L, is the category label. Within each category, ADP is used to
establish the optimal control. If the output of SIANN is y = i, then the ith ADP
structure is activated. For each ADP, an RNN is trained to model the dynamics.
Then, the critic network and the action network of that ADP are used to determine
the performance measure function and the optimal control. The actor-critic structure
in the active category is tuned online based on real-time recorded data. The details
are given in the next sections.

5.3 SIANN Architecture-Based Classification

In industrial process control, there are usually different production performance mea-
sures according to different desired process properties. Therefore, the input-output
data should be classified into different categories based on various features inherent
in the data. It has been shown by [1], that shunting inhibition is important to explain

Fig. 5.1 The structure diagram of the algorithm


5.3 SIANN Architecture-Based Classification 67

the manner in which the amygdala and OFC classify data based on environmental
cues. In this section, the structure and updating method are introduced for a SIANN
[21] that is used to classify the measured data into different control response cat-
egories depending on match between certain attributes of the data and previously
stored experiences. In this chapter, SIANN consists of one input layer, one hidden
layer and one output layer. In the hidden units, the shunting inhibition is the basic
synaptic interaction. Generalized shunting neurons are used in the hidden layer of
the SIANN. The structure of each neuron is shown in Fig. 5.2 and given as follows
[21].
The output of each neuron is expressed as
 
gh wTj x + w j0 + b j
sj = , (5.1)
a j + f h (cTj x + c j0 )

where x = [x1 , x2 , . . . , xm ]T , w j = [w j1 ; w j2 ; . . . , w jn ], and c j = [c j1 ; c j2 ; . . . ,


c jn ]. s j is the jth neuron output of hidden layer and xi is the input of jth neuron.
The SIANN parameters include: the passive decay rate of the jth neuron a j > 0 ,
the bias b j , the weights from the ith input to the jth neuron w ji and c ji . w j0 and c j0
are constants. The nonlinear activation functions are f h (·) and gh (·). Function f h (·)
is selected positive. Note that the denominator of (5.1) can never be zero due to the
definitions of the variables and functions.
The output layer of the SIANN is given by [22]


n
y= v j s j + d = vT s + d, (5.2)
j=1

where s = [s1 , s2 , . . . , sn ]T and v = [v1 , v2 , . . . , vn ]T , in which vi is the weight


between ith hidden layer to the output neuron, and d is the bias.
Remark 5.1 In (5.1), the denominator is termed a shunting inhibition. This is because
f h (·) is nonnegative, so that the combined effects of neighboring neuron inputs xi
in the denominator can only decrease neuron output s j . The numerator of (5.1) is

Fig. 5.2 The structure


diagram of the algorithm
68 5 Multiple Actor-Critic Optimal Control via ADP

a standard feedforward neural network structure. It is shown in [21] that neuron


structure (5.1) gives more flexibility in classification using a smaller number of
hidden layer neurons than standard feedforward neural networks.
It is necessary to train the SIANN to properly classify data in real-time into
prescribed categories. This is accomplished using previously recorded data. The
SIANN weights for proper classification are determined by supervised learning as
follows.
The gradient descent method is used to train the parameters. Given the desired
output category yd , define the error e = y − yd and

1 2
E= e . (5.3)
2
According to gradient descent algorithm, the updates for the parameters are given
by backpropagation as follows [25]:

∂ E ∂y
v̇ = −γv = −γv es, (5.4)
∂e ∂v

ḋ = −γd e, (5.5)
 
∂ E ∂ y ∂s j g  h wTj x + w j0 x T
ẇ j = −γw = −γw ev j , (5.6)
∂e ∂s j ∂w j a j + f h (cTj x + c j0 )

∂ E ∂ y ∂s j 1
ḃ j = −γb = −γb ev j , (5.7)
∂e ∂s j ∂b j a j + f h (cTj x + c j0 )

 
∂ E ∂ y ∂s j −(gh wTj x + w j0 + b j )
ȧ j = −γa = −γa ev j , (5.8)
∂e ∂s j ∂a j (a j + f h (cTj x + c j0 ))
2

 
∂ E ∂ y ∂s j −(gh wTj x + w j0 + b j )
ċ j = −γc = −γc ev j f h (cTj x + c j0 )x T , (5.9)
∂e ∂s j ∂c j (a j + f h (cTj x + c j0 ))
2

where γv , γd , γw , γb , γa and γc are learning rates and positive.


Using these parameter update equations, the SIANN is trained, using previously
recorded data, to properly classify the input-output data into the appropriate cate-
gories. Then, based on the trained SIANN classifier, one can decide the category label
for the input-output data that are measured in real time. Then, the optimal control
for each category is found in real-time using the category ADP structures as in the
subsequent sections.
5.4 Optimal Control Based on ADP 69

5.4 Optimal Control Based on ADP

In the previous section a SIANN is trained using previously recorded data to classify
the input-output data in real-time into different categories based on data attributes
and features. Based on this SIANN classification, an ADP structure in each category
is used to determine, based on data measured in real time, the optimal control for the
data attributes and performance measure in that category. In this section, a novel ADP
structure is designed that captures the required system identification and optimality
factors in process control.
This structure uses fast classification based on stored memories, such as occurs
in the amygdala and OFC in [7, 8], and also allows for real-time exploration and
learning, such as occurs in the ACC in [7, 8]. This structure provides an integrated
utilization of stored memories and real-time exploration [9].
The ADP structure is used in each category as shown in Fig. 5.1. The ADP structure
has the standard three networks for model, critic, and actor. In this section, first,
the optimization problem is introduced. Then a recursive neural network (RNN) is
trained to identify the system dynamics. Novel neural network (NN) parameter tuning
algorithms for the model RNN are given and the state estimation error is proven to
converge with time. After this model identification phase, the ADP critic network
and action network are designed to use data recorded in real time to compute online
the performance measure value function and the optimal control action, respectively.
A theorem is given to prove that the closed-loop system is UUB.
Based on the previous section, suppose the input-output data is classified into cate-
gory y = i. Then we can train the ith ADP using that input-output data. Throughout
this section, it is understood that the design refers to the ADP structure in each
category i. To conserve notation, we do not use subscripts i throughout.
The performance measure function for the ith ADP is given as
 ∞
J (x) = r (x(τ ), u(τ ))dτ, (5.10)
t

where x and u are the state and control inputs of the ith ADP, r (x, u) = Q(x) +
u T Ru, where R is a positive definite matrix. It is assumed that there exists a scalar
P > 0, s.t. Q(x) > P x T x.
The infinitesimal version of (5.10) is the Bellman equation

0 = JxT ẋ + r, (5.11)

∂J
where Jx = ∂x
. The Hamiltonian function is given as

H (x, u, Jx ) = JxT ẋ + r. (5.12)

The optimal performance measure function is defined as


70 5 Multiple Actor-Critic Optimal Control via ADP
 ∞
J ∗ (x) = min r (x(τ ), u(τ ))dτ, (5.13)
u t

and the optimal control can be obtained by


 ∞

u = arg min r (x(τ ), u(τ ))dτ. (5.14)
u t

In the following subsections, the detailed derivations for the model neural network,
the critic network and the action network are given.

5.4.1 Model Neural Network

The previous section showed how to use SIANN to classify the data into one of
i categories. Please see Fig. 5.1. In each category i, a RNN is used to identify the
system dynamics for the ith ADP structure. The dynamics in category i is taken to
be modeled by the RNN given as [26]

ẋ = AT1 x + AT2 f (x) + AT3 u + AT4 + ε + τd . (5.15)

This is a very general model of a nonlinear system that can closely fit many dynamical
models. Here, A1 , A2 , A1 and A4 are the unknown ideal weight matrices for ith ADP,
which are assumed to satisfy ||Ai || F ≤ AiB , where AiB > 0 is constant, i = 1, 2, 3, 4.
ε s the bounded approximate error, and is assumed to satisfy εT ε ≤ β A exT ex , where
β A is the bounded constant target value and the state estimation error is ex = x − x̂.
The activation function f (x) is a monotonically increasing function, and it is taken
to satisfy 0 ≤ || f (x1 ) − f (x2 )|| ≤ k||x1 − x2 ||, where x1 > x2 and k > 0. τd is the
disturbance and it is bounded , i.e., ||τd || ≤ d B , where d B is a constant number.
The approximate dynamics model is constructed as

x̂˙ = ÂT1 x̂ + ÂT2 f (x̂) + ÂT3 u + ÂT4 + A5 ex , (5.16)

where Â1 , Â2 , Â3 and Â4 are the estimated values of the ideal unknown weights, and
Â5 is a square matrix that satisfies
 
1 k2 1 βA
A5 − AT1 − AT2 A2 − + + I > 0.
2 2 2 2

Define the parameter identification errors as Ã1 = A1 − Â1 , Ã2 = A2 − Â2 , Ã3 =
A3 − Â3 , Ã4 = A4 − Â4 , and define f˜(ex ) = f (x) − f (x̂).
Due to disturbances and modeling errors, the model (5.16) cannot exactly recon-
struct the unknown dynamics (5.15). It is desired to tune the parameters in (5.16)
such that the parameter estimation errors and state estimation error are uniformly
ultimately bounded.
5.4 Optimal Control Based on ADP 71

Definition 5.1 Uniformly Ultimately Bounded (UUB) [27–29]: The equilibrium


point xe is said to be UUB if there exists a compact set U ∈ Rn such that, for all
x(t0 ) = x0 ∈ U , there exists a δ > 0 and a number T (δ, x0 ) such that ||x(t) − xe || <
δ for all t ≥ t0 + T .
The next theorem extends the result in [26], by providing robust turning methods
that make ex negative definite outside a bounded region. These robustified tuning
methods are required due to the disturbances in (5.15).
Theorem 5.1 Consider the approximate system model (5.16), and take the update
methods for the tunable parameters as

Â˙1 = β1 x̂exT − β1 ||ex || Â1 ,


Â˙2 = β2 f (x̂)exT − β2 ||ex || Â2 ,
Â˙ = β ueT − β ||e || Â ,
3 3 x 3 x 3

Â˙4 = β4 exT − β4 ||ex || Â4 ,

where βi > 0, i = 1, 2, 3, 4. If the initial values of the state estimation error ex (0),
and the parameter identification errors Ãi (0) are bounded. Then the state estimation
error ex , and the parameter identification errors Ãi are UUB.
Proof Let the initial values of the state estimation error ex (0), and the parameter
identification errors Ãi (0) be bounded, then the NN approximation property (5.15)
holds for the state x [29].
Define the Lyapunov function candidate:

V = V1 + V2 , (5.17)

1
where V1 = 1/2exT ex and V2 = V A1 + V A2 + V A3 + V A4 , in which V A1 =
2β1
1 1 1
tr { ÃT1 Ã1 }, V A2 =
tr { ÃT2 Ã2 }, V A3 = tr { ÃT3 Ã3 }, V A4 = Ã4 ÃT4 .
2β2 2β3 2β4
As the following equations

AT1 x − ÂT1 x̂ = (AT1 x − AT1 x̂) + (AT1 x̂ − ÂT1 x̂) = AT1 ex + ÃT1 x̂ (5.18)

and

AT2 f (x) − ÂT2 f (x̂) = (AT2 f (x) − AT2 f (x̂)) + (AT2 f (x̂) − ÂT2 f (x̂))
= AT2 f˜(ex ) + ÃT2 f (x̂) (5.19)

hold. Then one has

ėx = ẋ − x̂˙
= AT1 ex + ÃT1 x̂ + AT2 f˜(ex ) + ÃT2 f (x̂) + ÃT3 u + ÃT4 + ε − A5 ex + τd . (5.20)
72 5 Multiple Actor-Critic Optimal Control via ADP

Therefore,

V̇1 = exT (AT1 ex + ÃT1 x̂ + AT2 f˜(ex ) + ÃT2 f (x̂) + ÃT3 u + ÃT4 + ε − A5 ex + τd ).
(5.21)

As
1 1
exT AT2 f˜(ex ) ≤ exT AT2 A2 ex + k 2 exT ex (5.22)
2 2
and
 
1 T 1 1 βA
exT ε ≤ ex ex + ε T ε ≤ + exT ex , (5.23)
2 2 2 2

then one can get

1
V̇1 ≤ exT AT1 ex + exT ÃT1 x̂ + exT ÃT2 f (x̂) + exT AT2 A2 ex
  2
1 2 1 βA T
+ k + + ex ex + exT ÃT3 u + exT ÃT4 − exT A5 e + exT τd . (5.24)
2 2 2

On the other side, according to [29], one has

V̇ A1 = − tr { ÃT1 x̂exT } + ||ex ||tr { ÃT1 Â1 }


≤ − exT ÃT1 x̂ + ||ex |||| Ã1 || F A1B − ||ex |||| Ã1 ||2F , (5.25)

V̇ A2 = − tr { ÃT2 f (x̂)exT } + ||ex ||tr { ÃT2 Â2 }


= − exT ÃT2 f (x̂) + ||ex |||| Ã2 || F A2B − ||ex |||| Ã2 ||2F , (5.26)

V̇ A3 = − tr { Ã3T uexT } + ||ex ||tr { ÃT3 Â3 }


= − exT ÃT3 u + ||ex |||| Ã3 || F A3B − ||ex |||| Ã3 ||2F , (5.27)

V̇ A4 = − tr {exT ÃT4 } + ||ex ||tr { ÃT4 Â4 }


= − exT ÃT4 + ||ex |||| Ã4 || F A4B − ||ex |||| Ã4 ||2F . (5.28)

Therefore,

V̇2 ≤ − exT ÃT1 x̂ − exT ÃT2 f (x̂) − exT ÃT3 u − exT ÃT4
+ ||ex |||| Ã1 || F A1B − ||ex |||| Ã1 ||2F + ||ex |||| Ã2 || F A2B − ||ex |||| Ã2 ||2F
+ ||ex |||| Ã3 || F A3B − ||ex |||| Ã3 ||2F + ||ex |||| Ã4 || F A4B − ||ex |||| Ã4 ||2F . (5.29)
5.4 Optimal Control Based on ADP 73

From (5.24) and (5.29), one can obtain that


   
1 T k2 1 βA
V̇ ≤ eTx AT1 + A A2 + + + I − A5 ex
2 2 2 2 2
+ ||ex |||| Ã1 || F A1B − ||ex |||| Ã1 ||2F + ||ex |||| Ã2 || F A2B − ||ex |||| Ã2 ||2F
+ ||ex |||| Ã3 || F A3B − ||ex |||| Ã3 ||2F + ||ex |||| Ã4 || F A4B − ||ex |||| Ã4 ||2F + d B ||ex ||. (5.30)

 
1 k2 1 βA
Let K V = A5 − AT1 − AT2 A2 − + + I , and then
2 2 2 2

V̇ ≤ − ||ex || [(λmin (K V ) ||ex || + || Ã1 || F (|| Ã1 || F − A1B )


+ || Ã2 || F (|| Ã2 || F − A2B )) + || Ã3 || F (|| Ã3 || F − A3B )
+ || Ã4 || F (|| Ã4 || F − A4B ) −d B ]
 2  B 2
AB A1
= − ||ex || [(λmin (K V ) ||ex || + || Ã1 || F − 1 −
2 2
 B 2  B 2  B 2
A A2 A
+ || Ã2 || F − 2 − + || Ã3 || F − 3
2 2 2
 B 2  B 2  B 2 
A3 A A4
− + || Ã4 || F − 4 − −d B ] (5.31)
2 2 2

which is guaranteed negative as long as


 2  2  2  2
A1B A2B A3B A4B
2 + 2 + 2 + 2 + dB
||ex || > ≡ Be , (5.32)
λmin (K V )

or


2
A1B
A1B
|| Ã1 || F > + + dB , (5.33)
2 2


2
A2B
A2B
|| Ã2 || F > + + dB , (5.34)
2 2



A3B
AB 2
|| Ã3 || F > + 3 + dB , (5.35)
2 2



A B
AB 2
|| Ã4 || F > 4 + 4 + dB . (5.36)
2 2

Thus, V̇ (t) is negative outside a compact set. This demonstrates the UUB of ||ex || and
|| Ã1 || F , || Ã2 || F , || Ã3 || F , || Ã4 || F .
74 5 Multiple Actor-Critic Optimal Control via ADP

Remark 5.2 In fact, many NN activation functions are bounded and have bounded
derivatives. In Theorem 5.1, the unknown system is bounded, and the NN param-
eters and output are bounded. Therefore, the initial state estimation error ex and
the initial parameter identification errors Ãi are bounded. The approximation error
boundedness property is established in [25].

From Theorem 5.1, it can be seen that as t → ∞, the parameter estimates Âi
converge to bounded regions, such that the state estimation error ex is bounded. Let
the steady state value of Âi be denoted as Bi . Then after convergence of the model
parameters, (5.16) can be rewritten as

ẋ = B1T x + B2T f (x) + B3T u + B4T . (5.37)

In the following, the optimal control for ith ADP based on the well-trained RNN
(5.37) will be designed.

5.4.2 Critic Network and Action Network

Based on the trained RNN model (5.37), the critic network expression in each cate-
gory i is

J = WcT ϕc + εc , (5.38)

where Wc is the ideal critic network weight matrix, ϕc is the activation function and
εc is the approximation error. We assume that ||∇ϕc || ≤ ϕcd M .
For the actual neural networks, let the estimate of Wc be Ŵc , then the actual output
of the critic network in each category i is

Jˆ = ŴcT ϕc . (5.39)

Define the weight estimation error of the critic network as

W̃c = Wc − Ŵc . (5.40)

Substitute this into the Bellman equation (5.11) to obtain the equation error

ec = ŴcT ∇ϕc ẋ + r = −W̃cT ∇ϕc ẋ + ε H , (5.41)

where

ε H = WcT ∇ϕc ẋ + r. (5.42)

Let
5.4 Optimal Control Based on ADP 75

1 T
Ec = e ec . (5.43)
2 c

Then, define the weight update law for Ŵ˙ c as

∂ Ec ξ1 (ξ1T Ŵc + r )
Ŵ˙ c = −αc = −αc 2
, (5.44)
∂ Ŵc (ξ1T ξ1 + 1)

where αc > 0 is the learning rate of the critic network and ξ1 = ∇ϕc ẋ.
ξ1
Define ξ2 = and ξ3 = ξ1T ξ1 + 1, with ||ξ2 || ≥ ξ2m . Then
ξ3

ξ1 (ξ1 Ŵc + r )
T
εH
Ŵ˙ c = αc = −αc ξ2 ξ2T W̃c + αc ξ2 . (5.45)
ξ32 ξ3

The action network is used to obtain the control policy u in each category i and
is given by

u = WaT ϕa + εa , (5.46)

where Wa is the ideal weight matrix of the action network, ϕa is the activation
function. According to persistent excitation conditions, it has ϕa M ≥ ||ϕa || ≥ ϕam .
εa is the action network approximation error. The actual output of the action network
is

û = ŴaT ϕa , (5.47)

where Ŵa is the actual weight of the action network.


∂ Jˆ
From (5.37) and = 0, the desired feedback control input is
∂u
1
u = − R −1 B3 ∇ϕcT Ŵc . (5.48)
2
The error between the actual output of the action network and the desired feedback
control input is expressed as follows:

1 −1
ea = ŴaT ϕa + R B3 ∇ϕcT Ŵc . (5.49)
2
Define the objective function as follows:

1 T
Ea = e ea . (5.50)
2 a
76 5 Multiple Actor-Critic Optimal Control via ADP

Then the weight update law for the action network weight is a gradient descent
algorithm, which is given by
 T
Ŵ˙ a = −αa ϕa ŴaT ϕa + R −1 B3 ∇ϕcT Ŵc ,
1
(5.51)
2

where αa is the learning rate of the action network.


Define the weight estimation error of the action network is

W̃a = Wa − Ŵa . (5.52)

Then the update law of W̃a is


 T
˙ 1 −1
Ŵa =αa ϕa (Wa − W̃a ) ϕa + R B3 ∇ϕc (Wc − W̃c )
T T
2
 T
1 1
= αa ϕa −W̃aT ϕa − R −1 B3 ∇ϕcT W̃c + WaT ϕa + R −1 B3 ∇ϕcT Wc .
2 2
(5.53)

The RNN model (5.37) with controller (5.47) is

ẋ =B1T x + B2T f (x) + B3T û + B4T


=B1T x + B2T f (x) + B3T WaT ϕa − B3T W̃aT ϕa + B4T
=B1T x + B2T f (x) + B3T u − B3T εa − B3T W̃aT ϕa + B4T . (5.54)

The next theorem proves the asymptotic convergence of the optimal control scheme.
The result extends the results in [30].
Theorem 5.2 Let the optimal control input for (5.37) be provided by (5.47), and
the weight updating laws of the critic network and the action network be given as
in (5.44) and (5.51), respectively. Suppose there exist positive scalars l1 , l2 and l3
satisfying

l2 ϕam
2
l1 < , (5.55)
4||B3 ||2 ϕa2M
2
2ξ2m
l2 < , (5.56)
||R −1 B3 ||2 ϕcd
2
M

and

||B3 ||2 l1 2l1 ||B1 || + l1 ||B2 ||2 + l1 k 2 + 4l1
l3 > max , . (5.57)
λmin (R) P
5.4 Optimal Control Based on ADP 77

If the initial values of the state x(0), the control u(0) and the weight estimation
errors W̃c (0) and W̃a (0) are bounded. Then x in (5.54), control input u, and the
weight estimation errors W̃c and W̃a are UUB.

Proof Choose Lyapunov function candidate as follows:

Γ = Γ1 + Γ2 + Γ3 , (5.58)

1 l2
where Γ1 = W̃cT W̃c , Γ2 = tr{W̃aT W̃a }, l2 > 0 and Γ3 = l1 x T x + l3 J , l1 > 0,
2αc 2αa
l3 > 0.
Then the time derivative of the Lyapunov function candidate (5.58) along the
trajectories of the closed-loop systems (5.54) is computed as Γ˙ = Γ˙1 + Γ˙2 + Γ˙3 .
According to (5.45), it can be obtained that
 
εH
Γ˙1 =W̃cT −ξ2 ξ2T W̃c + ξ2
ξ3
1 1 εH
≤ − ||ξ2 ||2 ||W̃c ||2 + ||ξ2 ||2 ||W̃c ||2 + || ||2
2 2 ξ3
1 2 1 ε H
≤ − ξ2m ||W̃c ||2 + || ||2 . (5.59)
2 2 ξ3

1
Define ε12 = WaT ϕa + R −1 B3 ∇ϕcT Wc and assume ||ε12 || ≤ ε12M , then based on
2
(5.53), one has
T
1
Γ˙2 = − l2 tr W̃aT ϕa (W̃aT ϕa + R −1 B3 ∇ϕcT W̃c − ε12 )
2
l2 l2 ε 2
≤ − l2 ||ϕa ||2 ||W̃a ||2 + ||ϕa ||2 ||W̃a ||2 + 12M
2 2
l2 l
+ ||ϕa ||2 ||W̃a ||2 + ||R −1 B3 ∇ϕcT ||2 ||W̃c ||2
2
4 4
l2 2 l2 −1 l2 2
≤ − ϕam ||W̃a || + ||R B3 ||2 ϕcd
2 2
M || W̃c || +
2
ε . (5.60)
4 4 2 12M
The time derivative of Γ3 is calculated as follows:

Γ˙3 = 2l1 x T (B1T x + B2T f (x) + B3T u − B3T εa − B3T W̃aT ϕa + B4T ) + l3 (−r (x, u)).
(5.61)

As

2x T B2T f ≤ (||B2 ||2 + k 2 )||x||2 , (5.62)


78 5 Multiple Actor-Critic Optimal Control via ADP

−2x T B3T W̃aT ϕa ≤ ||x||2 + ||B3 ||2 ϕa2M ||W̃a ||2 , (5.63)

2x T B3T u ≤ ||x||2 + ||B3 ||2 ||u||2 , (5.64)

−2x T B3T εa ≤ ||x||2 + ||B3 ||2 εa2 M . (5.65)

Then (5.61) can be rewritten as

Γ˙3 ≤(2l1 ||B1 || + l1 ||B2 ||2 + l1 k 2 + 4l1 − l3 P)||x||2 + (||B3 ||2 l1 − l3 λmin (R))||u||2
+ l1 ||B3 ||2 ϕa2M ||W̃a ||2 + l1 ||B3 ||2 εa2 M + l1 ||B4 ||2 . (5.66)

l2 2 1 εH
Let εΓ = ε12M + || ||2 + l1 ||B3 ||2 εa2 M + l1 ||B4 ||2 . Then one has
2 2 ξ3

Γ˙ ≤ − (l3 P − 2l1 ||B1 || − l1 ||B2 ||2 − l1 k 2 − 4l1 )||x||2 − (l3 λmin (R) − ||B3 ||2 l1 )||u||2
   
1 2 l2 2 − l2 ϕ 2 − l ||B ||2 ϕ 2
− ξ2m − ||R −1 B3 ||2 ϕcd 2 || W̃ c || 1 3
2
a M ||W̃a || + εΓ .
2 4 M 4 am
(5.67)

If l1 , l2 and l3 are selected to satisfy (5.55)–(5.57), and



εΓ
||x|| > (5.68)
l3 P − 2l1 ||B1 || − l1 ||B2 ||2 − l1 k 2 − 4l1

or

εΓ
||u|| > (5.69)
l3 λmin (R) − ||B3 ||2 l1

or

εΓ
||W̃c || > (5.70)
ξ
1 2
2 2m
− l42 ||R −1 B3 ||2 ϕcd
2
M

or

εΓ
||W̃a || > (5.71)
ϕ
l2 2
4 am
− l1 ||B3 ||2 ϕa2M

hold. Γ˙ < 0. Therefore, according to the standard Lyapunov extension, x, u, W̃c and
W̃a are UUB.
5.4 Optimal Control Based on ADP 79

Theorem 5.3 Suppose the hypotheses of Theorem 5.2 hold. If ||ϕa || ≤ ϕa M , ||∇ϕc ||
≤ ϕcd M , ||Wc || ≤ WcM , ||Wa || ≤ Wa M and ||εa || ≤ εa M . Then 
(1) ||εa || ≤ εa M is UUB, where H (x, Ŵa , Ŵc ) = JˆxT B1T x+B2T f (x) + B3T û+B4T
+ Q(x) + û T R û.
(2) û is close to u within a small bound.

Proof (1) According to (5.39) and (5.47), one has


 
H (x, Ŵa , Ŵc ) =∇ϕcT (Wc − W̃c ) B1T x + B2T f (x) + B4T
 
+ ∇ϕcT (Wc − W̃c ) B3T WaT ϕa − B3T W̃aT ϕa
+ Q(x) + ϕaT (Wa − W̃a )R(WaT − W̃aT )ϕa . (5.72)

Equation (5.72) can be further written as


 
H (x, Ŵa , Ŵc ) =∇ϕcT Wc B1T x + B2T f (x) + B4T
 
− ∇ϕcT W̃c B1T x + B2T f (x) + B4T
+ ∇ϕcT Wc B3T WaT ϕa − ∇ϕcT Wc B3T W̃aT ϕa
− ∇ϕcT W̃c B3T WaT ϕa + ∇ϕcT W̃c B3T W̃aT ϕa
+ Q(x) + ϕaT Wa RWaT ϕa − ϕaT Wa R W̃aT ϕa
− ϕaT W̃a RWaT ϕa + ϕaT W̃a R W̃aT ϕa . (5.73)

For a fixed admissible control policy, H (x, Wa , Wc ) is bounded. Then one can let
||H (x, Wa , Wc )|| ≤ H B . Therefore, (5.73) can be written as
 
H (x, Ŵa , Ŵc ) = − ∇ϕcT W̃c B1T x + B2T f (x) + B4T − ∇ϕcT Wc B3T W̃aT ϕa
− ∇ϕcT W̃c B3T WaT ϕa + ∇ϕcT W̃c B3T W̃aT ϕa
− ϕaT Wa R W̃aT ϕa − ϕaT W̃a RWaT ϕa + ϕaT W̃a R W̃aT ϕa + H B .
(5.74)

Then one has


 
 
lim  H (x, Ŵa , Ŵc ) ≤ ϕcd M (||B1 || + ||B2 ||k)||W̃c ||||x|| + H B
t→∞
 
+ ϕcd M ||B4 || + ||B3 ||Wa M ϕa M ||W̃c ||
+ (ϕcd M WcM ||B3 ||ϕa M + 2ϕa2M Wa M ||R||)||W̃a ||
+ ϕcd M ||B3 ||ϕa M ||W̃c ||||W̃a || + ϕa2M ||R||||W̃a ||2 .
(5.75)

According to Theorem 5.2, the signals on the right-hand side of (5.75) are UUB,
therefore H (x, û, Jˆx ) is UUB.
80 5 Multiple Actor-Critic Optimal Control via ADP

(2) As

û − u = ŴaT ϕa − WaT ϕa − εa = −W̃aT ϕa − εa . (5.76)

Therefore, one can get

lim ||û − u|| ≤ ||W̃a ||ϕa M + εa M . (5.77)


t→∞

Equation (5.77) means that û is close to the control input u within a small bound.
Remark 5.3 According to Theorem 5.1 and the properties of NN, we can see that
the initial weight estimation errors W̃c and W̃a are bounded. Therefore, the initial
value H is bounded in Theorem 5.3.
Remark 5.4 It should be mentioned that Theorems 5.1–5.3 are based on the following
fact: the model network (RNN) is trained first, and then the critic and action networks
weights are turned according to the steady system model (5.37).
In the next theorem, the simultaneous turning method for the model, critic and
action networks will be discussed further.
Theorem 5.4 Let the optimal control input for (5.16) be provided by (5.47). The
update methods for the tunable parameters of Âi in (5.16), i = 1, 2, 3, 4 are given
in Theorem 5.1. The weight updating laws of the critic and the action networks are
given as

ξ1 (ξ1T Ŵc + r )
Ŵ˙ c = − αc 2
, (5.78)
(ξ1T ξ1 + 1)

Ŵ˙ a = − αa ϕa (ŴaT ϕa + R −1 Â3 ∇ϕcT Ŵc )T .


1
(5.79)
2
If
 
1 k2 3 βA
A5 > ÂT1 + ÂT2 Â2 + + + I, (5.80)
2 2 2 2

and there exist positive scalars l5 , l6 and l7 satisfying


2
2ξ2m
l5 < , (5.81)
||R −1 Â3 ||2 ϕcd
2
M
l5 ϕam
2
l6 < , (5.82)
4|| Â3 ||2 ϕa2M
 
Θ7 l6 || Â3 ||2
l7 > max , , (5.83)
P λmin (R)
5.4 Optimal Control Based on ADP 81

where Θ7 = 2l6 || Â1 || + l6 || Â2 ||2 + l6 || Â5 ||2 + l6 k 2 + 4l6 . Then the weight estima-
tion errors W̃c and W̃a are UUB if the model error ex is UUB.

Proof Choose the Lyapunov function candidate as

Θ = V + Γ4 + Γ5 + Γ6 , (5.84)

where V is given in (5.17), Γ4 = 2α1 c W̃cT W̃c , Γ5 = l5


2αa
tr {W̃aT W̃a }, l5 > 0 and Γ6 =
l6 x̂ T x̂ + l7 J (x̂, u), l5 > 0, l6 > 0.
From Theorem 5.1, one has
   
1 k2 1 βA
V̇ ≤exT AT1 + AT2 A2 + + + I − A5 ex
2 2 2 2
+ ||ex |||| Ã1 || F A1B − ||ex |||| Ã1 ||2F + ||ex |||| Ã2 || F A2B − ||ex |||| Ã2 ||2F
+ ||ex |||| Ã3 || F A3B − ||ex |||| Ã3 ||2F + ||ex |||| Ã4 || F A4B − ||ex |||| Ã4 ||2F + d B ||ex ||.
(5.85)

From Theorem 5.2,

1 2 1 εH
Γ˙4 ≤ − ξ2m ||W̃c ||2 + || ||2 , (5.86)
2 2 ξ3

and
l5 2 l5 l5 2
Γ˙5 ≤ − ϕam ||W̃a ||2 + ||R −1 B3 ||2 ϕcd
2
M || W̃c || +
2
ε . (5.87)
4 4 2 12M
Since the approximate dynamics model (5.16) with controller (5.47) is

x̂˙ = ÂT1 x̂ + ÂT2 f (x̂) + ÂT3 u − ÂT3 εa − ÂT3 W̃aT ϕa + ÂT4 + A5 ex . (5.88)

Then

Γ˙6 ≤(2l6 || Â1 || + l6 || Â2 ||2 + l6 || Â5 ||2 + l6 k 2 + 4l6 − l7 P)||x||2


+ (l6 || Â3 ||2 − l7 λmin (R))||u||2 + exT ex
+ l6 || Â3 ||2 ϕa2M ||W̃a ||2 + l6 || Â3 ||2 εa2 M + l6 || Â4 ||2 . (5.89)

1 ε H 2 l5 2
Let dV Γ = || || + ε12M + l6 || Â3 ||2 εa2 M + l6 || Â4 ||2 , and Θ A = A5 − AT1 −
2 ξ3 2
1 T k2 3 βA
A A2 − ( + + )I . Then
2 2 2 2 2
82 5 Multiple Actor-Critic Optimal Control via ADP

Θ̇ < − (l7 λmin (R) − l6 || Â3 ||2 )||u||2 − (l7 P − Θ7 )||x||2


 
1 2 l5 −1
− Θ A ||ex || −
2
ξ − ||R Â3 || ϕcd M ||W̃c ||2
2 2
2 2m 4
 
l5 2
− ϕ − l6 || Â3 || ϕa M ||W̃a ||2
2 2
4 am
4 

− ||ex || (|| Ãi ||2F − || Ãi || F AiB )+d B + dV Γ , (5.90)
i=1

which is guaranteed negative as long as

dV Γ
||ex || > . (5.91)

4
(|| Ãi ||2F − || Ãi || F AiB ) + dB
i=1

Thus, Θ̇ is negative outside a compact set.

Remark 5.5 From Theorem 5.4, it can be seen that when the system model error
enters in a bounded region, then the weights of critic and the action networks will be
UUB.

Remark 5.6 In this chapter, a novel multiple actor-critic structure is established to


obtain the optimal control for unknown systems. The algorithm is different from the
methods in [31, 32]. In [31], the optimized adaptive control and trajectory generation
for a class of wheeled inverted pendulum (WIP) models of vehicle systems were
investigated. In [32], automatic motion control was investigated for WIP models,
which have been widely applied for modeling of a large range of two wheeled modern
vehicles. The underactuated WIP model was decomposed into two subsystems. One
subsystem consists of planar movement of vehicle forward motion and yaw angular
motions. The other is about the pendulum tilt motion. In this chapter, a SIANN is
trained to classify previously recorded data into different memory categories, and
the optimal control for the data attributes and performance measure in that category
are calculated by multiple actor-critic structure.

5.5 Simulation Study

In this section two simulations are provided to demonstrate the versatility of the
proposed SIANN/Multiple ADP structure.
Example 5.1 We consider a simplified model of a continuously stirred tank reactor
with an exothermic reaction [33]. The model is given by
5.5 Simulation Study 83
⎡ 13 5 ⎤
     
ẋ1 ⎢ 6 12 ⎥ x1 −2
=⎣ +
50 8 ⎦ x2
u (5.92)
ẋ2 0
− −
3 3
where x1 represents the temperature and x2 represents the concentration of the initial
product of the chemical reaction.
Define two categories as follows:

1, ||x|| ≥ 0.1,
yd = (5.93)
−1, ||x|| < 0.1.

For the two categories, define the respective utility functions as



x T Qx + u T R1 u, yd = 1,
r (x, u) = (5.94)
10 log10 (x1 + x2 )2 + u T R2 u, yd = −1,

where Q = I2 , R1 = 1 and R2 = 100.


For system (5.92), the admissible control is used to obtain process response data
which is used as historical input-output data to train the SIANN and model RNN.
The norm of the state time-function is shown in Fig. 5.3.
First, the SIANN classifier is trained. The SIANN neuron output is (5.1) with
w j0 = 0, c j0 = 0 and gh (·) and f h (·) are taken as sigmoid functions. Let the initial
values of a j and b j be selected randomly in (−0.1, 0.1). Let the initial weights w j and

0.7

0.6

0.5

0.4
||x||

0.3

0.2

0.1

0
0 500 1000 1500
time steps

Fig. 5.3 The state norm


84 5 Multiple Actor-Critic Optimal Control via ADP

2
yd
y
1.5

1
category label

0.5

-0.5

-1

-1.5
0 100 200 300 400 500 600 700 800 900 1000
time steps

Fig. 5.4 The SIANN the training result

c j be selected randomly in (−1, 1). Let the initial parameter in the output of SIANN
be d = 1, and the initial weight v be selected randomly in (−1, 1). The number of
shunting neurons is 10.
The SIANN is trained using the historical data generated above and the parameter
tuning algorithms in previous section. When the historical data are fed into the trained
SIANN, the classification results are shown in Fig. 5.4. Note that, in Fig. 5.3, the
norm of the state falls below 0.1 at 300 time steps. This corresponds to the category
classification change observed in Fig. 5.4.
Second, the RNN estimation model is trained, which is used to estimate the
system dynamics. Let the  elements of matrices Âi , i = 1, 2, 3, 4 be selected in
 initial
2 2
(−0.5, 0.5). Let A5 = , βi = 0.05, i = 1, 2, 3, 4. Let the activation function
2 2
be f (·) = tan sig(·). The stored data generated as in Fig. 5.3 are used to turn the
model RNN. The training RNN error is shown in Fig. 5.5.
Finally, the optimal controller is simulated based on the trained SIANN classifier
and RNN estimation model. The structures of the critic network and action network
are 2-8-1 and 2-8-1, respectively, in each category. Let the initial weights of critic and
action networks for the two categories be selected in (−0.7, 0.7). Let the activation
function be hyperbolic function. Let the learning rate of critic and action networks be
0.02. The trained SIANN is used to classify the system state obtained by the optimal
controller. For increasing the accuracy, the weight of SIANN is modified online. The
classification result is shown in Fig. 5.6. Once the state belongs to one category, then
the RNN is used to get the system state. The test RNN estimation error is shown in
Fig. 5.7.
5.5 Simulation Study 85

0.015
ex(1)
ex(2)
0.01

0.005
ex

-0.005

-0.01

-0.015
0 100 200 300 400 500 600 700 800 900 1000
time steps

Fig. 5.5 The training error of RNN

1
y
yd

0.5

0
test results

-0.5

-1

-1.5
0 100 200 300 400 500 600 700 800 900 1000
time steps

Fig. 5.6 The test classification result

After 600 time steps, the convergent trajectory of Wc is obtained, which is shown
in Fig. 5.8. The control and state trajectories are shown in Figs. 5.9 and 5.10. The
simulation result reveals that the proposed optimal control method for unknown
system operates properly.
86 5 Multiple Actor-Critic Optimal Control via ADP

0.01

ex(1)
ex(2)
test RNN error 0.005

-0.005

-0.01

-0.015
0 100 200 300 400 500 600 700 800 900 1000
time steps

Fig. 5.7 The test RNN error

0.5
W c(1)
0.4
W c(2)

0.3 W c(3)
W c(4)
0.2 W c(5)
W c(6)
0.1
W c(7)
Wc

0 W c(8)

-0.1

-0.2

-0.3

-0.4

-0.5
0 100 200 300 400 500 600 700 800 900 1000
time steps

Fig. 5.8 The test RNN error

Example 5.2 Consider the nonlinear oscillator [34]:



ẋ1 = x1 + x2 − x1 (x12 + x22 ),
(5.95)
ẋ2 = −x1 + x2 − x2 (x12 + x22 ) + u.
5.5 Simulation Study 87

0.16

0.14

0.12

0.1

0.08
control

0.06

0.04

0.02

-0.02
0 100 200 300 400 500 600 700 800 900 1000
time steps

Fig. 5.9 The control trajectory

0.5

0.4 x1
x2
0.3

0.2

0.1
state

-0.1

-0.2

-0.3

-0.4

-0.5
0 100 200 300 400 500 600 700 800 900 1000
time steps

Fig. 5.10 The state trajectories

Define the performance measure functions as



x T Q 1 x + u T R1 u, yd = 1,
r (x, u) = (5.96)
||x|| + x12 + u T R1 u, yd = −1.
88 5 Multiple Actor-Critic Optimal Control via ADP

1.5
yd
1 y
e

0.5

0
category label

-0.5

-1

-1.5

-2

-2.5
0 20 40 60 80 100 120 140 160 180 200
time steps

Fig. 5.11 The training result for SIANN

Define the category label as



⎨ 1, x1 > 0.3, or x2 > 0.3,
yd = or x1 ≤ −0.3, or x2 ≤ −0.3, (5.97)

−1, −0.3 ≤ x1 < 0.3, and − 0.3 ≤ x2 < 0.3.

For system (5.95), the admissible control is used to obtain the historical data of the
system state. Based on this obtained system data, the SIANN classifier and the RNN
are trained.
First, the SIANN is trained. The parameters are similar as in Example 5.1. Then
the SIANN training result is shown in Fig. 5.11.
Second, the RNN estimation model is trained, based on the classified states.
Let the initial parameters Âi , i = 1, 2, 3, 4 of RNN be selected in (−0.2, 0.2) and
A5 = [2, 2; 2, 2]. Let βi = 0.05, i = 1, 2, 3, 4. The RNN training error is shown in
Fig. 5.12.
Finally, the optimal controller is established based on the trained SIANN classifier
and RNN estimation model. The structures of action network and critic network are
2-8-1 and 2-8-1, respectively. Let the initial weights Wa and Wc be selected in (−1, 1)
and the activation function be sigmoid function. The trained SIANN is used to classify
the system state online, and the classification result is shown in Fig. 5.13. Once the
state belongs to one category, then the RNN is used to get the system state. The RNN
estimate error is shown in Fig. 5.14. The control and state trajectories are shown in
Figs. 5.15 and 5.16. From simulation results, it can be seen that the proposed control
method for unknown systems is effective.
5.5 Simulation Study 89

0.01
ex(1)
ex(2)
0.005

-0.005
ex

-0.01

-0.015

-0.02

-0.025
0 100 200 300 400 500 600 700 800
time steps

Fig. 5.12 The RNN training error

1.5

test y
1 test yd

0.5
test classification

-0.5

-1

-1.5
0 200 400 600 800 1000 1200 1400 1600 1800 2000
time steps

Fig. 5.13 The test classification result


90 5 Multiple Actor-Critic Optimal Control via ADP

0.025
ex(1)
0.02
ex(2)

0.015

0.01

0.005
test ex

-0.005

-0.01

-0.015

-0.02

-0.025
0 200 400 600 800 1000 1200 1400 1600 1800 2000
time steps

Fig. 5.14 The RNN estimate error

1.5

0.5
control

-0.5

-1

-1.5
0 200 400 600 800 1000 1200 1400 1600 1800 2000
time steps

Fig. 5.15 The control trajectory


5.6 Conclusions 91

1
x1
x2

0.5

0
state

-0.5

-1

-1.5
0 200 400 600 800 1000 1200 1400 1600 1800 2000
time steps

Fig. 5.16 The state trajectories

5.6 Conclusions

This chapter proposed multiple actor-critic structures to obtain the optimal control
by input-output data for unknown systems. First we classified the input-output data
into several categories by SIANN. The performance measure functions were defined
for each category. Then the optimal controller was obtained by ADP algorithm. The
RNN was used to reconstruct the unknown system dynamics using input-output
data. The neural networks were used to approximate the critic network and action
network, respectively. It is proven that the model error and the closed-system are
UUB. Simulation results demonstrated the effectiveness of the proposed optimal
control scheme for the unknown nonlinear system.

References

1. Levine, D., Ramirez Jr., P.: An attentional theory of emotional influences on risky decisions.
Prog. Brain Res. 202(2), 369–388 (2013)
2. Levine, D., Mills, B., Estrada, S.: Modeling emotional influences on human decision making
under risk. In: Proceedings of International Joint Conference on Neural Networks, pp. 1657–
1662 (2005)
3. Werbos, P.: Intelligence in the brain: a theory of how it works and how to build it. Neural Netw.
22, 200–212 (2009)
4. Werbos. P.: Stable adaptive control using new critic designs. In: Proceedings of Adaptation,
Noise, and Self-Organizing Systems (1998)
92 5 Multiple Actor-Critic Optimal Control via ADP

5. Narendra, K., Balakrishnan, J.: Adaptive control using multiple models. IEEE Trans. Autom.
control 42(2), 171–187 (1997)
6. Sugimoto, N., Morimoto, J., Hyon, S., Kawato, M.: The eMOSAIC model for humanoid robot
control. Neural Netw. 29–30, 8–19 (2012)
7. Doya, K.: What are the computations of the cerebellum, the basal ganglia and the cerebral
cortex? Neural Netw. 12(7–8), 961–974 (1999)
8. Hikosaka, O., Nakahara, H., Rand, M., Sakai, K., Lu, X., Nakamura, K., Miyachi, S., Doya, K.:
Parallel neural networks for learning sequential procedures. Trends Neurosci. 22(10), 464–471
(1999)
9. Lee, J., Lee, J.: Approximate dynamic programming-based approaches for input-output data-
driven control of nonlinear processes. Automatica 41(7), 1281–1288 (2005)
10. Song, R., Xiao, W., Zhang, H.: Multi-objective optimal control for a class of unknown nonlinear
systems based on finite- approximation-error ADP algorithm. Neurocomputing 119(7), 212–
221 (2013)
11. Li, H., Liu, D., Wang, D.: Integral reinforcement learning for linear continuous-time zero-sum
games with completely unknown dynamics. IEEE Trans. Autom. Sci. Eng. 11(3), 706–714
(2014)
12. Yang, X., Liu, D., Huang, Y.: Neural-network-based online optimal control for uncertain non-
linear continuous-time systems with control constraints. IET Control Theory Appl. 7(17),
2037–2047 (2013)
13. Lewis, F., Vamvoudakis, K.: Reinforcement learning for partially observable dynamic pro-
cesses: adaptive dynamic programming using measured output data. IEEE Trans. Syst. Man
Cybern. Part B: Cybern. 41(1), 14–25 (2011)
14. Li, Z., Duan, Z., Lewis, F.: Distributed robust consensus control of multi-agent systems with
heterogeneous matching uncertainties. Automatica 50(3), 883–889 (2014)
15. Modares, H., Lewis, F., Naghibi-Sistani, M.: Integral reinforcement learning and experience
replay for adaptive optimal control of partially unknown constrained-input continuous-time
systems. Automatica 50(1), 193–202 (2014)
16. Zhang, H., Lewis, F.: Adaptive cooperative tracking control of higher-order nonlinear systems
with unknown dynamics. Automatica 48(7), 1432–1439 (2012)
17. Wei, Q., Liu, D.: Adaptive dynamic programming for optimal tracking control of unknown
nonlinear systems with application to coal gasification. IEEE Trans. Autom. Sci. Eng. 11(4),
1020–1036 (2014)
18. Doya, K., Samejima, K., Katagiri, K., Kawato, M.: Multiple model-based reinforcement learn-
ing. Neural Comput. 14, 1347–1369 (2002)
19. Levine, D.: Neural dynamics of affect, gist, probability, and choice. Cogn. Syst. Res. 15–16,
57–72 (2012)
20. Werbos, P.: Using ADP to understand and replicate brain intelligence: the next level design.
IEEE International Symposium on Approximate Dynamic Programming and Reinforcement
Learning, pp. 209–216 (2007)
21. Arulampalam, G., Bouzerdoum, A.: A generalized feedforward neural network architecture
for classification and regression. Neural Netw. 16, 561–568 (2003)
22. Bouzerdoum, A.: Classification and function approximation using feedforward shunting
inhibitory artificial neural networks. In: Proceedings of the IEEE-INNS-ENNS International
Joint Conference on Neural Networks, vol. 6, pp. 613–618 (2000)
23. Tivive, F., Bouzerdoum, A.: Efficient training algorithms for a class of shunting inhibitory
convolutional neural networks. IEEE Trans. Neural Netw. 16(3), 541–556 (2005)
24. Song, R., Lewis, F., Wei, Q., Zhang, H.: Off-policy actor-critic structure for optimal control of
unknown systems with disturbances. IEEE Trans. Cybern. 46(5), 1041–1050 (2016)
25. Hornik, K., Stinchcombe, M., White, H., Auer, P.: Degree of approximation results for feedfor-
ward networks approximating unknown mappings and their derivatives. Neural Comput. 6(6),
1262–1275 (1994)
26. Zhang, H., Cui, L., Zhang, X., Luo, Y.: Data-driven robust approximate optimal tracking control
for unknown general nonlinear systems using adaptive dynamic programming method. IEEE
Trans. Neural Netw. 22(12), 2226–2236 (2011)
References 93

27. Kim, Y., Lewis, F.: Neural network output feedback control of robot manipulators. IEEE Trans.
Robot. Autom. 15(2), 301–309 (1999)
28. Khalil, H.: Nonlinear System. Prentice-Hall, NJ (2002)
29. Lewis, F., Jagannathan, S., Yesildirek, A.: Neural Network Control of Robot Manipulators and
Nonlinear Systems. Taylor and Francis, London (1999)
30. Song, R., Xiao, W., Zhang, H., Sun, C.: Adaptive dynamic programming for a class of complex-
valued nonlinear systems. IEEE Trans. Neural Netw. Learn. Syst. 25(9), 1733–1739 (2014)
31. Yang, C., Li, Z., Li, J.: Trajectory planning and optimized adaptive control for a class of wheeled
inverted pendulum vehicle models. IEEE Trans. Cybern. 43(1), 24–36 (2013)
32. Yang, C., Li, Z., Cui, R., Xu, B.: Neural network-based motion control of an underactuated
wheeled inverted pendulum model. IEEE Trans. Neural Netw. Learn. Syst. 25(1), 2004–2016
(2014)
33. Beard, R.: Improving the Closed-Loop Performance of Nonlinear Systems, Ph.D. thesis, Rens-
selaer Polytechnic Institute, Troy, NY (1995)
34. Abu-Khalaf, M., Lewis, F.: Nearly optimal control laws for nonlinear systems with saturating
actuators using a neural network HJB approach. Automatica 41, 779–791 (2005)
Chapter 6
Optimal Control for a Class of
Complex-Valued Nonlinear Systems

In this chapter, an optimal control scheme based on ADP is developed to solve


infinite-horizon optimal control problems of continuous-time complex-valued non-
linear systems. A new performance index function is established based on complex-
valued state and control. Using system transformations, the complex-valued system
is transformed into a real-valued one, which overcomes Cauchy–Riemann conditions
effectively. Based on the transformed system and the performance index function, a
new ADP method is developed to obtain the optimal control law using neural net-
works. A compensation controller is developed to compensate the approximation
errors of neural networks. Stability properties of the nonlinear system are analyzed
and convergence properties of the weights for neural networks are presented. Finally,
simulation results demonstrate the performance of the developed optimal control
scheme for complex-valued nonlinear systems.

6.1 Introduction

In many science problems and engineering applications, the parameters and signals
are complex-valued [1, 2], such as quantum systems [3] and complex-valued neural
networks [4]. In [5], complex-valued filters were proposed for complex signals and
systems. In [6], a complex-valued B-spline neural network was proposed to model the
complex-valued Wiener system. In [7], a complex-valued pipelined recurrent neural
network for nonlinear adaptive prediction of complex nonlinear and nonstationary
signals was proposed. In [8], the output feedback stabilization of complex-valued
reaction-advection-diffusion systems was studied. In [4], the global asymptotic sta-
bility of delayed complex-valued recurrent neural networks was studied. In [9], a
reinforcement learning algorithm with complex-valued functions was proposed. In
the investigations of complex-valued systems, many system designers want to find
© Science Press, Beijing and Springer Nature Singapore Pte Ltd. 2019 95
R. Song et al., Adaptive Dynamic Programming: Single and Multiple
Controllers, Studies in Systems, Decision and Control 166,
https://doi.org/10.1007/978-981-13-1712-5_6
96 6 Optimal Control for a Class of Complex-Valued Nonlinear Systems

the optimal value of the complex-valued parameters or controlled variable, by opti-


mizing a chosen performance index function [10].
From the previous discussions, it can be seen that the optimal control schemes
based on ADP are constrained to real-valued systems. In many real-world systems,
however, the system states and controls are complex values [11]. As there exist
inherent differences between real-valued systems and complex-valued ones, the
ADP methods for real-valued systems cannot solve the optimal control problems
of complex-valued systems, directly. To the best of our knowledge, there are no dis-
cussions on ADP for complex-valued systems. Therefore, a novel ADP method for
complex-valued systems is eagerly anticipated. This motivates our research.
In this chapter, for the first time an ADP-based optimal control scheme for
complex-valued systems is developed. First, a new performance index function
is defined based on the complex-valued state and control. Second, using system
transformations, the complex-valued system is transformed into a real-valued one,
where the Cauchy–Riemann conditions can effectively be avoided. Then, a new ADP
method is developed to obtain the optimal control law of the nonlinear systems. Neu-
ral networks, including critic and action networks, are employed to implement the
developed ADP method. A compensation control method is established to overcome
the approximation errors of neural networks. It is proven that the developed control
scheme makes the closed-loop system uniformly ultimately bounded (UUB). It is
also shown that the weights of neural networks will converge to a finite neighbor-
hood of the optimal weights. Finally, the simulation studies are given to show the
effectiveness of the developed control scheme.

6.2 Motivations and Preliminaries

Consider the following complex-valued nonlinear system

ż = f (z) + g(z)u, (6.1)

where z ∈ Cn is the system state, f (z) ∈ Cn and f (0) = 0. Let g(z) ∈ Cn×n be a
bounded input gain, i.e., ||g(·)|| ≤ ḡ, where ḡ is a positive constant, and || · || stands
for the 2-norm, unless special declaration is given. Let u ∈ Cn be the control vector.
Let z 0 be the initial state. For system (6.1), the infinite-horizon performance index
function is defined as
 ∞
J (z) = r̄ (z(τ ), u(τ ))dτ, (6.2)
t

where the utility function r̄ (z, u) = z H Q 1 z + u H R1 u. Let Q 1 and R1 be diagonal


positive definite matrices. Let z H and u H denote the complex conjugate transpose of
z and u, respectively.
6.2 Motivations and Preliminaries 97

The aim of this chapter is to obtain the optimal control of the complex-valued
nonlinear system (6.1). In order to achieve this purpose, the following assumptions
are necessary.
Assumption 6.1 [4] Let i 2 = −1, and then z = x + i y, where x ∈ Rn and y ∈ Rn .
If C(z) ∈ Cn is the complex-valued function, then it can be expressed as C(z) =
C R (x, y) + iC I (x, y), where C R (x, y) ∈ Rn and C I (x, y) ∈ Rn .

Assumption 6.2 Let f (z) = ( f 1 (z), f 2 (z), . . . , f n (z))T , and f j (z) = f jR (x, y) +
i f jI (x, y), j = 1, 2, . . . , n. The partial derivatives of f j (z) with respect to x and y sat-
       
 ∂ f R   ∂ f R   ∂ f I   ∂ f I 
 j  R R  j  R I  j   j 
isfy   ≤ λ j ,   ≤ λ j ,   ≤ λ j and 
IR
 ≤ λ Ij I , where
 ∂ x   ∂ y   ∂ x   ∂ y 
1 1 1 1
λ Rj R , λ Rj I , λ Ij R and λ Ij I are positive constants. Let || · ||1 stand for 1-norm.

According to above preparations, the system transformation for system (6.1) will
be given. Let f (z) = f R (x, y) + i f I (x, y), g(z) = g R (x, y) + ig I (x, y) and u =
u R + iu I . Then, system (6.1) can be written as

ẋ + i ẏ = f R (x, y) + i f I (x, y)
+ (g R (x, y) + i g I (x, y))(u R + iu I ). (6.3)
   R  R   R 
x u f (x, y) g (x, y) − g I (x, y)
Let η = ,ν = I , F(η) = and G(η) = I .
y u f (x, y)
I g (x, y) g (x, y)
R

Then, we can obtain

η̇ = F(η) + G(η)ν, (6.4)

where η ∈ R2n , F(η) ∈ R2n , G(η) ∈ R2n×2n and ν ∈ R2n . From (6.4) we can see
that F(0) = 0.

Remark 6.1 In fact, the system transformations between system (6.1) and system
(6.4) are equivalent and reversible, which can be seen in the following equations

η̇= F(η)
 + G(η)ν 
ẋ f R (x, y) + g R (x, y)u R − g I (x, y)u I
⇔ =
ẏ f I (x, y) + g I (x, y)u R + g R (x, y)u I
⇔ ẋ + i ẏ = f R (x, y) + i f I (x, y) 
+ g R (x, y) + i g I (x, y) u R + (i g I (x, y) + g R (x, y))iu I
⇔ ẋ + i ẏ = f R (x, y) + i f I (x, y)   
+ g R (x, y) + i g I (x, y) u R + iu I
⇔ ż = f (z) + g(z)u.

Therefore, if the optimal control for (6.4) is acquired, then the optimal control
problem of system (6.1) is also solved. In the following section, the optimal control
scheme of system (6.4) will be developed based on continuous-time ADP method.
98 6 Optimal Control for a Class of Complex-Valued Nonlinear Systems

Let Q = diag(Q 1 , Q 1 ) and R = diag(R1 , R1 ). According to the definition of


r̄ (z, u), the utility function can be expressed as

r (η, ν) = ηT Qη + ν T Rν. (6.5)

According to (6.5), the performance index function (6.2) can be expressed as


 ∞
J (η) = r (η(τ ), ν(τ ))dτ. (6.6)
t

For an arbitrary admissible control law ν, if the associated performance index func-
tion J (η) is given in (6.6), then an infinitesimal version of (6.6) is the so-called
nonlinear Lyapunov equation [12]

0 = JηT (F(η) + G(η)ν) + r (η, ν), (6.7)

∂J
where Jη = is the partial derivative of the performance index function J . Define
∂η
the optimal performance index function as
 ∞

J (η) = min r (η(τ ), ν(η(τ )))dτ, (6.8)
ν∈U t

where U is a set of admissible control laws. Defining the Hamiltonian function as

H (η, ν, Jη ) = JηT (F(η) + G(η)ν) + r (η, ν), (6.9)

the optimal performance index function J ∗ (η) satisfies



0 = min H (η, ν, Jη∗ ) . (6.10)
ν∈U

According to (6.9) and (6.10), the optimal control law can be expressed as

1
ν ∗ (η) = − R −1 G T (η)Jη∗ (η). (6.11)
2
Remark 6.2 In this chapter, the system transformations between (6.1) and (6.4) are
necessary. We should say that the optimal control cannot be obtained from (6.1)
and (6.2) directly. For example, if the optimal control is calculated from (6.1) and
1
(6.2), according to Bellman optimality principle, we have u = − R1−1 g H (z)Jz (z).
2
∂J
Let z = x + i y and J = J R + i J I . The partial derivative Jz = exists only if
∂z
∂J R
∂J I
∂JR ∂JI
Cauchy–Riemann conditions are satisfied, i.e., = and =− .
∂x ∂y ∂y ∂x
6.2 Motivations and Preliminaries 99

As the performance index function J is in the real domain, then J I = 0, which


∂JR ∂JI ∂JR ∂JI ∂J ∂JR ∂JI
means = = 0 and =− = 0. Therefore = +i =
∂x ∂y ∂y ∂x ∂z ∂x ∂x
∂J I
∂J R
−i = 0. Thus the optimal control of complex-valued system (6.1) is
∂y ∂y
u = 0, which is obviously invalid. If complex-valued system (6.1) is transformed
into (6.4), then Cauchy–Riemann conditions are avoided. Therefore, the optimal
control of system (6.1) can be obtained by the transformed system (6.4) and the
performance index function (6.6).

In the next section, the ADP-based optimal control method will be given in details.

6.3 ADP-Based Optimal Control Design

In this section, neural networks are introduced to implement the optimal control
method. Let the number of hidden layer neurons of the neural network be L. Let
the weight matrix between the input layer and hidden layer be Y . Let the weight
matrix between the hidden layer and output layer be W and let the input vector be X .
Then, the output of the neural network is represented as FN (X, Y, W ) = W T σ (Y X ),
where σ (Y X ) is the activation function. For convenience of analysis, only the output
weight W is updated, while the hidden weight is kept fixed [13]. Hence, the neural
network function can be simplified by the expression FN (X, W ) = W T σ̄ (X ), where
σ̄ (X ) = σ (Y X ).
There are two neural networks in the ADP method, which are critic and action
networks, respectively. In the following subsections, the detailed design methods for
the critic and action networks will be given.

6.3.1 Critic Network

The critic network is utilized to approximate the performance index function J (z),
and the ideal critic network is expressed as J (z) = W̄cH ϕ̄c (z) + εc , where W̄c ∈ Cn c1
is the ideal critic network weight matrix. Let ϕ̄c (z) ∈ Cn c1 be the activation function
and let εc ∈ R be the approximation error of the critic network.
Let W̄c = W̄cR + i W̄cI and ϕ̄c = ϕ̄cR + i ϕ̄cI . Then, we have

J (η) = (W̄cR − i W̄cI )T (ϕ̄cR + i ϕ̄cI ) + εc


= W̄cRT ϕ̄cR + W̄cI T ϕ̄cI + i(W̄cI T ϕ̄cR − W̄cRT ϕ̄cI ) + εc . (6.12)
   
W̄cR ϕ̄cR
Let Wc = and ϕc = . As J (η) is a real-valued function, we can get
W̄cI ϕ̄cI
100 6 Optimal Control for a Class of Complex-Valued Nonlinear Systems

J (η) = WcT ϕc (η) + εc . (6.13)

Thus the derivative of J (η) is written as

Jη = ∇ϕcT (η)Wc + ∇εc , (6.14)

∂ϕc (η) ∂εc


where ∇ϕc (η) = and ∇εc = . According to (6.13), Hamiltonian function
∂η ∂η
(6.9) can be expressed as

H (η, ν, Wc ) = WcT ∇ψc (F + Gν) + r (η, ν) − ε H , (6.15)

where ε H is the residual error, which is expressed as ε H = −∇εcT (F + Gν). Let Ŵc
be the estimate of Wc , and then the output of the critic network is

Jˆ(η) = ŴcT ϕc (η). (6.16)

Hence, Hamiltonian function (6.9) can be approximated by

H (η, ν, Ŵc ) = ŴcT ∇ϕc (F + Gν) + r (η, ν). (6.17)

Then, we can define the weight estimation error of the critic network as

W̃c = Wc − Ŵc . (6.18)

Note that, for a fixed admissible control law ν, Hamiltonian function (6.9) becomes
H (η, ν, Jη ) = 0, which means H (η, ν, Wc ) = 0. Therefore, from (6.15), we have

ε H = WcT ∇ϕc (F + Gν) + r (η, ν). (6.19)

Let ec = H (η, ν, Ŵc ) − H (η, ν, Wc ). We can obtain

ec = − W̃cT ∇ϕc (η)(F(η) + G(η)ν) + ε H . (6.20)

1
It is desired to select Ŵc to minimize the squared residual error E c = ecT ec .
2
Normalized gradient descent algorithm is used to train the critic network [12]. Then,
the weight update rule of the critic network Ŵ˙ c is derived as

∂ Ec ξ1 (ξ1 Ŵc + r (η, ν))


T
Ŵ˙ c = −αc = −αc 2
, (6.21)
∂ Ŵc (ξ1T ξ1 + 1)

where αc > 0 is the learning rate of the critic network and ξ1 = ∇ϕc (F + Gν).
It is a modified Levenberg–Marquardt algorithm, i.e., (ξ1T ξ1 + 1) is replaced by
6.3 ADP-Based Optimal Control Design 101

(ξ1T ξ1 + 1)2 , which is used for normalization, and it will be required in the proofs
ξ1
[12]. Let ξ2 = and ξ3 = ξ1T ξ1 + 1. We have
ξ3

ξ1 (ξ1 Ŵc + r )
T
W̃˙ c = αc
ξ32
ξ1 (ξ1T Wc + r )
= −αc ξ2 ξ2T W̃c + αc
ξ32
εH
= −αc ξ2 ξ2T W̃c + αc ξ2 . (6.22)
ξ3

6.3.2 Action Network

The action network is used to obtain the control law u. The ideal expression of the
action network is u = W̄aT ϕ̄a (z) + ε̄a , where W̄a ∈ Cna1 ×n is the ideal weight matrix
of the action network. Let ϕ̄a (η) ∈ Cna1 be the activation function and let ε̄a ∈ Cn be
the approximation error of the action network.
 Let W̄a = W̄aR + i W̄aI , ϕ̄a = ϕ̄aR + iϕ̄aI and ε̄a = ε̄aR + i ε̄aI . Let WaT =
W̄aRT
− W̄aI T ϕ̄aR ε̄ R
, ϕa = and εa = aI . We have
W̄aIT
W̄aRT ϕ̄aI
ε̄a

ν = WaT ϕa (η) + εa . (6.23)

The output of the action network is

ν̂(η) = ŴaT ϕa (η), (6.24)

where Ŵa is the estimation of Wa . We can define the output error of the action
network as
1 −1 T T
ea = ŴaT ϕa + R G ∇ϕc Ŵc . (6.25)
2
The objective function to be minimized by the action network is defined as

1 T
Ea = e ea . (6.26)
2 a
The weight update law for the action network weight is a gradient descent algo-
rithm, which is given by

T
˙ 1 −1 T T
Ŵa = −αa ϕa Ŵa ϕa + R G ∇ϕc Ŵc ,
T
(6.27)
2
102 6 Optimal Control for a Class of Complex-Valued Nonlinear Systems

where αa is the learning rate of the action network. Define the weight estimation
error of the action network as

W̃a = Wa − Ŵa . (6.28)

Then, we have

W̃˙ a = αa ϕa ((Wa − W̃a )T ϕa + R −1 G T ∇ϕcT (Wc − W̃c ))T


1

2
1 −1 T T
= αa ϕa −W̃a ϕa − R G ∇ϕc W̃c
T
2
T
1 −1 T T
+Wa ϕa + R G ∇ϕc Wc .
T
(6.29)
2

1
As ν = − R −1 G T Jη , according to (6.14) and (6.23), we have
2
1 1
WaT ϕa + εa = − R −1 G T ∇ϕcT Wc − R −1 G T ∇εc . (6.30)
2 2
Thus, (6.29) can be rewritten as

T
˙ 1 −1 T T
W̃a = −αa ϕa W̃a ϕa + R G ∇ϕc W̃c − ε12 ,
T
(6.31)
2

1 −1 T
where ε12 = −εa − R G ∇εc .
2

6.3.3 Design of the Compensation Controller

In this subsection, a compensation controller is designed to overcome the approxi-


mation errors of the critic and action networks. Before the detailed design method,
the following assumption is necessary.
Assumption 6.3 The approximation errors of the critic and action networks, i.e., εc
and εa satisfy ||εc || ≤ εcM and ||εa || ≤ εa M . The residual error is upper bounded, i.e.
||ε H || ≤ ε H M . εcM , εa M and ε H M are positive numbers. The vectors of the activation
functions of the action network satisfy ||ϕa || ≤ ϕa M , where ϕa M is a positive number.
Define the compensation controller as

Kc GTη
νc = − , (6.32)
ηT η + b
6.3 ADP-Based Optimal Control Design 103

ηT η + b
where K c ≥ ||G||2 εa2 M , and b > 0 is a constant. Then, the optimal control
2ηT GG T η
law can be expressed as

νall = ν̂ + νc , (6.33)

where νc is the compensation controller, and ν̂ is the output of the action network.
Substituting (6.33) into (6.4), we can get

η̇ = F + G ŴaT ϕa + Gνc . (6.34)

As ŴaT ϕa = WaT ϕa − W̃aT ϕa = ν − εa − W̃aT ϕa , we can obtain

η̇ = F + Gν − Gεa − G W̃aT ϕa + Gνc . (6.35)

In the next subsection, the stability analysis will be given.

6.3.4 Stability Analysis

For continuous-time ADP methods, the signals need to be persistently exciting in


the learning process [12], i.e., the persistence of excitation assumption.
Assumption 6.4 Let the signal ξ2 be persistently exciting over the interval [t, t + T ],
i.e. there exist constants β1 > 0, β2 > 0 and T > 0, such that, for all t,
 t+T
β1 I ≤ ξ2 (τ )ξ2T (τ )dτ ≤ β2 I (6.36)
t

holds.
Remark 6.3 This assumption makes system (6.4) be persistently excited sufficiently
for tuning critic and action networks. In fact, the persistent excitation assumption
ensures ξ2m ≤ ξ2 , where ξ2m is a positive number [12].
Before giving the main result, the following preparation works are presented.
Lemma 6.1 For ∀x ∈ R n , we have

||x||2 ≤ ||x||1 ≤ n||x||2 . (6.37)
n 2

n
Proof Let x = (x1 , x2 , . . . , xn ) . As
T
||x||22 = |xi | ≤
2
|xi | = ||x||21 , we
n 2 i=1 i=1

n
can get ||x||2 ≤ ||x||1 . As ||x||21 = |xi | ≤n |xi |2 = n||x||22 , we can obtain
√ i=1 i=1
||x||1 ≤ n||x||2 .
104 6 Optimal Control for a Class of Complex-Valued Nonlinear Systems

Theorem 6.1 For system (6.1), if f (z) satisfies Assumptions 6.1 and 6.2, then we
have

||F(η) − F(η )||2 ≤ k||(η − η )||2 , (6.38)


   
x x
where η = and η = . Let k Rj = max{λ Rj R , λ Rj I } and k Ij = max{λ Ij R , λ Ij I },
y y
n n

j = 1, 2, . . . , n. Let k = k Rj + k Ij and k = k 2n.
j=1 j=1

Proof According to Assumption 6.2 and the mean value theorem for multi-variable
functions, we have

|| f jR (x, y) − f jR (x , y )||1
≤ λ Rj R ||x − x ||1 + λ Rj I ||y − y ||1 . (6.39)

According to the definition of 1-norm, we have ||η − η ||1 = ||x − x ||1 + ||y −
y ||1 , and

|| f jR (x, y) − f jR (x , y )||1 ≤ k Rj ||η − η ||1 . (6.40)

According to (6.40), we can get


n

|| f (x, y) − f (x , y )||1 ≤
R R
k Rj ||η − η ||1 . (6.41)
j=1

According to the idea from (6.40) and (6.41), we can also obtain


n
|| f I (x, y) − f I (x , y )||1 ≤ k Ij ||η − η ||1 . (6.42)
j=1

Therefore, we can get

||F(η) − F(η )||1 = || f R (x, y) − f R (x , y )||1 + || f I (x, y) − f I (x , y )||1


≤ k ||η − η ||1 . (6.43)

According to Lemma 6.1, we have

||F(η) − F(η )||2 ≤ ||F(η) − F(η )||1 , (6.44)

and
6.3 ADP-Based Optimal Control Design 105

||η − η ||1 ≤ 2n||η − η ||2 . (6.45)

From (6.43)–(6.45), we can obtain ||F(η) − F(η )||2 ≤ k||η − η ||2 . The proof is
completed.
Next theorems show that the estimation error of critic and action networks are
UUB.
Theorem 6.2 T Letthe weights of the critic network Ŵc be updated by (6.21). If the
 ξ1 
inequality  
 ξ W̃c  > ε H M holds, then the estimation error W̃c converges to the set
 3
W̃c ≤ β1−1 β2 T (1 + 2ρβ2 αc ) ε H M exponentially, where ρ > 0.
Proof Let W̃c be defined as in (6.18). Define the following Lyapunov function can-
didate
1
L= W̃ T W̃c . (6.46)
2αc c

The derivative of (6.46) is given by



 T 
 T   
ξ1 ξ1T ξ1 ε H  ξ1   ξ1   εH 
L̇ = W̃cT − 2 W̃c + 2 ≤ −
 W̃c     
  ξ W̃c  −  ξ  . (6.47)
ξ3 ξ3 ξ3 3 3

   T 
 εH   
As ξ3 ≥ 1, we have   < ε H M . If  ξ1 W̃c  > ε H M , then we can get L̇ ≤ 0. That
ξ  ξ 
3 3
means L decreases and ||ξ2 W̃c || is bounded. In light of [14] and Technical Lemma
T

2 in [12], W̃c ≤ β1−1 β2 T (1 + 2ρβ2 αc ) ε H M .
Theorem 6.3 Let the optimal control law be expressed as in (6.33). The weight
update laws of the critic and action networks are given as in (6.21) and (6.27),
respectively. If there exists parameters l2 and l3 , that satisfy

l2 > ||G||, (6.48)

and
 
||G||2 2k + 3
l3 > max , , (6.49)
λmin (R) λmin (Q)

respectively, then the system state η in (6.4) is UUB and the weights of the critic and
action networks, i.e., Ŵc and Ŵa converge to finite neighborhoods of the optimal
ones.
Proof Choose the Lyapunov function candidate as

V = V1 + V2 + V3 , (6.50)
106 6 Optimal Control for a Class of Complex-Valued Nonlinear Systems

1 l2
where V1 = W̃cT W̃c , V2 = tr{W̃aT W̃a }, V3 =ηT η + l3 J (η), l2 > 0, l3 > 0.
2αc 2αa
Taking the derivative of the Lyapunov function candidate (6.50), we can get V̇ =
V̇1 + V̇2 + V̇3 . According to (6.22), we have
εH
V̇1 = −(W̃cT ξ2 )T W̃cT ξ2 + (W̃cT ξ2 )T . (6.51)
ξ3

According to (6.31), we can get



T 
1
V̇2 = −l2 tr W̃aT ϕa W̃aT ϕa + R −1 G T ∇ϕcT W̃c − ε12
2
1
= −l2 (W̃aT ϕa )T W̃aT ϕa + l2 (W̃aT ϕa )T ε12 − l2 (W̃aT ϕa )T R −1 G T ∇ϕcT W̃c . (6.52)
2
The derivative of V3 can be expressed as

V̇3 = (2ηT F − 2ηT G W̃aT ϕa + 2ηT Gν − 2ηT Gεa + 2ηT Gνc ) + l3 (−r (η, ν)).
(6.53)

From Theorem 6.1, we have 2ηT F ≤ 2k ||η||2 . In addition, we can obtain

−2ηT G W̃aT ϕa ≤ ||η||2 + ||G||2 (W̃aT ϕa )T W̃aT ϕa ,


2ηT Gν ≤ ||η||2 + ||G||2 ||ν||2 ,
−2ηT Gεa ≤ ||η||2 + ||G||2 εa2 M . (6.54)

From (6.32), we can get




Kc GTη
2η Gνc = 2η G − T
T T
≤ −||G||2 εa2 M . (6.55)
η η+b

Then, (6.53) can be rewritten as

V̇3 ≤ (2k + 3 − l3 λmin (Q))||η||2 + (||G||2 − l3 λmin (R))||ν||2


+ ||G||2 (W̃aT ϕa )T W̃aT ϕa . (6.56)

εH
Let Z = [ηT , v T , (W̃cT ξ2 )T , (W̃aT ϕa )]T , and N V = [0, 0,
, M V 4 ]T ,
ξ3
1 
where MV 4 = − R −1 G T ∇ϕcT W̃c + l2 ε12 , and MV = diag (l3 λmin (Q) − 2k − 3),
2 
(l3 λmin (R) − ||G||2 ), 1, (l2 − ||G||2 ) . Thus, we have

V̇ ≤ −Z T MV Z + Z T N V
≤ −||Z ||2 λmin (MV ) + ||Z ||||N V ||. (6.57)
6.3 ADP-Based Optimal Control Design 107

||N V ||
According to (6.48)–(6.49), we can see that if ||Z || ≥ ≡ Z B , then the
λmin (MV )
εH
Lyapunov candidate V̇ ≤ 0. As MV and ξ3
are both upper bounded, we have ||N V ||
is upper bounded. Therefore, the state η, the weight errors W̃c and W̃a are UUB [15].
The proof is completed.

Theorem 6.4 Let the weight updating laws of the critic and the action networks be
given by (6.21) and (6.27), respectively. If Theorem 6.3 holds, then the control law
νall converges to a finite neighborhood of the optimal control law ν ∗ .

Proof From Theorem 6.3, there exist νc > 0 and W̃a > 0, such that lim ||ν|| ≤ ν
t→∞
and lim ||W̃a || ≤ W̃a . From (6.33), we have
t→∞

νall − ν ∗ = ν̂ + νc − ν ∗
= ŴaT ϕa + νc − WaT ϕa − εa
= W̃aT ϕa + νc − εa . (6.58)

Therefore, we have

lim ||νall − ν ∗ || ≤ W̃a ϕa M + ν + εa M . (6.59)


t→∞

As W̃a ϕa M + ν + εa M is finite, we can obtain the conclusion. The proof is com-


pleted.

6.4 Simulation Study

Example 6.1 Our first example is chosen as Example 3 in [3] with modifications.
Consider the following nonlinear complex-valued harmonic oscillator system

−2z(z 2 − 25 )
ż = i −1 + 10(1 + i)u, (6.60)
2z 2 − 1

where z ∈ C1 , z = z R + i z I and u = u R + iu I . The utility function is defined as


r̄ (z, u) = z H Q 1 z + u H R1 u. Let Q 1 = E and R1 = E, where E is the identity matrix
with a suitable dimension. Let η = [z R , z I ]T and ν = [u R , u I ]T . Let the critic and
action networks be expressed as Jˆ(η) = ŴcT ϕc (Yc η) and ν̂(η) = ŴaT ϕa (Ya η), where
Yc and Ya are constant matrices with suitable dimensions. The activation functions
of the critic and action networks are hyperbolic tangent functions. The structures of
the critic and action networks are 2-8-1 and 2-8-2, respectively. The initial weights
of Ŵc and Ŵa are selected arbitrarily from (−0.1, 0.1), respectively. The learn-
ing rates of the critic and action networks are selected as αc = αa = 0.01. Let
108 6 Optimal Control for a Class of Complex-Valued Nonlinear Systems

z 0 = −1 + i. Let K c = 3 and b = 1.02. Implementing the developed ADP method


for 40 time steps, the trajectories of the control and state are shown in Figs. 6.1
and 6.2, respectively. The weights of the critic and action networks converge to
Wc = [−0.0904; 0.0989; −0.0586; 0.0214; −0.0304; 0.0435; −0.0943; −0.0866]
and Wa = [−0.0269, 0.0117; 0.0197, −0.0548; 0.0336, −0.0790; 0.0789, −0.0980;
−0.0825, −0.0881; 0.0078, −0.0354; −0.0143, 0.0558; 0.0234, −0.0329], respec-
tively. From Fig. 6.2, we can see that the system state is UUB, which verifies the
effectiveness of the developed ADP method.

Example 6.2 In the second example, the effectiveness of the developed ADP method
will be justified by a complex-valued Chen system [16]. The system can be expressed
as

⎨ ż 1 = −μz 1 + z 2 (z 3 + α) + i z 1 u 1 ,
ż 2 = −μz 2 + z 1 (z 3 − α) + 10u 2 , (6.61)

ż 3 = 1 − 0.5(z̄ 1 z 2 + z 1 z̄ 2 ) + u 3 ,

where μ = 0.8 and α = 1.8. Let z = [z 1 , z 2 , z 3 ]T ∈ C3 and g(z) = diag(i z 1 , 10, 1).
Let z̄ 1 and z̄ 2 denote the complex conjugate vectors of z 1 and z 2 , respectively. Define
η = [z 1R , z 2R , z 3R , z 1I , z 2I , z 3I ]T and ν = [u 1R , u 2R , u 3R , u 1I , u 2I , u 3I ]T .

The structures of action and critic networks are 6-8-6 and 6-6-1, respectively.
Let the training rules of the neural networks be the same as in Example 1. Let

0.03

real(u)
0.02
imag(u)

0.01

-0.01

-0.02

-0.03
0 2 4 6 8 10 12 14 16 18 20
time steps

Fig. 6.1 Control trajectories


6.4 Simulation Study 109

0.8
real(z)
0.6
imag(z)

0.4

0.2

0
z

-0.2

-0.4

-0.6

-0.8

-1
0 2 4 6 8 10 12 14 16 18 20
time steps

Fig. 6.2 State trajectories

3.7

real(u1 )
3
real(u2 )
real(u3 )
2
imag(u1 )
imag(u2 )
1
imag(u3 )
control

-1

-2

-3

-3.7
0 50 100 150 200 250 300 350 400 450 480
time steps

Fig. 6.3 Control trajectories


110 6 Optimal Control for a Class of Complex-Valued Nonlinear Systems

0
z1 , z2 , z3

-1 real(z1 )
real(z2 )
real(z3 )
-2
imag(z1 )
imag(z2 )
-3 imag(z3 )

-4

-4.5
0 50 100 150 200 250 280
time steps

Fig. 6.4 State trajectories

z 0 = [1 + i, 1 − i, 0.5]T . Implementing the developed ADP method for 500 time


steps, the trajectories of the control and state are displayed in Figs. 6.3 and 6.4,
respectively. From the simulation results, we can say that the ADP method developed
in this chapter is effective and feasible.

6.5 Conclusion

In this chapter, for the first time an optimal control scheme based on ADP method for
complex-valued systems has been developed. First, the performance index function is
defined based on complex-valued state and control. Then, system transformations are
used to overcome Cauchy–Riemann conditions. Based on the transformed system and
the corresponding performance index function, a new ADP based optimal control
method is established. A compensation controller is presented to compensate the
approximation errors of neural networks. Finally, the simulation examples are given
to show the effectiveness of the developed optimal control scheme.

References

1. Adali, T., Schreier, P., Scharf, L.: Complex-valued signal processing: the proper way to deal
with impropriety. IEEE Trans. Signal Process. 59(11), 5101–5125 (2011)
References 111

2. Fang, T., Sun, J.: Stability analysis of complex-valued impulsive system. IET Control Theory
Appl. 7(8), 1152–1159 (2013)
3. Yang, C.: Stability and quantization of complex-valued nonlinear quantum systems. Chaos,
Solitons Fractals 42, 711–723 (2009)
4. Hu, J., Wang, J.: Global stability of complex-valued recurrent neural networks with time-delays.
IEEE Trans. Neural Netw. Learn. Syst. 23(6), 853–865 (2012)
5. Huang, S., Li, C., Liu, Y.: Complex-valued filtering based on the minimization of complex-error
entropy. IEEE Trans. Neural Netw. Learn. Syst. 24(5), 695–708 (2013)
6. Hong, X., Chen, S.: Modeling of complex-valued wiener systems using B-spline neural net-
work. IEEE Trans. Neural Netw. 22(5), 818–825 (2011)
7. Goh, S., Mandic, D.: Nonlinear adaptive prediction of complex-valued signals by complex-
valued PRNN. IEEE Trans. Signal Process. 53(5), 1827–1836 (2005)
8. Bolognani, S., Smyshlyaev, A., Krstic, M.: Adaptive output feedback control for complex-
valued reaction-advection-diffusion systems, In: Proceedings of American Control Conference,
Seattle, Washington, USA, pp. 961–966, (2008)
9. Hamagami, T., Shibuya, T., Shimada, S.: Complex-valued reinforcement learning. In: Proceed-
ings of IEEE International Conference on Systems, Man, and Cybernetics, Taipei, Taiwan, pp.
4175–4179 (2006)
10. Paulraj, A., Nabar, R., Gore, D.: Introduction to Space-Time Wireless Communications. Cam-
bridge University Press, Cambridge (2003)
11. Mandic, D.P., Goh, V.S.L.: Complex Valued Nonlinear Adaptive Filters: Noncircularity, Widely
Linear and Neural Models. Wiley, New York (2009)
12. Vamvoudakis, K.G., Lewis, F.L.: Online actor-critic algorithm to solve the continuous-time
infinite horizon optimal control problem. Automatica 46(5), 878–888 (2010)
13. Dierks, T., Jagannathan, S.: Online optimal control of affine nonlinear discrete-time systems
with unknown internal dynamics by using time-based policy update. IEEE Trans. Neural Netw.
Learn. Syst. 23(7), 1118–1129 (2012)
14. Khalil, H.K.: Nonlinear System. Prentice-Hall, Upper Saddle River (2002)
15. Lewis, F.L., Jagannathan, S., Yesildirek, A.: Neural Network Control of Robot Manipulators
and Nonlinear Systems. Taylor & Francis, New York (1999)
16. Mahmoud, G.M., Aly, S.A., Farghaly, A.A.: On chaos synchronization of a complex two
coupled dynamos system. Chaos, Solitons Fractals 33, 178–187 (2007)
Chapter 7
Off-Policy Neuro-Optimal Control
for Unknown Complex-Valued
Nonlinear Systems

This chapter establishes an optimal control of unknown complex-valued system. Pol-


icy iteration (PI) is used to obtain the solution of the Hamilton–Jacobi–Bellman (HJB)
equation. Off-policy learning allows the iterative performance index and iterative
control to be obtained by completely unknown dynamics. Critic and action networks
are used to get the iterative control and iterative performance index, which execute
policy evaluation and policy improvement. Asymptotic stability of the closed-loop
system and the convergence of the iterative performance index function are proven.
By Lyapunov technique, the uniformly ultimately bounded (UUB) of the weight error
is proven. Simulation study demonstrates the effectiveness of the proposed optimal
control method.

7.1 Introduction

Policy iteration (PI) and the value iteration (VI) are the two main algorithms in ADP
[1, 2]. In [3], VI algorithm for nonlinear systems was proposed, which can find the
optimal control law by iterative performance index function and iterative control
with a zero initial performance index function. When the iteration index increases
to infinity, it is proven that the iterative performance index function is a nondecreas-
ing sequence and bounded. In [4], PI algorithm was proposed, which can obtain the
optimal control by constructing a sequence of stabilizing iterative control. In con-
ventional PI, the system dynamics is necessary. While in many industrial systems,
the dynamics is difficult to be known. Off-policy method, which is based on PI, can
solve the HJB equation by completely unknown dynamics. In [5], H∞ tracking con-
trol of completely unknown continuous-time systems via off-policy reinforcement
learning was discussed. Note that, Ref. [6] studied the optimal control problem for
complex-valued nonlinear systems in the frame of infinite-horizon ADP algorithm
© Science Press, Beijing and Springer Nature Singapore Pte Ltd. 2019 113
R. Song et al., Adaptive Dynamic Programming: Single and Multiple
Controllers, Studies in Systems, Decision and Control 166,
https://doi.org/10.1007/978-981-13-1712-5_7
114 7 Off-Policy Neuro-Optimal Control …

with known dynamics. In this chapter, the further study the optimal control problem
of unknown complex-valued nonlinear systems based on off-policy PI.
In this chapter, we consider the optimal control problem of complex-valued
unknown system. The PI algorithm is used to obtain the solution of the HJB equation.
Off-policy learning allows the iterative performance index and the iterative control
to be obtained with completely unknown dynamics. Action and critic networks are
used to approximate the iterative performance index and iterative control. Therefore
the established method has two steps: policy evaluation and policy improvement,
which executed by the critic and actor. It is proven that the closed-loop system is
asymptotic stability, the iterative performance index function is convergent, and the
weight error is uniformly ultimately bounded (UUB).
The rest of the chapter is organized as follows. In Sect. 7.2, the problem motiva-
tions and preliminaries are presented. In Sect. 7.3, the optimal control design scheme
is developed based on PI, and the asymptotic stability is proved. In Sect. 7.4, two
examples are given to demonstrate the effectiveness of the proposed optimal control
scheme. In Sect. 7.5, the conclusion is drawn.

7.2 Problem Statement

Consider a class of complex-valued continuous-time systems as follows:

ż = f (z) + g(z)u, (7.1)

where z ∈ Cn is the complex-valued system state, u ∈ Cm is the complex-valued


control input. f (z) ∈ Cn and g(z) ∈ Cn×m are complex-valued functions.
We want to find an optimal control policy u(t) = u(z(t)), which minimizes the
following performance index function
 ∞
Ω= (z H X z + u H Y u)dt, (7.2)
0

where X and Y are positive definite symmetric matrices.


According to [6], the enlarge dimension method is used to obtain the transformed
systems. Assume that z = z R + i z I , u = u R + iu I , f = f R + i f I and g = g R +
ig I . Then, system (7.1) can be expressed as

ṡ = Γ (s) + Υ (s)q, (7.3)


 R  R R I   R R I 
z f (z , z ) g (z , z ) − g I (z R , z I )
where s = I , Γ (s) = , Υ (s) = and
z f I (z R , z I ) g I (z R , z I ) g R (z R , z I )
 R
u
q= . Therefore, the comples-valued system (7.1) is transformed into the real-
uI
valued system (7.3). Although the dimensions of system (7.3) increases two-fold,
7.2 Problem Statement 115

the complex-valued optimization problem is mapped into real domain. According to


the transformation, performance index function (7.2) can be expressed as
 ∞
Ω(s) = (sTMs + qTN q) dτ, (7.4)
t

where r (s, q) = s T Ms + q T N q, M = diag(X, X ), and N = diag(Y, Y ).


If the associated cost function Ω is C1 , then an infinitesimal equivalent to (7.4)
is the Bellman equation

0 = r (s, q) + ΩsT (Γ + Υ q), (7.5)

∂Ω
where Ωs = is disabused here as a column vector. Given that q is an admissible
∂s
control policy, if Ω satisfies (7.5), with r (s, q) ≥ 0, then it can be shown that Ω is
a Lyapunov function for the system (7.3) with control policy q.
We define Hamiltonian function H :

H (s, q, Ωs ) = r (s, q) + ΩsT (Γ + Υ q). (7.6)

Then the optimal performance index function Ω ∗ satisfies the HJB equation

0 = min H (s, q, Ωs∗ ). (7.7)


q

Assuming that the minimum on the right-hand side of (7.7) exists and is unique,
then the optimal control function is

1
q ∗ = − N −1 Υ T Ωs∗ . (7.8)
2
The optimal control problem can now be formulated: Find an admissible control
policy q(t) = q(s(t)) such that the performance index function (7.4) associated with
the system (7.3) is minimized.

7.3 Off-Policy Optimal Control Method

The PI algorithm given in [7] which can obtain the optimal control, requires full
system dynamics, because both Γ and Υ appear in the Bellman equation (7.5). In
[8–10], the data is used to construct the optimal control. In this section, the off-policy
PI algorithm will be presented to solve (7.5) without the dynamics of system (7.3).
First, the PI algorithm is given as follows. The PI algorithm begins from an
admissible control q [0] , then the iterative performance index function Ω [i] is updated
as
116 7 Off-Policy Neuro-Optimal Control …

0 = r (s, q [i] ) + Ωs[i]T (Γ + Υ q [i] ) (7.9)

and the iterative control is derived by

1
q [i+1] = − N −1 Υ T Ωs[i] . (7.10)
2
Based on the PI algorithm (7.9) and (7.10), the off-policy PI algorithm will be
derived. Note that, for any time t and and time interval T > 0, the iterative perfor-
mance index function satisfies
 t
Ω [i] (s(t − T )) = r (s, q [i] )dτ + Ω(s [i] (t)). (7.11)
t−T

To derive the off-policy PI method, we rewrite the system (7.3) as

ṡ = Γ (s) + Υ (s)q [i] + Υ (s)(q − q [i] ), (7.12)

where q [i] is the iterative control.


Then from (7.11) and (7.12), we can get

Ω [i] (s(t)) − Ω [i] (s(t − T ))


 t
= Ωs[i]T ṡdτ
t−T
 t
= Ωx[i]T (Γ + Υ q [i] + Υ (q − q [i] ))dτ. (7.13)
t−T

According to (7.9), we can have

Ωs[i]T (Γ + Υ q [i] ) = −s T Ms − q [i]T N q [i] . (7.14)

Then (7.13) can be expressed as

Ω [i] (s(t)) − Ω [i] (s(t − T ))


 t  t
= (−s T Ms − q [i]T N q [i] )dτ + Ωs[i]T Υ (q − q [i] )dτ. (7.15)
t−T t−T

From (7.10), we have

Ωs[i]T Υ = −2q [i+1]T N . (7.16)

Therefore, we can get the off-policy Bellman equation as


7.3 Off-Policy Optimal Control Method 117

Ω [i] (s(t)) − Ω [i] (s(t − T ))


 t  t
[i]T [i]
= (−s Ms − q N q )dτ +
T
−2q [i+1]T N (q − q [i] )dτ. (7.17)
t−T t−T

7.3.1 Convergence Analysis of Off-Policy PI Algorithm

In this subsection, we will analyze the stability of the closed-loop system and the
convergence of the established off-policy PI algorithm.

Theorem 7.1 Let the iterative performance index function Ω [i] satisfy
 ∞
Ω [i] (s) = s T Ms + q [i]T N q [i] dτ. (7.18)
t

Let the iterative control input q [i+1] be obtained from (7.10). Then the closed-loop
system (7.3) is asymptotically stable.

Proof Take Ω [i] as the Lyapunov function candidate. We make the derivative of Ω [i]
along the system ṡ = Γ + Υ q [i+1] , we can have

Ω̇ [i] = Ωs[i]T ṡ = Ωs[i]T Γ + Ωs[i]T Υ q [i+1] . (7.19)

According to (7.9), we can obtain

Ωs[i]T Γ = −Ωs[i]T Υ q [i] − s T Ms − q [i]T N q [i] . (7.20)

Then we have

Ω̇ [i] = Ωs[i]T Υ q [i+1] − Ωs[i]T Υ q [i] − s T Ms − q [i]T N q [i] . (7.21)

From (7.10), we have

Ωs[i]T Υ = −2q [i+1]T N . (7.22)

Then (7.21) can be expressed as

Ω̇ [i] = 2q [i+1]T N (q [i] − q [i+1] ) − s T Ms − q [i]T N q [i] . (7.23)

Since N is symmetric positivedefinite, there exist a diagonal matrix Λ and an orthog-


onal symmetric matrix B. The elements of Λ are the singular values of N . Then
N = BΛB.
Define y [i] = Bq [i] , then q [i] = B−1 y [i] . Therefore, (7.23) is expressed as
118 7 Off-Policy Neuro-Optimal Control …

Ω̇ [i] = 2y [i+1]T Λ(y [i] − y [i+1] ) − s T Ms − y [i]T Λy [i]


= 2y [i+1]T Λy [i] − 2y [i]T Λy [i+1] − s T Ms − y [i]T Λy [i] . (7.24)

Since Λkk is the singular value and Λkk > 0, then we get


m
Ω̇ [i] = Λkk (2y [i+1]T y [i] − 2y [i+1]T y [i+1] − y [i]T y [i] ) − s T Ms
k=1
< 0. (7.25)

Therefore, it is clear that the closed-loop system is asymptotically stable under


each iterative control input.
This completes the proof.

Theorem 7.2 Define the iterative performance index Ω [i] satisfying (7.9), and q [i+1]
obtaining from (7.10), then Ω [i] is non-increasing as i → ∞, i.e. Ω ∗ (s) ≤ Ω [i+1] ≤
Ω [i] .

Proof According to (7.9), we have

Ωs[i]T (Γ + Υ q [i] ) = −s T Ms − q [i]T N q [i] . (7.26)

From (7.26), we can get

Ωs[i]T Γ + Ωs[i]T Υ q [i+1] = −s T Ms − q [i]T N q [i] − Ωs[i]T Υ q [i] + Ωs[i]T Υ q [i+1] .


(7.27)

From (7.22), (7.27) is written as

Ωs[i]T Γ + Ωs[i]T Υ q [i+1]


= − q [i]T N q [i] + 2q [i+1]T N (q [i] − q [i+1] ) − s T Ms
= − q [i]T N q [i] + 2q [i+1]T N q [i] − 2q [i+1]T N q [i+1] − s T Ms. (7.28)

Furthermore, consider Ω [i+1] and Ω [i] taking the derivatives along system ṡ =
Γ + Υ q [i+1] , respectively. Then we can obtain
 ∞ T
d(Ω [i+1] − Ω [i] )
Ω [i+1] (s) − Ω [i] (s) = − (Γ + Υ q [i+1] )dτ
ds
 ∞ 0
 ∞
[i+1]T [i+1]
=− Ωs (Γ + Υ q )dτ + Ωs[i]T (Γ + Υ q [i+1] )dτ
 ∞
0
 ∞
0
 T
=− Ωs[i+1]T (Γ + Υ q [i+1] )dτ − s Ms + q [i]T N q [i]
0
0
− 2q [i+1]T N q [i] + 2q [i+1]T N q [i+1] dτ. (7.29)
7.3 Off-Policy Optimal Control Method 119

From (7.26), we can have

Ωs[i+1]T (Γ + Υ q [i+1] ) = −s T Ms − q [i+1]T N q [i+1] . (7.30)

Then (7.29) is written as

Ω [i+1] (s) − Ω [i] (s)


 ∞
= (−q [i]T N q [i] + 2q [i+1]T N q [i] − q [i+1]T N q [i+1] )dτ. (7.31)
0

According to the proof of Theorem 7.1 and (7.31), we have

Ω [i+1] (s) − Ω [i] (s)



m
=− Λkk (y [i]T y [i] − 2y [i+1]T y [i] + y [i+1]T y [i+1] ) ≤ 0. (7.32)
k=1

Moreover, according to the definition of Ω ∗ , it has Ω ∗ ≤ Ω [i+1] . Therefore, it is


concluded Ω ∗ ≤ Ω [i+1] ≤ Ω [i] .
This completes the proof.

7.3.2 Implementation Method of Off-Policy Iteration


Algorithm

From the off-policy Bellman equation (7.17), we can see that the system dynamics is
not required. Therefore, the off-policy PI algorithm depends on the system state s and
iterative control q [i] . In (7.17), the function structures Ω [i] and q [i] are unknown, so in
the following subsection the critic and action networks are presented to approximate
the iterative performance index function and the iterative control.
The critic network is expressed as

Ω [i] (s) = WΩ[i]T φΩ (s) + εΩ


[i]
, (7.33)

where WΩ[i] is the ideal weight of the critic network, φΩ (s) is the active function, εΩ
[i]
[i]
is the residual error. The estimation of Ω (s) is given as follows:

Ω̂ [i] (s) = ŴΩ[i]T φΩ (s), (7.34)

where ŴΩ[i] is the estimation of WΩ[i] .


120 7 Off-Policy Neuro-Optimal Control …

The action network is given as follows:

q [i] (s) = Wq[i]T φq (s) + εq[i] , (7.35)

where Wq[i] is the ideal weight of the action network, φq (s) is the active function, εq[i]
is the residual error. Accordingly, the estimation of q [i] (s) is given as follows:

q̂ [i] (s) = Ŵq[i]T φq (s). (7.36)

Then according to (7.17), we can define


 t
e[i] [i] [i]
H = Ω̂ (s(t)) − Ω̂ (s(t − T )) + (s T Ms + q̂ [i]T N q̂ [i] )dτ
t−T
 t
+ 2q̂ [i+1]T N (q − q̂ [i] )dτ. (7.37)
t−T

From (7.34) we have

Ω̂ [i] (s(t)) − Ω̂ [i] (s(t − T )) = ŴΩ[i]T


ϕΩ = (
ϕΩ
T
⊗ I )vec(ŴΩ[i]T )
= (
ϕΩ
T
⊗ I )ŴΩ[i] , (7.38)

where
ϕΩ = φΩ (s(t)) − φΩ (s(t − T )).
According to (7.36), we have

2q̂ [i+1]T N (q − q̂ [i] ) = 2(Ŵq[i+1]T ϕq )T N (q − q̂ [i] )


= 2ϕqT Ŵq[i+1] N (q − q̂ [i] )
 T

= 2 ((q − q̂ [i] ) N ) ⊗ ϕqT vec(Ŵq[i+1] ). (7.39)

Then, (7.37) can be expressed as


 t
e[i]
H = (
ϕΩ
T
⊗ I ) Ŵ [i]
Ω + (s T Ms + q̂ [i]T N q̂ [i] )dτ
t−T
 t 
T
+ 2 ((q − q̂ [i] ) N ) ⊗ ϕqT dT vec(Ŵq[i+1] ) (7.40)
t−T

i.e.,


 t   ŴΩ[i]
e[i] = (
ϕΩ
T
⊗ I) [i] T
2 ((q − q̂ ) N ) ⊗ ϕqT
H
t−T vec(Ŵq[i+1] )
 t
+ (s T Ms + q̂ [i]T N q̂ [i] )dτ. (7.41)
t−T
7.3 Off-Policy Optimal Control Method 121
  t  
[i] [i] T
Define C̄ = = (
ϕΩ ⊗ I ),
T
2 ((q − q̂ ) N ) ⊗ ϕq dτ ,
T
¯
 t
t−T

D̄ [i] = (s T Ms + q̂ [i]T N q̂ [i] )dτ . Then


t−T

e[i] [i] [i]


H = C̄ Ŵ + D̄ [i] . (7.42)

[i]

Let E [i] = 0.5e[i]T [i]
=
H e H , then we have W
[i] Ω
. Then the update method
vec(Ŵq[i+1] )
for the weight of critic and action networks is
[i]
Ŵ˙ = −ηW C̄ [i]T (C̄ [i] Ŵ [i] + D̄ [i] ), (7.43)

where ηW > 0.
In the following theorem, we will prove that the proposed implementation method
is convergent.

Theorem 7.3 Let the control input q [i] be equal to (7.36), the updating methods
for critic and action networks be as in (7.43). Define the weight error as W̃ [i] =
W [i] − Ŵ [i] , then for every iterative step, W̃ [i] is UUB.

Proof Let Lyapunov function candidate be as follows:

l1
Σ= W̃ [i]T W̃ [i] , (7.44)
2ηW

where l1 > 0.
According to (7.43), we can obtain

W̃˙ [i] = ηW C̄ [i]T (C̄ [i] (W [i] − W̃ [i] ) + D̄ [i] )


= −ηW C̄ [i]T C̄ [i] W̃ [i] + ηW C̄ [i]T (C̄ [i] W [i] + D̄ [i] ). (7.45)

Then we have

Σ̇ = −l1 W̃ [i]T C̄ [i]T C̄ [i] W̃ [i] + l1 W̃ [i]T C̄ [i]T (C̄ [i] W [i] + D̄ [i] )
≤ −l1 ||C̄ [i] ||2 ||W̃ [i] ||2 + ||W̃ [i] ||2 + ||S [i] ||2
= (−l1 ||C̄ [i] ||2 + 1)||W̃ [i] ||2 + ||S [i] ||2 , (7.46)

1 1
where S [i] = ||l1 C̄ [i]T C̄ [i] W [i] ||2 + ||l1 C̄ [i]T D̄ [i] ||2 .
2 2
122 7 Off-Policy Neuro-Optimal Control …

Therefore, if there exists l1 satisfying l1 ||C̄ [i] ||2 > 1, and

||S [i] ||
||W̃ [i] || > (7.47)
l1 ||C̄ [i] ||2 − 1

holds, Σ̇ < 0, which indicates W̃ [i] is UUB.


The proof is completed.

7.3.3 Implementation Process

In this part, the implementation process is presented.


Step 1: Select the initial state x and initial admissible control. Define the matrices
of M and N in the performance index function.
Step 2: Let i = 1, 2, . . ., select the initial weights ŴΩ and Ŵq of critic and action
networks.
Step 3: Update the critic and action weights according to (7.43).
Step 4: If the output of critic network satisfies ||Ω̂ [i+1] − Ω̂ [i] || < θ , then go to
Step 5. Else, go to Step 3.
Step 5: Stop.

7.4 Simulation Study

In this section, two examples are presented to demonstrate the effectiveness of the
proposed optimal control method.

Example 7.1 In this chapter, we first consider the following nonlinear complex-
valued system [11]

ż = f (z) + u, (7.48)

where
   
8 0 2 + 3i 3 − i
f (z) = − z+ f¯(z), (7.49)
0 6 4 − 2i 1 + 2i

and
1 − exp(−s j ) 1
f¯j (z) = +i , j = 1, 2. (7.50)
1 + exp(−s j ) 1 + exp(−y j )
7.4 Simulation Study 123

1.5
zR
1
zR
2
1
zI1
zI2
0.5

0
z

-0.5

-1

-1.5
0 5 10 15 20 25 30 35 40 45 50
time steps

Fig. 7.1 Control trajectories

According to the transformation in (7.3), it can be obtained that s = [z 1R , z 2R ,


z 1I , z 2I ]T ,
Γ = [ f 1R , f 2R , f 1I , f 2I ]T and q = [u 1R , u 2R , u 1I , u 2I ]T .
For the infinite-horizon optimal control problem, the utility function is defined as
r̄ (z, u(z)) = z H X z + u H Y u, where X = Y = I . The real part and the imaginary part
of the activation functions in the critic network and the action network are sigmoid
functions. The input of networks is s. In the implementation procedure, the initial
weights of critic and action networks are selected in [0.5, 0.5]. After 50 time steps,
the The critic network weight ŴΩ and the action network weight Ŵq are convergent,
then the optimal control is obtained. The state and control trajectories are shown in
Figs. 7.1 and 7.2. We can see that the closed-loop system is asymptotically stable.
Example 7.2 Consider the following complex-valued oscillator system with modi-
fication [6]

ż 1 = z 1 + z 2 − z 1 (z 12 + z 22 ) + u 1 ,
(7.51)
ż 2 = −z 1 + z 2 − z 2 (z 12 + z 22 ) + u 2 ,

where z = [z 1 , z 2 ]T ∈ C2 , u = [u 1 , u 2 ]T ∈ C2 , z j = z Rj + i z Ij and u j = u Rj + iu Ij ,
j = 1, 2. Let s = [z 1R , z 2R , z 1I , z 2I ]T , and q = [u 1R , u 2R , u 1I , u 2I ]T . According to the pro-
posed method, design the critic and action networks to approximate the performance
index function and control. The initial weights are selected in [0.5, 0.5]. The acti-
vation functions of critic and the action networks are sigmoid functions. After 100
time steps, the critic network weight ŴΩ and the action network weight Ŵq are con-
vergent, and the optimal control is achieved. The state and control trajectories are
obtained and demonstrated in Figs. 7.3 and 7.4. They are all convergent.
124 7 Off-Policy Neuro-Optimal Control …

150

uR
1
100
uR
2
uI1

50 uI2
control

-50

-100

-150
0 5 10 15 20 25 30 35 40 45 50
time steps

Fig. 7.2 Control trajectories

zR
1

0.5 zR
2
zI1
zI2
0

-0.5
z

-1

-1.5

-2
0 10 20 30 40 50 60 70 80 90 100
time steps

Fig. 7.3 Control trajectories


7.5 Conclusion 125

4
uR
1
3 uR
2
uI1
2
uI2

1
control

-1

-2

-3

-4
0 10 20 30 40 50 60 70 80 90 100
time steps

Fig. 7.4 Control trajectories

7.5 Conclusion

An optimal control method of unknown complex-valued system is proposed based


on PI algorithm. Off-policy learning is used to obtain solve HJB equation with com-
pletely unknown dynamics. Critic and action networks are used to get the iterative
performance index and iterative control. It is proven that the asymptotic stability of
the closed-loop system, the convergence of the iterative performance index function
and the UUB of the weight error. Simulation study demonstrates the effectiveness of
the proposed optimal control method.

References

1. Sutton, R., Barto, A.: Reinforcement Learning: An Introduction, A Bradford Book. The MIT
Press, Cambridge (2005)
2. Lewis, F., Vrabie, D.: Reinforcement learning and adaptive dynamic programming for feedback
control. IEEE Circuits Syst. Mag. 9(3), 32–50 (2009)
3. Al-Tamimi, A., Lewis, F., Abu-Khalaf, M.: Discrete-time nonlinear HJB solution using approx-
imate dynamic programming: convergence proof. IEEE Trans. Syst. Man Cybern. B Cybern.
38(4), 943–949 (2008)
4. Murray, J., Cox, C., Lendaris, G., Saeks, R.: Adaptive dynamic programming. IEEE Trans.
Syst. Man Cybern. Syst. 32(2), 140–153 (2002)
5. Modares, H., Lewis, F., Jiang, Z.: H∞ tracking control of completely unknown continuous-time
systems via off-policy reinforcement learning. IEEE Trans. Neural Netw. Learn. Syst. 26(10),
2550–2562 (2015)
126 7 Off-Policy Neuro-Optimal Control …

6. Song, R., Xiao, W., Zhang, H., Sun, C.: Adaptive dynamic programming for a class of complex-
valued nonlinear systems. IEEE Trans. Neural Netw. Learn. Syst. 25(9), 1733–1739 (2014)
7. Wang, J., Xu, X., Liu, D., Sun, Z., Chen, Q.: Self-learning cruise control using kernel-based
least squares policy iteration. IEEE Trans. Control Syst. Technol. 22(3), 1078–1087 (2014)
8. Luo, B., Wu, H., Huang, T., Liu, D.: Data-based approximate policy iteration for affine nonlinear
continuous-time optimal control design. Automatica 50(12), 3281–3290 (2014)
9. Modares, H., Lewis, F.: Linear quadratic tracking control of partially-unknown continuous-time
systems using reinforcement learning. IEEE Trans. Autom. Control 59, 3051–3056 (2014)
10. Kiumarsi, B., Lewis, F., Modares, H., Karimpur, A., Naghibi-Sistani, M.: Reinforcement Q-
learning for optimal tracking control of linear discrete-time systems with unknown dynamics.
Automatica 50(4), 1167–1175 (2014)
11. Abu-Khalaf, M., Lewis, F.: Nearly optimal control laws for nonlinear systems withsaturating
actuators using a neural network HJB approach. Automatica 41, 779–791 (2005)
Chapter 8
Approximation-Error-ADP-Based
Optimal Tracking Control for Chaotic
Systems

In this chapter, an optimal tracking contrl scheme is proposed for a class of discrete-
time chaotic systems using the approximation-error-based ADP algorithm. Via the
system transformation, the optimal tracking problem is transformed into an optimal
regulation problem, and then the novel optimal tracking control method is proposed.
It is shown that for the iterative ADP algorithm with finite approximation error, the
iterative performance index functions can converge to a finite neighborhood of the
greatest lower bound of all performance index functions under some convergence
conditions. Two examples are given to demonstrate the validity of the proposed
optimal tracking control scheme for chaotic systems.

8.1 Introduction

During recent years, the control problem of chaotic systems has received considerable
attentions [1, 2]. Many different methods are applied theoretically and experimentally
to control the chaotic systems, such as adaptive synchronization control method [3]
and impulsive control method [4]. However, the methods mentioned above are just
focus on designing the controller for chaotic systems. Few of them considered the
optimal tracking control problem, which is an important index of chaotic systems.
Although the iterative ADP algorithm has improved greatly in control field, it is
still an open problem about how to solve the optimal tracking control problem for
chaotic systems based on ADP algorithm. The reason is that in the most implementa-
tion methods of ADP algorithms, the accurate optimal control and performance index
function are not obtained, because of the existence of approximate error between the
approximate function and the expected one. While the approximate error will influ-
ence the control performance of the chaotic systems. So the approximate error need to
be considered in the ADP algorithm. This motivates our research. In this chapter, we
proposed an approximation-error ADP algorithm to deal with the optimal tracking
control problem for chaotic systems. First, the optimal tracking problem of chaotic
© Science Press, Beijing and Springer Nature Singapore Pte Ltd. 2019 127
R. Song et al., Adaptive Dynamic Programming: Single and Multiple
Controllers, Studies in Systems, Decision and Control 166,
https://doi.org/10.1007/978-981-13-1712-5_8
128 8 Approximation-Error-ADP-Based Optimal Tracking Control for Chaotic Systems

system is transformed into the optimal regulation problem, and the corresponding
performance index function is defined. Then, the approximation-error ADP algo-
rithm is established. It is proved that the ADP algorithm with approximation error
makes the iterative performance index functions converge to a finite neighborhood
of the optimal performance index function. Finally, two simulation examples are
given to show the effectiveness of the proposed optimal tracking control algorithm
for chaotic systems.
This chapter is organized as follows. In Sect. 8.2, we present the problem for-
mulation. In Sect. 8.3, the optimal tracking control scheme is developed and the
convergence proof is given. In Sect. 8.4, two examples are given to demonstrate the
effectiveness of the proposed tracking control scheme. In Sect. 8.5, the conclusion is
given.

8.2 Problem Formulation and Preliminaries

Consider the MIMO chaotic dynamic system which can be represented by the fol-
lowing form


m
x1 (k + 1) = f 1 (x(k)) + g1 j u j (k),
j=1
..
.

m
xn (k + 1) = f n (x(k)) + gn j u j (k), (8.1)
j=1

where x(k) = [x1 (k), x2 (k), . . . , xn (k)]T is the system state vector which is assumed
to be available for measurement. u(k) = [u 1 (k), u 2 (k), . . . , u m (k)]T is the control
input. gi j , i = 1, 2, . . . , n, j = 1, 2, . . . , m is the constant control gain. If we denote
f (x(k)) = [ f 1 (x(k)), f 2 (x(k), . . . , f n (x(k))]T and
⎡ ⎤
g11 g12 · · · g1m
⎢ .. .. .. ⎥ .
g = ⎣. . . ⎦ (8.2)
gn1 gn2 · · · gnm

Then, the chaotic system (8.1) can be rewritten as

x(k + 1) = F(x(k), u(k)) = f (x(k)) + gu(k). (8.3)

In fact, system (8.3) has a large family, and lots of chaotic systems can be described
as in (8.3), such as Hénon mapping [5], the new discrete chaotic system proposed in
[6] and many others.
8.2 Problem Formulation and Preliminaries 129

The objective of this chapter is to construct an optimal tracking controller, such


that the system state x tracks the reference signal xd , and all the signals in the
closed-loop system remain bounded. To meet the objective, we make the following
assumption.

Assumption 8.1 The desired trajectory signal xd is continuous, bounded, and


available for measurement. Furthermore, the desired trajectory xd has the follow-
ing form [7]:

xd (k + 1) = f (xd (k)) + gu d (k). (8.4)

Thus we define the tracking error as follows:

e(k + 1) = x(k + 1) − xd (k + 1). (8.5)

According to Ref. [8, 9], we define the control error

w(k) = u(k) − u d (k), (8.6)

and w(k) = 0, for k < 0, where u d (k) denotes the expected control, which can be
given as

u d (k) = g T (gg T + ε0 )−1 (xd (k + 1) − f (xd (k))), (8.7)

where ε0 is a constant matrix, which is used to guarantee the existence of the matrix
inverse.
Then the tracking error is obtained as follows:

e(k + 1) = x(k + 1) − xd (k + 1)
= f (x(k)) − f (xd (k)) + gw(k)
= f (e(k) + xd (k)) − f (xd (k)) + gw(k). (8.8)

Remark 8.1 Actually, the term g T (gg T + ε0 )−1 in Eq. (8.7), is the generalized
inverse of g. As g ∈ Rn×m , the generalized inverse technique is used to obtain the
inverse of g.

To solve the optimal tracking control problem, we present the following perfor-
mance index function


J (e(k), w(k)) = U (e(l), w(l)), (8.9)
l=k

where U (e(k), w(k)) > 0, ∀e(k), w(k), is the utility function. In this chapter, we
define
130 8 Approximation-Error-ADP-Based Optimal Tracking Control for Chaotic Systems

U (e(k), w(k)) = eT (k)Qe(k) + (w(k) − w(k − 1))T R(w(k) − w(k − 1)), (8.10)

where Q and R are both diagonal positive definite matrices. In (8.10), the first term
means the tracking error, and the second term means the difference of the control
error.
According to Bellman’s principle of optimality, the optimal performance index
function satisfies the HJB equation as follows:

J ∗ (e(k)) = inf {U (e(k), w(k)) + J ∗ (e(k + 1))}, (8.11)


w(k)

and the optimal control error w∗ (k) is

w∗ (k) = arg inf J (e(k), w(k)). (8.12)


w(k)

Thus, the HJB equation (8.11) can be written as

J ∗ (e(k)) = U (e(k), w∗ (k)) + J ∗ (e(k + 1)), (8.13)

and the optimal tracking control is

u ∗ (k) = w∗ (k) + u d (k). (8.14)

We can see that if we want to obtain the optimal tracking control u ∗ (k), we
must obtain w∗ (k) and the optimal performance index function J ∗ (e(k)). Generally,
J ∗ (e(k)) is unknown before all the control error w∗ (k) is considered. If we adopt the
traditional dynamic programming method to obtain the optimal performance index
function, then we have to face the “the curse of dimensionality”. This makes the
optimal control nearly impossible to be obtained by the HJB equation. So, in the
next part, we present an iterative ADP algorithm, based on Bellman’s principle of
optimality.

8.3 Optimal Tracking Control Scheme Based on


Approximation-Error ADP Algorithm

8.3.1 Description of Approximation-Error ADP Algorithm

First, for ∀e(k), let the initial function Υ (e(k)) be an arbitrary function that satisfies
Υ (e(k)) ∈ Ῡe(k) , where Ῡe(k) is defined as follows.

Definition 8.1 Let

Ῡe(k) = {Υ (e(k)) : Υ (e(k)) > 0, and Υ (e(k + 1)) < Υ (e(k))} (8.15)
8.3 Optimal Tracking Control Scheme Based … 131

be the initial positive definition function set, where e(k + 1) = F(x(k), u(k)) −
xd (k + 1).
For ∀e(k), let the initial performance index function J [0] (e(k)) = θ̂Υ (e(k)), where
θ̂ > 0 is a large enough finite positive constant. The initial iterative control policy
w[0] (k) can be computed as follows:

w[0] (k) = arg inf U (e(k), w(k)) + J [0] (e(k + 1)) , (8.16)
w(k)

where J [0] (e(k + 1)) = θ̂Υ (e(k + 1)). The performance index function can be
updated as

J [1] (e(k)) = U (e(k), w[0] (k)) + J [0] (e(k + 1)). (8.17)

For i = 1, 2, . . ., ∀ e(k), the iterative ADP algorithm will iterate between


w[i] (e(k)) = arg inf U (e(k), w(k)) + J [i] (e(k + 1)) , (8.18)
w(k)

and the iterative performance index functions


J [i+1] (e(k)) = inf U (e(k), w(k)) + J [i] (e(k + 1)) . (8.19)


w(k)

In fact, the accurate iterative control policy w[i] (k) and the iterative performance
index function J [i] (e(k)) are generally impossible to be obtained. For example, if
neural networks are used to implement the iterative ADP algorithm, no matter what
kind of neural networks we choose, the approximate error between the output of the
neural networks and the expect output must exist. In fact, as the existence of the
approximation error, the accurate iterative control policy of the chaotic systems can
not generally be obtained. So the iterative ADP algorithm with finite approximation
error is expressed as follows:

ŵ[0] (e(k)) = arg inf U (e(k), w(k)) + Jˆ[0] (e(k + 1)) + α[0] (e(k)), (8.20)
w(k)

where Jˆ[0] (e(k + 1)) = θ̂Υ (e(k + 1)). The performance index function can be
updated as

Jˆ[1] (e(k)) = U (e(k), ŵ[0] (k)) + Jˆ[0] (e(k + 1))) + β [0] (e(k)), (8.21)

where α[0] (e(k)) and β [0] (e(k)) are the approximation error functions of the iterative
control and iterative performance index function, respectively.
For i = 1, 2, . . ., the iterative ADP algorithm will iterate between

ŵ[i] (e(k)) = arg inf U (e(k), w(k)) + Jˆ[i] (e(k + 1)) + α[i] (e(k)), (8.22)
w(k)
132 8 Approximation-Error-ADP-Based Optimal Tracking Control for Chaotic Systems

and the performance index function



Jˆ[i+1] (e(k)) = inf U (e(k), w(k)) + Jˆ[i] (e(k + 1)) + β [i] (e(k)), (8.23)
w(k)

where α[i] (e(k)) and β [i] (e(k)) are the approximation error functions of the iterative
control and iterative performance index function, respectively.

Remark 8.2 The proposed approximation-error-ADP-based method for chaotic sys-


tems is the development of the traditional ADP method. In the traditional ADP
method, the approximate error is existent in the process of implementation. But the
analyses of the traditional ADP method have no regard for the approximate error. So,
in this chapter, the approximate error is considered in the algorithm analyses, and the
corresponding convergence proof is given. In the following theorems, we will prove
that the iterative performance index function with approximate error is convergent if
some conditions are satisfied.

8.3.2 Convergence Analysis of The Iterative ADP Algorithm

In fact, the iterative index i → ∞, the boundary of iterative approximation error


will also increase to infinity, although the approximation error is finite in the single
iteration [10]. The following theorem will show this property.

Theorem 8.1 Let e(k) be an arbitrary controllable state. For i = 1, 2, . . ., define a


new iterative performance index function as

Δ[i] (e(k)) = inf {U (e(k), w(k)) + Jˆ[i−1] (e(k + 1))}, (8.24)


w(k)

where Jˆ[i−1] (e(k + 1)) is defined in (8.23), and w(k) can accurately be obtained. If
the initial iterative performance index function Jˆ[0] (e(k)) = Δ[0] (e(k)), and there
exists a finite constant 1 that makes

| Jˆ[i] (e(k)) − Δ[i] (e(k))| ≤ 1 (8.25)

hold uniformly, then we have

| Jˆ[i] (e(k)) − J [i] (e(k))| ≤ i1 , (8.26)

where 1 is called uniform finite approximation error.

Proof The theorem can be proved by mathematical induction. First, let i = 1, we


have
8.3 Optimal Tracking Control Scheme Based … 133

Δ[1] (e(k)) = inf {U (e(k), w(k)) + Jˆ[0] (e(k + 1))}


w(k)

= J [1] (e(k)). (8.27)

Then, according to (8.25), we can get

| Jˆ[1] (e(k)) − J [1] (e(k))| ≤ 1 . (8.28)

Assume that (8.26) holds for i − 1. Then, for i, we have

Δ[i] (e(k)) = inf {U (e(k), w(k)) + Jˆ[i−1] (e(k + 1))}


w(k)

≤ inf {U (e(k), w(k)) + J [i−1] (e(k + 1)) + (i − 1)1 }


w(k)

= J [i] (e(k)) + (i − 1)1 . (8.29)

On the other hand, we can have

Δ[i] (e(k)) = inf {U (e(k), w(k)) + Jˆ[i−1] (e(k + 1))}


w(k)

≥ inf {U (e(k), w(k)) + J [i−1] (e(k + 1)) − (i − 1)1 }


w(k)

= J [i] (e(k)) − (i − 1)1 . (8.30)

So, we get

−(i − 1)1 ≤ Δ[i] (e(k)) − J [i] (e(k)) ≤ (i − 1)1 . (8.31)

Then, according to (8.25), we can get

| Jˆ[i] (e(k)) − J [i] (e(k))| ≤ i1 . (8.32)

It is obvious that the property analysis of the iterative performance index function
Jˆ[i] (e(k)) and iterative control policy ŵ[i] (k) are very difficult. In next part, the novel
convergence analysis is built.

Theorem 8.2 Let e(k) be an arbitrary controllable state. For ∀ i = 0, 1, . . ., let


Δ[i] (e(k)) be expressed as (8.24) and Jˆ[i] (e(k)) be expressed as (8.23). Let ζ < ∞
and 1 ≤ η < ∞ are both constants that make

J ∗ (e(k + 1)) ≤ ζU (e(k), w(k)), (8.33)

and

J [0] (e(k)) ≤ η J ∗ (e(k)) (8.34)


134 8 Approximation-Error-ADP-Based Optimal Tracking Control for Chaotic Systems

hold uniformly. If there exists 1 ≤ ι < ∞ that makes

Jˆ[i] (e(k)) ≤ ιΔ[i] (e(k)) (8.35)

hold uniformly, then we have


⎛ ⎞

i
ζ j ι j−1 (ι − 1) ζ i ιi (η − 1) ⎠
Jˆ[i] (e(k)) ≤ι⎝1 + + J ∗ (e(k)), (8.36)
j=1
(ζ + 1) j (ζ + 1)i


i
where we define (·) = 0, for ∀ j > i and i, j = 0, 1, . . ..
j

Proof The theorem can be proved by mathematical induction. First, let i = 0. Then,
(8.36) becomes

Jˆ[0] (e(k)) ≤ ιη J ∗ (e(k)). (8.37)

As Jˆ[0] (e(k)) ≤ η J ∗ (e(k)), then we can obtain Jˆ[0] (e(k)) ≤ η J ∗ (e(k))


≤ ιη J ∗ (e(k)), which obtains (8.37). So, the conclusion holds for i = 0.
Next, let i = 1. According to (8.27), we have

Δ[1] (e(k)) = inf U (e(k), w(k)) + Jˆ[0] (e(k + 1))
w(k)

≤ inf U (e(k), w(k)) + ιη J ∗ (e(k + 1)) . (8.38)


w(k)

As 1 ≤ ι < ∞ and 1 ≤ η < ∞, then, ιη − 1 ≥ 0, then according to (8.33), we have


    
ιη − 1 ιη − 1
Δ[1] (e(k)) ≤ inf
1+ζ U (e(k), w(k)) + ιη − ∗
J (e(k + 1))
w(k) ζ +1 ζ +1
 
ιη − 1

= 1+ζ inf U (e(k), w(k)) + J ∗ (e(k + 1))


ζ + 1 w(k)
 
ζ(ι − 1) ζι(η − 1)
= 1+ + J ∗ (e(k)). (8.39)
ζ +1 ζ +1

According to (8.35), we can obtain


 
ζ(ι − 1) ζι(η − 1)
Jˆ[1] (e(k)) ≤ ι 1 + + J ∗ (e(k)), (8.40)
ζ +1 ζ +1

which shows that (8.36) holds for i = 1.


8.3 Optimal Tracking Control Scheme Based … 135

Assume that (8.36) holds for i − 1. Then, for i, we have

Δ[i] (e(k))

= inf U (e(k), w(k)) + Jˆ[i−1] (e(k + 1))
w(k)
⎧ ⎛  
⎨ 
i−1 j j−1
ζ ι (ι − 1) ζ i−1 ιi−1 (η − 1)
≤ inf U (e(k), w(k)) + ι ⎝1 + + ∗
J (e(k + 1)) .
w(k) ⎩ (ζ + 1) j (ζ + 1)i−1
j=1
(8.41)

So, (8.41) can be written as


⎧⎛
⎨ i−1 j−1 j−1 
ζ ι (ι − 1) ζ i−1 ιi−1 (ιη − 1)
Δ[i] (e(k)) ≤ inf ⎝1 + ζ + U (e(k), w(k))
w(k) ⎩ (ζ + 1) j (ζ + 1)i−1
j=1
⎛ ⎛

i−1 j j−1 
ζ ι (ι − 1) ζ i−1 ιi−1 (η − 1)
+ ⎝ι ⎝1 + +
j=1
(ζ + 1) j (ζ + 1)i−1
⎛ ⎞⎞

i−1 j−1 j−1 
⎝ ζ ι (ι − 1) ζ i−1 ιi−1 (ιη − 1) ⎠⎠ ∗
− + J (e(k + 1))
j=1
(ζ + 1) j (ζ + 1)i−1
⎛ ⎞
i
ζ j ι j−1 (ι − 1) ζ i ιi (η − 1)

= ⎝1 + + ⎠ inf U (e(k), w(k)) + J ∗ (e(k + 1))


j=1
(ζ + 1) j
(ζ + 1) i w(k)
⎛ ⎞
i
ζ j ι j−1 (ι − 1) ζ i ιi (η − 1)
= ⎝1 + + ⎠ J ∗ (e(k)). (8.42)
j=1
(ζ + 1) j
(ζ + 1) i

Then, according to (8.35), we can obtain (8.36) which proves the conclusion for
∀ i = 0, 1, . . ..

Theorem 8.3 Let e(k) be an arbitrary controllable state. Suppose Theorem 8.2 holds
for ∀e(k). If for ζ < ∞ and ι ≥ 1, the inequality

ζ +1
ι< (8.43)
ζ

holds, then as i → ∞, the iterative performance index function Jˆ[i] (e(k)) in the iter-
ative ADP algorithm (8.20)–(8.23) is uniformly convergent into a bounded neigh-
borhood of the optimal performance index function J ∗ (e(k)), i.e.,
 
ζ(ι − 1)
lim Jˆ[i] (e(k)) = Jˆ∞ (e(k)) ≤ ι 1 + J ∗ (e(k)). (8.44)
i→∞ 1 − ζ(ι − 1)
136 8 Approximation-Error-ADP-Based Optimal Tracking Control for Chaotic Systems

Proof According  in Theorem 8.2, we can see that for j = 1, 2, . . ., the


 j j−1 to (8.42)
ζ ι (ι − 1)
sequence is a geometrical series. Then, (8.42) can be written as
(ζ + 1) j
⎛    ⎞
ζ(ι − 1) ζι i
⎜ 1− ⎟
⎜ ζ +1 ζ +1 ζ i ιi (η − 1) ⎟
Δ (e(k)) ≤ ⎜
[i]
⎜1+ + ⎟ J ∗ (e(k)). (8.45)
⎝ 1−
ζι (ζ + 1)i ⎟ ⎠
ζ +1

ζ +1
As i → ∞, if 1 ≤ ι < , then (8.45) becomes
ζ
 
[i] ∞ ζ(ι − 1)
lim Δ (e(k)) = Δ (e(k)) ≤ 1 + J ∗ (e(k)). (8.46)
i→∞ 1 − ζ(ι − 1)

According to (8.35), let i → ∞, we have

Jˆ∞ (e(k)) ≤ ιΔ∞ (e(k)). (8.47)

Considering (8.47) and (8.46), we can obtain


 
ζ(ι − 1)
Jˆ∞ (e(k)) ≤ ι 1 + J ∗ (e(k)). (8.48)
1 − ζ(ι − 1)

Furthermore, we can obtain the following corollary.

Corollary 8.1 Let e(k) be an arbitrary controllable state. Suppose Theorem 8.2
holds for ∀e(k). If for ζ < ∞ and ι ≥ 1, the inequality (8.43) holds, then the iterative
control policy ŵ[i] (k) of the iterative ADP algorithm (8.20)–(8.23) is convergent, i.e.,

ŵ∞ (k) = lim ŵ[i] (k) = arg inf U (e(k), w(k)) + Jˆ∞ (e(k + 1)) . (8.49)
i→∞ w(k)

Therefore, it is proved that the new iterative ADP algorithm with finite approximation
error is convergent. In the next section, the examples will be given to illustrate the
performance of the proposed method for chaotic systems.

8.4 Simulation Study

To evaluate the performance of our iterative ADP algorithm with finite approximation
error, we choose two examples with quadratic utility functions.
8.4 Simulation Study 137

0.5
x2

-0.5

-1

-1.3 -1 -0.5 0 0.5 1 1.3


x1

Fig. 8.1 The Hénon chaotic orbits

Example 8.1 Consider the Hénon mapping as follows:


      
x1 (k + 1) 1 + bx2 (k) − ax12 (k) 0.2 0 u 1 (k)
= + (8.50)
x2 (k + 1) x1 (k) 0 0.5 u 2 (k)

  
x1 (0) 0.1
where a = 1.4, b = 0.3, = . The chaotic signal orbits are given as
x2 (0) −0.1
Fig. 8.1.

The desired orbit xd (k) is generated by the following expression

xd (k + 1) = Axd (k), (8.51)

with
 
cos wT sin wT
A= ,
− sin wT cos wT

where T = 0.1s, w = 0.8 π.


The initial state is selected as [1, 0.5]T , and the utility function is defined as
U (e(k), w(k)) = eT (k)Qe(k) + wT (k)Rw(k), where Q = R = I2 . Let θ̂ = 6 and
Υ (e(k)) = eT (k)Qe(k) to initialize the algorithm. For illustrating the performance
of the proposed optimal tracking control scheme, two different approximate preci-
sions, i.e., 1 = 10−2 and 1 = 10−8 are selected. Then for 1 = 10−2 , we got the
performance index function trajectory as in Fig. 8.2. In Fig. 8.2, the iterative per-
138 8 Approximation-Error-ADP-Based Optimal Tracking Control for Chaotic Systems

12

10
performance index function

0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
iteration steps

Fig. 8.2 The performance index function trajectory for 1 = 10−2

formance index function is not convergent. The state tracking trajectories are as in
Figs. 8.3 and 8.4, and the tracking error trajectories are as in Fig. 8.5. We can see that
the iterative control can not make the system track the desired orbit satisfactorily.
For 1 = 10−8 , the convergence conditions are satisfied, and the performance index
function is shown in Fig. 8.6. The states trajectories are shown in Figs. 8.7 and 8.8,
and the tracking error trajectories are as in Fig. 8.9. From the results of the simulation,
we can see that the optimal tracking control of the chaotic systems can be obtained
if the convergence conditions are satisfied.

Example 8.2 Consider the following continuous time chaotic system [11]

⎨ ẋ1 = θ1 (x2 − h(x1 )) + u 1 ,
ẋ2 = x1 − x2 + x3 + u 2 , (8.52)

ẋ3 = −θ2 x2 + u 3 ,

1
where h(x1 ) = m 1 x1 + (m 0 − m 1 )(|x1 + θ3 | − |x1 − θ3 |), θ1 = 9, θ2 = 14.28,
2
1 2
θ3 = 1, m 0 = − and m 1 = .
7 7
According to Euler’s discretization method, the continuous time chaotic system
can be represented as follows:

⎨ x1 (k + 1) = x1 (k) + T̂ (θ1 (x2 (k) − h(x1 (k))) + u 1 (k)),
x (k + 1) = x2 (k) + T̂ (x1 (k) − x2 (k) + x3 (k) + u 2 (k)), (8.53)
⎩ 2
x3 (k + 1) = x3 (k) + T̂ (−θ2 x2 (k) + u 3 (k)),
8.4 Simulation Study 139

1.5

xd1
1 x1

0.5

0
state

-0.5

-1

-1.5

-2
0 50 100 150 200 250
time steps

Fig. 8.3 The trajectories of x1 and xd1 for 1 = 10−2

0.8
xd2
0.6
x2

0.4

0.2

0
state

-0.2

-0.4

-0.6

-0.8

-1

-1.2
0 50 100 150 200 250
time steps

Fig. 8.4 The trajectories of x2 and xd2 for 1 = 10−2


140 8 Approximation-Error-ADP-Based Optimal Tracking Control for Chaotic Systems

1.5
e1
e2
1

0.5
trajectory error

-0.5

-1

-1.5
0 50 100 150 200 250
time steps

Fig. 8.5 The trajectories of e1 and e2 for 1 = 10−2

5.5
performance index function

4.5

3.5

2.5
0 2 4 6 8 10 12 14 16
iteration steps

Fig. 8.6 The performance index function trajectory for 1 = 10−8


8.4 Simulation Study 141

2
xd1
x1
1.5

0.5
state

-0.5

-1

-1.5
0 50 100 150 200 250
time steps

Fig. 8.7 The trajectories of x1 and xd1 for 1 = 10−8

0.8
xd2
x2
0.6

0.4

0.2
state

-0.2

-0.4

-0.6
0 50 100 150 200 250
time steps

Fig. 8.8 The trajectories of x2 and xd2 for 1 = 10−8


142 8 Approximation-Error-ADP-Based Optimal Tracking Control for Chaotic Systems

0.6
e1
0.5 e2

0.4

0.3
state error

0.2

0.1

-0.1

-0.2
0 50 100 150 200 250
time steps

Fig. 8.9 The trajectories of e1 and e2 for 1 = 10−8

Fig. 8.10 The chaotic attractor of system (8.53)

where T̂ = 0.001. The discrete chaotic system is given as Fig. 8.10.

The initial state is selected as [0.2, 0.13, 0.17]T , and the reference trajectory is
[0.5, −0.5, 0.4]T . The utility function is defined as U (e(k), w(k)) = eT (k)Qe(k) +
wT (k)Rw(k), where Q = R = I3 . Let θ̂ = 6 and Υ (e(k)) = eT (k)Qe(k) to initialize
the algorithm. The performance index function is shown as in Fig. 8.11. The system
states are shown as in Fig. 8.12. The state errors are given in Fig. 8.13. It can be seen
that the proposed algorithm is effective and the simulation results are satisfactory.
8.4 Simulation Study 143

12

11
performance index function

10

5
0 6 8 10 12 14 16 18 20
iteration steps

Fig. 8.11 The performance index function trajectory for 1 = 10−8

0.6
x1
x2
0.4 x3

0.2
state

-0.2

-0.4

-0.6
0 100 200 300 400 500 600 700 800 900 1000
time steps

Fig. 8.12 The trajectories of x1 , x2 and x3 for 1 = 10−8


144 8 Approximation-Error-ADP-Based Optimal Tracking Control for Chaotic Systems

0.8

e1
0.6 e2
e3

0.4
state error

0.2

-0.2

-0.4
0 100 200 300 400 500 600 700 800 900 1000
time steps

Fig. 8.13 The trajectories of e1 , e2 and e3 for 1 = 10−8

8.5 Conclusion

This chapter proposed an optimal tracking control method for chaotic systems. Via
system transformation, the optimal tracking problem was transformed into an optimal
regulation problem, and then the approximation-error ADP algorithm was introduced
to deal with the optimal regulation problem with rigorous convergence analysis.
Finally, two examples have been given to demonstrate the validity of the proposed
optimal tracking control scheme for chaotic systems.

References

1. Zhang, H., Huang, W., Wang, Z., Chai, T.: Adaptive synchronization between two different
chaotic systems with unknown parameters. Phys. Lett. A 350(5–6), 363–366 (2006)
2. Chen, S., Lu, J.: Synchronization of an uncertain unified chaotic system via adaptive control.
Chaos, Solitons Fractals 14(4), 643–647 (2002)
3. Zhang, H., Wang, Z., Liu, D.: chaotifying fuzzy hyperbolic model using adaptive inverse
optimal control approach. Int. J. Bifurc. Chaos 14(10), 3505–3517 (2004)
4. Zhang, H., Ma, T., Fu, J., Tong, S.: Robust lag synchronization of two different chaotic systems
via dual-stage impulsive control. Chin. Phys. B 18(9), 3751–3757 (2009)
5. Héno, M.: A two-dimensional mapping with a strange attractor 50(1), 69–77 (1976)
6. Lu, J., Wu, X., Lü, J., Kang, L.: A new discrete chaotic system with rational fraction and its
dynamical behaviors. Chaos, Solitons Fractals 22(2), 311–319 (2004)
7. Zhang, H., Cui, L., Zhang, X., Luo, Y.: Data-driven robust approximate optimal tracking control
for unknown general nonlinear systems using adaptive dynamic programming method. IEEE
Trans. Neural Netw. 22(12), 2226–2236 (2011)
References 145

8. Zhang, H., Song, R., Wei, Q., Zhang, T.: Optimal tracking control for a class of nonlinear
discrete-time systems with time delays based on heuristic dynamic programming. IEEE Trans.
Neural Netw. 22(12), 1851–1862 (2011)
9. Zhang, H., Wei, Q., Luo, Y.: A novel infinite-time optimal tracking control scheme for a class of
discrete-time nonlinear systems via the greedy HDP iteration algorithm 38(4), 937–942 (2008)
10. Liu, D., Wei, Q.: Finite-approximation-error-based optimal control approach for discrete-time
nonlinear systems. IEEE Trans. Syst. Man Cybern. B Cybern. 43(2), 779–789 (2013)
11. Ma, T., Zhang, H., Fu, J.: Exponential synchronization of stochastic impulsive perturbed chaotic
Lur’e systems with time-varying delay and parametric uncertainty. Chin. Phys. B 17(12), 4407–
4417 (2008)
Chapter 9
Off-Policy Actor-Critic Structure
for Optimal Control of Unknown
Systems with Disturbances

An optimal control method is developed for unknown continuous time systems with
unknown disturbances in this chapter. The integral reinforcement learning (IRL) algo-
rithm is presented to obtain the iterative control. Off-policy learning is used to allow
the dynamics to be completely unknown. Neural networks (NN) are used to con-
struct critic and action networks. It is shown that if there are unknown disturbances,
off-policy IRL may not converge or may be biased. For reducing the influence of
unknown disturbances, a disturbances compensation controller is added. It is proven
that the weight errors are uniformly ultimately bounded (UUB) based on Lyapunov
techniques. Convergence of the Hamiltonian function is also proven. The simulation
study demonstrates the effectiveness of the proposed optimal control method for
unknown systems with disturbances.

9.1 Introduction

Adaptive control is a body of online design techniques that use measured data along
system trajectories to learn unknown system dynamics, and compensate for distur-
bances and modeling errors to provide guaranteed performance [1–3]. Reinforce-
ment learning (RL), which improves the action in a given uncertain environment,
can obtain the optimal adaptive control in an uncertain noisy environment with less
computational burden [4–6]. Integral RL (IRL) is based on the integral temporal dif-
ference and uses RL ideas to find the value of the parameters of the infinite horizon
performance index function [7], which has been one of the focus area in the theory
of optimal control [8, 9].
IRL is conceptually based on the policy iteration (PI) technique [10, 11]. PI algo-
rithm [12, 13] is an iterative approach to solving the HJB equation by constructing a
sequence of stabilizing control policies that converge to the optimal control solution.
IRL allows the development of a Bellman equation that does not contain the system
dynamics. In [14], an online algorithm that uses integral reinforcement knowledge

© Science Press, Beijing and Springer Nature Singapore Pte Ltd. 2019 147
R. Song et al., Adaptive Dynamic Programming: Single and Multiple
Controllers, Studies in Systems, Decision and Control 166,
https://doi.org/10.1007/978-981-13-1712-5_9
148 9 Off-Policy Actor-Critic Structure for Optimal Control …

for learning the continuous-time optimal control solution of nonlinear systems with
infinite horizon costs and partial knowledge of the system dynamics was introduced.
It is worth noting that most of the IRL algorithms are on-policy, i.e., the performance
index function is evaluated by using system data generated with policies being eval-
uated. It means on-policy learning methods use the “inaccurate” data to learn the
performance index function, which will increase the accumulated error. On-policy
IRL algorithms generally require knowledge of the system input coupling function
g(x).
A recently developed approach for IRL is off-policy learning, which can learn
the solution of HJB equation from the system data generated by an arbitrary control
[15]. In [16], the off-policy IRL method was developed to solve the H-infinity control
problem of continuous-time systems with completely unknown dynamics. In [17], a
novel PI approach for finding online adaptive optimal controllers for continuous-time
linear systems with completely unknown system dynamics was presented. In Refs. [1,
18, 19], robust optimal control designs for uncertain nonlinear systems were studied
based on robust ADP, in which the unmolded dynamics was taken into accounted
and the interactions between the system model and the dynamic uncertainty was
studied using robust nonlinear control theory. However, the external disturbances
exist in many industrial systems, which depend on the time scale directly. If the
external disturbances are considered as an interior part of the systems, then the
original systems become time-varying ones. It is not the desire of the engineers.
Therefore, effective optimal control methods are required to develop to deal with the
systems with disturbances. This motivates our research interest.
In this chapter an off-policy IRL algorithm is estimated to obtain the optimal con-
trol of unknown systems with unknown disturbances. Because the unknown external
disturbances exist, off-policy IRL methods may be biased and fail to give optimal
controllers. A disturbances compensation redesigned off-policy IRL controller is
given here. It is proven that the weight errors are uniformly ultimately bounded
(UUB) based on Lyapunov techniques.

9.2 Problem Statement

Consider the following nonlinear system

ẋ = f (x) + g(x)u + d, (9.1)

where x ∈ Rm is the system state vector, u ∈ Rm is the system control input, d =


d(t) ∈ Rn is unknown disturbance. f (x) ∈ Rn is the unknown drift dynamics of the
system, f (0) = 0. g(x) ∈ Rn×m is the unknown input dynamics of the system. f
and g are locally Lipchitz functions. It is assumed that there exists a matrix g M with
the same dimension of g, s.t. gg TM is the symmetric positive semi-definite matrix,
||gg TM || > 1, and ||g|| ≤ Bg , where Bg > 0. Assume
9.2 Problem Statement 149

||d|| ≤ Bd , (9.2)

where Bd is a positive number.


For system (9.1) with d = 0, the performance index function is defined as
 ∞
J (x) = r (x(τ ), u(τ ))dτ, (9.3)
t

where r (x, u) = Q(x) + u T Ru, in which Q(x) > 0 and R is a symmetric positive
definite matrix. It is assumed that there exists q > 0 satisfying Q(x) ≥ q||x||2 .
To begin with the algorithms, let us introduce the concept of the admissible control
[18, 20].

Definition 9.1 For system (9.1) with d = 0, a control policy u(x) is defined as
admissible, if u(x) is continuous on a set Ω ∈ Rn , u(0) = 0, u(x) stabilizes the
system, and J (x) defined in (9.3) is finite for all x.

For arbitrary admissible control u and d = 0, the infinitesimal version of (9.3) is


the Bellman equation

0 = JxT ẋ + r, (9.4)

which means along the solution of (9.1), one has

JxT ẋ + Q(x) + u T Ru = JxT ( f + gu) + Q(x) + u T Ru = 0. (9.5)

Therefore, one can get

JxT ( f + gu) = −Q(x) − u T Ru. (9.6)

Define the Hamiltonian function as

H (x, u, Jx ) = JxT ẋ + r, (9.7)

and let J ∗ be the optimal performance index function, then


 
0 = min H (x, u, Jx∗ ) . (9.8)
u∈U

If the solution J ∗ exists and it is continuously differentiable, then the optimal control
can be expressed as
 
u ∗ = arg min H (x, u, Jx∗ ) . (9.9)
u∈U

In general, if d = 0, then the analytical solution of (9.8) can be approximated by


policy iteration (PI) technique, which is given as follows.
150 9 Off-Policy Actor-Critic Structure for Optimal Control …

On-Policy Integral Reinforcement Learning (IRL)


From (9.10) and (9.11), it is clear that the exact knowledge of system dynamics is
necessary for the iterations. However for many complex industrial control processes,
it is difficult to estimate or obtain the system models. Therefore, the on-policy inte-
gral reinforcement learning (IRL) algorithm with infinite horizon performance index
function is presented [11], which is used to learn the optimal control solution with-
out using the complete knowledge of system dynamics (9.1). The on-policy IRL
algorithm is given as follows.

Algorithm 2 PI
1: Let i = 0, select an admissible control u [0] .
2: Let the iteration index i ≥ 0, and solve J [i] form

0 = Jx[i]T ( f + gu [i] ) + Q(x) + u [i]T Ru [i] , (9.10)


where J [0] = 0.
3: Update the control u [i] by
1
u [i+1] = − R −1 g T Jx[i] . (9.11)
2

Let d = 0, for any time T > 0, and t > T , the performance index function (9.3)
can be written in IRL form as
 t
J (x(t − T )) = r (x(τ ), u(τ ))dτ + J (x(t)). (9.12)
t−T

t
Define r I = t−T r (x(τ ), u(τ ))dτ . then (9.12) can be expressed as

J (x(t − T )) = r I + J (x(t)). (9.13)

Let u [i] be obtained by (9.11). Then the original system (9.1) with d = 0 can be
rewritten as

ẋ = f + gu [i] . (9.14)

From (9.10), one has

Jx[i]T ( f + gu [i] ) = −Q(x) − u [i]T Ru [i] . (9.15)

Therefore, for i = 1, 2, one can obtain


9.2 Problem Statement 151
 t
J [i] (x(t)) − J [i] (x(t − T )) = Jx[i]T ẋdτ
t−T
 t  t
=− Q(x)dτ − u [i]T Ru [i] dτ. (9.16)
t−T t−T

This is an IRL Bellman equation that can be solved instead of (9.10) at each step
of the PI algorithm. This means that f (x) is not needed in this IRL PI algorithm.
Details are given in [7].

9.3 Off-Policy Actor-Critic Integral Reinforcement


Learning

Based on Sect. 9.2, we will analyze the case of d is not equal to zero. In this section,
on-policy IRL for nonzero disturbance is first presented, and then the off-policy IRL
is given.

9.3.1 On-Policy IRL for Nonzero Disturbance

If disturbance d is not equal to zero, the IRL PI algorithm just gives incorrect results.
Let u = u [i] , then the original system (9.1) can be rewritten as

ẋ = f + gu [i] + d. (9.17)

According to (9.15), one has the Bellman equation

J [i] (x(t)) − J [i] (x(t − T ))


 t
= Jx[i]T ẋdτ
t−T
 t  t  t
[i]T [i]
=− Q(x)dτ − u Ru dτ + (Jx[i]T d)dτ. (9.18)
t−T t−T t−T

Therefore, one can define

e1 = J [i] (x(t − T )) − J [i] (x(t))


 t  t  t
− Q(x)dτ − u [i]T Ru [i] dτ + (Jx[i]T d)dτ. (9.19)
t−T t−T t−T

This shows that equation error el is biased by a term depending on the unknown
disturbance d. Therefore:
152 9 Off-Policy Actor-Critic Structure for Optimal Control …

(1) The least squares method in [17] will not give the correct solution for J [i] .
(2) If the unknown disturbance d is not equal to zero, the iterative control u [i] cannot
guarantee the stability of the closed-loop system when dynamics uncertainty
occurs.
Therefore, the original method for d = 0 is not adapted for the nonlinear system
with unknown external disturbance. In the following subsection, an off-policy IRL
algorithm is used to decrease el in (9.19), and makes the closed-loop system with
external disturbance stable.

9.3.2 Off-Policy IRL for Nonzero Disturbance

Here we detail off-policy IRL for the case of nonzero disturbance. Off-policy IRL
allows learning with completely unknown dynamics. However, it is seen here that
there are unknown disturbance, off-policy IRL may not perform properly.
Let u [i] be obtained by (9.11), and then the original system (9.1) can be rewritten
as

ẋ = f + gu [i] + g(u − u [i] ) + d. (9.20)

According to (9.15), one has the off-policy Bellman equation


 t  t  t
J [i] (x(t)) − J [i] (x(t − T )) = − Q(x)dτ − u [i]T Ru [i] dτ + Jx[i]T v[i] dτ
t−T t−T t−T
 t  t
=− Q(x)dτ − u [i]T Ru [i] dτ
t−T t−T
 t  t
+ −2u [i+1]T R(u − u [i] )dτ + (Jx[i]T d)dτ,
t−T t−T
(9.21)

where v[i] = gu − gu [i] + d.


The off-policy Bellman equation is the main equation used in off-policy learning.
Algorithm 1 is simply implemented by merely iterating on (9.21), as detailed in the
next results.

Algorithm 3 PI
1: Let i = 0, select an admissible control u [0] .
2: Let the iteration index i ≥ 0, and solve J [i] and u [i] simultaneously from (9.21).

Lemma 9.1 In essence, off-policy Algorithm 3 is equivalent to PI Algorithm 2 and


converges to the optimal control solution.
9.3 Off-Policy Actor-Critic Integral Reinforcement Learning 153

Proof (1) From (9.10) in Algorithm 2, one has

Jx[i]T ( f + gu [i] ) = −Q(x) − u [i]T Ru [i] . (9.22)

Since
 t
J [i] (x(t)) − J [i] (x(t − T )) = Jx[i]T ẋdτ. (9.23)
t−T

Therefore, according to (9.20), one has


 t
[i] [i]
J (x(t)) − J (x(t − T )) = Jx[i]T ẋdτ
t−T
 t
= Jx[i]T ( f + gu [i] + g(u − u [i] ) + d)dτ. (9.24)
t−T

From (9.22), one can obtain

J [i] (x(t)) − J [i] (x(t − T ))


 t  t  t
[i]T [i]
=− Q(x)dτ − u Ru dτ + Jx[i]T v[i] dτ. (9.25)
t−T t−T t−T

Here, (9.25) is the off-policy Bellman equation, which is the main equation in
Algorithm 3. From (9.22)–(9.25), it can be seen that from Algorithm 2, we can derive
Algorithm 3.
In (9.20), if one lets d = 0 and u = u [i] , then (9.24) can be written as
 t
J [i] (x(t)) − J [i] (x(t − T )) = Jx[i]T ẋdτ
t−T
 t
= Jx[i]T ( f + gu [i] )dτ. (9.26)
t−T

Therefore, according to (9.4), one has


 t  t
Jx[i]T ( f + gu [i] )dτ = (−Q(x) − u [i]T Ru [i] )dτ, (9.27)
t−T t−T

which means

Jx[i]T ( f + gu [i] ) = −Q(x) − u [i]T Ru [i] . (9.28)

From (9.26)–(9.28), it can be seen that from Algorithm 3, we can derive Algorithm 2.
(2) In [21] and [22], it was shown that as the iteration goes on by Algorithm 2,
j [i] and u [i] converge to the optimal solution J ∗ and u ∗ , respectively. Therefore, one
can get the optimal solution J ∗ and u ∗ by Algorithm 3.
154 9 Off-Policy Actor-Critic Structure for Optimal Control …

In fact, the iteration (9.10) in Algorithm 2 needs the knowledge of system dynam-
ics. While in (9.21) of Algorithm 3, system dynamics are not expressed explicitly.
Therefore, Algorithms 2 and 3 aim at different situations, although the two algo-
rithms are same essentially. The online solution of (9.21) is detailed in Sect. 9.3.3.
According to (9.21), one can define

e2 = J [i] (x(t − T )) − J [i] (x(t))


 t  t  t
[i]T [i]
− Q(x)dτ − u Ru dτ + Jx[i]T v[i] dτ
t−T t−T t−T
= J [i] (x(t − T )) − J (x(t))
[i]
 t  t
− Q(x)dτ − u [i]T Ru [i] dτ
t−T t−T
 t  t
+ Jx[i]T (gu − gu [i] )dτ + Jx[i]T ddτ. (9.29)
t−T t−T

This equation was developed for the case d = 0 in [19]. Therefore, the equation
error e2 is biased by a term depending on unknown disturbance d. This may cause
nonconvergence or biased results.
It is noted that in (29), d is the unknown external disturbance and Jx[i] may be
nonanalytic. For solving Jx[i] and u [i] from (9.29), critic and action networks are
introduced to obtain Jx[i] and u [i] approximately as shown next.

9.3.3 NN Approximation for Actor-Critic Structure

Here we introduce neural network approximation structures for Jx[i] and u [i] . These are
termed respectively by the critic NN and the actor NN. For off-policy learning, these
two structures are updated simultaneously using the off-policy Bellman equation
(9.21), as shown here.
Let the ideal critic network expression be

J [i] (x) = Wc[i]T ϕc (x) + εc[i] (x), (9.30)

where Wc[i] ∈ Rn 1 ×1 is the ideal weight of critic network, ϕc ∈ Rn 1 ×1 is the active


function, and εc[i] is residual error. Then one has

Jx[i] = ∇ϕcT Wc[i] + ∇εc[i] . (9.31)

Let the estimation of Wc[i] be Ŵc[i] , and then the estimation of J [i] can be expressed
as

Jˆ[i] = Ŵc[i]T ϕc . (9.32)


9.3 Off-Policy Actor-Critic Integral Reinforcement Learning 155

Accordingly, one has

Jˆx[i] = ∇ϕcT Ŵc[i] . (9.33)

Let ϕc = ϕc (x(t − T )) = ϕc (x(t)) − ϕc (x(t − T )), then the first term of
(9.21) is expressed as

Jˆ[i] (x(t)) − Jˆ[i] (x(t − T )) = Ŵc[i]T ϕc


= (ϕcT ⊗ I )vec(Ŵc[i]T )
= (ϕcT ⊗ I )Ŵc[i] . (9.34)

The last term of (9.29) is


 
Jˆx[i]T d = Ŵc[i]T ∇ϕc d = (∇ϕc d)T ⊗ I Ŵc[i] . (9.35)

Let the ideal action network expression be

u [i] (x) = Wa[i]T ϕa (x) + εa[i] (x), (9.36)

where Wa[i] ∈ Rn 2 ×m is the ideal weight of critic network, ϕa ∈ Rn 2 ×1 is the active


function, and εa[i] is residual error. Let the estimation of Wa[i] be Ŵa[i] , and then the
estimation of û [i] can be expressed as

û [i] = Ŵa[i]T ϕa . (9.37)

Then one has

Jx[i]T g(u − û [i] ) = − 2û [i+1]T R(u − û [i] )


= − 2(Ŵa[i+1]T ϕa )T R(u − û [i] )
= − 2ϕaT Ŵa[i+1] R(u − û [i] )
 T

= − 2 ((u − û [i] ) R) ⊗ ϕaT vec(Ŵa[i+1] ). (9.38)

Therefore, one can define the residual error as


 t
e3 = Jˆ[i] (x(t − T )) − Jˆ[i] (x(t)) − Q(x)dτ
t−T
 t  t
− û [i]T R û [i] dτ + Jˆx[i]T g(u − û [i] )dτ
t−T t−T
 t
+ ( Jˆx[i]T d)dτ. (9.39)
t−T
156 9 Off-Policy Actor-Critic Structure for Optimal Control …

According to (9.34)–(9.38), (9.39) is written as


 t  t
e3 = − (ϕcT ⊗ I )Ŵc[i] − Q(x)dτ − û [i]T R û [i] dτ
t−T t−T
 t 
T
−2 ((u − û [i] ) R) ⊗ ϕaT dτ vec(Ŵa[i+1] )
t−T
 t
+ (∇ϕc d)T ⊗ I dτ Ŵc[i] . (9.40)
t−T

 t  t
Define Dcc = −(ϕcT ⊗ I ), Dx x = Q(x)dτ + û [i]T R û [i] dτ, Daa =
 t 
t−T  t t−T
T
−2 ((u − û [i] ) R) ⊗ ϕaT dτ and Ddd = (∇ϕc d)T ⊗ I dτ . Then (9.40) is
t−T t−T
expressed as

e3 = Dcc Ŵc[i] − Dx x + Daa vec(Ŵa[i+1] ) + Ddd Ŵc[i] . (9.41)




[i] Ŵc[i]
Let Ψ = [Dcc + Ddd , Daa ] and Ŵ = . Then one has
vec(Ŵa[i+1] )

e3 = Ψ Ŵ [i] − Dx x . (9.42)

This error allows the update of the weights for the critic NN and the actor NN
simultaneously. Data is repeatedly collected at the end of each interval of length
T. When sufficient samples have been collected, this equation can be solved using
gradient descent method for the weight vector. Unfortunately,
t d is unknown, so that
Ddd and Ψ are unknown. Therefore we define D̄dd = t−T (∇ϕc Bd )T ⊗ I dτ , and
Ψ̄ = [Dcc + D̄dd , Daa ], where Bd is the bound of d in (9.2).
Then, one has the estimated residual error expressed as

e4 = Ψ̄ Ŵ [i] − Dx x . (9.43)

Let E = 21 e4T e4 , according to the gradient descent method, a solution can be found
by updating Ŵ [i] using

Ŵ˙ [i] = −αw Ψ̄ T (Ψ̄ Ŵ [i] − Dx x ), (9.44)

where αw is a positive number. This yields an on-line method for updating the weights
for the critic NN and the actor NN simultaneously.
Define the weight error W̃ [i] = W [i] − Ŵ [i] , then according to (9.44), one has
[i]
W̃˙ = αw Ψ̄ T Ψ̄ Ŵ [i] − αw Ψ̄ T Dx x
= − αw Ψ̄ T Ψ̄ W̃ [i] + αw Ψ̄ T Ψ̄ W [i] − αw Ψ̄ T Dx x . (9.45)
9.4 Disturbance Compensation Redesign and Stability Analysis 157

9.4 Disturbance Compensation Redesign and Stability


Analysis

It has been seen that if there is unknown disturbance, off-policy IRL may not perform
properly and may yield biased solution. In this section we show how to redesign the
off-policy IRL method by adding a disturbance compensator. It is shown that this
method yields proper performance by using Lyapunov techniques.

9.4.1 Disturbance Compensation Off-Policy Controller


Design

Here we propose the structure of the disturbance compensated off-policy IRL method.
Stability analysis is given in terms of Lyapunov theory. The following assumption is
first given.
Assumption 9.1 The activation function of the action network satisfies ||ϕa || ≤
ϕa M . The partial derivative of the activation function satisfies ||∇ϕc || ≤ ϕcd M . The
[i]
ideal weights of the critic and action networks satisfy ||Wc[i] || ≤ WcM and ||Wa[i] || ≤
[i]
Wa M on a compact set. The approximation error of the action network satisfies
||εa[i] || ≤ εa[i]M on a compact set.
Remark 9.1 Assumption 9.1 is a standard assumption in NN control theory. Many
NN activation functions are bounded and have bounded derivatives. Examples
include the sigmoid, symmetric sigmoid, hyperbolic tangent, and radial basis func-
tion. Continuous functions are bounded on a compact set. Hence the NN weights are
bounded. The approximation error boundedness property is established in [23].
From Assumption 9.1, it can be seen that Ψ̄ and Dx x are bounded. Without loss
of generality, write ||Ψ̄ || ≤ BΨ and ||Dx x || ≤ Bx x for positive numbers BΨ and Bx x .
Disturbance Compensated Off-Policy IRL
For unknown disturbance d = 0, the methods in these papers [17, 19] can be modified
to compensate for unknown disturbance as now shown.
Disturbance compensation controller is designed as

u [i]
c = −K c g M x/(x x + b),
T T
(9.46)

where K c ≥ Bd2 (x T x + b)/2 and b > 0. Let

u [i] [i] [i]


s = û + u c (9.47)

be the control input of system (9.1), then one can write

ẋ = f + g û [i] + gu [i]
c + d. (9.48)
158 9 Off-Policy Actor-Critic Structure for Optimal Control …

Fig. 9.1 Whole structure of system (9.48)

The whole structure of system (9.48) is shown in the figure (Fig. 9.1).
Based on (9.48), the following theorems can be obtained.

9.4.2 Stability Analysis

Our main result follows. It verifies the performance of the disturbance compensation
redesigned off-policy IRL algorithm.
Theorem 9.1 Let the control input be equal to (9.49), and let the updating methods
for critic and action networks be as in (9.44). Suppose Assumption 9.1 holds and let
the initial states be in the set such that the NN approximation error is bounded as in
the assumption. Then for every iterative step i, the weight errors W̃ [i] are UUB.
Proof Choose Lyapunov function candidate as follows:

Σ = Σ1 + Σ2 + Σ3 , (9.49)

l2
where Σ1 = x T x, Σ2 = l1 J [i] (x) and Σ3 = W̃ [i]T W̃ [i] , l1 > 0, l2 > 0.
2αw
As f is locally Lipchitz, then there exists B f > 0, s.t. || f (x)|| ≤ B f ||x||. There-
fore, from (9.48), one has

Σ̇1 =2x T ( f + g û [i] + gu [i]


c + d)
≤2B f ||x||2 + ||x||2 + Bg2 ||û [i] ||2 + 2x T gu [i]
c + ||x|| Bd + 1
2 2

≤(2B f + 1)||x||2 + Bg2 ||û [i] ||2 − Bd2 ||gg TM ||||x||2 + ||x||2 Bd2 + 1
≤(2B f + 1)||x||2 + Bg2 ||û [i] ||2 + 1. (9.50)

From (9.10), one can get


9.4 Disturbance Compensation Redesign and Stability Analysis 159

Σ̇2 =l1 J˙[i] (x)


≤ − l1 Q(x) − l1 λmin (R)||û [i] ||2
≤ − l1 q||x||2 − l1 λmin (R)||û [i] ||2 . (9.51)

Furthermore, one obtains

Σ̇3 = − l2 W̃ [i]T Ψ̄ T Ψ̄ W̃ [i] + l2 W̃ [i]T Ψ̄ T Ψ̄ W [i] − l2 W̃ [i]T Ψ̄ T Dx x


≤ − l2 B̄Ψ2 ||W̃ [i] ||2 + l2 ||W̃ [i] ||( B̄Ψ2 ||W [i] || + B̄Ψ Bx x ). (9.52)

Thus,

Σ̇ ≤(2B f + 1 − l1 q)||x||2 + (Bg2 − l1 λmin (R))||û [i] ||2 + 1


− l2 B̄Ψ2 ||W̃ [i] ||2 + l2 ||W̃ [i] ||( B̄Ψ2 ||W [i] || + B̄Ψ Bx x ). (9.53)

Let Z = [x T , û [i]T , ||W̃ [i] ||]T , then (9.53) is written as

Σ̇ ≤ −Z T M Z + Z T N + 1, (9.54)

where M = diag[l1 q − 2B f − 1, l1 λmin (R) − Bg2 , l2 ], N = [0, 0, l2 ( B̄Ψ2 ||W [i] || +


B̄Ψ Bx x )]T . Select

l1 > max{(2B f + 1)/q, Bg2 /λmin (R)}, (9.55)

and

l2 > 0. (9.56)

If

||N || ||N ||2 1
||Z || ≥ + + (9.57)
2λmin (M) 4λmin (M) λmin (M)
2

then Σ̇ ≤ 0. Therefore, the weight error W̃ is UUB.


Remark 9.2 It is assumed that the compact set in Assumption 9.1 is larger than the
set to which the state is bounded. This can be guaranteed under a mild condition on
the initial states as detailed in Theorem 4.2.1 of [24].
This result shows that closed-loop system (9.48) is stable regardless of unknown
disturbance d. In the next theorem, the convergence analyses of H (x, û [i] , Jˆx[i] ) and
u [i]
s are given.

Theorem 9.2 Suppose the hypotheses of Theorem 9.1 hold. Define H (x, u [i] ˆ[i]
s , Jx ) =
Jˆx[i]T ( f + g û [i] + gu [i]
c + d) + Q(x) + û
[i]T
R û [i] . Then
160 9 Off-Policy Actor-Critic Structure for Optimal Control …

(1) u [i]
s is close to u
[i]
within a finite bound, as t → ∞.
[i] ˆ[i]
(2) H (x, u s , Jx ) = H (x, Ŵa[i] , Ŵc[i] ) is UUB.

Proof (1) From Theorem 9.1, x will converge to a finite bound of zero, as t → ∞.
Therefore, there exist Buc > 0 and Bwa > 0, satisfying ||û [i] [i]
c || ≤ Buc and || W̃a || ≤
Bwa . Since u s − u = Ŵa ϕa − Wa ϕa − εa + u c = −W̃a ϕa − εa + u [i]
[i] [i] [i]T [i]T [i] [i] [i]T [i]
c .
Then one can get
[i]
||u [i] [i]
s − u || ≤ ϕa M Bwa + εa M + Buc . (9.58)

Equation (9.58) means that u [i]


s is close to u
[i]
within a finite bound.
(2) According to (9.32) and (9.37), one has

H (x, Ŵa[i] , Ŵc[i] ) =(Wc[i] − W̃c[i] )T ∇ϕc f + (Wc[i] − W̃c[i] )T ∇ϕc gu [i]
c

[i] [i] T [i]T [i]T
+ (Wc − W̃c ) ∇ϕc gWa ϕa − g W̃a ϕa + d
+ Q(x) + ϕaT (Wa[i] − W̃a[i] )R(Wa[i]T − W̃a[i]T )ϕa , (9.59)

which means that

H (x, Ŵa[i] , Ŵc[i] ) = Wc[i]T ∇ϕc f − W̃c[i]T ∇ϕc f + Wc[i]T ∇ϕc gu [i]
c

− W̃c[i]T ∇ϕc gu [i]


c

+ Wc[i]T ∇ϕc gWa[i]T ϕa − W̃c[i]T ∇ϕc gWa[i]T ϕa


− Wc[i]T ∇ϕc g W̃a[i]T ϕa + W̃c[i]T ∇ϕc g W̃a[i]T ϕa
+ Wc[i]T ∇ϕc d − W̃c[i]T ∇ϕc d + Q(x)
+ ϕaT Wa[i] RWa[i]T ϕa − ϕaT W̃a[i] RWa[i]T ϕa
− ϕaT Wa[i] R W̃a[i]T ϕa + ϕaT W̃a[i] R W̃a[i]T ϕa . (9.60)

As εa → 0 and εc → 0, for a fixed admissible control policy, one has H (x, Wa[i] , Wc[i] )
is bounded, i.e., there exists H B , s.t. ||H (x, Wa[i] , Wc[i] )|| ≤ H B . Therefore, (9.60) can
be written as

H (x, Ŵa[i] , Ŵc[i] ) = − W̃c[i]T ∇ϕc f + Wc[i]T ∇ϕc gu [i] [i]T [i]
c − W̃c ∇ϕc gu c

− W̃c[i]T ∇ϕc gWa[i]T ϕa − Wc[i]T ∇ϕc g W̃a[i]T ϕa


+ W̃c[i]T ∇ϕc g W̃a[i]T ϕa − ϕaT W̃a[i] RWa[i]T ϕa
− ϕaT Wa[i] R W̃a[i]T ϕa + ϕaT W̃a[i] R W̃a[i]T ϕa
+ Wc[i]T ∇ϕc d − W̃c[i]T ∇ϕc d + H B . (9.61)

According to Assumption 9.1, one has


9.4 Disturbance Compensation Redesign and Stability Analysis 161

||H (x, Ŵa[i] , Ŵc[i] )|| ≤ϕcd M B f ||W̃c[i] ||||x|| + ϕcd M Bg ||Wc[i] ||Buc
+ ϕcd M Bg ||W̃c[i] ||Buc + ϕcd M Bg ϕa M ||Wa[i] ||||W̃c[i] ||
+ ϕcd M Bg ϕa M ||Wc[i] ||||W̃a[i] ||
+ ϕcd M Bg ϕa M ||W̃a[i] ||||W̃c[i] ||
+ 2ϕa2M ||Wa[i] ||||R||||W̃a[i] || + ϕa2M ||R||||W̃a[i] ||2
+ ϕcd M ||Wc[i] ||Bd + ϕcd M ||W̃c[i] ||Bd + H B . (9.62)

According to Theorems 9.1 and 9.2, the signals on the right-hand side of (9.62)
are bounded, therefore H (x, u [i] ˆ[i]
s , Jx ) is UUB.

The previous theorem indicates that Jˆ[i] and u [i]


s close to J
[i]
and u [i] within a
[i] [i]
small bound. The final theorem shows that J and u convergence to the optimal
value and control policy.

Theorem 9.3 Let J [i] and u [i] be defined in (9.10) and (9.11), then ∀i = 0, 1, 2, . . . ,
u [i+1] is admissible, and 0 ≤ J [i+1] ≤ J [i] . Furthermore, lim J [i] = J ∗ and lim u [i]
i→∞ i→∞
= u∗.

Proof The proof can be seen in [19, 21].

9.5 Simulation Study

In this section we present simulation results that verify the proper performance of
the disturbance compensated off-policy IRL algorithm.
Consider the following torsional pendulum system [25]




= ω + d1 ,
dt (9.63)

⎩J

= u − Mgl sin θ − f d

+ d2 ,
dt dt
where M = 1/3 kg and l = 2/3 m are the mass and length of the pendulum bar,
respectively. The system states are the current angle θ and the angular velocity ω. Let
J = 4/3Ml 2 , and f d = 0.2 be the rotary inertia and frictional factor, respectively.
Let g = 9.8m/s2 be the gravity. d = [d1; d2] is white noise.
The activation functions ϕc and ϕa be hyperbolic tangent functions. The structures
of the critic and action networks are 2-8-1 and 2-8-1, respectively. Let Q = I 2,
R = 1, αw = 0.01, then the system state and control input trajectories are displayed
in Figs. 9.2 and 9.3. Therefore, we can declare the effectiveness of the disturbance
compensated off-policy IRL algorithm in this chapter.
162 9 Off-Policy Actor-Critic Structure for Optimal Control …

1.5

1
control

0.5

-0.5
0 500 1000 1500
time steps

Fig. 9.2 The proposed control input u

0.8
x(1)
0.6 x(2)

0.4

0.2

0
state

-0.2

-0.4

-0.6

-0.8

-1
0 500 1000 1500
time steps

Fig. 9.3 The state x under the proposed control input


9.6 Conclusion 163

9.6 Conclusion

This chapter proposes an optimal controller for unknown continuous time systems
with unknown disturbances. Based on policy iteration, an off-policy IRL algorithm is
estimated to obtain the iterative control. Critic and action networks are used to obtain
the iterative performance index function and control approximately. The weight
updating method is given based on off-policy IRL. A compensation controller is
constructed to reduce the influence of unknown disturbances. Based on Lyapunov
techniques, one proves the weight errors are UUB. The convergence of the Hamilto-
nian function is also proven. Simulation study demonstrates the effectiveness of the
proposed optimal control method for unknown systems with disturbances.
From this chapter, we can see that the weight error bound depends on the dis-
turbance bound. In the future, we will further study the method that focus on the
unknown disturbance instead of the disturbance bound.

References

1. Jiang, Y., Jiang, Z.: Robust adaptive dynamic programming for large-scale systems with an
application to multimachine power systems. IEEE Trans. Circuits Syst. II: Express Br. 59(10),
693–697 (2012)
2. Chen, B., Liu, K., Liu, X., Shi, P., Lin, C., Zhang, H.: Approximation-based adaptive neural
control design for a class of nonlinear systems. IEEE Trans. Cybern. 44(5), 610–619 (2014)
3. Lewis, F., Vamvoudakis, K.: Reinforcement learning for partially observable dynamic pro-
cesses: adaptive dynamic programming using measured output data. IEEE Trans. Syst. Man
Cybern. Part B: Cybern. 41(1), 14–25 (2011)
4. Song, R., Xiao, W., Zhang, H., Sun, C.: Adaptive dynamic programming for a class of complex-
valued nonlinear systems. IEEE Trans. Neural Netw. Learn. Syst. 25(9), 1733–1739 (2014)
5. Lee, J., Park, J., Choi, Y.: Integral Q-learning and explorized policy iteration for adaptive
optimal control of continuous-time linear systems. Automatica 48(11), 2850–2859 (2012)
6. Kiumarsi, B., Lewis, F., Modares, H., Karimpour, A., Naghibi-Sistani, M.: Reinforcement Q-
learning for optimal tracking control of linear discrete-time systems with unknown dynamics.
Automatica 50(4), 1167–1175 (2014)
7. Vrabie, D., Pastravanu, O., Lewis, F., Abu-Khalaf, M.: Adaptive optimal control for continuous-
time linear systems based on policy iteration. Automatica 45(2), 477–484 (2009)
8. Lewis, F., Vrabie, D., Vamvoudakis, K.: Reinforcement learning and feedback control: using
natural decision methods to design optimal adaptive controllers. IEEE Control Syst. Mag.
32(6), 76–105 (2012)
9. Bhasin, S., Kamalapurkar, R., Johnson, M., Vamvoudakis, K., Lewis, F., Dixon, W.: A novel
actor-critic-identifier architecture for approximate optimal control of uncertain nonlinear sys-
tems. Automatica 49(1), 82–92 (2013)
10. Vrabie, D., Lewis, F.: Adaptive dynamic programming for online solution of a zero-sum dif-
ferential game. J. Control Theory Appl. 9(3), 353–360 (2011)
11. Vrabie, D., Lewis, F.: Integral reinforcement learning for online computation of feedback nash
strategies of nonzero-sum differential games. In: Proceedings of Decision and Control, Atlanta,
GA, USA, pp. 3066–3071 (2010)
12. Sutton, R., Barto, A.: Reinforcement Learning: An Introduction. A Bradford Book. The MIT
Press, Cambridge (2005)
164 9 Off-Policy Actor-Critic Structure for Optimal Control …

13. Wang, J., Xu, X., Liu, D., Sun, Z., Chen, Q.: Self-learning cruise control using Kernel-based
least squares policy iteration. IEEE Trans. Control Syst. Technol. 22(3), 1078–1087 (2014)
14. Vamvoudakis, K., Vrabie, D., Lewis, F.: Online adaptive algorithm for optimal control with
integral reinforcement learning. Int. J. Robust Nonlinear Control 24(17), 2686–2710 (2015)
15. Li, H., Liu, D., Wang, D.: Integral reinforcement learning for linear continuous-time zero-sum
games with completely unknown dynamics. IEEE Trans. Autom. Sci. Eng. 11(3), 706–714
(2014)
16. Luo, B., Wu, H., Huang, T.: Off-policy reinforcement learning for H∞ control design. IEEE
Trans. Cybern. 45(1), 65–76 (2015)
17. Jiang, Y., Jiang, Z.: Computational adaptive optimal control for continuous-time linear systems
with completely unknown dynamics. Automatica 48(10), 2699–2704 (2012)
18. Jiang, Y., Jiang, Z.: Robust adaptive dynamic programming for nonlinear control design. In:
Proceedings of IEEE Conference on Decision and Control, Maui, Hawaii, USA, pp. 1896–1901
(2012)
19. Jiang, Y., Jiang, Z.: Robust adaptive dynamic programming and feedback stabilization of non-
linear systems. IEEE Trans. Neural Netw. Learn. Syst. 25(5), 882–893 (2014)
20. Beard, R., Saridis, G., Wen, J.: Galerkin approximations of the generalized Hamilton-Jacobi-
Bellman equation. Automatica 33(12), 2159–2177 (1997)
21. Abu-Khalaf, M., Lewis, F.: Nearly optimal control laws for nonlinear systems with saturating
actuators using a neural network HJB approach. Automatica 41(5), 779–791 (2005)
22. Saridis, G., Lee, C.: An approximation theory of optimal control for trainable manipulators.
IEEE Trans. Syst. Man Cybern. Part B: Cybern. 9(3), 152–159 (1979)
23. Hornik, K., Stinchcombe, M., White, H., Auer, P.: Degree of approximation results for feedfor-
ward networks approximating unknown mappings and their derivatives. Neural Comput. 6(6),
1262–1275 (1994)
24. Lewis, F., Jagannathan, S., Yesildirek, A.: Neural Network Control of Robot Manipulators and
Nonlinear Systems. Taylor and Francis, London (1999)
25. Liu, D., Wei, Q.: Policy iteration adaptive dynamic programming algorithm for discrete-time
nonlinear systems. IEEE Trans. Neural Netw. Learn. Syst. 25(3), 621–634 (2014)
Chapter 10
An Iterative ADP Method to Solve for a
Class of Nonlinear Zero-Sum Differential
Games

In this chapter, an iterative ADP method is presented to solve a class of continuous-


time nonlinear two-person zero-sum differential games. The idea is to use ADP
technique to obtain the optimal control pair iteratively which makes the performance
index function reach the saddle point of the zero-sum differential games. When the
saddle point does not exist, the mixed optimal control pair is obtained to make the
performance index function reach the mixed optimum. Rigid proofs are proposed
to guarantee the control pair stabilize the nonlinear system. And the convergent
property of the performance index function is also proved. Neural networks are used
to approximate the performance index function, compute the optimal control policy
and model the nonlinear system respectively for facilitating the implementation of
the iterative ADP method. Two examples are given to demonstrate the validity of the
proposed method.

10.1 Introduction

A large class of real systems are controlled by more than one controller or decision
maker with each using an individual strategy. These controllers often operate in a
group with a general quadratic performance index function as a game [1]. Zero-sum
(ZS) differential game theory has been widely applied to decision making problems
[2–7], stimulated by a vast number of applications, including those in economics,
management, communication networks, power networks, and in the design of com-
plex engineering systems. In these situations, many control schemes are presented
in order to reach some form of optimality [8, 9]. Traditional approaches to deal with
ZS differential games are to find out the optimal solution or the saddle point of the
games. So many interests are developed to discuss the existence conditions of the
differential ZS games [10, 11].
© Science Press, Beijing and Springer Nature Singapore Pte Ltd. 2019 165
R. Song et al., Adaptive Dynamic Programming: Single and Multiple
Controllers, Studies in Systems, Decision and Control 166,
https://doi.org/10.1007/978-981-13-1712-5_10
166 10 An Iterative ADP Method to Solve for a Class …

In the real world, however, the existence conditions of the saddle point for ZS
differential games are so difficult to satisfy that many applications of the ZS differ-
ential games are limited to linear systems [12–14]. On the other hand, for many ZS
differential games, especially in nonlinear case, the optimal solution of the game (or
saddle point) doesn’t exist inherently. Therefore, it is necessary to study the optimal
control approach for the ZS differential games that the saddle point is invalid. The
earlier optimal control scheme is to adopt the mixed trajectory method [15, 16],
one player selects an optimal probability distribution over his control set and the
other player selects an optimal probability distribution over his own control set, and
then the expected solution of the game can be obtained under the meaning of the
probability. The expected solution of the game is called mixed optimal solution and
the corresponding performance index function is mixed optimal performance index
function. The main difficulty of the mixed trajectory for the ZS differential game is
that the optimal probability distribution is too hard to obtain if not impossible under
the whole real space. Furthermore, the mixed optimal solution is hardly reached once
the control schemes are determined. In most cases (i.e. in engineering cases), how-
ever, the optimal solution or mixed optimal solution of the ZS differential games has
to be achieved by a determined optimal or mixed optimal control scheme. In order
to overcome these difficulties, a new iterative approach is proposed in this chapter
to solve the ZS differential games for the nonlinear system.
In this chapter, it is the first time that the continuous-time two-person ZS differ-
ential games for nonlinear systems are solved by the iterative ADP method. When
the saddle point exists, using the proposed iterative ADP method, the optimal control
pair is obtained to make the performance index function reach the saddle point. When
the saddle point does not exist, according to mixed trajectory method, a determined
mixed optimal control scheme is proposed to obtain the mixed optimal performance
index function. In a brief, the main contributions of this chapter include:
(1) Construct a new iterative method to solve two-person ZS differential games
for a class of nonlinear systems using ADP.
(2) Obtain the optimal control pair that makes the performance index function
reach the saddle point with rigid stability and convergence proof.
(3) Design a determined mixed optimal control scheme to obtain the mixed optimal
performance index function under the condition that there is no saddle point, and
give the analysis of stability and convergence.
This chapter is organized as follows. Section 10.2 presents the preliminaries and
assumptions. In Sect. 10.3, iterative ADP method for ZS differential games is pro-
posed. In Sect. 10.4, the neural network implementation for the optimal control
scheme is presented. In Sect. 10.5, simulation studies are given to demonstrate the
effectiveness of the proposed method. The conclusion is drawn in Sect. 10.6.

10.2 Preliminaries and Assumptions

Consider the following two-person ZS differential game. The state trajectory at time t
of the game denoted by x = x(t) is described by the continuous-time affine nonlinear
function
10.2 Preliminaries and Assumptions 167

ẋ = f (x, u, w) = a(x) + b(x)u + c(x)w, (10.1)

where x ∈ Rn , u ∈ Rk , w ∈ Rm and the initial condition x(0) = x0 is given.


The two control variables u and w are functions on [0, ∞) chosen, respectively,
by player I and player II from some control sets U [0, ∞) and W [0, ∞), subjects to
the constraints u ∈ U (t), and w ∈ W (t) for t ∈ [0, ∞), for given convex and com-
pact sets U (t) ⊂ Rk , W (t) ⊂ Rm . The performance index function is generalized
quadratic form (see [17]) given by
 ∞
V (x(0), u, w) = (x T Ax + u T Bu + wT Cw + 2u T Dw
0
+ 2x T Eu + 2x T Fw)dt (10.2)

where matrices A, B, C, D, E, F are with suitable dimension and A ≥ 0, B > 0,


C < 0. So we have for ∀t ∈ [0, ∞), the performance index function V (x(t), u, w)
(denoted by V (x) for brevity in the sequel) is convex in u and concave in w.
l(x, u, w) = x T Ax + u T Bu + wT Cw + 2u T Dw + 2x T Eu + 2x T Fw is general
quadratic utility function. For the above ZS differential game, there are two con-
trollers or players where player I tries to minimize the performance index function
V (x), while player II attempts to maximize it. And according to the situation of the
two players we have the following definitions.
Let

V (x) := inf sup V (x, u, w) (10.3)


u∈U [t,∞) w∈W [t,∞)

be the upper performance index function, and

V (x) := sup inf V (x, u, w) (10.4)


w∈W [t,∞) u∈U [t,∞)

be the lower performance index function with the obvious inequality V (x) ≥ V (x).
Define the optimal control pairs be (u, w) and (u, w) for upper and lower performance
index function respectively. Then we have

V (x) = V (x, u, w), (10.5)

and

V (x) = V (x, u, w). (10.6)

If both V (x) and V (x) exist and

V (x) = V (x) = V ∗ (x), (10.7)


168 10 An Iterative ADP Method to Solve for a Class …

we say that the optimal performance index function of the ZS differential game or the
saddle point exists and the corresponding optimal control pair is denoted by (u ∗ , w∗ ).
The following assumptions and lemmas are proposed that are in effect in the
remaining sections.

Assumption 10.1 The nonlinear system (10.1) is controllable.

Assumption 10.2 The upper performance index function and the lower performance
index function both exist.

Assumption 10.3 The system function f (x, u, w) is Lipschitz continuous on R n


containing the origin. Constraint sets U [0, ∞) and W [0, ∞) are subsets of the
space of the measurable and integrable functions on [0, ∞). The control sets
u ∈ U (t), and w ∈ W (t) for t ∈ [0, ∞) are nonempty closed, convex, and measur-
able dependent on t.

Based on the above assumptions, the following two lemmas are important to apply
the dynamic programming method.

Lemma 10.1 (principle of optimality) If the upper and lower performance index
function are defined as (10.3) and (10.4) respectively, then for 0 ≤ t ≤ tˆ < ∞, x ∈
Rn , u ∈ Rk , w ∈ Rm , we have
 tˆ
V (x) = inf sup (x T Ax + u T Bu
u∈U [t,tˆ) w∈W [t,tˆ) t

+ w Cw + 2u Dw + 2x Eu + 2x Fw)dt + V (x(tˆ)) ,
T T T T
(10.8)

and
 tˆ
V (x) = sup inf (x T Ax + u T Bu
w∈W [t,tˆ) u∈U [t,tˆ) t

T ˆ
+ w Cw + 2u Dw + 2x Eu + 2x Fw)dt + V (x(t )) .
T T T
(10.9)

Proof See [18].

According to the above lemma, we can prove the following results.

Lemma 10.2 (HJI equation) If the upper and lower performance index function are
defined as (10.3) and (10.4) respectively, we can obtain the following Hamilton–
Jacobi–Isaacs (HJI) equations

H J I (V (x), u, w) = V t (x) + H (V x (x), u, w) = 0, (10.10)


10.2 Preliminaries and Assumptions 169

dV (x) dV (x)
where V t (x) = , V x (x) = , H (V x (x), u, w) = inf sup {V x (a(x) +
dt dx u∈U w∈W
b(x)u + c(x)w) + (x T Ax + u T Bu + wT Cw + 2u T Dw + 2x T Eu + 2x T Fw)} is
called upper Hamilton function, and

H J I (V (x), u, w) = V t (x) + H (V x (x), u, w) = 0. (10.11)

dV (x) dV (x)
where V t (x) = , V x (x) = , H (V x (x), u, w) = sup inf {V x (a(x) +
dt dx w∈W u∈U
b(x)u + c(x)w) + (x Ax + u Bu + w Cw + 2u Dw + 2x Eu + 2x T Fw)} is
T T T T T

called lower Hamilton function.

Proof See [18].

10.3 Iterative Approximate Dynamic Programming


Method for ZS Differential Games

The optimal control pair is obtained by solving the HJI equations (10.10) and (10.11),
but these equations cannot generally be solved. There is no current method for rigor-
ously confronting this type of equation to find the optimal performance index function
of the system. This is the reason why we introduce the iterative ADP method. In this
section, the iterative ADP method for ZS differential games is proposed and we will
show that the iterative ADP method can be expanded to ZS differential games.

10.3.1 Derivation of The Iterative ADP Method

The goal of the proposed iterative approximate dynamic programming method is to


use adaptive critic designs technique to adaptively construct an optimal control pair
(u ∗ , w∗ ), which takes an arbitrary initial state x(0) to the singularity 0, simultane-
ously makes the performance index function reach the saddle point V ∗ (x) with rigid
convergence and stability criteria. As the value of the upper and lower performance
index function is not necessarily equal, which means no saddle point, the optimal
control pair (u ∗ , w∗ ) may not exist. This motivates us to change the control schemes
to obtain the new performance index function V (x) ≤ V o (x) ≤ V (x), where V o (x)
is the mixed optimal performance index function of the ZS differential games [15,
16]. As V o (x) is not satisfied with the HJI equations (10.10) and (10.11), it is unsolv-
able directly in theory. Therefore, a iterative approximation approach is proposed in
this chapter to approximate the mixed optimal performance index function.

Theorem 10.1 If Assumptions 10.1–10.3 hold, (u, w) is the optimal control pair
for the upper performance index function and (u, w) is the optimal control pair
170 10 An Iterative ADP Method to Solve for a Class …

for the lower performance index function, then there exist the control pairs (u, w),
(u, w) which make V o (x) = V (x, u, w) = V (x, u, w). Furthermore, if the saddle
point exists, then V o (x) = V ∗ (x).
Proof According to (10.3) and (10.5), we have V o (x) ≤ V (x, u, w). Simultaneously,
we also have V (x, u, w) ≤ V (x, u, w). As the system (10.1) is controllable and w is
continuous on Rm , there exists control pair (u, w) which makes V o (x) = V (x, u, w).
On the other hand, according to (10.4) and (10.6) we have V o (x) ≥ V (x, u, w).
And we can also find that V (x, u, w) ≥ V (x, u, w). As u is continuous on Rk , there
exists control pair (u, w) which makes V o (x) = V (x, u, w). Then we have V o (x) =
V (x, u, w) = V (x, u, w).
If the saddle point exists, we have V ∗ (x) = V (x) = V (x). On the other hand,
V (x) ≤ V o (x) ≤ V (x). Then obviously V o (x) = V ∗ (x).
The above theorem builds up the relationship between the optimal or mixed opti-
mal performance index function and the upper and lower performance index func-
tions. It also implies that the mixed optimal performance index function can be
solved through regulating the optimal control pairs for the upper and lower perfor-
mance index function. So firstly, it is necessary to find out the optimal control pair
for both the upper and lower performance index function.
Differentiating the HJI equation (10.10) through the derivative of the control w
for the upper performance index function, it yields

∂H T
= V x c(x) + 2wT C + 2u T D + 2x T F = 0. (10.12)
∂w
Then we can get

1
w = − C −1 (2D T u + 2F T x + cT (x)V x ). (10.13)
2
Substitute (10.13) into (10.10) and take the derivative of u then obtain

1
u = − (B − DC −1 D T )−1 (2(E T − DC −1 F T )x
2
+ (bT (x) − DC −1 cT (x))V x ). (10.14)

Thus, the detailed expression of the optimal control pair (u, w) for upper perfor-
mance index function is obtained.
For the lower performance index function, according to (10.11), take the derivative
of the control u and we have

∂H
= V Tx b(x) + 2u T B + 2wT D + 2x T E = 0. (10.15)
∂u
Then we can get
10.3 Iterative Approximate Dynamic Programming… 171

1
u = − B −1 (2Dw + 2E T x + bT (x)V x ). (10.16)
2
Substitute (10.16) into (10.11) and take the derivative of w, then we obtain

1
w = − (C − D T B D)−1 (2(F T − D T B −1 E T )x
2
− (cT (x) − D T B −1 bT (x))V x ). (10.17)

So the optimal control pair (u, w) for lower performance index function is also
obtained.
If the equality (10.7) holds under the optimal control pairs (u, w) and (u, w),
we have a saddle point; if not, we have a game without saddle point. For such a
differential game that the saddle point does not exist, we adopt a mixed trajectory
method to achieve the mathematical expectation of the performance index func-
tion. To apply mixed trajectory method, the game matrix is necessary to obtain
under the trajectory sets of the control pair (u, w). Small enough Gaussian noise
γu ∈ Rk and γw ∈ Rm are introduced that are added to the optimal control u and
j
w respectively, where γui (0, σi2 ), i = 1, 2, . . . , k and γw (0, σ j2 ), j = 1, 2, . . . , m are
zero-mean exploration noise with variances σi and σ j respectively.
2 2

Therefore, the upper and lower performance index functions (10.3) and (10.4)
become V (x, u, (w + γw )) and V (x, (u + γu ), w) respectively, where V (x, u, (w +
γw )) > V (x, (u + γu ), w) holds. Define the following game matrix

 I I1 I I2 
I1 L 11 L 12 = L , (10.18)
I2 L 21 L 22

where L 11 = V (x, u, w), L 12 = V (x, (u + γu ), w), L 21 = V (x, u, w) and L 22 =


V (x, u, (w + γw )). Define I1 , I2 are trajectories of player I, I I1 , I I2 are trajecto-
ries of player II.
According to the principle of mixed trajectory [16], we can get the expected
performance index function formulated by


2 
2
E(V (x)) = min max PI i L i j PI I j , (10.19)
PI i PI I j
i=1 j=1

where PI i > 0, i = 1, 2 is the probability of the trajectories of player I satisfying


2
PI i = 1 and PI I j > 0, j = 1, 2 is the probability of the trajectories of player II
i=1

2
satisfying PI I j = 1. Then we have
j=1
172 10 An Iterative ADP Method to Solve for a Class …

E(V (x)) = αV (x) + (1 − α)V (x), (10.20)

Vo − V
where α = .
V −V
For example, let the game matrix be

 I I1 I I2 
I1 L 11 = 11 L 12 = 7 = L .
I2 L 21 = 5 L 22 = 9

According to (10.19), for the trajectories of player I, choose trajectory I1 under


the probability PI and choose trajectory I2 under the probability (1 − PI ); for the
trajectories of player II, choose trajectory I I1 under the probability PI I and choose
trajectory I I2 under the probability (1 − PI I ). Then the mathematical expectation
can be expressed as

E(L) = L 11 PI PI I + L 12 PI (1 − PI I ) + L 21 (1 − PI )PI I
+ L 22 (1 − PI )(1 − PI I )
= 11PI PI I + 7PI (1 − PI I ) + 5(1 − PI )PI I + 9(1 − PI )(1 − PI I )
  
1 1
= 8 PI − PI I − + 8. (10.21)
2 4

So the value of the expectation is 8 and it can be written by

1 1
E(L) = L 11 + L 21 .
2 2
Remark 10.1 From the above example we can see that once the trajectory in the
trajectory set is determined, the expected value 8 can not be obtained in reality. In most
practical optimal control environment, however, the expected optimal performance
(or mixed optimal performance) has to be achieved.
So in the following part, the expatiation of the method to achieve the mixed optimal
performance index function will be displayed. Calculating the expected performance
index function for N times under the exploration noise γw ∈ Rm and γu ∈ Rk in the
control w and u, we can obtain E 1 (V (x)), E 2 (V (x)), . . . , E N (V (x)). Then the mixed
optimal performance index function can be written by

1 
N
V o (x) = E(E i (V (x))) = E i (V (x))
N i=1
= αV (x) + (1 − α)V (x), (10.22)

Vo − V
where α = .
V −V
10.3 Iterative Approximate Dynamic Programming… 173

Let l o (x, u, w, u, w) = αl(x, u, w) + (1 − α)l(x, u, w), then V o (x) can be


expressed by
 ∞
V o (x(0)) = l o (x, u, w, u, w)dx. (10.23)
0

Then according to Theorem 10.1, the mixed optimal control pair can be obtained
by regulating the control w in the control pair (u, w) that minimizes the error between
V (x) and V o (x) where the performance index function V (x) is defined by

V (x(0)) = V (x(0), u, w)
 ∞
= (x T Ax + u T Bu + wT Cw + 2u T Dw
0
+ 2x T Eu + 2x T Fw)dt, (10.24)

and V (x(0)) ≤ V(x(0)) ≤ V (x(0)).


Define
 ∞
(x(0)) = V(x(0)) − V o (x(0)) =
V
l(x, u, w, u, w, w)dx, (10.25)
0

where l(x, u, w, u, w, w) = l(x, u, w) − l o (x, u, w, u, w) denoted by


l(x, w). Then
the problem can be described by

(x))2 .
min(V (10.26)
w

(x) ≥ 0 we have the following


According to the principle of optimality, when V
Hamilton–Jacoobi–Bellman (HJB) equation

(x), w) = V
H J B(V x (x), x, w) = 0,
t (x) + H (V (10.27)


where V t (x) = d V (x) , V
x (x) = d V (x) , and the Hamilton function is H (V
x (x),
dt dx
x, w) = min {V x (a(x) + b(x)u + c(x)w) + l(x, w)}.
w∈W
When V (x) < 0, we have −V (x) = −(V(x) − V o (x)) > 0, then we also have
the HJB equation described by

(x)), w) = (−V
H J B((−V t (x)) + H ((−V x (x)), x, w)
=V t (x) + H (V x (x), x, w)
= 0, (10.28)

which is same as (10.27). Then the optimal control w can be obtained by differenti-
ating the HJB equation (10.27) through the derivative of control w:
174 10 An Iterative ADP Method to Solve for a Class …

1 x ).
w = − C −1 (2D T u + 2F T x + cT (x)V (10.29)
2
Remark 10.2 We can also obtain the mixed optimal control pair by regulating the
control u in the control pair (u, w) that minimizes the error between V (x) and V o (x)
where the performance index function V (x) is defined by

V (x(0)) =V (x(0), u, w)
 ∞
= (x T Ax + u T Bu + wT Cw + 2u T Dw + 2x T Eu + 2x T Fw)dt.
0
(10.30)

Remark 10.3 From (10.24) to (10.30) we can see that effective mixed optimal control
schemes are proposed to obtain the mixed optimal performance index function when
the saddle point does not exist, while the uniqueness of the mixed optimal control
pair is rarely satisfied.

10.3.2 The Procedure of the Method

Given the above preparation, we now formulate the iterative approximate dynamic
programming method for ZS differential games as follows.

Step 1 Initialize the algorithm with a stabilizing performance index function V [0]
and control pair (u [0] , w[0] ) where Assumptions 10.1–10.3 hold. Give the computation
precision ζ > 0.

Step 2 For i = 0, 1, . . ., from the same initial state x(0) run the system with control
pair (u [i] , w[i] ) for the upper performance index function and run the system with
control pair (u [i] , w[i] ) for the lower performance index function.

Step 3 For i = 0, 1, . . ., for upper performance index function let


 ∞
[i]
V (x(0)) = (x T Ax + u [i+1]T Bu [i+1]
0
+ w[i+1]T Cw[i+1] + 2u [i+1]T Dw[i+1]
+ 2x T Eu [i+1] + 2x T Fw[i+1] )dt, (10.31)

and the iterative optimal control pair is formulated by

1
u [i+1] = − (B − DC −1 D T )−1 (2(E T − DC −1 F T )x
2
[i]
+ (bT (x) − DC −1 cT (x))V x ), (10.32)
10.3 Iterative Approximate Dynamic Programming… 175

and
1 [i]
w[i+1] = − C −1 (2D T u [i+1] + 2F T x + cT (x)V x ), (10.33)
2
[i]
where (u [i] , w[i] ) is satisfied with H J I (V (x), u [i] , w[i] ) = 0.


[i+1] [i]

Step 4 If
V (x(0)) − V (x(0))
< ζ , let u = u [i] , w = w[i] and V (x) =
[i+1]
V (x) go to step 5, else i = i + 1 and go to step 3.

Step 5 For i = 0, 1, . . . for lower performance index function let


 ∞
[i]
V (x(0)) = (x T Ax + u [i+1]T Bu [i+1]
0
+ w[i+1]T Cw[i+1] + 2u [i+1]T Dw[i+1]
+ 2x T Eu [i+1] + 2x T Fw[i+1] )dt, (10.34)

and the iterative optimal control pair is formulated by

1
w[i+1] = − (C − D T B D)−1 (2(F T − D T B −1 E)x
2
+ (cT (x) − D T B −1 bT (x))V [i]
x ), (10.35)

and
1
u [i+1] = − B −1 (2Dw[i+1] + 2E T x + bT (x)V [i]
x ), (10.36)
2

where (u [i] , w[i] ) is satisfied with the HJI equation H J I (V [i] (x), u [i] , w[i] ) = 0.

Step 6 If
V [i+1] (x(0)) − V [i] (x(0))
< ζ , let u = u [i] , w = w[i] and V (x) =
V [i+1] (x) go to next step, else i = i + 1 and go to step 5.

Step 7 If
V (x(0)) − V (x(0))
< ζ stop, the saddle point is achieved, else go to the
next step.

Step 8 For i = 0, 1, . . ., regulate the control w for the upper performance index
function and let

[i+1] (x(0)) =V[i+1] (x(0)) − V o (x(0))


V
 ∞
= (x T Ax + u T Bu + w[i]T Cw[i]
0
+ 2u T Dw[i] + 2x T Eu + 2x T Fw[i] − l o (x, u, w, u, w))dt,
(10.37)
176 10 An Iterative ADP Method to Solve for a Class …

the iterative optimal control formulated by

1 x[i+1] ).
w[i] = − C −1 (2D T u + 2F T x + cT (x)V (10.38)
2

Step 9 If |V[i+1] (x(0)) − V o (x(0))| < ζ stop, else i = i + 1 and go to 7.

Remark 10.4 For step 8 of the above process, we can also regulate the control u for
the lower performance index function and the new performance index function is
expressed as
 ∞
V [i+1] (x(0)) = (x T Ax + u [i]T Bu [i] + wT Cw
0
+ 2u [i]T Dw + 2x T Eu [i] + 2x T Fw − l o (x, u, w, u, w))dt,
(10.39)

and we can obtain the similar result.

10.3.3 The Properties of the Iterative ADP Method

In this subsection, we present the proofs to show that the proposed iterative ADP
method for ZS differential games can be used to improve the properties of the system.
The following definition is proposed which is necessary for the remaining proofs.

Definition 10.1 (K function) A continuous function α : [ 0, a) → [ 0, ∞) is said to


belong to class K if it is strictly increasing and α(0) = 0.
[i]
Theorem 10.2 If Assumptions 10.1–10.3 hold, u [i] ∈ Rk , w[i] ∈ Rm and V (x) ∈
[i]
C1 satisfies the HJI equation H J I (V (x), u [i] , w[i] ) = 0, i = 0, 1, . . ., and for
∀ t, l(x, u [i] , w[i] ) = x T Ax + u [i]T Bu [i] + w[i]T Cw[i] + 2u [i]T Dw[i] + 2x T Eu [i] +
2x T Fw[i] ≥ 0, then the new control pair derived by

1 [i]
w[i+1] = − C −1 (2D T u [i+1] + 2F T x + cT (x)V x ),
2
[i+1] 1
u = − (B − DC −1 D T )−1 (2(E T − D T C −1 F T )x
2
[i]
+ (bT (x) − DC −1 cT (x))V x ), (10.40)

which satisfies (10.31) makes the system (10.1) asymptotically stable.


[i]
Proof Since the system is continuous and V (x) ∈ C1 , we have
10.3 Iterative Approximate Dynamic Programming… 177

[i]
dV (x) dV [i] (x, u [i+1] , w[i+1] )
=
dt dt
[i]T [i]T [i]T
= V x a(x) + V x b(x)u [i+1] + V x c(x)w[i+1] . (10.41)

According to (10.40) we can get


[i]
dV (x) [i]T [i]T
= V x a(x) + V x (b(x) − c(x)C −1 D T )u (i+1)
dt
[i]T 1 [i]T [i]
− V x c(x)C −1 F T x − V x c(x)C −1 cT (x)V x . (10.42)
2
From the HJI equation we have
[i]T
0 =Vx f (x, u [i] , w[i] ) + l(x, u [i] , w[i] )
[i]T [i]T
= V x a(x) + V x (b(x) − c(x)C −1 D T )u [i]
[i]T
+ 2x T (E − F T C −1 D T )u [i] − V x c(x)C −1 F T x
1 [i]T [i]
− V x c(x)C −1 cT (x)V x + x T Ax
4
+ u [i]T (B − DC −1 D T )u [i] − x T FC −1 F T x. (10.43)

Take (10.43) into (10.42) and then get


[i]
dV (x) [i]T
= V x (b(x) − c(x)C −1 D T )(u [i+1] − u [i] ) − x T Ax
dt
1 [i]T [i]
− u [i]T (B − DC −1 D T )u [i] − V x c(x)C −1 cT (x)V x
4
− 2x T (E − FC −1 D T )u [i+1] + x T FC −1 F T x. (10.44)

According to (10.40) we have


[i]
dV (x)
= − (u [i+1] − u [i] )T (B − DC −1 D T )(u [i+1] − u [i] )
dt
1 [i]T [i]
− u [i+1]T (B − DC −1 D T )u [i+1] − x T Ax − V x c(x)C −1 cT (x)V x
4
− 2x T (E − FC −1 D T )u [i+1] + x T FC −1 F T x. (10.45)

If we substitute (10.40) into the utility function, we can obtain


178 10 An Iterative ADP Method to Solve for a Class …

l(x, u [i+1] , w[i+1] ) = x T Ax + u [i+1]T Bu [i+1] + w[i+1]T Cw[i+1]


+ 2u [i+1]T Dw[i+1] + 2x T Eu [i+1] + 2x T Fw[i+1]
= x T Ax + u [i+1]T (B − DC −1 D T )u [i+1]
1 [i]T [i]
+ V x c(x)C −1 cT (x)V x
4
+ 2x T (E − FC −1 D T )u [i+1] − x T FC −1 F T x
≥0. (10.46)

[i]
dV (x)
Thus we can derive ≤ 0.
dt
[i]
As V (x) ≥ 0, there exist two functions α(
x
) and β(
x
) belong to class
[i]
K and satisfy α(
x
) ≤ V (x) ≤ β(
x
).
For ∀ε > 0, there exists δ(ε) > 0 that makes β(δ) ≤ α(ε). Let t0 is any initial
[i]
dV (x)
time. According to ≤ 0, for t ∈ [t0 , ∞) we have
dt
 [i]
[i] [i]
t
dV (x)
V (x(t)) − V (x(t0 )) = dτ ≤ 0. (10.47)
t0 dτ

So for ∀t0 and


x(t0 )
< δ(ε), for ∀t ∈ [t0 , ∞) we have
[i] [i]
α(ε) ≥ β(δ) ≥ V (x(t0 )) ≥ V (x(t)) ≥ α(
x
). (10.48)

As α(
x
) belongs to class K, we can obtain


x
≤ ε. (10.49)

Then we can conclude that the system (10.1) is asymptotically stable.

Theorem 10.3 If Assumptions 10.1–10.3 hold, u [i] ∈ Rk and w[i] ∈ Rm , V [i] (x) ∈
C1 satisfies the HJI equation H J I (V [i] (x), u [i] , w[i] ) = 0, i = 0, 1, . . ., and ∀ t,
l(x, u [i] , w[i] ) = x T Ax + u [i]T Bu [i] + w[i]T Cw[i] + 2u [i]T Dw[i] + 2x T Eu [i] + 2x T
Fw[i] < 0, then the control pair (u [i] , w[i] ) formulated by

1 −1
u [i+1] = − B (2Dw[i+1] + 2E T x + bT (x)V [i]
x ),
2
1
w[i+1] = − (C − D T B −1 D)−1 (2(F T − D T B −1 E T )x
2
+ (cT (x) − D T B −1 bT (x))V [i]
x ), (10.50)

which is satisfied with the performance index function (10.34) makes the system
(10.1) asymptotically stable.
10.3 Iterative Approximate Dynamic Programming… 179

Proof Since the system is continuous and V [i] (x) ∈ C1 , we have

dV [i] (x) dV [i] (x, u [i+1] , w[i+1] )


=
dt dt
=V [i]T
x a(x) + V [i]T
x b(x)u
[i+1]
+ V [i]T
x c(x)w
[i+1]
. (10.51)

According to (10.50) we can get

dV [i] (x)
=V [i]T [i]T
x a(x) + V x (c(x) − b(x)B
−1
D)w[i+1]
dt
− V [i]T
x b(x)B
−1 T
E x + V [i]T
x c(x)w
[i+1]

1
− V [i]T b(x)B −1 bT (x)V [i]T
x . (10.52)
2 x
From the HJI equation we have

0 =V [i]T
x f (x, u [i] , w[i] ) + l(x, u [i] , w[i] )
=V [i]T [i]T
x a(x) + V x (c(x) − b(x)B
−1
D)w[i]
1
− V [i]T b(x)B −1 bT (x)V [i]
4 x x

+ x Ax + w (C − D B D)w[i]
T [i]T T −1

− V [i]T
x b(x)B
−1 T
E x − x T F B −1 E T x + 2x T (F − E B −1 D)w[i] . (10.53)

Take (10.53) into (10.52) and then get

dV [i] (x)
=V [i]T
x (c(x) − b(x)B
−1
D)(w[i+1] − w[i] )
dt
− x T Ax − w[i]T (C − D T B −1 D)w[i]
+ x T F B −1 E T x − 2x T (F − E B −1 D)w[i]
1
− V [i]T b(x)B −1 bT (x)V [i]
x . (10.54)
4 x
According to (10.50) we have

dV [i] (x)
= − (w[i+1] − w[i] )T (C − D T B −1 D)(w[i+1] − w[i] )
dt
− w[i+1]T (C − D T B −1 D)w[i+1] − x T Ax
+ x T F B −1 E T x − 2x T (F − E B −1 D)w[i+1]
1
− V [i]T b(x)B −1 bT (x)V [i]
x . (10.55)
4 x
If we substitute (10.50) into the utility function, we obtain
180 10 An Iterative ADP Method to Solve for a Class …

l(x, u [i+1] , w[i+1] ) = x T Ax + u [i+1]T Bu [i+1] + w(i+1)T Cw[i+1]


+ 2u [i+1]T Dw[i+1] + 2x T Eu [i+1] + 2x T Fw[i+1]
= w[i+1]T (C − D T B −1 D)w[i+1] + x T Ax
− x T F B −1 E T x − 2x T (F − E B −1 D)w[i+1]
1
+ V [i]T b(x)B −1 bT (x)V [i]
4 x x

<0. (10.56)

dV [i] (x)
So we have > 0.
dt
[i]
As V (x) < 0, there exist two functions α(
x
) and β(
x
) belong to class
K and satisfy α(
x
) ≤ −V [i] (x) ≤ β(
x
).
For ∀ ε > 0, there exists δ(ε) > 0 that makes β(δ) ≤ α(ε). According to
dV [i] (x)
− < 0, for t ∈ [t0 , ∞) we have
dt
 t
dV [i] (x)
−V [i] (x(t)) − (−V [i] (x(t0 ))) = − dτ < 0. (10.57)
t0 dτ

So for ∀t0 and


x(t0 )
< δ(ε) for ∀t ∈ [t0 , ∞) we have

α(ε) ≥ β(δ) ≥ −V [i] (x(t0 )) > −V [i] (x(t)) ≥ α(


x
). (10.58)

As α(
x
) belongs to class K, we can obtain


x
< ε. (10.59)

Then we can conclude that the system (10.1) is asymptotically stable.

Corollary 10.1 If Assumptions 10.1–10.3 hold, u [i] ∈ Rk and w[i] ∈ Rm , V [i] (x) ∈
C1 satisfies the HJI equation H J I (V [i] (x), u [i] , w[i] ) = 0, i = 0, 1, . . ., and for ∀ t,
l(x, u [i] , w[i] ) = x T Ax + u [i]T Bu [i] + w[i]T Cw[i] + 2u [i]T Dw[i] + 2x T Eu [i] + 2x T
Fw[i] ≥ 0, then the control pairs (u [i] , w[i] ) which satisfies the performance index
function (10.34) makes the system (10.1) asymptotically stable.
[i]
Proof According to (10.31) and (10.34), we have V [i] (x) ≤ V (x). As the utility
[i]
function l(x, u [i] , w[i] ) ≥ 0, we have V [i] (x) ≥ 0. So we get 0 ≤ V [i] (x) ≤ V (x).
From Proposition 10.1, we know that for ∀ t0 , there exist two functions α(
x
)
and β(
x
) belong to class K and satisfy
[i] [i]
α(ε) ≥ β(δ) ≥ V (x(t0 )) ≥ V (x(t)) ≥ α(
x
). (10.60)
10.3 Iterative Approximate Dynamic Programming… 181

[i] [i]
According to V (x) ∈ C1 , V [i] (x) ∈ C1 and V (x) → 0, there exist time instants
t1 and t2 (not loss of generality, let t0 < t1 < t2 ) that satisfies
[i] [i] [i]
V (x(t0 )) ≥ V (x(t1 )) ≥ V [i] (x(t0 )) ≥ V (x(t2 )). (10.61)

[i]
Choose ε1 > 0 that satisfies V [i] (x(t0 )) ≥ α(ε1 ) ≥ V (x(t2 )). Then there exists
[i]
δ1 (ε1 ) > 0 that makes α(ε1 ) ≥ β(δ1 ) ≥ V (x(t2 )). Then we can obtain
[i]
V [i] (x(t0 )) ≥ α(ε1 ) ≥ β(δ1 ) ≥ V (x(t2 ))
[i]
≥ V (x(t)) ≥ V [i] (x(t)) ≥ α(
x
). (10.62)

According to (10.60), we have

α(ε) ≥ β(δ) ≥ V [i] (x(t0 )) ≥ α(ε1 ) ≥ β(δ1 )


≥ V [i] (x(t)) ≥ α(
x
). (10.63)

As α(
x
) belongs to class K, we can obtain


x
≤ ε. (10.64)

Then we can conclude that the system (10.1) is asymptotically stable.


[i]
Corollary 10.2 If Assumptions 10.1–10.3 hold, u [i] ∈ Rk and w[i] ∈ Rm , V (x) ∈
[i]
C1 satisfies the HJI equation H J I (V (x), u [i] , w[i] ) = 0, i = 0, 1, . . ., and ∀ t,
l(x, u [i] , w[i] ) = x T Ax + u [i]T Bu [i] + w[i]T Cw[i] + 2u [i]T Dw[i] + 2x T Eu [i] + 2x T
Fw[i] < 0, then the control pairs (u [i] , w[i] ) which is satisfied with the performance
index function (10.31) makes the system (10.1) asymptotically stable.
[i]
Proof According to (10.31) and (10.34), we have V [i] (x) ≤ V (x). From the utility
[i] [i]
function l(x, u [i] , w[i] ) < 0, we know V (x) < 0. So we get V [i] (x) ≤ V (x) < 0.
From Proposition 10.2, we know that there exist two functions α(
x
) and β(
x
)
belong to class K and satisfy

α(ε) ≥ β(δ) ≥ −V [i] (x(t0 )) ≥ −V [i] (x(t)) ≥ α(


x
). (10.65)

[i] [i]
According to V (x) ∈ C1 , V [i] (x) ∈ C1 and V (x) → 0, there exist time
instants t1 and t2 that satisfies
[i]
−V [i] (x(t0 )) ≥ −V [i] (x(t1 )) ≥ −V (x(t0 )) ≥ −V [i] (x(t2 )). (10.66)

[i]
Choose ε1 > 0 that satisfies −V (x(t0 )) ≥ α(ε1 ) ≥ −V [i] (x(t2 )). Then there
exists δ1 (ε1 ) > 0 that makes α(ε1 ) ≥ β(δ1 ) ≥ −V [i] (x(t2 )). Then we can obtain
182 10 An Iterative ADP Method to Solve for a Class …

[i]
−V (x(t0 )) ≥ α(ε1 ) ≥ β(δ1 ) ≥ −V [i] (x(t2 ))
[i]
≥ −V [i] (x(t)) ≥ −V (x(t)) ≥ α(
x
). (10.67)

According to (10.60), we have


[i]
α(ε) ≥ β(δ) ≥ −V (x(t0 )) ≥ α(ε1 ) ≥ β(δ1 )
[i]
≥ −V (x(t)) ≥ α(
x
). (10.68)

As α(
x
) belongs to class K, we can obtain


x
< ε. (10.69)

Then we can conclude that the system (10.1) is asymptotically stable.


[i]
Theorem 10.4 If Assumptions 10.1–10.3 hold, u [i] ∈ Rk and w[i] ∈ Rm , V (x) ∈
[i]
C1 satisfies the HJI equation H J I (V (x), u [i] , w[i] ) = 0, i = 0, 1, . . . and l(x, u [i] ,
w ) = x Ax + u Bu + w Cw[i] + 2u [i]T Dw[i] + 2x T Eu [i] + 2x T Fw[i] is
[i] T [i]T [i] [i]T

the utility function, then the control pair (u [i] , w[i] ) which is satisfied with the upper
performance index function (10.31) is a pair of asymptotically stable controls for
system (10.1).

Proof For the time sequel t0 < t1 < t2 < · · · < tm < tm+1 < · · · , not loss of gen-
erality, we suppose l(x, u [i] , w[i] ) ≥ 0 in [ t2n , t(2n+1) ) and l(x, u [i] , w[i] ) < 0 in
[ t2n+1 , t(2(n+1)) ) where n = 0, 1, . . .. t
For t ∈ [t0 , t1 ) we have l(x, u [i] , w[i] ) ≥ 0 and t01 l(x, u [i] , w[i] )dt ≥ 0. Accord-
ing to Theorem 10.2, we have


x(t0 )

x(t1 )

x(t1 )
, (10.70)

where t1 ∈ [t0 , t1 ).  t2
For t ∈ [t1 , t2 ) we have l(x, u [i] , w[i] ) < 0 and l(x, u [i] , w[i] )dt < 0. Accord-
t1
ing to Corollary 10.2 we have


x(t1 )
>
x(t2 )
>
x(t2 )
, (10.71)

where t2 ∈ [t1 , t2 ).
So we can obtain


x(t0 )

x(t0 )
>
x(t2 )
, (10.72)

where t0 ∈ [t0 , t2 ).
10.3 Iterative Approximate Dynamic Programming… 183

Then using the mathematical induction, for ∀ t, we have


x(t )

x(t)
where

t ∈ [t, ∞). So we can conclude that the system (10.1) is asymptotically stable and
the proof is completed.

Theorem 10.5 If Assumptions 10.1–10.3 hold, u [i] ∈ Rk and w[i] ∈ Rm , V [i] (x) ∈
C1 satisfies the HJI equation H J I (V [i] (x), u [i] , w[i] ) = 0, i = 0, 1, . . . and l(x, u [i] ,
w[i] ) = x T Ax + u [i]T Bu [i] + w[i]T Cw[i] + 2u [i]T Dw[i] + 2x T Eu [i] + 2x T Fw[i] is
the utility function, then the control pair (u [i] , w[i] ) which is satisfied with the upper
performance index function (10.39) is a pair of asymptotically stable controls for
system (10.1).

Proof Similar to the proof of Theorem 10.3, the conclusion can be obtained according
to Theorem 10.3 and Corollary 10.1 and the proof process is omitted.

In the following part, the analysis of convergence property for the ZS differential
games is presented to guarantee the iterative control pair reach the optimal.
[i]
Proposition 10.1 If Assumptions 10.1–10.3 hold, u [i] ∈ Rk , w[i] ∈ Rm and V (x) ∈
[i]
C1 satisfies the HJI equation H J I (V (x), u [i] , w[i] ) = 0, then the iterative control
pair (u [i] , w[i] ) formulated by

1 [i]
w[i+1] = − C −1 (2D T u [i+1] + 2F T x + cT (x)V x ),
2
1
u [i+1] = − (B − DC −1 D T )−1 (2(E T − D T C −1 F T )x
2
[i]
+ (bT (x) − DC −1 cT (x))V x ), (10.73)

[i]
makes the the upper performance index function V (x) → V (x) as i → ∞.

Proof To show the convergence of the upper performance index function, we will
[i+1] [i]
primarily consider the property of d(V (x) − V (x)) dt.
[i]
According to the HJI equation H J I (V (x), u [i] , w[i] ) = 0, we can obtain
[i+1]
dV (x) dt by replacing the index “i” by the index “i + 1”:

[i+1]
dV (x)
= − (x T Ax + u [i+1]T (B − DC −1 D T )u [i+1]
dt
1 [i]T [i]
+ V x c(x)C −1 cT (x)V x
4
+ 2x T (E − FC −1 D T )u [i+1]
− x T FC −1 F T x). (10.74)

According to (10.45), we can obtain


184 10 An Iterative ADP Method to Solve for a Class …

[i+1] [i]
d(V (x) − V (x))
dt
[i+1] [i]
dV (x)
dV (x)
= −
dt dt
= − (x T Ax + u [i+1]T (B − DC −1 D T )u [i+1]
1 [i]T [i]
+ V x c(x)C −1 cT (x)V x + 2x T (E − FC −1 D T )u [i+1] − x T FC −1 F T x)
4
− (−(u [i+1] − u [i] )T (B − DC −1 D T )(u [i+1] − u [i] )
1 [i]T [i]
− u [i+1]T (B − DC −1 D T )u [i+1] − x T Ax − V x c(x)C −1 cT (x)V x
4
− 2x T (E − FC −1 D T )u [i+1] + x T FC −1 F T x)
= u [i+1]T (B − DC −1 D T )u [i+1]
>0. (10.75)

Since the system f is asymptotically stable, its state trajectories x converge to


(i+1) [i] (i+1) [i]
zero, and so does V (x) − V (x). Since d(V (x) − V (x)) dt ≥ 0 on these
[i+1] [i] [i+1] [i]
trajectories, this implies that V (x) − V (x) ≤ 0, that is V (x) ≤ V (x). As
[i]
such, V (x) → V (x) as i → ∞.

Remark 10.5 The convergence of the upper performance index function can not
guarantee the convergence of the lower performance index function. In fact, the
lower performance index function may be boundary but not convergent. So it is
necessary to analyze the convergence of lower performance index function.

Proposition 10.2 If Assumptions 10.1–10.3 hold, u [i] ∈ Rk , w[i] ∈ Rm and V [i] (x) ∈
C1 satisfies the HJI function H J I (V [i] (x), u [i] , w[i] ) = 0, then the iterative control
pair (u [i] , w[i] ) formulated by

1 −1
u [i+1] = − B (2Dw[i+1] + 2E T x + bT (x)V [i]
x )
2
1
w[i+1] = − (C − D T B D)−1 (2(F T − D T B −1 E T )x
2
+ (cT (x) − D T B −1 bT (x))V [i]
x ) (10.76)

makes the lower performance index function V [i] (x) → V (x(t)) as i → ∞.

Proof To show the convergence of the lower performance index function, we also
consider the property of d(V [i+1] (x) − V [i] (x)) dt.

From the HJI equation H J I (V [i] (x), u [i] , w[i] ) = 0, we can obtain dV [i+1] (x) dt
by replacing the index “i” by the index “i + 1”:
10.3 Iterative Approximate Dynamic Programming… 185

dV [i+1] (x)
= − w[i+1]T (C − D T B −1 D)w[i+1] − x T Ax
dt
+ x T F B −1 E T x − 2x T (F − E B −1 D)w[i+1]
1
− V [i]T b(x)B −1 bT (x)V [i]
x . (10.77)
4 x
According to (10.50), we have

d(V [i+1] (x) − V [i] (x)) dt

=dV [i+1] (x) dt − dV [i] (x) dt
= − w[i+1]T (C − D T B −1 D)w[i+1] − x T Ax
1
+ x T F B −1 E T x − 2x T (F − E B −1 D)w[i+1] − V [i]T b(x)B −1 bT (x)V [i]
4 x x
[i+1] [i] T T −1 [i+1] [i]
− (−(w − w ) (C − D B D)(w −w )
− w[i+1]T (C − D T B −1 D)w[i+1] − x T Ax + x T F B −1 E T x
1
− 2x T (F − E B −1 D)w[i] − V [i]T b(x)B −1 bT (x)V [i]
x )
4 x
=(w[i+1]T − w[i]T )(C − D T B −1 D)(w[i+1] − w[i] )
<0. (10.78)

[i+1] [i]
Because the system f is stable, and V (x) − V (x) converge to zero. From
[i+1] [i]
the result d(V (x) − V (x)) dt ≤ 0, we can conclude V [i+1] (x) − V [i] (x) ≤ 0,
that is V [i+1] (x) ≥ V [i] (x). As such, V [i] (x) is an increasing sequence of positive
[i+1] [i+1]
number i. On the other hand, we know that V [i+1] (x) ≤ V (x), and V (x) is
[i+1]
a convergent performance index function, and therefore V (x) is convergent i.e.
V [i+1] (x) → V (x) as i → ∞.

Theorem 10.6 If the optimal performance index function of the ZS differential game
or the saddle point exists that is V (x) = V (x) = V ∗ (x), then the control pairs
[i]
(u [i+1] , w[i+1] ) and (u [i+1] , w[i+1] ) make V (x) → V ∗ (x) and V [i] (x) → V ∗ (x)
respectively as i → ∞.

Proof It is proved by contradiction. For the upper performance index function,


assume that there exists a pair of control (u ∗ , w∗ ) makes the performance index
function reach the saddle point V ∗ (x) while V ∗ (x) = V (x).
[i]
According to Proposition 10.1, we have V (x) → V (x) under the control pair
(u [i+1] , w[i+1] ) as i → ∞. As the performance index function is convex in u and
concave in w, so we have the optimal upper performance index function V (x) is
uniquely determined by the optimal control pair (u, w) for the same initial state
x(0). On the other hand, there exists a pair of control (u ∗ , w∗ ) makes the performance
index reach the saddle point. According to the property of the saddle point [15, 18],
the control pair (u ∗ , w∗ ) satisfies the HJI equation (10.10). Then we can derive the
186 10 An Iterative ADP Method to Solve for a Class …

following control pair expressed by

1
w∗ = − C −1 (2D T u + 2F T x + cT (x)V x ),
2
1
u ∗ = − (B − DC −1 D T )−1 (2(E T − D T C −1 F T )x
2
+ (bT (x) − DC −1 cT (x))V x ), (10.79)

which is the same as (10.9) and (10.10). So the control pair (u ∗ , w∗ ) in (10.79)
uniquely decides the upper performance index function V (x) for the same initial
state x(0). It is contradiction against the assumption.
Similarly, we can derive V (x) → V ∗ (x) under the control pair (u [i+1] , w[i+1] ) as
i → ∞.
For the situation the saddle point does not exist, we regulate the control w in
the control pair (u, w) to make the performance index function V (x) reach the the
mixed optimal performance index function V o (x). The following propositions and
theorems are considered the situation.

Proposition 10.3 If Assumptions 10.1–10.3 hold, u ∈ Rk , w[i] ∈ Rm and the utility


˜ w[i] ) = l(x, u, w) − l o (x, u, w, u, w), then the control pair (u, w[i] )
function is l(x,
that satisfies the performance index function (10.37) makes the system (10.1) asymp-
totically stable.

Proof According to Theorems 10.4 and 10.5, we can derive the result directly.

Proposition 10.4 If Assumptions 10.1–10.3 hold, u ∈ Rk , w[i] ∈ Rm and for ∀t the


˜ w[i] ) = l(x, u, w[i] ) − l o (x, u, w, u, w) ≥ 0 , then the control
utility function l(x,
[i]
pair (u, w ) that satisfies the performance index function (10.37) makes V [i] (x)
convergent as i → ∞.
[i] (x) = V[i] (x) − V o (x) defined in (10.37), we have
Proof Let V

[i+1] (x) d V
dV [i+1] (x, u, w[i] )
=
dt dt
=V x[i+1]T a(x) + V x[i+1]T b(x)u + V
x[i+1]T c(x)w[i] . (10.80)

From the HJB equation we have

x[i+1]T (a(x) + b(x)u + c(x)w[i+1] ) + l(x, u, w[i+1] )


0 =V
x[i+1]T a(x) + V
=V x[i+1]T b(x)u + V
x[i+1]T c(x)w[i+1]
+ x T Ax + u T Bu + w[i+1]T Cw[i+1] + 2u T Dw[i+1] + 2x T Eu + 2x T Fw[i+1] .
(10.81)

Take (10.81) into (10.80) and then get


10.3 Iterative Approximate Dynamic Programming… 187

[i+1] (x)
dV x[i+1]T c(x)(w[i] − w[i+1] ) − x T Ax
=V
dt
− u T Bu − w[i+1]T Cw[i+1] − 2u T Dw[i+1] − 2x T Eu − 2x T Fw[i+1] .
(10.82)

According to (10.40), we have

[i+1] (x)
dV
= − (w[i] − w[i+1] )T C(w[i] − w[i+1] )
dt
− x T Ax − u T Bu − w[i]T Cw[i] − 2u T Dw[i] − 2x T Eu − 2x T Fw[i] .
(10.83)

[i] (x) dt by replacing the index
According to the HJB equation, we can obtain d V
“i + 1 by the index “i”:

[i] (x)
dV
= − x T Ax − u T Bu − w[i]T Cw[i] − 2u T Dw[i] − 2x T Eu − 2x T Fw[i] .
dt
(10.84)

Then we have
[i+1] (x) − V
d(V [i] (x))
dt
= − (w[i] − w[i+1] )T C(w[i] − w[i+1] )
− x T Ax − u T Bu − w[i]T Cw[i] − 2u T Dw[i] − 2x T Eu − 2x T Fw[i]
− (−x T Ax − u T Bu − w[i]T Cw[i] − 2u T Dw[i] − 2x T Eu − 2x T Fw[i] )
= − (w[i] − w[i+1] )T C(w[i] − w[i+1] )
≥0. (10.85)

Since the system f is asymptotically stable, its state trajectories x converge to


zero, and so does V [i] (x). Since d(V
[i+1] (x) − V [i+1] (x) − V [i] (x)) dt ≥ 0 on these
[i+1] (x) − V
trajectories, this implies that V [i] (x) ≤ 0, that is V
[i+1] (x) ≤ V
[i] (x). As
such, V (x) is convergent.
[i]

Proposition 10.5 If Assumptions 10.1–10.3 hold, u ∈ Rk , w[i] ∈ Rm and the util-


˜ w[i] ) = l(x, u, w[i] ) − l o (x, u, w, u, w) < 0, then the control pair
ity function l(x,
(u, w ) that satisfies the performance index function (10.37) makes Ṽ [i] (x) conver-
[i]

gent as i → ∞.
[i+1]
Proof Since the system is continuous and V (x) ∈ C1 , we have
188 10 An Iterative ADP Method to Solve for a Class …

[i+1] (x))
d(−V [i+1] (x, u, w[i] )
dV
=−
dt dt

= − Vx[i+1]T x[i+1]T b(x)u − V
a(x) − V x[i+1]T c(x)w[i] . (10.86)

From the HJB equation we have

x[i+1]T (a(x) + b(x)u + c(x)w[i+1] ) − l(x, u, w[i+1] )


0=−V
x[i+1]T a(x) − V
=−V x[i+1]T b(x)u − V
x[i+1]T c(x)w[i+1] − x T Ax
− u T Bu − w[i+1]T Cw[i+1] − 2u T Dw[i+1] − 2x T Eu − 2x T Fw[i+1] . (10.87)

Take (10.87) into (10.86) and then get

[i+1] (x))
d(−V
=−V x[i+1]T c(x)(w[i] − w[i+1] ) + x T Ax + u T Bu
dt
+ w[i+1]T Cw[i+1] + 2u T Dw[i+1] + 2x T Eu + 2x T Fw[i+1] .
(10.88)

According to (10.40) we have

[i+1] (x))
d(−V
=(w[i] − w[i+1] )T C(w[i] − w[i+1] ) + x T Ax
dt
+ u T Bu + w[i]T Cw[i] + 2u T Dw[i] + 2x T Eu + 2x T Fw[i] .
(10.89)

[i] (x)/dt by replacing the index


According to the HJI equation, we can obtain d V
“i + 1” by the index “i”:

[i] (x))
d(−V
=x T Ax + u T Bu + w[i]T Cw[i] + 2u T Dw[i] + 2x T Eu + 2x T Fw[i] .
dt
(10.90)

Then we have

d((−V [i+1] (x)) − (−V [i] (x)))


dt
=(w[i] − w[i+1] )T C(w[i] − w[i+1] )
+ x T Ax + u T Bu + w[i]T Cw[i] + 2u T Dw[i] + 2x T Eu + 2x T Fw[i]
− (x T Ax + u T Bu + w[i]T Cw[i] + 2u T Dw[i] + 2x T Eu + 2x T Fw[i] )
=(w[i] − w[i+1] )T C(w[i] − w[i+1] )
<0. (10.91)
10.3 Iterative Approximate Dynamic Programming… 189

Since the system f is asymptotically stable, its state trajectories x converge to


zero, and so does V [i] (x). Since d(V
[i+1] (x) − V [i+1] (x) − V [i] (x)) dt < 0 on these
[i+1] (x) − V
trajectories, this implies that V [i] (x) > 0, that is V
[i+1] (x) > V
[i] (x). As
such, V [i] (x) is convergent.

Theorem 10.7 If Assumptions 10.1–10.3 hold, u ∈ Rk , w[i] ∈ Rm and l(x, ˜ w[i] ) =


l(x, u, w ) − l (x, u, w, u, w) is the utility function, then the control pair (u, w[i] )
[i] o

that satisfies the performance index function (10.37) makes V [i] (x) convergent as
i → ∞.
Proof For the time sequel t0 < t1 < t2 < · · · < tm < tm+1 < · · · , not loss of gener-
˜ w[i] ) ≥ 0 in [ t2n , t(2n+1) ) and l(x,
ality, we suppose l(x, ˜ w[i] ) < 0 in [ t2n+1 , t(2(n+1)) )
where n = 0, 1, 2, . . ..
˜ w[i] ) ≥ 0 and t1 l(x,
For t ∈ [ t2n , t(2n+1) ) we have l(x, ˜ w[i] )dt ≥ 0. According
t0
to Proposition 10.5, we have V [i+1] (x).
(x) ≤ V [i]

˜ w[i] ) < 0 and t2 l(x,
For t ∈ [ t2n+1 , t(2(n+1)) ) we have l(x, ˜ w[i] )dt < 0. Accord-
t1
ing to Proposition 10.5 we have V [i+1] (x) > V [i] (x).
Then for ∀ t0 , we have
 t1  t2
[i+1] (x(t0 ))
=


V ˜ w[i] )dt
+

l(x, ˜ w[i] )dt

l(x,
t0 t1
 t(m+1)
+ ···+
˜ w[i] )dt
+ · · ·
l(x,
tm
[i] (x(t0 ))
.
<
V (10.92)

So Ṽ [i] (x) is convergent as i → ∞.


Theorem 10.8 If Assumptions 10.1–10.3 hold, u ∈ Rk , w[i] ∈ Rm and l(x, ˜ w[i] ) =
l(x, u, w ) − l (x, u, w, u, w) is the utility function, then the control pair (u, w[i] )
[i] o

that satisfies the performance index function (10.37) makes V[i] (x) → V o (x) as
i → ∞.
Proof It is proved by contradiction. Suppose that the control pair (u, w[i] ) makes the
performance index function V [i] (x) converge to V (x) and V (x) = V o (x).
According to Theorem 10.7, as i → ∞ we have the following HJB equation based
on the principle of optimality

(x), w) = V
H J B(V x (x), x, w) = 0.
t (x) + H (V (10.93)

From the above assumption we know that |V[i] (x) − V o (x)| = 0 as i → ∞.


From Theorem 10.7, we know that there exists control pair (u, w ) that makes
V (x, u, w ) = V o (x) which minimizes the performance index function V (x).
According to the principle of optimality we also have the following HJB equation

H J B(V t (x) + H (V
(x), w ) = V x (x), x, w ) = 0. (10.94)
190 10 An Iterative ADP Method to Solve for a Class …

It is contradiction. So the assumption does not hold. Thus we have V[i] (x) → V o (x)
as i → ∞.

Remark 10.6 If we regulate the control u for the lower performance index function
which satisfies (10.39), we can also prove that the iterative control pair (u [i] , w)
stabilizes the nonlinear system (10.1) and the performance index function V [i] (x) →
V o (x) as i → ∞. The proof procedure is similar to the proof of Propositions 10.1–
10.5 and Theorems 10.7 and 10.8, and it is omitted.

10.4 Neural Network Implementation

As the computer can only deal with the digital and discrete signal it is necessary to
transform the continuous-time system and performance index function to the corre-
sponding discrete-time form. Discretization of the system function and performance
index function using Euler and trapezoidal methods [19, 20] leads to

x(t + 1) = x(t) + a(x(t)) + b(x(t))u(t) + c(x(t))w(t) t, (10.95)

and


V (x(0)) = (x T (t)Ax(t) + u T (t)Bu(t) + wT (t)Cw(t)
t=0
+ 2u T (t)Dw(t) + 2x T (t)Eu(t) + 2x T (t)Fw(t)) t, (10.96)

where t is the time interval.


In the case of linear systems the performance index function is quadratic and the
control policy is linear. In the nonlinear case, this is not necessarily true and therefore
we use neural networks to approximate u [i] , w[i] and V [i] (x).
Using NN estimation error can be expressed by

F(X ) = F(X, Y ∗ , W ∗ ) + ξ(X ), (10.97)

where Y ∗ , W ∗ are the ideal weight parameters, ξ(X ) is the estimation error.
There are ten neural networks to implement the iterative ADP method, where
three are model networks, three are critic networks and four are action networks
respectively. All the neural networks are chosen as three-layer feed-forward neural
network. The whole structure diagram is shown in Fig. 10.1.
10.4 Neural Network Implementation 191

Fig. 10.1 The structure diagram of the iterative ADP method for ZS differential games

10.4.1 The Model Network

For the nonlinear system, before carrying out iterative ADP method, we should first
train the model network. The output of the model network is given as

x̂(t + 1) = WmT σ (VmT X (t)), (10.98)

where X (t) = [x(t) u(t) w(t)]T .


We define the error function of the model network as

em (t) = x̂(t + 1) − x(t + 1). (10.99)

Define the performance error index as

1 2
E m (t) = e (t). (10.100)
2 m
192 10 An Iterative ADP Method to Solve for a Class …

Then the gradient-based weight update rule for the critic network can be described
by

wm (t + 1) = wm (t) + wm (t), (10.101)


 
∂ E m (t)
wm (t) = ηm − , (10.102)
∂wm (t)

where ηm > 0 is the learning rate of the model network.


After the model network is trained, its weights are kept unchanged.

10.4.2 The Critic Network

The critic network is used to approximate the performance index functions i.e.
[i]
V (x), V [i] (x) and V[i] (x). The output of the critic network is denoted as

V̂ [i] (X (t)) = Wc[i]T σ (Yc[i]T X (t)), (10.103)


 T
where X (t) = x(t) u(t) w(t) is the input of critic networks.
The three critic networks have three target functions. For the upper performance
index function, the target function can be written as
[i]
V (x(t)) = x T (t)Qx(t) + u [i+1]T (t)Ru [i+1] (t)
+ w[i+1]T Cw[i+1] + 2u [i+1]T Dw[i+1]
 [i]
+ 2x T Eu [i+1] + 2x T Fw[i+1] t + Vˆ (x(t + 1)), (10.104)

[i]
where Vˆ (x(t + 1)) is the output of the upper critic network.
For the lower performance index function, the target function can be written as

V [i] (x(t)) = x T (t)Qx(t) + u [i+1]T (t)Ru [i+1] (t)
+ w[i+1]T Cw[i+1] + 2u [i+1]T Dw[i+1]
 [i]
+ 2x T Eu [i+1] + 2x T Fw[i+1] t + V̂ (x(t + 1)), (10.105)

[i]
where V̂ (x(t + 1)) is the output of the lower critic network.
Then for the upper performance index function, we define the error function for
the critic network by
10.4 Neural Network Implementation 193

e[i] ˆ [i] [i]


c (t) = V (x(t)) − V (x(t)), (10.106)

[i]
where Vˆ (x(t)) is the output of the upper critic neural network.
And the objective function to be minimized in the critic network is

1 [i]
E c[i] (t) = (e (t))2 . (10.107)
2 c
So the gradient-based weight update rule for the critic network [21, 22] is given
by

wc[i] (t + 1) = wc[i] (t) + wc[i] (t), (10.108)

 
∂ E c[i] (t)
wc[i] (t) = ηc − , (10.109)
∂wc[i] (t)

∂ E c[i] (t) ∂ E c[i] (t) ∂ V̂ [i] (x(t))


= , (10.110)
∂wc[i] (t) ∂ V̂ [i] (x(t)) ∂wc[i] (t)

where ηc > 0 is the learning rate of critic network and wc (t) is the weight vector in
the critic network.
For the lower performance index function, the error function for the critic network
is defined by

[i]
e[i] [i]
c (t) = V̂ (x(t)) − V (x(t)). (10.111)

And for the mixed optimal performance index function, the error function can be
expressed as

eco(i) (t) = V̂[i] (x(t)) − V o (x(t)), (10.112)

where V̂[i] (x(t)) is the output of the critic network.


For the lower and mixed optimal performance index function, weight update rule
of the critic neural network is the same as the one for upper performance index
function. The details are omitted here.

10.4.3 The Action Network

The action network is used to approximate the optimal and mixed optimal controls.
There are four action networks, two are used to approximate the optimal control
194 10 An Iterative ADP Method to Solve for a Class …

pair for the upper performance index function and two are used to approximate the
optimal control pair for the lower performance index function.
For the two action networks for upper performance index function, x(t) is used
 T
as the input for the first action network to create the control u(t), and x(t) u(t)
is used as the input for the other action network to create the control w(t).
The target function of the first action network (u network) is the discretization
formulation of Eq. (10.32):

1
u [i+1] = − (B − DC −1 D T )−1 (2(E T − DC −1 F T )x
2
[i]
∂ V (x(t + 1))
+ (bT (x) − DC −1 cT (x)) . (10.113)
∂ x(t + 1)

And the target function of the second action network (w network) is the discretiza-
tion formulation of Eq. (10.33):
 [i]

[i+1] 1 −1 T [i+1] ∂ V (x(t + 1))
w =− C 2D u + 2F x + c (x)
T T
. (10.114)
2 ∂ x(t + 1)

While in the two action network for lower performance index function, x(t) is used
 T
as the input for the first action network to create the control w(t), and x(t) u(t)
is used as the input for the other action network to create the control u(t).
The target function of the first action network (w network) is the discretization
formulation of Eq. (10.35):

1
w[i+1] = − (C − D T B D)−1 (2(F T − D T B −1 E)x
2
∂ V [i] (x(t + 1))
+ (cT (x) − D T B −1 bT (x)) . (10.115)
∂ x(t + 1)

The target function of the second action network (u network) is the discretization
formulation of Eq. (10.36):
 
1 ∂ V [i] (x(t + 1))
u [i+1] = − B −1 2Dw[i+1] + 2E T x + bT (x) . (10.116)
2 ∂ x(t + 1)

The output of the action network i.e. the first network for the upper performance
index function can be formulated as

û [i] (t) = Wa[i]T σ (Ya[i]T x(t)). (10.117)

And the target of the output of the action network is given by (10.113). So we can
define the output error of the action network as
10.4 Neural Network Implementation 195

ea[i] (t) = û [i] (t) − u [i] (t). (10.118)

The weighs in the action network are updated to minimize the following perfor-
mance error measure
1 [i]
E a[i] (t) = (e (t))2 . (10.119)
2 a
The weights update algorithm is similar to the one for the critic network. By the
gradient descent rule, we can obtain

wa[i] (t + 1) = wa[i] (t) + wa[i] (t), (10.120)

 
∂ E a[i] (t)
wa[i] (t) = ηa − [i] , (10.121)
∂wa (t)

∂ E a[i] (t) ∂ E a[i] (t) ∂ea[i] (t) ∂ û [i] (t)


= , (10.122)
∂wa[i] (t) ∂ea[i] (t) ∂ û [i] (t) ∂wa[i] (t)

where ηa > 0 is the learning rate of action network.


The weights update rule for the other action networks is similar and is omitted.
After obtaining the optimal control pairs for the upper and lower performance
index respectively, we regulate the w action network for the upper performance
index function to compute the mixed optimal control pair. The output of the w action
network is

ŵ[i] (t) = Wa[i]T σ (Ya[i]T z(t)), (10.123)


 T
where z(t) = x(t) u(t) .
As the error function of the mixed optimal performance index function is expressed
as (10.112), then the update rule can be written as
 o(i)   o(i) 
[i] ∂ E c (t) ∂ E c (t) ∂eco(i) (t) ∂w[i] (t)
wa (t) = −ηa = −ηa (10.124)
∂wa[i] (t) [i]
∂eco(i) (t) ∂w[i] (t) ∂wa (t)

where E co(i) (t) = 21 (eco(i) (t))2 .

10.5 Simulation Study

In this section, two examples are used to illustrate the effectiveness of the proposed
approach for continuous-time nonlinear quadratic ZS game.
196 10 An Iterative ADP Method to Solve for a Class …

Example 10.1 Consider the following linear system [17]

ẋ = x + u + w (10.125)

with the performance index function

 ∞
J= (x 2 + u 2 − γ 2 w2 )dt, (10.126)
0

where γ 2 = 2.

We choose three-layer neural networks as the critic network and the model network
with the structure 1-8-1 and 2-8-1 respectively. The structure of action networks is
also three layers. For the upper performance index function, the structure of the u
action network is 1-8-1 and the structure of the w action network is 2-8-1; for the
lower performance index function, the structure of the u action network is 2-8-1
and the structure of the w action network is 1-8-1. The initial weights of action
networks, critic network and model network are all set to be random in [−1, 1]. It
should be mentioned that the model network should be trained first. For the given
initial state x(0) = 1, we train the model network for 20 000 steps under the learning
rate ηm = 0.01. After the training of the model network is completed, the weights
of model network keep unchanged. Then the critic network and the action networks
are trained for 100 time steps so that the given accuracy ζ = 10−6 is reached. In the
training process, the learning rate ηa = ηc = 0.01.
The system and the performance index function are transformed according to
(10.95) and (10.96) where the sample time interval is t = 0.01. Take the iteration
number i = 4. The convergence trajectory of the performance index function is
shown in Fig. 10.2.
From Theorem 4.1 in [17] we know that the saddle point exists for the system
(10.125) with the performance index function (10.126). Then from the simulation
Fig. 10.2, we can see that the performance index functions reach the saddle point after
Fig. 10.4 iterations which demonstrates the effectiveness of the proposed method in
the chapter.
Figure 10.3 shows the iterative trajectories of the control variables u. The optimal
control trajectories are displayed in Fig. 10.4 and the corresponding state trajectory
is shown in Fig. 10.5.

Example 10.2 Consider the following continuous-time affine nonlinear system

   
0.1x12 + 0.05x2 0.1 + x1 x2 x1
ẋ = + w
0.2x12 − 0.15x2 x1 0.2 + x1 x2
 
0.1 + x1 x2 0.5 + x1
+ u, (10.127)
x22 0.1 + x12 0.3 + x1 x2
10.5 Simulation Study 197
 T  T  T
where x = x1 x2 , u = u 1 u 2 u 3 and w = w1 w2 .
The performance index function is formulated by
 ∞
V (x(0), u, w) = (x T Ax + u T Bu + wT Cw + 2u T Dw + 2x T Eu
0
+ 2x T Fw)dt, (10.128)
⎡ ⎤
 100     T  
10 −3 0 100 101
where A = , B = ⎣ 0 1 0 ⎦, C = ,D= ,E =
01 0 −3 011 011
001
 
−1 0
and F = .
0 −1
The system and the performance index function are transformed according to
(10.95) and (10.96) where the sample time interval is t = 0.01. The critic network
and the model network are also chosen as three-layer neural networks with the struc-
ture 2-8-1 and 7-8-2 respectively. The action network is also chosen as three-layer
neural networks. For the upper performance index function, the structure of the u
action network is 2-8-3 and the structure of the w action network is 5-8-2; for the
lower performance index function, the structure of the u action network is 2-8-2 and
the structure of the w action network is 4-8-3. The initial weight is also randomly
chosen in [−1, 1]. For the given initial state x(0) = [−1 0.5 ]T , we train the model
network for 20000 steps. After the training of the model network is completed, the
weights keep unchanged. Then the critic network and the action network are trained

0.6

0.5
performance index function

1st iteration for upper performance index function


0.4

0.3 optimal performance index function

0.2 1st iteration for lower performance index function

0.1

0
0 10 20 30 40 50 60 70 80 90 100
time steps

Fig. 10.2 The trajectories of upper and lower performance index functions
198 10 An Iterative ADP Method to Solve for a Class …

-0.1

-0.2
1st iteration for upper performance index

the optimal control


-0.3
control

-0.4 1st iteration for lower performance index

-0.5

-0.6

-0.7

-0.8
0 10 20 30 40 50 60 70 80 90 100
time steps

Fig. 10.3 The control trajectories

-0.1

-0.2

-0.3
optimal control

-0.4 control u

-0.5

-0.6

control w
-0.7

-0.8

-0.9

-1
0 10 20 30 40 50 60 70 80 90 100
time steps

Fig. 10.4 The trajectories of optimal control pair


10.5 Simulation Study 199

0.9

0.8

0.7

0.6
state

0.5

0.4

0.3

0.2

0.1

0
0 10 20 30 40 50 60 70 80 90 100
time steps

Fig. 10.5 The state trajectory

1.6

1.4

1.2
performance index function

1 limiting iteration for lower performance index function

0.8

st
1 iteration for lower performance index function
0.6
1st iteration for upper performance index function

0.4
limiting iteration for upper performance index function

0.2

0
0 100 200 300 400 500 600 700 800 900 1000
time steps

Fig. 10.6 Performance index functions

for 1000 time steps so that the given accuracy ζ = 10−4 is reached. The convergence
curves of the performance index functions are shown in Fig. 10.6.
Figure 10.7 shows the convergent trajectories of control variable u 1 for upper
performance index function. And the optimal control trajectories for upper perfor-
200 10 An Iterative ADP Method to Solve for a Class …

0.05

0.045

u1 for upper performance index function 0.04

0.035
1st iteration
0.03

0.025

0.02 limiting iteration

0.015

0.01

0.005

0
0 100 200 300 400 500 600 700 800 900 1000
time steps

Fig. 10.7 The control u 1 for upper performance index function

0.2
controls for upper performance index function

0.15
u3

0.1 u2
u1

0.05

-0.05
w2

-0.1 w1

-0.15
0 100 200 300 400 500 600 700 800 900 1000
time steps

Fig. 10.8 The optimal control for upper performance index function
10.5 Simulation Study 201

0.04

0.03 1st iteration


w2 for lower performance index function

0.02

0.01

limiting
0
iteration

-0.01

-0.02

-0.03
0 100 200 300 400 500 600 700 800 900 1000
time steps

Fig. 10.9 The control w2 for lower performance index function

0.15
controls for lower performance index function

0.1 u3

w 2 u1
0.05 w1

u2
-0.05

-0.1
0 100 200 300 400 500 600 700 800 900 1000
time steps

Fig. 10.10 The optimal control for lower performance index function

mance index functions is displayed in Fig. 10.8. Figure 10.9 shows the convergent
trajectories of control variable w2 for lower performance index function. And the
optimal control trajectories for lower performance index functions is displayed in
Fig. 10.10.
202 10 An Iterative ADP Method to Solve for a Class …

1.4

1.2
upper performance index function
performance index function

mixed optimal performance index function


0.8

0.6

0.4

lower performance index function


0.2

0
0 100 200 300 400 500 600 700 800 900 1000
time steps

Fig. 10.11 The performance index function trajectories

After 7 iterations, we obtain the value of upper performance index function


V (x(0)) = 70.93180829072188 and the value of lower performance index func-
tion V (x(0)) = 22.15196172808129. So we know the saddle point does not exist,
it can also be seen from Fig. 10.6. Using the mixed trajectory method proposed
in this chapter, let Gaussian noise γw ∈ R2 and γu ∈ R3 where γwi (0, σi2 ), i = 1, 2
and γu (0, σ j2 ), j = 1, 2, 3. σi = 10−3 , i = 1, 2 and σ j = 10−3 , j = 1, 2, 3. Take the
j

Gaussian noise γw and γu into w and u and take N = 100. According to (10.18)–
(10.23) we can get the value of the mixed optimal performance index function
45.15594227340300. And the mixed optimal performance index function can be
expressed as V o (x) = 0.4716V (x) + 0.5284V (x).
Figure 10.11 displays the trajectories of the mixed optimal performance index
function, the corresponding mixed optimal control trajectories and state trajectories
are shown in Figs. 10.12 and 10.13 respectively. From the above simulation results we
can see that using the proposed iterative ADP method, we obtain the mixed optimal
performance index function successfully as the saddle point of the differential game
does not exists.
10.5 Simulation Study 203

0.25

0.2

0.15
mixed optimal control

u1
0.1
u2
u3
0.05

-0.05
w2

-0.1 w1

-0.15
0 100 200 300 400 500 600 700 800 900 1000
time steps

Fig. 10.12 The mixed optimal control trajectories

0.5
x
1
x
2

0
state trajectories

-0.5

-1
0 200 400 600 800 1000
time steps

Fig. 10.13 The state trajectories


204 10 An Iterative ADP Method to Solve for a Class …

10.6 Conclusions

In this chapter, we proposed an effective iterative ADP method to solve a class of


continuous-time nonlinear two-person ZS differential games. Under the situation
that the saddle point exists, using the iterative ADP method the performance index
function reaches the saddle point with rigid stability and convergence analysis. When
there is no saddle point, a determined control scheme is proposed to guarantee the
performance index function to reach the mixed optimal solution and the stability
and convergence properties are also analyzed. Finally, the simulation studies have
successfully demonstrated the outstanding characters.

References

1. Jamshidi, M.: Large-Scale Systems-Modeling and Control. North-Holland, Amsterdam, The


Netherlands (1982)
2. Chang, H., Marcus, S.: Two-person zero-sum markov games: receding horizon approach. IEEE
Trans. Autom. Control 48(11), 1951–1961 (2003)
3. Chen, B., Tseng, C., Uang, H.: Fuzzy differential games for nonlinear stochastic systems:
suboptimal approach. IEEE Trans. Fuzzy Syst. 10(2), 222–233 (2002)
4. Hwnag, K., Chiou, J., Chen, T.: Reinforcement learning in zero-sum Markov games for robot
soccer systems. In: Proceedings of the 2004 IEEE International Conference on Networking,
Sensing and Control Taipei, Taiwan, pp. 1110–1114 (2004)
5. Laraki, R., Solan, E.: The value of zero-sum stopping games in continuous time. SIAM J.
Control Optim. 43(5), 1913–1922 (2005)
6. Leslie, D., Collins, E.: Individual Q-learning in normal form games. SIAM J. Control Optim.
44(2), 495–514 (2005)
7. Gu, D.: A differential game approach to formation control. IEEE Trans. Control Syst. Technol.
16(1), 85–93 (2008)
8. Basar, T., Olsder, G.: Dynamic Noncooperative Game Theory. Academic, New York (1982)
9. Altman, E., Basar, T.: Multiuser rate-based flow control. IEEE Trans. Commun. 46(7), 940–949
(1998)
10. Goebel, R.: Convexity in zero-sum differential games. In: Proceedings of IEEE Conference on
Decision and Control, pp. 3964–3969 (2002)
11. Zhang, P., Deng, H., Xi, J.: On the value of two-person zero-sum linear quadratic differential
games. In: Proceedings of the 44th IEEE Conference on Decision and Control, and the European
Control Conference 2005 Seville, Spain, pp. 12–15 (2005)
12. Hua, X., Mizukami, K.: Linear-quadratic zero-sum differential games for generalized state
space systems. IEEE Trans. Autom. Control 39(1), 143–147 (1994)
13. Jimenez, M., Poznyak, A.: Robust and adaptive strategies with pre-identification via sliding
mode technique in LQ differential games. In: Proceedings of the 2006 American Control
Conference Minneapolis, Minnesota, USA, pp. 14–16 (2006)
14. Engwerda, J.: Uniqueness conditions for the affine open-loop linear quadratic differential game.
Automatica 44(2), 504–511 (2008)
15. Bertsekas, D.: Convex Analysis and Optimization. Athena Scientific, Belmont (2003)
16. Owen, G.: Game Theory. Acadamic Press, New York (1982)
17. Basar, T., Bernhard, P.: H ∞ Optimal Control and Related Minimax Design Problems.
Birkhäuser, Boston (1995)
18. Yong, J.: Dynamic programming and Hamilton–Jacobi–Bellman equation. Shanghai Science
Press, Shanghai (1991)
References 205

19. Padhi, R., Unnikrishnan, N., Wang, X., Balakrishman, S.: A single network adaptive critic
(SNAC) architecture for optimal control synthesis for a class of nonlinear systems. Neural
Netw. 19(10), 1648–1660 (2006)
20. Gupta, S.: Numerical Methods for Engineerings. Wiley Eastern Ltd. and New Age International
Company, New Delhi (1995)
21. Si, J., Wang, Y.: On-line learning control by association and reinforcement. IEEE Trans. Neural
Netw. 12(2), 264–275 (2001)
22. Enns, R., Si, J.: Helicopter trimming and tracking control using direct neural dynamic pro-
gramming. IEEE Trans. Neural Netw. 14(7), 929–939 (2003)
Chapter 11
Neural-Network-Based Synchronous
Iteration Learning Method for
Multi-player Zero-Sum Games

In this chapter, a synchronous solution method for multi-player zero-sum (ZS) games
without system dynamics is established based on neural network. The policy iteration
(PI) algorithm is presented to solve the Hamilton–Jacobi–Bellman (HJB) equation.
It is proven that the obtained iterative cost function is convergent to the optimal
game value. For avoiding system dynamics, off-policy learning method is given
to obtain the iterative cost function, controls and disturbances based on PI. Critic
neural network (CNN), action neural networks (ANNs) and disturbance neural net-
works (DNNs) are used to approximate the cost function, controls and disturbances.
The weights of neural networks compose the synchronous weight matrix, and the
uniformly ultimately bounded (UUB) of the synchronous weight matrix is proven.
Two examples are given to show that the effectiveness of the proposed synchronous
solution method for multi-player ZS games.

11.1 Introduction

The importance of strategic behavior in the human and social world is increasingly
recognized in theory and practice. As a result, game theory has emerged as a fun-
damental instrument in pure and applied research [1]. Modern day society relies on
the operation of complex systems, including aircraft, automobiles, electric power
systems, economic entities, business organizations, banking and finance systems,
computer networks, manufacturing systems, and industrial processes. Networked
dynamical agents have cooperative team-based goals as well as individual selfish
goals, and their interplay can be complex and yield unexpected results in terms
of emergent teams. Cooperation and conflict of multiple decision-makers for such
systems can be studied within the field of cooperative and noncooperative game
theory [2]. It knows that many real-world systems are often controlled by more
than one controller or decision maker with each using an individual strategy. These
controllers often operate in a group with a general quadratic performance index
© Science Press, Beijing and Springer Nature Singapore Pte Ltd. 2019 207
R. Song et al., Adaptive Dynamic Programming: Single and Multiple
Controllers, Studies in Systems, Decision and Control 166,
https://doi.org/10.1007/978-981-13-1712-5_11
208 11 Neural-Network-Based Synchronous Iteration...

function as a game. Therefore, some scholars research the multi-player games. In


[3], off-policy integral reinforcement learning method was developed to solve nonlin-
ear continuous-time multi-player non-zero-sum (NZS) games. In [4], a multi-player
zero-sum (ZS) differential games for a class of continuous-time uncertain nonlinear
systems were solved using upper and lower iterations. ZS game theory relies on
solving the Hamilton–Jacobi–Isaacs (HJI) equations, a generalized version of the
Hamilton–Jacobi–Bellman (HJB) equations appearing in optimal control problems.
In the nonlinear case the HJI equations are difficult or impossible to solve, and may
not have global analytic solutions even in simple cases. Therefore, many approximate
methods are proposed to obtain the solution of HJI equations [5–8].
In this chapter, multi-player ZS games for completely unknown continuous-time
nonlinear systems will be explicitly figured out. The main contributions of this chapter
are summarized as follows.
(1) A synchronous solution method based on PI algorithm and neural networks is
established.
(2) It is proven that the iterative cost function converges to the optimal game value
with system dynamics for traditional PI algorithm.
(3) Synchronous solution method is given to solve the off-policy HJB equation with
convergence analysis, according to critic neural network (CNN), action neural
networks (ANNs) and disturbance neural networks (DNNs).
(4) The uniformly ultimately bounded (UUB) of the synchronous weight matrix is
proven.
The rest of this chapter is organized as follows. In Sect. 11.2, we present the moti-
vations and preliminaries of the discussed problem. In Sect. 11.3, the synchronous
solution of multi-player ZS games is developed and the convergence proof is given.
In Sect. 11.4, two examples are given to demonstrate the effectiveness of the proposed
scheme. In Sect. 11.5, the conclusion is drawn.

11.2 Motivations and Preliminaries

In this chapter, we consider the continuous-time nonlinear system described by


p

q
ẋ = f (x) + g(x) u i + h(x) dj, (11.1)
i=1 j=1

where x ∈ Ω ∈ Rn is the system state, u i ∈ Rm 1 and d j ∈ Rm 2 are the control input


and the disturbance input, respectively. f (x) ∈ Rn , g(x) and h(x) are unknown func-
tions. f (0) = 0 and x = 0 is an equilibrium point of the system. Assume that f (x),
g(x) and h(x) are locally Lipschitz functions on the compact set Ω that contains the
origin. The dynamical system is stabilizable on Ω. The performance index function
is a generalized quadratic form given by
11.2 Motivations and Preliminaries 209

 ∞  
p

q 
J (x(0), U p , Dq ) = x T Qx+ u iT Ri u i − d Tj S j d j dt, (11.2)
0 i=1 j=1

where Q, Ri and S j are positive definite matrixes, U p = {u 1 , u 2 , . . . , u p } and Dq =


{d1 , d2 , . . . , dq }. Then, we define the multi-player ZS differential game subject to
(11.1) as

V ∗ (x(0)) = inf inf · · · inf sup sup · · · sup J (x(0), U p , Dq ). (11.3)


u1 u2 up d1 d2 dq

The multi-player ZS differential game selects the minimizing player set U p and
the maximizing player set Dq such that the saddle point U p∗ and Dq∗ satisfies the
following inequalities

J (x, U p∗ , Dq ) ≤ J (x, U p∗ , Dq∗ ) ≤ J (x, U p , Dq∗ ), (11.4)

where U p∗ = {u ∗1 , u ∗2 , . . . , u ∗p } and Dq∗ = {d1∗ , d2∗ , . . . , dq∗ }.


In this chapter, we assume that the multi-player optimal control problem has a
unique solution if and only if the Nash condition holds [9]

V ∗ (x) = inf sup J (x, U p , Dq ) = sup inf J (x, U p , Dq ). (11.5)


U p Dq Dq U p

If we give the feedback policy (U p (x), Dq (x)), then the value or cost of the policy
is
 ∞  
p

q 
V (x(t)) = x T Qx+ u iT Ri u i − d Tj S j d j dt. (11.6)
t i=1 j=1

By using Leibniz’s formula and differentiating, (11.6) has a differential equivalent.


Then we can obtain the nonlinear ZS game Bellman equation, which is given in terms
of the Hamiltonian function


p

q
H (x, ∇V, U p , D q ) = x Qx + T
u iT Ri u i − d Tj S j d j
i=1 j=1
⎛ ⎞

p

q
+ ∇V T ⎝ f + g ui + h dj⎠
i=1 j=1

= 0, (11.7)

∂V
where ∇V = . The stationary conditions are
∂x
210 11 Neural-Network-Based Synchronous Iteration...

∂H
= 0, i = 1, 2, . . . , p, (11.8)
∂u i

and ∂H
= 0, j = 1, 2, . . . , q. (11.9)
∂d j

According to (11.7), we have the optimal controls and the disturbances are

1
u i∗ = − Ri−1 g T ∇V ∗ , i = 1, 2, . . . , p, (11.10)
2
and 1 −1 T
d ∗j = S h ∇V ∗ , j = 1, 2, . . . , q. (11.11)
2 j
From Bellman equation (11.7), we can derive V ∗ from the solution of the HJI equation

1
p
0 = x T Qx + ∇V T f − ∇V T g Ri−1 g T ∇V
4 i=1

1
q
+ ∇V T h S −1
j h ∇V.
T
(11.12)
4 j=1

Note that if (11.12) is solved, then the optimal controls are obtained. In general
case, the PI algorithm can be applied to get V ∗ . The algorithm implementation process
is given in Algorithm 1.
The convergence of Algorithm 4 will be analyzed in the next theorem.

Algorithm 4 PI for nonlinear multi-player ZS differential games


1: Start with stabilizing initial policies u [0] [0] [0] [0] [0] [0]
1 , u 2 , . . ., u p , and d1 , d2 , . . ., dq .
[k]
2: Let k = 1, 2, 3, . . ., solve V from


p 
q
0 = x T Qx + u i[k]T Ri u i[k] − d [k]T
j S j d [k]
j (11.13)
i=1 j=1
 
p 
q 
+ ∇V [k]T f + g u i[k] + h d [k]
j .
i=1 j=1

3: Update control and disturbance using


1
u i[k+1] = − Ri−1 g T ∇V [k] (11.14)
2
and
1 −1 T
d [k+1]
j = S h ∇V [k] . (11.15)
2 j

4: Let k = k + 1, return to step 2 and continue.


11.2 Motivations and Preliminaries 211

Theorem 11.1 Define V [k] as in (11.13). Let control policy u i[k] and disturbance
policy d [k]
j be in (11.14) and (11.15), respectively. Then the iterative values V
[k]

converge to the optimal game values V , as k → ∞.

Proof According to (11.13), we have


p

q
V̇ [k+1] = −x T Qx − u i[k+1]T Ri u i[k+1] + d [k+1]T
j S j d [k+1]
j . (11.16)
i=1 j=1

Then


p

q
V̇ [k] = − x T Qx − u i[k]T Ri u i[k] + d [k]T
j S j d [k]
j
i=1 j=1


p

q
− u i[k+1]T Ri u i[k+1] + d [k+1]T
j S j d [k+1]
j
i=1 j=1


p

q
+ u i[k+1]T Ri u i[k+1] − d [k+1]T
j S j d [k+1]
j
i=1 j=1


p

q
=V̇ [k+1]
+ u i[k+1]T Ri u i[k+1] − d [k+1]T
j S j d [k+1]
j
i=1 j=1


p

q
− u i[k]T Ri u i[k] + d [k]T
j S j d [k]
j . (11.17)
i=1 j=1

By transformation, we have


p
V̇ [k] = V̇ [k+1] − (u i[k+1] − u i[k] )T Ri (u i[k+1] − u i[k] )
i=1

p
+2 u i[k+1]T Ri (u i[k+1] − u i[k] )
i=1

q
+ (d [k+1]
j − d [k] [k+1]
j ) S j (d j
T
− d [k]
j )
j=1


p
−2 d [k+1]T
j S j (d [k+1]
j − d [k]
j ). (11.18)
i=1

Let u i[k] = u i[k+1] − u i[k] and d [k] [k+1]


j = dj − d [k]
j , then
212 11 Neural-Network-Based Synchronous Iteration...


p

p
V̇ [k] = V̇ [k+1] − u i[k]T Ri u i[k] + 2 u i[k+1]T Ri u i[k]
i=1 i=1

q

p
+ d [k]T
j S j d [k]
j −2 d [k+1]T
j S j d [k]
j . (11.19)
j=1 i=1

From (11.14) and (11.15), we have

∇V [k]T g = −2u i[k+1]T Ri , (11.20)

and

∇V [k]T h = 2d [k+1]T
j Sj. (11.21)

Then (11.19) is expressed as


p

p
V̇ [k] = V̇ [k+1] − u i[k]T Ri u i[k] − V [k]T gu i[k]
i=1 i=1

q

p
+ d [k]T
j S j d [k]
j − V [k]T hd [k]
j . (11.22)
j=1 i=1

Thus a sufficient conditions for V̇ [k] ≤ V̇ [k+1] are

u i[k]T Ri u i[k] − V [k]T gu i[k] > 0, (11.23)

and

d [k]T
j S j d [k]
j − V
[k]T
hd [k]
j < 0. (11.24)

Hence, if δ H (S j )||d [k]


j || ≤ ||V
[k]T
h|| and V [k]T gu i[k] > 0, or δ H (S j )
||d [k]
j || ≤ ||V
[k]T
h|| and δ L (Ri )||u i[k] || > ||V [k]T g||, where δ L is the opera-
tor which takes the minimum singular value, and δ H is the operator which takes the
maximum singular value. Then V̇ [k] ≤ V̇ [k+1] . The proof completes.


From Algorithm 4, we can see that the PI algorithm depends on system dynam-
ics, which is unknown in this chapter. Therefore, in the next section, off-policy PI
algorithm will be presented which can solve the control and disturbance policies
synchronously.
11.3 Synchronous Solution of Multi-player ZS Games 213

11.3 Synchronous Solution of Multi-player ZS Games

In this section, off-policy algorithm will be proposed based on Algorithm 4. The


neural networks implementation process is also given. Based on that, the stability of
the synchronous solution method is proven.

11.3.1 Derivation of Off-Policy Algorithm

Let u i[k] and d [k]


j be obtained by (11.14) and (11.15), then the original system (11.1)
is rewritten as


p

q

p

q
ẋ = f + g u i[k] +h d [k]
j +g (u i − u i[k] ) +h (d j − d [k]
j ). (11.25)
i=1 j=1 i=1 j=1

Substitute (11.25) into (11.6), we have

V [k] (x(t + T )) − V [k] (x(t))


 t+T
= ∇V [k]T ẋdτ
t
⎛ ⎞
 t+T p

q
= ∇V [k]T ⎝ f + g u i[k] + h d [k]
j
⎠ dτ
t i=1 j=1
⎛ ⎞
 t+T 
p

q
+ ∇V [k]T ⎝g (u i − u i[k] ) + h (d j − d [k] ⎠
j ) dτ. (11.26)
t i=1 j=1

According to (11.13), (11.26) is

V [k] (x(t + T )) − V [k] (x(t))


t+T  
p

q 
=− x T Qx + u i[k]T Ri u i[k] − d [k]T
j S d
j j
[k]

t i=1 j=1
⎛ ⎞
 t+T 
p

q
+ ∇V [k]T ⎝g (u i − u i[k] ) + h (d j − d [k] ⎠
j ) dτ. (11.27)
t i=1 j=1

Then (11.27) is the off-policy Bellman equation for multi-player ZS games, which
is expressed as
214 11 Neural-Network-Based Synchronous Iteration...

V [k] (x(t + T )) − V [k] (x(t))


 t+T   p

q 
=− x T Qx + u i[k]T Ri u i[k] − d [k]T
j S d
j j
[k]

t i=1 j=1
 t+T  
p
+ −2 u i[k+1]T R i (u i − u i[k] )
t i=1
q 
− d [k+1]T
j Sj (d j − d [k]
j ) dτ. (11.28)
j=1

It can be seen that (11.28) shows two points. First, the system dynamics is not nec-
essary for obtaining V [k] . Second, u i[k] , d [k]
j and V
[k]
can be obtained synchronously.
In the next part, the implementation method for solving (11.28) will be presented.

11.3.2 Implementation Method for Off-Policy Algorithm

In this part, the method for solving off-policy Bellman equation (11.28) is given.
Critic, action and disturbance networks are applied to approximate V [k] , u i[k] and
d [k]
j . The implementation block diagram is shown in Fig. 11.1. Here CNN, ANNs
and DNNs are used to approximate the cost, control policies and disturbances.
In the neural network, if the number of hidden layer neurons is L, the weight
matrix between the input layer and hidden layer is Y , the weight matrix between the
hidden layer and output layer is W and the input vector of the neural network is X ,
then the output of three-layer neural network is represented by

ANN

ANN

Plant CNN

DNN

DNN

Fig. 11.1 Implementation block diagram


11.3 Synchronous Solution of Multi-player ZS Games 215

FN (X, Y, W ) = W T σ̂ (Y X ), (11.29)

where σ̂ (Y X ) is the activation function. For convenience of analysis, only the output
weight W is updating during the training, while the hidden weight is kept unchanged.
Hence, in the following part, the neural network function (11.29) can be simplified
by the expression

FN (X, W ) = W T σ (X ). (11.30)

The neural network expression of CNN is given as

V [k] (x) = A[k]T φV (x) + δV (x), (11.31)

where A[k] is the ideal weight of critic network, φV (x) is the active function, and
δV (x) is residual error. Let the estimation of A[k] is Â[k] . Then the estimation of
V [k] (x) is

V̂ [k] (x) = Â[k]T φV (x), (11.32)

and

∇ V̂ [k] (x) = ∇φVT (x) Â[k] . (11.33)

The neural network expression of ANN is

u i[k] = Bi[k]T φu (x) + δu (x), (11.34)

where Bi[k] is the ideal weight of action network, φu (x) is the active function, and
δu (x) is residual error. Let B̂i[k] be the estimation of Bi[k] , then the estimation of u i[k]
is

û i[k] = B̂i[k]T φu (x). (11.35)

The neural network expression of DNN is

d [k] [k]T
j = Cj φd (x) + δd (x), (11.36)

where C [k]
j is the ideal weight of action network, φd (x) is the active function, and

δ (x) is residual error. Let Cˆ[k] be the estimation of C [k] , then the estimation of d [k]
d j j j
is

d̂ [k] [k]T
j = Ĉ j φd (x). (11.37)

According to (11.28), we define the equation error as


216 11 Neural-Network-Based Synchronous Iteration...

e[k] =V̂ [k] (x(t)) − V̂ [k] (x(t + T ))


 t+T  p

q 
− x T Qx + û i[k]T Ri û i[k] − d̂ [k]T
j S d̂
j j
[k]

t i=1 j=1
 t+T  p
+ −2 û i[k+1]T Ri (u i − û i[k] )
t i=1

q 
−d̂ [k+1]T
j Sj (d j − d̂ [k]
j ) dτ . (11.38)
j=1

Therefore, substitute (11.32), (11.35) and (11.37) into (11.38), we have

e[k] =V̂ [k] (x(t)) − V̂ [k] (x(t + T ))


 t+T  p

q 
[k]T [k] [k]T [k]
− x Qx +
T
û i Ri û i − d̂ j S j d̂ j dτ
t i=1 j=1
 t+T  
p
T [k+1]
+ −2 φu B̂i Ri (u i − û i[k] )
t i=1

q 
−φdT Ĉ [k+1]
j Sj (d j − d̂ [k]
j ) dτ . (11.39)
j=1

Since


p  
p T  
φuT B̂i[k+1]T Ri (u i − û i[k] ) = (u i − û i[k] Ri ⊗ φuT vec( B̂i[k+1] ),
i=1 i=1
(11.40)

where ⊗ denotes kronecker product, and


q
φdT Ĉ [k+1]T
j Sj (d j − d̂ [k]
j )
j=1
 
q T  
[k]
= (d j − d̂ j S j ⊗ φd vec(Ĉ [k+1]
T
j ). (11.41)
j=1

Substitute (11.40) and (11.41) into (11.39)


11.3 Synchronous Solution of Multi-player ZS Games 217


e[k] = (φV (x(t)) − φV (x(t + T )))T ⊗ I Â[k]
 t+T  p

q
− x T Qx + û i[k]T Ri û i[k] − d̂ [k]T
j S j d̂ [k]
j dτ
t i=1 j=1
  p T  
t+T 
+ −2 (u i − û i[k] ) Ri ⊗ φuT vec( B̂i[k+1] )
t i=1
⎛⎛ ⎞ ⎞

q T 
− ⎝ ⎝ [k] ⎠ T⎠ [k+1]
(d j − d̂ j ) S j ⊗ φd vec(Ĉ j ) dτ (11.42)
j=1

Define

ΠV = (φV (x(t)) − φV (x(t + T )))T ⊗ I, (11.43)


 t+T  p

q 
[k]T [k] [k]T [k]
Π= x Qx +
T
û i Ri û i − d̂ j S j d̂ j dτ , (11.44)
t i=1 j=1

 t+T  
p T  
Πu = −2 (u i − û i[k] ) Ri ⊗ φuT dτ, (11.45)
t i=1

 t+T  
q T 
Πd = − (d j − d̂ [k]
j ) Sj ⊗ φdT dτ. (11.46)
t j=1

Then we have

e[k] = ΠV Â[k] − Π +⎡Πu vec( B̂i[k+1]⎤) + Πd vec(Ĉ [k+1]


j )
[k]

⎢ ⎥ (11.47)
= [ΠV Πu Πd ] ⎣ vec( B̂i[k+1] ) ⎦ − Π .
[k+1]
vec(Ĉ j )

Define activation function matrix

ΠΠ = [ΠV Πu Πd ] , (11.48)

and the synchronous weight matrix


⎡ ⎤
Â[k]
⎢ ⎥
Ŵi,[k]j = ⎣ vec( B̂i[k+1] ) ⎦ . (11.49)
[k+1]
vec(Ĉ j )

Then (11.47) is

e[k] = ΠΠ Ŵi,[k]j − Π. (11.50)


218 11 Neural-Network-Based Synchronous Iteration...

Define E [k] = 1/2e[k]T e[k] , then according to gradient descent algorithm, the
update method of the weight Ŵi,[k]j is

Ŵ˙ i,[k]j = −ηi,[k]j ΠΠT ΠΠ Ŵi,[k]j − Π , (11.51)

where ηi,[k]j is a positive number.


According to gradient descent algorithm, the optimal weight Ŵi,[k]j makes E [k]
minimum, which can be obtained adaptively by (11.51). Therefore, the weights of
critic, action and disturbance networks are solved simultaneously. In this proposed
method, only one equation is necessary instead of (11.13)–(11.15) in Algorithm 4 to
obtain the optimal solution for the multi-player ZS games.

11.3.3 Stability Analysis

Theorem 11.2 Let the update method for critic, action and disturbance networks
be as in (11.51). Define the weight estimation error as W̃i,[k]j = Wi,[k]j − Ŵi,[k]j , Then
W̃i,[k]j is UUB.
Proof Let Lyapunov function candidate be
α
Λi,[k]j = W̃i,[k]T [k]
j W̃i, j , ∀i, j, k, (11.52)
2ηi,[k]j

where α > 0.
According to (11.51), we have

W̃˙ i,[k]j =ηi,[k]j ΠΠT ΠΠ (Wi,[k]j − W̃i,[k]j ) − Π
= − ηi,[k]j ΠΠT ΠΠ W̃i,[k]j + ηi,[k]j ΠΠT ΠΠ Wi,[k]j − ηi,[k]j ΠΠT Π. (11.53)

Therefore, the gradient of (11.52) is


α ˙ [k]
Λ̇i,[k]j = W̃i,[k]T
j W̃i, j
ηi,[k]j

=α W̃i,[k]T
j −Π T
Π Π Π W̃ [k]
i, j + Π T
Π Π Π W [k]
i, j − Π T
Π Π
= − α W̃i,[k]T [k] [k]T T [k] [k]T T
j ΠΠ ΠΠ W̃i, j + α W̃i, j ΠΠ ΠΠ Wi, j − α W̃i, j ΠΠ Π
T

≤ − α||W̃i,[k]j ||2 ||ΠΠ ||2 + α W̃i,[k]T [k] [k]T T


j ΠΠ ΠΠ Wi, j − α W̃i, j ΠΠ Π
T

1 α2
≤ − α||W̃i,[k]j ||2 ||ΠΠ ||2 + ||W̃i,[k]j ||2 ||ΠΠ ||2 + ||Wi,[k]j ||2 ||ΠΠ ||2
2 2
1 α 2
+ ||W̃i,[k]j ||2 ||ΠΠ ||2 + ||Π ||2 . (11.54)
2 2
11.3 Synchronous Solution of Multi-player ZS Games 219

By transformation, (11.54) is

α2 α2
Λ̇i,[k]j ≤ (−α + 1) ||W̃i,[k]j ||2 ||ΠΠ ||2 + ||Wi,[k]j ||2 ||ΠΠ ||2 + ||Π ||2 . (11.55)
2 2
Define

α2 α2
Σi,[k]j = ||Wi,[k]j ||2 ||ΠΠ ||2 + ||Π ||2 . (11.56)
2 2
Then (11.55) is

Λ̇i,[k]j ≤ (−α + 1) ||W̃i,[k]j ||2 ||ΠΠ ||2 + Σi,[k]j . (11.57)

Thus, if

α > 1, (11.58)

and

Σi,[k]j
||W̃i,[k]j ||2 > , (11.59)
(α − 1)||ΠΠ ||2

then W̃i,[k]j is UUB. The proof completes.




According to Theorem 11.2, if the convergence condition is satisfied, then V̂ [k] →


V , û i[k] → u i[k] and d̂ [k]
[k] [k]
j → dj .

11.4 Simulation Study

In this section, two examples will be provided to demonstrate the effectiveness of


the optimal control scheme proposed in this chapter.
Example 11.1 Consider the following linear system [10] with modifications

ẋ = x + u + d. (11.60)

In this chapter, the initial state is x(0) = 1. We select hyperbolic tangent functions
as the activation functions of critic, action and disturbance networks. The structures
of critic, action and disturbance networks are 1-8-1. The initial weight W is selected
arbitrarily from (−1, 1), the dimension of W is 24 × 1. For the cost function, Q, R
and S in the utility function are identity matrices of appropriate dimensions. After
500 time steps, the simulation results are obtained. In Fig. 11.2, the cost function
is shown, which converges to zero as time increasing. The control and disturbance
220 11 Neural-Network-Based Synchronous Iteration...

0.12

0.1

0.08
cost function

0.06

0.04

0.02

0
0 50 100 150 200 250 300 350 400 450 500
time steps

Fig. 11.2 Cost function

-0.05
control

-0.1

-0.15
0 50 100 150 200 250 300 350 400 450 500
time steps

Fig. 11.3 Control

trajectories are given in Figs. 11.3 and 11.4. Under the action of the obtained control
and disturbance inputs, the state trajectory is displayed in Fig. 11.5. It is clear that
the presented method in this chapter is very effective and feasible.
11.4 Simulation Study 221

0.15

0.1
disturbance

0.05

0
0 50 100 150 200 250 300 350 400 450 500
time steps

Fig. 11.4 Disturbance

0.9

0.8

0.7

0.6
state

0.5

0.4

0.3

0.2

0.1

0
0 50 100 150 200 250 300 350 400 450 500
time steps

Fig. 11.5 State


222 11 Neural-Network-Based Synchronous Iteration...

-0.5

-1
cost function

-1.5

-2

-2.5

-3

-3.5
0 500 1000 1500 2000 2500
time steps

Fig. 11.6 Cost function

Example 11.2 Consider the following affine in control input nonlinear system [11]


p

q
ẋ = f (x) + g(x) u i + h(x) dj, (11.61)
i=1 j=1

 
x2
where f (x) = ,
−x2 − 21 x1 + 41 x2 (cos(2x1 ) + 2)2 + 41 x2 (sin(4x12 ) + 2)2
   
0 0
g(x) = , h(x) = , p = q = 1.
cos(2x1 ) + 2 sin(4x12 ) + 2
In this simulation, the initial state is x(0) = [1, −1]T . Hyperbolic tangent func-
tions are used to be as the activation functions of critic, action and disturbance
networks. The structures of the networks are 2-8-1. The initial weight W is selected
arbitrarily from (−1, 1), the dimension of W is 24 × 1. For the cost function of
(11.61), Q, R and S in the utility function are identity matrices of appropriate dimen-
sions. The simulation results are obtained by 2500 time steps. The cost function is
shown in Fig. 11.6, it is ZS. The control and disturbance trajectories are given in
Figs. 11.7 and 11.8. The state trajectories are displayed in Fig. 11.9. We can see that
the closed-loop system state, control and disturbance inputs converge to zero, as time
step increasing. So the proposed synchronous method for multi-player ZS games in
this chapter is very effective.
11.4 Simulation Study 223

0.35

0.3

0.25

0.2
control

0.15

0.1

0.05

0
0 500 1000 1500 2000 2500
time steps

Fig. 11.7 Control

-0.2

-0.4

-0.6
disturbance

-0.8

-1

-1.2

-1.4

-1.6
0 500 1000 1500 2000 2500
time steps

Fig. 11.8 Disturbance


224 11 Neural-Network-Based Synchronous Iteration...

1
x(1)
0.8 x(2)

0.6

0.4

0.2
state

-0.2

-0.4

-0.6

-0.8

-1
0 500 1000 1500 2000 2500
time steps

Fig. 11.9 State

11.5 Conclusions

This chapter proposed a synchronous solution method for multi-player ZS games


without system dynamics based on neural network. PI algorithm is presented to solve
the HJB equation with system dynamics. It is proven that the obtained iterative cost
function by PI is convergent to optimal game value. Based on PI, off-policy learning
method is given to obtain the iterative cost function, controls and disturbances. The
weights of CNN, ANNs and DNNs compose synchronous weight matrix, which is
proven to be UUB by Lyapunov technique. Simulation study indicates the effective-
ness of the proposed synchronous solution method for multi-player ZS games. A
future research problem is to use the proposed approach to a class of systems with
interconnection term.

References

1. Yeung, D., Petrosyan, L.: Cooperative Stochastic Differential Games. Springer, Berlin (2006)
2. Lewis, F., Vrabie, D., Syrmos, V.: Optimal Control, 3rd edn. Wiley, Hoboken (2012)
3. Song, R., Lewis, F., Wei, Q.: Off-policy integral reinforcement learning method to solve nonlin-
ear continuous-time multi-player non-zero-sum games. IEEE Trans. Neural Networks Learn.
Syst. 28(3), 704–713 (2016)
4. Liu, D., Wei, Q.: Multiperson zero-sum differential games for a class of uncertain nonlinear
systems. Int. J. Adap. Control Signal Process. 28(3–5), 205–231 (2014)
References 225

5. Mu, C., Sun, C., Song, A., Yu, H.: Iterative GDHP-based approximate optimal tracking control
for a class of discrete-time nonlinear systems. Neurocomputing 214(19), 775–784 (2016)
6. Fang, X., Zheng, D., He, H., Ni, Z.: Data-driven heuristic dynamic programming with virtual
reality. Neurocomputing 166(20), 244–255 (2015)
7. Feng, T., Zhang, H., Luo, Y., Zhang, J.: Stability analysis of heuristic dynamic programming
algorithm for nonlinear systems. Neurocomputing 149(Part C, 3), 1461–1468 (2015)
8. Feng, T., Zhang, H., Luo, Y., Liang, H.: Globally optimal distributed cooperative control for
general linear multi-agent systems. Neurocomputing 203(26), 12–21 (2016)
9. Lewis, F., Vrabie, D., Syrmos, V.: Optimal Control. Wiley, NewYork (2012)
10. Basar, T., Olsder, G.: Dynamic Noncooperative Game Theory. Academic Press, New York
(1982)
11. Vamvoudakis, K., Lewis, F.: Multi-player non-zero-sum games: online adaptive learning solu-
tion of coupled Hamilton-Jacobi equations. Automatica 47(8), 1556–1569 (2011)
Chapter 12
Off-Policy Integral Reinforcement
Learning Method for Multi-player
Non-zero-Sum Games

This chapter establishes an off-policy integral reinforcement learning (IRL) method


to solve nonlinear continuous-time non-zero-sum (NZS) games with unknown sys-
tem dynamics. The IRL algorithm is presented to obtain the iterative control and off-
policy learning is used to allow the dynamics to be completely unknown. Off-policy
IRL is designed to do policy evaluation and policy improvement in policy iteration
(PI) algorithm. Critic and action networks are used to obtain the performance index
and control for each player. Gradient descent algorithm makes the update of critic
and action weights simultaneously. The convergence analysis of the weights is given.
The asymptotic stability of the closed-loop system and the existence of Nash equilib-
rium are proven. Simulation study demonstrates the effectiveness of the developed
method for nonlinear continuous-time NZS games with unknown system dynamics.

12.1 Introduction

Non-zero-sum (NZS) games with N players rely on solving the coupled Hamilton–
Jacobi (HJ) equations, which means each player decides for the Nash equilibrium
depending on HJ equations coupled through their quadratic terms. In linear NZS
games, it reduces to the coupled algebraic Riccati equations [1]. In nonlinear NZS
games, it is difficult, even doesn’t have analytic solutions. Therefore, many intelligent
methods are proposed to obtain the approximate solutions [2–6].
IRL allows the development of a Bellman equation that does not contain the
system dynamics [7–10]. It is worth noting that most of the IRL algorithms are
on-policy, i.e., the performance index function is evaluated by using system data
generated with policies being evaluated. It means on-policy learning methods use
the “inaccurate” data to learn the performance index function, which will increase
the accumulated error. While off-policy IRL, which can learn the solution of HJB
equation from the system data generated by an arbitrary control. Moreover, off-policy
IRL can be regarded as a direct learning method for NZS games, which avoids the
© Science Press, Beijing and Springer Nature Singapore Pte Ltd. 2019 227
R. Song et al., Adaptive Dynamic Programming: Single and Multiple
Controllers, Studies in Systems, Decision and Control 166,
https://doi.org/10.1007/978-981-13-1712-5_12
228 12 Off-Policy Integral Reinforcement Learning Method for Multi-player …

identification of system dynamics. In [11], a novel PI approach for finding online


adaptive optimal controllers for CT linear systems with completely unknown system
dynamics was presented. This paper gives idea of off-policy control for nonlinear
systems. In [12], an off-policy RL method was introduced to learn the solution of
HJI equation of CT systems with completely unknown dynamics, from real system
data instead of mathematical system model, and its convergence was proved. In [13],
an off-policy optimal control method was developed for unknown CT systems with
unknown disturbances. These researches are foundation of our work. It is noted that,
the system dynamics is not known in advance in many multi-player NZS games.
This makes it difficult to obtain the Nash equilibrium depending on HJ equations
for each player. Therefore, effective methods are required to develop to deal with
the multi-player NZS games with unknown dynamics. This motivates our research
interest.
In this chapter, an off-policy IRL method is studied for CT multi-player NZS
games with unknown dynamics. First, the PI algorithm is introduced with conver-
gence analysis. Then off-policy is designed to policy evaluation and policy improve-
ment without system dynamics. Neural networks are used to approximate critic and
action networks. The update methods for neural networks weights are given. It is
proven that the weight errors of neural networks are uniformly ultimately bounded
(UUB). It is also proven that closed-loop system is asymptotically stable and the Nash
equilibrium can be obtained. The main contributions of this chapter are summarized
as follows.
(1) For nonlinear CT NZS games, off-policy Bellman equation is developed for
policy updating without system dynamics.
(2) It is proven that the weight errors of neural networks are UUB.
(3) The asymptotic stability of the closed-loop system is proven.
The rest of the chapter is organized as follows. In Sect. 12.2, the problem moti-
vations and preliminaries are presented. In Sect. 12.3, the multi-player learning PI
solution for NZS games is introduced. In Sect. 12.4, the off-policy IRL optimal
control method is considered. In Sect. 12.5, examples are given to demonstrate the
effectiveness of the proposed optimal control scheme. In Sect. 12.6, the conclusion
is drawn.

12.2 Problem Statement

Consider the following nonlinear system


N
ẋ(t) = f (x(t)) + g(x(t))u j (t), (12.1)
j=1

where state x(t) ∈ Rn and controls u j (t) ∈ Rm . This system has N inputs or play-
ers, and they influence each other through their joint effects on the overall system
12.2 Problem Statement 229

state dynamics. The system functions f (x) and g(x) are unknown. Let Ω con-
taining the origin be a closed subset of Rn to which all motions of (12.1) are

N
restricted. Let f + gu j be Lipschitz continuous on Ω and that system (12.1)
j=1
is stabilizable in the sense that there exists admissible control on Ω that asymp-
totically stabilizes the system. Define u −i as the supplementary set of player i:
u −i = {u j , j ∈ {1, 2, . . . , i − 1, i + 1, . . . , N }. Define the performance index func-
tion of player i as
 ∞
Ji (x(0), u i , u −i ) = ri (x(t), u i , u −i )dt, (12.2)
0


N
where the utility function ri (x, u i , u −i ) = Q i (x) + u Tj Ri j u j , in which function
j=1
Q i (x)0 and Ri j > 0 are symmetric matrices.
Define the multiplayer differential game
⎛ ⎞
 ∞ 
N
Vi∗ (x(t), u i , u −i ) = min ⎝ Q i (x) + u Tj Ri j u j ⎠ dτ . (12.3)
ui t j=1

This game implies that all the players have the same competitive hierarchical level
and seek to attain a Nash equilibrium as given by the following definition.
Definition 12.1 Nash equilibrium [14, 15]: Policies {u i∗ , u ∗−i } = {u ∗1 , u ∗2 , . . . ,
u i∗ , . . . , u ∗N } are said to constitute a Nash equilibrium solution for the N -player game
if

Ji∗ = Ji (u i∗ , u ∗−i ) ≤ Ji (u i , u ∗−i ), i ∈ N (12.4)

hence the N -tuple {J1∗ , J2∗ , . . . , JN∗ } is known as a Nash equilibrium value set or
outcome of the N -player game.

12.3 Multi-player Learning PI Solution for NZS Games

In this section, the PI solution for NZS games is introduced with convergence anal-
ysis. From Definition 12.1, it can be seen that if any player unilaterally changes his
control policy while the policies of all other players remain the same, then that player
will obtain worse performance. For fixed stabilizing feedback control policies u i and
u −i define the value function
⎛ ⎞
 ∞  ∞ 
N
Vi (x(t)) = ri (x, u i , u −i )dτ = ⎝ Q i (x) + u Tj Ri j u j ⎠ dτ . (12.5)
t t j=1
230 12 Off-Policy Integral Reinforcement Learning Method for Multi-player …

Using the Leibniz formula, Hamiltonian function is given by the Bellman equation
⎛ ⎞

N 
N
Hi (x, ∇Vi , u i , u −i ) = Q i (x) + u Tj Ri j u j + ∇ViT ⎝ f (x) + g(x)u j ⎠ = 0.
j=1 j=1
(12.6)

∂ Hi
Then the following policies can be yielded from =0
∂u i

1
u i (x) = − Rii−1 g T (x)∇Vi . (12.7)
2
Therefore, one obtains the N -coupled Hamilton–Jacobi (HJ) equations
⎛ ⎞
1 N
0 = ∇ViT ⎝ f (x) − g(x)R −1
j j g (x)∇V j
T ⎠ + Q i (x)
2 j=1

1
N
+ ∇V jT g(x)R −1 −1 T
j j Ri j R j j g (x)∇V j . (12.8)
4 j=1

Define the best response HJ equation as the Bellman equation (12.6) with control
u i∗ by (12.7), and arbitrary policies u −i as


N
Hi (x, ∇Vi , u i∗ , u −i ) = ∇ViT f (x) + Q i + ∇ViT g(x)u j
j=i

1  N
− ∇ViT g(x)Rii−1 g T (x)∇Vi + u Tj Ri j u j . (12.9)
4 j=i

In [1], the following policy iteration for N -player games has been proposed to
solve (8).
Here [0] and [k] in the superscript mean the iteration index. The following two
theorems reveal that the convergence of Algorithm 5.
Theorem 12.1 If  is bounded, and the system functions f (x) and g(x) are known,
the iterative control u i[k] of player i is obtained by PI algorithm in (12.10)–(12.12),
and the controls u −i do not update their control policies. Then the iterative cost
function is convergent, and the values converge to the best response HJ equation
(12.9).
Proof Let

Hio (x, ∇Vi[k] , u −i ) = min Hi (x, ∇Vi[k] , u i[k] , u −i ) = Hi (x, ∇Vi[k] , u i[k+1] , u −i ).
u i[k]
(12.13)
12.3 Multi-player Learning PI Solution for NZS Games 231

Algorithm 5 Policy Iteration


1: Start with stabilizing initial policies u [0] [0] [0]
1 , u2 , . . . , u N .
2: Given the N -tuple of policies u 1 , u 2 , . . . , u [k]
[k] [k]
N , solve for the N -tuple of costs
V1[k] , V2[k] , . . . , VN[k] using


N
Hi (x, ∇Vi[k] , u i[k] , u [k]
−i ) = ∇Vi
[k]T
( f (x) + g(x)u [k] [k] [k] [k]
j ) + ri (x, u 1 , u 2 , . . . , u N ) = 0
j=1
(12.10)

with Vi[k] (0) = 0.


3: Update the N-tuple of control policies using

u i[k+1] = arg min[Hi (x, ∇Vi , u i , u −i )], (12.11)


ui

which explicitly is
1
u i[k+1] = − Rii−1 g T (x)∇Vi[k] . (12.12)
2

Since only player i updates its control, then let

Hi (x, ∇Vi[k] , u i[k] , u −i ) = 0. (12.14)

Therefore, from (12.13) and (12.14), one has

Hio (x, ∇Vi[k] , u −i ) ≤ 0. (12.15)

By u i[k+1] and the current policies u −i , the orbital derivative becomes

V̇i[k] = Hi (x, ∇Vi[k] , u i[k+1] , u −i ) − ri (x, u i[k+1] , u −i ). (12.16)

According to (12.13), Eq. (12.16) is

V̇i[k] = Hio (x, ∇Vi[k] , u −i ) − ri (x, u i[k+1] , u −i ). (12.17)

From (12.15), it has

V̇i[k] ≤ −ri (x, u i[k+1] , u −i ). (12.18)

On the other side, as only player i updates its control, then

Hi (x, ∇Vi[k+1] , u i[k+1] , u −i ) = 0 (12.19)

and from (12.10)


232 12 Off-Policy Integral Reinforcement Learning Method for Multi-player …

V̇i[k+1] = Hi (x, ∇Vi[k+1] , u −i ) − ri (x, u i[k+1] , u −i ) = −ri (x, u i[k+1] , u −i ). (12.20)

According to (12.18) and (12.20),

V̇i[k] ≤ V̇i[k+1] . (12.21)

It follows that [15],

Vi[k] ≥ Vi[k+1] (12.22)

which means that Vi[k] is a monotonically decreasing sequence.


Define lim Vi[k] = Vi∞ , according to [16, 17], it has Vi∞ = Vi∗ . That is the algo-
k→∞
rithm converges to the best response HJ equation Vi∗ .

Theorem 12.2 If the system functions f (x) and g(x) are known, all the iterative
controls u i[k] of players i are obtained by PI algorithm in (12.10)–(12.12). If Ω is
bounded, then the iterative values Vi[k] converge to the optimal game values Vi∗ , as
k → ∞.

Proof As


N
V̇i[k+1] = −Q i (x) − u [k+1]T
j Ri j u [k+1]
j , (12.23)
j=1

and


N 
N
V̇i[k] = − Q i (x) − u [k]T
j Ri j u [k]
j + u [k+1]T
j Ri j u [k+1]
j
j=1 j=1


N
− u [k+1]T
j Ri j u [k+1]
j
j=1


N 
N
= V̇i[k+1] − u [k]T
j Ri j u [k]
j + u [k+1]T
j Ri j u [k+1]
j
j=1 j=1


N
T
= V̇i[k+1] − (u [k+1]
j − u [k] [k+1]
j ) Ri j (u j − u [k]
j )
j=1


N 
N
+2 u [k+1]T
j Ri j u [k+1]
j −2 u [k+1]T
j Ri j u [k]
j
j=1 j=1


N
T
= V̇i[k+1] − (u [k+1]
j − u [k] [k+1]
j ) Ri j (u j − u [k]
j )
j=1
12.3 Multi-player Learning PI Solution for NZS Games 233


N
+2 u [k+1]T
j Ri j (u [k+1]
j − u [k]
j ). (12.24)
j=1

Let

u [k] [k+1]
j = uj − u [k]
j . (12.25)

Then a sufficient condition for V̇i[k] ≤ V̇i[k+1] is


N 
N
u [k]T
j Ri j u [k]
j −2 u [k+1]T
j Ri j u [k]
j ≥ 0. (12.26)
j=1 j=1

According to (12.12), (12.26) means

u [k]T
j Ri j u [k] [k]T
j + ∇V j g(x)R −T [k]
j j Ri j u j ≥ 0. (12.27)

Hence, if ∇V j[k]T g(x)R −T [k]


j j Ri j u j ≥ 0, then V̇i
[k]
≤ V̇i[k+1] is true. If ∇V j[k]T g(x)
[k] [k]
R −T
j j Ri j u j < 0, then a sufficient condition for V̇i ≤ V̇i[k+1] becomes

||u [k]T
j Ri j u [k] [k]T
j || ≥ ||∇V j g(x)R −T [k]
j j Ri j u j ||, (12.28)

i.e.,

δ L (Ri j )||u [k] [k] −T


j || ≥ ||∇V j ||||g(x)||δ (R j j Ri j ),
H
(12.29)

where δ L (·) is the operator which takes the minimum singular value, δ H (·) is the
operator which takes the maximum singular value. Specifically, (12.29) holds if
[k]
δ H (R −T
j j Ri j ) = 0. By integration of V̇i ≤ V̇i[k+1] , it follows that

Vi[k] ≥ Vi[k+1] (12.30)

which shows that Vi[k] is a nonincreasing function.


Define lim Vi[k] = Vi∞ , one has
k→∞

Vi∞ ≤ Vi[k+1] ≤ Vi[k] . (12.31)

According to [17], it has Vi∞ = Vi∗ .


Thus the algorithm converges to Vi∗ .

Remark 12.1 In the above proof, if ∇V j[k]T g(x)R −T [k]


j j Ri j u j ≥ 0, then V̇i
[k]

[k+1] [k]T −T [k] [k] [k+1]
V̇i . If ∇V j g(x)R j j Ri j u j < 0, and (12.29) establish, then V̇i ≤ V̇i .
234 12 Off-Policy Integral Reinforcement Learning Method for Multi-player …

Theorems 12.1–12.2 prove that the PI algorithm (Algorithm 5) is convergent.


Note that, in system (12.1), f (x) and g(x) are unknown, therefore Algorithm 5 in
(12.10)–(12.12) cannot be used to obtain the Nash equilibrium for unknown system
(12.1), directly. In this chapter, an off-policy IRL method based on Algorithm 5 is
established for multi-player NZS games.

12.4 Off-Policy Integral Reinforcement Learning Method

From Algorithm 5, it can be seen that the PI algorithm depended on the system
dynamics, which is not known in this chapter. Therefore, off-policy IRL algorithm
is established to solve the NZS games. In this section, the off-policy IRL is first
presented. Then the method for solving off-policy Bellman equation is developed.
At last the theoretical analyses and implement method are given.

12.4.1 Derivation of Off-Policy Algorithm

Let u i[k] be obtained by (12.12), and then the original system (12.1) can be rewritten
as


N 
N
ẋ = f (x) + g(x)u [k]
j + g(x)(u j − u [k]
j ). (12.32)
j=1 j=1

Then

Vi[k] (x(t + T )) − Vi[k] (x(t))


 t+T
= ∇Vi[k]T ẋdτ
t
⎛ ⎞
 t+T N
= ∇Vi[k]T ⎝ f (x) + g(x)u [k]
j
⎠ dτ
t j=1
 t+T 
N
+ ∇Vi[k]T g(x)(u j − u [k]
j ))dτ. (12.33)
t j=1

From (12.10), one has


N 
N
∇Vi[k]T ( f (x) + g(x)u [k]
j ) = −Q i (x) − u [k]T
j Ri j u [k]
j . (12.34)
j=1 j=1
12.4 Off-Policy Integral Reinforcement Learning Method 235

Then (12.33) is given by

Vi[k] (x(t + T )) − Vi[k] (x(t))


 t+T  t+T  N
=− Q i (x)dτ − u [k]T
j Ri j u [k]
j dτ
t t j=1
 t+T 
N
+ ∇Vi[k]T g(x) (u j − u [k]
j ))dτ. (12.35)
t j=1

From (12.12), one has

g T (x)∇Vi[k] = −2Rii u i[k+1] . (12.36)

Then the off-policy Bellman equation is obtained from (12.35)

Vi[k] (x(t + T )) − Vi[k] (x(t))


 t+T  t+T  N
=− Q i (x)dτ − u [k]T
j Ri j u [k]
j dτ
t t j=1
 t+T 
N
−2 u i[k+1]T Rii (u j − u [k]
j )dτ. (12.37)
t j=1

Remark 12.2 Notice that in (12.37), the term ∇Vi[k]T g(x) depending on the unknown
function g(x) is replaced by u i[k+1]T Rii , which can be obtained by measuring the state
online. Therefore, (12.37) plays an important role in separating the system dynamics
from the iterative process. It is referred to as off-policy Bellman equation. By off-
policy Bellman equation (12.37), one can obtain the optimal control of the N -player
nonlinear differential game without the requirement of the system dynamics.

The off-policy IRL method is summed as follows.

Algorithm 6 Off-Policy IRL


1: Let the iterative index k = 0, select admissible controls u [0] [0] [0]
1 , u2 , . . . , u N .
[k] [k+1]
2: Let the iterative index k > 0, and solve for Vi and u i from off-policy Bellman equation
(12.37).

The next part will give the implementation method for Algorithm 6.
236 12 Off-Policy Integral Reinforcement Learning Method for Multi-player …

12.4.2 Implementation Method for Off-Policy Algorithm

In this part, the method for solving off-policy Bellman equation (12.37) will be
given. Critic and action networks are used to approximate Vi[k] (x(t)) and u i[k] for each
player, respectively. The neural network expression of critic network is given as

Vi[k] (x) = Mi[k]T ϕi (x) + εi[k] (x), (12.38)

where Mi[k] is the ideal weight of critic network, ϕi (x) is the active function, εi[k] (x) is
residual error. Let the estimation of Mi[k] be M̂i[k] , and then the estimation of Vi[k] (x)
is

V̂i[k] (x) = M̂i[k]T ϕi (x). (12.39)

Accordingly,
∇ V̂i[k] (x) = ∇ϕiT (x) M̂i[k] . (12.40)

Define the neural network expression of action network as

u i[k] (x) = Ni[k]T φi (x) + δi[k] (x), (12.41)

where Ni[k] is the ideal weight of action network, φi (x) is the active function, δi[k] (x)
is residual error. Let N̂i[k] be the estimation of Ni[k] , then the estimation of u i[k] (x) is

û i[k] (x) = N̂i[k]T φi (x). (12.42)

Therefore, from (12.37) the equation error is defined as


 t+T
ei[k] = V̂i[k] (x(t)) − V̂i[k] (x(t + T )) − Q i (x)dτ
t
 t+T 
N  t+T 
N
− û [k]T
j Ri j û [k]
j dτ −2 û i[k+1]T Rii (u j − û [k]
j )dτ.
t j=1 t j=1
(12.43)

Substitute (12.39) and (12.42) into (12.43), then one can get

ei[k] = M̂i[k]T ϕi (x(t)) − M̂i[k]T ϕi (x(t + T ))


 t+T  t+T  N
− Q i (x)dτ − û [k]T
j Ri j û [k]
j dτ
t t j=1
 t+T 
N
−2 φiT (x(τ )) N̂i[k+1] Rii (u j − û [k]
j )dτ. (12.44)
t j=1
12.4 Off-Policy Integral Reinforcement Learning Method 237

As


N
φiT (x) N̂i[k+1] Rii (u j − û [k]
j )
j=1
⎛⎛⎛ ⎞T ⎞ ⎞

N
= ⎝⎝⎝ (u j − û [k]
j )
⎠ Rii ⎠ ⊗ φiT ⎠ vec( N̂i[k+1] ). (12.45)
j=1

Here the symbol ⊗ stands for Kronecker product. Then (12.44) becomes

ei[k] = − (ϕi (x(t + T )) − ϕi (x(t)))T ⊗ I M̂i[k]


 t+T  t+T  N
− Q i (x)dτ − û [k]T
j Ri j û [k]
j dτ
t t j=1
⎛⎛⎛ ⎞T ⎞ ⎞
 t+T 
N
−2 ⎝⎝⎝ (u j − û [k]
j )
⎠ Rii ⎠ ⊗ φiT ⎠dτ vec( N̂i[k+1] ). (12.46)
t j=1

 t+T
[k]
Define I x x,i = (ϕi (x(t + T )) − ϕi (x(t)))T ⊗ I , I xu,i =− Q i (x)dτ −
⎛⎛⎛ ⎞T t ⎞ ⎞
 t+T 
N  t+T  N
⎜⎜⎝ ⎠ R ⎟ T⎟
û [k]T
j Ri j û [k] [k]
j dτ , and Iuu,i = 2 ⎝⎝ (u j − û [k]
j ) ii ⎠ ⊗ φi ⎠dτ .
t j=1 t j=1
Therefore (12.46) is written as

ei[k] = −Ix x,i M̂i[k] + Ixu,i


[k] [k]
− Iuu,i vec( N̂i[k+1] ). (12.47)

[k] [k] M̂i[k]
Let Iw,i = [−Ix x,i − Iuu,i ], Ŵi[k] = , then (12.47) is given by
vec( N̂i[k+1] )

ei[k] = Iw,i
[k]
Ŵi[k] + Ixu,i
[k]
. (12.48)

For obtaining the update rule of the weights of critic and action networks, the opti-
mization objective is defined as

1 [k]T [k]
E i[k] = e ei . (12.49)
2 i
Thus, using the gradient descent algorithm, one can get

Ŵ˙ i[k] = −ηi[k] Iw,i


[k]T [k]
(Iw,i Ŵi[k] + Ixu,i
[k]
), (12.50)
238 12 Off-Policy Integral Reinforcement Learning Method for Multi-player …

where ηi[k] is a positive number.


Remark 12.3 According to gradient descent algorithm, the optimal weight Wi[k] ,
which makes (12.49) minimum, can be obtained by (12.50). Therefore, the weights
of critic and action networks are solved simultaneously. It means that only one
equation is necessary instead of (12.10) and (12.11) in Algorithm 5 to obtain the
optimal solution for the multi-player NZS games.

12.4.3 Stability Analysis

In Theorems 12.1–12.2, the convergent property of the iterative cost function is


proven. In the following theorems one first proves that the weights estimation of
critic and action networks will converge to the ideal ones within a small bound,
based on the off-policy IRL. Then one proves that the asymptotic stability of the
closed-loop system and the existence of Nash equilibrium.
Theorem 12.3 Let the update method for critic and action networks be as in (12.50).
Define the weight estimation error as W̃i[k] = Wi[k] − Ŵi[k] , ∀ k, i. Then W̃i[k] is UUB.
Proof Select Lyapunov function candidate as follows:

l
L i[k] = W̃i[k]T W̃i[k] , (12.51)
2ηi[k]

where l > 0.
According to (12.50), one has
 
W̃˙ i[k] = ηi[k] Iw,i
[k]T [k]
Iw,i (Wi[k] − W̃i[k] ) + Ixu,i
[k]

= − ηi[k] Iw,i
[k]T [k]
Iw,i W̃i[k] + ηi[k] Iw,i
[k]T [k]
(Iw,i Wi[k] + Ixu,i
[k]
). (12.52)

Thus, the gradient of (12.51) is

L̇ i[k] = − l W̃i[k]T Iw,i


[k]T [k]
Iw,i W̃i[k] + l W̃i[k]T Iw,i
[k]T [k]
(Iw,i Wi[k] + Ixu,i
[k]
)
≤ − l||W̃i[k] ||2 ||Iw,i
[k] 2
|| + l W̃i[k]T Iw,i
[k]T [k]
Iw,i Wi[k] + l W̃i[k]T Iw,i
[k]T [k]
Ixu,i
1
≤ − l||W̃i[k] ||2 ||Iw,i
[k] 2
|| + ||W̃i[k] ||2 ||Iw,i
[k] 2
||
2
l2 1 l 2 [k] 2
+ ||Wi[k] ||2 ||Iw,i [k] 2
|| + ||W̃i[k] ||2 ||Iw,i
[k] 2
|| + ||Ixu,i || . (12.53)
2 2 2
That is

l2 l 2 [k] 2
L̇ i[k] ≤ −(l − 1)||W̃i[k] ||2 ||Iw,i
[k] 2
|| + ||Wi[k] ||2 ||Iw,i
[k] 2
|| + ||Ixu,i || . (12.54)
2 2
12.4 Off-Policy Integral Reinforcement Learning Method 239

Let Ci[k] = l2
||Wi[k] ||2 ||Iw,i
[k] 2 [k] 2
2

2
|| + l2 ||Ixu,i || , then (12.54) is

L̇ i[k] ≤ −(l − 1)||W̃i[k] ||2 ||Iw,i


[k] 2
|| + Ci[k] . (12.55)

Therefore, if l satisfies

l > 1, (12.56)

and

Ci[k]
||W̃i[k] ||2 > [k] 2
, (12.57)
(l − 1)||Iw,i ||

then W̃i[k] is UUB.

Based on Theorem 12.3, one can assume that V̂i[k] (x) → Vi[k] (x) and û i[k] (x) →
u i[k] (x).
Then one can get the following theorem which proves that the asymptotic
stability of the closed-loop system further.
Theorem 12.4 Let the control be given as in (12.12), then the closed-loop system
is asymptotically stable.

Proof Define Lyapunov function candidate as


⎛ ⎞
 ∞ 
N
Vi[k] (x(t)) = ⎝ Q i[k] (x) + u [k]T
j Ri j u [k]
j
⎠ dτ , ∀ k. (12.58)
t j=1

Take the time derivative to obtain


N
V̇i[k] (x(t)) = −Q i[k] (x) − u [k]T
j Ri j u [k]
j < 0. (12.59)
j=1

Therefore, Vi[k] (x(t)), ∀ k is a Lyapunov function. The closed-loop system is asymp-


totically stable.
 
From Theorem 12.4, it is clear that, ∀ k, the iterative control u i[k] , i = 1, 2, . . . , N
makes the closed-loop system stable asymptotically.
 Now, we are ready to give the
following theorem which demonstrates u i∗ , i = 1, 2, . . . , N } is in Nash equilib-
rium.
 
Theorem 12.5 For a fixed stabilizing  ∗ control u i , i =  1, 2, . . . , N , define Hamil-
tonian function as in (12.6). Let u i , i = 1, 2, . . . , N be defined as in (12.7), then
u i∗ , i = 1, 2, . . . , N } is in Nash equilibrium.
240 12 Off-Policy Integral Reinforcement Learning Method for Multi-player …

Proof As Vi (x(∞)) = 0, then from (12.2), one has


 ∞  
N   ∞
Ji (x(0), u i , u −i ) = Q i (x) + u Tj Ri j u j dt + V̇i dt + Vi (x(0)).
0 j=1 0

(12.60)

According to (12.6), (12.60) can be written as


 ∞
Ji (x(0), u i , u −i ) = Hi (x, ∇Vi , u i , u −i )dt + Vi (x(0)). (12.61)
0

As u i∗ = u i (Vi (x)) which is given by (12.7), then


⎛ ⎞

N 
N
T⎝
Hi (x, ∇Vi , u i∗ , u ∗−i ) = Q i (x) + u ∗T ∗
j Ri j u j + ∇Vi f (x) + g(x)u ∗j ⎠
j=1 j=1

= 0. (12.62)

According to (12.6), one has


N
Hi (x, ∇Vi , u i , u −i ) = Hi (x, ∇Vi , u i∗ , u ∗−i ) + (u j − u ∗j )T Ri j (u j − u ∗j )
j=1


N 
N
+ ∇ViT g(x)(u j − u ∗j ) + 2 u ∗T ∗
j Ri j (u j − u j ).
j=1 j=1
(12.63)

Therefore, (12.61) is expressed as


 ∞
N
Ji (x(0), u i , u −i ) = Vi (x(0)) + (u j − u ∗j )T Ri j (u j − u ∗j )dt
0 j=1
 ∞ 
N  ∞ 
N
+ ∇ViT g(x)(u j − u ∗j )dt + 2 u ∗T ∗
j Ri j (u j − u j )dt.
0 j=1 0 j=1
(12.64)

Furthermore, (12.64) can be written as


12.4 Off-Policy Integral Reinforcement Learning Method 241

Ji (x(0), u i , u −i )
 ∞ 
N
=Vi (x(0)) + ∇ViT g(x)(u j − u ∗j )dt
0 j=1, j=i
 ∞
+ ∇ViT g(x)(u i − u i∗ )dt
0
 ∞ 
N  ∞
+ 2 u ∗T
j Ri j (u j − u ∗j )dt + 2u i∗T Rii (u i − u i∗ )dt
0 j=1, j=i 0

 ∞ 
N  ∞
+ (u j − u ∗j )T Ri j (u j − u ∗j )dt + (u i − u i∗ )T Rii (u i − u i∗ ).
0 j=1, j=i 0

(12.65)

Note that, at the equilibrium point, one has u i = u i∗ and u −i = u ∗−i . Thus

Ji (x(0), u i∗ , u ∗−i ) = Vi (x(0)). (12.66)

From (12.65), one can get


 ∞
Ji (x(0), u i , u ∗−i ) =Vi (x(0)) + ∇ViT g(x)(u i − u i∗ )dt
 ∞ 0
 ∞
∗T ∗
+ 2u i Rii (u i − u i )dt + (u i − u i∗ )T Rii (u i − u i∗ )dt.
0 0
(12.67)

Note that u i∗ = u i (Vi (x)), then ∇ViT g(x) = −2u i∗T Rii . Therefore, Eq. (12.67) is
 ∞
Ji (x(0), u i , u ∗−i ) = Vi (x(0)) + (u i − u i∗ )T Rii (u i − u i∗ ). (12.68)
0

∗ ∗
Then clearly Ji (x(0),
 ∗ u i , u −i ) in (12.66)
 and Ji (x(0), u i , u ∗−i ) in (12.68) satisfy
(12.4). It means u i , i = 1, 2, . . . , N is in Nash equilibrium.

Based on above analyses, it is clear that off-policy IRL obtains V̂i[k] (x) and û i[k] (x)
simultaneously. Based on the weight update method (12.50), V̂i[k] (x) → Vi[k] (x)
and û i[k] (x) → u i[k] (x). It proves that u i[k] (x) makes the closed-loop system asymp-
totically stable, and u i[k] (x) → u i∗ (x), as k → ∞. The final theorem demonstrates

u i , i = 1, 2, . . . , N } is in Nash equilibrium. Therefore in the following subsec-
tion, one is ready to give the following computational algorithm for practical online
implementation.
242 12 Off-Policy Integral Reinforcement Learning Method for Multi-player …

Algorithm 7 Computational Algorithm for Off-Policy IRL


1: Let the iterative index k = 0, start with stabilizing initial policies u [0] [0] [0]
1 , u2 , . . . , u N .
[k]
2: Let the iterative index k > 0, and solve for Ŵi by

Ŵ˙ i[k] = −ηi[k] Iw,i


[k]T [k] [k] [k]
(Iw,i Ŵi + I xu,i ). (12.69)

3: Update N -tuple of control policies û i[k+1] and V̂i[k] (x).


4: Let k ← k + 1, repeat Step 2 until

||V̂i[k+1] − V̂i[k] || ≤ ε, (12.70)


where ε > 0 is a predefined constant threshold.
5: Stop.

12.5 Simulation Study

Here we present simulations of linear and nonlinear systems to show that the games
can be solved by off-policy IRL method of this chapter.

Example 12.1 Consider the following linear system with modification [18, 19]

ẋ = 2x + 3u 1 + 3u 2 . (12.71)

Define
 ∞
J1 = (9x 2 + 0.64u 21 + u 22 )dt, (12.72)
0

and
 ∞
J2 = (3x 2 + u 21 + u 22 )dt. (12.73)
0

Let the initial state be x(0) = 0.5. For each player the structures of the critic and
action networks are 1-8-1 and 1-8-1, respectively. The initial weights are selected
in (−0.5, 0.5). Let Q 1 = 9, Q 2 = 3, R11 = 0.64 and R12 = R21 = R22 = 1. The
activation functions ϕi and φi are hyperbolic functions. ηi = 0.01. After 100 time
steps, the state and control trajectories are shown in Figs. 12.1 and 12.2. From above
analyses, the iterative value function is monotonic decreasing, which are given in
Figs. 12.3 and 12.4.

Example 12.2 Consider the following nonlinear system in [20] with modification

ẋ = f (x) + gu 1 + gu 2 + gu 3 , (12.74)
12.5 Simulation Study 243

0.5

0.45

0.4

0.35

0.3

0.25
x

0.2

0.15

0.1

0.05

0
0 5 10 15 20 25 30
time steps

Fig. 12.1 The trajectory of state

0.3
u1
0.2 u2

0.1

0
u1 , u2

-0.1

-0.2

-0.3

-0.4

-0.5
0 5 10 15 20 25 30
time steps

Fig. 12.2 The trajectories of control


244 12 Off-Policy Integral Reinforcement Learning Method for Multi-player …

1.42

1.415

1.41
V1

1.405

1.4

1.395

1.39
0 100 200 300 400 500 600 700 800 900 1000
iteration steps

Fig. 12.3 The trajectory of V1

1.94

1.92

1.9
V2

1.88

1.86

1.84

1.82
0 100 200 300 400 500 600 700 800 900 1000
iteration steps

Fig. 12.4 The trajectory of V2


12.5 Simulation Study 245

1.5
x1
x2
1

0.5

0
x

-0.5

-1

-1.5
0 50 100 150 200 250 300
time steps

Fig. 12.5 The trajectory of state


f 1 (x)
where f (x) = , f 1 (x) = −2x1 + x2 , f 2 (x) = −0.5x1 − x2 + x12 x2 + 0.25
f 2 (x)

0
x2 (cos(2x1 ) + 2)2 + 0.25x2 (sin(4x12 ) + 2)2 , and g(x) = . Define
2x1
 ∞  
1 2 1 2
J1 (x) = x + x + u 1 + 0.9u 2 + u 3 dt,
2 2 2
(12.75)
0 8 1 4 2
 ∞ 
1 2
J2 (x) = x1 + x23 + u 21 + u 22 + 5u 23 dt, (12.76)
0 2
 ∞ 
1 2 1 2
J3 (x) = x1 + x2 + 3u 21 + 2u 22 + u 23 dt. (12.77)
0 4 2

Let the initial state be x(0) = [1; −1]. For each player the structures of the critic
and action networks are 2-8-1 and 2-8-1, respectively. The activation functions ϕi
and φi are hyperbolic functions. Let ηi = 0.05, after 1000 time steps, the state and
control trajectories are shown in Figs. 12.5 and 12.6. Figures 12.7, 12.8 and 12.9 are
value function for each player, which is monotonic decreasing.
246 12 Off-Policy Integral Reinforcement Learning Method for Multi-player …

500
u1
400 u2
u3
300

200
u1 , u2 , u3

100

-100

-200

-300
0 50 100 150 200 250 300
time steps

Fig. 12.6 The trajectories of control

73

72.5

72

71.5
V1

71

70.5

70

69.5
0 50 100 150 200 250 300 350 400 450 500
iteration steps

Fig. 12.7 The trajectory of V1


12.5 Simulation Study 247

62

61

60

59
V2

58

57

56

55

54
0 50 100 150 200 250 300 350 400 450 500
iteration steps

Fig. 12.8 The trajectory of V2

29.2

29

28.8

28.6
V3

28.4

28.2

28

27.8
0 50 100 150 200 250 300 350 400 450 500
iteration steps

Fig. 12.9 The trajectory of V3


248 12 Off-Policy Integral Reinforcement Learning Method for Multi-player …

12.6 Conclusion

This chapter establishes an off-policy IRL method for CT multi-player NZS games
with unknown dynamics. Since the system dynamics is unknown in this chapter, off-
policy IRL is studied to do policy evaluation and policy improvement in PI algorithm.
The critic and action networks are used to obtain the performance index and control
for each player. The convergence of the weights is proven. The asymptotic stability of
the closed-loop system and the existence of Nash equilibrium are proven. Simulation
study demonstrates the effectiveness of the proposed method of this chapter.
Furthermore, it is noted that the condition is hard for the proof of Theorem 12.2.
In the next work, we will concentrate on relaxing the condition for the proof of
Theorem 12.2. We also notice that the system unknown functions f and g are same
for each player in this chapter. In our future work, we will discuss the off-policy IRL
method for different unknown functions f i or gi for each player. This will make the
research more extensive and deep.

References

1. Vamvoudakis, K., Lewis, F.: Multi-player non-zero-sum games: online adaptive learning solu-
tion of coupled Hamilton-Jacobi equations. Automatica 47(8), 1556–1569 (2011)
2. Zhang, H., Wei, Q., Luo, Y.: A novel infinite-time optimal tracking control scheme for a class
of discrete-time nonlinear systems via the greedy HDP iteration algorithm. IEEE Trans. Syst.
Man Cybern. Part B-Cybern. 38(4), 937–942 (2008)
3. Wei, Q., Wang, F., Liu, D., Yang, X.: Finite-approximation-error based discrete-time iterative
adaptive dynamic programming. IEEE Trans. Cybern. 44(12), 2820–2833 (2014)
4. Wei, Q., Liu, D.: A novel iterative-Adaptive dynamic programming for discrete-time nonlinear.
IEEE Trans. Automat. Sci. Eng. 11(4), 1176–1190 (2014)
5. Song, R., Xiao, W., Zhang, H., Sun, C.: Adaptive dynamic programming for a class of complex-
valued nonlinear systems. IEEE Trans. Neural Netw. Learn. Syst. 25(9), 1733–1739 (2014)
6. Song, R., Lewis, F., Wei, Q., Zhang, H., Jiang, Z., Levine, D.: Multiple Actor-Critic Structures
for Continuous-Time Optimal Control Using Input-Output Data. IEEE Trans. Neural Netw.
Learn. Syst. 26(4), 851–865 (2015)
7. Modares, H., Lewis, F., Naghibi-Sistani, M.B.: Adaptive optimal control of unknown
constrained-input systems using policy iteration and neural networks. IEEE Trans. Neural
Netw. Learn. Syst. 24(10), 1513–1525 (2013)
8. Modares, H., Lewis, F.: Optimal tracking control of nonlinear partially-unknown constrained-
input systems using integral reinforcement learning. Automatica 50(7), 1780–1792 (2014)
9. Modares, H., Lewis, F., Naghibi-Sistani, M.: Integral reinforcement learning and experience
replay for adaptive optimal control of partially-unknown constrained-input continuous-time
systems. Automatica 50, 193–202 (2014)
10. Kiumarsi, B., Lewis, F., Naghibi-Sistani, M., Karimpour, A.: Approximate dynamic program-
ming for optimal tracking control of unknown linear systems using measured data. IEEE Trans.
Cybern. 45(12), 2770–2779 (2015)
11. Jiang, Y., Jiang, Z.: Computational adaptive optimal control for continuous-time linear systems
with completely unknown dynamics. Automatica 48(10), 2699–2704 (2012)
12. Luo, B., Wu, H., Huang, T.: Off-policy reinforcement learning for H control design. IEEE
Trans. Cybern. 45(1), 65–76 (2015)
References 249

13. Song, R., Lewis, F., Wei, Q., Zhang, H.: Off-policy actor-critic structure for optimal control of
unknown systems with disturbances. IEEE Trans. Cybern. 46(5), 1041–1050 (2016)
14. Lewis, F., Vrabie, D., Syrmos, V.L.: Optimal Control, 3rd edn. Wiley, Hoboken (2012)
15. Vamvoudakis, K., Lewis, F., Hudas, G.: Multi-agent differential graphical games: online adap-
tive learning solution for synchronization with optimality. Automatica 48(8), 1598–1611 (2012)
16. Abu-Khalaf, M., Lewis, F.: Nearly optimal control laws for nonlinear systems withsaturating
actuators using a neural network HJB approach. Automatica 41(5), 779–791 (2005)
17. Leake, R., Liu, R.: Construction of suboptimal control sequences. SIAM J. Control 5(1), 54–63
(1967)
18. Jungers, M., De Pieri, E., Abou-Kandil, H.: Solving coupled algebraic Riccati equations from
closed-loop Nash strategy, by lack of trust approach. Int. J. Tomogr. Stat. 7(F07), 49–54 (2007)
19. Limebeer, D., Anderson, B., Hendel, H.: A Nash game approach to mixed H2/H control. IEEE
Trans. Autom. Control 39(1), 69–82 (1994)
20. Liu, D., Li, H., Wang, D.: Online synchronous approximate optimal learning algorithm for
multiplayer nonzero-sum games with unknown dynamics. IEEE Trans. Syst. Man Cybern.:
Syst. 44(8), 1015–1027 (2014)
Chapter 13
Optimal Distributed Synchronization
Control for Heterogeneous Multi-agent
Graphical Games

In this chapter, a new optimal coordination control for the consensus problem of het-
erogeneous multi-agent differential graphical games by iterative ADP is developed.
The main idea is to use iterative ADP technique to obtain the iterative control law
which makes all the agents track a given dynamics and simultaneously makes the
iterative performance index function reach the Nash equilibrium. In the developed
heterogeneous multi-agent differential graphical games, the agent of each node is
different from the one of other nodes. The dynamics and performance index function
for each node depend only on local neighbor information. A cooperative policy iter-
ation algorithm for graphical differential games is developed to achieve the optimal
control law for the agent of each node, where the coupled Hamilton–Jacobi (HJ)
equations for optimal coordination control of heterogeneous multi-agent differential
games can be avoided. Convergence analysis is developed to show that the perfor-
mance index functions of heterogeneous multi-agent differential graphical games
can converge to the Nash equilibrium. Simulation results will show the effectiveness
of the developed optimal control scheme.

13.1 Introduction

A large class of real systems are controlled by more than one controller or decision
maker with each using an individual strategy. These controllers often operate in a
group with coupled performance index functions as a game [1]. Stimulated by a vast
number of applications, including those in economics, management, communication
networks, power networks, and in the design of complex engineering systems, game
theory [2] has been very successful in modeling strategic behavior, where the outcome
for each player depends on the actions of himself and all the other players.
For the previous policy iteration ADP algorithms of multi-player games, it always
desires the system states for each agent converge the equilibrium of the systems. In
many real world games, it requires that the states of each agent track a desired
© Science Press, Beijing and Springer Nature Singapore Pte Ltd. 2019 251
R. Song et al., Adaptive Dynamic Programming: Single and Multiple
Controllers, Studies in Systems, Decision and Control 166,
https://doi.org/10.1007/978-981-13-1712-5_13
252 13 Optimal Distributed Synchronization Control …

dynamics, i.e., synchronization control. Synchronization behavior of the multi-agent


optimal control of ADP, which pioneering work was proposed by Vamvoudakis et al.
[3], was developed for graphical differential games, where the Nash equilibrium of the
game was achieved by policy iteration algorithm. In this chapter, inspired by [3], the
optimal cooperative control for the heterogeneous multi-agent graphical differential
games is investigated. We emphasize that in the developed heterogeneous graphical
differential game, the autonomous agent of each node and the desired dynamics can
be different from each other. According to system transformation, the optimal syn-
chronization control problem is transformed into an optimal multi-agent cooperative
regulation problem. The corresponding performance index function of the graphical
differential game can also be transformed, which makes the optimal control law for
each agent can be expressed. The main contribution of this chapter is to develop
an effective policy iteration of ADP to obtain the optimal cooperative control law
for heterogeneous multi-agent graphical differential games. Convergence properties
of the iterative ADP algorithm are also developed, which guarantee that the perfor-
mance index function of the agent for each node converges to the Nash equilibrium
of the games. A simulation example will be presented to show the effectiveness of
the developed algorithm.
This chapter is organized as follows. In Sect. 13.2, graphs and synchronization
of heterogeneous multi-agent dynamic systems are presented. In Sect. 13.3, optimal
distributed cooperative control for heterogeneous multi-agent differential graphical
games will be presented. The expressions of the optimal control law will be devel-
oped in this section. In Sect. 13.4, heterogeneous multi-agent differential graphical
games by iterative ADP algorithm will be developed. The properties of the itera-
tive performance index functions will be analyzed. In Sect. 13.5, simulation results
are given to demonstrate the effectiveness of the developed algorithm. Finally, in
Sect. 13.6, the conclusions will be shown.

13.2 Graphs and Synchronization of Multi-agent Systems

In this section, a background review of communication graphs is given and the


problem of synchronization of heterogeneous multi-agent systems subject to external
disturbances is formulated.

13.2.1 Graph Theory

Let G = (V, E, A) be a weighted graph of N nodes with the nonempty finite set of
nodes V = {v1 , v2 , . . . , v N }, where the set of edges E belongs to the product space of
V (i.e., E ⊆ V × V), an edge of G is denoted by ρi j = (v j , vi ), which is a direct path
from node j to node i, and A = [ρi j ] is a weighted adjacency matrix with nonnegative
13.2 Graphs and Synchronization of Multi-agent Systems 253

adjacency elements, i.e., ρi j ≥ 0, ρi j ∈ E ⇔ ρi j > 0, otherwise ρi j = 0. The node


index i belongs to a finite index set N = {1, 2, . . . , N }.
Definition 13.1 (Laplacian Matrix) The graph Laplacian matrix L = [li j ] is defined
as D − A, with D = diag{di } ∈ R N ×N being the in-degree matrix of graph, where
di = Σ j=1
N
ρi j is in-degree of node vi in graph.
In this chapter, we assume that the graph is simple, e.g., no repeated edges and no
self loops. The set of neighbors of node vi is denoted by Ni = {v j ∈ V; (v j , vi ) ∈ E}.
A graph is referred to as a spanning tree, if there is a node vi (called the root), such
that there is a directed path from the root to any other nodes in the graph. A graph is
said to be strongly connected, if there is a directed path from node i to node j, for all
distinct nodes vi , v j ∈ V. A digraph has a spanning tree if it is strongly connected,
but not vice versa.

13.2.2 Synchronization and Tracking Error Dynamic


Systems

Consider a heterogeneous multi-agent system with N agents in the form of com-


munication network G. For an arbitrary node i, i = 1, 2, . . . , N , the heterogeneous
agent is expressed as

ẋi = Ai xi + Bi u i , (13.1)

where xi (t) ∈ Rn is the state of node vi , and u i (t) ∈ Rm is the input coordination
control. Let Ai and Bi for ∀ i ∈ N be constant matrices. In this chapter, we assume that
the control gain matrix Bi satisfies rank{Bi } ≥ n for the convenience of our analysis.
For ∀ i ∈ N, assume that the state xi of the agent for each node is controllable. The
local neighborhood tracking error for ∀ i ∈ N is defined as

δi = ρi j (xi − x j ) + σi (xi − x0 ), (13.2)
j∈Ni

where σi ≥ 0 is the pinning gain. Note that σi > 0 for at least one i. We let σi = 0
if and only if there is not a direct path from the control node to the ith node in G.
Otherwise we have σi > 0.
The target node or state is x0 (t) ∈ Rn , which satisfies the dynamics

ẋ0 = A0 x0 . (13.3)

The synchronization control design problem is to design local control protocols for
all the nodes in G to synchronize to the state of the control node, i.e., for ∀ i ∈ N, we
require xi (t) → x0 (t).
254 13 Optimal Distributed Synchronization Control …

According to (13.1), the global network dynamics can be expressed as

ẋ = Ax + Bu, (13.4)

where the global state vector of the multi-agent system (13.4) is x =


[x1T , x2T , . . . , x NT ]T ∈ R N n and the global control input u = [u T1 , u T2 , . . . , u TN ]T ∈
R N m . Let A = diag{Ai } and B = diag{Bi }, i = 1, 2, . . . , N . According to (13.2)
and (13.4), the global error vector for the network G is

δ = (L ⊗ In )x + (G ⊗ In )(x − x̄0 ) = ((L + G) ⊗ In )(x − x̄0 ) = L(x − x̄0 ),


(13.5)

where L = (L + G) ⊗ In and In is an identity matrix with n dimensions. Let L be the


Laplacian matrix and let ⊗ denote the Kronecker product operator. Let G = [σi j ] ∈
R N ×N be a diagonal matrix with diagonal entries σi .
In [3], for ∀ i ∈ N, the system matrices Ai = A0 . This makes the control u i ≡ 0
as xi reaches the target node x0 . In this chapter, according to (13.1) and (13.3), for
∀ i ∈ N, as xi = x0 , the control u i cannot be zero. Thus, we should first solve the
desired control at the target state. Let u di be the desired control that satisfies the
following equation

ẋ0 = Ai x0 + Bi u di . (13.6)

According to (13.3), we have

u di = Bi+ (A0 − Ai )x0 , (13.7)

where Bi+ is the Moore–Penrose pseudo-inverse matrix of Bi . Let xei = xi − x0 and


u ei = u i − u di . Then, according to (13.7), agent (13.1) can be rewritten as

ẋei = Ai xei + Bi u ei + (Ai − A0 )x0 + Bi u di = Ai xei + Bi u ei . (13.8)

According to (13.2), the tracking error vector for the network G can be expressed
as


δ̇i = ρi j (ẋi − ẋ j ) + σi (ẋi − ẋ0 )
j∈Ni

= ρi j (ẋei − ẋej ) + σi ẋei
j∈Ni

= ρi j (Ai xei + Bi u ei − A j xej + B j u ej ) + σi (Ai xei + Bi u ei )
j∈Ni
  
= ρi j Ai (xei − xej ) + ρi j (Ai − A j )xej + ρi j (Bi u ei − B j u ej )
j∈Ni j∈Ni j∈Ni
13.2 Graphs and Synchronization of Multi-agent Systems 255

+ σi Ai xei + σi Bi u ei
 
= Ai δi + ρi j (Ai − A j )xej + (di + σi )Bi u ei − ρi j B j u ej . (13.9)
j∈Ni j∈Ni

From (13.9), we can see that the tracking error vector is a function with respect
to δi and xej . This means that the distributed control u i should be designed by the
information of δi and xej . A system transformation is necessary.
Let j1i , j2i , . . . , j Ni i be the nodes in Ni . Define new state and control vectors as x̃ei =
[ xej
T
i , x i , . . . , x i ] and ũ ei = [ u i , u i , . . . , u i ] . If we let z i = [δi , x̃ ei ] and
T
ej
T
ej
T T
ej
T
ej
T
ej
T T T T
1 2 Ni 1 2 Ni

ū ei = [u Tei , ũ Tei ]T , then we can obtain the following expression

ż i = Āi z i + B̄i ū ei , (13.10)

where Āi and B̄i are expressed as


⎡ ⎤
Ai ei j1i (Ai − A j1i ) · · · ei jNi (Ai − A jNi )
⎢ 0 i i

⎢ A j1i 0 ⎥

Āi = ⎢ . ⎥, (13.11)
. .. ⎥
⎣ . . ⎦
0 0 A jNi
i

and
⎡ ⎤
(di + σi )Bi −ei j1i B j1i · · · −ei jNi B jNi
⎢ i i

⎢ 0 B j1i 0 ⎥
B̄i = ⎢
⎢ .. ..
⎥,
⎥ (13.12)
⎣ . . ⎦
0 0 B jNi
i

respectively. We can see that system (13.10) is a multi-input system. The controls
are u ei and all the u ej from its neighbors, where the controls for agent i are coupled
with the agents of its neighbors. This makes the controller design very difficult. In
the next section, an optimal distributed control by iterative ADP algorithm will be
developed that makes all agents reach the target state.

13.3 Optimal Distributed Cooperative Control


for Multi-agent Differential Graphical Games

In this section, our goal is to design an optimal distributed control to reach a consensus
while simultaneously minimizing the local performance index function for system
(13.10). Define u ei as the vector of the control for the neighbors of node i, i.e.,
u −ei  {u ej : j ∈ Ni }.
256 13 Optimal Distributed Synchronization Control …

13.3.1 Cooperative Performance Index Function

For ∀ i ∈ N, let Q ii , Rii , and Ri j , j ∈ Ni , be all constant and symmetric matrices


which satisfy Q ii ≥ 0, Rii > 0, and Ri j ≥ 0. For ∀ i ∈ N, define the utility function
as

Ui (z i , u ei , u −ei ) = z iT Q̄ ii z i + u Tei Rii u ei + u Tej Ri j u ej
j∈Ni

= δiT Q ii δi + u Tei Rii u ei + u Tej Ri j u ej , (13.13)
j∈Ni

where Q ii , Rii , and Ri j are positive definite functions for j ∈ Ni , and

⎡ ⎤T ⎡ ⎤⎡ ⎤
δi Q ii 0 ··· 0 δi
⎢ xej1i ⎥ ⎢ 0 ⎥ ⎢
0 · · · 0 ⎥ ⎢ ej1i
x ⎥
⎢ ⎥ ⎢ ⎥
z iT Q̄ ii z i = ⎢ .. ⎥ ⎢ .. .. . . .. ⎥ ⎢ .. ⎥ = δiT Q ii δi . (13.14)
⎣ . ⎦ ⎣ . . . . ⎦⎣ . ⎦
xejNi 0 0 ··· 0 xejNi
i i

Then, we can define the local performance index functions as




Ji (z i (0), u ei , u −ei ) = δi Q ii δi + u ei Rii u ei +
T T T
u ej Ri j u ej dt
0 j∈Ni

= Ui z i , u ei , u −ei dt. (13.15)
0

From (13.15), we can see that the performance index function includes only the
information about the inputs of node i and its neighbors. The goal of this chapter is to
design the local optimal distributed control to minimize the local performance index
functions (13.15) subject to (13.10) and make all nodes (agents) reach the consensus
at the target state x0 .
Definition 13.2 (Admissible Coordination Control Law [3]) Control law u ei , i ∈ N
is defined as admissible coordination control law if u ei is continuous, u ei (0) = 0, u ei
stabilizes agent (13.10) locally, and the performance index function (13.15) is finite.
If u ei and u ej , j ∈ Ni are admissible control laws, then we can define the local
performance index function Vi (z i ) as

Vi (z i (0)) = Ui z i , u ei , u −ei dt. (13.16)
0

∂ Vi
For admissible control laws u ei and u ej , j ∈ Ni , the Hamiltonian Hi (z i , , u ei ,
∂z i
u −ei ) satisfies the following cooperative Hamilton–Jacobi (HJ) equation
13.3 Optimal Distributed Cooperative Control for Multi-agent … 257


∂ Vi ∂ ViT
Hi z i , , u ei , u −ei ≡ Āi z i + B̄i ū ei + δiT Q ii δi
∂z i ∂z i

+ u Tei Rii u ei + u Tej Ri j u ej = 0 (13.17)
j∈Ni

with boundary condition Vi (0) = 0. If Vi∗ (z i ) is the local performance index func-
tion, then Vi∗ (z i ) satisfies the coupled HJ equations


∂V ∗
min Hi z i , i , u ei , u −ei = 0 (13.18)
u ei ∂z i

and the local optimal control law u ∗ei satisfies




∂V ∗
u ∗ei = arg min Hi z i , i , u ei , u −ei . (13.19)
u ei ∂z i

13.3.2 Nash Equilibrium

In this subsection, the optimality of the multi-agent systems will be developed. The
corresponding property will also be presented. According to [3], we introduce the
Nash equilibrium definition for multi-agent differential games.

Definition 13.3 (Global


 Nash equilibrium) A sequence of control laws
u ∗e1 , u ∗e2 , . . . , u ∗eN is said to be a Nash equilibrium solution for an N multi-agent
game for all i ∈ N and ∀u ei , u −ei ,


Ji∗ = Ji (u ∗e1 , u ∗e2 , . . . , u ∗eN ) ≤ Ji (u ∗e1 , u e2 , . . . , u ∗eN ), (13.20)

for u ei = u ∗ei .

The N -tuple of the local performance index function {J1∗ , J2∗ , . . . , JN∗ } is known
as the Nash equilibrium of the N multi-agent game in G.

Theorem 13.1  Let Ji∗ be the Nash equilibrium solution that satisfies (13.20). If u ∗e1 ,
u ∗e2 , . . ., u ∗eN are optimal control laws, then for ∀i, we have


∂ Ji∗
u ∗ei = arg min Hi zi , ∗
, u ei , u −ei . (13.21)
u ei ∂z i

Proof The conclusion


 can be proven by contradiction. Assume that the conclusion
is false. Let u ∗e1 , u ∗e2 , . . . , u ∗eN be optimal control laws. Then, there must exist a
u ∗el , such that


∂ Jl∗
u ∗el = arg min Hl zl , ∗
, u el , u −el . (13.22)
u el ∂zl
258 13 Optimal Distributed Synchronization Control …
 
As u ∗e1 , u ∗e2 , . . . , u ∗eN is the optimal control law and Jl∗ is the Nash equilibrium
solution, for l = 1, 2, . . . , N , we can get


∂ J∗ ∂ J∗
Hl zl , l , u ∗el , u ∗−el = l Āl zl + B̄l ū ∗el + δlT Q ll δl
∂zl ∂zl

+u ∗T ∗
el Rll u el + u ∗T ∗
ej Rl j u ej = 0. (13.23)
j∈Nl

If (13.23) holds, then we can let u oel satisfy




∂ Jl∗ ∗
u oel = arg min Hl zl , , u el , u −el . (13.24)
u el ∂zl

According to (13.24), we have





∂ J∗ ∂ J∗
Hl zl , l , u oel , u ∗−el = min Hl zl , l , u el , u ∗−el . (13.25)
∂zl u el ∂zl

Hence, we can obtain





∂ Jl∗ o ∗ ∂ Jl∗ ∗ ∗
Hl zl , ,u ,u ≤ Hl zl , ,u ,u
∂zl el −el ∂zl el −el
= 0. (13.26)

Find a performance index function Jlo (zl (0), u oel , u ∗−el ) that satisfies



∂ Jo ∂ Jo
Hl zl , l , u oel , u ∗−el = l Āl zl + B̄l ū oel + δlT Q ll δl
∂zl ∂zl

+u ∗T ∗
el Rll u el + u ∗T ∗
ej Rl j u ej = 0. (13.27)
j∈Nl

Then, we can obtain




˙
Jl = − δl Q ll δl + u el Rll u el +
o T oT o ∗T ∗
u ej Rl j u ej . (13.28)
j∈Nl

According to (13.27) and (13.28), we have





˙ ∂ Jl∗ o ∗

Jl = Hl zl , ,u ,u − δl Q ll δl + u el Rll u el +
T oT o
u ej Rl j u ej ≤ J˙lo .
∗T ∗
∂zl el −el j∈N l

(13.29)

Hence, Jl∗ ≥ Jlo for l = 1, 2, . . . N . It is contradiction. So the conclusion holds.


13.3 Optimal Distributed Cooperative Control for Multi-agent … 259

From Theorem 13.1, we can see that if for ∀ i ∈ N, all the control from its neigh-
bors are optimal, i.e., u ej = u ∗ej , j ∈ Ni , then the optimal distributed control u ∗ei
can be obtained by (13.21). The following results can be derived according to the
properties of u ∗ei , i ∈ N.

Lemma 1 Let V ∗ (z i ) > 0, i ∈ N be a solution to the HJ equation (13.18), the opti-


mal distributed control policies u ∗ei , i ∈ N be given by (13.19) in term to V ∗ (z i (0)).
Then,
(1) The local neighborhood consensus error (13.10) converges to zero.
(2) The local performance index functions Ji∗ (z i (0), u ∗ei , u ∗−ei ) are equal to
Vi∗ (z i (0)), i ∈ N, and u ∗ei and u ∗ej , j ∈ Ni are in Nash equilibrium.

Proof The proof can be seen in [3] and the details are omitted here.

13.4 Heterogeneous Multi-agent Differential Graphical


Games by Iterative ADP Algorithm

From the previous subsection, we know that if we have obtained the optimal per-
formance index function, then the optimal control law can be obtained by solving
the HJ equation (13.18). However, Eq. (13.18) is difficult or impossible to be solved
directly by dynamic programming. In this section, a policy iteration (PI) algorithm of
adaptive dynamic programming is developed which solves the HJ equation (13.18)
iteratively. Convergence properties of the developed algorithm will also be presented
in this section.

13.4.1 Derivation of the Heterogeneous Multi-agent


Differential Graphical Games

In the developed policy iteration algorithm, the performance index function and
control law are updated by iterations, with the iteration index k increasing from 0 to
infinity. For ∀ i ∈ N, let u i[0] be an arbitrary admissible control law. Let Vi[0] be the
iterative performance index function constructed by u i[0] , which satisfies


∂ V [0]
Hi z i , i , u [0] , u [0]
−ei = 0. (13.30)
∂z i ei

According to (13.30), we can update the iterative control law u [1]


ei by
 
∂ Vi[0]
u [1] = arg min Hi zi , [0]
, u ei , u −ei . (13.31)
ei u ei ∂z i
260 13 Optimal Distributed Synchronization Control …

From ∀k = 1, 2, . . ., solve for Vi[k] that satisfies the following Hamiltonian



[k] [k]  
∂V [k] [k] ∂V [k]
Hi z i , i , u ei , u −ei = i Āi z i + B̄i ū ei + δiT Q ii δi
∂z i ∂z i
[k]T [k]
 [k]T [k]
+u ei Rii u ei + u ej Ri j u ej = 0, (13.32)
j∈Ni

and update the iterative control law by


 
∂ Vi[k]
u [k+1] = arg min Hi zi , [k]
, u ei , u −ei . (13.33)
ei u ei ∂z i

From the multi-agent policy iteration (13.30)–(13.33), we can see that Vi[k] is used
to approximate Ji∗ and u i[k] is used to approximate u i∗ . Therefore, when i → ∞, it
is expected that the algorithm is convergent to make Vi[k] and u i[k] converge to the
optimal ones. In the next subsection, we will show such properties of the developed
policy iteration algorithm.

13.4.2 Properties of the Developed Policy Iteration Algorithm

In this subsection, we will first present the converge property of the multi-agent policy
iteration algorithm. Initialized by an admissible control laws, the monotonicity of
the iterative performance index functions is discussed. In [3], the properties of the
iterative performance index function are analyzed for the linear agent expressed
by ẋi = Axi + Bi u i . In this chapter, inspired by [3], the properties of the iterative
performance index function are analyzed for ẋi = Ai xi + Bi u i .
Lemma 2 (Solution for best iterative control law) Given fixed neighbor policies
u −ei  {u j : j ∈ Ni }, the best iterative control law can be expressed by

[k]
1 −1 T ∂ Vi
u [k+1] = − R (di + σi )B . (13.34)
ei
2 ii i
∂δi

Proof According to (13.32) and (13.33), taking the derivation of u ei , we have


 
∂ Vi[k]T    [k]T
u [k+1] = arg min Āi z i + B̄i ū [k] + δiT Q ii δi + u [k]T Rii u [k] + u ej Ri j u [k] ,
ei u ei ∂z i ei ei ei ej
j∈Ni
⎡ ⎤
(di + σi )BiT 0 · · · 0
 [k] T ⎢ −e i B Ti B Ti 0 ⎥
1 −1 ∂ ū ei ⎢ i j1 j j1 ⎥
⎢ 1 ⎥
= − Rii ⎢ .. .. ⎥
2 ∂u ei ⎢ . . ⎥
⎣ ⎦
−ei j i B iT 0 Bi T
Ni jN jN
i i
13.4 Heterogeneous Multi-agent Differential Graphical Games … 261

 T
∂ Vi[k]T ∂ Vi[k]T ∂ Vi[k]T
× , , ...,
∂δi ∂ xej i ∂ x ji
1 Ni

1 ∂ V [k]
=− (di + σi )Rii−1 BiT i . (13.35)
2 ∂δi

The proof is completed.


Remark 1 In [3], it shows that the iterative control law and the iterative performance
index function are both functions of δi . From Lemma 2, we can see that to obtain
the iterative control law u [k+1]
ei , the partial information of the tracking error z i , i.e.,
δi , is necessary. On the other hand, according to (13.9), δ̇ is a function of z i . If
the iterative control u −ei is given, then we can find an iterative performance index
function Vi[k] = Vi[k] (z i ) that satisfies (13.32). Hence, Vi[k] is the function of z i , which
is a obvious difference from the one in [3].
Next, inspired by [3], the convergence properties of the iterative performance
index functions will be developed by the following theorems.
Theorem 13.2 (Convergence of policy iteration algorithm when only one agent
updates its policy and all players u −ei in its neighborhood do not change) Given fixed
neighbor policies u o−ei , let the iterative performance index function Vi[k] and iterative
control law u [k+1]
ei are updated by the policy iteration (13.30)–(13.33). Then, the
iterative performance index function Vi[k] is convergent as k → ∞, i.e., lim Vi[k] =
k→∞
Vi∞ , where Vi∞ satisfies the following HJ equation
 
∂ V [k]
min Hi z i , i , u ei , u o−ei = 0. (13.36)
u ei ∂z i

Proof Let ũ oei = [ u oT


ej i
, u oT
ej i
, . . . , u oT
ej i
]T and û [k] [k]T
ei = [u ei , ũ ej ] . Taking the deriva-
oT T
1 2 Ni

tive of Vi[k] along the trajectory Āi z i + B̄i û [k+1]


ei , we have

V̇i[k]
∂ Vi[k] [k+1]
= ( Āi z i + B̄i û ei )
∂z i



∂ V [k] [k+1] o [k+1]T [k+1]

= Hi z i , i , u ei , u −ei − δiT Q ii δi + u ei Rii u ei + u oT
ej R i j u ej . (13.37)
o
∂z i
j∈Ni

According to (13.32) and (13.33), we can get



 
∂ Vi[k] ∂ Vi[k]
Hi z i , , u [k+1] , u o−ei = min Hi zi , , u ei , u o−ei
∂z i ei u ei ∂z i
 
∂ Vi[k]
≤ Hi zi , , u [k] o
ei , u −ei
∂z i
= 0. (13.38)
262 13 Optimal Distributed Synchronization Control …

Substituting the iteration index k + 1 into the HJ equation (13.32), we have




∂ V [k+1] [k+1] o
Hi z i , i , u ei , u −ei
∂z i
[k+1]

∂ Vi [k+1] [k+1]T [k+1]
= ( Āi z i + B̄i û ei ) + δi Q ii δi + u ei
T
Rii u ei + oT o
u ej Ri j u ej
∂z i j∈N i

=0, (13.39)

which means


V̇i[k+1] = − δiT Q ii δi + u [k+1]T
ei R ii u [k+1]
ei + u oT
ej R i j u ej .
o
(13.40)
j∈Ni

According to (13.37), (13.38) and (13.40), we can obtain V̇i[k+1] ≥ V̇i[k] . By inte-
grating both sides of V̇i[k+1] ≥ V̇i[k] , we can obtain Vi[k+1] ≤ Vi[k] . As Vi[k] is lower
bounded, we have Vi[k] is convergent as k → ∞, i.e., lim Vi[k] = Vi∞ . According to
k→∞
(13.32) and (13.33), letting k → ∞, we can obtain (13.36). The proof is completed.

From Theorem 13.2, we can see that for i = 1, 2, . . . , Ni , if the neighbor control laws
u o−ei are fixed, then the iterative performance index functions and iterative control
laws are convergent to the optimum. In next subsection, the convergence property
for iterative performance index function and iterative control laws when all nodes
update their control laws will be developed.

Theorem 13.3 (Convergence of policy iteration algorithm when all agents update
their policies) Assume all nodes i update their policies at each iteration of policy
iteration algorithm (13.30)–(13.33). Define ς (Ri j R −1 j j ) as the maximum singular
value of Ri j R j j . For small ς (Ri j R j j ), the iterative performance index function Vi[k]
−1 −1

is convergent to the optimal J ∗ , i.e., lim Vi[k] = Ji∗ .


k→∞

Proof According to (13.33) and (13.35), letting ŭ [k+1]


ei = [u [k+1]T
ei , ũ [k]T
ei ] , we have
T

dVi[k] ∂ Vi[k]
= ( Āi z i + B̄i ŭ [k+1] ). (13.41)
dt ∂z i ei

According to Hamilton function (13.32), we have

∂ Vi[k] ∂ V [k]  [k]T


Āi z i = − i B̄i ū [k] [k]T [k]
ei − δi Q ii δi − u ei Rii u ei −
T
u ej Ri j u [k]
ej . (13.42)
∂z i ∂z i j∈N i

According to Hamiltonian (13.32) for k + 1, we can obtain


13.4 Heterogeneous Multi-agent Differential Graphical Games … 263

V̇i[k+1] − V̇i[k]
∂ Vi[k]  [k+1]T
= B̄i (ū [k] [k+1]
ei − ŭ ei ) − u [k+1]T Rii u [k+1] − u ej Ri j u [k+1]
∂z i ei ei
j∈Ni
ej

 [k]T
+ u [k]T [k]
ei Rii u ei + u ej Ri j u [k]
ej
j∈Ni

∂ Vi[k] 
=(di + σi ) Bi (u [k] [k+1]
ei − u ei ) − u [k+1]T Rii u [k+1] − u [k+1]T Ri j u [k+1]
∂δi ei ei
j∈Ni
ej ej


+ u [k]T [k]
ei Rii u ei + u [k]T [k]
ej Ri j u ej
j∈Ni

=2u [k+1]T
ei Rii (u [k+1]
− u [k]
ei ei ) − u ei
[k+1]T
Rii u [k+1]
ei + u [k]T [k]
ei Rii u ei
 [k+1] T  [k+1]
+ (u ej − u [k] [k+1]
ej ) Ri j (u ej − u [k]
ej ) − 2 (u ej − u [k] [k+1]
ej )Ri j u ej
j∈Ni j∈Ni
 T
=(u [k+1]
ei − u [k] [k+1]
ei ) Rii (u ei
T
− u [k]
ei ) + (u [k+1]
ej − u [k] [k+1]
ej ) Ri j (u ej − u [k]
ej )
j∈Ni

−2 (u [k+1]
ej − u [k] [k+1]
ej )Ri j u ej . (13.43)
j∈Ni

A sufficient condition for V̇i[k+1] − V̇i[k] ≥ 0 is


 
u [k]T [k]
ei Rii u ei + u [k]T [k]
ej Ri j u ej ≥ 2 u [k]T [k+1]
ej Ri j u ej
j∈Ni j∈Ni

 ∂ V j[k]
=− (d j + σ j )u [k]T −1
ej Ri j R j j B j ,
j∈Ni
∂δ j
(13.44)
where u ei = u [k+1]
ei − u [k]
ei , i ∈ {i, N i }. Let ς (R i j ) be the maximum singular
value of Ri j . We can see that for ∀ i, if ς (Ri j R −1 j j ), ρi j , j ∈ Ni and σi are
[k+1]
small, then inequality (13.44) holds, which means V̇i ≥ V̇i[k] . By integration of
V̇i[k+1] ≥ V̇i[k] , we can obtain Vi[k+1] ≤ Vi[k] . Hence, the iterative performance index
function Vi[k] is monotonically nonincreasing and lower bounded. As such Vi[k] is
convergent as k → ∞, i.e., lim Vi[k] = Vi∞ .
k→∞
It is obvious that Vi∞ ≥ J ∗ . On the other hand, let {u [ ] [ ] [ ]
e1 , u e2 , . . . , u eN } be arbitrary
admissible control laws. For ∀i, there must exist a performance index function Vi[ ]
that satisfies
 
∂ Vi[ ] [ ] [ ]
Hi z i , , u ei , u −ei = 0. (13.45)
∂z i
264 13 Optimal Distributed Synchronization Control …
 
∂ V [ ]
Let u [ +1] = arg min Hi z i , i , u ei , u [ ]
−ei and then we have Vi∞ ≤ Vi[ +1] ≤
ei u ei ∂z i
Vi[ ] . As {u [ ] [ ] [ ]
e1 , u e2 , . . . , u eN } are arbitrary, let

 ∗ ∗ 
{u [ ] [ ] [ ] ∗
e1 , u e2 , . . . , u eN } = u e1 , u e2 , . . . , u eN . (13.46)

and then we can obtain Vi∞ ≤ J ∗ . Therefore, we can obtain that lim Vi[k] = Ji∗ . The
k→∞
proof is completed.

Remark 2 In [3], it shows that for linear multi-agent system ẋi = Axi + Bi u i , if
the edge weights ρi j and ς (Ri j R −1 j j ) are small, then the iterative performance index
function converge to the optimum. From Theorem 13.3, we have that for multi-agent
[k]
system (13.1), if ς (Ri j R −1
j j ) is small, then iterative performance index function Vi
will also converge to the optimum, as k → ∞, while the constraint for ρi j is omitted.

13.4.3 Heterogeneous Multi-agent Policy Iteration Algorithm

Based on above preparations, we summarize the heterogeneous multi-agent policy


iteration algorithm in Algorithm 8.
Algorithm 8 Heterogeneous multi-agent policy iteration algorithm
Initialization:
Choose randomly an admissible control law u i[0] , ∀ i ∈ N;
Choose a computation precision ε;
Choose a constant 0 < ζ < 1;
Choose positive definite matrices Q ii , Rii , and Ri j . Obtain the desired control u di , for ∀ i ∈ I
by u di = Bi−1 (A0 − Ai )x0 .
Transform the agent into (13.10), i.e., ż i = Āi z i + B̄i ū ei .
Iteration:
1: Construct the utility function Ui (z i , u ei , u −ei ) by (13.13).
2: Let the iteration index i = 0. Construct a performance index function Vi0 to satisfy (13.30);
3: For i ∈ N, let k = k + 1. Do Policy Improvement
 
[k+1] ∂ Vi[k] [k]
u ei = arg min Hi z i , , u ei , u −ei ;
u ei ∂z i

4: Do Policy Evaluation


∂ V [k]
Hi z i , i , u [k] , u [k]
−ei = 0;
∂z i ei

5: If Vi[k+1] ≤ Vi[k] , goto next step. Else, let Ri j = ζ Ri j , j ∈ Ni and goto Step 2.
6: If Vi[k] − Vi[k+1] ≤ ε, then the optimal performance index function and optimal control law are
obtained. Goto Step 6. Else goto Step 3;
7: return u [k] [k] ∗ [k]
ei and Vi . Let u i = u ei + u di .
13.5 Simulation Study 265

13.5 Simulation Study

In this section, the performance of our iterative ADP algorithm is evaluated by


numerical experiments. Consider the five-node strongly connected digraph structure
shown in Fig. 13.1 with a leader node connected to node 3. The edge weights are
taken as ρ13 = 0.1, ρ14 = 0.1, ρ21 = 0.5, ρ31 = 0.4, ρ32 = 0.3, ρ41 = 0.2, ρ45 = 0.1,
ρ52 = 0.8, ρ54 = 0.7, respectively. The pinning gains are taken σ1 = 0.5, σ2 = 0.3,
σ3 = 0.2, σ4 = 0.1, and σ5 = 0.1, respectively.
For the structure in Fig. 13.1, each node dynamics is considered as
Node 1:
       
ẋ11 −1 1 x11 20 u 11
= + , (13.47)
ẋ12 −3 −1 x12 01 u 12

Node 2:
       
ẋ21 −1 1 x21 20 u 21
= + , (13.48)
ẋ22 −4 −1 x22 03 u 22

Node 3:
       
ẋ31 −2 1 x31 20 u 31
= + , (13.49)
ẋ32 −1 −1 x32 02 u 32

Node 4:
       
ẋ41 −1 1 x41 10 u 41
= + , (13.50)
ẋ42 −2 −1 x42 01 u 42

Node 5:

1 4

Leader
3

2 5

Fig. 13.1 The structure of five-node digraph


266 13 Optimal Distributed Synchronization Control …

2 2
Lm xe11 In xe22
1 1
tracking error

tracking error
In xe11 Lm x
e22

0 0

In xe12 In xe21
-1 -1
Lm xe12 Lm xe21
-2 -2
0 50 100 150 0 50 100 150
(a) time steps (b)time steps

1 2
Lm xe32 In xe41
tracking error

tracking error
Lm x
0.5 e31 0
In xe31

Lm xe41
0 -2

Lm x
In xe32 In xe42 e42
-0.5 -4
0 50 100 150 0 50 100 150
(c) time steps (d) time steps

Fig. 13.2 Tracking errors xei , i = 1, 2, 3, 4. a Tracking error xe1 . b Tracking error xe2 . c Tracking
error xe3 . d Tracking error xe4

       
ẋ51 −2 1 x51 30 u 51
= + . (13.51)
ẋ52 −3 −1 x52 02 u 52

Let the desired dynamics be expressed as


    
ẋ01 −2 1 x01
= , (13.52)
ẋ02 −4 −1 x02

where I denotes the identity matrix with suitable dimensions. Define the utility
function as in (13.13), where the weight matrices are expressed as

Q 11 = Q 22 = Q 33 = Q 44 = Q 55 = I,
R11 = 4I,R12 = I,R13 = I,R14 = I,
R21 = I,R22 = 4I,
R31 = I,R32 = I,R33 = 5I,
R41 = I,R44 = 9I,R45 = I,
R52 = I,R52 = I,R55 = 9I. (13.53)
13.5 Simulation Study 267

3 4
In ue11
2 2
Tracking errors

Lm xe52 In ue12

Control errors
1 0
In xe52
0 -2
Lm ue11
-1 In x -4
e51 Lm ue12
Lm xe51
-2 -6
0 50 100 150 0 50 100 150
(a) time steps (b) time steps

4
Lm ue21 In ue31
2 In ue32
0
Control errors

Lm ue22 Control errors


0
In ue21
-2 Lm ue32
-2
Lm ue31
-4
-4 In ue22
-6
0 50 100 150 0 50 100 150
(c)time steps (d) time steps

Fig. 13.3 Tracking error xe5 and the controls errors u ei , i = 1, 2, 3. a Tracking errors xe5 . b Control
error u e1 . c Control error u e2 . d Control error u e3

0
Control error

Lm u e41
-5 In u e41

Lm u e42
-10
In u e42
-15
0 50 100 150
(a) time steps

4
Lm u e51
2
Control error

0 In u e51
Lm u e52
-2
In u e52
-4

-6
0 50 100 150
(b) time steps

Fig. 13.4 Controls errors u ei , i = 4, 5. a Control error u e4 . b Control error u e5


268 13 Optimal Distributed Synchronization Control …

x1
x2
x3
3
x4
2 x5
x0
1

0
xi2

-1

-2

-3
4
2 300
250
0 200
xi1 150
-2 100
50 time steps
-4 0

Fig. 13.5 The evolution of the agent states

u1
u2
u3
15
u4
10 u5

0
ui2

-5

-10

-15
20
10 300
250
0 200
ui1 150
-10 100
50 time steps
-20 0

Fig. 13.6 The evolution of the agent control

Let the initial states of each agent in the games be


     
1 −1 1
x1 (0) = , x2 (0) = , x3 (0) = ,
−1 1 1
     
2 −2 0.5
x4 (0) = , x5 (0) = , x0 (0) = . (13.54)
−2 2 −0.5
13.5 Simulation Study 269

Define xei = xi − x0 , i = 1, 2, 3, 4, 5. The graphical game is implemented as in


Algorithm 8 for k = 20 iterations. The tracking errors of every agent for nodes 1–5 are
shown in Figs. 13.2a–d and 13.3a, respectively. The control errors u ei , i = 1, 2, . . . , 5
are shown in Figs. 13.3b–d and 13.4, respectively.
From Figs. 13.2, 13.3 and 13.4, we can see that after 20 iterations, the iterative
states and iterative control laws are convergent to the optimum. Implementing the
achieved optimal control laws to the agents, the states are shown in Fig. 13.5 and
the corresponding optimal control trajectories for each node are shown in Fig. 13.6.
From Figs. 13.5 and 13.6, we can see that the states for each nodes are convergent
to the desired trajectory x0 , and the effectiveness of the developed iterative ADP
algorithm can be verified.

13.6 Conclusion

In this chapter, an effective policy iteration based ADP algorithm is developed to solve
the optimal coordination control problems for heterogeneous multi-agent differential
graphical games. The developed heterogeneous differential graphical games permits
the agent dynamics of each node to be different from the agents of other nodes.
An optimal cooperative policy iteration algorithm for graphical differential games is
developed to achieve the optimal control law for the agent of each node to guarantee
that the dynamics of each node can track the desired one. Convergence analysis is
developed to show that the performance index functions of heterogeneous multi-agent
differential graphical games can converge to the Nash equilibrium. Finally, simulation
results will show the effectiveness of the developed optimal control scheme.

References

1. Jamshidi, M.: Large-Scale Systems-Modeling and Control. The Netherlands Press, Amsterdam
(1982)
2. Owen, G.: Game Theory. Acadamic Press, New York (1982)
3. Vamvoudakis, K., Lewis, F., Hudas, G.: Multi-agent differential graphical games: online adaptive
learning solution for synchronization with optimality. Automatica 48(8), 1598–1611 (2012)
Index

A Non-zero-sum, 208, 227


Action neural networks, 207
Adaptive dynamic programming, 4
Algebraic Riccati equation, 1 O
Anterior cingulate cortex, 63 Orbitofrontal cortex, 66
Approximate/adaptive dynamic program-
ming, v
Approximate dynamic programming, 4 P
Policy iteration, 4

C
Critic neural network, 207 R
Recurrent neural network, 63
Reinforcement learning, 3
D
Disturbance neural networks, 207
S
Shunting inhibitory artificial neural network,
E 63, 64
Extreme learning machine, 7 Single-hidden layer feed-forward network, 7
Support vector machine, 8

H
Hamilton-Jacobi-Bellman, 1 U
Hamilton–Jacobi–Isaacs, 208 Uniformly ultimately bounded, 63

I V
Integral reinforcement learning, vi Value iteration, 4

L W
Linear quadratic regulator, 1 Wheeled inverted pendulum, 83

N Z
Neural network, 4, 69 Zero-sum, 165

© Science Press, Beijing and Springer Nature Singapore Pte Ltd. 2019 271
R. Song et al., Adaptive Dynamic Programming: Single and Multiple
Controllers, Studies in Systems, Decision and Control 166,
https://doi.org/10.1007/978-981-13-1712-5

You might also like