0% found this document useful (0 votes)
87 views11 pages

RL Frontmatter

This document provides an overview and summary of the draft textbook "Reinforcement Learning and Optimal Control" by Dimitri P. Bertsekas. The textbook covers topics such as deterministic and stochastic dynamic programming, approximation methods in value and policy spaces, parametric function approximation using linear architectures and neural networks, and infinite horizon reinforcement learning problems. It is scheduled to be finalized and published in 2019 by Athena Scientific.

Uploaded by

Chainszz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
87 views11 pages

RL Frontmatter

This document provides an overview and summary of the draft textbook "Reinforcement Learning and Optimal Control" by Dimitri P. Bertsekas. The textbook covers topics such as deterministic and stochastic dynamic programming, approximation methods in value and policy spaces, parametric function approximation using linear architectures and neural networks, and infinite horizon reinforcement learning problems. It is scheduled to be finalized and published in 2019 by Athena Scientific.

Uploaded by

Chainszz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Reinforcement Learning and Optimal Control

by
Dimitri P. Bertsekas
Massachusetts Institute of Technology

DRAFT TEXTBOOK
This is a draft of a textbook that is scheduled to be finalized in 2019,
and to be published by Athena Scientific. It represents “work in progress,”
and it will be periodically updated. It more than likely contains errors
(hopefully not serious ones). Furthermore, its references to the literature
are incomplete. Your comments and suggestions to the author at dim-
[email protected] are welcome. The date of last revision is given below.

December 14, 2018

WWW site for book information and orders

http://www.athenasc.com

Athena Scientific, Belmont, Massachusetts


Athena Scientific
Post Office Box 805
Nashua, NH 03060
U.S.A.

Email: [email protected]
WWW: http://www.athenasc.com

Publisher’s Cataloging-in-Publication Data

Bertsekas, Dimitri P.
Reinforcement Learning and Optimal Control
Includes Bibliography and Index
1. Mathematical Optimization. 2. Dynamic Programming. I. Title.
QA402.5 .B465 2019 519.703 00-91281

ISBN-10: 1-886529-39-6, ISBN-13: 978-1-886529-39-7


ABOUT THE AUTHOR

Dimitri Bertsekas studied Mechanical and Electrical Engineering at the


National Technical University of Athens, Greece, and obtained his Ph.D.
in system science from the Massachusetts Institute of Technology. He has
held faculty positions with the Engineering-Economic Systems Department,
Stanford University, and the Electrical Engineering Department of the Uni-
versity of Illinois, Urbana. Since 1979 he has been teaching at the Electrical
Engineering and Computer Science Department of the Massachusetts In-
stitute of Technology (M.I.T.), where he is currently the McAfee Professor
of Engineering.
His teaching and research spans several fields, including determinis-
tic optimization, dynamic programming and stochastic control, large-scale
and distributed computation, and data communication networks. He has
authored or coauthored numerous research papers and seventeen books,
several of which are currently used as textbooks in MIT classes, including
“Dynamic Programming and Optimal Control,” “Data Networks,” “Intro-
duction to Probability,” and “Nonlinear Programming.”
Professor Bertsekas was awarded the INFORMS 1997 Prize for Re-
search Excellence in the Interface Between Operations Research and Com-
puter Science for his book “Neuro-Dynamic Programming” (co-authored
with John Tsitsiklis), the 2001 AACC John R. Ragazzini Education Award,
the 2009 INFORMS Expository Writing Award, the 2014 AACC Richard
Bellman Heritage Award, the 2014 Khachiyan Prize for Life-Time Accom-
plishments in Optimization, the 2015 George B. Dantzig Prize, and the 2018
John von Neumann Theory Prize. In 2001, he was elected to the United
States National Academy of Engineering for “pioneering contributions to
fundamental research, practice and education of optimization/control the-
ory, and especially its application to data communication networks.”

iii
ATHENA SCIENTIFIC
OPTIMIZATION AND COMPUTATION SERIES

1. Abstract Dynamic Programming, 2nd Edition, by Dimitri P.


Bertsekas, 2018, ISBN 978-1-886529-46-5, 360 pages
2. Dynamic Programming and Optimal Control, Two-Volume Set,
by Dimitri P. Bertsekas, 2017, ISBN 1-886529-08-6, 1270 pages
3. Nonlinear Programming, 3rd Edition, by Dimitri P. Bertsekas,
2016, ISBN 1-886529-05-1, 880 pages
4. Convex Optimization Algorithms, by Dimitri P. Bertsekas, 2015,
ISBN 978-1-886529-28-1, 576 pages
5. Convex Optimization Theory, by Dimitri P. Bertsekas, 2009,
ISBN 978-1-886529-31-1, 256 pages
6. Introduction to Probability, 2nd Edition, by Dimitri P. Bertsekas
and John N. Tsitsiklis, 2008, ISBN 978-1-886529-23-6, 544 pages
7. Convex Analysis and Optimization, by Dimitri P. Bertsekas, An-
gelia Nedić, and Asuman E. Ozdaglar, 2003, ISBN 1-886529-45-0,
560 pages
8. Network Optimization: Continuous and Discrete Models, by Dim-
itri P. Bertsekas, 1998, ISBN 1-886529-02-7, 608 pages
9. Network Flows and Monotropic Optimization, by R. Tyrrell Rock-
afellar, 1998, ISBN 1-886529-06-X, 634 pages
10. Introduction to Linear Optimization, by Dimitris Bertsimas and
John N. Tsitsiklis, 1997, ISBN 1-886529-19-1, 608 pages
11. Parallel and Distributed Computation: Numerical Methods, by
Dimitri P. Bertsekas and John N. Tsitsiklis, 1997, ISBN 1-886529-
01-9, 718 pages
12. Neuro-Dynamic Programming, by Dimitri P. Bertsekas and John
N. Tsitsiklis, 1996, ISBN 1-886529-10-8, 512 pages
13. Constrained Optimization and Lagrange Multiplier Methods, by
Dimitri P. Bertsekas, 1996, ISBN 1-886529-04-3, 410 pages
14. Stochastic Optimal Control: The Discrete-Time Case, by Dimitri
P. Bertsekas and Steven E. Shreve, 1996, ISBN 1-886529-03-5,
330 pages

iv
Contents

1. Exact Dynamic Programming


1.1. Deterministic Dynamic Programming . . . . . . . . . . . p. 2
1.1.1. Deterministic Problems . . . . . . . . . . . . . . p. 2
1.1.2. The Dynamic Programming Algorithm . . . . . . . . p. 7
1.1.3. Approximation in Value Space . . . . . . . . . . . p. 12
1.1.4. Model-Free Approximate Solution - Q-Learning . . . . p. 13
1.2. Stochastic Dynamic Programming . . . . . . . . . . . . . p. 14
1.3. Examples, Variations, and Simplifications . . . . . . . . . p. 17
1.3.1. Deterministic Shortest Path Problems . . . . . . . . p. 18
1.3.2. Discrete Deterministic Optimization . . . . . . . . . p. 19
1.3.3. Problems with a Terminal State . . . . . . . . . . p. 23
1.3.4. Forecasts . . . . . . . . . . . . . . . . . . . . . p. 26
1.3.5. Problems with Uncontrollable State Components . . . p. 27
1.3.6. Partial State Information and Belief States . . . . . . p. 32
1.3.7. Linear Quadratic Optimal Control . . . . . . . . . . p. 35
1.4. Reinforcement Learning and Optimal Control - Some . . . . . .
Terminology . . . . . . . . . . . . . . . . . . . . . . p. 38
1.5. Notes and Sources . . . . . . . . . . . . . . . . . . . p. 40

2. Approximation in Value Space


2.1. Variants of Approximation in Value Space . . . . . . . . . p. 3
2.1.1. Off-Line and On-Line Methods . . . . . . . . . . . p. 4
2.1.2. Simplifying the Lookahead Minimization . . . . . . . p. 5
2.1.3. Model-Free Approximation in Value and . . . . . . . . .
Policy Space . . . . . . . . . . . . . . . . . . . p. 6
2.1.4. When is Approximation in Value Space Effective? . . . p. 9
2.2. Multistep Lookahead . . . . . . . . . . . . . . . . . . p. 10
2.2.1. Multistep Lookahead and Rolling Horizon . . . . . . p. 11
2.2.2. Multistep Lookahead and Deterministic Problems . . . p. 13
2.3. Problem Approximation . . . . . . . . . . . . . . . . . p. 14

v
vi Contents

2.3.1. Enforced Decomposition . . . . . . . . . . . . . . p. 15


2.3.2. Probabilistic Approximation - Certainty . . . . . . . . .
Equivalent Control . . . . . . . . . . . . . . . . p. 21
2.4. Rollout and Model Predictive Control . . . . . . . . . . . p. 27
2.4.1. Rollout for Deterministic Problems . . . . . . . . . p. 27
2.4.2. Stochastic Rollout and Monte Carlo Tree Search . . . p. 34
2.4.3. Model Predictive Control . . . . . . . . . . . . . . p. 41
2.5. Notes and Sources . . . . . . . . . . . . . . . . . . . p. 46

3. Parametric Approximation
3.1. Approximation Architectures . . . . . . . . . . . . . . . p. 2
3.1.1. Linear and Nonlinear Feature-Based Architectures . . . p. 2
3.1.2. Training of Linear and Nonlinear Architectures . . . . p. 7
3.1.3. Incremental Gradient and Newton Methods . . . . . . p. 9
3.2. Neural Networks . . . . . . . . . . . . . . . . . . . . p. 21
3.2.1. Training of Neural Networks . . . . . . . . . . . . p. 24
3.2.2. Multilayer and Deep Neural Networks . . . . . . . . p. 26
3.3. Sequential Dynamic Programming Approximation . . . . . . p. 29
3.4. Q-factor Parametric Approximation . . . . . . . . . . . . p. 31
3.5. Notes and Sources . . . . . . . . . . . . . . . . . . . p. 33

4. Infinite Horizon Renforcement Learning


4.1. An Overview of Infinite Horizon Problems . . . . . . . . . p. 2
4.2. Stochastic Shortest Path Problems . . . . . . . . . . . . p. 5
4.3. Discounted Problems . . . . . . . . . . . . . . . . . . p. 14
4.4. Exact and Approximate Value Iteration . . . . . . . . . . p. 19
4.5. Policy Iteration . . . . . . . . . . . . . . . . . . . . p. 22
4.5.1. Exact Policy Iteration . . . . . . . . . . . . . . . p. 22
4.5.2. Policy Iteration for Q-factors . . . . . . . . . . . . p. 27
4.5.3. Limited Lookahead Policies and Rollout . . . . . . . p. 28
4.5.4. Approximate Policy Iteration - Error Bounds . . . . . p. 30
4.6. Simulation-Based Policy Iteration with Parametric . . . . . . .
Approximation . . . . . . . . . . . . . . . . . . . . . p. 34
4.6.1. Self-Learning and Actor-Critic Systems . . . . . . . p. 34
4.6.2. A Model-Based Variant . . . . . . . . . . . . . . p. 35
4.6.3. A Model-Free Variant . . . . . . . . . . . . . . . p. 37
4.6.4. Issues Relating to Approximate Policy Iteration . . . . p. 39
4.7. Exact and Approximate Linear Programming . . . . . . . p. 42
4.8. Q-Learning . . . . . . . . . . . . . . . . . . . . . . p. 44
4.9. Additional Methods - Temporal Differences . . . . . . . . p. 47
4.10. Approximation in Policy Space . . . . . . . . . . . . . p. 58
4.11. Notes and Sources . . . . . . . . . . . . . . . . . . . p. 60
4.12. Appendix: Mathematical Analysis . . . . . . . . . . . . p. 63
Contents vii

4.12.1. Proofs for Stochastic Shortest Path Problems . . . . p. 63


4.12.2. Proofs for Discounted Problems . . . . . . . . . . p. 69
4.12.3. Convergence of Exact Policy Iteration . . . . . . . . p. 69
4.12.4. Error Bounds for Approximate Policy Iteration . . . . p. 70

5. Aggregation
5.1. Aggregation Frameworks . . . . . . . . . . . . . . . . . . p.
5.2. Classical and Biased Forms of the Aggregate Problem . . . . . p.
5.3. Bellman’s Equation for the Aggregate Problem . . . . . . . . p.
5.4. Algorithms for the Aggregate Problem . . . . . . . . . . . . p.
5.5. Some Examples . . . . . . . . . . . . . . . . . . . . . . p.
5.6. Spatiotemporal Aggregation for Deterministic Problems . . . . p.
5.7. Notes and Sources . . . . . . . . . . . . . . . . . . . . p.

References . . . . . . . . . . . . . . . . . . . . . . . . . p.

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . p.
Preface
In this book we consider large and challenging multistage decision prob-
lems, which can be solved in principle by dynamic programming (DP for
short), but their exact solution is computationally intractable. We discuss
solution methods that rely on approximations to produce suboptimal poli-
cies with adequate performance. These methods are collectively referred to
as reinforcement learning, and also by alternative names such as approxi-
mate dynamic programming, and neuro-dynamic programming.
Our subject has benefited greatly from the interplay of ideas from
optimal control and from artificial intelligence. One of the aims of the
book is to explore the common boundary between these two fields and to
form a bridge that is accessible by workers with background in either field.
Our primary focus will be on approximation in value space. Here, the
control at each state is obtained by limited lookahead with cost function
approximation, i.e., by optimization of the cost over a limited horizon, plus
an approximation of the optimal future cost, starting from the end of this
horizon. The latter cost, which we generally denote by J, ˜ is a function of
the state where we may be at the end of the horizon. It may be computed
by a variety of methods, possibly involving simulation and/or some given or
separately derived heuristic/suboptimal policy. The use of simulation often
allows for model-free implementations that do not require the availability
of a mathematical model, a major idea that has allowed the use of dynamic
programming beyond its classical boundaries.
We focus selectively on four types of methods for obtaining J: ˜
(a) Problem approximation: Here J˜ is the optimal cost function of a re-
lated simpler problem, which is solved by exact DP. Certainty equiv-
alent control and enforced decomposition schemes are discussed in
some detail.
(b) Rollout and model predictive control : Here J˜ is the cost function of
some known heuristic policy. The needed cost values to implement a
rollout policy are often calculated by simulation. While this method
applies to stochastic problems, the reliance on simulation favors de-
terministic problems, including challenging combinatorial problems
for which heuristics may be readily implemented. Rollout may also

ix
x Preface

be combined with adaptive simulation and Monte Carlo tree search,


which have proved very effective in the context of games such as
backgammon, chess, Go, and others.
Model predictive control was originally developed for continuous-
space optimal control problems that involve some goal state, e.g.,
the origin in a classical control context. It can be viewed as a special-
ized rollout method that is based on an optimization algorithm for
reaching a goal state.
(c) Parametric cost approximation: Here J˜ is chosen from within a para-
metric class of functions, including neural networks, with the param-
eters “optimized” or “trained” by using state-cost sample pairs and
some type of incremental least squares/regression algorithm. Ap-
proximate policy iteration and its variants are covered in some detail,
including several actor-critic schemes. These include policy evalua-
tion that involves temporal difference-based training methods, and
policy improvement that is based on approximation in policy space.
(d) Aggregation: Here the cost function J˜ is the optimal cost function of
some approximation to the original problem, called aggregate prob-
lem, which has fewer states. The aggregate problem can be formu-
lated in a variety of ways, and may be solved by using exact DP
techniques. Its optimal cost function is then used as J˜ in a limited
lookahead scheme. Aggregation may also be used to provide local im-
provements to parametric approximation schemes that involve neural
networks or linear feature-based architectures.
We have adopted a gradual expository approach, which proceeds
along three directions:
(1) From exact DP to approximate DP : We first discuss exact DP algo-
rithms, explain why they may be difficult to implement, and then use
them as the basis for approximations.
(2) From finite horizon to infinite horizon problems: We first discuss fi-
nite horizon exact and approximate DP methodologies, which are in-
tuitive and mathematically simple in Chapters 1-3. We then progress
to infinite horizon problems in Chapters 4 and 5.
(3) From model-based to model-free approaches: Reinforcement learning
methods offer a major potential benefit over classical DP approaches,
which were practiced exclusively up to the early 90s: they can be im-
plemented by using a simulator/computer model rather than a math-
ematical model. In our presentation, we first discuss model-based
methods, and then we identify those methods that can be appropri-
ately modified to work with a simulator.
After the first chapter, each new class of methods is introduced as a
Preface xi

more sophisticated or generalized version of a simpler method introduced


earlier. Moreover, each type of method is illustrated by means of examples,
which should be helpful in providing insight into its use, but may also be
skipped selectively and without loss of continuity. Detailed solutions to
some of the simpler examples are given, and may illustrate some of the
implementation details.
The mathematical style of this book is somewhat different from the
one of the author’s dynamic programming books [Ber12], [Ber17a], [Ber18a],
and the neuro-dynamic programming research monograph, written jointly
with John Tsitsiklis [BeT96]. While we rigorously present the theory of
finite and infinite horizon dynamic programming, and some fundamental
approximation methods, we rely more on intuitive explanations and less on
proof-based insights. Moreover, our mathematical requirements are mod-
est: calculus, elementary probability, and a minimal use of matrix-vector
algebra.
Furthermore, we present methods that are often successful in practice,
but have less than solid performance properties. This is a reflection of the
state of the art in the field: there are no methods that are guaranteed to
work for all or even most problems, but there are enough methods to try
on a given problem with a reasonable chance of success in the end. For this
process to work, however, it is important to have proper intuition into the
inner workings of each type of method, as well as an understanding of its
analytical and computational properties. To quote a statement from the
preface of the neuro-dynamic programming (NDP) monograph [BeT96]:
“It is primarily through an understanding of the mathematical structure of
the NDP methodology that we will be able to identify promising or solid
algorithms from the bewildering array of speculative proposals and claims
that can be found in the literature.”

Dimitri P. Bertsekas
Winter 2018

You might also like