RL Frontmatter
RL Frontmatter
by
Dimitri P. Bertsekas
Massachusetts Institute of Technology
DRAFT TEXTBOOK
This is a draft of a textbook that is scheduled to be finalized in 2019,
and to be published by Athena Scientific. It represents “work in progress,”
and it will be periodically updated. It more than likely contains errors
(hopefully not serious ones). Furthermore, its references to the literature
are incomplete. Your comments and suggestions to the author at dim-
[email protected] are welcome. The date of last revision is given below.
http://www.athenasc.com
Email: [email protected]
WWW: http://www.athenasc.com
Bertsekas, Dimitri P.
Reinforcement Learning and Optimal Control
Includes Bibliography and Index
1. Mathematical Optimization. 2. Dynamic Programming. I. Title.
QA402.5 .B465 2019 519.703 00-91281
iii
ATHENA SCIENTIFIC
OPTIMIZATION AND COMPUTATION SERIES
iv
Contents
v
vi Contents
3. Parametric Approximation
3.1. Approximation Architectures . . . . . . . . . . . . . . . p. 2
3.1.1. Linear and Nonlinear Feature-Based Architectures . . . p. 2
3.1.2. Training of Linear and Nonlinear Architectures . . . . p. 7
3.1.3. Incremental Gradient and Newton Methods . . . . . . p. 9
3.2. Neural Networks . . . . . . . . . . . . . . . . . . . . p. 21
3.2.1. Training of Neural Networks . . . . . . . . . . . . p. 24
3.2.2. Multilayer and Deep Neural Networks . . . . . . . . p. 26
3.3. Sequential Dynamic Programming Approximation . . . . . . p. 29
3.4. Q-factor Parametric Approximation . . . . . . . . . . . . p. 31
3.5. Notes and Sources . . . . . . . . . . . . . . . . . . . p. 33
5. Aggregation
5.1. Aggregation Frameworks . . . . . . . . . . . . . . . . . . p.
5.2. Classical and Biased Forms of the Aggregate Problem . . . . . p.
5.3. Bellman’s Equation for the Aggregate Problem . . . . . . . . p.
5.4. Algorithms for the Aggregate Problem . . . . . . . . . . . . p.
5.5. Some Examples . . . . . . . . . . . . . . . . . . . . . . p.
5.6. Spatiotemporal Aggregation for Deterministic Problems . . . . p.
5.7. Notes and Sources . . . . . . . . . . . . . . . . . . . . p.
References . . . . . . . . . . . . . . . . . . . . . . . . . p.
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . p.
Preface
In this book we consider large and challenging multistage decision prob-
lems, which can be solved in principle by dynamic programming (DP for
short), but their exact solution is computationally intractable. We discuss
solution methods that rely on approximations to produce suboptimal poli-
cies with adequate performance. These methods are collectively referred to
as reinforcement learning, and also by alternative names such as approxi-
mate dynamic programming, and neuro-dynamic programming.
Our subject has benefited greatly from the interplay of ideas from
optimal control and from artificial intelligence. One of the aims of the
book is to explore the common boundary between these two fields and to
form a bridge that is accessible by workers with background in either field.
Our primary focus will be on approximation in value space. Here, the
control at each state is obtained by limited lookahead with cost function
approximation, i.e., by optimization of the cost over a limited horizon, plus
an approximation of the optimal future cost, starting from the end of this
horizon. The latter cost, which we generally denote by J, ˜ is a function of
the state where we may be at the end of the horizon. It may be computed
by a variety of methods, possibly involving simulation and/or some given or
separately derived heuristic/suboptimal policy. The use of simulation often
allows for model-free implementations that do not require the availability
of a mathematical model, a major idea that has allowed the use of dynamic
programming beyond its classical boundaries.
We focus selectively on four types of methods for obtaining J: ˜
(a) Problem approximation: Here J˜ is the optimal cost function of a re-
lated simpler problem, which is solved by exact DP. Certainty equiv-
alent control and enforced decomposition schemes are discussed in
some detail.
(b) Rollout and model predictive control : Here J˜ is the cost function of
some known heuristic policy. The needed cost values to implement a
rollout policy are often calculated by simulation. While this method
applies to stochastic problems, the reliance on simulation favors de-
terministic problems, including challenging combinatorial problems
for which heuristics may be readily implemented. Rollout may also
ix
x Preface
Dimitri P. Bertsekas
Winter 2018