Application of Reinforcement Learning - Finance
Application of Reinforcement Learning - Finance
Applications in Finance
Summary of Notation 15
Overview 17
0.1. Learning Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . 17
0.2. What you’ll learn from this Book . . . . . . . . . . . . . . . . . . . . . . . . . 18
0.3. Expected Background to read this Book . . . . . . . . . . . . . . . . . . . . . 19
0.4. Decluttering the Jargon linked to Reinforcement Learning . . . . . . . . . . 20
0.5. Introduction to the Markov Decision Process (MDP) framework . . . . . . . 23
0.6. Real-world problems that fit the MDP framework . . . . . . . . . . . . . . . 25
0.7. The inherent difficulty in solving MDPs . . . . . . . . . . . . . . . . . . . . . 26
0.8. Value Function, Bellman Equation, Dynamic Programming and RL . . . . . 27
0.9. Outline of Chapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
0.9.1. Module I: Processes and Planning Algorithms . . . . . . . . . . . . . 29
0.9.2. Module II: Modeling Financial Applications . . . . . . . . . . . . . . 30
0.9.3. Module III: Reinforcement Learning Algorithms . . . . . . . . . . . . 32
0.9.4. Module IV: Finishing Touches . . . . . . . . . . . . . . . . . . . . . . . 33
0.9.5. Short Appendix Chapters . . . . . . . . . . . . . . . . . . . . . . . . . 33
1. Markov Processes 59
1.1. The Concept of State in a Process . . . . . . . . . . . . . . . . . . . . . . . . . 59
1.2. Understanding Markov Property from Stock Price Examples . . . . . . . . . 59
1.3. Formal Definitions for Markov Processes . . . . . . . . . . . . . . . . . . . . 65
1.3.1. Starting States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
1.3.2. Terminal States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3
1.3.3. Markov Process Implementation . . . . . . . . . . . . . . . . . . . . . 68
1.4. Stock Price Examples modeled as Markov Processes . . . . . . . . . . . . . . 70
1.5. Finite Markov Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
1.6. Simple Inventory Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
1.7. Stationary Distribution of a Markov Process . . . . . . . . . . . . . . . . . . . 76
1.8. Formalism of Markov Reward Processes . . . . . . . . . . . . . . . . . . . . . 80
1.9. Simple Inventory Example as a Markov Reward Process . . . . . . . . . . . . 82
1.10. Finite Markov Reward Processes . . . . . . . . . . . . . . . . . . . . . . . . . 84
1.11. Simple Inventory Example as a Finite Markov Reward Process . . . . . . . . 86
1.12. Value Function of a Markov Reward Process . . . . . . . . . . . . . . . . . . . 88
1.13. Summary of Key Learnings from this Chapter . . . . . . . . . . . . . . . . . 92
4
4.2. Linear Function Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . 168
4.3. Neural Network Function Approximation . . . . . . . . . . . . . . . . . . . . 174
4.4. Tabular as a form of FunctionApprox . . . . . . . . . . . . . . . . . . . . . . . 186
4.5. Approximate Policy Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 188
4.6. Approximate Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
4.7. Finite-Horizon Approximate Policy Evaluation . . . . . . . . . . . . . . . . . 190
4.8. Finite-Horizon Approximate Value Iteration . . . . . . . . . . . . . . . . . . . 192
4.9. Finite-Horizon Approximate Q-Value Iteration . . . . . . . . . . . . . . . . . 192
4.10. How to Construct the Non-Terminal States Distribution . . . . . . . . . . . . 194
4.11. Key Takeaways from this Chapter . . . . . . . . . . . . . . . . . . . . . . . . . 195
5
7.8. Optimal Exercise of American Options cast as a Finite MDP . . . . . . . . . . 258
7.9. Generalizing to Optimal-Stopping Problems . . . . . . . . . . . . . . . . . . 265
7.10. Pricing/Hedging in an Incomplete Market cast as an MDP . . . . . . . . . . 267
7.11. Key Takeaways from this Chapter . . . . . . . . . . . . . . . . . . . . . . . . . 269
6
10.7. Conceptual Linkage between DP and TD algorithms . . . . . . . . . . . . . . 380
10.8. Convergence of RL Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 384
10.9. Key Takeaways from this Chapter . . . . . . . . . . . . . . . . . . . . . . . . . 385
7
IV. Finishing Touches 451
Appendix 505
8
B. Portfolio Theory 509
B.1. Setting and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509
B.2. Portfolio Returns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509
B.3. Derivation of Efficient Frontier Curve . . . . . . . . . . . . . . . . . . . . . . 509
B.4. Global Minimum Variance Portfolio (GMVP) . . . . . . . . . . . . . . . . . . 510
B.5. Orthogonal Efficient Portfolios . . . . . . . . . . . . . . . . . . . . . . . . . . 510
B.6. Two-fund Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510
B.7. An example of the Efficient Frontier for 16 assets . . . . . . . . . . . . . . . . 511
B.8. CAPM: Linearity of Covariance Vector w.r.t. Mean Returns . . . . . . . . . . 511
B.9. Useful Corollaries of CAPM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512
B.10. Cross-Sectional Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512
B.11. Efficient Set with a Risk-Free Asset . . . . . . . . . . . . . . . . . . . . . . . . 512
Bibliography 537
9
Preface
We (Ashwin and Tikhon) have spent all of our educational and professional lives oper-
ating in the intersection of Mathematics and Computer Science - Ashwin for more than 3
decades and Tikhon for more than a decade. During these periods, we’ve commonly held
two obsessions. The first is to bring together the (strangely) disparate worlds of Math-
ematics and Software Development. The second is to focus on the pedagogy of various
topics across Mathematics and Computer Science. Fundamentally, this book was born out
of a deep desire to release our twin obsessions so as to not just educate the next generation
of scientists and engineers, but to also present some new and creative ways of teaching
technically challenging topics in ways that are easy to absorb and retain.
Apart from these common obsessions, each of us has developed some expertise in a few
topics that come together in this book. Ashwin’s undergraduate and doctoral education
was in Computer Science, with specializations in Algorithms, Discrete Mathematics and
Abstract Algebra. He then spent more than two decades of his career (across the Finance
and Retail industries) in the realm of Computational Mathematics, recently focused on
Machine Learning and Optimization. In his role as an Adjunct Professor at Stanford Uni-
versity, Ashwin specializes in Reinforcement Learning and Mathematical Finance. The
content of this book is essentially an expansion of the content of the course CME 241 he
teaches at Stanford. Tikhon’s education is in Computer Science and he has specialized in
Software Design, with an emphasis on treating software design as mathematical specifi-
cation of “what to do” versus computational mechanics of “how to do.” This is a powerful
way of developing software, particularly for mathematical applications, significantly im-
proving readability, modularity and correctness. This leads to code that naturally and
clearly reflects the mathematics, thus blurring the artificial lines between Mathematics
and Programming. He has also championed the philosophy of leveraging programming
as a powerful way to learn mathematical concepts. Ashwin has been greatly influenced by
Tikhon on this philosophy and both of us have been quite successful in imparting our stu-
dents with deep understanding of a variety of mathematical topics by using programming
as a powerful pedagogical tool.
In fact, the key distinguishing feature of this book is to promote learning through an ap-
propriate blend of A) intuitive understanding of the concepts, B) mathematical rigor, and
C) programming of the models and algorithms (with sound software design principles
that reflect the mathematical specification). We’ve found this unique approach to teach-
ing facilitates strong retention of the concepts because students are active learners when
they code everything they are learning, in a manner that reflects the mathematical con-
cepts. We have strived to create a healthy balance between content accessibility and in-
tuition development on one hand versus technical rigor and completeness on the other
hand. Throughout the book, we provide proper mathematical notation, theorems (and
sometimes formal proofs) as well as well-designed working code for various models and
algorithms. But we have always accompanied this formalism with intuition development
using simple examples and appropriate visualization.
We want to highlight that this book emphasizes the foundational components of Reinforce-
ment Learning - Markov Decision Processes, Bellman Equations, Fixed-Points, Dynamic
11
Programming, Function Approximation, Sampling, Experience-Replay, Batch Methods,
Value-based versus Policy-based Learning, balancing Exploration and Exploitation, blend-
ing Planning and Learning etc. So although we have covered several key algorithms in
this book, we do not dwell on specifics of the algorithms - rather, we emphasize the core
principles and always allow for various types of flexibility in tweaking those algorithms
(our investment in modular software design of the algorithms facilitates this flexibility).
Likewise, we have kept the content of the financial applications fairly basic, emphasizing
the core ideas, and developing working code for simplified versions of these financial ap-
plications. Getting these financial applications to be effective in practice is a much more
ambitious endeavor - we don’t attempt that in this book, but we highlight what it would
take to make it work in practice. The theme of this book is understanding of core con-
cepts rather than addressing all the nuances (and frictions) one typically encounters in
the real-world. The financial content in this book is a significant fraction of the broader
topic of Mathematical Finance and we hope that this book provides the side benefit of a
fairly quick yet robust education in the key topics of Portfolio Management, Derivatives
Pricing and Order-Book Trading.
We were introduced to modern Reinforcement Learning by the works of Richard Sutton,
including his seminal book with Andrew Barto. There are several other works by other
authors, many with more mathematical detail, but we found Sutton’s works much eas-
ier to learn this topic. Our book tends to follow Sutton’s less rigorous but more intuitive
approach but we provide a bit more mathematical formalism/detail and we use precise
working code instead of the typical psuedo-code found in textbooks on these topics. We
have also been greatly influenced by David Silver’s excellent RL lectures series at Univer-
sity College London that is available on youtube. We have strived to follow the structure
of David Silver’s lecture series, typically augmenting it with more detail. So it pays to em-
phasize that the content of this book is not our original work. Rather, our contribution is
to present content that is widely and publicly available in a manner that is easier to learn
(particularly due to our augmented approach of “learning by coding”). Likewise, the fi-
nancial content is not our original work - it is based on standard material on Mathematical
Finance and based on a few papers that treat the Financial problems as Stochastic Control
problems. However, we found the presentation in these papers not easy to understand for
the typical student. Moreover, some of these papers did not explicitly model these prob-
lems as Markov Decision Processes, and some of them did not consider Reinforcement
Learning as an option to solve these problems. So we presented the content in these pa-
pers in more detail, specifically with clearer notation and explanations, and with working
Python code. It’s interesting to note that Ashwin worked on some of these finance prob-
lems during his time at Goldman Sachs and Morgan Stanley, but at that time, these prob-
lems were not viewed from the lens of Stochastic Control. While designing the content of
CME 241 at Stanford, Ashwin realized that several problems from his past finance career
can be cast as Markov Decision Processes, which led him to the above-mentioned papers,
which in turn led to the content creation for CME 241, that then extended into the finan-
cial content in this book. There are several Appendices in this book to succinctly provide
appropriate pre-requisite mathematical/financial content. We have strived to provide ref-
erences throughout the chapters and appendices to enable curious students to learn each
topic in more depth. However, we are not exhaustive in our list of references because typ-
ically each of our references tends to be fairly exhaustive in the papers/books it in turn
references.
We have many people to thank - those who provided the support and encouragement
for us to write this book. Firstly, we would like to thank our managers at Target Cor-
12
poration - Paritosh Desai and Mike McNamara. Ashwin would like to thank all of the
faculty and staff at Stanford University he works with, for providing a wonderful environ-
ment and excellent support for CME 241, notably George Papanicolau, Kay Giesecke, Peter
Glynn, Gianluca Iaccarino, Indira Choudhury and Jess Galvez. Ashwin would also like to
thank all his students who implicitly proof-read the contents, and his course assistants
Sven Lerner and Jeff Gu. Tikhon would like to thank Paritosh Desai for help exploring the
software design and concepts used in the book as well as his immediate team at Target for
teaching him about how Markov Decision Processes apply in the real world.
13
Summary of Notation
Z Set of integers
Z+ Set of positive integers, i.e., {1, 2, 3, . . .}
Z≥0 Set of non-negative integers, i.e., {0, 1, 2, 3, . . .}
R Set of real numbers
R+ Set of positive real numbers
R≥0 Set of non-negative real numbers
log(x) Natural Logarithm (to the base e) of x
|x| Absolute Value of x
sign(x) +1 if x > 0, -1 if x < 0, 0 if x = 0
[a, b] Set of real numbers that are ≥ a and ≤ b. The notation x ∈ [a, b]
is shorthand for x ∈ R and a ≤ x ≤ b
[a, b) Set of real numbers that are ≥ a and < b. The notation x ∈ [a, b)
is shorthand for x ∈ R and a ≤ x < b
(a, b] Set of real numbers that are > a and leqb. The notation x ∈ (a, b]
is shorthand for x ∈ R and a < x ≤ b
∅ The Empty Set (Null Set)
Pn
a Sum of terms a1 , a2 , . . . , an
Qni=1 i
i=1 ai Product of terms a1 , a2 , . . . , an
≈ approximately equal to
x∈X x is an element of the set X
x∈/X x is not an element of the set X
X ∪Y Union of the sets X and Y
X ∩Y Intersection of the sets X and Y
X −Y Set Difference of the sets X and Y, i.e., the set of elements within
the set X that are not elements of the set Y
X ×Y Cartesian Product of the sets X and Y
Xk For a set X and an integer k ≥ 1, this refers to the Cartesian
Product X ×X ×. . .×X with k occurrences of X in the Cartesian
Product (note: X 1 = X )
f :X→Y Function f with Domain X and Co-domain Y
fk For a function f and an integer k ≥ 0, this refers to the function
composition of f with itself, repeated k times. So, f k (x) is the
value f (f (. . . f (x) . . .)) with k occurrences of f in this function-
composition expression (note: f 1 = f and f 0 is the identity
function)
f −1 Inverse function of a bijective function f : X → Y, i.e., for all
x ∈ X , f −1 (f (x)) = x and for all y ∈ Y, f (f −1 (y)) = y
f ′ (x0 ) Derivative of the function f : X → R with respect to it’s domain
variable x ∈ X , evaluated at x = x0
15
f ′′ (x0 ) Second Derivative of the function f : X → R with respect to it’s
domain variable x ∈ X , evaluated at x = x0
P[X] Probability Density Function (PDF) of random variable X
P[X = x] Probability that random variable X takes the value x
P[X|Y ] Probability Density Function (PDF) of random variable X, con-
ditional on the value of random variable Y (i.e., PDF of X ex-
pressed as a function of the values of Y )
P[X = x|Y = y] Probability that random variable X takes the value x, condi-
tional on random variable Y taking the value y
E[X] Expected Value of random variable X
E[X|Y ] Expected Value of random variable X, conditional on the value
of random variable Y (i.e., Expected Value of X expressed as a
function of the values of Y )
E[X|Y = y] Expected Value of random variable X, conditional on random
variable Y taking the value y
x ∼ N (µ, σ 2 ) Random variable x follows a Normal Distribution with mean µ
and variance σ 2
x ∼ P oisson(λ) Random variable x follows a Poisson Distribution with mean λ
f (x; w) Here f refers to a parameterized function with domain X (x ∈
X ), w refers to the parameters controlling the definition of the
function f
vT Row-vector with components equal to the components of the
Column-vector v, i.e., Transpose of the Column-vector v (by de-
fault, we assume vectors are expressed as Column-vectors)
AT Transpose of the matrix A
|v| L2 normp of vector v ∈ R , i.e., if v = (v1 , v2 , . . . , vm ), then
m
|v| = v1 + v2 + . . . + vm
2 2 2
16
Overview
• Focus on the foundational theory underpinning RL. Our treatment of this theory is
based on undergraduate-level Probability, Optimization, Statistics and Linear Alge-
bra. We emphasize rigorous but simple mathematical notations and formulations
in developing the theory, and encourage you to write out the equations rather than
just reading from the book. Occasionally, we invoke some advanced mathematics
(eg: Stochastic Calculus) but the majority of the book is based on easily understand-
able mathematics. In particular, two basic theory concepts - Bellman Optimality
Equation and Generalized Policy Iteration - are emphasized throughout the book as
they form the basis of pretty much everything we do in RL, even in the most ad-
vanced algorithms.
• Parallel to the mathematical rigor, we bring the concepts to life with simple exam-
ples and informal descriptions to help you develop an intuitive understanding of the
mathematical concepts. We drive towards creating appropriate mental models to
visualize the concepts. Often, this involves turning mathematical abstractions into
physical examples (emphasizing visual intuition). So we go back and forth between
rigor and intuition, between abstractions and visuals, so as to blend them nicely and
get the best of both worlds.
• Each time you learn a new mathematical concept or algorithm, we encourage you to
write small pieces of code (in Python) that implements the concept/algorithm. As an
example, if you just learned a surprising theorem, we’d ask you to write a simulator
17
to simply verify the statement of the theorem. We emphasize this approach not just to
bolster the theoretical and intuitive understanding with a hands-on experience, but
also because there is a strong emotional effect of seeing expected results emanating
from one’s code, which in turn promotes long-term retention of the concepts. Most
importantly, we avoid messy and complicated ML/RL/BigData tools/packages and
stick to bare-bones Python/numpy as these unnecessary tools/packages are huge
blockages to core understanding. We believe coding from scratch and designing
the code to reflect the mathematical structure/concepts is the correct approach to
truly understand the concepts/algorithms.
• Lastly, it is important to work with examples that are A) simplified versions of real-
world problems in a business domain rich with applications, B) adequately com-
prehensible without prior business-domain knowledge, C) intellectually interesting
and D) sufficiently marketable to employers. We’ve chosen Financial Trading ap-
plications. For each financial problem, we first cover the traditional approaches (in-
cluding solutions from landmark papers) and then cast the problem in ways that
can be solved with RL. We have made considerable effort to make this book self-
contained in terms of the financial knowledge required to navigate these prob-
lems.
• You will learn about the simple but powerful theory of Markov Decision Processes
(MDPs) – a framework for Sequential Optimal Decisioning under Uncertainty. You
will firmly understand the power of Bellman Equations, which is at the heart of all
Dynamic Programming as well as all RL algorithms.
• You will master Dynamic Programming (DP) Algorithms, which are a class of (in
the language of AI) Planning Algorithms. You will learn about Policy Iteration,
Value Iteration, Backward Induction, Approximate Dynamic Programming and the
all-important concept of Generalized Policy Iteration which lies at the heart of all DP
as well as all RL algorithms.
• You will gain a solid understanding of a variety of Reinforcement Learning (RL) Al-
gorithms, starting with the basic algorithms like SARSA and Q-Learning and mov-
ing on to several important algorithms that work well in practice, including Gradient
Temporal Difference, Deep Q-Networks, Least-Squares Policy Iteration, Policy Gra-
dient, Monte-Carlo Tree Search. You will learn about how to gain advantages in these
algorithms with bootstrapping, off-policy learning and deep-neural-networks-based
function approximation. You will learn how to balance exploration and exploitation
with Multi-Armed Bandits techniques like Upper Confidence Bounds, Thompson
Sampling, Gradient Bandits and Information State-Space algorithms. You will also
learn how to blend Planning and Learning methodologies, which is very important
in practice.
18
reflect the mathematical principles). The larger take-away from this book will be
a rare (and high-in-demand) ability to blend Applied Mathematics concepts with
Software Design paradigms.
• We treat each of the above problems as MDPs (i.e., Optimal Decisioning formula-
tions), first going over classical/analytical solutions to these problems, then intro-
ducing real-world frictions/considerations, and tackling with DP and/or RL.
• We implement a wide range of Algorithms and develop various models in a git code
base that we refer to and explain in detail throughout the book. This code base not
only provides detailed clarity on the algorithms/models, but also serves to educate
on healthy programming patterns suitable not just for RL, but more generally for any
Applied Mathematics work.
• Experience with (but not necessarily expertise in) Python is expected and a good
deal of comfort with numpy is required. Note that much of the Python program-
ming in this book is for mathematical modeling and for numerical algorithms, so one
doesn’t need to know Python from the perspective of building engineering systems
or user-interfaces. So you don’t need to be a professional software developer/engineer
but you need to have a healthy interest in learning Python best practices associated
with mathematical modeling, algorithms development and numerical programming
(we teach these best practices in this book). We don’t use any of the popular (but
messy and complicated) Big Data/Machine Learning libraries such as Pandas, PyS-
park, scikit, Tensorflow, PyTorch, OpenCV, NLTK etc. (all you need to know is
numpy).
19
• Familiarity with git and use of an Integrated Development Environment (IDE), eg:
Pycharm or Emacs (with Python plugins), is recommended, but not required.
• Familiarity with LaTeX for writing equations is recommended, but not required (other
typesetting tools, or even hand-written math is fine, but LaTeX is a skill that is very
valuable if you’d like a future in the general domain of Applied Mathematics).
• You need to be strong in undergraduate-level Probability as it is the most important
foundation underpinning RL.
• You will also need to have some preparation in undergraduate-level Numerical Op-
timization, Statistics, Linear Algebra.
• No background in Finance is required, but a strong appetite for Mathematical Fi-
nance is required.
20
Lecture 1: Introduction to Reinforcement Learning
About RL
Computer Science
Engineering Neuroscience
Machine
Learning
Optimal Reward
Control System
Reinforcement
Learning
Operations Classical/Operant
Research Conditioning
Bounded
Mathematics Psychology
Rationality
Economics
Figure 0.1.: Many Faces of Reinforcement Learning (Image Credit: David Silver’s RL
Course)
More importantly, the class of problems RL aims to solve can be described with a simple
yet powerful mathematical framework known as Markov Decision Processes (abbreviated as
MDPs). We have an entire chapter dedicated to deep coverage of MDPs, but we provide
a quick high-level introduction to MDPs in the next section.
21
Lecture 1: Introduction to Reinforcement Learning
About RL
Supervised Unsupervised
Learning Learning
Machine
Learning
Reinforcement
Learning
Figure 0.2.: Branches of Machine Learning (Image Credit: David Silver’s RL Course)
22
0.5. Introduction to the Markov Decision Process (MDP)
framework
The framework of a Markov Decision Process is depicted in Figure 0.3. As the Figure in-
dicates, the Agent and the Environment interact in a time-sequenced loop. The term Agent
refers to an algorithm (AI algorithm) and the term Environment refers to an abstract entity
that serves up uncertain outcomes to the Agent. It is important to note that the Environ-
ment is indeed abstract in this framework and can be used to model all kinds of real-world
situations such as the financial market serving up random stock prices or customers of
a company serving up random demand or a chess opponent serving up random moves
(from the perspective of the Agent), or really anything at all you can imagine that serves
up something random at each time step (it is up to us to model an Environment appropri-
ately to fit the MDP framework).
As the Figure indicates, at each time step t, the Agent observes an abstract piece of infor-
mation (which we call State) and a numerical (real number) quantity that we call Reward.
Note that the concept of State is indeed completely abstract in this framework and we can
model State to be any data type, as complex or elaborate as we’d like. This flexibility in
modeling State permits us to model all kinds of real-world situations as an MDP. Upon
observing a State and a Reward at time step t, the Agent responds by taking an Action.
Again, the concept of Action is completely abstract and is meant to represent an activity
performed by an AI algorithm. It could be a purchase or sale of a stock responding to
market stock price movements, or it could be movement of inventory from a warehouse
to a store in response to large sales at the store, or it could be a chess move in response
to the opponent’s chess move (opponent is Environment), or really anything at all you can
imagine that responds to observations (State and Reward) served by the Environment.
Upon receiving an Action from the Agent at time step t, the Environment responds (with
time ticking over to t + 1) by serving up the next time step’s random State and random
Reward. A technical detail (that we shall explain in detail later) is that the State is assumed
to have the Markov Property, which means:
• The next State/Reward depends only on Current State (for a given Action).
• The current State encapsulates all relevant information from the history of the inter-
action between the Agent and the Environment.
• The current State is a sufficient statistic of the future (for a given Action).
The goal of the Agent at any point in time is to maximize the Expected Sum of all future
Rewards by controlling (at each time step) the Action as a function of the observed State
(at that time step). This function from a State to Action at any time step is known as the
Policy function. So we say that the agent’s job is exercise control by determining the opti-
mal Policy function. Hence, this is a dynamic (i.e., time-sequenced) control system under
uncertainty. If the above description was too terse, don’t worry - we will explain all of this
in great detail in the coming chapters. For now, we just wanted to provide a quick flavor
of what the MDP framework looks like. Now we sketch the above description with some
(terse) mathematical notation to provide a bit more of the overview of the MDP frame-
work. The following notation is for discrete time steps (continuous time steps notation is
analogous, but technically more complicated to describe here):
We denote time steps as t = 1, 2, 3, . . .. Markov State at time t is denoted as St ∈ S where
S is referred to as the State Space (a countable set). Action at time t is denoted as At ∈ A
where A is referred to as the Action Space (a countable set). Reward at time t is denoted as
23
Figure 0.4.: Baby Learning MDP
24
we have a room in a house with a vase atop a bookcase. At the top of the Figure is a
baby (learning Agent) on the other side of the room who wants to make her way to the
bookcase, reach for the vase, and topple it - doing this efficiently (i.e., in quick time and
quietly) would mean a large Reward for the baby. At each time step, the baby finds her-
self in a certain posture (eg: lying on the floor, or sitting up, or trying to walk etc.) and
observes various visuals around the room - her posture and her visuals would constitute
the State for the baby at each time step. The baby’s Actions are various options of physical
movements to try to get to the other side of the room (assume the baby is still learning how
to walk). The baby tries one physical movement, but is unable to move forward with that
movement. That would mean a negative Reward - the baby quickly learns that this move-
ment is probably not a good idea. Then she tries a different movement, perhaps trying to
stand on her feet and start walking. She makes a couple of good steps forward (positive
Rewards), but then falls down and hurts herself (that would be a big negative Reward). So
by trial and error, the baby learns about the consequences of different movements (dif-
ferent actions). Eventually, the baby learns that by holding on to the couch, she can walk
across, and then when she reaches the bookcase, she learns (again by trial and error) a
technique to climb the bookcase that is quick yet quiet (so she doesn’t raise her mom’s
attention). This means the baby learns of the optimal policy (best actions for each of the
states she finds herself in) after essentially what is a “trial and error” method of learning
what works and what doesn’t. This example is essentially generalized in the MDP frame-
work, and the baby’s “trial and error” way of learning is essentially a special case of the
general technique of Reinforcement Learning.
Figure 0.5 illustrates the MDP for a self-driving car. At the top of the figure is the Agent
(the car’s driving algorithm) and at the bottom of the figure is the Environment (constitut-
ing everything the car faces when driving - other vehicles, traffic signals, road conditions,
weather etc.). The State consists of the car’s location, velocity, and all of the information
picked up by the car’s sensors/cameras. The Action consists of the steering, acceleration
and brake. The Reward would be a combination of metrics on ride comfort and safety,
25
Figure 0.5.: Self-driving Car MDP
as well as the negative of each time step (because maximizing the accumulated Reward
would then amount to minimizing time taken to reach the destination).
26
Lastly, when there are many actions, the Agent needs to try them all to check if there
are some hidden gems (great actions that haven’t been tried yet), which in turn means
one could end up wasting effort on “duds” (bad actions). So the agent has to find the
balance between exploitation (retrying actions which have yielded good rewards so far)
and exploration (trying actions that have either not been tried enough or not been tried at
all).
All of this seems to indicate that we don’t have much hope in solving MDPs in a reliable
and efficient manner. But it turns out that with some clever mathematics, we can indeed
make some good inroads. We outline the core idea of this “clever mathematics” in the
next section.
This equation says that when the Agent is in a given state s, it takes an action a = π(s),
then sees a random next state s′ and a random reward r, so V π (s) can be broken into
the expectation of r (immediate next step’s expected reward) and the remainder of the
future expected accumulated rewards (which can be written in terms of the expectation
of V π (s′ )). We won’t get into the details of how to solve this recursive formulation in
this chapter (will cover this in great detail in future chapters), but it’s important for you
to recognize for now that this recursive formulation is the key to evaluating the Value
Function for a given policy.
However, evaluating the Value Function for a given policy is not the end goal - it is
simply a means to the end goal of evaluating the Optimal Value Function (from which we
obtain the Optimal Policy). The Optimal Value Function V ∗ : S → R is defined as:
The good news is that even the Optimal Value Function can be expressed recursively, as
follows:
X
V ∗ (s) = max p(r, s′ |s, a) · (r + γ · V ∗ (s′ )) for all s ∈ S
a
r,s′
27
Furthermore, we can prove that there exists an Optimal Policy π ∗ achieving V ∗ (s) for
all s ∈ S (the proof is constructive, which gives a simple method to obtain the function
π ∗ from the function V ∗ ). Specifically, this means that the Value Function obtained by
following the optimal policy π ∗ is the same as the Optimal Value Function V ∗ , i.e.,
∗
V π (s) = V ∗ (s) for all s ∈ S
There is a bit of terminology here to get familiar with. The problem of calculating V π (s)
(Value Function for a give policy) is known as the Prediction problem (since this amounts
to statistical estimation of the expected returns from any given state when following a
policy π). The problem of calculating the Optimal Value Function V ∗ (and hence, Opti-
mal Policy π ∗ ), is known as the Control problem (since this requires steering of the policy
such that we obtain the maximum expected return from any state). Solving the Prediction
problem is typically a stepping stone towards solving the (harder) problem of Control.
These recursive equations for V π and V ∗ are known as the (famous) Bellman Equations
(which you will hear a lot about in future chapters). In a continuous-time formulation,
the Bellman Equation is referred to as the famous Hamilton-Jacobi-Bellman (HJB) equation.
The algorithms to solve the prediction and control problems based on Bellman equations
are broadly classified as:
Now let’s talk a bit about the difference between Dynamic Programming and Reinforce-
ment Learning algorithms. Dynamic Programming algorithms (which we cover a lot of in
this book) assume that the agent knows of the transition probabilities p and the algorithm
takes advantage of the knowledge of those probabilities (leveraging the Bellman Equation
to efficiently calculate the Value Function). Dynamic Programming algorithms are con-
sidered to be Planning and not Learning (in the language of A.I.) because the algorithm
doesn’t need to interact with the Environment and doesn’t need to learn from the (states,
rewards) data stream coming from the Environment. Rather, armed with the transition
probabilities, the algorithm can reason about future probabilistic outcomes and perform
the requisite optimization calculation to infer the Optimal Policy. So it plans it’s path to
success, rather than learning about how to succeed.
However, in typical real-word situations, one doesn’t really know the transition proba-
bilities p. This is the realm of Reinforcement Learning (RL). RL algorithms interact with
the Environment, learn with each new (state, reward) pair received from the Environ-
ment, and incrementally figure out the Optimal Value Function (with the “trial and error”
approach that we outlined earlier). However, note that the Environment interaction could
be real interaction or simulated interaction. In the latter case, we do have a model of the tran-
sitions but the structure of the model is so complex that we only have access to samples of
the next state and reward (rather than an explicit representation of the probabilities). This
is known as a Sampling Model of the Environment. With access to such a sampling model
of the environment (eg: a robot learning on a simulated terrain), we can employ the same
RL algorithm that we would have used when interacting with a real environment (eg: a
robot learning on an actual terrain). In fact, most RL algorithms in practice learn from
simulated models of the environment. As we explained earlier, RL is essentially a “trial
and error” learning approach and hence, is quite laborious and fundamentally inefficient.
The recent progress in RL is coming from more efficient ways of learning the Optimal
Value Function, and better ways of approximating the Optimal Value Function. One of
28
the key challenges for RL in the future is to identify better ways of finding the balance be-
tween “exploration” and “exploitation” of actions. In any case, one of the key reasons RL
has started doing well lately is due to the assistance it has obtained from Deep Learning
(typically Deep Neural Networks are used to approximate the Value Function and/or to
approximate the Policy Function). RL with such deep learning approximations is known
by the catchy modern term Deep RL.
We believe the current promise of A.I. is dependent on the success of Deep RL. The
next decade will be exciting as RL research will likely yield improved algorithms and it’s
pairing with Deep Learning will hopefully enable us to solve fairly complex real-world
stochastic control problems.
Before covering the contents of the chapters in these 4 modules, the book starts with 2
unnumbered chapters. The first of these unnumbered chapters is this chapter (the one you
are reading) which serves as an Overview, covering the pedagogical aspects of learning
RL (and more generally Applied Math), outline of the learnings to be acquired from this
book, background required to read this book, a high-level overview of Stochastic Con-
trol, MDP, Value Function, Bellman Equation and RL, and finally the outline of chapters
in this book. The second unnumbered chapter is called Programming and Design. Since
this book makes heavy use of Python code for developing mathematical models and for
algorithms implementations, we cover the requisite Python background (specifically the
design paradigms we use in our Python code) in this chapter. To be clear, this chapter
is not a full Python tutorial – the reader is expected to have some background in Python
already. It is a tutorial of some key techniques and practices in Python (that many readers
of this book might not be accustomed to) that we use heavily in this book and that are also
highly relevant to programming in the broader area of Applied Mathematics. We cover
the topics of Type Annotations, List and Dict Comprehensions, Functional Programming,
Interface Design with Abstract Base Classes, Generics Programming and Generators.
The remaining chapters in this book are organized in the 4 modules we listed above.
29
include the concept of Reward (but not Action) - the inclusion of Reward yields a frame-
work known as Markov Reward Process. With Markov Reward Processes, we can talk
about Value Functions and Bellman Equation, which serve as great preparation for under-
standing Value Function and Bellman Equation later in the context of MDPs. Chapter 1
motivates these concepts with examples of stock prices and with a simple inventory ex-
ample that serves first as a Markov Process and then as a Markov Reward Process. There
is also a significant amount of programming content in this chapter to develop comfort as
well as depth in these concepts.
Chapter 2 on Markov Decision Processes lays the foundational theory underpinning RL –
the framework for representing problems dealing with sequential optimal decisioning un-
der uncertainty (Markov Decision Process). You will learn about the relationship between
Markov Decision Processes and Markov Reward Processes, about the Value Function and
the Bellman Equations. Again, there is a considerable amount of programming exercises
in this chapter. The heavy investment in this theory together with hands-on programming
will put you in a highly advantaged position to learn the following chapters in a very clear
and speedy manner.
Chapter 3 on Dynamic Programming covers the Planning technique of Dynamic Program-
ming (DP), which is an important class of foundational algorithms that can be an alter-
native to RL if the MDP is not too large or too complex. Also, learning these algorithms
provides important foundations to be able to understand subsequent RL algorithms more
deeply. You will learn about several important DP algorithms by the end of the chapter
and you will learn about why DP gets difficult in practice which draws you to the mo-
tivation behind RL. Again, we cover plenty of programming exercises that are quick to
implement and will aid considerably in internalizing the concepts. Finally, we emphasize
a special algorithm - Backward Induction - for solving finite-horizon Markov Decision Pro-
cesses, which is the setting for the financial applications we cover in this book.
The Dynamic Programming algorithms covered in Chapter 3 suffer from the two so-
called curses: Curse of Dimensionality and Curse of Modeling. These curses can be cured
with a combination of sampling and function approximation. Module III covers the sam-
pling cure (using Reinforcement Learning). Chapter 4 on Function Approximation and Ap-
proximate Dynamic Programming covers the topic of function approximation and shows
how an intermediate cure - Approximate Dynamic Programming (function approxima-
tion without sampling) - is often quite viable and can be suitable for some problems. As
part of this chapter, we implement linear function approximation and approximation with
deep neural networks (forward and back propagation algorithms) so we can use these ap-
proximations in Approximate Dynamic Programming algorithms and later also in RL.
30
complex and DP/ADP algorithms don’t quite scale, which means we need to tap into RL
algorithms to solve them. So we revisit these financial applications in Module III when we
cover RL algorithms.
Chapter 6 is titled Dynamic Asset Allocation and Consumption. This chapter covers the
first of the 5 Financial Applications. This problem is about how to adjust the allocation
of one’s wealth to various investment choices in response to changes in financial markets.
The problem also involves how much wealth to consume in each interval over one’s life-
time so as to obtain the best utility from wealth consumption. Hence, it is the joint prob-
lem of (dynamic) allocation of wealth to financial assets and appropriate consumption of
one’s wealth over a period of time. This problem is best understood in the context of Mer-
ton’s landmark paper in 1969 where he stated and solved this problem. This chapter is
mainly focused on the mathematical derivation of Merton’s solution of this problem with
Dynamic Programming. You will also learn how to solve the asset allocation problem in
a simple setting with Approximate Backward Induction (an ADP algorithm covered in
Chapter 4).
Chapter 7 covers a very important topic in Mathematical Finance: Pricing and Hedging
of Derivatives. Full and rigorous coverage of derivatives pricing and hedging is a fairly
elaborate and advanced topic, and beyond the scope of this book. But we have provided a
way to understand the theory by considering a very simple setting - that of a single-period
with discrete outcomes and no provision for rebalancing of the hedges, that is typical
in the general theory. Following the coverage of the foundational theory, we cover the
problem of optimal pricing/hedging of derivatives in an incomplete market and the problem
of optimal exercise of American Options (both problems are modeled as MDPs). In this
chapter, you will learn about some highly important financial foundations such as the
concepts of arbitrage, replication, market completeness, and the all-important risk-neutral
measure. You will learn the proofs of the two fundamental theorems of asset pricing in
this simple setting. We also provide an overview of the general theory (beyond this simple
setting). Next you will learn about how to price/hedge derivatives incorporating real-
world frictions by modeling this problem as an MDP. In the final module of this chapter,
you will learn how to model the more general problem of optimal stopping as an MDP.
You will learn how to use Backward Induction (a DP algorithm we learned in Chapter 3)
to solve this problem when the state-space is not too big. By the end of this chapter, you
would have developed significant expertise in pricing and hedging complex derivatives,
a skill that is in high demand in the finance industry.
Chapter 8 on Order-Book Algorithms covers the remaining two Financial Applications,
pertaining to the world of Algorithmic Trading. The current practice in Algorithmic Trad-
ing is to employ techniques that are rules-based and heuristic. However, Algorithmic
Trading is quickly transforming into Machine Learning-based Algorithms. In this chap-
ter, you will be first introduced to the mechanics of trade order placements (market orders
and limit orders), and then introduced to a very important real-world problem – how to
submit a large-sized market order by splitting the shares to be transacted and timing the
splits optimally in order to overcome “price impact” and gain maximum proceeds. You
will learn about the classical methods based on Dynamic Programming. Next you will
learn about the market frictions and the need to tackle them with RL. In the second half
of this chapter, we cover the Algorithmic-Trading twin of the Optimal Execution problem
– that of a market-maker having to submit dynamically-changing bid and ask limit orders
so she can make maximum gains. You will learn about how market-makers (a big and
thriving industry) operate. Then you will learn about how to formulate this problem as
an MDP. We will do a thorough coverage of the classical Dynamic Programming solution
31
by Avellaneda and Stoikov. Finally, you will be exposed to the real-world nuances of this
problem, and hence, the need to tackle with a market-calibrated simulator and RL.
32
Chapter 12 on Policy Gradient Algorithms introduces a very different class of RL algo-
rithms that are based on improving the policy using the gradient of the policy function
approximation (rather than the usual policy improvement based on explicit argmax on
Q-Value Function). When action spaces are large or continuous, Policy Gradient tends
to be the only option and so, this chapter is useful to overcome many real-world chal-
lenges (including those in many financial applications) where the action space is indeed
large. You will learn about the mathematical proof of the elegant Policy Gradient Theo-
rem and implement a couple of Policy Gradient Algorithms from scratch. You will learn
about state-of-the-art Actor-Critic methods and a couple of specialized algorithms that
have worked well in practice. Lastly, you will also learn about Evolutionary Strategies,
an algorithm that looks quite similar to Policy Gradient Algorithms, but is technically not
an RL Algorithm. However, learning about Evolutionary Strategies is important because
some real-world applications, including Financial Applications, can indeed be tackled well
with Evolutionary Strategies.
33
The second appendix is a technical perspective of Function Approximations as Affine Spaces,
which helps develop a deeper mathematical understanding of function approximations.
The third appendix is on Portfolio Theory covering the mathematical foundations of balanc-
ing return versus risk in portfolios and the much-celebrated Capital Asset Pricing Model
(CAPM). The fourth appendix covers the basics of Stochastic Calculus as we need some of
this theory (Ito Integral, Ito’s Lemma etc.) in the derivations in a couple of the chapters in
Module II. The fifth appendix is on the HJB Equation, which as a key part of the derivation
of the closed-form solutions for 2 of the 5 financial applicatons we cover in Module II. The
sixth appendix covers the derivation of the famous Black-Scholes Equation (and it’s solution
for Call/Put Options). The seventh appendix covers the formulas for bayesian updates to
conjugate prior distributions for the parameters of Gaussian and Bernoulli data distribu-
tions.
34
Programming and Design
Programming is creative work with few constraints: imagine something and you can prob-
ably build it — in many different ways. Liberating and gratifying, but also challenging. Just
like starting a novel from a blank page or a painting from a blank canvas, a new program
is so open that it’s a bit intimidating. Where do you start? What will the system look like?
How will you get it right? How do you split your problem up? How do you prevent your
code from evolving into a complete mess?
There’s no easy answer. Programming is inherently iterative — we rarely get the right
design at first, but we can always edit code and refactor over time. But iteration itself is
not enough; just like a painter needs technique and composition, a programmer needs
patterns and design.
Existing teaching resources tend to deemphasize programming techniques and design.
Theory-heavy and algorithms-heavy books show models and algorithms as self-contained
procedures written in pseudocode, without the broader context (and corresponding de-
sign considerations) of a real codebase. Newer AI/ML materials sometimes take a differ-
ent tack and provide real code examples using industry-strength frameworks, but rarely
touch on software design questions.
In this book, we take a third approach. Starting from scratch, we build a Python frame-
work that reflects the key ideas and algorithms we cover in this book. The abstractions
we define map to the key concepts we introduce; how we structure the code maps to the
relationships between those concepts.
Unlike the pseudocode approach, we do not implement algorithms in a vacuum; rather,
each algorithm builds on abstractions introduced earlier in the book. By starting from
scratch (rather than using an existing ML framework) we keep the code reasonably sim-
ple, without needing to worry about specific examples going out of date. We can focus
on the concepts important to this book while teaching programming and design in situ,
demonstrating an intentional approach to code design.
35
losophy of code design oriented around defining and combining abstractions that reflect
how we think about our domain. Since code itself can point to specific design ideas and
capabilities, there’s a feedback loop: expanding the programming abstractions we’ve de-
signed can help us find new algorithms and functionality, improving our understanding
of the domain.
Just what is an abstraction? An appropriately abstract question! An abstraction is a
“compound idea”: a single concept that combines multiple separate ideas into one. We
can combine ideas along two axes:
• We can compose different concepts together, thinking about how they behave as one
unit. A car engine has thousands of parts that interact in complex ways, but we can
think about it as a single object for most purposes.
• We can unify different concepts by identifying how they are similar. Different breeds
of dogs might look totally different, but we can think of all of them as dogs.
The human mind can only handle so many distinct ideas at a time — we have an inher-
ently limited working memory. A rather simplified model is that we only have a handful
of “slots” in working memory and we simply can’t track more independent thoughts at
the same time. The way we overcome this limitation is by coming up with new ideas (new
abstractions) that combine multiple concepts into one.
We want to organize code around abstractions for the same reason that we use abstrac-
tions to understand more complex ideas. How do you understand code? Do you run the
program in your head? That’s a natural starting point and it works for simple programs
but it quickly becomes difficult and then impossible. A computer doesn’t have working-
memory limitations and can run billions of instructions a second that we can’t possibly
keep up with. The computer doesn’t need structure or abstraction in the code it runs, but
we need it to have any hope of writing or understanding anything beyond the simplest
of programs. Abstractions in our code group information and logic so that we can think
about rich concepts rather than tracking every single bit of information and every single
instruction separately.
The details may differ, but designing code around abstractions that correspond to a solid
mental model of the domain works well in any area and with any programming language.
It might take some extra up-front thought but, done well, this style of design pays divi-
dends. Our goal is to write code that makes life easier for ourselves; this helps for everything
from “one-off” experimental code through software engineering efforts with large teams.
cd rl-book
1
If you are not familiar with Git and GitHub, look through GitHub’s Getting Started documentation.
36
Then, create and activate a Python virtual environment2 :
You only need to create the environment once, but you will need to activate it every time
that you want to work on the code from a new shell. Once the environment is activated,
you can install the right versions of each Python dependency:
To access the framework itself, you need to install it in editable mode (-e):
pip install -e .
Once the environment is set up, you can confirm that it works by running the frame-
works automated tests:
If everything installed correctly, you should see an “OK” message on the last line of the
output after running this command.
Let’s jump into an extended example to see exactly what this means. One of the key
building blocks for Reinforcement Learning — all of statistics and machine learning, really
— is Probability. How are we going to handle uncertainty and randomness in our code?
One approach would be to keep Probability implicit. Whenever we have a random vari-
able, we could call a function and get a random result (technically referred to as a sample).
If we were writing a Monopoly game with two six-sided dice, we would define it like this:
2
A Python “virtual environment” is a way to manage Python dependencies on a per-project basis. Having a
different environment for different Python projects lets each project have its own version of Python libraries,
which avoids problems when one project needs an older version of a library and another project needs a newer
version.
37
This works, but it’s pretty limited. We can’t do anything except get one outcome at
a time. More importantly, this only captures a slice of how we think about Probability:
there’s randomness but we never even mentioned probability distributions (referred to as
simply distributions for the rest of this chapter). We have outcomes and we have a function
we can call repeatedly, but there’s no way to tell that function apart from a function that
has nothing to do with Probability but just happens to return an integer.
How can we write code to get the expected value of a distribution? If we have a paramet-
ric distribution (eg: a distribution like Poisson or Gaussian, characterized by parameters),
can we get the parameters out if we need them?
Since distributions are implicit in the code, the intentions of the code aren’t clear and it is
hard to write code that generalizes over distributions. Distributions are absolutely crucial
for machine learning, so this is not a great starting point.
This class defines an interface: a definition of what we require for something to qualify
as a distribution. Any kind of distribution we implement in the future will be able to, at
minimum, generate samples; when we write functions that sample distributions, they can
require their inputs to inherit from Distribution.
The class itself does not actually implement sample. Distribution captures the abstract
concept of distributions that we can sample, but we would need to specify a specific distri-
bution to actually sample anything. To reflect this in Python, we’ve made Distribution an
abstract base class (ABC), with sample as an abstract method—a method without an im-
plementation. Abstract classes and abstract methods are features that Python provides to
help us define interfaces for abstractions. We can define the Distribution class to structure
the rest of our probability distribution code before we define any specific distributions.
38
0.12.2. A Concrete Distribution
Now that we have an interface, what do we do with it? An interface can be approached
from two sides:
• Something that requires the interface. This will be code that uses operations speci-
fied in the interface and work with any value that satisfies those requirements.
• Something that provides the interface. This will be some value that supports the
operations specified in the interface.
If we have some code that requires an interface and some other code that satisfies the
interface, we know that we can put the two together and get something that works —
even if the two sides were written without any knowledge or reference to each other. The
interface manages how the two sides interact.
To use our Distribution class, we can start by providing a concrete class3 that imple-
ments the interface. Let’s say that we wanted to model dice — perhaps for a game of D&D
or Monopoly. We could do this by defining a Die class that represents an n-sided die and
inherits from Distribution:
import random
class Die(Distribution):
def __init__(self, sides):
self.sides = sides
def sample(self):
return random.randint(1, self.sides)
six_sided = Die(6)
def roll_dice():
return six_sided.sample() + six_sided.sample()
This version of roll_dice has exactly the same behavior as roll_dice in the previous
section, but it took a bunch of extra code to get there. What was the point?
The key difference is that we now have a value that represents the distribution of rolling
a die, not just the outcome of a roll. The code is easier to understand — when we come
across a Die object, the meaning and intention behind it is clear — and it gives us a place to
add additional die-specific functionality. For example, it would be useful for debugging
if we could print not just the outcome of rolling a die but the die itself — otherwise, how
would we know if we rolled a die with the right number of sides for the given situation?
If we were using a function to represent our die, printing it would not be useful:
>>> print(six_sided)
<function six_sided at 0x7f00ea3e3040>
That said, the Die class we’ve defined so far isn’t much better:
>>> print(Die(6))
<__main__.Die object at 0x7ff6bcadc190>
With a class — and unlike a function — we can fix this. Python lets us change some of
the built-in behavior of objects by overriding special methods. To change how the class is
printed, we can override __repr__:4
3
In this context, a concrete class is any class that is not an abstract class. More generally, “concrete” is the
opposite of “abstract” — when an abstraction can represent multiple more specific concepts, we call any of
the specific concepts “concrete.”
4
Our definition of __repr__ used a Python feature called an “f-string.” Introduced in Python 3.6, f-strings make
it easier to inject Python values into strings. By putting an f in front of a string literal, we can include a Python
value in a string: f”{1 + 1}” gives us the string ”2”.
39
class Die(Distribution):
...
def __repr__(self):
return f”Die(sides={self.sides})”
Much better:
>>> print(Die(6))
Die(sides=6)
This seems small but makes debugging much easier, especially as the codebase gets
larger and more complex.
Dataclasses
The Die class we wrote is intentionally simple. Our die is defined by a single property: the
number of sides it has. The __init__ method takes the number of sides as an input and
puts it into an attribute of the class; once a Die object is created, there is no reason to change
this value — if we need a die with a different number of sides, we can just create a new
object. Abstractions do not have to be complex to be useful.
Unfortunately, some of the default behavior of Python classes isn’t well-suited to simple
classes. We’ve already seen that we need to override __repr__ to get useful behavior, but
that’s not the only default that’s inconvenient. Python’s default way to compare objects for
equality — the __eq__ method — uses the is operator, which means it compares objects
by identity. This makes sense for classes in general which can change over time, but it is a
poor fit for simple abstraction like Die. Two Die objects with the same number of sides have
the same behavior and represent the same probability distribution, but with the default
version of __eq__, two Die objects declared separately will never be equal:
>>> six_sided = Die(6)
>>> six_sided == six_sided
True
>>> six_sided == Die(6)
False
>>> Die(6) == Die(6)
False
This behavior is inconvenient and confusing, the sort of edge-case that leads to hard-to-
spot bugs. Just like we overrode __repr__, we can fix this by overriding __eq__:
def __eq__(self, other):
return self.sides == other.sides
However, this simple implementation will lead to errors if we use == to compare a Die
with a non-Die value:
>>> Die(6) == None
Traceback (most recent call last):
File ”<stdin>”, line 1, in <module>
File ”.../rl/chapter1/probability.py”, line 18, in __eq__
return self.sides == other.sides
AttributeError: ’NoneType’ object has no attribute ’sides’
40
We generally won’t be comparing values of different types with == — for None, Die(6) is
None would be more idiomatic — but the usual expectation in Python is that == on different
types will return False rather than raising an exception. We can fix by explicitly checking
the type of other:
Most of the classes we will define in the rest of the book follow this same pattern —
they’re defined by a small number of parameters, all that __init__ does is set a few at-
tributes and they need custom __repr__ and __eq__ methods. Manually defining __init__,
__repr__ and __eq__ for every single class isn’t too bad — the definitions are entirely sys-
tematic — but it carries some real costs:
• Extra code without important content makes it harder to read and navigate through a
codebase.
• It’s easy for mistakes to sneak in. For example, if you add an attribute to a class but
forget to add it to its __eq__ method, you won’t get an error — == will just ignore
that attribute. Unless you have tests that explicitly check how == handles your new
attribute, this oversight can sneak through and lead to weird behavior in code that
uses your class.
• Frankly, writing these methods by hand is just tedious.
Luckily, Python 3.7 introduced a feature that fixes all of these problems: dataclasses.
The dataclasses module provides a decorator5 that lets us write a class that behaves like
Die without needing to manually implement __init__, __repr__ or __eq__. We still have
access to “normal” class features like inheritance ((Distribution)) and custom methods
(sample):
This version of Die has the exact behavior we want in a way that’s easier to write and
— more importantly — far easier to read. For comparison, here’s the code we would have
needed without dataclasses:
class Die(Distribution):
def __init__(self, sides):
self.sides = sides
def __repr__(self):
return f”Die(sides={self.sides})”
5
Python decorators are modifiers that can be applied to class, function and method definitions. A decorator is
written above the definition that it applies to, starting with a @ symbol. Examples include abstractmethod —
which we saw earlier — and dataclass.
41
def __eq__(self, other):
if isinstance(other, Die):
return self.sides == other.sides
return False
def sample(self):
return random.randint(1, self.sides)
As you can imagine, the difference would be even starker for classes with more at-
tributes!
Dataclasses provide such a useful foundation for classes in Python that the majority of
the classes we define in this book are dataclasses — we use dataclasses unless we have a
specific reason not to.
Immutability
Once we’ve created a Die object, it does not make sense to change its number of sides — if
we need a distribution for a different die, we can create a new object instead. If we change
the sides of a Die object in one part of our code, it will also change in every other part of
the codebase that uses that object, in ways that are hard to track. Even if the change made
sense in one place, chances are it is not expected in other parts of the code. Changing state
can create invisible connections between seemingly separate parts of the codebase which
becomes hard to mentally track. A sure recipe for bugs!
Normally, we avoid this kind of problem in Python purely by convention: nothing stops
us from changing sides on a Die object, but we know not to do that. This is doable, but
hardly ideal; just like it is better to rely on seatbelts rather than pure driver skill, it is better
to have the language prevent us from doing the wrong thing than relying on pure conven-
tion. Normal Python classes don’t have a convenient way to stop attributes from changing,
but luckily dataclasses do:
@dataclass(frozen=True)
class Die(Distribution):
...
>>> d = Die(6)
>>> d.sides = 10
Traceback (most recent call last):
File ”<stdin>”, line 1, in <module>
File ”<string>”, line 4, in __setattr__
dataclasses.FrozenInstanceError: cannot assign to field ’sides’
An object that we cannot change is called immutable. Instead of changing the object
in place, we can return a fresh copy with the attribute changed; dataclasses provides a
replace function that makes this easy:
42
This example is a bit convoluted — with such a simple object, we would just write d20 =
Die(20) — but dataclasses.replace becomes a lot more useful with more complex objects
that have multiple attributes.
Returning a fresh copy of data rather than modifying in place is a common pattern in
Python libraries. For example, the majority of Pandas operations — like drop or fillna —
return a copy of the dataframe rather than modifying the dataframe in place. These meth-
ods have an inplace argument as an option, but this leads to enough confusing behavior
that the Pandas team is currently deliberating on deprecating inplace altogether.
Apart from helping prevent odd behavior and bugs, frozen=True has an important bonus:
we can use immutable objects as dictionary keys and set elements. Without frozen=True,
we would get a TypeError because non-frozen dataclasses do not implement __hash__:
>>> d = Die(6)
>>> {d : ”abc”}
Traceback (most recent call last):
File ”<stdin>”, line 1, in <module>
TypeError: unhashable type: ’Die’
>>> {d}
Traceback (most recent call last):
File ”<stdin>”, line 1, in <module>
TypeError: unhashable type: ’Die’
Immutable dataclass objects act like plain data — not too different from strings and ints.
In this book, we follow the same practice with frozen=True as we do with dataclasses in
general: we set frozen=True unless there is a specific reason not to.
The types of an object’s attributes are a useful indicator of how the object should be
used. Python’s dataclasses let us use type annotations (also known as “type hints”) to
specify the type of each attribute:
@dataclass(frozen=True)
class Die(Distribution):
sides: int
43
In normal Python, these type annotations exist primarily for documentation — a user
can see the types of each attribute at a glance, but the language does not raise an error when
an object is created with the wrong types in an attribute . External tools — Integrated
Development Environments (IDEs) and typecheckers — can catch type mismatches in
annotated Python code without running the code. With a type-aware editor, Die(”foo”)
would be underlined with an error message:
This particular message comes from pyright running over the language server protocol
(LSP), but Python has a number of different typecheckers available6 .
Instead of needing to call sample to see an error — which we then have to carefully read
to track back to the source of the mistake — the mistake is highlighted for us without even
needing to run the code.
Static Typing
Being able to find type mismatches without running code is called static typing. Some lan-
guages — like Java and C++ — require all code to be statically typed; Python does not. In
fact, Python started out as a dynamically typed language with no type annotations. Type
errors would only come up when the code containing the error was run.
Python is still primarily a dynamically typed language — type annotations are optional in
most places and there is no built-in checking for annotations. In the Die(”foo”) example,
we only got an error when we ran code that passed sides into a function that required
an int (random.randint). We can get static checking with external tools, but even then it
remains optional — even statically checked Python code runs dynamic type checks, and
we can freely mix statically checked and “normal” Python. Optional static typing on top
of a dynamically typed languages is called gradual typing because we can incrementally
add static types to an existing dynamically typed codebase.
Dataclass attributes are not the only place where knowing types is useful; it would also
be handy for function parameters, return values and variables. Python supports optional
annotations on all of these; dataclasses are the only language construct where annotations
are required. To help mix annotated and unannotated code, typecheckers will report mis-
matches in code that is explicitly annotated, but will usually not try to guess types for
unannotated code.
How would we add type annotations to our example code? So far, we’ve defined two
classes:
6
Python has a number of external typecheckers, including:
• mypy
• pyright
• pytype
• pyre
44
• Distribution, an abstract class defining interfaces for probability distributions in
general
• Die, a concrete class for the distribution of an n-sided die
We’ve already annotated the sides in Die has to be an int. We also know that the outcome
of a die roll is an int. We can annotate this by adding -> int after def sample(...):
@dataclass(frozen=True)
class Die(Distribution):
sides: int
def sample(self) -> int:
return random.randint(1, self.sides)
Other kinds of concrete distributions would have other sorts of outcomes. A coin flip
would either be ”heads” or ”tails”; a normal distribution would produce a float. Type
annotations are particularly important when writing code for any kind of mathematical
modeling because we ensure that the type specifications in our code correspond clearly to
the precise specification of sets in our mathematical (notational) description.
class Distribution(ABC):
@abstractmethod
def sample(self):
pass
This works — annotations are optional — but it can get confusing: some code we write
will work for any kind of distribution, some code needs distributions that return numbers,
other code will need something else… In every instance sample better return something, but
that isn’t explicitly annotated. When we leave out annotations our code will still work, but
our editor or IDE will not catch as many mistakes.
The difficulty here is that different kinds of distributions — different implementations of
the Distribution interface — will return different types from sample. To deal with this, we
need type variables: variables that stand in for some type that can be different in different
contexts. Type variables are also known as “generics” because they let us write classes that
generically work for any type.
To add annotations to the abstract Distribution class, we will need to define a type vari-
able for the outcomes of the distribution, then tell Python that Distribution is “generic”
in that type:
# Distribution is ”generic in A”
class Distribution(ABC, Generic[A]):
# Sampling must produce a value of type A
45
@abstractmethod
def sample(self) -> A:
pass
In this code, we’ve defined a type variable A7 and specified that Distribution uses A
by inheriting from Generic[A]. We can now write type annotations for distributions with
specific types of outcomes: for example, Die would be an instance of Distribution[int] since
the outcome of a die roll is always an int. We can make this explicit in the class definition:
class Die(Distribution[int]):
...
This lets us write specialized functions that only work with certain kinds of distribu-
tions. Let’s say we wanted to write a function that approximated the expected value of
a distribution by sampling repeatedly and calculating the mean. This function works for
distributions that have numeric outcomes — float or int — but not other kinds of distri-
butions (How would we calculate an average for a random name?). We can annotate this
explicitly by using Distribution[float]:8
import statistics
def expected_value(d: Distribution[float], n: int = 100) -> float:
return statistics.mean(d.sample() for _ in range(n))
0.12.5. Functionality
So far, we’ve covered two abstractions for working with probability distributions:
• Distribution: an abstract class that defines the interface for probability distributions
• Die: a distribution for rolling fair n-sided dice
This is an illustrative example, but it doesn’t let us do much. If all we needed were n-
sided dice, a separate Distribution class would be overkill. Abstractions are a means for
managing complexity, but any abstraction we define also adds some complexity to a code-
base itself — it’s one more concept for a programmer to learn and understand. It’s always
worth considering whether the added complexity from defining and using an abstraction
is worth the benefit. How does the abstraction help us understand the code? What kind
of mistakes does it prevent — and what kind of mistakes does it encourage? What kind of
7
Traditionally, type variables have one-letter capitalized names — although it’s perfectly fine to use full words
if that would make the code clearer.
8
The float type in Python also covers int, so we can pass a Distribution[int] anywhere that a
Distribution[float] is required.
46
added functionality does it give us? If we don’t have sufficiently solid answers to these
questions, we should consider leaving the abstraction out.
If all we cared about were dice, Distribution wouldn’t carry its weight. Reinforcement
Learning, though, involves both a wide range of specific distributions — any given Re-
inforcement Learning problem can have domain-specific distributions — as well as algo-
rithms that need to work for all of these problems. This gives us two reasons to define a
Distribution abstraction: Distribution will unify different applications of Reinforcement
Learning and will generalize our Reinforcement Learning code to work in different con-
texts. By programming against a general interface like Distribution, our algorithms will
be able to work for the different applications we present in the book — and even work for
applications we weren’t thinking about when we designed the code.
One of the practical advantages of defining general-purpose abstractions in our code is
that it gives us a place to add functionality that will work for any instance of the abstraction.
For example, one of the most common operations for a probability distribution that we can
sample is drawing n samples. Of course, we could just write a loop every time we needed
to do this:
samples = []
for _ in range(100):
samples += [distribution.sample()]
This code is fine, but it’s not great. A for loop in Python might be doing pretty much
anything; it’s used for repeating pretty much anything. It’s hard to understand what a
loop is doing at a glance, so we’d have to carefully read the code to see that it’s putting
100 samples in a list. Since this is such a common operation, we can add a method for it
instead:
class Distribution(ABC, Generic[A]):
...
def sample_n(self, n: int) -> Sequence[A]:
return [self.sample() for _ in range(n)]
The implementation here is different — it’s using a list comprehension9 rather than a
normal for loop — but it’s accomplishing the same thing. The more important distinction
happens when we use the method; instead of needing a for loop or list comprehension
each time, we can just write:
9
List comprehensions are a Python feature to build lists by looping over something. The simplest pattern is
the same as writing a for loop:
foo = [expr for x in xs]
# is the same as:
foo = []
for x in xs:
foo += [expr]
List comprehensions can combine multiple lists, acting like nested for loops:
>>> [(x, y) for x in range(3) for y in range(2)]
[(0, 0), (0, 1), (1, 0), (1, 1), (2, 0), (2, 1)]
They can also have if clauses to only keep elements that match a condition:
>>> [x for x in range(10) if x % 2 == 0]
[0, 2, 4, 6, 8]
Some combination of for and if clauses can let us build surprisingly complicated lists! Comprehensions
will often be easier to read than loops—a loop could be doing anything, a comprehension is always creating a
list—but it’s always a judgement call. A couple of nested for loops might be easier to read than a sufficiently
convoluted comprehension!
47
samples = distribution.sample_n(100)
The meaning of this line of code — and the programmer’s intention behind it — are
immediately clear at a glance.
Of course, this example is pretty limited. The list comprehension to build a list with 100
samples is a bit more complicated than just calling sample_n(100), but not by much — it’s
still perfectly readable at a glance. This pattern of implementing general-purpose func-
tions on our abstractions becomes a lot more useful as the functions themselves become
more complicated.
However, there is another advantage to defining methods like sample_n: some kinds of
distributions might have more efficient or more accurate ways to implement the same logic.
If that’s the case, we would override sample_n to use the better implementation. Code
that uses sample_n would automatically benefit; code that used a loop or comprehension
instead would not. For example, this happens if we implement a distribution by wrapping
a function from numpy’s random module:
import numpy as np
@dataclass
class Gaussian(Distribution[float]):
μ: float
σ: float
def sample(self) -> float:
return np.random.normal(loc=self.μ, scale=self.σ)
def sample_n(self, n: int) -> Sequence[float]:
return np.random.normal(loc=self.μ, scale=self.σ, size=n)
numpy is optimized for array operations, which means that there is an up-front cost to
calling numpy.random.normal the first time, but it can quickly generate additional samples
after that. The performance impact is significant10 :
>>> d = Gaussian(μ=0, σ=1)
>>> timeit.timeit(lambda: [d.sample() for _ in range(100)])
293.33819171399955
>>> timeit.timeit(lambda: d.sample_n(100))
5.566364272999635
48
with an object. This is a useful capability that lets us abstract over doing the same kind
of action on different sorts of objects. Our sample_n method, for example, with it’s default
implementation, gives us two things:
If we made sample_n a normal function we could get 1. but not 2.; if we left sample_n as
an abstract method, we’d get 2. but not 1.. Having a non-abstract method on the abstract
class gives us the best of both worlds.
So if methods are our “verbs,” what else do we need? While methods abstract over
actions, they do this indirectly — we can talk about objects as standalone values, but we can
only use methods on objects, with no way to talk about computation itself. Stretching the
metaphor with grammar, it’s like having verbs without infinitives or gerunds — we’d be
able to talk about “somebody skiing,” but not about “skiing” itself or somebody “planning
to ski!”
In this world, “nouns” (objects) are first-class citizens but “verbs” (methods) aren’t.
What it takes to be a “first-class” value in a programming language is a fuzzy concept;
a reasonable litmus test is whether we can pass a value to a function or store it in a data
structure. We can do this with objects, but it’s not clear what this would mean for methods.
for _ in range(10):
do_something()
Instead of writing a loop each time, we could factor this logic into a function that took n
and do_something as arguments:
49
repeat takes action and n as arguments, then calls action n times. action has the type
Callable which, in Python, covers functions as well as any other objects you can call with
the f() syntax. We can also specify the return type and arguments a Callable should have;
if we wanted the type of a function that took an int and a str as input and returned a bool,
we would write Callable[[int, str], bool].
The version with the repeat function makes our intentions clear in the code. A for loop
can do many different things, while repeat will always just repeat. It’s not a big difference
in this case — the for loop version is sufficiently easy to read that it’s not a big impediment
— but it becomes more important with complex or domain-specific logic.
Let’s take a look at the expected_value function we defined earlier:
def expected_value(
d: Distribution[A],
f: Callable[[A], float],
n: int
) -> float:
return statistics.mean(f(d.sample()) for _ in range(n))
The implementation of expected_value has barely changed — it’s the same mean calcu-
lation as previously, except we apply f to each outcome. This small change, however,
has made the function far more flexible: we can now call expected_value on any sort of
Distribution, not just Distribution[float].
Going back to our coin example, we could use it like this:
The payoff function maps outcomes to numbers and then we calculate the expected
value using that mapping.
We’ll see first-class functions used in a number of places throughout the book; the key
idea to remember is that functions are values that we can pass around or store just like any
other object.
Lambdas
payoff itself is a pretty reasonable function:
it has a clear name and works as a standalone
concept. Often, though, we want to use a first-class function in some specific context where
giving the function a name is not needed or even distracting. Even in cases with reasonable
50
names like payoff, it might not be worth introducing an extra named function if it will only
be used in one place.
Luckily, Python gives us an alternative: lambda. Lambdas are function literals. We can
write 3.0 and get a number without giving it a name, and we can write a lambda expression
to get a function without giving it a name. Here’s the same example as with the payoff
function but using a lambda instead:
The lambda expression here behaves exactly the same as def payoff did in the earlier
version. Note how the lambda as a single expression with no return — if you ever need
multiple statements in a function, you’ll have to use a def instead of a lambda. In practice,
lambdas are great for functions whose bodies are short expressions, but anything that’s
too long or complicated will read better as a standalone function def.
The hard coded 0.01 in the while loop should be suspicious. How do we know that 0.01
is the right stopping condition? How do we decide when to stop at all?
The trick with this question is that we can’t know when to stop when we’re implementing
a general-purpose function because the level of precision we need will depend on what
the result is used for! It’s the caller of the function that knows when to stop, not the author.
The first improvement we can make is to turn the 0.01 into an extra parameter:
51
This is a definite improvement over a literal 0.01, but it’s still limited. We’ve provided an
extra parameter for how the function behaves, but the control is still fundamentally with
the function. The caller of the function might want to stop before the method converges if
it’s taking too many iterations or too much time, but there’s no way to do that by changing
the threshold parameter. We could provide additional parameters for all of these, but
we’d quickly end up with the logic for how to stop iteration requiring a lot more code
and complexity than the iterative algorithm itself! Even that wouldn’t be enough; if the
function isn’t behaving as expected in some specific application, we might want to print out
intermediate values or graph the convergence over time — so should we include additional
control parameters for that?
Then what do we do when we have n other iterative algorithms? Do we copy-paste the
same stopping logic and parameters into each one? We’d end up with a lot of redundant
code!
Note how the iterator for the set ({3, 2, 1}) prints 1 2 3 rather than 3 2 1 — sets do
not preserve the order in which elements are added, so they iterate over elements in some
kind of internally defined order instead.
When we iterate over a dictionary, we will print the keys rather than the values because
that is the default iterator. To get values or key-value pairs we’d need to use the values
and items methods respectively, each of which returns a different kind of iterator over the
dictionary.
d = {’a’: 1, ’b’: 2, ’c’: 3}
for k in d: print(k)
for v in d.values(): print(v)
for k, v in d.items(): print(k, v)
In each of these three cases we’re still looping over the same dictionary, we just get a dif-
ferent view each time—iterators give us the flexibility of iterating over the same structure
in different ways.
Iterators aren’t just for loops: they give us a first-class abstraction for iteration. We can
pass them into functions; for example, Python’s list function can convert any iterator into
a list. This is handy when we want to see the elements of specialized iterators if the iterator
itself does not print out its values:
52
>>> range(5)
range(0, 5)
>>> list(range(5))
[0, 1, 2, 3, 4]
Since iterators are first-class values, we can also write general-purpose iterator functions.
The Python standard library has a set of operations like this in the itertools module;
for example, itertools.takewhile lets us stop iterating as soon as some condition stops
holding:
>>> elements = [1, 3, 2, 5, 3]
>>> list(itertools.takewhile(lambda x: x < 5, elements))
[1, 3, 2]
Note how we converted the result of takewhile into a list — without that, we’d see that
takewhile returns some kind of opaque internal object that implements that iterator specif-
ically. This works fine — we can use the takewhile object anywhere we could use any other
iterator — but it looks a bit odd in the Python interpreter:
>>> itertools.takewhile(lambda x: x < 5, elements)
<itertools.takewhile object at 0x7f8e3baefb00>
Now that we’ve seen a few examples of how we can use iterators, how do we define
our own? In the most general sense, a Python Iterator is any object that implements
a __next__() method, but implementing iterators this way is pretty awkward. Luckily,
Python has a more convenient way to create an iterator by creating a generator using the
yield keyword. yield acts similar to return from a function, except instead of stopping
the function altogether, it outputs the yielded value to an iterator and pauses the function
until the yielded element is consumed by the caller.
This is a bit of an abstract description, so let’s look at how this would apply to our sqrt
function. Instead of looping and stopping based on some condition, we’ll write a version
of sqrt that returns an iterator with each iteration of the algorithm as a value:
def sqrt(a: float) -> Iterator[float]:
x = a / 2 # initial guess
while True:
x = (x + (a / x)) / 2
yield x
With this version, we update x at each iteration and then yield the updated value. In-
stead of getting a single value, the caller of the function gets an iterator that contains an
infinite number of iterations; it is up to the caller to decide how many iterations to evalu-
ate and when to stop. The sqrt function itself has an infinite loop, but this isn’t a problem
because execution of the function pauses at each yield which lets the caller of the function
stop it whenever they want.
To do 10 iterations of the sqrt algorithm, we could use itertools.islice:
>>> iterations = list(itertools.islice(sqrt(25), 10))
>>> iterations[-1]
5.0
A fixed number of iterations can be useful for exploration, but we probably want the
threshold-based convergence logic we had earlier. Since we now have a first-class abstrac-
tion for iteration, we can write a general-purpose converge function that takes an iterator
and returns a version of that same iterator that stops as soon as two values are sufficiently
close. Python 3.10 and later provides itertools.pairwise which makes the code pretty
simple:
53
def converge(values: Iterator[float], threshold: float) -> Iterator[float]:
for a, b in itertools.pairwise(values):
yield a
if abs(a - b) < threshold:
break
For older versions of Python, we’d have to implement our version of pairwise as well:
def pairwise(values: Iterator[A]) -> Iterator[Tuple[A, A]]:
a = next(values, None)
if a is None:
return
for b in values:
yield (a, b)
a = b
Both of these follow a common pattern with iterators: each function takes an iterator
as an input and returns an iterator as an output. This doesn’t always have to be the case,
but we get a major advantage when it is: iterator → iterator operations compose. We can
get relatively complex behavior by starting with an iterator (like our sqrt example) then
applying multiple operations to it. For example, somebody calling sqrt might want to
converge at some threshold but, just in case the algorithm gets stuck for some reason, also
have a hard stop at 10,000 iterations. We don’t need to write a new version of sqrt or even
converge to do this; instead, we can use converge with itertools.islice:
This is a powerful programming style because we can write and test each operation—
sqrt, converge, islice — in isolation and get complex behavior by combining them in the
right way. If we were writing the same logic without iterators, we would need a single loop
that calculated each step of sqrt, checked for convergence and kept a counter to stop after
10,000 steps — and we’d need to replicate this pattern for every single such algorithm!
Iterators and generators will come up all throughout this book because they provide a
programming abstraction for processes, making them a great foundation for the mathemat-
ical processes that underlie Reinforcement Learning.
54
• Python has type annotations which are required for dataclasses but are also useful
for describing interfaces in functions and methods. Additional tools like mypy or
PyCharm can use these type annotations to catch errors without needing to run the
code.
• Functions are first-class values in Python, meaning that they can be stored in vari-
ables and passed to other functions as arguments. Classes abstract over data; func-
tions abstract over computation.
• Iterators abstract over iteration: computations that happen in sequence, producing
a value after each iteration. Reinforcement learning focuses primarily on iterative
algorithms, so iterators become one of the key abstractions for working with different
reinforcement learning algorithms.
55
Part I.
57
1. Markov Processes
This book is about “Sequential Decisioning under Sequential Uncertainty.” In this chap-
ter, we will ignore the “sequential decisioning” aspect and focus just on the “sequential
uncertainty” aspect.
59
Figure 1.1.: Logistic Curves
where L is an arbitrary reference level and α1 ∈ R≥0 is a “pull strength” parameter. Note
that this probability is defined as a logistic function of L − Xt with the steepness of the
logistic function controlled by the parameter α1 (see Figure 1.1)
The way to interpret this logistic function of L − Xt is that if Xt is greater than the
reference level L (making P[Xt+1 = Xt + 1] < 0.5), then there is more of a down-pull than
an up-pull. Likewise, if Xt is less than L, then there is more of an up-pull. The extent of
the pull is controlled by the magnitude of the parameter α1 . We refer to this behavior as
mean-reverting behavior, meaning the stock price tends to revert to the “mean” (i.e., to the
reference level L).
We can model the state St = Xt and note that the probabilities of the next state St+1
depend only on the current state St and not on the previous states S0 , S1 , . . . , St−1 . In-
formally, we phrase this property as: “The future is independent of the past given the
present.” Formally, we can state this property of the states as:
This is a highly desirable property since it helps make the mathematics of such processes
much easier and the computations much more tractable. We call this the Markov Property
of States, or simply that these are Markov States.
Let us now code this up. First, we create a dataclass to represent the dynamics of this
process. As you can see in the code below, the dataclass Process1 contains two attributes
level_param: int and alpha1: float = 0.25 to represent L and α1 respectively. It contains
the method up_prob to calculate P[Xt+1 = Xt + 1] and the method next_state, which
samples from a Bernoulli distribution (whose probability is obtained from the method
up_prob) and creates the next state St+1 from the current state St . Also, note the nested
dataclass State meant to represent the state of Process 1 (it’s only attribute price: int
reflects the fact that the state consists of only the current price, which is an integer).
import numpy as np
from dataclasses import dataclass
60
@dataclass
class Process1:
@dataclass
class State:
price: int
level_param: int # level to which price mean-reverts
alpha1: float = 0.25 # strength of mean-reversion (non-negative value)
def up_prob(self, state: State) -> float:
return 1. / (1 + np.exp(-self.alpha1 * (self.level_param - state.price)))
def next_state(self, state: State) -> State:
up_move: int = np.random.binomial(1, self.up_prob(state), 1)[0]
return Process1.State(price=state.price + up_move * 2 - 1)
Next, we write a simple simulator using Python’s generator functionality (using yield)
as follows:
def simulation(process, start_state):
state = start_state
while True:
yield state
state = process.next_state(state)
Now we can use this simulator function to generate sampling traces. In the following
code, we generate num_traces number of sampling traces over time_steps number of time
steps starting from a price X0 of start_price. The use of Python’s generator feature lets
us do this “lazily” (on-demand) using the itertools.islice function.
import itertools
def process1_price_traces(
start_price: int,
level_param: int,
alpha1: float,
time_steps: int,
num_traces: int
) -> np.ndarray:
process = Process1(level_param=level_param, alpha1=alpha1)
start_state = Process1.State(price=start_price)
return np.vstack([
np.fromiter((s.price for s in itertools.islice(
simulation(process, start_state),
time_steps + 1
)), float) for _ in range(num_traces)])
61
We note that if we model the state St as Xt , we won’t satisfy the Markov Property because
the probabilities of Xt+1 depend on not just Xt but also on Xt−1 . However, we can perform
a little trick here and create an augmented state St consisting of the pair (Xt , Xt − Xt−1 ).
In case t = 0, the state S0 can assume the value (X0 , N ull) where N ull is just a symbol
denoting the fact that there have been no stock price movements thus far. With the state
St as this pair (Xt , Xt − Xt−1 ) , we can see that the Markov Property is indeed satisfied:
The code for generation of sampling traces of the stock price is almost identical to the
code we wrote for Process 1.
def process2_price_traces(
start_price: int,
alpha2: float,
time_steps: int,
num_traces: int
) -> np.ndarray:
process = Process2(alpha2=alpha2)
62
Figure 1.2.: Unit-Sigmoid Curves
63
history are greater than number of up-moves in history, then there will be more of an
up-pull than a down-pull for the next price movement Xt+1 − Xt (likewise, the other way
round when Ut > Dt ). The extent of this “reverse pull” is controlled by the “pull strength”
parameter α3 (governed by the sigmoid-shaped function f ).
Again, note that if we model the state St as Xt , we won’t satisfy the Markov Property
because the probabilities of next state St+1 = Xt+1 depends on the entire history of stock
price moves and not just on the current state St = Xt . However, we can again do something
clever and create a compact enough state St consisting of simply the pair (Ut , Dt ). With
this representation for the state St , the Markov Property is indeed satisfied:
The code for generation of sampling traces of the stock price is shown below:
def process3_price_traces(
start_price: int,
alpha3: float,
time_steps: int,
num_traces: int
) -> np.ndarray:
process = Process3(alpha3=alpha3)
start_state = Process3.State(num_up_moves=0, num_down_moves=0)
return np.vstack([
np.fromiter((start_price + s.num_up_moves - s.num_down_moves
for s in itertools.islice(simulation(process, start_state),
time_steps + 1)), float)
for _ in range(num_traces)])
64
Figure 1.3.: Single Sampling Trace
As suggested for Process 1, you can plot graphs of sampling traces of the stock price,
or plot graphs of the probability distributions of the stock price at various terminal time
points T for Processes 2 and 3, by playing with the code in rl/chapter2/stock_price_simulations.py.
Figure 1.3 shows a single sampling trace of stock prices for each of the 3 processes. Figure
1.4 shows the probability distribution of the stock price at terminal time T = 100 over 1000
traces.
Having developed the intuition for the Markov Property of States, we are now ready
to formalize the notion of Markov Processes (some of the literature refers to Markov Pro-
cesses as Markov Chains, but we will stick with the term Markov Processes).
65
Figure 1.4.: Terminal Distribution
• A countable set of states S (known as the State Space) and a set T ⊆ S (known as
the set of Terminal States)
• Termination: If an outcome for ST (for some time step T ) is a state in the set T , then
this sequence outcome terminates at time step T .
Definition 1.3.2. A Time-Homogeneous Markov Process is a Markov Process with the addi-
tional property that P[St+1 |St ] is independent of t.
This means, the dynamics of a Time-Homogeneous Markov Process can be fully speci-
fied with the function
P : (S − T ) × S → [0, 1]
defined as:
such that X
P(s, s′ ) = 1 for all s ∈ S − T
s′ ∈S
66
Note that the arguments to P in the above specification are devoid of the time index t
(hence, the term Time-Homogeneous which means “time-invariant”). Moreover, note that a
Markov Process that is not time-homogeneous can be converted to a Time-Homogeneous
Markov Process by augmenting all states with the time index t. This means if the original
state space of a Markov Process that is not time-homogeneous is S, then the state space of
the corresponding Time-Homogeneous Markov Process is Z≥0 × S (where Z≥0 denotes
the domain of the time index). This is because each time step has it’s own unique set of
(augmented) states, which means the entire set of states in Z≥0 × S can be covered by
time-invariant transition probabilities, thus qualifying as a Time-Homogeneous Markov
Process. Therefore, henceforth, any time we say Markov Process, assume we are refering
to a Discrete-Time, Time-Homogeneous Markov Process with a Countable State Space (unless
explicitly specified otherwise), which in turn will be characterized by the transition prob-
ability function P. Note that the stock price examples (all 3 of the Processes we covered)
are examples of a (Time-Homogeneous) Markov Process, even without requiring aug-
menting the state with the time index.
The classical definitions and theory of Markov Processes model “termination” with the
idea of Absorbing States. A state s is called an absorbing state if P(s, s) = 1. This means,
once we reach an absorbing state, we are “trapped” there, hence capturing the notion of
“termination.” So the classical definitions and theory of Markov Processes typically don’t
include an explicit specification of states as terminal and non-terminal. However, when
we get to Markov Reward Processes and Markov Decision Processes (frameworks that are
extensions of Markov Processes), we will need to explicitly specify states as terminal and
non-terminal states, rather than model the notion of termination with absorbing states.
So, for consistency in definitions and in the development of the theory, we are going with
a framework where states in a Markov Process are explicitly specified as terminal or non-
terminal states. We won’t consider an absorbing state as a terminal state as the Markov
Process keeps moving forward in time forever when it gets to an absorbing state. We will
refer to S − T as the set of Non-Terminal States N (and we will refer to a state in N as a
non-terminal state). The sequence S0 , S1 , S2 , . . . terminates at time step t = T if ST ∈ T .
We say that a Markov Process is fully specified by P in the sense that this gives us the
transition probabilities that govern the complete dynamics of the Markov Process. A way
to understand this is to relate specification of P to the specification of rules in a game (such
as chess or monopoly). These games are specified with a finite (in fact, fairly compact)
set of rules that is easy for a newbie to the game to understand. However, when we want
to actually play the game, we need to specify the starting position (one could start these
games at arbitrary, but legal, starting positions and not just at some canonical starting
position). The specification of the start state of the game is analogous to the specification
67
of µ. Given µ together with P enables us to generate sampling traces of the Markov Process
(analogously, play games like chess or monopoly). These sampling traces typically result
in a wide range of outcomes due to sampling and long-running of the Markov Process
(versus compact specification of transition probabilities). These sampling traces enable
us to answer questions such as probability distribution of states at specific future time
steps or expected time of first occurrence of a specific state etc., given a certain starting
probability distribution µ.
Thinking about the separation between specifying the rules of the game versus actually
playing the game helps us understand the need to separate the notion of dynamics speci-
fication P (fundamental to the time-homogeneous character of the Markov Process) and
the notion of starting distribution µ (required to perform sampling traces). Hence, the
separation of concerns between P and µ is key to the conceptualization of Markov Pro-
cesses. Likewise, we separate the concerns in our code design as well, as evidenced by
how we separated the next_state method in the Process dataclasses and the simulation
function.
68
on_non_terminal can be used on any object in State (i.e. for any state in S, terminal or non-
terminal). As an example, let’s say you want to calculate the expected number of states
one would traverse after a certain state and before hitting a terminal state. Clearly, this cal-
culation is well-defined for non-terminal states and the function f would implement this
by either some kind of analytical method or by sampling state-transition sequences and
averaging the counts of non-terminal states traversed across those sequences. By defining
(defaulting) this value to be 0 for terminal states, we can then invoke such a calculation for
all states S, terminal or non-terminal, and embed this calculation in an algorithm without
worrying about special handing in the code for the edge case of being a terminal state.
Now we are ready to write a class to represent Markov Processes. We create an ab-
stract class MarkovProcess parameterized by a generic type (TypeVar(’S’)) representing a
generic state space Generic[S]. The abstract class has an abstract method called transition
that is meant to specify the transition probability distribution of next states, given a current
non-terminal state. We know that transition is well-defined only for non-terminal states
and hence, it’s argument is clearly type-annotated as NonTerminal[S]. The return type of
transition is Distribution[State[S]], which as we know from the Chapter on Program-
ming and Design, represents the probability distribution of next states. We also have a
method simulate that enables us to generate an Iterable (generator) of sampled states,
given as input a start_state_distribution: Distribution[NonTerminal[S]] (from which
we sample the starting state). The sampling of next states relies on the implementation
of the sample method for the Distribution[State[S]] object produced by the transition
method.
Here’s the full body of the abstract class MarkovProcess3 :
69
@abstractmethod
def transition(self, state: NonTerminal[S]) -> Distribution[State[S]]:
pass
def simulate(
self,
start_state_distribution: Distribution[NonTerminal[S]]
) -> Iterable[State[S]]:
state: State[S] = start_state_distribution.sample()
yield state
while isinstance(state, NonTerminal):
state = self.transition(state).sample()
yield state
70
from rl.distribution import Constant
import numpy as np
def process3_price_traces(
start_price: int,
alpha3: float,
time_steps: int,
num_traces: int
) -> np.ndarray:
mp = StockPriceMP3(alpha3=alpha3)
start_state_distribution = Constant(
NonTerminal(StateMP3(num_up_moves=0, num_down_moves=0))
)
return np.vstack([np.fromiter(
(start_price + s.state.num_up_moves - s.state.num_down_moves for s in
itertools.islice(
mp.simulate(start_state_distribution),
time_steps + 1
)),
float
) for _ in range(num_traces)])
N → (S → [0, 1])
With this curried view, we can represent the outer → as a map (in Python, as a dictio-
nary of type Mapping) whose keys are the non-terminal states N , and each non-terminal-
state key maps to a FiniteDistribution[S] type that represents the inner →, i.e. a finite
probability distribution of the next states transitioned to from a given non-terminal state.
4
Currying is the technique of converting a function that takes multiple arguments into a sequence of functions
that each takes a single argument, as illustrated above for the P function.
71
Figure 1.5.: Weather Markov Process
Note that the FiniteDistribution[S] will only contain the set of states transitioned to
with non-zero probability. To make things concrete, here’s a toy Markov Process data
structure example of a city with highly unpredictable weather outcomes from one day
to the next (note: Categorical type inherits from FiniteDistribution type in the code at
rl/distribution.py):
from rl.distribution import Categorical
{
”Rain”: Categorical({”Rain”: 0.3, ”Nice”: 0.7}),
”Snow”: Categorical({”Rain”: 0.4, ”Snow”: 0.6}),
”Nice”: Categorical({”Rain”: 0.2, ”Snow”: 0.3})
}
To create a Transition data type from the above example of the weather Markov Process,
we’d need to wrap each of the “Rain,” “Snow” and “Nice” strings with NonTerminal.
72
Now we are ready to write the code for the FiniteMarkovProcess class.5 The __init__
method (constructor) takes as argument a transition_map whose type is similar to Transition[S]
except that we use the S type directly in the Mapping representation instead of NonTerminal[S]
or State[S] (this is convenient for users to specify their Markov Process in a succinct
Mapping representation without the burden of wrapping each S with a NonTerminal[S] or
Terminal[S]). The dictionary we created above for the weather Markov Process can be
used as the transition_map argument. However, this means the __init__ method needs
to wrap the specified S states as NonTerminal[S] or Terminal[S] when creating the attribute
self.transition_map. We also have an attribute self.non_terminal_states: Sequence[NonTerminal[S]]
that is an ordered sequence of the non-terminal states. We implement the transition
method by simply returning the FiniteDistribution[State[S]] the given state: NonTerminal[S]
maps to in the attribute self.transition_map: Transition[S]. Note that along with the
transition method, we have implemented the __repr__ method for a well-formatted dis-
play of self.transition_map.
5
FiniteMarkovProcess is defined in the file rl/markov_process.py.
73
ability
e−λ λi
f (i) =
i!
Denote F : Z≥0 → [0, 1] as the poisson cumulative probability distribution function, i.e.,
X
i
F (i) = f (j)
j=0
Assume you have storage capacity for at most C ∈ Z≥0 bicycles in your store. Each
evening at 6pm when your store closes, you have the choice to order a certain number
of bicycles from your supplier (including the option to not order any bicycles, on a given
day). The ordered bicycles will arrive 36 hours later (at 6am the day after the day after
you order - we refer to this as delivery lead time of 36 hours). Denote the State at 6pm
store-closing each day as (α, β), where α is the inventory in the store (refered to as On-
Hand Inventory at 6pm) and β is the inventory on a truck from the supplier (that you
had ordered the previous day) that will arrive in your store the next morning at 6am (β
is refered to as On-Order Inventory at 6pm). Due to your storage capacity constraint of at
most C bicycles, your ordering policy is to order C − (α + β) if α + β < C and to not order
if α + β ≥ C. The precise sequence of events in a 24-hour cycle is:
So restricting ourselves to this finite set of states, our order quantity equals C − (α + β)
when the state is (α, β).
If current state St is (α, β), there are only α + β + 1 possible next states St+1 as follows:
(α + β − i, C − (α + β)) for i = 0, 1, . . . , α + β
Note that the next state’s (St+1 ) On-Hand can be zero resulting from any of infinite pos-
sible demand outcomes greater than or equal to α + β.
74
So we are now ready to write code for this simple inventory example as a Markov Pro-
cess. All we have to do is to create a derived class inherited from FiniteMarkovProcess
and write a method to construct the transition_map: Transition. Note that the generic
state type S is replaced here with the @dataclass InventoryState consisting of the pair of
On-Hand and On-Order inventory quantities comprising the state of this Finite Markov
Process.
Let us utilize the __repr__ method written previously to view the transition probabilities
for the simple case of C = 2 and λ = 1.0 (this code is in the file rl/chapter2/simple_inventory_mp.py)
user_capacity = 2
user_poisson_lambda = 1.0
si_mp = SimpleInventoryMPFinite(
capacity=user_capacity,
poisson_lambda=user_poisson_lambda
)
print(si_mp)
75
From State InventoryState(on_hand=0, on_order=1):
To State InventoryState(on_hand=1, on_order=1) with Probability 0.368
To State InventoryState(on_hand=0, on_order=1) with Probability 0.632
From State InventoryState(on_hand=0, on_order=2):
To State InventoryState(on_hand=2, on_order=0) with Probability 0.368
To State InventoryState(on_hand=1, on_order=0) with Probability 0.368
To State InventoryState(on_hand=0, on_order=0) with Probability 0.264
From State InventoryState(on_hand=1, on_order=0):
To State InventoryState(on_hand=1, on_order=1) with Probability 0.368
To State InventoryState(on_hand=0, on_order=1) with Probability 0.632
From State InventoryState(on_hand=1, on_order=1):
To State InventoryState(on_hand=2, on_order=0) with Probability 0.368
To State InventoryState(on_hand=1, on_order=0) with Probability 0.368
To State InventoryState(on_hand=0, on_order=0) with Probability 0.264
From State InventoryState(on_hand=2, on_order=0):
To State InventoryState(on_hand=2, on_order=0) with Probability 0.368
To State InventoryState(on_hand=1, on_order=0) with Probability 0.368
To State InventoryState(on_hand=0, on_order=0) with Probability 0.264
For a graphical view of this Markov Process, see Figure 1.6. The nodes are the states,
labeled with their corresponding α and β values. The directed edges are the probabilistic
state transitions from 6pm on a day to 6pm on the next day, with the transition probabilities
labeled on them.
We can perform a number of interesting experiments and calculations with this simple
Markov Process and we encourage you to play with this code by changing values of the
capacity C and poisson mean λ, performing simulations and probabilistic calculations of
natural curiosity for a store owner.
There is a rich and interesting theory for Markov Processes. However, we won’t go into
this theory as our coverage of Markov Processes so far is a sufficient building block to take
us to the incremental topics of Markov Reward Processes and Markov Decision Processes.
However, before we move on, we’d like to show just a glimpse of the rich theory with the
calculation of Stationary Probabilities and apply it to the case of the above simple inventory
Markov Process.
The intuitive view of the stationary distribution π is that (under specific conditions we
are not listing here) if we let the Markov Process run forever, then in the long run the states
occur at specific time steps with relative frequencies (probabilities) given by a distribution
π that is independent of the time step. The probability of occurrence of a specific state s
at a time step (asymptotically far out in the future) should be equal to the sum-product
of probabilities of occurrence of all the states at the previous time step and the transition
76
Figure 1.6.: Simple Inventory Markov Process
77
probabilities from those states to s. But since the states’ occurrence probabilities are invari-
ant in time, the π distribution for the previous time step is the same as the π distribution
for the time step we considered. This argument holds for all states s, and that is exactly
the statement of the definition of Stationary Distribution formalized above.
If we specialize this definition of Stationary Distribution to Finite-States, Discrete-Time,
Time-Homogeneous Markov Processes with state space S = {s1 , s2 , . . . , sn } = N , then we
can express the Stationary Distribution π as follows:
X
n
π(sj ) = π(si ) · P(si , sj ) for all j = 1, 2, . . . n
i=1
Below we use bold-face notation to represent functions as vectors and matrices (since
we assume finite states). So, π is a column vector of length n and P is the n × n transition
probability matrix (rows are source states, columns are destination states with each row
summing to 1). Then, the statement of the above definition can be succinctly expressed
as:
πT = πT · P
which can be re-written as:
PT · π = π
But this is simply saying that π is an eigenvector of P T with eigenvalue of 1. So then, it
should be easy to obtain the stationary distribution π from an eigenvectors and eigenvalues
calculation of P T .
Let us write code to compute the stationary distribution. We shall add two methods
in the FiniteMarkovProcess class, one for setting up the transition probability matrix P
(get_transition_matrix method) and another to calculate the stationary distribution π
(get_stationary_distribution) from this transition probability matrix. Note that P is re-
stricted to N × N → [0, 1] (rather than N × S → [0, 1]) because these probability tran-
sitions suffice for all the calculations we will be performing for Finite Markov Processes.
Here’s the code for the two methods (the full code for FiniteMarkovProcess is in the file
rl/markov_process.py):
import numpy as np
from rl.distribution import FiniteDistribution, Categorical
def get_transition_matrix(self) -> np.ndarray:
sz = len(self.non_terminal_states)
mat = np.zeros((sz, sz))
for i, s1 in enumerate(self.non_terminal_states):
for j, s2 in enumerate(self.non_terminal_states):
mat[i, j] = self.transition(s1).probability(s2)
return mat
def get_stationary_distribution(self) -> FiniteDistribution[S]:
eig_vals, eig_vecs = np.linalg.eig(self.get_transition_matrix().T)
index_of_first_unit_eig_val = np.where(
np.abs(eig_vals - 1) < 1e-8)[0][0]
eig_vec_of_unit_eig_val = np.real(
eig_vecs[:, index_of_first_unit_eig_val])
return Categorical({
self.non_terminal_states[i].state: ev
for i, ev in enumerate(eig_vec_of_unit_eig_val /
sum(eig_vec_of_unit_eig_val))
})
78
We skip the theory that tells us about the conditions under which a stationary distribu-
tion is well-defined, or the conditions under which there is a unique stationary distribu-
tion. Instead, we just go ahead with this calculation here assuming this Markov Process
satisfies those conditions (it does!). So, we simply seek the index of the eig_vals vec-
tor with eigenvalue equal to 1 (accounting for floating-point error). Next, we pull out
the column of the eig_vecs matrix at the eig_vals index calculated above, and convert it
into a real-valued vector (eigenvectors/eigenvalues calculations are, in general, complex
numbers calculations - see the reference for the np.linalg.eig function). So this gives
us the real-valued eigenvector with eigenvalue equal to 1. Finally, we have to normalize
the eigenvector so it’s values add up to 1 (since we want probabilities), and return the
probabilities as a Categorical distribution).
Running this code for the simple case of capacity C = 2 and poisson mean λ = 1.0
(instance of SimpleInventoryMPFinite) produces the following output for the stationary
distribution π:
This tells us that On-Hand of 0 and On-Order of 1 is the state occurring most frequently
(28% of the time) when the system is played out indefinitely.
Let us summarize the 3 different representations we’ve covered:
Now we are ready to move to our next topic of Markov Reward Processes. We’d like to
finish this section by stating that the Markov Property owes its name to a mathematician
from a century ago - Andrey Markov. Although the Markov Property seems like a simple
enough concept, the concept has had profound implications on our ability to compute or
reason with systems involving time-sequenced uncertainty in practice. There are several
good books to learn more about Markov Processes - we recommend the book by Paul
Gagniuc (Gagniuc 2017).
79
1.8. Formalism of Markov Reward Processes
As we’ve said earlier, the reason we covered Markov Processes is because we want to make
our way to Markov Decision Processes (the framework for Reinforcement Learning algo-
rithms) by adding incremental features to Markov Processes. Now we cover an interme-
diate framework between Markov Processes and Markov Decision Processes, known as
Markov Reward Processes. We essentially just include the notion of a numerical reward to
a Markov Process each time we transition from one state to the next. These rewards are
random, and all we need to do is to specify the probability distributions of these rewards
as we make state transitions.
The main purpose of Markov Reward Processes is to calculate how much reward we
would accumulate (in expectation, from each of the non-terminal states) if we let the Pro-
cess run indefinitely, bearing in mind that future rewards need to be discounted appropri-
ately (otherwise the sum of rewards could blow up to ∞). In order to solve the problem of
calculating expected accumulative rewards from each non-terminal state, we will first set
up some formalism for Markov Reward Processes, develop some (elegant) theory on cal-
culating rewards accumulation, write plenty of code (based on the theory), and apply the
theory and code to the simple inventory example (which we will embellish with rewards
equal to negative of the costs incurred at the store).
Definition 1.8.1. A Markov Reward Process is a Markov Process, along with a time-indexed
sequence of Reward random variables Rt ∈ D (a countable subset of R) for time steps t =
1, 2, . . ., satisfying the Markov Property (including Rewards): P[(Rt+1 , St+1 )|St , St−1 , . . . , S0 ] =
P[(Rt+1 , St+1 )|St ] for all t ≥ 0.
It pays to emphasize again (like we emphasized for Markov Processes), that the def-
initions and theory of Markov Reward Processes we cover (by default) are for discrete-
time, for countable state spaces and countable set of pairs of next state and reward tran-
sitions (with the knowledge that the definitions and theory are analogously extensible to
continuous-time and uncountable spaces/transitions). In the more general case, where
states or rewards are uncountable, the same concepts apply except that the mathematical
formalism needs to be more detailed and more careful. Specifically, we’d end up with
integrals instead of summations, and probability density functions (for continuous prob-
ability distributions) instead of probability mass functions (for discrete probability dis-
tributions). For ease of notation and more importantly, for ease of understanding of the
core concepts (without being distracted by heavy mathematical formalism), we’ve cho-
sen to stay with discrete-time, countable S and countable D (by default). However, there
will be examples of Markov Reward Processes in this book involving continuous-time and
uncountable S and D (please adjust the definitions and formulas accordingly).
Since we commonly assume Time-Homogeneity of Markov Processes, we shall also (by
default) assume Time-Homogeneity for Markov Reward Processes, i.e., P[(Rt+1 , St+1 )|St ]
is independent of t.
With the default assumption of time-homogeneity, the transition probabilities of a Markov
Reward Process can be expressed as a transition probability function:
PR : N × D × S → [0, 1]
defined as:
80
XX
for all s ∈ N , r ∈ D, s′ ∈ S, such that PR (s, r, s′ ) = 1 for all s ∈ N
s′ ∈S r∈D
The subsection on Start States we had covered for Markov Processes naturally applies
to Markov Reward Processes as well. So we won’t repeat the section here, rather we sim-
ply highlight that when it comes to simulations, we need a separate specification of the
probability distribution of start states. Also, by inheriting from our framework of Markov
Processes, we model the notion of a “process termination” by explicitly specifying states
as terminal states or non-terminal states. The sequence S0 , R1 , S1 , R2 , S2 , . . . terminates at
time step t = T if ST ∈ T , with RT being the final reward in the sequence.
If all random sequences of states in a Markov Reward Process terminate, we refer to it
as episodic sequences (otherwise, we refer to it as continuing sequences).
Let’s write some code that captures this formalism. We create a derived @abstractclass
MarkovRewardProcess that inherits from the @abstractclass MarkovProcess. Analogous to
MarkovProcess’s transition method (that represents P), MarkovRewardProcess has an ab-
stract method transition_reward that represents PR . Note that the return type of transition_reward
is Distribution[Tuple[State[S], float]], representing the probability distribution of (next
state, reward) pairs transitioned to.
Also, analogous to MarkovProcess’s simulate method, MarkovRewardProcess has the method
simulate_reward which generates a stream of TransitionStep[S] objects. Each TransitionStep[S]
object consists of a 3-tuple: (state, next state, reward) representing the sampled transitions
within the generated sampling trace. Here’s the actual code:
@dataclass(frozen=True)
class TransitionStep(Generic[S]):
state: NonTerminal[S]
next_state: State[S]
reward: float
class MarkovRewardProcess(MarkovProcess[S]):
@abstractmethod
def transition_reward(self, state: NonTerminal[S])\
-> Distribution[Tuple[State[S], float]]:
pass
def simulate_reward(
self,
start_state_distribution: Distribution[NonTerminal[S]]
) -> Iterable[TransitionStep[S]]:
state: State[S] = start_state_distribution.sample()
reward: float = 0.
while isinstance(state, NonTerminal):
next_distribution = self.transition_reward(state)
next_state, reward = next_distribution.sample()
yield TransitionStep(state, next_state, reward)
state = next_state
So the idea is that if someone wants to model a Markov Reward Process, they’d simply
have to create a concrete class that implements the interface of the abstract class MarkovRewardProcess
(specifically implement the abstract method transition_reward). But note that the abstract
method transition of MarkovProcess also needs to be implemented to make the whole
thing concrete. However, we don’t have to implement it in the concrete class implementing
the interface of MarkovRewardProcess - in fact, we can implement it in the MarkovRewardProcess
class itself by tapping the method transition_reward. Here’s the code for the transition
method in MarkovRewardProcess:
81
from rl.distribution import Distribution, SampledDistribution
def transition(self, state: NonTerminal[S]) -> Distribution[State[S]]:
distribution = self.transition_reward(state)
def next_state(distribution=distribution):
next_s, _ = distribution.sample()
return next_s
return SampledDistribution(next_state)
RT : N × S → R
defined as:
X PR (s, r, s′ ) X PR (s, r, s′ )
RT (s, s′ ) = E[Rt+1 |St+1 = s′ , St = s] = · r = P ·r
P(s, s′ ) ′
r∈D PR (s, r, s )
r∈D r∈D
R:N →R
is defined as:
X XX
R(s) = E[Rt+1 |St = s] = P(s, s′ ) · RT (s, s′ ) = PR (s, r, s′ ) · r
s′ ∈S s′ ∈S r∈D
We’ve created a bit of notational clutter here. So it would be a good idea for you to
take a few minutes to pause, reflect and internalize the differences between PR , P (of the
implicit Markov Process), RT and R. This notation will analogously re-appear when we
learn about Markov Decision Processes in Chapter 2. Moreover, this notation will be used
considerably in the rest of the book, so it pays to get comfortable with their semantics.
82
• Holding cost of h for each bicycle that remains in your store overnight. Think of
this as “interest on inventory” - each day your bicycle remains unsold, you lose the
opportunity to gain interest on the cash you paid to buy the bicycle. Holding cost
also includes the cost of upkeep of inventory.
• Stockout cost of p for each unit of “missed demand,” i.e., for each customer wanting
to buy a bicycle that you could not satisfy with available inventory, eg: if 3 customers
show up during the day wanting to buy a bicycle each, and you have only 1 bicycle
at 8am (store opening time), then you lost two units of demand, incurring a cost of
2p. Think of the cost of p per unit as the lost revenue plus disappointment for the
customer. Typically p ≫ h.
Let us go through the precise sequence of events, now with incorporation of rewards,
in each 24-hour cycle:
Since the customer demand on any day can be an infinite set of possibilities (poisson
distribution over the entire range of non-negative integers), we have an infinite set of pairs
of next state and reward we could transition to from a given current state. Let’s see what
the probabilities of each of these transitions looks like. For a given current state St :=
(α, β), if customer demand for the day is i, then the next state St+1 is:
PR ((α, β), −h · α − p · max(i − (α + β), 0), (max(α + β − i, 0), max(C − (α + β), 0)))
e−λ λi
= for all i = 0, 1, 2, . . .
i!
Now let’s write some code to implement this simple inventory example as a Markov
Reward Process as described above. All we have to do is to create a concrete class imple-
menting the interface of the abstract class MarkovRewardProcess (specifically implement
83
the abstract method transition_reward). The code below in transition_reward method
in class SimpleInventoryMRP samples the customer demand from a Poisson distribution,
uses the above formulas for the pair of next state and reward as a function of the customer
demand sample, and returns an instance of SampledDistribution. Note that the generic
state type S is replaced here with the @dataclass InventoryState to represent a state of
this Markov Reward Process, comprising of the On-Hand and On-Order inventory quan-
tities.
from rl.distribution import SampledDistribution
import numpy as np
@dataclass(frozen=True)
class InventoryState:
on_hand: int
on_order: int
def inventory_position(self) -> int:
return self.on_hand + self.on_order
class SimpleInventoryMRP(MarkovRewardProcess[InventoryState]):
def __init__(
self,
capacity: int,
poisson_lambda: float,
holding_cost: float,
stockout_cost: float
):
self.capacity = capacity
self.poisson_lambda: float = poisson_lambda
self.holding_cost: float = holding_cost
self.stockout_cost: float = stockout_cost
def transition_reward(
self,
state: NonTerminal[InventoryState]
) -> SampledDistribution[Tuple[State[InventoryState], float]]:
def sample_next_state_reward(state=state) ->\
Tuple[State[InventoryState], float]:
demand_sample: int = np.random.poisson(self.poisson_lambda)
ip: int = state.state.inventory_position()
next_state: InventoryState = InventoryState(
max(ip - demand_sample, 0),
max(self.capacity - ip, 0)
)
reward: float = - self.holding_cost * state.state.on_hand\
- self.stockout_cost * max(demand_sample - ip, 0)
return NonTerminal(next_state), reward
return SampledDistribution(sample_next_state_reward)
84
If we satisfy the above two characteristics, we refer to the Markov Reward Process as a
Finite Markov Reward Process. So let us write some code for a Finite Markov Reward Pro-
cess. We create a concrete class FiniteMarkovRewardProcess that primarily inherits from
FiniteMarkovProcess (a concrete class) and secondarily implements the interface of the
abstract class MarkovRewardProcess. Our first task is to think about the data structure re-
quired to specify an instance of FiniteMarkovRewardProcess (i.e., the data structure we’d
pass to the __init__ method of FiniteMarkovRewardProcess). Analogous to how we cur-
ried P for a Markov Process as N → (S → [0, 1]) (where S = {s1 , s2 , . . . , sn } and N has
m ≤ n states), here we curry PR as:
N → (S × D → [0, 1])
Since S is finite and since the set of unique pairs of next state and reward transitions is
also finite, this leads to the analog of the Transition data type for the case of Finite Markov
Reward Processes (named RewardTransition) as follows:
import numpy as np
from rl.distribution import FiniteDistribution, Categorical
from collections import defaultdict
from typing import Mapping, Tuple, Dict, Set
class FiniteMarkovRewardProcess(FiniteMarkovProcess[S],
MarkovRewardProcess[S]):
transition_reward_map: RewardTransition[S]
reward_function_vec: np.ndarray
7
The code for FiniteMarkovRewardProcess (and more) is in rl/markov_process.py.
85
def __init__(
self,
transition_reward_map: Mapping[S, FiniteDistribution[Tuple[S, float]]]
):
transition_map: Dict[S, FiniteDistribution[S]] = {}
for state, trans in transition_reward_map.items():
probabilities: Dict[S, float] = defaultdict(float)
for (next_state, _), probability in trans:
probabilities[next_state] += probability
transition_map[state] = Categorical(probabilities)
super().__init__(transition_map)
nt: Set[S] = set(transition_reward_map.keys())
self.transition_reward_map = {
NonTerminal(s): Categorical(
{(NonTerminal(s1) if s1 in nt else Terminal(s1), r): p
for (s1, r), p in v}
) for s, v in transition_reward_map.items()
}
self.reward_function_vec = np.array([
sum(probability * reward for (_, reward), probability in
self.transition_reward_map[state])
for state in self.non_terminal_states
])
def transition_reward(self, state: NonTerminal[S]) -> StateReward[S]:
return self.transition_reward_map[state]
86
hence, each of these next states St+1 correspond to no stockout cost and only an overnight
holding cost of hα. Therefore,
When next state’s (St+1 ) On-Hand is equal to zero, there are two possibilities:
1. The demand for the day was exactly α + β, meaning all demand was satisifed with
available store inventory (so no stockout cost and only overnight holding cost), or
2. The demand for the day was strictly greater than α + β, meaning there’s some stock-
out cost in addition to overnight holding cost. The exact stockout cost is an expec-
tation calculation involving the number of units of missed demand under the corre-
sponding poisson probabilities of demand exceeding α + β.
87
class InventoryState:
on_hand: int
on_order: int
def inventory_position(self) -> int:
return self.on_hand + self.on_order
class SimpleInventoryMRPFinite(FiniteMarkovRewardProcess[InventoryState]):
def __init__(
self,
capacity: int,
poisson_lambda: float,
holding_cost: float,
stockout_cost: float
):
self.capacity: int = capacity
self.poisson_lambda: float = poisson_lambda
self.holding_cost: float = holding_cost
self.stockout_cost: float = stockout_cost
self.poisson_distr = poisson(poisson_lambda)
super().__init__(self.get_transition_reward_map())
def get_transition_reward_map(self) -> \
Mapping[
InventoryState,
FiniteDistribution[Tuple[InventoryState, float]]
]:
d: Dict[InventoryState, Categorical[Tuple[InventoryState, float]]] = {}
for alpha in range(self.capacity + 1):
for beta in range(self.capacity + 1 - alpha):
state = InventoryState(alpha, beta)
ip = state.inventory_position()
beta1 = self.capacity - ip
base_reward = - self.holding_cost * state.on_hand
sr_probs_map: Dict[Tuple[InventoryState, float], float] =\
{(InventoryState(ip - i, beta1), base_reward):
self.poisson_distr.pmf(i) for i in range(ip)}
probability = 1 - self.poisson_distr.cdf(ip - 1)
reward = base_reward - self.stockout_cost *\
(probability * (self.poisson_lambda - ip) +
ip * self.poisson_distr.pmf(ip))
sr_probs_map[(InventoryState(0, beta1), reward)] = probability
d[state] = Categorical(sr_probs_map)
return d
Now we are ready to formally define the main problem involving Markov Reward Pro-
cesses. As we’ve said earlier, we’d like to compute the “expected accumulated rewards”
from any non-terminal state. P However, if we simply add up the rewards in a sampling
trace following time step t as ∞ i=t+1 Ri = Rt+1 + Rt+2 + . . ., the sum would often di-
verge to infinity. So we allow for rewards accumulation to be done with a discount factor
γ ∈ [0, 1]: We define the (random) Return Gt as the “discounted accumulation of future
88
rewards” following time step t. Formally,
∞
X
Gt = γ i−t−1 · Ri = Rt+1 + γ · Rt+2 + γ 2 · Rt+3 + . . .
i=t+1
We use the above definition of Return even for a terminating sequence (say terminating
at t = T , i.e., ST ∈ T ), by treating Ri = 0 for all i > T .
Note that γ can range from a value of 0 on one extreme (called “myopic”) to a value
of 1 on another extreme (called “far-sighted”). “Myopic” means the Return is the same
as Reward (no accumulation of future Rewards in the Return). With “far-sighted” (γ =
1), the Return calculation can diverge for continuing (non-terminating) Markov Reward
Processes but “far-sighted” is indeed applicable for episodic Markov Reward Processes
(where all random sequences of the process terminate). Apart from the Return divergence
consideration, γ < 1 helps algorithms become more tractable (as we shall see later when
we get to Reinforcement Learning). We should also point out that the reason to have γ < 1
is not just for mathematical convenience or computational tractability - there are valid
modeling reasons to discount Rewards when accumulating to a Return. When Reward
is modeled as a financial quantity (revenues, costs, profits etc.), as will be the case in
most financial applications, it makes sense to incorporate time-value-of-money which is a
fundamental concept in Economics/Finance that says there is greater benefit in receiving a
dollar now versus later (which is the economic reason why interest is paid or earned). So
1
it is common to set γ to be the discounting based on the prevailing interest rate (γ = 1+r
where r is the interest rate over a single time step). Another technical reason for setting γ <
1 is that our models often don’t fully capture future uncertainty and so, discounting with
γ acts to undermine future rewards that might not be accurate (due to future uncertainty
modeling limitations). Lastly, from an AI perspective, if we want to build machines that
acts like humans, psychologists have indeed demonstrated that human/animal behavior
prefers immediate reward over future reward.
Note that we are (as usual) assuming the fact that the Markov Reward Process is time-
homogeneous (time-invariant probabilities of state transitions and rewards).
As you might imagine now, we’d want to identify non-terminal states with large ex-
pected returns and those with small expected returns. This, in fact, is the main problem
involving a Markov Reward Process - to compute the “Expected Return” associated with
each non-terminal state in the Markov Reward Process. Formally, we are interested in
computing the Value Function
V :N →R
defined as:
V (s) = E[Gt |St = s] for all s ∈ N , for all t = 0, 1, 2, . . .
For the rest of the book, we will assume that whenever we are talking about a Value
Function, the discount factor γ is appropriate to ensure that the Expected Return from
each state is finite.
Now we show a creative piece of mathematics due to Richard Bellman. Bellman noted
(Bellman 1957b) that the Value Function has a recursive structure. Specifically,
89
Figure 1.7.: Visualization of MRP Bellman Equation
s′ ∈N s′′ ∈N
X
= R(s) + γ · P(s, s′ ) · V (s′ ) for all s ∈ N
s′ ∈N
(1.1)
Note that although the transitions to random states s′ , s′′ , . . . are in the state space of
S rather than N , the right-hand-side above sums over states s′ , s′′ , . . . only in N because
transitions to terminal states (in T = S − N ) don’t contribute any reward beyond the
rewards produced before reaching the terminal state.
We refer to this recursive equation (1.1) for the Value Function as the Bellman Equation
for Markov Reward Processes. Figure 1.7 is a convenient visualization aid of this important
equation. In the rest of the book, we will depict quite a few of these type of state-transition
visualizations to aid with creating mental models of key concepts.
For the case of Finite Markov Reward Processes, assume S = {s1 , s2 , . . . , sn } and assume
N has m ≤ n states. Below we use bold-face notation to represent functions as column
vectors and matrices since we have finite states/transitions. So, V is a column vector of
length m, P is an m × m matrix, and R is a column vector of length m (rows/columns
90
corresponding to states in N ), so we can express the above equation in vector and matrix
notation as follows:
V = R + γP · V
Therefore,
⇒ V = (Im − γP)−1 · R (1.2)
This tells us that On-Hand of 0 and On-Order of 2 has the highest expected reward.
However, the Value Function is highest for On-Hand of 0 and On-Order of 1.
This computation for the Value Function works if the state space is not too large (the size
of the square linear system of equations is equal to number of non-terminal states). When
the state space is large, this direct method of solving a linear system of equations won’t
scale and we have to resort to numerical methods to solve the recursive Bellman Equation.
This is the topic of Dynamic Programming and Reinforcement Learning algorithms that
we shall learn in this book.
91
1.13. Summary of Key Learnings from this Chapter
Before we end this chapter, we’d like to highlight the two highly important concepts we
learnt in this chapter:
• Markov Property: A concept that enables us to reason effectively and compute effi-
ciently in practical systems involving sequential uncertainty
• Bellman Equation: A mathematical insight that enables us to express the Value Func-
tion recursively - this equation (and its Optimality version covered in Chapter 2) is
in fact the core idea within all Dynamic Programming and Reinforcement Learning
algorithms.
92
2. Markov Decision Processes
We’ve said before that this book is about “sequential decisioning” under “sequential un-
certainty.” In Chapter 1, we covered the “sequential uncertainty” aspect with the frame-
work of Markov Processes, and we extended the framework to also incorporate the no-
tion of uncertain “Reward” each time we make a state transition - we called this extended
framework Markov Reward Processes. However, this framework had no notion of “se-
quential decisioning.” In this chapter, we further extend the framework of Markov Re-
ward Processes to incorporate the notion of “sequential decisioning,” formally known as
Markov Decision Processes. Before we step into the formalism of Markov Decision Pro-
cesses, let us develop some intuition and motivation for the need to have such a framework
- to handle sequential decisioning. Let’s do this by re-visiting the simple inventory exam-
ple we covered in Chapter 1.
θ = max(C − (α + β), 0)
where θ ∈ Z≥0 is the order quantity, C ∈ Z≥0 is the space capacity (in bicycle units) at
the store, α is the On-Hand Inventory and β is the On-Order Inventory ((α, β) comprising
the State). We calculated the Value Function for the Markov Reward Process that results
from following this policy. Now we ask the question: Is this Value Function good enough?
More importantly, we ask the question: Can we improve this Value Function by following a
different ordering policy? Perhaps by ordering less than that implied by the above formula
for θ? This leads to the natural question - Can we identify the ordering policy that yields
the Optimal Value Function (one with the highest expected returns, i.e., lowest expected
accumulated costs, from each state)? Let us get an intuitive sense for this optimization
problem by considering a concrete example.
Assume that instead of bicycles, we want to control the inventory of a specific type of
toothpaste in the store. Assume you have space for 20 units of toothpaste on the shelf as-
signed to the toothpaste (assume there is no space in the backroom of the store). Asssume
that customer demand follows a Poisson distribution with Poisson parameter λ = 3.0. At
6pm store-closing each evening, when you observe the State as (α, β), you now have a
choice of ordering a quantity of toothpastes from any of the following values for the or-
der quantity θ : {0, 1, . . . , max(20 − (α + β), 0)}. Let’s say at Monday 6pm store-closing,
α = 4 and β = 3. So, you have a choice of order quantities from among the integers in
the range of 0 to (20 - (4 + 3) = 13) (i.e., 14 choices). Previously, in the Markov Reward
Process model, you would have ordered 13 units on Monday store-closing. This means
on Wednesday morning at 6am, a truck would have arrived with 13 units of the tooth-
paste. If you sold say 2 units of the toothpaste on Tuesday, then on Wednesday 8am at
store-opening, you’d have 4 + 3 - 2 + 13 = 18 units of toothpaste on your shelf. If you keep
93
following this policy, you’d typically have almost a full shelf at store-opening each day,
which covers almost a week worth of expected demand for the toothpaste. This means
your risk of going out-of-stock on the toothpaste is extremely low, but you’d be incurring
considerable holding cost (you’d have close to a full shelf of toothpastes sitting around
almost each night). So as a store manager, you’d be thinking - “I can lower my costs by
ordering less than that prescribed by the formula of 20 − (α + β).” But how much less?
If you order too little, you’d start the day with too little inventory and might risk going
out-of-stock. That’s a risk you are highly uncomfortable with since the stockout cost per
unit of missed demand (we called it p) is typically much higher than the holding cost per
unit (we called it h). So you’d rather “err” on the side of having more inventory. But how
much more? We also need to factor in the fact that the 36-hour lead time means a large
order incurs large holding costs two days later. Most importantly, to find this right balance
in terms of a precise mathematical optimization of the Value Function, we’d have to factor
in the uncertainty of demand (based on daily Poisson probabilities) in our calculations.
Now this gives you a flavor of the problem of sequential decisioning (each day you have
to decide how much to order) under sequential uncertainty.
To deal with the “decisioning” aspect, we will introduce the notion of Action to com-
plement the previously introduced notions of State and Reward. In the inventory example,
the order quantity is our Action. After observing the State, we choose from among a set
of Actions (in this case, we choose from within the set {0, 1, . . . , max(C − (α + β), 0)}).
We note that the Action we take upon observing a state affects the next day’s state. This
is because the next day’s On-Order is exactly equal to today’s order quantity (i.e., today’s
action). This in turn might affect our next day’s action since the action (order quantity)
is typically a function of the state (On-Hand and On-Order inventory). Also note that the
Action we take on a given day will influence the Rewards after a couple of days (i.e. after
the order arrives). It may affect our holding cost adversely if we had ordered too much or
it may affect our stockout cost adversely if we had ordered too little and then experienced
high demand.
• At each time step t, an Action At is picked (from among a specified choice of actions)
upon observing the State St
• Given an observed State St and a performed Action At , the probabilities of the state
and reward of the next time step (St+1 and Rt+1 ) are in general a function of not just
the state St , but also of the action At .
We are tasked with maximizing the Expected Return from each state (i.e., maximizing
the Value Function). This seems like a pretty hard problem in the general case because
there is a cyclic interplay between:
• next state/reward probabilities depending on action (and state) on the other hand.
There is also the challenge that actions might have delayed consequences on rewards,
and it’s not clear how to disentangle the effects of actions from different time steps on a
94
Figure 2.1.: Markov Decision Process
future reward. So without direct correspondence between actions and rewards, how can
we control the actions so as to maximize expected accumulated rewards? To answer this
question, we will need to set up some notation and theory. Before we formally define the
Markov Decision Process framework and it’s associated (elegant) theory, let us set up a
bit of terminology.
Using the language of AI, we say that at each time step t, the Agent (the algorithm we
design) observes the state St , after which the Agent performs action At , after which the
Environment (upon seeing St and At ) produces a random pair: the next state state St+1
and the next reward Rt+1 , after which the Agent oberves this next state St+1 , and the cycle
repeats (until we reach a terminal state). This cyclic interplay is depicted in Figure 2.1.
Note that time ticks over from t to t + 1 when the environment sees the state St and action
At .
The MDP framework was formalized in a paper by Richard Bellman (Bellman 1957a)
and the MDP theory was developed further in Richard Bellman’s book named Dynamic
Programming (Bellman 1957b) and in Ronald Howard’s book named Dynamic Programming
and Markov Processes (Howard 1960).
• A countable set of states S (known as the State Space), a set T ⊆ S (known as the
set of Terminal States), and a countable set of actions A
95
steps t = 0, 1, 2, . . ., a time-indexed sequence of environment-generated Reward ran-
dom variables Rt ∈ D (a countable subset of R) for time steps t = 1, 2, . . ., and a time-
indexed sequence of agent-controllable actions At ∈ A for time steps t = 0, 1, 2, . . ..
(Sometimes we restrict the set of actions allowable from specific states, in which case,
we abuse the A notation to refer to a function whose domain is N and range is A,
and we say that the set of actions allowable from a state s ∈ N is A(s).)
• Markov Property:
P[(Rt+1 , St+1 )|(St , At , St−1 , At−1 , . . . , S0 , A0 )] = P[(Rt+1 , St+1 )|(St , At )] for allt ≥ 0
• Termination: If an outcome for ST (for some time step T ) is a state in the set T , then
this sequence outcome terminates at time step T .
As in the case of Markov Reward Processes, we denote the set of non-terminal states
S − T as N and refer to any state in N as a non-terminal state. The sequence:
S0 , A 0 , R 1 , S 1 , A 1 , R 1 , S 2 , . . .
terminates at time step T if ST ∈ T (i.e., the final reward is RT and the final action is
AT −1 ).
In the more general case, where states or rewards are uncountable, the same concepts
apply except that the mathematical formalism needs to be more detailed and more careful.
Specifically, we’d end up with integrals instead of summations, and probability density
functions (for continuous probability distributions) instead of probability mass functions
(for discrete probability distributions). For ease of notation and more importantly, for
ease of understanding of the core concepts (without being distracted by heavy mathemat-
ical formalism), we’ve chosen to stay with discrete-time, countable S, countable A and
countable D (by default). However, there will be examples of Markov Decision Processes
in this book involving continuous-time and uncountable S, A and D (please adjust the
definitions and formulas accordingly).
As in the case of Markov Processes and Markov Reward Processes, we shall (by default)
assume Time-Homogeneity for Markov Decision Processes, i.e., P[(Rt+1 , St+1 )|(St , At )] is
independent of t. This means the transition probabilities of a Markov Decision Process can,
in the most general case, be expressed as a state-reward transition probability function:
PR : N × A × D × S → [0, 1]
defined as:
Henceforth, any time we say Markov Decision Process, assume we are refering to a
Discrete-Time, Time-Homogeneous Markov Decision Process with countable spaces and
countable transitions (unless explicitly specified otherwise), which in turn can be charac-
terized by the state-reward transition probability function PR . Given a specification of PR ,
we can construct:
96
• The state transition probability function
P : N × A × S → [0, 1]
defined as:
X
P(s, a, s′ ) = PR (s, a, r, s′ )
r∈D
RT : N × A × S → R
defined as:
X PR (s, a, r, s′ ) X PR (s, a, r, s′ )
= · r = P ·r
P(s, a, s′ ) ′
r∈D PR (s, a, r, s )
r∈D r∈D
R:N ×A→R
is defined as:
X XX
= P(s, a, s′ ) · RT (s, a, s′ ) = PR (s, a, r, s′ ) · r
s′ ∈S s′ ∈S r∈D
2.4. Policy
Having understood the dynamics of a Markov Decision Process, we now move on to the
specification of the Agent’s actions as a function of the current state. In the general case,
we assume that the Agent will perform a random action At , according to a probability
distribution that is a function of the current state St . We refer to this function as a Policy.
Formally, a Policy is a function
π : N × A → [0, 1]
defined as:
97
such that X
π(s, a) = 1 for all s ∈ N
a∈A
Note that the definition above assumes that a Policy is Markovian, i.e., the action prob-
abilities depend only on the current state and not the history. The definition above also
assumes that a Policy is Stationary, i.e., P[At = a|St = s] is invariant in time t. If we do
encounter a situation where the policy would need to depend on the time t, we’ll simply
include t to be part of the state, which would make the Policy stationary (albeit at the cost
of state-space bloat and hence, computational cost).
When we have a policy such that the action probability distribution for each state is
concentrated on a single action, we refer to it as a deterministic policy. Formally, a deter-
ministic policy πD : N → A has the property that for all s ∈ N ,
We will often encounter policies that assign equal probabilities to all actions, from each
non-terminal state. We implement this class of policies as follows:
from rl.distribution import Choose
@dataclass(frozen=True)
class UniformPolicy(Policy[S, A]):
valid_actions: Callable[[S], Iterable[A]]
def act(self, state: NonTerminal[S]) -> Choose[A]:
return Choose(self.valid_actions(state.state))
98
The above code is in the file rl/policy.py.
Now let’s write some code to create some concrete policies for an example we are famil-
iar with - the simple inventory example. We first create a concrete class SimpleInventoryDeterministicPolicy
for deterministic inventory replenishment policies that is a derived class of DeterministicPolicy.
Note that the generic state type S is replaced here with the class InventoryState that repre-
sents a state in the inventory example, comprising of the On-Hand and On-Order inven-
tory quantities. Also note that the generic action type A is replaced here with the int type
since in this example, the action is the quantity of inventory to be ordered at store-closing
(which is an integer quantity). Invoking the act method of SimpleInventoryDeterministicPolicy
runs the following deterministic policy:
We can instantiate a specific deterministic policy with a reorder point of say 8 as:
si_dp = SimpleInventoryDeterministicPolicy(reorder_point=8)
Now let’s write some code to create stochastic policies for the inventory example. We
create a concrete class SimpleInventoryStochasticPolicy that implements the interface of
the abstract class Policy (specifically implements the abstract method act). The code in
act implements a stochastic policy as a SampledDistribution[int] driven by a sampling of
the Poison distribution for the reorder point. Specifically, the reorder point r is treated as
a Poisson random variable with a specified mean (of say λ ∈ R≥0 ). We sample a value of
the reorder point r from this Poisson distribution (with mean λ). Then, we create a sample
order quantity (action) θ ∈ Z≥0 defined as:
θ = max(r − (α + β), 0)
import numpy as np
from rl.distribution import SampledDistribution
class SimpleInventoryStochasticPolicy(Policy[InventoryState, int]):
def __init__(self, reorder_point_poisson_mean: float):
99
self.reorder_point_poisson_mean: float = reorder_point_poisson_mean
def act(self, state: NonTerminal[InventoryState]) -> \
SampledDistribution[int]:
def action_func(state=state) -> int:
reorder_point_sample: int = \
np.random.poisson(self.reorder_point_poisson_mean)
return max(
reorder_point_sample - state.state.inventory_position(),
0
)
return SampledDistribution(action_func)
We can instantiate a specific stochastic policy with a reorder point poisson distribution
mean of say 8.0 as:
si_sp = SimpleInventoryStochasticPolicy(reorder_point_poisson_mean=8.0)
We will revisit the simple inventory example in a bit after we cover the code for Markov
Decision Processes, when we’ll show how to simulate the Markov Decision Process for this
simple inventory example, with the agent running a deterministic policy. But before we
move on to the code design for Markov Decision Processes (to accompany the above im-
plementation of Policies), we need to cover an important insight linking Markov Decision
Processes, Policies and Markov Reward Processes.
implied by the evaluation of the MDP with the policy π is defined as:
X
PR
π
(s, r, s′ ) = π(s, a) · PR (s, a, r, s′ )
a∈A
Likewise,
X
P π (s, s′ ) = π(s, a) · P(s, a, s′ )
a∈A
100
X
RπT (s, s′ ) = π(s, a) · RT (s, a, s′ )
a∈A
X
Rπ (s) = π(s, a) · R(s, a)
a∈A
So any time we talk about an MDP evaluated with a fixed policy, you should know that
we are effectively talking about the implied MRP. This insight is now going to be key in
the design of our code to represent Markov Decision Processes.
We create an abstract class called MarkovDecisionProcess (code shown below) with two
abstract methods - step and actions. The step method is key: it is meant to specify the
distribution of pairs of next state and reward, given a non-terminal state and action. The
actions method’s interface specifies that it takes as input a state: NonTerminal[S] and
produces as output an Iterable[A] to represent the set of actions allowable for the input
state (since the set of actions can be potentially infinite - in which case we’d have to return
an Iterator[A] - the return type is fairly generic, i.e., Iterable[A]).
The apply_policy method takes as input a policy: Policy[S, A] and returns a MarkovRewardProcess
representing the implied MRP. Let’s understand the code in apply_policy: First, we con-
struct a class RewardProcess that implements the abstract method transition_reward of
MarkovRewardProcess. transition_reward takes as input a state: NonTerminal[S], creates
actions: Distribution[A] by applying the given policy on state, and finally uses the
apply method of Distribution to transform actions: Distribution[A] into a Distribution[Tuple[State[S],
float]] (distribution of (next state, reward) pairs) using the abstract method step.
We also write the simulate_actions method that is analogous to the simulate_reward
method we had written for MarkovRewardProcess for generating a sampling trace. In this
case, each step in the sampling trace involves sampling an action from the given policy and
then sampling the pair of next state and reward, given the state and sampled action. Each
generated TransitionStep object consists of the 4-tuple: (state, action, next state, reward).
Here’s the actual code:
101
state: NonTerminal[S]
) -> Distribution[Tuple[State[S], float]]:
actions: Distribution[A] = policy.act(state)
return actions.apply(lambda a: mdp.step(state, a))
return RewardProcess()
def simulate_actions(
self,
start_states: Distribution[NonTerminal[S]],
policy: Policy[S, A]
) -> Iterable[TransitionStep[S, A]]:
state: State[S] = start_states.sample()
while isinstance(state, NonTerminal):
action_distribution = policy.act(state)
action = action_distribution.sample()
next_distribution = self.step(state, action)
next_state, reward = next_distribution.sample()
yield TransitionStep(state, action, next_state, reward)
state = next_state
102
effectively describe the MDP in terms of this step method. The actions method returns
an Iterator[int], an infinite generator of non-negative integers to represent the fact that
the action space (order quantities) for any state comprise of all non-negative integers.
import itertools
import numpy as np
from rl.distribution import SampledDistribution
@dataclass(frozen=True)
class SimpleInventoryMDPNoCap(MarkovDecisionProcess[InventoryState, int]):
poisson_lambda: float
holding_cost: float
stockout_cost: float
def step(
self,
state: NonTerminal[InventoryState],
order: int
) -> SampledDistribution[Tuple[State[InventoryState], float]]:
def sample_next_state_reward(
state=state,
order=order
) -> Tuple[State[InventoryState], float]:
demand_sample: int = np.random.poisson(self.poisson_lambda)
ip: int = state.state.inventory_position()
next_state: InventoryState = InventoryState(
max(ip - demand_sample, 0),
order
)
reward: float = - self.holding_cost * state.state.on_hand\
- self.stockout_cost * max(demand_sample - ip, 0)
return NonTerminal(next_state), reward
return SampledDistribution(sample_next_state_reward)
def actions(self, state: NonTerminal[InventoryState]) -> Iterator[int]:
return itertools.count(start=0, step=1)
We leave it to you as an exercise to run various simulations of the MRP implied by the de-
terministic and stochastic policy instances we had created earlier (the above code is in the
file rl/chapter3/simple_inventory_mdp_nocap.py). See the method fraction_of_days_oos
in this file as an example of a simulation to calculate the percentage of days when we’d be
unable to satisfy some customer demand for toothpaste due to too little inventory at store-
opening (naturally, the higher the re-order point in the policy, the lesser the percentage of
days when we’d be Out-of-Stock). This kind of simulation exercise helps build intuition
on the tradeoffs we have to make between having too little inventory versus having too
much inventory (holding costs versus stockout costs) - essentially leading to our ultimate
goal of determining the Optimal Policy (more on this later).
103
If we satisfy the above three characteristics, we refer to the Markov Decision Process as
a Finite Markov Decision Process. Let us write some code for a Finite Markov Decision
Process. We create a concrete class FiniteMarkovDecisionProcess that implements the in-
terface of the abstract class MarkovDecisionProcess (specifically implements the abstract
methods step and the actions). Our first task is to think about the data structure required
to specify an instance of FiniteMarkovDecisionProcess (i.e., the data structure we’d pass
to the __init__ method of FiniteMarkovDecisionProcess). Analogous to how we curried
PR for a Markov Reward Process as N → (S × D → [0, 1]) (where S = {s1 , s2 , . . . , sn } and
N has m ≤ n states), here we curry PR for the MDP as:
N → (A → (S × D → [0, 1]))
Since S is finite, A is finite, and the set of next state and reward transitions for each pair
of current state and action is also finite, we can represent PR as a data structure of type
StateActionMapping[S, A] as shown below:
104
) for a, v in d.items()} for s, d in mapping.items()}
self.non_terminal_states = list(self.mapping.keys())
def __repr__(self) -> str:
display = ””
for s, d in self.mapping.items():
display += f”From State {s.state}:\n”
for a, d1 in d.items():
display += f” With Action {a}:\n”
for (s1, r), p in d1:
opt = ”Terminal ” if isinstance(s1, Terminal) else ””
display += f” To [{opt}State {s1.state} and ”\
+ f”Reward {r:.3f}] with Probability {p:.3f}\n”
return display
def step(self, state: NonTerminal[S], action: A) -> StateReward[S]:
action_map: ActionMapping[A, S] = self.mapping[state]
return action_map[action]
def actions(self, state: NonTerminal[S]) -> Iterable[A]:
return self.mapping[state].keys()
Now that we’ve implemented a finite MDP, let’s implement a finite policy that maps each
non-terminal state to a probability distribution over a finite set of actions. So we create a
concrete class @datasclass FinitePolicy that implements the interface of the abstract class
Policy (specifically implements the abstract method act). An instance of FinitePolicy is
specified with the attribute self.policy_map: Mapping[S, FiniteDistribution[A]]] since
this type captures the structure of the π : N × A → [0, 1] function in the curried form
N → (A → [0, 1])
for the case of finite S and finite A. The act method is straightforward. We also implement
a __repr__ method for pretty-printing of self.policy_map.
@dataclass(frozen=True)
class FinitePolicy(Policy[S, A]):
policy_map: Mapping[S, FiniteDistribution[A]]
def __repr__(self) -> str:
display = ””
for s, d in self.policy_map.items():
display += f”For State {s}:\n”
for a, p in d:
display += f” Do Action {a} with Probability {p:.3f}\n”
return display
def act(self, state: NonTerminal[S]) -> FiniteDistribution[A]:
return self.policy_map[state.state]
105
Armed with a FinitePolicy class, we can now write a method apply_finite_policy in
FiniteMarkovDecisionProcess that takes as input a policy: FinitePolicy[S, A] and re-
turns a FiniteMarkovRewardProcess[S] by processing the finite structures of both of the
MDP and the Policy, and producing a finite structure of the implied MRP.
106
and reward from any pair of current state and action are also finite. Note that this cre-
ative alteration of the reward definition is purely to reduce this Markov Decision Process
into a Finite Markov Decision Process. Let’s now work out the calculation of the reward
transition function RT .
When the next state’s (St+1 ) On-Hand is greater than zero, it means all of the day’s
demand was satisfied with inventory that was available at store-opening (= α + β), and
hence, each of these next states St+1 correspond to no stockout cost and only an overnight
holding cost of hα. Therefore, for all α, β (with 0 ≤ α + β ≤ C) and for all order quantity
(action) θ (with 0 ≤ θ ≤ C − (α + β)):
When next state’s (St+1 ) On-Hand is equal to zero, there are two possibilities:
1. The demand for the day was exactly α + β, meaning all demand was satisifed with
available store inventory (so no stockout cost and only overnight holding cost), or
2. The demand for the day was strictly greater than α + β, meaning there’s some stock-
out cost in addition to overnight holding cost. The exact stockout cost is an expec-
tation calculation involving the number of units of missed demand under the corre-
sponding poisson probabilities of demand exceeding α + β.
107
So now let’s write some code for the simple inventory example as a Finite Markov De-
cision Process as described above. All we have to do is to create a derived class inherited
from FiniteMarkovDecisionProcess and write a method to construct the mapping (i.e., PR )
that the __init__ constuctor of FiniteMarkovRewardProcess requires as input. Note that
the generic state type S is replaced here with the @dataclass InventoryState to represent
the inventory state, comprising of the On-Hand and On-Order inventory quantities, and
the generic action type A is replaced here with int to represent the order quantity.
Now let’s test this out with some example inputs (as shown below). We construct an
instance of the SimpleInventoryMDPCap class with these inputs (named si_mdp below), then
construct an instance of the FinitePolicy[InventoryState, int] class (a deterministic pol-
icy, named fdp below), and combine them to produce the implied MRP (an instance of the
FiniteMarkovRewardProcess[InventoryState] class).
108
user_capacity = 2
user_poisson_lambda = 1.0
user_holding_cost = 1.0
user_stockout_cost = 10.0
si_mdp: FiniteMarkovDecisionProcess[InventoryState, int] =\
SimpleInventoryMDPCap(
capacity=user_capacity,
poisson_lambda=user_poisson_lambda,
holding_cost=user_holding_cost,
stockout_cost=user_stockout_cost
)
fdp: FiniteDeterministicPolicy[InventoryState, int] = \
FiniteDeterministicPolicy(
{InventoryState(alpha, beta): user_capacity - (alpha + beta)
for alpha in range(user_capacity + 1)
for beta in range(user_capacity + 1 - alpha)}
)
implied_mrp: FiniteMarkovRewardProcess[InventoryState] =\
si_mdp.apply_finite_policy(fdp)
Vπ :N →R
is defined as:
For the rest of the book, we assume that whenever we are talking about a Value Function,
the discount factor γ is appropriate to ensure that the Expected Return from each state is
finite - in particular, γ < 1 for continuing (non-terminating) MDPs where the Return could
otherwise diverge.
We expand V π (s) = Eπ,PR [Gt |St = s] as follows:
109
Eπ,PR [Rt+1 |St = s] + γ · Eπ,PR [Rt+2 |St = s] + γ 2 · Eπ,PR [Rt+3 |St = s] + . . .
X X X X
= π(s, a) · R(s, a) + γ · π(s, a) P(s, a, s′ ) π(s′ , a′ ) · R(s′ , a′ )
a∈A a∈A s′ ∈N a′ ∈A
X X X X X
+ γ2 · π(s, a) P(s, a′ , s′ ) π(s′ , a′ ) P(s′ , a′′ , s′′ ) π(s′′ , a′′ ) · R(s′′ , a′′ )
a∈A s′ ∈N a′ ∈A s′′ ∈N a′′ ∈A
+ ...
X X X
= Rπ (s) + γ · P π (s, s′ ) · Rπ (s′ ) + γ 2 · P π (s, s′ ) P π (s′ , s′′ ) · Rπ (s′′ ) + . . .
s′ ∈N s′ ∈N s′′ ∈N
But from Equation (1.1) in Chapter 1, we know that the last expression above is equal
to the π-implied MRP’s Value Function for state s. So, the Value Function V π of an MDP
evaluated with a fixed policy π is exactly the same function as the Value Function of the
π-implied MRP. So we can apply the MRP Bellman Equation on V π , i.e.,
X
V π (s) = Rπ (s) + γ · P π (s, s′ ) · V π (s′ )
s′ ∈N
X X X
= π(s, a) · R(s, a) + γ · π(s, a) P(s, a, s′ ) · V π (s′ ) (2.1)
a∈A a∈A s′ ∈N
X X
= π(s, a) · (R(s, a) + γ · P(s, a, s′ ) · V π (s′ )) for all s ∈ N
a∈A s′ ∈N
As we saw in Chapter 1, for finite state spaces that are not too large, Equation (2.1) can be
solved for V π (i.e. solution to the MDP Prediction problem) with a linear algebra solution
(Equation (1.2) from Chapter 1). More generally, Equation (2.1) will be a key equation
for the rest of the book in developing various Dynamic Programming and Reinforcement
Algorithms for the MDP Prediction problem. However, there is another Value Function
that’s also going to be crucial in developing MDP algorithms - one which maps a (state,
action) pair to the expected return originating from the (state, action) pair when evaluated
with a fixed policy. This is known as the Action-Value Function of an MDP evaluated with
a fixed policy π:
Qπ : N × A → R
defined as:
110
Figure 2.2.: Visualization of MDP State-Value Function Bellman Policy Equation
Equation (2.1) is known as the MDP State-Value Function Bellman Policy Equation (Fig-
ure 2.2 serves as a visualization aid for this Equation). Equation (2.4) is known as the MDP
Action-Value Function Bellman Policy Equation (Figure 2.3 serves as a visualization aid
for this Equation). Note that Equation (2.2) and Equation (2.3) are embedded in Figure
2.2 as well as in Figure 2.3. Equations (2.1), (2.2), (2.3) and (2.4) are collectively known
as the MDP Bellman Policy Equations.
For the rest of the book, in these MDP transition figures, we shall always depict states
as elliptical-shaped nodes and actions as rectangular-shaped nodes. Notice that transition
from a state node to an action node is associated with a probability represented by π and
transition from an action node to a state node is associated with a probability represented
by P.
Note that for finite MDPs of state space not too large, we can solve the MDP Predic-
tion problem (solving for V π and equivalently, Qπ ) in a straightforward manner: Given
a policy π, we can create the finite MRP implied by π, using the method apply_policy
in FiniteMarkovDecisionProcess, then use the direct linear-algebraic solution that we cov-
ered in Chapter 1 to calculate the Value Function of the π-implied MRP. We know that the
π-implied MRP’s Value Function is the same as the State-Value Function V π of the MDP
which can then be used to arrive at the Action-Value Function Qπ of the MDP (using Equa-
tion (2.3)). For large state spaces, we need to use iterative/numerical methods (Dynamic
Programming and Reinforcement Learning algorithms) to solve this Prediction problem
(covered later in this book).
111
Figure 2.3.: Visualization of MDP Action-Value Function Bellman Policy Equation
V∗ :N →R
is defined as:
where Π is the set of stationary (stochastic) policies over the spaces of N and A.
The way to read the above definition is that for each non-terminal state s, we consider
all possible stochastic stationary policies π, and maximize V π (s) across all these choices
of π. Note that the maximization over choices of π is done separately for each s, so it’s
conceivable that different choices of π might maximize V π (s) for different s ∈ N . Thus,
from the above definition of V ∗ , we can’t yet talk about the notion of “An Optimal Policy.”
So, for now, let’s just focus on the notion of Optimal Value Function, as defined above.
Note also that we haven’t yet talked about how to achieve the above-defined maximization
through an algorithm - we have simply defined the Optimal Value Function.
Likewise, the Optimal Action-Value Function
Q∗ : N × A → R
112
is defined as:
Likewise, let’s think about what it means to be optimal from a given non-terminal-state
and action pair (s, a), i.e, let’s unravel Q∗ (s, a). First, we get the immediate expected re-
ward R(s, a). Next, we consider all possible random states s′ ∈ S we can transition to,
and from each of those states which are non-terminal states, we recursively act optimally.
Formally, this gives us the following equation:
X
Q∗ (s, a) = R(s, a) + γ · P(s, a, s′ ) · V ∗ (s′ ) for all s ∈ N , a ∈ A (2.6)
s′ ∈N
Equation (2.7) is known as the MDP State-Value Function Bellman Optimality Equation
and is depicted in Figure 2.4 as a visualization aid.
Substituting for V ∗ (s) from Equation (2.5) in Equation (2.6) gives:
X
Q∗ (s, a) = R(s, a) + γ · P(s, a, s′ ) · max
′
Q∗ (s′ , a′ ) for all s ∈ N , a ∈ A (2.8)
a ∈A
s′ ∈N
Equation (2.8) is known as the MDP Action-Value Function Bellman Optimality Equa-
tion and is depicted in Figure 2.5 as a visualization aid.
Note that Equation (2.5) and Equation (2.6) are embedded in Figure 2.4 as well as in Fig-
ure 2.5. Equations (2.7), (2.5), (2.6) and (2.8) are collectively known as the MDP Bellman
Optimality Equations. We should highlight that when someone says MDP Bellman Equa-
tion or simply Bellman Equation, unless they explicit state otherwise, they’d be refering to
the MDP Bellman Optimality Equations (and typically specifically the MDP State-Value
Function Bellman Optimality Equation). This is because the MDP Bellman Optimality
Equations address the ultimate purpose of Markov Decision Processes - to identify the
Optimal Value Function and the associated policy/policies that achieve the Optimal Value
Function (i.e., enabling us to solve the MDP Control problem).
Again, it pays to emphasize that the Bellman Optimality Equations don’t directly give
us a recipe to calculate the Optimal Value Function or the policy/policies that achieve
113
Figure 2.4.: Visualization of MDP State-Value Function Bellman Optimality Equation
114
the Optimal Value Function - they simply state a powerful mathematical property of the
Optimal Value Function that (as we shall see later in this book) helps us come up with al-
gorithms (Dynamic Programming and Reinforcement Learning) to calculate the Optimal
Value Function and the associated policy/policies that achieve the Optimal Value Func-
tion.
We have been using the phrase “policy/policies that achieve the Optimal Value Func-
tion,” but we haven’t yet provided a clear definition of such a policy (or policies). In fact,
as mentioned earlier, it’s not clear from the definition of V ∗ if such a policy (one that
would achieve V ∗ ) exists (because it’s conceivable that different policies π achieve the
maximization of V π (s) for different states s ∈ N ). So instead, we define an Optimal Policy
π ∗ : N × A → [0, 1] as one that “dominates” all other policies with respect to the Value
Functions for the policies. Formally,
∗
π ∗ ∈ Π is an Optimal Policy if V π (s) ≥ V π (s) for all π ∈ Π and for all states s ∈ N
The definition of an Optimal Policy π ∗ says that it is a policy that is “better than or
equal to” (on the V π metric) all other stationary policies for all non-terminal states (note
that there could be multiple Optimal Policies). Putting this definition together with the
definition of the Optimal Value Function V ∗ , the natural question to then ask is whether
there exists an Optimal Policy π ∗ that maximizes V π (s) for all s ∈ N , i.e., whether there
∗
exists a π ∗ such that V ∗ (s) = V π (s) for all s ∈ N . On the face of it, this seems like a
strong statement. However, this answers in the affirmative in most MDP settings of in-
terest. The following theorem and proof is for our default setting of MDP (discrete-time,
countable-spaces, time-homogeneous), but the statements and argument themes below
apply to various other MDP settings as well. The MDP book by Martin Puterman (Puter-
man 2014) provides rigorous proofs for a variety of settings.
Before proceeding with the proof of Theorem (2.10.1), we establish a simple Lemma.
∗ ∗
Lemma 2.10.2. For any two Optimal Policies π1∗ and π2∗ , V π1 (s) = V π2 (s) for all s ∈ N
∗
Proof. Since π1∗ is an Optimal Policy, from the Optimal Policy definition, we have: V π1 (s) ≥
∗
V π2 (s) for all s ∈ N . Likewise, since π2∗ is an Optimal Policy, from the Optimal Policy
∗ ∗ ∗ ∗
definition, we have: V π2 (s) ≥ V π1 (s) for all s ∈ N . This implies: V π1 (s) = V π2 (s) for all
s ∈ N.
115
Proof. As a consequence of the above Lemma, all we need to do to prove Theorem (2.10.1)
is to establish an Optimal Policy that achieves the Optimal Value Function and the Opti-
mal Action-Value Function. We construct a Deterministic Policy (as a candidate Optimal
Policy) πD∗ : N → A as follows:
∗
πD (s) = arg max Q∗ (s, a) for all s ∈ N (2.9)
a∈A
Note that for any specific s, if two or more actions a achieve the maximization of Q∗ (s, a),
then we use an arbitrary rule in breaking ties and assigning a single action a as the output
of the above arg max operation.
First we show that πD∗ achieves the Optimal Value Functions V ∗ and Q∗ . Since π ∗ (s) =
D
arg maxa∈A Q (s, a) and V ∗ (s) = maxa∈A Q∗ (s, a) for all s ∈ N , we can infer for all s ∈ N
∗
that:
V ∗ (s) = Q∗ (s, πD
∗
(s))
This says that we achieve the Optimal Value Function from a given non-terminal state s
if we first take the action prescribed by the policy πD ∗ (i.e., the action π ∗ (s)), followed by
D
achieving the Optimal Value Function from each of the next time step’s states. But note
that each of the next time step’s states can achieve the Optimal Value Function by doing
the same thing described above (”first take action prescribed by πD ∗ , followed by ...”), and
so on and so forth for further time step’s states. Thus, the Optimal Value Function V ∗ is
achieved if from each non-terminal state, we take the action prescribed by πD ∗ . Likewise,
∗
the Optimal Action-Value Function Q is achieved if from each non-terminal state, we take
the action a (argument to Q∗ ) followed by future actions prescribed by πD ∗ . Formally, this
says:
∗
V πD (s) = V ∗ (s) for all s ∈ N
∗
QπD (s, a) = Q∗ (s, a) for all s ∈ N , for all a ∈ A
Finally, we argue that πD∗ is an Optimal Policy. Assume the contradiction (that π ∗ is not
D
an Optimal Policy). Then there exists a policy π ∈ Π and a state s ∈ N such that V π (s) >
∗ ∗
V πD (s). Since V πD (s) = V ∗ (s), we have: V π (s) > V ∗ (s) which contradicts the Optimal
Value Function Definition: V ∗ (s) = maxπ∈Π V π (s) for all s ∈ N . Hence, πD ∗ must be an
Optimal Policy.
Equation (2.9) is a key construction that goes hand-in-hand with the Bellman Optimality
Equations in designing the various Dynamic Programming and Reinforcement Learning
algorithms to solve the MDP Control problem (i.e., to solve for V ∗ , Q∗ and π ∗ ). Lastly, it’s
important to note that unlike the Prediction problem which has a straightforward linear-
algebra-solver for small state spaces, the Control problem is non-linear and so, doesn’t
have an analogous straightforward linear-algebra-solver. The simplest solutions for the
Control problem (even for small state spaces) are the Dynamic Programming algorithms
we will cover in Chapter 3.
116
State Space:
The definitions we’ve provided for MRPs and MDPs were for countable (discrete) state
spaces. As a special case, we considered finite state spaces since we have pretty straight-
forward algorithms for exact solution of Prediction and Control problems for finite MDPs
(which we shall learn about in Chapter 3). We emphasize finite MDPs because they help
you develop a sound understanding of the core concepts and make it easy to program
the algorithms (known as “tabular” algorithms since we can represent the MDP in a “ta-
ble,” more specifically a Python data structure like dict or numpy array). However, these
algorithms are practical only if the finite state space is not too large. Unfortunately, in
many real-world problems, state spaces are either very large-finite or infinite (sometimes
continuous-valued spaces). Large state spaces are unavoidable because phenomena in na-
ture and metrics in business evolve in time due to a complex set of factors and often depend
on history. To capture all these factors and to enable the Markov Property, we invariably
end up with having to model large state spaces which suffer from two “curses”:
117
• Approximation of the Value Function - We create an approximate representation
of the Value Function (eg: by using a supervised learning representation such as
a neural network). This permits us to work with an appropriately sampled subset
of the state space, infer the Value Function in this state space subset, and interpo-
late/extrapolate/generalize the Value Function in the remainder of the State Space.
• Sampling from the state-reward transition probabilities PR - Instead of working with
the explicit transition probabilities, we simply use the state-reward sample transi-
tions and employ Reinforcement Learning algorithms to incrementally improve the
estimates of the (approximated) Value Function. When state spaces are large, rep-
resenting explicit transition probabilities is impossible (not enough storage space),
and simply sampling from these probability distributions is our only option (and as
you shall learn, is surprisingly effective).
This combination of sampling a state space subset, approximation of the Value Function
(with deep neural networks), sampling state-reward transitions, and clever Reinforcement
Learning algorithms goes a long way in breaking both the curse of dimensionality and
curse of modeling. In fact, this combination is a common pattern in the broader field
of Applied Mathematics to break these curses. The combination of Sampling and Func-
tion Approximation (particularly with the modern advances in Deep Learning) are likely
to pave the way for future advances in the broader fields of Real-World AI and Applied
Mathematics in general. We recognize that some of this discussion is a bit premature since
we haven’t even started teaching Reinforcement Learning yet. But we hope that this sec-
tion provides some high-level perspective and connects the learnings from this chapter to
the techniques/algorithms that will come later in this book. We will also remind you of
this joint-importance of sampling and function approximation once we get started with
Reinforcement Learning algorithms later in this book.
Action Space:
Similar to state spaces, the definitions we’ve provided for MDPs were for countable (dis-
crete) action spaces. As a special case, we considered finite action spaces (together with
finite state spaces) since we have pretty straightforward algorithms for exact solution of
Prediction and Control problems for finite MDPs. As mentioned above, in these algo-
rithms, we represent the MDP in Python data structures like dict or numpy array. How-
ever, these finite-MDP algorithms are practical only if the state and action spaces are not
too large. In many real-world problems, action spaces do end up as fairly large - either
finite-large or infinite (sometimes continuous-valued action spaces). The large size of the
action space affects algorithms for MDPs in a couple of ways:
• Large action space makes the representation, estimation and evaluation of the pol-
icy π, of the Action-Value function for a policy Qπ and of the Optimal Action-Value
function Q∗ difficult. We have to resort to function approximation and sampling as
ways to overcome the large size of the action space.
• The Bellman Optimality Equation leads to a crucial calculation step in Dynamic Pro-
gramming and Reinforcement Learning algorithms that involves identifying the ac-
tion for each non-terminal state that maximizes the Action-Value Function Q. When
the action space is large, we cannot afford to evaluate Q for each action for an encoun-
tered state (as is done in simple tabular algorithms). Rather, we need to tap into an
optimization algorithm to perform the maximization of Q over the action space, for
each encountered state. Separately, there is a special class of Reinforcement Learning
118
algorithms called Policy Gradient Algorithms (that we shall later learn about) that
are particularly valuable for large action spaces (where other types of Reinforcement
Learning algorithms are not efficient and often, simply not an option). However,
these techniques to deal with large action spaces require care and attention as they
have their own drawbacks (more on this later).
Time Steps:
The definitions we’ve provided for MRP and MDP were for discrete time steps. We dis-
tinguish discrete time steps as terminating time-steps (known as terminating or episodic
MRPs/MDPs) or non-terminating time-steps (known as continuing MRPs/MDPs). We’ve
talked about how the choice of γ matters in these cases (γ = 1 doesn’t work for some con-
tinuing MDPs because reward accumulation can blow up to infinity). We won’t cover
it in this book, but there is an alternative formulation of the Value Function as expected
average reward (instead of expected discounted accumulated reward) where we don’t
discount even for continuing MDPs. We had also mentioned earlier that an alternative to
discrete time steps is continuous time steps, which is convenient for analytical tractability.
Sometimes, even if state space and action space components have discrete values (eg:
price of a security traded in fine discrete units, or number of shares of a security bought/sold
on a given day), for modeling purposes, we sometimes find it convenient to represent
these components as continuous values (i.e., uncountable state space). The advantage
of continuous state/action space representation (especially when paired with continuous
time) is that we get considerable mathematical benefits from differential calculus as well
as from properties of continuous probability distributions (eg: gaussian distribution con-
veniences). In fact, continuous state/action space and continuous time are very popu-
lar in Mathematical Finance since some of the groundbreaking work from Mathematical
Economics from the 1960s and 1970s - Robert Merton’s Portfolio Optimization formula-
tion and solution (Merton 1969) and Black-Scholes’ Options Pricing model (Black and
Scholes 1973), to name a couple - are grounded in stochastic calculus1 which models stock
prices/portfolio value as gaussian evolutions in continuous time (more on this later in
the book) and treats trades (buy/sell quantities) as also continuous variables (permitting
partial derivatives and tractable partial differential equations).
When all three of state space, action space and time steps are modeled as continuous, the
Bellman Optimality Equation we covered in this chapter for countable spaces and discrete-
time morphs into a differential calculus formulation and is known as the famous Hamilton-
Jacobi-Bellman (HJB) equation2 . The HJB Equation is commonly used to model and solve
many problems in engineering, physics, economics and finance. We shall cover a couple of
financial applications in this book that have elegant formulations in terms of the HJB equa-
tion and equally elegant analytical solutions of the Optimal Value Function and Optimal
Policy (tapping into stochastic calculus and differential equations).
119
Figure 2.6.: Partially-Observable Markov Decision Process
(e)
• The internal representation of the environment at each time step t (let’s call it St ).
This internal representation of the environment is what drives the probabilistic tran-
sition to the next time step t + 1, producing the random pair of next (environment)
(e)
state St+1 and reward Rt+1 .
(a)
• The agent state at each time step t (let’s call it St ). The agent state is what controls
the action At the agent takes at time step t, i.e., the agent runs a policy π which is a
(a)
function of the agent state St , producing a probability distribution of actions At .
(e) (a)
In our definition of MDP, note that we implicitly assumed that St = St at each time
step t, and called it the (common) state St at time t. Secondly, we assumed that this state
St is fully observable by the agent. To understand full observability, let us (first intuitively)
understand the concept of partial observability in a more generic setting than what we had
assumed in the framework for MDP. In this more generic framework, we denote Ot as
the information available to the agent from the environment at time step t, as depicted in
Figure 2.6. The notion of partial observability in this more generic framework is that from
the history of observations, actions and rewards up to time step t, the agent does not have
(e) (e)
full knowledge of the environment state St . This lack of full knowledge of St is known
as partial observability. Full observability, on the other hand, means that the agent can fully
(e)
construct St as a function of the history of observations, actions and rewards up to time
step t. Since we have the flexibility to model the exact data structures to represent obser-
vations, state and actions in this more generic framework, existence of full observability
(e)
lets us re-structure the observation data at time step t to be Ot = St . Since we have also
(e) (a)
assumed St = St , we have:
(e) (a)
O t = St = St for all time steps t = 0, 1, 2, . . .
The above statement specialized the framework to that of Markov Decision Processes,
which we can now name more precisely as Fully-Observable Markov Decision Processes
(when viewed from the lens of the more generic framework described above, that permits
partial observability or full observability).
120
In practice, you will often find that the agent doesn’t know the true internal represen-
(e)
tation (St ) of the environment (i.e, partial observability). Think about what it would
take to know what drives a stock price from time step t to t + 1 - the agent would need
to have access to pretty much every little detail of trading activity in the entire world,
and more!). However, since the MDP framework is simple and convenient, and since we
have tractable Dynamic Programming and Reinforcement Learning algorithms to solve
(e) a)
MDPs, we often do pretend that Ot = St = St and carry on with our business of solv-
(e) (a)
ing the assumed/modeled MDP. Often, this assumption of Ot = St = St turns out
to be a reasonable approximate model of the real-world but there are indeed situations
where this assumption is far-fetched. These are situations where we have access to too
(e)
little information pertaining to the key aspects of the internal state representation (St )
of the environment. It turns out that we have a formal framework for these situations -
this framework is known as Partially-Observable Markov Decision Process (POMDP for
short). By default, the acronym MDP will refer to a Fully-Observable Markov Decision
Process (i.e. corresponding to the MDP definition we have given earlier in this chapter).
So let’s now define a POMDP.
A POMDP has the usual features of an MDP (discrete-time, countable states, countable
actions, countable next state-reward transition probabilities, discount factor, plus assum-
ing time-homogeneity), together with the notion of random observation Ot at each time
step t (each observation Ot lies within the Observation Space O) and observation proba-
bility function Z : S × A × O → [0, 1] defined as:
It pays to emphasize that although a POMDP works with the notion of a state St , the
agent doesn’t have knowledge of St . It only has knowledge of observation Ot because Ot is
the extent of information made available from the environment. The agent will then need
to essentially “guess” (probabilistically) what the state St might be at each time step t in
order to take the action At . The agent’s goal in a POMDP is the same as that for an MDP:
to determine the Optimal Value Function and to identify an Optimal Policy (achieving the
Optimal Value Function).
Just like we have the rich theory and algorithms for MDPs, we have the theory and
algorithms for POMDPs. POMDP theory is founded on the notion of belief states. The
informal notion of a belief state is that since the agent doesn’t get to see the state St (it only
sees the observations Ot ) at each time step t, the agent needs to keep track of what it thinks
the state St might be, i.e., it maintains a probability distribution of states St conditioned
on history. Let’s make this a bit more formal.
Let us refer to the history Ht known to the agent at time t as the sequence of data it has
collected up to time t. Formally, this data sequence Ht is:
A Belief State b(h)t at time t is a probability distribution over states, conditioned on the
history h, i.e.,
b(h)t = (P[St = s1 |Ht = h], P[St = s2 |Ht = h], . . .)
P
such that s∈S b(h)t (s) = 1 for all histories h and for each t = 0, 1, 2, . . ..
Since the history Ht satisfies the Markov Property, the belief state b(h)t satisfies the
Markov Property. So we can reduce the POMDP to an MDP M with the set of belief
states of the POMDP as the set of states of the MDP M . Note that even if the set of states of
121
the POMDP were finite, the set of states of the MDP M will be infinite (i.e. infinite belief
states). We can see that this will almost always end up as a giant MDP M . So although this
is useful for theoretical reasoning, practically solving this MDP M is often quite hard com-
putationally. However, specialized techniques have been developed to solve POMDPs but
as you might expect, their computational complexity is still quite high. So we end up with
a choice when encountering a POMDP - either try to solve it with a POMDP algorithm
(computationally inefficient but capturing the reality of the real-world problem) or try to
(e) (a)
approximate it as an MDP (pretending Ot = St = St ) which will likely be compu-
tationally more efficient but might be a gross approximation of the real-world problem,
which in turn means it’s effectiveness in practice might be compromised. This is the mod-
eling dilemma we often end up with: what is the right level of detail of real-world factors
we need to capture in our model? How do we prevent state spaces from exploding be-
yond practical computational tractability? The answers to these questions typically have
to do with depth of understanding of the nuances of the real-world problem and a trial-
and-error process of: formulating the model, solving for the optimal policy, testing the
efficacy of this policy in practice (with appropriate measurements to capture real-world
metrics), learning about the drawbacks of our model, and iterating back to tweak (or com-
pletely change) the model.
Let’s consider a classic example of a card game such as Poker or Blackjack as a POMDP
where your objective as a player is to identify the optimal policy to maximize your ex-
pected return (Optimal Value Function). The observation Ot would be the entire set of
information you would have seen up to time step t (or a compressed version of this en-
tire information that suffices for predicting transitions and for taking actions). The state
St would include, among other things, the set of cards you have, the set of cards your
opponents have (which you don’t see), and the entire set of exposed as well as unex-
posed cards not held by players. Thus, the state is only partially observable. With this
POMDP structure, we proceed to develop a model of the transition probabilities of next
state St+1 and reward Rt+1 , conditional on current state St and current action At . We also
develop a model of the probabilities of next observation Ot+1 , conditional on next state
St+1 and current action At . These probabilities are estimated from data collected from
various games (capturing opponent behaviors) and knowledge of the cards-structure of
the deck (or decks) used to play the game. Now let’s think about what would happen
if we modeled this card game as an MDP. We’d no longer have the unseen cards as part
of our state. Instead, the state St will be limited to the information seen upto time t (i.e.,
St = Ot ). We can still estimate the transition probabilities, but since it’s much harder to
estimate in this case, our estimate will likely be quite noisy and nowhere near as reliable
as the probability estimates in the POMDP case. The advantage though with modeling it
as an MDP is that the algorithm to arrive at the Optimal Value Function/Optimal Policy
is a lot more tractable compared to the algorithm for the POMDP model. So it’s a tradeoff
between the reliability of the probability estimates versus the tractability of the algorithm
to solve for the Optimal Value Function/Policy.
The purpose of this subsection on POMDPs is to highlight that by default a lot of prob-
lems in the real-world are POMDPs and it can sometimes take quite a bit of domain-
knowledge, modeling creativity and real-world experimentation to treat them as MDPs
and make the solution to the modeled MDP successful in practice.
The idea of partial observability was introduced in a paper by K.J.Astrom (Åström
1965). To learn more about POMDP theory, we refer you to the POMDP book by Vikram
Krishnamurthy (Krishnamurthy 2016).
122
2.12. Summary of Key Learnings from this Chapter
• MDP Bellman Policy Equations
• MDP Bellman Optimality Equations
• Theorem (2.10.1) on the existence of an Optimal Policy, and of each Optimal Policy
achieving the Optimal Value Function
123
3. Dynamic Programming Algorithms
As a reminder, much of this book is about algorithms to solve the MDP Control problem,
i.e., to compute the Optimal Value Function (and an associated Optimal Policy). We will
also cover algorithms for the MDP Prediction problem, i.e., to compute the Value Function
when the AI agent executes a fixed policy π (which, as we know from Chapter 2, is the
same as computing the Value Function of the π-implied MRP). Our typical approach will
be to first cover algorithms to solve the Prediction problem before covering algorithms to
solve the Control problem - not just because Prediction is a key component in solving the
Control problem, but also because it helps understand the key aspects of the techniques
employed in the Control algorithm in the simpler setting of Prediction.
“I spent the Fall quarter (of 1950) at RAND. My first task was to find a name
for multistage decision processes. An interesting question is, ‘Where did the
name, dynamic programming, come from?’ The 1950s were not good years for
125
mathematical research. We had a very interesting gentleman in Washington
named Wilson. He was Secretary of Defense, and he actually had a patholog-
ical fear and hatred of the word, research. I’m not using the term lightly; I’m
using it precisely. His face would suffuse, he would turn red, and he would
get violent if people used the term, research, in his presence. You can imagine
how he felt, then, about the term, mathematical. The RAND Corporation was
employed by the Air Force, and the Air Force had Wilson as its boss, essen-
tially. Hence, I felt I had to do something to shield Wilson and the Air Force
from the fact that I was really doing mathematics inside the RAND Corpora-
tion. What title, what name, could I choose? In the first place I was interested
in planning, in decision making, in thinking. But planning, is not a good word
for various reasons. I decided therefore to use the word, ‘programming.’ I
wanted to get across the idea that this was dynamic, this was multistage, this
was time-varying—I thought, let’s kill two birds with one stone. Let’s take a
word that has an absolutely precise meaning, namely dynamic, in the classical
physical sense. It also has a very interesting property as an adjective, and that
is it’s impossible to use the word, dynamic, in a pejorative sense. Try think-
ing of some combination that will possibly give it a pejorative meaning. It’s
impossible. Thus, I thought dynamic programming was a good name. It was
something not even a Congressman could object to. So I used it as an umbrella
for my activities.”
Bellman had coined the term Dynamic Programming to refer to the general theory of
MDPs, together with the techniques to solve MDPs (i.e., to solve the Control problem). So
the MDP Bellman Optimality Equation was part of this catch-all term Dynamic Program-
ming. The core semantic of the term Dynamic Programming was that the Optimal Value
Function can be expressed recursively - meaning, to act optimally from a given state, we
will need to act optimally from each of the resulting next states (which is the essence of the
Bellman Optimality Equation). In fact, Bellman used the term “Principle of Optimality”
to refer to this idea of “Optimal Substructure,” and articulated it as follows:
So, you can see that the term Dynamic Programming was not just an algorithm in its
original usage. Crucially, Bellman laid out an iterative algorithm to solve for the Op-
timal Value Function (i.e., to solve the MDP Control problem). Over the course of the
next decade, the term Dynamic Programming got associated with (multiple) algorithms
to solve the MDP Control problem. The term Dynamic Programming was extended to
also refer to algorithms to solve the MDP Prediction problem. Over the next couple of
decades, Computer Scientists started refering to the term Dynamic Programming as any
algorithm that solves a problem through a recursive formulation as long as the algorithm
makes repeated invocations to the solutions of each subproblem (overlapping subproblem
structure). A classic such example is the algorithm to compute the Fibonacci sequence by
caching the Fibonacci values and re-using those values during the course of the algorithm
execution. The algorithm to calculate the shortest path in a graph is another classic exam-
ple where each shortest (i.e. optimal) path includes sub-paths that are optimal. However,
in this book, we won’t use the term Dynamic Programming in this broader sense. We will
126
use the term Dynamic Programming to be restricted to algorithms to solve the MDP Pre-
diction and Control problems (even though Bellman originally used it only in the context
of Control). More specifically, we will use the term Dynamic Programming in the narrow
context of Planning algorithms for problems with the following two specializations:
• The state space is finite, the action space is finite, and the set of pairs of next state
and reward (given any pair of current state and action) are also finite.
• We have explicit knowledge of the model probabilities (either in the form of PR or
in the form of P and R separately).
127
So we want to solve for an x such that x = cos(x). Knowing the frequency and ampli-
tude of cosine, we can see that the cosine curve intersects the line y = x at only one point,
which should be somewhere between 0 and π2 . But there is no easy way to solve for this
point. Here’s an idea: Start with any value x0 ∈ R, calculate x1 = cos(x0 ), then calculate
x2 = cos(x1 ), and so on …, i.e, xi+1 = cos(xi ) for i = 0, 1, 2, . . .. You will find that xi
and xi+1 get closer and closer as i increases, i.e., |xi+1 − xi | ≤ |xi − xi−1 | for all i ≥ 1.
So it seems like limi→∞ xi = limi→∞ cos(xi−1 ) = limi→∞ cos(xi ), which would imply that
for large enough i, xi would serve as an approximation to the solution of the equation
x = cos(x). But why does this method of repeated applications of the function f (no mat-
ter what x0 we start with) work? Why does it not diverge or oscillate? How quickly does
it converge? If there were multiple fixed-points, which fixed-point would it converge to
(if at all)? Can we characterize a class of functions f for which this method (repeatedly
applying f , starting with any arbitrary value of x0 ) would work (in terms of solving the
equation x = f (x))? These are the questions Fixed-Point theory attempts to answer. Can
you think of problems you have solved in the past which fall into this method pattern that
we’ve illustrated above for f (x) = cos(x)? It’s likely you have, because most of the root-
finding and optimization methods (including multi-variate solvers) are essentially based
on the idea of Fixed-Point. If this doesn’t sound convincing, consider the simple Newton
method:
For a differentiable function g : R → R whose root we want to solve for, the Newton
method update rule is:
g(xi )
xi+1 = xi −
g ′ (xi )
g(x)
Setting f (x) = x − g ′ (x) , the update rule is:
xi+1 = f (xi )
and it solves the equation x = f (x) (solves for the fixed-point of f ), i.e., it solves the
equation:
g(x)
x=x− ⇒ g(x) = 0
g ′ (x)
Thus, we see the same method pattern as we saw above for cos(x) (repeated application
of a function, starting with any initial value) enables us to solve for the root of g.
More broadly, what we are saying is that if we have a function f : X → X (for some arbi-
trary domain X ), under appropriate conditions (that we will state soon), f (f (. . . f (x0 ) . . .))
converges to a fixed-point of f , i.e., to the solution of the equation x = f (x) (no matter
what x0 ∈ X we start with). Now we are ready to state this formally. The statement of
the following theorem (due to Stefan Banach) is quite terse, so we will provide plenty of
explanation on how to interpret it and how to use it after stating the theorem (we skip the
proof of the theorem).
Theorem 3.3.1 (Banach Fixed-Point Theorem). Let X be a non-empty set equipped with a
complete metric d : X × X → R. Let f : X → X be such that there exists a L ∈ [0, 1) such that
d(f (x1 ), f (x2 )) ≤ L · d(x1 , x2 ) for all x1 , x2 ∈ X (this property of f is called a contraction, and
we refer to f as a contraction function). Then,
1. There exists a unique Fixed-Point x∗ ∈ X , i.e.,
x∗ = f (x∗ )
128
2. For any x0 ∈ X , and sequence [xi |i = 0, 1, 2, . . .] defined as xi+1 = f (xi ) for all i =
0, 1, 2, . . .,
lim xi = x∗
i→∞
3.
Li
d(x∗ , xi ) ≤ · d(x1 , x0 )
1−L
Equivalently,
L
d(x∗ , xi+1 ) ≤ · d(xi+1 , xi )
1−L
d(x∗ , xi+1 ) ≤ L · d(x∗ , xi )
We realize this is quite terse and will now demystify the theorem in a simple, intuitive
manner. First we need to explain what complete metric means. Let’s start with the term
metric. A metric is simply a function d : X × X → R that satisfies the usual “distance”
properties (for any x1 , x2 , x3 ∈ X ):
The term complete is a bit of a technical detail on sequences not escaping the set X (that’s
required in the proof). Since we won’t be doing the proof and since this technical detail is
not so important for the intuition, we skip the formal definition of complete. A non-empty
set X equipped with the function d (and the technical detail of being complete) is known
as a complete metric space.
Now we move on to the key concept of contraction. A function f : X → X is said to
be a contraction function if two points in X get closer when they are mapped by f (the
statement: d(f (x1 ), f (x2 )) ≤ L · d(x1 , x2 ) for all x1 , x2 ∈ X , for some L ∈ [0, 1)).
The theorem basically says that for any contraction function f , there is not only a unique
fixed-point x∗ , one can arrive at x∗ by repeated application of f , starting with any initial
value x0 ∈ X :
f (f (. . . f (x0 ) . . .)) → x∗
We use the notation f i : X → X for i = 0, 1, 2, . . . as follows:
129
Banach Fixed-Point Theorem also gives us a statement on the speed of convergence re-
lating the distance between x∗ and any xi to the distance between any two successive xi .
This is a powerful theorem. All we need to do is identify the appropriate set X to work
with, identify the appropriate metric d to work with, and ensure that f is indeed a con-
traction function (with respect to d). This enables us to solve for the fixed-point of f with
the above-described iterative process of applying f repeatedly, starting with any arbitrary
value of x0 ∈ X .
We leave it to you as an exercise to verify that f (x) = cos(x) is a contraction function in
the domain X = R with metric d defined as d(x1 , x2 ) = |x1 − x2 |. Now let’s write some
code to implement the fixed-point algorithm we described above. Note that we implement
this for any generic type X to represent an arbitrary domain X .
X = TypeVar(’X’)
def iterate(step: Callable[[X], X], start: X) -> Iterator[X]:
state = start
while True:
yield state
state = step(state)
The above function takes as input a function (step: Callable[[X], X]) and a starting
value (start: X), and repeatedly applies the function while yielding the values in the
form of an Iterator[X], i.e., as a stream of values. This produces an endless stream though.
We need a way to specify convergence, i.e., when successive values of the stream are “close
enough.”
The above function converge takes as input the generated values from iterate (argu-
ment values: Iterator[X]) and a signal to indicate convergence (argument done: Callable[[X,
X], bool]), and produces the generated values until done is True. It is the user’s responsi-
bility to write the function done and pass it to converge. Now let’s use these two functions
to solve for x = cos(x).
import numpy as np
x = 0.0
values = converge(
iterate(lambda y: np.cos(y), x),
lambda a, b: np.abs(a - b) < 1e-3
)
for i, v in enumerate(values):
print(f”{i}: {v:.4f}”)
This prints a trace with the index of the stream and the value at that index as the function
cos is repeatedly applied. It terminates when two successive values are within 3 decimal
places of each other.
130
0: 0.0000
1: 1.0000
2: 0.5403
3: 0.8576
4: 0.6543
5: 0.7935
6: 0.7014
7: 0.7640
8: 0.7221
9: 0.7504
10: 0.7314
11: 0.7442
12: 0.7356
13: 0.7414
14: 0.7375
15: 0.7401
16: 0.7384
17: 0.7396
18: 0.7388
We encourage you to try other starting values (other than the one we have above: x0 =
0.0) and see the trace. We also encourage you to identify other functions f which are
contractions in an appropriate metric. The above fixed-point code is in the file rl/iterate.py.
In this file, you will find two more functions last and converged to produce the final value
of the given iterator when it’s values converge according to the done function.
PR
π
: N × D × S → [0, 1]
in the form of a data structure (since the states are finite, and the pairs of next state and
reward transitions from each non-terminal state are also finite). The Prediction problem is
to compute the Value Function of the MDP when evaluated with the policy π (equivalently,
the Value Function of the π-implied MRP), which we denote as V π : N → R.
We know from Chapters 1 and 2 that by extracting (from PR π ) the transition probability
function P π : N × S → [0, 1] of the implicit Markov Process and the reward function
131
Rπ : N → R, we can perform the following calculation for the Value Function V π : N → R
(expressed as a column vector V π ∈ Rm ) to solve this Prediction problem:
V π = (Im − γP π )−1 · Rπ
where Im is the m × m identity matrix, column vector Rπ ∈ Rm represents Rπ , and
P is an m × m matrix representing P π (rows and columns corresponding to the non-
π
terminal states). However, when m is large, this calculation won’t scale. So, we look for a
numerical algorithm that would solve (for V π ) the following MRP Bellman Equation (for
a larger number of finite states).
V π = Rπ + γP π · V π
We define the Bellman Policy Operator B π : Rm → Rm as:
V π = B π (V π )
which means V π ∈ Rm is a Fixed-Point of the Bellman Policy Operator B π : Rm → Rm .
Note that the Bellman Policy Operator can be generalized to the case of non-finite MDPs
and V π is still a Fixed-Point for various generalizations of interest. However, since this
chapter focuses on developing algorithms for finite MDPs, we will work with the above
narrower (Equation (3.1)) definition. Also, for proofs of correctness of the DP algorithms
(based on Fixed-Point) in this chapter, we shall assume the discount factor γ < 1.
Note that B π is an affine transformation on vectors in Rm and should be thought of as
a generalization of a simple 1-D (R → R) affine transformation y = a + bx where the
multiplier b is replaced with the matrix γP π and the shift a is replaced with the column
vector Rπ .
We’d like to come up with a metric for which B π is a contraction function so we can
take advantage of Banach Fixed-Point Theorem and solve this Prediction problem by it-
erative applications of the Bellman Policy Operator B π . For any Value Function V ∈ Rm
(representing V : N → R), we shall express the Value for any state s ∈ N as V (s).
Our metric d : Rm × Rm → R shall be the L∞ norm defined as:
max |(B π (X) − B π (Y ))(s)| = γ · max |(P π · (X − Y ))(s)| ≤ γ · max |(X − Y )(s)|
s∈N s∈N s∈N
132
This gives us the following iterative algorithm (known as the Policy Evaluation algorithm
for fixed policy π : N × A → [0, 1]):
Vi+1 = B π (Vi ) = Rπ + γP π · Vi
We stop the algorithm when d(Vi , Vi+1 ) = maxs∈N |(Vi − Vi+1 )(s)| is adequately small.
It pays to emphasize that Banach Fixed-Point Theorem not only assures convergence to
the unique solution V π (no matter what Value Function V0 we start the algorithm with), it
also assures a reasonable speed of convergence (dependent on the choice of starting Value
Function V0 and the choice of γ). Now let’s write the code for Policy Evaluation.
DEFAULT_TOLERANCE = 1e-5
V = Mapping[NonTerminal[S], float]
def evaluate_mrp(
mrp: FiniteMarkovRewardProcess[S],
gamma: float
) -> Iterator[np.ndarray]:
def update(v: np.ndarray) -> np.ndarray:
return mrp.reward_function_vec + gamma * \
mrp.get_transition_matrix().dot(v)
v_0: np.ndarray = np.zeros(len(mrp.non_terminal_states))
return iterate(update, v_0)
def almost_equal_np_arrays(
v1: np.ndarray,
v2: np.ndarray,
tolerance: float = DEFAULT_TOLERANCE
) -> bool:
return max(abs(v1 - v2)) < tolerance
def evaluate_mrp_result(
mrp: FiniteMarkovRewardProcess[S],
gamma: float
) -> V[S]:
v_star: np.ndarray = converged(
evaluate_mrp(mrp, gamma=gamma),
done=almost_equal_np_arrays
)
return {s: v_star[i] for i, s in enumerate(mrp.non_terminal_states)}
The code should be fairly self-explanatory. Since the Policy Evaluation problem applies
to Finite MRPs, the function evaluate_mrp above takes as input mrp: FiniteMarkovDecisionProcess[S]
and a gamma: float to produce an Iterator on Value Functions represented as np.ndarray
(for fast vector/matrix calculations). The function update in evaluate_mrp represents the
application of the Bellman Policy Operator B π . The function evaluate_mrp_result pro-
duces the Value Function for the given mrp and the given gamma, returning the last value
function on the Iterator (which terminates based on the almost_equal_np_arrays func-
tion, considering the maximum of the absolute value differences across all states). Note
that the return type of evaluate_mrp_result is V[S] which is an alias for Mapping[NonTerminal[S],
float], capturing the semantic of N → R. Note that evaluate_mrp is useful for debugging
(by looking at the trace of value functions in the execution of the Policy Evaluation algo-
rithm) while evaluate_mrp_result produces the desired output Value Function.
Note that although we defined the Bellman Policy Operator B π as operating on Value
Functions of the π-implied MRP, we can also view the Bellman Policy Operator B π as
133
operating on Value Functions of an MDP. To support this MDP view, we express Equation
(3.1) in terms of the MDP transitions/rewards specification, as follows:
X X X
B π (V )(s) = π(s, a) · R(s, a) + γ π(s, a) P(s, a, s′ ) · V (s′ ) for all s ∈ N (3.2)
a∈A a∈A s′ ∈N
If the number of non-terminal states of a given MRP is m, then the running time of each
iteration is O(m2 ). Note though that to construct an MRP from a given MDP and a given
policy, we have to perform O(m2 · k) operations, where k = |A|.
G : Rm → (N → A)
X
′
G(V )(s) = πD (s) = arg max{R(s, a) + γ · P(s, a, s′ ) · V (s′ )} for all s ∈ N (3.3)
a∈A s′ ∈N
Pthat for any specific s, if two or more actions a achieve the maximization of R(s, a) +
Note
γ · s′ ∈N P(s, a, s′ ) · V (s′ ), then we use an arbitrary rule in breaking ties and assigning a
single action a as the output of the above arg max operation. We shall use Equation (3.3)
in our mathematical exposition but we require a different (but equivalent) expression for
G(V )(s) to guide us with our code since the interface for FiniteMarkovDecisionProcess
operates on PR , rather than R and P. The equivalent expression for G(V )(s) is as follows:
XX
G(V )(s) = arg max{ PR (s, a, r, s′ ) · (r + γ · W (s′ ))} for all s ∈ N (3.4)
a∈A s′ ∈S r∈D
Note that in Equation (3.4), because we have to work with PR , we need to consider
transitions to all states s′ ∈ S (versus transition to all states s′ ∈ N in Equation (3.3)), and
so, we need to handle the transitions to states s′ ∈ T carefully (essentially by using the W
function as described above).
Now let’s write some code to create this “greedy policy” from a given value function,
guided by Equation (3.4).
134
import operator
def extended_vf(v: V[S], s: State[S]) -> float:
def non_terminal_vf(st: NonTerminal[S], v=v) -> float:
return v[st]
return s.on_non_terminal(non_terminal_vf, 0.0)
def greedy_policy_from_vf(
mdp: FiniteMarkovDecisionProcess[S, A],
vf: V[S],
gamma: float
) -> FiniteDeterministicPolicy[S, A]:
greedy_policy_dict: Dict[S, A] = {}
for s in mdp.non_terminal_states:
q_values: Iterator[Tuple[A, float]] = \
((a, mdp.mapping[s][a].expectation(
lambda s_r: s_r[1] + gamma * extended_vf(vf, s_r[0])
)) for a in mdp.actions(s))
greedy_policy_dict[s.state] = \
max(q_values, key=operator.itemgetter(1))[0]
return FiniteDeterministicPolicy(greedy_policy_dict)
As you can see above, the function greedy_policy_from_vf loops through all the non-
terminal states that serve as keys in greedy_policy_dict: Dict[S, A]. Within this loop,
we go through all the actions in A(s) and compute Q-Value Q(s, a) as the sum (over all
(s′ , r) pairs) of PR (s, a, r, s′ ) · (r + γ · W (s′ )), written as E(s′ ,r)∼PR [r + γ · W (s′ )]. Finally,
we calculate arg maxa Q(s, a) for all non-terminal states s, and return it as a FinitePolicy
(which is our greedy policy).
Note that the extended_vf represents the W : S → R function used in the right-hand-
side of Equation (3.4), which is the usual value function when it’s argument is a non-
terminal state and is the default value of 0 when it’s argument is a terminal state. We
shall use the extended_vf function in other Dynamic Programming algorithms later in this
chapter as they also involve the W : S → R function in the right-hand-side of their corre-
sponding governing equation.
The word “Greedy” is a reference to the term “Greedy Algorithm,” which means an al-
gorithm that takes heuristic steps guided by locally-optimal choices in the hope of moving
towards a global optimum. Here, the reference to Greedy Policy means if we have a policy
π and its corresponding Value Function V π (obtained say using Policy Evaluation algo-
rithm), then applying the Greedy Policy function G on V π gives us a deterministic policy
′ : N → A that is hopefully “better” than π in the sense that V πD ′
πD is “greater” than V π .
We shall now make this statement precise and show how to use the Greedy Policy Function
to perform Policy Improvement.
135
If we are dealing with finite MDPs (with m non-terminal states), we’d represent the
Value Functions as vector X, Y ∈ Rm , and say that X ≥ Y if and only if X(s) ≥ Y (s) for
all s ∈ N .
So whenever you hear terms like “Better Value Function” or “Improved Value Function,”
you should interpret it to mean that the Value Function is no worse for each of the states
(versus the Value Function it’s being compared to).
So then, what about the claim of πD ′ = G(V π ) being “better” than π? The following
Proof. This proof is based on application of the Bellman Policy Operator on Value Func-
tions of the given MDP (note: this MDP view of the Bellman Policy Operator is expressed
′
in Equation (3.2)). We start by noting that applying the Bellman Policy Operator B πD re-
′
peatedly, starting with the Value Function V π , will converge to the Value Function V πD .
Formally,
′ ′
lim (B πD )i (V π ) = V πD
i→∞
So the proof is complete if we prove that:
′ ′
(B πD )i+1 (V π ) ≥ (B πD )i (V π ) for all i = 0, 1, 2, . . .
′
which means we get a non-decreasing sequence of Value Functions [(B πD )i (V π )|i =
′
0, 1, 2, . . .] with repeated applications of B πD starting with the Value Function V π .
Let us prove this by induction. The base case (for i = 0) of the induction is to prove
that:
′
B πD (V π ) ≥ V π
′ and Value Function V π , Equation
Note that for the case of the deterministic policy πD
(3.2) simplifies to:
′
X
′ ′
B πD (V π )(s) = R(s, πD (s)) + γ P(s, πD (s), s′ ) · V π (s′ ) for all s ∈ N
s′ ∈N
From Equation (3.3), we each s ∈ N , πD ′ (s) = G(V π )(s) is the action that
P know that for ′ ′
maximizes {R(s, a) + γ s′ ∈N P(s, a, s ) · V (s )}. Therefore,
π
′
X
B πD (V π )(s) = max{R(s, a) + γ P(s, a, s′ ) · V π (s′ )} = max Qπ (s, a) for all s ∈ N
a∈A a∈A
s′ ∈N
Let’s compare this equation against the Bellman Policy Equation for π (below):
X
V π (s) = π(s, a) · Qπ (s, a) for all s ∈ N
a∈A
We see that V π (s) is a weighted average of Qπ (s, a) (with weights equal to probabilities
′
π(s, a) over choices of a) while B πD (V π )(s) is the maximum (over choices of a) of Qπ (s, a).
Therefore,
′
B πD (V π ) ≥ V π
136
This establishes the base case of the proof by induction. Now to complete the proof, all
we have to do is to prove:
′ ′ ′ ′
If (B πD )i+1 (V π ) ≥ (B πD )i (V π ), then (B πD )i+2 (V π ) ≥ (B πD )i+1 (V π ) for all i = 0, 1, 2, . . .
′ ′ ′
Since (B πD )i+1 (V π ) = B πD ((B πD )i (V π )), from the definition of Bellman Policy Oper-
ator (Equation (3.1)), we can write the following two equations:
′
X ′
′ ′
(B πD )i+2 (V π )(s) = R(s, πD (s)) + γ P(s, πD (s), s′ ) · (B πD )i+1 (V π )(s′ ) for all s ∈ N
s′ ∈N
′
X ′
′ ′
(B πD )i+1 (V π )(s) = R(s, πD (s)) + γ P(s, πD (s), s′ ) · (B πD )i (V π )(s′ ) for all s ∈ N
s′ ∈N
Subtracting each side of the second equation from the first equation yields:
′ ′
(B πD )i+2 (V π )(s) − (B πD )i+1 (s)
X ′ ′
′
=γ P(s, πD (s), s′ ) · ((B πD )i+1 (V π )(s′ ) − (B πD )i (V π )(s′ ))
s′ ∈N
for all s ∈ N
Since γP(s, πD′ (s), s′ ) consists of all non-negative values and since the induction step
′ ′
assumes (B πD )i+1 (V π )(s′ ) ≥ (B πD )i (V π )(s′ ) for all s′ ∈ N , the right-hand-side of this
equation is non-negative, meaning the left-hand-side of this equation is non-negative, i.e.,
′ ′
(B πD )i+2 (V π )(s) ≥ (B πD )i+1 (V π )(s) for all s ∈ N
The way to understand the above proof is to think in terms of how each stage of further
′
application of B πD improves the Value Function. Stage 0 is when you have the Value
Function V π where we execute the policy π throughout the MDP. Stage 1 is when you
′ ′ for
have the Value Function B πD (V π ) where from each state s, we execute the policy πD
the first time step following s and then execute the policy π for all further time steps. This
′
has the effect of improving the Value Function from Stage 0 (V π ) to Stage 1 (B πD (V π )).
′
Stage 2 is when you have the Value Function (B πD )2 (V π ) where from each state s, we
execute the policy πD ′ for the first two time steps following s and then execute the policy π
for all further time steps. This has the effect of improving the Value Function from Stage
′ ′ ′ instead
1 (B πD (V π )) to Stage 2 ((B πD )2 (V π )). And so on … each stage applies policy πD
of policy π for one extra time step, which has the effect of improving the Value Function.
Note that “improve” means ≥ (really means that the Value Function doesn’t get worse for
any of the states). These stages are simply the iterations of the Policy Evaluation algorithm
(using policy πD ′ ) with starting Value Function V π , building a non-decreasing sequence of
′
Value Functions [(B πD )i (V π )|i = 0, 1, 2, . . .] that get closer and closer until they converge
′
to the Value Function V πD that is ≥ V π (hence, the term Policy Improvement).
The Policy Improvement Theorem yields our first Dynamic Programming algorithm
(called Policy Iteration) to solve the MDP Control problem. The Policy Iteration algorithm
is due to Ronald Howard (Howard 1960).
137
Figure 3.1.: Policy Iteration Loop
We perform these iterations (over j) until Vj+1 is identical to Vj (i.e., there is no further
improvement to the Value Function). When this happens, the following should hold:
X
Vj (s) = B G(Vj ) (Vj )(s) = R(s, G(Vj )(s)) + γ P(s, G(Vj )(s), s′ ) · Vj (s′ ) for all s ∈ N
s′ ∈N
138
Figure 3.2.: Policy Iteration Convergence
From Equation (3.3), wePknow that for each s ∈ N , πj+1 (s) = G(Vj )(s) is the action that
maximizes {R(s, a) + γ s′ ∈N P(s, a, s′ ) · Vj (s′ )}. Therefore,
X
Vj (s) = max{R(s, a) + γ P(s, a, s′ ) · Vj (s′ )} for all s ∈ N
a∈A
s′ ∈N
But this in fact is the MDP Bellman Optimality Equation, which would mean that Vj =
∗,
V i.e., when Vj+1 is identical to Vj , the Policy Iteration algorithm has converged to the
Optimal Value Function. The associated deterministic policy at the convergence of the
Policy Iteration algorithm (πj : N → A) is an Optimal Policy because V πj = Vj ≊ V ∗ ,
meaning that evaluating the MDP with the deterministic policy πj achieves the Optimal
Value Function (depicted in Figure 3.2). This means the Policy Iteration algorithm solves
the MDP Control problem. This proves the following Theorem:
Theorem 3.7.1 (Policy Iteration Convergence Theorem). For a Finite MDP with |N | = m
and γ < 1, Policy Iteration algorithm converges to the Optimal Value Function V ∗ ∈ Rm along
with a Deterministic Optimal Policy πD∗ : N → A, no matter which Value Function V ∈ Rm we
0
start the algorithm with.
Now let’s write some code for Policy Iteration Algorithm. Unlike Policy Evaluation
which repeatedly operates on Value Functions (and returns a Value Function), Policy Itera-
tion repeatedly operates on a pair of Value Function and Policy (and returns a pair of Value
Function and Policy). In the code below, notice the type Tuple[V[S], FinitePolicy[S,
A]] that represents a pair of Value Function and Policy. The function policy_iteration
repeatedly applies the function update on a pair of Value Function and Policy. The update
function, after splitting its input vf_policy into vf: V[S] and pi: FinitePolicy[S, A],
creates an MRP (mrp: FiniteMarkovRewardProcess[S]) from the combination of the input
mdp and pi. Then it performs a policy evaluation on mrp (using the evaluate_mrp_result
function) to produce a Value Function policy_vf: V[S], and finally creates a greedy (im-
proved) policy named improved_pi from policy_vf (using the previously-written func-
tion greedy_policy_from_vf). Thus the function update performs a Policy Evaluation fol-
lowed by a Policy Improvement. Notice also that policy_iteration offers the option to
perform the linear-algebra-solver-based computation of Value Function for a given policy
(get_value_function_vec method of the mrp object), in case the state space is not too large.
policy_iteration returns an Iterator on pairs of Value Function and Policy produced by
this process of repeated Policy Evaluation and Policy Improvement. almost_equal_vf_pis
is the function to decide termination based on the distance between two successive Value
Functions produced by Policy Iteration. policy_iteration_result returns the final (opti-
mal) pair of Value Function and Policy (from the Iterator produced by policy_iteration),
based on the termination criterion of almost_equal_vf_pis.
DEFAULT_TOLERANCE = 1e-5
139
def policy_iteration(
mdp: FiniteMarkovDecisionProcess[S, A],
gamma: float,
matrix_method_for_mrp_eval: bool = False
) -> Iterator[Tuple[V[S], FinitePolicy[S, A]]]:
def update(vf_policy: Tuple[V[S], FinitePolicy[S, A]])\
-> Tuple[V[S], FiniteDeterministicPolicy[S, A]]:
vf, pi = vf_policy
mrp: FiniteMarkovRewardProcess[S] = mdp.apply_finite_policy(pi)
policy_vf: V[S] = {mrp.non_terminal_states[i]: v for i, v in
enumerate(mrp.get_value_function_vec(gamma))}\
if matrix_method_for_mrp_eval else evaluate_mrp_result(mrp, gamma)
improved_pi: FiniteDeterministicPolicy[S, A] = greedy_policy_from_vf(
mdp,
policy_vf,
gamma
)
return policy_vf, improved_pi
v_0: V[S] = {s: 0.0 for s in mdp.non_terminal_states}
pi_0: FinitePolicy[S, A] = FinitePolicy(
{s.state: Choose(mdp.actions(s)) for s in mdp.non_terminal_states}
)
return iterate(update, (v_0, pi_0))
def almost_equal_vf_pis(
x1: Tuple[V[S], FinitePolicy[S, A]],
x2: Tuple[V[S], FinitePolicy[S, A]]
) -> bool:
return max(
abs(x1[0][s] - x2[0][s]) for s in x1[0]
) < DEFAULT_TOLERANCE
def policy_iteration_result(
mdp: FiniteMarkovDecisionProcess[S, A],
gamma: float,
) -> Tuple[V[S], FiniteDeterministicPolicy[S, A]]:
return converged(policy_iteration(mdp, gamma), done=almost_equal_vf_pis)
If the number of non-terminal states of a given MDP is m and the number of actions
(|A|) is k, then the running time of Policy Improvement is O(m2 · k) and we’ve already
seen before that each iteration of Policy Evaluation is O(m2 · k).
B ∗ : Rm → Rm
We shall use Equation (3.5) in our mathematical exposition but we require a different
(but equivalent) expression for B ∗ (V )(s) to guide us with our code since the interface
140
for FiniteMarkovDecisionProcess operates on PR , rather than R and P. The equivalent
expression for B ∗ (V )(s) is as follows:
XX
B ∗ (V )(s) = max{ PR (s, a, r, s′ ) · (r + γ · W (s′ ))} for all s ∈ N (3.6)
a∈A
s′ ∈S r∈D
which is a succinct representation of the first stage of Policy Evaluation with an improved
policy G(V π ) (note how all three of Bellman Policy Operator, Bellman Optimality Opera-
tor and Greedy Policy Function come together in this equation).
Much like how the Bellman Policy Operator B π was motivated by the MDP Bellman
Policy Equation (equivalently, the MRP Bellman Equation), Bellman Optimality Operator
B ∗ is motivated by the MDP Bellman Optimality Equation (re-stated below):
X
V ∗ (s) = max{R(s, a) + γ P(s, a, s′ ) · V ∗ (s′ )} for all s ∈ N
a∈A
s′ ∈N
Therefore, we can express the MDP Bellman Optimality Equation succinctly as:
V ∗ = B ∗ (V ∗ )
which means V ∗ ∈ Rm is a Fixed-Point of the Bellman Optimality Operator B ∗ : Rm →
Rm .
Note that the definitions of the Greedy Policy Function and of the Bellman Optimality
Operator that we have provided can be generalized to non-finite MDPs, and consequently
we can generalize Equation (3.7) and the statement that V ∗ is a Fixed-Point of the Bellman
Optimality Operator would still hold. However, in this chapter, since we are focused on
developing algorithms for finite MDPs, we shall stick to the definitions we’ve provided for
the case of finite MDPs.
Much like how we proved that B π is a contraction function, we want to prove that B ∗
is a contraction function (under L∞ norm) so we can take advantage of Banach Fixed-
Point Theorem and solve the Control problem by iterative applications of the Bellman
Optimality Operator B ∗ . So we need to prove that for all X, Y ∈ Rm ,
141
max |(B ∗ (X) − B ∗ (Y ))(s)| ≤ γ · max |(X − Y )(s)|
s∈N s∈N
This proof is a bit harder than the proof we did for B π . Here we need to utilize two key
properties of B ∗ .
B ∗ (X)(s) − B ∗ (Y )(s)
X X
= max{R(s, a)+γ P(s, a, s′ )·X(s′ )}−max{R(s, a)+γ P(s, a, s′ )·Y (s′ )} ≥ 0
a∈A a∈A
s′ ∈N s′ ∈N
X
B ∗ (X + c)(s) = max{R(s, a) + γ P(s, a, s′ ) · (X(s′ ) + c)}
a∈A
s′ ∈N
X
= max{R(s, a) + γ P(s, a, s′ ) · X(s′ )} + γc = B ∗ (X)(s) + γc
a∈A
s′ ∈N
With these two properties of B ∗ in place, let’s prove that B ∗ is a contraction function.
For given X, Y ∈ Rm , assume:
Since B ∗ has the monotonicity property, we can apply B ∗ throughout the above double-
inequality.
B ∗ (X − c)(s) ≤ B ∗ (Y )(s) ≤ B ∗ (X + c)(s) for all s ∈ N
142
Since B ∗ has the constant shift property,
In other words,
This gives us the following iterative algorithm, known as the Value Iteration algorithm,
due to Richard Bellman (Bellman 1957a):
We stop the algorithm when d(Vi , Vi+1 ) = maxs∈N |(Vi − Vi+1 )(s)| is adequately small.
It pays to emphasize that Banach Fixed-Point Theorem not only assures convergence to
the unique solution V ∗ (no matter what Value Function V0 we start the algorithm with), it
also assures a reasonable speed of convergence (dependent on the choice of starting Value
Function V0 and the choice of γ).
The answer lies in the Greedy Policy function G. Equation (3.7) told us that:
143
Specializing V to be V ∗ , we get:
∗
B G(V ) (V ∗ ) = B ∗ (V ∗ )
But we know that V ∗ is the Fixed-Point of the Bellman Optimality Operator B ∗ , i.e., B ∗ (V ∗ ) =
V ∗ . Therefore,
∗
B G(V ) (V ∗ ) = V ∗
∗
The above equation says V ∗ is the Fixed-Point of the Bellman Policy Operator B G(V ) .
∗ ∗
However, we know that B G(V ) has a unique Fixed-Point equal to V G(V ) . Therefore,
∗)
V G(V =V∗
This says that evaluating the MDP with the deterministic greedy policy G(V ∗ ) (policy
created from the Optimal Value Function V ∗ using the Greedy Policy Function G) in fact
achieves the Optimal Value Function V ∗ . In other words, G(V ∗ ) is the (Deterministic)
Optimal Policy π ∗ we’ve been seeking.
Now let’s write the code for Value Iteration. The function value_iteration returns an
Iterator on Value Functions (of type V[S]) produced by the Value Iteration algorithm. It
uses the function update for application of the Bellman Optimality Operator. update pre-
pares the Q-Values for a state by looping through all the allowable actions for the state, and
then calculates the maximum of those Q-Values (over the actions). The Q-Value calcula-
tion is same as what we saw in greedy_policy_from_vf: E(s′ ,r)∼PR [r + γ · W (s′ )], using the
PR probabilities represented in the mapping attribute of the mdp object (essentially Equation
(3.6)). Note the use of the previously-written function extended_vf to handle the function
W : S → R that appears in the definition of Bellman Optimality Operator in Equation
(3.6). The function value_iteration_result returns the final (optimal) Value Function,
together with it’s associated Optimal Policy. It simply returns the last Value Function of
the Iterator[V[S]] returned by value_iteration, using the termination condition speci-
fied in almost_equal_vfs.
DEFAULT_TOLERANCE = 1e-5
def value_iteration(
mdp: FiniteMarkovDecisionProcess[S, A],
gamma: float
) -> Iterator[V[S]]:
def update(v: V[S]) -> V[S]:
return {s: max(mdp.mapping[s][a].expectation(
lambda s_r: s_r[1] + gamma * extended_vf(v, s_r[0])
) for a in mdp.actions(s)) for s in v}
v_0: V[S] = {s: 0.0 for s in mdp.non_terminal_states}
return iterate(update, v_0)
def almost_equal_vfs(
v1: V[S],
v2: V[S],
tolerance: float = DEFAULT_TOLERANCE
) -> bool:
return max(abs(v1[s] - v2[s]) for s in v1) < tolerance
def value_iteration_result(
mdp: FiniteMarkovDecisionProcess[S, A],
gamma: float
) -> Tuple[V[S], FiniteDeterministicPolicy[S, A]]:
opt_vf: V[S] = converged(
value_iteration(mdp, gamma),
done=almost_equal_vfs
144
)
opt_policy: FiniteDeterministicPolicy[S, A] = greedy_policy_from_vf(
mdp,
opt_vf,
gamma
)
return opt_vf, opt_policy
If the number of non-terminal states of a given MDP is m and the number of actions
(|A|) is k, then the running time of each iteration of Value Iteration is O(m2 · k).
We encourage you to play with the above implementations of Policy Evaluation, Policy
Iteration and Value Iteration (code in the file rl/dynamic_programming.py) by running it
on MDPs/Policies of your choice, and observing the traces of the algorithms.
Now let’s write some code to evaluate si_mdp with the policy fdp.
from pprint import pprint
implied_mrp: FiniteMarkovRewardProcess[InventoryState] =\
si_mdp.apply_finite_policy(fdp)
user_gamma = 0.9
pprint(evaluate_mrp_result(implied_mrp, gamma=user_gamma))
145
NonTerminal(state=InventoryState(on_hand=1, on_order=1)): -29.345029758390766,
NonTerminal(state=InventoryState(on_hand=2, on_order=0)): -30.345029758390766}
This prints the following Optimal Value Function and Optimal Policy.
As we can see, the Optimal Policy is to not order if the Inventory Position (sum of On-
Hand and On-Order) is greater than 1 unit and to order 1 unit if the Inventory Position is
0 or 1. Finally, let’s run Value Iteration.
opt_vf_vi, opt_policy_vi = value_iteration_result(si_mdp, gamma=user_gamma
)
pprint(opt_vf_vi)
print(opt_policy_vi)
You’ll see the output from Value Iteration matches the output produced from Policy
Iteration - this is a good validation of our code correctness. We encourage you to play
around with user_capacity, user_poisson_lambda, user_holding_cost, user_stockout_cost
and user_gamma(code in __main__ in rl/chapter3/simple_inventory_mdp_cap.py). As a
valuable exercise, using this code, discover the mathematical structure of the Optimal Pol-
icy as a function of the above inputs.
146
π2 = G(V1 ), V1 → B π2 (V1 ) → (B π2 )2 (V1 ) → . . . (B π2 )i (V1 ) → . . . V π2 = V2
...
...
Each row in the layout above represents the progression of the Value Function for a
specific policy. Each row starts with the creation of the policy (for that row) using the
Greedy Policy Function G, and the remainder of the row consists of successive applications
of the Bellman Policy Operator (using that row’s policy) until convergence to the Value
Function for that row’s policy. So each row starts with a Policy Improvement and the rest
of the row is a Policy Evaluation. Notice how the end of one row dovetails into the start
of the next row with application of the Greedy Policy Function G. It’s also important to
recognize that Greedy Policy Function as well as Bellman Policy Operator apply to all states
in N . So, in fact, the entire Policy Iteration algorithm has 3 nested loops. The outermost
loop is over the rows in this 2-dimensional layout (each iteration in this outermost loop
creates an improved policy). The loop within this outermost loop is over the columns in
each row (each iteration in this loop applies the Bellman Policy Operator, i.e. the iterations
of Policy Evaluation). The innermost loop is over each state in N since we need to sweep
through all states in updating the Value Function when the Bellman Policy Operator is
applied on a Value Function (we also need to sweep through all states in applying the
Greedy Policy Function to improve the policy).
A higher-level view of Policy Iteration is to think of Policy Evaluation and Policy Im-
provement going back and forth iteratively - Policy Evaluation takes a policy and creates
the Value Function for that policy, while Policy Improvement takes a Value Function and
creates a Greedy Policy from it (that is improved relative to the previous policy). This was
depicted in Figure 3.1. It is important to recognize that this loop of Policy Evaluation and
Policy Improvement works to make the Value Function and the Policy increasingly con-
sistent with each other, until we reach convergence when the Value Function and Policy
become completely consistent with each other (as was illustrated in Figure 3.2).
We’d also like to share a visual of Policy Iteration that is quite popular in much of the
literature on Dynamic Programming, originally appearing in Sutton and Barto’s RL book
(Richard S. Sutton and Barto 2018). It is the visual of Figure 3.3. It’s a somewhat fuzzy sort
of visual, but it has it’s benefits in terms of pedagogy of Policy Iteration. The idea behind
this image is that the lower line represents the “policy line” indicating the progression of
the policies as Policy Iteration algorithm moves along and the upper line represents the
“value function line” indicating the progression of the Value Functions as Policy Itera-
tion algorithm moves along. The arrows pointing towards the upper line (“value function
line”) represent a Policy Evaluation for a given policy π, yielding the point (Value Func-
tion) V π on the upper line. The arrows pointing towards the lower line (“policy line”)
represent a Greedy Policy Improvement from a Value Function V π , yielding the point
(policy) π ′ = G(V π ) on the lower line. The key concept here is that Policy Evaluation
(arrows pointing to upper line) and Policy Improvement (arrows pointing to lower line)
are “competing” - they “push in different directions” even as they aim to get the Value
Function and Policy to be consistent with each other. This concept of simultaneously try-
ing to compete and trying to be consistent might seem confusing and contradictory, so it
deserves a proper explanation. Things become clear by noting that there are actually two
notions of consistency between a Value Function V and Policy π.
147
Figure 3.3.: Progression Lines of Value Function and Policy in Policy Iteration (Image
Credit: Sutton-Barto’s RL Book)
1. The notion of the Value Function V being consistent with/close to the Value Function
V π of the policy π.
2. The notion of the Policy π being consistent with/close to the Greedy Policy G(V ) of
the Value Function V .
Policy Evaluation aims for the first notion of consistency, but in the process, makes it
worse in terms of the second notion of consistency. Policy Improvement aims for the sec-
ond notion of consistency, but in the process, makes it worse in terms of the first notion
of consistency. This also helps us understand the rationale for alternating between Policy
Evaluation and Policy Improvement so that neither of the above two notions of consistency
slip up too much (thanks to the alternating propping up of the two notions of consistency).
Also, note that as Policy Iteration progresses, the upper line and lower line get closer and
closer and the “pushing in different directions” looks more and more collaborative rather
than competing (the gaps in consistency become lesser and lesser). In the end, the two
lines intersect, when there is no more pushing to do for either of Policy Evaluation or Policy
Improvement since at convergence, π ∗ and V ∗ have become completely consistent.
Now we are ready to talk about a very important idea known as Generalized Policy Iter-
ation that is emphasized throughout Sutton and Barto’s RL book (Richard S. Sutton and
Barto 2018) as the perspective that unifies all variants of DP as well as RL algorithms.
Generalized Policy Iteration is the idea that we can evaluate the Value Function for a pol-
icy with any Policy Evaluation method, and we can improve a policy with any Policy Im-
provement method (not necessarily the methods used in the classical Policy Iteration DP
algorithm). In particular, we’d like to emphasize the idea that neither of Policy Evalua-
tion and Policy Improvement need to go fully towards the notion of consistency they are
respectively striving for. As a simple example, think of modifying Policy Evaluation (say
148
for a policy π) to not go all the way to V π , but instead just perform say 3 Bellman Policy
Evaluations. This means it would partially bridge the gap on the first notion of consistency
(getting closer to V π but not go all the way to V π ), but it would also mean not slipping
up too much on the second notion of consistency. As another example, think of updating
just 5 of the states (say in a large state space) with the Greedy Policy Improvement func-
tion (rather than the normal Greedy Policy Improvement function that operates on all the
states). This means it would partially bridge the gap on the second notion of consistency
(getting closer to G(V π ) but not go all the way to G(V π )), but it would also mean not slip-
ping up too much on the first notion of consistency. A concrete example of Generalized
Policy Iteration is in fact Value Iteration. In Value Iteration, we apply the Bellman Policy
Operator just once before moving on to Policy Improvement. In a 2-dimensional layout,
this is what Value Iteration looks like:
π1 = G(V0 ), V0 → B π1 (V0 ) = V1
π2 = G(V1 ), V1 → B π2 (V1 ) = V2
...
...
πj+1 = G(Vj ), Vj → B πj+1 (Vj ) = V ∗
So the greedy policy improvement step is unchanged, but Policy Evaluation is reduced
to just a single Bellman Policy Operator application. In fact, pretty much all control al-
gorithms in Reinforcement Learning can be viewed as special cases of Generalized Policy
Iteration. In some of the simple versions of Reinforcement Learning Control algorithms,
the Policy Evaluation step is done for just a single state (versus for all states in usual Policy
Iteration, or even in Value Iteration) and the Policy Improvement step is also done for just
a single state. So essentially these Reinforcement Learning Control algorithms are an al-
ternating sequence of single-state policy evaluation and single-state policy improvement
(where the single-state is the state produced by sampling or the state that is encountered
in a real-world environment interaction). Figure 3.4 illustrates Generalized Policy Itera-
tion as the shorter-length arrows (versus the longer-length arrows seen in Figure 3.3 for
the usual Policy Iteration algorithm). Note how these shorter-length arrows don’t go all
the way to either the “value function line” or the “policy line” but they do go some part
of the way towards the line they are meant to go towards at that stage in the algorithm.
We would go so far as to say that the Bellman Equations and the concept of General-
ized Policy Iteration are the two most important concepts to internalize in the study of
Reinforcement Learning, and we highly encourage you to think along the lines of these
two ideas when we present several algorithms later in this book. The importance of the
concept of Generalized Policy Iteration (GPI) might not be fully visible to you yet, but we
hope that GPI will be your mantra by the time you finish this book. For now, let’s just note
the key takeaway regarding GPI - it is any algorithm to solve MDP control that alternates
between some form of Policy Evaluation and some form of Policy Improvement. We will
bring up GPI several times later in this book.
149
Figure 3.4.: Progression Lines of Value Function and Policy in Generalized Policy Iteration
(Image Credit: Coursera Course on Fundamentals of RL)
150
After each state’s value is updated with the Bellman Optimality Operator, we update
the Value Function Gap for all the states whose Value Function Gap does get changed as a
result of this state value update. These are exactly the states from which we have a prob-
abilistic transition to the state whose value just got updated. What this also means is that
we need to maintain the reverse transition dynamics in our data structure representation.
So, after each state value update, the queue of states is resorted (by their value function
gaps). We always pull out the state with the largest value function gap (from the top of
the queue), and update the value function for that state. This prioritizes updates of states
with the largest gaps, and it ensures that we quickly get to a point where all value function
gaps are low enough.
Another form of Asynchronous Dynamic Programming worth mentioning here is Real-
Time Dynamic Programming (RTDP). RTDP means we run a Dynamic Programming algo-
rithm while the AI agent is experiencing real-time interaction with the environment. When
a state is visited during the real-time interaction, we make an update for that state’s value.
Then, as we transition to another state as a result of the real-time interaction, we update
that new state’s value, and so on. Note also that in RTDP, the choice of action is the real-
time action executed by the AI agent, which the environment responds to. This action
choice is governed by the policy implied by the value function for the encountered state
at that point in time in the real-time interaction.
Finally, we need to highlight that often special types of structures of MDPs can bene-
fit from specific customizations of Dynamic Programming algorithms (typically, Asyn-
chronous). One such specialization is when each state is encountered not more than once
in each random sequence of state occurrences when an AI agent plays out an MDP, and
when all such random sequences of the MDP terminate. This structure can be conceptu-
alized as a Directed Acylic Graph wherein each non-terminal node in the Directed Acyclic
Graph (DAG) represents a pair of non-terminal state and action, and each terminal node in
the DAG represents a terminal state (the graph edges represent probabilistic transitions of
the MDP). In this specialization, the MDP Prediction and Control problems can be solved
in a fairly simple manner - by walking backwards on the DAG from the terminal nodes and
setting the Value Function of visited states (in the backward DAG walk) using the Bellman
Optimality Equation (for Control) or Bellman Policy Equation (for Prediction). Here we
don’t need the “iterate to convergence” approach of Policy Evaluation or Policy Iteration
or Value Iteration. Rather, all these Dynamic Programming algorithms essentially reduce
to a simple back-propagation of the Value Function on the DAG. This means, states are
visited (and their Value Functions set) in the order determined by the reverse sequence
of a Topological Sort on the DAG. We shall make this DAG back-propagation Dynamic
Programming algorithm clear for a special DAG structure - Finite-Horizon MDPs - where
all random sequences of the MDP terminate within a fixed number of time steps and each
time step has a separate (from other time steps) set of states. This special case of Finite-
Horizon MDPs is fairly common in Financial Applications and so, we cover it in detail in
the next section.
151
a separate (from other time steps) set of countable states. So, all states at time-step T
are terminal states and some states before time-step T could be terminal states. For all
t = 0, 1, . . . , T , denote the set of states for time step t as St , the set of terminal states for
time step t as Tt and the set of non-terminal states for time step t as Nt = St − Tt (note:
NT = ∅). As mentioned previously, when the MDP is not time-homogeneous, we augment
each state to include the index of the time step so that the augmented state at time step t
is (t, st ) for st ∈ St . The entire MDP’s (augmented) state space S is:
{(t, st )|t = 0, 1, . . . , T, st ∈ St }
We need a Python class to represent this augmented state space.
@dataclass(frozen=True)
class WithTime(Generic[S]):
state: S
time: int = 0
{(t, st )|t = 0, 1, . . . , T, st ∈ Tt }
As usual, the set of non-terminal states is denoted as N = S − T .
We denote the set of rewards receivable by the AI agent at time t as Dt (countable subset
of R) and we denote the allowable actions for states in Nt as At . In a more generic setting,
as we shall represent in our code, each non-terminal state (t, st ) has it’s own set of allowable
actions, denoted A(st ), However, for ease of exposition, here we shall treat all non-terminal
states at a particular time step to have the same set of allowable actions At . Let us denote
the entire action space A of the MDP as the union of all the At over all t = 0, 1, . . . , T − 1.
The state-reward transition probability function
PR : N × A × D × S → [0, 1]
is given by:
(
(PR )t (st , at , rt′ , st′ ) if t′ = t + 1 and st′ ∈ St′ and rt′ ∈ Dt′
PR ((t, st ), at , rt′ , (t′ , st′ )) =
0 otherwise
for all t = 0, 1, . . . , T − 1, st ∈ Nt , at ∈ At .
So it is convenient to represent a finite-horizon MDP with separate state-reward transi-
tion probability functions (PR )t for each time step. Likewise, it is convenient to represent
any policy of the MDP
π : N × A → [0, 1]
152
as:
π((t, st ), at ) = πt (st , at )
where
πt : Nt × At → [0, 1]
are the separate policies for each of the time steps t = 0, 1, . . . , T − 1
So essentially we interpret π as being composed of the sequence (π0 , π1 , . . . , πT −1 ).
Consequently, the Value Function for a given policy π (equivalently, the Value Function
for the π-implied MRP)
Vπ :N →R
can be conveniently represented in terms of a sequence of Value Functions
Vtπ : Nt → R
for each of time steps t = 0, 1, . . . , T − 1, defined as:
X X
Vtπ (st ) = πt
(PR )t (st , rt+1 , st+1 ) · (rt+1 + γ · Wt+1
π
(st+1 ))
st+1 ∈St+1 rt+1 ∈Dt+1 (3.8)
for all t = 0, 1, . . . , T − 1, st ∈ Nt
where
(
Vtπ (st ) if st ∈ Nt
Wtπ (st ) = for all t = 1, 2, . . . , T
0 if st ∈ Tt
and where (PR πt
)t : Nt × Dt+1 × St+1 for all t = 0, 1, . . . , T − 1 represent the π-implied
MRP’s state-reward transition probability functions for the time steps, defined as:
X
πt
(PR )t (st , rt+1 , st+1 ) = πt (st , at ) · (PR )t (st , at , rt+1 , st+1 ) for all t = 0, 1, . . . , T − 1
at ∈At
So for a Finite MDP, this yields a simple algorithm to calculate Vtπ for all t by simply
decrementing down from t = T − 1 to t = 0 and using Equation (3.8) to calculate Vtπ for
all t = 0, 1, . . . , T − 1 from the known values of Wt+1
π (since we are decrementing in time
index t).
This algorithm is the adaptation of Policy Evaluation to the finite horizon case with this
simple technique of “stepping back in time” (known as Backward Induction). Let’s write
some code to implement this algorithm. We are given an MDP over the augmented (finite)
state space WithTime[S], and a policy π (also over the augmented state space WithTime[S]).
So, we can use the method apply_finite_policy in FiniteMarkovDecisionProcess[WithTime[S],
A] to obtain the π-implied MRP of type FiniteMarkovRewardProcess[WithTime[S]].
153
Our first task to to “unwrap” the state-reward probability transition function PR π of this
πt
Now that we have the state-reward transition functions (PR )t arranged in the form of
a Sequence[RewardTransition[S]], we are ready to perform backward induction to calcu-
late Vtπ . The following function evaluate accomplishes it with a straightforward use of
Equation (3.8), as described above. Note the use of the previously-written extended_vf
function, that represents the Wtπ : St → R function appearing on the right-hand-side of
Equation (3.8).
def evaluate(
steps: Sequence[RewardTransition[S]],
gamma: float
) -> Iterator[V[S]]:
v: List[V[S]] = []
for step in reversed(steps):
v.append({s: res.expectation(
lambda s_r: s_r[1] + gamma * (
extended_vf(v[-1], s_r[0]) if len(v) > 0 else 0.
)
) for s, res in step.items()})
return reversed(v)
154
If |Nt | is O(m), then the running time of this algorithm is O(m2 · T ). However, note that
it takes O(m2 · k · T ) to convert the MDP to the π-implied MRP (where |At | is O(k)).
Now we move on to the Control problem - to calculate the Optimal Value Function and
the Optimal Policy. Similar to the pattern seen so far, the Optimal Value Function
V∗ :N →R
can be conveniently represented in terms of a sequence of Value Functions
Vt∗ : Nt → R
for each of time steps t = 0, 1, . . . , T − 1, defined as:
X X
Vt∗ (st ) = max { ∗
(PR )t (st , at , rt+1 , st+1 ) · (rt+1 + γ · Wt+1 (st+1 ))}
at ∈At
st+1 ∈St+1 rt+1 ∈Dt+1 (3.9)
for all t = 0, 1, . . . , T − 1, st ∈ Nt
where (
Vt∗ (st ) if st ∈ Nt
Wt∗ (st ) = for all t = 1, 2, . . . , T
0 if st ∈ Tt
The associated Optimal (Deterministic) Policy
∗
(πD ) t : Nt → A t
is defined as:
X X
∗ ∗
(πD )t (st ) = arg max{ (PR )t (st , at , rt+1 , st+1 ) · (rt+1 + γ · Wt+1 (st+1 ))}
at ∈At st+1 ∈St+1 rt+1 ∈Dt+1
for all t = 0, 1, . . . , T − 1, st ∈ Nt
(3.10)
So for a Finite MDP, this yields a simple algorithm to calculate Vt∗ for all t, by simply
decrementing down from t = T − 1 to t = 0, using Equation (3.9) to calculate Vt∗ , and
Equation (3.10) to calculate (πD ∗ ) for all t = 0, 1, . . . , T − 1 from the known values of W ∗
t t+1
(since we are decrementing in time index t).
This algorithm is the adaptation of Value Iteration to the finite horizon case with this
simple technique of “stepping back in time” (known as Backward Induction). Let’s write
some code to implement this algorithm. We are given a MDP over the augmented (finite)
state space WithTime[S]. So this MDP is of type FiniteMarkovDecisionProcess[WithTime[S],
A]. Our first task to to “unwrap” the state-reward probability transition function PR of
this MDP into a time-indexed sequenced of state-reward probability transition functions
(PR )t , t = 0, 1, . . . , T −1. This is accomplished by the following function unwrap_finite_horizon_MDP
(itertools.groupby groups the augmented states by their time step, and the function without_time
strips the time step from the augmented states when placing the states in (PR )t , i.e., Sequence[StateActionMappin
A]]).
155
from itertools import groupby
ActionMapping = Mapping[A, StateReward[S]]
StateActionMapping = Mapping[NonTerminal[S], ActionMapping[A, S]]
def unwrap_finite_horizon_MDP(
process: FiniteMarkovDecisionProcess[WithTime[S], A]
) -> Sequence[StateActionMapping[S, A]]:
def time(x: WithTime[S]) -> int:
return x.time
def single_without_time(
s_r: Tuple[State[WithTime[S]], float]
) -> Tuple[State[S], float]:
if isinstance(s_r[0], NonTerminal):
ret: Tuple[State[S], float] = (
NonTerminal(s_r[0].state.state),
s_r[1]
)
else:
ret = (Terminal(s_r[0].state.state), s_r[1])
return ret
def without_time(arg: ActionMapping[A, WithTime[S]]) -> \
ActionMapping[A, S]:
return {a: sr_distr.map(single_without_time)
for a, sr_distr in arg.items()}
return [{NonTerminal(s.state): without_time(
process.mapping[NonTerminal(s)]
) for s in states} for _, states in groupby(
sorted(
(nt.state for nt in process.non_terminal_states),
key=time
),
key=time
)]
Now that we have the state-reward transition functions (PR )t arranged in the form of a
Sequence[StateActionMapping[S, A]], we are ready to perform backward induction to cal-
culate Vt∗ . The following function optimal_vf_and_policy accomplishes it with a straight-
forward use of Equations (3.9) and (3.10), as described above.
from operator import itemgetter
def optimal_vf_and_policy(
steps: Sequence[StateActionMapping[S, A]],
gamma: float
) -> Iterator[Tuple[V[S], FiniteDeterministicPolicy[S, A]]]:
v_p: List[Tuple[V[S], FiniteDeterministicPolicy[S, A]]] = []
for step in reversed(steps):
this_v: Dict[NonTerminal[S], float] = {}
this_a: Dict[S, A] = {}
for s, actions_map in step.items():
action_values = ((res.expectation(
lambda s_r: s_r[1] + gamma * (
extended_vf(v_p[-1][0], s_r[0]) if len(v_p) > 0 else 0.
)
), a) for a, res in actions_map.items())
v_star, a_star = max(action_values, key=itemgetter(0))
this_v[s] = v_star
this_a[s.state] = a_star
v_p.append((this_v, FiniteDeterministicPolicy(this_a)))
return reversed(v_p)
If |Nt | is O(m) for all t and |At | is O(k), then the running time of this algorithm is O(m2 ·
156
k · T ).
Note that these algorithms for finite-horizon finite MDPs do not require any “iterations
to convergence” like we had for regular Policy Evaluation and Value Iteration. Rather, in
these algorithms we simply walk back in time and immediately obtain the Value Function
for each time step from the next time step’s Value Function (which is already known since
we walk back in time). This technique of “backpropagation of Value Function” goes by
the name of Backward Induction algorithms, and is quite commonplace in many Financial
applications (as we shall see later in this book). The above Backward Induction code is in
the file rl/finite_horizon.py.
where dt is the random demand on day t governed by a Poisson distribution with mean λi if
the action (index of the price choice) on day t is i ∈ At . Also, note that the sales revenue on
day t is equal to min(It , dt ) · Pi . Therefore, the state-reward probability transition function
for time index t
(PR )t : Nt × At × Dt+1 × St+1 → [0, 1]
157
is defined as:
−λ k
e i λi
if k < It and rt+1 = k · Pi
k!
P∞ e−λi λji
(PR )t (It , i, rt+1 , It − k) = if k = It and rt+1 = k · Pi
j=It j!
0 otherwise
Mapping[WithTime[int],
Mapping[int, FiniteDistribution[Tuple[WithTime[int], float]]]]
158
for s in range(initial_inventory + 1)
})
self.mdp = finite_horizon_MDP(self.single_step_mdp, time_steps)
• get_vf_for_policy that produces the Value Function for a given policy π, by first
creating the π-implied MRP from mdp, then unwrapping the MRP into a sequence
πt
of state-reward transition probability functions (PR )t , and then performing back-
ward induction using the previously-written function evaluate to calculate the Value
Function.
• get_optimal_vf_and_policy that produces the Optimal Value Function and Optimal
Policy, by first unwrapping self.mdp into a sequence of state-reward transition prob-
ability functions (PR )t , and then performing backward induction using the previously-
written function optimal_vf_and_policy to calculate the Optimal Value Function and
Optimal Policy.
from rl.finite_horizon import evaluate, optimal_vf_and_policy
def get_vf_for_policy(
self,
policy: FinitePolicy[WithTime[int], int]
) -> Iterator[V[int]]:
mrp: FiniteMarkovRewardProcess[WithTime[int]] \
= self.mdp.apply_finite_policy(policy)
return evaluate(unwrap_finite_horizon_MRP(mrp), 1.)
def get_optimal_vf_and_policy(self)\
-> Iterator[Tuple[V[int], FiniteDeterministicPolicy[int, int]]]:
return optimal_vf_and_policy(unwrap_finite_horizon_MDP(self.mdp), 1.)
Now let’s create a simple instance of ClearancePricingMDP for M = 12, T = 8 and 4 price
choices: “Full Price,” “30% Off,” “50% Off,” “70% Off” with respective mean daily demand
of 0.5, 1.0, 1.5, 2.5.
ii = 12
steps = 8
pairs = [(1.0, 0.5), (0.7, 1.0), (0.5, 1.5), (0.3, 2.5)]
cp: ClearancePricingMDP = ClearancePricingMDP(
initial_inventory=ii,
time_steps=steps,
price_lambda_pairs=pairs
)
Now let us calculate it’s Value Function for a stationary policy that chooses “Full Price”
if inventory is less than 2, otherwise “30% Off” if inventory is less than 5, otherwise “50%
Off” if inventory is less than 8, otherwise “70% Off.” Since we have a stationary policy, we
can represent it as a single-step policy and combine it with the single-step MDP we had cre-
ated above (attribute single_step_mdp) to create a single_step_mrp: FiniteMarkovRewardProcess[int].
Then we use the function finite_horizon_mrp (from file rl/finite_horizon.py) to create the
entire (augmented state) MRP of type FiniteMarkovRewardProcess[WithTime[int]]. Fi-
nally, we unwrap this MRP into a sequence of state-reward transition probability functions
and perform backward induction to calculate the Value Function for this stationary policy.
Running the following code tells us that V0π (12) is about 4.91 (assuming full price is 1),
which is the Expected Revenue one would obtain over 8 days, starting with an inventory
of 12, and executing this stationary policy (under the assumed demand distributions as a
function of the price choices).
159
Figure 3.5.: Optimal Policy Heatmap
Now let us determine what is the Optimal Policy and Optimal Value Function for this
instance of ClearancePricingMDP. Running cp.get_optimal_vf_and_policy() and evaluat-
ing the Optimal Value Function for time step 0 and inventory of 12, i.e. V0∗ (12), gives us a
value of 5.64, which is the Expected Revenue we’d obtain over the 8 days if we executed
the Optimal Policy.
Now let us plot the Optimal Price as a function of time steps and inventory levels.
160
heatmap = plt.imshow(np.array(prices).T, origin=’lower’)
plt.colorbar(heatmap, shrink=0.5, aspect=5)
plt.xlabel(”Time Steps”)
plt.ylabel(”Inventory”)
plt.show()
Figure 3.5 shows us the image produced by the above code. The light shade is “Full
Price,” the medium shade is “30% Off” and the dark shade is “50% Off.” This tells us
that on day 0, the Optimal Price is “30% Off” (corresponding to State 12, i.e., for starting
inventory M = I0 = 12). However, if the starting inventory I0 were less than 7, then the
Optimal Price is “Full Price.” This makes intuitive sense because the lower the inventory,
the less inclination we’d have to cut prices. We see that the thresholds for price cuts shift
as time progresses (as we move horizontally in the figure). For instance, on Day 5, we set
“Full Price” only if inventory has dropped below 3 (this would happen if we had a good
degree of sales on the first 5 days), we set “30% Off” if inventory is 3 or 4 or 5, and we set
“50% Off” if inventory is greater than 5. So even if we sold 6 units in the first 5 days, we’d
offer “50% Off” because we have only 3 days remaining now and 6 units of inventory left.
This makes intuitive sense. We see that the thresholds shift even further as we move to
Days 6 and 7. We encourage you to play with this simple application of Dynamic Pricing
by changing M, T, N, [(Pi , λi )|1 ≤ i ≤ N ] and studying how the Optimal Value Function
changes and more importantly, studying the thresholds of inventory (under optimality)
for various choices of prices and how these thresholds vary as time progresses.
The Finite MDP algorithms covered in this chapter are called “tabular” algorithms. The
word “tabular” (for “table”) refers to the fact that the MDP is specified in the form of a
finite data structure and the Value Function is also represented as a finite “table” of non-
terminal states and values. These tabular algorithms typically make a sweep through all
non-terminal states in each iteration to update the Value Function. This is not possible for
large state spaces or infinite state spaces where we need some function approximation for
the Value Function. The good news is that we can modify each of these tabular algorithms
such that instead of sweeping through all the non-terminal states at each step, we simply
sample an appropriate subset of non-terminal states, calculate the values for these sam-
pled states with the appropriate Bellman calculations (just like in the tabular algorithms),
and then create/update a function approximation (for the Value Function) with the sam-
pled states’ calculated values. The important point is that the fundamental structure of the
algorithms and the fundamental principles (Fixed-Point and Bellman Operators) are still
the same when we generalize from these tabular algorithms to function approximation-
based algorithms. In Chapter 4, we cover generalizations of these Dynamic Programming
algorithms from tabular methods to function approximation methods. We call these algo-
rithms Approximate Dynamic Programming.
We finish this chapter by refering you to the various excellent papers and books by Dim-
itri Bertsekas - (Dimitri P. Bertsekas 1981), (Dimitri P. Bertsekas 1983), (Dimitri P. Bert-
sekas 2005), (Dimitri P. Bertsekas 2012), (D. P. Bertsekas and Tsitsiklis 1996) - for a com-
prehensive treatment of the variants of DP, including Asynchronous DP, Finite-Horizon
DP and Approximate DP.
161
3.16. Summary of Key Learnings from this Chapter
Before we end this chapter, we’d like to highlight the three highly important concepts we
learnt in this chapter:
162
4. Function Approximation and Approximate
Dynamic Programming
1. A “tabular” representation of the MDP (or of the Value Function) won’t fit within
storage limits.
2. Sweeping through all states and their transition probabilities would be time-prohibitive
(or simply impossible, in the case of infinite state spaces).
Hence, when the state space is very large, we need to resort to approximation of the
Value Function. The Dynamic Programming algorithms would need to be suitably mod-
ified to their Approximate Dynamic Programming (abbreviated as ADP) versions. The
good news is that it’s not hard to modify each of the (tabular) Dynamic Programming al-
gorithms such that instead of sweeping through all the states in each iteration, we simply
sample an appropriate subset of the states, calculate the values for those states (with the
same Bellman Operator calculations as for the case of tabular), and then create/update a
function approximation (for the Value Function) using the sampled states’ calculated val-
ues. Furthermore, if the set of transitions from a given state is large (or infinite), instead
of using the explicit probabilities of those transitions, we can sample from the transitions
probability distribution. The fundamental structure of the algorithms and the fundamen-
tal principles (Fixed-Point and Bellman Operators) would still be the same.
So, in this chapter, we do a quick review of function approximation, write some code
for a couple of standard function approximation methods, and then utilize these func-
tion approximation methods to develop Approximate Dynamic Programming algorithms
(in particular, Approximate Policy Evaluation, Approximate Value Iteration and Approx-
imate Backward Induction). Since you are reading this book, it’s highly likely that you are
already familiar with the simple and standard function approximation methods such as
linear function approximation and function approximation using neural networks super-
vised learning. So we shall go through the background on linear function approximation
and neural networks supervised learning in a quick and terse manner, with the goal of
developing some code for these methods that we can use not just for the ADP algorithms
for this chapter, but also for RL algorithms later in the book. Note also that apart from
approximation of State-Value Functions N → R and Action-Value Functions N × A → R,
these function approximation methods can also be used for approximation of Stochastic
Policies N × A → [0, 1] in Policy-based RL algorithms.
163
4.1. Function Approximation
In this section, we describe function approximation in a fairly generic setting (not specific
to approximation of Value Functions or Policies). We denote the predictor variable as x,
belonging to an arbitrary domain denoted X and the response variable as y ∈ R. We
treat x and y as unknown random variables and our goal is to estimate the probability
distribution function f of the conditional random variable y|x from data provided in the
form of a sequence of (x, y) pairs. We shall consider parameterized functions f with the
parameters denoted as w. The exact data type of w will depend on the specific form of
function approximation. We denote the estimated probability of y conditional on x as
f (x; w)(y). Assume we are given data in the form of a sequence of n (x, y) pairs, as follows:
[(xi , yi )|1 ≤ i ≤ n]
The notion of estimating the conditional probability P[y|x] is formalized by solving for
w = w∗ such that:
Y
n X
n
∗
w = arg max{ f (xi ; w)(yi )} = arg max{ log f (xi ; w)(yi )}
w i=1 w i=1
164
We refer to EM [y|x] as the function approximation’s prediction for a given predictor
variable x.
For the purposes of Approximate Dynamic Programming and Reinforcement Learning,
the function approximation’s prediction E[y|x] provides an estimate of the Value Function
for any state (x takes the role of the State, and y takes the role of the Return following that
State). In the case of function approximation for (stochastic) policies, x takes the role of
the State, y takes the role of the Action for that policy, and f (x; w) provides the probabil-
ity distribution of actions for state x (corresponding to that policy). It’s also worthwhile
pointing out that the broader theory of function approximations covers the case of multi-
dimensional y (where y is a real-valued vector, rather than scalar) - this allows us to solve
classification problems as well as regression problems. However, for ease of exposition
and for sufficient coverage of function approximation applications in this book, we only
cover the case of scalar y.
Now let us write some code that captures this framework. We write an abstract base class
FunctionApprox type-parameterized by X (to permit arbitrary data types X ), representing
f (x; w), with the following 3 key methods, each of which will work with inputs of generic
Iterable type (Iterable is any data type that we can iterate over, such as Sequence types
or Iterator type):
1. @abstractmethod solve: takes as input an Iterable of (x, y) pairs and solves for the
optimal internal parameters w∗ that minimizes the cross-entropy between the em-
pirical probability distribution of the input data of (x, y) pairs and the model prob-
ability distribution f (x; w). Some implementations of solve are iterative numerical
methods and would require an additional input of error_tolerance that specifies
the required precision of the best-fit parameters w∗ . When an implementation of
solve is an analytical solution not requiring an error tolerance, we specify the input
error_tolerance as None. The output of solve is the FunctionApprox f (x; w ∗ ) (i.e.,
corresponding to the solved parameters w∗ ).
2. update: takes as input an Iterable of (x, y) pairs and updates the parameters w defin-
ing f (x; w). The purpose of update is to perform an incremental (iterative) improve-
ment to the parameters w, given the input data of (x, y) pairs in the current iteration.
The output of update is the FunctionApprox corresponding to the updated parame-
ters. Note that we should be able to solve based on an appropriate series of incre-
mental updates (upto a specified error_tolerance).
3. @abstractmethod evaluate: takes as input an Iterable of x values and calculates
EM [y|x] = Ef (x;w) [y] for each of the input x values, and outputs these expected val-
ues in the form of a numpy.ndarray.
165
• @abstractmethod update_with_gradient: takes as input a Gradient and updates the
internal parameters using the gradient values (eg: gradient descent update to the
parameters), returning the updated FunctionApprox.
∂Obj(xi ,yi )
The update method is written with ∂Out(xi ) defined as follows, for each training data
point (xi , yi ):
∂Obj(xi , yi )
= EM [y|xi ] − yi
∂Out(xi )
It turns out that for each concrete function approximation that we’d want to implement,
if the Objective Obj(xi , yi ) is the cross-entropy loss function, we can identify a model-
computed value Out(xi ) (either the output of the model or an intermediate computation
of the model) such that ∂Obj(x i ,yi )
∂Out(xi ) is equal to the prediction error EM [y|xi ] − yi (for each
training data point (xi , yi )) and we can come up with a numerical algorithm to compute
∇w Out(xi ), so that by chain-rule, we have the required gradient:
∂Obj(xi , yi )
∇w Obj(xi , yi ) = · ∇w Out(xi ) = (EM [y|xi ] − yi ) · ∇w Out(xi )
∂Out(xi )
The update method implements this chain-rule calculation, by setting obj_deriv_out_fun
to be the prediction error EM [y|xi ] − yi , delegating the calculation of ∇w Out(xi ) to the
concrete implementation of the abstract method objective_gradient
Note that the Gradient class contains a single attribute of type FunctionApprox so that a
Gradient object can represent the gradient values in the form of the internal parameters of
the FunctionApprox attribute (since each gradient value is simply a partial derivative with
respect to an internal parameter).
We use the TypeVar F to refer to a concrete class that would implement the abstract in-
terface of FunctionApprox.
from abc import ABC, abstractmethod
import numpy as np
X = TypeVar(’X’)
F = TypeVar(’F’, bound=’FunctionApprox’)
class FunctionApprox(ABC, Generic[X]):
@abstractmethod
def objective_gradient(
self: F,
xy_vals_seq: Iterable[Tuple[X, float]],
obj_deriv_out_fun: Callable[[Sequence[X], Sequence[float]], np.ndarray]
) -> Gradient[F]:
pass
@abstractmethod
def evaluate(self, x_values_seq: Iterable[X]) -> np.ndarray:
pass
@abstractmethod
def update_with_gradient(
self: F,
gradient: Gradient[F]
) -> F:
pass
def update(
self: F,
xy_vals_seq: Iterable[Tuple[X, float]]
) -> F:
166
def deriv_func(x: Sequence[X], y: Sequence[float]) -> np.ndarray:
return self.evaluate(x) - np.array(y)
return self.update_with_gradient(
self.objective_gradient(xy_vals_seq, deriv_func)
)
@abstractmethod
def solve(
self: F,
xy_vals_seq: Iterable[Tuple[X, float]],
error_tolerance: Optional[float] = None
) -> F:
pass
@dataclass(frozen=True)
class Gradient(Generic[F]):
function_approx: F
When concrete classes implementing FunctionApprox write the solve method in terms
of the update method, they will need to check if a newly updated FunctionApprox is “close
enough” to the previous FunctionApprox. So each of them will need to implement their
own version of “Are two FunctionApprox instances within a certain error_tolerance of each
other?” Hence, we need the following abstract method within:
@abstractmethod
def within(self: F, other: F, tolerance: float) -> bool:
pass
Any concrete class that implement this abstract class FunctionApprox will need to im-
plement these five abstract methods of FunctionApprox, based on the specific assumptions
that the concrete class makes for f .
Next, we write some methods useful for classes that inherit from FunctionApprox. Firstly,
we write a method called iterate_updates that takes as input a stream (Iterator) of Iterable
of (x, y) pairs, and performs a series of incremental updates to the parameters w (each us-
ing the update method), with each update done for each Iterable of (x, y) pairs in the
input stream xy_seq: Iterator[Iterable[Tuple[X, float]]]. iterate_updates returns an
Iterator of FunctionApprox representing the successively updated FunctionApprox instances
as a consequence of the repeated invocations to update. Note the use of the rl.iterate.accumulate
function (a wrapped version of itertools.accumulate) that calculates accumulated results
(including intermediate results) on an Iterable, based on a provided function to govern
the accumulation. In the code below, the Iterable is the input stream xy_seq_stream and
the function governing the accumulation is the update method of FunctionApprox.
import rl.iterate as iterate
def iterate_updates(
self: F,
xy_seq_stream: Iterator[Iterable[Tuple[X, float]]]
) -> Iterator[F]:
return iterate.accumulate(
xy_seq_stream,
lambda fa, xy: fa.update(xy),
initial=self
)
167
def rmse(
self,
xy_vals_seq: Iterable[Tuple[X, float]]
) -> float:
x_seq, y_seq = zip(*xy_vals_seq)
errors: np.ndarray = self.evaluate(x_seq) - np.array(y_seq)
return np.sqrt(np.mean(errors * errors))
Finally, we write a method argmax that takes as input an Iterable of x values and returns
the x value that maximizes Ef (x;w) [y].
The above code for FunctionApprox and Gradient is in the file rl/function_approx.py.
rl/function_approx.py also contains the convenience methods __add__ (to add two FunctionApprox),
__mul__ (to multiply a FunctionApprox with a real-valued scalar), and __call__ (to treat
a FunctionApprox object syntactically as a function taking an x: X as input, essentially
a shorthand for evaluate on a single x value). __add__ and __mul__ are meant to per-
form element-wise addition and scalar-multiplication on the internal parameters w of the
Function Approximation (see Appendix F on viewing Function Approximations as Vec-
tor Spaces). Likewise, it contains the methods __add__ and __mul__ for the Gradient class
that simply delegates to the __add__ and __mul__ methods of the FunctionApprox within
Gradient, and it also contains the method zero that returns a Gradient which is uniformly
zero for each of the parameter values.
Now we are ready to cover a concrete but simple function approximation - the case of
linear function approximation.
ϕj : X → R for each j = 1, 2, . . . , m
X
m
EM [y|x] = ϕj (x) · wj = ϕ(x)T · w
j=1
1 (y−ϕ(x)T ·w)2
P[y|x] = f (x; w)(y) = √ · e− 2σ 2
2πσ 2
168
So, the cross-entropy loss function (ignoring constant terms associated with σ 2 ) for a
given set of data points [xi , yi |1 ≤ i ≤ n] is defined as:
1 X
n
L(w) = · (ϕ(xi )T · w − yi )2
2n
i=1
Note that this loss function is identical to the mean-squared-error of the linear (in w)
predictions ϕ(xi )T ·w relative to the response values yi associated with the predictor values
xi , over all 1 ≤ i ≤ n.
If we include L2 regularization (with λ as the regularization coefficient), then the reg-
ularized loss function is:
1 X
n
1
L(w) = ( (ϕ(xi )T · w − yi )2 ) + · λ · |w|2
2n 2
i=1
1 X
n
∇w L(w) = · ( ϕ(xi ) · (ϕ(xi )T · w − yi )) + λ · w
n
i=1
We had said previously that for each concrete function approximation that we’d want to
implement, if the Objective Obj(xi , yi ) is the cross-entropy loss function, we can identify
a model-computed value Out(xi ) (either the output of the model or an intermediate com-
putation of the model) such that ∂Obj(x i ,yi )
∂Out(xi ) is equal to the prediction error EM [y|xi ] − yi
(for each training data point (xi , yi )) and we can come up with a numerical algorithm to
compute ∇w Out(xi ), so that by chain-rule, we have the required gradient ∇w Obj(xi , yi )
(without regularization). In the case of this linear function approximation, the model-
computed value Out(xi ) is simply the model prediction for predictor variable xi , i.e.,
∂Obj(xi , yi )
= ϕ(xi )T · w − yi = EM [y|xi ] − yi
∂Out(xi )
1 X
nt
169
which can be interpreted as the mean (over the data in iteration t) of the feature vectors
ϕ(xt,i ) weighted by the (scalar) linear prediction errors ϕ(xt,i )T · wt − yt,i (plus regular-
ization term λ · wt ).
Then, the update to the weights vector w is given by:
SMALL_NUM = 1e-6
from dataclasses import replace
@dataclass(frozen=True)
class AdamGradient:
learning_rate: float
decay1: float
decay2: float
@staticmethod
def default_settings() -> AdamGradient:
return AdamGradient(
learning_rate=0.001,
decay1=0.9,
decay2=0.999
)
@dataclass(frozen=True)
class Weights:
adam_gradient: AdamGradient
time: int
weights: np.ndarray
adam_cache1: np.ndarray
adam_cache2: np.ndarray
170
@staticmethod
def create(
adam_gradient: AdamGradient = AdamGradient.default_settings(),
weights: np.ndarray,
adam_cache1: Optional[np.ndarray] = None,
adam_cache2: Optional[np.ndarray] = None
) -> Weights:
return Weights(
adam_gradient=adam_gradient,
time=0,
weights=weights,
adam_cache1=np.zeros_like(
weights
) if adam_cache1 is None else adam_cache1,
adam_cache2=np.zeros_like(
weights
) if adam_cache2 is None else adam_cache2
)
def update(self, gradient: np.ndarray) -> Weights:
time: int = self.time + 1
new_adam_cache1: np.ndarray = self.adam_gradient.decay1 * \
self.adam_cache1 + (1 - self.adam_gradient.decay1) * gradient
new_adam_cache2: np.ndarray = self.adam_gradient.decay2 * \
self.adam_cache2 + (1 - self.adam_gradient.decay2) * gradient ** 2
corrected_m: np.ndarray = new_adam_cache1 / \
(1 - self.adam_gradient.decay1 ** time)
corrected_v: np.ndarray = new_adam_cache2 / \
(1 - self.adam_gradient.decay2 ** time)
new_weights: np.ndarray = self.weights - \
self.adam_gradient.learning_rate * corrected_m / \
(np.sqrt(corrected_v) + SMALL_NUM)
return replace(
self,
time=time,
weights=new_weights,
adam_cache1=new_adam_cache1,
adam_cache2=new_adam_cache2,
)
def within(self, other: Weights[X], tolerance: float) -> bool:
return np.all(np.abs(self.weights - other.weights) <= tolerance).item()
With this Weights class, we are ready to write the dataclass LinearFunctionApprox for
linear function approximation, inheriting from the abstract base class FunctionApprox. It
has attributes feature_functions that represents ϕj : X → R for all j = 1, 2, . . . , m,
regularization_coeff that represents the regularization coefficient λ, weights which is an
instance of the Weights class we wrote above, and direct_solve (which we will explain
shortly). The static method create serves as a factory method to create a new instance of
LinearFunctionApprox. The method get_feature_values takes as input an x_values_seq:
Iterable[X] (representing a sequence or stream of x ∈ X ), and produces as output the
corresponding feature vectors ϕ(x) ∈ Rm for each x in the input. The feature vectors are
output in the form of a 2-D numpy array, with each feature vector ϕ(x) (for each x in the
input sequence) appearing as a row in the output 2-D numpy array (the number of rows
in this numpy array is the length of the input x_values_seq and the number of columns is
the number of feature functions). Note that often we want to include a bias term in our
linear function approximation, in which case we need to prepend the sequence of feature
functions we provide as input with an artificial feature function lambda _: 1. to represent
the constant feature with value 1. This will ensure we have a bias weight in addition to
each of the weights that serve as coefficients to the (non-artificial) feature functions.
171
The method evaluate (an abstract methodPm in FunctionApprox) calculates the prediction
EM [y|x] for each input x as: ϕ(x) · w = j=1 ϕj (x) · wi . The method objective_gradient
T
(from FunctionApprox) performs the calculation G(xt ,yt ) (wt ) shown above: the mean of the
feature vectors ϕ(xt,i ) weighted by the (scalar) linear prediction errors ϕ(xt,i )T · wt − yt,i
(plus regularization term λ · wt ). The variable obj_deriv_out takes the role of the linear
prediction errors, when objective_gradient is invoked by the update method through the
method update_with_gradient. The method update_with_gradient (from FunctionApprox)
updates the weights using the calculated gradient along with the ADAM cache updates
(invoking the update method of the Weights class to ensure there are no in-place updates),
and returns a new LinearFunctionApprox object containing the updated weights.
172
self.get_feature_values(x_values_seq),
self.weights.weights
)
def update_with_gradient(
self,
gradient: Gradient[LinearFunctionApprox[X]]
) -> LinearFunctionApprox[X]:
return replace(
self,
weights=self.weights.update(
gradient.function_approx.weights.weights
)
)
We also require the within method, that simply delegates to the within method of the
Weights class.
The only method that remains to be written now is the solve method. Note that for linear
function approximation, we can directly solve for w∗ if the number of feature functions m
is not too large. If the entire provided data is [(xi , yi )|1 ≤ i ≤ n], then the gradient estimate
based on this data can be set to 0 to solve for w∗ , i.e.,
1 X
n
·( ϕ(xi ) · (ϕ(xi )T · w∗ − yi )) + λ · w∗ = 0
n
i=1
We denote Φ as the n rows × m columns matrix defined as Φi,j = ϕj (xi ) and the column
vector Y ∈ Rn defined as Yi = yi . Then we can write the above equation as:
1
· ΦT · (Φ · w∗ − Y ) + λ · w∗ = 0
n
⇒ (ΦT · Φ + nλ · Im ) · w∗ = ΦT · Y
⇒ w∗ = (ΦT · Φ + nλ · Im )−1 · ΦT · Y
where Im is the m × m identity matrix. Note that this direct linear-algebraic solution for
solving a square linear system of equations of size m is computationally feasible only if m
is not too large.
On the other hand, if the number of feature functions m is large, then we solve for w∗
by repeatedly calling update. The attribute direct_solve: bool in LinearFunctionApprox
specifies whether to perform a direct solution (linear algebra calculations shown above)
or to perform a sequence of iterative (incremental) updates to w using gradient descent.
The code below for the method solve does exactly this:
import itertools
import rl.iterate import iterate
def solve(
self,
xy_vals_seq: Iterable[Tuple[X, float]],
error_tolerance: Optional[float] = None
) -> LinearFunctionApprox[X]:
173
if self.direct_solve:
x_vals, y_vals = zip(*xy_vals_seq)
feature_vals: np.ndarray = self.get_feature_values(x_vals)
feature_vals_T: np.ndarray = feature_vals.T
left: np.ndarray = np.dot(feature_vals_T, feature_vals) \
+ feature_vals.shape[0] * self.regularization_coeff * \
np.eye(len(self.weights.weights))
right: np.ndarray = np.dot(feature_vals_T, y_vals)
ret = replace(
self,
weights=Weights.create(
adam_gradient=self.weights.adam_gradient,
weights=np.linalg.solve(left, right)
)
)
else:
tol: float = 1e-6 if error_tolerance is None else error_tolerance
def done(
a: LinearFunctionApprox[X],
b: LinearFunctionApprox[X],
tol: float = tol
) -> bool:
return a.within(b, tol)
ret = iterate.converged(
self.iterate_updates(itertools.repeat(list(xy_vals_seq))),
done=done
)
return ret
We denote the parameters for layer l as the matrix wl with dim(ol ) rows and dim(il )
columns. Note that the number of neurons in layer l is equal to dim(ol ). Since we are
174
restricting ourselves to scalar y, dim(oL ) = 1 and so, the number of neurons in the output
layer is 1.
The neurons in layer l define a linear transformation from layer input il to a variable we
denote as sl . Therefore,
Equations (4.1), (4.2) and (4.3) together define the calculation of the neural network
prediction oL (associated with the response variable y), given the predictor variable x.
This calculation is known as forward-propagation and will define the evaluate method of
the deep neural network function approximation class we shall soon write.
Our goal is to derive an expression for the cross-entropy loss gradient ∇wl L for all
l = 0, 1, . . . , L. For ease of understanding, our following exposition will be expressed
in terms of the cross-entropy loss function for a single predictor variable input x ∈ X and
it’s associated single response variable y ∈ R (the code will generalize appropriately to
the cross-entropy loss function for a given set of data points [xi , yi |1 ≤ i ≤ n]).
We can reduce this problem of calculating the cross-entropy loss gradient to the problem
of calculating Pl = ∇sl L for all l = 0, 1, . . . , L, as revealed by the following chain-rule
calculation:
Note that Pl · iTl represents the outer-product of the dim(ol )-size vector Pl and the
dim(il )-size vector il giving a matrix of size dim(ol ) × dim(il ).
If we include L2 regularization (with λl as the regularization coefficient for layer l), then:
Notation Description
il Vector Input to layer l for all l = 0, 1, . . . , L
ol Vector Output of layer l for all l = 0, 1, . . . , L
ϕ(x) Feature Vector for predictor variable x
y Response variable associated with predictor variable x
wl Matrix of Parameters for layer l for all l = 0, 1, . . . , L
gl (·) Activation function for layer l for l = 0, 1, . . . , L
sl sl = wl · il , ol = gl (sl ) for all l = 0, 1, . . . L
Pl Pl = ∇sl L for all l = 0, 1, . . . , L
λl Regularization coefficient for layer l for all l = 0, 1, . . . , L
Now that we have reduced the loss gradient calculation to calculation of Pl , we spend
the rest of this section deriving the analytical calculation of Pl . The following theorem tells
us that Pl has a recursive formulation that forms the core of the back-propagation algorithm
for a feed-forward fully-connected deep neural network.
175
Theorem 4.3.1. For all l = 0, 1, . . . , L − 1,
T
Pl = (wl+1 · Pl+1 ) ◦ gl′ (sl )
where the symbol ◦ represents the Hadamard Product, i.e., point-wise multiplication of two vectors
of the same dimension.
Proof. We start by applying the chain rule on Pl .
Pl = ∇sl L = (∇sl sl+1 )T · ∇sl+1 L = (∇sl sl+1 )T · Pl+1 (4.5)
Next, note that:
sl+1 = wl+1 · gl (sl )
Therefore,
∇sl sl+1 = wl+1 · Diagonal(gl′ (sl ))
where the notation Diagonal(v) for an m-dimensional vector v represents an m × m
diagonal matrix whose elements are the same (also in same order) as the elements of v.
Substituting this in Equation (4.5) yields:
Pl = (wl+1 · Diagonal(gl′ (sl )))T · Pl+1 = Diagonal(gl′ (sl )) · wl+1
T
· Pl+1
Now all we need to do is to calculate PL = ∇sL L so that we can run this recursive
formulation for Pl , estimate the loss gradient ∇wl L for any given data (using Equation
(4.4)), and perform gradient descent to arrive at wl∗ for all l = 0, 1, . . . L.
Firstly, note that sL , oL , PL are all scalars, so let’s just write them as sL , oL , PL respec-
tively (without the bold-facing) to make it explicit in the derivation that they are scalars.
Specifically, the gradient
∂L
∇s L L =
∂sL
∂L
To calculate ∂s L
, we need to assume a functional form for P[y|sL ]. We work with a fairly
generic exponential functional form for the probability distribution function:
θ·y−A(θ)
p(y|θ, τ ) = h(y, τ ) · e d(τ )
where θ should be thought of as the “center” parameter (related to the mean) of the
probability distribution and τ should be thought of as the “dispersion” parameter (re-
lated to the variance) of the distribution. h(·, ·), A(·), d(·) are general functions whose spe-
cializations define the family of distributions that can be modeled with this fairly generic
exponential functional form (note that this structure is adopted from the framework of
Generalized Linear Models).
For our neural network function approximation, we assume that τ is a constant, and we
set θ to be sL . So,
sL ·y−A(sL )
P[y|sL ] = p(y|sL , τ ) = h(y, τ ) · e d(τ )
176
Lemma 4.3.2.
Ep [y|sL ] = A′ (sL )
Proof. Since
Z ∞
p(y|sL , τ ) · dy = 1,
−∞
the partial derivative of the left-hand-side of the above equation with respect to sL is
zero. In other words,
R∞
∂{ −∞ p(y|sL , τ ) · dy}
=0
∂sL
Hence,
R∞ sL ·y−A(sL )
∂{ −∞ h(y, τ ) ·e d(τ ) · dy}
=0
∂sL
Taking the partial derivative inside the integral, we get:
Z ∞ sL ·y−A(sL ) y − A′ (sL )
h(y, τ ) · e d(τ ) · · dy = 0
−∞ d(τ )
Z ∞
⇒ p(y|sL , τ ) · (y − A′ (sL )) · dy = 0
−∞
⇒ Ep [y|sL ] = A′ (sL )
The above equation is important since it tells us that the output layer activation function
gL (·) must be set to be the derivative of the A(·) function. In the theory of generalized linear
models, the derivative of the A(·) function serves as the canonical link function for a given
probability distribution of the response variable conditional on the predictor variable.
Now we are equipped to derive a simple expression for PL .
Theorem 4.3.3.
∂L oL − y
PL = =
∂sL d(τ )
Proof. The Cross-Entropy Loss (Negative Log-Likelihood) for a single training data point
(x, y) is given by:
A(sL ) − sL · y
L = − log (h(y, τ )) +
d(τ )
Therefore,
∂L A′ (sL ) − y
PL = =
∂sL d(τ )
177
But from Equation (4.6), we know that A′ (sL ) = oL . Therefore,
∂L oL − y
PL = =
∂sL d(τ )
−y 2
e 2τ 2 s2
sL = µ, τ = σ, h(y, τ ) = √ , d(τ ) = τ, A(sL ) = L
2πτ 2
1
⇒ oL = gL (sL ) = E[y|sL ] = A′ (sL ) =
1 + e−sL
Hence, the output layer activation function gL is the logistic function. This gener-
alizes to softmax gL when we generalize this framework to multivariate y, which in
turn enables us to classify inputs x into a finite set of categories represented by y as
one-hot-encodings.
• Poisson distribution for y parameterized by λ:
1
sL = log λ, τ = 1, h(y, τ ) = , d(τ ) = 1, A(sL ) = esL
y!
178
Now we are ready to write a class for function approximation with the deep neural
network framework described above. We assume that the activation functions gl (·) are
identical for all l = 0, 1, . . . , L − 1 (known as the hidden layers activation function) and
the activation function gL (·) is known as the output layer activation function. Note that
often we want to include a bias term in the linear transformations of the layers. To include
a bias term in layer 0, just like in the case of LinearFuncApprox, we prepend the sequence
of feature functions we want to provide as input with an artificial feature function lambda
_: 1. to represent the constant feature with value 1. This ensures we have a bias weight
in layer 0 in addition to each of the weights (in layer 0) that serve as coefficients to the
(non-artificial) feature functions. Moreover, we allow the specification of a bias boolean
variable to enable a bias term in each if the layers l = 1, 2, . . . L.
Before we develop the code for forward-propagation and back-propagation, we write
a @dataclass to hold the configuration of a deep neural network (number of neurons in
the layers, the bias boolean variable, hidden layers activation function and output layer
activation function).
@dataclass(frozen=True)
class DNNSpec:
neurons: Sequence[int]
bias: bool
hidden_activation: Callable[[np.ndarray], np.ndarray]
hidden_activation_deriv: Callable[[np.ndarray], np.ndarray]
output_activation: Callable[[np.ndarray], np.ndarray]
output_activation_deriv: Callable[[np.ndarray], np.ndarray]
179
type as the input of get_feature_values (x_values_seq: Iterable[X]) and returns a list
with L + 2 numpy arrays. The last element of the returned list is a 1-D numpy array
representing the final output of the neural network: oL = EM [y|x] for each of the x values
in the input x_values_seq. The remaining L + 1 elements in the returned list are each 2-D
numpy arrays, consisting of il for all l = 0, 1, . . . L (for each of the x values provided as
input in x_values_seq).
The method evaluate (from FunctionApprox) returns the last element (oL = EM [y|x])
of the list returned by forward_propagation.
The method backward_propagation is the most important method of DNNApprox, calculat-
ing ∇wl Obj for all l = 0, 1, . . . , L, for some objective function Obj. We had said previously
that for each concrete function approximation that we’d want to implement, if the Objec-
tive Obj(xi , yi ) is the cross-entropy loss function, we can identify a model-computed value
Out(xi ) (either the output of the model or an intermediate computation of the model) such
that ∂Obj(x i ,yi )
∂Out(xi ) is equal to the prediction error EM [y|xi ] − yi (for each training data point
(xi , yi )) and we can come up with a numerical algorithm to compute ∇w Out(xi ), so that
by chain-rule, we have the required gradient ∇w Obj(xi , yi ) (without regularization). In
the case of this DNN function approximation, the model-computed value Out(xi ) is sL .
Thus,
∂Obj(xi , yi ) ∂L
= = PL = oL − yi = EM [y|xi ] − yi
∂Out(xi ) ∂sL
backward_propagation takes two inputs:
If we generalize the objective function from the cross-entropy loss function L to an arbi-
trary objective function Obj and define Pl to be ∇sl Obj (generalized from ∇sl L), then the
output of backward_propagation would be equal to Pl · iTl (i.e., without the regularization
term) for all l = 0, 1, . . . L.
The first step in backward_propagation is to set PL (variable deriv in the code) equal to
obj_deriv_out (which in the case of cross-entropy loss as Obj and sL as Out, reduces to
the prediction error EM [y|xi ] − yi ). As we walk back through the layers of the DNN, the
variable deriv represents Pl = ∇sl Obj, evaluated for each of the values made available by
fwd_prop (note that deriv is updated in each iteration of the loop reflecting Theorem 4.3.1:
Pl = (wl+1T · Pl+1 ) ◦ gl′ (sl )). Note also that the returned list back_prop is populated with
the result of Equation (4.4): ∇wl Obj = Pl · iTl .
The method objective_gradient (from FunctionApprox) takes as input an Iterable of
∂Obj
(x, y) pairs and the ∂Out function, invokes the forward_propagation method (to be passed
as input to backward_propagation), then invokes backward_propagation, and finally adds on
the regularization term λ · wl to the output of backward_propagation to return the gradient
∇wl Obj for all l = 0, 1, . . . L.
The method update_with_gradient (from FunctionApprox) takes as input a gradient (eg:
∇wl Obj), updates the weights wl for all l = 0, 1, . . . , L along with the ADAM cache up-
180
dates (invoking the update method of the Weights class to ensure there are no in-place
updates), and returns a new instance of DNNApprox that contains the updated weights.
Finally, the method solve (from FunctionApprox) utilizes the method iterate_updates
(inherited from FunctionApprox) along with the method within to perform a best-fit of the
weights that minimizes the cross-entropy loss function (basically, a series of incremental
updates based on gradient descent).
181
out: np.ndarray = self.dnn_spec.hidden_activation(
np.dot(inp, w.weights.T)
)
if self.dnn_spec.bias:
inp = np.insert(out, 0, 1., axis=1)
else:
inp = out
ret.append(inp)
ret.append(
self.dnn_spec.output_activation(
np.dot(inp, self.weights[-1].weights.T)
)[:, 0]
)
return ret
def evaluate(self, x_values_seq: Iterable[X]) -> np.ndarray:
return self.forward_propagation(x_values_seq)[-1]
def backward_propagation(
self,
fwd_prop: Sequence[np.ndarray],
obj_deriv_out: np.ndarray
) -> Sequence[np.ndarray]:
”””
:param fwd_prop represents the result of forward propagation (without
the final output), a sequence of L 2-D np.ndarrays of the DNN.
: param obj_deriv_out represents the derivative of the objective
function with respect to the linear predictor of the final layer.
:return: list (of length L+1) of |o_l| x |i_l| 2-D arrays,
i.e., same as the type of self.weights.weights
This function computes the gradient (with respect to weights) of
the objective where the output layer activation function
is the canonical link function of the conditional distribution of y|x
”””
deriv: np.ndarray = obj_deriv_out.reshape(1, -1)
back_prop: List[np.ndarray] = [np.dot(deriv, fwd_prop[-1]) /
deriv.shape[1]]
# L is the number of hidden layers, n is the number of points
# layer l deriv represents dObj/ds_l where s_l = i_l . weights_l
# (s_l is the result of applying layer l without the activation func)
for i in reversed(range(len(self.weights) - 1)):
# deriv_l is a 2-D array of dimension |o_l| x n
# The recursive formulation of deriv is as follows:
# deriv_{l-1} = (weights_l^T inner deriv_l) haddamard g’(s_{l-1}),
# which is ((|i_l| x |o_l|) inner (|o_l| x n)) haddamard
# (|i_l| x n), which is (|i_l| x n) = (|o_{l-1}| x n)
# Note: g’(s_{l-1}) is expressed as hidden layer activation
# derivative as a function of o_{l-1} (=i_l).
deriv = np.dot(self.weights[i + 1].weights.T, deriv) * \
self.dnn_spec.hidden_activation_deriv(fwd_prop[i + 1].T)
# If self.dnn_spec.bias is True, then i_l = o_{l-1} + 1, in which
# case # the first row of the calculated deriv is removed to yield
# a 2-D array of dimension |o_{l-1}| x n.
if self.dnn_spec.bias:
deriv = deriv[1:]
# layer l gradient is deriv_l inner fwd_prop[l], which is
# of dimension (|o_l| x n) inner (n x (|i_l|) = |o_l| x |i_l|
back_prop.append(np.dot(deriv, fwd_prop[i]) / deriv.shape[1])
return back_prop[::-1]
def objective_gradient(
self,
xy_vals_seq: Iterable[Tuple[X, float]],
obj_deriv_out_fun: Callable[[Sequence[X], Sequence[float]], float]
) -> Gradient[DNNApprox[X]]:
x_vals, y_vals = zip(*xy_vals_seq)
182
obj_deriv_out: np.ndarray = obj_deriv_out_fun(x_vals, y_vals)
fwd_prop: Sequence[np.ndarray] = self.forward_propagation(x_vals)[:-1]
gradient: Sequence[np.ndarray] = \
[x + self.regularization_coeff * self.weights[i].weights
for i, x in enumerate(self.backward_propagation(
fwd_prop=fwd_prop,
obj_deriv_out=obj_deriv_out
))]
return Gradient(replace(
self,
weights=[replace(w, weights=g) for
w, g in zip(self.weights, gradient)]
))
def solve(
self,
xy_vals_seq: Iterable[Tuple[X, float]],
error_tolerance: Optional[float] = None
) -> DNNApprox[X]:
tol: float = 1e-6 if error_tolerance is None else error_tolerance
def done(
a: DNNApprox[X],
b: DNNApprox[X],
tol: float = tol
) -> bool:
return a.within(b, tol)
return iterate.converged(
self.iterate_updates(itertools.repeat(list(xy_vals_seq))),
done=done
)
def within(self, other: FunctionApprox[X], tolerance: float) -> bool:
if isinstance(other, DNNApprox):
return all(w1.within(w2, tolerance)
for w1, w2 in zip(self.weights, other.weights))
else:
return False
Next we wrap this in an Iterator that returns a certain number of (x, y) pairs upon each
request for data points.
183
def data_seq_generator(
data_generator: Iterator[Tuple[Triple, float]],
num_pts: int
) -> Iterator[DataSeq]:
while True:
pts: DataSeq = list(islice(data_generator, num_pts))
yield pts
Likewise, let’s write a function to create a DNNApprox with 1 hidden layer with 2 neu-
rons and a little bit of regularization since this deep neural network is somewhat over-
parameterized to fit the data generated from the linear data model with noise.
def get_dnn_model() -> DNNApprox[Triple]:
ffs = feature_functions()
ag = adam_gradient()
def relu(arg: np.ndarray) -> np.ndarray:
return np.vectorize(lambda x: x if x > 0. else 0.)(arg)
def relu_deriv(res: np.ndarray) -> np.ndarray:
return np.vectorize(lambda x: 1. if x > 0. else 0.)(res)
def identity(arg: np.ndarray) -> np.ndarray:
return arg
def identity_deriv(res: np.ndarray) -> np.ndarray:
return np.ones_like(res)
ds = DNNSpec(
neurons=[2],
bias=True,
hidden_activation=relu,
hidden_activation_deriv=relu_deriv,
output_activation=identity,
output_activation_deriv=identity_deriv
)
return DNNApprox.create(
feature_functions=ffs,
dnn_spec=ds,
adam_gradient=ag,
regularization_coeff=0.05
)
Now let’s write some code to do a direct_solve with the LinearFunctionApprox based
on the data from the data model we have set up.
184
Figure 4.1.: SGD Convergence
linear_model_rmse_seq: Sequence[float] = \
[lfa.rmse(test_data) for lfa in islice(
get_linear_model().iterate_updates(training_data_gen),
training_iterations
)]
dnn_model_rmse_seq: Sequence[float] = \
[dfa.rmse(test_data) for dfa in islice(
get_dnn_model().iterate_updates(training_data_gen),
training_iterations
)]
185
4.4. Tabular as a form of FunctionApprox
Now we consider a simple case where we have a fixed and finite set of x-values X =
{x1 , x2 , . . . , xn }, and any data set of (x, y) pairs made available to us needs to have it’s
x-values from within this finite set X . The prediction E[y|x] for each x ∈ X needs to be
calculated only from the y-values associated with this x within the data set of (x, y) pairs.
In other words, the y-values in the data associated with other x should not influence the
prediction for x. Since we’d like the prediction for x to be E[y|x], it would make sense for
the prediction for a given x to be some sort of average of all the y-values associated with
x within the data set of (x, y) pairs seen so far. This simple case is refered to as Tabular
because we can store all x ∈ X together with their corresponding predictions E[y|x] in a
finite data structure (loosely refered to as a “table”).
So the calculations for Tabular prediction of E[y|x] is particularly straightforward. What
is interesting though is the fact that Tabular prediction actually fits the interface of FunctionApprox
in terms of the following three methods that we have emphasized as the essence of FunctionApprox:
• the solve method, that would simply take the average of all the y-values associated
with each x in the given data set, and store those averages in a dictionary data struc-
ture.
• the update method, that would update the current averages in the dictionary data
structure, based on the new data set of (x, y) pairs that is provided.
• the evaluate method, that would simply look up the dictionary data structure for
the y-value averages associated with each x-value provided as input.
This view of Tabular prediction as a special case of FunctionApprox also permits us to cast
the tabular algorithms of Dynamic Programming and Reinforcement Learning as special
cases of the function approximation versions of the algorithms (using the Tabular class we
develop below).
So now let us write the code for @dataclass Tabular as an implementation of the abstract
base class FunctionApprox. The attributes of @dataclass Tabular are:
• values_map which is a dictionary mapping each x-value to the average of the y-values
associated with x that have been seen so far in the data.
• counts_map which is a dictionary mapping each x-value to the count of y-values as-
sociated with x that have been seen so far in the data. We need to track the count of
y-values associated with each x because this enables us to update values_map appro-
priately upon seeing a new y-value associated with a given x.
• count_to_weight_func which defines a function from number of y-values seen so far
(associated with a given x) to the weight assigned to the most recent y. This enables
us to do a weighted average of the y-values seen so far, controlling the emphasis to
be placed on more recent y-values relative to previously seen y-values (associated
with a given x).
186
count_to_weight_func: Callable[[int], float] = \
field(default_factory=lambda: lambda n: 1.0 / n)
def objective_gradient(
self,
xy_vals_seq: Iterable[Tuple[X, float]],
obj_deriv_out_fun: Callable[[Sequence[X], Sequence[float]], float]
) -> Gradient[Tabular[X]]:
x_vals, y_vals = zip(*xy_vals_seq)
obj_deriv_out: np.ndarray = obj_deriv_out_fun(x_vals, y_vals)
sums_map: Dict[X, float] = defaultdict(float)
counts_map: Dict[X, int] = defaultdict(int)
for x, o in zip(x_vals, obj_deriv_out):
sums_map[x] += o
counts_map[x] += 1
return Gradient(replace(
self,
values_map={x: sums_map[x] / counts_map[x] for x in sums_map},
counts_map=counts_map
))
def evaluate(self, x_values_seq: Iterable[X]) -> np.ndarray:
return np.array([self.values_map.get(x, 0.) for x in x_values_seq])
def update_with_gradient(
self,
gradient: Gradient[Tabular[X]]
) -> Tabular[X]:
values_map: Dict[X, float] = dict(self.values_map)
counts_map: Dict[X, int] = dict(self.counts_map)
for key in gradient.function_approx.values_map:
counts_map[key] = counts_map.get(key, 0) + \
gradient.function_approx.counts_map[key]
weight: float = self.count_to_weight_func(counts_map[key])
values_map[key] = values_map.get(key, 0.) - \
weight * gradient.function_approx.values_map[key]
return replace(
self,
values_map=values_map,
counts_map=counts_map
)
def solve(
self,
xy_vals_seq: Iterable[Tuple[X, float]],
error_tolerance: Optional[float] = None
) -> Tabular[X]:
values_map: Dict[X, float] = {}
counts_map: Dict[X, int] = {}
for x, y in xy_vals_seq:
counts_map[x] = counts_map.get(x, 0) + 1
weight: float = self.count_to_weight_func(counts_map[x])
values_map[x] = weight * y + (1 - weight) * values_map.get(x, 0.)
return replace(
self,
values_map=values_map,
counts_map=counts_map
)
def within(self, other: FunctionApprox[X], tolerance: float) -> bool:
if isinstance(other, Tabular):
return all(abs(self.values_map[s] - other.values_map.get(s, 0.))
<= tolerance for s in self.values_map)
return False
Here’s a valuable insight: This Tabular setting is actually a special case of linear function
approximation by setting a feature function ϕi (·) for each xi as: ϕi (xi ) = 1 and ϕi (x) = 0
187
for each x ̸= xi (i.e., ϕi (·) is the indicator function for xi , and the Φ matrix is the identity
matrix), and the corresponding weights wi equal to the average of the y-values associated
with xi in the given data. This also means that the count_to_weights_func plays the role of
the learning rate function (as a function of the number of iterations in stochastic gradient
descent).
When we implement Approximate Dynamic Programming (ADP) algorithms with the
@abstractclass FunctionApprox (later in this chapter), using the Tabular class (for FunctionApprox)
enables us to specialize the ADP algorithm implementation to the Tabular DP algorithms
(that we covered in Chapter 3). Note that in the tabular DP algorithms, the set of finite
states take the role of X and the Value Function for a given state x = s takes the role of
the “predicted” y-value associated with x. We also note that in the tabular DP algorithms,
in each iteration of sweeping through all the states, the Value Function for a state x = s
is set to the current y value (not the average of all y-values seen so far). The current y-
value is simply the right-hand-side of the Bellman Equation corresponding to the tabular
DP algorithm. Consequently, when using Tabular class for tabular DP, we’d need to set
count_to_weight_func to be the function lambda _: 1 (this is because a weight of 1 for the
current y-value sets values_map[x] equal to the current y-value).
Likewise, when we implement RL algorithms (using @abstractclass FunctionApprox)
later in this book, using the Tabular class (for FunctionApprox) specializes the RL algo-
rithm implementation to Tabular RL. In Tabular RL, we average all the Returns seen so
far for a given state. If we choose to do a plain average (equal importance for all y-
values seen so far, associated with a given x), then in the Tabular class, we’d need to set
count_to_weights_func to be the function lambda n: 1. / n.
We want to emphasize that although tabular algorithms are just a special case of al-
gorithms with function approximation, we give special coverage in this book to tabular
algorithms because they help us conceptualize the core concepts in a simple (tabular) set-
ting without the distraction of some of the details and complications in the apparatus of
function approximation.
Now we are ready to write algorithms for Approximate Dynamic Programming (ADP).
Before we go there, it pays to emphasize that we have described and implemented a fairly
generic framework for gradient-based estimation of function approximations, given arbi-
trary training data. It can be used for arbitrary objective functions and arbitrary functional
forms/neural networks (beyond the concrete classes we implemented). We encourage
you to explore implementing and using this function approximation code for other types
of objectives and other types of functional forms/neural networks.
188
This is typical in many real-world problems where the state space is either very large or is
continuous-valued, and the transitions could be too many or could be continuous-valued
transitions. So, here’s what we do to overcome these challenges:
NTStateDistribution = Distribution[NonTerminal[S]]
• We sample pairs of (next state s′ , reward r) from a given non-terminal state s, and
calculate the expectation E[r + γ · V (s′ )] by averaging r + γ · V (s′ ) across the sam-
pled pairs. Note that the method expectation of a Distribution object performs a
sampled expectation. V (s′ ) is obtained from the function approximation instance
of FunctionApprox that is being updated in each iteration. The type of the function
approximation of the Value Function is aliased as follows (this type will be used not
just for Approximate Dynamic Programming algorithms, but also for Reinforcement
Learning Algorithms).
ValueFunctionApprox = FunctionApprox[NonTerminal[S]]
• The sampled list of non-terminal states s comprise our x-values and the associated
sampled expectations described above comprise our y-values. This list of (x, y) pairs
are used to update the approximation of the Value Function in each iteration (pro-
ducing a new instance of ValueFunctionApprox using it’s update method).
The entire code is shown below. The evaluate_mrp method produces an Iterator on
ValueFunctionApprox instances, and the code that calls evaluate_mrp can decide when/how
to terminate the iterations of Approximate Policy Evaluation.
189
Notice the function extended_vf used to evaluate the Value Function for the next state
transitioned to. However, the next state could be terminal or non-terminal, and the Value
Function is only defined for non-terminal states. extended_vf utilizes the method on_non_terminal
we had written in Chapter 1 when designing the State class - it evaluates to the default
value of 0 for a terminal state (and evaluates the given ValueFunctionApprox for a non-
terminal state).
extended_vf will be useful not just for Approximate Dynamic Programming algorithms,
but also for Reinforcement Learning algorithms.
190
of course make the same types of adaptations from Tabular to Approximate as we did in
the functions evaluate_mrp and value_iteration above.
In the backward_evaluate code below, the input argument mrp_f0_mu_triples is a list of
triples, with each triple corresponding to each non-terminal time step in the finite horizon.
Each triple consists of:
• An instance of MarkovRewardProceess - note that each time step has it’s own instance
of MarkovRewardProcess representation of transitions from non-terminal states s in a
time step t to the (state s′ , reward r) pairs in the next time step t + 1 (variable mrp in
the code below).
• An instance of ValueFunctionApprox to capture the approximate Value Function for
the time step (variable approx0 in the code below represents the initial ValueFunctionApprox
instances).
• A sampling probability distribution of non-terminal states in the time step (variable
mu in the code below).
The backward induction code below should be pretty self-explanatory. Note that in
backward induction, we don’t invoke the update method of FunctionApprox like we did in
the non-finite-horizon cases - here we invoke the solve method which internally performs
a series of updates on the FunctionApprox for a given time step (until we converge to within
a specified level of error_tolerance). In the non-finite-horizon cases, it was okay to sim-
ply do a single update in each iteration because we revisit the same set of states in further
iterations. Here, once we converge to an acceptable ValueFunctionApprox (using solve)
for a specific time step, we won’t be performing any more updates to the Value Function
for that time step (since we move on to the next time step, in reverse). backward_evaluate
returns an Iterator over ValueFunctionApprox objects, from time step 0 to the horizon time
step. We should point out that in the code below, we’ve taken special care to handle ter-
minal states (that occur either at the end of the horizon or can even occur before the end
of the horizon) - this is done using the extended_vf function we’d written earlier.
MRP_FuncApprox_Distribution = Tuple[MarkovRewardProcess[S],
ValueFunctionApprox[S],
NTStateDistribution[S]]
def backward_evaluate(
mrp_f0_mu_triples: Sequence[MRP_FuncApprox_Distribution[S]],
gamma: float,
num_state_samples: int,
error_tolerance: float
) -> Iterator[ValueFunctionApprox[S]]:
v: List[ValueFunctionApprox[S]] = []
for i, (mrp, approx0, mu) in enumerate(reversed(mrp_f0_mu_triples)):
def return_(s_r: Tuple[State[S], float], i=i) -> float:
s1, r = s_r
return r + gamma * (extended_vf(v[i-1], s1) if i > 0 else 0.)
v.append(
approx0.solve(
[(s, mrp.transition_reward(s).expectation(return_))
for s in mu.sample_n(num_state_samples)],
error_tolerance
)
)
return reversed(v)
191
4.8. Finite-Horizon Approximate Value Iteration
Now that we’ve understood and coded finite-horizon Approximate Policy Evaluation (to
solve the finite-horizon Prediction problem), we can extend the same concepts to finite-
horizon Approximate Value Iteration (to solve the finite-horizon Control problem). The
code below in back_opt_vf_and_policy is almost the same as the code above in backward_evaluate,
except that instead of MarkovRewardProcess, here we have MarkovDecisionProcess. For each
non-terminal time step, we maximize the Q-Value function (over all actions a) for each
non-terminal state s. back_opt_vf_and_policy returns an Iterator over pairs of ValueFunctionApprox
and DeterministicPolicy objects (representing the Optimal Value Function and the Opti-
mal Policy respectively), from time step 0 to the horizon time step.
from rl.distribution import Constant
from operator import itemgetter
MDP_FuncApproxV_Distribution = Tuple[
MarkovDecisionProcess[S, A],
ValueFunctionApprox[S],
NTStateDistribution[S]
]
def back_opt_vf_and_policy(
mdp_f0_mu_triples: Sequence[MDP_FuncApproxV_Distribution[S, A]],
gamma: float,
num_state_samples: int,
error_tolerance: float
) -> Iterator[Tuple[ValueFunctionApprox[S], DeterministicPolicy[S, A]]]:
vp: List[Tuple[ValueFunctionApprox[S], DeterministicPolicy[S, A]]] = []
for i, (mdp, approx0, mu) in enumerate(reversed(mdp_f0_mu_triples)):
def return_(s_r: Tuple[State[S], float], i=i) -> float:
s1, r = s_r
return r + gamma * (extended_vf(vp[i-1][0], s1) if i > 0 else 0.)
this_v = approx0.solve(
[(s, max(mdp.step(s, a).expectation(return_)
for a in mdp.actions(s)))
for s in mu.sample_n(num_state_samples)],
error_tolerance
)
def deter_policy(state: S) -> A:
return max(
((mdp.step(NonTerminal(state), a).expectation(return_), a)
for a in mdp.actions(NonTerminal(state))),
key=itemgetter(0)
)[1]
vp.append((this_v, DeterministicPolicy(deter_policy)))
return reversed(vp)
192
necessary to extract the optimal State-Value function and the optimal Policy (since we just
need to perform a max / arg max over all the actions for any non-terminal state). This con-
trasts with the case of working with the optimal State-Value function which requires us
to also avail of the transition probabilities, rewards and discount factor in order to extract
the optimal policy. We shall see later that Reinforcement Learning algorithms for Control
work with Action-Value (Q-Value) Functions for this very reason.
Performing backward induction on the optimal Q-value function means that knowledge
of the optimal Q-value function for a given time step t immediately gives us the optimal
State-Value function and the optimal policy for the same time step t. This contrasts with
performing backward induction on the optimal State-Value function - knowledge of the
optimal State-Value function for a given time step t cannot give us the optimal policy for
the same time step t (for that, we need the optimal State-Value function for time step t + 1
and furthermore, we also need the t to t + 1 state/reward transition probabilities).
So now we develop an algorithm that works with a function approximation for the Q-
Value function and steps back in time similar to the backward induction we had performed
earlier for the (State-)Value function. Just like we defined an alias type ValueFunctionApprox
for the State-Value function, we define an alias type QValueFunctionApprox for the Action-
Value function, as follows:
MDP_FuncApproxQ_Distribution = Tuple[
MarkovDecisionProcess[S, A],
QValueFunctionApprox[S, A],
NTStateDistribution[S]
]
def back_opt_qvf(
mdp_f0_mu_triples: Sequence[MDP_FuncApproxQ_Distribution[S, A]],
gamma: float,
num_state_samples: int,
error_tolerance: float
) -> Iterator[QValueFunctionApprox[S, A]]:
horizon: int = len(mdp_f0_mu_triples)
qvf: List[QValueFunctionApprox[S, A]] = []
for i, (mdp, approx0, mu) in enumerate(reversed(mdp_f0_mu_triples)):
def return_(s_r: Tuple[State[S], float], i=i) -> float:
s1, r = s_r
next_return: float = max(
qvf[i-1]((s1, a)) for a in
mdp_f0_mu_triples[horizon - i][0].actions(s1)
) if i > 0 and isinstance(s1, NonTerminal) else 0.
return r + gamma * next_return
193
this_qvf = approx0.solve(
[((s, a), mdp.step(s, a).expectation(return_))
for s in mu.sample_n(num_state_samples) for a in mdp.actions(s)],
error_tolerance
)
qvf.append(this_qvf)
return reversed(qvf)
We should also point out here that working with the optimal Q-value function (rather
than the optimal State-Value function) in the context of ADP prepares us nicely for RL
because RL algorithms typically work with the optimal Q-value function instead of the
optimal State-Value function.
All of the above code for Approximate Dynamic Programming (ADP) algorithms is in
the file rl/approximate_dynamic_programming.py. We encourage you to create instances
of MarkovRewardProcess and MarkovDecisionProcess (including finite-horizon instances)
and play with the above ADP code with different choices of function approximations,
non-terminal state sampling distributions, and number of samples. A simple but valuable
exercise is to reproduce the tabular versions of these algorithms by using the Tabular im-
plementation of FunctionApprox (note: the count_to_weights_func would then need to be
lambda _: 1.) in the above ADP functions.
194
horizon. Sometimes you can take advantage of the mathematical structure of the under-
lying Markov Process to come up with an analytical expression (exact or approximate)
for the probability distribution of non-terminal states at any time step for the underly-
ing Markov Process of the MRP/implied-MRP. For instance, if the Markov Process is de-
scribed by a stochastic differential equation (SDE) and if we are able to solve the SDE,
we would know the analytical expression for the probability distribution of non-terminal
states. If we cannot take advantage of any such special properties, then we can generate
sampling traces by time-incrementally sampling from the state-transition probability dis-
tributions of each of the Markov Reward Processes at each time step (if we are solving
a Control problem, then we create implied-MRPs by evaluating the given MDPs with a
uniform policy). The states reached by these sampling traces at any fixed time step pro-
vide a SampledDistribution of non-terminal states for that time step. If the above choices
are infeasible or computationally expensive, then a simple and neutral choice is to use a
uniform distribution over the non-terminal states for each time step.
We will write some code in Chapter 6 to create a SampledDistribution of non-terminal
states for each time step of a finite-horizon problem by stitching together samples of state
transitions at each time step. If you are curious about this now, you can take a peek at the
code in rl/chapter7/asset_alloc_discrete.py.
195
Part II.
197
5. Utility Theory
199
what we care about when it comes to financial applications. However, Utility of Money
is not so straightforward because different people respond to different levels of money in
different ways. Moreover, in many financial applications, Utility functions help us deter-
mine the tradeoff between financial return and risk, and this involves (challenging) as-
sessments of the likelihood of various outcomes. The next section develops the intuition
on these concepts.
200
comes results in this psychology of “I need to be compensated for taking this risk.” We
refer to this individualized demand of “compensation for risk” as the attitude of Risk-
Aversion. It means that individuals have differing degrees of discomfort with taking risk,
and they want to be compensated commensurately for taking risk. The amount of compen-
sation they seek is called Risk-Premium. The more Risk-Averse an individual is, the more
Risk-Premium the individual seeks. In the example above, your friend was more risk-
averse than you. Your risk-premium was $70 and your friend’s risk-premium was $150.
But the most important concept that you are learning here is that the root-cause of Risk-
Aversion is the asymmetry in the assignment of utility to outcomes of opposite sign and
same magnitude. We have introduced this notion of “asymmetry of utility” in a simple,
intuitive manner with this example, but we will soon embark on developing the formal
theory for this notion, and introduce a simple and elegant mathematical framework for
Utility Functions, Risk-Aversion and Risk-Premium.
A quick note before we get into the mathematical framework - you might be thinking
that a typical casino would actually charge you a bit more than $250 upfront for playing
the above game (because the casino needs to make a profit, on an expected basis), and
people are indeed willing to pay this amount at a typical casino. So what about the risk-
aversion we talked about earlier? The crucial point here is that people who play at casinos
are looking for entertainment and excitement emanating purely from the psychological
aspects of experiencing risk. They are willing to pay money for this entertainment and
excitement, and this payment is separate from the cost of pure financial utility that we
described above. So if people knew the true odds of pure-chance games of the type we
described above and if people did not care for entertainment and excitement value of risk-
taking in these games, focusing purely on financial utility, then what they’d be willing to
pay upfront to play such a game will be based on the type of calculations we outlined
above (meaning for the example we described, they’d typically pay less than $250 upfront
to play the game).
The last two properties above enable us to establish the Risk-Premium. Now let us un-
derstand the nature of Utility as a function of financial outcomes. The key is to note that
Utility is a non-linear function of financial outcomes. We call this non-linear function as
the Utility function - it represents the “happiness”/“satisfaction” as a function of money.
You should think of the concept of Utility in terms of Utility of Consumption of money, i.e.,
what exactly do the financial gains fetch you in your life or business. This is the idea of
“value” (utility) derived from consuming the financial gains (or the negative utility of
requisite financial recovery from monetary losses). So now let us look at another simple
example to illustrate the concept of Utility of Consumption, this time not of consumption
of money, but of consumption of cookies (to make the concept vivid and intuitive). Fig-
ure 5.1 shows two curves - we refer to the lower curve as the marginal satisfaction (utility)
201
Figure 5.1.: Utility Curve
curve and the upper curve as the accumulated satisfaction (utility) curve. Marginal Util-
ity refers to the incremental satisfaction we gain from an additional unit of consumption
and Accumulated Utility refers to the aggregate satisfaction obtained from a certain number
of units of consumption (in continuous-space, you can think of accumulated utility func-
tion as the integral, over consumption, of marginal utility function). In this example, we
are consuming (i.e., eating) cookies. The marginal satisfaction curve tells us that the first
cookie we eat provides us with 100 units of satisfaction (i.e., utility). The second cookie
provides us 80 units of satisfaction, which is intuitive because you are not as hungry after
eating the first cookie compared to before eating the first cookie. Also, the emotions of bit-
ing into the first cookie are extra positive because of the novelty of the experience. When
you get to your 5th cookie, although you are still enjoying the cookie, you don’t enjoy it as
nearly as much as the first couple of cookies. The marginal satisfaction curve shows this -
the 5th cookie provides us 30 units of satisfaction, and the 10th cookie provides us only 10
units of satisfaction. If we’d keep going, we might even find that the marginal satisfaction
turns negative (as in, one might feel too full or maybe even feel like throwing up).
So, we see that the marginal utility function is a decreasing function. Hence, accumu-
lated utility function is a concave function. The accumulated utility function is the Utility
of Consumption function (call it U ) that we’ve been discussing so far. Let us denote the
number of cookies eaten as x, and so the total “satisfaction” (utility) after eating x cookies
is refered to as U (x). In our financial examples, x would be amount of money one has
at one’s disposal, and is typically an uncertain outcome, i.e., x is a random variable with
an associated probability distribution. The extent of asymmetry in utility assignments for
gains versus losses that we saw earlier manifests as extent of concavity of the U (·) function
(which as we’ve discussed earlier, determines the extent of Risk-Aversion).
Now let’s examine the concave nature of the Utility function for financial outcomes with
another illustrative example. Let’s say you have to pick between two situations:
• In Situation 1, you have a 10% probability of winning a million dollars (and 90%
probability of winning 0).
202
Figure 5.2.: Certainty-Equivalent Value
• In Situation 2, you have a 0.1% probability of winning a billion dollars (and 99.9%
probability of winning 0).
The expected winning in Situation 1 is $100,000 and the expected winning in Situation
2 is $1,000,000 (i.e., 10 times more than Situation 1). If you analyzed this naively as win-
ning expectation maximization, you’d choose Situation 2. But most people would choose
Situation 1. The reason for this is that the Utility of a billion dollars is nowhere close to
1000 times the utility of a million dollars (except for some very wealth people perhaps).
In fact, the ratio of Utility of a billion dollars to Utility of a million dollars might be more
like 10. So, the choice of Situation 1 over Situation 2 is usually quite clear - it’s about Utility
expectation maximization. So if the Utility of 0 dollars is 0 units, the Utility of a million
dollars is say 1000 units, and the Utility of a billion dollars is say 10000 units (i.e., 10 times
that of a million dollars), then we see that the Utility of financial gains is a fairly concave
function.
Certainty-Equivalent Value represents the certain amount we’d pay to consume an un-
certain outcome. This is the amount of $180 you were willing to pay to play the casino
game of the previous section.
203
Figure 5.2 illustrates this concept of Certainty-Equivalent Value in graphical terms. Next,
we define Risk-Premium in two different conventions:
• Absolute Risk-Premium πA :
πA = E[x] − xCE
• Relative Risk-Premium πR :
1
U (x) ≈ U (x̄) + U ′ (x̄) · (x − x̄) + U ′′ (x̄) · (x − x̄)2
2
Taking the expectation of U (x) in the above formula, we get:
1
· U ′′ (x̄) · σx2
E[U (x)] ≈ U (x̄) +
2
Next, we write the Taylor-series expansion for U (xCE ) around x̄ and ignore terms be-
yond linear in the expansion, as follows:
Since E[U (x)] = U (xCE ) (by definition of xCE ), the above two expressions are approxi-
mately the same. Hence,
1
U ′ (x̄) · (xCE − x̄) ≈ · U ′′ (x̄) · σx2 (5.1)
2
From Equation (5.1), Absolute Risk-Premium
1 U ′′ (x̄) 2
πA = x̄ − xCE ≈ − · ′ · σx
2 U (x̄)
204
We refer to the function
U ′′ (x) · x
R(x) = −
U ′ (x)
as the Relative Risk-Aversion function. Therefore,
1
πR ≈· R(x̄) · σ 2x
2 x̄
Now let’s take stock of what we’ve learning here. We’ve shown that Risk-Premium is
proportional to the product of:
We’ve expressed the extent of Risk-Aversion to be proportional to the negative ratio of:
Firstly, note that U (x) is continuous with respect to a for all x ∈ R since:
1 − e−ax
lim =x
a→0 a
Now let us analyze the function U (·) for any fixed a. We note that for all a ∈ R:
• U (0) = 0
• U ′ (x) = e−ax > 0 for all x ∈ R
• U ′′ (x) = −a · e−ax
205
This means U (·) is a monotonically increasing function passing through the origin, and
it’s curvature has the opposite sign as that of a (note: no curvature when a = 0).
So now we can calculate the Absolute Risk-Aversion function:
−U ′′ (x)
A(x) = =a
U ′ (x)
So we see that the Absolute Risk-Aversion function is the constant value a. Conse-
quently, we say that this Utility function corresponds to Constant Absolute Risk-Aversion
(CARA). The parameter a is refered to as the Coefficient of CARA. The magnitude of pos-
itive a signifies the degree of risk-aversion. a = 0 is the case of being Risk-Neutral. Nega-
tive values of a mean one is “risk-seeking,” i.e., one will pay to take risk (the opposite of
risk-aversion) and the magnitude of negative a signifies the degree of risk-seeking.
If the random outcome x ∼ N (µ, σ 2 ), then using Equation (A.5) from Appendix A, we
get:
1−e−aµ+ a22σ2
E[U (x)] = a for a ̸= 0
µ for a = 0
aσ 2
xCE = µ −
2
aσ 2
Absolute Risk Premium πA = µ − xCE =
2
For optimization problems where we need to choose across probability distributions
2
where σ 2 is a function of µ, we seek the distribution that maximizes xCE = µ − aσ2 . This
clearly illustrates the concept of “risk-adjusted-return” because µ serves as the “return”
2
and the risk-adjustment aσ2 is proportional to the product of risk-aversion a and risk (i.e.,
variance in outcomes) σ 2 .
Our task is to determine the allocation π (out of the given $1) to invest in the risky
asset (so, 1 − π is invested in the riskless asset) so as to maximize the Expected Utility of
Consumption of Portfolio Wealth in 1 year. Note that we allow π to be unconstrained, i.e.,
π can be any real number from −∞ to +∞. So, if π > 0, we buy the risky asset and if
π < 0, we “short-sell” the risky asset. Investing π in the risky asset means in 1 year, the
risky asset’s value will be a normal distribution N (π(1 + µ), π 2 σ 2 ). Likewise, if 1 − π > 0,
we lend 1 − π (and will be paid back (1 − π)(1 + r) in 1 year), and if 1 − π < 0, we borrow
1 − π (and need to pay back (1 − π)(1 + r) in 1 year).
Portfolio Wealth W in 1 year is given by:
W ∼ N (1 + r + π(µ − r), π 2 σ 2 )
206
We assume CARA Utility with a ̸= 0, so:
1 − e−aW
U (W ) =
a
We know that maximizing E[U (W )] is equivalent to maximizing the Certainty-Equivalent
Value of Wealth W , which in this case (using the formula for xCE in the section on CARA)
is given by:
aπ 2 σ 2
1 + r + π(µ − r) −
2
This is a quadratic concave function of π for a > 0, and so, taking it’s derivative with
respect to π and setting it to 0 gives us the optimal investment fraction in the risky asset
(π ∗ ) as follows:
µ−r
π∗ =
aσ 2
Firstly, note that U (x) is continuous with respect to γ for all x ∈ R+ since:
x1−γ − 1
lim = log(x)
γ→1 1 − γ
Now let us analyze the function U (·) for any fixed γ. We note that for all γ ∈ R:
• U (1) = 0
• U ′ (x) = x−γ > 0 for all x ∈ R+
• U ′′ (x) = −γ · x−1−γ
This means U (·) is a monotonically increasing function passing through (1, 0), and it’s
curvature has the opposite sign as that of γ (note: no curvature when γ = 0).
So now we can calculate the Relative Risk-Aversion function:
−U ′′ (x) · x
R(x) = =γ
U ′ (x)
So we see that the Relative Risk-Aversion function is the constant value γ. Consequently,
we say that this Utility function corresponds to Constant Relative Risk-Aversion (CRRA). The
parameter γ is refered to as the Coefficient of CRRA. The magnitude of positive γ signifies
the degree of risk-aversion. γ = 0 yields the Utility function U (x) = x − 1 and is the case
of being Risk-Neutral. Negative values of γ mean one is “risk-seeking,” i.e., one will pay
to take risk (the opposite of risk-aversion) and the magnitude of negative γ signifies the
degree of risk-seeking.
If the random outcome x is lognormal, with log(x) ∼ N (µ, σ 2 ), then making a substitu-
tion y = log(x), expressing E[U (x)] as E[U (ey )], and using Equation (A.5) in Appendix A,
we get:
207
eµ(1−γ)+ σ22 (1−γ)2 −1
E[U (x)] = 1−γ for γ ̸= 1
µ for γ = 1
σ2
xCE = eµ+ 2
(1−γ)
xCE σ2 γ
Relative Risk Premium πR = 1 − = 1 − e− 2
x̄
For optimization problems where we need to choose across probability distributions
where σ 2 is a function of µ, we seek the distribution that maximizes log(xCE ) = µ +
σ2
2 (1−γ). Just like in the case of CARA, this clearly illustrates the concept of “risk-adjusted-
2 2
return” because µ + σ2 serves as the “return” and the risk-adjustment γσ2 is proportional
to the product of risk-aversion γ and risk (i.e., variance in outcomes) σ 2 .
• A risky asset, evolving in continuous time, with value denoted St at time t, whose
movements are defined by the Ito process:
dSt = µ · St · dt + σ · St · dzt
dRt = r · Rt · dt
We are given $1 to invest over a period of 1 year. We are asked to maintain a constant
fraction of investment of wealth (denoted π ∈ R) in the risky asset at each time t (with
1−π as the fraction of investment in the riskless asset at each time t). Note that to maintain
a constant fraction of investment in the risky asset, we need to continuously rebalance the
portfolio of the risky asset and riskless asset. Our task is to determine the constant π that
maximizes the Expected Utility of Consumption of Wealth at the end of 1 year. We allow π
to be unconstrained, i.e., π can take any value from −∞ to +∞. Positive π means we have
a “long” position in the risky asset and negative π means we have a “short” position in the
risky asset. Likewise, positive 1 − π means we are lending money at the riskless interest
rate of r and negative 1 − π means we are borrowing money at the riskless interest rate of
r.
We denote the Wealth at time t as Wt . Without loss of generality, assume W0 = 1. Since
Wt is the portfolio wealth at time t, the value of the investment in the risky asset at time t
208
would need to be π ·Wt and the value of the investment in the riskless asset at time t would
need to be (1 − π) · Wt . Therefore, the change in the value of the risky asset investment
from time t to time t + dt is:
µ · π · Wt · dt + σ · π · Wt · dzt
Likewise, the change in the value of the riskless asset investment from time t to time
t + dt is:
r · (1 − π) · Wt · dt
Therefore, the infinitesimal change in portfolio wealth dWt from time t to time t + dt is
given by:
1 π 2 · σ 2 · Wt2 1 1
d(log Wt ) = ((r + π(µ − r)) · Wt · − · 2 ) · dt + π · σ · Wt · · dzt
Wt 2 Wt Wt
π2σ2
= (r + π(µ − r) − ) · dt + π · σ · dzt
2
Therefore, Z Z
t t
π2σ2
log Wt = (r + π(µ − r) − ) · du + π · σ · dzu
0 2 0
Rt
Using the martingale property and Ito Isometry for the Ito integral 0 π · σ · dzu (see Ap-
pendix C), we get:
π2σ2 2 2
log W1 ∼ N (r + π(µ − r) − ,π σ )
2
We assume CRRA Utility with γ ̸= 0, so:
( 1−γ
W1 −1
1−γ for γ ̸= 1
U (W1 ) =
log(W1 ) for γ = 1
209
5.9. Key Takeaways from this Chapter
• An individual’s financial risk-aversion is represented by the concave nature of the
individual’s Utility as a function of financial outcomes.
• Risk-Premium (compensation an individual seeks for taking financial risk) is roughly
proportional to the individual’s financial risk-aversion and the measure of uncer-
tainty in financial outcomes.
• Risk-Adjusted-Return in finance should be thought of as the Certainty-Equivalent-
Value, whose Utility is the Expected Utility across uncertain (risky) financial out-
comes.
210
6. Dynamic Asset-Allocation and
Consumption
This chapter covers the first of five financial applications of Stochastic Control covered in
this book. This financial application deals with the topic of investment management for
not just a financial company, but more broadly for any corporation or for any individual.
The nuances for specific companies and individuals can vary considerably but what is
common across these entities is the need to:
• Periodically decide how one’s investment portfolio should be split across various
choices of investment assets - the key being how much money to invest in more risky
assets (which have potential for high returns on investment) versus less risky assets
(that tend to yield modest returns on investment). This problem of optimally allocat-
ing capital across investment assets of varying risk-return profiles relates to the topic
of Utility Theory we covered in Chapter 5. However, in this chapter, we deal with
the further challenge of adjusting one’s allocation of capital across assets, as time
progresses. We refer to this feature as Dynamic Asset Allocation (the word dynamic
refers to the adjustment of capital allocation to adapt to changing circumstances)
• Periodically decide how much capital to leave in one’s investment portfolio versus
how much money to consume for one’s personal needs/pleasures (or for a corpora-
tion’s operational requirements) by extracting money from one’s investment portfo-
lio. Extracting money from one’s investment portfolio can mean potentially losing
out on investment growth opportunities, but the flip side of this is the Utility of Con-
sumption that a corporation/individual desires. Noting that ultimately our goal is
to maximize total utility of consumption over a certain time horizon, this decision
of investing versus consuming really amounts to the timing of consumption of one’s
money over the given time horizon.
Thus, this problem constitutes the dual and dynamic decisioning of asset-allocation and
consumption. To gain an intuitive understanding of the challenge of this dual dynamic
decisioning problem, let us consider this problem from the perspective of personal finance
in a simplified setting.
211
• Receiving money: This could include your periodic salary, which typically remains
constant for a period of time, but can change if you get a promotion or if you get a
new job. This also includes money you liquidate from your investment portfolio, eg:
if you sell some stock, and decide not to re-invest in other investment assets. This
also includes interest you earn from your savings account or from some bonds you
might own. There are many other ways one can receive money, some fixed regular
payments and some uncertain in terms of payment quantity and timing, and we
won’t enumerate all the different ways of receiving money. We just want to highlight
here that receiving money at various points in time is one of the key financial aspects
in one’s life.
• Consuming money: The word “consume” refers to ”“spending.” Note that one needs
to consume money periodically to satisfy basic needs like shelter, food and clothing.
The rent or mortgage you pay on your house is one example - it may be a fixed
amount every month, but if your mortgage rate is a floating rate, it is subject to vari-
ation. Moreover, if you move to a new house, the rent or mortgage can be different.
The money you spend on food and clothing also constitutes consuming money. This
can often be fairly stable from one month to the next, but if you have a newborn baby,
it might require additional expenses of the baby’s food, clothing and perhaps also
toys. Then there is consumption of money that are beyond the “necessities” - things
like eating out at a fancy restaurant on the weekend, taking a summer vacation, buy-
ing a luxury car or an expensive watch etc. One gains “satisfaction”/“happiness”
(i.e., Utility) from this consumption of money. The key point here is that we need to
periodically make a decision on how much to spend (consume money) on a weekly
or monthly basis. One faces a tension in the dynamic decision between consuming
money (that gives us Consumption Utility) and saving money (which is the money we
put in our investment portfolio in the hope of the money growing, so we can con-
sume potentially larger amounts of money in the future).
• Investing Money: Let us suppose there are a variety of investment assets you can in-
vest in - simple savings account giving small interest, exchange-traded stocks (rang-
ing from value stocks to growth stocks, with their respective risk-return tradeoffs),
real-estate (the house you bought and live in is indeed considered an investment
asset), commodities such as gold, paintings etc. We call the composition of money
invested in these assets as one’s investment portfolio (see Appendix B for a quick
introduction to Portfolio Theory). Periodically, we need to decide if one should play
safe by putting most of one’s money in a savings account, or if we should allocate
investment capital mostly in stocks, or if we should be more speculative and invest
in an early-stage startup or in a rare painting. Reviewing the composition and poten-
tially re-allocating capital (refered to as re-balancing one’s portfolio) is the problem
of dynamic asset-allocation. Note also that we can put some of our received money
into our investment portfolio (meaning we choose to not consume that money right
away). Likewise, we can extract some money out of our investment portfolio so we
can consume money. The decisions of insertion and extraction of money into/from
our investment portfolio is essentially the dynamic money-consumption decision we
make, which goes together with the dynamic asset-allocation decision.
The above description has hopefully given you a flavor of the dual and dynamic deci-
sioning of asset-allocation and consumption. Ultimately, our personal goal is to maximize
the Expected Aggregated Utility of Consumption of Money over our lifetime (and per-
haps, also include the Utility of Consumption of Money for one’s spouse and children,
212
after one dies). Since investment portfolios are stochastic in nature and since we have to
periodically make decisions on asset-allocation and consumption, you can see that this has
all the ingredients of a Stochastic Control problem, and hence can be modeled as a Markov
Decision Process (albeit typically fairly complicated, since real-life finances have plenty of
nuances). Here’s a rough and informal sketch of what that MDP might look like (bear in
mind that we will formalize the MDP for simplified cases later in this chapter):
• States: The State can be quite complex in general, but mainly it consists of one’s age
(to keep track of the time to reach the MDP horizon), the quantities of money in-
vested in each investment asset, the valuation of the assets invested in, and poten-
tially also other aspects like one’s job/career situation (required to make predictions
of future salary possibilities).
• Actions: The Action is two-fold. Firstly, it’s the vector of investment amounts one
chooses to make at each time step (the time steps are at the periodicity at which we
review our investment portfolio for potential re-allocation of capital across assets).
Secondly, it’s the quantity of money one chooses to consume that is flexible/optional
(i.e., beyond the fixed payments like rent that we are committed to make).
• Model: The Model (probabilities of next state and reward, given current state and
action) can be fairly complex in most real-life situations. The hardest aspect is the
prediction of what might happen tomorrow in our life and career (we need this pre-
diction since it determines our future likelihood to receive money, consume money
and invest money). Moreover, the uncertain movements of investment assets would
need to be captured by our model.
Since our goal here was to simply do a rough and informal sketch, the above coverage of
the MDP is very hazy but we hope you get a sense for what the MDP might look like. Now
we are ready to take a simple special case of this MDP which does away with many of the
real-world frictions and complexities, yet retains the key features (in particular, the dual
dynamic decisioning aspect). This simple special case was the subject of Merton’s Portfo-
lio Problem (Merton 1969) which he formulated and solved in 1969 in a landmark paper.
A key feature of his formulation was that time is continuous and so, state (based on asset
prices) evolves as a continuous-time stochastic process, and actions (asset-allocation and
consumption) are made continuously. We cover the important parts of his paper in the
next section. Note that our coverage below requires some familiarity with Stochastic Cal-
culus (covered in Appendix C) and with the Hamilton-Jacobi-Bellman Equation (covered
in Appendix D), which is the continuous-time analog of Bellman’s Optimality Equation.
213
to live for T more years (T is a fixed real number). So, in the language of the previous
section, you will not be receiving money for the rest of your life, other than the option of
extracting money from your investment portfolio. Also assume that you have no fixed
payments to make like mortgage, subscriptions etc. (assume that you have already paid
for a retirement service that provides you with your essential food, clothing and other ser-
vices). This means all of your money consumption is flexible/optional, i.e., you have a choice
of consuming any real non-negative number at any point in time. All of the above are big
(and honestly, unreasonable) assumptions but they help keep the problem simple enough
for analytical tractability. In spite of these over-simplified assumptions, the problem for-
mulation still captures the salient aspects of dual dynamic decisioning of asset-allocation
and consumption while eliminating the clutter of A) receiving money from external sources
and B) consuming money that is of a non-optional nature.
We define wealth at any time t (denoted Wt ) as the aggregate market value of your
investment assets. Note that since no external money is received and since all consumption
is optional, Wt is your “net-worth.” Assume there are a fixed number n of risky assets and
a single riskless asset. Assume that each risky asset has a known normal distribution of
returns. Now we make a couple of big assumptions for analytical tractability:
• You are allowed to buy or sell any fractional quantities of assets at any point in time
(i.e., in continuous time).
• There are no transaction costs with any of the buy or sell transactions in any of the
assets.
You start with wealth W0 at time t = 0. As mentioned earlier, the goal is to maximize
your expected lifetime-aggregated Utility of Consumption of money with the actions at
any point in time being two-fold: Asset-Allocation and Consumption (Consumption being
equal to the capital extracted from the investment portfolio at any point in time). Note
that since there is no external source of money and since all capital extracted from the
investment portfolio at any point in time is immediately consumed, you are never adding
capital to your investment portfolio. The growth of the investment portfolio can happen
only from growth in the market value of assets in your investment portfolio. Lastly, we
assume that the Consumption Utility function is Constant Relative Risk-Aversion (CRRA),
which we covered in Chapter 5.
For ease of exposition, we formalize the problem setting and derive Merton’s beautiful
analytical solution for the case of n = 1 (i.e., only 1 risky asset). The solution generalizes
in a straightforward manner to the case of n > 1 risky assets, so it pays to keep the notation
and explanations simple, emphasizing intuition rather than heavy technical details.
Since we are operating in continuous-time, the risky asset follows a stochastic process
(denoted S) - specifically an Ito process (introductory background on Ito processes and
Ito’s Lemma covered in Appendix C), as follows:
dSt = µ · St · dt + σ · St · dzt
where µ ∈ R, σ ∈ R+ are fixed constants (note that for n assets, we would instead work
with a vector for µ and a matrix for σ).
The riskless asset has no uncertainty associated with it and has a fixed rate of growth in
continuous-time, so the valuation of the riskless asset Rt at time t is given by:
dRt = r · Rt · dt
214
Assume r ∈ R is a fixed constant, representing the instantaneous riskless growth of
money. We denote the consumption of wealth (equal to extraction of money from the
investment portfolio) per unit time (at time t) as c(t, Wt ) ≥ 0 to make it clear that the con-
sumption (our decision at any time t) will in general depend on both time t and wealth Wt .
Note that we talk about “rate of consumption in time” because consumption is assumed
to be continuous in time. As mentioned earlier, we denote wealth at time t as Wt (note
that W is a stochastic process too). We assume that Wt > 0 for all t ≥ 0. This is a rea-
sonable assumption to make as it manifests in constraining the consumption (extraction
from investment portfolio) to ensure wealth remains positive. We denote the fraction of
wealth allocated to the risky asset at time t as π(t, Wt ). Just like consumption c, risky-asset
allocation fraction π is a function of time t and wealth Wt . Since there is only one risky
asset, the fraction of wealth allocated to the riskless asset at time t is 1 − π(t, Wt ). Unlike
the constraint c(t, Wt ) ≥ 0, π(t, Wt ) is assumed to be unconstrained. Note that c(t, Wt )
and π(t, Wt ) together constitute the decision (MDP action) at time t. To keep our notation
light, we shall write ct for c(t, Wt ) and πt for π(t, Wt ), but please do recognize through-
out the derivation that both are functions of wealth Wt at time t as well as of time t itself.
Finally, we assume that the Utility of Consumption function is defined as:
x1−γ
U (x) =
1−γ
for a risk-aversion parameter γ ̸= 1. This Utility function is essentially the CRRA Utility
−1
function (ignoring the constant term 1−γ ) that we covered in Chapter 5 for γ ̸= 1. γ is
′′
the Coefficient of CRRA equal to −x·U (x)
U ′ (x) . We will not cover the case of CRRA Utility
function for γ = 1 (i.e., U (x) = log(x)), but we encourage you to work out the derivation
for U (x) = log(x) as an exercise.
Due to our assumption of no addition of money to our investment portfolio of the risky
asset St and riskless asset Rt and due to our assumption of no transaction costs of buy-
ing/selling any fractional quantities of risky as well as riskless assets, the time-evolution
for wealth should be conceptualized as a continuous adjustment of the allocation πt and
continuous extraction from the portfolio (equal to continuous consumption ct ).
Since the value of the risky asset investment at time t is πt · Wt , the change in the value
of the risky asset investment from time t to time t + dt is:
µ · πt · Wt · dt + σ · πt · Wt · dzt
Likewise, since the value of the riskless asset investment at time t is (1 − πt ) · Wt , the
change in the value of the riskless asset investment from time t to time t + dt is:
r · (1 − πt ) · Wt · dt
Therefore, the infinitesimal change in wealth dWt from time t to time t + dt is given by:
Note that this is an Ito process defining the stochastic evolution of wealth.
Our goal is to determine optimal (π(t, Wt ), c(t, Wt )) at any time t to maximize:
Z T
e−ρ(s−t) · c1−γ
s e−ρ(T −t) · B(T ) · WT1−γ
E[ · ds + | Wt ]
t 1−γ 1−γ
215
where ρ ≥ 0 is the utility discount rate to account for the fact that future utility of consump-
tion might be less than current utility of consumption, and B(·) is known as the “bequest”
function (think of this as the money you will leave for your family when you die at time
T ). We can solve this problem for arbitrary bequest B(T ) but for simplicity, we shall con-
sider B(T ) = ϵγ where 0 < ϵ ≪ 1, meaning “no bequest.” We require the bequest to be ϵγ
rather than 0 for technical reasons, that will become apparent later.
We should think of this problem as a continuous-time Stochastic Control problem where
the MDP is defined as below:
c1−γ
t
U (ct ) =
1−γ
and the Reward at time T is:
WT1−γ
B(T ) · U (WT ) = ϵγ ·
1−γ
The Return at time t is the accumulated discounted Reward:
Z T
−ρ(s−t) cs
1−γ
e−ρ(T −t) · ϵγ · WT1−γ
e · · ds +
t 1−γ 1−γ
Our goal is to find the Policy : (t, Wt ) → (πt , ct ) that maximizes the Expected Return. Note
the important constraint that ct ≥ 0, but πt is unconstrained.
Our first step is to write out the Hamilton-Jacobi-Bellman (HJB) Equation (the analog
of the Bellman Optimality Equation in continuous-time). We denote the Optimal Value
Function as V ∗ such that the Optimal Value for wealth Wt at time t is V ∗ (t, Wt ). Note
that unlike Section 3.13 in Chapter 3 where we denoted the Optimal Value Function as
a time-indexed sequence Vt∗ (·), here we make t an explicit functional argument of V ∗ .
This is because in the continuous-time setting, we are interested in the time-differential
of the Optimal Value Function. Appendix D provides the derivation of the general HJB
formulation (Equation (D.1) in Appendix D) - this general HJB Equation specializes here
to the following:
c1−γ
max{Et [dV ∗ (t, Wt ) + t
· dt} = ρ · V ∗ (t, Wt ) · dt (6.2)
πt ,ct 1−γ
Now use Ito’s Lemma on dV ∗ , remove the dzt term since it’s a martingale, and divide
throughout by dt to produce the HJB Equation in partial-differential form for any 0 ≤
t < T , as follows (the general form of this transformation appears as Equation (D.2) in
Appendix D):
WT1−γ
V ∗ (T, WT ) = ϵγ ·
1−γ
216
Let us write Equation (6.3) more succinctly as:
It pays to emphasize again that we are working with the constraints Wt > 0, ct ≥ 0 for
0≤t<T
To find optimal πt∗ , c∗t , we take the partial derivatives of Φ(t, Wt ; πt , ct ) with respect to
πt and ct , and equate to 0 (first-order conditions for Φ). The partial derivative of Φ with
respect to πt is:
∂V ∗ ∂2V ∗
(µ − r) · + · πt · σ 2 · W t = 0
∂Wt ∂Wt2
∗
− ∂W
∂V
· (µ − r)
⇒ πt∗ = ∂2V ∗
t
(6.5)
∂Wt2
· σ 2 · Wt
∂V ∗
− + (c∗t )−γ = 0
∂Wt
∂V ∗ −1
⇒ c∗t = ( )γ (6.6)
∂Wt
Now substitute πt∗ (from Equation (6.5))and c∗t (from Equation (6.6)) in Φ(t, Wt ; πt , ct )
(in Equation (6.3)) and equate to ρ · V ∗ (t, Wt ). This gives us the Optimal Value Function
Partial Differential Equation (PDE):
∗
∂V ∗ (µ − r)2 ( ∂Wt ) ∂V ∗ ∂V ∗ γ−1
∂V 2
γ
− · ∂2V ∗
+ · r · W t + · ( ) γ = ρ · V ∗ (t, Wt ) (6.7)
∂t 2σ 2 2
∂W t 1 − γ ∂W t
∂Wt
∗ WT1−γ
V (T, WT ) = ϵ · γ
1−γ
The second-order conditions for Φ are satisfied under the assumptions: c∗t > 0, Wt >
2 ∗
0, ∂∂WV 2 < 0 for all 0 ≤ t < T (we will later show that these are all satisfied in the solution
t
we derive), and for concave U (·), i.e., γ > 0
Next, we want to reduce the PDE (6.7) to an Ordinary Differential Equation (ODE) so
we can solve the (simpler) ODE. Towards this goal, we surmise with a guess solution in
terms of a deterministic function (f ) of time:
Wt1−γ
V ∗ (t, Wt ) = f (t)γ · (6.8)
1−γ
Then,
∂V ∗ W 1−γ
= γ · f (t)γ−1 · f ′ (t) · t (6.9)
∂t 1−γ
∂V ∗
= f (t)γ · Wt−γ (6.10)
∂Wt
217
∂2V ∗
2 = −f (t)γ · γ · Wt−γ−1 (6.11)
∂Wt
Substituting the guess solution in the PDE, we get the simple ODE:
where 2
ρ − (1 − γ) · ( (µ−r)
2σ 2 γ
+ r)
ν=
γ
We note that that the bequest function B(T ) = ϵγ proves to be convenient in order to
fit the guess solution for t = T . This means the boundary condition for this ODE is:
f (T ) = ϵ. Consequently, this ODE together with this boundary condition has a simple
enough solution, as follows:
(
1+(νϵ−1)·e−ν(T −t)
ν for ν ̸= 0
f (t) = (6.13)
T −t+ϵ for ν = 0
Substituting V ∗ (from Equation (6.8)) and its partial derivatives (from Equations (6.9),
(6.10) and (6.11)) in Equations (6.5) and (6.6), we get:
µ−r
π ∗ (t, Wt ) = (6.14)
σ2γ
(
Wt
ν·Wt
for ν ̸= 0
c∗ (t, Wt ) = = 1+(νϵ−1)·e−ν(T −t)
Wt
(6.15)
f (t) for ν = 0
T −t+ϵ
Finally, substituting the solution for f (t) (Equation (6.13)) in Equation (6.8), we get:
(1+(νϵ−1)·e−ν(T −t) )γ · Wt1−γ for ν ̸= 0
V ∗ (t, Wt ) = (T −t+ϵ)γ ·W
νγ 1−γ
(6.16)
1−γ
1−γ
t
for ν = 0
∗
Note that f (t) > 0 for all 0 ≤ t < T (for all ν) ensures $W_t > 0, c∗t > 0, ∂∂WV 2 < 0. This
2
t
ensures the constraints Wt > 0 and ct ≥ 0 are satisfied and the second-order conditions for
Φ are also satisfied. A very important lesson in solving Merton’s Portfolio problem is the
fact that the HJB Formulation is key and that this solution approach provides a template
for similar continuous-time stochastic control problems.
218
that asset allocation is straightforward - we just need to keep re-balancing to maintain this
constant fraction of our wealth in the risky asset. We expect our wealth to grow over time
and so, the capital in the risky asset would also grow proportionately.
The form of the solution for c∗ (t, Wt ) is extremely intuitive - the excess return of the
risky asset (µ − r) shows up in the numerator, which makes sense, since one would expect
to invest a higher fraction of one’s wealth in the risky asset if it gives us a higher excess
return. It also makes sense that the volatility σ of the risky asset (squared) shows up in
the denominator (the greater the volatility, the less we’d allocate to the risky asset, since
we are typically risk-averse, i.e., γ > 0). Likewise, it makes since that the coefficient of
CRRA γ shows up in the denominator since a more risk-averse individual (greater value
of γ) will want to invest less in the risky asset.
The Optimal Consumption Rate c∗ (t, Wt ) should be conceptualized in terms of the Opti-
mal Fractional Consumption Rate, i.e., the Optimal Consumption Rate c∗ (t, Wt ) as a fraction
of the Wealth Wt . Note that the Optimal Fractional Consumption Rate depends only on
1
t (it is equal to f (t) ). This means no matter what our wealth is, we should be extracting
a fraction of our wealth on a daily/monthly/yearly basis that is only dependent on our
age. Note also that if ϵ < ν1 , the Optimal Fractional Consumption Rate increases as time
progresses. This makes intuitive sense because when we have many more years to live,
we’d want to consume less and invest more to give the portfolio more ability to grow, and
when we get close to our death, we increase our consumption (since the optimal is “to die
broke,” assuming no bequest).
Now let us understand how the Wealth process evolves. Let us substitute for π ∗ (t, Wt )
(from Equation (6.14)) and c∗ (t, Wt ) (from Equation (6.15)) in the Wealth process defined
in Equation (6.1). This yields the following Wealth process W ∗ when we asset-allocate
optimally and consume optimally:
(µ − r)2 1 µ−r
dWt∗ = (r + 2
− ) · Wt∗ · dt + · Wt∗ · dzt (6.17)
σ γ f (t) σγ
The first thing to note about this Wealth process is that it is a lognormal process of the
form covered in Section C.7 of Appendix C. The lognormal volatility (fractional disper-
sion) of this wealth process is constant (= µ−r
σγ ). The lognormal drift (fractional drift) is
2
independent of the wealth but is dependent on time (= r + (µ−r) σ2 γ
− f (t)
1
). From the solu-
tion of the general lognormal process derived in Section C.7 of Appendix C, we conclude
that:
∫ (µ−r)2
W · e(r+ σ2 γ )t · (1 − 1−e−νt
∗
(µ−r)
(r+ 2 )t
2
− t du
0 −νT ) if ν ̸= 0
E[Wt ] = W0 · e σ γ · e 0 f (u) = 1+(νϵ−1)·e
2
W · e(r+ σ2 γ )t · (1 − t )
(µ−r)
0 T +ϵ if ν = 0
(6.18)
Since we assume no bequest, we should expect the Wealth process to keep growing up
to some point in time and then fall all the way down to 0 when time runs out (i.e., when
t = T ). We shall soon write the code for Equation (6.18) and plot the graph for this rise
and fall. An important point to note is that although the wealth process growth varies in
2
time (expected wealth growth rate = r + (µ−r)σ2 γ
− f (t)
1
as seen from Equation (6.17)), the
variation (in time) of the wealth process growth is only due to the fractional consumption
1
rate varying in time. If we ignore the fractional consumption rate (= f (t) ), then what
2
we get is the Expected Portfolio Annual Return of r + (µ−r)
σ2 γ
which is a constant (does
∗
not depend on either time t or on Wealth Wt ). Now let us write some code to calculate
219
the time-trajectories of Expected Wealth, Fractional Consumption Rate, Expected Wealth
Growth Rate and Expected Portfolio Annual Return.
The code should be pretty self-explanatory. We will just provide a few explanations of
variables in the code that may not be entirely obvious: portfolio_return calculates the
Expected Portfolio Annual Return, nu calculates the value of ν, f represents the function
f (t), wealth_growth_rate calculates the Expected Wealth Growth Rate as a function of time
t. The expected_wealth method assumes W0 = 1.
@dataclass(frozen=True)
class MertonPortfolio:
mu: float
sigma: float
r: float
rho: float
horizon: float
gamma: float
epsilon: float = 1e-6
def excess(self) -> float:
return self.mu - self.r
def variance(self) -> float:
return self.sigma * self.sigma
def allocation(self) -> float:
return self.excess() / (self.gamma * self.variance())
def portfolio_return(self) -> float:
return self.r + self.allocation() * self.excess()
def nu(self) -> float:
return (self.rho - (1 - self.gamma) * self.portfolio_return()) / \
self.gamma
def f(self, time: float) -> float:
remaining: float = self.horizon - time
nu = self.nu()
if nu == 0:
ret = remaining + self.epsilon
else:
ret = (1 + (nu * self.epsilon - 1) * exp(-nu * remaining)) / nu
return ret
def fractional_consumption_rate(self, time: float) -> float:
return 1 / self.f(time)
def wealth_growth_rate(self, time: float) -> float:
return self.portfolio_return() - self.fractional_consumption_rate(time)
def expected_wealth(self, time: float) -> float:
base: float = exp(self.portfolio_return() * time)
nu = self.nu()
if nu == 0:
ret = base * (1 - (1 - exp(-nu * time)) /
(1 + (nu * self.epsilon - 1) *
exp(-nu * self.horizon)))
else:
ret = base * (1 - time / (self.horizon + self.epsilon))
return ret
220
Figure 6.1.: Portfolio Return and Consumption Rate
over time, the Fractional Consumption Rate becomes greater than the Expected Portfolio
Annual Return. This illustrates how the optimal behavior is to consume modestly and in-
vest more when one is younger, then to gradually increase the consumption as one ages,
and finally to ramp up the consumption sharply when one is close to the end of one’s life.
Figure 6.1 shows the visual for this (along with the Expected Wealth Growth Rate) using
the above code for input values of: T = 20, µ = 10%, σ = 10%, r = 2%, ρ = 1%, γ = 2.0.
Figure 6.2 shows the time-trajectory of the expected wealth based on Equation (6.18) for
the same input values as listed above. Notice how the Expected Wealth rises in a convex
shape for several years since the consumption during all these years is quite modest, and
then the shape of the Expected Wealth curve turns concave at about 12 years, peaks at
about 16 years (when Fractional Consumption Rate rises to equal Expected Portfolio An-
nual Return), and then falls precipitously in the last couple of years (as the Consumption
increasingly drains the Wealth down to 0).
221
Figure 6.2.: Expected Wealth Time-Trajectory
1 − e−aWT
U (WT ) = for some fixed a ̸= 0
a
Thus, the problem is to maximize, for each t = 0, 1, . . . , T − 1, over choices of xt ∈ R,
the value:
1 − e−aWT
E[γ T −t · |(t, Wt )]
a
Since γ T −t and a are constants, this is equivalent to maximizing, for each t = 0, 1, . . . , T −
1, over choices of xt ∈ R, the value:
−e−aWT
E[ |(t, Wt )] (6.19)
a
We formulate this problem as a Continuous States and Continuous Actions discrete-time
finite-horizon MDP by specifying it’s State Transitions, Rewards and Discount Factor pre-
cisely. The problem then is to solve the MDP’s Control problem to find the Optimal Policy.
The terminal time for the finite-horizon MDP is T and hence, all the states at time t =
T are terminal states. We shall follow the notation of finite-horizon MDPs that we had
covered in Section 3.13 of Chapter 3. The State st ∈ St at any time step t = 0, 1, . . . , T
consists of the wealth Wt . The decision (Action) at ∈ At at any time step t = 0, 1, . . . , T − 1
is the quantity of investment in the risky asset (= xt ). Hence, the quantity of investment
in the riskless asset at time t will be Wt − xt . A deterministic policy at time t (for all
t = 0, 1, . . . T − 1) is denoted as πt , and hence, we write: πt (Wt ) = xt . Likewise, an optimal
deterministic policy at time t (for all t = 0, 1, . . . , T − 1) is denoted as πt∗ , and hence, we
write: πt∗ (Wt ) = x∗t .
Denote the random variable for the single-time-step return of the risky asset from time
t to time t + 1 as Yt ∼ N (µ, σ 2 ) for all t = 0, 1, . . . T − 1. So,
222
Wt+1 = xt · (1 + Yt ) + (Wt − xt ) · (1 + r) = xt · (Yt − r) + Wt · (1 + r) (6.20)
for all t = 0, 1, . . . , T − 1.
The MDP Reward is 0 for all t = 0, 1, . . . , T − 1. As a result of the simplified objective
(6.19) above, the MDP Reward for t = T is the following random quantity:
−e−aWT
a
We set the MDP discount factor to be γ = 1 (again, because of the simplified objective
(6.19) above).
We denote the Value Function at time t (for all t = 0, 1, . . . , T − 1) for a given policy
π = (π0 , π1 , . . . , πT −1 ) as:
−e−aWT
Vtπ (Wt ) = Eπ [ |(t, Wt )]
a
We denote the Optimal Value Function at time t (for all t = 0, 1, . . . , T − 1) as:
−e−aWT
Vt∗ (Wt ) = max Vtπ (Wt ) = max{Eπ [ |(t, Wt )]}
π π a
The Bellman Optimality Equation is:
−e−aWT
VT∗−1 (WT −1 ) = max Q∗T −1 (WT −1 , xT −1 ) = max{EYT −1 ∼N (µ,σ2 ) [ ]}
xT −1 xT −1 a
where Q∗t is the Optimal Action-Value Function at time t for all t = 0, 1, . . . , T − 1.
We make an educated guess for the functional form of the Optimal Value Function as:
Vt∗ (Wt ) = max{EYt ∼N (µ,σ2 ) [−bt+1 · e−ct+1 ·(xt ·(Yt −r)+Wt ·(1+r)) ]}
xt
The expectation of this exponential form (under the normal distribution) evaluates to:
σ2 2
Vt∗ (Wt ) = max{−bt+1 · e−ct+1 ·(1+r)·Wt −ct+1 ·(µ−r)·xt +ct+1 · ·xt
2
2 } (6.22)
xt
Since Vt∗ (Wt ) = maxxt Q∗t (Wt , xt ), from Equation (6.22), we can infer the functional
form for Q∗t (Wt , xt ) in terms of bt+1 and ct+1 :
σ2 2
Q∗t (Wt , xt ) = −bt+1 · e−ct+1 ·(1+r)·Wt −ct+1 ·(µ−r)·xt +ct+1 · ·xt
2
2 (6.23)
223
Since the right-hand-side of the Bellman Optimality Equation (6.22) involves a max over
xt , we can say that the partial derivative of the term inside the max with respect to xt is 0.
This enables us to write the Optimal Allocation x∗t in terms of ct+1 , as follows:
µ−r
⇒ x∗t = (6.24)
σ 2 · ct+1
Next we substitute this maximizing x∗t in the Bellman Optimality Equation (Equation
(6.22)):
(µ−r)2
Vt∗ (Wt ) = −bt+1 · e−ct+1 ·(1+r)·Wt − 2σ 2
But since
(µ−r)2
bt = bt+1 · e− 2σ 2
ct = ct+1 · (1 + r)
−aW
We can calculate bT −1 and cT −1 from the knowledge of the MDP Reward −e a T (Utility
of Terminal Wealth) at time t = T , which will enable us to unroll the above recursions for
bt and ct for all t = 0, 1, . . . , T − 2.
−e−aWT
VT∗−1 (WT −1 ) = max{EYT −1 ∼N (µ,σ2 ) [ ]}
xT −1 a
From Equation (6.20), we can write this as:
(µ−r)2
−e− 2σ 2
−a·(1+r)·WT −1
VT∗−1 (WT −1 ) =
a
Therefore,
(µ−r)2
e− 2σ 2
bT −1 =
a
cT −1 = a · (1 + r)
Now we can unroll the above recursions for bt and ct for all t = 0, 1, . . . T − 2 as:
224
Substituting the solution for ct+1 in Equation (6.24) gives us the solution for the Optimal
Policy:
µ−r
πt∗ (Wt ) = x∗t = 2 (6.25)
σ · a · (1 + r)T −t−1
for all t = 0, 1, . . . , T −1. Note that the optimal action at time step t (for all t = 0, 1, . . . , T −
1) does not depend on the state Wt at time t (it only depends on the time t). Hence, the
optimal policy πt∗ (·) for a fixed time t is a constant deterministic policy function.
Substituting the solutions for bt and ct in Equation (6.21) gives us the solution for the
Optimal Value Function:
(µ−r)2 (T −t)
−e− 2σ 2 T −t ·W
Vt∗ (Wt ) = · e−a(1+r) t
(6.26)
a
for all t = 0, 1, . . . , T − 1.
Substituting the solutions for bt+1 and ct+1 in Equation (6.23) gives us the solution for
the Optimal Action-Value Function:
(µ−r)2 (T −t−1)
−e− 2σ 2 T −t ·W T −t−1 ·x (aσ(1+r)T −t−1 )2 2
Q∗t (Wt , xt ) = · e−a(1+r) t −a(µ−r)(1+r) t+ 2
·xt
(6.27)
a
for all t = 0, 1, . . . , T − 1.
But real-world problems involving dynamic asset-allocation and consumption are not
so simple and clean. We have arbitrary, more complex asset price movements. Utility
functions don’t fit into simple CRRA/CARA formulas. In practice, trading often occurs
in discrete space - asset prices, allocation amounts and consumption are often discrete
quantities. Moreover, when we change our asset allocations or liquidate a portion of our
portfolio to consume, we incur transaction costs. Furthermore, trading doesn’t always
happen in continuous-time - there are typically specific windows of time where one is
locked-out from trading or there are trading restrictions. Lastly, many investments are
illiquid (eg: real-estate) or simply not allowed to be liquidated until a certain horizon (eg:
retirement funds), which poses major constraints on extracting money from one’s portfolio
for consumption. So even though prices/allocation amounts/consumption might be close
to being continuous-variables, the other above-mentioned frictions mean that we don’t get
the benefits of calculus that we obtained in the simple examples we covered.
With the above real-world considerations, we need to tap into Dynamic Programming
- more specifically, Approximate Dynamic Programming since real-world problems have
large state spaces and large action spaces (even if these spaces are not continuous, they
225
tend to be close to continuous). Appropriate function approximation of the Value Function
is key to solving these problems. Implementing a full-blown real-world investment and
consumption management system is beyond the scope of this book, but let us implement
an illustrative example that provides sufficient understanding of how a full-blown real-
world example would be implemented. We have to keep things simple enough and yet
sufficiently general. So here is the setting we will implement for:
The code in the class AssetAllocDiscrete below is fairly self-explanatory. We use the
function back_opt_qvf covered in Section 4.9 of Chapter 4 to perform backward induc-
tion on the optimal Q-Value Function. Since the state space is continuous, the optimal
Q-Value Function is represented as a QValueFunctionApprox (specifically, as a DNNApprox).
Moreover, since we are working with a generic distribution of returns that govern the
state transitions of this MDP, we need to work with the methods of the abstract class
MarkovDecisionProcess (and not the class FiniteMarkovDecisionProcess). The method
backward_induction_qvf below makes the call to back_opt_qvf. Since the risky returns
distribution is arbitrary and since the utility function is arbitrary, we don’t have prior
knowledge of the functional form of the Q-Value function. Hence, the user of the class
AssetAllocDiscrete also needs to provide the set of feature functions (feature_functions
in the code below) and the specification of a deep neural network to represent the Q-
Value function (dnn_spec in the code below). The rest of the code below is mainly about
preparing the input mdp_f0_mu_triples to be passed to back_opt_qvf. As was explained
in Section 4.9 of Chapter 4, mdp_f0_mu_triples is a sequence (for each time step) of the
following triples:
226
• A QValueFunctionApprox[float], float] object, prepared by get_qvf_func_approx.
This method sets up a DNNApprox[Tuple[NonTerminal[float], float]] object that rep-
resents a neural-network function approximation for the optimal Q-Value Function.
So the input to this neural network would be a Tuple[NonTerminal[float], float]
representing a (state, action) pair.
227
sampler=sr_sampler_func,
expectation_samples=1000
)
def actions(self, wealth: NonTerminal[float]) -> Sequence[float]:
return alloc_choices
return AssetAllocMDP()
def get_qvf_func_approx(self) -> \
DNNApprox[Tuple[NonTerminal[float], float]]:
adam_gradient: AdamGradient = AdamGradient(
learning_rate=0.1,
decay1=0.9,
decay2=0.999
)
ffs: List[Callable[[Tuple[NonTerminal[float], float]], float]] = []
for f in self.feature_functions:
def this_f(pair: Tuple[NonTerminal[float], float], f=f) -> float:
return f((pair[0].state, pair[1]))
ffs.append(this_f)
return DNNApprox.create(
feature_functions=ffs,
dnn_spec=self.dnn_spec,
adam_gradient=adam_gradient
)
def get_states_distribution(self, t: int) -> \
SampledDistribution[NonTerminal[float]]:
actions_distr: Choose[float] = self.uniform_actions()
def states_sampler_func() -> NonTerminal[float]:
wealth: float = self.initial_wealth_distribution.sample()
for i in range(t):
distr: Distribution[float] = self.risky_return_distributions[i]
rate: float = self.riskless_returns[i]
alloc: float = actions_distr.sample()
wealth = alloc * (1 + distr.sample()) + \
(wealth - alloc) * (1 + rate)
return NonTerminal(wealth)
return SampledDistribution(states_sampler_func)
def backward_induction_qvf(self) -> \
Iterator[QValueFunctionApprox[float, float]]:
init_fa: DNNApprox[Tuple[NonTerminal[float], float]] = \
self.get_qvf_func_approx()
mdp_f0_mu_triples: Sequence[Tuple[
MarkovDecisionProcess[float, float],
DNNApprox[Tuple[NonTerminal[float], float]],
SampledDistribution[NonTerminal[float]]
]] = [(
self.get_mdp(i),
init_fa,
self.get_states_distribution(i)
) for i in range(self.time_steps())]
num_state_samples: int = 300
error_tolerance: float = 1e-6
return back_opt_qvf(
mdp_f0_mu_triples=mdp_f0_mu_triples,
gamma=1.0,
num_state_samples=num_state_samples,
error_tolerance=error_tolerance
)
228
The above code is in the file rl/chapter7/asset_alloc_discrete.py. We encourage you to
create a few different instances of AssetAllocDiscrete by varying it’s inputs (try different
return distributions, different utility functions, different action spaces). But how do we
know the code above is correct? We need a way to test it. A good test is to specialize the
inputs to fit the setting of Section 6.4 for which we have a closed-form solution to com-
pare against. So let us write some code to specialize the inputs to fit this setting. Since
the above code has been written with an educational motivation rather than an efficient-
computation motivation, the convergence of the backward induction ADP algorithm is
going to be slow. So we shall test it on a small number of time steps and provide some
assistance for fast convergence (using limited knowledge from the closed-form solution
in specifying the function approximation). We write code below to create an instance of
AssetAllocDiscrete with time steps T = 4, µ = 13%, σ = 20%, r = 7%, coefficient of
CARA a = 1.0. We set up risky_return_distributions as a sequence of identical Gaussian
distributions, riskless_returns as a sequence of identical riskless rate of returns, and
utility_func as a lambda parameterized by the coefficient of CARA a. We know from the
closed-form solution that the optimal allocation to the risky asset for each of time steps
t = 0, 1, 2, 3 is given by:
1.5
x∗t =
1.074−t
Therefore, we set risky_alloc_choices (action choices) in the range [1.0, 2.0] in incre-
ments of 0.1 to see if our code can hit the correct values within the 0.1 granularity of action
choices.
To specify feature_functions and dnn_spec, we need to leverage the functional form of
the closed-form solution for the Action-Value function (i.e., Equation (6.27)). We observe
that we can write this as:
Q∗t (Wt , xt ) = −sign(a) · e−(α0 +α1 ·Wt +α2 ·xt +α3 ·xt )
2
where
(µ − r)2 (T − t − 1)
α0 = + log(|a|)
2σ 2
α1 = a(1 + r)T −t
α2 = a(µ − r)(1 + r)T −t−1
(aσ(1 + r)T −t−1 )2
α3 = −
2
∗
This means, the function approximation for Qt can be set up with a neural network with
no hidden layers, with the output layer activation function as g(S) = −sign(a) · e−S , and
with the feature functions as:
ϕ1 ((Wt , xt )) = 1
ϕ2 ((Wt , xt )) = Wt
ϕ3 ((Wt , xt )) = xt
ϕ4 ((Wt , xt )) = x2t
We set initial_wealth_distribution to be a normal distribution with a mean of init_wealth
(set equal to 1.0 below) and a standard distribution of init_wealth_stdev (set equal to a
small value of 0.1 below).
229
from rl.distribution import Gaussian
steps: int = 4
mu: float = 0.13
sigma: float = 0.2
r: float = 0.07
a: float = 1.0
init_wealth: float = 1.0
init_wealth_stdev: float = 0.1
excess: float = mu - r
var: float = sigma * sigma
base_alloc: float = excess / (a * var)
risky_ret: Sequence[Gaussian] = [Gaussian(mu=mu, sigma=sigma)
for _ in range(steps)]
riskless_ret: Sequence[float] = [r for _ in range(steps)]
utility_function: Callable[[float], float] = lambda x: - np.exp(-a * x) / a
alloc_choices: Sequence[float] = np.linspace(
2 / 3 * base_alloc,
4 / 3 * base_alloc,
11
)
feature_funcs: Sequence[Callable[[Tuple[float, float]], float]] = \
[
lambda _: 1.,
lambda w_x: w_x[0],
lambda w_x: w_x[1],
lambda w_x: w_x[1] * w_x[1]
]
dnn: DNNSpec = DNNSpec(
neurons=[],
bias=False,
hidden_activation=lambda x: x,
hidden_activation_deriv=lambda y: np.ones_like(y),
output_activation=lambda x: - np.sign(a) * np.exp(-x),
output_activation_deriv=lambda y: -y
)
init_wealth_distr: Gaussian = Gaussian(
mu=init_wealth,
sigma=init_wealth_stdev
)
aad: AssetAllocDiscrete = AssetAllocDiscrete(
risky_return_distributions=risky_ret,
riskless_returns=riskless_ret,
utility_func=utility_function,
risky_alloc_choices=alloc_choices,
feature_functions=feature_funcs,
dnn_spec=dnn,
initial_wealth_distribution=init_wealth_distr
)
Next, we perform the Q-Value backward induction, step through the returned iterator
(fetching the Q-Value function for each time step from t = 0 to t = T − 1), and evaluate
the Q-values at the init_wealth (for each time step) for all alloc_choices. Performing a
max and arg max over the alloc_choices at the init_wealth gives us the Optimal Value
function and the Optimal Policy for each time step for wealth equal to init_wealth.
230
print()
opt_alloc: float = max(
((q((NonTerminal(init_wealth), ac)), ac) for ac in alloc_choices),
key=itemgetter(0)
)[1]
val: float = max(q((NonTerminal(init_wealth), ac))
for ac in alloc_choices)
print(f”Opt Risky Allocation = {opt_alloc:.3f}, Opt Val = {val:.3f}”)
print(”Optimal Weights below:”)
for wts in q.weights:
pprint(wts.weights)
print()
Time 0
Time 1
Time 2
Time 3
231
print(f”W_t Weight = {w_t_wt:.3f}”)
print(f”x_t Weight = {x_t_wt:.3f}”)
print(f”x_t^2 Weight = {x_t2_wt:.3f}”)
print()
Time 0
Time 1
Time 2
Time 3
As mentioned previously, this serves as a good test for the correctness of the implemen-
tation of AssetAllocDiscrete.
We need to point out here that the general case of dynamic asset allocation and consump-
tion for a large number of risky assets will involve a continuous-valued action space of
high dimension. This means ADP algorithms will have challenges in performing the
max / arg max calculation across this large and continuous action space. Even many of
the RL algorithms find it challenging to deal with very large action spaces. Sometimes we
can take advantage of the specifics of the control problem to overcome this challenge. But
in a general setting, these large/continuous action space require special types of RL algo-
rithms that are well suited to tackle such action spaces. One such class of RL algorithms
is Policy Gradient Algorithms that we shall learn in Chapter 12.
232
6.6. Key Takeaways from this Chapter
• A fundamental problem in Mathematical Finance is that of jointly deciding on A) op-
timal investment allocation (among risky and riskless investment assets) and B) op-
timal consumption, over a finite horizon. Merton, in his landmark paper from 1969,
provided an elegant closed-form solution under assumptions of continuous-time,
normal distribution of returns on the assets, CRRA utility, and frictionless transac-
tions.
• In a more general setting of the above problem, we need to model it as an MDP. If the
MDP is not too large and if the asset return distributions are known, we can employ
finite-horizon ADP algorithms to solve it. However, in typical real-world situations,
the action space can be quite large and the asset return distributions are unknown.
This points to RL, and specifically RL algorithms that are well suited to tackle large
action spaces (such as Policy Gradient Algorithms).
233
7. Derivatives Pricing and Hedging
In this chapter, we cover two applications of MDP Control regarding financial derivatives’
pricing and hedging (the word hedging refers to reducing or eliminating market risks as-
sociated with a derivative). The first application is to identify the optimal time/state to
exercise an American Option (a type of financial derivative) in an idealized market set-
ting (akin to the “frictionless” market setting of Merton’s Portfolio problem from Chapter
6). Optimal exercise of an American Option is the key to determining it’s fair price. The
second application is to identify the optimal hedging strategy for derivatives in real-world
situations (technically refered to as incomplete markets, a term we will define shortly). The
optimal hedging strategy of a derivative is the key to determining it’s fair price in the real-
world (incomplete market) setting. Both of these applications can be cast as Markov De-
cision Processes where the Optimal Policy gives the Optimal Hedging/Optimal Exercise
in the respective applications, leading to the fair price of the derivatives under considera-
tion. Casting these derivatives applications as MDPs means that we can tackle them with
Dynamic Programming or Reinforcement Learning algorithms, providing an interesting
and valuable alternative to the traditional methods of pricing derivatives.
In order to understand and appreciate the modeling of these derivatives applications as
MDPs, one requires some background in the classical theory of derivatives pricing. Unfor-
tunately, thorough coverage of this theory is beyond the scope of this book and we refer
you to Tomas Bjork’s book on Arbitrage Theory in Continuous Time (Björk 2005) for a
thorough understanding of this theory. We shall spend much of this chapter covering the
very basics of this theory, and in particular explaining the key technical concepts (such
as arbitrage, replication, risk-neutral measure, market-completeness etc.) in a simple and
intuitive manner. In fact, we shall cover the theory for the very simple case of discrete-
time with a single-period. While that is nowhere near enough to do justice to the rich
continuous-time theory of derivatives pricing and hedging, this is the best we can do in a
single chapter. The good news is that MDP-modeling of the two problems we want to solve
- optimal exercise of american options and optimal hedging of derivatives in a real-world
(incomplete market) setting - doesn’t require one to have a thorough understanding of
the classical theory. Rather, an intuitive understanding of the key technical and economic
concepts should suffice, which we bring to life in the simple setting of discrete-time with
a single-period. We start this chapter with a quick introduction to derivatives, next we
describe the simple setting of a single-period with formal mathematical notation, cover-
ing the key concepts (arbitrage, replication, risk-neutral measure, market-completeness
etc.), state and prove the all-important fundamental theorems of asset pricing (only for
the single-period setting), and finally show how these two derivatives applications can be
cast as MDPs, along with the appropriate algorithms to solve the MDPs.
235
you to the book by John Hull (Hull 2010) for a thorough coverage of Derivatives. The
term “Derivative” is based on the word “derived” - it refers to the fact that a derivative is
a financial instrument whose structure and hence, value is derived from the performance
of an underlying entity or entities (which we shall simply refer to as “underlying”). The
underlying can be pretty much any financial entity - it could be a stock, currency, bond,
basket of stocks, or something more exotic like another derivative. The term performance
also refers to something fairly generic - it could be the price of a stock or commodity, it
could be the interest rate a bond yields, it could be the average price of a stock over a time
interval, it could be a market-index, or it could be something more exotic like the implied
volatility of an option (which itself is a type of derivative). Technically, a derivative is a
legal contract between the derivative buyer and seller that either:
• Entitles the derivative buyer to cashflow (which we’ll refer to as derivative payoff ) at
future point(s) in time, with the payoff being contingent on the underlying’s perfor-
mance (i.e., the payoff is a precise mathematical function of the underlying’s perfor-
mance, eg: a function of the underlying’s price at a future point in time). This type
of derivative is known as a “lock-type” derivative.
• Provides the derivative buyer with choices at future points in time, upon making
which, the derivative buyer can avail of cashflow (i.e., payoff ) that is contingent on
the underlying’s performance. This type of derivative is known as an “option-type”
derivative (the word “option” refering to the choice or choices the buyer can make
to trigger the contingent payoff).
Although both “lock-type” and “option-type” derivatives can both get very complex
(with contracts running over several pages of legal descriptions), we now illustrate both
these types of derivatives by going over the most basic derivative structures. In the fol-
lowing descriptions, current time (when the derivative is bought/sold) is denoted as time
t = 0.
7.1.1. Forwards
The most basic form of Forward Contract involves specification of:
In addition, the contract establishes that at time t = T , the forward contract seller needs
to deliver the underlying (say a stock with price St at time t) to the forward contract buyer.
This means at time t = T , effectively the payoff for the buyer is ST −K (likewise, the payoff
for the seller is K −ST ). This is because the buyer, upon receiving the underlying from the
seller, can immediately sell the underlying in the market for the price of ST and so, would
have made a gain of ST − K (note ST − K can be negative, in which case the payoff for the
buyer is negative).
The problem of forward contract “pricing” is to determine the fair value of K so that
the price of this forward contract derivative at the time of contract creation is 0. As time
t progresses, the underlying price might fluctuate, which would cause a movement away
from the initial price of 0. If the underlying price increases, the price of the forward would
naturally increase (and if the underlying price decreases, the price of the forward would
naturally decrease). This is an example of a “lock-type” derivative since neither the buyer
236
nor the seller of the forward contract need to make any choices at time t = T . Rather, the
payoff for the buyer is determined directly by the formula ST − K and the payoff for the
seller is determined directly by the formula K − ST .
• A future point in time t = T (we refer to T as the expiry of the Call Option).
• Underlying Price K known as Strike Price.
The contract gives the buyer (owner) of the European Call Option the right, but not the
obligation, to buy the underlying at time t = T at the price of K. Since the option owner
doesn’t have the obligation to buy, if the price ST of the underlying at time t = T ends
up being equal to or below K, the rational decision for the option owner would be to not
buy (at price K), which would result in a payoff of 0 (in this outcome, we say that the call
option is out-of-the-money). However, if ST > K, the option owner would make an instant
profit of ST − K by exercising her right to buy the underlying at the price of K. Hence, the
payoff in this case is ST − K (in this outcome, we say that the call option is in-the-money).
We can combine the two cases and say that the payoff is f (ST ) = max(ST −K, 0). Since the
payoff is always non-negative, the call option owner would need to pay for this privilege.
The amount the option owner would need to pay to own this call option is known as the
fair price of the call option. Identifying the value of this fair price is the highly celebrated
problem of Option Pricing (which you will learn more about as this chapter progresses).
A European Put Option is very similar to a European Call Option with the only differ-
ence being that the owner of the European Put Option has the right (but not the obliga-
tion) to sell the underlying at time t = T at the price of K. This means that the payoff is
f (ST ) = max(K − ST , 0). Payoffs for these Call and Put Options are known as “hockey-
stick” payoffs because if you plot the f (·) function, it is a flat line on the out-of-the-money
side and a sloped line on the in-the-money side. Such European Call and Put Options are
“Option-Type” (and not “Lock-Type”) derivatives since they involve a choice to be made
by the option owner (the choice of exercising the right to buy/sell at the Strike Price K).
However, it is possible to construct derivatives with the same payoff as these European
Call/Put Options by simply writing in the contract that the option owner will get paid
max(ST − K, 0) (in case of Call Option) or will get paid max(K − ST , 0) (in case of Put
Option) at time t = T . Such derivatives contracts do away with the option owner’s exer-
cise choice and hence, they are “Lock-Type” contracts. There is a subtle difference - setting
these derivatives up as “Option-Type” means the option owner might act “irrationally” -
the call option owner might mistakenly buy even if ST < K, or the call option owner might
for some reason forget/neglect to exercise her option even when ST > K. Setting up such
contracts as “Lock-Type” takes away the possibilities of these types of irrationalities from
the option owner.
A more general European Derivative involves an arbitrary function f (·) (generalizing
from the hockey-stick payoffs) and could be set up as “Option-Type” or “Lock-Type.”
237
means that the payoff can happen only at a fixed point in time t = T . This is in contrast to
American Options. The most basic forms of American Options are American Call and Put
Options. American Call and Put Options are essentially extensions of the corresponding
European Call and Put Options by allowing the buyer (owner) of the American Option to
exercise the option to buy (in the case of Call) or sell (in the case of Put) at any time t ≤ T .
The allowance of exercise at any time at or before the expiry time T can often be a tricky
financial decision for the option owner. At each point in time when the American Option
is in-the-money (i.e., positive payoff upon exercise), the option owner might be tempted to
exercise and collect the payoff but might as well be thinking that if she waits, the option
might become more in-the-money (i.e., prospect of a bigger payoff if she waits for a while).
Hence, it’s clear that an American Option is always of the “Option-Type” (and not “Lock-
Type”) since the timing of the decision (option) to exercise is very important in the case
of an American Option. This also means that the problem of pricing an American Option
(the fair price the buyer would need to pay to own an American Option) is much harder
than the problem of pricing an European Option.
So what purpose do derivatives serve? There are actually many motivations for different
market participants, but we’ll just list two key motivations. The first reason is to protect
against adverse market movements that might damage the value of one’s portfolio (this is
known as hedging). As an example, buying a put option can reduce or eliminate the risk
associated with ownership of the underlying. The second reason is operational or financial
convenience in trading to express a speculative view of market movements. For instance, if
one thinks a stock will increase in value by 50% over the next two years, instead of paying
say $100,000 to buy the stock (hoping to make $50,000 after two years), one can simply
buy a call option on $100,000 of the stock (paying the option price of say $5,000). If the
stock price indeed appreciates by 50% after 2 years, one makes $50,000 - $5,000 = $45,000.
Although one made $5000 less than the alternative of simply buying the stock, the fact that
one needs to pay $5000 (versus $50,000) to enter into the trade means the potential return
on investment is much higher.
Next, we embark on the journey of learning how to value derivatives, i.e., how to figure
out the fair price that one would be willing to buy or sell the derivative for at any point
in time. As mentioned earlier, the general theory of derivatives pricing is quite rich and
elaborate (based on continuous-time stochastic processes), and we don’t cover it in this
book. Instead, we provide intuition for the core concepts underlying derivatives pricing
theory in the context of a simple, special case - that of discrete-time with a single-period.
We formalize this simple setting in the next section.
µ : Ω → [0, 1]
such that
X
n
µ(ωi ) = 1
i=1
238
This simple single-period setting involves m + 1 fundamental assets A0 , A1 , . . . , Am
where A0 is a riskless asset (i.e., it’s price will evolve deterministically from t = 0 to t = 1)
(0)
and A1 , . . . , Am are risky assets. We denote the Spot Price (at t = 0) of Aj as Sj for all
(i)
j = 0, 1, . . . , m. We denote the Price of Aj in ωi as Sj for all j = 0, . . . , m, i = 1, . . . , n.
Assume that all asset prices are real numbers, i.e., in R (negative prices are typically unre-
alistic, but we still assume it for simplicity of exposition). For convenience, we normalize
the Spot Price (at t = 0) of the riskless asset AO to be 1. Therefore,
(0) (i)
S0 = 1 and S0 = 1 + r for all i = 1, . . . , n
where r represents the constant riskless rate of growth. We should interpret this riskless
1
rate of growth as the “time value of money” and 1+r as the riskless discount factor corre-
sponding to the “time value of money.”
(i)
The Value of portfolio θ in random outcome ωi (at t = 1), denoted by Vθ , is:
(i)
X
m
(i)
Vθ = θj · Sj for all i = 1, . . . , n (7.2)
j=0
(0)
• Vθ ≤ 0
(i)
• Vθ ≥ 0 for all i = 1, . . . , n
(i)
• There exists an i ∈ {1, . . . , n} such that µ(ωi ) > 0 and Vθ > 0
Thus, with an Arbitrage Portfolio, we never end up (at t = 0) with less value than what
we start with (at t = 1) and we end up with expected value strictly greater than what
we start with. This is the formalism of the notion of arbitrage, i.e., “making money from
nothing.” Arbitrage allows market participants to make infinite returns. In an efficient
market, arbitrage would disappear as soon as it appears since market participants would
immediately exploit it and through the process of exploiting the arbitrage, immediately
eliminate the arbitrage. Hence, Finance Theory typically assumes “arbitrage-free” markets
(i.e., financial markets with no arbitrage opportunities).
Next, we describe another very important concept in Mathematical Economics/Finance
- the concept of a Risk-Neutral Probability Measure. Consider a Probability Distribution π :
Ω → [0, 1] such that
239
Then, π is said to be a Risk-Neutral Probability Measure if:
(0) 1 X
n
(i)
Sj = · π(ωi ) · Sj for all j = 0, 1, . . . , m (7.3)
1+r
i=1
So for each of the m + 1 assets, the asset spot price (at t = 0) is the riskless rate-discounted
expectation (under π) of the asset price at t = 1. The term “risk-neutral” here is the same
as the term “risk-neutral” we used in Chapter 5, meaning it’s a situation where one doesn’t
need to be compensated for taking risk (the situation of a linear utility function). How-
ever, we are not saying that the market is risk-neutral - if that were the case, the market
probability measure µ would be a risk-neutral probability measure. We are simply defin-
ing π as a hypothetical construct under which each asset’s spot price is equal to the riskless
rate-discounted expectation (under π) of the asset’s price at t = 1. This means that under
the hypothetical π, there’s no return in excess of r for taking on the risk of probabilistic out-
comes at t = 1 (note: outcome probabilities are governed by the hypothetical π). Hence,
we refer to π as a risk-neutral probability measure. The purpose of this hypothetical con-
struct π is that it helps in the development of Derivatives Pricing and Hedging Theory, as
we shall soon see. The actual probabilities of outcomes in Ω are governed by µ, and not π.
Before we cover the two fundamental theorems of asset pricing, we need to cover an
important lemma that we will utilize in the proofs of the two fundamental theorems of
asset pricing.
Lemma 7.3.1. For any portfolio θ = (θ0 , θ1 , . . . , θm ) ∈ Rm+1 and any risk-neutral probability
measure π : Ω → [0, 1],
(0) 1 X n
(i)
Vθ = · π(ωi ) · Vθ
1+r
i=1
Proof. Using Equations (7.1), (7.3) and (7.2), the proof is straightforward:
(0)
X
m
(0)
X
m
1 X
n
(i)
Vθ = θ j · Sj = θj · · π(ωi ) · Sj
1+r
j=0 j=0 i=1
1 X
n X
m
(i) 1 X
n
(i)
= · π(ωi ) · θj · Sj = · π(ωi ) · Vθ
1+r 1+r
i=1 j=0 i=1
Now we are ready to cover the two fundamental theorems of asset pricing (sometimes,
also refered to as the fundamental theorems of arbitrage and the fundamental theorems of
finance!). We start with the first fundamental theorem of asset pricing, which associates
absence of arbitrage with existence of a risk-neutral probability measure.
240
Proof. First we prove the easy implication - if there exists a Risk-Neutral Probability Mea-
sure π, then we cannot have any arbitrage portfolios. Let’s review what it takes to have an
arbitrage portfolio θ = (θ0 , θ1 , . . . , θm ). The following are two of the three conditions to
be satisfied to qualify as an arbitrage portfolio θ (according to the definition of arbitrage
portfolio we gave above):
(i)
• Vθ ≥ 0 for all i = 1, . . . , n
(i)
• There exists an i ∈ {1, . . . , n} such that µ(ωi ) > 0 (⇒ π(ωi ) > 0) and Vθ >0
(0)
But if these two conditions are satisfied, the third condition Vθ ≤ 0 cannot be satisfied
because from Lemma (7.3.1), we know that:
(0) 1 X
n
(i)
Vθ = · π(ωi ) · Vθ
1+r
i=1
which is strictly greater than 0, given the two conditions stated above. Hence, all three
conditions cannot be simultaneously satisfied which eliminates the possibility of arbitrage
for any portfolio θ.
Next, we prove the reverse (harder to prove) implication - if a risk-neutral probability
measure doesn’t exist, then there exists an arbitrage portfolio θ. We define V ⊂ Rm as the
set of vectors v = (v1 , . . . , vm ) such that
1 X
n
(i)
vj = · µ(ωi ) · Sj for all j = 1, . . . , m
1+r
i=1
with V defined as spanning over all possible probability distributions µ : Ω → [0, 1]. V is
a bounded, closed, convex polytope in Rm . By the definition of a risk-neutral probability
measure, we can say that if a risk-neutral probability measure doesn’t exist, the vector
(0) (0)
(S1 , . . . , Sm ) ̸∈ V. The Hyperplane Separation Theorem implies that there exists a non-
zero vector (θ1 , . . . , θm ) such that for any v = (v1 , . . . , vm ) ∈ V,
X
m X
m
(0)
θj · v j > θ j · Sj
j=1 j=1
In particular, consider vectors v corresponding to the corners of V, those for which the full
probability mass is on a particular ωi ∈ Ω, i.e.,
X
m
1 (i)
X m
(0)
θj · ( · Sj ) > θj · Sj for all i = 1, . . . , n
1+r
j=1 j=1
X
m
1 (i)
X (0)
m
θj · ( · Sj ) > −θ0 > θj · Sj for all i = 1, . . . , n
1+r
j=1 j=1
Therefore,
1 X
m
(i)
X
m
(0)
· θ j · Sj > 0 > θj · Sj for all i = 1, . . . , n
1+r
j=0 j=0
241
This can be rewritten in terms of the Values of portfolio θ = (θ0 , θ1 , . . . , θ) at t = 0 and
t = 1, as follows:
1 (i) (0)
· V > 0 > Vθ for all i = 1, . . . , n
1+r θ
Thus, we can see that all three conditions in the definition of arbitrage portfolio are
satisfied and hence, θ = (θ0 , θ1 , . . . , θm ) is an arbitrage portfolio.
Now we are ready to move on to the second fundamental theorem of asset pricing, which
associates replication of derivatives with a unique risk-neutral probability measure.
The negatives of the components (θ0 , θ1 , . . . , θm ) are known as the hedges for D since
they can be used to offset the risk in the payoff of D at t = 1.
Definition 7.5.3. An arbitrage-free market (i.e., a market devoid of arbitrage) is said to be
Complete if every derivative in the market has a replicating portfolio.
Theorem 7.5.1 (Second Fundamental Theorem of Asset Pricing (2nd FTAP)). A market (in
our simple setting of discrete-time with a single-period) is Complete if and only if there is a unique
Risk-Neutral Probability Measure.
Proof. We will first prove that in an arbitrage-free market, if every derivative has a replicat-
ing portfolio (i.e., the market is complete), then there is a unique risk-neutral probability
measure. We define n special derivatives (known as Arrow-Debreu securities), one for each
random outcome in Ω at t = 1. We define the time t = 1 payoff of Arrow-Debreu security
Dk (for each of k = 1, . . . , n) as follows:
(i)
VDk = Ii=k for all i = 1, . . . , n
where I represents the indicator function. This means the payoff of derivative Dk is 1 for
random outcome ωk and 0 for all other random outcomes.
(k) (k) (k)
Since each derivative has a replicating portfolio, denote θ(k) = (θ0 , θ1 , . . . , θm ) as the
replicating portfolio for Dk for each k = 1, . . . , m. Therefore, for each k = 1, . . . , m:
(i)
X
m
(k) (i) (i)
Vθ(k) = θj · Sj = VDk = Ii=k for all i = 1, . . . , n
j=0
242
Using Lemma (7.3.1), we can write the following equation for any risk-neutral proba-
bility measure π, for each k = 1, . . . , m:
X
m
(k) (0) (0) 1 X
n
(i) 1 X
n
1
θj · Sj = Vθ(k) = · π(ωi ) · Vθ(k) = · π(ωi ) · Ii=k = · π(ωk )
1+r 1+r 1+r
j=0 i=1 i=1
We note that the above equation is satisfied for a unique π : Ω → [0, 1], defined as:
X
m
(k) (0)
π(ωk ) = (1 + r) · θj · Sj for all k = 1, . . . , n
j=0
Since D does not have a replicating portfolio, v is not in the span of {v0 , v1 , . . . , vm }, which
means {v0 , v1 , . . . , vm } do not span Rn . Hence, there exists a non-zero vector u = (u1 , . . . , un ) ∈
Rn orthogonal to each of v0 , v1 , . . . , vm , i.e.,
X
n
(i)
ui · Sj = 0 for all j = 0, 1, . . . , n (7.5)
i=1
(i)
Note that S0 = 1 + r for all i = 1, . . . , n and so,
X
n
ui = 0 (7.6)
i=1
• Construct π ′ (ωi ) > 0 for each i where π(ωi ) > 0 by making ϵ > 0 sufficiently small,
and set π ′ (ωi ) = 0 for each i where π(ωi ) = 0
1 X
n
1 X
n
ϵ X
n
′ (i) (i) (i) (0)
· π (ωi ) · Sj = · π(ωi ) · Sj + · u i · Sj = Sj
1+r 1+r 1+r
i=1 i=1 i=1
243
• Market with arbitrage ⇔ No risk-neutral probability measure
• Complete (arbitrage-free) market ⇔ Unique risk-neutral probability measure
• Incomplete (arbitrage-free) market ⇔ Multiple risk-neutral probability measures
The next topic is derivatives pricing that is based on the concepts of replication of deriva-
tives and risk-neutral probability measures, and so is tied to the concepts of arbitrage and com-
pleteness.
(0) 1 X
n
(i)
VD = · π(ωi ) · VD (7.9)
1+r
i=1
Proof. It seems quite reasonable that since θ is the replicating portfolio for D, the value of
(0) P (i)
the replicating portfolio at time t = 0 (equal to Vθ = nj=0 θj · Sj ) should be the price
(at t = 0) of derivative D. However, we will formalize the proof by first arguing that any
(0)
candidate derivative price for D other than Vθ leads to arbitrage, thus dismissing those
(0)
other candidate derivative prices, and then argue that with Vθ as the price of derivative
D, we eliminate the possibility of an arbitrage position involving D.
244
(0)
Consider candidate derivative prices Vθ − x for any positive real number x. Position
(1, −θ0 +x, −θ1 , . . . , −θm ) has value x·(1+r) > 0 in each of the random outcomes at t = 1.
But this position has spot (t = 0) value of 0, which means this is an Arbitrage Position,
rendering these candidate derivative prices invalid. Next consider candidate derivative
(0)
prices Vθ + x for any positive real number x. Position (−1, θ0 + x, θ1 , . . . , θm ) has value
x · (1 + r) > 0 in each of the random outcomes at t = 1. But this position has spot (t = 0)
value of 0, which means this is an Arbitrage Position, rendering these candidate derivative
(0)
prices invalid as well. So every candidate derivative price other than Vθ is invalid. Now
(0)
our goal is to establish Vθ as the derivative price of D by showing that we eliminate the
(0)
possibility of an arbitrage position in the market involving D if Vθ is indeed the derivative
price.
(0)
Firstly, note that Vθ can be expressed as the riskless rate-discounted expectation (under
π) of the payoff of D at t = 1, i.e.,
(0)
X
m
(0)
X
m
1 X
n
(i) 1 X
n X
m
(i)
Vθ = θ j · Sj = θj · · π(ωi ) · Sj = · π(ωi ) · θ j · Sj
1+r 1+r
j=0 j=0 i=1 i=1 j=0
1 X
n
(i)
= · π(ωi ) · VD (7.10)
1+r
i=1
(i)
X
m
(i)
Vγ(i)
D
= α · VD + βj · Sj for all i = 1, . . . , n (7.12)
j=0
Combining the linearity in Equations (7.3), (7.10), (7.11) and (7.12), we get:
1 X
n
Vγ(0)
D
= · π(ωi ) · Vγ(i)
D
(7.13)
1+r
i=1
So the position spot value (at t = 0) is the riskless rate-discounted expectation (under
π) of the position value at t = 1. For any γD (containing any arbitrary portfolio β), with
(0) (0)
derivative price VD equal to Vθ , if the following two conditions are satisfied:
(i)
• VγD ≥ 0 for all i = 1, . . . , n
(i)
• There exists an i ∈ {1, . . . , n} such that µ(ωi ) > 0 (⇒ π(ωi ) > 0) and VγD > 0
then:
1 X
n
Vγ(0)
D
= · π(ωi ) · Vγ(i)
D
>0
1+r
i=1
245
(0)
This eliminates any arbitrage possibility if D is priced at Vθ .
(0)
To summarize, we have eliminated all candidate derivative prices other than Vθ , and
(0)
we have established the price Vθ as the correct price of D in the sense that we eliminate
(0)
the possibility of an arbitrage position involving D if the price of D is Vθ .
(0) (0)
Finally, we note that with the derivative price VD = Vθ , from Equation (7.10), we
have:
(0) 1 X
n
(i)
VD = · π(ωi ) · VD
1+r
i=1
Now let us consider the special case of 1 risky asset (m = 1) and 2 random outcomes
(n = 2), which we will show is a Complete Market. To lighten notation, we drop the
subscript 1 on the risky asset price. Without loss of generality, we assume S (1) < S (2) .
No-arbitrage requires:
S (1) ≤ (1 + r) · S (0) ≤ S (2)
Assuming absence of arbitrage and invoking 1st FTAP, there exists a risk-neutral proba-
bility measure π such that:
1
S (0) = · (π(ω1 ) · S (1) + π(ω2 ) · S (2) )
1+r
π(ω1 ) + π(ω2 ) = 1
With 2 linear equations and 2 variables, this has a straightforward solution, as follows:
S (2) − (1 + r) · S (0)
π(ω1 ) =
S (2) − S (1)
(1 + r) · S (0) − S (1)
π(ω2 ) =
S (2) − S (1)
Conditions S (1) < S (2) and S (1) ≤ (1 + r) · S (0) ≤ S (2) ensure that 0 ≤ π(ω1 ), π(ω2 ) ≤ 1.
Also note that this is a unique solution for π(ω1 ), π(ω2 ), which means that the risk-neutral
probability measure is unique, implying that this is a complete market.
We can use these probabilities to price a derivative D as:
(0) 1 (1) (2)
VD = · (π(ω1 ) · VD + π(ω2 ) · VD )
1+r
Now let us try to form a replicating portfolio (θ0 , θ1 ) for D
(1)
VD = θ0 · (1 + r) + θ1 · S (1)
(2)
VD = θ0 · (1 + r) + θ1 · S (2)
Solving this yields Replicating Portfolio (θ0 , θ1 ) as follows:
(1) (2) (2) (1)
1 V · S (2) − VD · S (1) V − VD
θ0 = · D and θ1 = D(2) (7.14)
1+r S −S
(2) (1) S − S (1)
Note that the derivative price can also be expressed as:
(0)
VD = θ0 + θ1 · S (0)
246
7.6.2. Derivatives Pricing when Market is Incomplete
Theorem (7.6.1) assumed a complete market, but what about an incomplete market? Re-
call that an incomplete market means some derivatives can’t be replicated. Absence of
a replicating portfolio for a derivative precludes usual no-arbitrage arguments. The 2nd
FTAP says that in an incomplete market, there are multiple risk-neutral probability mea-
sures which means there are multiple derivative prices (each consistent with no-arbitrage).
To develop intuition for derivatives pricing when the market is incomplete, let us con-
sider the special case of 1 risky asset (m = 1) and 3 random outcomes (n = 3), which we
will show is an Incomplete Market. To lighten notation, we drop the subscript 1 on the
risky asset price. Without loss of generality, we assume S (1) < S (2) < S (3) . No-arbitrage
requires:
S (1) ≤ S (0) · (1 + r) ≤ S (3)
Assuming absence of arbitrage and invoking the 1st FTAP, there exists a risk-neutral prob-
ability measure π such that:
1
S (0) = · (π(ω1 ) · S (1) + π(ω2 ) · S (2) + π(ω3 ) · S (3) )
1+r
(2)
VD = θ0 · (1 + r) + θ1 · S (2)
(3)
VD = θ0 · (1 + r) + θ1 · S (3)
3 equations & 2 variables implies there is no replicating portfolio for some D. This means
this is an Incomplete Market.
So with multiple risk-neutral probability measures (and consequent, multiple deriva-
tive prices), how do we go about determining how much to buy/sell derivatives for? One
approach to handle derivative pricing in an incomplete market is the technique called Su-
perhedging, which provides upper and lower bounds for the derivative price. The idea
of Superhedging is to create a portfolio of fundamental assets whose Value dominates the
derivative payoff in all random outcomes at t = 1. Superhedging Price is the smallest pos-
sible Portfolio Spot (t = 0) Value among all such Derivative-Payoff-Dominating portfolios.
Without getting into too many details of the Superhedging technique (out of scope for this
book), we shall simply sketch the outline of this technique for our simple setting.
We note that for our simple setting of discrete-time with a single-period, this is a con-
strained linear optimization problem:
X
m
(0)
X
m
(i) (i)
min θ j · Sj such that θj · Sj ≥ VD for all i = 1, . . . , n (7.15)
θ
j=0 j=0
247
Let θ∗ = (θ0∗ , θ1∗ , . . . , θm
∗ ) be the solution to Equation (7.15). Let SP be the Superhedging
Pm ∗ (0)
Price j=0 θj · Sj .
After establishing feasibility, we define the Lagrangian J(θ, λ) as follows:
X
m
(0)
X
n
(i)
X
m
(i)
J(θ, λ) = θ j · Sj + λi · (VD − θ j · Sj )
j=0 i=1 j=0
X
n
∗ (0) (i)
∇θ J(θ , λ) = 0 ⇒ Sj = λi · Sj for all j = 0, 1, . . . , m
i=1
π(ωi )
This implies λi = for all i = 1, . . . , n for a risk-neutral probability measure π : Ω →
1+r
[0, 1] (λ can be thought of as “discounted probabilities”).
Define Lagrangian Dual
L(λ) = inf J(θ, λ)
θ
Then, Superhedging Price
X
m
θj∗ · Sj
(0)
SP = = sup L(λ) = sup inf J(θ, λ)
λ λ θ
j=0
Complementary Slackness and some linear algebra over the space of risk-neutral proba-
bility measures π : Ω → [0, 1] enables us to argue that:
X
n
π(ωi ) (i)
SP = sup · VD
π 1+r
i=1
This means the Superhedging Price is the least upper-bound of the riskless rate-discounted
expectation of derivative payoff across each of the risk-neutral probability measures in the
incomplete market, which is quite an intuitive thing to do amidst multiple risk-neutral
probability measures.
Likewise, the Subhedging price SB is defined as:
X
m
(0)
X
m
(i) (i)
max θj · Sj such that θj · Sj ≤ VD for all i = 1, . . . , n
θ
j=0 j=0
This means the Subhedging Price is the highest lower-bound of the riskless rate-discounted
expectation of derivative payoff across each of the risk-neutral probability measures in the
incomplete market, which is quite an intuitive thing to do amidst multiple risk-neutral
probability measures.
So this technique provides an lower bound (SB) and an upper bound (SP ) for the
derivative price, meaning:
248
• A price outside these bounds leads to an arbitrage
• Valid prices must be established within these bounds
But often these bounds are not tight and so, not useful in practice.
The alternative approach is to identify hedges that maximize Expected Utility of the
combination of the derivative along with it’s hedges, for an appropriately chosen mar-
ket/trader Utility Function (as covered in Chapter 5). The Utility function is a specifica-
tion of reward-versus-risk preference that effectively chooses the risk-neutral probability
measure (and hence, Price).
Consider a concave Utility function U : R → R applied to the Value in each ran-
−ax
dom outcome ωi , i = 1, . . . n, at t = 1 (eg: U (x) = 1−ea where a ∈ R is the degree
of risk-aversion). Let the real-world probabilities be given by µ : Ω → [0, 1]. Denote
(1) (n)
VD = (VD , . . . , VD ) as the payoff of Derivative D at t = 1. Let us say that you buy the
derivative D at t = 0 and will receive the random outcome-contingent payoff VD at t = 1.
Let x be the candidate derivative price for D, which means you will pay a cash quantity
of x at t = 0 for the privilege of receiving the payoff VD at t = 1. We refer to the candi-
date hedge as Portfolio θ = (θ0 , θ1 , . . . , θm ), representing the units held in the fundamental
assets.
Note that at t = 0, the cash quantity x you’d be paying to buy the derivative and the
cash quantity you’d be paying to buy the Portfolio θ should sum to 0 (note: either of these
cash quantities can be positive or negative, but they need to sum to 0 since “money can’t
just appear or disappear”). Formally,
X
m
(0)
x+ θ j · Sj =0 (7.16)
j=0
Our goal is to solve for the appropriate values of x and θ based on an Expected Utility
consideration (that we are about to explain). Consider the Utility of the position consisting
of derivative D together with portfolio θ in random outcome ωi at t = 1:
(i)
X
m
(i)
U (VD + θ j · Sj )
j=0
X
n
(i)
X
m
(i)
µ(ωi ) · U (VD + θ j · Sj ) (7.17)
i=1 j=0
(0) (i)
Noting that S0 = 1, S0 = 1 + r for all i = 1, . . . , n, we can substitute for the value
P (0)
of θ0 = −(x + m j=1 θj · Sj ) (obtained from Equation (7.16)) in the above Expected
Utility expression (7.17), so as to rewrite this Expected Utility expression in terms of just
(θ1 , . . . , θm ) (call it θ1:m ) as:
X
n
(i)
X
m
(i) (0)
g(VD , x, θ1:m ) = µ(ωi ) · U (VD − (1 + r) · x + θj · (Sj − (1 + r) · Sj ))
i=1 j=1
249
The core principle here (known as Expected-Utility-Indifference Pricing) is that introduc-
ing a t = 1 payoff of VD together with a derivative price payment of x∗ at t = 0 keeps the
Maximum Expected Utility unchanged.
Pm ∗ (0)
The (θ1∗ , . . . , θm
∗ ) that achieve max ∗ ∗ ∗
θ1:m g(VD , x , θ1:m ) and θ0 = −(x + j=1 θj · Sj ) are
∗
the requisite hedges associated with the derivative price x . Note that the Price of VD will
NOT be the negative of the Price of −VD , hence these prices simply serve as bid prices or
ask prices, depending on whether one pays or receives the random outcomes-contingent
payoff VD .
To develop some intuition for what this solution looks like, let us now write some code
for the case of 1 risky asset (i.e., m = 1). To make things interesting, we will write code
for the case where the risky asset price at t = 1 (denoted S) follows a normal distribution
S ∼ N (µ, σ 2 ). This means we have a continuous (rather than discrete) set of values for
the risky asset price at t = 1. Since there are more than 2 random outcomes at time t = 1,
this is the case of an Incomplete Market. Moreover, we assume the CARA utility function:
1 − e−a·y
U (y) =
a
where a is the CARA coefficient of risk-aversion.
We refer to the units of investment in the risky asset as α and the units of investment in
the riskless asset as β. Let S0 be the spot (t = 0) value of the risky asset (riskless asset
value at t = 0 is 1). Let f (S) be the payoff of the derivative D at t = 1. So, the price of
derivative D is the breakeven value x∗ such that:
∗ +α·(S−(1+r)·S
1 − e−a·(f (S)−(1+r)·x 0 ))
max ES∼N (µ,σ2 ) [ ]
α a
1 − e−a·(α·(S−(1+r)·S0 ))
= max ES∼N (µ,σ2 ) [ ] (7.18)
α a
The maximizing value of α (call it α∗ ) on the left-hand-side of Equation (7.18) along
with β ∗ = −(x∗ + α∗ · S0 ) are the requisite hedges associated with the derivative price x∗ .
We set up a @dataclass MaxExpUtility with attributes to represent the risky asset spot
price S0 (risky_spot), the riskless rate r (riskless_rate), mean µ of S (risky_mean), stan-
dard deviation σ of S (risky_stdev), and the payoff function f (·) of the derivative (payoff_func).
@dataclass(frozen=True)
class MaxExpUtility:
risky_spot: float # risky asset price at t=0
riskless_rate: float # riskless asset price grows from 1 to 1+r
risky_mean: float # mean of risky asset price at t=1
risky_stdev: float # std dev of risky asset price at t=1
payoff_func: Callable[[float], float] # derivative payoff at t=1
Before we write code to solve the derivatives pricing and hedging problem for an incom-
plete market, let us write code to solve the problem for a complete market (as this will serve
as a good comparison against the incomplete market solution). For a complete market, the
risky asset has two random prices at t = 1: prices µ + σ and µ − σ, with probabilities of 0.5
each. As we’ve seen in Section 7.6.1, we can perfectly replicate a derivative payoff in this
complete market situation as it amounts to solving 2 linear equations in 2 unknowns (solu-
tion shown in Equation (7.14)). The number of units of the requisite hedges are simply the
negatives of the replicating portfolio units. The method complete_mkt_price_and_hedges
(of the MaxExpUtility class) shown below implements this solution, producing a dictio-
nary comprising of the derivative price (price) and the hedge units α (alpha) and β (beta).
250
def complete_mkt_price_and_hedges(self) -> Mapping[str, float]:
x = self.risky_mean + self.risky_stdev
z = self.risky_mean - self.risky_stdev
v1 = self.payoff_func(x)
v2 = self.payoff_func(z)
alpha = (v1 - v2) / (z - x)
beta = - 1 / (1 + self.riskless_rate) * (v1 + alpha * x)
price = - (beta + alpha * self.risky_spot)
return {”price”: price, ”alpha”: alpha, ”beta”: beta}
1 − e−a·(−(1+r)·c+α·(S−(1+r)·S0 ))
max ES∼N (µ,σ2 ) [ ]
α a
where c is cash paid at t = 0 (so, c = −(α · S0 + β)).
The method max_exp_util_for_zero accepts as input c: float (representing the cash c
paid at t = 0) and risk_aversion_param: float (representing the CARA coefficient of risk
aversion a). Refering to Section A.4.1 in Appendix A, we have a closed-form solution to
this maximization problem:
µ − (1 + r) · S0
α∗ =
a · σ2
β = −(c + α∗ · S0 )
∗
Substituting α∗ in the Expected Utility expression above gives the following maximum
value for the Expected Utility for this special case:
Next we write a method max_exp_util that calculates the maximum expected utility for
the general case of a derivative with an arbitrary payoff f (·) at t = 1 (provided as input
pf: Callable[[float, float]] below), i.e., it calculates:
1 − e−a·(f (S)−(1+r)·c+α·(S−(1+r)·S0 ))
max ES∼N (µ,σ2 ) [ ]
α a
Clearly, this has no closed-form solution since f (·) is an arbitary payoff. The method
max_exp_util uses the scipy.integrate.quad function to calculate the expectation as an
251
integral of the CARA utility function of f (S) − (1 + r) · c + α · (S − (1 + r) · S0 ) multiplied
by the probability density of N (µ, σ 2 ), and then uses the scipy.optimize.minimize_scalar
function to perform the maximization over values of α.
from scipy.integrate import quad
from scipy.optimize import minimize_scalar
def max_exp_util(
self,
c: float,
pf: Callable[[float], float],
risk_aversion_param: float
) -> Mapping[str, float]:
sigma2 = self.risky_stdev * self.risky_stdev
mu = self.risky_mean
s0 = self.risky_spot
er = 1 + self.riskless_rate
factor = 1 / np.sqrt(2 * np.pi * sigma2)
integral_lb = self.risky_mean - self.risky_stdev * 6
integral_ub = self.risky_mean + self.risky_stdev * 6
def eval_expectation(alpha: float, c=c) -> float:
def integrand(rand: float, alpha=alpha, c=c) -> float:
payoff = pf(rand) - er * c\
+ alpha * (rand - er * s0)
exponent = -(0.5 * (rand - mu) * (rand - mu) / sigma2
+ risk_aversion_param * payoff)
return (1 - factor * np.exp(exponent)) / risk_aversion_param
return -quad(integrand, integral_lb, integral_ub)[0]
res = minimize_scalar(eval_expectation)
alpha_star = res[”x”]
max_val = - res[”fun”]
beta_star = - (c + alpha_star * s0)
return {”alpha”: alpha_star, ”beta”: beta_star, ”max_val”: max_val}
Finally, it’s time to put it all together - the method max_exp_util_price_and_hedge be-
low calculates the maximizing x∗ in Equation (7.18). First, we call max_exp_util_for_zero
(with c set to 0) to calculate the right-hand-side of Equation (7.18). Next, we create a wrap-
per function prep_func around max_exp_util, which is provided as input to scipt.optimize.root_scalar
to solve for x∗ in the right-hand-side of Equation (7.18). Plugging x∗ (opt_price in the code
below) in max_exp_util provides the hedges α∗ and β ∗ (alpha and beta in the code below).
from scipy.optimize import root_scalar
def max_exp_util_price_and_hedge(
self,
risk_aversion_param: float
) -> Mapping[str, float]:
meu_for_zero = self.max_exp_util_for_zero(
0.,
risk_aversion_param
)[”max_val”]
def prep_func(pr: float) -> float:
return self.max_exp_util(
pr,
self.payoff_func,
risk_aversion_param
)[”max_val”] - meu_for_zero
lb = self.risky_mean - self.risky_stdev * 10
ub = self.risky_mean + self.risky_stdev * 10
payoff_vals = [self.payoff_func(x) for x in np.linspace(lb, ub, 1001)]
252
lb_payoff = min(payoff_vals)
ub_payoff = max(payoff_vals)
opt_price = root_scalar(
prep_func,
bracket=[lb_payoff, ub_payoff],
method=”brentq”
).root
hedges = self.max_exp_util(
opt_price,
self.payoff_func,
risk_aversion_param
)
alpha = hedges[”alpha”]
beta = hedges[”beta”]
return {”price”: opt_price, ”alpha”: alpha, ”beta”: beta}
The above code for the class MaxExpUtility is in the file rl/chapter8/max_exp_utility.py.
As ever, we encourage you to play with various choices of S0 , r, µ, σ, f to create instances
of MaxExpUtility, analyze the obtained prices/hedges, and plot some graphs to develop
intuition on how the results change as a function of the various inputs.
Running this code for S0 = 100, r = 5%, µ = 110, σ = 25 when buying a call op-
tion (European since we have only one time period) with strike price = 105, the method
complete_mkt_price_and hedges gives an option price of 11.43, risky asset hedge units of
-0.6 (i.e., we hedge the risk of owning the call option by short-selling 60% of the risky
asset) and riskless asset hedge units of 48.57 (i.e., we take the $60 proceeds of short-sale
less the $11.43 option price payment = $48.57 of cash and invest in a risk-free bank ac-
count earning 5% interest). As mentioned earlier, this is the perfect hedge if we had a
complete market (i.e., two random outcomes). Running this code for the same inputs for
an incomplete market (calling the method max_exp_util_price_and_hedge for risk-aversion
parameter values of a = 0.3, 0.6, 0.9 gives us the following results:
We note that the call option price is quite high (23.28) when the risk-aversion is low at
a = 0.3 (relative to the complete market price of 11.43) but the call option price drops to
12.67 and 8.87 for a = 0.6 and a = 0.9 respectively. This makes sense since if you are more
risk-averse (high a), then you’d be less willing to take the risk of buying a call option and
hence, would want to pay less to buy the call option. Note how the risky asset short-sale
is significantly less (~47% - ~49%) compared the to the risky asset short-sale of 60% in
the case of a complete market. The varying investments in the riskless asset (as a function
of the risk-aversion a) essentially account for the variation in option prices (as a function
of a). Figure 7.1 provides tremendous intuition on how the hedges work for the case of
a complete market and for the cases of an incomplete market with the 3 choices of risk-
aversion parameters. Note that we have plotted the negatives of the hedge portfolio values
at t = 1 so as to visualize them appropriately relative to the payoff of the call option. Note
that the hedge portfolio value is a linear function of the risky asset price at t = 1. Notice
how the slope and intercept of the hedge portfolio value changes for the 3 risk-aversion
scenarios and how they compare against the complete market hedge portfolio value.
253
Figure 7.1.: Hedges when buying a Call Option
Now let us consider the case of selling the same call option. In our code, the only change
we make is to make the payoff function lambda x: - max(x - 105.0, 0) instead of lambda
x: max(x - 105.0, 0) to reflect the fact that we are now selling the call option and so, our
payoff will be the negative of that of an owner of the call option.
With the same inputs of S0 = 100, r = 5%, µ = 110, σ = 25, and for the same risk-
aversion parameter values of a = 0.3, 0.6, 0.9, we get the following results:
We note that the sale price demand for the call option is quite low (6.31) when the
risk-aversion is low at a = 0.3 (relative to the complete market price of 11.43) but the
sale price demand for the call option rises sharply to 32.32 and 44.24 for a = 0.6 and
a = 0.9 respectively. This makes sense since if you are more risk-averse (high a), then
you’d be less willing to take the risk of selling a call option and hence, would want to charge
more for the sale of the call option. Note how the risky asset hedge units are less (~52%
- 53%) compared to the risky asset hedge units (60%) in the case of a complete market.
The varying riskless borrowing amounts (as a function of the risk-aversion a) essentially
account for the variation in option prices (as a function of a). Figure 7.2 provides the visual
intuition on how the hedges work for the 3 choices of risk-aversion parameters (along with
the hedges for the complete market, for reference).
Note that each buyer and each seller might have a different level of risk-aversion, mean-
ing each of them would have a different buy price bid/different sale price ask. A transac-
tion can occur between a buyer and a seller (with potentially different risk-aversion levels)
if the buyer’s bid matches the seller’s ask.
254
Figure 7.2.: Hedges when selling a Call Option
π(ω1 ) + π(ω2 ) = 1
3 equations and 2 variables implies that there is no risk-neutral probability measure π
(1) (2) (1) (2)
for various sets of values of S1 , S1 , S2 , S2 . Let’s try to form a replicating portfolio
(θ0 , θ1 , θ2 ) for a derivative D:
(1) (1) (1)
V D = θ 0 · e r + θ 1 · S1 + θ 2 · S 2
(2) (2) (2)
V D = θ 0 · e r + θ 1 · S1 + θ 2 · S 2
2 equations and 3 variables implies that there are multiple replicating portfolios. Each
such replicating portfolio yields a price for D as:
(0) (0) (0)
V D = θ 0 + θ 1 · S 1 + θ 2 · S2
(0)
Select two such replicating portfolios with different VD . The combination of one of these
replicating portfolios with the negative of the other replicating portfolio is an Arbitrage
Portfolio because:
255
• They cancel off each other’s portfolio value in each t = 1 states
• The combined portfolio value can be made to be negative at t = 0 (by appropriately
choosing the replicating portfolio to negate)
• Solve for the replicating portfolio (i.e., solve for the units in the fundamental assets
that would replicate the derivative payoff), and then calculate the derivative price as
the value of this replicating portfolio at t = 0.
• Calculate the probabilities of random-outcomes for the unique risk-neutral probabil-
ity measure, and then calculate the derivative price as the riskless rate-discounted
expectation (under this risk-neutral probability measure) of the derivative payoff.
It turns out that even in the multi-period setting, when the market is complete, we can
calculate the derivative price (not just at t = 0, but at any random outcome at any future
time) with either of the above two (equivalent) methods, as long as we appropriately
adjust the fundamental assets’ units in the replicating portfolio (depending on the random
outcome) as we move from one time step to the next. It is important to note that when we
alter the fundamental assets’ units in the replicating portfolio at each time step, we need to
respect the constraint that money cannot enter or leave the replicating portfolio (i.e., it is a
self-financing replicating portfolio with the replicating portfolio value remaining unchanged
in the process of altering the units in the fundamental assets). It is also important to note
that the alteration in units in the fundamental assets is dependent on the prices of the
fundamental assets (which are random outcomes as we move forward from one time step
256
to the next). Hence, the fundamental assets’ units in the replicating portfolio evolve as
random variables, while respecting the self-financing constraint. Therefore, the replicating
portfolio in a multi-period setting in often refered to as a Dynamic Self-Financing Replicating
Portfolio to reflect the fact that the replicating portfolio is adapting to the changing prices of
the fundamental assets. The negatives of the fundamental assets’ units in the replicating
portfolio form the hedges for the derivative.
To ensure that the market is complete in a multi-period setting, we need to assume that
the market is “frictionless” - that we can trade in real-number quantities in any fundamen-
tal asset and that there are no transaction costs for any trades at any time step. From a com-
putational perspective, we walk back in time from the final time step (call it t = T ) to t = 0,
and calculate the fundamental assets’ units in the replicating portfolio in a “backward re-
cursive manner.” As in the case of the single-period setting, each backward-recursive step
from outcomes at time t + 1 to a specific outcome at time t simply involves solving a linear
system of equations where each unknown is the replicating portfolio units in a specific
fundamental asset and each equation corresponds to the value of the replicating portfolio
at a specific outcome at time t + 1 (which is established recursively). The market is com-
plete if there is a unique solution to each linear system of equations (for each time t and
for each outcome at time t) in this backward-recursive computation. This gives us not just
the replicating portfolio (and consequently, hedges) at each outcome at each time step,
but also the price at each outcome at each time step (the price is equal to the value of the
calculated replicating portfolio at that outcome at that time step).
Equivalently, we can do a backward-recursive calculation in terms of the risk-neutral
probability measures, with each risk-neutral probability measure giving us the transition
probabilities from an outcome at time step t to outcomes at time step t+1. Again, in a com-
plete market, it amounts to a unique solution of each of these linear system of equations.
For each of these linear system of equations, an unknown is a transition probability to a
time t + 1 outcome and an equation corresponds to a specific fundamental asset’s prices
at the time t + 1 outcomes. This calculation is popularized (and easily understood) in the
simple context of a Binomial Options Pricing Model. We devote Section 7.8 to coverage of
the original Binomial Options Pricing Model and model it as a Finite-State Finite-Horizon
MDP (and utilize the Finite-Horizon DP code developed in Chapter 3 to solve the MDP).
257
siderable literature on how to price in incomplete markets for multi-period/continuous-
time, which includes the superhedging approach as well as the Expected-Utility-Indifference
approach, that we had covered in Subsection 7.6.2 for the simple setting of discrete-time
with single-period. However, in practice, these approaches are not adopted as they fail to
capture real-world nuances adequately. Besides, most of these approaches lead to fairly
wide price bounds that are not particularly useful in practice. In Section 7.10, we extend
the Expected-Utility-Indifference approach that we had covered for the single-period setting
to the multi-period setting. It turns out that this approach can be modeled as an MDP,
with the adjustments to the hedge quantities at each time step as the actions of the MDP -
calculating the optimal policy gives us the optimal derivative hedging strategy and the as-
sociated optimal value function gives us the derivative price. This approach is applicable
to real-world situations and one can even incorporate all the real-world frictions in one’s
MDP to build a practical solution for derivatives trading (covered in Section 7.10).
258
Figure 7.3.: Binomial Option Pricing Model (Binomial Tree)
259
Si,0 Si,1 Si,i
the probability distribution of log-price-ratios {log S0,0 , log S0,0 , . . . , log S0,0 } after
i time steps (with each time step of interval Tn for a given expiry time T ∈ R+ and a fixed
σ 2 iT σ 2 iT
number of time steps n ∈ Z+ ) serves as a good approximation to N ((r
− 2 ) n , n ) (that
S iT
we know to be the risk-neutral probability distribution of log n
S0 in the continuous-
time process defined by Equation (7.19), as derived in Section C.7 in Appendix C), for all
i = 0, 1, . . . , n. Note that the starting price S0,0 of this discrete-time approximation process
is equal to the starting price S0 of the continuous-time process.
This calibration of q, u and d can be done in a variety of ways and there are indeed several
variants of Binomial Options Pricing Models with different choices of how q, u and d are
calibrated. We shall implement the choice made in the original Binomial Options Pricing
Model that was proposed in a seminal paper by Cox, Ross, Rubinstein (Cox, Ross, and
Rubinstein 1979). Their choice is best understood in two steps:
• As a first step, ignore the drift term r·St ·dt of the lognormal process, and assume the
underlying price follows the martingale process dSt = σ · St · dzt . They chose d to be
equal to u1 and calibrated u such that for any i ∈ Z≥0 , for
any 0 ≤ j ≤ i, the variance
of
Si+1,j+1 Si+1,j
two equal-probability random outcomes log Si,j = log(u) and log Si,j =
σ2 T
log(d) = − log(u)
is equal to the variance
of the normally-distributed random
n
St+ T
variable log St n for any t ≥ 0 (assuming the process dSt = σ · St · dzt ). This
yields:
√
σ2T σ T
log (u) = 2
⇒u=e n
n
• As a second step, q needs to be calibrated to account for the drift term r · St · dt in
the lognormal process under the risk-neutral probability measure. Specifically, q is
adjusted so that for any i ∈ Z≥0 , for any 0 ≤ j ≤ i, the mean of the two random
S S rT
outcomes i+1,j+1
Si,j = u and Si+1,j
i,j
= u1 is equal to the mean e n of the lognormally-
St+ T
distributed random variable St
n
for any t ≥ 0 (assuming the process dSt = r · St ·
dt + σ · St · dzt ). This yields:
√
rT rT T
+σ
1−q rT u·e −1 e n n n −1
qu + =en ⇒q= = √
u u −1
2
2σ T
e n −1
This calibration for u and q ensures that as n → ∞ (i.e., time step interval Tn → 0),
the mean and variance of the binomial distribution after i time steps matches
the mean
S iT
σ 2 iT σ 2 iT
(r − 2 )n and variance n of the normally-distributed random variable log n
S0 in
thecontinuous-time
process defined by Equation (7.19), for all i = 0, 1, . . . n. Note that
Si,j
log S0,0 follows a random walk Markov Process (reminiscent of the random walk exam-
ples in Chapter 1) with each movement in state space scaled by a factor of log(u).
Thus, we have the parameters u and q that fully specify the Binomial Options Pricing
Model. Now we get to the application of this model. We are interested in using this model
for optimal exercise (and hence, pricing) of American Options. This is in contrast to the
Black-Scholes Partial Differential Equation which only enabled us to price options with
a fixed payoff at a fixed point in time (eg: European Call and Put Options). Of course,
260
a special case of American Options is indeed European Options. It’s important to note
that here we are tackling the much harder problem of the ideal timing of exercise of an
American Option - the Binomial Options Pricing Model is well suited for this.
As mentioned earlier, we want to model the problem of Optimal Exercise of American
Options as a discrete-time, finite-horizon, finite-states MDP. We set the terminal time to be
t = T + 1, meaning all the states at time T + 1 are terminal states. Here we will utilize the
states and state transitions (probabilistic price movements of the underlying) given by the
Binomial Options Pricing Model as the states and state transitions in the MDP. The MDP
actions in each state will be binary - either exercise the option (and immediately move
to a terminal state) or don’t exercise the option (i.e., continue on to the next time step’s
random state, as given by the Binomial Options Pricing Model). If the exercise action is
chosen, the MDP reward is the option payoff. If the continue action is chosen, the reward
is 0. The discount factor γ is e− n since (as we’ve learnt in the single-period case), the
rT
price (which translates here to the Optimal Value Function) is defined as the riskless rate-
discounted expectation (under the risk-neutral probability measure) of the option payoff.
In the multi-period setting, the overall discounting amounts to composition (multiplica-
tion) of each time step’s discounting (which is equal to γ) and the overall risk-neutral
probability measure amounts to the composition of each time step’s risk-neutral probabil-
ity measure (which is specified by the calibrated value q).
Now let’s write some code to determine the Optimal Exercise of American Options
(and hence, the price of American Options) by modeling this problem as a discrete-time,
finite-horizon, finite-states MDP. We create a dataclass OptimalExerciseBinTree whose
attributes are spot_price (specifying the current, i.e., time=0 price of the underlying),
payoff (specifying the option payoff, when exercised), expiry (specifying the time T to
expiration of the American Option), rate (specifying the riskless rate r), vol (specifying
the lognormal volatility σ), and num_steps (specifying the number n of time steps in the
binomial tree). Note that each time step is of interval Tn (which is implemented below in
the method dt). Note also that the payoff function is fairly generic taking two arguments -
the first argument is the time at which the option is exercised, and the second argument is
the underlying price at the time the option is exercised. Note that for a typical American
Call or Put Option, the payoff does not depend on time and the dependency on the un-
derlying price is the standard “hockey-stick” payoff that we are now fairly familiar with
(however, we designed the interface to allow for more general option payoff functions).
The set of states Si at time step i (for all 0 ≤ i ≤ T + 1) is: {0, 1, . . . , i} and the method
state_price below calculates the price in state j at time step i as:
√
T
(2j−i)σ
Si,j = S0,0 · e n
261
from rl.dynamic_programming import V
from rl.policy import FiniteDeterministicPolicy
@dataclass(frozen=True)
class OptimalExerciseBinTree:
spot_price: float
payoff: Callable[[float, float], float]
expiry: float
rate: float
vol: float
num_steps: int
def dt(self) -> float:
return self.expiry / self.num_steps
def state_price(self, i: int, j: int) -> float:
return self.spot_price * np.exp((2 * j - i) * self.vol *
np.sqrt(self.dt()))
def get_opt_vf_and_policy(self) -> \
Iterator[Tuple[V[int], FiniteDeterministicPolicy[int, bool]]]:
dt: float = self.dt()
up_factor: float = np.exp(self.vol * np.sqrt(dt))
up_prob: float = (np.exp(self.rate * dt) * up_factor - 1) / \
(up_factor * up_factor - 1)
return optimal_vf_and_policy(
steps=[
{NonTerminal(j): {
True: Constant(
(
Terminal(-1),
self.payoff(i * dt, self.state_price(i, j))
)
),
False: Categorical(
{
(NonTerminal(j + 1), 0.): up_prob,
(NonTerminal(j), 0.): 1 - up_prob
}
)
} for j in range(i + 1)}
for i in range(self.num_steps + 1)
],
gamma=np.exp(-self.rate * dt)
)
Now we want to try out this code on an American Call Option and American Put Option.
We know that it is never optimal to exercise an American Call Option before the option
expiration. The reason for this is as follows: Upon early exercise (say at time τ < T ),
we borrow cash K (to pay for the purchase of the underlying) and own the underlying
(valued at Sτ ). So, at option expiration T , we owe cash K · er(T −τ ) and own the underlying
valued at ST , which is an overall value at time T of ST −K ·er(T −τ ) . We argue that this value
is always less than the value max(ST −K, 0) we’d obtain at option expiration T if we’d made
the choice to not exercise early. If the call option ends up in-the-money at option expiration
T (i.e., ST > K), then ST −K ·er(T −τ ) is less than the value ST −K we’d get by exercising at
option expiration T . If the call option ends up not being in-the-money at option expiration
T (i.e., ST ≤ K), then ST − K · er(T −τ ) < 0 which is less than the 0 payoff we’d obtain at
option expiration T . Hence, we are always better off waiting until option expiration (i.e. it
is never optimal to exercise a call option early, no matter how much in-the-money we get
before option expiration). Hence, the price of an American Call Option should be equal
to the price of an European Call Option with the same strike price and expiration time.
262
However, for an American Put Option, it is indeed sometimes optimal to exercise early
and hence, the price of an American Put Option is greater then the price of an European
Put Option with the same strike price and expiration time. Thus, it is interesting to ask the
question: For each time t < T , what is the threshold of underlying price St below which it
is optimal to exercise an American Put Option? It is interesting to view this threshold as
a function of time (we call this function as the optimal exercise boundary of an American
Put Option). One would expect that this optimal exercise boundary rises as one gets closer
to the option expiration T . But exactly what shape does this optimal exercise boundary
have? We can answer this question by analyzing the optimal policy at each time step - we
just need to find the state k at each time step i such that the Optimal Policy πi∗ (·) evaluates
to True for all states j ≤ k (and evaluates to False for all states j > k). We write the
following method to calculate the Optimal Exercise Boundary:
def option_exercise_boundary(
self,
policy_seq: Sequence[FiniteDeterministicPolicy[int, bool]],
is_call: bool
) -> Sequence[Tuple[float, float]]:
dt: float = self.dt()
ex_boundary: List[Tuple[float, float]] = []
for i in range(self.num_steps + 1):
ex_points = [j for j in range(i + 1)
if policy_seq[i].action_for[j] and
self.payoff(i * dt, self.state_price(i, j)) > 0]
if len(ex_points) > 0:
boundary_pt = min(ex_points) if is_call else max(ex_points)
ex_boundary.append(
(i * dt, opt_ex_bin_tree.state_price(i, boundary_pt))
)
return ex_boundary
Bi = max Si,j
j:πi∗ (j)=T rue
with the little detail that we only consider those states j for which the option payoff is
positive. For some time steps i, none of the states j qualify as πi∗ (j) = T rue, in which case
we don’t include that time step i in the output sequence.
To compare the results of American Call and Put Option Pricing on this Binomial Op-
tions Pricing Model against the corresponding European Options prices, we write the fol-
lowing method to implement the Black-Scholes closed-form solution (derived as Equa-
tions E.7 and E.8 in Appendix E):
263
ret = strike * np.exp(-self.rate * self.expiry) * norm.cdf(-d2) - \
self.spot_price * norm.cdf(-d1)
return ret
Here’s some code to price an American Put Option (changing is_call to True will price
American Call Options):
from rl.gen_utils.plot_funcs import plot_list_of_curves
spot_price_val: float = 100.0
strike: float = 100.0
is_call: bool = False
expiry_val: float = 1.0
rate_val: float = 0.05
vol_val: float = 0.25
num_steps_val: int = 300
if is_call:
opt_payoff = lambda _, x: max(x - strike, 0)
else:
opt_payoff = lambda _, x: max(strike - x, 0)
opt_ex_bin_tree: OptimalExerciseBinTree = OptimalExerciseBinTree(
spot_price=spot_price_val,
payoff=opt_payoff,
expiry=expiry_val,
rate=rate_val,
vol=vol_val,
num_steps=num_steps_val
)
vf_seq, policy_seq = zip(*opt_ex_bin_tree.get_opt_vf_and_policy())
ex_boundary: Sequence[Tuple[float, float]] = \
opt_ex_bin_tree.option_exercise_boundary(policy_seq, is_call)
time_pts, ex_bound_pts = zip(*ex_boundary)
label = (”Call” if is_call else ”Put”) + ” Option Exercise Boundary”
plot_list_of_curves(
list_of_x_vals=[time_pts],
list_of_y_vals=[ex_bound_pts],
list_of_colors=[”b”],
list_of_curve_labels=[label],
x_label=”Time”,
y_label=”Underlying Price”,
title=label
)
european: float = opt_ex_bin_tree.european_price(is_call, strike)
print(f”European Price = {european:.3f}”)
am_price: float = vf_seq[0][NonTerminal(0)]
print(f”American Price = {am_price:.3f}”)
So we can see that the price of this American Put Option is significantly higher than the
price of the corresponding European Put Option. The exercise boundary produced by this
code is shown in Figure 7.4. The locally-jagged nature of the exercise boundary curve is
because of the “diamond-like” local-structure of the underlying prices at the nodes in the
binomial tree. We can see that when the time to expiry is large, it is not optimal to exercise
unless the underlying price drops significantly. It is only when the time to expiry becomes
quite small that the optimal exercise boundary rises sharply towards the strike price value.
Changing is_call to True (and not changing any of the other inputs) prints as output:
264
Figure 7.4.: Put Option Exercise Boundary
This is a numerical validation of our proof above that it is never optimal to exercise an
American Call Option before option expiration.
The above code is in the file rl/chapter8/optimal_exercise_bin_tree.py. As ever, we en-
courage you to play with various choices of inputs to develop intuition for how American
Option Pricing changes as a function of the inputs (and how American Put Option Ex-
ercise Boundary changes). Note that you can specify the option payoff as any arbitrary
function of time and the underlying price.
265
A simple example of Stopping Time is Hitting Time of a set A for a process X. Informally,
it is the first time when X takes a value within the set A. Formally, Hitting Time TXA is
defined as:
TX,A = min{t ∈ R|Xt ∈ A}
A simple and common example of Hitting Time is the first time a process exceeds a
certain fixed threshold level. As an example, we might say we want to sell a stock when
the stock price exceeds $100. This $100 threshold constitutes our stopping policy, which
determines the stopping time (hitting time) in terms of when we want to sell the stock (i.e.,
exit owning the stock). Different people may have different criterion for exiting owning
the stock (your friend’s threshold might be $90), and each person’s criterion defines their
own stopping policy and hence, their own stopping time random variable.
Now that we have defined Stopping Time, we are ready to define the Optimal Stopping
problem. Optimal Stopping for a stochastic process X is a function W (·) whose domain is
the set of potential initial values of the stochastic process and co-domain is the length of
time for which the stochastic process runs, defined as:
where τ is a set of stopping times of X and H(·) is a function from the domain of the
stochastic process values to the set of real numbers.
Intuitively, you should think of Optimal Stopping as searching through many Stopping
Times (i.e., many Stopping Policies), and picking out the best Stopping Policy - the one
that maximizes the expected value of a function H(·) applied on the stochastic process at
the stopping time.
Unsurprisingly (noting the connection to Optimal Control in an MDP), W (·) is called
the Value function, and H is called the Reward function. Note that sometimes we can
have several stopping times that maximize E[H(Xτ )] and we say that the optimal stopping
time is the smallest stopping time achieving the maximum value. We mentioned above
that Optimal Exercise of American Options is a special case of Optimal Stopping. Let’s
understand this specialization better:
• X is the stochastic process for the underlying’s price in the risk-neutral probability
measure.
• x is the underlying security’s current price.
• τ is a set of exercise times, each exercise time corresponding to a specific policy of
option exercise (i.e., specific stopping policy).
• W (·) is the American Option price as a function of the underlying’s current price x.
• H(·) is the option payoff function (with riskless-rate discounting built into H(·)).
Now let us define Optimal Stopping problems as control problems in Markov Decision
Processes (MDPs).
266
E[H(Xτ )|X0 = x], and the Optimal Value Function V ∗ corresponds to the maximum value
of E[H(Xτ )|X0 = x].
For discrete time steps, the Bellman Optimality Equation is:
Thus, we see that Optimal Stopping is the solution to the above Bellman Optimality
Equation (solving the Control problem of the MDP described above). For a finite number
of time steps, we can run a backward induction algorithm from the final time step back
to time step 0 (essentially a generalization of the backward induction we did with the
Binomial Options Pricing Model to determine Optimal Exercise of American Options).
Many derivatives pricing problems (and indeed many problems in the broader space
of Mathematical Finance) can be cast as Optimal Stopping and hence can be modeled as
MDPs (as described above). The important point here is that this enables us to employ
Dynamic Programming or Reinforcement Learning algorithms to identify optimal stop-
ping policy for exotic derivatives (which typically yields a pricing algorithm for exotic
derivatives). When the state space is large (eg: when the payoff depends on several un-
derlying assets or when the payoff depends on the history of underlying’s prices, such
as Asian Options-payoff with American exercise feature), the classical algorithms used in
the finance industry for exotic derivatives pricing are not computationally tractable. This
points to the use of Reinforcement Learning algorithms which tend to be good at handling
large state spaces by effectively leveraging sampling and function approximation method-
ologies in the context of solving the Bellman Optimality Equation. Hence, we propose
Reinforcement Learning as a promising alternative technique to pricing of certain exotic
derivatives that can be cast as Optimal Stopping problems. We will discuss this more after
having covered Reinforcement Learning algorithms.
267
out of the many valid derivative prices). In practice, this “choice” is typically made in ad-
hoc and inconsistent ways. Hence, our proposal of making this “choice” in a mathematically-
disciplined manner by noting that ultimately a trader is interested in maximizing the
“risk-adjusted return” of a derivative together with it’s hedges (by sequential/dynamic
adjustment of the hedge quantities). Once we take this view, it is reminiscent of the As-
set Allocation problem we covered in Chapter 6 and the maximization objective is based
on the specification of preference for trading risk versus return (which in turn, amounts
to specification of a Utility function). Therefore, similar to the Asset Allocation problem,
the decision at each time step is the set of adjustments one needs to make to the hedge
quantities. With this rough overview, we are now ready to formalize the MDP model
for this approach to multi-period pricing/hedging in an incomplete market. For ease of
exposition, we simplify the problem setup a bit, although the approach and model we de-
scribe below essentially applies to more complex, more frictionful markets as well. Our
exposition below is an adaptation of the treatment in the Deep Hedging paper by Buehler,
Gonon, Teichmann, Wood, Mohan, Kochems (Bühler et al. 2018).
Assume we have a portfolio of m derivatives and we refer to our collective position
across the portfolio of m derivatives as D. Assume each of these m derivatives expires
by time T (i.e., all of their contingent cashflows will transpire by time T ). We model the
problem as a discrete-time finite-horizon MDP with the terminal time at t = T + 1 (i.e., all
states at time t = T + 1 are terminal states). We require the following notation to model
the MDP:
We will use the notation that we have previously used for discrete-time finite-horizon
MDPs, i.e., we will use time-subscripts in our notation.
We denote the State Space at time t (for all 0 ≤ t ≤ T +1) as St and a specific state at time t
as st ∈ St . Among other things, the key ingredients of st include: αt , Pt , βt , D. In practice,
st will include many other components (in general, any market information relevant to
hedge trading decisions). However, for simplicity (motivated by ease of articulation), we
assume st is simply the 4-tuple:
st := (αt , Pt , βt , D)
We denote the Action Space at time t (for all 0 ≤ t ≤ T ) as At and a specific action at time
t as at ∈ At . at represents the number of units of hedges traded at time t (i.e., adjustments
to be made to the hedges at each time step). Since there are n hedge positions (n assets
to be traded), at ∈ Rn , i.e., At ⊆ Rn . Note that for each of the n assets, it’s corresponding
component in at is positive if we buy the asset at time t and negative if we sell the asset at
time t. Any trading restrictions (eg: constraints on short-selling) will essentially manifest
themselves in terms of the exact definition of At as a function of st .
State transitions are essentially defined by the random movements of prices of the assets
that make up the potential hedges, i.e., P[Pt+1 |Pt ]. In practice, this is available either as an
explicit transition-probabilities model, or more likely available in the form of a simulator,
268
that produces an on-demand sample of the next time step’s prices, given the current time
step’s prices. Either way, the internals of P[Pt+1 |Pt ] are estimated from actual market data
and realistic trading/market assumptions. The practical details of how to estimate these
internals are beyond the scope of this book - it suffices to say here that this estimation is a
form of supervised learning, albeit fairly nuanced due to the requirement of capturing the
complexities of market-price behavior. For the following description of the MDP, simply
assume that we have access to P[Pt+1 |Pt ] in some form.
It is important to pay careful attention to the sequence of events at each time step t =
0, . . . , T , described below:
This Pricing principle is known as the principle of Indifference Pricing. The hedging strategy
for D ∪ D′ at time t (for all 0 ≤ t < T ) is given by the associated Optimal Deterministic
Policy πt∗ : St → At
269
• Pricing of derivatives in a complete market in two equivalent ways: A) Based on
construction of a replicating portfolio, and B) Based on riskless rate-discounted ex-
pectation in the risk-neutral probability measure.
• Optimal Exercise of American Options (and it’s generalization to Optimal Stopping
problems) cast as an MDP Control problem.
• Pricing and Hedging of Derivatives in an Incomplete (real-world) Market cast as an
MDP Control problem.
270
8. Order-Book Trading Algorithms
In this chapter, we venture into the world of Algorithmic Trading and specifically, we cover
a couple of problems involving a trading Order Book that can be cast as Markov Decision
Processes, and hence tackled with Dynamic Programming or Reinforcement Learning. We
start the chapter by covering the basics of how trade orders are submitted and executed on
an Order Book, a structure that allows for efficient transactions between buyers and sellers
of a financial asset. Without loss of generality, we refer to the financial asset being traded
on the Order Book as a “stock” and the number of units of the asset as “shares.” Next we
will explain how a large trade can significantly shift the Order Book, a phenomenon known
as Price Impact. Finally, we will cover the two algorithmic trading problems that can be cast
as MDPs. The first problem is Optimal Execution of the sale of a large number of shares
of a stock so as to yield the maximum utility of sales proceeds over a finite horizon. This
involves breaking up the sale of the shares into appropriate pieces and selling those pieces
at the right times so as to achieve the goal of maximizing the utility of sales proceeds.
Hence, it is an MDP Control problem where the actions are the number of shares sold
at each time step. The second problem is Optimal Market-Making, i.e., the optimal bids
(willingness to buy a certain number of shares at a certain price) and asks (willingness to
sell a certain number of shares at a certain price) to be submitted on the Order Book. Again,
by optimal, we mean maximization of the utility of revenues generated by the market-
maker over a finite-horizon (market-makers generate revenue through the spread, i.e. the
gap between the bid and ask prices they offer). This is also an MDP Control problem
where the actions are the bid and ask prices along with the bid and ask shares at each
time step.
For a deeper study on the topics of Order Book, Price Impact, Order Execution, Market-
Making (and related topics), we refer you to the comprehensive treatment in Olivier Gueant’s
book (Gueant 2016).
271
Figure 8.1.: Trading Order Book(Image Credit: https://nms.kcl.ac.uk/rll/
enrique-miranda/index.html)
presented for trading in the form of this aggregated view. Thus, the OB data structure can
be represented as two sorted lists of (Price, Size) pairs:
(b)
• We refer to P0 as The Best Bid Price (lightened to Best Bid ) to signify that it is the
highest offer to buy and hence, the best price for a seller to transact with.
(a)
• Likewise, we refer to P0 as The Ask Price (lightened to Best Ask) to signify that it is
the lowest offer to sell and hence, the best price for a buyer to transact with.
(a) (b)
P0 +P0
• 2 is refered to as the The Mid Price (lightened to Mid).
(a) (b)
• P0 − P0 is refered to as The Best Bid-Ask Spread (lightened to Spread).
(a) (b)
• Pn−1 − Pm−1 is refered to as The Market Depth (lightened to Depth).
Although an actual real-world trading order book has many other details, we believe this
simplified coverage is adequate for the purposes of core understanding of order book trad-
ing and to navigate the problems of optimal order execution and optimal market-making.
Apart from Limit Orders, traders can express their interest to buy/sell with another type
of order - a Market Order (abbreviated as MO). A Market Order (MO) states one’s intent
to buy/sell N shares at the best possible price(s) available on the OB at the time of MO sub-
mission. So, an LO is keen on price and not so keen on time (willing to wait to get the price
one wants) while an MO is keen on time (desire to trade right away) and not so keen on
272
price (will take whatever the best LO price is on the OB). So now let us understand the
actual transactions that occur between LOs and MOs (buy and sell interactions, and how
the OB changes as a result of these interactions). Firstly, we note that in normal trading
activity, a newly submitted sell LO’s price is typically above the price of the best buy LO
on the OB. But if a new sell LO’s price is less than or equal to the price of the best buy
LO’s price, we say that the market has crossed (to mean that the range of bid prices and the
range of ask prices have intersected), which results in an immediate transaction that eats
into the OB’s Buy LOs.
Precisely, a new Sell LO (P, N ) potentially transacts with (and hence, removes) the best
Buy LOs on the OB.
(b) (b)
X
i−1
(b) (b)
Removal: [(Pi , min(Ni , max(0, N − Nj ))) | (i : Pi ≥ P )] (8.1)
j=0
After this removal, it potentially adds the following LO to the asks side of the OB:
X (b)
(P, max(0, N − Ni )) (8.2)
(b)
i:Pi ≥P
Likewise, a new Buy MO (P, N ) potentially transacts with (and hence, removes) the
best Sell LOs on the OB
(a) (a)
X
i−1
(a) (a)
Removal: [(Pi , min(Ni , max(0, N − Nj ))) | (i : Pi ≤ P )] (8.3)
j=0
After this removal, it potentially adds the following to the bids side of the OB:
X (a)
(P, max(0, N − Ni )) (8.4)
(a)
i:Pi ≤P
When a Market Order (MO) is submitted, things are simpler. A Sell Market Order of N
shares will remove the best Buy LOs on the OB.
(b) (b)
X
i−1
(b)
Removal: [(Pi , min(Ni , max(0, N − Nj ))) | 0 ≤ i < m] (8.5)
j=0
X
m−1
(b) (b)
X
i−1
(b)
Pi · (min(Ni , max(0, N − Nj ))) (8.6)
i=0 j=0
We note that if N is large, the sales proceeds for this MO can be significantly lower than
(b) (b)
the best possible sales proceeds (= N · P0 ), which happens only if N ≤ N0 . Note also
(b)
that if N is large, the new Best Bid Price (new value of P0 ) can be significantly lower than
the Best Bid Price before the MO was submitted (because the MO “eats into” a significant
volume of Buy LOs on the OB). This “eating into” the Buy LOs on the OB and consequent
lowering of the Best Bid Price (and hence, Mid Price) is known as Price Impact of an MO
(more specifically, as the Temporary Price Impact of an MO). We use the word “temporary”
because subsequent to this “eating into” the Buy LOs of the OB (and consequent, “hole,”
ie., large Bid-Ask Spread), market participants will submit “replenishment LOs” (both
273
Buy LOs and Sell LOs) on the OB. These replenishments LOs would typically mitigate the
Bid-Ask Spread and the eventual settlement of the Best Bid/Best Ask/Mid Prices consti-
tutes what we call Permanent Price Impact - which refers to the changes in OB Best Bid/Best
Ask/Mid prices relative to the corresponding prices before submission of the MO.
Likewise, a Buy Market Order of N shares will remove the best Sell LOs on the OB
(a) (a)
X
i−1
(a)
Removal: [(Pi , min(Ni , max(0, N − Nj ))) | 0 ≤ i < n] (8.7)
j=0
X
n−1
(a) (a)
X
i−1
(a)
Pi · (min(Ni , max(0, N − Nj ))) (8.8)
i=0 j=0
If N is large, the purchase bill for this MO can be significantly higher than the best
(a) (a)
possible purchase bill (= N · P0 ), which happens only if N ≤ N0 . All that we wrote
above in terms of Temporary and Permanent Price Impact naturally apply in the opposite
direction for a Buy MO.
We refer to all of the above-described OB movements, including both temporary and
permanent Price Impacts broadly as Order Book Dynamics. There is considerable literature
on modeling Order Book Dynamics and some of these models can get fairly complex in
order to capture various real-world nuances. Much of this literature is beyond the scope
of this book. In this chapter, we will cover a few simple models for how a sell MO will
move the OB’s Best Bid Price (rather than a model for how it will move the entire OB). The
model for how a buy MO will move the OB’s Best Ask Price is naturally identical.
Now let’s write some code that models how LOs and MOs interact with the OB. We
write a class OrderBook that represents the Buy and Sell Limit Orders on the Order Book,
which are each represented as a sorted sequence of the type DollarsAndShares, which is a
dataclass we created to represent any pair of a dollar amount (dollar: float) and num-
ber of shares (shares: int). Sometimes, we use DollarsAndShares to represent an LO (pair
of price and shares) as in the case of the sorted lists of Buy and Sell LOs. At other times, we
use DollarsAndShares to represent the pair of total dollars transacted and total shares trans-
acted when an MO is executed on the OB. The OrderBook maintains a price-descending se-
quence of PriceSizePairs for Buy LOs (descending_bids) and a price-ascending sequence
of PriceSizePairs for Sell LOs (ascending_asks). We write the basic methods to get the
OrderBook’s highest bid price (method bid_price), lowest ask price (method ask_price),
mid price (method mid_price), spread between the highest bid price and lowest ask price
(method bid_ask_spread), and market depth (method market_depth).
@dataclass(frozen=True)
class DollarsAndShares:
dollars: float
shares: int
PriceSizePairs = Sequence[DollarsAndShares]
@dataclass(frozen=True)
class OrderBook:
descending_bids: PriceSizePairs
ascending_asks: PriceSizePairs
def bid_price(self) -> float:
return self.descending_bids[0].dollars
274
def ask_price(self) -> float:
return self.ascending_asks[0].dollars
def mid_price(self) -> float:
return (self.bid_price() + self.ask_price()) / 2
def bid_ask_spread(self) -> float:
return self.ask_price() - self.bid_price()
def market_depth(self) -> float:
return self.ascending_asks[-1].dollars - \
self.descending_bids[-1].dollars
Next we want to write methods for LOs and MOs to interact with the OrderBook. Notice
that each of Equation (8.1) (new Sell LO potentially removing some of the beginning of
the Buy LOs on the OB), Equation (8.3) (new Buy LO potentially removing some of the
beginning of the Sell LOs on the OB), Equation (8.5) (Sell MO removing some of the
beginning of the Buy LOs on the OB) and Equation (8.7) (Buy MO removing some of the
beginning of the Sell LOs on the OB) all perform a common core function - they “eat into”
the most significant LOs (on the opposite side) on the OB. So we first write a @staticmethod
eat_book for this common function.
eat_book takes as input a ps_pairs: PriceSizePairs (representing one side of the OB)
and the number of shares: int to buy/sell. Notice eat_book’s return type: Tuple[DollarsAndShares,
PriceSizePairs]. The returned DollarsAndShares represents the pair of dollars transacted
and the number of shares transacted (with number of shares transacted being less than
or equal to the input shares). The returned PriceSizePairs represents the remainder
of ps_pairs after the transacted number of shares have eaten into the input ps_pairs.
eat_book first deletes (i.e. “eats up”) as much of the beginning of the ps_pairs: PriceSizePairs
data structure as it can (basically matching the input number of shares with an appropri-
ate number of shares at the beginning of the ps_pairs: PriceSizePairs input). Note that
the returned PriceSizePairs is a separate data structure, ensuring the immutability of the
input ps_pairs: PriceSizePairs.
@staticmethod
def eat_book(
ps_pairs: PriceSizePairs,
shares: int
) -> Tuple[DollarsAndShares, PriceSizePairs]:
rem_shares: int = shares
dollars: float = 0.
for i, d_s in enumerate(ps_pairs):
this_price: float = d_s.dollars
this_shares: int = d_s.shares
dollars += this_price * min(rem_shares, this_shares)
if rem_shares < this_shares:
return (
DollarsAndShares(dollars=dollars, shares=shares),
[DollarsAndShares(
dollars=this_price,
shares=this_shares - rem_shares
)] + list(ps_pairs[i+1:])
)
else:
rem_shares -= this_shares
return (
DollarsAndShares(dollars=dollars, shares=shares - rem_shares),
[]
)
Now we are ready to write the method sell_limit_order which takes Sell LO Price and
275
Sell LO shares as input. As you can see in the code below, first it potentially removes
(if it “crosses”) an appropriate number of shares on the Buy LOs side of the OB (using
the @staticmethod eat_book), and then potentially adds an appropriate number of shares
at the Sell LO Price on the Sell LOs side of the OB. sell_limit_order returns a pair of
DollarsAndShares type and OrderBook type. The returned DollarsAndShares represents the
pair of dollars transacted and the number of shares transacted with the Buy LOs side of
the OB (with number of shares transacted being less than or equal to the input shares).
The returned OrderBook represents the new OB after potentially eating into the Buy LOs
side of the OB and then potentially adding some shares at the Sell LO Price on the Sell
LOs side of the OB. Note that the returned OrderBook is a newly-created data structure,
ensuring the immutability of self. We urge you to read the code below carefully as there
are many subtle details that are handled in the code.
from dataclasses import replace
def sell_limit_order(self, price: float, shares: int) -> \
Tuple[DollarsAndShares, OrderBook]:
index: Optional[int] = next((i for i, d_s
in enumerate(self.descending_bids)
if d_s.dollars < price), None)
eligible_bids: PriceSizePairs = self.descending_bids \
if index is None else self.descending_bids[:index]
ineligible_bids: PriceSizePairs = [] if index is None else \
self.descending_bids[index:]
d_s, rem_bids = OrderBook.eat_book(eligible_bids, shares)
new_bids: PriceSizePairs = list(rem_bids) + list(ineligible_bids)
rem_shares: int = shares - d_s.shares
if rem_shares > 0:
new_asks: List[DollarsAndShares] = list(self.ascending_asks)
index1: Optional[int] = next((i for i, d_s
in enumerate(new_asks)
if d_s.dollars >= price), None)
if index1 is None:
new_asks.append(DollarsAndShares(
dollars=price,
shares=rem_shares
))
elif new_asks[index1].dollars != price:
new_asks.insert(index1, DollarsAndShares(
dollars=price,
shares=rem_shares
))
else:
new_asks[index1] = DollarsAndShares(
dollars=price,
shares=new_asks[index1].shares + rem_shares
)
return d_s, OrderBook(
ascending_asks=new_asks,
descending_bids=new_bids
)
else:
return d_s, replace(
self,
descending_bids=new_bids
)
Next, we write the easier method sell_market_order which takes as input the number
of shares to be sold (as a market order). sell_market_order transacts with the appropri-
ate number of shares on the Buy LOs side of the OB (removing those many shares from
276
the Buy LOs side). It returns a pair of DollarsAndShares type and OrderBook type. The
returned DollarsAndShares represents the pair of dollars transacted and the number of
shares transacted (with number of shares transacted being less than or equal to the in-
put shares). The returned OrderBook represents the remainder of the OB after the trans-
acted number of shares have eaten into the Buy LOs side of the OB. Note that the returned
OrderBook is a newly-created data structure, ensuring the immutability of self.
def sell_market_order(
self,
shares: int
) -> Tuple[DollarsAndShares, OrderBook]:
d_s, rem_bids = OrderBook.eat_book(
self.descending_bids,
shares
)
return (d_s, replace(self, descending_bids=rem_bids))
We won’t list the methods buy_limit_order and buy_market_order here as they are com-
pletely analogous (you can find the entire code for OrderBook in the file rl/chapter9/order_book.py).
Now let us test out this code by creating a sample OrderBook and submitting some LOs and
MOs to interact with the OrderBook.
The above code creates an OrderBook in the price range [91, 114] with a bid-ask spread
of 5. Figure 8.2 depicts this OrderBook visually.
Let’s submit a Sell LO that says we’d like to sell 40 shares as long as the transacted price
is greater than or equal to 107. Our Sell LO should simply get added to the Sell LOs side
of the OB.
The new OrderBook ob1 has 40 more shares at the price level of 107, as depicted in Figure
8.3.
Now let’s submit a Sell MO that says we’d like to sell 120 shares at the “best price.” Our
Sell MO should transact with 120 shares at “best prices” of 100 and 99 as well (since the
OB does not have enough Buy LO shares at the price of 100).
The new OrderBook ob2 has 120 less shares on the Buy LOs side of the OB, as depicted
in Figure 8.4.
Now let’s submit a Buy LO that says we’d like to buy 80 shares as long as the transacted
price is less than or equal to 100. Our Buy LO should get added to the Buy LOs side of the
OB.
277
Figure 8.2.: Starting Order Book
278
Figure 8.4.: Order Book after Sell MO
279
Figure 8.6.: Order Book after 2nd Sell LO
The new OrderBook ob3 has re-introduced a Buy LO at the price level of 100 (now with
80 shares), as depicted in Figure 8.5.
Now let’s submit a Sell LO that says we’d like to sell 60 shares as long as the transacted
price is greater than or equal to 104. Our Sell LO should get added to the Sell LOs side of
the OB.
d_s4, ob4 = ob3.sell_limit_order(104, 60)
The new OrderBook ob4 has introduced a Sell LO at a price of 104 with 60 shares, as
depicted in Figure 8.6.
Now let’s submit a Buy MO that says we’d like to buy 150 shares at the “best price.” Our
Buy MO should transact with 150 shares at “best prices” on the Sell LOs side of the OB.
d_s5, ob5 = ob4.buy_market_order(150)
The new OrderBook ob5 has 150 less shares on the Sell LOs side of the OB, wiping out all
the shares at the price level of 104 and almost wiping out all the shares at the price level
of 105, as depicted in Figure 8.7.
This has served as a good test of our code (transactions working as we’d like) and
we encourage you to write more code of this sort to interact with the OrderBook, and
to produce graphs of evolution of the OrderBook as this will help develop stronger intu-
ition and internalize the concepts we’ve learnt above. All of the above code is in the file
rl/chapter9/order_book.py.
Now we are ready to get started with the problem of Optimal Execution of a large-sized
Market Order.
280
Figure 8.7.: Order Book after Buy MO
new investment objectives. You have to sell all of the N shares you own in this stock in
the next T hours, but you have been instructed to accomplish the sale by submitting only
Market Orders (not allowed to submit any Limit Orders because of the uncertainty in the
time of execution of the sale with a Limit Order). You can submit sell market orders (of
any size) at the start of each hour - so you have T opportunities to submit market orders
of any size. Your goal is to maximize the Expected Total Utility of sales proceeds for all N
shares over the T hours. Your task is to break up N into T appropriate chunks to maximize
the Expected Total Utility objective. If you attempt to sell the N shares too fast (i.e., too
many in the first few hours), as we’ve learnt above, each (MO) sale will eat a lot into the
Buy LOs on the OB (Temporary Price Impact) which would result in transacting at prices
below the best price (Best Bid Price). Moreover, you risk moving the Best Bid Price on the
OB significantly lower (Permanent Price Impact) that would affect the sales proceeds for
the next few sales you’d make. On the other hand, if you sell the N shares too slow (i.e., too
few in the first few hours), you might transact at good prices but then you risk running
out of time, which means you will have to dump a lot of shares with time running out
which in turn would mean transacting at prices below the best price. Moreover, selling
too slow exposes you to more uncertainty in market price movements over a longer time
period, and more uncertainty in sales proceeds means the Expected Utility objective gets
hurt. Thus, the precise timing and sizes in the breakup of shares is vital. You will need
to have an estimate of the Temporary and Permanent Price Impact of your Market Orders,
which can help you identify the appropriate number of shares to sell at the start of each
hour.
Unsurprisingly, we can model this problem as a Market Decision Process control prob-
lem where the actions at each time step (each hour, in this case) are the number of shares
sold at the time step and the rewards are the Utility of sales proceeds at each time step. To
keep things simple and intuitive, we shall model Price Impact of Market Orders in terms of
their effect on the Best Bid Price (rather than in terms of their effect on the entire OB). In
other words, we won’t be modeling the entire OB Price Dynamics, just the Best Bid Price
281
Dynamics. We shall refer to the OB activity of an MO immediately “eating into the Buy
LOs” (and hence, potentially transacting at prices lower than the best price) as the Tempo-
rary Price Impact. As mentioned earlier, this is followed by subsequent replenishment of
both Buy and Sell LOs on the OB (stabilizing the OB) - we refer to any eventual (end of
the hour) lowering of the Best Bid Price (relative to the Best Bid Price before the MO was
submitted) as the Permanent Price Impact. Modeling the temporary and permanent Price
Impacts separately helps us in deciding on the optimal actions (optimal shares to be sold
at the start of each hour).
Now we develop some formalism to describe this problem precisely. As mentioned
earlier, we make a number of simplifying assumptions in modeling the OB Dynamics for
ease of articulation (without diluting the most important concepts). We index discrete
time by t = 0, 1, . . . , T . We denote Pt as the Best Bid Price on the OB at the start of time
step t (for all t = 0, 1, . . . , T ) and Nt as the number of shares sold at time step t for all
t = 0, 1, . . . , T − 1. We denote the number of shares remaining to be sold at the start of
time step t as Rt for all t = 0, 1, . . . , T . Therefore,
X
t−1
Rt = N − Ni for all t = 0, 1, . . . , T
i=0
Note that:
R0 = N
Rt+1 = Rt − Nt for all t = 0, 1, . . . , T − 1
Also note that we need to sell everything by time t = T and so:
NT −1 = RT −1 ⇒ RT = 0
The model of Best Bid Price Dynamics from one time step to the next is given by:
Pt+1 = ft (Pt , Nt , ϵt ) for all t = 0, 1, . . . , T − 1
where ft is an arbitrary function incorporating:
• The Permanent Price Impact of selling Nt shares.
• The Price-Impact-independent market-movement of the Best Bid Price from time t
to time t + 1.
• Noise ϵt , a source of randomness in Best Bid Price movements.
The sales proceeds from the sale at time step t, for all t = 0, 1, . . . , T − 1, is defined as:
Nt · Qt = Nt · (Pt − gt (Pt , Nt ))
where gt is a function modeling the Temporary Price Impact (i.e., the Nt MO “eating into”
the Buy LOs on the OB). Qt should be interpreted as the average Buy LO price transacted
against by the Nt MO at time t.
Lastly, we denote the Utility (of Sales Proceeds) function as U (·).
As mentioned previously, solving for the optimal number of shares to be sold at each
time step can be modeled as a discrete-time finite-horizon Markov Decision Process, which
we describe below in terms of the order of MDP activity at each time step t = 0, 1, . . . , T −1
(the MDP horizon is time T meaning all states at time T are terminal states). We follow
the notational style of finite-horizon MDPs that should now be familiar from previous
chapters.
Order of Events at time step t for all t = 0, 1, . . . , T − 1:
282
• Observe State st := (Pt , Rt ) ∈ St
• Perform Action at := Nt ∈ At
• Receive Reward rt+1 := U (Nt · Qt ) = U (Nt · (Pt − gt (Pt , Nt )))
• Experience Price Dynamics Pt+1 = ft (Pt , Nt , ϵt ) and set Rt+1 = Rt − Nt so as to
obtain the next state st+1 = (Pt+1 , Rt+1 ) ∈ St+1 .
Note that we have intentionally not specified if Pt , Rt , Nt are integers or real numbers, or
if constrained to be non-negative etc. Those precise specifications will be customized to the
nuances/constraints of the specific Optimal Order Execution problem we’d be solving. Be
default, we shall assume that Pt ∈ R+ and Nt , Rt ∈ Z≥0 (as these represent realistic trading
situations), although we do consider special cases later in the chapter where Pt , Rt ∈ R
(unconstrained real numbers for analytical tractability).
The goal is to find the Optimal Policy π ∗ = (π0∗ , π1∗ , . . . , πT∗ −1 ) (defined as πt∗ ((Pt , Rt )) =
∗
Nt that maximizes:
X
T −1
E[ γ t · U (Nt · Qt )]
t=0
where γ is the discount factor to account for the fact that future utility of sales proceeds
can be modeled to be less valuable than today’s.
Now let us write some code to solve this MDP. We write a class OptimalOrderExecution
which models a fairly generic MDP for Optimal Order Execution as described above, and
solves the Control problem with Approximate Value Iteration using the backward induc-
tion algorithm that we implemented in Chapter 4. Let us start by taking a look at the
attributes (inputs) in OptimalOrderExecution:
• shares refers to the total number of shares N to be sold over T time steps.
• time_steps refers to the number of time steps T .
• avg_exec_price_diff refers to the time-sequenced functions gt that return the reduc-
tion in the average price obtained by the Market Order at time t due to eating into the
Buy LOs. gt takes as input the type PriceAndShares that represents a pair of price:
float and shares: int (in this case, the price is Pt and the shares is the MO size Nt
at time t). As explained earlier, the sales proceeds at time t is: Nt · (Pt − gt (Pt , Nt )).
• price_dynamics refers to the time-sequenced functions ft that represent the price dy-
namics: Pt+1 ∼ ft (Pt , Nt ). ft outputs a probability distribution of prices for Pt+1 .
• utility_func refers to the Utility of Sales Proceeds function, incorporating any risk-
aversion.
• discount_factor refers to the discount factor γ.
• func_approx refers to the ValueFunctionApprox type to be used to approximate the
Value Function for each time step (since we are doing backward induction).
• initial_price_distribution refers to the probability distribution of prices P0 at time
0, which is used to generate the samples of states at each of the time steps (needed
in the approximate backward induction algorithm).
from rl.approximate_dynamic_programming import ValueFunctionApprox
@dataclass(frozen=True)
class PriceAndShares:
price: float
shares: int
@dataclass(frozen=True)
class OptimalOrderExecution:
shares: int
283
time_steps: int
avg_exec_price_diff: Sequence[Callable[[PriceAndShares], float]]
price_dynamics: Sequence[Callable[[PriceAndShares], Distribution[float]]]
utility_func: Callable[[float], float]
discount_factor: float
func_approx: ValueFunctionApprox[PriceAndShares]
initial_price_distribution: Distribution[float]
The two key things we need to perform the backward induction are:
• A method get_mdp that given a time step t, produces the MarkovDecisionProcess ob-
ject representing the transitions from time t to time t+1. The class OptimalExecutionMDP
within get_mdp implements the abstract methods step and actions of the abstract
class MarkovDecisionProcess. The code should be fairly self-explanatory - just a cou-
ple of things to point out here. Firstly, the input p_r: NonTerminal[PriceAndShares]
to the step method represents the state (Pt , Rt ) at time t, and the variable p_s: PriceAndShares
represents the pair of (Pt , Nt ), which serves as input to avg_exec_price_diff and
price_dynamics (attributes of OptimalOrderExecution). Secondly, note that the actions
method returns an Iterator on a single int at time t = T −1 because of the constraint
NT −1 = RT −1 .
• A method get_states_distribution that returns the probability distribution of states
(Pt , Rt ) at time t (of type SampledDistribution[NonTerminal[PriceAndShares]]). The
code here is similar to the get_states_distribiution method of AssetAllocDiscrete
in Chapter 6 (essentially, walking forward from time 0 to time t by sampling from
the state-transition probability distribution and also sampling from uniform choices
over all actions at each time step).
284
)
return (NonTerminal(next_state), reward)
return SampledDistribution(
sampler=sr_sampler_func,
expectation_samples=100
)
def actions(self, p_s: NonTerminal[PriceAndShares]) -> \
Iterator[int]:
if t == steps - 1:
return iter([p_s.state.shares])
else:
return iter(range(p_s.state.shares + 1))
return OptimalExecutionMDP()
def get_states_distribution(self, t: int) -> \
SampledDistribution[NonTerminal[PriceAndShares]]:
def states_sampler_func() -> NonTerminal[PriceAndShares]:
price: float = self.initial_price_distribution.sample()
rem: int = self.shares
for i in range(t):
sell: int = Choose(range(rem + 1)).sample()
price = self.price_dynamics[i](PriceAndShares(
price=price,
shares=rem
)).sample()
rem -= sell
return NonTerminal(PriceAndShares(
price=price,
shares=rem
))
return SampledDistribution(states_sampler_func)
Finally, we produce the Optimal Value Function and Optimal Policy for each time step
with the following method backward_induction_vf_and_pi:
from rl.approximate_dynamic_programming import back_opt_vf_and_policy
def backward_induction_vf_and_pi(
self
) -> Iterator[Tuple[ValueFunctionApprox[PriceAndShares],
DeterministicPolicy[PriceAndShares, int]]]:
mdp_f0_mu_triples: Sequence[Tuple[
MarkovDecisionProcess[PriceAndShares, int],
ValueFunctionApprox[PriceAndShares],
SampledDistribution[NonTerminal[PriceAndShares]]
]] = [(
self.get_mdp(i),
self.func_approx,
self.get_states_distribution(i)
) for i in range(self.time_steps)]
num_state_samples: int = 10000
error_tolerance: float = 1e-6
return back_opt_vf_and_policy(
mdp_f0_mu_triples=mdp_f0_mu_triples,
gamma=self.discount_factor,
num_state_samples=num_state_samples,
error_tolerance=error_tolerance
)
285
different temporary and permanent price impact functions, different utility functions, im-
pose a few constraints etc.). Note that the above code has been written with an educa-
tional motivation rather than an efficient-computation motivation, so the convergence of
the backward induction ADP algorithm is going to be slow. How do we know that the
above code is correct? Well, we need to create a simple special case that yields a closed-
form solution that we can compare the Optimal Value Function and Optimal Policy pro-
duced by OptimalOrderExecution against. This will be the subject of the following subsec-
tion.
Pt+1 = ft (Pt , Nt , ϵ) = Pt − α · Nt + ϵt
where α ∈ R and ϵt for all t = 0, 1, . . . , T − 1 are independent and identically distributed
(i.i.d.) with E[ϵt |Nt , Pt ] = 0. Therefore, the Permanent Price Impact (as an Expectation)
is α · Nt .
As for the Temporary Price Impact, we know that gt needs to be a non-decreasing func-
tion of Nt . We assume a simple linear form for gt as follows:
X
T −1
Vtπ ((Pt , Rt )) = Eπ [ Ni · (Pi − β · Ni )|(Pt , Rt )]
i=t
The Optimal Value Function satisfies the finite-horizon Bellman Optimality Equation
for all t = 0, 1, . . . , T − 2, as follows:
∗
Vt∗ ((Pt , Rt )) = max{Nt · (Pt − β · Nt ) + E[Vt+1 ((Pt+1 , Rt+1 ))]}
Nt
and
286
VT∗−1 ((PT −1 , RT −1 )) = NT −1 · (PT −1 − β · NT −1 ) = RT −1 · (PT −1 − β · RT −1 )
VT∗−2 ((PT −2 , RT −2 )) = max {RT −2 ·PT −2 −β ·RT2 −2 +(α−2β)(NT2 −2 −NT −2 ·RT −2 )} (8.9)
NT −2
For the case α ≥ 2β, noting that NT −2 ≤ RT −2 , we have the trivial solution:
NT∗ −2 = 0 or NT∗ −2 = RT −2
Substituting either of these two values for NT∗ −2 in the right-hand-side of Equation (8.9)
gives:
RT −2
(α − 2β) · (2NT∗ −2 − RT −2 ) = 0 ⇒ NT∗ −2 =
2
Substituting this solution for NT∗ −2 in Equation (8.9) gives:
α + 2β
VT∗−2 ((PT −2 , RT −2 )) = RT −2 · PT −2 − RT2 −2 · ( )
4
Continuing backwards in time in this manner gives:
Rt
Nt∗ = for all t = 0, 1, . . . , T − 1
T −t
Rt2 2β + α · (T − t − 1)
Vt∗ ((Pt , Rt )) = Rt · Pt − ·( ) for all t = 0, 1, . . . , T − 1
2 T −t
287
Rolling forward in time, we see that Nt∗ = N
T , i.e., splitting the N shares uniformly across
the T time steps. Hence, the Optimal Policy is a constant deterministic function (i.e., inde-
pendent of the State). Note that a uniform split makes intuitive sense because Price Impact
and Market Movement are both linear and P additive, and don’t interact.PThis optimization
is essentially equivalent to minimizing Tt=1 Nt2 with the constraint: Tt=1 Nt = N . The
Optimal Expected Total Sales Proceeds is equal to:
N2 2β − α
N · P0 − · (α + )
2 T
Implementation Shortfall is the technical term used to refer to the reduction in Total Sales
Proceeds relative to the maximum possible sales proceeds (= N · P0 ). So, in this simple
2
linear model, the Implementation Shortfall from Price Impact is N2 · (α + 2β−α T ). Note that
the Implementation Shortfall is non-zero even if one had infinite time available (T → ∞)
for the case of α > 0. If Price Impact were purely temporary (α = 0, i.e., Price fully snapped
back), then the Implementation Shortfall is zero if one had infinite time available.
So now let’s customize the class OptimalOrderExecution to this simple linear price im-
pact model, and compare the Optimal Value Function and Optimal Policy produced by
OptimalOrderExecution against the above-derived closed-form solutions. We write code
below to create an instance of OptimalOrderExecution with time steps T = 5, total number
of shares to be sold N = 100, linear temporary price impact with α = 0.03, linear perma-
nent price impact with β = 0.03, utility function as the identity function (no risk-aversion),
and discount factor γ = 1. We set the standard deviation for the price dynamics probabil-
ity distribution to 0 to speed up the calculation. Since we know the closed-form solution
for the Optimal Value Function, we provide some assistance to OptimalOrderExecution by
setting up a linear function approximation with two features: Pt · Rt and Rt2 . The task
of OptimalOrderExecution is to infer the correct coefficients of these features for each time
step. If the coefficients match that of the closed-form solution, it provides a great degree
of confidence that our code is working correctly.
288
)
Next we evaluate this Optimal Value Function and Optimal Policy on a particular state
for all time steps, and compare that against the closed-form solution. The state we use for
evaluation is as follows:
The code to evaluate the obtained Optimal Value Function and Optimal Policy on the
above state is as follows:
With 100,000 state samples for each time step and only 10 state transition samples (since
the standard deviation of ϵ is set to be very small), this prints the following:
Time 0
Time 1
289
Optimal Sales = 20, Opt Val = 9762.479
Time 2
Time 3
Time 4
for t in range(num_time_steps):
print(f”Time {t:d}”)
print()
left: int = num_time_steps - t
opt_sale_anal: float = num_shares / num_time_steps
wt1: float = 1
wt2: float = -(2 * beta + alpha * (left - 1)) / (2 * left)
val_anal: float = wt1 * state.price * state.shares + \
wt2 * state.shares * state.shares
Time 0
290
Weight2 = -0.022
Time 1
Time 2
Time 3
Time 4
We need to point out here that the general case of optimal order execution involving
modeling of the entire Order Book’s dynamics will have to deal with a large state space.
This means the ADP algorithm will suffer from the curse of dimensionality, which means
we will need to employ RL algorithms.
Pt+1 = Pt − (β · Nt + θ · Xt ) + ϵt
Xt+1 = ρ · Xt + ηt
Qt = Pt − (β · Nt + θ · Xt )
where ϵt and ηt are each independent and identically distributed random variables with
mean zero for all t = 0, 1, . . . , T − 1, ϵt and ηt are also independent of each other for all
t = 0, 1, . . . , T − 1. Xt can be thought of as a market factor affecting Pt linearly. Applying
the finite-horizon Bellman Optimality Equation on the Optimal Value Function (and the
291
same backward-recursive approach as before) yields:
Rt
Nt∗ = + h(t, β, θ, ρ) · Xt
T −t
Vt∗ ((Pt , Rt , Xt )) = Rt · Pt − (quadratic in (Rt , Xt ) + constant)
Essentially, the serial-correlation predictability (ρ ̸= 0) alters the uniform-split strategy.
In the same paper, Bertsimas and Lo presented a more realistic model called Linear-
Percentage Temporary (abbreviated as LPT) Price Impact model, whose salient features in-
clude:
• Geometric random walk: consistent with real data, and avoids non-positive prices.
• Fractional Price Impact gt (PPtt,Nt ) doesn’t depend on Pt (this is validated by real data).
• Purely Temporary Price Impact, i.e., the price Pt snaps back after the Temporary
Price Impact (no Permanent effect of Market Orders on future prices).
Pt+1 = Pt · eZt
Xt+1 = ρ · Xt + ηt
Qt = Pt · (1 − β · Nt − θ · Xt )
where Zt are independent and identically distributed random variables with mean µZ
and variance σZ2 , ηt are independent and identically distributed random variables with
mean zero for all t = 0, 1, . . . , T − 1, Zt and ηt are independent of each other for all t =
0, 1, . . . , T − 1. Xt can be thought of as a market factor affecting Pt multiplicatively. With
the same derivation methodology as before, we get the solution:
Nt∗ = ct + ct Rt + ct Xt
(1) (2) (3)
2
σZ
Vt∗ ((Pt , Rt , Xt )) = eµZ +
(4) (5) (6) (7) (8) (9)
2 · Pt · (ct + ct Rt + ct Xt + ct Rt2 + ct Xt2 + ct Rt Xt )
(k)
where ct , 1 ≤ k ≤ 9, are constants (independent of Pt , Rt , Xt ).
As an exercise, we recommend implementing the above (LPT) model by customizing
OptimalOrderExecution to compare the obtained Optimal Value Function and Optimal Pol-
(k)
icy against the closed-form solution (you can find the exact expressions for the ct coeffi-
cients in the Bertsimas and Lo paper).
292
risk-aversion. The incorporation of risk-aversion affects the time-trajectory of Nt∗ . Clearly,
if λ = 0, we get the usual uniform-split strategy: Nt∗ = NT . The other extreme assumption
is to minimize V ar[Y ] which yields: N0∗ = N (sell everything immediately because the
only thing we want to avoid is uncertainty of sales proceeds). In their paper, Almgren
and Chriss go on to derive the Efficient Frontier for this problem (analogous to the Effi-
cient Frontier Portfolio Theory we outline in Appendix B). They also derive solutions for
specific utility functions.
To model a real-world trading situation, the first step is to start with the MDP we de-
scribed earlier with an appropriate model for the price dynamics ft (·) and the temporary
price impact gt (·) (incorporating potential time-heterogeneity, non-linear price dynamics
and non-linear impact). The OptimalOrderExecution class we wrote above allows us to
incorporate all of the above. We can also model various real-world “frictions” such as
discrete prices, discrete number of shares, constraints on prices and number of shares, as
well as trading fees. To make the model truer to reality and more sophisticated, we can
introduce various market factors in the State which would invariably lead to bloating of
the State Space. We would also need to capture Cross-Asset Market Impact. As a further
step, we could represent the entire Order Book (or a compact summary of the size/shape
of the Order book) as part of the state, which leads to further bloating of the state space.
All of this makes ADP infeasible and one would need to employ Reinforcement Learning
algorithms. More importantly, we’d need to write a realistic Order Book Dynamics sim-
ulator capturing all of the above real-world considerations that an RL algorithm would
learn from. There are a lot of practical and technical details involved in writing a real-
world simulator and we won’t be covering those details in this book. It suffices for here
to say that the simulator would essentially be a sampling model that has learnt the Order
Book Dynamics from market data (supervised learning of the Order Book Dynamics). Us-
ing such a simulator and with a deep learning-based function approximation of the Value
Function, we can solve a practical Optimal Order Execution problem with Reinforcement
Learning. We refer you to a couple of papers for further reading on this:
• Paper by Nevmyvaka, Feng, Kearns in 2006 (Nevmyvaka, Feng, and Kearns 2006)
• Paper by Vyetrenko and Xu in 2019 (Vyetrenko and Xu 2019)
Designing real-world simulators for Order Book Dynamics and using Reinforcement
Learning for Optimal Order Execution is an exciting area for future research as well as
engineering design. We hope this section has provided sufficient foundations for you to
dig into this topic further.
293
bid prices by submitting Buy LOs on an OB and quotes one’s ask prices by submitting
Sell LOs on the OB. Market-makers are known as liquidity providers in the market because
they make shares of the stock available for trading on the OB (both on the buy side and
sell side). In general, anyone who submits LOs can be thought of as a market liquidity
provider. Likewise, anyone who submits MOs can be thought of as a market liquidity taker
(because an MO takes shares out of the volume that was made available for trading on the
OB).
There is typically a fairly complex interplay between liquidity providers (including market-
makers) and liquidity takers. Modeling OB dynamics is about modeling this complex in-
terplay, predicting arrivals of MOs and LOs, in response to market events and in response
to observed activity on the OB. In this section, we view the OB from the perspective of a
single market-maker who aims to make money with Buy/Sell LOs of appropriate bid-ask
spread and with appropriate volume of shares (specified in their submitted LOs). The
market-maker is likely to be successful if she can do a good job of forecasting OB Dynam-
ics and dynamically adjusting her Buy/Sell LOs on the OB. The goal of the market-maker
is to maximize one’s Utility of Gains at the end of a suitable horizon of time.
The core intuition in the decision of how to set the price and shares in the market-maker’s
Buy and Sell LOs is as follows: If the market-maker’s bid-ask spread is too narrow, they
will have more frequent transactions but smaller gains per transaction (more likelihood of
their LOs being transacted against by an MO or an opposite-side LO). On the other hand,
if the market-maker’s bid-ask spread is too wide, they will have less frequent transactions
but larger gains per transaction (less likelihood of their LOs being transacted against by
an MO or an opposite-side LO). Also of great importance is the fact that a market-maker
needs to carefully manage potentially large inventory buildup (either on the long side
or the short side) so as to avoid scenarios of consequent unfavorable forced liquidation
upon reaching the horizon time. Inventory buildup can occur if the market participants
consistently transact against mostly one side of the market-maker’s submitted LOs. With
this high-level intuition, let us make these concepts of market-making precise. We start
by developing some notation to help articulate the problem of Optimal Market-Making
clearly. We will re-use some of the notation and terminology we had developed for the
problem of Optimal Order Execution. As ever, for ease of exposition, we will simplify the
setting for the Optimal Market-Making problem.
Assume there are a finite number of time steps indexed by t = 0, 1, . . . , T . Assume
the market-maker always shows a bid price and ask price (at each time t) along with the
associated bid shares and ask shares on the OB. Also assume, for ease of exposition, that
the market-maker can add or remove bid/ask shares from the OB costlessly. We use the
following notation:
294
(a) (a)
• We refer to δt = Pt − St as the market-maker’s Ask Spread (relative to OB Mid).
(b) (a) (a) (b)
• We refer to δt + δt = Pt − Pt as the market-maker’s Bid-Ask Spread.
(b)
• Random variable Xt ∈ Z≥0 refers to the total number of market-maker’s Bid Shares
(b)
that have been transacted against (by MOs or by Sell LOs) up to time t (Xt is often
refered to as the cumulative “hits” up to time t, as in “the market-maker’s buy offer
has been hit”).
(a)
• Random variable Xt ∈ Z≥0 refers to the total number of market-maker’s Ask
(a)
Shares that have been transacted against (by MOs or by Buy LOs) up to time t (Xt
is often refered to as the cumulative “lifts” up to time t, as in “the market-maker’s
sell offer has been lifted”).
With this notation in place, we can write the trading account balance equation for all
t = 0, 1, . . . , T − 1 as follows:
(a) (a) (a) (b) (b) (b)
Wt+1 = Wt + Pt · (Xt+1 − Xt ) − Pt · (Xt+1 − Xt ) (8.10)
Note that since the inventory I0 at time 0 is equal to 0, the inventory It at time t is given
by the equation:
(b) (a)
I t = Xt − Xt
The market-maker’s goal is to maximize (for an appropriately shaped concave utility
function U (·)) the sum of the trading account value at time T and the value of the inventory
of shares held at time T , i.e., we maximize:
E[U (WT + IT · ST )]
As we alluded to earlier, this problem can be cast as a discrete-time finite-horizon Markov
Decision Process (with discount factor γ = 1). Following the usual notation for discrete-
time finite-horizon MDPs, the order of activity for the MDP at each time step t = 0, 1, . . . , T −
1 is as follows:
that maximizes:
X
T
E[ Rt ] = E[RT ] = E[U (WT + IT · ST )]
t=1
295
8.3.1. Avellaneda-Stoikov Continuous-Time Formulation
A landmark paper by Avellaneda and Stoikov (Avellaneda and Stoikov 2008) formulated
this optimal market-making problem in it’s continuous-time version. Their formulation
is conducive to analytical tractability and they came up with a simple, clean and intuitive
solution. In this subsection, we go over their formulation and in the next subsection, we
show the derivation of their solution. We adapt our discrete-time notation above to their
continuous-time setting.
(b) (a)
[(Xt |0 ≤ t < T ] and [Xt |0 ≤ t < T ] are assumed to be continuous-time Poisson
(b)
processes with the hit rate per unit of time and the lift rate per unit of time denoted as λt and
(a)
λt respectively. Hence, we can write the following:
(b) (b)
dXt ∼ P oisson(λt · dt)
(a) (a)
dXt ∼ P oisson(λt · dt)
(b) (b)
λt = f (b) (δt )
(a) (a)
λt = f (a) (δt )
for decreasing functions f (b) (·) and f (a) (·).
(a) (a) (b) (b)
dWt = Pt · dXt − Pt · dXt
(b) (a)
I t = Xt − Xt (note: I0 = 0)
(b)
Since infinitesimal Poisson random variables dXt (shares hit in time interval from t
(a)
to t + dt) and dXt (shares lifted in time interval from t to t + dt) are Bernoulli random
(b) (a)
variables (shares hit/lifted within time interval of duration dt will be 0 or 1), Nt and Nt
(number of shares in the submitted LOs for the infinitesimal time interval from t to t + dt)
can be assumed to be 1.
This simplifies the Action at time t to be just the pair:
(b) (a)
(δt , δt )
OB Mid Price Dynamics is assumed to be scaled brownian motion:
dSt = σ · dzt
for some σ ∈ R+ .
The Utility function is assumed to be: U (x) = −e−γx where γ > 0 is the risk-aversion pa-
rameter (this Utility function is essentially the CARA Utility function devoid of associated
constants).
296
as separate functional arguments of V ∗ (instead of the typical approach of making the
state, as a tuple, a single functional argument). This is because in the continuous-time
setting, we are interested in the time-differential of the Optimal Value Function and we
also want to represent the dependency of the Optimal Value Function on each of St , Wt , It
as explicit separate dependencies. Appendix D provides the derivation of the general HJB
formulation (Equation (D.1) in Appendix D) - this general HJB Equation specializes here
to the following:
∂V ∗ ∂V ∗ σ2 ∂ 2V ∗
max { · dt + E[σ · · dzt + · · (dzt )2 ]
(b)
δt ,δt
(a) ∂t ∂St 2 ∂St2
+ λt · dt · V ∗ (t, St , Wt − St + δt , It + 1)
(b) (b)
· dt · V ∗ (t, St , Wt + St + δt , It − 1)
(a) (a)
+ λt
· dt) · V ∗ (t, St , Wt , It )
(b) (a)
+ (1 − λt · dt − λt
− V ∗ (t, St , Wt , It )} = 0
Next, we want to convert the HJB Equation to a Partial Differential Equation (PDE). We
can simplify the above HJB equation with a few observations:
• E[dzt ] = 0.
• E[(dzt )2 ] = dt.
(b) (a)
• Organize the terms involving λt and λt better with some algebra.
• Divide throughout by dt.
∂V ∗ σ 2 ∂ 2 V ∗
max { + ·
(b)
δt ,δt
(a) ∂t 2 ∂St2
+ λt · (V ∗ (t, St , Wt − St + δt , It + 1) − V ∗ (t, St , Wt , It ))
(b) (b)
297
∂V ∗ σ 2 ∂ 2 V ∗
+ ·
∂t 2 ∂St2
+ max{f (b) (δt ) · (V ∗ (t, St , Wt − St + δt , It + 1) − V ∗ (t, St , Wt , It ))}
(b) (b)
(b)
δt
∂θ σ 2 ∂ 2 θ ∂θ 2
+ ·( 2 −γ·( ) )
∂t 2 ∂St ∂St
(b)
f (b) (δt ) (b)
+ max{ · (1 − e−γ·(δt −St +θ(t,St ,It +1)−θ(t,St ,It )) )}
(b)
δt γ
(a)
f (a) (δt ) (a)
+ max{ · (1 − e−γ·(δt +St +θ(t,St ,It −1)−θ(t,St ,It )) )} = 0
(a)
δt γ
(b)
−e−γ·(Wt −Qt +θ(t,St ,It +1))
= −e−γ·(Wt +θ(t,St ,It ))
(b)
⇒ Qt = θ(t, St , It + 1) − θ(t, St , It ) (8.14)
298
(a)
Likewise for Qt , we get:
(a)
Qt = θ(t, St , It ) − θ(t, St , It − 1) (8.15)
(b) (a)
Using Equations (8.14) and (8.15), bring Qt and Qt in the PDE for θ:
∂θ σ 2 ∂ 2 θ ∂θ 2 (b) (b)
+ ·( 2 −γ·( ) ) + max g(δt ) + max h(δt ) = 0
∂t 2 ∂St ∂St (b)
δt
(a)
δt
(b)
f (b) (δt ) (b) (b)
· (1 − e−γ·(δt −St +Qt ) )
(b)
where g(δt ) =
γ
(a)
f (a) (δt ) (a) (a)
· (1 − e−γ·(δt +St −Qt ) )
(a)
and h(δt ) =
γ
(b) (b)
To maximize g(δt ), differentiate g with respect to δt and set to 0:
(b) ∗ (b)
(b) ∗ ∂f (b) (b) ∗ ∂f (b) (b) ∗
e−γ·(δt −St +Qt )
· (γ · f (b) (δt )− (δ
(b) t
)) + (δ
(b) t
)=0
∂δt ∂δt
(b) ∗
(b) ∗ (b) ∗ (b) 1 f (b) (δ )
⇒ δt = St − Pt = St − Qt + · log (1 − γ · (b) t(b) ∗ ) (8.16)
γ ∂f
(δ ) (b)
∂δt t
(a) (a)
To maximize h(δt ), differentiate h with respect to δt and set to 0:
(a) ∗ (a)
(a) ∗ ∂f (a) (a) ∗ ∂f (a) (a) ∗
e−γ·(δt +St −Qt )
· (γ · f (a) (δt )− (δ
(a) t
)) + (δ
(a) t
)=0
∂δt ∂δt
(a) ∗
(a) ∗ (a) ∗ (a) 1 f (a) (δ )
⇒ δt = Pt − St = Qt − St + · log (1 − γ · (a) t(a) ∗ ) (8.17)
γ ∂f
(δ ) (a)
∂δt t
(b) ∗ (a) ∗
Equations (8.16) and (8.17) are implicit equations for δt and δt respectively.
Now let us write the PDE in terms of the Optimal Bid and Ask Spreads:
∂θ σ 2 ∂ 2 θ ∂θ 2
+ ·( 2 −γ·( ) )
∂t 2 ∂St ∂St
(b) ∗
f (b) (δt ) (b) ∗
+ · (1 − e−γ·(δt −St +θ(t,St ,It +1)−θ(t,St ,It ))
)
γ (8.18)
(a) ∗
f (a) (δt ) (a) ∗
−γ·(δt +St +θ(t,St ,It −1)−θ(t,St ,It ))
+ · (1 − e )=0
γ
with boundary condition: θ(T, ST , IT ) = IT · ST
(b) ∗ (a) ∗
• First we solve PDE (8.18) for θ in terms of δt and δt . In general, this would be a
numerical PDE solution.
(b) ∗
• Using Equations (8.14) and (8.15), and using the above-obtained θ in terms of δt
(a) ∗ (b) (a) (b) ∗ (a) ∗
and δt , we get Qt and Qt in terms of δt and δt .
299
(b) (a) (b) ∗ (a) ∗
• Then we substitute the above-obtained Qt and Qt (in terms of δt and δt ) in
Equations (8.16) and (8.17).
(b) ∗ (a) ∗
• Finally, we solve the implicit equations for δt and δt (in general, numerically).
The process dSt = σ · dzt implies that ST ∼ N (St , σ 2 · (T − t)), and hence:
γ·It2 ·σ 2 ·(T −t)
V ∗ (t, St , Wt , It ) = −e−γ·(Wt +It ·St − 2
)
Hence,
(b) γ·(It +1)2 ·σ 2 ·(T −t)
V ∗ (t, St , Wt − Qt , It + 1) = −e−γ·(Wt −Qt +(It +1)·St −
(b) )
2
V ∗ (t, St , Wt , It ) = V ∗ (t, St , Wt − Qt , It + 1)
(b)
Therefore,
γ·It2 ·σ 2 ·(T −t) (b) γ·(It +1)2 ·σ 2 ·(T −t)
−e−γ·(Wt +It ·St − 2
)
= −e−γ·(Wt −Qt +(It +1)·St − 2
)
This implies:
(b) γ · σ 2 · (T − t)
Qt = St − (2It + 1) ·
2
Likewise, we can derive:
(a) γ · σ 2 · (T − t)
Qt = St − (2It − 1) ·
2
The formulas for the Indifference Mid Price and the Indifference Bid-Ask Price Spread are
as follows:
(m)
Qt = St − It · γ · σ 2 · (T − t)
(a) (b)
Qt − Qt = γ · σ 2 · (T − t)
These results for the simple case of no-market-making-after-time-t serve as approxima-
(m)
tions for our problem of optimal market-making. Think of Qt as a pseudo mid price for
the market-maker, an adjustment to the OB mid price St that takes into account the magni-
(m)
tude and sign of It . If the market-maker is long inventory (It > 0), then Qt < St , which
makes intuitive sense since the market-maker is interested in reducing her risk of inven-
tory buildup and so, would be be more inclined to sell than buy, leading her to show bid
300
and ask prices whose average is lower than the OB mid price St . Likewise, if the market-
(m)
maker is short inventory (It < 0), then Qt > St indicating inclination to buy rather than
sell.
Armed with this intuition, we come back to optimal market-making, observing from
Equations (8.16) and (8.17):
(b) ∗ (b) (m) (a) (a) ∗
Pt < Qt < Qt < Qt < Pt
(b) ∗ (b) (m) (a) (a) ∗
Visualize this ascending sequence of prices [Pt , Qt , Qt , Qt , Pt ] as jointly sliding
up/down (relative to OB mid price St ) as a function of the inventory It ’s magnitude and
(b) ∗ (a) ∗ (m)
sign, and perceive Pt , Pt in terms of their spreads to the pseudo mid price Qt :
(b) (a) (b) ∗
(b) (m) ∗ Q + Qt 1 f (b) (δ )
Qt − Pt = t + · log (1 − γ · (b) t(b) ∗ )
2 γ ∂f
(δ )
∂δt
(b) t
(b) ∗ (b) 1 γ
δt = St − Q t + · log (1 + ) (8.19)
γ k
(a) ∗ (a) 1 γ
δt = Qt − St + · log (1 + ) (8.20)
γ k
(b) ∗ (a) ∗ (m)
which means Pt and Pt are equidistant from Qt . Substituting these simplified
(b) ∗ (a) ∗
δt , δ t in Equation (8.18) reduces the PDE to:
∂θ σ 2 ∂ 2 θ ∂θ 2 c (b) ∗ (a) ∗
+ ·( 2 −γ·( ) )+ · (e−k·δt + e−k·δt ) = 0
∂t 2 ∂St ∂St k+γ (8.21)
with boundary condition θ(T, ST , IT ) = IT · ST
(b) ∗ (a) ∗
Note that this PDE (8.21) involves δt and δt . However, Equations (8.19), (8.20),
(b) ∗ (a) ∗
(8.14) and (8.15) enable expressing δt and δt in terms of θ(t, St , It − 1), θ(t, St , It ) and
θ(t, St , It + 1). This gives us a PDE just in terms of θ. Solving that PDE for θ would give
(b) ∗ (a) ∗
us not only V ∗ (t, St , Wt , It ) but also δt and δt (using Equations (8.19), (8.20), (8.14)
and (8.15)). To solve the PDE, we need to make a couple of approximations.
301
(b) ∗ (a) ∗
First we make a linear approximation for e−k·δt and e−k·δt in PDE (8.21) as follows:
∂θ σ 2 ∂ 2 θ ∂θ 2 c (b) ∗ (a) ∗
+ ·( 2 −γ·( ) )+ · (1 − k · δt + 1 − k · δt ) = 0 (8.22)
∂t 2 ∂St ∂St k+γ
Combining the Equations (8.19), (8.20), (8.14) and (8.15) gives us:
(b) ∗ (a) ∗ 2 γ
δt + δt = · log (1 + ) + 2θ(t, St , It ) − θ(t, St , It + 1) − θ(t, St , It − 1)
γ k
(b) ∗ (a) ∗
With this expression for δt + δt , PDE (8.22) takes the form:
∂θ σ 2 ∂ 2 θ ∂θ 2 c 2k γ
+ ·( 2 − γ · ( ) )+ · (2 − · log (1 + )
∂t 2 ∂St ∂St k+γ γ k (8.23)
− k · (2θ(t, St , It ) − θ(t, St , It + 1) − θ(t, St , It − 1))) = 0
It2 (2)
θ(t, St , It ) ≈ θ(0) (t, St ) + It · θ(1) (t, St ) + · θ (t, St )
2
We note that the Optimal Value Function V ∗ can depend on St only through the current
Value of the Inventory (i.e., through It ·St ), i.e., it cannot depend on St in any other way. This
means V ∗ (t, St , Wt , 0) = −e−γ(Wt +θ (t,St )) is independent of St . This means θ(0) (t, St ) is
(0)
(0) ∂ 2 θ (0)
independent of St . So, we can write it as simply θ(0) (t), meaning ∂θ ∂St and ∂St2 are equal
to 0. Therefore, we can write the approximate expansion for θ(t, St , It ) as:
It2 (2)
θ(t, St , It ) = θ(0) (t) + It · θ(1) (t, St ) +
· θ (t, St ) (8.24)
2
Substituting this approximation Equation (8.24) for θ(t, St , It ) in PDE (8.23), we get:
302
• Terms devoid of It (i.e., It0 )
• Terms involving It (i.e., It1 )
• Terms involving It2
∂θ(1) σ 2 ∂ 2 θ(1)
+ · = 0 with boundary condition θ(1) (T, ST ) = ST
∂t 2 ∂St2
The solution to this PDE is:
∂θ(0) c 2k γ
+ · (2 − · log (1 + ) + k · θ(2) ) = 0 with boundary θ(0) (T ) = 0
∂t k+γ γ k
c 2k γ kγσ 2
θ(0) (t) = · ((2 − · log (1 + )) · (T − t) − · (T − t)2 ) (8.28)
k+γ γ k 2
This completes the PDE solution for θ(t, St , It ) and hence, for V ∗ (t, St , Wt , It ). Lastly, we
(b) (a) (m) (b) ∗ (a) ∗
derive formulas for Qt , Qt , Qt , δt , δt .
Using Equations (8.14) and (8.15), we get:
(b) γ · σ 2 · (T − t)
Qt = θ(1) (t, St ) + (2It + 1) · θ(2) (t, St ) = St − (2It + 1) · (8.29)
2
(a) γ · σ 2 · (T − t)
Qt = θ(1) (t, St ) + (2It − 1) · θ(2) (t, St ) = St − (2It − 1) · (8.30)
2
Using equations (8.19) and (8.20), we get:
(b) ∗ (2It + 1) · γ · σ 2 · (T − t) 1 γ
δt = + · log (1 + ) (8.31)
2 γ k
(a) ∗ (1 − 2It ) · γ · σ 2 · (T − t) 1 γ
δt = + · log (1 + ) (8.32)
2 γ k
303
(b) ∗ (a) ∗ 2 γ
Optimal Bid-Ask Spread δt + δt = γ · σ 2 · (T − t) + · log (1 + ) (8.33)
γ k
(a) (m)γ · σ 2 · (T − t)
(m) (b)
Inner Spreads Qt − Qt = Qt − Qt =
2
This completes the analytical approximation to the solution of the Avellaneda-Stoikov
continuous-time formulation of the Optimal Market-Making problem.
304
• A paper from University of Liverpool (Spooner et al. 2018)
• A paper from J.P.Morgan Research (Ganesh et al. 2019)
This topic of development of models for OB Dynamics and RL algorithms for practical
market-making is an exciting area for future research as well as engineering design. We
hope this section has provided sufficient foundations for you to dig into this topic further.
305
Part III.
307
9. Monte-Carlo and Temporal-Difference for
Prediction
1. The AI Agent interacts with the actual environment and doesn’t bother with either a
model of explicit transition probabilities (probabilities model) or a model of transition
samples (sampling model).
2. We create a sampling model (by learning from interaction with the actual environ-
ment) and treat this sampling model as a simulated environment (meaning, the AI
agent interacts with this simulated environment).
309
From the perspective of the AI agent, either way there is an environment interface that
will serve up (at each time step) a single experience of (next state, reward) pair when
the agent performs a certain action in a given state. So essentially, either way, our access
is simply to a stream of individual experiences of next state and reward rather than their
explicit probabilities. So, then the question is - at a conceptual level, how does RL go
about solving Prediction and Control problems with just this limited access (access to
only experiences and not explicit probabilities)? This will become clearer and clearer as we
make our way through Module III, but it would be a good idea now for us to briefly sketch
an intuitive overview of the RL approach (before we dive into the actual RL algorithms).
To understand the core idea of how RL works, we take you back to the start of the book
where we went over how a baby learns to walk. Specifically, we’d like you to develop in-
tuition for how humans and other animals learn to perform requisite tasks or behave in
appropriate ways, so as to get trained to make suitable decisions. Humans/animals don’t
build a model of explicit probabilities in their minds in a way that a DP/ADP algorithm
would require. Rather, their learning is essentially a sort of “trial and error” method -
they try an action, receive an experience (i.e., next state and reward) from their environ-
ment, then take a new action, receive another experience, and so on … and then over a
period of time, they figure out which actions might be leading to good outcomes (pro-
ducing good rewards) and which actions might be leading to poor outcomes (poor re-
wards). This learning process involves raising the priority of actions perceived as good,
and lowering the priority of actions perceived as bad. Humans/animals don’t quite link
their actions to the immediate reward - they link their actions to the cumulative rewards
(Returns) obtained after performing an action. Linking actions to cumulative rewards is
challenging because multiple actions have significantly overlapping rewards sequences,
and often rewards show up in a delayed manner. Indeed, learning by attributing good
versus bad outcomes to specific past actions is the powerful part of human/animal learn-
ing. Humans/animals are essentially estimating a Q-Value Function and are updating
their Q-Value function each time they receive a new experience (of essentially a pair of
next state and reward). Exactly how humans/animals manage to estimate Q-Value func-
tions efficiently is unclear (a big area of ongoing research), but RL algorithms have specific
techniques to estimate the Q-Value function in an incremental manner by updating the Q-
Value function in subtle ways after each experience of next state and reward received from
either the actual environment or simulated environment.
We should also point out another important feature of human/animal learning - it is the
fact that humans/animals are good at generalizing their inferences from experiences, i.e.,
they can interpolate and extrapolate the linkages between their actions and the outcomes
received from their environment. Technically, this translates to a suitable function approx-
imation of the Q-Value function. So before we embark on studying the details of various
RL algorithms, it’s important to recognize that RL overcomes complexity (specifically, the
Curse of Dimensionality and Curse of Modeling, as we have alluded to in previous chap-
ters) with a combination of:
This idea of solving the MDP Prediction and Control problems in this manner (learning
incrementally from a stream of data with appropriate generalization ability in the Q-Value
310
function approximation) came from the Ph.D. thesis of Chris Watkins (Watkins 1989).
As mentioned before, we consider the RL book by Sutton and Barto (Richard S. Sutton
and Barto 2018) as the best source for a comprehensive study of RL algorithms as well as
the best source for all references associated with RL (hence, we don’t provide too many
references in this book).
As mentioned in previous chapters, most RL algorithms are founded on the Bellman
Equations and all RL Control algorithms are based on the fundamental idea of Generalized
Policy Iteration that we have explained in Chapter 2. But the exact ways in which the Bell-
man Equations and Generalized Policy Iteration idea are utilized in RL algorithms differ
from one algorithm to another, and they differ significantly from how the Bellman Equa-
tions/Generalized Policy Iteration idea is utilized in DP algorithms.
As has been our practice, we start with the Prediction problem (this chapter) and then
cover the Control problem (next chapter).
S0 , R 1 , S 1 , R 2 , S 2 , . . .
∞
X
Gt = γ i−t−1 · Ri = Rt+1 + γ · Rt+2 + γ 2 · Rt+3 + . . . = Rt+1 + γ · Gt+1
i=t+1
We use the above definition of Return even for a terminating trace experience (say ter-
minating at t = T , i.e., ST ∈ T ), by treating Ri = 0 for all i > T .
The RL prediction algorithms we will soon develop consume a stream of atomic experi-
ences or a stream of trace experiences to learn the requisite Value Function. So we want the
input to an RL Prediction algorithm to be either an Iterable of atomic experiences or an
Iterable of trace experiences. Now let’s talk about the representation (in code) of a single
atomic experience and the representation of a single trace experience. We take you back
to the code in Chapter 1 where we had set up a @dataclass TransitionStep that served as
a building block in the method simulate_reward in the abstract class MarkovRewardProcess.
311
@dataclass(frozen=True)
class TransitionStep(Generic[S]):
state: NonTerminal[S]
next_state: State[S]
reward: float
312
import MarkovRewardProcess as mp
def mc_prediction(
traces: Iterable[Iterable[mp.TransitionStep[S]]],
approx_0: ValueFunctionApprox[S],
gamma: float,
episode_length_tolerance: float = 1e-6
) -> Iterator[ValueFunctionApprox[S]]:
episodes: Iterator[Iterator[mp.ReturnStep[S]]] = \
(returns(trace, gamma, episode_length_tolerance) for trace in traces)
f = approx_0
yield f
for episode in episodes:
f = last(f.iterate_updates(
[(step.state, step.return_)] for step in episode
))
yield f
The core of the mc_prediction function above is the call to the returns function (de-
tailed below and available in the file rl/returns.py). returns takes as input: trace repre-
senting a trace experience (Iterable of TransitionStep), the discount factor gamma, and an
episodes_length_tolerance that determines how many time steps to cover in each trace ex-
perience when γ < 1 (as many steps as until γ steps falls below episodes_length_tolerance
or until the trace experience ends in a terminal state, whichever happens first). If γ = 1,
each trace experience needs to end in a terminal state (else the returns function will loop
forever).
The returns function calculates the returns Gt (accumulated discounted rewards) start-
ing from each state St in the trace experience.2 The key is to walk backwards from the end
of the trace experience to the start (so as to reuse the calculated returns while walking
backwards: Gt = Rt+1 + γ · Gt+1 ). Note the use of iterate.accumulate to perform this
backwards-walk calculation, which in turn uses the add_return method in TransitionStep
to create an instance of ReturnStep. The ReturnStep (as seen in the code below) class is de-
rived from the TransitionStep class and includes the additional attribute named return_.
We add a method called add_return in TransitionStep so we can augment the attributes
state, reward, next_state with the additional attribute return_ that is computed as reward
plus gamma times the return_ from the next state.3
@dataclass(frozen=True)
class TransitionStep(Generic[S]):
state: NonTerminal[S]
next_state: State[S]
reward: float
def add_return(self, gamma: float, return_: float) -> ReturnStep[S]:
return ReturnStep(
self.state,
self.next_state,
self.reward,
return_=self.reward + gamma * return_
)
@dataclass(frozen=True)
class ReturnStep(TransitionStep[S]):
return_: float
import itertools
2
returns is defined in the file rl/returns.py.
3
TransitionStep and the add_return method are defined in the file rl/markov_process.py.
313
import rl.iterate as iterate
import rl.markov_process as mp
def returns(
trace: Iterable[mp.TransitionStep[S]],
gamma: float,
tolerance: float
) -> Iterator[mp.ReturnStep[S]]:
trace = iter(trace)
max_steps = round(math.log(tolerance) / math.log(gamma)) if gamma < 1 \
else None
if max_steps is not None:
trace = itertools.islice(trace, max_steps * 2)
*transitions, last_transition = list(trace)
return_steps = iterate.accumulate(
reversed(transitions),
func=lambda next, curr: curr.add_return(gamma, next.return_),
initial=last_transition.add_return(gamma, 0)
)
return_steps = reversed(list(return_steps))
if max_steps is not None:
return_steps = itertools.islice(return_steps, max_steps)
return return_steps
We say that the trace experiences are episodic traces if each trace experience ends in a
terminal state to signify that each trace experience is an episode, after whose termination
we move on to the next episode. Trace experiences that do not terminate are known as
continuing traces. We say that an RL problem is episodic if the input trace experiences are
all episodic (likewise, we say that an RL problem is continuing if some of the input trace
experiences are continuing).
Assume that the probability distribution of returns conditional on a state is modeled by
a function approximation as a (state-conditional) normal distribution, whose mean (Value
Function) we denote as V (s; w) where s denotes a state for which the function approxima-
tion is being evaluated and w denotes the set of parameters in the function approximation
(eg: the weights in a neural network). Then, the loss function for supervised learning of
the Value Function is the sum of squares of differences between observed returns and the
Value Function estimate from the function approximation. For a state St visited at time t
in a trace experience and it’s associated return Gt on the trace experience, the contribution
to the loss function is:
1
L(St ,Gt ) (w) = · (V (St ; w) − Gt )2 (9.1)
2
It’s gradient with respect to w is:
314
of incremental supervised learning to Reinforcement Learning parameter updates. We
should interpret the change in parameters ∆w as the product of three conceptual entities:
• Learning Rate α
• Return Residual of the observed return Gt relative to the estimated conditional ex-
pected return V (St ; w)
• Estimate Gradient of the conditional expected return V (St ; w) with respect to the pa-
rameters w
This interpretation of the change in parameters as the product of these three conceptual
entities: (Learning rate, Return Residual, Estimate Gradient) is important as this will be a
repeated pattern in many of the RL algorithms we will cover.
Now we consider a simple case of Monte-Carlo Prediction where the MRP consists of a
finite state space with the non-terminal states N = {s1 , s2 , . . . , sm }. In this case, we rep-
resent the Value Function of the MRP in a data structure (dictionary) of (state, expected
return) pairs. This is known as “Tabular” Monte-Carlo (more generally as Tabular RL to
reflect the fact that we represent the calculated Value Function in a “table” , i.e., dictio-
nary). Note that in this case, Monte-Carlo Prediction reduces to a very simple calculation
wherein for each state, we simply maintain the average of the trace experience returns
from that state onwards (averaged over state visitations across trace experiences), and
the average is updated in an incremental manner. Recall from Section 4.4 of Chapter 4
that this is exactly what’s done in the Tabular class (in file rl/func_approx.py). We also
recall from Section 4.4 of Chapter 4 that Tabular implements the interface of the abstract
class FunctionApprox and so, we can perform Tabular Monte-Carlo Prediction by passing a
Tabular instance as the approx0: FunctionApprox argument to the mc_prediction function
above. The implementation of the update method in Tabular is exactly as we desire: it per-
forms an incremental averaging of the trace experience returns obtained from each state
onwards (over a stream of trace experiences).
Let us denote Vn (si ) as the estimate of the Value Function for a state si after the n-th oc-
(1) (2) (n)
currence of the state si (when doing Tabular Monte-Carlo Prediction) and let Yi , Yi , . . . , Yi
be the trace experience returns associated with the n occurrences of state si . Let us denote
the count_to_weight_func attribute of Tabular as f . Then, the Tabular update at the n-th
(n)
occurrence of state si (with it’s associated return Yi ) is as follows:
(n) (n)
Vn (si ) = (1 − f (n)) · Vn−1 (si ) + f (n) · Yi = Vn−1 (si ) + f (n) · (Yi − Vn−1 (si )) (9.3)
Thus, we see that the update (change) to the Value Function for a state si is equal to
(n)
f (n) (weight for the latest trace experience return Yi from state si ) times the difference
(n)
between the latest trace experience return Yi and the current Value Function estimate
Vn−1 (si ). This is a good perspective as it tells us how to adjust the Value Function estimate
in an intuitive manner. In the case of the default setting of count_to_weight_func as f (n) =
1
n , we get:
315
50+1 = 51. This illustrates how we move the Value Function estimate in the direction from
the current estimate to the latest trace experience return, by a magnitude of n1 of their gap.
Expanding the incremental updates across values of n in Equation (9.3), we get:
(n) (n−1)
Vn (si ) =f (n) · Yi + (1 − f (n)) · f (n − 1) · Yi + ...
(1)
+ (1 − f (n)) · (1 − f (n − 1)) · · · (1 − f (2)) · f (1) · Yi (9.5)
Pn (k)
1 (n) n − 1 1 (n−1) n−1 n−2 1 1 (1) k=1 Yi
Vn (si ) = ·Yi + · ·Y +. . .+ · · · · · ·Yi = (9.6)
n n n−1 i n n−1 2 1 n
which is an equally-weighted average of the trace experience returns from the state.
From the Law of Large Numbers, we know that the sample average converges to the ex-
pected value, which is the core idea behind the Monte-Carlo method.
Note that the Tabular class as an implementation of the abstract class FunctionApprox
is not just a software design happenstance - there is a formal mathematical specialization
here that is vital to recognize. This tabular representation is actually a special case of linear
function approximation by setting a feature function ϕi (·) for each xi as: ϕi (xi ) = 1 and
ϕ(x) = 0 for each x ̸= xi (i.e., ϕi (·) is the indicator function for xi , and the Φ matrix
of Chapter 4 reduces to the identity matrix). So we can conceptualize Tabular Monte-
Carlo Prediction as a linear function approximation with the feature functions equal to
the indicator functions for each of the non-terminal states and the linear-approximation
parameters wi equal to the Value Function estimates for the corresponding non-terminal
states.
With this perspective, more broadly, we can view Tabular RL as a special case of RL with
Linear Function Approximation of the Value Function. Moreover, the count_to_weight_func
attribute of Tabular plays the role of the learning rate (as a function of the number of it-
erations in stochastic gradient descent). This becomes clear if we write Equation (9.3) in
(n)
terms of parameter updates: write Vn (si ) as parameter value wi to denote the n-th up-
date to parameter wi corresponding to state si , and write f (n) as learning rate αn for the
n-th update to wi .
316
past trace experience returns, we want to point out that real-world situations are not sta-
tionary in the sense that the environment typically evolves over a period of time and so, RL
algorithms have to appropriately adapt to the changing environment. The way to adapt
effectively is to have an element of “forgetfulness” of the past because if one learns about
the distant past far too strongly in a changing environment, our predictions (and eventu-
ally control) would not be effective. So, how does an RL algorithm “forget?” Well, one
can “forget” through an appropriate time-decay of the weights when averaging trace ex-
perience returns. If we set a constant learning rate α (in Tabular, this would correspond to
count_to_weight_func=lambda _: alpha), we’d obtain “forgetfulness” with lower weights
for old data points and higher weights for recent data points. This is because with a con-
stant learning rate α, Equation (9.5) reduces to:
X
n
lim α · (1 − α)n−j = lim 1 − (1 − α)n = 1
n→∞ n→∞
j=1
It’s worthwhile pointing out that the Monte-Carlo algorithm we’ve implemented above
is known as Each-Visit Monte-Carlo to refer to the fact that we include each occurrence
of a state in a trace experience. So if a particular state appears 10 times in a given trace
experience, we have 10 (state, return) pairs that are used to make the update (for just that
state) at the end of that trace experience. This is in contrast to First-Visit Monte-Carlo in
which only the first occurrence of a state in a trace experience is included in the set of
(state, return) pairs used to make an update at the end of the trace experience. So First-
Visit Monte-Carlo needs to keep track of whether a state has already been visited in a
trace experience (repeat occurrences of states in a trace experience are ignored). We won’t
implement First-Visit Monte-Carlo in this book, and leave it to you as an exercise.
Now let’s write some code to test our implementation of Monte-Carlo Prediction. To do
so, we go back to a simple finite MRP example from Chapter 1 - SimpleInventoryMRPFinite.
The following code creates an instance of the MRP and computes it’s exact Value Function
based on Equation (1.2).
from rl.chapter2.simple_inventory_mrp import SimpleInventoryMRPFinite
user_capacity = 2
user_poisson_lambda = 1.0
user_holding_cost = 1.0
user_stockout_cost = 10.0
user_gamma = 0.9
si_mrp = SimpleInventoryMRPFinite(
capacity=user_capacity,
poisson_lambda=user_poisson_lambda,
holding_cost=user_holding_cost,
stockout_cost=user_stockout_cost
)
si_mrp.display_value_function(gamma=user_gamma)
317
This prints the following:
Next, we run Monte-Carlo Prediction by first generating a stream of trace experiences (in
the form of sampling traces) from the MRP, and then calling mc_prediction using Tabular
with equal-weights-learning-rate (i.e., default count_to_weight_func of lambda n: 1.0 /
n).
We see that the Value Function computed by Tabular Monte-Carlo Prediction with 60000
trace experiences is within 0.01 of the exact Value Function, for each of the states.
This completes the coverage of our first RL Prediction algorithm: Monte-Carlo Predic-
tion. This has the advantage of being a very simple, easy-to-understand algorithm with an
unbiased estimate of the Value Function. But Monte-Carlo can be slow to converge to the
correct Value Function and another disadvantage of Monte-Carlo is that it requires entire
trace experiences (or long-enough trace experiences when γ < 1). The next RL Prediction
algorithm we cover (Temporal-Difference) overcomes these weaknesses.
318
9.4. Temporal-Difference (TD) Prediction
To understand Temporal-Difference (TD) Prediction, we start with it’s Tabular version
as it is simple to understand (and then we can generalize to TD Prediction with Function
Approximation). To understand Tabular TD prediction, we begin by taking another look at
the Value Function update in Tabular Monte-Carlo (MC) Prediction with constant learning
rate.
319
approximation. To understand how the parameters of the function approximation update,
let’s consider the loss function for TD. We start with the single-state loss function for MC
(Equation (9.1)) and simply replace Gt with Rt+1 + γ · V (St+1 , w) as follows:
1
L(St ,St+1 ,Rt+1 ) (w) = · (V (St ; w) − (Rt+1 + γ · V (St+1 ; w)))2 (9.9)
2
Unlike MC, in the case of TD, we don’t take the gradient of this loss function. Instead we
“cheat” in the gradient calculation by ignoring the dependency of V (St+1 ; w) on w. This
“gradient with cheating” calculation is known as semi-gradient. Specifically, we pretend
that the only dependency of the loss function on w is through V (St ; w). Hence, the semi-
gradient calculation results in the following formula for change in parameters w:
This looks similar to the formula for parameters update in the case of MC (with Gt
replaced by Rt+1 + γ · V (St+1 ; w)). Hence, this has the same structure as MC in terms of
conceptualizing the change in parameters as the product of the following 3 entities:
• Learning Rate α
• TD Error δt = Rt+1 + γ · V (St+1 ; w) − V (St ; w)
• Estimate Gradient of the conditional expected return V (St ; w) with respect to the pa-
rameters w
Now let’s write some code to implement TD Prediction (with Function Approximation).
Unlike MC which takes as input a stream of trace experiences, TD works with a more
granular stream: a stream of atomic experiences. Note that a stream of trace experiences can
be broken up into a stream of atomic experiences, but we could also obtain a stream of
atomic experiences in other ways (not necessarily from a stream of trace experiences).
Thus, the TD prediction algorithm we write below (td_prediction) takes as input an
Iterable[TransitionStep[S]]. td_prediction produces an Iterator of ValueFunctionApprox[S],
i.e., an updated function approximation of the Value Function after each atomic experience
in the input atomic experiences stream. Similar to our implementation of MC, our imple-
mentation of TD is based on supervised learning on a stream of (x, y) pairs, but there are
two key differences:
1. The update of the ValueFunctionApprox is done after each atomic experience, versus
MC where the updates are done at the end of each trace experience.
2. The y-value depends on the Value Function estimate, as seen from the update Equa-
tion (9.10) above. This means we cannot use the iterate_updates method of FunctionApprox
that MC Prediction uses. Rather, we need to directly use the rl.iterate.accumulate
function (a wrapped version of itertools.accumulate). As seen in the code below,
the accumulation is performed on the input transitions: Iterable[TransitionStep[S]]
and the function governing the accumulation is the step function in the code below
that calls the update method of ValueFunctionApprox. Note that the y-values passed
to update involve a call to the estimated Value Function v for the next_state of each
transition. However, since the next_state could be Terminal or NonTerminal, and
since ValueFunctionApprox is valid only for non-terminal states, we use the extended_vf
function we had implemented in Chapter 4 to handle the cases of the next state being
Terminal or NonTerminal (with terminal states evaluating to the default value of 0).
320
import rl.iterate as iterate
import rl.markov_process as mp
from rl.approximate_dynamic_programming import ValueFunctionApprox
from rl.approximate_dynamic_programming import extended_vf
def td_prediction(
transitions: Iterable[mp.TransitionStep[S]],
approx_0: ValueFunctionApprox[S],
gamma: float
) -> Iterator[ValueFunctionApprox[S]]:
def step(
v: ValueFunctionApprox[S],
transition: mp.TransitionStep[S]
) -> ValueFunctionApprox[S]:
return v.update([(
transition.state,
transition.reward + gamma * extended_vf(v, transition.next_state)
)])
return iterate.accumulate(transitions, step, initial=approx_0)
import itertools
from rl.distribution import Distribution, Choose
from rl.approximate_dynamic_programming import NTStateDistribution
def mrp_episodes_stream(
mrp: MarkovRewardProcess[S],
start_state_distribution: NTStateDistribution[S]
) -> Iterable[Iterable[TransitionStep[S]]]:
return mrp.reward_traces(start_state_distribution)
def fmrp_episodes_stream(
fmrp: FiniteMarkovRewardProcess[S]
) -> Iterable[Iterable[TransitionStep[S]]]:
return mrp_episodes_stream(fmrp, Choose(fmrp.non_terminal_states))
def unit_experiences_from_episodes(
episodes: Iterable[Iterable[TransitionStep[S]]],
episode_length: int
) -> Iterable[TransitionStep[S]]:
return itertools.chain.from_iterable(
itertools.islice(episode, episode_length) for episode in episodes
)
321
where αn is the learning rate to be used at the n-th Value Function update for a given
state, α is the initial learning rate (i.e. α = α1 ), H (we call it “half life”) is the number of
updates for the learning rate to decrease to half the initial learning rate (if β is 1), and β is
the exponent controlling the curvature of the decrease in the learning rate. We shall often
set β = 0.5.
def learning_rate_schedule(
initial_learning_rate: float,
half_life: float,
exponent: float
) -> Callable[[int], float]:
def lr_func(n: int) -> float:
return initial_learning_rate * (1 + (n - 1) / half_life) ** -exponent
return lr_func
With these functions available, we can now write code to test our implementation of
TD Prediction. We use the same instance si_mrp: SimpleInventoryMRPFinite that we had
created above when testing MC Prediction. We use the same number of episodes (60000)
we had used when testing MC Prediction. We set initial learning rate α = 0.03, half life
H = 1000 and exponent β = 0.5. We set the episode length (number of atomic experiences
in a single trace experience) to be 100 (about the same as with the settings we had for
testing MC Prediction). We use the same discount factor γ = 0.9.
322
{NonTerminal(state=InventoryState(on_hand=0, on_order=0)): -35.529,
NonTerminal(state=InventoryState(on_hand=0, on_order=1)): -27.868,
NonTerminal(state=InventoryState(on_hand=0, on_order=2)): -28.344,
NonTerminal(state=InventoryState(on_hand=1, on_order=0)): -28.935,
NonTerminal(state=InventoryState(on_hand=1, on_order=1)): -29.386,
NonTerminal(state=InventoryState(on_hand=2, on_order=0)): -30.305}
Thus, we see that our implementation of TD prediction with the above settings fetches us
an estimated Value Function within 0.065 of the true Value Function after 60,000 episodes.
As ever, we encourage you to play with various settings for MC Prediction and TD pre-
diction to develop some intuition for how the results change as you change the settings.
You can play with the code in the file rl/chapter10/simple_inventory_mrp.py.
9.5. TD versus MC
It is often claimed that TD is the most significant and innovative idea in the development
of the field of Reinforcement Learning. The key to TD is that it blends the advantages
of Dynamic Programming (DP) and Monte-Carlo (MC). Like DP, TD updates the Value
Function estimate by bootstrapping from the Value Function estimate of the next state
experienced (essentially, drawing from Bellman Equation). Like MC, TD learns from ex-
periences without requiring access to transition probabilities (MC and TD updates are ex-
perience updates while DP updates are transition-probabilities-averaged-updates). So TD over-
comes curse of dimensionality and curse of modeling (computational limitation of DP),
and also has the advantage of not requiring entire trace experiences (practical limitation
of MC).
The TD idea has it’s origins in a seminal book by Harry Klopf (Klopf and Data Sciences
Laboratory 1972) that greatly influenced Richard Sutton and Andrew Barto to pursue the
TD idea further, after which they published several papers on TD, much of whose content
is covered in their RL book (Richard S. Sutton and Barto 2018).
323
scoring the goal (which is essentially the Value Function). If a pass to her teammate did
not result in a goal but greatly increased the chances of scoring a goal, then the action of
passing the ball to one’s teammate in that state is a good action, boosting the action’s Q-
value immediately, and she will likely try that action (or a similar action) again, meaning
actions with better Q-values are prioritized, which drives towards better and quicker goal-
scoring opportunities, and likely eventually results in a goal. Such goal-scoring (based on
active learning during the game, cutting out poor actions and promoting good actions)
would be hailed by commentators as “success from continuous and eager learning” on
the part of the soccer player. This is essentially TD learning.
If you think about career decisions and relationship decisions in our lives, MC-style
learning is quite infeasible because we simply don’t have sufficient “episodes” (for certain
decisions, our entire life might be a single episode), and waiting to analyze and adjust until
the end of an episode might be far too late in our lives. Rather, we learn and adjust our
evaluations of situations constantly in a TD-like manner. Think about various important
decisions we make in our lives and you will see that we learn by perpetual adjustment of
estimates and we are efficient in the use of limited experiences we obtain in our lives.
The stochastic approximation conditions above are known as the Robbins-Monro sched-
ule and apply to a general class of iterative methods used for root-finding or optimization
when data is noisy. The intuition here is that the steps should be large enough (first con-
dition) to eventually overcome any unfavorable initial values or noisy data and yet the
steps should eventually become small enough (second condition) to ensure convergence.
Note that in Equation (9.11), exponent β = 1 satisfies the Robbins-Monro conditions. In
particular, our default choice of count_to_weight_func=lambda n: 1.0 / n in Tabular sat-
isfies the Robbins-Monro conditions, but our other common choice of constant learning
rate does not satisfy the Robbins-Monro conditions. However, we want to emphasize that
the Robbins-Monro conditions are typically not that useful in practice because it is not a
statement of speed of convergence and it is not a statement on closeness to the true optima
(in practice, the goal is typically simply to get fairly close to the true answer reasonably
quickly).
324
The bad news with TD (due to the bias in it’s update) is that TD Prediction with function
approximation does not always converge to the true value function. Most TD Prediction
convergence proofs are for the Tabular case, however some proofs are for the case of linear
function approximation of the Value Function.
The flip side of MC’s bias advantage over TD is that the TD Target Rt+1 + γ · V (St+1 ; w)
has much lower variance than Gt because Gt depends on many random state transitions
and random rewards (on the remainder of the trace experience) whose variances accu-
mulate, whereas the TD Target depends on only the next random state transition St+1 and
the next random reward Rt+1 .
As for speed of convergence and efficiency in use of limited set of experiences data,
we still don’t have formal proofs on whether MC is better or TD is better. More impor-
tantly, because MC and TD have significant differences in their usage of data, nature of
updates, and frequency of updates, it is not even clear how to create a level-playing field
when comparing MC and TD for speed of convergence or for efficiency in usage of limited
experiences data. The typical comparisons between MC and TD are done with constant
learning rates, and it’s been determined that practically TD learns faster than MC with
constant learning rates.
A popular simple problem in the literature (when comparing RL prediction algorithms)
is a random walk MRP with states {0, 1, 2, . . . , B} with 0 and B as the terminal states (think
of these as terminating barriers of a random walk) and the remaining states as the non-
terminal states. From any non-terminal state i, we transition to state i + 1 with probability
p and to state i − 1 with probability 1 − p. The reward is 0 upon each transition, except
if we transition from state B − 1 to terminal state B which results in a reward of 1. It’s
quite obvious that for p = 0.5 (symmetric random walk), the Value Function is given by:
V (i) = Bi for all 0 < i < B. We’d like to analyze how MC and TD converge, if at all,
to this Value Function, starting from a neutral initial Value Function of V (i) = 0.5 for all
0 < i < B. The following code sets up this random walk MRP.
from rl.distribution import Categorical
class RandomWalkMRP(FiniteMarkovRewardProcess[int]):
barrier: int
p: float
def __init__(
self,
barrier: int,
p: float
):
self.barrier = barrier
self.p = p
super().__init__(self.get_transition_map())
def get_transition_map(self) -> \
Mapping[int, Categorical[Tuple[int, float]]]:
d: Dict[int, Categorical[Tuple[int, float]]] = {
i: Categorical({
(i + 1, 0. if i < self.barrier - 1 else 1.): self.p,
(i - 1, 0.): 1 - self.p
}) for i in range(1, self.barrier)
}
return d
325
Figure 9.1.: MC and TD Convergence for Random Walk MRP
and plot the root-mean-squared-errors (RMSE) of the Value Function averaged across the
non-terminal states as a function of episode batches (i.e., visualize how the RMSE of the
Value Function evolves as the MC/TD algorithm progresses). This is done by calling the
function compare_mc_and_td which is in the file rl/chapter10/prediction_utils.py.
Figure 9.1 depicts the convergence for our implementations of MC and TD Prediction
for constant learning rates of α = 0.01 (darker curves) and α = 0.05 (lighter curves). We
produced this Figure by using data from 700 episodes generated from the random walk
MRP with barrier B = 10, p = 0.5 and discount factor γ = 1 (a single episode refers
to a single trace experience that terminates either at state 0 or at state B). We plotted
the RMSE after each batch of 7 episodes, hence each of the 4 curves shown in the Figure
have 100 RMSE data points plotted. Firstly, we clearly see that MC has significantly more
variance as evidenced by the choppy MC RMSE progression curves. Secondly, we note
that α = 0.01 is a fairly small learning rate and so, the progression of RMSE is quite slow
on the darker curves. On the other hand, notice the quick learning for α = 0.05 (lighter
curves). MC RMSE curve is not just choppy, it’s evident that it progresses quite quickly
in the first few episode batches (relative to the corresponding TD) but is slow after the
first few episode batches (relative to the corresponding TD). This results in TD reaching
fairly small RMSE quicker than the corresponding MC (this is especially stark for TD with
α = 0.005, i.e. the dashed lighter curve in the Figure). This behavior of TD outperforming
the comparable MC (with constant learning rate) is typical for MRP problems.
Lastly, it’s important to recognize that MC is not very sensitive to the initial Value Func-
tion while TD is more sensitive to the initial Value Function. We encourage you to play
with the initial Value Function for this random walk example and evaluate how it affects
MC and TD convergence speed.
More generally, we encourage you to play with the compare_mc_and_td function on other
choices of MRP (ones we have created earlier in this book such as the inventory examples,
or make up your own MRPs) so you can develop good intuition for how MC and TD
326
Prediction algorithms converge for a variety of choices of learning rate schedules, initial
Value Function choices, choices of discount factor etc.
def get_fixed_episodes_from_sr_pairs_seq(
sr_pairs_seq: Sequence[Sequence[Tuple[S, float]]],
terminal_state: S
) -> Sequence[Sequence[TransitionStep[S]]]:
return [[TransitionStep(
state=NonTerminal(s),
reward=r,
next_state=NonTerminal(trace[i+1][0])
if i < len(trace) - 1 else Terminal(terminal_state)
) for i, (s, r) in enumerate(trace)] for trace in sr_pairs_seq]
import numpy as np
def get_episodes_stream(
fixed_episodes: Sequence[Sequence[TransitionStep[S]]]
) -> Iterator[Sequence[TransitionStep[S]]]:
num_episodes: int = len(fixed_episodes)
while True:
yield fixed_episodes[np.random.randint(num_episodes)]
327
As we know, TD works with atomic experiences rather than trace experiences. So we
need the following function to split the fixed finite set of trace experiences into a fixed finite
set of atomic experiences:
import itertools
def fixed_experiences_from_fixed_episodes(
fixed_episodes: Sequence[Sequence[TransitionStep[S]]]
) -> Sequence[TransitionStep[S]]:
return list(itertools.chain.from_iterable(fixed_episodes))
We’d like TD Prediction to run on an endless stream of TransitionStep[S] from the fixed
finite set of atomic experiences produced by fixed_experiences_from_fixed_episodes. So
we write the following function to generate an endless stream by repeatedly randomly
(uniformly) sampling from the fixed finite set of atomic experiences:
def get_experiences_stream(
fixed_experiences: Sequence[TransitionStep[S]]
) -> Iterator[TransitionStep[S]]:
num_experiences: int = len(fixed_experiences)
while True:
yield fixed_experiences[np.random.randint(num_experiences)]
328
given_data: Sequence[Sequence[Tuple[str, float]]] = [
[(’A’, 2.), (’A’, 6.), (’B’, 1.), (’B’, 2.)],
[(’A’, 3.), (’B’, 2.), (’A’, 4.), (’B’, 2.), (’B’, 0.)],
[(’B’, 3.), (’B’, 6.), (’A’, 1.), (’B’, 1.)],
[(’A’, 0.), (’B’, 2.), (’A’, 4.), (’B’, 4.), (’B’, 2.), (’B’, 3.)],
[(’B’, 8.), (’B’, 2.)]
]
This prints:
{NonTerminal(state=’B’): 5.190378571428572,
NonTerminal(state=’A’): 8.261809999999999}
Now let’s run MC Prediction with experience-replayed 100,000 trace experiences with
equal weighting for each of the (state, return) pairs, i.e., with count_to_weights_func at-
tribute of Tabular set to the function lambda n: 1.0 / n:
import rl.monte_carlo as mc
import rl.iterate as iterate
def mc_prediction(
episodes_stream: Iterator[Sequence[TransitionStep[S]]],
gamma: float,
num_episodes: int
) -> Mapping[NonTerminal[S], float]:
return iterate.last(itertools.islice(
mc.mc_prediction(
traces=episodes_stream,
approx_0=Tabular(),
gamma=gamma,
episode_length_tolerance=1e-10
),
num_episodes
)).values_map
num_mc_episodes: int = 100000
episodes: Iterator[Sequence[TransitionStep[str]]] = \
get_episodes_stream(fixed_episodes)
mc_pred: Mapping[NonTerminal[str], float] = mc_prediction(
episodes_stream=episodes,
gamma=gamma,
329
num_episodes=num_mc_episodes
)
pprint(mc_pred)
This prints:
{NonTerminal(state=’A’): 8.262643843836214,
NonTerminal(state=’B’): 5.191276907315868}
So, as expected, it ties out within the standard error for 100,000 trace experiences. Now
let’s move on to TD Prediction. Let’s run TD Prediction on experience-replayed 1,000,000
atomic experiences with a learning rate schedule having an initial learning rate of 0.01,
decaying with a half life of 10000, and with an exponent of 0.5.
import rl.td as td
from rl.function_approx import learning_rate_schedule, Tabular
def td_prediction(
experiences_stream: Iterator[TransitionStep[S]],
gamma: float,
num_experiences: int
) -> Mapping[NonTerminal[S], float]:
return iterate.last(itertools.islice(
td.td_prediction(
transitions=experiences_stream,
approx_0=Tabular(count_to_weight_func=learning_rate_schedule(
initial_learning_rate=0.01,
half_life=10000,
exponent=0.5
)),
gamma=gamma
),
num_experiences
)).values_map
fixed_experiences: Sequence[TransitionStep[str]] = \
fixed_experiences_from_fixed_episodes(fixed_episodes)
experiences: Iterator[TransitionStep[str]] = \
get_experiences_stream(fixed_experiences)
pprint(td_pred)
330
This prints:
{NonTerminal(state=’A’): 9.899838136517303,
NonTerminal(state=’B’): 7.444114569419306}
We note that this Value Function is vastly different from the Value Function produced by
MC Prediction. Is there a bug in our code, or perhaps a more serious conceptual problem?
Nope - there is neither a bug here nor a more serious problem. This is exactly what TD
Prediction on Experience Replay on a fixed finite data set is meant to produce. So, what
Value Function does this correspond to? It turns out that TD Prediction drives towards a
Value Function of an MRP that is implied by the fixed finite set of given experiences. By the
term implied, we mean the maximum likelihood estimate for the transition probabilities
PR , estimated from the given fixed finite data, i.e.,
PN
′ i=1 ISi =s,Ri+1 =r,Si+1 =s′
PR (s, r, s ) = PN (9.12)
i=1 ISi =s
where the fixed finite set of atomic experiences are [(Si , Ri+1 , Si+1 )|1 ≤ i ≤ N ], and I
denotes the indicator function.
So let’s write some code to construct this MRP based on the above formula.
from rl.distribution import Categorical
from rl.markov_process import FiniteMarkovRewardProcess
def finite_mrp(
fixed_experiences: Sequence[TransitionStep[S]]
) -> FiniteMarkovRewardProcess[S]:
def by_state(tr: TransitionStep[S]) -> S:
return tr.state.state
d: Mapping[S, Sequence[Tuple[S, float]]] = \
{s: [(t.next_state.state, t.reward) for t in l] for s, l in
itertools.groupby(
sorted(fixed_experiences, key=by_state),
key=by_state
)}
mrp: Dict[S, Categorical[Tuple[S, float]]] = \
{s: Categorical({x: y / len(l) for x, y in
collections.Counter(l).items()})
for s, l in d.items()}
return FiniteMarkovRewardProcess(mrp)
This prints:
So our TD Prediction algorithm doesn’t exactly match the Value Function of the data-
implied MRP, but it gets close. It turns out that a variation of our TD Prediction algorithm
exactly matches the Value Function of the data-implied MRP. We won’t implement this
variation in this chapter, but will describe it briefly here. The variation is as follows:
• The Value Function is not updated after each atomic experience, rather the Value
Function is updated at the end of each batch of atomic experiences.
331
• Each batch of atomic experiences consists of a single occurrence of each atomic ex-
perience in the given fixed finite data set.
• The updates to the Value Function to be performed at the end of each batch are accu-
mulated in a buffer after each atomic experience and the buffer’s contents are used to
update the Value Function only at the end of the batch. Specifically, this means that
the right-hand-side of Equation (9.10) is calculated at the end of each atomic expe-
rience and these calculated values are accumulated in the buffer until the end of the
batch, at which point the buffer’s contents are used to update the Value Function.
This variant of the TD Prediction algorithm is known as Batch Updating and more broadly,
RL algorithms that update the Value Function at the end of a batch of experiences are ref-
ered to as Batch Methods. This contrasts with Incremental Methods, which are RL algorithms
that update the Value Function after each atomic experience (in the case of TD) or at the
end of each trace experience (in the case of MC). The MC and TD Prediction algorithms
we implemented earlier in this chapter are Incremental Methods. We will cover Batch
Methods in detail in Chapter 11.
Although our TD Prediction algorithm is an Incremental Method, it did get fairly close
to the Value Function of the data-implied MRP. So let us ignore the nuance that our TD
Prediction algorithm didn’t exactly match the Value Function of the data-implied MRP
and instead focus on the fact that our MC Prediction algorithm and our TD Prediction al-
gorithm drove towards two very different Value Functions. The MC Prediction algorithm
learns a “fairly naive” Value Function - one that is based on the mean of the observed re-
turns (for each state) in the given fixed finite data. The TD Prediction algorithm is learning
something “deeper” - it is (implicitly) constructing an MRP based on the given fixed fi-
nite data (Equation (9.12)), and then (implicitly) calculating the Value Function of the
constructed MRP. The mechanics of the TD Prediction algorithms don’t actually construct
the MRP and calculate the Value Function of the MRP - rather, the TD Prediction algo-
rithm directly drives towards the Value Function of the data-implied MRP. However, the
fact that it gets to this “more nuanced” Value Function means that it is (implictly) trying
to infer a transitions structure from the given data, and hence, we say that it is learning
something “deeper” than what MC is learning. This has practical implications. Firstly,
this learning facet of TD means that it exploits any Markov property in the environment
and so, TD algorithms are more efficient (learn faster than MC) in Markov environments.
On the other hand, the naive nature of MC (not exploiting any Markov property in the
environment) is advantageous (more effective than TD) in non-Markov environments.
We encourage you to try Experience Replay on larger input data sets, and to code up
Batch Method variants of MC and TD prediction algorithms. As a starting point, the expe-
rience replay code for this chapter is in the file rl/chapter10/mc_td_experience_replay.py.
332
• Experiencing: By “experiencing,” we mean that the algorithm uses experiences ob-
tained by interacting with an actual or simulated environment, rather than per-
forming expectation calculations with a model of transition probabilities (the latter
doesn’t require interactions with an environment and hence, doesn’t “experience”).
MC and TD do experience, while DP does not experience.
We illustrate this perspective of bootstrapping (or not) and experiencing (or not) with
some very popular diagrams that we are borrowing from lecture slides from David Silver’s
RL course and from teaching content prepared by Richard Sutton.
The first diagram is Figure 9.2, known as the MC backup diagram for an MDP (although
we are covering Prediction in this chapter, these concepts also apply to MDP Control).
The root of the tree is the state whose Value Function we want to update. The remaining
nodes of the tree are the future states that might be visited and future actions that might be
taken. The branching on the tree is due to the probabilistic transitions of the MDP and the
multiple choices of actions that might be taken at each time step. The nodes marked as “T”
are the terminal states. The highlighted path on the tree from the root node (current state)
to a terminal state indicates a particular trace experience used by the MC algorithm. The
highlighted path is the set of future states/actions used in updating the Value Function of
the current state (root node). We say that the Value Function is “backed up” along this
highlighted path (to mean that the Value Function update calculation propagates from the
bottom of the highlighted path to the top, since the trace experience return is calculated as
accumulated rewards from the bottom to the top, i.e., from the end of the trace experience
to the beginning of the trace experience). This is why we refer to such diagrams as backup
diagrams. Since MC “experiences” , it only considers a single child node from any node
(rather than all the child nodes, which would be the case if we considered all probabilistic
transitions or considered all action choices). So the backup is narrow (doesn’t go wide
across the tree). Since MC does not “bootstrap,” it doesn’t use the Value Function estimate
from it’s child/grandchild node (next time step’s state/action) - instead, it utilizes the
rewards at all future states/actions along the entire trace experience. So the backup works
deep into the tree (is not shallow as would be the case in “bootstrapping”). In summary,
the MC backup is narrow and deep.
The next diagram is Figure 9.3, known as the TD backup diagram for an MDP. Again,
the highlighting applies to the future states/actions used in updating the Value Function
of the current state (root node). The Value Function is “backed up” along this highlighted
portion of the tree. Since TD “experiences” , it only considers a single child node from any
node (rather than all the child nodes, which would be the case if we considered all prob-
abilistic transitions or considered all actions choices). So the backup is narrow (doesn’t
go wide across the tree). Since TD “bootstraps” , it uses the Value Function estimate from
it’s child/grandchild node (next time step’s state/action) and doesn’t utilize rewards at
states/actions beyond the next time step’s state/action. So the backup is shallow (doesn’t
work deep into the tree). In summary, the TD backup is narrow and shallow.
The next diagram is Figure 9.4, known as the DP backup diagram for an MDP. Again,
the highlighting applies to the future states/actions used in updating the Value Function
of the current state (root node). The Value Function is “backed up” along this highlighted
portion of the tree. Since DP does not “experience” and utilizes the knowledge of prob-
abilities of all next states and considers all choices of actions (in the case of Control), it
considers all child nodes (all choices of actions) and all grandchild nodes (all probabilis-
tic transitions to next states) from the root node (current state). So the backup goes wide
across the tree. Since DP “bootstraps” , it uses the Value Function estimate from it’s chil-
333
Figure 9.2.: MC Backup Diagram (Image Credit: David Silver’s RL Course)
dren/grandchildren nodes (next time step’s states/actions) and doesn’t utilize rewards at
states/actions beyond the next time step’s states/actions. So the backup is shallow (doesn’t
work deep into the tree). In summary, the DP backup is wide and shallow.
This perspective of shallow versus deep (for “bootstrapping” or not) and of narrow ver-
sus wide (for “experiencing” or not) is a great way to visualize and internalize the core
ideas within MC, TD and DP, and it helps us compare and contrast these methods in a
simple and intuitive manner. We must thank Rich Sutton for this excellent pedagogical
contribution. This brings us to the next diagram (Figure 9.5) which provides a unified
view of RL in a single picture. The top of this Figure shows methods that “bootstrap” (in-
cluding TD and DP) and the bottom of this Figure shows methods that do not “bootstrap”
(including MC and methods known as “Exhaustive Search” that go both deep into the
tree and wide across the tree - we shall cover some of these methods in a later chapter).
Therefore the vertical dimension of this Figure refers to the depth of the backup. The left of
this Figure shows methods that “experience” (including TD and MC) and the right of this
Figure shows methods than do not “experience” (including DP and “Exhaustive Search”).
Therefore, the horizontal dimension of this Figure refers to the width of the backup.
334
Figure 9.3.: TD Backup Diagram (Image Credit: David Silver’s RL Course)
335
Figure 9.4.: DP Backup Diagram (Image Credit: David Silver’s RL Course)
336
Figure 9.5.: Unified View of RL (Image Credit: Sutton-Barto’s RL Book)
337
Prediction known as TD(λ). λ is a continuous-valued parameter in the range [0, 1] such
that λ = 0 corresponds to the TD approach and λ = 1 corresponds to the MC approach.
Tuning λ between 0 and 1 allows us to span the spectrum from the TD approach to the
MC approach, essentially a blended approach known as the TD(λ) approach. The TD(λ)
approach for RL Prediction gives us the TD(λ) Prediction algorithm. To get to the TD(λ)
Prediction algorithm (in this section), we start with the TD Prediction algorithm we wrote
earlier, generalize it to a multi-time-step bootstrapping prediction algorithm, extend that
further to an algorithm known as the λ-Return Prediction algorithm, after which we shall
be ready to present the TD(λ) Prediction algorithm.
X
t+n
Gt,n = γ i−t−1 · Ri + γ n · V (St+n )
i=t+1
= Rt+1 + γ · Rt+2 + γ 2 · Rt+3 + . . . + γ n−1 · Rt+n + γ n · V (St+n )
338
X
t+n
Gt,n = γ i−t−1 · Ri + γ n · V (St+n ; w)
i=t+1
= Rt+1 + γ · Rt+2 + γ 2 · Rt+3 + . . . + γ n−1 · Rt+n + γ n · V (St+n ; w)
The nuances we outlined above for when the trace experience terminates naturally apply
here as well.
Equation (9.14) looks similar to the parameters update equations for the MC and TD
Prediction algorithms we covered earlier, in terms of conceptualizing the change in pa-
rameters as the product of the following 3 entities:
• Learning Rate α
• n-step Bootstrapped Error Gt,n − V (St ; w)
• Estimate Gradient of the conditional expected return V (St ; w) with respect to the pa-
rameters w
n serves as a parameter taking us across the spectrum from TD to MC. n = 1 is the case
of TD while sufficiently large n is the case of MC. If a trace experience is of length T (i.e.,
ST ∈ T ), then n ≥ T will not have any bootstrapping (since the bootstrapping target goes
beyond the length of the trace experience) and hence, this makes it identical to MC.
We note that for large n, the update to the Value Function for state St visited at time
t happens in a delayed manner (after n steps, at time t + n), which is unlike the TD al-
gorithm we had developed earlier where the update happens at the very next time step.
We won’t be implementing this n-step bootstrapping Prediction algorithm and leave it as
an exercise for you to implement (re-using some of the functions/classes we have devel-
oped so far in this book). A key point to note for your implementation: The input won’t
be an Iterable of atomic experiences (like in the case of the TD Prediction algorithm we
implemented), rather it will be an Iterable of trace experiences (i.e., the input will be the
same as for our MC Prediction algorithm: Iterable[Iterable[TransitionStep[S]]]) since
we need multiple future rewards in the trace to perform an update to the current state.
X
N X
N
un · Gt,n + u · Gt where u + un = 1
n=1 n=1
Note that any of the un or u can be 0, as long as they all sum up to 1. The λ-Return target
is a special case of the weights un and u, and applies to episodic problems (i.e., where
every trace experience terminates). For a given state St with the episode terminating at
time T (i.e., ST ∈ T ), the weights for the λ-Return target are as follows:
339
−t−1
TX
λn−1 · Gt,n + λT −t−1 · Gt
(λ)
Gt = (1 − λ) · (9.15)
n=1
We note that for λ = 0, the λ-Return target reduces to the TD (1-step bootstrapping)
target and for λ = 1, the λ-Return target reduces to the MC target Gt . The λ parameter
gives us a smooth way of tuning from TD (λ = 0) to MC (λ = 1).
Note that for λ > 0, Equation (9.16) tells us that the parameters w of the function
approximation can be updated only at the end of an episode (the term episode refers to
a terminating trace experience). Updating w according to Equation (9.16) for all states
St , t = 0, . . . , T − 1, at the end of each episode gives us the Offline λ-Return Prediction algo-
rithm. The term Offline refers to the fact that we have to wait till the end of an episode
to make an update to the parameters w of the function approximation (rather than mak-
ing parameter updates after each time step in the episode, which we refer to as an Online
algorithm). Online algorithms are appealing because the Value Function update for an
atomic experience could be utilized immediately by the updates for the next few atomic
experiences, and so it facilitates continuous/fast learning. So the natural question to ask
here is if we can turn the Offline λ-return Prediction algorithm outlined above to an Online
version. An online version is indeed possible (it’s known as the TD(λ) Prediction algo-
rithm) and is the topic of the remaining subsections of this section. But before we begin
the coverage of the (Online) TD(λ) Prediction algorithm, let’s wrap up this subsection
with an implementation of this Offline version (i.e., the λ-Return Prediction algorithm).
import rl.markov_process as mp
import numpy as np
from rl.approximate_dynamic_programming import ValueFunctionApprox
def lambda_return_prediction(
traces: Iterable[Iterable[mp.TransitionStep[S]]],
approx_0: ValueFunctionApprox[S],
gamma: float,
lambd: float
) -> Iterator[ValueFunctionApprox[S]]:
func_approx: ValueFunctionApprox[S] = approx_0
yield func_approx
for trace in traces:
gp: List[float] = [1.]
lp: List[float] = [1.]
predictors: List[NonTerminal[S]] = []
partials: List[List[float]] = []
weights: List[List[float]] = []
trace_seq: Sequence[mp.TransitionStep[S]] = list(trace)
for t, tr in enumerate(trace_seq):
for i, partial in enumerate(partials):
partial.append(
partial[-1] +
gp[t - i] * (tr.reward - func_approx(tr.state)) +
(gp[t - i] * gamma * extended_vf(func_approx, tr.next_state)
if t < len(trace_seq) - 1 else 0.)
)
weights[i].append(
weights[i][-1] * lambd if t < len(trace_seq)
else lp[t - i]
340
)
predictors.append(tr.state)
partials.append([tr.reward +
(gamma * extended_vf(func_approx, tr.next_state)
if t < len(trace_seq) - 1 else 0.)])
weights.append([1. - (lambd if t < len(trace_seq) else 0.)])
gp.append(gp[-1] * gamma)
lp.append(lp[-1] * lambd)
responses: Sequence[float] = [np.dot(p, w) for p, w in
zip(partials, weights)]
for p, r in zip(predictors, responses):
func_approx = func_approx.update([(p, r)])
yield func_approx
It=t1 if t ≤ t1 , else
M (t) = M (ti ) · θt−ti + It=ti+1 if ti < t ≤ ti+1 for any 1 ≤ i < n, else (9.17)
M (tn ) · θt−tn otherwise (i.e., if t > tn )
341
Figure 9.6.: Memory Function (Frequency and Recency)
Let’s run this for θ = 0.8 and an arbitrary sequence of event times:
theta = 0.8
event_times = [2.0, 3.0, 4.0, 7.0, 9.0, 14.0, 15.0, 21.0]
plot_memory_function(theta, event_times)
342
of the discount factor and the TD-λ parameter) and the event timings are the time steps at
which the state s occurs in a trace experience. Thus, we define Eligibility Traces for a given
trace experience at any time step t (of the trace experience) as a function Et : N → R≥0 as
follows:
Note the similarities and differences relative to the TD update we have seen earlier.
Firstly, this is an online algorithm since we make an update at each time step in a trace
experience. Secondly, we update the Value Function for all states at each time step (un-
like TD Prediction which updates the Value Function only for the particular state that is
visited at that time step). Thirdly, the change in the Value Function for each state s ∈ N
is proportional to the TD-Error δt = Rt+1 + γ · V (St+1 ) − V (St ), much like in the case of
the TD update. However, here the TD-Error is multiplied by the eligibility trace Et (s) for
each state s at each time step t. So, we can compactly write the update as:
Theorem 9.6.1.
X
T −1 X
T −1
(λ)
α · δt · Et (s) = α · (Gt − V (St )) · ISt =s , for all s ∈ N
t=0 t=0
343
Proof. We begin the proof with the following important identity:
(λ)
Gt − V (St ) = −V (St ) +(1 − λ) · λ0 · (Rt+1 + γ · V (St+1 ))
+(1 − λ) · λ1 · (Rt+1 + γ · Rt+2 + γ 2 · V (St+2 ))
+(1 − λ) · λ2 · (Rt+1 + γ · Rt+2 + γ 2 · Rt+3 + γ 3 · V (St+3 ))
+...
= −V (St ) +(γλ)0 · (Rt+1 + γ · V (St+1 ) − γλ · V (St+1 ))
+(γλ)1 · (Rt+2 + γ · V (St+2 ) − γλ · V (St+2 ))
+(γλ)2 · (Rt+3 + γ · V (St+3 ) − γλ · V (St+3 ))
+...
= (γλ)0 · (Rt+1 + γ · V (St+1 ) − V (St ))
+(γλ)1 · (Rt+2 + γ · V (St+2 ) − V (St+1 ))
+(γλ)2 · (Rt+3 + γ · V (St+3 ) − V (St+2 ))
+...
= δt + γλ · δt+1 + (γλ)2 · δt+2 + . . .
(9.19)
Now assume that a specific non-terminal state s appears at time steps t1 , t2 , . . . , tn . Then,
X
T −1 X
n
(λ) (λ)
α· (Gt − V (St )) · ISt =s = α · (Gti − V (Sti ))
t=0 i=1
X
n
= α · (δti + γλ · δti +1 + (γλ)2 · δti +2 + . . .)
i=1
X
T −1
= α · δt · Et (s)
t=0
If we set λ = 0 in this Tabular TD(λ) Prediction algorithm, we note that Et (s) reduces
to ISt =s and so, the Tabular TD(λ) prediction algorithm’s update for λ = 0 at each time
step t reduces to:
V (St ) ← V (St ) + α · δt
which is exactly the update of the Tabular TD Prediction algorithm. Therefore, TD al-
gorithms are often refered to as TD(0).
If we set λ = 1 in this Tabular TD(λ) Prediction algorithm with episodic traces (i.e., all
trace experiences terminating), Theorem 9.6.1 tells us that the sum of all changes in the
Value Function for any specific state s ∈ N over the course of the entire trace experience
P −1
(= Tt=0 α · δt · Et (s)) is equal to the change in the Value Function for s in the Every-Visit
PT −1
MC Prediction algorithm as a result of it’s offline update for state s (= t=0 α · (Gt −
V (St ) · ISt =s )). Hence, TD(1) is considered to be “equivalent” to Every-Visit MC.
To clarify, TD(λ) Prediction is an online algorithm and hence, not exactly equivalent to
the offline λ-Return Prediction algorithm. However, if we modified the TD(λ) Prediction
algorithm to be offline, then they are equivalent. The offline version of TD(λ) Prediction
344
would not make the updates to the Value Function at each time step - rather, it would ac-
cumulate the changes to the Value Function (as prescribed by the TD(λ) update formula)
in a buffer, and then at the end of the trace experience, it would update the Value Function
with the contents of the buffer.
However, as explained earlier, online update are desirable because the changes to the
Value Function at each time step can be immediately usable for the next time steps’ updates
and so, it promotes rapid learning without having to wait for a trace experience to end.
Moreover, online algorithms can be used in situations where we don’t have a complete
episode.
With an understanding of Tabular TD(λ) Prediction in place, we can generalize TD(λ)
Prediction to the case of function approximation in a straightforward manner. In the case
of function approximation, the data type of eligibility traces will be the same data type
as that of the parameters w in the function approximation (so here we denote eligibility
traces at time t of a trace experience as simply Et rather than as a function of states as we
had done for the Tabular case above). We initialize E0 at the start of each trace experience
to ∇w V (S0 ; w). Then, for each time step t > 0, Et is calculated recursively in terms of the
previous time step’s value Et−1 , which is then used to update the parameters of the Value
Function approximation, as follows:
Et = γλ · Et−1 + ∇w V (St ; w)
∆w = α · δt · Et
where δt now denotes the TD Error based on the function approximation for the Value
Function.
The idea of Eligibility Traces has it’s origins in a seminal book by Harry Klopf (Klopf
and Data Sciences Laboratory 1972) that greatly influenced Richard Sutton and Andrew
Barto to pursue the idea of Eligibility Traces further, after which they published several
papers on Eligibility Traces, much of whose content is covered in their RL book (Richard
S. Sutton and Barto 2018).
345
response variable yt to be the TD target. Then we need to update the eligibility traces el_tr
and update the function approximation func_approx using the updated el_tr.
Thankfully, the __mul__ method of Gradient class enables us to conveniently multiply
el_tr with γ · λ and then, it also enables us to multiply the updated el_tr with the predic-
tion error EM [y|xt ]−yt = V (St ; w)−(Rt+1 +γ ·V (St+1 ; w)) (in the code as func_approx(x)
- y), which is then used (as a Gradient type) to update the internal parameters of the
func_approx. The __add__ method of Gradient enables us to add ∇w V (St ; w) (as a Gradient
type) to el_tr * gamma * lambd. The only seemingly difficult part is calculating ∇w V (St ; w).
The FunctionApprox interface provides us with a method objective_gradient to calculate
the gradient of any specified objective (call it Obj(x, y)). But here we have to calculate
the gradient of the prediction of the function approximation. Thankfully, the interface of
objective_gradient is fairly generic and we actually have a choice of constructing Obj(x, y)
to be whatever function we want (not necessarily a minimizing Objective Function). We
specify Obj(x, y) in terms of the obj_deriv_out_func argument, which as a reminder, rep-
resents ∂Obj(x,y)
∂Out(x) . Note that we have assumed a gaussian distribution for the returns con-
ditioned on the state. So we can set Out(x) to be the function approximation’s prediction
V (St ; w) and we can set Obj(x, y) = Out(x), meaning obj_deriv_out_func ( ∂Obj(x,y)
∂Out(x) ) is a
function returning the constant value of 1 (as seen in the code below).
import rl.markov_process as mp
import numpy as np
from rl.function_approx import Gradient
from rl.approximate_dynamic_programming import ValueFunctionApprox
def td_lambda_prediction(
traces: Iterable[Iterable[mp.TransitionStep[S]]],
approx_0: ValueFunctionApprox[S],
gamma: float,
lambd: float
) -> Iterator[ValueFunctionApprox[S]]:
func_approx: ValueFunctionApprox[S] = approx_0
yield func_approx
for trace in traces:
el_tr: Gradient[ValueFunctionApprox[S]] = Gradient(func_approx).zero()
for step in trace:
x: NonTerminal[S] = step.state
y: float = step.reward + gamma * \
extended_vf(func_approx, step.next_state)
el_tr = el_tr * (gamma * lambd) + func_approx.objective_gradient(
xy_vals_seq=[(x, y)],
obj_deriv_out_fun=lambda x1, y1: np.ones(len(x1))
)
func_approx = func_approx.update_with_gradient(
el_tr * (func_approx(x) - y)
)
yield func_approx
346
import rl.iterate as iterate
import rl.td_lambda as td_lambda
import itertools
from pprint import pprint
from rl.chapter10.prediction_utils import fmrp_episodes_stream
from rl.function_approx import learning_rate_schedule
gamma: float = 0.9
episode_length: int = 100
initial_learning_rate: float = 0.03
half_life: float = 1000.0
exponent: float = 0.5
lambda_param = 0.3
episodes: Iterable[Iterable[TransitionStep[S]]] = \
fmrp_episodes_stream(si_mrp)
curtailed_episodes: Iterable[Iterable[TransitionStep[S]]] = \
(itertools.islice(episode, episode_length) for episode in episodes)
learning_rate_func: Callable[[int], float] = learning_rate_schedule(
initial_learning_rate=initial_learning_rate,
half_life=half_life,
exponent=exponent
)
td_lambda_vfs: Iterator[ValueFunctionApprox[S]] = td_lambda.td_lambda_prediction(
traces=curtailed_episodes,
approx_0=Tabular(count_to_weight_func=learning_rate_func),
gamma=gamma,
lambd=lambda_param
)
num_episodes = 60000
final_td_lambda_vf: ValueFunctionApprox[S] = \
iterate.last(itertools.islice(td_lambda_vfs, episode_length * num_episodes))
pprint({s: round(final_td_lambda_vf(s), 3) for s in si_mrp.non_terminal_states})
Thus, we see that our implementation of TD(λ) Prediction with the above settings fetches
us an estimated Value Function fairly close to the true Value Function. As ever, we encour-
age you to play with various settings for TD(λ) Prediction to develop an intuition for how
the results change as you change the settings, and particularly as you change the λ param-
eter. You can play with the code in the file rl/chapter10/simple_inventory_mrp.py.
347
• “Equivalence” of λ-Return Prediction and TD(λ) Prediction, hence TD is equivalent
to TD(0) and MC is “equivalent” to TD(1).
348
10. Monte-Carlo and Temporal-Difference
for Control
In chapter 9, we covered MC and TD algorithms to solve the Prediction problem. In this
chapter, we cover MC and TD algorithms to solve the Control problem. As a reminder,
MC and TD algorithms are Reinforcement Learning algorithms that only have access to
an individual experience (at a time) of next state and reward when the AI agent performs
an action in a given state. The individual experience could be the result of an interaction
with an actual environment or could be served by a simulated environment (as explained
at the state of Chapter 9). It also pays to remind that RL algorithms overcome the Curse
of Dimensionality and the Curse of Modeling by incrementally updating (learning) an
appropriate function approximation of the Value Function from a stream of individual
experiences. Hence, large-scale Control problems that are typically seen in the real-world
are often tackled by RL.
349
Figure 10.1.: Progression Lines of Value Function and Policy in Generalized Policy Itera-
tion (Image Credit: Coursera Course on Fundamentals of RL)
for all states in usual Policy Iteration) and the Policy Improvement step is also done for
just a single state. So essentially these RL Control algorithms are an alternating sequence
of single-state policy evaluation and single-state policy improvement (where the single-
state is the state produced by sampling or the state that is encountered in a real-world
environment interaction). Similar to the case of Prediction, we first cover Monte-Carlo
(MC) Control and then move on to Temporal-Difference (TD) Control.
However, we note that Equation 10.1 can be written more succinctly as:
′
πD (s) = arg max Qπ (s, a) for all s ∈ N (10.2)
a∈A
This view of Greedy Policy Improvement is valuable because instead of doing Policy
Evaluation for calculating V π (MC Prediction), we can instead do Policy Evaluation to cal-
350
culate Qπ (with MC Prediction for the Q-Value Function). With this modification to Policy
Evaluation, we can keep alternating between Policy Evaluation and Policy Improvement
until convergence to obtain the Optimal Value Function and Optimal Policy. Indeed, this
is a valid MC Control algorithm. However, this algorithm is not practical as each Policy
Evaluation (MC Prediction) typically takes very long to converge (as we have noted in
Chapter 9) and the number of iterations of Evaluation and Improvement until GPI con-
vergence will also be large. More importantly, this algorithm simply modifies the Policy
Iteration DP/ADP algorithm by replacing DP/ADP Policy Evaluation with MC Q-Value
Policy Evaluation - hence, we simply end up with a slower version of the Policy Iteration
DP/ADP algorithm. Instead, we seek an MC Control Algorithm that switches from Policy
Evaluation to Policy Improvement without requiring Policy Evaluation to converge (this
is essentially the GPI idea).
So the natural GPI idea here would be to do the usual MC Prediction updates (of the
Q-Value estimate) at the end of an episode, then improve the policy at the end of that
episode, then perform MC Prediction updates (with the improved policy) at the end of the
next episode, and so on … . Let’s see what this algorithm looks like. Equation 10.2 tells us
that all we need to perform the requisite greedy action (from the improved policy) at any
time step in any episode is an estimate of the Q-Value Function. For ease of understanding,
for now, let us just restrict ourselves to the case of Tabular Every-Visit MC Control with
equal weights for each of the Return data points obtained for any (state, action) pair. In
this case, we can simply perform the following two updates at the end of each episode for
each (St , At ) pair encountered in the episode (note that at each time step t, At is based on
the greedy policy derived from the current estimate of the Q-Value function):
Count(St , At ) ← Count(St , At ) + 1
1 (10.3)
Q(St , At ) ← Q(St , At ) + · (Gt − Q(St , At ))
Count(St , At )
It’s important to note that Count(St , At ) is accumulated over the set of all episodes seen
thus far. Note that the estimate Q(St , At ) is not an estimate of the Q-Value Function for a
single policy - rather, it keeps updating as we encounter new greedy policies across the
set of episodes.
So is this now our first Tabular RL Control algorithm? Not quite - there is yet another
problem. This problem is more subtle and we illustrate the problem with a simple ex-
ample. Let’s consider a specific state (call it s) and assume that there are only two al-
lowable actions a1 and a2 for state s. Let’s say the true Q-Value Function for state s is:
Qtrue (s, a1 ) = 2, Qtrue (s, a2 ) = 5. Let’s say we initialize the Q-Value Function estimate as:
Q(s, a1 ) = Q(s, a2 ) = 0. When we encounter state s for the first time, the action to be taken
is arbitrary between a1 and a2 since they both have the same Q-Value estimate (meaning
both a1 and a2 yield the same max value for Q(s, a) among the two choices for a). Let’s
say we arbitrarily pick a1 as the action choice and let’s say for this first encounter of state s
(with the arbitrarily picked action a1 ), the return obtained is 3. So Q(s, a1 ) updates to the
value 3. So when the state s is encountered for the second time, we see that Q(s, a1 ) = 3
and Q(s, a2 ) = 0 and so, action a1 will be taken according to the greedy policy implied
by the estimate of the Q-Value Function. Let’s say we now obtain a return of -1, updating
Q(s, a1 ) to 3−1
2 = 1. When s is encountered for the third time, yet again action a1 will
be taken according to the greedy policy implied by the estimate of the Q-Value Function.
Let’s say we now obtain a return of 2, updating Q(s, a1 ) to 3−1+2 3 = 43 . We see that as long
as the returns associated with a1 are not negative enough to make the estimate Q(s, a1 )
351
negative, a2 is “locked out” by a1 because the first few occurrences of a1 happen to yield
an average return greater than the initialization of Q(s, a2 ). Even if a2 was chosen, it is
possible that the first few occurrences of a2 yield an average return smaller than the av-
erage return obtained on the first few occurrences of a1 , in which case a2 could still get
locked-out prematurely.
This problem goes beyond MC Control and applies to the broader problem of RL Con-
trol - updates can get biased by initial random occurrences of returns (or return estimates),
which in turn could prevent certain actions from being sufficiently chosen (thus, disal-
lowing accurate estimates of the Q-Values for those actions). While we do want to ex-
ploit actions that seem to be fetching higher returns, we also want to adequately explore all
possible actions so we can obtain an accurate-enough estimate of their Q-Values. This is
essentially the Explore-Exploit dilemma of the famous Multi-Armed Bandit Problem. In
Chapter 13, we will cover the Multi-Armed Bandit problem in detail, along with a variety
of techniques to solve the Multi-Armed Bandit problem (which are essentially creative
ways of resolving the Explore-Exploit dilemma). We will see in Chapter 13 that a sim-
ple way of resolving the Explore-Exploit dilemma is with a method known as ϵ-greedy,
which essentially means we must be greedy (“exploit”) a certain (1 − ϵ) fraction of the
time and for the remaining (ϵ) fraction of the time, we explore all possible actions. The
term “certain fraction of the time” refers to probabilities of choosing actions, which means
an ϵ-greedy policy (generated from a Q-Value Function estimate) will be a stochastic pol-
icy. For the sake of simplicity, in this book, we will employ the ϵ-greedy method to resolve
the Explore-Exploit dilemma in all RL Control algorithms involving the Explore-Exploit
dilemma (although you must understand that we can replace the ϵ-greedy method by the
other methods we shall cover in Chapter 13 in any of the RL Control algorithms where we
run into the Explore-Exploit dilemma). So we need to tweak the Tabular MC Control al-
gorithm described above to perform Policy Improvement with the ϵ-greedy method. The
formal definition of the ϵ-greedy stochastic policy π ′ (obtained from the current estimate
of the Q-Value Function) for a Finite MDP (since we are focused on Tabular RL Control)
is as follows:
(
′
ϵ
|A| +1−ϵ if a = arg maxb∈A Q(s, b)
Improved Stochastic Policy π (s, a) = ϵ
|A| otherwise
where A denotes the set of allowable actions and ϵ ∈ [0, 1] is the specification of the
degree of exploration.
This says that with probability 1 − ϵ, we select the action that maximizes the Q-Value
Function estimate for a given state, and with probability ϵ, we uniform-randomly select
each of the allowable actions (including the maximizing action). Hence, the maximiz-
ϵ
ing action is chosen with probability |A| + 1 − ϵ. Note that if ϵ is zero, π ′ reduces to the
deterministic greedy policy πD ′ that we had defined earlier. So the greedy policy can be
Theorem 10.2.1. For a Finite MDP, if π is a policy such that for all s ∈ N , π(s, a) ≥ ϵ
|A| for all
352
′
a ∈ A, then the ϵ-greedy policy π ′ obtained from Qπ is an improvement over π, i.e., V π (s) ≥
V π (s) for all s ∈ N .
Proof. We’ve previously learnt that for any policy π ′ , if we apply the Bellman Policy Oper-
′ ′
ator B π repeatedly (starting with V π ), we converge to V π . In other words,
′ ′
lim (B π )i (V π ) = V π
i→∞
′ ′ ′
B π (V π )(s) = (Rπ + γ · P π · V π )(s)
′
X ′
= Rπ (s) + γ · P π (s, s′ ) · V π (s′ )
s′ ∈N
X X
= π ′ (s, a) · (R(s, a) + γ · P(s, a, s′ ) · V π (s′ ))
a∈A s′ ∈N
X
′
= π (s, a) · Q (s, a)
π
a∈A
X ϵ
= · Qπ (s, a) + (1 − ϵ) · max Qπ (s, a)
|A| a∈A
a∈A
X ϵ X π(s, a) − ϵ
|A|
≥ · Qπ (s, a) + (1 − ϵ) · · Qπ (s, a)
|A| 1−ϵ
a∈A a∈A
X
= π(s, a) · Qπ (s, a)
a∈A
= V (s) for all s ∈ N
π
PThe line with the inequality above is due to the fact that for any fixed s ∈ N , maxa∈A Qπ (s, a) ≥
a∈A wa · Q (s, a) (maximum Q-Value greater than or equal to a weighted average of all
π
ϵ
π(s,a)− |A| P
Q-Values, for a given state) with the weights wa = 1−ϵ such that a∈A wa = 1 and
0 ≤ wa ≤ 1 for all a ∈ A.
This completes the base case of the proof by induction.
The induction step is easy and is proved as a consequence of the monotonicity property
of the B π operator (for any π), which is defined as follows:
Note that we proved the monotonicity property of the B π operator in Chapter [-@sec:dp-
chapter]. A straightforward application of this monotonicity property provides the induc-
tion step of the proof:
′ ′ ′ ′
(B π )i+1 (V π ) ≥ (B π )i (V π ) ⇒ (B π )i+2 (V π ) ≥ (B π )i+1 (V π ) for all i = 0, 1, 2, . . .
353
We note that for any ϵ-greedy policy π, we do ensure the condition that for all s ∈ N ,
π(s, a) ≥ |A|ϵ
for all a ∈ A. So we just need to ensure that this condition holds true for
the initial choice of π (in the GPI with MC algorithm). An easy way to ensure this is to
choose the initial π to be a uniform choice over actions (for each state), i.e., for all s ∈ N ,
1
π(s, a) = |A| for all a ∈ A.
• Do Policy Evaluation with the Q-Value Function with Q-Value updates at the end of
each episode.
• Do Policy Improvement with an ϵ-greedy Policy (readily obtained from the Q-Value
Function estimate at any time step for any episode).
So now we are ready to develop the details of the Monte-Control algorithm that we’ve
been seeking. For ease of understanding, we first cover the Tabular version and then
we will implement the generalized version with function approximation. Note that an
ϵ-greedy policy enables adequate exploration of actions, but we will also need to do ade-
quate exploration of states in order to achieve a suitable estimate of the Q-Value Function.
Moreover, as our Control algorithm proceeds and the Q-Value Function estimate gets bet-
ter and better, we reduce the amount of exploration and eventually (as the number of
episodes tend to infinity), we want to have ϵ (degree of exploration) tend to zero. In fact,
this behavior has a catchy acronym associated with it, which we define below:
Definition 10.3.1. We refer to Greedy In The Limit with Infinite Exploration (abbreviated as
GLIE) as the behavior that has the following two properties:
1. All state-action pairs are explored infinitely many times, i.e., for all s ∈ N , for all
a ∈ A, and Countk (s, a) denoting the number of occurrences of (s, a) pairs after k
episodes:
lim Countk (s, a) = ∞
k→∞
2. The policy converges to a greedy policy, i.e., for all s ∈ N , for all a ∈ A, and πk (s, a)
denoting the ϵ-greedy policy obtained from the Q-Value Function estimate after k
episodes:
lim πk (s, a) = Ia=arg maxb∈A Q(s,b)
k→∞
A simple way by which our method of using the ϵ-greedy policy (for policy improve-
ment) can be made GLIE is by reducing ϵ as a function of number of episodes k as follows:
1
ϵk =
k
So now we are ready to describe the Tabular MC Control algorithm we’ve been seeking.
We ensure that this algorithm has GLIE behavior and so, we refer to it as GLIE Tabular
Monte-Carlo Control. The following is the outline of the procedure for each episode (ter-
minating trace experience) in the algorithm:
354
• Generate the trace experience (episode) with actions sampled from the ϵ-greedy pol-
icy π obtained from the estimate of the Q-Value Function that is available at the start
of the trace experience. Also, sample the first state of the trace experience from a
uniform distribution of states in N . This ensures infinite exploration of both states
and actions. Let’s denote the contents of this trace experience as:
S0 , A 0 , R 1 , S 1 , A 1 , . . . , R T , S T
and define the trace experience return Gt associated with (St , At ) as:
X
T
Gt = γ i−t−1 · Ri = Rt+1 + γ · Rt+2 + γ 2 · Rt+3 + . . . γ T −t−1 · RT
i=t+1
• For each state St and action At in the trace experience, perform the following updates
at the end of the trace experience:
Count(St , At ) ← Count(St , At ) + 1
1
Q(St , At ) ← Q(St , At ) + · (Gt − Q(St , At ))
Count(St , At )
• Let’s say this trace experience is the k-th trace experience in the sequence of trace
experiences. Then, at the end of the trace experience, set:
1
ϵ←
k
We state the following important theorem without proof.
Theorem 10.3.1. The above-described GLIE Tabular Monte-Carlo Control algorithm converges to
the Optimal Action-Value function: Q(s, a) → Q∗ (s, a) for all s ∈ N , for all a ∈ A. Hence, GLIE
Tabular Monte-Carlo Control converges to an Optimal (Deterministic) Policy π ∗ .
The extension from Tabular to Function Approximation of the Q-Value Function is straight-
forward. The update (change) in the parameters w of the Q-Value Function Approxima-
tion Q(s, a; w) is as follows:
355
• states: NTStateDistribution[S] - This represents an arbitrary distribution of the
non-terminal states, which in turn allows us to sample the starting state (from this
distribution) for each trace experience.
• approx_0: QValueFunctionApprox[S, A] - This represents the initial function approx-
imation of the Q-Value function (that is meant to be updated, in an immutable man-
ner, through the course of the algorithm).
• gamma: float - This represents the discount factor to be used in estimating the Q-
Value Function.
• epsilon_as_func_of_episodes: Callable[[int], float] - This represents the extent
of exploration (ϵ) as a function of the number of trace experiences done so far (al-
lowing us to generalize from our default choice of ϵ(k) = k1 ).
• episode_length_tolerance: float - This represents the tolerance that determines
the trace experience length T (the minimum T such that γ T < tolerance).
356
actions: Callable[[NonTerminal[S]], Iterable[A]]
) -> DeterministicPolicy[S, A]:
def optimal_action(s: S) -> A:
_, a = q.argmax((NonTerminal(s), a) for a in actions(NonTerminal(s)))
return a
return DeterministicPolicy(optimal_action)
def epsilon_greedy_policy(
q: QValueFunctionApprox[S, A],
mdp: MarkovDecisionProcess[S, A],
epsilon: float = 0.0
) -> Policy[S, A]:
def explore(s: S, mdp=mdp) -> Iterable[A]:
return mdp.actions(NonTerminal(s))
return RandomPolicy(Categorical(
{UniformPolicy(explore): epsilon,
greedy_policy_from_qvf(q, mdp.actions): 1 - epsilon}
))
@dataclass(frozen=True)
class RandomPolicy(Policy[S, A]):
policy_choices: Distribution[Policy[S, A]]
def act(self, state: NonTerminal[S]) -> Distribution[A]:
policy: Policy[S, A] = self.policy_choices.sample()
return policy.act(state)
Now let us test glie_mc_control on the simple inventory MDP we wrote in Chapter 2.
First let’s run Value Iteration so we can determine the true Optimal Value Function and
Optimal Policy
This prints:
357
True Optimal Value Function
{NonTerminal(state=InventoryState(on_hand=0, on_order=0)): -34.894855194671294,
NonTerminal(state=InventoryState(on_hand=0, on_order=1)): -27.66095964467877,
NonTerminal(state=InventoryState(on_hand=0, on_order=2)): -27.99189950444479,
NonTerminal(state=InventoryState(on_hand=1, on_order=0)): -28.66095964467877,
NonTerminal(state=InventoryState(on_hand=1, on_order=1)): -28.99189950444479,
NonTerminal(state=InventoryState(on_hand=2, on_order=0)): -29.991899504444792}
True Optimal Policy
For State InventoryState(on_hand=0, on_order=0): Do Action 1
For State InventoryState(on_hand=0, on_order=1): Do Action 1
For State InventoryState(on_hand=0, on_order=2): Do Action 0
For State InventoryState(on_hand=1, on_order=0): Do Action 1
For State InventoryState(on_hand=1, on_order=1): Do Action 0
For State InventoryState(on_hand=2, on_order=0): Do Action 0
Now let’s fetch the final estimate of the Optimal Q-Value Function after num_episodes
have run, and extract from it the estimate of the Optimal State-Value Function and the
Optimal Policy.
358
mdp: FiniteMarkovDecisionProcess[S, A],
qvf: QValueFunctionApprox[S, A]
) -> Tuple[V[S], FiniteDeterministicPolicy[S, A]]:
opt_vf: V[S] = {
s: max(qvf((s, a)) for a in mdp.actions(s))
for s in mdp.non_terminal_states
}
opt_policy: FiniteDeterministicPolicy[S, A] = \
FiniteDeterministicPolicy({
s.state: qvf.argmax((s, a) for a in mdp.actions(s))[1]
for s in mdp.non_terminal_states
})
return opt_vf, opt_policy
opt_vf, opt_policy = get_vf_and_policy_from_qvf(
mdp=si_mdp,
qvf=final_qvf
)
print(f”GLIE MC Optimal Value Function with {num_episodes:d} episodes”)
pprint(opt_vf)
print(f”GLIE MC Optimal Policy with {num_episodes:d} episodes”)
print(opt_policy)
This prints:
We see that this reasonably converges to the true Value Function (and reaches the true
Optimal Policy) as produced by Value Iteration.
The code above is in the file rl/chapter11/simple_inventory_mdp_cap.py. Also see the
helper functions in rl/chapter11/control_utils.py which you can use to run your own ex-
periments and tests for RL Control algorithms.
10.4. SARSA
Just like in the case of RL Prediction, the natural idea is to replace MC Control with TD
Control using the TD Target Rt+1 +γ ·Q(St+1 , At+1 ; w) as a biased estimate of Gt when up-
dating Q(St , At ; w). This means the parameters update in Equation (10.4) gets modified
to the following parameters update:
359
Unlike MC Control where updates are made at the end of each trace experience (i.e.,
episode), a TD control algorithm can update at the end of each atomic experience. This
means the Q-Value Function Approximation is updated after each atomic experience (con-
tinuous learning), which in turn means that the ϵ-greedy policy will be (automatically) up-
dated at the end of each atomic experience. At each time step t in a trace experience, the
current ϵ-greedy policy is used to sample At from St and is also used to sample At+1 from
St+1 . Note that in MC Control, the same ϵ-greedy policy is used to sample all the actions
from their corresponding states in the trace experience, and so in MC Control, we were
able to generate the entire trace experience with the currently available ϵ-greedy policy.
However, here in TD Control, we need to generate a trace experience incrementally since
the action to be taken from a state depends on the just-updated ϵ-greedy policy (that is
derived from the just-updated Q-Value Function).
Just like in the case of RL Prediction, the disadvantage of the TD Target being a biased
estimate of the return is compensated by a reduction in the variance of the return esti-
mate. Also, TD Control offers a better speed of convergence (as we shall soon illustrate).
Most importantly, TD Control offers the ability to use in situations where we have incom-
plete trace experiences (happens often in real-world situations where experiments gets
curtailed/disrupted) and also, we can use it in situations where we never reach a terminal
state (continuing trace).
Note that Equation (10.5) has the entities
• State St
• Action At
• Reward Rt
• State St+1
• Action At+1
360
Figure 10.2.: Visualization of SARSA Algorithm
361
• epsilon_as_func_of_episodes: Callable[[int], float] - This represents the extent
of exploration (ϵ) as a function of the number of episodes.
• max_episode_length: int - This represents the number of time steps at which we
would curtail a trace experience and start a new one. As we’ve explained, TD Con-
trol doesn’t require complete trace experiences, and so we can do as little or as large
a number of time steps in a trace experience (max_episode_length gives us that con-
trol).
• Given the current state and action, we obtain a sample of the pair of next_state and
reward (using the sample method of the Distribution obtained from mdp.step(state,
action).
• Obtain the next_action from next_state using the function epsilon_greedy_action
which utilizes the ϵ-greedy policy derived from the current Q-Value Function esti-
mate (referenced by q).
• Update the Q-Value Function based on Equation (10.5) (using the update method
of q: QValueFunctionApprox[S, A]). Note that this is an immutable update since we
produce an Iterable (generator) of the Q-Value Function estimate after each time
step.
Before the code for glie_sarsa, let’s understand the code for epsilon_greedy_action
which returns an action sampled from the ϵ-greedy policy probability distribution that
is derived from the Q-Value Function estimate, given as input a non-terminal state, a Q-
Value Function estimate, the set of allowable actions, and ϵ.
362
num_episodes += 1
epsilon: float = epsilon_as_func_of_episodes(num_episodes)
state: NonTerminal[S] = states.sample()
action: A = epsilon_greedy_action(
q=q,
nt_state=state,
actions=set(mdp.actions(state)),
epsilon=epsilon
)
steps: int = 0
while isinstance(state, NonTerminal) and steps < max_episode_length:
next_state, reward = mdp.step(state, action).sample()
if isinstance(next_state, NonTerminal):
next_action: A = epsilon_greedy_action(
q=q,
nt_state=next_state,
actions=set(mdp.actions(next_state)),
epsilon=epsilon
)
q = q.update([(
(state, action),
reward + gamma * q((next_state, next_action))
)])
action = next_action
else:
q = q.update([((state, action), reward)])
yield q
steps += 1
state = next_state
Now let’s fetch the final estimate of the Optimal Q-Value Function after num_episodes *
363
max_episode_length updates of the Q-Value Function, and extract from it the estimate of
the Optimal State-Value Function and the Optimal Policy (using the function get_vf_and_policy_from_qvf
that we had written earlier).
import itertools
import rl.iterate as iterate
num_updates = num_episodes * max_episode_length
final_qvf: QValueFunctionApprox[InventoryState, int] = \
iterate.last(itertools.islice(qvfs, num_updates))
opt_vf, opt_policy = get_vf_and_policy_from_qvf(
mdp=si_mdp,
qvf=final_qvf
)
print(f”GLIE SARSA Optimal Value Function with {num_updates:d} updates”)
pprint(opt_vf)
print(f”GLIE SARSA Optimal Policy with {num_updates:d} updates”)
print(opt_policy)
This prints:
We see that this reasonably converges to the true Value Function (and reaches the true
Optimal Policy) as produced by Value Iteration (whose results were displayed when we
tested GLIE MC Control).
The code above is in the file rl/chapter11/simple_inventory_mdp_cap.py. Also see the
helper functions in rl/chapter11/control_utils.py which you can use to run your own ex-
periments and tests for RL Control algorithms.
For Tabular GLIE MC Control, we stated a theorem for theoretical guarantee of con-
vergence to the true Optimal Value Function (and hence, true Optimal Policy). Is there
something analogous for Tabular GLIE SARSA? This answers in the affirmative with the
added condition that we reduce the learning rate according to the Robbins-Monro sched-
ule. We state the following theorem without proof.
Theorem 10.4.1. Tabular SARSA converges to the Optimal Action-Value function, Q(s, a) →
Q∗ (s, a) (hence, converges to an Optimal Deterministic Policy π ∗ ), under the following conditions:
364
• Robbins-Monro schedule of step-sizes αt :
∞
X
αt = ∞
t=1
∞
X
αt2 < ∞
t=1
Now let’s compare GLIE MC Control and GLIE SARSA. This comparison is analogous
to the comparison in Section 9.5.2 in Chapter 9 regarding their bias, variance and conver-
gence properties. GLIE SARSA carries a biased estimate of the Q-Value Function com-
pared to the unbiased estimate of GLIE MC Control. On the flip side, the TD Target
Rt+1 + γ · Q(St+1 , At+1 ; w) has much lower variance than Gt because Gt depends on many
random state transitions and random rewards (on the remainder of the trace experience)
whose variances accumulate, whereas the TD Target depends on only the next random
state transition St+1 and the next random reward Rt+1 . The bad news with GLIE SARSA
(due to the bias in it’s update) is that with function approximation, it does not always
converge to the Optimal Value Function/Policy.
As mentioned in Chapter 9, because MC and TD have significant differences in their
usage of data, nature of updates, and frequency of updates, it is not even clear how to
create a level-playing field when comparing MC and TD for speed of convergence or for
efficiency in usage of limited experiences data. The typical comparisons between MC and
TD are done with constant learning rates, and it’s been determined that practically GLIE
SARSA learns faster than GLIE MC Control with constant learning rates. We illustrate
this by running GLIE MC Control and GLIE SARSA on SimpleInventoryMDPCap, and plot
the root-mean-squared-errors (RMSE) of the Q-Value Function estimates as a function of
batches of episodes (i.e., visualize how the RMSE of the Q-Value Function evolves as the
two algorithms progress). This is done by calling the function compare_mc_sarsa_ql which
is in the file rl/chapter11/control_utils.py.
Figure 10.3 depicts the convergence for our implementations of GLIE MC Control and
GLIE SARSA for a constant learning rate of α = 0.05. We produced this Figure by using
data from 500 episodes generated from the same SimpleInventoryMDPCap object we had
created earlier (with same discount factor γ = 0.9). We plotted the RMSE after each batch
of 10 episodes, hence both curves shown in the Figure have 50 RMSE data points plotted.
Firstly, we clearly see that MC Control has significantly more variance as evidenced by
the choppy MC Control RMSE progression curve. Secondly, we note that the MC Con-
trol RMSE curve progresses quite quickly in the first few episode batches but is slow to
converge after the first few episode batches (relative to the progression of SARSA). This
results in SARSA reaching fairly small RMSE quicker than MC Control. This behavior of
GLIE SARSA outperforming the comparable GLIE MC Control (with constant learning
rate) is typical in most MDP Control problems.
Lastly, it’s important to recognize that MC Control is not very sensitive to the initial
Value Function while SARSA is more sensitive to the initial Value Function. We encourage
you to play with the initial Value Function for this SimpleInventoryMDPCap example and
evaluate how it affects the convergence speeds.
More generally, we encourage you to play with the compare_mc_sarsa_ql function on
other MDP choices (ones we have created earlier in this book, or make up your own MDPs)
so you can develop good intuition for how GLIE MC Control and GLIE SARSA algorithms
converge for a variety of choices of learning rate schedules, initial Value Function choices,
choices of discount factor etc.
365
Figure 10.3.: GLIE MC Control and GLIE SARSA Convergence for SimpleInventoryMDP-
Cap
10.5. SARSA(λ)
Much like how we extended TD Prediction to TD(λ) Prediction, we can extend SARSA to
SARSA(λ), which gives us a way to tune the spectrum from MC Control to SARSA using
the λ parameter. Recall that in order to develop TD(λ) Prediction from TD Prediction,
we first developed the n-step TD Prediction Algorithm, then the Offline λ-Return TD Al-
gorithm, and finally the Online TD(λ) Algorithm. We develop an analogous progression
from SARSA to SARSA(λ).
So the first thing to do is to extend SARSA to 2-step-bootstrapped SARSA, whose update
is as follows:
X
t+n
Gt,n = γ i−t−1 · Ri + γ n · Q(St+n , At+n ; w)
i=t+1
= Rt+1 + γ · Rt+2 + . . . + γ n−1 · Rt+n + γ n · Q(St+n , At+n ; w)
X
N X
N
un · Gt,n + u · Gt where u + un = 1
n=1 n=1
366
Any of the un or u can be 0, as long as they all sum up to 1. The λ-Return target is a special
case of weights un and u, defined as follows:
Then, the Offline λ-Return SARSA Algorithm makes the following updates (performed
at the end of each trace experience) for each (St , At ) encountered in the trace experience:
(λ)
∆w = α · (Gt − Q(St , At ; w)) · ∇w Q(St , At ; w)
Finally, we create the SARSA(λ) Algorithm, which is the online “version” of the above
λ-Return SARSA Algorithm. The calculations/updates at each time step t for each trace
experience are as follows:
367
optimal behavior subsequent to taking action A). However, in the SARSA algorithm, the
behavior policy producing A from S and the target policy producing A′ from S ′ are in fact
the same policy - the ϵ-greedy policy. Algorithms such as SARSA in which the behavior
policy is the same as the target policy are refered to as On-Policy Algorithms to indicate
the fact that the behavior used to generate data (experiences) does not deviate from the
policy we are aiming for (target policy, which drives towards the optimal policy).
The separation of behavior policy and target policy as two separate policies gives us
algorithms that are known as Off-Policy Algorithms to indicate the fact that the behavior
policy is allowed to “deviate off” from the target policy. This separation enables us to
construct more general and more powerful RL algorithms. We will use the notation π for
the target policy and the notation µ for the behavior policy - therefore, we say that Off-
Policy algorithms estimate the Value Function for target policy π while following behavior
policy µ. Off-Policy algorithms can be very valuable in real-world situations where we can
learn the target policy π by observing humans or other AI agents who follow a behavior
policy µ. Another great practical benefit is to be able to re-use prior experiences that were
generated from old policies, say π1 , π2 , . . .. Yet another powerful benefit is that we can
learn multiple policies µ1 , µ2 , . . . while following one behavior policy π. Let’s now make
the concept of Off-Policy Learning concrete by covering the most basic (and most famous)
Off-Policy Control Algorithm, which goes by the name of Q-Learning.
10.6.1. Q-Learning
The best way to understand the (Off-Policy) Q-Learning algorithm is to tweak SARSA
to make it Off-Policy. Instead of having both the action A and the next action A′ being
generated by the same ϵ-greedy policy, we generate (i.e., sample) action A (from state
S) using an exploratory behavior policy µ and we generate the next action A′ (from next
state S ′ ) using the target policy π. The behavior policy can be any policy as long as it is
exploratory enough to be able to obtain sufficient data for all actions (in order to obtain
an adequate estimate of the Q-Value Function). Note that in SARSA, when we roll over to
the next (new) time step, the new time step’s state S is set to be equal to the previous time
step’s next state S ′ and the new time step’s action A is set to be equal to the previous time
step’s next action A′ . However, in Q-Learning, we only set the new time step’s state S to
be equal to the previous time step’s next state S ′ . The action A for the new time step will
be generated using the behavior policy µ, and won’t be equal to the previous time step’s
next action A′ (that would have been generated using the target policy π).
This Q-Learning idea of two separate policies - behavior policy and target policy - is
fairly generic, and can be used in algorithms beyond solving the Control problem. How-
ever, here we are interested in Q-Learning for Control and so, we want to ensure that the
target policy eventually becomes the optimal policy. One straightforward way to accom-
plish this is to make the target policy equal to the deterministic greedy policy derived from
the Q-Value Function estimate at every step. Thus, the update for Q-Learning Control al-
gorithm is as follows:
∆w = α · δt · ∇w Q(St , At ; w)
where
368
Figure 10.4.: Visualization of Q-Learning Algorithm
Following our convention from Chapter 2, we depict the Q-Learning algorithm in Figure
10.4 with states as elliptical-shaped nodes, actions as rectangular-shaped nodes, and the
edges as samples from transition probability distribution and action choices.
Although we have highlighted some attractive features of Q-Learning (on account of be-
ing Off-Policy), it turns out that Q-Learning when combined with function approximation
of the Q-Value Function leads to convergence issues (more on this later). However, Tab-
ular Q-Learning converges under the usual appropriate conditions. There is considerable
literature on convergence of Tabular Q-Learning and we won’t go over those convergence
theorems in this book - here it suffices to say that the convergence proofs for Tabular Q-
Learning require infinite exploration of all (state, action) pairs and appropriate stochastic
approximation conditions for step sizes.
Now let us write some code for Q-Learning. The function q_learning below is quite
369
similar to the function glie_sarsa we wrote earlier. Here are the differences:
The above code is in the file rl/td.py. Much like how we tested GLIE SARSA on SimpleInventoryMDPCap,
the code in the file rl/chapter11/simple_inventory_mdp_cap.py also tests Q-Learning on
SimpleInventoryMDPCap. We encourage you to leverage the helper functions in rl/chapter11/control_utils.py
to run your own experiments and tests for Q-Learning. In particular, the functions for
370
Q-Learning in rl/chapter11/control_utils.py employ the common practice of using the ϵ-
greedy policy as the behavior policy.
For all (sr , sc ) ∈ N , for all (ar , ac ) ∈ A((sr , sc )), if (sr + ar , sc + ac ) ∈ N , then:
PR ((sr , sc ), (ar , ac ), −1, (sr + ar − 1, sc + ac )) = p1,sc +ac · I(sr +ar −1,sc +ac )∈S
371
PR ((sr , sc ), (ar , ac ), −1, (sr + ar + 1, sc + ac )) = p2,sc +ac · I(sr +ar +1,sc +ac )∈S
PR ((sr , sc ), (ar , ac ), −1, (sr + ar , sc + ac )) = 1 − p1,sc +ac − p2,sc +ac
Discount Factor γ = 1
Now let’s write some code to model this problem with the above MDP spec, and run
Value Iteration, SARSA and Q-Learning as three different ways of solving this MDP Con-
trol problem.
We start with the problem specification in the form of a Python class WindyGrid and write
some helper functions before getting into the MDP creation and DP/RL algorithms.
’’’
Cell specifies (row, column) coordinate
’’’
Cell = Tuple[int, int]
CellSet = Set[Cell]
Move = Tuple[int, int]
’’’
WindSpec specifies a random vertical wind for each column.
Each random vertical wind is specified by a (p1, p2) pair
where p1 specifies probability of Downward Wind (could take you
one step lower in row coordinate unless prevented by a block or
boundary) and p2 specifies probability of Upward Wind (could take
you onw step higher in column coordinate unless prevented by a
block or boundary). If one bumps against a block or boundary, one
incurs a bump cost and doesn’t move. The remaining probability
1- p1 - p2 corresponds to No Wind.
’’’
WindSpec = Sequence[Tuple[float, float]]
possible_moves: Mapping[Move, str] = {
(-1, 0): ’D’,
(1, 0): ’U’,
(0, -1): ’L’,
(0, 1): ’R’
}
@dataclass(frozen=True)
class WindyGrid:
rows: int # number of grid rows
columns: int # number of grid columns
blocks: CellSet # coordinates of block cells
terminals: CellSet # coordinates of goal cells
wind: WindSpec # spec of vertical random wind for the columns
bump_cost: float # cost of bumping against block or boundary
@staticmethod
def add_move_to_cell(cell: Cell, move: Move) -> Cell:
return cell[0] + move[0], cell[1] + move[1]
def is_valid_state(self, cell: Cell) -> bool:
’’’
checks if a cell is a valid state of the MDP
’’’
return 0 <= cell[0] < self.rows and 0 <= cell[1] < self.columns \
and cell not in self.blocks
def get_all_nt_states(self) -> CellSet:
’’’
returns all the non-terminal states
’’’
return {(i, j) for i in range(self.rows) for j in range(self.columns)
if (i, j) not in set.union(self.blocks, self.terminals)}
def get_actions_and_next_states(self, nt_state: Cell) \
-> Set[Tuple[Move, Cell]]:
372
’’’
given a non-terminal state, returns the set of all possible
(action, next_state) pairs
’’’
temp: Set[Tuple[Move, Cell]] = {(a, WindyGrid.add_move_to_cell(
nt_state,
a
)) for a in possible_moves}
return {(a, s) for a, s in temp if self.is_valid_state(s)}
Next we write a method to calculate the transition probabilities. The code below should
be self-explanatory and mimics the description of the problem above and the mathematical
specification of the transition probabilities given above.
Next we write a method to create the MarkovDecisionProcess for the Windy Grid.
373
from rl.markov_decision_process import FiniteDeterministicPolicy
from rl.dynamic_programming import value_iteration_result, V
from rl.chapter11.control_utils import glie_sarsa_finite_learning_rate
from rl.chapter11.control_utils import q_learning_finite_learning_rate
from rl.chapter11.control_utils import get_vf_and_policy_from_qvf
def get_vi_vf_and_policy(self) -> \
Tuple[V[Cell], FiniteDeterministicPolicy[Cell, Move]]:
’’’
Performs the Value Iteration DP algorithm returning the
Optimal Value Function (as a V[Cell]) and the Optimal Policy
(as a FiniteDeterministicPolicy[Cell, Move])
’’’
return value_iteration_result(self.get_finite_mdp(), gamma=1.)
def get_glie_sarsa_vf_and_policy(
self,
epsilon_as_func_of_episodes: Callable[[int], float],
learning_rate: float,
num_updates: int
) -> Tuple[V[Cell], FiniteDeterministicPolicy[Cell, Move]]:
qvfs: Iterator[QValueFunctionApprox[Cell, Move]] = \
glie_sarsa_finite_learning_rate(
fmdp=self.get_finite_mdp(),
initial_learning_rate=learning_rate,
half_life=1e8,
exponent=1.0,
gamma=1.0,
epsilon_as_func_of_episodes=epsilon_as_func_of_episodes,
max_episode_length=int(1e8)
)
final_qvf: QValueFunctionApprox[Cell, Move] = \
iterate.last(itertools.islice(qvfs, num_updates))
return get_vf_and_policy_from_qvf(
mdp=self.get_finite_mdp(),
qvf=final_qvf
)
def get_q_learning_vf_and_policy(
self,
epsilon: float,
learning_rate: float,
num_updates: int
) -> Tuple[V[Cell], FiniteDeterministicPolicy[Cell, Move]]:
qvfs: Iterator[QValueFunctionApprox[Cell, Move]] = \
q_learning_finite_learning_rate(
fmdp=self.get_finite_mdp(),
initial_learning_rate=learning_rate,
half_life=1e8,
exponent=1.0,
gamma=1.0,
epsilon=epsilon,
max_episode_length=int(1e8)
)
final_qvf: QValueFunctionApprox[Cell, Move] = \
iterate.last(itertools.islice(qvfs, num_updates))
return get_vf_and_policy_from_qvf(
mdp=self.get_finite_mdp(),
qvf=final_qvf
)
The above code is in the file rl/chapter11/windy_grid.py. Note that this file also contains
some helpful printing functions that pretty-prints the grid, along with the calculated Op-
timal Value Functions and Optimal Policies. The method print_wind_and_bumps prints the
column wind probabilities and the cost of bumping into a block/boundary. The method
374
print_vf_and_policy prints a given Value Function and a given Deterministic Policy - this
method can be used to print the Optimal Value Function and Optimal Policy produced by
Value Iteration, by SARSA and by Q-Learning. In the printing of a deterministic policy,
“X” represents a block, “T” represents a terminal cell, and the characters “L,” “R,” “D,”
“U” represent “Left,” “Right,” “Down,” “Up” moves respectively.
Now let’s run our code on a small instance of a Windy Grid.
wg = WindyGrid(
rows=5,
columns=5,
blocks={(0, 1), (0, 2), (0, 4), (2, 3), (3, 0), (4, 0)},
terminals={(3, 4)},
wind=[(0., 0.9), (0.0, 0.8), (0.7, 0.0), (0.8, 0.0), (0.9, 0.0)],
bump_cost=4.0
)
wg.print_wind_and_bumps()
vi_vf_dict, vi_policy = wg.get_vi_vf_and_policy()
print(”Value Iteration\n”)
wg.print_vf_and_policy(
vf_dict=vi_vf_dict,
policy=vi_policy
)
epsilon_as_func_of_episodes: Callable[[int], float] = lambda k: 1. / k
learning_rate: float = 0.03
num_updates: int = 100000
sarsa_vf_dict, sarsa_policy = wg.get_glie_sarsa_vf_and_policy(
epsilon_as_func_of_episodes=epsilon_as_func_of_episodes,
learning_rate=learning_rate,
num_updates=num_updates
)
print(”SARSA\n”)
wg.print_vf_and_policy(
vf_dict=sarsa_vf_dict,
policy=sarsa_policy
)
epsilon: float = 0.2
ql_vf_dict, ql_policy = wg.get_q_learning_vf_and_policy(
epsilon=epsilon,
learning_rate=learning_rate,
num_updates=num_updates
)
print(”Q-Learning\n”)
wg.print_vf_and_policy(
vf_dict=ql_vf_dict,
policy=ql_policy
)
Value Iteration
0 1 2 3 4
375
4 XXXXX 5.25 2.02 1.10 1.00
3 XXXXX 8.53 5.20 1.00 0.00
2 9.21 6.90 8.53 XXXXX 1.00
1 8.36 9.21 8.36 12.16 11.00
0 10.12 XXXXX XXXXX 17.16 XXXXX
0 1 2 3 4
4 X R R R D
3 X R R R T
2 R U U X U
1 R U L L U
0 U X X U X
SARSA
0 1 2 3 4
4 XXXXX 5.47 2.02 1.08 1.00
3 XXXXX 8.78 5.37 1.00 0.00
2 9.14 7.03 8.29 XXXXX 1.00
1 8.51 9.16 8.27 11.92 12.58
0 10.05 XXXXX XXXXX 16.48 XXXXX
0 1 2 3 4
4 X R R R D
3 X R R R T
2 R U U X U
1 R U L L U
0 U X X U X
Q-Learning
0 1 2 3 4
4 XXXXX 5.45 2.02 1.09 1.00
3 XXXXX 8.09 5.12 1.00 0.00
2 8.78 6.76 7.92 XXXXX 1.00
1 8.31 8.85 8.09 11.52 10.93
0 9.85 XXXXX XXXXX 16.16 XXXXX
0 1 2 3 4
4 X R R R D
3 X R R R T
2 R U U X U
1 R U L L U
0 U X X U X
Value Iteration should be considered as the benchmark since it calculates the Optimal
Value Function within the default tolerance of 1e-5. We see that both SARSA and Q-
Learning get fairly close to the Optimal Value Function after only 100,000 updates (i.e.,
100,000 moves across various episodes). We also see that both SARSA and Q-Learning
376
Figure 10.5.: GLIE SARSA and Q-Learning Convergence for Windy Grid (Bump Cost = 4)
377
Figure 10.6.: GLIE SARSA and Q-Learning Convergence for Windy Grid (Bump Cost =
100,000)
SARSA and when to use Q-Learning. Roughly speaking, use SARSA if you are training
your AI agent with interaction with the actual environment where you care about time and
money consumed while doing the training with actual environment-interaction (eg: you
don’t want to risk damaging a robot by walking it towards an optimal path in the prox-
imity of physical danger). On the other hand, use Q-Learning if you are training your
AI agent with a simulated environment where large negative rewards don’t cause actual
time/money losses, but these large negative rewards help the AI agent learn quickly. In
a financial trading example, if you are training your RL agent in an actual trading envi-
ronment, you’d want to use SARSA as Q-Learning can potentially incur big losses while
SARSA (although slower in learning) will avoid real trading losses during the process of
learning. On the other hand, if you are training your RL agent in a simulated trading en-
vironment, Q-Learning is the way to go as it will learn fast by incuring “paper trading”
losses as part of the process of executing risky trades.
Note that Q-Learning (and Off-policy Learning in general) has higher per-sample vari-
ance than SARSA, which could lead to problems in convergence, especially when we em-
ploy function approximation for the Q-Value Function. Q-Learning has been shown to
be particularly problematic in converging when using neural networks for it’s Q-Value
function approximation.
The SARSA algorithm was introduced in a paper by Rummery and Niranjan (Rummery
and Niranjan 1994). The Q-Learning algorithm was introduced in the Ph.D. thesis of Chris
Watkins (Watkins 1989).
378
method is known as Importance Sampling, a fairly general technique (beyond RL) for es-
timating properties of a particular probability distribution, while only having access to
samples of a different probability distribution. Specializing this technique to Off-Policy
Control, we estimate the Value Function for the target policy (probability distribution of
interest) while having access to samples generated from the probability distribution of the
behavior policy. Specifically, Importance Sampling enables us to calculate EX∼P [f (X)]
(where P is the probability distribution of interest), given samples from probability dis-
tribution Q, as follows:
X
EX∼P [f (X)] = P (X) · f (X)
X P (X)
= Q(X) · · f (X)
Q(X)
P (X)
= EX∼Q [ · f (X)]
Q(X)
So basically, the function f (X) of samples X are scaled by the ratio of the probabilities
P (X) and Q(X).
Let’s employ this Importance Sampling method for Off-Policy Monte Carlo Prediction,
where we need to estimate the Value Function for policy π while only having access to
trace experience returns generated using policy µ. The idea is straightforward - we simply
weight the returns Gt according to the similarity between policies π and µ, by multiplying
importance sampling corrections along whole episodes. Let us define ρt as the product
of the ratio of action probabilities (on the two policies π and µ) from time t to time T − 1
(assume episode ends at time T ). Specifically,
379
Figure 10.7.: Policy Evaluation (DP Algorithm with Full Backup)
π(St , At )
∆w = α · · (Rt+1 + γ · Q(St+1 , At+1 ; w) − Q(St , At ; w)) · ∇w Q(St , At ; w)
µ(St , At )
This has much lower variance than MC importance sampling. A key advantage of TD
importance sampling is that policies only need to be similar over a single time step.
Since the modifications from On-Policy algorithms to Off-Policy algorithms based on
Importance Sampling are just a small tweak of scaling the update by importance sam-
pling corrections, we won’t implement the Off-Policy Importance Sampling algorithms in
Python code. However, we encourage you to implement the Prediction and Control MC
and TD Off-Policy algorithms (based on Importance Sampling) described above.
380
Figure 10.8.: TD Prediction (RL Algorithm with Sample Backup)
381
Figure 10.10.: SARSA (RL Algorithm with Sample Backup)
382
Figure 10.12.: Q-Learning (RL Algorithm with Sample Backup)
383
On/Off Policy Algorithm Tabular Linear Non-Linear
MC 3 3 3
On-Policy TD(0) 3 3 7
TD(λ) 3 3 7
MC 3 3 3
Off-Policy TD(0) 3 7 7
TD(λ) 3 7 7
• Bootstrapping, i.e., updating with a target that involves the current Value Function
estimate (as is the case with Temporal-Difference)
• Off-Policy
• Function Approximation of the Value Function
384
On/Off Policy Algorithm Tabular Linear Non-Linear
MC 3 3 3
On-Policy TD 3 3 7
Gradient TD 3 3 3
MC 3 3 3
Off-Policy TD 3 7 7
Gradient TD 3 3 3
we want to simply share that Gradient TD updates the value function approximation’s pa-
rameters with the actual gradient (not semi-gradient) of an appropriate loss function and
the gradient formula involves bootstrapping. Thus, it avails of the advantages of boot-
strapping without the disadvantages of semi-gradient (which we cheekily refered to as
“cheating” in Chapter 9). Figure 10.15 expands upon Figure 10.14 by incorporating con-
vergence properties of Gradient TD.
Now let’s move on to convergence of Control Algorithms. Figure 10.16 provides the
picture. (3) means it doesn’t quite hit the Optimal Value Function, but bounces around
near the Optimal Value Function. Gradient Q-Learning is the adaptation of Q-Learning
with Gradient TD. So this method is Off-Policy, is bootstrapped, but avoids semi-gradient.
This enables it to converge for linear function approximations. However, it diverges when
used with non-linear function approximations. So, for Control, even with Gradient TD,
the deadly triad still exists for a combination of [Bootstrapping, Off-Policy, Non-Linear
Function Approximation]. In Chapter 11, we shall cover the DQN algorithm which is
an innovative and practically effective method for getting around the deadly triad for RL
Control.
385
11. Batch RL, Experience-Replay, DQN,
LSPI, Gradient TD
In Chapters 9 and 10, we covered the basic RL algorithms for Prediction and Control re-
spectively. Specifically, we covered the basic Monte-Carlo (MC) and Temporal-Difference
(TD) techniques. We want to highlight two key aspects of these basic RL algorithms:
1. The experiences data arrives in the form of a single unit of experience at a time (single
unit is a trace experience for MC and an atomic experience for TD), the unit of experience
is used by the algorithm for Value Function learning, and then that unit of experience
is not used later in the algorithm (essentially, that unit of experience, once consumed,
is not re-consumed for further learning later in the algorithm). It doesn’t have to be
this way - one can develop RL algorithms that re-use experience data - this approach
is known as Experience-Replay (in fact, we saw a glimpse of Experience-Replay in
Section 9.5.3 of Chapter 9).
Thus, we have a choice or doing Experience-Replay or not, and we have a choice of do-
ing Batch RL or Incremental RL. In fact, some of the interesting and practically effective
algorithms combine both the ideas of Experience-Replay and Batch RL. This chapter starts
with the coverage of Batch RL and Experience-Replay. Then, we cover some key algo-
rithms (including Deep Q-Networks and Least Squares Policy Iteration) that effectively
leverage Batch RL and/or Experience-Replay. Next, we look deeper into the issue of the
Deadly Triad (that we had alluded to in Chapter 10) by viewing Value Functions as Vec-
tors (we had done this in Chapter 3), understand Value Function Vector transformations
with a balance of geometric intuition and mathematical rigor, providing insights into con-
vergence issues for a variety of traditional loss functions used to develop RL algorithms.
Finally, this treatment of Value Functions as Vectors leads us in the direction of overcom-
ing the Deadly Triad by defining an appropriate loss function, calculating whose gradient
provides a more robust set of RL algorithms known as Gradient Temporal Difference (ab-
breviated, as Gradient TD).
387
11.1. Batch RL and Experience-Replay
Let us understand Incremental RL versus Batch RL in the context of fixed finite experiences
data. To make things simple and easy to understand, we first focus on understanding the
difference for the case of MC Prediction (i.e., to calculate the Value Function of an MRP
using Monte-Carlo). In fact, we had covered this setting in Section 9.5.3 of Chapter 9.
To refresh this setting, specifically we have access to a fixed finite sequence/stream of
MRP trace experiences (i.e., Iterable[Iterable[TransitionStep[S]]]), which we know
can be converted to returns-augmented data of the form Iterable[Iterable[ReturnStep[S]]]
(using the returns function1 ). Flattening this data to Iterable[ReturnStep[S]] and ex-
tracting from it the (state, return) pairs gives us the fixed, finite training data for MC
Prediction, that we denote as follows:
D = [(Si , Gi )|1 ≤ i ≤ n]
We’ve learnt in Chapter 9 that we can do an Incremental MC Prediction estimation
V (s; w) by updating w after each MRP trace experience with the gradient calculation
∇w L(w) for each data pair (Si , Gi ), as follows:
1
L(Si ,Gi ) (w) = · (V (Si ; w) − Gi )2
2
∇w L(Si ,Gi ) (w) = (V (Si ; w) − Gi ) · ∇w V (Si ; w)
∆w = α · (Gi − V (Si ; w)) · ∇w V (Si ; w)
The Incremental MC Prediction algorithm performs n updates in sequence for data pairs
(Si , Gi ), i = 1, 2, . . . , n using the update method of FunctionApprox. We note that Incremen-
tal RL makes inefficient use of available training data D because we essentially “discard”
each of these units of training data after it’s used to perform an update. We want to make
efficient use of the given data with Batch RL. Batch MC Prediction aims to estimate the
MRP Value Function V (s; w∗ ) such that
1 X
n
∗
w = arg min · (V (Si ; w) − Gi )2
w 2n
i=1
1
= arg min E(S,G)∼D [ · (V (S; w) − G)2 ]
w 2
This in fact is the solve method of FunctionApprox on training data D. This approach is
called Batch RL because we first collect and store the entire set (batch) of data D available
to us, and then we find the best possible parameters w∗ fitting this data D. Note that unlike
Incremental RL, here we are not updating the MRP Value Function estimate while the data
arrives - we simply store the data as it arrives and start the MRP Value Function estima-
tion procedure once we are ready with the entire (batch) data D in storage. As we know
from the implementation of the solve method of FunctionApprox, finding the best possi-
ble parameters w∗ from the batch D involves calling the update method of FunctionApprox
with repeated use of the available data pairs (S, G) in the stored data set D. Each of these
updates to the parameters w is as follows:
1 X
n
∆w = α · · (Gi − V (Si ; w)) · ∇w V (Si ; w)
n
i=1
1
returns is defined in the file rl/returns.py
388
Note that unlike Incremental MC where each update to w uses data from a single trace
experience, each update to w in Batch MC uses all of the trace experiences data (all of the
batch data). If we keep doing these updates repeatedly, we will ultimately converge to the
desired MRP Value Function V (s; w∗ ). The repeated use of the available data in D means
that we are doing Batch MC Prediction using Experience-Replay. So we see that this makes
more efficient use of the available training data D due to the re-use of the data pairs in D.
The code for this Batch MC Prediction algorithm (batch_mc_prediction) is shown be-
low.2 From the input trace experiences (traces in the code below), we first create the set of
ReturnStep transitions that span across the set of all input trace experiences (return_steps
in the code below). This involves calculating the return associated with each state encoun-
tered in traces (across all trace experiences). From return_steps, we create the (state,
return) pairs that constitute the fixed, finite training data D, which is then passed to the
solve method of approx: ValueFunctionApprox[S].
import rl.markov_process as mp
from rl.returns import returns
from rl.approximate_dynamic_programming import ValueFunctionApprox
import itertools
def batch_mc_prediction(
traces: Iterable[Iterable[mp.TransitionStep[S]]],
approx: ValueFunctionApprox[S],
gamma: float,
episode_length_tolerance: float = 1e-6,
convergence_tolerance: float = 1e-5
) -> ValueFunctionApprox[S]:
’’’traces is a finite iterable’’’
return_steps: Iterable[mp.ReturnStep[S]] = \
itertools.chain.from_iterable(
returns(trace, gamma, episode_length_tolerance) for trace in traces
)
return approx.solve(
[(step.state, step.return_) for step in return_steps],
convergence_tolerance
)
Now let’s move on to Batch TD Prediction. Here we have fixed, finite experiences data
D available as:
D = [(Si , Ri , Si′ )|1 ≤ i ≤ n]
where (Ri , Si′ ) is the pair of reward and next state from a state Si . So, Experiences Data D is
presented in the form of a fixed, finite number of atomic experiences. This is represented
in code as an Iterable[TransitionStep[S]].
Just like Batch MC Prediction, here in Batch TD Prediction, we first collect and store the
data as it arrives, and once we are ready with the batch of data D in storage, we start the
MRP Value Function estimation procedure. The parameters w are updated with repeated
use of the atomic experiences in the stored data D. Each of these updates to the parameters
w is as follows:
1 X
n
∆w = α · · (Ri + γ · V (Si′ ; w) − V (Si ; w)) · ∇w V (Si ; w)
n
i=1
Note that unlike Incremental TD where each update to w uses data from a single atomic
experience, each update to w in Batch TD uses all of the atomic experiences data (all of
2
batch_mc_prediction is defined in the file rl/monte_carlo.py.
389
the batch data). The repeated use of the available data in D means that we are doing Batch
TD Prediction using Experience-Replay. So we see that this makes more efficient use of the
available training data D due to the re-use of the data pairs in D.
The code for this Batch TD Prediction algorithm (batch_td_prediction) is shown be-
low.3 We create a Sequence[TransitionStep] from the fixed, finite-length input atomic ex-
periences D (transitions in the code below), and call the update method of FunctionApprox
repeatedly, passing the data D (now in the form of a Sequence[TransitionStep]) to each
invocation of the update method (using the function itertools.repeat). This repeated in-
vocation of the update method is done by using the function iterate.accumulate. This is
done until convergence (convergence based on the done function in the code below), at
which point we return the converged FunctionApprox.
import rl.markov_process as mp
from rl.approximate_dynamic_programming import ValueFunctionApprox, extended_vf
import rl.iterate as iterate
import itertools
import numpy as np
def batch_td_prediction(
transitions: Iterable[mp.TransitionStep[S]],
approx_0: ValueFunctionApprox[S],
gamma: float,
convergence_tolerance: float = 1e-5
) -> ValueFunctionApprox[S]:
’’’transitions is a finite iterable’’’
def step(
v: ValueFunctionApprox[S],
tr_seq: Sequence[mp.TransitionStep[S]]
) -> ValueFunctionApprox[S]:
return v.update([(
tr.state, tr.reward + gamma * extended_vf(v, tr.next_state)
) for tr in tr_seq])
def done(
a: ValueFunctionApprox[S],
b: ValueFunctionApprox[S],
convergence_tolerance=convergence_tolerance
) -> bool:
return b.within(a, convergence_tolerance)
return iterate.converged(
iterate.accumulate(
itertools.repeat(list(transitions)),
step,
initial=approx_0
),
done=done
Likewise, we can do Batch TD(λ) Prediction. Here we are given a fixed, finite number
of trace experiences
For trace experience i, for each time step t in the trace experience, we calculate the eligibility
traces as follows:
with the eligiblity traces initialized at time 0 for trace experience i as Ei,0 = ∇w V (Si,0 ; w).
3
batch_td_prediction is defined in the file rl/td.py.
390
Then, each update to the parameters w is as follows:
Ti −1
1 X 1 X
n
∆w = α · · · (Ri,t+1 + γ · V (Si,t+1 ; w) − V (Si,t ; w)) · Ei,t (11.1)
n Ti
i=1 t=0
391
T = TypeVar(’T’)
class ExperienceReplayMemory(Generic[T]):
saved_transitions: List[T]
time_weights_func: Callable[[int], float]
weights: List[float]
weights_sum: float
def __init__(
self,
time_weights_func: Callable[[int], float] = lambda _: 1.0,
):
self.saved_transitions = []
self.time_weights_func = time_weights_func
self.weights = []
self.weights_sum = 0.0
def add_data(self, transition: T) -> None:
self.saved_transitions.append(transition)
weight: float = self.time_weights_func(len(self.saved_transitions) - 1)
self.weights.append(weight)
self.weights_sum += weight
def sample_mini_batch(self, mini_batch_size: int) -> Sequence[T]:
num_transitions: int = len(self.saved_transitions)
return Categorical(
{tr: self.weights[num_transitions - 1 - i] / self.weights_sum
for i, tr in enumerate(self.saved_transitions)}
).sample_n(min(mini_batch_size, num_transitions))
def replay(
self,
transitions: Iterable[T],
mini_batch_size: int
) -> Iterator[Sequence[T]]:
for transition in transitions:
self.add_data(transition)
yield self.sample_mini_batch(mini_batch_size)
while True:
yield self.sample_mini_batch(mini_batch_size)
X
m
V (s; w) = ϕj (s) · wj = ϕ(s)T · w for all s ∈ N
j=1
392
where ϕ(s) ∈ Rm is the feature vector for state s.
The direct solution of the MRP Value Function using simple linear algebra operations is
known as Least-Squares (abbreviated as LS) solution. We start with Batch MC Prediction
for the case of linear function approximation, which is known as Least-Squares Monte-
Carlo (abbreviated as LSMC).
1 XX 1 X
n m n
L(w) = · ( ϕj (Si ) · wj − Gi )2 = · (ϕ(Si )T · w − Gi )2
2n 2n
i=1 j=1 i=1
We set the gradient of this loss function to 0, and solve for w∗ . This yields:
X
n
ϕ(Si ) · (ϕ(Si )T · w∗ − Gi ) = 0
i=1
b ← b + ϕ(Si ) · Gi
To implement this algorithm, we can simply call batch_mc_prediction that we had writ-
ten earlier by setting the argument approx as LinearFunctionApprox and by setting the at-
tribute direct_solve in approx: LinearFunctionApprox[S] as True. If you read the code
under direct_solve=True branch in the solve method, you will see that it indeed performs
the above-described linear algebra calculations. The inversion of the matrix A is O(m3 )
complexity. However, we can speed up the algorithm to be O(m2 ) with a different imple-
mentation - we can maintain the inverse of A after each (Si , Gi ) update to A by applying
the Sherman-Morrison formula for incremental inverse (Sherman and Morrison 1950).
The Sherman-Morrison incremental inverse for A is as follows:
393
11.3.2. Least-Squares Temporal-Difference (LSTD)
For the case of linear function approximation, the loss function for Batch TD Prediction
with data [(Si , Ri , Si′ )|1 ≤ i ≤ n] is:
1 X
n
L(w) = · (ϕ(Si )T · w − (Ri + γ · ϕ(Si′ )T · w))2
2n
i=1
We set the semi-gradient of this loss function to 0, and solve for w∗ . This yields:
X
n
ϕ(Si ) · (ϕ(Si )T · w∗ − (Si + γ · ϕ(Si′ )T · w∗ )) = 0
i=1
and the m-Vector b is accumulated at each atomic experience (Si , Ri , Si′ ) as:
b ← b + ϕ(Si ) · Ri
394
epsilon: float
) -> LinearFunctionApprox[NonTerminal[S]]:
’’’ transitions is a finite iterable ’’’
num_features: int = len(feature_functions)
a_inv: np.ndarray = np.eye(num_features) / epsilon
b_vec: np.ndarray = np.zeros(num_features)
for tr in transitions:
phi1: np.ndarray = np.array([f(tr.state) for f in feature_functions])
if isinstance(tr.next_state, NonTerminal):
phi2 = phi1 - gamma * np.array([f(tr.next_state)
for f in feature_functions])
else:
phi2 = phi1
temp: np.ndarray = a_inv.T.dot(phi2)
a_inv = a_inv - np.outer(a_inv.dot(phi1), temp) / (1 + phi1.dot(temp))
b_vec += phi1 * tr.reward
opt_wts: np.ndarray = a_inv.dot(b_vec)
return LinearFunctionApprox.create(
feature_functions=feature_functions,
weights=Weights.create(opt_wts)
)
Let’s say we have access to only 10,000 transitions (each transition is an object of the type
TransitionStep). First we generate these 10,000 sampled transitions from the RandomWalkMRP
object we created above.
from rl.approximate_dynamic_programming import NTStateDistribution
from rl.markov_process import TransitionStep
import itertools
num_transitions: int = 10000
nt_states: Sequence[NonTerminal[int]] = random_walk.non_terminal_states
start_distribution: NTStateDistribution[int] = Choose(set(nt_states))
traces: Iterable[Iterable[TransitionStep[int]]] = \
random_walk.reward_traces(start_distribution)
transitions: Iterable[TransitionStep[int]] = \
itertools.chain.from_iterable(traces)
395
td_transitions: Iterable[TransitionStep[int]] = \
itertools.islice(transitions, num_transitions)
Before running LSTD, let’s run Incremental Tabular TD on the 10,000 transitions in
td_transitions and obtain the resultant Value Function (td_vf in the code below). Since
there are only 10,000 transitions, we use an aggressive initial learning rate of 0.5 to promote
fast learning, but we let this high learning rate decay quickly so the learning stabilizes.
Finally, we run the LSTD algorithm on 10,000 transitions. Note that the Value Function
of RandomWalkMRP, for p ̸= 0.5, is non-linear as a function of the integer states. So we use
non-linear features that can approximate arbitrary non-linear shapes - a good choice is the
set of (orthogonal) Laguerre Polynomials. In the code below, we use the first 5 Laguerre
Polynomials (i.e., upto degree 4 polynomial) as the feature functions for the linear function
approximation of the Value Function. Then we invoke the LSTD algorithm we wrote above
to calculate the LinearFunctionApprox based on this batch of 10,000 transitions.
Figure 11.1 depicts how the LSTD Value Function estimate (for 10,000 transitions) lstd_vf
compares against Incremental Tabular TD Value Function estimate (for 10,000 transitions)
396
Figure 11.1.: LSTD and Tabular TD Value Functions
td_vf and against the true value function true_vf (obtained using the linear-algebra-solver-
based calculation of the MRP Value Function). We encourage you to modify the parame-
ters used in the code above to see how it alters the results - specifically play around with
this_barrier, this_p, gamma, num_transitions, the learning rate trajectory for Incremental
Tabular TD, the number of Laguerre polynomials, and epsilon. The above code is in the
file rl/chapter12/random_walk_lstd.py.
11.3.3. LSTD(λ)
Likewise, we can do LSTD(λ) using Eligibility Traces. Here we are given a fixed, finite
number of trace experiences
Denote the Eligibility Traces of trace experience i at time t as Ei,t . Note that the eligibility
traces accumulate ∇w V (s; w) = ϕ(s) in each trace experience. When accumulating, the
previous time step’s eligibility traces is discounted by λγ. By setting the right-hand-side
of Equation (11.1) to 0 (i.e., setting the update to w over all atomic experiences data to 0),
we get:
X Ti −1
1 X
n
· Ei,t · (ϕ(Si,t )T · w∗ − (Ri,t+1 + γ · ϕ(Si,t+1 )T · w∗ )) = 0
Ti
i=1 t=0
1
A←A+ · Ei,t · (ϕ(Si,t ) − γ · ϕ(Si,t+1 ))T (note the Outer-Product)
Ti
397
On/Off Policy Algorithm Tabular Linear Non-Linear
MC 3 3 3
LSMC 3 3 -
On-Policy TD 3 3 7
LSTD 3 3 -
Gradient TD 3 3 3
MC 3 3 3
LSMC 3 7 -
Off-Policy TD 3 7 7
LSTD 3 7 -
Gradient TD 3 3 3
and the m-Vector b is accumulated at each atomic experience (Si,t , Ri,t+1 , Si,t+1 ) as:
1
b←b+ · Ei,t · Ri,t+1
Ti
With Sherman-Morrison incremental inverse, we can reduce the computational complex-
ity from O(m3 ) to O(m2 ).
1) The sequences of states made available to deep learning through trace experiences
are highly correlated, whereas deep learning algorithms are premised on data sam-
ples being independent.
2) The data distribution changes as the RL algorithm learns new behaviors, whereas
deep learning algorithms are premised on a fixed underlying distribution (i.e., sta-
tionary).
398
Experience-Replay serves to smooth the training data distribution over many past be-
haviors, effectively resolving the correlation issue as well as the non-stationary issue. Hence,
Experience-Replay is a powerful idea for Off-Policy TD Control. The idea of using Experience-
Replay for Off-Policy TD Control is due to the Ph.D. thesis of Long Lin (Lin 1993).
To make this idea of Q-Learning with Experience-Replay clear, we make a few changes to
the q_learning function we had written in Chapter 10 with the following function q_learning_experience_replay
from rl.markov_decision_process import TransitionStep
from rl.approximate_dynamic_programming import QValueFunctionApprox
from rl.approximate_dynamic_programming import NTStateDistribution
from rl.experience_replay import ExperienceReplayMemory
PolicyFromQType = Callable[
[QValueFunctionApprox[S, A], MarkovDecisionProcess[S, A]],
Policy[S, A]
]
def q_learning_experience_replay(
mdp: MarkovDecisionProcess[S, A],
policy_from_q: PolicyFromQType,
states: NTStateDistribution[S],
approx_0: QValueFunctionApprox[S, A],
gamma: float,
max_episode_length: int,
mini_batch_size: int,
weights_decay_half_life: float
) -> Iterator[QValueFunctionApprox[S, A]]:
exp_replay: ExperienceReplayMemory[TransitionStep[S, A]] = \
ExperienceReplayMemory(
time_weights_func=lambda t: 0.5 ** (t / weights_decay_half_life),
)
q: QValueFunctionApprox[S, A] = approx_0
yield q
while True:
state: NonTerminal[S] = states.sample()
steps: int = 0
while isinstance(state, NonTerminal) and steps < max_episode_length:
policy: Policy[S, A] = policy_from_q(q, mdp)
action: A = policy.act(state).sample()
next_state, reward = mdp.step(state, action).sample()
exp_replay.add_data(TransitionStep(
state=state,
action=action,
next_state=next_state,
reward=reward
))
trs: Sequence[TransitionStep[S, A]] = \
exp_replay.sample_mini_batch(mini_batch_size)
q = q.update(
[(
(tr.state, tr.action),
tr.reward + gamma * (
max(q((tr.next_state, a))
for a in mdp.actions(tr.next_state))
if isinstance(tr.next_state, NonTerminal) else 0.)
) for tr in trs],
)
yield q
steps += 1
state = next_state
The key difference between the q_learning algorithm we wrote in Chapter 10 and this
q_learning_experience_replay algorithm is that here we have an experience-replay mem-
ory (using the ExperienceReplayMemory class we had implemented earlier). In the q_learning
399
algorithm, the (state, action, next_state, reward) 4-tuple comprising TransitionStep (that
is used to perform the Q-Learning update) was the result of action being sampled from
the behavior policy (derived from the current estimate of the Q-Value Function, eg: ϵ-
greedy), and then the next_state and reward being generated from the (state, action)
pair using the step method of mdp. Here in q_learning_experience_replay, we don’t use
this 4-tuple TransitionStep to perform the update - rather, we append this 4-tuple to the
ExperienceReplayMemory (using the add_data method), then we sample mini_batch_sized
TransitionSteps from the ExperienceReplayMemory (giving more sampling weightage to
the more recently added TransitionSteps), and use those 4-tuple TransitionSteps to per-
form the Q-Learning update. Note that these sampled TransitionSteps might be from
old behavior policies (derived from old estimates of the Q-Value estimate). The key is
that this algorithm re-uses atomic experiences that were previously prepared by the algo-
rithm, which also means that it re-uses behavior policies that were previously constructed
by the algorithm.
The argument mini_batch_size refers to the number of TransitionSteps to be drawn
from the ExperienceReplayMemory at each step. The argument weights_decay_half_life
refers to the half life of an exponential decay function for the weights used in the sampling
of the TransitionSteps (the most recently added TransitionStep has the highest weight).
With this understanding, the code should be self-explanatory.
The above code is in the file rl/td.py.
400
Q-Learning update equation).
We won’t implement the DQN algorithm in Python code - however, we sketch the out-
line of the algorithm, as follows:
At each time t for each episode:
• Given state St , take action At according to ϵ-greedy policy extracted from Q-network
values Q(St , a; w).
• Given state St and action At , obtain reward Rt+1 and next state St+1 from the envi-
ronment.
• Append atomic experience (St , At , Rt+1 , St+1 ) in experience-replay memory D.
• Sample a random mini-batch of atomic experiences (si , ai , ri , s′i ) ∼ D.
• Using this mini-batch of atomic experiences, update the Q-network parameters w
with the Q-learning targets based on “frozen” parameters w− of the target network.
X
∆w = α · (ri + γ · max
′
Q(s′i , a′i ; w− ) − Q(si , ai ; w)) · ∇w Q(si , ai ; w)
ai
i
• St ← St+1
• Once every C time steps, set w− ← w.
To learn more about the effectiveness of DQN for Atari games, see the Original DQN
Paper (Mnih et al. 2013) and the DQN Nature Paper (Mnih et al. 2015) that DeepMind
has published.
Now we are ready to cover Batch RL Control (specifically Least-Squares TD Control),
which combines the ideas of Least-Squares TD Prediction and Q-Learning with Experience-
Replay.
with a direct linear-algebraic solve for the linear function approximation weights w
using batch experiences data generated using policy π.
2. ϵ-Greedy Policy Improvement.
In this section, we focus on Off-Policy Control with Least-Squares TD. This algorithm is
known as Least-Squares Policy Iteration, abbreviated as LSPI, developed by Lagoudakis
and Parr (Lagoudakis and Parr 2003). LSPI has been an important go-to algorithm in the
history of RL Control because of it’s simplicity and effectiveness. The basic idea of LSPI is
that it does Generalized Policy Iteration (GPI) in the form of Q-Learning with Experience-
Replay, with the key being that instead of doing the usual Q-Learning update after each
atomic experience, we do batch Q-Learning for the Policy Evaluation phase of GPI. We
401
spend the rest of this section describing LSPI in detail and then implementing it in Python
code.
The input to LSPI is a fixed finite data set D, consisting of a set of (s, a, r, s′ ) atomic
experiences, i.e., a set of rl.markov_decision_process.TransitionStep objects, and the task
of LSPI is to determine the Optimal Q-Value Function (and hence, Optimal Policy) based
on this experiences data set D using an experience-replayed, batch Q-Learning technique
described below. Assume D consists of n atomic experiences, indexed as i = 1, 2, . . . n,
with atomic experience i denoted as (si , ai , ri , s′i ).
In LSPI, each iteration of GPI involves access to:
Given D and πD , the goal of each iteration of GPI is to solve for weights w∗ that mini-
mizes:
X
n
L(w) = (Q(si , ai ; w) − (ri + γ · Q(s′i , πD (s′i ); w)))2
i=1
X
n
= (ϕ(si , ai )T · w − (ri + γ · ϕ(s′i , πD (s′i ))T · w))2
i=1
The solution for the weights w∗ is attained by setting the semi-gradient of L(w) to 0, i.e.,
X
n
ϕ(si , ai ) · (ϕ(si , ai )T · w∗ − (ri + γ · ϕ(s′i , πD (s′i ))T · w∗ )) = 0 (11.2)
i=1
We can calculate the solution w∗ as A−1 · b, where the m × m Matrix A is accumulated for
each TransitionStep (si , ai , ri , s′i ) as:
and the m-Vector b is accumulated at each atomic experience (si , ai , ri , s′i ) as:
b ← b + ϕ(si , ai ) · ri
X
m
Q(s, a; w∗ ) = ϕ(s, a)T · w∗ = ϕj (s, a) · wj∗
j=1
402
policy improvements. Note how LSTDQ in each iteration re-uses the same data D, i.e.,
LSPI does experience-replay.
We should point out here that the LSPI algorithm we described above should be consid-
ered as the standard variant of LSPI. However, we can design several other variants of LSPI,
in terms of how the experiences data is sourced and used. Firstly, we should note that the
experiences data D essentially provides the behavior policy for Q-Learning (along with the
consequent reward and next state transition). In the standard variant we described above,
since D is provided from an external source, the behavior policy that generates this data
D must come from an external source. It doesn’t have to be this way - we could generate
the experiences data from a behavior policy derived from the Q-Value estimates produced
by LSTDQ (eg: ϵ-greedy policy). This would mean the experiences data used in the al-
gorithm is not a fixed, finite data set, rather a variable, incrementally-produced data set.
Even if the behavior policy was external, the data set D might not be a fixed finite data set
- rather, it could be made available as an on-demand, variable data stream. Furthermore,
in each iteration of GPI, we could use a subset of the experiences data made available un-
til that point of time (rather than the approach of the standard variant of LSPI that uses
all of the available experiences data). If we choose to sample a subset of the available ex-
periences data, we might give more sampling-weightage to the more recently generated
data. This would especially be the case if the experiences data was being generated from
a policy derived from the Q-Value estimates produced by LSTDQ. In this case, we would
leverage the ExperienceReplayMemory class we’d written earlier.
Next, we write code to implement the standard variant of LSPI we described above. First,
we write a function to implement LSTDQ. As described above, the inputs to LSTDQ are
the experiences data D (transitions in the code below) and a deterministic target policy
πD (target_policy in the code below). Since we are doing a linear function approxima-
tion, the input also includes a set of features, described as functions of state and action
(feature_functions in the code below). Lastly, the inputs also include the discount factor
γ and the numerical control parameter ϵ. The code below should be fairly self-explanatory,
as it is a straightforward extension of LSTD (implemented in function least_squares_td
earlier). The key differences are that this is an estimate of the Action-Value (Q-Value)
function, rather than the State-Value Function, and the target used in the least-squares
calculation is the Q-Learning target (produced by the target_policy).
def least_squares_tdq(
transitions: Iterable[TransitionStep[S, A]],
feature_functions: Sequence[Callable[[Tuple[NonTerminal[S], A]], float]],
target_policy: DeterministicPolicy[S, A],
gamma: float,
epsilon: float
) -> LinearFunctionApprox[Tuple[NonTerminal[S], A]]:
’’’transitions is a finite iterable’’’
num_features: int = len(feature_functions)
a_inv: np.ndarray = np.eye(num_features) / epsilon
b_vec: np.ndarray = np.zeros(num_features)
for tr in transitions:
phi1: np.ndarray = np.array([f((tr.state, tr.action))
for f in feature_functions])
if isinstance(tr.next_state, NonTerminal):
phi2 = phi1 - gamma * np.array([
f((tr.next_state, target_policy.action_for(tr.next_state.state)))
for f in feature_functions])
else:
phi2 = phi1
temp: np.ndarray = a_inv.T.dot(phi2)
a_inv = a_inv - np.outer(a_inv.dot(phi1), temp) / (1 + phi1.dot(temp))
403
b_vec += phi1 * tr.reward
opt_wts: np.ndarray = a_inv.dot(b_vec)
return LinearFunctionApprox.create(
feature_functions=feature_functions,
weights=Weights.create(opt_wts)
)
Now we are ready to write the standard variant of LSPI. The code below is a straight-
forward implementation of our description above, looping through the iterations of GPI,
yielding the Q-Value LinearFunctionApprox after each iteration of GPI.
def least_squares_policy_iteration(
transitions: Iterable[TransitionStep[S, A]],
actions: Callable[[NonTerminal[S]], Iterable[A]],
feature_functions: Sequence[Callable[[Tuple[NonTerminal[S], A]], float]],
initial_target_policy: DeterministicPolicy[S, A],
gamma: float,
epsilon: float
) -> Iterator[LinearFunctionApprox[Tuple[NonTerminal[S], A]]]:
’’’transitions is a finite iterable’’’
target_policy: DeterministicPolicy[S, A] = initial_target_policy
transitions_seq: Sequence[TransitionStep[S, A]] = list(transitions)
while True:
q: LinearFunctionApprox[Tuple[NonTerminal[S], A]] = \
least_squares_tdq(
transitions=transitions_seq,
feature_functions=feature_functions,
target_policy=target_policy,
gamma=gamma,
epsilon=epsilon,
)
target_policy = greedy_policy_from_qvf(q, actions)
yield q
It is straightforward to model this problem as an MDP. The State is the number of vil-
lagers at risk on any given night (if the vampire is still alive, the State is the number of
villagers and if the vampire is dead, the State is 0, which is the only Terminal State). The
404
Action is the number of villagers poisoned on any given night. The Reward is zero as long
as the vampire is alive, and is equal to the number of villagers remaining if the vampire
dies. Let us refer to the initial number of villagers as I. Thus,
It is rather straightforward to solve this with Dynamic Programming (say, Value Itera-
tion) since we know the transition probabilities and rewards function and since the state
and action spaces are finite. However, in a situation where we don’t know the exact prob-
abilities with which the vampire operates, and we only had access to observations on spe-
cific days, we can attempt to solve this problem with Reinforcement Learning (assuming
we had access to observations of many vampires operating on many villages). In any case,
our goal here is to test LSPI using this vampire problem as an example. So we write some
code to first model this MDP as described above, solve it with value iteration (to obtain the
benchmark, i.e., true Optimal Value Function and true Optimal Policy to compare against),
then generate atomic experiences data from the MDP, and then solve this problem with
LSPI using this stream of generated atomic experiences.
405
ret: List[Callable[[Tuple[NonTerminal[int], int]], float]] = []
ident1: np.ndarray = np.eye(factor1_features)
ident2: np.ndarray = np.eye(factor2_features)
for i in range(factor1_features):
def factor1_ff(x: Tuple[NonTerminal[int], int], i=i) -> float:
return lagval(
float((x[0].state - x[1]) ** 2 / x[0].state),
ident1[i]
)
ret.append(factor1_ff)
for j in range(factor2_features):
def factor2_ff(x: Tuple[NonTerminal[int], int], j=j) -> float:
return lagval(
float((x[0].state - x[1]) * x[1] / x[0].state),
ident2[j]
)
ret.append(factor2_ff)
return ret
def lspi_transitions(self) -> Iterator[TransitionStep[int, int]]:
states_distribution: Choose[NonTerminal[int]] = \
Choose(self.non_terminal_states)
while True:
state: NonTerminal[int] = states_distribution.sample()
action: int = Choose(range(state.state)). sample()
next_state, reward = self.step(state, action).sample()
transition: TransitionStep[int, int] = TransitionStep(
state=state,
action=action,
next_state=next_state,
reward=reward
)
yield transition
def lspi_vf_and_policy(self) -> \
Tuple[V[int], FiniteDeterministicPolicy[int, int]]:
transitions: Iterable[TransitionStep[int, int]] = itertools.islice(
self.lspi_transitions(),
20000
)
qvf_iter: Iterator[LinearFunctionApprox[Tuple[
NonTerminal[int], int]]] = least_squares_policy_iteration(
transitions=transitions,
actions=self.actions,
feature_functions=self.lspi_features(4, 4),
initial_target_policy=DeterministicPolicy(
lambda s: int(s / 2)
),
gamma=1.0,
epsilon=1e-5
)
qvf: LinearFunctionApprox[Tuple[NonTerminal[int], int]] = \
iterate.last(
itertools.islice(
qvf_iter,
20
)
)
return get_vf_and_policy_from_qvf(self, qvf)
The above code should be self-explanatory. The main challenge with LSPI is that we
need to construct features function of the state and action such that the Q-Value Function
is linear in those features. In this case, since we simply want to test the correctness of our
LSPI implementation, we define feature functions (in method lspi_feature above) based
on our knowledge of the true optimal Q-Value Function from the Dynamic Programming
406
Figure 11.3.: True versus LSPI Optimal Value Function
solution. The atomic experiences comprising the experiences data D for LSPI to use is
generated with a uniform distribution of non-terminal states and a uniform distribution
of actions for a given state (in method lspi_transitions above).
Figure 11.3 shows the plot of the True Optimal Value Function (from Value Iteration)
versus the LSPI-estimated Optimal Value Function.
Figure 11.4 shows the plot of the True Optimal Policy (from Value Iteration) versus the
LSPI-estimated Optimal Policy.
The above code is in the file rl/chapter12/vampire.py. As ever, we encourage you to
modify some of the parameters in this code (including choices of feature functions, nature
and number of atomic transitions used, number of GPI iterations, choice of ϵ, and perhaps
even a different dynamic for the vampire behavior), and see how the results change.
407
Figure 11.4.: True versus LSPI Optimal Policy
408
In the financial trading industry, it has traditionally not been a common practice to ex-
plicitly view the American Options Pricing problem as an MDP. Specialized algorithms
have been developed to price American Options. We now provide a quick overview of the
common practice in pricing American Options in the financial trading industry. Firstly,
we should note that the price of some American Options is equal to the price of the corre-
sponding European Option, for which we have a closed-form solution under the assump-
tion of a lognormal process for the underlying - this is the case for a plain-vanilla American
call option whose price (as we proved in Chapter 7) is equal to the price of a plain-vanilla
European call option. However, this is not the case for a plain-vanilla American put option.
Secondly, we should note that if the payoff of an American option is dependent on only the
current price of the underlying (and not on the past prices of the underlying) - in which
case, we say that the option payoff is not “history-dependent” - and if the dimension of the
state space is not large, then we can do a simple backward induction on a binomial tree (as
we showed in Chapter 7). In practice, a more detailed data structure such as a trinomial
tree or a lattice is often used for more accurate backward-induction calculations. However,
if the payoff is history-dependent (i.e., payoff depends on past prices of the underlying)
or if the payoff depends on the prices of several underlying assets, then the state space is
too large for backward induction to handle. In such cases, the standard approach in the
financial trading industry is to use the Longstaff-Schwartz pricing algorithm (Longstaff
and Schwartz 2001). We won’t cover the Longstaff-Schwartz pricing algorithm in detail in
this book - it suffices to share here that the Longstaff-Schwartz pricing algorithm combines
3 ideas:
The goal of this section is to explain how to price American Options with Reinforcement
Learning, as an alternative to the Longstaff-Schwartz algorithm.
409
LSPI. The key customization comes from the fact that there are only two actions. The action
to exercise produces a (state-conditioned) reward (i.e., option payoff) and transition to a
terminal state. The action to continue produces no reward and transitions to a new state
at the next time step. Let us refer to these 2 actions as: a = c (continue the option) and
a = e (exercise the option).
Since we know the exercise value in any state, we only need to create a linear function
approximation for the continuation value, i.e., for the Q-Value Q(s, c) for all non-terminal
states s. If we denote the payoff in non-terminal state s as g(s), then Q(s, e) = g(s). So we
write (
ϕ(s)T · w if a = c
Q̂(s, a; w) = for all s ∈ N
g(s) if a = e
for feature functions ϕ(·) = [ϕi (·)|i = 1, . . . , m], which are feature functions of only state
(and not action).
Each iteration of GPI in the LSPI algorithm starts with a deterministic target policy πD (·)
that is made available as a greedy policy derived from the previous iteration’s LSTDQ-
solved Q̂(s; a; w∗ ). The LSTDQ solution for w∗ is based on pre-generated training data
and with the Q-Learning target policy set to be πD . Since we learn the Q-Value function
for only a = c, the behavior policy µ generating experiences data for training is a constant
function µ(s) = c. Note also that for American Options, the reward for a = c is 0. So each
atomic experience for training is of the form (s, c, 0, s′ ). This means we can represent each
atomic experience for training as a 2-tuple (s, s′ ). This reduces the LSPI Semi-Gradient
Equation (11.2) to:
X
ϕ(si ) · (ϕ(si )T · w∗ − γ · Q̂(s′i , πD (s′i ); w∗ )) = 0 (11.3)
i
• C1: If s′i is non-terminal and πD (s′i ) = c (i.e., ϕ(s′i )T ·w ≥ g(s′i )): Substitute ϕ(s′i )T ·w∗
for Q̂(s′i , πD (s′i ); w∗ ) in Equation (11.3)
• C2: If s′i is a terminal state or πD (s′i ) = e (i.e., g(s′i ) > ϕ(s′i )T · w): Substitute g(s′i )
for Q̂(s′i , πD (s′i ); w∗ ) in Equation (11.3)
So we can rewrite Equation (11.3) using indicator notation I for cases C1, C2 as:
X
ϕ(si ) · (ϕ(si )T · w∗ − IC1 · γ · ϕ(s′i )T · w∗ − IC2 · γ · g(s′i )) = 0
i
410
A ← A + ϕ(si ) · (ϕ(si ) − IC1 · γ · ϕ(s′i ))T
The m-Vector b is accumulated at each atomic experience (si , s′i ) as:
Li, Szepesvari, Schuurmans (Li, Szepesvári, and Schuurmans 2009) recommend in their
paper to use 7 feature functions, the first 4 Laguerre polynomials that are functions of the
underlying price and 3 functions of time. Precisely, the feature functions they recommend
are:
• ϕ0 (St ) = 1
Mt
• ϕ1 (St ) = e− 2
Mt
• ϕ2 (St ) = e− 2 · (1 − Mt )
Mt
• ϕ3 (St ) = e− 2 · (1 − 2Mt + Mt2 /2)
ϕ0 (t) = sin( π(T2T−t) )
(t)
•
(t)
• ϕ1 (t) = log(T − t)
(t)
• ϕ2 (t) = ( Tt )2
where Mt = SKt (St is the current underlying price and K is the American Option strike),
t is the current time, and T is the expiration time (i.e., 0 ≤ t < T ).
411
11.7. Value Function Geometry
Now we look deeper into the issue of the Deadly Triad (that we had alluded to in Chap-
ter 10) by viewing Value Functions as Vectors (we had done this in Chapter 3), under-
stand Value Function Vector transformations with a balance of geometric intuition and
mathematical rigor, providing insights into convergence issues for a variety of traditional
loss functions used to develop RL algorithms. As ever, the best way to understand Vector
transformations is to visualize it and so, we loosely refer to this topic as Value Function Ge-
ometry. The geometric intuition is particularly useful for linear function approximations.
To promote intuition, we shall present this content for linear function approximations of
the Value Function and stick to Prediction (rather than Control) although many of the
concepts covered in this section are well-extensible to non-linear function approximations
and to the Control problem.
This treatment was originally presented in the LSPI paper by Lagoudakis and Parr (Lagoudakis
and Parr 2003) and has been covered in detail in the RL book by Sutton and Barto (Richard
S. Sutton and Barto 2018). This treatment of Value Functions as Vectors leads us in the
direction of overcoming the Deadly Triad by defining an appropriate loss function, cal-
culating whose gradient provides a more robust set of RL algorithms known as Gradient
Temporal-Difference (abbreviated, as Gradient TD), which we shall cover in the next sec-
tion.
Along with visual intuition, it is important to write precise notation for Value Function
transformations and approximations. So we start with a set of formal definitions, keeping
the setting fairly simple and basic for ease of understanding.
X
m
Vw (s) = ϕ(s) · w =
T
ϕj (s) · wj for all s ∈ S
j=1
.
412
Assuming independence of the feature functions, the m feature functions give us m in-
dependent vectors in the vector space Rn . Feature function ϕj gives us the vector [ϕj (s1 ), ϕj (s2 ), . . . , ϕj (sn )] ∈
Rn . These m vectors are the m columns of the n×m matrix Φ = [ϕj (si )], 1 ≤ i ≤ n, 1 ≤ j ≤
m. The span of these m independent vectors is an m-dimensional vector subspace within
this n-dimensional vector space, spanned by the set of all w = (w1 , w2 , . . . , wm ) ∈ Rm . The
vector Vw = Φ · w in this vector subspace has coordinates [Vw (s1 ), Vw (s2 ), . . . , Vw (sn )].
The vector Vw is fully specified by w (so we often say w to mean Vw ). Our interest is in
identifying an appropriate w ∈ Rm that represents an adequate linear function approxi-
mation Vw = Φ · w of the Value Function V π .
We denote the probability distribution of occurrence of states under policy π as µπ :
S → [0, 1]. In accordance with the notation we used in Chapter 2, R(s, a) refers to the
Expected Reward upon taking action a in state s, and P(s, a, s′ ) refers to the probability of
transition from state s to state s′ upon taking action a. Define
X
Rπ (s) = π(s, a) · R(s, a) for all s ∈ S
a∈A
X
P π (s, s′ ) = π(s, a) · P(s, a, s′ ) for all s, s′ inS
a∈A
to denote the Expected Reward and state transition probabilities respectively of the π-
implied MRP.
Rπ refers to vector [Rπ (s1 ), Rπ (s2 ), . . . , Rπ (sn )] and P π refers to matrix [P π (si , si′ )], 1 ≤
i, i′ ≤ n. Denote γ < 1 (since there are no terminal states) as the MDP discount factor.
Note that B π is a linear operator in vector space Rn . So we henceforth denote and treat
B π as an n × n matrix, representing the linear operator. We’ve learnt in Chapter 2 that V π
is the fixed point of B π . Therefore, we can write:
Bπ · V π = V π
This means, if we start with an arbitrary Value Function vector V and repeatedly apply
B π , by Fixed-Point Theorem, we will reach the fixed point V π . We’ve learnt in Chapter
2 that this is in fact the Dynamic Programming Policy Evaluation algorithm. Note that
Tabular Monte Carlo also converges to V π (albeit slowly).
Next, we introduce the Projection Operator ΠΦ for the subspace spanned by the column
vectors (feature functions) of Φ. We define ΠΦ (V ) as the vector in the subspace spanned
by the column vectors of Φ that represents the orthogonal projection of Value Function
vector V on the Φ subspace. To make this precise, we first define “distance” d(V1 , V2 )
between Value Function vectors V1 , V2 , weighted by µπ across the n dimensions of V1 , V2 .
Specifically,
X
n
d(V1 , V2 ) = µπ (si ) · (V1 (si ) − V2 (si ))2 = (V1 − V2 )T · D · (V1 − V2 )
i=1
413
Figure 11.6.: Value Function Geometry (Image Credit: Sutton-Barto’s RL Book)
where D is the square diagonal matrix consisting of the diagonal elements µπ (si ), 1 ≤ i ≤
n.
With this “distance” metric, we define ΠΦ (V ) as the Value Function vector in the sub-
space spanned by the column vectors of Φ that is given by arg minw d(V , Vw ). This is a
weighted least squares regression with solution:
w∗ = (ΦT · D · Φ)−1 · ΦT · D · V
ΠΦ = Φ · (ΦT · D · Φ)−1 · ΦT · D
414
slow to converge, so we seek function approximations in the Φ subspace that are based on
Temporal-Difference (TD), i.e., bootstrapped methods. The remaining three Value Func-
tion vectors in the Φ subspace are based on TD methods.
We denote the second Value Function vector of interest in the Φ subspace as wBE . The
acronym BE stands for Bellman Error. To understand this, consider the application of the
Bellman Policy Operator B π on a Value Function vector Vw in the Φ subspace. Applying
B π on Vw typically throws Vw out of the Φ subspace. The idea is to find a Value Function
vector Vw in the Φ subspace such that the “distance” between Vw and B π · Vw is mini-
mized, i.e. we minimize the “error vector” BE = B π · Vw − Vw (Figure 11.6 provides the
visualization). Hence, we say we are minimizing the Bellman Error (or simply that we are
minimizing BE), and we refer to wBE as the Value Function vector in the Φ subspace for
which BE is minimized. Formally, we define it as:
The above formulation can be used to compute wBE if we know the model probabilities
P π and reward function Rπ . But often, in practice, we don’t know P π and Rπ , in which
case we seek model-free learning of wBE , specifically with a TD (bootstrapped) algorithm.
Let us refer to
(Φ − γP π · Φ)T · D · (Φ − γP π · Φ)
(Φ − γP π · Φ)T · D · Rπ
415
Descent with the gradient of the square of expected TD error, as follows:
1
∆w = −α · · ∇w (Eπ [δ])2
2
= −α · Eπ [r + γ · ϕ(s′ )T · w − ϕ(s)T · w] · ∇w Eπ [δ]
= α · (Eπ [r + γ · ϕ(s′ )T · w] − ϕ(s)T · w) · (ϕ(s) − γ · Eπ [ϕ(s′ )])
This is called the Residual Gradient algorithm, due to Leemon Baird (Baird 1995). It requires
two independent samples of s′ transitioning from s. If we do have that, it converges to wBE
robustly (even for non-linear function approximations). But this algorithm is slow, and
doesn’t converge to a desirable place. Another issue is that wBE is not learnable if we
can only access the features, and not underlying states. These issues led researchers to
consider alternative TD algorithms.
We denote the third Value Function vector of interest in the Φ subspace as wT DE and
define it as the vector in the Φ subspace for which the expected square of the TD error δ
(when following policy π) is minimized. Formally,
X X
wT DE = arg min µπ (s) Pπ (r, s′ |s) · (r + γ · ϕ(s′ )T · w − ϕ(s)T · w)2
w s∈S r,s′
To perform Stochastic Gradient Descent, we have to estimate the gradient of the expected
square of TD error by sampling. The weight update for each gradient sample in the
Stochastic Gradient Descent is:
This algorithm is called Naive Residual Gradient, due to Leemon Baird (Baird 1995). Naive
Residual Gradient converges robustly, but again, not to a desirable place. So researchers
had to look even further.
This brings us to the fourth (and final) Value Function vector of interest in the Φ sub-
space. We denote this Value Function vector as wP BE . The acronym P BE stands for Pro-
jected Bellman Error. To understand this, first consider the composition of the Projection
Operator ΠΦ and the Bellman Policy Operator B π , i.e., ΠΦ · B π (we call this composed
operator as the Projected Bellman operator). Visualize the application of this Projected Bell-
man operator on a Value Function vector Vw in the Φ subspace. Applying B π on Vw typ-
ically throws Vw out of the Φ subspace and then further applying ΠΦ brings it back to
the Φ subspace (call this resultant Value Function vector Vw′ ). The idea is to find a Value
Function vector Vw in the Φ subspace for which the “distance” between Vw and Vw′ is
minimized, i.e. we minimize the “error vector” P BE = ΠΦ · B π · Vw − Vw (Figure 11.6
provides the visualization). Hence, we say we are minimizing the Projected Bellman Error
(or simply that we are minimizing P BE), and we refer to wP BE as the Value Function
vector in the Φ subspace for which P BE is minimized. It turns out that the minimum of
PBE is actually zero, i.e., Φ · wP BE is a fixed point of operator ΠΦ · B π . Let us write out
this statement formally. We know:
ΠΦ = Φ · (ΦT · D · Φ)−1 · ΦT · D
416
B π · V = Rπ + γP π · V
The above formulation can be used to compute wP BE if we know the model probabil-
ities P π and reward function Rπ . But often, in practice, we don’t know P π and Rπ , in
which case we seek model-free learning of wP BE , specifically with a TD (bootstrapped)
algorithm.
The question is how do we construct matrix
A = ΦT · D · (Φ − γP π · Φ)
and vector
b = ΦT · D · Rπ
without a model?
Following policy π, each time we perform an individual transition from s to s′ getting
reward r, we get a sample estimate of A and b. The sample estimate of A is the outer-
product of vectors ϕ(s) and ϕ(s)−γ ·ϕ(s′ ). The sample estimate of b is scalar r times vector
ϕ(s). We average these sample estimates across many such individual transitions. Note
that this algorithm is exactly the Least Squares Temporal Difference (LSTD) algorithm
we’ve covered earlier in this chapter. Thus, we now know that LSTD converges to wP BE ,
i.e., minimizes (in fact takes down to 0) P BE. If the number of features m is large or if we
are doing non-linear function approximation or Off-Policy, then we seek a gradient-based
TD algorithm. It turns out that our usual Semi-Gradient TD algorithm converges to wP BE
in the case of on-policy linear function approximation. Note that the update for the usual
Semi-Gradient TD algorithm in the case of on-policy linear function approximation is as
follows:
∆w = α · (r + γ · ϕ(s′ )T · w − ϕ(s)T · w) · ϕ(s)
⇒ ΦT · D · (Φ − γP π · Φ) · w = ΦT · D · Rπ
417
11.8. Gradient Temporal-Difference (Gradient TD)
For on-policy linear function approximation, the semi-gradient TD algorithm gives us
wP BE . But to obtain wP BE in the case of non-linear function approximation or in the case
of Off-Policy, we need a different approach. The different approach is Gradient Temporal-
Difference (abbreviated, Gradient TD), the subject of this section.
The original Gradient TD algorithm, due to Sutton, Szepesvari, Maei (R. S. Sutton,
Szepesvári, and Maei 2008) is typically abbreviated as GTD. Researchers then came up
with a second-generation Gradient TD algorithm (R. S. Sutton et al. 2009), which is typi-
cally abbreviated as GTD-2. The same researchers also came up with a TD algorithm with
Gradient Correction (R. S. Sutton et al. 2009), which is typically abbreviated as TDC.
We now cover the TDC algorithm. For simplicity of articulation and ease of under-
standing, we restrict to the case of linear function approximation in our coverage of the
TDC algorithm below. However, do bear in mind that much of the concepts below extend
to non-linear function approximation (which is where we reap the benefits of Gradient
TD).
Our first task is to set up the appropriate loss function whose gradient will drive the
Stochastic Gradient Descent.
We want to estimate this gradient from individual transitions data. So we express each
of the 3 terms forming the product in the gradient expression above as expectations of
π
functions of individual transitions s −→ (r, s′ ). Denoting r + γ · ϕ(s′ )T · w − ϕ(s)T · w as
δ, we get:
ΦT · D · δw = E[δ · ϕ(s)]
∇w (ΦT · D · δw )T = E[(∇w δ) · ϕ(s)T ] = E[(γ · ϕ(s′ ) − ϕ(s)) · ϕ(s)T ]
ΦT · D · Φ = E[ϕ(s) · ϕ(s)T ]
Substituting, we get:
∇w L(w) = 2 · E[(γ · ϕ(s′ ) − ϕ(s)) · ϕ(s)T ] · (E[ϕ(s) · ϕ(s)T ])−1 · E[δ · ϕ(s)]
1
∆w = −α · · ∇w L(w)
2
418
= α · E[(ϕ(s) − γ · ϕ(s′ )) · ϕ(s)T ] · (E[ϕ(s) · ϕ(s)T ])−1 · E[δ · ϕ(s)]
= α · (E[ϕ(s) · ϕ(s)T ] − γ · E[ϕ(s′ ) · ϕ(s)T ]) · (E[ϕ(s) · ϕ(s)T ])−1 · E[δ · ϕ(s)]
= α · (E[δ · ϕ(s)] − γ · E[ϕ(s′ ) · ϕ(s)T ] · (E[ϕ(s) · ϕ(s)T ])−1 · E[δ · ϕ(s)])
= α · (E[δ · ϕ(s)] − γ · E[ϕ(s′ ) · ϕ(s)T ] · θ)
where θ = (E[ϕ(s) · ϕ(s)T ])−1 · E[δ · ϕ(s)] is the solution to the weighted least-squares
linear regression of B π · V − V against Φ, with weights as µπ .
We can perform this gradient descent with a technique known as Cascade Learning,
which involves simultaneously updating both w and θ (with θ converging faster). The
updates are as follows:
419
12. Policy Gradient Algorithms
It’s time to take stock of what we have learnt so far to set up context for this chapter. So
far, we have covered a range of RL Control algorithms, all of which are based on General-
ized Policy Iteration (GPI). All of these algorithms perform GPI by learning the Q-Value
Function and improving the policy by identifying the action that fetches the best Q-Value
(i.e., action value) for each state. Notice that the way we implemented this best action iden-
tification is by sweeping through all the actions for each state. This works well only if the
set of actions for each state is reasonably small. But if the action space is large/continuous,
we have to resort to some sort of optimization method to identify the best action for each
state, which is potentially complicated and expensive.
In this chapter, we cover RL Control algorithms that take a vastly different approach.
These Control algorithms are still based on GPI, but the Policy Improvement of their GPI
is not based on consulting the Q-Value Function, as has been the case with Control al-
gorithms we covered in the previous two chapters. Rather, the approach in the class of
algorithms we cover in this chapter is to directly find the Policy that fetches the “Best Ex-
pected Returns.” Specifically, the algorithms of this chapter perform a Gradient Ascent
on “Expected Returns” with the gradient defined with respect to the parameters of a Pol-
icy function approximation. We shall work with a stochastic policy of the form π(s, a; θ),
with θ denoting the parameters of the policy function approximation π. So we are basically
learning this parameterized policy that selects actions without consulting a Value Func-
tion. Note that we might still engage a Value Function approximation (call it Q(s; a; w))
in our algorithm, but it’s role is to only help learn the policy parameters θ and not to
identify the action with the best action-value for each state. So the two function approx-
imations π(s, a; θ) and Q(s, a; w) collaborate to improve the policy using gradient ascent
(based on gradient of “expected returns” with respect to θ). π(s, a; θ) is the primary
worker here (known as Actor) and Q(s, a; w) is the support worker (known as Critic).
The Critic parameters w are optimized by minimizing a suitable loss function defined in
terms of Q(s, a; w) while the Actor parameters θ are optimized by maximizing a suitable
“Expected Returns” function“. Note that we still haven’t defined what this”Expected Re-
turns” function is (we will do so shortly), but we already see that this idea is appealing
for large/continuous action spaces where sweeping through actions is infeasible. We will
soon dig into the details of this new approach to RL Control (known as Policy Gradient, ab-
breviated as PG) - for now, it’s important to recognize the big picture that PG is basically
GPI with Policy Improvement done as a Policy Gradient Ascent.
The contrast between the RL Control algorithms covered in the previous two chapters
and the algorithms of this chapter actually is part of the following bigger-picture classifi-
cation of learning algorithms for Control:
• Value Function-based: Here we learn the Value Function (typically with a function
approximation for the Value Function) and the Policy is implicit, readily derived
from the Value Function (eg: ϵ-greedy).
• Policy-based: Here we learn the Policy (with a function approximation for the Pol-
icy), and there is no need to learn a Value Function.
421
• Actor-Critic: Here we primarily learn the Policy (with a function approximation for
the Policy, known as Actor), and secondarily learn the Value Function (with a func-
tion approximation for the Value Function, known as Critic).
Let us start by enumerating the advantages of PG algorithms. We’ve already said that
PG algorithms are effective in large action spaces, especially high-dimensional or contin-
uous action spaces, because in such spaces selecting an action by deriving an improved
policy from an updating Q-Value function is intractable. A key advantage of PG is that it
naturally explores because the policy function approximation is configured as a stochastic
policy. Moreover, PG finds the best Stochastic Policy. This is not a factor for MDPs since
we know that there exists an optimal Deterministic Policy for any MDP but we often deal
with Partially-Observable MDPs (POMDPs) in the real-world, for which the set of optimal
policies might all be stochastic policies. We have an advantage in the case of MDPs as well
since PG algorithms naturally converge to the deterministic policy (the variance in the pol-
icy distribution will automatically converge to 0) whereas in Value Function-based algo-
rithms, we have to reduce the ϵ of the ϵ-greedy policy by-hand and the appropriate declin-
ing trajectory of ϵ is typically hard to figure out by manual tuning. In situations where the
policy function is a simpler function compared to the Value Function, we naturally benefit
from pursuing Policy-based algorithms than Value Function-based algorithms. Perhaps
the biggest advantage of PG algorithms is that prior knowledge of the functional form of
the Optimal Policy enables us to structure the known functional form in the function ap-
proximation for the policy. Lastly, PG offers numerical benefits as small changes in θ yield
small changes in π, and consequently small changes in the distribution of occurrences of
states. This results in stronger convergence guarantees for PG algorithms relative to Value
Function-based algorithms.
Now let’s understand the disadvantages of PG Algorithms. The main disadvantage of
PG Algorithms is that because they are based on gradient ascent, they typically converge to
a local optimum whereas Value Function-based algorithms converge to a global optimum.
Furthermore, the Policy Evaluation of PG is typically inefficient and can have high vari-
ance. Lastly, the Policy Improvements of PG happen in small steps and so, PG algorithms
are slow to converge.
422
12.2. Policy Gradient Theorem
In this section, we start by setting up some notation, and then state and prove the Policy
Gradient Theorem (abbreviated as PGT). The PGT provides the key calculation for PG
Algorithms.
Value Function V π (s) and Action Value function Qπ (s, a) are defined as:
∞
X
V (s) = Eπ [
π
γ k−t · Rk+1 |St = s] for all t = 0, 1, 2, . . .
k=t
∞
X
Q (s, a) = Eπ [
π
γ k−t · Rk+1 |St = s, At = a] for all t = 0, 1, 2, . . .
k=t
J(θ), V π , Qπ are all measures of Expected Returns, so it pays to specify exactly how they
differ. J(θ) is the Expected Return when following policy π (that is parameterized by θ),
averaged over all states s ∈ N and all actions a ∈ A. The idea is to perform a gradient as-
cent with J(θ) as the objective function, with each step in the gradient ascent essentially
pushing θ (and hence, π) in a desirable direction, until J(θ) is maximized. V π (s) is the
Expected Return for a specific state s ∈ N when following policy π. Qπ (s, a) is the Ex-
pected Return for a specific state s ∈ N and specific action a ∈ A when following policy
π.
We define the Advantage Function as:
The advantage function captures how much more value does a particular action provide
relative to the average value across actions (for a given state). The advantage function
plays an important role in reducing the variance in PG Algorithms.
Also, p(s → s′ , t, π) will be a key function for us in the PGT proof - it denotes the prob-
ability of going from state s to s′ in t steps by following policy π.
We express the “Expected Returns” Objective J(θ) as follows:
423
∞
X ∞
X
J(θ) = Eπ [ γ t · Rt+1 ] = γ t · Eπ [Rt+1 ]
t=0 t=0
∞
X X X X
= γt · ( p0 (S0 ) · p(S0 → s, t, π)) · π(s, a; θ) · Ras
t=0 s∈N S0 ∈N a∈A
X X ∞
X X
= ( γ t · p0 (S0 ) · p(S0 → s, t, π)) · π(s, a; θ) · Ras
s∈N S0 ∈N t=0 a∈A
Definition 12.2.1. X X
J(θ) = ρπ (s) · π(s, a; θ) · Ras
s∈N a∈A
where
∞
X X
π
ρ (s) = γ t · p0 (S0 ) · p(S0 → s, t, π)
S0 ∈N t=0
is the key function (for PG) that we shall refer to as Discounted-Aggregate State-Visitation
Measure. Note that ρπ (s) is a measure over the set of non-terminal states, but is not a prob-
ability measure. Think of ρπ (s) as weights reflecting the relative likelihood of occurrence
of states on a trace experience (adjusted for discounting, i.e, lesser importance to reach-
ing a state later on a trace experience). We can still talk about the distribution of states
under
P the measure ρπ , but we say that this distribution is improper to convey the fact that
s∈N ρ (s) ̸= 1 (i.e., the distribution is not normalized). We talk about this improper dis-
π
tribution of states under the measure ρπ so we can use (as a convenience) the “expected
value” notation for any random variable f : N → R under this improper distribution, i.e.,
we use the notation: X
Es∼ρπ [f (s)] = ρπ (s) · f (s)
s∈N
Using this notation, we can re-write the above definition of J(θ) as:
J(θ) = Es∼ρπ ,a∼π [Ras ]
424
As mentioned above, note that ρπ (s) (representing the discounting-adjusted probability
distribution of occurrence of states, ignoring normalizing factor turning the ρπ measure
into a probability measure) depends on θ but there’s no ∇θ ρπ (s) term in ∇θ J(θ).
Also note that:
∇θ π(s, a; θ) = π(s, a; θ) · ∇θ log π(s, a; θ)
∇θ log π(s, a; θ) is the Score function (Gradient of log-likelihood) that is commonly used
in Statistics.
Since ρπ is the Discounted-Aggregate State-Visitation Measure, we can sample-estimate ∇θ J(θ)
by calculating γ t ·(∇θ log π(St , At ; θ))·Qπ (St , At ) at each time step in each trace experience
(noting that the state occurrence probabilities and action occurrence probabilities are im-
plicit in the trace experiences, and ignoring the probability measure-normalizing factor),
and update the parameters θ (according to Stochastic Gradient Ascent) using each atomic
experience’s ∇θ J(θ) estimate.
We typically calculate the Score ∇θ log π(s, a; θ) using an analytically-convenient func-
tional form for the conditional probability distribution a|s (in terms of θ) so that the
derivative of the logarithm of this functional form is analytically tractable (this will be
clear in the next section when we consider a couple of examples of canonical functional
forms for a|s). In many PG Algorithms, we estimate Qπ (s, a) with a function approxima-
tion Q(s, a; w). We will later show how to avoid the estimate bias of Q(s, a; w).
Thus, the PGT enables a numerical estimate of ∇θ J(θ) which in turn enables Policy
Gradient Ascent.
Note: ∇θ R A
S0
0
= 0, so remove that term.
X X
∇θ J(θ) = p0 (S0 ) · ∇θ π(S0 , A0 ; θ) · Qπ (S0 , A0 )
S0 ∈N A0 ∈A
X X X
+ p0 (S0 ) · π(S0 , A0 ; θ) · ∇θ ( γ · PSA00,S1 · V π (S1 ))
S0 ∈N A0 ∈A S1 ∈N
425
P
Now bring the ∇θ inside the S1 ∈N to apply only on V π (S1 ).
X X
∇θ J(θ) = p0 (S0 ) · ∇θ π(S0 , A0 ; θ) · Qπ (S0 , A0 )
S0 ∈N A0 ∈A
X X X
+ p0 (S0 ) · π(S0 , A0 ; θ) · γ · PSA00,S1 · ∇θ V π (S1 )
S0 ∈N A0 ∈A S1 ∈N
P P P
Now bring and A0 ∈A inside the S1 ∈N
S0 ∈N
X X
∇θ J(θ) = p0 (S0 ) · ∇θ π(S0 , A0 ; θ) · Qπ (S0 , A0 )
S0 ∈N A0 ∈A
X X X
+ γ · p0 (S0 ) · ( π(S0 , A0 ; θ) · PSA00,S1 ) · ∇θ V π (S1 )
S1 ∈N S0 ∈N A0 ∈A
X
Note that π(S0 , A0 ; θ) · PSA00,S1 = p(S0 → S1 , 1, π)
A0 ∈A
X X
∇θ J(θ) = p0 (S0 ) · ∇θ π(S0 , A0 ; θ) · Qπ (S0 , A0 )
S0 ∈N A0 ∈A
X X
+ γ · p0 (S0 ) · p(S0 → S1 , 1, π) · ∇θ V π (S1 )
S1 ∈N S0 ∈N
X
Now expand V π (S1 ) to π(S1 , A1 ; θ) · Qπ (S1 , A1 )
A1 ∈A
.
X X
∇θ J(θ) = p0 (S0 ) · ∇θ π(S0 , A0 ; θ) · Qπ (S0 , A0 )
S0 ∈N A0 ∈A
X X X
+ γ · p0 (S0 ) · p(S0 → S1 , 1, π) · ∇θ ( π(S1 , A1 ; θ) · Qπ (S1 , A1 ))
S1 ∈N S0 ∈N A1 ∈A
P
We are now back to when we started calculating gradient of a π · Qπ . Follow the same
process of calculating the gradient of π · Qπ by parts, then Bellman-expanding Qπ (to
calculate its gradient), and iterate.
X X
∇θ J(θ) = p0 (S0 ) · ∇θ π(S0 , A0 ; θ) · Qπ (S0 , A0 )+
S0 ∈N A0 ∈A
X X X
γ · p0 (S0 ) · p(S0 → S1 , 1, π) · ( ∇θ π(S1 , A1 ; θ) · Qπ (S1 , A1 ) + . . .)
S1 ∈N S0 ∈N A1 ∈A
∞
X X X
Bring inside and note that
t=0 St ∈N S0 ∈N
X
∇θ π(St , At ; θ) · Qπ (St , At ) is independent of t
At ∈A
426
∞
X X X X
∇θ J(θ) = γ t · p0 (S0 ) · p(S0 → s, t, π) · ∇θ π(s, a; θ) · Qπ (s, a)
s∈N S0 ∈N t=0 a∈A
∞
X X def
Remember that γ t · p0 (S0 ) · p(S0 → s, t, π) = ρπ (s). So,
S0 ∈N t=0
X X
∇θ J(θ) = ρπ (s) · ∇θ π(s, a; θ) · Qπ (s, a)
s∈N a∈A
Q.E.D.
This proof is borrowed from the Appendix of the famous paper by Sutton, McAllester,
Singh, Mansour on Policy Gradient Methods for Reinforcement Learning with Function
Approximation (R. Sutton et al. 2001).
Note that using the “Expected Value” notation under the improper distribution implied
by the Discounted-Aggregate State-Visitation Measure ρπ , we can write the statement of
PGT as:
X X
∇θ J(θ) = ρπ (s) · π(s, a; θ) · (∇θ log π(s, a; θ)) · Qπ (s, a)
s∈N a∈A
=E s∼ρπ ,a∼π [(∇θ log π(s, a; θ)) · Qπ (s, a)]
As explained earlier, since the state occurrence probabilities and action occurrence prob-
abilities are implicit in the trace experiences, we can sample-estimate ∇θ J(θ) by calculat-
ing γ t · (∇θ log π(St , At ; θ)) · Qπ (St , At ) at each time step in each trace experience, and
update the parameters θ (according to Stochastic Gradient Ascent) with this calculation.
π(s, a; θ) = P ϕ(s,b)T ·θ
for all s ∈ N , a ∈ A
b∈A e
Then the score function is given by:
X
∇θ log π(s, a; θ) = ϕ(s, a) − π(s, b; θ) · ϕ(s, b) = ϕ(s, a) − Eπ [ϕ(s, ·)]
b∈A
The intuitive interpretation is that the score function for an action a represents the “ad-
vantage” of the feature vector for action a over the mean feature vector (across all actions),
for a given state s.
427
12.3.2. Canonical π(s, a; θ) for Single-Dimensional Continuous Action Spaces
For single-dimensional continuous action spaces (i.e., A = R), we often use a Gaussian
distribution for the Policy. Assume θ is an m-vector (θ1 , . . . , θm ) and assume the state
features vector ϕ(s) is given by (ϕ1 (s), . . . , ϕm (s)) for all s ∈ N .
We set the mean of the gaussian distribution for the Policy as a linear combination of
state features, i.e., ϕ(s)T · θ, and we set the variance to be a fixed value, say σ 2 . We could
make the variance parameterized as well, but let’s work with fixed variance to keep things
simple.
The Gaussian policy selects an action a as follows:
(a − ϕ(s)T · θ) · ϕ(s)
∇θ log π(s, a; θ) =
σ2
This is easily extensible to multi-dimensional continuous action spaces by considering
a multi-dimensional gaussian distribution for the Policy.
The intuitive interpretation is that the score function for an action a is proportional to
the feature vector for given state s scaled by the “advantage” of the action a over the mean
action (note: each a ∈ R).
For each of the above two examples (finite action spaces and continuous action spaces),
think of the “features advantage” of an action as the compass for the Gradient Ascent.
The gradient estimate for an encountered action is proportional to the action’s “features
advantage” scaled by the action’s Value Function. The intuition is that the Gradient Ascent
encourages picking actions that are yielding more favorable outcomes (Policy Improvement)
so as to ultimately get to a point where the optimal action is selected for each state.
428
This Policy Gradient algorithm is Monte-Carlo because it is not bootstrapped (com-
plete returns are used as an unbiased sample of Qπ , rather than a bootstrapped estimate).
In terms of our previously-described classification of RL algorithms as Value Function-
based or Policy-based or Actor-Critic, REINFORCE is a Policy-based algorithm since RE-
INFORCE does not involve learning a Value Function.
Now let’s write some code to implement the REINFORCE algorithm. In this chapter, we
will focus our Python code implementation of Policy Gradient algorithms to continuous
action spaces, although it should be clear based on the discussion so far that the Policy
Gradient approach applies to arbitrary action spaces (we’ve already seen an example of
the policy function parameterization for discrete action spaces). To keep things simple, the
function reinforce_gaussian below implements REINFORCE for the simple case of single-
dimensional continuous action space (i.e. A = R), although this can be easily extended to
multi-dimensional continuous action spaces. So in the code below, we work with a generic
state space given by TypeVar(’S’) and the action space is specialized to float (representing
R).
As seen earlier in the canonical example for single-dimensional continuous action space,
we assume a Gaussian distribution for the policy. Specifically, the policy is represented by
an arbitrary parameterized function approximation using the class FunctionApprox. As a
reminder, an instance of FunctionApprox represents a probability distribution function f
of the conditional random variable variable y|x where x belongs to an arbitrary domain
X and y ∈ R (probability of y conditional on x denoted as f (x; θ)(y) where θ denotes
the parameters of the FunctionApprox). Note that the evaluate method of FunctionApprox
takes as input an Iterable of x values and calculates g(x; θ) = Ef (x;θ) [y] for each of the x
values. In our case here, x represents non-terminal states in N and y represents actions
in R, so f (s; θ) denotes the probability distribution of actions, conditional on state s ∈ N ,
and g(s; θ) represents the Expected Value of (real-numbered) actions, conditional on state
s ∈ N . Since we have assumed the policy to be Gaussian,
1 (a−g(s;θ))2
π(s, a; θ) = √ · e− 2σ 2
2πσ 2
To be clear, our code below works with the @abstractclass FunctionApprox (meaning
it is an arbitrary parameterized function approximation) with the assumption that the
probability distribution of actions given a state is Gaussian whose variance σ 2 is assumed
to be a constant. Assume we have m features for our function approximation, denoted as
ϕ(s) = (ϕ1 (s), . . . , ϕm (s)) for all s ∈ N .
σ is specified in the code below as policy_stdev. The input policy_mean_approx0: FunctionApprox[NonTerminal
specifies the function approximation we initialize the algorithm with (it is up to the user
of reinforce_gaussian to configure policy_mean_approx0 with the appropriate functional
form for the function approximation, the hyper-parameter values, and the initial values of
the parameters θ that we want to solve for).
The Gaussian policy (of the type GaussianPolicyFromApprox) selects an action a (given
state s) by sampling from the gaussian distribution defined by mean g(s; θ) and variance
σ2.
The score function is given by:
429
ing the initial states distribution p0 : N → [0, 1]), and the current policy π (that is param-
eterized by θ, which updates after each trace experience). The inner loop loops over an
Iterator of step: ReturnStep[S, float] objects produced by the returns method for each
trace experience.
The variable grad is assigned the value of the negative score for an encountered (St , At )
in a trace experience, i.e., it is assigned the value:
(g(St ; θ) − At ) · ∇θ g(St ; θ)
−∇θ log π(St , At ; θ) =
σ2
We negate the sign of the score because we are performing Gradient Ascent rather than
Gradient Descent (the FunctionApprox class has been written for Gradient Descent). The
variable scaled_grad multiplies the negative of score (grad) with γ t (gamma_prod) and re-
turn Gt (step.return_). The rest of the code should be self-explanatory.
reinforce_gaussian returns an Iterable of FunctionApprox representing the stream of
updated policies π(s, ·; θ), with each of these FunctionApprox being generated (using yield)
at the end of each trace experience.
import numpy as np
from rl.distribution import Distribution, Gaussian
from rl.policy import Policy
from rl.markov_process import NonTerminal
from rl.markov_decision_process import MarkovDecisionProcess, TransitionStep
from rl.function_approx import FunctionApprox, Gradient
S = TypeVar(’S’)
@dataclass(frozen=True)
class GaussianPolicyFromApprox(Policy[S, float]):
function_approx: FunctionApprox[NonTerminal[S]]
stdev: float
def act(self, state: NonTerminal[S]) -> Gaussian:
return Gaussian(
mu=self.function_approx(state),
sigma=self.stdev
)
def reinforce_gaussian(
mdp: MarkovDecisionProcess[S, float],
policy_mean_approx0: FunctionApprox[NonTerminal[S]],
start_states_distribution: Distribution[NonTerminal[S]],
policy_stdev: float,
gamma: float,
episode_length_tolerance: float
) -> Iterator[FunctionApprox[NonTerminal[S]]]:
policy_mean_approx: FunctionApprox[NonTerminal[S]] = policy_mean_approx0
yield policy_mean_approx
while True:
policy: Policy[S, float] = GaussianPolicyFromApprox(
function_approx=policy_mean_approx,
stdev=policy_stdev
)
trace: Iterable[TransitionStep[S, float]] = mdp.simulate_actions(
start_states=start_states_distribution,
policy=policy
)
gamma_prod: float = 1.0
for step in returns(trace, gamma, episode_length_tolerance):
def obj_deriv_out(
states: Sequence[NonTerminal[S]],
actions: Sequence[float]
) -> np.ndarray:
430
return (policy_mean_approx.evaluate(states) -
np.array(actions)) / (policy_stdev * policy_stdev)
grad: Gradient[FunctionApprox[NonTerminal[S]]] = \
policy_mean_approx.objective_gradient(
xy_vals_seq=[(step.state, step.action)],
obj_deriv_out_fun=obj_deriv_out
)
scaled_grad: Gradient[FunctionApprox[NonTerminal[S]]] = \
grad * gamma_prod * step.return_
policy_mean_approx = \
policy_mean_approx.update_with_gradient(scaled_grad)
gamma_prod *= gamma
yield policy_mean_approx
431
The method get_mdp below sets up this MDP (should be self-explanatory as the con-
struction is very similar to the construction of the single-step MDPs in AssetAllocDiscrete).
432
for f in self.policy_feature_funcs:
def this_f(st: NonTerminal[AssetAllocState], f=f) -> float:
return f(st.state)
ffs.append(this_f)
return DNNApprox.create(
feature_functions=ffs,
dnn_spec=self.policy_mean_dnn_spec,
adam_gradient=adam_gradient
)
def reinforce(self) -> \
Iterator[FunctionApprox[NonTerminal[AssetAllocState]]]:
return reinforce_gaussian(
mdp=self.get_mdp(),
policy_mean_approx0=self.policy_mean_approx(),
start_states_distribution=self.start_states_distribution(),
policy_stdev=self.policy_stdev,
gamma=1.0,
episode_length_tolerance=1e-5
)
Next, we print the closed-form solution of the optimal action for states at each time step
(note: the closed-form solution for optimal action is independent of wealth Wt , and is only
dependent on t).
base_alloc: float = (mu - r) / (a * sigma * sigma)
for t in range(steps):
alloc: float = base_alloc / (1 + r) ** (steps - t - 1)
print(f”Time {t:d}: Optimal Risky Allocation = {alloc:.3f}”)
This prints:
433
Time 0: Optimal Risky Allocation = 1.144
Time 1: Optimal Risky Allocation = 1.224
Time 2: Optimal Risky Allocation = 1.310
Time 3: Optimal Risky Allocation = 1.402
Time 4: Optimal Risky Allocation = 1.500
Next we set up an instance of AssetAllocPG with the above parameters. Note that the
policy_mean_dnn_spec argument to the constructor of AssetAllocPG is set up as a trivial
neural network with no hidden layers and the identity function as the output layer activa-
tion function. Note also that the policy_feature_funcs argument to the constructor is set
up with the single feature function (1 + r)t .
from rl.distribution import Gaussian
from rl.function_approx import
risky_ret: Sequence[Gaussian] = [Gaussian(mu=mu, sigma=sigma)
for _ in range(steps)]
riskless_ret: Sequence[float] = [r for _ in range(steps)]
utility_function: Callable[[float], float] = lambda x: - np.exp(-a * x) / a
policy_feature_funcs: Sequence[Callable[[AssetAllocState], float]] = \
[
lambda w_t: (1 + r) ** w_t[1]
]
init_wealth_distr: Gaussian = Gaussian(mu=init_wealth, sigma=init_wealth_stdev)
policy_mean_dnn_spec: DNNSpec = DNNSpec(
neurons=[],
bias=False,
hidden_activation=lambda x: x,
hidden_activation_deriv=lambda y: np.ones_like(y),
output_activation=lambda x: x,
output_activation_deriv=lambda y: np.ones_like(y)
)
aad: AssetAllocPG = AssetAllocPG(
risky_return_distributions=risky_ret,
riskless_returns=riskless_ret,
utility_func=utility_function,
policy_feature_funcs=policy_feature_funcs,
policy_mean_dnn_spec=policy_mean_dnn_spec,
policy_stdev=policy_stdev,
initial_wealth_distribution=init_wealth_distr
)
Next, we invoke the method reinforce of this AssetAllocPG instance. In practice, we’d
have parameterized the standard deviation of the policy probability distribution just like
we parameterized the mean of the policy probability distribution, and we’d have updated
those parameters in a similar manner (the standard deviation would converge to 0, i.e.,
the policy would converge to the optimal deterministic policy given by the closed-form
solution). As an exercise, extend the function reinforce_gaussian to include a second
FunctionApprox for the standard deviation of the policy probability distribution and up-
date this FunctionApprox along with the updates to the mean FunctionApprox. However,
since we set the standard deviation of the policy probability distribution to be a constant
σ and since we use a Monte-Carlo method, the variance of the mean estimate of the policy
probability distribution is significantly high. So we take the average of the mean estimate
over several iterations (below we average the estimate from iteration 10000 to iteration
20000).
reinforce_policies: Iterator[FunctionApprox[
NonTerminal[AssetAllocState]]] = aad.reinforce()
434
num_episodes: int = 10000
averaging_episodes: int = 10000
policies: Sequence[FunctionApprox[NonTerminal[AssetAllocState]]] = \
list(itertools.islice(
reinforce_policies,
num_episodes,
num_episodes + averaging_episodes
))
for t in range(steps):
opt_alloc: float = np.mean([p(NonTerminal((init_wealth, t)))
for p in policies])
print(f”Time {t:d}: Optimal Risky Allocation = {opt_alloc:.3f}”)
This prints:
So we see that the estimate of the mean action for the 5 time steps from our implemen-
tation of the REINFORCE method gets fairly close to the closed-form solution.
The above code is in the file rl/chapter13/asset_alloc_reinforce.py. As ever, we encour-
age you to tweak the parameters and explore how the results vary.
As an exercise, we encourage you to implement an extension of this problem. Along
with the risky asset allocation choice as the action at each time step, also include a con-
sumption quantity (wealth to be extracted at each time step, along the lines of Merton’s
Dynamic Portfolio Allocation and Consumption problem) as part of the action at each
time step. So the action at each time step would be a pair (c, a) where c is the quantity to
consume and a is the quantity to allocate to the risky asset. Note that the consumption
is constrained to be non-negative and at most the amount of wealth at any time step (a is
unconstrained). The reward at each time step is the Utility of Consumption.
435
Bear in mind though that the efficient way to use the Critic is in the spirit of GPI, i.e., we
don’t take Q(s, a; w) for the current policy (current θ) all the way to convergence (thinking
about updates of w for a given Policy as Policy Evaluation phase of GPI). Instead, we
switch between Policy Evaluation (updates of w) and Policy Improvement (updates of
θ) quite frequently. In fact, with a bootstrapped (TD) approach, we would update both
w and θ after each atomic experience. w is updated such that a suitable loss function
is minimized. This can be done using any of the usual Value Function approximation
methods we have covered previously, including:
This method of calculating the gradient of J(θ) can be thought of as Approximate Policy
Gradient due to the bias of the Critic Q(s, a; w) (serving as an approximation of Qπ (s, a)),
i.e.,
X X
∇θ J(θ) ≈ ρπ (s) · ∇θ π(s, a; θ) · Q(s, a; w)
s∈N a∈A
Now let’s implement some code to perform Policy Gradient with the Critic updated
using Temporal-Difference (again, for the simple case of single-dimensional continuous
action space). In the function actor_critic_gaussian below, the key changes (from the
code in reinforce_gaussian) are:
• The Q-Value function approximation parameters w are updated after each atomic
experience as:
436
gamma_prod: float = 1.0
state: NonTerminal[S] = start_states_distribution.sample()
action: float = Gaussian(
mu=policy_mean_approx(state),
sigma=policy_stdev
).sample()
while isinstance(state, NonTerminal) and steps < max_episode_length:
next_state, reward = mdp.step(state, action).sample()
if isinstance(next_state, NonTerminal):
next_action: float = Gaussian(
mu=policy_mean_approx(next_state),
sigma=policy_stdev
).sample()
q = q.update([(
(state, action),
reward + gamma * q((next_state, next_action))
)])
action = next_action
else:
q = q.update([((state, action), reward)])
def obj_deriv_out(
states: Sequence[NonTerminal[S]],
actions: Sequence[float]
) -> np.ndarray:
return (policy_mean_approx.evaluate(states) -
np.array(actions)) / (policy_stdev * policy_stdev)
grad: Gradient[FunctionApprox[NonTerminal[S]]] = \
policy_mean_approx.objective_gradient(
xy_vals_seq=[(state, action)],
obj_deriv_out_fun=obj_deriv_out
)
scaled_grad: Gradient[FunctionApprox[NonTerminal[S]]] = \
grad * gamma_prod * q((state, action))
policy_mean_approx = \
policy_mean_approx.update_with_gradient(scaled_grad)
yield policy_mean_approx
gamma_prod *= gamma
steps += 1
state = next_state
437
A good Baseline Function B(s) is a function approximation V (s; v) of the State-Value
Function V π (s). So then we can rewrite the Actor-Critic Policy Gradient algorithm using
an estimate of the Advantage Function, as follows:
• The Q-Value function approximation parameters w are updated after each atomic
experience as:
A simpler way is to use the TD Error of the State-Value Function as an estimate of the
Advantage Function. To understand this idea, let δ π denote the TD Error for the true State-
Value Function V π (s). Then,
δ π = r + γ · V π (s′ ) − V π (s)
Note that δ π is an unbiased estimate of the Advantage function Aπ (s, a). This is because
This approach requires only one set of critic parameters v, and we don’t have to worry
about the Action-Value Function Q.
Now let’s implement some code for this TD Error-based PG Algorithm (again, for the
simple case of single-dimensional continuous action space). In the function actor_critic_td_error_gaussian
below:
438
• The State-Value function approximation parameters v are updated after each atomic
experience as:
where αθ is the learning rate for the Policy Mean function approximation.
from rl.approximate_dynamic_programming import ValueFunctionApprox
def actor_critic_td_error_gaussian(
mdp: MarkovDecisionProcess[S, float],
policy_mean_approx0: FunctionApprox[NonTerminal[S]],
value_func_approx0: ValueFunctionApprox[S],
start_states_distribution: NTStateDistribution[S],
policy_stdev: float,
gamma: float,
max_episode_length: float
) -> Iterator[FunctionApprox[NonTerminal[S]]]:
policy_mean_approx: FunctionApprox[NonTerminal[S]] = policy_mean_approx0
yield policy_mean_approx
vf: ValueFunctionApprox[S] = value_func_approx0
while True:
steps: int = 0
gamma_prod: float = 1.0
state: NonTerminal[S] = start_states_distribution.sample()
while isinstance(state, NonTerminal) and steps < max_episode_length:
action: float = Gaussian(
mu=policy_mean_approx(state),
sigma=policy_stdev
).sample()
next_state, reward = mdp.step(state, action).sample()
if isinstance(next_state, NonTerminal):
td_target: float = reward + gamma * vf(next_state)
else:
td_target = reward
td_error: float = td_target - vf(state)
vf = vf.update([(state, td_target)])
def obj_deriv_out(
states: Sequence[NonTerminal[S]],
actions: Sequence[float]
) -> np.ndarray:
return (policy_mean_approx.evaluate(states) -
np.array(actions)) / (policy_stdev * policy_stdev)
grad: Gradient[FunctionApprox[NonTerminal[S]]] = \
policy_mean_approx.objective_gradient(
xy_vals_seq=[(state, action)],
obj_deriv_out_fun=obj_deriv_out
)
scaled_grad: Gradient[FunctionApprox[NonTerminal[S]]] = \
grad * gamma_prod * td_error
policy_mean_approx = \
policy_mean_approx.update_with_gradient(scaled_grad)
yield policy_mean_approx
gamma_prod *= gamma
steps += 1
state = next_state
439
Likewise, we can implement an Actor-Critic algorithm using Eligibility Traces (i.e., TD(λ))
for the State-Value Function Approximation and also for the Policy Mean Function Ap-
proximation. The updates after each atomic experience to parameters v of the State-Value
function approximation and parameters θ of the policy mean function approximation are
given by:
Ev ← γ · λv · Ev + ∇v V (St ; v)
Eθ ← γ · λθ · Eθ + γ t · ∇θ log π(St , At ; θ)
where λv and λθ are the TD(λ) parameters respectively for the State-Value Function Ap-
proximation and the Policy Mean Function Approximation.
We encourage you to implement in code this Actor-Critic algorithm using Eligibility
Traces.
Now let’s compare these methods on the AssetAllocPG instance we had created earlier
to test REINFORCE, i.e., for time steps T = 5, µ = 13%, σ = 20%, r = 7%, coefficient
of CARA a = 1.0, probability distribution of wealth at the start of each trace experience
as N (1.0, 0.1), and constant standard deviation σ of the policy probability distribution of
actions for a given state as 0.5. The __main__ code in rl/chapter13/asset_alloc_pg.py eval-
uates the mean action for the start state of (t = 0, W0 = 1.0) after each episode (over 50,000
episodes) for each of the above-implemented PG algorithms’ function approximation for
the policy mean. It then plots the progress of the evaluated mean action for the start state
over the 50,000 episodes (each point plotted as an average over a batch of 200 episodes),
along with the benchmark of the optimal action for the start state from the known closed-
form solution. Figure 12.1 shows the graph, validating the points we have made above on
bias and variance of these algorithms.
Actor-Critic methods were developed in the late 1970s and 1980s, but not paid attention
to in the 1990s. In the past two decades, there has been a revival of Actor-Critic methods.
For a more detailed coverage of Actor-Critic methods, see the paper by Degris, White,
Sutton (Degris, White, and Sutton 2012).
• Q(s, a; w)
• A(s, a; w, v) = Q(s, a; w) − V (s; v)
• δ(s, s′ , r; v) = r + γ · V (s′ ; v) − V (s; v)
However, each of the above proxies for Qπ (s, a) in PG algorithms have a bias. In this
section, we talk about how to overcome bias. The basis for overcoming bias is an important
Theorem known as the Compatible Function Approximation Theorem. We state and prove this
theorem, and then explain how we could use it in a PG algorithm.
440
Figure 12.1.: Progress of PG Algorithms
Theorem 12.7.1 (Compatible Function Approximation Theorem). Let wθ∗ denote the Critic
parameters w that minimize the following mean-squared-error for given policy parameters θ:
X X
ρπ (s) · π(s, a; θ) · (Qπ (s, a) − Q(s, a; w))2
s∈N a∈A
Assume that the data type of θ is the same as the data type of w and furthermore, assume that for
any policy parameters θ, the Critic gradient at wθ∗ is compatible with the Actor score function,
i.e.,
∇w Q(s, a; wθ∗ ) = ∇θ log π(s, a; θ) for all s ∈ N , for all a ∈ A
Then the Policy Gradient using critic Q(s, a; wθ∗ ) is exact:
X X
∇θ J(θ) = ρπ (s) · ∇θ π(s, a; θ) · Q(s, a; wθ∗ )
s∈N a∈A
Proof. For a given θ, since wθ∗ minimizes the mean-squared-error as defined above, we
have: X X
ρπ (s) · π(s, a; θ) · (Qπ (s, a) − Q(s, a; wθ∗ )) · ∇w Q(s, a; wθ∗ ) = 0
s∈N a∈A
But since ∇w Q(s, a; wθ∗ ) = ∇θ log π(s, a; θ), we have:
X X
ρπ (s) · π(s, a; θ) · (Qπ (s, a) − Q(s, a; wθ∗ )) · ∇θ log π(s, a; θ) = 0
s∈N a∈A
Therefore,
X X
ρπ (s) · π(s, a; θ) · Qπ (s, a) · ∇θ log π(s, a; θ)
s∈N a∈A
X X
= ρπ (s) · π(s, a; θ) · Q(s, a; wθ∗ ) · ∇θ log π(s, a; θ)
s∈N a∈A
441
X X
But ∇θ J(θ) = ρπ (s) · π(s, a; θ) · Qπ (s, a) · ∇θ log π(s, a; θ)
s∈N a∈A
X X
So, ∇θ J(θ) = ρπ (s) · π(s, a; θ) · Q(s, a; wθ∗ ) · ∇θ log π(s, a; θ)
s∈N a∈A
X X
= ρ (s) ·
π
∇θ π(s, a; θ) · Q(s, a; wθ∗ )
s∈N a∈A
Q.E.D.
This proof originally appeared in the famous paper by Sutton, McAllester, Singh, Man-
sour on Policy Gradient Methods for Reinforcement Learning with Function Approxima-
tion (R. Sutton et al. 2001).
This means if the compatibility assumption of the Theorem is satisfied, we can use the
critic function approximation Q(s, a; wθ∗ ) and still have exact Policy Gradient (i.e., no bias
due to using a function approximation for the Q-Value Function). However, note that in
practice, we invoke the spirit of GPI and don’t take Q(s, a; w) to convergence for the current
θ. Rather, we update both w and θ frequently, and this turns out to be good enough in
terms of lowering the bias.
A simple way to enable Compatible Function Approximation is to make Q(s, a; w) a lin-
ear function approximation, with the features of the linear function approximation equal
to the Score of the policy function approximation, as follows:
X
m
∂ log π(s, a; θ)
Q(s, a; w) = · wi for all s ∈ N , for all a ∈ A
∂θi
i=1
which means the feature functions η(s, a) = (η1 (s, a), η2 (s, a), . . . , ηm (s, a)) of the linear
function approximation are given by:
∂ log π(s, a; θ)
ηi (s, a) = for all s ∈ N , for all a ∈ A, for all i = 1, . . . , m
∂θi
This means the feature functions η is identically equal to the Score. Note that although
here we assume Q(s, a; w) to be a linear function approximation, the policy function ap-
proximation π(s, a; θ) can be more flexible. All that is required is that θ consists of exactly
m parameters (matching the number of number of parameters m of w) and that each of
∂ log π(s,a;θ)
the partial derivatives ∂θi lines up with a corresponding feature ηi (s, a) of the lin-
ear function approximation Q(s, a; w). This means that as θ updates (as a consequence
of Stochastic Gradient Ascent), π(s, a; θ) updates, and consequently the feature functions
η(s, a) = ∇θ log π(s, a; θ) update. This means the feature vector η(s, a) is not constant
for a given (s, a) pair. Rather, the feature vector η(s, a) for a given (s, a) pair varies in
accordance with θ varying.
If we assume the canonical function approximation for π(s, a; θ) for finite action spaces
that we had described in Section 12.3, then:
X
η(s, a) = ϕ(s, a) − π(s, b; θ) · ϕ(s, b) for all s ∈ N for all a ∈ A
b∈A
442
Note the dependency of feature vector η(s, a) on θ.
If we assume the canonical function approximation for π(s, a; θ) for single-dimensional
continuous action spaces that we had described in Section 12.3, then:
(a − ϕ(s)T · θ) · ϕ(s)
η(s, a) = for all s ∈ N for all a ∈ A
σ2
Note the dependency of feature vector η(s, a) on θ.
We note that any compatible linear function approximation Q(s, a; w) serves as an ap-
proximation of the advantage function because:
X X X
m
∂ log π(s, a; θ)
π(s, a; θ) · Q(s, a; w) = π(s, a; θ) · ( · wi )
∂θi
a∈A a∈A i=1
X Xm
∂π(s, a; θ) X
m X
∂π(s, a; θ)
= ( · wi ) = ( ) · wi
∂θi ∂θi
a∈A i=1 i=1 a∈A
X
m
∂ X X
m
∂1
= ( π(s, a; θ)) · wi = · wi = 0
∂θi ∂θi
i=1 a∈A i=1
Denoting ∇θ log π(s, a; θ) as the score column vector SC(s, a; θ) and assuming compat-
ible linear-approximation critic:
X X
∇θ J(θ) = ρπ (s) · π(s, a; θ) · SC(s, a; θ) · (SC(s, a; θ)T · wθ∗ )
s∈N a∈A
X X
= ρ (s) ·
π
π(s, a; θ) · (SC(s, a; θ) · SC(s, a; θ)T ) · wθ∗
s∈N a∈A
= Es∼ρπ ,a∼π [SC(s, a; θ) · SC(s, a; θ)T ] · wθ∗
Note that Es∼ρπ ,a∼π [SC(s, a; θ)·SC(s, a; θ)T ] is the Fisher Information Matrix F IMρπ ,π (θ)
with respect to s ∼ ρπ , a ∼ π. Therefore, we can write ∇θ J(θ) more succinctly as:
Thus, we can update θ after each atomic experience at time step t by calculating the
gradient of J(θ) for the atomic experience as the outer product of SC(St , At ; θ) with itself
(which gives a m × m matrix), then multiply this matrix with the vector w, and then scale
by γ t , i.e.
∆θ = αθ · γ t · SC(St , At ; w) · SC(St , At ; w)T · w
The update for w after each atomic experience is the usual Q-Value Function Approx-
imation update with Q-Value loss function gradient for the atomic experience calculated
as:
This completes our coverage of the basic Policy Gradient Methods. Next, we cover a
couple of special Policy Gradient Methods that have worked well in practice - Natural
Policy Gradient and Deterministic Policy Gradient.
443
12.8. Policy Gradient Methods in Practice
12.8.1. Natural Policy Gradient
Natural Policy Gradient (abbreviated NPG) is due to a paper by Kakade (Kakade 2001)
that utilizes the idea of Natural Gradient first introduced by Amari (Amari 1998). We
won’t cover the theory of Natural Gradient in detail here, and refer you to the above two
papers instead. Here we give a high-level overview of the concepts, and describe the al-
gorithm.
The core motivation for Natural Gradient is that when the parameters space has a certain
underlying structure (as is the case with the parameters space of θ in the context of maxi-
mizing J(θ)), the usual gradient does not represent it’s steepest descent direction, but the
Natural Gradient does. The steepest descent direction of an arbitrary function f (θ) to be
minimized is defined as the vector ∆θ that minimizes f (θ + ∆θ) under the constraint that
the length |∆θ| is a constant. In general, the length |∆θ| is defined with respect to some
positive-definite matrix G(θ) governed by the underlying structure of the θ parameters
space, i.e.,
|∆θ|2 = (∆θ)T · G(θ) · ∆θ
We can show that under the length metric defined by the matrix G, the steepest descent
direction is:
−1
θ f (θ) = G (θ) · ∇θ f (θ)
∇nat
We refer to this steepest descent direction ∇nat
θ f (θ) as the Natural Gradient. We can
update the parameters θ in this Natural Gradient direction in order to achieve steepest
descent (according to the matrix G), as follows:
∆θ = −α · ∇nat
θ f (θ)
Amari showed that for a supervised learning problem of estimating the conditional
probability distribution of y|x with a function approximation (i.e., where the loss function
is defined as the KL divergence between the data distribution and the model distribution),
the matrix G is the Fisher Information Matrix for y|x.
Kakade specialized this idea of Natural Gradient to the case of Policy Gradient (naming
it Natural Policy Gradient) with the objective function f (θ) equal to the negative of the
Expected Returns J(θ). This gives the Natural Policy Gradient ∇nat θ J(θ) defined as:
−1
∇nat
θ J(θ) = F IMρπ ,π (θ) · ∇θ J(θ)
∗
∇nat
θ J(θ) = wθ
This compact result enables a simple algorithm for Natural Policy Gradient (NPG):
444
• After each atomic experience, update Critic parameters w with the critic loss gradi-
ent as:
∆θ = αθ · w
Note that ∇θ πD (s; θ) is a Jacobian matrix as it takes the partial derivatives of a poten-
tially multi-dimensional action a = πD (s; θ) with respect to each parameter in θ. As we’ve
pointed out during the coverage of (stochastic) PG, when θ changes, policy πD changes,
445
which changes the state distribution ρπD . So it’s not clear that this calculation indeed guar-
antees improvement - it doesn’t take into account the effect of changing θ on ρπD . How-
ever, as was the case in PGT, Deterministic Policy Gradient Theorem (abbreviated DPGT)
ensures that there is no need to compute the gradient of ρπD with respect to θ, and that
the update described above indeed follows the gradient of the Expected Return objective
function. We formalize this now by stating the DPGT.
Analogous to the Expected Returns Objective defined for (stochastic) PG, we define the
Expected Returns Objective J(θ) for DPG as:
∞
X
J(θ) = EπD [ γ t · Rt+1 ]
t=0
X
= ρπD (s) · Rπs D (s;θ)
s∈N
where
∞
X X
ρπD (s) = γ t · p0 (S0 ) · p(S0 → s, t, πD )
S0 ∈N t=0
Theorem 12.8.1 (Deterministic Policy Gradient Theorem). Given an MDP with action space
Rk , with appropriate gradient existence conditions,
X
∇θ J(θ) = ρπD (s) · ∇θ πD (s; θ) · ∇a QπD (s, a)
a=πD (s;θ)
s∈N
= Es∼ρπD [∇θ πD (s; θ) · ∇a QπD (s, a) ]
a=πD (s;θ)
446
∆θ = αθ · ∇θ πD (St ; θ) · ∇a Q(St , a; w)
a=πD (St ;θ)
Critic Bias can be resolved with a Compatible Function Approximation Theorem for
DPG (see Silver et al. paper for details). Instabilities caused by Bootstrapped Off-Policy
Learning with Function Approximation can be resolved with Gradient Temporal Differ-
ence (GTD).
Z
∇θ (Eψ∼pθ [F (ψ)]) = ∇θ ( pθ (ψ) · F (ψ) · dψ)
ψ
Z
= ∇θ (pθ (ψ)) · F (ψ) · dψ
ψ
Z
= pθ (ψ) · ∇θ (log pθ (ψ)) · F (ψ) · dψ
ψ
= Eψ∼pθ [∇θ (log pθ (ψ)) · F (ψ)] (12.2)
Now let’s see how NES can be applied to solving MDP Control. We set F (·) to be the
(stochastic) Return of an MDP. ψ corresponds to the parameters of a deterministic policy
πψ : N → A. ψ ∈ Rm is drawn from an isotropic m-variate Gaussian distribution, i.e.,
Gaussian with mean vector θ ∈ Rm and fixed diagonal covariance matrix σ 2 Im where
σ ∈ R is kept fixed and Im is the m × m identity matrix. The average objective (Expected
Return) can then be written as:
where ϵ ∈ Rm is the standard normal random variable generating ψ. Hence, from Equa-
tion (12.2), the gradient (∇θ ) of Expected Return can be written as:
447
−(ψ − θ)T · (ψ − θ)
= Eψ∼N (θ,σ2 Im ) [∇θ ( ) · F (ψ)]
2σ 2
1
= · Eϵ∼N (0,Im ) [ϵ · F (θ + σ · ϵ)]
σ
Now we come up with a sampling-based algorithm to solve the MDP. The above formula
helps estimate the gradient of Expected Return by sampling several ϵ (each ϵ represents a
Policy πθ+σ·ϵ ), and averaging ϵ · F (θ + σ · ϵ) across a large set (n) of ϵ samples.
Note that evaluating F (θ + σ · ϵ) involves playing an episode for a given sampled ϵ, and
obtaining that episode’s Return F (θ + σ · ϵ). Hence, we have n values of ϵ, n Policies πθ+σ·ϵ ,
and n Returns F (θ + σ · ϵ).
Given the gradient estimate, we update θ in this gradient direction, which in turn leads
to new samples of ϵ (new set of Policies πθ+σ·ϵ ), and the process repeats until Eϵ∼N (0,Im ) [F (θ+
σ · ϵ)] is maximized.
The key inputs to the algorithm are:
With these inputs, for each iteration t = 0, 1, 2, . . ., the algorithm performs the following
steps:
• Sample ϵ1 , ϵ2 , . . . ϵn ∼ N (0, Im ).
• Compute Returns Fi ← F (θt + σ · ϵi ) for i = 1, 2, . . . , n.
α Pn
• θt+1 ← θt + nσ i=1 ϵi · Fi
On the surface, this NES algorithm looks like PG because it’s not Value Function-based
(it’s Policy-based, like PG). Also, similar to PG, it uses a gradient to move the policy to-
wards optimality. But, ES does not interact with the environment (like PG/RL does).
ES operates at a high-level, ignoring the (state, action, reward) interplay. Specifically, it
does not aim to assign credit to actions in specific states. Hence, ES doesn’t have the core
essence of RL: Estimating the Q-Value Function for a Policy and using it to Improve the Policy.
Therefore, we don’t classify ES as Reinforcement Learning. Rather, we consider ES to be
an alternative approach to RL Algorithms.
What is the effectiveness of ES compared to RL? The traditional view has been that ES
won’t work on high-dimensional problems. Specifically, ES has been shown to be data-
inefficient relative to RL. This is because ES resembles simple hill-climbing based only on
finite differences along a few random directions at each step. However, ES is very simple to
implement (no Value Function approximation or back-propagation needed), and is highly
parallelizable. ES has the benefits of being indifferent to distribution of rewards and to ac-
tion frequency, and is tolerant of long horizons. A paper from OpenAI Research (Salimans
et al. 2017) shows techniques to make NES more robust and more data-efficient, and they
demonstrate that NES has more exploratory behavior than advanced PG algorithms.
448
• Policy Gradient Theorem gives us a simple formula for ∇θ J(θ) in terms of the score
of the policy function approximation (i.e., gradient of the log of the policy with re-
spect to the policy parameters θ).
• We can reduce variance in PG algorithms by using a critic and by using an estimate
of the advantage function for the Q-Value Function.
• Compatible Function Approximation Theorem enables us to overcome bias in PG
Algorithms.
• Natural Policy Gradient and Deterministic Policy Gradient are specialized PG algo-
rithms that have worked well in practice.
• Evolutionary Strategies are technically not RL, but they resemble PG Algorithms and
can sometimes be quite effective in solving MDP Control problems.
449
Part IV.
Finishing Touches
451
13. Multi-Armed Bandits: Exploration versus
Exploitation
We learnt in Chapter 10 that balancing exploration and exploitation is vital in RL Control
algorithms. While we want to exploit actions that seem to be fetching good returns, we
also want to adequately explore all possible actions so we can obtain an accurate-enough
estimate of their Q-Values. We had mentioned that this is essentially the Explore-Exploit
dilemma of the famous Multi-Armed Bandit Problem. The Multi-Armed Bandit prob-
lem provides a simple setting to understand the explore-exploit tradeoff and to develop
explore-exploit balancing algorithms. The approaches followed by the Multi-Armed Ban-
dit algorithms are then well-transportable to the more complex setting of RL Control.
In this Chapter, we start by specifying the Multi-Armed Bandit problem, followed by
coverage of a variety of techniques to solve the Multi-Armed Bandit problem (i.e., effec-
tively balancing exploration against exploitation). We’ve actually seen one of these algo-
rithms already for RL Control - following an ϵ-greedy policy, which naturally is applicable
to the simpler setting of Multi-Armed Bandits. We had mentioned in Chapter 10 that we
can simply replace the ϵ-greedy approach with any other algorithm for explore-exploit
tradeoff. In this chapter, we consider a variety of such algorithms, many of which are far
more sophisticated compared to the simple ϵ-greedy approach. However, we cover these
algorithms for the simple setting of Multi-Armed Bandits as it promotes understanding
and development of intuition. After covering a range of algorithms for Multi-Armed Ban-
dits, we consider an extended problem known as Contextual Bandits, that is a step between
the Multi-Armed Bandits problem and the RL Control problem (in terms of problem com-
plexity). Finally, we explain how the algorithms for Multi-Armed Bandits can be easily
transported to the more nuanced/extended setting of Contextual Bandits, and further ex-
tended to RL Control.
453
ematically disciplined manner. Before we do that, let’s look at a few common examples of
the explore-exploit dilemma.
The term Multi-Armed Bandit (abbreviated as MAB) is a spoof name that stands for
“Many One-Armed Bandits” and the term One-Armed Bandit refers to playing a slot-machine
in a casino (that has a single lever to be pulled, that presumably addicts us and eventually
takes away all our money, hence the term “bandit”). Multi-Armed Bandit refers to the
problem of playing several slot machines (each of which has an unknown fixed payout
probability distribution) in a manner that we can make the maximum cumulative gains
by playing over multiple rounds (by selecting a single slot machine in a single round).
The core idea is that to achieve maximum cumulative gains, one would need to balance
the notions of exploration and exploitation, no matter which selection strategy one would
pursue.
The AI Agent’s goal is to maximize the following Expected Cumulative Rewards over a
certain number of time steps T :
X
T
E[ Rt ]
t=1
So the AI Agent has T selections of actions to make (in sequence), basing each of those
selections only on the rewards it has observed before that time step (specifically, the AI
Agent does not have knowledge of the probability distributions Ra ). Any selection strat-
egy to maximize the Expected Cumulative Rewards risks wasting time on “duds” while
exploring and also risks missing untapped “gems” while exploiting.
454
It is immediately observable that the Environment doesn’t have a notion of State. When
the AI Agent selects an arm, the Environment simply samples from the probability distri-
bution for that arm. However, the AI Agent might maintain relevant features of the history
(of actions taken and rewards obtained) as it’s State, which would help the AI Agent in
making the arm-selection (action) decision. The arm-selection action is then based on
a (Policy) function of the agent’s State. So, the agent’s arm-selection strategy is basically
this Policy. Thus, even though a MAB is not posed as an MDP, the agent could model it
as an MDP and solve it with an appropriate Planning or Learning algorithm. However,
many MAB algorithms don’t take this formal MDP approach. Instead, they rely on heuris-
tic methods that don’t aim to optimize - they simply strive for good Cumulative Rewards
(in Expectation). Note that even in a simple heuristic algorithm, At is a random variable
simply because it is a function of past (random) rewards.
13.1.3. Regret
The idea of Regret is quite fundamental in designing algorithms for MAB. In this section,
we illuminate this idea.
We define the Action Value Q(a) as the (unknown) mean reward of action a, i.e.,
Q(a) = E[r|a]
We define the Optimal Value V ∗ and Optimal Action a∗ (noting that there could be multiple
optimal actions) as:
V ∗ = max Q(a) = Q(a∗ )
a∈A
lt = E[V ∗ − Q(At )]
X
T X
T
LT = lt = E[V ∗ − Q(At )]
t=1 t=1
Maximizing the Expected Cumulative Rewards is the same as Minimizing Total Regret.
∆a = V ∗ − Q(a)
We define Total Regret as the sum-product (over actions) of Counts and Gaps, as follows:
X
T X X
LT = E[V ∗ − Q(At )] = E[NT (a)] · (V ∗ − Q(a)) = CountT (a) · ∆a
t=1 a∈A a∈A
A good algorithm ensures small Counts for large Gaps. The core challenge though is that
we don’t know the Gaps.
455
In this chapter, we implement (in code) a few different algorithms for the MAB problem.
So let’s invest in an abstract base class whose interface can be implemented by each of the
algorithms we develop. The code for this abstract base class MABBase is shown below. It’s
constructor takes 3 inputs:
Each of the algorithms we’d like to write simply needs to implement the @abstractmethod
get_episode_rewards_actions which is meant to return a 1-D ndarray of actions taken by
the algorithm across the T time steps (for a single episode), and a 1-D ndarray of rewards
produced in response to those actions.
We write the following self-explanatory methods for the abstract base class MABBase:
456
def get_expected_action_counts(self) -> ndarray:
return mean(self.get_action_counts(), axis=0)
1 X
t
Q̂t (a) = Rs · IAs =a
Nt (a)
s=1
As ever, arg max ties are broken with an arbitrary rule in prioritizing actions. We’ve
noted in Chapter 10 that such an algorithm can lock into a suboptimal action forever (sub-
optimal a is an action for which ∆a > 0). This results in CountT (a) being a linear function
of T for some suboptimal a, which means the Total Regret is a linear function of T (we
refer to this as Linear Total Regret).
Now let’s consider the ϵ-greedy algorithm, which explores forever. At each time-step t:
A constant value of ϵ ensures a minimum regret proportional to the mean gap, i.e.,
ϵ X
lt ≥ ∆a
|A|
a∈A
457
Rt − Q̂t−1 (At )
Q̂t (At ) = Q̂t−1 (At ) +
Nt (At )
The idea here is that by setting a high initial value for the estimate of Q-Values (which we
refer to as Optimistic Initialization), we encourage systematic exploration early on. Another
way of doing optimistic initialization is to set a high value for N0 (a) for all a ∈ A, which
likewise encourages systematic exploration early on. However, these optimistic initializa-
tion ideas only serve to promote exploration early on and eventually, one can still lock into
a suboptimal action. Specifically, the Greedy algorithm together with optimistic initializa-
tion cannot be prevented from having Linear Total Regret in the general case. Likewise,
the ϵ-Greedy algorithm together with optimistic initialization cannot be prevented from
having Linear Total Regret in the general case. But in practice, these simple ideas of doing
optimistic initialization work quite well.
d = min ∆a
a|∆a >0
c|A|
ϵt = min(1, )
d2 (t + 1)
It can be shown that this decay schedule achieves Logarithmic Total Regret. However,
note that the above schedule requires advance knowledge of the gaps ∆a (which by def-
inition, is not known to the AI Agent). In practice, implementing some decay schedule
helps considerably. Let’s now write some code to implement Decaying ϵt -Greedy algo-
rithm along with Optimistic Initialization.
The class EpsilonGreedy shown below implements the interface of the abstract base class
MABBase. It’s constructor inputs arm_distributions, time_steps and num_episodes are the
inputs we have seen before (used to pass to the constructor of the abstract base class
MABBase). epsilon and epsilon_half_life are the inputs used to specify the declining tra-
jectory of ϵt . epsilon refers to ϵ0 (initial value of ϵ) and epsilon_half_life refers to the half
life of an exponentially-decaying ϵt (used in the @staticmethod get_epsilon_decay_func).
count_init and mean_init refer to values of N0 and Q̂0 respectively. get_episode_rewards_actions
implements MABBase’s @abstracmethod interface, and it’s code below should be self-explanatory.
458
mean_init: float = 0.,
) -> None:
if epsilon < 0 or epsilon > 1 or \
epsilon_half_life <= 1 or count_init < 0:
raise ValueError
super().__init__(
arm_distributions=arm_distributions,
time_steps=time_steps,
num_episodes=num_episodes
)
self.epsilon_func: Callable[[int], float] = \
EpsilonGreedy.get_epsilon_decay_func(epsilon, epsilon_half_life)
self.count_init: int = count_init
self.mean_init: float = mean_init
@staticmethod
def get_epsilon_decay_func(
epsilon,
epsilon_half_life
) -> Callable[[int], float]:
def epsilon_decay(
t: int,
epsilon=epsilon,
epsilon_half_life=epsilon_half_life
) -> float:
return epsilon * 2 ** -(t / epsilon_half_life)
return epsilon_decay
def get_episode_rewards_actions(self) -> Tuple[ndarray, ndarray]:
counts: List[int] = [self.count_init] * self.num_arms
means: List[float] = [self.mean_init] * self.num_arms
ep_rewards: ndarray = empty(self.time_steps)
ep_actions: ndarray = empty(self.time_steps, dtype=int)
for i in range(self.time_steps):
max_action: int = max(enumerate(means), key=itemgetter(1))[0]
epsl: float = self.epsilon_func(i)
action: int = max_action if Bernoulli(1 - epsl).sample() else \
Range(self.num_arms).sample()
reward: float = self.arm_distributions[action].sample()
counts[action] += 1
means[action] += (reward - means[action]) / counts[action]
ep_rewards[i] = reward
ep_actions[i] = action
return ep_rewards, ep_actions
459
Figure 13.1.: Total Regret Curves
mean_init (Q̂0 ), observe how the graphs change, and develop better intuition for these
simple algorithms.
Theorem 13.3.1 (Lai and Robbins Lower-Bound). Asymptotic Total Regret is at least logarith-
mic in the number of time steps, i.e., as T → ∞,
X 1 X ∆a
LT ≥ logT ≥ log T
∆a KL(Ra ||Ra∗ )
a|∆a >0 a|∆a >0
This makes intuitive sense because it would be hard for an algorithm to have low to-
tal regret if the KL Divergence of arm reward-distributions (relative to the optimal arm’s
reward-distribution) are low (i.e., arms that look distributionally-similar to the optimal
arm) but the Gaps (Expected Rewards of Arms relative to Optimal Arm) are not small
- these are the MAB problem instances where the algorithm will have a hard time iso-
lating the optimal arm simply from reward samples (we’d get similar sampling reward-
distributions of arms), and suboptimal arm selections would inflate the Total Regret.
460
Figure 13.2.: Q-Value Distributions
461
Figure 13.3.: Q-Value Distributions
would be to select the arm with the highest value of µa + c · σa across the arms (for some
fixed c ∈ R+ ). Thus, we are comparing (across actions) c standard errors higher than the
mean reward estimate (i.e., the upper-end of an appropriate confidence interval for the
mean reward). In this Figure, let’s say µa + c · σa is highest for the blue arm. So we play
the blue arm, and let’s say we get a somewhat low reward for the blue arm. This might
do two things to the blue arm’s sampling distribution - it can move blue’s µa lower and it
can also also lower blue’s σa (simply due to the fact that the number of blue arm samples
has grown). With the new µa and σa for the blue arm, let’s say the updated sampling
distributions are as shown in Figure 13.3. With the blue arm’s sampling distribution of the
mean reward narrower, let’s say the red arm now has the highest µa + c · σa , and so we
play the red arm. This process goes on until the sampling distributions get narrow enough
to give us adequate confidence in the mean rewards for the actions (i.e., obtain confident
estimates of Q(a)) so we can home in on the action with highest Q(a).
It pays to emphasize that Optimism in the Face of Uncertainty is a great approach to resolve
the Explore-Exploit dilemma because you gain regardless of whether the exploration due
to Optimism produces large rewards or not. If it does produce large rewards, you gain
immediately by collecting the large rewards. If it does not produce large rewards, you still
gain by acquiring the knowledge that certain actions (that you have explored) might not
be the best actions, which helps you in the long-run by focusing your attention on other
actions.
A formalization of the above intuition on Optimism in the Face of Uncertainty is the idea
of Upper Confidence Bounds (abbreviated as UCB). The idea of UCB is that along with an
estimate Q̂t (a) (for each a after t time steps), we also maintain an estimate Ût (a) represent-
ing the upper confidence interval width for the mean reward of a (after t time steps) such
that Q(a) < Q̂t (a) + Ût (a) with high probability. This naturally depends on the number
of times that a has been selected so far (call it Nt (a)). A small value of Nt (a) would imply
a large value of Ût (a) since the estimate of the mean reward would be fairly uncertain. On
the other hand, a large value of Nt (a) would imply a small value of Ût (a) since the esti-
462
mate of the mean reward would be fairly certain. We refer to Q̂t (a) + Ût (a) as the Upper
Confidence Bound (or simply UCB). The idea is to select the action that maximizes the UCB.
Formally, the action At+1 selected for the next (t + 1) time step is as follows:
Next, we develop the famous UCB1 Algorithm. In order to do that, we tap into an
important result from Statistics known as Hoeffding’s Inequality.
1X
n
X̄n = Xi
n
i=1
We can apply Hoeffding’s Inequality to MAB problem instances whose rewards have
probability distributions with [0, 1]-support. Conditioned on selecting action a at time step
t, sample mean X̄n specializes to Q̂t (a), and we set n = Nt (a) and u = Ût (a). Therefore,
Next, we pick a small probability p for Q(a) exceeding UCB Q̂t (a) + Ût (a). Now solve
for Ût (a), as follows: s
− log p
e−2Nt (a)·Ût (a) = p ⇒ Ût (a) =
2
2Nt (a)
We reduce p as we observe more rewards, eg: p = t−α (for some fixed α > 0). This ensures
we select the optimal action as t → ∞. Thus,
s
α log t
Ût (a) =
2Nt (a)
It has been shown that the UCB1 Algorithm achieves logarithmic total regret asymptoti-
cally. Specifically,
463
Theorem 13.4.2 (UCB1 Logarithmic Total Regret). As T → ∞,
X 4α · log T 2α · ∆a
LT ≤ +
∆a α−1
a|∆a >0
Now let’s implement the UCB1 Algorithm in code. The class UCB1 below implements
the interface of the abstract base class MABBase. We’ve implemented the below code for
rewards range [0, B] (adjusting the above UCB1 formula apropriately from [0, 1] range to
[0, B] range). B is specified as the constructor input bounds_range. The constructor input
alpha corresponds to the parameter α specified above. get_episode_rewards_actions im-
plements MABBase’s @abstracmethod interface, and it’s code below should be self-explanatory.
from numpy import ndarray, empty, sqrt, log
from operator import itemgetter
class UCB1(MABBase):
def __init__(
self,
arm_distributions: Sequence[Distribution[float]],
time_steps: int,
num_episodes: int,
bounds_range: float,
alpha: float
) -> None:
if bounds_range < 0 or alpha <= 0:
raise ValueError
super().__init__(
arm_distributions=arm_distributions,
time_steps=time_steps,
num_episodes=num_episodes
)
self.bounds_range: float = bounds_range
self.alpha: float = alpha
def get_episode_rewards_actions(self) -> Tuple[ndarray, ndarray]:
ep_rewards: ndarray = empty(self.time_steps)
ep_actions: ndarray = empty(self.time_steps, dtype=int)
for i in range(self.num_arms):
ep_rewards[i] = self.arm_distributions[i].sample()
ep_actions[i] = i
counts: List[int] = [1] * self.num_arms
means: List[float] = [ep_rewards[j] for j in range(self.num_arms)]
for i in range(self.num_arms, self.time_steps):
ucbs: Sequence[float] = [means[j] + self.bounds_range *
sqrt(0.5 * self.alpha * log(i) /
counts[j])
for j in range(self.num_arms)]
action: int = max(enumerate(ucbs), key=itemgetter(1))[0]
reward: float = self.arm_distributions[action].sample()
counts[action] += 1
means[action] += (reward - means[action]) / counts[action]
ep_rewards[i] = reward
ep_actions[i] = action
return ep_rewards, ep_actions
The above code is in the file rl/chapter14/ucb1.py. The code in __main__ sets up a
UCB1 instance with 6 arms, each having a binomial distribution with n = 10 and p =
{0.4, 0.8, 0.1, 0.5, 0.9, 0.2} for the 6 arms. When run with 1000 time steps, 500 episodes
and α = 4, we get the Total Regret Curve as shown in Figure 13.4.
We encourage you to modify the code in __main__ to model other distributions for the
arms, examine the results obtained, and develop more intuition for the UCB1 Algorithm.
464
Figure 13.4.: UCB1 Total Regret Curve
We get a better performance if our prior knowledge of P[R] is accurate. A simple exam-
ple of Bayesian UCB is to model independent Gaussian distributions. Assume the reward
distribution is Gaussian: Ra (r) = N (r; µa , σa2 ) for all a ∈ A, where µa and σa2 denote the
mean and variance respectively of the Gaussian reward distribution of a. The idea is to
compute a Gaussian posterior over µa , σa2 , as follows:
Y
P[µa , σa2 |Ht ] ∝ P[µa , σa2 ] · N (Rt ; µa , σa2 )
t|At =a
465
This posterior calculation can be performed in an incremental manner by updating P[µAt , σA 2 |H ]
t t
after each time step t (observing Rt after selecting action At ). This incremental calcula-
tion with Bayesian updates to hyperparameters (parameters controlling the probability
distributions of µa and σa2 ) is described in detail in Section G.1 in Appendix G.
Given this posterior distribution for µa and σa2 for all a ∈ A after each time step t, we
select the action that maximizes the Expectation of “c standard-errors above mean” , i.e.,
c · σa
At+1 = arg max EP[µa ,σa2 |Ht ] [µa + p ]
a∈A Nt (a)
• Outcome 1: Ra11 (with probability 0.7) and Ra12 (with probability 0.2). Thus, Out-
come 1 has probability 0.7 * 0.2 = 0.14. In Outcome 1, a1 has the maximum E[r|a]
among all actions since Ra11 has mean 5.0 and Ra12 has mean 2.0.
• Outcome 2: Ra11 (with probability 0.7) and Ra22 (with probability 0.8). Thus, Out-
come 2 has probability 0.7 * 0.8 = 0.56. In Outcome 2, a2 has the maximum E[r|a]
among all actions since Ra11 has mean 5.0 and Ra22 has mean 7.0.
• Outcome 3: Ra21 (with probability 0.3) and Ra12 (with probability 0.2). Thus, Out-
come 3 has probability 0.3 * 0.2 = 0.06. In Outcome 3, a1 has the maximum E[r|a]
among all actions since Ra21 has mean 10.0 and Ra12 has mean 2.0.
• Outcome 4: Ra21 (with probability 0.3) and Ra22 (with probability 0.8). Thus, Out-
come 4 has probability 0.3 * 0.8 = 0.24. In Outcome 4, a1 has the maximum E[r|a]
among all actions since Ra21 has mean 10.0 and Ra22 has mean 7.0.
Thus, a1 has the maximum E[r|a] among the two actions in Outcomes 1, 3 and 4, amount-
ing to a total outcomes probability of 0.14 + 0.06 + 0.24 = 0.44, and a2 has the maximum
466
E[r|a] among the two actions only in Outcome 2, which has an outcome probability of 0.56.
Therefore, in the next time step (t + 1), the Probability Matching method will select action
a1 with probability 0.44 and a2 with probability 0.56.
Generalizing this Probability Matching method to an arbitrary number of actions and
to an arbitrary number of probabilistic outcomes for the conditional reward distributions
for each action, we can write the probabilistic selection of actions at time step t + 1 as:
P[At+1 |Ht ] = PDt ∼P[R|Ht ] [EDt [r|At+1 ] > EDt [r|a] for all a ̸= At+1 ] (13.1)
P[At+1 |Ht ] = PDt ∼P[R|Ht ] [EDt [r|At+1 ] > EDt [r|a]for all a ̸= At+1 ]
= EDt ∼P[R|Ht ] [IAt+1 =arg maxa∈A EDt [r|a] ]
• Select the action (for time step t + 1) that maximizes this sample Action-Value func-
tion:
At+1 = arg max Q̂t (a)
a∈A
467
It turns out that Thompson Sampling achieves the Lai-Robbins lower bound for Loga-
rithmic Total Regret. To learn more about Thompson Sampling, we refer you to the excel-
lent tutorial on Thompson Sampling by Russo, Roy, Kazerouni, Osband, Wen (Russo et
al. 2018).
Now we implement Thompson Sampling by assuming a Gaussian distribution of re-
wards for each action. The posterior distributions for each action are produced by perform-
ing Bayesian updates of the hyperparameters that govern the estimated Gaussian-Inverse-
Gamma Probability Distributions of the parameters of the Gaussian reward distributions
for each action. Section G.1 of Appendix G describes the Bayesian updates of the hyper-
parameters θ, α, β, and the code below implements this update in the variable bayes in
method get_episode_rewards_actions (this method implements the @abstractmethod in-
terface of abstract base class MABBase). The sample mean rewards are obtained by invoking
the sample method of Gaussian and Gamma classes, and assigned to the variable mean_draws.
The variable theta refers to the hyperparameter θ, the variable alpha refers to the hyper-
parameter α, and the variable beta refers to the hyperparameter β. The rest of the code in
the method get_episode_rewards_actions should be self-explanatory.
468
Figure 13.5.: Thompson Sampling (Gaussian) Total Regret Curve
ep_actions[i] = action
return ep_rewards, ep_actions
The above code is in the file rl/chapter14/ts_gaussian.py. The code in __main__ sets
up a ThompsonSamplingGaussian instance with 6 arms, each having a Gaussian rewards
distribution. When run with 1000 time steps and 500 episodes, we get the Total Regret
Curve as shown in Figure 13.5.
We encourage you to modify the code in __main__ to try other mean and variance set-
tings for the Gaussian reward distributions of the arms, examine the results obtained, and
develop more intuition for Thompson Sampling for Gaussians.
Now we implement Thompson Sampling by assuming a Bernoulli distribution of re-
wards for each action. The posterior distributions for each action are produced by per-
forming Bayesian updates of the hyperparameters that govern the estimated Beta Prob-
ability Distributions of the parameters of the Bernoulli reward distributions for each ac-
tion. Section G.2 of Appendix G describes the Bayesian updates of the hyperparameters
α and β, and the code below implements this update in the variable bayes in method
get_episode_rewards_actions (this method implements the @abstractmethod interface of
abstract base class MABBase). The sample mean rewards are obtained by invoking the
sample method of the Beta class, and assigned to the variable mean_draws. The variable
alpha refers to the hyperparameter α and the variable beta refers to the hyperparame-
ter β. The rest of the code in the method get_episode_rewards_actions should be self-
explanatory.
from rl.distribution import Bernoulli, Beta
from operator import itemgetter
from numpy import ndarray, empty
class ThompsonSamplingBernoulli(MABBase):
def __init__(
self,
469
Figure 13.6.: Thompson Sampling (Bernoulli) Total Regret Curve
arm_distributions: Sequence[Bernoulli],
time_steps: int,
num_episodes: int
) -> None:
super().__init__(
arm_distributions=arm_distributions,
time_steps=time_steps,
num_episodes=num_episodes
)
def get_episode_rewards_actions(self) -> Tuple[ndarray, ndarray]:
ep_rewards: ndarray = empty(self.time_steps)
ep_actions: ndarray = empty(self.time_steps, dtype=int)
bayes: List[Tuple[int, int]] = [(1, 1)] * self.num_arms
for i in range(self.time_steps):
mean_draws: Sequence[float] = \
[Beta(alpha=alpha, beta=beta).sample() for alpha, beta in bayes]
action: int = max(enumerate(mean_draws), key=itemgetter(1))[0]
reward: float = float(self.arm_distributions[action].sample())
alpha, beta = bayes[action]
bayes[action] = (alpha + int(reward), beta + int(1 - reward))
ep_rewards[i] = reward
ep_actions[i] = action
return ep_rewards, ep_actions
The above code is in the file rl/chapter14/ts_bernoulli.py. The code in __main__ sets
up a ThompsonSamplingBernoulli instance with 6 arms, each having a Bernoulli rewards
distribution. When run with 1000 time steps and 500 episodes, we get the Total Regret
Curve as shown in Figure 13.6.
We encourage you to modify the code in __main__ to try other mean settings for the
Bernoulli reward distributions of the arms, examine the results obtained, and develop
more intuition for Thompson Sampling for Bernoullis.
470
13.6. Gradient Bandits
Now we cover a MAB algorithm that is similar to Policy Gradient for MDPs. This MAB
algorithm’s action selection is randomized and the action selection probabilities are con-
structed through Gradient Ascent (much like Stochastic Policy Gradient for MDPs). This
MAB Algorithm and it’s variants are cheekily refered to as Gradient Bandits. Our coverage
below follows the coverage of Gradient Bandit algorithm in the RL book by Sutton and
Barto (Richard S. Sutton and Barto 2018).
The basic idea is that we have m Score parameters (to be optimized), one for each action,
denoted as {sa |a ∈ A} that define the action-selection probabilities, which in turn defines
an Expected Reward Objective function to be maximized, as follows:
X
J(sa1 , . . . , sam ) = π(a) · E[r|a]
a∈A
e sa
π(a) = P sb
for all a ∈ A
b∈A e
The Score parameters are meant to represent the relative value of actions based on the
rewards seen until a certain time step, and are adjusted appropriately after each time step
(using Gradient Ascent). Note that π(·) is a Softmax function of the Score parameters.
Gradient Ascent moves the Score parameters sa (and hence, action probabilities π(a))
in the direction of the gradient of the objective function J(sa1 , . . . , sam ) with respect to
∂J
(sa1 , . . . , sam ). To construct this gradient of J(·), we calculate ∂s a
for each a ∈ A, as follows:
P
∂J ∂( a′ ∈A π(a′ ) · E[r|a′ ])
=
∂sa ∂sa
X ∂π(a′ )
= E[r|a′ ] ·
′
∂sa
a ∈A
X ∂ log π(a′ )
= π(a′ ) · E[r|a′ ] ·
∂sa
a′ ∈A
∂ log π(a′ )
= Ea′ ∼π,r∼Ra′ [r · ]
∂sa
∂J
= Ea′ ∼π,r∼Ra′ [r · (Ia=a′ − π(a))]
∂sa
At each time step t, we approximate the gradient with the (At , Rt ) sample as:
471
πt (a) is the probability of selecting action a at time step t, derived from the Score st (a) at
time step t.
We can reduce the variance of this estimate with a baseline B that is independent of a,
as follows:
(Rt − B) · (Ia=At − πt (a)) for all a ∈ A
This doesn’t introduce any bias in the estimate of the gradient of J(·) because:
∂ log π(a′ )
Ea′ ∼π [B · (Ia=a′ − π(a))] = Ea′ ∼π [B · ]
∂sa
X ∂ log π(a′ )
=B· π(a′ ) ·
′
∂sa
a ∈A
X ∂π(a′ )
=B·
∂sa
a′ ∈A
P
∂( a′ ∈A π(a′ ))
=B·
∂sa
∂1
=B·
∂sa
=0
P
We can use B = R̄t = 1t ts=1 Rs (average of all rewards obtained until time step t). So,
the update to scores st (a) for all a ∈ A is:
It should be noted that this Gradient Bandit algorithm and it’s variant Gradient Bandit
algorithms are simply a special case of policy gradient-based RL algorithms.
Now let’s write some code to implement this Gradient Algorithm. Apart from the usual
constructor inputs arm_distributions, time_steps and num_episodes that are passed along
to the constructor of the abstract base class MABBase, GradientBandits’ constructor also
takes as input learning_rate (specifying the initial learning rate) and learning_rate_decay
(specifying the speed at which the learning rate decays), which influence how the variable
step_size is set at every time step. The variable scores represents st (a) for all a ∈ A and
the variable probs represents πt (a) for all a ∈ A. The rest of the code below should be
self-explanatory, based on the above description of the calculations.
472
Figure 13.7.: Gradient Algorithm Total Regret Curve
)
self.learning_rate: float = learning_rate
self.learning_rate_decay: float = learning_rate_decay
def get_episode_rewards_actions(self) -> Tuple[ndarray, ndarray]:
ep_rewards: ndarray = empty(self.time_steps)
ep_actions: ndarray = empty(self.time_steps, dtype=int)
scores: List[float] = [0.] * self.num_arms
avg_reward: float = 0.
for i in range(self.time_steps):
max_score: float = max(scores)
exp_scores: Sequence[float] = [exp(s - max_score) for s in scores]
sum_exp_scores = sum(exp_scores)
probs: Sequence[float] = [s / sum_exp_scores for s in exp_scores]
action: int = Categorical(
{i: p for i, p in enumerate(probs)}
).sample()
reward: float = self.arm_distributions[action].sample()
avg_reward += (reward - avg_reward) / (i + 1)
step_size: float = self.learning_rate *\
(i / self.learning_rate_decay + 1) ** -0.5
for j in range(self.num_arms):
scores[j] += step_size * (reward - avg_reward) *\
((1 if j == action else 0) - probs[j])
ep_rewards[i] = reward
ep_actions[i] = action
return ep_rewards, ep_actions
The above code is in the file rl/chapter14/gradient_bandits.py. The code in __main__ sets
up a GradientBandits instance with 6 arms, each having a Gaussian reward distribution.
When run with 1000 time steps and 500 episodes, we get the Total Regret Curve as shown
in Figure 13.7.
We encourage you to modify the code in __main__ to try other mean and standard de-
viation settings for the Gaussian reward distributions of the arms, examine the results
obtained, and develop more intuition for this Gradient Algorithm.
473
Figure 13.8.: Gaussian Horse Race - Total Regret Curves
Running this horse race for 7 Gaussian arms with 500 time steps, 500 episodes and the
settings as specified in the file rl/chapter14/plot_mab_graphs.py, we obtain Figure 13.8 for
the Total Regret Curves for each of these algorithms.
Figure 13.9 shows the number of times each arm is pulled (for each of these algorithms).
The X-axis is sorted by the mean of the reward distributions of the arms. For each arm,
the left-to-right order of the arm-pulls count is the order in which the 5 MAB algorithms
are listed above. As we can see, the arms with low means are pulled only a few times and
the arms with high means are pulled often.
The file rl/chapter14/plot_mab_graphs.py also has a function to run a horse race for
Bernoulli arms with the following algorithms:
474
Figure 13.9.: Gaussian Horse Race - Arms Count
• Decaying ϵt -Greedy
• UCB1
• Thompson Sampling
• Gradient Bandit
Running this horse race for 9 Bernoulli arms with 500 time steps, 500 episodes and the
settings as specified in the file rl/chapter14/plot_mab_graphs.py, we obtain Figure 13.10
for the Total Regret Curves for each of these algorithms.
Figure 13.11 shows the number of times each arm is pulled (for each of the algorithms).
The X-axis is sorted by the mean of the reward distributions of the arms. For each arm,
the left-to-right order of the arm-pulls count is the order in which the 6 MAB algorithms
are listed above. As we can see, the arms with low means are pulled only a few times and
the arms with high means are pulled often.
We encourage you to experiment with the code in rl/chapter14/plot_mab_graphs.py:
try different arm distributions, try different input parameters for each of the algorithms,
plot the graphs, and try to explain the relative performance of the algorithms (perhaps by
writing some more diagnostics code). This will help build tremendous intuition on the
pros and cons of these algorithms.
475
Figure 13.10.: Bernoulli Horse Race - Total Regret Curves
476
features of history is known as Information State (to indicate that the agent captures all of
the relevant information known so far in the State of the modeled MDP). Before we explain
this Information State Space MDP approach in more detail, it pays to develop an intuitive
understanding of the Value of Information.
The key idea is that Exploration enables the agent to acquire information, which in turn
enables the agent to make more informed decisions as far as it’s future arm-selection strat-
egy is concerned. The natural question to ask then is whether we can quantify the value
of this information that can be acquired by Exploration. In other words, how much would
a decision-maker be willing to pay to acquire information (through exploration), prior to
making a decision? Vaguely speaking, the decision-maker should be paying an amount
equal to the gains in long-term (accumulated) reward that can be obtained upon getting
the information, less the sacrifice of excess immediate reward one would have obtained
had one exploited rather than explored. We can see that this approach aims to settle the
explore-exploit trade-off in a mathematically rigorous manner by establishing the Value of
Information. Note that information gain is higher in a more uncertain situation (all else be-
ing equal). Therefore, it makes sense to explore uncertain situations more. By formalizing
the value of information, we can trade-off exploration and exploitation optimally.
Now let us formalize the approach of treating a MAB as an Information State Space
MDP. After each time step of a MAB, we construct an Information State s̃, which comprises
of relevant features of the history until that time step. Essentially, s̃ summarizes all of the
information accumulated so far that is pertinent to be able to predict the reward distri-
bution for each action. Each action a causes a transition to a new information state s̃′ (by
adding information about the reward obtained after performing action a), with probabil-
ity P̃(s̃, a, s̃′ ). Note that this probability depends on the reward probability function Ra of
the MAB. Moreover, the MAB reward r obtained upon performing action a constitutes the
Reward of the Information State Space MDP for that time step. Putting all this together,
we have an MDP M̃ in information state space as follows:
The key point to note is that since Ra is unknown to the AI Agent in the MAB problem,
the State Transition Probability function and the Reward function of the Information State
Space MDP M̃ are unknown to the AI Agent. However, at any given time step, the AI Agent
can utilize the information within s̃ to form an estimate of Ra , which in turn gives estimates
of the State Transition Probability function and the Reward function of the Information
State Space MDP M̃ .
Note that M̃ will typically be a fairly complex MDP over an infinite number of infor-
mation states, and hence is not easy to solve. However, since it is after all an MDP, we
can use Dynamic Programming or Reinforcement Learning algorithms to arrive at the
Optimal Policy, which prescribes the optimal MAB action to take at that time step. If a
Dynamic Programming approach is taken, then after each time step, as new information
arrives (in the form of the MAB reward in response to the action taken), the estimates of
the State Transition probability function and the Reward function change, meaning the In-
formation State Space MDP to be solved changes, and consequently the Action-Selection
477
strategy for the MAB problem (prescribed by the Optimal Policy of the Information State
Space MDP) changes. A common approach is to treat the Information State Space MDP as
a Bayes-Adaptive MDP. Specifically, if we have m arms a1 , . . . , am , the state s̃ is modeled as
(s˜a1 , . . . , sa˜m ) such that s˜a for any a ∈ A represents a posterior probability distribution over
Ra , which is Bayes-updated after observing the reward upon each pull of the arm a. This
Bayes-Adaptive MDP can be tackled with the highly-celebrated Dynamic Programming
method known as Gittins Index, which was introduced in a 1979 paper by Gittins (Gittins
1979). The Gittins Index approach finds the Bayes-optimal explore-exploit trade-off with
respect to the prior distribution.
To grasp the concept of Information State Space MDP, let us consider a Bernoulli Ban-
dit problem with m arms with arm a’s reward probability distribution Ra given by the
Bernoulli distribution B(µa ), where µa ∈ [0, 1] (i.e., reward = 1 with probability µa , and
reward = 0 with probability 1 − µa ). If we denote the m arms by a1 , a2 , . . . , am , then the
information state is s̃ = (αa1 , βa1 , αa2 , βa2 . . . , αam , βam ), where αa is the number of pulls
of arm a (so far) for which the reward was 1 and βa is the number of pulls of arm a (so
far) for which the reward was 0. Note that by the Law of Large Numbers, in the long-run,
αa +βa → µa .
αa
We can treat this as a Bayes-adaptive MDP as follows: We model the prior distribution
over Ra as the Beta Distribution Beta(αa , βa ) over the unknown parameter µa . Each time
arm a is pulled, we update the posterior for Ra as:
• Beta(αa + 1, βa ) if r = 1
• Beta(αa , βa + 1) if r = 0
Note that the component (αa , βa ) within the information state provides the model Beta(αa , βa )
as the probability distribution over µa . Moreover, note that each state transition (updating
either αa or βa by 1) is essentially a Bayesian model update (Section G.2 in Appendix G
provides details of Bayesian updates to a Beta distribution over a Bernoulli parameter).
Note that in general, an exact solution to a Bayes-adaptive MDP is typically intractable.
In 2014, Guez, Heess, Silver, Dayan (Guez et al. 2014) came up with a Simulation-based
Search method, which involves a forward search in information state space using simula-
tions from current information state, to solve a Bayes-adaptive MDP.
478
as the Context. This means, the Context influences the rewards probability distribution for
each arm. This is known as the Contextual Bandit problem, which we formalize below:
The AI Agent’s goal is to maximize the following Expected Cumulative Rewards over a
certain number of time steps T :
X
T
E[ Rt ]
t=1
Each of the algorithms we’ve covered for the MAB problem can be easily extended to
the Contextual Bandit problem. The key idea in the extension of the MAB algorithms is
that we have to take into account the Context, when dealing with the rewards probability
distribution. In the MAB problem, the algorithms deal with a finite set of reward distribu-
tions, one for each of the actions. Here in the Contextual Bandit problem, the algorithms
work with function approximations for the rewards probability distributions where each
function approximation takes as input a pair of (Context, Action).
We won’t cover the details of the extensions of all MAB Algorithms to Contextual Bandit
algorithms. Rather, we simply sketch a simple Upper-Confidence-Bound algorithm for the
Contextual Bandit problem to convey a sense of how to extend the MAB algorithms to the
Contextual Bandit problem. Assume that the sampling distribution of the mean reward
for each (Context, Action) pair is a gaussian distribution, and so we maintain two function
approximations µ(c, a; w) and σ(c, a; v) to represent the mean and standard deviation of
the sampling distribution of mean reward for any context c and any action a. It’s important
to note that for MAB, we simply maintained a finite set of estimates µa and σa , i.e., two
parameters for each action a. Here we replace µa with function approximation µ(c, a; w)
and we replace σa with function approximation σ(c, a; v). After the receipt of a reward
from the Environment, the parameters w and v are appropriately updated. We essentially
perform supervised learning in an incremental manner when updating these parameters
of the function approximations. Note that σ(c, a; v) represents a function approximation
for the standard error of the mean reward for a given context c and given action a. A simple
Upper-Confidence-Bound algorithm would then select the action for a given context Ct at
479
time step t that maximizes µ(Ct , a; w) + α · σ(Ct , a; v) over all choices of a ∈ A, for some
fixed α. Thus, we are comparing (across actions) α standard errors higher than the mean
reward estimate (i.e., the upper-end of an appropriate confidence interval for the mean
reward) for Context Ct .
We want to highlight that many authors refer to the Context in Contextual Bandits as
State. We desist from using the term State in Contextual Bandits since we want to reserve
the term State to refer to the concept of “transitions” (as is the case in MDPs). Note that
the Context does not “transition” to the next Context in the next time step in Contextual
Bandits problems. Rather, the Context is drawn at random independently at each time step
from the Context probability distribution C. This is in contrast to the State in MDPs which
transitions to the next state at the next time step based on the State Transition probability
function of the MDP.
We finish this chapter by simply pointing out that the approaches of the MAB algorithms
can be further extended to resolve the Explore-Exploit dilemma in RL Control. From the
perspective of this extension, it pays to emphasize that MAB algorithms that fall under the
category of Optimism in the Face of Uncertainty can be roughly split into:
• Those that estimate the Q-Values (i.e., estimate E[r|a] from observed data) and the
uncertainty of the Q-Values estimate. When extending to RL Control, we estimate
the Q-Value Function for the (unknown) MDP and the uncertainty of the Q-Value
Function estimate. Note that when moving from MAB to RL Control, the Q-Values
are no longer simply the Expected Reward for a given action - rather, they are the Ex-
pected Return (i.e., accumulated rewards) from a given state and a given action. This
extension from Expected Reward to Expected Return introduces significant com-
plexity in the calculation of the uncertainty of the Q-Value Function estimate.
• Those that estimate the Model of the MDP, i.e., estimate of the State-Reward Transi-
tion Probability function PR of the MDP, and the uncertainty of the PR estimate. This
includes extension of Bayesian Bandits, Thompson Sampling and Bayes-Adaptive
MDP (for Information State Space MDP) where we replace P[R|Ht ] in the case of
Bandits with P[PR |Ht ] in the case of RL Control. Some of these algorithms sam-
ple from the estimated PR , and learn the Optimal Value Function/Optimal Policy
from the samples. Some other algorithms are Planning-oriented. Specifically, the
Planning-oriented approach is to run a Planning method (eg: Policy Iteration, Value
Iteration) using the estimated PR , then generate more data using the Optimal Pol-
icy (produced by the Planning method), use the generated data to improve the PR
estimate, then run the Planning method again to come up with the Optimal Policy
(for the MDP based on the improved PR estimate), and loop on in this manner until
convergence. As an example of this Planning-oriented approach, we refer you to the
paper on RMax Algorithm (Brafman and Tennenholtz 2001) to learn more.
480
– Optimistic Initialization
– Optimism in the Face of Uncertainty, eg: UCB, Bayesian UCB
– Probability Matching, eg: Thompson Sampling
– Gradient Bandit Algorithms
– Information State Space MDPs (incorporating value of Information), typically
solved by treating as Bayes-Adaptive MDPs
• The above MAB algorithms are well-extensible to Contextual Bandits and RL Con-
trol.
481
14. Blending Learning and Planning
After coverage of the issue of Exploration versus Exploitation in the last chapter, in this
chapter, we cover the topic of Planning versus Learning (and how to blend the two ap-
proaches) in the context of solving MDP Prediction and Control problems. In this chapter,
we also provide some coverage of the much-celebrated Monte-Carlo Tree-Search (abbre-
viated as MCTS) algorithm and it’s spiritual origin - the Adaptive Multi-Stage Sampling
(abbreviated as AMS) algorithm. MCTS and AMS are examples of Planning algorithms
tackled with sampling/RL-based techniques.
1. By interacting with the MDP Environment E, the AI Agent can build a Model of the
Environment (call it M ) and then use that model to estimate the requisite Value Func-
tion/Policy. We refer to this as the Model-Based approach. Solving Prediction/Control
using a Model of the Environment (i.e., Model-Based approach) is known as Planning
the solution. The term Planning comes from the fact that the AI Agent projects (with
the help of the model M ) probabilistic scenarios of future states/rewards for vari-
ous choices of actions from specific states, and solves for the requisite Value Func-
tion/Policy based on the model-projected future outcomes.
2. By interacting with the MDP Environment E, the AI Agent can directly estimate
the requisite Value Function/Policy, without bothering to build a Model of the En-
vironment. We refer to this as the Model-Free approach. Solving Prediction/Control
without using a model (i.e., Model-Free approach) is known as Learning the solu-
tion. The term Learning comes from the fact that the AI Agent “learns” the requisite
Value Function/Policy directly from experiences data obtained by interacting with
the MDP Environment E (without requiring any model).
Let us now dive a bit deeper into both these approaches to understand them better.
483
By “building a model,” we mean estimating PR from experiences data obtained by in-
teracting with the MDP Environment E. How does the AI Agent do this? Well, this is a
matter of estimating the conditional probability density function of pairs of (next state, re-
ward), conditioned on a particular pair of (state, action). This is an exercise in Supervised
Learning, where the y-values are (next state, reward) pairs and the x-values are (state,
action) pairs. We covered how to do Supervised Learning in Chapter 4. Also, note that
Equation (9.12) in Chapter 9 provides a simple tabular calculation to estimate the PR func-
tion for an MRP from a fixed, finite set of atomic experiences of (state, reward, next state)
triples. Following this Equation, we had written the function finite_mrp to construct a
FiniteMarkovRewardProcess (which includes a tabular PR function of explicit probabilities
of transitions), given as input a Sequence[TransitionStep[S]] (i.e., fixed, finite set of MRP
atomic experiences). This approach can be easily extended to estimate the PR function for
an MDP. Ok - now we have a model M in the form of an estimated PR . The next thing
to do in this approach of Planning the solution of Prediction/Control is to use the model
M to estimate the requisite Value Function/Policy. There are two broad approaches to do
this:
484
Figure 14.1.: Planning with a Supervised-Learnt Model
by planning (the planning being done with a Reinforcement Learning algorithm in-
teracting with the learnt simulator).
Figure 14.1 depicts the above-described approach of Planning the solution of Predic-
tion/Control. We start with an arbitrary Policy that is used to interact with the Environ-
ment E (upward-pointing arrow in the Figure). These interactions generate Experiences,
which are used to perform Supervised Learning (rightward-pointing arrow in the Figure)
to learn a model M . This model M is used to plan the requisite Value Function/Policy
(leftward-pointing arrow in the Figure). The Policy produced through this process of
Planning is then used to further interact with the Environment E, which in turn generates
a fresh set of Experiences, which in turn are used to update the Model M (incremental su-
pervised learning), which in turn is used to plan an updated Value Function/Policy, and
so the cycle repeats.
485
14.1.3. Advantages and Disadvantages of Planning versus Learning
In the previous two subsections, we covered the two different approaches to solving Pre-
diction/Control, either by Planning (subsection 14.1.1) or by Learning (subsection 14.1.2).
Let us now talk about their advantages and disadvantages.
Planning involves constructing a Model, so it’s natural advantage is to be able to con-
struct a model (from experiences data) with efficient and robust supervised learning meth-
ods. The other key advantage of Planning is that we can reason about Model Uncertainty.
Specifically, when we learn the Model M using supervised learning, we typically obtain
the standard errors for estimation of model parameters, which can then be used to create
confidence intervals for the Value Function and Policy planned using the model. Further-
more, since modeling real-world problems tends to be rather difficult, it is valuable to cre-
ate a family of models with differing assumptions, with different functional forms, with
differing parameterizations etc., and reason about how the Value Function/Policy would
disperse as a function of this range of models. This is quite beneficial in typical real-world
problems since it enables us to do Prediction/Control in a robust manner.
The disadvantage of Planning is that we have two sources of approximation error - the
first from supervised learning in estimating the model M , and the second from construct-
ing the Value Function/Policy (given the model). The Learning approach (without resort-
ing to a model, i.e., Model-Free RL) is thus advantageous is not having the first source of
approximation error (i.e., Model Error).
In this subsection, we show a rather creative and practically powerful approach to solve
real-world Prediction and Control problems. We basically extend Figure 14.1 to Figure
14.2. As you can see in Figure 14.2, the change is that there is a downward-pointing ar-
row from the Experiences node to the Policy node. This downward-pointing arrow refers
to Model-Free Reinforcement Learning, i.e., learning the Value Function/Policy directly from
experiences obtained by interacting with Environment E, i.e., Model-Free RL. This means
we obtain the requisite Value Function/Policy through the collaborative approach of Plan-
ning (using the model M ) and Learning (using Model-Free RL).
Note that when Planning is based on RL using experiences obtained by interacting
with the Simulated Environment S (based on Model M ), then we obtain the requisite
Value Function/Policy from two sources of experiences (from E and S) that are com-
bined and provided to an RL Algorithm. This means we simultaneously do Model-Based
RL and Model-Free RL. This is creative and powerful because it blends the best of both
worlds - Planning (with Model-Based RL) and Learning (with Model-Free RL). Apart
from Model-Free RL and Model-Based RL being blended here to obtain a more accurate
Value Function/Policy, the Model is simultaneously being updated with incremental su-
pervised learning (rightward-pointing arrow in Figure 14.2) as new experiences are being
generated as a result of the Policy interacting with the Environment E (upward-pointing
arrow in Figure 14.2).
This framework of blending Planning and Learning was created by Richard Sutton which
he named as Dyna (Richard S. Sutton 1991).
486
Figure 14.2.: Blending Planning and Learning
487
14.2. Decision-Time Planning
In the next two sections of this chapter, we cover a couple of Planning methods that are
sampling-based (experiences obtained by interacting with a sampling model) and use RL
techniques to solve for the requisite Value Function/Policy from the model-sampled expe-
riences. We cover the famous Monte-Carlo Tree-Search (MCTS) algorithm, followed by
an algorithm which is MCTS’ spiritual origin - the Adaptive Multi-Stage Sampling (AMS)
algorithm.
Both these algorithms are examples of Decision-Time Planning. The term Decision-Time
Planning requires some explanation. When it comes to Planning (with a model), there are
two possibilities:
• Background Planning: This refers to a planning method where the AI Agent pre-
computes the requisite Value Function/Policy for all states, and when it is time for
the AI Agent to perform the requisite action for a given state, it simply has to refer to
the pre-calculated policy and apply that policy to the given state. Essentially, in the
background, the AI Agent is constantly improving the requisite Value Function/Policy,
irrespective of which state the AI Agent is currently required to act on. Hence, the
term Background Planning.
• Decision-Time Planning: This approach contrasts with Background Planning. In
this approach, when the AI Agent has to identify the best action to take for a specific
state that the AI Agent currently encounters, the calculations for that best-action-
identification happens only when the AI Agent reaches that state. This is appropriate
in situations when there are such a large number of states in the state space that Back-
ground Planning is infeasible. However, for Decision-Time Planning to be effective,
the AI Agent needs to have sufficient time to be able to perform the calculations to
identify the action to take upon reaching a given state. This is feasible in games like
Chess where there is indeed some time for the AI Agent to make it’s move upon en-
countering a specific state of the chessboard (the move response doesn’t need to be
immediate). However, this is not feasible for a self-driving car, where the decision to
accelerate/brake or to steer must be immediate (this requires Background Planning).
Hence, with Decision-Time Planning, the AI Agent focuses all of the available computa-
tion and memory resources for the sole purpose of identifying the best action for a partic-
ular state (the state that has just been reached by the AI Agent). Decision-Time Planning is
typically successful because of this focus on a single state and consequently, on the states
that are most likely to be reached within the next few time steps (essentially, avoiding any
wasteful computation on states that are unlikely to be reached from the given state).
Decision-Time Planning typically looks much deeper than just a single time step ahead
(DP algorithms only look a single time step ahead) and evaluates action choices leading
to many different state and reward possibilities over the next several time steps. Searching
deeper than a single time step ahead is required because these Decision-Time Planning
algorithms typically work with imperfect Q-Values.
Decision-Time Planning methods sometimes go by the name Heuristic Search. Heuristic
Search refers to the method of growing out a tree of future states/actions/rewards from
the given state (which serves as the root of the tree). In classical Heuristic Search, an
approximate Value Function is calculated at the leaves of the tree and the Value Function
is then backed up to the root of the tree. Knowing the backed-up Q-Values at the root of
the tree enables the calculation of the best action for the root state. Modern methods of
Heuristic Search are very efficient in how the Value Function is approximated and backed
488
up. Monte-Carlo Tree-Search (MCTS) in one such efficient method that we cover in the
next section.
• Selection: Starting from the root node R (given state), we successively select children
nodes all the way until a leaf node L. This involves selecting actions based on a tree
policy, and selecting next states by sampling from the model of state transitions. The
trees in Figure 14.3 show states colored as white and actions colored as gray. This
Figure shows the Q-Values for a 2-player game (eg: Chess) where the reward is 1 at
termination for a win, 0 at termination for a loss, and 0 throughout the time the game
is in play. So the Q-Values in the Figure are displayed at each node in the form of
Wins as a fractions of Games Played that passed through the node (Games through
a node means the number of sampling traces that have run through the node). So
the label “1/6” for one of the State nodes (under “Selection,” the first image in the
Figure) means that we’ve had 6 sampling traces from the root node that have passed
through this State node labeled “1/6,” and 1 of those games was won by us. For
Actions nodes (gray nodes), the labels correspond to Opponent Wins as a fraction of
Games through the Action node. So the label “2/3” for one of the Action leaf nodes
means that we’ve had 3 sampling traces from the root node that have passed through
this Action leaf node, and 2 of those resulted in wins for the opponent (i.e., 1 win for
us).
• Expansion: On some rounds, the tree is expanded from L by adding a child node C
to it. In the Figure, we see that L is the Action leaf node labeled as “3/3” and we add
a child node C (state) to it labeled “0/0” (because we don’t yet have any sampling
traces running through this added state C).
• Simulation: From L (or from C if this round involved adding C), we complete the
sampling trace (that started from R and ran through L) all the way to a terminal state
489
Figure 14.3.: Monte-Carlo Tree-Search (This wikipedia image is being used under the cre-
ative commons license CC BY-SA 4.0)
The Selection Step in MCTS involves picking a child node (action) with “most promise,”
for each state in the sampling trace of the Selection Step. This means prioritizing actions
with higher Q-Value estimates. However, this needs to be balanced against actions that
haven’t been tried sufficiently (i.e., those actions whose Q-Value estimates have consider-
able uncertainty). This is our usual Explore v/s Exploit tradeoff that we covered in detail
in Chapter 13. The Explore v/s Exploit formula for games was first provided by Kocsis
and Szepesvari (Kocsis and Szepesvári 2006). This formula is known as Upper Confidence
Bound 1 for Trees (abbreviated as UCT). Most current MCTS Algorithms are based on some
variant of UCT. UCT is based on the UCB1 formula of Auer, Cesa-Bianchi, Fischer (Auer,
Cesa-Bianchi, and Fischer 2002).
490
state and action) is also provided. AMS overcomes the curse of dimensionality by sam-
pling the next state. The key idea in AMS is to adaptively select actions based on a suit-
able tradeoff between Exploration and Exploitation. AMS was the first algorithm to apply
the theory of Multi-Armed Bandits to derive a provably convergent algorithm for solving
finite-horizon MDPs. Moreover, it performs far better than the typical backward-induction
approach to solving finite-horizon MDPs, in cases where the state space is very large and
the action space is fairly small.
We use the same notation we used in section 3.13 of Chapter 3 for Finite-Horizon MDPs
(time steps t = 0, 1, . . . T ). We assume that the state space St for time step t is very large
for all t = 0, 1, . . . , T − 1 (the state space ST for time step T consists of all terminal states).
We assume that the action space At for time step t is fairly small for all t = 0, 1, . . . , T − 1.
We denote the probability distribution for the next state, conditional on the current state
and action (for time step t) as the function Pt : (St × At ) → (St+1 → [0, 1]), defined as:
As mentioned above, for all t = 0, 1, . . . , T − 1, AMS has access to only a sampling model
of Pt , that can be used to fetch a sample of the next state from St+1 . We also assume
that we are given the Expected Reward function Rt : St × At → R for each time step
t = 0, 1, . . . , T − 1 defined as:
Now let us understand how the Nt action selections are done for a given state st . First
we select each of the actions in At exactly once. This is a total of |At | action selections.
Each of the remaining Nt − |At | action selections (indexed as i ranging from |At | to Nt − 1)
is made based on the action that maximizes the following UCT formula (thus balancing
491
exploration and exploitation):
s
2 log i
Q̂t (st , at ) + (14.1)
Ntst ,at
When all Nt action selections are made for a given state st , Vt∗ (st ) = maxat ∈At Q∗t (st , at )
is approximated as:
X N st ,at
V̂tNt (st ) = t
· Q̂t (st , at ) (14.2)
Nt
at ∈At
Now let’s write a Python class to implement AMS. We start by writing it’s constructor.
For convenience, we assume each of the state spaces St (for t = 0, 1, . . . , T ) is the same
(denoted as S) and the allowable actions are the same across all time steps (denoted as
A).
492
PNtst ,at Nt+1 (st ,at ,j)
In the code below, vals_sum builds up the sum j=1 V̂t+1 (st+1 ), and counts rep-
resents Ntst ,at . Before the for loop, we initialize vals_sum by selecting each action at ∈
At (st ) exactly once. Then, for each iteration i of the for loop (for i ranging from |At (st )| to
Nt − 1), we calculate the Upper-Confidence Value (ucb_vals in the code below) for each
of the actions at ∈ At (st ) using the UCT formula of Equation (14.1), and pick an action
a∗t that maximizes ucb_vals. After the termination of the for loop, optimal_vf_and_policy
returns the Optimal Value Function approximation for st based on Equation (14.2) and
the recommended action for st as the action that maximizes Q̂t (st , at )
import numpy as np
from operator import itemgetter
def optimal_vf_and_policy(self, t: int, s: S) -> \
Tuple[float, A]:
actions: Set[A] = self.actions_funcs[t](s)
state_distr_func: Callable[[S, A], Distribution[S]] = \
self.state_distr_funcs[t]
expected_reward_func: Callable[[S, A], float] = \
self.expected_reward_funcs[t]
rewards: Mapping[A, float] = {a: expected_reward_func(s, a)
for a in actions}
val_sums: Dict[A, float] = {a: (self.optimal_vf_and_policy(
t + 1,
state_distr_func(s, a).sample()
)[0] if t < self.num_steps - 1 else 0.) for a in actions}
counts: Dict[A, int] = {a: 1 for a in actions}
for i in range(len(actions), self.num_samples[t]):
ucb_vals: Mapping[A, float] = \
{a: rewards[a] + self.gamma * val_sums[a] / counts[a] +
np.sqrt(2 * np.log(i) / counts[a]) for a in actions}
max_actions: Sequence[A] = [a for a, u in ucb_vals.items()
if u == max(ucb_vals.values())]
a_star: A = np.random.default_rng().choice(max_actions)
val_sums[a_star] += (self.optimal_vf_and_policy(
t + 1,
state_distr_func(s, a_star).sample()
)[0] if t < self.num_steps - 1 else 0.)
counts[a_star] += 1
return (
sum(counts[a] / self.num_samples[t] *
(rewards[a] + self.gamma * val_sums[a] / counts[a])
for a in actions),
max(
[(a, rewards[a] + self.gamma * val_sums[a] / counts[a])
for a in actions],
key=itemgetter(1)
)[0]
)
The above code is in the file rl/chapter15/ams.py. The __main__ in this file tests the
AMS algorithm for the simple case of the Dynamic Pricing problem that we had covered
in Section 3.14 of Chapter 3, although the Dynamic Pricing problem itself is not a problem
where AMS would do better than backward induction (since it’s state space is not very
large). We encourage you to play with our implementation of AMS by constructing a
finite-horizon MDP with a large state space (and small-enough action space). An example
of such a problem is Optimal Stopping (in particular, pricing of American Options) that
we had covered in Chapter 7.
Now let’s analyze the running-time complexity of AMS. Let N = max (N0 , N1 , . . . , NT −1 ).
At each time step t, the algorithm makes at most N recursive calls, and so the running-
493
time complexity is O(N T ). Note that since we need to select every action at least once for
every state at every time step, N ≥ |A|, meaning the running-time complexity is at least
|A|T . Compare this against the running-time complexity of backward induction, which
is O(|S|2 · |A| · T ). So, AMS is more efficient when S is very large (which is typical in
many real-world problems). In their paper, Chang, Fu, Hu, Marcus proved that the Value
Function approximation V̂0N0 is asymptotically unbiased, i.e.,
They also proved that the worst-possible bias is bounded by a quantity that converges to
PT −1 ln Nt
zero at the rate of O( t=0 Nt ). Specifically,
X
T −1
ln Nt
0 ≤ V0∗ (s0 ) − E[V̂0N0 (s0 )] ≤ O( ) for all s0 ∈ S
Nt
t=0
494
15. Summary and Real-World Considerations
The purpose of this chapter is two-fold: Firstly to summarize the key learnings from this
book, and secondly to provide some commentary on how to take the learnings from this
book into practice (to solve real-world problems). On the latter, we specifically focus on
the challenges one faces in the real-world - modeling difficulties, problem-size difficulties,
operational challenges, data challenges (access, cleaning, organization), product manage-
ment challenges (eg: addressing the gap between the technical problem being solved and
the business problem to be solved), and also change-management challenges as one shifts
an enterprise from legacy systems to an AI system.
495
nored, and in such situations, we have to employ (computationally expensive) algorithms
to solve the POMDP.
In Chapter 3, we first covered the foundation of the classical Dynamic Programming
(DP) algorithms - the Banach Fixed-Point Theorem, which gives us a simple method for
iteratively solving for a fixed-point of a contraction function. Next, we constructed the
Bellman Policy Operator and showed that it’s a contraction function, meaning we can take
advantage of the Banach Fixed-Point Theorem, yielding a DP algorithm to solve the Predic-
tion problem, refered to as the Policy Evaluation algorithm. Next, we introduced the no-
tions of a Greedy Policy and Policy Improvement, which yields a DP algorithm known as
Policy Iteration to solve the Control problem. Next, we constructed the Bellman Optimal-
ity Operator and showed that it’s a contraction function, meaning we can take advantage
of the Banach Fixed-Point Theorem, yielding a DP algorithm to solve the Control problem,
refered to as the Value Iteration algorithm. Next, we introduced the all-important concept
of Generalized Policy Iteration (GPI) - the powerful idea of alternating between any method
for Policy Evaluation and any method for Policy Improvement, including methods that
are partial applications of Policy Evaluation or Policy Improvement. This generalized per-
spective unifies almost all of the algorithms that solve MDP Control problems (including
Reinforcement Learning algorithms). We finished this chapter with coverage of Backward
Induction algorithms to solve Prediction and Control problems for finite-horizon MDPs
- Backward Induction is a simple technique to backpropagate the Value Function from
horizon-end to the start. It is important to note that the DP algorithms in this chapter ap-
ply to MDPs with a finite number of states and that these algorithms are computationally
feasible only if the state space is not too large (the next chapter extends these DP algo-
rithms to handle large state spaces, including infinite state spaces).
In Chapter 4, we first covered a refresher on Function Approximation by developing
the calculations first for linear function approximation and then for feed-forward fully-
connected deep neural networks. We also explained that a Tabular prediction can be
viewed as a special form of function approximation (since it satisfies the interface we de-
signed for Function Approximation). With this apparatus for Function Approximation,
we extended the DP algorithms of the previous chapter to Approximate Dynamic Pro-
gramming (ADP) algorithms in a rather straightforward manner. In fact, DP algorithms
can be viewed as special cases of ADP algorithms by setting the function approximation
to be Tabular. Essentially, we replace tabular Value Function updates with updates to
Function Approximation parameters (where the Function Approximation represents the
Value Function). The sweeps over all states in the tabular (DP) algorithms are replaced by
sampling states in the ADP algorithms, and expectation calculations in Bellman Operators
are handled in ADP as averages of the corresponding calculations over transition samples
(versus calculations using explicit transition probabilities in the DP algorithms).
Module II was about Modeling Financial Applications as MDPs. We started Module
II with a basic coverage of Utility Theory in Chapter 5. The concept of Utility is vital
since Utility of cashflows is the appropriate Reward in the MDP for many financial appli-
cations. In this chapter, we explained that an individual’s financial risk-aversion is rep-
resented by the concave nature of the individual’s Utility as a function of financial out-
comes. We showed that the Risk-Premium (compensation an individual seeks for taking
financial risk) is roughly proportional to the individual’s financial risk-aversion and also
proportional to the measure of uncertainty in financial outcomes. Risk-Adjusted-Return
in finance should be thought of as the Certainty-Equivalent-Value, whose Utility is the
Expected Utility across uncertain (risky) financial outcomes. We finished this chapter by
covering the Constant Absolute Risk-Aversion (CARA) and the Constant Relative Risk-
496
Aversion (CRRA) Utility functions, along with simple asset allocation examples for each
of CARA and CRRA Utility functions.
In Chapter 6, we covered the problem of Dynamic Asset-Allocation and Consumption.
This is a fundamental problem in Mathematical Finance of jointly deciding on A) optimal
investment allocation (among risky and riskless investment assets) and B) optimal con-
sumption, over a finite horizon. We first covered Merton’s landmark paper from 1969 that
provided an elegant closed-form solution under assumptions of continuous-time, normal
distribution of returns on the assets, CRRA utility, and frictionless transactions. In a more
general setting of this problem, we need to model it as an MDP. If the MDP is not too large
and if the asset return distributions are known, we can employ finite-horizon ADP algo-
rithms to solve it. However, in typical real-world situations, the action space can be quite
large and the asset return distributions are unknown. This points to RL, and specifically
RL algorithms that are well suited to tackle large action spaces (such as Policy Gradient
Algorithms).
In Chapter 7, we covered the problem of pricing and hedging of derivative securities.
We started with the fundamental concepts of Arbitrage, Market-Completeness and Risk-
Neutral Probability Measure. Based on these concepts, we stated and proved the two fun-
damental theorems of Asset Pricing for the simple case of a single discrete time-step. These
theorems imply that the pricing of derivatives in an arbitrage-free and complete market
can be done in two equivalent ways: A) Based on construction of a replicating portfolio,
and B) Based on riskless rate-discounted expectation in the risk-neutral probability mea-
sure. Finally, we covered two financial trading problems that can be cast as MDPs. The
first problem is the Optimal Exercise of American Options (and it’s generalization to Op-
timal Stopping problems). The second problem is the Pricing and Hedging of Derivatives
in an Incomplete (real-world) Market.
In Chapter 8, we covered problems involving trading optimally on an Order Book. We
started with developing an understanding of the core ingredients of an Order Book: Limit
Orders, Market Orders, Order Book Dynamics, and Price Impact. The rest of the chapter
covered two important problems that can be cast as MDPs. These are the problems of Op-
timal Order Execution and Optimal Market-Making. For each of these two problems, we
derived closed-form solutions under highly simplified assumptions (eg: Bertsimas-Lo,
Avellaneda-Stoikov formulations), which helps develop intuition. Since these problems
are modeled as finite-horizon MDPs, we can implement backward-induction ADP algo-
rithms to solve them. However, in practice, we need to develop Reinforcement Learning
algorithms (and associated market simulators) to solve these problems in real-world set-
tings to overcome the Curse of Dimensionality and Curse of Modeling.
Module III covered Reinforcement Learning algorithms. Module III starts by motivat-
ing the case for Reinforcement Learning (RL). In the real-world, we typically do not have
access to a model of state-reward transition probabilities. Typically, we simply have ac-
cess to an environment, that serves up the next state and reward, given current state and
action, at each step in the AI Agent’s interaction with the environment. The environment
could be the actual environment or could be a simulated environment (the latter from a
learnt model of the environment). RL algorithms for Prediction/Control learn the requi-
site Value Function/Policy by obtaining sufficient data (atomic experiences) from interaction
with the environment. This is a sort of “trial and error” learning, through a process of pri-
oritizing actions that seems to fetch good rewards, and deprioritizing actions that seem to
fetch poor rewards. Specifically, RL algorithms are in the business of learning an approx-
imate Q-Value Function, an estimate of the Expected Return for any given action in any
given state. The success of RL algorithms depends not only on their ability to learn the
497
Q-Value Function in an incremental manner through interactions with the environment,
but also on their ability to perform good generalization of the Q-Value Function with ap-
propriate function approximation (often using deep neural networks, in which case we
term it as Deep RL). Most RL algorithms are founded on the Bellman Equations and all
RL Control algorithms are based on the fundamental idea of Generalized Policy Iteration.
In Chapter 9, we covered RL Prediction algorithms. Specifically, we covered Monte-
Carlo (MC) and Temporal-Difference (TD) algorithms for Prediction. A key learning from
this Chapter was the Bias-Variance tradeoff in MC versus TD. Another key learning was
that while MC Prediction learns the statistical mean of the observed returns, TD Prediction
learns something “deeper” - TD implicitly estimates an MRP from the observed data and
produces the Value Function of the implicitly-estimated MRP. We emphasized viewing TD
versus MC versus DP from the perspectives of “bootstrapping” and “experiencing.” We
finished this Chapter by covering λ-Return Prediction and TD(λ) Prediction algorithms,
which give us a way to tradeoff bias versus variance (along the spectrum of MC to TD) by
tuning the λ parameter. TD is equivalent to TD(0) and MC is “equivalent” to TD(1).
In Chapter 10, we covered RL Control algorithms. We re-emphasized that RL Control
is based on the idea of Generalized Policy Iteration (GPI). We explained that Policy Eval-
uation is done for the Q-Value Function (instead of the State-Value Function), and that
the Improved Policy needs to be exploratory, eg: ϵ-greedy. Next we described an im-
portant concept - Greedy in the Limit with Infinite Exploration (GLIE). Our first RL Control
algorithm was GLIE Monte-Carlo Control. Next, we covered two important TD Control
algorithm: SARSA (which is On-Policy) and Q-Learning (which is Off-Policy). We briefly
covered Importance Sampling, which is a different way of doing Off-Policy algorithms.
We wrapped up this Chapter with some commentary on the convergence of RL Predic-
tion and RL Control algorithms. We highlighted a strong pattern of situations when we
run into convergence issues - it is when all three of [Bootstrapping, Function Approxima-
tion, Off-Policy] are done together. We’ve seen how each of these three is individually
beneficial, but when the three come together, it’s “too much of a good thing” , bringing
about convergence issues. The confluence of these three is known as the Deadly Triad (an
example of this would be Q-Learning with Function Approximation).
In Chapter 11, we covered the more nuanced RL Algorithms, going beyond the plain-
vanilla MC and TD algorithms we covered in Chapters 9 and 10. We started this Chapter by
introducing the novel ideas of Batch RL and Experience-Replay. Next, we covered the Least-
Squares Monte-Carlo (LSMC) Prediction algorithm and the Least-Squares Temporal-Difference
(LSTD) algorithm, which is a direct (gradient-free) solution of Batch TD. Next, we covered
the very important Deep Q-Networks (DQN) algorithm, which uses Experience-Replay
and fixed Q-learning targets, in order to avoid the pitfalls of time-correlation and varying
TD Target. Next, we covered the Least-Squares Policy Iteration (LSPI) algorithm, which is
an Off-Policy, Experience-Replay Control Algorithm using LSTDQ for Policy Evaluation.
Then we showed how Optimal Exercise of American Options can be tackled with LSPI and
Deep Q-Learning algorithms. In the second half of this Chapter, we looked deeper into the
issue of the Deadly Triad by viewing Value Functions as Vectors so as to understand Value
Function Vector transformations with a balance of geometric intuition and mathematical
rigor, providing insights into convergence issues for a variety of traditional loss functions
used to develop RL algorithms. Finally, this treatment of Value Functions as Vectors led
us in the direction of overcoming the Deadly Triad by defining an appropriate loss func-
tion, calculating whose gradient provides a more robust set of RL algorithms known as
Gradient Temporal Difference (Gradient TD).
In Chapter 12, we covered Policy Gradient (PG) algorithms, which are based on GPI
498
with Policy Improvement as a Stochastic Gradient Ascent for an Expected Returns Objective
using a policy function approximation. We started with the Policy Gradient Theorem that
gives us a simple formula for the gradient of the Expected Returns Objective in terms of the
score of the policy function approximation. Our first PG algorithm was the REINFORCE
algorithm, a Monte-Carlo Policy Gradient algorithm with no bias but high variance. We
showed how to tackle the Optimal Asset Allocation problem with REINFORCE. Next, we
showed how we can reduce variance in PG algorithms by using a critic and by using an es-
timate of the advantage function in place of the Q-Value Function. Next, we showed how
to overcome bias in PG Algorithms based on the Compatible Function Approximation Theo-
rem. Finally, we covered two specialized PG algorithms that have worked well in practice -
Natural Policy Gradient and Deterministic Policy Gradient. We also provided some cover-
age of Evolutionary Strategies, which are technically not RL algorithms, but they resemble
PG Algorithms and can sometimes be quite effective in solving MDP Control problems.
In Module IV, we provided some finishing touches by covering the topic of Exploration
versus Exploitation and the topic of Blending Learning and Planning in some detail. In
Chapter 13, we provided significant coverage of algorithms for the Multi-Armed Ban-
dit (MAB) problem, which provides a simple setting to understand and appreciate the
nuances of the Explore versus Exploit dilemma that we typically need to resolve within
RL Control algorithms. We started with simple methods such as Naive Exploration (eg:
ϵ-greedy) and Optimistic Initialization. Next, we covered methods based on the broad
approach of Optimism in the Face of Uncertainty (eg: Upper-Confidence Bounds). Next,
we covered the powerful and practically effective method of Probability Matching (eg:
Thompson Sampling). Then we also covered Gradient Bandit Algorithms and a disci-
plined approach to balancing exploration and exploitation by forming Information State
Space MDPs (incorporating value of Information), typically solved by treating as Bayes-
Adaptive MDPs. Finally, we noted that the above MAB algorithms are well-extensible to
Contextual Bandits and RL Control.
In Chapter 14, we covered the issue of Planning versus Learning, and showed how
to blend Planning and Learning. Next, we covered Monte-Carlo Tree-Search (MCTS),
which is a Planning algorithm based on Tree-Search and based on sampling/RL tech-
niques. Lastly, we covered Adaptive Multi-Stage Sampling (AMS), that we consider to
be the spiritual origin of MCTS - it is an efficient algorithm for finite-horizon MDPs with
very large state space and fairly small action space.
499
obvious choice at all - it requires considerable thought and typically one would need to
consult with the business head to identify what exactly is the objective function in run-
ning the business, eg: the precise definition of the Utility function. One should also bear
in mind that a typical real-world problem is actually a Partially Observable Markov Deci-
sion Process (POMDP) rather than an MDP. In the pursuit of computational tractability,
one might approximate the POMDP as an MDP but in order to do so, one requires strong
understanding of the business domain. However, sometimes partial state-observability
cannot be ignored, and in such situations, we have to employ (computationally expen-
sive) algorithms to solve the POMDP. Indeed, controlling state space explosion is one of
the biggest challenges in the real-world. Much of the effort in modeling an MDP is to de-
fine a state space that finds the appropriate balance between capturing the key aspects of
the real-world problem and attaining computational tractability.
Now we’d like to share the approach we usually take when encountering a new prob-
lem, like one of the Financial Applications we covered in Module II. Our first stab at the
problem is to create a simpler version of the problem that lends itself to analytical tractabil-
ity, exploring ways to develop a closed-form solution (like we obtained for some of the
Financial Applications in Module II). This typically requires removing some of the fric-
tions and constraints of the real-world problem. For Financial Applications, this might
involve assuming no transaction costs, perhaps assuming continuous trading, perhaps as-
suming no liquidity constraints. There are multiple advantages of deriving a closed-form
solution with simplified assumptions. Firstly, the closed-form solution immediately pro-
vides tremendous intuition as it shows the analytical dependency of the Optimal Value
Function/Optimal Policy on the inputs and parameters of the problem. Secondly, when
we eventually obtain the solution to the full-fledged model, we can test the solution by
creating a special case of the full-fledged model that reduces to the simplified model for
which we have a closed-form solution. Thirdly, the expressions within the closed-form
solution provide us with some guidance on constructing appropriate features for function
approximation when solving the full-fledged model.
The next stage would be to bring in some of the real-world frictions and constraints,
and attempt to solve the problem with Dynamic Programming (or Approximate Dynamic
Programming). This means we need to construct a model of state-reward transition prob-
abilities. Such a model would be estimated from real-world data obtained from interac-
tion with the actual environment. However, often we find that Dynamic Programming
(or Approximate Dynamic Programming) is not an option due to the Curse of Modeling
(i.e., hard to build a model of transition probabilities). This leaves us with the eventual
go-to option of pursuing a Reinforcement Learning technique. In most real-world prob-
lems, we’d employ RL not with actual environment interactions, but with simulated en-
vironment interactions. This means we need to build a sampling model estimated from
real-world data obtained from interactions with the actual environment. In fact, in many
real-world problems, we’d want to augment the data-learnt simulator with human knowl-
edge/assumptions (specifically information that might not be readily obtained from elec-
tronic data that a human expert might be knowledgeable about). Having a simulator of the
environment is very valuable because we can run it indefinitely and also because we can
create a variety of scenarios (with different settings/assumptions) to run the simulator in.
Deep Learning-based function approximations have been quite successful in the context
of Reinforcement Learning algorithms (we refer to this as Deep Reinforcement Learning).
Lastly, it pays to re-emphasize that the learnings from Chapter 14 are very important for
real-world problems. In particular, the idea of blending model-based RL with model-free
RL (Figure 14.2) is an attractive option for real-world applications because the real-world
500
is typically not stationary and hence, models need to be updated continuously.
Given the plethora of choices for different types of RL algorithms, it is indeed difficult
to figure out which RL algorithm would be most suitable for a given real-world problem.
As ever, we recommend starting with a simple algorithm such as the MC and TD meth-
ods we used in Chapters 9 and 10. Although the simple algorithms may not be powerful
enough for many real-world applications, they are a good place to start to try out on a
smaller size of the actual problem - these simple RL algorithms are very easy to imple-
ment, reason about and debug. However, the most important advice we can give you is
that after having understood the various nuances of the specific real-world problem you
want to solve, you should aim to construct an RL algorithm that is customized for your
problem. One must recognize that the set of RL algorithms is not a fixed menu to choose
from. Rather, there are various pieces of RL algorithms that are open to modification. In
fact, we can combine different aspects of different algorithms to suit our specific needs for
a given real-world problem. We not only make choices on features in function approxima-
tions and on hyper-parameters, we also make choices on the exact design of the algorithm
method, eg: how exactly we’d like to do Off-Policy Learning, or how exactly we’d like to
do the Policy Evaluation component of Generalized Policy Iteration in our Control algo-
rithm. In practice, we’ve found that we often end up with the more advanced algorithms
due to the typical real-world problem complexity or state-space/action-space size. There
is no silver bullet here, and one has to try various algorithms to see which one works best
for the given problem. However, it pays to share that the algorithms that have worked
well for us in real-world problems are Least-Squares Policy Iteration, Gradient Temporal-
Difference, Deep Q-Networks and Natural Policy Gradient. We have always paid attention
to Richard Sutton’s mantra of avoiding the Deadly Triad. We recommend the excellent
paper by Hasselt, Doron, Strub, Hessel, Sonnerat, Modayil (Hasselt et al. 2018) to under-
stand the nuances of the Deadly Triad in the context of Deep Reinforcement Learning.
It’s important to recognize that the code we developed in this book is for educational
purposes and we barely made an attempt to make the code performant. In practice, this
type of educational code won’t suffice - we need to develop highly performant code and
make the code parallelizable wherever possible. This requires an investment in a suitable
distributed system for storage and compute, so the RL algorithms can be trained in an
efficient manner.
When it comes to making an RL algorithm successful in a real-world application, the
design and implementation of the model and the algorithm is only a small piece of the
overall puzzle. Indeed, one needs to build an entire ecosystem of data management, soft-
ware engineering, model training infrastructure, model deployment platform, tools for
easy debugging, measurements/instrumentation and explainability of results. Moreover,
it is vital to have a strong Product Management practice in order to ensure that the algo-
rithm is serving the needs of the overall product being built. Indeed, the goal is to build a
successful product, not just a model and an algorithm. A key challenge in many organiza-
tions is to replace a legacy system or a manual system with a modern solution (eg: with
an RL-based solution). This requires investment in a culture change in the organization
so that all stakeholders are supportive, otherwise the change management will be very
challenging.
When the product carrying the RL algorithm runs in production, it is vital to evaluate
whether the real-world problem is actually being solved effectively by defining, evaluating
and reporting the appropriate success metrics. If those metrics are found to be inadequate,
we need the appropriate feedback system in the organization to investigate why the prod-
uct (and perhaps the model) is not delivering the requisite results. It could be that we
501
have designed a model which is not quite the right fit for the real-world problem, in which
case we improve the model in the next iteration of this feedback system. It often takes sev-
eral iterations of evaluating the success metrics, providing feedback, and improving the
model (and sometimes the algorithm) in order to achieve adequate results. An important
point to note is that typically in practice, we rarely need to solve all the way to an optima -
typically, being close to optimum is good enough to achieve the requisite success metrics.
A Product Manager must constantly question whether we are solving the right problem,
and whether we are investing our efforts in the most important aspects of the problem (eg:
ask if it suffices to be reasonably close to optimum).
Lastly, one must recognize that typically in the real-world, we are plagued with noisy
data, incomplete data and sometimes plain wrong data. The design of the model needs
to take this into account. Also, there is no such thing as the “perfect model” - in practice,
a model is simply a crude approximation of reality. It should be assumed by default that
we have bad data and that we have an imperfect model. Hence, it is important to build a
system that can reason about uncertainties in data and about uncertainties with the model.
A book can simply not do justice to explaining the various nuances and complications
that arise in developing and deploying an RL-based solution in the real-world. Here we
have simply scratched the surface of the various issues that arise. You would truly un-
derstand and appreciate these nuances and complications only by stepping into the real-
world and experiencing it for yourself. However, it is important to first be grounded in
the foundations of RL, which is what we hope you got from this book.
502
Appendix
503
A. Moment Generating Function and its
Applications
The purpose of this Appendix is to introduce the Moment Generating Function (MGF) and
demonstrate it’s utility in several applications in Applied Mathematics.
fx(n) (t) = Ex [xn · etx ] for all n ∈ Z≥0 for all t ∈ R (A.2)
Note that this holds true for any distribution for x. This is rather convenient since all we
need is the functional form for the distribution of x. This would lead us to the expression
for the MGF (in terms of t). Then, we take derivatives of this MGF and evaluate those
derivatives at 0 to obtain the moments of x.
Equation (A.4) helps us calculate the often-appearing expectation Ex [xn · ex ]. In fact,
Ex [ex ] and Ex [x · ex ] are very common in several areas of Applied Mathematics. Again,
note that this holds true for any distribution for x.
The MGF should be thought of as an alternative specification of a random variable (al-
ternative to specifying it’s Probability Distribution). This alternative specification is very
valuable because it can sometimes provide better analytical tractability than working with
the Probability Density Function or Cumulative Distribution Function (as an example, see
the below section on the MGF for linear functions of independent random variables).
505
The Probability Density Function of x is complicated to calculate as it involves convolu-
tions. However, observe that the MGF fx of x is given by:
∑m Y
m Y
m Y
m
fx (t) = E[et(α0 + i=1 α i xi )
] = e α0 t · E[etαi xi ] = eα0 t · fαi xi (t) = eα0 t · fxi (αi t)
i=1 i=1 i=1
This means the MGF of x can be calculated as eα0 t times the product of the MGFs of αi xi
(or of αi -scaled MGFs of xi ) for all i = 1, 2, . . . , m. This gives us a much better way to
analytically tract the probability distribution of x (compared to the convolution approach).
σ 2 t2
′
(µ,σ 2 ) (t) = Ex∼N (µ,σ 2 ) [x · e ] = (µ + σ t) · e
tx 2 µt+
fx∼N 2 (A.6)
σ 2 t2
′′
(µ,σ 2 ) (t) = Ex∼N (µ,σ 2 ) [x · e ] = ((µ + σ t) + σ ) · e
2 tx 2 2 2 µt+
fx∼N 2 (A.7)
′
fx∼N (µ,σ 2 ) (0) = Ex∼N (µ,σ 2 ) [x] = µ
′′
(µ,σ 2 ) (0) = Ex∼N (µ,σ 2 ) [x ] = µ + σ
2 2 2
fx∼N
σ2
′
(µ,σ 2 ) (1) = Ex∼N (µ,σ 2 ) [x · e ] = (µ + σ )e
x 2 µ+
fx∼N 2
σ2
′′
(µ,σ 2 ) (1) = Ex∼N (µ,σ 2 ) [x · e ] = ((µ + σ ) + σ )e
2 x 2 2 2 µ+
fx∼N 2
This problem of minimizing Ex [etx ] shows up a lot in various places in Applied Mathe-
matics when dealing with exponential functions (eg: when optimizing the Expectation of
−γy
a Constant Absolute Risk-Aversion (CARA) Utility function U (y) = 1−eγ where γ is the
506
coefficient of risk-aversion and where y is a parameterized function of a random variable
x).
Let us denote t∗ as the value of t that minimizes the MGF. Specifically,
∗ + σ 2 t∗2 −µ2
min fx∼N (µ,σ2 ) (t) = eµt 2 = e 2σ2 (A.9)
t∈R
′
fx∼B(µ+σ,µ−σ) (t) = 0.5((µ + σ) · e(µ+σ)t + (µ − σ) · e(µ−σ)t )
Note that unless µ ∈ open interval (−σ, σ) (i.e., absolute value of mean is less than standard
′
deviation), fx∼B(µ+σ,µ−σ) (t) will not be 0 for any value of t. Therefore, for this minimiza-
tion to be non-trivial, we will henceforth assume µ ∈ (−σ, σ). With this assumption in
′
place, setting fx∼B(µ+σ,µ−σ) (t) to 0 yields:
∗ ∗
(µ + σ) · e(µ+σ)t + (µ − σ) · e(µ−σ)t = 0
507
which leads to:
1 σ−µ
t∗ = ln ( )
2σ µ+σ
Note that
′′
fx∼B(µ+σ,µ−σ) (t) = 0.5((µ + σ)2 · e(µ+σ)t + (µ − σ)2 · e(µ−σ)t ) > 0 for all t ∈ R
∗ ∗ σ − µ µ+σ σ − µ µ−σ
min fx∼B(µ+σ,µ−σ) (t) = 0.5(e(µ+σ)t + e(µ−σ)t ) = 0.5(( ) 2σ + ( ) 2σ )
t∈R µ+σ µ+σ
508
B. Portfolio Theory
In this Appendix, we provide a quick and terse introduction to Portfolio Theory. While this
topic is not a direct pre-requisite for the topics we cover in the chapters, we believe one
should have some familiarity with the risk versus reward considerations when construct-
ing portfolios of financial assets, and know of the important results. To keep this Appendix
brief, we will provide the minimal content required to understand the essence of the key
concepts. We won’t be doing rigorous proofs. We will also ignore details pertaining to
edge-case/irregular-case conditions so as to focus on the core concepts.
XpT · 1n = 1
where 1n ∈ Rn is a column vector comprising of all 1’s.
We shall drop the subscript p in Xp whenever the reference to portfolio p is clear.
X T · 1n = 1
509
X T · R = rp
where rp is the mean return for Efficient Portfolio p. We set up the Lagrangian and solve
to express X in terms of R, V, rp . Substituting for X gives us the efficient frontier parabola
of Efficient Portfolio Variance σp2 as a function of its mean rp :
a − 2brp + crp2
σp2 =
ac − b2
where
• a = RT · V −1 · R
• b = RT · V −1 · 1n
• c = 1Tn · V −1 · 1n
• It has mean r0 = cb .
• It has variance σ02 = 1c .
V −1 ·1n
• It has investment proportions X0 = c .
GMVP is positively correlated with all portfolios and with all assets. GMVP’s covariance
with all portfolios and with all assets is a constant value equal to σ02 = 1c (which is also
equal to its own variance).
a − brp
rz =
b − crp
z always lies on the opposite side of p on the (efficient frontier) parabola. If we treat
the Efficient Frontier as a curve of mean (y-axis) versus variance (x-axis), the straight line
from p to GMVP intersects the mean axis (y-axis) at rz . If we treat the Efficient Frontier
as a curve of mean (y-axis) versus standard deviation (x-axis), the tangent to the efficient
frontier at p intersects the mean axis (y-axis) at rz . Moreover, all portfolios on one side of
the efficient frontier are positively correlated with each other.
510
Figure B.1.: Efficient Frontier for 16 Assets
Varying α from −∞ to +∞ basically traces the entire efficient frontier. So to construct all
efficient portfolios, we just need to identify two canonical efficient portfolios. One of them
is GMVP. The other is a portfolio we call Special Efficient Portfolio (SEP) with:
• Mean r1 = ab .
• Variance σ12 = ba2 .
V −1 ·R
• Investment proportions X1 = b .
a−b ab
The orthogonal portfolio to SEP has mean rz = b−c ab =0
511
V ·X
where βp = σ2 p ∈ Rn is the vector of slope coefficients of regressions where the ex-
p
planatory variable is the portfolio mean return rp ∈ R and the n dependent variables are
the asset mean returns R ∈ Rn .
The linearity of βp w.r.t. mean returns R is famously known as the Capital Asset Pricing
Model (CAPM).
• So, in this case, covariance vector V ·Xp and βp are just scalar multiples of asset mean
vector.
• The investment proportion X in a given individual asset changes monotonically
along the efficient frontier.
• Covariance V · X is also monotonic along the efficient frontier.
• But β is not monotonic, which means that for every individual asset, there is a unique
pair of efficient portfolios that result in maximum and minimum βs for that asset.
of a, b, b, c.
• These two portfolios lie symmetrically on opposite sides of the efficient frontier (their
βs are equal and of opposite signs), and are the only two orthogonal efficient port-
folios with the same variance ( = 2σ02 ).
• If rF < r0 , rT > rF .
• If rF > r0 , rT < rF .
• All portfolios on this efficient set are perfectly correlated.
512
C. Introduction to and Overview of
Stochastic Calculus Basics
In this Appendix, we provide a quick introduction to the Basics of Stochastic Calculus. To
be clear, Stochastic Calculus is a vast topic requiring an entire graduate-level course to de-
velop a good understanding. We shall only be scratching the surface of Stochastic Calculus
and even with the very basics of this subject, we will focus more on intuition than rigor,
and familiarize you with just the most important results relevant to this book. For an ade-
quate treatment of Stochastic Calculus relevant to Finance, we recommend Steven Shreve’s
two-volume discourse Stochastic Calculus for Finance I (Shreve 2003) and Stochastic Cal-
culus for Finance II (Shreve 2004). For a broader treatment of Stochastic Calculus, we
recommend Bernt Oksendal’s book on Stochastic Differential Equations (Øksendal 2003).
• Independent Increments: Increments Zt1 − Zt0 , Zt2 − Zt1 , . . . , Ztn − Ztn−1 are inde-
pendent of each other.
• Martingale (i.e., Zero-Drift) Property: Expected Value of any Increment is 0.
ti+1 −1
X
E[Zti+1 − Zti ] = E[Zj+1 − Zj ] = 0 for all i = 0, 1, . . . , n − 1
j=ti
for all i = 0, 1, . . . , n − 1.
513
Moreover, we have an important property that Quadratic Variation equals Time Steps.
Quadratic Variation over the time interval [ti , ti+1 ] for all i = 0, 1, . . . , n − 1 is defined as:
ti+1 −1
X
(Zj+1 − Zj )2
j=ti
It pays to emphasize the important conceptual difference between the Variance of Incre-
ment property and the Quadratic Variation property. The Variance of Increment property
is a statement about the expectation of the square of the Zti+1 − Zti increment whereas the
Quadratic Variation property is a statement of certainty (note: there is no E[· · · ] in this
statement) about the sum of squares of atomic increments Yj over the discrete-steps time-
interval [ti , ti+1 ]. The Quadratic Variation property owes to the fact that P[Yt2 = 1] = 1 for
all t = 0, 1, . . ..
We can view the Quadratic Variations of a Process X over all discrete-step time intervals
[0, t] as a Process denoted [X], defined as:
X
t
[X]t = (Xj+1 − Xj )2
j=0
Thus, for the simple random walk Markov Process Z, we have the succinct formula:
[Z]t = t for all t (i.e., this Quadratic Variation process is a deterministic process).
(n) 1 1 2
zt = √ · Znt for all t = 0, , , . . .
n n n
It’s easy to show that the above properties of the simple random walk process hold for
the z (n) process as well. Now consider the continuous-time process z defined as:
(n)
zt = lim zt for all t ∈ R≥0
n→∞
514
C.3. Continuous-Time Stochastic Processes
Brownian motion z is our first example of a Continuous-Time Stochastic Process. Now let
us define a general continuous-time stochastic process, although for the sake of simplic-
ity, we shall restrict ourselves to one-dimensional real-valued continuous-time stochastic
processes.
• t ∈ [0, T ]
• ω∈Ω
zt+h − zt
Random variable lim is almost always infinite
h→0 h
z −z
The intuition is that t+hh t has standard deviation of √1h , which goes to ∞ as h goes
to 0.
• Sample traces z(ω) have infinite total variation, meaning:
Z T
Random variable |dzt | = ∞ (almost always)
S
This means each sample random trace of brownian motion has quadratic variation equal
to the time interval of the trace. The quadratic variation of z expressed as a process [z] has
the deterministic value of t at time t. Expressed in infinitesimal terms, we say that:
(dzt )2 = dt
This formula generalizes to:
515
(1) (2)
(dzt ) · (dzt ) = ρ · dt
where z (1) and z (2) are two different brownian motions with correlation between the
(1) (2)
random variables zt and zt equal to ρ for all t > 0.
You should intuitively interpret the formula (dzt )2 = dt (and it’s generalization) as
a deterministic statement, and in fact this statement is used as an algebraic convenience
in Brownian motion-based stochastic calculus, forming the core of Ito Isometry and Ito’s
Lemma (which we cover shortly, but first we need to define the Ito Integral).
In the interest of focusing on intuition rather than rigor, we skip the technical details
of filtrations and adaptive processes that make the above integral sensible. Instead, we
simply say that this integral makes sense only if random variable Xs for any time s is
disallowed from depending on zs′ for any s′ > s (i.e., the stochastic process X cannot peek
Rt
into the future) and that the time-integral 0 Xs2 · ds is finite for all t ≥ 0. So we shall roll
forward with the assumption that the stochastic process Y is defined as the above-specified
integral (known as the Ito Integral) of a stochastic process X with respect to Brownian
motion. The equivalent notation is:
dYt = Xt · dzt
We state without proof the following properties of the Ito Integral stochastic process Y :
• Y is a martingale, i.e., E[(Yt − Ys )|Ys ] = 0 (i.e., E[Yt |Ys ] = Ys ) for all 0 ≤ s < t
Rt
• Ito Isometry: E[Yt2 ] = 0 E[Xs2 ] · ds.
Rt
• Quadratic Variance formula: [Y ]t = 0 Xs2 · ds
Note that we have generalized the notation [X] for discrete-time processes to continuous-
Rt
time processes, defined as [X]t = 0 (dXs )2 for any continuous-time stochastic process.
Ito Isometry generalizes to:
Z T Z T Z T
(1) (1) (2) (2) (1) (2)
E[( Xt · dzt )( Xt · dzt )] = E[Xt · Xt · ρ · dt]
S S S
where X (1) and X (2) are two different stochastic processes, and z (1) and z (2) are two
(1) (2)
different brownian motions with correlation between the random variables zt and zt
equal to ρ for all t > 0.
Likewise, the Quadratic Variance formula generalizes to:
Z T Z T
(1) (1) (2) (2) (1) (2)
(Xt · dzt )(Xt · dzt ) = Xt · Xt · ρ · dt
S S
516
C.6. Ito’s Lemma
We can extend the above Ito Integral to an Ito process Y as defined below:
dYt = µt · dt + σt · dzt
We require the same conditions for the stochastic process σ as we required above for X
Rt
in the definition of the Ito Integral. Moreover, we require that: 0 |µs | · ds is finite for all
t ≥ 0.
In the context of this Ito process Y described above, we refer to µ as the drift process and
we refer to σ as the dispersion process.
Now, consider a twice-differentiable function f : [0, T ] × R → R. We define a stochastic
process whose (random) value at time t is f (t, Yt ). Let’s write it’s Taylor series with respect
to the variables t and Yt .
∂f ∂f 1 ∂2f
df (t, Yt ) = · dt + · (µt · dt + σt · dzt ) + · · (µt · dt + σt · dzt )2 + . . .
∂t ∂Yt 2 ∂Yt2
Next, we use the rules: (dt)2 = 0, dt · dzt = 0, (dzt )2 = dt to get Ito’s Lemma:
∂f ∂f σ2 ∂ 2f ∂f
df (t, Yt ) = ( + µt · + t · 2 ) · dt + σt · · dzt (C.1)
∂t ∂Yt 2 ∂Yt ∂Yt
dYt = µt · dt + σt · dzt
∂f 1
df (t, Yt ) = ( + (∇Y f )T · µt + T r[σtT · (∆Y f ) · σt ]) · dt + (∇Y f )T · σt · dzt (C.2)
∂t 2
where the symbol ∇ represents the gradient of a function, the symbol ∆ represents the
Hessian of a function, and the symbol T r represents the Trace of a matrix.
Next, we cover two common Ito processes, and use Ito’s Lemma to solve the Stochastic
Differential Equation represented by these Ito Processes:
517
C.7. A Lognormal Process
Consider a stochastic process x described in the form of the following Ito process:
Note that here z is standard (one-dimensional) Brownian motion, and µ, σ are deter-
ministic functions of time t. This is solved easily by defining an appropriate function of xt
and applying Ito’s Lemma, as follows:
yt = log(xt )
1 σ 2 (t) · x2t 1 1
dyt = (µ(t) · xt · − · 2 ) · dt + σ(t) · xt · · dzt
xt 2 xt xt
σ 2 (t)
= (µ(t) − ) · dt + σ(t) · dzt
2
So,
Z T Z T
σ 2 (t)
y T = yS + (µ(t) − ) · dt + σ(t) · dzt
S 2 S
∫T σ 2 (t) ∫
)·dt+ ST
xT = xS · e S (µ(t)− 2
σ(t)·dzt
∫T
E[xT |xS ] = xS · e S µ(t)·dt
∫T
(2µ(t)+σ 2 (t))·dt
E[x2T |xS ] = x2S · e S
∫T ∫T
σ 2 (t)·dt
V ariance[xT |xS ] = E[x2T |xS ] − (E[xT |xS ])2 = x2S · e S 2µ(t)·dt
· (e S − 1)
The special case of µ(t) = µ (constant) and σ(t) = σ (constant) is a very common Ito
process used all over Finance/Economics (for its simplicity, tractability as well as practi-
cality), and is known as Geometric Brownian Motion, to reflect the fact that the stochastic
increment of the process (σ · xt · dzt ) is multiplicative to the level of the process xt . If we
consider this special case, we get:
σ2
yT = log(xT ) ∼ N (log(xS ) + (µ − )(T − S), σ 2 (T − S))
2
518
C.8. A Mean-Reverting Process
Now we consider a stochastic process x described in the form of the following Ito process:
∫t ∫t ∫t
dyt = (−xt · µ(t) · e− 0 µ(u)·du
+ µ(t) · xt · e− 0 µ(u)·du
) · dt + σ(t) · e− 0 µ(u)·du
· dzt
∫t
= σ(t) · e− 0 µ(u)·du
· dzt
σ2
xT ∼ N (xS · eµ(T −S) , · (e2µ(T −S) − 1))
2µ
519
D. The Hamilton-Jacobi-Bellman (HJB)
Equation
In this Appendix, we provide a quick coverage of the Hamilton-Jacobi-Bellman (HJB)
Equation, which is the continuous-time version of the Bellman Optimality Equation. Al-
though much of this book covers Markov Decision Processes in a discrete-time setting, we
do cover some classical Mathematical Finance Stochastic Control formulations in continuous-
time. To understand these formulations, one must first understand the HJB Equation,
which is the purpose of this Appendix. As is the norm in the Appendices in this book,
we will compromise on some of the rigor and emphasize the intuition to develop basic
familiarity with HJB.
max {e−ρt · R(t, st , at ) · dt + E(t,st ,at ) [e−ρ(t+dt) · V ∗ (t + dt, st+dt ) − e−ρt · V ∗ (t, st )]} = 0
at ∈At
⇒ max {e−ρt · R(t, st , at ) · dt + E(t,st ,at ) [e−ρt · (dV ∗ (t, st ) − ρ · V ∗ (t, st ) · dt)]} = 0
at ∈At
521
Multiplying throughout by eρt and re-arranging, we get:
For a finite-horizon problem terminating at time T , the above equation is subject to ter-
minal condition:
V ∗ (T, sT ) = T (sT )
for some terminal reward function T (·).
Equation (D.1) is known as the Hamilton-Jacobi-Bellman Equation - the continuous-
time analog of the Bellman Optimality Equation. In the literature, it is often written in a
more compact form that essentially takes the above form and “divides throughout by dt.”
This requires a few technical details involving the stochastic differentiation operator. To
keep things simple, we shall stick to the HJB formulation of Equation (D.1).
∂V ∗ 1
dV ∗ (t, st ) = ( + (∇s V ∗ )T · µt + T r[σtT · (∆s V ∗ ) · σt ]) · dt + (∇s V ∗ )T · σt · dzt
∂t 2
Substituting this expression for dV ∗ (t, st ) in Equation (D.1), noting that
∂V ∗ 1
ρ · V ∗ (t, st ) = max { + (∇s V ∗ )T · µt + T r[σtT · (∆s V ∗ ) · σt ] + R(t, st , at )} (D.2)
at ∈At ∂t 2
For a finite-horizon problem terminating at time T , the above equation is subject to ter-
minal condition:
V ∗ (T, sT ) = T (sT )
for some terminal reward function T (·).
522
E. Black-Scholes Equation and it’s Solution
for Call/Put Options
In this Appendix, we sketch the derivation of the much-celebrated Black-Scholes equation
and it’s solution for Call and Put Options (Black and Scholes 1973). As is the norm in
the Appendices in this book, we will compromise on some of the rigor and emphasize the
intuition to develop basic familiarity with concepts in continuous-time derivatives pricing
and hedging.
E.1. Assumptions
The Black-Scholes Model is about pricing and hedging of a derivative on a single under-
lying asset (henceforth, simply known as “underlying”). The model makes several sim-
plifying assumptions for analytical convenience. Here are the assumptions:
• The underlying (whose price we denote as St as time t) follows a special case of the
lognormal process we covered in Section C.7 of Appendix C, where the drift µ(t) is
a constant (call it µ ∈ R) and the dispersion σ(t) is also a constant (call it σ ∈ R+ ):
This process is often refered to as Geometric Brownian Motion to reflect the fact that
the stochastic increment of the process (σ · St · dzt ) is multiplicative to the level of
the process St .
• The derivative has a known payoff at time t = T , as a function f : R+ → R of the
underlying price ST at time T .
• Apart from the underlying, the market also includes a riskless asset (which should
be thought of as lending/borrowing money at a constant infinitesimal rate of annual
return equal to r). The riskless asset (denote it’s price as Rt at time t) movements
can thus be described as:
dRt = r · Rt · dt
• Assume that we can trade in any real-number quantity in the underlying as well as in
the riskless asset, in continuous-time, without any transaction costs (i.e., the typical
“frictionless” market assumption).
∂V ∂V σ2 ∂2V ∂V
dV (t, St ) = ( + µ · St · + · St2 · ) · dt + σ · St · · dzt (E.2)
∂t ∂St 2 ∂St2 ∂St
523
Now here comes the key idea: create a portfolio comprising of the derivative and the
underlying so as to eliminate the incremental uncertainty arising from the brownian mo-
tion increment dzt . It’s clear from the coefficients of dzt in Equation (E.1) and (E.2) that
∂V
this can be accomplished with a portfolio comprising of ∂S t
units of the underlying and -1
units of the derivative (i.e., by selling a derivative contract written on a single unit of the
underlying). Let us refer to the value of this portfolio as Πt at time t. Thus,
∂V
Πt = −V (t, St ) + · St (E.3)
∂St
Over an infinitesimal time-period [t, t + dt], the change in the portfolio value Πt is given
by:
∂V
dΠt = −dV (t, St ) + · dSt
∂St
Substituting for dSt and dV (t, St ) from Equations (E.1) and (E.2), we get:
∂V σ2 ∂2V
dΠt = (− − · St2 · ) · dt (E.4)
∂t 2 ∂St2
Thus, we have eliminated the incremental uncertainty arising from dzt and hence, this
is a riskless portfolio. To ensure the market remains free of arbitrage, the infinitesimal rate
of annual return for this riskless portfolio must be the same as that for the riskless asset,
i.e., must be equal to r. Therefore,
dΠt = r · Πt · dt (E.5)
From Equations (E.4) and (E.5), we infer that:
∂V σ2 ∂2V
− − · St2 · = r · Πt
∂t 2 ∂St2
Substituting for Πt from Equation (E.3), we get:
∂V σ2 ∂2V ∂V
− − · St2 · = r · (−V (t, St ) + · St )
∂t 2 ∂St2 ∂St
Re-arranging, we arrive at the famous Black-Scholes equation:
∂V σ2 ∂2V ∂V
+ · St2 · 2 + r · St · + r · V (t, St ) = 0 (E.6)
∂t 2 ∂St ∂St
A few key points to note here:
524
changes. Note that − ∂S∂V
t
represents the hedge units in the underlying at any time t
for any underlying price St , which nullifies the risk of changes to the derivative price
V (t, St ).
3. The drift µ of the underlying price movement (interpreted as expected annual rate of
return of the underlying) does not appear in the Black-Scholes Equation and hence,
the price of any derivative will be independent of the expected rate of return of the
underlying. Note though the prominent appearance of σ (refered to as the underly-
ing volatility) and the riskless rate of return r in the Black-Scholes equation.
τ =T −t
St σ2
x = log + (r − ) · τ
K 2
u(τ, x) = C(t, St ) · erτ
This reduces the Black-Scholes PDE into the Heat Equation:
∂u σ2 ∂ 2u
= ·
∂τ 2 ∂x2
The terminal condition C(T, ST ) = max(ST − K, 0) transforms into the Heat Equation’s
initial condition:
u(0, x) = K · (emax(x,0) − 1)
Using the standard convolution method for solving this Heat Equation with initial con-
dition u(0, x), we obtain the Green’s Function Solution:
Z +∞
1 (x−y)2
u(τ, x) = √ · u(0, y) · e− 2σ 2 τ · dy
σ 2πτ −∞
σ2 τ
u(τ, x) = K · ex+ 2 · N (d1 ) − K · N (d2 )
525
where N (·) is the standard normal cumulative distribution function:
Z z
1 y2
N (z) = √ e− 2 · dy
σ 2π −∞
x + σ2τ
d1 = √
σ τ
√
d2 = d1 − σ τ
Substituting for τ, x, u(τ, x) with t, St , C(t, St ), we get:
526
F. Function Approximations as Affine Spaces
F.1. Vector Space
A Vector space is defined as a commutative group V under an addition operation (written
as +), together with multiplication of elements of V with elements of a field K (known
as scalars), expressed as a binary in-fix operation ∗ : K × V → V, with the following
properties:
Then the set of all linear maps with domain V and co-domain W constitute a function
space (restricted to just this subspace of all linear maps, rather than the space of all V → W
functions) that we denote as L(V, W).
The specialization of the function space of linear maps to the space L(V, K) (i.e., spe-
cializing the vector space W to the scalars field K) is known as the dual vector space and
is denoted as V ∗ .
527
F.4. Affine Space
An Affine Space is defined as a set A associated with a vector space V and a binary in-fix
operation ⊕ : A × V → A, with the following properties:
• For all a ∈ A, a ⊕ 0 = a, where 0 is the zero vector in V (this is known as the right
identity property).
• For all v1 , v2 ∈ V, for all a ∈ A, (a ⊕ v1 ) ⊕ v2 = a ⊕ (v1 + v2 ) (this is known as the
associativity property).
• For each a ∈ A, the mapping fa : V → A defined as fa (v) = a ⊕ v for all v ∈ V is a
bijection (i.e., one-to-one and onto mapping).
The elements of an affine space are called points and the elements of the vector space
associated with an affine space are called translations. The idea behind affine spaces is that
unlike a vector space, an affine space doesn’t have a notion of a zero element and one cannot
add two points in the affine space. Instead one adds a translation (from the associated vector
space) to a point (from the affine space) to yield another point (in the affine space). The
term translation is used to signify that we “translate” (i.e. shift) a point to another point
in the affine space with the shift being effected by a translation in the associated vector
space. This means there is a notion of “subtracting” one point of the affine space from
another point of the affine space (denoted with the operation ⊖), yielding a translation in
the associated vector space.
A simple way to visualize an affine space is by considering the simple example of the
affine space of all 3-D points on the plane defined by the equation z = 1, i.e., the set of
all points (x, y, 1) for all x ∈ R, y ∈ R. The associated vector space is the set of all 3-D
points on the plane defined by the equation z = 0, i.e., the set of all points (x, y, 0) for
all x ∈ R, y ∈ R (with the usual addition and scalar multiplication operations). We see
that any point (x, y, 1) on the affine space is translated to the point (x + x′ , y + y ′ , 1) by the
translation (x′ , y ′ , 0) in the vector space. Note that the translation (0, 0, 0) (zero vector)
results in the point (x, y, 1) remaining unchanged. Note that translations (x′ , y ′ , 0) and
(x′′ , y ′′ , 0) applied one after the other is the same as the single translation (x′ +x′′ , y ′ +y ′′ , 0).
Finally, note that for any fixed point (x, y, 1), we have a bijective mapping from the vector
space z = 0 to the affine space z = 1 that maps any translation (x′ , y ′ , 0) to the point
(x + x′ , y + y ′ , 1).
528
type D is specified as a generic container data type because we consider generic function
approximations here. A specific family of function approximations will customize to a
specific container data type for D (eg: linear function approximations will customize D to
a Sequence data type, a feed-forward deep neural network will customize D to a Sequence
of 2-dimensional arrays). We are interested in viewing Function Approximations as points
in an appropriate Affine Space. To explain this, we start by viewing parameters as points
in an Affine Space.
G : X → (P → G)
as:
G(x)(p) = ∇p f (x, p)
We refer to this affine space R as the Representational Space to signify the fact that the ⊕
operation for R simply “delegates” to the ⊕ operation for P and so, the parameters p ∈ P
basically serve as the internal representation of the function approximation I(p) : X → R.
This “delegation” from R to P implies that I is a linear map from Parameters Space P to
Representational Space R.
Notice that the __add__ method of the Gradient class in rl/function_approx.py is over-
loaded. One of the __add__ methods corresponds to vector addition of two gradients in the
Gradient Space G. The other __add__ method corresponds to the ⊕ operation adding a gra-
dient (treated as a translation in the vector space of gradients) to a function approximation
(treated as a point in the affine space of function approximations).
529
F.7. Stochastic Gradient Descent
Stochastic Gradient Descent is a function
SGD : X × R → (P → P)
representing a mapping from (predictor, response) data to a “parameters-update” func-
tion (in order to improve the function approximation), defined as:
U :P→G
is defined as:
• Learning rate α ∈ R+
• Prediction error function e : P → R
• Gradient operator G(x) : P → G
Note that the product of functions e and G(x) above is element-wise in their common
domain P = D[R], resulting in the scalar (R) multiplication of vectors in G.
Updating vector p to vector p ⊕ U (p) in the Parameters Space P results in updating
function I(p) : X → R to function I(p ⊕ U (p)) : X → R in the Representational Space R.
This is rather convenient since we can view the ⊕ operation for the Parameters Space P as
effectively the ⊕ operation in the Representational Space R.
530
• Learning rate α ∈ R+
• Prediction Error y − Φ(x)T · p ∈ R for the updating data (x, y) ∈ X × R
• Inner-product of the feature vector Φ(x) ∈ Rm of the updating input value x ∈ X
and the feature vector Φ(z) ∈ Rm of the evaluation input value z ∈ X .
531
G. Conjugate Priors for Gaussian and
Bernoulli Distributions
The setting for this Appendix is that we receive data incrementally as x1 , x2 , . . . and we as-
sume a certain probability distribution (eg: Gaussian, Bernoulli) for each xi , i = 1, 2, . . ..
We utilize an appropriate conjugate prior for the assumed data distribution so that we
can derive the posterior distribution for the parameters of the assumed data distribution.
We can then say that for any n ∈ Z+ , the conjugate prior is the probability distribution
for the parameters of the assumed data distribution, conditional on the first n data points
(x1 , x2 , . . . xn ) and the posterior is the probability distribution for the parameters of the
assumed distribution, conditional on the first n + 1 data points (x1 , x2 , . . . , xn+1 ). This
amounts to performing Bayesian updates on the hyperparameters upon receipt of each
incremental data xi (hyperparameters refer to the parameters of the prior and posterior
distributions). In this appendix, we shall not cover the derivations of the posterior dis-
tribution from the prior distribution and the data distribution. We shall simply state the
results (references for derivations can be found on the Conjugate Prior Wikipedia Page).
xn ∼ N (µ, σ 2 )
and we assume both µ and σ 2 are unknown random variables with Gaussian-Inverse-
Gamma Probability Distribution Conjugate Prior for µ and σ 2 , i.e.,
σ2
µ|x1 , . . . , xn ∼ N (θn , )
n
σ 2 |x1 , . . . , xn ∼ IG(αn , βn )
where IG(αn , βn ) refers to the Inverse Gamma distribution with parameters αn and βn .
This means σ12 |x1 , . . . , xn follows a Gamma distribution with parameters αn and βn , i.e.,
the probability of σ12 having a value y ∈ R+ is:
β α · y α−1 · e−βy
Γ(α)
where Γ(·) is the Gamma Function.
θn , αn , βn are hyperparameters determining the probability distributions of µ and σ 2 ,
conditional on data x1 , . . . , xn .
Then, the posterior distribution is given by:
nθn + xn+1 σ 2
µ|x1 , . . . , xn+1 ∼ N ( , )
n+1 n+1
533
1 n(xn+1 − θn )2
σ 2 |x1 , . . . , xn+1 ∼ IG(αn + , βn + )
2 2(n + 1)
This means upon receipt of the data point xn+1 , the hyperparameters can be updated
as:
nθn + xn+1
θn+1 =
n+1
1
αn+1 = αn +
2
n(xn+1 − θn )2
βn+1 = βn +
2(n + 1)
p|x1 , . . . , xn ∼ Beta(αn , βn )
where Beta(αn , βn ) refers to the Beta distribution with parameters αn and βn , i.e., the
probability of p having a value y ∈ [0, 1] is:
Γ(α + β)
· y α−1 · (1 − y)β−1
Γ(α) · Γ(β)
where Γ(·) is the Gamma Function.
αn , βn are hyperparameters determining the probability distribution of p, conditional
on data x1 , . . . , xn .
Then, the posterior distribution is given by:
αn+1 = αn + Ixn+1 =1
βn+1 = βn + Ixn+1 =0
534
Bibliography
535
Almgren, Robert, and Neil Chriss. 2000. “Optimal Execution of Portfolio Transactions.”
Journal of Risk, 5–39.
Amari, S. 1998. “Natural Gradient Works Efficiently in Learning.” Neural Computation 10
(2): 251–76.
Auer, Peter, Nicolò Cesa-Bianchi, and Paul Fischer. 2002. “Finite-Time Analysis of the
Multiarmed Bandit Problem.” Machine Learning 47 (2): 235–56. https://doi.org/10.
1023/A:1013689704352.
Avellaneda, Marco, and Sasha Stoikov. 2008. “High-Frequency Trading in a Limit Or-
der Book.” Quantitative Finance 8 (3): 217–24. http://www.informaworld.com/10.1080/
14697680701381228.
Åström, K. J. 1965. “Optimal Control of Markov Processes with Incomplete State In-
formation.” Journal of Mathematical Analysis and Applications 10 (1): 174–205. https:
//doi.org/10.1016/0022-247X(65)90154-X.
Baird, Leemon. 1995. “Residual Algorithms: Reinforcement Learning with Function Ap-
proximation.” In Machine Learning Proceedings 1995, edited by Armand Prieditis and
Stuart Russell, 30–37. San Francisco (CA): Morgan Kaufmann. https://doi.org/https:
//doi.org/10.1016/B978-1-55860-377-6.50013-X.
Barberà, Salvador, Christian Seidl, and Peter J. Hammond, eds. 1998. Handbook of Utility
Theory: Handbook of Utility Theory / Barberà, Salvador. - Boston, Mass. [U.a.] : Kluwer,
1998- Vol. 1. Dordrecht [u.a.]: Kluwer. http://gso.gbv.de/DB=2.1/CMD?ACT=SRCHA&SRT=
YOP&IKT=1016&TRM=ppn+266386229&sourceid=fbw_bibsonomy.
Bellman, Richard. 1957a. “A Markovian Decision Process.” Journal of Mathematics and
Mechanics 6 (5): 679–84. http://www.jstor.org/stable/24900506.
———. 1957b. Dynamic Programming. 1st ed. Princeton, NJ, USA: Princeton University
Press.
Bertsekas, D. P., and J. N. Tsitsiklis. 1996. Neuro-Dynamic Programming. Belmont, MA:
Athena Scientific.
Bertsekas, Dimitri P. 1981. “Distributed Dynamic Programming.” In 1981 20th IEEE Con-
ference on Decision and Control Including the Symposium on Adaptive Processes, 774–79.
https://doi.org/10.1109/CDC.1981.269319.
———. 1983. “Distributed Asynchronous Computation of Fixed Points.” Mathematical
Programming 27: 107–20.
———. 2005. Dynamic Programming and Optimal Control, Volume 1, 3rd Edition. Athena
Scientific.
———. 2012. Dynamic Programming and Optimal Control, Volume 2: Approximate Dynamic
Programming. Athena Scientific.
Bertsimas, Dimitris, and Andrew W. Lo. 1998. “Optimal Control of Execution Costs.”
Journal of Financial Markets 1 (1): 1–50.
Björk, Tomas. 2005. Arbitrage Theory in Continuous Time. 2. ed., reprint. Oxford [u.a.]: Ox-
ford Univ. Press. http://gso.gbv.de/DB=2.1/CMD?ACT=SRCHA&SRT=YOP&IKT=1016&TRM=
ppn+505893878&sourceid=fbw_bibsonomy.
Black, Fisher, and Myron S. Scholes. 1973. “The Pricing of Options and Corporate Liabili-
ties.” Journal of Political Economy 81 (3): 637–54.
Bradtke, Steven J., and Andrew G. Barto. 1996. “Linear Least-Squares Algorithms for
Temporal Difference Learning.” Mach. Learn. 22 (1-3): 33–57. http://dblp.uni-trier.
de/db/journals/ml/ml22.html#BradtkeB96.
Brafman, Ronen I., and Moshe Tennenholtz. 2001. “R-MAX - a General Polynomial Time
Algorithm for Near-Optimal Reinforcement Learning.” In IJCAI, edited by Bernhard
Nebel, 953–58. Morgan Kaufmann. http://dblp.uni-trier.de/db/conf/ijcai/ijcai2001.
537
html#BrafmanT01.
Bühler, Hans, Lukas Gonon, Josef Teichmann, and Ben Wood. 2018. “Deep Hedging.”
http://arxiv.org/abs/1802.03042.
Chang, Hyeong Soo, Michael C. Fu, Jiaqiao Hu, and Steven I. Marcus. 2005. “An Adaptive
Sampling Algorithm for Solving Markov Decision Processes.” Operations Research 53
(1): 126–39. http://dblp.uni-trier.de/db/journals/ior/ior53.html#ChangFHM05.
Coulom, Rémi. 2006. “Efficient Selectivity and Backup Operators in Monte-Carlo Tree
Search.” In Computers and Games, edited by H. Jaap van den Herik, Paolo Ciancarini,
and H. H. L. M. Donkers, 4630:72–83. Lecture Notes in Computer Science. Springer.
http://dblp.uni-trier.de/db/conf/cg/cg2006.html#Coulom06.
Cox, J., S. Ross, and M. Rubinstein. 1979. “Option Pricing: A Simplified Approach.” Jour-
nal of Financial Economics 7: 229–63.
Degris, Thomas, Martha White, and Richard S. Sutton. 2012. “Off-Policy Actor-Critic.”
CoRR abs/1205.4839. http://arxiv.org/abs/1205.4839.
Gagniuc, Paul A. 2017. Markov Chains: From Theory to Implementation and Experimentation.
John Wiley & Sons.
Ganesh, Sumitra, Nelson Vadori, Mengda Xu, Hua Zheng, Prashant P. Reddy, and Manuela
Veloso. 2019. “Reinforcement Learning for Market Making in a Multi-Agent Dealer
Market.” CoRR abs/1911.05892. http://dblp.uni-trier.de/db/journals/corr/corr1911.
html#abs-1911-05892.
Gittins, John C. 1979. “Bandit Processes and Dynamic Allocation Indices.” Journal of the
Royal Statistical Society. Series B (Methodological), 148–77.
Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.
Gueant, Olivier. 2016. The Financial Mathematics of Market Liquidity: From Optimal Execution
to Market Making. Chapman; Hall/CRC Financial Mathematics Series.
Guez, Arthur, Nicolas Heess, David Silver, and Peter Dayan. 2014. “Bayes-Adaptive
Simulation-Based Search with Value Function Approximation.” In NIPS, edited by
Zoubin Ghahramani, Max Welling, Corinna Cortes, Neil D. Lawrence, and Kilian Q.
Weinberger, 451–59. http://dblp.uni-trier.de/db/conf/nips/nips2014.html#GuezHSD14.
Hasselt, Hado van, Yotam Doron, Florian Strub, Matteo Hessel, Nicolas Sonnerat, and
Joseph Modayil. 2018. “Deep Reinforcement Learning and the Deadly Triad.” CoRR
abs/1812.02648. http://arxiv.org/abs/1812.02648.
Howard, R. A. 1960. Dynamic Programming and Markov Processes. Cambridge, MA: MIT
Press.
Hull, John C. 2010. Options, Futures, and Other Derivatives. Seventh. Pearson.
Kakade, Sham M. 2001. “A Natural Policy Gradient.” In NIPS, edited by Thomas G. Di-
etterich, Suzanna Becker, and Zoubin Ghahramani, 1531–38. MIT Press. http://dblp.
uni-trier.de/db/conf/nips/nips2001.html#Kakade01.
Kingma, Diederik P., and Jimmy Ba. 2014. “Adam: A Method for Stochastic Optimiza-
tion.” CoRR abs/1412.6980. http://dblp.uni-trier.de/db/journals/corr/corr1412.
html#KingmaB14.
Klopf, A. H., and Air Force Cambridge Research Laboratories (U.S.). Data Sciences Labo-
ratory. 1972. Brain Function and Adaptive Systems–a Heterostatic Theory. Special Reports.
Data Sciences Laboratory, Air Force Cambridge Research Laboratories, Air Force Sys-
tems Command, United States Air Force. https://books.google.com/books?id=C2hztwEACAAJ.
Kocsis, L., and Cs. Szepesvári. 2006. “Bandit Based Monte-Carlo Planning.” In ECML,
282–93.
Krishnamurthy, Vikram. 2016. Partially Observed Markov Decision Processes: From Filtering to
Controlled Sensing. Cambridge University Press. https://doi.org/10.1017/CBO9781316471104.
538
Lagoudakis, Michail G., and Ronald Parr. 2003. “Least-Squares Policy Iteration.” J. Mach.
Learn. Res. 4: 1107–49. http://dblp.uni-trier.de/db/journals/jmlr/jmlr4.html#LagoudakisP03.
Lai, T. L., and H. Robbins. 1985. “Asymptotically Efficient Adaptive Allocation Rules.”
Advances in Applied Mathematics 6: 4–22.
Li, Y., Cs. Szepesvári, and D. Schuurmans. 2009. “Learning Exercise Policies for American
Options.” In AISTATS, 5:352–59. http://www.ics.uci.edu/~aistats/.
Lin, Long J. 1993. “Reinforcement Learning for Robots Using Neural Networks.” PhD
thesis, Pittsburg: CMU.
Longstaff, Francis A., and Eduardo S. Schwartz. 2001. “Valuing American Options by
Simulation: A Simple Least-Squares Approach.” Review of Financial Studies 14 (1): 113–
47. https://doi.org/10.1093/rfs/14.1.113.
Merton, Robert C. 1969. “Lifetime Portfolio Selection Under Uncertainty: The Continuous-
Time Case.” The Review of Economics and Statistics 51 (3): 247–57. https://doi.org/10.
2307/1926560.
Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou,
Daan Wierstra, and Martin Riedmiller. 2013. “Playing Atari with Deep Reinforcement
Learning.” http://arxiv.org/abs/1312.5602.
Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G.
Bellemare, Alex Graves, et al. 2015. “Human-Level Control Through Deep Reinforce-
ment Learning.” Nature 518 (7540): 529–33. https://doi.org/10.1038/nature14236.
Nevmyvaka, Yuriy, Yi Feng, and Michael J. Kearns. 2006. “Reinforcement Learning for
Optimized Trade Execution.” In ICML, edited by William W. Cohen and Andrew W.
Moore, 148:673–80. ACM International Conference Proceeding Series. ACM. http:
//dblp.uni-trier.de/db/conf/icml/icml2006.html#NevmyvakaFK06.
Puterman, Martin L. 2014. Markov Decision Processes: Discrete Stochastic Dynamic Program-
ming. John Wiley & Sons.
Rummery, G. A., and M. Niranjan. 1994. “On-Line Q-Learning Using Connectionist Sys-
tems.” CUED/F-INFENG/TR-166. Engineering Department, Cambridge University.
Russo, Daniel J., Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. 2018.
“A Tutorial on Thompson Sampling.” Foundations and Trends® in Machine Learning 11
(1): 1–96. https://doi.org/10.1561/2200000070.
Salimans, Tim, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. 2017. “Evo-
lution Strategies as a Scalable Alternative to Reinforcement Learning.” arXiv Preprint
arXiv:1703.03864.
Sherman, Jack, and Winifred J. Morrison. 1950. “Adjustment of an Inverse Matrix Corre-
sponding to a Change in One Element of a Given Matrix.” The Annals of Mathematical
Statistics 21 (1): 124–27. https://doi.org/10.1214/aoms/1177729893.
Shreve, Steven E. 2003. Stochastic Calculus for Finance I: The Binomial Asset Pricing Model:
Binomial Asset Pricing Model. New York, NY: Springer-Verlag.
———. 2004. Stochastic Calculus for Finance II: Continuous-Time Models. New York: Springer.
Silver, David, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den
Driessche, Julian Schrittwieser, et al. 2016. “Mastering the Game of Go with Deep
Neural Networks and Tree Search.” Nature 529 (7587): 484–89. https://doi.org/10.
1038/nature16961.
Silver, David, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin A.
Riedmiller. 2014. “Deterministic Policy Gradient Algorithms.” In ICML, 32:387–95.
JMLR Workshop and Conference Proceedings. JMLR.org. http://dblp.uni-trier.de/
db/conf/icml/icml2014.html#SilverLHDWR14.
Spooner, Thomas, John Fearnley, Rahul Savani, and Andreas Koukorinis. 2018. “Market
539
Making via Reinforcement Learning.” CoRR abs/1804.04216. http://dblp.uni-trier.
de/db/journals/corr/corr1804.html#abs-1804-04216.
Sutton, R., D. Mcallester, S. Singh, and Y. Mansour. 2001. “Policy Gradient Methods for
Reinforcement Learning with Function Approximation.” MIT Press.
Sutton, R. S., H. R. Maei, D. Precup, S. Bhatnagar, D. Silver, Cs. Szepesvári, and E. Wiewiora.
2009. “Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear
Function Approximation.” In ICML, 993–1000.
Sutton, R. S., Cs. Szepesvári, and H. R. Maei. 2008. “A Convergent O(n) Algorithm for Off-
Policy Temporal-Difference Learning with Linear Function Approximation.” In NIPS,
1609–16.
Sutton, Richard S. 1991. “Dyna, an Integrated Architecture for Learning, Planning, and Re-
acting.” SIGART Bull. 2 (4): 160–63. http://dblp.uni-trier.de/db/journals/sigart/
sigart2.html#Sutton91.
Sutton, Richard S., and Andrew G. Barto. 2018. Reinforcement Learning: An Introduction.
Second. The MIT Press. http://incompleteideas.net/book/the-book-2nd.html.
Vyetrenko, Svitlana, and Shaojie Xu. 2019. “Risk-Sensitive Compact Decision Trees for Au-
tonomous Execution in Presence of Simulated Market Response.” CoRR abs/1906.02312.
http://dblp.uni-trier.de/db/journals/corr/corr1906.html#abs-1906-02312.
Watkins, C. J. C. H. 1989. “Learning from Delayed Rewards.” PhD thesis, King’s College,
Oxford.
Williams, R. J. 1992. “Simple Statistical Gradient-Following Algorithms for Connectionist
Reinforcement Learning.” Machine Learning 8: 229–56.
Øksendal, Bernt. 2003. Stochastic Differential Equations. 6. ed. Universitext. Berlin ; Heidel-
berg [u.a.]: Springer. http://aleph.bib.uni-mannheim.de/F/?func=find-b&request=106106503&
find_code=020&adjacent=N&local_base=MAN01PUBLIC&x=0&y=0.
540