0% found this document useful (0 votes)

85 views540 pages

Application of Reinforcement Learning - Finance

Uploaded by

2089

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

85 views540 pages

Application of Reinforcement Learning - Finance

Uploaded by

2089

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 540

Foundations of Reinforcement Learning with

Applications in Finance

Ashwin Rao, Tikhon Jelvis

Contents
Preface 11

Summary of Notation 15

Overview 17
0.1. Learning Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . 17
0.2. What you’ll learn from this Book . . . . . . . . . . . . . . . . . . . . . . . . . 18
0.3. Expected Background to read this Book . . . . . . . . . . . . . . . . . . . . . 19
0.4. Decluttering the Jargon linked to Reinforcement Learning . . . . . . . . . . 20
0.5. Introduction to the Markov Decision Process (MDP) framework . . . . . . . 23
0.6. Real-world problems that fit the MDP framework . . . . . . . . . . . . . . . 25
0.7. The inherent diﬀiculty in solving MDPs . . . . . . . . . . . . . . . . . . . . . 26
0.8. Value Function, Bellman Equation, Dynamic Programming and RL . . . . . 27
0.9. Outline of Chapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
0.9.1. Module I: Processes and Planning Algorithms . . . . . . . . . . . . . 29
0.9.2. Module II: Modeling Financial Applications . . . . . . . . . . . . . . 30
0.9.3. Module III: Reinforcement Learning Algorithms . . . . . . . . . . . . 32
0.9.4. Module IV: Finishing Touches . . . . . . . . . . . . . . . . . . . . . . . 33
0.9.5. Short Appendix Chapters . . . . . . . . . . . . . . . . . . . . . . . . . 33

Programming and Design 35

0.10. Code Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
0.11. Environment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
0.12. Classes and Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
0.12.1. A Distribution Interface . . . . . . . . . . . . . . . . . . . . . . . . . . 38
0.12.2. A Concrete Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 39
0.12.3. Checking Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
0.12.4. Type Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
0.12.5. Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
0.13. Abstracting over Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
0.13.1. First-Class Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
0.13.2. Iterative Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
0.14. Key Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

I. Processes and Planning Algorithms 57

1. Markov Processes 59
1.1. The Concept of State in a Process . . . . . . . . . . . . . . . . . . . . . . . . . 59
1.2. Understanding Markov Property from Stock Price Examples . . . . . . . . . 59
1.3. Formal Definitions for Markov Processes . . . . . . . . . . . . . . . . . . . . 65
1.3.1. Starting States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
1.3.2. Terminal States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

3
1.3.3. Markov Process Implementation . . . . . . . . . . . . . . . . . . . . . 68
1.4. Stock Price Examples modeled as Markov Processes . . . . . . . . . . . . . . 70
1.5. Finite Markov Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
1.6. Simple Inventory Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
1.7. Stationary Distribution of a Markov Process . . . . . . . . . . . . . . . . . . . 76
1.8. Formalism of Markov Reward Processes . . . . . . . . . . . . . . . . . . . . . 80
1.9. Simple Inventory Example as a Markov Reward Process . . . . . . . . . . . . 82
1.10. Finite Markov Reward Processes . . . . . . . . . . . . . . . . . . . . . . . . . 84
1.11. Simple Inventory Example as a Finite Markov Reward Process . . . . . . . . 86
1.12. Value Function of a Markov Reward Process . . . . . . . . . . . . . . . . . . . 88
1.13. Summary of Key Learnings from this Chapter . . . . . . . . . . . . . . . . . 92

2. Markov Decision Processes 93

2.1. Simple Inventory Example: How much to Order? . . . . . . . . . . . . . . . . 93
2.2. The Diﬀiculty of Sequential Decisioning under Uncertainty . . . . . . . . . . 94
2.3. Formal Definition of a Markov Decision Process . . . . . . . . . . . . . . . . 95
2.4. Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
2.5. [Markov Decision Process, Policy] := Markov Reward Process . . . . . . . . 100
2.6. Simple Inventory Example with Unlimited Capacity (Infinite State/Action
Space) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
2.7. Finite Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . . 103
2.8. Simple Inventory Example as a Finite Markov Decision Process . . . . . . . . 106
2.9. MDP Value Function for a Fixed Policy . . . . . . . . . . . . . . . . . . . . . . 109
2.10. Optimal Value Function and Optimal Policies . . . . . . . . . . . . . . . . . . 112
2.11. Variants and Extensions of MDPs . . . . . . . . . . . . . . . . . . . . . . . . . 116
2.11.1. Size of Spaces and Discrete versus Continuous . . . . . . . . . . . . . 116
2.11.2. Partially-Observable Markov Decision Processes (POMDPs) . . . . . 119
2.12. Summary of Key Learnings from this Chapter . . . . . . . . . . . . . . . . . 123

3. Dynamic Programming Algorithms 125

3.1. Planning versus Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
3.2. Usage of the term Dynamic Programming . . . . . . . . . . . . . . . . . . . . . 125
3.3. Fixed-Point Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
3.4. Bellman Policy Operator and Policy Evaluation Algorithm . . . . . . . . . . 131
3.5. Greedy Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
3.6. Policy Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
3.7. Policy Iteration Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
3.8. Bellman Optimality Operator and Value Iteration Algorithm . . . . . . . . . 140
3.9. Optimal Policy from Optimal Value Function . . . . . . . . . . . . . . . . . . 143
3.10. Revisiting the Simple Inventory Example . . . . . . . . . . . . . . . . . . . . 145
3.11. Generalized Policy Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
3.12. Aysnchronous Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . 149
3.13. Finite-Horizon Dynamic Programming: Backward Induction . . . . . . . . . 151
3.14. Dynamic Pricing for End-of-Life/End-of-Season of a Product . . . . . . . . . 157
3.15. Generalizations to Non-Tabular Algorithms . . . . . . . . . . . . . . . . . . . 161
3.16. Summary of Key Learnings from this Chapter . . . . . . . . . . . . . . . . . 162

4. Function Approximation and Approximate Dynamic Programming 163

4.1. Function Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

4
4.2. Linear Function Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . 168
4.3. Neural Network Function Approximation . . . . . . . . . . . . . . . . . . . . 174
4.4. Tabular as a form of FunctionApprox . . . . . . . . . . . . . . . . . . . . . . . 186
4.5. Approximate Policy Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 188
4.6. Approximate Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
4.7. Finite-Horizon Approximate Policy Evaluation . . . . . . . . . . . . . . . . . 190
4.8. Finite-Horizon Approximate Value Iteration . . . . . . . . . . . . . . . . . . . 192
4.9. Finite-Horizon Approximate Q-Value Iteration . . . . . . . . . . . . . . . . . 192
4.10. How to Construct the Non-Terminal States Distribution . . . . . . . . . . . . 194
4.11. Key Takeaways from this Chapter . . . . . . . . . . . . . . . . . . . . . . . . . 195

II. Modeling Financial Applications 197

5. Utility Theory 199

5.1. Introduction to the Concept of Utility . . . . . . . . . . . . . . . . . . . . . . 199
5.2. A Simple Financial Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
5.3. The Shape of the Utility function . . . . . . . . . . . . . . . . . . . . . . . . . 201
5.4. Calculating the Risk-Premium . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
5.5. Constant Absolute Risk-Aversion (CARA) . . . . . . . . . . . . . . . . . . . . 205
5.6. A Portfolio Application of CARA . . . . . . . . . . . . . . . . . . . . . . . . . 206
5.7. Constant Relative Risk-Aversion (CRRA) . . . . . . . . . . . . . . . . . . . . 207
5.8. A Portfolio Application of CRRA . . . . . . . . . . . . . . . . . . . . . . . . . 208
5.9. Key Takeaways from this Chapter . . . . . . . . . . . . . . . . . . . . . . . . . 210

6. Dynamic Asset-Allocation and Consumption 211

6.1. Optimization of Personal Finance . . . . . . . . . . . . . . . . . . . . . . . . . 211
6.2. Merton’s Portfolio Problem and Solution . . . . . . . . . . . . . . . . . . . . . 213
6.3. Developing Intuition for the Solution to Merton’s Portfolio Problem . . . . . 218
6.4. A Discrete-Time Asset-Allocation Example . . . . . . . . . . . . . . . . . . . 221
6.5. Porting to Real-World . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
6.6. Key Takeaways from this Chapter . . . . . . . . . . . . . . . . . . . . . . . . . 233

7. Derivatives Pricing and Hedging 235

7.1. A Brief Introduction to Derivatives . . . . . . . . . . . . . . . . . . . . . . . . 235
7.1.1. Forwards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
7.1.2. European Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
7.1.3. American Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
7.2. Notation for the Single-Period Simple Setting . . . . . . . . . . . . . . . . . . 238
7.3. Portfolios, Arbitrage and Risk-Neutral Probability Measure . . . . . . . . . . 239
7.4. First Fundamental Theorem of Asset Pricing (1st FTAP) . . . . . . . . . . . . 240
7.5. Second Fundamental Theorem of Asset Pricing (2nd FTAP) . . . . . . . . . 242
7.6. Derivatives Pricing in Single-Period Setting . . . . . . . . . . . . . . . . . . . 244
7.6.1. Derivatives Pricing when Market is Complete . . . . . . . . . . . . . 244
7.6.2. Derivatives Pricing when Market is Incomplete . . . . . . . . . . . . . 247
7.6.3. Derivatives Pricing when Market has Arbitrage . . . . . . . . . . . . 255
7.7. Derivatives Pricing in Multi-Period/Continuous-Time Settings . . . . . . . . 256
7.7.1. Multi-Period Complete-Market Setting . . . . . . . . . . . . . . . . . 256
7.7.2. Continuous-Time Complete-Market Setting . . . . . . . . . . . . . . . 257

5
7.8. Optimal Exercise of American Options cast as a Finite MDP . . . . . . . . . . 258
7.9. Generalizing to Optimal-Stopping Problems . . . . . . . . . . . . . . . . . . 265
7.10. Pricing/Hedging in an Incomplete Market cast as an MDP . . . . . . . . . . 267
7.11. Key Takeaways from this Chapter . . . . . . . . . . . . . . . . . . . . . . . . . 269

8. Order-Book Trading Algorithms 271

8.1. Basics of Order Book and Price Impact . . . . . . . . . . . . . . . . . . . . . . 271
8.2. Optimal Execution of a Market Order . . . . . . . . . . . . . . . . . . . . . . 280
8.2.1. Simple Linear Price Impact Model with no Risk-Aversion . . . . . . . 286
8.2.2. Paper by Bertsimas and Lo on Optimal Order Execution . . . . . . . 291
8.2.3. Incorporating Risk-Aversion and Real-World Considerations . . . . . 292
8.3. Optimal Market-Making . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
8.3.1. Avellaneda-Stoikov Continuous-Time Formulation . . . . . . . . . . 296
8.3.2. Solving the Avellaneda-Stoikov Formulation . . . . . . . . . . . . . . 296
8.3.3. Analytical Approximation to the Solution to Avellaneda-Stoikov For-
mulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
8.3.4. Real-World Market-Making . . . . . . . . . . . . . . . . . . . . . . . . 304
8.4. Key Takeaways from this Chapter . . . . . . . . . . . . . . . . . . . . . . . . . 305

III. Reinforcement Learning Algorithms 307

9. Monte-Carlo and Temporal-Difference for Prediction 309

9.1. Overview of the Reinforcement Learning approach . . . . . . . . . . . . . . 309
9.2. RL for Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
9.3. Monte-Carlo (MC) Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
9.4. Temporal-Difference (TD) Prediction . . . . . . . . . . . . . . . . . . . . . . . 319
9.5. TD versus MC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
9.5.1. TD learning akin to human learning . . . . . . . . . . . . . . . . . . . 323
9.5.2. Bias, Variance and Convergence . . . . . . . . . . . . . . . . . . . . . 324
9.5.3. Fixed-Data Experience Replay on TD versus MC . . . . . . . . . . . . 327
9.5.4. Bootstrapping and Experiencing . . . . . . . . . . . . . . . . . . . . . 332
9.6. TD(λ) Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
9.6.1. n-Step Bootstrapping Prediction Algorithm . . . . . . . . . . . . . . . 338
9.6.2. λ-Return Prediction Algorithm . . . . . . . . . . . . . . . . . . . . . . 339
9.6.3. Eligibility Traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
9.6.4. Implementation of the TD(λ) Prediction algorithm . . . . . . . . . . 345
9.7. Key Takeaways from this Chapter . . . . . . . . . . . . . . . . . . . . . . . . . 347

10.Monte-Carlo and Temporal-Difference for Control 349

10.1. Refresher on Generalized Policy Iteration (GPI) . . . . . . . . . . . . . . . . . . 349
10.2. GPI with Evaluation as Monte-Carlo . . . . . . . . . . . . . . . . . . . . . . . 350
10.3. GLIE Monte-Control Control . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
10.4. SARSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
10.5. SARSA(λ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
10.6. Off-Policy Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
10.6.1. Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
10.6.2. Windy Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
10.6.3. Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 378

6
10.7. Conceptual Linkage between DP and TD algorithms . . . . . . . . . . . . . . 380
10.8. Convergence of RL Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 384
10.9. Key Takeaways from this Chapter . . . . . . . . . . . . . . . . . . . . . . . . . 385

11.Batch RL, Experience-Replay, DQN, LSPI, Gradient TD 387

11.1. Batch RL and Experience-Replay . . . . . . . . . . . . . . . . . . . . . . . . . 388
11.2. A generic implementation of Experience-Replay . . . . . . . . . . . . . . . . 391
11.3. Least-Squares RL Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
11.3.1. Least-Squares Monte-Carlo (LSMC) . . . . . . . . . . . . . . . . . . . 393
11.3.2. Least-Squares Temporal-Difference (LSTD) . . . . . . . . . . . . . . . 394
11.3.3. LSTD(λ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
11.3.4. Convergence of Least-Squares Prediction . . . . . . . . . . . . . . . . 398
11.4. Q-Learning with Experience-Replay . . . . . . . . . . . . . . . . . . . . . . . 398
11.4.1. Deep Q-Networks (DQN) Algorithm . . . . . . . . . . . . . . . . . . 400
11.5. Least-Squares Policy Iteration (LSPI) . . . . . . . . . . . . . . . . . . . . . . . 401
11.5.1. Saving your Village from a Vampire . . . . . . . . . . . . . . . . . . . 404
11.5.2. Least-Squares Control Convergence . . . . . . . . . . . . . . . . . . . 407
11.6. RL for Optimal Exercise of American Options . . . . . . . . . . . . . . . . . . 407
11.6.1. LSPI for American Options Pricing . . . . . . . . . . . . . . . . . . . . 409
11.6.2. Deep Q-Learning for American Options Pricing . . . . . . . . . . . . 411
11.7. Value Function Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
11.7.1. Notation and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . 412
11.7.2. Bellman Policy Operator and Projection Operator . . . . . . . . . . . 413
11.7.3. Vectors of interest in the Φ subspace . . . . . . . . . . . . . . . . . . . 414
11.8. Gradient Temporal-Difference (Gradient TD) . . . . . . . . . . . . . . . . . . 418
11.9. Key Takeaways from this Chapter . . . . . . . . . . . . . . . . . . . . . . . . . 419

12.Policy Gradient Algorithms 421

12.1. Advantages and Disadvantages of Policy Gradient Algorithms . . . . . . . . 422
12.2. Policy Gradient Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
12.2.1. Notation and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . 423
12.2.2. Statement of the Policy Gradient Theorem . . . . . . . . . . . . . . . 424
12.2.3. Proof of the Policy Gradient Theorem . . . . . . . . . . . . . . . . . . 425
12.3. Score function for Canonical Policy Functions . . . . . . . . . . . . . . . . . . 427
12.3.1. Canonical π(s, a; θ) for Finite Action Spaces . . . . . . . . . . . . . . . 427
12.3.2. Canonical π(s, a; θ) for Single-Dimensional Continuous Action Spaces428
12.4. REINFORCE Algorithm (Monte-Carlo Policy Gradient) . . . . . . . . . . . . 428
12.5. Optimal Asset Allocation (Revisited) . . . . . . . . . . . . . . . . . . . . . . . 431
12.6. Actor-Critic and Variance Reduction . . . . . . . . . . . . . . . . . . . . . . . 435
12.7. Overcoming Bias with Compatible Function Approximation . . . . . . . . . 440
12.8. Policy Gradient Methods in Practice . . . . . . . . . . . . . . . . . . . . . . . 444
12.8.1. Natural Policy Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . 444
12.8.2. Deterministic Policy Gradient . . . . . . . . . . . . . . . . . . . . . . . 445
12.9. Evolutionary Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
12.10.Key Takeaways from this Chapter . . . . . . . . . . . . . . . . . . . . . . . . . 448

7
IV. Finishing Touches 451

13.Multi-Armed Bandits: Exploration versus Exploitation 453

13.1. Introduction to the Multi-Armed Bandit Problem . . . . . . . . . . . . . . . . 453
13.1.1. Some Examples of Explore-Exploit Dilemma . . . . . . . . . . . . . . 454
13.1.2. Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
13.1.3. Regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
13.1.4. Counts and Gaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
13.2. Simple Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
13.2.1. Greedy and ϵ-Greedy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
13.2.2. Optimistic Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . 457
13.2.3. Decaying ϵt -Greedy Algorithm . . . . . . . . . . . . . . . . . . . . . . 458
13.3. Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
13.4. Upper Confidence Bound Algorithms . . . . . . . . . . . . . . . . . . . . . . 461
13.4.1. Hoeffding’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
13.4.2. UCB1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
13.4.3. Bayesian UCB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
13.5. Probability Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466
13.5.1. Thompson Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
13.6. Gradient Bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
13.7. Horse Race . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474
13.8. Information State Space MDP . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
13.9. Extending to Contextual Bandits and RL Control . . . . . . . . . . . . . . . . 478
13.10.Key Takeaways from this Chapter . . . . . . . . . . . . . . . . . . . . . . . . . 480

14.Blending Learning and Planning 483

14.1. Planning versus Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483
14.1.1. Planning the solution of Prediction/Control . . . . . . . . . . . . . . 483
14.1.2. Learning the solution of Prediction/Control . . . . . . . . . . . . . . 485
14.1.3. Advantages and Disadvantages of Planning versus Learning . . . . . 486
14.1.4. Blending Planning and Learning . . . . . . . . . . . . . . . . . . . . . 486
14.2. Decision-Time Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488
14.3. Monte-Carlo Tree-Search (MCTS) . . . . . . . . . . . . . . . . . . . . . . . . 489
14.4. Adaptive Multi-Stage Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 490
14.5. Summary of Key Learnings from this Chapter . . . . . . . . . . . . . . . . . 494

15.Summary and Real-World Considerations 495

15.1. Summary of Key Learnings from this Book . . . . . . . . . . . . . . . . . . . 495
15.2. RL in the Real-World . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499

Appendix 505

A. Moment Generating Function and its Applications 505

A.1. The Moment Generating Function (MGF) . . . . . . . . . . . . . . . . . . . . 505
A.2. MGF for Linear Functions of Random Variables . . . . . . . . . . . . . . . . . 505
A.3. MGF for the Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 506
A.4. Minimizing the MGF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506
A.4.1. Minimizing the MGF when x follows a normal distribution . . . . . 507
A.4.2. Minimizing the MGF when x is a symmetric binary distribution . . . 507

8
B. Portfolio Theory 509
B.1. Setting and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509
B.2. Portfolio Returns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509
B.3. Derivation of Efficient Frontier Curve . . . . . . . . . . . . . . . . . . . . . . 509
B.4. Global Minimum Variance Portfolio (GMVP) . . . . . . . . . . . . . . . . . . 510
B.5. Orthogonal Efficient Portfolios . . . . . . . . . . . . . . . . . . . . . . . . . . 510
B.6. Two-fund Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510
B.7. An example of the Efficient Frontier for 16 assets . . . . . . . . . . . . . . . . 511
B.8. CAPM: Linearity of Covariance Vector w.r.t. Mean Returns . . . . . . . . . . 511
B.9. Useful Corollaries of CAPM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512
B.10. Cross-Sectional Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512
B.11. Efficient Set with a Risk-Free Asset . . . . . . . . . . . . . . . . . . . . . . . . 512

C. Introduction to and Overview of Stochastic Calculus Basics 513

C.1. Simple Random Walk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513
C.2. Brownian Motion as Scaled Random Walk . . . . . . . . . . . . . . . . . . . . 514
C.3. Continuous-Time Stochastic Processes . . . . . . . . . . . . . . . . . . . . . . 515
C.4. Properties of Brownian Motion sample traces . . . . . . . . . . . . . . . . . . 515
C.5. Ito Integral . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516
C.6. Ito’s Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517
C.7. A Lognormal Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518
C.8. A Mean-Reverting Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519

D. The Hamilton-Jacobi-Bellman (HJB) Equation 521

D.1. HJB as a continuous-time version of Bellman Optimality Equation . . . . . . 521
D.2. HJB with State Transitions as an Ito Process . . . . . . . . . . . . . . . . . . . 522

E. Black-Scholes Equation and it’s Solution for Call/Put Options 523

E.1. Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523
E.2. Derivation of the Black-Scholes Equation . . . . . . . . . . . . . . . . . . . . 523
E.3. Solution of the Black-Scholes Equation for Call/Put Options . . . . . . . . . 525

F. Function Approximations as Aﬀine Spaces 527

F.1. Vector Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
F.2. Function Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
F.3. Linear Map of Vector Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
F.4. Affine Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528
F.5. Linear Map of Affine Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528
F.6. Function Approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528
F.6.1. D[R] as an Affine Space P . . . . . . . . . . . . . . . . . . . . . . . . . 529
F.6.2. Representational Space R . . . . . . . . . . . . . . . . . . . . . . . . . 529
F.7. Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . 530
F.8. SGD Update for Linear Function Approximations . . . . . . . . . . . . . . . 530

G. Conjugate Priors for Gaussian and Bernoulli Distributions 533

G.1. Conjugate Prior for Gaussian Distribution . . . . . . . . . . . . . . . . . . . . 533
G.2. Conjugate Prior for Bernoulli Distribution . . . . . . . . . . . . . . . . . . . . 534

Bibliography 537

9
Preface
We (Ashwin and Tikhon) have spent all of our educational and professional lives oper-
ating in the intersection of Mathematics and Computer Science - Ashwin for more than 3
decades and Tikhon for more than a decade. During these periods, we’ve commonly held
two obsessions. The first is to bring together the (strangely) disparate worlds of Math-
ematics and Software Development. The second is to focus on the pedagogy of various
topics across Mathematics and Computer Science. Fundamentally, this book was born out
of a deep desire to release our twin obsessions so as to not just educate the next generation
of scientists and engineers, but to also present some new and creative ways of teaching
technically challenging topics in ways that are easy to absorb and retain.
Apart from these common obsessions, each of us has developed some expertise in a few
topics that come together in this book. Ashwin’s undergraduate and doctoral education
was in Computer Science, with specializations in Algorithms, Discrete Mathematics and
Abstract Algebra. He then spent more than two decades of his career (across the Finance
and Retail industries) in the realm of Computational Mathematics, recently focused on
Machine Learning and Optimization. In his role as an Adjunct Professor at Stanford Uni-
versity, Ashwin specializes in Reinforcement Learning and Mathematical Finance. The
content of this book is essentially an expansion of the content of the course CME 241 he
teaches at Stanford. Tikhon’s education is in Computer Science and he has specialized in
Software Design, with an emphasis on treating software design as mathematical specifi-
cation of “what to do” versus computational mechanics of “how to do.” This is a powerful
way of developing software, particularly for mathematical applications, significantly im-
proving readability, modularity and correctness. This leads to code that naturally and
clearly reflects the mathematics, thus blurring the artificial lines between Mathematics
and Programming. He has also championed the philosophy of leveraging programming
as a powerful way to learn mathematical concepts. Ashwin has been greatly influenced by
Tikhon on this philosophy and both of us have been quite successful in imparting our stu-
dents with deep understanding of a variety of mathematical topics by using programming
as a powerful pedagogical tool.
In fact, the key distinguishing feature of this book is to promote learning through an ap-
propriate blend of A) intuitive understanding of the concepts, B) mathematical rigor, and
C) programming of the models and algorithms (with sound software design principles
that reflect the mathematical specification). We’ve found this unique approach to teach-
ing facilitates strong retention of the concepts because students are active learners when
they code everything they are learning, in a manner that reflects the mathematical con-
cepts. We have strived to create a healthy balance between content accessibility and in-
tuition development on one hand versus technical rigor and completeness on the other
hand. Throughout the book, we provide proper mathematical notation, theorems (and
sometimes formal proofs) as well as well-designed working code for various models and
algorithms. But we have always accompanied this formalism with intuition development
using simple examples and appropriate visualization.
We want to highlight that this book emphasizes the foundational components of Reinforce-
ment Learning - Markov Decision Processes, Bellman Equations, Fixed-Points, Dynamic

11
Programming, Function Approximation, Sampling, Experience-Replay, Batch Methods,
Value-based versus Policy-based Learning, balancing Exploration and Exploitation, blend-
ing Planning and Learning etc. So although we have covered several key algorithms in
this book, we do not dwell on specifics of the algorithms - rather, we emphasize the core
principles and always allow for various types of flexibility in tweaking those algorithms
(our investment in modular software design of the algorithms facilitates this flexibility).
Likewise, we have kept the content of the financial applications fairly basic, emphasizing
the core ideas, and developing working code for simplified versions of these financial ap-
plications. Getting these financial applications to be effective in practice is a much more
ambitious endeavor - we don’t attempt that in this book, but we highlight what it would
take to make it work in practice. The theme of this book is understanding of core con-
cepts rather than addressing all the nuances (and frictions) one typically encounters in
the real-world. The financial content in this book is a significant fraction of the broader
topic of Mathematical Finance and we hope that this book provides the side benefit of a
fairly quick yet robust education in the key topics of Portfolio Management, Derivatives
Pricing and Order-Book Trading.
We were introduced to modern Reinforcement Learning by the works of Richard Sutton,
including his seminal book with Andrew Barto. There are several other works by other
authors, many with more mathematical detail, but we found Sutton’s works much eas-
ier to learn this topic. Our book tends to follow Sutton’s less rigorous but more intuitive
approach but we provide a bit more mathematical formalism/detail and we use precise
working code instead of the typical psuedo-code found in textbooks on these topics. We
have also been greatly influenced by David Silver’s excellent RL lectures series at Univer-
sity College London that is available on youtube. We have strived to follow the structure
of David Silver’s lecture series, typically augmenting it with more detail. So it pays to em-
phasize that the content of this book is not our original work. Rather, our contribution is
to present content that is widely and publicly available in a manner that is easier to learn
(particularly due to our augmented approach of “learning by coding”). Likewise, the fi-
nancial content is not our original work - it is based on standard material on Mathematical
Finance and based on a few papers that treat the Financial problems as Stochastic Control
problems. However, we found the presentation in these papers not easy to understand for
the typical student. Moreover, some of these papers did not explicitly model these prob-
lems as Markov Decision Processes, and some of them did not consider Reinforcement
Learning as an option to solve these problems. So we presented the content in these pa-
pers in more detail, specifically with clearer notation and explanations, and with working
Python code. It’s interesting to note that Ashwin worked on some of these finance prob-
lems during his time at Goldman Sachs and Morgan Stanley, but at that time, these prob-
lems were not viewed from the lens of Stochastic Control. While designing the content of
CME 241 at Stanford, Ashwin realized that several problems from his past finance career
can be cast as Markov Decision Processes, which led him to the above-mentioned papers,
which in turn led to the content creation for CME 241, that then extended into the finan-
cial content in this book. There are several Appendices in this book to succinctly provide
appropriate pre-requisite mathematical/financial content. We have strived to provide ref-
erences throughout the chapters and appendices to enable curious students to learn each
topic in more depth. However, we are not exhaustive in our list of references because typ-
ically each of our references tends to be fairly exhaustive in the papers/books it in turn
references.
We have many people to thank - those who provided the support and encouragement
for us to write this book. Firstly, we would like to thank our managers at Target Cor-

12
poration - Paritosh Desai and Mike McNamara. Ashwin would like to thank all of the
faculty and staff at Stanford University he works with, for providing a wonderful environ-
ment and excellent support for CME 241, notably George Papanicolau, Kay Giesecke, Peter
Glynn, Gianluca Iaccarino, Indira Choudhury and Jess Galvez. Ashwin would also like to
thank all his students who implicitly proof-read the contents, and his course assistants
Sven Lerner and Jeff Gu. Tikhon would like to thank Paritosh Desai for help exploring the
software design and concepts used in the book as well as his immediate team at Target for
teaching him about how Markov Decision Processes apply in the real world.

13
Summary of Notation

Z Set of integers
Z+ Set of positive integers, i.e., {1, 2, 3, . . .}
Z≥0 Set of non-negative integers, i.e., {0, 1, 2, 3, . . .}
R Set of real numbers
R+ Set of positive real numbers
R≥0 Set of non-negative real numbers
log(x) Natural Logarithm (to the base e) of x
|x| Absolute Value of x
sign(x) +1 if x > 0, -1 if x < 0, 0 if x = 0
[a, b] Set of real numbers that are ≥ a and ≤ b. The notation x ∈ [a, b]
is shorthand for x ∈ R and a ≤ x ≤ b
[a, b) Set of real numbers that are ≥ a and < b. The notation x ∈ [a, b)
is shorthand for x ∈ R and a ≤ x < b
(a, b] Set of real numbers that are > a and leqb. The notation x ∈ (a, b]
is shorthand for x ∈ R and a < x ≤ b
∅ The Empty Set (Null Set)
Pn
a Sum of terms a1 , a2 , . . . , an
Qni=1 i
i=1 ai Product of terms a1 , a2 , . . . , an
≈ approximately equal to
x∈X x is an element of the set X
x∈/X x is not an element of the set X
X ∪Y Union of the sets X and Y
X ∩Y Intersection of the sets X and Y
X −Y Set Difference of the sets X and Y, i.e., the set of elements within
the set X that are not elements of the set Y
X ×Y Cartesian Product of the sets X and Y
Xk For a set X and an integer k ≥ 1, this refers to the Cartesian
Product X ×X ×. . .×X with k occurrences of X in the Cartesian
Product (note: X 1 = X )
f :X→Y Function f with Domain X and Co-domain Y
fk For a function f and an integer k ≥ 0, this refers to the function
composition of f with itself, repeated k times. So, f k (x) is the
value f (f (. . . f (x) . . .)) with k occurrences of f in this function-
composition expression (note: f 1 = f and f 0 is the identity
function)
f −1 Inverse function of a bijective function f : X → Y, i.e., for all
x ∈ X , f −1 (f (x)) = x and for all y ∈ Y, f (f −1 (y)) = y
f ′ (x0 ) Derivative of the function f : X → R with respect to it’s domain
variable x ∈ X , evaluated at x = x0

15
f ′′ (x0 ) Second Derivative of the function f : X → R with respect to it’s
domain variable x ∈ X , evaluated at x = x0
P[X] Probability Density Function (PDF) of random variable X
P[X = x] Probability that random variable X takes the value x
P[X|Y ] Probability Density Function (PDF) of random variable X, con-
ditional on the value of random variable Y (i.e., PDF of X ex-
pressed as a function of the values of Y )
P[X = x|Y = y] Probability that random variable X takes the value x, condi-
tional on random variable Y taking the value y
E[X] Expected Value of random variable X
E[X|Y ] Expected Value of random variable X, conditional on the value
of random variable Y (i.e., Expected Value of X expressed as a
function of the values of Y )
E[X|Y = y] Expected Value of random variable X, conditional on random
variable Y taking the value y
x ∼ N (µ, σ 2 ) Random variable x follows a Normal Distribution with mean µ
and variance σ 2
x ∼ P oisson(λ) Random variable x follows a Poisson Distribution with mean λ
f (x; w) Here f refers to a parameterized function with domain X (x ∈
X ), w refers to the parameters controlling the definition of the
function f
vT Row-vector with components equal to the components of the
Column-vector v, i.e., Transpose of the Column-vector v (by de-
fault, we assume vectors are expressed as Column-vectors)
AT Transpose of the matrix A
|v| L2 normp of vector v ∈ R , i.e., if v = (v1 , v2 , . . . , vm ), then
m

|v| = v1 + v2 + . . . + vm
2 2 2

A−1 Matrix-Inverse of the square matrix A

A·B Matrix-Multiplication of matrices A and B (note: vector nota-
tion v typically refers to a column-vector, i.e., a matrix with
1 column, and so v T · w is simply the inner-product of same-
dimensional vectors v and w)
Im m × m Identity Matrix
Diagonal(v) m × m Diagonal Matrix whose elements are the same (also in
same order) as the elements of the m-dimensional Vector v
dim(v) Dimension of a vector v
Ic I represents the Indicator function and Ic = 1 if condition c is
True, = 0 if c is False
arg maxx∈X f (x) This refers to the value of x ∈ X that maximizes f (x), i.e.,
maxx∈X f (x) = f (arg maxx∈X f (x))
∇w f (w) Gradient of the function f with respect to w (note: w could be
an arbitrary data structure and this gradient is of the same data
type as the data type of w)
x←y Variable x is assigned (or updated to) the value of y

16
Overview

0.1. Learning Reinforcement Learning

Reinforcement Learning (RL) is emerging as a practical, powerful technique for solving
a variety of complex business problems across industries that involve Sequential Optimal
Decisioning under Uncertainty. Although RL is classified as a branch of Machine Learn-
ing (ML), it tends to be viewed and treated quite differently from other branches of ML
(Supervised and Unsupervised Learning). Indeed, RL seems to hold the key to unlock-
ing the promise of AI – machines that adapt their decisions to vagaries in observed in-
formation, while continuously steering towards the optimal outcome. It’s penetration in
high-profile problems like self-driving cars, robotics and strategy games points to a future
where RL algorithms will have decisioning abilities far superior to humans.
But when it comes getting educated in RL, there seems to be a reluctance to jump right
in because RL seems to have acquired a reputation of being mysterious and exotic. We of-
ten hear even technical people claim that RL involves “advanced math” and “complicated
engineering,” and so there seems to be a psychological barrier to entry. While real-world
RL algorithms and implementations do get fairly elaborate and complicated in overcom-
ing the proverbial last-mile of business problems, the foundations of RL can actually be
learned without heavy technical machinery. A key goal of this book is to demystify RL
by finding a balance between A) providing depth of understanding and B) keeping
technical content basic. So now we list the key features of this book which enable this
balance:

• Focus on the foundational theory underpinning RL. Our treatment of this theory is
based on undergraduate-level Probability, Optimization, Statistics and Linear Alge-
bra. We emphasize rigorous but simple mathematical notations and formulations
in developing the theory, and encourage you to write out the equations rather than
just reading from the book. Occasionally, we invoke some advanced mathematics
(eg: Stochastic Calculus) but the majority of the book is based on easily understand-
able mathematics. In particular, two basic theory concepts - Bellman Optimality
Equation and Generalized Policy Iteration - are emphasized throughout the book as
they form the basis of pretty much everything we do in RL, even in the most ad-
vanced algorithms.
• Parallel to the mathematical rigor, we bring the concepts to life with simple exam-
ples and informal descriptions to help you develop an intuitive understanding of the
mathematical concepts. We drive towards creating appropriate mental models to
visualize the concepts. Often, this involves turning mathematical abstractions into
physical examples (emphasizing visual intuition). So we go back and forth between
rigor and intuition, between abstractions and visuals, so as to blend them nicely and
get the best of both worlds.
• Each time you learn a new mathematical concept or algorithm, we encourage you to
write small pieces of code (in Python) that implements the concept/algorithm. As an
example, if you just learned a surprising theorem, we’d ask you to write a simulator

17
to simply verify the statement of the theorem. We emphasize this approach not just to
bolster the theoretical and intuitive understanding with a hands-on experience, but
also because there is a strong emotional effect of seeing expected results emanating
from one’s code, which in turn promotes long-term retention of the concepts. Most
importantly, we avoid messy and complicated ML/RL/BigData tools/packages and
stick to bare-bones Python/numpy as these unnecessary tools/packages are huge
blockages to core understanding. We believe coding from scratch and designing
the code to reflect the mathematical structure/concepts is the correct approach to
truly understand the concepts/algorithms.
• Lastly, it is important to work with examples that are A) simplified versions of real-
world problems in a business domain rich with applications, B) adequately com-
prehensible without prior business-domain knowledge, C) intellectually interesting
and D) suﬀiciently marketable to employers. We’ve chosen Financial Trading ap-
plications. For each financial problem, we first cover the traditional approaches (in-
cluding solutions from landmark papers) and then cast the problem in ways that
can be solved with RL. We have made considerable effort to make this book self-
contained in terms of the financial knowledge required to navigate these prob-
lems.

0.2. What you’ll learn from this Book

Here is what you will specifically learn and gain from the book:

• You will learn about the simple but powerful theory of Markov Decision Processes
(MDPs) – a framework for Sequential Optimal Decisioning under Uncertainty. You
will firmly understand the power of Bellman Equations, which is at the heart of all
Dynamic Programming as well as all RL algorithms.

• You will master Dynamic Programming (DP) Algorithms, which are a class of (in
the language of AI) Planning Algorithms. You will learn about Policy Iteration,
Value Iteration, Backward Induction, Approximate Dynamic Programming and the
all-important concept of Generalized Policy Iteration which lies at the heart of all DP
as well as all RL algorithms.

• You will gain a solid understanding of a variety of Reinforcement Learning (RL) Al-
gorithms, starting with the basic algorithms like SARSA and Q-Learning and mov-
ing on to several important algorithms that work well in practice, including Gradient
Temporal Difference, Deep Q-Networks, Least-Squares Policy Iteration, Policy Gra-
dient, Monte-Carlo Tree Search. You will learn about how to gain advantages in these
algorithms with bootstrapping, off-policy learning and deep-neural-networks-based
function approximation. You will learn how to balance exploration and exploitation
with Multi-Armed Bandits techniques like Upper Confidence Bounds, Thompson
Sampling, Gradient Bandits and Information State-Space algorithms. You will also
learn how to blend Planning and Learning methodologies, which is very important
in practice.

• You will exercise with plenty of “from-scratch” Python implementations of models

and algorithms. Throughout the book, we emphasize healthy Python programming
practices including interface design, type annotations, functional programming and
inheritance-based polymorphism (always ensuring that the programming principles

18
reflect the mathematical principles). The larger take-away from this book will be
a rare (and high-in-demand) ability to blend Applied Mathematics concepts with
Software Design paradigms.

• You will go deep into important Financial Trading problems, including:

– (Dynamic) Asset-Allocation to maximize Utility of Consumption

– Pricing and Hedging of Derivatives in an Incomplete Market
– Optimal Exercise/Stopping of Path-Dependent American Options
– Optimal Trade Order Execution (managing Price Impact)
– Optimal Market-Making (Bid/Ask managing Inventory Risk)

• We treat each of the above problems as MDPs (i.e., Optimal Decisioning formula-
tions), first going over classical/analytical solutions to these problems, then intro-
ducing real-world frictions/considerations, and tackling with DP and/or RL.

• As a bonus, we throw in a few applications beyond Finance, including a couple from

Supply-Chain and Clearance Pricing in a Retail business.

• We implement a wide range of Algorithms and develop various models in a git code
base that we refer to and explain in detail throughout the book. This code base not
only provides detailed clarity on the algorithms/models, but also serves to educate
on healthy programming patterns suitable not just for RL, but more generally for any
Applied Mathematics work.

• In summary, this book blends Theory/Mathematics, Programming/Algorithms and

Real-World Financial Nuances while always keeping things simple and intuitive.

0.3. Expected Background to read this Book

There is no shortcut to learning Reinforcement Learning or learning the Financial Appli-
cations content. You will need to allocate 100-200 hours of effort to learn this material
(assuming you have no prior background in these topics). This extent of effort incorpo-
rates the time required to write out the equations/theory as well as the coding of the mod-
els/algorithms, while you are making your way through this book. Note that although
we have kept the Mathematics, Programming and Financial content fairly basic, this topic
is only for technically-inclined readers. Below we outline the technical preparation that is
required to follow the material covered in this book.

• Experience with (but not necessarily expertise in) Python is expected and a good
deal of comfort with numpy is required. Note that much of the Python program-
ming in this book is for mathematical modeling and for numerical algorithms, so one
doesn’t need to know Python from the perspective of building engineering systems
or user-interfaces. So you don’t need to be a professional software developer/engineer
but you need to have a healthy interest in learning Python best practices associated
with mathematical modeling, algorithms development and numerical programming
(we teach these best practices in this book). We don’t use any of the popular (but
messy and complicated) Big Data/Machine Learning libraries such as Pandas, PyS-
park, scikit, Tensorflow, PyTorch, OpenCV, NLTK etc. (all you need to know is
numpy).

19
• Familiarity with git and use of an Integrated Development Environment (IDE), eg:
Pycharm or Emacs (with Python plugins), is recommended, but not required.
• Familiarity with LaTeX for writing equations is recommended, but not required (other
typesetting tools, or even hand-written math is fine, but LaTeX is a skill that is very
valuable if you’d like a future in the general domain of Applied Mathematics).
• You need to be strong in undergraduate-level Probability as it is the most important
foundation underpinning RL.
• You will also need to have some preparation in undergraduate-level Numerical Op-
timization, Statistics, Linear Algebra.
• No background in Finance is required, but a strong appetite for Mathematical Fi-
nance is required.

0.4. Decluttering the Jargon linked to Reinforcement Learning

Machine Learning has exploded in the past decade or so, and Reinforcement Learning
(treated as a branch of Machine Learning and hence, a branch of A.I.) has surfaced to the
mainstream in both academia and in the industry. It is important to understand what Re-
inforcement Learning aims to solve, rather than the more opaque view of RL as a technique
to learn from data. RL aims to solve problems that involve making Sequential Optimal De-
cisions under Uncertainty. Let us break down this jargon so as to develop an intuitive (and
high-level) understanding of the features pertaining to the problems RL solves.
Firstly, let us understand the term uncertainty. This means the problems under con-
sideration involve random variables that evolve over time. The technical term for this is
stochastic processes. We will cover this in detail later in this book, but for now, it’s impor-
tant to recognize that evolution of random variables over time is very common in nature
(e.g. weather) and in business (e.g. customer demand or stock prices), but modeling and
navigating such random evolutions can be enormously challenging.
The next term is Optimal Decisions, which refers to the technical term Optimization. This
means there is a well-defined quantity to be maximized (the “goal”). The quantity to be
maximized might be financial (like investment value or business profitability), or it could
be a safety or speed metric (such as health of customers or time to travel), or something
more complicated like a blend of multiple objectives rolled into a single objective.
The next term is Sequential, which refers to the fact that as we move forward in time, the
relevant random variables’ values evolve, and the optimal decisions have to be adjusted to
the “changing circumstances.” Due to this non-static nature of the optimal decisions, the
term Dynamic Decisions is often used in the literature covering this subject.
Putting together the three notions of (Uncertainty/Stochastic, Optimization, Sequen-
tial/Dynamic Decisions), these problems (that RL tackles) have the common feature that
one needs to overpower the uncertainty by persistent steering towards the goal. This brings us
to the term Control (in references to persistent steering). These problems are often (aptly)
characterized by the technical term stochastic control. So you see that there is indeed a lot
of jargon here. All of this jargon will become amply clear after the first few chapters in this
book where we develop mathematical formalism to understand these concepts precisely
(and also write plenty of code to internalized these concepts). For now, we just wanted to
familiarize you with the range of jargon linked to Reinforcement Learning.
This jargon overload is due to the confluence of terms from Control Theory (emerging
from Engineering disciplines), from Operations Research, and from Artificial Intelligence
(emerging from Computer Science). For simplicity, we prefer to refer to the class of prob-

20
Lecture 1: Introduction to Reinforcement Learning
About RL

Many Faces of Reinforcement Learning

Computer Science

Engineering Neuroscience
Machine
Learning
Optimal Reward
Control System
Reinforcement
Learning
Operations Classical/Operant
Research Conditioning
Bounded
Mathematics Psychology
Rationality

Economics

Figure 0.1.: Many Faces of Reinforcement Learning (Image Credit: David Silver’s RL
Course)

lems RL aims to solve as stochastic control problems. Reinforcement Learning is a class

of algorithms that are used to solve Stochastic Control problems. We should point out
here that there are other disciplines (beyond Control Theory, Operations Research and
Artificial Intelligence) with a rich history of developing theory and techniques within the
general space of Stochastic Control. Figure 0.1 (a popular image on the internet, first seen
by us in David Silver’s RL course slides) illustrates the many faces of Stochastic Control,
which has often been referred to as “The many faces of Reinforcement Learning.”

It is also important to recognize that Reinforcement Learning is considered to be a branch

of Machine Learning. While there is no crisp definition for Machine Learning (ML), ML
generally refers to the broad set of techniques to infer mathematical models/functions by
acquiring (“learning”) knowledge of patterns and properties in the presented data. In
this regard, Reinforcement Learning does fit this definition. However, unlike the other
branches of ML (Supervised Learning and Unsupervised Learning), Reinforcement Learn-
ing is a lot more ambitious - it not only learns the patterns and properties of the presented
data, it also learns about the appropriate behaviors to be exercised (appropriate decisions
to be made) so as to drive towards the optimization objective. It is sometimes said that
Supervised Learning and Unsupervised learning are about “minimization” (i.e., they min-
imize the fitting error of a model to the presented data), while Reinforcement Learning is
about “maximization” (i.e., RL identifies the suitable decisions to be made to maximize a
well-defined objective). Figure 0.2 depicts the in-vogue classification of Machine Learning.

More importantly, the class of problems RL aims to solve can be described with a simple
yet powerful mathematical framework known as Markov Decision Processes (abbreviated as
MDPs). We have an entire chapter dedicated to deep coverage of MDPs, but we provide
a quick high-level introduction to MDPs in the next section.

21
Lecture 1: Introduction to Reinforcement Learning
About RL

Branches of Machine Learning

Supervised Unsupervised
Learning Learning

Machine
Learning

Reinforcement
Learning

Figure 0.2.: Branches of Machine Learning (Image Credit: David Silver’s RL Course)

Figure 0.3.: The MDP Framework

22
0.5. Introduction to the Markov Decision Process (MDP)
framework
The framework of a Markov Decision Process is depicted in Figure 0.3. As the Figure in-
dicates, the Agent and the Environment interact in a time-sequenced loop. The term Agent
refers to an algorithm (AI algorithm) and the term Environment refers to an abstract entity
that serves up uncertain outcomes to the Agent. It is important to note that the Environ-
ment is indeed abstract in this framework and can be used to model all kinds of real-world
situations such as the financial market serving up random stock prices or customers of
a company serving up random demand or a chess opponent serving up random moves
(from the perspective of the Agent), or really anything at all you can imagine that serves
up something random at each time step (it is up to us to model an Environment appropri-
ately to fit the MDP framework).
As the Figure indicates, at each time step t, the Agent observes an abstract piece of infor-
mation (which we call State) and a numerical (real number) quantity that we call Reward.
Note that the concept of State is indeed completely abstract in this framework and we can
model State to be any data type, as complex or elaborate as we’d like. This flexibility in
modeling State permits us to model all kinds of real-world situations as an MDP. Upon
observing a State and a Reward at time step t, the Agent responds by taking an Action.
Again, the concept of Action is completely abstract and is meant to represent an activity
performed by an AI algorithm. It could be a purchase or sale of a stock responding to
market stock price movements, or it could be movement of inventory from a warehouse
to a store in response to large sales at the store, or it could be a chess move in response
to the opponent’s chess move (opponent is Environment), or really anything at all you can
imagine that responds to observations (State and Reward) served by the Environment.
Upon receiving an Action from the Agent at time step t, the Environment responds (with
time ticking over to t + 1) by serving up the next time step’s random State and random
Reward. A technical detail (that we shall explain in detail later) is that the State is assumed
to have the Markov Property, which means:

• The next State/Reward depends only on Current State (for a given Action).
• The current State encapsulates all relevant information from the history of the inter-
action between the Agent and the Environment.
• The current State is a suﬀicient statistic of the future (for a given Action).

The goal of the Agent at any point in time is to maximize the Expected Sum of all future
Rewards by controlling (at each time step) the Action as a function of the observed State
(at that time step). This function from a State to Action at any time step is known as the
Policy function. So we say that the agent’s job is exercise control by determining the opti-
mal Policy function. Hence, this is a dynamic (i.e., time-sequenced) control system under
uncertainty. If the above description was too terse, don’t worry - we will explain all of this
in great detail in the coming chapters. For now, we just wanted to provide a quick flavor
of what the MDP framework looks like. Now we sketch the above description with some
(terse) mathematical notation to provide a bit more of the overview of the MDP frame-
work. The following notation is for discrete time steps (continuous time steps notation is
analogous, but technically more complicated to describe here):
We denote time steps as t = 1, 2, 3, . . .. Markov State at time t is denoted as St ∈ S where
S is referred to as the State Space (a countable set). Action at time t is denoted as At ∈ A
where A is referred to as the Action Space (a countable set). Reward at time t is denoted as

23
Figure 0.4.: Baby Learning MDP

Rt ∈ D where D is a countable subset of R (representing the numerical feedback served

by the Environment, along with the State, at each time step t).
We represent the transition probabilities from one time step to the next with the follow-
ing notation:

p(r, s′ |s, a) = P[(Rt+1 = r, St+1 = s′ )|St = s, At = a]

γ ∈ [0, 1] is known as the discount factor used to discount Rewards when accumulating
Rewards, as follows:

Return Gt = Rt+1 + γ · Rt+2 + γ 2 · Rt+3 + . . .

The discount factor γ allows us to model situations where a future reward is less desir-
able than a current reward of the same quantity.
The goal is to find a Policy π : S → A that maximizes E[Gt |St = s] for all s ∈ S. In
subsequent chapters, we clarify that the MDP framework actually considers more general
policies than described here - policies that are stochastic, i.e., functions that take as in-
put a state and output a probability distribution of actions (rather than a single action).
However, for ease of understanding of the core concepts, in this chapter, we stick to de-
terministic policies π : S → A.
The intuition here is that the two entities Agent and Environment interact in a time-
sequenced loop wherein the Environment serves up next states and rewards based on the
transition probability function p and the Agent exerts control over the vagaries of p by ex-
ercising the policy π in a way that optimizes the Expected “accumulated rewards” (i.e.,
Expected Return) from any state.
It’s worth pointing out that the MDP framework is inspired by how babies (Agent) learn
to perform tasks (i.e., take Actions) in response to the random activities and events (States
and Rewards) they observe as being served up from the world (Environment) around them.
Figure 0.4 illustrates this - at the bottom of the Figure (labeled “World,” i.e., Environment),

24
we have a room in a house with a vase atop a bookcase. At the top of the Figure is a
baby (learning Agent) on the other side of the room who wants to make her way to the
bookcase, reach for the vase, and topple it - doing this eﬀiciently (i.e., in quick time and
quietly) would mean a large Reward for the baby. At each time step, the baby finds her-
self in a certain posture (eg: lying on the floor, or sitting up, or trying to walk etc.) and
observes various visuals around the room - her posture and her visuals would constitute
the State for the baby at each time step. The baby’s Actions are various options of physical
movements to try to get to the other side of the room (assume the baby is still learning how
to walk). The baby tries one physical movement, but is unable to move forward with that
movement. That would mean a negative Reward - the baby quickly learns that this move-
ment is probably not a good idea. Then she tries a different movement, perhaps trying to
stand on her feet and start walking. She makes a couple of good steps forward (positive
Rewards), but then falls down and hurts herself (that would be a big negative Reward). So
by trial and error, the baby learns about the consequences of different movements (dif-
ferent actions). Eventually, the baby learns that by holding on to the couch, she can walk
across, and then when she reaches the bookcase, she learns (again by trial and error) a
technique to climb the bookcase that is quick yet quiet (so she doesn’t raise her mom’s
attention). This means the baby learns of the optimal policy (best actions for each of the
states she finds herself in) after essentially what is a “trial and error” method of learning
what works and what doesn’t. This example is essentially generalized in the MDP frame-
work, and the baby’s “trial and error” way of learning is essentially a special case of the
general technique of Reinforcement Learning.

0.6. Real-world problems that fit the MDP framework

As you might imagine by now, all kinds of problems in nature and in business (and indeed,
in our personal lives) can be modeled as Markov Decision Processes. Here is a sample of
such problems:

• Self-driving vehicle (Actions constitute speed/steering to optimize safety/time).

• Game of Chess (Actions constitute moves of the pieces to optimize chances of win-
ning the game).
• Complex Logistical Operations, such as those in a Warehouse (Actions constitute
inventory movements to optimize throughput/time).
• Making a humanoid robot walk/run on a diﬀicult terrain (Actions are walking move-
ments to optimize time to destination).
• Manage an investment portfolio (Actions are trades to optimize long-term invest-
ment gains).
• Optimal decisions during a football game (Actions are strategic game calls to opti-
mize chances of winning the game).
• Strategy to win an election (Actions constitute political decisions to optimize chances
of winning the election).

Figure 0.5 illustrates the MDP for a self-driving car. At the top of the figure is the Agent
(the car’s driving algorithm) and at the bottom of the figure is the Environment (constitut-
ing everything the car faces when driving - other vehicles, traﬀic signals, road conditions,
weather etc.). The State consists of the car’s location, velocity, and all of the information
picked up by the car’s sensors/cameras. The Action consists of the steering, acceleration
and brake. The Reward would be a combination of metrics on ride comfort and safety,

25
Figure 0.5.: Self-driving Car MDP

as well as the negative of each time step (because maximizing the accumulated Reward
would then amount to minimizing time taken to reach the destination).

0.7. The inherent diﬀiculty in solving MDPs

“Solving” an MDP refers to identifying the optimal policy with an algorithm. This section
paints an intuitive picture of why solving a general MDP is fundamentally a hard problem.
Often, the challenge is simply that the State Space is very large (involving many variables)
or complex (elaborate data structure), and hence, is computationally intractable. Like-
wise, sometimes the Action Space can be quite large or complex.
But the main reason for why solving an MDP is inherently diﬀicult is the fact that there
is no direct feedback on what the “correct” Action is for a given State. What we mean
by that is that unlike a supervised learning framework, the MDP framework doesn’t give
us anything other than a Reward feedback to indicate if an Action is the right one or not.
A large Reward might encourage the Agent, but it’s not clear if one just got lucky with
the large Reward or if there could be an even larger Reward if the Agent tries the Action
again. The linkage between Actions and Rewards is further complicated by the fact that
there is time-sequenced complexity in an MDP, meaning an Action can influence future
States, which in turn influences future Actions. Consequently, we sometimes find that
Actions can have delayed consequences, i.e., the Rewards for a good Action might come
after many time steps (eg: in a game of Chess, a brilliant move leads to a win after several
further moves).
The other problem one encounters in real-world situations is that the Agent often doesn’t
know the model of the environment. By model, we are referring to the probabilities of
state-transitions and rewards (the function p we defined above). This means the Agent
has to simultaneously learn the model (from the real-world data stream) and solve for the
optimal policy.

26
Lastly, when there are many actions, the Agent needs to try them all to check if there
are some hidden gems (great actions that haven’t been tried yet), which in turn means
one could end up wasting effort on “duds” (bad actions). So the agent has to find the
balance between exploitation (retrying actions which have yielded good rewards so far)
and exploration (trying actions that have either not been tried enough or not been tried at
all).
All of this seems to indicate that we don’t have much hope in solving MDPs in a reliable
and eﬀicient manner. But it turns out that with some clever mathematics, we can indeed
make some good inroads. We outline the core idea of this “clever mathematics” in the
next section.

0.8. Value Function, Bellman Equation, Dynamic Programming

and RL
Perhaps the most important concept we want to highlight in this entire book is the idea of
a Value Function and how we can compute it in an eﬀicient manner with either Planning or
Learning algorithms. The Value Function V π : S → R for a given policy π is defined as:

V π (s) = Eπ,p [Gt |St = s] for all s ∈ S

The intuitive way to understand Value Function is that it tells us how much “accumu-
lated future reward” (i.e., Return) we expect to obtain from a given state. The randomness
under the expectation comes from the uncertain future states and rewards the Agent is go-
ing to see (based on the function p), upon taking future actions prescribed by the policy
π. The key in evaluating the Value Function for a given policy is that it can be expressed
recursively, in terms of the Value Function for the next time step’s states. In other words,
X
V π (s) = p(r, s′ |s, π(s)) · (r + γ · V π (s′ )) for all s ∈ S
r,s′

This equation says that when the Agent is in a given state s, it takes an action a = π(s),
then sees a random next state s′ and a random reward r, so V π (s) can be broken into
the expectation of r (immediate next step’s expected reward) and the remainder of the
future expected accumulated rewards (which can be written in terms of the expectation
of V π (s′ )). We won’t get into the details of how to solve this recursive formulation in
this chapter (will cover this in great detail in future chapters), but it’s important for you
to recognize for now that this recursive formulation is the key to evaluating the Value
Function for a given policy.
However, evaluating the Value Function for a given policy is not the end goal - it is
simply a means to the end goal of evaluating the Optimal Value Function (from which we
obtain the Optimal Policy). The Optimal Value Function V ∗ : S → R is defined as:

V ∗ (s) = max V π (s) for all s ∈ S

The good news is that even the Optimal Value Function can be expressed recursively, as
follows:
X
V ∗ (s) = max p(r, s′ |s, a) · (r + γ · V ∗ (s′ )) for all s ∈ S
a
r,s′

27
Furthermore, we can prove that there exists an Optimal Policy π ∗ achieving V ∗ (s) for
all s ∈ S (the proof is constructive, which gives a simple method to obtain the function
π ∗ from the function V ∗ ). Specifically, this means that the Value Function obtained by
following the optimal policy π ∗ is the same as the Optimal Value Function V ∗ , i.e.,
∗
V π (s) = V ∗ (s) for all s ∈ S
There is a bit of terminology here to get familiar with. The problem of calculating V π (s)
(Value Function for a give policy) is known as the Prediction problem (since this amounts
to statistical estimation of the expected returns from any given state when following a
policy π). The problem of calculating the Optimal Value Function V ∗ (and hence, Opti-
mal Policy π ∗ ), is known as the Control problem (since this requires steering of the policy
such that we obtain the maximum expected return from any state). Solving the Prediction
problem is typically a stepping stone towards solving the (harder) problem of Control.
These recursive equations for V π and V ∗ are known as the (famous) Bellman Equations
(which you will hear a lot about in future chapters). In a continuous-time formulation,
the Bellman Equation is referred to as the famous Hamilton-Jacobi-Bellman (HJB) equation.
The algorithms to solve the prediction and control problems based on Bellman equations
are broadly classified as:

• Dynamic Programming, a class of (in the language of A.I.) Planning algorithms.

• Reinforcement Learning, a class of (in the language of A.I.) Learning algorithms.

Now let’s talk a bit about the difference between Dynamic Programming and Reinforce-
ment Learning algorithms. Dynamic Programming algorithms (which we cover a lot of in
this book) assume that the agent knows of the transition probabilities p and the algorithm
takes advantage of the knowledge of those probabilities (leveraging the Bellman Equation
to efficiently calculate the Value Function). Dynamic Programming algorithms are con-
sidered to be Planning and not Learning (in the language of A.I.) because the algorithm
doesn’t need to interact with the Environment and doesn’t need to learn from the (states,
rewards) data stream coming from the Environment. Rather, armed with the transition
probabilities, the algorithm can reason about future probabilistic outcomes and perform
the requisite optimization calculation to infer the Optimal Policy. So it plans it’s path to
success, rather than learning about how to succeed.
However, in typical real-word situations, one doesn’t really know the transition proba-
bilities p. This is the realm of Reinforcement Learning (RL). RL algorithms interact with
the Environment, learn with each new (state, reward) pair received from the Environ-
ment, and incrementally figure out the Optimal Value Function (with the “trial and error”
approach that we outlined earlier). However, note that the Environment interaction could
be real interaction or simulated interaction. In the latter case, we do have a model of the tran-
sitions but the structure of the model is so complex that we only have access to samples of
the next state and reward (rather than an explicit representation of the probabilities). This
is known as a Sampling Model of the Environment. With access to such a sampling model
of the environment (eg: a robot learning on a simulated terrain), we can employ the same
RL algorithm that we would have used when interacting with a real environment (eg: a
robot learning on an actual terrain). In fact, most RL algorithms in practice learn from
simulated models of the environment. As we explained earlier, RL is essentially a “trial
and error” learning approach and hence, is quite laborious and fundamentally inefficient.
The recent progress in RL is coming from more efficient ways of learning the Optimal
Value Function, and better ways of approximating the Optimal Value Function. One of

28
the key challenges for RL in the future is to identify better ways of finding the balance be-
tween “exploration” and “exploitation” of actions. In any case, one of the key reasons RL
has started doing well lately is due to the assistance it has obtained from Deep Learning
(typically Deep Neural Networks are used to approximate the Value Function and/or to
approximate the Policy Function). RL with such deep learning approximations is known
by the catchy modern term Deep RL.
We believe the current promise of A.I. is dependent on the success of Deep RL. The
next decade will be exciting as RL research will likely yield improved algorithms and it’s
pairing with Deep Learning will hopefully enable us to solve fairly complex real-world
stochastic control problems.

0.9. Outline of Chapters

The chapters in this book are organized into 4 modules as follows:

• Module I: Processes and Planning Algorithms (Chapters 1-4)

• Module II: Modeling Financial Applications (Chapters 5-8)
• Module III: Reinforcement Learning Algorithms (Chapters 9-12)
• Module IV: Finishing Touches (Chapters 13-15)

Before covering the contents of the chapters in these 4 modules, the book starts with 2
unnumbered chapters. The first of these unnumbered chapters is this chapter (the one you
are reading) which serves as an Overview, covering the pedagogical aspects of learning
RL (and more generally Applied Math), outline of the learnings to be acquired from this
book, background required to read this book, a high-level overview of Stochastic Con-
trol, MDP, Value Function, Bellman Equation and RL, and finally the outline of chapters
in this book. The second unnumbered chapter is called Programming and Design. Since
this book makes heavy use of Python code for developing mathematical models and for
algorithms implementations, we cover the requisite Python background (specifically the
design paradigms we use in our Python code) in this chapter. To be clear, this chapter
is not a full Python tutorial – the reader is expected to have some background in Python
already. It is a tutorial of some key techniques and practices in Python (that many readers
of this book might not be accustomed to) that we use heavily in this book and that are also
highly relevant to programming in the broader area of Applied Mathematics. We cover
the topics of Type Annotations, List and Dict Comprehensions, Functional Programming,
Interface Design with Abstract Base Classes, Generics Programming and Generators.
The remaining chapters in this book are organized in the 4 modules we listed above.

0.9.1. Module I: Processes and Planning Algorithms

The first module of this book covers the theory of Markov Decision Processes (MDP),
Dynamic Programming (DP) and Approximate Dynamics Programming (ADP) across
Chapters 1-4.
In order to understand the MDP framework, we start with the foundations of Markov
Processes (sometimes referred to as Markov Chains) in Chapter 1. Markov Processes do
not have any Rewards or Actions, they only have states and states transitions. We be-
lieve spending a lot of time on this simplified framework of Markov Processes is excellent
preparation before getting to MDPs. Chapter 1 then builds upon Markov Processes to

29
include the concept of Reward (but not Action) - the inclusion of Reward yields a frame-
work known as Markov Reward Process. With Markov Reward Processes, we can talk
about Value Functions and Bellman Equation, which serve as great preparation for under-
standing Value Function and Bellman Equation later in the context of MDPs. Chapter 1
motivates these concepts with examples of stock prices and with a simple inventory ex-
ample that serves first as a Markov Process and then as a Markov Reward Process. There
is also a significant amount of programming content in this chapter to develop comfort as
well as depth in these concepts.
Chapter 2 on Markov Decision Processes lays the foundational theory underpinning RL –
the framework for representing problems dealing with sequential optimal decisioning un-
der uncertainty (Markov Decision Process). You will learn about the relationship between
Markov Decision Processes and Markov Reward Processes, about the Value Function and
the Bellman Equations. Again, there is a considerable amount of programming exercises
in this chapter. The heavy investment in this theory together with hands-on programming
will put you in a highly advantaged position to learn the following chapters in a very clear
and speedy manner.
Chapter 3 on Dynamic Programming covers the Planning technique of Dynamic Program-
ming (DP), which is an important class of foundational algorithms that can be an alter-
native to RL if the MDP is not too large or too complex. Also, learning these algorithms
provides important foundations to be able to understand subsequent RL algorithms more
deeply. You will learn about several important DP algorithms by the end of the chapter
and you will learn about why DP gets diﬀicult in practice which draws you to the mo-
tivation behind RL. Again, we cover plenty of programming exercises that are quick to
implement and will aid considerably in internalizing the concepts. Finally, we emphasize
a special algorithm - Backward Induction - for solving finite-horizon Markov Decision Pro-
cesses, which is the setting for the financial applications we cover in this book.
The Dynamic Programming algorithms covered in Chapter 3 suffer from the two so-
called curses: Curse of Dimensionality and Curse of Modeling. These curses can be cured
with a combination of sampling and function approximation. Module III covers the sam-
pling cure (using Reinforcement Learning). Chapter 4 on Function Approximation and Ap-
proximate Dynamic Programming covers the topic of function approximation and shows
how an intermediate cure - Approximate Dynamic Programming (function approxima-
tion without sampling) - is often quite viable and can be suitable for some problems. As
part of this chapter, we implement linear function approximation and approximation with
deep neural networks (forward and back propagation algorithms) so we can use these ap-
proximations in Approximate Dynamic Programming algorithms and later also in RL.

0.9.2. Module II: Modeling Financial Applications

The second module of this book covers the background on Utility Theory and 5 financial
applications of Stochastic Control across Chapter 5-8.
We begin this module with Chapter 5 on Utility Theory which covers a very important
Economics concept that is a pre-requisite for most of the Financial Applications we cover
in subsequent chapters. This is the concept of risk-aversion (i.e., how people want to be
compensated for taking risk) and the related concepts of risk-premium and Utility func-
tions. The remaining chapters in this module cover not only the 5 financial applications,
but also great detail on how to model them as MDPs, develop DP/ADP algorithms to
solve them, and write plenty of code to implement the algorithms, which helps internal-
ize the learnings quite well. Note that in practice these financial applications can get fairly

30
complex and DP/ADP algorithms don’t quite scale, which means we need to tap into RL
algorithms to solve them. So we revisit these financial applications in Module III when we
cover RL algorithms.
Chapter 6 is titled Dynamic Asset Allocation and Consumption. This chapter covers the
first of the 5 Financial Applications. This problem is about how to adjust the allocation
of one’s wealth to various investment choices in response to changes in financial markets.
The problem also involves how much wealth to consume in each interval over one’s life-
time so as to obtain the best utility from wealth consumption. Hence, it is the joint prob-
lem of (dynamic) allocation of wealth to financial assets and appropriate consumption of
one’s wealth over a period of time. This problem is best understood in the context of Mer-
ton’s landmark paper in 1969 where he stated and solved this problem. This chapter is
mainly focused on the mathematical derivation of Merton’s solution of this problem with
Dynamic Programming. You will also learn how to solve the asset allocation problem in
a simple setting with Approximate Backward Induction (an ADP algorithm covered in
Chapter 4).
Chapter 7 covers a very important topic in Mathematical Finance: Pricing and Hedging
of Derivatives. Full and rigorous coverage of derivatives pricing and hedging is a fairly
elaborate and advanced topic, and beyond the scope of this book. But we have provided a
way to understand the theory by considering a very simple setting - that of a single-period
with discrete outcomes and no provision for rebalancing of the hedges, that is typical
in the general theory. Following the coverage of the foundational theory, we cover the
problem of optimal pricing/hedging of derivatives in an incomplete market and the problem
of optimal exercise of American Options (both problems are modeled as MDPs). In this
chapter, you will learn about some highly important financial foundations such as the
concepts of arbitrage, replication, market completeness, and the all-important risk-neutral
measure. You will learn the proofs of the two fundamental theorems of asset pricing in
this simple setting. We also provide an overview of the general theory (beyond this simple
setting). Next you will learn about how to price/hedge derivatives incorporating real-
world frictions by modeling this problem as an MDP. In the final module of this chapter,
you will learn how to model the more general problem of optimal stopping as an MDP.
You will learn how to use Backward Induction (a DP algorithm we learned in Chapter 3)
to solve this problem when the state-space is not too big. By the end of this chapter, you
would have developed significant expertise in pricing and hedging complex derivatives,
a skill that is in high demand in the finance industry.
Chapter 8 on Order-Book Algorithms covers the remaining two Financial Applications,
pertaining to the world of Algorithmic Trading. The current practice in Algorithmic Trad-
ing is to employ techniques that are rules-based and heuristic. However, Algorithmic
Trading is quickly transforming into Machine Learning-based Algorithms. In this chap-
ter, you will be first introduced to the mechanics of trade order placements (market orders
and limit orders), and then introduced to a very important real-world problem – how to
submit a large-sized market order by splitting the shares to be transacted and timing the
splits optimally in order to overcome “price impact” and gain maximum proceeds. You
will learn about the classical methods based on Dynamic Programming. Next you will
learn about the market frictions and the need to tackle them with RL. In the second half
of this chapter, we cover the Algorithmic-Trading twin of the Optimal Execution problem
– that of a market-maker having to submit dynamically-changing bid and ask limit orders
so she can make maximum gains. You will learn about how market-makers (a big and
thriving industry) operate. Then you will learn about how to formulate this problem as
an MDP. We will do a thorough coverage of the classical Dynamic Programming solution

31
by Avellaneda and Stoikov. Finally, you will be exposed to the real-world nuances of this
problem, and hence, the need to tackle with a market-calibrated simulator and RL.

0.9.3. Module III: Reinforcement Learning Algorithms

The third module of this book covers Reinforcement Learning algorithms across Chapter
9-12.
Chapter 9 on Monte-Carlo and Temporal-Difference methods for Prediction starts a new phase
in this book - our entry into the world of RL algorithms. To understand the basics of RL,
we start this chapter by restricting the RL problem to a very simple one – one where the
state space is small and manageable as a table enumeration (known as tabular RL) and
one where we only have to calculate the Value Function for a Fixed Policy (this problem
is known as the Prediction problem, versus the optimization problem which is known
as the Control problem). The restriction to Tabular Prediction is important because it
makes it much easier to understand the core concepts of Monte-Carlo (MC) and Temporal-
Difference (TD) in this simplified setting. The later part of this chapter extends Tabular
Prediction to Prediction with Function Approximation (leveraging the function approxi-
mation foundations we develop in Chapter 4 in the context of ADP). The remaining chap-
ters will build upon this chapter by adding more complexity and more nuances, while
retaining much of the key core concepts developed in this chapter. As ever, you will learn
by coding plenty of MC and TD algorithms from scratch.
Chapter 10 on Monte-Carlo and Temporal-Difference for Control makes the natural exten-
sion from Prediction to Control, while initially remaining in the tabular setting. The in-
vestments made in understanding the core concepts of MC and TD in Chapter 9 bear fruit
here as important Control Algorithms such as SARSA and Q-learning can now be learned
with enormous clarity. In this chapter, we implement both SARSA and Q-Learning from
scratch in Python. This chapter also introduces a very important concept for the future
success of RL in the real-world: off-policy learning (Q-Learning is the simplest off-policy
learning algorithm and it has had good success in various applications). The later part of
this chapter extends Tabular Control to Control with Function Approximation (leveraging
the function approximation foundations we develop in Chapter 4).
Chapter 11 on Batch RL, Experience-Replay, DQN, LSPI, Gradient TD moves on from basic
and more traditional RL algorithms to recent innovations in RL. We start this chapter with
the important ideas of Batch RL and Experience Replay, which makes more eﬀicient use of
data by storing data as it comes and re-using it in batches throughout the learning process
of the algorithm. We emphasize the innovative Deep Q-Networks algorithm that leverages
Experience-Replay and Deep Neural Networks for Q-Learning. We also emphasize a sim-
ple but important Batch RL technique using linear function approximation - the algorithm
for Prediction is known as Least-Squares Temporal Difference and it’s extension to solve
the Control problem is known as Least-Squares Policy Iteration. We discuss these algo-
rithms in the context of one of the Financial Applications from Module II. In the later part
of this chapter, we provide deeper insights into the core mathematics underpinning RL
algorithms (back to the basics of Bellman Equation). Understanding Value Function Geom-
etry will place you in a highly advantaged situation in terms of truly understanding what
is it that makes some Algorithms succeed in certain situations and fail in other situations.
This chapter also explains how to break out of the so-called Deadly Triad (when bootstrap-
ping, function approximation and off-policy are employed together, RL algorithms tend
to fail). The state-of-the-art Gradient TD Algorithm resists the deadly triad and we dive
deep into its inner workings to understand how and why.

32
Chapter 12 on Policy Gradient Algorithms introduces a very different class of RL algo-
rithms that are based on improving the policy using the gradient of the policy function
approximation (rather than the usual policy improvement based on explicit argmax on
Q-Value Function). When action spaces are large or continuous, Policy Gradient tends
to be the only option and so, this chapter is useful to overcome many real-world chal-
lenges (including those in many financial applications) where the action space is indeed
large. You will learn about the mathematical proof of the elegant Policy Gradient Theo-
rem and implement a couple of Policy Gradient Algorithms from scratch. You will learn
about state-of-the-art Actor-Critic methods and a couple of specialized algorithms that
have worked well in practice. Lastly, you will also learn about Evolutionary Strategies,
an algorithm that looks quite similar to Policy Gradient Algorithms, but is technically not
an RL Algorithm. However, learning about Evolutionary Strategies is important because
some real-world applications, including Financial Applications, can indeed be tackled well
with Evolutionary Strategies.

0.9.4. Module IV: Finishing Touches

The fourth module of this book provides the finishing touches - covering some miscella-
neous topics, showing how to blend together the Planning approach of Module I with the
Learning approach of Module III, and providing perspectives on practical nuances in the
real-world.
Chapter 13 on Multi-Armed Bandits: Exploration versus Exploitation is a deep-dive into
the topic of balancing exploration and exploitation, a topic of great importance in RL al-
gorithms. Exploration versus Exploitation is best understood in the simpler setting of the
Multi-Armed Bandit (MAB) problem. You will learn about various state-of-the-art MAB
algorithms, implement them in Python, and draw various graphs to understand how they
perform versus each other in various problem settings. You will then be exposed to Con-
textual Bandits which is a popular approach in optimizing choices of Advertisement place-
ments. Finally, you will learn how to apply the MAB algorithms within RL.
Chapter 14 on Blending Learning and Planning brings the various pieces of Planning
and Learning concepts learned in this book together. You will learn that in practice, one
needs to be creative about blending planning and learning concepts (a technique known as
Model-based RL). In practice, many problems are indeed tackled using Model-based RL.
You will also get familiar with an algorithm (Monte Carlo Tree Search) that was highly
popularized when it solved the Game of GO, a problem that was thought to be insur-
mountable by present AI technology.
Chapter 15 is the concluding chapter on Summary and Real-World Considerations. The
purpose of this chapter is two-fold: Firstly to summarize the key learnings from this book,
and secondly to provide some commentary on how to take the learnings from this book
into practice (to solve real-world problems). We specifically focus on the challenges one
faces in the real-world - modeling diﬀiculties, problem-size diﬀiculties, operational chal-
lenges, data challenges (access, cleaning, organization), and also change-management
challenges as one shifts an enterprise from legacy systems to an AI system. We also pro-
vide some guidance on how to go about building an end-to-end system based on RL.

0.9.5. Short Appendix Chapters

Finally, we have 7 short Appendix chapters at the end of this book. The first appendix
is on Moment Generating Functions and it’s use in various calculations across this book.

33
The second appendix is a technical perspective of Function Approximations as Aﬀine Spaces,
which helps develop a deeper mathematical understanding of function approximations.
The third appendix is on Portfolio Theory covering the mathematical foundations of balanc-
ing return versus risk in portfolios and the much-celebrated Capital Asset Pricing Model
(CAPM). The fourth appendix covers the basics of Stochastic Calculus as we need some of
this theory (Ito Integral, Ito’s Lemma etc.) in the derivations in a couple of the chapters in
Module II. The fifth appendix is on the HJB Equation, which as a key part of the derivation
of the closed-form solutions for 2 of the 5 financial applicatons we cover in Module II. The
sixth appendix covers the derivation of the famous Black-Scholes Equation (and it’s solution
for Call/Put Options). The seventh appendix covers the formulas for bayesian updates to
conjugate prior distributions for the parameters of Gaussian and Bernoulli data distribu-
tions.

34
Programming and Design
Programming is creative work with few constraints: imagine something and you can prob-
ably build it — in many different ways. Liberating and gratifying, but also challenging. Just
like starting a novel from a blank page or a painting from a blank canvas, a new program
is so open that it’s a bit intimidating. Where do you start? What will the system look like?
How will you get it right? How do you split your problem up? How do you prevent your
code from evolving into a complete mess?
There’s no easy answer. Programming is inherently iterative — we rarely get the right
design at first, but we can always edit code and refactor over time. But iteration itself is
not enough; just like a painter needs technique and composition, a programmer needs
patterns and design.
Existing teaching resources tend to deemphasize programming techniques and design.
Theory-heavy and algorithms-heavy books show models and algorithms as self-contained
procedures written in pseudocode, without the broader context (and corresponding de-
sign considerations) of a real codebase. Newer AI/ML materials sometimes take a differ-
ent tack and provide real code examples using industry-strength frameworks, but rarely
touch on software design questions.
In this book, we take a third approach. Starting from scratch, we build a Python frame-
work that reflects the key ideas and algorithms we cover in this book. The abstractions
we define map to the key concepts we introduce; how we structure the code maps to the
relationships between those concepts.
Unlike the pseudocode approach, we do not implement algorithms in a vacuum; rather,
each algorithm builds on abstractions introduced earlier in the book. By starting from
scratch (rather than using an existing ML framework) we keep the code reasonably sim-
ple, without needing to worry about specific examples going out of date. We can focus
on the concepts important to this book while teaching programming and design in situ,
demonstrating an intentional approach to code design.

0.10. Code Design

How can we take a complex domain like reinforcement learning and turn it into code that
is easy to understand, debug and extend? How can we split this problem into manageable
pieces? How do those pieces interact?
There is no simple single answer to these questions. No two programming challenges
are identical and the same challenge has many reasonable solutions. A solid design will
not be completely clear up-front; it helps to have a clear direction in mind, but expect to
revisit specific decisions over time. That’s exactly what we did with the code for this book:
we had a vision for a Python Reinforcement Learning framework that matched the topics
we present, but as we wrote more and more of the book, we revised the framework code
as we came up with better ideas or found new requirements our previous design did not
cover.
We might have no easy answers, but we do have patterns and principles that — in our
experience — consistently produce quality code. Taken together, these ideas form a phi-

35
losophy of code design oriented around defining and combining abstractions that reflect
how we think about our domain. Since code itself can point to specific design ideas and
capabilities, there’s a feedback loop: expanding the programming abstractions we’ve de-
signed can help us find new algorithms and functionality, improving our understanding
of the domain.
Just what is an abstraction? An appropriately abstract question! An abstraction is a
“compound idea”: a single concept that combines multiple separate ideas into one. We
can combine ideas along two axes:

• We can compose different concepts together, thinking about how they behave as one
unit. A car engine has thousands of parts that interact in complex ways, but we can
think about it as a single object for most purposes.
• We can unify different concepts by identifying how they are similar. Different breeds
of dogs might look totally different, but we can think of all of them as dogs.

The human mind can only handle so many distinct ideas at a time — we have an inher-
ently limited working memory. A rather simplified model is that we only have a handful
of “slots” in working memory and we simply can’t track more independent thoughts at
the same time. The way we overcome this limitation is by coming up with new ideas (new
abstractions) that combine multiple concepts into one.
We want to organize code around abstractions for the same reason that we use abstrac-
tions to understand more complex ideas. How do you understand code? Do you run the
program in your head? That’s a natural starting point and it works for simple programs
but it quickly becomes diﬀicult and then impossible. A computer doesn’t have working-
memory limitations and can run billions of instructions a second that we can’t possibly
keep up with. The computer doesn’t need structure or abstraction in the code it runs, but
we need it to have any hope of writing or understanding anything beyond the simplest
of programs. Abstractions in our code group information and logic so that we can think
about rich concepts rather than tracking every single bit of information and every single
instruction separately.
The details may differ, but designing code around abstractions that correspond to a solid
mental model of the domain works well in any area and with any programming language.
It might take some extra up-front thought but, done well, this style of design pays divi-
dends. Our goal is to write code that makes life easier for ourselves; this helps for everything
from “one-off” experimental code through software engineering efforts with large teams.

0.11. Environment Setup

You can follow along with all of the examples in this book by getting a copy of the RL
framework from GitHub1 and setting up a dedicated Python 3 environment for the code.
The Python code depends on several Python libraries. Once you have a copy of the code
repository, you can create an environment with the right libraries by running a few shell
commands.
First, move to the directory with the codebase:

cd rl-book
1
If you are not familiar with Git and GitHub, look through GitHub’s Getting Started documentation.

36
Then, create and activate a Python virtual environment2 :

python3 -m venv .venv

source .venv/bin/activate

You only need to create the environment once, but you will need to activate it every time
that you want to work on the code from a new shell. Once the environment is activated,
you can install the right versions of each Python dependency:

pip install -r requirements.txt

To access the framework itself, you need to install it in editable mode (-e):

pip install -e .

Once the environment is set up, you can confirm that it works by running the frame-
works automated tests:

python -m unittest discover

If everything installed correctly, you should see an “OK” message on the last line of the
output after running this command.

0.12. Classes and Interfaces

What does designing clean abstractions actually entail? There are always two parts to
answering this question:

1. Understanding the domain concept that you are modeling.

2. Figuring out how to express that concept with features and patterns provided by
your programming language.

Let’s jump into an extended example to see exactly what this means. One of the key
building blocks for Reinforcement Learning — all of statistics and machine learning, really
— is Probability. How are we going to handle uncertainty and randomness in our code?
One approach would be to keep Probability implicit. Whenever we have a random vari-
able, we could call a function and get a random result (technically referred to as a sample).
If we were writing a Monopoly game with two six-sided dice, we would define it like this:

from random import randint

def six_sided()
return randint(1, 6)
def roll_dice():
return six_sided() + six_sided()

2
A Python “virtual environment” is a way to manage Python dependencies on a per-project basis. Having a
different environment for different Python projects lets each project have its own version of Python libraries,
which avoids problems when one project needs an older version of a library and another project needs a newer
version.

37
This works, but it’s pretty limited. We can’t do anything except get one outcome at
a time. More importantly, this only captures a slice of how we think about Probability:
there’s randomness but we never even mentioned probability distributions (referred to as
simply distributions for the rest of this chapter). We have outcomes and we have a function
we can call repeatedly, but there’s no way to tell that function apart from a function that
has nothing to do with Probability but just happens to return an integer.
How can we write code to get the expected value of a distribution? If we have a paramet-
ric distribution (eg: a distribution like Poisson or Gaussian, characterized by parameters),
can we get the parameters out if we need them?
Since distributions are implicit in the code, the intentions of the code aren’t clear and it is
hard to write code that generalizes over distributions. Distributions are absolutely crucial
for machine learning, so this is not a great starting point.

0.12.1. A Distribution Interface

To address these problems, let’s define an abstraction for probability distributions.
How do we represent a distribution in code? What can we do with distributions? That
depends on exactly what kind of distribution we’re working with. If we know something
about the structure of a distribution — perhaps it’s a Poisson distribution where λ = 5,
perhaps it’s an empirical distribution with set probabilities for each outcome — we could
do quite a bit: produce an exact Probability Distribution Function (PDF) or Cumulative
Distribution Function (CDF), calculate expectations and do various operations eﬀiciently.
But that isn’t the case for all the distributions we work with! What if the distribution comes
from a complicated simulation? At the extreme, we might not be able to do anything except
draw samples from the distribution.
Sampling is the least common denominator. We can sample distributions where we
don’t know enough to do anything else, and we can sample distributions where we know
the exact form and parameters. Any abstraction we start with for a probability distribution
needs to cover sampling, and any abstraction that requires more than just sampling will
not let us handle all the distributions we care about.
In Python, we can express this idea with a class:

from abc import ABC, abstractmethod

class Distribution(ABC):
@abstractmethod
def sample(self):
pass

This class defines an interface: a definition of what we require for something to qualify
as a distribution. Any kind of distribution we implement in the future will be able to, at
minimum, generate samples; when we write functions that sample distributions, they can
require their inputs to inherit from Distribution.
The class itself does not actually implement sample. Distribution captures the abstract
concept of distributions that we can sample, but we would need to specify a specific distri-
bution to actually sample anything. To reflect this in Python, we’ve made Distribution an
abstract base class (ABC), with sample as an abstract method—a method without an im-
plementation. Abstract classes and abstract methods are features that Python provides to
help us define interfaces for abstractions. We can define the Distribution class to structure
the rest of our probability distribution code before we define any specific distributions.

38
0.12.2. A Concrete Distribution
Now that we have an interface, what do we do with it? An interface can be approached
from two sides:
• Something that requires the interface. This will be code that uses operations speci-
fied in the interface and work with any value that satisfies those requirements.
• Something that provides the interface. This will be some value that supports the
operations specified in the interface.
If we have some code that requires an interface and some other code that satisfies the
interface, we know that we can put the two together and get something that works —
even if the two sides were written without any knowledge or reference to each other. The
interface manages how the two sides interact.
To use our Distribution class, we can start by providing a concrete class3 that imple-
ments the interface. Let’s say that we wanted to model dice — perhaps for a game of D&D
or Monopoly. We could do this by defining a Die class that represents an n-sided die and
inherits from Distribution:
import random
class Die(Distribution):
def __init__(self, sides):
self.sides = sides
def sample(self):
return random.randint(1, self.sides)
six_sided = Die(6)
def roll_dice():
return six_sided.sample() + six_sided.sample()

This version of roll_dice has exactly the same behavior as roll_dice in the previous
section, but it took a bunch of extra code to get there. What was the point?
The key difference is that we now have a value that represents the distribution of rolling
a die, not just the outcome of a roll. The code is easier to understand — when we come
across a Die object, the meaning and intention behind it is clear — and it gives us a place to
add additional die-specific functionality. For example, it would be useful for debugging
if we could print not just the outcome of rolling a die but the die itself — otherwise, how
would we know if we rolled a die with the right number of sides for the given situation?
If we were using a function to represent our die, printing it would not be useful:
>>> print(six_sided)
<function six_sided at 0x7f00ea3e3040>

That said, the Die class we’ve defined so far isn’t much better:
>>> print(Die(6))
<__main__.Die object at 0x7ff6bcadc190>

With a class — and unlike a function — we can fix this. Python lets us change some of
the built-in behavior of objects by overriding special methods. To change how the class is
printed, we can override __repr__:4
3
In this context, a concrete class is any class that is not an abstract class. More generally, “concrete” is the
opposite of “abstract” — when an abstraction can represent multiple more specific concepts, we call any of
the specific concepts “concrete.”
4
Our definition of __repr__ used a Python feature called an “f-string.” Introduced in Python 3.6, f-strings make
it easier to inject Python values into strings. By putting an f in front of a string literal, we can include a Python
value in a string: f”{1 + 1}” gives us the string ”2”.

39
class Die(Distribution):
...
def __repr__(self):
return f”Die(sides={self.sides})”

Much better:
>>> print(Die(6))
Die(sides=6)

This seems small but makes debugging much easier, especially as the codebase gets
larger and more complex.

Dataclasses
The Die class we wrote is intentionally simple. Our die is defined by a single property: the
number of sides it has. The __init__ method takes the number of sides as an input and
puts it into an attribute of the class; once a Die object is created, there is no reason to change
this value — if we need a die with a different number of sides, we can just create a new
object. Abstractions do not have to be complex to be useful.
Unfortunately, some of the default behavior of Python classes isn’t well-suited to simple
classes. We’ve already seen that we need to override __repr__ to get useful behavior, but
that’s not the only default that’s inconvenient. Python’s default way to compare objects for
equality — the __eq__ method — uses the is operator, which means it compares objects
by identity. This makes sense for classes in general which can change over time, but it is a
poor fit for simple abstraction like Die. Two Die objects with the same number of sides have
the same behavior and represent the same probability distribution, but with the default
version of __eq__, two Die objects declared separately will never be equal:
>>> six_sided = Die(6)
>>> six_sided == six_sided
True
>>> six_sided == Die(6)
False
>>> Die(6) == Die(6)
False

This behavior is inconvenient and confusing, the sort of edge-case that leads to hard-to-
spot bugs. Just like we overrode __repr__, we can fix this by overriding __eq__:
def __eq__(self, other):
return self.sides == other.sides

This fixes the weird behavior we saw earlier:

>>> Die(6) == Die(6)
True

However, this simple implementation will lead to errors if we use == to compare a Die
with a non-Die value:
>>> Die(6) == None
Traceback (most recent call last):
File ”<stdin>”, line 1, in <module>
File ”.../rl/chapter1/probability.py”, line 18, in __eq__
return self.sides == other.sides
AttributeError: ’NoneType’ object has no attribute ’sides’

40
We generally won’t be comparing values of different types with == — for None, Die(6) is
None would be more idiomatic — but the usual expectation in Python is that == on different
types will return False rather than raising an exception. We can fix by explicitly checking
the type of other:

def eq(self, other):

if isinstance(other, Die):
return self.sides == other.sides
return False

>>> Die(6) == None

False

Most of the classes we will define in the rest of the book follow this same pattern —
they’re defined by a small number of parameters, all that __init__ does is set a few at-
tributes and they need custom __repr__ and __eq__ methods. Manually defining __init__,
__repr__ and __eq__ for every single class isn’t too bad — the definitions are entirely sys-
tematic — but it carries some real costs:

• Extra code without important content makes it harder to read and navigate through a
codebase.
• It’s easy for mistakes to sneak in. For example, if you add an attribute to a class but
forget to add it to its __eq__ method, you won’t get an error — == will just ignore
that attribute. Unless you have tests that explicitly check how == handles your new
attribute, this oversight can sneak through and lead to weird behavior in code that
uses your class.
• Frankly, writing these methods by hand is just tedious.

Luckily, Python 3.7 introduced a feature that fixes all of these problems: dataclasses.
The dataclasses module provides a decorator5 that lets us write a class that behaves like
Die without needing to manually implement __init__, __repr__ or __eq__. We still have
access to “normal” class features like inheritance ((Distribution)) and custom methods
(sample):

from dataclasses import dataclass

@dataclass
class Die(Distribution):
sides: int
def sample(self):
return random.randint(1, self.sides)

This version of Die has the exact behavior we want in a way that’s easier to write and
— more importantly — far easier to read. For comparison, here’s the code we would have
needed without dataclasses:

class Die(Distribution):
def __init__(self, sides):
self.sides = sides
def __repr__(self):
return f”Die(sides={self.sides})”
5
Python decorators are modifiers that can be applied to class, function and method definitions. A decorator is
written above the definition that it applies to, starting with a @ symbol. Examples include abstractmethod —
which we saw earlier — and dataclass.

41
def __eq__(self, other):
if isinstance(other, Die):
return self.sides == other.sides
return False
def sample(self):
return random.randint(1, self.sides)

As you can imagine, the difference would be even starker for classes with more at-
tributes!
Dataclasses provide such a useful foundation for classes in Python that the majority of
the classes we define in this book are dataclasses — we use dataclasses unless we have a
specific reason not to.

Immutability

Once we’ve created a Die object, it does not make sense to change its number of sides — if
we need a distribution for a different die, we can create a new object instead. If we change
the sides of a Die object in one part of our code, it will also change in every other part of
the codebase that uses that object, in ways that are hard to track. Even if the change made
sense in one place, chances are it is not expected in other parts of the code. Changing state
can create invisible connections between seemingly separate parts of the codebase which
becomes hard to mentally track. A sure recipe for bugs!
Normally, we avoid this kind of problem in Python purely by convention: nothing stops
us from changing sides on a Die object, but we know not to do that. This is doable, but
hardly ideal; just like it is better to rely on seatbelts rather than pure driver skill, it is better
to have the language prevent us from doing the wrong thing than relying on pure conven-
tion. Normal Python classes don’t have a convenient way to stop attributes from changing,
but luckily dataclasses do:

@dataclass(frozen=True)
class Die(Distribution):
...

With frozen=True, attempting to change sides will raise an exception:

>>> d = Die(6)
>>> d.sides = 10
Traceback (most recent call last):
File ”<stdin>”, line 1, in <module>
File ”<string>”, line 4, in __setattr__
dataclasses.FrozenInstanceError: cannot assign to field ’sides’

An object that we cannot change is called immutable. Instead of changing the object
in place, we can return a fresh copy with the attribute changed; dataclasses provides a
replace function that makes this easy:

>>> import dataclasses

>>> d6 = Die(6)
>>> d20 = dataclasses.replace(d6, sides=20)
>>> d20
Die(sides=20)

42
This example is a bit convoluted — with such a simple object, we would just write d20 =
Die(20) — but dataclasses.replace becomes a lot more useful with more complex objects
that have multiple attributes.
Returning a fresh copy of data rather than modifying in place is a common pattern in
Python libraries. For example, the majority of Pandas operations — like drop or fillna —
return a copy of the dataframe rather than modifying the dataframe in place. These meth-
ods have an inplace argument as an option, but this leads to enough confusing behavior
that the Pandas team is currently deliberating on deprecating inplace altogether.
Apart from helping prevent odd behavior and bugs, frozen=True has an important bonus:
we can use immutable objects as dictionary keys and set elements. Without frozen=True,
we would get a TypeError because non-frozen dataclasses do not implement __hash__:
>>> d = Die(6)
>>> {d : ”abc”}
Traceback (most recent call last):
File ”<stdin>”, line 1, in <module>
TypeError: unhashable type: ’Die’
>>> {d}
Traceback (most recent call last):
File ”<stdin>”, line 1, in <module>
TypeError: unhashable type: ’Die’

With frozen=True, dictionaries and sets work as expected:

>>> d = Die(6)
>>> {d : ”abc”}
{Die(sides=6): ’abc’}
>>> {d}
{Die(sides=6)}

Immutable dataclass objects act like plain data — not too different from strings and ints.
In this book, we follow the same practice with frozen=True as we do with dataclasses in
general: we set frozen=True unless there is a specific reason not to.

0.12.3. Checking Types

A die has to have an int number of sides — 0.5 sides or ”foo” sides simply doesn’t make
sense. Python will not stop us from trying Die(”foo”), but we would get a TypeError if we
tried sampling it:
>>> foo = Die(”foo”)
>>> foo.sample()
Traceback (most recent call last):
File ”<stdin>”, line 1, in <module>
File ”.../rl/chapter1/probability.py”, line 37, in sample
return random.randint(1, self.sides)
File ”.../lib/python3.8/random.py”, line 248, in randint
return self.randrange(a, b+1)
TypeError: can only concatenate str (not ”int”) to str

The types of an object’s attributes are a useful indicator of how the object should be
used. Python’s dataclasses let us use type annotations (also known as “type hints”) to
specify the type of each attribute:
@dataclass(frozen=True)
class Die(Distribution):
sides: int

43
In normal Python, these type annotations exist primarily for documentation — a user
can see the types of each attribute at a glance, but the language does not raise an error when
an object is created with the wrong types in an attribute . External tools — Integrated
Development Environments (IDEs) and typecheckers — can catch type mismatches in
annotated Python code without running the code. With a type-aware editor, Die(”foo”)
would be underlined with an error message:

Argument of type ”Literal[’foo’]” cannot be assigned to parameter ”sides”

of type ”int” in function ”__init__”
”Literal[’foo’]” is incompatible with ”int” [reportGeneralTypeIssues]

This particular message comes from pyright running over the language server protocol
(LSP), but Python has a number of different typecheckers available6 .
Instead of needing to call sample to see an error — which we then have to carefully read
to track back to the source of the mistake — the mistake is highlighted for us without even
needing to run the code.

Static Typing

Being able to find type mismatches without running code is called static typing. Some lan-
guages — like Java and C++ — require all code to be statically typed; Python does not. In
fact, Python started out as a dynamically typed language with no type annotations. Type
errors would only come up when the code containing the error was run.
Python is still primarily a dynamically typed language — type annotations are optional in
most places and there is no built-in checking for annotations. In the Die(”foo”) example,
we only got an error when we ran code that passed sides into a function that required
an int (random.randint). We can get static checking with external tools, but even then it
remains optional — even statically checked Python code runs dynamic type checks, and
we can freely mix statically checked and “normal” Python. Optional static typing on top
of a dynamically typed languages is called gradual typing because we can incrementally
add static types to an existing dynamically typed codebase.
Dataclass attributes are not the only place where knowing types is useful; it would also
be handy for function parameters, return values and variables. Python supports optional
annotations on all of these; dataclasses are the only language construct where annotations
are required. To help mix annotated and unannotated code, typecheckers will report mis-
matches in code that is explicitly annotated, but will usually not try to guess types for
unannotated code.
How would we add type annotations to our example code? So far, we’ve defined two
classes:
6
Python has a number of external typecheckers, including:

• mypy
• pyright
• pytype
• pyre

The PyCharm IDE also has a propriety typchecker built-in.

These tools can be run from the command line or integrated into editors. Different checkers mostly overlap
in functionality and coverage, but have slight differences in the sort of errors they detect and the style of error
messages they generate.

44
• Distribution, an abstract class defining interfaces for probability distributions in
general
• Die, a concrete class for the distribution of an n-sided die

We’ve already annotated the sides in Die has to be an int. We also know that the outcome
of a die roll is an int. We can annotate this by adding -> int after def sample(...):

@dataclass(frozen=True)
class Die(Distribution):
sides: int
def sample(self) -> int:
return random.randint(1, self.sides)

Other kinds of concrete distributions would have other sorts of outcomes. A coin flip
would either be ”heads” or ”tails”; a normal distribution would produce a float. Type
annotations are particularly important when writing code for any kind of mathematical
modeling because we ensure that the type specifications in our code correspond clearly to
the precise specification of sets in our mathematical (notational) description.

0.12.4. Type Variables

Annotating sample for the Die class was straightforward: the outcome of a die roll is always
a number, so sample always returns int. But what type would sample in general have? The
Distribution class defines an interface for any distribution, which means it needs to cover
any type of outcomes. For our first version of the Distribution class, we didn’t annotate
anything for sample:

class Distribution(ABC):
@abstractmethod
def sample(self):
pass

This works — annotations are optional — but it can get confusing: some code we write
will work for any kind of distribution, some code needs distributions that return numbers,
other code will need something else… In every instance sample better return something, but
that isn’t explicitly annotated. When we leave out annotations our code will still work, but
our editor or IDE will not catch as many mistakes.
The diﬀiculty here is that different kinds of distributions — different implementations of
the Distribution interface — will return different types from sample. To deal with this, we
need type variables: variables that stand in for some type that can be different in different
contexts. Type variables are also known as “generics” because they let us write classes that
generically work for any type.
To add annotations to the abstract Distribution class, we will need to define a type vari-
able for the outcomes of the distribution, then tell Python that Distribution is “generic”
in that type:

from typing import Generic, TypeVar

# A type variable named ”A”
A = TypeVar(”A”)

# Distribution is ”generic in A”
class Distribution(ABC, Generic[A]):
# Sampling must produce a value of type A

45
@abstractmethod
def sample(self) -> A:
pass

In this code, we’ve defined a type variable A7 and specified that Distribution uses A
by inheriting from Generic[A]. We can now write type annotations for distributions with
specific types of outcomes: for example, Die would be an instance of Distribution[int] since
the outcome of a die roll is always an int. We can make this explicit in the class definition:

class Die(Distribution[int]):
...

This lets us write specialized functions that only work with certain kinds of distribu-
tions. Let’s say we wanted to write a function that approximated the expected value of
a distribution by sampling repeatedly and calculating the mean. This function works for
distributions that have numeric outcomes — float or int — but not other kinds of distri-
butions (How would we calculate an average for a random name?). We can annotate this
explicitly by using Distribution[float]:8

import statistics
def expected_value(d: Distribution[float], n: int = 100) -> float:
return statistics.mean(d.sample() for _ in range(n))

With this function:

• expected_value(Die(6)) would be fine

• expected_value(RandomName()) (where RandomName is a Distribution[str]) would be
a type error

Using expected_value on a distribution with non-numeric outcomes would raise a type

error at runtime. Having this highlighted in the editor can save us time — we see the
mistake right away, rather than waiting for tests to run — and will catch the problem even
if our test suite doesn’t.

0.12.5. Functionality
So far, we’ve covered two abstractions for working with probability distributions:

• Distribution: an abstract class that defines the interface for probability distributions
• Die: a distribution for rolling fair n-sided dice

This is an illustrative example, but it doesn’t let us do much. If all we needed were n-
sided dice, a separate Distribution class would be overkill. Abstractions are a means for
managing complexity, but any abstraction we define also adds some complexity to a code-
base itself — it’s one more concept for a programmer to learn and understand. It’s always
worth considering whether the added complexity from defining and using an abstraction
is worth the benefit. How does the abstraction help us understand the code? What kind
of mistakes does it prevent — and what kind of mistakes does it encourage? What kind of
7
Traditionally, type variables have one-letter capitalized names — although it’s perfectly fine to use full words
if that would make the code clearer.
8
The float type in Python also covers int, so we can pass a Distribution[int] anywhere that a
Distribution[float] is required.

46
added functionality does it give us? If we don’t have suﬀiciently solid answers to these
questions, we should consider leaving the abstraction out.
If all we cared about were dice, Distribution wouldn’t carry its weight. Reinforcement
Learning, though, involves both a wide range of specific distributions — any given Re-
inforcement Learning problem can have domain-specific distributions — as well as algo-
rithms that need to work for all of these problems. This gives us two reasons to define a
Distribution abstraction: Distribution will unify different applications of Reinforcement
Learning and will generalize our Reinforcement Learning code to work in different con-
texts. By programming against a general interface like Distribution, our algorithms will
be able to work for the different applications we present in the book — and even work for
applications we weren’t thinking about when we designed the code.
One of the practical advantages of defining general-purpose abstractions in our code is
that it gives us a place to add functionality that will work for any instance of the abstraction.
For example, one of the most common operations for a probability distribution that we can
sample is drawing n samples. Of course, we could just write a loop every time we needed
to do this:
samples = []
for _ in range(100):
samples += [distribution.sample()]

This code is fine, but it’s not great. A for loop in Python might be doing pretty much
anything; it’s used for repeating pretty much anything. It’s hard to understand what a
loop is doing at a glance, so we’d have to carefully read the code to see that it’s putting
100 samples in a list. Since this is such a common operation, we can add a method for it
instead:
class Distribution(ABC, Generic[A]):
...
def sample_n(self, n: int) -> Sequence[A]:
return [self.sample() for _ in range(n)]

The implementation here is different — it’s using a list comprehension9 rather than a
normal for loop — but it’s accomplishing the same thing. The more important distinction
happens when we use the method; instead of needing a for loop or list comprehension
each time, we can just write:
9
List comprehensions are a Python feature to build lists by looping over something. The simplest pattern is
the same as writing a for loop:
foo = [expr for x in xs]
# is the same as:
foo = []
for x in xs:
foo += [expr]

List comprehensions can combine multiple lists, acting like nested for loops:
>>> [(x, y) for x in range(3) for y in range(2)]
[(0, 0), (0, 1), (1, 0), (1, 1), (2, 0), (2, 1)]

They can also have if clauses to only keep elements that match a condition:
>>> [x for x in range(10) if x % 2 == 0]
[0, 2, 4, 6, 8]

Some combination of for and if clauses can let us build surprisingly complicated lists! Comprehensions
will often be easier to read than loops—a loop could be doing anything, a comprehension is always creating a
list—but it’s always a judgement call. A couple of nested for loops might be easier to read than a suﬀiciently
convoluted comprehension!

47
samples = distribution.sample_n(100)

The meaning of this line of code — and the programmer’s intention behind it — are
immediately clear at a glance.
Of course, this example is pretty limited. The list comprehension to build a list with 100
samples is a bit more complicated than just calling sample_n(100), but not by much — it’s
still perfectly readable at a glance. This pattern of implementing general-purpose func-
tions on our abstractions becomes a lot more useful as the functions themselves become
more complicated.
However, there is another advantage to defining methods like sample_n: some kinds of
distributions might have more eﬀicient or more accurate ways to implement the same logic.
If that’s the case, we would override sample_n to use the better implementation. Code
that uses sample_n would automatically benefit; code that used a loop or comprehension
instead would not. For example, this happens if we implement a distribution by wrapping
a function from numpy’s random module:
import numpy as np
@dataclass
class Gaussian(Distribution[float]):
μ: float
σ: float
def sample(self) -> float:
return np.random.normal(loc=self.μ, scale=self.σ)
def sample_n(self, n: int) -> Sequence[float]:
return np.random.normal(loc=self.μ, scale=self.σ, size=n)

numpy is optimized for array operations, which means that there is an up-front cost to
calling numpy.random.normal the first time, but it can quickly generate additional samples
after that. The performance impact is significant10 :
>>> d = Gaussian(μ=0, σ=1)
>>> timeit.timeit(lambda: [d.sample() for _ in range(100)])
293.33819171399955
>>> timeit.timeit(lambda: d.sample_n(100))
5.566364272999635

That’s a 53× difference!

The code for Distribution and several concrete classes implementing the Distribution
interface (including Gaussian) is in the file rl/distribution.py.

0.13. Abstracting over Computation

So far, we’ve seen how we can build up a programming model for our domain by defining
interfaces (like Distribution) and writing classes (like Die) that implement these inter-
faces. Classes and interfaces give us a way to model the “things” in our domain, but, in
an area like Reinforcement Learning, “things” aren’t enough: we also want some way to
abstract over the actions that we’re taking or the computation that we’re performing.
Classes do give us one way to model behavior: methods. A common analogy is that
objects act as “nouns” and methods act as “verbs” — methods are the actions we can take
10
This code uses the timeit module from Python’s standard library, which provides a simple way to benchmark
small bits of Python code. By default, it measures how long it takes to execute 1,000,000 iterations of the
function in seconds, so the two examples here took 0.293ms and 0.006ms respectively.

48
with an object. This is a useful capability that lets us abstract over doing the same kind
of action on different sorts of objects. Our sample_n method, for example, with it’s default
implementation, gives us two things:

1. If we implement a new type of distribution with a custom sample method, we get

sample_n for free for that distribution.
2. If we implement a new type of distribution which has a way to get n samples faster
than calling sample n times, we can override the method to use the faster algorithm.

If we made sample_n a normal function we could get 1. but not 2.; if we left sample_n as
an abstract method, we’d get 2. but not 1.. Having a non-abstract method on the abstract
class gives us the best of both worlds.
So if methods are our “verbs,” what else do we need? While methods abstract over
actions, they do this indirectly — we can talk about objects as standalone values, but we can
only use methods on objects, with no way to talk about computation itself. Stretching the
metaphor with grammar, it’s like having verbs without infinitives or gerunds — we’d be
able to talk about “somebody skiing,” but not about “skiing” itself or somebody “planning
to ski!”
In this world, “nouns” (objects) are first-class citizens but “verbs” (methods) aren’t.
What it takes to be a “first-class” value in a programming language is a fuzzy concept;
a reasonable litmus test is whether we can pass a value to a function or store it in a data
structure. We can do this with objects, but it’s not clear what this would mean for methods.

0.13.1. First-Class Functions

If we didn’t have a first-class way to talk about actions (as opposed to objects), one way we
could work around this would be to represent functions as objects with a single method.
We’d be able to pass them around just like normal values and, when we needed to actually
perform the action or computation, we’d just call the object’s method. This solution shows
us that it makes sense to have a first-class way to work with actions, but it requires an extra
layer of abstraction (an object just to have a single method) which doesn’t add anything
substantial on its own while making our intentions less clear.
Luckily, we don’t have to resort to a one-method object pattern in Python because Python
has first-class functions: functions are already values that we can pass around and store,
without needing a separate wrapper object.
First-class functions give us a new way to abstract over computation. Methods let us
talk about the same kind of behavior for different objects; first-class functions let us do
something with different actions. A simple example might be repeating the same action n
times. Without an abstraction, we might do this with a for loop:

for _ in range(10):
do_something()

Instead of writing a loop each time, we could factor this logic into a function that took n
and do_something as arguments:

def repeat(action: Callable, n: int):

for _ in range(n):
action()
repeat(do_something, 10)

49
repeat takes action and n as arguments, then calls action n times. action has the type
Callable which, in Python, covers functions as well as any other objects you can call with
the f() syntax. We can also specify the return type and arguments a Callable should have;
if we wanted the type of a function that took an int and a str as input and returned a bool,
we would write Callable[[int, str], bool].
The version with the repeat function makes our intentions clear in the code. A for loop
can do many different things, while repeat will always just repeat. It’s not a big difference
in this case — the for loop version is suﬀiciently easy to read that it’s not a big impediment
— but it becomes more important with complex or domain-specific logic.
Let’s take a look at the expected_value function we defined earlier:

def expected_value(d: Distribution[float], n: int) -> float:

return statistics.mean(d.sample() for _ in range(n))

We had to restrict this function to Distribution[float] because it only makes sense to

take an expectation of a numeric outcome. But what if we have something else like a
coin flip? We would still like some way to understand the expectation of the distribution,
but to make that meaningful we’d need to have some mapping from outcomes (”heads”
or ”tails”) to numbers (For example, we could say ”heads” is 1 and ”tails” is 0.). We
could do this by taking our Coin distribution, converting it to a Distribution[float] and
calling expected_value on that, but this might be ineﬀicient and it’s certainly awkward.
An alternative would be to provide the mapping as an argument to the expected_value
function:

def expected_value(
d: Distribution[A],
f: Callable[[A], float],
n: int
) -> float:
return statistics.mean(f(d.sample()) for _ in range(n))

The implementation of expected_value has barely changed — it’s the same mean calcu-
lation as previously, except we apply f to each outcome. This small change, however,
has made the function far more flexible: we can now call expected_value on any sort of
Distribution, not just Distribution[float].
Going back to our coin example, we could use it like this:

def payoff(coin: Coin) -> float:

return 1.0 if coin == ”heads” else 0.0
expected_value(coin_flip, payoff)

The payoff function maps outcomes to numbers and then we calculate the expected
value using that mapping.
We’ll see first-class functions used in a number of places throughout the book; the key
idea to remember is that functions are values that we can pass around or store just like any
other object.

Lambdas
payoff itself is a pretty reasonable function:
it has a clear name and works as a standalone
concept. Often, though, we want to use a first-class function in some specific context where
giving the function a name is not needed or even distracting. Even in cases with reasonable

50
names like payoff, it might not be worth introducing an extra named function if it will only
be used in one place.
Luckily, Python gives us an alternative: lambda. Lambdas are function literals. We can
write 3.0 and get a number without giving it a name, and we can write a lambda expression
to get a function without giving it a name. Here’s the same example as with the payoff
function but using a lambda instead:

expected_value(coin_flip, lambda coin: 1.0 if coin == ”heads” else 0.0)

The lambda expression here behaves exactly the same as def payoff did in the earlier
version. Note how the lambda as a single expression with no return — if you ever need
multiple statements in a function, you’ll have to use a def instead of a lambda. In practice,
lambdas are great for functions whose bodies are short expressions, but anything that’s
too long or complicated will read better as a standalone function def.

0.13.2. Iterative Algorithms

First-class functions give us an abstraction over individual computations: we can pass func-
tions around, give them inputs and get outputs, but the computation between the input
and the output is a complete black box. The caller of the function has no control over what
happens inside the function. This limitation can be a real problem!
One common scenario in Reinforcement Learning — and other areas in numerical pro-
gramming — is algorithms that iteratively converge to the correct result. We can run the
algorithm repeatedly to get more and more accurate results, but the improvements with
each iteration get progressively smaller. For example, we can approximate the square root
of a by starting with some initial guess x0 and repeatedly calculating xn+1 :
a
xn + xn
xn+1 =
2
At each iteration, xn+1 gets closer to the right answer by smaller and smaller steps. At
some point the change from xn to xn+1 becomes small enough that we decide to stop. In
Python, this logic might look something like this:

def sqrt(a: float) -> float:

x = a / 2 # initial guess
while abs(x_n - x) > 0.01:
x_n = (x + (a / x)) / 2
return x_n

The hard coded 0.01 in the while loop should be suspicious. How do we know that 0.01
is the right stopping condition? How do we decide when to stop at all?
The trick with this question is that we can’t know when to stop when we’re implementing
a general-purpose function because the level of precision we need will depend on what
the result is used for! It’s the caller of the function that knows when to stop, not the author.
The first improvement we can make is to turn the 0.01 into an extra parameter:

def sqrt(a: float, threshold: float) -> float:

x = a / 2 # initial guess
while abs(x_n - x) > threshold:
x_n = (x + (a / x)) / 2
return x_n

51
This is a definite improvement over a literal 0.01, but it’s still limited. We’ve provided an
extra parameter for how the function behaves, but the control is still fundamentally with
the function. The caller of the function might want to stop before the method converges if
it’s taking too many iterations or too much time, but there’s no way to do that by changing
the threshold parameter. We could provide additional parameters for all of these, but
we’d quickly end up with the logic for how to stop iteration requiring a lot more code
and complexity than the iterative algorithm itself! Even that wouldn’t be enough; if the
function isn’t behaving as expected in some specific application, we might want to print out
intermediate values or graph the convergence over time — so should we include additional
control parameters for that?
Then what do we do when we have n other iterative algorithms? Do we copy-paste the
same stopping logic and parameters into each one? We’d end up with a lot of redundant
code!

Iterators and Generators

This friction points to a conceptual distinction that our code does not support: what happens
at each iteration is logically separate from how we do the iteration, but the two are fully coupled
in our implementation. We need some way to abstract over iteration in some way that lets
us separate producing values iteratively from consuming them.
Luckily, Python provides powerful facilities for doing exactly this: iterators and genera-
tors. Iterators give us a way of consuming values and generators give us a way of producing
values.
You might not realize it, but chances are your Python code uses iterators all the time.
Python’s for loop uses an iterator under the hood to get each value it’s looping over —
this is how for loops work for lists, dictionaries, sets, ranges and even custom types. Try
it out:
for x in [3, 2, 1]: print(x)
for x in {3, 2, 1}: print(x)
for x in range(3): print(x)

Note how the iterator for the set ({3, 2, 1}) prints 1 2 3 rather than 3 2 1 — sets do
not preserve the order in which elements are added, so they iterate over elements in some
kind of internally defined order instead.
When we iterate over a dictionary, we will print the keys rather than the values because
that is the default iterator. To get values or key-value pairs we’d need to use the values
and items methods respectively, each of which returns a different kind of iterator over the
dictionary.
d = {’a’: 1, ’b’: 2, ’c’: 3}
for k in d: print(k)
for v in d.values(): print(v)
for k, v in d.items(): print(k, v)

In each of these three cases we’re still looping over the same dictionary, we just get a dif-
ferent view each time—iterators give us the flexibility of iterating over the same structure
in different ways.
Iterators aren’t just for loops: they give us a first-class abstraction for iteration. We can
pass them into functions; for example, Python’s list function can convert any iterator into
a list. This is handy when we want to see the elements of specialized iterators if the iterator
itself does not print out its values:

52
>>> range(5)
range(0, 5)
>>> list(range(5))
[0, 1, 2, 3, 4]

Since iterators are first-class values, we can also write general-purpose iterator functions.
The Python standard library has a set of operations like this in the itertools module;
for example, itertools.takewhile lets us stop iterating as soon as some condition stops
holding:
>>> elements = [1, 3, 2, 5, 3]
>>> list(itertools.takewhile(lambda x: x < 5, elements))
[1, 3, 2]

Note how we converted the result of takewhile into a list — without that, we’d see that
takewhile returns some kind of opaque internal object that implements that iterator specif-
ically. This works fine — we can use the takewhile object anywhere we could use any other
iterator — but it looks a bit odd in the Python interpreter:
>>> itertools.takewhile(lambda x: x < 5, elements)
<itertools.takewhile object at 0x7f8e3baefb00>

Now that we’ve seen a few examples of how we can use iterators, how do we define
our own? In the most general sense, a Python Iterator is any object that implements
a __next__() method, but implementing iterators this way is pretty awkward. Luckily,
Python has a more convenient way to create an iterator by creating a generator using the
yield keyword. yield acts similar to return from a function, except instead of stopping
the function altogether, it outputs the yielded value to an iterator and pauses the function
until the yielded element is consumed by the caller.
This is a bit of an abstract description, so let’s look at how this would apply to our sqrt
function. Instead of looping and stopping based on some condition, we’ll write a version
of sqrt that returns an iterator with each iteration of the algorithm as a value:
def sqrt(a: float) -> Iterator[float]:
x = a / 2 # initial guess
while True:
x = (x + (a / x)) / 2
yield x

With this version, we update x at each iteration and then yield the updated value. In-
stead of getting a single value, the caller of the function gets an iterator that contains an
infinite number of iterations; it is up to the caller to decide how many iterations to evalu-
ate and when to stop. The sqrt function itself has an infinite loop, but this isn’t a problem
because execution of the function pauses at each yield which lets the caller of the function
stop it whenever they want.
To do 10 iterations of the sqrt algorithm, we could use itertools.islice:
>>> iterations = list(itertools.islice(sqrt(25), 10))
>>> iterations[-1]
5.0

A fixed number of iterations can be useful for exploration, but we probably want the
threshold-based convergence logic we had earlier. Since we now have a first-class abstrac-
tion for iteration, we can write a general-purpose converge function that takes an iterator
and returns a version of that same iterator that stops as soon as two values are suﬀiciently
close. Python 3.10 and later provides itertools.pairwise which makes the code pretty
simple:

53
def converge(values: Iterator[float], threshold: float) -> Iterator[float]:
for a, b in itertools.pairwise(values):
yield a
if abs(a - b) < threshold:
break

For older versions of Python, we’d have to implement our version of pairwise as well:
def pairwise(values: Iterator[A]) -> Iterator[Tuple[A, A]]:
a = next(values, None)
if a is None:
return
for b in values:
yield (a, b)
a = b

Both of these follow a common pattern with iterators: each function takes an iterator
as an input and returns an iterator as an output. This doesn’t always have to be the case,
but we get a major advantage when it is: iterator → iterator operations compose. We can
get relatively complex behavior by starting with an iterator (like our sqrt example) then
applying multiple operations to it. For example, somebody calling sqrt might want to
converge at some threshold but, just in case the algorithm gets stuck for some reason, also
have a hard stop at 10,000 iterations. We don’t need to write a new version of sqrt or even
converge to do this; instead, we can use converge with itertools.islice:

results = converge(sqrt(n), 0.001)

capped_results = itertools.islice(results, 10000)

This is a powerful programming style because we can write and test each operation—
sqrt, converge, islice — in isolation and get complex behavior by combining them in the
right way. If we were writing the same logic without iterators, we would need a single loop
that calculated each step of sqrt, checked for convergence and kept a counter to stop after
10,000 steps — and we’d need to replicate this pattern for every single such algorithm!
Iterators and generators will come up all throughout this book because they provide a
programming abstraction for processes, making them a great foundation for the mathemat-
ical processes that underlie Reinforcement Learning.

0.14. Key Takeaways

• The code in this book is designed not only to illustrate the topics and algorithms
covered in each chapter, but also to demonstrate some general principles of code de-
sign. Paying attention to code design gives us a codebase that’s easier to understand,
debug and extend.
• Code design is centered around abstractions. An abstraction is a code construct that
combines separate pieces of information or unifies distinct concepts into one. Pro-
gramming languages like Python provide different tools for expressing abstractions,
but the goal is always the same: make the code easier to understand.
• Classes and interfaces (abstract classes in Python) define abstractions over data.
Multiple implementations of the same concept can have the same interface (like
Distribution), and code written using that interface will work for objects of any com-
patible class. Traditional Python classes have some inconvenient limitations which
can be avoided by using dataclasses (classes with the @dataclass annotation).

54
• Python has type annotations which are required for dataclasses but are also useful
for describing interfaces in functions and methods. Additional tools like mypy or
PyCharm can use these type annotations to catch errors without needing to run the
code.
• Functions are first-class values in Python, meaning that they can be stored in vari-
ables and passed to other functions as arguments. Classes abstract over data; func-
tions abstract over computation.
• Iterators abstract over iteration: computations that happen in sequence, producing
a value after each iteration. Reinforcement learning focuses primarily on iterative
algorithms, so iterators become one of the key abstractions for working with different
reinforcement learning algorithms.

55
Part I.

Processes and Planning Algorithms

57
1. Markov Processes
This book is about “Sequential Decisioning under Sequential Uncertainty.” In this chap-
ter, we will ignore the “sequential decisioning” aspect and focus just on the “sequential
uncertainty” aspect.

1.1. The Concept of State in a Process

For a gentle introduction to the concept of State, we start with an informal notion of the
terms Process and State (this will be formalized later in this chapter). Informally, think of
a Process as producing a sequence of random outcomes at discrete time steps that we’ll
index by a time variable t = 0, 1, 2, . . .. The random outcomes produced by a process
might be key financial/trading/business metrics one cares about, such as prices of financial
derivatives or the value of a portfolio held by an investor. To understand and reason about
the evolution of these random outcomes of a process, it is beneficial to focus on the internal
representation of the process at each point in time t, that is fundamentally responsible for
driving the outcomes produced by the process. We refer to this internal representation
of the process at time t as the (random) state of the process at time t and denote it as St .
Specifically, we are interested in the probability of the next state St+1 , given the present
state St and the past states S0 , S1 , . . . , St−1 , i.e., P[St+1 |St , St−1 , . . . , S0 ]. So to clarify, we
distinguish between the internal representation (state) and the output (outcomes) of the
Process. The state could be any data type - it could be something as simple as the daily
closing price of a single stock, or it could be something quite elaborate like the number of
shares of each publicly traded stock held by each bank in the U.S., as noted at the end of
each week.

1.2. Understanding Markov Property from Stock Price Examples

We will be learning about Markov Processes in this chapter and these processes have States
that possess a property known as the Markov Property. So we will now learn about the
Markov Property of States. Let us develop some intuition for this property with some exam-
ples of random evolution of stock prices over time.
To aid with the intuition, let us pretend that stock prices take on only integer values and
that it’s acceptable to have zero or negative stock prices. Let us denote the stock price at
time t as Xt . Let us assume that from time step t to the next time step t + 1, the stock price
can either go up by 1 or go down by 1, i.e., the only two outcomes for Xt+1 are Xt + 1 or
Xt − 1. To understand the random evolution of the stock prices in time, we just need to
quantify the probability of an up-move P[Xt+1 = Xt + 1] since the probability of a down-
move P[Xt+1 = Xt − 1] = 1 − P[Xt+1 = Xt + 1]. We will consider 3 different processes
for the evolution of stock prices. These 3 processes will prescribe P[Xt+1 = Xt + 1] in 3
different ways.
Process 1:
1
P[Xt+1 = Xt + 1] = −α
1 + e 1 (L−Xt )

59
Figure 1.1.: Logistic Curves

where L is an arbitrary reference level and α1 ∈ R≥0 is a “pull strength” parameter. Note
that this probability is defined as a logistic function of L − Xt with the steepness of the
logistic function controlled by the parameter α1 (see Figure 1.1)
The way to interpret this logistic function of L − Xt is that if Xt is greater than the
reference level L (making P[Xt+1 = Xt + 1] < 0.5), then there is more of a down-pull than
an up-pull. Likewise, if Xt is less than L, then there is more of an up-pull. The extent of
the pull is controlled by the magnitude of the parameter α1 . We refer to this behavior as
mean-reverting behavior, meaning the stock price tends to revert to the “mean” (i.e., to the
reference level L).
We can model the state St = Xt and note that the probabilities of the next state St+1
depend only on the current state St and not on the previous states S0 , S1 , . . . , St−1 . In-
formally, we phrase this property as: “The future is independent of the past given the
present.” Formally, we can state this property of the states as:

P[St+1 |St , St−1 , . . . , S0 ] = P[St+1 |St ] for all t ≥ 0

This is a highly desirable property since it helps make the mathematics of such processes
much easier and the computations much more tractable. We call this the Markov Property
of States, or simply that these are Markov States.
Let us now code this up. First, we create a dataclass to represent the dynamics of this
process. As you can see in the code below, the dataclass Process1 contains two attributes
level_param: int and alpha1: float = 0.25 to represent L and α1 respectively. It contains
the method up_prob to calculate P[Xt+1 = Xt + 1] and the method next_state, which
samples from a Bernoulli distribution (whose probability is obtained from the method
up_prob) and creates the next state St+1 from the current state St . Also, note the nested
dataclass State meant to represent the state of Process 1 (it’s only attribute price: int
reflects the fact that the state consists of only the current price, which is an integer).

import numpy as np
from dataclasses import dataclass

60
@dataclass
class Process1:
@dataclass
class State:
price: int
level_param: int # level to which price mean-reverts
alpha1: float = 0.25 # strength of mean-reversion (non-negative value)
def up_prob(self, state: State) -> float:
return 1. / (1 + np.exp(-self.alpha1 * (self.level_param - state.price)))
def next_state(self, state: State) -> State:
up_move: int = np.random.binomial(1, self.up_prob(state), 1)[0]
return Process1.State(price=state.price + up_move * 2 - 1)

Next, we write a simple simulator using Python’s generator functionality (using yield)
as follows:
def simulation(process, start_state):
state = start_state
while True:
yield state
state = process.next_state(state)

Now we can use this simulator function to generate sampling traces. In the following
code, we generate num_traces number of sampling traces over time_steps number of time
steps starting from a price X0 of start_price. The use of Python’s generator feature lets
us do this “lazily” (on-demand) using the itertools.islice function.
import itertools
def process1_price_traces(
start_price: int,
level_param: int,
alpha1: float,
time_steps: int,
num_traces: int
) -> np.ndarray:
process = Process1(level_param=level_param, alpha1=alpha1)
start_state = Process1.State(price=start_price)
return np.vstack([
np.fromiter((s.price for s in itertools.islice(
simulation(process, start_state),
time_steps + 1
)), float) for _ in range(num_traces)])

The entire code is in the file rl/chapter2/stock_price_simulations.py. We encourage

you to play with this code with different start_price, level_param, alpha1, time_steps,
num_traces. You can plot graphs of sampling traces of the stock price, or plot graphs of
the terminal distributions of the stock price at various time points (both of these plotting
functions are made available for you in this code).
Now let us consider a different process.
Process 2: (
0.5(1 − α2 (Xt − Xt−1 )) if t > 0
P[Xt+1 = Xt + 1] =
0.5 if t = 0
where α2 is a “pull strength” parameter in the closed interval [0, 1]. The intuition here
is that the direction of the next move Xt+1 − Xt is biased in the reverse direction of the
previous move Xt − Xt−1 and the extent of the bias is controlled by the parameter α2 .

61
We note that if we model the state St as Xt , we won’t satisfy the Markov Property because
the probabilities of Xt+1 depend on not just Xt but also on Xt−1 . However, we can perform
a little trick here and create an augmented state St consisting of the pair (Xt , Xt − Xt−1 ).
In case t = 0, the state S0 can assume the value (X0 , N ull) where N ull is just a symbol
denoting the fact that there have been no stock price movements thus far. With the state
St as this pair (Xt , Xt − Xt−1 ) , we can see that the Markov Property is indeed satisfied:

P[(Xt+1 , Xt+1 − Xt )|(Xt , Xt − Xt−1 ), (Xt−1 , Xt−1 − Xt−2 ), . . . , (X0 , N ull)]

= P[(Xt+1 , Xt+1 − Xt )|(Xt , Xt − Xt−1 )] = 0.5(1 − α2 (Xt+1 − Xt )(Xt − Xt−1 ))

It would be natural to wonder why the state doesn’t comprise of simply Xt − Xt−1 - in
other words, why is Xt also required to be part of the state. It is true that knowledge of
simply Xt −Xt−1 fully determines the probabilities of Xt+1 −Xt . So if we set the state to be
simply Xt −Xt−1 at any time step t, then we do get a Markov Process with just two states +1
and -1 (with probability transitions between them). However, this simple Markov Process
doesn’t tell us what the value of stock price Xt is by looking at the state Xt − Xt−1 at time
t. In this application, we not only care about Markovian state transition probabilities, but
we also care about knowledge of stock price at any time t from knowledge of the state at
time t. Hence, we model the state as the pair (Xt , Xt−1 ).
Note that if we had modeled the state St as the entire stock price history (X0 , X1 , . . . , Xt ),
then the Markov Property would be satisfied trivially. The Markov Property would also be
satisfied if we had modeled the state St as the pair (Xt , Xt−1 ) for t > 0 and S0 as (X0 , N ull).
However, our choice of St := (Xt , Xt − Xt−1 ) is the “simplest/minimal” internal represen-
tation. In fact, in this entire book, our endeavor in modeling states for various processes is
to ensure the Markov Property with the “simplest/minimal” representation for the state.
The corresponding dataclass for Process 2 is shown below:
handy_map: Mapping[Optional[bool], int] = {True: -1, False: 1, None: 0}
@dataclass
class Process2:
@dataclass
class State:
price: int
is_prev_move_up: Optional[bool]
alpha2: float = 0.75 # strength of reverse-pull (value in [0,1])
def up_prob(self, state: State) -> float:
return 0.5 * (1 + self.alpha2 * handy_map[state.is_prev_move_up])
def next_state(self, state: State) -> State:
up_move: int = np.random.binomial(1, self.up_prob(state), 1)[0]
return Process2.State(
price=state.price + up_move * 2 - 1,
is_prev_move_up=bool(up_move)
)

The code for generation of sampling traces of the stock price is almost identical to the
code we wrote for Process 1.
def process2_price_traces(
start_price: int,
alpha2: float,
time_steps: int,
num_traces: int
) -> np.ndarray:
process = Process2(alpha2=alpha2)

62
Figure 1.2.: Unit-Sigmoid Curves

start_state = Process2.State(price=start_price, is_prev_move_up=None)

return np.vstack([
np.fromiter((s.price for s in itertools.islice(
simulation(process, start_state),
time_steps + 1
)), float) for _ in range(num_traces)])

Now let us look at a more complicated process.

Process 3: This is an extension of Process 2 where the probability of next movement
depends not only on the last movement, but on all past
P movements. Specifically, it de-
pends on the number of past up-moves (call it PUt = ti=1 max(Xi − Xi−1 , 0)) relative to
the number of past down-moves (call it Dt = ti=1 max(Xi−1 − Xi , 0)) in the following
manner: 
 1
U +D if t > 0
1+( tD t −1)α3
P[Xt+1 = Xt + 1] = t
0.5 if t = 0
where α3 ∈ R≥0 is a “pull strength” parameter. Let us view the above probability expres-
sion as:
Dt
f( ; α3 )
Ut + Dt
where f : [0, 1] → [0, 1] is a sigmoid-shaped function
1
f (x; α) =
1+ ( x1 − 1)α
whose steepness at x = 0.5 is controlled by the parameter α (note: values of α < 1 will
produce an inverse sigmoid as seen in Figure 1.2 which shows unit-sigmoid functions f
for different values of α).
The probability of next up-movement is fundamentally dependent on the quantity UtD t
+Dt
Dt
(the function f simply serves to control the extent of the “reverse pull”). Ut +Dt is the frac-
tion of past time steps when there was a down-move. So, if number of down-moves in

63
history are greater than number of up-moves in history, then there will be more of an
up-pull than a down-pull for the next price movement Xt+1 − Xt (likewise, the other way
round when Ut > Dt ). The extent of this “reverse pull” is controlled by the “pull strength”
parameter α3 (governed by the sigmoid-shaped function f ).
Again, note that if we model the state St as Xt , we won’t satisfy the Markov Property
because the probabilities of next state St+1 = Xt+1 depends on the entire history of stock
price moves and not just on the current state St = Xt . However, we can again do something
clever and create a compact enough state St consisting of simply the pair (Ut , Dt ). With
this representation for the state St , the Markov Property is indeed satisfied:

P[(Ut+1 , Dt+1 )|(Ut , Dt ), (Ut−1 , Dt−1 ), . . . , (U0 , D0 )] = P[(Ut+1 , Dt+1 )|(Ut , Dt )]

(
f ( UtD+Dt ; α3 ) if Ut+1 = Ut + 1, Dt+1 = Dt
t
=
f ( UtU+D
t
t
; α3 ) if Ut+1 = Ut , Dt+1 = Dt + 1
It is important to note that unlike Processes 1 and 2, the stock price Xt is actually not part of
the state St in Process 3. This is because Ut and Dt together contain suﬀicient information
to capture the stock price Xt (since Xt = X0 + Ut − Dt , and noting that X0 is provided as
a constant).
The corresponding dataclass for Process 3 is shown below:
@dataclass
class Process3:
@dataclass
class State:
num_up_moves: int
num_down_moves: int
alpha3: float = 1.0 # strength of reverse-pull (non-negative value)
def up_prob(self, state: State) -> float:
total = state.num_up_moves + state.num_down_moves
if total == 0:
return 0.5
elif state.num_down_moves == 0:
return state.num_down_moves ** self.alpha3
else:
return 1. / (1 + (total / state.num_down_moves - 1) ** self.alpha3)
def next_state(self, state: State) -> State:
up_move: int = np.random.binomial(1, self.up_prob(state), 1)[0]
return Process3.State(
num_up_moves=state.num_up_moves + up_move,
num_down_moves=state.num_down_moves + 1 - up_move
)

The code for generation of sampling traces of the stock price is shown below:
def process3_price_traces(
start_price: int,
alpha3: float,
time_steps: int,
num_traces: int
) -> np.ndarray:
process = Process3(alpha3=alpha3)
start_state = Process3.State(num_up_moves=0, num_down_moves=0)
return np.vstack([
np.fromiter((start_price + s.num_up_moves - s.num_down_moves
for s in itertools.islice(simulation(process, start_state),
time_steps + 1)), float)
for _ in range(num_traces)])

64
Figure 1.3.: Single Sampling Trace

As suggested for Process 1, you can plot graphs of sampling traces of the stock price,
or plot graphs of the probability distributions of the stock price at various terminal time
points T for Processes 2 and 3, by playing with the code in rl/chapter2/stock_price_simulations.py.
Figure 1.3 shows a single sampling trace of stock prices for each of the 3 processes. Figure
1.4 shows the probability distribution of the stock price at terminal time T = 100 over 1000
traces.
Having developed the intuition for the Markov Property of States, we are now ready
to formalize the notion of Markov Processes (some of the literature refers to Markov Pro-
cesses as Markov Chains, but we will stick with the term Markov Processes).

1.3. Formal Definitions for Markov Processes

Our formal definitions in this book will be restricted to Discrete-Time Markov Processes,
where time moves forward in discrete time steps t = 0, 1, 2, . . .. Also for ease of exposition,
our formal definitions in this book will be restricted to sets of states that are countable.
A countable set can be either a finite set or an infinite set of the same cardinality as the
set of natural numbers, i.e., a set that is enumerable. This book will cover examples of
continuous-time Markov Processes, where time is a continuous variable.1 This book will
also cover examples of sets of states that are uncountable.2 However, for ease of exposition,
our definitions and development of the theory in this book will be restricted to discrete-
time and countable sets of states. The definitions and theory can be analogously extended
to continuous-time or to uncountable sets of states (we request you to self-adjust the defi-
nitions and theory accordingly when you encounter continuous-time or uncountable sets
of states in this book).
1
Markov Processes in continuous-time often go by the name Stochastic Processes, and their calculus goes by the
name Stochastic Calculus (see Appendix C).
2
Uncountable sets are those with cardinality larger than the set of natural numbers, eg: the set of real numbers,
which are not enumerable.

65
Figure 1.4.: Terminal Distribution

Definition 1.3.1. A Markov Process consists of:

• A countable set of states S (known as the State Space) and a set T ⊆ S (known as
the set of Terminal States)

• A time-indexed sequence of random states St ∈ S for time steps t = 0, 1, 2, . . .

with each state transition satisfying the Markov Property: P[St+1 |St , St−1 , . . . , S0 ] =
P[St+1 |St ] for all t ≥ 0.

• Termination: If an outcome for ST (for some time step T ) is a state in the set T , then
this sequence outcome terminates at time step T .

We refer to P[St+1 |St ] as the transition probabilities for time t.

Definition 1.3.2. A Time-Homogeneous Markov Process is a Markov Process with the addi-
tional property that P[St+1 |St ] is independent of t.

This means, the dynamics of a Time-Homogeneous Markov Process can be fully speci-
fied with the function
P : (S − T ) × S → [0, 1]
defined as:

P(s, s′ ) = P[St+1 = s′ |St = s] for time steps t = 0, 1, 2, . . . , for all s ∈ S − T , s′ ∈ S

such that X
P(s, s′ ) = 1 for all s ∈ S − T
s′ ∈S

We refer to the function P as the transition probability function of the Time-Homogeneous

Markov Process, with the first argument to P to be thought of as the “source” state and
the second argument as the “destination” state.

66
Note that the arguments to P in the above specification are devoid of the time index t
(hence, the term Time-Homogeneous which means “time-invariant”). Moreover, note that a
Markov Process that is not time-homogeneous can be converted to a Time-Homogeneous
Markov Process by augmenting all states with the time index t. This means if the original
state space of a Markov Process that is not time-homogeneous is S, then the state space of
the corresponding Time-Homogeneous Markov Process is Z≥0 × S (where Z≥0 denotes
the domain of the time index). This is because each time step has it’s own unique set of
(augmented) states, which means the entire set of states in Z≥0 × S can be covered by
time-invariant transition probabilities, thus qualifying as a Time-Homogeneous Markov
Process. Therefore, henceforth, any time we say Markov Process, assume we are refering
to a Discrete-Time, Time-Homogeneous Markov Process with a Countable State Space (unless
explicitly specified otherwise), which in turn will be characterized by the transition prob-
ability function P. Note that the stock price examples (all 3 of the Processes we covered)
are examples of a (Time-Homogeneous) Markov Process, even without requiring aug-
menting the state with the time index.
The classical definitions and theory of Markov Processes model “termination” with the
idea of Absorbing States. A state s is called an absorbing state if P(s, s) = 1. This means,
once we reach an absorbing state, we are “trapped” there, hence capturing the notion of
“termination.” So the classical definitions and theory of Markov Processes typically don’t
include an explicit specification of states as terminal and non-terminal. However, when
we get to Markov Reward Processes and Markov Decision Processes (frameworks that are
extensions of Markov Processes), we will need to explicitly specify states as terminal and
non-terminal states, rather than model the notion of termination with absorbing states.
So, for consistency in definitions and in the development of the theory, we are going with
a framework where states in a Markov Process are explicitly specified as terminal or non-
terminal states. We won’t consider an absorbing state as a terminal state as the Markov
Process keeps moving forward in time forever when it gets to an absorbing state. We will
refer to S − T as the set of Non-Terminal States N (and we will refer to a state in N as a
non-terminal state). The sequence S0 , S1 , S2 , . . . terminates at time step t = T if ST ∈ T .

1.3.1. Starting States

Now it’s natural to ask the question: How do we “start” the Markov Process (in the stock
price examples, this was the notion of the start state)? More generally, we’d like to spec-
ify a probability distribution of start states so we can perform simulations and (let’s say)
compute the probability distribution of states at specific future time steps. While this is a
relevant question, we’d like to separate the following two specifications:

• Specification of the transition probability function P

• Specification of the probability distribution of start states (denote this as µ : N →
[0, 1])

We say that a Markov Process is fully specified by P in the sense that this gives us the
transition probabilities that govern the complete dynamics of the Markov Process. A way
to understand this is to relate specification of P to the specification of rules in a game (such
as chess or monopoly). These games are specified with a finite (in fact, fairly compact)
set of rules that is easy for a newbie to the game to understand. However, when we want
to actually play the game, we need to specify the starting position (one could start these
games at arbitrary, but legal, starting positions and not just at some canonical starting
position). The specification of the start state of the game is analogous to the specification

67
of µ. Given µ together with P enables us to generate sampling traces of the Markov Process
(analogously, play games like chess or monopoly). These sampling traces typically result
in a wide range of outcomes due to sampling and long-running of the Markov Process
(versus compact specification of transition probabilities). These sampling traces enable
us to answer questions such as probability distribution of states at specific future time
steps or expected time of first occurrence of a specific state etc., given a certain starting
probability distribution µ.
Thinking about the separation between specifying the rules of the game versus actually
playing the game helps us understand the need to separate the notion of dynamics speci-
fication P (fundamental to the time-homogeneous character of the Markov Process) and
the notion of starting distribution µ (required to perform sampling traces). Hence, the
separation of concerns between P and µ is key to the conceptualization of Markov Pro-
cesses. Likewise, we separate the concerns in our code design as well, as evidenced by
how we separated the next_state method in the Process dataclasses and the simulation
function.

1.3.2. Terminal States

Games are examples of Markov Processes that terminate at specific states (based on rules
for winning or losing the game). In general, in a Markov Process, termination might occur
after a variable number of time steps (like in the games examples), or like we will see in
many financial application examples, termination might occur after a fixed number of time
steps, or like in the stock price examples we saw earlier, there is in fact no termination.
If all random sequences of states (sampling traces) reach a terminal state, then we say
that these random sequences of the Markov Process are Episodic (otherwise we call these
sequences as Continuing). The notion of episodic sequences is important in Reinforcement
Learning since some Reinforcement Learning algorithms require episodic sequences.
When we cover some of the financial applications later in this book, we will find that the
Markov Process terminates after a fixed number of time steps, say T . In these applications,
the time index t is part of the state representation, each state with time index t = T is
labeled a terminal state, and all states with time index t < T will transition to states with
time index t + 1.
Now we are ready to write some code for Markov Processes, where we illustrate how to
specify that certain states are terminal states.

1.3.3. Markov Process Implementation

The first thing we do is to create separate classes for non-terminal states N and terminal
states T , with an abstract base class for all states S (terminal or non-terminal). In the code
below, the abstract base class (ABC) State represents S. The class State is parameterized
by a generic type (TypeVar(’S’)) representing a generic state space Generic[S].
The concrete class Terminal represents T and the concrete class NonTerminal represents
N . The method on_non_terminal will prove to be very beneficial in the implementation
of various algorithms we shall be writing for Markov Processes and also for Markov Re-
ward Processes and Markov Decision Processes (which are extensions of Markov Pro-
cesses). The method on_non_terminal enables us to calculate a value for all states in S
even though the calculation is defined only for all non-terminal states N . The argument f
to on_non_terminal defines this value through a function from N to an arbitrary value-
type X. The argument default provides the default value for terminal states T so that

68
on_non_terminal can be used on any object in State (i.e. for any state in S, terminal or non-
terminal). As an example, let’s say you want to calculate the expected number of states
one would traverse after a certain state and before hitting a terminal state. Clearly, this cal-
culation is well-defined for non-terminal states and the function f would implement this
by either some kind of analytical method or by sampling state-transition sequences and
averaging the counts of non-terminal states traversed across those sequences. By defining
(defaulting) this value to be 0 for terminal states, we can then invoke such a calculation for
all states S, terminal or non-terminal, and embed this calculation in an algorithm without
worrying about special handing in the code for the edge case of being a terminal state.

from abc import ABC

from dataclasses import dataclass
from typing import Generic, Callable, TypeVar
S = TypeVar(’S’)
X = TypeVar(’X’)
class State(ABC, Generic[S]):
state: S
def on_non_terminal(
self,
f: Callable[[NonTerminal[S]], X],
default: X
) -> X:
if isinstance(self, NonTerminal):
return f(self)
else:
return default
@dataclass(frozen=True)
class Terminal(State[S]):
state: S
@dataclass(frozen=True)
class NonTerminal(State[S]):
state: S

Now we are ready to write a class to represent Markov Processes. We create an ab-
stract class MarkovProcess parameterized by a generic type (TypeVar(’S’)) representing a
generic state space Generic[S]. The abstract class has an abstract method called transition
that is meant to specify the transition probability distribution of next states, given a current
non-terminal state. We know that transition is well-defined only for non-terminal states
and hence, it’s argument is clearly type-annotated as NonTerminal[S]. The return type of
transition is Distribution[State[S]], which as we know from the Chapter on Program-
ming and Design, represents the probability distribution of next states. We also have a
method simulate that enables us to generate an Iterable (generator) of sampled states,
given as input a start_state_distribution: Distribution[NonTerminal[S]] (from which
we sample the starting state). The sampling of next states relies on the implementation
of the sample method for the Distribution[State[S]] object produced by the transition
method.
Here’s the full body of the abstract class MarkovProcess3 :

from abc import abstractmethod

from rl.distribution import Distribution
from typing import Iterable
class MarkovProcess(ABC, Generic[S]):
3
MarkovProcess is defined in the file rl/markov_process.py.

69
@abstractmethod
def transition(self, state: NonTerminal[S]) -> Distribution[State[S]]:
pass
def simulate(
self,
start_state_distribution: Distribution[NonTerminal[S]]
) -> Iterable[State[S]]:
state: State[S] = start_state_distribution.sample()
yield state
while isinstance(state, NonTerminal):
state = self.transition(state).sample()
yield state

1.4. Stock Price Examples modeled as Markov Processes

So if you have a mathematical specification of the transition probabilities of a Markov Pro-
cess, all you need to do is to create a concrete class that implements the interface of the ab-
stract class MarkovProcess (specifically by implementing the abstract method transition)
in a manner that captures your mathematical specification of the transition probabilities.
Let us write this for the case of Process 3 (the 3rd example of stock price transitions we
covered earlier). We name the concrete class as StockPriceMP3. Note that the generic state
space S is now replaced with a specific state space represented by the type @dataclass
StateMP3. The code should be self-explanatory since we implemented this process as a
standalone in the previous section. Note the use of the Categorical distribution in the
transition method to capture the 2-outcomes probability distribution of next states (for
movements up or down).
from rl.distribution import Categorical
from rl.gen_utils.common_funcs import get_unit_sigmoid_func
@dataclass
class StateMP3:
num_up_moves: int
num_down_moves: int
@dataclass
class StockPriceMP3(MarkovProcess[StateMP3]):
alpha3: float = 1.0 # strength of reverse-pull (non-negative value)
def up_prob(self, state: StateMP3) -> float:
total = state.num_up_moves + state.num_down_moves
return get_unit_sigmoid_func(self.alpha3)(
state.num_down_moves / total
) if total else 0.5
def transition(
self,
state: NonTerminal[StateMP3]
) -> Categorical[State[StateMP3]]:
up_p = self.up_prob(state.state)
return Categorical({
NonTerminal(StateMP3(
state.state.num_up_moves + 1, state.state.num_down_moves
)): up_p,
NonTerminal(StateMP3(
state.state.num_up_moves, state.state.num_down_moves + 1
)): 1 - up_p
})

To generate sampling traces, we write the following function:

70
from rl.distribution import Constant
import numpy as np
def process3_price_traces(
start_price: int,
alpha3: float,
time_steps: int,
num_traces: int
) -> np.ndarray:
mp = StockPriceMP3(alpha3=alpha3)
start_state_distribution = Constant(
NonTerminal(StateMP3(num_up_moves=0, num_down_moves=0))
)
return np.vstack([np.fromiter(
(start_price + s.state.num_up_moves - s.state.num_down_moves for s in
itertools.islice(
mp.simulate(start_state_distribution),
time_steps + 1
)),
float
) for _ in range(num_traces)])

We leave it to you as an exercise to similarly implement Stock Price Processes 1 and 2

that we had covered earlier. The complete code along with the driver to set input param-
eters, run all 3 processes and create plots is in the file rl/chapter2/stock_price_mp.py. We
encourage you to change the input parameters in __main__ and get an intuitive feel for how
the simulation results vary with the changes in parameters.

1.5. Finite Markov Processes

Now let us consider Markov Processes with a finite state space. So we can represent the
state space as S = {s1 , s2 , . . . , sn }. Assume the set of non-terminal states N has m ≤ n
states. Let us refer to Markov Processes with finite state spaces as Finite Markov Processes.
Since Finite Markov Processes are a subclass of Markov Processes, it would make sense to
create a concrete class FiniteMarkovProcess that implements the interface of the abstract
class MarkovProcess (specifically implement the abstract method transition). But first let’s
think about the data structure required to specify an instance of a FiniteMarkovProcess
(i.e., the data structure we’d pass to the __init__ method of FiniteMarkovProcess). One
choice is a m × n 2D numpy array representation, i.e., matrix elements representing tran-
sition probabilities
P : N × S → [0, 1]
However, we often find that this matrix can be sparse since one often transitions from a
given state to just a few set of states. So we’d like a sparse representation and we can
accomplish this by conceptualizing P in an equivalent curried form4 as follows:

N → (S → [0, 1])

With this curried view, we can represent the outer → as a map (in Python, as a dictio-
nary of type Mapping) whose keys are the non-terminal states N , and each non-terminal-
state key maps to a FiniteDistribution[S] type that represents the inner →, i.e. a finite
probability distribution of the next states transitioned to from a given non-terminal state.
4
Currying is the technique of converting a function that takes multiple arguments into a sequence of functions
that each takes a single argument, as illustrated above for the P function.

71
Figure 1.5.: Weather Markov Process

Note that the FiniteDistribution[S] will only contain the set of states transitioned to
with non-zero probability. To make things concrete, here’s a toy Markov Process data
structure example of a city with highly unpredictable weather outcomes from one day
to the next (note: Categorical type inherits from FiniteDistribution type in the code at
rl/distribution.py):
from rl.distribution import Categorical
{
”Rain”: Categorical({”Rain”: 0.3, ”Nice”: 0.7}),
”Snow”: Categorical({”Rain”: 0.4, ”Snow”: 0.6}),
”Nice”: Categorical({”Rain”: 0.2, ”Snow”: 0.3})
}

It is common to view this Markov Process representation as a directed graph, as depicted

in Figure 1.5. The nodes are the states and the directed edges are the probabilistic state
transitions, with the transition probabilities labeled on them.
Our goal now is to define a FiniteMarkovProcess class that is a concrete class imple-
mentation of the abstract class MarkovProcess. This requires us to wrap the states in the
keys/values of the FiniteMarkovProcess dictionary with the appropriate Terminal or NonTerminal
wrapping. Let’s create an alias called Transition for this wrapped dictionary data struc-
ture since we will use this wrapped data structure often:
from typing import Mapping
from rl.distribution import FiniteDistribution
Transition = Mapping[NonTerminal[S], FiniteDistribution[State[S]]]

To create a Transition data type from the above example of the weather Markov Process,
we’d need to wrap each of the “Rain,” “Snow” and “Nice” strings with NonTerminal.

72
Now we are ready to write the code for the FiniteMarkovProcess class.5 The __init__
method (constructor) takes as argument a transition_map whose type is similar to Transition[S]
except that we use the S type directly in the Mapping representation instead of NonTerminal[S]
or State[S] (this is convenient for users to specify their Markov Process in a succinct
Mapping representation without the burden of wrapping each S with a NonTerminal[S] or
Terminal[S]). The dictionary we created above for the weather Markov Process can be
used as the transition_map argument. However, this means the __init__ method needs
to wrap the specified S states as NonTerminal[S] or Terminal[S] when creating the attribute
self.transition_map. We also have an attribute self.non_terminal_states: Sequence[NonTerminal[S]]
that is an ordered sequence of the non-terminal states. We implement the transition
method by simply returning the FiniteDistribution[State[S]] the given state: NonTerminal[S]
maps to in the attribute self.transition_map: Transition[S]. Note that along with the
transition method, we have implemented the __repr__ method for a well-formatted dis-
play of self.transition_map.

from typing import Sequence

from rl.distribution import FiniteDistribution, Categorical
class FiniteMarkovProcess(MarkovProcess[S]):
non_terminal_states: Sequence[NonTerminal[S]]
transition_map: Transition[S]
def __init__(self, transition_map: Mapping[S, FiniteDistribution[S]]):
non_terminals: Set[S] = set(transition_map.keys())
self.transition_map = {
NonTerminal(s): Categorical(
{(NonTerminal(s1) if s1 in non_terminals else Terminal(s1)): p
for s1, p in v}
) for s, v in transition_map.items()
}
self.non_terminal_states = list(self.transition_map.keys())
def __repr__(self) -> str:
display = ””
for s, d in self.transition_map.items():
display += f”From State {s.state}:\n”
for s1, p in d:
opt = (
”Terminal State” if isinstance(s1, Terminal) else ”State”
)
display += f” To {opt} {s1.state} with Probability {p:.3f}\n”
return display
def transition(self, state: NonTerminal[S]) -> FiniteDistribution[State[S]]:
return self.transition_map[state]

1.6. Simple Inventory Example

To help conceptualize Finite Markov Processes, let us consider a simple example of changes
in inventory at a store. Assume you are the store manager and that you are tasked with
controlling the ordering of inventory from a supplier. Let us focus on the inventory of a
particular type of bicycle. Assume that each day there is random (non-negative integer)
demand for the bicycle with the probabilities of demand following a Poisson distribution
(with Poisson parameter λ ∈ R≥0 ), i.e. demand i for each i = 0, 1, 2, . . . occurs with prob-

5
FiniteMarkovProcess is defined in the file rl/markov_process.py.

73
ability
e−λ λi
f (i) =
i!
Denote F : Z≥0 → [0, 1] as the poisson cumulative probability distribution function, i.e.,

X
i
F (i) = f (j)
j=0

Assume you have storage capacity for at most C ∈ Z≥0 bicycles in your store. Each
evening at 6pm when your store closes, you have the choice to order a certain number
of bicycles from your supplier (including the option to not order any bicycles, on a given
day). The ordered bicycles will arrive 36 hours later (at 6am the day after the day after
you order - we refer to this as delivery lead time of 36 hours). Denote the State at 6pm
store-closing each day as (α, β), where α is the inventory in the store (refered to as On-
Hand Inventory at 6pm) and β is the inventory on a truck from the supplier (that you
had ordered the previous day) that will arrive in your store the next morning at 6am (β
is refered to as On-Order Inventory at 6pm). Due to your storage capacity constraint of at
most C bicycles, your ordering policy is to order C − (α + β) if α + β < C and to not order
if α + β ≥ C. The precise sequence of events in a 24-hour cycle is:

• Observe the (α, β) State at 6pm store-closing (call this state St )

• Immediately order according to the ordering policy described above
• Receive bicycles at 6am if you had ordered 36 hours ago
• Open the store at 8am
• Experience random demand from customers according to demand probabilities stated
above (number of bicycles sold for the day will be the minimum of demand on the
day and inventory at store opening on the day)
• Close the store at 6pm and observe the state (this state is St+1 )

If we let this process run for a while, in steady-state we ensure that α + β ≤ C. So

to model this process as a Finite Markov Process, we shall only consider the steady-state
(finite) set of states

S = {(α, β)|α ∈ Z≥0 , β ∈ Z≥0 , 0 ≤ α + β ≤ C}

So restricting ourselves to this finite set of states, our order quantity equals C − (α + β)
when the state is (α, β).
If current state St is (α, β), there are only α + β + 1 possible next states St+1 as follows:

(α + β − i, C − (α + β)) for i = 0, 1, . . . , α + β

with transition probabilities governed by the Poisson probabilities of demand as follows:

P((α, β), (α + β − i, C − (α + β))) = f (i) for 0 ≤ i ≤ α + β − 1

∞
X
P((α, β), (0, C − (α + β))) = f (j) = 1 − F (α + β − 1)
j=α+β

Note that the next state’s (St+1 ) On-Hand can be zero resulting from any of infinite pos-
sible demand outcomes greater than or equal to α + β.

74
So we are now ready to write code for this simple inventory example as a Markov Pro-
cess. All we have to do is to create a derived class inherited from FiniteMarkovProcess
and write a method to construct the transition_map: Transition. Note that the generic
state type S is replaced here with the @dataclass InventoryState consisting of the pair of
On-Hand and On-Order inventory quantities comprising the state of this Finite Markov
Process.

from rl.distribution import Categorical

from scipy.stats import poisson
@dataclass(frozen=True)
class InventoryState:
on_hand: int
on_order: int
def inventory_position(self) -> int:
return self.on_hand + self.on_order
class SimpleInventoryMPFinite(FiniteMarkovProcess[InventoryState]):
def __init__(
self,
capacity: int,
poisson_lambda: float
):
self.capacity: int = capacity
self.poisson_lambda: float = poisson_lambda
self.poisson_distr = poisson(poisson_lambda)
super().__init__(self.get_transition_map())
def get_transition_map(self) -> \
Mapping[InventoryState, FiniteDistribution[InventoryState]]:
d: Dict[InventoryState, Categorical[InventoryState]] = {}
for alpha in range(self.capacity + 1):
for beta in range(self.capacity + 1 - alpha):
state = InventoryState(alpha, beta)
ip = state.inventory_position()
beta1 = self.capacity - ip
state_probs_map: Mapping[InventoryState, float] = {
InventoryState(ip - i, beta1):
(self.poisson_distr.pmf(i) if i < ip else
1 - self.poisson_distr.cdf(ip - 1))
for i in range(ip + 1)
}
d[InventoryState(alpha, beta)] = Categorical(state_probs_map)
return d

Let us utilize the __repr__ method written previously to view the transition probabilities
for the simple case of C = 2 and λ = 1.0 (this code is in the file rl/chapter2/simple_inventory_mp.py)

user_capacity = 2
user_poisson_lambda = 1.0
si_mp = SimpleInventoryMPFinite(
capacity=user_capacity,
poisson_lambda=user_poisson_lambda
)
print(si_mp)

The output we get is nicely displayed as:

From State InventoryState(on_hand=0, on_order=0):

To State InventoryState(on_hand=0, on_order=2) with Probability 1.000

75
From State InventoryState(on_hand=0, on_order=1):
To State InventoryState(on_hand=1, on_order=1) with Probability 0.368
To State InventoryState(on_hand=0, on_order=1) with Probability 0.632
From State InventoryState(on_hand=0, on_order=2):
To State InventoryState(on_hand=2, on_order=0) with Probability 0.368
To State InventoryState(on_hand=1, on_order=0) with Probability 0.368
To State InventoryState(on_hand=0, on_order=0) with Probability 0.264
From State InventoryState(on_hand=1, on_order=0):
To State InventoryState(on_hand=1, on_order=1) with Probability 0.368
To State InventoryState(on_hand=0, on_order=1) with Probability 0.632
From State InventoryState(on_hand=1, on_order=1):
To State InventoryState(on_hand=2, on_order=0) with Probability 0.368
To State InventoryState(on_hand=1, on_order=0) with Probability 0.368
To State InventoryState(on_hand=0, on_order=0) with Probability 0.264
From State InventoryState(on_hand=2, on_order=0):
To State InventoryState(on_hand=2, on_order=0) with Probability 0.368
To State InventoryState(on_hand=1, on_order=0) with Probability 0.368
To State InventoryState(on_hand=0, on_order=0) with Probability 0.264

For a graphical view of this Markov Process, see Figure 1.6. The nodes are the states,
labeled with their corresponding α and β values. The directed edges are the probabilistic
state transitions from 6pm on a day to 6pm on the next day, with the transition probabilities
labeled on them.
We can perform a number of interesting experiments and calculations with this simple
Markov Process and we encourage you to play with this code by changing values of the
capacity C and poisson mean λ, performing simulations and probabilistic calculations of
natural curiosity for a store owner.
There is a rich and interesting theory for Markov Processes. However, we won’t go into
this theory as our coverage of Markov Processes so far is a suﬀicient building block to take
us to the incremental topics of Markov Reward Processes and Markov Decision Processes.
However, before we move on, we’d like to show just a glimpse of the rich theory with the
calculation of Stationary Probabilities and apply it to the case of the above simple inventory
Markov Process.

1.7. Stationary Distribution of a Markov Process

Definition 1.7.1. The Stationary Distribution of a (Discrete-Time, Time-Homogeneous) Markov
Process with state space S = N and transition probability function P : N × N → [0, 1] is
a probability distribution function π : N → [0, 1] such that:
X
π(s) = π(s) · P(s, s′ ) for all s ∈ N
s′ ∈N

The intuitive view of the stationary distribution π is that (under specific conditions we
are not listing here) if we let the Markov Process run forever, then in the long run the states
occur at specific time steps with relative frequencies (probabilities) given by a distribution
π that is independent of the time step. The probability of occurrence of a specific state s
at a time step (asymptotically far out in the future) should be equal to the sum-product
of probabilities of occurrence of all the states at the previous time step and the transition

76
Figure 1.6.: Simple Inventory Markov Process

77
probabilities from those states to s. But since the states’ occurrence probabilities are invari-
ant in time, the π distribution for the previous time step is the same as the π distribution
for the time step we considered. This argument holds for all states s, and that is exactly
the statement of the definition of Stationary Distribution formalized above.
If we specialize this definition of Stationary Distribution to Finite-States, Discrete-Time,
Time-Homogeneous Markov Processes with state space S = {s1 , s2 , . . . , sn } = N , then we
can express the Stationary Distribution π as follows:

X
n
π(sj ) = π(si ) · P(si , sj ) for all j = 1, 2, . . . n
i=1

Below we use bold-face notation to represent functions as vectors and matrices (since
we assume finite states). So, π is a column vector of length n and P is the n × n transition
probability matrix (rows are source states, columns are destination states with each row
summing to 1). Then, the statement of the above definition can be succinctly expressed
as:
πT = πT · P
which can be re-written as:
PT · π = π
But this is simply saying that π is an eigenvector of P T with eigenvalue of 1. So then, it
should be easy to obtain the stationary distribution π from an eigenvectors and eigenvalues
calculation of P T .
Let us write code to compute the stationary distribution. We shall add two methods
in the FiniteMarkovProcess class, one for setting up the transition probability matrix P
(get_transition_matrix method) and another to calculate the stationary distribution π
(get_stationary_distribution) from this transition probability matrix. Note that P is re-
stricted to N × N → [0, 1] (rather than N × S → [0, 1]) because these probability tran-
sitions suﬀice for all the calculations we will be performing for Finite Markov Processes.
Here’s the code for the two methods (the full code for FiniteMarkovProcess is in the file
rl/markov_process.py):

import numpy as np
from rl.distribution import FiniteDistribution, Categorical
def get_transition_matrix(self) -> np.ndarray:
sz = len(self.non_terminal_states)
mat = np.zeros((sz, sz))
for i, s1 in enumerate(self.non_terminal_states):
for j, s2 in enumerate(self.non_terminal_states):
mat[i, j] = self.transition(s1).probability(s2)
return mat
def get_stationary_distribution(self) -> FiniteDistribution[S]:
eig_vals, eig_vecs = np.linalg.eig(self.get_transition_matrix().T)
index_of_first_unit_eig_val = np.where(
np.abs(eig_vals - 1) < 1e-8)[0][0]
eig_vec_of_unit_eig_val = np.real(
eig_vecs[:, index_of_first_unit_eig_val])
return Categorical({
self.non_terminal_states[i].state: ev
for i, ev in enumerate(eig_vec_of_unit_eig_val /
sum(eig_vec_of_unit_eig_val))
})

78
We skip the theory that tells us about the conditions under which a stationary distribu-
tion is well-defined, or the conditions under which there is a unique stationary distribu-
tion. Instead, we just go ahead with this calculation here assuming this Markov Process
satisfies those conditions (it does!). So, we simply seek the index of the eig_vals vec-
tor with eigenvalue equal to 1 (accounting for floating-point error). Next, we pull out
the column of the eig_vecs matrix at the eig_vals index calculated above, and convert it
into a real-valued vector (eigenvectors/eigenvalues calculations are, in general, complex
numbers calculations - see the reference for the np.linalg.eig function). So this gives
us the real-valued eigenvector with eigenvalue equal to 1. Finally, we have to normalize
the eigenvector so it’s values add up to 1 (since we want probabilities), and return the
probabilities as a Categorical distribution).
Running this code for the simple case of capacity C = 2 and poisson mean λ = 1.0
(instance of SimpleInventoryMPFinite) produces the following output for the stationary
distribution π:

{InventoryState(on_hand=0, on_order=0): 0.117,

InventoryState(on_hand=0, on_order=1): 0.279,
InventoryState(on_hand=0, on_order=2): 0.117,
InventoryState(on_hand=1, on_order=0): 0.162,
InventoryState(on_hand=1, on_order=1): 0.162,
InventoryState(on_hand=2, on_order=0): 0.162}

This tells us that On-Hand of 0 and On-Order of 1 is the state occurring most frequently
(28% of the time) when the system is played out indefinitely.
Let us summarize the 3 different representations we’ve covered:

• Functional Representation: as given by the transition method, i.e., given a non-

terminal state, the transition method returns a probability distribution of next states.
This representation is valuable when performing simulations by sampling the next
state from the returned probability distribution of next states. This is applicable to
the general case of Markov Processes (including infinite state spaces).
• Sparse Data Structure Representation: as given by transition map: Transition, which
is convenient for compact storage and useful for visualization (eg: __repr__ method
display or as a directed graph figure). This is applicable only to Finite Markov Pro-
cesses.
• Dense Data Structure Representation: as given by the get_transition_matrix 2D
numpy array, which is useful for performing linear algebra that is often required
to calculate mathematical properties of the process (eg: to calculate the stationary
distribution). This is applicable only to Finite Markov Processes.

Now we are ready to move to our next topic of Markov Reward Processes. We’d like to
finish this section by stating that the Markov Property owes its name to a mathematician
from a century ago - Andrey Markov. Although the Markov Property seems like a simple
enough concept, the concept has had profound implications on our ability to compute or
reason with systems involving time-sequenced uncertainty in practice. There are several
good books to learn more about Markov Processes - we recommend the book by Paul
Gagniuc (Gagniuc 2017).

79
1.8. Formalism of Markov Reward Processes
As we’ve said earlier, the reason we covered Markov Processes is because we want to make
our way to Markov Decision Processes (the framework for Reinforcement Learning algo-
rithms) by adding incremental features to Markov Processes. Now we cover an interme-
diate framework between Markov Processes and Markov Decision Processes, known as
Markov Reward Processes. We essentially just include the notion of a numerical reward to
a Markov Process each time we transition from one state to the next. These rewards are
random, and all we need to do is to specify the probability distributions of these rewards
as we make state transitions.
The main purpose of Markov Reward Processes is to calculate how much reward we
would accumulate (in expectation, from each of the non-terminal states) if we let the Pro-
cess run indefinitely, bearing in mind that future rewards need to be discounted appropri-
ately (otherwise the sum of rewards could blow up to ∞). In order to solve the problem of
calculating expected accumulative rewards from each non-terminal state, we will first set
up some formalism for Markov Reward Processes, develop some (elegant) theory on cal-
culating rewards accumulation, write plenty of code (based on the theory), and apply the
theory and code to the simple inventory example (which we will embellish with rewards
equal to negative of the costs incurred at the store).

Definition 1.8.1. A Markov Reward Process is a Markov Process, along with a time-indexed
sequence of Reward random variables Rt ∈ D (a countable subset of R) for time steps t =
1, 2, . . ., satisfying the Markov Property (including Rewards): P[(Rt+1 , St+1 )|St , St−1 , . . . , S0 ] =
P[(Rt+1 , St+1 )|St ] for all t ≥ 0.

It pays to emphasize again (like we emphasized for Markov Processes), that the def-
initions and theory of Markov Reward Processes we cover (by default) are for discrete-
time, for countable state spaces and countable set of pairs of next state and reward tran-
sitions (with the knowledge that the definitions and theory are analogously extensible to
continuous-time and uncountable spaces/transitions). In the more general case, where
states or rewards are uncountable, the same concepts apply except that the mathematical
formalism needs to be more detailed and more careful. Specifically, we’d end up with
integrals instead of summations, and probability density functions (for continuous prob-
ability distributions) instead of probability mass functions (for discrete probability dis-
tributions). For ease of notation and more importantly, for ease of understanding of the
core concepts (without being distracted by heavy mathematical formalism), we’ve cho-
sen to stay with discrete-time, countable S and countable D (by default). However, there
will be examples of Markov Reward Processes in this book involving continuous-time and
uncountable S and D (please adjust the definitions and formulas accordingly).
Since we commonly assume Time-Homogeneity of Markov Processes, we shall also (by
default) assume Time-Homogeneity for Markov Reward Processes, i.e., P[(Rt+1 , St+1 )|St ]
is independent of t.
With the default assumption of time-homogeneity, the transition probabilities of a Markov
Reward Process can be expressed as a transition probability function:

PR : N × D × S → [0, 1]

defined as:

PR (s, r, s′ ) = P[(Rt+1 = r, St+1 = s′ )|St = s] for time steps t = 0, 1, 2, . . . ,

80
XX
for all s ∈ N , r ∈ D, s′ ∈ S, such that PR (s, r, s′ ) = 1 for all s ∈ N
s′ ∈S r∈D

The subsection on Start States we had covered for Markov Processes naturally applies
to Markov Reward Processes as well. So we won’t repeat the section here, rather we sim-
ply highlight that when it comes to simulations, we need a separate specification of the
probability distribution of start states. Also, by inheriting from our framework of Markov
Processes, we model the notion of a “process termination” by explicitly specifying states
as terminal states or non-terminal states. The sequence S0 , R1 , S1 , R2 , S2 , . . . terminates at
time step t = T if ST ∈ T , with RT being the final reward in the sequence.
If all random sequences of states in a Markov Reward Process terminate, we refer to it
as episodic sequences (otherwise, we refer to it as continuing sequences).
Let’s write some code that captures this formalism. We create a derived @abstractclass
MarkovRewardProcess that inherits from the @abstractclass MarkovProcess. Analogous to
MarkovProcess’s transition method (that represents P), MarkovRewardProcess has an ab-
stract method transition_reward that represents PR . Note that the return type of transition_reward
is Distribution[Tuple[State[S], float]], representing the probability distribution of (next
state, reward) pairs transitioned to.
Also, analogous to MarkovProcess’s simulate method, MarkovRewardProcess has the method
simulate_reward which generates a stream of TransitionStep[S] objects. Each TransitionStep[S]
object consists of a 3-tuple: (state, next state, reward) representing the sampled transitions
within the generated sampling trace. Here’s the actual code:

@dataclass(frozen=True)
class TransitionStep(Generic[S]):
state: NonTerminal[S]
next_state: State[S]
reward: float
class MarkovRewardProcess(MarkovProcess[S]):
@abstractmethod
def transition_reward(self, state: NonTerminal[S])\
-> Distribution[Tuple[State[S], float]]:
pass
def simulate_reward(
self,
start_state_distribution: Distribution[NonTerminal[S]]
) -> Iterable[TransitionStep[S]]:
state: State[S] = start_state_distribution.sample()
reward: float = 0.
while isinstance(state, NonTerminal):
next_distribution = self.transition_reward(state)
next_state, reward = next_distribution.sample()
yield TransitionStep(state, next_state, reward)
state = next_state

So the idea is that if someone wants to model a Markov Reward Process, they’d simply
have to create a concrete class that implements the interface of the abstract class MarkovRewardProcess
(specifically implement the abstract method transition_reward). But note that the abstract
method transition of MarkovProcess also needs to be implemented to make the whole
thing concrete. However, we don’t have to implement it in the concrete class implementing
the interface of MarkovRewardProcess - in fact, we can implement it in the MarkovRewardProcess
class itself by tapping the method transition_reward. Here’s the code for the transition
method in MarkovRewardProcess:

81
from rl.distribution import Distribution, SampledDistribution
def transition(self, state: NonTerminal[S]) -> Distribution[State[S]]:
distribution = self.transition_reward(state)
def next_state(distribution=distribution):
next_s, _ = distribution.sample()
return next_s
return SampledDistribution(next_state)

Note that since the transition_reward method is abstract in MarkovRewardProcess6 , the

only thing the transition method can do is to tap into the sample method of the abstract
Distribution object produced by transition_reward and return a SampledDistribution.
Now let us develop some more theory. Given a specification of PR , we can extract:

• The transition probability function P : N × S → [0, 1] of the implicit Markov Process

defined as: X
P(s, s′ ) = PR (s, r, s′ )
r∈D

• The reward transition function:

RT : N × S → R

defined as:
X PR (s, r, s′ ) X PR (s, r, s′ )
RT (s, s′ ) = E[Rt+1 |St+1 = s′ , St = s] = · r = P ·r
P(s, s′ ) ′
r∈D PR (s, r, s )
r∈D r∈D

The Rewards specification of most Markov Reward Processes we encounter in practice

can be directly expressed as the reward transition function RT (versus the more general
specification of PR ). Lastly, we want to highlight that we can transform either of PR or
RT into a “more compact” reward function that is suﬀicient to perform key calculations
involving Markov Reward Processes. This reward function

R:N →R

is defined as:
X XX
R(s) = E[Rt+1 |St = s] = P(s, s′ ) · RT (s, s′ ) = PR (s, r, s′ ) · r
s′ ∈S s′ ∈S r∈D

We’ve created a bit of notational clutter here. So it would be a good idea for you to
take a few minutes to pause, reflect and internalize the differences between PR , P (of the
implicit Markov Process), RT and R. This notation will analogously re-appear when we
learn about Markov Decision Processes in Chapter 2. Moreover, this notation will be used
considerably in the rest of the book, so it pays to get comfortable with their semantics.

1.9. Simple Inventory Example as a Markov Reward Process

Now we return to the simple inventory example and embellish it with a reward structure to
turn it into a Markov Reward Process (business costs will be modeled as negative rewards).
Let us assume that your store business incurs two types of costs:
6
The full definition of MarkovRewardProcess is in the file rl/markov_process.py.

82
• Holding cost of h for each bicycle that remains in your store overnight. Think of
this as “interest on inventory” - each day your bicycle remains unsold, you lose the
opportunity to gain interest on the cash you paid to buy the bicycle. Holding cost
also includes the cost of upkeep of inventory.
• Stockout cost of p for each unit of “missed demand,” i.e., for each customer wanting
to buy a bicycle that you could not satisfy with available inventory, eg: if 3 customers
show up during the day wanting to buy a bicycle each, and you have only 1 bicycle
at 8am (store opening time), then you lost two units of demand, incurring a cost of
2p. Think of the cost of p per unit as the lost revenue plus disappointment for the
customer. Typically p ≫ h.

Let us go through the precise sequence of events, now with incorporation of rewards,
in each 24-hour cycle:

• Observe the (α, β) State at 6pm store-closing (call this state St )

• Immediately order according to the ordering policy given by: Order quantity =
max(C − (α + β), 0)
• Record any overnight holding cost incurred as described above
• Receive bicycles at 6am if you had ordered 36 hours ago
• Open the store at 8am
• Experience random demand from customers according to the specified poisson prob-
abilities (poisson mean = λ)
• Record any stockout cost due to missed demand as described above
• Close the store at 6pm, register the reward Rt+1 as the negative sum of overnight
holding cost and the day’s stockout cost, and observe the state (this state is St+1 )

Since the customer demand on any day can be an infinite set of possibilities (poisson
distribution over the entire range of non-negative integers), we have an infinite set of pairs
of next state and reward we could transition to from a given current state. Let’s see what
the probabilities of each of these transitions looks like. For a given current state St :=
(α, β), if customer demand for the day is i, then the next state St+1 is:

(max(α + β − i, 0), max(C − (α + β), 0))

and the reward Rt+1 is:

−h · α − p · max(i − (α + β), 0)
Note that the overnight holding cost applies to each unit of on-hand inventory at store
closing (= α) and the stockout cost applies only to any units of “missed demand” (=
max(i − (α + β), 0)). Since two different values of demand i ∈ Z≥0 do not collide on
any unique pair (s′ , r) of next state and reward, we can express the transition probability
function PR for this Simple Inventory Example as a Markov Reward Process as:

PR ((α, β), −h · α − p · max(i − (α + β), 0), (max(α + β − i, 0), max(C − (α + β), 0)))

e−λ λi
= for all i = 0, 1, 2, . . .
i!
Now let’s write some code to implement this simple inventory example as a Markov
Reward Process as described above. All we have to do is to create a concrete class imple-
menting the interface of the abstract class MarkovRewardProcess (specifically implement

83
the abstract method transition_reward). The code below in transition_reward method
in class SimpleInventoryMRP samples the customer demand from a Poisson distribution,
uses the above formulas for the pair of next state and reward as a function of the customer
demand sample, and returns an instance of SampledDistribution. Note that the generic
state type S is replaced here with the @dataclass InventoryState to represent a state of
this Markov Reward Process, comprising of the On-Hand and On-Order inventory quan-
tities.
from rl.distribution import SampledDistribution
import numpy as np
@dataclass(frozen=True)
class InventoryState:
on_hand: int
on_order: int
def inventory_position(self) -> int:
return self.on_hand + self.on_order
class SimpleInventoryMRP(MarkovRewardProcess[InventoryState]):
def __init__(
self,
capacity: int,
poisson_lambda: float,
holding_cost: float,
stockout_cost: float
):
self.capacity = capacity
self.poisson_lambda: float = poisson_lambda
self.holding_cost: float = holding_cost
self.stockout_cost: float = stockout_cost
def transition_reward(
self,
state: NonTerminal[InventoryState]
) -> SampledDistribution[Tuple[State[InventoryState], float]]:
def sample_next_state_reward(state=state) ->\
Tuple[State[InventoryState], float]:
demand_sample: int = np.random.poisson(self.poisson_lambda)
ip: int = state.state.inventory_position()
next_state: InventoryState = InventoryState(
max(ip - demand_sample, 0),
max(self.capacity - ip, 0)
)
reward: float = - self.holding_cost * state.state.on_hand\
- self.stockout_cost * max(demand_sample - ip, 0)
return NonTerminal(next_state), reward
return SampledDistribution(sample_next_state_reward)

The above code can be found in the file rl/chapter2/simple_inventory_mrp.py. We leave

it as an exercise for you to use the simulate_reward method inherited by SimpleInventoryMRP
to perform simulations and analyze the statistics produced from the sampling traces.

1.10. Finite Markov Reward Processes

Certain calculations for Markov Reward Processes can be performed easily if:

• The state space is finite (S = {s1 , s2 , . . . , sn }), and

• The set of unique pairs of next state and reward transitions from each of the states
in N is finite

84
If we satisfy the above two characteristics, we refer to the Markov Reward Process as a
Finite Markov Reward Process. So let us write some code for a Finite Markov Reward Pro-
cess. We create a concrete class FiniteMarkovRewardProcess that primarily inherits from
FiniteMarkovProcess (a concrete class) and secondarily implements the interface of the
abstract class MarkovRewardProcess. Our first task is to think about the data structure re-
quired to specify an instance of FiniteMarkovRewardProcess (i.e., the data structure we’d
pass to the __init__ method of FiniteMarkovRewardProcess). Analogous to how we cur-
ried P for a Markov Process as N → (S → [0, 1]) (where S = {s1 , s2 , . . . , sn } and N has
m ≤ n states), here we curry PR as:

N → (S × D → [0, 1])

Since S is finite and since the set of unique pairs of next state and reward transitions is
also finite, this leads to the analog of the Transition data type for the case of Finite Markov
Reward Processes (named RewardTransition) as follows:

StateReward = FiniteDistribution[Tuple[State[S], float]]

RewardTransition = Mapping[NonTerminal[S], StateReward[S]]

The FiniteMarkovRewardProcess class has 3 responsibilities:

• It needs to accept as input to init a Mapping type similar to RewardTransition

using simply S instead of NonTerminal[S] or State[S] in order to make it conve-
nient for the user to specify a FiniteRewardProcess as a succinct dictionary, with-
out being encumbered with wrapping S as NonTerminal[S] or Terminal[S] types.
This means the __init__ method (constructor) needs to appropriately wrap S as
NonTerminal[S] or Terminal[S] types to create the attribute transition_reward_map:
RewardTransition[S]. Also, the __init__ method needs to create a transition_map:
Transition[S] (extracted from the input to __init__) in order to instantiate its con-
crete parent class FiniteMarkovProcess.
• It needs to implement the transition_reward method analogous to the implementa-
tion of the transition method in FiniteMarkovProcess
• It needs to compute the reward fuction R : N → R from the transition probabil-
ity function PR (i.e. from self.transition_reward_map: RewardTransition) based on
the expectation calculation we specified above (as mentioned earlier, R is key to the
relevant calculations we shall soon be performing on Finite Markov Reward Pro-
cesses). To perform further calculations with the reward function R, we need to
produce it as a 1-dimensional numpy array (i.e., a vector) attribute of the class (we
name it as reward_function_vec).

Here’s the code that fulfills the above three responsibilities7 :

import numpy as np
from rl.distribution import FiniteDistribution, Categorical
from collections import defaultdict
from typing import Mapping, Tuple, Dict, Set
class FiniteMarkovRewardProcess(FiniteMarkovProcess[S],
MarkovRewardProcess[S]):
transition_reward_map: RewardTransition[S]
reward_function_vec: np.ndarray

7
The code for FiniteMarkovRewardProcess (and more) is in rl/markov_process.py.

85
def __init__(
self,
transition_reward_map: Mapping[S, FiniteDistribution[Tuple[S, float]]]
):
transition_map: Dict[S, FiniteDistribution[S]] = {}
for state, trans in transition_reward_map.items():
probabilities: Dict[S, float] = defaultdict(float)
for (next_state, _), probability in trans:
probabilities[next_state] += probability
transition_map[state] = Categorical(probabilities)
super().__init__(transition_map)
nt: Set[S] = set(transition_reward_map.keys())
self.transition_reward_map = {
NonTerminal(s): Categorical(
{(NonTerminal(s1) if s1 in nt else Terminal(s1), r): p
for (s1, r), p in v}
) for s, v in transition_reward_map.items()
}
self.reward_function_vec = np.array([
sum(probability * reward for (_, reward), probability in
self.transition_reward_map[state])
for state in self.non_terminal_states
])
def transition_reward(self, state: NonTerminal[S]) -> StateReward[S]:
return self.transition_reward_map[state]

1.11. Simple Inventory Example as a Finite Markov Reward

Process
Now we’d like to model the simple inventory example as a Finite Markov Reward Process
so we can take advantage of the algorithms that apply to Finite Markov Reward Processes.
As we’ve noted previously, our ordering policy ensures that in steady-state, the sum of
On-Hand (denote as α) and On-Order (denote as β) won’t exceed the capacity C. So we
constrain the set of states such that this condition is satisfied: 0 ≤ α + β ≤ C (i.e., finite
number of states). Although the set of states is finite, there are an infinite number of pairs
of next state and reward outcomes possible from any given current state. This is because
there are an infinite set of possibilities of customer demand on any given day (resulting
in infinite possibilities of stockout cost, i.e., negative reward, on any day). To qualify as
a Finite Markov Reward Process, we’ll need to model in a manner such that we have a
finite set of pairs of next state and reward outcomes from a given current state. So what
we’ll do is that instead of considering (St+1 , Rt+1 ) as the pair of next state and reward, we
model the pair of next state and reward to instead be (St+1 , E[Rt+1 |(St , St+1 )]) (we know
PR due to the Poisson probabilities of customer demand, so we can actually calculate this
conditional expectation of reward). So given a state s, the pairs of next state and reward
would be: (s′ , RT (s, s′ )) for all the s′ we transition to from s. Since the set of possible next
states s′ are finite, these newly-modeled rewards associated with the transitions (RT (s, s′ ))
are also finite and hence, the set of pairs of next state and reward from any current state
are also finite. Note that this creative alteration of the reward definition is purely to reduce
this Markov Reward Process into a Finite Markov Reward Process. Let’s now work out the
calculation of the reward transition function RT .
When the next state’s (St+1 ) On-Hand is greater than zero, it means all of the day’s
demand was satisfied with inventory that was available at store-opening (= α + β), and

86
hence, each of these next states St+1 correspond to no stockout cost and only an overnight
holding cost of hα. Therefore,

RT ((α, β), (α + β − i, C − (α + β))) = −hα for 0 ≤ i ≤ α + β − 1

When next state’s (St+1 ) On-Hand is equal to zero, there are two possibilities:

1. The demand for the day was exactly α + β, meaning all demand was satisifed with
available store inventory (so no stockout cost and only overnight holding cost), or
2. The demand for the day was strictly greater than α + β, meaning there’s some stock-
out cost in addition to overnight holding cost. The exact stockout cost is an expec-
tation calculation involving the number of units of missed demand under the corre-
sponding poisson probabilities of demand exceeding α + β.

This calculation is shown below:

∞
X
RT ((α, β), (0, C − (α + β))) = −hα − p( f (j) · (j − (α + β)))
j=α+β+1

= −hα − p(λ(1 − F (α + β − 1)) − (α + β)(1 − F (α + β)))

So now we have a specification of RT , but when it comes to our coding interface, we are
expected to specify PR as that is the interface through which we create a FiniteMarkovRewardProcess.
Fear not - a specification of PR is easy once we have a specification of RT . We simply cre-
ate 4-tuples (s, r, s′ , p) for all s ∈ N , s′ ∈ S such that r = RT (s, s′ ) and p = P(s, s′ ) (we
know P along with RT ), and the set of all these 4-tuples (for all s ∈ N , s′ ∈ S) consti-
tute the specification of PR , i.e., PR (s, r, s′ ) = p. This turns our reward-definition-altered
mathematical model of a Finite Markov Reward Process into a programming model of the
FiniteMarkovRewardProcess class. This reward-definition-altered model enables us to gain
from the fact that we can leverage the algorithms we’ll be writing for Finite Markov Reward
Processes (including some simple and elegant linear-algebra-solver-based solutions). The
downside of this reward-definition-altered model is that it prevents us from generating
samples of the specific rewards encountered when transitioning from one state to another
(because we no longer capture the probabilities of individual reward outcomes). Note
that we can indeed generate sampling traces, but each transition step in the sampling trace
will only show us the “mean reward” (specifically, the expected reward conditioned on
current state and next state).
In fact, most Markov Processes you’d encounter in practice can be modeled as a combina-
tion of RT and P, and you’d simply follow the above RT to PR representation transforma-
tion drill to present this information in the form of PR to instantiate a FiniteMarkovRewardProcess.
We designed the interface to accept PR as input since that is the most general interface for
specifying Markov Reward Processes.
So now let’s write some code for the simple inventory example as a Finite Markov Re-
ward Process as described above. All we have to do is to create a derived class inherited
from FiniteMarkovRewardProcess and write a method to construct the transition_reward_map
input to the constructor (__init__) of FiniteMarkovRewardProcess (i.e., PR ). Note that the
generic state type S is replaced here with the @dataclass InventoryState to represent the
inventory state, comprising of the On-Hand and On-Order inventory quantities.
from scipy.stats import poisson
@dataclass(frozen=True)

87
class InventoryState:
on_hand: int
on_order: int
def inventory_position(self) -> int:
return self.on_hand + self.on_order
class SimpleInventoryMRPFinite(FiniteMarkovRewardProcess[InventoryState]):
def __init__(
self,
capacity: int,
poisson_lambda: float,
holding_cost: float,
stockout_cost: float
):
self.capacity: int = capacity
self.poisson_lambda: float = poisson_lambda
self.holding_cost: float = holding_cost
self.stockout_cost: float = stockout_cost
self.poisson_distr = poisson(poisson_lambda)
super().__init__(self.get_transition_reward_map())
def get_transition_reward_map(self) -> \
Mapping[
InventoryState,
FiniteDistribution[Tuple[InventoryState, float]]
]:
d: Dict[InventoryState, Categorical[Tuple[InventoryState, float]]] = {}
for alpha in range(self.capacity + 1):
for beta in range(self.capacity + 1 - alpha):
state = InventoryState(alpha, beta)
ip = state.inventory_position()
beta1 = self.capacity - ip
base_reward = - self.holding_cost * state.on_hand
sr_probs_map: Dict[Tuple[InventoryState, float], float] =\
{(InventoryState(ip - i, beta1), base_reward):
self.poisson_distr.pmf(i) for i in range(ip)}
probability = 1 - self.poisson_distr.cdf(ip - 1)
reward = base_reward - self.stockout_cost *\
(probability * (self.poisson_lambda - ip) +
ip * self.poisson_distr.pmf(ip))
sr_probs_map[(InventoryState(0, beta1), reward)] = probability
d[state] = Categorical(sr_probs_map)
return d

The above code is in the file rl/chapter2/simple_inventory_mrp.py). We encourage you

to play with the inputs to SimpleInventoryMRPFinite in __main__ and view the transition
probabilities and rewards of the constructed Finite Markov Reward Process.

1.12. Value Function of a Markov Reward Process

Now we are ready to formally define the main problem involving Markov Reward Pro-
cesses. As we’ve said earlier, we’d like to compute the “expected accumulated rewards”
from any non-terminal state. P However, if we simply add up the rewards in a sampling
trace following time step t as ∞ i=t+1 Ri = Rt+1 + Rt+2 + . . ., the sum would often di-
verge to infinity. So we allow for rewards accumulation to be done with a discount factor
γ ∈ [0, 1]: We define the (random) Return Gt as the “discounted accumulation of future

88
rewards” following time step t. Formally,

∞
X
Gt = γ i−t−1 · Ri = Rt+1 + γ · Rt+2 + γ 2 · Rt+3 + . . .
i=t+1

We use the above definition of Return even for a terminating sequence (say terminating
at t = T , i.e., ST ∈ T ), by treating Ri = 0 for all i > T .
Note that γ can range from a value of 0 on one extreme (called “myopic”) to a value
of 1 on another extreme (called “far-sighted”). “Myopic” means the Return is the same
as Reward (no accumulation of future Rewards in the Return). With “far-sighted” (γ =
1), the Return calculation can diverge for continuing (non-terminating) Markov Reward
Processes but “far-sighted” is indeed applicable for episodic Markov Reward Processes
(where all random sequences of the process terminate). Apart from the Return divergence
consideration, γ < 1 helps algorithms become more tractable (as we shall see later when
we get to Reinforcement Learning). We should also point out that the reason to have γ < 1
is not just for mathematical convenience or computational tractability - there are valid
modeling reasons to discount Rewards when accumulating to a Return. When Reward
is modeled as a financial quantity (revenues, costs, profits etc.), as will be the case in
most financial applications, it makes sense to incorporate time-value-of-money which is a
fundamental concept in Economics/Finance that says there is greater benefit in receiving a
dollar now versus later (which is the economic reason why interest is paid or earned). So
1
it is common to set γ to be the discounting based on the prevailing interest rate (γ = 1+r
where r is the interest rate over a single time step). Another technical reason for setting γ <
1 is that our models often don’t fully capture future uncertainty and so, discounting with
γ acts to undermine future rewards that might not be accurate (due to future uncertainty
modeling limitations). Lastly, from an AI perspective, if we want to build machines that
acts like humans, psychologists have indeed demonstrated that human/animal behavior
prefers immediate reward over future reward.
Note that we are (as usual) assuming the fact that the Markov Reward Process is time-
homogeneous (time-invariant probabilities of state transitions and rewards).
As you might imagine now, we’d want to identify non-terminal states with large ex-
pected returns and those with small expected returns. This, in fact, is the main problem
involving a Markov Reward Process - to compute the “Expected Return” associated with
each non-terminal state in the Markov Reward Process. Formally, we are interested in
computing the Value Function
V :N →R

defined as:
V (s) = E[Gt |St = s] for all s ∈ N , for all t = 0, 1, 2, . . .

For the rest of the book, we will assume that whenever we are talking about a Value
Function, the discount factor γ is appropriate to ensure that the Expected Return from
each state is finite.
Now we show a creative piece of mathematics due to Richard Bellman. Bellman noted
(Bellman 1957b) that the Value Function has a recursive structure. Specifically,

89
Figure 1.7.: Visualization of MRP Bellman Equation

V (s) = E[Rt+1 |St = s] + γ · E[Rt+2 |St = s] + γ 2 · E[Rt+3 |St = s] + . . .

X
= R(s) + γ · P[St+1 = s′ |St = s] · E[Rt+2 |St+1 = s′ ]
s′ ∈N
X X
+ γ2 · P[St+1 = s′ |St = s] P[St+2 = s′′ |St+1 = s′ ] · E[Rt+3 |St+2 = s′′ ]
s′ ∈N s′′ ∈N
+ ...
X X X
= R(s) + γ · P(s, s′ ) · R(s′ ) + γ 2 · P(s, s′ ) P(s′ , s′′ ) · R(s′′ ) + . . .
s′ ∈N s′ ∈N s′′ ∈N
X X
′ ′
= R(s) + γ · P(s, s ) · (R(s ) + γ · P(s , s ) · R(s′′ ) + . . .)
′ ′′

s′ ∈N s′′ ∈N
X
= R(s) + γ · P(s, s′ ) · V (s′ ) for all s ∈ N
s′ ∈N
(1.1)

Note that although the transitions to random states s′ , s′′ , . . . are in the state space of
S rather than N , the right-hand-side above sums over states s′ , s′′ , . . . only in N because
transitions to terminal states (in T = S − N ) don’t contribute any reward beyond the
rewards produced before reaching the terminal state.
We refer to this recursive equation (1.1) for the Value Function as the Bellman Equation
for Markov Reward Processes. Figure 1.7 is a convenient visualization aid of this important
equation. In the rest of the book, we will depict quite a few of these type of state-transition
visualizations to aid with creating mental models of key concepts.
For the case of Finite Markov Reward Processes, assume S = {s1 , s2 , . . . , sn } and assume
N has m ≤ n states. Below we use bold-face notation to represent functions as column
vectors and matrices since we have finite states/transitions. So, V is a column vector of
length m, P is an m × m matrix, and R is a column vector of length m (rows/columns

90
corresponding to states in N ), so we can express the above equation in vector and matrix
notation as follows:

V = R + γP · V

Therefore,
⇒ V = (Im − γP)−1 · R (1.2)

where Im is the m × m identity matrix.

Let us write some code to implement the calculation of Equation (1.2). In the FiniteMarkovRewardProcess
class, we implement the method get_value_function_vec that performs the above calcu-
lation for the Value Function V in terms of the reward function R and the transition prob-
ability function P of the implicit Markov Process. The Value Function V is produced as a
1D numpy array (i.e. a vector). Here’s the code:

def get_value_function_vec(self, gamma: float) -> np.ndarray:

return np.linalg.solve(
np.eye(len(self.non_terminal_states)) -
gamma * self.get_transition_matrix(),
self.reward_function_vec
)

Invoking this get_value_function_vec method on SimpleInventoryMRPFinite for the sim-

ple case of capacity C = 2, poisson mean λ = 1.0, holding cost h = 1.0, stockout cost
p = 10.0, and discount factor γ = 0.9 yields the following result:

{NonTerminal(state=InventoryState(on_hand=0, on_order=0)): -35.511,

NonTerminal(state=InventoryState(on_hand=0, on_order=1)): -27.932,
NonTerminal(state=InventoryState(on_hand=0, on_order=2)): -28.345,
NonTerminal(state=InventoryState(on_hand=1, on_order=0)): -28.932,
NonTerminal(state=InventoryState(on_hand=1, on_order=1)): -29.345,
NonTerminal(state=InventoryState(on_hand=2, on_order=0)): -30.345}

The corresponding values of the attribute reward_function_vec (i.e., R) are:

{NonTerminal(state=InventoryState(on_hand=1, on_order=0)): -3.325,

NonTerminal(state=InventoryState(on_hand=0, on_order=0)): -10.0,
NonTerminal(state=InventoryState(on_hand=0, on_order=1)): -2.325,
NonTerminal(state=InventoryState(on_hand=0, on_order=2)): -0.274,
NonTerminal(state=InventoryState(on_hand=1, on_order=1)): -1.274,
NonTerminal(state=InventoryState(on_hand=2, on_order=0)): -2.274}

This tells us that On-Hand of 0 and On-Order of 2 has the highest expected reward.
However, the Value Function is highest for On-Hand of 0 and On-Order of 1.
This computation for the Value Function works if the state space is not too large (the size
of the square linear system of equations is equal to number of non-terminal states). When
the state space is large, this direct method of solving a linear system of equations won’t
scale and we have to resort to numerical methods to solve the recursive Bellman Equation.
This is the topic of Dynamic Programming and Reinforcement Learning algorithms that
we shall learn in this book.

91
1.13. Summary of Key Learnings from this Chapter
Before we end this chapter, we’d like to highlight the two highly important concepts we
learnt in this chapter:

• Markov Property: A concept that enables us to reason effectively and compute eﬀi-
ciently in practical systems involving sequential uncertainty
• Bellman Equation: A mathematical insight that enables us to express the Value Func-
tion recursively - this equation (and its Optimality version covered in Chapter 2) is
in fact the core idea within all Dynamic Programming and Reinforcement Learning
algorithms.

92
2. Markov Decision Processes
We’ve said before that this book is about “sequential decisioning” under “sequential un-
certainty.” In Chapter 1, we covered the “sequential uncertainty” aspect with the frame-
work of Markov Processes, and we extended the framework to also incorporate the no-
tion of uncertain “Reward” each time we make a state transition - we called this extended
framework Markov Reward Processes. However, this framework had no notion of “se-
quential decisioning.” In this chapter, we further extend the framework of Markov Re-
ward Processes to incorporate the notion of “sequential decisioning,” formally known as
Markov Decision Processes. Before we step into the formalism of Markov Decision Pro-
cesses, let us develop some intuition and motivation for the need to have such a framework
- to handle sequential decisioning. Let’s do this by re-visiting the simple inventory exam-
ple we covered in Chapter 1.

2.1. Simple Inventory Example: How much to Order?

When we covered the simple inventory example in Chapter 1 as a Markov Reward Process,
the ordering policy was:

θ = max(C − (α + β), 0)
where θ ∈ Z≥0 is the order quantity, C ∈ Z≥0 is the space capacity (in bicycle units) at
the store, α is the On-Hand Inventory and β is the On-Order Inventory ((α, β) comprising
the State). We calculated the Value Function for the Markov Reward Process that results
from following this policy. Now we ask the question: Is this Value Function good enough?
More importantly, we ask the question: Can we improve this Value Function by following a
different ordering policy? Perhaps by ordering less than that implied by the above formula
for θ? This leads to the natural question - Can we identify the ordering policy that yields
the Optimal Value Function (one with the highest expected returns, i.e., lowest expected
accumulated costs, from each state)? Let us get an intuitive sense for this optimization
problem by considering a concrete example.
Assume that instead of bicycles, we want to control the inventory of a specific type of
toothpaste in the store. Assume you have space for 20 units of toothpaste on the shelf as-
signed to the toothpaste (assume there is no space in the backroom of the store). Asssume
that customer demand follows a Poisson distribution with Poisson parameter λ = 3.0. At
6pm store-closing each evening, when you observe the State as (α, β), you now have a
choice of ordering a quantity of toothpastes from any of the following values for the or-
der quantity θ : {0, 1, . . . , max(20 − (α + β), 0)}. Let’s say at Monday 6pm store-closing,
α = 4 and β = 3. So, you have a choice of order quantities from among the integers in
the range of 0 to (20 - (4 + 3) = 13) (i.e., 14 choices). Previously, in the Markov Reward
Process model, you would have ordered 13 units on Monday store-closing. This means
on Wednesday morning at 6am, a truck would have arrived with 13 units of the tooth-
paste. If you sold say 2 units of the toothpaste on Tuesday, then on Wednesday 8am at
store-opening, you’d have 4 + 3 - 2 + 13 = 18 units of toothpaste on your shelf. If you keep

93
following this policy, you’d typically have almost a full shelf at store-opening each day,
which covers almost a week worth of expected demand for the toothpaste. This means
your risk of going out-of-stock on the toothpaste is extremely low, but you’d be incurring
considerable holding cost (you’d have close to a full shelf of toothpastes sitting around
almost each night). So as a store manager, you’d be thinking - “I can lower my costs by
ordering less than that prescribed by the formula of 20 − (α + β).” But how much less?
If you order too little, you’d start the day with too little inventory and might risk going
out-of-stock. That’s a risk you are highly uncomfortable with since the stockout cost per
unit of missed demand (we called it p) is typically much higher than the holding cost per
unit (we called it h). So you’d rather “err” on the side of having more inventory. But how
much more? We also need to factor in the fact that the 36-hour lead time means a large
order incurs large holding costs two days later. Most importantly, to find this right balance
in terms of a precise mathematical optimization of the Value Function, we’d have to factor
in the uncertainty of demand (based on daily Poisson probabilities) in our calculations.
Now this gives you a flavor of the problem of sequential decisioning (each day you have
to decide how much to order) under sequential uncertainty.
To deal with the “decisioning” aspect, we will introduce the notion of Action to com-
plement the previously introduced notions of State and Reward. In the inventory example,
the order quantity is our Action. After observing the State, we choose from among a set
of Actions (in this case, we choose from within the set {0, 1, . . . , max(C − (α + β), 0)}).
We note that the Action we take upon observing a state affects the next day’s state. This
is because the next day’s On-Order is exactly equal to today’s order quantity (i.e., today’s
action). This in turn might affect our next day’s action since the action (order quantity)
is typically a function of the state (On-Hand and On-Order inventory). Also note that the
Action we take on a given day will influence the Rewards after a couple of days (i.e. after
the order arrives). It may affect our holding cost adversely if we had ordered too much or
it may affect our stockout cost adversely if we had ordered too little and then experienced
high demand.

2.2. The Diﬀiculty of Sequential Decisioning under Uncertainty

This simple inventory example has given us a peek into the world of Markov Decision
Processes, which in general, have two distinct (and inter-dependent) high-level features:

• At each time step t, an Action At is picked (from among a specified choice of actions)
upon observing the State St
• Given an observed State St and a performed Action At , the probabilities of the state
and reward of the next time step (St+1 and Rt+1 ) are in general a function of not just
the state St , but also of the action At .

We are tasked with maximizing the Expected Return from each state (i.e., maximizing
the Value Function). This seems like a pretty hard problem in the general case because
there is a cyclic interplay between:

• actions depending on state, on one hand, and

• next state/reward probabilities depending on action (and state) on the other hand.

There is also the challenge that actions might have delayed consequences on rewards,
and it’s not clear how to disentangle the effects of actions from different time steps on a

94
Figure 2.1.: Markov Decision Process

future reward. So without direct correspondence between actions and rewards, how can
we control the actions so as to maximize expected accumulated rewards? To answer this
question, we will need to set up some notation and theory. Before we formally define the
Markov Decision Process framework and it’s associated (elegant) theory, let us set up a
bit of terminology.
Using the language of AI, we say that at each time step t, the Agent (the algorithm we
design) observes the state St , after which the Agent performs action At , after which the
Environment (upon seeing St and At ) produces a random pair: the next state state St+1
and the next reward Rt+1 , after which the Agent oberves this next state St+1 , and the cycle
repeats (until we reach a terminal state). This cyclic interplay is depicted in Figure 2.1.
Note that time ticks over from t to t + 1 when the environment sees the state St and action
At .
The MDP framework was formalized in a paper by Richard Bellman (Bellman 1957a)
and the MDP theory was developed further in Richard Bellman’s book named Dynamic
Programming (Bellman 1957b) and in Ronald Howard’s book named Dynamic Programming
and Markov Processes (Howard 1960).

2.3. Formal Definition of a Markov Decision Process

Similar to the definitions of Markov Processes and Markov Reward Processes, for ease
of exposition, the definitions and theory of Markov Decision Processes below will be for
discrete-time, for countable state spaces and countable set of pairs of next state and reward
transitions (with the knowledge that the definitions and theory are analogously extensible
to continuous-time and uncountable spaces, which we shall indeed encounter later in this
book).

Definition 2.3.1. A Markov Decision Process comprises of:

• A countable set of states S (known as the State Space), a set T ⊆ S (known as the
set of Terminal States), and a countable set of actions A

• A time-indexed sequence of environment-generated random states St ∈ S for time

95
steps t = 0, 1, 2, . . ., a time-indexed sequence of environment-generated Reward ran-
dom variables Rt ∈ D (a countable subset of R) for time steps t = 1, 2, . . ., and a time-
indexed sequence of agent-controllable actions At ∈ A for time steps t = 0, 1, 2, . . ..
(Sometimes we restrict the set of actions allowable from specific states, in which case,
we abuse the A notation to refer to a function whose domain is N and range is A,
and we say that the set of actions allowable from a state s ∈ N is A(s).)

• Markov Property:

P[(Rt+1 , St+1 )|(St , At , St−1 , At−1 , . . . , S0 , A0 )] = P[(Rt+1 , St+1 )|(St , At )] for allt ≥ 0

• Termination: If an outcome for ST (for some time step T ) is a state in the set T , then
this sequence outcome terminates at time step T .

As in the case of Markov Reward Processes, we denote the set of non-terminal states
S − T as N and refer to any state in N as a non-terminal state. The sequence:

S0 , A 0 , R 1 , S 1 , A 1 , R 1 , S 2 , . . .

terminates at time step T if ST ∈ T (i.e., the final reward is RT and the final action is
AT −1 ).
In the more general case, where states or rewards are uncountable, the same concepts
apply except that the mathematical formalism needs to be more detailed and more careful.
Specifically, we’d end up with integrals instead of summations, and probability density
functions (for continuous probability distributions) instead of probability mass functions
(for discrete probability distributions). For ease of notation and more importantly, for
ease of understanding of the core concepts (without being distracted by heavy mathemat-
ical formalism), we’ve chosen to stay with discrete-time, countable S, countable A and
countable D (by default). However, there will be examples of Markov Decision Processes
in this book involving continuous-time and uncountable S, A and D (please adjust the
definitions and formulas accordingly).
As in the case of Markov Processes and Markov Reward Processes, we shall (by default)
assume Time-Homogeneity for Markov Decision Processes, i.e., P[(Rt+1 , St+1 )|(St , At )] is
independent of t. This means the transition probabilities of a Markov Decision Process can,
in the most general case, be expressed as a state-reward transition probability function:

PR : N × A × D × S → [0, 1]
defined as:

PR (s, a, r, s′ ) = P[(Rt+1 = r, St+1 = s′ )|(St = s, At = a)]

for time steps t = 0, 1, 2, . . . , for all s ∈ N , a ∈ A, r ∈ D, s′ ∈ N such that
XX
PR (s, a, r, s′ ) = 1 for all s ∈ N , a ∈ A
s′ ∈S r∈D

Henceforth, any time we say Markov Decision Process, assume we are refering to a
Discrete-Time, Time-Homogeneous Markov Decision Process with countable spaces and
countable transitions (unless explicitly specified otherwise), which in turn can be charac-
terized by the state-reward transition probability function PR . Given a specification of PR ,
we can construct:

96
• The state transition probability function

P : N × A × S → [0, 1]
defined as:
X
P(s, a, s′ ) = PR (s, a, r, s′ )
r∈D

• The reward transition function:

RT : N × A × S → R
defined as:

RT (s, a, s′ ) = E[Rt+1 |(St+1 = s′ , St = s, At = a)]

X PR (s, a, r, s′ ) X PR (s, a, r, s′ )
= · r = P ·r
P(s, a, s′ ) ′
r∈D PR (s, a, r, s )
r∈D r∈D

The Rewards specification of most Markov Decision Processes we encounter in practice

R:N ×A→R
is defined as:

R(s, a) = E[Rt+1 |(St = s, At = a)]

X XX
= P(s, a, s′ ) · RT (s, a, s′ ) = PR (s, a, r, s′ ) · r
s′ ∈S s′ ∈S r∈D

2.4. Policy
Having understood the dynamics of a Markov Decision Process, we now move on to the
specification of the Agent’s actions as a function of the current state. In the general case,
we assume that the Agent will perform a random action At , according to a probability
distribution that is a function of the current state St . We refer to this function as a Policy.
Formally, a Policy is a function

π : N × A → [0, 1]
defined as:

π(s, a) = P[At = a|St = s] for time steps t = 0, 1, 2, . . . , for all s ∈ N , a ∈ A

97
such that X
π(s, a) = 1 for all s ∈ N
a∈A
Note that the definition above assumes that a Policy is Markovian, i.e., the action prob-
abilities depend only on the current state and not the history. The definition above also
assumes that a Policy is Stationary, i.e., P[At = a|St = s] is invariant in time t. If we do
encounter a situation where the policy would need to depend on the time t, we’ll simply
include t to be part of the state, which would make the Policy stationary (albeit at the cost
of state-space bloat and hence, computational cost).
When we have a policy such that the action probability distribution for each state is
concentrated on a single action, we refer to it as a deterministic policy. Formally, a deter-
ministic policy πD : N → A has the property that for all s ∈ N ,

π(s, πD (s)) = 1 and π(s, a) = 0 for all a ∈ A with a ̸= πD (s)

So we shall denote deterministic policies simply as the function πD . We shall refer to
non-deterministic policies as stochastic policies (the word stochastic reflecting the fact that
the agent will perform a random action according to the probability distribution specified
by π). So when we use the notation π, assume that we are dealing with a stochastic (i.e.,
non-deterministic) policy and when we use the notation πD , assume that we are dealing
with a deterministic policy.
Let’s write some code to get a grip on the concept of Policy—we start with the design
of an abstract class called Policy that represents a general Policy, as we have articulated
above. The only method it contains is an abstract method act that accepts as input a state:
NonTerminal[S] (as seen before in the classes MarkovProcess and MarkovRewardProcess, S is
a generic type to represent a generic state space) and produces as output a Distribution[A]
representing the probability distribution of the random action as a function of the input
non-terminal state. Note that A represents a generic type to represent a generic action
space.
A = TypeVar(’A’)
S = TypeVar(’S’)
class Policy(ABC, Generic[S, A]):
@abstractmethod
def act(self, state: NonTerminal[S]) -> Distribution[A]:
pass

Next, we implement a class for deterministic policies.

@dataclass(frozen=True)
class DeterministicPolicy(Policy[S, A]):
action_for: Callable[[S], A]
def act(self, state: NonTerminal[S]) -> Constant[A]:
return Constant(self.action_for(state.state))

We will often encounter policies that assign equal probabilities to all actions, from each
non-terminal state. We implement this class of policies as follows:
from rl.distribution import Choose
@dataclass(frozen=True)
class UniformPolicy(Policy[S, A]):
valid_actions: Callable[[S], Iterable[A]]
def act(self, state: NonTerminal[S]) -> Choose[A]:
return Choose(self.valid_actions(state.state))

98
The above code is in the file rl/policy.py.
Now let’s write some code to create some concrete policies for an example we are famil-
iar with - the simple inventory example. We first create a concrete class SimpleInventoryDeterministicPolicy
for deterministic inventory replenishment policies that is a derived class of DeterministicPolicy.
Note that the generic state type S is replaced here with the class InventoryState that repre-
sents a state in the inventory example, comprising of the On-Hand and On-Order inven-
tory quantities. Also note that the generic action type A is replaced here with the int type
since in this example, the action is the quantity of inventory to be ordered at store-closing
(which is an integer quantity). Invoking the act method of SimpleInventoryDeterministicPolicy
runs the following deterministic policy:

πD ((α, β)) = max(r − (α + β), 0)

where r is a parameter representing the “reorder point” (meaning, we order only when
the inventory position falls below the “reorder point”), α is the On-Hand Inventory at
store-closing, β is the On-Order Inventory at store-closing, and inventory position is equal
to α + β. In Chapter 1, we set the reorder point to be equal to the store capacity C.

from rl.distribution import Constant

@dataclass(frozen=True)
class InventoryState:
on_hand: int
on_order: int
def inventory_position(self) -> int:
return self.on_hand + self.on_order
class SimpleInventoryDeterministicPolicy(
DeterministicPolicy[InventoryState, int]
):
def __init__(self, reorder_point: int):
self.reorder_point: int = reorder_point
def action_for(s: InventoryState) -> int:
return max(self.reorder_point - s.inventory_position(), 0)
super().__init__(action_for)

We can instantiate a specific deterministic policy with a reorder point of say 8 as:

si_dp = SimpleInventoryDeterministicPolicy(reorder_point=8)

Now let’s write some code to create stochastic policies for the inventory example. We
create a concrete class SimpleInventoryStochasticPolicy that implements the interface of
the abstract class Policy (specifically implements the abstract method act). The code in
act implements a stochastic policy as a SampledDistribution[int] driven by a sampling of
the Poison distribution for the reorder point. Specifically, the reorder point r is treated as
a Poisson random variable with a specified mean (of say λ ∈ R≥0 ). We sample a value of
the reorder point r from this Poisson distribution (with mean λ). Then, we create a sample
order quantity (action) θ ∈ Z≥0 defined as:

θ = max(r − (α + β), 0)

import numpy as np
from rl.distribution import SampledDistribution
class SimpleInventoryStochasticPolicy(Policy[InventoryState, int]):
def __init__(self, reorder_point_poisson_mean: float):

99
self.reorder_point_poisson_mean: float = reorder_point_poisson_mean
def act(self, state: NonTerminal[InventoryState]) -> \
SampledDistribution[int]:
def action_func(state=state) -> int:
reorder_point_sample: int = \
np.random.poisson(self.reorder_point_poisson_mean)
return max(
reorder_point_sample - state.state.inventory_position(),
0
)
return SampledDistribution(action_func)

We can instantiate a specific stochastic policy with a reorder point poisson distribution
mean of say 8.0 as:

si_sp = SimpleInventoryStochasticPolicy(reorder_point_poisson_mean=8.0)

We will revisit the simple inventory example in a bit after we cover the code for Markov
Decision Processes, when we’ll show how to simulate the Markov Decision Process for this
simple inventory example, with the agent running a deterministic policy. But before we
move on to the code design for Markov Decision Processes (to accompany the above im-
plementation of Policies), we need to cover an important insight linking Markov Decision
Processes, Policies and Markov Reward Processes.

2.5. [Markov Decision Process, Policy] := Markov Reward

Process
This section has an important insight - that if we evaluate a Markov Decision Process
(MDP) with a fixed policy π (in general, with a fixed stochastic policy π), we get the
Markov Reward Process (MRP) that is implied by the combination of the MDP and the
policy π. Let’s clarify this with notational precision. But first we need to point out that we
have some notation clashes between MDP and MRP. We used PR to denote the transition
probability function of the MRP as well as to denote the state-reward transition probabil-
ity function of the MDP. We used P to denote the transition probability function of the
Markov Process implicit in the MRP as well as to denote the state transition probability
function of the MDP. We used RT to denote the reward transition function of the MRP
as well as to denote the reward transition function of the MDP. We used R to denote the
reward function of the MRP as well as to denote the reward function of the MDP. We can
resolve these notation clashes by noting the arguments to PR , P, RT and R, but to be extra-
clear, we’ll put a superscript of π to each of the functions PR , P, RT and R of the π-implied
MRP so as to distinguish between these functions for the MDP versus the π-implied MRP.
Let’s say we are given a fixed policy π and an MDP specified by it’s state-reward tran-
sition probability function PR . Then the transition probability function PR π of the MRP

implied by the evaluation of the MDP with the policy π is defined as:
X
PR
π
(s, r, s′ ) = π(s, a) · PR (s, a, r, s′ )
a∈A

Likewise,
X
P π (s, s′ ) = π(s, a) · P(s, a, s′ )
a∈A

100
X
RπT (s, s′ ) = π(s, a) · RT (s, a, s′ )
a∈A
X
Rπ (s) = π(s, a) · R(s, a)
a∈A

So any time we talk about an MDP evaluated with a fixed policy, you should know that
we are effectively talking about the implied MRP. This insight is now going to be key in
the design of our code to represent Markov Decision Processes.
We create an abstract class called MarkovDecisionProcess (code shown below) with two
abstract methods - step and actions. The step method is key: it is meant to specify the
distribution of pairs of next state and reward, given a non-terminal state and action. The
actions method’s interface specifies that it takes as input a state: NonTerminal[S] and
produces as output an Iterable[A] to represent the set of actions allowable for the input
state (since the set of actions can be potentially infinite - in which case we’d have to return
an Iterator[A] - the return type is fairly generic, i.e., Iterable[A]).
The apply_policy method takes as input a policy: Policy[S, A] and returns a MarkovRewardProcess
representing the implied MRP. Let’s understand the code in apply_policy: First, we con-
struct a class RewardProcess that implements the abstract method transition_reward of
MarkovRewardProcess. transition_reward takes as input a state: NonTerminal[S], creates
actions: Distribution[A] by applying the given policy on state, and finally uses the
apply method of Distribution to transform actions: Distribution[A] into a Distribution[Tuple[State[S],
float]] (distribution of (next state, reward) pairs) using the abstract method step.
We also write the simulate_actions method that is analogous to the simulate_reward
method we had written for MarkovRewardProcess for generating a sampling trace. In this
case, each step in the sampling trace involves sampling an action from the given policy and
then sampling the pair of next state and reward, given the state and sampled action. Each
generated TransitionStep object consists of the 4-tuple: (state, action, next state, reward).
Here’s the actual code:

from rl.distribution import Distribution

@dataclass(frozen=True)
class TransitionStep(Generic[S, A]):
state: NonTerminal[S]
action: A
next_state: State[S]
reward: float
class MarkovDecisionProcess(ABC, Generic[S, A]):
@abstractmethod
def actions(self, state: NonTerminal[S]) -> Iterable[A]:
pass
@abstractmethod
def step(
self,
state: NonTerminal[S],
action: A
) -> Distribution[Tuple[State[S], float]]:
pass
def apply_policy(self, policy: Policy[S, A]) -> MarkovRewardProcess[S]:
mdp = self
class RewardProcess(MarkovRewardProcess[S]):
def transition_reward(
self,

101
state: NonTerminal[S]
) -> Distribution[Tuple[State[S], float]]:
actions: Distribution[A] = policy.act(state)
return actions.apply(lambda a: mdp.step(state, a))
return RewardProcess()
def simulate_actions(
self,
start_states: Distribution[NonTerminal[S]],
policy: Policy[S, A]
) -> Iterable[TransitionStep[S, A]]:
state: State[S] = start_states.sample()
while isinstance(state, NonTerminal):
action_distribution = policy.act(state)
action = action_distribution.sample()
next_distribution = self.step(state, action)
next_state, reward = next_distribution.sample()
yield TransitionStep(state, action, next_state, reward)
state = next_state

The above code is in the file rl/markov_decision_process.py.

2.6. Simple Inventory Example with Unlimited Capacity (Infinite

State/Action Space)
Now we come back to our simple inventory example. Unlike previous situations of this ex-
ample, here we assume that there is no space capacity constraint on toothpaste. This means
we have a choice of ordering any (unlimited) non-negative integer quantity of toothpaste
units. Therefore, the action space is infinite. Also, since the order quantity shows up as
On-Order the next day and as delivered inventory the day after the next day, the On-Hand
and On-Order quantities are also unbounded. Hence, the state space is infinite. Due to the
infinite state and action spaces, we won’t be able to take advantage of the so-called “Tab-
ular Dynamic Programming Algorithms” we will cover in Chapter 3 (algorithms that are
meant for finite state and action spaces). There is still significant value in modeling infi-
nite MDPs of this type because we can perform simulations (by sampling from an infinite
space). Simulations are valuable not just to explore various properties and metrics rele-
vant in the real-world problem modeled with an MDP, but simulations also enable us to
design approximate algorithms to calculate Value Functions for given policies as well as
Optimal Value Functions (which is the ultimate purpose of modeling MDPs).
We will cover details on these approximate algorithms later in the book - for now, it’s
important for you to simply get familiar with how to model infinite MDPs of this type. This
infinite-space inventory example serves as a great learning for an introduction to modeling
an infinite (but countable) MDP.
We create a concrete class SimpleInventoryMDPNoCap that implements the abstract class
MarkovDecisionProcess (specifically implements abstract methods step and actions). The
attributes poisson_lambda, holding_cost and stockout_cost have the same semantics as
what we had covered for Markov Reward Processes in Chapter 1 (SimpleInventoryMRP).
The step method takes as input a state: NonTerminal[InventoryState] and an order: int
(representing the MDP action). We sample from the poisson probability distribution of
customer demand (calling it demand_sample: int). Using order: int and demand_sample:
int, we obtain a sample of the pair of next_state: InventoryState and reward: float. This
sample pair is returned as a SampledDistribution object. The above sampling dynamics

102
effectively describe the MDP in terms of this step method. The actions method returns
an Iterator[int], an infinite generator of non-negative integers to represent the fact that
the action space (order quantities) for any state comprise of all non-negative integers.

import itertools
import numpy as np
from rl.distribution import SampledDistribution
@dataclass(frozen=True)
class SimpleInventoryMDPNoCap(MarkovDecisionProcess[InventoryState, int]):
poisson_lambda: float
holding_cost: float
stockout_cost: float
def step(
self,
state: NonTerminal[InventoryState],
order: int
) -> SampledDistribution[Tuple[State[InventoryState], float]]:
def sample_next_state_reward(
state=state,
order=order
) -> Tuple[State[InventoryState], float]:
demand_sample: int = np.random.poisson(self.poisson_lambda)
ip: int = state.state.inventory_position()
next_state: InventoryState = InventoryState(
max(ip - demand_sample, 0),
order
)
reward: float = - self.holding_cost * state.state.on_hand\
- self.stockout_cost * max(demand_sample - ip, 0)
return NonTerminal(next_state), reward
return SampledDistribution(sample_next_state_reward)
def actions(self, state: NonTerminal[InventoryState]) -> Iterator[int]:
return itertools.count(start=0, step=1)

We leave it to you as an exercise to run various simulations of the MRP implied by the de-
terministic and stochastic policy instances we had created earlier (the above code is in the
file rl/chapter3/simple_inventory_mdp_nocap.py). See the method fraction_of_days_oos
in this file as an example of a simulation to calculate the percentage of days when we’d be
unable to satisfy some customer demand for toothpaste due to too little inventory at store-
opening (naturally, the higher the re-order point in the policy, the lesser the percentage of
days when we’d be Out-of-Stock). This kind of simulation exercise helps build intuition
on the tradeoffs we have to make between having too little inventory versus having too
much inventory (holding costs versus stockout costs) - essentially leading to our ultimate
goal of determining the Optimal Policy (more on this later).

2.7. Finite Markov Decision Processes

Certain calculations for Markov Decision Processes can be performed easily if:

• The state space is finite (S = {s1 , s2 , . . . , sn }),

• The action space A(s) is finite for each s ∈ N ,
• The set of unique pairs of next state and reward transitions from each pair of current
non-terminal state and action is finite.

103
If we satisfy the above three characteristics, we refer to the Markov Decision Process as
a Finite Markov Decision Process. Let us write some code for a Finite Markov Decision
Process. We create a concrete class FiniteMarkovDecisionProcess that implements the in-
terface of the abstract class MarkovDecisionProcess (specifically implements the abstract
methods step and the actions). Our first task is to think about the data structure required
to specify an instance of FiniteMarkovDecisionProcess (i.e., the data structure we’d pass
to the __init__ method of FiniteMarkovDecisionProcess). Analogous to how we curried
PR for a Markov Reward Process as N → (S × D → [0, 1]) (where S = {s1 , s2 , . . . , sn } and
N has m ≤ n states), here we curry PR for the MDP as:

N → (A → (S × D → [0, 1]))

Since S is finite, A is finite, and the set of next state and reward transitions for each pair
of current state and action is also finite, we can represent PR as a data structure of type
StateActionMapping[S, A] as shown below:

StateReward = FiniteDistribution[Tuple[State[S], float]]

ActionMapping = Mapping[A, StateReward[S]]
StateActionMapping = Mapping[NonTerminal[S], ActionMapping[A, S]]

The constructor (init method) of FiniteMarkovDecisionProcess takes as input mapping

which is essentially of the same structure as StateActionMapping[S, A], except that the
Mapping is specified in terms of S rather than NonTerminal[S] or State[S] so as to make it
easy for a user to specify a FiniteMarkovDecisionProcess without the overhead of wrap-
ping S in NonTerminal[S] or Terminal[S]. But this means __init__ need to do the wrapping
to construct the attribute self.mapping: StateActionMapping[S, A]. This represents the
complete structure of the Finite MDP - it maps each non-terminal state to an action map,
and it maps each action in each action map to a finite probability distribution of pairs of
next state and reward (essentially the structure of the PR function). Along with the at-
tribute self.mapping, we also have an attribute non_terminal_states: Sequence[NonTerminal[S]]
that is an ordered sequence of non-terminal states. Now let’s consider the implemen-
tation of the abstract method step of MarkovDecisionProcess. It takes as input a state:
NonTerminal[S] and an action: A. self.mapping[state][action] gives us an object of type
FiniteDistribution[Tuple[State[S], float]] which represents a finite probability distri-
bution of pairs of next state and reward, which is exactly what we want to return. This sat-
isfies the responsibility of FiniteMarkovDecisionProcess in terms of implementing the ab-
stract method step of the abstract class MarkovDecisionProcess. The other abstract method
to implement is the actions method which produces an Iterable on the allowed actions
A(s) for a given s ∈ N by invoking self.mapping[state].keys(). The __repr__ method
shown below is quite straightforward.
from rl.distribution import FiniteDistribution, SampledDistribution
class FiniteMarkovDecisionProcess(MarkovDecisionProcess[S, A]):
mapping: StateActionMapping[S, A]
non_terminal_states: Sequence[NonTerminal[S]]
def __init__(
self,
mapping: Mapping[S, Mapping[A, FiniteDistribution[Tuple[S, float]]]]
):
non_terminals: Set[S] = set(mapping.keys())
self.mapping = {NonTerminal(s): {a: Categorical(
{(NonTerminal(s1) if s1 in non_terminals else Terminal(s1), r): p
for (s1, r), p in v}

104
) for a, v in d.items()} for s, d in mapping.items()}
self.non_terminal_states = list(self.mapping.keys())
def __repr__(self) -> str:
display = ””
for s, d in self.mapping.items():
display += f”From State {s.state}:\n”
for a, d1 in d.items():
display += f” With Action {a}:\n”
for (s1, r), p in d1:
opt = ”Terminal ” if isinstance(s1, Terminal) else ””
display += f” To [{opt}State {s1.state} and ”\
+ f”Reward {r:.3f}] with Probability {p:.3f}\n”
return display
def step(self, state: NonTerminal[S], action: A) -> StateReward[S]:
action_map: ActionMapping[A, S] = self.mapping[state]
return action_map[action]
def actions(self, state: NonTerminal[S]) -> Iterable[A]:
return self.mapping[state].keys()

Now that we’ve implemented a finite MDP, let’s implement a finite policy that maps each
non-terminal state to a probability distribution over a finite set of actions. So we create a
concrete class @datasclass FinitePolicy that implements the interface of the abstract class
Policy (specifically implements the abstract method act). An instance of FinitePolicy is
specified with the attribute self.policy_map: Mapping[S, FiniteDistribution[A]]] since
this type captures the structure of the π : N × A → [0, 1] function in the curried form

N → (A → [0, 1])
for the case of finite S and finite A. The act method is straightforward. We also implement
a __repr__ method for pretty-printing of self.policy_map.

@dataclass(frozen=True)
class FinitePolicy(Policy[S, A]):
policy_map: Mapping[S, FiniteDistribution[A]]
def __repr__(self) -> str:
display = ””
for s, d in self.policy_map.items():
display += f”For State {s}:\n”
for a, p in d:
display += f” Do Action {a} with Probability {p:.3f}\n”
return display
def act(self, state: NonTerminal[S]) -> FiniteDistribution[A]:
return self.policy_map[state.state]

Let’s also implement a finite deterministic policy as a derived class of FinitePolicy.

class FiniteDeterministicPolicy(FinitePolicy[S, A]):

action_for: Mapping[S, A]
def __init__(self, action_for: Mapping[S, A]):
self.action_for = action_for
super().__init__(policy_map={s: Constant(a) for s, a in
self.action_for.items()})
def __repr__(self) -> str:
display = ””
for s, a in self.action_for.items():
display += f”For State {s}: Do Action {a}\n”
return display

105
Armed with a FinitePolicy class, we can now write a method apply_finite_policy in
FiniteMarkovDecisionProcess that takes as input a policy: FinitePolicy[S, A] and re-
turns a FiniteMarkovRewardProcess[S] by processing the finite structures of both of the
MDP and the Policy, and producing a finite structure of the implied MRP.

from collections import defaultdict

from rl.distribution import FiniteDistribution, Categorical
def apply_finite_policy(self, policy: FinitePolicy[S, A])\
-> FiniteMarkovRewardProcess[S]:
transition_mapping: Dict[S, FiniteDistribution[Tuple[S, float]]] = {}
for state in self.mapping:
action_map: ActionMapping[A, S] = self.mapping[state]
outcomes: DefaultDict[Tuple[S, float], float]\
= defaultdict(float)
actions = policy.act(state)
for action, p_action in actions:
for (s1, r), p in action_map[action]:
outcomes[(s1.state, r)] += p_action * p
transition_mapping[state.state] = Categorical(outcomes)
return FiniteMarkovRewardProcess(transition_mapping)

The code for FiniteMarkovRewardProcess is in rl/markov_decision_process.py and the

code for FinitePolicy and FiniteDeterministicPolicy is in rl/policy.py.

2.8. Simple Inventory Example as a Finite Markov Decision

Process
Now we’d like to model the simple inventory example as a Finite Markov Decision Pro-
cess so we can take advantage of the algorithms specifically for Finite Markov Decision
Processes. To enable finite states and finite actions, we now re-introduce the constraint of
space capacity C and apply the restriction that the order quantity (action) cannot exceed
C − (α + β) where α is the On-Hand component of the State and β is the On-Order compo-
nent of the State. Thus, the action space for any given state (α, β) ∈ S is finite. Next, note
that this ordering policy ensures that in steady-state, the sum of On-Hand and On-Order
will not exceed the capacity C. So we constrain the set of states to be the steady-state set
of finite states
S = {(α, β)|α ∈ Z≥0 , β ∈ Z≥0 , 0 ≤ α + β ≤ C}
Although the set of states is finite, there are an infinite number of pairs of next state and
reward outcomes possible from any given pair of current state and action. This is because
there are an infinite set of possibilities of customer demand on any given day (resulting in
infinite possibilities of stockout cost, i.e., negative reward, on any day). To qualify as a Fi-
nite Markov Decision Process, we need to model in a manner such that we have a finite set
of pairs of next state and reward outcomes from any given pair of current state and action.
So what we do is that instead of considering (St+1 , Rt+1 ) as the pair of next state and re-
ward, we model the pair of next state and reward to instead be (St+1 , E[Rt+1 |(St , St+1 , At )])
(we know PR due to the Poisson probabilities of customer demand, so we can actually cal-
culate this conditional expectation of reward). So given a state s and action a, the pairs of
next state and reward would be: (s′ , RT (s, a, s′ )) for all the s′ we transition to from (s, a).
Since the set of possible next states s′ are finite, these newly-modeled rewards associated
with the transitions (RT (s, a, s′ )) are also finite and hence, the set of pairs of next state

106
and reward from any pair of current state and action are also finite. Note that this cre-
ative alteration of the reward definition is purely to reduce this Markov Decision Process
into a Finite Markov Decision Process. Let’s now work out the calculation of the reward
transition function RT .
When the next state’s (St+1 ) On-Hand is greater than zero, it means all of the day’s
demand was satisfied with inventory that was available at store-opening (= α + β), and
hence, each of these next states St+1 correspond to no stockout cost and only an overnight
holding cost of hα. Therefore, for all α, β (with 0 ≤ α + β ≤ C) and for all order quantity
(action) θ (with 0 ≤ θ ≤ C − (α + β)):

RT ((α, β), θ, (α + β − i, θ)) = −hα for 0 ≤ i ≤ α + β − 1

When next state’s (St+1 ) On-Hand is equal to zero, there are two possibilities:

This calculation is shown below:

∞
X
RT ((α, β), θ, (0, θ)) = −hα − p( f (j) · (j − (α + β)))
j=α+β+1

= −hα − p(λ(1 − F (α + β − 1)) − (α + β)(1 − F (α + β)))

So now we have a specification of RT , but when it comes to our coding interface, we are
expected to specify PR as that is the interface through which we create a FiniteMarkovDecisionProcess.
Fear not - a specification of PR is easy once we have a specification of RT . We simply
create 5-tuples (s, a, r, s′ , p) for all s ∈ N , s′ ∈ S, a ∈ A such that r = RT (s, a, s′ ) and
p = P(s, a, s′ ) (we know P along with RT ), and the set of all these 5-tuples (for all
s ∈ N , s′ ∈ S, a ∈ A) constitute the specification of PR , i.e., PR (s, a, r, s′ ) = p. This
turns our reward-definition-altered mathematical model of a Finite Markov Decision Pro-
cess into a programming model of the FiniteMarkovDecisionProcess class. This reward-
definition-altered model enables us to gain from the fact that we can leverage the algo-
rithms we’ll be writing for Finite Markov Decision Processes (specifically, the classical
Dynamic Programming algorithms - covered in Chapter 3). The downside of this reward-
definition-altered model is that it prevents us from generating sampling traces of the spe-
cific rewards encountered when transitioning from one state to another (because we no
longer capture the probabilities of individual reward outcomes). Note that we can indeed
perform simulations, but each transition step in the sampling trace will only show us the
“mean reward” (specifically, the expected reward conditioned on current state, action and
next state).
In fact, most Markov Processes you’d encounter in practice can be modeled as a combina-
tion of RT and P, and you’d simply follow the above RT to PR representation transforma-
tion drill to present this information in the form of PR to instantiate a FiniteMarkovDecisionProcess.
We designed the interface to accept PR as input since that is the most general interface for
specifying Markov Decision Processes.

107
So now let’s write some code for the simple inventory example as a Finite Markov De-
cision Process as described above. All we have to do is to create a derived class inherited
from FiniteMarkovDecisionProcess and write a method to construct the mapping (i.e., PR )
that the __init__ constuctor of FiniteMarkovRewardProcess requires as input. Note that
the generic state type S is replaced here with the @dataclass InventoryState to represent
the inventory state, comprising of the On-Hand and On-Order inventory quantities, and
the generic action type A is replaced here with int to represent the order quantity.

from scipy.stats import poisson

from rl.distribution import Categorical
InvOrderMapping = Mapping[
InventoryState,
Mapping[int, Categorical[Tuple[InventoryState, float]]]
]
class SimpleInventoryMDPCap(FiniteMarkovDecisionProcess[InventoryState, int]):
def __init__(
self,
capacity: int,
poisson_lambda: float,
holding_cost: float,
stockout_cost: float
):
self.capacity: int = capacity
self.poisson_lambda: float = poisson_lambda
self.holding_cost: float = holding_cost
self.stockout_cost: float = stockout_cost
self.poisson_distr = poisson(poisson_lambda)
super().__init__(self.get_action_transition_reward_map())
def get_action_transition_reward_map(self) -> InvOrderMapping:
d: Dict[InventoryState, Dict[int, Categorical[Tuple[InventoryState,
float]]]] = {}
for alpha in range(self.capacity + 1):
for beta in range(self.capacity + 1 - alpha):
state: InventoryState = InventoryState(alpha, beta)
ip: int = state.inventory_position()
base_reward: float = - self.holding_cost * alpha
d1: Dict[int, Categorical[Tuple[InventoryState, float]]] = {}
for order in range(self.capacity - ip + 1):
sr_probs_dict: Dict[Tuple[InventoryState, float], float] =\
{(InventoryState(ip - i, order), base_reward):
self.poisson_distr.pmf(i) for i in range(ip)}
probability: float = 1 - self.poisson_distr.cdf(ip - 1)
reward: float = base_reward - self.stockout_cost *\
(probability * (self.poisson_lambda - ip) +
ip * self.poisson_distr.pmf(ip))
sr_probs_dict[(InventoryState(0, order), reward)] = \
probability
d1[order] = Categorical(sr_probs_dict)
d[state] = d1
return d

Now let’s test this out with some example inputs (as shown below). We construct an
instance of the SimpleInventoryMDPCap class with these inputs (named si_mdp below), then
construct an instance of the FinitePolicy[InventoryState, int] class (a deterministic pol-
icy, named fdp below), and combine them to produce the implied MRP (an instance of the
FiniteMarkovRewardProcess[InventoryState] class).

108
user_capacity = 2
user_poisson_lambda = 1.0
user_holding_cost = 1.0
user_stockout_cost = 10.0
si_mdp: FiniteMarkovDecisionProcess[InventoryState, int] =\
SimpleInventoryMDPCap(
capacity=user_capacity,
poisson_lambda=user_poisson_lambda,
holding_cost=user_holding_cost,
stockout_cost=user_stockout_cost
)
fdp: FiniteDeterministicPolicy[InventoryState, int] = \
FiniteDeterministicPolicy(
{InventoryState(alpha, beta): user_capacity - (alpha + beta)
for alpha in range(user_capacity + 1)
for beta in range(user_capacity + 1 - alpha)}
)
implied_mrp: FiniteMarkovRewardProcess[InventoryState] =\
si_mdp.apply_finite_policy(fdp)

The above code is in the file rl/chapter3/simple_inventory_mdp_cap.py. We encourage

you to play with the inputs in __main__, produce the resultant implied MRP, and explore
it’s characteristics (such as it’s Reward Function and it’s Value Function).

2.9. MDP Value Function for a Fixed Policy

Now we are ready to talk about the Value Function for an MDP evaluated with a fixed
policy π (also known as the MDP Prediction problem). The term Prediction refers to the fact
that this problem is about forecasting the expected future return when the agent follows
a specific policy. Just like in the case of MRP, we define the Return Gt at time step t for an
MDP as:
X∞
Gt = γ i−t−1 · Ri = Rt+1 + γ · Rt+2 + γ 2 · Rt+3 + . . .
i=t+1

where γ ∈ [0, 1] is a specified discount factor.

We use the above definition of Return even for a terminating sequence (say terminating
at t = T , i.e., ST ∈ T ), by treating Ri = 0 for all i > T .
The Value Function for an MDP evaluated with a fixed policy π

Vπ :N →R

is defined as:

V π (s) = Eπ,PR [Gt |St = s] for all s ∈ N , for all t = 0, 1, 2, . . .

For the rest of the book, we assume that whenever we are talking about a Value Function,
the discount factor γ is appropriate to ensure that the Expected Return from each state is
finite - in particular, γ < 1 for continuing (non-terminating) MDPs where the Return could
otherwise diverge.
We expand V π (s) = Eπ,PR [Gt |St = s] as follows:

109
Eπ,PR [Rt+1 |St = s] + γ · Eπ,PR [Rt+2 |St = s] + γ 2 · Eπ,PR [Rt+3 |St = s] + . . .
X X X X
= π(s, a) · R(s, a) + γ · π(s, a) P(s, a, s′ ) π(s′ , a′ ) · R(s′ , a′ )
a∈A a∈A s′ ∈N a′ ∈A
X X X X X
+ γ2 · π(s, a) P(s, a′ , s′ ) π(s′ , a′ ) P(s′ , a′′ , s′′ ) π(s′′ , a′′ ) · R(s′′ , a′′ )
a∈A s′ ∈N a′ ∈A s′′ ∈N a′′ ∈A
+ ...
X X X
= Rπ (s) + γ · P π (s, s′ ) · Rπ (s′ ) + γ 2 · P π (s, s′ ) P π (s′ , s′′ ) · Rπ (s′′ ) + . . .
s′ ∈N s′ ∈N s′′ ∈N

But from Equation (1.1) in Chapter 1, we know that the last expression above is equal
to the π-implied MRP’s Value Function for state s. So, the Value Function V π of an MDP
evaluated with a fixed policy π is exactly the same function as the Value Function of the
π-implied MRP. So we can apply the MRP Bellman Equation on V π , i.e.,

X
V π (s) = Rπ (s) + γ · P π (s, s′ ) · V π (s′ )
s′ ∈N
X X X
= π(s, a) · R(s, a) + γ · π(s, a) P(s, a, s′ ) · V π (s′ ) (2.1)
a∈A a∈A s′ ∈N
X X
= π(s, a) · (R(s, a) + γ · P(s, a, s′ ) · V π (s′ )) for all s ∈ N
a∈A s′ ∈N

As we saw in Chapter 1, for finite state spaces that are not too large, Equation (2.1) can be
solved for V π (i.e. solution to the MDP Prediction problem) with a linear algebra solution
(Equation (1.2) from Chapter 1). More generally, Equation (2.1) will be a key equation
for the rest of the book in developing various Dynamic Programming and Reinforcement
Algorithms for the MDP Prediction problem. However, there is another Value Function
that’s also going to be crucial in developing MDP algorithms - one which maps a (state,
action) pair to the expected return originating from the (state, action) pair when evaluated
with a fixed policy. This is known as the Action-Value Function of an MDP evaluated with
a fixed policy π:
Qπ : N × A → R
defined as:

Qπ (s, a) = Eπ,PR [Gt |(St = s, At = a)] for all s ∈ N , a ∈ A, for all t = 0, 1, 2, . . .

To avoid terminology confusion, we refer to V π as the State-Value Function (albeit often

simply abbreviated to Value Function) for policy π, to distinguish from the Action-Value
Function Qπ . The way to interpret Qπ (s, a) is that it’s the Expected Return from a given
non-terminal state s by first taking the action a and subsequently following policy π. With
this interpretation of Qπ (s, a), we can perceive V π (s) as the “weighted average” of Qπ (s, a)
(over all possible actions a from a non-terminal state s) with the weights equal to proba-
bilities of action a, given state s (i.e., π(s, a)). Precisely,
X
V π (s) = π(s, a) · Qπ (s, a) for all s ∈ N (2.2)
a∈A

110
Figure 2.2.: Visualization of MDP State-Value Function Bellman Policy Equation

Combining Equation (2.1) and Equation (2.2) yields:

X
Qπ (s, a) = R(s, a) + γ · P(s, a, s′ ) · V π (s′ ) for all s ∈ N , a ∈ A (2.3)
s′ ∈N

Combining Equation (2.3) and Equation (2.2) yields:

X X
Qπ (s, a) = R(s, a) + γ · P(s, a, s′ ) π(s′ , a′ ) · Qπ (s′ , a′ ) for all s ∈ N , a ∈ A (2.4)
s′ ∈N a′ ∈A

Equation (2.1) is known as the MDP State-Value Function Bellman Policy Equation (Fig-
ure 2.2 serves as a visualization aid for this Equation). Equation (2.4) is known as the MDP
Action-Value Function Bellman Policy Equation (Figure 2.3 serves as a visualization aid
for this Equation). Note that Equation (2.2) and Equation (2.3) are embedded in Figure
2.2 as well as in Figure 2.3. Equations (2.1), (2.2), (2.3) and (2.4) are collectively known
as the MDP Bellman Policy Equations.
For the rest of the book, in these MDP transition figures, we shall always depict states
as elliptical-shaped nodes and actions as rectangular-shaped nodes. Notice that transition
from a state node to an action node is associated with a probability represented by π and
transition from an action node to a state node is associated with a probability represented
by P.
Note that for finite MDPs of state space not too large, we can solve the MDP Predic-
tion problem (solving for V π and equivalently, Qπ ) in a straightforward manner: Given
a policy π, we can create the finite MRP implied by π, using the method apply_policy
in FiniteMarkovDecisionProcess, then use the direct linear-algebraic solution that we cov-
ered in Chapter 1 to calculate the Value Function of the π-implied MRP. We know that the
π-implied MRP’s Value Function is the same as the State-Value Function V π of the MDP
which can then be used to arrive at the Action-Value Function Qπ of the MDP (using Equa-
tion (2.3)). For large state spaces, we need to use iterative/numerical methods (Dynamic
Programming and Reinforcement Learning algorithms) to solve this Prediction problem
(covered later in this book).

111
Figure 2.3.: Visualization of MDP Action-Value Function Bellman Policy Equation

2.10. Optimal Value Function and Optimal Policies

Finally, we arrive at the main purpose of a Markov Decision Process - to identify a policy
(or policies) that would yield the Optimal Value Function (i.e., the best possible Expected
Return from each of the non-terminal states). We say that a Markov Decision Process is
“solved” when we identify its Optimal Value Function (together with its associated Opti-
mal Policy, i.e., a Policy that yields the Optimal Value Function). The problem of identi-
fying the Optimal Value Function and its associated Optimal Policy/Policies is known as
the MDP Control problem. The term Control refers to the fact that this problem involves
steering the actions (by iterative modifications of the policy) to drive the Value Function
towards Optimality. Formally, the Optimal Value Function

V∗ :N →R
is defined as:

V ∗ (s) = max V π (s) for all s ∈ N

π∈Π

where Π is the set of stationary (stochastic) policies over the spaces of N and A.
The way to read the above definition is that for each non-terminal state s, we consider
all possible stochastic stationary policies π, and maximize V π (s) across all these choices
of π. Note that the maximization over choices of π is done separately for each s, so it’s
conceivable that different choices of π might maximize V π (s) for different s ∈ N . Thus,
from the above definition of V ∗ , we can’t yet talk about the notion of “An Optimal Policy.”
So, for now, let’s just focus on the notion of Optimal Value Function, as defined above.
Note also that we haven’t yet talked about how to achieve the above-defined maximization
through an algorithm - we have simply defined the Optimal Value Function.
Likewise, the Optimal Action-Value Function

Q∗ : N × A → R

112
is defined as:

Q∗ (s, a) = max Qπ (s, a) for all s ∈ N , a ∈ A

π∈Π
∗
V is often refered to as the Optimal State-Value Function to distinguish it from the
Optimal Action-Value Function Q∗ (although, for succinctness, V ∗ is often also refered
to as simply the Optimal Value Function). To be clear, if someone says, Optimal Value
Function, by default, they’d be refering to the Optimal State-Value Function V ∗ (not Q∗ ).
Much like how the Value Function(s) for a fixed policy have a recursive formulation,
Bellman noted (Bellman 1957b) that we can create a recursive formulation for the Optimal
Value Function(s). Let us start by unraveling the Optimal State-Value Function V ∗ (s) for
a given non-terminal state s - we consider all possible actions a ∈ A we can take from state
s, and pick the action a that yields the best Action-Value from thereon, i.e., the action a
that yields the best Q∗ (s, a). Formally, this gives us the following equation:

V ∗ (s) = max Q∗ (s, a) for all s ∈ N (2.5)

a∈A

Likewise, let’s think about what it means to be optimal from a given non-terminal-state
and action pair (s, a), i.e, let’s unravel Q∗ (s, a). First, we get the immediate expected re-
ward R(s, a). Next, we consider all possible random states s′ ∈ S we can transition to,
and from each of those states which are non-terminal states, we recursively act optimally.
Formally, this gives us the following equation:
X
Q∗ (s, a) = R(s, a) + γ · P(s, a, s′ ) · V ∗ (s′ ) for all s ∈ N , a ∈ A (2.6)
s′ ∈N

Substituting for Q∗ (s, a) from Equation (2.6) in Equation (2.5) gives:

X
V ∗ (s) = max{R(s, a) + γ · P(s, a, s′ ) · V ∗ (s′ )} for all s ∈ N (2.7)
a∈A
s′ ∈N

Equation (2.7) is known as the MDP State-Value Function Bellman Optimality Equation
and is depicted in Figure 2.4 as a visualization aid.
Substituting for V ∗ (s) from Equation (2.5) in Equation (2.6) gives:

X
Q∗ (s, a) = R(s, a) + γ · P(s, a, s′ ) · max
′
Q∗ (s′ , a′ ) for all s ∈ N , a ∈ A (2.8)
a ∈A
s′ ∈N

Equation (2.8) is known as the MDP Action-Value Function Bellman Optimality Equa-
tion and is depicted in Figure 2.5 as a visualization aid.
Note that Equation (2.5) and Equation (2.6) are embedded in Figure 2.4 as well as in Fig-
ure 2.5. Equations (2.7), (2.5), (2.6) and (2.8) are collectively known as the MDP Bellman
Optimality Equations. We should highlight that when someone says MDP Bellman Equa-
tion or simply Bellman Equation, unless they explicit state otherwise, they’d be refering to
the MDP Bellman Optimality Equations (and typically specifically the MDP State-Value
Function Bellman Optimality Equation). This is because the MDP Bellman Optimality
Equations address the ultimate purpose of Markov Decision Processes - to identify the
Optimal Value Function and the associated policy/policies that achieve the Optimal Value
Function (i.e., enabling us to solve the MDP Control problem).
Again, it pays to emphasize that the Bellman Optimality Equations don’t directly give
us a recipe to calculate the Optimal Value Function or the policy/policies that achieve

113
Figure 2.4.: Visualization of MDP State-Value Function Bellman Optimality Equation

Figure 2.5.: Visualization of MDP Action-Value Function Bellman Optimality Equation

114
the Optimal Value Function - they simply state a powerful mathematical property of the
Optimal Value Function that (as we shall see later in this book) helps us come up with al-
gorithms (Dynamic Programming and Reinforcement Learning) to calculate the Optimal
Value Function and the associated policy/policies that achieve the Optimal Value Func-
tion.
We have been using the phrase “policy/policies that achieve the Optimal Value Func-
tion,” but we haven’t yet provided a clear definition of such a policy (or policies). In fact,
as mentioned earlier, it’s not clear from the definition of V ∗ if such a policy (one that
would achieve V ∗ ) exists (because it’s conceivable that different policies π achieve the
maximization of V π (s) for different states s ∈ N ). So instead, we define an Optimal Policy
π ∗ : N × A → [0, 1] as one that “dominates” all other policies with respect to the Value
Functions for the policies. Formally,

∗
π ∗ ∈ Π is an Optimal Policy if V π (s) ≥ V π (s) for all π ∈ Π and for all states s ∈ N

The definition of an Optimal Policy π ∗ says that it is a policy that is “better than or
equal to” (on the V π metric) all other stationary policies for all non-terminal states (note
that there could be multiple Optimal Policies). Putting this definition together with the
definition of the Optimal Value Function V ∗ , the natural question to then ask is whether
there exists an Optimal Policy π ∗ that maximizes V π (s) for all s ∈ N , i.e., whether there
∗
exists a π ∗ such that V ∗ (s) = V π (s) for all s ∈ N . On the face of it, this seems like a
strong statement. However, this answers in the aﬀirmative in most MDP settings of in-
terest. The following theorem and proof is for our default setting of MDP (discrete-time,
countable-spaces, time-homogeneous), but the statements and argument themes below
apply to various other MDP settings as well. The MDP book by Martin Puterman (Puter-
man 2014) provides rigorous proofs for a variety of settings.

Theorem 2.10.1. For any (discrete-time, countable-spaces, time-homogeneous) MDP:

∗
• There exists an Optimal Policy π ∗ ∈ Π, i.e., there exists a Policy π ∗ ∈ Π such that V π (s) ≥
V π (s) for all policies π ∈ Π and for all states s ∈ N
∗
• All Optimal Policies achieve the Optimal Value Function, i.e. V π (s) = V ∗ (s) for all s ∈ N ,
for all Optimal Policies π ∗
∗
• All Optimal Policies achieve the Optimal Action-Value Function, i.e. Qπ (s, a) = Q∗ (s, a)
for all s ∈ N , for all a ∈ A, for all Optimal Policies π ∗

Before proceeding with the proof of Theorem (2.10.1), we establish a simple Lemma.
∗ ∗
Lemma 2.10.2. For any two Optimal Policies π1∗ and π2∗ , V π1 (s) = V π2 (s) for all s ∈ N
∗
Proof. Since π1∗ is an Optimal Policy, from the Optimal Policy definition, we have: V π1 (s) ≥
∗
V π2 (s) for all s ∈ N . Likewise, since π2∗ is an Optimal Policy, from the Optimal Policy
∗ ∗ ∗ ∗
definition, we have: V π2 (s) ≥ V π1 (s) for all s ∈ N . This implies: V π1 (s) = V π2 (s) for all
s ∈ N.

Now we are ready to prove Theorem (2.10.1)

115
Proof. As a consequence of the above Lemma, all we need to do to prove Theorem (2.10.1)
is to establish an Optimal Policy that achieves the Optimal Value Function and the Opti-
mal Action-Value Function. We construct a Deterministic Policy (as a candidate Optimal
Policy) πD∗ : N → A as follows:

∗
πD (s) = arg max Q∗ (s, a) for all s ∈ N (2.9)
a∈A

Note that for any specific s, if two or more actions a achieve the maximization of Q∗ (s, a),
then we use an arbitrary rule in breaking ties and assigning a single action a as the output
of the above arg max operation.
First we show that πD∗ achieves the Optimal Value Functions V ∗ and Q∗ . Since π ∗ (s) =
D
arg maxa∈A Q (s, a) and V ∗ (s) = maxa∈A Q∗ (s, a) for all s ∈ N , we can infer for all s ∈ N
∗

that:
V ∗ (s) = Q∗ (s, πD
∗
(s))
This says that we achieve the Optimal Value Function from a given non-terminal state s
if we first take the action prescribed by the policy πD ∗ (i.e., the action π ∗ (s)), followed by
D
achieving the Optimal Value Function from each of the next time step’s states. But note
that each of the next time step’s states can achieve the Optimal Value Function by doing
the same thing described above (”first take action prescribed by πD ∗ , followed by ...”), and

so on and so forth for further time step’s states. Thus, the Optimal Value Function V ∗ is
achieved if from each non-terminal state, we take the action prescribed by πD ∗ . Likewise,
∗
the Optimal Action-Value Function Q is achieved if from each non-terminal state, we take
the action a (argument to Q∗ ) followed by future actions prescribed by πD ∗ . Formally, this

says:
∗
V πD (s) = V ∗ (s) for all s ∈ N
∗
QπD (s, a) = Q∗ (s, a) for all s ∈ N , for all a ∈ A
Finally, we argue that πD∗ is an Optimal Policy. Assume the contradiction (that π ∗ is not
D
an Optimal Policy). Then there exists a policy π ∈ Π and a state s ∈ N such that V π (s) >
∗ ∗
V πD (s). Since V πD (s) = V ∗ (s), we have: V π (s) > V ∗ (s) which contradicts the Optimal
Value Function Definition: V ∗ (s) = maxπ∈Π V π (s) for all s ∈ N . Hence, πD ∗ must be an

Optimal Policy.
Equation (2.9) is a key construction that goes hand-in-hand with the Bellman Optimality
Equations in designing the various Dynamic Programming and Reinforcement Learning
algorithms to solve the MDP Control problem (i.e., to solve for V ∗ , Q∗ and π ∗ ). Lastly, it’s
important to note that unlike the Prediction problem which has a straightforward linear-
algebra-solver for small state spaces, the Control problem is non-linear and so, doesn’t
have an analogous straightforward linear-algebra-solver. The simplest solutions for the
Control problem (even for small state spaces) are the Dynamic Programming algorithms
we will cover in Chapter 3.

2.11. Variants and Extensions of MDPs

2.11.1. Size of Spaces and Discrete versus Continuous
Variants of MDPs can be organized by variations in the size and type of:
• State Space
• Action Space
• Time Steps

116
State Space:

The definitions we’ve provided for MRPs and MDPs were for countable (discrete) state
spaces. As a special case, we considered finite state spaces since we have pretty straight-
forward algorithms for exact solution of Prediction and Control problems for finite MDPs
(which we shall learn about in Chapter 3). We emphasize finite MDPs because they help
you develop a sound understanding of the core concepts and make it easy to program
the algorithms (known as “tabular” algorithms since we can represent the MDP in a “ta-
ble,” more specifically a Python data structure like dict or numpy array). However, these
algorithms are practical only if the finite state space is not too large. Unfortunately, in
many real-world problems, state spaces are either very large-finite or infinite (sometimes
continuous-valued spaces). Large state spaces are unavoidable because phenomena in na-
ture and metrics in business evolve in time due to a complex set of factors and often depend
on history. To capture all these factors and to enable the Markov Property, we invariably
end up with having to model large state spaces which suffer from two “curses”:

• Curse of Dimensionality (size of state space S)

• Curse of Modeling (size/complexity of state-reward transition probabilites PR )

Curse of Dimensionality is a term coined by Richard Bellman in the context of Dynamic

Programming. It refers to the fact that when the number of dimensions in the state space
grows, there is an exponential increase in the number of samples required to attain an ade-
quate level of accuracy in algorithms. Consider this simple example (adaptation of an ex-
ample by Bellman himself) - In a single dimension of space from 0 to 1, 100 evenly spaced
sample points suffice to sample the space within a threshold distance of 0.01 between
points. An equivalent sampling in 10 dimensions ([0, 1]10 ) within a threshold distance of
0.01 between points will require 1020 points. So the 10-dimensional space requires points
that are greater by a factor of 1018 relative to the points required in single dimension. This
explosion in requisite points in the state space is known as the Curse of Dimensionality.
Curse of Modeling refers to the fact that when state spaces are large or when the struc-
ture of state-reward transition probabilities is complex, explicit modeling of these transi-
tion probabilities is very hard and often impossible (the set of probabilities can go beyond
memory or even disk storage space). Even if it’s possible to fit the probabilities in avail-
able storage space, estimating the actual probability values can be very difficult in complex
real-world situations.
To overcome these two curses, we can attempt to contain the state space size with some
dimension reduction techniques, i.e., including only the most relevant factors in the state
representation. Secondly, if future outcomes depend on history, we can include just the
past few time steps’ values rather than the entire history in the state representation. These
savings in state space size are essentially prudent approximations in the state represen-
tation. Such state space modeling considerations often require a sound understanding
of the real-world problem. Recent advances in unsupervised Machine Learning can also
help us contain the state space size. We won’t discuss these modeling aspects in detail
here - rather, we’d just like to emphasize for now that modeling the state space appropri-
ately is one of the most important skills in real-world Reinforcement Learning, and we will
illustrate some of these modeling aspects through a few examples later in this book.
Even after performing these modeling exercises in reducing the state space size, we often
still end up with fairly large state spaces (so as to capture sufficient nuances of the real-
world problem). We battle these two curses in fundamentally two (complementary) ways:

117
• Approximation of the Value Function - We create an approximate representation
of the Value Function (eg: by using a supervised learning representation such as
a neural network). This permits us to work with an appropriately sampled subset
of the state space, infer the Value Function in this state space subset, and interpo-
late/extrapolate/generalize the Value Function in the remainder of the State Space.
• Sampling from the state-reward transition probabilities PR - Instead of working with
the explicit transition probabilities, we simply use the state-reward sample transi-
tions and employ Reinforcement Learning algorithms to incrementally improve the
estimates of the (approximated) Value Function. When state spaces are large, rep-
resenting explicit transition probabilities is impossible (not enough storage space),
and simply sampling from these probability distributions is our only option (and as
you shall learn, is surprisingly effective).

This combination of sampling a state space subset, approximation of the Value Function
(with deep neural networks), sampling state-reward transitions, and clever Reinforcement
Learning algorithms goes a long way in breaking both the curse of dimensionality and
curse of modeling. In fact, this combination is a common pattern in the broader field
of Applied Mathematics to break these curses. The combination of Sampling and Func-
tion Approximation (particularly with the modern advances in Deep Learning) are likely
to pave the way for future advances in the broader fields of Real-World AI and Applied
Mathematics in general. We recognize that some of this discussion is a bit premature since
we haven’t even started teaching Reinforcement Learning yet. But we hope that this sec-
tion provides some high-level perspective and connects the learnings from this chapter to
the techniques/algorithms that will come later in this book. We will also remind you of
this joint-importance of sampling and function approximation once we get started with
Reinforcement Learning algorithms later in this book.

Action Space:
Similar to state spaces, the definitions we’ve provided for MDPs were for countable (dis-
crete) action spaces. As a special case, we considered finite action spaces (together with
finite state spaces) since we have pretty straightforward algorithms for exact solution of
Prediction and Control problems for finite MDPs. As mentioned above, in these algo-
rithms, we represent the MDP in Python data structures like dict or numpy array. How-
ever, these finite-MDP algorithms are practical only if the state and action spaces are not
too large. In many real-world problems, action spaces do end up as fairly large - either
finite-large or infinite (sometimes continuous-valued action spaces). The large size of the
action space affects algorithms for MDPs in a couple of ways:

• Large action space makes the representation, estimation and evaluation of the pol-
icy π, of the Action-Value function for a policy Qπ and of the Optimal Action-Value
function Q∗ diﬀicult. We have to resort to function approximation and sampling as
ways to overcome the large size of the action space.
• The Bellman Optimality Equation leads to a crucial calculation step in Dynamic Pro-
gramming and Reinforcement Learning algorithms that involves identifying the ac-
tion for each non-terminal state that maximizes the Action-Value Function Q. When
the action space is large, we cannot afford to evaluate Q for each action for an encoun-
tered state (as is done in simple tabular algorithms). Rather, we need to tap into an
optimization algorithm to perform the maximization of Q over the action space, for
each encountered state. Separately, there is a special class of Reinforcement Learning

118
algorithms called Policy Gradient Algorithms (that we shall later learn about) that
are particularly valuable for large action spaces (where other types of Reinforcement
Learning algorithms are not eﬀicient and often, simply not an option). However,
these techniques to deal with large action spaces require care and attention as they
have their own drawbacks (more on this later).

Time Steps:

The definitions we’ve provided for MRP and MDP were for discrete time steps. We dis-
tinguish discrete time steps as terminating time-steps (known as terminating or episodic
MRPs/MDPs) or non-terminating time-steps (known as continuing MRPs/MDPs). We’ve
talked about how the choice of γ matters in these cases (γ = 1 doesn’t work for some con-
tinuing MDPs because reward accumulation can blow up to infinity). We won’t cover
it in this book, but there is an alternative formulation of the Value Function as expected
average reward (instead of expected discounted accumulated reward) where we don’t
discount even for continuing MDPs. We had also mentioned earlier that an alternative to
discrete time steps is continuous time steps, which is convenient for analytical tractability.
Sometimes, even if state space and action space components have discrete values (eg:
price of a security traded in fine discrete units, or number of shares of a security bought/sold
on a given day), for modeling purposes, we sometimes find it convenient to represent
these components as continuous values (i.e., uncountable state space). The advantage
of continuous state/action space representation (especially when paired with continuous
time) is that we get considerable mathematical benefits from differential calculus as well
as from properties of continuous probability distributions (eg: gaussian distribution con-
veniences). In fact, continuous state/action space and continuous time are very popu-
lar in Mathematical Finance since some of the groundbreaking work from Mathematical
Economics from the 1960s and 1970s - Robert Merton’s Portfolio Optimization formula-
tion and solution (Merton 1969) and Black-Scholes’ Options Pricing model (Black and
Scholes 1973), to name a couple - are grounded in stochastic calculus1 which models stock
prices/portfolio value as gaussian evolutions in continuous time (more on this later in
the book) and treats trades (buy/sell quantities) as also continuous variables (permitting
partial derivatives and tractable partial differential equations).
When all three of state space, action space and time steps are modeled as continuous, the
Bellman Optimality Equation we covered in this chapter for countable spaces and discrete-
time morphs into a differential calculus formulation and is known as the famous Hamilton-
Jacobi-Bellman (HJB) equation2 . The HJB Equation is commonly used to model and solve
many problems in engineering, physics, economics and finance. We shall cover a couple of
financial applications in this book that have elegant formulations in terms of the HJB equa-
tion and equally elegant analytical solutions of the Optimal Value Function and Optimal
Policy (tapping into stochastic calculus and differential equations).

2.11.2. Partially-Observable Markov Decision Processes (POMDPs)

You might have noticed in the definition of MDP that there are actually two different no-
tions of state, which we collapsed into a single notion of state. These two notions of state
are:
1
Appendix C provides a quick introduction to and overview of Stochastic Calculus.
2
Appendix D provides a quick introduction to the HJB Equation.

119
Figure 2.6.: Partially-Observable Markov Decision Process

(e)
• The internal representation of the environment at each time step t (let’s call it St ).
This internal representation of the environment is what drives the probabilistic tran-
sition to the next time step t + 1, producing the random pair of next (environment)
(e)
state St+1 and reward Rt+1 .
(a)
• The agent state at each time step t (let’s call it St ). The agent state is what controls
the action At the agent takes at time step t, i.e., the agent runs a policy π which is a
(a)
function of the agent state St , producing a probability distribution of actions At .

(e) (a)
In our definition of MDP, note that we implicitly assumed that St = St at each time
step t, and called it the (common) state St at time t. Secondly, we assumed that this state
St is fully observable by the agent. To understand full observability, let us (first intuitively)
understand the concept of partial observability in a more generic setting than what we had
assumed in the framework for MDP. In this more generic framework, we denote Ot as
the information available to the agent from the environment at time step t, as depicted in
Figure 2.6. The notion of partial observability in this more generic framework is that from
the history of observations, actions and rewards up to time step t, the agent does not have
(e) (e)
full knowledge of the environment state St . This lack of full knowledge of St is known
as partial observability. Full observability, on the other hand, means that the agent can fully
(e)
construct St as a function of the history of observations, actions and rewards up to time
step t. Since we have the flexibility to model the exact data structures to represent obser-
vations, state and actions in this more generic framework, existence of full observability
(e)
lets us re-structure the observation data at time step t to be Ot = St . Since we have also
(e) (a)
assumed St = St , we have:

(e) (a)
O t = St = St for all time steps t = 0, 1, 2, . . .

The above statement specialized the framework to that of Markov Decision Processes,
which we can now name more precisely as Fully-Observable Markov Decision Processes
(when viewed from the lens of the more generic framework described above, that permits
partial observability or full observability).

120
In practice, you will often find that the agent doesn’t know the true internal represen-
(e)
tation (St ) of the environment (i.e, partial observability). Think about what it would
take to know what drives a stock price from time step t to t + 1 - the agent would need
to have access to pretty much every little detail of trading activity in the entire world,
and more!). However, since the MDP framework is simple and convenient, and since we
have tractable Dynamic Programming and Reinforcement Learning algorithms to solve
(e) a)
MDPs, we often do pretend that Ot = St = St and carry on with our business of solv-
(e) (a)
ing the assumed/modeled MDP. Often, this assumption of Ot = St = St turns out
to be a reasonable approximate model of the real-world but there are indeed situations
where this assumption is far-fetched. These are situations where we have access to too
(e)
little information pertaining to the key aspects of the internal state representation (St )
of the environment. It turns out that we have a formal framework for these situations -
this framework is known as Partially-Observable Markov Decision Process (POMDP for
short). By default, the acronym MDP will refer to a Fully-Observable Markov Decision
Process (i.e. corresponding to the MDP definition we have given earlier in this chapter).
So let’s now define a POMDP.
A POMDP has the usual features of an MDP (discrete-time, countable states, countable
actions, countable next state-reward transition probabilities, discount factor, plus assum-
ing time-homogeneity), together with the notion of random observation Ot at each time
step t (each observation Ot lies within the Observation Space O) and observation proba-
bility function Z : S × A × O → [0, 1] defined as:

Z(s′ , a, o) = P[Ot+1 = o|(St+1 = s′ , At = a)]

It pays to emphasize that although a POMDP works with the notion of a state St , the
agent doesn’t have knowledge of St . It only has knowledge of observation Ot because Ot is
the extent of information made available from the environment. The agent will then need
to essentially “guess” (probabilistically) what the state St might be at each time step t in
order to take the action At . The agent’s goal in a POMDP is the same as that for an MDP:
to determine the Optimal Value Function and to identify an Optimal Policy (achieving the
Optimal Value Function).
Just like we have the rich theory and algorithms for MDPs, we have the theory and
algorithms for POMDPs. POMDP theory is founded on the notion of belief states. The
informal notion of a belief state is that since the agent doesn’t get to see the state St (it only
sees the observations Ot ) at each time step t, the agent needs to keep track of what it thinks
the state St might be, i.e., it maintains a probability distribution of states St conditioned
on history. Let’s make this a bit more formal.
Let us refer to the history Ht known to the agent at time t as the sequence of data it has
collected up to time t. Formally, this data sequence Ht is:

(O0 , A0 , R1 , O1 , A1 , R2 , . . . , Ot−1 , At−1 , Rt , Ot )

A Belief State b(h)t at time t is a probability distribution over states, conditioned on the
history h, i.e.,
b(h)t = (P[St = s1 |Ht = h], P[St = s2 |Ht = h], . . .)
P
such that s∈S b(h)t (s) = 1 for all histories h and for each t = 0, 1, 2, . . ..
Since the history Ht satisfies the Markov Property, the belief state b(h)t satisfies the
Markov Property. So we can reduce the POMDP to an MDP M with the set of belief
states of the POMDP as the set of states of the MDP M . Note that even if the set of states of

121
the POMDP were finite, the set of states of the MDP M will be infinite (i.e. infinite belief
states). We can see that this will almost always end up as a giant MDP M . So although this
is useful for theoretical reasoning, practically solving this MDP M is often quite hard com-
putationally. However, specialized techniques have been developed to solve POMDPs but
as you might expect, their computational complexity is still quite high. So we end up with
a choice when encountering a POMDP - either try to solve it with a POMDP algorithm
(computationally inefficient but capturing the reality of the real-world problem) or try to
(e) (a)
approximate it as an MDP (pretending Ot = St = St ) which will likely be compu-
tationally more efficient but might be a gross approximation of the real-world problem,
which in turn means it’s effectiveness in practice might be compromised. This is the mod-
eling dilemma we often end up with: what is the right level of detail of real-world factors
we need to capture in our model? How do we prevent state spaces from exploding be-
yond practical computational tractability? The answers to these questions typically have
to do with depth of understanding of the nuances of the real-world problem and a trial-
and-error process of: formulating the model, solving for the optimal policy, testing the
efficacy of this policy in practice (with appropriate measurements to capture real-world
metrics), learning about the drawbacks of our model, and iterating back to tweak (or com-
pletely change) the model.
Let’s consider a classic example of a card game such as Poker or Blackjack as a POMDP
where your objective as a player is to identify the optimal policy to maximize your ex-
pected return (Optimal Value Function). The observation Ot would be the entire set of
information you would have seen up to time step t (or a compressed version of this en-
tire information that suffices for predicting transitions and for taking actions). The state
St would include, among other things, the set of cards you have, the set of cards your
opponents have (which you don’t see), and the entire set of exposed as well as unex-
posed cards not held by players. Thus, the state is only partially observable. With this
POMDP structure, we proceed to develop a model of the transition probabilities of next
state St+1 and reward Rt+1 , conditional on current state St and current action At . We also
develop a model of the probabilities of next observation Ot+1 , conditional on next state
St+1 and current action At . These probabilities are estimated from data collected from
various games (capturing opponent behaviors) and knowledge of the cards-structure of
the deck (or decks) used to play the game. Now let’s think about what would happen
if we modeled this card game as an MDP. We’d no longer have the unseen cards as part
of our state. Instead, the state St will be limited to the information seen upto time t (i.e.,
St = Ot ). We can still estimate the transition probabilities, but since it’s much harder to
estimate in this case, our estimate will likely be quite noisy and nowhere near as reliable
as the probability estimates in the POMDP case. The advantage though with modeling it
as an MDP is that the algorithm to arrive at the Optimal Value Function/Optimal Policy
is a lot more tractable compared to the algorithm for the POMDP model. So it’s a tradeoff
between the reliability of the probability estimates versus the tractability of the algorithm
to solve for the Optimal Value Function/Policy.
The purpose of this subsection on POMDPs is to highlight that by default a lot of prob-
lems in the real-world are POMDPs and it can sometimes take quite a bit of domain-
knowledge, modeling creativity and real-world experimentation to treat them as MDPs
and make the solution to the modeled MDP successful in practice.
The idea of partial observability was introduced in a paper by K.J.Astrom (Åström
1965). To learn more about POMDP theory, we refer you to the POMDP book by Vikram
Krishnamurthy (Krishnamurthy 2016).

122
2.12. Summary of Key Learnings from this Chapter
• MDP Bellman Policy Equations
• MDP Bellman Optimality Equations
• Theorem (2.10.1) on the existence of an Optimal Policy, and of each Optimal Policy
achieving the Optimal Value Function

123
3. Dynamic Programming Algorithms
As a reminder, much of this book is about algorithms to solve the MDP Control problem,
i.e., to compute the Optimal Value Function (and an associated Optimal Policy). We will
also cover algorithms for the MDP Prediction problem, i.e., to compute the Value Function
when the AI agent executes a fixed policy π (which, as we know from Chapter 2, is the
same as computing the Value Function of the π-implied MRP). Our typical approach will
be to first cover algorithms to solve the Prediction problem before covering algorithms to
solve the Control problem - not just because Prediction is a key component in solving the
Control problem, but also because it helps understand the key aspects of the techniques
employed in the Control algorithm in the simpler setting of Prediction.

3.1. Planning versus Learning

In this book, we shall look at Prediction and Control from the lens of AI (and we’ll specifi-
cally use the terminology of AI). We shall distinguish between algorithms that don’t have
a model of the MDP environment (no access to the PR function) versus algorithms that do
have a model of the MDP environment (meaning PR is available to us either in terms of ex-
plicit probability distribution representations or available to us just as a sampling model).
The former (algorithms without access to a model) are known as Learning Algorithms to
reflect the fact that the AI agent will need to interact with the real-world environment
(eg: a robot learning to navigate in an actual forest) and learn the Value Function from
data (states encountered, actions taken, rewards observed) it receives through interac-
tions with the environment. The latter (algorithms with access to a model of the MDP
environment) are known as Planning Algorithms to reflect the fact that the AI agent re-
quires no real-world environment interaction and in fact, projects (with the help of the
model) probabilistic scenarios of future states/rewards for various choices of actions, and
solves for the requisite Value Function based on the projected outcomes. In both Learning
and Planning, the Bellman Equation is the fundamental concept driving the algorithms
but the details of the algorithms will typically make them appear fairly different. We will
only focus on Planning algorithms in this chapter, and in fact, will only focus on a subclass
of Planning algorithms known as Dynamic Programming.

3.2. Usage of the term Dynamic Programming

Unfortunately, the term Dynamic Programming tends to be used by different fields in
somewhat different ways. So it pays to clarify the history and the current usage of the
term. The term Dynamic Programming was coined by Richard Bellman himself. Here is the
rather interesting story told by Bellman about how and why he coined the term.

“I spent the Fall quarter (of 1950) at RAND. My first task was to find a name
for multistage decision processes. An interesting question is, ‘Where did the
name, dynamic programming, come from?’ The 1950s were not good years for

125
mathematical research. We had a very interesting gentleman in Washington
named Wilson. He was Secretary of Defense, and he actually had a patholog-
ical fear and hatred of the word, research. I’m not using the term lightly; I’m
using it precisely. His face would suffuse, he would turn red, and he would
get violent if people used the term, research, in his presence. You can imagine
how he felt, then, about the term, mathematical. The RAND Corporation was
employed by the Air Force, and the Air Force had Wilson as its boss, essen-
tially. Hence, I felt I had to do something to shield Wilson and the Air Force
from the fact that I was really doing mathematics inside the RAND Corpora-
tion. What title, what name, could I choose? In the first place I was interested
in planning, in decision making, in thinking. But planning, is not a good word
for various reasons. I decided therefore to use the word, ‘programming.’ I
wanted to get across the idea that this was dynamic, this was multistage, this
was time-varying—I thought, let’s kill two birds with one stone. Let’s take a
word that has an absolutely precise meaning, namely dynamic, in the classical
physical sense. It also has a very interesting property as an adjective, and that
is it’s impossible to use the word, dynamic, in a pejorative sense. Try think-
ing of some combination that will possibly give it a pejorative meaning. It’s
impossible. Thus, I thought dynamic programming was a good name. It was
something not even a Congressman could object to. So I used it as an umbrella
for my activities.”

Bellman had coined the term Dynamic Programming to refer to the general theory of
MDPs, together with the techniques to solve MDPs (i.e., to solve the Control problem). So
the MDP Bellman Optimality Equation was part of this catch-all term Dynamic Program-
ming. The core semantic of the term Dynamic Programming was that the Optimal Value
Function can be expressed recursively - meaning, to act optimally from a given state, we
will need to act optimally from each of the resulting next states (which is the essence of the
Bellman Optimality Equation). In fact, Bellman used the term “Principle of Optimality”
to refer to this idea of “Optimal Substructure,” and articulated it as follows:

PRINCIPLE OF OPTIMALITY. An optimal policy has the property that what-

ever the initial state and initial decisions are, the remaining decisions must
constitute an optimal policy with regard to the state resulting from the first
decisions.

So, you can see that the term Dynamic Programming was not just an algorithm in its
original usage. Crucially, Bellman laid out an iterative algorithm to solve for the Op-
timal Value Function (i.e., to solve the MDP Control problem). Over the course of the
next decade, the term Dynamic Programming got associated with (multiple) algorithms
to solve the MDP Control problem. The term Dynamic Programming was extended to
also refer to algorithms to solve the MDP Prediction problem. Over the next couple of
decades, Computer Scientists started refering to the term Dynamic Programming as any
algorithm that solves a problem through a recursive formulation as long as the algorithm
makes repeated invocations to the solutions of each subproblem (overlapping subproblem
structure). A classic such example is the algorithm to compute the Fibonacci sequence by
caching the Fibonacci values and re-using those values during the course of the algorithm
execution. The algorithm to calculate the shortest path in a graph is another classic exam-
ple where each shortest (i.e. optimal) path includes sub-paths that are optimal. However,
in this book, we won’t use the term Dynamic Programming in this broader sense. We will

126
use the term Dynamic Programming to be restricted to algorithms to solve the MDP Pre-
diction and Control problems (even though Bellman originally used it only in the context
of Control). More specifically, we will use the term Dynamic Programming in the narrow
context of Planning algorithms for problems with the following two specializations:

• The state space is finite, the action space is finite, and the set of pairs of next state
and reward (given any pair of current state and action) are also finite.
• We have explicit knowledge of the model probabilities (either in the form of PR or
in the form of P and R separately).

This is the setting of the class FiniteMarkovDecisionProcess we had covered in Chap-

ter 2. In this setting, Dynamic Programming algorithms solve the Prediction and Con-
trol problems exactly (meaning the computed Value Function converges to the true Value
Function as the algorithm iterations keep increasing). There are variants of Dynamic Pro-
gramming algorithms known as Asynchronous Dynamic Programming algorithms, Ap-
proximate Dynamic Programming algorithms etc. But without such qualifications, when
we use just the term Dynamic Programming, we will be refering to the “classical” it-
erative algorithms (that we will soon describe) for the above-mentioned setting of the
FiniteMarkovDecisionProcess class to solve MDP Prediction and Control exactly. Even
though these classical Dynamical Programming algorithms don’t scale to large state/action
spaces, they are extremely vital to develop one’s core understanding of the key concepts in
the more advanced algorithms that will enable us to scale (eg: the Reinforcement Learning
algorithms that we shall introduce in later chapters).

3.3. Fixed-Point Theory

We start by covering 3 classical Dynamic Programming algorithms. Each of the 3 algo-
rithms is founded on the Bellman Equations we had covered in Chapter 2. Each of the 3
algorithms is an iterative algorithm where the computed Value Function converges to the
true Value Function as the number of iterations approaches infinity. Each of the 3 algo-
rithms is based on the concept of Fixed-Point and updates the computed Value Function
towards the Fixed-Point (which in this case, is the true Value Function). Fixed-Point is
actually a fairly generic and important concept in the broader fields of Pure as well as Ap-
plied Mathematics (also important in Theoretical Computer Science), and we believe un-
derstanding Fixed-Point theory has many benefits beyond the needs of the subject of this
book. Of more relevance is the fact that the Fixed-Point view of Dynamic Programming
is the best way to understand Dynamic Programming. We shall not only cover the theory
of Dynamic Programming through the Fixed-Point perspective, but we shall also imple-
ment Dynamic Programming algorithms in our code based on the Fixed-Point concept. So
this section will be a short primer on general Fixed-Point Theory (and implementation in
code) before we get to the 3 Dynamic Programming algorithms.
Definition 3.3.1. The Fixed-Point of a function f : X → X (for some arbitrary domain X )
is a value x ∈ X that satisfies the equation: x = f (x).
Note that for some functions, there will be multiple fixed-points and for some other
functions, a fixed-point won’t exist. We will be considering functions which have a unique
fixed-point (this will be the case for the Dynamic Programming algorithms).
Let’s warm up to the above-defined abstract concept of Fixed-Point with a concrete ex-
ample. Consider the function f (x) = cos(x) defined for x ∈ R (x in radians, to be clear).

127
So we want to solve for an x such that x = cos(x). Knowing the frequency and ampli-
tude of cosine, we can see that the cosine curve intersects the line y = x at only one point,
which should be somewhere between 0 and π2 . But there is no easy way to solve for this
point. Here’s an idea: Start with any value x0 ∈ R, calculate x1 = cos(x0 ), then calculate
x2 = cos(x1 ), and so on …, i.e, xi+1 = cos(xi ) for i = 0, 1, 2, . . .. You will find that xi
and xi+1 get closer and closer as i increases, i.e., |xi+1 − xi | ≤ |xi − xi−1 | for all i ≥ 1.
So it seems like limi→∞ xi = limi→∞ cos(xi−1 ) = limi→∞ cos(xi ), which would imply that
for large enough i, xi would serve as an approximation to the solution of the equation
x = cos(x). But why does this method of repeated applications of the function f (no mat-
ter what x0 we start with) work? Why does it not diverge or oscillate? How quickly does
it converge? If there were multiple fixed-points, which fixed-point would it converge to
(if at all)? Can we characterize a class of functions f for which this method (repeatedly
applying f , starting with any arbitrary value of x0 ) would work (in terms of solving the
equation x = f (x))? These are the questions Fixed-Point theory attempts to answer. Can
you think of problems you have solved in the past which fall into this method pattern that
we’ve illustrated above for f (x) = cos(x)? It’s likely you have, because most of the root-
finding and optimization methods (including multi-variate solvers) are essentially based
on the idea of Fixed-Point. If this doesn’t sound convincing, consider the simple Newton
method:
For a differentiable function g : R → R whose root we want to solve for, the Newton
method update rule is:

g(xi )
xi+1 = xi −
g ′ (xi )
g(x)
Setting f (x) = x − g ′ (x) , the update rule is:

xi+1 = f (xi )
and it solves the equation x = f (x) (solves for the fixed-point of f ), i.e., it solves the
equation:

g(x)
x=x− ⇒ g(x) = 0
g ′ (x)
Thus, we see the same method pattern as we saw above for cos(x) (repeated application
of a function, starting with any initial value) enables us to solve for the root of g.
More broadly, what we are saying is that if we have a function f : X → X (for some arbi-
trary domain X ), under appropriate conditions (that we will state soon), f (f (. . . f (x0 ) . . .))
converges to a fixed-point of f , i.e., to the solution of the equation x = f (x) (no matter
what x0 ∈ X we start with). Now we are ready to state this formally. The statement of
the following theorem (due to Stefan Banach) is quite terse, so we will provide plenty of
explanation on how to interpret it and how to use it after stating the theorem (we skip the
proof of the theorem).
Theorem 3.3.1 (Banach Fixed-Point Theorem). Let X be a non-empty set equipped with a
complete metric d : X × X → R. Let f : X → X be such that there exists a L ∈ [0, 1) such that
d(f (x1 ), f (x2 )) ≤ L · d(x1 , x2 ) for all x1 , x2 ∈ X (this property of f is called a contraction, and
we refer to f as a contraction function). Then,
1. There exists a unique Fixed-Point x∗ ∈ X , i.e.,
x∗ = f (x∗ )

128
2. For any x0 ∈ X , and sequence [xi |i = 0, 1, 2, . . .] defined as xi+1 = f (xi ) for all i =
0, 1, 2, . . .,
lim xi = x∗
i→∞

3.
Li
d(x∗ , xi ) ≤ · d(x1 , x0 )
1−L
Equivalently,
L
d(x∗ , xi+1 ) ≤ · d(xi+1 , xi )
1−L
d(x∗ , xi+1 ) ≤ L · d(x∗ , xi )

We realize this is quite terse and will now demystify the theorem in a simple, intuitive
manner. First we need to explain what complete metric means. Let’s start with the term
metric. A metric is simply a function d : X × X → R that satisfies the usual “distance”
properties (for any x1 , x2 , x3 ∈ X ):

1. d(x1 , x2 ) = 0 ⇔ x1 = x2 (meaning two different points have a distance strictly

greater than 0)
2. d(x1 , x2 ) = d(x2 , x1 ) (meaning distance is directionless)
3. d(x1 , x3 ) ≤ d(x1 , x2 ) + d(x2 , x3 ) (meaning the triangle inequality is satisfied)

The term complete is a bit of a technical detail on sequences not escaping the set X (that’s
required in the proof). Since we won’t be doing the proof and since this technical detail is
not so important for the intuition, we skip the formal definition of complete. A non-empty
set X equipped with the function d (and the technical detail of being complete) is known
as a complete metric space.
Now we move on to the key concept of contraction. A function f : X → X is said to
be a contraction function if two points in X get closer when they are mapped by f (the
statement: d(f (x1 ), f (x2 )) ≤ L · d(x1 , x2 ) for all x1 , x2 ∈ X , for some L ∈ [0, 1)).
The theorem basically says that for any contraction function f , there is not only a unique
fixed-point x∗ , one can arrive at x∗ by repeated application of f , starting with any initial
value x0 ∈ X :

f (f (. . . f (x0 ) . . .)) → x∗
We use the notation f i : X → X for i = 0, 1, 2, . . . as follows:

f i+1 (x) = f (f i (x)) for all i = 0, 1, 2, . . . , for all x ∈ X

f 0 (x) = x for all x ∈ X
With this notation, the computation of the fixed-point can be expressed as:

lim f i (x0 ) = x∗ for all x0 ∈ X

i→∞
The algorithm, in iterative form, is:

xi+1 = f (xi ) for all i = 0, 2, . . .

We stop the algorithm when xi and xi+1 are close enough based on the distance-metric
d.

129
Banach Fixed-Point Theorem also gives us a statement on the speed of convergence re-
lating the distance between x∗ and any xi to the distance between any two successive xi .
This is a powerful theorem. All we need to do is identify the appropriate set X to work
with, identify the appropriate metric d to work with, and ensure that f is indeed a con-
traction function (with respect to d). This enables us to solve for the fixed-point of f with
the above-described iterative process of applying f repeatedly, starting with any arbitrary
value of x0 ∈ X .
We leave it to you as an exercise to verify that f (x) = cos(x) is a contraction function in
the domain X = R with metric d defined as d(x1 , x2 ) = |x1 − x2 |. Now let’s write some
code to implement the fixed-point algorithm we described above. Note that we implement
this for any generic type X to represent an arbitrary domain X .

X = TypeVar(’X’)
def iterate(step: Callable[[X], X], start: X) -> Iterator[X]:
state = start
while True:
yield state
state = step(state)

The above function takes as input a function (step: Callable[[X], X]) and a starting
value (start: X), and repeatedly applies the function while yielding the values in the
form of an Iterator[X], i.e., as a stream of values. This produces an endless stream though.
We need a way to specify convergence, i.e., when successive values of the stream are “close
enough.”

def converge(values: Iterator[X], done: Callable[[X, X], bool]) -> Iterator[X]:

a = next(values, None)
if a is None:
return
yield a
for b in values:
yield b
if done(a, b):
return
a = b

The above function converge takes as input the generated values from iterate (argu-
ment values: Iterator[X]) and a signal to indicate convergence (argument done: Callable[[X,
X], bool]), and produces the generated values until done is True. It is the user’s responsi-
bility to write the function done and pass it to converge. Now let’s use these two functions
to solve for x = cos(x).

import numpy as np
x = 0.0
values = converge(
iterate(lambda y: np.cos(y), x),
lambda a, b: np.abs(a - b) < 1e-3
)
for i, v in enumerate(values):
print(f”{i}: {v:.4f}”)

This prints a trace with the index of the stream and the value at that index as the function
cos is repeatedly applied. It terminates when two successive values are within 3 decimal
places of each other.

130
0: 0.0000
1: 1.0000
2: 0.5403
3: 0.8576
4: 0.6543
5: 0.7935
6: 0.7014
7: 0.7640
8: 0.7221
9: 0.7504
10: 0.7314
11: 0.7442
12: 0.7356
13: 0.7414
14: 0.7375
15: 0.7401
16: 0.7384
17: 0.7396
18: 0.7388

We encourage you to try other starting values (other than the one we have above: x0 =
0.0) and see the trace. We also encourage you to identify other functions f which are
contractions in an appropriate metric. The above fixed-point code is in the file rl/iterate.py.
In this file, you will find two more functions last and converged to produce the final value
of the given iterator when it’s values converge according to the done function.

3.4. Bellman Policy Operator and Policy Evaluation Algorithm

Our first Dynamic Programming algorithm is called Policy Evaluation. The Policy Eval-
uation algorithm solves the problem of calculating the Value Function of a Finite MDP
evaluated with a fixed policy π (i.e., the Prediction problem for finite MDPs). We know
that this is equivalent to calculating the Value Function of the π-implied Finite MRP. To
avoid notation confusion, note that a superscript of π for a symbol means it refers to no-
tation for the π-implied MRP. The precise specification of the Prediction problem is as
follows:
Let the states of the MDP (and hence, of the π-implied MRP) be S = {s1 , s2 , . . . , sn },
and without loss of generality, let N = {s1 , s2 , . . . , sm } be the non-terminal states. We are
given a fixed policy π : N × A → [0, 1]. We are also given the π-implied MRP’s transition
probability function:

PR
π
: N × D × S → [0, 1]

in the form of a data structure (since the states are finite, and the pairs of next state and
reward transitions from each non-terminal state are also finite). The Prediction problem is
to compute the Value Function of the MDP when evaluated with the policy π (equivalently,
the Value Function of the π-implied MRP), which we denote as V π : N → R.
We know from Chapters 1 and 2 that by extracting (from PR π ) the transition probability

function P π : N × S → [0, 1] of the implicit Markov Process and the reward function

131
Rπ : N → R, we can perform the following calculation for the Value Function V π : N → R
(expressed as a column vector V π ∈ Rm ) to solve this Prediction problem:

V π = (Im − γP π )−1 · Rπ
where Im is the m × m identity matrix, column vector Rπ ∈ Rm represents Rπ , and
P is an m × m matrix representing P π (rows and columns corresponding to the non-
π

terminal states). However, when m is large, this calculation won’t scale. So, we look for a
numerical algorithm that would solve (for V π ) the following MRP Bellman Equation (for
a larger number of finite states).

V π = Rπ + γP π · V π
We define the Bellman Policy Operator B π : Rm → Rm as:

B π (V ) = Rπ + γP π · V for any vector V in the vector space Rm (3.1)

So, the MRP Bellman Equation can be expressed as:

V π = B π (V π )
which means V π ∈ Rm is a Fixed-Point of the Bellman Policy Operator B π : Rm → Rm .
Note that the Bellman Policy Operator can be generalized to the case of non-finite MDPs
and V π is still a Fixed-Point for various generalizations of interest. However, since this
chapter focuses on developing algorithms for finite MDPs, we will work with the above
narrower (Equation (3.1)) definition. Also, for proofs of correctness of the DP algorithms
(based on Fixed-Point) in this chapter, we shall assume the discount factor γ < 1.
Note that B π is an aﬀine transformation on vectors in Rm and should be thought of as
a generalization of a simple 1-D (R → R) aﬀine transformation y = a + bx where the
multiplier b is replaced with the matrix γP π and the shift a is replaced with the column
vector Rπ .
We’d like to come up with a metric for which B π is a contraction function so we can
take advantage of Banach Fixed-Point Theorem and solve this Prediction problem by it-
erative applications of the Bellman Policy Operator B π . For any Value Function V ∈ Rm
(representing V : N → R), we shall express the Value for any state s ∈ N as V (s).
Our metric d : Rm × Rm → R shall be the L∞ norm defined as:

d(X, Y ) = ∥X − Y ∥∞ = max |(X − Y )(s)|

s∈N
Bπ is a contraction function under L∞ norm because for all X, Y ∈ Rm ,

So invoking Banach Fixed-Point Theorem proves the following Theorem:

Theorem 3.4.1 (Policy Evaluation Convergence Theorem). For a Finite MDP with |N | = m
and γ < 1, if V π ∈ Rm is the Value Function of the MDP when evaluated with a fixed policy
π : N × A → [0, 1], then V π is the unique Fixed-Point of the Bellman Policy Operator B π :
Rm → Rm , and
lim (B π )i (V0 ) → V π for all starting Value Functions V0 ∈ Rm
i→∞

132
This gives us the following iterative algorithm (known as the Policy Evaluation algorithm
for fixed policy π : N × A → [0, 1]):

• Start with any Value Function V0 ∈ Rm

• Iterating over i = 0, 1, 2, . . ., calculate in each iteration:

Vi+1 = B π (Vi ) = Rπ + γP π · Vi

We stop the algorithm when d(Vi , Vi+1 ) = maxs∈N |(Vi − Vi+1 )(s)| is adequately small.
It pays to emphasize that Banach Fixed-Point Theorem not only assures convergence to
the unique solution V π (no matter what Value Function V0 we start the algorithm with), it
also assures a reasonable speed of convergence (dependent on the choice of starting Value
Function V0 and the choice of γ). Now let’s write the code for Policy Evaluation.
DEFAULT_TOLERANCE = 1e-5
V = Mapping[NonTerminal[S], float]
def evaluate_mrp(
mrp: FiniteMarkovRewardProcess[S],
gamma: float
) -> Iterator[np.ndarray]:
def update(v: np.ndarray) -> np.ndarray:
return mrp.reward_function_vec + gamma * \
mrp.get_transition_matrix().dot(v)
v_0: np.ndarray = np.zeros(len(mrp.non_terminal_states))
return iterate(update, v_0)

def almost_equal_np_arrays(
v1: np.ndarray,
v2: np.ndarray,
tolerance: float = DEFAULT_TOLERANCE
) -> bool:
return max(abs(v1 - v2)) < tolerance
def evaluate_mrp_result(
mrp: FiniteMarkovRewardProcess[S],
gamma: float
) -> V[S]:
v_star: np.ndarray = converged(
evaluate_mrp(mrp, gamma=gamma),
done=almost_equal_np_arrays
)
return {s: v_star[i] for i, s in enumerate(mrp.non_terminal_states)}

The code should be fairly self-explanatory. Since the Policy Evaluation problem applies
to Finite MRPs, the function evaluate_mrp above takes as input mrp: FiniteMarkovDecisionProcess[S]
and a gamma: float to produce an Iterator on Value Functions represented as np.ndarray
(for fast vector/matrix calculations). The function update in evaluate_mrp represents the
application of the Bellman Policy Operator B π . The function evaluate_mrp_result pro-
duces the Value Function for the given mrp and the given gamma, returning the last value
function on the Iterator (which terminates based on the almost_equal_np_arrays func-
tion, considering the maximum of the absolute value differences across all states). Note
that the return type of evaluate_mrp_result is V[S] which is an alias for Mapping[NonTerminal[S],
float], capturing the semantic of N → R. Note that evaluate_mrp is useful for debugging
(by looking at the trace of value functions in the execution of the Policy Evaluation algo-
rithm) while evaluate_mrp_result produces the desired output Value Function.
Note that although we defined the Bellman Policy Operator B π as operating on Value
Functions of the π-implied MRP, we can also view the Bellman Policy Operator B π as

133
operating on Value Functions of an MDP. To support this MDP view, we express Equation
(3.1) in terms of the MDP transitions/rewards specification, as follows:

X X X
B π (V )(s) = π(s, a) · R(s, a) + γ π(s, a) P(s, a, s′ ) · V (s′ ) for all s ∈ N (3.2)
a∈A a∈A s′ ∈N

If the number of non-terminal states of a given MRP is m, then the running time of each
iteration is O(m2 ). Note though that to construct an MRP from a given MDP and a given
policy, we have to perform O(m2 · k) operations, where k = |A|.

3.5. Greedy Policy

We had said earlier that we will be presenting 3 Dynamic Programming Algorithms. The
first (Policy Evaluation), as we saw in the previous section, solves the MDP Prediction
problem. The other two (that will present in the next two sections) solve the MDP Control
problem. This section is a stepping stone from Prediction to Control. In this section, we
define a function that is motivated by the idea of improving a value function/improving a
policy with a “greedy” technique. Formally, the Greedy Policy Function

G : Rm → (N → A)

interpreted as a function mapping a Value Function V (represented as a vector) to a

′ : N → A, is defined as:
deterministic policy πD

X
′
G(V )(s) = πD (s) = arg max{R(s, a) + γ · P(s, a, s′ ) · V (s′ )} for all s ∈ N (3.3)
a∈A s′ ∈N

Pthat for any specific s, if two or more actions a achieve the maximization of R(s, a) +
Note
γ · s′ ∈N P(s, a, s′ ) · V (s′ ), then we use an arbitrary rule in breaking ties and assigning a
single action a as the output of the above arg max operation. We shall use Equation (3.3)
in our mathematical exposition but we require a different (but equivalent) expression for
G(V )(s) to guide us with our code since the interface for FiniteMarkovDecisionProcess
operates on PR , rather than R and P. The equivalent expression for G(V )(s) is as follows:
XX
G(V )(s) = arg max{ PR (s, a, r, s′ ) · (r + γ · W (s′ ))} for all s ∈ N (3.4)
a∈A s′ ∈S r∈D

where W ∈ Rn is defined as:

(
V (s′ ) if s′ ∈ N
W (s′ ) =
0 if s′ ∈ T = S − N

Note that in Equation (3.4), because we have to work with PR , we need to consider
transitions to all states s′ ∈ S (versus transition to all states s′ ∈ N in Equation (3.3)), and
so, we need to handle the transitions to states s′ ∈ T carefully (essentially by using the W
function as described above).
Now let’s write some code to create this “greedy policy” from a given value function,
guided by Equation (3.4).

134
import operator
def extended_vf(v: V[S], s: State[S]) -> float:
def non_terminal_vf(st: NonTerminal[S], v=v) -> float:
return v[st]
return s.on_non_terminal(non_terminal_vf, 0.0)
def greedy_policy_from_vf(
mdp: FiniteMarkovDecisionProcess[S, A],
vf: V[S],
gamma: float
) -> FiniteDeterministicPolicy[S, A]:
greedy_policy_dict: Dict[S, A] = {}
for s in mdp.non_terminal_states:
q_values: Iterator[Tuple[A, float]] = \
((a, mdp.mapping[s][a].expectation(
lambda s_r: s_r[1] + gamma * extended_vf(vf, s_r[0])
)) for a in mdp.actions(s))
greedy_policy_dict[s.state] = \
max(q_values, key=operator.itemgetter(1))[0]
return FiniteDeterministicPolicy(greedy_policy_dict)

As you can see above, the function greedy_policy_from_vf loops through all the non-
terminal states that serve as keys in greedy_policy_dict: Dict[S, A]. Within this loop,
we go through all the actions in A(s) and compute Q-Value Q(s, a) as the sum (over all
(s′ , r) pairs) of PR (s, a, r, s′ ) · (r + γ · W (s′ )), written as E(s′ ,r)∼PR [r + γ · W (s′ )]. Finally,
we calculate arg maxa Q(s, a) for all non-terminal states s, and return it as a FinitePolicy
(which is our greedy policy).
Note that the extended_vf represents the W : S → R function used in the right-hand-
side of Equation (3.4), which is the usual value function when it’s argument is a non-
terminal state and is the default value of 0 when it’s argument is a terminal state. We
shall use the extended_vf function in other Dynamic Programming algorithms later in this
chapter as they also involve the W : S → R function in the right-hand-side of their corre-
sponding governing equation.
The word “Greedy” is a reference to the term “Greedy Algorithm,” which means an al-
gorithm that takes heuristic steps guided by locally-optimal choices in the hope of moving
towards a global optimum. Here, the reference to Greedy Policy means if we have a policy
π and its corresponding Value Function V π (obtained say using Policy Evaluation algo-
rithm), then applying the Greedy Policy function G on V π gives us a deterministic policy
′ : N → A that is hopefully “better” than π in the sense that V πD ′
πD is “greater” than V π .
We shall now make this statement precise and show how to use the Greedy Policy Function
to perform Policy Improvement.

3.6. Policy Improvement

Terms such as “better” or “improvement” refer to either Value Functions or to Policies (in
the latter case, to Value Functions of an MDP evaluated with the policies). So what does
it mean to say a Value Function X : N → R is “better” than a Value Function Y : N → R?
Here’s the answer:

Definition 3.6.1 (Value Function Comparison). We say X ≥ Y for Value Functions X, Y :

N → R of an MDP if and only if:

X(s) ≥ Y (s) for all s ∈ N

135
If we are dealing with finite MDPs (with m non-terminal states), we’d represent the
Value Functions as vector X, Y ∈ Rm , and say that X ≥ Y if and only if X(s) ≥ Y (s) for
all s ∈ N .
So whenever you hear terms like “Better Value Function” or “Improved Value Function,”
you should interpret it to mean that the Value Function is no worse for each of the states
(versus the Value Function it’s being compared to).
So then, what about the claim of πD ′ = G(V π ) being “better” than π? The following

important theorem by Richard Bellman (Bellman 1957b) provides the clarification:

Theorem 3.6.1 (Policy Improvement Theorem). For a finite MDP, for any policy π,
′ π)
V πD = V G(V ≥Vπ

Proof. This proof is based on application of the Bellman Policy Operator on Value Func-
tions of the given MDP (note: this MDP view of the Bellman Policy Operator is expressed
′
in Equation (3.2)). We start by noting that applying the Bellman Policy Operator B πD re-
′
peatedly, starting with the Value Function V π , will converge to the Value Function V πD .
Formally,
′ ′
lim (B πD )i (V π ) = V πD
i→∞
So the proof is complete if we prove that:
′ ′
(B πD )i+1 (V π ) ≥ (B πD )i (V π ) for all i = 0, 1, 2, . . .
′
which means we get a non-decreasing sequence of Value Functions [(B πD )i (V π )|i =
′
0, 1, 2, . . .] with repeated applications of B πD starting with the Value Function V π .
Let us prove this by induction. The base case (for i = 0) of the induction is to prove
that:
′
B πD (V π ) ≥ V π
′ and Value Function V π , Equation
Note that for the case of the deterministic policy πD
(3.2) simplifies to:
′
X
′ ′
B πD (V π )(s) = R(s, πD (s)) + γ P(s, πD (s), s′ ) · V π (s′ ) for all s ∈ N
s′ ∈N

From Equation (3.3), we each s ∈ N , πD ′ (s) = G(V π )(s) is the action that
P know that for ′ ′
maximizes {R(s, a) + γ s′ ∈N P(s, a, s ) · V (s )}. Therefore,
π

′
X
B πD (V π )(s) = max{R(s, a) + γ P(s, a, s′ ) · V π (s′ )} = max Qπ (s, a) for all s ∈ N
a∈A a∈A
s′ ∈N

Let’s compare this equation against the Bellman Policy Equation for π (below):
X
V π (s) = π(s, a) · Qπ (s, a) for all s ∈ N
a∈A

We see that V π (s) is a weighted average of Qπ (s, a) (with weights equal to probabilities
′
π(s, a) over choices of a) while B πD (V π )(s) is the maximum (over choices of a) of Qπ (s, a).
Therefore,
′
B πD (V π ) ≥ V π

136
This establishes the base case of the proof by induction. Now to complete the proof, all
we have to do is to prove:

′ ′ ′ ′
If (B πD )i+1 (V π ) ≥ (B πD )i (V π ), then (B πD )i+2 (V π ) ≥ (B πD )i+1 (V π ) for all i = 0, 1, 2, . . .
′ ′ ′
Since (B πD )i+1 (V π ) = B πD ((B πD )i (V π )), from the definition of Bellman Policy Oper-
ator (Equation (3.1)), we can write the following two equations:

′
X ′
′ ′
(B πD )i+2 (V π )(s) = R(s, πD (s)) + γ P(s, πD (s), s′ ) · (B πD )i+1 (V π )(s′ ) for all s ∈ N
s′ ∈N

′
X ′
′ ′
(B πD )i+1 (V π )(s) = R(s, πD (s)) + γ P(s, πD (s), s′ ) · (B πD )i (V π )(s′ ) for all s ∈ N
s′ ∈N

Subtracting each side of the second equation from the first equation yields:
′ ′
(B πD )i+2 (V π )(s) − (B πD )i+1 (s)
X ′ ′
′
=γ P(s, πD (s), s′ ) · ((B πD )i+1 (V π )(s′ ) − (B πD )i (V π )(s′ ))
s′ ∈N

for all s ∈ N
Since γP(s, πD′ (s), s′ ) consists of all non-negative values and since the induction step
′ ′
assumes (B πD )i+1 (V π )(s′ ) ≥ (B πD )i (V π )(s′ ) for all s′ ∈ N , the right-hand-side of this
equation is non-negative, meaning the left-hand-side of this equation is non-negative, i.e.,
′ ′
(B πD )i+2 (V π )(s) ≥ (B πD )i+1 (V π )(s) for all s ∈ N

This completes the proof by induction.

The way to understand the above proof is to think in terms of how each stage of further
′
application of B πD improves the Value Function. Stage 0 is when you have the Value
Function V π where we execute the policy π throughout the MDP. Stage 1 is when you
′ ′ for
have the Value Function B πD (V π ) where from each state s, we execute the policy πD
the first time step following s and then execute the policy π for all further time steps. This
′
has the effect of improving the Value Function from Stage 0 (V π ) to Stage 1 (B πD (V π )).
′
Stage 2 is when you have the Value Function (B πD )2 (V π ) where from each state s, we
execute the policy πD ′ for the first two time steps following s and then execute the policy π

for all further time steps. This has the effect of improving the Value Function from Stage
′ ′ ′ instead
1 (B πD (V π )) to Stage 2 ((B πD )2 (V π )). And so on … each stage applies policy πD
of policy π for one extra time step, which has the effect of improving the Value Function.
Note that “improve” means ≥ (really means that the Value Function doesn’t get worse for
any of the states). These stages are simply the iterations of the Policy Evaluation algorithm
(using policy πD ′ ) with starting Value Function V π , building a non-decreasing sequence of
′
Value Functions [(B πD )i (V π )|i = 0, 1, 2, . . .] that get closer and closer until they converge
′
to the Value Function V πD that is ≥ V π (hence, the term Policy Improvement).
The Policy Improvement Theorem yields our first Dynamic Programming algorithm
(called Policy Iteration) to solve the MDP Control problem. The Policy Iteration algorithm
is due to Ronald Howard (Howard 1960).

137
Figure 3.1.: Policy Iteration Loop

3.7. Policy Iteration Algorithm

The proof of the Policy Improvement Theorem has shown us how to start with the Value
Function V π (for a policy π), perform a greedy policy improvement to create a policy
′ = G(V π ), and then perform Policy Evaluation (with policy π ′ ) with starting Value
πD
′
D
Function V π , resulting in the Value Function V πD that is an improvement over the Value
Function V π we started with. Now note that we can do the same process again to go from
′ and V πD′ ′′ and associated improved Value Function V πD ′′
πD to an improved policy πD .
And we can keep going in this way to create further improved policies and associated
Value Functions, until there is no further improvement. This methodology of performing
Policy Improvement together with Policy Evaluation using the improved policy, in an iter-
ative manner (depicted in Figure 3.1), is known as the Policy Iteration algorithm (shown
below).

• Start with any Value Function V0 ∈ Rm

• Iterating over j = 0, 1, 2, . . ., calculate in each iteration:

Deterministic Policy πj+1 = G(Vj )

Value Function Vj+1 = lim (B πj+1 )i (Vj )

i→∞

We perform these iterations (over j) until Vj+1 is identical to Vj (i.e., there is no further
improvement to the Value Function). When this happens, the following should hold:

Vj = (B G(Vj ) )i (Vj ) = Vj+1 for all i = 0, 1, 2, . . .

In particular, this equation should hold for i = 1:

X
Vj (s) = B G(Vj ) (Vj )(s) = R(s, G(Vj )(s)) + γ P(s, G(Vj )(s), s′ ) · Vj (s′ ) for all s ∈ N
s′ ∈N

138
Figure 3.2.: Policy Iteration Convergence

From Equation (3.3), wePknow that for each s ∈ N , πj+1 (s) = G(Vj )(s) is the action that
maximizes {R(s, a) + γ s′ ∈N P(s, a, s′ ) · Vj (s′ )}. Therefore,
X
Vj (s) = max{R(s, a) + γ P(s, a, s′ ) · Vj (s′ )} for all s ∈ N
a∈A
s′ ∈N

But this in fact is the MDP Bellman Optimality Equation, which would mean that Vj =
∗,
V i.e., when Vj+1 is identical to Vj , the Policy Iteration algorithm has converged to the
Optimal Value Function. The associated deterministic policy at the convergence of the
Policy Iteration algorithm (πj : N → A) is an Optimal Policy because V πj = Vj ≊ V ∗ ,
meaning that evaluating the MDP with the deterministic policy πj achieves the Optimal
Value Function (depicted in Figure 3.2). This means the Policy Iteration algorithm solves
the MDP Control problem. This proves the following Theorem:

Theorem 3.7.1 (Policy Iteration Convergence Theorem). For a Finite MDP with |N | = m
and γ < 1, Policy Iteration algorithm converges to the Optimal Value Function V ∗ ∈ Rm along
with a Deterministic Optimal Policy πD∗ : N → A, no matter which Value Function V ∈ Rm we
0
start the algorithm with.

Now let’s write some code for Policy Iteration Algorithm. Unlike Policy Evaluation
which repeatedly operates on Value Functions (and returns a Value Function), Policy Itera-
tion repeatedly operates on a pair of Value Function and Policy (and returns a pair of Value
Function and Policy). In the code below, notice the type Tuple[V[S], FinitePolicy[S,
A]] that represents a pair of Value Function and Policy. The function policy_iteration
repeatedly applies the function update on a pair of Value Function and Policy. The update
function, after splitting its input vf_policy into vf: V[S] and pi: FinitePolicy[S, A],
creates an MRP (mrp: FiniteMarkovRewardProcess[S]) from the combination of the input
mdp and pi. Then it performs a policy evaluation on mrp (using the evaluate_mrp_result
function) to produce a Value Function policy_vf: V[S], and finally creates a greedy (im-
proved) policy named improved_pi from policy_vf (using the previously-written func-
tion greedy_policy_from_vf). Thus the function update performs a Policy Evaluation fol-
lowed by a Policy Improvement. Notice also that policy_iteration offers the option to
perform the linear-algebra-solver-based computation of Value Function for a given policy
(get_value_function_vec method of the mrp object), in case the state space is not too large.
policy_iteration returns an Iterator on pairs of Value Function and Policy produced by
this process of repeated Policy Evaluation and Policy Improvement. almost_equal_vf_pis
is the function to decide termination based on the distance between two successive Value
Functions produced by Policy Iteration. policy_iteration_result returns the final (opti-
mal) pair of Value Function and Policy (from the Iterator produced by policy_iteration),
based on the termination criterion of almost_equal_vf_pis.

DEFAULT_TOLERANCE = 1e-5

139
def policy_iteration(
mdp: FiniteMarkovDecisionProcess[S, A],
gamma: float,
matrix_method_for_mrp_eval: bool = False
) -> Iterator[Tuple[V[S], FinitePolicy[S, A]]]:
def update(vf_policy: Tuple[V[S], FinitePolicy[S, A]])\
-> Tuple[V[S], FiniteDeterministicPolicy[S, A]]:
vf, pi = vf_policy
mrp: FiniteMarkovRewardProcess[S] = mdp.apply_finite_policy(pi)
policy_vf: V[S] = {mrp.non_terminal_states[i]: v for i, v in
enumerate(mrp.get_value_function_vec(gamma))}\
if matrix_method_for_mrp_eval else evaluate_mrp_result(mrp, gamma)
improved_pi: FiniteDeterministicPolicy[S, A] = greedy_policy_from_vf(
mdp,
policy_vf,
gamma
)
return policy_vf, improved_pi
v_0: V[S] = {s: 0.0 for s in mdp.non_terminal_states}
pi_0: FinitePolicy[S, A] = FinitePolicy(
{s.state: Choose(mdp.actions(s)) for s in mdp.non_terminal_states}
)
return iterate(update, (v_0, pi_0))
def almost_equal_vf_pis(
x1: Tuple[V[S], FinitePolicy[S, A]],
x2: Tuple[V[S], FinitePolicy[S, A]]
) -> bool:
return max(
abs(x1[0][s] - x2[0][s]) for s in x1[0]
) < DEFAULT_TOLERANCE
def policy_iteration_result(
mdp: FiniteMarkovDecisionProcess[S, A],
gamma: float,
) -> Tuple[V[S], FiniteDeterministicPolicy[S, A]]:
return converged(policy_iteration(mdp, gamma), done=almost_equal_vf_pis)

If the number of non-terminal states of a given MDP is m and the number of actions
(|A|) is k, then the running time of Policy Improvement is O(m2 · k) and we’ve already
seen before that each iteration of Policy Evaluation is O(m2 · k).

3.8. Bellman Optimality Operator and Value Iteration Algorithm

By making a small tweak to the definition of Greedy Policy Function in Equation (3.3)
(changing the arg max to max), we define the Bellman Optimality Operator

B ∗ : Rm → Rm

as the following (non-linear) transformation of a vector (representing a Value Function)

in the vector space Rm
X
B ∗ (V )(s) = max{R(s, a) + γ P(s, a, s′ ) · V (s′ )} for all s ∈ N (3.5)
a∈A
s′ ∈N

We shall use Equation (3.5) in our mathematical exposition but we require a different
(but equivalent) expression for B ∗ (V )(s) to guide us with our code since the interface

140
for FiniteMarkovDecisionProcess operates on PR , rather than R and P. The equivalent
expression for B ∗ (V )(s) is as follows:
XX
B ∗ (V )(s) = max{ PR (s, a, r, s′ ) · (r + γ · W (s′ ))} for all s ∈ N (3.6)
a∈A
s′ ∈S r∈D

where W ∈ Rn is defined (same as in the case of Equation (3.4)) as:

(
V (s′ ) if s′ ∈ N
W (s′ ) =
0 if s′ ∈ T = S − N
Note that in Equation (3.6), because we have to work with PR , we need to consider
transitions to all states s′ ∈ S (versus transition to all states s′ ∈ N in Equation (3.5)), and
so, we need to handle the transitions to states s′ ∈ T carefully (essentially by using the W
function as described above).
For each s ∈ N , the action a ∈ A that produces the maximization in (3.5) is the action
prescribed by the deterministic policy πD in (3.3). Therefore, if we apply the Bellman
Policy Operator on any Value Function V ∈ Rm using the Greedy Policy G(V ), it should
be identical to applying the Bellman Optimality Operator. Therefore,

B G(V ) (V ) = B ∗ (V ) for all V ∈ Rm (3.7)

In particular, it’s interesting to observe that by specializing V to be the Value Function
π
V for a policy π, we get:
B G(V ) (V π ) = B ∗ (V π )
π

which is a succinct representation of the first stage of Policy Evaluation with an improved
policy G(V π ) (note how all three of Bellman Policy Operator, Bellman Optimality Opera-
tor and Greedy Policy Function come together in this equation).
Much like how the Bellman Policy Operator B π was motivated by the MDP Bellman
Policy Equation (equivalently, the MRP Bellman Equation), Bellman Optimality Operator
B ∗ is motivated by the MDP Bellman Optimality Equation (re-stated below):
X
V ∗ (s) = max{R(s, a) + γ P(s, a, s′ ) · V ∗ (s′ )} for all s ∈ N
a∈A
s′ ∈N

Therefore, we can express the MDP Bellman Optimality Equation succinctly as:

V ∗ = B ∗ (V ∗ )
which means V ∗ ∈ Rm is a Fixed-Point of the Bellman Optimality Operator B ∗ : Rm →
Rm .
Note that the definitions of the Greedy Policy Function and of the Bellman Optimality
Operator that we have provided can be generalized to non-finite MDPs, and consequently
we can generalize Equation (3.7) and the statement that V ∗ is a Fixed-Point of the Bellman
Optimality Operator would still hold. However, in this chapter, since we are focused on
developing algorithms for finite MDPs, we shall stick to the definitions we’ve provided for
the case of finite MDPs.
Much like how we proved that B π is a contraction function, we want to prove that B ∗
is a contraction function (under L∞ norm) so we can take advantage of Banach Fixed-
Point Theorem and solve the Control problem by iterative applications of the Bellman
Optimality Operator B ∗ . So we need to prove that for all X, Y ∈ Rm ,

141
max |(B ∗ (X) − B ∗ (Y ))(s)| ≤ γ · max |(X − Y )(s)|
s∈N s∈N

This proof is a bit harder than the proof we did for B π . Here we need to utilize two key
properties of B ∗ .

1. Monotonicity Property, i.e, for all X, Y ∈ Rm ,

If X(s) ≥ Y (s) for all s ∈ N , then B ∗ (X)(s) ≥ B ∗ (Y )(s) for all s ∈ N

Observe that for each state s ∈ N and each action a ∈ A,

X X
{R(s, a) + γ P(s, a, s′ ) · X(s′ )} − {R(s, a) + γ P(s, a, s′ ) · Y (s′ )}
s′ ∈N s′ ∈N
X
=γ P(s, a, s′ ) · (X(s′ ) − Y (s′ )) ≥ 0
s′ ∈N

Therefore for each state s ∈ N ,

B ∗ (X)(s) − B ∗ (Y )(s)
X X
= max{R(s, a)+γ P(s, a, s′ )·X(s′ )}−max{R(s, a)+γ P(s, a, s′ )·Y (s′ )} ≥ 0
a∈A a∈A
s′ ∈N s′ ∈N

2. Constant Shift Property, i.e., for all X ∈ Rm , c ∈ R,

B ∗ (X + c)(s) = B ∗ (X)(s) + γc for all s ∈ N

In the above statement, adding a constant (∈ R) to a Value Function (∈ Rm ) adds the

constant point-wise to all states of the Value Function (to all dimensions of the vector
representing the Value Function). In other words, a constant ∈ R might as well be
treated as a Value Function with the same (constant) value for all states. Therefore,

X
B ∗ (X + c)(s) = max{R(s, a) + γ P(s, a, s′ ) · (X(s′ ) + c)}
a∈A
s′ ∈N
X
= max{R(s, a) + γ P(s, a, s′ ) · X(s′ )} + γc = B ∗ (X)(s) + γc
a∈A
s′ ∈N

With these two properties of B ∗ in place, let’s prove that B ∗ is a contraction function.
For given X, Y ∈ Rm , assume:

max |(X − Y )(s)| = c

s∈N

We can rewrite this as:

X(s) − c ≤ Y (s) ≤ X(s) + c for all s ∈ N

Since B ∗ has the monotonicity property, we can apply B ∗ throughout the above double-
inequality.
B ∗ (X − c)(s) ≤ B ∗ (Y )(s) ≤ B ∗ (X + c)(s) for all s ∈ N

142
Since B ∗ has the constant shift property,

B ∗ (X)(s) − γc ≤ B ∗ (Y )(s) ≤ B ∗ (X)(s) + γc for all s ∈ N

In other words,

max |(B ∗ (X) − B ∗ (Y ))(s)| ≤ γc = γ · max |(X − Y )(s)|

s∈N s∈N

So invoking Banach Fixed-Point Theorem proves the following Theorem:

Theorem 3.8.1 (Value Iteration Convergence Theorem). For a Finite MDP with |N | = m
and γ < 1, if V ∗ ∈ Rm is the Optimal Value Function, then V ∗ is the unique Fixed-Point of the
Bellman Optimality Operator B ∗ : Rm → Rm , and

lim (B ∗ )i (V0 ) → V ∗ for all starting Value Functions V0 ∈ Rm

i→∞

This gives us the following iterative algorithm, known as the Value Iteration algorithm,
due to Richard Bellman (Bellman 1957a):

• Start with any Value Function V0 ∈ Rm

• Iterating over i = 0, 1, 2, . . ., calculate in each iteration:

Vi+1 (s) = B ∗ (Vi )(s) for all s ∈ N

3.9. Optimal Policy from Optimal Value Function

Note that the Policy Iteration algorithm produces a policy together with a Value Function
in each iteration. So, in the end, when we converge to the Optimal Value Function Vj = V ∗
in iteration j, the Policy Iteration algorithm has a deterministic policy πj associated with
Vj such that:
V j = V πj = V ∗
and we refer to πj as the Optimal Policy π ∗ , one that yields the Optimal Value Function
V ∗ , i.e.,
∗
Vπ =V∗
But Value Iteration has no such policy associated with it since the entire algorithm is
devoid of a policy representation and operates only with Value Functions. So now the
question is: when Value Iteration converges to the Optimal Value Function Vi = V ∗ in
iteration i, how do we get hold of an Optimal Policy π ∗ such that:
∗
V π = Vi = V ∗

The answer lies in the Greedy Policy function G. Equation (3.7) told us that:

B G(V ) (V ) = B ∗ (V ) for all V ∈ Rm

143
Specializing V to be V ∗ , we get:
∗
B G(V ) (V ∗ ) = B ∗ (V ∗ )

But we know that V ∗ is the Fixed-Point of the Bellman Optimality Operator B ∗ , i.e., B ∗ (V ∗ ) =
V ∗ . Therefore,
∗
B G(V ) (V ∗ ) = V ∗
∗
The above equation says V ∗ is the Fixed-Point of the Bellman Policy Operator B G(V ) .
∗ ∗
However, we know that B G(V ) has a unique Fixed-Point equal to V G(V ) . Therefore,
∗)
V G(V =V∗

This says that evaluating the MDP with the deterministic greedy policy G(V ∗ ) (policy
created from the Optimal Value Function V ∗ using the Greedy Policy Function G) in fact
achieves the Optimal Value Function V ∗ . In other words, G(V ∗ ) is the (Deterministic)
Optimal Policy π ∗ we’ve been seeking.
Now let’s write the code for Value Iteration. The function value_iteration returns an
Iterator on Value Functions (of type V[S]) produced by the Value Iteration algorithm. It
uses the function update for application of the Bellman Optimality Operator. update pre-
pares the Q-Values for a state by looping through all the allowable actions for the state, and
then calculates the maximum of those Q-Values (over the actions). The Q-Value calcula-
tion is same as what we saw in greedy_policy_from_vf: E(s′ ,r)∼PR [r + γ · W (s′ )], using the
PR probabilities represented in the mapping attribute of the mdp object (essentially Equation
(3.6)). Note the use of the previously-written function extended_vf to handle the function
W : S → R that appears in the definition of Bellman Optimality Operator in Equation
(3.6). The function value_iteration_result returns the final (optimal) Value Function,
together with it’s associated Optimal Policy. It simply returns the last Value Function of
the Iterator[V[S]] returned by value_iteration, using the termination condition speci-
fied in almost_equal_vfs.

DEFAULT_TOLERANCE = 1e-5
def value_iteration(
mdp: FiniteMarkovDecisionProcess[S, A],
gamma: float
) -> Iterator[V[S]]:
def update(v: V[S]) -> V[S]:
return {s: max(mdp.mapping[s][a].expectation(
lambda s_r: s_r[1] + gamma * extended_vf(v, s_r[0])
) for a in mdp.actions(s)) for s in v}
v_0: V[S] = {s: 0.0 for s in mdp.non_terminal_states}
return iterate(update, v_0)
def almost_equal_vfs(
v1: V[S],
v2: V[S],
tolerance: float = DEFAULT_TOLERANCE
) -> bool:
return max(abs(v1[s] - v2[s]) for s in v1) < tolerance
def value_iteration_result(
mdp: FiniteMarkovDecisionProcess[S, A],
gamma: float
) -> Tuple[V[S], FiniteDeterministicPolicy[S, A]]:
opt_vf: V[S] = converged(
value_iteration(mdp, gamma),
done=almost_equal_vfs

144
)
opt_policy: FiniteDeterministicPolicy[S, A] = greedy_policy_from_vf(
mdp,
opt_vf,
gamma
)
return opt_vf, opt_policy

If the number of non-terminal states of a given MDP is m and the number of actions
(|A|) is k, then the running time of each iteration of Value Iteration is O(m2 · k).
We encourage you to play with the above implementations of Policy Evaluation, Policy
Iteration and Value Iteration (code in the file rl/dynamic_programming.py) by running it
on MDPs/Policies of your choice, and observing the traces of the algorithms.

3.10. Revisiting the Simple Inventory Example

Let’s revisit the simple inventory example. We shall consider the version with a space
capacity since we want an example of a FiniteMarkovDecisionProcess. It will help us test
our code for Policy Evaluation, Policy Iteration and Value Iteration. More importantly, it
will help us identify the mathematical structure of the optimal policy of ordering for this
store inventory problem. So let’s take another look at the code we wrote in Chapter 2 to
set up an instance of a SimpleInventoryMDPCap and a FiniteDeterministicPolicy (that we
can use for Policy Evaluation).
user_capacity = 2
user_poisson_lambda = 1.0
user_holding_cost = 1.0
user_stockout_cost = 10.0
si_mdp: FiniteMarkovDecisionProcess[InventoryState, int] =\
SimpleInventoryMDPCap(
capacity=user_capacity,
poisson_lambda=user_poisson_lambda,
holding_cost=user_holding_cost,
stockout_cost=user_stockout_cost
)
fdp: FiniteDeterministicPolicy[InventoryState, int] = \
FiniteDeterministicPolicy(
{InventoryState(alpha, beta): user_capacity - (alpha + beta)
for alpha in range(user_capacity + 1)
for beta in range(user_capacity + 1 - alpha)}
)

Now let’s write some code to evaluate si_mdp with the policy fdp.
from pprint import pprint
implied_mrp: FiniteMarkovRewardProcess[InventoryState] =\
si_mdp.apply_finite_policy(fdp)
user_gamma = 0.9
pprint(evaluate_mrp_result(implied_mrp, gamma=user_gamma))

This prints the following Value Function.

{NonTerminal(state=InventoryState(on_hand=0, on_order=0)): -35.510518165628724,

NonTerminal(state=InventoryState(on_hand=0, on_order=1)): -27.93217421014731,
NonTerminal(state=InventoryState(on_hand=0, on_order=2)): -28.345029758390766,
NonTerminal(state=InventoryState(on_hand=1, on_order=0)): -28.93217421014731,

145
NonTerminal(state=InventoryState(on_hand=1, on_order=1)): -29.345029758390766,
NonTerminal(state=InventoryState(on_hand=2, on_order=0)): -30.345029758390766}

Next, let’s run Policy Iteration.

opt_vf_pi, opt_policy_pi = policy_iteration_result(
si_mdp,
gamma=user_gamma
)
pprint(opt_vf_pi)
print(opt_policy_pi)

This prints the following Optimal Value Function and Optimal Policy.

{NonTerminal(state=InventoryState(on_hand=1, on_order=0)): -28.660960231637507,

NonTerminal(state=InventoryState(on_hand=0, on_order=2)): -27.991900091403533,
NonTerminal(state=InventoryState(on_hand=0, on_order=1)): -27.660960231637507,
NonTerminal(state=InventoryState(on_hand=0, on_order=0)): -34.894855781630035,
NonTerminal(state=InventoryState(on_hand=1, on_order=1)): -28.991900091403533,
NonTerminal(state=InventoryState(on_hand=2, on_order=0)): -29.991900091403533}

For State InventoryState(on_hand=0, on_order=0): Do Action 1

For State InventoryState(on_hand=0, on_order=1): Do Action 1
For State InventoryState(on_hand=0, on_order=2): Do Action 0
For State InventoryState(on_hand=1, on_order=0): Do Action 1
For State InventoryState(on_hand=1, on_order=1): Do Action 0
For State InventoryState(on_hand=2, on_order=0): Do Action 0

As we can see, the Optimal Policy is to not order if the Inventory Position (sum of On-
Hand and On-Order) is greater than 1 unit and to order 1 unit if the Inventory Position is
0 or 1. Finally, let’s run Value Iteration.
opt_vf_vi, opt_policy_vi = value_iteration_result(si_mdp, gamma=user_gamma
)
pprint(opt_vf_vi)
print(opt_policy_vi)

You’ll see the output from Value Iteration matches the output produced from Policy
Iteration - this is a good validation of our code correctness. We encourage you to play
around with user_capacity, user_poisson_lambda, user_holding_cost, user_stockout_cost
and user_gamma(code in __main__ in rl/chapter3/simple_inventory_mdp_cap.py). As a
valuable exercise, using this code, discover the mathematical structure of the Optimal Pol-
icy as a function of the above inputs.

3.11. Generalized Policy Iteration

In this section, we dig into the structure of the Policy Iteration algorithm and show how
this structure can be generalized. Let us start by looking at a 2-dimensional layout of how
the Value Functions progress in Policy Iteration from the starting Value Function V0 to the
final Value Function V ∗ .

π1 = G(V0 ), V0 → B π1 (V0 ) → (B π1 )2 (V0 ) → . . . (B π1 )i (V0 ) → . . . V π1 = V1

146
π2 = G(V1 ), V1 → B π2 (V1 ) → (B π2 )2 (V1 ) → . . . (B π2 )i (V1 ) → . . . V π2 = V2

...

πj+1 = G(Vj ), Vj → B πj+1 (Vj ) → (B πj+1 )2 (Vj ) → . . . (B πj+1 )i (Vj ) → . . . V πj+1 = V ∗

Each row in the layout above represents the progression of the Value Function for a
specific policy. Each row starts with the creation of the policy (for that row) using the
Greedy Policy Function G, and the remainder of the row consists of successive applications
of the Bellman Policy Operator (using that row’s policy) until convergence to the Value
Function for that row’s policy. So each row starts with a Policy Improvement and the rest
of the row is a Policy Evaluation. Notice how the end of one row dovetails into the start
of the next row with application of the Greedy Policy Function G. It’s also important to
recognize that Greedy Policy Function as well as Bellman Policy Operator apply to all states
in N . So, in fact, the entire Policy Iteration algorithm has 3 nested loops. The outermost
loop is over the rows in this 2-dimensional layout (each iteration in this outermost loop
creates an improved policy). The loop within this outermost loop is over the columns in
each row (each iteration in this loop applies the Bellman Policy Operator, i.e. the iterations
of Policy Evaluation). The innermost loop is over each state in N since we need to sweep
through all states in updating the Value Function when the Bellman Policy Operator is
applied on a Value Function (we also need to sweep through all states in applying the
Greedy Policy Function to improve the policy).
A higher-level view of Policy Iteration is to think of Policy Evaluation and Policy Im-
provement going back and forth iteratively - Policy Evaluation takes a policy and creates
the Value Function for that policy, while Policy Improvement takes a Value Function and
creates a Greedy Policy from it (that is improved relative to the previous policy). This was
depicted in Figure 3.1. It is important to recognize that this loop of Policy Evaluation and
Policy Improvement works to make the Value Function and the Policy increasingly con-
sistent with each other, until we reach convergence when the Value Function and Policy
become completely consistent with each other (as was illustrated in Figure 3.2).
We’d also like to share a visual of Policy Iteration that is quite popular in much of the
literature on Dynamic Programming, originally appearing in Sutton and Barto’s RL book
(Richard S. Sutton and Barto 2018). It is the visual of Figure 3.3. It’s a somewhat fuzzy sort
of visual, but it has it’s benefits in terms of pedagogy of Policy Iteration. The idea behind
this image is that the lower line represents the “policy line” indicating the progression of
the policies as Policy Iteration algorithm moves along and the upper line represents the
“value function line” indicating the progression of the Value Functions as Policy Itera-
tion algorithm moves along. The arrows pointing towards the upper line (“value function
line”) represent a Policy Evaluation for a given policy π, yielding the point (Value Func-
tion) V π on the upper line. The arrows pointing towards the lower line (“policy line”)
represent a Greedy Policy Improvement from a Value Function V π , yielding the point
(policy) π ′ = G(V π ) on the lower line. The key concept here is that Policy Evaluation
(arrows pointing to upper line) and Policy Improvement (arrows pointing to lower line)
are “competing” - they “push in different directions” even as they aim to get the Value
Function and Policy to be consistent with each other. This concept of simultaneously try-
ing to compete and trying to be consistent might seem confusing and contradictory, so it
deserves a proper explanation. Things become clear by noting that there are actually two
notions of consistency between a Value Function V and Policy π.

147
Figure 3.3.: Progression Lines of Value Function and Policy in Policy Iteration (Image
Credit: Sutton-Barto’s RL Book)

1. The notion of the Value Function V being consistent with/close to the Value Function
V π of the policy π.
2. The notion of the Policy π being consistent with/close to the Greedy Policy G(V ) of
the Value Function V .

Policy Evaluation aims for the first notion of consistency, but in the process, makes it
worse in terms of the second notion of consistency. Policy Improvement aims for the sec-
ond notion of consistency, but in the process, makes it worse in terms of the first notion
of consistency. This also helps us understand the rationale for alternating between Policy
Evaluation and Policy Improvement so that neither of the above two notions of consistency
slip up too much (thanks to the alternating propping up of the two notions of consistency).
Also, note that as Policy Iteration progresses, the upper line and lower line get closer and
closer and the “pushing in different directions” looks more and more collaborative rather
than competing (the gaps in consistency become lesser and lesser). In the end, the two
lines intersect, when there is no more pushing to do for either of Policy Evaluation or Policy
Improvement since at convergence, π ∗ and V ∗ have become completely consistent.
Now we are ready to talk about a very important idea known as Generalized Policy Iter-
ation that is emphasized throughout Sutton and Barto’s RL book (Richard S. Sutton and
Barto 2018) as the perspective that unifies all variants of DP as well as RL algorithms.
Generalized Policy Iteration is the idea that we can evaluate the Value Function for a pol-
icy with any Policy Evaluation method, and we can improve a policy with any Policy Im-
provement method (not necessarily the methods used in the classical Policy Iteration DP
algorithm). In particular, we’d like to emphasize the idea that neither of Policy Evalua-
tion and Policy Improvement need to go fully towards the notion of consistency they are
respectively striving for. As a simple example, think of modifying Policy Evaluation (say

148
for a policy π) to not go all the way to V π , but instead just perform say 3 Bellman Policy
Evaluations. This means it would partially bridge the gap on the first notion of consistency
(getting closer to V π but not go all the way to V π ), but it would also mean not slipping
up too much on the second notion of consistency. As another example, think of updating
just 5 of the states (say in a large state space) with the Greedy Policy Improvement func-
tion (rather than the normal Greedy Policy Improvement function that operates on all the
states). This means it would partially bridge the gap on the second notion of consistency
(getting closer to G(V π ) but not go all the way to G(V π )), but it would also mean not slip-
ping up too much on the first notion of consistency. A concrete example of Generalized
Policy Iteration is in fact Value Iteration. In Value Iteration, we apply the Bellman Policy
Operator just once before moving on to Policy Improvement. In a 2-dimensional layout,
this is what Value Iteration looks like:

π1 = G(V0 ), V0 → B π1 (V0 ) = V1
π2 = G(V1 ), V1 → B π2 (V1 ) = V2
...
...
πj+1 = G(Vj ), Vj → B πj+1 (Vj ) = V ∗
So the greedy policy improvement step is unchanged, but Policy Evaluation is reduced
to just a single Bellman Policy Operator application. In fact, pretty much all control al-
gorithms in Reinforcement Learning can be viewed as special cases of Generalized Policy
Iteration. In some of the simple versions of Reinforcement Learning Control algorithms,
the Policy Evaluation step is done for just a single state (versus for all states in usual Policy
Iteration, or even in Value Iteration) and the Policy Improvement step is also done for just
a single state. So essentially these Reinforcement Learning Control algorithms are an al-
ternating sequence of single-state policy evaluation and single-state policy improvement
(where the single-state is the state produced by sampling or the state that is encountered
in a real-world environment interaction). Figure 3.4 illustrates Generalized Policy Itera-
tion as the shorter-length arrows (versus the longer-length arrows seen in Figure 3.3 for
the usual Policy Iteration algorithm). Note how these shorter-length arrows don’t go all
the way to either the “value function line” or the “policy line” but they do go some part
of the way towards the line they are meant to go towards at that stage in the algorithm.
We would go so far as to say that the Bellman Equations and the concept of General-
ized Policy Iteration are the two most important concepts to internalize in the study of
Reinforcement Learning, and we highly encourage you to think along the lines of these
two ideas when we present several algorithms later in this book. The importance of the
concept of Generalized Policy Iteration (GPI) might not be fully visible to you yet, but we
hope that GPI will be your mantra by the time you finish this book. For now, let’s just note
the key takeaway regarding GPI - it is any algorithm to solve MDP control that alternates
between some form of Policy Evaluation and some form of Policy Improvement. We will
bring up GPI several times later in this book.

3.12. Aysnchronous Dynamic Programming

The classical Dynamic Programming algorithms we have described in this chapter are
qualified as Synchronous Dynamic Programming algorithms. The word synchronous refers
to two things:

149
Figure 3.4.: Progression Lines of Value Function and Policy in Generalized Policy Iteration
(Image Credit: Coursera Course on Fundamentals of RL)

1. All states’ values are updated in each iteration

2. The mathematical description of the algorithms corresponds to all the states’ value
updates to occur simultaneously. However, when implementing in code (in Python,
where computation is serial and not parallel), this “simultaneous update” would be
done by creating a new copy of the Value Function vector and sweeping through all
states to assign values to the new copy from the values in the old copy.

In practice, Dynamic Programming algorithms are typically implemented as Asynchronous

algorithms, where the above two constraints (all states updated simultaneously) are re-
laxed. The term asynchronous affords a lot of flexibility - we can update a subset of states in
each iteration, and we can update states in any order we like. A natural outcome of this re-
laxation of the synchronous constraint is that we can maintain just one vector for the value
function and update the values in-place. This has considerable benefits - an updated value
for a state is immediately available for updates of other states (note: in synchronous, with
the old and new value function vectors, one has to wait for the entire states sweep to be
over until an updated state value is available for another state’s update). In fact, in-place
updates of value function is the norm in practical implementations of algorithms to solve
the MDP Control problem.
Another feature of practical asynchronous algorithms is that we can prioritize the order
in which state values are updated. There are many ways in which algorithms assign prior-
ities, and we’ll just highlight a simple but effective way of prioritizing state value updates.
It’s known as prioritized sweeping. We maintain a queue of the states, sorted by their “value
function gaps” g : N → R (illustrated below as an example for Value Iteration):
X
g(s) = |V (s) − max{R(s, a) + γ · P(s, a, s′ ) · V (s′ )}| for all s ∈ N
a∈A
s′ ∈N

150
After each state’s value is updated with the Bellman Optimality Operator, we update
the Value Function Gap for all the states whose Value Function Gap does get changed as a
result of this state value update. These are exactly the states from which we have a prob-
abilistic transition to the state whose value just got updated. What this also means is that
we need to maintain the reverse transition dynamics in our data structure representation.
So, after each state value update, the queue of states is resorted (by their value function
gaps). We always pull out the state with the largest value function gap (from the top of
the queue), and update the value function for that state. This prioritizes updates of states
with the largest gaps, and it ensures that we quickly get to a point where all value function
gaps are low enough.
Another form of Asynchronous Dynamic Programming worth mentioning here is Real-
Time Dynamic Programming (RTDP). RTDP means we run a Dynamic Programming algo-
rithm while the AI agent is experiencing real-time interaction with the environment. When
a state is visited during the real-time interaction, we make an update for that state’s value.
Then, as we transition to another state as a result of the real-time interaction, we update
that new state’s value, and so on. Note also that in RTDP, the choice of action is the real-
time action executed by the AI agent, which the environment responds to. This action
choice is governed by the policy implied by the value function for the encountered state
at that point in time in the real-time interaction.
Finally, we need to highlight that often special types of structures of MDPs can bene-
fit from specific customizations of Dynamic Programming algorithms (typically, Asyn-
chronous). One such specialization is when each state is encountered not more than once
in each random sequence of state occurrences when an AI agent plays out an MDP, and
when all such random sequences of the MDP terminate. This structure can be conceptu-
alized as a Directed Acylic Graph wherein each non-terminal node in the Directed Acyclic
Graph (DAG) represents a pair of non-terminal state and action, and each terminal node in
the DAG represents a terminal state (the graph edges represent probabilistic transitions of
the MDP). In this specialization, the MDP Prediction and Control problems can be solved
in a fairly simple manner - by walking backwards on the DAG from the terminal nodes and
setting the Value Function of visited states (in the backward DAG walk) using the Bellman
Optimality Equation (for Control) or Bellman Policy Equation (for Prediction). Here we
don’t need the “iterate to convergence” approach of Policy Evaluation or Policy Iteration
or Value Iteration. Rather, all these Dynamic Programming algorithms essentially reduce
to a simple back-propagation of the Value Function on the DAG. This means, states are
visited (and their Value Functions set) in the order determined by the reverse sequence
of a Topological Sort on the DAG. We shall make this DAG back-propagation Dynamic
Programming algorithm clear for a special DAG structure - Finite-Horizon MDPs - where
all random sequences of the MDP terminate within a fixed number of time steps and each
time step has a separate (from other time steps) set of states. This special case of Finite-
Horizon MDPs is fairly common in Financial Applications and so, we cover it in detail in
the next section.

3.13. Finite-Horizon Dynamic Programming: Backward

Induction
In this section, we consider a specialization of the DAG-structured MDPs described at the
end of the previous section - one that we shall refer to as Finite-Horizon MDPs, where each
sequence terminates within a fixed finite number of time steps T and each time step has

151
a separate (from other time steps) set of countable states. So, all states at time-step T
are terminal states and some states before time-step T could be terminal states. For all
t = 0, 1, . . . , T , denote the set of states for time step t as St , the set of terminal states for
time step t as Tt and the set of non-terminal states for time step t as Nt = St − Tt (note:
NT = ∅). As mentioned previously, when the MDP is not time-homogeneous, we augment
each state to include the index of the time step so that the augmented state at time step t
is (t, st ) for st ∈ St . The entire MDP’s (augmented) state space S is:

{(t, st )|t = 0, 1, . . . , T, st ∈ St }
We need a Python class to represent this augmented state space.
@dataclass(frozen=True)
class WithTime(Generic[S]):
state: S
time: int = 0

The set of terminal states T is:

{(t, st )|t = 0, 1, . . . , T, st ∈ Tt }
As usual, the set of non-terminal states is denoted as N = S − T .
We denote the set of rewards receivable by the AI agent at time t as Dt (countable subset
of R) and we denote the allowable actions for states in Nt as At . In a more generic setting,
as we shall represent in our code, each non-terminal state (t, st ) has it’s own set of allowable
actions, denoted A(st ), However, for ease of exposition, here we shall treat all non-terminal
states at a particular time step to have the same set of allowable actions At . Let us denote
the entire action space A of the MDP as the union of all the At over all t = 0, 1, . . . , T − 1.
The state-reward transition probability function

PR : N × A × D × S → [0, 1]
is given by:

(
(PR )t (st , at , rt′ , st′ ) if t′ = t + 1 and st′ ∈ St′ and rt′ ∈ Dt′
PR ((t, st ), at , rt′ , (t′ , st′ )) =
0 otherwise

for all t = 0, 1, . . . T − 1, st ∈ Nt , at ∈ At , t′ = 0, 1, . . . , T where

(PR )t : Nt × At × Dt+1 × St+1 → [0, 1]

are the separate state-reward transition probability functions for each of the time steps
t = 0, 1, . . . , T − 1 such that
X X
(PR )t (st , at , rt+1 , st+1 ) = 1
st+1 ∈St+1 rt+1 ∈Dt+1

for all t = 0, 1, . . . , T − 1, st ∈ Nt , at ∈ At .
So it is convenient to represent a finite-horizon MDP with separate state-reward transi-
tion probability functions (PR )t for each time step. Likewise, it is convenient to represent
any policy of the MDP

π : N × A → [0, 1]

152
as:

π((t, st ), at ) = πt (st , at )
where

πt : Nt × At → [0, 1]
are the separate policies for each of the time steps t = 0, 1, . . . , T − 1
So essentially we interpret π as being composed of the sequence (π0 , π1 , . . . , πT −1 ).
Consequently, the Value Function for a given policy π (equivalently, the Value Function
for the π-implied MRP)

Vπ :N →R
can be conveniently represented in terms of a sequence of Value Functions

Vtπ : Nt → R
for each of time steps t = 0, 1, . . . , T − 1, defined as:

V π ((t, st )) = Vtπ (st ) for all t = 0, 1, . . . , T − 1, st ∈ Nt

Then, the Bellman Policy Equation can be written as:

X X
Vtπ (st ) = πt
(PR )t (st , rt+1 , st+1 ) · (rt+1 + γ · Wt+1
π
(st+1 ))
st+1 ∈St+1 rt+1 ∈Dt+1 (3.8)
for all t = 0, 1, . . . , T − 1, st ∈ Nt

where
(
Vtπ (st ) if st ∈ Nt
Wtπ (st ) = for all t = 1, 2, . . . , T
0 if st ∈ Tt
and where (PR πt
)t : Nt × Dt+1 × St+1 for all t = 0, 1, . . . , T − 1 represent the π-implied
MRP’s state-reward transition probability functions for the time steps, defined as:

X
πt
(PR )t (st , rt+1 , st+1 ) = πt (st , at ) · (PR )t (st , at , rt+1 , st+1 ) for all t = 0, 1, . . . , T − 1
at ∈At

So for a Finite MDP, this yields a simple algorithm to calculate Vtπ for all t by simply
decrementing down from t = T − 1 to t = 0 and using Equation (3.8) to calculate Vtπ for
all t = 0, 1, . . . , T − 1 from the known values of Wt+1
π (since we are decrementing in time

index t).
This algorithm is the adaptation of Policy Evaluation to the finite horizon case with this
simple technique of “stepping back in time” (known as Backward Induction). Let’s write
some code to implement this algorithm. We are given an MDP over the augmented (finite)
state space WithTime[S], and a policy π (also over the augmented state space WithTime[S]).
So, we can use the method apply_finite_policy in FiniteMarkovDecisionProcess[WithTime[S],
A] to obtain the π-implied MRP of type FiniteMarkovRewardProcess[WithTime[S]].

153
Our first task to to “unwrap” the state-reward probability transition function PR π of this

π-implied MRP into a time-indexed sequenced of state-reward probability transition func-

tions (PRπt
)t , t = 0, 1, . . . , T −1. This is accomplished by the following function unwrap_finite_horizon_MRP
(itertools.groupby groups the augmented states by their time step, and the function without_time
πt
strips the time step from the augmented states when placing the states in (PR )t , i.e.,
Sequence[RewardTransition[S]]).

from itertools import groupby

StateReward = FiniteDistribution[Tuple[State[S], float]]
RewardTransition = Mapping[NonTerminal[S], StateReward[S]]
def unwrap_finite_horizon_MRP(
process: FiniteMarkovRewardProcess[WithTime[S]]
) -> Sequence[RewardTransition[S]]:
def time(x: WithTime[S]) -> int:
return x.time
def single_without_time(
s_r: Tuple[State[WithTime[S]], float]
) -> Tuple[State[S], float]:
if isinstance(s_r[0], NonTerminal):
ret: Tuple[State[S], float] = (
NonTerminal(s_r[0].state.state),
s_r[1]
)
else:
ret = (Terminal(s_r[0].state.state), s_r[1])
return ret
def without_time(arg: StateReward[WithTime[S]]) -> StateReward[S]:
return arg.map(single_without_time)
return [{NonTerminal(s.state): without_time(
process.transition_reward(NonTerminal(s))
) for s in states} for _, states in groupby(
sorted(
(nt.state for nt in process.non_terminal_states),
key=time
),
key=time
)]

πt
Now that we have the state-reward transition functions (PR )t arranged in the form of
a Sequence[RewardTransition[S]], we are ready to perform backward induction to calcu-
late Vtπ . The following function evaluate accomplishes it with a straightforward use of
Equation (3.8), as described above. Note the use of the previously-written extended_vf
function, that represents the Wtπ : St → R function appearing on the right-hand-side of
Equation (3.8).

def evaluate(
steps: Sequence[RewardTransition[S]],
gamma: float
) -> Iterator[V[S]]:
v: List[V[S]] = []
for step in reversed(steps):
v.append({s: res.expectation(
lambda s_r: s_r[1] + gamma * (
extended_vf(v[-1], s_r[0]) if len(v) > 0 else 0.
)
) for s, res in step.items()})
return reversed(v)

154
If |Nt | is O(m), then the running time of this algorithm is O(m2 · T ). However, note that
it takes O(m2 · k · T ) to convert the MDP to the π-implied MRP (where |At | is O(k)).
Now we move on to the Control problem - to calculate the Optimal Value Function and
the Optimal Policy. Similar to the pattern seen so far, the Optimal Value Function

V∗ :N →R
can be conveniently represented in terms of a sequence of Value Functions

Vt∗ : Nt → R
for each of time steps t = 0, 1, . . . , T − 1, defined as:

V ∗ ((t, st )) = Vt∗ (st ) for all t = 0, 1, . . . , T − 1, st ∈ Nt

Thus, the Bellman Optimality Equation can be written as:

X X
Vt∗ (st ) = max { ∗
(PR )t (st , at , rt+1 , st+1 ) · (rt+1 + γ · Wt+1 (st+1 ))}
at ∈At
st+1 ∈St+1 rt+1 ∈Dt+1 (3.9)
for all t = 0, 1, . . . , T − 1, st ∈ Nt

where (
Vt∗ (st ) if st ∈ Nt
Wt∗ (st ) = for all t = 1, 2, . . . , T
0 if st ∈ Tt
The associated Optimal (Deterministic) Policy
∗
(πD ) t : Nt → A t

is defined as:

X X
∗ ∗
(πD )t (st ) = arg max{ (PR )t (st , at , rt+1 , st+1 ) · (rt+1 + γ · Wt+1 (st+1 ))}
at ∈At st+1 ∈St+1 rt+1 ∈Dt+1

for all t = 0, 1, . . . , T − 1, st ∈ Nt
(3.10)

So for a Finite MDP, this yields a simple algorithm to calculate Vt∗ for all t, by simply
decrementing down from t = T − 1 to t = 0, using Equation (3.9) to calculate Vt∗ , and
Equation (3.10) to calculate (πD ∗ ) for all t = 0, 1, . . . , T − 1 from the known values of W ∗
t t+1
(since we are decrementing in time index t).
This algorithm is the adaptation of Value Iteration to the finite horizon case with this
simple technique of “stepping back in time” (known as Backward Induction). Let’s write
some code to implement this algorithm. We are given a MDP over the augmented (finite)
state space WithTime[S]. So this MDP is of type FiniteMarkovDecisionProcess[WithTime[S],
A]. Our first task to to “unwrap” the state-reward probability transition function PR of
this MDP into a time-indexed sequenced of state-reward probability transition functions
(PR )t , t = 0, 1, . . . , T −1. This is accomplished by the following function unwrap_finite_horizon_MDP
(itertools.groupby groups the augmented states by their time step, and the function without_time
strips the time step from the augmented states when placing the states in (PR )t , i.e., Sequence[StateActionMappin
A]]).

155
from itertools import groupby
ActionMapping = Mapping[A, StateReward[S]]
StateActionMapping = Mapping[NonTerminal[S], ActionMapping[A, S]]
def unwrap_finite_horizon_MDP(
process: FiniteMarkovDecisionProcess[WithTime[S], A]
) -> Sequence[StateActionMapping[S, A]]:
def time(x: WithTime[S]) -> int:
return x.time
def single_without_time(
s_r: Tuple[State[WithTime[S]], float]
) -> Tuple[State[S], float]:
if isinstance(s_r[0], NonTerminal):
ret: Tuple[State[S], float] = (
NonTerminal(s_r[0].state.state),
s_r[1]
)
else:
ret = (Terminal(s_r[0].state.state), s_r[1])
return ret
def without_time(arg: ActionMapping[A, WithTime[S]]) -> \
ActionMapping[A, S]:
return {a: sr_distr.map(single_without_time)
for a, sr_distr in arg.items()}
return [{NonTerminal(s.state): without_time(
process.mapping[NonTerminal(s)]
) for s in states} for _, states in groupby(
sorted(
(nt.state for nt in process.non_terminal_states),
key=time
),
key=time
)]

Now that we have the state-reward transition functions (PR )t arranged in the form of a
Sequence[StateActionMapping[S, A]], we are ready to perform backward induction to cal-
culate Vt∗ . The following function optimal_vf_and_policy accomplishes it with a straight-
forward use of Equations (3.9) and (3.10), as described above.
from operator import itemgetter
def optimal_vf_and_policy(
steps: Sequence[StateActionMapping[S, A]],
gamma: float
) -> Iterator[Tuple[V[S], FiniteDeterministicPolicy[S, A]]]:
v_p: List[Tuple[V[S], FiniteDeterministicPolicy[S, A]]] = []
for step in reversed(steps):
this_v: Dict[NonTerminal[S], float] = {}
this_a: Dict[S, A] = {}
for s, actions_map in step.items():
action_values = ((res.expectation(
lambda s_r: s_r[1] + gamma * (
extended_vf(v_p[-1][0], s_r[0]) if len(v_p) > 0 else 0.
)
), a) for a, res in actions_map.items())
v_star, a_star = max(action_values, key=itemgetter(0))
this_v[s] = v_star
this_a[s.state] = a_star
v_p.append((this_v, FiniteDeterministicPolicy(this_a)))
return reversed(v_p)

If |Nt | is O(m) for all t and |At | is O(k), then the running time of this algorithm is O(m2 ·

156
k · T ).
Note that these algorithms for finite-horizon finite MDPs do not require any “iterations
to convergence” like we had for regular Policy Evaluation and Value Iteration. Rather, in
these algorithms we simply walk back in time and immediately obtain the Value Function
for each time step from the next time step’s Value Function (which is already known since
we walk back in time). This technique of “backpropagation of Value Function” goes by
the name of Backward Induction algorithms, and is quite commonplace in many Financial
applications (as we shall see later in this book). The above Backward Induction code is in
the file rl/finite_horizon.py.

3.14. Dynamic Pricing for End-of-Life/End-of-Season of a

Product

Now we consider a rather important business application - Dynamic Pricing. We consider

the problem of Dynamic Pricing for the case of products that reach their end of life or at the
end of a season after which we don’t want to carry the product anymore. We need to adjust
the prices up and down dynamically depending on how much inventory of the product
you have, how many days remain for end-of-life/end-of-season, and your expectations
of customer demand as a function of price adjustments. To make things concrete, assume
you own a super-market and you are T days away from Halloween. You have just received
M Halloween masks from your supplier and you won’t be receiving any more inventory
during these final T days. You want to dynamically set the selling price of the Halloween
masks at the start of each day in a manner that maximizes your Expected Total Sales Revenue
for Halloween masks from today until Halloween (assume no one will buy Halloween
masks after Halloween).
Assume that for each of the T days, at the start of the day, you are required to select a
price for that day from one of N prices P1 , P2 , . . . , PN ∈ R, such that your selected price
will be the selling price for all masks on that day. Assume that the customer demand for
the number of Halloween masks on any day is governed by a Poisson probability distri-
bution with mean λi ∈ R if you select that day’s price to be Pi (where i is a choice among
1, 2, . . . , N ). Note that on any given day, the demand could exceed the number of Hal-
loween masks you have in the store, in which case the number of masks sold on that day
will be equal to the number of masks you had at the start of that day.
A state for this MDP is given by a pair (t, It ) where t ∈ {0, 1, . . . , T } denotes the time
index and It ∈ {0, 1, . . . , M } denotes the inventory at time t. Using our notation from the
previous section, St = {0, 1, . . . , M } for all t = 0, 1, . . . , T so that It ∈ St . Nt = St for all
t = 0, 1, . . . , T − 1 and NT = ∅. The action choices at time t can be represented by the
choice of integers from 1 to N . Therefore, At = {1, 2, . . . , N }.
Note that:
I0 = M, It+1 = max(0, It − dt ) for 0 ≤ t < T

where dt is the random demand on day t governed by a Poisson distribution with mean λi if
the action (index of the price choice) on day t is i ∈ At . Also, note that the sales revenue on
day t is equal to min(It , dt ) · Pi . Therefore, the state-reward probability transition function
for time index t
(PR )t : Nt × At × Dt+1 × St+1 → [0, 1]

157
is defined as:
 −λ k


e i λi
if k < It and rt+1 = k · Pi
 k!
P∞ e−λi λji
(PR )t (It , i, rt+1 , It − k) = if k = It and rt+1 = k · Pi

 j=It j!
0 otherwise

for all 0 ≤ t < T

Using the definition of (PR )t and using the boundary condition WT∗ (IT ) = 0 for all
IT ∈ {0, 1, . . . , M }, we can perform the backward induction algorithm to calculate Vt∗ and
associated optimal (deterministic) policy (πD ∗ ) for all 0 ≤ t < T .
t
Now let’s write some code to represent this Dynamic Programming problem as a FiniteMarkovDecisionProc
and determine it’s optimal policy, i.e., the Optimal (Dynamic) Price at time step t for any
available level of inventory It . The type Nt is int and the type At is also int. So we create
an MDP of type FiniteMarkovDecisionProcess[WithTime[int], int] (since the augmented
state space is WithTime[int]). Our first task is to construct PR of type:

Mapping[WithTime[int],
Mapping[int, FiniteDistribution[Tuple[WithTime[int], float]]]]

In the class ClearancePricingMDP below, PR is manufactured in init and is used to

create the attribute mdp: FiniteMarkovDecisionProces[WithTime[int], int]. Since PR is
independent of time, we first create a single-step (time-invariant) MDP single_step_mdp:
FiniteMarkovDecisionProcess[int, int] (think of this as the building-block MDP), and
then use the function finite_horizon_MDP (from file rl/finite_horizon.py) to create self.mdp
from self.single_step_mdp. The constructor argument initial_inventory: int represents
the initial inventory M . The constructor argument time_steps represents the number of
time steps T . The constructor argument price_lambda_pairs represents [(Pi , λi )|1 ≤ i ≤
N ].

from scipy.stats import poisson

from rl.finite_horizon import finite_horizon_MDP
class ClearancePricingMDP:
initial_inventory: int
time_steps: int
price_lambda_pairs: Sequence[Tuple[float, float]]
single_step_mdp: FiniteMarkovDecisionProcess[int, int]
mdp: FiniteMarkovDecisionProcess[WithTime[int], int]
def __init__(
self,
initial_inventory: int,
time_steps: int,
price_lambda_pairs: Sequence[Tuple[float, float]]
):
self.initial_inventory = initial_inventory
self.time_steps = time_steps
self.price_lambda_pairs = price_lambda_pairs
distrs = [poisson(l) for _, l in price_lambda_pairs]
prices = [p for p, _ in price_lambda_pairs]
self.single_step_mdp: FiniteMarkovDecisionProcess[int, int] =\
FiniteMarkovDecisionProcess({
s: {i: Categorical(
{(s - k, prices[i] * k):
(distrs[i].pmf(k) if k < s else 1 - distrs[i].cdf(s - 1))
for k in range(s + 1)})
for i in range(len(prices))}

158
for s in range(initial_inventory + 1)
})
self.mdp = finite_horizon_MDP(self.single_step_mdp, time_steps)

Now let’s write two methods for this class:

• get_vf_for_policy that produces the Value Function for a given policy π, by first
creating the π-implied MRP from mdp, then unwrapping the MRP into a sequence
πt
of state-reward transition probability functions (PR )t , and then performing back-
ward induction using the previously-written function evaluate to calculate the Value
Function.
• get_optimal_vf_and_policy that produces the Optimal Value Function and Optimal
Policy, by first unwrapping self.mdp into a sequence of state-reward transition prob-
ability functions (PR )t , and then performing backward induction using the previously-
written function optimal_vf_and_policy to calculate the Optimal Value Function and
Optimal Policy.
from rl.finite_horizon import evaluate, optimal_vf_and_policy
def get_vf_for_policy(
self,
policy: FinitePolicy[WithTime[int], int]
) -> Iterator[V[int]]:
mrp: FiniteMarkovRewardProcess[WithTime[int]] \
= self.mdp.apply_finite_policy(policy)
return evaluate(unwrap_finite_horizon_MRP(mrp), 1.)
def get_optimal_vf_and_policy(self)\
-> Iterator[Tuple[V[int], FiniteDeterministicPolicy[int, int]]]:
return optimal_vf_and_policy(unwrap_finite_horizon_MDP(self.mdp), 1.)

Now let’s create a simple instance of ClearancePricingMDP for M = 12, T = 8 and 4 price
choices: “Full Price,” “30% Off,” “50% Off,” “70% Off” with respective mean daily demand
of 0.5, 1.0, 1.5, 2.5.
ii = 12
steps = 8
pairs = [(1.0, 0.5), (0.7, 1.0), (0.5, 1.5), (0.3, 2.5)]
cp: ClearancePricingMDP = ClearancePricingMDP(
initial_inventory=ii,
time_steps=steps,
price_lambda_pairs=pairs
)

Now let us calculate it’s Value Function for a stationary policy that chooses “Full Price”
if inventory is less than 2, otherwise “30% Off” if inventory is less than 5, otherwise “50%
Off” if inventory is less than 8, otherwise “70% Off.” Since we have a stationary policy, we
can represent it as a single-step policy and combine it with the single-step MDP we had cre-
ated above (attribute single_step_mdp) to create a single_step_mrp: FiniteMarkovRewardProcess[int].
Then we use the function finite_horizon_mrp (from file rl/finite_horizon.py) to create the
entire (augmented state) MRP of type FiniteMarkovRewardProcess[WithTime[int]]. Fi-
nally, we unwrap this MRP into a sequence of state-reward transition probability functions
and perform backward induction to calculate the Value Function for this stationary policy.
Running the following code tells us that V0π (12) is about 4.91 (assuming full price is 1),
which is the Expected Revenue one would obtain over 8 days, starting with an inventory
of 12, and executing this stationary policy (under the assumed demand distributions as a
function of the price choices).

159
Figure 3.5.: Optimal Policy Heatmap

def policy_func(x: int) -> int:

return 0 if x < 2 else (1 if x < 5 else (2 if x < 8 else 3))
stationary_policy: FiniteDeterministicPolicy[int, int] = \
FiniteDeterministicPolicy({s: policy_func(s) for s in range(ii + 1)})
single_step_mrp: FiniteMarkovRewardProcess[int] = \
cp.single_step_mdp.apply_finite_policy(stationary_policy)
vf_for_policy: Iterator[V[int]] = evaluate(
unwrap_finite_horizon_MRP(finite_horizon_MRP(single_step_mrp, steps)),
1.
)

Now let us determine what is the Optimal Policy and Optimal Value Function for this
instance of ClearancePricingMDP. Running cp.get_optimal_vf_and_policy() and evaluat-
ing the Optimal Value Function for time step 0 and inventory of 12, i.e. V0∗ (12), gives us a
value of 5.64, which is the Expected Revenue we’d obtain over the 8 days if we executed
the Optimal Policy.
Now let us plot the Optimal Price as a function of time steps and inventory levels.

import matplotlib.pyplot as plt

from matplotlib import cm
import numpy as np
prices = [[pairs[policy.act(s).value][0] for s in range(ii + 1)]
for _, policy in cp.get_optimal_vf_and_policy()]

160
heatmap = plt.imshow(np.array(prices).T, origin=’lower’)
plt.colorbar(heatmap, shrink=0.5, aspect=5)
plt.xlabel(”Time Steps”)
plt.ylabel(”Inventory”)
plt.show()

Figure 3.5 shows us the image produced by the above code. The light shade is “Full
Price,” the medium shade is “30% Off” and the dark shade is “50% Off.” This tells us
that on day 0, the Optimal Price is “30% Off” (corresponding to State 12, i.e., for starting
inventory M = I0 = 12). However, if the starting inventory I0 were less than 7, then the
Optimal Price is “Full Price.” This makes intuitive sense because the lower the inventory,
the less inclination we’d have to cut prices. We see that the thresholds for price cuts shift
as time progresses (as we move horizontally in the figure). For instance, on Day 5, we set
“Full Price” only if inventory has dropped below 3 (this would happen if we had a good
degree of sales on the first 5 days), we set “30% Off” if inventory is 3 or 4 or 5, and we set
“50% Off” if inventory is greater than 5. So even if we sold 6 units in the first 5 days, we’d
offer “50% Off” because we have only 3 days remaining now and 6 units of inventory left.
This makes intuitive sense. We see that the thresholds shift even further as we move to
Days 6 and 7. We encourage you to play with this simple application of Dynamic Pricing
by changing M, T, N, [(Pi , λi )|1 ≤ i ≤ N ] and studying how the Optimal Value Function
changes and more importantly, studying the thresholds of inventory (under optimality)
for various choices of prices and how these thresholds vary as time progresses.

3.15. Generalizations to Non-Tabular Algorithms

The Finite MDP algorithms covered in this chapter are called “tabular” algorithms. The
word “tabular” (for “table”) refers to the fact that the MDP is specified in the form of a
finite data structure and the Value Function is also represented as a finite “table” of non-
terminal states and values. These tabular algorithms typically make a sweep through all
non-terminal states in each iteration to update the Value Function. This is not possible for
large state spaces or infinite state spaces where we need some function approximation for
the Value Function. The good news is that we can modify each of these tabular algorithms
such that instead of sweeping through all the non-terminal states at each step, we simply
sample an appropriate subset of non-terminal states, calculate the values for these sam-
pled states with the appropriate Bellman calculations (just like in the tabular algorithms),
and then create/update a function approximation (for the Value Function) with the sam-
pled states’ calculated values. The important point is that the fundamental structure of the
algorithms and the fundamental principles (Fixed-Point and Bellman Operators) are still
the same when we generalize from these tabular algorithms to function approximation-
based algorithms. In Chapter 4, we cover generalizations of these Dynamic Programming
algorithms from tabular methods to function approximation methods. We call these algo-
rithms Approximate Dynamic Programming.
We finish this chapter by refering you to the various excellent papers and books by Dim-
itri Bertsekas - (Dimitri P. Bertsekas 1981), (Dimitri P. Bertsekas 1983), (Dimitri P. Bert-
sekas 2005), (Dimitri P. Bertsekas 2012), (D. P. Bertsekas and Tsitsiklis 1996) - for a com-
prehensive treatment of the variants of DP, including Asynchronous DP, Finite-Horizon
DP and Approximate DP.

161
3.16. Summary of Key Learnings from this Chapter
Before we end this chapter, we’d like to highlight the three highly important concepts we
learnt in this chapter:

• Fixed-Point of Functions and Banach Fixed-Point Theorem: The simple concept of

Fixed-Point of Functions that is profound in its applications, and the Banach Fixed-
Point Theorem that enables us to construct iterative algorithms to solve problems
with fixed-point formulations.
• Generalized Policy Iteration: The powerful idea of alternating between any method
for Policy Evaluation and any method for Policy Improvement, including methods
that are partial applications of Policy Evaluation or Policy Improvement. This gener-
alized perspective unifies almost all of the algorithms that solve MDP Control prob-
lems.
• Backward Induction: A straightforward method to solve finite-horizon MDPs by
simply backpropagating the Value Function from the horizon-end to the start.

162
4. Function Approximation and Approximate
Dynamic Programming

In Chapter 3, we covered Dynamic Programming algorithms where the MDP is specified

in the form of a finite data structure and the Value Function is represented as a finite “table”
of states and values. These Dynamic Programming algorithms swept through all states in
each iteration to update the value function. But when the state space is large (as is the case
in real-world applications), these Dynamic Programming algorithms won’t work because:

1. A “tabular” representation of the MDP (or of the Value Function) won’t fit within
storage limits.
2. Sweeping through all states and their transition probabilities would be time-prohibitive
(or simply impossible, in the case of infinite state spaces).

Hence, when the state space is very large, we need to resort to approximation of the
Value Function. The Dynamic Programming algorithms would need to be suitably mod-
ified to their Approximate Dynamic Programming (abbreviated as ADP) versions. The
good news is that it’s not hard to modify each of the (tabular) Dynamic Programming al-
gorithms such that instead of sweeping through all the states in each iteration, we simply
sample an appropriate subset of the states, calculate the values for those states (with the
same Bellman Operator calculations as for the case of tabular), and then create/update a
function approximation (for the Value Function) using the sampled states’ calculated val-
ues. Furthermore, if the set of transitions from a given state is large (or infinite), instead
of using the explicit probabilities of those transitions, we can sample from the transitions
probability distribution. The fundamental structure of the algorithms and the fundamen-
tal principles (Fixed-Point and Bellman Operators) would still be the same.
So, in this chapter, we do a quick review of function approximation, write some code
for a couple of standard function approximation methods, and then utilize these func-
tion approximation methods to develop Approximate Dynamic Programming algorithms
(in particular, Approximate Policy Evaluation, Approximate Value Iteration and Approx-
imate Backward Induction). Since you are reading this book, it’s highly likely that you are
already familiar with the simple and standard function approximation methods such as
linear function approximation and function approximation using neural networks super-
vised learning. So we shall go through the background on linear function approximation
and neural networks supervised learning in a quick and terse manner, with the goal of
developing some code for these methods that we can use not just for the ADP algorithms
for this chapter, but also for RL algorithms later in the book. Note also that apart from
approximation of State-Value Functions N → R and Action-Value Functions N × A → R,
these function approximation methods can also be used for approximation of Stochastic
Policies N × A → [0, 1] in Policy-based RL algorithms.

163
4.1. Function Approximation
In this section, we describe function approximation in a fairly generic setting (not specific
to approximation of Value Functions or Policies). We denote the predictor variable as x,
belonging to an arbitrary domain denoted X and the response variable as y ∈ R. We
treat x and y as unknown random variables and our goal is to estimate the probability
distribution function f of the conditional random variable y|x from data provided in the
form of a sequence of (x, y) pairs. We shall consider parameterized functions f with the
parameters denoted as w. The exact data type of w will depend on the specific form of
function approximation. We denote the estimated probability of y conditional on x as
f (x; w)(y). Assume we are given data in the form of a sequence of n (x, y) pairs, as follows:

[(xi , yi )|1 ≤ i ≤ n]

The notion of estimating the conditional probability P[y|x] is formalized by solving for
w = w∗ such that:
Y
n X
n
∗
w = arg max{ f (xi ; w)(yi )} = arg max{ log f (xi ; w)(yi )}
w i=1 w i=1

In other words, we shall be operating in the framework of Maximum Likelihood Estima-

tion. We say that the data [(xi , yi )|1 ≤ i ≤ n] specifies the empirical probability distribution
D of y|x and the function f (parameterized by w) specifies the model probability distribution
M of y|x. With maximum likelihood estimation, we are essentially trying to reconcile the
model probability distribution M with the empirical probability distribution D. Hence,
maximum likelihood estimation is essentially minimization of a loss function defined as
the cross-entropy H(D, M ) = −ED [log M ] between the probability distributions D and
M . The term function approximation refers to the fact that the model probability distri-
bution M serves as an approximation to some true probability distribution that is only
partially observed through the empirical probability distribution D.
Our framework will allow for incremental estimation wherein at each iteration t of the
incremental estimation (for t = 1, 2, . . .), data of the form

[(xt,i , yt,i )|1 ≤ i ≤ nt ]

is used to update the parameters from wt−1 to wt (parameters initialized at iteration
t = 0 to w0 ). This framework can be used to update the parameters incrementally with a
gradient descent algorithm, either stochastic gradient descent (where a single (x, y) pair
is used for each iteration’s gradient calculation) or mini-batch gradient descent (where an
appropriate subset of the available data is used for each iteration’s gradient calculation)
or simply re-using the entire data available for each iteration’s gradient calculation (and
consequent, parameters update). Moreover, the flexibility of our framework, allowing for
incremental estimation, is particularly important for Reinforcement Learning algorithms
wherein we update the parameters of the function approximation from the new data that
is generated from each state transition as a result of interaction with either the real envi-
ronment or a simulated environment.
Among other things, the estimate f (parameterized by w) gives us the model-expected
value of y conditional on x, i.e.
Z +∞
EM [y|x] = Ef (x;w) [y] = y · f (x; w)(y) · dy
−∞

164
We refer to EM [y|x] as the function approximation’s prediction for a given predictor
variable x.
For the purposes of Approximate Dynamic Programming and Reinforcement Learning,
the function approximation’s prediction E[y|x] provides an estimate of the Value Function
for any state (x takes the role of the State, and y takes the role of the Return following that
State). In the case of function approximation for (stochastic) policies, x takes the role of
the State, y takes the role of the Action for that policy, and f (x; w) provides the probabil-
ity distribution of actions for state x (corresponding to that policy). It’s also worthwhile
pointing out that the broader theory of function approximations covers the case of multi-
dimensional y (where y is a real-valued vector, rather than scalar) - this allows us to solve
classification problems as well as regression problems. However, for ease of exposition
and for suﬀicient coverage of function approximation applications in this book, we only
cover the case of scalar y.
Now let us write some code that captures this framework. We write an abstract base class
FunctionApprox type-parameterized by X (to permit arbitrary data types X ), representing
f (x; w), with the following 3 key methods, each of which will work with inputs of generic
Iterable type (Iterable is any data type that we can iterate over, such as Sequence types
or Iterator type):

1. @abstractmethod solve: takes as input an Iterable of (x, y) pairs and solves for the
optimal internal parameters w∗ that minimizes the cross-entropy between the em-
pirical probability distribution of the input data of (x, y) pairs and the model prob-
ability distribution f (x; w). Some implementations of solve are iterative numerical
methods and would require an additional input of error_tolerance that specifies
the required precision of the best-fit parameters w∗ . When an implementation of
solve is an analytical solution not requiring an error tolerance, we specify the input
error_tolerance as None. The output of solve is the FunctionApprox f (x; w ∗ ) (i.e.,
corresponding to the solved parameters w∗ ).
2. update: takes as input an Iterable of (x, y) pairs and updates the parameters w defin-
ing f (x; w). The purpose of update is to perform an incremental (iterative) improve-
ment to the parameters w, given the input data of (x, y) pairs in the current iteration.
The output of update is the FunctionApprox corresponding to the updated parame-
ters. Note that we should be able to solve based on an appropriate series of incre-
mental updates (upto a specified error_tolerance).
3. @abstractmethod evaluate: takes as input an Iterable of x values and calculates
EM [y|x] = Ef (x;w) [y] for each of the input x values, and outputs these expected val-
ues in the form of a numpy.ndarray.

As we’ve explained, an incremental update to the parameters w involves calculating a

gradient and then using the gradient to adjust the parameters w. Hence the method update
is supported by the following two abstract methods.

• @abstractmethod objective_gradient: computes the gradient of an objective func-

tion (call it Obj(x, y)) of the FunctionApprox with respect to the parameters w in the
internal representation of the FunctionApprox. The gradient is output in the form of
a Gradient type. The second argument obj_deriv_out_fun of the objective_gradient
method represents the partial derivative of Obj with respect to an appropriate model-
computed value (call it Out(x)), i.e., ∂Obj(x,y)
∂Out(x) , when evaluated at a Sequence of x val-
ues and a Sequence of y values (to be obtained from the first argument xy_vals_seq
of the objective_gradient method).

165
• @abstractmethod update_with_gradient: takes as input a Gradient and updates the
internal parameters using the gradient values (eg: gradient descent update to the
parameters), returning the updated FunctionApprox.
∂Obj(xi ,yi )
The update method is written with ∂Out(xi ) defined as follows, for each training data
point (xi , yi ):

∂Obj(xi , yi )
= EM [y|xi ] − yi
∂Out(xi )
It turns out that for each concrete function approximation that we’d want to implement,
if the Objective Obj(xi , yi ) is the cross-entropy loss function, we can identify a model-
computed value Out(xi ) (either the output of the model or an intermediate computation
of the model) such that ∂Obj(x i ,yi )
∂Out(xi ) is equal to the prediction error EM [y|xi ] − yi (for each
training data point (xi , yi )) and we can come up with a numerical algorithm to compute
∇w Out(xi ), so that by chain-rule, we have the required gradient:

∂Obj(xi , yi )
∇w Obj(xi , yi ) = · ∇w Out(xi ) = (EM [y|xi ] − yi ) · ∇w Out(xi )
∂Out(xi )
The update method implements this chain-rule calculation, by setting obj_deriv_out_fun
to be the prediction error EM [y|xi ] − yi , delegating the calculation of ∇w Out(xi ) to the
concrete implementation of the abstract method objective_gradient
Note that the Gradient class contains a single attribute of type FunctionApprox so that a
Gradient object can represent the gradient values in the form of the internal parameters of
the FunctionApprox attribute (since each gradient value is simply a partial derivative with
respect to an internal parameter).
We use the TypeVar F to refer to a concrete class that would implement the abstract in-
terface of FunctionApprox.
from abc import ABC, abstractmethod
import numpy as np
X = TypeVar(’X’)
F = TypeVar(’F’, bound=’FunctionApprox’)
class FunctionApprox(ABC, Generic[X]):
@abstractmethod
def objective_gradient(
self: F,
xy_vals_seq: Iterable[Tuple[X, float]],
obj_deriv_out_fun: Callable[[Sequence[X], Sequence[float]], np.ndarray]
) -> Gradient[F]:
pass
@abstractmethod
def evaluate(self, x_values_seq: Iterable[X]) -> np.ndarray:
pass
@abstractmethod
def update_with_gradient(
self: F,
gradient: Gradient[F]
) -> F:
pass
def update(
self: F,
xy_vals_seq: Iterable[Tuple[X, float]]
) -> F:

166
def deriv_func(x: Sequence[X], y: Sequence[float]) -> np.ndarray:
return self.evaluate(x) - np.array(y)
return self.update_with_gradient(
self.objective_gradient(xy_vals_seq, deriv_func)
)
@abstractmethod
def solve(
self: F,
xy_vals_seq: Iterable[Tuple[X, float]],
error_tolerance: Optional[float] = None
) -> F:
pass

@dataclass(frozen=True)
class Gradient(Generic[F]):
function_approx: F

When concrete classes implementing FunctionApprox write the solve method in terms
of the update method, they will need to check if a newly updated FunctionApprox is “close
enough” to the previous FunctionApprox. So each of them will need to implement their
own version of “Are two FunctionApprox instances within a certain error_tolerance of each
other?” Hence, we need the following abstract method within:
@abstractmethod
def within(self: F, other: F, tolerance: float) -> bool:
pass

Any concrete class that implement this abstract class FunctionApprox will need to im-
plement these five abstract methods of FunctionApprox, based on the specific assumptions
that the concrete class makes for f .
Next, we write some methods useful for classes that inherit from FunctionApprox. Firstly,
we write a method called iterate_updates that takes as input a stream (Iterator) of Iterable
of (x, y) pairs, and performs a series of incremental updates to the parameters w (each us-
ing the update method), with each update done for each Iterable of (x, y) pairs in the
input stream xy_seq: Iterator[Iterable[Tuple[X, float]]]. iterate_updates returns an
Iterator of FunctionApprox representing the successively updated FunctionApprox instances
as a consequence of the repeated invocations to update. Note the use of the rl.iterate.accumulate
function (a wrapped version of itertools.accumulate) that calculates accumulated results
(including intermediate results) on an Iterable, based on a provided function to govern
the accumulation. In the code below, the Iterable is the input stream xy_seq_stream and
the function governing the accumulation is the update method of FunctionApprox.
import rl.iterate as iterate
def iterate_updates(
self: F,
xy_seq_stream: Iterator[Iterable[Tuple[X, float]]]
) -> Iterator[F]:
return iterate.accumulate(
xy_seq_stream,
lambda fa, xy: fa.update(xy),
initial=self
)

Next, we write a method called rmse to calculate the Root-Mean-Squared-Error of the

predictions for x (using evaluate) relative to associated (supervisory) y, given as input an
Iterable of (x, y) pairs. This method will be useful in testing the goodness of a FunctionApprox
estimate.

167
def rmse(
self,
xy_vals_seq: Iterable[Tuple[X, float]]
) -> float:
x_seq, y_seq = zip(*xy_vals_seq)
errors: np.ndarray = self.evaluate(x_seq) - np.array(y_seq)
return np.sqrt(np.mean(errors * errors))

Finally, we write a method argmax that takes as input an Iterable of x values and returns
the x value that maximizes Ef (x;w) [y].

def argmax(self, xs: Iterable[X]) -> X:

return list(xs)[np.argmax(self.evaluate(xs))]

The above code for FunctionApprox and Gradient is in the file rl/function_approx.py.
rl/function_approx.py also contains the convenience methods __add__ (to add two FunctionApprox),
__mul__ (to multiply a FunctionApprox with a real-valued scalar), and __call__ (to treat
a FunctionApprox object syntactically as a function taking an x: X as input, essentially
a shorthand for evaluate on a single x value). __add__ and __mul__ are meant to per-
form element-wise addition and scalar-multiplication on the internal parameters w of the
Function Approximation (see Appendix F on viewing Function Approximations as Vec-
tor Spaces). Likewise, it contains the methods __add__ and __mul__ for the Gradient class
that simply delegates to the __add__ and __mul__ methods of the FunctionApprox within
Gradient, and it also contains the method zero that returns a Gradient which is uniformly
zero for each of the parameter values.
Now we are ready to cover a concrete but simple function approximation - the case of
linear function approximation.

4.2. Linear Function Approximation

We define a sequence of feature functions

ϕj : X → R for each j = 1, 2, . . . , m

and we define ϕ : X → Rm as:

ϕ(x) = (ϕ1 (x), ϕ2 (x), . . . , ϕm (x)) for all x ∈ X

We treat ϕ(x) as a column vector for all x ∈ X .

For linear function approximation, the internal parameters w are represented as a weights
column-vector w = (w1 , w2 , . . . , wm ) ∈ Rm . Linear function approximation is based on the
assumption of a Gaussian distribution for y|x with mean

X
m
EM [y|x] = ϕj (x) · wj = ϕ(x)T · w
j=1

and constant variance σ 2 , i.e.,

1 (y−ϕ(x)T ·w)2
P[y|x] = f (x; w)(y) = √ · e− 2σ 2
2πσ 2

168
So, the cross-entropy loss function (ignoring constant terms associated with σ 2 ) for a
given set of data points [xi , yi |1 ≤ i ≤ n] is defined as:

1 X
n
L(w) = · (ϕ(xi )T · w − yi )2
2n
i=1

Note that this loss function is identical to the mean-squared-error of the linear (in w)
predictions ϕ(xi )T ·w relative to the response values yi associated with the predictor values
xi , over all 1 ≤ i ≤ n.
If we include L2 regularization (with λ as the regularization coeﬀicient), then the reg-
ularized loss function is:

1 X
n
1
L(w) = ( (ϕ(xi )T · w − yi )2 ) + · λ · |w|2
2n 2
i=1

The gradient of L(w) with respect to w works out to:

1 X
n
∇w L(w) = · ( ϕ(xi ) · (ϕ(xi )T · w − yi )) + λ · w
n
i=1

We had said previously that for each concrete function approximation that we’d want to
implement, if the Objective Obj(xi , yi ) is the cross-entropy loss function, we can identify
a model-computed value Out(xi ) (either the output of the model or an intermediate com-
putation of the model) such that ∂Obj(x i ,yi )
∂Out(xi ) is equal to the prediction error EM [y|xi ] − yi
(for each training data point (xi , yi )) and we can come up with a numerical algorithm to
compute ∇w Out(xi ), so that by chain-rule, we have the required gradient ∇w Obj(xi , yi )
(without regularization). In the case of this linear function approximation, the model-
computed value Out(xi ) is simply the model prediction for predictor variable xi , i.e.,

Out(xi ) = EM [y|xi ] = ϕ(xi )T · w

This is confirmed by noting that with Obj(xi , yi ) set to be the cross-entropy loss function
L(w) and Out(xi ) set to be the model prediction ϕ(xi )T ·w (for training data point (xi , yi )),

∂Obj(xi , yi )
= ϕ(xi )T · w − yi = EM [y|xi ] − yi
∂Out(xi )

∇w Out(xi ) = ∇w (ϕ(xi )T · w) = ϕ(xi )

We can solve for w∗ by incremental estimation using gradient descent (change in w
proportional to the gradient estimate of L(w) with respect to w). If the (xt , yt ) data at
time t is:

[(xt,i , yt,i )|1 ≤ i ≤ nt ]

,
then the gradient estimate G(xt ,yt ) (wt ) at time t is given by:

1 X
nt

G(xt ,yt ) (wt ) = · ( ϕ(xt,i ) · (ϕ(xt,i )T · wt − yt,i )) + λ · wt

n
i=1

169
which can be interpreted as the mean (over the data in iteration t) of the feature vectors
ϕ(xt,i ) weighted by the (scalar) linear prediction errors ϕ(xt,i )T · wt − yt,i (plus regular-
ization term λ · wt ).
Then, the update to the weights vector w is given by:

wt+1 = wt − αt · G(xt ,yt ) (wt )

where αt is the learning rate for the gradient descent at time t. To facilitate numerical
convergence, we require αt to be an appropriate function of time t. There are a number
of numerical algorithms to achieve the appropriate time-trajectory of αt . We shall go with
one such numerical algorithm - ADAM (Kingma and Ba 2014), which we shall use not
just for linear function approximation but later also for the deep neural network function
approximation. Before we write code for linear function approximation, we need to write
some helper code to implement the ADAM gradient descent algorithm.
We create an @dataclass Weights to represent and update the weights (i.e., internal pa-
rameters) of a function approximation. The Weights dataclass has 5 attributes: adam_gradient
that captures the ADAM parameters, including the base learning rate and the decay pa-
rameters, time that represents how many times the weights have been updated, weights
that represents the weight parameters of the function approximation as a numpy array (1-
D array for linear function approximation and 2-D array for each layer of deep neural net-
work function approximation), and the two ADAM cache parameters. The @staticmethod
create serves as a factory method to create a new instance of the Weights dataclass. The
update method of this Weights dataclass produces an updated instance of the Weights dat-
aclass that represents the updated weight parameters together with the incremented time
and the updated ADAM cache parameters. We will follow a programming design pat-
tern wherein we don’t update anything in-place - rather, we create a new object with up-
dated values (using the dataclasses.replace function). This ensures we don’t get unex-
pected/undesirable updates in-place, which are typically the cause of bugs in numerical
code. Finally, we write the within method which will be required to implement the within
method in the linear function approximation class as well as in the deep neural network
function approximation class.

SMALL_NUM = 1e-6
from dataclasses import replace
@dataclass(frozen=True)
class AdamGradient:
learning_rate: float
decay1: float
decay2: float
@staticmethod
def default_settings() -> AdamGradient:
return AdamGradient(
learning_rate=0.001,
decay1=0.9,
decay2=0.999
)

@dataclass(frozen=True)
class Weights:
adam_gradient: AdamGradient
time: int
weights: np.ndarray
adam_cache1: np.ndarray
adam_cache2: np.ndarray

170
@staticmethod
def create(
adam_gradient: AdamGradient = AdamGradient.default_settings(),
weights: np.ndarray,
adam_cache1: Optional[np.ndarray] = None,
adam_cache2: Optional[np.ndarray] = None
) -> Weights:
return Weights(
adam_gradient=adam_gradient,
time=0,
weights=weights,
adam_cache1=np.zeros_like(
weights
) if adam_cache1 is None else adam_cache1,
adam_cache2=np.zeros_like(
weights
) if adam_cache2 is None else adam_cache2
)
def update(self, gradient: np.ndarray) -> Weights:
time: int = self.time + 1
new_adam_cache1: np.ndarray = self.adam_gradient.decay1 * \
self.adam_cache1 + (1 - self.adam_gradient.decay1) * gradient
new_adam_cache2: np.ndarray = self.adam_gradient.decay2 * \
self.adam_cache2 + (1 - self.adam_gradient.decay2) * gradient ** 2
corrected_m: np.ndarray = new_adam_cache1 / \
(1 - self.adam_gradient.decay1 ** time)
corrected_v: np.ndarray = new_adam_cache2 / \
(1 - self.adam_gradient.decay2 ** time)
new_weights: np.ndarray = self.weights - \
self.adam_gradient.learning_rate * corrected_m / \
(np.sqrt(corrected_v) + SMALL_NUM)
return replace(
self,
time=time,
weights=new_weights,
adam_cache1=new_adam_cache1,
adam_cache2=new_adam_cache2,
)
def within(self, other: Weights[X], tolerance: float) -> bool:
return np.all(np.abs(self.weights - other.weights) <= tolerance).item()

With this Weights class, we are ready to write the dataclass LinearFunctionApprox for
linear function approximation, inheriting from the abstract base class FunctionApprox. It
has attributes feature_functions that represents ϕj : X → R for all j = 1, 2, . . . , m,
regularization_coeff that represents the regularization coeﬀicient λ, weights which is an
instance of the Weights class we wrote above, and direct_solve (which we will explain
shortly). The static method create serves as a factory method to create a new instance of
LinearFunctionApprox. The method get_feature_values takes as input an x_values_seq:
Iterable[X] (representing a sequence or stream of x ∈ X ), and produces as output the
corresponding feature vectors ϕ(x) ∈ Rm for each x in the input. The feature vectors are
output in the form of a 2-D numpy array, with each feature vector ϕ(x) (for each x in the
input sequence) appearing as a row in the output 2-D numpy array (the number of rows
in this numpy array is the length of the input x_values_seq and the number of columns is
the number of feature functions). Note that often we want to include a bias term in our
linear function approximation, in which case we need to prepend the sequence of feature
functions we provide as input with an artificial feature function lambda _: 1. to represent
the constant feature with value 1. This will ensure we have a bias weight in addition to
each of the weights that serve as coeﬀicients to the (non-artificial) feature functions.

171
The method evaluate (an abstract methodPm in FunctionApprox) calculates the prediction
EM [y|x] for each input x as: ϕ(x) · w = j=1 ϕj (x) · wi . The method objective_gradient
T

(from FunctionApprox) performs the calculation G(xt ,yt ) (wt ) shown above: the mean of the
feature vectors ϕ(xt,i ) weighted by the (scalar) linear prediction errors ϕ(xt,i )T · wt − yt,i
(plus regularization term λ · wt ). The variable obj_deriv_out takes the role of the linear
prediction errors, when objective_gradient is invoked by the update method through the
method update_with_gradient. The method update_with_gradient (from FunctionApprox)
updates the weights using the calculated gradient along with the ADAM cache updates
(invoking the update method of the Weights class to ensure there are no in-place updates),
and returns a new LinearFunctionApprox object containing the updated weights.

from dataclasses import replace

@dataclass(frozen=True)
class LinearFunctionApprox(FunctionApprox[X]):
feature_functions: Sequence[Callable[[X], float]]
regularization_coeff: float
weights: Weights
direct_solve: bool
@staticmethod
def create(
feature_functions: Sequence[Callable[[X], float]],
adam_gradient: AdamGradient = AdamGradient.default_settings(),
regularization_coeff: float = 0.,
weights: Optional[Weights] = None,
direct_solve: bool = True
) -> LinearFunctionApprox[X]:
return LinearFunctionApprox(
feature_functions=feature_functions,
regularization_coeff=regularization_coeff,
weights=Weights.create(
adam_gradient=adam_gradient,
weights=np.zeros(len(feature_functions))
) if weights is None else weights,
direct_solve=direct_solve
)
def get_feature_values(self, x_values_seq: Iterable[X]) -> np.ndarray:
return np.array(
[[f(x) for f in self.feature_functions] for x in x_values_seq]
)
def objective_gradient(
self,
xy_vals_seq: Iterable[Tuple[X, float]],
obj_deriv_out_fun: Callable[[Sequence[X], Sequence[float]], float]
) -> Gradient[LinearFunctionApprox[X]]:
x_vals, y_vals = zip(*xy_vals_seq)
obj_deriv_out: np.ndarray = obj_deriv_out_fun(x_vals, y_vals)
features: np.ndarray = self.get_feature_values(x_vals)
gradient: np.ndarray = \
features.T.dot(obj_deriv_out) / len(obj_deriv_out) \
+ self.regularization_coeff * self.weights.weights
return Gradient(replace(
self,
weights=replace(
self.weights,
weights=gradient
)
))
def evaluate(self, x_values_seq: Iterable[X]) -> np.ndarray:
return np.dot(

172
self.get_feature_values(x_values_seq),
self.weights.weights
)
def update_with_gradient(
self,
gradient: Gradient[LinearFunctionApprox[X]]
) -> LinearFunctionApprox[X]:
return replace(
self,
weights=self.weights.update(
gradient.function_approx.weights.weights
)
)

We also require the within method, that simply delegates to the within method of the
Weights class.

def within(self, other: FunctionApprox[X], tolerance: float) -> bool:

if isinstance(other, LinearFunctionApprox):
return self.weights.within(other.weights, tolerance)
else:
return False

The only method that remains to be written now is the solve method. Note that for linear
function approximation, we can directly solve for w∗ if the number of feature functions m
is not too large. If the entire provided data is [(xi , yi )|1 ≤ i ≤ n], then the gradient estimate
based on this data can be set to 0 to solve for w∗ , i.e.,

1 X
n
·( ϕ(xi ) · (ϕ(xi )T · w∗ − yi )) + λ · w∗ = 0
n
i=1

We denote Φ as the n rows × m columns matrix defined as Φi,j = ϕj (xi ) and the column
vector Y ∈ Rn defined as Yi = yi . Then we can write the above equation as:
1
· ΦT · (Φ · w∗ − Y ) + λ · w∗ = 0
n
⇒ (ΦT · Φ + nλ · Im ) · w∗ = ΦT · Y
⇒ w∗ = (ΦT · Φ + nλ · Im )−1 · ΦT · Y
where Im is the m × m identity matrix. Note that this direct linear-algebraic solution for
solving a square linear system of equations of size m is computationally feasible only if m
is not too large.
On the other hand, if the number of feature functions m is large, then we solve for w∗
by repeatedly calling update. The attribute direct_solve: bool in LinearFunctionApprox
specifies whether to perform a direct solution (linear algebra calculations shown above)
or to perform a sequence of iterative (incremental) updates to w using gradient descent.
The code below for the method solve does exactly this:
import itertools
import rl.iterate import iterate
def solve(
self,
xy_vals_seq: Iterable[Tuple[X, float]],
error_tolerance: Optional[float] = None
) -> LinearFunctionApprox[X]:

173
if self.direct_solve:
x_vals, y_vals = zip(*xy_vals_seq)
feature_vals: np.ndarray = self.get_feature_values(x_vals)
feature_vals_T: np.ndarray = feature_vals.T
left: np.ndarray = np.dot(feature_vals_T, feature_vals) \
+ feature_vals.shape[0] * self.regularization_coeff * \
np.eye(len(self.weights.weights))
right: np.ndarray = np.dot(feature_vals_T, y_vals)
ret = replace(
self,
weights=Weights.create(
adam_gradient=self.weights.adam_gradient,
weights=np.linalg.solve(left, right)
)
)
else:
tol: float = 1e-6 if error_tolerance is None else error_tolerance
def done(
a: LinearFunctionApprox[X],
b: LinearFunctionApprox[X],
tol: float = tol
) -> bool:
return a.within(b, tol)
ret = iterate.converged(
self.iterate_updates(itertools.repeat(list(xy_vals_seq))),
done=done
)
return ret

The above code is in the file rl/function_approx.py.

4.3. Neural Network Function Approximation

Now we generalize the linear function approximation to accommodate non-linear func-
tions with a simple deep neural network, specifically a feed-forward fully-connected neu-
ral network. We work with the same notation ϕ(·) = (ϕ1 (·), ϕ2 (·), . . . , ϕm (·)) for feature
functions that we covered for the case of linear function approximation. Assume we have
L hidden layers in the neural network. Layers numbered l = 0, 1, . . . , L − 1 carry the
hidden layer neurons and layer l = L carries the output layer neurons.
A couple of things to note about our notation for vectors and matrices when performing
linear algebra operations: Vectors will be treated as column vectors (including gradient of
a scalar with respect to a vector). The gradient of a vector of dimension m with respect to
a vector of dimension n is expressed as a Jacobian matrix with m rows and n columns. We
use the notation dim(v) to refer to the dimension of a vector v.
We denote the input to layer l as vector il and the output to layer l as vector ol , for all
l = 0, 1, . . . , L. Denoting the predictor variable as x ∈ X , the response variable as y ∈ R,
and the neural network as model M to predict the expected value of y conditional on x,
we have:

i0 = ϕ(x) ∈ Rm and oL = EM [y|x] and il+1 = ol for all l = 0, 1, . . . , L − 1 (4.1)

We denote the parameters for layer l as the matrix wl with dim(ol ) rows and dim(il )
columns. Note that the number of neurons in layer l is equal to dim(ol ). Since we are

174
restricting ourselves to scalar y, dim(oL ) = 1 and so, the number of neurons in the output
layer is 1.
The neurons in layer l define a linear transformation from layer input il to a variable we
denote as sl . Therefore,

sl = wl · il for all l = 0, 1, . . . , L (4.2)

We denote the activation function of layer l as gl : R → R for all l = 0, 1, . . . , L. The acti-
vation function gl applies point-wise on each dimension of vector sl , so we take notational
liberty with gl by writing:

ol = gl (sl ) for all l = 0, 1, . . . , L (4.3)

Equations (4.1), (4.2) and (4.3) together define the calculation of the neural network
prediction oL (associated with the response variable y), given the predictor variable x.
This calculation is known as forward-propagation and will define the evaluate method of
the deep neural network function approximation class we shall soon write.
Our goal is to derive an expression for the cross-entropy loss gradient ∇wl L for all
l = 0, 1, . . . , L. For ease of understanding, our following exposition will be expressed
in terms of the cross-entropy loss function for a single predictor variable input x ∈ X and
it’s associated single response variable y ∈ R (the code will generalize appropriately to
the cross-entropy loss function for a given set of data points [xi , yi |1 ≤ i ≤ n]).
We can reduce this problem of calculating the cross-entropy loss gradient to the problem
of calculating Pl = ∇sl L for all l = 0, 1, . . . , L, as revealed by the following chain-rule
calculation:

∇wl L = (∇sl L)T · ∇wl sl = PlT · ∇wl sl = Pl · iTl for all l = 0, 1, . . . L

Note that Pl · iTl represents the outer-product of the dim(ol )-size vector Pl and the
dim(il )-size vector il giving a matrix of size dim(ol ) × dim(il ).
If we include L2 regularization (with λl as the regularization coeﬀicient for layer l), then:

∇wl L = Pl · iTl + λl · wl for all l = 0, 1, . . . , L (4.4)

Here’s the summary of our notation:

Notation Description
il Vector Input to layer l for all l = 0, 1, . . . , L
ol Vector Output of layer l for all l = 0, 1, . . . , L
ϕ(x) Feature Vector for predictor variable x
y Response variable associated with predictor variable x
wl Matrix of Parameters for layer l for all l = 0, 1, . . . , L
gl (·) Activation function for layer l for l = 0, 1, . . . , L
sl sl = wl · il , ol = gl (sl ) for all l = 0, 1, . . . L
Pl Pl = ∇sl L for all l = 0, 1, . . . , L
λl Regularization coeﬀicient for layer l for all l = 0, 1, . . . , L

Now that we have reduced the loss gradient calculation to calculation of Pl , we spend
the rest of this section deriving the analytical calculation of Pl . The following theorem tells
us that Pl has a recursive formulation that forms the core of the back-propagation algorithm
for a feed-forward fully-connected deep neural network.

175
Theorem 4.3.1. For all l = 0, 1, . . . , L − 1,
T
Pl = (wl+1 · Pl+1 ) ◦ gl′ (sl )
where the symbol ◦ represents the Hadamard Product, i.e., point-wise multiplication of two vectors
of the same dimension.
Proof. We start by applying the chain rule on Pl .
Pl = ∇sl L = (∇sl sl+1 )T · ∇sl+1 L = (∇sl sl+1 )T · Pl+1 (4.5)
Next, note that:
sl+1 = wl+1 · gl (sl )
Therefore,
∇sl sl+1 = wl+1 · Diagonal(gl′ (sl ))
where the notation Diagonal(v) for an m-dimensional vector v represents an m × m
diagonal matrix whose elements are the same (also in same order) as the elements of v.
Substituting this in Equation (4.5) yields:
Pl = (wl+1 · Diagonal(gl′ (sl )))T · Pl+1 = Diagonal(gl′ (sl )) · wl+1
T
· Pl+1

= gl′ (sl ) ◦ (wl+1

T
· Pl+1 ) = (wl+1
T
· Pl+1 ) ◦ gl′ (sl )

Now all we need to do is to calculate PL = ∇sL L so that we can run this recursive
formulation for Pl , estimate the loss gradient ∇wl L for any given data (using Equation
(4.4)), and perform gradient descent to arrive at wl∗ for all l = 0, 1, . . . L.
Firstly, note that sL , oL , PL are all scalars, so let’s just write them as sL , oL , PL respec-
tively (without the bold-facing) to make it explicit in the derivation that they are scalars.
Specifically, the gradient
∂L
∇s L L =
∂sL
∂L
To calculate ∂s L
, we need to assume a functional form for P[y|sL ]. We work with a fairly
generic exponential functional form for the probability distribution function:
θ·y−A(θ)
p(y|θ, τ ) = h(y, τ ) · e d(τ )

where θ should be thought of as the “center” parameter (related to the mean) of the
probability distribution and τ should be thought of as the “dispersion” parameter (re-
lated to the variance) of the distribution. h(·, ·), A(·), d(·) are general functions whose spe-
cializations define the family of distributions that can be modeled with this fairly generic
exponential functional form (note that this structure is adopted from the framework of
Generalized Linear Models).
For our neural network function approximation, we assume that τ is a constant, and we
set θ to be sL . So,
sL ·y−A(sL )
P[y|sL ] = p(y|sL , τ ) = h(y, τ ) · e d(τ )

We require the scalar prediction of the neural network oL = gL (sL ) to be equal to

Ep [y|sL ]. So the question is: What function gL : R → R (in terms of the functional form
of p(y|sL , τ )) would satisfy the requirement of oL = gL (sL ) = Ep [y|sL ]? To answer this
question, we first establish the following Lemma:

176
Lemma 4.3.2.
Ep [y|sL ] = A′ (sL )

Proof. Since
Z ∞
p(y|sL , τ ) · dy = 1,
−∞

the partial derivative of the left-hand-side of the above equation with respect to sL is
zero. In other words,
R∞
∂{ −∞ p(y|sL , τ ) · dy}
=0
∂sL
Hence,
R∞ sL ·y−A(sL )
∂{ −∞ h(y, τ ) ·e d(τ ) · dy}
=0
∂sL
Taking the partial derivative inside the integral, we get:
Z ∞ sL ·y−A(sL ) y − A′ (sL )
h(y, τ ) · e d(τ ) · · dy = 0
−∞ d(τ )
Z ∞
⇒ p(y|sL , τ ) · (y − A′ (sL )) · dy = 0
−∞

⇒ Ep [y|sL ] = A′ (sL )

So to satisfy oL = gL (sL ) = Ep [y|sL ], we require that

oL = gL (sL ) = A′ (sL ) (4.6)

The above equation is important since it tells us that the output layer activation function
gL (·) must be set to be the derivative of the A(·) function. In the theory of generalized linear
models, the derivative of the A(·) function serves as the canonical link function for a given
probability distribution of the response variable conditional on the predictor variable.
Now we are equipped to derive a simple expression for PL .

Theorem 4.3.3.
∂L oL − y
PL = =
∂sL d(τ )
Proof. The Cross-Entropy Loss (Negative Log-Likelihood) for a single training data point
(x, y) is given by:

A(sL ) − sL · y
L = − log (h(y, τ )) +
d(τ )
Therefore,

∂L A′ (sL ) − y
PL = =
∂sL d(τ )

177
But from Equation (4.6), we know that A′ (sL ) = oL . Therefore,

∂L oL − y
PL = =
∂sL d(τ )

At each iteration of gradient descent, we require an estimate of the loss gradient up to a

constant factor. So we can ignore the constant d(τ ) and simply say that PL = oL − y (up to
a constant factor). This is a rather convenient estimate of PL for a given data point (x, y)
since it represents the neural network prediction error for that data point. When presented
with a sequence of data points [(xt,i , yt,i )|1 ≤ i ≤ nt ] in iteration t, we simply average the
prediction errors across these presented data points. Then, beginning with this estimate of
PL , we can use the recursive formulation of Pl (Theorem 4.3.1) to calculate the gradient of
the loss function (Equation (4.4)) with respect to all the parameters of the neural network
(this is known as the back-propagation algorithm for a fully-connected feed-forward deep
neural network).
Here are some common specializations of the functional form for the conditional proba-
bility distribution P[y|sL ], along with the corresponding canonical link function that serves
as the activation function gL of the output layer:

• Normal distribution y ∼ N (µ, σ 2 ):

−y 2
e 2τ 2 s2
sL = µ, τ = σ, h(y, τ ) = √ , d(τ ) = τ, A(sL ) = L
2πτ 2

⇒ oL = gL (sL ) = E[y|sL ] = A′ (sL ) = sL

Hence, the output layer activation function gL is the identity function. This means
that the linear function approximation of the previous section is exactly the same as
a neural network with 0 hidden layers (just the output layer) and with the output
layer activation function equal to the identity function.
• Bernoulli distribution for binary-valued y, parameterized by p:
p
sL = log ( ), τ = 1, h(y, τ ) = 1, d(τ ) = 1, A(sL ) = log (1 + esL )
1−p

1
⇒ oL = gL (sL ) = E[y|sL ] = A′ (sL ) =
1 + e−sL
Hence, the output layer activation function gL is the logistic function. This gener-
alizes to softmax gL when we generalize this framework to multivariate y, which in
turn enables us to classify inputs x into a finite set of categories represented by y as
one-hot-encodings.
• Poisson distribution for y parameterized by λ:

1
sL = log λ, τ = 1, h(y, τ ) = , d(τ ) = 1, A(sL ) = esL
y!

⇒ oL = gL (sL ) = E[y|sL ] = A′ (sL ) = esL

Hence, the output layer activation function gL is the exponential function.

178
Now we are ready to write a class for function approximation with the deep neural
network framework described above. We assume that the activation functions gl (·) are
identical for all l = 0, 1, . . . , L − 1 (known as the hidden layers activation function) and
the activation function gL (·) is known as the output layer activation function. Note that
often we want to include a bias term in the linear transformations of the layers. To include
a bias term in layer 0, just like in the case of LinearFuncApprox, we prepend the sequence
of feature functions we want to provide as input with an artificial feature function lambda
_: 1. to represent the constant feature with value 1. This ensures we have a bias weight
in layer 0 in addition to each of the weights (in layer 0) that serve as coeﬀicients to the
(non-artificial) feature functions. Moreover, we allow the specification of a bias boolean
variable to enable a bias term in each if the layers l = 1, 2, . . . L.
Before we develop the code for forward-propagation and back-propagation, we write
a @dataclass to hold the configuration of a deep neural network (number of neurons in
the layers, the bias boolean variable, hidden layers activation function and output layer
activation function).
@dataclass(frozen=True)
class DNNSpec:
neurons: Sequence[int]
bias: bool
hidden_activation: Callable[[np.ndarray], np.ndarray]
hidden_activation_deriv: Callable[[np.ndarray], np.ndarray]
output_activation: Callable[[np.ndarray], np.ndarray]
output_activation_deriv: Callable[[np.ndarray], np.ndarray]

neurons is a sequence of length L specifying dim(O0 ), dim(O1 ), . . . , dim(OL−1 ) (note

dim(oL ) doesn’t need to be specified since we know dim(oL ) = 1). If bias is set to be True,
then dim(Il ) = dim(Ol−1 ) + 1 for all l = 1, 2, . . . L and so in the code below, when bias
is True, we’ll need to prepend the matrix representing Il with a vector consisting of all 1s
(to incorporate the bias term). Note that along with specifying the hidden and output
layers activation functions gl (·) defined as gl (sl ) = ol , we also specify the hidden layers
activation function derivative (hidden_activation_deriv) and the output layer activation
function derivative (output_activation_deriv) in the form of functions hl (·) defined as
hl (g(sl )) = hl (ol ) = gl′ (sl ) (as we know, this derivative is required in the back-propagation
calculation). We shall soon see that in the code, hl (·) is a more convenient specification
than the direct specification of gl′ (·).
Now we write the @dataclass DNNApprox that implements the abstract base class FunctionApprox.
It has attributes:

• feature_functions that represents ϕj : X → R for all j = 1, 2, . . . , m.

• dnn_spec that specifies the neural network configuration (instance of DNNSpec).
• regularization_coeff that represents the common regularization coeﬀicient λ for
the weights across all layers.
• weights which is a sequence of Weights objects (to represent and update the weights
of all layers).

The method get_feature_values is identical to the case of LinearFunctionApprox produc-

ing a matrix with number of rows equal to the number of x values in it’s input x_values_seq:
Iterable[X] and number of columns equal to the number of specified feature_functions.
The method forward_propagation implements the forward-propagation calculation that
was covered earlier (combining Equations (4.1) (potentially adjusted for the bias term, as
mentioned above), (4.2) and (4.3)). forward_propagation takes as input the same data

179
type as the input of get_feature_values (x_values_seq: Iterable[X]) and returns a list
with L + 2 numpy arrays. The last element of the returned list is a 1-D numpy array
representing the final output of the neural network: oL = EM [y|x] for each of the x values
in the input x_values_seq. The remaining L + 1 elements in the returned list are each 2-D
numpy arrays, consisting of il for all l = 0, 1, . . . L (for each of the x values provided as
input in x_values_seq).
The method evaluate (from FunctionApprox) returns the last element (oL = EM [y|x])
of the list returned by forward_propagation.
The method backward_propagation is the most important method of DNNApprox, calculat-
ing ∇wl Obj for all l = 0, 1, . . . , L, for some objective function Obj. We had said previously
that for each concrete function approximation that we’d want to implement, if the Objec-
tive Obj(xi , yi ) is the cross-entropy loss function, we can identify a model-computed value
Out(xi ) (either the output of the model or an intermediate computation of the model) such
that ∂Obj(x i ,yi )
∂Out(xi ) is equal to the prediction error EM [y|xi ] − yi (for each training data point
(xi , yi )) and we can come up with a numerical algorithm to compute ∇w Out(xi ), so that
by chain-rule, we have the required gradient ∇w Obj(xi , yi ) (without regularization). In
the case of this DNN function approximation, the model-computed value Out(xi ) is sL .
Thus,

∂Obj(xi , yi ) ∂L
= = PL = oL − yi = EM [y|xi ] − yi
∂Out(xi ) ∂sL
backward_propagation takes two inputs:

1. fwd_prop: Sequence[np.ndarray] which represents the output of forward_propagation

except for the last element (which is the final output of the neural network), i.e., a
sequence of L + 1 2-D numpy arrays representing the inputs to layers l = 0, 1, . . . L
(for each of the Iterable of x-values provided as input to the neural network).
2. obj_deriv_out: np.ndarray, which represents the partial derivative of an arbitrary
objective function Obj with respect to an arbitrary model-produced value Out, eval-
uated at each of the Iterable of (x, y) pairs that are provided as training data.

If we generalize the objective function from the cross-entropy loss function L to an arbi-
trary objective function Obj and define Pl to be ∇sl Obj (generalized from ∇sl L), then the
output of backward_propagation would be equal to Pl · iTl (i.e., without the regularization
term) for all l = 0, 1, . . . L.
The first step in backward_propagation is to set PL (variable deriv in the code) equal to
obj_deriv_out (which in the case of cross-entropy loss as Obj and sL as Out, reduces to
the prediction error EM [y|xi ] − yi ). As we walk back through the layers of the DNN, the
variable deriv represents Pl = ∇sl Obj, evaluated for each of the values made available by
fwd_prop (note that deriv is updated in each iteration of the loop reflecting Theorem 4.3.1:
Pl = (wl+1T · Pl+1 ) ◦ gl′ (sl )). Note also that the returned list back_prop is populated with
the result of Equation (4.4): ∇wl Obj = Pl · iTl .
The method objective_gradient (from FunctionApprox) takes as input an Iterable of
∂Obj
(x, y) pairs and the ∂Out function, invokes the forward_propagation method (to be passed
as input to backward_propagation), then invokes backward_propagation, and finally adds on
the regularization term λ · wl to the output of backward_propagation to return the gradient
∇wl Obj for all l = 0, 1, . . . L.
The method update_with_gradient (from FunctionApprox) takes as input a gradient (eg:
∇wl Obj), updates the weights wl for all l = 0, 1, . . . , L along with the ADAM cache up-

180
dates (invoking the update method of the Weights class to ensure there are no in-place
updates), and returns a new instance of DNNApprox that contains the updated weights.
Finally, the method solve (from FunctionApprox) utilizes the method iterate_updates
(inherited from FunctionApprox) along with the method within to perform a best-fit of the
weights that minimizes the cross-entropy loss function (basically, a series of incremental
updates based on gradient descent).

from dataclasses import replace

import itertools
import rl.iterate import iterate
@dataclass(frozen=True)
class DNNApprox(FunctionApprox[X]):
feature_functions: Sequence[Callable[[X], float]]
dnn_spec: DNNSpec
regularization_coeff: float
weights: Sequence[Weights]
@staticmethod
def create(
feature_functions: Sequence[Callable[[X], float]],
dnn_spec: DNNSpec,
adam_gradient: AdamGradient = AdamGradient.default_settings(),
regularization_coeff: float = 0.,
weights: Optional[Sequence[Weights]] = None
) -> DNNApprox[X]:
if weights is None:
inputs: Sequence[int] = [len(feature_functions)] + \
[n + (1 if dnn_spec.bias else 0)
for i, n in enumerate(dnn_spec.neurons)]
outputs: Sequence[int] = list(dnn_spec.neurons) + [1]
wts = [Weights.create(
weights=np.random.randn(output, inp) / np.sqrt(inp),
adam_gradient=adam_gradient
) for inp, output in zip(inputs, outputs)]
else:
wts = weights
return DNNApprox(
feature_functions=feature_functions,
dnn_spec=dnn_spec,
regularization_coeff=regularization_coeff,
weights=wts
)
def get_feature_values(self, x_values_seq: Iterable[X]) -> np.ndarray:
return np.array(
[[f(x) for f in self.feature_functions] for x in x_values_seq]
)
def forward_propagation(
self,
x_values_seq: Iterable[X]
) -> Sequence[np.ndarray]:
”””
:param x_values_seq: a n-length iterable of input points
:return: list of length (L+2) where the first (L+1) values
each represent the 2-D input arrays (of size n x |i_l|),
for each of the (L+1) layers (L of which are hidden layers),
and the last value represents the output of the DNN (as a
1-D array of length n)
”””
inp: np.ndarray = self.get_feature_values(x_values_seq)
ret: List[np.ndarray] = [inp]
for w in self.weights[:-1]:

181
out: np.ndarray = self.dnn_spec.hidden_activation(
np.dot(inp, w.weights.T)
)
if self.dnn_spec.bias:
inp = np.insert(out, 0, 1., axis=1)
else:
inp = out
ret.append(inp)
ret.append(
self.dnn_spec.output_activation(
np.dot(inp, self.weights[-1].weights.T)
)[:, 0]
)
return ret
def evaluate(self, x_values_seq: Iterable[X]) -> np.ndarray:
return self.forward_propagation(x_values_seq)[-1]
def backward_propagation(
self,
fwd_prop: Sequence[np.ndarray],
obj_deriv_out: np.ndarray
) -> Sequence[np.ndarray]:
”””
:param fwd_prop represents the result of forward propagation (without
the final output), a sequence of L 2-D np.ndarrays of the DNN.
: param obj_deriv_out represents the derivative of the objective
function with respect to the linear predictor of the final layer.
:return: list (of length L+1) of |o_l| x |i_l| 2-D arrays,
i.e., same as the type of self.weights.weights
This function computes the gradient (with respect to weights) of
the objective where the output layer activation function
is the canonical link function of the conditional distribution of y|x
”””
deriv: np.ndarray = obj_deriv_out.reshape(1, -1)
back_prop: List[np.ndarray] = [np.dot(deriv, fwd_prop[-1]) /
deriv.shape[1]]
# L is the number of hidden layers, n is the number of points
# layer l deriv represents dObj/ds_l where s_l = i_l . weights_l
# (s_l is the result of applying layer l without the activation func)
for i in reversed(range(len(self.weights) - 1)):
# deriv_l is a 2-D array of dimension |o_l| x n
# The recursive formulation of deriv is as follows:
# deriv_{l-1} = (weights_l^T inner deriv_l) haddamard g’(s_{l-1}),
# which is ((|i_l| x |o_l|) inner (|o_l| x n)) haddamard
# (|i_l| x n), which is (|i_l| x n) = (|o_{l-1}| x n)
# Note: g’(s_{l-1}) is expressed as hidden layer activation
# derivative as a function of o_{l-1} (=i_l).
deriv = np.dot(self.weights[i + 1].weights.T, deriv) * \
self.dnn_spec.hidden_activation_deriv(fwd_prop[i + 1].T)
# If self.dnn_spec.bias is True, then i_l = o_{l-1} + 1, in which
# case # the first row of the calculated deriv is removed to yield
# a 2-D array of dimension |o_{l-1}| x n.
if self.dnn_spec.bias:
deriv = deriv[1:]
# layer l gradient is deriv_l inner fwd_prop[l], which is
# of dimension (|o_l| x n) inner (n x (|i_l|) = |o_l| x |i_l|
back_prop.append(np.dot(deriv, fwd_prop[i]) / deriv.shape[1])
return back_prop[::-1]
def objective_gradient(
self,
xy_vals_seq: Iterable[Tuple[X, float]],
obj_deriv_out_fun: Callable[[Sequence[X], Sequence[float]], float]
) -> Gradient[DNNApprox[X]]:
x_vals, y_vals = zip(*xy_vals_seq)

182
obj_deriv_out: np.ndarray = obj_deriv_out_fun(x_vals, y_vals)
fwd_prop: Sequence[np.ndarray] = self.forward_propagation(x_vals)[:-1]
gradient: Sequence[np.ndarray] = \
[x + self.regularization_coeff * self.weights[i].weights
for i, x in enumerate(self.backward_propagation(
fwd_prop=fwd_prop,
obj_deriv_out=obj_deriv_out
))]
return Gradient(replace(
self,
weights=[replace(w, weights=g) for
w, g in zip(self.weights, gradient)]
))
def solve(
self,
xy_vals_seq: Iterable[Tuple[X, float]],
error_tolerance: Optional[float] = None
) -> DNNApprox[X]:
tol: float = 1e-6 if error_tolerance is None else error_tolerance
def done(
a: DNNApprox[X],
b: DNNApprox[X],
tol: float = tol
) -> bool:
return a.within(b, tol)
return iterate.converged(
self.iterate_updates(itertools.repeat(list(xy_vals_seq))),
done=done
)
def within(self, other: FunctionApprox[X], tolerance: float) -> bool:
if isinstance(other, DNNApprox):
return all(w1.within(w2, tolerance)
for w1, w2 in zip(self.weights, other.weights))
else:
return False

All of the above code is in the file rl/function_approx.py.

A comprehensive treatment of function approximations using Deep Neural Networks
can be found in the Deep Learning book by Goodfellow, Bengio, Courville (Goodfellow,
Bengio, and Courville 2016).
Let us now write some code to create function approximations with LinearFunctionApprox
and DNNApprox, given a stream of data from a simple data model - one that has some noise
around a linear function. Here’s some code to create an Iterator of (x, y) pairs (where
x = (x1 , x2 , x3 )) for the data model:

y = 2 + 10x1 + 4x2 − 6x3 + N (0, 0.3)

def example_model_data_generator() -> Iterator[Tuple[Triple, float]]:
coeffs: Aug_Triple = (2., 10., 4., -6.)
d = norm(loc=0., scale=0.3)
while True:
pt: np.ndarray = np.random.randn(3)
x_val: Triple = (pt[0], pt[1], pt[2])
y_val: float = coeffs[0] + np.dot(coeffs[1:], pt) + \
d.rvs(size=1)[0]
yield (x_val, y_val)

Next we wrap this in an Iterator that returns a certain number of (x, y) pairs upon each
request for data points.

183
def data_seq_generator(
data_generator: Iterator[Tuple[Triple, float]],
num_pts: int
) -> Iterator[DataSeq]:
while True:
pts: DataSeq = list(islice(data_generator, num_pts))
yield pts

Now let’s write a function to create a LinearFunctionApprox.

def feature_functions():
return [lambda _: 1., lambda x: x[0], lambda x: x[1], lambda x: x[2]]
def adam_gradient():
return AdamGradient(
learning_rate=0.1,
decay1=0.9,
decay2=0.999
)
def get_linear_model() -> LinearFunctionApprox[Triple]:
ffs = feature_functions()
ag = adam_gradient()
return LinearFunctionApprox.create(
feature_functions=ffs,
adam_gradient=ag,
regularization_coeff=0.,
direct_solve=True
)

Likewise, let’s write a function to create a DNNApprox with 1 hidden layer with 2 neu-
rons and a little bit of regularization since this deep neural network is somewhat over-
parameterized to fit the data generated from the linear data model with noise.
def get_dnn_model() -> DNNApprox[Triple]:
ffs = feature_functions()
ag = adam_gradient()
def relu(arg: np.ndarray) -> np.ndarray:
return np.vectorize(lambda x: x if x > 0. else 0.)(arg)
def relu_deriv(res: np.ndarray) -> np.ndarray:
return np.vectorize(lambda x: 1. if x > 0. else 0.)(res)
def identity(arg: np.ndarray) -> np.ndarray:
return arg
def identity_deriv(res: np.ndarray) -> np.ndarray:
return np.ones_like(res)
ds = DNNSpec(
neurons=[2],
bias=True,
hidden_activation=relu,
hidden_activation_deriv=relu_deriv,
output_activation=identity,
output_activation_deriv=identity_deriv
)
return DNNApprox.create(
feature_functions=ffs,
dnn_spec=ds,
adam_gradient=ag,
regularization_coeff=0.05
)

Now let’s write some code to do a direct_solve with the LinearFunctionApprox based
on the data from the data model we have set up.

184
Figure 4.1.: SGD Convergence

training_num_pts: int = 1000

test_num_pts: int = 10000
training_iterations: int = 200
data_gen: Iterator[Tuple[Triple, float]] = example_model_data_generator()
training_data_gen: Iterator[DataSeq] = data_seq_generator(
data_gen,
training_num_pts
)
test_data: DataSeq = list(islice(data_gen, test_num_pts))
direct_solve_lfa: LinearFunctionApprox[Triple] = \
get_linear_model().solve(next(training_data_gen))
direct_solve_rmse: float = direct_solve_lfa.rmse(test_data)

Running the above code, we see that the Root-Mean-Squared-Error (direct_solve_rmse)

is indeed 0.3, matching the standard deviation of the noise in the linear data model (which
is used above to generate the training data as well as the test data).
Now let us perform stochastic gradient descent with instances of LinearFunctionApprox
and DNNApprox and examine the Root-Mean-Squared-Errors on the two function approxi-
mations as a function of number of iterations in the gradient descent.

linear_model_rmse_seq: Sequence[float] = \
[lfa.rmse(test_data) for lfa in islice(
get_linear_model().iterate_updates(training_data_gen),
training_iterations
)]
dnn_model_rmse_seq: Sequence[float] = \
[dfa.rmse(test_data) for dfa in islice(
get_dnn_model().iterate_updates(training_data_gen),
training_iterations
)]

The plot of linear_model_rmse_seq and dnn_model_rmse_seq is shown in Figure 4.1.

185
4.4. Tabular as a form of FunctionApprox
Now we consider a simple case where we have a fixed and finite set of x-values X =
{x1 , x2 , . . . , xn }, and any data set of (x, y) pairs made available to us needs to have it’s
x-values from within this finite set X . The prediction E[y|x] for each x ∈ X needs to be
calculated only from the y-values associated with this x within the data set of (x, y) pairs.
In other words, the y-values in the data associated with other x should not influence the
prediction for x. Since we’d like the prediction for x to be E[y|x], it would make sense for
the prediction for a given x to be some sort of average of all the y-values associated with
x within the data set of (x, y) pairs seen so far. This simple case is refered to as Tabular
because we can store all x ∈ X together with their corresponding predictions E[y|x] in a
finite data structure (loosely refered to as a “table”).
So the calculations for Tabular prediction of E[y|x] is particularly straightforward. What
is interesting though is the fact that Tabular prediction actually fits the interface of FunctionApprox
in terms of the following three methods that we have emphasized as the essence of FunctionApprox:

• the solve method, that would simply take the average of all the y-values associated
with each x in the given data set, and store those averages in a dictionary data struc-
ture.
• the update method, that would update the current averages in the dictionary data
structure, based on the new data set of (x, y) pairs that is provided.
• the evaluate method, that would simply look up the dictionary data structure for
the y-value averages associated with each x-value provided as input.

This view of Tabular prediction as a special case of FunctionApprox also permits us to cast
the tabular algorithms of Dynamic Programming and Reinforcement Learning as special
cases of the function approximation versions of the algorithms (using the Tabular class we
develop below).
So now let us write the code for @dataclass Tabular as an implementation of the abstract
base class FunctionApprox. The attributes of @dataclass Tabular are:

• values_map which is a dictionary mapping each x-value to the average of the y-values
associated with x that have been seen so far in the data.
• counts_map which is a dictionary mapping each x-value to the count of y-values as-
sociated with x that have been seen so far in the data. We need to track the count of
y-values associated with each x because this enables us to update values_map appro-
priately upon seeing a new y-value associated with a given x.
• count_to_weight_func which defines a function from number of y-values seen so far
(associated with a given x) to the weight assigned to the most recent y. This enables
us to do a weighted average of the y-values seen so far, controlling the emphasis to
be placed on more recent y-values relative to previously seen y-values (associated
with a given x).

The evaluate, objective_gradient, update_with_gradient, solve and within methods

should now be self-explanatory.
from dataclasses import field, replace
@dataclass(frozen=True)
class Tabular(FunctionApprox[X]):
values_map: Mapping[X, float] = field(default_factory=lambda: {})
counts_map: Mapping[X, int] = field(default_factory=lambda: {})

186
count_to_weight_func: Callable[[int], float] = \
field(default_factory=lambda: lambda n: 1.0 / n)
def objective_gradient(
self,
xy_vals_seq: Iterable[Tuple[X, float]],
obj_deriv_out_fun: Callable[[Sequence[X], Sequence[float]], float]
) -> Gradient[Tabular[X]]:
x_vals, y_vals = zip(*xy_vals_seq)
obj_deriv_out: np.ndarray = obj_deriv_out_fun(x_vals, y_vals)
sums_map: Dict[X, float] = defaultdict(float)
counts_map: Dict[X, int] = defaultdict(int)
for x, o in zip(x_vals, obj_deriv_out):
sums_map[x] += o
counts_map[x] += 1
return Gradient(replace(
self,
values_map={x: sums_map[x] / counts_map[x] for x in sums_map},
counts_map=counts_map
))
def evaluate(self, x_values_seq: Iterable[X]) -> np.ndarray:
return np.array([self.values_map.get(x, 0.) for x in x_values_seq])
def update_with_gradient(
self,
gradient: Gradient[Tabular[X]]
) -> Tabular[X]:
values_map: Dict[X, float] = dict(self.values_map)
counts_map: Dict[X, int] = dict(self.counts_map)
for key in gradient.function_approx.values_map:
counts_map[key] = counts_map.get(key, 0) + \
gradient.function_approx.counts_map[key]
weight: float = self.count_to_weight_func(counts_map[key])
values_map[key] = values_map.get(key, 0.) - \
weight * gradient.function_approx.values_map[key]
return replace(
self,
values_map=values_map,
counts_map=counts_map
)
def solve(
self,
xy_vals_seq: Iterable[Tuple[X, float]],
error_tolerance: Optional[float] = None
) -> Tabular[X]:
values_map: Dict[X, float] = {}
counts_map: Dict[X, int] = {}
for x, y in xy_vals_seq:
counts_map[x] = counts_map.get(x, 0) + 1
weight: float = self.count_to_weight_func(counts_map[x])
values_map[x] = weight * y + (1 - weight) * values_map.get(x, 0.)
return replace(
self,
values_map=values_map,
counts_map=counts_map
)
def within(self, other: FunctionApprox[X], tolerance: float) -> bool:
if isinstance(other, Tabular):
return all(abs(self.values_map[s] - other.values_map.get(s, 0.))
<= tolerance for s in self.values_map)
return False

Here’s a valuable insight: This Tabular setting is actually a special case of linear function
approximation by setting a feature function ϕi (·) for each xi as: ϕi (xi ) = 1 and ϕi (x) = 0

187
for each x ̸= xi (i.e., ϕi (·) is the indicator function for xi , and the Φ matrix is the identity
matrix), and the corresponding weights wi equal to the average of the y-values associated
with xi in the given data. This also means that the count_to_weights_func plays the role of
the learning rate function (as a function of the number of iterations in stochastic gradient
descent).
When we implement Approximate Dynamic Programming (ADP) algorithms with the
@abstractclass FunctionApprox (later in this chapter), using the Tabular class (for FunctionApprox)
enables us to specialize the ADP algorithm implementation to the Tabular DP algorithms
(that we covered in Chapter 3). Note that in the tabular DP algorithms, the set of finite
states take the role of X and the Value Function for a given state x = s takes the role of
the “predicted” y-value associated with x. We also note that in the tabular DP algorithms,
in each iteration of sweeping through all the states, the Value Function for a state x = s
is set to the current y value (not the average of all y-values seen so far). The current y-
value is simply the right-hand-side of the Bellman Equation corresponding to the tabular
DP algorithm. Consequently, when using Tabular class for tabular DP, we’d need to set
count_to_weight_func to be the function lambda _: 1 (this is because a weight of 1 for the
current y-value sets values_map[x] equal to the current y-value).
Likewise, when we implement RL algorithms (using @abstractclass FunctionApprox)
later in this book, using the Tabular class (for FunctionApprox) specializes the RL algo-
rithm implementation to Tabular RL. In Tabular RL, we average all the Returns seen so
far for a given state. If we choose to do a plain average (equal importance for all y-
values seen so far, associated with a given x), then in the Tabular class, we’d need to set
count_to_weights_func to be the function lambda n: 1. / n.
We want to emphasize that although tabular algorithms are just a special case of al-
gorithms with function approximation, we give special coverage in this book to tabular
algorithms because they help us conceptualize the core concepts in a simple (tabular) set-
ting without the distraction of some of the details and complications in the apparatus of
function approximation.
Now we are ready to write algorithms for Approximate Dynamic Programming (ADP).
Before we go there, it pays to emphasize that we have described and implemented a fairly
generic framework for gradient-based estimation of function approximations, given arbi-
trary training data. It can be used for arbitrary objective functions and arbitrary functional
forms/neural networks (beyond the concrete classes we implemented). We encourage
you to explore implementing and using this function approximation code for other types
of objectives and other types of functional forms/neural networks.

4.5. Approximate Policy Evaluation

The first Approximate Dynamic Programming (ADP) algorithm we cover is Approximate
Policy Evaluation, i.e., evaluating the Value Function for a Markov Reward Process (MRP).
Approximate Policy Evaluation is fundamentally the same as Tabular Policy Evaluation
in terms of repeatedly applying the Bellman Policy Operator B π on the Value Function
V : N → R. However, unlike Tabular Policy Evaluation algorithm, here the Value Func-
tion V (·) is set up and updated as an instance of FunctionApprox rather than as a table of
values for the non-terminal states. This is because unlike Tabular Policy Evaluation which
operates on an instance of a FiniteMarkovRewardProcess, Approximate Policy Evaluation
algorithm operates on an instance of MarkovRewardProcess. So we do not have an enumer-
ation of states of the MRP and we do not have the transition probabilities of the MRP.

188
This is typical in many real-world problems where the state space is either very large or is
continuous-valued, and the transitions could be too many or could be continuous-valued
transitions. So, here’s what we do to overcome these challenges:

• We specify a sampling probability distribution of non-terminal states (argument

non_terminal_states_distribution in the code below) from which we shall sample a
specified number (num_state_samples in the code below) of non-terminal states, and
construct a list of those sampled non-terminal states (nt_states in the code below)
in each iteration. The type of this probability distribution of non-terminal states is
aliased as follows (this type will be used not just for Approximate Dynamic Pro-
gramming algorithms, but also for Reinforcement Learning algorithms):

NTStateDistribution = Distribution[NonTerminal[S]]

• We sample pairs of (next state s′ , reward r) from a given non-terminal state s, and
calculate the expectation E[r + γ · V (s′ )] by averaging r + γ · V (s′ ) across the sam-
pled pairs. Note that the method expectation of a Distribution object performs a
sampled expectation. V (s′ ) is obtained from the function approximation instance
of FunctionApprox that is being updated in each iteration. The type of the function
approximation of the Value Function is aliased as follows (this type will be used not
just for Approximate Dynamic Programming algorithms, but also for Reinforcement
Learning Algorithms).

ValueFunctionApprox = FunctionApprox[NonTerminal[S]]

• The sampled list of non-terminal states s comprise our x-values and the associated
sampled expectations described above comprise our y-values. This list of (x, y) pairs
are used to update the approximation of the Value Function in each iteration (pro-
ducing a new instance of ValueFunctionApprox using it’s update method).

The entire code is shown below. The evaluate_mrp method produces an Iterator on
ValueFunctionApprox instances, and the code that calls evaluate_mrp can decide when/how
to terminate the iterations of Approximate Policy Evaluation.

from rl.iterate import iterate

def evaluate_mrp(
mrp: MarkovRewardProcess[S],
gamma: float,
approx_0: ValueFunctionApprox[S],
non_terminal_states_distribution: NTStateDistribution[S],
num_state_samples: int
) -> Iterator[ValueFunctionApprox[S]]:
def update(v: ValueFunctionApprox[S]) -> ValueFunctionApprox[S]:
nt_states: Sequence[NonTerminal[S]] = \
non_terminal_states_distribution.sample_n(num_state_samples)
def return_(s_r: Tuple[State[S], float]) -> float:
s1, r = s_r
return r + gamma * extended_vf(v, s1)
return v.update(
[(s, mrp.transition_reward(s).expectation(return_))
for s in nt_states]
)
return iterate(update, approx_0)

189
Notice the function extended_vf used to evaluate the Value Function for the next state
transitioned to. However, the next state could be terminal or non-terminal, and the Value
Function is only defined for non-terminal states. extended_vf utilizes the method on_non_terminal
we had written in Chapter 1 when designing the State class - it evaluates to the default
value of 0 for a terminal state (and evaluates the given ValueFunctionApprox for a non-
terminal state).

def extended_vf(vf: ValueFunctionApprox[S], s: State[S]) -> float:

return s.on_non_terminal(vf, 0.0)

extended_vf will be useful not just for Approximate Dynamic Programming algorithms,
but also for Reinforcement Learning algorithms.

4.6. Approximate Value Iteration

Now that we’ve understood and coded Approximate Policy Evaluation (to solve the Pre-
diction problem), we can extend the same concepts to Approximate Value Iteration (to
solve the Control problem). The code below in value_iteration is almost the same as the
code above in evaluate_mrp, except that instead of a MarkovRewardProcess, here we have a
MarkovDecisionProcess, and instead of the Bellman Policy Operator update, here we have
the Bellman Optimality Operator update. Therefore, in the Value Function update, we
maximize the Q-value function (over all actions a) for each non-terminal state s. Also,
similar to evaluate_mrp, value_iteration produces an Iterator on ValueFunctionApprox
instances, and the code that calls value_iteration can decide when/how to terminate the
iterations of Approximate Value Iteration.

from rl.iterate import iterate

def value_iteration(
mdp: MarkovDecisionProcess[S, A],
gamma: float,
approx_0: ValueFunctionApprox[S],
non_terminal_states_distribution: NTStateDistribution[S],
num_state_samples: int
) -> Iterator[ValueFunctionApprox[S]]:
def update(v: ValueFunctionApprox[S]) -> ValueFunctionApprox[S]:
nt_states: Sequence[NonTerminal[S]] = \
non_terminal_states_distribution.sample_n(num_state_samples)
def return_(s_r: Tuple[State[S], float]) -> float:
s1, r = s_r
return r + gamma * extended_vf(v, s1)
return v.update(
[(s, max(mdp.step(s, a).expectation(return_)
for a in mdp.actions(s)))
for s in nt_states]
)
return iterate(update, approx_0)

4.7. Finite-Horizon Approximate Policy Evaluation

Next, we move on to Approximate Policy Evaluation in a finite-horizon setting, meaning
we perform Approximate Policy Evaluation with a backward induction algorithm, much
like how we did backward induction for finite-horizon Tabular Policy Evaluation. We will

190
of course make the same types of adaptations from Tabular to Approximate as we did in
the functions evaluate_mrp and value_iteration above.
In the backward_evaluate code below, the input argument mrp_f0_mu_triples is a list of
triples, with each triple corresponding to each non-terminal time step in the finite horizon.
Each triple consists of:

• An instance of MarkovRewardProceess - note that each time step has it’s own instance
of MarkovRewardProcess representation of transitions from non-terminal states s in a
time step t to the (state s′ , reward r) pairs in the next time step t + 1 (variable mrp in
the code below).
• An instance of ValueFunctionApprox to capture the approximate Value Function for
the time step (variable approx0 in the code below represents the initial ValueFunctionApprox
instances).
• A sampling probability distribution of non-terminal states in the time step (variable
mu in the code below).

The backward induction code below should be pretty self-explanatory. Note that in
backward induction, we don’t invoke the update method of FunctionApprox like we did in
the non-finite-horizon cases - here we invoke the solve method which internally performs
a series of updates on the FunctionApprox for a given time step (until we converge to within
a specified level of error_tolerance). In the non-finite-horizon cases, it was okay to sim-
ply do a single update in each iteration because we revisit the same set of states in further
iterations. Here, once we converge to an acceptable ValueFunctionApprox (using solve)
for a specific time step, we won’t be performing any more updates to the Value Function
for that time step (since we move on to the next time step, in reverse). backward_evaluate
returns an Iterator over ValueFunctionApprox objects, from time step 0 to the horizon time
step. We should point out that in the code below, we’ve taken special care to handle ter-
minal states (that occur either at the end of the horizon or can even occur before the end
of the horizon) - this is done using the extended_vf function we’d written earlier.

MRP_FuncApprox_Distribution = Tuple[MarkovRewardProcess[S],
ValueFunctionApprox[S],
NTStateDistribution[S]]
def backward_evaluate(
mrp_f0_mu_triples: Sequence[MRP_FuncApprox_Distribution[S]],
gamma: float,
num_state_samples: int,
error_tolerance: float
) -> Iterator[ValueFunctionApprox[S]]:
v: List[ValueFunctionApprox[S]] = []
for i, (mrp, approx0, mu) in enumerate(reversed(mrp_f0_mu_triples)):
def return_(s_r: Tuple[State[S], float], i=i) -> float:
s1, r = s_r
return r + gamma * (extended_vf(v[i-1], s1) if i > 0 else 0.)
v.append(
approx0.solve(
[(s, mrp.transition_reward(s).expectation(return_))
for s in mu.sample_n(num_state_samples)],
error_tolerance
)
)
return reversed(v)

191
4.8. Finite-Horizon Approximate Value Iteration
Now that we’ve understood and coded finite-horizon Approximate Policy Evaluation (to
solve the finite-horizon Prediction problem), we can extend the same concepts to finite-
horizon Approximate Value Iteration (to solve the finite-horizon Control problem). The
code below in back_opt_vf_and_policy is almost the same as the code above in backward_evaluate,
except that instead of MarkovRewardProcess, here we have MarkovDecisionProcess. For each
non-terminal time step, we maximize the Q-Value function (over all actions a) for each
non-terminal state s. back_opt_vf_and_policy returns an Iterator over pairs of ValueFunctionApprox
and DeterministicPolicy objects (representing the Optimal Value Function and the Opti-
mal Policy respectively), from time step 0 to the horizon time step.
from rl.distribution import Constant
from operator import itemgetter
MDP_FuncApproxV_Distribution = Tuple[
MarkovDecisionProcess[S, A],
ValueFunctionApprox[S],
NTStateDistribution[S]
]
def back_opt_vf_and_policy(
mdp_f0_mu_triples: Sequence[MDP_FuncApproxV_Distribution[S, A]],
gamma: float,
num_state_samples: int,
error_tolerance: float
) -> Iterator[Tuple[ValueFunctionApprox[S], DeterministicPolicy[S, A]]]:
vp: List[Tuple[ValueFunctionApprox[S], DeterministicPolicy[S, A]]] = []
for i, (mdp, approx0, mu) in enumerate(reversed(mdp_f0_mu_triples)):
def return_(s_r: Tuple[State[S], float], i=i) -> float:
s1, r = s_r
return r + gamma * (extended_vf(vp[i-1][0], s1) if i > 0 else 0.)
this_v = approx0.solve(
[(s, max(mdp.step(s, a).expectation(return_)
for a in mdp.actions(s)))
for s in mu.sample_n(num_state_samples)],
error_tolerance
)
def deter_policy(state: S) -> A:
return max(
((mdp.step(NonTerminal(state), a).expectation(return_), a)
for a in mdp.actions(NonTerminal(state))),
key=itemgetter(0)
)[1]
vp.append((this_v, DeterministicPolicy(deter_policy)))
return reversed(vp)

4.9. Finite-Horizon Approximate Q-Value Iteration

The above code for Finite-Horizon Approximate Value Iteration extends the Finite-Horizon
Backward Induction Value Iteration algorithm of Chapter 3 by treating the Value Function
as a function approximation instead of an exact tabular representation. However, there is
an alternative (and arguably simpler and more effective) way to solve the Finite-Horizon
Control problem - we can perform backward induction on the optimal Action-Value (Q-
value) Function instead of the optimal (State-)Value Function. In general, the key advan-
tage of working with the optimal Action Value function is that it has all the information

192
necessary to extract the optimal State-Value function and the optimal Policy (since we just
need to perform a max / arg max over all the actions for any non-terminal state). This con-
trasts with the case of working with the optimal State-Value function which requires us
to also avail of the transition probabilities, rewards and discount factor in order to extract
the optimal policy. We shall see later that Reinforcement Learning algorithms for Control
work with Action-Value (Q-Value) Functions for this very reason.
Performing backward induction on the optimal Q-value function means that knowledge
of the optimal Q-value function for a given time step t immediately gives us the optimal
State-Value function and the optimal policy for the same time step t. This contrasts with
performing backward induction on the optimal State-Value function - knowledge of the
optimal State-Value function for a given time step t cannot give us the optimal policy for
the same time step t (for that, we need the optimal State-Value function for time step t + 1
and furthermore, we also need the t to t + 1 state/reward transition probabilities).
So now we develop an algorithm that works with a function approximation for the Q-
Value function and steps back in time similar to the backward induction we had performed
earlier for the (State-)Value function. Just like we defined an alias type ValueFunctionApprox
for the State-Value function, we define an alias type QValueFunctionApprox for the Action-
Value function, as follows:

QValueFunctionApprox = FunctionApprox[Tuple[NonTerminal[S], A]]

The code below in back_opt_qvf is quite similar to back_opt_vf_and_policy above. The

key difference is that we have QValueFunctionApprox in the input to the function rather
than ValueFunctionApprox to reflect the fact that we are approximating Q∗t : Nt × At → R
for all time steps t in the finite horizon. For each non-terminal time step, we express the
Q-value function (for a set of sample non-terminal states s and for all actions a) in terms
of the Q-value function approximation of the next time step. This is essentially the MDP
Action-Value Function Bellman Optimality Equation for the finite-horizon case (adapted
to function approximation). back_opt_qvf returns an Iterator over QValueFunctionApprox
(representing the Optimal Q-Value Function), from time step 0 to the horizon time step.
We can then obtain Vt∗ (Optimal State-Value Function) and πt∗ for each t by simply per-
forming a max / arg max over all actions a ∈ At of Q∗t (s, a) for any s ∈ Nt .

MDP_FuncApproxQ_Distribution = Tuple[
MarkovDecisionProcess[S, A],
QValueFunctionApprox[S, A],
NTStateDistribution[S]
]
def back_opt_qvf(
mdp_f0_mu_triples: Sequence[MDP_FuncApproxQ_Distribution[S, A]],
gamma: float,
num_state_samples: int,
error_tolerance: float
) -> Iterator[QValueFunctionApprox[S, A]]:
horizon: int = len(mdp_f0_mu_triples)
qvf: List[QValueFunctionApprox[S, A]] = []
for i, (mdp, approx0, mu) in enumerate(reversed(mdp_f0_mu_triples)):
def return_(s_r: Tuple[State[S], float], i=i) -> float:
s1, r = s_r
next_return: float = max(
qvf[i-1]((s1, a)) for a in
mdp_f0_mu_triples[horizon - i][0].actions(s1)
) if i > 0 and isinstance(s1, NonTerminal) else 0.
return r + gamma * next_return

193
this_qvf = approx0.solve(
[((s, a), mdp.step(s, a).expectation(return_))
for s in mu.sample_n(num_state_samples) for a in mdp.actions(s)],
error_tolerance
)
qvf.append(this_qvf)
return reversed(qvf)

We should also point out here that working with the optimal Q-value function (rather
than the optimal State-Value function) in the context of ADP prepares us nicely for RL
because RL algorithms typically work with the optimal Q-value function instead of the
optimal State-Value function.
All of the above code for Approximate Dynamic Programming (ADP) algorithms is in
the file rl/approximate_dynamic_programming.py. We encourage you to create instances
of MarkovRewardProcess and MarkovDecisionProcess (including finite-horizon instances)
and play with the above ADP code with different choices of function approximations,
non-terminal state sampling distributions, and number of samples. A simple but valuable
exercise is to reproduce the tabular versions of these algorithms by using the Tabular im-
plementation of FunctionApprox (note: the count_to_weights_func would then need to be
lambda _: 1.) in the above ADP functions.

4.10. How to Construct the Non-Terminal States Distribution

Each of the above ADP algorithms takes as input probability distribution(s) of non-terminal
states. You may be wondering how one constructs the probability distribution of non-
terminal states so you can feed it as input to any of these ADP algorithm. There is no
simple, crisp answer to this. But we will provide some general pointers in this section on
how to construct the probability distribution of non-terminal states.
Let us start with Approximate Policy Evaluation and Approximate Value Iteration al-
gorithms. They require as input the probability distribution of non-terminal states. For
Approximate Value Iteration algorithm, a natural choice would be evaluate the Markov
Decision Process (MDP) with a uniform policy (equal probability for each action, from
any state) to construct the implied Markov Reward Process (MRP), and then infer the
stationary distribution of it’s Markov Process, using some special property of the Markov
Process (for instance, if it’s a finite-states Markov Process, we might be able to perform
the matrix calculations we covered in Chapter 1 to calculate the stationary distribution).
The stationary distribution would serve as the probability distribution of non-terminal
states to be used by the Approximate Value Iteration algorithm. For Approximate Policy
Evaluation algorithm, we do the same stationary distribution calculation with the given
MRP. If we cannot take advantage of any special properties of the given MDP/MRP, then
we can run a simulation with the simulate method in MarkovRewardProcess (inherited from
MarkovProcess) and create a SampledDistribution of non-terminal states based on the non-
terminal states reached by the sampling traces after a suﬀiciently large (but fixed) number
of time steps (this is essentially an estimate of the stationary distribution). If the above
choices are infeasible or computationally expensive, then a simple and neutral choice is to
use a uniform distribution over the states.
Next, we consider the backward induction ADP algorithms for finite-horizon MDPs/MRPs.
Our job here is to infer the distribution of non-terminal states for each time step in the finite

194
horizon. Sometimes you can take advantage of the mathematical structure of the under-
lying Markov Process to come up with an analytical expression (exact or approximate)
for the probability distribution of non-terminal states at any time step for the underly-
ing Markov Process of the MRP/implied-MRP. For instance, if the Markov Process is de-
scribed by a stochastic differential equation (SDE) and if we are able to solve the SDE,
we would know the analytical expression for the probability distribution of non-terminal
states. If we cannot take advantage of any such special properties, then we can generate
sampling traces by time-incrementally sampling from the state-transition probability dis-
tributions of each of the Markov Reward Processes at each time step (if we are solving
a Control problem, then we create implied-MRPs by evaluating the given MDPs with a
uniform policy). The states reached by these sampling traces at any fixed time step pro-
vide a SampledDistribution of non-terminal states for that time step. If the above choices
are infeasible or computationally expensive, then a simple and neutral choice is to use a
uniform distribution over the non-terminal states for each time step.
We will write some code in Chapter 6 to create a SampledDistribution of non-terminal
states for each time step of a finite-horizon problem by stitching together samples of state
transitions at each time step. If you are curious about this now, you can take a peek at the
code in rl/chapter7/asset_alloc_discrete.py.

4.11. Key Takeaways from this Chapter

• The Function Approximation interface involves two key methods - A) updating the
parameters of the Function Approximation based on training data available from
each iteration of a data stream, and B) evaluating the expectation of the response
variable whose conditional probability distribution is modeled by the Function Ap-
proximation. Linear Function Approximation and Deep Neural Network Function
Approximation are the two main Function Approximations we’ve implemented and
will be using in the rest of the book.
• Tabular satisfies the interface of Function Approximation, and can be viewed as a
special case of linear function approximation with feature functions set to be indica-
tor functions for each of the x values.
• All the Tabular DP algorithms can be generalized to ADP algorithms replacing tab-
ular Value Function updates with updates to Function Approximation parameters
(where the Function Approximation represents the Value Function). Sweep over all
states in the tabular case is replaced by sampling states in the ADP case. Expectation
calculations in Bellman Operators are handled in ADP as averages of the correspond-
ing calculations over transition samples (versus calculations using explicit transition
probabilities in the tabular algorithms).

195
Part II.

Modeling Financial Applications

197
5. Utility Theory

5.1. Introduction to the Concept of Utility

This chapter marks the beginning of Module II, where we cover a set of financial appli-
cations that can be solved with Dynamic Programming or Reinforcement Learning Algo-
rithms. A fundamental feature of many financial applications cast as Stochastic Control
problems is that the Rewards of the modeled MDP are Utility functions in order to capture
the tradeoff between financial returns and risk. So this chapter is dedicated to the topic
of Financial Utility. We begin with developing an understanding of what Utility means
from a broad Economic perspective, then zoom into the concept of Utility from a finan-
cial/monetary perspective, and finally show how Utility functions can be designed to cap-
ture individual preferences of “risk-taking-inclination” when it comes to specific financial
applications.
Utility Theory is a vast and important topic in Economics and we won’t cover it in detail
in this book - rather, we will focus on the aspects of Utility Theory that are relevant for
the Financial Applications we cover in this book. But it pays to have some familiarity with
the general concept of Utility in Economics. The term Utility (in Economics) refers to the
abstract concept of an individual’s preferences over choices of products or services or ac-
tivities (or more generally, over choices of certain abstract entities analyzed in Economics).
Let’s say you are offered 3 options to spend your Saturday afternoon: A) lie down on your
couch and listen to music, or B) baby-sit your neighbor’s kid and earn some money, or
C) play a game of tennis with your friend. We really don’t know how to compare these
3 options in a formal/analytical manner. But we tend to be fairly decisive (instinctively)
in picking among disparate options of this type. Utility Theory aims to formalize mak-
ing choices by assigning a real number to each presented choice, and then picking the
choice with the highest assigned number. The assigned real number for each choice rep-
resents the “value”/“worth” of the choice, noting that the “value”/“worth” is often an
implicit/instinctive value that needs to be made explicit. In our example, the numerical
value for each choice is not something concrete or precise like number of dollars earned
on a choice or the amount of time spent on a choice - rather it is a more abstract notion
of an individual’s “happiness” or “satisfaction” associated with a choice. In this exam-
ple, you might say you prefer option A) because you feel lazy today (so, no tennis) and
you care more about enjoying some soothing music after a long work-week than earning
a few extra bucks through baby-sitting. Thus, you are comparing different attributes like
money, relaxation and pleasure. This can get more complicated if your friend is offered
these options, and say your friend chooses option C). If you see your friend’s choice, you
might then instead choose option C) because you perceive the “collective value” (for you
and your friend together) to be highest if you both choose option C).
We won’t go any further on this topic of abstract Utility, but we hope the above example
provides the basic intuition for the broad notion of Utility in Economics as preferences
over choices by assigning a numerical value for each choice. For a deeper study of Utility
Theory, we refer you to The Handbook of Utility Theory (Barberà, Seidl, and Hammond
1998). In this book, we focus on the narrower notion of Utility of Money because money is

199
what we care about when it comes to financial applications. However, Utility of Money
is not so straightforward because different people respond to different levels of money in
different ways. Moreover, in many financial applications, Utility functions help us deter-
mine the tradeoff between financial return and risk, and this involves (challenging) as-
sessments of the likelihood of various outcomes. The next section develops the intuition
on these concepts.

5.2. A Simple Financial Example

To warm up to the concepts associated with Financial Utility Theory, let’s start with a
simple financial example. Consider a casino game where your financial gain/loss is based
on the outcome of tossing a fair coin (HEAD or TAIL outcomes). Let’s say you will be
paid $1000 if the coin shows HEAD on the toss, and let’s say you would be required to pay
$500 if the coin shows TAIL on the toss. Now the question is: How much would you be
willing to pay upfront to play this game? Your first instinct might be to say: “I’d pay $250
upfront to play this game because that’s my expected payoff, based on the probability of
the outcomes” (250 = 0.5(1000) + 0.5(−500)). But after you think about it carefully, you
might alter your answer to be: “I’d pay a little less than $250.” When pressed for why the
fair upfront cost for playing the game should be less than $250, you might say: “I need to
be compensated for taking the risk.”
What does the word “risk” mean? It refers to the degree of variation in the outcomes
($1000 versus -$500). But then why would you say you need to be compensated for being
exposed to this variation in outcomes? If -$500 makes you unhappy, $1000 should make
you happy, and so, shouldn’t we average out the happiness to the tune of $250? Well, not
quite. Let’s understand the word “happiness” (or call it “satisfaction”) - this is the notion
of utility of outcomes. Let’s say you did pay $250 upfront to play the game. Then the coin
toss outcome of HEAD is a net gain of $1000 - $250 = $750 and the coin toss outcome of
TAIL is a net gain of -$500 - $250 = -$750 (i.e., net loss of $750). Now let’s say the HEAD
outcome gain of $750 gives you “happiness” of say 100 units. If the TAIL outcome loss of
$750 gives you “unhappiness” of 100 units, then “happiness” and “unhappiness” levels
cancel out, and in that case, it would be fair to pay $250 upfront to play the game. But it
turns out that for most people, the “happiness”/“satisfaction”” levels are asymmetric. If
the “happiness” for $750 gain is 100 units, then the “unhappiness” for $750 loss is typically
more than 100 units (let’s say for you it’s 120 units). This means you will pay an upfront
amount X (less than $250) such that the difference in Utilities of $1000 and X is exactly
the difference in the Utilities of X and -$500. Let’s say this X amounts of $180. The gap
of $70 ($250 - $180) represents your compensation for taking the risk, and it really comes
down to the asymmetry in your assignment of utility to the outcomes.
Note that the degree of asymmetry of utility (“happiness” versus “unhappiness” for
equal gains versus losses) is fairly individualized. Your utility assignment to outcomes
might be different from your friend’s. Your friend might be more asymmetric in assessing
utility of the two outcomes and might assign 100 units of “happiness” for the gain outcome
and 150 units of “unhappiness” for the loss outcome. So then your friend would pay an
upfront amount X lower than the amount of $180 you paid upfront to play this game.
Let’s say the X for your friend works out to $100, so his compensation for taking the risk is
$250 - $100 = $150, significantly more than your $70 of compensation for taking the same
risk.
Thus we see that each individual’s asymmetry in utility assignment to different out-

200
comes results in this psychology of “I need to be compensated for taking this risk.” We
refer to this individualized demand of “compensation for risk” as the attitude of Risk-
Aversion. It means that individuals have differing degrees of discomfort with taking risk,
and they want to be compensated commensurately for taking risk. The amount of compen-
sation they seek is called Risk-Premium. The more Risk-Averse an individual is, the more
Risk-Premium the individual seeks. In the example above, your friend was more risk-
averse than you. Your risk-premium was $70 and your friend’s risk-premium was $150.
But the most important concept that you are learning here is that the root-cause of Risk-
Aversion is the asymmetry in the assignment of utility to outcomes of opposite sign and
same magnitude. We have introduced this notion of “asymmetry of utility” in a simple,
intuitive manner with this example, but we will soon embark on developing the formal
theory for this notion, and introduce a simple and elegant mathematical framework for
Utility Functions, Risk-Aversion and Risk-Premium.
A quick note before we get into the mathematical framework - you might be thinking
that a typical casino would actually charge you a bit more than $250 upfront for playing
the above game (because the casino needs to make a profit, on an expected basis), and
people are indeed willing to pay this amount at a typical casino. So what about the risk-
aversion we talked about earlier? The crucial point here is that people who play at casinos
are looking for entertainment and excitement emanating purely from the psychological
aspects of experiencing risk. They are willing to pay money for this entertainment and
excitement, and this payment is separate from the cost of pure financial utility that we
described above. So if people knew the true odds of pure-chance games of the type we
described above and if people did not care for entertainment and excitement value of risk-
taking in these games, focusing purely on financial utility, then what they’d be willing to
pay upfront to play such a game will be based on the type of calculations we outlined
above (meaning for the example we described, they’d typically pay less than $250 upfront
to play the game).

5.3. The Shape of the Utility function

We seek a “valuation formula” for the amount we’d pay upfront to sign-up for situations
like the simple example above, where we have uncertain outcomes with varying payoffs
for the outcomes. Intuitively, we see that the amount we’d pay:

• Increases as the Mean of the outcome increases

• Decreases as the Variance of the outcome (i.e., Risk) increases
• Decreases as our Personal Risk-Aversion increases

The last two properties above enable us to establish the Risk-Premium. Now let us un-
derstand the nature of Utility as a function of financial outcomes. The key is to note that
Utility is a non-linear function of financial outcomes. We call this non-linear function as
the Utility function - it represents the “happiness”/“satisfaction” as a function of money.
You should think of the concept of Utility in terms of Utility of Consumption of money, i.e.,
what exactly do the financial gains fetch you in your life or business. This is the idea of
“value” (utility) derived from consuming the financial gains (or the negative utility of
requisite financial recovery from monetary losses). So now let us look at another simple
example to illustrate the concept of Utility of Consumption, this time not of consumption
of money, but of consumption of cookies (to make the concept vivid and intuitive). Fig-
ure 5.1 shows two curves - we refer to the lower curve as the marginal satisfaction (utility)

201
Figure 5.1.: Utility Curve

curve and the upper curve as the accumulated satisfaction (utility) curve. Marginal Util-
ity refers to the incremental satisfaction we gain from an additional unit of consumption
and Accumulated Utility refers to the aggregate satisfaction obtained from a certain number
of units of consumption (in continuous-space, you can think of accumulated utility func-
tion as the integral, over consumption, of marginal utility function). In this example, we
are consuming (i.e., eating) cookies. The marginal satisfaction curve tells us that the first
cookie we eat provides us with 100 units of satisfaction (i.e., utility). The second cookie
provides us 80 units of satisfaction, which is intuitive because you are not as hungry after
eating the first cookie compared to before eating the first cookie. Also, the emotions of bit-
ing into the first cookie are extra positive because of the novelty of the experience. When
you get to your 5th cookie, although you are still enjoying the cookie, you don’t enjoy it as
nearly as much as the first couple of cookies. The marginal satisfaction curve shows this -
the 5th cookie provides us 30 units of satisfaction, and the 10th cookie provides us only 10
units of satisfaction. If we’d keep going, we might even find that the marginal satisfaction
turns negative (as in, one might feel too full or maybe even feel like throwing up).
So, we see that the marginal utility function is a decreasing function. Hence, accumu-
lated utility function is a concave function. The accumulated utility function is the Utility
of Consumption function (call it U ) that we’ve been discussing so far. Let us denote the
number of cookies eaten as x, and so the total “satisfaction” (utility) after eating x cookies
is refered to as U (x). In our financial examples, x would be amount of money one has
at one’s disposal, and is typically an uncertain outcome, i.e., x is a random variable with
an associated probability distribution. The extent of asymmetry in utility assignments for
gains versus losses that we saw earlier manifests as extent of concavity of the U (·) function
(which as we’ve discussed earlier, determines the extent of Risk-Aversion).
Now let’s examine the concave nature of the Utility function for financial outcomes with
another illustrative example. Let’s say you have to pick between two situations:

• In Situation 1, you have a 10% probability of winning a million dollars (and 90%
probability of winning 0).

202
Figure 5.2.: Certainty-Equivalent Value

• In Situation 2, you have a 0.1% probability of winning a billion dollars (and 99.9%
probability of winning 0).

The expected winning in Situation 1 is $100,000 and the expected winning in Situation
2 is $1,000,000 (i.e., 10 times more than Situation 1). If you analyzed this naively as win-
ning expectation maximization, you’d choose Situation 2. But most people would choose
Situation 1. The reason for this is that the Utility of a billion dollars is nowhere close to
1000 times the utility of a million dollars (except for some very wealth people perhaps).
In fact, the ratio of Utility of a billion dollars to Utility of a million dollars might be more
like 10. So, the choice of Situation 1 over Situation 2 is usually quite clear - it’s about Utility
expectation maximization. So if the Utility of 0 dollars is 0 units, the Utility of a million
dollars is say 1000 units, and the Utility of a billion dollars is say 10000 units (i.e., 10 times
that of a million dollars), then we see that the Utility of financial gains is a fairly concave
function.

5.4. Calculating the Risk-Premium

Note that the concave nature of the U (·) function implies that:

E[U (x)] < U (E[x])

We define Certainty-Equivalent Value xCE as:

xCE = U −1 (E[U (x)])

Certainty-Equivalent Value represents the certain amount we’d pay to consume an un-
certain outcome. This is the amount of $180 you were willing to pay to play the casino
game of the previous section.

203
Figure 5.2 illustrates this concept of Certainty-Equivalent Value in graphical terms. Next,
we define Risk-Premium in two different conventions:

• Absolute Risk-Premium πA :

πA = E[x] − xCE

• Relative Risk-Premium πR :

πA E[x] − xCE xCE

πR = = =1−
E[x] E[x] E[x]

Now we develop mathematical formalism to derive formulas for Risk-Premia πA and πR

in terms of the extent of Risk-Aversion and the extent of Risk itself. To lighten notation,
we refer to E[x] as x̄ and Variance of x as σx2 . We write U (x) as the Taylor series expansion
around x̄ and ignore terms beyond quadratic in the expansion, as follows:

1
U (x) ≈ U (x̄) + U ′ (x̄) · (x − x̄) + U ′′ (x̄) · (x − x̄)2
2
Taking the expectation of U (x) in the above formula, we get:

1
· U ′′ (x̄) · σx2
E[U (x)] ≈ U (x̄) +
2
Next, we write the Taylor-series expansion for U (xCE ) around x̄ and ignore terms be-
yond linear in the expansion, as follows:

U (xCE ) ≈ U (x̄) + U ′ (x̄) · (xCE − x̄)

Since E[U (x)] = U (xCE ) (by definition of xCE ), the above two expressions are approxi-
mately the same. Hence,

1
U ′ (x̄) · (xCE − x̄) ≈ · U ′′ (x̄) · σx2 (5.1)
2
From Equation (5.1), Absolute Risk-Premium

1 U ′′ (x̄) 2
πA = x̄ − xCE ≈ − · ′ · σx
2 U (x̄)

We refer to the function:

U ′′ (x)
A(x) = −
U ′ (x)
as the Absolute Risk-Aversion function. Therefore,
1
πA ≈ · A(x̄) · σx2
2
In multiplicative uncertainty settings, we focus on the variance σ 2x of xx̄ . So in multi-
x̄
plicative settings, we focus on the Relative Risk-Premium:

πA 1 U ′′ (x̄) · x̄ σx2 1 U ′′ (x̄) · x̄ 2

πR = ≈− · · = − · · σx
x̄ 2 U ′ (x̄) x̄2 2 U ′ (x̄) x̄

204
We refer to the function
U ′′ (x) · x
R(x) = −
U ′ (x)
as the Relative Risk-Aversion function. Therefore,
1
πR ≈· R(x̄) · σ 2x
2 x̄

Now let’s take stock of what we’ve learning here. We’ve shown that Risk-Premium is
proportional to the product of:

• Extent of Risk-Aversion: either A(x̄) or R(x̄)

• Extent of uncertainty of outcome (i.e., Risk): either σx2 or σ 2x
x̄

We’ve expressed the extent of Risk-Aversion to be proportional to the negative ratio of:

• Concavity of the Utility function (at x̄): −U ′′ (x̄)

• Slope of the Utility function (at x̄): U ′ (x̄)

So for typical optimization problems in financial applications, we maximize E[U (x)]

(not E[x]), which in turn amounts to maximization of xCE = E[x] − πA . If we refer to E[x]
as our “Expected Return on Investment” (or simply “Return” for short) and πA as the
“risk-adjustment” due to risk-aversion and uncertainty of outcomes, then xCE can be con-
ceptualized as “risk-adjusted-return.” Thus, in financial applications, we seek to maximize
risk-adjusted-return xCE rather than just the return E[x]. It pays to emphasize here that
the idea of maximizing risk-adjusted-return is essentially the idea of maximizing expected
utility, and that the utility function is a representation of an individual’s risk-aversion.
Note that Linear Utility function U (x) = a + bx implies Risk-Neutrality (i.e., when one
doesn’t demand any compensation for taking risk). Next, we look at typically-used Utility
functions U (·) with:

• Constant Absolute Risk-Aversion (CARA)

• Constant Relative Risk-Aversion (CRRA)

5.5. Constant Absolute Risk-Aversion (CARA)

Consider the Utility function U : R → R, parameterized by a ∈ R, defined as:
( −ax
1−e
a for a ̸= 0
U (x) =
x for a = 0

Firstly, note that U (x) is continuous with respect to a for all x ∈ R since:

1 − e−ax
lim =x
a→0 a
Now let us analyze the function U (·) for any fixed a. We note that for all a ∈ R:

• U (0) = 0
• U ′ (x) = e−ax > 0 for all x ∈ R
• U ′′ (x) = −a · e−ax

205
This means U (·) is a monotonically increasing function passing through the origin, and
it’s curvature has the opposite sign as that of a (note: no curvature when a = 0).
So now we can calculate the Absolute Risk-Aversion function:

−U ′′ (x)
A(x) = =a
U ′ (x)
So we see that the Absolute Risk-Aversion function is the constant value a. Conse-
quently, we say that this Utility function corresponds to Constant Absolute Risk-Aversion
(CARA). The parameter a is refered to as the Coeﬀicient of CARA. The magnitude of pos-
itive a signifies the degree of risk-aversion. a = 0 is the case of being Risk-Neutral. Nega-
tive values of a mean one is “risk-seeking,” i.e., one will pay to take risk (the opposite of
risk-aversion) and the magnitude of negative a signifies the degree of risk-seeking.
If the random outcome x ∼ N (µ, σ 2 ), then using Equation (A.5) from Appendix A, we
get:

 1−e−aµ+ a22σ2
E[U (x)] = a for a ̸= 0
µ for a = 0

aσ 2
xCE = µ −
2

aσ 2
Absolute Risk Premium πA = µ − xCE =
2
For optimization problems where we need to choose across probability distributions
2
where σ 2 is a function of µ, we seek the distribution that maximizes xCE = µ − aσ2 . This
clearly illustrates the concept of “risk-adjusted-return” because µ serves as the “return”
2
and the risk-adjustment aσ2 is proportional to the product of risk-aversion a and risk (i.e.,
variance in outcomes) σ 2 .

5.6. A Portfolio Application of CARA

Let’s say we are given $1 to invest and hold for a horizon of 1 year. Let’s say our portfolio
investment choices are:

• A risky asset with Annual Return ∼ N (µ, σ 2 ), µ ∈ R, σ ∈ R+ .

• A riskless asset with Annual Return = r ∈ R.

Our task is to determine the allocation π (out of the given $1) to invest in the risky
asset (so, 1 − π is invested in the riskless asset) so as to maximize the Expected Utility of
Consumption of Portfolio Wealth in 1 year. Note that we allow π to be unconstrained, i.e.,
π can be any real number from −∞ to +∞. So, if π > 0, we buy the risky asset and if
π < 0, we “short-sell” the risky asset. Investing π in the risky asset means in 1 year, the
risky asset’s value will be a normal distribution N (π(1 + µ), π 2 σ 2 ). Likewise, if 1 − π > 0,
we lend 1 − π (and will be paid back (1 − π)(1 + r) in 1 year), and if 1 − π < 0, we borrow
1 − π (and need to pay back (1 − π)(1 + r) in 1 year).
Portfolio Wealth W in 1 year is given by:

W ∼ N (1 + r + π(µ − r), π 2 σ 2 )

206
We assume CARA Utility with a ̸= 0, so:

1 − e−aW
U (W ) =
a
We know that maximizing E[U (W )] is equivalent to maximizing the Certainty-Equivalent
Value of Wealth W , which in this case (using the formula for xCE in the section on CARA)
is given by:

aπ 2 σ 2
1 + r + π(µ − r) −
2
This is a quadratic concave function of π for a > 0, and so, taking it’s derivative with
respect to π and setting it to 0 gives us the optimal investment fraction in the risky asset
(π ∗ ) as follows:
µ−r
π∗ =
aσ 2

5.7. Constant Relative Risk-Aversion (CRRA)

Consider the Utility function U : R+ → R, parameterized by γ ∈ R, defined as:
( 1−γ
−1
x
1−γ for γ ̸= 1
U (x) =
log(x) for γ = 1

Firstly, note that U (x) is continuous with respect to γ for all x ∈ R+ since:

x1−γ − 1
lim = log(x)
γ→1 1 − γ

Now let us analyze the function U (·) for any fixed γ. We note that for all γ ∈ R:

• U (1) = 0
• U ′ (x) = x−γ > 0 for all x ∈ R+
• U ′′ (x) = −γ · x−1−γ

This means U (·) is a monotonically increasing function passing through (1, 0), and it’s
curvature has the opposite sign as that of γ (note: no curvature when γ = 0).
So now we can calculate the Relative Risk-Aversion function:

−U ′′ (x) · x
R(x) = =γ
U ′ (x)
So we see that the Relative Risk-Aversion function is the constant value γ. Consequently,
we say that this Utility function corresponds to Constant Relative Risk-Aversion (CRRA). The
parameter γ is refered to as the Coeﬀicient of CRRA. The magnitude of positive γ signifies
the degree of risk-aversion. γ = 0 yields the Utility function U (x) = x − 1 and is the case
of being Risk-Neutral. Negative values of γ mean one is “risk-seeking,” i.e., one will pay
to take risk (the opposite of risk-aversion) and the magnitude of negative γ signifies the
degree of risk-seeking.
If the random outcome x is lognormal, with log(x) ∼ N (µ, σ 2 ), then making a substitu-
tion y = log(x), expressing E[U (x)] as E[U (ey )], and using Equation (A.5) in Appendix A,
we get:

207

 eµ(1−γ)+ σ22 (1−γ)2 −1
E[U (x)] = 1−γ for γ ̸= 1
µ for γ = 1
σ2
xCE = eµ+ 2
(1−γ)

xCE σ2 γ
Relative Risk Premium πR = 1 − = 1 − e− 2
x̄
For optimization problems where we need to choose across probability distributions
where σ 2 is a function of µ, we seek the distribution that maximizes log(xCE ) = µ +
σ2
2 (1−γ). Just like in the case of CARA, this clearly illustrates the concept of “risk-adjusted-
2 2
return” because µ + σ2 serves as the “return” and the risk-adjustment γσ2 is proportional
to the product of risk-aversion γ and risk (i.e., variance in outcomes) σ 2 .

5.8. A Portfolio Application of CRRA

This application of CRRA is a special case of Merton’s Portfolio Problem (Merton 1969)
that we shall cover in its full generality in Chapter 6. This section requires us to have some
basic familiarity with Stochastic Calculus (covered in Appendix C), specifically Ito Pro-
cesses and Ito’s Lemma. Here we consider the single-decision version of Merton’s Portfolio
Problem where our portfolio investment choices are:

• A risky asset, evolving in continuous time, with value denoted St at time t, whose
movements are defined by the Ito process:

dSt = µ · St · dt + σ · St · dzt

where µ ∈ R, σ ∈ R+ are given constants, and zt is 1-dimensional standard Brownian

Motion.
• A riskless asset, growing continuously in time, with value denoted Rt at time t,
whose growth is defined by the ordinary differential equation:

dRt = r · Rt · dt

where r ∈ R is a given constant.

We are given $1 to invest over a period of 1 year. We are asked to maintain a constant
fraction of investment of wealth (denoted π ∈ R) in the risky asset at each time t (with
1−π as the fraction of investment in the riskless asset at each time t). Note that to maintain
a constant fraction of investment in the risky asset, we need to continuously rebalance the
portfolio of the risky asset and riskless asset. Our task is to determine the constant π that
maximizes the Expected Utility of Consumption of Wealth at the end of 1 year. We allow π
to be unconstrained, i.e., π can take any value from −∞ to +∞. Positive π means we have
a “long” position in the risky asset and negative π means we have a “short” position in the
risky asset. Likewise, positive 1 − π means we are lending money at the riskless interest
rate of r and negative 1 − π means we are borrowing money at the riskless interest rate of
r.
We denote the Wealth at time t as Wt . Without loss of generality, assume W0 = 1. Since
Wt is the portfolio wealth at time t, the value of the investment in the risky asset at time t

208
would need to be π ·Wt and the value of the investment in the riskless asset at time t would
need to be (1 − π) · Wt . Therefore, the change in the value of the risky asset investment
from time t to time t + dt is:

µ · π · Wt · dt + σ · π · Wt · dzt
Likewise, the change in the value of the riskless asset investment from time t to time
t + dt is:

r · (1 − π) · Wt · dt
Therefore, the infinitesimal change in portfolio wealth dWt from time t to time t + dt is
given by:

dWt = (r + π(µ − r)) · Wt · dt + π · σ · Wt · dzt

Note that this is an Ito process defining the stochastic evolution of portfolio wealth.
Applying Ito’s Lemma (see Appendix C) on log Wt gives us:

1 π 2 · σ 2 · Wt2 1 1
d(log Wt ) = ((r + π(µ − r)) · Wt · − · 2 ) · dt + π · σ · Wt · · dzt
Wt 2 Wt Wt
π2σ2
= (r + π(µ − r) − ) · dt + π · σ · dzt
2
Therefore, Z Z
t t
π2σ2
log Wt = (r + π(µ − r) − ) · du + π · σ · dzu
0 2 0
Rt
Using the martingale property and Ito Isometry for the Ito integral 0 π · σ · dzu (see Ap-
pendix C), we get:

π2σ2 2 2
log W1 ∼ N (r + π(µ − r) − ,π σ )
2
We assume CRRA Utility with γ ̸= 0, so:
( 1−γ
W1 −1
1−γ for γ ̸= 1
U (W1 ) =
log(W1 ) for γ = 1

We know that maximizing E[U (W1 )] is equivalent to maximizing the Certainty-Equivalent

Value of W1 , hence also equivalent to maximizing the log of the Certainty-Equivalent Value
of W1 , which in this case (using the formula for xCE from the section on CRRA) is given
by:
π 2 σ 2 π 2 σ 2 (1 − γ)
r + π(µ − r) − +
2 2
2
π σ γ2
= r + π(µ − r) −
2
This is a quadratic concave function of π for γ > 0, and so, taking it’s derivative with
respect to π and setting it to 0 gives us the optimal investment fraction in the risky asset
(π ∗ ) as follows:
µ−r
π∗ =
γσ 2

209
5.9. Key Takeaways from this Chapter
• An individual’s financial risk-aversion is represented by the concave nature of the
individual’s Utility as a function of financial outcomes.
• Risk-Premium (compensation an individual seeks for taking financial risk) is roughly
proportional to the individual’s financial risk-aversion and the measure of uncer-
tainty in financial outcomes.
• Risk-Adjusted-Return in finance should be thought of as the Certainty-Equivalent-
Value, whose Utility is the Expected Utility across uncertain (risky) financial out-
comes.

210
6. Dynamic Asset-Allocation and
Consumption
This chapter covers the first of five financial applications of Stochastic Control covered in
this book. This financial application deals with the topic of investment management for
not just a financial company, but more broadly for any corporation or for any individual.
The nuances for specific companies and individuals can vary considerably but what is
common across these entities is the need to:

• Periodically decide how one’s investment portfolio should be split across various
choices of investment assets - the key being how much money to invest in more risky
assets (which have potential for high returns on investment) versus less risky assets
(that tend to yield modest returns on investment). This problem of optimally allocat-
ing capital across investment assets of varying risk-return profiles relates to the topic
of Utility Theory we covered in Chapter 5. However, in this chapter, we deal with
the further challenge of adjusting one’s allocation of capital across assets, as time
progresses. We refer to this feature as Dynamic Asset Allocation (the word dynamic
refers to the adjustment of capital allocation to adapt to changing circumstances)

• Periodically decide how much capital to leave in one’s investment portfolio versus
how much money to consume for one’s personal needs/pleasures (or for a corpora-
tion’s operational requirements) by extracting money from one’s investment portfo-
lio. Extracting money from one’s investment portfolio can mean potentially losing
out on investment growth opportunities, but the flip side of this is the Utility of Con-
sumption that a corporation/individual desires. Noting that ultimately our goal is
to maximize total utility of consumption over a certain time horizon, this decision
of investing versus consuming really amounts to the timing of consumption of one’s
money over the given time horizon.

Thus, this problem constitutes the dual and dynamic decisioning of asset-allocation and
consumption. To gain an intuitive understanding of the challenge of this dual dynamic
decisioning problem, let us consider this problem from the perspective of personal finance
in a simplified setting.

6.1. Optimization of Personal Finance

Personal Finances can be very simple for some people (earn a monthly salary, spend the
entire salary) and can be very complicated for some other people (eg: those who own
multiple businesses in multiple countries and have complex assets and liabilities). Here
we shall consider a situation that is relatively simple but includes suﬀicient nuances to
provide you with the essential elements of the general problem of dynamic asset-allocation
and consumption. Let’s say your personal finances consist of the following aspects:

211
• Receiving money: This could include your periodic salary, which typically remains
constant for a period of time, but can change if you get a promotion or if you get a
new job. This also includes money you liquidate from your investment portfolio, eg:
if you sell some stock, and decide not to re-invest in other investment assets. This
also includes interest you earn from your savings account or from some bonds you
might own. There are many other ways one can receive money, some fixed regular
payments and some uncertain in terms of payment quantity and timing, and we
won’t enumerate all the different ways of receiving money. We just want to highlight
here that receiving money at various points in time is one of the key financial aspects
in one’s life.
• Consuming money: The word “consume” refers to ”“spending.” Note that one needs
to consume money periodically to satisfy basic needs like shelter, food and clothing.
The rent or mortgage you pay on your house is one example - it may be a fixed
amount every month, but if your mortgage rate is a floating rate, it is subject to vari-
ation. Moreover, if you move to a new house, the rent or mortgage can be different.
The money you spend on food and clothing also constitutes consuming money. This
can often be fairly stable from one month to the next, but if you have a newborn baby,
it might require additional expenses of the baby’s food, clothing and perhaps also
toys. Then there is consumption of money that are beyond the “necessities” - things
like eating out at a fancy restaurant on the weekend, taking a summer vacation, buy-
ing a luxury car or an expensive watch etc. One gains “satisfaction”/“happiness”
(i.e., Utility) from this consumption of money. The key point here is that we need to
periodically make a decision on how much to spend (consume money) on a weekly
or monthly basis. One faces a tension in the dynamic decision between consuming
money (that gives us Consumption Utility) and saving money (which is the money we
put in our investment portfolio in the hope of the money growing, so we can con-
sume potentially larger amounts of money in the future).
• Investing Money: Let us suppose there are a variety of investment assets you can in-
vest in - simple savings account giving small interest, exchange-traded stocks (rang-
ing from value stocks to growth stocks, with their respective risk-return tradeoffs),
real-estate (the house you bought and live in is indeed considered an investment
asset), commodities such as gold, paintings etc. We call the composition of money
invested in these assets as one’s investment portfolio (see Appendix B for a quick
introduction to Portfolio Theory). Periodically, we need to decide if one should play
safe by putting most of one’s money in a savings account, or if we should allocate
investment capital mostly in stocks, or if we should be more speculative and invest
in an early-stage startup or in a rare painting. Reviewing the composition and poten-
tially re-allocating capital (refered to as re-balancing one’s portfolio) is the problem
of dynamic asset-allocation. Note also that we can put some of our received money
into our investment portfolio (meaning we choose to not consume that money right
away). Likewise, we can extract some money out of our investment portfolio so we
can consume money. The decisions of insertion and extraction of money into/from
our investment portfolio is essentially the dynamic money-consumption decision we
make, which goes together with the dynamic asset-allocation decision.

The above description has hopefully given you a flavor of the dual and dynamic deci-
sioning of asset-allocation and consumption. Ultimately, our personal goal is to maximize
the Expected Aggregated Utility of Consumption of Money over our lifetime (and per-
haps, also include the Utility of Consumption of Money for one’s spouse and children,

212
after one dies). Since investment portfolios are stochastic in nature and since we have to
periodically make decisions on asset-allocation and consumption, you can see that this has
all the ingredients of a Stochastic Control problem, and hence can be modeled as a Markov
Decision Process (albeit typically fairly complicated, since real-life finances have plenty of
nuances). Here’s a rough and informal sketch of what that MDP might look like (bear in
mind that we will formalize the MDP for simplified cases later in this chapter):

• States: The State can be quite complex in general, but mainly it consists of one’s age
(to keep track of the time to reach the MDP horizon), the quantities of money in-
vested in each investment asset, the valuation of the assets invested in, and poten-
tially also other aspects like one’s job/career situation (required to make predictions
of future salary possibilities).

• Actions: The Action is two-fold. Firstly, it’s the vector of investment amounts one
chooses to make at each time step (the time steps are at the periodicity at which we
review our investment portfolio for potential re-allocation of capital across assets).
Secondly, it’s the quantity of money one chooses to consume that is flexible/optional
(i.e., beyond the fixed payments like rent that we are committed to make).

• Rewards: The Reward is the Utility of Consumption of Money that we deemed as

flexible/optional - it corresponds to the second part of the Action.

• Model: The Model (probabilities of next state and reward, given current state and
action) can be fairly complex in most real-life situations. The hardest aspect is the
prediction of what might happen tomorrow in our life and career (we need this pre-
diction since it determines our future likelihood to receive money, consume money
and invest money). Moreover, the uncertain movements of investment assets would
need to be captured by our model.

Since our goal here was to simply do a rough and informal sketch, the above coverage of
the MDP is very hazy but we hope you get a sense for what the MDP might look like. Now
we are ready to take a simple special case of this MDP which does away with many of the
real-world frictions and complexities, yet retains the key features (in particular, the dual
dynamic decisioning aspect). This simple special case was the subject of Merton’s Portfo-
lio Problem (Merton 1969) which he formulated and solved in 1969 in a landmark paper.
A key feature of his formulation was that time is continuous and so, state (based on asset
prices) evolves as a continuous-time stochastic process, and actions (asset-allocation and
consumption) are made continuously. We cover the important parts of his paper in the
next section. Note that our coverage below requires some familiarity with Stochastic Cal-
culus (covered in Appendix C) and with the Hamilton-Jacobi-Bellman Equation (covered
in Appendix D), which is the continuous-time analog of Bellman’s Optimality Equation.

6.2. Merton’s Portfolio Problem and Solution

Now we describe Merton’s Portfolio problem and derive its analytical solution, which is
one of the most elegant solutions in Mathematical Economics. The solution structure will
provide tremendous intuition for how the asset-allocation and consumption decisions de-
pend on not just the state variables but also on the problem inputs.
We denote time as t and say that current time is t = 0. Assume that you have just retired
(meaning you won’t be earning any money for the rest of your life) and that you are going

213
to live for T more years (T is a fixed real number). So, in the language of the previous
section, you will not be receiving money for the rest of your life, other than the option of
extracting money from your investment portfolio. Also assume that you have no fixed
payments to make like mortgage, subscriptions etc. (assume that you have already paid
for a retirement service that provides you with your essential food, clothing and other ser-
vices). This means all of your money consumption is flexible/optional, i.e., you have a choice
of consuming any real non-negative number at any point in time. All of the above are big
(and honestly, unreasonable) assumptions but they help keep the problem simple enough
for analytical tractability. In spite of these over-simplified assumptions, the problem for-
mulation still captures the salient aspects of dual dynamic decisioning of asset-allocation
and consumption while eliminating the clutter of A) receiving money from external sources
and B) consuming money that is of a non-optional nature.
We define wealth at any time t (denoted Wt ) as the aggregate market value of your
investment assets. Note that since no external money is received and since all consumption
is optional, Wt is your “net-worth.” Assume there are a fixed number n of risky assets and
a single riskless asset. Assume that each risky asset has a known normal distribution of
returns. Now we make a couple of big assumptions for analytical tractability:

• You are allowed to buy or sell any fractional quantities of assets at any point in time
(i.e., in continuous time).
• There are no transaction costs with any of the buy or sell transactions in any of the
assets.

You start with wealth W0 at time t = 0. As mentioned earlier, the goal is to maximize
your expected lifetime-aggregated Utility of Consumption of money with the actions at
any point in time being two-fold: Asset-Allocation and Consumption (Consumption being
equal to the capital extracted from the investment portfolio at any point in time). Note
that since there is no external source of money and since all capital extracted from the
investment portfolio at any point in time is immediately consumed, you are never adding
capital to your investment portfolio. The growth of the investment portfolio can happen
only from growth in the market value of assets in your investment portfolio. Lastly, we
assume that the Consumption Utility function is Constant Relative Risk-Aversion (CRRA),
which we covered in Chapter 5.
For ease of exposition, we formalize the problem setting and derive Merton’s beautiful
analytical solution for the case of n = 1 (i.e., only 1 risky asset). The solution generalizes
in a straightforward manner to the case of n > 1 risky assets, so it pays to keep the notation
and explanations simple, emphasizing intuition rather than heavy technical details.
Since we are operating in continuous-time, the risky asset follows a stochastic process
(denoted S) - specifically an Ito process (introductory background on Ito processes and
Ito’s Lemma covered in Appendix C), as follows:

dSt = µ · St · dt + σ · St · dzt

where µ ∈ R, σ ∈ R+ are fixed constants (note that for n assets, we would instead work
with a vector for µ and a matrix for σ).
The riskless asset has no uncertainty associated with it and has a fixed rate of growth in
continuous-time, so the valuation of the riskless asset Rt at time t is given by:

dRt = r · Rt · dt

214
Assume r ∈ R is a fixed constant, representing the instantaneous riskless growth of
money. We denote the consumption of wealth (equal to extraction of money from the
investment portfolio) per unit time (at time t) as c(t, Wt ) ≥ 0 to make it clear that the con-
sumption (our decision at any time t) will in general depend on both time t and wealth Wt .
Note that we talk about “rate of consumption in time” because consumption is assumed
to be continuous in time. As mentioned earlier, we denote wealth at time t as Wt (note
that W is a stochastic process too). We assume that Wt > 0 for all t ≥ 0. This is a rea-
sonable assumption to make as it manifests in constraining the consumption (extraction
from investment portfolio) to ensure wealth remains positive. We denote the fraction of
wealth allocated to the risky asset at time t as π(t, Wt ). Just like consumption c, risky-asset
allocation fraction π is a function of time t and wealth Wt . Since there is only one risky
asset, the fraction of wealth allocated to the riskless asset at time t is 1 − π(t, Wt ). Unlike
the constraint c(t, Wt ) ≥ 0, π(t, Wt ) is assumed to be unconstrained. Note that c(t, Wt )
and π(t, Wt ) together constitute the decision (MDP action) at time t. To keep our notation
light, we shall write ct for c(t, Wt ) and πt for π(t, Wt ), but please do recognize through-
out the derivation that both are functions of wealth Wt at time t as well as of time t itself.
Finally, we assume that the Utility of Consumption function is defined as:

x1−γ
U (x) =
1−γ
for a risk-aversion parameter γ ̸= 1. This Utility function is essentially the CRRA Utility
−1
function (ignoring the constant term 1−γ ) that we covered in Chapter 5 for γ ̸= 1. γ is
′′
the Coeﬀicient of CRRA equal to −x·U (x)
U ′ (x) . We will not cover the case of CRRA Utility
function for γ = 1 (i.e., U (x) = log(x)), but we encourage you to work out the derivation
for U (x) = log(x) as an exercise.
Due to our assumption of no addition of money to our investment portfolio of the risky
asset St and riskless asset Rt and due to our assumption of no transaction costs of buy-
ing/selling any fractional quantities of risky as well as riskless assets, the time-evolution
for wealth should be conceptualized as a continuous adjustment of the allocation πt and
continuous extraction from the portfolio (equal to continuous consumption ct ).
Since the value of the risky asset investment at time t is πt · Wt , the change in the value
of the risky asset investment from time t to time t + dt is:

µ · πt · Wt · dt + σ · πt · Wt · dzt

Likewise, since the value of the riskless asset investment at time t is (1 − πt ) · Wt , the
change in the value of the riskless asset investment from time t to time t + dt is:

r · (1 − πt ) · Wt · dt

Therefore, the infinitesimal change in wealth dWt from time t to time t + dt is given by:

dWt = ((r + πt · (µ − r)) · Wt − ct ) · dt + πt · σ · Wt · dzt (6.1)

Note that this is an Ito process defining the stochastic evolution of wealth.
Our goal is to determine optimal (π(t, Wt ), c(t, Wt )) at any time t to maximize:
Z T
e−ρ(s−t) · c1−γ
s e−ρ(T −t) · B(T ) · WT1−γ
E[ · ds + | Wt ]
t 1−γ 1−γ

215
where ρ ≥ 0 is the utility discount rate to account for the fact that future utility of consump-
tion might be less than current utility of consumption, and B(·) is known as the “bequest”
function (think of this as the money you will leave for your family when you die at time
T ). We can solve this problem for arbitrary bequest B(T ) but for simplicity, we shall con-
sider B(T ) = ϵγ where 0 < ϵ ≪ 1, meaning “no bequest.” We require the bequest to be ϵγ
rather than 0 for technical reasons, that will become apparent later.
We should think of this problem as a continuous-time Stochastic Control problem where
the MDP is defined as below:

• The State at time t is (t, Wt )

• The Action at time t is (πt , ct )
• The Reward per unit time at time t < T is:

c1−γ
t
U (ct ) =
1−γ
and the Reward at time T is:
WT1−γ
B(T ) · U (WT ) = ϵγ ·
1−γ
The Return at time t is the accumulated discounted Reward:
Z T
−ρ(s−t) cs
1−γ
e−ρ(T −t) · ϵγ · WT1−γ
e · · ds +
t 1−γ 1−γ
Our goal is to find the Policy : (t, Wt ) → (πt , ct ) that maximizes the Expected Return. Note
the important constraint that ct ≥ 0, but πt is unconstrained.
Our first step is to write out the Hamilton-Jacobi-Bellman (HJB) Equation (the analog
of the Bellman Optimality Equation in continuous-time). We denote the Optimal Value
Function as V ∗ such that the Optimal Value for wealth Wt at time t is V ∗ (t, Wt ). Note
that unlike Section 3.13 in Chapter 3 where we denoted the Optimal Value Function as
a time-indexed sequence Vt∗ (·), here we make t an explicit functional argument of V ∗ .
This is because in the continuous-time setting, we are interested in the time-differential
of the Optimal Value Function. Appendix D provides the derivation of the general HJB
formulation (Equation (D.1) in Appendix D) - this general HJB Equation specializes here
to the following:

c1−γ
max{Et [dV ∗ (t, Wt ) + t
· dt} = ρ · V ∗ (t, Wt ) · dt (6.2)
πt ,ct 1−γ
Now use Ito’s Lemma on dV ∗ , remove the dzt term since it’s a martingale, and divide
throughout by dt to produce the HJB Equation in partial-differential form for any 0 ≤
t < T , as follows (the general form of this transformation appears as Equation (D.2) in
Appendix D):

∂V ∗ ∂V ∗ ∂ 2 V ∗ πt2 · σ 2 · Wt2 c1−γ

max{ + ·((πt (µ−r)+r)Wt −ct )+ · + t } = ρ·V ∗ (t, Wt ) (6.3)
πt ,ct ∂t ∂Wt ∂Wt2 2 1−γ
This HJB Equation is subject to the terminal condition:

WT1−γ
V ∗ (T, WT ) = ϵγ ·
1−γ

216
Let us write Equation (6.3) more succinctly as:

max Φ(t, Wt ; πt , ct ) = ρ · V ∗ (t, Wt ) (6.4)

πt ,ct

It pays to emphasize again that we are working with the constraints Wt > 0, ct ≥ 0 for
0≤t<T
To find optimal πt∗ , c∗t , we take the partial derivatives of Φ(t, Wt ; πt , ct ) with respect to
πt and ct , and equate to 0 (first-order conditions for Φ). The partial derivative of Φ with
respect to πt is:
∂V ∗ ∂2V ∗
(µ − r) · + · πt · σ 2 · W t = 0
∂Wt ∂Wt2
∗
− ∂W
∂V
· (µ − r)
⇒ πt∗ = ∂2V ∗
t
(6.5)
∂Wt2
· σ 2 · Wt

The partial derivative of Φ with respect to ct is:

∂V ∗
− + (c∗t )−γ = 0
∂Wt

∂V ∗ −1
⇒ c∗t = ( )γ (6.6)
∂Wt
Now substitute πt∗ (from Equation (6.5))and c∗t (from Equation (6.6)) in Φ(t, Wt ; πt , ct )
(in Equation (6.3)) and equate to ρ · V ∗ (t, Wt ). This gives us the Optimal Value Function
Partial Differential Equation (PDE):
∗
∂V ∗ (µ − r)2 ( ∂Wt ) ∂V ∗ ∂V ∗ γ−1
∂V 2
γ
− · ∂2V ∗
+ · r · W t + · ( ) γ = ρ · V ∗ (t, Wt ) (6.7)
∂t 2σ 2 2
∂W t 1 − γ ∂W t
∂Wt

The boundary condition for this PDE is:

∗ WT1−γ
V (T, WT ) = ϵ · γ
1−γ

The second-order conditions for Φ are satisfied under the assumptions: c∗t > 0, Wt >
2 ∗
0, ∂∂WV 2 < 0 for all 0 ≤ t < T (we will later show that these are all satisfied in the solution
t
we derive), and for concave U (·), i.e., γ > 0
Next, we want to reduce the PDE (6.7) to an Ordinary Differential Equation (ODE) so
we can solve the (simpler) ODE. Towards this goal, we surmise with a guess solution in
terms of a deterministic function (f ) of time:

Wt1−γ
V ∗ (t, Wt ) = f (t)γ · (6.8)
1−γ

Then,
∂V ∗ W 1−γ
= γ · f (t)γ−1 · f ′ (t) · t (6.9)
∂t 1−γ
∂V ∗
= f (t)γ · Wt−γ (6.10)
∂Wt

217
∂2V ∗
2 = −f (t)γ · γ · Wt−γ−1 (6.11)
∂Wt
Substituting the guess solution in the PDE, we get the simple ODE:

f ′ (t) = ν · f (t) − 1 (6.12)

where 2
ρ − (1 − γ) · ( (µ−r)
2σ 2 γ
+ r)
ν=
γ
We note that that the bequest function B(T ) = ϵγ proves to be convenient in order to
fit the guess solution for t = T . This means the boundary condition for this ODE is:
f (T ) = ϵ. Consequently, this ODE together with this boundary condition has a simple
enough solution, as follows:
(
1+(νϵ−1)·e−ν(T −t)
ν for ν ̸= 0
f (t) = (6.13)
T −t+ϵ for ν = 0

Substituting V ∗ (from Equation (6.8)) and its partial derivatives (from Equations (6.9),
(6.10) and (6.11)) in Equations (6.5) and (6.6), we get:
µ−r
π ∗ (t, Wt ) = (6.14)
σ2γ
(
Wt
ν·Wt
for ν ̸= 0
c∗ (t, Wt ) = = 1+(νϵ−1)·e−ν(T −t)
Wt
(6.15)
f (t) for ν = 0
T −t+ϵ

Finally, substituting the solution for f (t) (Equation (6.13)) in Equation (6.8), we get:

 (1+(νϵ−1)·e−ν(T −t) )γ · Wt1−γ for ν ̸= 0
V ∗ (t, Wt ) = (T −t+ϵ)γ ·W
νγ 1−γ
(6.16)

1−γ

1−γ
t
for ν = 0
∗
Note that f (t) > 0 for all 0 ≤ t < T (for all ν) ensures $W_t > 0, c∗t > 0, ∂∂WV 2 < 0. This
2

t
ensures the constraints Wt > 0 and ct ≥ 0 are satisfied and the second-order conditions for
Φ are also satisfied. A very important lesson in solving Merton’s Portfolio problem is the
fact that the HJB Formulation is key and that this solution approach provides a template
for similar continuous-time stochastic control problems.

6.3. Developing Intuition for the Solution to Merton’s Portfolio

Problem
The solution for π ∗ (t, Wt ) and c∗ (t, Wt ) are surprisingly simple. π ∗ (t, Wt ) is a constant,
i.e., it is independent of both of the state variables t and Wt . This means that no matter
what wealth we carry and no matter how close we are to the end of the horizon (i.e., no
matter what our age is), we should invest the same fraction of our wealth in the risky
asset (likewise for the case of n risky assets). The simplifying assumptions in Merton’s
Portfolio problem statement did play a part in the simplicity of the solution, but the fact
that π ∗ (t, Wt ) is a constant is still rather surprising. The simplicity of the solution means

218
that asset allocation is straightforward - we just need to keep re-balancing to maintain this
constant fraction of our wealth in the risky asset. We expect our wealth to grow over time
and so, the capital in the risky asset would also grow proportionately.
The form of the solution for c∗ (t, Wt ) is extremely intuitive - the excess return of the
risky asset (µ − r) shows up in the numerator, which makes sense, since one would expect
to invest a higher fraction of one’s wealth in the risky asset if it gives us a higher excess
return. It also makes sense that the volatility σ of the risky asset (squared) shows up in
the denominator (the greater the volatility, the less we’d allocate to the risky asset, since
we are typically risk-averse, i.e., γ > 0). Likewise, it makes since that the coeﬀicient of
CRRA γ shows up in the denominator since a more risk-averse individual (greater value
of γ) will want to invest less in the risky asset.
The Optimal Consumption Rate c∗ (t, Wt ) should be conceptualized in terms of the Opti-
mal Fractional Consumption Rate, i.e., the Optimal Consumption Rate c∗ (t, Wt ) as a fraction
of the Wealth Wt . Note that the Optimal Fractional Consumption Rate depends only on
1
t (it is equal to f (t) ). This means no matter what our wealth is, we should be extracting
a fraction of our wealth on a daily/monthly/yearly basis that is only dependent on our
age. Note also that if ϵ < ν1 , the Optimal Fractional Consumption Rate increases as time
progresses. This makes intuitive sense because when we have many more years to live,
we’d want to consume less and invest more to give the portfolio more ability to grow, and
when we get close to our death, we increase our consumption (since the optimal is “to die
broke,” assuming no bequest).
Now let us understand how the Wealth process evolves. Let us substitute for π ∗ (t, Wt )
(from Equation (6.14)) and c∗ (t, Wt ) (from Equation (6.15)) in the Wealth process defined
in Equation (6.1). This yields the following Wealth process W ∗ when we asset-allocate
optimally and consume optimally:

(µ − r)2 1 µ−r
dWt∗ = (r + 2
− ) · Wt∗ · dt + · Wt∗ · dzt (6.17)
σ γ f (t) σγ
The first thing to note about this Wealth process is that it is a lognormal process of the
form covered in Section C.7 of Appendix C. The lognormal volatility (fractional disper-
sion) of this wealth process is constant (= µ−r
σγ ). The lognormal drift (fractional drift) is
2
independent of the wealth but is dependent on time (= r + (µ−r) σ2 γ
− f (t)
1
). From the solu-
tion of the general lognormal process derived in Section C.7 of Appendix C, we conclude
that:

∫  (µ−r)2
W · e(r+ σ2 γ )t · (1 − 1−e−νt
∗
(µ−r)
(r+ 2 )t
2
− t du
0 −νT ) if ν ̸= 0
E[Wt ] = W0 · e σ γ · e 0 f (u) = 1+(νϵ−1)·e
 2
W · e(r+ σ2 γ )t · (1 − t )
(µ−r)

0 T +ϵ if ν = 0
(6.18)
Since we assume no bequest, we should expect the Wealth process to keep growing up
to some point in time and then fall all the way down to 0 when time runs out (i.e., when
t = T ). We shall soon write the code for Equation (6.18) and plot the graph for this rise
and fall. An important point to note is that although the wealth process growth varies in
2
time (expected wealth growth rate = r + (µ−r)σ2 γ
− f (t)
1
as seen from Equation (6.17)), the
variation (in time) of the wealth process growth is only due to the fractional consumption
1
rate varying in time. If we ignore the fractional consumption rate (= f (t) ), then what
2
we get is the Expected Portfolio Annual Return of r + (µ−r)
σ2 γ
which is a constant (does
∗
not depend on either time t or on Wealth Wt ). Now let us write some code to calculate

219
the time-trajectories of Expected Wealth, Fractional Consumption Rate, Expected Wealth
Growth Rate and Expected Portfolio Annual Return.
The code should be pretty self-explanatory. We will just provide a few explanations of
variables in the code that may not be entirely obvious: portfolio_return calculates the
Expected Portfolio Annual Return, nu calculates the value of ν, f represents the function
f (t), wealth_growth_rate calculates the Expected Wealth Growth Rate as a function of time
t. The expected_wealth method assumes W0 = 1.

@dataclass(frozen=True)
class MertonPortfolio:
mu: float
sigma: float
r: float
rho: float
horizon: float
gamma: float
epsilon: float = 1e-6
def excess(self) -> float:
return self.mu - self.r
def variance(self) -> float:
return self.sigma * self.sigma
def allocation(self) -> float:
return self.excess() / (self.gamma * self.variance())
def portfolio_return(self) -> float:
return self.r + self.allocation() * self.excess()
def nu(self) -> float:
return (self.rho - (1 - self.gamma) * self.portfolio_return()) / \
self.gamma
def f(self, time: float) -> float:
remaining: float = self.horizon - time
nu = self.nu()
if nu == 0:
ret = remaining + self.epsilon
else:
ret = (1 + (nu * self.epsilon - 1) * exp(-nu * remaining)) / nu
return ret
def fractional_consumption_rate(self, time: float) -> float:
return 1 / self.f(time)
def wealth_growth_rate(self, time: float) -> float:
return self.portfolio_return() - self.fractional_consumption_rate(time)
def expected_wealth(self, time: float) -> float:
base: float = exp(self.portfolio_return() * time)
nu = self.nu()
if nu == 0:
ret = base * (1 - (1 - exp(-nu * time)) /
(1 + (nu * self.epsilon - 1) *
exp(-nu * self.horizon)))
else:
ret = base * (1 - time / (self.horizon + self.epsilon))
return ret

The above code is in the file rl/chapter7/merton_solution_graph.py. We highly encour-

age you to experiment by changing the various inputs in this code (T, µ, σ, r, ρ, γ) and
visualize how the results change. Doing this will help build tremendous intuition.
2
A rather interesting observation is that if r + (µ−r)
σ2 γ
1
> f (0) and ϵ < ν1 , then the Frac-
tional Consumption Rate is initially less than the Expected Portfolio Annual Return and

220
Figure 6.1.: Portfolio Return and Consumption Rate

over time, the Fractional Consumption Rate becomes greater than the Expected Portfolio
Annual Return. This illustrates how the optimal behavior is to consume modestly and in-
vest more when one is younger, then to gradually increase the consumption as one ages,
and finally to ramp up the consumption sharply when one is close to the end of one’s life.
Figure 6.1 shows the visual for this (along with the Expected Wealth Growth Rate) using
the above code for input values of: T = 20, µ = 10%, σ = 10%, r = 2%, ρ = 1%, γ = 2.0.
Figure 6.2 shows the time-trajectory of the expected wealth based on Equation (6.18) for
the same input values as listed above. Notice how the Expected Wealth rises in a convex
shape for several years since the consumption during all these years is quite modest, and
then the shape of the Expected Wealth curve turns concave at about 12 years, peaks at
about 16 years (when Fractional Consumption Rate rises to equal Expected Portfolio An-
nual Return), and then falls precipitously in the last couple of years (as the Consumption
increasingly drains the Wealth down to 0).

6.4. A Discrete-Time Asset-Allocation Example

In this section, we cover a discrete-time version of the problem that lends itself to analytical
tractability, much like Merton’s Portfolio Problem in continuous-time. We are given wealth
W0 at time 0. At each of discrete time steps labeled t = 0, 1, . . . , T − 1, we are allowed to
allocate the wealth Wt at time t to a portfolio of a risky asset and a riskless asset in an
unconstrained manner with no transaction costs. The risky asset yields a random return
∼ N (µ, σ 2 ) over each single time step (for a given µ ∈ R and a given σ ∈ R+ ). The riskless
asset yields a constant return denoted by r over each single time step (for a given r ∈ R).
We assume that there is no consumption of wealth at any time t < T , and that we liquidate
and consume the wealth WT at time T . So our goal is simply to maximize the Expected
Utility of Wealth at the final time step t = T by dynamically allocating xt ∈ R in the risky
asset and the remaining Wt − xt in the riskless asset for each t = 0, 1, . . . , T − 1. Assume
the single-time-step discount factor is γ and that the Utility of Wealth at the final time step

221
Figure 6.2.: Expected Wealth Time-Trajectory

t = T is given by the following CARA function:

1 − e−aWT
U (WT ) = for some fixed a ̸= 0
a
Thus, the problem is to maximize, for each t = 0, 1, . . . , T − 1, over choices of xt ∈ R,
the value:

1 − e−aWT
E[γ T −t · |(t, Wt )]
a
Since γ T −t and a are constants, this is equivalent to maximizing, for each t = 0, 1, . . . , T −
1, over choices of xt ∈ R, the value:

−e−aWT
E[ |(t, Wt )] (6.19)
a
We formulate this problem as a Continuous States and Continuous Actions discrete-time
finite-horizon MDP by specifying it’s State Transitions, Rewards and Discount Factor pre-
cisely. The problem then is to solve the MDP’s Control problem to find the Optimal Policy.
The terminal time for the finite-horizon MDP is T and hence, all the states at time t =
T are terminal states. We shall follow the notation of finite-horizon MDPs that we had
covered in Section 3.13 of Chapter 3. The State st ∈ St at any time step t = 0, 1, . . . , T
consists of the wealth Wt . The decision (Action) at ∈ At at any time step t = 0, 1, . . . , T − 1
is the quantity of investment in the risky asset (= xt ). Hence, the quantity of investment
in the riskless asset at time t will be Wt − xt . A deterministic policy at time t (for all
t = 0, 1, . . . T − 1) is denoted as πt , and hence, we write: πt (Wt ) = xt . Likewise, an optimal
deterministic policy at time t (for all t = 0, 1, . . . , T − 1) is denoted as πt∗ , and hence, we
write: πt∗ (Wt ) = x∗t .
Denote the random variable for the single-time-step return of the risky asset from time
t to time t + 1 as Yt ∼ N (µ, σ 2 ) for all t = 0, 1, . . . T − 1. So,

222
Wt+1 = xt · (1 + Yt ) + (Wt − xt ) · (1 + r) = xt · (Yt − r) + Wt · (1 + r) (6.20)
for all t = 0, 1, . . . , T − 1.
The MDP Reward is 0 for all t = 0, 1, . . . , T − 1. As a result of the simplified objective
(6.19) above, the MDP Reward for t = T is the following random quantity:

−e−aWT
a
We set the MDP discount factor to be γ = 1 (again, because of the simplified objective
(6.19) above).
We denote the Value Function at time t (for all t = 0, 1, . . . , T − 1) for a given policy
π = (π0 , π1 , . . . , πT −1 ) as:

−e−aWT
Vtπ (Wt ) = Eπ [ |(t, Wt )]
a
We denote the Optimal Value Function at time t (for all t = 0, 1, . . . , T − 1) as:

−e−aWT
Vt∗ (Wt ) = max Vtπ (Wt ) = max{Eπ [ |(t, Wt )]}
π π a
The Bellman Optimality Equation is:

Vt∗ (Wt ) = max Q∗t (Wt , xt ) = max{EYt ∼N (µ,σ2 ) [Vt+1

∗
(Wt+1 )]}
xt xt

for all t = 0, 1, . . . , T − 2, and

−e−aWT
VT∗−1 (WT −1 ) = max Q∗T −1 (WT −1 , xT −1 ) = max{EYT −1 ∼N (µ,σ2 ) [ ]}
xT −1 xT −1 a
where Q∗t is the Optimal Action-Value Function at time t for all t = 0, 1, . . . , T − 1.
We make an educated guess for the functional form of the Optimal Value Function as:

Vt∗ (Wt ) = −bt · e−ct ·Wt (6.21)

where bt , ct are independent of the wealth Wt for all t = 0, 1, . . . , T −1. Next, we express the
Bellman Optimality Equation using this functional form for the Optimal Value Function:

Vt∗ (Wt ) = max{EYt ∼N (µ,σ2 ) [−bt+1 · e−ct+1 ·Wt+1 ]}

Using Equation (6.20), we can write this as:

Vt∗ (Wt ) = max{EYt ∼N (µ,σ2 ) [−bt+1 · e−ct+1 ·(xt ·(Yt −r)+Wt ·(1+r)) ]}
xt

The expectation of this exponential form (under the normal distribution) evaluates to:
σ2 2
Vt∗ (Wt ) = max{−bt+1 · e−ct+1 ·(1+r)·Wt −ct+1 ·(µ−r)·xt +ct+1 · ·xt
2
2 } (6.22)
xt

Since Vt∗ (Wt ) = maxxt Q∗t (Wt , xt ), from Equation (6.22), we can infer the functional
form for Q∗t (Wt , xt ) in terms of bt+1 and ct+1 :
σ2 2
Q∗t (Wt , xt ) = −bt+1 · e−ct+1 ·(1+r)·Wt −ct+1 ·(µ−r)·xt +ct+1 · ·xt
2
2 (6.23)

223
Since the right-hand-side of the Bellman Optimality Equation (6.22) involves a max over
xt , we can say that the partial derivative of the term inside the max with respect to xt is 0.
This enables us to write the Optimal Allocation x∗t in terms of ct+1 , as follows:

−ct+1 · (µ − r) + σ 2 · c2t+1 · x∗t = 0

µ−r
⇒ x∗t = (6.24)
σ 2 · ct+1
Next we substitute this maximizing x∗t in the Bellman Optimality Equation (Equation
(6.22)):

(µ−r)2
Vt∗ (Wt ) = −bt+1 · e−ct+1 ·(1+r)·Wt − 2σ 2

But since

Vt∗ (Wt ) = −bt · e−ct ·Wt

we can write the following recursive equations for bt and ct :

(µ−r)2
bt = bt+1 · e− 2σ 2

ct = ct+1 · (1 + r)
−aW
We can calculate bT −1 and cT −1 from the knowledge of the MDP Reward −e a T (Utility
of Terminal Wealth) at time t = T , which will enable us to unroll the above recursions for
bt and ct for all t = 0, 1, . . . , T − 2.

−e−aWT
VT∗−1 (WT −1 ) = max{EYT −1 ∼N (µ,σ2 ) [ ]}
xT −1 a
From Equation (6.20), we can write this as:

−e−a(xT −1 ·(YT −1 −r)+WT −1 ·(1+r))

VT∗−1 (WT −1 ) = max{EYT −1 ∼N (µ,σ2 ) [ ]}
xT −1 a
Using the result in Equation (A.9) in Appendix A, we can write this as:

(µ−r)2
−e− 2σ 2
−a·(1+r)·WT −1
VT∗−1 (WT −1 ) =
a
Therefore,
(µ−r)2
e− 2σ 2
bT −1 =
a
cT −1 = a · (1 + r)
Now we can unroll the above recursions for bt and ct for all t = 0, 1, . . . T − 2 as:

(µ−r)2 ·(T −t)

e− 2σ 2
bt =
a
ct = a · (1 + r)T −t

224
Substituting the solution for ct+1 in Equation (6.24) gives us the solution for the Optimal
Policy:
µ−r
πt∗ (Wt ) = x∗t = 2 (6.25)
σ · a · (1 + r)T −t−1
for all t = 0, 1, . . . , T −1. Note that the optimal action at time step t (for all t = 0, 1, . . . , T −
1) does not depend on the state Wt at time t (it only depends on the time t). Hence, the
optimal policy πt∗ (·) for a fixed time t is a constant deterministic policy function.
Substituting the solutions for bt and ct in Equation (6.21) gives us the solution for the
Optimal Value Function:
(µ−r)2 (T −t)
−e− 2σ 2 T −t ·W
Vt∗ (Wt ) = · e−a(1+r) t
(6.26)
a
for all t = 0, 1, . . . , T − 1.
Substituting the solutions for bt+1 and ct+1 in Equation (6.23) gives us the solution for
the Optimal Action-Value Function:
(µ−r)2 (T −t−1)
−e− 2σ 2 T −t ·W T −t−1 ·x (aσ(1+r)T −t−1 )2 2
Q∗t (Wt , xt ) = · e−a(1+r) t −a(µ−r)(1+r) t+ 2
·xt
(6.27)
a
for all t = 0, 1, . . . , T − 1.

6.5. Porting to Real-World

We have covered a continuous-time setting and a discrete-time setting with simplifying
assumptions that provide analytical tractability. The specific simplifying assumptions that
enabled analytical tractability were:

• Normal distribution of asset returns

• CRRA/CARA assumptions
• Frictionless markets/trading (no transaction costs, unconstrained and continuous
prices/allocation amounts/consumption)

But real-world problems involving dynamic asset-allocation and consumption are not
so simple and clean. We have arbitrary, more complex asset price movements. Utility
functions don’t fit into simple CRRA/CARA formulas. In practice, trading often occurs
in discrete space - asset prices, allocation amounts and consumption are often discrete
quantities. Moreover, when we change our asset allocations or liquidate a portion of our
portfolio to consume, we incur transaction costs. Furthermore, trading doesn’t always
happen in continuous-time - there are typically specific windows of time where one is
locked-out from trading or there are trading restrictions. Lastly, many investments are
illiquid (eg: real-estate) or simply not allowed to be liquidated until a certain horizon (eg:
retirement funds), which poses major constraints on extracting money from one’s portfolio
for consumption. So even though prices/allocation amounts/consumption might be close
to being continuous-variables, the other above-mentioned frictions mean that we don’t get
the benefits of calculus that we obtained in the simple examples we covered.
With the above real-world considerations, we need to tap into Dynamic Programming
- more specifically, Approximate Dynamic Programming since real-world problems have
large state spaces and large action spaces (even if these spaces are not continuous, they

225
tend to be close to continuous). Appropriate function approximation of the Value Function
is key to solving these problems. Implementing a full-blown real-world investment and
consumption management system is beyond the scope of this book, but let us implement
an illustrative example that provides suﬀicient understanding of how a full-blown real-
world example would be implemented. We have to keep things simple enough and yet
suﬀiciently general. So here is the setting we will implement for:

• One risky asset and one riskless asset.

• Finite number of time steps (discrete-time setting akin to Section 6.4).
• No consumption (i.e., no extraction from the investment portfolio) until the end of
the finite horizon, and hence, without loss of generality, we set the discount factor
equal to 1.
• Arbitrary distribution of return for the risky asset, allowing the distribution of re-
turns to vary over time (risky_return_distributions: Sequence[Distribution[float]]
in the code below).
• The return on the riskless asset, varying in time (riskless_returns: Sequence[float]
in the code below).
• Arbitrary Utility Function (utility_func: Callable[[float], float] in the code be-
low).
• Finite number of choices of investment amounts in the risky asset at each time step
(risky_alloc_choices: Sequence[float] in the code below).
• Arbitrary distribution of initial wealth W0 (initial_wealth_distribution:
Distribution[float] in the code below).

The code in the class AssetAllocDiscrete below is fairly self-explanatory. We use the
function back_opt_qvf covered in Section 4.9 of Chapter 4 to perform backward induc-
tion on the optimal Q-Value Function. Since the state space is continuous, the optimal
Q-Value Function is represented as a QValueFunctionApprox (specifically, as a DNNApprox).
Moreover, since we are working with a generic distribution of returns that govern the
state transitions of this MDP, we need to work with the methods of the abstract class
MarkovDecisionProcess (and not the class FiniteMarkovDecisionProcess). The method
backward_induction_qvf below makes the call to back_opt_qvf. Since the risky returns
distribution is arbitrary and since the utility function is arbitrary, we don’t have prior
knowledge of the functional form of the Q-Value function. Hence, the user of the class
AssetAllocDiscrete also needs to provide the set of feature functions (feature_functions
in the code below) and the specification of a deep neural network to represent the Q-
Value function (dnn_spec in the code below). The rest of the code below is mainly about
preparing the input mdp_f0_mu_triples to be passed to back_opt_qvf. As was explained
in Section 4.9 of Chapter 4, mdp_f0_mu_triples is a sequence (for each time step) of the
following triples:

• A MarkovDecisionProcess[float, float] object, which in the code below is prepared

by the method get_mdp. State is the portfolio wealth (float type) and Action is the
quantity of investment in the risky asset (also of float type). get_mdp creates a class
AssetAllocMDP that implements the abstract class MarkovDecisionProcess. To do so,
we need to implement the step method and the actions method. The step method
returns an instance of SampledDistribution, which is based on the sr_sampler_func
that returns a sample of the pair of next state (next time step’s wealth) and reward,
given the current state (current wealth) and action (current time step’s quantity of
investment in the risky asset).

226
• A QValueFunctionApprox[float], float] object, prepared by get_qvf_func_approx.
This method sets up a DNNApprox[Tuple[NonTerminal[float], float]] object that rep-
resents a neural-network function approximation for the optimal Q-Value Function.
So the input to this neural network would be a Tuple[NonTerminal[float], float]
representing a (state, action) pair.

• An NTStateDistribution[float] object prepared by get_states_distribution, which

returns a SampledDistribution[NonTerminal[float]] representing the distribution of
non-terminal states (distribution of portfolio wealth) at each time step.
The SampledDistribution[NonTerminal[float]] is prepared by states_sampler_func
that generates a sampling trace by sampling the state-transitions (portfolio wealth
transitions) from time 0 to the given time step in a time-incremental manner (in-
voking the sample method of the risky asset’s return Distributions and the sample
method of a uniform distribution over the action choices specified by risky_alloc_choices).
from rl.distribution import Distribution, SampledDistribution, Choose
from rl.function_approx import DNNSpec, AdamGradient, DNNApprox
from rl.approximate_dynamic_programming import back_opt_qvf, QValueFunctionApprox
from operator import itemgetter
import numpy as np
@dataclass(frozen=True)
class AssetAllocDiscrete:
risky_return_distributions: Sequence[Distribution[float]]
riskless_returns: Sequence[float]
utility_func: Callable[[float], float]
risky_alloc_choices: Sequence[float]
feature_functions: Sequence[Callable[[Tuple[float, float]], float]]
dnn_spec: DNNSpec
initial_wealth_distribution: Distribution[float]
def time_steps(self) -> int:
return len(self.risky_return_distributions)
def uniform_actions(self) -> Choose[float]:
return Choose(self.risky_alloc_choices)
def get_mdp(self, t: int) -> MarkovDecisionProcess[float, float]:
distr: Distribution[float] = self.risky_return_distributions[t]
rate: float = self.riskless_returns[t]
alloc_choices: Sequence[float] = self.risky_alloc_choices
steps: int = self.time_steps()
utility_f: Callable[[float], float] = self.utility_func
class AssetAllocMDP(MarkovDecisionProcess[float, float]):
def step(
self,
wealth: NonTerminal[float],
alloc: float
) -> SampledDistribution[Tuple[State[float], float]]:
def sr_sampler_func(
wealth=wealth,
alloc=alloc
) -> Tuple[State[float], float]:
next_wealth: float = alloc * (1 + distr.sample()) \
+ (wealth.state - alloc) * (1 + rate)
reward: float = utility_f(next_wealth) \
if t == steps - 1 else 0.
next_state: State[float] = Terminal(next_wealth) \
if t == steps - 1 else NonTerminal(next_wealth)
return (next_state, reward)
return SampledDistribution(

227
sampler=sr_sampler_func,
expectation_samples=1000
)
def actions(self, wealth: NonTerminal[float]) -> Sequence[float]:
return alloc_choices
return AssetAllocMDP()
def get_qvf_func_approx(self) -> \
DNNApprox[Tuple[NonTerminal[float], float]]:
adam_gradient: AdamGradient = AdamGradient(
learning_rate=0.1,
decay1=0.9,
decay2=0.999
)
ffs: List[Callable[[Tuple[NonTerminal[float], float]], float]] = []
for f in self.feature_functions:
def this_f(pair: Tuple[NonTerminal[float], float], f=f) -> float:
return f((pair[0].state, pair[1]))
ffs.append(this_f)
return DNNApprox.create(
feature_functions=ffs,
dnn_spec=self.dnn_spec,
adam_gradient=adam_gradient
)
def get_states_distribution(self, t: int) -> \
SampledDistribution[NonTerminal[float]]:
actions_distr: Choose[float] = self.uniform_actions()
def states_sampler_func() -> NonTerminal[float]:
wealth: float = self.initial_wealth_distribution.sample()
for i in range(t):
distr: Distribution[float] = self.risky_return_distributions[i]
rate: float = self.riskless_returns[i]
alloc: float = actions_distr.sample()
wealth = alloc * (1 + distr.sample()) + \
(wealth - alloc) * (1 + rate)
return NonTerminal(wealth)
return SampledDistribution(states_sampler_func)
def backward_induction_qvf(self) -> \
Iterator[QValueFunctionApprox[float, float]]:
init_fa: DNNApprox[Tuple[NonTerminal[float], float]] = \
self.get_qvf_func_approx()
mdp_f0_mu_triples: Sequence[Tuple[
MarkovDecisionProcess[float, float],
DNNApprox[Tuple[NonTerminal[float], float]],
SampledDistribution[NonTerminal[float]]
]] = [(
self.get_mdp(i),
init_fa,
self.get_states_distribution(i)
) for i in range(self.time_steps())]
num_state_samples: int = 300
error_tolerance: float = 1e-6
return back_opt_qvf(
mdp_f0_mu_triples=mdp_f0_mu_triples,
gamma=1.0,
num_state_samples=num_state_samples,
error_tolerance=error_tolerance
)

228
The above code is in the file rl/chapter7/asset_alloc_discrete.py. We encourage you to
create a few different instances of AssetAllocDiscrete by varying it’s inputs (try different
return distributions, different utility functions, different action spaces). But how do we
know the code above is correct? We need a way to test it. A good test is to specialize the
inputs to fit the setting of Section 6.4 for which we have a closed-form solution to com-
pare against. So let us write some code to specialize the inputs to fit this setting. Since
the above code has been written with an educational motivation rather than an efficient-
computation motivation, the convergence of the backward induction ADP algorithm is
going to be slow. So we shall test it on a small number of time steps and provide some
assistance for fast convergence (using limited knowledge from the closed-form solution
in specifying the function approximation). We write code below to create an instance of
AssetAllocDiscrete with time steps T = 4, µ = 13%, σ = 20%, r = 7%, coefficient of
CARA a = 1.0. We set up risky_return_distributions as a sequence of identical Gaussian
distributions, riskless_returns as a sequence of identical riskless rate of returns, and
utility_func as a lambda parameterized by the coefficient of CARA a. We know from the
closed-form solution that the optimal allocation to the risky asset for each of time steps
t = 0, 1, 2, 3 is given by:
1.5
x∗t =
1.074−t
Therefore, we set risky_alloc_choices (action choices) in the range [1.0, 2.0] in incre-
ments of 0.1 to see if our code can hit the correct values within the 0.1 granularity of action
choices.
To specify feature_functions and dnn_spec, we need to leverage the functional form of
the closed-form solution for the Action-Value function (i.e., Equation (6.27)). We observe
that we can write this as:

Q∗t (Wt , xt ) = −sign(a) · e−(α0 +α1 ·Wt +α2 ·xt +α3 ·xt )
2

where
(µ − r)2 (T − t − 1)
α0 = + log(|a|)
2σ 2
α1 = a(1 + r)T −t
α2 = a(µ − r)(1 + r)T −t−1
(aσ(1 + r)T −t−1 )2
α3 = −
2
∗
This means, the function approximation for Qt can be set up with a neural network with
no hidden layers, with the output layer activation function as g(S) = −sign(a) · e−S , and
with the feature functions as:

ϕ1 ((Wt , xt )) = 1
ϕ2 ((Wt , xt )) = Wt
ϕ3 ((Wt , xt )) = xt
ϕ4 ((Wt , xt )) = x2t
We set initial_wealth_distribution to be a normal distribution with a mean of init_wealth
(set equal to 1.0 below) and a standard distribution of init_wealth_stdev (set equal to a
small value of 0.1 below).

229
from rl.distribution import Gaussian
steps: int = 4
mu: float = 0.13
sigma: float = 0.2
r: float = 0.07
a: float = 1.0
init_wealth: float = 1.0
init_wealth_stdev: float = 0.1
excess: float = mu - r
var: float = sigma * sigma
base_alloc: float = excess / (a * var)
risky_ret: Sequence[Gaussian] = [Gaussian(mu=mu, sigma=sigma)
for _ in range(steps)]
riskless_ret: Sequence[float] = [r for _ in range(steps)]
utility_function: Callable[[float], float] = lambda x: - np.exp(-a * x) / a
alloc_choices: Sequence[float] = np.linspace(
2 / 3 * base_alloc,
4 / 3 * base_alloc,
11
)
feature_funcs: Sequence[Callable[[Tuple[float, float]], float]] = \
[
lambda _: 1.,
lambda w_x: w_x[0],
lambda w_x: w_x[1],
lambda w_x: w_x[1] * w_x[1]
]
dnn: DNNSpec = DNNSpec(
neurons=[],
bias=False,
hidden_activation=lambda x: x,
hidden_activation_deriv=lambda y: np.ones_like(y),
output_activation=lambda x: - np.sign(a) * np.exp(-x),
output_activation_deriv=lambda y: -y
)
init_wealth_distr: Gaussian = Gaussian(
mu=init_wealth,
sigma=init_wealth_stdev
)
aad: AssetAllocDiscrete = AssetAllocDiscrete(
risky_return_distributions=risky_ret,
riskless_returns=riskless_ret,
utility_func=utility_function,
risky_alloc_choices=alloc_choices,
feature_functions=feature_funcs,
dnn_spec=dnn,
initial_wealth_distribution=init_wealth_distr
)

Next, we perform the Q-Value backward induction, step through the returned iterator
(fetching the Q-Value function for each time step from t = 0 to t = T − 1), and evaluate
the Q-values at the init_wealth (for each time step) for all alloc_choices. Performing a
max and arg max over the alloc_choices at the init_wealth gives us the Optimal Value
function and the Optimal Policy for each time step for wealth equal to init_wealth.

from pprint import pprint

it_qvf: Iterator[QValueFunctionApprox[float, float]] = \
aad.backward_induction_qvf()
for t, q in enumerate(it_qvf):
print(f”Time {t:d}”)

230
print()
opt_alloc: float = max(
((q((NonTerminal(init_wealth), ac)), ac) for ac in alloc_choices),
key=itemgetter(0)
)[1]
val: float = max(q((NonTerminal(init_wealth), ac))
for ac in alloc_choices)
print(f”Opt Risky Allocation = {opt_alloc:.3f}, Opt Val = {val:.3f}”)
print(”Optimal Weights below:”)
for wts in q.weights:
pprint(wts.weights)
print()

This prints the following:

Time 0

Opt Risky Allocation = 1.200, Opt Val = -0.225

Optimal Weights below:
array([[ 0.13318188, 1.31299678, 0.07327264, -0.03000281]])

Time 1

Opt Risky Allocation = 1.300, Opt Val = -0.257

Optimal Weights below:
array([[ 0.08912411, 1.22479503, 0.07002802, -0.02645654]])

Time 2

Opt Risky Allocation = 1.400, Opt Val = -0.291

Optimal Weights below:
array([[ 0.03772409, 1.144612 , 0.07373166, -0.02566819]])

Time 3

Opt Risky Allocation = 1.500, Opt Val = -0.328

Optimal Weights below:
array([[ 0.00126822, 1.0700996 , 0.05798272, -0.01924149]])

Now let’s compare these results against the closed-form solution.

for t in range(steps):
print(f”Time {t:d}”)
print()
left: int = steps - t
growth: float = (1 + r) ** (left - 1)
alloc: float = base_alloc / growth
val: float = - np.exp(- excess * excess * left / (2 * var)
- a * growth * (1 + r) * init_wealth) / a
bias_wt: float = excess * excess * (left - 1) / (2 * var) + \
np.log(np.abs(a))
w_t_wt: float = a * growth * (1 + r)
x_t_wt: float = a * excess * growth
x_t2_wt: float = - var * (a * growth) ** 2 / 2
print(f”Opt Risky Allocation = {alloc:.3f}, Opt Val = {val:.3f}”)
print(f”Bias Weight = {bias_wt:.3f}”)

231
print(f”W_t Weight = {w_t_wt:.3f}”)
print(f”x_t Weight = {x_t_wt:.3f}”)
print(f”x_t^2 Weight = {x_t2_wt:.3f}”)
print()

This prints the following:

Time 0

Opt Risky Allocation = 1.224, Opt Val = -0.225

Bias Weight = 0.135
W_t Weight = 1.311
x_t Weight = 0.074
x_t^2 Weight = -0.030

Time 1

Opt Risky Allocation = 1.310, Opt Val = -0.257

Bias Weight = 0.090
W_t Weight = 1.225
x_t Weight = 0.069
x_t^2 Weight = -0.026

Time 2

Opt Risky Allocation = 1.402, Opt Val = -0.291

Bias Weight = 0.045
W_t Weight = 1.145
x_t Weight = 0.064
x_t^2 Weight = -0.023

Time 3

Opt Risky Allocation = 1.500, Opt Val = -0.328

Bias Weight = 0.000
W_t Weight = 1.070
x_t Weight = 0.060
x_t^2 Weight = -0.020

As mentioned previously, this serves as a good test for the correctness of the implemen-
tation of AssetAllocDiscrete.
We need to point out here that the general case of dynamic asset allocation and consump-
tion for a large number of risky assets will involve a continuous-valued action space of
high dimension. This means ADP algorithms will have challenges in performing the
max / arg max calculation across this large and continuous action space. Even many of
the RL algorithms find it challenging to deal with very large action spaces. Sometimes we
can take advantage of the specifics of the control problem to overcome this challenge. But
in a general setting, these large/continuous action space require special types of RL algo-
rithms that are well suited to tackle such action spaces. One such class of RL algorithms
is Policy Gradient Algorithms that we shall learn in Chapter 12.

232
6.6. Key Takeaways from this Chapter
• A fundamental problem in Mathematical Finance is that of jointly deciding on A) op-
timal investment allocation (among risky and riskless investment assets) and B) op-
timal consumption, over a finite horizon. Merton, in his landmark paper from 1969,
provided an elegant closed-form solution under assumptions of continuous-time,
normal distribution of returns on the assets, CRRA utility, and frictionless transac-
tions.
• In a more general setting of the above problem, we need to model it as an MDP. If the
MDP is not too large and if the asset return distributions are known, we can employ
finite-horizon ADP algorithms to solve it. However, in typical real-world situations,
the action space can be quite large and the asset return distributions are unknown.
This points to RL, and specifically RL algorithms that are well suited to tackle large
action spaces (such as Policy Gradient Algorithms).

233
7. Derivatives Pricing and Hedging
In this chapter, we cover two applications of MDP Control regarding financial derivatives’
pricing and hedging (the word hedging refers to reducing or eliminating market risks as-
sociated with a derivative). The first application is to identify the optimal time/state to
exercise an American Option (a type of financial derivative) in an idealized market set-
ting (akin to the “frictionless” market setting of Merton’s Portfolio problem from Chapter
6). Optimal exercise of an American Option is the key to determining it’s fair price. The
second application is to identify the optimal hedging strategy for derivatives in real-world
situations (technically refered to as incomplete markets, a term we will define shortly). The
optimal hedging strategy of a derivative is the key to determining it’s fair price in the real-
world (incomplete market) setting. Both of these applications can be cast as Markov De-
cision Processes where the Optimal Policy gives the Optimal Hedging/Optimal Exercise
in the respective applications, leading to the fair price of the derivatives under considera-
tion. Casting these derivatives applications as MDPs means that we can tackle them with
Dynamic Programming or Reinforcement Learning algorithms, providing an interesting
and valuable alternative to the traditional methods of pricing derivatives.
In order to understand and appreciate the modeling of these derivatives applications as
MDPs, one requires some background in the classical theory of derivatives pricing. Unfor-
tunately, thorough coverage of this theory is beyond the scope of this book and we refer
you to Tomas Bjork’s book on Arbitrage Theory in Continuous Time (Björk 2005) for a
thorough understanding of this theory. We shall spend much of this chapter covering the
very basics of this theory, and in particular explaining the key technical concepts (such
as arbitrage, replication, risk-neutral measure, market-completeness etc.) in a simple and
intuitive manner. In fact, we shall cover the theory for the very simple case of discrete-
time with a single-period. While that is nowhere near enough to do justice to the rich
continuous-time theory of derivatives pricing and hedging, this is the best we can do in a
single chapter. The good news is that MDP-modeling of the two problems we want to solve
- optimal exercise of american options and optimal hedging of derivatives in a real-world
(incomplete market) setting - doesn’t require one to have a thorough understanding of
the classical theory. Rather, an intuitive understanding of the key technical and economic
concepts should suﬀice, which we bring to life in the simple setting of discrete-time with
a single-period. We start this chapter with a quick introduction to derivatives, next we
describe the simple setting of a single-period with formal mathematical notation, cover-
ing the key concepts (arbitrage, replication, risk-neutral measure, market-completeness
etc.), state and prove the all-important fundamental theorems of asset pricing (only for
the single-period setting), and finally show how these two derivatives applications can be
cast as MDPs, along with the appropriate algorithms to solve the MDPs.

7.1. A Brief Introduction to Derivatives

If you are reading this book, you likely already have some familiarity with Financial Deriva-
tives (or at least have heard of them, given that derivatives were at the center of the 2008
financial crisis). In this section, we sketch an overview of financial derivatives and refer

235
you to the book by John Hull (Hull 2010) for a thorough coverage of Derivatives. The
term “Derivative” is based on the word “derived” - it refers to the fact that a derivative is
a financial instrument whose structure and hence, value is derived from the performance
of an underlying entity or entities (which we shall simply refer to as “underlying”). The
underlying can be pretty much any financial entity - it could be a stock, currency, bond,
basket of stocks, or something more exotic like another derivative. The term performance
also refers to something fairly generic - it could be the price of a stock or commodity, it
could be the interest rate a bond yields, it could be the average price of a stock over a time
interval, it could be a market-index, or it could be something more exotic like the implied
volatility of an option (which itself is a type of derivative). Technically, a derivative is a
legal contract between the derivative buyer and seller that either:

• Entitles the derivative buyer to cashflow (which we’ll refer to as derivative payoff ) at
future point(s) in time, with the payoff being contingent on the underlying’s perfor-
mance (i.e., the payoff is a precise mathematical function of the underlying’s perfor-
mance, eg: a function of the underlying’s price at a future point in time). This type
of derivative is known as a “lock-type” derivative.
• Provides the derivative buyer with choices at future points in time, upon making
which, the derivative buyer can avail of cashflow (i.e., payoff ) that is contingent on
the underlying’s performance. This type of derivative is known as an “option-type”
derivative (the word “option” refering to the choice or choices the buyer can make
to trigger the contingent payoff).

Although both “lock-type” and “option-type” derivatives can both get very complex
(with contracts running over several pages of legal descriptions), we now illustrate both
these types of derivatives by going over the most basic derivative structures. In the fol-
lowing descriptions, current time (when the derivative is bought/sold) is denoted as time
t = 0.

7.1.1. Forwards
The most basic form of Forward Contract involves specification of:

• A future point in time t = T (we refer to T as expiry of the forward contract).

• The fixed payment K to be made by the forward contract buyer to the seller at time
t = T.

In addition, the contract establishes that at time t = T , the forward contract seller needs
to deliver the underlying (say a stock with price St at time t) to the forward contract buyer.
This means at time t = T , effectively the payoff for the buyer is ST −K (likewise, the payoff
for the seller is K −ST ). This is because the buyer, upon receiving the underlying from the
seller, can immediately sell the underlying in the market for the price of ST and so, would
have made a gain of ST − K (note ST − K can be negative, in which case the payoff for the
buyer is negative).
The problem of forward contract “pricing” is to determine the fair value of K so that
the price of this forward contract derivative at the time of contract creation is 0. As time
t progresses, the underlying price might fluctuate, which would cause a movement away
from the initial price of 0. If the underlying price increases, the price of the forward would
naturally increase (and if the underlying price decreases, the price of the forward would
naturally decrease). This is an example of a “lock-type” derivative since neither the buyer

236
nor the seller of the forward contract need to make any choices at time t = T . Rather, the
payoff for the buyer is determined directly by the formula ST − K and the payoff for the
seller is determined directly by the formula K − ST .

7.1.2. European Options

The most basic forms of European Options are European Call and Put Options. The most
basic European Call Option contract involves specification of:

• A future point in time t = T (we refer to T as the expiry of the Call Option).
• Underlying Price K known as Strike Price.

The contract gives the buyer (owner) of the European Call Option the right, but not the
obligation, to buy the underlying at time t = T at the price of K. Since the option owner
doesn’t have the obligation to buy, if the price ST of the underlying at time t = T ends
up being equal to or below K, the rational decision for the option owner would be to not
buy (at price K), which would result in a payoff of 0 (in this outcome, we say that the call
option is out-of-the-money). However, if ST > K, the option owner would make an instant
profit of ST − K by exercising her right to buy the underlying at the price of K. Hence, the
payoff in this case is ST − K (in this outcome, we say that the call option is in-the-money).
We can combine the two cases and say that the payoff is f (ST ) = max(ST −K, 0). Since the
payoff is always non-negative, the call option owner would need to pay for this privilege.
The amount the option owner would need to pay to own this call option is known as the
fair price of the call option. Identifying the value of this fair price is the highly celebrated
problem of Option Pricing (which you will learn more about as this chapter progresses).
A European Put Option is very similar to a European Call Option with the only differ-
ence being that the owner of the European Put Option has the right (but not the obliga-
tion) to sell the underlying at time t = T at the price of K. This means that the payoff is
f (ST ) = max(K − ST , 0). Payoffs for these Call and Put Options are known as “hockey-
stick” payoffs because if you plot the f (·) function, it is a flat line on the out-of-the-money
side and a sloped line on the in-the-money side. Such European Call and Put Options are
“Option-Type” (and not “Lock-Type”) derivatives since they involve a choice to be made
by the option owner (the choice of exercising the right to buy/sell at the Strike Price K).
However, it is possible to construct derivatives with the same payoff as these European
Call/Put Options by simply writing in the contract that the option owner will get paid
max(ST − K, 0) (in case of Call Option) or will get paid max(K − ST , 0) (in case of Put
Option) at time t = T . Such derivatives contracts do away with the option owner’s exer-
cise choice and hence, they are “Lock-Type” contracts. There is a subtle difference - setting
these derivatives up as “Option-Type” means the option owner might act “irrationally” -
the call option owner might mistakenly buy even if ST < K, or the call option owner might
for some reason forget/neglect to exercise her option even when ST > K. Setting up such
contracts as “Lock-Type” takes away the possibilities of these types of irrationalities from
the option owner.
A more general European Derivative involves an arbitrary function f (·) (generalizing
from the hockey-stick payoffs) and could be set up as “Option-Type” or “Lock-Type.”

7.1.3. American Options

The term “European” above refers to the fact that the option to exercise is available only at a
fixed point in time t = T . Even if it is set up as “Lock-Type,” the term “European” typically

237
means that the payoff can happen only at a fixed point in time t = T . This is in contrast to
American Options. The most basic forms of American Options are American Call and Put
Options. American Call and Put Options are essentially extensions of the corresponding
European Call and Put Options by allowing the buyer (owner) of the American Option to
exercise the option to buy (in the case of Call) or sell (in the case of Put) at any time t ≤ T .
The allowance of exercise at any time at or before the expiry time T can often be a tricky
financial decision for the option owner. At each point in time when the American Option
is in-the-money (i.e., positive payoff upon exercise), the option owner might be tempted to
exercise and collect the payoff but might as well be thinking that if she waits, the option
might become more in-the-money (i.e., prospect of a bigger payoff if she waits for a while).
Hence, it’s clear that an American Option is always of the “Option-Type” (and not “Lock-
Type”) since the timing of the decision (option) to exercise is very important in the case
of an American Option. This also means that the problem of pricing an American Option
(the fair price the buyer would need to pay to own an American Option) is much harder
than the problem of pricing an European Option.
So what purpose do derivatives serve? There are actually many motivations for different
market participants, but we’ll just list two key motivations. The first reason is to protect
against adverse market movements that might damage the value of one’s portfolio (this is
known as hedging). As an example, buying a put option can reduce or eliminate the risk
associated with ownership of the underlying. The second reason is operational or financial
convenience in trading to express a speculative view of market movements. For instance, if
one thinks a stock will increase in value by 50% over the next two years, instead of paying
say $100,000 to buy the stock (hoping to make $50,000 after two years), one can simply
buy a call option on $100,000 of the stock (paying the option price of say $5,000). If the
stock price indeed appreciates by 50% after 2 years, one makes $50,000 - $5,000 = $45,000.
Although one made $5000 less than the alternative of simply buying the stock, the fact that
one needs to pay $5000 (versus $50,000) to enter into the trade means the potential return
on investment is much higher.
Next, we embark on the journey of learning how to value derivatives, i.e., how to figure
out the fair price that one would be willing to buy or sell the derivative for at any point
in time. As mentioned earlier, the general theory of derivatives pricing is quite rich and
elaborate (based on continuous-time stochastic processes), and we don’t cover it in this
book. Instead, we provide intuition for the core concepts underlying derivatives pricing
theory in the context of a simple, special case - that of discrete-time with a single-period.
We formalize this simple setting in the next section.

7.2. Notation for the Single-Period Simple Setting

Our simple setting involves discrete time with a single-period from t = 0 to t = 1. Time
t = 0 has a single state which we shall refer to as the “Spot” state. Time t = 1 has n random
outcomes formalized by the sample space Ω = {ω1 , . . . , ωn }. The probability distribution
of this finite sample space is given by the probability mass function

µ : Ω → [0, 1]

such that
X
n
µ(ωi ) = 1
i=1

238
This simple single-period setting involves m + 1 fundamental assets A0 , A1 , . . . , Am
where A0 is a riskless asset (i.e., it’s price will evolve deterministically from t = 0 to t = 1)
(0)
and A1 , . . . , Am are risky assets. We denote the Spot Price (at t = 0) of Aj as Sj for all
(i)
j = 0, 1, . . . , m. We denote the Price of Aj in ωi as Sj for all j = 0, . . . , m, i = 1, . . . , n.
Assume that all asset prices are real numbers, i.e., in R (negative prices are typically unre-
alistic, but we still assume it for simplicity of exposition). For convenience, we normalize
the Spot Price (at t = 0) of the riskless asset AO to be 1. Therefore,
(0) (i)
S0 = 1 and S0 = 1 + r for all i = 1, . . . , n
where r represents the constant riskless rate of growth. We should interpret this riskless
1
rate of growth as the “time value of money” and 1+r as the riskless discount factor corre-
sponding to the “time value of money.”

7.3. Portfolios, Arbitrage and Risk-Neutral Probability Measure

We define a portfolio as a vector θ = (θ0 , θ1 , . . . , θm ) ∈ Rm+1 , representing the number
of units held in the assets Aj , j = 0, 1, . . . , m. The Spot Value (at t = 0) of portfolio θ,
(0)
denoted by Vθ , is:
(0)
X m
(0)
Vθ = θ j · Sj (7.1)
j=0

(i)
The Value of portfolio θ in random outcome ωi (at t = 1), denoted by Vθ , is:

(i)
X
m
(i)
Vθ = θj · Sj for all i = 1, . . . , n (7.2)
j=0

Next, we cover an extremely important concept in Mathematical Economics/Finance -

the concept of Arbitrage. An Arbitrage Portfolio θ is one that “makes money from nothing.”
Formally, an arbitrage portfolio is a portfolio θ such that:

(0)
• Vθ ≤ 0
(i)
• Vθ ≥ 0 for all i = 1, . . . , n
(i)
• There exists an i ∈ {1, . . . , n} such that µ(ωi ) > 0 and Vθ > 0

Thus, with an Arbitrage Portfolio, we never end up (at t = 0) with less value than what
we start with (at t = 1) and we end up with expected value strictly greater than what
we start with. This is the formalism of the notion of arbitrage, i.e., “making money from
nothing.” Arbitrage allows market participants to make infinite returns. In an eﬀicient
market, arbitrage would disappear as soon as it appears since market participants would
immediately exploit it and through the process of exploiting the arbitrage, immediately
eliminate the arbitrage. Hence, Finance Theory typically assumes “arbitrage-free” markets
(i.e., financial markets with no arbitrage opportunities).
Next, we describe another very important concept in Mathematical Economics/Finance
- the concept of a Risk-Neutral Probability Measure. Consider a Probability Distribution π :
Ω → [0, 1] such that

π(ωi ) = 0 if and only if µ(ωi ) = 0 for all i = 1, . . . , n

239
Then, π is said to be a Risk-Neutral Probability Measure if:

(0) 1 X
n
(i)
Sj = · π(ωi ) · Sj for all j = 0, 1, . . . , m (7.3)
1+r
i=1

So for each of the m + 1 assets, the asset spot price (at t = 0) is the riskless rate-discounted
expectation (under π) of the asset price at t = 1. The term “risk-neutral” here is the same
as the term “risk-neutral” we used in Chapter 5, meaning it’s a situation where one doesn’t
need to be compensated for taking risk (the situation of a linear utility function). How-
ever, we are not saying that the market is risk-neutral - if that were the case, the market
probability measure µ would be a risk-neutral probability measure. We are simply defin-
ing π as a hypothetical construct under which each asset’s spot price is equal to the riskless
rate-discounted expectation (under π) of the asset’s price at t = 1. This means that under
the hypothetical π, there’s no return in excess of r for taking on the risk of probabilistic out-
comes at t = 1 (note: outcome probabilities are governed by the hypothetical π). Hence,
we refer to π as a risk-neutral probability measure. The purpose of this hypothetical con-
struct π is that it helps in the development of Derivatives Pricing and Hedging Theory, as
we shall soon see. The actual probabilities of outcomes in Ω are governed by µ, and not π.
Before we cover the two fundamental theorems of asset pricing, we need to cover an
important lemma that we will utilize in the proofs of the two fundamental theorems of
asset pricing.

Lemma 7.3.1. For any portfolio θ = (θ0 , θ1 , . . . , θm ) ∈ Rm+1 and any risk-neutral probability
measure π : Ω → [0, 1],
(0) 1 X n
(i)
Vθ = · π(ωi ) · Vθ
1+r
i=1

Proof. Using Equations (7.1), (7.3) and (7.2), the proof is straightforward:

(0)
X
m
(0)
X
m
1 X
n
(i)
Vθ = θ j · Sj = θj · · π(ωi ) · Sj
1+r
j=0 j=0 i=1

1 X
n X
m
(i) 1 X
n
(i)
= · π(ωi ) · θj · Sj = · π(ωi ) · Vθ
1+r 1+r
i=1 j=0 i=1

Now we are ready to cover the two fundamental theorems of asset pricing (sometimes,
also refered to as the fundamental theorems of arbitrage and the fundamental theorems of
finance!). We start with the first fundamental theorem of asset pricing, which associates
absence of arbitrage with existence of a risk-neutral probability measure.

7.4. First Fundamental Theorem of Asset Pricing (1st FTAP)

Theorem 7.4.1 (First Fundamental Theorem of Asset Pricing (1st FTAP)). Our simple set-
ting of discrete time with single-period will not admit arbitrage portfolios if and only if there exists
a Risk-Neutral Probability Measure.

240
Proof. First we prove the easy implication - if there exists a Risk-Neutral Probability Mea-
sure π, then we cannot have any arbitrage portfolios. Let’s review what it takes to have an
arbitrage portfolio θ = (θ0 , θ1 , . . . , θm ). The following are two of the three conditions to
be satisfied to qualify as an arbitrage portfolio θ (according to the definition of arbitrage
portfolio we gave above):
(i)
• Vθ ≥ 0 for all i = 1, . . . , n
(i)
• There exists an i ∈ {1, . . . , n} such that µ(ωi ) > 0 (⇒ π(ωi ) > 0) and Vθ >0

(0)
But if these two conditions are satisfied, the third condition Vθ ≤ 0 cannot be satisfied
because from Lemma (7.3.1), we know that:

(0) 1 X
n
(i)
Vθ = · π(ωi ) · Vθ
1+r
i=1

which is strictly greater than 0, given the two conditions stated above. Hence, all three
conditions cannot be simultaneously satisfied which eliminates the possibility of arbitrage
for any portfolio θ.
Next, we prove the reverse (harder to prove) implication - if a risk-neutral probability
measure doesn’t exist, then there exists an arbitrage portfolio θ. We define V ⊂ Rm as the
set of vectors v = (v1 , . . . , vm ) such that

1 X
n
(i)
vj = · µ(ωi ) · Sj for all j = 1, . . . , m
1+r
i=1

with V defined as spanning over all possible probability distributions µ : Ω → [0, 1]. V is
a bounded, closed, convex polytope in Rm . By the definition of a risk-neutral probability
measure, we can say that if a risk-neutral probability measure doesn’t exist, the vector
(0) (0)
(S1 , . . . , Sm ) ̸∈ V. The Hyperplane Separation Theorem implies that there exists a non-
zero vector (θ1 , . . . , θm ) such that for any v = (v1 , . . . , vm ) ∈ V,

X
m X
m
(0)
θj · v j > θ j · Sj
j=1 j=1

In particular, consider vectors v corresponding to the corners of V, those for which the full
probability mass is on a particular ωi ∈ Ω, i.e.,

X
m
1 (i)
X m
(0)
θj · ( · Sj ) > θj · Sj for all i = 1, . . . , n
1+r
j=1 j=1

Since this is a strict inequality, we will be able to choose a θ0 ∈ R such that:

X
m
1 (i)
X (0)
m
θj · ( · Sj ) > −θ0 > θj · Sj for all i = 1, . . . , n
1+r
j=1 j=1

Therefore,
1 X
m
(i)
X
m
(0)
· θ j · Sj > 0 > θj · Sj for all i = 1, . . . , n
1+r
j=0 j=0

241
This can be rewritten in terms of the Values of portfolio θ = (θ0 , θ1 , . . . , θ) at t = 0 and
t = 1, as follows:
1 (i) (0)
· V > 0 > Vθ for all i = 1, . . . , n
1+r θ
Thus, we can see that all three conditions in the definition of arbitrage portfolio are
satisfied and hence, θ = (θ0 , θ1 , . . . , θm ) is an arbitrage portfolio.

Now we are ready to move on to the second fundamental theorem of asset pricing, which
associates replication of derivatives with a unique risk-neutral probability measure.

7.5. Second Fundamental Theorem of Asset Pricing (2nd FTAP)

Before we state and prove the 2nd FTAP, we need some definitions.
Definition 7.5.1. A Derivative D (in our simple setting of discrete-time with a single-
period) is specified as a vector payoff at time t = 1, denoted as:
(1) (2) (n)
(VD , VD , . . . , VD )
(i)
where VD is the payoff of the derivative in random outcome ωi for all i = 1, . . . , n
Definition 7.5.2. A Portfolio θ = (θ0 , θ1 , . . . , θm ) ∈ Rm+1 is a Replicating Portfolio for deriva-
tive D if:
(i) (i)
Xm
(i)
VD = Vθ = θj · Sj for all i = 1, . . . , n (7.4)
j=0

The negatives of the components (θ0 , θ1 , . . . , θm ) are known as the hedges for D since
they can be used to offset the risk in the payoff of D at t = 1.
Definition 7.5.3. An arbitrage-free market (i.e., a market devoid of arbitrage) is said to be
Complete if every derivative in the market has a replicating portfolio.
Theorem 7.5.1 (Second Fundamental Theorem of Asset Pricing (2nd FTAP)). A market (in
our simple setting of discrete-time with a single-period) is Complete if and only if there is a unique
Risk-Neutral Probability Measure.
Proof. We will first prove that in an arbitrage-free market, if every derivative has a replicat-
ing portfolio (i.e., the market is complete), then there is a unique risk-neutral probability
measure. We define n special derivatives (known as Arrow-Debreu securities), one for each
random outcome in Ω at t = 1. We define the time t = 1 payoff of Arrow-Debreu security
Dk (for each of k = 1, . . . , n) as follows:
(i)
VDk = Ii=k for all i = 1, . . . , n

where I represents the indicator function. This means the payoff of derivative Dk is 1 for
random outcome ωk and 0 for all other random outcomes.
(k) (k) (k)
Since each derivative has a replicating portfolio, denote θ(k) = (θ0 , θ1 , . . . , θm ) as the
replicating portfolio for Dk for each k = 1, . . . , m. Therefore, for each k = 1, . . . , m:

(i)
X
m
(k) (i) (i)
Vθ(k) = θj · Sj = VDk = Ii=k for all i = 1, . . . , n
j=0

242
Using Lemma (7.3.1), we can write the following equation for any risk-neutral proba-
bility measure π, for each k = 1, . . . , m:

X
m
(k) (0) (0) 1 X
n
(i) 1 X
n
1
θj · Sj = Vθ(k) = · π(ωi ) · Vθ(k) = · π(ωi ) · Ii=k = · π(ωk )
1+r 1+r 1+r
j=0 i=1 i=1

We note that the above equation is satisfied for a unique π : Ω → [0, 1], defined as:

X
m
(k) (0)
π(ωk ) = (1 + r) · θj · Sj for all k = 1, . . . , n
j=0

which implies that we have a unique risk-neutral probability measure.

Next, we prove the other direction of the 2nd FTAP. We need to prove that if there exists
a risk-neutral probability measure π and if there exists a derivative D with no replicating
portfolio, then we can construct a risk-neutral probability measure different than π.
Consider the following vectors in the vector space Rn
(1) (n) (1) (n)
v = (VD , . . . , VD ) and vj = (Sj , . . . , Sj ) for all j = 0, 1, . . . , m

Since D does not have a replicating portfolio, v is not in the span of {v0 , v1 , . . . , vm }, which
means {v0 , v1 , . . . , vm } do not span Rn . Hence, there exists a non-zero vector u = (u1 , . . . , un ) ∈
Rn orthogonal to each of v0 , v1 , . . . , vm , i.e.,

X
n
(i)
ui · Sj = 0 for all j = 0, 1, . . . , n (7.5)
i=1

(i)
Note that S0 = 1 + r for all i = 1, . . . , n and so,

X
n
ui = 0 (7.6)
i=1

Define π ′ : Ω → R as follows (for some ϵ ∈ R+ ):

π ′ (ωi ) = π(ωi ) + ϵ · ui for all i = 1, . . . , n (7.7)

To establish π ′ as a risk-neutral probability measure different than π, note:

P P P
• Since ni=1 π(ωi ) = 1 and since ni=1 ui = 0, ni=1 π ′ (ωi ) = 1

• Construct π ′ (ωi ) > 0 for each i where π(ωi ) > 0 by making ϵ > 0 suﬀiciently small,
and set π ′ (ωi ) = 0 for each i where π(ωi ) = 0

• From Equations (7.7), (7.3) and (7.5), we have for each j = 0, 1, . . . , m:

1 X
n
1 X
n
ϵ X
n
′ (i) (i) (i) (0)
· π (ωi ) · Sj = · π(ωi ) · Sj + · u i · Sj = Sj
1+r 1+r 1+r
i=1 i=1 i=1

Together, the two FTAPs classify markets into:

243
• Market with arbitrage ⇔ No risk-neutral probability measure
• Complete (arbitrage-free) market ⇔ Unique risk-neutral probability measure
• Incomplete (arbitrage-free) market ⇔ Multiple risk-neutral probability measures

The next topic is derivatives pricing that is based on the concepts of replication of deriva-
tives and risk-neutral probability measures, and so is tied to the concepts of arbitrage and com-
pleteness.

7.6. Derivatives Pricing in Single-Period Setting

In this section, we cover the theory of derivatives pricing for our simple setting of discrete-
time with a single-period. To develop the theory of how to price a derivative, first we need
to define the notion of a Position.

Definition 7.6.1. A Position involving a derivative D is the combination of holding some

units in D and some units in the fundamental assets A0 , A1 , . . . , Am , which can be formally
represented as a vector γD = (α, θ0 , θ1 , . . . , θm ) ∈ Rm+2 where α denotes the units held in
derivative D and αj denotes the units held in Aj for all j = 0, 1 . . . , m.

Therefore, a Position is an extension of the Portfolio concept that includes a derivative.

Hence, we can naturally extend the definition of Portfolio Value to Position Value and we can
also extend the definition of Arbitrage Portfolio to Arbitrage Position.
We need to consider derivatives pricing in three market situations:

• When the market is complete

• When the market is incomplete
• When the market has arbitrage

7.6.1. Derivatives Pricing when Market is Complete

Theorem 7.6.1. For our simple setting of discrete-time with a single-period, if the market is com-
plete, then any derivative D with replicating portfolio θ = (θ0 , θ1 , . . . , θm ) has price at time t = 0
(0)
(denoted as value VD ):
(0) (0)
X n
(i)
VD = Vθ = θ j · Sj (7.8)
j=0

Furthermore, if the unique risk-neutral probability measure is π : Ω → [0, 1], then:

(0) 1 X
n
(i)
VD = · π(ωi ) · VD (7.9)
1+r
i=1

Proof. It seems quite reasonable that since θ is the replicating portfolio for D, the value of
(0) P (i)
the replicating portfolio at time t = 0 (equal to Vθ = nj=0 θj · Sj ) should be the price
(at t = 0) of derivative D. However, we will formalize the proof by first arguing that any
(0)
candidate derivative price for D other than Vθ leads to arbitrage, thus dismissing those
(0)
other candidate derivative prices, and then argue that with Vθ as the price of derivative
D, we eliminate the possibility of an arbitrage position involving D.

244
(0)
Consider candidate derivative prices Vθ − x for any positive real number x. Position
(1, −θ0 +x, −θ1 , . . . , −θm ) has value x·(1+r) > 0 in each of the random outcomes at t = 1.
But this position has spot (t = 0) value of 0, which means this is an Arbitrage Position,
rendering these candidate derivative prices invalid. Next consider candidate derivative
(0)
prices Vθ + x for any positive real number x. Position (−1, θ0 + x, θ1 , . . . , θm ) has value
x · (1 + r) > 0 in each of the random outcomes at t = 1. But this position has spot (t = 0)
value of 0, which means this is an Arbitrage Position, rendering these candidate derivative
(0)
prices invalid as well. So every candidate derivative price other than Vθ is invalid. Now
(0)
our goal is to establish Vθ as the derivative price of D by showing that we eliminate the
(0)
possibility of an arbitrage position in the market involving D if Vθ is indeed the derivative
price.
(0)
Firstly, note that Vθ can be expressed as the riskless rate-discounted expectation (under
π) of the payoff of D at t = 1, i.e.,

(0)
X
m
(0)
X
m
1 X
n
(i) 1 X
n X
m
(i)
Vθ = θ j · Sj = θj · · π(ωi ) · Sj = · π(ωi ) · θ j · Sj
1+r 1+r
j=0 j=0 i=1 i=1 j=0

1 X
n
(i)
= · π(ωi ) · VD (7.10)
1+r
i=1

Now consider an arbitrary portfolio β = (β0 , β1 , . . . , βm ). Define a position γD = (α, β0 , β1 , . . . , βm ).

(0) (0)
Assuming the derivative price VD is equal to Vθ , the Spot Value (at t = 0) of position
(0)
γD , denoted VγD , is:
(0)
X m
(0)
VγD = α · V θ +
(0)
β j · Sj (7.11)
j=0
(i)
Value of position γD in random outcome ωi (at t = 1), denoted VγD , is:

(i)
X
m
(i)
Vγ(i)
D
= α · VD + βj · Sj for all i = 1, . . . , n (7.12)
j=0

Combining the linearity in Equations (7.3), (7.10), (7.11) and (7.12), we get:

1 X
n
Vγ(0)
D
= · π(ωi ) · Vγ(i)
D
(7.13)
1+r
i=1

So the position spot value (at t = 0) is the riskless rate-discounted expectation (under
π) of the position value at t = 1. For any γD (containing any arbitrary portfolio β), with
(0) (0)
derivative price VD equal to Vθ , if the following two conditions are satisfied:

(i)
• VγD ≥ 0 for all i = 1, . . . , n
(i)
• There exists an i ∈ {1, . . . , n} such that µ(ωi ) > 0 (⇒ π(ωi ) > 0) and VγD > 0

then:

1 X
n
Vγ(0)
D
= · π(ωi ) · Vγ(i)
D
>0
1+r
i=1

245
(0)
This eliminates any arbitrage possibility if D is priced at Vθ .
(0)
To summarize, we have eliminated all candidate derivative prices other than Vθ , and
(0)
we have established the price Vθ as the correct price of D in the sense that we eliminate
(0)
the possibility of an arbitrage position involving D if the price of D is Vθ .
(0) (0)
Finally, we note that with the derivative price VD = Vθ , from Equation (7.10), we
have:

(0) 1 X
n
(i)
VD = · π(ωi ) · VD
1+r
i=1

Now let us consider the special case of 1 risky asset (m = 1) and 2 random outcomes
(n = 2), which we will show is a Complete Market. To lighten notation, we drop the
subscript 1 on the risky asset price. Without loss of generality, we assume S (1) < S (2) .
No-arbitrage requires:
S (1) ≤ (1 + r) · S (0) ≤ S (2)
Assuming absence of arbitrage and invoking 1st FTAP, there exists a risk-neutral proba-
bility measure π such that:
1
S (0) = · (π(ω1 ) · S (1) + π(ω2 ) · S (2) )
1+r
π(ω1 ) + π(ω2 ) = 1
With 2 linear equations and 2 variables, this has a straightforward solution, as follows:

S (2) − (1 + r) · S (0)
π(ω1 ) =
S (2) − S (1)
(1 + r) · S (0) − S (1)
π(ω2 ) =
S (2) − S (1)
Conditions S (1) < S (2) and S (1) ≤ (1 + r) · S (0) ≤ S (2) ensure that 0 ≤ π(ω1 ), π(ω2 ) ≤ 1.
Also note that this is a unique solution for π(ω1 ), π(ω2 ), which means that the risk-neutral
probability measure is unique, implying that this is a complete market.
We can use these probabilities to price a derivative D as:
(0) 1 (1) (2)
VD = · (π(ω1 ) · VD + π(ω2 ) · VD )
1+r
Now let us try to form a replicating portfolio (θ0 , θ1 ) for D
(1)
VD = θ0 · (1 + r) + θ1 · S (1)
(2)
VD = θ0 · (1 + r) + θ1 · S (2)
Solving this yields Replicating Portfolio (θ0 , θ1 ) as follows:
(1) (2) (2) (1)
1 V · S (2) − VD · S (1) V − VD
θ0 = · D and θ1 = D(2) (7.14)
1+r S −S
(2) (1) S − S (1)
Note that the derivative price can also be expressed as:
(0)
VD = θ0 + θ1 · S (0)

246
7.6.2. Derivatives Pricing when Market is Incomplete
Theorem (7.6.1) assumed a complete market, but what about an incomplete market? Re-
call that an incomplete market means some derivatives can’t be replicated. Absence of
a replicating portfolio for a derivative precludes usual no-arbitrage arguments. The 2nd
FTAP says that in an incomplete market, there are multiple risk-neutral probability mea-
sures which means there are multiple derivative prices (each consistent with no-arbitrage).
To develop intuition for derivatives pricing when the market is incomplete, let us con-
sider the special case of 1 risky asset (m = 1) and 3 random outcomes (n = 3), which we
will show is an Incomplete Market. To lighten notation, we drop the subscript 1 on the
risky asset price. Without loss of generality, we assume S (1) < S (2) < S (3) . No-arbitrage
requires:
S (1) ≤ S (0) · (1 + r) ≤ S (3)
Assuming absence of arbitrage and invoking the 1st FTAP, there exists a risk-neutral prob-
ability measure π such that:

1
S (0) = · (π(ω1 ) · S (1) + π(ω2 ) · S (2) + π(ω3 ) · S (3) )
1+r

π(ω1 ) + π(ω2 ) + π(ω3 ) = 1

So we have 2 equations and 3 variables, which implies there are multiple solutions for π.
Each of these solutions for π provides a valid price for a derivative D.

(0) 1 (1) (2) (3)

VD = · (π(ω1 ) · VD + π(ω2 ) · VD + π(ω3 ) · VD )
1+r
Now let us try to form a replicating portfolio (θ0 , θ1 ) for D
(1)
VD = θ0 · (1 + r) + θ1 · S (1)

(2)
VD = θ0 · (1 + r) + θ1 · S (2)
(3)
VD = θ0 · (1 + r) + θ1 · S (3)
3 equations & 2 variables implies there is no replicating portfolio for some D. This means
this is an Incomplete Market.
So with multiple risk-neutral probability measures (and consequent, multiple deriva-
tive prices), how do we go about determining how much to buy/sell derivatives for? One
approach to handle derivative pricing in an incomplete market is the technique called Su-
perhedging, which provides upper and lower bounds for the derivative price. The idea
of Superhedging is to create a portfolio of fundamental assets whose Value dominates the
derivative payoff in all random outcomes at t = 1. Superhedging Price is the smallest pos-
sible Portfolio Spot (t = 0) Value among all such Derivative-Payoff-Dominating portfolios.
Without getting into too many details of the Superhedging technique (out of scope for this
book), we shall simply sketch the outline of this technique for our simple setting.
We note that for our simple setting of discrete-time with a single-period, this is a con-
strained linear optimization problem:

X
m
(0)
X
m
(i) (i)
min θ j · Sj such that θj · Sj ≥ VD for all i = 1, . . . , n (7.15)
θ
j=0 j=0

247
Let θ∗ = (θ0∗ , θ1∗ , . . . , θm
∗ ) be the solution to Equation (7.15). Let SP be the Superhedging
Pm ∗ (0)
Price j=0 θj · Sj .
After establishing feasibility, we define the Lagrangian J(θ, λ) as follows:

X
m
(0)
X
n
(i)
X
m
(i)
J(θ, λ) = θ j · Sj + λi · (VD − θ j · Sj )
j=0 i=1 j=0

So there exists λ = (λ1 , . . . , λn ) that satisfy the following KKT conditions:

λi ≥ 0 for all i = 1, . . . , n
X
m
θj∗ · Sj ) = 0 for all i = 1, . . . , n (Complementary Slackness)
(i) (i)
λi · (VD −
j=0

X
n
∗ (0) (i)
∇θ J(θ , λ) = 0 ⇒ Sj = λi · Sj for all j = 0, 1, . . . , m
i=1
π(ωi )
This implies λi = for all i = 1, . . . , n for a risk-neutral probability measure π : Ω →
1+r
[0, 1] (λ can be thought of as “discounted probabilities”).
Define Lagrangian Dual
L(λ) = inf J(θ, λ)
θ
Then, Superhedging Price
X
m
θj∗ · Sj
(0)
SP = = sup L(λ) = sup inf J(θ, λ)
λ λ θ
j=0

Complementary Slackness and some linear algebra over the space of risk-neutral proba-
bility measures π : Ω → [0, 1] enables us to argue that:
X
n
π(ωi ) (i)
SP = sup · VD
π 1+r
i=1

This means the Superhedging Price is the least upper-bound of the riskless rate-discounted
expectation of derivative payoff across each of the risk-neutral probability measures in the
incomplete market, which is quite an intuitive thing to do amidst multiple risk-neutral
probability measures.
Likewise, the Subhedging price SB is defined as:
X
m
(0)
X
m
(i) (i)
max θj · Sj such that θj · Sj ≤ VD for all i = 1, . . . , n
θ
j=0 j=0

Likewise arguments enable us to establish:

X
n
π(ωi ) (i)
SB = inf · VD
π 1+r
i=1

This means the Subhedging Price is the highest lower-bound of the riskless rate-discounted
expectation of derivative payoff across each of the risk-neutral probability measures in the
incomplete market, which is quite an intuitive thing to do amidst multiple risk-neutral
probability measures.
So this technique provides an lower bound (SB) and an upper bound (SP ) for the
derivative price, meaning:

248
• A price outside these bounds leads to an arbitrage
• Valid prices must be established within these bounds

But often these bounds are not tight and so, not useful in practice.
The alternative approach is to identify hedges that maximize Expected Utility of the
combination of the derivative along with it’s hedges, for an appropriately chosen mar-
ket/trader Utility Function (as covered in Chapter 5). The Utility function is a specifica-
tion of reward-versus-risk preference that effectively chooses the risk-neutral probability
measure (and hence, Price).
Consider a concave Utility function U : R → R applied to the Value in each ran-
−ax
dom outcome ωi , i = 1, . . . n, at t = 1 (eg: U (x) = 1−ea where a ∈ R is the degree
of risk-aversion). Let the real-world probabilities be given by µ : Ω → [0, 1]. Denote
(1) (n)
VD = (VD , . . . , VD ) as the payoff of Derivative D at t = 1. Let us say that you buy the
derivative D at t = 0 and will receive the random outcome-contingent payoff VD at t = 1.
Let x be the candidate derivative price for D, which means you will pay a cash quantity
of x at t = 0 for the privilege of receiving the payoff VD at t = 1. We refer to the candi-
date hedge as Portfolio θ = (θ0 , θ1 , . . . , θm ), representing the units held in the fundamental
assets.
Note that at t = 0, the cash quantity x you’d be paying to buy the derivative and the
cash quantity you’d be paying to buy the Portfolio θ should sum to 0 (note: either of these
cash quantities can be positive or negative, but they need to sum to 0 since “money can’t
just appear or disappear”). Formally,

X
m
(0)
x+ θ j · Sj =0 (7.16)
j=0

Our goal is to solve for the appropriate values of x and θ based on an Expected Utility
consideration (that we are about to explain). Consider the Utility of the position consisting
of derivative D together with portfolio θ in random outcome ωi at t = 1:

(i)
X
m
(i)
U (VD + θ j · Sj )
j=0

So, the Expected Utility of this position at t = 1 is given by:

X
n
(i)
X
m
(i)
µ(ωi ) · U (VD + θ j · Sj ) (7.17)
i=1 j=0
(0) (i)
Noting that S0 = 1, S0 = 1 + r for all i = 1, . . . , n, we can substitute for the value
P (0)
of θ0 = −(x + m j=1 θj · Sj ) (obtained from Equation (7.16)) in the above Expected
Utility expression (7.17), so as to rewrite this Expected Utility expression in terms of just
(θ1 , . . . , θm ) (call it θ1:m ) as:

X
n
(i)
X
m
(i) (0)
g(VD , x, θ1:m ) = µ(ωi ) · U (VD − (1 + r) · x + θj · (Sj − (1 + r) · Sj ))
i=1 j=1

We define the Price of D as the “breakeven value” x∗ such that:

max g(VD , x∗ , θ1:m ) = max g(0, 0, θ1:m )

θ1:m θ1:m

249
The core principle here (known as Expected-Utility-Indifference Pricing) is that introduc-
ing a t = 1 payoff of VD together with a derivative price payment of x∗ at t = 0 keeps the
Maximum Expected Utility unchanged.
Pm ∗ (0)
The (θ1∗ , . . . , θm
∗ ) that achieve max ∗ ∗ ∗
θ1:m g(VD , x , θ1:m ) and θ0 = −(x + j=1 θj · Sj ) are
∗
the requisite hedges associated with the derivative price x . Note that the Price of VD will
NOT be the negative of the Price of −VD , hence these prices simply serve as bid prices or
ask prices, depending on whether one pays or receives the random outcomes-contingent
payoff VD .
To develop some intuition for what this solution looks like, let us now write some code
for the case of 1 risky asset (i.e., m = 1). To make things interesting, we will write code
for the case where the risky asset price at t = 1 (denoted S) follows a normal distribution
S ∼ N (µ, σ 2 ). This means we have a continuous (rather than discrete) set of values for
the risky asset price at t = 1. Since there are more than 2 random outcomes at time t = 1,
this is the case of an Incomplete Market. Moreover, we assume the CARA utility function:

1 − e−a·y
U (y) =
a
where a is the CARA coeﬀicient of risk-aversion.
We refer to the units of investment in the risky asset as α and the units of investment in
the riskless asset as β. Let S0 be the spot (t = 0) value of the risky asset (riskless asset
value at t = 0 is 1). Let f (S) be the payoff of the derivative D at t = 1. So, the price of
derivative D is the breakeven value x∗ such that:

∗ +α·(S−(1+r)·S
1 − e−a·(f (S)−(1+r)·x 0 ))
max ES∼N (µ,σ2 ) [ ]
α a
1 − e−a·(α·(S−(1+r)·S0 ))
= max ES∼N (µ,σ2 ) [ ] (7.18)
α a
The maximizing value of α (call it α∗ ) on the left-hand-side of Equation (7.18) along
with β ∗ = −(x∗ + α∗ · S0 ) are the requisite hedges associated with the derivative price x∗ .
We set up a @dataclass MaxExpUtility with attributes to represent the risky asset spot
price S0 (risky_spot), the riskless rate r (riskless_rate), mean µ of S (risky_mean), stan-
dard deviation σ of S (risky_stdev), and the payoff function f (·) of the derivative (payoff_func).
@dataclass(frozen=True)
class MaxExpUtility:
risky_spot: float # risky asset price at t=0
riskless_rate: float # riskless asset price grows from 1 to 1+r
risky_mean: float # mean of risky asset price at t=1
risky_stdev: float # std dev of risky asset price at t=1
payoff_func: Callable[[float], float] # derivative payoff at t=1

Before we write code to solve the derivatives pricing and hedging problem for an incom-
plete market, let us write code to solve the problem for a complete market (as this will serve
as a good comparison against the incomplete market solution). For a complete market, the
risky asset has two random prices at t = 1: prices µ + σ and µ − σ, with probabilities of 0.5
each. As we’ve seen in Section 7.6.1, we can perfectly replicate a derivative payoff in this
complete market situation as it amounts to solving 2 linear equations in 2 unknowns (solu-
tion shown in Equation (7.14)). The number of units of the requisite hedges are simply the
negatives of the replicating portfolio units. The method complete_mkt_price_and_hedges
(of the MaxExpUtility class) shown below implements this solution, producing a dictio-
nary comprising of the derivative price (price) and the hedge units α (alpha) and β (beta).

250
def complete_mkt_price_and_hedges(self) -> Mapping[str, float]:
x = self.risky_mean + self.risky_stdev
z = self.risky_mean - self.risky_stdev
v1 = self.payoff_func(x)
v2 = self.payoff_func(z)
alpha = (v1 - v2) / (z - x)
beta = - 1 / (1 + self.riskless_rate) * (v1 + alpha * x)
price = - (beta + alpha * self.risky_spot)
return {”price”: price, ”alpha”: alpha, ”beta”: beta}

Next we write a helper method max_exp_util_for_zero (to handle the right-hand-side

of Equation (7.18)) that calculates the maximum expected utility for the special case of a
derivative with payoff equal to 0 in all random outcomes at t = 1, i.e., it calculates:

1 − e−a·(−(1+r)·c+α·(S−(1+r)·S0 ))
max ES∼N (µ,σ2 ) [ ]
α a
where c is cash paid at t = 0 (so, c = −(α · S0 + β)).
The method max_exp_util_for_zero accepts as input c: float (representing the cash c
paid at t = 0) and risk_aversion_param: float (representing the CARA coeﬀicient of risk
aversion a). Refering to Section A.4.1 in Appendix A, we have a closed-form solution to
this maximization problem:

µ − (1 + r) · S0
α∗ =
a · σ2
β = −(c + α∗ · S0 )
∗

Substituting α∗ in the Expected Utility expression above gives the following maximum
value for the Expected Utility for this special case:

∗ ·(µ−(1+r)·S (a·α∗ ·σ)2 (µ−(1+r)·S0 )2

1 − e−a·(−(1+r)·c+α 0 ))+ 2 1 − ea·(1+r)·c− 2σ 2
=
a a
def max_exp_util_for_zero(
self,
c: float,
risk_aversion_param: float
) -> Mapping[str, float]:
ra = risk_aversion_param
er = 1 + self.riskless_rate
mu = self.risky_mean
sigma = self.risky_stdev
s0 = self.risky_spot
alpha = (mu - s0 * er) / (ra * sigma * sigma)
beta = - (c + alpha * self.risky_spot)
max_val = (1 - np.exp(-ra * (-er * c + alpha * (mu - s0 * er))
+ (ra * alpha * sigma) ** 2 / 2)) / ra
return {”alpha”: alpha, ”beta”: beta, ”max_val”: max_val}

Next we write a method max_exp_util that calculates the maximum expected utility for
the general case of a derivative with an arbitrary payoff f (·) at t = 1 (provided as input
pf: Callable[[float, float]] below), i.e., it calculates:

1 − e−a·(f (S)−(1+r)·c+α·(S−(1+r)·S0 ))
max ES∼N (µ,σ2 ) [ ]
α a
Clearly, this has no closed-form solution since f (·) is an arbitary payoff. The method
max_exp_util uses the scipy.integrate.quad function to calculate the expectation as an

251
integral of the CARA utility function of f (S) − (1 + r) · c + α · (S − (1 + r) · S0 ) multiplied
by the probability density of N (µ, σ 2 ), and then uses the scipy.optimize.minimize_scalar
function to perform the maximization over values of α.
from scipy.integrate import quad
from scipy.optimize import minimize_scalar
def max_exp_util(
self,
c: float,
pf: Callable[[float], float],
risk_aversion_param: float
) -> Mapping[str, float]:
sigma2 = self.risky_stdev * self.risky_stdev
mu = self.risky_mean
s0 = self.risky_spot
er = 1 + self.riskless_rate
factor = 1 / np.sqrt(2 * np.pi * sigma2)
integral_lb = self.risky_mean - self.risky_stdev * 6
integral_ub = self.risky_mean + self.risky_stdev * 6
def eval_expectation(alpha: float, c=c) -> float:
def integrand(rand: float, alpha=alpha, c=c) -> float:
payoff = pf(rand) - er * c\
+ alpha * (rand - er * s0)
exponent = -(0.5 * (rand - mu) * (rand - mu) / sigma2
+ risk_aversion_param * payoff)
return (1 - factor * np.exp(exponent)) / risk_aversion_param
return -quad(integrand, integral_lb, integral_ub)[0]
res = minimize_scalar(eval_expectation)
alpha_star = res[”x”]
max_val = - res[”fun”]
beta_star = - (c + alpha_star * s0)
return {”alpha”: alpha_star, ”beta”: beta_star, ”max_val”: max_val}

Finally, it’s time to put it all together - the method max_exp_util_price_and_hedge be-
low calculates the maximizing x∗ in Equation (7.18). First, we call max_exp_util_for_zero
(with c set to 0) to calculate the right-hand-side of Equation (7.18). Next, we create a wrap-
per function prep_func around max_exp_util, which is provided as input to scipt.optimize.root_scalar
to solve for x∗ in the right-hand-side of Equation (7.18). Plugging x∗ (opt_price in the code
below) in max_exp_util provides the hedges α∗ and β ∗ (alpha and beta in the code below).
from scipy.optimize import root_scalar
def max_exp_util_price_and_hedge(
self,
risk_aversion_param: float
) -> Mapping[str, float]:
meu_for_zero = self.max_exp_util_for_zero(
0.,
risk_aversion_param
)[”max_val”]
def prep_func(pr: float) -> float:
return self.max_exp_util(
pr,
self.payoff_func,
risk_aversion_param
)[”max_val”] - meu_for_zero
lb = self.risky_mean - self.risky_stdev * 10
ub = self.risky_mean + self.risky_stdev * 10
payoff_vals = [self.payoff_func(x) for x in np.linspace(lb, ub, 1001)]

252
lb_payoff = min(payoff_vals)
ub_payoff = max(payoff_vals)
opt_price = root_scalar(
prep_func,
bracket=[lb_payoff, ub_payoff],
method=”brentq”
).root
hedges = self.max_exp_util(
opt_price,
self.payoff_func,
risk_aversion_param
)
alpha = hedges[”alpha”]
beta = hedges[”beta”]
return {”price”: opt_price, ”alpha”: alpha, ”beta”: beta}

The above code for the class MaxExpUtility is in the file rl/chapter8/max_exp_utility.py.
As ever, we encourage you to play with various choices of S0 , r, µ, σ, f to create instances
of MaxExpUtility, analyze the obtained prices/hedges, and plot some graphs to develop
intuition on how the results change as a function of the various inputs.
Running this code for S0 = 100, r = 5%, µ = 110, σ = 25 when buying a call op-
tion (European since we have only one time period) with strike price = 105, the method
complete_mkt_price_and hedges gives an option price of 11.43, risky asset hedge units of
-0.6 (i.e., we hedge the risk of owning the call option by short-selling 60% of the risky
asset) and riskless asset hedge units of 48.57 (i.e., we take the $60 proceeds of short-sale
less the $11.43 option price payment = $48.57 of cash and invest in a risk-free bank ac-
count earning 5% interest). As mentioned earlier, this is the perfect hedge if we had a
complete market (i.e., two random outcomes). Running this code for the same inputs for
an incomplete market (calling the method max_exp_util_price_and_hedge for risk-aversion
parameter values of a = 0.3, 0.6, 0.9 gives us the following results:

--- Risk Aversion Param = 0.30 ---

{’price’: 23.279, ’alpha’: -0.473, ’beta’: 24.055}
--- Risk Aversion Param = 0.60 ---
{’price’: 12.669, ’alpha’: -0.487, ’beta’: 35.998}
--- Risk Aversion Param = 0.90 ---
{’price’: 8.865, ’alpha’: -0.491, ’beta’: 40.246}

We note that the call option price is quite high (23.28) when the risk-aversion is low at
a = 0.3 (relative to the complete market price of 11.43) but the call option price drops to
12.67 and 8.87 for a = 0.6 and a = 0.9 respectively. This makes sense since if you are more
risk-averse (high a), then you’d be less willing to take the risk of buying a call option and
hence, would want to pay less to buy the call option. Note how the risky asset short-sale
is significantly less (~47% - ~49%) compared the to the risky asset short-sale of 60% in
the case of a complete market. The varying investments in the riskless asset (as a function
of the risk-aversion a) essentially account for the variation in option prices (as a function
of a). Figure 7.1 provides tremendous intuition on how the hedges work for the case of
a complete market and for the cases of an incomplete market with the 3 choices of risk-
aversion parameters. Note that we have plotted the negatives of the hedge portfolio values
at t = 1 so as to visualize them appropriately relative to the payoff of the call option. Note
that the hedge portfolio value is a linear function of the risky asset price at t = 1. Notice
how the slope and intercept of the hedge portfolio value changes for the 3 risk-aversion
scenarios and how they compare against the complete market hedge portfolio value.

253
Figure 7.1.: Hedges when buying a Call Option

Now let us consider the case of selling the same call option. In our code, the only change
we make is to make the payoff function lambda x: - max(x - 105.0, 0) instead of lambda
x: max(x - 105.0, 0) to reflect the fact that we are now selling the call option and so, our
payoff will be the negative of that of an owner of the call option.
With the same inputs of S0 = 100, r = 5%, µ = 110, σ = 25, and for the same risk-
aversion parameter values of a = 0.3, 0.6, 0.9, we get the following results:

--- Risk Aversion Param = 0.30 ---

{’price’: -6.307, ’alpha’: 0.527, ’beta’: -46.395}
--- Risk Aversion Param = 0.60 ---
{’price’: -32.317, ’alpha’: 0.518, ’beta’: -19.516}
--- Risk Aversion Param = 0.90 ---
{’price’: -44.236, ’alpha’: 0.517, ’beta’: -7.506}

We note that the sale price demand for the call option is quite low (6.31) when the
risk-aversion is low at a = 0.3 (relative to the complete market price of 11.43) but the
sale price demand for the call option rises sharply to 32.32 and 44.24 for a = 0.6 and
a = 0.9 respectively. This makes sense since if you are more risk-averse (high a), then
you’d be less willing to take the risk of selling a call option and hence, would want to charge
more for the sale of the call option. Note how the risky asset hedge units are less (~52%
- 53%) compared to the risky asset hedge units (60%) in the case of a complete market.
The varying riskless borrowing amounts (as a function of the risk-aversion a) essentially
account for the variation in option prices (as a function of a). Figure 7.2 provides the visual
intuition on how the hedges work for the 3 choices of risk-aversion parameters (along with
the hedges for the complete market, for reference).
Note that each buyer and each seller might have a different level of risk-aversion, mean-
ing each of them would have a different buy price bid/different sale price ask. A transac-
tion can occur between a buyer and a seller (with potentially different risk-aversion levels)
if the buyer’s bid matches the seller’s ask.

254
Figure 7.2.: Hedges when selling a Call Option

7.6.3. Derivatives Pricing when Market has Arbitrage

Finally, we arrive at the case where the market has arbitrage. This is the case where there
is no risk-neutral probability measure and there can be multiple replicating portfolios
(which can lead to arbitrage). So this is the case where we are unable to price deriva-
tives. To provide intuition for the case of a market with arbitrage, we consider the special
case of 2 risky assets (m = 2) and 2 random outcomes (n = 2), which we will show is a
(1) (2) (1) (2)
Market with Arbitrage. Without loss of generality, we assume S1 < S1 and S2 < S2 .
Let us try to determine a risk-neutral probability measure π:

S1 = e−r · (π(ω1 ) · S1 + π(ω2 ) · S1 )

(0) (1) (2)

S2 = e−r · (π(ω1 ) · S2 + π(ω2 ) · S2 )

(0) (1) (2)

π(ω1 ) + π(ω2 ) = 1
3 equations and 2 variables implies that there is no risk-neutral probability measure π
(1) (2) (1) (2)
for various sets of values of S1 , S1 , S2 , S2 . Let’s try to form a replicating portfolio
(θ0 , θ1 , θ2 ) for a derivative D:
(1) (1) (1)
V D = θ 0 · e r + θ 1 · S1 + θ 2 · S 2
(2) (2) (2)
V D = θ 0 · e r + θ 1 · S1 + θ 2 · S 2
2 equations and 3 variables implies that there are multiple replicating portfolios. Each
such replicating portfolio yields a price for D as:
(0) (0) (0)
V D = θ 0 + θ 1 · S 1 + θ 2 · S2
(0)
Select two such replicating portfolios with different VD . The combination of one of these
replicating portfolios with the negative of the other replicating portfolio is an Arbitrage
Portfolio because:

255
• They cancel off each other’s portfolio value in each t = 1 states
• The combined portfolio value can be made to be negative at t = 0 (by appropriately
choosing the replicating portfolio to negate)

So this is a market that admits arbitrage (no risk-neutral probability measure).

7.7. Derivatives Pricing in Multi-Period/Continuous-Time

Settings
Now that we have understood the key concepts of derivatives pricing/hedging for the
simple setting of discrete-time with a single-period, it’s time to do an overview of deriva-
tives pricing/hedging theory in the full-blown setting of multiple time-periods and in
continuous-time. While an adequate coverage of this theory is beyond the scope of this
book, we will sketch an overview in this section. Along the way, we will cover two deriva-
tives pricing applications that can be modeled as MDPs (and hence, tackled with Dynamic
Programming or Reinforcement Learning Algorithms).
The good news is that much of the concepts we learnt for the single-period setting carry
over to multi-period and continuous-time settings. The key difference in going over from
single-period to multi-period is that we need to adjust the replicating portfolio (i.e., ad-
just θ) at each time step. Other than this difference, the concepts of arbitrage, risk-neutral
probability measures, complete market etc. carry over. In fact, the two fundamental theo-
rems of asset pricing also carry over. It is indeed true that in the multi-period setting, no-
arbitrage is equivalent to the existence of a risk-neutral probability measure and market
completeness (i.e., replication of derivatives) is equivalent to having a unique risk-neutral
probability measure.

7.7.1. Multi-Period Complete-Market Setting

We learnt in the single-period setting that if the market is complete, there are two equiv-
alent ways to conceptualize derivatives pricing:

• Solve for the replicating portfolio (i.e., solve for the units in the fundamental assets
that would replicate the derivative payoff), and then calculate the derivative price as
the value of this replicating portfolio at t = 0.
• Calculate the probabilities of random-outcomes for the unique risk-neutral probabil-
ity measure, and then calculate the derivative price as the riskless rate-discounted
expectation (under this risk-neutral probability measure) of the derivative payoff.

It turns out that even in the multi-period setting, when the market is complete, we can
calculate the derivative price (not just at t = 0, but at any random outcome at any future
time) with either of the above two (equivalent) methods, as long as we appropriately
adjust the fundamental assets’ units in the replicating portfolio (depending on the random
outcome) as we move from one time step to the next. It is important to note that when we
alter the fundamental assets’ units in the replicating portfolio at each time step, we need to
respect the constraint that money cannot enter or leave the replicating portfolio (i.e., it is a
self-financing replicating portfolio with the replicating portfolio value remaining unchanged
in the process of altering the units in the fundamental assets). It is also important to note
that the alteration in units in the fundamental assets is dependent on the prices of the
fundamental assets (which are random outcomes as we move forward from one time step

256
to the next). Hence, the fundamental assets’ units in the replicating portfolio evolve as
random variables, while respecting the self-financing constraint. Therefore, the replicating
portfolio in a multi-period setting in often refered to as a Dynamic Self-Financing Replicating
Portfolio to reflect the fact that the replicating portfolio is adapting to the changing prices of
the fundamental assets. The negatives of the fundamental assets’ units in the replicating
portfolio form the hedges for the derivative.
To ensure that the market is complete in a multi-period setting, we need to assume that
the market is “frictionless” - that we can trade in real-number quantities in any fundamen-
tal asset and that there are no transaction costs for any trades at any time step. From a com-
putational perspective, we walk back in time from the final time step (call it t = T ) to t = 0,
and calculate the fundamental assets’ units in the replicating portfolio in a “backward re-
cursive manner.” As in the case of the single-period setting, each backward-recursive step
from outcomes at time t + 1 to a specific outcome at time t simply involves solving a linear
system of equations where each unknown is the replicating portfolio units in a specific
fundamental asset and each equation corresponds to the value of the replicating portfolio
at a specific outcome at time t + 1 (which is established recursively). The market is com-
plete if there is a unique solution to each linear system of equations (for each time t and
for each outcome at time t) in this backward-recursive computation. This gives us not just
the replicating portfolio (and consequently, hedges) at each outcome at each time step,
but also the price at each outcome at each time step (the price is equal to the value of the
calculated replicating portfolio at that outcome at that time step).
Equivalently, we can do a backward-recursive calculation in terms of the risk-neutral
probability measures, with each risk-neutral probability measure giving us the transition
probabilities from an outcome at time step t to outcomes at time step t+1. Again, in a com-
plete market, it amounts to a unique solution of each of these linear system of equations.
For each of these linear system of equations, an unknown is a transition probability to a
time t + 1 outcome and an equation corresponds to a specific fundamental asset’s prices
at the time t + 1 outcomes. This calculation is popularized (and easily understood) in the
simple context of a Binomial Options Pricing Model. We devote Section 7.8 to coverage of
the original Binomial Options Pricing Model and model it as a Finite-State Finite-Horizon
MDP (and utilize the Finite-Horizon DP code developed in Chapter 3 to solve the MDP).

7.7.2. Continuous-Time Complete-Market Setting

To move on from multi-period to continuous-time, we simply make the time-periods smaller
and smaller, and take the limit of the time-period tending to zero. We need to preserve the
complete-market property as we do this, which means that we can trade in real-number
units without transaction costs in continuous-time. As we’ve seen before, operating in
continuous-time allows us to tap into stochastic calculus, which forms the foundation of
much of the rich theory of continuous-time derivatives pricing/hedging. With this very
rough and high-level overview, we refer you to Tomas Bjork’s book on Arbitrage Theory
in Continuous Time (Björk 2005) for a thorough understanding of this theory.
To provide a sneak-peek into this rich continuous-time theory, we’ve sketched in Ap-
pendix E the derivation of the famous Black-Scholes equation and it’s solution for the case
of European Call and Put Options.
So to summarize, we are in good shape to price/hedge in a multi-period and continuous-
time setting if the market is complete. But what if the market is incomplete (which is typ-
ical in a real-world situation)? Founded on the Fundamental Theorems of Asset Pricing
(which applies to multi-period and continuous-time settings as well), there is indeed con-

257
siderable literature on how to price in incomplete markets for multi-period/continuous-
time, which includes the superhedging approach as well as the Expected-Utility-Indifference
approach, that we had covered in Subsection 7.6.2 for the simple setting of discrete-time
with single-period. However, in practice, these approaches are not adopted as they fail to
capture real-world nuances adequately. Besides, most of these approaches lead to fairly
wide price bounds that are not particularly useful in practice. In Section 7.10, we extend
the Expected-Utility-Indifference approach that we had covered for the single-period setting
to the multi-period setting. It turns out that this approach can be modeled as an MDP,
with the adjustments to the hedge quantities at each time step as the actions of the MDP -
calculating the optimal policy gives us the optimal derivative hedging strategy and the as-
sociated optimal value function gives us the derivative price. This approach is applicable
to real-world situations and one can even incorporate all the real-world frictions in one’s
MDP to build a practical solution for derivatives trading (covered in Section 7.10).

7.8. Optimal Exercise of American Options cast as a Finite MDP

In this section, we tackle the pricing of American Options in a discrete-time, multi-period
setting, assuming the market is complete. To satisfy market completeness, we need to
assume that the market is “frictionless” - that we can trade in real-number quantities in
any fundamental asset and that there are no transaction costs for any trades at any time
step. In particular, we employ the Binomial Options Pricing Model to solve for the price
(and hedges) of American Options. The original Binomial Options Pricing Model was
developed to price (and hedge) options (including American Options) on an underly-
ing whose price evolves according to a lognormal stochastic process, with the stochastic
process approximated in the form of a simple discrete-time, finite-horizon, finite-states
process that enables enormous computational tractability. The lognormal stochastic pro-
cess is basically of the same form as the stochastic process of the underlying price in the
Black-Scholes model (covered in Appendix E). However, the underlying price process in
the Black-Scholes model is specified in the real-world probability measure whereas here
we specify the underlying price process in the risk-neutral probability measure. This is
because here we will employ the pricing method of riskless rate-discounted expectation
(under the risk-neutral probability measure) of the option payoff. Recall that in the single-
period setting, the underlying asset price’s expected rate of growth is calibrated to be equal
to the riskless rate r, under the risk-neutral probability measure. This calibration applies
even in the multi-period and continuous-time settings. For a continuous-time lognormal
stochastic process, the lognormal drift will hence be equal to r in the risk-neutral probabil-
ity measure (rather than µ in the real-world probability measure, as per the Black-Scholes
model). Precisely, the stochastic process S for the underlying price in the risk-neutral
probability measure is:

dSt = r · St · dt + σ · St · dzt (7.19)

where σ is the lognormal dispersion (often refered to as “lognormal volatility” - we will

simply call it volatility for the rest of this section). If you want to develop a thorough un-
derstanding of the broader topic of change of probability measures and how it affects the
drift term (beyond the scope of this book, but an important topic in continuous-time finan-
cial pricing theory), we refer you to the technical material on Radon-Nikodym Derivative
and Girsanov Theorem.

258
Figure 7.3.: Binomial Option Pricing Model (Binomial Tree)

The Binomial Options Pricing Model serves as a discrete-time, finite-horizon, finite-

states approximation to this continuous-time process, and is essentially an extension of
the single-period model we had covered earlier for the case of a single fundamental risky
asset. We’ve learnt previously that in the single-period case for a single fundamental risky
asset, in order to be a complete market, we need to have exactly two random outcomes.
We basically extend this “two random outcomes” pattern to each outcome at each time
step, by essentially growing out a “binary tree.” But there is a caveat - with a binary tree,
we end up with an exponential (2i ) number of outcomes after i time steps. To contain the
exponential growth, we construct a “recombining tree,” meaning an “up move” followed
by a “down move” ends up in the same underlying price outcome as a “down move” fol-
lowed by an “up move” (as illustrated in Figure 7.3). Thus, we have i + 1 price outcomes
after i time steps in this “recombining tree.” We conceptualize the ascending-sorted se-
quence of i + 1 price outcomes as the (time step = i) states Si = {0, 1, . . . , i} (since the
price movements form a discrete-time, finite-states Markov Process). Since we are model-
ing a lognormal process, we model the discrete-time price moves as multiplicative to the
price. We denote Si,j as the price after i time steps in state j (for any i ∈ Z≥0 and for
any 0 ≤ j ≤ i). So the two random prices resulting from Si,j are Si+1,j+1 = Si,j · u and
Si+1,j = Si,j · d for some constants u and d (that are calibrated). The important point is
that u and d remain constant across time steps i and across states j at each time step i (as
seen in Figure 7.3).
Let q be the probability of the “up move” (typically, we use p to denote real-world prob-
ability and q to denote the risk-neutral probability) so that 1 − q is the probability of the
“down move.” Just like u and d, the value of q is kept constant across time steps i and across
states j at each time step i (as seen in Figure 7.3). q, u and d need to be calibrated so that

259

Si,0 Si,1 Si,i
the probability distribution of log-price-ratios {log S0,0 , log S0,0 , . . . , log S0,0 } after
i time steps (with each time step of interval Tn for a given expiry time T ∈ R+ and a fixed
σ 2 iT σ 2 iT
number of time steps n ∈ Z+ ) serves as a good approximation to N ((r
− 2 ) n , n ) (that
S iT
we know to be the risk-neutral probability distribution of log n
S0 in the continuous-
time process defined by Equation (7.19), as derived in Section C.7 in Appendix C), for all
i = 0, 1, . . . , n. Note that the starting price S0,0 of this discrete-time approximation process
is equal to the starting price S0 of the continuous-time process.
This calibration of q, u and d can be done in a variety of ways and there are indeed several
variants of Binomial Options Pricing Models with different choices of how q, u and d are
calibrated. We shall implement the choice made in the original Binomial Options Pricing
Model that was proposed in a seminal paper by Cox, Ross, Rubinstein (Cox, Ross, and
Rubinstein 1979). Their choice is best understood in two steps:

• As a first step, ignore the drift term r·St ·dt of the lognormal process, and assume the
underlying price follows the martingale process dSt = σ · St · dzt . They chose d to be
equal to u1 and calibrated u such that for any i ∈ Z≥0 , for
any 0 ≤ j ≤ i, the variance
of
Si+1,j+1 Si+1,j
two equal-probability random outcomes log Si,j = log(u) and log Si,j =
σ2 T
log(d) = − log(u)
is equal to the variance
of the normally-distributed random
n
St+ T
variable log St n for any t ≥ 0 (assuming the process dSt = σ · St · dzt ). This
yields:
√
σ2T σ T
log (u) = 2
⇒u=e n
n
• As a second step, q needs to be calibrated to account for the drift term r · St · dt in
the lognormal process under the risk-neutral probability measure. Specifically, q is
adjusted so that for any i ∈ Z≥0 , for any 0 ≤ j ≤ i, the mean of the two random
S S rT
outcomes i+1,j+1
Si,j = u and Si+1,j
i,j
= u1 is equal to the mean e n of the lognormally-
St+ T
distributed random variable St
n
for any t ≥ 0 (assuming the process dSt = r · St ·
dt + σ · St · dzt ). This yields:
√
rT rT T
+σ
1−q rT u·e −1 e n n n −1
qu + =en ⇒q= = √
u u −1
2
2σ T
e n −1

This calibration for u and q ensures that as n → ∞ (i.e., time step interval Tn → 0),
the mean and variance of the binomial distribution after i time steps matches
the mean

S iT
σ 2 iT σ 2 iT
(r − 2 )n and variance n of the normally-distributed random variable log n
S0 in
thecontinuous-time
process defined by Equation (7.19), for all i = 0, 1, . . . n. Note that
Si,j
log S0,0 follows a random walk Markov Process (reminiscent of the random walk exam-
ples in Chapter 1) with each movement in state space scaled by a factor of log(u).
Thus, we have the parameters u and q that fully specify the Binomial Options Pricing
Model. Now we get to the application of this model. We are interested in using this model
for optimal exercise (and hence, pricing) of American Options. This is in contrast to the
Black-Scholes Partial Differential Equation which only enabled us to price options with
a fixed payoff at a fixed point in time (eg: European Call and Put Options). Of course,

260
a special case of American Options is indeed European Options. It’s important to note
that here we are tackling the much harder problem of the ideal timing of exercise of an
American Option - the Binomial Options Pricing Model is well suited for this.
As mentioned earlier, we want to model the problem of Optimal Exercise of American
Options as a discrete-time, finite-horizon, finite-states MDP. We set the terminal time to be
t = T + 1, meaning all the states at time T + 1 are terminal states. Here we will utilize the
states and state transitions (probabilistic price movements of the underlying) given by the
Binomial Options Pricing Model as the states and state transitions in the MDP. The MDP
actions in each state will be binary - either exercise the option (and immediately move
to a terminal state) or don’t exercise the option (i.e., continue on to the next time step’s
random state, as given by the Binomial Options Pricing Model). If the exercise action is
chosen, the MDP reward is the option payoff. If the continue action is chosen, the reward
is 0. The discount factor γ is e− n since (as we’ve learnt in the single-period case), the
rT

price (which translates here to the Optimal Value Function) is defined as the riskless rate-
discounted expectation (under the risk-neutral probability measure) of the option payoff.
In the multi-period setting, the overall discounting amounts to composition (multiplica-
tion) of each time step’s discounting (which is equal to γ) and the overall risk-neutral
probability measure amounts to the composition of each time step’s risk-neutral probabil-
ity measure (which is specified by the calibrated value q).
Now let’s write some code to determine the Optimal Exercise of American Options
(and hence, the price of American Options) by modeling this problem as a discrete-time,
finite-horizon, finite-states MDP. We create a dataclass OptimalExerciseBinTree whose
attributes are spot_price (specifying the current, i.e., time=0 price of the underlying),
payoff (specifying the option payoff, when exercised), expiry (specifying the time T to
expiration of the American Option), rate (specifying the riskless rate r), vol (specifying
the lognormal volatility σ), and num_steps (specifying the number n of time steps in the
binomial tree). Note that each time step is of interval Tn (which is implemented below in
the method dt). Note also that the payoff function is fairly generic taking two arguments -
the first argument is the time at which the option is exercised, and the second argument is
the underlying price at the time the option is exercised. Note that for a typical American
Call or Put Option, the payoff does not depend on time and the dependency on the un-
derlying price is the standard “hockey-stick” payoff that we are now fairly familiar with
(however, we designed the interface to allow for more general option payoff functions).
The set of states Si at time step i (for all 0 ≤ i ≤ T + 1) is: {0, 1, . . . , i} and the method
state_price below calculates the price in state j at time step i as:
√
T
(2j−i)σ
Si,j = S0,0 · e n

Finally, the method get_opt_vf_and_policy calculates u (up_factor) and q (up_prob),

prepares the requisite state-reward transitions (conditional on current state and action)
to move from one time step to the next, and passes along the constructed time-sequenced
transitions to rl.finite_horizon.get_opt_vf_and_policy (which we had written in Chap-
ter 3) to perform the requisite backward induction and return an Iterator on pairs of
V[int] and FiniteDeterministicPolicy[int, bool]. Note that the states at any time-step i
are the integers from 0 to i and hence, represented as int, and the actions are represented
as bool (True for exercise and False for continue). Note that we represent an early terminal
state (in case of option exercise before expiration of the option) as -1.
from rl.distribution import Constant, Categorical
from rl.finite_horizon import optimal_vf_and_policy

261
from rl.dynamic_programming import V
from rl.policy import FiniteDeterministicPolicy
@dataclass(frozen=True)
class OptimalExerciseBinTree:
spot_price: float
payoff: Callable[[float, float], float]
expiry: float
rate: float
vol: float
num_steps: int
def dt(self) -> float:
return self.expiry / self.num_steps
def state_price(self, i: int, j: int) -> float:
return self.spot_price * np.exp((2 * j - i) * self.vol *
np.sqrt(self.dt()))
def get_opt_vf_and_policy(self) -> \
Iterator[Tuple[V[int], FiniteDeterministicPolicy[int, bool]]]:
dt: float = self.dt()
up_factor: float = np.exp(self.vol * np.sqrt(dt))
up_prob: float = (np.exp(self.rate * dt) * up_factor - 1) / \
(up_factor * up_factor - 1)
return optimal_vf_and_policy(
steps=[
{NonTerminal(j): {
True: Constant(
(
Terminal(-1),
self.payoff(i * dt, self.state_price(i, j))
)
),
False: Categorical(
{
(NonTerminal(j + 1), 0.): up_prob,
(NonTerminal(j), 0.): 1 - up_prob
}
)
} for j in range(i + 1)}
for i in range(self.num_steps + 1)
],
gamma=np.exp(-self.rate * dt)
)

Now we want to try out this code on an American Call Option and American Put Option.
We know that it is never optimal to exercise an American Call Option before the option
expiration. The reason for this is as follows: Upon early exercise (say at time τ < T ),
we borrow cash K (to pay for the purchase of the underlying) and own the underlying
(valued at Sτ ). So, at option expiration T , we owe cash K · er(T −τ ) and own the underlying
valued at ST , which is an overall value at time T of ST −K ·er(T −τ ) . We argue that this value
is always less than the value max(ST −K, 0) we’d obtain at option expiration T if we’d made
the choice to not exercise early. If the call option ends up in-the-money at option expiration
T (i.e., ST > K), then ST −K ·er(T −τ ) is less than the value ST −K we’d get by exercising at
option expiration T . If the call option ends up not being in-the-money at option expiration
T (i.e., ST ≤ K), then ST − K · er(T −τ ) < 0 which is less than the 0 payoff we’d obtain at
option expiration T . Hence, we are always better off waiting until option expiration (i.e. it
is never optimal to exercise a call option early, no matter how much in-the-money we get
before option expiration). Hence, the price of an American Call Option should be equal
to the price of an European Call Option with the same strike price and expiration time.

262
However, for an American Put Option, it is indeed sometimes optimal to exercise early
and hence, the price of an American Put Option is greater then the price of an European
Put Option with the same strike price and expiration time. Thus, it is interesting to ask the
question: For each time t < T , what is the threshold of underlying price St below which it
is optimal to exercise an American Put Option? It is interesting to view this threshold as
a function of time (we call this function as the optimal exercise boundary of an American
Put Option). One would expect that this optimal exercise boundary rises as one gets closer
to the option expiration T . But exactly what shape does this optimal exercise boundary
have? We can answer this question by analyzing the optimal policy at each time step - we
just need to find the state k at each time step i such that the Optimal Policy πi∗ (·) evaluates
to True for all states j ≤ k (and evaluates to False for all states j > k). We write the
following method to calculate the Optimal Exercise Boundary:

def option_exercise_boundary(
self,
policy_seq: Sequence[FiniteDeterministicPolicy[int, bool]],
is_call: bool
) -> Sequence[Tuple[float, float]]:
dt: float = self.dt()
ex_boundary: List[Tuple[float, float]] = []
for i in range(self.num_steps + 1):
ex_points = [j for j in range(i + 1)
if policy_seq[i].action_for[j] and
self.payoff(i * dt, self.state_price(i, j)) > 0]
if len(ex_points) > 0:
boundary_pt = min(ex_points) if is_call else max(ex_points)
ex_boundary.append(
(i * dt, opt_ex_bin_tree.state_price(i, boundary_pt))
)
return ex_boundary

option_exercise_boundary takes as input policy_seq which represents the sequence of

optimal policies πi∗ for each time step 0 ≤ i ≤ T , and produces as output the sequence of
pairs ( iT
n , Bi ) where

Bi = max Si,j
j:πi∗ (j)=T rue

with the little detail that we only consider those states j for which the option payoff is
positive. For some time steps i, none of the states j qualify as πi∗ (j) = T rue, in which case
we don’t include that time step i in the output sequence.
To compare the results of American Call and Put Option Pricing on this Binomial Op-
tions Pricing Model against the corresponding European Options prices, we write the fol-
lowing method to implement the Black-Scholes closed-form solution (derived as Equa-
tions E.7 and E.8 in Appendix E):

from scipy.stats import norm

def european_price(self, is_call: bool, strike: float) -> float:
sigma_sqrt: float = self.vol * np.sqrt(self.expiry)
d1: float = (np.log(self.spot_price / strike) +
(self.rate + self.vol ** 2 / 2.) * self.expiry) \
/ sigma_sqrt
d2: float = d1 - sigma_sqrt
if is_call:
ret = self.spot_price * norm.cdf(d1) - \
strike * np.exp(-self.rate * self.expiry) * norm.cdf(d2)
else:

263
ret = strike * np.exp(-self.rate * self.expiry) * norm.cdf(-d2) - \
self.spot_price * norm.cdf(-d1)
return ret

Here’s some code to price an American Put Option (changing is_call to True will price
American Call Options):
from rl.gen_utils.plot_funcs import plot_list_of_curves
spot_price_val: float = 100.0
strike: float = 100.0
is_call: bool = False
expiry_val: float = 1.0
rate_val: float = 0.05
vol_val: float = 0.25
num_steps_val: int = 300
if is_call:
opt_payoff = lambda _, x: max(x - strike, 0)
else:
opt_payoff = lambda _, x: max(strike - x, 0)
opt_ex_bin_tree: OptimalExerciseBinTree = OptimalExerciseBinTree(
spot_price=spot_price_val,
payoff=opt_payoff,
expiry=expiry_val,
rate=rate_val,
vol=vol_val,
num_steps=num_steps_val
)
vf_seq, policy_seq = zip(*opt_ex_bin_tree.get_opt_vf_and_policy())
ex_boundary: Sequence[Tuple[float, float]] = \
opt_ex_bin_tree.option_exercise_boundary(policy_seq, is_call)
time_pts, ex_bound_pts = zip(*ex_boundary)
label = (”Call” if is_call else ”Put”) + ” Option Exercise Boundary”
plot_list_of_curves(
list_of_x_vals=[time_pts],
list_of_y_vals=[ex_bound_pts],
list_of_colors=[”b”],
list_of_curve_labels=[label],
x_label=”Time”,
y_label=”Underlying Price”,
title=label
)
european: float = opt_ex_bin_tree.european_price(is_call, strike)
print(f”European Price = {european:.3f}”)
am_price: float = vf_seq[0][NonTerminal(0)]
print(f”American Price = {am_price:.3f}”)

This prints as output:

European Price = 7.459
American Price = 7.971

So we can see that the price of this American Put Option is significantly higher than the
price of the corresponding European Put Option. The exercise boundary produced by this
code is shown in Figure 7.4. The locally-jagged nature of the exercise boundary curve is
because of the “diamond-like” local-structure of the underlying prices at the nodes in the
binomial tree. We can see that when the time to expiry is large, it is not optimal to exercise
unless the underlying price drops significantly. It is only when the time to expiry becomes
quite small that the optimal exercise boundary rises sharply towards the strike price value.
Changing is_call to True (and not changing any of the other inputs) prints as output:

264
Figure 7.4.: Put Option Exercise Boundary

European Price = 12.336

American Price = 12.328

This is a numerical validation of our proof above that it is never optimal to exercise an
American Call Option before option expiration.
The above code is in the file rl/chapter8/optimal_exercise_bin_tree.py. As ever, we en-
courage you to play with various choices of inputs to develop intuition for how American
Option Pricing changes as a function of the inputs (and how American Put Option Ex-
ercise Boundary changes). Note that you can specify the option payoff as any arbitrary
function of time and the underlying price.

7.9. Generalizing to Optimal-Stopping Problems

In this section, we generalize the problem of Optimal Exercise of American Options to
the problem of Optimal Stopping in Stochastic Calculus, which has several applications in
Mathematical Finance, including pricing of exotic derivatives. After defining the Optimal
Stopping problem, we show how this problem can be modeled as an MDP (generalizing
the MDP modeling of Optimal Exercise of American Options), which affords us the ability
to solve them with Dynamic Programming or Reinforcement Learning algorithms.
First we define the concept of Stopping Time. Informally, Stopping Time τ is a random
time (time as a random variable) at which a given stochastic process exhibits certain be-
havior. Stopping time is defined by a stopping policy to decide whether to continue or stop
a stochastic process based on the stochastic process’ current and past values. Formally, it
is a random variable τ such that the event {τ ≤ t} is in the σ-algebra Ft of the stochastic
process, for all t. This means the stopping decision (i.e., stopping policy) of whether τ ≤ t
only depends on information up to time t, i.e., we have all the information required to
make the stopping decision at any time t.

265
A simple example of Stopping Time is Hitting Time of a set A for a process X. Informally,
it is the first time when X takes a value within the set A. Formally, Hitting Time TXA is
defined as:
TX,A = min{t ∈ R|Xt ∈ A}
A simple and common example of Hitting Time is the first time a process exceeds a
certain fixed threshold level. As an example, we might say we want to sell a stock when
the stock price exceeds $100. This $100 threshold constitutes our stopping policy, which
determines the stopping time (hitting time) in terms of when we want to sell the stock (i.e.,
exit owning the stock). Different people may have different criterion for exiting owning
the stock (your friend’s threshold might be $90), and each person’s criterion defines their
own stopping policy and hence, their own stopping time random variable.
Now that we have defined Stopping Time, we are ready to define the Optimal Stopping
problem. Optimal Stopping for a stochastic process X is a function W (·) whose domain is
the set of potential initial values of the stochastic process and co-domain is the length of
time for which the stochastic process runs, defined as:

W (x) = max E[H(Xτ )|X0 = x]

where τ is a set of stopping times of X and H(·) is a function from the domain of the
stochastic process values to the set of real numbers.
Intuitively, you should think of Optimal Stopping as searching through many Stopping
Times (i.e., many Stopping Policies), and picking out the best Stopping Policy - the one
that maximizes the expected value of a function H(·) applied on the stochastic process at
the stopping time.
Unsurprisingly (noting the connection to Optimal Control in an MDP), W (·) is called
the Value function, and H is called the Reward function. Note that sometimes we can
have several stopping times that maximize E[H(Xτ )] and we say that the optimal stopping
time is the smallest stopping time achieving the maximum value. We mentioned above
that Optimal Exercise of American Options is a special case of Optimal Stopping. Let’s
understand this specialization better:

• X is the stochastic process for the underlying’s price in the risk-neutral probability
measure.
• x is the underlying security’s current price.
• τ is a set of exercise times, each exercise time corresponding to a specific policy of
option exercise (i.e., specific stopping policy).
• W (·) is the American Option price as a function of the underlying’s current price x.
• H(·) is the option payoff function (with riskless-rate discounting built into H(·)).

Now let us define Optimal Stopping problems as control problems in Markov Decision
Processes (MDPs).

• The MDP State at time t is Xt .

• The MDP Action is Boolean: Stop the Process or Continue the Process.
• The MDP Reward is always 0, except upon Stopping, when it is equal to H(Xτ ).
• The MDP Discount Factor γ is equal to 1.
• The MDP probabilistic-transitions are governed by the Stochastic Process X.

A specific policy corresponds to a specific stopping-time random variable τ , the Opti-

mal Policy π ∗ corresponds to the stopping-time τ ∗ that yields the maximum (over τ ) of

266
E[H(Xτ )|X0 = x], and the Optimal Value Function V ∗ corresponds to the maximum value
of E[H(Xτ )|X0 = x].
For discrete time steps, the Bellman Optimality Equation is:

V ∗ (Xt ) = max(H(Xt ), E[V ∗ (Xt+1 )|Xt ])

Thus, we see that Optimal Stopping is the solution to the above Bellman Optimality
Equation (solving the Control problem of the MDP described above). For a finite number
of time steps, we can run a backward induction algorithm from the final time step back
to time step 0 (essentially a generalization of the backward induction we did with the
Binomial Options Pricing Model to determine Optimal Exercise of American Options).
Many derivatives pricing problems (and indeed many problems in the broader space
of Mathematical Finance) can be cast as Optimal Stopping and hence can be modeled as
MDPs (as described above). The important point here is that this enables us to employ
Dynamic Programming or Reinforcement Learning algorithms to identify optimal stop-
ping policy for exotic derivatives (which typically yields a pricing algorithm for exotic
derivatives). When the state space is large (eg: when the payoff depends on several un-
derlying assets or when the payoff depends on the history of underlying’s prices, such
as Asian Options-payoff with American exercise feature), the classical algorithms used in
the finance industry for exotic derivatives pricing are not computationally tractable. This
points to the use of Reinforcement Learning algorithms which tend to be good at handling
large state spaces by effectively leveraging sampling and function approximation method-
ologies in the context of solving the Bellman Optimality Equation. Hence, we propose
Reinforcement Learning as a promising alternative technique to pricing of certain exotic
derivatives that can be cast as Optimal Stopping problems. We will discuss this more after
having covered Reinforcement Learning algorithms.

7.10. Pricing/Hedging in an Incomplete Market cast as an MDP

In Subsection 7.6.2, we developed a pricing/hedging approach based on Expected-Utility-
Indifference for the simple setting of discrete-time with single-period, when the market is
incomplete. In this section, we extend this approach to the case of discrete-time with multi-
period. In the single-period setting, the solution is rather straightforward as it amounts to
an unconstrained multi-variate optimization together with a single-variable root-solver.
Now when we extend this solution approach to the multi-period setting, it amounts to
a sequential/dynamic optimal control problem. Although this is far more complex than
the single-period setting, the good news is that we can model this solution approach for
the multi-period setting as a Markov Decision Process. This section will be dedicated to
modeling this solution approach as an MDP, which gives us enormous flexibility in captur-
ing the real-world nuances. Besides, modeling this approach as an MDP permits us to tap
into some of the recent advances in Deep Learning and Reinforcement Learning (i.e. Deep
Reinforcement Learning). Since we haven’t yet learnt about Reinforcement Learning al-
gorithms, this section won’t cover the algorithmic aspects (i.e., how to solve the MDP) - it
will simply cover how to model the MDP for the Expected-Utility-Indifference approach to
pricing/hedging derivatives in an incomplete market.
Before we get into the MDP modeling details, it pays to remind that in an incomplete
market, we have multiple risk-neutral probability measures and hence, multiple valid
derivative prices (each consistent with no-arbitrage). This means the market/traders need
to “choose” a suitable risk-neutral probability measure (which amounts to choosing one

267
out of the many valid derivative prices). In practice, this “choice” is typically made in ad-
hoc and inconsistent ways. Hence, our proposal of making this “choice” in a mathematically-
disciplined manner by noting that ultimately a trader is interested in maximizing the
“risk-adjusted return” of a derivative together with it’s hedges (by sequential/dynamic
adjustment of the hedge quantities). Once we take this view, it is reminiscent of the As-
set Allocation problem we covered in Chapter 6 and the maximization objective is based
on the specification of preference for trading risk versus return (which in turn, amounts
to specification of a Utility function). Therefore, similar to the Asset Allocation problem,
the decision at each time step is the set of adjustments one needs to make to the hedge
quantities. With this rough overview, we are now ready to formalize the MDP model
for this approach to multi-period pricing/hedging in an incomplete market. For ease of
exposition, we simplify the problem setup a bit, although the approach and model we de-
scribe below essentially applies to more complex, more frictionful markets as well. Our
exposition below is an adaptation of the treatment in the Deep Hedging paper by Buehler,
Gonon, Teichmann, Wood, Mohan, Kochems (Bühler et al. 2018).
Assume we have a portfolio of m derivatives and we refer to our collective position
across the portfolio of m derivatives as D. Assume each of these m derivatives expires
by time T (i.e., all of their contingent cashflows will transpire by time T ). We model the
problem as a discrete-time finite-horizon MDP with the terminal time at t = T + 1 (i.e., all
states at time t = T + 1 are terminal states). We require the following notation to model
the MDP:

• Denote the derivatives portfolio-aggregated Contingent Cashflows at time t as Xt ∈ R.

• Assume we have n assets trading in the market that would serve as potential hedges
for our derivatives position D.
• Denote the number of units held in the hedge positions at time t as αt ∈ Rn .
• Denote the cashflows per unit of hedges at time t as Yt ∈ Rn .
• Denote the prices per unit of hedges at time t as Pt ∈ Rn .
• Denote the trading account value at time t as βt ∈ R.

We will use the notation that we have previously used for discrete-time finite-horizon
MDPs, i.e., we will use time-subscripts in our notation.
We denote the State Space at time t (for all 0 ≤ t ≤ T +1) as St and a specific state at time t
as st ∈ St . Among other things, the key ingredients of st include: αt , Pt , βt , D. In practice,
st will include many other components (in general, any market information relevant to
hedge trading decisions). However, for simplicity (motivated by ease of articulation), we
assume st is simply the 4-tuple:

st := (αt , Pt , βt , D)
We denote the Action Space at time t (for all 0 ≤ t ≤ T ) as At and a specific action at time
t as at ∈ At . at represents the number of units of hedges traded at time t (i.e., adjustments
to be made to the hedges at each time step). Since there are n hedge positions (n assets
to be traded), at ∈ Rn , i.e., At ⊆ Rn . Note that for each of the n assets, it’s corresponding
component in at is positive if we buy the asset at time t and negative if we sell the asset at
time t. Any trading restrictions (eg: constraints on short-selling) will essentially manifest
themselves in terms of the exact definition of At as a function of st .
State transitions are essentially defined by the random movements of prices of the assets
that make up the potential hedges, i.e., P[Pt+1 |Pt ]. In practice, this is available either as an
explicit transition-probabilities model, or more likely available in the form of a simulator,

268
that produces an on-demand sample of the next time step’s prices, given the current time
step’s prices. Either way, the internals of P[Pt+1 |Pt ] are estimated from actual market data
and realistic trading/market assumptions. The practical details of how to estimate these
internals are beyond the scope of this book - it suﬀices to say here that this estimation is a
form of supervised learning, albeit fairly nuanced due to the requirement of capturing the
complexities of market-price behavior. For the following description of the MDP, simply
assume that we have access to P[Pt+1 |Pt ] in some form.
It is important to pay careful attention to the sequence of events at each time step t =
0, . . . , T , described below:

1. Observe the state st := (αt , Pt , βt , D).

2. Perform action (trades) at , which produces trading account value change = −aTt ·Pt
(note: this is an inner-product in Rn ).
3. These trades incur transaction costs, for example equal to γ · abs(aTt ) · Pt for some
γ ∈ R+ (note: abs, denoting absolute value, applies point-wise on aTt ∈ Rn , and
then we take it’s inner-product with Pt ∈ Rn ).
4. Update αt as:
αt+1 = αt + at
At termination, we need to force-liquidate, which establishes the constraint: aT =
−αT .
5. Realize end-of-time-step cashflows from the derivatives position D as well as from
the (updated) hedge positions. This is equal to Xt+1 + αTt+1 · Yt+1 (note: αTt+1 · Yt+1
is an inner-product in Rn ).
6. Update trading account value βt as:

βt+1 = βt − aTt · Pt − γ · abs(aTt ) · Pt + Xt+1 + αTt+1 · Yt+1

7. MDP Reward rt+1 = 0 for all t = 0, . . . , T −1 and rT +1 = U (βT +1 ) for an appropriate

concave Utility function (based on the extent of risk-aversion).
8. Hedge prices evolve from Pt to Pt+1 , based on price-transition model of P[Pt+1 |Pt ].

Assume we now want to enter into an incremental position of derivatives-portfolio D′

in m′ derivatives. We denote the combined position as D ∪ D′ . We want to determine the
Price of the incremental position D′ , as well as the hedging strategy for D ∪ D′ .
Denote the Optimal Value Function at time t (for all 0 ≤ t ≤ T ) as Vt∗ : St → R. Pricing
of D′ is based on the principle that introducing the incremental position of D′ together
with a calibrated cash payment/receipt (Price of D′ ) at t = 0 should leave the Optimal
Value (at t = 0) unchanged. Precisely, the Price of D′ is the value x∗ such that

V0∗ ((α0 , P0 , β0 − x∗ , D ∪ D′ )) = V0∗ ((α0 , P0 , β0 , D))

This Pricing principle is known as the principle of Indifference Pricing. The hedging strategy
for D ∪ D′ at time t (for all 0 ≤ t < T ) is given by the associated Optimal Deterministic
Policy πt∗ : St → At

7.11. Key Takeaways from this Chapter

• The concepts of Arbitrage, Completeness and Risk-Neutral Probability Measure.
• The two fundamental theorems of Asset Pricing.

269
• Pricing of derivatives in a complete market in two equivalent ways: A) Based on
construction of a replicating portfolio, and B) Based on riskless rate-discounted ex-
pectation in the risk-neutral probability measure.
• Optimal Exercise of American Options (and it’s generalization to Optimal Stopping
problems) cast as an MDP Control problem.
• Pricing and Hedging of Derivatives in an Incomplete (real-world) Market cast as an
MDP Control problem.

270
8. Order-Book Trading Algorithms
In this chapter, we venture into the world of Algorithmic Trading and specifically, we cover
a couple of problems involving a trading Order Book that can be cast as Markov Decision
Processes, and hence tackled with Dynamic Programming or Reinforcement Learning. We
start the chapter by covering the basics of how trade orders are submitted and executed on
an Order Book, a structure that allows for eﬀicient transactions between buyers and sellers
of a financial asset. Without loss of generality, we refer to the financial asset being traded
on the Order Book as a “stock” and the number of units of the asset as “shares.” Next we
will explain how a large trade can significantly shift the Order Book, a phenomenon known
as Price Impact. Finally, we will cover the two algorithmic trading problems that can be cast
as MDPs. The first problem is Optimal Execution of the sale of a large number of shares
of a stock so as to yield the maximum utility of sales proceeds over a finite horizon. This
involves breaking up the sale of the shares into appropriate pieces and selling those pieces
at the right times so as to achieve the goal of maximizing the utility of sales proceeds.
Hence, it is an MDP Control problem where the actions are the number of shares sold
at each time step. The second problem is Optimal Market-Making, i.e., the optimal bids
(willingness to buy a certain number of shares at a certain price) and asks (willingness to
sell a certain number of shares at a certain price) to be submitted on the Order Book. Again,
by optimal, we mean maximization of the utility of revenues generated by the market-
maker over a finite-horizon (market-makers generate revenue through the spread, i.e. the
gap between the bid and ask prices they offer). This is also an MDP Control problem
where the actions are the bid and ask prices along with the bid and ask shares at each
time step.
For a deeper study on the topics of Order Book, Price Impact, Order Execution, Market-
Making (and related topics), we refer you to the comprehensive treatment in Olivier Gueant’s
book (Gueant 2016).

8.1. Basics of Order Book and Price Impact

Some of the financial literature refers to the Order Book as Limit Order Book (abbreviated
as LOB) but we will stick with the lighter language - Order Book, abbreviated as OB. The
Order Book is essentially a data structure that facilitates matching stock buyers with stock
sellers (i.e., an electronic marketplace). Figure 8.1 depicts a simplified view of an order
book. In this order book market, buyers and sellers express their intent to trade by submit-
ting bids (intent to buy) and asks (intent to sell). These expressions of intent to buy or sell
are known as Limit Orders (abbreviated as LO). The word “limit” in Limit Order refers to
the fact that one is interested in buying only below a certain price level (and likewise, one
is interested in selling only above a certain price level). Each LO is comprised of a price P
and number of shares N . A bid, i.e., Buy LO (P, N ) states willingness to buy N shares at
a price less than or equal to P . Likewise, an ask, i.e., a Sell LO (P, N ) states willingness to
sell N shares at a price greater than or equal to P .
Note that multiple traders might submit LOs with the same price. The order book ag-
gregates the number of shares at each unique price, and the OB data structure is typically

271
Figure 8.1.: Trading Order Book(Image Credit: https://nms.kcl.ac.uk/rll/
enrique-miranda/index.html)

presented for trading in the form of this aggregated view. Thus, the OB data structure can
be represented as two sorted lists of (Price, Size) pairs:

(b) (b) (b) (b)

Buy LOs (Bids): [(Pi , Ni ) | 0 ≤ i < m], Pi > Pj for i < j
(a) (a) (a) (a)
Sell LOs (Asks): [(Pi , Ni ) | 0 ≤ i < n], Pi < Pj for i < j
Note that the Buy LOs are arranged in descending order and the Sell LOs are arranged
in ascending order to signify the fact that the beginning of each list consists of the most
important (best-price) LOs.
Now let’s learn about some of the standard terminology:

(b)
• We refer to P0 as The Best Bid Price (lightened to Best Bid ) to signify that it is the
highest offer to buy and hence, the best price for a seller to transact with.
(a)
• Likewise, we refer to P0 as The Ask Price (lightened to Best Ask) to signify that it is
the lowest offer to sell and hence, the best price for a buyer to transact with.
(a) (b)
P0 +P0
• 2 is refered to as the The Mid Price (lightened to Mid).
(a) (b)
• P0 − P0 is refered to as The Best Bid-Ask Spread (lightened to Spread).
(a) (b)
• Pn−1 − Pm−1 is refered to as The Market Depth (lightened to Depth).

Although an actual real-world trading order book has many other details, we believe this
simplified coverage is adequate for the purposes of core understanding of order book trad-
ing and to navigate the problems of optimal order execution and optimal market-making.
Apart from Limit Orders, traders can express their interest to buy/sell with another type
of order - a Market Order (abbreviated as MO). A Market Order (MO) states one’s intent
to buy/sell N shares at the best possible price(s) available on the OB at the time of MO sub-
mission. So, an LO is keen on price and not so keen on time (willing to wait to get the price
one wants) while an MO is keen on time (desire to trade right away) and not so keen on

272
price (will take whatever the best LO price is on the OB). So now let us understand the
actual transactions that occur between LOs and MOs (buy and sell interactions, and how
the OB changes as a result of these interactions). Firstly, we note that in normal trading
activity, a newly submitted sell LO’s price is typically above the price of the best buy LO
on the OB. But if a new sell LO’s price is less than or equal to the price of the best buy
LO’s price, we say that the market has crossed (to mean that the range of bid prices and the
range of ask prices have intersected), which results in an immediate transaction that eats
into the OB’s Buy LOs.
Precisely, a new Sell LO (P, N ) potentially transacts with (and hence, removes) the best
Buy LOs on the OB.

(b) (b)
X
i−1
(b) (b)
Removal: [(Pi , min(Ni , max(0, N − Nj ))) | (i : Pi ≥ P )] (8.1)
j=0

After this removal, it potentially adds the following LO to the asks side of the OB:
X (b)
(P, max(0, N − Ni )) (8.2)
(b)
i:Pi ≥P

Likewise, a new Buy MO (P, N ) potentially transacts with (and hence, removes) the
best Sell LOs on the OB

(a) (a)
X
i−1
(a) (a)
Removal: [(Pi , min(Ni , max(0, N − Nj ))) | (i : Pi ≤ P )] (8.3)
j=0

After this removal, it potentially adds the following to the bids side of the OB:
X (a)
(P, max(0, N − Ni )) (8.4)
(a)
i:Pi ≤P

When a Market Order (MO) is submitted, things are simpler. A Sell Market Order of N
shares will remove the best Buy LOs on the OB.

(b) (b)
X
i−1
(b)
Removal: [(Pi , min(Ni , max(0, N − Nj ))) | 0 ≤ i < m] (8.5)
j=0

The sales proceeds for this MO is:

X
m−1
(b) (b)
X
i−1
(b)
Pi · (min(Ni , max(0, N − Nj ))) (8.6)
i=0 j=0

We note that if N is large, the sales proceeds for this MO can be significantly lower than
(b) (b)
the best possible sales proceeds (= N · P0 ), which happens only if N ≤ N0 . Note also
(b)
that if N is large, the new Best Bid Price (new value of P0 ) can be significantly lower than
the Best Bid Price before the MO was submitted (because the MO “eats into” a significant
volume of Buy LOs on the OB). This “eating into” the Buy LOs on the OB and consequent
lowering of the Best Bid Price (and hence, Mid Price) is known as Price Impact of an MO
(more specifically, as the Temporary Price Impact of an MO). We use the word “temporary”
because subsequent to this “eating into” the Buy LOs of the OB (and consequent, “hole,”
ie., large Bid-Ask Spread), market participants will submit “replenishment LOs” (both

273
Buy LOs and Sell LOs) on the OB. These replenishments LOs would typically mitigate the
Bid-Ask Spread and the eventual settlement of the Best Bid/Best Ask/Mid Prices consti-
tutes what we call Permanent Price Impact - which refers to the changes in OB Best Bid/Best
Ask/Mid prices relative to the corresponding prices before submission of the MO.
Likewise, a Buy Market Order of N shares will remove the best Sell LOs on the OB

(a) (a)
X
i−1
(a)
Removal: [(Pi , min(Ni , max(0, N − Nj ))) | 0 ≤ i < n] (8.7)
j=0

The purchase bill for this MO is:

X
n−1
(a) (a)
X
i−1
(a)
Pi · (min(Ni , max(0, N − Nj ))) (8.8)
i=0 j=0

If N is large, the purchase bill for this MO can be significantly higher than the best
(a) (a)
possible purchase bill (= N · P0 ), which happens only if N ≤ N0 . All that we wrote
above in terms of Temporary and Permanent Price Impact naturally apply in the opposite
direction for a Buy MO.
We refer to all of the above-described OB movements, including both temporary and
permanent Price Impacts broadly as Order Book Dynamics. There is considerable literature
on modeling Order Book Dynamics and some of these models can get fairly complex in
order to capture various real-world nuances. Much of this literature is beyond the scope
of this book. In this chapter, we will cover a few simple models for how a sell MO will
move the OB’s Best Bid Price (rather than a model for how it will move the entire OB). The
model for how a buy MO will move the OB’s Best Ask Price is naturally identical.
Now let’s write some code that models how LOs and MOs interact with the OB. We
write a class OrderBook that represents the Buy and Sell Limit Orders on the Order Book,
which are each represented as a sorted sequence of the type DollarsAndShares, which is a
dataclass we created to represent any pair of a dollar amount (dollar: float) and num-
ber of shares (shares: int). Sometimes, we use DollarsAndShares to represent an LO (pair
of price and shares) as in the case of the sorted lists of Buy and Sell LOs. At other times, we
use DollarsAndShares to represent the pair of total dollars transacted and total shares trans-
acted when an MO is executed on the OB. The OrderBook maintains a price-descending se-
quence of PriceSizePairs for Buy LOs (descending_bids) and a price-ascending sequence
of PriceSizePairs for Sell LOs (ascending_asks). We write the basic methods to get the
OrderBook’s highest bid price (method bid_price), lowest ask price (method ask_price),
mid price (method mid_price), spread between the highest bid price and lowest ask price
(method bid_ask_spread), and market depth (method market_depth).

@dataclass(frozen=True)
class DollarsAndShares:
dollars: float
shares: int
PriceSizePairs = Sequence[DollarsAndShares]
@dataclass(frozen=True)
class OrderBook:
descending_bids: PriceSizePairs
ascending_asks: PriceSizePairs
def bid_price(self) -> float:
return self.descending_bids[0].dollars

274
def ask_price(self) -> float:
return self.ascending_asks[0].dollars
def mid_price(self) -> float:
return (self.bid_price() + self.ask_price()) / 2
def bid_ask_spread(self) -> float:
return self.ask_price() - self.bid_price()
def market_depth(self) -> float:
return self.ascending_asks[-1].dollars - \
self.descending_bids[-1].dollars

Next we want to write methods for LOs and MOs to interact with the OrderBook. Notice
that each of Equation (8.1) (new Sell LO potentially removing some of the beginning of
the Buy LOs on the OB), Equation (8.3) (new Buy LO potentially removing some of the
beginning of the Sell LOs on the OB), Equation (8.5) (Sell MO removing some of the
beginning of the Buy LOs on the OB) and Equation (8.7) (Buy MO removing some of the
beginning of the Sell LOs on the OB) all perform a common core function - they “eat into”
the most significant LOs (on the opposite side) on the OB. So we first write a @staticmethod
eat_book for this common function.
eat_book takes as input a ps_pairs: PriceSizePairs (representing one side of the OB)
and the number of shares: int to buy/sell. Notice eat_book’s return type: Tuple[DollarsAndShares,
PriceSizePairs]. The returned DollarsAndShares represents the pair of dollars transacted
and the number of shares transacted (with number of shares transacted being less than
or equal to the input shares). The returned PriceSizePairs represents the remainder
of ps_pairs after the transacted number of shares have eaten into the input ps_pairs.
eat_book first deletes (i.e. “eats up”) as much of the beginning of the ps_pairs: PriceSizePairs
data structure as it can (basically matching the input number of shares with an appropri-
ate number of shares at the beginning of the ps_pairs: PriceSizePairs input). Note that
the returned PriceSizePairs is a separate data structure, ensuring the immutability of the
input ps_pairs: PriceSizePairs.
@staticmethod
def eat_book(
ps_pairs: PriceSizePairs,
shares: int
) -> Tuple[DollarsAndShares, PriceSizePairs]:
rem_shares: int = shares
dollars: float = 0.
for i, d_s in enumerate(ps_pairs):
this_price: float = d_s.dollars
this_shares: int = d_s.shares
dollars += this_price * min(rem_shares, this_shares)
if rem_shares < this_shares:
return (
DollarsAndShares(dollars=dollars, shares=shares),
[DollarsAndShares(
dollars=this_price,
shares=this_shares - rem_shares
)] + list(ps_pairs[i+1:])
)
else:
rem_shares -= this_shares
return (
DollarsAndShares(dollars=dollars, shares=shares - rem_shares),
[]
)

Now we are ready to write the method sell_limit_order which takes Sell LO Price and

275
Sell LO shares as input. As you can see in the code below, first it potentially removes
(if it “crosses”) an appropriate number of shares on the Buy LOs side of the OB (using
the @staticmethod eat_book), and then potentially adds an appropriate number of shares
at the Sell LO Price on the Sell LOs side of the OB. sell_limit_order returns a pair of
DollarsAndShares type and OrderBook type. The returned DollarsAndShares represents the
pair of dollars transacted and the number of shares transacted with the Buy LOs side of
the OB (with number of shares transacted being less than or equal to the input shares).
The returned OrderBook represents the new OB after potentially eating into the Buy LOs
side of the OB and then potentially adding some shares at the Sell LO Price on the Sell
LOs side of the OB. Note that the returned OrderBook is a newly-created data structure,
ensuring the immutability of self. We urge you to read the code below carefully as there
are many subtle details that are handled in the code.
from dataclasses import replace
def sell_limit_order(self, price: float, shares: int) -> \
Tuple[DollarsAndShares, OrderBook]:
index: Optional[int] = next((i for i, d_s
in enumerate(self.descending_bids)
if d_s.dollars < price), None)
eligible_bids: PriceSizePairs = self.descending_bids \
if index is None else self.descending_bids[:index]
ineligible_bids: PriceSizePairs = [] if index is None else \
self.descending_bids[index:]
d_s, rem_bids = OrderBook.eat_book(eligible_bids, shares)
new_bids: PriceSizePairs = list(rem_bids) + list(ineligible_bids)
rem_shares: int = shares - d_s.shares
if rem_shares > 0:
new_asks: List[DollarsAndShares] = list(self.ascending_asks)
index1: Optional[int] = next((i for i, d_s
in enumerate(new_asks)
if d_s.dollars >= price), None)
if index1 is None:
new_asks.append(DollarsAndShares(
dollars=price,
shares=rem_shares
))
elif new_asks[index1].dollars != price:
new_asks.insert(index1, DollarsAndShares(
dollars=price,
shares=rem_shares
))
else:
new_asks[index1] = DollarsAndShares(
dollars=price,
shares=new_asks[index1].shares + rem_shares
)
return d_s, OrderBook(
ascending_asks=new_asks,
descending_bids=new_bids
)
else:
return d_s, replace(
self,
descending_bids=new_bids
)

Next, we write the easier method sell_market_order which takes as input the number
of shares to be sold (as a market order). sell_market_order transacts with the appropri-
ate number of shares on the Buy LOs side of the OB (removing those many shares from

276
the Buy LOs side). It returns a pair of DollarsAndShares type and OrderBook type. The
returned DollarsAndShares represents the pair of dollars transacted and the number of
shares transacted (with number of shares transacted being less than or equal to the in-
put shares). The returned OrderBook represents the remainder of the OB after the trans-
acted number of shares have eaten into the Buy LOs side of the OB. Note that the returned
OrderBook is a newly-created data structure, ensuring the immutability of self.

def sell_market_order(
self,
shares: int
) -> Tuple[DollarsAndShares, OrderBook]:
d_s, rem_bids = OrderBook.eat_book(
self.descending_bids,
shares
)
return (d_s, replace(self, descending_bids=rem_bids))

We won’t list the methods buy_limit_order and buy_market_order here as they are com-
pletely analogous (you can find the entire code for OrderBook in the file rl/chapter9/order_book.py).
Now let us test out this code by creating a sample OrderBook and submitting some LOs and
MOs to interact with the OrderBook.

bids: PriceSizePairs = [DollarsAndShares(

dollars=x,
shares=poisson(100. - (100 - x) * 10)
) for x in range(100, 90, -1)]
asks: PriceSizePairs = [DollarsAndShares(
dollars=x,
shares=poisson(100. - (x - 105) * 10)
) for x in range(105, 115, 1)]
ob0: OrderBook = OrderBook(descending_bids=bids, ascending_asks=asks)

The above code creates an OrderBook in the price range [91, 114] with a bid-ask spread
of 5. Figure 8.2 depicts this OrderBook visually.
Let’s submit a Sell LO that says we’d like to sell 40 shares as long as the transacted price
is greater than or equal to 107. Our Sell LO should simply get added to the Sell LOs side
of the OB.

d_s1, ob1 = ob0.sell_limit_order(107, 40)

The new OrderBook ob1 has 40 more shares at the price level of 107, as depicted in Figure
8.3.
Now let’s submit a Sell MO that says we’d like to sell 120 shares at the “best price.” Our
Sell MO should transact with 120 shares at “best prices” of 100 and 99 as well (since the
OB does not have enough Buy LO shares at the price of 100).

d_s2, ob2 = ob1.sell_market_order(120)

The new OrderBook ob2 has 120 less shares on the Buy LOs side of the OB, as depicted
in Figure 8.4.
Now let’s submit a Buy LO that says we’d like to buy 80 shares as long as the transacted
price is less than or equal to 100. Our Buy LO should get added to the Buy LOs side of the
OB.

d_s3, ob3 = ob2.buy_limit_order(100, 80)

277
Figure 8.2.: Starting Order Book

Figure 8.3.: Order Book after Sell LO

278
Figure 8.4.: Order Book after Sell MO

Figure 8.5.: Order Book after Buy LO

279
Figure 8.6.: Order Book after 2nd Sell LO

The new OrderBook ob3 has re-introduced a Buy LO at the price level of 100 (now with
80 shares), as depicted in Figure 8.5.
Now let’s submit a Sell LO that says we’d like to sell 60 shares as long as the transacted
price is greater than or equal to 104. Our Sell LO should get added to the Sell LOs side of
the OB.
d_s4, ob4 = ob3.sell_limit_order(104, 60)

The new OrderBook ob4 has introduced a Sell LO at a price of 104 with 60 shares, as
depicted in Figure 8.6.
Now let’s submit a Buy MO that says we’d like to buy 150 shares at the “best price.” Our
Buy MO should transact with 150 shares at “best prices” on the Sell LOs side of the OB.
d_s5, ob5 = ob4.buy_market_order(150)

The new OrderBook ob5 has 150 less shares on the Sell LOs side of the OB, wiping out all
the shares at the price level of 104 and almost wiping out all the shares at the price level
of 105, as depicted in Figure 8.7.
This has served as a good test of our code (transactions working as we’d like) and
we encourage you to write more code of this sort to interact with the OrderBook, and
to produce graphs of evolution of the OrderBook as this will help develop stronger intu-
ition and internalize the concepts we’ve learnt above. All of the above code is in the file
rl/chapter9/order_book.py.
Now we are ready to get started with the problem of Optimal Execution of a large-sized
Market Order.

8.2. Optimal Execution of a Market Order

Imagine the following problem: You are a trader in a stock and your boss has instructed
that you exit from trading in this stock because this stock doesn’t meet your company’s

280
Figure 8.7.: Order Book after Buy MO

new investment objectives. You have to sell all of the N shares you own in this stock in
the next T hours, but you have been instructed to accomplish the sale by submitting only
Market Orders (not allowed to submit any Limit Orders because of the uncertainty in the
time of execution of the sale with a Limit Order). You can submit sell market orders (of
any size) at the start of each hour - so you have T opportunities to submit market orders
of any size. Your goal is to maximize the Expected Total Utility of sales proceeds for all N
shares over the T hours. Your task is to break up N into T appropriate chunks to maximize
the Expected Total Utility objective. If you attempt to sell the N shares too fast (i.e., too
many in the first few hours), as we’ve learnt above, each (MO) sale will eat a lot into the
Buy LOs on the OB (Temporary Price Impact) which would result in transacting at prices
below the best price (Best Bid Price). Moreover, you risk moving the Best Bid Price on the
OB significantly lower (Permanent Price Impact) that would affect the sales proceeds for
the next few sales you’d make. On the other hand, if you sell the N shares too slow (i.e., too
few in the first few hours), you might transact at good prices but then you risk running
out of time, which means you will have to dump a lot of shares with time running out
which in turn would mean transacting at prices below the best price. Moreover, selling
too slow exposes you to more uncertainty in market price movements over a longer time
period, and more uncertainty in sales proceeds means the Expected Utility objective gets
hurt. Thus, the precise timing and sizes in the breakup of shares is vital. You will need
to have an estimate of the Temporary and Permanent Price Impact of your Market Orders,
which can help you identify the appropriate number of shares to sell at the start of each
hour.
Unsurprisingly, we can model this problem as a Market Decision Process control prob-
lem where the actions at each time step (each hour, in this case) are the number of shares
sold at the time step and the rewards are the Utility of sales proceeds at each time step. To
keep things simple and intuitive, we shall model Price Impact of Market Orders in terms of
their effect on the Best Bid Price (rather than in terms of their effect on the entire OB). In
other words, we won’t be modeling the entire OB Price Dynamics, just the Best Bid Price

281
Dynamics. We shall refer to the OB activity of an MO immediately “eating into the Buy
LOs” (and hence, potentially transacting at prices lower than the best price) as the Tempo-
rary Price Impact. As mentioned earlier, this is followed by subsequent replenishment of
both Buy and Sell LOs on the OB (stabilizing the OB) - we refer to any eventual (end of
the hour) lowering of the Best Bid Price (relative to the Best Bid Price before the MO was
submitted) as the Permanent Price Impact. Modeling the temporary and permanent Price
Impacts separately helps us in deciding on the optimal actions (optimal shares to be sold
at the start of each hour).
Now we develop some formalism to describe this problem precisely. As mentioned
earlier, we make a number of simplifying assumptions in modeling the OB Dynamics for
ease of articulation (without diluting the most important concepts). We index discrete
time by t = 0, 1, . . . , T . We denote Pt as the Best Bid Price on the OB at the start of time
step t (for all t = 0, 1, . . . , T ) and Nt as the number of shares sold at time step t for all
t = 0, 1, . . . , T − 1. We denote the number of shares remaining to be sold at the start of
time step t as Rt for all t = 0, 1, . . . , T . Therefore,
X
t−1
Rt = N − Ni for all t = 0, 1, . . . , T
i=0

Note that:

R0 = N
Rt+1 = Rt − Nt for all t = 0, 1, . . . , T − 1
Also note that we need to sell everything by time t = T and so:

NT −1 = RT −1 ⇒ RT = 0
The model of Best Bid Price Dynamics from one time step to the next is given by:
Pt+1 = ft (Pt , Nt , ϵt ) for all t = 0, 1, . . . , T − 1
where ft is an arbitrary function incorporating:
• The Permanent Price Impact of selling Nt shares.
• The Price-Impact-independent market-movement of the Best Bid Price from time t
to time t + 1.
• Noise ϵt , a source of randomness in Best Bid Price movements.
The sales proceeds from the sale at time step t, for all t = 0, 1, . . . , T − 1, is defined as:
Nt · Qt = Nt · (Pt − gt (Pt , Nt ))
where gt is a function modeling the Temporary Price Impact (i.e., the Nt MO “eating into”
the Buy LOs on the OB). Qt should be interpreted as the average Buy LO price transacted
against by the Nt MO at time t.
Lastly, we denote the Utility (of Sales Proceeds) function as U (·).
As mentioned previously, solving for the optimal number of shares to be sold at each
time step can be modeled as a discrete-time finite-horizon Markov Decision Process, which
we describe below in terms of the order of MDP activity at each time step t = 0, 1, . . . , T −1
(the MDP horizon is time T meaning all states at time T are terminal states). We follow
the notational style of finite-horizon MDPs that should now be familiar from previous
chapters.
Order of Events at time step t for all t = 0, 1, . . . , T − 1:

282
• Observe State st := (Pt , Rt ) ∈ St
• Perform Action at := Nt ∈ At
• Receive Reward rt+1 := U (Nt · Qt ) = U (Nt · (Pt − gt (Pt , Nt )))
• Experience Price Dynamics Pt+1 = ft (Pt , Nt , ϵt ) and set Rt+1 = Rt − Nt so as to
obtain the next state st+1 = (Pt+1 , Rt+1 ) ∈ St+1 .

Note that we have intentionally not specified if Pt , Rt , Nt are integers or real numbers, or
if constrained to be non-negative etc. Those precise specifications will be customized to the
nuances/constraints of the specific Optimal Order Execution problem we’d be solving. Be
default, we shall assume that Pt ∈ R+ and Nt , Rt ∈ Z≥0 (as these represent realistic trading
situations), although we do consider special cases later in the chapter where Pt , Rt ∈ R
(unconstrained real numbers for analytical tractability).
The goal is to find the Optimal Policy π ∗ = (π0∗ , π1∗ , . . . , πT∗ −1 ) (defined as πt∗ ((Pt , Rt )) =
∗
Nt that maximizes:
X
T −1
E[ γ t · U (Nt · Qt )]
t=0

where γ is the discount factor to account for the fact that future utility of sales proceeds
can be modeled to be less valuable than today’s.
Now let us write some code to solve this MDP. We write a class OptimalOrderExecution
which models a fairly generic MDP for Optimal Order Execution as described above, and
solves the Control problem with Approximate Value Iteration using the backward induc-
tion algorithm that we implemented in Chapter 4. Let us start by taking a look at the
attributes (inputs) in OptimalOrderExecution:

• shares refers to the total number of shares N to be sold over T time steps.
• time_steps refers to the number of time steps T .
• avg_exec_price_diff refers to the time-sequenced functions gt that return the reduc-
tion in the average price obtained by the Market Order at time t due to eating into the
Buy LOs. gt takes as input the type PriceAndShares that represents a pair of price:
float and shares: int (in this case, the price is Pt and the shares is the MO size Nt
at time t). As explained earlier, the sales proceeds at time t is: Nt · (Pt − gt (Pt , Nt )).
• price_dynamics refers to the time-sequenced functions ft that represent the price dy-
namics: Pt+1 ∼ ft (Pt , Nt ). ft outputs a probability distribution of prices for Pt+1 .
• utility_func refers to the Utility of Sales Proceeds function, incorporating any risk-
aversion.
• discount_factor refers to the discount factor γ.
• func_approx refers to the ValueFunctionApprox type to be used to approximate the
Value Function for each time step (since we are doing backward induction).
• initial_price_distribution refers to the probability distribution of prices P0 at time
0, which is used to generate the samples of states at each of the time steps (needed
in the approximate backward induction algorithm).
from rl.approximate_dynamic_programming import ValueFunctionApprox
@dataclass(frozen=True)
class PriceAndShares:
price: float
shares: int
@dataclass(frozen=True)
class OptimalOrderExecution:
shares: int

283
time_steps: int
avg_exec_price_diff: Sequence[Callable[[PriceAndShares], float]]
price_dynamics: Sequence[Callable[[PriceAndShares], Distribution[float]]]
utility_func: Callable[[float], float]
discount_factor: float
func_approx: ValueFunctionApprox[PriceAndShares]
initial_price_distribution: Distribution[float]

The two key things we need to perform the backward induction are:

• A method get_mdp that given a time step t, produces the MarkovDecisionProcess ob-
ject representing the transitions from time t to time t+1. The class OptimalExecutionMDP
within get_mdp implements the abstract methods step and actions of the abstract
class MarkovDecisionProcess. The code should be fairly self-explanatory - just a cou-
ple of things to point out here. Firstly, the input p_r: NonTerminal[PriceAndShares]
to the step method represents the state (Pt , Rt ) at time t, and the variable p_s: PriceAndShares
represents the pair of (Pt , Nt ), which serves as input to avg_exec_price_diff and
price_dynamics (attributes of OptimalOrderExecution). Secondly, note that the actions
method returns an Iterator on a single int at time t = T −1 because of the constraint
NT −1 = RT −1 .
• A method get_states_distribution that returns the probability distribution of states
(Pt , Rt ) at time t (of type SampledDistribution[NonTerminal[PriceAndShares]]). The
code here is similar to the get_states_distribiution method of AssetAllocDiscrete
in Chapter 6 (essentially, walking forward from time 0 to time t by sampling from
the state-transition probability distribution and also sampling from uniform choices
over all actions at each time step).

def get_mdp(self, t: int) -> MarkovDecisionProcess[PriceAndShares, int]:

utility_f: Callable[[float], float] = self.utility_func
price_diff: Sequence[Callable[[PriceAndShares], float]] = \
self.avg_exec_price_diff
dynamics: Sequence[Callable[[PriceAndShares], Distribution[float]]] = \
self.price_dynamics
steps: int = self.time_steps
class OptimalExecutionMDP(MarkovDecisionProcess[PriceAndShares, int]):
def step(
self,
p_r: NonTerminal[PriceAndShares],
sell: int
) -> SampledDistribution[Tuple[State[PriceAndShares],
float]]:
def sr_sampler_func(
p_r=p_r,
sell=sell
) -> Tuple[State[PriceAndShares], float]:
p_s: PriceAndShares = PriceAndShares(
price=p_r.state.price,
shares=sell
)
next_price: float = dynamics[t](p_s).sample()
next_rem: int = p_r.state.shares - sell
next_state: PriceAndShares = PriceAndShares(
price=next_price,
shares=next_rem
)
reward: float = utility_f(
sell * (p_r.state.price - price_diff[t](p_s))

284
)
return (NonTerminal(next_state), reward)
return SampledDistribution(
sampler=sr_sampler_func,
expectation_samples=100
)
def actions(self, p_s: NonTerminal[PriceAndShares]) -> \
Iterator[int]:
if t == steps - 1:
return iter([p_s.state.shares])
else:
return iter(range(p_s.state.shares + 1))
return OptimalExecutionMDP()
def get_states_distribution(self, t: int) -> \
SampledDistribution[NonTerminal[PriceAndShares]]:
def states_sampler_func() -> NonTerminal[PriceAndShares]:
price: float = self.initial_price_distribution.sample()
rem: int = self.shares
for i in range(t):
sell: int = Choose(range(rem + 1)).sample()
price = self.price_dynamics[i](PriceAndShares(
price=price,
shares=rem
)).sample()
rem -= sell
return NonTerminal(PriceAndShares(
price=price,
shares=rem
))
return SampledDistribution(states_sampler_func)

Finally, we produce the Optimal Value Function and Optimal Policy for each time step
with the following method backward_induction_vf_and_pi:
from rl.approximate_dynamic_programming import back_opt_vf_and_policy
def backward_induction_vf_and_pi(
self
) -> Iterator[Tuple[ValueFunctionApprox[PriceAndShares],
DeterministicPolicy[PriceAndShares, int]]]:
mdp_f0_mu_triples: Sequence[Tuple[
MarkovDecisionProcess[PriceAndShares, int],
ValueFunctionApprox[PriceAndShares],
SampledDistribution[NonTerminal[PriceAndShares]]
]] = [(
self.get_mdp(i),
self.func_approx,
self.get_states_distribution(i)
) for i in range(self.time_steps)]
num_state_samples: int = 10000
error_tolerance: float = 1e-6
return back_opt_vf_and_policy(
mdp_f0_mu_triples=mdp_f0_mu_triples,
gamma=self.discount_factor,
num_state_samples=num_state_samples,
error_tolerance=error_tolerance
)

The above code is in the file rl/chapter9/optimal_order_execution.py. We encourage

you to create a few different instances of OptimalOrderExecution by varying it’s inputs (try

285
different temporary and permanent price impact functions, different utility functions, im-
pose a few constraints etc.). Note that the above code has been written with an educa-
tional motivation rather than an eﬀicient-computation motivation, so the convergence of
the backward induction ADP algorithm is going to be slow. How do we know that the
above code is correct? Well, we need to create a simple special case that yields a closed-
form solution that we can compare the Optimal Value Function and Optimal Policy pro-
duced by OptimalOrderExecution against. This will be the subject of the following subsec-
tion.

8.2.1. Simple Linear Price Impact Model with no Risk-Aversion

Now we consider a special case of the above-described MDP - a simple linear Price Impact
model with no risk-aversion. Furthermore, for analytical tractability, we assume N, Nt , Pt
are all unconstrained continuous-valued (i.e., taking values ∈ R).
We assume simple linear price dynamics as follows:

Pt+1 = ft (Pt , Nt , ϵ) = Pt − α · Nt + ϵt
where α ∈ R and ϵt for all t = 0, 1, . . . , T − 1 are independent and identically distributed
(i.i.d.) with E[ϵt |Nt , Pt ] = 0. Therefore, the Permanent Price Impact (as an Expectation)
is α · Nt .
As for the Temporary Price Impact, we know that gt needs to be a non-decreasing func-
tion of Nt . We assume a simple linear form for gt as follows:

gt (Pt , Nt ) = β · Nt for all t = 0, 1, . . . , T − 1

for some constant β ∈ R≥0 . So, Qt = Pt − βNt . As mentioned above, we assume no
risk-aversion, i.e., the Utility function U (·) is assumed to be the identity function. Also, we
assume that the MDP discount factor γ = 1.
Note that all of these assumptions are far too simplistic and hence, an unrealistic model
of the real-world, but starting with this simple model helps build good intuition and en-
ables us to develop more realistic models by incrementally adding complexity/nuances
from this simple base model.
As ever, in order to solve the Control problem, we define the Optimal Value Function and
invoke the Bellman Optimality Equation. We shall use the standard notation for discrete-
time finite-horizon MDPs that we are now very familiar with.
Denote the Value Function for policy π at time t (for all t = 0, 1, . . . T − 1) as:

X
T −1
Vtπ ((Pt , Rt )) = Eπ [ Ni · (Pi − β · Ni )|(Pt , Rt )]
i=t

Denote the Optimal Value Function at time t (for all t = 0, 1, . . . , T − 1) as:

Vt∗ ((Pt , Rt )) = max Vtπ ((Pt , Rt ))

The Optimal Value Function satisfies the finite-horizon Bellman Optimality Equation
for all t = 0, 1, . . . , T − 2, as follows:
∗
Vt∗ ((Pt , Rt )) = max{Nt · (Pt − β · Nt ) + E[Vt+1 ((Pt+1 , Rt+1 ))]}
Nt

and

286
VT∗−1 ((PT −1 , RT −1 )) = NT −1 · (PT −1 − β · NT −1 ) = RT −1 · (PT −1 − β · RT −1 )

From the above, we can infer:

VT∗−2 ((PT −2 , RT −2 )) = max {NT −2 · (PT −2 − β · NT −2 ) + E[RT −1 · (PT −1 − β · RT −1 )]}

NT −2

= max {NT −2 · (PT −2 − β · NT −2 ) + E[(RT −2 − NT −2 )(PT −1 − β · (RT −2 − NT −2 ))]}

NT −2

= max {NT −2 · (PT −2 − β · NT −2 ) + (RT −2 − NT −2 ) · (PT −2 − α · NT −2 − β · (RT −2 − NT −2 ))}

NT −2

This simplifies to:

VT∗−2 ((PT −2 , RT −2 )) = max {RT −2 ·PT −2 −β ·RT2 −2 +(α−2β)(NT2 −2 −NT −2 ·RT −2 )} (8.9)
NT −2

For the case α ≥ 2β, noting that NT −2 ≤ RT −2 , we have the trivial solution:

NT∗ −2 = 0 or NT∗ −2 = RT −2

Substituting either of these two values for NT∗ −2 in the right-hand-side of Equation (8.9)
gives:

VT∗−2 ((PT −2 , RT −2 )) = RT −2 · (PT −2 − β · RT −2 )

Continuing backwards in time in this manner (for the case α ≥ 2β) gives:

Nt∗ = 0 or Nt∗ = Rt for all t = 0, 1, . . . , T − 1

Vt∗ ((Pt , Rt )) = Rt · (Pt − β · Rt ) for all t = 0, 1, . . . , T − 1
So the solution for the case α ≥ 2β is to sell all N shares at any one of the time steps
t = 0, 1, . . . , T − 1 (and none in the other time steps), and the Optimal Expected Total Sale
Proceeds is N · (P0 − β · N )
For the case α < 2β, differentiating the term inside the max in Equation (8.9) with
respect to NT −2 , and setting it to 0 gives:

RT −2
(α − 2β) · (2NT∗ −2 − RT −2 ) = 0 ⇒ NT∗ −2 =
2
Substituting this solution for NT∗ −2 in Equation (8.9) gives:

α + 2β
VT∗−2 ((PT −2 , RT −2 )) = RT −2 · PT −2 − RT2 −2 · ( )
4
Continuing backwards in time in this manner gives:

Rt
Nt∗ = for all t = 0, 1, . . . , T − 1
T −t
Rt2 2β + α · (T − t − 1)
Vt∗ ((Pt , Rt )) = Rt · Pt − ·( ) for all t = 0, 1, . . . , T − 1
2 T −t

287
Rolling forward in time, we see that Nt∗ = N
T , i.e., splitting the N shares uniformly across
the T time steps. Hence, the Optimal Policy is a constant deterministic function (i.e., inde-
pendent of the State). Note that a uniform split makes intuitive sense because Price Impact
and Market Movement are both linear and P additive, and don’t interact.PThis optimization
is essentially equivalent to minimizing Tt=1 Nt2 with the constraint: Tt=1 Nt = N . The
Optimal Expected Total Sales Proceeds is equal to:

N2 2β − α
N · P0 − · (α + )
2 T
Implementation Shortfall is the technical term used to refer to the reduction in Total Sales
Proceeds relative to the maximum possible sales proceeds (= N · P0 ). So, in this simple
2
linear model, the Implementation Shortfall from Price Impact is N2 · (α + 2β−α T ). Note that
the Implementation Shortfall is non-zero even if one had infinite time available (T → ∞)
for the case of α > 0. If Price Impact were purely temporary (α = 0, i.e., Price fully snapped
back), then the Implementation Shortfall is zero if one had infinite time available.
So now let’s customize the class OptimalOrderExecution to this simple linear price im-
pact model, and compare the Optimal Value Function and Optimal Policy produced by
OptimalOrderExecution against the above-derived closed-form solutions. We write code
below to create an instance of OptimalOrderExecution with time steps T = 5, total number
of shares to be sold N = 100, linear temporary price impact with α = 0.03, linear perma-
nent price impact with β = 0.03, utility function as the identity function (no risk-aversion),
and discount factor γ = 1. We set the standard deviation for the price dynamics probabil-
ity distribution to 0 to speed up the calculation. Since we know the closed-form solution
for the Optimal Value Function, we provide some assistance to OptimalOrderExecution by
setting up a linear function approximation with two features: Pt · Rt and Rt2 . The task
of OptimalOrderExecution is to infer the correct coeﬀicients of these features for each time
step. If the coeﬀicients match that of the closed-form solution, it provides a great degree
of confidence that our code is working correctly.

num_shares: int = 100

num_time_steps: int = 5
alpha: float = 0.03
beta: float = 0.05
init_price_mean: float = 100.0
init_price_stdev: float = 10.0

price_diff = [lambda p_s: beta * p_s.shares for _ in range(num_time_steps)]

dynamics = [lambda p_s: Gaussian(
mu=p_s.price - alpha * p_s.shares,
sigma=0.
) for _ in range(num_time_steps)]
ffs = [
lambda p_s: p_s.state.price * p_s.state.shares,
lambda p_s: float(p_s.state.shares * p_s.state.shares)
]
fa: FunctionApprox = LinearFunctionApprox.create(feature_functions=ffs)
init_price_distrib: Gaussian = Gaussian(
mu=init_price_mean,
sigma=init_price_stdev

288
)

ooe: OptimalOrderExecution = OptimalOrderExecution(

shares=num_shares,
time_steps=num_time_steps,
avg_exec_price_diff=price_diff,
price_dynamics=dynamics,
utility_func=lambda x: x,
discount_factor=1,
func_approx=fa,
initial_price_distribution=init_price_distrib
)
it_vf: Iterator[Tuple[ValueFunctionApprox[PriceAndShares],
DeterministicPolicy[PriceAndShares, int]]] = \
ooe.backward_induction_vf_and_pi()

Next we evaluate this Optimal Value Function and Optimal Policy on a particular state
for all time steps, and compare that against the closed-form solution. The state we use for
evaluation is as follows:

state: PriceAndShares = PriceAndShares(

price=init_price_mean,
shares=num_shares
)

The code to evaluate the obtained Optimal Value Function and Optimal Policy on the
above state is as follows:

for t, (vf, pol) in enumerate(it_vf):

print(f”Time {t:d}”)
print()
opt_sale: int = pol.action_for(state)
val: float = vf(NonTerminal(state))
print(f”Optimal Sales = {opt_sale:d}, Opt Val = {val:.3f}”)
print()
print(”Optimal Weights below:”)
print(vf.weights.weights)
print()

With 100,000 state samples for each time step and only 10 state transition samples (since
the standard deviation of ϵ is set to be very small), this prints the following:

Time 0

Optimal Sales = 20, Opt Val = 9779.976

Optimal Weights below:

[ 0.9999948 -0.02199718]

Time 1

289
Optimal Sales = 20, Opt Val = 9762.479

Optimal Weights below:

[ 0.9999935 -0.02374564]

Time 2

Optimal Sales = 20, Opt Val = 9733.324

Optimal Weights below:

[ 0.99999335 -0.02666098]

Time 3

Optimal Sales = 20, Opt Val = 9675.013

Optimal Weights below:

[ 0.99999316 -0.03249182]

Time 4

Optimal Sales = 20, Opt Val = 9500.000

Optimal Weights below:

[ 1. -0.05]

Now let’s compare these results against the closed-form solution.

for t in range(num_time_steps):
print(f”Time {t:d}”)
print()
left: int = num_time_steps - t
opt_sale_anal: float = num_shares / num_time_steps
wt1: float = 1
wt2: float = -(2 * beta + alpha * (left - 1)) / (2 * left)
val_anal: float = wt1 * state.price * state.shares + \
wt2 * state.shares * state.shares

print(f”Optimal Sales = {opt_sale_anal:.3f}, Opt Val = {val_anal:.3f}”)

print(f”Weight1 = {wt1:.3f}”)
print(f”Weight2 = {wt2:.3f}”)
print()

This prints the following:

Time 0

Optimal Sales = 20.000, Opt Val = 9780.000

Weight1 = 1.000

290
Weight2 = -0.022

Time 1

Optimal Sales = 20.000, Opt Val = 9762.500

Weight1 = 1.000
Weight2 = -0.024

Time 2

Optimal Sales = 20.000, Opt Val = 9733.333

Weight1 = 1.000
Weight2 = -0.027

Time 3

Optimal Sales = 20.000, Opt Val = 9675.000

Weight1 = 1.000
Weight2 = -0.033

Time 4

Optimal Sales = 20.000, Opt Val = 9500.000

Weight1 = 1.000
Weight2 = -0.050

We need to point out here that the general case of optimal order execution involving
modeling of the entire Order Book’s dynamics will have to deal with a large state space.
This means the ADP algorithm will suffer from the curse of dimensionality, which means
we will need to employ RL algorithms.

8.2.2. Paper by Bertsimas and Lo on Optimal Order Execution

A paper by Bertsimas and Lo on Optimal Order Execution (Bertsimas and Lo 1998) con-
sidered a special case of the simple Linear Impact model we sketched above. Specifically,
they assumed no risk-aversion (Utility function is identity function) and assumed that the
Permanent Price Impact parameter α is equal to the Temporary Price Impact Parameter β.
In the same paper, Bertsimas and Lo then extended this Linear Impact Model to include
dependence on a serially-correlated variable Xt as follows:

Pt+1 = Pt − (β · Nt + θ · Xt ) + ϵt

Xt+1 = ρ · Xt + ηt

Qt = Pt − (β · Nt + θ · Xt )
where ϵt and ηt are each independent and identically distributed random variables with
mean zero for all t = 0, 1, . . . , T − 1, ϵt and ηt are also independent of each other for all
t = 0, 1, . . . , T − 1. Xt can be thought of as a market factor affecting Pt linearly. Applying
the finite-horizon Bellman Optimality Equation on the Optimal Value Function (and the

291
same backward-recursive approach as before) yields:

Rt
Nt∗ = + h(t, β, θ, ρ) · Xt
T −t
Vt∗ ((Pt , Rt , Xt )) = Rt · Pt − (quadratic in (Rt , Xt ) + constant)
Essentially, the serial-correlation predictability (ρ ̸= 0) alters the uniform-split strategy.
In the same paper, Bertsimas and Lo presented a more realistic model called Linear-
Percentage Temporary (abbreviated as LPT) Price Impact model, whose salient features in-
clude:

• Geometric random walk: consistent with real data, and avoids non-positive prices.
• Fractional Price Impact gt (PPtt,Nt ) doesn’t depend on Pt (this is validated by real data).
• Purely Temporary Price Impact, i.e., the price Pt snaps back after the Temporary
Price Impact (no Permanent effect of Market Orders on future prices).

The specific model is:

Pt+1 = Pt · eZt
Xt+1 = ρ · Xt + ηt
Qt = Pt · (1 − β · Nt − θ · Xt )
where Zt are independent and identically distributed random variables with mean µZ
and variance σZ2 , ηt are independent and identically distributed random variables with
mean zero for all t = 0, 1, . . . , T − 1, Zt and ηt are independent of each other for all t =
0, 1, . . . , T − 1. Xt can be thought of as a market factor affecting Pt multiplicatively. With
the same derivation methodology as before, we get the solution:

Nt∗ = ct + ct Rt + ct Xt
(1) (2) (3)

2
σZ
Vt∗ ((Pt , Rt , Xt )) = eµZ +
(4) (5) (6) (7) (8) (9)
2 · Pt · (ct + ct Rt + ct Xt + ct Rt2 + ct Xt2 + ct Rt Xt )
(k)
where ct , 1 ≤ k ≤ 9, are constants (independent of Pt , Rt , Xt ).
As an exercise, we recommend implementing the above (LPT) model by customizing
OptimalOrderExecution to compare the obtained Optimal Value Function and Optimal Pol-
(k)
icy against the closed-form solution (you can find the exact expressions for the ct coeﬀi-
cients in the Bertsimas and Lo paper).

8.2.3. Incorporating Risk-Aversion and Real-World Considerations

Bertsimas and Lo ignored risk-aversion for the purpose of analytical tractability. Although
there was value in obtaining closed-form solutions, ignoring risk-aversion makes their
model unrealistic. We have discussed in detail in Chapter 5 about the fact that traders are
wary of the risk of uncertain revenues and would be willing to trade away some expected
revenues for lower variance of revenues. This calls for incorporating risk-aversion in the
maximization objective. Almgren and Chriss wrote an important paper (Almgren and
Chriss 2000) where they work in this Risk-Aversion framework. They consider our simple
linear price impact model and incorporate risk-aversion by maximizing E[Y ] − λ · V ar[Y ]
P −1
where Y is the total (uncertain) sales proceeds Tt=0 Nt · Qt and λ controls the degree of

292
risk-aversion. The incorporation of risk-aversion affects the time-trajectory of Nt∗ . Clearly,
if λ = 0, we get the usual uniform-split strategy: Nt∗ = NT . The other extreme assumption
is to minimize V ar[Y ] which yields: N0∗ = N (sell everything immediately because the
only thing we want to avoid is uncertainty of sales proceeds). In their paper, Almgren
and Chriss go on to derive the Efficient Frontier for this problem (analogous to the Effi-
cient Frontier Portfolio Theory we outline in Appendix B). They also derive solutions for
specific utility functions.
To model a real-world trading situation, the first step is to start with the MDP we de-
scribed earlier with an appropriate model for the price dynamics ft (·) and the temporary
price impact gt (·) (incorporating potential time-heterogeneity, non-linear price dynamics
and non-linear impact). The OptimalOrderExecution class we wrote above allows us to
incorporate all of the above. We can also model various real-world “frictions” such as
discrete prices, discrete number of shares, constraints on prices and number of shares, as
well as trading fees. To make the model truer to reality and more sophisticated, we can
introduce various market factors in the State which would invariably lead to bloating of
the State Space. We would also need to capture Cross-Asset Market Impact. As a further
step, we could represent the entire Order Book (or a compact summary of the size/shape
of the Order book) as part of the state, which leads to further bloating of the state space.
All of this makes ADP infeasible and one would need to employ Reinforcement Learning
algorithms. More importantly, we’d need to write a realistic Order Book Dynamics sim-
ulator capturing all of the above real-world considerations that an RL algorithm would
learn from. There are a lot of practical and technical details involved in writing a real-
world simulator and we won’t be covering those details in this book. It suffices for here
to say that the simulator would essentially be a sampling model that has learnt the Order
Book Dynamics from market data (supervised learning of the Order Book Dynamics). Us-
ing such a simulator and with a deep learning-based function approximation of the Value
Function, we can solve a practical Optimal Order Execution problem with Reinforcement
Learning. We refer you to a couple of papers for further reading on this:

• Paper by Nevmyvaka, Feng, Kearns in 2006 (Nevmyvaka, Feng, and Kearns 2006)
• Paper by Vyetrenko and Xu in 2019 (Vyetrenko and Xu 2019)

Designing real-world simulators for Order Book Dynamics and using Reinforcement
Learning for Optimal Order Execution is an exciting area for future research as well as
engineering design. We hope this section has provided suﬀicient foundations for you to
dig into this topic further.

8.3. Optimal Market-Making

Now we move on to the second problem of this chapter involving trading on an Order
Book - the problem of Optimal Market-Making. A market-maker is a company/individual
which/who regularly quotes bid and ask prices in a financial asset (which, without loss of
generality, we will refer to as a “stock”). The market-maker typically holds some inventory
in the stock, always looking to buy at one’s quoted bid price and sell at one’s quoted ask
price, thus looking to make money off the spread between one’s quoted ask price and
one’s quoted bid price. The business of a market-maker is similar to that of a car dealer
who maintains an inventory of cars and who will offer purchase and sales prices, looking
to make a profit off the price spread and ensuring that the inventory of cars doesn’t get
too big. In this section, we consider the business of a market-maker who quotes one’s

293
bid prices by submitting Buy LOs on an OB and quotes one’s ask prices by submitting
Sell LOs on the OB. Market-makers are known as liquidity providers in the market because
they make shares of the stock available for trading on the OB (both on the buy side and
sell side). In general, anyone who submits LOs can be thought of as a market liquidity
provider. Likewise, anyone who submits MOs can be thought of as a market liquidity taker
(because an MO takes shares out of the volume that was made available for trading on the
OB).
There is typically a fairly complex interplay between liquidity providers (including market-
makers) and liquidity takers. Modeling OB dynamics is about modeling this complex in-
terplay, predicting arrivals of MOs and LOs, in response to market events and in response
to observed activity on the OB. In this section, we view the OB from the perspective of a
single market-maker who aims to make money with Buy/Sell LOs of appropriate bid-ask
spread and with appropriate volume of shares (specified in their submitted LOs). The
market-maker is likely to be successful if she can do a good job of forecasting OB Dynam-
ics and dynamically adjusting her Buy/Sell LOs on the OB. The goal of the market-maker
is to maximize one’s Utility of Gains at the end of a suitable horizon of time.
The core intuition in the decision of how to set the price and shares in the market-maker’s
Buy and Sell LOs is as follows: If the market-maker’s bid-ask spread is too narrow, they
will have more frequent transactions but smaller gains per transaction (more likelihood of
their LOs being transacted against by an MO or an opposite-side LO). On the other hand,
if the market-maker’s bid-ask spread is too wide, they will have less frequent transactions
but larger gains per transaction (less likelihood of their LOs being transacted against by
an MO or an opposite-side LO). Also of great importance is the fact that a market-maker
needs to carefully manage potentially large inventory buildup (either on the long side
or the short side) so as to avoid scenarios of consequent unfavorable forced liquidation
upon reaching the horizon time. Inventory buildup can occur if the market participants
consistently transact against mostly one side of the market-maker’s submitted LOs. With
this high-level intuition, let us make these concepts of market-making precise. We start
by developing some notation to help articulate the problem of Optimal Market-Making
clearly. We will re-use some of the notation and terminology we had developed for the
problem of Optimal Order Execution. As ever, for ease of exposition, we will simplify the
setting for the Optimal Market-Making problem.
Assume there are a finite number of time steps indexed by t = 0, 1, . . . , T . Assume
the market-maker always shows a bid price and ask price (at each time t) along with the
associated bid shares and ask shares on the OB. Also assume, for ease of exposition, that
the market-maker can add or remove bid/ask shares from the OB costlessly. We use the
following notation:

• Denote Wt ∈ R as the market-maker’s trading account value at time t.

• Denote It ∈ Z as the market-maker’s inventory of shares at time t (assume I0 =
0). Note that the inventory can be positive or negative (negative means the market-
maker is short a certain number of shares).
• Denote St ∈ R+ as the OB Mid Price at time t (assume a stochastic process for St ).
(b)
• Denote Pt ∈ R+ as the market-maker’s Bid Price at time t.
(b)
• Denote Nt ∈ Z+ as the market-maker’s Bid Shares at time t.
(a)
• Denote Pt ∈ R+ as the market-maker’s Ask Price at time t.
(a)
• Denote Nt ∈ Z+ as the market-maker’s Ask Shares at time t.
(b) (b)
• We refer to δt = St − Pt as the market-maker’s Bid Spread (relative to OB Mid).

294
(a) (a)
• We refer to δt = Pt − St as the market-maker’s Ask Spread (relative to OB Mid).
(b) (a) (a) (b)
• We refer to δt + δt = Pt − Pt as the market-maker’s Bid-Ask Spread.
(b)
• Random variable Xt ∈ Z≥0 refers to the total number of market-maker’s Bid Shares
(b)
that have been transacted against (by MOs or by Sell LOs) up to time t (Xt is often
refered to as the cumulative “hits” up to time t, as in “the market-maker’s buy offer
has been hit”).
(a)
• Random variable Xt ∈ Z≥0 refers to the total number of market-maker’s Ask
(a)
Shares that have been transacted against (by MOs or by Buy LOs) up to time t (Xt
is often refered to as the cumulative “lifts” up to time t, as in “the market-maker’s
sell offer has been lifted”).

With this notation in place, we can write the trading account balance equation for all
t = 0, 1, . . . , T − 1 as follows:
(a) (a) (a) (b) (b) (b)
Wt+1 = Wt + Pt · (Xt+1 − Xt ) − Pt · (Xt+1 − Xt ) (8.10)
Note that since the inventory I0 at time 0 is equal to 0, the inventory It at time t is given
by the equation:
(b) (a)
I t = Xt − Xt
The market-maker’s goal is to maximize (for an appropriately shaped concave utility
function U (·)) the sum of the trading account value at time T and the value of the inventory
of shares held at time T , i.e., we maximize:

E[U (WT + IT · ST )]
As we alluded to earlier, this problem can be cast as a discrete-time finite-horizon Markov
Decision Process (with discount factor γ = 1). Following the usual notation for discrete-
time finite-horizon MDPs, the order of activity for the MDP at each time step t = 0, 1, . . . , T −
1 is as follows:

• Observe State (St , Wt , It ) ∈ St .

(b) (b) (a) (a)
• Perform Action (Pt , Nt , Pt , Nt ) ∈ At .
(b) (b)
• Random number of bid shares hit at time step t (this is equal to Xt+1 − Xt ).
(a) (a)
• Random number of ask shares lifted at time step t (this is equal to Xt+1 − Xt ),
• Update of Wt to Wt+1 .
• Update of It to It+1 .
• Stochastic evolution of St to St+1 .
• Receive Reward Rt+1 , where
(
0 for 1 ≤ t + 1 ≤ T − 1
Rt+1 :=
U (WT + IT · ST ) for t + 1 = T

The goal is to find an Optimal Policy π ∗ = (π0∗ , π1∗ , . . . , πT∗ −1 ), where

πt∗ ((St , Wt , It )) = (Pt , Nt , Pt , Nt )

(b) (b) (a) (a)

that maximizes:
X
T
E[ Rt ] = E[RT ] = E[U (WT + IT · ST )]
t=1

295
8.3.1. Avellaneda-Stoikov Continuous-Time Formulation
A landmark paper by Avellaneda and Stoikov (Avellaneda and Stoikov 2008) formulated
this optimal market-making problem in it’s continuous-time version. Their formulation
is conducive to analytical tractability and they came up with a simple, clean and intuitive
solution. In this subsection, we go over their formulation and in the next subsection, we
show the derivation of their solution. We adapt our discrete-time notation above to their
continuous-time setting.
(b) (a)
[(Xt |0 ≤ t < T ] and [Xt |0 ≤ t < T ] are assumed to be continuous-time Poisson
(b)
processes with the hit rate per unit of time and the lift rate per unit of time denoted as λt and
(a)
λt respectively. Hence, we can write the following:
(b) (b)
dXt ∼ P oisson(λt · dt)
(a) (a)
dXt ∼ P oisson(λt · dt)
(b) (b)
λt = f (b) (δt )
(a) (a)
λt = f (a) (δt )
for decreasing functions f (b) (·) and f (a) (·).
(a) (a) (b) (b)
dWt = Pt · dXt − Pt · dXt
(b) (a)
I t = Xt − Xt (note: I0 = 0)
(b)
Since infinitesimal Poisson random variables dXt (shares hit in time interval from t
(a)
to t + dt) and dXt (shares lifted in time interval from t to t + dt) are Bernoulli random
(b) (a)
variables (shares hit/lifted within time interval of duration dt will be 0 or 1), Nt and Nt
(number of shares in the submitted LOs for the infinitesimal time interval from t to t + dt)
can be assumed to be 1.
This simplifies the Action at time t to be just the pair:
(b) (a)
(δt , δt )
OB Mid Price Dynamics is assumed to be scaled brownian motion:

dSt = σ · dzt
for some σ ∈ R+ .
The Utility function is assumed to be: U (x) = −e−γx where γ > 0 is the risk-aversion pa-
rameter (this Utility function is essentially the CARA Utility function devoid of associated
constants).

8.3.2. Solving the Avellaneda-Stoikov Formulation

The following solution is as presented in the Avellaneda-Stoikov paper. We can express
the Avellaneda-Stoikov continuous-time formulation as a Hamilton-Jacobi-Bellman (HJB)
formulation (note: for reference, the general HJB formulation is covered in Appendix D).
We denote the Optimal Value function as V ∗ (t, St , Wt , It ). Note that unlike Section 3.13
in Chapter 3 where we denoted the Optimal Value Function as a time-indexed sequence
Vt∗ (·), here we make t an explicit functional argument of V ∗ and each of St , Wt , It also

296
as separate functional arguments of V ∗ (instead of the typical approach of making the
state, as a tuple, a single functional argument). This is because in the continuous-time
setting, we are interested in the time-differential of the Optimal Value Function and we
also want to represent the dependency of the Optimal Value Function on each of St , Wt , It
as explicit separate dependencies. Appendix D provides the derivation of the general HJB
formulation (Equation (D.1) in Appendix D) - this general HJB Equation specializes here
to the following:

max E[dV ∗ (t, St , Wt , It )] = 0 for t < T

(b) (a)
δt ,δt

V ∗ (T, ST , WT , IT ) = −e−γ·(WT +IT ·ST )

An infinitesimal change dV ∗ to V ∗ (t, St , Wt , It ) is comprised of 3 components:

• Due to pure movement in time (dependence of V ∗ on t).

• Due to randomness in OB Mid-Price (dependence of V ∗ on St ).
• Due to randomness in hitting/lifting the market-maker’s Bid/Ask (dependence of
V ∗ on λt and λt ). Note that the probability of being hit in interval from t to t +
(b) (a)
(b) (a)
dt is λt · dt and probability of being lifted in interval from t to t + dt is λt · dt,
upon which the trading account value Wt changes appropriately and the inventory
It increments/decrements by 1.

With this, we can expand dV ∗ (t, St , Wt , It ) and rewrite HJB as:

∂V ∗ ∂V ∗ σ2 ∂ 2V ∗
max { · dt + E[σ · · dzt + · · (dzt )2 ]
(b)
δt ,δt
(a) ∂t ∂St 2 ∂St2

+ λt · dt · V ∗ (t, St , Wt − St + δt , It + 1)
(b) (b)

· dt · V ∗ (t, St , Wt + St + δt , It − 1)
(a) (a)
+ λt
· dt) · V ∗ (t, St , Wt , It )
(b) (a)
+ (1 − λt · dt − λt
− V ∗ (t, St , Wt , It )} = 0
Next, we want to convert the HJB Equation to a Partial Differential Equation (PDE). We
can simplify the above HJB equation with a few observations:

• E[dzt ] = 0.
• E[(dzt )2 ] = dt.
(b) (a)
• Organize the terms involving λt and λt better with some algebra.
• Divide throughout by dt.

∂V ∗ σ 2 ∂ 2 V ∗
max { + ·
(b)
δt ,δt
(a) ∂t 2 ∂St2

+ λt · (V ∗ (t, St , Wt − St + δt , It + 1) − V ∗ (t, St , Wt , It ))
(b) (b)

· (V ∗ (t, St , Wt + St + δt , It − 1) − V ∗ (t, St , Wt , It ))} = 0

(a) (a)
+ λt
(b) (b) (a) (a)
Next, note that λt = f (b) (δt ) and λt = f (a) (δt ), and apply the max only on the
relevant terms:

297
∂V ∗ σ 2 ∂ 2 V ∗
+ ·
∂t 2 ∂St2
+ max{f (b) (δt ) · (V ∗ (t, St , Wt − St + δt , It + 1) − V ∗ (t, St , Wt , It ))}
(b) (b)
(b)
δt

+ max{f (a) (δt ) · (V ∗ (t, St , Wt + St + δt , It − 1) − V ∗ (t, St , Wt , It ))} = 0

(a) (a)
(a)
δt

This combines with the boundary condition:

V ∗ (T, ST , WT , IT ) = −e−γ·(WT +IT ·ST )

Next, we make an educated guess for the functional form of V ∗ (t, St , Wt , It ):

V ∗ (t, St , Wt , It ) = −e−γ·(Wt +θ(t,St ,It )) (8.11)

to reduce the problem to a Partial Differential Equation (PDE) in terms of θ(t, St , It ).
Substituting this guessed functional form into the above PDE for V ∗ (t, St , Wt , It ) gives:

∂θ σ 2 ∂ 2 θ ∂θ 2
+ ·( 2 −γ·( ) )
∂t 2 ∂St ∂St
(b)
f (b) (δt ) (b)
+ max{ · (1 − e−γ·(δt −St +θ(t,St ,It +1)−θ(t,St ,It )) )}
(b)
δt γ
(a)
f (a) (δt ) (a)
+ max{ · (1 − e−γ·(δt +St +θ(t,St ,It −1)−θ(t,St ,It )) )} = 0
(a)
δt γ

The boundary condition is:

θ(T, ST , IT ) = IT · ST
It turns out that θ(t, St , It + 1) − θ(t, St , It ) and θ(t, St , It ) − θ(t, St , It − 1) are equal to
financially meaningful quantities known as Indifference Bid and Ask Prices.
Indifference Bid Price Q(b) (t, St , It ) is defined as follows:

V ∗ (t, St , Wt − Q(b) (t, St , It ), It + 1) = V ∗ (t, St , Wt , It ) (8.12)

Q(b) (t, St , It ) is the price to buy a single share with a guarantee of immediate purchase that
results in the Optimum Expected Utility staying unchanged.
Likewise, Indifference Ask Price Q(a) (t, St , It ) is defined as follows:

V ∗ (t, St , Wt + Q(a) (t, St , It ), It − 1) = V ∗ (t, St , Wt , It ) (8.13)

Q(a) (t, St , It ) is the price to sell a single share with a guarantee of immediate sale that results
in the Optimum Expected Utility staying unchanged.
(b) (a)
For convenience, we abbreviate Q(b) (t, St , It ) as Qt and Q(a) (t, St , It ) as Qt . Next, we
express V ∗ (t, St , Wt − Qt , It + 1) = V ∗ (t, St , Wt , It ) in terms of θ:
(b)

(b)
−e−γ·(Wt −Qt +θ(t,St ,It +1))
= −e−γ·(Wt +θ(t,St ,It ))

(b)
⇒ Qt = θ(t, St , It + 1) − θ(t, St , It ) (8.14)

298
(a)
Likewise for Qt , we get:
(a)
Qt = θ(t, St , It ) − θ(t, St , It − 1) (8.15)
(b) (a)
Using Equations (8.14) and (8.15), bring Qt and Qt in the PDE for θ:

∂θ σ 2 ∂ 2 θ ∂θ 2 (b) (b)
+ ·( 2 −γ·( ) ) + max g(δt ) + max h(δt ) = 0
∂t 2 ∂St ∂St (b)
δt
(a)
δt

(b)
f (b) (δt ) (b) (b)
· (1 − e−γ·(δt −St +Qt ) )
(b)
where g(δt ) =
γ
(a)
f (a) (δt ) (a) (a)
· (1 − e−γ·(δt +St −Qt ) )
(a)
and h(δt ) =
γ
(b) (b)
To maximize g(δt ), differentiate g with respect to δt and set to 0:

(b) ∗ (b)
(b) ∗ ∂f (b) (b) ∗ ∂f (b) (b) ∗
e−γ·(δt −St +Qt )
· (γ · f (b) (δt )− (δ
(b) t
)) + (δ
(b) t
)=0
∂δt ∂δt
(b) ∗
(b) ∗ (b) ∗ (b) 1 f (b) (δ )
⇒ δt = St − Pt = St − Qt + · log (1 − γ · (b) t(b) ∗ ) (8.16)
γ ∂f
(δ ) (b)
∂δt t

(a) (a)
To maximize h(δt ), differentiate h with respect to δt and set to 0:

(a) ∗ (a)
(a) ∗ ∂f (a) (a) ∗ ∂f (a) (a) ∗
e−γ·(δt +St −Qt )
· (γ · f (a) (δt )− (δ
(a) t
)) + (δ
(a) t
)=0
∂δt ∂δt
(a) ∗
(a) ∗ (a) ∗ (a) 1 f (a) (δ )
⇒ δt = Pt − St = Qt − St + · log (1 − γ · (a) t(a) ∗ ) (8.17)
γ ∂f
(δ ) (a)
∂δt t

(b) ∗ (a) ∗
Equations (8.16) and (8.17) are implicit equations for δt and δt respectively.
Now let us write the PDE in terms of the Optimal Bid and Ask Spreads:

∂θ σ 2 ∂ 2 θ ∂θ 2
+ ·( 2 −γ·( ) )
∂t 2 ∂St ∂St
(b) ∗
f (b) (δt ) (b) ∗
+ · (1 − e−γ·(δt −St +θ(t,St ,It +1)−θ(t,St ,It ))
)
γ (8.18)
(a) ∗
f (a) (δt ) (a) ∗
−γ·(δt +St +θ(t,St ,It −1)−θ(t,St ,It ))
+ · (1 − e )=0
γ
with boundary condition: θ(T, ST , IT ) = IT · ST

How do we go about solving this? Here are the steps:

(b) ∗ (a) ∗
• First we solve PDE (8.18) for θ in terms of δt and δt . In general, this would be a
numerical PDE solution.
(b) ∗
• Using Equations (8.14) and (8.15), and using the above-obtained θ in terms of δt
(a) ∗ (b) (a) (b) ∗ (a) ∗
and δt , we get Qt and Qt in terms of δt and δt .

299
(b) (a) (b) ∗ (a) ∗
• Then we substitute the above-obtained Qt and Qt (in terms of δt and δt ) in
Equations (8.16) and (8.17).
(b) ∗ (a) ∗
• Finally, we solve the implicit equations for δt and δt (in general, numerically).

This completes the (numerical) solution to the Avellaneda-Stoikov continuous-time for-

mulation for the Optimal Market-Making problem. Having been through all the heavy
equations above, let’s now spend some time on building intuition.
(b) (a)
(m) Q +Q
Define the Indifference Mid Price Qt = t 2 t . To develop intuition for Indifference
Prices, consider a simple case where the market-maker doesn’t supply any bids or asks
after time t. This means the trading account value WT at time T must be the same as the
trading account value at time t and the inventory IT at time T must be the same as the
inventory It at time t. This implies:

V ∗ (t, St , Wt , It ) = E[−e−γ·(Wt +It ·ST ) ]

The process dSt = σ · dzt implies that ST ∼ N (St , σ 2 · (T − t)), and hence:
γ·It2 ·σ 2 ·(T −t)
V ∗ (t, St , Wt , It ) = −e−γ·(Wt +It ·St − 2
)

Hence,
(b) γ·(It +1)2 ·σ 2 ·(T −t)
V ∗ (t, St , Wt − Qt , It + 1) = −e−γ·(Wt −Qt +(It +1)·St −
(b) )
2

But from Equation (8.12), we know that:

V ∗ (t, St , Wt , It ) = V ∗ (t, St , Wt − Qt , It + 1)
(b)

Therefore,
γ·It2 ·σ 2 ·(T −t) (b) γ·(It +1)2 ·σ 2 ·(T −t)
−e−γ·(Wt +It ·St − 2
)
= −e−γ·(Wt −Qt +(It +1)·St − 2
)

This implies:
(b) γ · σ 2 · (T − t)
Qt = St − (2It + 1) ·
2
Likewise, we can derive:

(a) γ · σ 2 · (T − t)
Qt = St − (2It − 1) ·
2
The formulas for the Indifference Mid Price and the Indifference Bid-Ask Price Spread are
as follows:
(m)
Qt = St − It · γ · σ 2 · (T − t)
(a) (b)
Qt − Qt = γ · σ 2 · (T − t)
These results for the simple case of no-market-making-after-time-t serve as approxima-
(m)
tions for our problem of optimal market-making. Think of Qt as a pseudo mid price for
the market-maker, an adjustment to the OB mid price St that takes into account the magni-
(m)
tude and sign of It . If the market-maker is long inventory (It > 0), then Qt < St , which
makes intuitive sense since the market-maker is interested in reducing her risk of inven-
tory buildup and so, would be be more inclined to sell than buy, leading her to show bid

300
and ask prices whose average is lower than the OB mid price St . Likewise, if the market-
(m)
maker is short inventory (It < 0), then Qt > St indicating inclination to buy rather than
sell.
Armed with this intuition, we come back to optimal market-making, observing from
Equations (8.16) and (8.17):
(b) ∗ (b) (m) (a) (a) ∗
Pt < Qt < Qt < Qt < Pt
(b) ∗ (b) (m) (a) (a) ∗
Visualize this ascending sequence of prices [Pt , Qt , Qt , Qt , Pt ] as jointly sliding
up/down (relative to OB mid price St ) as a function of the inventory It ’s magnitude and
(b) ∗ (a) ∗ (m)
sign, and perceive Pt , Pt in terms of their spreads to the pseudo mid price Qt :
(b) (a) (b) ∗
(b) (m) ∗ Q + Qt 1 f (b) (δ )
Qt − Pt = t + · log (1 − γ · (b) t(b) ∗ )
2 γ ∂f
(δ )
∂δt
(b) t

(b) (a) (a) ∗

(a) ∗ (m) Q + Qt 1 f (a) (δ )
Pt − Qt = t + · log (1 − γ · (a) t(a) ∗ )
2 γ ∂f
(δ )
∂δt
(a) t

8.3.3. Analytical Approximation to the Solution to Avellaneda-Stoikov

Formulation
The PDE (8.18) we derived above for θ and the associated implicit Equations (8.16) and
(b) ∗ (a) ∗
(8.17) for δt , δt are messy. So we make some assumptions, simplify, and derive an-
alytical approximations(as presented in the Avellaneda-Stoikov paper). We start by as-
suming a fairly standard functional form for f (b) and f (a) :

f (b) (δ) = f (a) (δ) = c · e−k·δ

This reduces Equations (8.16) and (8.17) to:

(b) ∗ (b) 1 γ
δt = St − Q t + · log (1 + ) (8.19)
γ k
(a) ∗ (a) 1 γ
δt = Qt − St + · log (1 + ) (8.20)
γ k
(b) ∗ (a) ∗ (m)
which means Pt and Pt are equidistant from Qt . Substituting these simplified
(b) ∗ (a) ∗
δt , δ t in Equation (8.18) reduces the PDE to:

∂θ σ 2 ∂ 2 θ ∂θ 2 c (b) ∗ (a) ∗
+ ·( 2 −γ·( ) )+ · (e−k·δt + e−k·δt ) = 0
∂t 2 ∂St ∂St k+γ (8.21)
with boundary condition θ(T, ST , IT ) = IT · ST
(b) ∗ (a) ∗
Note that this PDE (8.21) involves δt and δt . However, Equations (8.19), (8.20),
(b) ∗ (a) ∗
(8.14) and (8.15) enable expressing δt and δt in terms of θ(t, St , It − 1), θ(t, St , It ) and
θ(t, St , It + 1). This gives us a PDE just in terms of θ. Solving that PDE for θ would give
(b) ∗ (a) ∗
us not only V ∗ (t, St , Wt , It ) but also δt and δt (using Equations (8.19), (8.20), (8.14)
and (8.15)). To solve the PDE, we need to make a couple of approximations.

301
(b) ∗ (a) ∗
First we make a linear approximation for e−k·δt and e−k·δt in PDE (8.21) as follows:

∂θ σ 2 ∂ 2 θ ∂θ 2 c (b) ∗ (a) ∗
+ ·( 2 −γ·( ) )+ · (1 − k · δt + 1 − k · δt ) = 0 (8.22)
∂t 2 ∂St ∂St k+γ

Combining the Equations (8.19), (8.20), (8.14) and (8.15) gives us:

(b) ∗ (a) ∗ 2 γ
δt + δt = · log (1 + ) + 2θ(t, St , It ) − θ(t, St , It + 1) − θ(t, St , It − 1)
γ k
(b) ∗ (a) ∗
With this expression for δt + δt , PDE (8.22) takes the form:

∂θ σ 2 ∂ 2 θ ∂θ 2 c 2k γ
+ ·( 2 − γ · ( ) )+ · (2 − · log (1 + )
∂t 2 ∂St ∂St k+γ γ k (8.23)
− k · (2θ(t, St , It ) − θ(t, St , It + 1) − θ(t, St , It − 1))) = 0

To solve PDE (8.23), we consider the following asymptotic expansion of θ in It :

∞
X In
θ(t, St , It ) = t
· θ(n) (t, St )
n!
n=0

So we need to determine the functions θ(n) (t, St ) for all n = 0, 1, 2, . . .

For tractability, we approximate this expansion to the first 3 terms:

It2 (2)
θ(t, St , It ) ≈ θ(0) (t, St ) + It · θ(1) (t, St ) + · θ (t, St )
2
We note that the Optimal Value Function V ∗ can depend on St only through the current
Value of the Inventory (i.e., through It ·St ), i.e., it cannot depend on St in any other way. This
means V ∗ (t, St , Wt , 0) = −e−γ(Wt +θ (t,St )) is independent of St . This means θ(0) (t, St ) is
(0)

(0) ∂ 2 θ (0)
independent of St . So, we can write it as simply θ(0) (t), meaning ∂θ ∂St and ∂St2 are equal
to 0. Therefore, we can write the approximate expansion for θ(t, St , It ) as:

It2 (2)
θ(t, St , It ) = θ(0) (t) + It · θ(1) (t, St ) +
· θ (t, St ) (8.24)
2
Substituting this approximation Equation (8.24) for θ(t, St , It ) in PDE (8.23), we get:

∂θ(0) ∂θ(1) It2 ∂θ(2) σ 2 ∂ 2 θ(1) It2 ∂ 2 θ(2)

+ It · + · + · (It · + · )
∂t ∂t 2 ∂t 2 ∂St2 2 ∂St2
γσ 2 ∂θ(1) It2 ∂θ(2) 2 c 2k γ
− · (It · + · ) + · (2 − · log (1 + ) + k · θ(2) ) = 0
2 ∂St 2 ∂St k+γ γ k (8.25)
with boundary condition:
IT2 (2)
θ(0) (T ) + IT · θ(1) (T, ST ) + · θ (T, ST ) = IT · ST
2
Now we separately collect terms involving specific powers of It , each yielding a separate
PDE:

302
• Terms devoid of It (i.e., It0 )
• Terms involving It (i.e., It1 )
• Terms involving It2

We start by collecting terms involving It :

∂θ(1) σ 2 ∂ 2 θ(1)
+ · = 0 with boundary condition θ(1) (T, ST ) = ST
∂t 2 ∂St2
The solution to this PDE is:

θ(1) (t, St ) = St (8.26)

Next, we collect terms involving It2 :

∂θ(2) σ 2 ∂ 2 θ(2) ∂θ(1) 2

+ · − γ · σ 2
· ( ) = 0 with boundary condition θ(2) (T, ST ) = 0
∂t 2 ∂St2 ∂St

Noting that θ(1) (t, St ) = St , we solve this PDE as:

θ(2) (t, St ) = −γ · σ 2 · (T − t) (8.27)

Finally, we collect the terms devoid of It

∂θ(0) c 2k γ
+ · (2 − · log (1 + ) + k · θ(2) ) = 0 with boundary θ(0) (T ) = 0
∂t k+γ γ k

Noting that θ(2) (t, St ) = −γσ 2 · (T − t), we solve as:

c 2k γ kγσ 2
θ(0) (t) = · ((2 − · log (1 + )) · (T − t) − · (T − t)2 ) (8.28)
k+γ γ k 2
This completes the PDE solution for θ(t, St , It ) and hence, for V ∗ (t, St , Wt , It ). Lastly, we
(b) (a) (m) (b) ∗ (a) ∗
derive formulas for Qt , Qt , Qt , δt , δt .
Using Equations (8.14) and (8.15), we get:

(b) γ · σ 2 · (T − t)
Qt = θ(1) (t, St ) + (2It + 1) · θ(2) (t, St ) = St − (2It + 1) · (8.29)
2

(a) γ · σ 2 · (T − t)
Qt = θ(1) (t, St ) + (2It − 1) · θ(2) (t, St ) = St − (2It − 1) · (8.30)
2
Using equations (8.19) and (8.20), we get:

(b) ∗ (2It + 1) · γ · σ 2 · (T − t) 1 γ
δt = + · log (1 + ) (8.31)
2 γ k

(a) ∗ (1 − 2It ) · γ · σ 2 · (T − t) 1 γ
δt = + · log (1 + ) (8.32)
2 γ k

303
(b) ∗ (a) ∗ 2 γ
Optimal Bid-Ask Spread δt + δt = γ · σ 2 · (T − t) + · log (1 + ) (8.33)
γ k

(b) (a) (b) ∗ (a) ∗

(m) Q + Qt Pt + Pt
Optimal Pseudo-Mid Qt = t = = St − It · γ · σ 2 · (T − t) (8.34)
2 2
(m)
Now let’s get back to developing intuition. Think of Qt as inventory-risk-adjusted mid-
(m)
price (adjustment to St ). If the market-maker is long inventory (It > 0), Qt < St indi-
(m)
cating inclination to sell rather than buy, and if market-maker is short inventory, Qt > St
(b) ∗ (a) ∗
indicating inclination to buy rather than sell. Think of the interval [Pt , Pt ] as being
(m)
around the pseudo mid-price Qt (rather than thinking of it as being around the OB
(b) ∗ (a) ∗ (m)
mid-price St ). The interval [Pt , Pt ] moves up/down in tandem with Qt moving
up/down (as a function of inventory It ). Note from Equation (8.33) that the Optimal
(a) ∗ (b) ∗
Bid-Ask Spread Pt − Pt is independent of inventory It .
A useful view is:
(b) ∗ (b) (m) (a) (a) ∗
Pt < Qt < Qt < Qt < Pt
with the spreads as follows:
(a) ∗ (a) (b) (b) ∗ 1 γ
Outer Spreads Pt − Qt = Q t − Pt = · log (1 + )
γ k

(a) (m)γ · σ 2 · (T − t)
(m) (b)
Inner Spreads Qt − Qt = Qt − Qt =
2
This completes the analytical approximation to the solution of the Avellaneda-Stoikov
continuous-time formulation of the Optimal Market-Making problem.

8.3.4. Real-World Market-Making

Note that while the Avellaneda-Stoikov continuous-time formulation and solution is ele-
gant and intuitive, it is far from a real-world model. Real-world OB dynamics are time-
heterogeneous, non-linear and far more complex. Furthermore, there are all kinds of real-
world frictions we need to capture, such as discrete time, discrete prices/number of shares
in a bid/ask submitted by the market-maker, various constraints on prices and number of
shares in the bid/ask, and fees to be paid by the market-maker. Moreover, we need to cap-
ture various market factors in the State and in the OB Dynamics. This invariably leads to
the Curse of Dimensionality and Curse of Modeling. This takes us down the same path that
we’ve now got all too familiar with - Reinforcement Learning algorithms. This means we
need a simulator that captures all of the above factors, features and frictions. Such a simu-
lator is basically a Market-Data-learnt Sampling Model of OB Dynamics. We won’t be cover-
ing the details of how to build such a simulator as that is outside the scope of this book (a
topic under the umbrella of supervised learning of market patterns and behaviors). Using
this simulator and neural-networks-based function approximation of the Value Function
(and/or of the Policy function), we can leverage the power of RL algorithms (to be cov-
ered in the following chapters) to solve the problem of optimal market-making in practice.
There are a number of papers written on how to build practical and useful market simula-
tors and using Reinforcement Learning for Optimal Market-Making. We refer you to two
such papers here:

304
• A paper from University of Liverpool (Spooner et al. 2018)
• A paper from J.P.Morgan Research (Ganesh et al. 2019)

This topic of development of models for OB Dynamics and RL algorithms for practical
market-making is an exciting area for future research as well as engineering design. We
hope this section has provided suﬀicient foundations for you to dig into this topic further.

8.4. Key Takeaways from this Chapter

• Foundations of Order Book, Limit Orders, Market Orders, Price Impact of large Mar-
ket Orders, and complexity of Order Book Dynamics.
• Casting Order Book trading problems such as Optimal Order Execution and Op-
timal Market-Making as Markov Decision Processes, developing intuition by de-
riving closed-form solutions for highly simplified assumptions (eg: Bertsimas-Lo,
Avellaneda-Stoikov formulations), developing a deeper understanding by imple-
menting a backward-induction ADP algorithm, and then moving on to develop RL
algorithms (and associated market simulator) to solve this problem in a real-world
setting to overcome the Curse of Dimensionality and Curse of Modeling.

305
Part III.

Reinforcement Learning Algorithms

307
9. Monte-Carlo and Temporal-Difference for
Prediction

9.1. Overview of the Reinforcement Learning approach

In Module I, we covered Dynamic Programming (DP) and Approximate Dynamic Pro-
gramming (ADP) algorithms to solve the problems of Prediction and Control. DP and
ADP algorithms assume that we have access to a model of the MDP environment (by model,
we mean the transitions defined by PR - notation from Chapter 2 - refering to probabil-
ities of next state and reward, given current state and action). However, in real-world
situations, we often do not have access to a model of the MDP environment and so, we’d
need to access the actual MDP environment directly. As an example, a robotics appli-
cation might not have access to a model of a certain type of terrain to learn to walk on,
and so we’d need to access the actual (physical) terrain. This means we’d need to interact
with the actual MDP environment. Note that the actual MDP environment doesn’t give
us transition probabilities - it simple serves up a new state and reward when we take an
action in a certain state. In other words, it gives us individual experiences of next state and
reward, rather than the actual probabilities of occurrence of next states and rewards. So,
the natural question to ask is whether we can infer the Optimal Value Function/Optimal
Policy without access to a model (in the case of Prediction - the question is whether we
can infer the Value Function for a given policy). The answer to this question is Yes and the
algorithms that achieve this are known as Reinforcement Learning algorithms.
But Reinforcement Learning is often a great option even in situations where we do have
a model. In typical real-world problems, the state space is large and the transitions struc-
ture is complex, so transition probabilities are either hard to compute or impossible to
store/compute (within practical storage/compute constraints). This means even if we
could theoretically estimate a model from interactions with the actual environment and
then run a DP/ADP algorithm, it’s typically intractable/infeasible in a typical real-world
problem. Moreover, a typical real-world environment is not stationary (meaning the prob-
abilities PR change over time) and so, the PR probabilities model would need to be re-
estimated periodically (making it cumbersome to do DP/ADP). All of this points to the
practical alternative of constructing a sampling model (a model that serves up samples of
next state and reward) - this is typically much more feasible than estimating a probabilities
model (i.e. a model of explicit transition probabilities). A sampling model can then be used
by a Reinforcement Learning algorithm (as we shall explain shortly).
So what we are saying is that practically we are left with one of the following two options:

1. The AI Agent interacts with the actual environment and doesn’t bother with either a
model of explicit transition probabilities (probabilities model) or a model of transition
samples (sampling model).
2. We create a sampling model (by learning from interaction with the actual environ-
ment) and treat this sampling model as a simulated environment (meaning, the AI
agent interacts with this simulated environment).

309
From the perspective of the AI agent, either way there is an environment interface that
will serve up (at each time step) a single experience of (next state, reward) pair when
the agent performs a certain action in a given state. So essentially, either way, our access
is simply to a stream of individual experiences of next state and reward rather than their
explicit probabilities. So, then the question is - at a conceptual level, how does RL go
about solving Prediction and Control problems with just this limited access (access to
only experiences and not explicit probabilities)? This will become clearer and clearer as we
make our way through Module III, but it would be a good idea now for us to briefly sketch
an intuitive overview of the RL approach (before we dive into the actual RL algorithms).
To understand the core idea of how RL works, we take you back to the start of the book
where we went over how a baby learns to walk. Specifically, we’d like you to develop in-
tuition for how humans and other animals learn to perform requisite tasks or behave in
appropriate ways, so as to get trained to make suitable decisions. Humans/animals don’t
build a model of explicit probabilities in their minds in a way that a DP/ADP algorithm
would require. Rather, their learning is essentially a sort of “trial and error” method -
they try an action, receive an experience (i.e., next state and reward) from their environ-
ment, then take a new action, receive another experience, and so on … and then over a
period of time, they figure out which actions might be leading to good outcomes (pro-
ducing good rewards) and which actions might be leading to poor outcomes (poor re-
wards). This learning process involves raising the priority of actions perceived as good,
and lowering the priority of actions perceived as bad. Humans/animals don’t quite link
their actions to the immediate reward - they link their actions to the cumulative rewards
(Returns) obtained after performing an action. Linking actions to cumulative rewards is
challenging because multiple actions have significantly overlapping rewards sequences,
and often rewards show up in a delayed manner. Indeed, learning by attributing good
versus bad outcomes to specific past actions is the powerful part of human/animal learn-
ing. Humans/animals are essentially estimating a Q-Value Function and are updating
their Q-Value function each time they receive a new experience (of essentially a pair of
next state and reward). Exactly how humans/animals manage to estimate Q-Value func-
tions eﬀiciently is unclear (a big area of ongoing research), but RL algorithms have specific
techniques to estimate the Q-Value function in an incremental manner by updating the Q-
Value function in subtle ways after each experience of next state and reward received from
either the actual environment or simulated environment.
We should also point out another important feature of human/animal learning - it is the
fact that humans/animals are good at generalizing their inferences from experiences, i.e.,
they can interpolate and extrapolate the linkages between their actions and the outcomes
received from their environment. Technically, this translates to a suitable function approx-
imation of the Q-Value function. So before we embark on studying the details of various
RL algorithms, it’s important to recognize that RL overcomes complexity (specifically, the
Curse of Dimensionality and Curse of Modeling, as we have alluded to in previous chap-
ters) with a combination of:

1. Learning incrementally by updating the Q-Value function from individual experi-

ences of next state and reward received after performing actions in specific states.
2. Good generalization ability of the Q-Value function with a suitable function approxi-
mation (indeed, recent progress in capabilities of deep neural networks have helped
considerably).

This idea of solving the MDP Prediction and Control problems in this manner (learning
incrementally from a stream of data with appropriate generalization ability in the Q-Value

310
function approximation) came from the Ph.D. thesis of Chris Watkins (Watkins 1989).
As mentioned before, we consider the RL book by Sutton and Barto (Richard S. Sutton
and Barto 2018) as the best source for a comprehensive study of RL algorithms as well as
the best source for all references associated with RL (hence, we don’t provide too many
references in this book).
As mentioned in previous chapters, most RL algorithms are founded on the Bellman
Equations and all RL Control algorithms are based on the fundamental idea of Generalized
Policy Iteration that we have explained in Chapter 2. But the exact ways in which the Bell-
man Equations and Generalized Policy Iteration idea are utilized in RL algorithms differ
from one algorithm to another, and they differ significantly from how the Bellman Equa-
tions/Generalized Policy Iteration idea is utilized in DP algorithms.
As has been our practice, we start with the Prediction problem (this chapter) and then
cover the Control problem (next chapter).

9.2. RL for Prediction

We re-use a lot of the notation we had developed in Module I. As a reminder, Prediction
is the problem of estimating the Value Function of an MDP for a given policy π. We know
from Chapter 2 that this is equivalent to estimating the Value Function of the π-implied
MRP. So in this chapter, we assume that we are working with an MRP (rather than an
MDP) and we assume that the MRP is available in the form of an interface that serves up
an individual experience of (next state, reward) pair, given current state. The interface
might be an actual environment or a simulated environment. We refer to the agent’s re-
ceipt of an individual experience of (next state, reward), given current state, as an atomic
experience. Interacting with this interface in succession (starting from a state S0 ) gives us
a trace experience consisting of alternating states and rewards as follows:

S0 , R 1 , S 1 , R 2 , S 2 , . . .

Given a stream of atomic experiences or a stream of trace experiences, the RL Prediction

problem is to estimate the Value Function V : N → R of the MRP defined as:

V (s) = E[Gt |St = s] for all s ∈ N , for all t = 0, 1, 2, . . .

where the Return Gt for each t = 0, 1, 2, . . . is defined as:

∞
X
Gt = γ i−t−1 · Ri = Rt+1 + γ · Rt+2 + γ 2 · Rt+3 + . . . = Rt+1 + γ · Gt+1
i=t+1

We use the above definition of Return even for a terminating trace experience (say ter-
minating at t = T , i.e., ST ∈ T ), by treating Ri = 0 for all i > T .
The RL prediction algorithms we will soon develop consume a stream of atomic experi-
ences or a stream of trace experiences to learn the requisite Value Function. So we want the
input to an RL Prediction algorithm to be either an Iterable of atomic experiences or an
Iterable of trace experiences. Now let’s talk about the representation (in code) of a single
atomic experience and the representation of a single trace experience. We take you back
to the code in Chapter 1 where we had set up a @dataclass TransitionStep that served as
a building block in the method simulate_reward in the abstract class MarkovRewardProcess.

311
@dataclass(frozen=True)
class TransitionStep(Generic[S]):
state: NonTerminal[S]
next_state: State[S]
reward: float

TransitionStep[S] represents a single atomic experience. simulate_reward produces an

Iterator[TransitionStep[S]] (i.e., a stream of atomic experiences in the form of a sam-
pling trace) but in general, we can represent a single trace experience as an Iterable[TransitionStep[S]]
(i.e., a sequence or stream of atomic experiences). Therefore, we want the input to an RL
prediction algorithm to be either:

• an Iterable[TransitionStep[S]] representing a stream of atomic experiences

• an Iterable[Iterable[TransitionStep[S]]] representing a stream of trace experi-
ences

Let’s add a method reward_traces to MarkovRewardProcess that produces an Iterator

(stream) of the sampling traces produced by simulate_reward.1 So then we’d be able to use
the output of reward_traces as the Iterable[Iterable[TransitionStep[S]]] input to an RL
Prediction algorithm. Note that the input start_state_distribution is the specification of
the probability distribution of start states (state from which we start a sampling trace that
can be used as a trace experience).
def reward_traces(
self,
start_state_distribution: Distribution[NonTerminal[S]]
) -> Iterable[Iterable[TransitionStep[S]]]:
while True:
yield self.simulate_reward(start_state_distribution)

9.3. Monte-Carlo (MC) Prediction

Monte-Carlo (MC) Prediction is a very simple RL algorithm that performs supervised
learning to predict the expected return from any state of an MRP (i.e., it estimates the
Value Function of an MRP), given a stream of trace experiences. Note that we wrote the
abstract class FunctionApprox in Chapter 4 for supervised learning that takes data in the
form of (x, y) pairs where x is the predictor variable and y ∈ R is the response vari-
able. For the Monte-Carlo prediction problem, the x-values are the encountered states
across the stream of input trace experiences and the y-values are the associated returns on
the trace experiences (starting from the corresponding encountered state). The following
function (in the file rl/monte_carlo.py) mc_prediction takes as input an Iterable of trace
experiences, with each trace experience represented as an Iterable of TransitionSteps.
mc_prediction performs the requisite supervised learning in an incremental manner, by
calling the method iterate_updates of approx_0: ValueFunctionApprox[S] on an Iterator
of (state, return) pairs that are extracted from each trace experience. As a reminder, the
method iterate_updates calls the method update of FunctionApprox iteratively (in this
case, each call to update updates the ValueFunctionApprox for a single (state, return) data
point). mc_prediction produces as output an Iterator of ValueFunctionApprox[S], i.e., an
updated function approximation of the Value Function at the end of each trace experience
(note that function approximation updates can be done only at the end of trace experiences
because the trace experience returns are available only at the end of trace experiences).
1
reward_traces is defined in the file rl/markov_process.py.

312
import MarkovRewardProcess as mp
def mc_prediction(
traces: Iterable[Iterable[mp.TransitionStep[S]]],
approx_0: ValueFunctionApprox[S],
gamma: float,
episode_length_tolerance: float = 1e-6
) -> Iterator[ValueFunctionApprox[S]]:
episodes: Iterator[Iterator[mp.ReturnStep[S]]] = \
(returns(trace, gamma, episode_length_tolerance) for trace in traces)
f = approx_0
yield f
for episode in episodes:
f = last(f.iterate_updates(
[(step.state, step.return_)] for step in episode
))
yield f

The core of the mc_prediction function above is the call to the returns function (de-
tailed below and available in the file rl/returns.py). returns takes as input: trace repre-
senting a trace experience (Iterable of TransitionStep), the discount factor gamma, and an
episodes_length_tolerance that determines how many time steps to cover in each trace ex-
perience when γ < 1 (as many steps as until γ steps falls below episodes_length_tolerance
or until the trace experience ends in a terminal state, whichever happens first). If γ = 1,
each trace experience needs to end in a terminal state (else the returns function will loop
forever).
The returns function calculates the returns Gt (accumulated discounted rewards) start-
ing from each state St in the trace experience.2 The key is to walk backwards from the end
of the trace experience to the start (so as to reuse the calculated returns while walking
backwards: Gt = Rt+1 + γ · Gt+1 ). Note the use of iterate.accumulate to perform this
backwards-walk calculation, which in turn uses the add_return method in TransitionStep
to create an instance of ReturnStep. The ReturnStep (as seen in the code below) class is de-
rived from the TransitionStep class and includes the additional attribute named return_.
We add a method called add_return in TransitionStep so we can augment the attributes
state, reward, next_state with the additional attribute return_ that is computed as reward
plus gamma times the return_ from the next state.3

@dataclass(frozen=True)
class TransitionStep(Generic[S]):
state: NonTerminal[S]
next_state: State[S]
reward: float
def add_return(self, gamma: float, return_: float) -> ReturnStep[S]:
return ReturnStep(
self.state,
self.next_state,
self.reward,
return_=self.reward + gamma * return_
)
@dataclass(frozen=True)
class ReturnStep(TransitionStep[S]):
return_: float

import itertools

2
returns is defined in the file rl/returns.py.
3
TransitionStep and the add_return method are defined in the file rl/markov_process.py.

313
import rl.iterate as iterate
import rl.markov_process as mp
def returns(
trace: Iterable[mp.TransitionStep[S]],
gamma: float,
tolerance: float
) -> Iterator[mp.ReturnStep[S]]:
trace = iter(trace)
max_steps = round(math.log(tolerance) / math.log(gamma)) if gamma < 1 \
else None
if max_steps is not None:
trace = itertools.islice(trace, max_steps * 2)
*transitions, last_transition = list(trace)
return_steps = iterate.accumulate(
reversed(transitions),
func=lambda next, curr: curr.add_return(gamma, next.return_),
initial=last_transition.add_return(gamma, 0)
)
return_steps = reversed(list(return_steps))
if max_steps is not None:
return_steps = itertools.islice(return_steps, max_steps)
return return_steps

We say that the trace experiences are episodic traces if each trace experience ends in a
terminal state to signify that each trace experience is an episode, after whose termination
we move on to the next episode. Trace experiences that do not terminate are known as
continuing traces. We say that an RL problem is episodic if the input trace experiences are
all episodic (likewise, we say that an RL problem is continuing if some of the input trace
experiences are continuing).
Assume that the probability distribution of returns conditional on a state is modeled by
a function approximation as a (state-conditional) normal distribution, whose mean (Value
Function) we denote as V (s; w) where s denotes a state for which the function approxima-
tion is being evaluated and w denotes the set of parameters in the function approximation
(eg: the weights in a neural network). Then, the loss function for supervised learning of
the Value Function is the sum of squares of differences between observed returns and the
Value Function estimate from the function approximation. For a state St visited at time t
in a trace experience and it’s associated return Gt on the trace experience, the contribution
to the loss function is:
1
L(St ,Gt ) (w) = · (V (St ; w) − Gt )2 (9.1)
2
It’s gradient with respect to w is:

∇w L(St ,Gt ) (w) = (V (St ; w) − Gt ) · ∇w V (St ; w)

We know that the change in the parameters (adjustment to the parameters) is equal to
the negative of the gradient of the loss function, scaled by the learning rate (let’s denote
the learning rate as α). Then the change in parameters is:

∆w = α · (Gt − V (St ; w)) · ∇w V (St ; w) (9.2)

This is a standard formula for change in parameters in response to incremental atomic
data for supervised learning when the response variable has a conditional normal distri-
bution. But it’s useful to see this formula in an intuitive manner for this specialization

314
of incremental supervised learning to Reinforcement Learning parameter updates. We
should interpret the change in parameters ∆w as the product of three conceptual entities:

• Learning Rate α
• Return Residual of the observed return Gt relative to the estimated conditional ex-
pected return V (St ; w)
• Estimate Gradient of the conditional expected return V (St ; w) with respect to the pa-
rameters w

This interpretation of the change in parameters as the product of these three conceptual
entities: (Learning rate, Return Residual, Estimate Gradient) is important as this will be a
repeated pattern in many of the RL algorithms we will cover.
Now we consider a simple case of Monte-Carlo Prediction where the MRP consists of a
finite state space with the non-terminal states N = {s1 , s2 , . . . , sm }. In this case, we rep-
resent the Value Function of the MRP in a data structure (dictionary) of (state, expected
return) pairs. This is known as “Tabular” Monte-Carlo (more generally as Tabular RL to
reflect the fact that we represent the calculated Value Function in a “table” , i.e., dictio-
nary). Note that in this case, Monte-Carlo Prediction reduces to a very simple calculation
wherein for each state, we simply maintain the average of the trace experience returns
from that state onwards (averaged over state visitations across trace experiences), and
the average is updated in an incremental manner. Recall from Section 4.4 of Chapter 4
that this is exactly what’s done in the Tabular class (in file rl/func_approx.py). We also
recall from Section 4.4 of Chapter 4 that Tabular implements the interface of the abstract
class FunctionApprox and so, we can perform Tabular Monte-Carlo Prediction by passing a
Tabular instance as the approx0: FunctionApprox argument to the mc_prediction function
above. The implementation of the update method in Tabular is exactly as we desire: it per-
forms an incremental averaging of the trace experience returns obtained from each state
onwards (over a stream of trace experiences).
Let us denote Vn (si ) as the estimate of the Value Function for a state si after the n-th oc-
(1) (2) (n)
currence of the state si (when doing Tabular Monte-Carlo Prediction) and let Yi , Yi , . . . , Yi
be the trace experience returns associated with the n occurrences of state si . Let us denote
the count_to_weight_func attribute of Tabular as f . Then, the Tabular update at the n-th
(n)
occurrence of state si (with it’s associated return Yi ) is as follows:

(n) (n)
Vn (si ) = (1 − f (n)) · Vn−1 (si ) + f (n) · Yi = Vn−1 (si ) + f (n) · (Yi − Vn−1 (si )) (9.3)

Thus, we see that the update (change) to the Value Function for a state si is equal to
(n)
f (n) (weight for the latest trace experience return Yi from state si ) times the difference
(n)
between the latest trace experience return Yi and the current Value Function estimate
Vn−1 (si ). This is a good perspective as it tells us how to adjust the Value Function estimate
in an intuitive manner. In the case of the default setting of count_to_weight_func as f (n) =
1
n , we get:

n−1 1 (n) 1 (n)

Vn (si ) = · Vn−1 (si ) + · Yi = Vn−1 (si ) + · (Yi − Vn−1 (si )) (9.4)
n n n
So if we have 9 occurrences of a state with an average trace experience return of 50 and
if the 10th occurrence of the state gives a trace experience return of 60, then we consider
10 of 60 − 50 (equal to 1) and increase the Value Function estimate for the state from 50 to
1

315
50+1 = 51. This illustrates how we move the Value Function estimate in the direction from
the current estimate to the latest trace experience return, by a magnitude of n1 of their gap.
Expanding the incremental updates across values of n in Equation (9.3), we get:

(n) (n−1)
Vn (si ) =f (n) · Yi + (1 − f (n)) · f (n − 1) · Yi + ...
(1)
+ (1 − f (n)) · (1 − f (n − 1)) · · · (1 − f (2)) · f (1) · Yi (9.5)

In the case of the default setting of count_to_weight_func as f (n) = n1 , we get:

Pn (k)
1 (n) n − 1 1 (n−1) n−1 n−2 1 1 (1) k=1 Yi
Vn (si ) = ·Yi + · ·Y +. . .+ · · · · · ·Yi = (9.6)
n n n−1 i n n−1 2 1 n

which is an equally-weighted average of the trace experience returns from the state.
From the Law of Large Numbers, we know that the sample average converges to the ex-
pected value, which is the core idea behind the Monte-Carlo method.
Note that the Tabular class as an implementation of the abstract class FunctionApprox
is not just a software design happenstance - there is a formal mathematical specialization
here that is vital to recognize. This tabular representation is actually a special case of linear
function approximation by setting a feature function ϕi (·) for each xi as: ϕi (xi ) = 1 and
ϕ(x) = 0 for each x ̸= xi (i.e., ϕi (·) is the indicator function for xi , and the Φ matrix
of Chapter 4 reduces to the identity matrix). So we can conceptualize Tabular Monte-
Carlo Prediction as a linear function approximation with the feature functions equal to
the indicator functions for each of the non-terminal states and the linear-approximation
parameters wi equal to the Value Function estimates for the corresponding non-terminal
states.
With this perspective, more broadly, we can view Tabular RL as a special case of RL with
Linear Function Approximation of the Value Function. Moreover, the count_to_weight_func
attribute of Tabular plays the role of the learning rate (as a function of the number of it-
erations in stochastic gradient descent). This becomes clear if we write Equation (9.3) in
(n)
terms of parameter updates: write Vn (si ) as parameter value wi to denote the n-th up-
date to parameter wi corresponding to state si , and write f (n) as learning rate αn for the
n-th update to wi .

(n) (n−1) (n) (n−1)

wi = wi + αn · (Yi − wi )
(n) (n−1)
So, the change in parameter wi for state si is αn times Yi − wi . We observe that
(n) (n−1) (n)
Yi − w i represents the gradient of the loss function for the data point (si , Yi ) in the
case of linear function approximation with features as indicator variables (for each state).
(n) (n) P
This is because the loss function for the data point (si , Yi ) is 12 · (Yi − m j=1 ϕj (si ) ·
(n−1) 2 (n) (n−1) 2
wj ) which reduces to 1
2 · (Yi − wi ) , whose gradient in the direction of wi is
(n) (n−1)
Yi − wi and 0 in the other directions (for j ̸= i). So we see that Tabular updates
are basically a special case of LinearFunctionApprox updates if we set the features to be
indicator functions for each of the states (with count_to_weight_func playing the role of
the learning rate).
Now that you recognize that count_to_weight_func essentially plays the role of the learn-
ing rate and governs the importance given to the latest trace experience return relative to

316
past trace experience returns, we want to point out that real-world situations are not sta-
tionary in the sense that the environment typically evolves over a period of time and so, RL
algorithms have to appropriately adapt to the changing environment. The way to adapt
effectively is to have an element of “forgetfulness” of the past because if one learns about
the distant past far too strongly in a changing environment, our predictions (and eventu-
ally control) would not be effective. So, how does an RL algorithm “forget?” Well, one
can “forget” through an appropriate time-decay of the weights when averaging trace ex-
perience returns. If we set a constant learning rate α (in Tabular, this would correspond to
count_to_weight_func=lambda _: alpha), we’d obtain “forgetfulness” with lower weights
for old data points and higher weights for recent data points. This is because with a con-
stant learning rate α, Equation (9.5) reduces to:

(n) (n−1) (1)

Vn (si ) = α · Yi + (1 − α) · α · Yi + . . . + (1 − α)n−1 · α · Yi
Xn
(j)
= α · (1 − α)n−j · Yi
j=1

which means we have exponentially-decaying weights in the weighted average of the

trace experience returns for any given state.
Note that for 0 < α ≤ 1, the weights sum up to 1 as n tends to infinity, i.e.,

X
n
lim α · (1 − α)n−j = lim 1 − (1 − α)n = 1
n→∞ n→∞
j=1

It’s worthwhile pointing out that the Monte-Carlo algorithm we’ve implemented above
is known as Each-Visit Monte-Carlo to refer to the fact that we include each occurrence
of a state in a trace experience. So if a particular state appears 10 times in a given trace
experience, we have 10 (state, return) pairs that are used to make the update (for just that
state) at the end of that trace experience. This is in contrast to First-Visit Monte-Carlo in
which only the first occurrence of a state in a trace experience is included in the set of
(state, return) pairs used to make an update at the end of the trace experience. So First-
Visit Monte-Carlo needs to keep track of whether a state has already been visited in a
trace experience (repeat occurrences of states in a trace experience are ignored). We won’t
implement First-Visit Monte-Carlo in this book, and leave it to you as an exercise.
Now let’s write some code to test our implementation of Monte-Carlo Prediction. To do
so, we go back to a simple finite MRP example from Chapter 1 - SimpleInventoryMRPFinite.
The following code creates an instance of the MRP and computes it’s exact Value Function
based on Equation (1.2).
from rl.chapter2.simple_inventory_mrp import SimpleInventoryMRPFinite
user_capacity = 2
user_poisson_lambda = 1.0
user_holding_cost = 1.0
user_stockout_cost = 10.0
user_gamma = 0.9
si_mrp = SimpleInventoryMRPFinite(
capacity=user_capacity,
poisson_lambda=user_poisson_lambda,
holding_cost=user_holding_cost,
stockout_cost=user_stockout_cost
)
si_mrp.display_value_function(gamma=user_gamma)

317
This prints the following:

{NonTerminal(state=InventoryState(on_hand=0, on_order=0)): -35.511,

Next, we run Monte-Carlo Prediction by first generating a stream of trace experiences (in
the form of sampling traces) from the MRP, and then calling mc_prediction using Tabular
with equal-weights-learning-rate (i.e., default count_to_weight_func of lambda n: 1.0 /
n).

from rl.chapter2.simple_inventory_mrp import InventoryState

from rl.function_approx import Tabular
from rl.approximate_dynamic_programming import ValueFunctionApprox
from rl.distribution import Choose
from rl.iterate import last
from rl.monte_carlo import mc_prediction
from itertools import islice
from pprint import pprint
traces: Iterable[Iterable[TransitionStep[S]]] = \
mrp.reward_traces(Choose(si_mrp.non_terminal_states))
it: Iterator[ValueFunctionApprox[InventoryState]] = mc_prediction(
traces=traces,
approx_0=Tabular(),
gamma=user_gamma,
episode_length_tolerance=1e-6
)
num_traces = 60000
last_func: ValueFunctionApprox[InventoryState] = last(islice(it, num_traces))
pprint({s: round(last_func.evaluate([s])[0], 3)
for s in si_mrp.non_terminal_states})

This prints the following:

{NonTerminal(state=InventoryState(on_hand=1, on_order=1)): -29.341,

NonTerminal(state=InventoryState(on_hand=2, on_order=0)): -30.349,
NonTerminal(state=InventoryState(on_hand=0, on_order=0)): -35.52,
NonTerminal(state=InventoryState(on_hand=0, on_order=1)): -27.931,
NonTerminal(state=InventoryState(on_hand=0, on_order=2)): -28.355,
NonTerminal(state=InventoryState(on_hand=1, on_order=0)): -28.93}

We see that the Value Function computed by Tabular Monte-Carlo Prediction with 60000
trace experiences is within 0.01 of the exact Value Function, for each of the states.
This completes the coverage of our first RL Prediction algorithm: Monte-Carlo Predic-
tion. This has the advantage of being a very simple, easy-to-understand algorithm with an
unbiased estimate of the Value Function. But Monte-Carlo can be slow to converge to the
correct Value Function and another disadvantage of Monte-Carlo is that it requires entire
trace experiences (or long-enough trace experiences when γ < 1). The next RL Prediction
algorithm we cover (Temporal-Difference) overcomes these weaknesses.

318
9.4. Temporal-Difference (TD) Prediction
To understand Temporal-Difference (TD) Prediction, we start with it’s Tabular version
as it is simple to understand (and then we can generalize to TD Prediction with Function
Approximation). To understand Tabular TD prediction, we begin by taking another look at
the Value Function update in Tabular Monte-Carlo (MC) Prediction with constant learning
rate.

V (St ) ← V (St ) + α · (Gt − V (St )) (9.7)

where St is the state visited at time step t in the current trace experience, Gt is the trace
experience return obtained from time step t onwards, and α denotes the learning rate
(based on count_to_weight_func attribute in the Tabular class). The key in moving from
MC to TD is to take advantage of the recursive structure of the Value Function as given by
the MRP Bellman Equation (Equation (1.1)). Although we only have access to individual
experiences of next state St+1 and reward Rt+1 , and not the transition probabilities of next
state and reward, we can approximate Gt as experience reward Rt+1 plus γ times V (St+1 )
(where St+1 is the experience’s next state). The idea is to build upon (the term we use is
bootstrap) the Value Function that is currently estimated. Clearly, this is a biased estimate of
the Value Function meaning the update to the Value Function for St will be biased. But the
bias disadvantage is outweighed by the reduction in variance (which we will discuss more
about later), by speedup in convergence (bootstrapping is our friend here), and by the fact
that we don’t actually need entire/long-enough trace experiences (again, bootstrapping is
our friend here). So, the update for Tabular TD Prediction is:

V (St ) ← V (St ) + α · (Rt+1 + γ · V (St+1 ) − V (St )) (9.8)

Note how we’ve simply replaced Gt in Equation (9.7) (Tabular MC Prediction update)
with Rt+1 + γ · V (St+1 ).
To facilitate understanding, for the remainder of the book, we shall interpret V (St+1 )
as being equal to 0 if St+1 ∈ T (note: technically, this notation is incorrect because V (·)
is a function with domain N ). Likewise, we shall interpret the function approximation
notation V (St+1 ; w) as being equal to 0 if St+1 ∈ T .
We refer to Rt+1 + γ · V (St+1 ) as the TD target and we refer to δt = Rt+1 + γ · V (St+1 ) −
V (St ) as the TD Error. The TD Error is the crucial quantity since it represents the “sample
Bellman Error” and hence, the TD Error can be used to move V (St ) appropriately (as
shown in the above adjustment to V (St )), which in turn has the effect of bridging the TD
error (on an expected basis).
An important practical advantage of TD is that (unlike MC) we can use it in situations
where we have incomplete trace experiences (happens often in real-world situations where
experiments gets curtailed/disrupted) and also, we can use it in situations where we never
reach a terminal state (continuing trace). The other appealing thing about TD is that it is
learning (updating Value Function) after each atomic experience (we call it continuous
learning) versus MC’s learning at the end of trace experiences. This also means that TD
can be run on any stream of atomic experiences, not just atomic experiences that are part of
a trace experience. This is a major advantage as we can chop the available data and serve
it any order, freeing us from the order in which the data arrives.
Now that we understand how TD Prediction works for the Tabular case, let’s consider
TD Prediction with Function Approximation. Here, each time we transition from a state
St to state St+1 with reward Rt+1 , we make an update to the parameters of the function

319
approximation. To understand how the parameters of the function approximation update,
let’s consider the loss function for TD. We start with the single-state loss function for MC
(Equation (9.1)) and simply replace Gt with Rt+1 + γ · V (St+1 , w) as follows:

1
L(St ,St+1 ,Rt+1 ) (w) = · (V (St ; w) − (Rt+1 + γ · V (St+1 ; w)))2 (9.9)
2
Unlike MC, in the case of TD, we don’t take the gradient of this loss function. Instead we
“cheat” in the gradient calculation by ignoring the dependency of V (St+1 ; w) on w. This
“gradient with cheating” calculation is known as semi-gradient. Specifically, we pretend
that the only dependency of the loss function on w is through V (St ; w). Hence, the semi-
gradient calculation results in the following formula for change in parameters w:

∆w = α · (Rt+1 + γ · V (St+1 ; w) − V (St ; w)) · ∇w V (St ; w) (9.10)

This looks similar to the formula for parameters update in the case of MC (with Gt
replaced by Rt+1 + γ · V (St+1 ; w)). Hence, this has the same structure as MC in terms of
conceptualizing the change in parameters as the product of the following 3 entities:

• Learning Rate α
• TD Error δt = Rt+1 + γ · V (St+1 ; w) − V (St ; w)
• Estimate Gradient of the conditional expected return V (St ; w) with respect to the pa-
rameters w

Now let’s write some code to implement TD Prediction (with Function Approximation).
Unlike MC which takes as input a stream of trace experiences, TD works with a more
granular stream: a stream of atomic experiences. Note that a stream of trace experiences can
be broken up into a stream of atomic experiences, but we could also obtain a stream of
atomic experiences in other ways (not necessarily from a stream of trace experiences).
Thus, the TD prediction algorithm we write below (td_prediction) takes as input an
Iterable[TransitionStep[S]]. td_prediction produces an Iterator of ValueFunctionApprox[S],
i.e., an updated function approximation of the Value Function after each atomic experience
in the input atomic experiences stream. Similar to our implementation of MC, our imple-
mentation of TD is based on supervised learning on a stream of (x, y) pairs, but there are
two key differences:

1. The update of the ValueFunctionApprox is done after each atomic experience, versus
MC where the updates are done at the end of each trace experience.
2. The y-value depends on the Value Function estimate, as seen from the update Equa-
tion (9.10) above. This means we cannot use the iterate_updates method of FunctionApprox
that MC Prediction uses. Rather, we need to directly use the rl.iterate.accumulate
function (a wrapped version of itertools.accumulate). As seen in the code below,
the accumulation is performed on the input transitions: Iterable[TransitionStep[S]]
and the function governing the accumulation is the step function in the code below
that calls the update method of ValueFunctionApprox. Note that the y-values passed
to update involve a call to the estimated Value Function v for the next_state of each
transition. However, since the next_state could be Terminal or NonTerminal, and
since ValueFunctionApprox is valid only for non-terminal states, we use the extended_vf
function we had implemented in Chapter 4 to handle the cases of the next state being
Terminal or NonTerminal (with terminal states evaluating to the default value of 0).

320
import rl.iterate as iterate
import rl.markov_process as mp
from rl.approximate_dynamic_programming import ValueFunctionApprox
from rl.approximate_dynamic_programming import extended_vf
def td_prediction(
transitions: Iterable[mp.TransitionStep[S]],
approx_0: ValueFunctionApprox[S],
gamma: float
) -> Iterator[ValueFunctionApprox[S]]:
def step(
v: ValueFunctionApprox[S],
transition: mp.TransitionStep[S]
) -> ValueFunctionApprox[S]:
return v.update([(
transition.state,
transition.reward + gamma * extended_vf(v, transition.next_state)
)])
return iterate.accumulate(transitions, step, initial=approx_0)

The above code is in the file rl/td.py.

Now let’s write some code to test our implementation of TD Prediction. We test on the
same SimpleInventoryMRPFinite that we had tested MC Prediction on. Let us see how close
we can get to the true Value Function (that we had calculated above while testing MC Pre-
diction). But first we need to write a function to construct a stream of atomic experiences
(Iterator[TransitionStep[S]]) from a given FiniteMarkovRewardProcess (below code is in
the file rl/chapter10/prediction_utils.py). Note the use of itertools.chain.from.iterable
to chain together a stream of trace experiences (obtained by calling method reward_traces)
into a stream of atomic experiences in the below function unit_experiences_from_episodes.

import itertools
from rl.distribution import Distribution, Choose
from rl.approximate_dynamic_programming import NTStateDistribution
def mrp_episodes_stream(
mrp: MarkovRewardProcess[S],
start_state_distribution: NTStateDistribution[S]
) -> Iterable[Iterable[TransitionStep[S]]]:
return mrp.reward_traces(start_state_distribution)
def fmrp_episodes_stream(
fmrp: FiniteMarkovRewardProcess[S]
) -> Iterable[Iterable[TransitionStep[S]]]:
return mrp_episodes_stream(fmrp, Choose(fmrp.non_terminal_states))
def unit_experiences_from_episodes(
episodes: Iterable[Iterable[TransitionStep[S]]],
episode_length: int
) -> Iterable[TransitionStep[S]]:
return itertools.chain.from_iterable(
itertools.islice(episode, episode_length) for episode in episodes
)

Effective use of Tabular TD Prediction requires us to create an appropriate learning

rate schedule by suitably lowering the learning rate as a function of the number of oc-
currences of a state in the atomic experiences stream (learning rate schedule specified
by count_to_weight_func attribute of Tabular class). We write below (code in the file
rl/function_approx.py) the following learning rate schedule:
α
αn = (9.11)
1 + ( n−1
H )
β

321
where αn is the learning rate to be used at the n-th Value Function update for a given
state, α is the initial learning rate (i.e. α = α1 ), H (we call it “half life”) is the number of
updates for the learning rate to decrease to half the initial learning rate (if β is 1), and β is
the exponent controlling the curvature of the decrease in the learning rate. We shall often
set β = 0.5.

def learning_rate_schedule(
initial_learning_rate: float,
half_life: float,
exponent: float
) -> Callable[[int], float]:
def lr_func(n: int) -> float:
return initial_learning_rate * (1 + (n - 1) / half_life) ** -exponent
return lr_func

With these functions available, we can now write code to test our implementation of
TD Prediction. We use the same instance si_mrp: SimpleInventoryMRPFinite that we had
created above when testing MC Prediction. We use the same number of episodes (60000)
we had used when testing MC Prediction. We set initial learning rate α = 0.03, half life
H = 1000 and exponent β = 0.5. We set the episode length (number of atomic experiences
in a single trace experience) to be 100 (about the same as with the settings we had for
testing MC Prediction). We use the same discount factor γ = 0.9.

import rl.iterate as iterate

import rl.td as td
import itertools
from pprint import pprint
from rl.chapter10.prediction_utils import fmrp_episodes_stream
from rl.chapter10.prediction_utils import unit_experiences_from_episodes
from rl.function_approx import learning_rate_schedule
episode_length: int = 100
initial_learning_rate: float = 0.03
half_life: float = 1000.0
exponent: float = 0.5
gamma: float = 0.9
episodes: Iterable[Iterable[TransitionStep[S]]] = \
fmrp_episodes_stream(si_mrp)
td_experiences: Iterable[TransitionStep[S]] = \
unit_experiences_from_episodes(
episodes,
episode_length
)
learning_rate_func: Callable[[int], float] = learning_rate_schedule(
initial_learning_rate=initial_learning_rate,
half_life=half_life,
exponent=exponent
)
td_vfs: Iterator[ValueFunctionApprox[S]] = td.td_prediction(
transitions=td_experiences,
approx_0=Tabular(count_to_weight_func=learning_rate_func),
gamma=gamma
)
num_episodes = 60000
final_td_vf: ValueFunctionApprox[S] = \
iterate.last(itertools.islice(td_vfs, episode_length * num_episodes))
pprint({s: round(final_td_vf(s), 3) for s in si_mrp.non_terminal_states})

This prints the following:

322
{NonTerminal(state=InventoryState(on_hand=0, on_order=0)): -35.529,
NonTerminal(state=InventoryState(on_hand=0, on_order=1)): -27.868,
NonTerminal(state=InventoryState(on_hand=0, on_order=2)): -28.344,
NonTerminal(state=InventoryState(on_hand=1, on_order=0)): -28.935,
NonTerminal(state=InventoryState(on_hand=1, on_order=1)): -29.386,
NonTerminal(state=InventoryState(on_hand=2, on_order=0)): -30.305}

Thus, we see that our implementation of TD prediction with the above settings fetches us
an estimated Value Function within 0.065 of the true Value Function after 60,000 episodes.
As ever, we encourage you to play with various settings for MC Prediction and TD pre-
diction to develop some intuition for how the results change as you change the settings.
You can play with the code in the file rl/chapter10/simple_inventory_mrp.py.

9.5. TD versus MC
It is often claimed that TD is the most significant and innovative idea in the development
of the field of Reinforcement Learning. The key to TD is that it blends the advantages
of Dynamic Programming (DP) and Monte-Carlo (MC). Like DP, TD updates the Value
Function estimate by bootstrapping from the Value Function estimate of the next state
experienced (essentially, drawing from Bellman Equation). Like MC, TD learns from ex-
periences without requiring access to transition probabilities (MC and TD updates are ex-
perience updates while DP updates are transition-probabilities-averaged-updates). So TD over-
comes curse of dimensionality and curse of modeling (computational limitation of DP),
and also has the advantage of not requiring entire trace experiences (practical limitation
of MC).
The TD idea has it’s origins in a seminal book by Harry Klopf (Klopf and Data Sciences
Laboratory 1972) that greatly influenced Richard Sutton and Andrew Barto to pursue the
TD idea further, after which they published several papers on TD, much of whose content
is covered in their RL book (Richard S. Sutton and Barto 2018).

9.5.1. TD learning akin to human learning

Perhaps the most attractive thing about TD (versus MC) is that it is akin to how humans
learn. Let us illustrate this point with how a soccer player learns to improve her game in the
process of playing many soccer games. Let’s simplify the soccer game to a “golden-goal”
soccer game, i.e., the game ends when a team scores a goal. The reward in such a soccer
game is +1 for scoring (and winning), 0 if the opponent scores, and also 0 for the entire
duration of the game before the goal is scored. The soccer player (who is learning) has her
State comprising of her position/velocity/posture etc., the other players’ positions/velocity
etc., the soccer ball’s position/velocity etc. The Actions of the soccer player are her physical
movements, including the ways to dribble/kick the ball. If the soccer player learns in an
MC style (a single episode is a single soccer game), then the soccer player analyzes (at the
end of the game) all possible states and actions that occurred during the game and assesses
how the actions in each state might have affected the final outcome of the game. You can
see how laborious and diﬀicult this actions-reward linkage would be, and you might even
argue that it’s impossible to disentangle the effects of various actions during the game
on the goal that was eventually scored. In any case, you should recognize that this is
absolutely not how a soccer player would analyze and learn. Rather, a soccer player learns
during the game - she is continuously evaluating how her actions change the probability of

323
scoring the goal (which is essentially the Value Function). If a pass to her teammate did
not result in a goal but greatly increased the chances of scoring a goal, then the action of
passing the ball to one’s teammate in that state is a good action, boosting the action’s Q-
value immediately, and she will likely try that action (or a similar action) again, meaning
actions with better Q-values are prioritized, which drives towards better and quicker goal-
scoring opportunities, and likely eventually results in a goal. Such goal-scoring (based on
active learning during the game, cutting out poor actions and promoting good actions)
would be hailed by commentators as “success from continuous and eager learning” on
the part of the soccer player. This is essentially TD learning.
If you think about career decisions and relationship decisions in our lives, MC-style
learning is quite infeasible because we simply don’t have suﬀicient “episodes” (for certain
decisions, our entire life might be a single episode), and waiting to analyze and adjust until
the end of an episode might be far too late in our lives. Rather, we learn and adjust our
evaluations of situations constantly in a TD-like manner. Think about various important
decisions we make in our lives and you will see that we learn by perpetual adjustment of
estimates and we are eﬀicient in the use of limited experiences we obtain in our lives.

9.5.2. Bias, Variance and Convergence

Now let’s talk about bias and variance of the MC and TD prediction estimates, and their
convergence properties.
Say we are at state St at time step t on a trace experience, and Gt is the return from that
state St onwards on this trace experience. Gt is an unbiased estimate of the true value
function for state St , which is a big advantage for MC when it comes to convergence, even
with function approximation of the Value Function. On the other hand, the TD Target
Rt+1 + γ · V (St+1 ; w) is a biased estimate of the true value function for state St . There is
considerable literature on formal proofs of TD Prediction convergence and we won’t cover
it in detail here, but here’s a qualitative summary: Tabular TD Prediction converges to
the true value function in the mean for constant learning rate, and converges to the true
value function if the following stochastic approximation conditions are satisfied for the
learning rate schedule αn , n = 1, 2, . . ., where the index n refers to the n-th occurrence of
a particular state whose Value Function is being updated:
∞
X ∞
X
αn = ∞ and αn2 < ∞
n=1 n=1

The stochastic approximation conditions above are known as the Robbins-Monro sched-
ule and apply to a general class of iterative methods used for root-finding or optimization
when data is noisy. The intuition here is that the steps should be large enough (first con-
dition) to eventually overcome any unfavorable initial values or noisy data and yet the
steps should eventually become small enough (second condition) to ensure convergence.
Note that in Equation (9.11), exponent β = 1 satisfies the Robbins-Monro conditions. In
particular, our default choice of count_to_weight_func=lambda n: 1.0 / n in Tabular sat-
isfies the Robbins-Monro conditions, but our other common choice of constant learning
rate does not satisfy the Robbins-Monro conditions. However, we want to emphasize that
the Robbins-Monro conditions are typically not that useful in practice because it is not a
statement of speed of convergence and it is not a statement on closeness to the true optima
(in practice, the goal is typically simply to get fairly close to the true answer reasonably
quickly).

324
The bad news with TD (due to the bias in it’s update) is that TD Prediction with function
approximation does not always converge to the true value function. Most TD Prediction
convergence proofs are for the Tabular case, however some proofs are for the case of linear
function approximation of the Value Function.
The flip side of MC’s bias advantage over TD is that the TD Target Rt+1 + γ · V (St+1 ; w)
has much lower variance than Gt because Gt depends on many random state transitions
and random rewards (on the remainder of the trace experience) whose variances accu-
mulate, whereas the TD Target depends on only the next random state transition St+1 and
the next random reward Rt+1 .
As for speed of convergence and eﬀiciency in use of limited set of experiences data,
we still don’t have formal proofs on whether MC is better or TD is better. More impor-
tantly, because MC and TD have significant differences in their usage of data, nature of
updates, and frequency of updates, it is not even clear how to create a level-playing field
when comparing MC and TD for speed of convergence or for eﬀiciency in usage of limited
experiences data. The typical comparisons between MC and TD are done with constant
learning rates, and it’s been determined that practically TD learns faster than MC with
constant learning rates.
A popular simple problem in the literature (when comparing RL prediction algorithms)
is a random walk MRP with states {0, 1, 2, . . . , B} with 0 and B as the terminal states (think
of these as terminating barriers of a random walk) and the remaining states as the non-
terminal states. From any non-terminal state i, we transition to state i + 1 with probability
p and to state i − 1 with probability 1 − p. The reward is 0 upon each transition, except
if we transition from state B − 1 to terminal state B which results in a reward of 1. It’s
quite obvious that for p = 0.5 (symmetric random walk), the Value Function is given by:
V (i) = Bi for all 0 < i < B. We’d like to analyze how MC and TD converge, if at all,
to this Value Function, starting from a neutral initial Value Function of V (i) = 0.5 for all
0 < i < B. The following code sets up this random walk MRP.
from rl.distribution import Categorical
class RandomWalkMRP(FiniteMarkovRewardProcess[int]):
barrier: int
p: float
def __init__(
self,
barrier: int,
p: float
):
self.barrier = barrier
self.p = p
super().__init__(self.get_transition_map())
def get_transition_map(self) -> \
Mapping[int, Categorical[Tuple[int, float]]]:
d: Dict[int, Categorical[Tuple[int, float]]] = {
i: Categorical({
(i + 1, 0. if i < self.barrier - 1 else 1.): self.p,
(i - 1, 0.): 1 - self.p
}) for i in range(1, self.barrier)
}
return d

The above code is in the file rl/chapter10/random_walk_mrp.py. Next, we generate a

stream of trace experiences from the MRP, use the trace experiences stream to perform MC
Prediction, split the trace experiences stream into a stream of atomic experiences so as to
perform TD Prediction, run MC and TD Prediction with a variety of learning rate choices,

325
Figure 9.1.: MC and TD Convergence for Random Walk MRP

and plot the root-mean-squared-errors (RMSE) of the Value Function averaged across the
non-terminal states as a function of episode batches (i.e., visualize how the RMSE of the
Value Function evolves as the MC/TD algorithm progresses). This is done by calling the
function compare_mc_and_td which is in the file rl/chapter10/prediction_utils.py.
Figure 9.1 depicts the convergence for our implementations of MC and TD Prediction
for constant learning rates of α = 0.01 (darker curves) and α = 0.05 (lighter curves). We
produced this Figure by using data from 700 episodes generated from the random walk
MRP with barrier B = 10, p = 0.5 and discount factor γ = 1 (a single episode refers
to a single trace experience that terminates either at state 0 or at state B). We plotted
the RMSE after each batch of 7 episodes, hence each of the 4 curves shown in the Figure
have 100 RMSE data points plotted. Firstly, we clearly see that MC has significantly more
variance as evidenced by the choppy MC RMSE progression curves. Secondly, we note
that α = 0.01 is a fairly small learning rate and so, the progression of RMSE is quite slow
on the darker curves. On the other hand, notice the quick learning for α = 0.05 (lighter
curves). MC RMSE curve is not just choppy, it’s evident that it progresses quite quickly
in the first few episode batches (relative to the corresponding TD) but is slow after the
first few episode batches (relative to the corresponding TD). This results in TD reaching
fairly small RMSE quicker than the corresponding MC (this is especially stark for TD with
α = 0.005, i.e. the dashed lighter curve in the Figure). This behavior of TD outperforming
the comparable MC (with constant learning rate) is typical for MRP problems.
Lastly, it’s important to recognize that MC is not very sensitive to the initial Value Func-
tion while TD is more sensitive to the initial Value Function. We encourage you to play
with the initial Value Function for this random walk example and evaluate how it affects
MC and TD convergence speed.
More generally, we encourage you to play with the compare_mc_and_td function on other
choices of MRP (ones we have created earlier in this book such as the inventory examples,
or make up your own MRPs) so you can develop good intuition for how MC and TD

326
Prediction algorithms converge for a variety of choices of learning rate schedules, initial
Value Function choices, choices of discount factor etc.

9.5.3. Fixed-Data Experience Replay on TD versus MC

We have talked a lot about how TD learns versus how MC learns. In this subsection, we turn
our focus to what TD learns and what MC learns, to shed light on a profound conceptual
difference between TD and MC. We illuminate this difference with a special setting - we are
given a fixed finite set of trace experiences (versus usual settings considered in this chap-
ter so far where we had an “endless” stream of trace experiences). The agent is allowed
to tap into this fixed finite set of traces experiences endlessly, i.e., the MC or TD Predic-
tion RL agent can indeed consume an endless stream of experiences, but all of that stream
of experiences must ultimately be sourced from the given fixed finite set of trace experi-
ences. This means we’d end up tapping into trace experiences (or it’s component atomic
experiences) repeatedly. We call this technique of re-using experiences data encountered
previously as Experience Replay. We will cover this Experience Replay technique in more
detail in Chapter 11, but for now, we shall uncover the key conceptual difference between
what MC and TD learn by running the algorithms on an Experience Replay of a fixed finite
set of trace experiences.
So let us start by setting up this experience replay with some code. Firstly, we represent
the given input data of the fixed finite set of trace experiences as the data type:
Sequence[Sequence[Tuple[S, float]]]
The outer Sequence refers to the sequence of trace experiences, and the inner Sequence
refers to the sequence of (state, reward) pairs in a trace experience (to represent the alter-
nating sequence of states and rewards in a trace experience). The first function we write
is to convert this data set into a:
Sequence[Sequence[TransitionStep[S]]]
which is consumable by MC and TD Prediction algorithms (since their interfaces work
with the TransitionStep[S] data type). The following function does this job:

def get_fixed_episodes_from_sr_pairs_seq(
sr_pairs_seq: Sequence[Sequence[Tuple[S, float]]],
terminal_state: S
) -> Sequence[Sequence[TransitionStep[S]]]:
return [[TransitionStep(
state=NonTerminal(s),
reward=r,
next_state=NonTerminal(trace[i+1][0])
if i < len(trace) - 1 else Terminal(terminal_state)
) for i, (s, r) in enumerate(trace)] for trace in sr_pairs_seq]

We’d like MC Prediction to run on an endless stream of Sequence[TransitionStep[S]]

sourced from the fixed finite data set produced by get_fixed_episodes_from_sr_pairs_seq.
So we write the following function to generate an endless stream by repeatedly randomly
(uniformly) sampling from the fixed finite set of trace experiences:

import numpy as np
def get_episodes_stream(
fixed_episodes: Sequence[Sequence[TransitionStep[S]]]
) -> Iterator[Sequence[TransitionStep[S]]]:
num_episodes: int = len(fixed_episodes)
while True:
yield fixed_episodes[np.random.randint(num_episodes)]

327
As we know, TD works with atomic experiences rather than trace experiences. So we
need the following function to split the fixed finite set of trace experiences into a fixed finite
set of atomic experiences:
import itertools
def fixed_experiences_from_fixed_episodes(
fixed_episodes: Sequence[Sequence[TransitionStep[S]]]
) -> Sequence[TransitionStep[S]]:
return list(itertools.chain.from_iterable(fixed_episodes))

We’d like TD Prediction to run on an endless stream of TransitionStep[S] from the fixed
finite set of atomic experiences produced by fixed_experiences_from_fixed_episodes. So
we write the following function to generate an endless stream by repeatedly randomly
(uniformly) sampling from the fixed finite set of atomic experiences:
def get_experiences_stream(
fixed_experiences: Sequence[TransitionStep[S]]
) -> Iterator[TransitionStep[S]]:
num_experiences: int = len(fixed_experiences)
while True:
yield fixed_experiences[np.random.randint(num_experiences)]

Ok - now we are ready to run MC and TD Prediction algorithms on an experience replay

of the given input of a fixed finite set of trace experiences. It is quite obvious what the MC
Prediction algorithm would learn. MC Prediction is simply supervised learning of a data
set of states and their associated returns, and here we have a fixed finite set of states (across
the trace experiences) and the corresponding trace experience returns associated with each
of those states. Hence, MC Prediction should return a Value Function comprising of the
average returns seen in the fixed finite data set for each of the states in the data set. So let
us first write a function to explicitly calculate the average returns, and then we can confirm
that MC Prediction will give the same answer.
from rl.returns import returns
from rl.markov_process import ReturnStep
def get_return_steps_from_fixed_episodes(
fixed_episodes: Sequence[Sequence[TransitionStep[S]]],
gamma: float
) -> Sequence[ReturnStep[S]]:
return list(itertools.chain.from_iterable(returns(episode, gamma, 1e-8)
for episode in fixed_episodes))
def get_mean_returns_from_return_steps(
returns_seq: Sequence[ReturnStep[S]]
) -> Mapping[NonTerminal[S], float]:
def by_state(ret: ReturnStep[S]) -> S:
return ret.state.state
sorted_returns_seq: Sequence[ReturnStep[S]] = sorted(
returns_seq,
key=by_state
)
return {NonTerminal(s): np.mean([r.return_ for r in l])
for s, l in itertools.groupby(
sorted_returns_seq,
key=by_state
)}

To facilitate comparisons, we will do all calculations on the following simple hand-

entered input data set:

328
given_data: Sequence[Sequence[Tuple[str, float]]] = [
[(’A’, 2.), (’A’, 6.), (’B’, 1.), (’B’, 2.)],
[(’A’, 3.), (’B’, 2.), (’A’, 4.), (’B’, 2.), (’B’, 0.)],
[(’B’, 3.), (’B’, 6.), (’A’, 1.), (’B’, 1.)],
[(’A’, 0.), (’B’, 2.), (’A’, 4.), (’B’, 4.), (’B’, 2.), (’B’, 3.)],
[(’B’, 8.), (’B’, 2.)]
]

The following code runs get_mean_returns_from_return_steps on this simple input data

set.
from pprint import pprint
gamma: float = 0.9
fixed_episodes: Sequence[Sequence[TransitionStep[str]]] = \
get_fixed_episodes_from_sr_pairs_seq(
sr_pairs_seq=given_data,
terminal_state=’T’
)
returns_seq: Sequence[ReturnStep[str]] = \
get_return_steps_from_fixed_episodes(
fixed_episodes=fixed_episodes,
gamma=gamma
)
mean_returns: Mapping[NonTerminal[str], float] = \
get_mean_returns_from_return_steps(returns_seq)
pprint(mean_returns)

This prints:

{NonTerminal(state=’B’): 5.190378571428572,
NonTerminal(state=’A’): 8.261809999999999}

Now let’s run MC Prediction with experience-replayed 100,000 trace experiences with
equal weighting for each of the (state, return) pairs, i.e., with count_to_weights_func at-
tribute of Tabular set to the function lambda n: 1.0 / n:
import rl.monte_carlo as mc
import rl.iterate as iterate
def mc_prediction(
episodes_stream: Iterator[Sequence[TransitionStep[S]]],
gamma: float,
num_episodes: int
) -> Mapping[NonTerminal[S], float]:
return iterate.last(itertools.islice(
mc.mc_prediction(
traces=episodes_stream,
approx_0=Tabular(),
gamma=gamma,
episode_length_tolerance=1e-10
),
num_episodes
)).values_map
num_mc_episodes: int = 100000
episodes: Iterator[Sequence[TransitionStep[str]]] = \
get_episodes_stream(fixed_episodes)
mc_pred: Mapping[NonTerminal[str], float] = mc_prediction(
episodes_stream=episodes,
gamma=gamma,

329
num_episodes=num_mc_episodes
)
pprint(mc_pred)

This prints:

{NonTerminal(state=’A’): 8.262643843836214,
NonTerminal(state=’B’): 5.191276907315868}

So, as expected, it ties out within the standard error for 100,000 trace experiences. Now
let’s move on to TD Prediction. Let’s run TD Prediction on experience-replayed 1,000,000
atomic experiences with a learning rate schedule having an initial learning rate of 0.01,
decaying with a half life of 10000, and with an exponent of 0.5.

import rl.td as td
from rl.function_approx import learning_rate_schedule, Tabular

def td_prediction(
experiences_stream: Iterator[TransitionStep[S]],
gamma: float,
num_experiences: int
) -> Mapping[NonTerminal[S], float]:
return iterate.last(itertools.islice(
td.td_prediction(
transitions=experiences_stream,
approx_0=Tabular(count_to_weight_func=learning_rate_schedule(
initial_learning_rate=0.01,
half_life=10000,
exponent=0.5
)),
gamma=gamma
),
num_experiences
)).values_map

num_td_experiences: int = 1000000

fixed_experiences: Sequence[TransitionStep[str]] = \
fixed_experiences_from_fixed_episodes(fixed_episodes)

experiences: Iterator[TransitionStep[str]] = \
get_experiences_stream(fixed_experiences)

td_pred: Mapping[NonTerminal[str], float] = td_prediction(

experiences_stream=experiences,
gamma=gamma,
num_experiences=num_td_experiences
)

pprint(td_pred)

330
This prints:

{NonTerminal(state=’A’): 9.899838136517303,
NonTerminal(state=’B’): 7.444114569419306}

We note that this Value Function is vastly different from the Value Function produced by
MC Prediction. Is there a bug in our code, or perhaps a more serious conceptual problem?
Nope - there is neither a bug here nor a more serious problem. This is exactly what TD
Prediction on Experience Replay on a fixed finite data set is meant to produce. So, what
Value Function does this correspond to? It turns out that TD Prediction drives towards a
Value Function of an MRP that is implied by the fixed finite set of given experiences. By the
term implied, we mean the maximum likelihood estimate for the transition probabilities
PR , estimated from the given fixed finite data, i.e.,
PN
′ i=1 ISi =s,Ri+1 =r,Si+1 =s′
PR (s, r, s ) = PN (9.12)
i=1 ISi =s
where the fixed finite set of atomic experiences are [(Si , Ri+1 , Si+1 )|1 ≤ i ≤ N ], and I
denotes the indicator function.
So let’s write some code to construct this MRP based on the above formula.
from rl.distribution import Categorical
from rl.markov_process import FiniteMarkovRewardProcess
def finite_mrp(
fixed_experiences: Sequence[TransitionStep[S]]
) -> FiniteMarkovRewardProcess[S]:
def by_state(tr: TransitionStep[S]) -> S:
return tr.state.state
d: Mapping[S, Sequence[Tuple[S, float]]] = \
{s: [(t.next_state.state, t.reward) for t in l] for s, l in
itertools.groupby(
sorted(fixed_experiences, key=by_state),
key=by_state
)}
mrp: Dict[S, Categorical[Tuple[S, float]]] = \
{s: Categorical({x: y / len(l) for x, y in
collections.Counter(l).items()})
for s, l in d.items()}
return FiniteMarkovRewardProcess(mrp)

Now let’s print it’s Value Function.

fmrp: FiniteMarkovRewardProcess[str] = finite_mrp(fixed_experiences)
fmrp.display_value_function(gamma)

This prints:

{NonTerminal(state=’A’): 9.958, NonTerminal(state=’B’): 7.545}

So our TD Prediction algorithm doesn’t exactly match the Value Function of the data-
implied MRP, but it gets close. It turns out that a variation of our TD Prediction algorithm
exactly matches the Value Function of the data-implied MRP. We won’t implement this
variation in this chapter, but will describe it briefly here. The variation is as follows:

• The Value Function is not updated after each atomic experience, rather the Value
Function is updated at the end of each batch of atomic experiences.

331
• Each batch of atomic experiences consists of a single occurrence of each atomic ex-
perience in the given fixed finite data set.
• The updates to the Value Function to be performed at the end of each batch are accu-
mulated in a buffer after each atomic experience and the buffer’s contents are used to
update the Value Function only at the end of the batch. Specifically, this means that
the right-hand-side of Equation (9.10) is calculated at the end of each atomic expe-
rience and these calculated values are accumulated in the buffer until the end of the
batch, at which point the buffer’s contents are used to update the Value Function.

This variant of the TD Prediction algorithm is known as Batch Updating and more broadly,
RL algorithms that update the Value Function at the end of a batch of experiences are ref-
ered to as Batch Methods. This contrasts with Incremental Methods, which are RL algorithms
that update the Value Function after each atomic experience (in the case of TD) or at the
end of each trace experience (in the case of MC). The MC and TD Prediction algorithms
we implemented earlier in this chapter are Incremental Methods. We will cover Batch
Methods in detail in Chapter 11.
Although our TD Prediction algorithm is an Incremental Method, it did get fairly close
to the Value Function of the data-implied MRP. So let us ignore the nuance that our TD
Prediction algorithm didn’t exactly match the Value Function of the data-implied MRP
and instead focus on the fact that our MC Prediction algorithm and our TD Prediction al-
gorithm drove towards two very different Value Functions. The MC Prediction algorithm
learns a “fairly naive” Value Function - one that is based on the mean of the observed re-
turns (for each state) in the given fixed finite data. The TD Prediction algorithm is learning
something “deeper” - it is (implicitly) constructing an MRP based on the given fixed fi-
nite data (Equation (9.12)), and then (implicitly) calculating the Value Function of the
constructed MRP. The mechanics of the TD Prediction algorithms don’t actually construct
the MRP and calculate the Value Function of the MRP - rather, the TD Prediction algo-
rithm directly drives towards the Value Function of the data-implied MRP. However, the
fact that it gets to this “more nuanced” Value Function means that it is (implictly) trying
to infer a transitions structure from the given data, and hence, we say that it is learning
something “deeper” than what MC is learning. This has practical implications. Firstly,
this learning facet of TD means that it exploits any Markov property in the environment
and so, TD algorithms are more eﬀicient (learn faster than MC) in Markov environments.
On the other hand, the naive nature of MC (not exploiting any Markov property in the
environment) is advantageous (more effective than TD) in non-Markov environments.
We encourage you to try Experience Replay on larger input data sets, and to code up
Batch Method variants of MC and TD prediction algorithms. As a starting point, the expe-
rience replay code for this chapter is in the file rl/chapter10/mc_td_experience_replay.py.

9.5.4. Bootstrapping and Experiencing

We summarize MC, TD and DP in terms of whether they bootstrap (or not) and in terms
of whether they experience interactions with an actual/simulated environment (or not).

• Bootstrapping: By “bootstrapping,” we mean that an update to the Value Function

utilizes a current or prior estimate of the Value Function. MC does not bootstrap since
it’s Value Function updates use actual trace experience returns and not any current
or prior estimates of the Value Function. On the other hand, TD and DP do bootstrap.

332
• Experiencing: By “experiencing,” we mean that the algorithm uses experiences ob-
tained by interacting with an actual or simulated environment, rather than per-
forming expectation calculations with a model of transition probabilities (the latter
doesn’t require interactions with an environment and hence, doesn’t “experience”).
MC and TD do experience, while DP does not experience.

We illustrate this perspective of bootstrapping (or not) and experiencing (or not) with
some very popular diagrams that we are borrowing from lecture slides from David Silver’s
RL course and from teaching content prepared by Richard Sutton.
The first diagram is Figure 9.2, known as the MC backup diagram for an MDP (although
we are covering Prediction in this chapter, these concepts also apply to MDP Control).
The root of the tree is the state whose Value Function we want to update. The remaining
nodes of the tree are the future states that might be visited and future actions that might be
taken. The branching on the tree is due to the probabilistic transitions of the MDP and the
multiple choices of actions that might be taken at each time step. The nodes marked as “T”
are the terminal states. The highlighted path on the tree from the root node (current state)
to a terminal state indicates a particular trace experience used by the MC algorithm. The
highlighted path is the set of future states/actions used in updating the Value Function of
the current state (root node). We say that the Value Function is “backed up” along this
highlighted path (to mean that the Value Function update calculation propagates from the
bottom of the highlighted path to the top, since the trace experience return is calculated as
accumulated rewards from the bottom to the top, i.e., from the end of the trace experience
to the beginning of the trace experience). This is why we refer to such diagrams as backup
diagrams. Since MC “experiences” , it only considers a single child node from any node
(rather than all the child nodes, which would be the case if we considered all probabilistic
transitions or considered all action choices). So the backup is narrow (doesn’t go wide
across the tree). Since MC does not “bootstrap,” it doesn’t use the Value Function estimate
from it’s child/grandchild node (next time step’s state/action) - instead, it utilizes the
rewards at all future states/actions along the entire trace experience. So the backup works
deep into the tree (is not shallow as would be the case in “bootstrapping”). In summary,
the MC backup is narrow and deep.
The next diagram is Figure 9.3, known as the TD backup diagram for an MDP. Again,
the highlighting applies to the future states/actions used in updating the Value Function
of the current state (root node). The Value Function is “backed up” along this highlighted
portion of the tree. Since TD “experiences” , it only considers a single child node from any
node (rather than all the child nodes, which would be the case if we considered all prob-
abilistic transitions or considered all actions choices). So the backup is narrow (doesn’t
go wide across the tree). Since TD “bootstraps” , it uses the Value Function estimate from
it’s child/grandchild node (next time step’s state/action) and doesn’t utilize rewards at
states/actions beyond the next time step’s state/action. So the backup is shallow (doesn’t
work deep into the tree). In summary, the TD backup is narrow and shallow.
The next diagram is Figure 9.4, known as the DP backup diagram for an MDP. Again,
the highlighting applies to the future states/actions used in updating the Value Function
of the current state (root node). The Value Function is “backed up” along this highlighted
portion of the tree. Since DP does not “experience” and utilizes the knowledge of prob-
abilities of all next states and considers all choices of actions (in the case of Control), it
considers all child nodes (all choices of actions) and all grandchild nodes (all probabilis-
tic transitions to next states) from the root node (current state). So the backup goes wide
across the tree. Since DP “bootstraps” , it uses the Value Function estimate from it’s chil-

333
Figure 9.2.: MC Backup Diagram (Image Credit: David Silver’s RL Course)

dren/grandchildren nodes (next time step’s states/actions) and doesn’t utilize rewards at
states/actions beyond the next time step’s states/actions. So the backup is shallow (doesn’t
work deep into the tree). In summary, the DP backup is wide and shallow.
This perspective of shallow versus deep (for “bootstrapping” or not) and of narrow ver-
sus wide (for “experiencing” or not) is a great way to visualize and internalize the core
ideas within MC, TD and DP, and it helps us compare and contrast these methods in a
simple and intuitive manner. We must thank Rich Sutton for this excellent pedagogical
contribution. This brings us to the next diagram (Figure 9.5) which provides a unified
view of RL in a single picture. The top of this Figure shows methods that “bootstrap” (in-
cluding TD and DP) and the bottom of this Figure shows methods that do not “bootstrap”
(including MC and methods known as “Exhaustive Search” that go both deep into the
tree and wide across the tree - we shall cover some of these methods in a later chapter).
Therefore the vertical dimension of this Figure refers to the depth of the backup. The left of
this Figure shows methods that “experience” (including TD and MC) and the right of this
Figure shows methods than do not “experience” (including DP and “Exhaustive Search”).
Therefore, the horizontal dimension of this Figure refers to the width of the backup.

9.6. TD(λ) Prediction

Now that we’ve seen the contrasting natures of TD and MC (and their respective pros and
cons), it’s natural to wonder if we could design an RL Prediction algorithm that combines
the features of TD and MC and perhaps fetch us a blend of their respective benefits. It turns
out this is indeed possible, and is the subject of this section - an innovative approach to RL

334
Figure 9.3.: TD Backup Diagram (Image Credit: David Silver’s RL Course)

335
Figure 9.4.: DP Backup Diagram (Image Credit: David Silver’s RL Course)

336
Figure 9.5.: Unified View of RL (Image Credit: Sutton-Barto’s RL Book)

337
Prediction known as TD(λ). λ is a continuous-valued parameter in the range [0, 1] such
that λ = 0 corresponds to the TD approach and λ = 1 corresponds to the MC approach.
Tuning λ between 0 and 1 allows us to span the spectrum from the TD approach to the
MC approach, essentially a blended approach known as the TD(λ) approach. The TD(λ)
approach for RL Prediction gives us the TD(λ) Prediction algorithm. To get to the TD(λ)
Prediction algorithm (in this section), we start with the TD Prediction algorithm we wrote
earlier, generalize it to a multi-time-step bootstrapping prediction algorithm, extend that
further to an algorithm known as the λ-Return Prediction algorithm, after which we shall
be ready to present the TD(λ) Prediction algorithm.

9.6.1. n-Step Bootstrapping Prediction Algorithm

In this subsection, we generalize the bootstrapping approach of TD to multi-time-step
bootstrapping (which we refer to as n-step bootstrapping). We start with Tabular Pre-
diction as it is very easy to explain and understand. To understand n-step bootstrapping,
let us take another look at the TD update equation for the Tabular case (Equation (9.8)):

V (St ) ← V (St ) + α · (Rt+1 + γ · V (St+1 ) − V (St ))

The basic idea was that we replaced Gt (in the case of the MC update) with the estimate
Rt+1 + γ · V (St+1 ), by using the current estimate of the Value Function of the state that is
1 time step ahead on the trace experience. It’s then natural to extend this idea to instead
use the current estimate of the Value Function for the state that is 2 time steps ahead on
the trace experience, which would yield the following update:

V (St ) ← V (St ) + α · (Rt+1 + γ · Rt+2 + γ 2 · V (St+2 ) − V (St ))

We can generalize this to an update that uses the current estimate of the Value Function
for the state that is n ≥ 1 time steps ahead on the trace experience, as follows:

V (St ) ← V (St ) + α · (Gt,n − V (St )) (9.13)

where Gt,n (known as n-step bootstrapped return) is defined as:

X
t+n
Gt,n = γ i−t−1 · Ri + γ n · V (St+n )
i=t+1
= Rt+1 + γ · Rt+2 + γ 2 · Rt+3 + . . . + γ n−1 · Rt+n + γ n · V (St+n )

If the trace experience terminates at t = T , i.e., ST ∈ T , the above equation applies

only for t, n such that t + n < T . Essentially, each n-step bootstrapped return Gt,n is
an approximation of the full return Gt , by truncating Gt at n steps and adjusting for the
remainder with the Value Function estimate V (St+n ) for the state St+n . If t + n ≥ T , then
there is no need for a truncation and the n-step bootstrapped return Gt,n is equal to the
full return Gt .
It is easy to generalize this n-step bootstrapping Prediction algorithm to the case of Func-
tion Approximation for the Value Function. The update Equation (9.13) generalizes to:

∆w = α · (Gt,n − V (St ; w)) · ∇w V (St ; w) (9.14)

where the n-step bootstrapped return Gt,n is now defined in terms of the function ap-
proximation for the Value Function (rather than the tabular Value Function), as follows:

338
X
t+n
Gt,n = γ i−t−1 · Ri + γ n · V (St+n ; w)
i=t+1
= Rt+1 + γ · Rt+2 + γ 2 · Rt+3 + . . . + γ n−1 · Rt+n + γ n · V (St+n ; w)

The nuances we outlined above for when the trace experience terminates naturally apply
here as well.
Equation (9.14) looks similar to the parameters update equations for the MC and TD
Prediction algorithms we covered earlier, in terms of conceptualizing the change in pa-
rameters as the product of the following 3 entities:

• Learning Rate α
• n-step Bootstrapped Error Gt,n − V (St ; w)
• Estimate Gradient of the conditional expected return V (St ; w) with respect to the pa-
rameters w

n serves as a parameter taking us across the spectrum from TD to MC. n = 1 is the case
of TD while suﬀiciently large n is the case of MC. If a trace experience is of length T (i.e.,
ST ∈ T ), then n ≥ T will not have any bootstrapping (since the bootstrapping target goes
beyond the length of the trace experience) and hence, this makes it identical to MC.
We note that for large n, the update to the Value Function for state St visited at time
t happens in a delayed manner (after n steps, at time t + n), which is unlike the TD al-
gorithm we had developed earlier where the update happens at the very next time step.
We won’t be implementing this n-step bootstrapping Prediction algorithm and leave it as
an exercise for you to implement (re-using some of the functions/classes we have devel-
oped so far in this book). A key point to note for your implementation: The input won’t
be an Iterable of atomic experiences (like in the case of the TD Prediction algorithm we
implemented), rather it will be an Iterable of trace experiences (i.e., the input will be the
same as for our MC Prediction algorithm: Iterable[Iterable[TransitionStep[S]]]) since
we need multiple future rewards in the trace to perform an update to the current state.

9.6.2. λ-Return Prediction Algorithm

Now we extend the n-step bootstrapping Prediction Algorithm to the λ-Return Prediction
Algorithm. The idea behind this extension is really simple: Since the target for each n (in
n-step bootstrapping) is Gt,n , a valid target can also be a weighted-average target:

X
N X
N
un · Gt,n + u · Gt where u + un = 1
n=1 n=1

Note that any of the un or u can be 0, as long as they all sum up to 1. The λ-Return target
is a special case of the weights un and u, and applies to episodic problems (i.e., where
every trace experience terminates). For a given state St with the episode terminating at
time T (i.e., ST ∈ T ), the weights for the λ-Return target are as follows:

un = (1 − λ) · λn−1 for all n = 1, . . . , T − t − 1, un = 0 for all n ≥ T − t and u = λT −t−1

(λ)
We denote the λ-Return target as Gt , defined as:

339
−t−1
TX
λn−1 · Gt,n + λT −t−1 · Gt
(λ)
Gt = (1 − λ) · (9.15)
n=1

Thus, the update Equation is:

(λ)
∆w = α · (Gt − V (St ; w)) · ∇w V (St ; w) (9.16)

We note that for λ = 0, the λ-Return target reduces to the TD (1-step bootstrapping)
target and for λ = 1, the λ-Return target reduces to the MC target Gt . The λ parameter
gives us a smooth way of tuning from TD (λ = 0) to MC (λ = 1).
Note that for λ > 0, Equation (9.16) tells us that the parameters w of the function
approximation can be updated only at the end of an episode (the term episode refers to
a terminating trace experience). Updating w according to Equation (9.16) for all states
St , t = 0, . . . , T − 1, at the end of each episode gives us the Offline λ-Return Prediction algo-
rithm. The term Offline refers to the fact that we have to wait till the end of an episode
to make an update to the parameters w of the function approximation (rather than mak-
ing parameter updates after each time step in the episode, which we refer to as an Online
algorithm). Online algorithms are appealing because the Value Function update for an
atomic experience could be utilized immediately by the updates for the next few atomic
experiences, and so it facilitates continuous/fast learning. So the natural question to ask
here is if we can turn the Offline λ-return Prediction algorithm outlined above to an Online
version. An online version is indeed possible (it’s known as the TD(λ) Prediction algo-
rithm) and is the topic of the remaining subsections of this section. But before we begin
the coverage of the (Online) TD(λ) Prediction algorithm, let’s wrap up this subsection
with an implementation of this Offline version (i.e., the λ-Return Prediction algorithm).

import rl.markov_process as mp
import numpy as np
from rl.approximate_dynamic_programming import ValueFunctionApprox
def lambda_return_prediction(
traces: Iterable[Iterable[mp.TransitionStep[S]]],
approx_0: ValueFunctionApprox[S],
gamma: float,
lambd: float
) -> Iterator[ValueFunctionApprox[S]]:
func_approx: ValueFunctionApprox[S] = approx_0
yield func_approx
for trace in traces:
gp: List[float] = [1.]
lp: List[float] = [1.]
predictors: List[NonTerminal[S]] = []
partials: List[List[float]] = []
weights: List[List[float]] = []
trace_seq: Sequence[mp.TransitionStep[S]] = list(trace)
for t, tr in enumerate(trace_seq):
for i, partial in enumerate(partials):
partial.append(
partial[-1] +
gp[t - i] * (tr.reward - func_approx(tr.state)) +
(gp[t - i] * gamma * extended_vf(func_approx, tr.next_state)
if t < len(trace_seq) - 1 else 0.)
)
weights[i].append(
weights[i][-1] * lambd if t < len(trace_seq)
else lp[t - i]

340
)
predictors.append(tr.state)
partials.append([tr.reward +
(gamma * extended_vf(func_approx, tr.next_state)
if t < len(trace_seq) - 1 else 0.)])
weights.append([1. - (lambd if t < len(trace_seq) else 0.)])
gp.append(gp[-1] * gamma)
lp.append(lp[-1] * lambd)
responses: Sequence[float] = [np.dot(p, w) for p, w in
zip(partials, weights)]
for p, r in zip(predictors, responses):
func_approx = func_approx.update([(p, r)])
yield func_approx

The above code is in the file rl/td_lambda.py.

Note that this λ-Return Prediction algorithm is not just Offline, it is also a highly ineﬀi-
cient algorithm because of the two loops within each trace experience. However, it serves
as a pedagogical benefit before moving on to the (eﬀicient) Online TD(λ) Prediction al-
gorithm.

9.6.3. Eligibility Traces

Now we are ready to start developing the TD(λ) Prediction algorithm. The TD(λ) Predic-
tion algorithm is founded on the concept of Eligibility Traces. So we start by introducing
the concept of Eligibility traces (first for the Tabular case, then generalize to Function Ap-
proximations), then go over the TD(λ) Prediction algorithm (based on Eligibility traces),
and finally explain why the TD(λ) Prediction algorithm is essentially the Online version
of the Offline λ-Return Prediction algorithm we’ve implemented above.
We begin the story of Eligibility Traces with the concept of a (for lack of a better term)
Memory function. Assume that we have an event happening at specific points in time, say
at times t1 , t2 , . . . , tn ∈ R≥0 with t1 < t2 < . . . < tn , and we’d like to construct a Memory
function M : R≥0 → R≥0 such that the Memory function (at any point in time t) remembers
the number of times the event has occurred up to time t, but also has an element of “for-
getfulness” in the sense that recent occurrences of the event are remembered better than
older occurrences of the event. So the function M needs to have an element of memory-
decay in remembering the count of the occurrences of the events. In other words, we want
the function M to produce a time-decayed count of the event occurrences. We do this by
constructing the function M as follows (for some decay-parameter θ ∈ [0, 1]):



It=t1 if t ≤ t1 , else
M (t) = M (ti ) · θt−ti + It=ti+1 if ti < t ≤ ti+1 for any 1 ≤ i < n, else (9.17)


M (tn ) · θt−tn otherwise (i.e., if t > tn )

where I denotes the indicator function.

This means the memory function has an uptick of 1 each time the event occurs (at time
ti , for each i = 1, 2, . . . n), but then decays by a factor of θ∆t over any interval ∆t where
the event doesn’t occur. Thus, the memory function captures the notion of frequency of
the events as well as the recency of the events.
Let’s write some code to plot this function in order to visualize it and gain some intuition.

def plot_memory_function(theta: float, event_times: List[float]) -> None:

step: float = 0.01

341
Figure 9.6.: Memory Function (Frequency and Recency)

x_vals: List[float] = [0.0]

y_vals: List[float] = [0.0]
for t in event_times:
rng: Sequence[int] = range(1, int(math.floor((t - x_vals[-1]) / step)))
x_vals += [x_vals[-1] + i * step for i in rng]
y_vals += [y_vals[-1] * theta ** (i * step) for i in rng]
x_vals.append(t)
y_vals.append(y_vals[-1] * theta ** (t - x_vals[-1]) + 1.0)
plt.plot(x_vals, y_vals)
plt.grid()
plt.xticks([0.0] + event_times)
plt.xlabel(”Event Timings”, fontsize=15)
plt.ylabel(”Memory Funtion Values”, fontsize=15)
plt.title(”Memory Function (Frequency and Recency)”, fontsize=25)
plt.show()

Let’s run this for θ = 0.8 and an arbitrary sequence of event times:
theta = 0.8
event_times = [2.0, 3.0, 4.0, 7.0, 9.0, 14.0, 15.0, 21.0]
plot_memory_function(theta, event_times)

This produces the graph in Figure 9.6.

The above code is in the file rl/chapter10/memory_function.py.
This memory function is actually quite useful as a model for a variety of modeling situa-
tions in the broader world of Applied Mathematics where we want to combine the notions
of frequency and recency. However, here we want to use this memory function as a way
to model Eligibility Traces for the tabular case, which in turn will give us the tabular TD(λ)
Prediction algorithm (online version of the offline tabular λ-Return Prediction algorithm
we covered earlier).
Now we are ready to define Eligibility Traces for the Tabular case. We assume a finite
state space with the set of non-terminal states N = {s1 , s2 , . . . , sm }. Eligibility Trace for
each state s ∈ N is defined as the Memory function M (·) with θ = γ · λ (i.e., the product

342
of the discount factor and the TD-λ parameter) and the event timings are the time steps at
which the state s occurs in a trace experience. Thus, we define Eligibility Traces for a given
trace experience at any time step t (of the trace experience) as a function Et : N → R≥0 as
follows:

E0 (s) = IS0 =s , for all s ∈ N

Et (s) = γ · λ · Et−1 (s) + ISt =s , for all s ∈ N , for all t = 1, 2, . . .

where I denotes the indicator function.

Then, the Tabular TD(λ) Prediction algorithm performs the following updates to the
Value Function at each time step t in each trace experience:

V (s) ← V (s) + α · (Rt+1 + γ · V (St+1 ) − V (St )) · Et (s), for all s ∈ N

Note the similarities and differences relative to the TD update we have seen earlier.
Firstly, this is an online algorithm since we make an update at each time step in a trace
experience. Secondly, we update the Value Function for all states at each time step (un-
like TD Prediction which updates the Value Function only for the particular state that is
visited at that time step). Thirdly, the change in the Value Function for each state s ∈ N
is proportional to the TD-Error δt = Rt+1 + γ · V (St+1 ) − V (St ), much like in the case of
the TD update. However, here the TD-Error is multiplied by the eligibility trace Et (s) for
each state s at each time step t. So, we can compactly write the update as:

V (s) ← V (s) + α · δt · Et (s), for all s ∈ N (9.18)

where α is the learning rate.

This is it - this is the Tabular TD(λ) Prediction algorithm! Now the question is - how is
this linked to the Tabular λ-Return Prediction algorithm? It turns out that if we made all
the updates of Equation (9.18) in an offline manner (at the end of each trace experience),
then the sum of the changes in the Value Function for any specific state s ∈ N over the
course of the entire trace experience is equal to the change in the Value Function for s
in the Tabular λ-Return Prediction algorithm as a result of it’s offline update for state s.
Concretely,

Theorem 9.6.1.

X
T −1 X
T −1
(λ)
α · δt · Et (s) = α · (Gt − V (St )) · ISt =s , for all s ∈ N
t=0 t=0

where I denotes the indicator function.

343
Proof. We begin the proof with the following important identity:
(λ)
Gt − V (St ) = −V (St ) +(1 − λ) · λ0 · (Rt+1 + γ · V (St+1 ))
+(1 − λ) · λ1 · (Rt+1 + γ · Rt+2 + γ 2 · V (St+2 ))
+(1 − λ) · λ2 · (Rt+1 + γ · Rt+2 + γ 2 · Rt+3 + γ 3 · V (St+3 ))
+...
= −V (St ) +(γλ)0 · (Rt+1 + γ · V (St+1 ) − γλ · V (St+1 ))
+(γλ)1 · (Rt+2 + γ · V (St+2 ) − γλ · V (St+2 ))
+(γλ)2 · (Rt+3 + γ · V (St+3 ) − γλ · V (St+3 ))
+...
= (γλ)0 · (Rt+1 + γ · V (St+1 ) − V (St ))
+(γλ)1 · (Rt+2 + γ · V (St+2 ) − V (St+1 ))
+(γλ)2 · (Rt+3 + γ · V (St+3 ) − V (St+2 ))
+...
= δt + γλ · δt+1 + (γλ)2 · δt+2 + . . .
(9.19)
Now assume that a specific non-terminal state s appears at time steps t1 , t2 , . . . , tn . Then,

X
T −1 X
n
(λ) (λ)
α· (Gt − V (St )) · ISt =s = α · (Gti − V (Sti ))
t=0 i=1
X
n
= α · (δti + γλ · δti +1 + (γλ)2 · δti +2 + . . .)
i=1
X
T −1
= α · δt · Et (s)
t=0

If we set λ = 0 in this Tabular TD(λ) Prediction algorithm, we note that Et (s) reduces
to ISt =s and so, the Tabular TD(λ) prediction algorithm’s update for λ = 0 at each time
step t reduces to:

V (St ) ← V (St ) + α · δt
which is exactly the update of the Tabular TD Prediction algorithm. Therefore, TD al-
gorithms are often refered to as TD(0).
If we set λ = 1 in this Tabular TD(λ) Prediction algorithm with episodic traces (i.e., all
trace experiences terminating), Theorem 9.6.1 tells us that the sum of all changes in the
Value Function for any specific state s ∈ N over the course of the entire trace experience
P −1
(= Tt=0 α · δt · Et (s)) is equal to the change in the Value Function for s in the Every-Visit
PT −1
MC Prediction algorithm as a result of it’s offline update for state s (= t=0 α · (Gt −
V (St ) · ISt =s )). Hence, TD(1) is considered to be “equivalent” to Every-Visit MC.
To clarify, TD(λ) Prediction is an online algorithm and hence, not exactly equivalent to
the offline λ-Return Prediction algorithm. However, if we modified the TD(λ) Prediction
algorithm to be offline, then they are equivalent. The offline version of TD(λ) Prediction

344
would not make the updates to the Value Function at each time step - rather, it would ac-
cumulate the changes to the Value Function (as prescribed by the TD(λ) update formula)
in a buffer, and then at the end of the trace experience, it would update the Value Function
with the contents of the buffer.
However, as explained earlier, online update are desirable because the changes to the
Value Function at each time step can be immediately usable for the next time steps’ updates
and so, it promotes rapid learning without having to wait for a trace experience to end.
Moreover, online algorithms can be used in situations where we don’t have a complete
episode.
With an understanding of Tabular TD(λ) Prediction in place, we can generalize TD(λ)
Prediction to the case of function approximation in a straightforward manner. In the case
of function approximation, the data type of eligibility traces will be the same data type
as that of the parameters w in the function approximation (so here we denote eligibility
traces at time t of a trace experience as simply Et rather than as a function of states as we
had done for the Tabular case above). We initialize E0 at the start of each trace experience
to ∇w V (S0 ; w). Then, for each time step t > 0, Et is calculated recursively in terms of the
previous time step’s value Et−1 , which is then used to update the parameters of the Value
Function approximation, as follows:

Et = γλ · Et−1 + ∇w V (St ; w)

∆w = α · (Rt+1 + γ · V (St+1 ; w) − V (St ; w)) · Et

The update to the parameters w can be expressed more succinctly as:

∆w = α · δt · Et
where δt now denotes the TD Error based on the function approximation for the Value
Function.
The idea of Eligibility Traces has it’s origins in a seminal book by Harry Klopf (Klopf
and Data Sciences Laboratory 1972) that greatly influenced Richard Sutton and Andrew
Barto to pursue the idea of Eligibility Traces further, after which they published several
papers on Eligibility Traces, much of whose content is covered in their RL book (Richard
S. Sutton and Barto 2018).

9.6.4. Implementation of the TD(λ) Prediction algorithm

You’d have observed that the TD(λ) update is not as simple as the MC and TD updates,
where we were able to use the FunctionApprox interface in a straightforward manner. For
TD(λ), it might appear that we can’t quite use the FunctionApprox interface and would
need to write custom-code for it’s implementation. However, by noting that the FunctionApprox
method objective_gradient is quite generic and that FunctionApprox and Gradient support
methods __add__ and __mul__ (vector space operations), we can actually implement the
TD(λ) in terms of the FunctionApprox interface.
The function td_lambda_prediction below takes as input an Iterable of trace experi-
ences (traces), an initial FunctionApprox (approx_0), and the γ and λ parameters. At the
start of each trace experience, we need to initialize the eligibility traces to 0. The data
type of the eligibility traces is the Gradient type and so we invoke the zero method for
Gradient(func_approx) in order to initialize the eligibility traces to 0. Then, at every time
step in every trace experience, we first set the predictor variable xt to be the state and the

345
response variable yt to be the TD target. Then we need to update the eligibility traces el_tr
and update the function approximation func_approx using the updated el_tr.
Thankfully, the __mul__ method of Gradient class enables us to conveniently multiply
el_tr with γ · λ and then, it also enables us to multiply the updated el_tr with the predic-
tion error EM [y|xt ]−yt = V (St ; w)−(Rt+1 +γ ·V (St+1 ; w)) (in the code as func_approx(x)
- y), which is then used (as a Gradient type) to update the internal parameters of the
func_approx. The __add__ method of Gradient enables us to add ∇w V (St ; w) (as a Gradient
type) to el_tr * gamma * lambd. The only seemingly diﬀicult part is calculating ∇w V (St ; w).
The FunctionApprox interface provides us with a method objective_gradient to calculate
the gradient of any specified objective (call it Obj(x, y)). But here we have to calculate
the gradient of the prediction of the function approximation. Thankfully, the interface of
objective_gradient is fairly generic and we actually have a choice of constructing Obj(x, y)
to be whatever function we want (not necessarily a minimizing Objective Function). We
specify Obj(x, y) in terms of the obj_deriv_out_func argument, which as a reminder, rep-
resents ∂Obj(x,y)
∂Out(x) . Note that we have assumed a gaussian distribution for the returns con-
ditioned on the state. So we can set Out(x) to be the function approximation’s prediction
V (St ; w) and we can set Obj(x, y) = Out(x), meaning obj_deriv_out_func ( ∂Obj(x,y)
∂Out(x) ) is a
function returning the constant value of 1 (as seen in the code below).

import rl.markov_process as mp
import numpy as np
from rl.function_approx import Gradient
from rl.approximate_dynamic_programming import ValueFunctionApprox
def td_lambda_prediction(
traces: Iterable[Iterable[mp.TransitionStep[S]]],
approx_0: ValueFunctionApprox[S],
gamma: float,
lambd: float
) -> Iterator[ValueFunctionApprox[S]]:
func_approx: ValueFunctionApprox[S] = approx_0
yield func_approx
for trace in traces:
el_tr: Gradient[ValueFunctionApprox[S]] = Gradient(func_approx).zero()
for step in trace:
x: NonTerminal[S] = step.state
y: float = step.reward + gamma * \
extended_vf(func_approx, step.next_state)
el_tr = el_tr * (gamma * lambd) + func_approx.objective_gradient(
xy_vals_seq=[(x, y)],
obj_deriv_out_fun=lambda x1, y1: np.ones(len(x1))
)
func_approx = func_approx.update_with_gradient(
el_tr * (func_approx(x) - y)
)
yield func_approx

The above code is in the file rl/td_lambda.py.

Let’s use the same instance si_mrp: SimpleInventoryMRPFinite that we had created above
when testing MC and TD Prediction. We use the same number of episodes (60000) we had
used when testing MC Prediction. Just like in the case of testing TD prediction, we set ini-
tial learning rate α = 0.03, half life H = 1000 and exponent β = 0.5. We set the episode
length (number of atomic experiences in a single trace experience) to be 100 (same as with
the settings we had for testing TD Prediction and consistent with MC Prediction as well).
We use the same discount factor γ = 0.9. Let’s set λ = 0.3.

346
import rl.iterate as iterate
import rl.td_lambda as td_lambda
import itertools
from pprint import pprint
from rl.chapter10.prediction_utils import fmrp_episodes_stream
from rl.function_approx import learning_rate_schedule
gamma: float = 0.9
episode_length: int = 100
initial_learning_rate: float = 0.03
half_life: float = 1000.0
exponent: float = 0.5
lambda_param = 0.3
episodes: Iterable[Iterable[TransitionStep[S]]] = \
fmrp_episodes_stream(si_mrp)
curtailed_episodes: Iterable[Iterable[TransitionStep[S]]] = \
(itertools.islice(episode, episode_length) for episode in episodes)
learning_rate_func: Callable[[int], float] = learning_rate_schedule(
initial_learning_rate=initial_learning_rate,
half_life=half_life,
exponent=exponent
)
td_lambda_vfs: Iterator[ValueFunctionApprox[S]] = td_lambda.td_lambda_prediction(
traces=curtailed_episodes,
approx_0=Tabular(count_to_weight_func=learning_rate_func),
gamma=gamma,
lambd=lambda_param
)
num_episodes = 60000
final_td_lambda_vf: ValueFunctionApprox[S] = \
iterate.last(itertools.islice(td_lambda_vfs, episode_length * num_episodes))
pprint({s: round(final_td_lambda_vf(s), 3) for s in si_mrp.non_terminal_states})

This prints the following:

{NonTerminal(state=InventoryState(on_hand=0, on_order=0)): -35.545,

NonTerminal(state=InventoryState(on_hand=0, on_order=1)): -27.97,
NonTerminal(state=InventoryState(on_hand=0, on_order=2)): -28.396,
NonTerminal(state=InventoryState(on_hand=1, on_order=0)): -28.943,
NonTerminal(state=InventoryState(on_hand=1, on_order=1)): -29.506,
NonTerminal(state=InventoryState(on_hand=2, on_order=0)): -30.339}

Thus, we see that our implementation of TD(λ) Prediction with the above settings fetches
us an estimated Value Function fairly close to the true Value Function. As ever, we encour-
age you to play with various settings for TD(λ) Prediction to develop an intuition for how
the results change as you change the settings, and particularly as you change the λ param-
eter. You can play with the code in the file rl/chapter10/simple_inventory_mrp.py.

9.7. Key Takeaways from this Chapter

• Bias-Variance tradeoff of TD versus MC.
• MC learns the statistical mean of the observed returns while TD learns something
“deeper” - it implicitly estimates an MRP from the observed data and produces the
Value Function of the implicitly-estimated MRP.
• Understanding TD versus MC versus DP from the perspectives of “bootstrapping”
and “experiencing” (Figure 9.5 provides a great view).

347
• “Equivalence” of λ-Return Prediction and TD(λ) Prediction, hence TD is equivalent
to TD(0) and MC is “equivalent” to TD(1).

348
10. Monte-Carlo and Temporal-Difference
for Control
In chapter 9, we covered MC and TD algorithms to solve the Prediction problem. In this
chapter, we cover MC and TD algorithms to solve the Control problem. As a reminder,
MC and TD algorithms are Reinforcement Learning algorithms that only have access to
an individual experience (at a time) of next state and reward when the AI agent performs
an action in a given state. The individual experience could be the result of an interaction
with an actual environment or could be served by a simulated environment (as explained
at the state of Chapter 9). It also pays to remind that RL algorithms overcome the Curse
of Dimensionality and the Curse of Modeling by incrementally updating (learning) an
appropriate function approximation of the Value Function from a stream of individual
experiences. Hence, large-scale Control problems that are typically seen in the real-world
are often tackled by RL.

10.1. Refresher on Generalized Policy Iteration (GPI)

We shall soon see that all RL Control algorithms are based on the fundamental idea of
Generalized Policy Iteration (introduced initially in Chapter 3), henceforth abbreviated as
GPI. The exact ways in which the GPI idea is utilized in RL algorithms differs from one
algorithm to another, and they differ significantly from how the GPI idea is utilized in DP
algorithms. So before we get into RL Control algorithms, it’s important to ground on the
abstract concept of GPI. We now ask you to re-read Section 3.11 in Chapter 3.
To summarize, the key concept in GPI is that we can evaluate the Value Function for a
policy with any Policy Evaluation method, and we can improve a policy with any Policy
Improvement method (not necessarily the methods used in the classical Policy Iteration
DP algorithm). The word any does not simply mean alternative algorithms for Policy Eval-
uation and/or Policy Improvements - the word any also refers to the fact that we can do a
“partial” Policy Evaluation or a “partial” Policy Improvement. The word “partial” is used
quite generically here - any set of calculations that simply take us towards a complete Policy
Evaluation or towards a complete Policy Improvement qualify. This means GPI allows us
to switch from Policy Evaluation to Policy Improvements without doing a complete Pol-
icy Evaluation or complete Policy Improvement (for instance, we don’t have to take Policy
Evaluation calculations all the way to convergence). Figure 10.1 illustrates Generalized
Policy Iteration as the shorter-length arrows (versus the longer-length arrows seen in Fig-
ure 3.3 for the usual Policy Iteration algorithm). Note how these shorter-length arrows
don’t go all the way to either the “value function line” or the “policy line” but the shorter-
length arrows do go some part of the way towards the line they are meant to go towards
at that stage in the algorithm.
As has been our norm in the book so far, our approach to RL Control algorithms is to
first cover the simple case of Tabular RL Control algorithms to illustrate the core concepts
in a simple and intuitive manner. In many Tabular RL Control algorithms (especially Tab-
ular TD Control), GPI consists of the Policy Evaluation step for just a single state (versus

349
Figure 10.1.: Progression Lines of Value Function and Policy in Generalized Policy Itera-
tion (Image Credit: Coursera Course on Fundamentals of RL)

for all states in usual Policy Iteration) and the Policy Improvement step is also done for
just a single state. So essentially these RL Control algorithms are an alternating sequence
of single-state policy evaluation and single-state policy improvement (where the single-
state is the state produced by sampling or the state that is encountered in a real-world
environment interaction). Similar to the case of Prediction, we first cover Monte-Carlo
(MC) Control and then move on to Temporal-Difference (TD) Control.

10.2. GPI with Evaluation as Monte-Carlo

Let us think about how to do MC Control based on the GPI idea. The natural idea that
emerges is to do Policy Evaluation with MC (this is basically MC Prediction), followed by
greedy Policy Improvement, then MC Policy Evaluation with the improved policy, and so
on … . This seems like a reasonable idea, but there is a problem with doing greedy Policy
Improvement. The problem is that the Greedy Policy Improvement calculation (Equa-
tion 10.1) requires a model of the state transition probability function P and the reward
function R, which is not available in an RL interface.
X
′
πD (s) = arg max{R(s, a) + γ P(s, a, s′ ) · V π (s′ )} for all s ∈ N (10.1)
a∈A s′ ∈N

However, we note that Equation 10.1 can be written more succinctly as:

′
πD (s) = arg max Qπ (s, a) for all s ∈ N (10.2)
a∈A

This view of Greedy Policy Improvement is valuable because instead of doing Policy
Evaluation for calculating V π (MC Prediction), we can instead do Policy Evaluation to cal-

350
culate Qπ (with MC Prediction for the Q-Value Function). With this modification to Policy
Evaluation, we can keep alternating between Policy Evaluation and Policy Improvement
until convergence to obtain the Optimal Value Function and Optimal Policy. Indeed, this
is a valid MC Control algorithm. However, this algorithm is not practical as each Policy
Evaluation (MC Prediction) typically takes very long to converge (as we have noted in
Chapter 9) and the number of iterations of Evaluation and Improvement until GPI con-
vergence will also be large. More importantly, this algorithm simply modifies the Policy
Iteration DP/ADP algorithm by replacing DP/ADP Policy Evaluation with MC Q-Value
Policy Evaluation - hence, we simply end up with a slower version of the Policy Iteration
DP/ADP algorithm. Instead, we seek an MC Control Algorithm that switches from Policy
Evaluation to Policy Improvement without requiring Policy Evaluation to converge (this
is essentially the GPI idea).
So the natural GPI idea here would be to do the usual MC Prediction updates (of the
Q-Value estimate) at the end of an episode, then improve the policy at the end of that
episode, then perform MC Prediction updates (with the improved policy) at the end of the
next episode, and so on … . Let’s see what this algorithm looks like. Equation 10.2 tells us
that all we need to perform the requisite greedy action (from the improved policy) at any
time step in any episode is an estimate of the Q-Value Function. For ease of understanding,
for now, let us just restrict ourselves to the case of Tabular Every-Visit MC Control with
equal weights for each of the Return data points obtained for any (state, action) pair. In
this case, we can simply perform the following two updates at the end of each episode for
each (St , At ) pair encountered in the episode (note that at each time step t, At is based on
the greedy policy derived from the current estimate of the Q-Value function):

Count(St , At ) ← Count(St , At ) + 1
1 (10.3)
Q(St , At ) ← Q(St , At ) + · (Gt − Q(St , At ))
Count(St , At )
It’s important to note that Count(St , At ) is accumulated over the set of all episodes seen
thus far. Note that the estimate Q(St , At ) is not an estimate of the Q-Value Function for a
single policy - rather, it keeps updating as we encounter new greedy policies across the
set of episodes.
So is this now our first Tabular RL Control algorithm? Not quite - there is yet another
problem. This problem is more subtle and we illustrate the problem with a simple ex-
ample. Let’s consider a specific state (call it s) and assume that there are only two al-
lowable actions a1 and a2 for state s. Let’s say the true Q-Value Function for state s is:
Qtrue (s, a1 ) = 2, Qtrue (s, a2 ) = 5. Let’s say we initialize the Q-Value Function estimate as:
Q(s, a1 ) = Q(s, a2 ) = 0. When we encounter state s for the first time, the action to be taken
is arbitrary between a1 and a2 since they both have the same Q-Value estimate (meaning
both a1 and a2 yield the same max value for Q(s, a) among the two choices for a). Let’s
say we arbitrarily pick a1 as the action choice and let’s say for this first encounter of state s
(with the arbitrarily picked action a1 ), the return obtained is 3. So Q(s, a1 ) updates to the
value 3. So when the state s is encountered for the second time, we see that Q(s, a1 ) = 3
and Q(s, a2 ) = 0 and so, action a1 will be taken according to the greedy policy implied
by the estimate of the Q-Value Function. Let’s say we now obtain a return of -1, updating
Q(s, a1 ) to 3−1
2 = 1. When s is encountered for the third time, yet again action a1 will
be taken according to the greedy policy implied by the estimate of the Q-Value Function.
Let’s say we now obtain a return of 2, updating Q(s, a1 ) to 3−1+2 3 = 43 . We see that as long
as the returns associated with a1 are not negative enough to make the estimate Q(s, a1 )

351
negative, a2 is “locked out” by a1 because the first few occurrences of a1 happen to yield
an average return greater than the initialization of Q(s, a2 ). Even if a2 was chosen, it is
possible that the first few occurrences of a2 yield an average return smaller than the av-
erage return obtained on the first few occurrences of a1 , in which case a2 could still get
locked-out prematurely.
This problem goes beyond MC Control and applies to the broader problem of RL Con-
trol - updates can get biased by initial random occurrences of returns (or return estimates),
which in turn could prevent certain actions from being suﬀiciently chosen (thus, disal-
lowing accurate estimates of the Q-Values for those actions). While we do want to ex-
ploit actions that seem to be fetching higher returns, we also want to adequately explore all
possible actions so we can obtain an accurate-enough estimate of their Q-Values. This is
essentially the Explore-Exploit dilemma of the famous Multi-Armed Bandit Problem. In
Chapter 13, we will cover the Multi-Armed Bandit problem in detail, along with a variety
of techniques to solve the Multi-Armed Bandit problem (which are essentially creative
ways of resolving the Explore-Exploit dilemma). We will see in Chapter 13 that a sim-
ple way of resolving the Explore-Exploit dilemma is with a method known as ϵ-greedy,
which essentially means we must be greedy (“exploit”) a certain (1 − ϵ) fraction of the
time and for the remaining (ϵ) fraction of the time, we explore all possible actions. The
term “certain fraction of the time” refers to probabilities of choosing actions, which means
an ϵ-greedy policy (generated from a Q-Value Function estimate) will be a stochastic pol-
icy. For the sake of simplicity, in this book, we will employ the ϵ-greedy method to resolve
the Explore-Exploit dilemma in all RL Control algorithms involving the Explore-Exploit
dilemma (although you must understand that we can replace the ϵ-greedy method by the
other methods we shall cover in Chapter 13 in any of the RL Control algorithms where we
run into the Explore-Exploit dilemma). So we need to tweak the Tabular MC Control al-
gorithm described above to perform Policy Improvement with the ϵ-greedy method. The
formal definition of the ϵ-greedy stochastic policy π ′ (obtained from the current estimate
of the Q-Value Function) for a Finite MDP (since we are focused on Tabular RL Control)
is as follows:

(
′
ϵ
|A| +1−ϵ if a = arg maxb∈A Q(s, b)
Improved Stochastic Policy π (s, a) = ϵ
|A| otherwise

where A denotes the set of allowable actions and ϵ ∈ [0, 1] is the specification of the
degree of exploration.
This says that with probability 1 − ϵ, we select the action that maximizes the Q-Value
Function estimate for a given state, and with probability ϵ, we uniform-randomly select
each of the allowable actions (including the maximizing action). Hence, the maximiz-
ϵ
ing action is chosen with probability |A| + 1 − ϵ. Note that if ϵ is zero, π ′ reduces to the
deterministic greedy policy πD ′ that we had defined earlier. So the greedy policy can be

considered to be a special case of ϵ-greedy policy with ϵ = 0.

But we haven’t yet actually proved that an ϵ-greedy policy is indeed an improved policy.
We do this in the theorem below. Note that in the following theorem’s proof, we re-use the
notation and inductive-proof approach used in the Policy Improvement Theorem (Theo-
rem 3.6.1) in Chapter 3. So it would be a good idea to re-read the proof of Theorem 3.6.1
in Chapter 3 before reading the following theorem’s proof.

Theorem 10.2.1. For a Finite MDP, if π is a policy such that for all s ∈ N , π(s, a) ≥ ϵ
|A| for all

352
′
a ∈ A, then the ϵ-greedy policy π ′ obtained from Qπ is an improvement over π, i.e., V π (s) ≥
V π (s) for all s ∈ N .
Proof. We’ve previously learnt that for any policy π ′ , if we apply the Bellman Policy Oper-
′ ′
ator B π repeatedly (starting with V π ), we converge to V π . In other words,
′ ′
lim (B π )i (V π ) = V π
i→∞

So the proof is complete if we prove that:

′ ′
(B π )i+1 (V π ) ≥ (B π )i (V π ) for all i = 0, 1, 2, . . .
′
In plain English, this says we need to prove that repeated application of B π produces a
′
non-decreasing sequence of Value Functions [(B π )i (V π )|i = 0, 1, 2, . . .].
We prove this by induction. The base case of the proof by induction is to show that
′
B π (V π ) ≥ V π

′ ′ ′
B π (V π )(s) = (Rπ + γ · P π · V π )(s)
′
X ′
= Rπ (s) + γ · P π (s, s′ ) · V π (s′ )
s′ ∈N
X X
= π ′ (s, a) · (R(s, a) + γ · P(s, a, s′ ) · V π (s′ ))
a∈A s′ ∈N
X
′
= π (s, a) · Q (s, a)
π

a∈A
X ϵ
= · Qπ (s, a) + (1 − ϵ) · max Qπ (s, a)
|A| a∈A
a∈A
X ϵ X π(s, a) − ϵ
|A|
≥ · Qπ (s, a) + (1 − ϵ) · · Qπ (s, a)
|A| 1−ϵ
a∈A a∈A
X
= π(s, a) · Qπ (s, a)
a∈A
= V (s) for all s ∈ N
π

PThe line with the inequality above is due to the fact that for any fixed s ∈ N , maxa∈A Qπ (s, a) ≥
a∈A wa · Q (s, a) (maximum Q-Value greater than or equal to a weighted average of all
π
ϵ
π(s,a)− |A| P
Q-Values, for a given state) with the weights wa = 1−ϵ such that a∈A wa = 1 and
0 ≤ wa ≤ 1 for all a ∈ A.
This completes the base case of the proof by induction.
The induction step is easy and is proved as a consequence of the monotonicity property
of the B π operator (for any π), which is defined as follows:

Monotonicity Property of B π : X ≥ Y ⇒ B π (X) ≥ B π (Y )

Note that we proved the monotonicity property of the B π operator in Chapter [-@sec:dp-
chapter]. A straightforward application of this monotonicity property provides the induc-
tion step of the proof:
′ ′ ′ ′
(B π )i+1 (V π ) ≥ (B π )i (V π ) ⇒ (B π )i+2 (V π ) ≥ (B π )i+1 (V π ) for all i = 0, 1, 2, . . .

This completes the proof.

353
We note that for any ϵ-greedy policy π, we do ensure the condition that for all s ∈ N ,
π(s, a) ≥ |A|ϵ
for all a ∈ A. So we just need to ensure that this condition holds true for
the initial choice of π (in the GPI with MC algorithm). An easy way to ensure this is to
choose the initial π to be a uniform choice over actions (for each state), i.e., for all s ∈ N ,
1
π(s, a) = |A| for all a ∈ A.

10.3. GLIE Monte-Control Control

So to summarize, we’ve resolved two problems - firstly, we replaced the state-value func-
tion estimate with the action-value function estimate and secondly, we replaced greedy
policy improvement with ϵ-greedy policy improvement. So our MC Control algorithm
does GPI as follows:

• Do Policy Evaluation with the Q-Value Function with Q-Value updates at the end of
each episode.
• Do Policy Improvement with an ϵ-greedy Policy (readily obtained from the Q-Value
Function estimate at any time step for any episode).

So now we are ready to develop the details of the Monte-Control algorithm that we’ve
been seeking. For ease of understanding, we first cover the Tabular version and then
we will implement the generalized version with function approximation. Note that an
ϵ-greedy policy enables adequate exploration of actions, but we will also need to do ade-
quate exploration of states in order to achieve a suitable estimate of the Q-Value Function.
Moreover, as our Control algorithm proceeds and the Q-Value Function estimate gets bet-
ter and better, we reduce the amount of exploration and eventually (as the number of
episodes tend to infinity), we want to have ϵ (degree of exploration) tend to zero. In fact,
this behavior has a catchy acronym associated with it, which we define below:
Definition 10.3.1. We refer to Greedy In The Limit with Infinite Exploration (abbreviated as
GLIE) as the behavior that has the following two properties:

1. All state-action pairs are explored infinitely many times, i.e., for all s ∈ N , for all
a ∈ A, and Countk (s, a) denoting the number of occurrences of (s, a) pairs after k
episodes:
lim Countk (s, a) = ∞
k→∞

2. The policy converges to a greedy policy, i.e., for all s ∈ N , for all a ∈ A, and πk (s, a)
denoting the ϵ-greedy policy obtained from the Q-Value Function estimate after k
episodes:
lim πk (s, a) = Ia=arg maxb∈A Q(s,b)
k→∞

A simple way by which our method of using the ϵ-greedy policy (for policy improve-
ment) can be made GLIE is by reducing ϵ as a function of number of episodes k as follows:
1
ϵk =
k
So now we are ready to describe the Tabular MC Control algorithm we’ve been seeking.
We ensure that this algorithm has GLIE behavior and so, we refer to it as GLIE Tabular
Monte-Carlo Control. The following is the outline of the procedure for each episode (ter-
minating trace experience) in the algorithm:

354
• Generate the trace experience (episode) with actions sampled from the ϵ-greedy pol-
icy π obtained from the estimate of the Q-Value Function that is available at the start
of the trace experience. Also, sample the first state of the trace experience from a
uniform distribution of states in N . This ensures infinite exploration of both states
and actions. Let’s denote the contents of this trace experience as:

S0 , A 0 , R 1 , S 1 , A 1 , . . . , R T , S T

and define the trace experience return Gt associated with (St , At ) as:

X
T
Gt = γ i−t−1 · Ri = Rt+1 + γ · Rt+2 + γ 2 · Rt+3 + . . . γ T −t−1 · RT
i=t+1

• For each state St and action At in the trace experience, perform the following updates
at the end of the trace experience:

Count(St , At ) ← Count(St , At ) + 1
1
Q(St , At ) ← Q(St , At ) + · (Gt − Q(St , At ))
Count(St , At )
• Let’s say this trace experience is the k-th trace experience in the sequence of trace
experiences. Then, at the end of the trace experience, set:
1
ϵ←
k
We state the following important theorem without proof.
Theorem 10.3.1. The above-described GLIE Tabular Monte-Carlo Control algorithm converges to
the Optimal Action-Value function: Q(s, a) → Q∗ (s, a) for all s ∈ N , for all a ∈ A. Hence, GLIE
Tabular Monte-Carlo Control converges to an Optimal (Deterministic) Policy π ∗ .
The extension from Tabular to Function Approximation of the Q-Value Function is straight-
forward. The update (change) in the parameters w of the Q-Value Function Approxima-
tion Q(s, a; w) is as follows:

∆w = α · (Gt − Q(St , At ; w)) · ∇w Q(St , At ; w) (10.4)

where α is the learning rate in the stochastic gradient descent and Gt is the trace expe-
rience return from state St upon taking action At at time t on a trace experience.
Now let us write some code to implement the above description of GLIE Monte-Carlo
Control, generalized to handle Function Approximation of the Q-Value Function. As you
shall see in the code below, there are a couple of other generalizations from the algorithm
outline described above. Let us start by understanding the various arguments to the below
function glie_mc_control.

• mdp: MarkovDecisionProcess[S, A] - This represents the interface to an abstract Markov

Decision Process. Note that this interface doesn’t provide any access to the transi-
tion probabilities or reward function. The core functionality available through this
interface are the two @abstractmethods step and actions. The step method only al-
lows us to access an individual experience of the next state and reward pair given
the current state and action (since it returns an abstract Distribution object). The
actions method gives us the allowable actions for a given state.

355
• states: NTStateDistribution[S] - This represents an arbitrary distribution of the
non-terminal states, which in turn allows us to sample the starting state (from this
distribution) for each trace experience.
• approx_0: QValueFunctionApprox[S, A] - This represents the initial function approx-
imation of the Q-Value function (that is meant to be updated, in an immutable man-
ner, through the course of the algorithm).
• gamma: float - This represents the discount factor to be used in estimating the Q-
Value Function.
• epsilon_as_func_of_episodes: Callable[[int], float] - This represents the extent
of exploration (ϵ) as a function of the number of trace experiences done so far (al-
lowing us to generalize from our default choice of ϵ(k) = k1 ).
• episode_length_tolerance: float - This represents the tolerance that determines
the trace experience length T (the minimum T such that γ T < tolerance).

glie_mc_control produces a generator (Iterator) of Q-Value Function estimates at the

end of each trace experience. The code is fairly self-explanatory. The method simulate_actions
of mdp: MarkovDecisionProcess creates a single sampling trace (i.e., a trace experience).
At the end of each trace experience, the update method of FunctionApprox updates the Q-
Value Function (creates a new Q-Value Function without mutating the currrent Q-Value
Function) using each of the returns (and associated state-actions pairs) from the trace ex-
perience. The ϵ-greedy policy is derived from the Q-Value Function estimate by using the
function epsilon_greedy_policy that is shown below and is quite self-explanatory.

from rl.markov_decision_process import epsilon_greedy_policy, TransitionStep

from rl.approximate_dynamic_programming import QValueFunctionApprox
from rl.approximate_dynamic_programming import NTStateDistribution
def glie_mc_control(
mdp: MarkovDecisionProcess[S, A],
states: NTStateDistribution[S],
approx_0: QValueFunctionApprox[S, A],
gamma: float,
epsilon_as_func_of_episodes: Callable[[int], float],
episode_length_tolerance: float = 1e-6
) -> Iterator[QValueFunctionApprox[S, A]]:
q: QValueFunctionApprox[S, A] = approx_0
p: Policy[S, A] = epsilon_greedy_policy(q, mdp, 1.0)
yield q
num_episodes: int = 0
while True:
trace: Iterable[TransitionStep[S, A]] = \
mdp.simulate_actions(states, p)
num_episodes += 1
for step in returns(trace, gamma, episode_length_tolerance):
q = q.update([((step.state, step.action), step.return_)])
p = epsilon_greedy_policy(
q,
mdp,
epsilon_as_func_of_episodes(num_episodes)
)
yield q

The implementation of epsilon_greedy_policy is as follows:

from rl.policy import DeterministicPolicy, Policy, RandomPolicy

def greedy_policy_from_qvf(
q: QValueFunctionApprox[S, A],

356
actions: Callable[[NonTerminal[S]], Iterable[A]]
) -> DeterministicPolicy[S, A]:
def optimal_action(s: S) -> A:
_, a = q.argmax((NonTerminal(s), a) for a in actions(NonTerminal(s)))
return a
return DeterministicPolicy(optimal_action)
def epsilon_greedy_policy(
q: QValueFunctionApprox[S, A],
mdp: MarkovDecisionProcess[S, A],
epsilon: float = 0.0
) -> Policy[S, A]:
def explore(s: S, mdp=mdp) -> Iterable[A]:
return mdp.actions(NonTerminal(s))
return RandomPolicy(Categorical(
{UniformPolicy(explore): epsilon,
greedy_policy_from_qvf(q, mdp.actions): 1 - epsilon}
))

The above code is in the file rl/monte_carlo.py.

Note that epsilon_greedy_policy returns an instance of the class RandomPolicy. RandomPolicy
creates a policy that randomly selects one of several specified policies (in this case, we need
to select between the greedy policy of type DeterministicPolicy and the UniformPolicy).
The implementation of RandomPolicy is shown below and you can find it’s code in the file
rl/policy.py.

@dataclass(frozen=True)
class RandomPolicy(Policy[S, A]):
policy_choices: Distribution[Policy[S, A]]
def act(self, state: NonTerminal[S]) -> Distribution[A]:
policy: Policy[S, A] = self.policy_choices.sample()
return policy.act(state)

Now let us test glie_mc_control on the simple inventory MDP we wrote in Chapter 2.

from rl.chapter3.simple_inventory_mdp_cap import SimpleInventoryMDPCap

capacity: int = 2
poisson_lambda: float = 1.0
holding_cost: float = 1.0
stockout_cost: float = 10.0
gamma: float = 0.9
si_mdp: SimpleInventoryMDPCap = SimpleInventoryMDPCap(
capacity=capacity,
poisson_lambda=poisson_lambda,
holding_cost=holding_cost,
stockout_cost=stockout_cost
)

First let’s run Value Iteration so we can determine the true Optimal Value Function and
Optimal Policy

from rl.dynamic_programming import value_iteration_result

true_opt_vf, true_opt_policy = value_iteration_result(fmdp, gamma=gamma)
print(”True Optimal Value Function”)
pprint(true_opt_vf)
print(”True Optimal Policy”)
print(true_opt_policy)

This prints:

357
True Optimal Value Function
{NonTerminal(state=InventoryState(on_hand=0, on_order=0)): -34.894855194671294,
NonTerminal(state=InventoryState(on_hand=0, on_order=1)): -27.66095964467877,
NonTerminal(state=InventoryState(on_hand=0, on_order=2)): -27.99189950444479,
NonTerminal(state=InventoryState(on_hand=1, on_order=0)): -28.66095964467877,
NonTerminal(state=InventoryState(on_hand=1, on_order=1)): -28.99189950444479,
NonTerminal(state=InventoryState(on_hand=2, on_order=0)): -29.991899504444792}
True Optimal Policy
For State InventoryState(on_hand=0, on_order=0): Do Action 1
For State InventoryState(on_hand=0, on_order=1): Do Action 1
For State InventoryState(on_hand=0, on_order=2): Do Action 0
For State InventoryState(on_hand=1, on_order=0): Do Action 1
For State InventoryState(on_hand=1, on_order=1): Do Action 0
For State InventoryState(on_hand=2, on_order=0): Do Action 0

Now let’s run GLIE MC Control with the following parameters:

from rl.function_approx import Tabular

from rl.distribution import Choose
from rl.chapter3.simple_inventory_mdp_cap import InventoryState
episode_length_tolerance: float = 1e-5
epsilon_as_func_of_episodes: Callable[[int], float] = lambda k: k ** -0.5
initial_learning_rate: float = 0.1
half_life: float = 10000.0
exponent: float = 1.0
initial_qvf_dict: Mapping[Tuple[NonTerminal[InventoryState], int], float] = {
(s, a): 0. for s in si_mdp.non_terminal_states for a in si_mdp.actions(s)
}
learning_rate_func: Callable[[int], float] = learning_rate_schedule(
initial_learning_rate=initial_learning_rate,
half_life=half_life,
exponent=exponent
)
qvfs: Iterator[QValueFunctionApprox[InventoryState, int] = glie_mc_control(
mdp=si_mdp,
states=Choose(si_mdp.non_terminal_states),
approx_0=Tabular(
values_map=initial_qvf_dict,
count_to_weight_func=learning_rate_func
),
gamma=gamma,
epsilon_as_func_of_episodes=epsilon_as_func_of_episodes,
episode_length_tolerance=episode_length_tolerance
)

Now let’s fetch the final estimate of the Optimal Q-Value Function after num_episodes
have run, and extract from it the estimate of the Optimal State-Value Function and the
Optimal Policy.

from rl.distribution import Constant

from rl.dynamic_programming import V
import itertools
import rl.iterate as iterate
num_episodes = 10000
final_qvf: QValueFunctionApprox[InventoryState, int] = \
iterate.last(itertools.islice(qvfs, num_episodes))
def get_vf_and_policy_from_qvf(

358
mdp: FiniteMarkovDecisionProcess[S, A],
qvf: QValueFunctionApprox[S, A]
) -> Tuple[V[S], FiniteDeterministicPolicy[S, A]]:
opt_vf: V[S] = {
s: max(qvf((s, a)) for a in mdp.actions(s))
for s in mdp.non_terminal_states
}
opt_policy: FiniteDeterministicPolicy[S, A] = \
FiniteDeterministicPolicy({
s.state: qvf.argmax((s, a) for a in mdp.actions(s))[1]
for s in mdp.non_terminal_states
})
return opt_vf, opt_policy
opt_vf, opt_policy = get_vf_and_policy_from_qvf(
mdp=si_mdp,
qvf=final_qvf
)
print(f”GLIE MC Optimal Value Function with {num_episodes:d} episodes”)
pprint(opt_vf)
print(f”GLIE MC Optimal Policy with {num_episodes:d} episodes”)
print(opt_policy)

This prints:

GLIE MC Optimal Value Function with 10000 episodes

{NonTerminal(state=InventoryState(on_hand=0, on_order=0)): -34.76212336633032,
NonTerminal(state=InventoryState(on_hand=0, on_order=1)): -27.90668364332291,
NonTerminal(state=InventoryState(on_hand=0, on_order=2)): -28.306190508518398,
NonTerminal(state=InventoryState(on_hand=1, on_order=0)): -28.548284937363526,
NonTerminal(state=InventoryState(on_hand=1, on_order=1)): -28.864409885059185,
NonTerminal(state=InventoryState(on_hand=2, on_order=0)): -30.23156422557605}
GLIE MC Optimal Policy with 10000 episodes
For State InventoryState(on_hand=0, on_order=0): Do Action 1
For State InventoryState(on_hand=0, on_order=1): Do Action 1
For State InventoryState(on_hand=0, on_order=2): Do Action 0
For State InventoryState(on_hand=1, on_order=0): Do Action 1
For State InventoryState(on_hand=1, on_order=1): Do Action 0
For State InventoryState(on_hand=2, on_order=0): Do Action 0

We see that this reasonably converges to the true Value Function (and reaches the true
Optimal Policy) as produced by Value Iteration.
The code above is in the file rl/chapter11/simple_inventory_mdp_cap.py. Also see the
helper functions in rl/chapter11/control_utils.py which you can use to run your own ex-
periments and tests for RL Control algorithms.

10.4. SARSA
Just like in the case of RL Prediction, the natural idea is to replace MC Control with TD
Control using the TD Target Rt+1 +γ ·Q(St+1 , At+1 ; w) as a biased estimate of Gt when up-
dating Q(St , At ; w). This means the parameters update in Equation (10.4) gets modified
to the following parameters update:

∆w = α · (Rt+1 + γ · Q(St+1 , At+1 ; w) − Q(St , At ; w)) · ∇w Q(St , At ; w) (10.5)

359
Unlike MC Control where updates are made at the end of each trace experience (i.e.,
episode), a TD control algorithm can update at the end of each atomic experience. This
means the Q-Value Function Approximation is updated after each atomic experience (con-
tinuous learning), which in turn means that the ϵ-greedy policy will be (automatically) up-
dated at the end of each atomic experience. At each time step t in a trace experience, the
current ϵ-greedy policy is used to sample At from St and is also used to sample At+1 from
St+1 . Note that in MC Control, the same ϵ-greedy policy is used to sample all the actions
from their corresponding states in the trace experience, and so in MC Control, we were
able to generate the entire trace experience with the currently available ϵ-greedy policy.
However, here in TD Control, we need to generate a trace experience incrementally since
the action to be taken from a state depends on the just-updated ϵ-greedy policy (that is
derived from the just-updated Q-Value Function).
Just like in the case of RL Prediction, the disadvantage of the TD Target being a biased
estimate of the return is compensated by a reduction in the variance of the return esti-
mate. Also, TD Control offers a better speed of convergence (as we shall soon illustrate).
Most importantly, TD Control offers the ability to use in situations where we have incom-
plete trace experiences (happens often in real-world situations where experiments gets
curtailed/disrupted) and also, we can use it in situations where we never reach a terminal
state (continuing trace).
Note that Equation (10.5) has the entities

• State St
• Action At
• Reward Rt
• State St+1
• Action At+1

which prompted this TD Control algorithm to be named SARSA (for State-Action-Reward-

State-Action). Following our convention from Chapter 2, we depict the SARSA algorithm
in Figure 10.2 with states as elliptical-shaped nodes, actions as rectangular-shaped nodes,
and the edges as samples from transition probability distribution and ϵ-greedy policy dis-
tribution.
Now let us write some code to implement the above-described SARSA algorithm. Let
us start by understanding the various arguments to the below function glie_sarsa.

• mdp: MarkovDecisionProcess[S, A] - This represents the interface to an abstract Markov

Decision Process. We want to remind that this interface doesn’t provide any access
to the transition probabilities or reward function. The core functionality available
through this interface are the two @abstractmethods step and actions. The step
method only allows us to access a sample of the next state and reward pair given
the current state and action (since it returns an abstract Distribution object). The
actions method gives us the allowable actions for a given state.
• states: NTStateDistribution[S] - This represents an arbitrary distribution of the
non-terminal states, which in turn allows us to sample the starting state (from this
distribution) for each trace experience.
• approx_0: QValueFunctionApprox[S, A] - This represents the initial function approx-
imation of the Q-Value function (that is meant to be updated, after each atomic ex-
perience, in an immutable manner, through the course of the algorithm).
• gamma: float - This represents the discount factor to be used in estimating the Q-
Value Function.

360
Figure 10.2.: Visualization of SARSA Algorithm

361
• epsilon_as_func_of_episodes: Callable[[int], float] - This represents the extent
of exploration (ϵ) as a function of the number of episodes.
• max_episode_length: int - This represents the number of time steps at which we
would curtail a trace experience and start a new one. As we’ve explained, TD Con-
trol doesn’t require complete trace experiences, and so we can do as little or as large
a number of time steps in a trace experience (max_episode_length gives us that con-
trol).

glie_sarsa produces a generator (Iterator) of Q-Value Function estimates at the end

of each atomic experience. The while True loops over trace experiences. The inner while
loops over time steps - each of these steps involves the following:

• Given the current state and action, we obtain a sample of the pair of next_state and
reward (using the sample method of the Distribution obtained from mdp.step(state,
action).
• Obtain the next_action from next_state using the function epsilon_greedy_action
which utilizes the ϵ-greedy policy derived from the current Q-Value Function esti-
mate (referenced by q).
• Update the Q-Value Function based on Equation (10.5) (using the update method
of q: QValueFunctionApprox[S, A]). Note that this is an immutable update since we
produce an Iterable (generator) of the Q-Value Function estimate after each time
step.

Before the code for glie_sarsa, let’s understand the code for epsilon_greedy_action
which returns an action sampled from the ϵ-greedy policy probability distribution that
is derived from the Q-Value Function estimate, given as input a non-terminal state, a Q-
Value Function estimate, the set of allowable actions, and ϵ.

from operator import itemgetter

from Distribution import Categorical
def epsilon_greedy_action(
q: QValueFunctionApprox[S, A],
nt_state: NonTerminal[S],
actions: Set[A],
epsilon: float
) -> A:
greedy_action: A = max(
((a, q((nt_state, a))) for a in actions),
key=itemgetter(1)
)[0]
return Categorical(
{a: epsilon / len(actions) +
(1 - epsilon if a == greedy_action else 0.) for a in actions}
).sample()
def glie_sarsa(
mdp: MarkovDecisionProcess[S, A],
states: NTStateDistribution[S],
approx_0: QValueFunctionApprox[S, A],
gamma: float,
epsilon_as_func_of_episodes: Callable[[int], float],
max_episode_length: int
) -> Iterator[QValueFunctionApprox[S, A]]:
q: QValueFunctionApprox[S, A] = approx_0
yield q
num_episodes: int = 0
while True:

362
num_episodes += 1
epsilon: float = epsilon_as_func_of_episodes(num_episodes)
state: NonTerminal[S] = states.sample()
action: A = epsilon_greedy_action(
q=q,
nt_state=state,
actions=set(mdp.actions(state)),
epsilon=epsilon
)
steps: int = 0
while isinstance(state, NonTerminal) and steps < max_episode_length:
next_state, reward = mdp.step(state, action).sample()
if isinstance(next_state, NonTerminal):
next_action: A = epsilon_greedy_action(
q=q,
nt_state=next_state,
actions=set(mdp.actions(next_state)),
epsilon=epsilon
)
q = q.update([(
(state, action),
reward + gamma * q((next_state, next_action))
)])
action = next_action
else:
q = q.update([((state, action), reward)])
yield q
steps += 1
state = next_state

The above code is in the file rl/td.py.

Let us test this on the simple inventory MDP we tested GLIE MC Control on (we use
the same si_mdp: SimpleInventoryMDPCap object and the same parameter values that were
set up earlier when testing GLIE MC Control).
from rl.chapter3.simple_inventory_mdp_cap import InventoryState
max_episode_length: int = 100
epsilon_as_func_of_episodes: Callable[[int], float] = lambda k: k ** -0.5
initial_learning_rate: float = 0.1
half_life: float = 10000.0
exponent: float = 1.0
gamma: float = 0.9
initial_qvf_dict: Mapping[Tuple[NonTerminal[InventoryState], int], float] = {
(s, a): 0. for s in si_mdp.non_terminal_states for a in si_mdp.actions(s)
}
learning_rate_func: Callable[[int], float] = learning_rate_schedule(
initial_learning_rate=initial_learning_rate,
half_life=half_life,
exponent=exponent
)
qvfs: Iterator[QValueFunctionApprox[InventoryState, int]] = glie_sarsa(
mdp=si_mdp,
states=Choose(si_mdp.non_terminal_states),
approx_0=Tabular(
values_map=initial_qvf_dict,
count_to_weight_func=learning_rate_func
),
gamma=gamma,
epsilon_as_func_of_episodes=epsilon_as_func_of_episodes,
max_episode_length=max_episode_length
)

Now let’s fetch the final estimate of the Optimal Q-Value Function after num_episodes *

363
max_episode_length updates of the Q-Value Function, and extract from it the estimate of
the Optimal State-Value Function and the Optimal Policy (using the function get_vf_and_policy_from_qvf
that we had written earlier).

import itertools
import rl.iterate as iterate
num_updates = num_episodes * max_episode_length
final_qvf: QValueFunctionApprox[InventoryState, int] = \
iterate.last(itertools.islice(qvfs, num_updates))
opt_vf, opt_policy = get_vf_and_policy_from_qvf(
mdp=si_mdp,
qvf=final_qvf
)
print(f”GLIE SARSA Optimal Value Function with {num_updates:d} updates”)
pprint(opt_vf)
print(f”GLIE SARSA Optimal Policy with {num_updates:d} updates”)
print(opt_policy)

This prints:

GLIE SARSA Optimal Value Function with 1000000 updates

{NonTerminal(state=InventoryState(on_hand=0, on_order=0)): -35.05830797041331,
NonTerminal(state=InventoryState(on_hand=0, on_order=1)): -27.8507256742493,
NonTerminal(state=InventoryState(on_hand=0, on_order=2)): -27.735579652721434,
NonTerminal(state=InventoryState(on_hand=1, on_order=0)): -28.984534974043097,
NonTerminal(state=InventoryState(on_hand=1, on_order=1)): -29.325829885558885,
NonTerminal(state=InventoryState(on_hand=2, on_order=0)): -30.236704327526777}
GLIE SARSA Optimal Policy with 1000000 updates
For State InventoryState(on_hand=0, on_order=0): Do Action 1
For State InventoryState(on_hand=0, on_order=1): Do Action 1
For State InventoryState(on_hand=0, on_order=2): Do Action 0
For State InventoryState(on_hand=1, on_order=0): Do Action 1
For State InventoryState(on_hand=1, on_order=1): Do Action 0
For State InventoryState(on_hand=2, on_order=0): Do Action 0

We see that this reasonably converges to the true Value Function (and reaches the true
Optimal Policy) as produced by Value Iteration (whose results were displayed when we
tested GLIE MC Control).
The code above is in the file rl/chapter11/simple_inventory_mdp_cap.py. Also see the
helper functions in rl/chapter11/control_utils.py which you can use to run your own ex-
periments and tests for RL Control algorithms.
For Tabular GLIE MC Control, we stated a theorem for theoretical guarantee of con-
vergence to the true Optimal Value Function (and hence, true Optimal Policy). Is there
something analogous for Tabular GLIE SARSA? This answers in the aﬀirmative with the
added condition that we reduce the learning rate according to the Robbins-Monro sched-
ule. We state the following theorem without proof.

Theorem 10.4.1. Tabular SARSA converges to the Optimal Action-Value function, Q(s, a) →
Q∗ (s, a) (hence, converges to an Optimal Deterministic Policy π ∗ ), under the following conditions:

• GLIE schedule of policies πt (s, a)

364
• Robbins-Monro schedule of step-sizes αt :
∞
X
αt = ∞
t=1
∞
X
αt2 < ∞
t=1

Now let’s compare GLIE MC Control and GLIE SARSA. This comparison is analogous
to the comparison in Section 9.5.2 in Chapter 9 regarding their bias, variance and conver-
gence properties. GLIE SARSA carries a biased estimate of the Q-Value Function com-
pared to the unbiased estimate of GLIE MC Control. On the flip side, the TD Target
Rt+1 + γ · Q(St+1 , At+1 ; w) has much lower variance than Gt because Gt depends on many
random state transitions and random rewards (on the remainder of the trace experience)
whose variances accumulate, whereas the TD Target depends on only the next random
state transition St+1 and the next random reward Rt+1 . The bad news with GLIE SARSA
(due to the bias in it’s update) is that with function approximation, it does not always
converge to the Optimal Value Function/Policy.
As mentioned in Chapter 9, because MC and TD have significant differences in their
usage of data, nature of updates, and frequency of updates, it is not even clear how to
create a level-playing field when comparing MC and TD for speed of convergence or for
eﬀiciency in usage of limited experiences data. The typical comparisons between MC and
TD are done with constant learning rates, and it’s been determined that practically GLIE
SARSA learns faster than GLIE MC Control with constant learning rates. We illustrate
this by running GLIE MC Control and GLIE SARSA on SimpleInventoryMDPCap, and plot
the root-mean-squared-errors (RMSE) of the Q-Value Function estimates as a function of
batches of episodes (i.e., visualize how the RMSE of the Q-Value Function evolves as the
two algorithms progress). This is done by calling the function compare_mc_sarsa_ql which
is in the file rl/chapter11/control_utils.py.
Figure 10.3 depicts the convergence for our implementations of GLIE MC Control and
GLIE SARSA for a constant learning rate of α = 0.05. We produced this Figure by using
data from 500 episodes generated from the same SimpleInventoryMDPCap object we had
created earlier (with same discount factor γ = 0.9). We plotted the RMSE after each batch
of 10 episodes, hence both curves shown in the Figure have 50 RMSE data points plotted.
Firstly, we clearly see that MC Control has significantly more variance as evidenced by
the choppy MC Control RMSE progression curve. Secondly, we note that the MC Con-
trol RMSE curve progresses quite quickly in the first few episode batches but is slow to
converge after the first few episode batches (relative to the progression of SARSA). This
results in SARSA reaching fairly small RMSE quicker than MC Control. This behavior of
GLIE SARSA outperforming the comparable GLIE MC Control (with constant learning
rate) is typical in most MDP Control problems.
Lastly, it’s important to recognize that MC Control is not very sensitive to the initial
Value Function while SARSA is more sensitive to the initial Value Function. We encourage
you to play with the initial Value Function for this SimpleInventoryMDPCap example and
evaluate how it affects the convergence speeds.
More generally, we encourage you to play with the compare_mc_sarsa_ql function on
other MDP choices (ones we have created earlier in this book, or make up your own MDPs)
so you can develop good intuition for how GLIE MC Control and GLIE SARSA algorithms
converge for a variety of choices of learning rate schedules, initial Value Function choices,
choices of discount factor etc.

365
Figure 10.3.: GLIE MC Control and GLIE SARSA Convergence for SimpleInventoryMDP-
Cap

10.5. SARSA(λ)
Much like how we extended TD Prediction to TD(λ) Prediction, we can extend SARSA to
SARSA(λ), which gives us a way to tune the spectrum from MC Control to SARSA using
the λ parameter. Recall that in order to develop TD(λ) Prediction from TD Prediction,
we first developed the n-step TD Prediction Algorithm, then the Offline λ-Return TD Al-
gorithm, and finally the Online TD(λ) Algorithm. We develop an analogous progression
from SARSA to SARSA(λ).
So the first thing to do is to extend SARSA to 2-step-bootstrapped SARSA, whose update
is as follows:

∆w = α · (Rt+1 + γ · Rt+2 + γ 2 · Q(St+2 , At+2 ; w) − Q(St , At ; w)) · ∇w Q(St , At ; w)

Generalizing this to n-step-bootstrapped SARSA, the update would then be as follows:

∆w = α · (Gt,n − Q(St , At ; w)) · ∇w Q(St , At ; w)

where the n-step-bootstrapped Return Gt,n is defined as:

X
t+n
Gt,n = γ i−t−1 · Ri + γ n · Q(St+n , At+n ; w)
i=t+1
= Rt+1 + γ · Rt+2 + . . . + γ n−1 · Rt+n + γ n · Q(St+n , At+n ; w)

Instead of Gt,n , a valid target is a weighted-average target:

X
N X
N
un · Gt,n + u · Gt where u + un = 1
n=1 n=1

366
Any of the un or u can be 0, as long as they all sum up to 1. The λ-Return target is a special
case of weights un and u, defined as follows:

un = (1 − λ) · λn−1 for all n = 1, . . . , T − t − 1

un = 0 for all n ≥ T − t and u = λT −t−1

(λ)
We denote the λ-Return target as Gt , defined as:
−t−1
TX
λn−1 · Gt,n + λT −t−1 · Gt
(λ)
Gt = (1 − λ) ·
n=1

Then, the Offline λ-Return SARSA Algorithm makes the following updates (performed
at the end of each trace experience) for each (St , At ) encountered in the trace experience:
(λ)
∆w = α · (Gt − Q(St , At ; w)) · ∇w Q(St , At ; w)

Finally, we create the SARSA(λ) Algorithm, which is the online “version” of the above
λ-Return SARSA Algorithm. The calculations/updates at each time step t for each trace
experience are as follows:

δt = Rt+1 + γ · Q(St+1 , At+1 ; w) − Q(St , At ; w)

Et = γλ · Et−1 + ∇w Q(St , At ; w)
∆w = α · δt · Et
with the eligiblity traces initialized at time 0 for each trace experience as E0 = ∇w V (S0 ; w).
Note that just like in SARSA, the ϵ-greedy policy improvement is automatic from the up-
dated Q-Value Function estimate after each time step.
We leave the implementation of SARSA(λ) in Python code as an exercise for you to do.

10.6. Off-Policy Control

All control algorithms face a tension between wanting to learn Q-Values contingent on
subsequent optimal behavior versus wanting to explore all actions. This almost seems con-
tradictory because the quest for exploration deters one from optimal behavior. Our ap-
proach so far of pursuing an ϵ-greedy policy (to be thought of as an almost optimal policy)
is a hack to resolve this tension. A cleaner approach is to use two separate policies for
the two separate goals of wanting to be optimal and wanting to explore. The first policy
is the one that we learn about (which eventually becomes the optimal policy) - we call
this policy the Target Policy (to signify the “target” of Control). The second policy is the
one that behaves in an exploratory manner, so we can obtain suﬀicient data for all actions,
enabling us to adequately estimate the Q-Value Function - we call this policy the Behavior
Policy.
In SARSA, at a given time step, we are in a current state S, take action A, after which we
obtain the reward R and next state S ′ , upon which we take the next action A′ . The action
A taken from the current state S is meant to come from an exploratory policy (behavior
policy) so that for each state S, we have adequate occurrences of all actions in order to ac-
curately estimate the Q-Value Function. The action A′ taken from the next state S ′ is meant
to come from the target policy as we aim for subsequent optimal behavior (Q∗ (S, A) requires

367
optimal behavior subsequent to taking action A). However, in the SARSA algorithm, the
behavior policy producing A from S and the target policy producing A′ from S ′ are in fact
the same policy - the ϵ-greedy policy. Algorithms such as SARSA in which the behavior
policy is the same as the target policy are refered to as On-Policy Algorithms to indicate
the fact that the behavior used to generate data (experiences) does not deviate from the
policy we are aiming for (target policy, which drives towards the optimal policy).
The separation of behavior policy and target policy as two separate policies gives us
algorithms that are known as Off-Policy Algorithms to indicate the fact that the behavior
policy is allowed to “deviate off” from the target policy. This separation enables us to
construct more general and more powerful RL algorithms. We will use the notation π for
the target policy and the notation µ for the behavior policy - therefore, we say that Off-
Policy algorithms estimate the Value Function for target policy π while following behavior
policy µ. Off-Policy algorithms can be very valuable in real-world situations where we can
learn the target policy π by observing humans or other AI agents who follow a behavior
policy µ. Another great practical benefit is to be able to re-use prior experiences that were
generated from old policies, say π1 , π2 , . . .. Yet another powerful benefit is that we can
learn multiple policies µ1 , µ2 , . . . while following one behavior policy π. Let’s now make
the concept of Off-Policy Learning concrete by covering the most basic (and most famous)
Off-Policy Control Algorithm, which goes by the name of Q-Learning.

10.6.1. Q-Learning

The best way to understand the (Off-Policy) Q-Learning algorithm is to tweak SARSA
to make it Off-Policy. Instead of having both the action A and the next action A′ being
generated by the same ϵ-greedy policy, we generate (i.e., sample) action A (from state
S) using an exploratory behavior policy µ and we generate the next action A′ (from next
state S ′ ) using the target policy π. The behavior policy can be any policy as long as it is
exploratory enough to be able to obtain suﬀicient data for all actions (in order to obtain
an adequate estimate of the Q-Value Function). Note that in SARSA, when we roll over to
the next (new) time step, the new time step’s state S is set to be equal to the previous time
step’s next state S ′ and the new time step’s action A is set to be equal to the previous time
step’s next action A′ . However, in Q-Learning, we only set the new time step’s state S to
be equal to the previous time step’s next state S ′ . The action A for the new time step will
be generated using the behavior policy µ, and won’t be equal to the previous time step’s
next action A′ (that would have been generated using the target policy π).
This Q-Learning idea of two separate policies - behavior policy and target policy - is
fairly generic, and can be used in algorithms beyond solving the Control problem. How-
ever, here we are interested in Q-Learning for Control and so, we want to ensure that the
target policy eventually becomes the optimal policy. One straightforward way to accom-
plish this is to make the target policy equal to the deterministic greedy policy derived from
the Q-Value Function estimate at every step. Thus, the update for Q-Learning Control al-
gorithm is as follows:

∆w = α · δt · ∇w Q(St , At ; w)

where

368
Figure 10.4.: Visualization of Q-Learning Algorithm

δt = Rt+1 + γ · Q(St+1 , arg max Q(St+1 , a; w); w) − Q(St , At ; w)

a∈A
= Rt+1 + γ · max Q(St+1 , a; w) − Q(St , At ; w)
a∈A

Following our convention from Chapter 2, we depict the Q-Learning algorithm in Figure
10.4 with states as elliptical-shaped nodes, actions as rectangular-shaped nodes, and the
edges as samples from transition probability distribution and action choices.
Although we have highlighted some attractive features of Q-Learning (on account of be-
ing Off-Policy), it turns out that Q-Learning when combined with function approximation
of the Q-Value Function leads to convergence issues (more on this later). However, Tab-
ular Q-Learning converges under the usual appropriate conditions. There is considerable
literature on convergence of Tabular Q-Learning and we won’t go over those convergence
theorems in this book - here it suﬀices to say that the convergence proofs for Tabular Q-
Learning require infinite exploration of all (state, action) pairs and appropriate stochastic
approximation conditions for step sizes.
Now let us write some code for Q-Learning. The function q_learning below is quite

369
similar to the function glie_sarsa we wrote earlier. Here are the differences:

• q_learning takes as input the argument policy_from_q: PolicyFromQType, which is

a function with two arguments - a Q-Value Function and a MarkovDecisionProcess
object - and returns the policy derived from the Q-Value Function. Thus, q_learning
takes as input a general behavior policy whereas glie_sarsa uses the ϵ-greedy pol-
icy as it’s behavior (and target) policy. However, you should note that the typical
descriptions of Q-Learning in the RL literature specialize the behavior policy to be
the ϵ-greedy policy (we’ve simply chosen to describe and implement Q-Learning in
it’s more general form of using an arbitrary user-specified behavior policy).
• glie_sarsa takes as input epsilon_as_func_of_episodes: Callable[[int], float] whereas
q_learning doesn’t require this argument (Q-Learning can converge even if it’s be-
havior policy has an unchanging ϵ, and any ϵ specification in q_learning would be
built into the policy_from_q argument).
• As explained above, in q_learning, the action from the state is obtained using the
specified behavior policy policy_from_q and the “next action” from the next_state is
implicitly obtained using the deterministic greedy policy derived from the Q-Value
Function estimate q. In glie_sarsa, both action and next_action were obtained from
the ϵ-greedy policy.
• As explained above, in q_learning, as we move to the next time step, we set state to
be equal to the previous time step’s next_state whereas in glie_sarsa, we not only
do this but we also set action to be equal to the previous time step’s next_action.
PolicyFromQType = Callable[
[QValueFunctionApprox[S, A], MarkovDecisionProcess[S, A]],
Policy[S, A]
]
def q_learning(
mdp: MarkovDecisionProcess[S, A],
policy_from_q: PolicyFromQType,
states: NTStateDistribution[S],
approx_0: QValueFunctionApprox[S, A],
gamma: float,
max_episode_length: int
) -> Iterator[QValueFunctionApprox[S, A]]:
q: QValueFunctionApprox[S, A] = approx_0
yield q
while True:
state: NonTerminal[S] = states.sample()
steps: int = 0
while isinstance(state, NonTerminal) and steps < max_episode_length:
policy: Policy[S, A] = policy_from_q(q, mdp)
action: A = policy.act(state).sample()
next_state, reward = mdp.step(state, action).sample()
next_return: float = max(
q((next_state, a))
for a in mdp.actions(next_state)
) if isinstance(next_state, NonTerminal) else 0.
q = q.update([((state, action), reward + gamma * next_return)])
yield q
steps += 1
state = next_state

The above code is in the file rl/td.py. Much like how we tested GLIE SARSA on SimpleInventoryMDPCap,
the code in the file rl/chapter11/simple_inventory_mdp_cap.py also tests Q-Learning on
SimpleInventoryMDPCap. We encourage you to leverage the helper functions in rl/chapter11/control_utils.py
to run your own experiments and tests for Q-Learning. In particular, the functions for

370
Q-Learning in rl/chapter11/control_utils.py employ the common practice of using the ϵ-
greedy policy as the behavior policy.

10.6.2. Windy Grid

Now we cover an interesting Control problem that is quite popular in the RL literature -
how to navigate a “Windy Grid.” We have added some bells and whistles to this problem
to make it more interesting. We want to evaluate SARSA and Q-Learning on this problem.
Here’s the detailed description of this problem:
We are given a grid comprising of cells arranged in the form of m rows and n columns,
defined as G = {(i, j) | 0 ≤ i < m, 0 ≤ j < n}. A subset of G (denoted B) are uninhabitable
cells known as blocks. A subset of G − B (denoted T ) is known as the set of goal cells. We
have to find a least-cost path from each of the cells in G − B − T to any of the cells in
T . At each step, we are required to make a move to a non-block cell (we cannot remain
stationary). Right after we make our move, a random vertical wind could move us one
cell up or down, unless limited by a block or limited by the boundary of the grid.
Each column has it’s own random wind specification given by two parameters 0 ≤ p1 ≤
1 and 0 ≤ p2 ≤ 1 with p1 + p2 ≤ 1. The wind blows downwards with probability p1 ,
upwards with probability p2 , and there is no wind with probability 1 − p1 − p2 . If the wind
makes us bump against a block or against the boundary of the grid, we incur a cost of b ∈
R+ in addition to the usual cost of 1 for each move we make. Thus, here the cost includes
not just the time spent on making the moves, but also the cost of bumping against blocks
or against the boundary of the grid (due to the wind). Minimizing the expected total cost
amounts to finding our way to a goal state in a manner that combines minimization of the
number of moves with the minimization of the hurt caused by bumping (assume discount
factor of 1 when minimizing this expected total cost). If the wind causes us to bump
against a wall or against a boundary, we bounce and settle in the cell we moved to just
before being blown by the wind (note that the wind blows immediately after we make a
move). The wind will never move us by more than one cell between two successive moves,
and the wind is never horizontal. Note also that if we move to a goal cell, the process ends
immediately without any wind-blow following the movement to the goal cell. The random
wind for all the columns is specified as a sequence [(p1,j , p2,j ) | 0 ≤ j < n].
Let us model this problem of minimizing the expected total cost while reaching a goal
cell as a Finite Markov Decision Process.
State Space S = G − B, Non-Terminal States N = G − B − T , Terminal States are T .
We denote the set of all possible moves {UP, DOWN, LEFT, RIGHT} as:
A = {(1, 0), (−1, 0), (0, −1), (0, 1)}.
The actions A(s) for a given non-terminal state s ∈ N is defined as: {a | a ∈ A, s+a ∈ S}
where + denotes element-wise addition of integer 2-tuples.
For all (sr , sc ) ∈ N , for all (ar , ac ) ∈ A((sr , sc )), if (sr + ar , sc + ac ) ∈ T , then:

PR ((sr , sc ), (ar , ac ), −1, (sr + ar , sc + ac )) = 1

For all (sr , sc ) ∈ N , for all (ar , ac ) ∈ A((sr , sc )), if (sr + ar , sc + ac ) ∈ N , then:

PR ((sr , sc ), (ar , ac ), −1 − b, (sr + ar , sc + ac ))

= p1,sc +ac · I(sr +ar −1,sc +ac )∈S
/ + p2,sc +ac · I(sr +ar +1,sc +ac )∈S
/

PR ((sr , sc ), (ar , ac ), −1, (sr + ar − 1, sc + ac )) = p1,sc +ac · I(sr +ar −1,sc +ac )∈S

371
PR ((sr , sc ), (ar , ac ), −1, (sr + ar + 1, sc + ac )) = p2,sc +ac · I(sr +ar +1,sc +ac )∈S
PR ((sr , sc ), (ar , ac ), −1, (sr + ar , sc + ac )) = 1 − p1,sc +ac − p2,sc +ac
Discount Factor γ = 1
Now let’s write some code to model this problem with the above MDP spec, and run
Value Iteration, SARSA and Q-Learning as three different ways of solving this MDP Con-
trol problem.
We start with the problem specification in the form of a Python class WindyGrid and write
some helper functions before getting into the MDP creation and DP/RL algorithms.

’’’
Cell specifies (row, column) coordinate
’’’
Cell = Tuple[int, int]
CellSet = Set[Cell]
Move = Tuple[int, int]
’’’
WindSpec specifies a random vertical wind for each column.
Each random vertical wind is specified by a (p1, p2) pair
where p1 specifies probability of Downward Wind (could take you
one step lower in row coordinate unless prevented by a block or
boundary) and p2 specifies probability of Upward Wind (could take
you onw step higher in column coordinate unless prevented by a
block or boundary). If one bumps against a block or boundary, one
incurs a bump cost and doesn’t move. The remaining probability
1- p1 - p2 corresponds to No Wind.
’’’
WindSpec = Sequence[Tuple[float, float]]
possible_moves: Mapping[Move, str] = {
(-1, 0): ’D’,
(1, 0): ’U’,
(0, -1): ’L’,
(0, 1): ’R’
}
@dataclass(frozen=True)
class WindyGrid:
rows: int # number of grid rows
columns: int # number of grid columns
blocks: CellSet # coordinates of block cells
terminals: CellSet # coordinates of goal cells
wind: WindSpec # spec of vertical random wind for the columns
bump_cost: float # cost of bumping against block or boundary
@staticmethod
def add_move_to_cell(cell: Cell, move: Move) -> Cell:
return cell[0] + move[0], cell[1] + move[1]
def is_valid_state(self, cell: Cell) -> bool:
’’’
checks if a cell is a valid state of the MDP
’’’
return 0 <= cell[0] < self.rows and 0 <= cell[1] < self.columns \
and cell not in self.blocks
def get_all_nt_states(self) -> CellSet:
’’’
returns all the non-terminal states
’’’
return {(i, j) for i in range(self.rows) for j in range(self.columns)
if (i, j) not in set.union(self.blocks, self.terminals)}
def get_actions_and_next_states(self, nt_state: Cell) \
-> Set[Tuple[Move, Cell]]:

372
’’’
given a non-terminal state, returns the set of all possible
(action, next_state) pairs
’’’
temp: Set[Tuple[Move, Cell]] = {(a, WindyGrid.add_move_to_cell(
nt_state,
a
)) for a in possible_moves}
return {(a, s) for a, s in temp if self.is_valid_state(s)}

Next we write a method to calculate the transition probabilities. The code below should
be self-explanatory and mimics the description of the problem above and the mathematical
specification of the transition probabilities given above.

from rl.distribution import Categorical

def get_transition_probabilities(self, nt_state: Cell) \
-> Mapping[Move, Categorical[Tuple[Cell, float]]]:
’’’
given a non-terminal state, return a dictionary whose
keys are the valid actions (moves) from the given state
and the corresponding values are the associated probabilities
(following that move) of the (next_state, reward) pairs.
The probabilities are determined from the wind probabilities
of the column one is in after the move. Note that if one moves
to a goal cell (terminal state), then one ends up in that
goal cell with 100% probability (i.e., no wind exposure in a
goal cell).
’’’
d: Dict[Move, Categorical[Tuple[Cell, float]]] = {}
for a, (r, c) in self.get_actions_and_next_states(nt_state):
if (r, c) in self.terminals:
d[a] = Categorical({((r, c), -1.): 1.})
else:
down_prob, up_prob = self.wind[c]
stay_prob: float = 1. - down_prob - up_prob
d1: Dict[Tuple[Cell, float], float] = \
{((r, c), -1.): stay_prob}
if self.is_valid_state((r - 1, c)):
d1[((r - 1, c), -1.)] = down_prob
if self.is_valid_state((r + 1, c)):
d1[((r + 1, c), -1.)] = up_prob
d1[((r, c), -1. - self.bump_cost)] = \
down_prob * (1 - self.is_valid_state((r - 1, c))) + \
up_prob * (1 - self.is_valid_state((r + 1, c)))
d[a] = Categorical(d1)
return d

Next we write a method to create the MarkovDecisionProcess for the Windy Grid.

from rl.markov_decision_process import FiniteMarkovDecisionProcess

def get_finite_mdp(self) -> FiniteMarkovDecisionProcess[Cell, Move]:
’’’
returns the FiniteMarkovDecision object for this windy grid problem
’’’
return FiniteMarkovDecisionProcess(
{s: self.get_transition_probabilities(s) for s in
self.get_all_nt_states()}
)

Next we write methods for Value Iteration, SARSA and Q-Learning

373
from rl.markov_decision_process import FiniteDeterministicPolicy
from rl.dynamic_programming import value_iteration_result, V
from rl.chapter11.control_utils import glie_sarsa_finite_learning_rate
from rl.chapter11.control_utils import q_learning_finite_learning_rate
from rl.chapter11.control_utils import get_vf_and_policy_from_qvf
def get_vi_vf_and_policy(self) -> \
Tuple[V[Cell], FiniteDeterministicPolicy[Cell, Move]]:
’’’
Performs the Value Iteration DP algorithm returning the
Optimal Value Function (as a V[Cell]) and the Optimal Policy
(as a FiniteDeterministicPolicy[Cell, Move])
’’’
return value_iteration_result(self.get_finite_mdp(), gamma=1.)
def get_glie_sarsa_vf_and_policy(
self,
epsilon_as_func_of_episodes: Callable[[int], float],
learning_rate: float,
num_updates: int
) -> Tuple[V[Cell], FiniteDeterministicPolicy[Cell, Move]]:
qvfs: Iterator[QValueFunctionApprox[Cell, Move]] = \
glie_sarsa_finite_learning_rate(
fmdp=self.get_finite_mdp(),
initial_learning_rate=learning_rate,
half_life=1e8,
exponent=1.0,
gamma=1.0,
epsilon_as_func_of_episodes=epsilon_as_func_of_episodes,
max_episode_length=int(1e8)
)
final_qvf: QValueFunctionApprox[Cell, Move] = \
iterate.last(itertools.islice(qvfs, num_updates))
return get_vf_and_policy_from_qvf(
mdp=self.get_finite_mdp(),
qvf=final_qvf
)
def get_q_learning_vf_and_policy(
self,
epsilon: float,
learning_rate: float,
num_updates: int
) -> Tuple[V[Cell], FiniteDeterministicPolicy[Cell, Move]]:
qvfs: Iterator[QValueFunctionApprox[Cell, Move]] = \
q_learning_finite_learning_rate(
fmdp=self.get_finite_mdp(),
initial_learning_rate=learning_rate,
half_life=1e8,
exponent=1.0,
gamma=1.0,
epsilon=epsilon,
max_episode_length=int(1e8)
)
final_qvf: QValueFunctionApprox[Cell, Move] = \
iterate.last(itertools.islice(qvfs, num_updates))
return get_vf_and_policy_from_qvf(
mdp=self.get_finite_mdp(),
qvf=final_qvf
)

The above code is in the file rl/chapter11/windy_grid.py. Note that this file also contains
some helpful printing functions that pretty-prints the grid, along with the calculated Op-
timal Value Functions and Optimal Policies. The method print_wind_and_bumps prints the
column wind probabilities and the cost of bumping into a block/boundary. The method

374
print_vf_and_policy prints a given Value Function and a given Deterministic Policy - this
method can be used to print the Optimal Value Function and Optimal Policy produced by
Value Iteration, by SARSA and by Q-Learning. In the printing of a deterministic policy,
“X” represents a block, “T” represents a terminal cell, and the characters “L,” “R,” “D,”
“U” represent “Left,” “Right,” “Down,” “Up” moves respectively.
Now let’s run our code on a small instance of a Windy Grid.

wg = WindyGrid(
rows=5,
columns=5,
blocks={(0, 1), (0, 2), (0, 4), (2, 3), (3, 0), (4, 0)},
terminals={(3, 4)},
wind=[(0., 0.9), (0.0, 0.8), (0.7, 0.0), (0.8, 0.0), (0.9, 0.0)],
bump_cost=4.0
)
wg.print_wind_and_bumps()
vi_vf_dict, vi_policy = wg.get_vi_vf_and_policy()
print(”Value Iteration\n”)
wg.print_vf_and_policy(
vf_dict=vi_vf_dict,
policy=vi_policy
)
epsilon_as_func_of_episodes: Callable[[int], float] = lambda k: 1. / k
learning_rate: float = 0.03
num_updates: int = 100000
sarsa_vf_dict, sarsa_policy = wg.get_glie_sarsa_vf_and_policy(
epsilon_as_func_of_episodes=epsilon_as_func_of_episodes,
learning_rate=learning_rate,
num_updates=num_updates
)
print(”SARSA\n”)
wg.print_vf_and_policy(
vf_dict=sarsa_vf_dict,
policy=sarsa_policy
)
epsilon: float = 0.2
ql_vf_dict, ql_policy = wg.get_q_learning_vf_and_policy(
epsilon=epsilon,
learning_rate=learning_rate,
num_updates=num_updates
)
print(”Q-Learning\n”)
wg.print_vf_and_policy(
vf_dict=ql_vf_dict,
policy=ql_policy
)

This prints the following:

Column 0: Down Prob = 0.00, Up Prob = 0.90

Column 1: Down Prob = 0.00, Up Prob = 0.80
Column 2: Down Prob = 0.70, Up Prob = 0.00
Column 3: Down Prob = 0.80, Up Prob = 0.00
Column 4: Down Prob = 0.90, Up Prob = 0.00
Bump Cost = 4.00

Value Iteration

0 1 2 3 4

375
4 XXXXX 5.25 2.02 1.10 1.00
3 XXXXX 8.53 5.20 1.00 0.00
2 9.21 6.90 8.53 XXXXX 1.00
1 8.36 9.21 8.36 12.16 11.00
0 10.12 XXXXX XXXXX 17.16 XXXXX

0 1 2 3 4
4 X R R R D
3 X R R R T
2 R U U X U
1 R U L L U
0 U X X U X

SARSA

0 1 2 3 4
4 XXXXX 5.47 2.02 1.08 1.00
3 XXXXX 8.78 5.37 1.00 0.00
2 9.14 7.03 8.29 XXXXX 1.00
1 8.51 9.16 8.27 11.92 12.58
0 10.05 XXXXX XXXXX 16.48 XXXXX

0 1 2 3 4
4 X R R R D
3 X R R R T
2 R U U X U
1 R U L L U
0 U X X U X

Q-Learning

0 1 2 3 4
4 XXXXX 5.45 2.02 1.09 1.00
3 XXXXX 8.09 5.12 1.00 0.00
2 8.78 6.76 7.92 XXXXX 1.00
1 8.31 8.85 8.09 11.52 10.93
0 9.85 XXXXX XXXXX 16.16 XXXXX

0 1 2 3 4
4 X R R R D
3 X R R R T
2 R U U X U
1 R U L L U
0 U X X U X

Value Iteration should be considered as the benchmark since it calculates the Optimal
Value Function within the default tolerance of 1e-5. We see that both SARSA and Q-
Learning get fairly close to the Optimal Value Function after only 100,000 updates (i.e.,
100,000 moves across various episodes). We also see that both SARSA and Q-Learning

376
Figure 10.5.: GLIE SARSA and Q-Learning Convergence for Windy Grid (Bump Cost = 4)

obtain the true Optimal Policy, consistent with Value Iteration.

Now let’s explore SARSA and Q-Learning’s speed of convergence to the Optimal Value
Function.
We first run GLIE SARSA and Q-Learning for the above settings of bump cost = 4.0,
GLIE SARSA ϵ(k) = k1 , Q-Learning ϵ = 0.2. Figure 10.5 depicts the trajectory of Root-
Mean-Squared-Error (RMSE) of the Q-Values relative to the Q-Values obtained by Value
Iteration. The RMSE is plotted as a function of progressive batches of 10 episodes. We can
see that GLIE SARSA and Q-Learning have roughly the same convergence trajectory.
Now let us set the bump cost to a very high value of 100,000. Figure 10.6 depicts the
convergence trajectory for bump cost of 100,000. We see that Q-Learning converges much
faster than GLIE SARSA (we kept GLIE SARSA ϵ(k) = k1 and Q-Learning ϵ = 0.2).
So why does Q-Learning do better? Q-Learning has two advantages over GLIE SARSA
here: Firstly, it’s behavior policy is exploring at the constant amount of 20% whereas GLIE
SARSA’s exploration declines to 10% after just the 10th episode. This means Q-Learning
gets suﬀicient data quicker than GLIE SARSA for the entire set of (state, action) pairs.
Secondly, Q-Learning’s target policy is greedy, versus GLIE SARSA’s declining-ϵ-greedy.
This means GLIE SARSA’s Optimal Q-Value estimation is compromised due to the explo-
ration of actions in it’s target policy (rather than a pure exploitation with max over actions,
as is the case with Q-Learning). Thus, the separation between behavior policy and target
policy in Q-Learning fetches it the best of both worlds and enables it to perform better than
GLIE SARSA in this example.
SARSA is a more “conservative” algorithm in the sense that if there is a risk of a large
negative reward close to the optimal path, SARSA will tend to avoid that dangerous op-
timal path and only slowly learn to use that optimal path when ϵ (exploration) reduces
suﬀiciently. Q-Learning, on the other hand, will tend to take that risk while exploring
and learns fast through “big failures.” This provides us with a guide on when to use

377
Figure 10.6.: GLIE SARSA and Q-Learning Convergence for Windy Grid (Bump Cost =
100,000)

SARSA and when to use Q-Learning. Roughly speaking, use SARSA if you are training
your AI agent with interaction with the actual environment where you care about time and
money consumed while doing the training with actual environment-interaction (eg: you
don’t want to risk damaging a robot by walking it towards an optimal path in the prox-
imity of physical danger). On the other hand, use Q-Learning if you are training your
AI agent with a simulated environment where large negative rewards don’t cause actual
time/money losses, but these large negative rewards help the AI agent learn quickly. In
a financial trading example, if you are training your RL agent in an actual trading envi-
ronment, you’d want to use SARSA as Q-Learning can potentially incur big losses while
SARSA (although slower in learning) will avoid real trading losses during the process of
learning. On the other hand, if you are training your RL agent in a simulated trading en-
vironment, Q-Learning is the way to go as it will learn fast by incuring “paper trading”
losses as part of the process of executing risky trades.
Note that Q-Learning (and Off-policy Learning in general) has higher per-sample vari-
ance than SARSA, which could lead to problems in convergence, especially when we em-
ploy function approximation for the Q-Value Function. Q-Learning has been shown to
be particularly problematic in converging when using neural networks for it’s Q-Value
function approximation.
The SARSA algorithm was introduced in a paper by Rummery and Niranjan (Rummery
and Niranjan 1994). The Q-Learning algorithm was introduced in the Ph.D. thesis of Chris
Watkins (Watkins 1989).

10.6.3. Importance Sampling

Now that we’ve got a good grip of Off-Policy Learning through the Q-Learning algorithm,
we show a very different (arguably simpler) method of doing Off-Policy Learning. This

378
method is known as Importance Sampling, a fairly general technique (beyond RL) for es-
timating properties of a particular probability distribution, while only having access to
samples of a different probability distribution. Specializing this technique to Off-Policy
Control, we estimate the Value Function for the target policy (probability distribution of
interest) while having access to samples generated from the probability distribution of the
behavior policy. Specifically, Importance Sampling enables us to calculate EX∼P [f (X)]
(where P is the probability distribution of interest), given samples from probability dis-
tribution Q, as follows:
X
EX∼P [f (X)] = P (X) · f (X)
X P (X)
= Q(X) · · f (X)
Q(X)
P (X)
= EX∼Q [ · f (X)]
Q(X)
So basically, the function f (X) of samples X are scaled by the ratio of the probabilities
P (X) and Q(X).
Let’s employ this Importance Sampling method for Off-Policy Monte Carlo Prediction,
where we need to estimate the Value Function for policy π while only having access to
trace experience returns generated using policy µ. The idea is straightforward - we simply
weight the returns Gt according to the similarity between policies π and µ, by multiplying
importance sampling corrections along whole episodes. Let us define ρt as the product
of the ratio of action probabilities (on the two policies π and µ) from time t to time T − 1
(assume episode ends at time T ). Specifically,

π(St , At ) π(St+1 , At+1 ) π(ST −1 , AT −1 )

ρt = · ···
µ(St , At ) µ(St+1 , At+1 ) µ(ST −1 , AT −1 )
We’ve learnt in Chapter 9 that the learning rate α (treated as an update step-size) serves
as a weight to the update target (in the case of MC, the update target is the return Gt ). So
all we have to do is to scale the step-size α for the update for time t by ρt . Hence, the MC
Prediction update is tweaked to be the following when doing Off-Policy with Importance
Sampling:

∆w = α · ρt · (Gt − V (St ; w)) · ∇w V (St ; w)

For MC Control, we make the analogous tweak to the update for the Q-Value Function,
as follows:

∆w = α · ρt · (Gt − Q(St , At ; w)) · ∇w Q(St , At ; w)

Note that we cannot use this method if µ is zero when π is non-zero (since µ is in the
denominator).
A key disadvantage of Off-Policy MC with Importance Sampling is that it dramatically
increases the variance of the Value Function estimate. To contain the variance, we can use
TD targets (instead of trace experience returns) generated from µ to evaluate the Value
Function for π. For Off-Policy TD Prediction, we essentially weight TD target R + γ ·
V (S ′ ; w) with importance sampling. Here we only need a single importance sampling
correction, as follows:
π(St , At )
∆w = α · · (Rt+1 + γ · V (St+1 ; w) − V (St ; w)) · ∇w V (St ; w)
µ(St , At )

379
Figure 10.7.: Policy Evaluation (DP Algorithm with Full Backup)

For TD Control, we do the analogous update for the Q-Value Function:

π(St , At )
∆w = α · · (Rt+1 + γ · Q(St+1 , At+1 ; w) − Q(St , At ; w)) · ∇w Q(St , At ; w)
µ(St , At )

This has much lower variance than MC importance sampling. A key advantage of TD
importance sampling is that policies only need to be similar over a single time step.
Since the modifications from On-Policy algorithms to Off-Policy algorithms based on
Importance Sampling are just a small tweak of scaling the update by importance sam-
pling corrections, we won’t implement the Off-Policy Importance Sampling algorithms in
Python code. However, we encourage you to implement the Prediction and Control MC
and TD Off-Policy algorithms (based on Importance Sampling) described above.

10.7. Conceptual Linkage between DP and TD algorithms

It’s worthwhile placing RL algorithms in terms of their conceptual relationship to DP al-
gorithms. Let’s start with the Prediction problem, whose solution is based on the Bellman
Expectation Equation. Figure 10.8 depicts TD Prediction, which is the sample backup ver-
sion of Policy Evaluation, depicted in Figure 10.7 as a full backup DP algorithm.
Likewise, Figure 10.10 depicts SARSA, which is the sample backup version of Q-Policy
Iteration (Policy Iteration on Q-Value), depicted in Figure 10.9 as a full backup DP algo-
rithm.
Finally, Figure 10.12 depicts Q-Learning, which is the sample backup version of Q-Value
Iteration (Value Iteration on Q-Value), depicted in Figure 10.11 as a full backup DP algo-
rithm.
The table in Figure 10.13 summarizes these RL algorithms, along with their correspond-
ing DP algorithms, showing the expectation targets of the DP algorithms’ updates along
with the corresponding sample targets of the RL algorithms’ updates.

380
Figure 10.8.: TD Prediction (RL Algorithm with Sample Backup)

Figure 10.9.: Q-Policy Iteration (DP Algorithm with Full Backup)

381
Figure 10.10.: SARSA (RL Algorithm with Sample Backup)

Figure 10.11.: Q-Value Iteration (DP Algorithm with Full Backup)

382
Figure 10.12.: Q-Learning (RL Algorithm with Sample Backup)

Full Backup (DP) Sample Backup (TD)

Policy Evaluation’s V (S) update: TD Learning’s V (S) update:
E[R + γV (S ′ )|S] sample R + γV (S ′ )
Q-Policy Iteration’s Q(S, A) update: SARSA’s Q(S, A) update:
E[R + γQ(S ′ , A′ )|S, A] sample R + γQ(S ′ , A′ )
Q-Value Iteration’s Q(S, A) update: Q-Learning’s Q(S, A) update:
E[R + γ maxa′ Q(S ′ , a′ )|S, A] sample R + γ maxa′ Q(S ′ , a′ )

Figure 10.13.: Relationship between DP and RL algorithms

383
On/Off Policy Algorithm Tabular Linear Non-Linear
MC 3 3 3
On-Policy TD(0) 3 3 7
TD(λ) 3 3 7
MC 3 3 3
Off-Policy TD(0) 3 7 7
TD(λ) 3 7 7

Figure 10.14.: Convergence of RL Prediction Algorithms

10.8. Convergence of RL Algorithms

Now we provide an overview of convergence of RL Algorithms. Let us start with RL Pre-
diction. Figure 10.14 provides the overview of RL Prediction. As you can see, Monte-Carlo
Prediction has convergence guarantees, whether On-Policy or Off-Policy, whether Tabular
or with Function Approximation (even with non-linear Function Approximation). How-
ever, Temporal-Difference Prediction can have convergence issues - the core reason for
this is that the TD update is not a true gradient update (as we’ve explained in Chapter 9,
it is a semi-gradient update). As you can see, although we have convergence guarantees
for On-Policy TD Prediction with linear function approximation, there is no convergence
guarantee for On-Policy TD Prediction with non-linear function approximation. The situ-
ation is even worse for Off-Policy TD Prediction - there is no convergence guarantee even
for linear function approximation.
We want to highlight a confluence pattern in RL Algorithms where convergence prob-
lems arise. As a rule of thumb, if we do all of the following three, then we run into con-
vergence problems.

• Bootstrapping, i.e., updating with a target that involves the current Value Function
estimate (as is the case with Temporal-Difference)
• Off-Policy
• Function Approximation of the Value Function

Hence, [Bootstrapping, Off-Policy, Function Approximation] is known as the Deadly

Triad, a term emphasized and popularized by Richard Sutton in a number of publications
and lectures. We should highlight that the Deadly Triad phenomenon is not a theorem -
rather, it should be viewed as a rough pattern and as a rule of thumb. So to achieve conver-
gence, we avoid at least one of the above three. We have seen that each of [Bootstrapping,
Off-Policy, Function Approximation] provides benefits, but when all three come together,
we run into convergence problems. The fundamental problem is that semi-gradient boot-
strapping does not follow the gradient of any objective function and this causes TD to
diverge when running off-policy and when using function approximations.
Function Approximation is typically unavoidable in real-world problems because of the
size of real-world problems. So we are looking at avoiding semi-gradient bootstrapping
or avoiding off-policy. Note that semi-gradient bootstrapping can be mitigated by tuning
the TD λ parameter to a high-enough value. However, if we want to get around this prob-
lem in a fundamental manner, we can avoid the core issue of semi-gradient bootstrapping
by instead doing a true gradient with a method known as Gradient Temporal-Difference (or
Gradient TD, for short). We will cover Gradient TD in detail in Chapter 11, but for now,

384
On/Off Policy Algorithm Tabular Linear Non-Linear
MC 3 3 3
On-Policy TD 3 3 7
Gradient TD 3 3 3
MC 3 3 3
Off-Policy TD 3 7 7
Gradient TD 3 3 3

Figure 10.15.: Convergence of RL Prediction Algorithms, including Gradient TD

Algorithm Tabular Linear Non-Linear

MC Control 3 ( 3) 7
SARSA 3 ( 3) 7
Q-Learning 3 7 7
Gradient Q-Learning 3 3 7

Figure 10.16.: Convergence of RL Control Algorithms

we want to simply share that Gradient TD updates the value function approximation’s pa-
rameters with the actual gradient (not semi-gradient) of an appropriate loss function and
the gradient formula involves bootstrapping. Thus, it avails of the advantages of boot-
strapping without the disadvantages of semi-gradient (which we cheekily refered to as
“cheating” in Chapter 9). Figure 10.15 expands upon Figure 10.14 by incorporating con-
vergence properties of Gradient TD.
Now let’s move on to convergence of Control Algorithms. Figure 10.16 provides the
picture. (3) means it doesn’t quite hit the Optimal Value Function, but bounces around
near the Optimal Value Function. Gradient Q-Learning is the adaptation of Q-Learning
with Gradient TD. So this method is Off-Policy, is bootstrapped, but avoids semi-gradient.
This enables it to converge for linear function approximations. However, it diverges when
used with non-linear function approximations. So, for Control, even with Gradient TD,
the deadly triad still exists for a combination of [Bootstrapping, Off-Policy, Non-Linear
Function Approximation]. In Chapter 11, we shall cover the DQN algorithm which is
an innovative and practically effective method for getting around the deadly triad for RL
Control.

10.9. Key Takeaways from this Chapter

• RL Control is based on the idea of Generalized Policy Iteration (GPI):
– Policy Evaluation with Q-Value Function (instead of State-Value Function V )
– Improved Policy needs to be exploratory, eg: ϵ-greedy
• On-Policy versus Off-Policy (eg: SARSA versus Q-Learning)
• Deadly Triad := [Bootstrapping, Off-Policy, Function Approximation]

385
11. Batch RL, Experience-Replay, DQN,
LSPI, Gradient TD

In Chapters 9 and 10, we covered the basic RL algorithms for Prediction and Control re-
spectively. Specifically, we covered the basic Monte-Carlo (MC) and Temporal-Difference
(TD) techniques. We want to highlight two key aspects of these basic RL algorithms:

1. The experiences data arrives in the form of a single unit of experience at a time (single
unit is a trace experience for MC and an atomic experience for TD), the unit of experience
is used by the algorithm for Value Function learning, and then that unit of experience
is not used later in the algorithm (essentially, that unit of experience, once consumed,
is not re-consumed for further learning later in the algorithm). It doesn’t have to be
this way - one can develop RL algorithms that re-use experience data - this approach
is known as Experience-Replay (in fact, we saw a glimpse of Experience-Replay in
Section 9.5.3 of Chapter 9).

2. Learning occurs in a granularly incremental manner, by updating the Value Function

after each unit of experience. It doesn’t have to be this way - one can develop RL
algorithms that take an entire batch of experiences (or in fact, all of the experiences
that one could possibly get), and learn the Value Function directly for that entire
batch of experiences. A key idea here is that if we know in advance what experi-
ences data we have (or will have), and if we collect and organize all of that data,
then we could directly (i.e., not incrementally) estimate the Value Function for that
experiences data set. This approach to RL is known as Batch RL (versus the basic
RL algorithms we covered in the previous chapters that can be termed as Incremental
RL).

Thus, we have a choice or doing Experience-Replay or not, and we have a choice of do-
ing Batch RL or Incremental RL. In fact, some of the interesting and practically effective
algorithms combine both the ideas of Experience-Replay and Batch RL. This chapter starts
with the coverage of Batch RL and Experience-Replay. Then, we cover some key algo-
rithms (including Deep Q-Networks and Least Squares Policy Iteration) that effectively
leverage Batch RL and/or Experience-Replay. Next, we look deeper into the issue of the
Deadly Triad (that we had alluded to in Chapter 10) by viewing Value Functions as Vec-
tors (we had done this in Chapter 3), understand Value Function Vector transformations
with a balance of geometric intuition and mathematical rigor, providing insights into con-
vergence issues for a variety of traditional loss functions used to develop RL algorithms.
Finally, this treatment of Value Functions as Vectors leads us in the direction of overcom-
ing the Deadly Triad by defining an appropriate loss function, calculating whose gradient
provides a more robust set of RL algorithms known as Gradient Temporal Difference (ab-
breviated, as Gradient TD).

387
11.1. Batch RL and Experience-Replay
Let us understand Incremental RL versus Batch RL in the context of fixed finite experiences
data. To make things simple and easy to understand, we first focus on understanding the
difference for the case of MC Prediction (i.e., to calculate the Value Function of an MRP
using Monte-Carlo). In fact, we had covered this setting in Section 9.5.3 of Chapter 9.
To refresh this setting, specifically we have access to a fixed finite sequence/stream of
MRP trace experiences (i.e., Iterable[Iterable[TransitionStep[S]]]), which we know
can be converted to returns-augmented data of the form Iterable[Iterable[ReturnStep[S]]]
(using the returns function1 ). Flattening this data to Iterable[ReturnStep[S]] and ex-
tracting from it the (state, return) pairs gives us the fixed, finite training data for MC
Prediction, that we denote as follows:

D = [(Si , Gi )|1 ≤ i ≤ n]
We’ve learnt in Chapter 9 that we can do an Incremental MC Prediction estimation
V (s; w) by updating w after each MRP trace experience with the gradient calculation
∇w L(w) for each data pair (Si , Gi ), as follows:
1
L(Si ,Gi ) (w) = · (V (Si ; w) − Gi )2
2
∇w L(Si ,Gi ) (w) = (V (Si ; w) − Gi ) · ∇w V (Si ; w)
∆w = α · (Gi − V (Si ; w)) · ∇w V (Si ; w)
The Incremental MC Prediction algorithm performs n updates in sequence for data pairs
(Si , Gi ), i = 1, 2, . . . , n using the update method of FunctionApprox. We note that Incremen-
tal RL makes ineﬀicient use of available training data D because we essentially “discard”
each of these units of training data after it’s used to perform an update. We want to make
eﬀicient use of the given data with Batch RL. Batch MC Prediction aims to estimate the
MRP Value Function V (s; w∗ ) such that

1 X
n
∗
w = arg min · (V (Si ; w) − Gi )2
w 2n
i=1
1
= arg min E(S,G)∼D [ · (V (S; w) − G)2 ]
w 2
This in fact is the solve method of FunctionApprox on training data D. This approach is
called Batch RL because we first collect and store the entire set (batch) of data D available
to us, and then we find the best possible parameters w∗ fitting this data D. Note that unlike
Incremental RL, here we are not updating the MRP Value Function estimate while the data
arrives - we simply store the data as it arrives and start the MRP Value Function estima-
tion procedure once we are ready with the entire (batch) data D in storage. As we know
from the implementation of the solve method of FunctionApprox, finding the best possi-
ble parameters w∗ from the batch D involves calling the update method of FunctionApprox
with repeated use of the available data pairs (S, G) in the stored data set D. Each of these
updates to the parameters w is as follows:

1 X
n
∆w = α · · (Gi − V (Si ; w)) · ∇w V (Si ; w)
n
i=1
1
returns is defined in the file rl/returns.py

388
Note that unlike Incremental MC where each update to w uses data from a single trace
experience, each update to w in Batch MC uses all of the trace experiences data (all of the
batch data). If we keep doing these updates repeatedly, we will ultimately converge to the
desired MRP Value Function V (s; w∗ ). The repeated use of the available data in D means
that we are doing Batch MC Prediction using Experience-Replay. So we see that this makes
more eﬀicient use of the available training data D due to the re-use of the data pairs in D.
The code for this Batch MC Prediction algorithm (batch_mc_prediction) is shown be-
low.2 From the input trace experiences (traces in the code below), we first create the set of
ReturnStep transitions that span across the set of all input trace experiences (return_steps
in the code below). This involves calculating the return associated with each state encoun-
tered in traces (across all trace experiences). From return_steps, we create the (state,
return) pairs that constitute the fixed, finite training data D, which is then passed to the
solve method of approx: ValueFunctionApprox[S].

import rl.markov_process as mp
from rl.returns import returns
from rl.approximate_dynamic_programming import ValueFunctionApprox
import itertools
def batch_mc_prediction(
traces: Iterable[Iterable[mp.TransitionStep[S]]],
approx: ValueFunctionApprox[S],
gamma: float,
episode_length_tolerance: float = 1e-6,
convergence_tolerance: float = 1e-5
) -> ValueFunctionApprox[S]:
’’’traces is a finite iterable’’’
return_steps: Iterable[mp.ReturnStep[S]] = \
itertools.chain.from_iterable(
returns(trace, gamma, episode_length_tolerance) for trace in traces
)
return approx.solve(
[(step.state, step.return_) for step in return_steps],
convergence_tolerance
)

Now let’s move on to Batch TD Prediction. Here we have fixed, finite experiences data
D available as:
D = [(Si , Ri , Si′ )|1 ≤ i ≤ n]
where (Ri , Si′ ) is the pair of reward and next state from a state Si . So, Experiences Data D is
presented in the form of a fixed, finite number of atomic experiences. This is represented
in code as an Iterable[TransitionStep[S]].
Just like Batch MC Prediction, here in Batch TD Prediction, we first collect and store the
data as it arrives, and once we are ready with the batch of data D in storage, we start the
MRP Value Function estimation procedure. The parameters w are updated with repeated
use of the atomic experiences in the stored data D. Each of these updates to the parameters
w is as follows:

1 X
n
∆w = α · · (Ri + γ · V (Si′ ; w) − V (Si ; w)) · ∇w V (Si ; w)
n
i=1

Note that unlike Incremental TD where each update to w uses data from a single atomic
experience, each update to w in Batch TD uses all of the atomic experiences data (all of
2
batch_mc_prediction is defined in the file rl/monte_carlo.py.

389
the batch data). The repeated use of the available data in D means that we are doing Batch
TD Prediction using Experience-Replay. So we see that this makes more eﬀicient use of the
available training data D due to the re-use of the data pairs in D.
The code for this Batch TD Prediction algorithm (batch_td_prediction) is shown be-
low.3 We create a Sequence[TransitionStep] from the fixed, finite-length input atomic ex-
periences D (transitions in the code below), and call the update method of FunctionApprox
repeatedly, passing the data D (now in the form of a Sequence[TransitionStep]) to each
invocation of the update method (using the function itertools.repeat). This repeated in-
vocation of the update method is done by using the function iterate.accumulate. This is
done until convergence (convergence based on the done function in the code below), at
which point we return the converged FunctionApprox.
import rl.markov_process as mp
from rl.approximate_dynamic_programming import ValueFunctionApprox, extended_vf
import rl.iterate as iterate
import itertools
import numpy as np
def batch_td_prediction(
transitions: Iterable[mp.TransitionStep[S]],
approx_0: ValueFunctionApprox[S],
gamma: float,
convergence_tolerance: float = 1e-5
) -> ValueFunctionApprox[S]:
’’’transitions is a finite iterable’’’
def step(
v: ValueFunctionApprox[S],
tr_seq: Sequence[mp.TransitionStep[S]]
) -> ValueFunctionApprox[S]:
return v.update([(
tr.state, tr.reward + gamma * extended_vf(v, tr.next_state)
) for tr in tr_seq])
def done(
a: ValueFunctionApprox[S],
b: ValueFunctionApprox[S],
convergence_tolerance=convergence_tolerance
) -> bool:
return b.within(a, convergence_tolerance)
return iterate.converged(
iterate.accumulate(
itertools.repeat(list(transitions)),
step,
initial=approx_0
),
done=done

Likewise, we can do Batch TD(λ) Prediction. Here we are given a fixed, finite number
of trace experiences

D = [(Si,0 , Ri,1 , Si,1 , Ri,2 , Si,2 , . . . , Ri,Ti , Si,Ti )|1 ≤ i ≤ n]

For trace experience i, for each time step t in the trace experience, we calculate the eligibility
traces as follows:

Ei,t = γλ · Ei,t−1 + ∇w V (Si,t ; w) for all t = 1, 1, . . . Ti − 1

with the eligiblity traces initialized at time 0 for trace experience i as Ei,0 = ∇w V (Si,0 ; w).
3
batch_td_prediction is defined in the file rl/td.py.

390
Then, each update to the parameters w is as follows:

Ti −1
1 X 1 X
n
∆w = α · · · (Ri,t+1 + γ · V (Si,t+1 ; w) − V (Si,t ; w)) · Ei,t (11.1)
n Ti
i=1 t=0

11.2. A generic implementation of Experience-Replay

Before we proceed to more algorithms involving Experience-Replay and/or Batch RL, it is
vital to recognize that the concept of Experience-Replay stands on it’s own, independent of
it’s use in Batch RL. In fact, Experience-Replay is a much broader concept, beyond it’s use
in RL. The idea of Experience-Replay is that we have a stream of data coming in and instead
of consuming it in an algorithm as soon as it arrives, we store each unit of incoming data
in memory (which we shall call Experience-Replay-Memory, abbreviated as ER-Memory),
and use samples of data from ER-Memory (with replacement) for our algorithm’s needs.
Thus, we are routing the incoming stream of data to ER-Memory and sourcing data needed
for our algorithm from ER-Memory (by sampling with replacement). This enables re-use
of the incoming data stream. It also gives us flexibility to sample an arbitrary number of
data units at a time, so our algorithm doesn’t need to be limited to using a single unit of
data at a time. Lastly, we organize the data in ER-Memory in such a manner that we can
assign different sampling weights to different units of data, depending on the arrival time
of the data. This is quite useful for many algorithms that wish to give more importance to
recently arrived data and de-emphasize/forget older data.
Let us now write some code to implement all of these ideas described above. The
code below uses an arbitrary data type T, which means that the unit of data being han-
dled with Experience-Replay could be any data structure (specifically, not limited to the
TransitionStep data type that we care about for RL with Experience-Replay).
The attribute saved_transitions: List[T] is the data structure storing the incoming
units of data, with the most recently arrived unit of data at the end of the list (since
we append to the list). The attribute time_weights_func lets the user specify a function
from the reverse-time-stamp of a unit of data to the sampling weight to assign to that
unit of data (“reverse-time-stamp” means the most recently-arrived unit of data has a
time-index of 0, although physically it is stored at the end of the list, rather than at the
start). The attribute weights simply stores the sampling weights of all units of data in
saved_transitions, and the attribute weights_sum stores the sum of the weights (the at-
tributes weights and weights_sum are there purely for computational eﬀiciency to avoid
too many calls to time_weights_func and avoidance of summing a long list of weights,
which is required to normalize the weights to sum up to 1).
add_data appends an incoming unit of data (transition: T) to self.saved_transitions
and updates self.weights and self.weights_sum. sample_mini_batches returns a sample
of specified size mini_batch_size, using the sampling weights in self.weights. We also
have a method replay that takes as input an Iterable of transitions and a mini_batch_size,
and returns an Iterator of mini_batch_sized data units. As long as the input transitions:
Iterable[T] is not exhausted, replay appends each unit of data in transitions to self.saved_transitions
and then yields a mini_batch_sized sample of data. Once transitions: Iterable[T] is ex-
hausted, it simply yields the samples of data. The Iterator generated by replay can be
piped to any algorithm that expects an Iterable of the units of data as input, essentially
enabling us to replace the pipe carrying an input data stream with a pipe carrying the data
stream sourced from ER-Memory.

391
T = TypeVar(’T’)
class ExperienceReplayMemory(Generic[T]):
saved_transitions: List[T]
time_weights_func: Callable[[int], float]
weights: List[float]
weights_sum: float
def __init__(
self,
time_weights_func: Callable[[int], float] = lambda _: 1.0,
):
self.saved_transitions = []
self.time_weights_func = time_weights_func
self.weights = []
self.weights_sum = 0.0
def add_data(self, transition: T) -> None:
self.saved_transitions.append(transition)
weight: float = self.time_weights_func(len(self.saved_transitions) - 1)
self.weights.append(weight)
self.weights_sum += weight
def sample_mini_batch(self, mini_batch_size: int) -> Sequence[T]:
num_transitions: int = len(self.saved_transitions)
return Categorical(
{tr: self.weights[num_transitions - 1 - i] / self.weights_sum
for i, tr in enumerate(self.saved_transitions)}
).sample_n(min(mini_batch_size, num_transitions))
def replay(
self,
transitions: Iterable[T],
mini_batch_size: int
) -> Iterator[Sequence[T]]:
for transition in transitions:
self.add_data(transition)
yield self.sample_mini_batch(mini_batch_size)
while True:
yield self.sample_mini_batch(mini_batch_size)

The code above is in the file rl/experience_replay.py. We encourage you to implement

Batch MC Prediction and Batch TD Prediction using this ExperienceReplayMemory class.

11.3. Least-Squares RL Prediction

We’ve seen how Batch RL Prediction is an iterative process of weight updates until conver-
gence - the MRP Value Function is updated with repeated use of the fixed, finite (batch)
data that is made available. However, if we assume that the MRP Value Function approxi-
mation V (s; w) is a linear function approximation (linear in a set of feature functions of the
state space), then we can solve for the MRP Value Function with direct and simple linear
algebra operations (ie., without the need for iterative weight updates until convergence).
Let us see how.
We define a sequence of feature functions ϕj : N → R, j = 1, 2, . . . , m and we assume
the parameters w is a weights vector w = (w1 , w2 , . . . , wm ) ∈ Rm . Therefore, the MRP
Value Function is approximated as:

X
m
V (s; w) = ϕj (s) · wj = ϕ(s)T · w for all s ∈ N
j=1

392
where ϕ(s) ∈ Rm is the feature vector for state s.
The direct solution of the MRP Value Function using simple linear algebra operations is
known as Least-Squares (abbreviated as LS) solution. We start with Batch MC Prediction
for the case of linear function approximation, which is known as Least-Squares Monte-
Carlo (abbreviated as LSMC).

11.3.1. Least-Squares Monte-Carlo (LSMC)

For the case of linear function approximation, the loss function for Batch MC Prediction
with data [(Si , Gi )|1 ≤ i ≤ n] is:

1 XX 1 X
n m n
L(w) = · ( ϕj (Si ) · wj − Gi )2 = · (ϕ(Si )T · w − Gi )2
2n 2n
i=1 j=1 i=1

We set the gradient of this loss function to 0, and solve for w∗ . This yields:

X
n
ϕ(Si ) · (ϕ(Si )T · w∗ − Gi ) = 0
i=1

We can calculate the solution w∗ as A−1 · b, where the m × m Matrix A is accumulated at

each data pair (Si , Gi ) as:

A ← A + ϕ(Si ) · ϕ(Si )T (i.e., outer-product of ϕ(Si ) with itself)

and the m-Vector b is accumulated at each data pair (Si , Gi ) as:

b ← b + ϕ(Si ) · Gi

To implement this algorithm, we can simply call batch_mc_prediction that we had writ-
ten earlier by setting the argument approx as LinearFunctionApprox and by setting the at-
tribute direct_solve in approx: LinearFunctionApprox[S] as True. If you read the code
under direct_solve=True branch in the solve method, you will see that it indeed performs
the above-described linear algebra calculations. The inversion of the matrix A is O(m3 )
complexity. However, we can speed up the algorithm to be O(m2 ) with a different imple-
mentation - we can maintain the inverse of A after each (Si , Gi ) update to A by applying
the Sherman-Morrison formula for incremental inverse (Sherman and Morrison 1950).
The Sherman-Morrison incremental inverse for A is as follows:

A−1 · ϕ(Si ) · ϕ(Si )T · A−1

(A + ϕ(Si ) · ϕ(Si )T )−1 = A−1 −
1 + ϕ(Si )T · A−1 · ϕ(Si )

with A−1 initialized to 1ϵ · Im , where Im is the m × m identity matrix, and ϵ ∈ R+ is

a small number provided as a parameter to the algorithm. 1ϵ should be considered to be
a proxy for the step-size α which is not required for least-squares algorithms. If ϵ is too
small, the sequence of inverses of A can be quite unstable and if ϵ is too large, the learning
is slowed.
This brings down the computational complexity of this algorithm to O(m2 ). We won’t
implement the Sherman-Morrison incremental inverse for LSMC, but in the next subsec-
tion we shall implement it for Least-Squares Temporal Difference (LSTD).

393
11.3.2. Least-Squares Temporal-Difference (LSTD)
For the case of linear function approximation, the loss function for Batch TD Prediction
with data [(Si , Ri , Si′ )|1 ≤ i ≤ n] is:

1 X
n
L(w) = · (ϕ(Si )T · w − (Ri + γ · ϕ(Si′ )T · w))2
2n
i=1

We set the semi-gradient of this loss function to 0, and solve for w∗ . This yields:

X
n
ϕ(Si ) · (ϕ(Si )T · w∗ − (Si + γ · ϕ(Si′ )T · w∗ )) = 0
i=1

We can calculate the solution w∗ as A−1 · b, where the m × m Matrix A is accumulated at

each atomic experience (Si , Ri , Si′ ) as:

A ← A + ϕ(Si ) · (ϕ(Si ) − γ · ϕ(Si′ ))T (note the Outer-Product)

and the m-Vector b is accumulated at each atomic experience (Si , Ri , Si′ ) as:

b ← b + ϕ(Si ) · Ri

With Sherman-Morrison incremental inverse, we can reduce the computational com-

plexity from O(m3 ) to O(m2 ), as follows:

A−1 · ϕ(Si ) · (ϕ(Si ) − γ · ϕ(Si′ ))T · A−1

(A + ϕ(Si ) · (ϕ(Si ) − γ · ϕ(Si ))T )−1 = A−1 −
1 + (ϕ(Si ) − γ · ϕ(Si′ ))T · A−1 · ϕ(Si )

with A−1 initialized to 1ϵ · Im , where Im is the m × m identity matrix, and ϵ ∈ R+ is a

small number provided as a parameter to the algorithm.
This algorithm is known as the Least-Squares Temporal-Difference (LSTD) algorithm
and is due to Bradtke and Barto (Bradtke and Barto 1996).
Now let’s write some code to implement this LSTD algorithm. The arguments transitions,
feature_functions, gamma and epsilon of the function least_squares_td below are quite
self-explanatory. This is a batch method with direct calculation of the estimated Value
Function from batch data (rather than iterative weight updates), so least_squares_td re-
turns the estimated Value Function of type LinearFunctionApprox[NonTerminal[S]], rather
than an Iterator over the updated function approximations (as was the case in Incremen-
tal RL algorithms).
The code below should be fairly self-explanatory. a_inv refers to A−1 which is updated
with the Sherman-Morrison incremental inverse method. b_vec refers to the b vector. phi1
refers to ϕ(Si ), phi2 refers to ϕ(Si ) − γ · ϕ(Si′ ) (except when Si′ is a terminal state, in which
case phi2 is simply ϕ(Si )). The temporary variable temp refers to (A−1 )T ·(ϕ(Si )−γ ·ϕ(Si′ ))
and is used both in the numerator and denominator in the Sherman-Morrison formula to
update A−1 .
from rl.function_approx import LinearFunctionApprox
import rl.markov_process as mp
import numpy as np
def least_squares_td(
transitions: Iterable[mp.TransitionStep[S]],
feature_functions: Sequence[Callable[[NonTerminal[S]], float]],
gamma: float,

394
epsilon: float
) -> LinearFunctionApprox[NonTerminal[S]]:
’’’ transitions is a finite iterable ’’’
num_features: int = len(feature_functions)
a_inv: np.ndarray = np.eye(num_features) / epsilon
b_vec: np.ndarray = np.zeros(num_features)
for tr in transitions:
phi1: np.ndarray = np.array([f(tr.state) for f in feature_functions])
if isinstance(tr.next_state, NonTerminal):
phi2 = phi1 - gamma * np.array([f(tr.next_state)
for f in feature_functions])
else:
phi2 = phi1
temp: np.ndarray = a_inv.T.dot(phi2)
a_inv = a_inv - np.outer(a_inv.dot(phi1), temp) / (1 + phi1.dot(temp))
b_vec += phi1 * tr.reward
opt_wts: np.ndarray = a_inv.dot(b_vec)
return LinearFunctionApprox.create(
feature_functions=feature_functions,
weights=Weights.create(opt_wts)
)

The code above is in the file rl/td.py.

Now let’s test this on transitions data sampled from the RandomWalkMRP example we had
constructed in Chapter 9. As a reminder, this MRP consists of a random walk across states
{0, 1, 2, . . . , B} with 0 and B as the terminal states (think of these as terminating barriers
of a random walk) and the remaining states as the non-terminal states. From any non-
terminal state i, we transition to state i + 1 with probability p and to state i − 1 with prob-
ability 1 − p. The reward is 0 upon each transition, except if we transition from state B − 1
to terminal state B which results in a reward of 1. The code for RandomWalkMRP is in the file
rl/chapter10/random_walk_mrp.py.
First we set up a RandomWalkMRP object with B = 20, p = 0.55 and calculate it’s true Value
Function (so we can later compare against Incremental TD and LSTD methods).
from rl.chapter10.random_walk_mrp import RandomWalkMRP
import nump as np
this_barrier: int = 20
this_p: float = 0.55
random_walk: RandomWalkMRP = RandomWalkMRP(
barrier=this_barrier,
p=this_p
)
gamma = 1.0
true_vf: np.ndarray = random_walk.get_value_function_vec(gamma=gamma)

Let’s say we have access to only 10,000 transitions (each transition is an object of the type
TransitionStep). First we generate these 10,000 sampled transitions from the RandomWalkMRP
object we created above.
from rl.approximate_dynamic_programming import NTStateDistribution
from rl.markov_process import TransitionStep
import itertools
num_transitions: int = 10000
nt_states: Sequence[NonTerminal[int]] = random_walk.non_terminal_states
start_distribution: NTStateDistribution[int] = Choose(set(nt_states))
traces: Iterable[Iterable[TransitionStep[int]]] = \
random_walk.reward_traces(start_distribution)
transitions: Iterable[TransitionStep[int]] = \
itertools.chain.from_iterable(traces)

395
td_transitions: Iterable[TransitionStep[int]] = \
itertools.islice(transitions, num_transitions)

Before running LSTD, let’s run Incremental Tabular TD on the 10,000 transitions in
td_transitions and obtain the resultant Value Function (td_vf in the code below). Since
there are only 10,000 transitions, we use an aggressive initial learning rate of 0.5 to promote
fast learning, but we let this high learning rate decay quickly so the learning stabilizes.

from rl.function_approx import Tabular

import rl.iterate as iterate
initial_learning_rate: float = 0.5
half_life: float = 1000
exponent: float = 0.5
approx0: Tabular[NonTerminal[int]] = Tabular(
count_to_weight_func=learning_rate_schedule(
initial_learning_rate=initial_learning_rate,
half_life=half_life,
exponent=exponent
)
)
td_func: Tabular[NonTerminal[int]] = \
iterate.last(itertools.islice(
td_prediction(
transitions=td_transitions,
approx_0=approx0,
gamma=gamma
),
num_transitions
))
td_vf: np.ndarray = td_func.evaluate(nt_states)

Finally, we run the LSTD algorithm on 10,000 transitions. Note that the Value Function
of RandomWalkMRP, for p ̸= 0.5, is non-linear as a function of the integer states. So we use
non-linear features that can approximate arbitrary non-linear shapes - a good choice is the
set of (orthogonal) Laguerre Polynomials. In the code below, we use the first 5 Laguerre
Polynomials (i.e., upto degree 4 polynomial) as the feature functions for the linear function
approximation of the Value Function. Then we invoke the LSTD algorithm we wrote above
to calculate the LinearFunctionApprox based on this batch of 10,000 transitions.

from rl.chapter12.laguerre import laguerre_state_features

from rl.function_approx import LinearFunctionApprox
num_polynomials: int = 5
features: Sequence[Callable[[NonTerminal[int]], float]] = \
laguerre_state_features(num_polynomials)
lstd_transitions: Iterable[TransitionStep[int]] = \
itertools.islice(transitions, num_transitions)
epsilon: float = 1e-4
lstd_func: LinearFunctionApprox[NonTerminal[int]] = \
least_squares_td(
transitions=lstd_transitions,
feature_functions=features,
gamma=gamma,
epsilon=epsilon
)
lstd_vf: np.ndarray = lstd_func.evaluate(nt_states)

Figure 11.1 depicts how the LSTD Value Function estimate (for 10,000 transitions) lstd_vf
compares against Incremental Tabular TD Value Function estimate (for 10,000 transitions)

396
Figure 11.1.: LSTD and Tabular TD Value Functions

td_vf and against the true value function true_vf (obtained using the linear-algebra-solver-
based calculation of the MRP Value Function). We encourage you to modify the parame-
ters used in the code above to see how it alters the results - specifically play around with
this_barrier, this_p, gamma, num_transitions, the learning rate trajectory for Incremental
Tabular TD, the number of Laguerre polynomials, and epsilon. The above code is in the
file rl/chapter12/random_walk_lstd.py.

11.3.3. LSTD(λ)
Likewise, we can do LSTD(λ) using Eligibility Traces. Here we are given a fixed, finite
number of trace experiences

D = [(Si,0 , Ri,1 , Si,1 , Ri,2 , Si,2 , . . . , Ri,Ti , Si,Ti )|1 ≤ i ≤ n]

Denote the Eligibility Traces of trace experience i at time t as Ei,t . Note that the eligibility
traces accumulate ∇w V (s; w) = ϕ(s) in each trace experience. When accumulating, the
previous time step’s eligibility traces is discounted by λγ. By setting the right-hand-side
of Equation (11.1) to 0 (i.e., setting the update to w over all atomic experiences data to 0),
we get:

X Ti −1
1 X
n
· Ei,t · (ϕ(Si,t )T · w∗ − (Ri,t+1 + γ · ϕ(Si,t+1 )T · w∗ )) = 0
Ti
i=1 t=0

We can calculate the solution w∗ as A−1 · b, where the m × m Matrix A is accumulated at

each atomic experience (Si,t , Ri,t+1 , Si,t+1 ) as:

1
A←A+ · Ei,t · (ϕ(Si,t ) − γ · ϕ(Si,t+1 ))T (note the Outer-Product)
Ti

397
On/Off Policy Algorithm Tabular Linear Non-Linear
MC 3 3 3
LSMC 3 3 -
On-Policy TD 3 3 7
LSTD 3 3 -
Gradient TD 3 3 3
MC 3 3 3
LSMC 3 7 -
Off-Policy TD 3 7 7
LSTD 3 7 -
Gradient TD 3 3 3

Figure 11.2.: Convergence of RL Prediction Algorithms

and the m-Vector b is accumulated at each atomic experience (Si,t , Ri,t+1 , Si,t+1 ) as:
1
b←b+ · Ei,t · Ri,t+1
Ti
With Sherman-Morrison incremental inverse, we can reduce the computational complex-
ity from O(m3 ) to O(m2 ).

11.3.4. Convergence of Least-Squares Prediction

Before we move on to Least-Squares for the Control problem, we want to point out that the
convergence behavior of Least-Squares Prediction algorithms are identical to their coun-
terpart Incremental RL Prediction algorithms, with the exception that Off-Policy LSMC
does not have convergence guarantees. Figure 11.2 shows the updated summary table for
convergence of RL Prediction algorithms (that we had displayed at the end of Chapter 10)
to now also include Least-Squares Prediction algorithms.
This ends our coverage of Least-Squares Prediction. Before we move on to Least-Squares
Control, we need to cover Incremental RL Control with Experience-Replay as it serves as
a stepping stone towards Least-Squares Control.

11.4. Q-Learning with Experience-Replay

In this section, we cover Off-Policy Incremental TD Control with Experience-Replay. Specif-
ically, we revisit the Q-Learning algorithm we covered in Chapter 10, but we tweak that
algorithm such that the transitions used to make the Q-Learning updates are sourced from
an experience-replay memory, rather than from a behavior policy derived from the cur-
rent Q-Value estimate. While investigating the challenges with Off-Policy TD methods
with deep learning function approximation, researchers identified two challenges:

1) The sequences of states made available to deep learning through trace experiences
are highly correlated, whereas deep learning algorithms are premised on data sam-
ples being independent.
2) The data distribution changes as the RL algorithm learns new behaviors, whereas
deep learning algorithms are premised on a fixed underlying distribution (i.e., sta-
tionary).

398
Experience-Replay serves to smooth the training data distribution over many past be-
haviors, effectively resolving the correlation issue as well as the non-stationary issue. Hence,
Experience-Replay is a powerful idea for Off-Policy TD Control. The idea of using Experience-
Replay for Off-Policy TD Control is due to the Ph.D. thesis of Long Lin (Lin 1993).
To make this idea of Q-Learning with Experience-Replay clear, we make a few changes to
the q_learning function we had written in Chapter 10 with the following function q_learning_experience_replay
from rl.markov_decision_process import TransitionStep
from rl.approximate_dynamic_programming import QValueFunctionApprox
from rl.approximate_dynamic_programming import NTStateDistribution
from rl.experience_replay import ExperienceReplayMemory
PolicyFromQType = Callable[
[QValueFunctionApprox[S, A], MarkovDecisionProcess[S, A]],
Policy[S, A]
]
def q_learning_experience_replay(
mdp: MarkovDecisionProcess[S, A],
policy_from_q: PolicyFromQType,
states: NTStateDistribution[S],
approx_0: QValueFunctionApprox[S, A],
gamma: float,
max_episode_length: int,
mini_batch_size: int,
weights_decay_half_life: float
) -> Iterator[QValueFunctionApprox[S, A]]:
exp_replay: ExperienceReplayMemory[TransitionStep[S, A]] = \
ExperienceReplayMemory(
time_weights_func=lambda t: 0.5 ** (t / weights_decay_half_life),
)
q: QValueFunctionApprox[S, A] = approx_0
yield q
while True:
state: NonTerminal[S] = states.sample()
steps: int = 0
while isinstance(state, NonTerminal) and steps < max_episode_length:
policy: Policy[S, A] = policy_from_q(q, mdp)
action: A = policy.act(state).sample()
next_state, reward = mdp.step(state, action).sample()
exp_replay.add_data(TransitionStep(
state=state,
action=action,
next_state=next_state,
reward=reward
))
trs: Sequence[TransitionStep[S, A]] = \
exp_replay.sample_mini_batch(mini_batch_size)
q = q.update(
[(
(tr.state, tr.action),
tr.reward + gamma * (
max(q((tr.next_state, a))
for a in mdp.actions(tr.next_state))
if isinstance(tr.next_state, NonTerminal) else 0.)
) for tr in trs],
)
yield q
steps += 1
state = next_state

The key difference between the q_learning algorithm we wrote in Chapter 10 and this
q_learning_experience_replay algorithm is that here we have an experience-replay mem-
ory (using the ExperienceReplayMemory class we had implemented earlier). In the q_learning

399
algorithm, the (state, action, next_state, reward) 4-tuple comprising TransitionStep (that
is used to perform the Q-Learning update) was the result of action being sampled from
the behavior policy (derived from the current estimate of the Q-Value Function, eg: ϵ-
greedy), and then the next_state and reward being generated from the (state, action)
pair using the step method of mdp. Here in q_learning_experience_replay, we don’t use
this 4-tuple TransitionStep to perform the update - rather, we append this 4-tuple to the
ExperienceReplayMemory (using the add_data method), then we sample mini_batch_sized
TransitionSteps from the ExperienceReplayMemory (giving more sampling weightage to
the more recently added TransitionSteps), and use those 4-tuple TransitionSteps to per-
form the Q-Learning update. Note that these sampled TransitionSteps might be from
old behavior policies (derived from old estimates of the Q-Value estimate). The key is
that this algorithm re-uses atomic experiences that were previously prepared by the algo-
rithm, which also means that it re-uses behavior policies that were previously constructed
by the algorithm.
The argument mini_batch_size refers to the number of TransitionSteps to be drawn
from the ExperienceReplayMemory at each step. The argument weights_decay_half_life
refers to the half life of an exponential decay function for the weights used in the sampling
of the TransitionSteps (the most recently added TransitionStep has the highest weight).
With this understanding, the code should be self-explanatory.
The above code is in the file rl/td.py.

11.4.1. Deep Q-Networks (DQN) Algorithm

DeepMind developed an innovative and practically effective RL Control algorithm based
on Q-Learning with Experience-Replay - an algorithm they named as Deep Q-Networks
(abberviated as DQN). Apart from reaping the above-mentioned benefits of Experience-
Replay for Q-Learning with a Deep Neural Network approximating the Q-Value function,
they also benefited from employing a second Deep Neural Network (let us call the main
DNN as the Q-Network, refering to it’s parameters at w, and the second DNN as the target
network, refering to it’s parameters as w− ). The parameters w− of the target network are
infrequently updated to be made equal to the parameters w of the Q-network. The purpose
of the Q-Network is to evaluate the Q-Value of the current state s and the purpose of the
target network is to evaluate the Q-Value of the next state s′ , which in turn is used to obtain
the Q-Learning target (note that the Q-Value of the current state is Q(s, a; w) and the Q-
Learning target is r + γ · maxa′ Q(s′ , a′ ; w− ) for a given atomic experience (s, a, r, s′ )).
Deep Learning is premised on the fact that the supervised learning targets (response
values y corresponding to predictor values x) are pre-generated fixed values. This is not
the case in TD learning where the targets are dependent on the Q-Values. As Q-Values
are updated at each step, the targets also get updated, and this correlation between the
current state’s Q-Value estimate and the target value typically leads to oscillations or di-
vergence of the Q-Value estimate. By infrequently updating the parameters w− of the
target network (providing the target values) to be made equal to the parameters w of the
Q-network (which are updated at each iteration), the targets in the Q-Learning update
are essentially kept fixed. This goes a long way in resolving the core issue of correlation
between the current state’s Q-Value estimate and the target values, helping considerably
with convergence of the Q-Learning algorithm. Thus, DQN reaps the benefits of not just
Experience-Replay in Q-Learning (which we articulated earlier), but also the benefits of
having “fixed” targets. DNN utilizes a parameter C such that the updating of w− to be
made equal to w is done once every C updates to w (updates to w are based on the usual

400
Q-Learning update equation).
We won’t implement the DQN algorithm in Python code - however, we sketch the out-
line of the algorithm, as follows:
At each time t for each episode:

• Given state St , take action At according to ϵ-greedy policy extracted from Q-network
values Q(St , a; w).
• Given state St and action At , obtain reward Rt+1 and next state St+1 from the envi-
ronment.
• Append atomic experience (St , At , Rt+1 , St+1 ) in experience-replay memory D.
• Sample a random mini-batch of atomic experiences (si , ai , ri , s′i ) ∼ D.
• Using this mini-batch of atomic experiences, update the Q-network parameters w
with the Q-learning targets based on “frozen” parameters w− of the target network.
X
∆w = α · (ri + γ · max
′
Q(s′i , a′i ; w− ) − Q(si , ai ; w)) · ∇w Q(si , ai ; w)
ai
i

• St ← St+1
• Once every C time steps, set w− ← w.

To learn more about the effectiveness of DQN for Atari games, see the Original DQN
Paper (Mnih et al. 2013) and the DQN Nature Paper (Mnih et al. 2015) that DeepMind
has published.
Now we are ready to cover Batch RL Control (specifically Least-Squares TD Control),
which combines the ideas of Least-Squares TD Prediction and Q-Learning with Experience-
Replay.

11.5. Least-Squares Policy Iteration (LSPI)

Having seen Least-Squares Prediction, the natural question is whether we can extend the
Least-Squares (batch with linear function approximation) methodology to solve the Con-
trol problem. For On-Policy MC Control and On-Policy TD Control, we take the usual
route of Generalized Policy Iteration (GPI) with:

1. Policy Evaluation as Least-Squares Q-Value Prediction. Specifically, the Q-Value for

a policy π is approximated as:

Qπ (s, a) ≈ Q(s, a; w) = ϕ(s, a)T · w for all s ∈ N , for all a ∈ A

with a direct linear-algebraic solve for the linear function approximation weights w
using batch experiences data generated using policy π.
2. ϵ-Greedy Policy Improvement.

In this section, we focus on Off-Policy Control with Least-Squares TD. This algorithm is
known as Least-Squares Policy Iteration, abbreviated as LSPI, developed by Lagoudakis
and Parr (Lagoudakis and Parr 2003). LSPI has been an important go-to algorithm in the
history of RL Control because of it’s simplicity and effectiveness. The basic idea of LSPI is
that it does Generalized Policy Iteration (GPI) in the form of Q-Learning with Experience-
Replay, with the key being that instead of doing the usual Q-Learning update after each
atomic experience, we do batch Q-Learning for the Policy Evaluation phase of GPI. We

401
spend the rest of this section describing LSPI in detail and then implementing it in Python
code.
The input to LSPI is a fixed finite data set D, consisting of a set of (s, a, r, s′ ) atomic
experiences, i.e., a set of rl.markov_decision_process.TransitionStep objects, and the task
of LSPI is to determine the Optimal Q-Value Function (and hence, Optimal Policy) based
on this experiences data set D using an experience-replayed, batch Q-Learning technique
described below. Assume D consists of n atomic experiences, indexed as i = 1, 2, . . . n,
with atomic experience i denoted as (si , ai , ri , s′i ).
In LSPI, each iteration of GPI involves access to:

• The experiences data set D.

• A Deterministic Target Policy (call it πD ), that is made available from the previous
iteration of GPI.

Given D and πD , the goal of each iteration of GPI is to solve for weights w∗ that mini-
mizes:
X
n
L(w) = (Q(si , ai ; w) − (ri + γ · Q(s′i , πD (s′i ); w)))2
i=1
X
n
= (ϕ(si , ai )T · w − (ri + γ · ϕ(s′i , πD (s′i ))T · w))2
i=1

The solution for the weights w∗ is attained by setting the semi-gradient of L(w) to 0, i.e.,

X
n
ϕ(si , ai ) · (ϕ(si , ai )T · w∗ − (ri + γ · ϕ(s′i , πD (s′i ))T · w∗ )) = 0 (11.2)
i=1

We can calculate the solution w∗ as A−1 · b, where the m × m Matrix A is accumulated for
each TransitionStep (si , ai , ri , s′i ) as:

A ← A + ϕ(si , ai ) · (ϕ(si , ai ) − γ · ϕ(s′i , πD (s′i )))T

and the m-Vector b is accumulated at each atomic experience (si , ai , ri , s′i ) as:

b ← b + ϕ(si , ai ) · ri

With Sherman-Morrison incremental inverse, we can reduce the computational complex-

ity from O(m3 ) to O(m2 ).
This solved w∗ defines an updated Q-Value Function as follows:

X
m
Q(s, a; w∗ ) = ϕ(s, a)T · w∗ = ϕj (s, a) · wj∗
j=1

′ (serving as the Deterministic

This defines an updated, improved deterministic policy πD
Target Policy for the next iteration of GPI):
′
πD (s) = arg max Q(s, a; w∗ ) for all s ∈ N
a

This least-squares solution of w∗ (Prediction) is known as Least-Squares Temporal Dif-

ference for Q-Value, abbreviated as LSTDQ. Thus, LSPI is GPI with LSTDQ and greedy

402
policy improvements. Note how LSTDQ in each iteration re-uses the same data D, i.e.,
LSPI does experience-replay.
We should point out here that the LSPI algorithm we described above should be consid-
ered as the standard variant of LSPI. However, we can design several other variants of LSPI,
in terms of how the experiences data is sourced and used. Firstly, we should note that the
experiences data D essentially provides the behavior policy for Q-Learning (along with the
consequent reward and next state transition). In the standard variant we described above,
since D is provided from an external source, the behavior policy that generates this data
D must come from an external source. It doesn’t have to be this way - we could generate
the experiences data from a behavior policy derived from the Q-Value estimates produced
by LSTDQ (eg: ϵ-greedy policy). This would mean the experiences data used in the al-
gorithm is not a fixed, finite data set, rather a variable, incrementally-produced data set.
Even if the behavior policy was external, the data set D might not be a fixed finite data set
- rather, it could be made available as an on-demand, variable data stream. Furthermore,
in each iteration of GPI, we could use a subset of the experiences data made available un-
til that point of time (rather than the approach of the standard variant of LSPI that uses
all of the available experiences data). If we choose to sample a subset of the available ex-
periences data, we might give more sampling-weightage to the more recently generated
data. This would especially be the case if the experiences data was being generated from
a policy derived from the Q-Value estimates produced by LSTDQ. In this case, we would
leverage the ExperienceReplayMemory class we’d written earlier.
Next, we write code to implement the standard variant of LSPI we described above. First,
we write a function to implement LSTDQ. As described above, the inputs to LSTDQ are
the experiences data D (transitions in the code below) and a deterministic target policy
πD (target_policy in the code below). Since we are doing a linear function approxima-
tion, the input also includes a set of features, described as functions of state and action
(feature_functions in the code below). Lastly, the inputs also include the discount factor
γ and the numerical control parameter ϵ. The code below should be fairly self-explanatory,
as it is a straightforward extension of LSTD (implemented in function least_squares_td
earlier). The key differences are that this is an estimate of the Action-Value (Q-Value)
function, rather than the State-Value Function, and the target used in the least-squares
calculation is the Q-Learning target (produced by the target_policy).
def least_squares_tdq(
transitions: Iterable[TransitionStep[S, A]],
feature_functions: Sequence[Callable[[Tuple[NonTerminal[S], A]], float]],
target_policy: DeterministicPolicy[S, A],
gamma: float,
epsilon: float
) -> LinearFunctionApprox[Tuple[NonTerminal[S], A]]:
’’’transitions is a finite iterable’’’
num_features: int = len(feature_functions)
a_inv: np.ndarray = np.eye(num_features) / epsilon
b_vec: np.ndarray = np.zeros(num_features)
for tr in transitions:
phi1: np.ndarray = np.array([f((tr.state, tr.action))
for f in feature_functions])
if isinstance(tr.next_state, NonTerminal):
phi2 = phi1 - gamma * np.array([
f((tr.next_state, target_policy.action_for(tr.next_state.state)))
for f in feature_functions])
else:
phi2 = phi1
temp: np.ndarray = a_inv.T.dot(phi2)
a_inv = a_inv - np.outer(a_inv.dot(phi1), temp) / (1 + phi1.dot(temp))

403
b_vec += phi1 * tr.reward
opt_wts: np.ndarray = a_inv.dot(b_vec)
return LinearFunctionApprox.create(
feature_functions=feature_functions,
weights=Weights.create(opt_wts)
)

Now we are ready to write the standard variant of LSPI. The code below is a straight-
forward implementation of our description above, looping through the iterations of GPI,
yielding the Q-Value LinearFunctionApprox after each iteration of GPI.

def least_squares_policy_iteration(
transitions: Iterable[TransitionStep[S, A]],
actions: Callable[[NonTerminal[S]], Iterable[A]],
feature_functions: Sequence[Callable[[Tuple[NonTerminal[S], A]], float]],
initial_target_policy: DeterministicPolicy[S, A],
gamma: float,
epsilon: float
) -> Iterator[LinearFunctionApprox[Tuple[NonTerminal[S], A]]]:
’’’transitions is a finite iterable’’’
target_policy: DeterministicPolicy[S, A] = initial_target_policy
transitions_seq: Sequence[TransitionStep[S, A]] = list(transitions)
while True:
q: LinearFunctionApprox[Tuple[NonTerminal[S], A]] = \
least_squares_tdq(
transitions=transitions_seq,
feature_functions=feature_functions,
target_policy=target_policy,
gamma=gamma,
epsilon=epsilon,
)
target_policy = greedy_policy_from_qvf(q, actions)
yield q

The above code is in the file rl/td.py.

11.5.1. Saving your Village from a Vampire

Now we consider a Control problem we’d like to test the above LSPI algorithm on. We call
it the Vampire problem that can be described as a good old-fashioned bedtime story, as
follows:

A village is visited by a vampire every morning who uniform-randomly eats 1 villager

upon entering the village, then retreats to the hills, planning to come back the next
morning. The villagers come up with a plan. They will poison a certain number of
villagers each night until the vampire eats a poisoned villager the next morning, after
which the vampire dies immediately (due to the poison in the villager the vampire ate).
Unfortunately, all villagers who get poisoned also die the day after they are given the
poison. If the goal of the villagers is to maximize the expected number of villagers at
termination (termination is when either the vampire dies or all villagers die), what
should be the optimal poisoning strategy? In other words, if there are n villagers on
any day, how many villagers should be poisoned (as a function of n)?

It is straightforward to model this problem as an MDP. The State is the number of vil-
lagers at risk on any given night (if the vampire is still alive, the State is the number of
villagers and if the vampire is dead, the State is 0, which is the only Terminal State). The

404
Action is the number of villagers poisoned on any given night. The Reward is zero as long
as the vampire is alive, and is equal to the number of villagers remaining if the vampire
dies. Let us refer to the initial number of villagers as I. Thus,

S = {0, 1, . . . , I}, T = {0}

A(s) = {0, 1, . . . , s − 1} where s ∈ N
For all s ∈ N , for all a ∈ A(s),

 s−a
 s if r = 0 and s′ = s − a − 1
PR (s, a, r, s′ ) = a
if r = s − a and s′ = 0


s
0 otherwise

It is rather straightforward to solve this with Dynamic Programming (say, Value Itera-
tion) since we know the transition probabilities and rewards function and since the state
and action spaces are finite. However, in a situation where we don’t know the exact prob-
abilities with which the vampire operates, and we only had access to observations on spe-
cific days, we can attempt to solve this problem with Reinforcement Learning (assuming
we had access to observations of many vampires operating on many villages). In any case,
our goal here is to test LSPI using this vampire problem as an example. So we write some
code to first model this MDP as described above, solve it with value iteration (to obtain the
benchmark, i.e., true Optimal Value Function and true Optimal Policy to compare against),
then generate atomic experiences data from the MDP, and then solve this problem with
LSPI using this stream of generated atomic experiences.

from rl.markov_decision_process import TransitionStep

from rl.distribution import Categorical, Choose
from rl.function_approx import LinearFunctionApprox
from rl.policy import DeterministicPolicy, FiniteDeterministicPolicy
from rl.dynamic_programming import value_iteration_result, V
from rl.chapter11.control_utils import get_vf_and_policy_from_qvf
from rl.td import least_squares_policy_iteration
from numpy.polynomial.laguerre import lagval
import itertools
import rl.iterate as iterate
import numpy as np
class VampireMDP(FiniteMarkovDecisionProcess[int, int]):
initial_villagers: int
def __init__(self, initial_villagers: int):
self.initial_villagers = initial_villagers
super().__init__(self.mdp_map())
def mdp_map(self) -> \
Mapping[int, Mapping[int, Categorical[Tuple[int, float]]]]:
return {s: {a: Categorical(
{(s - a - 1, 0.): 1 - a / s, (0, float(s - a)): a / s}
) for a in range(s)} for s in range(1, self.initial_villagers + 1)}
def vi_vf_and_policy(self) -> \
Tuple[V[int], FiniteDeterministicPolicy[int, int]]:
return value_iteration_result(self, 1.0)
def lspi_features(
self,
factor1_features: int,
factor2_features: int
) -> Sequence[Callable[[Tuple[NonTerminal[int], int]], float]]:

405
ret: List[Callable[[Tuple[NonTerminal[int], int]], float]] = []
ident1: np.ndarray = np.eye(factor1_features)
ident2: np.ndarray = np.eye(factor2_features)
for i in range(factor1_features):
def factor1_ff(x: Tuple[NonTerminal[int], int], i=i) -> float:
return lagval(
float((x[0].state - x[1]) ** 2 / x[0].state),
ident1[i]
)
ret.append(factor1_ff)
for j in range(factor2_features):
def factor2_ff(x: Tuple[NonTerminal[int], int], j=j) -> float:
return lagval(
float((x[0].state - x[1]) * x[1] / x[0].state),
ident2[j]
)
ret.append(factor2_ff)
return ret
def lspi_transitions(self) -> Iterator[TransitionStep[int, int]]:
states_distribution: Choose[NonTerminal[int]] = \
Choose(self.non_terminal_states)
while True:
state: NonTerminal[int] = states_distribution.sample()
action: int = Choose(range(state.state)). sample()
next_state, reward = self.step(state, action).sample()
transition: TransitionStep[int, int] = TransitionStep(
state=state,
action=action,
next_state=next_state,
reward=reward
)
yield transition
def lspi_vf_and_policy(self) -> \
Tuple[V[int], FiniteDeterministicPolicy[int, int]]:
transitions: Iterable[TransitionStep[int, int]] = itertools.islice(
self.lspi_transitions(),
20000
)
qvf_iter: Iterator[LinearFunctionApprox[Tuple[
NonTerminal[int], int]]] = least_squares_policy_iteration(
transitions=transitions,
actions=self.actions,
feature_functions=self.lspi_features(4, 4),
initial_target_policy=DeterministicPolicy(
lambda s: int(s / 2)
),
gamma=1.0,
epsilon=1e-5
)
qvf: LinearFunctionApprox[Tuple[NonTerminal[int], int]] = \
iterate.last(
itertools.islice(
qvf_iter,
20
)
)
return get_vf_and_policy_from_qvf(self, qvf)

The above code should be self-explanatory. The main challenge with LSPI is that we
need to construct features function of the state and action such that the Q-Value Function
is linear in those features. In this case, since we simply want to test the correctness of our
LSPI implementation, we define feature functions (in method lspi_feature above) based
on our knowledge of the true optimal Q-Value Function from the Dynamic Programming

406
Figure 11.3.: True versus LSPI Optimal Value Function

solution. The atomic experiences comprising the experiences data D for LSPI to use is
generated with a uniform distribution of non-terminal states and a uniform distribution
of actions for a given state (in method lspi_transitions above).
Figure 11.3 shows the plot of the True Optimal Value Function (from Value Iteration)
versus the LSPI-estimated Optimal Value Function.
Figure 11.4 shows the plot of the True Optimal Policy (from Value Iteration) versus the
LSPI-estimated Optimal Policy.
The above code is in the file rl/chapter12/vampire.py. As ever, we encourage you to
modify some of the parameters in this code (including choices of feature functions, nature
and number of atomic transitions used, number of GPI iterations, choice of ϵ, and perhaps
even a different dynamic for the vampire behavior), and see how the results change.

11.5.2. Least-Squares Control Convergence

We wrap up this section by including the convergence behavior of LSPI in the summary ta-
ble for convergence of RL Control algorithms (that we had displayed at the end of Chapter
10). Figure 11.5 shows the updated summary table for convergence of RL Control algo-
rithms to now also include LSPI. Note that (3) means it doesn’t quite hit the Optimal
Value Function, but bounces around near the Optimal Value Function. But this is better
than Q-Learning in the case of linear function approximation.

11.6. RL for Optimal Exercise of American Options

We learnt in Chapter 7 that the American Options Pricing problem is an Optimal Stopping
problem and can be modeled as an MDP so that solving the Control problem of the MDP
gives us the fair price of an American Option. We can solve it with Dynamic Programming
or Reinforcement Learning, as appropriate.

407
Figure 11.4.: True versus LSPI Optimal Policy

Algorithm Tabular Linear Non-Linear

MC Control 3 ( 3) 7
SARSA 3 ( 3) 7
Q-Learning 3 7 7
LSPI 3 ( 3) -
Gradient Q-Learning 3 3 7

Figure 11.5.: Convergence of RL Control Algorithms

408
In the financial trading industry, it has traditionally not been a common practice to ex-
plicitly view the American Options Pricing problem as an MDP. Specialized algorithms
have been developed to price American Options. We now provide a quick overview of the
common practice in pricing American Options in the financial trading industry. Firstly,
we should note that the price of some American Options is equal to the price of the corre-
sponding European Option, for which we have a closed-form solution under the assump-
tion of a lognormal process for the underlying - this is the case for a plain-vanilla American
call option whose price (as we proved in Chapter 7) is equal to the price of a plain-vanilla
European call option. However, this is not the case for a plain-vanilla American put option.
Secondly, we should note that if the payoff of an American option is dependent on only the
current price of the underlying (and not on the past prices of the underlying) - in which
case, we say that the option payoff is not “history-dependent” - and if the dimension of the
state space is not large, then we can do a simple backward induction on a binomial tree (as
we showed in Chapter 7). In practice, a more detailed data structure such as a trinomial
tree or a lattice is often used for more accurate backward-induction calculations. However,
if the payoff is history-dependent (i.e., payoff depends on past prices of the underlying)
or if the payoff depends on the prices of several underlying assets, then the state space is
too large for backward induction to handle. In such cases, the standard approach in the
financial trading industry is to use the Longstaff-Schwartz pricing algorithm (Longstaff
and Schwartz 2001). We won’t cover the Longstaff-Schwartz pricing algorithm in detail in
this book - it suﬀices to share here that the Longstaff-Schwartz pricing algorithm combines
3 ideas:

• The Pricing is based on a set of sampling traces of the underlying prices.

• Function approximation of the continuation value for in-the-money states
• Backward-recursive determination of early exercise states

The goal of this section is to explain how to price American Options with Reinforcement
Learning, as an alternative to the Longstaff-Schwartz algorithm.

11.6.1. LSPI for American Options Pricing

A paper by Li, Szepesvari, Schuurmans (Li, Szepesvári, and Schuurmans 2009) showed
that LSPI can be an attractive alternative to the Longstaff-Schwartz algorithm in pricing
American Options. Before we dive into the details of pricing American Options with LSPI,
let’s review the MDP model for American Options Pricing.

• State is [Current Time, Relevant History of Underlying Security Prices].

• Action is Boolean: Exercise (i.e., Stop) or Continue.
• Reward is always 0, except upon Exercise (when the Reward is equal to the Payoff).
• State-transitions are based on the Underlying Securities’ Risk-Neutral Process.

The key is to create a linear function approximation of the state-conditioned continuation

value of the American Option (continuation value is the price of the American Option at the
current state, conditional on not exercising the option at the current state, i.e., continuing
to hold the option). Knowing the continuation value in any state enables us to compare
the continuation value against the exercise value (i.e., payoff), thus providing us with the
Optimal Stopping criteria (as a function of the state), which in turn enables us to determine
the Price of the American Option. Furthermore, we can customize the LSPI algorithm to
the nuances of the American Option Pricing problem, yielding a specialized version of

409
LSPI. The key customization comes from the fact that there are only two actions. The action
to exercise produces a (state-conditioned) reward (i.e., option payoff) and transition to a
terminal state. The action to continue produces no reward and transitions to a new state
at the next time step. Let us refer to these 2 actions as: a = c (continue the option) and
a = e (exercise the option).
Since we know the exercise value in any state, we only need to create a linear function
approximation for the continuation value, i.e., for the Q-Value Q(s, c) for all non-terminal
states s. If we denote the payoff in non-terminal state s as g(s), then Q(s, e) = g(s). So we
write (
ϕ(s)T · w if a = c
Q̂(s, a; w) = for all s ∈ N
g(s) if a = e
for feature functions ϕ(·) = [ϕi (·)|i = 1, . . . , m], which are feature functions of only state
(and not action).
Each iteration of GPI in the LSPI algorithm starts with a deterministic target policy πD (·)
that is made available as a greedy policy derived from the previous iteration’s LSTDQ-
solved Q̂(s; a; w∗ ). The LSTDQ solution for w∗ is based on pre-generated training data
and with the Q-Learning target policy set to be πD . Since we learn the Q-Value function
for only a = c, the behavior policy µ generating experiences data for training is a constant
function µ(s) = c. Note also that for American Options, the reward for a = c is 0. So each
atomic experience for training is of the form (s, c, 0, s′ ). This means we can represent each
atomic experience for training as a 2-tuple (s, s′ ). This reduces the LSPI Semi-Gradient
Equation (11.2) to:
X
ϕ(si ) · (ϕ(si )T · w∗ − γ · Q̂(s′i , πD (s′i ); w∗ )) = 0 (11.3)
i

We need to consider two cases for the term Q̂(s′i , πD (s′i ); w∗ ):

• C1: If s′i is non-terminal and πD (s′i ) = c (i.e., ϕ(s′i )T ·w ≥ g(s′i )): Substitute ϕ(s′i )T ·w∗
for Q̂(s′i , πD (s′i ); w∗ ) in Equation (11.3)
• C2: If s′i is a terminal state or πD (s′i ) = e (i.e., g(s′i ) > ϕ(s′i )T · w): Substitute g(s′i )
for Q̂(s′i , πD (s′i ); w∗ ) in Equation (11.3)

So we can rewrite Equation (11.3) using indicator notation I for cases C1, C2 as:
X
ϕ(si ) · (ϕ(si )T · w∗ − IC1 · γ · ϕ(s′i )T · w∗ − IC2 · γ · g(s′i )) = 0
i

Factoring out w∗ , we get:

X X
( ϕ(si ) · (ϕ(si ) − IC1 · γ · ϕ(s′i ))T ) · w∗ = γ · IC2 · ϕ(si ) · g(s′i )
i i

This can be written in the familiar vector-matrix notation as: A · w∗ = b

X
A= ϕ(si ) · (ϕ(si ) − IC1 · γ · ϕ(s′i ))T
i
X
b=γ· IC2 · ϕ(si ) · g(s′i )
i

The m × m Matrix A is accumulated at each atomic experience (si , s′i ) as:

410
A ← A + ϕ(si ) · (ϕ(si ) − IC1 · γ · ϕ(s′i ))T
The m-Vector b is accumulated at each atomic experience (si , s′i ) as:

b ← b + γ · IC2 · ϕ(si ) · g(s′i )

With Sherman-Morrison incremental inverse of A, we can reduce the time-complexity
from O(m3 ) to O(m2 ).
This solved w∗ updates the Q-Value Function Approximation to Q̂(s, a; w∗ ). This de-
′ (serving as the Deterministic Target
fines an updated, improved deterministic policy πD
Policy for the next iteration of GPI):
′
πD (s) = arg max Q̂(s, a; w∗ ) for all s ∈ N
a

Li, Szepesvari, Schuurmans (Li, Szepesvári, and Schuurmans 2009) recommend in their
paper to use 7 feature functions, the first 4 Laguerre polynomials that are functions of the
underlying price and 3 functions of time. Precisely, the feature functions they recommend
are:
• ϕ0 (St ) = 1
Mt
• ϕ1 (St ) = e− 2
Mt
• ϕ2 (St ) = e− 2 · (1 − Mt )
Mt
• ϕ3 (St ) = e− 2 · (1 − 2Mt + Mt2 /2)
ϕ0 (t) = sin( π(T2T−t) )
(t)
•
(t)
• ϕ1 (t) = log(T − t)
(t)
• ϕ2 (t) = ( Tt )2

where Mt = SKt (St is the current underlying price and K is the American Option strike),
t is the current time, and T is the expiration time (i.e., 0 ≤ t < T ).

11.6.2. Deep Q-Learning for American Options Pricing

LSPI is data-eﬀicient and compute-eﬀicient, but linearity is a limitation in the function
approximation. The alternative is (incremental) Q-Learning with neural network function
approximation, which we cover in this subsection. We employ the same set up as LSPI
(including Experience Replay) - specifically, the function approximation is required only
for continuation value. Precisely,
(
f (s; w) if a = c
Q̂(s, a; w) = for all s ∈ N
g(s) if a = e
where f (s; w) is the deep neural network function approximation.
The Q-Learning update for each atomic experience (si , s′i ) is:
∆w = α · (γ · Q̂(s′i , π(s′i ); w) − f (si ; w)) · ∇w f (si ; w)
When s′i is a non-terminal state, the update is:
∆w = α · (γ · max(g(s′i ), f (s′i ; w)) − f (si ; w)) · ∇w f (si ; w)
When s′i is a terminal state, the update is:
∆w = α · (γ · g(s′i ) − f (si ; w)) · ∇w f (si ; w)

411
11.7. Value Function Geometry
Now we look deeper into the issue of the Deadly Triad (that we had alluded to in Chap-
ter 10) by viewing Value Functions as Vectors (we had done this in Chapter 3), under-
stand Value Function Vector transformations with a balance of geometric intuition and
mathematical rigor, providing insights into convergence issues for a variety of traditional
loss functions used to develop RL algorithms. As ever, the best way to understand Vector
transformations is to visualize it and so, we loosely refer to this topic as Value Function Ge-
ometry. The geometric intuition is particularly useful for linear function approximations.
To promote intuition, we shall present this content for linear function approximations of
the Value Function and stick to Prediction (rather than Control) although many of the
concepts covered in this section are well-extensible to non-linear function approximations
and to the Control problem.
This treatment was originally presented in the LSPI paper by Lagoudakis and Parr (Lagoudakis
and Parr 2003) and has been covered in detail in the RL book by Sutton and Barto (Richard
S. Sutton and Barto 2018). This treatment of Value Functions as Vectors leads us in the
direction of overcoming the Deadly Triad by defining an appropriate loss function, cal-
culating whose gradient provides a more robust set of RL algorithms known as Gradient
Temporal-Difference (abbreviated, as Gradient TD), which we shall cover in the next sec-
tion.
Along with visual intuition, it is important to write precise notation for Value Function
transformations and approximations. So we start with a set of formal definitions, keeping
the setting fairly simple and basic for ease of understanding.

11.7.1. Notation and Definitions

Assume our state space is finite without any terminal states, i.e. S = N = {s1 , s2 , . . . , sn }.
Assume our action space A consists of a finite number of actions. This coverage can be
extended to infinite/continuous spaces, but we shall stick to this simple setting in this sec-
tion. Also, as mentioned above, we restrict this coverage to the case of a fixed (potentially
stochastic) policy denoted as π : S × A → [0, 1]. This means we are restricting to the case
of the Prediction problem (although it’s possible to extend some of this coverage to the
case of Control).
We denote the Value Function for a policy π as V π : S → R. Consider the n-dimensional
vector space Rn , with each dimension corresponding to a state in S. Think of a Value Func-
tion (typically denoted V ): S → R as a vector in the Rn vector space. Each dimension’s
coordinate is the evaluation of the Value Function for that dimension’s state. The coordi-
nates of vector V π for policy π are: [V π (s1 ), V π (s2 ), . . . , V π (sn )]. Note that this treatment
is the same as the treatment in our coverage of Dynamic Programming in Chapter 3.
Our interest is in identifying an appropriate function approximation of the Value Func-
tion V π . For the function approximation, assume there are m feature functions ϕ1 , ϕ2 , . . . , ϕm :
S → R, with ϕ(s) ∈ Rm denoting the feature vector for any state s ∈ S. To keep things sim-
ple and to promote understanding of the concepts, we limit ourselves to linear function
approximations. For linear function approximation of the Value Function with weights
w = (w1 , w2 , . . . , wm ), we use the notation Vw : S → R, defined as:

X
m
Vw (s) = ϕ(s) · w =
T
ϕj (s) · wj for all s ∈ S
j=1
.

412
Assuming independence of the feature functions, the m feature functions give us m in-
dependent vectors in the vector space Rn . Feature function ϕj gives us the vector [ϕj (s1 ), ϕj (s2 ), . . . , ϕj (sn )] ∈
Rn . These m vectors are the m columns of the n×m matrix Φ = [ϕj (si )], 1 ≤ i ≤ n, 1 ≤ j ≤
m. The span of these m independent vectors is an m-dimensional vector subspace within
this n-dimensional vector space, spanned by the set of all w = (w1 , w2 , . . . , wm ) ∈ Rm . The
vector Vw = Φ · w in this vector subspace has coordinates [Vw (s1 ), Vw (s2 ), . . . , Vw (sn )].
The vector Vw is fully specified by w (so we often say w to mean Vw ). Our interest is in
identifying an appropriate w ∈ Rm that represents an adequate linear function approxi-
mation Vw = Φ · w of the Value Function V π .
We denote the probability distribution of occurrence of states under policy π as µπ :
S → [0, 1]. In accordance with the notation we used in Chapter 2, R(s, a) refers to the
Expected Reward upon taking action a in state s, and P(s, a, s′ ) refers to the probability of
transition from state s to state s′ upon taking action a. Define
X
Rπ (s) = π(s, a) · R(s, a) for all s ∈ S
a∈A
X
P π (s, s′ ) = π(s, a) · P(s, a, s′ ) for all s, s′ inS
a∈A

to denote the Expected Reward and state transition probabilities respectively of the π-
implied MRP.
Rπ refers to vector [Rπ (s1 ), Rπ (s2 ), . . . , Rπ (sn )] and P π refers to matrix [P π (si , si′ )], 1 ≤
i, i′ ≤ n. Denote γ < 1 (since there are no terminal states) as the MDP discount factor.

11.7.2. Bellman Policy Operator and Projection Operator

In Chapter 2, we introduced the Bellman Policy Operator B π for policy π operating on any
Value Function vector V . As a reminder,

B π (V ) = Rπ + γP π · V for any VF vector V ∈ Rn

Note that B π is a linear operator in vector space Rn . So we henceforth denote and treat
B π as an n × n matrix, representing the linear operator. We’ve learnt in Chapter 2 that V π
is the fixed point of B π . Therefore, we can write:

Bπ · V π = V π

This means, if we start with an arbitrary Value Function vector V and repeatedly apply
B π , by Fixed-Point Theorem, we will reach the fixed point V π . We’ve learnt in Chapter
2 that this is in fact the Dynamic Programming Policy Evaluation algorithm. Note that
Tabular Monte Carlo also converges to V π (albeit slowly).
Next, we introduce the Projection Operator ΠΦ for the subspace spanned by the column
vectors (feature functions) of Φ. We define ΠΦ (V ) as the vector in the subspace spanned
by the column vectors of Φ that represents the orthogonal projection of Value Function
vector V on the Φ subspace. To make this precise, we first define “distance” d(V1 , V2 )
between Value Function vectors V1 , V2 , weighted by µπ across the n dimensions of V1 , V2 .
Specifically,

X
n
d(V1 , V2 ) = µπ (si ) · (V1 (si ) − V2 (si ))2 = (V1 − V2 )T · D · (V1 − V2 )
i=1

413
Figure 11.6.: Value Function Geometry (Image Credit: Sutton-Barto’s RL Book)

where D is the square diagonal matrix consisting of the diagonal elements µπ (si ), 1 ≤ i ≤
n.
With this “distance” metric, we define ΠΦ (V ) as the Value Function vector in the sub-
space spanned by the column vectors of Φ that is given by arg minw d(V , Vw ). This is a
weighted least squares regression with solution:

w∗ = (ΦT · D · Φ)−1 · ΦT · D · V

Since ΠΦ (V ) = Φ · w∗ , we henceforth denote and treat Projection Operator ΠΦ as the

following n × n matrix:

ΠΦ = Φ · (ΦT · D · Φ)−1 · ΦT · D

11.7.3. Vectors of interest in the Φ subspace

In this section, we cover 4 Value Function vectors of interest in the Φ subspace, as candidate
linear function approximations of the Value Function V π . To lighten notation, we will refer
to the Φ-subspace Value Function vectors by their corresponding weights w. All 4 of these
Value Function vectors are depicted in Figure 11.6, an image we are borrowing from Sutton
and Barto’s RL book (Richard S. Sutton and Barto 2018). We spend the rest of this section
going over these 4 Value Function vectors in detail.
The first Value Function vector of interest in the Φ subspace is the Projection ΠΦ · V π ,
denoted as wπ = arg minw d(V π , Vw ). This is the linear function approximation of the
Value Function V π we seek because it is the Value Function vector in the Φ subspace that
is “closest” to V π . Monte-Carlo with linear function approximation will (slowly) con-
verge to wπ . Figure 11.6 provides the visualization. We’ve learnt that Monte-Carlo can be

414
slow to converge, so we seek function approximations in the Φ subspace that are based on
Temporal-Difference (TD), i.e., bootstrapped methods. The remaining three Value Func-
tion vectors in the Φ subspace are based on TD methods.
We denote the second Value Function vector of interest in the Φ subspace as wBE . The
acronym BE stands for Bellman Error. To understand this, consider the application of the
Bellman Policy Operator B π on a Value Function vector Vw in the Φ subspace. Applying
B π on Vw typically throws Vw out of the Φ subspace. The idea is to find a Value Function
vector Vw in the Φ subspace such that the “distance” between Vw and B π · Vw is mini-
mized, i.e. we minimize the “error vector” BE = B π · Vw − Vw (Figure 11.6 provides the
visualization). Hence, we say we are minimizing the Bellman Error (or simply that we are
minimizing BE), and we refer to wBE as the Value Function vector in the Φ subspace for
which BE is minimized. Formally, we define it as:

wBE = arg min d(B π · Vw , Vw )

w
= arg min d(Vw , Rπ + γP π · Vw )
w
= arg min d(Φ · w, Rπ + γP π · Φ · w)
w
= arg min d(Φ · w − γP π · Φ · w, Rπ )
w
= arg min d((Φ − γP π · Φ) · w, Rπ )
w

This is a weighted least-squares linear regression of Rπ against Φ − γP π · Φ with weights

µπ , whose solution is:

wBE = ((Φ − γP π · Φ)T · D · (Φ − γP π · Φ))−1 · (Φ − γP π · Φ)T · D · Rπ

The above formulation can be used to compute wBE if we know the model probabilities
P π and reward function Rπ . But often, in practice, we don’t know P π and Rπ , in which
case we seek model-free learning of wBE , specifically with a TD (bootstrapped) algorithm.
Let us refer to
(Φ − γP π · Φ)T · D · (Φ − γP π · Φ)

as matrix A and let us refer to

(Φ − γP π · Φ)T · D · Rπ

as vector b so that wBE = A−1 · b.

Following policy π, each time we perform an individual transition from s to s′ getting
reward r, we get a sample estimate of A and b. The sample estimate of A is the outer-
product of vector ϕ(s) − γ · ϕ(s′ ) with itself. The sample estimate of b is scalar r times
vector ϕ(s) − γ · ϕ(s′ ). We average these sample estimates across many such individual
transitions. However, this requires m (the number of features) to be not too large.
If m is large or if we are doing non-linear function approximation or off-policy, then we
seek a gradient-based TD algorithm. We defined wBE as the vector in the Φ subspace for
which the Bellman Error is minimized. But Bellman Error for a state is the expectation of
the TD error δ for that state when following policy π. So we want to do Stochastic Gradient

415
Descent with the gradient of the square of expected TD error, as follows:

1
∆w = −α · · ∇w (Eπ [δ])2
2
= −α · Eπ [r + γ · ϕ(s′ )T · w − ϕ(s)T · w] · ∇w Eπ [δ]
= α · (Eπ [r + γ · ϕ(s′ )T · w] − ϕ(s)T · w) · (ϕ(s) − γ · Eπ [ϕ(s′ )])

This is called the Residual Gradient algorithm, due to Leemon Baird (Baird 1995). It requires
two independent samples of s′ transitioning from s. If we do have that, it converges to wBE
robustly (even for non-linear function approximations). But this algorithm is slow, and
doesn’t converge to a desirable place. Another issue is that wBE is not learnable if we
can only access the features, and not underlying states. These issues led researchers to
consider alternative TD algorithms.
We denote the third Value Function vector of interest in the Φ subspace as wT DE and
define it as the vector in the Φ subspace for which the expected square of the TD error δ
(when following policy π) is minimized. Formally,
X X
wT DE = arg min µπ (s) Pπ (r, s′ |s) · (r + γ · ϕ(s′ )T · w − ϕ(s)T · w)2
w s∈S r,s′

To perform Stochastic Gradient Descent, we have to estimate the gradient of the expected
square of TD error by sampling. The weight update for each gradient sample in the
Stochastic Gradient Descent is:

∆w = −α · f rac12 · ∇w (r + γ · ϕ(s′ )T · w − ϕ(s)T · w)2

= α · (r + γ · ϕ(s′ )T · w − ϕ(s)T · w) · (ϕ(s) − γ · ϕ(s′ ))

This algorithm is called Naive Residual Gradient, due to Leemon Baird (Baird 1995). Naive
Residual Gradient converges robustly, but again, not to a desirable place. So researchers
had to look even further.
This brings us to the fourth (and final) Value Function vector of interest in the Φ sub-
space. We denote this Value Function vector as wP BE . The acronym P BE stands for Pro-
jected Bellman Error. To understand this, first consider the composition of the Projection
Operator ΠΦ and the Bellman Policy Operator B π , i.e., ΠΦ · B π (we call this composed
operator as the Projected Bellman operator). Visualize the application of this Projected Bell-
man operator on a Value Function vector Vw in the Φ subspace. Applying B π on Vw typ-
ically throws Vw out of the Φ subspace and then further applying ΠΦ brings it back to
the Φ subspace (call this resultant Value Function vector Vw′ ). The idea is to find a Value
Function vector Vw in the Φ subspace for which the “distance” between Vw and Vw′ is
minimized, i.e. we minimize the “error vector” P BE = ΠΦ · B π · Vw − Vw (Figure 11.6
provides the visualization). Hence, we say we are minimizing the Projected Bellman Error
(or simply that we are minimizing P BE), and we refer to wP BE as the Value Function
vector in the Φ subspace for which P BE is minimized. It turns out that the minimum of
PBE is actually zero, i.e., Φ · wP BE is a fixed point of operator ΠΦ · B π . Let us write out
this statement formally. We know:

ΠΦ = Φ · (ΦT · D · Φ)−1 · ΦT · D

416
B π · V = Rπ + γP π · V

Therefore, the statement that Φ · wP BE is a fixed point of operator ΠΦ · B π can be written

as follows:

Φ · (ΦT · D · Φ)−1 · ΦT · D · (Rπ + γP π · Φ · wP BE ) = Φ · wP BE

Since the columns of Φ are assumed to be independent (full rank),

(ΦT · D · Φ)−1 · ΦT · D · (Rπ + γP π · Φ · wP BE ) = wP BE i

ΦT · D · (Rπ + γP π · Φ · wP BE ) = ΦT · D · Φ · wP BE
ΦT · D · (Φ − γP π · Φ) · wP BE = ΦT · D · Rπ (11.4)

This is a square linear system of the form A · wP BE = b whose solution is:

wP BE = A−1 · b = (ΦT · D · (Φ − γP π · Φ))−1 · ΦT · D · Rπ

The above formulation can be used to compute wP BE if we know the model probabil-
ities P π and reward function Rπ . But often, in practice, we don’t know P π and Rπ , in
which case we seek model-free learning of wP BE , specifically with a TD (bootstrapped)
algorithm.
The question is how do we construct matrix

A = ΦT · D · (Φ − γP π · Φ)

and vector
b = ΦT · D · Rπ

without a model?
Following policy π, each time we perform an individual transition from s to s′ getting
reward r, we get a sample estimate of A and b. The sample estimate of A is the outer-
product of vectors ϕ(s) and ϕ(s)−γ ·ϕ(s′ ). The sample estimate of b is scalar r times vector
ϕ(s). We average these sample estimates across many such individual transitions. Note
that this algorithm is exactly the Least Squares Temporal Difference (LSTD) algorithm
we’ve covered earlier in this chapter. Thus, we now know that LSTD converges to wP BE ,
i.e., minimizes (in fact takes down to 0) P BE. If the number of features m is large or if we
are doing non-linear function approximation or Off-Policy, then we seek a gradient-based
TD algorithm. It turns out that our usual Semi-Gradient TD algorithm converges to wP BE
in the case of on-policy linear function approximation. Note that the update for the usual
Semi-Gradient TD algorithm in the case of on-policy linear function approximation is as
follows:
∆w = α · (r + γ · ϕ(s′ )T · w − ϕ(s)T · w) · ϕ(s)

This converges to wP BE because at convergence, we have: Eπ [∆w] = 0, which can be

expressed as:
ΦT · D · (Rπ + γP π · Φ · w − Φ · w) = 0

⇒ ΦT · D · (Φ − γP π · Φ) · w = ΦT · D · Rπ

which is satisfied for w = wP BE (as seen from Equation (11.4)).

417
11.8. Gradient Temporal-Difference (Gradient TD)
For on-policy linear function approximation, the semi-gradient TD algorithm gives us
wP BE . But to obtain wP BE in the case of non-linear function approximation or in the case
of Off-Policy, we need a different approach. The different approach is Gradient Temporal-
Difference (abbreviated, Gradient TD), the subject of this section.
The original Gradient TD algorithm, due to Sutton, Szepesvari, Maei (R. S. Sutton,
Szepesvári, and Maei 2008) is typically abbreviated as GTD. Researchers then came up
with a second-generation Gradient TD algorithm (R. S. Sutton et al. 2009), which is typi-
cally abbreviated as GTD-2. The same researchers also came up with a TD algorithm with
Gradient Correction (R. S. Sutton et al. 2009), which is typically abbreviated as TDC.
We now cover the TDC algorithm. For simplicity of articulation and ease of under-
standing, we restrict to the case of linear function approximation in our coverage of the
TDC algorithm below. However, do bear in mind that much of the concepts below extend
to non-linear function approximation (which is where we reap the benefits of Gradient
TD).
Our first task is to set up the appropriate loss function whose gradient will drive the
Stochastic Gradient Descent.

wP BE = arg min d(ΠΦ · B π · Vw , Vw ) = arg min d(ΠΦ · B π · Vw , ΠΦ · Vw )

w w

So we define the loss function (denoting B π · Vw − Vw as δw ) as:

L(w) = (ΠΦ · δw )T · D · (ΠΦ · δw ) = δw

T
· ΠTΦ · D · ΠΦ · δw
T
= δw · (Φ · (ΦT · D · Φ)−1 · ΦT · D)T · D · (Φ · (ΦT · D · Φ)−1 · ΦT · D) · δw
T
= δw · (D · Φ · (ΦT · D · Φ)−1 · ΦT ) · D · (Φ · (ΦT · D · Φ)−1 · ΦT · D) · δw
T
= (δw · D · Φ) · (ΦT · D · Φ)−1 · (ΦT · D · Φ) · (ΦT · D · Φ)−1 · (ΦT · D · δw )
= (ΦT · D · δw )T · (ΦT · D · Φ)−1 · (ΦT · D · δw )
We derive the TDC Algorithm based on ∇w L(w).

∇w L(w) = 2 · (∇w (ΦT · D · δw )T ) · (ΦT · D · Φ)−1 · (ΦT · D · δw )

We want to estimate this gradient from individual transitions data. So we express each
of the 3 terms forming the product in the gradient expression above as expectations of
π
functions of individual transitions s −→ (r, s′ ). Denoting r + γ · ϕ(s′ )T · w − ϕ(s)T · w as
δ, we get:

ΦT · D · δw = E[δ · ϕ(s)]
∇w (ΦT · D · δw )T = E[(∇w δ) · ϕ(s)T ] = E[(γ · ϕ(s′ ) − ϕ(s)) · ϕ(s)T ]
ΦT · D · Φ = E[ϕ(s) · ϕ(s)T ]
Substituting, we get:

∇w L(w) = 2 · E[(γ · ϕ(s′ ) − ϕ(s)) · ϕ(s)T ] · (E[ϕ(s) · ϕ(s)T ])−1 · E[δ · ϕ(s)]

1
∆w = −α · · ∇w L(w)
2

418
= α · E[(ϕ(s) − γ · ϕ(s′ )) · ϕ(s)T ] · (E[ϕ(s) · ϕ(s)T ])−1 · E[δ · ϕ(s)]
= α · (E[ϕ(s) · ϕ(s)T ] − γ · E[ϕ(s′ ) · ϕ(s)T ]) · (E[ϕ(s) · ϕ(s)T ])−1 · E[δ · ϕ(s)]
= α · (E[δ · ϕ(s)] − γ · E[ϕ(s′ ) · ϕ(s)T ] · (E[ϕ(s) · ϕ(s)T ])−1 · E[δ · ϕ(s)])
= α · (E[δ · ϕ(s)] − γ · E[ϕ(s′ ) · ϕ(s)T ] · θ)
where θ = (E[ϕ(s) · ϕ(s)T ])−1 · E[δ · ϕ(s)] is the solution to the weighted least-squares
linear regression of B π · V − V against Φ, with weights as µπ .
We can perform this gradient descent with a technique known as Cascade Learning,
which involves simultaneously updating both w and θ (with θ converging faster). The
updates are as follows:

∆w = α · δ · ϕ(s) − α · γ · ϕ(s′ ) · (ϕ(s)T · θ)

∆θ = β · (δ − ϕ(s)T · θ) · ϕ(s)
where β is the learning rate for θ. Note that ϕ(s)T · θ operates as an estimate of the TD
error δ for current state s.
Repeating what we had said in Chapter 10, Gradient TD converges reliably for the Pre-
diction problem even when we are faced with the Deadly Triad of [Bootstrapping, Off-
Policy, Non-Linear Function Approximation]. The picture is less rosy for Control. Gra-
dient Q-Learning (Gradient TD for Off-Policy Control) converges reliably for both on-
policy and off-policy linear function approximations, but there are divergence issues for
non-linear function approximations. For Control problems with non-linear function ap-
proximations (especially, neural network approximations with off-policy learning), one
can leverage the approach of the DQN algorithm (Experience Replay with fixed Target
Network helps overcome the Deadly Triad).

11.9. Key Takeaways from this Chapter

• Batch RL makes eﬀicient use of data.
• DQN uses Experience-Replay and fixed Q-learning targets, avoiding the pitfalls of
time-correlation and varying TD Target.
• LSTD is a direct (gradient-free) solution of Batch TD Prediction.
• LSPI is an Off-Policy, Experience-Replay Control Algorithm using LSTDQ for Policy
Evaluation.
• Optimal Exercise of American Options can be tackled with LSPI and Deep Q-Learning
algorithms.
• For Prediction, the 4 Value Function vectors of interest in the Φ subspace are wπ , wBE , wT DE , wP BE
with wP BE as the key sought-after function approximation for Value Function V π .
• For Prediction, Gradient TD solves for wP BE eﬀiciently and robustly in the case of
non-linear function approximation and in the case of Off-Policy.

419
12. Policy Gradient Algorithms
It’s time to take stock of what we have learnt so far to set up context for this chapter. So
far, we have covered a range of RL Control algorithms, all of which are based on General-
ized Policy Iteration (GPI). All of these algorithms perform GPI by learning the Q-Value
Function and improving the policy by identifying the action that fetches the best Q-Value
(i.e., action value) for each state. Notice that the way we implemented this best action iden-
tification is by sweeping through all the actions for each state. This works well only if the
set of actions for each state is reasonably small. But if the action space is large/continuous,
we have to resort to some sort of optimization method to identify the best action for each
state, which is potentially complicated and expensive.
In this chapter, we cover RL Control algorithms that take a vastly different approach.
These Control algorithms are still based on GPI, but the Policy Improvement of their GPI
is not based on consulting the Q-Value Function, as has been the case with Control al-
gorithms we covered in the previous two chapters. Rather, the approach in the class of
algorithms we cover in this chapter is to directly find the Policy that fetches the “Best Ex-
pected Returns.” Specifically, the algorithms of this chapter perform a Gradient Ascent
on “Expected Returns” with the gradient defined with respect to the parameters of a Pol-
icy function approximation. We shall work with a stochastic policy of the form π(s, a; θ),
with θ denoting the parameters of the policy function approximation π. So we are basically
learning this parameterized policy that selects actions without consulting a Value Func-
tion. Note that we might still engage a Value Function approximation (call it Q(s; a; w))
in our algorithm, but it’s role is to only help learn the policy parameters θ and not to
identify the action with the best action-value for each state. So the two function approx-
imations π(s, a; θ) and Q(s, a; w) collaborate to improve the policy using gradient ascent
(based on gradient of “expected returns” with respect to θ). π(s, a; θ) is the primary
worker here (known as Actor) and Q(s, a; w) is the support worker (known as Critic).
The Critic parameters w are optimized by minimizing a suitable loss function defined in
terms of Q(s, a; w) while the Actor parameters θ are optimized by maximizing a suitable
“Expected Returns” function“. Note that we still haven’t defined what this”Expected Re-
turns” function is (we will do so shortly), but we already see that this idea is appealing
for large/continuous action spaces where sweeping through actions is infeasible. We will
soon dig into the details of this new approach to RL Control (known as Policy Gradient, ab-
breviated as PG) - for now, it’s important to recognize the big picture that PG is basically
GPI with Policy Improvement done as a Policy Gradient Ascent.
The contrast between the RL Control algorithms covered in the previous two chapters
and the algorithms of this chapter actually is part of the following bigger-picture classifi-
cation of learning algorithms for Control:

• Value Function-based: Here we learn the Value Function (typically with a function
approximation for the Value Function) and the Policy is implicit, readily derived
from the Value Function (eg: ϵ-greedy).
• Policy-based: Here we learn the Policy (with a function approximation for the Pol-
icy), and there is no need to learn a Value Function.

421
• Actor-Critic: Here we primarily learn the Policy (with a function approximation for
the Policy, known as Actor), and secondarily learn the Value Function (with a func-
tion approximation for the Value Function, known as Critic).

PG Algorithms can be Policy-based or Actor-Critic, whereas the Control algorithms we

covered in the previous two chapters are Value Function-based.
In this chapter, we start by enumerating the advantages and disadvantages of Policy
Gradient Algorithms, state and prove the Policy Gradient Theorem (which provides the
fundamental calculation underpinning Policy Gradient Algorithms), then go on to ad-
dress how to lower the bias and variance in these algorithms, give an overview of special
cases of Policy Gradient algorithms that have found success in practical applications, and
finish with a description of Evolutionary Strategies that although technically not RL, re-
semble Policy Gradient algorithms and are quite effective in solving certain Control prob-
lems.

12.1. Advantages and Disadvantages of Policy Gradient

Algorithms

Let us start by enumerating the advantages of PG algorithms. We’ve already said that
PG algorithms are effective in large action spaces, especially high-dimensional or contin-
uous action spaces, because in such spaces selecting an action by deriving an improved
policy from an updating Q-Value function is intractable. A key advantage of PG is that it
naturally explores because the policy function approximation is configured as a stochastic
policy. Moreover, PG finds the best Stochastic Policy. This is not a factor for MDPs since
we know that there exists an optimal Deterministic Policy for any MDP but we often deal
with Partially-Observable MDPs (POMDPs) in the real-world, for which the set of optimal
policies might all be stochastic policies. We have an advantage in the case of MDPs as well
since PG algorithms naturally converge to the deterministic policy (the variance in the pol-
icy distribution will automatically converge to 0) whereas in Value Function-based algo-
rithms, we have to reduce the ϵ of the ϵ-greedy policy by-hand and the appropriate declin-
ing trajectory of ϵ is typically hard to figure out by manual tuning. In situations where the
policy function is a simpler function compared to the Value Function, we naturally benefit
from pursuing Policy-based algorithms than Value Function-based algorithms. Perhaps
the biggest advantage of PG algorithms is that prior knowledge of the functional form of
the Optimal Policy enables us to structure the known functional form in the function ap-
proximation for the policy. Lastly, PG offers numerical benefits as small changes in θ yield
small changes in π, and consequently small changes in the distribution of occurrences of
states. This results in stronger convergence guarantees for PG algorithms relative to Value
Function-based algorithms.
Now let’s understand the disadvantages of PG Algorithms. The main disadvantage of
PG Algorithms is that because they are based on gradient ascent, they typically converge to
a local optimum whereas Value Function-based algorithms converge to a global optimum.
Furthermore, the Policy Evaluation of PG is typically ineﬀicient and can have high vari-
ance. Lastly, the Policy Improvements of PG happen in small steps and so, PG algorithms
are slow to converge.

422
12.2. Policy Gradient Theorem
In this section, we start by setting up some notation, and then state and prove the Policy
Gradient Theorem (abbreviated as PGT). The PGT provides the key calculation for PG
Algorithms.

12.2.1. Notation and Definitions

Denoting the discount factor as γ, we shall assume either episodic sequences with 0 ≤ γ ≤
1 or non-episodic (continuing) sequences with 0 ≤ γ < 1. We shall use our usual nota-
tion of discrete-time, countable-spaces, time-homogeneous MDPs although we can indeed
extend PGT and PG Algorithms to more general settings as well. We lighten P(s, a, s′ ) no-
tation to Ps,s
a and R(s, a) notation to Ra because we want to save some space in the very
′ s
long equations in the derivation of PGT.
We denote the probability distribution of the starting state as p0 : N → [0, 1]. The policy
function approximation is denoted as π(s, a; θ) = P[At = a|St = s; θ].
The PG coverage is quite similar for non-discounted, non-episodic MDPs, by consider-
ing the average-reward objective, but we won’t cover it in this book.
Now we formalize the “Expected Returns” Objective J(θ).
∞
X
J(θ) = Eπ [ γ t · Rt+1 ]
t=0

Value Function V π (s) and Action Value function Qπ (s, a) are defined as:
∞
X
V (s) = Eπ [
π
γ k−t · Rk+1 |St = s] for all t = 0, 1, 2, . . .
k=t

∞
X
Q (s, a) = Eπ [
π
γ k−t · Rk+1 |St = s, At = a] for all t = 0, 1, 2, . . .
k=t

J(θ), V π , Qπ are all measures of Expected Returns, so it pays to specify exactly how they
differ. J(θ) is the Expected Return when following policy π (that is parameterized by θ),
averaged over all states s ∈ N and all actions a ∈ A. The idea is to perform a gradient as-
cent with J(θ) as the objective function, with each step in the gradient ascent essentially
pushing θ (and hence, π) in a desirable direction, until J(θ) is maximized. V π (s) is the
Expected Return for a specific state s ∈ N when following policy π. Qπ (s, a) is the Ex-
pected Return for a specific state s ∈ N and specific action a ∈ A when following policy
π.
We define the Advantage Function as:

Aπ (s, a) = Qπ (s, a) − V π (s)

The advantage function captures how much more value does a particular action provide
relative to the average value across actions (for a given state). The advantage function
plays an important role in reducing the variance in PG Algorithms.
Also, p(s → s′ , t, π) will be a key function for us in the PGT proof - it denotes the prob-
ability of going from state s to s′ in t steps by following policy π.
We express the “Expected Returns” Objective J(θ) as follows:

423
∞
X ∞
X
J(θ) = Eπ [ γ t · Rt+1 ] = γ t · Eπ [Rt+1 ]
t=0 t=0
∞
X X X X
= γt · ( p0 (S0 ) · p(S0 → s, t, π)) · π(s, a; θ) · Ras
t=0 s∈N S0 ∈N a∈A
X X ∞
X X
= ( γ t · p0 (S0 ) · p(S0 → s, t, π)) · π(s, a; θ) · Ras
s∈N S0 ∈N t=0 a∈A

Definition 12.2.1. X X
J(θ) = ρπ (s) · π(s, a; θ) · Ras
s∈N a∈A

where
∞
X X
π
ρ (s) = γ t · p0 (S0 ) · p(S0 → s, t, π)
S0 ∈N t=0
is the key function (for PG) that we shall refer to as Discounted-Aggregate State-Visitation
Measure. Note that ρπ (s) is a measure over the set of non-terminal states, but is not a prob-
ability measure. Think of ρπ (s) as weights reflecting the relative likelihood of occurrence
of states on a trace experience (adjusted for discounting, i.e, lesser importance to reach-
ing a state later on a trace experience). We can still talk about the distribution of states
under
P the measure ρπ , but we say that this distribution is improper to convey the fact that
s∈N ρ (s) ̸= 1 (i.e., the distribution is not normalized). We talk about this improper dis-
π

tribution of states under the measure ρπ so we can use (as a convenience) the “expected
value” notation for any random variable f : N → R under this improper distribution, i.e.,
we use the notation: X
Es∼ρπ [f (s)] = ρπ (s) · f (s)
s∈N
Using this notation, we can re-write the above definition of J(θ) as:
J(θ) = Es∼ρπ ,a∼π [Ras ]

12.2.2. Statement of the Policy Gradient Theorem

The Policy Gradient Theorem (PGT) provides a powerful formula for the gradient of J(θ)
with respect to θ so we can perform Gradient Ascent. The key challenge is that J(θ) de-
pends not only on the selection of actions through policy π (parameterized by θ), but also
on the probability distribution of occurrence of states (also affected by π, and hence by
θ). With knowledge of the functional form of π on θ, it is not diﬀicult to evaluate the
dependency of actions selection on θ, but evaluating the dependency of the probability
distribution of occurrence of states on θ is diﬀicult since the environment only provides
atomic experiences at a time (and not probabilities of transitions). However, the PGT (be-
low) comes to our rescue because the gradient of J(θ) with respect to θ involves only
the gradient of π with respect to θ, and not the gradient of the probability distribution of
occurrence of states with respect to θ. Precisely, we have:
Theorem 12.2.1 (Policy Gradient Theorem).
X X
∇θ J(θ) = ρπ (s) · ∇θ π(s, a; θ) · Qπ (s, a)
s∈N a∈A

424
As mentioned above, note that ρπ (s) (representing the discounting-adjusted probability
distribution of occurrence of states, ignoring normalizing factor turning the ρπ measure
into a probability measure) depends on θ but there’s no ∇θ ρπ (s) term in ∇θ J(θ).
Also note that:
∇θ π(s, a; θ) = π(s, a; θ) · ∇θ log π(s, a; θ)
∇θ log π(s, a; θ) is the Score function (Gradient of log-likelihood) that is commonly used
in Statistics.
Since ρπ is the Discounted-Aggregate State-Visitation Measure, we can sample-estimate ∇θ J(θ)
by calculating γ t ·(∇θ log π(St , At ; θ))·Qπ (St , At ) at each time step in each trace experience
(noting that the state occurrence probabilities and action occurrence probabilities are im-
plicit in the trace experiences, and ignoring the probability measure-normalizing factor),
and update the parameters θ (according to Stochastic Gradient Ascent) using each atomic
experience’s ∇θ J(θ) estimate.
We typically calculate the Score ∇θ log π(s, a; θ) using an analytically-convenient func-
tional form for the conditional probability distribution a|s (in terms of θ) so that the
derivative of the logarithm of this functional form is analytically tractable (this will be
clear in the next section when we consider a couple of examples of canonical functional
forms for a|s). In many PG Algorithms, we estimate Qπ (s, a) with a function approxima-
tion Q(s, a; w). We will later show how to avoid the estimate bias of Q(s, a; w).
Thus, the PGT enables a numerical estimate of ∇θ J(θ) which in turn enables Policy
Gradient Ascent.

12.2.3. Proof of the Policy Gradient Theorem

We begin the proof by noting that:
X X X
J(θ) = p0 (S0 ) · V π (S0 ) = p0 (S0 ) · π(S0 , A0 ; θ) · Qπ (S0 , A0 )
S0 ∈N S0 ∈N A0 ∈A

Calculate ∇θ J(θ) by it’s product parts π(S0 , A0 ; θ) and Qπ (S0 , A0 ).

X X
∇θ J(θ) = p0 (S0 ) · ∇θ π(S0 , A0 ; θ) · Qπ (S0 , A0 )
S0 ∈N A0 ∈A
X X
+ p0 (S0 ) · π(S0 , A0 ; θ) · ∇θ Qπ (S0 , A0 )
S0 ∈N A0 ∈A

Now expand Qπ (S as:

0 , A0 )
X
RA 0
S0 + γ · PSA00,S1 · V π (S1 ) (Bellman Policy Equation)
S1 ∈N
X X
∇θ J(θ) = p0 (S0 ) · ∇θ π(S0 , A0 ; θ) · Qπ (S0 , A0 )
S0 ∈N A0 ∈A
X X X
+ p0 (S0 ) · π(S0 , A0 ; θ) · ∇θ (RA 0
S0 + γ · PSA00,S1 · V π (S1 ))
S0 ∈N A0 ∈A S1 ∈N

Note: ∇θ R A
S0
0
= 0, so remove that term.
X X
∇θ J(θ) = p0 (S0 ) · ∇θ π(S0 , A0 ; θ) · Qπ (S0 , A0 )
S0 ∈N A0 ∈A
X X X
+ p0 (S0 ) · π(S0 , A0 ; θ) · ∇θ ( γ · PSA00,S1 · V π (S1 ))
S0 ∈N A0 ∈A S1 ∈N

425
P
Now bring the ∇θ inside the S1 ∈N to apply only on V π (S1 ).
X X
∇θ J(θ) = p0 (S0 ) · ∇θ π(S0 , A0 ; θ) · Qπ (S0 , A0 )
S0 ∈N A0 ∈A
X X X
+ p0 (S0 ) · π(S0 , A0 ; θ) · γ · PSA00,S1 · ∇θ V π (S1 )
S0 ∈N A0 ∈A S1 ∈N
P P P
Now bring and A0 ∈A inside the S1 ∈N
S0 ∈N
X X
∇θ J(θ) = p0 (S0 ) · ∇θ π(S0 , A0 ; θ) · Qπ (S0 , A0 )
S0 ∈N A0 ∈A
X X X
+ γ · p0 (S0 ) · ( π(S0 , A0 ; θ) · PSA00,S1 ) · ∇θ V π (S1 )
S1 ∈N S0 ∈N A0 ∈A
X
Note that π(S0 , A0 ; θ) · PSA00,S1 = p(S0 → S1 , 1, π)
A0 ∈A

X X
∇θ J(θ) = p0 (S0 ) · ∇θ π(S0 , A0 ; θ) · Qπ (S0 , A0 )
S0 ∈N A0 ∈A
X X
+ γ · p0 (S0 ) · p(S0 → S1 , 1, π) · ∇θ V π (S1 )
S1 ∈N S0 ∈N
X
Now expand V π (S1 ) to π(S1 , A1 ; θ) · Qπ (S1 , A1 )
A1 ∈A
.
X X
∇θ J(θ) = p0 (S0 ) · ∇θ π(S0 , A0 ; θ) · Qπ (S0 , A0 )
S0 ∈N A0 ∈A
X X X
+ γ · p0 (S0 ) · p(S0 → S1 , 1, π) · ∇θ ( π(S1 , A1 ; θ) · Qπ (S1 , A1 ))
S1 ∈N S0 ∈N A1 ∈A
P
We are now back to when we started calculating gradient of a π · Qπ . Follow the same
process of calculating the gradient of π · Qπ by parts, then Bellman-expanding Qπ (to
calculate its gradient), and iterate.
X X
∇θ J(θ) = p0 (S0 ) · ∇θ π(S0 , A0 ; θ) · Qπ (S0 , A0 )+
S0 ∈N A0 ∈A
X X X
γ · p0 (S0 ) · p(S0 → S1 , 1, π) · ( ∇θ π(S1 , A1 ; θ) · Qπ (S1 , A1 ) + . . .)
S1 ∈N S0 ∈N A1 ∈A

This iterative process leads us to:

∞ X X
X X
∇θ J(θ) = γ t · p0 (S0 ) · p(S0 → St , t, π) · ∇θ π(St , At ; θ) · Qπ (St , At )
t=0 St ∈N S0 ∈N At ∈A

∞
X X X
Bring inside and note that
t=0 St ∈N S0 ∈N
X
∇θ π(St , At ; θ) · Qπ (St , At ) is independent of t
At ∈A

426
∞
X X X X
∇θ J(θ) = γ t · p0 (S0 ) · p(S0 → s, t, π) · ∇θ π(s, a; θ) · Qπ (s, a)
s∈N S0 ∈N t=0 a∈A
∞
X X def
Remember that γ t · p0 (S0 ) · p(S0 → s, t, π) = ρπ (s). So,
S0 ∈N t=0
X X
∇θ J(θ) = ρπ (s) · ∇θ π(s, a; θ) · Qπ (s, a)
s∈N a∈A
Q.E.D.
This proof is borrowed from the Appendix of the famous paper by Sutton, McAllester,
Singh, Mansour on Policy Gradient Methods for Reinforcement Learning with Function
Approximation (R. Sutton et al. 2001).
Note that using the “Expected Value” notation under the improper distribution implied
by the Discounted-Aggregate State-Visitation Measure ρπ , we can write the statement of
PGT as:

X X
∇θ J(θ) = ρπ (s) · π(s, a; θ) · (∇θ log π(s, a; θ)) · Qπ (s, a)
s∈N a∈A
=E s∼ρπ ,a∼π [(∇θ log π(s, a; θ)) · Qπ (s, a)]
As explained earlier, since the state occurrence probabilities and action occurrence prob-
abilities are implicit in the trace experiences, we can sample-estimate ∇θ J(θ) by calculat-
ing γ t · (∇θ log π(St , At ; θ)) · Qπ (St , At ) at each time step in each trace experience, and
update the parameters θ (according to Stochastic Gradient Ascent) with this calculation.

12.3. Score function for Canonical Policy Functions

Now we illustrate how the Score function ∇θ π(s, a; θ) is calculated using an analytically-
convenient functional form for the conditional probability distribution a|s (in terms of θ)
so that the derivative of the logarithm of this functional form is analytically tractable. We
do this for a couple of canonical functional forms for a|s, one for finite action spaces and
one for single-dimensional continuous action spaces.

12.3.1. Canonical π(s, a; θ) for Finite Action Spaces

For finite action spaces, we often use the Softmax Policy. Assume θ is an m-vector (θ1 , . . . , θm )
and assume feature vector ϕ(s, a) is given by: (ϕ1 (s, a), . . . , ϕm (s, a)) for all s ∈ N , a ∈ A.
We weight actions using linear combinations of features, i.e., ϕ(s, a)T · θ, and we set the
action probabilities to be proportional to exponentiated weights, as follows:
eϕ(s,a) ·θ
T

π(s, a; θ) = P ϕ(s,b)T ·θ
for all s ∈ N , a ∈ A
b∈A e
Then the score function is given by:
X
∇θ log π(s, a; θ) = ϕ(s, a) − π(s, b; θ) · ϕ(s, b) = ϕ(s, a) − Eπ [ϕ(s, ·)]
b∈A

The intuitive interpretation is that the score function for an action a represents the “ad-
vantage” of the feature vector for action a over the mean feature vector (across all actions),
for a given state s.

427
12.3.2. Canonical π(s, a; θ) for Single-Dimensional Continuous Action Spaces
For single-dimensional continuous action spaces (i.e., A = R), we often use a Gaussian
distribution for the Policy. Assume θ is an m-vector (θ1 , . . . , θm ) and assume the state
features vector ϕ(s) is given by (ϕ1 (s), . . . , ϕm (s)) for all s ∈ N .
We set the mean of the gaussian distribution for the Policy as a linear combination of
state features, i.e., ϕ(s)T · θ, and we set the variance to be a fixed value, say σ 2 . We could
make the variance parameterized as well, but let’s work with fixed variance to keep things
simple.
The Gaussian policy selects an action a as follows:

a ∼ N (ϕ(s)T · θ, σ 2 ) for a given s ∈ N

Then the score function is given by:

(a − ϕ(s)T · θ) · ϕ(s)
∇θ log π(s, a; θ) =
σ2
This is easily extensible to multi-dimensional continuous action spaces by considering
a multi-dimensional gaussian distribution for the Policy.
The intuitive interpretation is that the score function for an action a is proportional to
the feature vector for given state s scaled by the “advantage” of the action a over the mean
action (note: each a ∈ R).
For each of the above two examples (finite action spaces and continuous action spaces),
think of the “features advantage” of an action as the compass for the Gradient Ascent.
The gradient estimate for an encountered action is proportional to the action’s “features
advantage” scaled by the action’s Value Function. The intuition is that the Gradient Ascent
encourages picking actions that are yielding more favorable outcomes (Policy Improvement)
so as to ultimately get to a point where the optimal action is selected for each state.

12.4. REINFORCE Algorithm (Monte-Carlo Policy Gradient)

Now we are ready to write our first Policy Gradient algorithm. As ever, the simplest al-
gorithm is a Monte-Carlo algorithm. In the case of Policy Gradient, a simple Monte-Carlo
calculation provides us with an algorithm known as REINFORCE, due to R.J.Williams
(Williams 1992), which we cover in this section.
We’ve already explained that we can calculate the Score function using an analytical
derivative of a specified functional form for π(St , At ; θ) for each atomic experience (St , At , Rt , St+1 ).
What remains is to obtain an estimate of Qπ (St , At ) for each atomic experience (St , At , Rt , St+1 ).
REINFORCE uses the trace experience return Gt for (St , At ), while following policy π, as
an unbiased sample of Qπ (St , At ). Thus, at every time step (i.e., at every atomic expe-
rience) in each episode, we estimate ∇θ J(θ) by calculating γ t · (∇θ log π(St , At ; θ)) · Gt
(noting that the state occurrence probabilities and action occurrence probabilities are im-
plicit in the trace experiences), and update the parameters θ at the end of each episode
(using each atomic experience’s ∇θ J(θ) estimate) according to Stochastic Gradient As-
cent as follows:

∆θ = α · γ t · (∇θ log π(St , At ; θ)) · Gt

where α is the learning rate.

428
This Policy Gradient algorithm is Monte-Carlo because it is not bootstrapped (com-
plete returns are used as an unbiased sample of Qπ , rather than a bootstrapped estimate).
In terms of our previously-described classification of RL algorithms as Value Function-
based or Policy-based or Actor-Critic, REINFORCE is a Policy-based algorithm since RE-
INFORCE does not involve learning a Value Function.
Now let’s write some code to implement the REINFORCE algorithm. In this chapter, we
will focus our Python code implementation of Policy Gradient algorithms to continuous
action spaces, although it should be clear based on the discussion so far that the Policy
Gradient approach applies to arbitrary action spaces (we’ve already seen an example of
the policy function parameterization for discrete action spaces). To keep things simple, the
function reinforce_gaussian below implements REINFORCE for the simple case of single-
dimensional continuous action space (i.e. A = R), although this can be easily extended to
multi-dimensional continuous action spaces. So in the code below, we work with a generic
state space given by TypeVar(’S’) and the action space is specialized to float (representing
R).
As seen earlier in the canonical example for single-dimensional continuous action space,
we assume a Gaussian distribution for the policy. Specifically, the policy is represented by
an arbitrary parameterized function approximation using the class FunctionApprox. As a
reminder, an instance of FunctionApprox represents a probability distribution function f
of the conditional random variable variable y|x where x belongs to an arbitrary domain
X and y ∈ R (probability of y conditional on x denoted as f (x; θ)(y) where θ denotes
the parameters of the FunctionApprox). Note that the evaluate method of FunctionApprox
takes as input an Iterable of x values and calculates g(x; θ) = Ef (x;θ) [y] for each of the x
values. In our case here, x represents non-terminal states in N and y represents actions
in R, so f (s; θ) denotes the probability distribution of actions, conditional on state s ∈ N ,
and g(s; θ) represents the Expected Value of (real-numbered) actions, conditional on state
s ∈ N . Since we have assumed the policy to be Gaussian,

1 (a−g(s;θ))2
π(s, a; θ) = √ · e− 2σ 2
2πσ 2
To be clear, our code below works with the @abstractclass FunctionApprox (meaning
it is an arbitrary parameterized function approximation) with the assumption that the
probability distribution of actions given a state is Gaussian whose variance σ 2 is assumed
to be a constant. Assume we have m features for our function approximation, denoted as
ϕ(s) = (ϕ1 (s), . . . , ϕm (s)) for all s ∈ N .
σ is specified in the code below as policy_stdev. The input policy_mean_approx0: FunctionApprox[NonTerminal
specifies the function approximation we initialize the algorithm with (it is up to the user
of reinforce_gaussian to configure policy_mean_approx0 with the appropriate functional
form for the function approximation, the hyper-parameter values, and the initial values of
the parameters θ that we want to solve for).
The Gaussian policy (of the type GaussianPolicyFromApprox) selects an action a (given
state s) by sampling from the gaussian distribution defined by mean g(s; θ) and variance
σ2.
The score function is given by:

(a − g(s; θ)) · ∇θ g(s; θ)

∇θ log π(s, a; θ) =
σ2
The outer loop of while True: loops over trace experiences produced by the method
simulate_actions of the input mdp for a given input start_states_distribution (specify-

429
ing the initial states distribution p0 : N → [0, 1]), and the current policy π (that is param-
eterized by θ, which updates after each trace experience). The inner loop loops over an
Iterator of step: ReturnStep[S, float] objects produced by the returns method for each
trace experience.
The variable grad is assigned the value of the negative score for an encountered (St , At )
in a trace experience, i.e., it is assigned the value:

(g(St ; θ) − At ) · ∇θ g(St ; θ)
−∇θ log π(St , At ; θ) =
σ2
We negate the sign of the score because we are performing Gradient Ascent rather than
Gradient Descent (the FunctionApprox class has been written for Gradient Descent). The
variable scaled_grad multiplies the negative of score (grad) with γ t (gamma_prod) and re-
turn Gt (step.return_). The rest of the code should be self-explanatory.
reinforce_gaussian returns an Iterable of FunctionApprox representing the stream of
updated policies π(s, ·; θ), with each of these FunctionApprox being generated (using yield)
at the end of each trace experience.

import numpy as np
from rl.distribution import Distribution, Gaussian
from rl.policy import Policy
from rl.markov_process import NonTerminal
from rl.markov_decision_process import MarkovDecisionProcess, TransitionStep
from rl.function_approx import FunctionApprox, Gradient
S = TypeVar(’S’)
@dataclass(frozen=True)
class GaussianPolicyFromApprox(Policy[S, float]):
function_approx: FunctionApprox[NonTerminal[S]]
stdev: float
def act(self, state: NonTerminal[S]) -> Gaussian:
return Gaussian(
mu=self.function_approx(state),
sigma=self.stdev
)
def reinforce_gaussian(
mdp: MarkovDecisionProcess[S, float],
policy_mean_approx0: FunctionApprox[NonTerminal[S]],
start_states_distribution: Distribution[NonTerminal[S]],
policy_stdev: float,
gamma: float,
episode_length_tolerance: float
) -> Iterator[FunctionApprox[NonTerminal[S]]]:
policy_mean_approx: FunctionApprox[NonTerminal[S]] = policy_mean_approx0
yield policy_mean_approx
while True:
policy: Policy[S, float] = GaussianPolicyFromApprox(
function_approx=policy_mean_approx,
stdev=policy_stdev
)
trace: Iterable[TransitionStep[S, float]] = mdp.simulate_actions(
start_states=start_states_distribution,
policy=policy
)
gamma_prod: float = 1.0
for step in returns(trace, gamma, episode_length_tolerance):
def obj_deriv_out(
states: Sequence[NonTerminal[S]],
actions: Sequence[float]
) -> np.ndarray:

430
return (policy_mean_approx.evaluate(states) -
np.array(actions)) / (policy_stdev * policy_stdev)
grad: Gradient[FunctionApprox[NonTerminal[S]]] = \
policy_mean_approx.objective_gradient(
xy_vals_seq=[(step.state, step.action)],
obj_deriv_out_fun=obj_deriv_out
)
scaled_grad: Gradient[FunctionApprox[NonTerminal[S]]] = \
grad * gamma_prod * step.return_
policy_mean_approx = \
policy_mean_approx.update_with_gradient(scaled_grad)
gamma_prod *= gamma
yield policy_mean_approx

The above code is in the file rl/policy_gradient.py.

12.5. Optimal Asset Allocation (Revisited)

In this chapter, we will test the PG algorithms we implement on the Optimal Asset Alloca-
tion problem of Chapter 6, specifically the setting of the class AssetAllocDiscrete covered
in Section 6.5. As a reminder, in this setting, we have a single risky asset and at each of a
fixed finite number of time steps, one has to make a choice of the quantity of current wealth
to invest in the risky asset (remainder in the riskless asset) with the goal of maximizing
the expected utility of wealth at the end of the finite horizon. Thus, this finite-horizon
MDP’s state at any time t is the pair (t, Wt ) where Wt ∈ R denotes the wealth at time t,
and the action at time t is the investment xt ∈ R in the risky asset.
We provided an ADP backward-induction solution to this problem in Chapter 4, im-
plemented with AssetAllocDiscrete (code in rl/chapter7/asset_alloc_discrete.py). Now
we want to solve it with PG algorithms, starting with REINFORCE. So we require a new
interface and hence, we implement a new class AssetAllocPG with appropriate tweaks to
AssetAllocDiscrete. The key change in the interface is that we have inputs policy_feature_funcs,
policy_mean_dnn_spec and policy_stdev (see code below). policy_feature_funcs repre-
sents the sequence of feature functions for the FunctionApprox representing the mean ac-
tion for a given state (i.e., g(s; θ) = Ef (s;θ) [a] where f represents the policy probability
distribution of actions for a given state). policy_mean_dnn_spec specifies the architecture
of a deep neural network a user would like to use for the FunctionApprox. policy_stdev
represents the fixed standard deviation σ of the policy probability distribution of actions
for any state. Unlike the backward-induction solution of AssetAllocDiscrete where we
had to model a separate MDP for each time step in the finite horizon (where the state for
each time step’s MDP is the wealth), here we model a single MDP across all time steps with
the state as the pair of time step index t and the wealth Wt . AssetAllocState = Tuple[int,
float] is the data type for the state (t, Wt ).
from rl.function_approx import DNNSpec
AssetAllocState = Tuple[int, float]
@dataclass(frozen=True)
class AssetAllocPG:
risky_return_distributions: Sequence[Distribution[float]]
riskless_returns: Sequence[float]
utility_func: Callable[[float], float]
policy_feature_funcs: Sequence[Callable[[AssetAllocState], float]]
policy_mean_dnn_spec: DNNSpec
policy_stdev: float
policy_mean
initial_wealth_distribution: Distribution[float]

431
The method get_mdp below sets up this MDP (should be self-explanatory as the con-
struction is very similar to the construction of the single-step MDPs in AssetAllocDiscrete).

from rl.distribution import SampledDistribution

def time_steps(self) -> int:
return len(self.risky_return_distributions)
def get_mdp(self) -> MarkovDecisionProcess[AssetAllocState, float]:
steps: int = self.time_steps()
distrs: Sequence[Distribution[float]] = self.risky_return_distributions
rates: Sequence[float] = self.riskless_returns
utility_f: Callable[[float], float] = self.utility_func
class AssetAllocMDP(MarkovDecisionProcess[AssetAllocState, float]):
def step(
self,
state: NonTerminal[AssetAllocState],
action: float
) -> SampledDistribution[Tuple[State[AssetAllocState], float]]:
def sr_sampler_func(
state=state,
action=action
) -> Tuple[State[AssetAllocState], float]:
time, wealth = state.state
next_wealth: float = action * (1 + distrs[time].sample()) \
+ (wealth - action) * (1 + rates[time])
reward: float = utility_f(next_wealth) \
if time == steps - 1 else 0.
next_pair: AssetAllocState = (time + 1, next_wealth)
next_state: State[AssetAllocState] = \
Terminal(next_pair) if time == steps - 1 \
else NonTerminal(next_pair)
return (next_state, reward)
return SampledDistribution(sampler=sr_sampler_func)
def actions(self, state: NonTerminal[AssetAllocState]) \
-> Sequence[float]:
return []
return AssetAllocMDP()

The methods start_states_distribution and policy_mean_approx below create the SampledDistribution

of start states and the DNNApprox representing the mean action for a given state respectively.
Finally, the reinforce method below simply collects all the ingredients and passes along
to reinforce_gaussian to solve this asset allocation problem.

from rl.function_approx import AdamGradient, DNNApprox

def start_states_distribution(self) -> \
SampledDistribution[NonTerminal[AssetAllocState]]:
def start_states_distribution_func() -> NonTerminal[AssetAllocState]:
wealth: float = self.initial_wealth_distribution.sample()
return NonTerminal((0, wealth))
return SampledDistribution(sampler=start_states_distribution_func)
def policy_mean_approx(self) -> \
FunctionApprox[NonTerminal[AssetAllocState]]:
adam_gradient: AdamGradient = AdamGradient(
learning_rate=0.003,
decay1=0.9,
decay2=0.999
)
ffs: List[Callable[[NonTerminal[AssetAllocState]], float]] = []

432
for f in self.policy_feature_funcs:
def this_f(st: NonTerminal[AssetAllocState], f=f) -> float:
return f(st.state)
ffs.append(this_f)
return DNNApprox.create(
feature_functions=ffs,
dnn_spec=self.policy_mean_dnn_spec,
adam_gradient=adam_gradient
)
def reinforce(self) -> \
Iterator[FunctionApprox[NonTerminal[AssetAllocState]]]:
return reinforce_gaussian(
mdp=self.get_mdp(),
policy_mean_approx0=self.policy_mean_approx(),
start_states_distribution=self.start_states_distribution(),
policy_stdev=self.policy_stdev,
gamma=1.0,
episode_length_tolerance=1e-5
)

The above code is in the file rl/chapter13/asset_alloc_pg.py.

Let’s now test this out on an instance of the problem for which we have a closed-form
solution (so we can verify the REINFORCE solution against the closed-form solution). The
special instance is the setting covered in Section 6.4 of Chapter 6. From Equation (6.25),
we know that the optimal action in state (t, Wt ) is linear in a single feature defined as
(1 + r)t where r is the constant risk-free rate across time steps. So we need to set up the
function approximation policy_mean_dnn_spec: DNNSpec as linear in this single feature (no
hidden layers and identity function as the output layer activation function), and check if
the optimized weight (coefficient of this single feature) matches up with the closed-form
solution of Equation (6.25).
Let us use similar settings that we had used in Chapter 6 to test AssetAllocDiscrete. In
the code below, we create an instance of AssetAllocPG with time steps T = 5, µ = 13%, σ =
20%, r = 7%, coefficient of CARA a = 1.0. We set up risky_return_distributions as a
sequence of identical Gaussian distributions, riskless_returns as a sequence of identical
riskless rate of returns, and utility_func as a lambda parameterized by the coefficient of
CARA a. We set the probability distribution of wealth at time t = 0 (start of each trace
experience) as N (1.0, 0.1), and we set the constant standard deviation σ of the policy prob-
ability distribution of actions for a given state as 0.5.
steps: int = 5
mu: float = 0.13
sigma: float = 0.2
r: float = 0.07
a: float = 1.0
init_wealth: float = 1.0
init_wealth_stdev: float = 0.1
policy_stdev: float = 0.5

Next, we print the closed-form solution of the optimal action for states at each time step
(note: the closed-form solution for optimal action is independent of wealth Wt , and is only
dependent on t).
base_alloc: float = (mu - r) / (a * sigma * sigma)
for t in range(steps):
alloc: float = base_alloc / (1 + r) ** (steps - t - 1)
print(f”Time {t:d}: Optimal Risky Allocation = {alloc:.3f}”)

This prints:

433
Time 0: Optimal Risky Allocation = 1.144
Time 1: Optimal Risky Allocation = 1.224
Time 2: Optimal Risky Allocation = 1.310
Time 3: Optimal Risky Allocation = 1.402
Time 4: Optimal Risky Allocation = 1.500

Next we set up an instance of AssetAllocPG with the above parameters. Note that the
policy_mean_dnn_spec argument to the constructor of AssetAllocPG is set up as a trivial
neural network with no hidden layers and the identity function as the output layer activa-
tion function. Note also that the policy_feature_funcs argument to the constructor is set
up with the single feature function (1 + r)t .
from rl.distribution import Gaussian
from rl.function_approx import
risky_ret: Sequence[Gaussian] = [Gaussian(mu=mu, sigma=sigma)
for _ in range(steps)]
riskless_ret: Sequence[float] = [r for _ in range(steps)]
utility_function: Callable[[float], float] = lambda x: - np.exp(-a * x) / a
policy_feature_funcs: Sequence[Callable[[AssetAllocState], float]] = \
[
lambda w_t: (1 + r) ** w_t[1]
]
init_wealth_distr: Gaussian = Gaussian(mu=init_wealth, sigma=init_wealth_stdev)
policy_mean_dnn_spec: DNNSpec = DNNSpec(
neurons=[],
bias=False,
hidden_activation=lambda x: x,
hidden_activation_deriv=lambda y: np.ones_like(y),
output_activation=lambda x: x,
output_activation_deriv=lambda y: np.ones_like(y)
)
aad: AssetAllocPG = AssetAllocPG(
risky_return_distributions=risky_ret,
riskless_returns=riskless_ret,
utility_func=utility_function,
policy_feature_funcs=policy_feature_funcs,
policy_mean_dnn_spec=policy_mean_dnn_spec,
policy_stdev=policy_stdev,
initial_wealth_distribution=init_wealth_distr
)

Next, we invoke the method reinforce of this AssetAllocPG instance. In practice, we’d
have parameterized the standard deviation of the policy probability distribution just like
we parameterized the mean of the policy probability distribution, and we’d have updated
those parameters in a similar manner (the standard deviation would converge to 0, i.e.,
the policy would converge to the optimal deterministic policy given by the closed-form
solution). As an exercise, extend the function reinforce_gaussian to include a second
FunctionApprox for the standard deviation of the policy probability distribution and up-
date this FunctionApprox along with the updates to the mean FunctionApprox. However,
since we set the standard deviation of the policy probability distribution to be a constant
σ and since we use a Monte-Carlo method, the variance of the mean estimate of the policy
probability distribution is significantly high. So we take the average of the mean estimate
over several iterations (below we average the estimate from iteration 10000 to iteration
20000).
reinforce_policies: Iterator[FunctionApprox[
NonTerminal[AssetAllocState]]] = aad.reinforce()

434
num_episodes: int = 10000
averaging_episodes: int = 10000
policies: Sequence[FunctionApprox[NonTerminal[AssetAllocState]]] = \
list(itertools.islice(
reinforce_policies,
num_episodes,
num_episodes + averaging_episodes
))
for t in range(steps):
opt_alloc: float = np.mean([p(NonTerminal((init_wealth, t)))
for p in policies])
print(f”Time {t:d}: Optimal Risky Allocation = {opt_alloc:.3f}”)

This prints:

Time 0: Optimal Risky Allocation = 1.215

Time 1: Optimal Risky Allocation = 1.300
Time 2: Optimal Risky Allocation = 1.392
Time 3: Optimal Risky Allocation = 1.489
Time 4: Optimal Risky Allocation = 1.593

So we see that the estimate of the mean action for the 5 time steps from our implemen-
tation of the REINFORCE method gets fairly close to the closed-form solution.
The above code is in the file rl/chapter13/asset_alloc_reinforce.py. As ever, we encour-
age you to tweak the parameters and explore how the results vary.
As an exercise, we encourage you to implement an extension of this problem. Along
with the risky asset allocation choice as the action at each time step, also include a con-
sumption quantity (wealth to be extracted at each time step, along the lines of Merton’s
Dynamic Portfolio Allocation and Consumption problem) as part of the action at each
time step. So the action at each time step would be a pair (c, a) where c is the quantity to
consume and a is the quantity to allocate to the risky asset. Note that the consumption
is constrained to be non-negative and at most the amount of wealth at any time step (a is
unconstrained). The reward at each time step is the Utility of Consumption.

12.6. Actor-Critic and Variance Reduction

As we’ve mentioned in the previous section, REINFORCE has high variance since it’s a
Monte-Carlo method. So it can take quite long for REINFORCE to converge. A simple
way to reduce the variance is to use a function approximation for the Q-Value Function
instead of using the trace experience return as an unbiased sample of the Q-Value Func-
tion. Variance reduction happens from the simple fact that a function approximation of the
Q-Value Function updates gradually (using gradient descent) and so, does not vary enor-
mously like the trace experience returns would. Let us denote the function approximation
of the Q-Value Function as Q(s, a; w) where w denotes the parameters of the function
approximation. We refer to Q(s, a; w) as the Critic and we refer to the π(s, a; θ) function
approximation as the Actor. The two function approximations π(s, a; θ) and Q(s, a; w) col-
laborate to improve the policy using gradient ascent (guided by the PGT, using Q(s, a; w)
in place of the true Q-Value Function Qπ (s, a)). π(s, a; θ) is called Actor because it is the
primary worker and Q(s, a; w) is called Critic because it is the support worker. The intu-
itive way to think about this is that the Actor updates policy parameters in a direction that
is suggested by the Critic.

435
Bear in mind though that the eﬀicient way to use the Critic is in the spirit of GPI, i.e., we
don’t take Q(s, a; w) for the current policy (current θ) all the way to convergence (thinking
about updates of w for a given Policy as Policy Evaluation phase of GPI). Instead, we
switch between Policy Evaluation (updates of w) and Policy Improvement (updates of
θ) quite frequently. In fact, with a bootstrapped (TD) approach, we would update both
w and θ after each atomic experience. w is updated such that a suitable loss function
is minimized. This can be done using any of the usual Value Function approximation
methods we have covered previously, including:

• Monte-Carlo, i.e., w updated using trace experience returns Gt .

• Temporal-Difference, i.e., w updated using TD Targets.
• TD(λ), i.e., w updated using targets based on eligibility traces.
• It could even be LSTD if we assume a linear function approximation for the critic
Q(s, a; w).

This method of calculating the gradient of J(θ) can be thought of as Approximate Policy
Gradient due to the bias of the Critic Q(s, a; w) (serving as an approximation of Qπ (s, a)),
i.e.,
X X
∇θ J(θ) ≈ ρπ (s) · ∇θ π(s, a; θ) · Q(s, a; w)
s∈N a∈A

Now let’s implement some code to perform Policy Gradient with the Critic updated
using Temporal-Difference (again, for the simple case of single-dimensional continuous
action space). In the function actor_critic_gaussian below, the key changes (from the
code in reinforce_gaussian) are:

• The Q-Value function approximation parameters w are updated after each atomic
experience as:

∆w = β · (Rt+1 + γ · Q(St+1 , At+1 ; w) − Q(St , At ; w)) · ∇w Q(St , At ; w)

where β is the learning rate for the Q-Value function approximation.

• The policy mean parameters θ are updated after each atomic experience as:

∆θ = α · γ t · (∇θ log π(St ; At ; θ)) · Q(St , At ; w)

(instead of α · γ t · (∇θ log π(St , At ; θ)) · Gt ).

from rl.approximate_dynamic_programming import QValueFunctionApprox

from rl.approximate_dynamic_programming import NTStateDistribution
def actor_critic_gaussian(
mdp: MarkovDecisionProcess[S, float],
policy_mean_approx0: FunctionApprox[NonTerminal[S]],
q_value_func_approx0: QValueFunctionApprox[S, float],
start_states_distribution: NTStateDistribution[S],
policy_stdev: float,
gamma: float,
max_episode_length: float
) -> Iterator[FunctionApprox[NonTerminal[S]]]:
policy_mean_approx: FunctionApprox[NonTerminal[S]] = policy_mean_approx0
yield policy_mean_approx
q: QValueFunctionApprox[S, float] = q_value_func_approx0
while True:
steps: int = 0

436
gamma_prod: float = 1.0
state: NonTerminal[S] = start_states_distribution.sample()
action: float = Gaussian(
mu=policy_mean_approx(state),
sigma=policy_stdev
).sample()
while isinstance(state, NonTerminal) and steps < max_episode_length:
next_state, reward = mdp.step(state, action).sample()
if isinstance(next_state, NonTerminal):
next_action: float = Gaussian(
mu=policy_mean_approx(next_state),
sigma=policy_stdev
).sample()
q = q.update([(
(state, action),
reward + gamma * q((next_state, next_action))
)])
action = next_action
else:
q = q.update([((state, action), reward)])
def obj_deriv_out(
states: Sequence[NonTerminal[S]],
actions: Sequence[float]
) -> np.ndarray:
return (policy_mean_approx.evaluate(states) -
np.array(actions)) / (policy_stdev * policy_stdev)
grad: Gradient[FunctionApprox[NonTerminal[S]]] = \
policy_mean_approx.objective_gradient(
xy_vals_seq=[(state, action)],
obj_deriv_out_fun=obj_deriv_out
)
scaled_grad: Gradient[FunctionApprox[NonTerminal[S]]] = \
grad * gamma_prod * q((state, action))
policy_mean_approx = \
policy_mean_approx.update_with_gradient(scaled_grad)
yield policy_mean_approx
gamma_prod *= gamma
steps += 1
state = next_state

The above code is in the file rl/policy_gradient.py. We leave it to you as an exercise to

implement the update of Q(s, a; w) with TD(λ), i.e., with eligibility traces.
We can reduce the variance of this Actor-Critic method by subtracting a Baseline Func-
tion B(s) from Q(s, a; w) in the Policy Gradient estimate. This means we update the pa-
rameters θ as:

∆θ = α · γ t · ∇θ log π(St , At ; θ) · (Q(St , At ; w) − B(St ))

Note that the Baseline Function B(s) is only a function of state s (and not of action a).
This ensures that subtracting the Baseline Function B(s) does not add bias. This is because:
X X
ρπ (s) ∇θ π(s, a; θ) · B(s)
s∈N a∈A
X X
= ρ (s) · B(s) · ∇θ (
π
π(s, a; θ))
s∈N a∈A
X
= ρπ (s) · B(s) · ∇θ 1
s∈N
=0

437
A good Baseline Function B(s) is a function approximation V (s; v) of the State-Value
Function V π (s). So then we can rewrite the Actor-Critic Policy Gradient algorithm using
an estimate of the Advantage Function, as follows:

A(s, a; w, v) = Q(s, a; w) − V (s; v)

With this, the approximation for ∇θ J(θ) is given by:

X X
∇θ J(θ) ≈ ρπ (s) · ∇θ π(s, a; θ) · A(s, a; w, v)
s∈N a∈A

The function actor_critic_advantage_gaussian in the file rl/policy_gradient.py imple-

ments this algorithm, i.e., Policy Gradient with two Critics Q(s, a; w) and V (s; v), each
updated using Temporal-Difference (again, for the simple case of single-dimensional con-
tinuous action space). Specifically, in the code of actor_critic_advantage_gaussian:

• The Q-Value function approximation parameters w are updated after each atomic
experience as:

∆w = βw · (Rt+1 + γ · Q(St+1 , At+1 ; w) − Q(St , At ; w)) · ∇w Q(St , At ; w)

where βw is the learning rate for the Q-Value function approximation.

• The State-Value function approximation parameters v are updated after each atomic
experience as:

∆v = βv · (Rt+1 + γ · V (St+1 ; v) − V (St ; v)) · ∇v V (St ; v)

where βv is the learning rate for the State-Value function approximation.

• The policy mean parameters θ are updated after each atomic experience as:

∆θ = α · γ t · (∇θ log π(St ; At ; θ)) · (Q(St , At ; w) − V (St ; v))

A simpler way is to use the TD Error of the State-Value Function as an estimate of the
Advantage Function. To understand this idea, let δ π denote the TD Error for the true State-
Value Function V π (s). Then,

δ π = r + γ · V π (s′ ) − V π (s)

Note that δ π is an unbiased estimate of the Advantage function Aπ (s, a). This is because

Eπ [δ π |s, a] = Eπ [r + γ · V π (s′ )|s, a] − V π (s) = Qπ (s, a) − V π (s) = Aπ (s, a)

So we can write Policy Gradient in terms of Eπ [δ π |s, a]:

X X
∇θ J(θ) = ρπ (s) · ∇θ π(s, a; θ) · Eπ [δ π |s, a]
s∈N a∈A

In practice, we use a function approximation for the TD error, and sample:

δ(s, r, s′ ; v) = r + γ · V (s′ ; v) − V (s; v)

This approach requires only one set of critic parameters v, and we don’t have to worry
about the Action-Value Function Q.
Now let’s implement some code for this TD Error-based PG Algorithm (again, for the
simple case of single-dimensional continuous action space). In the function actor_critic_td_error_gaussian
below:

438
• The State-Value function approximation parameters v are updated after each atomic
experience as:

∆v = αv · (Rt+1 + γ · V (St+1 ; v) − V (St ; v)) · ∇v V (St ; v)

where αv is the learning rate for the State-Value function approximation.

• The policy mean parameters θ are updated after each atomic experience as:

∆θ = αθ · γ t · (∇θ log π(St ; At ; θ)) · (Rt+1 + γ · V (St+1 ; v) − V (St ; v))

where αθ is the learning rate for the Policy Mean function approximation.
from rl.approximate_dynamic_programming import ValueFunctionApprox
def actor_critic_td_error_gaussian(
mdp: MarkovDecisionProcess[S, float],
policy_mean_approx0: FunctionApprox[NonTerminal[S]],
value_func_approx0: ValueFunctionApprox[S],
start_states_distribution: NTStateDistribution[S],
policy_stdev: float,
gamma: float,
max_episode_length: float
) -> Iterator[FunctionApprox[NonTerminal[S]]]:
policy_mean_approx: FunctionApprox[NonTerminal[S]] = policy_mean_approx0
yield policy_mean_approx
vf: ValueFunctionApprox[S] = value_func_approx0
while True:
steps: int = 0
gamma_prod: float = 1.0
state: NonTerminal[S] = start_states_distribution.sample()
while isinstance(state, NonTerminal) and steps < max_episode_length:
action: float = Gaussian(
mu=policy_mean_approx(state),
sigma=policy_stdev
).sample()
next_state, reward = mdp.step(state, action).sample()
if isinstance(next_state, NonTerminal):
td_target: float = reward + gamma * vf(next_state)
else:
td_target = reward
td_error: float = td_target - vf(state)
vf = vf.update([(state, td_target)])
def obj_deriv_out(
states: Sequence[NonTerminal[S]],
actions: Sequence[float]
) -> np.ndarray:
return (policy_mean_approx.evaluate(states) -
np.array(actions)) / (policy_stdev * policy_stdev)
grad: Gradient[FunctionApprox[NonTerminal[S]]] = \
policy_mean_approx.objective_gradient(
xy_vals_seq=[(state, action)],
obj_deriv_out_fun=obj_deriv_out
)
scaled_grad: Gradient[FunctionApprox[NonTerminal[S]]] = \
grad * gamma_prod * td_error
policy_mean_approx = \
policy_mean_approx.update_with_gradient(scaled_grad)
yield policy_mean_approx
gamma_prod *= gamma
steps += 1
state = next_state

The above code is in the file rl/policy_gradient.py.

439
Likewise, we can implement an Actor-Critic algorithm using Eligibility Traces (i.e., TD(λ))
for the State-Value Function Approximation and also for the Policy Mean Function Ap-
proximation. The updates after each atomic experience to parameters v of the State-Value
function approximation and parameters θ of the policy mean function approximation are
given by:

∆v = αv · (Rt+1 + γ · V (St+1 ; v) − V (St ; v)) · Ev

∆θ = αθ · (Rt+1 + γ · V (St+1 ; v) − V (St ; v)) · Eθ
where the Eligibility Traces Ev and Eθ are updated after each atomic experience as follows:

Ev ← γ · λv · Ev + ∇v V (St ; v)

Eθ ← γ · λθ · Eθ + γ t · ∇θ log π(St , At ; θ)
where λv and λθ are the TD(λ) parameters respectively for the State-Value Function Ap-
proximation and the Policy Mean Function Approximation.
We encourage you to implement in code this Actor-Critic algorithm using Eligibility
Traces.
Now let’s compare these methods on the AssetAllocPG instance we had created earlier
to test REINFORCE, i.e., for time steps T = 5, µ = 13%, σ = 20%, r = 7%, coeﬀicient
of CARA a = 1.0, probability distribution of wealth at the start of each trace experience
as N (1.0, 0.1), and constant standard deviation σ of the policy probability distribution of
actions for a given state as 0.5. The __main__ code in rl/chapter13/asset_alloc_pg.py eval-
uates the mean action for the start state of (t = 0, W0 = 1.0) after each episode (over 50,000
episodes) for each of the above-implemented PG algorithms’ function approximation for
the policy mean. It then plots the progress of the evaluated mean action for the start state
over the 50,000 episodes (each point plotted as an average over a batch of 200 episodes),
along with the benchmark of the optimal action for the start state from the known closed-
form solution. Figure 12.1 shows the graph, validating the points we have made above on
bias and variance of these algorithms.
Actor-Critic methods were developed in the late 1970s and 1980s, but not paid attention
to in the 1990s. In the past two decades, there has been a revival of Actor-Critic methods.
For a more detailed coverage of Actor-Critic methods, see the paper by Degris, White,
Sutton (Degris, White, and Sutton 2012).

12.7. Overcoming Bias with Compatible Function Approximation

We’ve talked a lot about reducing variance for faster convergence of PG Algorithms. Specif-
ically, we’ve talked about the following proxies for Qπ (s, a) in the form of Actor-Critic
algorithms in order to reduce variance.

• Q(s, a; w)
• A(s, a; w, v) = Q(s, a; w) − V (s; v)
• δ(s, s′ , r; v) = r + γ · V (s′ ; v) − V (s; v)

However, each of the above proxies for Qπ (s, a) in PG algorithms have a bias. In this
section, we talk about how to overcome bias. The basis for overcoming bias is an important
Theorem known as the Compatible Function Approximation Theorem. We state and prove this
theorem, and then explain how we could use it in a PG algorithm.

440
Figure 12.1.: Progress of PG Algorithms

Theorem 12.7.1 (Compatible Function Approximation Theorem). Let wθ∗ denote the Critic
parameters w that minimize the following mean-squared-error for given policy parameters θ:
X X
ρπ (s) · π(s, a; θ) · (Qπ (s, a) − Q(s, a; w))2
s∈N a∈A

Assume that the data type of θ is the same as the data type of w and furthermore, assume that for
any policy parameters θ, the Critic gradient at wθ∗ is compatible with the Actor score function,
i.e.,
∇w Q(s, a; wθ∗ ) = ∇θ log π(s, a; θ) for all s ∈ N , for all a ∈ A
Then the Policy Gradient using critic Q(s, a; wθ∗ ) is exact:
X X
∇θ J(θ) = ρπ (s) · ∇θ π(s, a; θ) · Q(s, a; wθ∗ )
s∈N a∈A

Proof. For a given θ, since wθ∗ minimizes the mean-squared-error as defined above, we
have: X X
ρπ (s) · π(s, a; θ) · (Qπ (s, a) − Q(s, a; wθ∗ )) · ∇w Q(s, a; wθ∗ ) = 0
s∈N a∈A
But since ∇w Q(s, a; wθ∗ ) = ∇θ log π(s, a; θ), we have:
X X
ρπ (s) · π(s, a; θ) · (Qπ (s, a) − Q(s, a; wθ∗ )) · ∇θ log π(s, a; θ) = 0
s∈N a∈A

Therefore,
X X
ρπ (s) · π(s, a; θ) · Qπ (s, a) · ∇θ log π(s, a; θ)
s∈N a∈A
X X
= ρπ (s) · π(s, a; θ) · Q(s, a; wθ∗ ) · ∇θ log π(s, a; θ)
s∈N a∈A

441
X X
But ∇θ J(θ) = ρπ (s) · π(s, a; θ) · Qπ (s, a) · ∇θ log π(s, a; θ)
s∈N a∈A

X X
So, ∇θ J(θ) = ρπ (s) · π(s, a; θ) · Q(s, a; wθ∗ ) · ∇θ log π(s, a; θ)
s∈N a∈A
X X
= ρ (s) ·
π
∇θ π(s, a; θ) · Q(s, a; wθ∗ )
s∈N a∈A

Q.E.D.

This proof originally appeared in the famous paper by Sutton, McAllester, Singh, Man-
sour on Policy Gradient Methods for Reinforcement Learning with Function Approxima-
tion (R. Sutton et al. 2001).
This means if the compatibility assumption of the Theorem is satisfied, we can use the
critic function approximation Q(s, a; wθ∗ ) and still have exact Policy Gradient (i.e., no bias
due to using a function approximation for the Q-Value Function). However, note that in
practice, we invoke the spirit of GPI and don’t take Q(s, a; w) to convergence for the current
θ. Rather, we update both w and θ frequently, and this turns out to be good enough in
terms of lowering the bias.
A simple way to enable Compatible Function Approximation is to make Q(s, a; w) a lin-
ear function approximation, with the features of the linear function approximation equal
to the Score of the policy function approximation, as follows:

X
m
∂ log π(s, a; θ)
Q(s, a; w) = · wi for all s ∈ N , for all a ∈ A
∂θi
i=1

which means the feature functions η(s, a) = (η1 (s, a), η2 (s, a), . . . , ηm (s, a)) of the linear
function approximation are given by:

∂ log π(s, a; θ)
ηi (s, a) = for all s ∈ N , for all a ∈ A, for all i = 1, . . . , m
∂θi

This means the feature functions η is identically equal to the Score. Note that although
here we assume Q(s, a; w) to be a linear function approximation, the policy function ap-
proximation π(s, a; θ) can be more flexible. All that is required is that θ consists of exactly
m parameters (matching the number of number of parameters m of w) and that each of
∂ log π(s,a;θ)
the partial derivatives ∂θi lines up with a corresponding feature ηi (s, a) of the lin-
ear function approximation Q(s, a; w). This means that as θ updates (as a consequence
of Stochastic Gradient Ascent), π(s, a; θ) updates, and consequently the feature functions
η(s, a) = ∇θ log π(s, a; θ) update. This means the feature vector η(s, a) is not constant
for a given (s, a) pair. Rather, the feature vector η(s, a) for a given (s, a) pair varies in
accordance with θ varying.
If we assume the canonical function approximation for π(s, a; θ) for finite action spaces
that we had described in Section 12.3, then:
X
η(s, a) = ϕ(s, a) − π(s, b; θ) · ϕ(s, b) for all s ∈ N for all a ∈ A
b∈A

442
Note the dependency of feature vector η(s, a) on θ.
If we assume the canonical function approximation for π(s, a; θ) for single-dimensional
continuous action spaces that we had described in Section 12.3, then:

(a − ϕ(s)T · θ) · ϕ(s)
η(s, a) = for all s ∈ N for all a ∈ A
σ2
Note the dependency of feature vector η(s, a) on θ.
We note that any compatible linear function approximation Q(s, a; w) serves as an ap-
proximation of the advantage function because:

X X X
m
∂ log π(s, a; θ)
π(s, a; θ) · Q(s, a; w) = π(s, a; θ) · ( · wi )
∂θi
a∈A a∈A i=1

X Xm
∂π(s, a; θ) X
m X
∂π(s, a; θ)
= ( · wi ) = ( ) · wi
∂θi ∂θi
a∈A i=1 i=1 a∈A

X
m
∂ X X
m
∂1
= ( π(s, a; θ)) · wi = · wi = 0
∂θi ∂θi
i=1 a∈A i=1

Denoting ∇θ log π(s, a; θ) as the score column vector SC(s, a; θ) and assuming compat-
ible linear-approximation critic:
X X
∇θ J(θ) = ρπ (s) · π(s, a; θ) · SC(s, a; θ) · (SC(s, a; θ)T · wθ∗ )
s∈N a∈A
X X
= ρ (s) ·
π
π(s, a; θ) · (SC(s, a; θ) · SC(s, a; θ)T ) · wθ∗
s∈N a∈A
= Es∼ρπ ,a∼π [SC(s, a; θ) · SC(s, a; θ)T ] · wθ∗

Note that Es∼ρπ ,a∼π [SC(s, a; θ)·SC(s, a; θ)T ] is the Fisher Information Matrix F IMρπ ,π (θ)
with respect to s ∼ ρπ , a ∼ π. Therefore, we can write ∇θ J(θ) more succinctly as:

∇θ J(θ) = F IMρπ ,π (θ) · wθ∗ (12.1)

Thus, we can update θ after each atomic experience at time step t by calculating the
gradient of J(θ) for the atomic experience as the outer product of SC(St , At ; θ) with itself
(which gives a m × m matrix), then multiply this matrix with the vector w, and then scale
by γ t , i.e.
∆θ = αθ · γ t · SC(St , At ; w) · SC(St , At ; w)T · w

The update for w after each atomic experience is the usual Q-Value Function Approx-
imation update with Q-Value loss function gradient for the atomic experience calculated
as:

∆w = αw · (Rt+1 + γ · SC(St+1 , At+1 ; θ)T · w − SC(St , At ; θ)T · w) · SC(St , At ; θ)

This completes our coverage of the basic Policy Gradient Methods. Next, we cover a
couple of special Policy Gradient Methods that have worked well in practice - Natural
Policy Gradient and Deterministic Policy Gradient.

443
12.8. Policy Gradient Methods in Practice
12.8.1. Natural Policy Gradient
Natural Policy Gradient (abbreviated NPG) is due to a paper by Kakade (Kakade 2001)
that utilizes the idea of Natural Gradient first introduced by Amari (Amari 1998). We
won’t cover the theory of Natural Gradient in detail here, and refer you to the above two
papers instead. Here we give a high-level overview of the concepts, and describe the al-
gorithm.
The core motivation for Natural Gradient is that when the parameters space has a certain
underlying structure (as is the case with the parameters space of θ in the context of maxi-
mizing J(θ)), the usual gradient does not represent it’s steepest descent direction, but the
Natural Gradient does. The steepest descent direction of an arbitrary function f (θ) to be
minimized is defined as the vector ∆θ that minimizes f (θ + ∆θ) under the constraint that
the length |∆θ| is a constant. In general, the length |∆θ| is defined with respect to some
positive-definite matrix G(θ) governed by the underlying structure of the θ parameters
space, i.e.,
|∆θ|2 = (∆θ)T · G(θ) · ∆θ
We can show that under the length metric defined by the matrix G, the steepest descent
direction is:

−1
θ f (θ) = G (θ) · ∇θ f (θ)
∇nat
We refer to this steepest descent direction ∇nat
θ f (θ) as the Natural Gradient. We can
update the parameters θ in this Natural Gradient direction in order to achieve steepest
descent (according to the matrix G), as follows:

∆θ = −α · ∇nat
θ f (θ)

Amari showed that for a supervised learning problem of estimating the conditional
probability distribution of y|x with a function approximation (i.e., where the loss function
is defined as the KL divergence between the data distribution and the model distribution),
the matrix G is the Fisher Information Matrix for y|x.
Kakade specialized this idea of Natural Gradient to the case of Policy Gradient (naming
it Natural Policy Gradient) with the objective function f (θ) equal to the negative of the
Expected Returns J(θ). This gives the Natural Policy Gradient ∇nat θ J(θ) defined as:

−1
∇nat
θ J(θ) = F IMρπ ,π (θ) · ∇θ J(θ)

where F IMρπ ,π denotes the Fisher Information Matrix with respect to s ∼ ρπ , a ∼ π.

We’ve noted in the previous section that if we enable Compatible Function Approxi-
mation with a linear function approximation for Qπ (s, a), then we have Equation (12.1),
i.e.,

∇θ J(θ) = F IMρπ ,π (θ) · wθ∗

This means:

∗
∇nat
θ J(θ) = wθ

This compact result enables a simple algorithm for Natural Policy Gradient (NPG):

444
• After each atomic experience, update Critic parameters w with the critic loss gradi-
ent as:

∆w = αw · (Rt+1 + γ · SC(St+1 , At+1 ; θ)T · w − SC(St , At ; θ)T · w) · SC(St , At ; θ)

• After each atomic experience, update Actor parameters θ in the direction of w:

∆θ = αθ · w

12.8.2. Deterministic Policy Gradient

Deterministic Policy Gradient (abbreviated DPG) is a creative adaptation of Policy Gra-
dient wherein instead of a parameterized function approximation for a stochastic policy,
we have a parameterized function approximation for a deterministic policy for the case of
continuous action spaces. DPG is due to a paper by Silver, Lever, Heess, Degris, Wiestra,
Riedmiller (Silver et al. 2014). DPG is expressed in terms of the Expected Gradient of the
Q-Value Function and can be estimated much more eﬀiciently than the usual (stochastic)
PG. (Stochastic) PG integrates over both the state and action spaces, whereas DPG inte-
grates over only the state space. As a result, computing (stochastic) PG would require
more samples if the action space has many dimensions.
In Actor-Critic DPG, the Actor is the function approximation for the deterministic policy
and the Critic is the function approximation for the Q-Value Function. The paper by Silver
et al. provides a Compatible Function Approximation Theorem for DPG to overcome Critic
approximation bias. The paper also shows that DPG is the limiting case of (Stochastic) PG,
as policy variance tends to 0. This means the usual machinery of PG (such as Actor-Critic,
Compatible Function Approximation, Natural Policy Gradient etc.) is also applicable to
DPG.
We use the notation a = πD (s; θ) to represent (a potentially multi-dimensional) continuous-
valued action a equal to the value of a deterministic policy function approximation πD
(parameterized by θ), when evaluated for a state s.
The core idea of DPG can be understood intuitively by orienting on the basics of GPI and
specifically, on Policy Improvement in GPI. For continuous action spaces, greedy policy
improvement (with an argmax over actions, for each state) is problematic. So a simple and
attractive alternative is to move the policy in the direction of the gradient of the Q-Value
Function (rather than globally maximizing the Q-Value Function, at each step). Specifi-
cally, for each state s that is encountered, the policy approximation parameters θ are up-
dated in proportion to ∇θ Q(s, πD (s; θ)). Note that the direction of policy improvement is
different for each state, and so the average direction of policy improvements is given by:

Es∼ρµ [∇θ Q(s, πD (s; θ))]

where ρµ is the same Discounted-Aggregate State-Visitation Measure we had defined for
PG (now for exploratory behavior policy µ).
Using chain-rule, the above expression can be written as:

Es∼ρµ [∇θ πD (s; θ) · ∇a QπD (s, a) ]
a=πD (s;θ)

Note that ∇θ πD (s; θ) is a Jacobian matrix as it takes the partial derivatives of a poten-
tially multi-dimensional action a = πD (s; θ) with respect to each parameter in θ. As we’ve
pointed out during the coverage of (stochastic) PG, when θ changes, policy πD changes,

445
which changes the state distribution ρπD . So it’s not clear that this calculation indeed guar-
antees improvement - it doesn’t take into account the effect of changing θ on ρπD . How-
ever, as was the case in PGT, Deterministic Policy Gradient Theorem (abbreviated DPGT)
ensures that there is no need to compute the gradient of ρπD with respect to θ, and that
the update described above indeed follows the gradient of the Expected Return objective
function. We formalize this now by stating the DPGT.
Analogous to the Expected Returns Objective defined for (stochastic) PG, we define the
Expected Returns Objective J(θ) for DPG as:

∞
X
J(θ) = EπD [ γ t · Rt+1 ]
t=0
X
= ρπD (s) · Rπs D (s;θ)
s∈N

= Es∼ρπD [Rπs D (s;θ) ]

where
∞
X X
ρπD (s) = γ t · p0 (S0 ) · p(S0 → s, t, πD )
S0 ∈N t=0

is the Discounted-Aggregate State-Visitation Measure when following deterministic pol-

icy πD (s; θ).
With a derivation similar to the proof of the PGT, we have the DPGT, as follows:

Theorem 12.8.1 (Deterministic Policy Gradient Theorem). Given an MDP with action space
Rk , with appropriate gradient existence conditions,
X

∇θ J(θ) = ρπD (s) · ∇θ πD (s; θ) · ∇a QπD (s, a)
a=πD (s;θ)
s∈N

= Es∼ρπD [∇θ πD (s; θ) · ∇a QπD (s, a) ]
a=πD (s;θ)

In practice, we use an Actor-Critic algorithm with a function approximation Q(s, a; w)

for the Q-Value Function as the Critic. Since the policy approximated is Deterministic, we
need to address the issue of exploration - this is typically done with Off-Policy Control
wherein we employ an exploratory (stochastic) behavior policy, while the policy being
approximated (and learnt with DPG) is the target (deterministic) policy. The Expected
Return Objective is a bit different in the case of Off-Policy - it is the Expected Q-Value for
the target policy under state-occurrence probabilities while following the behavior policy,
and the Off-Policy Deterministic Policy Gradient is an approximate (not exact) formula.
We avoid importance sampling in the Actor because DPG doesn’t involve an integral over
actions, and we avoid importance sampling in the Critic by employing Q-Learning. As a
result, for Off-Policy Actor-Critic DPG, we update the Critic parameters w and the Actor
parameters θ after each atomic experience in a trace experience generated by the behavior
policy.

∆w = αw · (Rt+1 + γ · Q(St+1 , πD (St+1 ; θ); w) − Q(St , At ; w)) · ∇w Q(St , At ; w)

446

∆θ = αθ · ∇θ πD (St ; θ) · ∇a Q(St , a; w)
a=πD (St ;θ)

Critic Bias can be resolved with a Compatible Function Approximation Theorem for
DPG (see Silver et al. paper for details). Instabilities caused by Bootstrapped Off-Policy
Learning with Function Approximation can be resolved with Gradient Temporal Differ-
ence (GTD).

12.9. Evolutionary Strategies

We conclude this chapter with a section on Evolutionary Strategies - a class of algorithms to
solve MDP Control problems. We want to highlight right upfront that Evolutionary Strate-
gies are technically not RL algorithms (for reasons we shall illuminate once we explain the
technique of Evolutionary Strategies). However, Evolutionary Strategies can sometimes
be quite effective in solving MDP Control problems and so, we give them appropriate
coverage as part of a wide range of approaches to solve MDP Control. We cover them in
this chapter because of their superficial resemblance to Policy Gradient Algorithms (again,
they are not RL algorithms and hence, not Policy Gradient algorithms).
Evolutionary Strategies (abbreviated as ES) actually refers to a technique/approach that
is best understood as a type of Black-Box Optimization. It was popularized in the 1970s
as Heuristic Search Methods. It is loosely inspired by natural evolution of living beings. We
focus on a subclass of ES known as Natural Evolutionary Strategies (abbreviated as NES).
The original setting for this approach was quite generic and not at all specific to solving
MDPs. Let us understand this generic setting first. Given an objective function F (ψ),
where ψ refers to parameters, we consider a probability distribution pθ (ψ) over ψ, where
θ refers to the parameters of the probability distribution. The goal in this generic setting
is to maximize the average objective Eψ∼pθ [F (ψ)].
We search for optimal θ with stochastic gradient ascent as follows:

Z
∇θ (Eψ∼pθ [F (ψ)]) = ∇θ ( pθ (ψ) · F (ψ) · dψ)
ψ
Z
= ∇θ (pθ (ψ)) · F (ψ) · dψ
ψ
Z
= pθ (ψ) · ∇θ (log pθ (ψ)) · F (ψ) · dψ
ψ
= Eψ∼pθ [∇θ (log pθ (ψ)) · F (ψ)] (12.2)

Now let’s see how NES can be applied to solving MDP Control. We set F (·) to be the
(stochastic) Return of an MDP. ψ corresponds to the parameters of a deterministic policy
πψ : N → A. ψ ∈ Rm is drawn from an isotropic m-variate Gaussian distribution, i.e.,
Gaussian with mean vector θ ∈ Rm and fixed diagonal covariance matrix σ 2 Im where
σ ∈ R is kept fixed and Im is the m × m identity matrix. The average objective (Expected
Return) can then be written as:

Eψ∼pθ [F (ψ)] = Eϵ∼N (0,Im ) [F (θ + σ · ϵ)]

where ϵ ∈ Rm is the standard normal random variable generating ψ. Hence, from Equa-
tion (12.2), the gradient (∇θ ) of Expected Return can be written as:

Eψ∼pθ [∇θ (log pθ (ψ)) · F (ψ)]

447
−(ψ − θ)T · (ψ − θ)
= Eψ∼N (θ,σ2 Im ) [∇θ ( ) · F (ψ)]
2σ 2
1
= · Eϵ∼N (0,Im ) [ϵ · F (θ + σ · ϵ)]
σ
Now we come up with a sampling-based algorithm to solve the MDP. The above formula
helps estimate the gradient of Expected Return by sampling several ϵ (each ϵ represents a
Policy πθ+σ·ϵ ), and averaging ϵ · F (θ + σ · ϵ) across a large set (n) of ϵ samples.
Note that evaluating F (θ + σ · ϵ) involves playing an episode for a given sampled ϵ, and
obtaining that episode’s Return F (θ + σ · ϵ). Hence, we have n values of ϵ, n Policies πθ+σ·ϵ ,
and n Returns F (θ + σ · ϵ).
Given the gradient estimate, we update θ in this gradient direction, which in turn leads
to new samples of ϵ (new set of Policies πθ+σ·ϵ ), and the process repeats until Eϵ∼N (0,Im ) [F (θ+
σ · ϵ)] is maximized.
The key inputs to the algorithm are:

• Learning rate (SGD Step Size) α

• Standard Deviation σ
• Initial value of parameter vector θ0

With these inputs, for each iteration t = 0, 1, 2, . . ., the algorithm performs the following
steps:

• Sample ϵ1 , ϵ2 , . . . ϵn ∼ N (0, Im ).
• Compute Returns Fi ← F (θt + σ · ϵi ) for i = 1, 2, . . . , n.
α Pn
• θt+1 ← θt + nσ i=1 ϵi · Fi

On the surface, this NES algorithm looks like PG because it’s not Value Function-based
(it’s Policy-based, like PG). Also, similar to PG, it uses a gradient to move the policy to-
wards optimality. But, ES does not interact with the environment (like PG/RL does).
ES operates at a high-level, ignoring the (state, action, reward) interplay. Specifically, it
does not aim to assign credit to actions in specific states. Hence, ES doesn’t have the core
essence of RL: Estimating the Q-Value Function for a Policy and using it to Improve the Policy.
Therefore, we don’t classify ES as Reinforcement Learning. Rather, we consider ES to be
an alternative approach to RL Algorithms.
What is the effectiveness of ES compared to RL? The traditional view has been that ES
won’t work on high-dimensional problems. Specifically, ES has been shown to be data-
ineﬀicient relative to RL. This is because ES resembles simple hill-climbing based only on
finite differences along a few random directions at each step. However, ES is very simple to
implement (no Value Function approximation or back-propagation needed), and is highly
parallelizable. ES has the benefits of being indifferent to distribution of rewards and to ac-
tion frequency, and is tolerant of long horizons. A paper from OpenAI Research (Salimans
et al. 2017) shows techniques to make NES more robust and more data-eﬀicient, and they
demonstrate that NES has more exploratory behavior than advanced PG algorithms.

12.10. Key Takeaways from this Chapter

• Policy Gradient Algorithms are based on GPI with Policy Improvement as a Stochas-
tic Gradient Ascent for “Expected Returns” Objective J(θ) where θ are the parame-
ters of the function approximation for the Policy.

448
• Policy Gradient Theorem gives us a simple formula for ∇θ J(θ) in terms of the score
of the policy function approximation (i.e., gradient of the log of the policy with re-
spect to the policy parameters θ).
• We can reduce variance in PG algorithms by using a critic and by using an estimate
of the advantage function for the Q-Value Function.
• Compatible Function Approximation Theorem enables us to overcome bias in PG
Algorithms.
• Natural Policy Gradient and Deterministic Policy Gradient are specialized PG algo-
rithms that have worked well in practice.
• Evolutionary Strategies are technically not RL, but they resemble PG Algorithms and
can sometimes be quite effective in solving MDP Control problems.

449
Part IV.

Finishing Touches

451
13. Multi-Armed Bandits: Exploration versus
Exploitation
We learnt in Chapter 10 that balancing exploration and exploitation is vital in RL Control
algorithms. While we want to exploit actions that seem to be fetching good returns, we
also want to adequately explore all possible actions so we can obtain an accurate-enough
estimate of their Q-Values. We had mentioned that this is essentially the Explore-Exploit
dilemma of the famous Multi-Armed Bandit Problem. The Multi-Armed Bandit prob-
lem provides a simple setting to understand the explore-exploit tradeoff and to develop
explore-exploit balancing algorithms. The approaches followed by the Multi-Armed Ban-
dit algorithms are then well-transportable to the more complex setting of RL Control.
In this Chapter, we start by specifying the Multi-Armed Bandit problem, followed by
coverage of a variety of techniques to solve the Multi-Armed Bandit problem (i.e., effec-
tively balancing exploration against exploitation). We’ve actually seen one of these algo-
rithms already for RL Control - following an ϵ-greedy policy, which naturally is applicable
to the simpler setting of Multi-Armed Bandits. We had mentioned in Chapter 10 that we
can simply replace the ϵ-greedy approach with any other algorithm for explore-exploit
tradeoff. In this chapter, we consider a variety of such algorithms, many of which are far
more sophisticated compared to the simple ϵ-greedy approach. However, we cover these
algorithms for the simple setting of Multi-Armed Bandits as it promotes understanding
and development of intuition. After covering a range of algorithms for Multi-Armed Ban-
dits, we consider an extended problem known as Contextual Bandits, that is a step between
the Multi-Armed Bandits problem and the RL Control problem (in terms of problem com-
plexity). Finally, we explain how the algorithms for Multi-Armed Bandits can be easily
transported to the more nuanced/extended setting of Contextual Bandits, and further ex-
tended to RL Control.

13.1. Introduction to the Multi-Armed Bandit Problem

At various points in past chapters, we’ve emphasized the importance of the Explore-Exploit
tradeoff in the context of RL Control - selecting actions for any given state that balances
the notions of exploration and exploitation. If you think about it, you will realize that
many situations in business (and in our lives!) present this explore-exploit dilemma on
choices one has to make. Exploitation involves making choices that seem to be best based on
past outcomes, while Exploration involves making choices one hasn’t yet tried (or not tried
suﬀiciently enough).
Exploitation has intuitive notions of “being greedy” and of being “short-sighted,” and
too much exploitation could lead to some regret of having missing out on unexplored
“gems.” Exploration has intuitive notions of “gaining information” and of being “long-
sighted,” and too much exploration could lead to some regret of having wasting time on
“duds.” This naturally leads to the idea of balancing exploration and exploitation so we
can combine information-gains and greedy-gains in the most optimal manner. The natural
question then is whether we can set up this problem of explore-exploit dilemma in a math-

453
ematically disciplined manner. Before we do that, let’s look at a few common examples of
the explore-exploit dilemma.

13.1.1. Some Examples of Explore-Exploit Dilemma

• Restaurant Selection: We like to go to our favorite restaurant (Exploitation) but we
also like to try out a new restaurant (Exploration).
• Online Banner Advertisements: We like to repeat the most successful advertisement
(Exploitation) but we also like to show a new advertisement (Exploration).
• Oil Drilling: We like to drill at the best known location (Exploitation) but we also
like to drill at a new location (Exploration).
• Learning to play a game: We like to play the move that has worked well for us so far
(Exploitation) but we also like to play a new experimental move (Exploration).

The term Multi-Armed Bandit (abbreviated as MAB) is a spoof name that stands for
“Many One-Armed Bandits” and the term One-Armed Bandit refers to playing a slot-machine
in a casino (that has a single lever to be pulled, that presumably addicts us and eventually
takes away all our money, hence the term “bandit”). Multi-Armed Bandit refers to the
problem of playing several slot machines (each of which has an unknown fixed payout
probability distribution) in a manner that we can make the maximum cumulative gains
by playing over multiple rounds (by selecting a single slot machine in a single round).
The core idea is that to achieve maximum cumulative gains, one would need to balance
the notions of exploration and exploitation, no matter which selection strategy one would
pursue.

13.1.2. Problem Definition

Definition 13.1.1. A Multi-Armed Bandit (MAB) comprises of:

• A finite set of Actions A (known as the ”arms”).

• Each action (”arm”) a ∈ A is associated with a probability distribution over R (un-

known to the AI Agent) denoted as Ra , defined as:

Ra (r) = P[r|a] for all r ∈ R

• A time-indexed sequence of AI Agent-selected actions At ∈ A for time steps t =

1, 2, . . ., and a time-indexed sequence of Environment-generated Reward random vari-
ables Rt ∈ R for time steps t = 1, 2, . . ., with Rt randomly drawn from the probability
distribution RAt .

The AI Agent’s goal is to maximize the following Expected Cumulative Rewards over a
certain number of time steps T :
X
T
E[ Rt ]
t=1

So the AI Agent has T selections of actions to make (in sequence), basing each of those
selections only on the rewards it has observed before that time step (specifically, the AI
Agent does not have knowledge of the probability distributions Ra ). Any selection strat-
egy to maximize the Expected Cumulative Rewards risks wasting time on “duds” while
exploring and also risks missing untapped “gems” while exploiting.

454
It is immediately observable that the Environment doesn’t have a notion of State. When
the AI Agent selects an arm, the Environment simply samples from the probability distri-
bution for that arm. However, the AI Agent might maintain relevant features of the history
(of actions taken and rewards obtained) as it’s State, which would help the AI Agent in
making the arm-selection (action) decision. The arm-selection action is then based on
a (Policy) function of the agent’s State. So, the agent’s arm-selection strategy is basically
this Policy. Thus, even though a MAB is not posed as an MDP, the agent could model it
as an MDP and solve it with an appropriate Planning or Learning algorithm. However,
many MAB algorithms don’t take this formal MDP approach. Instead, they rely on heuris-
tic methods that don’t aim to optimize - they simply strive for good Cumulative Rewards
(in Expectation). Note that even in a simple heuristic algorithm, At is a random variable
simply because it is a function of past (random) rewards.

13.1.3. Regret
The idea of Regret is quite fundamental in designing algorithms for MAB. In this section,
we illuminate this idea.
We define the Action Value Q(a) as the (unknown) mean reward of action a, i.e.,

Q(a) = E[r|a]

We define the Optimal Value V ∗ and Optimal Action a∗ (noting that there could be multiple
optimal actions) as:
V ∗ = max Q(a) = Q(a∗ )
a∈A

We define Regret lt as the opportunity loss at a single time step t, as follows:

lt = E[V ∗ − Q(At )]

We define the Total Regret LT as the total opportunity loss, as follows:

X
T X
T
LT = lt = E[V ∗ − Q(At )]
t=1 t=1

Maximizing the Expected Cumulative Rewards is the same as Minimizing Total Regret.

13.1.4. Counts and Gaps

Let Nt (a) be the (random) number of selections of an action a across the first t time steps.
Let us refer to E[Nt (a)] for a given action-selection strategy as the Count of an action a
over the first t steps, denoted as Countt (a). Let us refer to the Value difference between an
action a and the optimal action a∗ as the Gap for a, denoted as ∆a , i.e,

∆a = V ∗ − Q(a)

We define Total Regret as the sum-product (over actions) of Counts and Gaps, as follows:

X
T X X
LT = E[V ∗ − Q(At )] = E[NT (a)] · (V ∗ − Q(a)) = CountT (a) · ∆a
t=1 a∈A a∈A

A good algorithm ensures small Counts for large Gaps. The core challenge though is that
we don’t know the Gaps.

455
In this chapter, we implement (in code) a few different algorithms for the MAB problem.
So let’s invest in an abstract base class whose interface can be implemented by each of the
algorithms we develop. The code for this abstract base class MABBase is shown below. It’s
constructor takes 3 inputs:

• arm_distributions which is a Sequence of Distribution[float]s, one for each arm.

• time_steps which represents the number of time steps T
• num_episodes which represents the number of episodes we can run the algorithm on
(each episode having T time steps), in order to produce metrics to evaluate how well
the algorithm does in expectation (averaged across the episodes).

Each of the algorithms we’d like to write simply needs to implement the @abstractmethod
get_episode_rewards_actions which is meant to return a 1-D ndarray of actions taken by
the algorithm across the T time steps (for a single episode), and a 1-D ndarray of rewards
produced in response to those actions.

from rl.distribution import Distribution

from numpy import ndarray
class MABBase(ABC):
def __init__(
self,
arm_distributions: Sequence[Distribution[float]],
time_steps: int,
num_episodes: int
) -> None:
self.arm_distributions: Sequence[Distribution[float]] = \
arm_distributions
self.num_arms: int = len(arm_distributions)
self.time_steps: int = time_steps
self.num_episodes: int = num_episodes
@abstractmethod
def get_episode_rewards_actions(self) -> Tuple[ndarray, ndarray]:
pass

We write the following self-explanatory methods for the abstract base class MABBase:

from numpy import mean, vstack, cumsum, full, bincount

def get_all_rewards_actions(self) -> Sequence[Tuple[ndarray, ndarray]]:
return [self.get_episode_rewards_actions()
for _ in range(self.num_episodes)]
def get_rewards_matrix(self) -> ndarray:
return vstack([x for x, _ in self.get_all_rewards_actions()])
def get_actions_matrix(self) -> ndarray:
return vstack([y for _, y in self.get_all_rewards_actions()])
def get_expected_rewards(self) -> ndarray:
return mean(self.get_rewards_matrix(), axis=0)
def get_expected_cum_rewards(self) -> ndarray:
return cumsum(self.get_expected_rewards())
def get_expected_regret(self, best_mean) -> ndarray:
return full(self.time_steps, best_mean) - self.get_expected_rewards()
def get_expected_cum_regret(self, best_mean) -> ndarray:
return cumsum(self.get_expected_regret(best_mean))
def get_action_counts(self) -> ndarray:
return vstack([bincount(ep, minlength=self.num_arms)
for ep in self.get_actions_matrix()])

456
def get_expected_action_counts(self) -> ndarray:
return mean(self.get_action_counts(), axis=0)

The above code is in the file rl/chapter14/mab_base.py.

Next, we cover some simple heuristic algorithms.

13.2. Simple Algorithms

We consider algorithms that estimate a Q-Value Q̂t (a) for each a ∈ A, as an approximation
to the true Q-Value Q(a). The subscript t in Q̂t refers to the fact that this is an estimate after
t time steps that takes into account all of the information available up to t time steps.
A natural way of estimating Q̂t (a) is by rewards-averaging, i.e.,

1 X
t
Q̂t (a) = Rs · IAs =a
Nt (a)
s=1

where I refers to the indicator function.

13.2.1. Greedy and ϵ-Greedy

First consider an algorithm that never explores (i.e., always exploits). This is known as the
Greedy Algorithm which selects the action with highest estimated value, i.e.,

At = arg max Q̂t−1 (a)

a∈A

As ever, arg max ties are broken with an arbitrary rule in prioritizing actions. We’ve
noted in Chapter 10 that such an algorithm can lock into a suboptimal action forever (sub-
optimal a is an action for which ∆a > 0). This results in CountT (a) being a linear function
of T for some suboptimal a, which means the Total Regret is a linear function of T (we
refer to this as Linear Total Regret).
Now let’s consider the ϵ-greedy algorithm, which explores forever. At each time-step t:

• With probability 1 − ϵ, select action equal to arg maxa∈A Q̂t−1 (a)

• With probability ϵ, select a random action (uniformly) from A

A constant value of ϵ ensures a minimum regret proportional to the mean gap, i.e.,
ϵ X
lt ≥ ∆a
|A|
a∈A

Hence, the ϵ-Greedy algorithm also has Linear Total Regret.

13.2.2. Optimistic Initialization

Next we consider a simple and practical idea: Initialize Q̂0 (a) to a high value for all a ∈ A
and update Q̂t by incremental-averaging. Starting with N0 (a) ≥ 0 for all a ∈ A, the
updates at each time step t are as follows:

Nt (At ) = Nt−1 (At ) + 1

457
Rt − Q̂t−1 (At )
Q̂t (At ) = Q̂t−1 (At ) +
Nt (At )
The idea here is that by setting a high initial value for the estimate of Q-Values (which we
refer to as Optimistic Initialization), we encourage systematic exploration early on. Another
way of doing optimistic initialization is to set a high value for N0 (a) for all a ∈ A, which
likewise encourages systematic exploration early on. However, these optimistic initializa-
tion ideas only serve to promote exploration early on and eventually, one can still lock into
a suboptimal action. Specifically, the Greedy algorithm together with optimistic initializa-
tion cannot be prevented from having Linear Total Regret in the general case. Likewise,
the ϵ-Greedy algorithm together with optimistic initialization cannot be prevented from
having Linear Total Regret in the general case. But in practice, these simple ideas of doing
optimistic initialization work quite well.

13.2.3. Decaying ϵt -Greedy Algorithm

The natural question that emerges is whether it is possible to construct an algorithm with
Sublinear Total Regret in the general case. Along these lines, we consider an ϵ-Greedy
algorithm with ϵ decaying as time progresses. We call such an algorithm Decaying ϵt -
Greedy.
For any fixed c > 0, consider a decay schedule for ϵ1 , ϵ2 , . . . as follows:

d = min ∆a
a|∆a >0

c|A|
ϵt = min(1, )
d2 (t + 1)
It can be shown that this decay schedule achieves Logarithmic Total Regret. However,
note that the above schedule requires advance knowledge of the gaps ∆a (which by def-
inition, is not known to the AI Agent). In practice, implementing some decay schedule
helps considerably. Let’s now write some code to implement Decaying ϵt -Greedy algo-
rithm along with Optimistic Initialization.
The class EpsilonGreedy shown below implements the interface of the abstract base class
MABBase. It’s constructor inputs arm_distributions, time_steps and num_episodes are the
inputs we have seen before (used to pass to the constructor of the abstract base class
MABBase). epsilon and epsilon_half_life are the inputs used to specify the declining tra-
jectory of ϵt . epsilon refers to ϵ0 (initial value of ϵ) and epsilon_half_life refers to the half
life of an exponentially-decaying ϵt (used in the @staticmethod get_epsilon_decay_func).
count_init and mean_init refer to values of N0 and Q̂0 respectively. get_episode_rewards_actions
implements MABBase’s @abstracmethod interface, and it’s code below should be self-explanatory.

from operator import itemgetter

from rl.distribution import Distribution, Range, Bernoulli
from numpy import ndarray, empty
class EpsilonGreedy(MABBase):
def __init__(
self,
arm_distributions: Sequence[Distribution[float]],
time_steps: int,
num_episodes: int,
epsilon: float,
epsilon_half_life: float = 1e8,
count_init: int = 0,

458
mean_init: float = 0.,
) -> None:
if epsilon < 0 or epsilon > 1 or \
epsilon_half_life <= 1 or count_init < 0:
raise ValueError
super().__init__(
arm_distributions=arm_distributions,
time_steps=time_steps,
num_episodes=num_episodes
)
self.epsilon_func: Callable[[int], float] = \
EpsilonGreedy.get_epsilon_decay_func(epsilon, epsilon_half_life)
self.count_init: int = count_init
self.mean_init: float = mean_init
@staticmethod
def get_epsilon_decay_func(
epsilon,
epsilon_half_life
) -> Callable[[int], float]:
def epsilon_decay(
t: int,
epsilon=epsilon,
epsilon_half_life=epsilon_half_life
) -> float:
return epsilon * 2 ** -(t / epsilon_half_life)
return epsilon_decay
def get_episode_rewards_actions(self) -> Tuple[ndarray, ndarray]:
counts: List[int] = [self.count_init] * self.num_arms
means: List[float] = [self.mean_init] * self.num_arms
ep_rewards: ndarray = empty(self.time_steps)
ep_actions: ndarray = empty(self.time_steps, dtype=int)
for i in range(self.time_steps):
max_action: int = max(enumerate(means), key=itemgetter(1))[0]
epsl: float = self.epsilon_func(i)
action: int = max_action if Bernoulli(1 - epsl).sample() else \
Range(self.num_arms).sample()
reward: float = self.arm_distributions[action].sample()
counts[action] += 1
means[action] += (reward - means[action]) / counts[action]
ep_rewards[i] = reward
ep_actions[i] = action
return ep_rewards, ep_actions

The above code is in the file rl/chapter14/epsilon_greedy.py.

Figure 13.1 shows the results of running the above code for 1000 time steps over 500
episodes, with N0 and Q̂0 both set to 0. This graph was generated (see __main__ in rl/chapter14/epsilon_greedy.p
by creating 3 instances of EpsilonGreedy - the first with epsilon set to 0 (i.e., Greedy), the
second with epsilon set to 0.12 and epsilon_half_life set to a very high value (i.e, ϵ-
Greedy, with no decay for ϵ), and the third with epsilon set to 0.12 and epsilon_half_life
set to 150 (i.e., Decaying ϵt -Greedy). We can see that Greedy produces Linear Total Regret
since it locks to a suboptimal value. We can also see that ϵ-Greedy has higher total regret
than Greedy initially because of exploration, and then settles in with Linear Total Regret,
commensurate with the constant amount of exploration (ϵ = 0.12 in this case). Lastly, we
can see that Decaying ϵt -Greedy produces Sublinear Total Regret as the initial effort spent
in exploration helps identify the best action and as time elapses, the exploration keeps
reducing so as to keep reducing the single-step regret.
In the __main__ code in rl/chapter14/epsilon_greedy.py, we encourage you to experi-
ment with different arm_distributions, epsilon, epsilon_half_life, count_init (N0 ) and

459
Figure 13.1.: Total Regret Curves

mean_init (Q̂0 ), observe how the graphs change, and develop better intuition for these
simple algorithms.

13.3. Lower Bound

It should be clear by now that we strive for algorithms with Sublinear Total Regret for any
MAB problem (i.e., without any prior knowledge of the arm-reward distributions Ra ).
Intuitively, the performance of any algorithm is determined by the similarity between the
optimal arm’s reward-distribution and the other arms’s reward-distributions. Hard MAB
problems are those with similar-distribution arms with different means Q(a). This can be
∗
formally described in terms of the KL Divergence KL(Ra ||Ra ) and gaps ∆a . Indeed, Lai
and Robbins (Lai and Robbins 1985) established a logarithmic lower bound for the Asymp-
∗
totic Total Regret, with a factor expressed in terms of the KL Divergence KL(Ra ||Ra ) and
gaps ∆a . Specifically,

Theorem 13.3.1 (Lai and Robbins Lower-Bound). Asymptotic Total Regret is at least logarith-
mic in the number of time steps, i.e., as T → ∞,
X 1 X ∆a
LT ≥ logT ≥ log T
∆a KL(Ra ||Ra∗ )
a|∆a >0 a|∆a >0

This makes intuitive sense because it would be hard for an algorithm to have low to-
tal regret if the KL Divergence of arm reward-distributions (relative to the optimal arm’s
reward-distribution) are low (i.e., arms that look distributionally-similar to the optimal
arm) but the Gaps (Expected Rewards of Arms relative to Optimal Arm) are not small
- these are the MAB problem instances where the algorithm will have a hard time iso-
lating the optimal arm simply from reward samples (we’d get similar sampling reward-
distributions of arms), and suboptimal arm selections would inflate the Total Regret.

460
Figure 13.2.: Q-Value Distributions

13.4. Upper Confidence Bound Algorithms

Now we come to an important idea that is central to many algorithms for MAB. This idea
goes by the catchy name of Optimism in the Face of Uncertainty. As ever, this idea is best un-
derstood with intuition first, followed by mathematical rigor. To develop intuition, imag-
ine you are given 3 arms. You’d like to develop an estimate of Q(a) = E[r|a] for each of
the 3 arms a. After playing the arms a few times, you start forming beliefs in your mind of
what the Q(a) might be for each arm. Unlike the simple algorithms we’ve seen so far where
one averaged the sample rewards for each arm to maintain a Q̂(a) estimate for each a, here
we maintain the sampling distribution of the mean rewards (for each a) that represents
our (probabilistic) beliefs of what Q(a) might be for each arm a.
To keep things simple, let’s assume the sampling distribution of the mean reward is a
gaussian distribution (for each a), and so we maintain an estimate of µa and σa for each
arm a to represent the mean and standard deviation of the sampling distribution of mean
reward for a. µa would be calculated as the average of the sample rewards seen so far for
an arm a. σa would be calculated as the standard error of the mean reward estimate, i.e.,
the sample standard deviation of the rewards seen so far, divided by the square root of the
number of samples (for a given arm a). Let us say that after playing the arms a few times,
we arrive at the gaussian sampling distribution of mean reward for each of the 3 arms, as
illustrated in Figure 13.2. Let’s refer to the three arms as red, blue and green. The normal
distributions in Figure 13.2 show the red arm as the solid curve, the blue arm as the dashed
curve and the green arm as the dotted-and-dashed curve. The blue arm has the highest σa .
This could be either because the sample standard deviation is high or it could be because
we have played the blue arm just a few times (remember the square root of number of
samples appears in the denominator of the standard error calculation). Now looking at
this Figure, we have to decide which arm to select next. The intuition behind Optimism in
the Face of Uncertainty is that the more uncertain we are about the Q(a) for an arm a, the
more important it is to play that arm. This is because more uncertainty on Q(a) makes it
more likely to be the best arm (all else being equal on the arms). The rough heuristic then

461
Figure 13.3.: Q-Value Distributions

would be to select the arm with the highest value of µa + c · σa across the arms (for some
fixed c ∈ R+ ). Thus, we are comparing (across actions) c standard errors higher than the
mean reward estimate (i.e., the upper-end of an appropriate confidence interval for the
mean reward). In this Figure, let’s say µa + c · σa is highest for the blue arm. So we play
the blue arm, and let’s say we get a somewhat low reward for the blue arm. This might
do two things to the blue arm’s sampling distribution - it can move blue’s µa lower and it
can also also lower blue’s σa (simply due to the fact that the number of blue arm samples
has grown). With the new µa and σa for the blue arm, let’s say the updated sampling
distributions are as shown in Figure 13.3. With the blue arm’s sampling distribution of the
mean reward narrower, let’s say the red arm now has the highest µa + c · σa , and so we
play the red arm. This process goes on until the sampling distributions get narrow enough
to give us adequate confidence in the mean rewards for the actions (i.e., obtain confident
estimates of Q(a)) so we can home in on the action with highest Q(a).
It pays to emphasize that Optimism in the Face of Uncertainty is a great approach to resolve
the Explore-Exploit dilemma because you gain regardless of whether the exploration due
to Optimism produces large rewards or not. If it does produce large rewards, you gain
immediately by collecting the large rewards. If it does not produce large rewards, you still
gain by acquiring the knowledge that certain actions (that you have explored) might not
be the best actions, which helps you in the long-run by focusing your attention on other
actions.
A formalization of the above intuition on Optimism in the Face of Uncertainty is the idea
of Upper Confidence Bounds (abbreviated as UCB). The idea of UCB is that along with an
estimate Q̂t (a) (for each a after t time steps), we also maintain an estimate Ût (a) represent-
ing the upper confidence interval width for the mean reward of a (after t time steps) such
that Q(a) < Q̂t (a) + Ût (a) with high probability. This naturally depends on the number
of times that a has been selected so far (call it Nt (a)). A small value of Nt (a) would imply
a large value of Ût (a) since the estimate of the mean reward would be fairly uncertain. On
the other hand, a large value of Nt (a) would imply a small value of Ût (a) since the esti-

462
mate of the mean reward would be fairly certain. We refer to Q̂t (a) + Ût (a) as the Upper
Confidence Bound (or simply UCB). The idea is to select the action that maximizes the UCB.
Formally, the action At+1 selected for the next (t + 1) time step is as follows:

At+1 = arg max{Q̂t (a) + Ût (a)}

a∈A

Next, we develop the famous UCB1 Algorithm. In order to do that, we tap into an
important result from Statistics known as Hoeffding’s Inequality.

13.4.1. Hoeffding’s Inequality

We state Hoeffding’s Inequality without proof.

Theorem 13.4.1 (Hoeffding’s Inequality). Let X1 , . . . , Xn be independent and identically dis-

tributed random variables in the range [0, 1], and let

1X
n
X̄n = Xi
n
i=1

be the sample mean. Then for any u ≥ 0,

P[E[X̄n ] > X̄n + u] ≤ e−2nu

We can apply Hoeffding’s Inequality to MAB problem instances whose rewards have
probability distributions with [0, 1]-support. Conditioned on selecting action a at time step
t, sample mean X̄n specializes to Q̂t (a), and we set n = Nt (a) and u = Ût (a). Therefore,

P[Q(a) > Q̂t (a) + Ût (a)] ≤ e−2Nt (a)·Ût (a)

Next, we pick a small probability p for Q(a) exceeding UCB Q̂t (a) + Ût (a). Now solve
for Ût (a), as follows: s
− log p
e−2Nt (a)·Ût (a) = p ⇒ Ût (a) =
2

2Nt (a)
We reduce p as we observe more rewards, eg: p = t−α (for some fixed α > 0). This ensures
we select the optimal action as t → ∞. Thus,
s
α log t
Ût (a) =
2Nt (a)

13.4.2. UCB1 Algorithm

This yields the UCB1 algorithm by Auer, Cesa-Bianchi, Fischer (Auer, Cesa-Bianchi, and
Fischer 2002) for arbitrary-distribution arms bounded in [0, 1]:
s
α log t
At+1 = arg max{Q̂t (a) + }
a∈A 2Nt (a)

It has been shown that the UCB1 Algorithm achieves logarithmic total regret asymptoti-
cally. Specifically,

463
Theorem 13.4.2 (UCB1 Logarithmic Total Regret). As T → ∞,
X 4α · log T 2α · ∆a
LT ≤ +
∆a α−1
a|∆a >0

Now let’s implement the UCB1 Algorithm in code. The class UCB1 below implements
the interface of the abstract base class MABBase. We’ve implemented the below code for
rewards range [0, B] (adjusting the above UCB1 formula apropriately from [0, 1] range to
[0, B] range). B is specified as the constructor input bounds_range. The constructor input
alpha corresponds to the parameter α specified above. get_episode_rewards_actions im-
plements MABBase’s @abstracmethod interface, and it’s code below should be self-explanatory.
from numpy import ndarray, empty, sqrt, log
from operator import itemgetter
class UCB1(MABBase):
def __init__(
self,
arm_distributions: Sequence[Distribution[float]],
time_steps: int,
num_episodes: int,
bounds_range: float,
alpha: float
) -> None:
if bounds_range < 0 or alpha <= 0:
raise ValueError
super().__init__(
arm_distributions=arm_distributions,
time_steps=time_steps,
num_episodes=num_episodes
)
self.bounds_range: float = bounds_range
self.alpha: float = alpha
def get_episode_rewards_actions(self) -> Tuple[ndarray, ndarray]:
ep_rewards: ndarray = empty(self.time_steps)
ep_actions: ndarray = empty(self.time_steps, dtype=int)
for i in range(self.num_arms):
ep_rewards[i] = self.arm_distributions[i].sample()
ep_actions[i] = i
counts: List[int] = [1] * self.num_arms
means: List[float] = [ep_rewards[j] for j in range(self.num_arms)]
for i in range(self.num_arms, self.time_steps):
ucbs: Sequence[float] = [means[j] + self.bounds_range *
sqrt(0.5 * self.alpha * log(i) /
counts[j])
for j in range(self.num_arms)]
action: int = max(enumerate(ucbs), key=itemgetter(1))[0]
reward: float = self.arm_distributions[action].sample()
counts[action] += 1
means[action] += (reward - means[action]) / counts[action]
ep_rewards[i] = reward
ep_actions[i] = action
return ep_rewards, ep_actions

The above code is in the file rl/chapter14/ucb1.py. The code in __main__ sets up a
UCB1 instance with 6 arms, each having a binomial distribution with n = 10 and p =
{0.4, 0.8, 0.1, 0.5, 0.9, 0.2} for the 6 arms. When run with 1000 time steps, 500 episodes
and α = 4, we get the Total Regret Curve as shown in Figure 13.4.
We encourage you to modify the code in __main__ to model other distributions for the
arms, examine the results obtained, and develop more intuition for the UCB1 Algorithm.

464
Figure 13.4.: UCB1 Total Regret Curve

13.4.3. Bayesian UCB

The algorithms we have covered so far have not made any assumptions about the rewards
distributions Ra (except for the range of the rewards). Now we assume that the rewards
distributions are restricted to a family of analytically-tractable probability distributions,
which enables us to make analytically-favorable inferences about the rewards distribu-
tions. Let us refer to the sequence of distributions [Ra |a ∈ A] as R. To be clear, the
AI Agent (algorithm) does not have knowledge of R and aims to estimate R from the
rewards data obtained upon performing actions. Bayesian Bandit Algorithms (abbrevi-
ated as Bayesian Bandits) achieve this by maintaining an estimate of the probability distri-
bution over R based on rewards data seen for each of the selected arms. The idea is to
compute the posterior distribution P[R|Ht ] by exploiting prior knowledge of P[R], where
Ht = A1 , R1 , A1 , R1 , . . . , At , Rt is the history. Note that the prior distribution P[R] and the
posterior distribution P[R|Ht ] are probability distributions over probability distributions
(since each Ra in R is a probability distribution). This posterior distribution is then used
to guide exploration. This leads to two types of algorithms:

• Upper Confidence Bounds (Bayesian UCB), which we give an example of below.

• Probability Matching, which we cover in the next section in the form of Thompson
Sampling.

We get a better performance if our prior knowledge of P[R] is accurate. A simple exam-
ple of Bayesian UCB is to model independent Gaussian distributions. Assume the reward
distribution is Gaussian: Ra (r) = N (r; µa , σa2 ) for all a ∈ A, where µa and σa2 denote the
mean and variance respectively of the Gaussian reward distribution of a. The idea is to
compute a Gaussian posterior over µa , σa2 , as follows:
Y
P[µa , σa2 |Ht ] ∝ P[µa , σa2 ] · N (Rt ; µa , σa2 )
t|At =a

465
This posterior calculation can be performed in an incremental manner by updating P[µAt , σA 2 |H ]
t t
after each time step t (observing Rt after selecting action At ). This incremental calcula-
tion with Bayesian updates to hyperparameters (parameters controlling the probability
distributions of µa and σa2 ) is described in detail in Section G.1 in Appendix G.
Given this posterior distribution for µa and σa2 for all a ∈ A after each time step t, we
select the action that maximizes the Expectation of “c standard-errors above mean” , i.e.,
c · σa
At+1 = arg max EP[µa ,σa2 |Ht ] [µa + p ]
a∈A Nt (a)

13.5. Probability Matching

As mentioned in the previous section, calculating the posterior distribution P[R|Ht ] after
each time step t also enables a different approach known as Probability Matching. The
idea behind Probability Matching is to select an action a probabilistically in proportion
to the probability that a might be the optimal action (based on the rewards data seen so
far). Before describing Probability Matching formally, we illustrate the idea with a simple
example to develop intuition.
Let us say we have only two actions a1 and a2 . For simplicity, let us assume that the
posterior distribution P[Ra1 |Ht ] has only two distribution outcomes (call them Ra11 and
Ra21 ) and that the posterior distribution P[Ra2 |Ht ] also has only two distribution outcomes
(call them Ra12 and Ra22 ). Typically, there will be an infinite (continuum) of distribution
outcomes for P[R|Ht ] - here we assume only two distribution outcomes for each of the
actions’ estimated conditional probability of rewards purely for simplicity so as to con-
vey the intuition behind Probability Matching. Assume that P[Ra1 = Ra11 |Ht ] = 0.7 and
P[Ra1 = Ra21 |Ht ] = 0.3, and that Ra11 has mean 5.0 and Ra21 has mean 10.0. Assume that
P[Ra2 = Ra12 |Ht ] = 0.2 and P[Ra2 = Ra22 |Ht ] = 0.8, and that Ra12 has mean 2.0 and Ra22
has mean 7.0.
Probability Matching calculates at each time step t how often does each action a have the
maximum E[r|a] among all actions, across all the probabilistic outcomes for the posterior
distribution P[R|Ht ], and then selects that action a probabilistically in proportion to this
calculation. Let’s do this probability calculation for our simple case of two actions and two
probabilistic outcomes each for the posterior distribution for each action. So here, we have
4 probabilistic outcomes when considering the two actions jointly, as follows:

• Outcome 1: Ra11 (with probability 0.7) and Ra12 (with probability 0.2). Thus, Out-
come 1 has probability 0.7 * 0.2 = 0.14. In Outcome 1, a1 has the maximum E[r|a]
among all actions since Ra11 has mean 5.0 and Ra12 has mean 2.0.
• Outcome 2: Ra11 (with probability 0.7) and Ra22 (with probability 0.8). Thus, Out-
come 2 has probability 0.7 * 0.8 = 0.56. In Outcome 2, a2 has the maximum E[r|a]
among all actions since Ra11 has mean 5.0 and Ra22 has mean 7.0.
• Outcome 3: Ra21 (with probability 0.3) and Ra12 (with probability 0.2). Thus, Out-
come 3 has probability 0.3 * 0.2 = 0.06. In Outcome 3, a1 has the maximum E[r|a]
among all actions since Ra21 has mean 10.0 and Ra12 has mean 2.0.
• Outcome 4: Ra21 (with probability 0.3) and Ra22 (with probability 0.8). Thus, Out-
come 4 has probability 0.3 * 0.8 = 0.24. In Outcome 4, a1 has the maximum E[r|a]
among all actions since Ra21 has mean 10.0 and Ra22 has mean 7.0.

Thus, a1 has the maximum E[r|a] among the two actions in Outcomes 1, 3 and 4, amount-
ing to a total outcomes probability of 0.14 + 0.06 + 0.24 = 0.44, and a2 has the maximum

466
E[r|a] among the two actions only in Outcome 2, which has an outcome probability of 0.56.
Therefore, in the next time step (t + 1), the Probability Matching method will select action
a1 with probability 0.44 and a2 with probability 0.56.
Generalizing this Probability Matching method to an arbitrary number of actions and
to an arbitrary number of probabilistic outcomes for the conditional reward distributions
for each action, we can write the probabilistic selection of actions at time step t + 1 as:

P[At+1 |Ht ] = PDt ∼P[R|Ht ] [EDt [r|At+1 ] > EDt [r|a] for all a ̸= At+1 ] (13.1)

where Dt refers to a particular random outcome of a distribution of rewards for each

action, drawn from the posterior distribution P[R|Ht ]. As ever, ties between actions are
broken with an arbitrary rule prioritizing actions.
Note that the Probability Matching method is also based on the principle of Optimism
in the Face of Uncertainty because an action with more uncertainty in it’s mean reward is
more likely to have the highest mean reward among all actions (all else being equal), and
hence deserves to be selected more frequently.
We see that the Probability Matching approach is mathematically disciplined in driving
towards cumulative reward maximization while balancing exploration and exploitation.
However, the right-hand-side of Equation 13.1 can be diﬀicult to compute analytically
from the posterior distributions. We resolve this diﬀiculty with a sampling approach to
Probability Matching known as Thompson Sampling.

13.5.1. Thompson Sampling

We can reformulate the right-hand-side of Equation 13.1 as follows:

where I refers to the indicator function. This reformulation in terms of an Expectation is

convenient because we can estimate the Expectation by sampling various Dt probability
distributions and for each sample of Dt , we simply check if an action has the best mean
reward (compared to other actions) under the distribution Dt . This sampling-based ap-
proach to Probability Matching is known as Thompson Sampling. Specifically, Thompson
Sampling performs the following calculations for time step t + 1:

• Compute the posterior distribution P[R|Ht ] by performing Bayesian updates of the

hyperparameters that govern the estimated probability distributions of the parame-
ters of the reward distributions for each action.
• Sample a joint (across actions) rewards distribution Dt from the posterior distribution
P[R|Ht ].
• Calculate a sample Action-Value function with sample Dt as:

Q̂t (a) = EDt [r|a]

• Select the action (for time step t + 1) that maximizes this sample Action-Value func-
tion:
At+1 = arg max Q̂t (a)
a∈A

467
It turns out that Thompson Sampling achieves the Lai-Robbins lower bound for Loga-
rithmic Total Regret. To learn more about Thompson Sampling, we refer you to the excel-
lent tutorial on Thompson Sampling by Russo, Roy, Kazerouni, Osband, Wen (Russo et
al. 2018).
Now we implement Thompson Sampling by assuming a Gaussian distribution of re-
wards for each action. The posterior distributions for each action are produced by perform-
ing Bayesian updates of the hyperparameters that govern the estimated Gaussian-Inverse-
Gamma Probability Distributions of the parameters of the Gaussian reward distributions
for each action. Section G.1 of Appendix G describes the Bayesian updates of the hyper-
parameters θ, α, β, and the code below implements this update in the variable bayes in
method get_episode_rewards_actions (this method implements the @abstractmethod in-
terface of abstract base class MABBase). The sample mean rewards are obtained by invoking
the sample method of Gaussian and Gamma classes, and assigned to the variable mean_draws.
The variable theta refers to the hyperparameter θ, the variable alpha refers to the hyper-
parameter α, and the variable beta refers to the hyperparameter β. The rest of the code in
the method get_episode_rewards_actions should be self-explanatory.

from rl.distribution import Gaussian, Gamma

from operator import itemgetter
from numpy import ndarray, empty, sqrt
class ThompsonSamplingGaussian(MABBase):
def __init__(
self,
arm_distributions: Sequence[Gaussian],
time_steps: int,
num_episodes: int,
init_mean: float,
init_stdev: float
) -> None:
super().__init__(
arm_distributions=arm_distributions,
time_steps=time_steps,
num_episodes=num_episodes
)
self.theta0: float = init_mean
self.n0: int = 1
self.alpha0: float = 1
self.beta0: float = init_stdev * init_stdev
def get_episode_rewards_actions(self) -> Tuple[ndarray, ndarray]:
ep_rewards: ndarray = empty(self.time_steps)
ep_actions: ndarray = empty(self.time_steps, dtype=int)
bayes: List[Tuple[float, int, float, float]] =\
[(self.theta0, self.n0, self.alpha0, self.beta0)] * self.num_arms
for i in range(self.time_steps):
mean_draws: Sequence[float] = [Gaussian(
mu=theta,
sigma=1 / sqrt(n * Gamma(alpha=alpha, beta=beta).sample())
).sample() for theta, n, alpha, beta in bayes]
action: int = max(enumerate(mean_draws), key=itemgetter(1))[0]
reward: float = self.arm_distributions[action].sample()
theta, n, alpha, beta = bayes[action]
bayes[action] = (
(reward + n * theta) / (n + 1),
n + 1,
alpha + 0.5,
beta + 0.5 * n / (n + 1) * (reward - theta) * (reward - theta)
)
ep_rewards[i] = reward

468
Figure 13.5.: Thompson Sampling (Gaussian) Total Regret Curve

ep_actions[i] = action
return ep_rewards, ep_actions

The above code is in the file rl/chapter14/ts_gaussian.py. The code in __main__ sets
up a ThompsonSamplingGaussian instance with 6 arms, each having a Gaussian rewards
distribution. When run with 1000 time steps and 500 episodes, we get the Total Regret
Curve as shown in Figure 13.5.
We encourage you to modify the code in __main__ to try other mean and variance set-
tings for the Gaussian reward distributions of the arms, examine the results obtained, and
develop more intuition for Thompson Sampling for Gaussians.
Now we implement Thompson Sampling by assuming a Bernoulli distribution of re-
wards for each action. The posterior distributions for each action are produced by per-
forming Bayesian updates of the hyperparameters that govern the estimated Beta Prob-
ability Distributions of the parameters of the Bernoulli reward distributions for each ac-
tion. Section G.2 of Appendix G describes the Bayesian updates of the hyperparameters
α and β, and the code below implements this update in the variable bayes in method
get_episode_rewards_actions (this method implements the @abstractmethod interface of
abstract base class MABBase). The sample mean rewards are obtained by invoking the
sample method of the Beta class, and assigned to the variable mean_draws. The variable
alpha refers to the hyperparameter α and the variable beta refers to the hyperparame-
ter β. The rest of the code in the method get_episode_rewards_actions should be self-
explanatory.
from rl.distribution import Bernoulli, Beta
from operator import itemgetter
from numpy import ndarray, empty
class ThompsonSamplingBernoulli(MABBase):
def __init__(
self,

469
Figure 13.6.: Thompson Sampling (Bernoulli) Total Regret Curve

arm_distributions: Sequence[Bernoulli],
time_steps: int,
num_episodes: int
) -> None:
super().__init__(
arm_distributions=arm_distributions,
time_steps=time_steps,
num_episodes=num_episodes
)
def get_episode_rewards_actions(self) -> Tuple[ndarray, ndarray]:
ep_rewards: ndarray = empty(self.time_steps)
ep_actions: ndarray = empty(self.time_steps, dtype=int)
bayes: List[Tuple[int, int]] = [(1, 1)] * self.num_arms
for i in range(self.time_steps):
mean_draws: Sequence[float] = \
[Beta(alpha=alpha, beta=beta).sample() for alpha, beta in bayes]
action: int = max(enumerate(mean_draws), key=itemgetter(1))[0]
reward: float = float(self.arm_distributions[action].sample())
alpha, beta = bayes[action]
bayes[action] = (alpha + int(reward), beta + int(1 - reward))
ep_rewards[i] = reward
ep_actions[i] = action
return ep_rewards, ep_actions

The above code is in the file rl/chapter14/ts_bernoulli.py. The code in __main__ sets
up a ThompsonSamplingBernoulli instance with 6 arms, each having a Bernoulli rewards
distribution. When run with 1000 time steps and 500 episodes, we get the Total Regret
Curve as shown in Figure 13.6.
We encourage you to modify the code in __main__ to try other mean settings for the
Bernoulli reward distributions of the arms, examine the results obtained, and develop
more intuition for Thompson Sampling for Bernoullis.

470
13.6. Gradient Bandits
Now we cover a MAB algorithm that is similar to Policy Gradient for MDPs. This MAB
algorithm’s action selection is randomized and the action selection probabilities are con-
structed through Gradient Ascent (much like Stochastic Policy Gradient for MDPs). This
MAB Algorithm and it’s variants are cheekily refered to as Gradient Bandits. Our coverage
below follows the coverage of Gradient Bandit algorithm in the RL book by Sutton and
Barto (Richard S. Sutton and Barto 2018).
The basic idea is that we have m Score parameters (to be optimized), one for each action,
denoted as {sa |a ∈ A} that define the action-selection probabilities, which in turn defines
an Expected Reward Objective function to be maximized, as follows:
X
J(sa1 , . . . , sam ) = π(a) · E[r|a]
a∈A

where π : A → [0, 1] refers to the function for action-selection probabilities, that is

defined as follows:

e sa
π(a) = P sb
for all a ∈ A
b∈A e

The Score parameters are meant to represent the relative value of actions based on the
rewards seen until a certain time step, and are adjusted appropriately after each time step
(using Gradient Ascent). Note that π(·) is a Softmax function of the Score parameters.
Gradient Ascent moves the Score parameters sa (and hence, action probabilities π(a))
in the direction of the gradient of the objective function J(sa1 , . . . , sam ) with respect to
∂J
(sa1 , . . . , sam ). To construct this gradient of J(·), we calculate ∂s a
for each a ∈ A, as follows:
P
∂J ∂( a′ ∈A π(a′ ) · E[r|a′ ])
=
∂sa ∂sa
X ∂π(a′ )
= E[r|a′ ] ·
′
∂sa
a ∈A
X ∂ log π(a′ )
= π(a′ ) · E[r|a′ ] ·
∂sa
a′ ∈A
∂ log π(a′ )
= Ea′ ∼π,r∼Ra′ [r · ]
∂sa

We know from standard softmax-function calculus that:

s ′
∂ log π(a′ ) ∂(log ∑ e a esb )
= b∈A
= Ia=a′ − π(a)
∂sa ∂sa
∂J
Therefore ∂sa can we re-written as:

∂J
= Ea′ ∼π,r∼Ra′ [r · (Ia=a′ − π(a))]
∂sa

At each time step t, we approximate the gradient with the (At , Rt ) sample as:

Rt · (Ia=At − πt (a)) for all a ∈ A

471
πt (a) is the probability of selecting action a at time step t, derived from the Score st (a) at
time step t.
We can reduce the variance of this estimate with a baseline B that is independent of a,
as follows:
(Rt − B) · (Ia=At − πt (a)) for all a ∈ A
This doesn’t introduce any bias in the estimate of the gradient of J(·) because:

∂ log π(a′ )
Ea′ ∼π [B · (Ia=a′ − π(a))] = Ea′ ∼π [B · ]
∂sa
X ∂ log π(a′ )
=B· π(a′ ) ·
′
∂sa
a ∈A
X ∂π(a′ )
=B·
∂sa
a′ ∈A
P
∂( a′ ∈A π(a′ ))
=B·
∂sa
∂1
=B·
∂sa
=0
P
We can use B = R̄t = 1t ts=1 Rs (average of all rewards obtained until time step t). So,
the update to scores st (a) for all a ∈ A is:

st+1 (a) = st (a) + α · (Rt − R̄t ) · (Ia=At − πt (a))

It should be noted that this Gradient Bandit algorithm and it’s variant Gradient Bandit
algorithms are simply a special case of policy gradient-based RL algorithms.
Now let’s write some code to implement this Gradient Algorithm. Apart from the usual
constructor inputs arm_distributions, time_steps and num_episodes that are passed along
to the constructor of the abstract base class MABBase, GradientBandits’ constructor also
takes as input learning_rate (specifying the initial learning rate) and learning_rate_decay
(specifying the speed at which the learning rate decays), which influence how the variable
step_size is set at every time step. The variable scores represents st (a) for all a ∈ A and
the variable probs represents πt (a) for all a ∈ A. The rest of the code below should be
self-explanatory, based on the above description of the calculations.

from rl.distribution import Distribution, Categorical

from operator import itemgetter
from numpy import ndarray, empty, exp
class GradientBandits(MABBase):
def __init__(
self,
arm_distributions: Sequence[Distribution[float]],
time_steps: int,
num_episodes: int,
learning_rate: float,
learning_rate_decay: float
) -> None:
super().__init__(
arm_distributions=arm_distributions,
time_steps=time_steps,
num_episodes=num_episodes

472
Figure 13.7.: Gradient Algorithm Total Regret Curve

)
self.learning_rate: float = learning_rate
self.learning_rate_decay: float = learning_rate_decay
def get_episode_rewards_actions(self) -> Tuple[ndarray, ndarray]:
ep_rewards: ndarray = empty(self.time_steps)
ep_actions: ndarray = empty(self.time_steps, dtype=int)
scores: List[float] = [0.] * self.num_arms
avg_reward: float = 0.
for i in range(self.time_steps):
max_score: float = max(scores)
exp_scores: Sequence[float] = [exp(s - max_score) for s in scores]
sum_exp_scores = sum(exp_scores)
probs: Sequence[float] = [s / sum_exp_scores for s in exp_scores]
action: int = Categorical(
{i: p for i, p in enumerate(probs)}
).sample()
reward: float = self.arm_distributions[action].sample()
avg_reward += (reward - avg_reward) / (i + 1)
step_size: float = self.learning_rate *\
(i / self.learning_rate_decay + 1) ** -0.5
for j in range(self.num_arms):
scores[j] += step_size * (reward - avg_reward) *\
((1 if j == action else 0) - probs[j])
ep_rewards[i] = reward
ep_actions[i] = action
return ep_rewards, ep_actions

The above code is in the file rl/chapter14/gradient_bandits.py. The code in __main__ sets
up a GradientBandits instance with 6 arms, each having a Gaussian reward distribution.
When run with 1000 time steps and 500 episodes, we get the Total Regret Curve as shown
in Figure 13.7.
We encourage you to modify the code in __main__ to try other mean and standard de-
viation settings for the Gaussian reward distributions of the arms, examine the results
obtained, and develop more intuition for this Gradient Algorithm.

473
Figure 13.8.: Gaussian Horse Race - Total Regret Curves

13.7. Horse Race

We’ve implemented several algorithms for the MAB problem. Now it’s time for a compe-
tition between them, that we will call a Horse Race. In this Horse Race, we will compare
the Total Regret across the algorithms, and we will also examine the number of times the
different arms get pulled by the various algorithms. We expect a good algorithm to have
small total regret and we expect a good algorithm to pull the arms with high Gaps few
number of times and pull the arms with low (and zero) gaps large number of times.
The code in the file rl/chapter14/plot_mab_graphs.py has a function to run a horse race
for Gaussian arms with the following algorithms:

• Greedy with Optimistic Initialization

• ϵ-Greedy
• Decaying ϵt -Greedy
• Thompson Sampling
• Gradient Bandit

Running this horse race for 7 Gaussian arms with 500 time steps, 500 episodes and the
settings as specified in the file rl/chapter14/plot_mab_graphs.py, we obtain Figure 13.8 for
the Total Regret Curves for each of these algorithms.
Figure 13.9 shows the number of times each arm is pulled (for each of these algorithms).
The X-axis is sorted by the mean of the reward distributions of the arms. For each arm,
the left-to-right order of the arm-pulls count is the order in which the 5 MAB algorithms
are listed above. As we can see, the arms with low means are pulled only a few times and
the arms with high means are pulled often.
The file rl/chapter14/plot_mab_graphs.py also has a function to run a horse race for
Bernoulli arms with the following algorithms:

• Greedy with Optimistic Initialization

• ϵ-Greedy

474
Figure 13.9.: Gaussian Horse Race - Arms Count

• Decaying ϵt -Greedy
• UCB1
• Thompson Sampling
• Gradient Bandit

Running this horse race for 9 Bernoulli arms with 500 time steps, 500 episodes and the
settings as specified in the file rl/chapter14/plot_mab_graphs.py, we obtain Figure 13.10
for the Total Regret Curves for each of these algorithms.
Figure 13.11 shows the number of times each arm is pulled (for each of the algorithms).
The X-axis is sorted by the mean of the reward distributions of the arms. For each arm,
the left-to-right order of the arm-pulls count is the order in which the 6 MAB algorithms
are listed above. As we can see, the arms with low means are pulled only a few times and
the arms with high means are pulled often.
We encourage you to experiment with the code in rl/chapter14/plot_mab_graphs.py:
try different arm distributions, try different input parameters for each of the algorithms,
plot the graphs, and try to explain the relative performance of the algorithms (perhaps by
writing some more diagnostics code). This will help build tremendous intuition on the
pros and cons of these algorithms.

13.8. Information State Space MDP

We had mentioned earlier in this chapter that although a MAB problem is not posed as an
MDP, the AI Agent could maintain relevant features of the history (of actions taken and
rewards obtained) as it’s State, which would help the AI Agent in making the arm-selection
(action) decision. So the AI Agent treats the MAB problem as an MDP and the arm-
selection action is essentially a (Policy) function of the agent’s State. One can then arrive
at the Optimal arm-selection strategy by solving the Control problem of this MDP with
an appropriate Planning or Learning algorithm. The representation of State as relevant

475
Figure 13.10.: Bernoulli Horse Race - Total Regret Curves

Figure 13.11.: Bernoulli Horse Race - Arms Count

476
features of history is known as Information State (to indicate that the agent captures all of
the relevant information known so far in the State of the modeled MDP). Before we explain
this Information State Space MDP approach in more detail, it pays to develop an intuitive
understanding of the Value of Information.
The key idea is that Exploration enables the agent to acquire information, which in turn
enables the agent to make more informed decisions as far as it’s future arm-selection strat-
egy is concerned. The natural question to ask then is whether we can quantify the value
of this information that can be acquired by Exploration. In other words, how much would
a decision-maker be willing to pay to acquire information (through exploration), prior to
making a decision? Vaguely speaking, the decision-maker should be paying an amount
equal to the gains in long-term (accumulated) reward that can be obtained upon getting
the information, less the sacrifice of excess immediate reward one would have obtained
had one exploited rather than explored. We can see that this approach aims to settle the
explore-exploit trade-off in a mathematically rigorous manner by establishing the Value of
Information. Note that information gain is higher in a more uncertain situation (all else be-
ing equal). Therefore, it makes sense to explore uncertain situations more. By formalizing
the value of information, we can trade-off exploration and exploitation optimally.
Now let us formalize the approach of treating a MAB as an Information State Space
MDP. After each time step of a MAB, we construct an Information State s̃, which comprises
of relevant features of the history until that time step. Essentially, s̃ summarizes all of the
information accumulated so far that is pertinent to be able to predict the reward distri-
bution for each action. Each action a causes a transition to a new information state s̃′ (by
adding information about the reward obtained after performing action a), with probabil-
ity P̃(s̃, a, s̃′ ). Note that this probability depends on the reward probability function Ra of
the MAB. Moreover, the MAB reward r obtained upon performing action a constitutes the
Reward of the Information State Space MDP for that time step. Putting all this together,
we have an MDP M̃ in information state space as follows:

• Denote the Information State Space of M̃ as S̃.

• The Action Space of M̃ is the action space of the given MAB: A.
• The State Transition Probability function of M̃ is P̃.
• The Reward function of M̃ is given by the Reward probability function Ra of the
MAB.
• Discount Factor γ = 1.

The key point to note is that since Ra is unknown to the AI Agent in the MAB problem,
the State Transition Probability function and the Reward function of the Information State
Space MDP M̃ are unknown to the AI Agent. However, at any given time step, the AI Agent
can utilize the information within s̃ to form an estimate of Ra , which in turn gives estimates
of the State Transition Probability function and the Reward function of the Information
State Space MDP M̃ .
Note that M̃ will typically be a fairly complex MDP over an infinite number of infor-
mation states, and hence is not easy to solve. However, since it is after all an MDP, we
can use Dynamic Programming or Reinforcement Learning algorithms to arrive at the
Optimal Policy, which prescribes the optimal MAB action to take at that time step. If a
Dynamic Programming approach is taken, then after each time step, as new information
arrives (in the form of the MAB reward in response to the action taken), the estimates of
the State Transition probability function and the Reward function change, meaning the In-
formation State Space MDP to be solved changes, and consequently the Action-Selection

477
strategy for the MAB problem (prescribed by the Optimal Policy of the Information State
Space MDP) changes. A common approach is to treat the Information State Space MDP as
a Bayes-Adaptive MDP. Specifically, if we have m arms a1 , . . . , am , the state s̃ is modeled as
(s˜a1 , . . . , sa˜m ) such that s˜a for any a ∈ A represents a posterior probability distribution over
Ra , which is Bayes-updated after observing the reward upon each pull of the arm a. This
Bayes-Adaptive MDP can be tackled with the highly-celebrated Dynamic Programming
method known as Gittins Index, which was introduced in a 1979 paper by Gittins (Gittins
1979). The Gittins Index approach finds the Bayes-optimal explore-exploit trade-off with
respect to the prior distribution.
To grasp the concept of Information State Space MDP, let us consider a Bernoulli Ban-
dit problem with m arms with arm a’s reward probability distribution Ra given by the
Bernoulli distribution B(µa ), where µa ∈ [0, 1] (i.e., reward = 1 with probability µa , and
reward = 0 with probability 1 − µa ). If we denote the m arms by a1 , a2 , . . . , am , then the
information state is s̃ = (αa1 , βa1 , αa2 , βa2 . . . , αam , βam ), where αa is the number of pulls
of arm a (so far) for which the reward was 1 and βa is the number of pulls of arm a (so
far) for which the reward was 0. Note that by the Law of Large Numbers, in the long-run,
αa +βa → µa .
αa

We can treat this as a Bayes-adaptive MDP as follows: We model the prior distribution
over Ra as the Beta Distribution Beta(αa , βa ) over the unknown parameter µa . Each time
arm a is pulled, we update the posterior for Ra as:

• Beta(αa + 1, βa ) if r = 1
• Beta(αa , βa + 1) if r = 0

Note that the component (αa , βa ) within the information state provides the model Beta(αa , βa )
as the probability distribution over µa . Moreover, note that each state transition (updating
either αa or βa by 1) is essentially a Bayesian model update (Section G.2 in Appendix G
provides details of Bayesian updates to a Beta distribution over a Bernoulli parameter).
Note that in general, an exact solution to a Bayes-adaptive MDP is typically intractable.
In 2014, Guez, Heess, Silver, Dayan (Guez et al. 2014) came up with a Simulation-based
Search method, which involves a forward search in information state space using simula-
tions from current information state, to solve a Bayes-adaptive MDP.

13.9. Extending to Contextual Bandits and RL Control

A Contextual Bandit problem is a natural extension of the MAB problem, by introduc-
ing the concept of Context that has an influence on the rewards probability distribution
for each arm. Before we provide a formal definition of a Contextual Bandit problem, we
will provide an intuitive explanation with a canonical example. Consider the problem of
showing a banner advertisement on a web site where there is a choice of displaying one
among m different advertisements at a time. If the user clicks on the advertisement, there
is a reward of 1 (if the user doesn’t click, the reward is 0). The selection of the adver-
tisement to display is the arm-selection (out of m arms, i.e., advertisements). This seems
like a standard MAB problem, except that on a web site, we don’t have a single user. In
each round, a random user (among typically millions of users) appears. Each user will
have their own characteristics of how they would respond to advertisements, meaning the
rewards probability distribution for each arm would depend on the user. We refer to the
user characteristics (as relevant to their likelihood to respond to specific advertisements)

478
as the Context. This means, the Context influences the rewards probability distribution for
each arm. This is known as the Contextual Bandit problem, which we formalize below:

Definition 13.9.1. A Contextual Bandit comprises of:

• A finite set of Actions A (known as the ”arms”).

• A probability distribution C over Contexts, defined as:

C(c) = P[c] for all Contexts c

• Each pair of a context c and an action (”arm”) a ∈ A is associated with a probability

distribution over R (unknown to the AI Agent) denoted as Rac , defined as:

Rac (r) = P[r|c, a] for all r ∈ R

• A time-indexed sequence of Environment-generated random Contexts Ct for time

steps t = 1, 2, . . ., a time-indexed sequence of AI Agent-selected actions At ∈ A
for time steps t = 1, 2, . . ., and a time-indexed sequence of Environment-generated
Reward random variables Rt ∈ R for time steps t = 1, 2, . . ., such that for each time
step t, Ct is first randomly drawn from the probability distribution C, after which the
AI Agent selects the action At , after which Rt is randomly drawn from the probability
distribution RA Ct .
t

The AI Agent’s goal is to maximize the following Expected Cumulative Rewards over a
certain number of time steps T :
X
T
E[ Rt ]
t=1

Each of the algorithms we’ve covered for the MAB problem can be easily extended to
the Contextual Bandit problem. The key idea in the extension of the MAB algorithms is
that we have to take into account the Context, when dealing with the rewards probability
distribution. In the MAB problem, the algorithms deal with a finite set of reward distribu-
tions, one for each of the actions. Here in the Contextual Bandit problem, the algorithms
work with function approximations for the rewards probability distributions where each
function approximation takes as input a pair of (Context, Action).
We won’t cover the details of the extensions of all MAB Algorithms to Contextual Bandit
algorithms. Rather, we simply sketch a simple Upper-Confidence-Bound algorithm for the
Contextual Bandit problem to convey a sense of how to extend the MAB algorithms to the
Contextual Bandit problem. Assume that the sampling distribution of the mean reward
for each (Context, Action) pair is a gaussian distribution, and so we maintain two function
approximations µ(c, a; w) and σ(c, a; v) to represent the mean and standard deviation of
the sampling distribution of mean reward for any context c and any action a. It’s important
to note that for MAB, we simply maintained a finite set of estimates µa and σa , i.e., two
parameters for each action a. Here we replace µa with function approximation µ(c, a; w)
and we replace σa with function approximation σ(c, a; v). After the receipt of a reward
from the Environment, the parameters w and v are appropriately updated. We essentially
perform supervised learning in an incremental manner when updating these parameters
of the function approximations. Note that σ(c, a; v) represents a function approximation
for the standard error of the mean reward for a given context c and given action a. A simple
Upper-Confidence-Bound algorithm would then select the action for a given context Ct at

479
time step t that maximizes µ(Ct , a; w) + α · σ(Ct , a; v) over all choices of a ∈ A, for some
fixed α. Thus, we are comparing (across actions) α standard errors higher than the mean
reward estimate (i.e., the upper-end of an appropriate confidence interval for the mean
reward) for Context Ct .
We want to highlight that many authors refer to the Context in Contextual Bandits as
State. We desist from using the term State in Contextual Bandits since we want to reserve
the term State to refer to the concept of “transitions” (as is the case in MDPs). Note that
the Context does not “transition” to the next Context in the next time step in Contextual
Bandits problems. Rather, the Context is drawn at random independently at each time step
from the Context probability distribution C. This is in contrast to the State in MDPs which
transitions to the next state at the next time step based on the State Transition probability
function of the MDP.
We finish this chapter by simply pointing out that the approaches of the MAB algorithms
can be further extended to resolve the Explore-Exploit dilemma in RL Control. From the
perspective of this extension, it pays to emphasize that MAB algorithms that fall under the
category of Optimism in the Face of Uncertainty can be roughly split into:

• Those that estimate the Q-Values (i.e., estimate E[r|a] from observed data) and the
uncertainty of the Q-Values estimate. When extending to RL Control, we estimate
the Q-Value Function for the (unknown) MDP and the uncertainty of the Q-Value
Function estimate. Note that when moving from MAB to RL Control, the Q-Values
are no longer simply the Expected Reward for a given action - rather, they are the Ex-
pected Return (i.e., accumulated rewards) from a given state and a given action. This
extension from Expected Reward to Expected Return introduces significant com-
plexity in the calculation of the uncertainty of the Q-Value Function estimate.
• Those that estimate the Model of the MDP, i.e., estimate of the State-Reward Transi-
tion Probability function PR of the MDP, and the uncertainty of the PR estimate. This
includes extension of Bayesian Bandits, Thompson Sampling and Bayes-Adaptive
MDP (for Information State Space MDP) where we replace P[R|Ht ] in the case of
Bandits with P[PR |Ht ] in the case of RL Control. Some of these algorithms sam-
ple from the estimated PR , and learn the Optimal Value Function/Optimal Policy
from the samples. Some other algorithms are Planning-oriented. Specifically, the
Planning-oriented approach is to run a Planning method (eg: Policy Iteration, Value
Iteration) using the estimated PR , then generate more data using the Optimal Pol-
icy (produced by the Planning method), use the generated data to improve the PR
estimate, then run the Planning method again to come up with the Optimal Policy
(for the MDP based on the improved PR estimate), and loop on in this manner until
convergence. As an example of this Planning-oriented approach, we refer you to the
paper on RMax Algorithm (Brafman and Tennenholtz 2001) to learn more.

13.10. Key Takeaways from this Chapter

• The Multi-Armed Bandit problem provides a simple setting to understand and ap-
preciate the nuances of the Explore-Exploit dilemma that we typically need to resolve
within RL Control algorithms.
• In this chapter, we covered the following broad approaches to resolve the Explore-
Exploit dilemma:
– Naive Exploration, eg: ϵ-greedy

480
– Optimistic Initialization
– Optimism in the Face of Uncertainty, eg: UCB, Bayesian UCB
– Probability Matching, eg: Thompson Sampling
– Gradient Bandit Algorithms
– Information State Space MDPs (incorporating value of Information), typically
solved by treating as Bayes-Adaptive MDPs
• The above MAB algorithms are well-extensible to Contextual Bandits and RL Con-
trol.

481
14. Blending Learning and Planning
After coverage of the issue of Exploration versus Exploitation in the last chapter, in this
chapter, we cover the topic of Planning versus Learning (and how to blend the two ap-
proaches) in the context of solving MDP Prediction and Control problems. In this chapter,
we also provide some coverage of the much-celebrated Monte-Carlo Tree-Search (abbre-
viated as MCTS) algorithm and it’s spiritual origin - the Adaptive Multi-Stage Sampling
(abbreviated as AMS) algorithm. MCTS and AMS are examples of Planning algorithms
tackled with sampling/RL-based techniques.

14.1. Planning versus Learning

In the language of AI, we use the terms Planning and Learning to refer to two different
approaches to solve an AI problem. Let us understand these terms from the perspective
of solving MDP Prediction and Control. Let us zoom out and look at the big picture. The
AI Agent has access to an MDP Environment E. In the process of interacting with the MDP
Environment E, the AI Agent receives experiences data. Each unit of experience data is
in the form of a (next state, reward) pair for the current state and action. The AI Agent’s
goal is to estimate the requisite Value Function/Policy through this process of interaction
with the MDP Environment E (for Prediction, the AI Agent estimates the Value Function
for a given policy and for Control, the AI Agent estimates the Optimal Value Function and
the Optimal Policy). The AI Agent can go about this in one of two ways:

1. By interacting with the MDP Environment E, the AI Agent can build a Model of the
Environment (call it M ) and then use that model to estimate the requisite Value Func-
tion/Policy. We refer to this as the Model-Based approach. Solving Prediction/Control
using a Model of the Environment (i.e., Model-Based approach) is known as Planning
the solution. The term Planning comes from the fact that the AI Agent projects (with
the help of the model M ) probabilistic scenarios of future states/rewards for vari-
ous choices of actions from specific states, and solves for the requisite Value Func-
tion/Policy based on the model-projected future outcomes.
2. By interacting with the MDP Environment E, the AI Agent can directly estimate
the requisite Value Function/Policy, without bothering to build a Model of the En-
vironment. We refer to this as the Model-Free approach. Solving Prediction/Control
without using a model (i.e., Model-Free approach) is known as Learning the solu-
tion. The term Learning comes from the fact that the AI Agent “learns” the requisite
Value Function/Policy directly from experiences data obtained by interacting with
the MDP Environment E (without requiring any model).

Let us now dive a bit deeper into both these approaches to understand them better.

14.1.1. Planning the solution of Prediction/Control

In the first approach (Planning the solution of Prediction/Control), we first need to “build
a model.” By “model” , we refer to the State-Reward Transition Probability Function PR .

483
By “building a model,” we mean estimating PR from experiences data obtained by in-
teracting with the MDP Environment E. How does the AI Agent do this? Well, this is a
matter of estimating the conditional probability density function of pairs of (next state, re-
ward), conditioned on a particular pair of (state, action). This is an exercise in Supervised
Learning, where the y-values are (next state, reward) pairs and the x-values are (state,
action) pairs. We covered how to do Supervised Learning in Chapter 4. Also, note that
Equation (9.12) in Chapter 9 provides a simple tabular calculation to estimate the PR func-
tion for an MRP from a fixed, finite set of atomic experiences of (state, reward, next state)
triples. Following this Equation, we had written the function finite_mrp to construct a
FiniteMarkovRewardProcess (which includes a tabular PR function of explicit probabilities
of transitions), given as input a Sequence[TransitionStep[S]] (i.e., fixed, finite set of MRP
atomic experiences). This approach can be easily extended to estimate the PR function for
an MDP. Ok - now we have a model M in the form of an estimated PR . The next thing
to do in this approach of Planning the solution of Prediction/Control is to use the model
M to estimate the requisite Value Function/Policy. There are two broad approaches to do
this:

1. By constructing PR as an explicit representation of probabilities of transitions, the

AI Agent can utilize one of the Dynamic Programming Algorithms (eg: Policy Eval-
uation, Policy Iteration, Value Iteration) or a Tree-search method (by growing out a
tree of future states/rewards/actions from a given state/action, eg: the MCTS/AMS
algorithms we will cover later in this chapter). Note that in this approach, there is
no need to interact with an MDP Environment since a model of transition probabilities
are available that can be used to project any (probabilistic) future outcome (for any
choice of action) that is desired to estimate the requisite Value Function/Policy.
2. By treating PR as a sampling model, by which we mean that the AI agent uses PR
as simply an (on-demand) interface to sample an individual pair of (next state, re-
ward) from a given (state, action) pair. This means the AI Agent treats this sampling
model view of PR as a Simulated MDP Environment (let us refer to this Simulated
MDP Environment as S). Note that S serves as a proxy interaction-interface to the
actual MDP Environment E. A significant advantage of interacting with S instead
of E is that we can sample infinitely many times without any of the real-world in-
teraction constraints that an actual MDP Environment E poses. Think about a robot
learning to walk on an actual street versus learning to walk on a simulator of the
street’s activities. Furthermore, the user could augment his/her views on top of an
experiences-data-learnt simulator. For example, the user might say that the experi-
ences data obtained by interacting with E doesn’t include certain types of scenarios
but the user might have knowledge of how those scenarios would play out, thus cre-
ating a “human-knowledge-augmented simulator” (more on this in Chapter 15). By
interacting with the simulated MDP Environment S (instead of the actual MDP En-
vironment E), the AI Agent can use any of the RL Algorithms we covered in Module
III of this book to estimate the requisite Value Function/Policy. Since this approach
uses a model M (albeit a sampling model) and since this approach uses RL, we re-
fer to this approach as Model-Based RL. To summarize this approach, the AI Agent
first learns (supervised learning) a model M as an approximation of the actual MDP
Environment E, and then the AI Agent plans the solution to Prediction/Control by
using the model M in the form of a simulated MDP Environment S which an RL al-
gorithm interacts with. Here the Planning/Learning terminology often gets confus-
ing to new students of this topic since this approach is supervised learning followed

484
Figure 14.1.: Planning with a Supervised-Learnt Model

by planning (the planning being done with a Reinforcement Learning algorithm in-
teracting with the learnt simulator).

Figure 14.1 depicts the above-described approach of Planning the solution of Predic-
tion/Control. We start with an arbitrary Policy that is used to interact with the Environ-
ment E (upward-pointing arrow in the Figure). These interactions generate Experiences,
which are used to perform Supervised Learning (rightward-pointing arrow in the Figure)
to learn a model M . This model M is used to plan the requisite Value Function/Policy
(leftward-pointing arrow in the Figure). The Policy produced through this process of
Planning is then used to further interact with the Environment E, which in turn generates
a fresh set of Experiences, which in turn are used to update the Model M (incremental su-
pervised learning), which in turn is used to plan an updated Value Function/Policy, and
so the cycle repeats.

14.1.2. Learning the solution of Prediction/Control

In the second approach (Learning the solution of Prediction/Control), we don’t bother to

build a model. Rather, the AI Agent directly estimates the requisite Value Function/Policy
from the experiences data obtained by interacting with the Actual MDP Environment E.
The AI Agent does this by using any of the RL algorithms we covered in Module III of this
book. Since this approach is “model-free,” we refer to this approach as Model-Free RL.

485
14.1.3. Advantages and Disadvantages of Planning versus Learning

In the previous two subsections, we covered the two different approaches to solving Pre-
diction/Control, either by Planning (subsection 14.1.1) or by Learning (subsection 14.1.2).
Let us now talk about their advantages and disadvantages.
Planning involves constructing a Model, so it’s natural advantage is to be able to con-
struct a model (from experiences data) with eﬀicient and robust supervised learning meth-
ods. The other key advantage of Planning is that we can reason about Model Uncertainty.
Specifically, when we learn the Model M using supervised learning, we typically obtain
the standard errors for estimation of model parameters, which can then be used to create
confidence intervals for the Value Function and Policy planned using the model. Further-
more, since modeling real-world problems tends to be rather diﬀicult, it is valuable to cre-
ate a family of models with differing assumptions, with different functional forms, with
differing parameterizations etc., and reason about how the Value Function/Policy would
disperse as a function of this range of models. This is quite beneficial in typical real-world
problems since it enables us to do Prediction/Control in a robust manner.
The disadvantage of Planning is that we have two sources of approximation error - the
first from supervised learning in estimating the model M , and the second from construct-
ing the Value Function/Policy (given the model). The Learning approach (without resort-
ing to a model, i.e., Model-Free RL) is thus advantageous is not having the first source of
approximation error (i.e., Model Error).

14.1.4. Blending Planning and Learning

In this subsection, we show a rather creative and practically powerful approach to solve
real-world Prediction and Control problems. We basically extend Figure 14.1 to Figure
14.2. As you can see in Figure 14.2, the change is that there is a downward-pointing ar-
row from the Experiences node to the Policy node. This downward-pointing arrow refers
to Model-Free Reinforcement Learning, i.e., learning the Value Function/Policy directly from
experiences obtained by interacting with Environment E, i.e., Model-Free RL. This means
we obtain the requisite Value Function/Policy through the collaborative approach of Plan-
ning (using the model M ) and Learning (using Model-Free RL).
Note that when Planning is based on RL using experiences obtained by interacting
with the Simulated Environment S (based on Model M ), then we obtain the requisite
Value Function/Policy from two sources of experiences (from E and S) that are com-
bined and provided to an RL Algorithm. This means we simultaneously do Model-Based
RL and Model-Free RL. This is creative and powerful because it blends the best of both
worlds - Planning (with Model-Based RL) and Learning (with Model-Free RL). Apart
from Model-Free RL and Model-Based RL being blended here to obtain a more accurate
Value Function/Policy, the Model is simultaneously being updated with incremental su-
pervised learning (rightward-pointing arrow in Figure 14.2) as new experiences are being
generated as a result of the Policy interacting with the Environment E (upward-pointing
arrow in Figure 14.2).
This framework of blending Planning and Learning was created by Richard Sutton which
he named as Dyna (Richard S. Sutton 1991).

486
Figure 14.2.: Blending Planning and Learning

487
14.2. Decision-Time Planning
In the next two sections of this chapter, we cover a couple of Planning methods that are
sampling-based (experiences obtained by interacting with a sampling model) and use RL
techniques to solve for the requisite Value Function/Policy from the model-sampled expe-
riences. We cover the famous Monte-Carlo Tree-Search (MCTS) algorithm, followed by
an algorithm which is MCTS’ spiritual origin - the Adaptive Multi-Stage Sampling (AMS)
algorithm.
Both these algorithms are examples of Decision-Time Planning. The term Decision-Time
Planning requires some explanation. When it comes to Planning (with a model), there are
two possibilities:

• Background Planning: This refers to a planning method where the AI Agent pre-
computes the requisite Value Function/Policy for all states, and when it is time for
the AI Agent to perform the requisite action for a given state, it simply has to refer to
the pre-calculated policy and apply that policy to the given state. Essentially, in the
background, the AI Agent is constantly improving the requisite Value Function/Policy,
irrespective of which state the AI Agent is currently required to act on. Hence, the
term Background Planning.
• Decision-Time Planning: This approach contrasts with Background Planning. In
this approach, when the AI Agent has to identify the best action to take for a specific
state that the AI Agent currently encounters, the calculations for that best-action-
identification happens only when the AI Agent reaches that state. This is appropriate
in situations when there are such a large number of states in the state space that Back-
ground Planning is infeasible. However, for Decision-Time Planning to be effective,
the AI Agent needs to have suﬀicient time to be able to perform the calculations to
identify the action to take upon reaching a given state. This is feasible in games like
Chess where there is indeed some time for the AI Agent to make it’s move upon en-
countering a specific state of the chessboard (the move response doesn’t need to be
immediate). However, this is not feasible for a self-driving car, where the decision to
accelerate/brake or to steer must be immediate (this requires Background Planning).

Hence, with Decision-Time Planning, the AI Agent focuses all of the available computa-
tion and memory resources for the sole purpose of identifying the best action for a partic-
ular state (the state that has just been reached by the AI Agent). Decision-Time Planning is
typically successful because of this focus on a single state and consequently, on the states
that are most likely to be reached within the next few time steps (essentially, avoiding any
wasteful computation on states that are unlikely to be reached from the given state).
Decision-Time Planning typically looks much deeper than just a single time step ahead
(DP algorithms only look a single time step ahead) and evaluates action choices leading
to many different state and reward possibilities over the next several time steps. Searching
deeper than a single time step ahead is required because these Decision-Time Planning
algorithms typically work with imperfect Q-Values.
Decision-Time Planning methods sometimes go by the name Heuristic Search. Heuristic
Search refers to the method of growing out a tree of future states/actions/rewards from
the given state (which serves as the root of the tree). In classical Heuristic Search, an
approximate Value Function is calculated at the leaves of the tree and the Value Function
is then backed up to the root of the tree. Knowing the backed-up Q-Values at the root of
the tree enables the calculation of the best action for the root state. Modern methods of
Heuristic Search are very eﬀicient in how the Value Function is approximated and backed

488
up. Monte-Carlo Tree-Search (MCTS) in one such eﬀicient method that we cover in the
next section.

14.3. Monte-Carlo Tree-Search (MCTS)

Monte-Carlo Tree-Search (abbreviated as MCTS) is a Heuristic Search method that in-
volves growing out a Search Tree from the state for which we seek the best action (hence,
it is a Decision-Time Planning algorithm). MCTS was popularized in 2016 by Deep Mind’s
AlphaGo algorithm (Silver et al. 2016). MCTS was first introduced by Remi Coulom for
game trees (Coulom 2006).
For every state in the Search Tree, we maintain the Q-Values for all actions from that
state. The basic idea is to form several sampling traces from the root of the tree (i.e., from
the given state) to terminal states. Each such sampling trace threads through the Search
Tree to a leaf node of the tree, and then extends beyond the tree from the leaf node to a
terminal state. This separation of the two pieces of each sampling trace is important - the
first piece within the tree, and the second piece outside the tree. Of particular importance
is the fact that the first piece of various sampling traces will pass through states (within
the Search Tree) that are quite likely to be reached from the given state (at the root node).
MCTS benefits from states in the tree being revisited several times, as it enables more ac-
curate Q-Values for those states (and consequently, a more accurate Q-Value at the root of
the tree, from backing-up of Q-Values). Moreover, these sampling traces prioritize actions
with good Q-Values. Prioritizing actions with good Q-Values has to be balanced against
actions that haven’t been tried suﬀiciently, and this is essentially the explore-exploit trade-
off that we covered in detail in Chapter 13.
Each sampling trace round of MCTS consists of four steps:

• Selection: Starting from the root node R (given state), we successively select children
nodes all the way until a leaf node L. This involves selecting actions based on a tree
policy, and selecting next states by sampling from the model of state transitions. The
trees in Figure 14.3 show states colored as white and actions colored as gray. This
Figure shows the Q-Values for a 2-player game (eg: Chess) where the reward is 1 at
termination for a win, 0 at termination for a loss, and 0 throughout the time the game
is in play. So the Q-Values in the Figure are displayed at each node in the form of
Wins as a fractions of Games Played that passed through the node (Games through
a node means the number of sampling traces that have run through the node). So
the label “1/6” for one of the State nodes (under “Selection,” the first image in the
Figure) means that we’ve had 6 sampling traces from the root node that have passed
through this State node labeled “1/6,” and 1 of those games was won by us. For
Actions nodes (gray nodes), the labels correspond to Opponent Wins as a fraction of
Games through the Action node. So the label “2/3” for one of the Action leaf nodes
means that we’ve had 3 sampling traces from the root node that have passed through
this Action leaf node, and 2 of those resulted in wins for the opponent (i.e., 1 win for
us).
• Expansion: On some rounds, the tree is expanded from L by adding a child node C
to it. In the Figure, we see that L is the Action leaf node labeled as “3/3” and we add
a child node C (state) to it labeled “0/0” (because we don’t yet have any sampling
traces running through this added state C).
• Simulation: From L (or from C if this round involved adding C), we complete the
sampling trace (that started from R and ran through L) all the way to a terminal state

489
Figure 14.3.: Monte-Carlo Tree-Search (This wikipedia image is being used under the cre-
ative commons license CC BY-SA 4.0)

T . This entire sampling trace from R to T is known as a single Monte-Carlo Simu-

lation Trace, in which actions are selected according to the tree policy when within
the tree, and according to a rollout policy beyond the tree (the term “rollout” refers
to “rolling out” a simulation from the leaf node to termination). The tree policy
is based on an Explore-Exploit tradeoff using estimated Q-Values, and the rollout
policy is typically a simple policy such as a uniform policy.
• Backpropagation: The return generated by the sampling trace is backed up (“back-
propagated”) to update (or initialize) the Q-Values at the nodes in the tree that are
part of the sampling trace. Note that in the Figure, the rolled out simulation trace
resulted in a win for the opponent (loss for us). So the backed up Q-Values reflect
an extra win for the opponent (on the gray nodes, i.e., action nodes) and an extra
loss for us (on the white nodes, i.e., state nodes).

The Selection Step in MCTS involves picking a child node (action) with “most promise,”
for each state in the sampling trace of the Selection Step. This means prioritizing actions
with higher Q-Value estimates. However, this needs to be balanced against actions that
haven’t been tried suﬀiciently (i.e., those actions whose Q-Value estimates have consider-
able uncertainty). This is our usual Explore v/s Exploit tradeoff that we covered in detail
in Chapter 13. The Explore v/s Exploit formula for games was first provided by Kocsis
and Szepesvari (Kocsis and Szepesvári 2006). This formula is known as Upper Confidence
Bound 1 for Trees (abbreviated as UCT). Most current MCTS Algorithms are based on some
variant of UCT. UCT is based on the UCB1 formula of Auer, Cesa-Bianchi, Fischer (Auer,
Cesa-Bianchi, and Fischer 2002).

14.4. Adaptive Multi-Stage Sampling

It’s not well known that MCTS and UCT concepts first appeared in the Adaptive Multi-
Stage Sampling algorithm by Chang, Fu, Hu, Marcus (Chang et al. 2005). Adaptive Multi-
Stage Sampling (abbreviated as AMS) is a generic sampling-based algorithm to solve
finite-horizon Markov Decision Processes (although the paper describes how to extend
this algorithm for infinite-horizon MDPs). We consider AMS to be the “spiritual origin”
of MCTS/UCT, and hence we dedicate this section to coverage of AMS.
AMS is a planning algorithm - a sampling model is provided for the next state (condi-
tional on given state and action), and a model of Expected Reward (conditional on given

490
state and action) is also provided. AMS overcomes the curse of dimensionality by sam-
pling the next state. The key idea in AMS is to adaptively select actions based on a suit-
able tradeoff between Exploration and Exploitation. AMS was the first algorithm to apply
the theory of Multi-Armed Bandits to derive a provably convergent algorithm for solving
finite-horizon MDPs. Moreover, it performs far better than the typical backward-induction
approach to solving finite-horizon MDPs, in cases where the state space is very large and
the action space is fairly small.
We use the same notation we used in section 3.13 of Chapter 3 for Finite-Horizon MDPs
(time steps t = 0, 1, . . . T ). We assume that the state space St for time step t is very large
for all t = 0, 1, . . . , T − 1 (the state space ST for time step T consists of all terminal states).
We assume that the action space At for time step t is fairly small for all t = 0, 1, . . . , T − 1.
We denote the probability distribution for the next state, conditional on the current state
and action (for time step t) as the function Pt : (St × At ) → (St+1 → [0, 1]), defined as:

Pt (st , at )(st+1 ) = P[St+1 = st+1 |(St = st , At = at )]

As mentioned above, for all t = 0, 1, . . . , T − 1, AMS has access to only a sampling model
of Pt , that can be used to fetch a sample of the next state from St+1 . We also assume
that we are given the Expected Reward function Rt : St × At → R for each time step
t = 0, 1, . . . , T − 1 defined as:

Rt (st , at ) = E[Rt+1 |(St = st , At = at )]

We denote the Discount Factor as γ.

The problem is to calculate an approximation to the Optimal Value Function Vt∗ (st ) for
all st ∈ St for all t = 0, 1, . . . , T −1. Using only samples from the state-transition probability
distribution functions Pt and the Expected Reward functions Rt , AMS aims to do better
than backward induction for the case where St is very large and At is small for all t =
0, 1, . . . T − 1.
The AMS algorithm is based on a fixed allocation of the number of action selections
for each state at each time step. Denote the number of action selections for each state
at time step t as Nt . We ensure that each action at ∈ At is selected at least once, hence
Nt ≥ |At |. While the algorithm is running, we denote Ntst ,at to be the number of selections
of a particular action at (for a given state st ) until that point in the algorithm.
Denote V̂tNt (st ) as the AMS Algorithm’s approximation of Vt∗ (st ), utilizing all of the
Nt action selections. For a given state st , for each selection of an action at , one next state
is sampled from the probability distribution Pt (st , at ) (over the state space St+1 ). For a
fixed st and fixed at , let us denote the j-th sample of the next state (for j = 1, . . . , Ntst ,at )
(st ,at ,j) (st ,at ,j)
as st+1 . Each such next state sample st+1 ∼ Pt (st , at ) leads to a recursive call to
N (s ,a ,j)
t t
V̂t+1t+1 (st+1 ) in order to calculate the approximation Q̂t (st , at ) of the Optimal Action
Value Function Q∗t (st , at ) as:
PNtst ,at N (s ,at ,j)
t
j=1 V̂t+1t+1 (st+1 )
Q̂t (st , at ) = Rt (st , at ) + γ ·
Ntst ,at

Now let us understand how the Nt action selections are done for a given state st . First
we select each of the actions in At exactly once. This is a total of |At | action selections.
Each of the remaining Nt − |At | action selections (indexed as i ranging from |At | to Nt − 1)
is made based on the action that maximizes the following UCT formula (thus balancing

491
exploration and exploitation):
s
2 log i
Q̂t (st , at ) + (14.1)
Ntst ,at

When all Nt action selections are made for a given state st , Vt∗ (st ) = maxat ∈At Q∗t (st , at )
is approximated as:
X N st ,at
V̂tNt (st ) = t
· Q̂t (st , at ) (14.2)
Nt
at ∈At

Now let’s write a Python class to implement AMS. We start by writing it’s constructor.
For convenience, we assume each of the state spaces St (for t = 0, 1, . . . , T ) is the same
(denoted as S) and the allowable actions are the same across all time steps (denoted as
A).

from rl.distribution import Distribution

A = TypeVar(’A’)
S = TypeVar(’S’)
class AMS(Generic[S, A]):
def __init__(
self,
actions_funcs: Sequence[Callable[[S], Set[A]]],
state_distr_funcs: Sequence[Callable[[S, A], Distribution[S]]],
expected_reward_funcs: Sequence[Callable[[S, A], float]],
num_samples: Sequence[int],
gamma: float
) -> None:
self.num_steps: int = len(actions_funcs)
self.actions_funcs: Sequence[Callable[[S], Set[A]]] = \
actions_funcs
self.state_distr_funcs: Sequence[Callable[[S, A], Distribution[S]]] = \
state_distr_funcs
self.expected_reward_funcs: Sequence[Callable[[S, A], float]] = \
expected_reward_funcs
self.num_samples: Sequence[int] = num_samples
self.gamma: float = gamma

Let us understand the inputs to the constructor init.

• actions_funcs consists of a Sequence (for all of t = 0, 1, . . . , T − 1) of functions, each

mapping a state in St to a set of actions within A, that we denote as At (st ) (i.e.,
Callable[[S], Set[A]]).
• state_distr_funcs represents Pt for all t = 0, 1, . . . , T − 1 (accessible only as a sam-
pling model since the return type of each Callable is Distribution).
• expected_reward_funcs represents Rt for all t = 0, 1, . . . , T − 1.
• num_samples represents Nt (the number of actions selections for each state st ) for all
t = 0, 1, . . . , T − 1.
• gamma represents the discount factor γ.

self.num_steps represents the number of time steps T .

Next we write the method optimal_vf_and_policy to compute V̂tNt (st ) and the associ-
ated recommended action for state st (note the type of the output, representing this pair
as Tuple[float, A]).

492
PNtst ,at Nt+1 (st ,at ,j)
In the code below, vals_sum builds up the sum j=1 V̂t+1 (st+1 ), and counts rep-
resents Ntst ,at . Before the for loop, we initialize vals_sum by selecting each action at ∈
At (st ) exactly once. Then, for each iteration i of the for loop (for i ranging from |At (st )| to
Nt − 1), we calculate the Upper-Confidence Value (ucb_vals in the code below) for each
of the actions at ∈ At (st ) using the UCT formula of Equation (14.1), and pick an action
a∗t that maximizes ucb_vals. After the termination of the for loop, optimal_vf_and_policy
returns the Optimal Value Function approximation for st based on Equation (14.2) and
the recommended action for st as the action that maximizes Q̂t (st , at )

import numpy as np
from operator import itemgetter
def optimal_vf_and_policy(self, t: int, s: S) -> \
Tuple[float, A]:
actions: Set[A] = self.actions_funcs[t](s)
state_distr_func: Callable[[S, A], Distribution[S]] = \
self.state_distr_funcs[t]
expected_reward_func: Callable[[S, A], float] = \
self.expected_reward_funcs[t]
rewards: Mapping[A, float] = {a: expected_reward_func(s, a)
for a in actions}
val_sums: Dict[A, float] = {a: (self.optimal_vf_and_policy(
t + 1,
state_distr_func(s, a).sample()
)[0] if t < self.num_steps - 1 else 0.) for a in actions}
counts: Dict[A, int] = {a: 1 for a in actions}
for i in range(len(actions), self.num_samples[t]):
ucb_vals: Mapping[A, float] = \
{a: rewards[a] + self.gamma * val_sums[a] / counts[a] +
np.sqrt(2 * np.log(i) / counts[a]) for a in actions}
max_actions: Sequence[A] = [a for a, u in ucb_vals.items()
if u == max(ucb_vals.values())]
a_star: A = np.random.default_rng().choice(max_actions)
val_sums[a_star] += (self.optimal_vf_and_policy(
t + 1,
state_distr_func(s, a_star).sample()
)[0] if t < self.num_steps - 1 else 0.)
counts[a_star] += 1
return (
sum(counts[a] / self.num_samples[t] *
(rewards[a] + self.gamma * val_sums[a] / counts[a])
for a in actions),
max(
[(a, rewards[a] + self.gamma * val_sums[a] / counts[a])
for a in actions],
key=itemgetter(1)
)[0]
)

The above code is in the file rl/chapter15/ams.py. The __main__ in this file tests the
AMS algorithm for the simple case of the Dynamic Pricing problem that we had covered
in Section 3.14 of Chapter 3, although the Dynamic Pricing problem itself is not a problem
where AMS would do better than backward induction (since it’s state space is not very
large). We encourage you to play with our implementation of AMS by constructing a
finite-horizon MDP with a large state space (and small-enough action space). An example
of such a problem is Optimal Stopping (in particular, pricing of American Options) that
we had covered in Chapter 7.
Now let’s analyze the running-time complexity of AMS. Let N = max (N0 , N1 , . . . , NT −1 ).
At each time step t, the algorithm makes at most N recursive calls, and so the running-

493
time complexity is O(N T ). Note that since we need to select every action at least once for
every state at every time step, N ≥ |A|, meaning the running-time complexity is at least
|A|T . Compare this against the running-time complexity of backward induction, which
is O(|S|2 · |A| · T ). So, AMS is more eﬀicient when S is very large (which is typical in
many real-world problems). In their paper, Chang, Fu, Hu, Marcus proved that the Value
Function approximation V̂0N0 is asymptotically unbiased, i.e.,

lim lim . . . lim E[V̂0N0 (s0 )] = V0∗ (s0 ) for all s0 ∈ S

N0 →∞ N1 →∞ NT −1 →∞

They also proved that the worst-possible bias is bounded by a quantity that converges to
PT −1 ln Nt
zero at the rate of O( t=0 Nt ). Specifically,

X
T −1
ln Nt
0 ≤ V0∗ (s0 ) − E[V̂0N0 (s0 )] ≤ O( ) for all s0 ∈ S
Nt
t=0

14.5. Summary of Key Learnings from this Chapter

• Planning versus Learning, and how to blend Planning and Learning.
• Monte-Carlo Tree-Search (MCTS): An example of a Planning algorithm based on
Tree-Search and based on sampling/RL techniques.
• Adaptive Multi-Stage Sampling (AMS): The spiritual origin of MCTS - it is an eﬀi-
cient algorithm for finite-horizon MDPs with very large state space and fairly small
action space.

494
15. Summary and Real-World Considerations
The purpose of this chapter is two-fold: Firstly to summarize the key learnings from this
book, and secondly to provide some commentary on how to take the learnings from this
book into practice (to solve real-world problems). On the latter, we specifically focus on
the challenges one faces in the real-world - modeling diﬀiculties, problem-size diﬀiculties,
operational challenges, data challenges (access, cleaning, organization), product manage-
ment challenges (eg: addressing the gap between the technical problem being solved and
the business problem to be solved), and also change-management challenges as one shifts
an enterprise from legacy systems to an AI system.

15.1. Summary of Key Learnings from this Book

In Module I, we covered the Markov Decision Process framework, the Bellman Equations,
Dynamic Programming algorithms, Function Approximation, and Approximate Dynamic
Programming.
Module I started with Chapter 1, where we first introduced the very important Markov
Property, a concept that enables us to reason effectively and compute eﬀiciently in practical
systems involving sequential uncertainty. Such systems are best approached through the
very simple framework of Markov Processes, involving probabilistic state transitions. Next,
we developed the framework of Markov Reward Processes (MRP), the MRP Value Func-
tion, and the MRP Bellman Equation, which expresses the MRP Value Function recursively.
We showed how this MRP Bellman Equation can be solved with simple linear-algebraic-
calculations when the state space is finite and not too large.
In Chapter 2, we developed the framework of Markov Decision Processes (MDP). A key
learning from this Chapter is that an MDP evaluated with a fixed Policy is equivalent to an
MRP. Calculating the Value Function of an MDP evaluated with a fixed Policy (i.e. calcu-
lating the Value Function of an MRP) is known as the Prediction problem. We developed
the 4 forms of the MDP Bellman Policy Equations (which are essentially equivalent to the
MRP Bellman Equation). Next, we defined the Control problem as the calculation of the
Optimal Value Function (and an associated Optimal Policy) of an MDP. Correspondingly,
we developed the 4 forms of the MDP Bellman Optimality Equation. We stated and proved
an important theorem on the existence of an Optimal Policy, and of each Optimal Policy
achieving the Optimal Value Function. We finished this Chapter with some commentary
on variants and extensions of MDPs. Here we introduced the two Curses in the context of
solving MDP Prediction and Control - the Curse of Dimensionality and the Curse of Mod-
eling, which can be battled with appropriate approximation of the Value Function and
with appropriate sampling from the state-reward transition probability function. Here,
we also covered Partially-Observable Markov Decision Processes (POMDP), which refers
to situations where all components of the State are not observable (quite typical in the
real-world). Often, we pretend a POMDP is an MDP as the MDP framework fetches us
computational tractability. Modeling a POMDP as an MDP is indeed a very challenging
endeavor in the real-world. However, sometimes partial state-observability cannot be ig-

495
nored, and in such situations, we have to employ (computationally expensive) algorithms
to solve the POMDP.
In Chapter 3, we first covered the foundation of the classical Dynamic Programming
(DP) algorithms - the Banach Fixed-Point Theorem, which gives us a simple method for
iteratively solving for a fixed-point of a contraction function. Next, we constructed the
Bellman Policy Operator and showed that it’s a contraction function, meaning we can take
advantage of the Banach Fixed-Point Theorem, yielding a DP algorithm to solve the Predic-
tion problem, refered to as the Policy Evaluation algorithm. Next, we introduced the no-
tions of a Greedy Policy and Policy Improvement, which yields a DP algorithm known as
Policy Iteration to solve the Control problem. Next, we constructed the Bellman Optimal-
ity Operator and showed that it’s a contraction function, meaning we can take advantage
of the Banach Fixed-Point Theorem, yielding a DP algorithm to solve the Control problem,
refered to as the Value Iteration algorithm. Next, we introduced the all-important concept
of Generalized Policy Iteration (GPI) - the powerful idea of alternating between any method
for Policy Evaluation and any method for Policy Improvement, including methods that
are partial applications of Policy Evaluation or Policy Improvement. This generalized per-
spective unifies almost all of the algorithms that solve MDP Control problems (including
Reinforcement Learning algorithms). We finished this chapter with coverage of Backward
Induction algorithms to solve Prediction and Control problems for finite-horizon MDPs
- Backward Induction is a simple technique to backpropagate the Value Function from
horizon-end to the start. It is important to note that the DP algorithms in this chapter ap-
ply to MDPs with a finite number of states and that these algorithms are computationally
feasible only if the state space is not too large (the next chapter extends these DP algo-
rithms to handle large state spaces, including infinite state spaces).
In Chapter 4, we first covered a refresher on Function Approximation by developing
the calculations first for linear function approximation and then for feed-forward fully-
connected deep neural networks. We also explained that a Tabular prediction can be
viewed as a special form of function approximation (since it satisfies the interface we de-
signed for Function Approximation). With this apparatus for Function Approximation,
we extended the DP algorithms of the previous chapter to Approximate Dynamic Pro-
gramming (ADP) algorithms in a rather straightforward manner. In fact, DP algorithms
can be viewed as special cases of ADP algorithms by setting the function approximation
to be Tabular. Essentially, we replace tabular Value Function updates with updates to
Function Approximation parameters (where the Function Approximation represents the
Value Function). The sweeps over all states in the tabular (DP) algorithms are replaced by
sampling states in the ADP algorithms, and expectation calculations in Bellman Operators
are handled in ADP as averages of the corresponding calculations over transition samples
(versus calculations using explicit transition probabilities in the DP algorithms).
Module II was about Modeling Financial Applications as MDPs. We started Module
II with a basic coverage of Utility Theory in Chapter 5. The concept of Utility is vital
since Utility of cashflows is the appropriate Reward in the MDP for many financial appli-
cations. In this chapter, we explained that an individual’s financial risk-aversion is rep-
resented by the concave nature of the individual’s Utility as a function of financial out-
comes. We showed that the Risk-Premium (compensation an individual seeks for taking
financial risk) is roughly proportional to the individual’s financial risk-aversion and also
proportional to the measure of uncertainty in financial outcomes. Risk-Adjusted-Return
in finance should be thought of as the Certainty-Equivalent-Value, whose Utility is the
Expected Utility across uncertain (risky) financial outcomes. We finished this chapter by
covering the Constant Absolute Risk-Aversion (CARA) and the Constant Relative Risk-

496
Aversion (CRRA) Utility functions, along with simple asset allocation examples for each
of CARA and CRRA Utility functions.
In Chapter 6, we covered the problem of Dynamic Asset-Allocation and Consumption.
This is a fundamental problem in Mathematical Finance of jointly deciding on A) optimal
investment allocation (among risky and riskless investment assets) and B) optimal con-
sumption, over a finite horizon. We first covered Merton’s landmark paper from 1969 that
provided an elegant closed-form solution under assumptions of continuous-time, normal
distribution of returns on the assets, CRRA utility, and frictionless transactions. In a more
general setting of this problem, we need to model it as an MDP. If the MDP is not too large
and if the asset return distributions are known, we can employ finite-horizon ADP algo-
rithms to solve it. However, in typical real-world situations, the action space can be quite
large and the asset return distributions are unknown. This points to RL, and specifically
RL algorithms that are well suited to tackle large action spaces (such as Policy Gradient
Algorithms).
In Chapter 7, we covered the problem of pricing and hedging of derivative securities.
We started with the fundamental concepts of Arbitrage, Market-Completeness and Risk-
Neutral Probability Measure. Based on these concepts, we stated and proved the two fun-
damental theorems of Asset Pricing for the simple case of a single discrete time-step. These
theorems imply that the pricing of derivatives in an arbitrage-free and complete market
can be done in two equivalent ways: A) Based on construction of a replicating portfolio,
and B) Based on riskless rate-discounted expectation in the risk-neutral probability mea-
sure. Finally, we covered two financial trading problems that can be cast as MDPs. The
first problem is the Optimal Exercise of American Options (and it’s generalization to Op-
timal Stopping problems). The second problem is the Pricing and Hedging of Derivatives
in an Incomplete (real-world) Market.
In Chapter 8, we covered problems involving trading optimally on an Order Book. We
started with developing an understanding of the core ingredients of an Order Book: Limit
Orders, Market Orders, Order Book Dynamics, and Price Impact. The rest of the chapter
covered two important problems that can be cast as MDPs. These are the problems of Op-
timal Order Execution and Optimal Market-Making. For each of these two problems, we
derived closed-form solutions under highly simplified assumptions (eg: Bertsimas-Lo,
Avellaneda-Stoikov formulations), which helps develop intuition. Since these problems
are modeled as finite-horizon MDPs, we can implement backward-induction ADP algo-
rithms to solve them. However, in practice, we need to develop Reinforcement Learning
algorithms (and associated market simulators) to solve these problems in real-world set-
tings to overcome the Curse of Dimensionality and Curse of Modeling.
Module III covered Reinforcement Learning algorithms. Module III starts by motivat-
ing the case for Reinforcement Learning (RL). In the real-world, we typically do not have
access to a model of state-reward transition probabilities. Typically, we simply have ac-
cess to an environment, that serves up the next state and reward, given current state and
action, at each step in the AI Agent’s interaction with the environment. The environment
could be the actual environment or could be a simulated environment (the latter from a
learnt model of the environment). RL algorithms for Prediction/Control learn the requi-
site Value Function/Policy by obtaining suﬀicient data (atomic experiences) from interaction
with the environment. This is a sort of “trial and error” learning, through a process of pri-
oritizing actions that seems to fetch good rewards, and deprioritizing actions that seem to
fetch poor rewards. Specifically, RL algorithms are in the business of learning an approx-
imate Q-Value Function, an estimate of the Expected Return for any given action in any
given state. The success of RL algorithms depends not only on their ability to learn the

497
Q-Value Function in an incremental manner through interactions with the environment,
but also on their ability to perform good generalization of the Q-Value Function with ap-
propriate function approximation (often using deep neural networks, in which case we
term it as Deep RL). Most RL algorithms are founded on the Bellman Equations and all
RL Control algorithms are based on the fundamental idea of Generalized Policy Iteration.
In Chapter 9, we covered RL Prediction algorithms. Specifically, we covered Monte-
Carlo (MC) and Temporal-Difference (TD) algorithms for Prediction. A key learning from
this Chapter was the Bias-Variance tradeoff in MC versus TD. Another key learning was
that while MC Prediction learns the statistical mean of the observed returns, TD Prediction
learns something “deeper” - TD implicitly estimates an MRP from the observed data and
produces the Value Function of the implicitly-estimated MRP. We emphasized viewing TD
versus MC versus DP from the perspectives of “bootstrapping” and “experiencing.” We
finished this Chapter by covering λ-Return Prediction and TD(λ) Prediction algorithms,
which give us a way to tradeoff bias versus variance (along the spectrum of MC to TD) by
tuning the λ parameter. TD is equivalent to TD(0) and MC is “equivalent” to TD(1).
In Chapter 10, we covered RL Control algorithms. We re-emphasized that RL Control
is based on the idea of Generalized Policy Iteration (GPI). We explained that Policy Eval-
uation is done for the Q-Value Function (instead of the State-Value Function), and that
the Improved Policy needs to be exploratory, eg: ϵ-greedy. Next we described an im-
portant concept - Greedy in the Limit with Infinite Exploration (GLIE). Our first RL Control
algorithm was GLIE Monte-Carlo Control. Next, we covered two important TD Control
algorithm: SARSA (which is On-Policy) and Q-Learning (which is Off-Policy). We briefly
covered Importance Sampling, which is a different way of doing Off-Policy algorithms.
We wrapped up this Chapter with some commentary on the convergence of RL Predic-
tion and RL Control algorithms. We highlighted a strong pattern of situations when we
run into convergence issues - it is when all three of [Bootstrapping, Function Approxima-
tion, Off-Policy] are done together. We’ve seen how each of these three is individually
beneficial, but when the three come together, it’s “too much of a good thing” , bringing
about convergence issues. The confluence of these three is known as the Deadly Triad (an
example of this would be Q-Learning with Function Approximation).
In Chapter 11, we covered the more nuanced RL Algorithms, going beyond the plain-
vanilla MC and TD algorithms we covered in Chapters 9 and 10. We started this Chapter by
introducing the novel ideas of Batch RL and Experience-Replay. Next, we covered the Least-
Squares Monte-Carlo (LSMC) Prediction algorithm and the Least-Squares Temporal-Difference
(LSTD) algorithm, which is a direct (gradient-free) solution of Batch TD. Next, we covered
the very important Deep Q-Networks (DQN) algorithm, which uses Experience-Replay
and fixed Q-learning targets, in order to avoid the pitfalls of time-correlation and varying
TD Target. Next, we covered the Least-Squares Policy Iteration (LSPI) algorithm, which is
an Off-Policy, Experience-Replay Control Algorithm using LSTDQ for Policy Evaluation.
Then we showed how Optimal Exercise of American Options can be tackled with LSPI and
Deep Q-Learning algorithms. In the second half of this Chapter, we looked deeper into the
issue of the Deadly Triad by viewing Value Functions as Vectors so as to understand Value
Function Vector transformations with a balance of geometric intuition and mathematical
rigor, providing insights into convergence issues for a variety of traditional loss functions
used to develop RL algorithms. Finally, this treatment of Value Functions as Vectors led
us in the direction of overcoming the Deadly Triad by defining an appropriate loss func-
tion, calculating whose gradient provides a more robust set of RL algorithms known as
Gradient Temporal Difference (Gradient TD).
In Chapter 12, we covered Policy Gradient (PG) algorithms, which are based on GPI

498
with Policy Improvement as a Stochastic Gradient Ascent for an Expected Returns Objective
using a policy function approximation. We started with the Policy Gradient Theorem that
gives us a simple formula for the gradient of the Expected Returns Objective in terms of the
score of the policy function approximation. Our first PG algorithm was the REINFORCE
algorithm, a Monte-Carlo Policy Gradient algorithm with no bias but high variance. We
showed how to tackle the Optimal Asset Allocation problem with REINFORCE. Next, we
showed how we can reduce variance in PG algorithms by using a critic and by using an es-
timate of the advantage function in place of the Q-Value Function. Next, we showed how
to overcome bias in PG Algorithms based on the Compatible Function Approximation Theo-
rem. Finally, we covered two specialized PG algorithms that have worked well in practice -
Natural Policy Gradient and Deterministic Policy Gradient. We also provided some cover-
age of Evolutionary Strategies, which are technically not RL algorithms, but they resemble
PG Algorithms and can sometimes be quite effective in solving MDP Control problems.
In Module IV, we provided some finishing touches by covering the topic of Exploration
versus Exploitation and the topic of Blending Learning and Planning in some detail. In
Chapter 13, we provided significant coverage of algorithms for the Multi-Armed Ban-
dit (MAB) problem, which provides a simple setting to understand and appreciate the
nuances of the Explore versus Exploit dilemma that we typically need to resolve within
RL Control algorithms. We started with simple methods such as Naive Exploration (eg:
ϵ-greedy) and Optimistic Initialization. Next, we covered methods based on the broad
approach of Optimism in the Face of Uncertainty (eg: Upper-Confidence Bounds). Next,
we covered the powerful and practically effective method of Probability Matching (eg:
Thompson Sampling). Then we also covered Gradient Bandit Algorithms and a disci-
plined approach to balancing exploration and exploitation by forming Information State
Space MDPs (incorporating value of Information), typically solved by treating as Bayes-
Adaptive MDPs. Finally, we noted that the above MAB algorithms are well-extensible to
Contextual Bandits and RL Control.
In Chapter 14, we covered the issue of Planning versus Learning, and showed how
to blend Planning and Learning. Next, we covered Monte-Carlo Tree-Search (MCTS),
which is a Planning algorithm based on Tree-Search and based on sampling/RL tech-
niques. Lastly, we covered Adaptive Multi-Stage Sampling (AMS), that we consider to
be the spiritual origin of MCTS - it is an eﬀicient algorithm for finite-horizon MDPs with
very large state space and fairly small action space.

15.2. RL in the Real-World

Although this is an academic book on the Foundations of RL, we (the authors of this book)
have significant experience in leveraging the power of Applied Mathematics to solve prob-
lems in the real-world. So we devote this subsection to the various nuances of applying
RL in the real-world.
The most important point we’d like to make is that in order to develop models and al-
gorithms that will be effective in the real-world, one should not only have a deep technical
understanding but to also have a deep understanding of the business domain. If the busi-
ness domain is Financial Trading, one needs to be well-versed in the practical details of the
specific market one is working in, and the operational details and transactional frictions
involved in trading. These details need to be carefully captured in the MDP one is con-
structing. These details have ramifications on the choices made in defining the state space
and the action space. More importantly, defining the reward function is typically not an

499
obvious choice at all - it requires considerable thought and typically one would need to
consult with the business head to identify what exactly is the objective function in run-
ning the business, eg: the precise definition of the Utility function. One should also bear
in mind that a typical real-world problem is actually a Partially Observable Markov Deci-
sion Process (POMDP) rather than an MDP. In the pursuit of computational tractability,
one might approximate the POMDP as an MDP but in order to do so, one requires strong
understanding of the business domain. However, sometimes partial state-observability
cannot be ignored, and in such situations, we have to employ (computationally expen-
sive) algorithms to solve the POMDP. Indeed, controlling state space explosion is one of
the biggest challenges in the real-world. Much of the effort in modeling an MDP is to de-
fine a state space that finds the appropriate balance between capturing the key aspects of
the real-world problem and attaining computational tractability.
Now we’d like to share the approach we usually take when encountering a new prob-
lem, like one of the Financial Applications we covered in Module II. Our first stab at the
problem is to create a simpler version of the problem that lends itself to analytical tractabil-
ity, exploring ways to develop a closed-form solution (like we obtained for some of the
Financial Applications in Module II). This typically requires removing some of the fric-
tions and constraints of the real-world problem. For Financial Applications, this might
involve assuming no transaction costs, perhaps assuming continuous trading, perhaps as-
suming no liquidity constraints. There are multiple advantages of deriving a closed-form
solution with simplified assumptions. Firstly, the closed-form solution immediately pro-
vides tremendous intuition as it shows the analytical dependency of the Optimal Value
Function/Optimal Policy on the inputs and parameters of the problem. Secondly, when
we eventually obtain the solution to the full-fledged model, we can test the solution by
creating a special case of the full-fledged model that reduces to the simplified model for
which we have a closed-form solution. Thirdly, the expressions within the closed-form
solution provide us with some guidance on constructing appropriate features for function
approximation when solving the full-fledged model.
The next stage would be to bring in some of the real-world frictions and constraints,
and attempt to solve the problem with Dynamic Programming (or Approximate Dynamic
Programming). This means we need to construct a model of state-reward transition prob-
abilities. Such a model would be estimated from real-world data obtained from interac-
tion with the actual environment. However, often we find that Dynamic Programming
(or Approximate Dynamic Programming) is not an option due to the Curse of Modeling
(i.e., hard to build a model of transition probabilities). This leaves us with the eventual
go-to option of pursuing a Reinforcement Learning technique. In most real-world prob-
lems, we’d employ RL not with actual environment interactions, but with simulated en-
vironment interactions. This means we need to build a sampling model estimated from
real-world data obtained from interactions with the actual environment. In fact, in many
real-world problems, we’d want to augment the data-learnt simulator with human knowl-
edge/assumptions (specifically information that might not be readily obtained from elec-
tronic data that a human expert might be knowledgeable about). Having a simulator of the
environment is very valuable because we can run it indefinitely and also because we can
create a variety of scenarios (with different settings/assumptions) to run the simulator in.
Deep Learning-based function approximations have been quite successful in the context
of Reinforcement Learning algorithms (we refer to this as Deep Reinforcement Learning).
Lastly, it pays to re-emphasize that the learnings from Chapter 14 are very important for
real-world problems. In particular, the idea of blending model-based RL with model-free
RL (Figure 14.2) is an attractive option for real-world applications because the real-world

500
is typically not stationary and hence, models need to be updated continuously.
Given the plethora of choices for different types of RL algorithms, it is indeed difficult
to figure out which RL algorithm would be most suitable for a given real-world problem.
As ever, we recommend starting with a simple algorithm such as the MC and TD meth-
ods we used in Chapters 9 and 10. Although the simple algorithms may not be powerful
enough for many real-world applications, they are a good place to start to try out on a
smaller size of the actual problem - these simple RL algorithms are very easy to imple-
ment, reason about and debug. However, the most important advice we can give you is
that after having understood the various nuances of the specific real-world problem you
want to solve, you should aim to construct an RL algorithm that is customized for your
problem. One must recognize that the set of RL algorithms is not a fixed menu to choose
from. Rather, there are various pieces of RL algorithms that are open to modification. In
fact, we can combine different aspects of different algorithms to suit our specific needs for
a given real-world problem. We not only make choices on features in function approxima-
tions and on hyper-parameters, we also make choices on the exact design of the algorithm
method, eg: how exactly we’d like to do Off-Policy Learning, or how exactly we’d like to
do the Policy Evaluation component of Generalized Policy Iteration in our Control algo-
rithm. In practice, we’ve found that we often end up with the more advanced algorithms
due to the typical real-world problem complexity or state-space/action-space size. There
is no silver bullet here, and one has to try various algorithms to see which one works best
for the given problem. However, it pays to share that the algorithms that have worked
well for us in real-world problems are Least-Squares Policy Iteration, Gradient Temporal-
Difference, Deep Q-Networks and Natural Policy Gradient. We have always paid attention
to Richard Sutton’s mantra of avoiding the Deadly Triad. We recommend the excellent
paper by Hasselt, Doron, Strub, Hessel, Sonnerat, Modayil (Hasselt et al. 2018) to under-
stand the nuances of the Deadly Triad in the context of Deep Reinforcement Learning.
It’s important to recognize that the code we developed in this book is for educational
purposes and we barely made an attempt to make the code performant. In practice, this
type of educational code won’t suffice - we need to develop highly performant code and
make the code parallelizable wherever possible. This requires an investment in a suitable
distributed system for storage and compute, so the RL algorithms can be trained in an
efficient manner.
When it comes to making an RL algorithm successful in a real-world application, the
design and implementation of the model and the algorithm is only a small piece of the
overall puzzle. Indeed, one needs to build an entire ecosystem of data management, soft-
ware engineering, model training infrastructure, model deployment platform, tools for
easy debugging, measurements/instrumentation and explainability of results. Moreover,
it is vital to have a strong Product Management practice in order to ensure that the algo-
rithm is serving the needs of the overall product being built. Indeed, the goal is to build a
successful product, not just a model and an algorithm. A key challenge in many organiza-
tions is to replace a legacy system or a manual system with a modern solution (eg: with
an RL-based solution). This requires investment in a culture change in the organization
so that all stakeholders are supportive, otherwise the change management will be very
challenging.
When the product carrying the RL algorithm runs in production, it is vital to evaluate
whether the real-world problem is actually being solved effectively by defining, evaluating
and reporting the appropriate success metrics. If those metrics are found to be inadequate,
we need the appropriate feedback system in the organization to investigate why the prod-
uct (and perhaps the model) is not delivering the requisite results. It could be that we

501
have designed a model which is not quite the right fit for the real-world problem, in which
case we improve the model in the next iteration of this feedback system. It often takes sev-
eral iterations of evaluating the success metrics, providing feedback, and improving the
model (and sometimes the algorithm) in order to achieve adequate results. An important
point to note is that typically in practice, we rarely need to solve all the way to an optima -
typically, being close to optimum is good enough to achieve the requisite success metrics.
A Product Manager must constantly question whether we are solving the right problem,
and whether we are investing our efforts in the most important aspects of the problem (eg:
ask if it suﬀices to be reasonably close to optimum).
Lastly, one must recognize that typically in the real-world, we are plagued with noisy
data, incomplete data and sometimes plain wrong data. The design of the model needs
to take this into account. Also, there is no such thing as the “perfect model” - in practice,
a model is simply a crude approximation of reality. It should be assumed by default that
we have bad data and that we have an imperfect model. Hence, it is important to build a
system that can reason about uncertainties in data and about uncertainties with the model.
A book can simply not do justice to explaining the various nuances and complications
that arise in developing and deploying an RL-based solution in the real-world. Here we
have simply scratched the surface of the various issues that arise. You would truly un-
derstand and appreciate these nuances and complications only by stepping into the real-
world and experiencing it for yourself. However, it is important to first be grounded in
the foundations of RL, which is what we hope you got from this book.

502
Appendix

503
A. Moment Generating Function and its
Applications
The purpose of this Appendix is to introduce the Moment Generating Function (MGF) and
demonstrate it’s utility in several applications in Applied Mathematics.

A.1. The Moment Generating Function (MGF)

The Moment Generating Function (MGF) of a random variable x (discrete or continuous)
is defined as a function fx : R → R+ such that:

fx (t) = Ex [etx ] for all t ∈ R (A.1)

(n) (0)
Let us denote the nth -derivative of fx as fx : R → R for all n ∈ Z≥0 (fx is defined to
be simply the MGF fx ).

fx(n) (t) = Ex [xn · etx ] for all n ∈ Z≥0 for all t ∈ R (A.2)

fx(n) (0) = Ex [xn ] (A.3)

fx(n) (1) = Ex [xn · ex ] (A.4)

(n) (1)
Equation (A.3) tells us that fx (0) gives us the nth moment of x. In particular, fx (0) =
fx′ (0) gives us the mean and fx (0) − (fx (0))2 = fx′′ (0) − (fx′ (0))2 gives us the variance.
(2) (1)

Note that this holds true for any distribution for x. This is rather convenient since all we
need is the functional form for the distribution of x. This would lead us to the expression
for the MGF (in terms of t). Then, we take derivatives of this MGF and evaluate those
derivatives at 0 to obtain the moments of x.
Equation (A.4) helps us calculate the often-appearing expectation Ex [xn · ex ]. In fact,
Ex [ex ] and Ex [x · ex ] are very common in several areas of Applied Mathematics. Again,
note that this holds true for any distribution for x.
The MGF should be thought of as an alternative specification of a random variable (al-
ternative to specifying it’s Probability Distribution). This alternative specification is very
valuable because it can sometimes provide better analytical tractability than working with
the Probability Density Function or Cumulative Distribution Function (as an example, see
the below section on the MGF for linear functions of independent random variables).

A.2. MGF for Linear Functions of Random Variables

Consider m independent random variables x1 , x2 , . . . , xm . Let α0 , α1 , . . . , αm ∈ R. Now
consider the random variable
Xm
x = α0 + αi xi
i=1

505
The Probability Density Function of x is complicated to calculate as it involves convolu-
tions. However, observe that the MGF fx of x is given by:

∑m Y
m Y
m Y
m
fx (t) = E[et(α0 + i=1 α i xi )
] = e α0 t · E[etαi xi ] = eα0 t · fαi xi (t) = eα0 t · fxi (αi t)
i=1 i=1 i=1

This means the MGF of x can be calculated as eα0 t times the product of the MGFs of αi xi
(or of αi -scaled MGFs of xi ) for all i = 1, 2, . . . , m. This gives us a much better way to
analytically tract the probability distribution of x (compared to the convolution approach).

A.3. MGF for the Normal Distribution

Here we assume that the random variables x follows a normal distribution. Let x ∼
N (µ, σ 2 ).

fx∼N (µ,σ2 ) (t) = Ex∼N (µ,σ2 ) [etx ]

Z +∞
1 (x−µ)2
= √ · e− 2σ2 · etx · dx
−∞ 2πσ
Z +∞
1 (x−(µ+tσ 2 ))2 σ 2 t2
= √ · e− 2σ 2 · eµt+ 2 · dx
−∞ 2πσ
σ 2 t2
= eµt+ 2 · Ex∼N (µ+tσ2 ,σ2 ) [1]
2 2
µt+ σ 2t
=e (A.5)

σ 2 t2
′
(µ,σ 2 ) (t) = Ex∼N (µ,σ 2 ) [x · e ] = (µ + σ t) · e
tx 2 µt+
fx∼N 2 (A.6)

σ 2 t2
′′
(µ,σ 2 ) (t) = Ex∼N (µ,σ 2 ) [x · e ] = ((µ + σ t) + σ ) · e
2 tx 2 2 2 µt+
fx∼N 2 (A.7)
′
fx∼N (µ,σ 2 ) (0) = Ex∼N (µ,σ 2 ) [x] = µ
′′
(µ,σ 2 ) (0) = Ex∼N (µ,σ 2 ) [x ] = µ + σ
2 2 2
fx∼N
σ2
′
(µ,σ 2 ) (1) = Ex∼N (µ,σ 2 ) [x · e ] = (µ + σ )e
x 2 µ+
fx∼N 2

σ2
′′
(µ,σ 2 ) (1) = Ex∼N (µ,σ 2 ) [x · e ] = ((µ + σ ) + σ )e
2 x 2 2 2 µ+
fx∼N 2

A.4. Minimizing the MGF

Now let us consider the problem of minimizing the MGF. The problem is to:

min fx (t) = min Ex [etx ]

t∈R t∈R

This problem of minimizing Ex [etx ] shows up a lot in various places in Applied Mathe-
matics when dealing with exponential functions (eg: when optimizing the Expectation of
−γy
a Constant Absolute Risk-Aversion (CARA) Utility function U (y) = 1−eγ where γ is the

506
coeﬀicient of risk-aversion and where y is a parameterized function of a random variable
x).
Let us denote t∗ as the value of t that minimizes the MGF. Specifically,

t∗ = arg min fx (t) = arg min Ex [etx ]

t∈R t∈R

A.4.1. Minimizing the MGF when x follows a normal distribution

Here we consider the fairly typical case where x follows a normal distribution. Let x ∼
N (µ, σ 2 ). Then we have to solve the problem:
σ 2 t2
min fx∼N (µ,σ2 ) (t) = min Ex∼N (µ,σ2 ) [etx ] = min eµt+ 2
t∈R t∈R t∈R

From Equation (A.6) above, we have:

σ 2 t2
′
(µ,σ 2 ) (t) = (µ + σ t) · e
2 µt+
fx∼N 2

Setting this to 0 yields:

∗ + σ 2 t∗2
(µ + σ 2 t∗ ) · eµt 2 =0
which leads to:
−µ
t∗ = (A.8)
σ2
From Equation (A.7) above, we have:
σ 2 t2
′′
(µ,σ 2 ) (t) = ((µ + σ t) + σ ) · e > 0 for all t ∈ R
2 2 2 µt+
fx∼N 2

which confirms that t∗ is a minima.

σ 2 t2
Substituting t = t∗ in fx∼N (µ,σ2 ) (t) = eµt+ 2 yields:

∗ + σ 2 t∗2 −µ2
min fx∼N (µ,σ2 ) (t) = eµt 2 = e 2σ2 (A.9)
t∈R

A.4.2. Minimizing the MGF when x is a symmetric binary distribution

Here we consider the case where x follows a binary distribution: x takes values µ + σ and
µ − σ with probability 0.5 each. Let us refer to this distribution as x ∼ B(µ + σ, µ − σ).
Note that the mean and variance of x under B(µ + σ, µ − σ) are µ and σ 2 respectively. So
we have to solve the problem:

min fx∼B(µ+σ,µ−σ) (t) = min Ex∼B(µ+σ,µ−σ) [etx ] = min 0.5(e(µ+σ)t + e(µ−σ)t )

t∈R t∈R t∈R

′
fx∼B(µ+σ,µ−σ) (t) = 0.5((µ + σ) · e(µ+σ)t + (µ − σ) · e(µ−σ)t )
Note that unless µ ∈ open interval (−σ, σ) (i.e., absolute value of mean is less than standard
′
deviation), fx∼B(µ+σ,µ−σ) (t) will not be 0 for any value of t. Therefore, for this minimiza-
tion to be non-trivial, we will henceforth assume µ ∈ (−σ, σ). With this assumption in
′
place, setting fx∼B(µ+σ,µ−σ) (t) to 0 yields:
∗ ∗
(µ + σ) · e(µ+σ)t + (µ − σ) · e(µ−σ)t = 0

507
which leads to:
1 σ−µ
t∗ = ln ( )
2σ µ+σ
Note that
′′
fx∼B(µ+σ,µ−σ) (t) = 0.5((µ + σ)2 · e(µ+σ)t + (µ − σ)2 · e(µ−σ)t ) > 0 for all t ∈ R

which confirms that t∗ is a minima.

Substituting t = t∗ in fx∼B(µ+σ,µ−σ) (t) = 0.5(e(µ+σ)t + e(µ−σ)t ) yields:

∗ ∗ σ − µ µ+σ σ − µ µ−σ
min fx∼B(µ+σ,µ−σ) (t) = 0.5(e(µ+σ)t + e(µ−σ)t ) = 0.5(( ) 2σ + ( ) 2σ )
t∈R µ+σ µ+σ

508
B. Portfolio Theory
In this Appendix, we provide a quick and terse introduction to Portfolio Theory. While this
topic is not a direct pre-requisite for the topics we cover in the chapters, we believe one
should have some familiarity with the risk versus reward considerations when construct-
ing portfolios of financial assets, and know of the important results. To keep this Appendix
brief, we will provide the minimal content required to understand the essence of the key
concepts. We won’t be doing rigorous proofs. We will also ignore details pertaining to
edge-case/irregular-case conditions so as to focus on the core concepts.

B.1. Setting and Notation

In this section, we go over the core setting of Portfolio Theory, along with the requisite
notation.
Assume there are n assets in the economy and that their mean returns are represented
in a column vector R ∈ Rn . We denote the covariance of returns of the n assets by an n × n
non-singular matrix V .
We consider arbitrary portfolios p comprised of investment quantities in these n assets
that are normalized to sum up to 1. Denoting column vector Xp ∈ Rn as the investment
quantities in the n assets for portfolio p, we can write the normality of the investment
quantities in vector notation as:

XpT · 1n = 1
where 1n ∈ Rn is a column vector comprising of all 1’s.
We shall drop the subscript p in Xp whenever the reference to portfolio p is clear.

B.2. Portfolio Returns

• A single portfolio’s mean return is X T · R ∈ R.
• A single portfolio’s variance of return is the quadratic form X T · V · X ∈ R.
• Covariance between portfolios p and q is the bilinear form XpT · V · Xq ∈ R.
• Covariance of the n assets with a single portfolio is the vector V · X ∈ Rn .

B.3. Derivation of Eﬀicient Frontier Curve

An asset which has no variance in terms of how it’s value evolves in time is known as a
risk-free asset. The Efficient Frontier is defined for a world with no risk-free assets. The
Efficient Frontier is the set of portfolios with minimum variance of return for each level
of portfolio mean return (we refer to a portfolio in the Efficient Frontier as an Efficient
Portfolio). Hence, to determine the Efficient Frontier, we solve for X so as to minimize
portfolio variance X T · V · X subject to constraints:

X T · 1n = 1

509
X T · R = rp
where rp is the mean return for Efficient Portfolio p. We set up the Lagrangian and solve
to express X in terms of R, V, rp . Substituting for X gives us the efficient frontier parabola
of Efficient Portfolio Variance σp2 as a function of its mean rp :

a − 2brp + crp2
σp2 =
ac − b2
where

• a = RT · V −1 · R
• b = RT · V −1 · 1n
• c = 1Tn · V −1 · 1n

B.4. Global Minimum Variance Portfolio (GMVP)

The global minimum variance portfolio (GMVP) is the portfolio at the tip of the eﬀicient
frontier parabola, i.e., the portfolio with the lowest possible variance among all portfolios
on the Eﬀicient Frontier. Here are the relevant characteristics for the GMVP:

• It has mean r0 = cb .
• It has variance σ02 = 1c .
V −1 ·1n
• It has investment proportions X0 = c .

GMVP is positively correlated with all portfolios and with all assets. GMVP’s covariance
with all portfolios and with all assets is a constant value equal to σ02 = 1c (which is also
equal to its own variance).

B.5. Orthogonal Eﬀicient Portfolios

For every eﬀicient portfolio p (other than GMVP), there exists a unique orthogonal eﬀicient
portfolio z (i.e. Covariance(p, z) = 0) with finite mean

a − brp
rz =
b − crp

z always lies on the opposite side of p on the (efficient frontier) parabola. If we treat
the Efficient Frontier as a curve of mean (y-axis) versus variance (x-axis), the straight line
from p to GMVP intersects the mean axis (y-axis) at rz . If we treat the Efficient Frontier
as a curve of mean (y-axis) versus standard deviation (x-axis), the tangent to the efficient
frontier at p intersects the mean axis (y-axis) at rz . Moreover, all portfolios on one side of
the efficient frontier are positively correlated with each other.

B.6. Two-fund Theorem

The X vector (normalized investment quantities in assets) of any eﬀicient portfolio is a
linear combination of the X vectors of two other eﬀicient portfolios. Notationally,

Xp = αXp1 + (1 − α)Xp2 for some scalar α

510
Figure B.1.: Eﬀicient Frontier for 16 Assets

Varying α from −∞ to +∞ basically traces the entire efficient frontier. So to construct all
efficient portfolios, we just need to identify two canonical efficient portfolios. One of them
is GMVP. The other is a portfolio we call Special Efficient Portfolio (SEP) with:

• Mean r1 = ab .
• Variance σ12 = ba2 .
V −1 ·R
• Investment proportions X1 = b .
a−b ab
The orthogonal portfolio to SEP has mean rz = b−c ab =0

B.7. An example of the Eﬀicient Frontier for 16 assets

Figure B.1 shows a plot of the mean daily returns versus the standard deviation of daily
returns collected over a 3-year period for 16 assets. The curve is the Efficient Frontier for
these 16 assets. Note the special portfolios GMVP and SEP on the Efficient Frontier. This
curve was generated from the code at rl/appendix2/efficient_frontier.py. We encourage
you to play with different choices (and count) of assets, and to also experiment with dif-
ferent time ranges as well as to try weekly and monthly returns.

B.8. CAPM: Linearity of Covariance Vector w.r.t. Mean Returns

Important Theorem: The covariance vector of individual assets with a portfolio (note:
covariance vector = V · X ∈ Rn ) can be expressed as an exact linear function of the in-
dividual assets’ mean returns vector if and only if the portfolio is eﬀicient. If the eﬀicient
portfolio is p (and its orthogonal portfolio z), then:
rp − rz
R = rz 1n + (V · Xp ) = rz 1n + (rp − rz )βp
σp2

511
V ·X
where βp = σ2 p ∈ Rn is the vector of slope coeﬀicients of regressions where the ex-
p
planatory variable is the portfolio mean return rp ∈ R and the n dependent variables are
the asset mean returns R ∈ Rn .
The linearity of βp w.r.t. mean returns R is famously known as the Capital Asset Pricing
Model (CAPM).

B.9. Useful Corollaries of CAPM

• If p is SEP, rz = 0 which would mean:
rp
R = rp β p = · V · Xp
σp2

• So, in this case, covariance vector V ·Xp and βp are just scalar multiples of asset mean
vector.
• The investment proportion X in a given individual asset changes monotonically
along the efficient frontier.
• Covariance V · X is also monotonic along the efficient frontier.
• But β is not monotonic, which means that for every individual asset, there is a unique
pair of efficient portfolios that result in maximum and minimum βs for that asset.

B.10. Cross-Sectional Variance

• The cross-sectional variance in βs (variance in βs across assets for a fixed efficient
portfolio) is zero when the efficient portfolio is GMVP and is also zero when the
efficient portfolio has infinite mean.
• The cross-sectional
p variance inpβs is maximum for the two efficient portfolios with
means: r0 +σ0 |A| and r0 −σ02 |A| where A is the 2 × 2 symmetric matrix consisting
2

of a, b, b, c.
• These two portfolios lie symmetrically on opposite sides of the eﬀicient frontier (their
βs are equal and of opposite signs), and are the only two orthogonal eﬀicient port-
folios with the same variance ( = 2σ02 ).

B.11. Eﬀicient Set with a Risk-Free Asset

If we have a risk-free asset with return rF , then V is singular. So we first form the Efficient
Frontier without the risk-free asset. The Efficient Set (including the risk-free asset) is de-
fined as the tangent to this Efficient Frontier (without the risk-free asset) from the point
(0, rF ) when the Efficient Frontier is considered to be a curve of mean returns (y-axis)
against standard deviation of returns (x-axis).
Let’s say the tangent touches the Efficient Frontier at the point (Portfolio) T and let it’s
return be rT . Then:

• If rF < r0 , rT > rF .
• If rF > r0 , rT < rF .
• All portfolios on this eﬀicient set are perfectly correlated.

512
C. Introduction to and Overview of
Stochastic Calculus Basics
In this Appendix, we provide a quick introduction to the Basics of Stochastic Calculus. To
be clear, Stochastic Calculus is a vast topic requiring an entire graduate-level course to de-
velop a good understanding. We shall only be scratching the surface of Stochastic Calculus
and even with the very basics of this subject, we will focus more on intuition than rigor,
and familiarize you with just the most important results relevant to this book. For an ade-
quate treatment of Stochastic Calculus relevant to Finance, we recommend Steven Shreve’s
two-volume discourse Stochastic Calculus for Finance I (Shreve 2003) and Stochastic Cal-
culus for Finance II (Shreve 2004). For a broader treatment of Stochastic Calculus, we
recommend Bernt Oksendal’s book on Stochastic Differential Equations (Øksendal 2003).

C.1. Simple Random Walk

The best way to get started with Stochastic Calculus is to first get familiar with key proper-
ties of a simple random walk viewed as a discrete-time, countable state-space, time-homogeneous
Markov Process. The state space is the set of integers Z. Denoting the random state at time
t as Zt , the state transitions are defined in terms of the independent and identically dis-
tributed (i.i.d.) random variables Yt for all t = 0, 1, . . .

Zt+1 = Zt + Yt and P[Yt = 1] = P[Yt = −1] = 0.5 for all t = 0, 1, . . .

A quick point on notation: We refer to the random state at time t as Zt (i.e., as a random
variable at time t), whereas we refer to the Markov Process for this simple random walk
as Z (i.e., without any subscript).
Since the random variables {Yt |t = 0, 1, . . .} are i.i.d, the increments Zti+1 − Zti (for
i = 0, 1, . . . n − 1) in the random walk states for any set of time steps t0 < t1 < . . . < tn
have the following properties:

• Independent Increments: Increments Zt1 − Zt0 , Zt2 − Zt1 , . . . , Ztn − Ztn−1 are inde-
pendent of each other.
• Martingale (i.e., Zero-Drift) Property: Expected Value of any Increment is 0.
ti+1 −1
X
E[Zti+1 − Zti ] = E[Zj+1 − Zj ] = 0 for all i = 0, 1, . . . , n − 1
j=ti

• Variance of any Increment equals Time Steps of the Increment:

ti+1 −1 ti+1 −1 ti+1 −1 ti+1
X X X X
E[(Zti+1 −Zti ) ] = E[(
2 2
Yj ) ] = E[Yj2 ]+2 E[Yj ]·E[Yk ] = ti+1 −ti
j=ti j=ti j=ti k=j+1

for all i = 0, 1, . . . , n − 1.

513
Moreover, we have an important property that Quadratic Variation equals Time Steps.
Quadratic Variation over the time interval [ti , ti+1 ] for all i = 0, 1, . . . , n − 1 is defined as:
ti+1 −1
X
(Zj+1 − Zj )2
j=ti

Since (Zj+1 − Zj )2 = Yj2 = 1 for all j = ti , ti + 1, . . . , ti+1 − 1, Quadratic Variation

ti+1 −1
X
(Zj+1 − Zj )2 = ti+1 − ti for all i = 0, 1, . . . n − 1
j=ti

It pays to emphasize the important conceptual difference between the Variance of Incre-
ment property and the Quadratic Variation property. The Variance of Increment property
is a statement about the expectation of the square of the Zti+1 − Zti increment whereas the
Quadratic Variation property is a statement of certainty (note: there is no E[· · · ] in this
statement) about the sum of squares of atomic increments Yj over the discrete-steps time-
interval [ti , ti+1 ]. The Quadratic Variation property owes to the fact that P[Yt2 = 1] = 1 for
all t = 0, 1, . . ..
We can view the Quadratic Variations of a Process X over all discrete-step time intervals
[0, t] as a Process denoted [X], defined as:

X
t
[X]t = (Xj+1 − Xj )2
j=0

Thus, for the simple random walk Markov Process Z, we have the succinct formula:
[Z]t = t for all t (i.e., this Quadratic Variation process is a deterministic process).

C.2. Brownian Motion as Scaled Random Walk

Now let us take our simple random walk process Z, and simultaneously A) speed up time
and B) scale down the size of the atomic increments Yt . Specifically, define for any fixed
positive integer n:

(n) 1 1 2
zt = √ · Znt for all t = 0, , , . . .
n n n
It’s easy to show that the above properties of the simple random walk process hold for
the z (n) process as well. Now consider the continuous-time process z defined as:
(n)
zt = lim zt for all t ∈ R≥0
n→∞

This continuous-time process z with z0 = 0 is known as standard Brownian Motion. z

retains the same properties as those of the simple random walk process that we have listed
above (independent increments, martingale, increment variance equal to time interval,
and quadratic variation equal to time interval). Also, by Central Limit Theorem,

zt |zs ∼ N (zs , t − s) for any 0 ≤ s < t

We denote dzt as the increment in z over the infinitesimal time interval [t, t + dt].

dzt ∼ N (0, dt)

514
C.3. Continuous-Time Stochastic Processes
Brownian motion z is our first example of a Continuous-Time Stochastic Process. Now let
us define a general continuous-time stochastic process, although for the sake of simplic-
ity, we shall restrict ourselves to one-dimensional real-valued continuous-time stochastic
processes.

Definition C.3.1. A One-dimensional Real-Valued Continuous-Time Stochastic Process denoted

X is defined as a collection of real-valued random variables {Xt |t ∈ [0, T ]} (for some fixed
T ∈ R, with index t interpreted as continuous-time) defined on a common probability
space (Ω, F, P), where Ω is a sample space, F is a σ-algebra and P is a probability measure
(so, Xt : Ω → R for each t ∈ [0, T ]).

We can view a stochastic process X as an R-valued function of two variables:

• t ∈ [0, T ]
• ω∈Ω

As a two-variable function, if we fix t, then we get the random variable Xt : Ω → R for

time t and if we fix ω, then we get a single R-valued outcome for each random variable
across time (giving us a sample trace across time, denoted X(ω)).
Now let us come back to Brownian motion, viewed as a Continuous-Time Stochastic
Process.

C.4. Properties of Brownian Motion sample traces

• Sample traces z(ω) of Brownian motion z are continuous.
• Sample traces z(ω) are almost always non-differentiable, meaning:

zt+h − zt
Random variable lim is almost always infinite
h→0 h
z −z
The intuition is that t+hh t has standard deviation of √1h , which goes to ∞ as h goes
to 0.
• Sample traces z(ω) have infinite total variation, meaning:
Z T
Random variable |dzt | = ∞ (almost always)
S

The quadratic variation property can be expressed as:

Z T
(dzt )2 = T − S
S

This means each sample random trace of brownian motion has quadratic variation equal
to the time interval of the trace. The quadratic variation of z expressed as a process [z] has
the deterministic value of t at time t. Expressed in infinitesimal terms, we say that:

(dzt )2 = dt
This formula generalizes to:

515
(1) (2)
(dzt ) · (dzt ) = ρ · dt

where z (1) and z (2) are two different brownian motions with correlation between the
(1) (2)
random variables zt and zt equal to ρ for all t > 0.
You should intuitively interpret the formula (dzt )2 = dt (and it’s generalization) as
a deterministic statement, and in fact this statement is used as an algebraic convenience
in Brownian motion-based stochastic calculus, forming the core of Ito Isometry and Ito’s
Lemma (which we cover shortly, but first we need to define the Ito Integral).

C.5. Ito Integral

We want to define a stochastic process Y from a stochastic process X as follows:
Z t
Yt = Xs · dzs
0

In the interest of focusing on intuition rather than rigor, we skip the technical details
of filtrations and adaptive processes that make the above integral sensible. Instead, we
simply say that this integral makes sense only if random variable Xs for any time s is
disallowed from depending on zs′ for any s′ > s (i.e., the stochastic process X cannot peek
Rt
into the future) and that the time-integral 0 Xs2 · ds is finite for all t ≥ 0. So we shall roll
forward with the assumption that the stochastic process Y is defined as the above-specified
integral (known as the Ito Integral) of a stochastic process X with respect to Brownian
motion. The equivalent notation is:

dYt = Xt · dzt

We state without proof the following properties of the Ito Integral stochastic process Y :

• Y is a martingale, i.e., E[(Yt − Ys )|Ys ] = 0 (i.e., E[Yt |Ys ] = Ys ) for all 0 ≤ s < t
Rt
• Ito Isometry: E[Yt2 ] = 0 E[Xs2 ] · ds.
Rt
• Quadratic Variance formula: [Y ]t = 0 Xs2 · ds

Note that we have generalized the notation [X] for discrete-time processes to continuous-
Rt
time processes, defined as [X]t = 0 (dXs )2 for any continuous-time stochastic process.
Ito Isometry generalizes to:
Z T Z T Z T
(1) (1) (2) (2) (1) (2)
E[( Xt · dzt )( Xt · dzt )] = E[Xt · Xt · ρ · dt]
S S S

where X (1) and X (2) are two different stochastic processes, and z (1) and z (2) are two
(1) (2)
different brownian motions with correlation between the random variables zt and zt
equal to ρ for all t > 0.
Likewise, the Quadratic Variance formula generalizes to:
Z T Z T
(1) (1) (2) (2) (1) (2)
(Xt · dzt )(Xt · dzt ) = Xt · Xt · ρ · dt
S S

516
C.6. Ito’s Lemma
We can extend the above Ito Integral to an Ito process Y as defined below:

dYt = µt · dt + σt · dzt

We require the same conditions for the stochastic process σ as we required above for X
Rt
in the definition of the Ito Integral. Moreover, we require that: 0 |µs | · ds is finite for all
t ≥ 0.
In the context of this Ito process Y described above, we refer to µ as the drift process and
we refer to σ as the dispersion process.
Now, consider a twice-differentiable function f : [0, T ] × R → R. We define a stochastic
process whose (random) value at time t is f (t, Yt ). Let’s write it’s Taylor series with respect
to the variables t and Yt .

∂f (t, Yt ) ∂f (t, Yt ) 1 ∂ 2 f (t, Yt )

df (t, Yt ) = · dt + · dYt + · · (dYt )2 + . . .
∂t ∂Yt 2 ∂Yt2

Substituting for dYt and lightening notation, we get:

∂f ∂f 1 ∂2f
df (t, Yt ) = · dt + · (µt · dt + σt · dzt ) + · · (µt · dt + σt · dzt )2 + . . .
∂t ∂Yt 2 ∂Yt2

Next, we use the rules: (dt)2 = 0, dt · dzt = 0, (dzt )2 = dt to get Ito’s Lemma:

∂f ∂f σ2 ∂ 2f ∂f
df (t, Yt ) = ( + µt · + t · 2 ) · dt + σt · · dzt (C.1)
∂t ∂Yt 2 ∂Yt ∂Yt

Ito’s Lemma describes the stochastic process of a function (f ) of an Ito Process (Y )

in terms of the partial derivatives of f , and in terms of the drift (µ) and dispersion (σ)
processes that define Y .
If we generalize Y to be an n-dimensional stochastic process (as a column vector) with
µt as an n-dimensional (stochastic) column vector, σt as an n × m (stochastic) matrix, and
zt as an m-dimensional vector of m independent standard brownian motions (as follows):

dYt = µt · dt + σt · dzt

then we get the multi-variate version of Ito’s Lemma, as follows:

∂f 1
df (t, Yt ) = ( + (∇Y f )T · µt + T r[σtT · (∆Y f ) · σt ]) · dt + (∇Y f )T · σt · dzt (C.2)
∂t 2

where the symbol ∇ represents the gradient of a function, the symbol ∆ represents the
Hessian of a function, and the symbol T r represents the Trace of a matrix.
Next, we cover two common Ito processes, and use Ito’s Lemma to solve the Stochastic
Differential Equation represented by these Ito Processes:

517
C.7. A Lognormal Process
Consider a stochastic process x described in the form of the following Ito process:

dxt = µ(t) · xt · dt + σ(t) · xt · dzt

Note that here z is standard (one-dimensional) Brownian motion, and µ, σ are deter-
ministic functions of time t. This is solved easily by defining an appropriate function of xt
and applying Ito’s Lemma, as follows:

yt = log(xt )

Applying Ito’s Lemma on yt with respect to xt , we get:

1 σ 2 (t) · x2t 1 1
dyt = (µ(t) · xt · − · 2 ) · dt + σ(t) · xt · · dzt
xt 2 xt xt
σ 2 (t)
= (µ(t) − ) · dt + σ(t) · dzt
2

So,
Z T Z T
σ 2 (t)
y T = yS + (µ(t) − ) · dt + σ(t) · dzt
S 2 S

∫T σ 2 (t) ∫
)·dt+ ST
xT = xS · e S (µ(t)− 2
σ(t)·dzt

xT |xS follows a lognormal distribution, i.e.,

Z T Z T
σ 2 (t)
yT = log(xT ) ∼ N (log(xS ) + (µ(t) − ) · dt, σ 2 (t) · dt)
S 2 S

∫T
E[xT |xS ] = xS · e S µ(t)·dt

∫T
(2µ(t)+σ 2 (t))·dt
E[x2T |xS ] = x2S · e S

∫T ∫T
σ 2 (t)·dt
V ariance[xT |xS ] = E[x2T |xS ] − (E[xT |xS ])2 = x2S · e S 2µ(t)·dt
· (e S − 1)

The special case of µ(t) = µ (constant) and σ(t) = σ (constant) is a very common Ito
process used all over Finance/Economics (for its simplicity, tractability as well as practi-
cality), and is known as Geometric Brownian Motion, to reflect the fact that the stochastic
increment of the process (σ · xt · dzt ) is multiplicative to the level of the process xt . If we
consider this special case, we get:

σ2
yT = log(xT ) ∼ N (log(xS ) + (µ − )(T − S), σ 2 (T − S))
2

E[xT |xS ] = xS · eµ(T −S)

V ariance[xT |xS ] = x2S · e2µ(T −S) · (eσ

2 (T −S)
− 1)

518
C.8. A Mean-Reverting Process
Now we consider a stochastic process x described in the form of the following Ito process:

dxt = µ(t) · xt · dt + σ(t) · dzt

As in the process of the previous section, z is standard (one-dimensional) Brownian
motion, and µ, σ are deterministic functions of time t. This is solved easily by defining an
appropriate function of xt and applying Ito’s Lemma, as follows:
∫t
yt = x t · e − 0 µ(u)·du

Applying Ito’s Lemma on yt with respect to xt , we get:

∫t ∫t ∫t
dyt = (−xt · µ(t) · e− 0 µ(u)·du
+ µ(t) · xt · e− 0 µ(u)·du
) · dt + σ(t) · e− 0 µ(u)·du
· dzt
∫t
= σ(t) · e− 0 µ(u)·du
· dzt

So the process y is a martingale. Using Ito Isometry, we get:

Z T ∫t
yT ∼ N (yS , σ 2 (t) · e− 0 2µ(u)·du
· dt)
S
Therefore,
∫T ∫T Z T ∫t
xT ∼ N (xS · e S µ(t)·dt
,e 0 2µ(t)·dt
· σ 2 (t) · e− 0 2µ(u)·du
· dt)
S
We call this process “mean-reverting” because with negative µ(t), the process is “pulled”
to a baseline level of 0, at a speed whose expectation is proportional to −µ(t) and propor-
tional to the distance from the baseline (so we say the process reverts to a baseline of 0
and the strength of mean-reversion is greater if the distance from the baseline is greater).
If µ(t) is positive, then we say that the process is “mean-diverting” to signify that it gets
pulled away from the baseline level of 0.
The special case of µ(t) = µ (constant) and σ(t) = σ (constant) is a fairly common Ito
process (again for it’s simplicity, tractability as well as practicality), and is known as the
Ornstein-Uhlenbeck Process with the mean (baseline) level set to 0. If we consider this
special case, we get:

σ2
xT ∼ N (xS · eµ(T −S) , · (e2µ(T −S) − 1))
2µ

519
D. The Hamilton-Jacobi-Bellman (HJB)
Equation
In this Appendix, we provide a quick coverage of the Hamilton-Jacobi-Bellman (HJB)
Equation, which is the continuous-time version of the Bellman Optimality Equation. Al-
though much of this book covers Markov Decision Processes in a discrete-time setting, we
do cover some classical Mathematical Finance Stochastic Control formulations in continuous-
time. To understand these formulations, one must first understand the HJB Equation,
which is the purpose of this Appendix. As is the norm in the Appendices in this book,
we will compromise on some of the rigor and emphasize the intuition to develop basic
familiarity with HJB.

D.1. HJB as a continuous-time version of Bellman Optimality

Equation
In order to develop the continuous-time setting, we shall consider a (not necessarily time-
homogeneous) process where the set of states at time t are denoted as St and the set of
allowable actions for each state at time t are denoted as At . Since time is continuous,
Rewards are represented as a Reward Rate function R such that for any state st ∈ St and
for any action at ∈ At , R(t, st , at ) · dt is the Expected Reward in the time interval (t, t + dt],
conditional on state st and action at (note the functional dependency of R on t since we
will be integrating R over time). Instead of the discount factor γ as in the case of discrete-
time MDPs, here we employ a discount rate (akin to interest-rate discounting) ρ ∈ R≥0 so
that the discount factor over any time interval (t, t + dt] is e−ρ·dt .
We denote the Optimal Value Function as V ∗ such that the Optimal Value for state st ∈ St
at time t is V ∗ (t, st ). Note that unlike Section 3.13 in Chapter 3 where we denoted the Op-
timal Value Function as a time-indexed sequence Vt∗ (st ), here we make t an explicit func-
tional argument of V ∗ . This is because in the continuous-time setting, we are interested in
the time-differential of the Optimal Value Function.
Now let us write the Bellman Optimality Equation in it’s continuous-time version, i.e,
let us consider the process V ∗ over the time interval [t, t + dt] as follows:

V ∗ (t, st ) = max {R(t, st , at ) · dt + E(t,st ,at ) [e−ρ·dt · V ∗ (t + dt, st+dt )]}

at ∈At

Multiplying throughout by e−ρt and re-arranging, we get:

max {e−ρt · R(t, st , at ) · dt + E(t,st ,at ) [e−ρ(t+dt) · V ∗ (t + dt, st+dt ) − e−ρt · V ∗ (t, st )]} = 0
at ∈At

⇒ max {e−ρt · R(t, st , at ) · dt + E(t,st ,at ) [d{e−ρt · V ∗ (t, st )}]} = 0

at ∈At

⇒ max {e−ρt · R(t, st , at ) · dt + E(t,st ,at ) [e−ρt · (dV ∗ (t, st ) − ρ · V ∗ (t, st ) · dt)]} = 0
at ∈At

521
Multiplying throughout by eρt and re-arranging, we get:

ρ · V ∗ (t, st ) · dt = max {E(t,st ,at ) [dV ∗ (t, st )] + R(t, st , at ) · dt} (D.1)

at ∈At

For a finite-horizon problem terminating at time T , the above equation is subject to ter-
minal condition:
V ∗ (T, sT ) = T (sT )
for some terminal reward function T (·).
Equation (D.1) is known as the Hamilton-Jacobi-Bellman Equation - the continuous-
time analog of the Bellman Optimality Equation. In the literature, it is often written in a
more compact form that essentially takes the above form and “divides throughout by dt.”
This requires a few technical details involving the stochastic differentiation operator. To
keep things simple, we shall stick to the HJB formulation of Equation (D.1).

D.2. HJB with State Transitions as an Ito Process

Although we have expressed the HJB Equation for V ∗ , we cannot do anything useful with
it unless we know the state transition probabilities (all of which are buried inside the calcu-
lation of E(t,st ,at ) [·] in the HJB Equation). In continuous-time, the state transition probabil-
ities are modeled as a stochastic process for states (or of it’s features). Let us assume that
states are real-valued vectors, i.e, state st ∈ Rn at any time t ≥ 0 and that the transitions
for s are given by an Ito process, as follows:

dst = µ(t, st , at ) · dt + σ(t, st , at ) · dzt

where the function µ (drift function) gives an Rn valued process, the function σ (disper-
sion function) gives an Rn×m -valued process and z is an m-dimensional process consisting
of m independent standard brownian motions.
Now we can apply multivariate Ito’s Lemma (Equation (C.2) from Appendix C) for V ∗
as a function of t and st (we lighten notation by writing µt and σt instead of µ(t, st , at )
and σ(t, st , at )):

∂V ∗ 1
dV ∗ (t, st ) = ( + (∇s V ∗ )T · µt + T r[σtT · (∆s V ∗ ) · σt ]) · dt + (∇s V ∗ )T · σt · dzt
∂t 2
Substituting this expression for dV ∗ (t, st ) in Equation (D.1), noting that

E(t,st ,at ) [(∇s V ∗ )T · σt · dzt ] = 0

and dividing throughout by dt, we get:

∂V ∗ 1
ρ · V ∗ (t, st ) = max { + (∇s V ∗ )T · µt + T r[σtT · (∆s V ∗ ) · σt ] + R(t, st , at )} (D.2)
at ∈At ∂t 2

For a finite-horizon problem terminating at time T , the above equation is subject to ter-
minal condition:
V ∗ (T, sT ) = T (sT )
for some terminal reward function T (·).

522
E. Black-Scholes Equation and it’s Solution
for Call/Put Options
In this Appendix, we sketch the derivation of the much-celebrated Black-Scholes equation
and it’s solution for Call and Put Options (Black and Scholes 1973). As is the norm in
the Appendices in this book, we will compromise on some of the rigor and emphasize the
intuition to develop basic familiarity with concepts in continuous-time derivatives pricing
and hedging.

E.1. Assumptions
The Black-Scholes Model is about pricing and hedging of a derivative on a single under-
lying asset (henceforth, simply known as “underlying”). The model makes several sim-
plifying assumptions for analytical convenience. Here are the assumptions:

• The underlying (whose price we denote as St as time t) follows a special case of the
lognormal process we covered in Section C.7 of Appendix C, where the drift µ(t) is
a constant (call it µ ∈ R) and the dispersion σ(t) is also a constant (call it σ ∈ R+ ):

dSt = µ · St · dt + σ · St · dzt (E.1)

This process is often refered to as Geometric Brownian Motion to reflect the fact that
the stochastic increment of the process (σ · St · dzt ) is multiplicative to the level of
the process St .
• The derivative has a known payoff at time t = T , as a function f : R+ → R of the
underlying price ST at time T .
• Apart from the underlying, the market also includes a riskless asset (which should
be thought of as lending/borrowing money at a constant infinitesimal rate of annual
return equal to r). The riskless asset (denote it’s price as Rt at time t) movements
can thus be described as:
dRt = r · Rt · dt
• Assume that we can trade in any real-number quantity in the underlying as well as in
the riskless asset, in continuous-time, without any transaction costs (i.e., the typical
“frictionless” market assumption).

E.2. Derivation of the Black-Scholes Equation

We denote the price of the derivative at any time t for any price St of the underlying as
V (t, St ). Thus, V (T, ST ) is equal to the payoff f (ST ). Applying Ito’s Lemma on V (t, St )
(see Equation (C.1) in Appendix C), we get:

∂V ∂V σ2 ∂2V ∂V
dV (t, St ) = ( + µ · St · + · St2 · ) · dt + σ · St · · dzt (E.2)
∂t ∂St 2 ∂St2 ∂St

523
Now here comes the key idea: create a portfolio comprising of the derivative and the
underlying so as to eliminate the incremental uncertainty arising from the brownian mo-
tion increment dzt . It’s clear from the coeﬀicients of dzt in Equation (E.1) and (E.2) that
∂V
this can be accomplished with a portfolio comprising of ∂S t
units of the underlying and -1
units of the derivative (i.e., by selling a derivative contract written on a single unit of the
underlying). Let us refer to the value of this portfolio as Πt at time t. Thus,

∂V
Πt = −V (t, St ) + · St (E.3)
∂St
Over an infinitesimal time-period [t, t + dt], the change in the portfolio value Πt is given
by:

∂V
dΠt = −dV (t, St ) + · dSt
∂St
Substituting for dSt and dV (t, St ) from Equations (E.1) and (E.2), we get:

∂V σ2 ∂2V
dΠt = (− − · St2 · ) · dt (E.4)
∂t 2 ∂St2

Thus, we have eliminated the incremental uncertainty arising from dzt and hence, this
is a riskless portfolio. To ensure the market remains free of arbitrage, the infinitesimal rate
of annual return for this riskless portfolio must be the same as that for the riskless asset,
i.e., must be equal to r. Therefore,

dΠt = r · Πt · dt (E.5)
From Equations (E.4) and (E.5), we infer that:

∂V σ2 ∂2V
− − · St2 · = r · Πt
∂t 2 ∂St2
Substituting for Πt from Equation (E.3), we get:

∂V σ2 ∂2V ∂V
− − · St2 · = r · (−V (t, St ) + · St )
∂t 2 ∂St2 ∂St
Re-arranging, we arrive at the famous Black-Scholes equation:

∂V σ2 ∂2V ∂V
+ · St2 · 2 + r · St · + r · V (t, St ) = 0 (E.6)
∂t 2 ∂St ∂St
A few key points to note here:

1. The Black-Scholes equation is a partial differential equation (PDE) in t and St , and

it is valid for any derivative with arbitary payoff f (ST ) at a fixed time t = T , and the
derivative price function V (t, St ) needs to be twice differentiable with respect to St
and once differentiable with respect to t.
2. The infinitesimal change in the portfolio value (= dΠt ) incorporates only the in-
finitesimal changes in the prices of the underlying and the derivative, and not the
changes in the units held in the underlying and the derivative (meaning the port-
folio is assumed to be self-financing). The portfolio composition does change con-
∂V
tinuously though since the units held in the underlying at time t needs to be ∂S t
,
which in general would change as time evolves and as the price St of the underlying

524
changes. Note that − ∂S∂V
t
represents the hedge units in the underlying at any time t
for any underlying price St , which nullifies the risk of changes to the derivative price
V (t, St ).
3. The drift µ of the underlying price movement (interpreted as expected annual rate of
return of the underlying) does not appear in the Black-Scholes Equation and hence,
the price of any derivative will be independent of the expected rate of return of the
underlying. Note though the prominent appearance of σ (refered to as the underly-
ing volatility) and the riskless rate of return r in the Black-Scholes equation.

E.3. Solution of the Black-Scholes Equation for Call/Put Options

The Black–Scholes PDE can be solved numerically using standard methods such as finite-
differences. It turns out we can solve this PDE as an exact formula (closed-form solution)
for the case of European call and put options, whose payoff functions are max(ST − K, 0)
and max(K −ST , 0) respectively, where K is the option strike. We shall denote the call and
put option prices at time t for underlying price of St as C(t, St ) and P (t, St ) respectively
(as specializations of V (t, St )). We derive the solution below for call option pricing, with
put option pricing derived similarly. Note that we could simply use the put-call parity:
C(t, St ) − P (t, St ) = St − K · e−r·(T −t) to obtain the put option price from the call option
price. The put-call parity holds because buying a call option and selling a put option is a
combined payoff of ST −K - this means owning the underlying and borrowing K ·e−r(T −t)
at time t, whose value is St − K · e−r·(T −t) .
To derive the formula for C(t, St ), we perform the following change-of-variables trans-
formation:

τ =T −t

St σ2
x = log + (r − ) · τ
K 2
u(τ, x) = C(t, St ) · erτ
This reduces the Black-Scholes PDE into the Heat Equation:

∂u σ2 ∂ 2u
= ·
∂τ 2 ∂x2
The terminal condition C(T, ST ) = max(ST − K, 0) transforms into the Heat Equation’s
initial condition:

u(0, x) = K · (emax(x,0) − 1)
Using the standard convolution method for solving this Heat Equation with initial con-
dition u(0, x), we obtain the Green’s Function Solution:
Z +∞
1 (x−y)2
u(τ, x) = √ · u(0, y) · e− 2σ 2 τ · dy
σ 2πτ −∞

With some manipulations, this yields:

σ2 τ
u(τ, x) = K · ex+ 2 · N (d1 ) − K · N (d2 )

525
where N (·) is the standard normal cumulative distribution function:
Z z
1 y2
N (z) = √ e− 2 · dy
σ 2π −∞

and d1 , d2 are the quantities:

x + σ2τ
d1 = √
σ τ
√
d2 = d1 − σ τ
Substituting for τ, x, u(τ, x) with t, St , C(t, St ), we get:

C(t, St ) = St · N (d1 ) − K · e−r·(T −t) · N (d2 ) (E.7)

where 2
log St
+ (r + σ2 ) · (T − t)
d1 = K √
σ· T −t
√
d2 = d1 − σ T − t
The put option price is:

P (t, St ) = K · e−r·(T −t) · N (−d2 ) − St · N (−d1 ) (E.8)

526
F. Function Approximations as Aﬀine Spaces
F.1. Vector Space
A Vector space is defined as a commutative group V under an addition operation (written
as +), together with multiplication of elements of V with elements of a field K (known
as scalars), expressed as a binary in-fix operation ∗ : K × V → V, with the following
properties:

• a ∗ (b ∗ v) = (a ∗ b) ∗ v, for all a, b ∈ K, for all v ∈ V.

• 1 ∗ v = v for all v ∈ V where 1 denotes the multiplicative identity of K.
• a ∗ (v1 + v2 ) = a ∗ v1 + a ∗ v2 for all a ∈ K, for all v1 , v2 ∈ V.
• (a + b) ∗ v = a ∗ v + b ∗ v for all a, b ∈ K, for all v ∈ V.

F.2. Function Space

The set F of all functions from an arbitrary generic domain X to a vector space co-domain
V (over scalars field K) constitutes a vector space (known as function space) over the
scalars field K with addition operation (+) defined as:

(f + g)(x) = f (x) + g(x) for all f, g ∈ F , for all x ∈ X

and scalar multiplication operation (∗) defined as:

(a ∗ f )(x) = a ∗ f (x) for all f ∈ F , for all a ∈ K, for all x ∈ X

Hence, addition and scalar multiplication for a function space are defined point-wise.

F.3. Linear Map of Vector Spaces

A linear map of Vector Spaces is a function h : V → W where V is a vector space over a
scalars field K and W is a vector space over the same scalars field K, having the following
two properties:

• h(v1 + v2 ) = h(v1 ) + h(v2 ) for all v1 , v2 ∈ V (i.e., application of f commutes with

the addition operation).
• h(a ∗ v) = a ∗ h(v) for all v ∈ V, for all a ∈ K (i.e., application of f commutes with
the scalar multiplication operation).

Then the set of all linear maps with domain V and co-domain W constitute a function
space (restricted to just this subspace of all linear maps, rather than the space of all V → W
functions) that we denote as L(V, W).
The specialization of the function space of linear maps to the space L(V, K) (i.e., spe-
cializing the vector space W to the scalars field K) is known as the dual vector space and
is denoted as V ∗ .

527
F.4. Aﬀine Space
An Aﬀine Space is defined as a set A associated with a vector space V and a binary in-fix
operation ⊕ : A × V → A, with the following properties:

• For all a ∈ A, a ⊕ 0 = a, where 0 is the zero vector in V (this is known as the right
identity property).
• For all v1 , v2 ∈ V, for all a ∈ A, (a ⊕ v1 ) ⊕ v2 = a ⊕ (v1 + v2 ) (this is known as the
associativity property).
• For each a ∈ A, the mapping fa : V → A defined as fa (v) = a ⊕ v for all v ∈ V is a
bijection (i.e., one-to-one and onto mapping).

The elements of an affine space are called points and the elements of the vector space
associated with an affine space are called translations. The idea behind affine spaces is that
unlike a vector space, an affine space doesn’t have a notion of a zero element and one cannot
add two points in the affine space. Instead one adds a translation (from the associated vector
space) to a point (from the affine space) to yield another point (in the affine space). The
term translation is used to signify that we “translate” (i.e. shift) a point to another point
in the affine space with the shift being effected by a translation in the associated vector
space. This means there is a notion of “subtracting” one point of the affine space from
another point of the affine space (denoted with the operation ⊖), yielding a translation in
the associated vector space.
A simple way to visualize an affine space is by considering the simple example of the
affine space of all 3-D points on the plane defined by the equation z = 1, i.e., the set of
all points (x, y, 1) for all x ∈ R, y ∈ R. The associated vector space is the set of all 3-D
points on the plane defined by the equation z = 0, i.e., the set of all points (x, y, 0) for
all x ∈ R, y ∈ R (with the usual addition and scalar multiplication operations). We see
that any point (x, y, 1) on the affine space is translated to the point (x + x′ , y + y ′ , 1) by the
translation (x′ , y ′ , 0) in the vector space. Note that the translation (0, 0, 0) (zero vector)
results in the point (x, y, 1) remaining unchanged. Note that translations (x′ , y ′ , 0) and
(x′′ , y ′′ , 0) applied one after the other is the same as the single translation (x′ +x′′ , y ′ +y ′′ , 0).
Finally, note that for any fixed point (x, y, 1), we have a bijective mapping from the vector
space z = 0 to the affine space z = 1 that maps any translation (x′ , y ′ , 0) to the point
(x + x′ , y + y ′ , 1).

F.5. Linear Map of Aﬀine Spaces

A linear map of Affine Spaces is a function h : A → B where A is an affine space associated
with a vector space V and B is an affine space associated with the same vector space V,
having the following property:

h(a ⊕ v) = h(a) ⊕ v for all a ∈ A, for all v ∈ V

F.6. Function Approximations

We represent function approximations by parameterized functions f : X × D[R] → R
where X is the input domain and D[R] is the parameters domain. The notation D[Y ]
refers to a generic container data type D over a component generic data type Y . The data

528
type D is specified as a generic container data type because we consider generic function
approximations here. A specific family of function approximations will customize to a
specific container data type for D (eg: linear function approximations will customize D to
a Sequence data type, a feed-forward deep neural network will customize D to a Sequence
of 2-dimensional arrays). We are interested in viewing Function Approximations as points
in an appropriate Aﬀine Space. To explain this, we start by viewing parameters as points
in an Aﬀine Space.

F.6.1. D[R] as an Aﬀine Space P

When performing Stochastic Gradient Descent or Batch Gradient Descent, parameters
p ∈ D[R] of a function approximation f : X × D[R] → R are updated using an appro-
priate linear combination of gradients of f with respect to p (at specific values of x ∈ X ).
Hence, the parameters domain D[R] can be treated as an aﬀine space (call it P) whose
associated vector space (over scalars field R) is the set of gradients of f with respect to
parameters p ∈ D[R] (denoted as ∇p f (x, p)), evaluated at specific values of x ∈ X , with
addition operation defined as element-wise real-numbered addition and scalar multipli-
cation operation defined as element-wise multiplication with real-numbered scalars. We
refer to this Aﬀine Space P as the Parameters Space and we refer to it’s associated vector
space (of gradients) as the Gradient Space G. Since each point in P and each translation in
G is an element in D[R], the ⊕ operation is element-wise real-numbered addition.
We define the gradient function

G : X → (P → G)

as:
G(x)(p) = ∇p f (x, p)

for all x ∈ X , for all p ∈ P.

F.6.2. Representational Space R

We consider a function I : P → (X → R) defined as I(p) = g : X → R for all p ∈ P such
that g(x) = f (x, p) for all x ∈ X . The Range of this function I forms an aﬀine space R
whose associated vector space is the Gradient Space G, with the ⊕ operation defined as:

I(p) ⊕ v = I(p ⊕ v) for all p ∈ P, v ∈ G

We refer to this aﬀine space R as the Representational Space to signify the fact that the ⊕
operation for R simply “delegates” to the ⊕ operation for P and so, the parameters p ∈ P
basically serve as the internal representation of the function approximation I(p) : X → R.
This “delegation” from R to P implies that I is a linear map from Parameters Space P to
Representational Space R.
Notice that the __add__ method of the Gradient class in rl/function_approx.py is over-
loaded. One of the __add__ methods corresponds to vector addition of two gradients in the
Gradient Space G. The other __add__ method corresponds to the ⊕ operation adding a gra-
dient (treated as a translation in the vector space of gradients) to a function approximation
(treated as a point in the aﬀine space of function approximations).

529
F.7. Stochastic Gradient Descent
Stochastic Gradient Descent is a function

SGD : X × R → (P → P)
representing a mapping from (predictor, response) data to a “parameters-update” func-
tion (in order to improve the function approximation), defined as:

SGD(x, y)(p) = p ⊕ (α ∗ ((y − f (x, p)) ∗ G(x)(p)))

for all x ∈ X , y ∈ R, p ∈ P, where α ∈ R+ represents the learning rate (step size of SGD).
For a fixed data pair (x, y) ∈ X × R, with prediction error function e : P → R defined
as e(p) = y − f (x, p), the (SGD-based) parameters change function

U :P→G
is defined as:

U (p) = SGD(x, y)(p) ⊖ p = α ∗ (e(p) ∗ G(x)(p))

for all p ∈ P.
So, we can conceptualize the parameters change function U as the product of:

• Learning rate α ∈ R+
• Prediction error function e : P → R
• Gradient operator G(x) : P → G

Note that the product of functions e and G(x) above is element-wise in their common
domain P = D[R], resulting in the scalar (R) multiplication of vectors in G.
Updating vector p to vector p ⊕ U (p) in the Parameters Space P results in updating
function I(p) : X → R to function I(p ⊕ U (p)) : X → R in the Representational Space R.
This is rather convenient since we can view the ⊕ operation for the Parameters Space P as
effectively the ⊕ operation in the Representational Space R.

F.8. SGD Update for Linear Function Approximations

In this section, we restrict to linear function approximations, i.e., for all x ∈ X ,
f (x, p) = Φ(x)T · p
where p ∈ Rm = P and Φ : X → Rm represents the feature functions (note: Φ(x)T · p is
the usual inner-product in Rm ).
Then the gradient function G : X → (Rm → Rm ) can be written as:
G(x)(p) = ∇p (Φ(x)T · p) = Φ(x)
for all x ∈ X , for all p ∈ Rm .
When SGD-updating vector p to vector p⊕(α∗((y −Φ(x)T ·p)∗Φ(x))) in the Parameters
Space P = Rm , applying the linear map I : Rm → R correspondingly updates functions
in R. Concretely, a linear function approximation g : X → R defined as g(z) = Φ(z)T · p
for all z ∈ X updates correspondingly to the function g (x,y) : X → R defined as g (x,y) (z) =
Φ(z)T · p + α · (y − Φ(x)T · p) · (Φ(z)T · Φ(x)) for all z ∈ X .
It’s useful to note that the change in the evaluation at z ∈ X is simply the product of:

530
• Learning rate α ∈ R+
• Prediction Error y − Φ(x)T · p ∈ R for the updating data (x, y) ∈ X × R
• Inner-product of the feature vector Φ(x) ∈ Rm of the updating input value x ∈ X
and the feature vector Φ(z) ∈ Rm of the evaluation input value z ∈ X .

531
G. Conjugate Priors for Gaussian and
Bernoulli Distributions
The setting for this Appendix is that we receive data incrementally as x1 , x2 , . . . and we as-
sume a certain probability distribution (eg: Gaussian, Bernoulli) for each xi , i = 1, 2, . . ..
We utilize an appropriate conjugate prior for the assumed data distribution so that we
can derive the posterior distribution for the parameters of the assumed data distribution.
We can then say that for any n ∈ Z+ , the conjugate prior is the probability distribution
for the parameters of the assumed data distribution, conditional on the first n data points
(x1 , x2 , . . . xn ) and the posterior is the probability distribution for the parameters of the
assumed distribution, conditional on the first n + 1 data points (x1 , x2 , . . . , xn+1 ). This
amounts to performing Bayesian updates on the hyperparameters upon receipt of each
incremental data xi (hyperparameters refer to the parameters of the prior and posterior
distributions). In this appendix, we shall not cover the derivations of the posterior dis-
tribution from the prior distribution and the data distribution. We shall simply state the
results (references for derivations can be found on the Conjugate Prior Wikipedia Page).

G.1. Conjugate Prior for Gaussian Distribution

Here we assume that each data point is Gaussian-distributed in R. So when we receive the
n-th data point xn , we assume:

xn ∼ N (µ, σ 2 )
and we assume both µ and σ 2 are unknown random variables with Gaussian-Inverse-
Gamma Probability Distribution Conjugate Prior for µ and σ 2 , i.e.,

σ2
µ|x1 , . . . , xn ∼ N (θn , )
n
σ 2 |x1 , . . . , xn ∼ IG(αn , βn )
where IG(αn , βn ) refers to the Inverse Gamma distribution with parameters αn and βn .
This means σ12 |x1 , . . . , xn follows a Gamma distribution with parameters αn and βn , i.e.,
the probability of σ12 having a value y ∈ R+ is:

β α · y α−1 · e−βy
Γ(α)
where Γ(·) is the Gamma Function.
θn , αn , βn are hyperparameters determining the probability distributions of µ and σ 2 ,
conditional on data x1 , . . . , xn .
Then, the posterior distribution is given by:

nθn + xn+1 σ 2
µ|x1 , . . . , xn+1 ∼ N ( , )
n+1 n+1

533
1 n(xn+1 − θn )2
σ 2 |x1 , . . . , xn+1 ∼ IG(αn + , βn + )
2 2(n + 1)
This means upon receipt of the data point xn+1 , the hyperparameters can be updated
as:

nθn + xn+1
θn+1 =
n+1
1
αn+1 = αn +
2
n(xn+1 − θn )2
βn+1 = βn +
2(n + 1)

G.2. Conjugate Prior for Bernoulli Distribution

Here we assume that each data point is Bernoulli-distributed. So when we receive the n-th
data point xn , we assume xn = 1 with probability p and xn = 0 with probability 1 − p. We
assume p is an unknown random variable with Beta Distribution Conjugate Prior for p, i.e,

p|x1 , . . . , xn ∼ Beta(αn , βn )
where Beta(αn , βn ) refers to the Beta distribution with parameters αn and βn , i.e., the
probability of p having a value y ∈ [0, 1] is:

Γ(α + β)
· y α−1 · (1 − y)β−1
Γ(α) · Γ(β)
where Γ(·) is the Gamma Function.
αn , βn are hyperparameters determining the probability distribution of p, conditional
on data x1 , . . . , xn .
Then, the posterior distribution is given by:

p|x1 , . . . , xn+1 ∼ Beta(αn + Ixn+1 =1 , βn + Ixn+1 =0 )

where I refers to the indicator function.
This means upon receipt of the data point xn+1 , the hyperparameters can be updated
as:

αn+1 = αn + Ixn+1 =1
βn+1 = βn + Ixn+1 =0

534
Bibliography

535
Almgren, Robert, and Neil Chriss. 2000. “Optimal Execution of Portfolio Transactions.”
Journal of Risk, 5–39.
Amari, S. 1998. “Natural Gradient Works Eﬀiciently in Learning.” Neural Computation 10
(2): 251–76.
Auer, Peter, Nicolò Cesa-Bianchi, and Paul Fischer. 2002. “Finite-Time Analysis of the
Multiarmed Bandit Problem.” Machine Learning 47 (2): 235–56. https://doi.org/10.
1023/A:1013689704352.
Avellaneda, Marco, and Sasha Stoikov. 2008. “High-Frequency Trading in a Limit Or-
der Book.” Quantitative Finance 8 (3): 217–24. http://www.informaworld.com/10.1080/
14697680701381228.
Åström, K. J. 1965. “Optimal Control of Markov Processes with Incomplete State In-
formation.” Journal of Mathematical Analysis and Applications 10 (1): 174–205. https:
//doi.org/10.1016/0022-247X(65)90154-X.
Baird, Leemon. 1995. “Residual Algorithms: Reinforcement Learning with Function Ap-
proximation.” In Machine Learning Proceedings 1995, edited by Armand Prieditis and
Stuart Russell, 30–37. San Francisco (CA): Morgan Kaufmann. https://doi.org/https:
//doi.org/10.1016/B978-1-55860-377-6.50013-X.
Barberà, Salvador, Christian Seidl, and Peter J. Hammond, eds. 1998. Handbook of Utility
Theory: Handbook of Utility Theory / Barberà, Salvador. - Boston, Mass. [U.a.] : Kluwer,
1998- Vol. 1. Dordrecht [u.a.]: Kluwer. http://gso.gbv.de/DB=2.1/CMD?ACT=SRCHA&SRT=
YOP&IKT=1016&TRM=ppn+266386229&sourceid=fbw_bibsonomy.
Bellman, Richard. 1957a. “A Markovian Decision Process.” Journal of Mathematics and
Mechanics 6 (5): 679–84. http://www.jstor.org/stable/24900506.
———. 1957b. Dynamic Programming. 1st ed. Princeton, NJ, USA: Princeton University
Press.
Bertsekas, D. P., and J. N. Tsitsiklis. 1996. Neuro-Dynamic Programming. Belmont, MA:
Athena Scientific.
Bertsekas, Dimitri P. 1981. “Distributed Dynamic Programming.” In 1981 20th IEEE Con-
ference on Decision and Control Including the Symposium on Adaptive Processes, 774–79.
https://doi.org/10.1109/CDC.1981.269319.
———. 1983. “Distributed Asynchronous Computation of Fixed Points.” Mathematical
Programming 27: 107–20.
———. 2005. Dynamic Programming and Optimal Control, Volume 1, 3rd Edition. Athena
Scientific.
———. 2012. Dynamic Programming and Optimal Control, Volume 2: Approximate Dynamic
Programming. Athena Scientific.
Bertsimas, Dimitris, and Andrew W. Lo. 1998. “Optimal Control of Execution Costs.”
Journal of Financial Markets 1 (1): 1–50.
Björk, Tomas. 2005. Arbitrage Theory in Continuous Time. 2. ed., reprint. Oxford [u.a.]: Ox-
ford Univ. Press. http://gso.gbv.de/DB=2.1/CMD?ACT=SRCHA&SRT=YOP&IKT=1016&TRM=
ppn+505893878&sourceid=fbw_bibsonomy.
Black, Fisher, and Myron S. Scholes. 1973. “The Pricing of Options and Corporate Liabili-
ties.” Journal of Political Economy 81 (3): 637–54.
Bradtke, Steven J., and Andrew G. Barto. 1996. “Linear Least-Squares Algorithms for
Temporal Difference Learning.” Mach. Learn. 22 (1-3): 33–57. http://dblp.uni-trier.
de/db/journals/ml/ml22.html#BradtkeB96.
Brafman, Ronen I., and Moshe Tennenholtz. 2001. “R-MAX - a General Polynomial Time
Algorithm for Near-Optimal Reinforcement Learning.” In IJCAI, edited by Bernhard
Nebel, 953–58. Morgan Kaufmann. http://dblp.uni-trier.de/db/conf/ijcai/ijcai2001.

537
html#BrafmanT01.
Bühler, Hans, Lukas Gonon, Josef Teichmann, and Ben Wood. 2018. “Deep Hedging.”
http://arxiv.org/abs/1802.03042.
Chang, Hyeong Soo, Michael C. Fu, Jiaqiao Hu, and Steven I. Marcus. 2005. “An Adaptive
Sampling Algorithm for Solving Markov Decision Processes.” Operations Research 53
(1): 126–39. http://dblp.uni-trier.de/db/journals/ior/ior53.html#ChangFHM05.
Coulom, Rémi. 2006. “Eﬀicient Selectivity and Backup Operators in Monte-Carlo Tree
Search.” In Computers and Games, edited by H. Jaap van den Herik, Paolo Ciancarini,
and H. H. L. M. Donkers, 4630:72–83. Lecture Notes in Computer Science. Springer.
http://dblp.uni-trier.de/db/conf/cg/cg2006.html#Coulom06.
Cox, J., S. Ross, and M. Rubinstein. 1979. “Option Pricing: A Simplified Approach.” Jour-
nal of Financial Economics 7: 229–63.
Degris, Thomas, Martha White, and Richard S. Sutton. 2012. “Off-Policy Actor-Critic.”
CoRR abs/1205.4839. http://arxiv.org/abs/1205.4839.
Gagniuc, Paul A. 2017. Markov Chains: From Theory to Implementation and Experimentation.
John Wiley & Sons.
Ganesh, Sumitra, Nelson Vadori, Mengda Xu, Hua Zheng, Prashant P. Reddy, and Manuela
Veloso. 2019. “Reinforcement Learning for Market Making in a Multi-Agent Dealer
Market.” CoRR abs/1911.05892. http://dblp.uni-trier.de/db/journals/corr/corr1911.
html#abs-1911-05892.
Gittins, John C. 1979. “Bandit Processes and Dynamic Allocation Indices.” Journal of the
Royal Statistical Society. Series B (Methodological), 148–77.
Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.
Gueant, Olivier. 2016. The Financial Mathematics of Market Liquidity: From Optimal Execution
to Market Making. Chapman; Hall/CRC Financial Mathematics Series.
Guez, Arthur, Nicolas Heess, David Silver, and Peter Dayan. 2014. “Bayes-Adaptive
Simulation-Based Search with Value Function Approximation.” In NIPS, edited by
Zoubin Ghahramani, Max Welling, Corinna Cortes, Neil D. Lawrence, and Kilian Q.
Weinberger, 451–59. http://dblp.uni-trier.de/db/conf/nips/nips2014.html#GuezHSD14.
Hasselt, Hado van, Yotam Doron, Florian Strub, Matteo Hessel, Nicolas Sonnerat, and
Joseph Modayil. 2018. “Deep Reinforcement Learning and the Deadly Triad.” CoRR
abs/1812.02648. http://arxiv.org/abs/1812.02648.
Howard, R. A. 1960. Dynamic Programming and Markov Processes. Cambridge, MA: MIT
Press.
Hull, John C. 2010. Options, Futures, and Other Derivatives. Seventh. Pearson.
Kakade, Sham M. 2001. “A Natural Policy Gradient.” In NIPS, edited by Thomas G. Di-
etterich, Suzanna Becker, and Zoubin Ghahramani, 1531–38. MIT Press. http://dblp.
uni-trier.de/db/conf/nips/nips2001.html#Kakade01.
Kingma, Diederik P., and Jimmy Ba. 2014. “Adam: A Method for Stochastic Optimiza-
tion.” CoRR abs/1412.6980. http://dblp.uni-trier.de/db/journals/corr/corr1412.
html#KingmaB14.
Klopf, A. H., and Air Force Cambridge Research Laboratories (U.S.). Data Sciences Labo-
ratory. 1972. Brain Function and Adaptive Systems–a Heterostatic Theory. Special Reports.
Data Sciences Laboratory, Air Force Cambridge Research Laboratories, Air Force Sys-
tems Command, United States Air Force. https://books.google.com/books?id=C2hztwEACAAJ.
Kocsis, L., and Cs. Szepesvári. 2006. “Bandit Based Monte-Carlo Planning.” In ECML,
282–93.
Krishnamurthy, Vikram. 2016. Partially Observed Markov Decision Processes: From Filtering to
Controlled Sensing. Cambridge University Press. https://doi.org/10.1017/CBO9781316471104.

538
Lagoudakis, Michail G., and Ronald Parr. 2003. “Least-Squares Policy Iteration.” J. Mach.
Learn. Res. 4: 1107–49. http://dblp.uni-trier.de/db/journals/jmlr/jmlr4.html#LagoudakisP03.
Lai, T. L., and H. Robbins. 1985. “Asymptotically Eﬀicient Adaptive Allocation Rules.”
Advances in Applied Mathematics 6: 4–22.
Li, Y., Cs. Szepesvári, and D. Schuurmans. 2009. “Learning Exercise Policies for American
Options.” In AISTATS, 5:352–59. http://www.ics.uci.edu/~aistats/.
Lin, Long J. 1993. “Reinforcement Learning for Robots Using Neural Networks.” PhD
thesis, Pittsburg: CMU.
Longstaff, Francis A., and Eduardo S. Schwartz. 2001. “Valuing American Options by
Simulation: A Simple Least-Squares Approach.” Review of Financial Studies 14 (1): 113–
47. https://doi.org/10.1093/rfs/14.1.113.
Merton, Robert C. 1969. “Lifetime Portfolio Selection Under Uncertainty: The Continuous-
Time Case.” The Review of Economics and Statistics 51 (3): 247–57. https://doi.org/10.
2307/1926560.
Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou,
Daan Wierstra, and Martin Riedmiller. 2013. “Playing Atari with Deep Reinforcement
Learning.” http://arxiv.org/abs/1312.5602.
Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G.
Bellemare, Alex Graves, et al. 2015. “Human-Level Control Through Deep Reinforce-
ment Learning.” Nature 518 (7540): 529–33. https://doi.org/10.1038/nature14236.
Nevmyvaka, Yuriy, Yi Feng, and Michael J. Kearns. 2006. “Reinforcement Learning for
Optimized Trade Execution.” In ICML, edited by William W. Cohen and Andrew W.
Moore, 148:673–80. ACM International Conference Proceeding Series. ACM. http:
//dblp.uni-trier.de/db/conf/icml/icml2006.html#NevmyvakaFK06.
Puterman, Martin L. 2014. Markov Decision Processes: Discrete Stochastic Dynamic Program-
ming. John Wiley & Sons.
Rummery, G. A., and M. Niranjan. 1994. “On-Line Q-Learning Using Connectionist Sys-
tems.” CUED/F-INFENG/TR-166. Engineering Department, Cambridge University.
Russo, Daniel J., Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. 2018.
“A Tutorial on Thompson Sampling.” Foundations and Trends® in Machine Learning 11
(1): 1–96. https://doi.org/10.1561/2200000070.
Salimans, Tim, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. 2017. “Evo-
lution Strategies as a Scalable Alternative to Reinforcement Learning.” arXiv Preprint
arXiv:1703.03864.
Sherman, Jack, and Winifred J. Morrison. 1950. “Adjustment of an Inverse Matrix Corre-
sponding to a Change in One Element of a Given Matrix.” The Annals of Mathematical
Statistics 21 (1): 124–27. https://doi.org/10.1214/aoms/1177729893.
Shreve, Steven E. 2003. Stochastic Calculus for Finance I: The Binomial Asset Pricing Model:
Binomial Asset Pricing Model. New York, NY: Springer-Verlag.
———. 2004. Stochastic Calculus for Finance II: Continuous-Time Models. New York: Springer.
Silver, David, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den
Driessche, Julian Schrittwieser, et al. 2016. “Mastering the Game of Go with Deep
Neural Networks and Tree Search.” Nature 529 (7587): 484–89. https://doi.org/10.
1038/nature16961.
Silver, David, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin A.
Riedmiller. 2014. “Deterministic Policy Gradient Algorithms.” In ICML, 32:387–95.
JMLR Workshop and Conference Proceedings. JMLR.org. http://dblp.uni-trier.de/
db/conf/icml/icml2014.html#SilverLHDWR14.
Spooner, Thomas, John Fearnley, Rahul Savani, and Andreas Koukorinis. 2018. “Market

539
Making via Reinforcement Learning.” CoRR abs/1804.04216. http://dblp.uni-trier.
de/db/journals/corr/corr1804.html#abs-1804-04216.
Sutton, R., D. Mcallester, S. Singh, and Y. Mansour. 2001. “Policy Gradient Methods for
Reinforcement Learning with Function Approximation.” MIT Press.
Sutton, R. S., H. R. Maei, D. Precup, S. Bhatnagar, D. Silver, Cs. Szepesvári, and E. Wiewiora.
2009. “Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear
Function Approximation.” In ICML, 993–1000.
Sutton, R. S., Cs. Szepesvári, and H. R. Maei. 2008. “A Convergent O(n) Algorithm for Off-
Policy Temporal-Difference Learning with Linear Function Approximation.” In NIPS,
1609–16.
Sutton, Richard S. 1991. “Dyna, an Integrated Architecture for Learning, Planning, and Re-
acting.” SIGART Bull. 2 (4): 160–63. http://dblp.uni-trier.de/db/journals/sigart/
sigart2.html#Sutton91.
Sutton, Richard S., and Andrew G. Barto. 2018. Reinforcement Learning: An Introduction.
Second. The MIT Press. http://incompleteideas.net/book/the-book-2nd.html.
Vyetrenko, Svitlana, and Shaojie Xu. 2019. “Risk-Sensitive Compact Decision Trees for Au-
tonomous Execution in Presence of Simulated Market Response.” CoRR abs/1906.02312.
http://dblp.uni-trier.de/db/journals/corr/corr1906.html#abs-1906-02312.
Watkins, C. J. C. H. 1989. “Learning from Delayed Rewards.” PhD thesis, King’s College,
Oxford.
Williams, R. J. 1992. “Simple Statistical Gradient-Following Algorithms for Connectionist
Reinforcement Learning.” Machine Learning 8: 229–56.
Øksendal, Bernt. 2003. Stochastic Differential Equations. 6. ed. Universitext. Berlin ; Heidel-
berg [u.a.]: Springer. http://aleph.bib.uni-mannheim.de/F/?func=find-b&request=106106503&
find_code=020&adjacent=N&local_base=MAN01PUBLIC&x=0&y=0.

540

Notes Summary
No ratings yet
Notes Summary
65 pages
QuantEconlectures Python3
No ratings yet
QuantEconlectures Python3
1,362 pages
Quantecon Python
100% (4)
Quantecon Python
1,413 pages
The Everyday Healthy Vegetarian by Nandita Iyer
No ratings yet
The Everyday Healthy Vegetarian by Nandita Iyer
458 pages
Linear Programming Exercises
No ratings yet
Linear Programming Exercises
147 pages
Statistical Machine Learning For Quantitative Finance
No ratings yet
Statistical Machine Learning For Quantitative Finance
25 pages
Reinforcement Learning: Foundations
No ratings yet
Reinforcement Learning: Foundations
276 pages
Coma
100% (1)
Coma
42 pages
Stochastic Finance A Numeraire Approach (PDFDrive)
No ratings yet
Stochastic Finance A Numeraire Approach (PDFDrive)
339 pages
Informed Search Strategies: Artificial Intelligence
No ratings yet
Informed Search Strategies: Artificial Intelligence
72 pages
Jean Gallier, Jocelyn Quaintance - Linear Algebra and Optimization With Applications To Machine Learning - Volume II - Fundamentals of Optimization Theory With Applications To Machine Learning. 2-Wor
100% (1)
Jean Gallier, Jocelyn Quaintance - Linear Algebra and Optimization With Applications To Machine Learning - Volume II - Fundamentals of Optimization Theory With Applications To Machine Learning. 2-Wor
896 pages
Exercises 9
100% (1)
Exercises 9
2 pages
Book
No ratings yet
Book
534 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
406 pages
Uffe Høgsbro Thygesen - Stochastic Differential Equations For Science and Engineering-CRC Press - Chapman & Hall (2023)
No ratings yet
Uffe Høgsbro Thygesen - Stochastic Differential Equations For Science and Engineering-CRC Press - Chapman & Hall (2023)
381 pages
Lecture15 Deep Reinforcement Learning PDF
No ratings yet
Lecture15 Deep Reinforcement Learning PDF
109 pages
The New Market Manipulation
100% (1)
The New Market Manipulation
63 pages
Transformers Architectures For Time Series Forecasting
No ratings yet
Transformers Architectures For Time Series Forecasting
109 pages
RL Notes
No ratings yet
RL Notes
69 pages
Machine Learning Methods in Finance
No ratings yet
Machine Learning Methods in Finance
64 pages
RealFlow Beginners Guide Downloadable PD
No ratings yet
RealFlow Beginners Guide Downloadable PD
66 pages
Barrier Options
100% (1)
Barrier Options
30 pages
Module5 Quiz
100% (1)
Module5 Quiz
34 pages
1981 Book FastFourierTransformAndConvolu
No ratings yet
1981 Book FastFourierTransformAndConvolu
260 pages
4 Phase Plane (Strogatz 6) : 4.1 Geometrical Approach: Phase Portraits
No ratings yet
4 Phase Plane (Strogatz 6) : 4.1 Geometrical Approach: Phase Portraits
10 pages
Ground Support in Deep Underground Mines
100% (1)
Ground Support in Deep Underground Mines
27 pages
FN3026 Introduction
No ratings yet
FN3026 Introduction
10 pages
Somali Climate Risk Review
No ratings yet
Somali Climate Risk Review
156 pages
Percival, Introduction To Dynamics
100% (2)
Percival, Introduction To Dynamics
220 pages
Derivative Pricing
No ratings yet
Derivative Pricing
173 pages
Python For Physicist by HA
100% (1)
Python For Physicist by HA
168 pages
Linear Optimization and Extensions - Problems and Solutions (PDFDrive)
No ratings yet
Linear Optimization and Extensions - Problems and Solutions (PDFDrive)
450 pages
TOP 7 Python Libraries For DATA Visualization!!
No ratings yet
TOP 7 Python Libraries For DATA Visualization!!
9 pages
From Navier Stokes To Black Scholes - Numerical Methods in Computational Finance
No ratings yet
From Navier Stokes To Black Scholes - Numerical Methods in Computational Finance
13 pages
Monte Carlo With Applications To Finance in R
100% (2)
Monte Carlo With Applications To Finance in R
73 pages
Data Mining For Social Robotics - Toward Autonomously Social Robots (Mohammad & Nishida 2016-01-09)
No ratings yet
Data Mining For Social Robotics - Toward Autonomously Social Robots (Mohammad & Nishida 2016-01-09)
330 pages
The Market Graph
100% (1)
The Market Graph
45 pages
Computations in Option Pricing Engines 2020
100% (1)
Computations in Option Pricing Engines 2020
51 pages
Binomial Options Pricing Model
No ratings yet
Binomial Options Pricing Model
481 pages
1K 2K In-Chassis Maintenance
100% (1)
1K 2K In-Chassis Maintenance
76 pages
C++ For Finance
No ratings yet
C++ For Finance
264 pages
PSS 5000 APNO Vehicle Tagging 80510800
100% (1)
PSS 5000 APNO Vehicle Tagging 80510800
46 pages
Statistical Methods For Computational Marketssandholm2008b
No ratings yet
Statistical Methods For Computational Marketssandholm2008b
162 pages
Aramco HSE Questions
No ratings yet
Aramco HSE Questions
20 pages
Navigation
75% (4)
Navigation
31 pages
Financial Modeling With Levy Processes
No ratings yet
Financial Modeling With Levy Processes
101 pages
Python 9
No ratings yet
Python 9
10 pages
Deep Reinforcement Learning PDF
No ratings yet
Deep Reinforcement Learning PDF
150 pages
Deep Reinforcement Learning in Games
No ratings yet
Deep Reinforcement Learning in Games
9 pages
From Characteristic Functions and Fourier Transforms To PDFs/CDFs and Option Prices
No ratings yet
From Characteristic Functions and Fourier Transforms To PDFs/CDFs and Option Prices
22 pages
Cppi Model in Discrete Time - Academic Paper
No ratings yet
Cppi Model in Discrete Time - Academic Paper
28 pages
5 Python Tricks You Should Know
100% (1)
5 Python Tricks You Should Know
8 pages
Introduction Mathematical Portfolio Theo
No ratings yet
Introduction Mathematical Portfolio Theo
159 pages
Ezzahti, Ali - The Accuracy of The Black Scholes Model in Pricing AEX Index Call Options, Literature and Emperical Study
100% (1)
Ezzahti, Ali - The Accuracy of The Black Scholes Model in Pricing AEX Index Call Options, Literature and Emperical Study
35 pages
Mos Cabin R1
100% (1)
Mos Cabin R1
13 pages
PDF Evolutionary Optimization Algorithms Full Online: Book Details
No ratings yet
PDF Evolutionary Optimization Algorithms Full Online: Book Details
1 page
Priors Algorithms Bayesian
No ratings yet
Priors Algorithms Bayesian
108 pages
New Trends in Energy Derivatives: Alexander Eydeland Morgan Stanley
No ratings yet
New Trends in Energy Derivatives: Alexander Eydeland Morgan Stanley
33 pages
A Practical ImplementationOfHJM
No ratings yet
A Practical ImplementationOfHJM
336 pages
Chapter 9 - Mechanics of Options Markets
No ratings yet
Chapter 9 - Mechanics of Options Markets
28 pages
Volatility
100% (1)
Volatility
8 pages
Better Approximations To Cumulative Normal Functions: Graeme West
No ratings yet
Better Approximations To Cumulative Normal Functions: Graeme West
7 pages
Optim
No ratings yet
Optim
70 pages
Dharwar Drilling Society: Case Analysis Report
No ratings yet
Dharwar Drilling Society: Case Analysis Report
8 pages
Computational Finance Using MATLAB
No ratings yet
Computational Finance Using MATLAB
26 pages
0082 A Multi-Factor Model For Energy Derivatives
No ratings yet
0082 A Multi-Factor Model For Energy Derivatives
20 pages
Financial Accounting and Reporting Quiz Bowl - Quizizz
No ratings yet
Financial Accounting and Reporting Quiz Bowl - Quizizz
13 pages
The Art of Strategy and Force Planning
No ratings yet
The Art of Strategy and Force Planning
14 pages
RR Infra Girders Launching
No ratings yet
RR Infra Girders Launching
1 page
Opportunity at Risk
No ratings yet
Opportunity at Risk
88 pages
Some Notes On Daphnis Et Chloé
No ratings yet
Some Notes On Daphnis Et Chloé
13 pages
Case For TCP-2
No ratings yet
Case For TCP-2
14 pages
Craig Ch03
No ratings yet
Craig Ch03
46 pages
Client Services Agreement
No ratings yet
Client Services Agreement
37 pages
Markov Interest Rate Models - Hagan and Woodward
No ratings yet
Markov Interest Rate Models - Hagan and Woodward
28 pages
RGUHS - B.SC Nursing - 2012 - 1 - Mar - 1754 Anatomy and Physiology (Rs 3)
No ratings yet
RGUHS - B.SC Nursing - 2012 - 1 - Mar - 1754 Anatomy and Physiology (Rs 3)
1 page
Bates 1996
No ratings yet
Bates 1996
40 pages
Uv
No ratings yet
Uv
41 pages
IT Based Decision Making in Health Care
No ratings yet
IT Based Decision Making in Health Care
5 pages
EMPAC Student's User Manual Ver. 1 2
No ratings yet
EMPAC Student's User Manual Ver. 1 2
14 pages
Exim Policy
No ratings yet
Exim Policy
14 pages
Interim Budget 2024 EY Highlights
No ratings yet
Interim Budget 2024 EY Highlights
23 pages
Part 1 «Listening»: Содержание ↑ Audioscript ↓
No ratings yet
Part 1 «Listening»: Содержание ↑ Audioscript ↓
7 pages
Data Engineer Requirment
No ratings yet
Data Engineer Requirment
2 pages
Plan
No ratings yet
Plan
1 page
DPT R8-3W Cat
No ratings yet
DPT R8-3W Cat
2 pages
Nitratos, TNT 835
No ratings yet
Nitratos, TNT 835
2 pages
Gray Hat Hacking the Ethical Hacker's
From Everand
Gray Hat Hacking the Ethical Hacker's
Çağatay Şanlı
5/5 (1)
Content Creation Revolution with chatGPT
From Everand
Content Creation Revolution with chatGPT
Maria Cowen
No ratings yet