Decision Making Under Uncertainty and Reinforcement Learning

Intelligent Systems Reference Library 223
Christos Dimitrakakis
Ronald Ortner
Decision Making
Under Uncertainty
and Reinforcement
Learning
Theory and Algorithms
Intelligent Systems Reference Library
Volume 223
Series Editors
Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland
Lakhmi C. Jain, KES International, Shoreham-by-Sea, UK
The aim of this series is to publish a Reference Library, including novel advances
and developments in all aspects of Intelligent Systems in an easily accessible and
well structured form. The series includes reference works, handbooks, compendia,
textbooks, well-structured monographs, dictionaries, and encyclopedias. It contains
well integrated knowledge and current information in the field of Intelligent Systems.
The series covers the theory, applications, and design methods of Intelligent Systems.
Virtually all disciplines such as engineering, computer science, avionics, business,
e-commerce, environment, healthcare, physics and life science are included. The list
of topics spans all the areas of modern intelligent systems such as: Ambient intelli-
gence, Computational intelligence, Social intelligence, Computational neuroscience,
Artificial life, Virtual society, Cognitive systems, DNA and immunity-based systems,
e-Learning and teaching, Human-centred computing and Machine ethics, Intelligent
control, Intelligent data analysis, Knowledge-based paradigms, Knowledge manage-
ment, Intelligent agents, Intelligent decision making, Intelligent network security,
Interactive entertainment, Learning paradigms, Recommender systems, Robotics
and Mechatronics including human-machine teaming, Self-organizing and adap-
tive systems, Soft computing including Neural systems, Fuzzy systems, Evolu-
tionary computing and the Fusion of these paradigms, Perception and Vision, Web
intelligence and Multimedia.
Indexed by SCOPUS, DBLP, zbMATH, SCImago.
All books published in the series are submitted for consideration in Web of Science.
Christos Dimitrakakis · Ronald Ortner
Decision Making Under

Uncertainty
and Reinforcement Learning
Theory and Algorithms
Christos Dimitrakakis Ronald Ortner
Informatique Department Mathematik und
Université de Neuchâtel Informationstechnologie
Neuchâtel, Switzerland Montanuniversität Leoben
Leoben, Austria
ISSN 1868-4394 ISSN 1868-4408 (electronic)

Intelligent Systems Reference Library
ISBN 978-3-031-07612-1 ISBN 978-3-031-07614-5 (eBook)
https://doi.org/10.1007/978-3-031-07614-5
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Switzerland AG 2022
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
The purpose of this book is to collect the fundamental results for decision making
under uncertainty in one place. In particular, the aim is to give a unified account of
algorithms and theory for sequential decision making problems, including reinforce-
ment learning. Starting from elementary statistical decision theory, we progress to
the reinforcement learning problem and various solution methods. The end of the
book focuses on the current state of the art in models and approximation algorithms.
The problem of decision making under uncertainty can be broken down into two
parts. First, how do we learn about the world? This involves both the problem of
modeling our initial uncertainty about the world and that of drawing conclusions
from evidence and our initial belief. Secondly, given what we currently know about
the world, how should we decide what to do, taking into account future events and
observations that may change our conclusions?
Typically, this will involve creating long-term plans covering possible future even-
tualities. That is, when planning under uncertainty, we also need to take into account
what possible future knowledge could be generated when implementing our plans.
Intuitively, executing plans which involve trying out new things should give more
information, but it is hard to tell whether this information will be beneficial. The
choice between doing something which is already known to produce good results and
experiment with something new is known as the exploration–exploitation dilemma,
and it is at the root of the interaction between learning and planning.
Part I of the book, Chaps. 1–4, focuses on decision making under uncertainty in
non-sequential settings. This includes scenarios such as hypothesis testing, where
the decision maker must choose a single action given the available evidence. Most
of the development is given through the prism of Bayesian inference and decision
theory, where the decision maker has a subjective belief (expressed as a probability
distribution) over what is true. Part II of the book, Chaps. 5–8, introduces sequential
problems and the formalism of Markov decision processes. The remaining chapters
are devoted to the problem of reinforcement learning, which is one of the most general
sequential decision problems under uncertainty. Finally, we have added a number of
v
vi Preface
theoretical and practical exercises that will hopefully aid the reader to understand
the material.
Neuchâtel, Switzerland Christos Dimitrakakis

Leoben, Austria Ronald Ortner
Acknowledgements
Many thanks go to all the students of the Decision making under uncertainty and
Advanced topics in reinforcement learning and decision making classes over the
years for bearing with early drafts of this book. A big thank you goes to Nikolaos
Tziortziotis, whose code is used in some of the examples in the book. Finally, thanks
to Aristide Tossou and Hannes Eriksson for proof-reading various chapters. Finally,
a lot of the coded examples in the book were run using the parallel package by Tange
[1].
Reference
1. Tange, O.: Gnu parallel-the command-line power tool. USENIX Mag. 36(1), 42–47 (2011)
vii
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Uncertainty and Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The Exploration–Exploitation Trade-Off . . . . . . . . . . . . . . . . . . . . . 2
1.3 Decision Theory and Reinforcement Learning . . . . . . . . . . . . . . . . 3
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Subjective Probability and Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Subjective Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Relative Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Subjective Probability Assumptions . . . . . . . . . . . . . . . . . 8
2.1.3 Assigning Unique Probabilities* . . . . . . . . . . . . . . . . . . . . 10
2.1.4 Conditional Likelihoods . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.5 Probability Elicitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Updating Beliefs: Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Utility Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Rewards and Preferences . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Preferences Among Distributions . . . . . . . . . . . . . . . . . . . 15
2.3.3 Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.4 Measuring Utility* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.5 Convex and Concave Utility Functions . . . . . . . . . . . . . . . 20
2.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 Decision Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Rewards that Depend on the Outcome of an Experiment . . . . . . . 25
3.2.1 Formalisation of the Problem Setting . . . . . . . . . . . . . . . . 26
3.2.2 Decision Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.3 Statistical Estimation* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
ix
x Contents
3.3 Bayes Decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3.1 Convexity of the Bayes-Optimal Utility* . . . . . . . . . . . . . 32
3.4 Statistical and Strategic Decision Making . . . . . . . . . . . . . . . . . . . . 35
3.4.1 Alternative Notions of Optimality . . . . . . . . . . . . . . . . . . . 36
3.4.2 Solving Minimax Problems* . . . . . . . . . . . . . . . . . . . . . . . 38
3.4.3 Two-Player Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5 Decision Problems with Observations . . . . . . . . . . . . . . . . . . . . . . . 43
3.5.1 Maximizing Utility When Making Observations . . . . . . . 45
3.5.2 Bayes Decision Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.5.3 Decision Problems in Classification . . . . . . . . . . . . . . . . . 48
3.5.4 Calculating Posteriors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.7.1 Problems with No Observations . . . . . . . . . . . . . . . . . . . . . 52
3.7.2 Problems with Observations . . . . . . . . . . . . . . . . . . . . . . . . 53
3.7.3 An Insurance Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.7.4 Medical Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2 Sufficient Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2.1 Sufficient Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2.2 Exponential Families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.3 Conjugate Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3.1 Bernoulli-Beta Conjugate Pair . . . . . . . . . . . . . . . . . . . . . . 64
4.3.2 Conjugates for the Normal Distribution . . . . . . . . . . . . . . 68
4.3.3 Conjugates for Multivariate Distributions . . . . . . . . . . . . . 73
4.4 Credible Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.5 Concentration Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.5.1 Chernoff-Hoeffding Bounds . . . . . . . . . . . . . . . . . . . . . . . . 82
4.6 Approximate Bayesian Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.6.1 Monte Carlo Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.6.2 Approximate Bayesian Computation . . . . . . . . . . . . . . . . . 85
4.6.3 Analytic Approximations of the Posterior . . . . . . . . . . . . 86
4.6.4 Maximum Likelihood and Empirical Bayes
Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5 Sequential Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.1 Gains From Sequential Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.1.1 An Example: Sampling with Costs . . . . . . . . . . . . . . . . . . 90
5.2 Optimal Sequential Sampling Procedures . . . . . . . . . . . . . . . . . . . . 94
5.2.1 Multi-stage Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.2.2 Backwards Induction for Bounded Procedures . . . . . . . . 97
Contents xi
5.2.3 Unbounded Sequential Decision Procedures . . . . . . . . . . 98

5.2.4 The Sequential Probability Ratio Test . . . . . . . . . . . . . . . . 100
5.2.5 Wald’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.3 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.4 Markov Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6 Experiment Design and Markov Decision Processes . . . . . . . . . . . . . . 109
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.2 Bandit Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.2.1 An Example: Bernoulli Bandits . . . . . . . . . . . . . . . . . . . . . 112
6.2.2 Decision-Theoretic Bandit Process . . . . . . . . . . . . . . . . . . 113
6.3 Markov Decision Processes and Reinforcement Learning . . . . . . 115
6.3.1 Value Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.4 Finite Horizon, Undiscounted Problems . . . . . . . . . . . . . . . . . . . . . 120
6.4.1 Direct Policy Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.4.2 Backwards Induction Policy Evaluation . . . . . . . . . . . . . . 120
6.4.3 Backwards Induction Policy Optimization . . . . . . . . . . . . 121
6.5 Infinite-Horizon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.5.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.5.2 Markov Chain Theory for Discounted Problems . . . . . . . 124
6.5.3 Optimality Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.5.4 MDP Algorithms for Infinite Horizon
and Discounted Rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.6 Optimality Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.8 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.9.1 MDP Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.9.2 Automatic Algorithm Selection . . . . . . . . . . . . . . . . . . . . . 142
6.9.3 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.9.4 General Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7 Simulation-Based Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.1.1 The Robbins-Monro Approximation . . . . . . . . . . . . . . . . . 148
7.1.2 The Theory of the Approximation . . . . . . . . . . . . . . . . . . . 149
7.2 Dynamic Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
7.2.1 Monte Carlo Policy Evaluation and Iteration . . . . . . . . . . 153
7.2.2 Monte Carlo Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
7.2.3 Temporal Difference Methods . . . . . . . . . . . . . . . . . . . . . . 156
7.2.4 Stochastic Value Iteration Methods . . . . . . . . . . . . . . . . . . 158
7.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
7.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
xii Contents
8 Approximate Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
8.1.1 Fitting a Value Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
8.1.2 Fitting a Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
8.1.3 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
8.1.4 Estimation Building Blocks . . . . . . . . . . . . . . . . . . . . . . . . 172
8.1.5 The Value Estimation Step . . . . . . . . . . . . . . . . . . . . . . . . . 174
8.1.6 Policy Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
8.2 Approximate Policy Iteration (API) . . . . . . . . . . . . . . . . . . . . . . . . . 178
8.2.1 Error Bounds for Approximate Value Functions . . . . . . . 179
8.2.2 Rollout-Based Policy Iteration Methods . . . . . . . . . . . . . . 180
8.2.3 Least Squares Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
8.3 Approximate Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
8.3.1 Approximate Backwards Induction . . . . . . . . . . . . . . . . . . 184
8.3.2 State Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
8.3.3 Representative State Approximation . . . . . . . . . . . . . . . . . 186
8.3.4 Bellman Error Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
8.4 Policy Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
8.4.1 Stochastic Policy Gradient . . . . . . . . . . . . . . . . . . . . . . . . . 190
8.4.2 Practical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
8.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
8.6 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
8.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
9 Bayesian Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
9.1.1 Acting in Unknown MDPs . . . . . . . . . . . . . . . . . . . . . . . . . 197
9.1.2 Updating the Belief . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
9.2 Finding Bayes-Optimal Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
9.2.1 The Expected MDP Heuristic . . . . . . . . . . . . . . . . . . . . . . . 201
9.2.2 The Maximum MDP Heuristic . . . . . . . . . . . . . . . . . . . . . . 203
9.2.3 Bayesian Policy Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . 204
9.2.4 The Belief-Augmented MDP . . . . . . . . . . . . . . . . . . . . . . . 204
9.2.5 Branch and Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
9.2.6 Bounds on the Expected Utility . . . . . . . . . . . . . . . . . . . . . 207
9.2.7 Estimating Lower Bounds on the Value Function
with Backwards Induction . . . . . . . . . . . . . . . . . . . . . . . . . 209
9.2.8 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
9.3 Bayesian Methods in Continuous Spaces . . . . . . . . . . . . . . . . . . . . 211
9.3.1 Linear-Gaussian Transition Models . . . . . . . . . . . . . . . . . . 212
9.3.2 Approximate Dynamic Programming . . . . . . . . . . . . . . . . 213
9.4 Partially Observable Markov Decision Processes . . . . . . . . . . . . . . 214
9.4.1 Solving Known POMDPs . . . . . . . . . . . . . . . . . . . . . . . . . . 215
9.4.2 Solving Unknown POMDPs . . . . . . . . . . . . . . . . . . . . . . . . 217
Contents xiii
9.5 Relations Between Different Settings . . . . . . . . . . . . . . . . . . . . . . . . 217

9.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
10 Distribution-Free Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . 221
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
10.2 Finite Stochastic Bandit Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 221
10.2.1 The UCB1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
10.2.2 Non i.i.d. Rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
10.3 Reinforcement Learning in MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . 226
10.3.1 An Upper-Confidence Bound Algorithm . . . . . . . . . . . . . 227
10.3.2 Bibliographical Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
11 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
Appendix: Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
Chapter 1
Introduction
1.1 Uncertainty and Probability
A lot of this book is grounded in the essential methods of probability, in particular

using it to represent uncertainty. While probability is a simple mathematical con-
struction, philosophically it has had at least three different meanings. In the classical
sense, a probability distribution is a description for a truly random event. In the sub-
jectivist sense, probability is merely an expression of our uncertainty, which is not
necessarily due to randomness. Finally, in the algorithmic sense, probability is linked
to the length of a program that generates a particular output.
In all cases, we are dealing with a set Ω of possible outcomes: the result of
a random experiment, the underlying state of the world and the program output
respectively. In all cases, we use probability to model our uncertainty over Ω.
Classical Probability
A random experiment is performed, with a given set Ω of possible outcomes. An
example is the 2-slit experiment in physics, where a particle is generated which
can go through either one of two slits. According to our current understanding
of quantum mechanics, it is impossible to predict which slit the particle will go
through. Herein, the set Ω consists of two possible events corresponding to the
particle passing through one or the other slit.
In the 2-slit experiment, the probabilities of either event can be actually accurately
calculated through quantum theory. However, which slit the particle will go through
is fundamentally unpredictable. Such quantum experiments are the only ones that are
currently thought of as truly random (though some people disagree about that too).
Any other procedure, such as tossing a coin or casting a die, is inherently deterministic
and only appears random due to our difficulty in predicting the outcome. That is,
modelling a coin toss as a random process is usually the best approximation we can
make in practice, given our uncertainty about the complex dynamics involved. This
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 1
C. Dimitrakakis and R. Ortner, Decision Making Under Uncertainty
and Reinforcement Learning, Intelligent Systems Reference Library 223,
https://doi.org/10.1007/978-3-031-07614-5_1
2 1 Introduction
gives rise to the concept of subjective probability as a general technique to model

uncertainty.
Subjective Probability
Here Ω can conceptually not only describe the outcomes of some experiment,
but also a set of possible worlds or realities. This set can be quite large and
include anything imaginable. For example, it may include worlds where dragons
are real. However, in practice one only cares about certain aspects of the world,
such as whether in this world, you will win the lottery if you buy a ticket. We
can interpret the probability of a world in Ω as our degree of belief that it
corresponds to reality.
In such a setting there is an actual, true world ω ∗ ∈ Ω, which could have been set
by Nature to an arbritrary value deterministically. However, we do not know which
element in Ω is the true world, and the probability reflects our lack of knowledge
(rather than any inherent randomness about the selection of ω ∗ ).
No matter which view we espouse, we must always take into account our uncer-
tainty when making decisions. When the problem we are dealing with is sequential,
we are taking actions, obtaining new observations, and then taking further actions. As
we gather more information, we learn more about the world. However, the things we
learn about depend on what actions we take. For example, if we always take the same
route to work, then we learn how much time this route takes on different days and
times of the week. However, we don’t obtain information about the time other routes
take. So, we potentially miss out on better choices than the one we follow usually.
This phenomenon gives rise to the so-called exploration-exploitation trade-off.
1.2 The Exploration–Exploitation Trade-Off
Consider the problem of selecting a restaurant to go to during a vacation. The best

restaurant you have found so far was Les Epinards. The food there is usually to your
taste and satisfactory. However, a well-known recommendations website suggests
that King’s Arm is really good! It is tempting to try it out. But there is the risk
involved that the King’s Arm is much worse than Les Epinards. On the other hand,
it could also be much better. What should you do?
It all depends on how much information you have about either restaurant, and how
many more days you’ll stay in town. If this is your last day, then it’s probably a better
idea to go to Les Epinards, unless you are expecting King’s Arm to be significantly
better. However, if you are going to stay there longer, trying out King’s Arm is a
good bet. If you are lucky, you will be getting much better food for the remaining
time, while otherwise you will have missed only one good meal out of many, making
the potential risk quite small.
1.3 Decision Theory and Reinforcement Learning 3
Thus, one must decide whether to exploit knowledge about the world to gain a
known reward, or to explore the world to learn something new. This will potentially
give you less reward immediately, but the knowledge itself can usually be put to use
in the future.
This exploration-exploitation trade-off only arises when data collection is inter-
active. If we are simply given a set of data which is to be used to decide upon a course
or action, but our decision does not affect the data we shall collect in the future, then
things are much simpler. However, a lot of real-world human decision making as
well as modern applications in data science involve such trade-offs. Decision theory
offers precise mathematical models and algorithms for such problems.
1.3 Decision Theory and Reinforcement Learning
Decision theory deals with the formalization and solution of decision problems.
Given a number of alternatives, what would be the rational choice in a particular
situation depending on one’s goals and desires? In order to answer this question we
need to develop a good concept of rational behavior. This will serve two purposes.
Firstly, this can provide an explanation for the behavior of animals and humans.
Secondly, it is useful for developing models and algorithms for automated decision
making in complex tasks.
A particularly interesting problem in this setting is reinforcement learning. This
problem arises when the environment is unknown, and the learner has to make deci-
sions solely through interaction, which only gives limited feedback. Thus, the learn-
ing agent does not have access to detailed instructions on which task to perform,
nor on how to do it. Instead, it performs actions, which affect the environment and
obtains some observations (i.e., sensory input) and feedback, usually in the form of
rewards which represent the agent’s desires. The learning problem is then formu-
lated as the problem of learning how to act to maximize total reward. In biological
systems, reward is intrinsically hardwired to signals associated with basic needs. In
artificial systems, we can choose the reward signals so as to reinforce behaviour that
achieves the designer’s goals.
Reinforcement learning is a fundamental problem in artificial intelligence, since
frequently we can tell robots, computers, or cars only what we would like them to
achieve, but we do not know the best way to achieve it. We would like to simply
give them a description of our goals and then let them explore the environment on
their own to find a good solution. Since the world is (at least partially) unknown, the
learner always has to deal with the exploration-exploitation trade-off.
Animals and humans also learn through imitation, exploration, and shaping their
behavior according to reward signals to finally achieve their goals. In fact, it has been
known since the 1990s that there is some connection between some reinforcement
learning algorithms and mechanisms in the basal ganglia [1–3].
Decision theory is closely related to other fields, such as logic, statistics, game
theory, and optimization. Those fields have slightly different underlying objectives,
4 1 Introduction
even though they may share the same formalism. In the field of optimization, we
are not only interested in optimal planning in complex environments but also in
how to make robust plans given some uncertainty about the environment. Artificial
intelligence research is concerned with modelling the environments and developing
algorithms that are able to learn by interaction with the environment or from demon-
stration by teachers. Economics and game theory deal with the problem of modeling
the behavior of rational agents and with designing mechanisms (such as markets)
that will give incentives to agents to behave in a certain way.
Beyond pure research, there are also many applications connected to decision the-
ory. Commercial applications arise e.g. in advertising where one wishes to model the
preferences and decision making of individuals. Decision problems also arise in secu-
rity. There are many decision problems, especially in cryptographic and biometric
authentication, but also in detecting and responding to intrusions in networked com-
puter systems. Finally, in the natural sciences, especially in biology and medicine,
decision theory offers a way to automatically design and run experiments and to
optimally construct clinical trials.
Outline
1. Subjective probability and utility: the notion of subjective probability; eliciting

priors; the concept of utility; expected utility
2. Decision problems: maximizing expected utility; maximin utility; regret
3. Estimation: estimation as conditioning; families of distributions closed under con-
ditioning; conjugate priors; concentration inequalities; PAC and high-probability
bounds; Markov Chain Monte Carlo; ABC estimation
4. Sequential sampling and optimal stopping: sequential sampling problems; the
cost of sampling; optimal stopping; martingales.
5. Reinforcement learning I—Markov decision processes: belief and information
state; bandit problems; Markov decision processes; backwards induction; value
iteration; policy iteration; temporal differences; linear programming
6. Reinforcement learning II—Stochastic and approximation algorithms: Sarsa; Q-
learning; stochastic value iteration; TD(λ)
7. Reinforcement learning III—Function approximation: features and the curse of
dimensionality; approximate value iteration; approximate policy iteration; policy
gradient
8. Reinforcement learning IV—Bayesian reinforcement learning: bounds on utility;
Thompson sampling; stochastic branch and bound; sparse sampling; partially
observable MDPs
9. Reinforcement learning V—Distribution-free reinforcement learning: stochastic
and metric bandits; UCRL; bounds for Thompson sampling
References 5
References
1. Yin, H.H., Knowlton, B.J.: The role of the basal ganglia in habit formation. Nat. Rev. Neurosci.
7(6), 464 (2006)
2. Barto, A.G.: Adaptive Critics and the Basal Ganglia, pp. 215–232. MIT Press (1995)
3. Schultz, W., Dayan, P., Read Montague, P.: A neural substrate of prediction and reward. Science
275(5306), 1593–1599 (1997)
Chapter 2
Subjective Probability and Utility
2.1 Subjective Probability
In order to make decisions, we need to be able to make predictions about the possible
outcomes of each decision. Usually, we have uncertainty about what those outcomes
are. This can be due to stochasticity, which is frequently used to model games of
chance and inherently unpredictable physical phenomena. It can also be due to partial
information, a characteristic of many natural problems. For example, it might be hard
to know at any single moment how much change you have in your wallet, whether
you will be able to catch the next bus, or to remember where you left your keys.
In either case, this uncertainty can be expressed as a subjective belief. This does not
have to correspond to reality. For example, some people believe, quite inaccurately,
that if a coin comes up tails for a long time, it is quite likely to come up heads very
soon. Or, you might quite happily believe your keys are in your pocket, only to realise
that you left them at home as soon you arrive at the office.
In this book, we assume the view that subjective beliefs can be modelled as
probabilities. This allows us to treat uncertainty due to stochasticity and due to partial
information in a unified framework. In doing so, we have to define for each problem
a space of possible outcomes Ω and specify an appropriate probability distribution.
2.1.1 Relative Likelihood
Let us start with the simple example of guessing whether a tossed coin will come up
head or tails. In this case the sample space Ω would correspond to every possible
way the coin can land. Since we are only interested in predicting which face will
be up, let A ⊂ Ω be all those cases where the coin comes up heads, and B ⊂ Ω be
the set of tosses where it comes up tails. Here A ∩ B = ∅, but there may be some
other events such as the coin becoming lost, so it does not necessarily hold that

8 2 Subjective Probability and Utility
A ∪ B = Ω. Nevertheless, we only care about whether A is more likely than B. As

said, this likelihood may be based only on subjective beliefs. We can express that via
the concept of relative likelihood:
The relative likelihood of two events A and B

• If A is more likely than B, then we write A B, or equivalently B ≺ A.
• If A is as likely as B, then we write A B.
We also use and for at least as likely as and for no more likely than.
Let us now speak more generally about the case where we have defined an appro-
priate σ-field F on Ω. Then each element Ai ∈ F will be a subset of Ω. We now
wish to define relative likelihood relations for the elements Ai ∈ F.1
As we would like to use the language of probability to talk about likelihoods, we
need to define a probability measure that agrees with our given relations. A probability
measure P : F → [0, 1] is said to agree with a relation A B, if it has the property
that P(A) ≤ P(B) if and only if A B, for all A, B ∈ F. In general, there are
many possible measures that can agree with a given relation, cf. Example 2.1 below.
However, it could also be that a given relational structure is incompatible with any
possible probability measure. We also consider the question under which assumptions
a likelihood relation corresponds to a unique probability measure.
2.1.2 Subjective Probability Assumptions
We would like our beliefs to satisfy some intuitive properties about what statements
we can make concerning the relative likelihood of events. As we will see, these
assumptions are also necessary to guarantee the existence of a corresponding prob-
ability measure. First of all, it must always be possible to say whether one event is
more likely than the other, i.e., our beliefs must be complete. Consequently, we are
not allowed to claim ignorance.
Assumption 2.1.1 (SP1) For any pair of events A, B ∈ F, one has either A B,
A ≺ B, or A B.
Another important assumption is a principle of consistency: Informally, if we believe
that every possible event Ai that leads to A is less likely than a unique corresponding
event Bi that leads to an outcome B, then we should always conclude that A is less
likely than B.
1More formally, we can define three classes: C , C≺ , C ⊂ F 2 such that a pair (Ai , A j ) ∈ C R if
an only if it satisfies the relation Ai R A j , where R ∈ {, ≺, }. These three classes form a partition
of F 2 under the subjective probability assumptions we will introduce in the next section.
2.1 Subjective Probability 9
Assumption 2.1.2 (SP2) Let A = A1 ∪ A2 , B = B1 ∪ B2 with A1 ∩ A2 = B1 ∩

B2 = ∅. If Ai Bi for i = 1, 2 then A B.
We also require the simple technical assumption that any event A ∈ F is at least as
likely as the empty event ∅, which never happens.
Assumption 2.1.3 (SP3) For all A it holds that ∅ A. Further, ∅ ≺ Ω.
As it turns out, these assumptions are sufficient for proving the following
theorems [1]. The first theorem tells us that our belief must be consistent with respect
to transitivity.
Theorem 2.1.4 (Transitivity) Under Assumptions 2.1.1, 2.1.2, and 2.1.3, for all
events A, B, C: If A B and B C, then A C.
The second theorem says that if two events have a certain relation, then their negations
have the converse relation.
Theorem 2.1.5 (Complement) For any A, B: A B iff A B .
Finally, note that if A ⊂ B, then it must be the case that whenever A happens, B
must happen and hence B must be at least as likely as A.
Theorem 2.1.6 (Fundamental property of relative likelihoods) If A ⊂ B then

A B. Furthermore, ∅ A Ω for any event A.
Since we are dealing with σ-fields, we need to introduce additional assumptions

for infinite sequences of events. While these are not necessary if the field F is finite,
it is good to include them for generality.
Assumption 2.1.7 (SP4) If A1 ⊃ A2 ⊃ · · · is a decreasing sequence of events in F
∞
and B ∈ F is such that Ai B for all i, then i=1 Ai B.
As a consequence, we obtain the following dual theorem:
Theorem 2.1.8 If A1 ⊂ A2 ⊂ · · · is an increasing sequence of events in F and
∞
B ∈ F is such that Ai B for all i, then i=1 Ai B.
We are now able to state a theorem for the unions of infinite sequences of disjoint
events.
∞ ∞
Theorem 2.1.9 If (Ai )i=1 and (B
i )∞i=1 are infinite
∞ sequences of disjoint events in F
such that Ai Bi for all i, then i=1 Ai i=1 Bi .
As said, our goal is to express likelihood via probability. Accordingly, we say
that likelihood is induced by a probability measure P if for all A, B: A B iff
P(A) > P(B), and A B iff P(A) = P(B). In this case the following theorem
guarantees that the stipulated assumptions are always satisfied.
Theorem 2.1.10 Let P be a probability measure over Ω. Then
(i) P(A) > P(B), P(A) < P(B), or P(A) = P(B) for all A, B.
(ii) Consider (possibly infinite) partitions {Ai }i , {Bi }i of A, B, respectively. If

P(Ai ) ≤ P(Bi ) for all i, then P(A) ≤ P(B).
(iii) For any A, P(∅) ≤ P(A) and P(∅) < P(Ω).

is trivial, as P : F → [0, 1]. Part (ii) holds due to P(A) = P( i Ai ) =
Proof Part (i)

i P(Ai ) ≤ i P(Bi ) = P(B). Part (iii) follows from P(∅) = 0, P(A) ≥ 0, and
P(Ω) = 1.
2.1.3 Assigning Unique Probabilities*
In many cases, and particularly when F is a finite field, there is a large number of
probability distributions agreeing with our relative likelihoods. Choosing one specific
probability over another does not seem easy. The following example underscores this
ambiguity.
Example 2.1 Consider F = {∅, A, A , Ω} for some A with ∅ = A ⊆ Ω and

assume A A . Consequently, P(A) > 1/2, which is however insufficient for
assigning a specific value to P(A).
In some cases we would like to assign unique probabilities to events in order

to facilitate computations. Intuitively, we may only be able to tell whether some
event A is more likely than some other event B. However, we can create a new, uni-
formly distributed random variable x on [0, 1] and determine for each value α ∈ [0, 1]
whether A more or less likely than the event x > α. Since we need to compare both
A and B with all such events, the distribution we obtain is unique. For the sake of
completeness we start with the definition of the uniform distribution over the interval
[0, 1].
Definition 2.1.11 (Uniform distribution) Let λ(A) denote the length of any interval
A ⊆ [0, 1]. Then x : Ω → [0, 1] has a uniform distribution on [0, 1] if, for any
subintervals A, B of [0, 1],
(x ∈ A) (x ∈ B) iff λ(A) ≤ λ(B),
where (x ∈ A) denotes the event that x(ω) ∈ A. Then (x ∈ A) (x ∈ B) means

that ω is such that x ∈ A is not more likely than x ∈ B.
This means that any larger interval is more likely than any smaller interval. Now we
shall connect the uniform distribution to the original sample space Ω by assuming
that there is some function with uniform distribution.
Assumption 2.1.12 (SP5) It is possible to construct a random variable x : Ω →

[0, 1] with a uniform distribution in [0, 1].
2.1 Subjective Probability 11
Constructing the Probability Distribution

We can now use the uniform distribution to create a unique probability measure that
agrees with our likelihood relation. First, we have to map each event in Ω to an
equivalent event in [0, 1], which can be done according to the following theorem.
Theorem 2.1.13 (Equivalent event) For any event A ∈ F, there exists some α ∈
[0, 1] such that A (x ∈ [0, α]).
This means that we can now define the probability of an event A by matching it
to a specific equivalent event on [0, 1].
Definition 2.1.14 (Probability of A) Given any event A, define P(A) to be the α

with A (x ∈ [0, α]).
Hence
A (x ∈ [0, P(A)]),
which is sufficient to show the following theorem.
Theorem 2.1.15 (Relative likelihood and probability) If assumptions SP1–SP5 are

satisfied, then the probability measure P defined in Definition 2.1.14 is unique.
Furthermore, for any two events A, B it holds that A B iff P(A) ≤ P(B).
2.1.4 Conditional Likelihoods
So far we have only considered the problem of forming opinions about which events
are more likely a priori. However, we also need to have a way to incorporate evidence
which may adjust our opinions. For example, while we ordinarily may think that
A B, we may have additional information D, given which we think the opposite
is true. We can formalise this through the notion of conditional likelihoods.
Example 2.2 Say that A is the event that it rains in Gothenburg, Sweden tomorrow.
We know that Gothenburg is quite rainy due to its oceanic climate, so we set A A .
Now, let us try and incorporate some additional information. Let D denote the fact
that good weather is forecast, so that given D good weather will appear more probable
than rain, formally (A | D) (A | D).
Conditional Likelihoods
Define (A | D) (B | D) to mean that B is at least as likely as A when it is known
that D holds.
Assumption 2.1.16 (CP) For any events A, B, D,
(A | D) (B | D) iff A ∩ D B ∩ D.
Theorem 2.1.17 If a likelihood relation satisfies assumptions SP1–SP5 as well

as CP, then there exists a probability measure P such that: For any A, B, D such
that P(D) > 0,
(A | D) (B | D) iff P(A | D) ≤ P(B | D).
It turns out that there are very few ways that a conditional probability definition
can satisfy all of our assumptions. The usual definition is the following.
Definition 2.1.18 (Conditional probability)
P(A ∩ D)
P(A | D) .
P(D)
This definition effectively answers the question of how much evidence for A
we have, now that we have observed D. This is expressed as the ratio between the
combined event A ∩ D, also known as the joint probability of A and D, and the
marginal probability of D itself. The intuition behind the definition becomes clearer
once we rewrite it as P(A ∩ D) = P(A | D) P(D). Then conditional probability is
effectively used as a way to factorise joint probabilities.
2.1.5 Probability Elicitation
Probability elicitation is the problem of quantifying the subjective probabilities of

a particular individual. One of the simplest, and most direct, methods, is to simply
ask. However, because we cannot simply ask somebody to completely specify a
probability distribution, we can ask for this distribution iteratively as indicated by
the procedure outlined in the following example.
Example 2.3 (Temperature prediction) Let τ be the temperature tomorrow at noon
in Gothenburg. What are your estimates?
Eliciting the prior/forming the subjective probability measure P
• Select temperature x0 s.t. (τ ≤ x0 ) (τ > x0 ).

• Select temperature x1 s.t. (τ ≤ x1 | τ ≤ x0 ) (τ > x1 | τ ≤ x0 ).
By repeating this procedure recursively we are able to slowly build the com-
plete distribution, quantile by quantile.
Note that, necessarily, P(τ ≤ x0 ) = P(τ > x0 ) =: p0 . Since P(τ ≤ x0 ) +

P(τ > x0 ) = P(τ ≤ x0 ∪ τ > x0 ) = P(τ ∈ R) = 1, it follows that p0 = 21 . Sim-
ilarly, P(τ ≤ x1 | τ ≤ x0 ) = P(τ > x1 | τ ≤ x0 ) = 41 .
2.2 Updating Beliefs: Bayes’ Theorem 13
Exercise 2.1.19 Propose another way to arrive at a prior probability distribution.

For example, define a procedure for eliciting a single probability distribution from a
group of people without any interaction between the participants.
2.2 Updating Beliefs: Bayes’ Theorem
Although we always start with a particular belief, this belief must be adjusted when
we receive new evidence. In probabilistic inference, the updated beliefs are simply
the probability of future events conditioned on observed events. This idea is captured
neatly by Bayes’ theorem, which links the prior probability P(Ai ) of events to their
posterior probability P(Ai | B) given some event B and the probability P(B | Ai )
of observing the evidence B given events Ai .
Theorem 2.2.1 (Bayes’ theorem)
n Let A1 , A2 , . . . be a (possibly infinite) sequence
of disjoint events such that i=1 Ai = Ω and P(Ai ) > 0 for all i. Let B be another
event with P(B) > 0. Then
P(B | Ai ) P(Ai )
P(Ai | B) = n .
j=1 P(B | A j ) P(A j )
Proof By definition, P(Ai | B) = P(Ai ∩ B)/P(B) and P(Ai ∩ B) = P(B |

Ai ) P(Ai ), so
P(B | Ai ) P(Ai )
P(Ai | B) = . (2.1)
P(B)

As nj=1 A j = Ω, we have B = nj=1 (B ∩ A j ). Since the A j are disjoint, so are
the B ∩ A j . As P is a probability, the union property gives
⎛ ⎞

n
n
P(B) = P ⎝ (B ∩ A j )⎠ = P(B ∩ A j ) = P(B | A j ) P(A j ),
j=1 j=1 j=1
which plugged into (2.1) completes the proof.

A Simple Exercise in Updating Beliefs
Example 2.4 (The weather forecast) Form a subjective probability for the proba-
bility that it rains.
A1 : Rain.
A2 : No rain.
First, choose P(A1 ) and set P(A2 ) = 1 − P(A1 ). Now assume that there is a weather
forecasting station that predicts no rain for tomorrow. However, you know the fol-
lowing facts about the station: On the days when it rains, half of the time the station
had predicted it was not going to rain. On days when it doesn’t rain, the station had
said no rain 9 times out of 10.
(Solution) Let B denote the event that the station predicts no rain. According to our
information, the probability that there is rain when the prediction said no rain is
P(B | A1 ) = 21 . On the other hand, P(B | A2 ) = 0.9. Combining these with Bayes’
theorem, we obtain
P(B | A1 ) P(A1 )
P(A1 | B) =
P(B | A1 ) P(A1 ) + P(B | A2 ) [1 − P(A1 )]
1
P(A1 )
= 2
.
0.9 − 0.4 P(A1 )
2.3 Utility Theory
While probability can be used to describe how likely an event is, utility can be
used to describe how desirable it is. More concretely, our subjective probabilities
are numerical representations of our beliefs and the information available to us.
They can be taken to represent our “internal model” of the world. By analogy, our
utilities are numerical representations of our tastes and preferences. That is, even if
the consequences of our actions are not directly known to us, we assume that we act
to maximise our utility, in some sense.
2.3.1 Rewards and Preferences
Rewards
Consider that we have to choose a reward r from a set R of possible rewards. While
the elements of R may be arbitrary, we shall in general find that we prefer some
rewards to others. In fact, some elements of R may not even be desirable. As an
example, R might be a set of tickets to different musical events, or a set of financial
commodities.
Preferences
Example 2.5 (Musical event tickets) We have a set of tickets R, and we must choose
the ticket r ∈ R we prefer best. Here are two possible scenarios:
• R is a set of tickets to different music events at the same time, at equally good
halls with equally good seats and the same price. Here preferences simply coincide
with the preferences for a certain type of music or an artist.
• If R is a set of tickets to different events at different times, at different quality
halls with different quality seats and different prices, preferences may depend on
all the factors.
2.3 Utility Theory 15
Example 2.6 (Route selection) We have a set of alternative routes and must pick
one. If R contains two routes of the same quality, one short and one long, we will
probably prefer the shorter one. If the longer route is more scenic our preferences
may be different.
Preferences Among Rewards

We will treat preferences in a similar manner as we have treated likelihoods. That is,
we will define a linear ordering among possible rewards.
Let a, b ∈ R be two rewards. When we prefer a to b, we write a ∗ b. Conversely,
when we like a less than b we write a ≺∗ b. If we like a as much as b, we write
a ∗ b. We also use a ∗ b and a ∗ b when we like a at least as much as b and
we don’t like a any more than b, respectively.
We make the following assumptions about the preference relations.
Assumption 2.3.1 (i) For any a, b ∈ R, one of the following holds: a ∗ b, a ≺∗
b, or a ∗ b.
(ii) For any a, b, c ∈ R, if a ∗ b and b ∗ c, then a ∗ c.
The first assumption means that we must always be able to decide between any
two rewards. It may seem that it does not always hold in practice, since humans are
frequently indecisive. However, without the second assumption, it is still possible to
create preference relations that are cyclic.
Example 2.7 (Counter-example for transitive preferences) Consider vector rewards
in R = R2 , with ri = (ai , bi ), and fix some > 0. Our preference relation satisfies:
• ri ∗ r j if bi ≥ b j + .
• ri ∗ r j if ai > a j and |bi − b j | < .
This may correspond for example to an employer deciding to hire one out of several
candidates depending on their experience (ai ) and their school grades (bi ). Since
grades are not very reliable, if two people have similar grades, then we prefer the
one with more experience. However, that may lead to a cycle. Consider a sequence
of candidates i = 1, . . . , n, such that bi = bi+1 + δ, with δ < and ai > ai+1 . Then
clearly, we must always prefer ri to ri+1 . However, if δ(n − 1) > , we will prefer
rn to r1 .
2.3.2 Preferences Among Distributions
When We Cannot Select Rewards Directly

In most problems, we cannot choose the rewards directly. Rather, we must make
some decision, and then obtain a reward depending on this decision. Since we may be
uncertain about the outcome of a decision, we can specify our uncertainty regarding
the rewards obtained by a decision in terms of a probability distribution.
Example 2.8 (Route selection) Assume that you have to pick between two routes
P1 , P2 . Your preferences are such that shorter time routes are preferred over longer
ones. For simplicity, let R = {10, 15, 30, 35} be the possible times it might take to
reach your destination. Route P1 takes 10 min when the road is clear, but 30 min
when the traffic is heavy. The probability of heavy traffic on P1 is 0.5. On the other
hand, route P2 takes 15 min when the road is clear, but 35 min when the traffic is
heavy. The probability of heavy traffic on P2 is 0.2.
Preferences Among Probability Distributions

As seen in the previous example, we frequently have to define preferences between
probability distributions, rather than over rewards. To represent our preferences, we
can use the same notation as before. Let P1 , P2 be two distributions on (R, FR ). If
we prefer P1 to P2 , we write P1 ∗ P2 . If we like P1 less than P2 , write P1 ≺∗ P2 .
If we like P1 as much as P2 , we write P1 ∗ P2 . Finally, as before we also use ∗
and ∗ to denote negations of the strict preference relations ∗ and ≺∗ .
What would be a good principle for choosing between the two routes in
Example 2.8? Clearly route P1 gives both the lowest best-case time and the low-
est worst-case time. It thus appears as though both an extremely cautious person
(who assumes the worst-case) and an extreme optimist (who assumes the best case)
would say P1 ∗ P2 . However, the average time taken in P2 is only 19 min versus
20 min for P1 . Thus, somebody who only took the average time into account would
prefer P2 . In the following sections, we will develop one of the most fundamental
methodologies for choices under uncertainty, based on the idea of utilities.
2.3.3 Utility
The concept of utility allows us to create a unifying framework, such that given
a particular set of rewards and probability distributions on them, we can define
preferences among distributions via their expected utility. The first step is to define
utility as a way to define a preference relation among rewards.
Definition 2.3.2 (Utility) A utility function U : R → R is said to agree with the

preference relation ∗ , if for all rewards a, b ∈ R
a ∗ b iff U (a) ≥ U (b).
The above definition is very similar to how we defined relative likelihood in terms
of probability. For a given utility function, its expectation for a distribution over
rewards is defined as follows.
Table 2.1 A simple gambling problem

r U (r ) P Q
Did not enter 0 1 0
Paid 1 CU and lost −1 0 0.99
Paid 1 CU and won 10 9 0 0.01
Definition 2.3.3 (Expected utility) Given a utility function U , the expected utility
of a distribution P on R is

E P (U ) = U (r ) d P(r ).
R
We make the assumption that the utility function is such that the expected utility
remains consistent with the preference relations between all probability distributions
we are choosing between.
Assumption 2.3.4 (The expected utility hypothesis) Given a preference relation ∗

over R and a corresponding utility function U , the utility of any probability measure
P on R is equal to the expected utility of the reward under P. Consequently,
P ∗ Q iff E P (U ) ≥ E Q (U ). (2.2)
Example 2.9 Consider the following decision problem. You have the option of enter-
ing a lottery for 1 currency unit (CU). The prize is 10 CU and the probability of
winning is 0.01. This can be formalised by making it a choice between two proba-
bility distributions: P, where you do not enter the lottery, and Q, which represents
entering the lottery.
Calculating
the expected utility obviously gives 0 for not entering and E(U |
Q) = r U (r ) Q(r ) = −0.9 utility of entering the lottery, cf. Table 2.1.
Monetary Rewards
Frequently, rewards come in the form of money. In general, it is assumed that people
prefer to have more money than less money. However, while the utility of monetary
rewards is generally assumed to be increasing, it is not necessarily linear. For example,
1,000 Euros are probably worth more to somebody with only 100 Euros in the bank
than to somebody with 100,000 Euros. Hence, it seems reasonable to assume that
the utility of money is concave.
The following examples show the consequences of the expected utility hypothesis.
Example 2.10 Choose between the following two gambles:

1. The reward is 500,000 with certainty.
2. The reward is 2,500,000 with probability 0.10. It is 500,000 with probability 0.89,
and 0 with probability 0.01.
Example 2.11 Choose between the following two gambles:

1. The reward is 500,000 with probability 0.11, or 0 with probability 0.89.
2. The reward is 2,500,000 with probability 0.1, or 0 with probability 0.9.
Exercise 2.3.5 Show that under the expected utility hypothesis, if gamble 1 is pre-
ferred in the Example 2.10, gamble 1 must also be preferred in the Example 2.11 for
any utility function.
In practice, you may find that your preferences are not aligned with what this
exercise suggests. This implies that either your decisions do not conform to the
expected utility hypothesis, or that you are not internalising the given probabilities.
We will explore this further in following example.
The St. Petersburg Paradox
The following simple example illustrates the fact that, internally, most humans do
not behave in ways that are compatible with linear utility for money.
Example 2.12 (The St. Petersburg Paradox, Bernoulli 1713) A coin is tossed repeat-
edly until the coin comes up heads. The player obtained 2n currency units, where
n ∈ {1, 2, . . .} is the number of times the coin was thrown. The coin is assumed to
be fair, meaning that the probability of heads is always 1/2.
How many currency units k would you be willing to pay to play this game once?
As the probability to stop at round n is 2−n , the expected monetary gain of the
game is

∞
2n 2−n = ∞.
n=1
Were your utility function linear, you would be willing to pay any finite amount k
to play, as the expected utility for playing the game for any finite k is
∞

U (2n − k) 2−n = ∞.
n=1
It would be safe to assume that very few readers would be prepared to pay an
arbitrarily high amount to play this game. One way to explain this is that the utility
function is not linear. An alternative would be to assume a logarithmic utility function.
For example, if we also assume that the player has an initial capital C from which k
has to be paid, the utility function would satisfy
EU = ln(C + 2n − k) 2−n .
n=1
Then for C = 10, 000 the maximum bet would be 14. For C = 100 it would be 6,
while for C = 10 it is just 4.
There is another reason why one may not pay an arbitrary amount to play this
game. The player may not fully internalise the fact (or rather, the promise) that
the coin is unbiased. Another explanation would be that it is not really believed
that the bank can pay an unbounded amount of money. Indeed, if the bank can
only pay amounts up to N , in the linear expected utility scenario, for a coin with
probability p of coming heads, we have
N
1 − (2 p) N
2n p n−1 (1 − p) = 2(1 − p) .
n=1
1 − 2p
For large N and p = 0.45, it turns out that you should only expect a payoff of about 10
currency units. Similarly, for a fair coin and N = 1024 you should only pay around
10 units as well. These are possible subjective beliefs that an individual might have
that would influence its behaviour when dealing with a formally specified decision
problem.
2.3.4 Measuring Utility*
Since we cannot even rely on linear utility for money, we need to ask ourselves how
we can measure the utility of different rewards. There are a number of ways, including
trying to infer it from the actions of people. The simplest approach is to simply ask
them to make even money bets. No matter what approach we use, however, we need
to make some assumptions about the utility structure. This includes whether or not
we should accept that the expected utility hypothesis holds for the observed human
behaviour.
Experimental Measurement of Utility
Example 2.13 Let a, b denote a lottery ticket that yields a or b CU with equal
probability. Consider the following sequence:
1. Find x1 such that receiving x1 CU with certainty is equivalent to receiving a, b.
2. Find x2 such that receiving x2 CU with certainty is equivalent to receiving a, x1 .
3. Find x3 such that receiving x3 CU with certainty is equivalent to receiving x1 , b.
4. Find x4 such that receiving x4 CU with certainty is equivalent to receiving x2 , x3 .
The above example algorithm allows us to measure the utility of money under the
assumption that the expected utility hypothesis holds. Note that if x1 = x4 , then the
Fig. 2.1 Linear, convex and U (x)

1.6 x
concave functions
ex − 1
1.4 ln(x + 1)
1.2
0.8
0.6
0.4
0.2
x
0.2 0.4 0.6 0.8 1
expected utility hypothesis is violated, since it implies U (x1 ) = U (x4 ) = 21 (U (a) +

U (b)).
2.3.5 Convex and Concave Utility Functions
As previously mentioned, utility functions of monetary rewards seem to be concave.

In general, we would say that a concave utility function implies risk aversion and
a convex one risk taking. Intuitively, a risk averse person prefers a fixed amount of
money to a random amount of money with the same expected value. A risk taker
prefers to gamble. Let’s start with the definition of a convex function, where in the
following we assume that is a convex subset of Rn , that is, contains with any
two points x, y also the line segment between x and y.
Definition 2.3.6 A function g : → R is convex on A ⊂ if for any points x, y ∈
A and any α ∈ [0, 1]
αg(x) + (1 − α)g(y) ≥ g (αx + (1 − α)y) .
An important property of convex functions is that they are bounded from above
by linear segments connecting their points. This property is formally given below.
Theorem 2.3.7 (Jensen’s inequality) Let g be a convex function on Ω and P be a

measure with P() = 1. Then for any x ∈ Ω such that E(x) and E[g(x)] exist, it
holds that
E[g(x)] ≥ g[E(x)].
2.4 Exercises 21
Example 2.14 If the utility function is convex, then we would prefer obtaining a
random reward x rather than a fixed reward y = E(x). Thus, a convex utility function
implies risk-taking. This is illustrated by Fig. 2.1, which shows a linear function, x,
a convex function, e x − 1, and a concave function, ln(x + 1).
Definition 2.3.8 A function g is concave on A ⊂ if for any points x, y ∈ A and

any α ∈ [0, 1]
αg(x) + (1 − α)g(y) ≤ g(αx + (1 − α)y).
For concave functions, the inverse of Jensen’s inequality holds (i.e., with ≥
replaced with ≤). If the utility function is concave, then we choose a gamble giving
a fixed reward E[x] rather than one giving a random reward x. Consequently, a con-
cave utility function implies risk aversion. The act of buying insurance can be related
to the concavity of our utility function. Consider the following example, where we
assume individuals are risk-averse, but insurance companies are risk-neutral.
Example 2.15 (Insurance) Let d be the insurance cost, h our insurance cover, the
probability of needing the cover, and U an increasing utility function (for monetary
values). Then we are going to buy insurance if the utility of losing d with certainty
is greater than the utility of losing −h with probability .
U (−d) > U (−h) + (1 − ) U (0). (2.3)
The insurance company has a linear utility and fixes the premium d high enough
so that
d > h. (2.4)
Consequently, we see from (2.4) that U (−h) ≥ U (−d), as U is an increasing

function. From (2.3) we obtain U (−h) > U (−h) + (1 − ) U (0). Now the −h
term is the utility of our expected monetary loss, while the right hand side is our
expected utility. Consequently if the inequality holds, our utility function is (at least
locally) concave.
2.4 Exercises
Exercise 2.4.1 Preferences are transitive if they are induced by a utility function
U : R → R such that a ∗ b iff U (a) > U (b). Give an example of a utility function,
not necessarily mapping the rewards R to R, and a binary relation > such that
transitivity can be violated. Back your example with a thought experiment.
Exercise 2.4.2 Assuming that U is increasing and absolutely continuous, consider

the following experiment:
1. You specify an amount a, then observe random value Y .

2. If Y ≥ a, you receive Y currency units.
3. If Y < a, you receive a random amount X with known distribution (independent
of Y ).
Show that we should choose a such that U (a) = E[U (X )].
Exercise 2.4.3 (Usefulness of probability and utility)

1. Would it be useful to separate randomness from uncertainty? What would be
desirable properties of an alternative concept to probability?
2. Give an example of how the expected utility assumption might be violated.
Exercise 2.4.4 Consider two urns, each containing red and blue balls. The first urn
contains an equal number of red and blue balls. The second urn contains a randomly
chosen proportion X of red balls, i.e., the probability of drawing a red ball from that
urn is X .
1. Suppose that you were to select an urn, and then choose a random ball from
that urn. If the ball is red, you win 1 CU, otherwise nothing. Show that if your
utility function is increasing with monetary gain, you should prefer the first urn
iff E(X ) < 21 .
2. Suppose that you were to select an urn, and then choose n random balls from
that urn and that urn only. Each time you draw a red ball, you gain 1 CU. After
you draw a ball, you put it back in the urn. Assume that the utility U is strictly
concave and suppose that E(X ) = 21 . Show that you should always select balls
from the first urn.
Hint: Show that for the second urn, E(U | x) is concave for 0 ≤ x ≤ 1 (this can be
2
done by showing ddx 2 E(U | x) < 0). In fact,
n−2
d2 n−2 k
E(U | x) = n(n − 1) [U (k) − 2U (k + 1) + U (k + 2)] x (1 − x)n−2−k .
dx2 k
k=0
Then apply Jensen’s inequality.
Exercise 2.4.5 (Defining likelihood relations via Probability measures) Show that
a probability measure P on (, F) satisfies the following:
1. For any events A, B ∈ F either P(A) > P(B), P(B) > P(A) or P(A) = P(B).
2. If Ai , Bi are partitions of A, B such that for all P(Ai ) ≤ P(Bi ) for all i, then
P(A) ≤ P(B).
3. For any event A, P(∅) ≤ P(A) and P(∅) < P().
Exercise 2.4.6 (Definition of conditional probability) Recall that P(A | B)

P(A∩B)
P(B)
is only a definition. Think of a a plausible alternative for a definition of
conditional probability. Note that the conditional probability P(A | B) is just a new
probability measure that shall satisfy the following basic properties of a probability
2.4 Exercises 23
measure: (a) null probability: P(∅ | B) = 0, (b) total probability: P( | B) = 1,

(c) union of disjoint subsets: P(A1 ∪ A2 | B) = P(A1 | B) + P(A2 | B), (d) con-
ditional probability: P(A | D) ≤ P(B | D) if and only if P(A ∩ D) ≤ P(B ∩ D).
Exercise 2.4.7 (Alternatives to the expected utility hypothesis) The expected utility
hypothesis states that we prefer decision P over Q if and only if our expected utility
under the distribution P is larger than that under Q, i.e., E P (U ) ≥ E Q (U ). Under
what conditions do you think this is a reasonable hypothesis? Can you come up with
a different rule for making decisions under uncertainty? Would it still satisfy the total
order and transitivity properties of preference relations? In other words, could you
still unambiguously say for any P, Q whether you prefer P to Q? If you had three
choices, P, Q, W , and you preferred P to Q and Q to W, would you always prefer P
to W ?
Exercise 2.4.8 (Rational Arthur-Merlin games) You are Arthur, and you wish to pay
Merlin to do a very difficult computation for you. More specifically, you perform a
query q ∈ Q and obtain an answer r ∈ R from Merlin. After he gives you the answer,
you give Merlin a random amount of money m, depending on r, q. In particular, it
∗
assumed that there exists a unique∗ correct answer r =
is
∗
f (q) and E(m | r, q) =
m m P(m | r, q) is maximized by r , i.e., for any r = r
∗
E(m | r , q) > E(m | r, q).
Assume that Merlin knows P and the function f . Is this sufficient to incentivize Mer-
lin to respond with the correct answer? If not, what other assumptions or knowledge
are required?
Exercise 2.4.9 Assume that you need to travel over the weekend. You wish to decide
whether to take the train or the car. Assume that the train and the car trip cost exactly
the same amount of money. The train trip takes 2 h. If it does not rain, then the car
trip takes 1.5 h. However, if it rains the road becomes both more slippery and more
crowded and so the average trip time is 2.5 h. Assume that your utility function is
equal to the negative amount of time spent travelling: U (t) = −t.
1. Let it be Friday. What is the expected utility of taking the car on Sunday? What
is the expected utility of taking the train on Sunday? What is the Bayes-optimal
decision, assuming you will travel on Sunday?
2. Consider two stations H1 and H2 that predict rain on Saturday and Sunday with
probabilities 0.4, 0.6 (H1 ) and 0.9, 0.1 (H2 ), respectively. Let it be a rainy Sat-
urday, which we denote by event A. What is your posterior probability over the
two weather stations, given that it has rained, i.e., P(Hi | A)? What is the new
marginal probability of rain on Sunday, i.e., P(B | A)? What is now the expected
utility of taking the car versus taking the train on Sunday? What is the Bayes-
optimal decision?
Exercise 2.4.10 Consider the previous example with a nonlinear utility function.
1. One example is U (t) = 1/t, which is a convex utility function. How would you
interpret the utility in that case? Without performing the calculations, can you
tell in advance whether your optimal decision can change? Verify your answer
by calculating the expected utility of the two possible choices.
2. How would you model a problem where the objective involves arriving in time
for a particular appointment?
Reference
1. DeGroot, M.H.: Optimal Statistical Decisions. Wiley (1970)

Chapter 3
Decision Problems
3.1 Introduction
In this chapter we describe how to formalize statistical decision problems. These

involve making decisions whose utility depends on an unknown state of the world.
In this setting, it is common to assume that the state of the world is a fundamental
property that is not influenced by our decisions. However, we can calculate a proba-
bility distribution for the state of the world, using a prior belief and some data, where
the data we obtain may depend on our decisions.
A classical application of this framework is parameter estimation. Therein, we
stipulate the existence of a parametrized law of nature, and we wish to choose
a best-guess set of parameters for the law through measurements and some prior
information. An example would be determining the gravitational attraction constant
from observations of planetary movements. These measurements are always obtained
through experiments, the automatic design of which will be covered in later chapters.
The decisions we make will necessarily depend on both our prior belief and the
data we obtain. In the last section of this chapter will examine how sensitive our
decisions are to the prior, and how we can choose it so that our decisions are robust.
3.2 Rewards that Depend on the Outcome

of an Experiment
Consider the problem of choosing one of two different types of tickets in a raffle.
Each type of ticket gives you the chance to win a different prize. The first is a bicycle
and the second is a tea set. The winner ticket for each prize is drawn uniformly from
the respective sold tickets. Thus, the raffle guarantees that somebody will win either
price. If most people opt for the bicycle, your chance of actually winning it by buying
a single ticket is much smaller. However, if you prefer winning a bicycle to winning

26 3 Decision Problems
the tea set, it is not clear what choice you should make in the raffle. The above is the
quintessential example for problems where the reward that we obtain depends not
only on our decisions, but also on the outcome of an experiment.
This problem can be viewed more generally for scenarios where the reward you
receive depends not only on your own choice, but also on some other, unknown fact
in the world. This may be something completely uncontrollable, and hence you only
can make an informed guess.
More formally, given a set of possible actions A, we must make a decision a ∈ A
before knowing the outcome ω of an experiment with outcomes in Ω. After the
experiment is performed, we obtain a reward r ∈ R which depends on both the
outcome ω of the experiment and our decision. As discussed in the previous chapter,
our preferences for some rewards over others are determined by a utility function
U : R → R, such that we prefer r to r if and only if U (r ) ≥ U (r ). Now, however,
we cannot choose rewards directly. Another example, which will be used throughout
this section, is the following.
Example 3.1 (Taking the umbrella) We must decide whether to take an umbrella to
work. Our reward depends on whether we get wet and the amount of objects that we
carry. We would rather not get wet and not carry too many things, which can be made
more precise by choosing an appropriate utility function. For example, we might put
a value of −1 for carrying the umbrella and a value of −10 for getting wet. In this
example, the only events of interest are whether it rains or not.
3.2.1 Formalisation of the Problem Setting
The elements we need to formulate the problem setting are a random variable, a
decision variable, a reward function mapping the random and the decision variable
to a reward, and a utility function that says how much we prefer each reward.
Assumption 3.2.1 (Outcomes) There exists a probability measure P on (Ω, FΩ )

such that the probability of the random outcome ω being in A ∈ FΩ is
P(ω ∈ A) = P(A).
The probability measure P is completely independent of any decision that we

make.
Assumption 3.2.2 (Utilities) Given a set of rewards R, our preferences satisfy
Assumptions 2.1.1, 2.1.2, and 2.1.3, i.e., preferences are transitive, all rewards are
comparable, and there exists a utility function U , measurable with respect to FR
such that U (r ) ≥ U (r ) iff r ∗ r .
Since the random outcome ω does not depend on our decision a, we must find
a way to connect the two. This can be formalized via a reward function, so that the
3.2 Rewards that Depend on the Outcome of an Experiment 27
reward that we obtain (whether we get wet or not) depends on both our decision (to
take the umbrella) and the random outcome (whether it rains).
Definition 3.2.1 (Reward function) A reward function ρ : Ω × A → R defines the
reward we obtain if we select a ∈ A and the experimental outcome is ω ∈ Ω:
r = ρ(ω, a)
When we discussed the problem of choosing between distributions in Sect. 2.3.2,
we had directly defined probability distributions on the set of rewards. We can
now formulate our problem in that setting. First, we define a set of distributions
{Pa | a∈ A} on the reward space (R, FR ), such that the decision a amounts to choos-
ing a particular distribution Pa on the rewards.
Example 3.2 (Rock paper scissors) Consider the simple game of rock paper scissors,
where your opponent plays a move at the same time as you, so that you cannot
influence his move. The opponent’s moves are Ω = {ωR , ωP , ωS } for rock, paper,
scissors, respectively, which also corresponds to your decision set A = {aR , aP , aS }.
The reward set is R = {Win, Draw, Lose}.
You have studied your opponent for some time and you believe that he is most
likely to play rock P(ωR ) = 3/6, somewhat likely to play paperP(ωP ) = 2/6, and less
likely to play scissors:P(ωS ) = 1/6.
What is the probability of each reward, for each decision you make? Taking
the example of aR , we see that you win if the opponent plays scissors with
probability 1/6, you lose if the opponent plays paper (2/6), and you draw if he plays rock
(3/6). Consequently, we can convert the outcome probabilities to reward probabilities
for every decision:
PaR (Win) = 1/6, PaR (Draw) = 3/6, PaR (Lose) = 2/6,
PaP (Win) = 3/6, PaP (Draw) = 2/6, PaP (Lose) = 1/6,
PaS (Win) = 2/6, PaS (Draw) = 1/6, PaS (Lose) = 3/6.
Of course, what you play depends on our own utility function. If you prefer winning
utility function U (Win) = 1,
over drawing or losing, you could for example have the
U (Draw) = 0, U (Lose) = −1. Then, since Ea U = ω∈Ω U (ω, a)Pa (ω), we have
E aR U = −1/6,
E aP U = 2/6,
E aS U = −1/6,
so that based on your belief, choosing paper is best.

The above example illustrates that every decision that we make creates a corre-
sponding probability distribution on the rewards. While the outcome of the experi-
ment is independent of the decision, the distribution of rewards is effectively chosen
by our decision.
ω a
a r U U
(a) The combined decision (b) The separated decision

problem problem
Fig. 3.1 Decision diagrams for the combined and separated formulation of the decision problem.
Squares denote decision variables, diamonds denote utilities. All other variables are denoted by
circles. Arrows denote the flow of dependency
Expected utility
The expected utility of any decision a ∈ A under P is:

E Pa (U ) = U (r ) d Pa (r ) = U [ρ(ω, a)] d P(ω)
R Ω
From now on, we shall use the simple notation
U (P, a) E Pa U
to denote the expected utility of a under distribution P.
Instead of viewing the decision as effectively choosing a distribution over rewards

(Fig. 3.1a) we can separate the random part of the process from the deterministic
part (Fig. 3.1b) by considering a measure P on the space of outcomes Ω, such that
the reward depends on both a and the outcome ω ∈ Ω through the reward function
ρ(ω, a). The optimal decision is of course always the a ∈ A maximizing E(U | Pa ).
However, this structure allows us to clearly distinguish the controllable from the
random part of the rewards.
The probability measure induced by decisions

For every a ∈ A, the function ρ : Ω × A → R induces a probability distribu-
tion Pa on R. In fact, for any B ∈ FR :
Pa (B) P(ρ(ω, a) ∈ B) = P(ω |ρ(ω, a) ∈ B)

3.2 Rewards that Depend on the Outcome of an Experiment 29
Table 3.1 Rewards, utilities, expected utility for 20% probability of rain
ρ(ω, a) a1 a2
ω1 Dry, carrying umbrella Wet
ω2 Dry, carrying umbrella Dry
U [ρ(ω, a)] a1 a2
ω1 –1 –10
ω2 –1 0
E P (U | a) –1 –2
The above equation requires that the following technical assumption is satisfied.
As usual, we employ the expected utility hypothesis (Assumption 2.3.4). Thus, we
should choose the decision that results in the highest expected utility.
Assumption 3.2.3 The sets {ω | ρ(ω, a) ∈ B} belong to FΩ . That is, ρ is FΩ -
measurable for any a.
The dependency structure of this problem in either formulation can be visualized
in the decision diagram shown in Fig. 3.1.
Example 3.3 (Continuation of Example 3.1) You are going to work, and it might
rain. The forecast said that the probability of rain (ω1 ) was 20%. What do you do?
• a1 : Take the umbrella.
• a2 : Risk it!
The reward of a given outcome and decision combination, as well as the respective
utility is given in Table 3.1.
3.2.2 Decision Diagrams
Decision diagrams, also known as decision networks or influence diagrams, are used
to show dependencies between different variables. As illustrated in the examples
shown in Fig. 3.1, if an arrow points from a variable x to a variable y it means that
y depends on x. In other words, y has x as an input. In general, decision diagrams
include the following types of nodes:
• Choice nodes (denoted by squares) are nodes whose values can be directly chosen
by the decision maker. In general there may be more than one decision maker
involved.
• Value nodes (denoted by diamonds) are the nodes that the decision maker is inter-
ested in influencing. That is, the utility of the decision maker is always a function
of the value nodes.
• Circle nodes are used to denote all other types of variables. These include deter-
ministic or stochastic variables.
• Line style and colour represent quantities that part of the problem, but are not
observed by one or more of the decision makers. These are usually called latent
variables.
Let us take a look at the example of Fig. 3.1b, the reward is a function of both
ω and a, i.e., r = ρ(ω, a), while ω depends only on the probability distribution P.
Typically, there must be a path from a choice node to a value node, otherwise nothing
the decision maker can do will influence the utility. Nodes belonging to or observed
by different players will usually be denoted by different lines or colors. In Fig. 3.1b,
ω, which is not observed, is shown with a dashed line.
3.2.3 Statistical Estimation*
Statistical decision problems arise particularly often in parameter estimation, such

as estimating the covariance matrix of a Gaussian random variable. In this setting,
the unknown outcome of the experiment ω is called a parameter, while the set of
outcomes Ω is called the parameter space. Classical statistical estimation involves
selecting a single parameter value on the basis of observations. This requires us to
specify a preference for different types of estimation errors, and is distinct from the
standard Bayesian approach to estimation, which calculates a full distribution over
all possible parameters.
A simple example is estimating the distribution of votes in an election from a
small sample. Depending on whether we are interested in predicting the vote share of
individual parties or the most likely winner of the election, we can use a distribution
over vote shares (possibly estimated through standard Bayesian methodology) to
decide on a share or the winner.
Example 3.4 (Voting) Assume you wish to estimate the number of votes for different
candidates in an election. The unknown parameters of the problem mainly include:
the percentage of likely voters in the population, the probability that a likely voter is
going to vote for each candidate. One simple way to estimate this is by polling.
Consider a nation with k political parties. Let ω = (ω1 , . . . , ωk ) ∈ [0, 1]k be the
voting proportions for each party. We wish to make a guess a ∈ [0, 1]k . How should
we guess, given a distribution P(ω)? How should we select U and ρ? This depends
on what our goal is, when we make the guess.
If we wish to give a reasonable estimate about the votes of all the k parties, we can
use the squared error: First, set the error vectorr = (ω1 − a1 , . . . , ωk − ak ) ∈ [0, 1]k .
Then we set U (r ) −r 2 , where r 2 = i |ωi − ai |2 .
If on the other hand, we just want to predict the winner of the election, then the
actual percentages of all individual parties are not important. In that case, we can set
r = 1 if arg maxi ωi = arg maxi ai and 0 otherwise, and U (r ) = r .
3.3 Bayes Decisions 31
Losses and risks

In such problems, it is common to specify a loss instead of a utility. This is
usually the negative utility:
Definition 3.2.2 (Loss)
(ω, a) = −U [ρ(ω, a)]
Given the above, instead of the expected utility, we consider the expected
loss, or risk.
Definition 3.2.3 (Risk)

κ(P, a) = (ω, a) d P(ω)
Ω
Of course, the optimal decision is a minimizing κ.
3.3 Bayes Decisions
The decision which maximizes the expected utility under a particular distribution P,
is called the Bayes-optimal decision, or simply the Bayes decision. The probability
distribution P is supposed to reflect all our uncertainty about the problem. Note that
in the following we usually drop the reward function ρ from the decision problem
and consider utility functions U that map directly from Ω × A to R.
Definition 3.3.1 (Bayes-optimal utility) Consider an outcome (or parameter) space
Ω, a decision space A, and a utility function U : Ω × A → R. For any probability
distribution P on Ω, the Bayes-optimal utility U ∗ (P) is defined as the smallest upper
bound on U (P, a) over all decisions a ∈ A. That is,
U ∗ (P) = sup U (P, a). (3.3.1)

a∈A
The maximization over decisions is usually not easy. However, there exist a few
cases where it is relatively simple. The first of those is when the utility function is
the negative squared error.
Example 3.5 (Quadratic loss) Consider Ω = Rk and A = Rk . The utility function
that, for any point ω ∈ R, is defined as
U (ω, a) = −ω − a2
is called quadratic loss.

Quadratic loss is a very important special case of utility functions, as it is easy to

calculate the optimal solution. This is illustrated by the following theorem.
Theorem 3.3.1 Let P be a measure on Ω and U : Ω × A → R be the quadratic

loss defined in Example 3.5. Then the decision
a = E P (ω)
maximizes the expected utility U (P, a), under the technical assumption that
∂/∂a|ω − a|2 is measurable with respect to FR .
Proof The expected utility of decision a is given by

U (P, a) = − ω − a2 d P(ω).
Ω
Taking derivatives, due to the measurability assumption, we can swap the order of
differentiation and integration and obtain

∂ ∂
ω − a2 d P(ω) = ω − a2 d P(ω)
∂a Ω Ω ∂a

=2 (a − ω) d P(ω)
Ω
=2 a d P(ω) − 2 ω d P(ω)
Ω Ω
= 2a − 2E(ω).
Setting the derivative equal to 0 and noting that the utility is concave, we see that the
expected utility is maximized for a = E P (ω).
Another simple example is the absolute error, where U (ω, a) = |ω − a|. The
solution in this case differs significantly from the squared error. As can be seen from
Fig. 3.2a, for absolute loss, the optimal decision is to choose the a that is closest to
the most likely ω. Figure 3.2b illustrates the finding of Theorem 3.3.1.
3.3.1 Convexity of the Bayes-Optimal Utility*
Although finding the optimal decision for an arbitrary utility U and distribution P
may be difficult, fortunately the Bayes-optimal utility has some nice properties which
enable it to be approximated rather well. In particular, for any decision, the expected
utility is linear with respect to our belief P. Consequently, the Bayes-optimal utility
is convex with respect to P. This firstly implies that there is a unique “worst” dis-
tribution P, against which we cannot do very well. Secondly, we can approximate
3.3 Bayes Decisions 33
Fig. 3.2 Expected utility 0

curves for different values of
P(ω = 0) in -0.2
{0.1, 0.25, 0.5, 0.75}, as the
decision a varies in [0, 1] -0.4
U
-0.6
0.1
-0.8 0.25
0.5
0.75
-1
0 0.2 0.4 0.6 0.8 1
a
(a) Absolute error
-0.2
-0.4
U
-0.6
0.1
-0.8 0.25
0.5
0.75
-1
0 0.2 0.4 0.6 0.8 1
a
(b) Quadratic error
the Bayes-utility very well for all possible distributions by generalizing from a small
number of distributions. In order to define linearity and convexity, we first introduce
the concept of a mixture of distributions.
Consider two probability measures P, Q on (Ω, FΩ ). These define two alter-
native distributions for ω. For any P, Q and α ∈ [0, 1], we define the mixture of
distributions mixture
of distri-
Z α αP + (1 − α)Q (3.3.2) butions
to mean the probability measure such that Z α (A) = αP(A) + (1 − α)Q(A) for any
A ∈ FΩ . For any fixed choice a, the expected utility varies linearly with α:
Remark 3.3.1 (Linearity of the expected utility) If Z α is as defined in (3.3.2), then,

for any a ∈ A,
U (Z α , a) = α U (P, a) + (1 − α) U (Q, a).
Proof

U (Z α , a) = U (ω, a) dZ α (ω)
Ω

=α U (ω, a) d P(ω) + (1 − α) U (ω, a) dQ(ω)
Ω Ω
= α U (P, a) + (1 − α) U (Q, a).
However, if we consider Bayes-optimal decisions, this is no longer true, because

the optimal decision depends on the distribution. In fact, the utility of Bayes-optimal
decisions is convex, as the following theorem shows.
Theorem 3.3.2 For probability measures P, Q on Ω and any α ∈ [0, 1],
U ∗ [Z α ] ≤ α U ∗ (P) + (1 − α) U ∗ (Q).
Proof From the definition of the expected utility (3.3.1), for any decision a ∈ A,
U (Z α , a) = α U (P, a) + (1 − α) U (Q, a).
Hence, by definition (3.3.1) of the Bayes-utility, we have
U ∗ (Z α ) = sup U (Z α , a)
a∈A
= sup [α U (P, a) + (1 − α) U (Q, a)].
a∈A
As supx [ f (x) + g(x)] ≤ supx f (x) + supx g(x), we obtain
U ∗ [Z α ] ≤ α sup U (P, a) + (1 − α) sup U (Q, a)

a∈A a∈A
= α U ∗ (P) + (1 − α) U ∗ (Q).
As we have proven, the expected utility is linear with respect to Z α . Thus, for any
fixed action a we obtain a line as those shown in Fig. 3.3. By Theorem 3.3.2, the
Bayes-optimal utility is convex. Furthermore, the minimizing decision for any Z α is
tangent to the Bayes-optimal utility at the point (Z α , U ∗ (Z α )). If we take a decision
that is optimal with respect to some Z , but the distribution is in fact Q
= Z , then we
are not far from the optimal, if Q, Z are close and U ∗ is smooth. Consequently, we
can trivially lower bound the Bayes utility by examining any arbitrary finite set of
decisions Â ⊆ A. That is,
U ∗ (P) ≥ max U (P, a)
a∈Â
for any probability distribution P on Ω. In addition, we can upper-bound the Bayes

utility as follows. Take any two distributions P1 , P2 over Ω. Then, the upper bound
3.4 Statistical and Strategic Decision Making 35
Fig. 3.3 A strictly convex Bayes utility
U ∗ (αP1 + (1 − α)P2 ) ≤ α U ∗ (P1 ) + (1 − α) U ∗ (P2 )
holds due to convexity. The two bounds suggest an algorithm for successive approxi-
mation of the Bayes-optimal utility, by looking for the largest gap between the lower
and the upper bounds.
3.4 Statistical and Strategic Decision Making
In this section we consider more general strategies that do not simply pick a single
fixed decision. Further, we will consider other criteria than maximizing expected
utility and minimizing risk, respectively.
Strategies Instead of choosing a specific decision, we could instead choose to ran-
domize our decision somehow. In other words, instead of our choices being specific
decisions, we can choose among distributions over decisions. For example, instead
of choosing to eat lasanga or beef, we choose to throw a coin and eat lasagna if the
coin comes heads and beef otherwise. Accordingly, in the following we will con-
sider strategies that are probability measures on A, the set of which we will denote
by ´(A).
Definition 3.4.1 (Strategy) A strategy σ ∈ ´(A) is a probability distribution

over A and determines for each a in A the probability with which a is chosen.
Interestingly, for the type of problems that we have considered so far, even if we
expand our choices to the set of all possible probability measures on A, there always
is one decision (rather than a strategy) which is optimal.
Theorem 3.4.1 Consider any statistical decision problem with probability measure
P on outcomes Ω and with utility function U : Ω × A → R. Further let a ∗ ∈ A
such that U (P, a ∗ ) ≥ U (P, a) for all a ∈ A. Then for any probability measure
σ on A,
U (P, a ∗ ) ≥ U (P, σ).
Proof

U (P, σ) = U (P, a) dσ(a)
A

≤ U (P, a ∗ ) dσ(a)
A

∗
= U (P, a ) dσ(a)
A
= U (P, a ∗ )
This theorem should be not be applied naively. It only states that if we know P
then the expected utility of the best fixed/deterministic decision a ∗ ∈ A cannot be
increased by randomizing between decisions. For example, it does not make sense to
apply this theorem to cases where P itself is completely or partially unknown (e.g.,
when P is chosen by somebody else and its value remains hidden to us).
3.4.1 Alternative Notions of Optimality
There are some situations where maximizing expected utility with respect to the
distribution on outcomes is unnatural. Two simple examples are the following.
Maximin and minimax policies If there is no information about ω available, it may
be reasonable to take a pessimistic approach and select a ∗ that maximizes the utility
maximin in the worst-case ω. The respective maximin value
U∗ = max min U (ω, a) = min U (ω, a ∗ )

a ω ω
can essentially be seen as how much utility we would be able to obtain, if we were
to make a decision a first, and nature were to select an adversarial decision ω later.
On the other hand, the minimax value is minimax
U ∗ = min max U (ω, a) = max U (ω ∗ , a),

ω a a
where ω ∗ arg minω maxa U (ω, a) is the worst-case choice nature could make, if
we were to select our own decision a after its own choice was revealed to us.
To illustrate this, let us consider the following example.
Example 3.6 You consider attending an open air concert. The weather forecast
reports 50% probability of rain. Going to the concert (a1 ) will give you a lot of
pleasure if it doesn’t rain (ω2 ), but in case of rain (ω1 ) you actually would have
preferred to stay at home (a2 ). Since in general you prefer nice weather to rain also
in case you decided not to go you prefer ω2 to ω1 . The reward of a given outcome-
decision combination, as well as the respective utility is given in Table 3.2. We see
that a1 maximizes expected utility. However, under a worst-case assumption this is
not the case, i.e., the maximin solution is a2 .
Note that by definition
U ∗ ≥ U (ω ∗ , a ∗ ) ≥ U∗ . (3.4.1)
Maximin/minimax problems are a special case of problems in game theory, in partic-

ular two-player zero-sum games. The minimax problem can be seen as a game where
the maximizing player plays first, and the minimizing player second. If U ∗ = U∗ ,
then the game is said to have a value, which implies that if both players are playing
optimal, then it doesn’t matter which player moves first. More details about these
types of problems will be given in Sect. 3.4.2.
Regret Instead of calculating the expected utility for each possible decision, we
could instead calculate how much utility we would have obtained if we had made
the best decision in hindsight. Consider, for example the problem in Table 3.2.
There the optimal action is either a1 or a2 , depending on whether we accept the
probability P over Ω, or adopt a worst-case approach. However, after we make a
specific decision, we can always look at the best decision we could have made given
the actual outcome ω.
Definition 3.4.2 (Regret) The regret of σ is how much we lose compared to the best
decision in hindsight, that is,
L(ω, a) max

U (ω, a ) − U (ω, a).
a
As an example let us revisit Example 3.6. Given the regret of each decision-outcome
pair, we can determine the decision minimizing expected regret E(L | P, a) and min-
imizing maximum regret maxω L(ω, a), analogously to expected utility and minimax
utility. Table 3.3 shows that the choice minimizing regret either in expectation or in
the minimax sense is a1 (going to the concert). Note that this is a different outcome
than before when we considered utility, which shows that the concept of regret may
result in different decisions.
Table 3.2 Utility function, expected utility and maximin utility of Example 3.6
U (ω, a) a1 a2
ω1 –1 0
ω2 10 1
E(U | P, a) 4.5 –0.5
minω U (ω, a) –1 0
Table 3.3 Regret, in expectation and minimax for Example 3.6

L(ω, a) a1 a2
ω1 1 0
ω2 0 9
E(L | P, a) 0.5 4.5
maxω L(ω, a) 1 9
3.4.2 Solving Minimax Problems*
We now view minimax problems as two player games, where one player chooses a
and the other player chooses ω. The decision diagram for this problem is given in
Fig. 3.4, where the dashed line indicates that, from the point of view of the decision
maker, nature’s choice is unobserved before she makes her own decision. A simul-
taneous two-player game is a game where both players act without knowing each
other’s decision, see Fig. 3.5. From the point of view of the player that chooses a,
this is equivalent to assuming that ω is hidden, as shown in Fig. 3.4. There are other
variations of such games, however. For example, their moves may be revealed after
they have played. This is important in the case where the game is played repeatedly.
However, what is usually revealed is not the belief ξ, which is something assumed
to be internal to player one, but ω, the actual decision made by the first player. In
other cases, we might have that U itself is not known, and we only observe U (ω, a)
for the choices made.
Minimax utility, regret and loss In the following we again consider strategies
as defined in Definition 3.4.1. If the decision maker knows the outcome, then the
additional flexibility by randomizing over the actions does not help. As we showed
Fig. 3.4 Simultaneous ω

two-player stochastic game.
The first player (nature)
chooses ω, and the second U
player (the decision maker)
chooses a. Then the second a
player obtains utility U (ω, a)
Fig. 3.5 Simultaneous ξ ω

two-player stochastic game.
The first player (nature)
chooses ξ, and the second U
player (the decision maker)
chooses σ. Then ω ∼ ξ and σ a
a ∼ σ and the second player
obtains utility U (ω, a)
for the general case of a distribution over Ω, a simple decision is as good as any
randomized strategy:
Remark 3.4.1 For each ω, there is some a such that
U (ω, a) ∈ max U (ω, σ). (3.4.2)

σ∈Á
What follows are some rather trivial remarks connecting regret with utility in
various cases.
Remark 3.4.2
L(ω, σ) = σ(a)L(w, a) ≥ 0,
a
with equality iff σ is ω-optimal.
Proof

L(ω, σ) = max

U (ω, σ ) − U (ω, σ) = max

U (ω, σ ) − σ(a) U (ω, a)
σ σ
a

= σ(a) max

U (ω, σ ) − U (ω, a) ≥ 0
σ
a
The equality in case of optimality is obvious.
Remark 3.4.3
L(ω, σ) = max U (ω, a) − U (ω, σ)
a
Proof As (3.4.2) shows, for any fixed ω, the best decision is always deterministic,
so that

σ(a )L(ω, a ) = σ(a )[max U (ω, a) − U (ω, a )]
a∈A
a a

= max U (ω, a) − σ(a ) U (ω, a ).
a∈A
a
Table 3.4 Even-bet utility

U ω1 ω2
a1 1 −1
a2 0 0
Table 3.5 Even-bet regret

L ω1 ω2
a1 0 1
a2 1 0
Remark 3.4.4 L(ω, σ) = −U (ω, σ) iff maxa U (ω, a) = 0.

Proof If
max

U (ω, σ ) − U (ω, σ) = −U (ω, σ)
σ
then
max

U (ω, σ ) = max U (ω, a) = 0.
σ a
The converse holds as well, which finishes the proof.

The following example demonstrates that in general the minimax regret will be
achieved by a randomized strategy.
Example 3.7 (An even-money bet) Consider the decision problem described
in Table 3.4. The respective loss is given in Table 3.5. The maximum regret of a
strategy σ can be written as

max L(ω, σ) = max σ(a)L(ω, a)
ω ω
a
= max σ(a) I {a = ai ∧ ω
= ωi } · 1,
ω
since L(a, ω) = 0 when a = ai and ω

= ωi . Note that maxω L(ω, σ) ≥ 1
2
and that
equality is obtained iff σ(a) = 21 , giving minimax regret L ∗ = 21 .
3.4.3 Two-Player Games
In this section we give a few more details about the connections between minimax
theory and the theory of two-player games. In particular, we extend the actions of
nature to ´(Ω), the probability distributions over Ω and as before consider strategies
in ´(A).
For two distributions σ, ξ on A and Ω, we define our expected utility as

U (ξ, σ) U (ω, a)ξ(ω)σ(a).
ω∈Ω a∈A
Then we define the maximin policy σ ∗ to satisfy
min U (ξ, σ ∗ ) = U∗ max min U (ξ, σ).

ξ σ ξ
The minimax prior ξ ∗ satisfies
max U (ξ ∗ , σ) = U ∗ min max U (ξ, σ),

σ ξ σ
where the solution exists as long as A and Ω are finite, which we will assume in the
following.
Expected regret
We can now define the expected regret for a given pair of distributions ξ, σ as

L(ξ, σ) = max

ξ(ω) U (ω, σ ) − U (ω, σ)
σ
ω
= max

U (ξ, σ ) − U (ξ, σ).
σ
Not all minimax and maximin policies result in the same value. The following
theorem gives a condition under which the game does have a value.
Theorem 3.4.2 If there exist distributions1 ξ ∗ , σ ∗ and C ∈ R such that
U (ξ ∗ , σ) ≤ C ≤ U (ξ, σ ∗ ) ∀ξ, σ
then
U ∗ = U∗ = U (ξ ∗ , σ ∗ ) = C.
Proof Since C ≤ U (ξ, σ ∗ ) for all ξ we have
C ≤ min U (ξ, σ ∗ ) ≤ max min U (ξ, σ) = U∗ .

ξ σ ξ
Similarly,
C ≥ max U (ξ ∗ , σ) ≥ min max U (ξ, σ) = U ∗ .
σ ξ σ
1 These distributions may be singular, that is, they may be concentrated in one point. For example,
σ ∗ is singular, if σ ∗ (a) = 1 for some a and σ ∗ (a ) = 0 for all a
= a.
By (3.4.1) it follows that

C ≥ U ∗ ≥ U∗ ≥ C.
Theorem 3.4.2 gives a sufficient condition for a game having a value. In fact, the
type of games we have been looking at so far are called bilinear games. For these, a
solution always exists and there are efficient methods for finding it.
Definition 3.4.3 A bilinear game is a tuple (U, , Σ, Ω, A) with U : × Σ → R
such that all ξ ∈ are arbitrary distributions on Ω and all σ ∈ Σ are arbitrary
distributions on A with

U (ξ, σ) E(U | ξ, σ) = U (ω, a) σ(a) ξ(ω).
ω,a
Theorem 3.4.3 For a bilinear game, U ∗ = U∗ . In addition, the following three con-
ditions are equivalent:
1. σ ∗ is maximin, ξ ∗ is minimax, and U ∗ = C.
2. U (ξ, σ ∗ ) ≥ C ≥ U (ξ ∗ , σ) for all ξ, σ.
3. U (ω, σ ∗ ) ≥ C ≥ U (ξ ∗ , a) for all ω, a.
3.4.3.1 Linear Programming Formulation
While general games may be hard, bilinear games are easy, in the sense that minimax
solutions can be found with well-known algorithms. One such method is linear
programming. The problem
max min U (ξ, σ),
σ ξ
where ξ, σ are distributions over finite domains, can be converted to finding σ cor-
responding to the greatest lower bound vσ ∈ R on the utility. Using matrix notation,
set U to be the matrix such that U ω,a = U (ω, a), and consider the vectors σa = σ(a)
and ξ ω = ξ(ω). Then the problem can be written as:

max vσ | (Uσ) j ≥ vσ ∀ j, σi = 1, σi ≥ 0 ∀i
i
Equivalently, we can find ξ with the least upper bound:

⎧ ⎫
⎨ ⎬
min vξ | (ξ U)i ≤ vξ ∀i, ξ j = 1, ξ j ≥ 0 ∀ j ,
⎩ ⎭
j
where everything has been written in matrix form. In fact, one can show that vξ = vσ ,
thus obtaining Theorem 3.4.3.
3.5 Decision Problems with Observations 43
To understand the connection of two-person games with Bayesian decision theory,

take another look at Fig. 3.3, seeing the risk as negative expected utility, or as the
opponent’s gain. Each of the decision lines represents nature’s gain as she chooses
different prior distributions, while we keep our policy σ fixed. The bottom horizontal
line that would be tangent to the Bayes-optimal utility curve would be minimax: if
nature were to change priors, it would not increase its gain, since the line is horizontal.
On the other hand, if we were to choose a different tangent line, we would only
increase nature’s gain (and decrease our utility).
3.5 Decision Problems with Observations
So far we have only examined problems where the outcomes were drawn from some
fixed distribution. This distribution constituted our subjective belief about what the
unknown parameter is. Now, we examine the case where we can obtain some obser-
vations that depend on the unknown ω before we make our decision, cf. Fig. 3.6.
These observations should give us more information about ω before making a deci-
sion. Intuitively, we should be able to make decisions by simply considering the
posterior distribution.
In this setting, we once more aim to take some decision a ∈ A so as to maximize
expected utility. As before, we have a prior distribution ξ on some parameter ω ∈ Ω,
representing what we know about ω. Consequently, the expected utility of any fixed
decision a is going to be Eξ (U | a).
However, now we may obtain more information about ω before making a final
decision. In particular, each ω corresponds to a model of the world Pω , which is
a probability distribution over some observation space S, such that Pω (X ) is the
probability that the observation is in X ⊂ S. The set of parameters Ω thus defines a
family of models
P {Pω | ω ∈ Ω}.
Now, consider the case where we take an observation x from the true model Pω∗
before making a decision. We can represent the dependency of our decision on the
observation by making our decision a function of x.
Fig. 3.6 Statistical decision

problem with observations ω x
ξ
π a U
Definition 3.5.1 (Policy) A policy π : S → A maps any observation to a decision.2
Given a policy π, its expected utility is given by

U (ξ, π) Eξ {U [ω, π(x)]} = U [ω, π(x)] d Pω (x) dξ(ω).
Ω S
This is the standard Bayesian framework for decision making. It may be slightly
more intuitive in some case to use the notation ψ(x | ω), in order to emphasize that
this is a conditional distribution. However, there is no technical difference between
the two notations.
When the set of policies includes all constant policies, then there is a policy π ∗
at least as good as the best fixed decision a ∗ . This is formalized in the following
remark.
Remark 3.5.1 Let denote a set of policies π : S → A. If for each a ∈ A there is

a π ∈ such that π(x) = a ∀x ∈ S, then maxπ∈ Eξ (U | π) ≥ maxa∈A Eξ (U | a).
Proof The proof follows by setting 0 to be the set of constant policies. The result
follows since 0 ⊂ .
We conclude this section with a simple example about deciding whether or not to
go to a restaurant, given some expert opinions.
Example 3.8 Consider the problem of deciding whether or not to go to a particular

restaurant. Let Ω = [0, 1] with ω = 0 meaning the food is in general horrible and
ω = 1 meaning the restaurant is great. Let x1 , . . . , xn be n expert opinions in S =
{0, 1} about the restaurant, where 1 means that the restaurant is recommended by the
expert and 0 means that it is not recommended. Under our model, the probability of
observing xi = 1 when the quality of the restaurant is ω is given by Pω (1) = ω and
conversely Pω (0) = 1 − ω. The probability of observing a particular sequence x of
length n is3
Pω (x) = ω s (1 − ω)n−s
n
with s = i=1 xi .
2 For that reason, policies are also sometimes called decision functions or decision rules in the
literature.
3 We obtain a different probability of observations under the binomial model, but the resulting
posterior, and hence the policy, is the same.

3.5.1 Maximizing Utility When Making Observations
Statistical procedures based on the assumption that a distribution can be assigned

to any parameter in a statistical decision problem are called Bayesian statistical
methods. The scope of these methods has been the subject of much discussion in the
statistical literature, see e.g. [1].
In the following, we shall look at different expressions for the expected utility. We
shall overload the utility operator U for various cases: when the parameter is fixed,
when the parameter is random, when the decision is fixed, and when the decision
depends on the observation x and thus is random as well.
Expected utility of a fixed decision a with ω ∼ ξ

We first consider the expected utility of taking a fixed decision a ∈ A, when
P(ω ∈ B) = ξ(B). This is the case we have dealt with so far with

U (ξ, a) Eξ (U | a) = U (ω, a) dξ(ω).
Ω
Expected utility of a policy π with fixed ω ∈ Ω

Next we assume that ω is fixed, but instead of selecting a decision directly, we
select a decision that depends on the random observation x, which is distributed
according to Pω on S. We do this by defining a policy π : S → A. Then

U (ω, π) = U (ω, π(x)) d Pω (x). (3.5.1)
S
Expected utility of a policy π with ω ∼ ξ

Finally, we generalize to the case where ω is distributed with measure ξ. Note that
the expectation of the previous expression (3.5.1) by definition can be written
as

U (ξ, π) = U (ω, π) dξ(ω), U ∗ (ξ) sup U (ξ, π) = U (ξ, π ∗ ).
Ω π
3.5.2 Bayes Decision Rules
We wish to construct the Bayes decision rule, that is, the policy with maximal
ξ-expected utility. However, doing so by examining all possible policies is cumber-
some, because (usually) there are many more policies than decisions. It is however,
easy to find the Bayes decision for each possible observation. This is because it is
usually possible to rewrite the expected utility of a policy in terms of the posterior
distribution. While this is trivial to do when the outcome and observation spaces are
finite, it can be extended to the general case as shown in the following theorem.
Theorem 3.5.1 If U is non-negative or bounded, then we can reverse the integration

order in the normal form

U (ξ, π) = E {U [ω, π(x)]} = U [ω, π(x)] d Pω (x) dξ(ω)
Ω S
to obtain the utility in extensive form as

U (ξ, π) = U [ω, π(x)] dξ(ω | x) d Pξ (x), (3.5.2)
S Ω

where Pξ (x) = Ω Pω (x) dξ(ω).
Proof To prove this when U is non-negative, we shall use Tonelli’s theorem. First
Pω (x)
we need to construct an appropriate product measure. Let p(x | ω) ddν(x) be the
Radon-Nikodym derivative of Pω with respect to some dominating measure ν on S.
Similarly, let p(ω) dμ(x)
dξ(ω)
be the corresponding derivative for ξ. Now, the utility
can be written as

U (ξ, π) = U [ω, π(x)] p(x | ω) p(ω) dν(x) dμ(ω)
Ω S
= h(ω, x) dν(x) dμ(ω)
Ω S
with h(ω, x) U [ω, π(x)] p(x | ω) p(ω). Clearly, if U is non-negative, then so is

h(ω, x). Now we are ready to apply Tonelli’s theorem to get

U (ξ, π) = h(ω, x) dμ(ω) dμ(x)
S Ω
= U [ω, π(x)] p(x | ω) p(ω) dμ(ω) dν(x)
S Ω
= U [ω, π(x)] p(ω | x) dμ(ω) p(x) dν(x)
S Ω

d Pξ (x)
= U [ω, π(x)] p(ω | x) dμ(ω) dν(x)
S Ω dν(x)

= U [ω, π(x)] dξ(ω | x) d Pξ (x).
S Ω
When U is bounded in [a, b], it suffices to consider U = U − a, which is non-

negative.
We can construct an optimal policy π ∗ as follows. For any specific observed x ∈ S,

we set π ∗ (x) to

π ∗ (x) arg max Eξ (U | x, a) = arg max U (ω, a) dξ(ω | x).
a∈A a∈A Ω
So now we can plug π ∗ in the extensive form to obtain

U [ω, π ∗ (x)] dξ(ω | x) d Pξ (x) = max U [ω, a] dξ(ω | x) d Pξ (x).
S Ω S a Ω
Consequently, there is no need to completely specify the policy before we have

seen x. Obviously, this would create problems when S is large.
Bayes’ decision rule

The optimal decision given x is the optimal decision with respect to the posterior
ξ(ω | x).
The following definitions summarize terminology that we have mostly implicitly

introduced before.
Definition 3.5.2 (Prior distribution) The distribution ξ is called the prior distribu-
tion of ω.
Definition 3.5.3 (Marginal distribution) The distribution Pξ is called the (prior)

marginal distribution of x.
Definition 3.5.4 (Posterior distribution) The conditional distribution ξ(· | x) is

called the posterior distribution of ω.
3.5.3 Decision Problems in Classification
Classification is the problem of deciding which class y ∈ Y some particular obser-

vation xt ∈ X belongs to. From a decision-theoretic viewpoint, the problem can be
seen at three different levels. In the first, we are given a classification model in terms
of a probability distribution, and we simply we wish to classify optimally given the
model. In the second, we are given a family of models, a prior distribution on the
family as well as a training data set, and we wish to classify optimally according to
our belief. In the last form of the problem, we are given a set of policies π : X → Y
and we must choose the one with highest expected performance. The two latter forms
of the problem are equivalent when the set of policies contains all Bayes decision
rules for a specific model family.
3.5.3.1 Deciding the Class Given a Probabilistic Model
In the simple form of the problem, we are already given a classifier P that can calculate
probabilities P(yt | xt ), and we simply must decide upon some class at ∈ Y, so as to
maximize a specific utility function. One standard utility function is the prediction
accuracy
Ut I {yt = at } .
The probability P(yt | xt ) is the posterior probability of the class given the observa-
tion xt . If we wish to maximize expected utility, we can simply choose
at ∈ arg max P(yt = a | xt ).

a∈Y
This defines a particular, simple policy. In fact, for two-class problems with Y =
decision {0, 1}, such a rule can be often visualized as a decision boundary in X , on whose
bound-
ary one side we decide for class 0 and on whose other side for class 1.
3.5.3.2 Deciding the Class Given a Model Family
In the general form of the problem, we are given a training data set S =
{(x1 , y1 ), . . . , (xn , yn )}, a set of classification models {Pω | ω ∈ Ω}, and a prior distri-
bution ξ on Ω. For each model, we can easily calculate Pω (y1 , . . . , yn | x1 , . . . , xn ).
Consequently, we can calculate the posterior distribution
Pω (y1 , . . . , yn | x1 , . . . , xn ) ξ(ω)
ξ(ω | S) =
ω ∈Ω Pω (y1 , . . . , yn | x 1 , . . . , x n ) ξ(ω )
and the posterior marginal label probability


Pξ|S (yt | xt ) Pξ (yt | xt , S) = Pω (yt | xt ) ξ(ω | S).
ω∈Ω
We can then define the simple policy

at ∈ arg max Pω (yt | xt ) ξ(ω | S),
a∈Y ω∈Ω
which is known as Bayes’ rule. Bayes’

rule
3.5.3.3 The Bayes-Optimal Policy Under Parametrization Constraints*
In some cases, we are restricted to functionally simple policies, which do not contain
any Bayes rules as defined above. For example, we might be limited to linear functions
of x. Let π : X → Y be such a rule and let be the set of allowed policies. Given
a family of models and a set of training data, we wish to calculate the policy that
maximizes our expected utility. For a given ω, we can indeed compute

U (ω, π) = U (y, π(x))Pω (y | x)Pω (x),
x,y
where we assume an i.i.d. model, i.e., xt | ω ∼ Pω (x) independently of previous

observations. Note that to select the optimal rule π ∈ we also need to know Pω (x).
For the case where ω is unknown and we have a posterior ξ(ω | S), given a training
data set S as before, the Bayesian framework is easily extensible and gives

U (ξ(· | S), π) = ξ(ω | S) U (y, π(x))Pω (y | x)Pω (x).
ω x,y
The respective maximization is in general not trivial. However, if our policies in

are parametrized, we can employ optimization algorithms such as gradient ascent to
find a maximum. In particular, if we sample ω ∼ ξ(· | S), then

∇π U (ξ(· | S), π) = ∇π U (y, π(x))Pω (y | x)Pω (x).
x,y
3.5.3.4 Fairness in Classification Problems*
Any policy, when applied to large-scale, real world problems, has certain externali-
ties. This implies that considering only the decision maker’s utility is not sufficient.
One such issue is fairness.
This concerns desirable properties of policies applied to a population of individ-
uals. For example, college admissions should be decided on variables that inform
us about individual merit, but fairness may also require taking into account the fact
that certain communities are inherently disadvantaged. At the same time, a person
should not feel that someone else in a similar situation obtained an unfair advantage.
All this must be taken into account while still caring about optimizing the decision
maker’s utility function. As another example, consider mortgage decisions: while
lenders should take into account the creditworthiness of individuals in order to make
a profit, society must ensure that they do not unduly discriminate against socially
vulnerable groups.
Recent work in fairness for statistical decision making in the classification setting
has considered two main notions of fairness. The first uses (conditional) indepen-
dence constraints between a sensitive variable (such as ethnicity) and other variables
(such as decisions made). The second type ensures that decisions are meritocratic,
so that better individuals are favoured. Here smoothness4 can be used to assure that
similar people are treated similarly, which helps to avoid cronyism. While a thor-
ough discussion of fairness is beyond the scope of this book, it is useful to note that
some of these concepts are impossible to strictly achieve simultaneously, but may be
approximately satisfied by careful design of the policy. The recent work by [2–7],
and [8] goes much more deeply on this topic.
3.5.4 Calculating Posteriors
Posterior distributions for multiple observations

We now consider how we can re-write the posterior distribution over Ω incrementally.
Assume that we have a prior ξ on Ω and observe x n x1 , . . . , xn . For the observation
probability, we write:
Observation probability given history x n−1 and parameter ω

Pω (x n )
Pω (xn | x n−1 ) =
Pω (x n−1 )
Then we obtain the following recursion for the posterior.
Posterior recursion
Pω (x n ) ξ(ω) Pω (xn | x n−1 )ξ(ω | x n−1 )

ξ(ω | x n ) = =
Pξ (x n ) Pξ (xn | x n−1 )

Here Pξ (· | ·) = Ω Pω (· | ·) dξ(ω) is a marginal distribution.
4 For example, through Lipschitz conditions on the policy.

3.6 Summary 51
Posterior distributions for multiple independent observations

Now we consider the case where, given the parameter ω, the next observation
does
not depend on the history: If Pω (xn | x n−1 ) = Pω (xn ) then Pω (x n ) = nk=1 Pω (xk ).
Then the recursion looks as follows.
Posterior recursion with conditional independence
Pω (x n ) ξ0 (ω)
ξn (ω) ξ0 (ω | x n ) =
Pξ0 (xn )
Pω (xn ) ξn−1 (ω)
= ξn−1 (ω | xn ) = ,
Pξn−1 (xn )

where ξt is the belief at time t. Here Pξn (· | ·) = Ω Pω (· | ·) dξn (ω) is the
marginal distribution with respect to the n-th posterior.
Conditional independence allows us to write the posterior update as an identical

recursion at each time t. We shall take advantage of that when we look at conjugate
prior distributions in Chap. 4. For such models, the recursion involves a particularly
simple parameter update.
3.6 Summary
In this chapter, we introduced a general framework for making decisions a ∈ A

whose optimality depends on an unknown outcome or parameter ω. We saw that,
when our knowledge about ω ∈ Ω is in terms of a probability distribution ξ on Ω,
then the utility of the Bayes-optimal decision is convex with respect to ξ.
In some cases, observations x ∈ X may affect affect our belief, leading to a
posterior ξ(· | x). This requires us to introduce the notion of a policy π : X → A
mapping observations to decisions. While it is possible to construct a complete policy
by computing U (ξ, π) for all policies (normal form) and maximizing, it is frequently
simpler to just wait until we observe x and compute U [ξ(· | x), a] for all decisions
(extensive form).
In minimax settings, we can consider a fixed but unknown parameter ω or a fixed
but unknown prior ξ. This links statistical decision theory to game theory.
3.7 Exercises
The first part of the exercises considers problems where we are simply given some
distribution over Ω. In the second part, the distribution is a posterior distribution that
depends on observations x.
3.7.1 Problems with No Observations
For the following exercises, we consider a set of worlds Ω and a decision set A, as
well as the following utility function U : Ω × A → R:
U (ω, a) = sinc(ω − a),
where sinc(x) = sin(x)/x. If ω is known and A = Ω = R then obviously the optimal

decision is a = ω, as sinc(x) ≤ sinc(0) = 1. However, we consider the case where
Ω = A = {−2.5, . . . , −0.5, 0, 0.5, . . . , 2.5} .
Exercise 3.7.1 Assume ω is drawn from ξ with ξ(ω) = 1/11 for all ω ∈ Ω. Cal-
culate and plot the expected utility U (ξ, a) = ω ξ(ω)U (ω, a) for each a. Report
maxa U (ξ, a).
Exercise 3.7.2 Assume ω ∈ Ω is arbitrary (but deterministically selected). Calcu-

late the utility U (a) = minω U (ω, a) for each a. Report max(U ).
Exercise 3.7.3 Again assume ω ∈ Ω is arbitrary (but deterministically selected).

now allow for stochastic policies π on A. Then the expected utility is U (ω, π) =
We
a U (ω, a)π(a).
(a) Calculate and plot the expected utility when π(a) = 1/11 for all a, reporting
values for all ω.
(b) Find
max min U (ξ, π).
π ξ
Hint: Use the linear programming formulation, adding a constant to the utility
matrix U so that all elements are non-negative.
Exercise 3.7.4 Consider the definition of rules that, for some > 0, select a maxi-
mizing
P {ω | U (ω, a) > sup U (ω, d ) − } .
d ∈A
Prove that this is indeed a statistical decision problem, i.e., it corresponds to maxi-
mizing the expectation of some utility function.
3.7 Exercises 53
3.7.2 Problems with Observations
For the following exercises we consider a set of worlds Ω and a decision set A, as
well as the following utility function U : Ω × A → R:
U (ω, a) = −|ω − a|2
In addition, we consider a family of distributions on a sample space S = {0, 1}n ,
F f ω ω ∈ Ω,
such that f ω is the binomial probability mass function with parameters ω. Consider
the parameter set
Ω = {0, 0.1, . . . , 0.9, 1} .
Let ξ be the uniform distribution on Ω, such that ξ(ω) = 1/11 for all ω ∈ Ω. Further,
let the decision set be A = [0, 1].

Exercise 3.7.5 What is the decision a ∗ maximizing U (ξ, a) = ω ξ(ω)U (ω, a)
and what is U (ξ, a ∗ )?
Exercise 3.7.6 In the same setting, we now observe the sequence x = (x1 , x2 , x3 ) =
(1, 0, 1).
1. Plot the posterior distribution ξ(ω | x) and compare it to the posterior we would
obtain if our prior on ω was ξ = Beta(2, 2).
2. Find the decision a ∗ maximizing the a posteriori expected utility

Eξ (U | a, x) = U (ω, a)ξ(ω | x).
ω
3. Consider n = 2, i.e., S = {0, 1}2 . Calculate the Bayes-optimal expected utility in

extensive form:

Eξ (U | π ∗ ) = φ(x) U [ω, π ∗ (x)]ξ(ω | x) = φ(x) max U [ω, a]ξ(ω | x),
a
S ω S ω

where φ(x) = ω f ω (x)ξ(ω) is the prior marginal distribution of x and
δ ∗ : S → A is the Bayes-optimal decision rule.
Hint: You can simplify the computational
complexity somewhat, since you only
need to calculate the probability of t x t . This is not necessary to solve the
problem though.
Exercise 3.7.7 In the same setting, we consider nature to be adversarial. Once more,
we observe x = (1, 0, 1). Assume that nature can choose a prior among a set of priors
= {ξ1 , ξ2 }. Let ξ1 (ω) = 1/11 and ξ2 (ω) = ω/5.5 for each ω.
1. Calculate and plot the value for deterministic decisions a:
min Eξ (U | a, x).
ξ∈
2. Find the minimax prior ξ ∗
min max Eξ (U | a).

ξ∈ a∈A
Hint: Apart from the adversarial prior selection, this is very similar to the previous
exercise.
3.7.3 An Insurance Problem
Consider Example 2.15 of Chap. 2. Therein, an insurer is covering customers by ask-

ing for a premium d > 0. In the event of an accident, which happens with probability
∈ [0, 1], the insurer pays out h > 0. The problem of the insurer is, given , h, what
to set the premium d to so as to maximize its expected utility. We assume that the
insurer’s utility is linear.
We now consider customers with some baseline income level x ∈ S. For simplic-
ity, we assume that the only possible income levels are in S = {15, 20, 25, . . . , 60},
with K = 10 possible incomes. Let V : R → R denote the utility function of a cus-
tomer. Customers who are interested in the insurance product will buy if and only
if
V (x − d) > V (x − h) + (1 − ) V (x).
We make the simplifying assumption that the utility function is the same for all
customers and has the following form:

ln x, x ≥1
V (x) =
1 − (x − 2)2 , otherwise.
Customers who are not interested the insurance product, will not buy it no matter
what the price.
There is some unknown probability distribution Pω (x) over the income level,
n that the probability of n people having incomes x = (x1 , . . . , xn ) is Pω (x ) =
n n n
such
P (x
i=1 ω i ). We have two data sources for this. The first is a model of the gen-
eral population ω1 not working in high-tec industry, and the second is a model of
employees in high-tech industry, ω2 . The models are summarized in Table 3.6 below.
Together, these two models form a family of distributions P = {Pω | ω ∈ Ω} with
Ω = {ω1 , ω2 }.
3.7 Exercises 55
Table 3.6 Income level distribution for the two models

Income 15 20 25 30 35 40 45 50 55 60
levels
Models Probability (%) of income level Pω (x)
ω1 5 10 12 13 11 10 8 10 11 10
ω2 8 4 1 6 11 14 16 15 13 12
The goal is to find a premium d that maximizes the expected utility of the insurance
company. We assume that the company is liquid enough that utility is linear. In
the following, we consider four different cases. For simplicity, you can let A = S
throughout this exercise.
Exercise 3.7.8 Show that the expected utility for a given ω is the expected gain
from a buying customer times the probability that an interested customer will have
an income x such that she would buy our insurance, that is,

U (ω, d) = (d − h) Pω (x) I {V (x − d) > V (x − h) + (1 − )V (x)} .
x∈S
Let h = 150 and = 10−3 . Plot the expected utility for the two possible ω for
varying d. What is the optimal price level if the incomes of all interested customers
are distributed according to ω1 ? What is the optimal price level if they are distributed
according to ω2 ?
Exercise 3.7.9 According to our intuition, customers interested in our product are
much more likely to come from the high-tech industry than from the general pop-
ulation. For that reason, we have a prior probability ξ(ω1 ) = 1/4 and ξ(ω2 ) = 3/4
over the parameters Ω of the family P. More specifically, and in keeping with our
previous assumptions, we formulate the following model:
ω∗ ∼ ξ
x T | ω ∗ = ω ∼ Pω
That is, the data is drawn from one unknown model ω ∗ ∈ Ω. This can be thought
of as an experiment where nature randomly selects ω ∗ with probability ξ and then
generates the data from the corresponding model Pω∗ . Plot the expected utility under
this prior as the premium d varies. What is the optimal expected utility and premium?
Exercise 3.7.10 Instead of fully relying on our prior, the company decides to perform
a random survey of 1000 people who are asked whether they would be interested in
the insurance product (as long as the price is low enough). If interested, they are also
asked about their income level. Assume that only 126 people were interested, with
income levels as given in Table 3.7. Each row column of the table shows the stated
income and the number of people reporting it.
Table 3.7 Survey results

Income 15 20 25 30 35 40 45 50 55 60
Number 7 8 7 10 15 16 13 19 17 14
Let x n = {x1 , x2 , . . . , xn } be the set of data collected. Assuming that that the
responses are truthful, calculate the posterior probability ξ(ω | x n ), assuming that
the only possible models of income distribution are the two models ω1 , ω2 used in
the previous exercises. Plot the expected utility under the posterior distribution as d
varies. What is the maximum expected utility that can be obtained?
Exercise 3.7.11 Having only two possible models is somewhat limiting, especially
since neither of them might correspond to the income distribution of people interested
in our insurance product. How could this problem be rectified? Describe your idea
and implement it. When would you expect this to work better?
3.7.4 Medical Diagnosis
Many patients arriving at an emergency room suffer from chest pain. This may indi-
cate acute coronary syndrome (ACS). Patients suffering from ACS that go untreated
may die with probability5 2% in the next few days. Successful diagnosis lowers the
short-term mortality rate to 0.2%. Consequently, a prompt diagnosis is essential.
Statistics of patients Approximately 50% of patients presenting with chest pain turn
out to suffer from ACS (either acute myocardial infraction or unstable angina pectoris).
Approximately 10% suffer from lung cancer. Of ACS sufferers in general,2/3 are smok-
ers and 1/3 non-smokers. Only 1/4 of non-ACS sufferers are smokers. In addition, 90%
of lung cancer patients are smokers. Only 1/4 of non-cancer patients are smokers.
Assumption 3.7.1 A patient may suffer from none, either or both conditions.
Assumption 3.7.2 When the smoking history of the patient is known, the develop-
ment of cancer or ACS are independent.
Tests One can perform an ECG to test for ACS. An ECG test has sensitivity of 66.6%
(i.e., it correctly detects 2/3 of all patients that suffer from ACS), and a specificity of
75% (i.e., 1/4 of patients that do not have ACS, still test positive). An X-ray can
diagnose lung cancer with a sensitivity of 90% and a specificity of 90%.
Assumption 3.7.3 Repeated applications of a test produce the same result for the
same patient, i.e., that randomness is only due to patient variability.
5 The following figures are not really accurate, as they are liberally adapted from different studies.
References 57
Assumption 3.7.4 The existence of lung cancer does not affect the probability that
the ECG will be positive. Conversely, the existence of ACS does not affect the
probability that the X-ray will be positive.
The main problem we want to solve, is how to perform experiments or tests, so

as to diagnose the patient using as few resources as possible and making sure the
patient lives. This is a problem in experiment design. We start from the simplest case
and look at a couple of examples where we only observe the results of some tests.
We then examine the case where we can select which tests to perform.
Exercise 3.7.12 Now consider the case where you have the choice which tests to
perform. First, you observe S, i.e., whether or not the patient is a smoker. Then
you select a test d1 ∈ {X-ray, ECG} to make. Finally, you decide whether or not to
treat for ASC, that is, you choose d2 ∈ {heart treatment, no treatment}.
An untreated ASC patient may die with probability 2%, while a treated one with
probability 0.2%. Treating a non-ASC patient results in death with probability 0.1%.
1. Draw a decision diagram, where:
• S is an observed random variable taking values in {0, 1}.
• A is a hidden variable taking values in {0, 1}.
• C is a hidden variable taking values in {0, 1}.
• d1 is a choice variable, taking values in {X-ray, ECG}.
• r1 is a result variable, taking values in {0, 1}, corresponding to negative and
positive tests results.
• d2 is a choice variable, which depends on the test result of d1 and on S.
• r2 is a result variable, taking values in {0, 1} corresponding to the patient dying
(0), or living (1).
2. Let d1 = X-ray, and assume the patient suffers from ACS, i.e., A = 1. How is
the posterior distributed?
3. What is the optimal decision rule for this problem?
References
1. Leonard, J.: Savage. The Foundations of Statistics. Dover Publications (1972)

2. Kearns, M., Roth, A.: The Ethical Algorithm: The Science of Socially Aware Algorithm Design.
Oxford University Press, USA (2019)
3. Dwork, C., Hardt, M., Pitassi, T., Reingold, O., Zemel, R.: Fairness through awareness. In:
Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, pp. 214–226.
ACM (2012)
4. Chouldechova, A.: Fair prediction with disparate impact: a study of bias in recidivism prediction
instruments. Technical Report (2016). arXiv:1610.07524
5. Corbett-Davies, S., Pierson, E., Feller, A., Goel, S., Huq, A.: Algorithmic decision making and
the cost of fairness. Technical Report (2017). arXiv:1701.08230
6. Kleinberg, J., Mullainathan, S., Raghavan, M.: Inherent trade-offs in the fair determination of
risk scores. Technical Report (2016). arXiv:1609.05807
7. Kilbertus, N., Rojas-Carulla, M., Parascandolo, G., Hardt, M., Janzing, D., Schölkopf, B.: Avoid-
ing discrimination through causal reasoning. Technical Report (2017). arXiv:1706.02744
8. Dimitrakakis, C., Liu, Y., Parkes, D., Radanovic, G.: Subjective fairness: fairness is in the eye
of the beholder. Technical Report (2017). arXiv:1706.00119
Chapter 4
Estimation
4.1 Introduction
In the previous chapter, we have seen how to make optimal decisions with respect
to a given utility function and belief. One important question is how to compute
an updated belief from observations and a prior belief. More generally, we wish to
examine how much information we can obtain about an unknown parameter from
observations, and how to bound the respective estimation error. While most of this
chapter will focus on the Bayesian framework for estimating parameters, we shall also
look at tools for making conclusions about the value of parameters without making
specific assumptions about the data distribution, i.e., without providing specific prior
information.
In the Bayesian setting, we calculate posterior distributions of parameters given
data. The basic problem can be stated as follows. Let P {Pω | ω ∈ Ω} be a fam-
ily of probability measures on (S, FS ) and ξ be our prior probability measure on
(Ω, FΩ ). Given some data x ∼ Pω∗ , with ω ∗ ∈ Ω, how can we estimate ω ∗ ? The
Bayesian approach is to estimate the posterior distribution ξ(· | x), instead of guess-
ing a single ω ∗ . In general, the posterior measure is a function ξ(· | x) : FΩ → [0, 1],
with
Pω (x) dξ(ω)
ξ(B | x) = B . (4.1.1)
Ω Pω (x) dξ(ω)
The posterior distribution allows us to quantify our uncertainty about the unknown ω ∗ .
This in turn enables us to take decisions that take uncertainty into account.
The first question we are concerned with in this chapter is how to calculate this
posterior for any value of x in practice. If x is a complex object, this may be com-
putationally difficult. In fact, the posterior distribution can also be a complicated
function. However, there exist distribution families and priors such that this calcula-
tion is very easy, in the sense that the functional form of the posterior depends upon a

60 4 Estimation
small number of parameters. This happens when a summary of the data that contains
all necessary information can be calculated easily. Formally, this is captured via the
concept of a sufficient statistic.
4.2 Sufficient Statistics
Sometimes we want to summarize the data we have observed. This can happen when
the data is a long sequence of simple observations x n = (x1 , . . . , xn ). It may also
be useful to do so when we have a single observation x, such as a high-resolution
image. For some applications, it may be sufficient to only calculate a really simple
function of the data, such as the sample mean.
Definition 4.2.1 (Sample mean) The sample mean x̄n : Rn → R of a sequence
x n = (x1 , . . . , xn ) with xt ∈ R is defined as
1
n
x̄n xt .
n t=1
statistic The mean is an example for what is called a statistic, that is, a function of the
observations to some vector space. In the following, we are interested in statistics
that can replace all the complete original data in our calculations without losing any
information. Such statistics are called sufficient.
4.2.1 Sufficient Statistics
We consider the standard probabilistic setting. Let S be a sample space and Ω be a

parameter space defining a family of measures on S, that is,
P = {Pω | ω ∈ Ω} .
In addition, we must also define an appropriate prior distribution ξ on the parameter

space Ω. Then the definition of a sufficient statistic in the Bayesian sense1 is as
follows.
Definition 4.2.2 Let Ξ be a set of prior distributions on Ω, which indexes a family
P = {Pω | ω ∈ Ω} of distributions on S. A statistic φ : S → Z, where Z is a vector
space,2 is a sufficient statistic for P, Ξ , if
1 There is an alternative definition, which replaces equality of posterior distributions with point-wise
equality on the family members, i.e., Pω (x) = Pω (x ) for all ω. This is a stronger definition, as it
implies the Bayesian one we use here.
2 Typically Z ⊂ Rk for finite-dimensional statistics.
4.2 Sufficient Statistics 61
ξ(· | x) = ξ(· | x ) (4.2.1)
for any prior ξ ∈ Ξ and any x, x ∈ S such that φ(x) = φ(x ).

This means that the statistic is sufficient if, whenever we obtain the same value of the
statistic for two different datasets x, x , then the resulting posterior distribution over
the parameters is identical, independent of the prior distribution. In other words, the
value of the statistic is sufficient for computing the posterior. Interestingly, a sufficient
statistic always implies the following factorization for members of the family.
Theorem 4.2.1 A statistic φ : S → Z is sufficient for P, Ξ iff there exist func-
tions u : S → (0, ∞) and v : Z × Ω → [0, ∞) such that ∀x ∈ S, ω ∈ Ω:
Pω (x) = u(x) v[φ(x), ω]
Proof In the following proof we assume arbitrary Ω. The case when Ω is finite is
technically simpler and is left as an exercise. Let us first assume the existence of u, v
satisfying the equation. Then for any B ∈ FΩ we have

u(x) v[φ(x), ω] dξ(ω)
ξ(B | x) = B
u(x) v[φ(x), ω] dξ(ω)
Ω
v[φ(x), ω] dξ(ω)
= B
Ω v[φ(x), ω] dξ(ω)
For x with φ(x) = φ(x ), it follows that ξ(B | x) = ξ(B | x ), so ξ(· | x) = ξ(· | x )
and φ satisfies the definition of a sufficient statistic.
Conversely, assume that φ is a sufficient statistic. Let μ be a dominating measure
on S so that we can define the densities p(ω) dμ(ω) dξ(ω)
and
dξ(ω | x) Pω (x) p(ω)

p(ω | x) = ,
dμ(ω) Ω Pω (x) dξ(ω)
whence
p(ω | x)
Pω (x) = Pω (x) dξ(ω).
p(ω)
Ω
Since φ is sufficient, there is by definition some function g : Z × Ω → [0, ∞) such

that p(ω | x) = g[φ(x), ω]. This must be the case, since sufficiency means that p(ω |
x) = p(ω | x ) whenever φ(x) = φ(x ). For this to occur the existence of such a g
is necessary. Consequently, we can factorize Pω as
Pω (x) = u(x) v[φ(x), ω],

where u(x) = Ω Pω (x) dξ(ω) and v[φ(x), ω] = g[φ(x), ω]/ p(ω).
62 4 Estimation
In the factorization of Theorem 4.2.1, u is the only factor that depends directly on x.
Interestingly, it does not appear in the posterior calculation at all. So, the posterior
only depends on x through the statistic.
Example 4.1 Suppose x n = (x1 , . . . , xn ) is a random sample from a distribution

with two possible outcomes 0 and 1 where ω is the probability of outcome 1. (This
is usually known as a Bernoulli distribution that we will have a closer look at in
Sect. 4.3.1). Then the probability of observing a certain sequence x n is given by

n
Pω (x n ) = Pω (xt ) = ω sn (1 − ω)n−sn ,
t=1

where sn = nt=1 xt is the number of times 1 has been observed until time n. Here
the statistic φ(x n ) = sn satisfies (4.2.1) with u(x) = 1, while Pω (x n ) only depends
on the data through the statistic sn = φ(x n ).
Another example is when we have a finite set of models. Then the sufficient
statistic is always a finite-dimensional vector.
Lemma 4.2.1 Let P = {Pθ | θ ∈ Θ} be a family, where each model Pθ is a prob-
ability measure on X and Θ contains n models. If p ∈ ń is a vector representing
our prior distribution, i.e., ξ(θ) = pθ , then the finite-dimensional vector with entries
qθ = pθ Pθ (x) is a sufficient statistic.
Proof Simply note that the posterior distribution in this case is

qθ
ξ(θ | x) = ,
θ qθ
that is, the values qθ are sufficient to compute the posterior.
From the proof it is clear that also the vector with entries wθ = qθ is a sufficient
θ qθ
statistic.
4.2.2 Exponential Families
Even when dealing with an infinite set of models in some cases the posterior dis-
tributions can be computed efficiently. Many well-known distributions such as the
Gaussian, Bernoulli and Dirichlet distribution are members of the exponential family
of distributions. All those distributions are factorizable in the manner shown below,
while at the same time they have fixed-dimension sufficient statistics.
4.3 Conjugate Priors 63
Definition 4.2.3 A distribution family P = {Pω | ω ∈ Ω} with Pω being a proba-

bility function (or density) on the sample space S is said to be an exponential family
if there are suitable functions a : Ω → R, b : S → R, gi : Ω → R and h i : S → R
(1 ≤ i ≤ k) such that for any x ∈ S, ω ∈ Ω:

k
Pω (x) = a(ω) b(x) exp gi (ω) h i (x)
i=1
Among families of distributions satisfying certain smoothness conditions, only

exponential familes have a fixed-dimension sufficient statistic. Because of this, expo-
nential family distributions admit so-called parametric conjugate prior distribution
families. These have the property that any posterior distribution calculated will
remain within the conjugate family. Frequently, because of the simplicity of the
statistic used, calculation of the conjugate posterior parameters is very simple.
4.3 Conjugate Priors
In this section, we examine some well-known conjugate families. First, we give

sufficient conditions for the of existence of a conjugate family of priors for a given
distribution family and statistic. While this section can be used as a reference, the
reader may wish to initially only look at the first few example families.
The following remark gives sufficient conditions for the existence of a finite-
dimensional sufficient statistic.
Remark 4.3.1 If a family P of distributions on S has a sufficient statistic φ : S →
Z of fixed dimension for any x ∈ S, then there exists a conjugate family of priors
Ξ = {ξα | α ∈ A} with a set A of possible parameters for the prior distribution, such
that:
1. Pω (x) is proportional to some ξα ∈ Ξ , that is,

∀x ∈ S ∃ξα ∈ Ξ, c > 0 : Pω (x) dξα (ω) = c ξα (B), ∀B ∈ FΩ .
B
2. The family is closed under multiplication, that is,
∀ξ1 , ξ2 ∈ Ξ ∃ξα ∈ Ξ, c > 0 : ξα = c ξ1 ξ2 .
While conjugate families exist for statistics with unbounded dimension, here we shall
focus on finite-dimensional families. We will start with the simplest example, the
Bernoulli-Beta pair.
64 4 Estimation
4.3.1 Bernoulli-Beta Conjugate Pair
The Bernoulli distribution is a discrete distribution that is ideal for modelling indepen-
dent random trials with just two different outcomes (typically ‘success’ and ‘failure’)
and a fixed probability of success.
Definition 4.3.1 (Bernoulli distribution) The Bernoulli distribution is discrete with
outcomes S = {0, 1}, a parameter ω ∈ [0, 1], and probability function

ω, if u = 1
Pω (x = u) = ω (1 − ω)
u 1−u
=
1 − ω, if u = 0.
If x is distributed according to a Bernoulli distribution with parameter ω, we write

x ∼ Bern(ω).
The structure of the graphical model in Fig. 4.1 shows the dependencies between
the different variables of the model.
When considering n independent trials of a Bernoulli

distribution the set of pos-
sible outcomes is S = {0, 1}n . Then Pω (x n ) = nt=1 Pω (xt ) is the probability of
observing the exact sequence x n under the Bernoulli model. However, in many cases
we are interested in the probability of observing a particular number of successes
(outcome 1) and failures (outcome 0) and do not care about the actual order. For
summarizing we need to count the actual sequences in which out of n trials we have
binomial k successes. The actual number is given by the binomial coefficient, defined as
coeffi-
cient

n n!
= k, n ∈ N, n ≥ k.
k k! (n − k)!
Now we are ready to define the binomial distribution, that is a scaled product-
Bernoulli distribution for multiple independent outcomes where we want to measure
the probability of a particular number successes or failures. Thus, the Bernoulli is a
distribution on a sequence of outcomes, while
the binomial is a distribution on the
total number of successes. That is, let s = nt=1 xt be the total number of successes
observed up to time n. Then we are interested in the probability that there are exactly
k successes out of n trials.
Definition 4.3.2 (Binomial distribution) The binomial distribution with parameters
ω and n has outcomes S = {0, 1, . . . , n}. Its probability function is given by

n
Pω (s = k) = ω k (1 − ω)n−k .
k
Fig. 4.1 Bernoulli graphical

ω x
model
If s is drawn from a binomial distribution with parameters ω, n, we write s ∼

Binom(ω, n).
Now let us return to the Bernoulli distribution. If the parameter ω is known,
then observations are independent of each other. However, this is not the case
when ω is unknown. For example, if Ω = {ω1 , ω2 }, then the probability of observing a
sequence x n is given by

n
n
P(x n ) = P(x n | ω)P(ω) = P(xt | ω1 )P(ω1 ) + P(xt | ω2 )P(ω2 ),
ω∈Ω t=1 t=1
which in general is different from nt=1 P(xt ). For the general case where Ω = [0, 1]
the question is whether there is a prior distribution that can succinctly describe our
uncertainty about the parameter. Indeed, there is, and it is called the Beta distribution. Beta dis-
tribution
It is defined on the interval [0, 1] and has two parameters that determine the density
of the observations. Because the Bernoulli distribution has a parameter in [0, 1], the
outcomes of the Beta can be used to specify a prior on the parameters of the Bernoulli
distribution.
Definition 4.3.3 (Beta distribution) The Beta distribution has outcomes ω ∈ Ω =
[0, 1] and parameters α0 , α1 > 0, which we will summarize in a vector α = (α1 , α0 ).
Its probability density function is given by
Γ (α0 + α1 ) α1 −1
p(ω | α) = ω (1 − ω)α0 −1 , (4.3.1)
Γ (α0 )Γ (α1 )
∞
where Γ (α) = 0 u α−1 e−u du is the Gamma function. If ω is distributed according Gamma
function
to a Beta distribution with parameters α1 , α0 , we write ω ∼ Beta(α1 , α0 ).
We note that the Gamma function is an extension of the function n!, so that for
n ∈ N it holds that Γ (n) = n!. That way, the first term in (4.3.1) corresponds to
a generalized binomial coefficient. The dependencies between the parameters are
shown in the graphical model of Fig. 4.2. A Beta distribution with parameter α has
expectation
α1
E(ω | α) =
α0 + α1
and variance α1 α0
V(ω | α) = .
(α1 + α0 ) (α1 +
2 α0 + 1)
Fig. 4.2 Beta graphical

α ω
model
66 4 Estimation
Fig. 4.3 Four example Beta

densities α = (1, 1)
4 α = (2, 1)
α = (10, 20)
α = (1/10, 1/2)
p(ω|α)
2
0
0 0.2 0.4 0.6 0.8 1
ω
Figure 4.3 shows the density of a Beta distribution for four different parameter
vectors. When α0 = α1 = 1, the distribution is equivalent to a uniform one.
4.3.1.1 Beta Prior for Bernoulli Distributions
As already indicated, we can encode our uncertainty about the unknown but fixed
parameter ω ∈ [0, 1] of a Bernoulli distribution using a Beta distribution. We start
with a Beta distribution with parameter α that defines our prior ξ0 via ξ0 (B)
B p(ω | α) dω. Then the posterior probability is given by

n
Pω (xt ) p(ω | α)
p(ω | x n , α) =
nt=1 ∝ ω sn,1 +α1 −1 (1 − ω)sn,0 +α0 −1 ,
Ω t=1 Pω (x t ) p(ω | α) dω

where sn,1 = nt=1 xt and sn,0 = n − sn,1 are the total number of 1s and 0s, respec-
tively. As can be seen, this again has the form of a Beta distribution (Fig. 4.4).
Beta-Bernoulli model
Let ω be drawn from a Beta distribution with parameters α1 , α0 , and x n =
(x1 , . . . , xn ) be a sample drawn independently from a Bernoulli distribution
with parameter ω, i.e.,
ω ∼ Beta(α1 , α0 ), x n | ω ∼ Bern n (ω).
Fig. 4.4 Beta-Bernoulli

α ω xt
graphical model
Then the posterior distribution of ω given the sample the posterior distribution
is also Beta, that is,
ω | x n ∼ Beta(α1 , α0 ) with α1 = α1 + sn,1 , α0 = α0 + sn,0 .
Example 4.2 The parameter ω ∈ [0, 1] of a randomly selected coin can be modelled
as a Beta distribution peaking around 21 . Usually one assumes that coins are fair.
However, not all coins are exactly the same. Thus, it is possible that each coin
deviates slightly from fairness. We can use a Beta distribution to model how likely
(we think) different values ω of coin parameters are.
To demonstrate how belief changes, we perform the following simple experiment.
We repeatedly toss a coin and wish to form an accurate belief about how biased
the coin is, under the assumption that the outcomes are Bernoulli with a fixed
parameter ω. Our initial belief, ξ0 , is modelled as a Beta distribution on the parameter
space Ω = [0, 1], with parameters α0 = α1 = 100. This places a strong prior on the
coin being close to fair. However, we still allow for the possibility that the coin is
biased.
Figure 4.5 shows a sequence of beliefs at times 0, 10, 100, 1000 respectively, from
a coin with bias ω = 0.6. Due to the strength of our prior, after 10 observations, the
situation has not changed much and the belief ξ10 is very close to the initial one.
However, after 100 observations our belief has now shifted towards 0.6, the true bias
of the coin. After a total of 1000 observations, our belief is centered very close to 0.6,
and is now much more concentrated, reflecting the fact that we are almost certain
about the value of ω.
Fig. 4.5 Changing beliefs as 30

we observe tosses from a ξ0
coin with probability
ω = 0.6 of heads ξ10
ξ100
20 ξ1000
10
0
0 0.2 0.4 0.6 0.8 1
68 4 Estimation
4.3.2 Conjugates for the Normal Distribution
The well-known normal distribution is also endowed with suitable conjugate priors.
We first give the definition of the normal distribution, then consider the cases where
we wish to estimate its mean, its variance, or both at the same time.
Definition 4.3.4 (Normal distribution) The normal distribution is a continuous dis-
tribution with outcomes in R. It has two parameters, the mean ω ∈ R and the variance
σ 2 ∈ R+ , or alternatively the precision r ∈ R+ , where σ 2 = r −1 . Its probability den-
sity function is given by
r
r
f (x | ω, r ) = exp − (x − ω)2 .
2π 2
When x is distributed according to a normal distribution with parameters ω, r −1 , we

write x ∼ N (ω, r −1 ).
For a respective sample x n of size n, we write x n ∼ N t (ω, r −1 ). Independent
samples satisfy the independence condition
n

n
r r
n
f (x | ω, r ) =
n
f (xt | ω, r ) = exp − (xt − ω) .
2
t=1
2π 2 t=1
The dependency graph in Fig. 4.6 shows the dependencies between the parame-
ters of a normal distribution and the observations xt . In this graph, only a single
sample xt is shown, and it is implied that all xt are independent of each other
given r, ω.
Transformations of normal samples. The importance of the normal distribution
stems from the fact that many actual distributions turn out to be approximately nor-
mal. Another interesting property of the normal distribution concerns transforma-
tions of normal samples. For example,
if x n is drawn from a normal distribution with
mean ω and precision r , then nt=1 xt ∼ N (nω, nr −1 ). Finally, if the samples
xt are
standard drawn from the standard normal distribution, i.e., xt ∼ N (0, 1), then nt=1 xt2 has
normal
a χ2 -distribution with n degrees of freedom (cf. also the discussion of the Gamma
χ2 -
distribution distribution below).
Fig. 4.6 Normal graphical

r
model
xt
ω
4.3.2.1 Normal Distribution with Known Precision, Unknown Mean
The simplest normal estimation problem occurs when we only need to estimate
the mean and assume that the variance (or equivalently, the precision) is known.
For Bayesian estimation, it is convenient to assume that the mean ω is drawn from
another normal distribution with known mean. This gives a conjugate pair and thus
results in a posterior normal distribution for the mean as well (Fig. 4.7).
Normal-Normal conjugate pair

Let ω be drawn from a normal distribution with mean μ and precision τ , and
x n = (x1 , . . . , xn ) be a sample drawn independently from a normal distribution
with mean ω and precision r , that is,
x n | ω, r ∼ N n (ω, r −1 ), ω | τ ∼ N (μ, τ −1 ).
Then the posterior distribution of ω given the sample is also normal, that is,
τ μ + nr x̄n
ω | x n ∼ N (μ , τ −1 ) with μ = , τ = τ + nr,
τ
n
where x̄n 1
n t=1 xt .
It can be seen that the updated estimate for the mean is shifted towards the empir-
ical mean x̄n , and the precision increases linearly with the number of samples.
4.3.2.2 Normal with Unknown Precision and Known Mean
To model normal distributions with known mean, but unknown precision (or equiv-
alently, unknown variance), we first have to introduce the Gamma distribution that
we will use to represent our uncertainty about the precision.
Definition 4.3.5 (Gamma distribution) The Gamma distribution is a continuous dis-

tribution with outcomes in [0, ∞). It has two parameters α > 0, β > 0 and proba-
bility density function
β α α−1 −βr
f (r | α, β) = r e ,
Γ (α)
Fig. 4.7 Normal with

r
unknown mean, graphical
model τ xt
ω
μ
70 4 Estimation
where Γ is the Gamma function. We write r ∼ Gamma(α, β) for a random

variable r that is distributed according to a Gamma distribution.
The graphical model of a Gamma distribution is shown in Fig. 4.8. The parameters
α, β determine the shape and scale of the distribution, respectively. This is illustrated
in Fig. 4.9, which also depicts some special cases that show that the Gamma distri-
bution is a generalization of a number of other standard distributions. For example,
expo- for α = 1, β > 0 one obtains an exponential distribution with parameter β. Its prob-
nential
distribu- ability density function is
tion f (x | β) = βe−βx ,
and as the Gamma distribution it has support in [0, ∞], i.e., x > 0. For n ∈ N and
α = n2 , β = 21 one obtains a χ2 -distribution with n degrees of freedom.
Now we return to our problem of estimating the precision of a normal distribution
with known mean, using the Gamma distribution to represent uncertainty about the
precision (Fig. 4.10).
Normal-Gamma model
Let r be drawn from a Gamma distribution with parameters α, β, while x n is
a sample drawn independently from a normal distribution with mean ω and
precision r , i.e.,
x n | r ∼ N n (ω, r −1 ), r | α, β ∼ Gamma(α, β).
Fig. 4.8 Gamma graphical

β
model
rt
α
Fig. 4.9 Example Gamma 1

densities 1, 1
1, 2
0.8
2, 2
4, 1/2
0.6
f (r|α, β)
0.4
0.2
0
0 2 4 6 8 10 12
t
Fig. 4.10 Normal-Gamma

α
graphical model for normal
distributions with unknown β r
precision xt
ω
Then the posterior distribution of r given the sample is also Gamma, that is,
1
n
n
r | x n ∼ Gamma(α , β ) with α = α + , β = β + (xt − ω)2 .
2 2 t=1
4.3.2.3 Normal with Unknown Precision and Unknown Mean
Finally, let us turn our attention to the general problem of estimating both the mean
and the precision of a normal distribution. We will use the same prior distributions
for the mean and precision as in the case when just one of them was unknown. It
will be assumed that the precision is independent of the mean, while the mean has a
normal distribution given the precision (Fig. 4.11).
Fig. 4.11 Graphical model

α
for a normal distribution
with unknown mean and
precision β r
xt
μ ω
τ
72 4 Estimation
Normal with unknown mean and precision

Let x n be a sample from a normal distribution with unknown mean ω and
precision r , whose prior joint distribution satisfies
ω | r, τ , μ ∼ N (μ, (τr )−1 ), r | α, β ∼ Gamma(α, β).
Then the posterior distribution is
ω | x n , r, τ , μ ∼ N (μ , (τ r )−1 ) with r | x n , α, β ∼ Gamma(α , β ),
where
τ μ + n x̄
μ = , τ = τ + n,
τ +n
1
n
n τ n(x̄ − μ)2
α = α + , β = β + (xt − x̄n )2 + .
2 2 t=1 2(τ + n)
While ω | r is normally distributed, the marginal distribution of ω is not normal.

student In fact, it can be shown that it is a student t-distribution. In the following, we describe
t-
distribution the marginal distribution of a sequence of observations x n , which is a generalized
student t-distribution.
The marginal distribution of x. For a normal distribution with mean ω and

precision r , we have
√ r
f (x | ω, r ) ∝ r · exp − (x − ω)2 .
2
For a prior ω|r ∼ N (μ, (τr )−1 ) and r ∼ Gamma(α, β), as before, the joint distribu-
tion for mean and precision is given by
√ τr
ξ(ω, r ) ∝ r · exp − (ω − μ)2 r α−1 e−βr ,
2
as ξ(ω, r ) = ξ(ω | r ) ξ(r ). Now we can write the marginal density of new observa-
tions as

pξ (x) = f (x | ω, r ) dξ(ω, r )
∞ ∞ r τr
√
∝ r · exp − (x − ω)2 exp − (ω − μ)2 r α−1 e−βr dω dr
2 2
0 −∞
∞ ∞ r
α− 1
−βr τr
= r 2e exp − (x − ω)2 − (ω − μ)2 dω dr
2 2
0 −∞
⎛ ∞ ⎞
∞ r
1
α−
= r 2e −βr ⎝ 2
exp − (x − ω) + τ (ω − μ) 2 dω ⎠ dr
2
0 −∞
∞
1 τr 2π
= r α− 2 e−βr exp − (μ − x)2 dr.
2(τ + 1) r (1 + τ )
0
4.3.3 Conjugates for Multivariate Distributions
The binomial distribution as well as the normal distribution can be extended to mul-
tiple dimensions. Fortunately, multivariate extensions exist for their corresponding
conjugate priors, too.
4.3.3.1 Multinomial-Dirichlet Conjugates
The multinomial distribution is the extension of the binomial distribution to an arbi-

trary number of outcomes. It is a common model for independent random trials
with a finite number of possible outcomes, such as repeated dice throws, multi-class
classification problems, etc.
As in the binomial distribution we perform independent trials, but now consider a
more general outcome set S = {1, . . . , K } for each trial. Denoting by n k the number
of times one obtains outcome k, the multinomial distribution gives the probability of
observing a given vector (n 1 , . . . , n K ) after a total of n trials.
Definition 4.3.6 (Multinomial distribution) The multinomial distribution is discrete

with parameters n, K ∈ N and a vector parameter ω ∈ ´ K , i.e., ωk ≥ 0 with ω1 =
1, where each ωk represents the probability of obtaining outcome k in a single trial.
The set of possible outcome counts after n trials is the set of vectors n = (n k )k=1
K
K
such that k n k = n, and the probability function is given by
n!
K
P(n | ω) =
K ωkn k .
k=1 nk ! k=1
The dependencies between the variables are shown in Fig. 4.12.

74 4 Estimation
Fig. 4.12 Multinomial xt

ω
graphical model
4.3.3.2 The Dirichlet Distribution
The Dirichlet distribution is the multivariate extension of the Beta distribution that
will turn out to be a natural candidate for a prior on the multinomial distribution.
Definition 4.3.7 (Dirichlet distribution) The Dirichlet distribution is a continuous

distribution with outcomes ω ∈ Ω = ´ K and a parameter vector α ∈ R+ K
. Its prob-
ability density function is

Γ ( K αi ) αk −1
f (ω | α) =
K i=1 ωk .
k=1 Γ (αk )
We write ω ∼ Dir (α) for a random variable ω distributed according to a Dirichlet

distribution.
The parameter α determines the density of the observations, as shown in Fig. 4.13.
The Dirichlet distribution is conjugate to the multinomial distribution in the same
way that the Beta distribution is conjugate to the Bernoulli/binomial distribution.
Multinomial distribution with unknown parameter

Assume that the parameter ω of a multinomial distribution is generated from a
Dirichlet distribution as illustrated in Fig. 4.14. If we observe x n = (x1 , . . . , xn ),
and our prior is given by Dir (α), so that our initial belief is ξ0 (ω) f (ω | α),
the resulting posterior after n observations is

K
ξt (ω) ∝ ωkn k +αk −1 ,
k=1
n
where n k = t=1 Ixt = k.
Fig. 4.13 Dirichlet

α ω
graphical model
Fig. 4.14 Dirichlet- xt

α ω
multinomial graphical model
4.3.3.3 Multivariate Normal Conjugate Families
The last conjugate pair we shall discuss is that for multivariate normal distribu-
tions. Similarly to the extension of the Bernoulli distribution to the multinomial,
and the corresponding extension of the Beta to the Dirichlet, the normal priors can
be extended to the multivariate case. The prior of the mean becomes a multivariate
normal distribution, while that of the precision becomes a Wishart distribution.
Definition 4.3.8 (Multivariate normal distribution) The multivariate normal distri-
bution is a continuous distribution with outcome space S = R K . Its parameters are
a mean vector ω ∈ R K and precision matrix3 R ∈ R K ×K that is a positive-definite,
that is, x Rx > 0 for x = 0. The probability density function of the multivariate
normal distribution is given by

1
f (x | ω, R) = (2π)− 2 |R| · exp − (x t − ω) R(x − ω) ,
K
where |R| denotes the matrix determinant. When taking n independent samples from matrix
determi-
a fixed multivariate normal distribution the probability of observing a sequence x n nant
is given by
n
f (x n | ω, R) = f (x t | ω, R).
t=1
The graphical model of the multivariate normal distribution is given in Fig. 4.15.
For the definition of the Wishart distribution we first have to recall the definition
of a matrix trace.
Definition 4.3.9 The trace of an n × n square matrix A with entries ai j is defined
as
n
trace(A) aii .
i=1
Definition 4.3.10 (Wishart distribution) The Wishart distribution is a matrix dis-

tribution on R K ×K with n > K − 1 degrees of freedom and precision matrix
T ∈ R K ×K . Its probability density function is given by
n−K −1
e− 2 trace(T V )
n 1
f (V | n, T ) ∝ |T | 2 |V | 2
Fig. 4.15 Multivariate

normal graphical model R
xt
ω
3 As before, the precision is the inverse of the covariance.

76 4 Estimation
Fig. 4.16 Normal-Wishart

α
graphical model
T R
xt
μ ω
for positive-definite V ∈ R K ×K .
Construction of the Wishart distribution. Let x n be a sequence of n samples

drawn independently from a multivariate normal distribution with mean ω ∈ R K
and precision matrix T ∈ R K ×K , that is, x n ∼ N n (ω, T −1
). Let x̄ n be the respective
empirical mean, and define the covariance matrix S = nt=1 (x t − x̄ n )(x t − x̄ n ) .
Then S has a Wishart distribution with n − 1 degrees of freedom and precision
matrix T , and we write S ∼ Wish(n − 1, T ).
Normal-Wishart conjugate prior (Fig. 4.16)
Theorem 4.3.1 Let x n be a sample from a multivariate normal distribution on

R K with unknown mean ω ∈ R K and precision R ∈ R K ×K whose joint prior
distribution satisfies
ω | R ∼ N (μ, (τ R)−1 ), R ∼ Wish(α, T ),
with τ > 0, α > K − 1, T > 0. Then the posterior distribution is

τ μ + n x̄ n −1
ω| R∼N , [(τ + n)R] ,
τ +n

τn
R ∼ Wish α + n, T + S + (μ − x̄)(μ − x̄) ,
τ +n
n
where S = t=1 (x t − x̄)(x t − x̄) .
4.4 Credible Intervals 77
4.4 Credible Intervals
In general, according to our current belief ξ there is a certain subjective probability

that some unknown parameter ω takes a certain value. However, we are not always
interested in the precise probability distribution itself. Instead, we can use the com-
plete distribution to describe an interval that we think contains the true value of the
unknown parameter. In Bayesian parlance, this is called a credible interval.
Definition 4.4.1 (Credible interval) Given some probability measure ξ on Ω repre-
senting our belief and some interval (or set) A ⊂ Ω,

ξ(A) = dξ = P(ω ∈ A | ξ)
A
is our subjective belief that the unknown parameter ω is in A. If ξ(A) = s, then we

say that A is an s-credible interval (or set), or an interval of size (or measure) s.
As an example, for prior distributions on R one can construct an s-credible interval
by finding ωl , ωu ∈ R such that
ξ([ωl , ωu ]) = s.
Note that ωl , ωu are not unique and any choice satisfying the condition is valid.
However, typically the interval is chosen so as to exclude the tails (extremes) of
the distribution and centered in the maximum. Figure 4.17 shows the 90% credible
interval for the Bernoulli parameter of Example 4.1 after 1000 observations, that is,
the measure of A under ξ is ξ(A) = 0.9. We see that the true parameter ω = 0.6 lies
slightly outside it.
Reliability of Credible Intervals
Let φ, ξ0 be probability measures on the parameter set Ω with ξ0 being our prior
belief and φ the actual distribution of ω ∈ Ω. Each ω defines a measure Pω on the
Fig. 4.17 90% credible 30

interval after 1000 xi
observations from a
Bernoulli with ω = 0.6
20
10
0
0.4 0.5 0.6 0.7 0.8
78 4 Estimation
observation set S. We would like to construct for any n ∈ N a credible interval

An ⊂ Ω that has measure s = ξn (An ) after n observations. If we denote by Q
ω Pω dφ(ω) the marginal distribution on S, then the probability that the credible
interval An does not include ω is

Q xn ∈ Sn | ω ∈
/ An .
The probability that the true value of ω will be within a particular credible interval
depends on how well the prior ξ0 matches the true distribution from which the
parameter ω was drawn. This is illustrated in the following experimental setup,
where we check how often a 50% credible interval fails.
Experimental testing of a credible interval

Given a probability family P = {Pω | w ∈ Ω}.
Nature chooses distribution φ over Ω.
Choose distribution ξ0 over Ω.
for i = 1, . . . , N do
Draw ωi ∼ φ.
Draw x n | ωi ∼ Pωi .
for t = 1, . . . , n do
Calculate ξt (·) = ξ0 (· | x t ).
Construct At such that ξt (At ) = 0.5.
Check failure and set t,i = Iωi ∈ / At .
end for
end for N
Average over all i, i.e., set t = N1 i=1 t,i for each t.
We performed this experiment constructing credible intervals for a Bernoulli

parameter ω using N = 1000 trials and n = 100 observations per trial. Figure 4.18
illustrates what happens when we choose ξ0 = φ. We see that the credible inter-
val is always centered around our initial mean guess and is quite tight. Figure 4.19
shows the failure rate when the credible interval At around our estimated mean does
not match the actual value of ωi . Since the measure of our interval At is always
ξt (At ) = 21 , we expect our error probability to be 21 , and this is borne out by the
experimental results.
4.4 Credible Intervals 79
0.58
UCB
LCB
0.56
0.54
0.52
0.5
w
0.48
0.46
0.44
0.42
10 20 30 40 50 60 70 80 90 100
t
Fig. 4.18 50% credible intervals for a prior Beta(10, 10), ξ0 matching the distribution of ω
0.53
0.52
Failure rate
0.51
0.5
0.49
10 20 30 40 50 60 70 80 90 100
t
Fig. 4.19 Failure rate of 50% credible intervals for a prior Beta(10, 10), ξ0 matching the distribution
of ω
On the other hand, Fig. 4.20 illustrates what happens when ξ0 = φ. In fact,
we used ωi = 0.6 for all trials i. Formally, this is a Dirac distribution concentrated
in ω = 0.6, usually notated as φ(ω) = δ(ω − 0.6). We see that the credible interval
is always centered around our initial mean guess and that it is always quite tight.
Figure 4.21 shows the average number of failures. We see that initially, due to the
fact that our prior is different from the true distribution, we make more mistakes than
in the previous case. However, eventually, our prior is swamped by the data and the
error rate converges to 50%.
80 4 Estimation
0.62
UCB
LCB
0.6
0.58
0.56
0.54
w
0.52
0.5
0.48
0.46
0.44
10 20 30 40 50 60 70 80 90 100
t
Fig. 4.20 50% credible intervals for a prior Beta(10, 10), when ξ0 does not match the distribution
of ω = 0.6
1
0.9
Average number of failures
0.8
0.7
0.6
0.5
10 20 30 40 50 60 70 80 90 100
t
Fig. 4.21 Failure rate of 50% credible interval for a prior Beta(10, 10), when ξ0 does not match
the distribution of ω = 0.6
4.5 Concentration Inequalities
While Bayesian ideas are useful, as they allow us to express our subjective beliefs
about a particular unknown quantity, they nevertheless are difficult to employ when
we have no good intuition about what prior to use. One could look at the Bayesian
estimation problem as a minimax game between us and nature and consider bounds
with respect to the worst possible prior distribution. However, even in that case we
must select a family of distributions and priors.
4.5 Concentration Inequalities 81
In this section we will examine guarantees we can give about any calculation we
make from observations with minimal assumptions about the distribution generating
these observations. These findings are fundamental, in the sense that they rely on a
very general phenomenon, called concentration of measure. As a consequence, they
are much stronger than results such as the central limit theorem (which we will not
cover in this textbook).
Here we shall focus on the most common application of calculating the sample
mean, as given in Definition 4.2.1. We have seen that e.g. for the Beta-Bernoulli
conjugate prior, it is a simple enough matter to compute a posterior distribution. From
that, we can obtain a credible interval on the expected value of the unknown Bernoulli
distribution. However, we would like to do the same for arbitrary distributions on
[0, 1], rather than just the Bernoulli. We shall now give an overview of a set of tools
that can be used to do this.
Theorem 4.5.1 (Markov’s inequality) If X ∼ P with P a distribution on [0, ∞),

then
EX
P(X ≥ u) ≤ , (4.5.1)
u

where P(X ≥ u) = P {x | x ≥ u} .
Proof By definition of the expectation, for any u
∞
EX = x d P(x)
0
u ∞
= x d P(x) + x d P(x)
0 u
∞
≥0+ u d P(x)
u
= u P(X ≥ u).
Consequently, if x̄n is the empirical mean after n observations, for a random variable
X with expectation EX = μ, we can use Markov’s inequality to obtain P(|x̄n − μ| ≥
) ≤ E|x̄n − μ|/. In particular, for X ∈ [0, 1], we obtain the bound
1
P |x̄n − μ| ≥ ≤ .

82 4 Estimation
Unfortunately, this bound does not improve for a larger number of observations n.
However, we can get significantly better bounds through various transformations,
using Markov’s inequality as a building block in other inequalities. The first of those
is Chebyshev’s inequality.
Theorem 4.5.2 (Chebyshev’s inequality) Let X be a random variable with expec-
tation EX = μ and variance V X = σ 2 . Then, for all k > 0,
1
P |X − μ| ≥ kσ ≤ 2 . (4.5.2)
k
Proof First note that for monotonic f ,

P(X ≥ u) = P f (X ) ≥ f (u) , (4.5.3)
as {x | x ≥ u} = {x | f (x) ≥ f (u)}. Then we can derive

|X − μ| 4.5.3 |X − μ|2
P |X − μ| ≥ kσ = P ≥1 = P ≥ 1
kσ k 2 σ2

4.5.1 (X − μ)2 E(X − μ)2 1
≤ E = = 2.
k 2 σ2 k 2 σ2 k
Chebyshev’s inequality can in turn be used to obtain confidence bounds that
improve with the number of samples n.
Example 4.3 (Application to sample mean) It is easy to show that the sample mean
x̄n has expectation μ and variance σ 2 /n and we obtain from (4.5.2)

kσ 1
P |x̄n − μ| ≥ √ ≤ 2.
n k
√ √
Setting = kσ/ n we get k = n/σ and hence
σ2
P |x̄n − μ| ≥ ≤ 2 .
n
4.5.1 Chernoff-Hoeffding Bounds
The inequality derived in Example 4.3 can be quite loose. In fact, one can prove
tighter bounds for the estimation of an expected value by a different application of
the Markov inequality, due to Chernoff.
4.5 Concentration Inequalities 83
Main idea
of Chernoff bounds.
Let Sn = nt=1 xi , with xt ∼ P independently, i.e., x n ∼ P n . By definition,
from Markov’s inequality we obtain for any θ > 0
P(Sn ≥ u) = P(eθSn ≥ eθu )

n
≤ e−θu EeθSn = e−θu Eeθxt . (4.5.4)
t=1
Theorem 4.5.3 (Hoeffding’s inequality, [1]) For t = 1, 2, . . . , n let xt ∼ Pt be

independent random variables with xt ∈ [at , bt ] and Ext = μt . Then

2n 2 2
P (x̄n − μ ≥ ) ≤ exp − n , (4.5.5)
t=1 (bt − at )
2
n
where μ = 1
n t=1 μt .
Proof Applying (4.5.4) to random variables xt − μt so that Sn = n(x̄n − μ) and

setting u = n gives
P(x̄n − μ ≥ ) = P(Sn ≥ u)

n
≤ e−θn Eeθ(xt −μt ) . (4.5.6)
t=1
Applying Jensen’s inequality directly to the expectation does not help. However, we
can use convexity in another way: Let f (z) be the following linear upper bound on
eθz on the interval [a, b]:
b − z θa z − a θb
f (z) e + e ≥ eθz
b−a b−a
Then obviously Eeθz ≤ E f (z) for z ∈ [a, b]. We can use this to bound the term
inside the product in (4.5.6), setting z = xt − μt :
e−θμt
eθ(xt −μt ) ≤ (bt − μt )eθat + (μt − at )eθbt
bt − at
Bounding the expectation of this term by taking derivatives with respect to θ and
computing the second order Taylor expansion gives
Eeθ(xt −μt ) ≤ e 8 θ (bt −at )2

1 2
.
84 4 Estimation
We conclude by plugging in this bound in (4.5.6):

n
P(x̄n − μ ≥ ) ≤ e−θn+ 8 θ t=1 (bt −at )
1 2 2
.
n
This is minimized for θ = 4n/ t=1 (bt − at )2 , which proves the required result.
We can apply this inequality directly to the sample mean example and obtain for
xt ∈ [0, 1]
P (|x̄n − μ| ≥ ) ≤ 2e−2n .
2
4.6 Approximate Bayesian Approaches
Unfortunately, exact computation of the posterior distributions is only possible in

special cases. In this section, we give a brief overview of some classic methods for
approximate Bayesian inference. The first, Monte Carlo methods, rely on stochastic
approximations of the posterior distributions where at least the likelihood function is
computable. The second, approximate Bayesian computation, extends Monte Carlo
methods to the case where the probability function is incomputable or not available
at all. In the third, which includes variational Bayes methods, we replace distribu-
tions with an analytic approximation. Finally, in empirical Bayes methods, some
parameters are replaced by an empirical estimate.
4.6.1 Monte Carlo Inference
Monte Carlo inference has been a cornerstone of approximate Bayesian statistics

ever since computing power was sufficient for such methods to become practical.
Let us begin with a simple example, that of estimating expectations
Definition 4.6.1 Let f : S → [0, 1] and P be a measure on S. Then

EP f f (x) d P(x).
S
Estimating the expectation E P f is relatively easy as long as we can generate

samples from P. Then a simple average provides an estimate that can be computed
fast and whose error can be bounded by Hoeffding’s inequality.
Corollary 4.6.1 Let x n = (x1 , . . . , xn ) be a sample of size n with xt ∼ P and f :
S → [0, 1]. Setting fˆn n1 t f (xt ) it holds that
!
x n ∈ S n | | fˆn − E f | ≥ ≤ 2e−2n .
2
P
4.6 Approximate Bayesian Approaches 85
This technique can also be used to calculate posterior distributions.

Example 4.4 (Calculation of posterior distributions) Assume a probability family
P = {Pω | ω ∈ Ω} and a prior distribution ξ on Ω such that we can draw ω ∼ ξ.
The posterior distribution is given by (4.1.1), where we can write the nominator as

Pω (x) dξ(ω) = I{ω ∈ B}Pω (x) dξ(ω) = Eξ [I{ω ∈ B}Pω (x)] . (4.6.1)
B Ω
Similarly, the denominator of (4.1.1) can be written as Eξ [Pω (x)]. If Pω is bounded,

then the error can be bounded too.
An extension of this approach involves Markov chain Monte Carlo (MCMC)
methods. These are sequential sampling procedures where data are sampled itera-
tively. At the t-th iteration, we obtain a sample xt ∼ Q t , where Q t depends on the
previous sample xt−1 . Even if one can guarantee that Q t → P, there is no easy way
to determine a priori when the procedure has converged. For more details see for
example [2].
4.6.2 Approximate Bayesian Computation
The main problem approximate Bayesian computation (ABC) aims to solve is how
to weight the evidence we have for or against different models. The assumption is
that we have a family of models {Mω | ω ∈ }, from which we can generate data.
However, there is no easy way to calculate the probability of any model having
generated the data. On the other hand, like in the standard Bayesian setting, we can
start with a prior ξ over and given some data x ∈ W we wish to calculate the
posterior ξ(ω | x). ABC methods generally rely on what is called an approximate
statistic in order to weight the relative likelihood of models given the data.
An approximate statistic φ : S → S maps the data to some lower dimensional
space S . Then it is possible to compare different data points in terms of how similar
their statistics are. For this, one also defines some distance D : S × S → R+ .
ABC methods are useful in two specific situations. The first is when the family
of models that we consider has an intractable likelihood. This means that calculating
Mω (x) is prohibitively expensive. The second is in some applications which admit a
class of parametrized simulators which have no probabilistic description. Then one
reasonable approach is to find the best simulator in the class and apply it to the actual
problem.
The simplest algorithm in this context is ABC Rejection Sampling shown as
Algorithm 4.1. Here, we repeatedly sample a model from the prior distribution, and
then generate data x̂ from the model. If the sampled data is -close to the original data
in terms of the statistic, we accept the sample as an approximate posterior sample.
For an overview of ABC methods see [3, 4]. Early ABC methods were developed
for applications such as econometric modelling e.g. [5], where detailed simulators
86 4 Estimation
Algorithm 4.1 ABC Rejection Sampling from ξ(ω | x).

input prior ξ, data x, generative model family {Mω | ω ∈ }, statistic φ, error bound .
repeat
ω̂ ∼ ξ
x̂ ∼ Mω̂
until D[φ(x), φ(x̂)] ≤
Return ω̂
but no useful analytical probabilistic models were available. ABC methods have also
been used for inference in dynamical systems e.g. [6] and the reinforcement learning
problem [7, 8].
4.6.3 Analytic Approximations of the Posterior
Another type of approximation involves substituting complex distributions with

members from a simpler family. For example, one could replace a multimodal poste-
rior distribution ξ(ω | x) with a Gaussian. A more principled approximation would
involve selecting a distribution that is the closest with respect to some divergence or
distance. In particular, we would like to approximate the target distribution ξ(ω | x)
with some other distribution Q θ (ω) in a family {Q θ | θ ∈ Θ} such that the distance
between ξ(ω | x) and Q θ (ω) is minimal. For measuring the latter one can use the total
variation or the Wasserstein distance. The most popular algorithms employ however
the KL-divergence
dQ
D (Q P) ln dQ.
dP
Ω
As the KL-divergence is asymmetric, its use results in two distinct approximation

methods, variational Bayes and expectation propagation.
Variational approximation. In this formulation, we wish to minimize the KL-
divergence
dQ θ
D Q θ ξ|x = ln dQ θ ,
dξ|x
Ω
where ξ|x is shorthand for the distribution ξ(ω | x). An efficient method for mini-
mizing this divergence is rewriting it as
4.6 Approximate Bayesian Approaches 87

dξ|x
D Q θ ξ|x = − ln dQ θ
dQ θ
Ω

dξ|x
=− ln dQ θ + ln ξ(x),
dQ θ
Ω
where ξ|x is shorthand for the joint distribution ξ(ω, x) for a fixed value of x. As the
second term does not depend on θ, we can find the best element of the family by
computing
dξ|x
max ln dQ θ ,
θ∈Θ dQ θ
Ω
where the term we are maximizing can also be seen as a lower bound on the marginal
log-likelihood.
Expectation propagation. The other direction requires us to minimize the diver-
gence
dξ|x
D ξ|x Q θ = ln dξ|x .
dQ θ
Ω
A respective algorithm in the case of data terms that are independent given the
parameter is expectation propagation [9]. There, the approximation has a factored
form and is iteratively updated, with each term minimizing the KL-divergence while
keeping the remaining terms fixed.
4.6.4 Maximum Likelihood and Empirical Bayes Methods
When it is not necessary to have a full posterior distribution, some parameter may
be estimated point-wise. One simple such approach is maximum likelihood. In the
simplest case, we replace the posterior distribution ξ(ω | x) with a point estimate
corresponding to the parameter value that maximizes the likelihood, that is,
∗
ωML ∈ arg max Pω (x).
ω
Alternatively, the maximum a posteriori parameter may be obtained by

∗
ωMAP ∈ arg max ξ(ω | x).
ω
In the latter case, even though we cannot compute the full function ξ(ω | x), we can
still maximize (perhaps locally) for ω.
88 4 Estimation
More generally, there might be some parameters φ for which we actually can
compute a posterior distribution, and some other parameters ω for which we can
not. Then we can perform Bayesian inference for the φ parameters and maximum-
likelihood for the ω parameters. This generally falls within the domain of Empirical
Bayes methods, pioneered by [10]. These replace some parameters by an empirical
estimate not necessarily corresponding to the maximum likelihood. These methods
are quite diverse [10–14] and unfortunately beyond the scope of this book.
References
1. Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat.
Assoc. 58(301), 13–30 (1963)
2. Casella, G., Fienberg, S., Olkin, I. (eds.) Monte Carlo Statistical Methods. Springer Texts in
Statistics. Springer (1999)
3. Csilléry, K., Blum, M.G.B., Gaggiotti, O.E., François, O.: Approximate Bayesian computation
(ABC) in practice. Trends Ecol. Evol. 25(7), 410–418 (2010)
4. Marin, J.-M., Pudlo, P., Robert, C.P., Ryder, R.J.: Approximate Bayesian computational meth-
ods. Stat. Comput. 22(6), 1167–1180 (2012)
5. Geweke, J.F.: Using simulation methods for Bayesian econometric models: Inference, devel-
opment, and communication. Econ. Rev. 18(1), 1–73 (1999)
6. Toni, T., Welch, D., Strelkowa, N., Ipsen, A., Stumpf, M.P.H.: Approximate Bayesian com-
putation scheme for parameter inference and model selection in dynamical systems. J. Royal
Soc. Interf. 6(31), 187–202 (2009)
7. Dimitrakakis, C., Tziortziotis, N.: ABC reinforcement learning. In: Proceedings of the 30th
International Conference on Machine Learning, ICML 2013, pp. 684–692. JMLR.org (2013)
8. Dimitrakakis, C., Tziortziotis, N.: Usable ABC reinforcement learning. In: NIPS 2014 Work-
shop: ABC in Montreal (2014)
9. Minka, T.P.: Expectation propagation for approximate Bayesian inference. In: UAI ’01: Pro-
ceedings of the 17th Conference in Uncertainty in Artificial Intelligence, pp. 362–369. Morgan
Kaufmann (2001)
10. Robbins, H.: An empirical Bayes approach to statistics. In: Neyman, J. (ed.) Proceedings of
the Third Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contri-
butions to the Theory of Statistics. University of California Press, Berkeley, CA (1955)
11. Laird, N.M., Louis, T.A.: Empirical Bayes confidence intervals based on bootstrap samples. J.
Am. Stat. Assoc. 82(399), 739–750 (1987)
12. Lwin, T., Maritz, J.S.: Empirical Bayes approach to multiparameter estimation: with special
reference to multinomial distribution. Ann. Inst. Stat. Math. 41(1), 81–99 (1989)
13. Robbins, H.: The empirical Bayes approach to statistical decision problems. Ann. Math. Stat.
35(1), 1–20 (1964)
14. Deely, J.J., Lindley, D.V.: Bayes empirical Bayes. J. Am. Stat. Assoc. 76(376), 833–841 (1981)
Chapter 5
Sequential Sampling
5.1 Gains From Sequential Sampling
So far, we have mainly considered decision problems where the sample size was fixed.
However, frequently the sample size can also be part of the decision. Since normally
larger sample sizes give us more information, in this case the decision problem is only
interesting when obtaining new samples has a cost. Consider the following example.
Example 5.1 Consider that you have 100 produced items and you want to determine
whether there are fewer than 10 faulty items among them. If testing has some cost,
it pays off to think about whether it is possible to do without testing all 100 items.
Indeed, this is possible by the following simple online testing scheme: You test one
item after another until you either have discovered 10 faulty items or 91 good items.
In either case you have the correct answer at considerably lower cost than when
testing all items.
A sequential sample from some unknown distribution P is generated as follows.

First, let us fix notation and assume that each new sample xi we obtain belongs to
some alphabet X , so that at time t, we have observed x1 , . . . , xt ∈ X t .
It is also
convenient to define the set of all sequences in the alphabet X as X ∗ ∞ t=0 X .
t
∗
The distribution P defines a probability on X so that xt+1 may depend on the
previous samples x1 , . . . , xt in an arbitrary manner. At any time t, we can either
stop sampling or obtain one more observation xt+1 . A sample obtained in this way
is called a sequential sample. More formally, we give the following definition:
Definition 5.1.1 (Sequential sampling) A sequential sampling procedure on a prob-
ability space1 (X ∗ , B (X ∗ ) , P) involves a stopping function πs : X ∗ → {0, 1}, such stopping
function
that we stop sampling at time t if and only if πs (x t ) = 1, otherwise we obtain a new
sample xt+1 | x t ∼ P(· | x t ).
1 This is simply a sample space and associated algebra, together with a probability measure.

90 5 Sequential Sampling
Thus, the sample obtained depends both on P and the sampling procedure πs . In
our setting, we don’t just want to sample sequentially, but also to take some action
after sampling is complete. For that reason, we can generalize the above definition
to sequential decision procedures.
Definition 5.1.2 (Sequential decision procedure) A sequential decision procedure
π = (πs , πd ) is a tuple composed of
1. a stopping rule πs : X ∗ → {0, 1} and
2. a decision rule πd : X ∗ → A.
The stopping rule πs specifies whether, at any given time, we should stop and make
a decision in A or take one more sample. That is, stop if
πs (x t ) = 1,
otherwise observe xt+1 . Once we have stopped (i.e. πs (x t ) = 1), we choose the
decision
πd (x t ).
Deterministic stopping rules If the stopping rule πs is deterministic, then for

stopping any t, there exists some stopping set Bt ⊂ X t such that
set

1, if x t ∈ Bt
πs (x ) =
t
(5.1)
0, if x t ∈
/ Bt .
As with any Bayesian decision problem, it is sufficient to consider only deterministic

decision rules.
We are interested in sequential sampling problems especially when there is a
reason for us to stop sampling early enough, like in the case when we incur a cost
with each sample we take. A detailed example is given in the following section.
5.1.1 An Example: Sampling with Costs
We once more consider problems where we have some observations x1 , x2 , . . ., with

xt ∈ X , which are drawn from some distribution with parameter θ ∈ Θ, or more pre-
cisely from a family P = {Pθ | θ ∈ Θ}, such that each (X , B (X ) , Pθ ) is a prob-
ability space for all θ ∈ Θ. Since we take repeated observations, the probability
of a sequence x n = x1 , . . . , xn under an i.i.d. model θ is Pθn (x n ). We have a prior
probability measure ξ on B (Θ) for the unknown parameter, and we wish to take
an action a ∈ A that maximizes the expected utility according to a utility function
u : Θ × A → R.
In the classical case, we obtain a complete sample of fixed size n, x n = (x1 , . . . , xn )
and calculate a posterior measure ξ(· | x n ). We then take the decision maximizing
5.1 Gains From Sequential Sampling 91
the expected utility according to our posterior. Now consider the case of sampling
with costs, such that a sample of size n results in a cost of cn. For that reason we
define a new utility function U which depends on the number of observations we
have.
Samples with costs
U (θ, a, x n ) = u(θ, a) − cn, (5.2)

Eξ (U | a, x ) = u(θ, a) dξ(θ | x n ) − cn.
n
(5.3)
Θ
In the remainder of this section, we shall consider the following simple decision
problem, where we need to make a decision about the value of an unknown parameter.
As we get more data, we have a better chance of discovering the right parameter.
However, there is always a small chance of getting no information.
Example 5.2 Consider the following decision problem, where the goal is to distin-
guish between two possible hypotheses θ1 , θ2 , with corresponding decisions a1 , a2 .
We have three possible observations {1, 2, 3}, with 1, 2 being more likely under the
first and second hypothesis, respectively. However, the third observation gives us no
information about the hypothesis, as its probability is the same under θ1 and θ2 . In
this problem γ is the probability that we obtain an uninformative sample.
• Parameters: Θ = {θ1 , θ2 }.
• Decisions: A = {a1 , a2 }.
• Observation distribution f i (k) = Pθi (xt = k) for all t with
f 1 (1) = 1 − γ, f 1 (2) = 0, f 1 (3) = γ, (5.4)

f 2 (1) = 0, f 2 (2) = 1 − γ, f 2 (3) = γ. (5.5)
• Local utility: u(θi , a j ) = 0, for i = j and b < 0 otherwise.

• Prior: Pξ (θ = θ1 ) = ξ = 1 − Pξ (θ = θ2 ).
• Observation cost per sample: c.
At any step t, you have the option of continuing for one more step, or stopping and
taking an action in A. The question is what is the policy for sampling and selecting
an action that maximizes expected utility?
In this problem, it is immediately possible to distinguish θ1 from θ2 when you
observe xt = 1 or xt = 2. However, the values xt = 3 provide no information.
The expected utility of stopping if you have only observed 3’s after t steps is ξb − ct.
In fact, if your posterior parameter after t steps is ξt , then the expected utility of stop-
ping is b min {ξt , 1 − ξt } − ct. In general, you should expect ξt to approach 0 or 1
with high probability, and hence taking more samples is better. However, if we pay
utility −c for each additional sample, there is a point of diminishing returns, after
which it will not be worthwhile to take any more samples.
We first investigate the setting where the number of observations is fixed. In
value particular, the value of the optimal procedure taking n observations is defined to be
the expected utility that maximizes the a posteriori utility given x n , i.e.,

V (n) = Pξn (x n ) max Eξ (U | x n , a),
a
xn

where Pξn = Θ Pθn dξ(θ) is the marginal distribution over n observations. For this
specific example, it is easy to calculate the value of the procedure that takes n obser-
vations, by noting the following facts.
(a) The probability of observing xt = 3 for all t = 1, . . . , n is γ n . Then we must
rely on our prior ξ to make a decision.
(b) If we observe any other sequence, we know the value of θ.
Consequently, the total value V (n) of the optimal procedure taking n observations is
V (n) = ξbγ n − cn.
Based on this, we now want to find the optimal number of samples n. Since V
is a smooth function, an approximate maximizer can be found by viewing n as a
continuous variable.2 Taking derivatives, we get

∗ c 1
n = log ,
ξb log γ log γ

c c
V (n ∗ ) = 1 + log .
log γ ξb log γ
The results of applying this procedure are illustrated in Fig. 5.1. Here we can see
that, for two different choices of priors, the optimal number of samples is different.
In both cases, there is a clear choice for how many samples to take, when we must
fix the number of samples before seeing any data.
However, we may not be constrained to fix the number of samples a priori. As
illustrated in Example 5.1, many times it is a good idea to adaptively decide when to
stop taking samples. This is illustrated by the following sequential procedure. Since
we already know that there is an optimal a priori number of steps n ∗ , we can choose
to look at all possible stopping times that are smaller or equal to n ∗ .
2 In the end, we can find the optimal maximizer by looking at the nearest two integers to the value
found.
5.1 Gains From Sequential Sampling 93
Fig. 5.1 Illustration of P1, 0

the procedure taking a fixed
number of samples n. The -1
value of taking exactly n
observations under two
different beliefs, for γ = 0.9, -2
V (n)
b = −10, c = 10−2
-3
-4
ξ = 0.1
ξ = 0.5
-5
100 101 102
n
P2. A sequential procedure stopping after at most n ∗ steps
• If t < n ∗ , use the stopping rule πs (x t ) = 1 iff xt = 3.

• If t = n ∗ , then stop.
• Our posterior after stopping is ξ(θ | x n ), where both x n and the number of
observations n are random.
Since the probability of xt = 3 is always the same for both θ1 and θ2 , we have
∗
Eξ (n) = E(n | θ = θ1 ) = E(n | θ = θ2 ) < n .
We can calculate the expected number of observations as

∗

n
∗
Eξ (n | n ≤ n ) = Eξ (n | θ = θ1 ) = t Pξ (n = t | θ = θ1 )
t=1
∗
n −1 ∗
∗ n ∗ −1 1 − γn
= tγ t−1
(1 − γ) + n γ = ,
t=1
1−γ
using the formula for the geometric series. Consequently, the value of this proce-
dure is
V̄ (n ∗ ) = Eξ (U | n = n ∗ ) Pξ (n = n ∗ ) + Eξ (U | n < n ∗ ) Pξ (n < n ∗ )
∗
= ξbγ n − c Eξ (n),
and from the definition of n ∗ we obtain

∗ c c c
V̄ (n ) = + 1+ .
γ − 1 log γ ξb(1 − γ)
Fig. 5.2 The value of three 0

strategies for ξ = 1/2,
b = −10, c = 10−2 and -0.5
varying γ. Higher values of -1
γ imply a longer time before
-1.5
the true θ is known
-2
V
-2.5
-3
fixed
-3.5 bounded
unbounded
-4
0.5 0.6 0.7 0.8 0.9 1
γ
As you can see, there is a non-zero probability that n = n ∗ , at which time we will
have not resolved the true value of θ. In that case, we are still not better off than at
the very beginning of the procedure, when we had no observations. If our utility is
linear with the number of steps, it thus makes sense that we should still continue.
For that reason, we should consider unbounded procedures.
unbounded
proce- The unbounded procedure for our example is simply this to use the stopping rule
dures πs (x t ) = 1 iff xt = 3. Since we only obtain information whenever xt = 3, and that
information is enough to fully decide θ, once we observe xt = 3, we can make a
decision that has value 0, as we can guess correctly. So, the value of the unbounded
sequential procedure is just V ∗ = −c Eξ (n), where
∞
∞
1
Eξ (n) = t Pξ (n = t) = tγ t−1 (1 − γ) = ,
t=1 t=1
1−γ
again using the formula for the geometric series.

In the given example, it is clear that bounded procedures are (in expectation) better
than fixed-sampling procedures, as seen in Fig. 5.2. In turn, the unbounded procedure
is (in expectation) better than the bounded procedure. Of course, an unbounded
procedure may end up costing much more than taking a decision without observing
any data, as it disregards the costs up to time t. This relates to the economic idea of
sunk costs: since our utility is additive in terms of the cost, our optimal decision now
should not depend on previously accrued costs.
5.2 Optimal Sequential Sampling Procedures
We now turn our attention to the general case. While it is easy to define the optimal
stopping rule and decision in this simple example, how can we actually do the same
5.2 Optimal Sequential Sampling Procedures 95
thing for arbitrary problems? The following section characterizes optimal sequential
sampling procedures and gives an algorithm for constructing them.
Once more, consider a distribution family P = {Pθ | θ ∈ Θ} and a prior ξ over
B (Θ). For a decision set A, a utility function U : Θ × A → R, and sampling
costs c, the utility of a sequential decision procedure π is the local utility at the
end of the procedure, minus the sampling cost. In expectation, this can be written as

U (ξ, π) = Eξ u[θ, π(x n )] − nc .
Here the cost is inside the expectation, since the number of samples we take is
random. Summing over all the possible stopping times n, and taking Bn ⊂ X ∗ as the
set of observations for which we stop, we have
∞
∞

U (ξ, π) = Eξ [U (θ, π(x )) | x ] d Pξ (x ) −
n n n
Pξ (Bn )nc
n=1 Bn n=1
∞ ∞

= U [θ, π(x )] dξ(θ | x ) d Pξ (x ) −
n n n
Pξ (Bn )nc,
n=1 Bn Θ n=1
where Pξ is the marginal distribution under ξ. Although it may seem difficult to

evaluate this, it can be done by a simple dynamic programming technique called
backwards induction. We first give the algorithm for the case of bounded procedures back-
wards
(i.e. procedures that must stop after a particular time) and later for unbounded ones. induc-
tion
Definition 5.2.1 (Bounded sequential decision procedure) A sequential decision
procedure is T -bounded for a positive integer T if Pξ (n ≤ T ) = 1. The procedure
is called bounded if it is T -bounded for some T .
We can analyse such a procedure by recursively analysing procedures for larger T ,

starting from the final point of the process and working our way backwards. Consider
a π that is T -bounded. Then we know that we shall take at most T samples. If the
process ends at stage T , we will have observed some sequence x T , which gives rise
to a posterior ξ(θ | x T ). Since we must stop at T , we must choose a maximizing
expected utility at that stage, that is,

Eξ [U | x , a] = U (θ, a) dξ(θ | x T ).
T
Θ
Since we need not take another sample, the respective value (maximal expected
utility) of that stage is
V 0 [ξ(·|x T )] max U (ξ(· | x T ), a),

a∈A
where we introduce the notation V n to denote the expected utility, given that we are
stopping after at most n steps.
ξ(· | x1 = 0)
c
ξ
c
ξ(· | x1 = 1)
Fig. 5.3 An example of a sequential decision problem with two stages. The initial belief is ξ and
there are two possible subsequent beliefs, depending on whether we observe xt = 0 or xt = 1. At
each stage we pay c
More generally, we need to consider the effect on subsequent decisions. Consider

the following simple two-stage problem as an example. Let X = {0, 1} and ξ be the
prior on the θ parameter of Ber n(θ). We wish to either decide immediately on a
parameter θ, or take one more observation, at cost c, before deciding. The problem
we consider has two stages, as illustrated in Fig. 5.3.
In this example, we begin with a prior ξ at the first stage. There are two possible
outcomes for the second stage.
1. If we observe x1 = 0 then our value is V 0 [ξ(· | x1 = 0)].
2. If we observe x1 = 1 then our value is V 0 [ξ(· | x1 = 1)].
At the first stage, we can:
1. Stop with value V 0 (ξ).
2. Pay a sampling cost c for value V 0 [ξ(· | x1 )] with Pξ (x1 ) = Θ Pθ (x1 ) dξ(w).
So the expected value of continuing for one more step is

V (ξ)
1
V 0 [ξ(· | x1 )] d Pξ (x1 ).
X
Thus, the overall value for this problem is:

1
max V (ξ), 0
V [ξ(· | x1 )]Pξ (x1 ) − c
0
x1 =0
The above is simply the maximum of the value of stopping immediately (V 0 ), and
the value of continuing for at most one more step (V 1 ). This procedure can be applied
recursively for multi-stage problems, as explained below.
5.2.1 Multi-stage Problems
For simplicity, we use ξn to denote a posterior ξ(· | x n ), omitting the specificbreak-

value of x n . For any specific ξn , there is a range of given possible next beliefs ξn+1 ,
Fig. 5.4 A partial view of

the multi-stage process ξ n1 +1
c
x n +1 = 1
ξn
x n +1 = 0
c
ξ n0 +1
depending on what the value of the next observation xn is. This is illustrated in
Fig. 5.4, by extension from the previous two-stage example. The immediate value

V 0 (ξt ) = sup u(θ, a) dξt (θ)
a∈A Θ
is the expected value if we stop immediately at time t. The next-step value

Eξn V (ξn+1 ) = V 0 [ξn (· | xn )] dξn (xn )
0
X
is the expected value of the next step, ignoring the cost. Finally, the optimal value
at the n-th step is just the maximum of the value of stopping immediately and the
next-step value, that is,

V 1 (ξn ) = max V 0 (ξn ), Eξn V 0 (ξn+1 ) − c .
This procedure can be generalized over all steps 1, 2, . . . , T , to obtain a general

procedure.
5.2.2 Backwards Induction for Bounded Procedures
The main idea expressed in the previous section is to start from the last stage of
our decision problem, where the utility is known, and then move backwards. At
each stage, we know the probability of reaching different points in the next stage,
as well as their values. Consequently, we can compute the value of any point in the
current stage as well. This idea is formalised below, via the algorithm of backwards
induction.
Theorem 5.2.1 (Backwards induction) The utility of a T -bounded optimal proce-

dure with prior ξ0 is V T (ξ0 ) and is given by the recursion

V j+1 (ξn ) = max V 0 (ξn ), Eξn V j (ξn+1 ) − c (5.6)

for every belief ξn in the set of beliefs that arise from the prior ξ0 , with j = T − n.
The proof of this theorem follows by induction. However, we shall prove a more
general version in Chap. 6. Equation 5.6 essentially gives a recursive calculation of the
value of the T -bounded optimal procedure. To evaluate it, we first need to calculate
all possible beliefs ξ1 , . . . , ξT . For each belief ξT , we calculate V 0 (ξT ). We then
move backwards, and calculate V 0 (ξT −1 ) and V 1 (ξT −1 ). Proceeding backwards, for
n = T − 1, T − 2, . . . , 1, we calculate V T +1 (ξn ) for all beliefs ξn with j = T − n.
The value of the procedure also determines the optimal sampling strategy, as shown
by the following theorem.
Theorem 5.2.2 The optimal T -bounded procedure stops at time t if the value of
stopping at t is better than that of continuing, i.e. if
V 0 (ξt ) ≥ V T −t (ξt ).
This procedure chooses a maximizing Eξt U (θ, a), otherwise takes one more sample.
Finally, longer procedures (i.e. procedures that allow for stopping later) are always
better than shorter ones, as shown by the following theorem.
Theorem 5.2.3 For any probability measure ξ on Θ,
V n (ξ) ≤ V n+1 (ξ). (5.7)
That is, the procedure that stops after at most n steps is never better than the procedure
that stops after at most n + 1 time steps. To obtain an intuition of why this is the
case, consider the example of Sect. 5.1.1. In that example, if we have a sequence
of 3s, then we obtain no information. Consequently, when we compare the value of
a plan taking at most n samples with that of a plan taking at most n + 1 samples, we
see that the latter plan is better for the event where we obtain n 3s, but has the same
value for all other events.
5.2.3 Unbounded Sequential Decision Procedures
Given the monotonicity of the value of bounded procedures (5.7), one may well
ask what is the value of unbounded procedures, i.e. procedures that may never stop
sampling. The value of an unbounded sampling and decision procedure π under
prior ξ is

U (ξ, π) = V 0 [ξ(· | x n )] − cn d Pξπ (x n ) = Eπξ V 0 [ξ(· | x n )] − cn ,

X∗
where Pξπ (x n ) is the probability that we observe samples x n and stop under the
marginal distribution defined by ξ and π, while n is the random number of samples
taken by π. As before, this is random because the observations x are random; π itself
can be deterministic.
Definition 5.2.2 (Regular procedure) Given a decision procedure π, let B>k (π) ⊂
X ∗ be the set of sequences such that π takes more than k samples. Then π is regular
if U (ξ, π) ≥ V 0 (ξ) and if, for all n ∈ N, and for all x n ∈ B>n (π)
U [ξ(· | x n ), π] ≥ V 0 [ξ(· | x n )] − cn, (5.8)
i.e., the expected utility given for any sample that starts with x n where we don’t stop,
is greater than that of stopping at n.
In other words, if π specifies that at least one observation should be taken, then the
value of π is greater than the value of choosing a decision without any observa-
tion. Furthermore, whenever π specifies that another observation should be taken,
the expected value of continuing must be larger than the value of stopping. If the
procedure is not regular, then there may be stages where the procedure specifies that
sampling should be continued, though the value may not increase by doing so.
Theorem 5.2.4 If π is not regular, then there exists a regular π such that U (ξ, π ) ≥
U (ξ, π).
Proof First, consider the case that π is not regular because U (ξ, π) ≤ V 0 (ξ). Then
π can be the regular procedure which chooses a ∈ A without any observations.
Now consider the case that U (ξ, π) > V 0 (ξ) and that π specifies at least one
sample should be taken. Let π be the procedure which stops as soon as the observed
x n does not satisfy (5.8).
If π stops, then both sides of (5.8) are equal, as the value of stopping immediately
is at least as high as that of continuing. Consequently, π stops no later than π for
any x n . Finally, let
Bk (π) = x ∈ X ∗ | n = k
be the set of observations such that exactly k samples are taken by rule π. Given that
π does stop, we have
∞

U (ξ, π ) = {V 0 [ξ(· | x k ) − ck]} d Pξ (x k )
k=1 Bk (π )
∞
≥ U [ξ(· | x k , π)] d Pξ (x k )
k=1 Bk (π )
∞

π π
= Eξ {U | Bk (π )}Pξ (Bk (π )) = E ξ U = U (ξ, π),
k=1
where the inequality follows from the fact that π is not regular.
5.2.4 The Sequential Probability Ratio Test
Sometimes we wish to collect just enough data in order to be able to confirm or

disprove a particular hypothesis. More specifically, we have a set of parameters Θ,
and we need to pick the right one. However, rather than simply using an existing set
of data, we are collecting data sequentially, and we need to decide when to stop and
select a model. In this case, each one of our decisions ai corresponds to choosing the
model θi , and we have a utility function that favours our picking the correct model.
As before, data collection has some cost, which we must balance against the expected
utility of picking a parameter.
As an illustration, consider a problem where we must decide for one out of two
possible parameters θ1 , θ2 . At each step, we can either take another sample from the
unknown Pθ (xt ), or decide for one or the other of the parameters.
Example 5.3 (A two-point sequential decision problem.) Consider a problem where
there are two parameters and two final actions which select one of the two parameter
values, such that:
• Observations xt ∈ X
• Distribution family: P = {Pθ | θ ∈ Θ}
• Probability space (X ∗ , B (X ∗ ) , Pθ ).
• Parameter set Θ = {θ1 , θ2 }.
• Action set A = {a1 , a2 }.
• Prior ξ = P(θ = θ1 ).
• Sampling cost c > 0.
The actions we take upon stopping can be interpreted as guessing the parameter.
When we guess wrong, we suffer a cost, as seen in Table 5.1.
As will be the case for all our sequential decision problems, we only need to
consider our current belief ξ, and its possible evolution, when making a decision. To
obtain some intuition about this procedure, we are going to analyse this problem by
examining what the optimal decision is under all possible beliefs ξ.
Under some belief ξ, the immediate value (i.e. the value we obtain if we stop
immediately) is simply
V 0 (ξ) = max {λ1 ξ, λ2 (1 − ξ)} . (5.9)
The worst-case immediate value, i.e. the minimum, is attained when both terms are
equal. Consequently, setting λ1 ξ = λ2 (1 − ξ) gives ξ = λ2 /(λ1 + λ2 ). Intuitively,
Table 5.1 The local utility function, with λ1 , λ2 < 0

U (θ, d) a1 a2
θ1 0 λ1
θ2 λ2 0
Fig. 5.5 The value of the λ2

optimal continuation V ξL λ 1 +λ 2
ξH
versus stopping V 0
V∗
V 1∗ (ξ)
λ1 λ2
V 0∗ ( ξ) λ 1 +λ 2
ξ
this is the worst-case belief, as the uncertainty it induces leaves us unable to choose
between either hypothesis. Replacing in (5.9) gives a lower bound for the value for
any belief:
λ1 λ2
V 0 (ξ) ≥ .
λ1 + λ2
Let Π denote the set of procedures π which take at least one observation and
define
V (ξ) = sup U (ξ, π).
π∈Π
Then the ξ-expected utility V ∗ (ξ) must satisfy

V ∗ (ξ) = max V 0 (ξ), V (ξ) .
As we showed in Sect. 3.3.1, V is a convex function of ξ. Now let

Ξ0 ξ | V 0 (ξ) ≥ V (ξ)
be the set of priors where it is optimal to terminate sampling. It follows that Ξ \ Ξ0 ,

the set of priors where we must not terminate sampling, is a convex set.
Figure 5.5 illustrates the above arguments, by plotting the immediate value against
the optimal continuation after taking one more sample. For the worst-case belief, we
must always continue sampling. When we are absolutely certain about the model, then
it’s always better to stop immediately. There are two points where the curves intersect.
Together, these define three subsets of beliefs: On the left, if ξ < ξ L , we decide for
one parameter θ1 . On the right, if ξ > ξ H , we decide for the other parameter, θ2 .
Otherwise, we continue sampling. This is the main idea of the sequential probability
ratio test, explained below.
The sequential probability ratio test (SPRT)

Figure 5.5 offers a graphical illustration of when it is better to take one more sample
in this setting. In particular, if ξ ∈ (ξ L , ξT ), then it is optimal to take at least one more
sample. Otherwise, it is optimal to make an immediate decision with value ρ0 (ξ).
This has a nice interpretation as a standard tool in statistics: the sequential prob-
ability ratio test. First note that our posterior at time t can be written as
ξ Pθ1 (x t )
ξt = .
ξ Pθ1 (x t ) + (1 − ξ)Pθ2 (x t )
Then, for any posterior, the optimal procedure is:

• If ξ L < ξt < ξT , take one more sample.
• If ξ L ≥ ξt , stop and choose a2 .
• If ξT ≤ ξt , stop and choose a1 .
We can now restate the optimal procedure in terms of a probability ratio, i.e. we
should always take another observation as long as
ξ(1 − ξT ) Pθ (x t ) ξ(1 − ξ L )
< 2 t < .
(1 − ξ)ξT Pθ1 (x ) (1 − ξ)ξ L
If the first inequality is violated, we choose a1 . If the second inequality is violated,

we choose a2 . So, there is an equivalence between SPRT and optimal sampling
procedures, when the optimal policy is to continue sampling whenever our belief is
within a specific interval.
5.2.5 Wald’s Theorem
An important tool in the analysis of SPRT as well as other procedures that stop at
random times is the following theorem by Wald.
Theorem 5.2.5 (Wald’s theorem) Let z 1 , z 2 , . . . be a sequence of i.i.d. random vari-

ables with measure G, such that E z i = m for all i. Then for any sequential procedure
with E n < ∞:
n
E z i = m E n. (5.10)
i=1
Proof In the following proof, we omit explicit references to the implied sequential
procedure.
5.3 Martingales 103

n ∞
k
E zi = z i dG k (z k )
i=1 k=1 Bk i=1
k
∞

= z i dG k (z k )
k=1 i=1 Bk
∞
∞

= z i dG k (z k )
i=1 k=i Bk
∞

= z i dG i (z i )
i=1 B≥i
∞

= E(z i ) P(n ≥ i) = m E n
i=1
Pθ2 (xi )
We now consider an application of this theorem to the SPRT. Let z i = log Pθ1 (xi )
.
Consider the equivalent formulation of the SPRT which uses

n
a< zi < b
i=1
as the test. Using Wald’s theorem and the previous properties and assuming c ≈ 0,
we obtain the following approximately optimal values for a, b:
I1 λ2 (1 − ξ) 1 I 2 λ1 ξ
a ≈ log c − log , b ≈ log − log ,
ξ c 1−ξ
where I1 = − E(z | θ = θ1 ) and I2 = E(z | θ = θ2 ) is the information, better known

as the KL-divergence. If the cost c is very small, then the information terms vanish
and we can approximate the values by log c and log 1c .
5.3 Martingales
Martingales are a fundamentally important concept in the analysis of stochastic

processes where the expectation at time t + 1 only depends on the state of the process
at time t.
An example of a martingale sequence is when xt is the amount of money you have
at a given time, and where at each time-step t you are making a gamble such that
you lose or gain 1 currency unit with equal probability. Then, at any step t, it holds
that E(xt+1 | xt ) = xt . This concept can be generalized to two random processes xt
and yt , which are dependent.
Definition 5.3.1 Let x n ∈ S n be a sequence of observations with distribution Pn ,
and yn : S n → R be a random variable. Then the sequence {yn } is a martingale with
respect to {xn } if for all n the expectation

E(yn ) = yn (x n ) d Pn (x n )
Sn
exists and
E(yn+1 | x ) = yn
n
holds with probability 1. If {yn } is a martingale with respect to itself, i.e. yi (x) = x,
then we call it simply a martingale.
It is also useful to consider the following generalizations of martingale sequences.
Definition 5.3.2 A sequence {yn } is a super-martingale if E(yn+1 | x n ) ≤ yn and a

sub-martingale if E(yn+1 | x n ) ≥ yn , w.p. 1.
At a first glance, it might appear that martingales are not very frequently encoun-
tered, apart from some niche applications. However, we can always construct a
martingale from any sequence of random variables as follows.
Definition 5.3.3 (Doob martingale) Consider a function f : S m → R and some
x mx1 , . . . xm . Then, for any n ≤ m, assuming the
m
associated random variables
expectation E( f | x ) = S m−n f (x ) d P(xn+1 , . . . , xm | x n ) exists, we can con-
n
struct the random variable

yn (x n ) = E[ f | x n ].
Then E(yn+1 | x n ) = yn , and so yn is a martingale sequence with respect to xn .

Another interesting type of martginale sequence are martingale difference
sequences. They are particularly important as they are related to some useful con-
centration bounds.
Definition 5.3.4 A sequence {yn } is a martingale difference sequence with respect
to {xn } if
E(yn+1 | x ) = 0
n
with probability 1.
For bounded difference sequences, the following well-known concentration bound

holds.
Theorem 5.3.1 Let bk be a random variable depending on x k−1 and {yk } be a
k with respect to the {xk }, such that yk ∈ [bk , bk + ck ]
martingale difference sequence
w.p. 1, Then, defining sk i=1 yi , it holds that

−2t 2
(s
P n ≥ t) ≤ exp n 2
. (5.11)
i=1 ci
5.5 Exercises 105
This allows us to bound the probability that the difference sequence deviates from
zero. Since there are only few problems where the default random variables are
difference sequences, use of this theorem is most common by defining a new random
variable sequence that is a difference sequence.
5.4 Markov Processes
A more general type of sequence of random variables than martingales are Markov
processes. Informally speaking, a Markov process is a sequence of variables {xn }
such that the next value xt+1 only depends on the current value xt .
Definition 5.4.1 (Markov Process) Let (S, B (S)) be a measurable space. If {xn } is
a sequence of random variables xn : S → X such that
P(xt ∈ A | xt−1 , . . . , x1 ) = P(xt ∈ A | xt−1 ), ∀A ∈ B (X ) ,
i.e., xt is independent of xt−2 , . . . given xt−1 , then {xn } is a Markov process,

and xt is called the state of the Markov process at time t. If P(xt ∈ A | xt−1 =
x) = τ (A | x) where τ : B (S) × S → [0, 1] is the transition kernel, then {xn } is a
stationary Markov process. station-
ary
Markov
Note that is the sequence of posterior parameters obtained in Bayesian inference process
is a Markov process.
5.5 Exercises
Exercise 5.1 Consider a stationary Markov process with state space S and whose
transition kernel is a matrix τ . At time t, we are at state xt = s and we can either,
1: Terminate and receive reward b(s), or 2: Pay c(s) and continue to a random state
xt+1 from the distribution τ (z | z).
Assuming b, c > 0 and τ are known, design a backwards induction algorithm
that optimizes the utility function
T −1

U (x1 , . . . , x T ) = b(x T ) − c(xt ).
t=1
Finally, show that the expected utility of the optimal policy starting from any state
must be bounded.
Exercise 5.2 Consider the problem of classification with features x ∈ X and labels
y ∈ Y, where each label costs c > 0. Assume a Bayesian model with some parameter
space Θ on which we have a prior distribution ξ0 . Let ξt be the posterior distribution
after t examples (x1 , y1 ), . . . , (xt , yt ).
Let our expected utility be the expected accuracy (i.e., the marginal probability
of correctly guessing the right label over all possible models) of the Bayes-optimal
classifier π : X → Y, minus the cost paid, i.e.,

Et (U ) max Pθ (π(x) | x) d Pθ (x) dξt (θ) − ct.
π Θ X
Show that the Bayes-optimal classification accuracy (ignoring the label cost)
after t observations can be rewritten as

max Pθ (y | x) dξt (θ | x) d Pt (x),
Θ X y∈Y
where Pt and Et denote marginal distributions under the belief ξt . Write the expres-
sion for the expected gain in accuracy when obtaining one more sample and label.
Implement the above for a model family of your choice. Two simple options are
the following. The first is a finite model family composed of two different classifiers
Pθ (y | x). The second is the family of discrete classifier models with a Dirichlet
product prior, i.e. where X = {1, . . . , n}, and each different x ∈ X corresponds to a
different multinomial distribution over Y. In both cases, you can assume a common
(and known) data distribution P(x), in which case ξt (θ | x) = ξt (θ).
0.7
expected
actual
0.6 predicted
0.5
0.4
0.3
0.2
10 0 10 1 10 2 10 3
Fig. 5.6 Illustrative results for an implementation of Exercise 5.2 on a discrete classifier model
5.5 Exercises 107
Figure 5.6 shows the performance for a family of discrete classifier models
with |X | = 4. It shows the expected classifier performance (based on the poste-
rior marginal), the actual performance on a small test set, as well as the cumulative
predicted performance gain. As you can see, even though the expected performance
gain is zero in some cases, cumulatively it reaches the actual performance of the
classifier. You should be able to produce a similar figure for your own setup.
Chapter 6
Experiment Design and Markov Decision
Processes
6.1 Introduction
This chapter introduces the very general formalism of Markov decision processes
(MDPs) that allows representation of various sequential decision making problems.
Thus a Markov decision process can be used to model stochastic path problems,
stopping problems as well as problems in reinforcement learning, experiment design,
and control.
We begin by taking a look at the problem of experimental design. A typical experi-
mental
question is to how to best allocate treatments with unknown efficacy to patients in design
an adaptive manner, so that the best treatment is found, or so as to maximize the
number of patients that are treated successfully. The problem, originally considered
by Chernoff [1, 2], informally can be stated as follows.
We have a number of treatments of unknown efficacy, i.e., some of them work
better than the others. We observe patients one at a time. When a new patient arrives,
we must choose which treatment to administer. Afterwards, we observe whether the
patient improves or not. Given that the treatment effects are initially unknown, how
can we maximize the number of cured patients? Alternatively, how can we discover
the best treatment? The two different problems are formalized below.
Example 6.1 [Adaptive treatment allocation] Consider K treatments to be admin- adaptive

treat-
istered to T volunteers. To each volunteer only a single treatment can be assigned. ment
At the tth trial, we treat one volunteer with some treatment at ∈ {1, . . . , K }. We then alloca-
tion
obtain a reward rt = 1 if the patient is healed
and 0 otherwise. We wish to choose
treatments maximizing the utility U = t rt . This corresponds to maximizing the
number of patients that get healed over time.
Example 6.2 [Adaptive hypothesis testing] An alternative goal would be to do a adaptive

hypothe-
clinical trial, in order to find the best possible treatment. For simplicity, consider the sis
problem of trying to find out whether a particular treatment is better or not than a testing
placebo. We are given a hypothesis set Ω, with each ω ∈ Ω corresponding to different clinical
trial
110 6 Experiment Design and Markov Decision Processes
models for the effect of the treatment and the placebo. Since we don’t know what
is the right model, we place a prior ξ0 on Ω. We can perform T experiments, after
which we must decide whether or not the treatment is significantly better than the
placebo. To model this, we define a decision set D = {d0 , d1 } and a utility function
U : D × Ω → R which models the effect of each decision d given different versions
of reality ω. One hypothesis ω ∈ Ω is true. To identify the correct hypothesis, we
can choose from a set of K possible experiments to be performed over T trials. At
the t-th trial, we choose experiment at ∈ {1, . . . , K } and observe outcome xt ∈ X ,
with xt ∼ Pω drawn from the true hypothesis. Our posterior is
ξt (ω) ξ0 (ω | a1 , . . . , at , x1 , . . . , xt ).
The reward is rt = 0 for t < T and
r T = max EξT (U | d).

d∈D
T
Our utility can again be expressed as a sum over individual rewards, U = t=1 r t .
Both formalizations correspond to so-called bandit problems which we take a

closer look at in the following section.
6.2 Bandit Problems
The simplest bandit problem is the stochastic n-armed bandit. We are faced with K
different one-armed bandit machines, such as those found in casinos. At time t, we
have to choose one action (i.e., a machine, in the bandit context usually termed arm)
at ∈ A = {1, . . . , K }. Each time t we play a machine, we receive a reward rt with
fixed expected value ωi = E(rt | at = i). Unfortunately, we do not know the ωi , and
consequently the best arm is also unknown. The question is how to choose arms in
order to maximize the total expected reward.
Definition 6.2.1 (The stochastic n-armed bandit problem) Given a set of arms A =
{1, . . . , K } each giving reward according to a fixed reward distribution with unknown
mean, select a sequence of actions at ∈ A, so as to maximize expected utility, where
the utility is
T −1

U= γ t rt ,
t=0
discount T ∈ (0, ∞] is the horizon, and γ ∈ (0, 1] a discount factor. The reward rt is stochastic
factor
and only depends on the action chosen at step t with expectation E(rt | at = i) = ωi .
policy For selecting actions, we want to specify some policy or decision rule. Such a rule
shall only depend on the sequence of previously taken actions and observed rewards.
6.2 Bandit Problems 111
Usually, the policy π : A∗ × R∗ → A is a deterministic mapping from the space of

all sequences of actions and rewards to actions. That is, for every observation and
action history a1 , . . . , at−1 , r1 , . . . , rt−1 it suggests a single action at . More generally,
it could also be a stochastic policy, that specifies a mapping to action distributions.
We use the notation
π(at | a t−1 , r t−1 )
for stochastic history-dependent bandit policies, i.e., the probability of choosing

action at given the history until time t.
One idea to approach the bandit problem is to apply the Bayesian decision-
theoretic framework we have developed earlier to maximize utility in expectation.
More specifically, given the horizon T ∈ (0, ∞] and the discount factor γ ∈ (0, 1],
we define our utility from time t to be
T −t

Ut γ k rt+k .
k=1
As a first step we need to define a suitable family of probability measures P, indexed

by parameter ω ∈ Ω describing the reward distribution of each bandit, together with
a prior distribution ξ on Ω. Since ω is unknown, we cannot maximize the expected
utility with respect to it. However, we can always maximize expected utility with
respect to our belief ξ. That is, we replace the ill-defined problem of maximizing util-
ity in an unknown model with that of maximizing expected utility given a distribution
over possible models. The problem can be written in a simple form as

max Eπξ Ut = max Eπω Ut dξ(ω). (6.2.1)
π π
Ω
The following figure summarizes the bandit problem formulated in the Bayesian
setting.
Decision-theoretic statement of the bandit problem
• Let A be the set of arms.

• Define a family of distributions P = Pω,i ω ∈ Ω, i ∈ A on R.
• Assume the i.i.d. model rt | ω, at = i ∼ Pω,i .
• Define prior ξ on Ω.
• Select a policy π : A∗ × R∗ → A maximizing
T −1

π π
Eξ U = Eξ γ t rt .
t=0
There are two main difficulties with this approach. The first one is that the choice of
the family and the prior distribution is effectively part of the problem formulation and
can severely influence the solution. This issue can be resolved by either specifying a
subjective prior distribution, or by selecting a prior distribution that has good worst-
case guarantees. The second difficulty concerns the computation of the policy that
maximizes expected utility given a prior and a family. This is a hard problem, because
in general such policies are history dependent and the set of all possible histories is
exponential in the horizon T .
6.2.1 An Example: Bernoulli Bandits
As a simple illustration, consider the case when the reward for choosing one of the
n actions is either 0 or 1 with some fixed yet unknown probability depending on the
chosen action. This can be modelled in the standard Bayesian framework using the
Beta-Bernoulli conjugate prior. More specifically, we can formalize the problem as
follows.
Consider n Bernoulli distributions with unknown parameters ωi (i = 1, . . . , n)
such that
rt | at = i ∼ Bern(ωi ), E(rt | at = i) = ωi . (6.2.2)
Each Bernoulli distribution thus corresponds to the distribution of rewards obtained

from each bandit that we can play. In order to apply the statistical decision theoretic
framework, we have to quantify our uncertainty about the parameters ω in terms of
a probability distribution.
We model our belief for each bandit’s parameter ωi as a Beta distribution
Beta(αi , βi ), with density f (ω | αi , βi ) so that

n
ξ(ω1 , . . . , ωn ) = f (ωi | αi , βi ).
i=1
Recall that the posterior of a Beta prior is also a Beta. Let

t
Nt,i I{ak = i}
k=1
be the number of times we played arm i and
1
t
r̂t,i rt · I{ak = i}
Nt,i k=1
6.2 Bandit Problems 113
be the empirical reward of arm i at time t. We can set r̂t,i = 0 when Nt,i = 0. Then,
the posterior distribution for the parameter of arm i is
ξt = Beta(αi + Nt,i r̂t,i , βi + Nt,i (1 − r̂t,i )).
In order to evaluate a policy we need to be able to predict the expected utility we

obtain. The latter only depends on our current belief, and the state of our belief
corresponds to the state of the bandit problem. This means that everything we know
about the problem at time t can be summarized by ξt . For Bernoulli bandits, a
sufficient statistic for our belief is the number of times we played each bandit and
the total reward from each bandit. Thus, our state at time t is entirely described by
our priors α, β (the initial state) and the vectors
Nt = (Nt,1 , . . . , Nt,n )
r̂t = (r̂t,1 , . . . , r̂t,n ).
Accordingly, as rt ∈ {0, 1}, the possible states of our belief given some prior are in
N2n . At any time t, we can calculate the probability of observing rt = 1 if we pull
arm i as
αi + Nt,i r̂t,i
ξt (rt = 1 | at = i) = .
αi + βi + Nt,i
So, not only we can predict the immediate reward based on our current belief, but we
can also predict all possible next beliefs: the next state is well-defined and depends
only on the current state and observation. As we shall see later, this type of decision
problem can be modelled as a Markov decision process (Definition 6.3.1). For now,
we shall take a closer look at the general bandit process itself.
6.2.2 Decision-Theoretic Bandit Process
The basic view of the bandit process is to consider only the decision maker’s
actions at , obtained rewards rt and the latent parameter ω, as shown in Fig. 6.2a.
With this basic framework, we can now define the general decision-theoretic bandit
process, which also includes the states of belief ξt of the decision maker.
Definition 6.2.2 Let A be a set of arms (now not necessarily finite). Let Ω be a
values, indexing a family of probability measures P =
set of possible parameter
Pω,a ω ∈ Ω, a ∈ A . There is some ω ∈ Ω such that, whenever we take action
at = i, we observe reward rt ∈ R ⊂ R with probability measure
Pω,i (R) Pω (rt ∈ R | at = i), R ⊆ R.

Fig. 6.1 A partial view of

(3)
the multi-stage process. r =1 ξ t +1
Here, the probability that we (2)
at
obtain r = 1 if we take
r =0 (2)
ξ t +1
action at = i is simply
Pξt ,i ({1}) ξt
(1)
r =1 ξ t +1
(1)
at
r =0 (0)
ξ t +1
Let ξ0 be a prior distribution on Ω and let the posterior distributions for B ⊆ Ω be

defined as
ξt+1 (B) ∝ Pω,at (rt ) dξt (ω).
B
The next belief ξt+1 is random, since it depends on the random quantity rt . In fact, the
probability of the next reward lying in some set R for at = i is given by the marginal
distribution
Pξt ,i (R) Pω,i (R) dξt (ω).
Ω
Finally, as ξt+1 deterministically depends on ξt , at , rt , the probability of obtaining

a particular next belief is the same as the probability of obtaining the corresponding
rewards leading to the next belief. In more detail, we can write

P(ξt+1 = ξ | ξt , at ) = I ξt (· | at , rt = r ) = ξ d Pξt ,a (r ).
R
In practice, although multiple reward sequences may lead to the same beliefs, we
frequently ignore that possibility for simplicity. Then the process obtains a tree-like
structure. A solution to the problem of which action to select is given by the follow-
back- ing backwards induction algorithm for bandits, similar to the backwards induction
wards
induc- algorithm given in Sect. 5.2.2, i.e.,
tion

U ∗ (ξt ) = max E(rt | ξt , at ) + P(ξt+1 | ξt , at ) U ∗ (ξt+1 ).
at
ξt+1
If you look at this structure, you can see that the next belief only depends on the
current belief, action, and reward, i.e., it satisfies the Markov property, as seen in
Fig. 6.1.
In reality, the reward depends only on the action and the unknown ω, as can be
seen in Fig. 6.2a. This is the point of view of an external observer. If we want to
6.3 Markov Decision Processes and Reinforcement Learning 115
at a t +1 at
at a t +1
ξt
rt r t +1 rt ξt ξ t +1
rt r t +1
ω ω
(a) The basic process (b) The Bayesian (c) The lifted process
model
Fig. 6.2 Three views of the bandit process. The figure (a) shows the basic bandit process from the
view of an external observer. The decision maker selects at and then obtains reward rt , while the
parameter ω is hidden. The process is repeated for t = 1, . . . , T . The Bayesian model is shown in
(b) and the resulting process in (c). While ω is not known, at each time step t we maintain a belief
ξt on Ω. The reward distribution is then defined through our belief. In (b), we can see the complete
process, where the dependency on ω is clear. In (c), we marginalize out ω and obtain a model where
the transitions only depend on the current belief and action
add the decision maker’s internal belief to the graph, we obtain Fig. 6.2b. From the
point of view of the decision maker, the distribution of ω only depends on his current
belief. Consequently, the distribution of rewards also only depends on the current
belief, as we can marginalize over ω. This gives rise to the decision-theoretic bandit
process shown in Fig. 6.2c.
A decision-theoretic bandit process can be modelled more generally as a Markov
decision process, a setting we shall consider more generally in the following section.
It turns out that backwards induction, as well as other efficient algorithms, can provide
optimal solutions for Markov decision processes, too.
6.3 Markov Decision Processes and Reinforcement

Learning
The bandit setting is one of the simplest instances of so-called reinforcement learning
problems. Informally, speaking, these are problems of learning how to act in an
unknown environment, only through interaction with the environment and limited
reinforcement signals. The learning agent interacts with the environment by choosing
actions that give certain observations and rewards. The goal is usually to maximize
some measure of the accumulated reward. For example, we can consider a mouse
running through a maze, where the reward is finding one or all pieces of cheese
hidden in the maze.
The reinforcement learning problem

Learn how to act in an unknown environment, only by interaction and
reinforcement.
Generally, we assume that the environment μ that we are acting in has an under-
lying state st ∈ S, which changes in discrete time steps t. At each step, the agent
obtains an observation xt ∈ X and chooses an action at ∈ A. We usually assume
that the environment is such that its next state st+1 only depends on its current
state st and the last action at taken by the agent. In addition, the agent observes a
reward signal rt , and its goal is to maximize the total reward during its lifetime.
When the environment μ is unknown, this problem is hard even in seemingly
simple settings like the multi-armed bandit, where the underlying state never changes.
In many real-world applications, the problem is even harder, as the state often is not
directly observed. Instead, we may have to rely on the observables xt , which give
only partial information about the true underlying state st .
Reinforcement learning problems typically fall into one of the following three
categories: (1) Markov decision processes (MDPs), where the state st is observed
directly, i.e., xt = st ; (2) partially observable MDPs (POMDPs), where the state is
hidden, i.e., xt is only probabilistically dependent on the state st ; and (3) stochastic
Markov games, where the next state also depends on the move of other agents. While
all of these problem descriptions are different, in the Bayesian setting they all can be
reformulated as MDPs by constructing an appropriate belief state, similarly to the
decision theoretic formulation of the bandit problem.
In this chapter, we shall confine our attention to Markov decision processes.
Hence, we shall not discuss the case where we cannot observe the state directly, or
consider the existence of other agents.
Definition 6.3.1 (Markov decision process) A Markov decision process (MDP)
μ is a tuple μ =
S, A, P, R, where S is the state space and A is the action
transi- space. The transition distributionP = { p(· | s, a) | s ∈ S, a ∈ A} is a collection
tion
distribu- of probability measures on S, indexed in S × A, and the reward distribution
tion R = {ρ(· | s, a) | s ∈ S, a ∈ A} is a collection of probability measures on R, such
reward
distribu- that
tion
p(S | s, a) = Pμ (st+1 ∈ S | st = s, at = a),

ρ(R | s, a) = Pμ (rt ∈ R | st = s, at = a).
Usually, also an initial state s0 (or more generally, an initial distribution from which
s0 is sampled) is specified.
In the following, we usually assume only MDPs with finite state and action spaces
S, A. By definition, rewards and transition probabilities of MDP are time-invariant,
although one could assume more general models. In any case however, rewards and
transitions in an MDP μ shall satisfy the following Markov property (Fig. 6.3).
Fig. 6.3 The structure of a μ

Markov decision process
rt
st s t +1
at
Markov property of the reward and state distribution
Pμ (st+1 ∈ S | s1 , a1 , . . . , st , at ) =Pμ (st+1 ∈ S | st , at ),

(Transition distribution)
Pμ (rt ∈ R | s1 , a1 , . . . , st , at ) =Pμ (rt ∈ R | st , at ),
(Reward distribution)
where S ⊂ S and R ⊂ R are reward and state subsets, respectively.
Dependencies of rewards. In the following, for a given MDP μ we use
rμ (s, a) = Eμ (rt+1 | st = s, at = a)
to denote the expected reward for a state-action-pair s, a. Similarly, we write

pμ (· | s, a) for the transition probability distribution under s, a.
Sometimes however it is more convenient to have rewards that depend on the next
state as well, i.e.,
rμ (s, a, s ) = Eμ (rt+1 | st = s, at = a, st+1 = s ),
though this is complicates the notation considerably since now the reward is obtained
on the next time step. However, we can always replace this with the expected reward
for a given state-action pair, i.e.,

rμ (s, a) = Eμ (rt+1 | st = s, at = s) = pμ (s | s, a) rμ (s, a, s ).
s ∈S
In a simpler setting, rewards only depend on the current state, so that we can write
rμ (s) = Eμ (rt | st = s).

policy Policies. A policy π (sometimes also called decision function) specifies which action
to take. One can think of a policy as implemented through an algorithm or an embod-
ied agent.
Policies
A policy π defines a conditional distribution on actions given the history:
Pπ (at | st , . . . , s1 , at−1 , . . . , a1 ) (history-dependent policy)

π
P (at | st ) (Markov policy)
In general, policies map histories to actions. In certain cases, however, there are
optimal policies that are Markov. This is for example the case with additive utility
functions U : R∗ → R, which map the sequence of all possible rewards to a real
number as specified in the following definition.
Definition 6.3.2 (Additive utility) The utility function U : R∗ → R is defined as

T
U (r0 , r1 , . . . , r T ) = γ k rk ,
k=0
horizon where T is the horizon, after which the agent is no longer interested in rewards, and
discount γ ∈ (0, 1] is the discount factor, which discounts future rewards. It is convenient
factor
to introduce a special notation for the utility starting from time t, i.e., the sum of
rewards from that time on:
T −t

Ut γ k rt+k .
k=0
At any time t, the agent wants to to find a policy π maximizing the expected total
future reward (expected utility)
T −t

π π
Eμ U t = Eμ γ k rt+k .
k=0
This is so far identical to the expected utility framework we have seen so far, with
the only difference that now the reward space is a sequence of numerical rewards
and that we are acting within a dynamical system with state space S. In fact, it is
a good idea to think about the value of different states of the system under certain
policies in the same way that one thinks about how good different positions are in a
board game like chess.
6.3.1 Value Functions
Value functions represent the expected utility of a given state or state-action pair for
a specific policy. They are useful as shorthand notation and as the basis for algorithm
development. The most basic one is the state value function.
State value function

π
Vμ,t (s) Eπμ (Ut | st = s)
The state value function for a particular policy π in an MDP μ can be interpreted
as how much utility you should expect if you follow the policy starting from state s
at time t.
State-action value function
Q πμ,t (s, a) Eπμ (Ut | st = s, at = a)
The state-action value function for a particular policy π in an MDP μ can be

interpreted as how much utility you should expect if you play action a at state s at
time t, and then follow the policy π. It is also useful to define the optimal policy
and optimal value functions for a given MDP μ. Using a star to indicate optimal
quantities, an optimal policy π ∗ dominates all other policies π everywhere in S, that
is,
π∗ π
Vμ,t (s) ≥ Vμ,t (s) ∀π, t, s.
The optimal value function V ∗ is the value function of an optimal policy π ∗ , i.e.,
∗ ∗
∗ π
Vμ,t (s) Vμ,t (s), Q ∗μ,t (s) Q πμ,t (s, a).
Finding the optimal policy when μ is known

When the MDP μ is known, the expected utility of any policy can be calculated.
As the number of policies is exponential in the number of states, it is however not
advisable to determine the optimal policy by calculating the utility of every possible
policy. More suitable approaches include iterative (offline) methods. These either
try to estimate the optimal value function directly, or iteratively improve a policy
until it is optimal. Another type of method tries to find an optimal policy in an online
fashion. That is, the optimal actions are estimated only for states which can be visited
from the current state. However, all these algorithms share the same main ideas.
6.4 Finite Horizon, Undiscounted Problems
The conceptually simplest type of problems are finite horizon problems where T <
∞ and γ = 1. The first thing we shall try is to evaluate a given policy π for a given
π
MDP, that is, compute Vμ,t (s) for all states s and t = 0, 1, . . . T . There are a number
of algorithms that can be used for that purpose.
6.4.1 Direct Policy Evaluation
By definition,
T −t

π
Vμ,t (s) Eπμ (Ut | st = s) = π
Eμ (rt+k | st = s)
k=0

T
π μ
= Eμ (rk | sk = s ) Pπ (sk = s | st = s), (6.4.1)
k=t s ∈S
and one can try to compute the value function by (6.4.1), using that

Pμπ (sk = s | st = s) = P(sk = s | sk−1 = s , st = s)P(sk−1 = s | st = s).
s ∈S
However, the computational cost of this direct policy evaluation is quite high, as it
results in a total of |S|3 T operations if the value function is to be computed for all
time steps up to T .
6.4.2 Backwards Induction Policy Evaluation
Noting that
π
Vμ,t (s) Eπμ (Ut | st = s)
= Eπμ (rt | st = s) + Eπμ (Ut+1 | st = s)

= Eπμ (rt | st = s) + pμ (s | s, π(s)) Vμ,t+1
π
(s )
s ∈S
provides a recursion that can be used for the backwards induction algorithm shown as
Algorithm 6.1, that is similar to the backwards induction algorithm we have already
seen for sequential sampling and bandit problems. However, here in a first step we
are only evaluating a given policy π rather than finding the optimal one.
6.4 Finite Horizon, Undiscounted Problems 121
Algorithm 6.1 Backwards induction policy evaluation

input μ, policy π, horizon T
for s ∈ S do
Initialize V̂T (s) = Eπμ (r T | sT = s).
for t = T − 1, . . . , 0 do

V̂t (s) = Eπμ (rt | st = s) + pμ (s | s, π(s)) V̂t+1 (s )
s ∈S
end for
end for
Remark 6.4.1 The backwards induction algorithm gives estimates V̂t (s) satisfying
π
V̂t (s) = Vμ,t (s).
6.4.3 Backwards Induction Policy Optimization
Backwards induction policy optimization as given as Algorithm 6.2 is the first

non-naive algorithm for finding an optimal policy for the sequential problem with
horizon T . It is basically identical to the backwards induction algorithms we have
seen in Chap. 5 for sequential sampling and the bandit problem.
Algorithm 6.2 Finite-horizon backwards induction

input μ, horizon T
Initialize V̂T (s) := maxa rμ (s, a) for all s ∈ S .
for t = T − 1, . . . , 1 do
for s ∈ S do
πt (s) := arg maxa rμ (s, a) + s ∈S pμ (s | s, a) V̂t+1 (s )

V̂t (s) := rμ (s, πt (s)) + s ∈S pμ (s | s, πt (s)) V̂t+1 (s )
end for
end for
return π = (πt )t=1T
Theorem 6.4.1 For T -horizon problems, backwards induction is optimal, i.e.,

∗
V̂t (s) = Vμ,t (s).
∗
Proof First we show that V̂t (s) ≥ Vμ,t (s). For t = T we evidently have V̂T (s) =
∗ ∗
maxa r (s, a) = Vμ,T (s). We proceed by induction assuming that V̂t+1 (s) ≥ Vμ,t+1 (s)

holds. Then for any policy π

V̂t (s) = max r (s, a) + pμ (s | s, a) V̂t+1 (s )
a
s ∈S

≥ max r (s, a) + pμ (s | s, a) Vμ,t+1
∗
(s ) (by induction assumption)
a
s ∈S

π
≥ max r (s, a) + pμ (s | s, a) Vμ,t+1 (s ) (by optimality)
a
s ∈S

π
≥ Vμ,t (s).
Choosing π = π ∗ to be the optimal policy this completes the induction proof. Finally,
we note that for the policy π returned by backwards induction we have
∗ π ∗
Vμ,t (s) ≥ Vμ,t (s) = V̂t (s) ≥ Vμ,t (s).
6.5 Infinite-Horizon
When problems have no fixed horizon, they usually can be modelled as infinite
horizon problems, sometimes with help of a terminal state, whose visit terminates
the problem, or discounted rewards, which indicate that we care less about rewards
further in the future. When reward discounting is exponential, these problems can be
seen as undiscounted problems with random and geometrically distributed horizon.
For problems with no discounting and no terminal states there are some complications
in the definition of optimal policy. However, we defer discussion of such problems
to Chap. 10.
6.5.1 Examples
We begin with some examples, which will help elucidate the concept of terminal
states and infinite horizon.
6.5.1.1 Shortest-Path Problems
First we consider shortest path problems, where the aim is to find the shortest path to a
particular goal. Although the process terminates when the goal is reached, not all poli-
cies may be able to reach the goal, and so the process may never terminate. We shall
consider two types of shortest path problems, deterministic and stochastic. Although
conceptually different, both problems have essentially the same complexity.
6.5 Infinite-Horizon 123
Properties
14 13 12 11 10 9 8 7
15 13 6 γ = 1, T → ∞.
16 15 14 4 3 4 5 rt = −1 unless st = X, in which
17 2 case rt = 0.
18 19 20 2 1 2 Pµ (st+1 = X|st = X) = 1.
19 21 1 X 1 A = {North, South, East, West}
20 22
Transitions are deterministic and
21 23 24 25 26 27 28 walls block.
Fig. 6.4 Deterministic shortest path example
Consider an agent moving in a maze, aiming to get to some terminal goal

state X , as for example seen in Fig. 6.4. That is, when reaching this state, the agent
stays in X for all further time steps and receives a reward of 0. In general, the agent
can move deterministically in the four cardinal directions, and receives a negative
reward at each time step. Consequently, the optimal policy is to move to X as quickly
as possible.
Solving the shortest path problem with deterministic transitions can be done sim-
ply by recursively defining the distance of states to X . Thus, first the distance of X to
X is set to 0. Then for states s with distance d to X and with a neighbor state s with
no assigned distance yet, one assigns s the distance d + 1 to X . This is illustrated
in Fig. 6.4, where for all reachable states the distance to X is indicated. The respec-
tive optimal policy at each step simply moves to a neighbor state with the smallest
distance to X . Its reward starting in any state s is simply the negative distance from
s to X .
Stochastic shortest path problem with a pit. Now let us assume the shortest path
problem with stochastic dynamics. That is, at each time-step there is a small proba-
bility ω that we move to a random direction. To make this more interesting, we can
add a pit O, that is a terminal state giving a one-time negative reward of −100 (and
0 reward for all further steps) as seen in Fig. 6.5.
Figure 6.6 show that randomness changes the solution significantly in this envi-
ronment. When ω is relatively small, it is worthwhile (in expectation) for the agent
to pass past the pit, even though there is a risk of falling in and getting a reward of
−100. In the example given, even starting from the third row in the first column, the
agent prefers taking the short-cut. If ω is sufficiently high, the optimal policy avoids
however approaching the pit. Still, the agent prefers jumping in the pit (getting a
large one-time negative reward) to staying at the save left bottom of the maze forever
(with a reward of −1 at each step).
Properties
γ = 1, T → ∞.
rt = −1, but rt = 0 at X and −100

(once, reward 0 afterwards) at O.
Pµ (st+1 = X|st = X) = 1.
O X Pµ (st+1 = O|st = O) = 1.
A = {North, South, East, West}
Moves to a random direction with

probability ω. Walls block.
Fig. 6.5 Stochastic shortest path example
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
(a) ω = 0.1 (b) ω = 0.5
0.5
1
1.5
2
2.5
-120 -100 -80 -60 -40 -20 0
(c) value
Fig. 6.6 Pit maze solutions for two values of ω
6.5.2 Markov Chain Theory for Discounted Problems
Many problems have no natural terminal state, but continue ad infinitum. A popular
model to guarantee that the utility is still bounded is to exponentially discount future
rewards. This also has some economical interpretation. On the one hand discounting
takes into account the effects of inflation, on the other hand money available now
may be more useful than money one obtains in the future. Both these effects diminish
the value of money over time. In this section we consider some basics of MDPs with
infinite horizon and discounted rewards, when the utility is given by

T
Ut = lim γ k rk , γ ∈ (0, 1).
T →∞
k=t
For simplicity, in the following we assume that rewards only depend on the current
state instead of both state and action. It can easily be verified that the results presented
below can be adapted to the latter case. Henceforth, we will also often drop the
dependence on μ in the notation, if the considered MDP μ is clear from the context.
As we assume finite state and action spaces S, A as well as a time-invariant transition
kernel we may use the following simplified vector notation:
• r = (r (s))s∈S is the reward vector in R|S| .
• We will use P π to denote the transition matrix in R|S|×|S| for policy π, i.e.,

P π (s, s ) = p(s | s, a)Pπ (a | s).
a
• v π = (Eπ (Ut | st = s))s∈S is a vector in R|S| representing the value of policy π.
Definition 6.5.1 A policy π is stationary if π(at = a | st = s) = π(at = a | st =

s) for all time steps t, t , states s and actions a.
For infinite-horizon discounted MDPs, considering stationary policies are suffi-

cient in the sense that there is always a stationary policy maximizing utility. This
can be proven by induction, using arguments similar to other proofs given here. For
a detailed proof see Sect. 5.5 of [3]. The remainder of this chapter collects material
from [3]. We now present a set of important results that show how to express MDP
quantities like value vectors using linear algebra.
Remark 6.5.1 It holds that
∞

vπ = γ t P tπ r.
t=0
Proof Let rt be the random reward at step t when starting in state s and following
policy π. Then
∞

π π
v (s) = E γ r t s0 = s
t

t=0
∞

= γ t Eπ (rt | s0 = s)
t=0
∞

= γt Pπ (st = s | s0 = s) E(rt | st = s )
t=0 s ∈S
∞

= γ t P tπ r,
t=0
as the entries of P tπ are precisely the t-step transition probabilities, and for any
distribution vector p over S, we have E p rt = p r t .
One can show that the expected discounted total reward of a policy is equal to
the expected undiscounted total reward with a geometrically distributed horizon (see
Exercise 6.1). Accordingly, an MDP with discounted rewards is equivalent to one
where there is no discounting but a stopping probability (1 − γ) at every step.
The value of a particular policy can be expressed as a linear equation. This is an
important result, that has led to a number of successful algorithms that employ linear
algebra.
Theorem 6.5.1 For any stationary policy π, v π is the unique solution of
v = r + γ P π v. (6.5.1)
In addition, the solution is

v π = (I − γ P π )−1 r,
where I is the identity matrix.

For the proof of Theorem 6.5.1 we make a few preparations first. For a matrix
A ∈ Rn×n with entries ai j we define the norm

A max |ai, j |
i
j
spectral and the spectral radius

radius
σ( A) lim At 1/t .
t→∞
It holds that A ≥ σ( A), and if A is a probability matrix then A = σ( A) = 1.

Further, the following theorem holds.
Theorem 6.5.2 If σ( A) < 1, then (I − A)−1 exists and is given by

T
−1
(I − A) = lim At . (6.5.2)
T →∞
t=0
Proof of Theorem 6.5.1 First note that from (6.5.1) one obtains r = (I − γ P π )v.
Since γ P π = γ · P π = γ < 1, the inverse

T
(I − γ P π )−1 = lim (γ P π )t
T →∞
t=0
exists by Theorem 6.5.2. It follows that

∞

v = (I − γ P π )−1 r = γ t P tπ r = v π ,
t=0
where the last step is by Remark 6.5.1.

It is important to note that the entries of the matrix X = (I − γ P π )−1 are
the expected number of discounted cumulative visits to each state s, starting from
state s and following policy π. That is,
∞

x(s, s ) = Eπμ γ t Pπμ (st = s | st = s) .
t=0
This interpretation is quite useful, as many algorithms rely on an estimation of X for

approximating value functions.
6.5.3 Optimality Equations
Let us now look at the backwards induction algorithms in terms of operators. We

first introduce the one-step backwards induction operator for a fixed policy and the
Bellman operator, which for the optimal policy coincides with the former operator.
In the following, we denote by V the space of value functions, which is a Banach
space (i.e., a complete, normed vector space) equipped with the norm

v = max |v(s)|s ∈ S .
Definition 6.5.2 (Policy and Bellman operator) Let v ∈ V. Then the linear operator
of a policy π is given by
Lπ v r + γ P π v.
Further, the (non-linear) Bellman operator is defined as
L v max {r + γ P π v} .
π
We now show that the Bellman operator satisfies the following monotonicity
properties with respect to an arbitrary value function v. Further, if a value function
Bellman v is optimal, then it satisfies the Bellman optimality equation
optimal-
ity
equation v = L v.
In the following, we use the notation v ≥ v in short for v(s) ≥ v (s) for all s.
Theorem 6.5.3 Let v ∗ maxπ v π . Then for any v ∈ V:

(1) If v ≥ L v, then v ≥ v ∗ .
(2) If v ≤ L v, then v ≤ v ∗ .
(3) If v = L v, then v = maxπ v π .
Proof We first prove (1). A simple proof by induction over n shows that for any π
and any n

n−1
v ≥ r + γ Pπv ≥ γ t P tπ r + γ n P nπ v.
t=0
∞
Since v π = t=0 γ t P tπ r, it follows that
∞

v − v π ≥ γ n P nπ v − γ t P tπ r.
t=n
Let e be the vector with e(s) = 1 for all s. Then, as rewards are assumed to be in
[0, 1],
∞
γn e
γ k P kπ r ≤ ,
k=n
1−γ
which for n → ∞ approaches 0. It follows that for any π
v ≥ vπ ,
and hence also v ≥ v ∗ , which completes the proof of (1). A similar argument shows
(2), which together with (1) then implies (3).
A similar theorem can also be proven for the repeated application of the Bellman
operator.
Theorem 6.5.4 1. Let v, v ∈ V with v ≥ v. Then L v ≥ L v.

2. Consider a sequence (v n )n≥0 of values functions with arbitrary v 0 ∈ V and
v n+1 = L v n for n ≥ 0. If there is an N with L v N ≤ v N , then L v N +k ≤ v N +k
for all k ≥ 0 and similarly for ≥.
Proof Let π ∈ arg maxπ r + γ P π v. Then
L v = r + γ P π v ≤ r + γ P π v ≤ max

r + γ P π v = L v ,
π
where the first inequality is due to the fact that Pv ≥ Pv for any transition
matrix P. For the second part, note that we have L v N ≤ v N by assumption and
hence L k L v N ≤ L k v N by the first part of the theorem. It follows that
L v N +k = v N +k+1 = L k L v N ≤ L k v N = v N +k .
Thus, if one starts with an initial value v 0 ≤ v for all v ∈ V, then repeated
application of the Bellman operator (known as value iteration as introduced in the
following section) converges monotonically to v ∗ . For example, if rewards are ≥ 0,
one may set v 0 = 0 and v n = L n v 0 is always a lower bound on the optimal value
function.
We eventually want to show that repeated application of the Bellman operator
always converges to the optimal value, independent of the initial value v 0 . As a
preparation, we need the following definition and the subsequent theorem.
Definition 6.5.3 For a Banach space X (i.e., a complete, normed linear space) we
say that T : X → X is a contraction mapping, if there is γ ∈ [0, 1) such that T u −
T v ≤ γu − v for all u, v ∈ X .
Theorem 6.5.5 (Banach fixed-point theorem) Given a Banach space X and a con-
traction mapping T : X → X , it holds that:
1. There is a unique u∗ ∈ X with T u∗ = u∗ .
2. For any u0 ∈ X the sequence (un )n≥0 defined by un+1 = T un = T n+1 u0 con-
verges to u∗ .
Proof First note that for any m ≥ 1

m−1
m−1
un+m − un ≤ un+k+1 − un+k = T n+k u1 − T n+k u0
k=0 k=0

m−1
γ n (1 − γ m )
≤ γ n+k u1 − u0 = u1 − u0 .
k=0
1−γ
Therefore, for each

> 0, there is n such that un+m − un ≤
. Since X is a Banach
space, the sequence has a limit u∗ ∈ X .
Next, we show that this u∗ is indeed a fixed point of T . Indeed, we have
T u∗ − u∗ ≤ T u∗ − un + un − u∗ . (6.5.3)
As T u∗ − un = T u∗ − T un−1 ≤ γu∗ − un−1 , both terms on the right hand

side of (6.5.3) approach 0 for n → ∞, whence we have T u∗ − u∗ = 0 and T u∗ =
u∗ .
We conclude by showing uniqueness. If u , u are both fixed points then by the
contraction property
u − u = T u − T u ≤ γu − u ,
whence it follows that u − u = 0 and u = u .
Theorem 6.5.6 For γ ∈ [0, 1) the Bellman operator L is a contraction mapping

in V.
Proof Let v, v ∈ V. Consider s ∈ S such that L v(s) ≥ L v (s), and let

as∗ ∈ arg max r (s) + γ p(s | s, a) v(s ) .
a∈A s ∈S
Then, as as∗ is optimal for v(s), but not necessarily for v (s), we have

0 ≤ L v(s) − L v (s) ≤ γ p(s | s, as∗ ) v(s ) − γ p(s | s, as∗ ) v (s )
s ∈S s ∈S

=γ p(s | s, as∗ ) [v(s ) − v (s )]
s ∈S

≤γ p(s | s, as∗ ) v − v = γv − v .
s ∈S
Repeating the argument for s such that L v(s) ≤ L v (s), we obtain
|L v(s) − L v (s)| ≤ γv − v .
It follows that L v − L v = maxs |L v(s) − L v (s)| ≤ γv − v .
We note that is easy to adapt this proof to show that Lπ is a contraction, too.
Theorem 6.5.7 1. There is a unique v ∗ ∈ V such that L v ∗ = v ∗ and v ∗ = maxπ v π .

2. For any stationary policy π, there is a unique v ∈ V such that Lπ v = v and
v = vπ .
Proof As the Bellman operator L is a contraction by Theorem 6.5.6, application of

Theorem 6.5.5 shows that there is a unique v ∗ ∈ V such that L v ∗ = v ∗ . This is also
the optimal value function due to Theorem 6.5.3. The second part of the theorem
follows from the first part when considering only a single policy π (which then is
optimal).
6.5.4 MDP Algorithms for Infinite Horizon and Discounted

Rewards
We now take a look at the basic algorithms for obtaining an optimal policy when
the Markov decision process is known. Value iteration is a simple extension of the
backwards induction algorithm to the infinite horizon case. Alternatively, policy iter-
ation evaluates and improves a sequence of Markov policies. We also discuss two
variants of these methods, namely modified policy iteration, which is somewhere in
between value and policy iteration, and temporal-difference policy iteration, which is
related to classical reinforcement learning algorithms such as Sarsa and Q-Learning.
Another basic technique is linear programming, which is useful in theoretical anal-
yses as well as in some special practical cases. While for the sake of simplicity,
we stick to our assumption that rewards depend only on the state, the algorithms
described below can easily extended to the case when the reward also depends on
the action.
6.5.4.1 Value Iteration
Value iteration corresponds to repeated application of the Bellman operator intro-

duced in the previous section.
Algorithm 6.3 Value iteration

input μ
Initialization: Choose some v 0 ∈ V .
for k = 1, 2, . . . , n do
for s ∈ S do
v k (s) = maxa r (s) + γ s ∈S p(s | s, a) v k−1 (s )
end for
end for
return v n and πn = arg maxπ {r + γ P π v n }
Value iteration is a direct extension of the backwards induction algorithm for

infinite horizon and finite state space S. Since we know that stationary policies are
optimal, there is no need to maintain the values and actions for all time steps. At each
step, it is sufficient to merely keep the previous value v n−1 . The following theorem
guarantees that value iteration converges to the optimal value and provides bounds
for the error of each single estimate v n .
Theorem 6.5.8 Value iteration satisfies:

1. limn→∞ v n − v ∗ = 0.
2. For each
> 0 there exists N
< ∞ such that for all n ≥ N
:
(a) v n+1 − v n ≤
(1 − γ)/2γ.
(b) v n − v ∗ <
/2.
(c) The policy πn is
-optimal, i.e., v πn (s) ≥ v ∗ (s) −
for all states s.
Proof Statements 1 and 2(a) follow from Theorems 6.5.5 and 6.5.6 from the previous
section. Now note that
v πn − v ∗ ≤ v πn − v n + v n − v ∗ .
For the first term we have by Theorems 6.5.1 and 6.5.6

v πn − v n = Lπn v πn − v n+1

≤ Lπn v πn − L v n+1 + L v n+1 − v n+1

= Lπn v πn − Lπn v n+1 + L v n+1 − L v n
≤ γ v πn − v n+1 + γ v n+1 − v n .
A similar argument gives a respective bound for the second term v n − v ∗ . Then,
rearranging we obtain
γ γ
v πn − v n+1 ≤ v n+1 − v n , v n − v ∗ ≤ v n+1 − v n ,
1−γ 1−γ
and 2(b) as well as 2(c) follow from 2(a).
Theorem 6.5.8 also bounds the error when stopping value iteration when the
change in the estimated value function from one iteration to the next falls below
a set small threshold. As we have already discussed in context of Theorem 6.5.4 in
the previous section, initializing v 0 = 0 guarantees that value iteration converges
monotonically. The following theorem shows that value iteration converges con-
verges with an error of O(γ n ) in this case.
Theorem 6.5.9 Let v 0 = 0 and assume that rewards are bounded in [0, 1]. Then
γn 2γ n
v n − v ∗ ≤ , and v πn − v ∗ ≤ .
1−γ 1−γ
Proof For the first part, note that
1 γ0
v 0 − v ∗ = v ∗ ≤ = .
1−γ 1−γ
Proceeding by induction over n proves the first claim, as by the contraction property
of Theorem 6.5.6 we have
v n+1 − v ∗ = L v n − L v ∗ ≤ γv n − v ∗ .
The second part can be shown similarly to 2(c) of Theorem 6.5.8.

Although value iteration converges exponentially fast, the convergence depends
on the discount factor γ. When γ is very close to one, convergence can be extremely
slow. In fact, [4] showed that the number of iterations are of the order 1/(1 − γ)
for bounded accuracy of the input data. Omitting logarithmic factors the overall
complexity is Õ(|S|2 |A|L(1 − γ)−1 ), where L is the total number of bits used to
represent the input.1
6.5.4.2 Policy Iteration
Unlike value iteration, policy iteration attempts to iteratively improve a given policy,
rather than a value function. Starting with an arbitrary policy π0 , at each iteration it
first computes the value of the current policy. In finite MDPs this policy evaluation
step can be performed with either linear algebra or backwards induction. In a second
step called policy improvement the policy is updated by choosing the policy that is
greedy with respect to the value function computed in the evaluation step.
Algorithm 6.4 Policy iteration

input μ
Initialization: Choose some policy π0 . Set n = 0.
repeat
v n = v πn // policy evaluation
πn+1 = arg maxπ {r + γ P π v n } // policy improvement
n=n+1
until πn+1 = πn
return πn+1 , v n
The following theorem shows that the policies generated by policy iteration are
monotonically improving.
Theorem 6.5.10 For value vectors v n , v n+1 generated by policy iteration it holds
that v n ≤ v n+1 .
Proof From the policy improvement step
r + γ P πn+1 v n ≥ r + γ P πn v n = v n
1 Thus, the result is weakly polynomial complexity, due to the dependence on the input size descrip-
tion.
where the equality is due to the policy evaluation step for πn . Rearranging, we get
r ≥ (I − γ P πn+1 ) v n and hence
(I − γ P πn+1 )−1 r ≥ v n .
By Theorem 6.5.1 and the policy evaluation step for πn+1 the left hand side equals
v n+1 , which completes the proof.
Theorem 6.5.10 can be used to show that policy iteration terminates after a finite
number of steps.
Corollary 6.5.1 Policy iteration terminates after a finite number of iterations and
returns an optimal policy.
Proof There is only a finite number of policies, and since policies in policy iteration
are monotonically improving, the algorithm must stop after finitely many iterations.
Finally, the last iteration satisfies
v n = max {r + γ P π v n } ,
π
that is, v n solves the optimality equation and the claim follows by Theorem 6.5.3.
As even in finite MDPs, there are |A||S| policies, Corollary 6.5.1 only guarantees
exponential-time convergence in the number of states. However, the complexity of
policy iteration can be shown to be actually strongly polynomial [5] with the number
of required iterations being Õ(|S|2 |A|(1 − γ)−1 ), again omitting logarithmic factors.
As the behaviour of policy iteration seems to be quite different from value iteration,
one is interested in algorithms that lie between policy iteration and value iteration.
We will have a look at two such algorithms in the following two subsections.
6.5.4.3 Modified Policy Iteration
Modified policy iteration tries to speed up policy iteration by performing an m-step

update in the policy evaluation step. For m = 1, the algorithm is identical to value
iteration, while for m → ∞ the algorithm corresponds to policy iteration. Modified
policy iteration can perform much better than either pure value iteration or pure
policy iteration.
Algorithm 6.5 Modified policy iteration

input μ, parameter m
Initialization: Choose some v 0 ∈ V .
for k = 1, 2, . . . , n do
πk = arg maxπ {r + γ P π v k−1 } // policy improvement
v k = Lπmk v k−1 // partial policy evaluation
end for
return πn , v n
6.5.4.4 A Geometric View
It is perhaps interesting to see the problem from a geometric perspective. This also
gives rise to the so-called temporal-difference algorithms which will be considered
below. First, we define the difference operator, which is the difference between a
value function vector v and its transformation via the Bellman operator.
Definition 6.5.4 The difference operator is defined as differ-
ence
operator
Bv max {r + (γ P π − I)v} = L v − v.
π
Accordingly, the Bellman optimality equation can be rewritten as
Bv = 0.
Defining the set of greedy policies with respect to a value vector v ∈ V as
Πv arg max {r + (γ P π − I)v} ,

π
we can show the following inequality between two value function vectors.
Theorem 6.5.11 For any v, v ∈ V and π ∈ Πv
Bv ≥ Bv + (γ P π − I)(v − v).
Proof By definition, Bv ≥ r + (γ P π − I)v , while Bv = r + (γ P π − I)v. Sub-

tracting the latter from the former gives the result.
The inequality in Theorem 6.5.11 is similar to the convexity of the Bayes-optimal

utility (3.3.2). Geometrically, we can see from a look at Fig. 6.7, that applying the
Bellman operator on value function always improves it, yet may have a negative
effect on the other value function. If the number of policies is finite, then the figure is
also a good illustration of the policy iteration algorithm, where each value function
improvement results in a new point on the horizontal axis, and the choice of the best
improvement (highest line) for that point. In fact, we can write the policy iteration
algorithm in terms of the difference operator.
Theorem 6.5.12 Let (v n )n≥0 be the sequence of value vectors obtained from policy
iteration. Then for any π ∈ Πvn ,
v n+1 = v n − (γ P π − I)−1 Bv n . (6.5.4)
Proof By definition, we have for π ∈ Πvn

Fig. 6.7 The graph shows 1

the effect of the difference
operator on the optimal value
function v ∗ as well as on two
arbitrary value functions,
v 1 , v 2 . Each line is the 0
improvement effected by the
Bv
greedy policy π ∗ , π1 , π2
with respect to each value
function v ∗ , v 1 , v 2
−1 π∗
π1
π2
v2 v1 v∗
v
v n+1 = (I − γ P π )−1 r − v n + v n
= (I − γ P π )−1 [r − (I − γ P π )v n ] + v n .
Since r − (I − γ P π )v n = Bv n , the claim follows.
6.5.4.5 Temporal-Difference Policy Iteration
Similarly to modified policy iteration, temporal-difference policy iteration replaces

the next-step value with an approximation v n of the value v πn of the policy at step n.
Informally, this approximation is chosen so as to reduce the discrepancy of the value
function over time.
More precisely, at the nth iteration of the algorithm, we use a policy improvement
step to obtain the next policy πn+1 given the current approximation v n , i.e.,
Lπn+1 v n = L v n .
tempo- In order to update the value from v n to v n+1 we rely on the temporal difference
ral
differ- defined as
ence dn (s, s ) = r(s) + γv n (s ) − v n (s).
The temporal difference error can be seen as the difference in the estimate when we
move from state s to state s . In fact, it is easy to see that if the value function estimate
satisfies v n = v πn , then the expected error is zero, as

dn (s, s ) p(s | s, πn (s)) = r(s) + γv n (s ) p(s | s, πn (s)) − v n (s).
s ∈S s ∈S
Note the similarity to the difference operator in modified policy iteration. The idea
of temporal-difference policy iteration is to adjust the current value v n , using the
temporal differences mixed over an infinite number of steps, that is, we set
v n+1 = v n + τ n , where
∞

τ n (s) = Eπn (γλ) dn (st , st+1 ) | s0 = s .
t
t=0
The parameter λ is a simple way to weight the different temporal difference errors.
If λ → 0, the error τ n is dominated by the short-term discrepancies in our value
function, while for λ → 1 also the terms far in the future matter. In the end, the value
function is adjusted in the direction of this error.
Putting all of those steps together, the algorithm looks as follows.
Algorithm 6.6 Temporal-Difference Policy Iteration

input μ, parameter λ ∈ [0, 1]
Initialization: Choose some v 0 ∈ V . Set n = 0.
repeat
πn+1 = arg maxπ {r + γ P π v n } // policy improvement
v n+1 = v n + τ n // temporal difference update
n =n+1
until πn+1 = πn
return πn , v n
It can be shown that v n+1 is the unique fixed point of the equation
Dn v (1 − λ)Lπn+1 v n + λLπn+1 v.
That is, if we repeatedly apply the above operator to some vector v, then we approach
a fixed point v ∗ = Dn v ∗ . It is interesting to see what happens at the two extreme
choices of λ in this case. For λ = 1, this becomes standard policy iteration, as the
fixed point satisfies v ∗ = Lπn+1 v ∗ so that v ∗ must be the value of policy πn+1 . For
λ = 0, one obtains standard value iteration, as the fixed point is reached under one
step and is v ∗ = Lπn+1 v n , i.e., the approximate value of the one-step greedy policy.
In general, the new value vector is moved only partially towards the direction of the
Bellman update, depending on how we choose λ.
6.5.4.6 Linear Programming
Perhaps surprisingly, we can also solve an MDP through linear programming, refor-
mulating the maximization problem as a linear optimization problem with linear
constraints. As a first step we recall that there is an easy way to determine whether
a particular v is an upper bound on the optimal value function v ∗ , since if v ≥ L v

then v ≥ v ∗ by Theorem 6.5.3. In order to transform this into a linear program, we
must first define a scalar function to minimize. We can do this by selecting some
arbitrary distribution y ∈ ´(S) on the state space. Then denoting the vector of next
state probabilities p(s | s, a) by ps,a , we can write the following linear program.
Primal linear program
min y v,
v
such that for all s ∈ S, a ∈ A
v(s) − γ p
s,a v ≥ r (s).
Note that the inequality condition is equivalent to v ≥ L v, so that the smallest

v that satisfies the inequality will be the optimal value function that satsifies the
Bellman equation.
It also pays to look at the dual linear program, that aims to find the maximal
cumulative discounted state-action visits x(s, a) that are consistent with the transition
kernel of the process.
Dual linear program

max x(s, a) r (s, a),
x
s∈S a∈A
such that x(s, a) ≥ 0 and for all s ∈ S

x(s , a) − γ p(s | s, a) x(s, a) = y(s )
a∈A s∈S a∈A
with y ∈ ´ (S).
In this case, the respective vector x ∈ R|S×A| can be interpreted as the discounted
sum of state-action visits, as shown in the following theorem.
Theorem 6.5.13 For any policy π,

π
xπ (s, a) = E γ I st = s, at = a | s0 ∼ y
t
t≥0
is a feasible solution to the

dual problem. On the other hand, if x is a feasible solution
to the dual problem, then a x(s, a) > 0. Defining a respective randomized policy
6.6 Optimality Criteria 139
x(s, a)
π x (a | s) =
,
a ∈A x(s, a )
x πx = x is a feasible solution of the dual program.

The equality condition of the dual program ensures that x is consistent with
the transition kernel of the MDP. Consequently, the program can be seen as search
among all possible cumulative state-action distributions to find the one giving the
highest total reward.
6.6 Optimality Criteria
While we have concentrated on discounted rewards under infinite horizon, we con-

clude this chapter with an overview of different optimality criteria. As already indi-
cated previously, the following two views of discounted reward processes are equiv-
alent.
Infinite horizon, discounted

Discount rewards by discount factor γ with 0 < γ < 1. Then
∞
∞

Ut = γ rt+k
k
and E Ut = γ k E rt+k .
k=0 k=0
Geometric horizon, undiscounted

At each step t, the process terminates with probability 1 − γ. That is,
T −t
∞

UtT = rt+k with T ∼ Geom(1 − γ), so that E Ut = γ k E rt+k .
k=0 k=0
In the following, let Π be the set of all history-dependent (possibly randomized)

policies. Similar to before, we write v πγ (s) for the total discounted reward of policy
π ∈ Π when starting in state s, now emphasizing the dependence on the discount
factor γ.
Definition 6.6.1 π ∗ is discount optimal for γ ∈ [0, 1) if
∗
v πγ (s) ≥ v πγ (s) ∀s ∈ S, π ∈ Π.
The average reward (gain) criterion. Beside the expected total reward defined as
Vtπ,T Eπ UtT , the expected average reward is a natural criterion.
Definition 6.6.2 The gain g of a policy π starting from state s is defined as
1 π,T
g π (s) lim V (s),
T →∞ T
that is, the expected average reward the policy obtains when starting in s.
If the limit in Definition 6.6.2 does not exist, one may consider the limits
π 1 π,T π 1 π,T
g+ (s) lim sup V (s), g− (s) lim inf V (s).
T →∞ T T →∞ T
Definition 6.6.3 π ∗ is gain optimal if

∗
g π (s) ≥ g π (s) ∀s ∈ S, π ∈ Π.
Definition 6.6.4 π ∗ is overtaking optimal if

∗
lim inf V π ,T (s) − V π,T (s) ≥ 0 ∀s ∈ S, π ∈ Π.
T →∞
The existence of overtaking optimal policies is not guaranteed. The following

criterion is a weaker one.
Definition 6.6.5 π ∗ is average-overtaking optimal if
1 π∗ ,T
lim inf V (s) − V+π (s) ≥ 0 ∀s ∈ S, π ∈ Π,
T →∞ T
∞
where V+π (s) Eπ
t=1 max{r t , 0} st = s .
Definition 6.6.6 π ∗ is n-discount optimal for n ∈ {−1, 0, 1, . . .} if

∗
lim inf (1 − γ)−n v πγ (s) − v πγ (s) ≥ 0 ∀s ∈ S, π ∈ Π.
γ↑1
Definition 6.6.7 A policy is Blackwell optimal if ∀s, ∃γ ∗ (s) such that

∗
v πγ (s) − v πγ (s) ≥ 0, ∀π ∈ Π, γ ∗ (s) ≤ γ < 1.
Lemma 6.6.1 If a policy is m-discount optimal then it is n-discount optimal for all
n ≤ m.
Lemma 6.6.2 Gain optimality is equivalent to (−1)-discount optimality.

6.8 Further Reading 141
The different optimality criteria summarized here are treated in detail in

Chapter 5 of [3].
6.7 Summary
Markov decision processes can be used to represent shortest path problems, stop-
ping problems, experiment design problems, multi-armed bandit and more general
reinforcement learning problems.
Bandit problems are the simplest type of Markov decision process, since they
have a single fixed, never-changing state. However, to solve them, one can construct
a Markov decision processes in belief space within a Bayesian framework. It is then
possible to apply backwards induction to find the optimal policy.
Backwards induction is applicable more generally to arbitrary Markov decision
processes. For the case of infinite-horizon problems, it is referred to as value iteration,
as it converges to a fixed point. It is tractable when either the state space S or the
horizon T are small (finite).
When the horizon is infinite, policy iteration can also be used to find optimal
policies. It is different from value iteration in that at every step, it fully evaluates
a policy before the improvement step, while value iteration only performs a partial
evaluation. In fact, at the nth iteration, value iteration has calculated the value of an
n-step policy.
We can arbitrarily mix between the two extremes of policy iteration and value
iteration in two ways. Firstly, we can perform a k-step partial evaluation, which
is called modified policy iteration. When k = 1 this coincides with value iteration,
while for k → ∞, one obtains policy iteration. Secondly, we can adjust our value
function by using a temporal difference error of values in future time steps. Again,
we can mix liberally between policy iteration and value iteration by focusing on
errors far in the future (policy iteration) or on short-term errors (value iteration).
Finally, it is possible to solve MDPs through linear programming, reformulating
the MDP as a linear optimization problem with constraints. In the primal formulation,
we attempt to find a minimal upper bound on the optimal value function. In the dual
formulation, our goal is to find a distribution on state-action visits that maximizes
expected utility and is consistent with the MDP model.
6.8 Further Reading
For further information on the MDP formulation of bandit problems in the decision
theoretic setting see the last chapter of [6], which was explored in more detail in Duff’s
PhD thesis [7]. When the number of (information) states in the bandit problem is
finite, [8] has proven that it is possible to formulate simple so-called index policies.
However, this is not generally applicable. Easily computable, near-optimal heuristic
strategies for bandit problems will be presented in Chap. 10. The decision-theoretic
solution to the unknown MDP problem is given in Chap. 9.
Further theoretical background on MDPs, including many of the theorems in
Sect. 6.5, can be found in [3]. Chap. 2 of [9] gives a quick overview of MDP theory
from the operator perspective. The introductory reinforcement learning book of [10]
also explains the basic Markov decision process framework.
6.9 Exercises
6.9.1 MDP Theory
Exercise 6.1 Show that the expected discounted total reward of any given policy is
equal to the expected undiscounted total reward with a finite, but random horizon T .
In particular, let T be distributed according to a geometric distribution on {1, 2, . . .}
with parameter 1 − γ. Then show that
T

T

E lim γ rk = E
k
rk T ∼ Geom(1 − γ) .
T →∞
k=0 k=0
6.9.2 Automatic Algorithm Selection
Consider the problem of selecting algorithms for finding solutions to a sequence of

problems. Assume you have n algorithms to choose from. At time t, you get a task
and choose one of the algorithms. Assume that the algorithms are randomized, so
that each algorithm will find a solution with some unknown probability. Our aim is
to maximize the expected total number of solutions found. Consider the following
specific cases of this problem.
Exercise 6.2 Assume that the probability that the ith algorithm successfully solves
the tth task is always pi . Furthermore, tasks are in no way distinguishable from
each other. In each case, let pi ∈ {0.1, . . . , 0.9} andassume a prior distribution
ξi ( pi ) = 1/9 for all i with a complete belief ξ( p) = i ξi ( pi ). Then formulate the
problem as a decision-theoretic n-armed bandit problem with reward at time t being
rt = 1 if the task is solved and rt = 0 if the problem is not solved. Independent
of whether or not the task at time t is solved or not, at the next time-step the next
problem is to be solved. The aim is to find a policy π mapping from the history of
observations to the set of algorithms that maximizes the total reward to time T in
expectation, i.e.,
6.9 Exercises 143

T
Eξ,π U0 = E ξ,π rt .
T
t=1
1. Characterize the essential difference between maximizing U00 , U01 , U02 .

2. For n = 3 algorithms, calculate the maximum expected utility
max Eξ,π U0T

π
using backwards induction for T ∈ {0, 1, 2, 3, 4} and report the expected utility
in each case. Hint: Use the decision-theoretic bandit formulation to dynamically
construct a Markov decision process which you can solve with backwards induc-
tion. See also the extensive form of the utility from (3.5.2).
3. Now utilize the backwards induction algorithm developed in the previous step in
a problem where we receive a sequence of N tasks to solve and our utility is

N
U0N = rt .
t=1
At each step t ≤ N , find the optimal action by calculating Eξ,π Utt+T for T ∈
{0, 1, 2, 3, 4}. Hint: At each step you can update your prior distribution using the
same routine you use to update your prior distribution. You only need consider
T < N − t.
4. Develop a simple heuristic algorithm for your choice and compare its utility with
the utility of backwards induction. Perform 103 simulations, each experiment run-
ning for N = 103 time-steps and average the results. How does the performance
improve? Hint: If the program runs too slowly go only up to T = 3.
6.9.3 Scheduling
You are controlling a set of n processing nodes of a processing network that is part
of a big CPU farm. At time t, you may be given a job of class xt ∈ X to execute.
Assume these are identically and independently drawn such that P(xt = k) = pk for
all t, k. With some probability p0 , you are not given a job to execute at the next step.
If you do have a new job, then you can either:
(a) Ignore the job.
(b) Send the job to some node i. If the node is already active, then the previous job
is lost.
Not all the nodes and jobs are equal. Some nodes are better at processing certain
types of jobs. If the ith node is running a job of type k ∈ X , then it has a probability
φi,k ∈ [0, 1] of finishing it within that time step. Then the node becomes free and can
accept a new job.
For this problem, assume that there are n = 3 nodes and k = 2 types of jobs and
that the completion probabilities are given by the matrix
⎡ ⎤
0.3 0.1
Φ = ⎣0.2 0.2⎦ .
0.1 0.3
Further, we assume p0 = 0.1, p1 = 0.4, p2 = 0.5 to be the probabilities of not get-

ting any job, and the probabilities of the two job types, respectively. We wish to find
the policy π maximizing the expected total reward given the MDP model μ, that is,
∞

Eμ,π γ t rt , (6.9.1)
t=0
where γ = 0.9 and we get a reward of 1 every time a job is completed.

More precisely, at each time step t, the following events happen:
1. A new job xt appears.
2. Each node either continues processing, or completes its current job and becomes
free. You get a reward rt equal to the number of nodes that complete their jobs
within this step.
3. You decide whether to ignore the new job or add it to one of the nodes. If you add
a job, then it immediately starts running for the duration of the time step. (If the
job queue is empty then obviously you cannot add a job to a node.)
Exercise 6.3 Solve the following problems:

1. Identify the state and action space of this problem and formulate it as a Markov
decision process. Hint: Use independence of the nodes to construct the MDP
parameters.
2. Solve the problem using value iteration, using the stopping criterion indicated
Theorem 6.5.8 (2c), where
= 0.1. Indicate the number of iterations needed to
stop.
3. Solve the problem using policy iteration. Indicate the number of iterations needed
to stop. Hint: You can either modify the value iteration algorithm to perform policy
evaluation, using the same
, or you can use the linear formulation. If you use
the latter, take care with the inverse!
4. Now consider an alternative version of the problem, where we suffer a penalty of
0.1 (i.e., we get a negative reward) for each time-step that each node is busy. Are
the solutions different?
5. Finally consider a version of the problem, where we suffer a penalty of 10 each
time we cancel an executing job. Are the solutions different?
6. Plot the value function for the optimal policy in each setting.
References 145
Hint: To verify that your algorithms work, test them first on a smaller MDP
with known solutions. For example, check out http://webdocs.cs.ualberta.ca/~sutton/
book/ebook/node35.html.
6.9.4 General Questions
Exercise 6.4
1. What in your view are the fundamental advantages and disadvantages of modelling
problems as Markov decision processes?
2. Is the algorithm selection problem of Exercise 6.2 solvable with policy iteration?
If so, how?
3. What are the fundamental similarities and differences between the decision-
theoretic finite-horizon bandit setting and the infinite-horizon MDP setting?
References
1. Chernoff, H.: Sequential design of experiments. Ann. Math. Stat. 30(3), 755–770 (1959)
2. Chernoff, H.: Sequential models for clinical trials. In: Proceedings of the Fifth Berkeley Sym-
posium on Mathematical Statistics and Probability, vol. 4, pp. 805–812. University of California
Press (1966)
3. Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming.
Wiley, New Jersey, US (1994)
4. Tseng, P.: Solving H-horizon, stationary Markov decision problems in time proportional to
log(H). Oper. Res. Lett. 9(5), 287–297 (1990)
5. Ye, Y.: The simplex and policy-iteration methods are strongly polynomial for the Markov
decision problem with a fixed discount rate. Math. Oper. Res. 36(4), 593–603 (2011)
7. O’Gordon Duff, M.: Optimal learning computational procedures for bayes-adaptive markov
decision processes. Ph.D. thesis, University of Massachusetts at Amherst (2002)
8. Gittins, J.C.: Multi-armed Bandit Allocation Indices. Wiley, New Jersey, US (1989)
9. Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming. Athena Scientific (1996)
10. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press (1998)
Chapter 7
Simulation-Based Algorithms
7.1 Introduction
In this chapter, we consider the general problem of reinforcement learning in dynamic

environments. Up to now, we have only examined a solution method for bandit
problems, which are only a special case. The Bayesian decision-theoretic solution as
presented in Chap. 6 is to reduce the bandit problem to a Markov decision process
which can then be solved with backwards induction.
On the other hand, we also have seen that MDPs can be used to describe envi-
ronments in more general reinforcement learning problems. If the underlying MDP
is fully known, then a number of standard algorithms—some of which we have
introduced in Chap. 6—can be employed to find the optimal policy.
However, in the actual reinforcement learning problem, the model of the environ-
ment is unknown. Thus, the main focus of this chapter will be how to simultaneously
learn about the underlying process and act to maximize utility in an approximate way.
This can be done through approximate dynamic programming, where we replace the
actual unknown parameters of the underlying process with estimates. The estimates
can be improved by drawing samples from the environment, either by acting within
the real environment or using a simulator. Although the algorithms following either
approach may not be performing as well as the Bayes-optimal solution, they are
computationally cheap so that they are worth investigating in practice.
It is important to note that while the algorithms presented in this chapter may
converge eventually to an optimal policy, their online performance can be quite bad.
That is, they may not accumulate a lot of reward while still learning. In that sense,
they are not offering optimal online solutions to the reinforcement learning problem.
For online algorithms with respective guarantees we refer to Chap. 10.
148 7 Simulation-Based Algorithms
7.1.1 The Robbins-Monro Approximation
As already outlined, we are interested in reinforcement learning algorithms that

replace the unknown actual process in which the learner is acting with an estimate
that will eventually converge to the true process. Accordingly, the actions taken by
the learner shall be nearly-optimal with respect to the estimate.
To approximate the process, one can use the general idea of Robbins-Monro
stochastic approximation [1]. This entails maintaining a point estimate of the param-
eter we want to approximate and perform random steps that on average move towards
the solution, in a way to be made more precise later. The stochastic approximation
actually defines a large class of procedures, containing stochastic gradient descent
as a special case.
Algorithm 7.1 shows an algorithm for the multi-armed bandit problem that uses a
Robbins-Monro approximation. The parameters include a particular policy π which
defines a probability distribution over the arms given the observed history, a set of
initial estimates r̂i,0 for the mean of each arm i, and a sequence of step sizes (αt )t .
The algorithm chooses an arm according to the policy π (line 3) after which
the estimate r̂t,i of the chosen arm i is updated (lines 4–6). As we shall see later,
this particular update rule can be seen as trying to minimize the expected squared
error between the estimated reward and the random reward obtained by each arm.
Consequently, the variance of the reward of each arm plays an important role.
The step-sizes αt are usually chosen to decrease with time t to guarantee conver-
gence. Obviously, αt can be chosen so that the r̂i,t are simply the mean estimates of
the expected value of the reward for each arm i, which is a natural choice.
Concerning an appriopriate choice of the policy used by the Robbins-Monro
algorithm, note that all arms should be chosen often enough, so that one has good
estimates for the expected reward of every arm. One simple way to achieve that is
to play the apparently best arm most of the time, but to sometimes select an arm at
random. This is called -greedy action selection.
Definition 7.1.1 (-greedy policy) Let Â∗t = arg maxi∈A r̂t,i denote the set of empir-
ically best arms at step t. Then the -greedy policy π̂∗ with parameters (t )t is given
by
π̂∗ (1 − t ) Unif (Â∗t ) + t Unif (A).
In general it makes sense to let t depend on t, so that the randomness can be chosen
to decrease over time when estimates converge to their true values.
When using -greedy in our Robbins-Monro bandit algorithm, the two main
parameters to choose are the amount of randomness t and the step-size αt in the esti-
mation. Both of them have a significant effect on the performance of the algorithm.
Although as indicated above, in general it makes sense to vary these parameters with
time, it is perhaps instructive to look at what happens for fixed values of and α.
Figure 7.1 shows the average reward obtained, if we keep the step size α or the
randomness fixed, respectively, with initial estimates r̂0,i = 0.
7.1 Introduction 149
Algorithm 7.1 Robbins-Monro bandit algorithm

1: input Set of arms A = {1, . . . , K }, step-sizes (αt )t , initial estimates (r̂i,0 )i , policy π.
2: for t = 1, . . . , T do
3: Choose arm at = i with probability π(i | a1 , . . . , at−1 , r1 , . . . , rt−1 ).
4: Observe reward rt .
5: r̂t,i = αt rt + (1 − αt ) r̂i,t−1 // estimation step
6: r̂t,i = r̂ j,t−1 for j = i
7: end for
1 1
0.8 0.8
0.6 0.6
r
r
0.4 0.4
0.001 0.0
0.2 0.01 0.2 0.01
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
t ·10 4 t ·10 4
(a) fixed (b) fixed α
Fig. 7.1 The plots above show average reward over time. For fixed t = 0.1, the step size is
α ∈ {0.01, 0.1, 0.5}. For fixed step size α = 0.1 we vary the exploration rate in ∈ {0.0, 0.01, 0.1}
For fixed , we find that larger values of α tend to give better results eventually,
while smaller values have a better initial performance. This is a natural trade-off,
since large α appears to “learn” fast, but it also “forgets” quickly. That is, for a large
α the estimates mostly depend upon the last few rewards that have been observed.
Things are not so clear-cut for the choice of . We see that the choice of = 0
is significantly worse than = 0.1. That appears to suggest that there is an optimal
level of exploration. Ideally, we should use the decision-theoretic solution seen in
Chap. 6, but usually a good heuristic method for choosing will be good enough.
7.1.2 The Theory of the Approximation
An important question when working with approximation algorithms as the Robbins-

Monro bandit algorithm of the previous section is whether the used estimates con-
verge to the correct values, so that the algorithm itself eventually converges to an
optimal policy. Note that as already mentioned we want to focus on asymptotic
convergence and are not interested in how much reward we obtain in the learning
phase.
This section briefly reviews some basic results of stochastic approximation theory.
Complete proofs can be found in [2]. We first consider the core problem of stochastic
approximation itself. In particular, we shall cast the approximation problem as a
minimization problem. That is, given some unknown parameter of the environment
ν ∈ Rn and an estimate νt ∈ Rn of ν at time t, the estimate is assumed to be chosen
such that for a certain function f : Rn → R the estimate νt minimizes f .
Then with respect to the learning algorithm the goal is that the sequence of values
(νt )t generated by the algorithm converges to some ν ∗ that is a local minimum, or a
stationary point for f . For strictly convex f , this would also be a global minimum.
In the following, we will examine algorithms that compute estimates νt according
to the update
νt+1 = νt + αt z t+1 , (7.1.1)
where αt ∈ R is a step-size and z t ∈ Rn a random direction. This process can be

shown to converge to a stationary point of f under certain assumptions. Sufficient
conditions include continuity and smoothness properties of f and the update direc-
tions z t . In particular, we shall assume the following about the function f that we
wish to minimize.
Assumption 7.1.1 Let h t denote that history (μt , z t , αt , . . . , μ0 , z 0 , α0 ) up to
step t. Then function f : Rn → R satisfies:
(i) f (x) ≥ 0 for all x ∈ Rn .
(ii) (Lipschitz derivative) f is continuously differentiable (i.e., the derivative ∇ f
exists and is continuous) and ∃L > 0 such that
∇ f (x) − ∇ f (y) ≤ L x − y , ∀ x, y ∈ Rn .
(iii) (Pseudo-gradient) ∃c > 0 such that
c ∇ f (νt ) 2 ≤ −∇ f (νt ) E(z t+1 | h t ), ∀ t.
(iv) ∃K 1 , K 2 > 0 such that
E( z t+1 | h t ) ≤ K 1 + K 2 ∇ f (μt ) , ∀ t.
2 2
The basic condition (ii) ensures that the function is well-behaved, so that gradient-
following methods will easily find the minimum. Condition (iii) combines two
assumptions in one. Firstly, the expected direction of the update always decreases
f , and secondly that the squared norm of the gradient is not too large relative to
the size of the update. Finally, condition (iv) ensures that the update is bounded in
expectation relative to the gradient. One can see how putting together the last two
conditions ensures that the expected direction of the update is correct and its norm
is bounded.
Theorem 7.1.1 Consider an algorithm with update (7.1.1) such that the step sizes
αt ≥ 0 satisfy
∞ ∞
αt = ∞, αt2 < ∞.
t=0 t=0
Then under Assumption 7.1.1 the following holds with probability 1:

1. The sequence f (νt ) t converges.
2. limt→∞ ∇ f (νt ) = 0.
3. Every limit point ν ∗ of νt satisfies ∇ f (ν ∗ ) = 0.
The conditions of Theorem 7.1.1 are not necessary. Alternative sufficient conditions
relying on contraction properties are for example discussed in detail in [2]. The
following example illustrates the impact of the choice of step size schedule (αt )t on
convergence.
Example 7.1 (Estimating the mean of a Gaussian distribution) Consider a sequence

of observations xt sampled from a Gaussian distribution with mean 21 and variance 1,
that is, xt ∼ N (0.5, 1). We try to estimate the mean using updates according to
(7.1.1) with the update direction given by
z t+1 = xt+1 − νt .
Now consider three different step-size schedules: The first one, αt = 1/t, satisfies
√
both assumptions of Theorem 7.1.1 on the step-size. The second one, αt = 1/ t,
decreases too slowly, while the third one, αt = t −3/2 , approaches zero too fast.
Figure 7.2 demonstrates the convergence, or lack thereof, of the estimates νt to
the expected value. In fact, the schedule t −3/2 converges
√ to a value quite far away
from the expected value, while the slow schedule 1/ t oscillates.
Fig. 7.2 Estimation of the

expectation of 1/t
√
xt ∼ N (0.5, 1) using use 2 1/ t
three step-size schedules t−3/2
µt 0
0 200 400 600 800 1,000

t
7.2 Dynamic Problems
It is possible to extend the ideas outlined in the previous section to the dynamic
Markov decision process setting. We simply need to have a policy that is greedy with
respect to our estimates as well as a way to update our estimates so that they converge
to the actual MDP we are acting in. The additional challenge of the dynamic setting
is that our policy now affects which sequences of states we observe. That is, while
in the bandit problem we could freely select an arm to pull, sampling an arbitrary
state is not possible as easily (unless some simulation of the MDP is available).
Otherwise, the algorithmic structure remains basically the same and is described in
Algorithm 7.2.
Algorithm 7.2 Generic reinforcement learning algorithm

1: input state space S , action space A, reward set R ⊆ R, parameter set Θ, initial parameters
θ0 ∈ Θ, update-rule f : Θ × S 2 × A × R → Θ, policy π : S × Θ → ´ (A)
2: for t = 1, . . . , T do
3: Take action at ∼ π(· | θt , st ).
4: Observe reward rt+1 , state st+1 .
5: Update estimate θt+1 = f (θt , st , at , rt+1 , st+1 ).
6: end for
In general, the policy chooses an action at according to a distribution that depends

on the current state st and a parameter θt that e.g. could describe a posterior distri-
bution over MDPs, or a distribution over other parameters. Concerning the choice of
policy one can e.g. use the Bayes-optimal policy with respect to the parameter θ or
some heuristic policy.
7.2 Dynamic Problems 153
Fig. 7.3 The chain task
0.2 0 0 0 1
Table 7.1 The chain task’s value function for γ = 0.95

s s1 s2 s3 s4 s5
V ∗ (s) 7.6324 7.8714 8.4490 9.2090 10.209
Q ∗ (s, 1) 7.4962 7.4060 7.5504 7.7404 8.7404
Q ∗ (s, 2) 7.6324 7.8714 8.4490 9.2090 10.2090
Example 7.2 (The chain task) The chain task has two actions and five states, as
shown in Fig. 7.3. The starting state is the leftmost state, where the reward is 0.2.
The reward is 1.0 in the rightmost state, and zero otherwise. The first action (dashed,
blue) takes you to the right, while the second action (solid, red) takes you to the
leftmost state, both with probability 0.8. With a probability of 0.2 both actions have
the opposite effect. The value function of the chain task for a discount factor γ = 0.95
starting in the leftmost state is shown in Table 7.1.
The chain task is a very simple but well-known task used to test the efficacy
of reinforcement learning algorithms. In particular, it is useful for analysing how
algorithms solve the exploration-exploitation trade-off, since in the short run (i.e.,
small discount factor γ) simply staying in the leftmost state is advantageous. However
if the discount factor is sufficiently large, algorithms should be incentivized to more
fully explore the state space. A variant of this task with action-dependent rewards
can be found in [3].
7.2.1 Monte Carlo Policy Evaluation and Iteration
To make things as easy as possible, let us assume that we have a way to start the
environment from any arbitrary state. That would be the case if the environment had
a reset action, or if we were simply running an accurate simulation.
We shall begin with simplest possible problem, that of estimating the expected
utility of each state for a specific policy. This can be performed with Monte Carlo
policy evaluation as shown in Algorithm 7.3. The idea is to estimate the value function
for every state by approximating the expectation with the sum of rewards obtained
over multiple trajectories starting from each state. Although this is very simple, Monte
Carlo policy evaluation is a very useful method that is employed by several more
complex algorithms as a subroutine. It can be also used non-Markovian environments
such as partially observable or multi-agent settings.
Algorithm 7.3 Monte Carlo policy evaluation

input policy π, number of iterations K , horizon T
for s ∈ S do
for k = 0, . . . , K do
Initialize U (k) := 0.
Choose initial state s1 = s.
for t = 0, . . . , T do
Choose action at ∼ π(· | h t ) for history h t .
Observe reward rt and next state st+1 .
Set U (k) := U (k) + rt .
end for
end for K
Calculate estimate v K (s) = K1 k=1 U (k) .
end for
return v K
Remark 7.2.1 Using Hoeffding’s inequality (4.5.5), one can show that for each
state s that the error of the estimate v K (s) is bounded by ln(2|S|/δ) with probability at
δ
2K

least 1 − |S| . Hence, a union bound of the form P(A1 ∪ A2 ∪ . . . ∪ An ) ≤ i P(Ai )
shows that with probability 1 − δ this error bound holds for all states.
Stochastic policy evaluation as shown in Algorithm 7.4 is a generalization of

Monte Carlo policy evaluation. Choosing the parameter αk = k1 and initial values
v(s) = 0 the two algorithms are the same.
Algorithm 7.4 Stochastic policy evaluation

1: input policy π, initial values v 0 , parameters αk , number of iterations K , horizon T
2: for s ∈ S do
3: for k = 1, . . . , K do
4: Choose initial state s1 = s.
5: for t = 0, . . . , T do
6: Choose action at ∼ π(· | h t ) for history h t .
7: Observe reward rt and next state st+1 .
8: Set U (k) := U (k) + rt .
9: end for
10: Update estimate v k (s) = v k−1 (s) + αk (U (k) − v k−1 (s)).
11: end for
12: end for
13: return v K
7.2.2 Monte Carlo Updates
It is also possible to update the value of all encountered states after each visit. This
algorithm is called every-visit Monte Carlo, whose evaluation of a given state trajec-
tory s1 , . . . , sT and thereby earned rewards r1 , . . . , r T is shown in Algorithm 7.5. In
general, an estimate of the value function will be computed over several iterations as
in the algorithms of the previous section. The parameters αk we had before are here
replaced by the values n t 1(s) that depend on the state and the respective visits in the
latter. However, the type of estimate computed by every-visit Monte Carlo can be
Algorithm 7.5 Every-visit Monte Carlo update

1: input initial values v, state visit counts n(s), trajectory s1 , . . . , sT , rewards r1 , . . . , r T
2: Initialize U := 0.
3: for t = 1, . . . , T do
4: U = U + rt .
5: n(st ) = n(st ) + 1
6: v(st ) = v(st ) + n t (s
1
t)
(Ut − v(st ))
7: end for
8: return v
biased, as the update is going to be proportional to the number of steps spent in the
respective state. In order to avoid the bias, we can instead only perform the update
on each first visit to every state. This eliminates the dependence between states and
is called the first-visit Monte Carlo algorithm.
Algorithm 7.6 First-visit Monte Carlo update

1: input Initial values v, state visit counts n(s), trajectory s1 , . . . , sT , rewards r1 , . . . , r T
2: Initialize U := 0.
3: for t = 1, . . . , T do
4: U = U + rt
5: if st ∈ / {s1 , . . . , st−1 } then
6: n(st ) = n(st ) + 1
7: v(st ) = v(st ) + n t (s 1
t)
(Ut − v(st ))
8: end if
9: end for
10: return v
Figure 7.4 shows the difference between the two Monte Carlo evaluation methods
on the chain task for the optimal policy. In particualr, it shows the L1 error between
the value function of the optimal policy and the respective Monte Carlo estimates.
Fig. 7.4 Error as the number 101

of iterations n increases, for every
first and every visit Monte first
Carlo estimation
100
vt − V π
10−1
10−2
0 0.2 0.4 0.6 0.8 1
iterations ·104
7.2.3 Temporal Difference Methods
The kind of update we have now seen in several algorithms is of the form
v(st ) = v(st ) + α(Ut − v(st )), (7.2.1)
where Ut is the utility sampled along some trajectory that was generated when fol-
lowing policy π. The difference between the observed utility and the assumed value
so far can be seen as an error term, which is used for correction towards the true
value.
The main idea of temporal difference learning methods is to replace the utility
term by the immediate reward(s) in the next step(s) and estimate the utility for the
subsequent steps by the current value of the next state.
That is, in the simplest case (now for technical reasons considering discounted
tempo- rewards) the error considered is the so-called temporal difference! error
ral
differ-
ence!
error
dt = rt + γv(st+1 ) − v(st ), (7.2.2)
so that the update is

v(st ) = v(st ) + αdt .
It can be shown that the error term used in (7.2.1) can be written as a sum over the
temporal difference errors

Ut − v(st ) = γ −t d ,
≥t
provided that v is assumed to be fixed. Otherwise the sum at least provides an

approximation.
7.2.3.1 Temporal Difference Algorithm with Eligibility Traces
Instead of the estimate rt + γv(st+1 ) we used for the utility before we can more
generally consider the return over the next m steps starting in step t, that is,
Ut,m := rt + γrt+1 + . . . + γ m−1 rt+m−1 + γ m v(st+m ). The TD(λ) algorithm now
uses a mixture
Utλ := (1 − λ) λm−1 Ut,m
m≥1
of all these different estimates, where the parameter λ ∈ [0, 1] determines the impor-
tance of estimates with higher m similar to a discount factor and the factor (1 − λ) is
used for normalization. Note that for λ = 0 one obtains the estimate corresponding
to using the temporal difference error (7.2.2). If there is a terminal state then for
λ = 1 an update with respect to Utλ can be shown to correspond to a Monte Carlo
update. In general, doing the update according to (7.2.1) with respect to Utλ , one can
write the update using the temporal difference error as follows.
TD(λ) update
∞

v(st ) = v(st ) + α (γλ)−t d .
=t
Unfortunately, as this update uses all future state and reward observations, it is
only possible to implement it offline. However, considering a so-called backward
view obtaining an online version shown as Algorithm 7.7 is possible.
Algorithm 7.7 Online TD(λ)

1: input initial values, trajectory s1 , . . . , sT , rewards r1 , . . . , r T
2: e = 0
3: for t = 1, . . . , T do
4: e(st ) = e(st ) + 1 // eligibility increase
5: for s ∈ S do
6: v(s) = v(s) + αe(s) dt // update eligible states
7: end for
8: e = λet // eligibility discount
9: end for
10: return v
The main idea is to backpropagate changes in future states to previously encountered

states. Thereby we wish to modify older states less than more recent states. This is
accomplished by using the eligibility traces e. These are initialized to be 0 for all
states. Each visit increases the eligibility of a state by 1, and there is a general discount
of λ for all states in each step. Thus, the eligibility takes into account how often states
Fig. 7.5 Eligibility traces, 2.5

replacing and cumulative replacing
cumulative
2
1.5
0.5
0
0 20 40 60 80 100
are visited and discards visits that are not so recent. Accordingly, the eligibility of
each state is used as a factor for the temporal difference error in the state updates.
Figure 7.5 is an illustration of how eligibility traces operate in the two differ-
ent formulations: replacing versus cumulative. While replacing traces start from 1
whenever a state is visited again, cumulative traces increase with every visit.
7.2.4 Stochastic Value Iteration Methods
In the previous sections we have seen the idea of updating value information not
on all states but only on the currently observed one. This approach can not only
be applied to the setting when one wants to estimate the value of a given policy,
but also when we want to learn an optimal policy. For example, standard value
iteration as introduced in Sect. 6.5.4.1 performs a sweep over the complete state
space at each iteration. However, in principle one could perform the update over an
arbitrary sequence of states. These could be generated when following a particular
fixed policy or more generally a reinforcement learning algorithm. This lends to the
idea of simulation-based value iteration.
In order to work in general the used state sequences usually must satisfy certain
technical requirements. One sufficient condition is for example that the policies that
generate the state sequences must be proper for episodic problems with a terminal
state. That is, they all reach a terminating state with probability 1. For discounted
non-episodic problems, this can be achieved by using a geometric distribution for a
restart in a state chosen uniformly at random. This ensures that all policies will be
proper. Alternatively, one could also select starting states with an arbitrary schedule,
as long as it is guaranteed that all states are visited infinitely often in the limit.
7.2.4.1 Simulation-Based Value Iteration
Before considering the general case when the underlying MDP model μ is unknown,
we start with the obviously simpler case where we can obtain data of the MDP from
simulation. That is, for each state-action pair (s, a), we can request independent sam-
ples from the respective transition probability distribution pμ (·|s, a). Algorithm 7.8
shows a generic simulation-based value iteration algorithm. In this algorithm, at
every step there is a probability the agent moves to a random state. This can be
seen as restarting the MDP from a uniform starting state distribution Unif (S).
Algorithm 7.8 Simulation-based value iteration

1: input μ-simulator
2: Initialize s1 ∈ S , v 0 ∈ V .
3: for t = 1, 2, . . . , T do
πt (st ) = rμ (s) + arg (s )
4: maxa γ s ∈S pμ (s |st , a) v t−1
5: v t (st ) = rμ (s) + γ s ∈S pμ (s |st , πt (st )) v t−1 (s )
6: st+1 ∼ (1 − ) · pμ (st+1 | st , πt (st )) + · Unif (S )
7: end for
8: return πT , VT
Figure 7.6 shows the error in value function estimation in the chain task
(Example 7.2) when using simulation-based value iteration. In general, it is advis-
able to use an initial value v 0 that is an upper bound on the optimal value function,
if such a value is known. This is due to the fact that in that case convergence of
simulation-based value iteration is always guaranteed, as long as the policy that we
are using is proper.1 This is confirmed by the results. In general, the estimation error
is highly dependent upon the initial value function estimate v 0 and the exploration
parameter . It is interesting to see uniform sweeps (i.e., = 1) result in the lowest
estimation error.
7.2.4.2 Q-Learning
Simulation-based value iteration can be suitably modified for the general reinforce-
ment learning setting when neither model nor simulator for the underlying environ-
ment are available. Here instead of relying on a model of the environment, we replace
arbitrary random sweeps of the state-space with the actual state sequence observed
in the real environment. We also use this sequence as a simple way to estimate the
transition probabilities. The replacement of the true MDP model by an estimate is a
natural idea which leads to a whole family of stochastic value iteration algorithms.
The most important of these is Q-learning, which uses a trivial empirical MDP model.
1 As mentioned, in the case of discounted non-episodic problems, this amounts to a geometric

stopping time distribution, after which the state is drawn from the initial state distribution.
102 102
100 100
10-2 10-2
error
error
10-4 10-4
1.0 1.0
0.5 0.5
10-6 0.1 10-6 0.1
0.01 0.01
-8 1-gamma -8 1-gamma
10 10
0 500 1000 1500 2000 0 500 1000 1500 2000
t t
(a) Pessimistic initialization. (b) Optimistic initialization.
Fig. 7.6 Simulation-based value iteration with pessimistic initial estimates (v 0 = 0) and optimistic
initial estimates (v 0 = 20 = 1/(1 − γ)) for varying . Errors indicate v n − V ∗ 1
Q-learning is shown as Algorithm 7.9, not only one of the most well-known but also
one of the simplest algorithms in reinforcement learning.
Algorithm 7.9 Q-learning

1: input exploration parameters t , step sizes αt
2: Initialize s1 ∈ S , v 0 ∈ V .
3: for t = 1, 2, . . . , T do
4: v t (s) = maxa∈A q t (s, a) // update value function
5: π̂t∗ (s) = arg maxa∈A q t (s, a) // compute optimal policy wrt v
6: Choose at ∼ π̂∗t (st ). // play t -greedy policy
7: q t+1 (st , at ) = (1 − αt ) q t (st , at ) + αt [rμ (st ) + v t (st+1 )]
8: Observe st+1 ∼ pμ (st+1 | st , at ).
9: end for
10: return πT , VT
It is straightforward to verify that Q-learning is just a version of simulation-based

value iteration, where at each step t, given the partial observation (st , at , st+1 ) the
transition model of the underlying MDP is approximated by

1, if st+1 = s
p̂(s |st , at ) =
0, if st+1 = s .
In addition, since we cannot arbitrarily select states in the real environment, we

replace the state-exploring parameter with a time-dependent exploration parameter
t for the policy we employ on the real environment. Even though this model is very
simplistic, it still works relatively well in practice.
Figure 7.7 shows the performance of the basic Q-learning algorithm for the chain
task (Example 7.2) in terms of value function error and regret. In this particular
implementation, we used a polynomially decreasing exploration parameter t and
Fig. 7.7 Q-learning with 60

v 0 = 1/(1 − γ),
t = 1/n(st ),
αt = αn(st )−2/3 for 50
state-visit counts n(s)
40
error
30 1.0
0.5
0.1
20
0.05
0.01
10
0 20 40 60 80 100
t x 10
(a) Error
3,000
1.0
0.5
0.1
2,000 0.05
0.01
regret
1,000
0 20 40 60 80 100
t x 10
(b) Regret
step size αt . Both of these depend on the number of visits to a particular state, which
leads to a more efficient performance.
Of course, one could get any algorithm in between pure Q-learning and pure
stochastic value iteration. In fact, variants of the Q-learning algorithm using eligi-
bility traces (see Sect. 7.2.3.1) can be formulated in the same way.
7.2.4.3 Generalized Stochastic Value Iteration
Finally, we present an algorithm, which can be considered as a generalization of

the algorithms we have seen In particular, generalized stochastic value iteration as
shown as Algorithm 7.10 is an online reinforcement learning algorithm, that includes
simulation-based value iteration and Q-learning as special cases. As Q-learning and

other algorithms we have seen before there are parameters t for the amount of
exploration as well as the step size parameters αt . In addition the algorithm uses
state-action distributions σt and an initial MDP estimator μ̂0 . The MDP estimators
μ̂ used and updated by the algorithm are supposed to include both an estimate of the
transition probabilities pμ̂ (s | s, a) and the expected reward2 rμ̂ (s).
Algorithm 7.10 Generalized stochastic value iteration

1: input exploration parameters t , step sizes αt , state-action distributions σt , initial MDP estimate
μ̂0
2: Initialize s1 ∈ S , q 1 ∈ Q, v 0 ∈ V .
3: for t = 1, 2, . . . , T do
4: v t (s) = maxa∈A q t (s, a)
5: π̂t∗ (s) = arg maxa∈A q t (s, a)
6: Choose at ∼ π̂∗t (st ).
7: Observe st+1 , rt+1 .
8: Update MDP estimate μ̂t = μ̂t−1 | st , at , st+1 , rt+1 .
9: for s ∈ S , a ∈ A do
10: With probability σt (s, a) update

q t+1 (s, a) = (1 − αt ) q t (s, a) + αt rμ̂t (s) + γ pμ̂t (s | s, a) v t (s ) ,
s ∈S
11: otherwise set q t+1 (s, a) = q t (s, a).

12: end for
13: end for
14: return πT , VT
It is instructive to examine special cases for these parameters. When σt = 1,

αt = 1, and μ
t = μ, we obtain standard value iteration. Q-learning is obtained when
setting σt (s, a) = I {st = s, at = a} and

pμ̂t (st+1 = s | st = s, at = a) = I st+1 = s | st = s, at = a .
Finally, for σt (s, a) = et (s, a) we obtain a stochastic eligibility-trace Q-learning

algorithm similar to TD(λ).
2More generally, rewards could of course depend not only on state, but the combination of state
and action.
7.4 Exercises 163
7.3 Discussion
Most of the algorithms we have seen in this chapter are quite simple and thus clearly
demonstrate the principle of learning by reinforcement. However, they do not aim to
solve the reinforcement learning problem in an optimal way. They have been mostly
used for finding near-optimal policies given access to samples from a simulator, for
example in the case of learning to play Atari games [4]. However, even in this case,
a crucial issue is how much data is needed to be able to approach optimal behavior.
Convergence. Even though it is quite simple, Q-learning can be shown to converge
to an optimal policy in various settings. Tsitsiklis [5] has provided an asymptotic
proof based on stochastic approximation theory with less restrictive assumptions
than the original paper of [6]. Later [7] proved finite sample convergence results
under strong mixing assumptions on the MDP. Moreover, for modifications such as
delayed Q-learning [8] stronger results on the sample complexity can be shown.
That is, Õ(|S||A|) samples are sufficient to determine an -optimal policy with high
probability.
Exploration. Another extension of Q-learning is using a population of value function
estimates using random initial values and weighted bootstrapping. This was first
introduced by [9, 10] and evaluated on bandit tasks. Recently, this idea has also been
exploited in the context of deep neural networks by [11] for representations of value
functions in the setting of full reinforcement learning. We will examine this more
closely in Chap. 8.
Bootstrapping and subsampling use a single set of empirical data to obtain an
empirical measure of uncertainty about statistics of the data. We wish to do the same
thing for value functions, based on data from one or more trajectories. Informally, this
variant maintains a collection of Q-value estimates, each one of which is trained on
different segments3 of the data, with possible overlaps. In order to achieve efficient
exploration, a random Q-estimate is selected at every episode or every few steps.
This results in a bootstrap analogue of Thompson sampling. Figure 7.8 shows the use
of weighted bootstrap estimates for the Double Chain problem introduced by [3].
7.4 Exercises
Exercise 7.1 According to Theorem 6.5.1 the value function of a policy π for an
MDP μ with state reward function r can be written as the solution of the linear
equation v πμ = (I − γ Pμπ )−1 r, where the term Φμπ (I − γ Pμπ )−1 can be seen as a
feature matrix. However, simulation-based algorithms like Sarsa only approximate
the value function directly rather than Φμπ . This means that if the reward function
changes, they have to be restarted from scratch. Is there a way to improve on this?4
3 If not dealing with bandit problems, it is important to do this with trajectories.

4 This exercise stems from a discussion with Peter Auer in 2012 about this problem.
Fig. 7.8 Cumulative regret 600

of weighted Bootstrap 1
Q-learning for various 2
values of bootstrap replicates 8
(1 is equivalent to plain 32
Q-learning). Generally 400
speaking, an increased
amount of replicates leads to
L
improved exploration
performance
200
0
0 0.2 0.4 0.6 0.8 1
t ·104
1. Develop and test a simulation-based algorithm (such as Sarsa) for estimating Φμπ ,
and prove its asymptotic convergence. Hint: Focus on the fact that you’d like to
estimate a value function for all possible reward functions.
2. Consider a model-based approach building an empirical transition kernel Pμ̂π .
How good are your value function estimates in the first versus the second
approach? Why would you expect either one to be better?
3. Can the same idea be extended to Q-learning?
References
1. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407
(1951)
3. Dearden, R., Friedman, N., Russell, S.J.: Bayesian Q-learning. In: Proceedings of the Fifteenth
National Conference on Artificial Intelligence and Tenth Innovative Applications of Artificial
Intelligence Conference, AAAI 98, IAAI98, pp. 761–768. AAAI Press, The MIT Press (1998)
4. Mnih, V., Kavukcuoglu, K., Silver, D., et al.: Human-level control through deep reinforcement
learning. Nature 518(7540), 529–533 (2015)
5. Tsitsiklis, J.N.: Asynchronous stochastic approximation and Q-learning. Mach. Learn. 16(3),
185–202 (1994)
6. Watkins, C.J.C.H, Dayan, P: Q-learning. Mach. Learn. 8(3–4), 279–292 (1992)
7. Kearns, M., Singh, S.: Finite sample convergence rates for Q-learning and indirect algorithms.
In: Advances in Neural Information Processing Systems, vol. 11, pp. 996–1002. The MIT Press
(1999)
8. Strehl, A.L., Li, L., Wiewiora, E., Langford, J., Littman, M.L.: Pac model-free reinforcement
learning. In: Machine Learning, Proceedings of the 23rd International Conference, ICML 2006,
pp. 881–888. ACM (2006)
References 165
9. Dimitrakakis, C.: Nearly optimal exploration-exploitation decision thresholds. In: Artificial

Neural Networks–ICANN 2006, 16th International Conference, Proceedings, Part I
10. Dimitrakakis, C.: Ensembles for Sequence Learning. Ph.D. thesis, École Polytechnique
Fédérale de Lausanne (2006)
11. Osband, I., Blundell, C., Pritzel, A., Van Roy, B.: Deep exploration via bootstrapped DQN. In:
Advances in Neural Information Processing Systems, vol. 29, pp. 4026–4034 (2016)
Chapter 8
Approximate Representations
8.1 Introduction
In this chapter, we consider methods for approximating value functions, policies,

or transition kernel. This is in particular useful when the state or policy space are
large, so that one has to use some parametrization that may not include the true
value function, policy, or transition kernel. In general, we shall assume the existence
of either some approximate value function space V or some approximate policy
space , which are the set of allowed value functions and policies, respectively. For
the purposes of this chapter, we will assume that we have access to some simulator or
approximate model of the transition probabilities, wherever necessary. Model-based
reinforcement learning where the transition probabilities are explicitly estimated will
be examined in the next two chapters.
As an introduction, let us start with the case where we have a value function
space V and some value function v ∈ V that is our best approximation of the optimal
value function. Then we can define the greedy policy with respect to v as follows:
Definition 8.1.1 (v-greedy policy and value function)
πu∗ ∈ arg max Lπ u, v ∗u = L u,

π∈
where maps from states to action distributions.

Although the greedy policies need not be stochastic, here we are explicitly consid-
ering stochastic policies, because this sometimes facilitates finding a good approxi-
mation. If u is the optimal value function V ∗ , then the greedy policy is going to be
optimal.
More generally, when we are trying to approximate a value function, we usually
are constrained to look for it in a parametrized set of value functions V , where is
the parameter space. Hence, it might be the case that the optimal value function may
not lie within V . Similarly, the policies that we can use lie in a space , which
168 8 Approximate Representations
may not include the greedy policy itself. This is usually because it is not possible to
represent all possible value functions and policies in complex problems.
Usually, we are not aiming at a uniformly good approximation to a value function
or policy. Instead, we define φ, a distribution on S, which specifies on which parts
of the state space we want to have a good approximation by placing higher weight
on the most important states. Frequently, φ only has a finite support, meaning that
we only measure the approximation error over a finite set of representative states
Ŝ ⊆ S. In the sequel, we shall always define the quality of an approximate value or
policy with respect to φ.
In the remainder of this chapter, we shall examine a number of approximate
dynamic programming algorithms. What all of these algorithms have in common
is the requirement to calculate an approximate value function or policy. The two
next sections given an overview of the basic problem of fitting an approximate value
function or policy to a target.
8.1.1 Fitting a Value Function
Let us begin by considering the problem of finding the value function v θ ∈ V that
best matches a target value function u that is not necessarily in V . This can be done
by minimizing the difference between the target value u and the approximation vθ ,
that is,
v θ − uφ = |v θ (s) − u(s)| dφ(s)
S
with respect to some measure φ on S. If u = V ∗ , i.e., the optimal value function,

then we end up getting the best possible value function with respect to the
distribution φ. We can formalize the idea for fitting an approximate value function
to a target as follows.
Approximate value function problem
Given V = {v θ | θ ∈ },
find θ∗ ∈ arg min v θ − uφ ,
θ∈

where · φ S | · | dφ.
Unfortunately, this minimization problem can be difficult to solve in general. A

particularly simple case is when the set of approximate functions is small enough
for the minimization to be performed via enumeration.
Example 8.1 (Fitting a finite number of value functions) Consider a finite space of
value functions V = {v 1 , v 2 , v 3 }, which we wish to fit to a target value function u.
1 1
v1
0.5 v2 0.5
v3
0 u 0
− 0.5 − 0.5
−1 −1
0 2 4 6 8 10 0 2 4 6 8 10
(a) The target function and the three (b) The errors at the chosen points.
candidates.
6
0
0 2 4 6 8 10
(c) The total error of each candidate.
Fig. 8.1 Fitting a value function in V = {v 1 , v 2 , v 3 } to a target value function u, over a finite
number of points. While none of the three candidates is a perfect fit, we clearly see that v 1 has the
lowest cumulative error over the measured set of points
In this particular scenario, v 1 (x) = sin(0.1x), v 2 (x) = sin(0.5x), v 3 (x) = sin(x),

while
u(x) = 0.5 sin(0.1x) + 0.3 sin(0.1x) + 0.1 sin(x) + 0.1 sin(10x).
Clearly, none of the given functions is a perfect fit. In addition, finding the best overall
fit requires minimizing an integral. So, for this problem we choose a random set of
points X = {xt } on which to evaluate the fit, with φ(xt ) = 1 for every point xt ∈ X .
This is illustrated in Fig. 8.1, which shows the error of the functions at the selected
points, as well as their cumulative error.
In the example above, the approximation space V does not have a member that is
sufficiently close to the target value function. It could be that a larger function space
contains a better approximation. However, it may be difficult to find the best fit in an
arbitrary set V.
8.1.2 Fitting a Policy
The problem of fitting a policy is not significantly different from that of fitting a value
function, especially when the action space is continuous. Once more, we define an
appropriate normed vector space so that it makes sense to talk about the normed
difference between two policies π, π with respect to some measure φ on the states,
more precisely defined as

π − π = π(· | s) − π (· | s) dφ(s),
φ
S
where the norm within the integral is usually the L 1 norm.

For a finite action space,
this corresponds to π(· | s) − π (· | s) = a∈A |π(a | s) − π (a | s)|,
but certainly other norms may be used and are sometimes more convenient. The
optimization problem corresponding to fitting an approximate policy from a set of
policies to a target policy π is shown below.
The policy approximation problem
Given = {πθ | θ ∈ },
find θ∗ ∈ arg min πθ − πu∗ φ ,

θ∈
where πu∗ = arg maxπ∈ Lı u.
Once more, the minimization problem may not be trivial, but there are some cases
where it is particularly easy. One of these is when the policies can be efficiently
enumerated, as in the example below.
Example 8.2 (Fitting a finite space of policies) For simplicity, consider the space of
deterministic policies with a binary action space A = {0, 1}. Then each policy can be
represented as a simple mapping π : S → {0, 1}, corresponding to a binary partition
of the state space. In this example, the state space is the 2-dimensional unit cube,
S = [0, 1]2 . Figure 8.2 shows an example policy, where the light red and light green
areas represent taking action 1 and 0, respectively. The measure φ has support only
on the crosses and circles, which indicate the action taken at that location. Consider
a policy space consisting of just four policies. Each set of two policies is indicated
by the magenta (dashed) and blue (dotted) lines in Fig. 8.2. Each line corresponds to
two possible policies, one selecting action 1 in the high region, and the other selecting
action 0 instead. In terms of our error metric, the best policy is the one that makes
the fewest mistakes. Consequently, the best policy in this set to use the blue line and
play action 1 (red) in the top-right region.
Fig. 8.2 An example policy. 1

The red areas indicate taking
action 1, and the green areas
action 0. The φ measure has 0.8
finite support, indicated by
the crosses and circles. The
blue and magenta lines 0.6
indicate two possible
s2
policies that separate the
0.4
state space with a hyperplane
0.2
0
0 0.2 0.4 0.6 0.8 1
s1
8.1.3 Features
Frequently, when dealing with large, or complicated spaces, it pays off to project
(observed) states and actions onto a feature space X . In that way, we can make
problems much more manageable. Generally speaking, a feature mapping is defined
as follows.
Feature mapping
For X ⊂ Rn , a feature mapping f : S × A → X can be written in vector form

as ⎡ ⎤
f 1 (s, a)
f (s, a) = ⎣ . . . ⎦ .
f n (s, a)
Obviously, one can define feature mappings f : S → X for states only in a

similar manner.
What sort of functions should we use? A common idea is to use a set of smooth
symmetric functions, such as usual radial basis functions.
Example 8.3 (Radial Basis Functions) Let d be a metric on S × A and define the
a set of characteristic state-action pairs {(si , ai ) | i = 1, . . . , n}. These can act as
centers for a set of radial basis functions, defined as follows:
f i (s, a) exp {−d[(s, a), (si , ai )]}
These functions are sometimes called kernels. A one-dimensional example of Gaus-

sian radial basis functions is shown in Fig. 8.3.
Fig. 8.3 Radial basis 1

functions
0.8
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
Another common type of functions are binary functions. These effectively dis-
cretize a continuous space through either a cover or a partition.

Definition 8.1.2 A collection of sets G is a cover of X iff S∈G S ⊃ X .
Definition 8.1.3 A collection of sets G is a partition of X iff

If S
= R ∈ G , then S ∩ R = ∅.
1.
2. S∈G S = X .
In reinforcement learning, feature functions corresponding to partitions are usu-

ally referred to as tilings.
Example 8.4 (Tilings) Let G = {X 1 , . . . , X n } be a partition of S × A. Then the f i
can be defined by
f i (s, a) I {(s, a) ∈ X i } .
Multiple tilings create a cover and can be used without many difficulties with most
discrete reinforcement learning algorithms, cf. Sutton and Barto [1].
8.1.4 Estimation Building Blocks
Now that we have looked at the basic problems in approximate regimes, let us look
at some methods for obtaining useful approximations. First of all, we introduce
some basic concepts such as look-ahead and rollout policies for estimating value
functions. Then we formulate value function approximation and policy estimation as
an optimization problem. These are going to be used in the remaining sections. For
example, Sect. 8.2 introduces the well known approximate policy iteration algorithm,
which combines those two steps into approximate policy evaluation and approximate
policy improvement.
8.1.4.1 Look-ahead Policies
Given an approximate value function u, the transition model Pμ of the MDP

and the expected rewards rμ , we can always find the improving policy given in
Definition 8.1.1 via the following single-step look-ahead.
Single-step look-ahead

Let q(i, a) rμ (i, a) + γ j∈S Pμ ( j | i, a) u( j). Then the single-step look-
ahead policy is defined as
πq (a | i) > 0 iff a ∈ arg max q(i, a ).

a ∈A
(Note that there may be more than one maximizing action.)
We are however not necessarily limited to the first-step. By looking T steps

forward into the future we can improve both our value function and policy estimates.
T -step look-ahead
Define uk recursively as:
u0 = u,

qk (i, a) = rμ (i, a) + γ Pμ ( j | i, a) uk−1 ( j),
j∈S
uk (i) = max {qk (i, a) | a ∈ A} .
Then the T -step look-ahead policy is defined by
πq t (a | i) > 0 iff a ∈ arg max q T (i, a ).

a ∈A
In fact, taking u = 0, this recursion is identical to solving the k-horizon problem

and in the limit we obtain solution to the original problem. In the general case, our
value function estimation error is bounded by γ k u − V ∗ .
8.1.4.2 Rollout Policies
As we have seen in Sect. 7.2.2 one way to obtain an approximate value function of
an arbitrary policy π is to use Monte Carlo estimation, that is, to simulate several
sequences of state-action-reward tuples by running the policy on the MDP. More

specifically, we have the following rollout estimate.
Rollout estimate of the q-factor
In particular, from each state i, we take K i rollouts to estimate:
k −1
1
K i T
q(i, a) = r (st,k , at,k ), (8.1.1)
K i k=1 t=0
where st,k , at,k ∼ Pπμ (· | s0 = i, a0 = a), and Tk ∼ G(1 − γ).
This results in a set of samples of q-factors. The next problem is to find a parametric
policy πθ that approximates the greedy policy with respect to our samples, πq∗ . For
a finite number of actions, this can be seen as a classification problem [2]. For
continuous actions, it becomes a regression problem. As indicated before we define
a distribution φ on a set of representative states Ŝ over which we wish to perform
the minimization.
Rollout policy estimation
Given some distribution φ on a set Ŝ of representative states and a set of samples

q(i, a), giving us the greedy policy
πq∗ (a | i) = arg max q(i | a)
and a parametrized policy space {πθ | θ ∈ }, the goal is to determine

min πθ − πq∗ φ .
θ
8.1.5 The Value Estimation Step
We can now attempt to fit a parametric approximation to a given state or state-action

value function. This is often better than simply maintaining a set of rollout estimates
from individual states (or state-action pairs), as it might enable us to generalize
over the complete state space. A simple parametrization for the value function is
to use a generalized linear model on a set of features. Then the value function is a
linear function of the features with parameters θ. More precisely, we can define the
following model for the case where we have a feature mapping on states.
Generalized linear model using state features (or kernel)
Given a feature mapping f : S → Rn and parameters θ ∈ Rn , compute the

approximation

n
v θ (s) = θi f i (s).
i=1
Choosing the model representation is only the first step. We now have to use it to
represent a specific value function. In order to do this, as before we first pick a set
of representative states Ŝ to fit our value function v θ to v. This type of estimation
can be seen as a regression problem, where the observations are value function
measurements at different states.
Fitting a value function to a target
Let φ be a distribution over representative states Ŝ. For some constants κ, p > 0,
we define the weighted prediction error per state as
cs (θ) = φ(s) v θ (s) − u(s)κp ,

where the total prediction error is c(θ) = s∈ Ŝ cs (θ).
The goal is to find a θ minimizing c(θ).
Minimizing this error can be done using gradient descent, which is a general
algorithm for finding local minima of smooth cost functions. Generally, minimizing a
real-valued cost function c(θ) with gradient descent involves an algorithm iteratively
approximating the value minimizing c:
θ (n+1) = θ (n) + αn ∇c(θ (n) )
Under certain conditions1 on the step-size parameter αn , limn→∞ c(θ(n) ) = minθ c(θ).
Example 8.5 (Gradient descent for p = 2, κ = 2) For p = 2, κ = 2 the square root

and κ cancel out and we obtain

∇θ cs = φ(s) ∇θ [v θ (s) − u(s)]2 = 2[v θ (s) − u(s)]∇θ v θ ,
a∈A
where ∇θ v θ (s) = f (s). Taking partial derivatives ∂/∂θ j leads to the update rule
1 See also Sect. 7.1.1.

θ(n+1)
j = θ(n)
j − 2αφ(s)[v θ (n) (s) − u(s)] f j (s).
However, the value function is not necessarily

self-consistent, meaning that we
do not have the identity v θ (s) = r(s) + S v θ (s ) d P(s | s, a). For that reason, we
can instead choose a parameter that tries to make the parametrized value function
self-consistent by minimizing the Bellman error.
Minimizing the Bellman error
For a given φ and P̂, the goal is to find θ minimizing

b(θ) r (s, a) + γ v θ (s ) d P̂(s | s, a) − v θ (s)

. (8.1.2)
S φ
Here P̂ is not necessarily the true transition kernel. It can be a model or an empir-
ical approximation (in which case the integral would only be over the empirical
support). The summation itself is performed with respect to the measure φ.
In this chapter, we will look at two methods for approximately minimizing the
Bellman error. The first, least square policy iteration is a batch algorithm for approx-
imate policy iteration and finds the least-squares solution to the problem using the
empirical transition kernel. The second is a gradient based method, which is flexible
enough to use either an explicit model of the MDP or the empirical transition kernel.
It is also possible to simultaneously approximate some value function u and
minimizing the Bellman error by considering the minimization problem
min c(θ) + λb(θ)

θ
whereby the Bellman error acts as a regularizer ensuring that our approximation is
indeed as consistent as possible.
8.1.6 Policy Estimation
A natural parametrization for policies is to use a generalized linear model on a set

of features. Then a policy can be described as a linear function of the features with
parameters θ, together with an appropriate link function. More precisely, we can
define the following model.
Generalized linear model using features (or kernel)
Given a feature mapping f : S × A → Rn , parameters θ ∈ Rn , and a link func-

tion : S × A → R+ , the parametrized policy πθ is defined as
gθ (s, a)
πθ (a | s) = ,
h θ (s)
n

where gθ (s, a) = θi f i (s, a) and h θ (s) = gθ (s, b).
i=1 b∈A
The link function ensures that the denominator is positive, and the policy is
a distribution over actions. An alternative method would be to directly constrain
the policy parameters so the result is always a distribution, but this would result
in a constrained optimization problem. A typical choice for the link function is
(x) = exp(x), which results in the softmax family of policies.
In order to fit a policy, we first pick a set of representative states Ŝ and then find a
policy πθ that approximates a target policy π, which is typically the greedy policy with
respect to some value function. In order to do so, we can define an appropriate cost
function and then estimate the optimal parameters via some arbitrary optimization
method.
Fitting a policy through a cost function
Given a target policy π and a cost function
cs (θ) = φ(s) πθ (· | s) − π(· | s)κp ,

the goal is to find the parameter θ minimizing c(θ) = s∈ Ŝ cs (θ).
Once more, we can use gradient descent to minimize the cost function. We
obtain different results for different norms, but the three cases of main interest are
p = 1, p = 2, and p → ∞. We present the first one here, and leave the others as an
exercise.
Example 8.6 (The case p = 1, κ = 1) The derivative can be written as

∇θ cs = φ(s) ∇θ |πθ (a | s) − π(a | s)|,
a∈A
∇θ |πθ (a | s) − π(a | s)| = ∇θ πθ (a | s) sgn[πθ (a | s) − π(a | s)].
The policy derivative in turn is

h(s)∇θ g(s, a) − ∇θ h(s)g(s, a)

πθ (a | s) = ,
h(s)2

with ∇θ h(s) = b∈A f i (s, b) i and ∇θ g(s, a) = f (s, a). Taking partial derivatives
∂/∂θ j , leads to the update rule

θ(n+1)
j = θ(n)
j − αn φ(s) πθ(n) (a | s) f j (s, b) − f j (s, a) .
b∈A
Alternative cost functions. It is often a good idea to add a penalty term of the form
θq to the cost function, constraining the parameters to be small. The purpose of
this is to prevent overfitting of the parameters for a small number of observations.
8.2 Approximate Policy Iteration (API)
The main idea of approximate policy iteration is to replace the exact Bellman
operator L with an approximate version Lˆ to obtain an approximate optimal policy
and a respective approximate optimal value.
Just as in standard policy iteration introduced in Sect. 6.5.4.2, there is a policy
improvement step and a policy evaluation step. In the policy improvement step,
we simply try to get as close as possible to the best possible improvement using a
restricted set of policies and an approximation of the Bellman operator. Similarly, in
the policy evaluation step, we try to get as close as possible to the actual value of the
improved policy using a respective set of value functions.
Algorithm 8.1 Generic approximate policy iteration algorithm

Input: initial value function approximation v 0 , approximate Bellman operator Lˆ , approximate
ˆ value function space V̂ , norms · φ , and · ψ ,
value estimator V̂ , policy space ,
for k = 1, . . . do

πk = arg minπ∈ˆ Lˆπ v k−1 − L v k−1 // policy improvement
φ
v k = arg minv∈V̂ v − V̂μπk ψ // policy evaluation
end for
At the k-th iteration of the policy improvement step the approximate value v k−1
of the previous policy πk−1 is used to obtain an improved policy πk . However, note
that we may not be able to implement the policy arg maxπ Lπ v k−1 for two reasons.
ˆ may not include all possible policies. Secondly, the Bell-
Firstly, the policy space
man operator is in general also only an approximation. In the policy evaluation step,
we aim at finding the function v k that is the closest to the true value function of
policy πk . However, even if the value function space V̂ is rich enough, the mini-
8.2 Approximate Policy Iteration (API) 179
mization is done over a norm that integrates over a finite subset of the state space.
The following section discusses the effect of those errors on the convergence of
approximate policy iteration.
8.2.1 Error Bounds for Approximate Value Functions
If the approximate value function u is close to V ∗ then the greedy policy with respect
to u is close to optimal. For a finite state and action space, the following holds.
Theorem 8.2.1 Considera finite MDP μ with discount factor γ < 1 and a vector
u ∈ V such that u − Vμ∗ ∞ = . If π is the u-greedy policy then
π 2γ
V − V ∗ ≤ .
μ μ ∞ 1−γ
In addition, ∃0 > 0 s.t. if < 0 , then π is optimal.
Proof Recall that L is the one-step Bellman operator and Lπ is the one-step policy
operator on the value function. Then (skipping the index for μ)
π
V − V ∗ = Lπ V π − V ∗ ∞
∞

≤ Lπ V π − Lπ u∞ + Lπ u − V ∗ ∞

≤ γ V π − u∞ + L u − V ∗ ∞
by contraction, and by the fact that π is u-greedy,

≤ γ V π − V ∗ ∞ + γ V ∗ − u∞ + γ u − V ∗ ∞

≤ γ V π − V ∗ + 2γ,
∞
which proves the first part.

For the second part, note that the state and action sets are finite. Consequently,
the set of policies is finite. Thus, there is some 0 > 0 such that the best sub-optimal
policy is 0 -close to the optimal policy in value. So, if < 0 , the obtained policy
must be optimal.
Building on this result, we can prove a simple bound for approximate policy
iteration, assuming uniform error bounds on the approximation of the value of a policy
as well as on the approximate Bellman operator. Even though these assumptions are
quite strong, we still only obtain the following rather weak asymptotic convergence
result.2
2 For δ = 0, this is identical to the result for -equivalent MDPs by Even-Dar and Mansour [3].
Theorem 8.2.2 (Bertsekas and Tsitsiklis [4], Proposition 6.2) Assume that there are
, δ such that, for all k the iterates v k , πk satisfy
v k − V πk ∞ ≤ ,

Lπ v k − L v k ≤ δ.
k+1 ∞
Then
δ + 2γ
lim sup V πk − V ∗ ∞ ≤ .
k→∞ (1 − γ)2
8.2.2 Rollout-Based Policy Iteration Methods
As suggested by Bertsekas and Tsitsiklis [4], one idea for estimating the value func-
tion is to simply perform rollouts, while the policy itself is estimated in parametric
form. The first practical algorithm in this direction was Rollout Sampling Approx-
imate Policy Iteration by Dimitrakakis and Lagoudakis [5]. The main idea is to
concentrate rollouts in interesting parts of the state space, so as to maximize the
expected amount of improvement we can obtain with a given rollout budget.
Algorithm 8.2 Rollout Sampling Approximate Policy Iteration

for k = 1, . . . do
Select a set of representative states Ŝk .
for n = 1, . . . do
Select a state sn ∈ Ŝk maximizing Un (s) and perform a rollout obtaining {st,k , at,k }.
If â ∗ (sn ) is optimal w.p. 1 − δ, add sn to Ŝk (δ) and remove it from Ŝk .
end for
Calculate q k ≈ Q πk from the rollouts, cf. Eq. (8.1.1).
Train a classifier πθk+1 on the set of states Ŝk (δ) with actions â ∗ (s).
end for
If we have data collects we can use the empirical state distribution to select starting
states. In general, rollouts give us estimates q k , which are used to select states for
further rollouts. That is, we compute for each state s actions
âs∗ arg max q k (s, a).

a
Then we select a state sn maximizing the upper bound value Un (s) defined via
ˆ k (s) q k (s, âs∗ ) − max q k (s, a),

∗
a
=âs

1
Un (s) ˆ k − max q k (s, a) + ,
a
=âs∗ 1 + c(s)
where c(s) is the number of rollouts from state s. If the sampling of a state s stops
whenever
2 |A| − 1
ˆ k (s) ≥
ln ,
c(s)(1 − γ)2 δ
then we are certain that the optimal action has been identified with probability 1 − δ
for that state, due to Hoeffding’s inequality. Unfortunately, guaranteeing a policy
improvement for the complete state space is impossible, even with strong assump-
tions.3
8.2.3 Least Squares Methods
When considering quadratic error, it is tempting to use linear methods, such as least
squares methods, which are very efficient. This requires to formulate the problem in
linear form, using a feature mapping that projects individual states (or state-action
pairs) onto a high-dimensional space. Then the value function can be represented as
linear function of the parameters and this mapping, which minimizes a squared error
over the observed trajectories.
To get an intuition for these methods, recall from Theorem 6.5.1 that the solution
of
v = r + γ P μ,π v
is the value function of π and can be obtained via
v = (I − γ P μ,π )−1 r.
Here we consider the setting where we do not have access to the transition matrix, but
instead have some observations of transition (st , at , st+1 ). In addition, our state space
can be continuous (e.g., S ⊂ Rn ), so that the transition matrix becomes a general
transition kernel. Consequently, the set of value functions V becomes a Hilbert space,
while in the discrete setting a value function is simply a point in Rn .
3 First, note that if we need to identify the optimal action for k states, then the above stopping rule
has an overall error probability of kδ. In addition, even if we assume that value functions are smooth,
it will be impossible to identify the boundary in the state space where the optimal policy should
switch actions [6].
In general, we deal with this case via projections. We project from the infinite-
dimensional Hilbert space to one with finite dimension on a subset of states: namely,
the ones that we have observed. We also replace the transition kernel with the empir-
ical transition matrix on the observed states.
Parametrization. Let us first deal with parametrizing a linear value function. Setting
v = θ, where is a feature matrix and θ is a parameter vector, we have
θ = r + γ P μ,π θ,
−1
θ = (I − γ P μ,π ) r.
This simple linear parametrization of a value function is perfectly usable in a discrete

MDP setting where the transition matrix is given. However, otherwise making use of
this parametrization is not so straightforward. The first problem is how to define the
transition matrix itself, since there is an infinite number of states. A simple solution
to this problem is to only define the matrix on the observed sequence of states,
and furthermore, so that the probability of transition to a particular state is 1 if that
transition has been observed. This makes the matrix off-diagonal. More precisely,
the construction is as follows.
Empirical construction. Given a set of data points {(st , at , rt ) | t = 1, . . . , n}, we

define:
1. the empirical reward vector r = (rt )t ,
2. the feature matrix = (t )t , with t = f (st , at ), and
3. the empirical n × n transition matrix P μ,π (st , st ) = I t = t + 1 .
Generally the value function space generated by the features and the linear
parametrization does not allow us to obtain exact value functions. For this reason,
instead of considering the inverse A−1 of the matrix A = (I − γ P μ,π ) we use the
pseudo-inverse defined as
−1 −1
A A A A .
If the inverse exists, then it is equal to the pseudo-inverse. However, in our setting,
the matrix can be low rank, in which case we instead obtain the matrix minimizing
the squared error, which in turn can be used to obtain a good estimate for the param-
eters. This immediately leads to the Least Squares Temporal Difference (LSTD)
algorithm [7], which estimates an approximate value function for some policy π
given some data D and a feature mapping f .
State-action value functions. As estimating a state-value function is not directly
useful for obtaining an improved policy without a model, we can instead estimate a
state-action value function as follows:
q = r + γ P μ,π q
θ = r + γ P μ,π θ

θ = (I − γ P μ,π ) −1 r
However, this approach has two drawbacks. The first is that it is difficult to get
an unbiased estimate of θ. The second is that when we apply the Bellman operator
to q, the result may lie outside the space spanned by the features. For this reason, we
instead consider the least-square projection ( )−1 , i.e.,

q = ( )−1 r + γ P μ,π q .
Replacing q = θ leads to the estimate

−1
θ = (I − γ P μ,π ) r.
In practice, of course, we do not have the transitions P μ,π but estimate them
from data. Note that for any deterministic policy π and a set of T data points
(st , at , rt , st )t=1
T
, we have

P μ,π = P(s | s, a) (s , π(s ))
s
1
T
≈ P̂(st | st , at ) (st , π(st )),
T t=1
where for P̂ one can take the simple empirical transition matrix mentioned previously.
This equation can be used to maintain q-factors with q(s, a) = f (s, a)θ to obtain
an empirical estimate of the Bellman operator as summarized in Algorithm 8.3.
Algorithm 8.3 LSTDQ-Least Squares Temporal Differences

data D = (st , at , rt , st ) | t = 1, . . . , T , feature mapping f , policy π
input
A = t=1 (st , at ) P[(st , at ) − γ(st , π(st ))]
n
b = nt=1 (st , at )rt

θ = A−1 b
The algorithm can be easily extended to approximate policy iteration, resulting

in the well-known Least Squares Policy Iteration (LSPI) [8] algorithm shown in
Algorithm 8.4. The idea is to repeatedly estimate the value function for improved
policies using a least squares estimate, and then compute the greedy policy for each
estimate.
Algorithm 8.4 LSPI-Least Squares Policy Iteration

input data D = (st , at , rt , st ) | t = 1, . . . , T , feature mapping f
Set π0 arbitrarily.
for k = 1, . . . do
θ (k) = LSTDQ(D, f, πk−1 )
π (k) = π ∗ (k)
θ
end for
8.3 Approximate Value Iteration
Approximate algorithms can also be defined for backwards induction. The general
algorithmic structure remains the same as exact backwards induction, however the
exact steps are replaced by approximations. Applying approximate value iteration
may be necessary for two reasons. Firstly, it may not be possible to update the value
function for all states. Secondly, the set of available value function representations
may be not complex enough to capture the true value function.
8.3.1 Approximate Backwards Induction
The first algorithm is approximate backwards induction. Let us start with the basic
backwards induction algorithm:

Vt∗ (s) = max r (s, a) + γ ∗
Vt+1 (s ) P μ (s |s, a)
a∈A
s
This is essentially the same both for finite and infinite-horizon problems. If we have
to pick the value function from a set of functions V, we can use the following value
function approximation.
Let our estimate at time t be v t ∈ V, with V being a set of (possibly parametrized)
functions. Let V̂t be our one-step update given the value function approximation at
the next step, v t+1 . Then v t will be the closest approximation in that set.
Iterative approximation

V̂t (s) = max r (s, a) + γ P μ (s | s, a) v t+1 (s )
a∈A
s

v t = arg min v − V̂t
v∈V
8.3 Approximate Value Iteration 185
The above minimization can for example be performed by gradient descent. Con-
sider the case where v is a parametrized function from a set of parametrized value
functions V with parameters θ. Then it is sufficient to maintain the parameters θ (t)
at any time t. These can be updated with a gradient scheme at every step. In the
online case, our next-step estimates can be obtained by gradient descent using a step
size sequence (αt )t .
Online gradient estimation

θ t+1 = θ t − αt ∇θ v t − V̂t
This gradient descent algorithm can also be made stochastic, if we sample s from
the probability distribution P μ (s | s, a) used in the iterative approximation. The
next sections give some examples.
8.3.2 State Aggregation
In state aggregation, multiple different states with identical properties (with respect to
rewards and transition probabilities) are identified in order to obtain a new aggregated
state in an aggregated MDP with smaller state space. Unfortunately, it is very rarely
the case that aggregated states are really indistinguishable with respect to rewards and
transition probabilities. Nevertheless, as we can see in the example below, aggregation
can significantly simplify computation through the reduction of the size of the state
space.
Example 8.7 (Aggregated value function) A simple method for aggregation is to

set the value of every state in an aggregate set to be the same. More precisely, let
G = {S1 , . . . , Sn } be a partition of S, with θ ∈ Rn and let f k (st ) = I {st ∈ Sk }. Then
the approximate value function is
v(s) = θ(k), if s ∈ Sk , k
= 0,
where θ(k) is the k-th co-ordinate of θ.
In the above example, the value of every state corresponds to the value of the k-th
set in the partition. Of course, this is only a very rough approximation if the sets Sk
are very large. However, this is a convenient approach to use for gradient descent
updates, as only one parameter needs to be updated at every step.
Online gradient estimate for aggregated value functions
Consider the case · = ·22 . For st ∈ Sk and some step size sequence (αt )t :

θ t+1 (k) = (1 − αt )θ t (k) + αt max r (st , a) + γ P( j | st , a) θ t ( f k ( j)),
a∈A
j∈ Ŝ
while θ t+1 (k) = θ(k) for st ∈

/ Sk , where Ŝ is a subset of states.
Of course, whenever we perform the estimation online, we are limited to esti-

mation on the sequence of states st that we visit. Consequently, estimation on other
states may not be very good. It is indeed possible that we suffer from convergence
problems as we alternate between estimating the values of different states in the
aggregation. In the online algorithm, we posited the existence of a subset of states
that we can use to perform the gradient update. We can formalize this through the
notion of a representative state approximation.
8.3.3 Representative State Approximation
A more refined approach is to choose some representative states and try to approx-
imate the value function of all other states as a convex combination of the value of
the representative states.
Representative state approximation
Let Ŝ be a set of n representative states and θ ∈ Rn and a feature mapping f

with
n
f i (s) = 1, ∀s ∈ S.
i=1
The feature mapping is used to perform the convex combination. Usually, f i (s)
is larger for representative states i which are “closer” to s. In general, the feature
mapping is fixed, and we want to find a set of parameters for the values of the
representative states. At time t, for each representative state i, we obtain a new
estimate of its value function and plug it back in.
8.3 Approximate Value Iteration 187
Fig. 8.4 Error in the 400

representative state chain
approximation for two random
different MDPs structures as
we increase the number of
300
sampled states. The first is
the chain environment from
u−V
Example 7.2, extended to 200
100 states. The second
involves randomly generated
MDPs with two actions and 100
100 states
0
0 20 40 60 80 100
number of samples
Representative state update
For i ∈ Ŝ:

θ t+1 (i) = max r (i, a) + γ v t (s) d P(s | i, a) (8.3.1)
a∈A
with

n
v t (s) = f i (s)θ t (i).
i=1
When the integration in (8.3.1) is not possible, we may instead approximate the
expectation with a Monte Carlo method. One particular problem with this method
arises when the transition kernel is very sparse. Then we are basing our estimates on
approximate values of other states, which may be very far from any other represen-
tative state. This is illustrated in Fig. 8.4, which presents the value function error for
the chain environment of Example 7.2 and random MDPs. Due to the linear structure
of the chain environment, the states are far from each other. In contrast, the random
MDPs are generally both quite dense and the state distribution for any particular
policy mixes rather fast. Thus, states in the former tend to have very different values
and in the latter very similar ones.
8.3.4 Bellman Error Methods
The problems with the representative state update can be alleviated through Bellman
error minimization. The idea here is to obtain a value function that is as consistent
as possible. The basic Bellman error minimization is given by
min v θ − L v θ .
θ
This is different from the approximate backwards induction algorithm we saw previ-
ously, since the same parameter θ appears in both terms inside the norm. Furthermore,
if the norm has support in all of the state space and the approximate value function
space contains the actual set of value functions then the minimum is 0 and we obtain
the optimal value function.
Gradient update
For the L2-norm, we have the following α-step update:

v θ − L v θ = Dθ (s)2 , Dθ (s) = v θ (s) − max v θ ( j) d P( j | s, a)
a∈A S
s∈ Ŝ
Then the gradient update becomes θ t+1 = θ t − αDθt (st )∇θ Dθt (st ), where

∇θ Dθt (st ) = ∇θ v θt (st ) − ∇θ v θt ( j) d P( j | st , at∗ )
S

with at∗ arg maxa∈A r (st , a) + γ S v θ ( j) d P( j | st , a) .
We can also construct a q-factor approximation for the case where no model
is available. This can be simply done by replacing P with the empirical transition
observed at time t.
8.4 Policy Gradient
In the previous section, we have seen how to use gradient methods for value function
approximation. It is also possible to use these methods to estimate policies—the
only necessary ingredients are a policy representation and a way to evaluate a policy.
The representation is usually parametric, but non-parametric representations are also
possible. A common choice for parametrized policies is to use a feature function
f : S × A → Rk and a linear parametrization with parameters θ ∈ Rk leading to
the following Softmax distribution:
e F(s,a)
π(a | s) = F(s,a )
, F(s, a) θ f (s, a) (8.4.1)
a ∈A e
8.4 Policy Gradient 189
As usual, we would like to find a policy maximizing expected utility. Policy gradient
algorithms employ gradient ascent on the expected utility to find a locally maximizing
policy. Here we focus on the discounted reward criterion with discount factor γ, where
a policy’s expected utility is defined with respect to a starting state distribution y so
that

π
E y (U ) = y(s)V π (s) = y(s) π
P (h | s1 = s) U (h), (8.4.2)
s s h
where U (h) is the utility of a trajectory h. This definition leads to a number of simple
expressions for the gradient of the expected utility, including what is known as the
policy gradient theorem of [9]. For notational simplicity, we omit the subscript θ for
the policy π.
Theorem 8.4.1 Assuming that the reward only depends on the state, for any
θ-parametrized policy space , the gradient of the utility from starting state dis-
tribution y can be equivalently written in the three following forms:
∇θ Eπy U = y γ(I − γ P πμ )−1 ∇θ P πμ (I − γ P πμ )−1 r (8.4.3)

π,γ
= xμ, y (s) ∇θ π(a | s) Q πμ (s, a) (8.4.4)
s a

= U (h) Pπμ (h)∇θ ln Pπμ (h), (8.4.5)
h
where (as in Sect. 6.5.4.6) we use

∞

π,γ
xμ, y (s) = γ t Pπμ (st = s | s0 = s ) y(s )
s t=0
to denote the γ-discounted sum of state visits. Further, h ∈ (S × A)∗ denotes a state-
action history, Pπμ (h) its probability under the policy π in the MDP μ, and U (h) is
the utility of history h.
Proof We begin by proving the claim (8.4.3). Note that

π π −1
Eμ U = y (I − γ P μ ) r,
where y is a starting state distribution vector and P πμ is the transition matrix resulting
from applying policy π to μ. Computing the derivative using matrix calculus gives
∇θ E U = y ∇θ (I − γ P πμ )−1 r,
as the only term involving θ is π. The derivative of the matrix inverse can be written
as
∇θ (I − γ P πμ )−1 = −(I − γ P πμ )−1 ∇θ (I − γ P πμ )(I − γ P πμ )−1

= γ(I − γ P πμ )−1 ∇θ P πμ (I − γ P πμ )−1 ,
which concludes the proof of (8.4.3).

We proceed expanding P πμ term, thus obtaining a formula that only has a derivative
for π:
∂ π ∂

Pμ (s | s) = Pμ (s | s, a) π(a | s)
∂θi a
∂θi
Defining the state visitation matrix X (I − γ P)−1 we have, rewriting (8.4.3):
∇θ Eπμ U = γ y X∇θ P πμ X r
We are now ready to prove claim (8.4.4). Define the expected state visitation from
the starting distribution to be x y X, so that we obtain
∇θ Eπμ U = γ x ∇θ P πμ X r

=γ x(s) P μ (s | s, a)∇θ π(a | s) Vμπ (s )
s a,s

=γ x(s) ∇θ π(a | s) P μ (s | s, a)Vμπ (s )
s a s

=γ x(s) ∇θ π(a | s) Q πμ (s, a).
s a
The last claim (8.4.5) is straightforward. Indeed,

∇θ E U = U (h)∇θ Pπμ (h) = U (h) Pπμ (h)∇ ln Pπμ (h), (8.4.6)
h h
as ∇θ ln Pπμ (h) = 1
∇
Pπμ (h) θ Pπμ (h).
8.4.1 Stochastic Policy Gradient
For finite MDPs, we can obtain x π from the state occupancy matrix (6.5.2) by left
multiplication with the initial state distribution y. However, in the context of gradient
methods, it makes more sense to use a stochastic estimate of x π to calculate the
gradient, since

∇θ Eπ U = Eπy γ t I {st = s} ∇θ π(a | s) Q π (s, a).
s,t a
8.4 Policy Gradient 191
For the discounted reward criterion, we can easily obtain unbiased samples through
geometric stopping (see Exercise 6.1).
Importance sampling
The last formulation is especially useful as it allows us to use importance sampling
to compute the gradient even on data obtained for different policies, which in general
is more data efficient. First note that for any history h ∈ (S × A)∗ , we have

T
π π
Pμ (h) = Pμ (st | s , a ) P (at | s , a )
t−1 t−1 t t−1
(8.4.7)
t=1
without any Markovian assumptions on the model or policy. We can now rewrite
(8.4.6) in terms of the expectation with respect to an alternative policy π as

π π Pπμ (U )
∇ Eμ U = Eμ U (h)∇ ln Pπμ (h)
Pπμ (U )

T
π(at | s t , a t−1 )
π
= Eμ U (h)∇ ln Pπμ (h) ,
t=1
π (at | s t , a t−1 )
since the μ-dependent terms in (8.4.7) cancel out. In practice the expectation would
be approximated through sampling trajectories h. Note that
∇π(at | s t , a t−1 )
∇ ln Pπμ (h) = ∇ ln π(at | s t , a t−1 ) = .
t t
π(at | s t , a t−1 )
Overall, importance sampling allows us to perform stochastic gradient descent on

data collected from any arbitrary previous policy π , and perform gradient descent
on a parametrized policy π.
8.4.2 Practical Considerations
The first design choice in any gradient algorithm is how to parametrize the policy. For
the discrete case, a common parametrization is to have a separate and independent
parameter for each state-action pair, i.e., θs,a = π(a|s). This leads to a particularly
simple expression for the second form (8.4.4), which is ∂/∂θs,a Eπμ U = y(s) Q(s, a).
However, it is easy to see that in this case the parametrization will lead to all param-
eters increasing if rewards are positive. This can be avoided by either a Softmax
parametrization or by subtracting a bias term (e.g. the value of the state Vμπ (s)) from
the derivative. Nevertheless, this parametrization implies stochastic discrete policies.
We could also suitably parametrize continuous policies. For example, if A ⊂ Rn ,
we can consider a linear policy. Most of the derivation carries over to Euclidean state-
action spaces. In particular, the second form (8.4.4) is also suitable for deterministic
policies.
Finally, in practice, we may not need to accurately calculate the expectations
involved in the gradient. Sample trajectories are sufficient to update the gradient in
a meaningful way, especially for the third form (8.4.5), as we can naturally sample
from the distribution of trajectories. However, the fact that this form doesn’t need
a Markovian assumption also means that it cannot take advantage of Markovian
environments.
Policy gradient methods are useful, especially in cases where the environment
model or value function is extremely complicated, while the optimal policy itself
might be quite simple. The main difficulty lies in obtaining an appropriate estimate
of the gradient itself, but convergence to a local maximum is generally good as long
as we are adjusting the parameters in a gradient-related direction (in the sense of
Assumption 7.1.1iii).
8.5 Examples
Let us now consider two well-known problems with a 2-dimensional continuous

state space and a discrete set of actions. The first is the inverted pendulum problem
in the version of Lagoudakis and Parr [8], where a controller must balance a rod
upside-down. The state information is the rotational velocity and position of the
pendulum. The second example is the mountain car problem, where we must drive
an underpowered vehicle to the top of a hill [1]. The state information is the velocity
and location of the car. In both problems, there are three actions: “push left”, “push
right” and “do nothing”.
Let us first consider the effect of model and features in representing the value
function of the inverted pendulum problem. Figure 8.5 shows value function approx-
imations for policy evaluation under a uniformly random policy for different choices
of model and features. Here we need to fit an approximate value function to samples
of the utility obtained from different states. The quality of the approximation depends
on both the model and the features used. The first choice of features is simply the
raw state representation, while the second a 16 × 16 uniform RBF tiling. The two
models are very simple: the first uses a linear model-Gaussian model4 assumption
on observation noise (LG), and the second is a k-nearest neighbour (kNN) model.
As can be seen from the figure the linear model results in a smooth approximation,
but is inadequate for modelling the value function in the original 2-dimensional state
space. However, a high-dimensional non-linear projection using RBF kernels results
in a smooth and accurate value function representation. Non-parametric models such
as k-nearest neighbours behave rather well under either state representation.
4Essentially, this is the a linear model of the form st+1 | st = s, at = a ∼ N μa s, a , where μa
has a normal prior and a Wishart prior.
8.6 Further Reading 193
LG, state kNN, state
·10−2
5 −0.6
−0.8
V
V
0
−1
−1 5 −1 5
0 0 0 0
1 −5 1 −5
s1 s2 s1 s2
LG, RBF kNN, RBF
−0.6
−0.6
−0.8
−0.8
V
V
−1
−1
−1 5 −1 5
0 0 0 0
1 −5 1 −5
s1 s2 s1 s2
Fig. 8.5 Estimated value function of a uniformly random policy on the 2-dimensional state-space
of the pendulum problem. Results are shown for a k-nearest neighbour model (kNN) with k = 3
and a Bayesian linear-Gaussian model (LG) for either the case when the model uses the plain state
information (state) or an 256-dimensional RBF embedding (RBF)
For finding the optimal value function we must additionally consider the question
of which algorithm to use. In Fig. 8.6 we see the effect of choosing either approxi-
mate value iteration (AVI) or representative state representations and value iteration
(RSVI) for the inverted pendulum and mountain car.
8.6 Further Reading
Among value function approximation methods, the two most well known are fitted Q-
iteration [10] and fitted value iteration, which has been analysed in [11]. Minimizing
the Bellman error [12–14] is generally a good way to ensure that approximate value
iteration is stable.
In approximate policy iteration methods, one needs to approximate both the value
function and policy. An empirical approximation of the value function is maintained
in rollout sampling policy iteration [5, 6]. However, one can employ least-squares
methods [7, 8, 15] for example.
The general technique of state aggregation [16, 17] is applicable to a variety of
reinforcement learning algorithms. While the more general question of selecting
AVI, Pendulum AVI, MountainCar
0
−0.5
−10
V
V
−1 −20
−1 5 −1 5
0 0 0
−5 0
1 −5 ·10−2
s1 s2 s1 s2
RSVI, Pendulum RSVI, Mountaincar
0
−0.5 −10
V
V
−1 −20
−1 5 −1 5
0 0 0
−5 0
1 −5 ·10−2
s1 s2 s1 s2
Fig. 8.6 Estimated optimal value function for the pendulum problem. Results are shown for approx-
imate value iteration (AVI) with a Bayesian linear-Gaussian model, and a representative state rep-
resentation (RSVI) with an RBF embedding. Both the embedding and the states where the value
function is approximated are a 16 × 16 uniform grid over the state space
appropriate features is open, there has been some progress in the domain of fea-
ture reinforcement learning [18]. In general, learning internal representations (i.e.,
features) has been a prominent aspect of neural network research [19]. Even if it is
unclear to what extent recently proposed approximation architectures that employ
deep learning actually learn useful representations, they have been successfully used
in combination with simple reinforcement learning algorithms [20]. Another inter-
esting direction is to establish links between features and approximately sufficient
statistics [21, 22].
Finally, the policy gradient theorem in the state visitation form was first proposed
by Sutton et al. [9], while Williams [23] was the first to use the log-ratio trick in
Eq. (8.4.5) in reinforcement learning. To our knowledge, the analytical gradient has
not actually been applied (or indeed, described) in prior literature. Extensions of the
policy gradient idea are also natural. They have also been used in a Bayesian setting
by Ghavamzadeh and Engel [14], while the natural gradient has been proposed by
Kakade [24]. A survey of policy gradient methods can be found in [25].
References 195
8.7 Exercises
Exercise 8.1 (Enlarging the function space) Consider the problem in Example 8.1.
What would be a simple way to extend the space of value functions from the three
given candidates to an infinite number of value functions? How could we get a good
fit?
Exercise 8.2 (Enlarging the policy) Consider Example 8.2. This represents an
example of linear deterministic policies. In which two ways can this policy space be
extended and how?
Exercise 8.3 Find the derivative for minimizing the cost function in (8.1.2) for the
following two cases:
1. p = 2, κ = 2.
2. p → ∞, κ = 1.
References
1. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press (1998)
2. Lagoudakis, M.G., Parr, R.: Reinforcement learning as classification: leveraging modern clas-
sifiers. In: Machine Learning, Proceedings of the 20th International Conference (ICML 2003),
pp. 424–431. AAAI Press (2003)
3. Even-Dar, E., Mansour, Y: Approximate equivalence of Markov decision processes. In: Com-
putational Learning Theory and Kernel Machines, 16th Annual Conference on Computational
Learning Theory and 7th Kernel Workshop, COLT/Kernel 2003. Lecture notes in Computer
Science, vol. 2777, pp. 581–594. Springer (2003)
5. Dimitrakakis, C., Lagoudakis, M.G.: Rollout sampling approximate policy iteration. Mach.
Learn. 72(3), 157–171 (2008)
6. Dimitrakakis, C., Lagoudakis, M.G.: Algorithms and bounds for rollout sampling approximate
policy iteration. In: Girgin, S., Loth, M., Munos, R., Preux, P., Ryabko, D. (eds.) Recent
Advances in Reinforcement Learning, 8th European Workshop, EWRL 2008. Lecture Notes
in Computer Science, vol. 5323, pp. 27–40. Springer (2008)
7. Bradtke, Steven J., Barto, Andrew G.: Linear least-squares algorithms for temporal difference
learning. Mach. Learn. 22(1), 33–57 (1996)
8. Lagoudakis, Michail G.: Parr, Ronald: least-squares policy iteration. J. Mach. Learn. Res. 4,
1107–1149 (2003)
9. Sutton, R.S., McAllester, D.A., Singh, S.P., Mansour, Y.: Policy gradient methods for reinforce-
ment learning with function approximation. In: Advances in Neural Information Processing
Systems 12, pp. 1057–1063. The MIT Press (1999)
10. Antos, A., Munos, R., Szepesvari, C.: Fitted Q-iteration in continuous action-space MDPs. In:
Advances in Neural Information Processing Systems, vol. 20, pp. 9–16 (2008)
11. Munos, R., Szepesvári, C.: Finite-time bounds for fitted value iteration. J. Mach. Learn. Res.
9, 815–857 (2008)
12. Antos, A., Szepesvari, C., Munos, R.: Learning near-optimal policies with bellman-residual
minimization based fitted policy iteration and a single sample path. Mach. Learn. 71(1), 89–129
(2008)
13. Dimitrakakis, C.: Monte-carlo utility estimates for bayesian reinforcement learning. In: Pro-
ceedings of the 52nd IEEE Conference on Decision and Control, CDC 2013, pp. 7303–7308.
IEEE (2013)
14. Ghavamzadeh, M., Engel, Y.: Bayesian policy gradient algorithms. In: Advances in Neural
Information Processing Systems, vol. 19, pp. 457–464. MIT Press (2006)
15. Boyan, Justin A.: Technical update: least-squares temporal difference learning. Mach. Learn.
49(2), 233–246 (2002)
16. Singh, Satinder P., Jaakkola, Tommi S., Jordan, Michael I.: Reinforcement learning with soft
state aggregation. Adv. Neural Inf. Process. Syst. 7, 361–368 (1995)
17. Bernstein, A.: Adaptive state aggregation for reinforcement learning. Master’s thesis, Technion
Israel Institute of Technology (2007)
18. Hutter, Marcus: Feature reinforcement learning: part I: unstructured MDPs. J. Artif. General
Intell. 1, 3–24 (2009)
19. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by error prop-
agation. In: Rumelhart, D.E., McClelland, J.L., et al. (eds.) Parallel Distributed Processing.
Foundations, vol. 1, pp. 318–362. MIT Press, Cambridge (1987)
20. Mnih, V., Kavukcuoglu, K., Silver, D., et al.: Human-level control through deep reinforcement
learning. Nature 518(7540), 529–533 (2015)
21. Dimitrakakis, C., Tziortziotis, N.: ABC reinforcement learning. In: Proceedings of the 30th
International Conference on Machine Learning, ICML 2013, pp. 684–692. JMLR.org (2013)
22. Dimitrakakis, C., Tziortziotis, N.: Usable ABC reinforcement learning. In: NIPS 2014 Work-
shop: ABC in Montreal (2014)
23. Williams, Ronald J.: Simple statistical gradient-following algorithms for connectionist rein-
forcement learning. Mach. Learn. 8(3–4), 229–256 (1992)
24. Kakade, Sham: A natural policy gradient. Adv. Neural Inf. Process. Syst. 14, 1531–1538 (2002)
25. Peters, J., Schaal, S.: Policy gradient methods for robotics. In: 2006 IEEE/RSJ International
Conference on Intelligent Robots and Systems, IROS 2006, pp. 2219–2225. IEEE (2006)
Chapter 9
Bayesian Reinforcement Learning
9.1 Introduction
In this chapter, we return to the setting of subjective probability and utility by for-
malizing the reinforcement learning problem as a Bayesian decision problem and
solving it directly. In the Bayesian setting, we are acting in an unknown environ-
ment, and we represent our subjective belief about the environment in the form of a
probability distribution. We shall first consider the case of acting in unknown MDPs,
which is the focus of the reinforcement learning problem. We will examine a few dif-
ferent heuristics for maximizing expected utility in the Bayesian setting and contrast
them with tractable approximations to the Bayes-optimal solution. We then present
extensions of these ideas to continuous domains. Finally, we draw connections of
this setting to partially observable MDPs.
9.1.1 Acting in Unknown MDPs
The reinforcement learning problem can be formulated as the problem of learning

to act in an unknown environment, only by interaction and reinforcement. All of
these elements of the definition are important. Firstly and foremostly it is a learning
problem: We have only partial prior knowledge about the environment we are acting
in. This knowledge is arrived at via interaction with the environment. We do not have
a fixed set of data to work with, but we must actively explore the environment to
understand how it works. Finally, there is a reinforcement signal that punishes some
behaviours and rewards others. We can formulate some of these problems as Markov
decision processes.
Let us begin with the case where the environment can be represented as an
MDP μ. That is, at each time step t, we observe the environment’s state st ∈ S, take an
action at ∈ A and receive reward rt ∈ R. Consequently, the state and our action fully
198 9 Bayesian Reinforcement Learning
Fig. 9.1 The unknown μ

Markov decision process.
ξ is our prior over the
unknown μ, which is not ξ rt
directly observed. However,
we always observe the result
of our action at in terms of st s t +1
reward rt and next state st+1
at
determine the distribution of the immediate reward, as well as that of the next state,
as described in Definition 6.3.1. For a specific MDP μ the probability of the imme-
diate reward is given by Pμ (rt | st , at ), with expectation r̄μ (s, a) Eμ (rt | st = s,
at = a), while the next state distribution is given by Pμ (st+1 | st , at ). If these quan-
tities are known, or if we can at least draw samples from these distributions, it is
possible to employ (approximate) dynamic programming to obtain the optimal pol-
icy and value function for the MDP.
More precisely, when μ is known, we wish to find a policy π : S → A maximiz-
ing the utility in expectation. This requires us to solve the maximization T problem
maxπ Eπμ U , where the utility is an additive function of rewards, U = t=1 rt . This
can be accomplished using standard algorithms, such as value or policy iteration.
However, knowing μ is contrary to the problem definition.
In Chap. 7 we have seen a number of stochastic approximation algorithms which
allow us to learn the optimal policy for a given MDP eventually. However, these
generally give few guarantees on the performance of the policy while learning.
A good way of learning the optimal policy in an MDP should trade off explor-
ing the environment to obtain further knowledge and simultaneously exploiting this
knowledge.
Within the subjective probabilistic framework, there is a natural formalization
for learning optimal behavior in am MDP. We define a prior belief ξ on the set
of MDPs M, and then find the policy that maximizes the expected utility with
respect to the prior Eπξ (U ). The structure of the unknown MDP process is shown in
Fig. 9.1. We have previously seen two simpler sequential decision problems in the
Bayesian setting. The first was the simple optimal stopping procedure in Sect. 5.2.2,
which introduced the backwards induction algorithm. The second was the optimal
experiment design problem, which resulted in the bandit Markov decision process
of Sect. 6.2. Now we want to formulate the reinforcement learning problem as a
Bayesian maximization problem.
Let ξ be a prior over M and Π be a set of policies. Then the expected utility of
the optimal policy, over some fixed starting state distribution, is

Uξ∗ max E(U | π, ξ) = max E(U | π, μ) dξ(μ). (9.1.1)
π∈Π π∈Π M
Solving this optimization problem and hence finding the optimal policy is however
not easy, as in general the optimal policy π must incorporate the information it
obtained while interacting with the MDP. Formally, this means that it must map from
histories to actions. For any such history-dependent policy, the action we take at
step t must depend on what we observed in previous steps 1, . . . , t − 1. Consequently,
an optimal policy must also specify actions to be taken in all future time steps and
accordingly take into account the learning that will take place up to each future
time step. Thus, in some sense, the value of information is automatically taken into
account in this model. This is illustrated in the following example.
Example 9.1 Consider two MDPs μ1 , μ2 with a single state (i.e., S = {1}) and
actions A = {1, 2}. In the MDP μi , whenever you take action at = i you obtain
reward rt = 1, otherwise you obtain reward 0. If we only consider policies that do
not take into account the history so far, the expected utility of such a policy π taking
action i with probability π(i) is

π
Eξ U = T ξ(μi ) π(i)
i
for horizon T . Consequently, if the prior ξ is not uniform, the optimal policy selects
the action corresponding to the MDP with the highest prior probability. Then, the
maximal expected utility is
T max ξ(μi ).
i
However, observing the reward after choosing the first action, we can determine the
true MDP. Consequently, an improved policy is the following: First select the best
action with respect to the prior, and then switch to the best action for the MDP we
have identified to be the true one. Then, our utility improves to
max ξ(μi ) + (T − 1).

i
As we have to consider quite general policies in this setting, it is useful to differ-

entiate between different policy types. We use Π to denote the set of all policies. We
use Πk to denote the set of k-order Markov policies π, that only take into account
the previous k steps, that is,
π(at | s t , a t−1 , r t−1 ) = π(at | st−k+1

t
, at−k , rt−k ),
t−1 t−1
where here and in the following we use the notation s t to abbreviate (s1 , . . . , st )
and stt+k for (st , . . . , st+k ), and accordingly at , rt , att+k , and rtt+k . Important special
cases are the set of blind policies Π0 and the set of memoryless policies Π1 . The
set Π̄k ⊂ Πk contains all stationary policies in Πk . Finally, policies may be indexed
by some parameter set Θ, in which case the set of parameterized policies is given
by ΠΘ .
Let us now turn to the problem of learning an optimal policy. Learning means that
observations we make will affect our belief, so that we will first take a closer look at
this belief update. Given that, we shall examine methods for exact and approximate
methods of policy optimization.
9.1.2 Updating the Belief
Strictly speaking, in order to update our belief, we must condition the prior dis-
tribution on all the information. This includes the sequence of observations up to
this point in time, including the states s t, actions a t−1, and rewards r t−1 , as well
the policy π that we followed. Let Dt = s t , a t−1 , r t−1 be the observed data up to
time t. Then the posterior measure for any measurable subset B of the set of all
MDPs M is

Pπμ (Dt ) dξ(μ)
ξ(B | Dt , π) = B π .
M Pμ (Dt ) dξ(μ)
However, as we shall see in the following remark, we can usually1 ignore the policy
itself when calculating the posterior.
Remark 9.1.1 The dependence on the policy can be removed, since the posterior
is the same for all policies that put non-zero mass on the observed data. Indeed, for

Dt ∼ Pπμ it is easy to see that ∀π = π such that Pπμ (Dt ) > 0, it holds that
ξ(B | Dt , π) = ξ(B | Dt , π ).
The proof is left as an exercise for the reader. In the specific case of MDPs, the poste-
rior calculation is easy to perform incrementally. This also more clearly demonstrates
why there is no dependence on the policy. Let ξt be the (random) posterior at time t.
Then the next-step belief is given by

Pπμ (Dt ) dξ(μ)
ξt+1 (B) ξ(B | Dt+1 ) = B π
M Pμ (Dt ) dξ(μ)

Pμ (st+1 , rt | st , at ) π(at | s t , a t−1 , r t−1 ) dξ(μ | Dt )
= B
Pμ (st+1 , rt | st , at ) π(at | s t , a t−1 , r t−1 ) dξ(μ | Dt )
M
Pμ (st+1 , rt | st , at ) dξt (μ)
= B .
M Pμ (st+1 , r t | st , at ) dξt (μ)
1The exception involves any type of inference where Pπμ (Dt ) is not directly available. This includes
methods of approximate Bayesian computation [1], that use trajectories from past policies for
approximation. See [2] for an example of this in reinforcement learning.
9.2 Finding Bayes-Optimal Policies 201
The above calculation is easy to perform for arbitrarily complex MDPs when the
set M is finite. The posterior calculation is also simple under certain conjugate
priors, such as the Dirichlet-multinomial prior for transition distributions.
9.2 Finding Bayes-Optimal Policies
The problem of policy optimization in the Bayesian case is much harder than when
the MDP is known. This is simply because we have to consider history dependent
policies, which makes the policy space much larger.
In this section, we first consider two simple heuristics for finding sub-optimal poli-
cies. Then we show how policy gradient methods can be extended to the Bayesian
case to obtain Bayes-optimal policies within a parametrized policy class. We pro-
ceed considering finite look-ahead backwards induction to approximate the Bayes-
optimal policy. Generally, backwards induction in this setting requires building an
exponential-size tree. However, upper and lower bounds on the value function can
be used to create a branch and bound algorithm to improve efficiency. We end this
section introducing two methods to construct such bounds, and discuss their relation
to one of the best-known Bayesian methods, posterior sampling.
9.2.1 The Expected MDP Heuristic
One simple heuristic is to simply calculate the expected MDP μ(ξ) Eξ μ for the
current belief ξ. In particular, the transition kernel of the expected MDP is simply
the expected transition kernel:

Pμ(ξ) (s |s, a) = Pμ (s |s, a) dξ(μ)
M
Then we simply calculate the optimal memoryless policy for

μ(ξ), that is,
π ∗ (
μ(ξ)) ∈ arg max Vμπ(ξ) ,
π∈Π1
where Π1 = π ∈ Π | Pπ (at | s t , a t−1 ) = Pπ (at | st ) is the set of Markov poli-

cies. The policy π ∗ ( μ(ξ)) is executed on the real MDP. Algorithm 9.1 shows the
pseudocode for this heuristic. One important detail is that the policy update schedule
only generates the k-th policy at step Tk . This is useful to ensure policies remain con-
sistent, as small changes in the mean MDP may create a large change in the resulting
policy. It is natural to have Tk − Tk−1 in the order of 1/1 − γ for discounted problems,
or simply the length of the episode for episodic problems. In the undiscounted case,
0 0 0
0 0 0 0 0 0
1 1 1
(a) μ 1 (b) μ 2 (c) μ ( ξ )
Fig. 9.2 The two MDPs and the expected MDP from Example 9.2
switching policies whenever sufficient information has been obtained to significantly

change the belief gives good performance guarantees, as we shall see in Chap. 10.
Algorithm 9.1 The expected MDP heuristic

input prior ξ0 , update schedule {Tk }
for k = 1, . . . do
πk ≈ π ∗ (
μ(ξ)) ∈ arg maxπ∈Π1 V π
μ(ξ)
for t = Tk−1 + 1, . . . , Tk do
Observe st .
Update belief ξt (·) = ξt−1 (· | st , at−1 , rt−1 , st−1 ).
Take action at ∼ πk (· | st ).
Observe reward rt .
end for
end for
Unfortunately, the policy returned by this heuristic may be far from the Bayes-
optimal policy in Π1 , as shown by the following example.
Example 9.2 (Counterexample to Algorithm 9.1 based on an example by Remi

Munos) As illustrated in Fig. 9.2, let M = {μ1 , μ2 } be the set of MDPs, and the
belief is ξ(μ1 ) = θ, ξ(μ2 ) = 1 − θ. All transitions are deterministic, and there are
two actions, the blue and the red action. We see that in the expected MDP the state
with reward 1 is reachable, and that μ(ξ) ∈ / M. One can compute that even when
T → ∞, the μ(ξ)-optimal policy is not optimal in Π1 , if

γθ(1 − θ) 1 1
< + .
1−γ 1 − γθ 1 − γ(1 − θ)
Fig. 9.3 The MDP μi from

Example 9.3
a =0
a =1 0
a=i 1
a=n 0
9.2.2 The Maximum MDP Heuristic
An alternative idea is to simply pick the maximum-probability MDP, as shown in

Algorithm 9.2. This at least guarantees that the MDP for which one chooses the
optimal policy is actually within the set of MDPs. However, it may still be the case
that the resulting policy is sub-optimal, as shown by the following example.
Algorithm 9.2 The maximum MDP heuristic

input prior ξ0 , update schedule {Tk }
for k = 1, . . . do
πk ≈ arg maxπ E π
μ∗ (ξTk ) U .
for t = 1 + Tk−1 , . . . , Tk do
Observe st .
Update belief ξt (·) = ξt−1 (· | st , at−1 , rt−1 , st−1 ).
Take action at ∼ πk (· | st ).
Observe reward rt .
end for
end for
Example 9.3 (Counterexample to Algorithm 9.2) As illustrated in Fig. 9.3 let

M = {μi | i = 1, . . . , n} be the set of MDPs where A = {0, . . . , n}. In all MDPs,
action 0 gives a reward of . In MDP each μi , action i gives reward 1, and all remain-
ing actions give a reward of 0. For any action a, the MDP terminates after an action
is chosen and the reward received. Now if ξ(μi ) < for all i, then it is optimal to
choose action 0, while Algorithm 9.2 would pick the sub-optimal maxi ξ(μi ).
9.2.3 Bayesian Policy Gradient
Policy gradient (see Sect. 8.4) can also be performed in the Bayesian setting. For
this, we must restrict our policies to a parametrized policy space so that we can
differentiate them:
ΠΘ {πθ | θ ∈ Θ}
The general idea is to find the policy parameter maximizing expected utility under
the current belief, i.e. solve the problem
max Eπξ [U ],
θ
where the utility is defined with respect to some starting state distribution (see
Eq. 8.4.2). A policy gradient algorithm would simply move in the direction of the
gradient, which can be computed as

∇θ Eπξ [U ] = ∇θ π
Eμ [U ] dξ(μ)
M

= ∇θ Eπμ [U ] dξ(μ)
M
1
n
≈ ∇θ Eπμ(i) [U ], μ(i) ∼ ξ(μ).
n i=1
Here, the integral is approximated by sampling MDPs from the belief, and ∇θ Eπμ(i) [U ]
is the standard policy gradient for a given MDP μ(i) . Approximations to the gradient
can be computed via rollouts.
An interesting question is how to define the policy parametrization. In order for a
policy to be adaptive it must take into account the complete history of observations.
This can be achieved even for simple policies, as long as we have a statistic mapping
histories to a rich enough representation. This is detailed in the following example.
Example 9.4 (Policies defined on a statistic φ : H × A → Rk×|A| ) Define history
h t = s1 , a1 , r1 , . . . , st and a history-dependent sigmoid policy:

πθ (a | h t ) ∝ exp φ(h t , a) θ
9.2.4 The Belief-Augmented MDP
The most direct way to actually solve the Bayesian reinforcement learning problem
of (9.1.1) is to cast it as a yet another MDP. We have already seen how this can
be done with bandit problems in Sect. 6.2.2, but we shall now see that the general
methodology is also applicable to MDPs.
Fig. 9.4 Belief-augmented

MDP ψt ψ t +1
st s t +1
ξt ξ t +1 ψt ψ t +1
at at
(a) The complete MDP (b) Compact form

model. of the model.
We are given an initial belief ξ0 on a set of MDPs M. Each μ ∈ M is a tuple

(S, A, Pμ , ρ), with state space S, action space A, transition kernel Pμ and reward
function ρ : S × A → R. Let st , at , rt be the state, action, and reward observed in
the original MDP and ξt be our belief over MDPs μ ∈ M at step t. Note that the
marginal next-state distribution is

P(st+1 ∈ S | ξt , st , at ) Pμ (st+1 ∈ S | st , at ) dξt (μ),
M
while the next belief deterministically depends on the next state, i.e.,
ξt+1 (·) ξt (· | st+1 , st , at ).
We now construct an augmented Markov decision process (Ψ, A, P, ρ ) with the

state space Ψ = S × Ξ being the product of the original MDP states S and possible
beliefs Ξ . The transition distribution is given by
P(ψt+1 | ψt , at ) = P(ξt+1 , st+1 | ξt , st , at )

= P(ξt+1 | ξt , st+1 , st , at )P(st+1 | ξt , st , at ),
where P(ξt+1 | ξt , st+1 , st , at ) is the singular distribution centred on the posterior

distribution ξt (· | st+1 , st , at ).
This construction is illustrated in Fig. 9.4. The optimal policy for the augmented
MDP is the ξ-optimal policy in the original MDP. The augmented MDP has a pseudo-
tree structure (since belief states might repeat), as shown in the following example.
Example 9.5 Consider aset of MDPs M with A = {1, 2}, S = {1, 2}. In general
for any hyper-state ψt = (st , ξt ) each possible action-state transition results in one
specific new hyper-state. This is illustrated for the specific example in the following
diagram.
1
ψt+1
1
2
1 ψt+1
2
ψt
1 3
2 ψt+1
2 4
ψt+1
at st+1
When the branching factor is very large, or when we need to deal with very large
tree depths, it becomes necessary to approximate the MDP structure.
9.2.5 Branch and Bound
Branch and bound is a general technique for solving large problems. It can be applied
in all cases where upper and lower bounds on the value of solution sets can be
found. For Bayesian reinforcement learning, we can consider upper and lower bounds
q + and q − on Q ∗ in the belief-augmented MDP (BAMDP). That is,
q + (ψ, a) ≥ Q ∗ (ψ, a) ≥ q − (ψ, a)

v + (ψ) = max q + (ψ, a), v − (ψ) = max q − (ψ, a).
a∈A a∈A
Let us now consider an incremental expansion of the BAMDP so that, starting

from some hyperstate ψt , we create the BAMDP tree for all subsequent states
ψt+1 , ψt+2 , . . .. For any leaf node ψt = (ξt , st ) in the tree, we can define upper
and lower value function bounds (cf. Eq. 9.2.1 below) via
π(ξt )
v − (ψt ) = Vξt (st ), v + (ψt ) = Vξ+t (st ),
where π(ξt ) can be any approximately optimal policy for ξt Using backwards induc-
tion, we can calculate tighter upper q + and lower bounds q − for all non-leaf hyper-
states by

q + (ψt , at ) = P(ψt+1 | ψt , at ) ρ(ψt , at ) + γ v + (ψt+1 ) ,
ψt+1

−
q (ψt , at ) = P(ψt+1 | ψt , at ) ρ(ψt , at ) + γ v − (ψt+1 ) .
ψt+1
We can then use the upper bounds to expand the tree (i.e., to select actions in the
tree that maximize v + ) while the lower bounds can be used to select the final policy.
Sub-optimal branches can be discarded once their upper bounds become lower than
the lower bound of some other branch.
Remark 9.2.1 If q − (ψ, a) ≥ q + (ψ, a ) then a is sub-optimal at ψ.
However, such an algorithm is only possible to implement when the number of
possible MDPs and states are finite. We can generalize this to the infinite case by
applying stochastic branch and bound methods [3, 4]. This involves estimating upper
and lower bounds on the values of leaf nodes through Monte Carlo sampling.
9.2.6 Bounds on the Expected Utility
Bounds on the Bayes-expected utility can serve as a guideline when trying to find
a good policy. Accordingly, in this and the following section we aim at obtaining
respective upper and lower bounds. First, note that given a belief ξ and a policy π,
the respective conditional expected utility is defined as follows.
Definition 9.2.1 (Bayesian value function π for a belief ξ)
Vξπ (s) Eπξ (U | s)
It is easy to see that the Bayes value function of a policy is simply the expected value
function under ξ:

π π
Vξ (s) = Eμ (U | s) dξ(μ) = Vμπ (s) dξ(μ)
M M
However, the Bayes-optimal value function is not equal to the expected value
function of the optimal policy for each MDP. In fact, the Bayes-value of any policy
is a natural lower bound on the Bayes-optimal value function, as the Bayes-optimal
policy is the maximum by definition. We can however use the expected optimal value
function as an upper bound on the Bayes-optimal value, that is,

Vξ∗ sup Eπξ (U ) = sup π
Eμ (U ) dξ(μ)
π π M

≤ π
sup Vμ dξ(μ) = Vμ∗ dξ(μ) Vξ+ .
M π M
Given the previous development, it is easy to see that the following inequalities
always hold, giving us upper and lower bounds on the value function:
Vξπ ≤ Vξ∗ ≤ Vξ+ , ∀π (9.2.1)

Fig. 9.5 A geometric view of the bounds. Here we plot the expected value of two policies, π1 , π2
and the policy π ∗ (ξ1 ) that is optimal for ξ1 , as well as the Bayes-optimal value function Vξ∗
These bounds are geometrically demonstrated in Fig. 9.5. They are entirely anal-
ogous to the Bayes bounds of Sect. 3.3.1, with the only difference being that we are
now considering complete policies rather than simple decisions.
Thompson sampling and upper bounds. The upper bound on the value func-
tion, can be easily approximated through Monte Carlo sampling, as shown in Algo-
rithm 9.3 (cf. also the respective policy evaluation Algorithm 9.4). In fact, for K = 1,
Thomp- this method is equivalent to Thompson sampling [5], which was first used in the con-
son
sam- text of Bayesian reinforcement learning by [6]. In Thompson sampling, we sample
pling a single MDP from the belief and then act optimally with respect to this, until some
exploration condition is met. This is good exploration heuristic with formal per-
formance guarantees for bandit problems [7, 8]. However, obtaining lower bounds
requires estimating good policies. An algorithm for doing this through backwards
induction will be explained in the following.
Algorithm 9.3 Bayesian Monte Carlo policy evaluation

input policy π, belief ξ
for k = 1, . . . , K do
μk ∼ ξ
v k = Vμπk
end for
K
u = K1 k=1 vk
return u.
Algorithm 9.4 Bayesian Monte Carlo upper bound

input belief ξ
for k = 1, . . . , K do
μk ∼ ξ
v k = Vμ∗k
end for
u∗ = K1 k=1 K
vk
return u∗
9.2.7 Estimating Lower Bounds on the Value Function with

Backwards Induction
A lower bound on the value function is useful to tell us how tight our upper bounds
are. It is possible to obtain one by evaluating any arbitrary policy. So, tighter lower
bounds can be obtained by finding better policies, something that was explored
by [9, 10].
In particular, we can consider the problem of finding the best memoryless policy.
This involves two approximations. Firstly, approximating our belief over MDPs with
a sample over a finite set of n MDPs. Secondly, assuming that the belief is nearly
constant over time, and performing backwards induction those n MDPs simultane-
ously. While this greedy procedure might not find the optimal memoryless policy, it
still improves the lower bounds considerably (Fig. 9.6).
The central step backwards induction over multiple MDPs is summarized by
the following equation, which simply involves calculating the expected utility of a
particular policy over all MDPs:

Q πξ,t (s, a) r̄μ (s, a) + γ π
Vμ,t+1 (s ) d Pμ (s | s, a) dξ(μ) (9.2.2)
M S
The algorithm greedily performs backwards induction as shown in Algorithm 9.5.

However, this is not an optimal procedure, since the belief at any time-step t is not
constant. Indeed, even though the policy is memoryless, ξ(μ | st , π) = ξ(μ | st , π ).
This is because the probability of being at a particular state is different under different
policies and at different time-steps (e.g. if you consider periodic MDPs). For the same
reason, this type of backwards induction may not converge as value iteration, but can
exhibit cyclic convergence similar to the cyclic equilibria in Markov games [11].
80
expected utility over all states

tighter bound
naive bound
70
upper bound
60
50
40
30
20
10
0 20 40 60 80 100
Fig. 9.6 Illustration of the improved bounds. The naive and tighter bound refers to the lower bound
obtained by calculating the value of the policy that is optimal for the expected MDP and that obtained
by calculating the value of the MMBI policy respectively. The upper bound is Vξ+ . The horizontal
axis refers to our belief: At the left edge, our belief is uniform over all MDPs, while on the right
edge, we are certain about the true MDP
Algorithm 9.5 Multi-MDP backwards induction (MMBI)

1: input M, ξ, γ, T
π
2: Set Vμ,T +1 (s) = 0 for all s ∈ S .
3: for t = T, T − 1, . . . , 0 do
4: for s ∈ S , a ∈ A do
5: Calculate Q πξ,t (s, a) from (9.2.2) using {Vμ,t+1
π }.
6: end for
7: for s ∈ S do
8: Choose πt (s) ∈ arg maxa∈A Q ξ,t (s, a).
9: for μ ∈ M do
10: π (s) = Q π (s, π (s)).
Set Vμ,t μ,t t
11: end for
12: end for
13: end for
14: return π, Q ξ
In practice, we maintain a belief over an infinite set of MDPs, such as the class of
all discrete MDPs with a certain number of state and actions. To apply this algorithm
in this case, we can sample a finite number of MDPs from the current belief and
then find the optimal policy for this sample, as shown in Algorithm 9.6. For K = 1,
this is also equivalent to Thompson sampling. However as we can see in Fig. 9.7,
Algorithm 9.6 performs better when the number of samples is increased.
9.3 Bayesian Methods in Continuous Spaces 211
Algorithm 9.6 Monte Carlo Bayesian Reinforcement Learning

At time t, sample K MDPs μ1 , . . . , μ K from ξt .
K
Calculate the best memoryless policy π ≈ arg maxπ∈Π1 k=1 Vμπk wrt the sample.
Execute π until a termination condition is met.
Fig. 9.7 Comparison of the 550

MCBRL
regret between the expected Exploit
MDP heuristic and sampling 500
with Multi-MDP backwards
450
induction for the chain
regret
environment (Example 7.2). 400
The error bars show the
standard error of the average 350
regret
300
250
2 4 6 8 10 12 14 16
n
9.2.8 Further Reading
One of the first treatments of Bayesian reinforcement learning is due to [12]. Although
the idea was well-known in the statistical community [13], the first incursion of the
idea of Bayes-adaptive policies in reinforcement learning was achieved by Duff’s
thesis [14]. Most recent advances in Bayes-adaptive policies involve the use of intel-
ligent methods for exploring the tree, such as sparse sampling [31] and Monte Carlo
tree search [15].
Instead of sampling MDPs, one could sample beliefs, which leads to a finite hyper-
state approximation of the complete belief MDP. One such approach is BEETLE [9,
16], which examines a set of possible future beliefs and approximates the value of
each belief with a lower bound. In essence, it then creates the set of policies which
are optimal with respect to these bounds.
Another idea is to take advantage of the expectation-maximization view of rein-
forcement learning [32]. This allows to apply a host of different probabilistic infer-
ence algorithms. This approach was investigated by [17].
9.3 Bayesian Methods in Continuous Spaces
Formally, Bayesian reinforcement learning in continuous state spaces is not

significantly different from the discrete case. Typically, we assume that the agent
acts within a fully observable discrete-time Markov decision process, with a metric
state space S, for example S ⊂ Rd . The action space A itself can be either discrete
or continuous. The transition kernel can be defined as a collection of probability
measures on the continuous state space, indexed by (s, a) as
Pμ (S | s, a) Pμ (st+1 ∈ S | st = s, at = a), S ⊂ S.
There are a number of transition models one can consider for the continuous case.
For the purposes of this textbook, we shall limit ourselves to the relatively simple
case of linear-Gaussian models.
9.3.1 Linear-Gaussian Transition Models
The simplest type of transition model for an MDP defined on a continuous state
space is a linear-Gaussian model, which also results in a closed form posterior calcu-
lation due to the conjugate prior. While typically the real system dynamics may not
be linear, one can often find some mapping f : S → X to a k-dimensional vector
space X such that the dynamics of the transformed state x t f (st ) at time t may
be well-approximated by a linear system. Then the next state st+1 is given by the
output of a function g : X × A → S of the transformed state, the action, and some
additive noise εt , i.e.,
st+1 = g(x t , at ) + εt .
When g is linear and εt is normally distributed, this corresponds to a multivariate

linear-Gaussian model. In particular, we can parametrize g with a set of k × k-
design design matrices { Ai | i ∈ A}, such that g(x t , at ) = Aat x t . We can also define a set
matrices
covari-
of covariance matrices {V i | i ∈ A} for the noise distribution and define the next
ance state distribution by
st+1 | x t = x, at = i ∼ N ( Ai x, V i ).
That is, the next state is drawn from a normal distribution with mean Ai x and
covariance matrix V i .
In order to model our uncertainty with a (subjective) prior distribution ξ, we have
to specify the model structure. Fortunately, in this particular case, a conjugate prior
exists in the form of the matrix-normal distribution for A and the inverse-Wishart
distribution for V . Given V i , the distribution for Ai is matrix-normal, while the
marginal distribution of V i is inverse-Wishart. More specifically, the dependencies
are as follows:
∼ φ( Ai | M, C, V
Ai | V i = V ), (9.3.1)
V i ∼ ψ(V i | W , n), (9.3.2)
where φi is the prior distribution on dynamics matrices conditional on the covariance

and two prior parameters: M, which is the prior mean and C which is the prior
output (dependent variable) covariance. Finally, ψ is the marginal prior on covariance
matrices, which has an inverse-Wishart distribution with W and n. The analytical
form of the distributions is given by:
9.3 Bayesian Methods in Continuous Spaces 213
−1

φ( Ai | M, C, V ) ∝ e− 2 trace P( Ai −M)V i ( Ai −M)C
1
1 −1
ψ(V i | W , n) ∝ | V −1 W | 2 e− 2 trace(V W )
n 1
2
Essentially, the considered setting is an extension of the univariate Bayesian linear
regression model (see for example [13]) to the multivariate case via vectorization
of the mean matrix. Since the prior is conjugate, it is relatively simple to calculate
posterior values of the parameters after each observation. While we omit the details,
a full description of inference using this model is given by [18].
Further reading. More complex transition models include the non-parametric exten-
sion of the above model, namely Gaussian processes (GP) [33]. For an n-dimensional Gaus-
sian
state space, one typically applies independent GPs for predicting each state coordi- pro-
nate, i.e., f i : Rn → R. As this completely decouples the state dimensions, it is best cesses
to consider a joint model, but this requires various approximations (cf. e.g., [19]).
A well-known method for model-based Gaussian process reinforcement learning is
GP-Rmax of [34], which has been recently shown by [20] to be KWIK-learnable.2
Another straightforward extension of linear models are piecewise linear models,
which can be described in a Bayesian non-parametric framework [21]. This avoids
the computational complexity that is introduced when using GPs.
9.3.2 Approximate Dynamic Programming
Bayesian methods are also frequently used as part of a dynamic programming

approach. Typically, this requires maintaining a distribution over value functions
in some sense. For continuous state spaces particularly, one can e.g. assume that the
value function v is drawn from a Gaussian process. However, to perform inference
we also need to specify some generative model for the observations.
Temporal differences. Engel et al. [22] consider temporal differences from a
Bayesian perspective in conjunction with a GP model, so that the rewards are dis-
tributed as
rt | v, st , st+1 ∼ N v(st+1 ) − γ v(st ), σ
for any time step t. This essentially gives a simple model for sequences of rewards
and states P(r T |v, s T ). We can now write the posterior as ξ(v | r T , s T ) ∝ P(r T |
v, s T ) ξ(v), where the dependence ξ(v|s T ) is suppressed. This model was later
updated by [23] using the reward distribution

rt | v, st , st+1 ∼ N v(st ) − γ v(st+1 ), N (st , st+1 ) ,
2Informally, a class is KWIK-learnable if the number of mistakes made by the algorithm is poly-
nomially bounded in the problem parameters. In the context of reinforcement learning this would
be the number of steps for which no guarantee of utility can be provided.
where N (s, s ) ΔU (s) − γΔU (s ) with ΔU (s) U (s) − v(s) denoting the distri-
bution of the residual, i.e., the utility when starting from s minus its expectation. The
correlation between U (s) and U (s ) is captured via N , and the residuals are modelled
as a Gaussian process. While the model is still an approximation, it is equivalent to
performing GP regression using Monte Carlo samples of the discounted return.
Bayesian finite-horizon dynamic programming for deterministic systems. Instead
of using an approximate model, [24] employ a series of GPs, each for one dynamic
programming stage, under the assumption that the dynamics are deterministic and
the rewards are Gaussian-distributed. It is possible to extend this approach to the case
of non-deterministic transitions, at the cost of requiring additional approximations.
However, since a lot of real-world problems do in fact have deterministic dynamics,
the approach is consistent.
Bayesian least-squares temporal differences. Tziortziotis and Dimitrakakis [25]
instead consider a model for the value function itself, where the random quantity is
the empirical transition matrix P̂ rather than the reward (which can be assumed to
be known):
P̂v | v, P ∼ N (Pv, β I )
This model makes a different trade-off in its distributional assumptions. It allows

us to model the uncertainty about P in a Bayesian manner, but instead of explicitly
modelling this as a distribution on P itself, we are modelling a distribution on the
resulting Bellman operator.
Gradient methods. As we saw in Sect. 9.2.3, if we are able to sample from the
posterior distribution, we can leverage stochastic gradient descent methods to extend
any gradient algorithm for reinforcement learning with a given model to the Bayesian
setting. In the continuous MDP setting [26] used Gaussian Process models to sample
from the posterior. However, other methods could be used, including neural networks
or bootstrapping.
9.4 Partially Observable Markov Decision Processes
In most real world applications the state st of the system at time t cannot be observed
directly. Instead, we obtain some observation xt , which depends on the state of the
system. While this does give us some information about the system state, it is in
general not sufficient to pinpoint it exactly. This idea can be formalized as a partially
observable Markov decision process (POMDP).
Definition 9.4.1 (POMDP) A partially observable Markov decision process
(POMDP) μ is a tuple (X , S, A, P, y) where X is an observation space, S is a
state space, A is an action space, P is a conditional distribution on observations,
states and rewards and y is a starting state distribution. The reward, observation and
next state are Markov with respect to the current state and action:
9.4 Partially Observable Markov Decision Processes 215
Pμ (st+1 , rt , xt | st , at , . . .) = P(st+1 | st , at )P(xt | st )P(rt | st )
Here P(st+1 | st , at ) is the transition distribution, giving the probabilities of next transi-
tion
states given the current state and action. P(xt | st ) is the observation distribution, distribu-
giving the probabilities of different observations given the current state. Finally, tion
P(rt | st ) is the reward distribution, which we make dependent only on the current observa-
tion
state for simplicity. distribu-
tion
reward
distribu-
Partially observable Markov decision process tion
The following graphical model illustrates the dependencies in a POMDP.
9.4.1 Solving Known POMDPs
When we know a POMDP’s parameters, that is to say, when we know the transition,
observation and reward distributions, the problem is formally the same as solving
an unknown MDP. In particular, we can similarly define a belief state summarizing
our knowledge. This takes the form of a probability distribution on the hidden state
variable st rather than on the model μ. If μ defines starting state probabilities, then
the belief is not subjective, as it only relies on the actual POMDP parameters. The
transition distribution on states given our belief is as follows.
Belief ξ
For any distribution ξ on S, we define

ξ(st+1 | at , μ) Pμ (st+1 | st at ) dξ(st ).
S
When there is no ambiguity, we shall use ξ to denote arbitrary marginal distri-

butions on states and state sequences given the belief ξ.
When the model μ is given, calculating a belief update is not particularly difficult,
but we must take care to properly use the time index t. Starting from Bayes’ theorem,
it is easy to derive the belief update from ξt to ξt+1 as follows.
Belief update
ξt+1 (st+1 | μ) ξt (st+1 | xt+1 , rt+1 , at , μ)

Pμ (xt+1 , rt+1 | st+1 )ξt (st+1 | at , μ)
=
ξt (xt+1 | at , μ)

ξt (st+1 | at , μ) = Pμ (st+1 | st , at , μ) dξt (st )
S

ξt (xt+1 | at , μ) = Pμ (xt+1 | st+1 ) dξt (st+1 | at , μ)
S
A particularly attractive setting is when the model is finite. Then the sufficient
statistic also has finite dimension and all updates are in closed form.
Remark 9.4.1 If S, A, X are finite, then we can define a sequence of vectors pt ∈
´|S| and matrices At as
pt ( j) = P(xt | st = j),
At (i, j) = P(st+1 = j | st = i, at ).
Then writing bt (i) for ξt (st = i), we can then use Bayes theorem to obtain
diag( pt+1 ) At bt
bt+1 = .
p
t+1 At bt
9.5 Relations Between Different Settings 217
9.4.2 Solving Unknown POMDPs
Solving a POMDP that is unknown is a much harder problem. The basic update
equation for a joint belief on both possible state and possible model is given by
ξ(μ, s t | x t , a t ) ∝ Pμ (x t | s t )Pμ (s t | a t ) ξ(μ).
Unfortunately, even for the simplest possible case of two possible models μ1 , μ2
and binary observations, there is no finite-dimensional representation of the belief at
time t.
Strategies for solving unknown POMDPs include solving the full Bayesian deci-
sion problem, but this requires exponential inference and planning for exact solu-
tions [27]. For this reason, one usually uses approximations.
One very simple approximation involves replacing a POMDP with a variable
order Markov decision process, for which inference has only logarithmic compu- variable
order
tational complexity [28]. The variable order model assumes that the observation Markov
probabilities can be decomposed in terms of finite-length contexts. Of course, the decision
process
memory complexity is still linear. This approach has been used by [15] in combina-
tion with a Monte Carlo planner for online decision making with promising results.
In general, finding optimal policies in POMDPs is hard even for restricted classes
of policies [35]. However, approximations [29] and stochastic methods as well as
policy search methods [30, 32] work quite well in practice.
9.5 Relations Between Different Settings
Markov decision processes can be used to model a wide range of different prob-
lems. This section informally (and perhaps sometimes inaccurately) describes the
relationship between different MDP settings we have dealt with so far.
When the state and action spaces are finite, the optimal policy for both finite
horizon and infinite horizon discounted MDPs can be computed in polynomial time
using algorithms like backwards induction or policy iteration (see Chap. 6). However,
in other cases obtaining the optimal policy is far from trivial. In the reinforcement
learning setting, the MDP is not known and must be estimated while acting in it. If the
goal is to maximize expected utility under a prior, the problem becomes a BAMDP.
The BAMDP can be seen as a special case of a POMDP, where the underlying latent
variable is the MDP parameter, which has a fixed value, rather than the state. For that
reason it is generally assumed that BAMDPs have the same complexity as POMDPs.
Note however, that POMDPs with linear-Gaussian dynamics can be solved with the
exact same controller as linear-Gaussian MDPs, by replacing the actual state with
its expected value.
The POMDP problem with discrete state space is similar to a continuous MDP.
This is because we can construct an MDP that uses the POMDP belief state as its
state. This belief state is finite-dimensional and continuous. If the MDP state space
is continuous, then it is in general not possible to decide whether a given policy
is optimal in finite time. However, it is possible to check whether a policy is -
optimal under certain regularity conditions on the state space structure. However,
if the POMDP state space is discrete, there is only a finite number of possible next
belief states for any belief state, and the number of policies is also finite for a finite
horizon. Thus, the relationship between these classes is summarized below:
d-MDP ⊆ d-BAMDP ⊆ d-POMDP ⊆ c-MDP
While it is known that d-MDP is P-complete and d-POMDP is PSPACE-complete,

it is unclear if d-BAMDP is in a simpler class, such as NP.3
Finally, the above settings can be generalized to multi-player games. In particular,
an MDP with many players and a different reward function for each player is a
stochastic game (SG). When the game is zero sum, planning is conjectured to remain
in P (although we have not seen a formal proof of this at the time of writing). For
non-zero-sum games, computation of a Nash equilibrium has PPAD complexity.
Partially observable stochastic games (POSG) are in general in an exponential (or
higher) complexity class, even though inference may be simple depending on the
type of game.
9.6 Exercises
Exercise 9.1 Consider the algorithms we have seen in Chap. 8. Are any of those
applicable to belief-augmented MDPs? Outline a strategy for applying one of those
algorithms to the problem. What would be the biggest obstacle we would have to
overcome in your specific example?
Exercise 9.2 Prove Remark 9.1.1.
Exercise 9.3 A practical case of Bayesian reinforcement learning in discrete space
is when we have an independent belief over the transition probabilities of each state-
action pair. Consider the case where we have n states and k actions. Similar to the
product-prior in the bandit case in Sect. 6.2, we assign a probability (density) ξs,a to
the probability vector θ (s,a) ∈ ń . We can then define our joint belief on the (nk) × n
matrix Θ to be
ξ(Θ) = ξs,a (θ (s,a) ).
s∈S,a∈A
(i) Derive the updates for a product-Dirichlet prior on transitions.

(ii) Derive the updates for a product-Normal-Gamma prior on rewards.
(iii) What would be the meaning of using a Normal-Wishart prior on rewards?
3 The complexity hierarchy satisfies P ⊆ NP ⊆ PSPACE ⊆ EXP, with P ⊂ EXP.

9.6 Exercises 219
Exercise 9.4 Consider the Gaussian process model of Eq. (9.3.2). What is the
implicit assumption made about the transition model? If this assumption is satis-
fied, what does the corresponding posterior distribution represent?
References
1. Csilléry, K., Blum, M.G.B., Gaggiotti, O.E., François, O.: Approximate Bayesian computation
(ABC) in practice. Trends Ecol. Evol. 25(7), 410–418 (2010)
2. Dimitrakakis, C., Tziortziotis, N.: ABC reinforcement learning. In Proceedings of the 30th
International Conference on Machine Learning, ICML 2013, pp. 684–692 (2013). (JMLR.org)
3. Dimitrakakis, C.: Complexity of stochastic branch and bound methods for belief tree search
in Bayesian reinforcement learning. In: 2nd International Conference on Agents and Artificial
Intelligence (ICAART 2010), pp. 259–264. Springer, Valencia, Spain (2010)
4. Dimitrakakis, C.: Tree exploration for Bayesian RL exploration. In: 2008 International Confer-
ences on Computational Intelligence for Modelling, Control and Automation (CIMCA 2008),
Intelligent Agents, Web Technologies and Internet Commerce (IAWTIC 2008), Innovation in
Software Engineering (ISE 2008), pp. 1029–1034. IEEE Computer Society (2008)
5. Thompson, W.R.: On the likelihood that one unknown probability exceeds another in view of
the evidence of two samples. Biometrika 25(3–4), 285–294 (1933)
6. Strens, M.J.A.: A Bayesian framework for reinforcement learning. In: Proceedings of the Sev-
enteenth International Conference on Machine Learning (ICML 2000), pp. 943–950. Morgan
Kaufmann (2000)
7. Kaufmann, E., Korda, N., Munos, R.: Thompson sampling: an asymptotically optimal finite-
time analysis. In: Algorithmic Learning Theory-23rd International Conference, ALT 2012.
Proceedings. Lecture Notes in Computer Science, vol. 7568, pp. 199–213. Springer (2012)
8. Osband, I., Russo, D., Van Roy, B.: (More) efficient reinforcement learning via posterior sam-
pling. In: Advances in Neural Information Processing Systems, vol. 26, pp. 3003–3011 (2013)
9. Poupart, P., Vlassis, N.A., Hoey, J., Regan, K.: An analytic solution to discrete Bayesian rein-
forcement learning. In: Machine Learning, Proceedings of the 23rd International Conference
(ICML 2006), pp. 697–704. ACM (2006)
10. Dimitrakakis, C.: Robust Bayesian reinforcement learning through tight lower bounds. In: San-
ner, S., Hutter, M. (eds.) Recent Advances in Reinforcement Learning–9th European Workshop,
EWRL 2011. Lecture Notes in Computer Science, vol. 7188, pp. 177–188. Springer (2011)
11. Zinkevich, M., Greenwald, A., Littman, M.L.: Cyclic equilibria in Markov games. In: Advances
in Neural Information Processing Systems, vol. 18, pp. 1641–1648 (2006)
12. Richard Ernest Bellman: A problem in the sequential design of experiments. Sankhya 16,
221–229 (1957)
14. Duff, M.O.: Optimal learning computational procedures for Bayes-adaptive Markov decision
processes. Ph.D. thesis, University of Massachusetts at Amherst (2002)
15. Veness, J., Ng, K.S., Hutter, M., Silver, D.: A Monte Carlo AIXI approximation. Technical
Report 0909.0801 (2009). (arXiv)
16. Poupart, P., Vlassis, N.: Model-based Bayesian reinforcement learning in partially observable
domains. In: International Symposium on Artificial Intelligence and Mathematics, ISAIM 2008
(2008)
17. Furmston, T, Barber, D.: Variational methods for reinforcement learning. In: Proceedings of
the 13th International Conference on Artificial Intelligence and Statistics (AISTATS 2010), pp.
241–248 (2010)
18. Minka, T.P.: Bayesian linear regression. Technical Report, Microsoft research (2000)
19. Álvarez, M., Luengo, D., Titsias, M., Lawrence, N.: Efficient multioutput Gaussian processes
through variational inducing kernels. In: Proceedings of the 13th International Conference on
Artificial Intelligence and Statistics (AISTATS 2010), pp. 25–32 (2010)
20. Grande, R.C., Walsh, T.J., How, J.P.: Sample efficient reinforcement learning with gaussian
processes. In: Proceedings of the 31th International Conference on Machine Learning, ICML
2014, pp. 1332–1340 (2014). (JMLR.org)
21. Tziortziotis, N., Dimitrakakis, C., Blekas, K.: Cover tree Bayesian reinforcement learning. J.
Mach. Learn. Res. 15(1), 2313–2335 (2014)
22. Engel, Y., Mannor, S., Meir, R.: Bayes meets Bellman: the Gaussian process approach to
temporal difference learning. In: Machine Learning, Proceedings of the 20th International
Conference (ICML 2003), pp. 154–161. AAAI Press (2003)
23. Engel, Y., Mannor, S., Meir, R.: Reinforcement learning with gaussian processes. In: Machine
Learning, Proceedings of the 22nd International Conference (ICML 2005), pp. 201–208. ACM
(2005)
24. Deisenroth, M.P., Rasmussen, C.E., Peters, J.: Gaussian process dynamic programming. Neu-
rocomputing 72(7–9), 508–1524 (2009)
25. Tziortziotis, N., Dimitrakakis, C.: Bayesian inference for least squares temporal difference
regularization. In: Machine Learning and Knowledge Discovery in Databases-European Con-
ference, ECML PKDD 2017, Proceedings Part II. Lecture Notes in Computer Science, vol.
10535, pp. 126–141. Springer (2017)
26. Ghavamzadeh, M., Engel, Y.: Bayesian policy gradient algorithms. In: Advances in Neural
Information Processing Systems, vol. 19, pp. 457–464. MIT Press (2006)
27. Ross, S., Chaib-draa, B., Pineau, J.: Bayes-adaptive POMDPs. In: Advances in Neural Infor-
mation Processing Systems, vol. 20, pp. 1225–1232 (2008)
28. Dimitrakakis, C.: Bayesian variable order Markov models. In: Proceedings of the 13th Inter-
national Conference on Artificial Intelligence and Statistics (AISTATS 2010), pp. 161–168
(2010)
29. Spaan, M.T.J., Vlassis, N.: Perseus: randomized point-based value iteration for POMDPs. J.
Artif. Intell. Res. 24(1), 195–220 (2005)
30. Baxter, J., Bartlett, P.L.: Reinforcement learning in POMDP’s via direct gradient ascent. In:
Proceedings of the 17th International Conference on Machine Learning, ICML 2000, pp. 41–
48. Morgan Kaufmann, San Francisco, CA (2000)
31. Wang, T., Lizotte, D., Bowling, M., Schuurmans, D.: Bayesian sparse sampling for on-line
reward optimization. In: Machine Learning, Proceedings of the 22nd International Conference
(ICML 2005). ACM (2005)
32. Toussaint, M., Harmelign, S., Storkey, A.: Probabilistic inference for solving (PO)MDPs. Tech-
nical Report EDI-INF-RR-0934, University of Endinburgh, School of Informatics (2006)
33. Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning. MIT Press
(2006)
34. Jung, T., Stone, P.: Gaussian processes for sample-efficient reinforcement learning with RMAX-
like exploration. In: Machine Learning and Knowledge Discovery in Databases, European
Conference, ECML PKDD 2010. Lecture Notes in Computer Science, vol. 6321, pp. 601–616.
Springer (2010)
35. Vlassis, N., Littman, M.L., Barber, D.: On the computational complexity of stochastic controller
optimization in POMDPs. ACM Trans. Comput. Theory 4(4), 12:1–12:8 (2012)
Chapter 10
Distribution-Free Reinforcement
Learning
10.1 Introduction
The Bayesian framework requires specifying a prior distribution. For many reasons,
we may frequently be unable to do that. In addition, as we have seen, the Bayes-
optimal solution is often intractable. In this chapter we shall take a look at algorithms
that do not require specifying a prior distribution. Instead, they employ the heuristic of
“optimism under uncertainty” to select policies. This idea is very similar to heuristic
search algorithms, such as A∗ [1]. All these algorithms assume the best possible
model that is consistent with the observations so far and choose the optimal policy
in this “optimistic” model. Intuitively, this means that for each possible policy we
maintain an upper bound on the value/utility we can reasonably expect from it. In
general we want this upper bound to
1. be as tight as possible (i.e., to be close to the true value),
2. still hold with high probability.
We begin with an introduction to these ideas in bandit problems, when the objective
is to maximize total reward. We then expand this discussion to structured bandit
problems, which have many applications in optimization. Finally, we look at the
case of maximizing total reward in unknown MDPs.
10.2 Finite Stochastic Bandit Problems
First of all, let us briefly recall the stochastic bandit setting, which we already have
considered in Sect. 6.2. The learner in discrete time steps t = 1, 2, . . . chooses an
arm at from a given set A = {1, . . . , K } of K arms. The rewards rt the learner
obtains in return are random and assumed to be independent as well as bounded,
e.g., rt ∈ [0, 1]. The expected reward r (i) = E(rt |at = i) for choosing any arm i
222 10 Distribution-Free Reinforcement Learning
T
is unknown to the learner, who aims to maximize the total reward t=1 rt after a
certain number of T time steps.
Let r ∗ maxi r (i) be the highest expected reward that can be achieved. Obvi-
ously, the optimal policy π ∗ in each time step chooses the arm giving the highest
expected reward r ∗ . The learner who does not know which arm is optimal will choose
at each time step t an arm at from A, or more generally, a probability distribution
over the arms from which at then is drawn. It is important to notice that maximizing
the total reward is equivalent to minimizing total regret with respect to that policy.
Definition 10.2.1 (Total regret) The (total) regret of a policy π relative to the opti-
mal fixed policy π ∗ after T steps is

T
∗
L T (π ) rt − rtπ ,
t=1
∗
where rtπ is the reward obtained by the policy π at step t and rt∗ rtπ . Accordingly,
the expected (total) regret is

T
∗
E L T (π ) T r − Eπ rt .
t=1
The regret compares the collected rewards to those of the best fixed policy. Comparing
instead to the best rewards obtained by the arms at each time would be too hard due
to their randomness.
We note that the notion of regret we consider here is usually called pseudo-regret,
while the term expected regret often refers to the just mentioned comparison to the
actual best rewards, cf. [8] for a more detailed discussion.
10.2.1 The UCB1 Algorithm
It makes sense for a learning algorithm to use the empirical average rewards obtained
for each arm so far.
Empirical average
1
t t
r̂t,i rk,i I {ak = i} , where Nt,i I {ak = i}
Nt,i k=1 k=1
and rk,i denotes the (random) reward the learner receives upon choosing arm i
at step k.
10.2 Finite Stochastic Bandit Problems 223
Simply always choosing the arm with best the empirical average reward so far is
not the best idea, because you might get stuck with a sub-optimal arm: If the optimal
arm underperforms at the beginning, so that its empirical average is far below the
true mean of a suboptimal arm, it will never be chosen again. A better strategy is to
choose arms optimistically. Intuitively, as long as an arm has a significant chance of
being the best, you play it every now and then. One simple way to implement this is
shown in the following UCB1 algorithm Auer et al. [2].
Algorithm 10.1 UCB1

input A
Choose each arm once to obtain an initial estimate.
for t = 1, . . . do
Choose arm at = arg maxi∈A r̂t−1,i + N2t−1,i ln t
.
end for

Thus, the algorithm adds a bonus value of order O( ln t/Nt,i ) to the empirical
value of each arm thus forming an upper confidence bound. This upper confidence upper
confi-
bound value is such that the true mean reward of each arm will lie below it with high dence
probability by the Hoeffding bound (4.5.5). bound
Theorem 10.2.1 (Auer et al. [2]) The expected regret of UCB1 after T time steps is
at most
8lnT
E L T (UCB1) ≤ +5 (r ∗ − r (i)).
i:r (i)<r ∗
r∗− r (i) i
Proof By Wald’s identity (5.10) the expected regret can be written as

T
E LT = E (r ∗ − rt ) = (r ∗ − r (i)) E N T,i , (10.2.1)
t=1 i
so that we focus on bounding E Nt,i . Thus, let i be an arbitrary suboptimal arm,

for which
√ we shall consider when it will be chosen by the algorithm. Write
Bt,s = (2 ln t)/s for the bonus value at step t after s observations. Note that for
fixed values of t, s, si ∈ N under the assumption that Nt,i = si and (the count of the
optimal action) Nt,∗ = s, we have by the Hoeffding bound (4.5.5) that
−4 ln t
P(r̂t,i ≥ r (i) + Bt,si ) ≤ e = t −4 ,
∗ −4 ln t
P(r̂t,∗ ≤ r − Bt,s ) ≤ e = t −4 .
Accordingly we may assume that (taking care of the contribution of the error prob-
abilities to E Nt,i below)
r̂t,i < r (i) + Bt,Nt,i , (10.2.2)

∗
r < r̂t,∗ + Bt,Nt,∗ . (10.2.3)

Now note that for s ≥ (8 ln T )/(r ∗ − r (i))2 it holds that
2Bt,s ≤ (r ∗ − r (i)), (10.2.4)

so that after arm i has been chosen (8 ln T )/(r ∗ − r (i))2 times we get from
(10.2.2), (10.2.4), and (10.2.3) that
r̂t,i + Bt,Nt,i < r (i) + 2Bt,Nt,i ≤ r ∗

< r̂t,∗ + Bt,Nt,∗ .

This shows that after (8 ln T )/(r ∗ − r (i))2 samples from arm i, the algorithm
won’t choose it again. Taking into account the error probabilities for (10.2.2) and
(10.2.3), arm i may be played once at each step t whenever either equation does not
hold. Summing over all possible values for t, Nt,i and Nt,∗ this shows that

8 ln T
E Nt,i ≤ + 2τ −4 .
(r ∗ − r (i))2 τ ≥1 s≤τ si ≤τ
Combining this with (10.2.1) and noting that the sum converges to a value < 4,
proves the regret bound.
The UCB1 algorithm is actually not the first algorithm employing optimism in the
face of uncertainty to deal with the exploration-exploitation dilemma, nor the first that
uses confidence intervals for that purpose. This idea goes back to the seminal work
of Lai and Robbins [3] that used the same approach, however in a more complicated
form. In particular, the whole history is used for computing the arm to choose. The
derived bounds ofLai and Robbins
[3] show that after T steps each suboptimal arm
is played at most D1KL + o(1) log T times in expectation, where DKL measures the
distance between the reward distributions of the optimal and the suboptimal arm by
the Kullback-Leibler divergence, and o(1) → 0 as T → ∞. This bound was also
shown to be asymptotically optimal Lai and Robbins [3]. A lower bound logarithmic
in T for any finite T that is close to matching the bound of Theorem 10.2.1 can be
found in Mannor and Tsitsiklis [4]. Improvements that get closer to the lower bound
(and are still based on the UCB1 idea) can be found in Auer and Ortner [5], while
the gap has been finally closed by Lattimore [6].
For so-called distribution-independent bounds that do not depend on problem
parameters like the “gaps” r ∗ − r (i), see e.g. Audibert and Bubeck [7]. In general,√
these bounds cannot be logarithmic√ in T anymore, as the gaps may be of order 1/ T
resulting in bounds that are O( K T ), just like in the nonstochastic setting that we
will take a look at next.
10.2 Finite Stochastic Bandit Problems 225
10.2.2 Non i.i.d. Rewards
The stochastic setting just considered is only one among several variants of the multi-
armed bandit problem. While it is impossible to cover them all, we give a brief of
the most common scenarios and refer to Bubeck and Cesa-Bianchi [8] for a more
complete overview.
What is common to most variants of the classic stochastic setting is that the
assumption of receiving i.i.d. rewards when sampling a fixed arm is loosened. The
most extreme case is the so-called nonstochastic, sometimes also termed adversarial
bandit setting, where the reward sequence for each arm is assumed to be fixed in adver-
sarial
advance (and thus not random at all). In this case, the reward is maximized when bandit
choosing in each time step the arm that maximizes the reward at this step. Obvi-
ously, since the reward sequences can be completely arbitrary, no learner can stand
a chance to perform well with respect to this optimal policy. Thus, one confines
oneself to consider the regret with respect to the best fixed arm in hindsight, that is,
T
arg maxi t=1 rt,i where rt,i is the reward of arm i at step t. It is still not clear that
this is not too
√ much to ask for, but it turns out that one can achieve regret bounds
of order O( K T ) in this setting. Clearly, algorithms that choose arms deterministi-
cally can always be tricked by an adversarial reward sequence. However, algorithms
that at each time step choose an arm from a suitable distribution over the arms (that is
updated according to the collected rewards), can be shown to give the mentioned opti-
mal regret bound. A prominent exponent of these algorithms is the Exp3 algorithm
of Auer et al. [9], that uses an exponential weighting scheme.
In the contextual bandit setting the learner receives some additional side infor- contex-
tual
mation called the context. The reward for choosing an arm is assumed to depend on bandit
the context as well as on the chosen arm and can be either stochastic or adversarial.
The learner usually competes against the best policy that maps contexts to arms.
There is a notable amount of literature dealing with various settings that are usually
also interesting for applications like web advertisement where user data takes the
role of provided side information. For an overview see e.g. Chap. 4 of Bubeck and
Cesa-Bianchi [8] or Part V of Lattimore and Szepesvári [10].
In other settings the i.i.d. assumption about the rewards of a fixed arm is replaced
by more general assumptions, such as that underlying each arm there is a Markov
chain and rewards depend on the state of the Markov chain when sampling the
arm. This is called the restless bandits problem, that is already quite close to the
general reinforcement learning setting with an underlying Markov decision √ process
(see Sect. 10.3.1). Regret bounds in this setting can be shown to be Õ( T ) even if
at each time step the learner can observe only the state of the arm he chooses, see
Ortner et al. [11].
10.3 Reinforcement Learning in MDPs
Taking a step further from the bandit problems of the previous sections we now
want to consider a more general reinforcement learning setting where the learner
operates on an unknown underlying MDP. Note that the stochastic bandit problem
corresponds to a single state MDP.
Thus, consider an MDP μ with state space S, action space A, and let r (s, a) ∈
[0, 1] and P(·|s, a) be the mean reward and the transition probability distribution
on S for each state s ∈ S and each action a ∈ A, respectively. For the moment
we assume that S and A are finite. As we have seen in Sect. 6.6 there are various
optimality criteria for MDPs. In the spirit of the bandit problems considered so far
we consider undiscounted rewards and examine the regret after any T steps with
respect to an optimal policy.
Since the optimal T -step policy in general will be non-stationary and different for
different horizons T and different initial states, we will compare to a gain optimal
policy π ∗ as introduced in Definition 6.6.3. Further, we assume that the MDP is
communicating. That is, for any two states s, s there is a policy πs,s that with positive
probability reaches s when starting in s. This assumption allows the learner to recover
when making a mistake. Note that in MDPs that are not communicating one wrong
step may lead to a suboptimal region of the state space that cannot be left anymore,
which makes competing to an optimal policy in a learning setting impossible. For
communicating MDPs we can define the diameter to be the maximal expected time
it takes to connect any two states.
Definition 10.3.1 Let T (π, s, s ) the expected number of steps it takes to reach
state s when starting in s and playing policy π . Then the diameter is defined as
D max min T (π, s, s ).

s,s π
Given that our rewards are assumed to be bounded in [0, 1], intuitively, when we
make one wrong step in some state s, in the long run we won’t lose more than D.
After all, in D steps we can go back to s and continue optimally.
Under the assumption that the MDP is communicating, the gain g ∗ can be shown
to be independent of the initial state, that is, g ∗ (s) = g ∗ for all states s. Accordingly,
we define the T -step regret of a learning algorithm as

T
∗
LT g − rt ,
t=1
where rt is the reward collected by the algorithm at step t. Note that in general (and
depending on the initial state) the value T g ∗ we compare to will differ from the
optimal T -step reward. However, this difference can be shown to be upper bounded
by the diameter and is therefore negligible when considering the regret.
10.3 Reinforcement Learning in MDPs 227
10.3.1 An Upper-Confidence Bound Algorithm
Now we aim at extending the idea underlying the UCB1 algorithm to the general
reinforcement learning setting. Again, we would like to have for each (stationary)
policy π an upper bound on the gain that is reasonable to expect. Note that simply
taking each policy to be the arm of a bandit problem does not work well. First, to
approach the true gain of a chosen policy, it will not be sufficient to choose it just
once. It would be necessary to follow each policy for a sufficiently high number of
consecutive steps. Without knowledge of some characteristics of the underlying MDP
like mixing times, it might be however difficult to determine how long a policy shall
be played. Further, due to the large number of stationary policies, which is |A||S| ,
the regret bounds that would result from such an approach would be exponential in
the number of states.
Thus, we rather maintain confidence regions for the rewards and transition prob-
abilities of each state-action pair s, a. Then, at each step t, these confidence regions
implicitly also define a confidence region for the true underlying MDP μ∗ , that is, a
set Mt of plausible MDPs. For suitably chosen confidence intervals for the rewards
and transition probabilities one can obtain that
∗
P(μ ∈/ Mt ) < δ. (10.3.1)
Given this confidence region Mt , one can define the optimistic value for any
policy π to be
π
g+ (Mt ) max gμπ μ ∈ Mt . (10.3.2)
Note that similar to the bandit setting this estimate is optimistic for each policy,
π
as due to (10.3.1) it holds that g+ (Mt ) ≥ gμπ with high probability. Analogously to
UCB1 we would like to make an optimistic choice among the possible policies, that
π
is, we choose a policy π that maximizes g+ (Mt ).
However, unlike in the bandit setting where we immediately receive a sample
from the reward of the chosen arm, in the MDP setting we only obtain information
about the reward in the current state. Thus, we should not play the chosen optimistic
policy just for one but a sufficiently large number of steps. An easy way is to play
policies in episodes of increasing length, such that sooner or later each action is
played for a sufficient number of steps in each state. Summarized, we obtain (the
outline of) an algorithm as shown below.
UCRL2 [12] outline

In episodes k = 1, 2, . . .
• At the first step tk of episode k, update the confidence region Mtk .
π
• Compute an optimistic policy π̃k ∈ arg maxπ g+ (Mtk ).
• Execute π̃k , observe rewards and transitions until tk+1 .
10.3.1.1 Technical Details for UCRL2
To make the algorithm complete, we have to fill in some technical details. In the
following, let S be the number of states and A the number of actions of the underlying
MDP μ. Further, the algorithm takes a confidence parameter δ > 0.
The confidence region. Concerning the confidence regions, for the rewards it is
sufficient to use confidence intervals similar to those for UCB1. For the transition
probabilities we consider all those transition probability distributions to be plau-
sible whose
·
1 -norm is close to the empirical distribution P̂ t (· | s, a). That is,
the confidence region Mt at step t used to compute the optimistic policy can be
defined as the set of MDPs with mean rewards r (s, a) and transition probabilities
P(· | s, a) such that

r (s, a) − r̂ (s, a) ≤ 7 log(2S At/δ) ,
2Nt (s,a)

14S log(2 At/δ)
P(· | s, a) − P̂ t (· | s, a) ≤ Nt (s,a)
,
1
where r̂ (s, a) and P̂ t (· | s, a) are the estimates for the rewards and the transition
probabilities, and Nt (s, a) denotes the number of samples of action a in state s at
time step t.
One can show via a bound due to Weissman et al. [13] that given n samples of
the transition probability distribution P(· | s, a), one has
n

P P(· | s, a) − P̂ t (· | s, a) ≥ ≤ 2 exp − .
S
1 2
Using this together with standard Hoeffding bounds for the reward estimates, it
can be shown that the confidence region contains the true underlying MDP with high
probability.
Lemma 10.3.1
∗ δ
P(μ ∈ Mt ) > 1 − .
15t 6
Episode lengths. Concerning the termination of episodes, as already mentioned, we
would like to have episodes that are long enough so that we do not suffer large regret
when playing a suboptimal policy. Intuitively, it only pays off to recompute the opti-
mistic policy when the estimates or confidence intervals have changed sufficiently.
One option is e.g. to terminate an episode when the confidence interval for one state-
action pair has shrinked by some factor. Even simpler, one can terminate an episode
when a state-action pair has been sampled often (compared to the samples one had
before the episode has started), e.g. when one has doubled the number of visits in
some state-action pair. This also allows to bound the total number of episodes up to
step T .
Lemma 10.3.2 If an episode of UCRL2 is terminated when the number of visits in

some state-action pair has been doubled, the total number of episodes up to step T
is upper bounded by S A log2 8T
SA
.
This episode termination criterion also allows to bound the sum over all fractions
of the form √vNk (s,a) , where vk (s, a) is the number of times action a has been chosen
k (s,a)
in state s during episode k, while Nk (s, a) is the respective count of visits before
episode k. The evaluation of this sum will turn out to be important in the regret
analysis below to bound the sum over all confidence intervals over the visited state-
action pairs.
Lemma 10.3.3
vk (s, a) √ √
√ ≤ ( 2 + 1) S AT .
k s,a
Nk (s, a)
Calculating the optimistic policy. It is important to note that the computation of

the optimistic policy can be performed efficiently by using a modification of value
π
iteration. Intuitively, for each policy π the optimistic value g+ (Mt ) maximizes the
gain over all possible values in the confidence intervals for the rewards and the
transition probabilities for π . This is an optimization problem over a compact space
π
that can be easily solved. More precisely, in order to find arg maxπ g+ (Mt ), for each
considered policy one additionally has to determine the precise values for rewards
and transition probabilities within the confidence region. This corresponds to finding
the optimal policy in an MDP with compact action space, which can be solved by an
extension of value iteration that in each iteration now not only maximizes over the
original action space but also within the confidence region of the respective action.
π
Noting that g+ (Mt ) is maximized when the rewards are set to their upper confidence
values, this results in the following value iteration scheme:
1. Set the optimistic rewards r̃ (s, a) to the upper confidence values for all states s
and all actions a.
2. Set u 0 (s) := 0 for all s.
3. For i = 0, 1, 2, . . . set

u i+1 (s) := max r̃ (s, a) + max P(s ) u i (s ) , (10.3.3)
a P∈P(s,a)
s
where P(s, a) is the set of all plausible transition probabilities for choosing
action a in state s.
Similarly to the value iteration algorithm in Sect. 6.5.4.1, this scheme can be
shown to converge. More precisely, one can show that maxs {u i+1 (s) − u i (s)} −
mins {u i+1 (s) − u i (s)} → 0 and also
π̃
u i+1 (s) → u i (s) + g+ for all s. (10.3.4)
After convergence the maximizing actions constitute the optimistic policy π̃ , and the
maximizing transition probabilities are the respective optimistic transition values P̃.
One can also show that the so-called span maxs u i (s) − mins u i (s) of the con-
verged value vector u i is upper bounded by the diameter. This follows by optimality
of the vector u i . Intuitively, if the span would be larger than D one could increase
the collected reward in the lower value state s − by going (as fast as possible) to the
higher value state s + . Note that this argument uses the fact that the true MDP is
plausible w.h.p., so that we may take the true transitions to get from s − to s + .
Lemma 10.3.4 Let u i (s) the converged value vector. Then
max u i (s) − min u i (s) ≤ D.

s s
10.3.1.2 Analysis of UCRL2
In this section we derive the following regret bound for UCRL2.
Theorem 10.3.1 ([12]) In an MDP with S states, A actions, and diameter D with
probability of at least 1 − δ the regret of UCRL2 after any T steps is bounded by

const · DS AT log Tδ .
Proof The main idea of the proof is that by Lemma 10.3.1 we have that
π̃k
g̃k∗ g+ (Mtk ) ≥ g ∗ ≥ g π̃k , (10.3.5)
so that the regret in each step is upper bounded by the width of the confidence interval
for g π̃k , that is, by g̃k∗ − g π̃k . In what follows we need to break down this confidence
interval to the confidence intervals we have for rewards and transition probabilities.
In the following, we consider that the true MDP μ is always contained in the
confidence regions Mt considered by the algorithm. Using Lemma 10.3.1 it is not
difficult to show that with probability at least 1 − 12Tδ 5/4 the regret accumulated due
√
to μ ∈ / Mt at some step t is bounded by T .
Further, note that the random fluctuation of the rewards can be easily bounded
by Hoeffding’s inequality (4.5.5), that is, if st and at denote the state and action at
step t, we have
T
rt ≥ r (st , at ) − 58 T log 8Tδ
t=1 t
with probability at least 1 − 12Tδ 5/4 .

Therefore, writing vk (s, a) for the number of
times
action a has been chosen
in state s in episode k we have t r (st , at ) = k s,a vk (s, a) r (s, a) so that by
(10.3.5) we can bound the regret by

T √
(g ∗ − rt ) ≤ vk (s, a) g̃k∗ − r (s, a) + T + 85 T log 8Tδ (10.3.6)
t=1 k s,a
with probability at least 1 − 12T2δ5/4 .

Thus, let us consider an arbitrary but fixed episode k, and consider the regret

vk (s, a) g̃k∗ − r (s, a)
s,a
p
the algorithm accumulates in this episode. Let conf rk (s, a) and conf k (s, a) be the
width of the confidence intervals for rewards and transition probabilities in episode k.
First, we simply have

vk (s, a) g̃k∗ − r (s, a) ≤ vk (s, a) g̃k∗ − r̃k (s, a)
s,a s,a

+ vk (s, a) r̃k (s, a) − r (s, a) , (10.3.7)
s,a
where the second term is bounded by
|r̃k (s, a) − r̂k (s, a)| + |r̂k (s, a) − r (s, a)| ≤ 2conf rk (s, a)
w.h.p. by Lemma 10.3.1, so that

vk (s, a) r̃k (s, a) − r (s, a) ≤ 2 vk (s, a) · conf rk (s, a). (10.3.8)
s,a s,a
For the first term in (10.3.7) we use that after convergence of the value vector u i
we have by (10.3.3) and (10.3.4)

g̃k∗ − r̃k (s, π̃k (s)) = P̃k (s |s, π̃k (s)) · u i (s ) − u i (s).
s
Then noting that vk (s, a) = 0 for a = π̃k (s) and using vector/matrix notation it
follows that

vk (s, a) g̃k∗ − r̃k (s, π̃k (s))
s,a

= vk (s, a) P̃k (s |s, π̃k (s)) · u i (s ) − u i (s)
s,a s

= vk P̃k − I u

= vk P̃k − Pk + Pk − I wk

= vk P̃k − Pk wk + vk Pk − I wk , (10.3.9)
where Pk is the true transition matrix (in μ) of the optimistic policy π̃k in episode k,
and wk is a renormalization of the vector u (with entries u i (s)) where wk (s) :=
u i (s) − 21 (mins u i (s) + maxs u i (s)), so that
wk
∞ ≤ D2 by Lemma 10.3.4.
Since
P̃k − Pk
1 ≤
P̃k − P̂k
1 +
P̂k − Pk
1 , the first term of (10.3.9) is
bounded as

vk P̃k − Pk wk ≤ vk P̃k − Pk 1 · wk ∞
p
≤2 vk (s, a) conf k (s, a) D. (10.3.10)
s,a
The second term can be rewritten as martingale difference sequence

tk+1 −1

vk Pk − I wk = P(·|st , a)wk − wk (st )
t=tk
tk+1 −1

= P(·|st , a)wk − wk (st+1 ) + wk (stk+1 ) − wk (stk ),
t=tk
so that its sum over all episodes can be bounded by Azuma-Hoeffding inequality
(5.11) and Lemma 10.3.2, that is,

vk Pk − I wk ≤ D 25 T log 8Tδ + DS A log2 8T
SA
(10.3.11)
k

Summing (10.3.8) and (10.3.10) over all episodes, by definition of the confidence
intervals and Lemma 10.3.3 we have
p
vk (s, a) conf rk (s, a) + 2D vk (s, a) conf k (s, a)
k s,a k s,a

≤ const · D S log(AT /δ) √vk (s,a)
Nk (s,a)
k s,a
√
≤ const · D S log(AT /δ) S AT . (10.3.12)
Thus, combining (10.3.7)–(10.3.12) we obtain that

√
vk (s, a) g̃k∗ − r (s, a) ≤ const · D S log(AT /δ) S AT (10.3.13)
s,a

Finally by (10.3.6) is upper bounded by const ·
√ and (10.3.13) the regret of UCRL2
D S log(AT /δ) S AT with probability at least 1 − 3 T ≥2 12Tδ 5/4 ≥ 1 − δ.
The following is a corresponding lower bound on the regret that shows that the
upper bound of Theorem 10.3.1 is optimal in T and A.
Theorem 10.3.2 (Jaksch et al. [12]) For any algorithm and any natural numbers T ,
S, A > 1, and D ≥ log A S there is an MDP with S states, A actions, and diameter D,
the expected regret after T steps is
√
Ω DS AT .
Similar to the distribution dependent regret bound of Theorem 10.2.1 for UCB1,
one can derive a logarithmic bound on the expected regret of UCRL2.
Theorem 10.3.3 (Jaksch et al. [12]) In an MDP with S states, A actions, and diam-
eter D the expected regret of UCRL2 is

D 2 S 2 A log(T )
O ,
Δ

where Δ g ∗ − maxπ g π : g π < g ∗ is the gap between the optimal gain and the
second largest gain.
10.3.2 Bibliographical Remarks
Similar to UCB1 that was based on the work of Lai and Robbins [3], UCRL2 is not
the first optimistic algorithm with theoretical guarantees. Thus, the index policies of
Burnetas and Katehakis [14] and Tewari and Bartlett [15] choose actions optimisti-
cally by using confidence bounds for the estimates in the current state. However, the
logarithmic regret bounds are derived only for ergodic MDPs in which each policy
visits each state with probability 1.
Another important predecessor that is based on the principle of optimism in the
face of uncertainty is R-Max [16], that assumes in each not sufficiently visited state
to receive the maximal possible reward. UCRL2 offers a refinement of this idea to
motivate exploration. Sample complexity bounds as derived for R-Max can also be
obtained for UCRL2, cf. [12].
The gap between the lower bound of Theorem 10.3.2 and the bound for UCRL2
has not been closed so far. There have been various tries in that direction for different
algorithms inspired by Thompson sampling [17] or UCB1 [18]. However all of the
claimed proofs seem to contain some issues that remain unresolved up-to-date.
The situation is settled in the simpler episodic setting, where after any√H steps
there is a restart. Here there are matching upper and lower bounds of order H S AT
on the regret, see [19].
In the discounted setting, the MBIE algorithm of Strehl and Littman [20, 21] is a
precursor of UCRL2 that is based on the same ideas. While there are regret bounds
available also for MBIE, these are not easily comparable to Theorem 10.2.1, as the
regret is measured along the trajectory of the algorithm, while the regret considered
for UCRL2 is with respect to the trajectory an optimal policy would have taken.
In general, regret in the discounted setting seems to be a less satisfactory concept.
However, sample complexity bounds in the discounted setting for a UCRL2 variant
have been given in Lattimore and Hutter [22].
Last but not least, we would like to refer any reader interested in the material of
this chapter to the recent book of Lattimore and Szepesvári [10] that deals with the
whole range of topics from simple bandits to reinforcement learning in MDPs in
much more detail.
References
1. Hart, P.E., Nilsson, N.J., Raphael, B.: A formal basis for the heuristic determination of mini-
mum cost paths. IEEE Trans. Syst. Sci. Cybern. 4(2), 100–107 (1968)
2. Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite time analysis of the multiarmed bandit problem.
Mach. Learn. 47(2–3), 235–256 (2002)
3. Tze Leung Lai and Herbert Robbins: Asymptotically efficient adaptive allocation rules. Adv.
Appl. Math. 6(1), 4–22 (1985)
4. Mannor, S., Tsitsiklis, J.N.: The sample complexity of exploration in the multi-armed bandit
problem. J. Mach. Learn. Res. 5, 623–648 (2004)
5. Auer, P., Ortner, R.: UCB revisited: improved regret bounds for the stochastic multi-armed
bandit problem. Period. Math. Hung. 61(1–2), 55–65 (2010)
6. Lattimore, T.: Optimally confident UCB: Improved regret for finite-armed bandits. Technical
Report 1507.07880 (2015). (arXiv)
7. Audibert, J.-Y., Bubeck, S.: Minimax policies for adversarial and stochastic bandits. In:
colt2009. Proceedings of the 22nd Annual Conference on Learning Theory, pp. 217–226
(2009)
8. Bubeck, S., Cesa-Bianchi, N.: Regret analysis of stochastic and nonstochastic multi-armed
bandit problems. Found. Trends Mach. Learn. 5(1), 1–122 (2012)
9. Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: The nonstochastic multiarmed bandit
problem. SIAM J. Comput. 32(1), 48–77 (2002)
10. Lattimore, T., Szepesvári, C.: Bandit Algorithms. Cambridge University Press (2020)
11. Ortner, R., Ryabko, D., Auer, P., Munos, R.: Regret bounds for restless Markov bandits. Theor.
Comput. Sci. 558, 62–76 (2014)
12. Jaksch, T., Ortner, R., Auer, P.: Near-optimal regret bounds for reinforcement learning. J.
Mach. Learn. Res. 11, 1563–1600 (2010)
13. Weissman, T., Ordentlich, E., Seroussi, G., Verdu, S., Weinberger, M.J.: Inequalities for the
L 1 deviation of the empirical distribution. Technical Report HPL-2003-97 (R.1), Hewlett-
Packard Labs, Technical Report (2003)
14. Burnetas, A.N., Katehakis, M.N.: Optimal adaptive policies for Markov decision processes.
Math. Oper. Res. 22(1), 222–255 (1997)
15. Tewari, A., Bartlett, P.: Optimistic linear programming gives logarithmic regret for irreducible
MDPs. In: Advances in Neural Information Processing Systems, vol. 20, pp. 1505–1512. MIT
Press (2008)
16. Brafman, R.I., Tennenholtz, M.: R-MAX-A general polynomial time algorithm for near-
optimal reinforcement learning. J. Mach. Learn. Res. 3, 213–231 (2003)
References 235
17. Agrawal, S., Jia, R.: Optimistic posterior sampling for reinforcement learning: worst-case
regret bounds. In: Advances in Neural Information Processing Systems, vol. 30, pp. 1184–
1194 (2017)
18. Ortner, R.: Regret bounds for reinforcement learning via Markov chain concentration. J. Artif.
Intell. Res. 67, 115–128 (2020)
19. Azar, M.G., Osband, I., Munos, R.: Minimax regret bounds for reinforcement learning. In:
Proceedings of the 34th International Conference on Machine Learning, ICML 2017, pp.
263–272 (2017)
20. Strehl, A.L., Littman, M.L.: A theoretical analysis of model-based interval estimation. In:
Machine Learning, Proceedings of the 22nd International Conference, ICML 2005, pp. 857–
864. ACM (2005)
21. Strehl, A.L., Littman, M.L.: An analysis of model-based interval estimation for Markov deci-
sion processes. J. Comput. Syst. Sci. 74(8), 1309–1331 (2008)
22. Lattimore, T., Hutter, M.: Near-optimal PAC bounds for discounted MDPs. Theor. Comput.
Sci. 558, 125–143 (2014)
Chapter 11
Conclusion
This book touched upon the basic principles of decision making under uncertainty
in the context of reinforcement learning. While one of the main streams of thought
is Bayesian decision theory, we also discussed the basics of approximate dynamic
programming and stochastic approximation as applied to reinforcement learning
problems.
Consciously, however, we have avoided going into a number of topics related to
reinforcement learning and decision theory, some of which would need a book of
their own to be properly addressed. Even though it was fun writing the book, we at
some point had to decide to stop and consolidate the material we had, sometimes
culling partially developed material in favour of a more concise volume.
Firstly, we haven’t explicitly considered many models that can be used for rep-
resenting transition distributions, value functions or policies, beyond the simplest
ones, as we felt that this would detract from the main body of the text. Textbooks
for the latest fashion are always going to be abundant, and we hope that this book
provides a sufficient basis to enable the use of any current methods. There are also
a large number of areas which have not been covered at all. In particular, while we
touched upon the setting of two-player games and its connection to robust statistical
decisions, we have not examined problems which are also relevant to sequential deci-
sion making, such as Markov games and Bayesian games. In relation to this, while
early in the book we discuss risk aversion and risk seeking, we have not discussed
specific sequential decision making algorithms for such problems. Furthermore, even
though we discuss the problem of preference elicitation, we do not discuss specific
algorithms for it or the related problem of inverse reinforcement learning. Another
topic which went unmentioned, but which may become more important in the future,
is hierarchical reinforcement learning as well as options, which allow constructing
long-term actions (such as “go to the supermarket”) from primitive actions (such as
“open the door”). Finally, even though we have mentioned the basic framework of
238 11 Conclusion
regret minimization, we focused on the standard reinforcement learning problem, and

ignored adversarial settings and problems with varying amounts of side information.
It is important to note that the book almost entirely elides social aspects of decision
making. In practice, any algorithm that is going to be used to make autonomous
decision is going to have a societal impact. In such cases, the algorithm designer must
guard against negative externalities, such as hurting disadvantaged groups, violating
privacy, or environmental damage. However, as a lot of these issues are context
dependent, we urge the reader to consult recent work in economics, algorithmic
fairness and differential privacy.
Appendix
Symbols
Table A.1 Logic symbols

Definition
∧ Logical and
∨ Logical or
⇒ Implies
⇔ If and only if
∃ There exists
∀ For every
s.t. Such that
Table A.2 List of set theory symbols

{xk } A set indexed by k

x x Ry The set of x satisfying relation x Ry
N Set of natural numbers
Z Set of integers
R Set of real numbers
Ω The universe set (or sample space)
∅ The empty set
ń The n-dimensional simplex
´(A) The collection of distributions over a set A
B (A) The Borel σ -algebra induced by a set A
(continued)
© The Editor(s) (if applicable) and The Author(s), under exclusive license to 239
Springer Nature Switzerland AG 2022
240 Appendix: Symbols
Table A.2 (continued)

n
An The product set i=1 A
∞
A∗ n=0 A n the set of all sequences from set A
x∈A x belongs to A
A⊂B A is a (strict) subset of B
A⊆B A is a (non-strict) subset of B
B\A Set difference
BA Symmetric set difference
A Set complement
A∪B Set union
A∩B Set intersection
Table A.3 Analysis and linear algebra symbols

x The transpose of a vector x
| A| The determinant of a matrix A

x p The p-norm of a vector ( i |xi | p )1/ p

f p The p-norm of a function ( | f (x)i| p dx)1/ p

A p The operator norm of a matrix max Ax x p = 1
∂ f (x)/∂ xi Partial derivative with respect to xi
∇f Gradient vector of partial derivatives with respect to vector x
Table A.4 Miscellaneous statistics symbols

N (μ, m) Normal distribution with mean μ and
covariance Σ
N (μ, Σ) Normal distribution with mean μ and
covariance Σ
Bern (ω) Bernoulli distribution with parameter ω
Binom (ω, t) Binomial distribution with parameter ω over t
trials
Gamma (α, β) Gamma distribution with shape α and scaling β
Dir (α) Dirichlet distribution with prior mass α
Unif (A) Uniform distribution on the set A
Beta (α, β) Beta distribution with parameters (α, β)
Geom (ω) Geometric distribution with parameter ω
Wish (n − 1, T ) Wishart distribution with n degrees of freedom
and parameter matrix T
φ:X →Y Statistic mapping from observations to a vector
space
Index
A Difference operator, 135

Adaptive hypothesis testing, 109 Discount factor, 110, 118
Adaptive treatment allocation, 109 Distribution
Adversarial bandit, 225 Bernoulli, 64
Beta, 65
binomial, 64
B exponential, 70
Backwards induction, 95, 97, 114 Gamma, 69
multi-MDP, 210 marginal, 47, 92
Bandit problems, 110 normal, 68
adversarial, 225 posterior, 47
Bernoulli, 112 prior, 47
contextual, 225 Wishart, 75
nonstochastic, 225 χ 2 68
stochastic, 110, 221
Bayes’ rule, 49
Bayes’ theorem, 13 E
Bellman error, 176, 187 Estimation, 30
Bellman operator, 127, 178 Experimental design, 109
Bellman optimality equation, 128 Exploration vs exploitation, 2
Beta distribution, 65 Exponential distribution, 70
Binomial coefficient, 64
Branch and bound, 206
F
Fairness, 49
C
Classification, 48
Clinical trial, 109 G
Contextual bandit, 225 Gain, 140
Covariance, 212 Game, 40
Credible interval, 77 bilinear, 42
Gamma function, 65
Gaussian processes, 213
D Gradient descent, 175
Decision boundary, 48 stochastic, 148
Decision diagram, 29
Decision procedure
sequential, 90 H
Design matrices, 212 Horizon, 118
© The Editor(s) (if applicable) and The Author(s), under exclusive license to 241
Springer Nature Switzerland AG 2022
242 Index
I optimal, 119
Importance sampling, 191 stationary, 125
Inequality stochastic, 167
Chebyshev, 82 Policy approximation, 170
Hoeffding, 83 Policy estimation, 176
Jensen, 20 Policy evaluation, 120
Markov, 81 backwards induction, 120
Bayesian Monte Carlo, 208
Monte Carlo, 153
K Policy gradient, 188
KL-divergence, 86 Bayesian, 204
stochastic, 190
Policy iteration, 133
L approximate, 178
Least squares, 181 modified, 134
Likelihood temporal-difference, 136
conditional, 11 Policy optimization
relative, 7 backwards induction, 121
Linear programming, 137 Posterior sampling, see Thompson sampling
Loss, 31 Preference, 14
quadratic, 31 Probability
subjective, 7
Pseudo-inverse, 182
M
Markov decision process, 115, 116, 142
belief-augmented, 204 Q
partially observable, 214 Q-learning, 159
variable order, 217
Markov process, 105
Martingale, 104 R
Matrix determinant, 75 Regret, 37
Maximin, 36 total, 222
Minimax, 37 Reward, 14
Mixture of distributions, 33 Reward distribution, 116, 215
Monte Carlo update Risk, 31
every-visit, 155 Robbins-Monro approximation, 147
first-visit, 155
Multinomial, 73
Multivariate-normal, 75 S
Sample mean, 60
Sampling
O sequential, 89
Observation distribution, 215 Sequential probability ratio test, 102
Series
geometric, 93
P Softmax, 177
Policy, 44, 110, 118 Spectral radius, 126
blind, 199 Standard normal, 68
-greedy, 148 State aggregation, 185
history-dependent, 118 Stationary Markov process, 105
k-order Markov, 199 Statistic, 60
Markov, 118 sufficient, 60
maximin, 41 Stopping function, 89
memoryless, 199 Stopping set, 90
Index 243
Strategy, 35 function, 20
Student t-distribution, 72
V
T Value, 92
Temporal difference, 135, 136 Value function
error, 136, 156 approximate, 168
Thompson sampling, 208, 210 optimal, 119
Trace, 75 state, 119
Transition distribution, 116, 215 state-action, 119
Value iteration, 131
approximate, 184
U generalized stochastic, 161
Un-bounded procedures, 94
Upper confidence bound, 223
Utility, 16, 118 W
Bayes-optimal, 31 Wald’s theorem, 102

Decision Making Under Uncertainty and Reinforcement Learning

Uploaded by

Copyright:

Available Formats

Decision Making Under Uncertainty and Reinforcement Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Decision Making Under Uncertainty and Reinforcement Learning

Uploaded by

Copyright:

Available Formats

Intelligent Systems Reference Library 223

Decision Making Under

ISSN 1868-4394 ISSN 1868-4408 (electronic)

Neuchâtel, Switzerland Christos Dimitrakakis

3.3 Bayes Decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.2.3 Unbounded Sequential Decision Procedures . . . . . . . . . . 98

8 Approximate Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

9.5 Relations Between Different Settings . . . . . . . . . . . . . . . . . . . . . . . . 217

Appendix: Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239

1.1 Uncertainty and Probability

A lot of this book is grounded in the essential methods of probability, in particular

gives rise to the concept of subjective probability as a general technique to model

1.2 The Exploration–Exploitation Trade-Off

Consider the problem of selecting a restaurant to go to during a vacation. The best

1.3 Decision Theory and Reinforcement Learning

1. Subjective probability and utility: the notion of subjective probability; eliciting

2.1 Subjective Probability

2.1.1 Relative Likelihood

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 7

A ∪ B = Ω. Nevertheless, we only care about whether A is more likely than B. As

The relative likelihood of two events A and B

2.1.2 Subjective Probability Assumptions

Assumption 2.1.2 (SP2) Let A = A1 ∪ A2 , B = B1 ∪ B2 with A1 ∩ A2 = B1 ∩

Theorem 2.1.6 (Fundamental property of relative likelihoods) If A ⊂ B then

Since we are dealing with σ-fields, we need to introduce additional assumptions

(ii) Consider (possibly infinite) partitions {Ai }i , {Bi }i of A, B, respectively. If

2.1.3 Assigning Unique Probabilities*

Example 2.1 Consider F = {∅, A, A , Ω} for some A with ∅ = A ⊆ Ω and

In some cases we would like to assign unique probabilities to events in order

(x ∈ A) (x ∈ B) iff λ(A) ≤ λ(B),

where (x ∈ A) denotes the event that x(ω) ∈ A. Then (x ∈ A) (x ∈ B) means

Assumption 2.1.12 (SP5) It is possible to construct a random variable x : Ω →

Constructing the Probability Distribution

Definition 2.1.14 (Probability of A) Given any event A, define P(A) to be the α

which is sufficient to show the following theorem.

Theorem 2.1.15 (Relative likelihood and probability) If assumptions SP1–SP5 are

2.1.4 Conditional Likelihoods

Assumption 2.1.16 (CP) For any events A, B, D,

Theorem 2.1.17 If a likelihood relation satisfies assumptions SP1–SP5 as well

(A | D) (B | D) iff P(A | D) ≤ P(B | D).

2.1.5 Probability Elicitation

Probability elicitation is the problem of quantifying the subjective probabilities of

Eliciting the prior/forming the subjective probability measure P

• Select temperature x0 s.t. (τ ≤ x0 ) (τ > x0 ).

Note that, necessarily, P(τ ≤ x0 ) = P(τ > x0 ) =: p0 . Since P(τ ≤ x0 ) +

Exercise 2.1.19 Propose another way to arrive at a prior probability distribution.

2.2 Updating Beliefs: Bayes’ Theorem

Proof By definition, P(Ai | B) = P(Ai ∩ B)/P(B) and P(Ai ∩ B) = P(B |

which plugged into (2.1) completes the proof.

2.3 Utility Theory

2.3.1 Rewards and Preferences

Preferences Among Rewards

2.3.2 Preferences Among Distributions

When We Cannot Select Rewards Directly

Preferences Among Probability Distributions

Definition 2.3.2 (Utility) A utility function U : R → R is said to agree with the

a ∗ b iff U (a) ≥ U (b).

Table 2.1 A simple gambling problem

Assumption 2.3.4 (The expected utility hypothesis) Given a preference relation ∗

Example 2.1 Consider F = {∅, A, A , Ω} for some A with ∅ = A ⊆ Ω and

U (−d) > U (−h) + (1 − ) U (0). (2.3)

Consequently, we see from (2.4) that U (−h) ≥ U (−d), as U is an increasing

Exercise 2.4.6 (Definition of conditional probability) Recall that P(A | B)

U (ω, a) = −ω − a2