0% found this document useful (0 votes)

16 views273 pages

Book-Decision Making Under Uncertainty and Reinforcement Learning

Uploaded by

Xiaolin Cheng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views273 pages

Book-Decision Making Under Uncertainty and Reinforcement Learning

Uploaded by

Xiaolin Cheng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 273

1

Decision Making Under Uncertainty and

Reinforcement Learning

Christos Dimitrakakis Ronald Ortner

April 8, 2021
2
Contents

1 Introduction 9
1.1 Uncertainty and probability . . . . . . . . . . . . . . . . . . . . . 10
1.2 The exploration-exploitation trade-off . . . . . . . . . . . . . . . 11
1.3 Decision theory and reinforcement learning . . . . . . . . . . . . 12
1.4 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2 Subjective probability and utility 15

2.1 Subjective probability . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.1 Relative likelihood . . . . . . . . . . . . . . . . . . . . . . 16
2.1.2 Subjective probability assumptions . . . . . . . . . . . . . 17
2.1.3 Assigning unique probabilities* . . . . . . . . . . . . . . . 18
2.1.4 Conditional likelihoods . . . . . . . . . . . . . . . . . . . . 19
2.1.5 Probability elicitation . . . . . . . . . . . . . . . . . . . . 20
2.2 Updating beliefs: Bayes’ theorem . . . . . . . . . . . . . . . . . . 21
2.3 Utility theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.1 Rewards and preferences . . . . . . . . . . . . . . . . . . . 22
2.3.2 Preferences among distributions . . . . . . . . . . . . . . 23
2.3.3 Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.4 Measuring utility* . . . . . . . . . . . . . . . . . . . . . . 26
2.3.5 Convex and concave utility functions . . . . . . . . . . . . 27
2.3.6 Decision diagrams . . . . . . . . . . . . . . . . . . . . . . 28
2.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3 Decision problems 33
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Rewards that depend on the outcome of an experiment . . . . . . 34
3.2.1 Formalisation of the problem setting . . . . . . . . . . . . 35
3.2.2 Decision diagrams . . . . . . . . . . . . . . . . . . . . . . 37
3.2.3 Statistical estimation* . . . . . . . . . . . . . . . . . . . . 38
3.3 Bayes decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3.1 Convexity of the Bayes-optimal utility* . . . . . . . . . . 40
3.4 Statistical and strategic decision making . . . . . . . . . . . . . . 43
3.4.1 Alternative notions of optimality . . . . . . . . . . . . . . 44
3.4.2 Solving minimax problems* . . . . . . . . . . . . . . . . . 45
3.4.3 Two-player games . . . . . . . . . . . . . . . . . . . . . . 47
3.5 Decision problems with observations . . . . . . . . . . . . . . . . 49
3.5.1 Decision problems in classification . . . . . . . . . . . . . 53
3.5.2 Calculating posteriors . . . . . . . . . . . . . . . . . . . . 56

3
4 CONTENTS

3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.7.1 Problems with no observations . . . . . . . . . . . . . . . 58
3.7.2 Problems with observations . . . . . . . . . . . . . . . . . 58
3.7.3 An insurance problem . . . . . . . . . . . . . . . . . . . . 59
3.7.4 Medical diagnosis . . . . . . . . . . . . . . . . . . . . . . . 61

4 Estimation 65
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2 Sufficient statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2.1 Sufficient statistics . . . . . . . . . . . . . . . . . . . . . . 67
4.2.2 Exponential families . . . . . . . . . . . . . . . . . . . . . 68
4.3 Conjugate priors . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3.1 Bernoulli-Beta conjugate pair . . . . . . . . . . . . . . . . 69
4.3.2 Conjugates for the normal distribution . . . . . . . . . . . 73
4.3.3 Conjugates for multivariate distributions . . . . . . . . . . 78
4.4 Credible intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.5 Concentration inequalities . . . . . . . . . . . . . . . . . . . . . . 84
4.5.1 Chernoff-Hoeffding bounds . . . . . . . . . . . . . . . . . 86
4.6 Approximate Bayesian approaches . . . . . . . . . . . . . . . . . 87
4.6.1 Monte-Carlo inference . . . . . . . . . . . . . . . . . . . . 87
4.6.2 Approximate Bayesian Computation . . . . . . . . . . . . 88
4.6.3 Analytic approximations of the posterior . . . . . . . . . . 89
4.6.4 Maximum Likelihood and Empirical Bayes methods . . . 90

5 Sequential sampling 91
5.1 Gains from sequential sampling . . . . . . . . . . . . . . . . . . . 92
5.1.1 An example: sampling with costs . . . . . . . . . . . . . . 93
5.2 Optimal sequential sampling procedures . . . . . . . . . . . . . . 96
5.2.1 Multi-stage problems . . . . . . . . . . . . . . . . . . . . . 99
5.2.2 Backwards induction for bounded procedures . . . . . . . 99
5.2.3 Unbounded sequential decision procedures . . . . . . . . . 100
5.2.4 The sequential probability ratio test . . . . . . . . . . . . 101
5.2.5 Wald’s theorem . . . . . . . . . . . . . . . . . . . . . . . . 104
5.3 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.4 Markov processes . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.5 Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

6 Experiment design and Markov decision processes 109

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.2 Bandit problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.2.1 An example: Bernoulli bandits . . . . . . . . . . . . . . . 112
6.2.2 Decision-theoretic bandit process . . . . . . . . . . . . . . 113
6.3 Markov decision processes and reinforcement learning . . . . . . 115
6.3.1 Value functions . . . . . . . . . . . . . . . . . . . . . . . . 118
6.4 Finite horizon, undiscounted problems . . . . . . . . . . . . . . . 119
6.4.1 Policy evaluation . . . . . . . . . . . . . . . . . . . . . . . 119
6.4.2 Backwards induction policy evaluation . . . . . . . . . . . 120
6.4.3 Backwards induction policy optimisation . . . . . . . . . . 121
6.5 Infinite-horizon . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
CONTENTS 5

6.5.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

6.5.2 Markov chain theory for discounted problems . . . . . . . 125
6.5.3 Optimality equations . . . . . . . . . . . . . . . . . . . . . 127
6.5.4 Infinite Horizon MDP Algorithms . . . . . . . . . . . . . 130
6.6 Optimality Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.8 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.9.1 Medical diagnosis . . . . . . . . . . . . . . . . . . . . . . . 142
6.9.2 Markov Decision Process theory . . . . . . . . . . . . . . 142
6.9.3 Automatic algorithm selection . . . . . . . . . . . . . . . 142
6.9.4 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.9.5 General questions . . . . . . . . . . . . . . . . . . . . . . . 145

7 Simulation-based algorithms 147

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
7.1.1 The Robbins-Monro approximation . . . . . . . . . . . . . 148
7.1.2 The theory of the approximation . . . . . . . . . . . . . . 150
7.2 Dynamic problems . . . . . . . . . . . . . . . . . . . . . . . . . . 153
7.2.1 Monte-Carlo policy evaluation and iteration . . . . . . . . 154
7.2.2 Monte-Carlo policy evaluation . . . . . . . . . . . . . . . 154
7.2.3 Monte Carlo updates . . . . . . . . . . . . . . . . . . . . . 156
7.2.4 Temporal difference methods . . . . . . . . . . . . . . . . 157
7.2.5 Stochastic value iteration methods . . . . . . . . . . . . . 159
7.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
7.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

8 Approximate representations 167

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
8.1.1 Fitting a value function . . . . . . . . . . . . . . . . . . . 168
8.1.2 Fitting a policy . . . . . . . . . . . . . . . . . . . . . . . . 169
8.1.3 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
8.1.4 Estimation building blocks . . . . . . . . . . . . . . . . . 172
8.1.5 The value estimation step . . . . . . . . . . . . . . . . . . 174
8.1.6 Policy estimation . . . . . . . . . . . . . . . . . . . . . . . 176
8.2 Approximate policy iteration (API) . . . . . . . . . . . . . . . . . 177
8.2.1 Error bounds for approximate value functions . . . . . . . 178
8.2.2 Rollout-based policy iteration methods . . . . . . . . . . . 179
8.2.3 Least Squares Methods . . . . . . . . . . . . . . . . . . . 180
8.3 Approximate value iteration . . . . . . . . . . . . . . . . . . . . . 182
8.3.1 Approximate backwards induction . . . . . . . . . . . . . 182
8.3.2 State aggregation . . . . . . . . . . . . . . . . . . . . . . . 183
8.3.3 Representative state approximation . . . . . . . . . . . . 184
8.3.4 Bellman error methods . . . . . . . . . . . . . . . . . . . . 185
8.4 Policy gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
8.4.1 Stochastic policy gradient . . . . . . . . . . . . . . . . . . 187
8.4.2 Practical considerations . . . . . . . . . . . . . . . . . . . 188
8.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
8.6 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
8.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
6 CONTENTS

9 Bayesian reinforcement learning 195

9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
9.2 Acting in unknown MDPs . . . . . . . . . . . . . . . . . . . . . . 196
9.2.1 Updating the belief . . . . . . . . . . . . . . . . . . . . . . 198
9.2.2 Finding Bayes-optimal policies . . . . . . . . . . . . . . . 199
9.2.3 The maximum MDP heuristic . . . . . . . . . . . . . . . . 200
9.2.4 Bounds on the expected utility . . . . . . . . . . . . . . . 201
9.2.5 Tighter lower bounds . . . . . . . . . . . . . . . . . . . . 202
9.2.6 The Belief-augmented MDP . . . . . . . . . . . . . . . . . 204
9.2.7 Branch and bound . . . . . . . . . . . . . . . . . . . . . . 206
9.2.8 Further reading . . . . . . . . . . . . . . . . . . . . . . . . 207
9.3 Bayesian methods in continuous spaces . . . . . . . . . . . . . . . 207
9.3.1 Linear-Gaussian transition models . . . . . . . . . . . . . 207
9.3.2 Approximate dynamic programming . . . . . . . . . . . . 209
9.4 Partially observable Markov decision processes . . . . . . . . . . 210
9.4.1 Solving known POMDPs . . . . . . . . . . . . . . . . . . 211
9.4.2 Solving unknown POMDPs . . . . . . . . . . . . . . . . . 212
9.5 Relations between different settings . . . . . . . . . . . . . . . . . 213
9.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215

10 Distribution-free reinforcement learning 217

10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
10.2 Finite Stochastic Bandit problems . . . . . . . . . . . . . . . . . 218
10.2.1 The UCB1 algorithm . . . . . . . . . . . . . . . . . . . . . 219
10.2.2 Non i.i.d. Rewards . . . . . . . . . . . . . . . . . . . . . . 221
10.3 Reinforcement learning in MDPs . . . . . . . . . . . . . . . . . . 222
10.3.1 An upper-confidence bound algorithm . . . . . . . . . . . 222
10.3.2 Bibliographical remarks . . . . . . . . . . . . . . . . . . . 228

11 Conclusion 231

A Symbols 233

B Probability concepts 237

B.1 Fundamental definitions . . . . . . . . . . . . . . . . . . . . . . . 238
B.1.1 Experiments and sample spaces . . . . . . . . . . . . . . . 238
B.2 Events, measure and probability . . . . . . . . . . . . . . . . . . 239
B.2.1 Events and probability . . . . . . . . . . . . . . . . . . . . 240
B.2.2 Measure theory primer . . . . . . . . . . . . . . . . . . . . 240
B.2.3 Measure and probability . . . . . . . . . . . . . . . . . . . 241
B.3 Conditioning and independence . . . . . . . . . . . . . . . . . . . 243
B.3.1 Mutually exclusive events . . . . . . . . . . . . . . . . . . 244
B.3.2 Independent events . . . . . . . . . . . . . . . . . . . . . . 246
B.3.3 Conditional probability . . . . . . . . . . . . . . . . . . . 246
B.3.4 Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . 247
B.4 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
B.4.1 (Cumulative) Distribution functions . . . . . . . . . . . . 248
B.4.2 Discrete and continuous random variables . . . . . . . . . 249
B.4.3 Random vectors . . . . . . . . . . . . . . . . . . . . . . . 249
B.4.4 Measure-theoretic notation . . . . . . . . . . . . . . . . . 250
CONTENTS 7

B.4.5 Marginal distributions and independence . . . . . . . . . 251

B.4.6 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
B.5 Divergences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
B.6 Empirical distributions . . . . . . . . . . . . . . . . . . . . . . . . 253
B.7 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
B.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254

C Useful results 257

C.1 Functional Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 258
C.1.1 Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
C.1.2 Special functions . . . . . . . . . . . . . . . . . . . . . . . 259

D Index 261
8 CONTENTS
Chapter 1

Introduction

9
10 CHAPTER 1. INTRODUCTION

The purpose of this book is to collect the fundamental results for decision
making under uncertainty in one place. In particular, the aim is to give a uni-
fied account of algorithms and theory for sequential decision making problems,
including reinforcement learning. Starting from elementary statistical decision
theory, we progress to the reinforcement learning problem and various solution
methods. The end of the book focuses on the current state-of-the-art in models
and approximation algorithms.

The problem of decision making under uncertainty can be broken down into
two parts. First, how do we learn about the world? This involves both the
problem of modeling our initial uncertainty about the world, and that of draw-
ing conclusions from evidence and our initial belief. Secondly, given what we
currently know about the world, how should we decide what to do, taking into
account future events and observations that may change our conclusions?
Typically, this will involve creating long-term plans covering possible future
eventualities. That is, when planning under uncertainty, we also need to take
into account what possible future knowledge could be generated when imple-
menting our plans. Intuitively, executing plans which involve trying out new
things should give more information, but it is hard to tell whether this infor-
mation will be beneficial. The choice between doing something which is already
known to produce good results and experiment with something new is known
as the exploration-exploitation dilemma, and it is at the root of the interaction
between learning and planning.

1.1 Uncertainty and probability

A lot of this book is grounded in the essential methods of probability, in particu-
lar using it to represent uncertainty. While probability is a simple mathematical
construction, philosophically it has had at least three different meanings. In the
classical sense, a probability distribution is a description for a truly random
event. In the subjectivist sense, probability is merely an expression of our un-
certainty, which is not necessarily due to randomness. Finally, in the algorithmic
sense, probability is linked to the length of a program that generates a particular
output.
In all cases, we are dealing with a set Ω of possible outcomes: the result of a
random experiment, the underlying state of the world and the program output
respectively. In all cases, we use probability to model our uncertainty over Ω.

Classical Probability
A random experiment is performed, with a given set Ω of possible outcomes.
An example is the 2-slit experiment in physics, where a particle is generated
which can go through either one of two slits. According to our current
understanding of quantum mechanics, it is impossible to predict which slit
the particle will go through. Herein, the set Ω consists of two possible
events corresponding to the particle passing through one or the other slit.

In the 2-slit experiment, the probabilities of either event can be actually

accurately calculated through quantum theory. However, which slit the particle
1.2. THE EXPLORATION-EXPLOITATION TRADE-OFF 11

will go through is fundamentally unpredictable. Such quantum experiments

are the only ones that are currently thought of as truly random (though some
people disagree about that too). Any other procedure, such as tossing a coin or
casting a die, is inherently deterministic and only appears random due to our
difficulty in predicting the outcome. That is, modelling a coin toss as a random
process is usually the best approximation we can make in practice, given our
uncertainty about the complex dynamics involved. This gives rise to the concept
of subjective probability as a general technique to model uncertainty.

Subjective Probability
Here Ω can conceptually not only describe the outcomes of some experi-
ment, but also a set of possible worlds or realities. This set can be quite
large and include anything imaginable. For example, it may include worlds
where dragons are real. However, in practice one only cares about certain
aspects of the world, such as whether in this world, you will win the lottery
if you buy a ticket. We can interpret the probability of a world in Ω as our
degree of belief that it corresponds to reality.

In such a setting there is an actual, true world ω ∗ ∈ Ω, which could have

been set by Nature to an arbritrary value deterministically. However, we do not
know which element in Ω is the true world, and the probability reflects our lack
of knowledge (rather than any inherent randomness about the selection of ω ∗ ).
No matter which view we espouse, we must always take into account our
uncertainty when making decisions. When the problem we are dealing with is
sequential, we are taking actions, obtaining new observations, and then taking
further actions. As we gather more information, we learn more about the world.
However, the things we learn about depend on what actions we take. For exam-
ple, if we always take the same route to work, then we learn how much time this
route takes on different days and times of the week. However, we don’t obtain
information about the time other routes take. So, we potentially miss out on
better choices than the one we follow usually. This phenomenon gives rise to
the so-called exploration-exploitation trade-off.

1.2 The exploration-exploitation trade-off

Consider the problem of selecting a restaurant to go to during a vacation. The
best restaurant you have found so far was Les Epinards. The food there is
usually to your taste and satisfactory. However, a well-known recommendations
website suggests that King’s Arm is really good! It is tempting to try it out.
But there is the risk involved that the King’s Arm is much worse than Les
Epinards. On the other hand, it could also be much better. What should you
do?
It all depends on how much information you have about either restaurant,
and how many more days you’ll stay in town. If this is your last day, then it’s
probably a better idea to go to Les Epinards, unless you are expecting King’s
Arm to be significantly better. However, if you are going to stay there longer,
trying out King’s Arm is a good bet. If you are lucky, you will be getting much
better food for the remaining time, while otherwise you will have missed only
one good meal out of many, making the potential risk quite small.
12 CHAPTER 1. INTRODUCTION

Thus, one must decide whether to exploit knowledge about the world to
gain a known reward, or to explore the world to learn something new. This
will potentially give you less reward immediately, but the knowledge itself can
usually be put to use in the future.
This exploration-exploitation trade-off only arises when data collection is
interactive. If we are simply given a set of data which is to be used to decide
upon a course or action, but our decision does not affect the data we shall collect
in the future, then things are much simpler. However, a lot of real-world human
decision making as well as modern applications in data science involve such
trade-offs. Decision theory offers precise mathematical models and algorithms
for such problems.

1.3 Decision theory and reinforcement learning

Decision theory deals with the formalization and solution of decision problems.
Given a number of alternatives, what would be the rational choice in a particular
situation depending on one’s goals and desires? In order to answer this question
we need to develop a good concept of rational behavior. This will serve two
purposes. Firstly, this can provide an explanation for the behavior of animals
and humans. Secondly, it is useful for developing models and algorithms for
automated decision making in complex tasks.
A particularly interesting problem in this setting is reinforcement learning.
This problem arises when the environment is unknown, and the learner has to
make decisions solely through interaction, which only gives limited feedback.
Thus, the learning agent does not have access to detailed instructions on which
task to perform, nor on how to do it. Instead, it performs actions, which
affect the environment and obtains some observations (i.e., sensory input) and
feedback, usually in the form of rewards which represent the agent’s desires. The
learning problem is then formulated as the problem of learning how to act to
maximize total reward. In biological systems, reward is intrinsically hardwired
to signals associated with basic needs. In artificial systems, we can choose the
reward signals so as to reinforce behaviour that achieves the designer’s goals.
Reinforcement learning is a fundamental problem in artificial intelligence,
since frequently we can tell robots, computers, or cars only what we would like
them to achieve, but we do not know the best way to achieve it. We would
like to simply give them a description of our goals and then let them explore
the environment on their own to find a good solution. Since the world is (at
least partially) unknown, the learner always has to deal with the exploration-
exploitation trade-off.
Animals and humans also learn through imitation, exploration, and shap-
ing their behavior according to reward signals to finally achieve their goals.
In fact, it has been known since the 1990s that there is some connection be-
tween some reinforcement learning algorithms and mechanisms in the basal
ganglia [Yin and Knowlton, 2006, Barto, 1995, Schultz et al., 1997].
Decision theory is closely related to other fields, such as logic, statistics,
game theory, and optimization. Those fields have slightly different underlying
objectives, even though they may share the same formalism. In the field of
optimization, we are not only interested in optimal planning in complex envi-
ronments but also in how to make robust plans given some uncertainty about
1.3. DECISION THEORY AND REINFORCEMENT LEARNING 13

the environment. Artificial intelligence research is concerned with modelling the

environments and developing algorithms that are able to learn by interaction
with the environment or from demonstration by teachers. Economics and game
theory deal with the problem of modeling the behavior of rational agents and
with designing mechanisms (such as markets) that will give incentives to agents
to behave in a certain way.
Beyond pure research, there are also many applications connected to deci-
sion theory. Commercial applications arise e.g. in advertising where one wishes
to model the preferences and decision making of individuals. Decision problems
also arise in security. There are many decision problems, especially in cryp-
tographic and biometric authentication, but also in detecting and responding
to intrusions in networked computer systems. Finally, in the natural sciences,
especially in biology and medicine, decision theory offers a way to automatically
design and run experiments and to optimally construct clinical trials.

Outline

1. Subjective probability and utility: The notion of subjective probability;

eliciting priors; the concept of utility; expected utility.

2. Decision problems: maximising expected utility; maximin utility; regret.

3. Estimation: Estimation as conditioning; families of distributions that are

closed under conditioning; conjugate priors; concentration inequalities;
PAC and high-probability bounds; Markov Chain Monte Carlo; ABC es-
timation.

4. Sequential sampling and optimal stopping: Sequential sampling problems;

the cost of sampling; optimal stopping; martingales.

5. Reinforcement learning I - Markov decision processes Belief and infor-

mation state; bandit problems; Markov decision processes; backwards
induction; value iteration; policy iteration; temporal differences; linear
programming

6. Reinforcement learning II – Stochastic and approximation algorithms:

Sarsa; Q-learning; stochastic value iteration; TD(λ)

7. Reinforcement learning III – Function approximation features and the

curse of dimensionality; approximate value iteration; approximate policy
iteration; policy gradient

8. Reinforcement learning IV – Bayesian reinforcement learning: bounds on

the utility; Thompson sampling; stochastic branch and bound; sparse sam-
pling; partially observable MDPs.

9. Reinforcement learning V – Distribution-free reinforcement learning: stochas-

tic and metric bandits; UCRL; (*) bounds for Thompson sampling.

B Probability refresher: measure theory; axiomatic definition of probability;

conditional probability; Bayes’ theorem; random variables; expectation

C Miscellaneous mathematical results.

14 CHAPTER 1. INTRODUCTION

1.4 Acknowledgements
Many thanks go to all the students of the Decision making under uncertainty
and Advanced topics in reinforcement learning and decision making classes over
the years for bearing with early drafts of this book. A big “thank you” goes
to Nikolaos Tziortziotis, whose code is used in some of the examples in the
book. Finally, thanks to Aristide Tossou and Hannes Eriksson for proof-reading
various chapters. Finally, a lot of the coded examples in the book were run
using the parallel package by Tange [2011].
Chapter 2

Subjective probability and

utility

15
16 CHAPTER 2. SUBJECTIVE PROBABILITY AND UTILITY

2.1 Subjective probability

In order to make decisions, we need to be able to make predictions about the
possible outcomes of each decision. Usually, we have uncertainty about what
those outcomes are. This can be due to stochasticity, which is frequently used to
model games of chance and inherently unpredictable physical phenomena. It can
also be due to partial information, a characteristic of many natural problems.
For example, it might be hard to know at any single moment how much change
you have in your wallet, whether you will be able to catch the next bus, or to
remember where you left your keys.
In either case, this uncertainty can be expressed as a subjective belief. This
does not have to correspond to reality. For example, some people believe, quite
inaccurately, that if a coin comes up tails for a long time, it is quite likely to
come up heads very soon. Or, you might quite happily believe your keys are in
your pocket, only to realise that you left them at home as soon you arrive at
the office.
In this book, we assume the view that subjective beliefs can be modelled as
probabilities. This allows us to treat uncertainty due to stochasticity and due
to partial information in a unified framework. In doing so, we have to define
for each problem a space of possible outcomes Ω and specify an appropriate
probability distribution.

2.1.1 Relative likelihood

Let us start with the simple example of guessing whether a tossed coin will come
up head or tails. In this case the sample space Ω would correspond to every
possible way the coin can land. Since we are only interested in predicting which
face will be up, let A ⊂ Ω be all those cases where the coin comes up heads,
and B ⊂ Ω be the set of tosses where it comes up tails. Here A ∩ B = ∅, but
there may be some other events such as the coin becoming lost, so it does not
necessarily hold that A ∪ B = Ω. Nevertheless, we only care about whether A
is more likely than B. As said, this likelihood may be based only on subjective
beliefs. We can express that via the concept of relative likelihood:

The relative likelihood of two events A and B

If A is more likely than B, then we write A ≻ B, or equivalently

B ≺ A.
If A is as likely as B, then we write A h B.
We also use % and - for at least as likely as and for no more likely than.

Let us now speak more generally about the case where we have defined an
appropriate σ-field F on Ω. Then each element Ai ∈ F will be a subset of Ω.
We now wish to define relative likelihood relations for the elements Ai ∈ F.1
1 More formally, we can define three classes: C , C , C ⊂ F 2 such that a pair (A , A ) ∈
≻ ≺ h i j
CR if an only if it satisfies the relation Ai RAj , where R ∈ {≻, ≺, h}. These three classes
form a partition of F 2 under the subjective probability assumptions we will introduce in the
2.1. SUBJECTIVE PROBABILITY 17

As we would like to use the language of probability to talk about likelihoods,

we need to define a probability measure that agrees with our given relations. A
probability measure P : F → [0, 1] is said to agree with a relation A - B, if it
has the property that P (A) ≤ P (B) if and only if A - B, for all A, B ∈ F. In
general, there are many possible measures that can agree with a given relation,
cf. Example 1 below. However, it could also be that a given relational structure
is incompatible with any possible probability measure. We also consider the
question under which assumptions a likelihood relation corresponds to a unique
probability measure.

2.1.2 Subjective probability assumptions

We would like our beliefs to satisfy some intuitive properties about what state-
ments we can make concerning the relative likelihood of events. As we will see,
these assumptions are also necessary to guarantee the existence of a correspond-
ing probability measure. First of all, it must always be possible to say whether
one event is more likely than the other, i.e., our beliefs must be complete. Con-
sequently, we are not allowed to claim ignorance.

Assumption 2.1.1 (SP1). For any pair of events A, B ∈ F, one has either
A ≻ B, A ≺ B, or A h B.

Another important assumption is a principle of consistency: Informally, if

we believe that every possible event Ai that leads to A is less likely than a
unique corresponding event Bi that leads to an outcome B, then we should
always conclude that A is less likely than B.

Assumption 2.1.2 (SP2). Let A = A1 ∪ A2 , B = B1 ∪ B2 with A1 ∩ A2 =

B1 ∩ B2 = ∅. If Ai - Bi for i = 1, 2 then A - B.

We also require the simple technical assumption that any event A ∈ F is at

least as likely as the empty event ∅, which never happens.

Assumption 2.1.3 (SP3). For all A it holds that ∅ - A. Further, ∅ ≺ Ω.

As it turns out, these assumptions are sufficient for proving the following
theorems [DeGroot, 1970]. The first theorem tells us that our belief must be
consistent with respect to transitivity.

Theorem 2.1.1 (Transitivity). Under Assumptions 2.1.1, 2.1.2, and 2.1.3, for
all events A, B, C: If A - B and B - C, then A - C.

The second theorem says that if two events have a certain relation, then
their negations have the converse relation.

Theorem 2.1.2 (Complement). For any A, B: A - B iff A∁ ≻ B ∁ .

Finally, note that if A ⊂ B, then it must be the case that whenever A

happens, B must happen and hence B must be at least as likely as A.

Theorem 2.1.3 (Fundamental property of relative likelihoods). If A ⊂ B then

A - B. Furthermore, ∅ - A - Ω for any event A.
next section.
18 CHAPTER 2. SUBJECTIVE PROBABILITY AND UTILITY

Since we are dealing with σ-fields, we need to introduce additional assump-

tions for infinite sequences of events. While these are not necessary if the field
F is finite, it is good to include them for generality.
Assumption 2.1.4 (SP4). If A1 ⊃ A2 ⊃ · · · is a decreasing
T∞ sequence of events
in F and B ∈ F is such that Ai % B for all i, then i=1 Ai % B.
As a consequence, we obtain the following dual theorem:
Theorem 2.1.4. If A1 ⊂ A2 ⊂ · · · is an increasing
S∞ sequence of events in F
and B ∈ F is such that Ai - B for all i, then i=1 Ai - B.
We are now able to state a theorem for the unions of infinite sequences of
disjoint events.
Theorem 2.1.5. If (Ai )∞ ∞
i=1 and (Bi )i=1
S∞are infinite
S∞sequences of disjoint events
in F such that Ai - Bi for all i, then i=1 Ai - i=1 Bi .
As said, our goal is to express likelihood via probability. Accordingly, we say
that likelihood is induced by a probability measure P if for all A, B: A ≻ B iff
P (A) > P (B), and A h B iff P (A) = P (B). In this case the following theorem
guarantees that the stipulated assumptions are always satisfied.
Theorem 2.1.6. Let P be a probability measure over Ω. Then
(i) P (A) > P (B), P (A) < P (B) or P (A) = P (B) for all A, B.
(ii) Consider (possibly infinite) partitions {Ai }i , {Bi }i of A, B, respectively.
If P (Ai ) ≤ P (Bi ) for all i, then P (A) ≤ P (B).
(iii) For any A, P (∅) ≤ P (A) and P (∅) < P (Ω).
Proof.
S Part P as P : F → [0, 1]. Part (ii) holds due to P (A) =
(i) is trivial, P
P ( i Ai ) = i P (Ai ) ≤ i P (Bi ) = P (B). Part (iii) follows from P (∅) = 0,
P (A) ≥ 0, and P (Ω) = 1.

2.1.3 Assigning unique probabilities*

In many cases, and particularly when F is a finite field, there is a large number
of probability distributions agreeing with our relative likelihoods. Choosing one
specific probability over another does not seem easy. The following example
underscores this ambiguity.
n o
Example 1. Consider F = ∅, A, A∁ , Ω for some A with ∅ =
6 A ⊆ Ω and assume
A ≻ A∁ . Consequently, P (A) > 1/2, which is however insufficient for assigning a
specific value to P (A).

In some cases we would like to assign unique probabilities to events in order

to facilitate computations. Intuitively, we may only be able to tell whether some
event A is more likely than some other event B. However, we can create a new,
uniformly distributed random variable x on [0, 1] and determine for each value
α ∈ [0, 1] whether A more or less likely than the event x > α. Since we need
to compare both A and B with all such events, the distribution we obtain is
unique. For the sake of completeness we start with the definition of the uniform
distribution over the interval [0, 1].
2.1. SUBJECTIVE PROBABILITY 19

Definition 2.1.1 (Uniform distribution). Let λ(A) denote the length of any
interval A ⊆ [0, 1]. Then x : Ω → [0, 1] has a uniform distribution on [0, 1] if,
for any subintervals A, B of [0, 1],

(x ∈ A) - (x ∈ B) iff λ(A) ≤ λ(B),

where (x ∈ A) denotes the event that x(ω) ∈ A. Then (x ∈ A) - (x ∈ B)

means that ω is such that x ∈ A is not more likely than x ∈ B.
This means that any larger interval is more likely than any smaller interval.
Now we shall connect the uniform distribution to the original sample space Ω
by assuming that there is some function with uniform distribution.
Assumption 2.1.5 (SP5). It is possible to construct a random variable x :
Ω → [0, 1] with a uniform distribution in [0, 1].

Constructing the probability distribution

We can now use the uniform distribution to create a unique probability
measure that agrees with our likelihood relation. First, we have to map each
event in Ω to an equivalent event in [0, 1], which can be done according to the
following theorem.
Theorem 2.1.7 (Equivalent event). For any event A ∈ F, there exists some
α ∈ [0, 1] such that A h (x ∈ [0, α]).
This means that we can now define the probability of an event A by matching
it to a specific equivalent event on [0, 1].
Definition 2.1.2 (Probability of A). Given any event A, define P (A) to be the
α with A h (x ∈ [0, α]).
Hence
A h (x ∈ [0, P (A)]),
which is sufficient to show the following theorem.
Theorem 2.1.8 (Relative likelihood and probability). If assumptions SP1–SP5
are satisfied, then the probability measure P defined in Definition 2.1.2 is unique.
Furthermore, for any two events A, B it holds that A - B iff P (A) ≤ P (B).

2.1.4 Conditional likelihoods

So far we have only considered the problem of forming opinions about which
events are more likely a priori. However, we also need to have a way to incorpo-
rate evidence which may adjust our opinions. For example, while we ordinarily
may think that A - B, we may have additional information D, given which
we think the opposite is true. We can formalise this through the notion of
conditional likelihoods.

Example 2. Say that A is the event that it rains in Gothenburg, Sweden tomorrow.
We know that Gothenburg is quite rainy due to its oceanic climate, so we set A % A∁ .
Now, let us try and incorporate some additional information. Let D denote the fact
that good weather is forecast, so that given D good weather will appear more probable
than rain, formally (A | D) - (A∁ | D).
20 CHAPTER 2. SUBJECTIVE PROBABILITY AND UTILITY

Conditional likelihoods
Define (A | D) - (B | D) to mean that B is at least as likely as A when it
is known that D holds.
Assumption 2.1.6 (CP). For any events A, B, D,
(A | D) - (B | D) iff A ∩ D - B ∩ D.
Theorem 2.1.9. If a likelihood relation - satisfies assumptions SP1–SP5 as
well as CP, then there exists a probability measure P such that: For any A, B, D
such that P (D) > 0,
(A | D) - (B | D) iff P (A | D) ≤ P (B | D).
It turns out that there are very few ways that a conditional probability def-
inition can satisfy all of our assumptions. The usual definition is the following.
Definition 2.1.3 (Conditional probability).
P (A ∩ D)
P (A | D) , .
P (D)
This definition effectively answers the question of how much evidence for A
we have, now that we have observed D. This is expressed as the ratio between
the combined event A ∩ D, also known as the joint probability of A and D, and
the marginal probability of D itself. The intuition behind the definition becomes
clearer once we rewrite it as P (A ∩ D) = P (A | D) P (D). Then conditional
probability is effectively used as a way to factorise joint probabilities.

2.1.5 Probability elicitation

Probability elicitation is the problem of quantifying the subjective probabilities
of a particular individual. One of the simplest, and most direct, methods, is to
simply ask. However, because we cannot simply ask somebody to completely
specify a probability distribution, we can ask for this distribution iteratively as
indicated by the procedure outlined in the following example.

Example 3 (Temperature prediction). Let τ be the temperature tomorrow at noon

in Gothenburg. What are your estimates?

Eliciting the prior / forming the subjective probability measure P

Select temperature x0 s.t. (τ ≤ x0 ) h (τ > x0 ).

Select temperature x1 s.t. (τ ≤ x1 | τ ≤ x0 ) h (τ > x1 | τ ≤ x0 ).
By repeating this procedure recursively we are able to slowly build the
complete distribution, quantile by quantile.

Note that, necessarily, P (τ ≤ x0 ) = P (τ > x0 ) =: p0 . Since P (τ ≤ x0 ) +

P (τ > x0 ) = P (τ ≤ x0 ∪ τ > x0 ) = P (τ ∈ R) = 1, it follows that p0 = 21 .
Similarly, P (τ ≤ x1 | τ ≤ x0 ) = P (τ > x1 | τ ≤ x0 ) = 14 .
2.2. UPDATING BELIEFS: BAYES’ THEOREM 21

Exercise 1. Propose another way to arrive at a prior probability distribution. For

example, define a procedure for eliciting a single probability distribution from a group
of people without any interaction between the participants.

2.2 Updating beliefs: Bayes’ theorem

Although we always start with a particular belief, this belief must be adjusted
when we receive new evidence. In probabilistic inference, the updated beliefs are
simply the probability of future events conditioned on observed events. This idea
is captured neatly by Bayes’ theorem, which links the prior probability P (Ai )
of events to their posterior probability P (Ai | B) given some event B and the
probability P (B | Ai ) of observing the evidence B given events Ai .

Theorem 2.2.1 (Bayes’ theorem).S Let A1 , A2 , . . . be a (possibly infinite) se-

n
quence of disjoint events such that i=1 Ai = Ω and P (Ai ) > 0 for all i. Let B
be another event with P (B) > 0. Then

P (B | Ai ) P (Ai )
P (Ai | B) = Pn .
j=1 P (B | Aj ) P (Aj )

Proof. By definition, P (Ai | B) = P (Ai ∩ B)/P (B) and P (Ai ∩ B) = P (B |

Ai ) P (Ai ), so
P (B | Ai ) P (Ai )
P (Ai | B) = . (2.2.1)
P (B)
Sn Sn
As j=1 Aj = Ω, we have B = j=1 (B ∩ Aj ). Since the Aj are disjoint, so are
the B ∩ Aj . As P is a probability, the union property gives
 
n
[ n
X n
X
P (B) = P  (B ∩ Aj ) = P (B ∩ Aj ) = P (B | Aj ) P (Aj ),
j=1 j=1 j=1

which plugged into (2.2.1) completes the proof.

A simple exercise in updating beliefs

Example 4 (The weather forecast). Form a subjective probability for the probability
that it rains.
A1 : Rain.
A2 : No rain.
First, choose P (A1 ) and set P (A2 ) = 1 − P (A1 ). Now assume that there is a weather
forecasting station that predicts no rain for tomorrow. However, you know the fol-
lowing facts about the station: On the days when it rains, half of the time the station
had predicted it was not going to rain. On days when it doesn’t rain, the station had
said no rain 9 times out of 10.

Solution. Let B denote the event that the station predicts no rain. According
to our information, the probability that there is rain when the prediction said
22 CHAPTER 2. SUBJECTIVE PROBABILITY AND UTILITY

no rain is P (B | A1 ) = 12 . On the other hand, P (B | A2 ) = 0.9. Combining

these with Bayes’ theorem, we obtain

P (B | A1 ) P (A1 )
P (A1 | B) =
P (B | A1 ) P (A1 ) + P (B | A2 ) [1 − P (A1 )]
1
2 P (A1 )
= .
0.9 − 0.4 P (A1 )

2.3 Utility theory

While probability can be used to describe how likely an event is, utility can be
used to describe how desirable it is. More concretely, our subjective probabilities
are numerical representations of our beliefs and the information available to us.
They can be taken to represent our “internal model” of the world. By analogy,
our utilities are numerical representations of our tastes and preferences. That is,
even if the consequences of our actions are not directly known to us, we assume
that we act to maximise our utility, in some sense.

2.3.1 Rewards and preferences

Rewards
Consider that we have to choose a reward r from a set R of possible rewards.
While the elements of R may be abritrary, we shall in general find that we prefer
some rewards to others. In fact, some elements of R may not even be desirable.
As an example, R might be a set of tickets to different musical events, or a set
of financial commodities.

Preferences

Example 5 (Musical event tickets). We have a set of tickets R, and we must choose
the ticket r ∈ R we prefer best. Here are two possible scenarios:
R is a set of tickets to different music events at the same time, at equally
good halls with equally good seats and the same price. Here preferences simply
coincide with the preferences for a certain type of music or an artist.
If R is a set of tickets to different events at different times, at different quality
halls with different quality seats and different prices, preferences may depend
on all the factors.

Example 6 (Route selection). We have a set of alternative routes and must pick one.
If R contains two routes of the same quality, one short and one long, we will probably
prefer the shorter one. If the longer route is more scenic our preferences may be
different.

Preferences among rewards

We will treat preferences in a similar manner as we have treated likelihoods.
That is, we will define a linear ordering among possible rewards.
Let a, b ∈ R be two rewards. When we prefer a to b, we write a ≻∗ b.
Conversely, when we like a less than b we write a ≺∗ b. If we like a as much
2.3. UTILITY THEORY 23

as b, we write a h∗ b. We also use a %∗ b and a -∗ b when we like a at least as

much as b and we don’t like a any more than b, respectively.
We make the following assumptions about the preference relations.

Assumption 2.3.1. (i) For any a, b ∈ R, one of the following holds: a ≻∗ b,

a ≺∗ b, or a h∗ b.

(ii) For any a, b, c ∈ R, if a -∗ b and b -∗ c, then a -∗ c.

The first assumption means that we must always be able to decide between
any two rewards. It may seem that it does not always hold in practice, since
humans are frequently indecisive. However, without the second assumption, it
is still possible to create preference relations that are cyclic.

Example 7 (Counter-example for transitive preferences). Consider vector rewards in

R = R2 , with ri = (ai , bi ), and fix some ǫ > 0. Our preference relation satisfies:
ri ≻∗ rj if bi ≥ bj + ǫ.
ri ≻∗ rj if ai > aj and |bi − bj | < ǫ.
This may correspond for example to an employer deciding to hire one out of several
candidates depending on their experience (ai ) and their school grades (bi ). Since
grades are not very reliable, if two people have similar grades, then we prefer the one
with more experience. However, that may lead to a cycle. Consider a sequence of
candidates i = 1, . . . , n, such that bi = bi+1 + δ, with δ < ǫ and ai > ai+1 . Then
clearly, we must always prefer ri to ri+1 . However, if δ(n − 1) > ǫ, we will prefer rn
to r1 .

2.3.2 Preferences among distributions

When we cannot select rewards directly
In most problems, we cannot choose the rewards directly. Rather, we must
make some decision, and then obtain a reward depending on this decision. Since
we may be uncertain about the outcome of a decision, we can specify our un-
certainty regarding the rewards obtained by a decision in terms of a probability
distribution.

Example 8 (Route selection). Assume that you have to pick between two routes
P1 , P2 . Your preferences are such that shorter time routes are preferred over longer
ones. For simplicity, let R = {10, 15, 30, 35} be the possible times it might take to
reach your destination. Route P1 takes 10 minutes when the road is clear, but 30
minutes when the traffic is heavy. The probability of heavy traffic on P1 is 0.5. On the
other hand, route P2 takes 15 minutes when the road is clear, but 35 minutes when
the traffic is heavy. The probability of heavy traffic on P2 is 0.2.

Preferences among probability distributions

As seen in the previous example, we frequently have to define preferences
between probability distributions, rather than over rewards. To represent our
preferences, we can use the same notation as before. Let P1 , P2 be two distri-
butions on (R, FR ). If we prefer P1 to P2 , we write P1 ≻∗ P2 . If we like P1
less than P2 , write P1 ≺∗ P2 . If we like P1 as much as P2 , we write P1 h∗ P2 .
Finally, as before we also use %∗ and -∗ to denote negations of the strict pref-
erence relations ≻∗ and ≺∗ .
24 CHAPTER 2. SUBJECTIVE PROBABILITY AND UTILITY

What would be a good principle for choosing between the two routes in
Example 8? Clearly route P1 gives both the lowest best-case time and the
lowest worst-case time. It thus appears as though both an extremely cautious
person (who assumes the worst-case) and an extreme optimist (who assumes the
best case) would say P1 ≻∗ P2 . However, the average time taken in P2 is only 19
minutes versus 20 minutes for P1 . Thus, somebody who only took the average
time into account would prefer P2 . In the following sections, we will develop one
of the most fundamental methodologies for choices under uncertainty, based on
the idea of utilities.

2.3.3 Utility
The concept of utility allows us to create a unifying framework, such that given
a particular set of rewards and probability distributions on them, we can define
preferences among distributions via their expected utility. The first step is to
define utility as a way to define a preference relation among rewards.

Definition 2.3.1 (Utility). A utility function U : R → R is said to agree with

the preference relation %∗ , if for all rewards a, b ∈ R

a %∗ b iff U (a) ≥ U (b).

The above definition is very similar to how we defined relative likelihood in

terms of probability. For a given utility function, its expectation for a distribu-
tion over rewards is defined as follows.

Definition 2.3.2 (Expected utility). Given a utility function U , the expected

utility of a distribution P on R is
Z
EP (U ) = U (r) dP (r).
R

We make the assumption that the utility function is such that the expected
utility remains consistent with the preference relations between all probability
distributions we are choosing between.

Assumption 2.3.2 (The expected utility hypothesis). Given a preference re-

lation %∗ over R and a corresponding utility function U , the utility of any prob-
ability measure P on R is equal to the expected utility of the reward under P .
Consequently,
P %∗ Q iff EP (U ) ≥ EQ (U ). (2.3.1)

Example 9. Consider the following decision problem. You have the option of entering
a lottery for 1 currency unit (CU). The prize is 10 CU and the probability of winning
is 0.01. This can be formalised by making it a choice between two probability dis-
tributions: P , where you do not enter the lottery, and Q, which represents entering
the lottery. PCalculating the expected utility obviously gives 0 for not entering and
E(U | Q) = r U (r) Q(r) = −0.9 utility of entering the lottery, cf. Table 2.1.
2.3. UTILITY THEORY 25

r U (r) P Q
did not enter 0 1 0
paid 1 CU and lost −1 0 0.99
paid 1 CU and won 10 9 0 0.01

Table 2.1: A simple gambling problem.

Monetary rewards
Frequently, rewards come in the form of money. In general, it is assumed that
people prefer to have more money than less money. However, while the utility
of monetary rewards is generally assumed to be increasing, it is not necessarily
linear. For example, 1,000 Euros are probably worth more to somebody with
only 100 Euros in the bank than to somebody with 100,000 Euros. Hence, it
seems reasonable to assume that the utility of money is concave.
The following examples show the consequences of the expected utility hy-
pothesis.

Example 10. Choose between the following two gambles:

1. The reward is 500,000 with certainty.
2. The reward is 2,500,000 with probability 0.10. It is 500,000 with probability
0.89, and 0 with probability 0.01.

Example 11. Choose between the following two gambles:

1. The reward is 500,000 with probability 0.11, or 0 with probability 0.89.
2. The reward is 2,500,000 with probability 0.1, or 0 with probability 0.9.

Exercise 2. Show that under the expected utility hypothesis, if gamble 1 is preferred
in the Example 10, gamble 1 must also be preferred in the Example 11 for any utility
function.

In practice, you may find that your preferences are not aligned with what
this exercise suggests. This implies that either your decisions do not conform
to the expected utility hypothesis, or that you are not internalising the given
probabilities. We will explore this further in following example.

The St. Petersburg Paradox

The following simple example illustrates the fact that, internally, most humans
do not behave in ways that are compatible with linear utility for money.

Example 12 (The St. Petersburg Paradox, Bernoulli 1713). A coin is tossed repeat-
edly until the coin comes up heads. The player obtained 2n currency units, where
n ∈ {1, 2, . . .} is the number of times the coin was thrown. The coin is assumed to be
fair, meaning that the probability of heads is always 1/2.
How many currency units k would you be willing to pay to play this game once?

As the probability to stop at round n is 2−n , the expected monetary gain of

the game is
∞
X
2n 2−n = ∞.
n=1
26 CHAPTER 2. SUBJECTIVE PROBABILITY AND UTILITY

Were your utility function linear, you would be willing to pay any finite amount k
to play, as the expected utility for playing the game for any finite k is
∞
X
U (2n − k) 2−n = ∞.
n=1

It would be safe to assume that very few readers would be prepared to pay an
arbitrarily high amount to play this game. One way to explain this is that the
utility function is not linear. An alternative would be to assume a logarithmic
utility function. For example, if we also assume that the player has an initial
capital C from which k has to be paid, the utility function would satisfy
∞
X
EU = ln(C + 2n − k) 2−n .
n=1

Then for C = 10, 000 the maximum bet would be 14. For C = 100 it would be
6, while for C = 10 it is just 4.
There is another reason why one may not pay an abritrary amount to play
this game. The player may not fully internalise the fact (or rather, the promise)
that the coin is unbiased. Another explanation would be that it is not really
believed that the bank can pay an unbounded amount of money. Indeed, if the
bank can only pay amounts up to N , in the linear expected utility scenario, for
a coin with probability p of coming heads, we have
N
X 1 − (2p)N
2n pn−1 (1 − p) = 2(1 − p) .
n=1
1 − 2p

For large N and p = 0.45, it turns out that you should only expect a payoff of
about 10 currency units. Similarly, for a fair coin and N = 1024 you should
only pay around 10 units as well. These are possible subjective beliefs that an
individual might have that would influence its behaviour when dealing with a
formally specified decision problem.

2.3.4 Measuring utility*

Since we cannot even rely on linear utility for money, we need to ask ourselves
how we can measure the utility of different rewards. There are a number of ways,
including trying to infer it from the actions of people. The simplest approach is
to simply ask them to make even money bets. No matter what approach we use,
however, we need to make some assumptions about the utility structure. This
includes whether or not we should accept that the expected utility hypothesis
holds for the observed human behaviour.

Experimental measurement of utility

Example 13. Let ha, bi denote a lottery ticket that yields a or b CU with equal
probability. Consider the following sequence:
1. Find x1 such that receiving x1 CU with certainty is equivalent to receiving ha, bi.
2. Find x2 such that receiving x2 CU with certainty is equivalent to receiving
ha, x1 i.
2.3. UTILITY THEORY 27

3. Find x3 such that receiving x3 CU with certainty is equivalent to receiving

hx1 , bi.
4. Find x4 such that receiving x4 CU with certainty is equivalent to receiving
hx2 , x3 i.

The above example algorithm allows us to measure the utility of money

under the assumption that the expected utility hypothesis holds. Note that
if x1 6= x4 , then the expected utility hypothesis is violated, since it implies
U (x1 ) = U (x4 ) = 12 (U (a) + U (b)).

2.3.5 Convex and concave utility functions

As previously mentioned, utility functions of monetary rewards seem to be con-
cave. In general, we would say that a concave utility function implies risk
aversion and a convex one risk taking. Intuitively, a risk averse person prefers a
fixed amount of money to a random amount of money with the same expected
value. A risk taker prefers to gamble. Let’s start with the definition of a convex
function.

Definition 2.3.3. A function g : Ω → R is convex on A ⊂ Ω if for any points

x, y ∈ A and any α ∈ [0, 1]

αg(x) + (1 − α)g(y) ≥ g (αx + (1 − α)y) .

An important property of convex functions is that they are bounded from

above by linear segments connecting their points. This property is formally
given below.

Theorem 2.3.1 (Jensen’s inequality). Let g be a convex function on Ω P be a

measure with P (Ω) = 1. Then for any x ∈ Ω such that E(x) and E[g(x)] exist,
it holds that
E[g(x)] ≥ g[E(x)].

Example 14. If the utility function is convex, then we would prefer obtaining a ran-
dom reward x rather than a fixed reward y = E(x). Thus, a convex utility function
implies risk-taking. This is illustrated by Figure 2.1, which shows a linear function, x,
a convex function, ex − 1, and a concave function, ln(x + 1).

Definition 2.3.4. A function g is concave on A ⊂ Ω if for any points x, y ∈ A

and any α ∈ [0, 1]

αg(x) + (1 − α)g(y) ≤ g(αx + (1 − α)y).

For concave functions, the inverse of Jensen’s inequality holds (i.e., with ≥
replaced with ≤). If the utility function is concave, then we choose a gamble
giving a fixed reward E[x] rather than one giving a random reward x. Con-
sequently, a concave utility function implies risk aversion. The act of buying
insurance can be related to the concavity of our utility function. Consider the
following example, where we assume individuals are risk-averse, but insurance
companies are risk-neutral.
28 CHAPTER 2. SUBJECTIVE PROBABILITY AND UTILITY

U (x) x
1.6
ex − 1
1.4 ln(x + 1)

1.2

0.8

0.6

0.4

0.2
x
0.2 0.4 0.6 0.8 1

Figure 2.1: Linear, convex and concave functions.

a r U a P r U

(a) Simple decision problem (b) Decision problem with

random rewards

Figure 2.2: Here are two very simple decision diagrams. In the first, the DM
selects the decision a, which determines the reward r, which determines the
utility U . In the second, the DM only chooses a distribution P , which determines
the rewards.

Example 15 (Insurance). Let x be the insurance cost, h our insurance cover, ǫ the
probability of needing the cover, and U an increasing utility function (for monetary
values). Then we are going to buy insurance if the utility of losing x with certainty is
greater than the utility of losing −h with probability ǫ.

U (−x) > ǫU (−h) + (1 − ǫ)U (0). (2.3.2)

The company has a linear utility, and fixes the premium x high enough so that

x > ǫh. (2.3.3)

Consequently, we see from (2.3.3) that U (−ǫh) ≥ U (−x), as U is an increasing

function. From (2.3.2) we obtain U (−ǫh) > ǫU (−h) + (1 − ǫ)U (0). Now the −ǫh term
is the utility of our expected monetary loss, while the right hand side is our expected
utility. Consequently if the inequality holds, our utility function is (at least locally)
concave.

2.3.6 Decision diagrams

Decision diagrams are also known as decision networks or influence diagrams.
Like the examples shown in Figure 2.2, they are used to show dependencies
2.3. UTILITY THEORY 29

between different variables. If an arrow points from a variable x to a variable y

it means that y depends on x, in other words it has x as an input. In general,
these include the following types of nodes:

Choice nodes, denoted by squares. These are nodes whose values the
decision maker can directly choose. Sometimes there is more than one
decision maker involved.

Value nodes, denoted by diamonds. These are the nodes that the decision
maker is interested in influencing. The utility of the decision maker is
always a function of the value nodes.

Circle nodes are used to denote all other types of variables. These include
deterministic, stochastic, known or unknown variables.
While a full theory of graphical models and decision diagrams is beyond the
scope of this book, we will make use of them frequently to visualise dependencies
between variables in simple problems.
30 CHAPTER 2. SUBJECTIVE PROBABILITY AND UTILITY

2.4 Exercises
Exercise 3. Preferences are transitive if they are induced by a utility function U :
R → R such that a ≻∗ b iff U (a) > U (b). Give an example of a utility function, not
necessarily mapping the rewards R to R, and a binary relation > such that transitivity
can be violated. Back your example with a thought experiment.

Exercise 4. Assuming that U is increasing and absolutely continuous, consider the

following experiment:
1. You specify an amount a, then observe random value Y .
2. If Y ≥ a, you receive Y currency units.
3. If Y < a, you receive a random amount X with known distribution (independent
of Y ).
Show that we should choose a such that U (a) = E[U (X)].

Exercise 5 (usefulness of probability and utility). 1. Would it be useful to sepa-

rate randomness from uncertainty? What would be desirable properties of an
alternative concept to probability?
2. Give an example of how the expected utility assumption might be violated.

Exercise 6. Consider two urns, each containing red and blue balls. The first urn
contains an equal number of red and blue balls. The second urn contains a randomly
chosen proportion X of red balls, i.e., the probability of drawing a red ball from that
urn is X.
1. Suppose that you were to select an urn, and then choose a random ball from
that urn. If the ball is red, you win 1 CU, otherwise nothing. Show that if your
utility function is increasing with monetary gain, you should prefer the first urn
iff E(X) < 12 .
2. Suppose that you were to select an urn, and then choose n random balls from
that urn and that urn only. Each time you draw a red ball, you gain 1 CU.
After you draw a ball, you put it back in the urn. Assume that the utility U
is strictly concave and suppose that E(X) = 21 . Show that you should always
select balls from the first urn.
Hint: Show that for the second urn, E(U | x) is concave for 0 ≤ x ≤ 1 (this can be
d2
done by showing dx 2 E(U | x) < 0). In fact,

n−2
!
d2 X n−2 k
E(U | x) = n(n − 1) [U (k) − 2U (k + 1) + U (k + 2)] x (1 − x)n−2−k .
dx2 k
k=0

Then apply Jensen’s inequality.

Exercise 7 (Defining likelihood relations via Probability measures). Show that a

probability measure P on (Ω, F ) satisfies the following:
1. For any events A, B ∈ F either P (A) > P (B), P (B) > P (A) or P (A) = P (B).
2. If Ai , Bi are partitions of A, B such that for all P (Ai ) ≤ P (Bi ) for all i, then
P (A) ≤ P (B).
3. For any event A, P (∅) ≤ P (A) and P (∅) < P (Ω).

Exercise 8 (Definition of conditional probability). Recall that P (A | B) , P P(A∩B)(B)

is only a definition. Think of a a plausible alternative for a definition of conditional
probability. Note that the conditional probability P (A | B) is just a new probability
2.4. EXERCISES 31

Exercise 9 (Alternatives to the expected utility hypothesis). The expected utility

hypothesis states that we prefer decision P over Q if and only if our expected utility
under the distribution P is larger than that under Q, i.e., EP (U ) ≥ EQ (U ). Under
what conditions do you think this is a reasonable hypothesis? Can you come up with
a different rule for making decisions under uncertainty? Would it still satisfy the total
order and transitivity properties of preference relations? In other words, could you
still unambiguously say for any P , Q whether you prefer P to Q? If you had three
choices, P, Q, W , and you preferred P to Q and Q to W, would you always prefer P
to W ?

Exercise 10 (Rational Arthur-Merlin games). You are Arthur, and you wish to pay
Merlin to do a very difficult computation for you. More specifically, you perform a
query q ∈ Q and obtain an answer r ∈ R from Merlin. After he gives you the answer,
you give Merlin a random amount of money m, depending on r, q. In particular, it
is assumed that there exists a unique correct answer r∗ = f (q) and E(m | r, q) =
P ∗ ∗
m mP (m | r, q) is maximised by r , i.e., for any r 6= r

E(m | r∗ , q) > E(m | r, q).

Assume that Merlin knows P and the function f . Is this sufficient to incentivise Merlin
to respond with the correct answer? If not, what other assumptions or knowledge are
required?

Exercise 11. Assume that you need to travel over the weekend. You wish to decide
whether to take the train or the car. Assume that the train and the car trip cost exactly
the same amount of money. The train trip takes 2 hours. If it does not rain, then
the car trip takes 1.5 hour. However, if it rains the road becomes both more slippery
and more crowded and so the average trip time is 2.5 hours. Assume that your utility
function is equal to the negative amount of time spent travelling: U (t) = −t.
1. Let it be Friday. What is the expected utility of taking the car on Sunday? What
is the expected utility of taking the train on Sunday? What is the Bayes-optimal
decision, assuming you will travel on Sunday?
2. Let it be a rainy Saturday, which we denote by event A. What is your posterior
probability over the two weather stations, given that it has rained, i.e., P (Hi |
A)? What is the new marginal probability of rain on Sunday, i.e., P (B | A)?
What is now the expected utility of taking the car versus taking the train on
Sunday? What is the Bayes-optimal decision?

Exercise 12. Consider the previous example with a nonlinear utility function.
1. One example is U (t) = 1/t, which is a convex utility function. How would you
interpret the utility in that case? Without performing the calculations, can you
tell in advance whether your optimal decision can change? Verify your answer
by calculating the expected utility of the two possible choices.
2. How would you model a problem where the objective involves arriving in time
for a particular appointment?
32 CHAPTER 2. SUBJECTIVE PROBABILITY AND UTILITY
Chapter 3

Decision problems

33
34 CHAPTER 3. DECISION PROBLEMS

3.1 Introduction
In this chapter we describe how to formalise statistical decision problems. These
involve making decisions whose utility depends on an unknown state of the
world. In this setting, it is common to assume that the state of the world is
a fundamental property that is not influenced by our decisions. However, we
can calculate a probability distribution for the state of the world, using a prior
belief and some data, where the data we obtain may depend on our decisions.
A classical application of this framework is parameter estimation. Therein,
we stipulate the existence of a parameterised law of nature, and we wish to
choose a best-guess set of parameters for the law through measurements and
some prior information. An example would be determining the gravitational
attraction constant from observations of planetary movements. These measure-
ments are always obtained through experiments, the automatic design of which
will be covered in later chapters.
The decisions we make will necessarily depend on both our prior belief and
the data we obtain. In the last section of this chapter will examine how sensitive
our decisions are to the prior, and how we can choose it so that our decisions
are robust.

3.2 Rewards that depend on the outcome of an

experiment
Consider the problem of choosing one of two different types of tickets in a raffle.
Each type of ticket gives you the chance to win a different prize. The first is a
bicycle and the second is a tea set. The winner ticket for each prize is drawn
uniformly from the respective sold tickets. Thus, the raffle guarantees that
somebody will win either price. If most people opt for the bicycle, your chance
of actually winning it by buying a single ticket is much smaller. However, if you
prefer winning a bicycle to winning the tea set, it is not clear what choice you
should make in the raffle. The above is the quintessential example for problems
where the reward that we obtain depends not only on our decisions, but also on
the outcome of an experiment.
This problem can be viewed more generally for scenarios where the reward
you receive depends not only on your own choice, but also on some other,
unknown fact in the world. This may be something completely uncontrollable,
and hence you only can make an informed guess.
More formally, given a set of possible actions A, we must make a decision
a ∈ A before knowing the outcome ω of an experiment with outcomes in Ω.
After the experiment is performed, we obtain a reward r ∈ R which depends
on both the outcome ω of the experiment and our decision. As discussed in the
previous chapter, our preferences for some rewards over others are determined
by a utility function U : R → R, such that we prefer r to r′ if and only if
U (r) ≥ U (r′ ). Now, however, we cannot choose rewards directly. Another
example, which will be used throughout this section, is the following.

Example 16 (Taking the umbrella). We must decide whether to take an umbrella to

work. Our reward depends on whether we get wet and the amount of objects that we
carry. We would rather not get wet and not carry too many things, which can be made
3.2. REWARDS THAT DEPEND ON THE OUTCOME OF AN EXPERIMENT35

more precise by choosing an appropriate utility function. For example, we might put
a value of −1 for carrying the umbrella and a value of −10 for getting wet. In this
example, the only events of interest are whether it rains or not.

3.2.1 Formalisation of the problem setting

The elements we need to formulate the problem setting are a random variable,
a decision variable, a reward function mapping the random and the decision
variable to a reward, and a utility function that says how much we prefer each
reward.

Assumption 3.2.1 (Outcomes). There exists a probability measure P on (Ω, FΩ )

such that the probability of the random outcome ω being in A ∈ FΩ is

P(ω ∈ A) = P (A).

The probability measure P is completely independent of any decision that

we make.

Assumption 3.2.2 (Utilities). Given a set of rewards R, our preferences satisfy

Assumptions 2.1.1, 2.1.2, and 2.1.3, i.e., preferences are transitive, all rewards
are comparable, and there exists a utility function U , measurable with respect to
FR such that U (r) ≥ U (r′ ) iff r ≻∗ r′ .

Since the random outcome ω does not depend on our decision a, we must
find a way to connect the two. This can be formalised via a reward function, so
that the reward that we obtain (whether we get wet or not) depends on both
our decision (to take the umbrella) and the random outcome (whether it rains).

Definition 3.2.1 (Reward function). A reward function ρ : Ω × A → R defines

the reward we obtain if we select a ∈ A and the experimental outcome is ω ∈ Ω:

r = ρ(ω, a).

When we discussed the problem of choosing between distributions in Sec-

tion 2.3.2, we had directly defined probability distributions on the set of rewards.
We can now formulate our problem in that setting. First, we define a set of dis-
tributions {Pa | a ∈ A} on the reward space (R, FR ), such that the decision a
amounts to choosing a particular distribution Pa on the rewards.

Example 17 (Rock paper scissors). Consider the simple game of rock paper scissors,
where your opponent plays a move at the same time as you, so that you cannot
influence his move. The opponent’s moves are Ω = {ωR , ωP , ωS } for rock, paper,
scissors, respectively, which also correpsonds to your decision set A = {aR , aP , aS }.
The reward set is R = {Win, Draw, Lose}.
You have studied your opponent for some time and you believe that he is most
likely to play rock P (ωR ) = 3/6, somewhat likely to play paper P (ωP ) = 2/6, and less
likely to play scissors: P (ωS ) = 1/6.
What is the probability of each reward, for each decision you make? Taking the
example of aR , we see that you win if the opponent plays scissors with probability
1/6, you lose if the opponent plays paper (2/6), and you draw if he plays rock (3/6).

Consequently, we can convert the outcome probabilities to reward probabilities for

36 CHAPTER 3. DECISION PROBLEMS

ω a

a Pa r U U

(a) The combined decision (b) The separated de-

problem cision problem

Figure 3.1: Decision diagrams for the combined and separated formulation of the
decision problem. Squares denote decision variables, diamonds denote utilities.
All other variables are denoted by circles. Arrows denote the flow of dependency.

every decision:

PaR (Win) = 1/6, PaR (Draw) = 3/6, PaR (Lose) = 2/6,

PaP (Win) = 3/6, PaP (Draw) = 2/6, PaP (Lose) = 1/6,
PaS (Win) = 2/6, PaS (Draw) = 1/6, PaS (Lose) = 3/6.

Of course, what you play depends on our own utility function. If you prefer winning
over drawing or losing, you could for example have
Pthe utility function U (Win) = 1,
U (Draw) = 0, U (Lose) = −1. Then, since Ea U = ω∈Ω U (ω, a)Pa (ω), we have

EaR U = −1/6
EaP U = 2/6
EaS U = −1/6,

so that based on your belief, choosing paper is best.

The above example illustrates that every decision that we make creates a
corresponding probability distribution on the rewards. While the outcome of
the experiment is independent of the decision, the distribution of rewards is
effectively chosen by our decision.

Expected utility
The expected utility of any decision a ∈ A under P is:
Z Z
EPa (U ) = U (r) dPa (r) = U [ρ(ω, a)] dP (ω)
R Ω

From now on, we shall use the simple notation

U (P, a) , EPa U

to denote the expected utility of a under distribution P .

Instead of viewing the decision as effectively choosing a distribution over

rewards (Fig. 3.1(a)) we can separate the random part of the process from the
3.2. REWARDS THAT DEPEND ON THE OUTCOME OF AN EXPERIMENT37

deterministic part (Fig. 3.1(b)) by considering a measure P on the space of

outcomes Ω, such that the reward depends on both a and the outcome ω ∈ Ω
through the reward function ρ(ω, a). The optimal decision is of course always
the a ∈ A maximising E(U | Pa ). However, this structure allows us to clearly
distinguish the controllable from the random part of the rewards.

The probability measure induced by decisions

For every a ∈ A, the function ρ : Ω × A → R induces a probability distri-
bution Pa on R. In fact, for any B ∈ FR :

Pa (B) , P(ρ(ω, a) ∈ B) = P ({ω | ρ(ω, a) ∈ B}).

The above equation requires that the following technical assumption is sat-
isfied. As usual, we employ the expected utility hypothesis (Assumption 2.3.2).
Thus, we should choose the decision that results in the highest expected utility.
Assumption 3.2.3. The sets {ω | ρ(ω, a) ∈ B} belong to FΩ . That is, ρ is
FΩ -measurable for any a.
The dependency structure of this problem in either formulation can be vi-
sualised in the decision diagram shown in Figure 3.1.

Example 18 (Continuation of Example 16). You are going to work, and it might rain.
The forecast said that the probability of rain (ω1 ) was 20%. What do you do?
a1 : Take the umbrella.
a2 : Risk it!
The reward of a given outcome and decision combination, as well as the respective
utility is given in Table 3.1.

ρ(ω, a) a1 a2
ω1 dry, carrying umbrella wet
ω2 dry, carrying umbrella dry
U [ρ(ω, a)] a1 a2
ω1 -1 -10
ω2 -1 0
EP (U | a) -1 -2

Table 3.1: Rewards, utilities, expected utility for 20% probability of rain.

3.2.2 Decision diagrams

Decision diagrams are also known as decision networks or influence diagrams.
Like the examples shown in Figure 3.1, they are used to show dependencies
between different variables. In general, these include the following types of
nodes:
Choice nodes, denoted by squares. These are nodes whose values the
decision maker can directly choose. Sometimes there is more than one
decision maker involved.
38 CHAPTER 3. DECISION PROBLEMS

Value nodes, denoted by diamonds. These are the nodes that the decision
maker is interested in influencing. The utility of the decision maker is
always a function of the value nodes.

Circle nodes are used to denote all other types of variables. These include
deterministic, stochastic, known or unknown variables.

The nodes are connected via directed edges. These denote the dependencies
between nodes. For example, in Figure 3.1(b), the reward is a function of both
ω and a, i.e., r = ρ(ω, a), while ω depends only on the probability distribution
P . Typically, there must be a path from a choice node to a value node, otherwise
nothing the decision maker can do will influence its utility. Nodes belonging to
or observed by different players will usually be denoted by different lines or
colors. In Figure 3.1(b), ω, which is not observed, is shown in a lighter color.

3.2.3 Statistical estimation*

Statistical decision problems arise particularly often in parameter estimation,
such as estimating the covariance matrix of a Gaussian random variable. In
this setting, the unknown outcome of the experiment ω is called a parameter,
while the set of outcomes Ω is called the parameter space. Classical statistical
estimation involves selecting a single parameter value on the basis of observa-
tions. This requires us to specify a preference for different types of estimation
errors, and is distinct from the standard Bayesian approach to estimation, which
calculates a full distribution over all possible parameters.
A simple example is estimating the distribution of votes in an election from
a small sample. Depending on whether we are interested in predicting the vote
share of individual parties or the most likely winner of the election, we can use
a distribution over vote shares (possibly estimated through standard Bayesian
methodology) to decide on a share or the winner.

Example 19 (Voting). Assume you wish to estimate the number of votes for different
candidates in an election. The unknown parameters of the problem mainly include:
the percentage of likely voters in the population, the probability that a likely voter is
going to vote for each candidate. One simple way to estimate this is by polling.
Consider a nation with k political parties. Let ω = (ω1 , . . . , ωk ) ∈ [0, 1]k be the
voting proportions for each party. We wish to make a guess a ∈ [0, 1]k . How should
we guess, given a distribution P (ω)? How should we select U and ρ? This depends on
what our goal is, when we make the guess.
If we wish to give a reasonable estimate about the votes of all the k parties, we can
use the squared error: First, set the error Pvector r = (ω1 − a1 , . . . , ωk − ak ) ∈ [0, 1]k .
Then we set U (r) , −krk2 , where krk2 = i |ωi − ai |2 .
If on the other hand, we just want to predict the winner of the election, then the
actual percentages of all individual parties are not important. In that case, we can set
r = 1 if arg maxi ωi = arg maxi ai and 0 otherwise, and U (r) = r.

Losses and risks

In such problems, it is common to specify a loss instead of a utility. This
is usually the negative utility:
3.3. BAYES DECISIONS 39

Definition 3.2.2 (Loss).

ℓ(ω, a) = −U [ρ(ω, a)].

Given the above, instead of the expected utility, we consider the ex-
pected loss, or risk.

Definition 3.2.3 (Risk).

Z
κ(P, a) = ℓ(ω, a) dP (ω).
Ω

Of course, the optimal decision is a minimising κ.

3.3 Bayes decisions

The decision which maximises the expected utility under a particular distribu-
tion P , is called the Bayes-optimal decision, or simply the Bayes decision. The
probability distribution P is supposed to reflect all our uncertainty about the
problem. Note that in the following we usually drop the reward function ρ from
the decision problem and consider utility functions U that map directly from
Ω × A to R.
Definition 3.3.1 (Bayes-optimal utility). Consider an outcome (or parameter)
space Ω, a decision space A, and a utility function U : Ω × A → R. For any
probability distribution P on Ω, the Bayes-optimal utility U ∗ (P ) is defined as
the smallest upper bound on U (P, a) over all decisions a ∈ A. That is,

U ∗ (P ) = sup U (P, a). (3.3.1)

a∈A

The maximisation over decisions is usually not easy. However, there exist
a few cases where it is relatively simple. The first of those is when the utility
function is the negative squared error.

Example 20 (Quadratic loss). Consider Ω = Rk and A = Rk . The utility function

that, for any point ω ∈ R, is defined as
U (ω, a) = −kω − ak2
is called quadratic loss.

Quadratic loss is a very important special case of utility functions, as it

is easy to calculate the optimal solution. This is illustrated by the following
theorem.
Theorem 3.3.1. Let P be a measure on Ω and U : Ω ×A → R be the quadratic
loss defined in Example 20. Then the decision

a = EP (ω)

maximises the expected utility U (P, a), under the technical assumption that
∂/∂a|ω − a|2 is measurable with respect to FR .
40 CHAPTER 3. DECISION PROBLEMS

Proof. The expected utility of decision a is given by

Z
U (P, a) = − kω − ak2 dP (ω).
Ω

Taking derivatives, due to the measurability assumption, we can swap the order
of differentiation and integration and obtain
Z Z
∂ 2 ∂
kω − ak dP (ω) = kω − ak2 dP (ω)
∂a Ω Ω ∂a
Z
=2 (a − ω) dP (ω)
Ω
Z Z
=2 a dP (ω) − 2 ω dP (ω)
Ω Ω
= 2a − 2 E(ω).

Setting the derivative equal to 0 and noting that the utility is concave, we see
that the expected utility is maximised for a = EP (ω).

Another simple example is the absolute error, where U (ω, a) = |ω − a|. The
solution in this case differs significantly from the squared error. As can be
seen from Figure 3.2(a), for absolute loss, the optimal decision is to choose the
a that is closest to the most likely ω. Figure 3.2(b) illustrates the finding of
Theorem 3.3.1.

3.3.1 Convexity of the Bayes-optimal utility*

Although finding the optimal decision for an arbitrary utility U and distribu-
tion P may be difficult, fortunately the Bayes-optimal utility has some nice
properties which enable it to be approximated rather well. In particular, for
any decision, the expected utility is linear with respect to our belief P . Con-
sequently, the Bayes-optimal utility is convex with respect to P . This firstly
implies that there is a unique “worst” distribution P , against which we can-
not do very well. Secondly, we can approximate the Bayes-utility very well for
all possible distributions by generalising from a small number of distributions.
In order to define linearity and convexity, we first introduce the concept of a
mixture of distributions.
Consider two probability measures P, Q on (Ω, FΩ ). These define two alter-
native distributions for ω. For any P, Q and α ∈ [0, 1], we define the mixture of
mixture of distributions distributions
Zα , αP + (1 − α)Q (3.3.2)

to mean the probability measure such that Zα (A) = αP (A) + (1 − α)Q(A) for
any A ∈ FΩ . For any fixed choice a, the expected utility varies linearly with α:
Remark 3.3.1 (Linearity of the expected utility). If Zα is as defined in (3.3.2),
then, for any a ∈ A,

U (Zα , a) = α U (P, a) + (1 − α) U (Q, a).

3.3. BAYES DECISIONS 41

-0.2

-0.4
U
-0.6
0.1
-0.8 0.25
0.5
0.75
-1
0 0.2 0.4 0.6 0.8 1
a
(a) Absolute error

-0.2

-0.4
U
-0.6
0.1
-0.8 0.25
0.5
0.75
-1
0 0.2 0.4 0.6 0.8 1
a
(b) Quadratic error

Figure 3.2: Expected utility curves for different values of P (ω = 0), as the
decision a varies in [0, 1].

Proof.
Z
U (Zα , a) = U (ω, a) dZα (ω)
Ω
Z Z
=α U (ω, a) dP (ω) + (1 − α) U (ω, a) dQ(ω)
Ω Ω
= α U (P, a) + (1 − α) U (Q, a).

However, if we consider Bayes-optimal decisions, this is no longer true, be-

cause the optimal decision depends on the distribution. In fact, the utility of
Bayes-optimal decisions is convex, as the following theorem shows.
Theorem 3.3.2. For probability measures P, Q on Ω and any α ∈ [0, 1],

U ∗ [Zα ] ≤ α U ∗ (P ) + (1 − α) U ∗ (Q).
42 CHAPTER 3. DECISION PROBLEMS

-0.1

-0.2
util

-0.3

-0.4

U ∗ (P )
-0.5
0 0.2 0.4 0.6 0.8 1
P

Figure 3.3: A strictly convex Bayes utility.

Proof. From the definition of the expected utility (3.3.1), for any decision a ∈ A,

U (Zα , a) = α U (P, a) + (1 − α) U (Q, a).

Hence, by definition (3.3.1) of the Bayes-utility, we have

U ∗ (Zα ) = sup U (Zα , a)

a∈A
= sup [α U (P, a) + (1 − α) U (Q, a)].
a∈A

As supx [f (x) + g(x)] ≤ supx f (x) + supx g(x), we obtain:

U ∗ [Zα ] ≤ α sup U (P, a) + (1 − α) sup U (Q, a)

a∈A a∈A
= α U ∗ (P ) + (1 − α) U ∗ (Q).

As we have proven, the expected utility is linear with respect to Zα . Thus, for
any fixed action a we obtain a line as those shown in Fig. 3.3. By Theorem 3.3.2,
the Bayes-optimal utility is convex. Furthermore, the minimising decision for
any Zα is tangent to the Bayes-optimal utility at the point (Zα , U ∗ (Zα )). If
we take a decision that is optimal with respect to some Z, but the distribution
is in fact Q 6= Z, then we are not far from the optimal, if Q, Z are close and
U ∗ is smooth. Consequently, we can trivially lower bound the Bayes utility by
examining any arbitrary finite set of decisions Â ⊆ A. That is,

U ∗ (P ) ≥ max U (P, a)
a∈Â

for any probability distribution P on Ω. In addition, we can upper-bound the

Bayes utility as follows. Take any two distributions P1 , P2 over Ω. Then, the
3.4. STATISTICAL AND STRATEGIC DECISION MAKING 43

upper bound

U ∗ (αP1 + (1 − α)P2 ) ≤ α U ∗ (P1 ) + (1 − α) U ∗ (P2 )

holds due to convexity. The two bounds suggest an algorithm for successive ap-
proximation of the Bayes-optimal utilty, by looking for the largest gap between
the lower and the upper bounds.

3.4 Statistical and strategic decision making

In this section we consider more general strategies that do not simply pick a
single fixed decision. Further, we will consider other criteria than maximising
expected utility and minimising risk, respectively.

Strategies Instead of choosing a specific decision, we could instead choose

to randomise our decision somehow. In other words, instead of our choices
being specific decisions, we can choose among distributions over decisions. For
example, instead of choosing to eat lasanga or beef, we choose to throw a coin
and eat lasagna if the coin comes heads and beef otherwise. Accordingly, in the
following we will consider strategies that are probability measures on A, the set
of which we will denote by ∆(A).
Definition 3.4.1 (Strategy). A strategy σ ∈ ∆(A) is a probability distribution
over A and determines for each a in A the probability with which a is chosen.
Interestingly, for the type of problems that we have considered so far, even
if we expand our choices to the set of all possible probability measures on A,
there always is one decision (rather than a strategy) which is optimal.
Theorem 3.4.1. Consider any statistical decision problem with probability mea-
sure P on outcomes Ω and with utility function U : Ω × A → R. Further let
a∗ ∈ A such that U (P, a∗ ) ≥ U (P, a) for all a ∈ A. Then for any probability
measure σ on A,
U (P, a∗ ) ≥ U (P, σ).
Proof.
Z
U (P, σ) = U (P, a) dσ(a)
ZA
≤ U (P, a∗ ) dσ(a)
A
Z
= U (P, a∗ ) dσ(a)
A
= U (P, a∗ )

This theorem should be not be applied naively. It only states that if we know
P then the expected utility of the best fixed/deterministic decision a∗ ∈ A
cannot be increased by randomising between decisions. For example, it does
not make sense to apply this theorem to cases where P itself is completely
or partially unknown (e.g., when P is chosen by somebody else and its value
remains hidden to us).
44 CHAPTER 3. DECISION PROBLEMS

U (ω, a) a1 a2
ω1 -1 0
ω2 10 1
E(U | P, a) 4.5 -0.5
minω U (ω, a) -1 0

Table 3.2: Utility function, expected utility and maximin utility of Example 21.

3.4.1 Alternative notions of optimality

There are some situations where maximising expected utility with respect to the
distribution on outcomes is unnatural. Two simple examples are the following.

Maximin and minimax policies If there is no information about ω avail-

able, it may be reasonable to take a pessimistic approach and select a∗ that
maximin maximises the utility in the worst-case ω. The respective maximin value

U∗ = max min U (ω, a) = min U (ω, a∗ )

a ω ω

can essentially be seen as how much utility we would be able to obtain, if we were
to make a decision a first, and nature were to select an adversarial decision ω
later.
minimax On the other hand, the minimax value is

U ∗ = min max U (ω, a) = max U (ω ∗ , a),

ω a a

where ω ∗ , arg minω maxa U (ω, a) is the worst-case choice nature could make,
if we were to select our own decision a after its own choice was revealed to us.
To illustrate this, let us consider the following example.

Example 21. You consider attending an open air concert. The weather forecast re-
ports 50% probability of rain. Going to the concert (a1 ) will give you a lot of pleasure
if it doesn’t rain (ω2 ), but in case of rain (ω1 ) you actually would have preferred to stay
at home (a2 ). Since in general you prefer nice weather to rain also in case you decided
not to go you prefer ω2 to ω1 . The reward of a given outcome-decision combination, as
well as the respective utility is given in Table 3.2. We see that a1 maximises expected
utility. However, under a worst-case assumption this is not the case, i.e., the maximin
solution is a2 .

Note that by definition

U ∗ ≥ U (ω ∗ , a∗ ) ≥ U∗ . (3.4.1)

Maximin/minimax problems are a special case of problems in game theory, in

particular two-player zero-sum games. The minimax problem can be seen as a
game where the maximising player plays first, and the minimising player second.
If U ∗ = U∗ , then the game is said to have a value, which implies that if both
players are playing optimal, then it doesn’t matter which player moves first.
More details about these types of problems will be given in Section 3.4.2.
3.4. STATISTICAL AND STRATEGIC DECISION MAKING 45

L(ω, a) a1 a2
ω1 1 0
ω2 0 9
E(L | P, a) 0.5 4.5
maxω L(ω, a) 1 9

Table 3.3: Regret, in expectation and minimax for Example 21.

Regret Instead of calculating the expected utility for each possible decision,
we could instead calculate how much utility we would have obtained if we had
made the best decision in hindsight. Consider, for example the problem in
Table 3.2. There the optimal action is either a1 or a2 , depending on whether
we accept the probability P over Ω, or adopt a worst-case approach. However,
after we make a specific decision, we can always look at the best decision we
could have made given the actual outcome ω.

Definition 3.4.2 (Regret). The regret of σ is how much we lose compared to

the best decision in hindsight, that is,

L(ω, a) , max
′
U (ω, a′ ) − U (ω, a).
a

As an example let us revisit Example 21. Given the regret of each decision-
outcome pair, we can determine the decision minimising expected regret E(L |
P, a) and minimising maximum regret maxω L(ω, a), analogously to expected
utility and minimax utility. Table 3.3 shows that the choice minimising regret
either in expectation or in the minimax sense is a1 (going to the concert). Note
that this is a different outcome than before when we considered utility, which
shows that the concept of regret may result in different decisions.

3.4.2 Solving minimax problems*

We now view minimax problems as two player games, where one player chooses
a and the other player chooses ω. The decision diagram for this problem is
given in Figure 3.4, where the dashed line indicates that, from the point of view
of the decision maker, nature’s choice is unobserved before she makes her own
decision. A simultaneous two-player game is a game where both players act
without knowing each other’s decision, see Figure 3.5. From the point of view
of the player that chooses a, this is equivalent to assuming that ω is hidden, as

Figure 3.4: Simultaneous two-player stochastic game. The first player (nature)
chooses ω, and the second player (the decision maker) chooses a. Then the
second player obtains utility U (ω, a).
46 CHAPTER 3. DECISION PROBLEMS

ξ ω

σ a

Figure 3.5: Simultaneous two-player stochastic game. The first player (nature)
chooses ξ, and the second player (the decision maker) chooses σ. Then ω ∼ ξ
and a ∼ σ and the second player obtains utility U (ω, a).

shown in Figure 3.4. There are other variations of such games, however. For
example, their moves may be revealed after they have played. This is important
in the case where the game is played repeatedly. However, what is usually
revealed is not the belief ξ, which is something assumed to be internal to player
one, but ω, the actual decision made by the first player. In other cases, we
might have that U itself is not known, and we only observe U (ω, a) for the
choices made.

Minimax utility, regret and loss In the following we again consider strate-
gies as defined in Definition 3.4.1. If the decision maker knows the outcome,
then the additional flexibility by randomizing over the actions does not help.
As we showed for the general case of a distribution over Ω, a simple decision is
as good as any randomised strategy:
Remark 3.4.1. For each ω, there is some a such that:

U (ω, a) ∈ max U (ω, σ). (3.4.2)

σ∈∆A

What follows are some rather trivial remarks connecting regret with utility
in various cases.
Remark 3.4.2. X
L(ω, σ) = σ(a)L(w, a) ≥ 0,
a

with equality iff σ is ω-optimal.

Proof.
X
L(ω, σ) = max
′
U (ω, σ ′ ) − U (ω, σ) = max
′
U (ω, σ ′ ) − σ(a) U (ω, a)
σ σ
a
X
= σ(a) max
′
U (ω, σ ′ ) − U (ω, a) ≥ 0.
σ
a

The equality in case of optimality is obvious.

Remark 3.4.3.
L(ω, σ) = max U (ω, a) − U (ω, σ).
a
3.4. STATISTICAL AND STRATEGIC DECISION MAKING 47

U ω1 ω2
a1 1 −1
a2 0 0

Table 3.4: Even-bet utility.

Proof. As (3.4.2) shows, for any fixed ω, the best decision is always determinis-
tic, so that
X X
σ(a′ )L(ω, a′ ) = σ(a′ )[max U (ω, a) − U (ω, a′ )]
a∈A
a′ a′
X
= max U (ω, a) − σ(a′ ) U (ω, a′ ).
a∈A
a′

Remark 3.4.4. L(ω, σ) = −U (ω, σ) iff maxa U (ω, a) = 0.

Proof. If maxσ′ U (ω, σ ′ )−U (ω, σ) = −U (ω, σ) then maxσ′ U (ω, σ ′ ) = maxa U (ω, a) =
0. The converse holds as well, which finishes the proof.
The following example demonstrates that in general the minimax regret will
be achieved by a randomized strategy.

Example 22. (An even-money bet) Consider the decision problem described in Ta-

L ω1 ω2
a1 0 1
a2 1 0

Table 3.5: Even-bet regret.

ble 3.4. The respective loss is given in Table 3.5. The maximum regret of a strategy
σ can be written as
X
max L(ω, σ) = max σ(a)L(ω, a)
ω ω
a

= max σ(a) I {a = ai ∧ ω 6= ωi } · 1,
ω

1
since L(a, ω) = 0 when a = ai and ω 6= ωi . Note that maxω L(ω, σ) ≥ 2
and that
equality is obtained iff σ(a) = 12 , giving minimax regret L∗ = 12 .

3.4.3 Two-player games

In this section we give a few more details about the connections between min-
imax theory and the theory of two-player games. In particular, we extend the
actions of nature to ∆(Ω), the probability distributions over Ω and as before
consider strategies in ∆(A).
For two distributions σ, ξ on A and Ω, we define our expected utility as
XX
U (ξ, σ) , U (ω, a)ξ(ω)σ(a).
ω∈Ω a∈A

Then we define the maximin policy σ ∗ to satisfy

min U (ξ, σ ∗ ) = U∗ , max min U (ξ, σ).
ξ σ ξ
48 CHAPTER 3. DECISION PROBLEMS

The minimax prior ξ ∗ satisfies

max U (ξ ∗ , σ) = U ∗ , min max U (ξ, σ),

σ ξ σ

where the solution exists as long as A and Ω are finite, which we will assume
in the following.

Expected regret
We can now define the expected regret for a given pair of distributions ξ, σ
as
X
L(ξ, σ) = max
′
ξ(ω) {U (ω, σ ′ ) − U (ω, σ)}
σ
ω
= max
′
U (ξ, σ ′ ) − U (ξ, σ).
σ

Not all minimax and maximin policies result in the same value. The following
theorem gives a condition under which the game does have a value.
Theorem 3.4.2. If there exist distributions1 ξ ∗ , σ ∗ and C ∈ R such that

U (ξ ∗ , σ) ≤ C ≤ U (ξ, σ ∗ ) ∀ξ, σ

then
U ∗ = U∗ = U (ξ ∗ , σ ∗ ) = C.
Proof. Since C ≤ U (ξ, σ ∗ ) for all ξ we have

C ≤ min U (ξ, σ ∗ ) ≤ max min U (ξ, σ) = U∗ .

ξ σ ξ

Similarly
C ≥ max U (ξ ∗ , σ) ≥ min max U (ξ, σ) = U ∗ .
σ ξ σ

By (3.4.1) it follows that

C ≥ U ∗ ≥ U∗ ≥ C.
Theorem 3.4.2 gives a sufficient condition for a game having a value. In fact,
the type of games we have been looking at so far are called bilinear games. For
these, a solution always exists and there are efficient methods for finding it.
Definition 3.4.3. A bilinear game is a tuple (U, Ξ, Σ, Ω, A) with U : Ξ × Σ →
R such that all ξ ∈ Ξ are arbitrary distributions on Ω and all σ ∈ Σ are
arbitrary distributions on A with
X
U (ξ, σ) , E(U | ξ, σ) = U (ω, a) σ(a) ξ(ω).
ω,a

Theorem 3.4.3. For a bilinear game, U ∗ = U∗ . In addition, the following

three conditions are equivalent:
1 These distributions may be singular, that is, they may be concentrated in one point. For

example, σ ∗ is singular, if σ ∗ (a) = 1 for some a and σ ∗ (a′ ) = 0 for all a′ 6= a.

3.5. DECISION PROBLEMS WITH OBSERVATIONS 49

1. σ ∗ is maximin, ξ ∗ is minimax, and U ∗ = C.

2. U (ξ, σ ∗ ) ≥ C ≥ U (ξ ∗ , σ) for all ξ, σ.
3. U (ω, σ ∗ ) ≥ C ≥ U (ξ ∗ , a) for all ω, a.

Linear programming formulation

While general games may be hard, bilinear games are easy, in the sense that
minimax solutions can be found with well-known algorithms. One such method
is linear programming. The problem

max min U (ξ, σ),

σ ξ

where ξ, σ are distributions over finite domains, can be converted to finding σ

corresponding to the greatest lower bound vσ ∈ R on the utility. Using matrix
notation, set U to be the matrix such that Uω,a = U (ω, a), and consider the
vectors π(a) = σ(a) and ξ(ω) = ξ(ω). Then the problem can be written as:
( )
X
max vσ (U π)j ≥ vσ ∀j, σi = 1, σi ≥ 0 ∀i .
i

Equivalently, we can find ξ with the least upper bound:

 
 X 
min vξ (ξ ⊤ U )i ≤ vξ ∀i, ξj = 1, ξj ≥ 0 ∀j ,
 
j

where everything has been written in matrix form. In fact, one can show that
vξ = vσ , thus obtaining Theorem 3.4.3.
To understand the connection of two-person games with Bayesian decision
theory, take another look at Figure 3.3, seeing the risk as negative expected
utility, or as the opponent’s gain. Each of the decision lines represents nature’s
gain as she chooses different prior distributions, while we keep our policy σ fixed.
The bottom horizontal line that would be tangent to the Bayes-optimal utility
curve would be minimax: if nature were to change priors, it would not increase
its gain, since the line is horizontal. On the other hand, if we were to choose a
different tangent line, we would only increase nature’s gain (and decrease our
utility).

3.5 Decision problems with observations

So far we have only examined problems where the outcomes were drawn from
some fixed distribution. This distribution constituted our subjective belief about
what the unknown parameter is. Now, we examine the case where we can obtain
some observations that depend on the unknown ω before we make our decision,
cf. Figure 3.6. These observations should give us more information about ω
before making a decision. Intuitively, we should be able to make decisions by
simply considering the posterior distribution.
In this setting, we once more aim to take some decision a ∈ A so as to
maximise expected utility. As before, we have a prior distribution ξ on some
50 CHAPTER 3. DECISION PROBLEMS

ξ ω x

a a U

Figure 3.6: Statistical decision problem with observations.

parameter ω ∈ Ω, representing what we know about ω. Consequently, the

expected utility of any fixed decision a is going to be Eξ (U | a).
However, now we may obtain more infomation about ω before making a final
decision. In particular, each ω corresponds to a model of the world Pω , which
is a probability distribution over some observation space S, such that Pω (X) is
the probability that the observation is in X ⊂ S. The set of parameters Ω thus
defines a family of models

P , {Pω | ω ∈ Ω} .

Now, consider the case where we take an observation x from the true model Pω∗
before making a decision. We can represent the dependency of our decision on
the observation by making our decision a function of x.

Definition 3.5.1 (Policy). A policy π : S → A maps any observation to a

decision.2

Given a policy π, its expected utility is given by

Z Z
U (ξ, π) , Eξ {U [ω, π(x)]} = U [ω, π(x)] dPω (x) dξ(ω).
Ω S

This is the standard Bayesian framework for decision making. It may be slightly
more intuitive in some case to use the notation ψ(x | ω), in order to emphasize
that this is a conditional distribution. However, there is no technical difference
between the two notations.
When the set of policies includes all constant policies, then there is a policy
π ∗ at least as good as the best fixed decision a∗ . This is formalized in the
following remark.
Remark 3.5.1. Let Π denote a set of policies π : S → A. If for each a ∈ A there
is a π ∈ Π such that π(x) = a ∀x ∈ S, then maxπ∈Π Eξ (U | π) ≥ maxa∈A Eξ (U |
a).

Proof. The proof follows by setting Π0 to be the set of constant policies. The
result follows since Π0 ⊂ Π.

We conclude this section with a simple example about deciding whether or

not to go to a restaurant, given some expert opinions.
2 For that reason, policies are also sometimes called decision functions or decision rules in

the literature.
3.5. DECISION PROBLEMS WITH OBSERVATIONS 51

Example 23. Consider the problem of deciding whether or not to go to a particular

restaurant. Let Ω = [0, 1] with ω = 0 meaning the food is in general horrible and ω = 1
meaning the restaurant is great. Let x1 , . . . , xn be n expert opinions in S = {0, 1}
about the restaurant, where 1 means that the restaurant is recommended by the
expert and 0 means that it is not recommended. Under our model, the probability of
observing xi = 1 when the quality of the restaurant is ω is given by Pω (1) = ω and
conversely Pω (0) = 1 − ω. The probability of observing a particular sequence x of
length n is3
Pω (x) = ω s (1 − ω)n−s
Pn
with s = i=1 xi .

Maximising utility when making observations

Statistical procedures based on the assumption that a distribution can be as-

signed to any parameter in a statistical decision problem are called Bayesian
statistical methods. The scope of these methods has been the subject of much
discussion in the statistical literature, see e.g. Savage [1972].
In the following, we shall look at different expressions for the expected utility.
We shall overload the utility operator U for various cases: when the parameter
is fixed, when the parameter is random, when the decision is fixed, and when
the decision depends on the observation x and thus is random as well.

Expected utility of a fixed decision a with ω ∼ ξ

We first consider the expected utility of taking a fixed decision a ∈ A, when
P(ω ∈ B) = ξ(B). This is the case we have dealt with so far with
Z
U (ξ, a) , Eξ (U | a) = U (ω, a) dξ(ω).
Ω

Expected utility of a policy π with fixed ω ∈ Ω

Next we assume that ω is fixed, but instead of selecting a decision directly,
we select a decision that depends on the random observation x, which is
distributed according to Pω on S. We do this by defining a policy π : S → A.
Then
Z
U (ω, π) = U (ω, π(x)) dPω (x). (3.5.1)
S

Expected utility of a policy π with ω ∼ ξ

Finally, we generalise to the case where ω is distributed with measure ξ.
Note that the expectation of the previous expression (3.5.1) by definition

3 We obtain a different probability of observations under the binomial model, but the re-

sulting posterior, and hence the policy, is the same.

52 CHAPTER 3. DECISION PROBLEMS

can be written as
Z
U (ξ, π) = U (ω, π) dξ(ω), U ∗ (ξ) , sup U (ξ, π) = U (ξ, π ∗ ).
Ω π

Bayes decision rules

We wish to construct the Bayes decision rule, that is, the policy with maximal
ξ-expected utility. However, doing so by examining all possible policies is cum-
bersome, because (usually) there are many more policies than decisions. It is
however, easy to find the Bayes decision for each possible observation. This is
because it is usally possible to rewrite the expected utility of a policy in terms
of the posterior distribution. While this is trivial to do when the outcome and
observation spaces are finite, it can be extended to the general case as shown in
the following theorem.

Theorem 3.5.1. If U is non-negative or bounded, then we can reverse the

integration order in the normal form

Z Z
U (ξ, π) = E {U [ω, π(x)]} = U [ω, π(x)] dPω (x) dξ(ω),
Ω S

to obtain the utility in extensive form as

Z Z
U (ξ, π) = U [ω, π(x)] dξ(ω | x) dPξ (x), (3.5.2)
S Ω

R
where Pξ (x) = Ω
Pω (x) dξ(ω).

Proof. To prove this when U is non-negative, we shall use Tonelli’s theorem.

First we need to construct an appropriate product measure. Let p(x | ω) ,
dPω (x)
dν(x) be the Radon-Nikodym derivative of Pω with respect to some dominating
dξ(ω)
measure ν on S. Similarly, let p(ω) , dµ(x) be the corresponding derivative for
ξ. Now, the utility can be written as

Z Z
U (ξ, π) = U [ω, π(x)] p(x | ω) p(ω) dν(x) dµ(ω)
ZΩ ZS
= h(ω, x) dν(x) dµ(ω)
Ω S

with h(ω, x) , U [ω, π(x)]p(x | ω)p(ω). Clearly, if U is non-negative, then so is

3.5. DECISION PROBLEMS WITH OBSERVATIONS 53

h(ω, x). Now we are ready to apply Tonelli’s theorem to get

Z Z
U (ξ, π) = h(ω, x) dµ(ω) dµ(x)
ZS ZΩ
= U [ω, π(x)] p(x | ω) p(ω) dµ(ω) dν(x)
ZS ZΩ
= U [ω, π(x)] p(ω | x) dµ(ω) p(x) dν(x)
S Ω
Z Z
dPξ (x)
= U [ω, π(x)] p(ω | x) dµ(ω) dν(x)
S Ω dν(x)
Z Z
= U [ω, π(x)] dξ(ω | x) dPξ (x).
S Ω

We can construct an optimal policy π ∗ as follows. For any specific observed

x ∈ S, we set π ∗ (x) to
Z
∗
π (x) , arg max Eξ (U | x, a) = arg max U (ω, a) dξ(ω | x).
a∈A a∈A Ω

So now we can plug π ∗ in the extensive form to obtain

Z Z Z Z
U [ω, π ∗ (x)] dξ(ω | x) dPξ (x) = max U [ω, a] dξ(ω | x) dPξ (x).
S Ω S a Ω

Consequently, there is no need to completely specify the policy before we

have seen x. Obviously, this would create problems when S is large.

Bayes decision rule

The optimal decision given x is the optimal decision with respect to the
posterior ξ(ω | x).

The following definitions summarize terminology that we have mostly im-

plicitly introduced before.

Definition 3.5.2 (Prior distribution). The distribution ξ is called the prior

distribution of ω.

Definition 3.5.3 (Marginal distribution). The distribution Pξ is called the

(prior) marginal distribution of x.

Definition 3.5.4 (Posterior distribution). The conditional distribution ξ(· | x)

is called the posterior distribution of ω.

3.5.1 Decision problems in classification

Classification is the problem of deciding which class y ∈ Y some particular ob-
servation xt ∈ X belongs to. From a decision-theoretic viewpoint, the problem
can be seen at three different levels. In the first, we are given a classification
model in terms of a probability distribution, and we simply we wish to classify
optimally given the model. In the second, we are given a family of models, a
54 CHAPTER 3. DECISION PROBLEMS

prior distribution on the family as well as a training data set, and we wish to
classify optimally according to our belief. In the last form of the problem, we
are given a set of policies π : X → Y and we must choose the one with highest
expected performance. The two latter forms of the problem are equivalent when
the set of policies contains all Bayes decision rules for a specific model family.

Deciding the class given a probabilistic model

In the simple form of the problem, we are already given a classifier P that
can calculate probabilities P (yt | xt ), and we simply must decide upon some
class at ∈ Y, so as to maximise a specific utility function. One standard utility
function is the prediction accuracy

Ut , I {yt = at } .

The probability P (yt | xt ) is the posterior probability of the class given the
observation xt . If we wish to maximise expected utility, we can simply choose

at ∈ arg max P (yt = a | xt ).

a∈Y

This defines a particular, simple policy. In fact, for two-class problems with
decision boundary Y = {0, 1}, such a rule can be often visualised as a decision boundary in X , on
whose one side we decide for class 0 and on whose other side for class 1.

Deciding the class given a model family

In the general form of the problem, we are given a training data set S =
{(x1 , y1 ), . . . , (xn , yn )}, a set of classification models {Pω | ω ∈ Ω}, and a prior
distribution ξ on Ω. For each model, we can easily calculate Pω (y1 , . . . , yn |
x1 , . . . , xn ). Consequently, we can calculate the posterior distribution

Pω (y1 , . . . , yn | x1 , . . . , xn ) ξ(ω)
ξ(ω | S) = P ′
ω ′ ∈Ω Pω (y1 , . . . , yn | x1 , . . . , xn ) ξ(ω )
′

and the posterior marginal label probability

X
Pξ|S (yt | xt ) , Pξ (yt | xt , S) = Pω (yt | xt ) ξ(ω | S).
ω∈Ω

We can then define the simple policy

X
at ∈ arg max Pω (yt | xt ) ξ(ω | S),
a∈Y
ω∈Ω

Bayes rule which is known as Bayes rule.

The Bayes-optimal policy under parametrisation constraints*

In some cases, we are restricted to functionally simple policies, which do not
contain any Bayes rules as defined above. For example, we might be limited to
linear functions of x. Let π : X → Y be such a rule and let Π be the set of
allowed policies. Given a family of models and a set of training data, we wish
3.5. DECISION PROBLEMS WITH OBSERVATIONS 55

to calculate the policy that maximises our expected utility. For a given ω, we
can indeed compute
X
U (ω, π) = U (y, π(x))Pω (y | x)Pω (x),
x,y

where we assume an i.i.d. model, i.e., xt | ω ∼ Pω (x) independently of previous

observations. Note that to select the optimal rule π ∈ Π we also need to know
Pω (x). For the case where ω is unknown and we have a posterior ξ(ω | S), given
a training data set S as before, the Bayesian framework is easily extendible and
gives
X X
U (ξ(· | S), π) = ξ(ω | S) U (y, π(x))Pω (y | x)Pω (x).
ω x,y

The respective maximisation is in general not trivial. However, if our policies

in Π are parametrised, we can employ optimisation algorithms such as gradient
ascent to find a maximum. In particular, if we sample ω ∼ ξ(· | S), then
X
∇π U (ξ(· | S), π) = ∇π U (y, π(x))Pω (y | x)Pω (x).
x,y

Fairness in classification problems*

Any policy, when applied to large-scale, real world problems, has certain exter-
nalities. This implies that considering only the decision maker’s utility is not
sufficient. One such issue is fairness.
This concerns desirable properties of policies applied to a population of in-
dividuals. For example, college admissions should be decided on variables that
inform about merit, but fairness may also require taking into account the fact
that certain communities are inherently disadvantaged. At the same time, a
person should not feel that someone else in a similar situation obtained an un-
fair advantage. All this must be taken into account while still caring about
optimizing the decision maker’s utility function. As another example, consider
mortgage decisions: while lenders should take into account the creditworthiness
of individuals in order to make a profit, society must ensure that they do not
unduly discriminate against socially vulnerable groups.
Recent work in fairness for statistical decision making in the classifica-
tion setting has considered two main notions of fairness. The first uses (con-
ditional) independence constraints between a sensitive variable (such as eth-
nicity) and other variables (such as decisions made). The second type en-
sures that decisions are meritocratic, so that better individuals are favoured,
but also smoothness4 in order to avoid elitism. While a thorough discus-
sion of fairness is beyond the scope of this book, it is useful to note that
some of these concepts are impossible to strictly achieve simultaneously, but
may be approximately satisfied by careful design of the policy. The recent
work by Dwork et al. [2012], Chouldechova [2016], Corbett-Davies et al. [2017],
Kleinberg et al. [2016], Kilbertus et al. [2017], Dimitrakakis et al. [2017] goes
much more deeply on this topic.
4 More precisely, Lipschitz conditions on the policy.
56 CHAPTER 3. DECISION PROBLEMS

3.5.2 Calculating posteriors

Posterior distributions for multiple observations
We now consider how we can re-write the posterior distribution over Ω in-
crementally. Assume that we have a prior ξ on Ω and observe xn , x1 , . . . , xn .
For the observation probability, we write:

Observation probability given history xn−1 and parameter ω

Pω (xn )
Pω (xn | xn−1 ) =
Pω (xn−1 )

Then we obtain the following recursion for the posterior.

Posterior recursion

Pω (xn ) ξ(ω) Pω (xn | xn−1 )ξ(ω | xn−1 )

ξ(ω | xn ) = = .
Pξ (xn ) Pξ (xn | xn−1 )
R
Here Pξ (· | ·) = Ω
Pω (· | ·) dξ(ω) is a marginal distribution.

Posterior distributions for multiple independent observations

Now we consider the case where, given the parameter ω, the next observation
n−1
Qn not depend on the history: If Pω (xn | x
does ) = Pω (xn ) then Pω (xn ) =
P (x
k=1 ω k ). Then the recursion looks as follows.

Posterior recursion with conditional independence

Pω (xn ) ξ0 (ω)
ξn (ω) , ξ0 (ω | xn ) =
Pξ0 (xn )
Pω (xn ) ξn−1 (ω)
= ξn−1 (ω | xn ) = ,
Pξn−1 (xn )
R
where ξt is the belief at time t. Here Pξn (· | ·) = Ω Pω (· | ·) dξn (ω) is the
marginal distribution with respect to the n-th posterior.

Conditional independence allows us to write the posterior update as an iden-

tical recursion at each time t. We shall take advantage of that when we look
at conjugate prior distributions in Chapter 4. For such models, the recursion
involves a particularly simple parameter update.

3.6 Summary
In this chapter, we introduced a general framework for making decisions a ∈ A
whose optimality depends on an unknow outcome or parameter ω. We saw that,
3.6. SUMMARY 57

when our knowledge about ω ∈ Ω is in terms of a probability distribution ξ on

Ω, then the utility of the Bayes-optimal decision is convex with respect to ξ.
In some cases, observations x ∈ X may affect affect our belief, leading to a
posterior ξ(· | x). This requires us to introduce the notion of a policy π : X → A
mapping observations to decisions. While it is possible to construct a complete
policy by computing U (ξ, π) for all policies (normal form) and maximising, it is
frequently simpler to just wait until we observe x and compute U [ξ(· | x), a] for
all decisions (extensive form).
In minimax settings, we can consider a fixed but unknown parameter ω or a
fixed but unknown prior ξ. This links statistical decision theory to game theory.
58 CHAPTER 3. DECISION PROBLEMS

3.7 Exercises
The first part of the exercises considers problems where we are simply given
some distribution over Ω. In the second part, the distribution is a posterior
distribution that depends on observations x.

3.7.1 Problems with no observations

For the following exercises, we consider a set of worlds Ω and a decision set A,
as well as the following utility function U : Ω × A → R:

U (ω, a) = sinc(ω − a),

where sinc(x) = sin(x)/x. If ω is known and A = Ω = R then obviously the

optimal decision is a = ω, as sinc(x) ≤ sinc(0) = 1. However, we consider the
case where
Ω = A = {−2.5, . . . , −0.5, 0, 0.5, . . . , 2.5} .

Exercise 13. Assume ω is drawn from P ξ with ξ(ω) = 1/11 for all ω ∈ Ω. Calculate and
plot the expected utility U (ξ, a) = ω ξ(ω)U (ω, a) for each a. Report maxa U (ξ, a).

Exercise 14. Assume ω ∈ Ω is arbitrary (but deterministically selected). Calculate

the utility U (a) = minω U (ω, a) for each a. Report max(U ).

Exercise 15. Again assume ω ∈ Ω is arbitrary (but deterministically selected). We

P allow for stochastic policies π on A. Then the expected utility is U (ω, π) =
now
a U (ω, a)π(a).

(a) Calculate and plot the expected utility when π(a) = 1/11 for all a, reporting
values for all ω.
(b) Find
max min U (ξ, π).
π ξ

Hint: Use the linear programming formulation, adding a constant to the utility
matrix U so that all elements are non-negative.

Exercise 16. Consider the definition of rules that, for some ǫ > 0, select a maximising

P ω U (ω, a) > sup U (ω, d′ ) − ǫ .
d′ ∈A

Prove that this is indeed a statistical decision problem, i.e., it corresponds to max-
imising the expectation of some utility function.

3.7.2 Problems with observations

For the following exercises we consider a set of worlds Ω and a decision set A,
as well as the following utility function U : Ω × A → R:

U (ω, a) = −|ω − a|2

In addition, we consider a family of distributions on a sample space S = {0, 1}n ,

F , {fω | ω ∈ Ω} ,
3.7. EXERCISES 59

such that fω is the binomial probability mass function with parameters ω (with
the number of draws n being implied). Consider the parameter set

Ω = {0, 0.1, . . . , 0.9, 1} .

Let ξ be the uniform distribution on Ω, such that ξ(ω) = 1/11 for all ω ∈ Ω.
Further, let the decision set be A = [0, 1].
P
Exercise 17. What is the decision a∗ maximising U (ξ, a) = ω ξ(ω)U (ω, a) and what
is U (ξ, a∗ )?

Exercise 18. In the same setting, we now observe the sequence x = (x1 , x2 , x3 ) =
(1, 0, 1).
1. Plot the posterior distribution ξ(ω | x) and compare it to the posterior we would
obtain if our prior on ω was ξ ′ = Beta(2, 2).
2. Find the decision a∗ maximising the a posteriori expected utility
X
Eξ (U | a, x) = U (ω, a)ξ(ω | x).
ω

3. Consider n = 2, i.e., S = {0, 1}2 . Calculate the Bayes-optimal expected utility

in extensive form:
X X X X
Eξ (U | π ∗ ) = φ(x) U [ω, π ∗ (x)]ξ(ω | x) = φ(x) max U [ω, a]ξ(ω | x),
a
S ω S ω
P
where φ(x) = ω fω (x)ξ(ω) is the prior marginal distribution of x and δ ∗ : S →
A is the Bayes-optimal decision rule.
Hint: You can simplify the computational
P complexity somewhat, since you only
need to calculate the probability of t x t . This is not necessary to solve the
problem though.

Exercise 19. In the same setting, we consider nature to be adversarial. Once more,
we observe x = (1, 0, 1). Assume that nature can choose a prior among a set of priors
Ξ = {ξ1 , ξ2 }. Let ξ1 (ω) = 1/11 and ξ2 (ω) = ω/5.5 for each ω.
1. Calculate and plot the value for deterministic decisions a:

min Eξ (U | a, x).
ξ∈Ξ

2. Find the minimax prior ξ ∗

min max Eξ (U | a)
ξ∈Ξ a∈A

Hint: Apart from the adversarial prior selection, this is very similar to the previous
exercise.

3.7.3 An insurance problem

Consider the insurance example of chapter 3. Therein, an insurer is covering
customers by asking for a premium d > 0. In the event of an accident, which
happens with probability ǫ ∈ [0, 1], the insurer pays out h > 0. The problem
of the insurer is, given ǫ, h, what to set the premium d to so as to maximise its
expected utility. We assume that the insurer’s utility is linear.
60 CHAPTER 3. DECISION PROBLEMS

We now consider customers with some baseline income level x ∈ S. For sim-
plicity, we assume that the only possible income levels are S = {15, 20, 25, . . . , 60},
with K = 10 possible incomes. let V : R → R denote the utility function of a
customer. Customers who are interested the insurance product, will buy it if
and only if:
V (x − d) > ǫV (x − h) + (1 − ǫ)V (x). (3.7.1)

We make the simplifying assumption that the utility function is the same for
all customers, and that it has the following form:
(
ln x, x≥1
V (x) = 2
(3.7.2)
1 − (x − 2) , otherwise.

Customers who are not interested the insurance product, will not buy it no
matter what the price.
There is some unknown probability distribution Pω (x) over the income level,
such that the probability of n people having incomes xT = (x1 , . . . , xn ) is
QT
PωT (xT ) = i=1 Pω (xi ). We have two data sources for this. The first is a model
of the general population ω1 not working in high-tec industry, and the second
is a model of employees in high-tec industry, ω2 . The models are summarised
in the table below. Together, these two models form a family of distributions

Income Levels 15 20 25 30 35 40 45 50 55 60
Models Probability (%) of income level Pω (x)
ω1 5 10 12 13 11 10 8 10 11 10
ω2 8 4 1 6 11 14 16 15 13 12

Table 3.6: Income level distribution for the two models.

P = {Pω | ω ∈ Ω}, with Ω = {ω1 , ω2 }.

Our goal is to find a premium d that maximises our expected utility. We
assume that our firm is liquid enough that our utility is linear. In this exercise,
we consider 4 different cases. For simplicity, you can let A = S throughout this
exercise.

Exercise 20 (50). Show that the expected utility for a given ω is the expected gain
from a buying customer, times the probability that an interested customer will have
an income x such that they would buy our insurance.
X
U (ω, d) = (d − ǫh) Pω (x) I {V (x − d) > ǫV (x − h) + (1 − ǫ)V (x)} . (3.7.3)
x∈S

Let h = 150 and ǫ = 10−3 . Plot the expected utility for varying d, for the two
possible ω. What is the optimal price level if the incomes of all interested customers
are distributed according to ω1 ? What is the optimal price level if they are distributed
according to ω2 ?

Exercise 21 (20). According to our intuition, customers interested in our product are
much more likely to come from the high-tec industry than from the general population.
For that reason, we have a prior probability ξ(ω1 ) = 1/4 and ξ(ω2 ) = 3/4 over the
parameters Ω of the family P. More specifically, and in keeping with our previous
3.7. EXERCISES 61

assumptions, we formulate the following model:

ω∗ ∼ ξ (3.7.4)
T ∗
x | ω = ω ∼ Pω (3.7.5)

That is, the data is drawn from one unknown model ω ∗ ∈ Ω. This can be thought
of as an experiment where nature randomly selects ω ∗ with probability ξ and then
generates the data from the corresponding model Pω∗ . Plot the expected utility under
this prior as the premium d varies. What is the optimal expected utility and premium?

Exercise 22 (20). Instead of fully relying on our prior, the company decides to per-
form a random survey of 1000 people. We asked whether they would be interested in
the insurance product (as long as the price is low enough). If they were interested,
we asked them what their income is. Only 126 people were interested, with income
levels given in Table 3.7. Each row column of the table shows the stated income and
the number of people reporting it. Let xT = {x1 , x2 , . . .} be the set of data we have

Income 15 20 25 30 35 40 45 50 55 60
Number 7 8 7 10 15 16 13 19 17 14

Table 3.7: Survey results.

collected. Assuming that that the responses are truthful, calculate the posterior prob-
ability ξ(ω | xT ), assuming that the only possible models of income distribution are
the two models ω1 , ω2 used in the previous exercises. Plot the expected utility under
the posterior distribution as d varies. What is the maximum expected utility we can
obtain?

Exercise 23 (30? – Bonus exercise ). Having only two possible models is somewhat
limiting, especially since neither of them might correspond to the income distribution
of people interested in our insurance product. How could this problem be rectified?
Describe the idea and implement it. When would you expect this to work better?

3.7.4 Medical diagnosis

(Note: the figures below are not really accurate, as they are liberally adapted
from different studies.)
Many patients arriving at an emergency room, suffer from chest pain. This
may indicate acute coronary syndrome (ACS). Patients suffering from ACS that
go untreated may die with probability 2% in the next few days. Successful
diagnosis results lowers the short-term mortality rate to 0.2%. Consequently, a
prompt diagnosis is essential.

Statistics of patients Approximately 50% of patients presenting with chest

pain turn out to suffer from ACS (either acute myocardial infraction or unstable
angina pectoris). Approximately 10% suffer from lung cancer.
Of ACS sufferers in general, 2⁄3 are smokers and 1⁄3 non-smokers. Only 1⁄4 of
non-ACS sufferers are smokers.
In addition, 90% of lung cancer patients are smokers. Only 1⁄4 of non-cancer
patients are smokers.
Assumption 3.7.1. A patient may suffer from none, either or both conditions!
62 CHAPTER 3. DECISION PROBLEMS

Assumption 3.7.2. When the smoking history of the patient is known, the
development of cancer or ACS are independent.

Tests. One can perform an ECG to test for ACS. An ECG test has sensitivity
of 66.6% (i.e. it correctly detects 2⁄3 of all patients that suffer from ACS), and a
specificity of 75% (i.e. 1⁄4 of patients that do not have ACS, still test positive).
An X-ray can diagnose lung cancer with a sensitivity of 90% and a specificity
of 90%.
Assumption 3.7.3. Repeated applications of a test produce the same result for
the same patient, i.e. that randomness is only due to patient variability.
Assumption 3.7.4. The existence of lung cancer does not affect the probability
that the ECG will be positive. Conversely, the existence of ACS does not affect
the probability that the X-ray will be positive.
The main problem we want to solve, is how to perform experiments or tests,
so as to
diagnose the patient
use as few resources as possible.
make sure the patient lives
This is a problem in experiment design. We start from the simplest case, and
look at a couple of example where we only observe the results of some tests. We
then examine the case where we can select which tests to perform.

Exercise 24. In this exercise, we only worry about making inferences from different
tests results.
1. What does the above description imply about the dependencies between the
patient condition, smoking and test results? Draw a belief network for the
above problem, with the following events (i.e. variables that can be either true
or false)
A: ACS
C: Lung cancer.
S: Smoking
E: Positive ECG result.
X: Positive X-ray result.
2. What is the probability that the patient suffers from ACS if S = true?
3. What is the probability that the patient suffers from ACS if the ECG result is
negative?
4. What is the probability that the patient suffers from ACS if the X-ray result is
negative and the patient is a smoker?

Exercise 25. Now consider the case where you have the choice between tests to
perform First, you observe S, whether or not the patient is a smoker. Then, you select
a test to make: d1 ∈ {X-ray, ECG}. Finally, you decide whether or not to treat for
ASC: d2 ∈ {heart treatment, no treatment}. An untreated ASC patient may die
with probability 2%, while a treated one with probability 0.2%. Treating a non-ASC
patient result in death with probability 0.1%.
3.7. EXERCISES 63

1. Draw a decision diagram, where:

S is an observed random variable taking values in {0, 1}.
A is an hidden variable taking values in {0, 1}.
C is an hidden variable taking values in {0, 1}.
d1 is a choice variable, taking values in {X-ray, ECG}.
r1 is a result variable, taking values in {0, 1}, corresponding to negative
and positive tests results.
d2 is a choice variable, which depends on the test results, d1 and on S.
r2 is a result variable, taking values in {0, 1} corresponding to the patient
dying (0), or living (1).
2. Let d1 = X-ray, and assume the patient suffers from ACS, i.e. A = 1. How is
the posterior distributed?
3. What is the pro
64 CHAPTER 3. DECISION PROBLEMS
Chapter 4

Estimation

65
66 CHAPTER 4. ESTIMATION

4.1 Introduction
In the previous chapter, we have seen how to make optimal decisions with
respect to a given utility function and belief. One important question is how to
compute an updated belief from observations and a prior belief. More generally,
we wish to examine how much information we can obtain about an unknown
parameter from observations, and how to bound the respective estimation error.
While most of this chapter will focus on the Bayesian framework for estimating
parameters, we shall also look at tools for making conclusions about the value
of parameters without making specific assumptions about the data distribution,
i.e., without providing specific prior information.
In the Bayesian setting, we calculate posterior distributions of parameters
given data. The basic problem can be stated as follows. Let P , {Pω | ω ∈ Ω}
be a family of probability measures on (S, FS ) and ξ be our prior probability
measure on (Ω, FΩ ). Given some data x ∼ Pω∗ , with ω ∗ ∈ Ω, how can we
estimate ω ∗ ? The Bayesian approach is to estimate the posterior distribution
ξ(· | x), instead of guessing a single ω ∗ . In general, the posterior measure is a
function ξ(· | x) : FΩ → [0, 1], with
R
Pω (x) dξ(ω)
ξ(B | x) = BR .
Ω ω
P (x) dξ(ω)
The posterior distribution allows us to quantify our uncertainty about the un-
known ω ∗ . This in turn enables us to take decisions that take uncertainty into
account.
The first question we are concerned with in this chapter is how to calculate
this posterior for any value of x in practice. If x is a complex object, this may
be computationally difficult. In fact, the posterior distribution can also be a
complicated function. However, there exist distribution families and priors such
that this calculation is very easy, in the sense that the functional form of the
posterior depends upon a small number of parameters. This happens when a
summary of the data that contains all necessary information can be calculated
easily. Formally, this is captured via the concept of a sufficient statistic.

4.2 Sufficient statistics

Sometimes we want to summarise the data we have observed. This can happen
when the data is a long sequence of simple observations xn = (x1 , . . . , xn ). It
may also be useful to do so when we have a single observation x, such as a high-
resolution image. For some applications, it may be sufficient to only calculate
a really simple function of the data, such as the sample mean.
Definition 4.2.1 (Sample mean). The sample mean x̄n : Rn → R of a sequence
xn = (x1 , . . . , xn ) with xt ∈ R is defined as
n
1X
x̄n , xt .
n t=1
statistic The mean is an example for what is called a statistic, that is, a function
of the observations to some vector space. In the following, we are interested
in statistics that can replace all the complete original data in our calculations
without losing any information. Such statistics are called sufficient.
4.2. SUFFICIENT STATISTICS 67

4.2.1 Sufficient statistics

We consider the standard probabilistic setting. Let S be a sample space and Ω
be a parameter space defining a family of measures on S, that is,

P = {Pω | ω ∈ Ω} .

In addition, we must also define an appropriate prior distribution ξ on the

parameter space Ω. Then the definition of a sufficient statistic in the Bayesian
sense1 is as follows.

Definition 4.2.2. Let Ξ be a set of prior distributions on Ω, which indexes a

family P = {Pω | ω ∈ Ω} of distributions on S. A statistic φ : S → Z, where
Z is a vector space2 , is a sufficient statistic for hP, Ξi, if

ξ(· | x) = ξ(· | x′ ) (4.2.1)

for any prior ξ ∈ Ξ and any x, x′ ∈ S such that φ(x) = φ(x′ ).

This means that the statistic is sufficient if, whenever we obtain the same
value of the statistic for two different datasets x, x′ , then the resulting posterior
distribution over the parameters is identical, independent of the prior distri-
bution. In other words, the value of the statistic is sufficient for computing
the posterior. Interestingly, a sufficient statistic always implies the following
factorisation for members of the family.

Theorem 4.2.1. A statistic φ : S → Z is sufficient for hP, Ξi iff there exist

functions u : S → (0, ∞) and v : Z × Ω → [0, ∞) such that ∀x ∈ S, ω ∈ Ω:

Pω (x) = u(x) v[φ(x), ω].

Proof. In the following proof we assume arbitrary Ω. The case when Ω is finite
is technically simpler and is left as an exercise. Let us first assume the existence
of u, v satisfying the equation. Then for any B ∈ FΩ we have
R
u(x) v[φ(x), ω] dξ(ω)
ξ(B | x) = BR
u(x) v[φ(x), ω] dξ(ω)
RΩ
v[φ(x), ω] dξ(ω)
= RB .
Ω
v[φ(x), ω] dξ(ω)

For x′ with φ(x) = φ(x′ ), it follows that ξ(B | x) = ξ(B | x′ ), so ξ(· | x) = ξ(· |
x′ ) and φ satisfies the definition of a sufficient statistic.
Conversely, assume that φ is a sufficient statistic. Let µ be a dominating
dξ(ω)
measure on S so that we can define the densities p(ω) , dµ(ω) and

dξ(ω | x) Pω (x) p(ω)

p(ω | x) , =R ,
dµ(ω) P (x) dξ(ω)
Ω ω

1 There is an alternative definition, which replaces equality of posterior distributions with

point-wise equality on the family members, i.e., Pω (x) = Pω (x′ ) for all ω. This is a stronger
definition, as it implies the Bayesian one we use here.
2 Typically Z ⊂ Rk for finite-dimensional statistics.
68 CHAPTER 4. ESTIMATION

whence Z
p(ω | x)
Pω (x) = Pω (x) dξ(ω).
p(ω) Ω

Since φ is sufficient, there is by definition some function g : Z ×Ω → [0, ∞) such

that p(ω | x) = g[φ(x), ω]. This must be the case, since sufficiency means that
p(ω | x) = p(ω | x′ ) whenever stat(x) = φ(x′ ). For this to occur the existence
of such a g is necessary. Consequently, we can factorise Pω as

Pω (x) = u(x) v[φ(x), ω],

R
where u(x) = Ω
Pω (x) dξ(ω) and v[φ(x), ω] = g[φ(x), ω]/p(ω).
In the factorisation of Theorem 4.2.1, u is the only factor that depends
directly on x. Interestingly, it does not appear in the posterior calculation at
all. So, the posterior only depends on x through the statistic.

Example 24. Suppose xn = (x1 , . . . , xn ) is a random sample from a distribution with

two possible outcomes 0 and 1 where ω is the probability of outcome 1. (This is usually
known as a Bernoulli distribution that we will have a closer look at in Section 4.3.1
below.) Then the probability of observing a certain sequence xn is given by
n
Y
Pω (xn ) = Pω (xt ) = ω sn (1 − ω)n−sn ,
t=1
Pn
where sn = t=1 xt is the number of times 1 has been observed until time n. Here
the statistic φ(xn ) = sn satisfies (4.2.1) with u(x) = 1, while Pω (xn ) only depends on
the data through the statistic sn = φ(xn ).

Another example is when we have a finite set of models. Then the sufficient
statistic is always a finite-dimensional vector.
Lemma 4.2.1. Let P = {Pθ | θ ∈ Θ} be a family, where each model Pθ is
a probability measure on X and Θ contains n models. If p ∈ ∆n is a vector
representing our prior distribution, i.e., ξ(θ) = pθ , then the finite-dimensional
vector with entries qθ = pθ Pθ (x) is a sufficient statistic.
Proof. Simply note that the posterior distribution in this case is
qθ
ξ(θ | x) = P ,
θ′ qθ ′

that is, the values qθ are sufficient to compute the posterior.

From the proof it is clear that also the vector with entries wθ = P qθ is a
θ ′ qθ ′
sufficient statistic.

4.2.2 Exponential families

Even when dealing with an infinite set of models in some cases the posterior
distributions can be computed efficiently. Many well-known distributions such
as the Gaussian, Bernoulli and Dirichlet distribution are members of the expo-
nential family of distributions. All those distributions are factorisable in the
manner shown below, while at the same time they have fixed-dimension suffi-
cient statistics.
4.3. CONJUGATE PRIORS 69

Definition 4.2.3. A distribution family P = {Pω | ω ∈ Ω} with Pω being

a probability function (or density) on the sample space S, is said to be an
exponential family if there are suitable functions a : Ω → R, b : S → R,
gi : ΩtoR, and hi : S → R (1 ≤ i ≤ k) such that for any x ∈ S, ω ∈ Ω:
" k
#
X
Pω (x) = a(ω) b(x) exp gi (ω) hi (x) . (4.2.2)
i=1

Among families of distributions satisfying certain smoothness conditions,

only exponential familes have a fixed-dimension sufficient statistic. Because of
this, exponential family distributions admit so-called parametric conjugate prior
distribution families. These have the property that any posterior distribution
calculated will remain within the conjugate family. Frequently, because of the
simplicity of the statistic used, calculation of the conjugate posterior parameters
is very simple.

4.3 Conjugate priors

In this section, we examine some well-known conjugate families. First, we give
sufficient conditions for the of existence of a conjugate family of priors for a given
distribution family and statistic. While this section can be used as a reference,
the reader may wish to initially only look at the first few example families.
The following remark gives sufficient conditions for the existence of a finite-
dimensional sufficient statistic.
Remark 4.3.1. If a family P of distributions on S has a sufficient statistic
φ : S → Z of fixed dimension for any x ∈ S, then there exists a conjugate family
of priors Ξ = {ξα | α ∈ A} with a set A of possible parameters for the prior
distribution, such that:

1. Pω (x) is proportional to some ξα ∈ Ξ, that is,

Z
∀x ∈ S ∃ξα ∈ Ξ, c > 0 : Pω (x) dξα (ω) = c ξα (B), ∀B ∈ FΩ .
B

2. The family is closed under multiplication, that is,

∀ξ1 , ξ2 ∈ Ξ ∃ξα ∈ Ξ, c > 0 : ξα = c ξ1 ξ2 .

While conjugate families exist for statistics with unbounded dimension, here
we shall focus on finite-dimensional families. We will start with the simplest
example, the Bernoulli-Beta pair.

4.3.1 Bernoulli-Beta conjugate pair

The Bernoulli distribution is a discrete distribution that is ideal for modelling
independent random trials with just two different outcomes (typically ‘success’
and ‘failure’) and a fixed probability of success.
70 CHAPTER 4. ESTIMATION

Definition 4.3.1 (Bernoulli distribution). The Bernoulli distribution is discrete

with outcomes S = {0, 1}, a parameter ω ∈ [0, 1], and probability function
(
ω, if u = 1,
Pω (x = u) = ω u (1 − ω)1−u =
1 − ω, if u = 0.

If x is distributed according to a Bernoulli distribution with parameter ω, we

write x ∼ Bern(ω).

ω x

Figure 4.1: Bernoulli graphical model.

The structure of the graphical model in Figure 4.1 shows the dependencies
between the different variables of the model.
When considering n independent trials of a Bernoulli
Qn distribution the set of
n
possible outcomes is S = {0, 1} . Then Pω (xn ) = t=1 Pω (xt ) is the probability
of observing the exact sequence xn under the Bernoulli model. However, in many
cases we are interested in the probability of observing a particular number of
successes (outcome 1) and failures (outcome 0) and do not care about the actual
order. For summarizing we need to count the actual sequences in which out
of n trials we have k successes. The actual number is given by the binomial
binomial coefficient coefficient, defined as

n n!
= k, n ∈ N, n ≥ k.
k k! (n − k)!
Now we are ready to define the binomial distribution, that is a scaled
product-Bernoulli distribution for multiple independent outcomes where we
want to measure the probability of a particular number successes or failures.
Thus, the Bernoulli is a distribution on a sequence of outcomes, while the
Pnbino-
mial is a distribution on the total number of successes. That is, let s = t=1 xt
be the total number of successes observed up to time n. Then we are interested
in the probability that there are exactly k successes out of n trials.
Definition 4.3.2 (Binomial distribution). The binomial distribution with pa-
rameters ω and n has outcomes S = {0, 1, . . . , n}. Its probability function is
given by
n
Pω (s = k) = ω k (1 − ω)n−k . (4.3.1)
k
If s is drawn from a binomial distribution with parameters ω, n, we write s ∼
Binom(ω, n).
Now let us return to the Bernoulli distribution. If the parameter ω is known,
then observations are independent of each other. However, this is not the case
when ω is unknown. For example, if Ω = {ω1 , ω2 }, then the probability of
observing a sequence xn is given by
X n
Y n
Y
n n
P(x ) = P(x | ω) P(ω) = P(xt | ω1 ) P(ω1 ) + P(xt | ω2 ) P(ω2 ),
ω∈Ω t=1 t=1
4.3. CONJUGATE PRIORS 71

α ω

Figure 4.2: Beta graphical model.

Qn
which in general is different from t=1 P(xt ). For the general case where
Ω = [0, 1] the question is whether there is a prior distribution that can suc-
cinctly describe our uncertainty about the parameter. Indeed, there is, and it is
called the Beta distribution. It is defined on the interval [0, 1] and has two pa- Beta distribution
rameters that determine the density of the observations. Because the Bernoulli
distribution has a parameter in [0, 1], the outcomes of the Beta can be used to
specify a prior on the parameters of the Bernoulli distribution.
Definition 4.3.3 (Beta distribution). The Beta distribution has outcomes ω ∈
Ω = [0, 1] and parameters α0 , α1 > 0, which we will summarize in a vector
α = (α1 , α0 ). Its probability density function is given by

Γ (α0 + α1 ) α1 −1
p(ω | α) = ω (1 − ω)α0 −1 , (4.3.2)
Γ (α0 )Γ (α1 )
R∞
where Γ (α) = 0 uα−1 e−u du is the Gamma function (see also Appendix C.1.2). Gamma function
If ω is distributed according to a Beta distribution with parameters α1 , α0 , we
write ω ∼ Beta(α1 , α0 ).
We note that the Gamma function is an extension of the function n! (see also
Appendix C.1.2), so that for n ∈ N it holds that Γ (n) = n!. That way, the first
term in (4.3.2) corresponds to a generalized binomial coefficient. The depen-
dencies between the parameters are shown in the graphical model of Figure 4.2.
A Beta distribution with parameter α has expectation

E(ω | α) = α1 /(α0 + α1 )

and variance
α1 α0
V(ω | α) = .
(α1 + α0 )2 (α1 + α0 + 1)
Figure 4.3 shows the density of a Beta distribution for four different param-
eter vectors. When α0 = α1 = 1, the distribution is equivalent to a uniform
one.

Beta prior for Bernoulli distributions

As already indicated, we can encode our uncertainty about the unknown but
fixed parameter ω ∈ [0, 1] of a Bernoulli distribution using a Beta distribution.
We start with R a Beta distribution with parameter α that defines our prior ξ0
via ξ0 (B) , B p(ω | α) dω. Then the posterior probability is given by
Qn
Pω (xt ) p(ω | α)
p(ω | xn , α) = R Qnt=1 ∝ ω sn,1 +α1 −1 (1 − ω)sn,0 +α0 −1 ,
Ω t=1 P ω (x t ) p(ω | α) dω
Pn
where sn,1 = t=1 xt and sn,0 = n − sn,1 are the total number of 1s and 0s,
respectively. As can be seen, this again has the form of a Beta distribution.
72 CHAPTER 4. ESTIMATION

α = (1, 1)
4 α = (2, 1)
α = (10, 20)
α = (1/10, 1/2)

p(ω|α) 2

0
0 0.2 0.4 0.6 0.8 1
ω

Figure 4.3: Four example Beta densities.

Beta-Bernoulli model

α ω xt

Figure 4.4: Beta-Bernoulli graphical model.

Let ω be drawn from a Beta distribution with parameters α1 , α0 , and

xn = (x1 , . . . , xn ) be a sample drawn independently from a Bernoulli dis-
tribution with parameter ω, i.e.,

ω ∼ Beta(α1 , α0 ), xn | ω ∼ Bern n (ω).

Then the posterior distribution of ω given the sample the posterior distri-
bution is also Beta, that is,

ω | xn ∼ Beta(α1′ , α0′ ) with α1′ = α1 + sn,1 , α0′ = α0 + sn,0 .

Example 25. The parameter ω ∈ [0, 1] of a randomly selected coin can be modelled
as a Beta distribution peaking around 21 . Usually one assumes that coins are fair.
However, not all coins are exactly the same. Thus, it is possible that each coin deviates
slightly from fairness. We can use a Beta distribution to model how likely (we think)
different values ω of coin parameters are.
To demonstrate how belief changes, we perform the following simple experiment.
We repeatedly toss a coin and wish to form an accurate belief about how biased the
coin is, under the assumption that the outcomes are Bernoulli with a fixed parameter
ω. Our initial belief, ξ0 , is modelled as a Beta distribution on the parameter space
Ω = [0, 1], with parameters α0 = α1 = 100. This places a strong prior on the coin
being close to fair. However, we still allow for the possibility that the coin is biased.
Figure 4.5 shows a sequence of beliefs at times 0, 10, 100, 1000 respectively, from
4.3. CONJUGATE PRIORS 73

30
ξ0
ξ10
ξ100
20 ξ1000

0
0 0.2 0.4 0.6 0.8 1

Figure 4.5: Changing beliefs as we observe tosses from a coin with probability
ω = 0.6 of heads.

a coin with bias ω = 0.6. Due to the strength of our prior, after 10 observations,
the situation has not changed much and the belief ξ10 is very close to the initial one.
However, after 100 observations our belief has now shifted towards 0.6, the true bias
of the coin. After a total of 1000 observations, our belief is centered very close to 0.6,
and is now much more concentrated, reflecting the fact that we are almost certain
about the value of ω.

4.3.2 Conjugates for the normal distribution

The well-known normal distribution is also endowed with suitable conjugate
priors. We first give the definition of the normal distribution, then consider the
cases where we wish to estimate its mean, its variance, or both at the same
time.
Definition 4.3.4 (Normal distribution). The normal distribution is a continu-
ous distribution with outcomes in R. It has two parameters, the mean ω ∈ R and
the variance σ 2 ∈ R+ , or alternatively the precision r ∈ R+ , where σ 2 = r−1 .
Its probability density function is given by
r r
r
f (x | ω, r) = exp − (x − ω)2 .
2π 2
When x is distributed according to a normal distribution with parameters
ω, r−1 , we write x ∼ N (ω, r−1 ).
For a respective sample xn of size n, we write xn ∼ N t (ω, r−1 ). Independent
samples satisfy the independence condition
n r n n
!
Y r r X
f (xn | ω, r) = f (xt | ω, r) = exp − (xt − ω)2 .
t=1
2π 2 t=1

The dependency graph in Figure 4.6 shows the dependencies between the
parameters of a normal distribution and the observations xt . In this graph,
only a single sample xt is shown, and it is implied that all xt are independent
of each other given r, ω.
74 CHAPTER 4. ESTIMATION

r
xt
ω

Figure 4.6: Normal graphical model.

Transformations of normal samples. The importance of the normal dis-

tribution stems from the fact that many actual distributions turn out to be
approximately normal. Another interesting property of the normal distribution
n
concerns transformations of normal samples. For example,Pn if x is drawn from a
normal distribution with mean ω and precision r, then t=1 xt ∼ N (nω, nr−1 ).
standard normal Finally, if the samples
Pn xt are drawn from the standard normal distribution, i.e.,
χ2 -distribution xt ∼ N (0, 1), then t=1 x2t has a χ2 -distribution with n degrees of freedom (cf.
also the discussion of the Gamma distribution below).

Normal distribution with known precision, unknown mean

The simplest normal estimation problem occurs when we only need to estimate
the mean and assume that the variance (or equivalently, the precision) is known.
For Bayesian estimation, it is convenient to assume that the mean ω is drawn
from another normal distribution with known mean. This gives a conjugate pair
and thus results in a posterior normal distribution for the mean as well.

Normal-Normal conjugate pair

r
τ xt
ω
µ

Figure 4.7: Normal with unknown mean, graphical model.

Let ω be drawn from a normal distribution with mean µ and precision τ ,

and xn = (x1 , . . . , xn ) be a sample drawn independently from a normal
distribution with mean ω and precision r, that is,

xn | ω, r ∼ N n (ω, r−1 ), ω | τ ∼ N (µ, τ −1 ).

Then the posterior distribution of ω given the sample is also normal, that
is,
τ µ + nrx̄n
ω | xn ∼ N (µ′ , τ ′−1 ) with µ′ = , τ ′ = τ + nr,
τ′
1
Pn
where x̄n , n t=1 xt .

It can be seen that the updated estimate for the mean is shifted towards
4.3. CONJUGATE PRIORS 75

the empirical mean x̄n , and the precision increases linearly with the number of
samples.

Normal with unknown precision and known mean

To model normal distributions with known mean, but unknown precision (or
equivalently, unknown variance), we first have to introduce the Gamma distri-
bution that we will use to represent our uncertainty about the precision.

Definition 4.3.5 (Gamma distribution). The Gamma distribution is a contin-

uous distribution with outcomes in [0, ∞). It has two parameters α > 0, β > 0
and probability density function

β α α−1 −βr
f (r | α, β) = r e ,
Γ (α)

where Γ is the Gamma function. We write r ∼ Gamma(α, β) for a random

variable r that is distributed according to a Gamma distribution.

The graphical model of a Gamma distribution is shown in Figure 4.8. The

β
rt
α

Figure 4.8: Gamma graphical model.

parameters α, β determine the shape and scale of the distribution, respectively.

This is illustrated in Figure 4.9, which also depicts some special cases that
show that the Gamma distribution is a generalisation of a number of other
standard distributions. For example, for α = 1, β > 0 one obtains an exponential

1
1, 1
1, 2
0.8
2, 2
4, 1/2
f (r|α, β)

0.6

0.4

0.2

0
0 2 4 6 8 10 12
t

Figure 4.9: Example Gamma densities.

exponential distribution
76 CHAPTER 4. ESTIMATION

distribution with parameter β. Its probability density function is

f (x | β) = βe−βx , (4.3.3)

and as the Gamma distribution it has support in [0, ∞], i.e., x > 0. For n ∈ N
and α = n2 , β = 21 one obtains a χ2 -distribution with n degrees of freedom.
Now we return to our problem of estimating the precision of a normal dis-
tribution with known mean, using the Gamma distribution to represent uncer-
tainty about the precision.

Normal-Gamma model

α
β r
xt
ω

Figure 4.10: Normal-Gamma graphical model for normal distributions with

unknown precision.

Let r be drawn from a Gamma distribution with parameters α, β, while

xn is a sample drawn independently from a normal distribution with mean
ω and precision r, i.e.,

xn | r ∼ N n (ω, r−1 ), r | α, β ∼ Gamma(α, β).

Then the posterior distribution of r given the sample is also Gamma, that
is,
n
n 1X
r | xn ∼ Gamma(α′ , β ′ ) with α′ = α + , β′ = β + (xt − ω)2 .
2 2 t=1

Normal with unknown precision and unknown mean

Finally, let us turn our attention to the general problem of estimating both the
mean and the precision of a normal distribution. We will use the same prior
distributions for the mean and precision as in the case when just one of them
was unknown. It will be assumed that the precision is independent of the mean,
while the mean has a normal distribution given the precision.

Normal with unknown mean and precision

4.3. CONJUGATE PRIORS 77

β r
xt
µ ω

Figure 4.11: Graphical model for a normal distribution with unknown mean
and precision.

Let xn be a sample from a normal distribution with unknown mean ω

and precision r, whose prior joint distribution satisfies

ω | r, τ, µ ∼ N (µ, (τ r)−1 ), r | α, β ∼ Gamma(α, β).

Then the posterior distribution is

ω | xn , r, τ, µ ∼ N (µ′ , (τ ′ r)−1 ) with r | xn , α, β ∼ Gamma(α′ , β ′ ),

where
τ µ + nx̄
µ′ = , τ ′ = τ + n,
τ +n
n
n 1X τ n(x̄ − µ)2
α′ = α + , β′ = β + (xt − x̄n )2 + .
2 2 t=1 2(τ + n)

While ω | r is normally distributed, the marginal distribution of ω is not

normal. In fact, it can be shown that it is a student t-distribution. In the student t-distribution
following, we describe the marginal distribution of a sequence of observations
xn , which is a generalised student t-distribution.

The marginal distribution of x. For a normal distribution with mean ω

and precision r, we have

√ r
f (x | ω, r) ∝ r · exp − (x − ω)2 .
2

For a prior ω|r ∼ N (µ, (τ r)−1 ) and r ∼ Gamma(α, β), as before, the joint distri-
bution for mean and precision is given by

√ τr
ξ(ω, r) ∝ r · exp − (ω − µ)2 rα−1 e−βr ,
2
78 CHAPTER 4. ESTIMATION

as ξ(ω, r) = ξ(ω | r) ξ(r). Now we can write the marginal density of new
observations as
Z
pξ (x) = f (x | ω, r) dξ(ω, r)
Z ∞Z ∞ r τr
√
∝ r · exp − (x − ω)2 exp − (ω − µ)2 rα−1 e−βr dω dr
0 −∞ 2 2
Z ∞ Z ∞
1 r τr
= rα− 2 e−βr exp − (x − ω)2 − (ω − µ)2 dω dr
0 −∞ 2 2
Z ∞ Z ∞
1 r
= rα− 2 e−βr exp − (x − ω)2 + τ (ω − µ)2 dω dr
0 −∞ 2
Z ∞ s
α− 21 −βr τr 2 2π
= r e exp − (µ − x) dr.
0 2(τ + 1) r(1 + τ)

4.3.3 Conjugates for multivariate distributions

The binomial distribution as well as the normal distribution can be extended to
multiple dimensions. Fortunately, multivariate extensions exist for their corre-
sponding conjugate priors, too.

Multinomial-Dirichlet conjugates
The multinomial distribution is the extension of the binomial distribution to an
arbitrary number of outcomes. It is a common model for independent random
trials with a finite number of possible outcomes, such as repeated dice throws,
multi-class classification problems, etc.
As in the binomial distribution we perform independent trials, but now con-
sider a more general outcome set S = {1, . . . , K} for each trial. Denoting by nk
the number of times one obtains outcome k, the multinomial distribution gives
the probability of observing a given vector (n1 , . . . , nK ) after a total of n trials.

Definition 4.3.6 (Multinomial distribution). The multinomial distribution is

discrete with parameters n, K ∈ N and a vector parameter ω ∈ ∆K , i.e., ωk ≥ 0
with kωk1 = 1, where each ωk represents the probability of obtaining outcome
k in a single trial. The set of possible
PK outcome counts after n trials is the set
of vectors n = (nk )Kk=1 such that k nk = n, and the probability function is
given by

K
Y
n!
P(n | ω) = QK ωknk . (4.3.4)
k=1 nk ! k=1

The dependencies between the variables are shown in Figure 4.12.

ω xt

Figure 4.12: Multinomial graphical model.

4.3. CONJUGATE PRIORS 79

The Dirichlet distribution

The Dirichlet distribution is the multivariate extension of the Beta distribution
that will turn out to be a natural candidate for a prior on the multinomial
distribution.
Definition 4.3.7 (Dirichlet distribution). The Dirichlet distribution is a contin-
uous distribution with outcomes ω ∈ Ω = ∆K and a parameter vector α ∈ RK +.
Its probability density function is
PK
Γ ( i=1 αi ) Y αk −1
f (ω | α) = QK ωk . (4.3.5)
k=1 Γ (αk )

We write ω ∼ Dir (α) for a random variable ω distributed according to a Dirich-

let distribution.
The parameter α determines the density of the observations, as shown
in Figure 4.13. The Dirichlet distribution is conjugate to the multinomial

α ω

Figure 4.13: Dirichlet graphical model.

distribution in the same way that the Beta distribution is conjugate to the
Bernoulli/binomial distribution.

Multinomial distribution with unknown parameter.

α ω xt

Figure 4.14: Dirichlet-multinomial graphical model.

Assume that the parameter ω of a multinomial distribution is generated

from a Dirichlet distribution as illustrated in Figure 4.14. If we observe
xn = (x1 , . . . , xn ), and our prior is given by Dir (α), so that our initial
belief is ξ0 (ω) , f (ω | α), the resulting posterior after n observations is
K
Y
ξt (ω) ∝ ωknk +αk −1 ,
k=1
Pn
where nk = t=1 I {xt = k}.

Multivariate normal conjugate families

The last conjugate pair we shall discuss is that for multivariate normal distribu-
tions. Similarly to the extension of the Bernoulli distribution to the multinomial,
and the corresponding extension of the Beta to the Dirichlet, the normal priors
can be extended to the multivariate case. The prior of the mean becomes a
80 CHAPTER 4. ESTIMATION

multivariate normal distribution, while that of the precision becomes a Wishart

distribution.
Definition 4.3.8 (Multivariate normal distribution). The multivariate normal
distribution is a continuous distribution with outcome space S = RK . Its
parameters are a mean vector ω ∈ RK and precision matrix3 R ∈ RK×K that
is a positive-definite, that is, x⊤ Rx > 0 for x 6= 0. The probability density
function of the multivariate normal distribution is given by

K p 1
f (x | ω, R) = (2π)− 2 |R| · exp − (xt − ω)⊤ R(x − ω) , (4.3.6)
2
matrix determinant where |R| denotes the matrix determinant. When taking n independent samples
from a fixed multivariate normal distribution the probability of observing a
sequence xn is given by
n
Y
f (xn | ω, R) = f (xt | ω, R).
t=1

The graphical model of the multivariate normal distribution is given in Fig-

ure 4.15.

R
xt
ω

Figure 4.15: Multivariate normal graphical model.

For the definition of the Wishart distribution we first have to recall the
definition of a matrix trace.
Definition 4.3.9. The trace of an n × n square matrix A with entries aij is
defined as
Xn
trace(A) , aii .
i=1

Definition 4.3.10 (Wishart distribution). The Wishart distribution is a matrix

distribution on RK×K with n > K − 1 degrees of freedom and precision matrix
T ∈ RK×K . Its probability density function is given by
n n−K−1 1
f (V | n, T ) ∝ |T | 2 |V | 2 e− 2 trace(T V ) (4.3.7)

for positive-definite V ∈ RK×K .

Construction of the Wishart distribution. Let xn be a sequence of n

samples drawn independently from a multivariate normal distribution with
mean ω ∈ RK and precision matrix T ∈ RK×K , that is, xn ∼ N n (ω, T −1 ).
Let x̄
P n be the respective empirical mean, and define the covariance matrix
n
S = t=1 (xt − x̄n )(xt − x̄n )⊤ . Then S has a Wishart distribution with n − 1
degrees of freedom and precision matrix T , and we write S ∼ Wish(n − 1, T ).
3 As before, the precision is the inverse of the covariance.
4.4. CREDIBLE INTERVALS 81

Normal-Wishart conjugate prior

T R
xt
µ ω

Figure 4.16: Normal-Wishart graphical model.

Theorem 4.3.1. Let xn be a sample from a multivariate normal distribu-

tion on RK with unknown mean ω ∈ RK and precision R ∈ RK×K whose
joint prior distribution satisfies

ω | R ∼ N (µ, (τ R)−1 ), R ∼ Wish(α, T ),

with τ > 0, α > K − 1, T > 0. Then the posterior distribution is

τ µ + nx̄n
ω|R∼N , [(τ + n)R]−1 ,
τ +n

τn
R ∼ Wish α + n, T + S + (µ − x̄)(µ − x̄)⊤ ,
τ +n
Pn
where S = t=1 (xt − x̄)(xt − x̄)⊤ .

4.4 Credible intervals

In general, according to our current belief ξ there is a certain subjective proba-
bility that some unknown parameter ω takes a certain value. However, we are
not always interested in the precise probability distribution itself. Instead, we
can use the complete distribution to describe an interval that we think contains
the true value of the unknown parameter. In Bayesian parlance, this is called a
credible interval.

Definition 4.4.1 (Credible interval). Given some probability measure ξ on Ω

representing our belief and some interval (or set) A ⊂ Ω,
Z
ξ(A) = dξ = P(ω ∈ A | ξ)
A

is our subjective belief that the unknown parameter ω is in A. If ξ(A) = s,

then we say that A is an s-credible interval (or set), or an interval of size (or
measure) s.
82 CHAPTER 4. ESTIMATION

30
xi

0
0.4 0.5 0.6 0.7 0.8

Figure 4.17: 90% credible interval after 1000 observations from a Bernoulli with
ω = 0.6.

As an example, for prior distributions on R one can construct an s-credible

interval by finding ωl , ωu ∈ R such that

ξ([ωl , ωu ]) = s.

Note that ωl , ωu are not unique and any choice satisfying the condition is valid.
However, typically the interval is chosen so as to exclude the tails (extremes)
of the distribution and centered in the maximum. Figure 4.17 shows the
90% credible interval for the Bernoulli parameter of Example 24 after 1000
observations, that is, the measure of A under ξ is ξ(A) = 0.9. We see that the
true parameter ω = 0.6 lies slightly outside it.

Reliability of credible intervals

Let φ, ξ0 be probability measures on the parameter set Ω with ξ0 being our
prior belief and φ the actual distribution of ω ∈ Ω. Each ω defines a measure Pω
on the observation set S. We would like to construct for any n ∈ N a credible
R n ⊂ Ω that has measure s = ξn (An ) after n observations. If we denote
interval A
by Q , ω Pω dφ(ω) the marginal distribution on S, then the probability that
the credible interval An does not include ω is

Q ({xn ∈ S n | ω ∈
/ An }) .

The probability that the true value of ω will be within a particular cred-
ible interval depends on how well the prior ξ0 matches the true distribution
from which the parameter ω was drawn. This is illustrated in the following
experimental setup, where we check how often a 50% credible interval fails.

Experimental testing of a credible interval

Given a probability family P = {Pω | w ∈ Ω}.

Nature chooses distribution φ over Ω.
4.4. CREDIBLE INTERVALS 83

Choose distribution ξ0 over Ω.

for i = 1, . . . , N do
Draw ωi ∼ φ.
Draw xn | ωi ∼ Pωi .
for t = 1, . . . , n do
Calculate ξt (·) = ξ0 (· | xt ).
Construct At such that ξt (At ) = 0.5.
Check failure and set ǫt,i = I {ωi ∈
/ At }.
end for
end for PN
Average over all i, i.e., set ǫt = N1 i=1 ǫt,i for each t.

0.58
UCB
LCB
0.56

0.54

0.52
w

0.5

0.48

0.46

0.44

0.42
10 20 30 40 50 60 70 80 90 100
t

Figure 4.18: 50% credible intervals for a prior Beta(10, 10),ξ0 matching the dis-
tribution of ω.

0.53

0.52
Failure rate

0.51

0.5

0.49

10 20 30 40 50 60 70 80 90 100
t

Figure 4.19: Failure rate of 50% credible intervals for a prior Beta(10, 10),
ξ0 matching the distribution of ω.

We performed this experiment constructing credible intervals for a Bernoulli

parameter ω using N = 1000 trials and n = 100 observations per trial. Fig-
ure 4.18 illustrates what happens when we choose ξ0 = φ. We see that the
credible interval is always centered around our initial mean guess and is quite
tight. Figure 4.19 shows the failure rate when the credible interval At around
our estimated mean does not match the actual value of ωi . Since the measure
84 CHAPTER 4. ESTIMATION

0.62
UCB
LCB
0.6

0.58

0.56

0.54

w
0.52

0.5

0.48

0.46

0.44

10 20 30 40 50 60 70 80 90 100
t

Figure 4.20: 50% credible intervals for a prior Beta(10, 10), when ξ0 does not
match the distribution of ω = 0.6.
1

0.9
Average number of failures

0.8

0.7

0.6

0.5

10 20 30 40 50 60 70 80 90 100
t

Figure 4.21: Failure rate of 50% credible interval for a prior Beta(10, 10), when
ξ0 does not match the distribution of ω = 0.6.

of our interval At is always ξt (At ) = 21 , we expect our error probability to be 21 ,

and this is borne out by the experimental results.
On the other hand, Figure 4.20 illustrates what happens when ξ0 6= φ. In
fact, we used ωi = 0.6 for all trials i. Formally, this is a Dirac distribution
concentrated in ω = 0.6, usually notated as φ(ω) = δ(ω − 0.6). We see that the
credible interval is always centered around our initial mean guess and that it is
always quite tight. Figure 4.21 shows the average number of failures. We see
that initially, due to the fact that our prior is different from the true distribution,
we make more mistakes than in the previous case. However, eventually, our prior
is swamped by the data and the error rate converges to 50%.

4.5 Concentration inequalities

While Bayesian ideas are useful, as they allow us to express our subjective beliefs
about a particular unknown quantity, they nevertheless are difficult to employ
when we have no good intuition about what prior to use. One could look at the
Bayesian estimation problem as a minimax game between us and nature and
consider bounds with respect to the worst possible prior distribution. However,
even in that case we must select a family of distributions and priors.
In this section we will examine guarantees we can give about any calculation
4.5. CONCENTRATION INEQUALITIES 85

we make from observations with minimal assumptions about the distribution

generating these observations. These findings are fundamental, in the sense
that they rely on a very general phenomenon, called concentration of measure.
As a consequence, they are much stronger than results such as the central limit
theorem (which we will not cover in this textbook).
Here we shall focus on the most common application of calculating the sam-
ple mean, as given in Definition 4.2.1. We have seen that e.g. for the Beta-
Bernoulli conjugate prior, it is a simple enough matter to compute a posterior
distribution. From that, we can obtain a credible interval on the expected value
of the unknown Bernoulli distribution. However, we would like to do the same
for arbitrary distributions on [0, 1], rather than just the Bernoulli. We shall now
give an overview of a set of tools that can be used to do this.

Theorem 4.5.1 (Markov’s inequality). If X ∼ P with P a distribution on

[0, ∞), then

EX
P(X ≥ u) ≤ , (4.5.1)
u

where P(X ≥ u) = P {x | x ≥ u} .

Proof. By definition of the expectation, for any u

Z ∞
EX = x dP (x)
Z0 u Z ∞
= x dP (x) + x dP (x)
0 u
Z ∞
≥0+ u dP (x)
u
= u P(X ≥ u).

Consequently, if x̄n is the empirical mean after n observations, for a random

variable X with expectation E X = µ, we can use Markov’s inequality to obtain
P(|x̄n − µ| ≥ ǫ) ≤ E |x̄n − µ|/ǫ. In particular, for X ∈ [0, 1], we obtain the bound
1
P |x̄n − µ| ≥ ǫ ≤ .
ǫ
Unfortunately, this bound does not improve for a larger number of observa-
tions n. However, we can get significantly better bounds through various trans-
formations, using Markov’s inequality as a building block in other inequalities.
The first of those is Chebyshev’s inequality.

Theorem 4.5.2 (Chebyshev’s inequality). Let X be a random variable with

expectation E X = µ and variance V X = σ 2 . Then, for all k > 0,
1
P |X − µ| ≥ kσ ≤ 2 . (4.5.2)
k
Proof. First note that for monotonic f ,

P(X ≥ u) = P f (X) ≥ f (u) , (4.5.3)
86 CHAPTER 4. ESTIMATION

as {x | x ≥ u} = {x | f (u) ≥ f (u)}. Then we can derive

|X − µ| (4.5.3) |X − µ|2
P |X − µ| ≥ kσ = P ≥1 = P ≥ 1
kσ k2 σ2

(4.5.1) (X − µ)2 E(X − µ)2 1
≤ E 2 2
= = 2.
k σ k2 σ2 k

Chebyshev’s inequality can in turn be used to obtain confidence bounds that

improve with the number of samples n.

Example 26 (Application to sample mean). It is easy to show that the sample mean
x̄n has expectation µ and variance σ 2 /n and we obtain from (4.5.2)

kσ 1
P |x̄n − µ| ≥ √ ≤ 2.
n k
√ √
Setting ǫ = kσ/ n we get k = ǫ n/σ and hence
σ2
P |x̄n − µ| ≥ ǫ ≤ 2 .
ǫ n

4.5.1 Chernoff-Hoeffding bounds

The inequality derived in Example 26 can be quite loose. In fact, one can prove
tighter bounds for the estimation of an expected value by a different application
of the Markov inequality, due to Chernoff.

Main Pidea of Chernoff bounds.

n
Let Sn = t=1 xi , with xt ∼ P independently, i.e., xn ∼ P n . By definition,
from Markov’s inequality we obtain for any θ > 0

P(Sn ≥ u) = P(eθSn ≥ eθu )

n
Y
≤ e−θu E eθSn = e−θu E eθxt . (4.5.4)
t=1

Theorem 4.5.3 (Hoeffding’s inequality, Hoeffding [1963]). For t = 1, 2, . . . , n

let xt ∼ Pt be independent random variables with xt ∈ [at , bt ] and E xt = µt .
Then
2n2 ǫ2
P (x̄n − µ ≥ ǫ) ≤ exp − Pn 2
, (4.5.5)
t=1 (bt − at )
Pn
where µ = n1 t=1 µt .
Proof. Applying (4.5.4) to random variables xt − µt so that Sn = n(x̄n − µ) and
setting u = nǫ gives

P(x̄n − µ ≥ ǫ) = P(Sn ≥ u)
n
Y
−θnǫ
≤e E eθ(xt −µ) . (4.5.6)
t=1
4.6. APPROXIMATE BAYESIAN APPROACHES 87

Applying Jensen’s inequality directly to the expectation does not help. However,
we can use convexity in another way. Let f (x) be the linear upper bound on
eθx on the interval [a, b], i.e.
b − x θa x − a θb
f (x) , e + e ≥ eθx .
b−a b−a
Then obviously E eθx ≤ E f (x) for x ∈ [a, b]. Applying this to the expectation
term (4.5.6) above we obtain
e−θµt
eθ(xt −µt ) ≤ (bt − µt )eθat + (µt − at )eθbt .
bt − a t
Taking derivatives with respect to θ and computing the second order Taylor
expansion, we get
2
1
(bt −at )2
E eθ(xt −µt ) ≤ e 8 θ
1 2 Pn 2
P(x̄n − µ ≥ ǫ) ≤ e−θnǫ+ 8 θ t=1 (bt −at ) .
Pn
This is minimised for θ = 4nǫ/ t=1 (bt − at )2 , which proves the required result.

We can apply this inequality directly to the sample mean example and obtain
for xt ∈ [0, 1]
2
P (|x̄n − µ| ≥ ǫ) ≤ 2e−2nǫ .

4.6 Approximate Bayesian approaches

Unfortunately, exact computation of the posterior distributions is only possible
in special cases. In this section, we give a brief overview of some classic meth-
ods for approximate Bayesian inference. The first, Monte-Carlo methods, rely
on stochastic approximations of the posterior distributions where at least the
likelihood function is computable. The second, approximate Bayesian computa-
tion, extends Monte Carlo methods to the case where the probability function
is incomputable or not available at all. In the third, which includes variational
Bayes methods, we replace distributions with an analytic approximation. Fi-
nally, in empirical Bayes methods, some parameters are replaced by an empirical
estimate.

4.6.1 Monte-Carlo inference

Monte-Carlo inference has been a cornerstone of approximate Bayesian statistics
ever since computing power was sufficient for such methods to become practical.
Let us begin with a simple example, that of estimating expectations.
Definition 4.6.1. Let f : S → [0, 1] and P be a measure on S. Then
Z
EP f , f (x) dP (x).
S

Estimating the expectation EP f is relatively easy as long as we can gener-

ate samples from P . Then a simple average provides an estimate that can be
computed fast and whose error can be bounded by Hoeffding’s inequality.
88 CHAPTER 4. ESTIMATION

Corollary 4.6.1. Let xn = (x1 , . . . , xn ) be a sample of size n with xt ∼ P and

P
f : S → [0, 1]. Setting fˆn , n1 t f (xt ) it holds that
n o 2
P xn ∈ S n |fˆn − E f | ≥ ǫ ≤ 2e−2nǫ .

This technique can also be used to calculate posterior distributions.

Example 27 (Calculation of posterior distributions). Assume a probability family

P = {Pω | ω ∈ Ω} and a prior distribution ξ on Ω such that we can draw ω ∼ ξ. The
posterior distribution is given by (4.1), where we can write the nominator as
Z Z
Pω (x) dξ(ω) = I {ω ∈ B} Pω (x) dξ(ω) = Eξ [I {ω ∈ B} Pω (x)] . (4.6.1)
B Ω

Similarly, the denominator of (4.1) can be written as Eξ [Pω (x)]. If Pω is bounded,

then the error can be bounded too.

An extension of this approach involves Markov chain Monte-Carlo (MCMC)

methods. These are sequential sampling procedures where data are sampled
iteratively. At the t-th iteration, we obtain a sample xt ∼ Qt , where Qt depends
on the previous sample xt−1 . Even if one can guarantee that Qt → P , there is
no easy way to determine a priori when the procedure has converged. For more
details see for example [Casella et al., 1999].

4.6.2 Approximate Bayesian Computation

The main problem approximate Bayesian computation (ABC) aims to solve
is how to weight the evidence we have for or against different models. The
assumption is that we have a family of models {Mω | ω ∈ Ω}, from which we
can generate data. However, there is no easy way to calculate the probability of
any model having generated the data. On the other hand, like in the standard
Bayesian setting, we can start with a prior ξ over Ω and given some data x ∈ W
we wish to calculate the posterior ξ(ω | x). ABC methods generally rely on
what is called an approximate statistic in order to weight the relative likelihood
of models given the data.
An approximate statistic φ : S → S ′ maps the data to some lower dimen-
sional space S ′ . Then it is possible to compare different data points in terms
of how similar their statistics are. For this, one also defines some distance
D : S ′ × S ′ → R+ .
ABC methods are useful in two specific situations. The first is when the fam-
ily of models that we consider has an intractable likelihood. This means that
calculating Mω (x) is prohibitively expensive. The second is in some applications
which admit a class of parametrised simulators which have no probabilistic de-
scription. Then one reasonable approach is to find the best simulator in the
class and apply it to the actual problem.
The simplest algorithm in this context is ABC Rejection Sampling shown as
Algorithm 1. Here, we repeatedly sample a model from the prior distribution,
and then generate data x̂ from the model. If the sampled data is ǫ-close to the
original data in terms of the statistic, we accept the sample as an approximate
posterior sample.
For an overview of ABC methods see [Csilléry et al., 2010, Marin et al.,
2011]. Early ABC methods were developed for applications such as econometric
4.6. APPROXIMATE BAYESIAN APPROACHES 89

Algorithm 1 ABC Rejection Sampling from ξ(ω | x).

input prior ξ, data x, generative model family {Mω | ω ∈ Ω}, statistic φ,
error bound ǫ.
repeat
ω̂ ∼ ξ
x̂ ∼ Mω̂ .
until D[φ(x), φ(x̂)] ≤ ǫ
Return ω̂.

modelling [e.g., Geweke, 1999], where detailed simulators but no useful analyt-
ical probabilistic models were available. ABC methods have also been used for
inference in dynamical systems [e.g., Toni et al., 2009] and the reinforcement
learning problem [Dimitrakakis and Tziortziotis, 2013, 2014].

4.6.3 Analytic approximations of the posterior

Another type of approximation involves substituting complex distributions with
members from a simpler family. For example, one could replace a multimodal
posterior distribution ξ(ω | x) with a Gaussian. A more principled approxi-
mation would involve selecting a distribution that is the closest with respect
to some divergence or distance. In particular, we would like to approximate
the target distribution ξ(ω | x) with some other distribution Qθ (ω) in a family
{Qθ | θ ∈ Θ} such that the distance between ξ(ω | x) and Qθ (ω) is minimal. For
measuring the latter one can use the total variation or the Wasserstein distance.
The most popular algorithms employ however the KL-divergence
Z
dQ
D (Q k P ) , ln dQ. (4.6.2)
Ω dP
As the KL-divergence is asymmetric, its use results in two distinct approxima-
tion methods, variational Bayes and expectation propagation.

Variational approximation. In this formulation, we wish to minimise the

where ξx is shorthand for the joint distribution ξ(ω, x) for a fixed value of x.
As the second term does not depend on θ, we can find the best element of the
family by computing Z
dξx
max ln dQθ , (4.6.4)
θ∈Θ Ω dQθ
where the term we are maximising can also be seen as a lower bound on the
marginal log-likelihood.
90 CHAPTER 4. ESTIMATION

Expectation propagation. The other direction requires us to minimise the

divergence Z
dξ|x
D ξ|x Qθ = ln dξ|x .
Ω dQθ
A respective algorithm in the case of data terms that are independent given the
parameter is expectation propagation [Minka, 2001a]. There, the approximation
has a factored form and is iteratively updated, with each term minimising the
KL divergence while keeping the remaining terms fixed.

4.6.4 Maximum Likelihood and Empirical Bayes methods

When it is not necessary to have a full posterior distribution, some parameter
may be estimated point-wise. One simple such approach is maximum likelihood.
In the simplest case, we replace the posterior distribution ξ(ω | x) with a point
estimate corresponding to the parameter value that maximises the likelihood,
that is,
∗
ωML ∈ arg max Pω (x). (4.6.5)
ω

Alternatively, the maximum a posteriori parameter may be obtained by

∗
ωMAP ∈ arg max ξ(ω | x). (4.6.6)
ω

In the latter case, even though we cannot compute the full function ξ(ω | x), we
can still maximise (perhaps locally) for ω.
More generally, there might be some parameters φ for which we actually can
compute a posterior distribution. Then we can still use the same approaches
maximising either of
Z
Pω (x) = Pω,φ (x) dξ(φ | x),
Z
ξ(ω | x) = ξ(ω | φ, x) dξ(φ | x).
Φ

Empirical Bayes methods, pioneered by Robbins [1955], replace some param-

eters by an empirical estimate not necessarily corresponding to the maximum
likelihood. These methods are quite diverse [Laird and Louis, 1987, Lwin and Maritz,
1989, Robbins, 1964, 1955, Deely and Lindley, 1981] and unfortunately beyond
the scope of this book.
Chapter 5

Sequential sampling

91
92 CHAPTER 5. SEQUENTIAL SAMPLING

5.1 Gains from sequential sampling

So far, we have mainly considered decision problems where the sample size was
fixed. However, frequently the sample size can also be part of the decision. Since
normally larger sample sizes give us more information, in this case the decision
problem is only interesting when obtaining new samples has a cost. Consider
the following example.

Example 28. Consider that you have 100 produced items and you want to determine
whether there are fewer than 10 faulty items among them. If testing has some cost,
it pays off to think about whether it is possible to do without testing all 100 items.
Indeed, this is possible by the following simple online testing scheme: You test one
item after another until you either have discovered 10 faulty items or 91 good items.
In either case you have the correct answer at considerably lower cost than when testing
all items.

A sequential sample from some unknown distribution P is generated as fol-

lows. First, let us fix notation and assume that each new sample xi we obtain
belongs to some alphabet X , so that at time t, we have observed x1 , . . . , xt ∈ X t .
S∞convenient to define the set of all sequences in the alphabet X as
It is also
X ∗ , t=0 X t . The distribution P defines a probability on X ∗ so that xt+1
may depend on the previous samples x1 , . . . , xt in an arbitrary manner. At any
time t, we can either stop sampling or obtain one more observation xt+1 . A
sample obtained in this way is called a sequential sample. More formally, we
give the following definition:

Definition 5.1.1 (Sequential sampling). A sequential sampling procedure on

stopping function a probability space1 (X ∗ , B (X ∗ ) , P ) involves a stopping function πs : X ∗ →
{0, 1}, such that we stop sampling at time t if and only if πs (xt ) = 1, otherwise
we obtain a new sample xt+1 | xt ∼ P (· | xt ).

Thus, the sample obtained depends both on P and the sampling proce-
dure πs . In our setting, we don’t just want to sample sequentially, but also to
take some action after sampling is complete. For that reason, we can generalise
the above definition to sequential decision procedures.

Definition 5.1.2 (Sequential decision procedure). A sequential decision proce-

dure π = (πs , πd ) is a tuple composed of

1. a stopping rule πs : X ∗ → {0, 1} and

2. a decision rule πd : X ∗ → A.

The stopping rule πs specifies whether, at any given time, we should stop and
make a decision in A or take one more sample. That is, stop if

πs (xt ) = 1,

otherwise observe xt+1 . Once we have stopped (i.e. πs (xt ) = 1), we choose the
decision
πd (xt ).
1 This is simply a sample space and associated algebra, together with a probability measure.

See Appendix B for a complete definition.

5.1. GAINS FROM SEQUENTIAL SAMPLING 93

Deterministic stopping rules If the stopping rule πs is deterministic, then

for any t, there exists some stopping set Bt ⊂ X t such that stopping set
(
t 1, if xt ∈ Bt
πs (x ) = (5.1.1)
0, if xt ∈
/ Bt .

As with any Bayesian decision problem, it is sufficient to consider only deter-

ministic decision rules.
We are interested in sequential sampling problems especially when there is
a reason for us to stop sampling early enough, like in the case when we incur
a cost with each sample we take. A detailed example is given in the following
section.

5.1.1 An example: sampling with costs

We once more consider problems where we have some observations x1 , x2 , . . .,
with xt ∈ X , which are drawn from some distribution with parameter θ ∈ Θ, or
more precisely from a family P = {Pθ | θ ∈ Θ}, such that each (X , B (X ) , Pθ )
is a probability space for all θ ∈ Θ. Since we take repeated observations, the
probability of a sequence xn = x1 , . . . , xn under an i.i.d. model θ is Pθn (xn ). We
have a prior probability measure ξ on B (Θ) for the unknown parameter, and
we wish to take an action a ∈ A that maximises the expected utility according
to a utility function u : Θ × A → R.
In the classical case, we obtain a complete sample of fixed size n, xn =
(x1 , . . . , xn ) and calculate a posterior measure ξ(· | xn ). We then take the deci-
sion maximising the expected utility according to our posterior. Now consider
the case of sampling with costs, such that a sample of size n results in a cost
of cn. For that reason we define a new utility function U which depends on the
number of observations we have.

Samples with costs

U (θ, a, xn ) = u(θ, a) − cn, (5.1.2)

Z
Eξ (U | a, xn ) = u(θ, a) dξ(θ | xn ) − cn. (5.1.3)
Θ

In the remainder of this section, we shall consider the following simple deci-
sion problem, where we need to make a decision about the value of an unknown
parameter. As we get more data, we have a better chance of discovering the right
parameter. However, there is always a small chance of getting no information.

Example 29. Consider the following decision problem, where the goal is to distinguish
between two possible hypotheses θ1 , θ2 , with corresponding decisions a1 , a2 . We have
three possible observations {1, 2, 3}, with 1, 2 being more likely under the first and
second hypothesis, respectively. However, the third observation gives us no information
about the hypothesis, as its probability is the same under θ1 and θ2 . In this problem
γ is the probability that we obtain an uninformative sample.
94 CHAPTER 5. SEQUENTIAL SAMPLING

Parameters: Θ = {θ1 , θ2 }.
Decisions: A = {a1 , a2 }.
Observation distribution fi (k) = Pθi (xt = k) for all t with
f1 (1) = 1 − γ, f1 (2) = 0, f1 (3) = γ, (5.1.4)
f2 (1) = 0, f2 (2) = 1 − γ, f2 (3) = γ. (5.1.5)

Local utility: u(θi , aj ) = 0, for i = j and b < 0 otherwise.

Prior: Pξ (θ = θ1 ) = ξ = 1 − Pξ (θ = θ2 ).
Observation cost per sample: c.
At any step t, you have the option of continuing for one more step, or stopping and
taking an action in A. The question is what is the policy for sampling and selecting
an action that maximises expected utility?

In this problem, it is immediately possible to distinguish θ1 from θ2 when you

observe xt = 1 or xt = 2. However, the values xt = 3 provide no information.
Hence, the utility of stopping only depends on. So, the expected utility of
stopping if you have only observed 3s after t steps is ξb − ct. In fact, if your
posterior parameter after t steps is ξt , then the expected utility of stopping is
b min {ξt , 1 − ξt } − ct. In general, you should expect ξt to approach 0 or 1 with
high probability, and hence taking more samples is better. However, if we pay
utility −c for each additional sample, there is a point of diminishing returns,
after which it will not be worthwhile to take any more samples.
We first investigate the setting where the number of observations is fixed. In
value particular, the value of the optimal procedure taking n observations is defined
to be the expected utility that maximises the a posteriori utility given xn , i.e.
X
V (n) = Pξn (xn ) max Eξ (U | xn , a),
a
xn
R
where Pξn = Θ Pθn dξ(θ) is the marginal distribution over n observations. For
this specific example, it is easy to calculate the value of the procedure that takes
n observations, by noting the following facts.
(a) The probability of observing xt = 3 for all t = 1, . . . , n is γ n . Then we must
rely on our prior ξ to make a decision.
(b) If we observe any other sequence, we know the value of θ.
Consequently, the total value V (n) of the optimal procedure taking n observa-
tions is
V (n) = ξbγ n − cn. (5.1.6)
Based on this, we now want to find the optimal number of samples n. Since V
is a smooth function, an approximate maximiser can be found by viewing n as
a continuous variable.2 Taking derivatives, we get

c 1
n∗ = log , (5.1.7)
ξb log γ log γ

c c
V (n∗ ) = 1 + log . (5.1.8)
log γ ξb log γ
2 In the end, we can find the optimal maximiser by looking at the nearest two integers to

the value found.

5.1. GAINS FROM SEQUENTIAL SAMPLING 95

-1

-2
V (n)

-3

-4
ξ = 0.1
ξ = 0.5
-5
100 101 102
n

Figure 5.1: Illustration of P1, the procedure taking a fixed number of samples n.
The value of taking exactly n observations under two different beliefs, for γ =
0.9, b = −10, c = 10−2 .

The results of applying this procedure are illustrated in Figure 5.1. Here we
can see that, for two different choices of priors, the optimal number of samples
is different. In both cases, there is a clear choice for how many samples to take,
when we must fix the number of samples before seeing any data.
However, we may not be constrained to fix the number of samples a priori.
As illustrated in Example 28, many times it is a good idea to adaptively decide
when to stop taking samples. This is illustrated by the following sequential
procedure. Since we already know that there is an optimal a priori number of
steps n∗ , we can choose to look at all possible stopping times that are smaller
or equal to n∗ .

P2. A sequential procedure stopping after at most n∗ steps.

If t < n∗ , use the stopping rule πs (xt ) = 1 iff xt 6= 3.

If t = n∗ , then stop.
Our posterior after stopping is ξ(θ | xn ), where both xn and the
number of observations n are random.

Since the probability of xt = 3 is always the same for both θ1 and θ2 , we

have

Eξ (n) = E(n | θ = θ1 ) = E(n | θ = θ2 ) < n∗ .

96 CHAPTER 5. SEQUENTIAL SAMPLING

We can calculate the expected number of observations as follows:

∗
n
X
∗
Eξ (n | n ≤ n ) = Eξ (n | θ = θ1 ) = t Pξ (n = t | θ = θ1 ) (5.1.9)
t=1
∗
nX −1 ∗
∗ 1 − γn
= tγ t−1 (1 − γ) + n∗ γ n −1
= , (5.1.10)
t=1
1−γ

using the formula for the geometric series (see equation C.1.4). Consequently,
the value of this procedure is

V̄ (n∗ ) = Eξ (U | n = n∗ ) Pξ (n = n∗ ) + Eξ (U | n < n∗ ) Pξ (n < n∗ )

∗
= ξbγ n − c Eξ (n),

and from the definition of n∗ we obtain

∗ c c c
V̄ (n ) = + 1+ . (5.1.11)
γ − 1 log γ ξb(1 − γ)

As you can see, there is a non-zero probability that n = n∗ , at which time we

will have not resolved the true value of θ. In that case, we are still not better off
than at the very beginning of the procedure, when we had no observations. If
our utility is linear with the number of steps, it thus makes sense that we should
unbounded procedures still continue. For that reason, we should consider unbounded procedures.
The unbounded procedure for our example is simply this to use the stopping
rule πs (xt ) = 1 iff xt 6= 3. Since we only obtain information whenever xt 6= 3,
and that information is enough to fully decide θ, once we observe xt 6= 3, we
can make a decision that has value 0, as we can guess correctly. So, the value
of the unbounded sequential procedure is just V ∗ = −c Eξ (n).
∞
X ∞
X 1
Eξ (n) = t Pξ (n = t) = tγ t−1 (1 − γ) = , (5.1.12)
t=1 t=1
1−γ

again using the formula for the geometric series.

In the given example, it is clear that bounded procedures are (in expecta-
tion) better than fixed-sampling procedures, as seen in Figure 5.2. In turn, the
unbounded procedure is (in expectation) better than the bounded procedure.
Of course, an unbounded procedure may end up costing much more than taking
a decision without observing any data, as it disregards the costs up to time t.
This relates to the economic idea of sunk costs: since our utility is additive in
terms of the cost, our optimal decision now should not depend on previously
accrued costs.

5.2 Optimal sequential sampling procedures

We now turn our attention to the general case. While it is easy to define the
optimal stopping rule and decision in this simple example, how can we actually
do the same thing for arbitrary problems? The following section characterises
optimal sequential sampling procedures and gives an algorithm for constructing
them.
5.2. OPTIMAL SEQUENTIAL SAMPLING PROCEDURES 97

0
-0.5
-1
-1.5
-2
V

-2.5
-3
fixed
-3.5 bounded
unbounded
-4
0.5 0.6 0.7 0.8 0.9 1
α

Figure 5.2: The value of three strategies for ξ = 1/2, b = −10, c = 10−2 and
varying γ. Higher values of γ imply a longer time before the true θ is known.

Once more, consider a distribution family P = {Pθ | θ ∈ Θ} and a prior ξ

over B (Θ). For a decision set A, a utility function U : Θ × A → R, and a
sampling cost c, the utility of a sequential decision procedure is the local utility
at the end of the procedure, minus the sampling cost. In expectation, this can
be written as

U (ξ, π) = Eξ {u[θ, π(xn )] − nc} . (5.2.1)

Here the cost is inside the expectation, since the number of samples we take is
random. Summing over all the possible stopping times n, and taking Bn ⊂ X ∗
as the set of observations for which we stop, we have:
∞ Z
X ∞
X
U (ξ, π) = Eξ [U (θ, π(xn )) | xn ] dPξ (xn ) − Pξ (Bn )nc (5.2.2)
n=1 Bn n=1
X∞ Z Z ∞
X
n n n
= U [θ, π(x )] dξ(θ | x ) dPξ (x ) − Pξ (Bn )nc
n=1 Bn Θ n=1
(5.2.3)

where Pξ is the marginal distribution under ξ. Although it may seem difficult

to evaluate this, it can be done by a simple dynamic programming technique
called backwards induction. We first give the algorithm for the case of bounded
procedures (i.e. procedures that must stop after a particular time) and later for
unbounded ones.

Definition 5.2.1 (Bounded sequential decision procedure). A sequential deci-

sion procedure is T -bounded for a positive integer T if Pξ (n ≤ T ) = 1. The
procedure is called bounded if it is T -bounded for some T .

We can analyse such a procedure by recursively analysing procedures for

larger T , starting from the final point of the process and working our way
backwards. Consider a π that is T -bounded. Then we know that we shall take
98 CHAPTER 5. SEQUENTIAL SAMPLING

at most T samples. If the process ends at stage T , we will have observed some
sequence xT , which gives rise to a posterior ξ(θ | xT ). Since we must stop at T ,
we must choose a maximising expected utility at that stage:
Z
T
Eξ [U | x , a] = U (θ, a) dξ(θ | xT )
Θ
Since we need not take another sample, the respective value (maximal expected
utility) of that stage is
V 0 [ξ(·|xT )] , max U (ξ(· | xT ), a),
a∈A
n
where we introduce the notation V to denote the expected utility, given that
we are stopping after at most n steps.
More generally, we need to consider the effect on subsequent decisions. Con-
sider the following simple two-stage problem as an example. Let X = {0, 1}
and ξ be the prior on the θ parameter of Bern(θ). We wish to either decide
immediately on a parameter θ, or take one more observation, at cost c, before
deciding. The problem we consider has two stages, as illustrated in Figure 5.3.

ξ(· | x1 = 0)
c
ξ
c
ξ(· | x1 = 1)

Figure 5.3: An example of a sequential decision problem with two stages. The
initial belief is ξ and there are two possible subsequent beliefs, depending on
whether we observe xt = 0 or xt = 1. At each stage we pay c.

In this example, we begin with a prior ξ at the first stage. There are two
possible outcomes for the second stage.
1. If we observe x1 = 0 then our value is V 0 [ξ(· | x1 = 0)].
2. If we observe x1 = 1 then our value is V 0 [ξ(· | x1 = 1)].
At the first stage, we can:
1. Stop with value V 0 (ξ).
R
2. Pay a sampling cost c for value V 0 [ξ(· | x1 )] with Pξ (x1 ) = Θ
Pθ (x1 ) dξ(w).
So the expected value of continuing for one more step is
Z
1
V (ξ) , V 0 [ξ(· | x1 )] dPξ (x1 ).
X
Thus, the overall value for this problem is:
( 1
)
X
max V 0 (ξ), V 0 [ξ(· | x1 )]Pξ (x1 ) − c.
x1 =0

The above is simply the maximum of the value of stopping immediately (V 0 ),

and the value of continuing for at most one more step (V 1 ). This procedure can
be applied recursively for multi-stage problems, as explained below.
5.2. OPTIMAL SEQUENTIAL SAMPLING PROCEDURES 99

5.2.1 Multi-stage problems

For simplicity, we use ξn to denote a posterior ξ(· | xn ), omitting the specific
value of xn . For any specific ξn , there is a range of given possible next beliefs
ξn+1 , depending on what the value of the next observation xn is. This is illus-
trated in Figure 5.4, by extension from the previous two-stage example. The

1
ξn+1

c
xn+1 = 1
ξn xn+1 = 0
c

0
ξn+1

Figure 5.4: A partial view of the multi-stage process.

value of the process can be calculated as follows, more generally:

Z
0
V (ξt ) = sup u(θ, a) dξt (θ) (Immediate value)
a∈A Θ
ξn (·) , ξ(· | xn ) (posterior)
Z
0
Eξn V (ξn+1 ) = V 0 [ξn (· | xn )] dξn (xn ) (Next-step value)
X

V 1 (ξn ) = max V 0 (ξn ), Eξn V 0 (ξn+1 ) − c (Optimal value)

The immediate value is the expected value if we stop immediately at time t.

The next-step value is the expected value of the next step, ignoring the cost.
Finally, the optimal value at the n-th step is just the maximum of the value of
stopping immediately and the next-step value. We can generalise this procedure
over all steps 1, 2, . . . , T , to obtain a general procedure.

5.2.2 Backwards induction for bounded procedures

The main idea expressed in the previous section is to start from the last stage of
our decision problem, where the utility is known, and then move backwards. At
each stage, we know the probability of reaching different points in the next stage,
as well as their values. Consequently, we can compute the value of any point
in the current stage as well. This idea is formalised below, via the algorithm of
backwards induction.

Theorem 5.2.1 (Backwards induction). The utility of a T -bounded optimal

procedure with prior ξ0 is V T (ξ0 ) and is given by the recursion:

V j+1 (ξn ) = max V 0 (ξn ), Eξn V j (ξn+1 ) − c (5.2.4)

for every belief ξn in the set of beliefs that arise from the prior ξ0 , with j = T −n.
100 CHAPTER 5. SEQUENTIAL SAMPLING

The proof of this theorem follows by induction. However, we shall prove a

more general version in Chapter 6. Equation 5.2.4 essentially gives a recursive
calculation of the value of the T -bounded optimal procedure. To evaluate it,
we first need to calculate all possible beliefs ξ1 , . . . , ξT . For each belief ξT ,
we calculate V 0 (ξT ). We then move backwards, and calculate V 0 (ξT −1 ) and
V 1 (ξT −1 ). Proceeding backwards, for n = T − 1, T − 2, . . . , 1, we calculate
V T +1 (ξn ) for all beliefs ξn with j = T − n. The value of the procedure also
determines the optimal sampling strategy, as shown by the following theorem.
Theorem 5.2.2. The optimal T -bounded procedure stops at time t if the value
of stopping at t is better than that of continuing, i.e. if
V 0 (ξt ) ≥ V T −t (ξt ).
This procedure chooses a maximising Eξt U (θ, a), otherwise takes one more sam-
ple.
Finally, longer procedures (i.e. procedures that allow for stopping later) are
always better than shorter ones, as shown by the following theorem.
Theorem 5.2.3. For any probability measure ξ on Θ,
V n (ξ) ≤ V n+1 (ξ). (5.2.5)
That is, the procedure that stops after at most n steps is never better than
the procedure that stops after at most n + 1 time steps. To obtain an intuition
of why this is the case, consider the example of Section 5.1.1. In that example,
if we have a sequence of 3s, then we obtain no information. Consequently, when
we compare the value of a plan taking at most n samples with that of a plan
taking at most n + 1 samples, we see that the latter plan is better for the event
where we obtain n 3s, but has the same value for all other events.

5.2.3 Unbounded sequential decision procedures

Given the monotonicity of the value of bounded procedures (5.2.5), one may well
ask what is the value of unbounded procedures, i.e. procedures that may never
stop sampling. The value of an unbounded sampling and decision procedure π
under prior ξ is
Z
0
U (ξ, π) = V [ξ(· | xn )] − cn dPξπ (xn ) = Eπξ V 0 [ξ(· | xn )] − cn ,
X∗

where Pξπ (xn )

is the probability that we observe samples xn and stop under
the marginal distribution defined by ξ and π, while n is the random number of
samples taken by π. As before, this is random because the observations x are
random; π itself can be deterministic.
Definition 5.2.2 (Regular procedure). Given a decision procedure π, let B>k (π) ⊂
X ∗ be the set of sequences such that π takes more than k samples. Then π is
regular if U (ξ, π) ≥ V 0 (ξ) and if, for all n ∈ N, and for all xn ∈ B>n (π)
U [ξ(· | xn ), π] ≥ V 0 [ξ(· | xn )] − cn, (5.2.6)
i.e., the expected utility given for any sample that starts with xn where we don’t
stop, is greater than that of stopping at n.
5.2. OPTIMAL SEQUENTIAL SAMPLING PROCEDURES 101

In other words, if π specifies that at least one observation should be taken,

then the value of π is greater than the value of choosing a decision without
any observation. Furthermore, whenever π specifies that another observation
should be taken, the expected value of continuing must be larger than the value
of stopping. If the procedure is not regular, then there may be stages where the
procedure specifies that sampling should be continued, though the value may
not increase by doing so.
Theorem 5.2.4. If π is not regular, then there exists a regular π ′ such that
U (ξ, π ′ ) ≥ U (ξ, π).
Proof. First, consider the case that π is not regular because U (ξ, π) ≤ V 0 (ξ).
Then π ′ can be the regular procedure which chooses a ∈ A without any obser-
vations.
Now consider the case that U (ξ, π) > V 0 (ξ) and that π specifies at least one
sample should be taken. Let π ′ be the procedure which stops as soon as the
observed xn does not satisfy (5.2.6).
If π stops, then both sides of (5.2.6) are equal, as the value of stopping
immediately is at least as high as that of continuing. Consequently, π ′ stops no
later than π for any xn . Finally, let

Bk (π) = {x ∈ X ∗ | n = k} (5.2.7)

be the set of observations such that exactly k samples are taken by rule π and

B≤k (π) = {x ∈ X ∗ | n ≤ k} (5.2.8)

be the set of observations such that at most k samples are taken by rule π. Then
∞ Z
X
U (ξ, π ′ ) = {V 0 [ξ(· | xk ) − ck]} dPξ (xk )
′
k=1 Bk (π )
X∞ Z
≥ U [ξ(· | xk , π)] dPξ (xk )
k=1 Bk (π ′ )

X∞
= Eπξ {U | Bk (π ′ )}Pξ (Bk (π ′ )) = Eξπ U = U (ξ, π).
k=1

5.2.4 The sequential probability ratio test

Sometimes we wish to collect just enough data in order to be able to confirm
or disprove a particular hypothesis. More specifically, we have a set of parame-
ters Θ, and we need to pick the right one. However, rather than simply using an
existing set of data, we are collecting data sequentially, and we need to decide
when to stop and select a model. In this case, each one of our decisions ai cor-
responds to choosing the model θi , and we have a utility function that favours
our picking the correct model. As before, data collection has some cost, which
we must balance against the expected utility of picking a parameter.
As an illustration, consider a problem where we must decide for one out of
two possible parameters θ1 , θ2 . At each step, we can either take another sample
from the unknown Pθ (xt ), or decide for one or the other of the parameters.
102 CHAPTER 5. SEQUENTIAL SAMPLING

Example 30 (A two-point sequential decision problem.). Consider a problem where

there are two parameters and two final actions which select one of the two parameter
values, such that:
Observations xt ∈ X
Distribution family: P = {Pθ | θ ∈ Θ}
Probability space (X ∗ , B (X ∗ ) , Pθ ).
Parameter set Θ = {θ1 , θ2 }.
Action set A = {a1 , a2 }.
Prior ξ = P(θ = θ1 ).
Sampling cost c > 0.
The actions we take upon stopping can be interpreted as guessing the parameter.
When we guess wrong, we suffer a cost, as seen in the following table:

U (θ, d) a1 a2
θ1 0 λ1
θ2 λ2 0

Table 5.1: The local utility function, with λ1 , λ2 < 0

As will be the case for all our sequential decision problems, we only need to
consider our current belief ξ, and its possible evolution, when making a decision.
To obtain some intuition about this procedure, we are going to analyse this
problem by examining what the optimal decision is under all possible beliefs ξ.
Under some belief ξ, the immediate value (i.e. the value we obtain if we stop
immediately), is simply:

V 0 (ξ) = max {λ1 ξ, λ2 (1 − ξ)} . (5.2.9)

The worst-case immediate value, i.e. the minimum, is attained when both
terms are equal. Consequently, setting λ1 ξ = λ2 (1 − ξ) gives ξ = λ2 /(λ1 + λ2 ).
Intuitively, this is the worst-case belief, as the uncertainty it induces leaves us
unable to choose between either hypothesis. Replacing in (5.2.9) gives a lower
bound for the value for any belief.
λ1 λ2
V 0 (ξ) ≥ .
λ1 + λ2
Let Π denote the set of procedures π which take at least one observation
and define:
V ′ (ξ) = sup U (ξ, π). (5.2.10)
π∈Π
∗
Then the ξ-expected utility V (ξ) must satisfy

V ∗ (ξ) = max V 0 (ξ), V ′ (ξ) . (5.2.11)

As we showed in Section 3.3.1, V ′ is a convex function of ξ. Now let

Ξ0 , ξ V 0 (ξ) ≥ V ′ (ξ) (5.2.12)

be the set of priors where it is optimal to terminate sampling. It follows that

Ξ \ Ξ0 , the set of priors where we must not terminate sampling, is a convex set.
5.2. OPTIMAL SEQUENTIAL SAMPLING PROCEDURES 103

λ2
ξL λ1 +λ2 ξH

c
V∗

V1∗ (ξ)

λ1 λ2
V0∗ (ξ) λ1 +λ2
ξ

Figure 5.5: The value of the optimal continuation V ′ versus stopping V 0 .

Figure 5.5 illustrates the above arguments, by plotting the immediate value
against the optimal continuation after taking one more sample. For the worst-
case belief, we must always continue sampling. When we are absolutely certain
about the model, then it’s always better to stop immediately. There are two
points where the curves intersect. Together, these define three subsets of beliefs:
On the left, if ξ < ξL , we decide for one parameter θ0 . On the right, if ξ > ξH ,
we decide for the other parameter, θ1 . Otherwise, we continue sampling. This
is the main idea of the sequential probability ratio test, explained below.

The sequential probability ratio test (SPRT)

Figure 5.5 offers a graphical illustration of when it is better ot take one more
sample in this setting. In particular, if ξ ∈ (ξL , ξT ), then it is optimal to take at
least one more sample. Otherwise, it is optimal to make an immediate decision
with value ρ0 (ξ).
This has a nice interpretation as a standard tool in statistics: the sequential
probability ratio test. First note that our posterior at time t can be written as

ξPθ1 (xt )
ξt = .
ξPθ1 (xt ) + (1 − ξ)Pθ2 (xt )

Then, for any posterior, the optimal procedure is:

If ξL < ξt < ξT , take one more sample.

If ξL ≥ ξt , stop and choose a2 .

If ξT ≤ ξt , stop and choose a1 .

We can now restate the optimal procedure in terms of a probability ratio,

i.e. we should always take another observation as long as

ξ(1 − ξT ) Pθ (xt ) ξ(1 − ξL )

< 2 t < .
(1 − ξ)ξT Pθ1 (x ) (1 − ξ)ξL
104 CHAPTER 5. SEQUENTIAL SAMPLING

If the first inequality is violated, we choose a1 . If the second inequality is

violated, we choose a2 . So, there is an equivalence between SPRT and optimal
sampling procedures, when the optimal policy is to continue sampling whenever
our belief is within a specific interval.

5.2.5 Wald’s theorem

An important tool in the analysis of SPRT as well as other procedures that stop
at random times is the following theorem by Wald.

Theorem 5.2.5 (Wald’s theorem). Let z1 , z2 , . . . be a sequence of i.i.d. random

variables with measure G, such that E zi = m for all i. Then for any sequential
procedure with E n < ∞:
Xn
E zi = m E n. (5.2.13)
i=1

Proof.
n
X ∞ Z
X k
X
E zi = zi dGk (z k )
i=1 k=1 Bk i=1
k Z
∞ X
X
= zi dGk (z k ).
k=1 i=1 Bk

X∞ X ∞ Z
= zi dGk (z k )
i=1 k=i Bk
∞ Z
X
= zi dGi (z i )
i=1 B≥i
∞
X
= E(zi ) P(n ≥ i) = m E n.
i=1

We now consider an application of this theorem to the SPRT. Let zi =

Pθ2 (xi )
log Pθ (xi ) . Consider the equivalent formulation of the SPRT which uses
1

n
X
a< zi < b
i=1

as the test. Using Wald’s theorem and the previous properties and assuming
c ≈ 0, we obtain the following approximately optimal values for a, b:

I1 λ2 (1 − ξ) 1 I 2 λ1 ξ
a ≈ log c − log b ≈ log − log , (5.2.14)
ξ c 1−ξ

where I1 = − E(z | θ = θ1 ) and I2 = E(z | θ = θ2 ) is the information, better

known as the KL divergence. If the cost c is very small, then the information
terms vanish and we can approximate the values by log c and log 1c .
5.3. MARTINGALES 105

5.3 Martingales
Martingales are a fundamentally important concept in the analysis of stochastic
processes where the expectation at time t + 1 only depends on the state of the
process at time t.
An example of a martingale sequence is when xt is the amount of money you
have at a given time, and where at each time-step t you are making a gamble
such that you lose or gain 1 currency unit with equal probability. Then, at any
step t, it holds that E(xt+1 | xt ) = xt . This concept can be generalised to two
random processes xt and yt , which are dependent.

Definition 5.3.1. Let xn ∈ S n be a sequence of observations with distribu-

tion Pn , and yn : S n → R be a random variable. Then the sequence {yn } is a
martingale with respect to {xn } if for all n the expectation
Z
E(yn ) = yn (xn ) dPn (xn ) (5.3.1)
Sn

exists and
E(yn+1 | xn ) = yn (5.3.2)
holds with probability 1. If {yn } is a martingale with respect to itself, i.e.
yi (x) = x, then we call it simply a martingale.

It is also useful to consider the following generalisations of martingale se-

quences.

Definition 5.3.2. A sequence {yn } is a super-martingale if E(yn+1 | xn ) ≤ yn

and a sub-martingale if E(yn+1 | xn ) ≥ yn , w.p. 1.

At a first glance, it might appear that martingales are not very frequently
encountered, apart from some niche applications. However, we can always con-
struct a martingale from any sequence of random variables as follows.

Definition 5.3.3 (Doob martingale). Consider a function f : S m → R and

some associated random variables xmR , x1 , . . . xm . Then, for any n ≤ m, as-
suming the expectation E(f | xn ) = S m−n f (xm ) d P(xn+1 , . . . , xm | xn ) exists,
we can construct the random variable

yn (xn ) = E[f | xn ].

Then E(yn+1 | xn ) = yn , and so yn is a martingale sequence with respect to xn .

Another interesting type of martginale sequence are martingale difference

sequences. They are particularly important as they are related to some useful
concentration bounds.

Definition 5.3.4. A sequence {yn } is a martingale difference sequence with

respect to {xn } if

E(yn+1 | xn ) = 0 with probability 1. (5.3.3)

For bounded difference sequences, the following well-known concentration

bound holds.
106 CHAPTER 5. SEQUENTIAL SAMPLING

Theorem 5.3.1. Let bk be a random variable depending on xk−1 and {yk }

be a martingale difference sequence with respect to the {xk }, such that yk ∈
Pk
[bk , bk + ck ] w.p. 1, Then, defining sk , i=1 yi , it holds that:

−2t2
P(sn ≥ t) ≤ exp Pn 2 . (5.3.4)
i=1 ci

This allows us to bound the probability that the difference sequence deviates
from zero. Since there are only few problems where the default random variables
are difference sequences, use of this theorem is most common by defining a new
random variable sequence that is a difference sequence.

5.4 Markov processes

A more general type of sequence of random variables than martingales are
Markov processes. Informally speaking, a Markov process is a sequence of vari-
ables {xn } such that the next value xt+1 only depends on the current value xt .

Definition 5.4.1 (Markov Process). Let (S, B (S)) be a measurable space. If

{xn } is a sequence of random variables xn : S → X such that

P(xt ∈ A | xt−1 , . . . , x1 ) = P(xt ∈ A | xt−1 ), ∀A ∈ B (X ) , (5.4.1)

i.e., xt is independent of xt−2 , . . . given xt−1 , then {xn } is a Markov process,

and xt is called the state of the Markov process at time t. If P(xt ∈ A | xt−1 =
x) = τ (A | x) where τ : B (S) × S → [0, 1] is the transition kernel, then {xn } is
stationary Markov process a stationary Markov process

Note that is the sequence of posterior parameters obtained in Bayesian in-

ference is a Markov process.
5.5. EXERCISES. 107

5.5 Exercises.
Exercise 26. Consider a stationary Markov process with state space S and whose
transition kernel is a matrix τ . At time t, we are at state xt = s and we can either,
1: Terminate and receive reward b(s), or 2: Pay c(s) and continue to a random state
xt+1 from the distribution τ (z ′ | z).
Assuming b, c > 0 and τ are known, design a backwards induction algorithm that
optimises the utility function
T
X −1
U (x1 , . . . , xT ) = b(xT ) − c(xt ).
t=1

Finally, show that the expected utility of the optimal policy starting from any
state must be bounded.

Exercise 27. Consider the problem of classification with features x ∈ X and labels
y ∈ Y, where each label costs c > 0. Assume a Bayesian model with some parameter
space Θ on which we have a prior distribution ξ0 . Let ξt be the posterior distribution
after t examples (x1 , y1 ), . . . , (xt , yt ).
Let our expected utility be the expected accuracy (i.e., the marginal probability
of correctly guessing the right label over all possible models) of the Bayes-optimal
classifier π : X → Y minus the cost paid:
Z Z
Et (U ) , max Pθ (π(x) | x) dPθ (x) dξt (θ) − ct
π Θ X

Show that the Bayes-optimal classification accuracy after t observations can be

rewritten as Z Z
max Pθ (y | x) dξt (θ | x) d Pt (x) − ct
Θ X y∈Y
where Pt and Et denote marginal distributions under the belief ξt . Write the expression
for the expected gain in accuracy when obtaining one more sample and label.
Implement the above for a model family of your choice. Two simple options are
the following. The first is a finite model family composed of two different classifiers
Pθ (y | x). The second is the family of discrete classifier models with a Dirichlet
product prior, i.e. where X = {1, . . . , n}, and each different x ∈ X corresponds to a
different multinomial distribution over Y. In both cases, you can assume a common
(and known) data distribution P (x), in which case ξt (θ | x) = ξt (θ).
Figure 5.6 shows the performance for a family of discrete classifier models with
|X | = 4. It shows the expected classifier performance (based on the posterior
marginal), the actual performance on a small test set, as well as the cumulative pre-
dicted performance gain. As you can see, even though the expected performance gain
is zero in some cases, cumulatively it reaches the actual performance of the classifier.
You should be able to produce a similar figure for your own setup.
108 CHAPTER 5. SEQUENTIAL SAMPLING

0.7
expected
actual
0.6 predicted

0.5

0.4

0.3

0.2
100 101 102 103

Figure 5.6: Illustrative results for an implementation of Exercise 27 on a discrete

classifier model.
Chapter 6

Experiment design and

Markov decision processes

109
110CHAPTER 6. EXPERIMENT DESIGN AND MARKOV DECISION PROCESSES

6.1 Introduction
This chapter introduces the very general formalism of Markov decision processes
(MDPs) that allows representation of various sequential decision making prob-
lems. Thus a Markov decision process can be used to model stochastic path
problems, stopping problems as well as problems in reinforcement learning, ex-
periment design, and control.
experimental design We begin by taking a look at the problem of experimental design. One
instance of this problem occurs when considering how to best allocate treat-
ments with unknown efficacy to patients in an adaptive manner, so that the
best treatment is found, or so as to maximise the number of patients that
are treated successfully. The problem, originally considered by Chernoff [1959,
1966], informally can be stated as follows.
We have a number of treatments of unknown efficacy, i.e., some of them
work better than the others. We observe patients one at a time. When a new
patient arrives, we must choose which treatment to administer. Afterwards, we
observe whether the patient improves or not. Given that the treatment effects
are initially unknown, how can we maximise the number of cured patients? Al-
ternatively, how can we discover the best treatment? The two different problems
are formalised below.

Adaptive treatment alloca- Example 31. Consider k treatments to be administered to T volunteers. To each
tion volunteer only a single treatment can be assigned. At the t-th trial, we treat one
volunteer with some treatment at ∈ {1, . . . , k}. We then obtain a reward rt = 1 if the
patient
Pis healed and 0 otherwise. We wish to choose actions maximising the utility
U = t rt . This would correspond to maximising the number of patients that get
healed over time.

Example 32. An alternative goal would be to do a clinical trial , in order to find

Adaptive hypothesis test- the best possible treatment. For simplicity, consider the problem of trying to find
ing out whether a particular treatment is better or not than a placebo. We are given a
hypothesis set Ω, with each ω ∈ Ω corresponding to different models for the effect of
the treatment and the placebo. Since we don’t know what is the right model, we place
a prior ξ0 on Ω. We can perform T experiments, after which we must decide whether
or not the treatment is significantly better than the placebo. To model this, we define
a decision set D = {d0 , d1 } and a utility function U : D × Ω → R which models the
effect of each decision d given different versions of reality ω. One hypothesis ω ∈ Ω
is true. To identify the correct hypothesis, we can choose from a set of k possible
experiments to be performed over T trials. At the t-th trial, we choose experiment
at ∈ {1, . . . , k} and observe outcome xt ∈ X , with xt ∼ Pω drawn from the true
hypothesis. Our posterior is

ξt (ω) , ξ0 (ω | a1 , . . . , at , x1 , . . . , xt ).

The reward is rt = 0 for t < T and

rT = max EξT (U | d).

d∈D

PT
Our utility can again be expressed as a sum over individual rewards, U = t=1 rt .

Both formalizations correspond to so-called bandit problems which we take

a closer look at in the following section.
6.2. BANDIT PROBLEMS 111

6.2 Bandit problems

The simplest bandit problem is the stochastic n-armed bandit. We are faced
with n different one-armed bandit machines, such as those found in casinos.
In this problem, at time t, you have to choose one action (i.e., a machine)
at ∈ A = {1, . . . , n}. In this setting, each time t you play a machine, you receive
a reward rt , with fixed expected value ωi = E(rt | at = i). Unfortunately, you
do not know the ωi , and consequently the best arm is also unknown. How do
you then choose arms so as to maximise the total expected reward?
Definition 6.2.1 (The stochastic n-armed bandit problem.). This is the prob-
lem of selecting a sequence of actions at ∈ A, with A = {1, . . . , n}, so as to
maximise expected utility, where the utility is
T
X −1
U= γ t rt ,
t=0

where T ∈ (0, ∞] is the horizon and γ ∈ (0, 1] is a discount factor . The reward discount factor
rt is stochastic, and only depends on the current action with expectation E(rt |
at = i) = ωi .
In order to select the actions, we must specify some policy or decision rule. policy
Such a rule can only depend on the sequence of previously taken actions and
observed rewards. Usually, the policy π : A∗ × R∗ → A is a deterministic
mapping from the space of all sequences of actions and rewards to actions.
That is, for every observation and action history a1 , r1 , . . . , at−1 , rt−1 it suggests
a single action at . More generally, it could also be a stochastic policy, that
specifies a mapping to action distributions. We use the notation

π(at | at−1 , rt−1 ) (6.2.1)

for stochastic history-dependent bandit policies, i.e., the probability of choosing

action at given the history until time t.
How can we solve bandit problems? One idea is to apply the Bayesian
decision-theoretic framework we have developed earlier to maximise utility in
expectation. More specifically, given the horizon T ∈ (0, ∞] and the discount
factor γ ∈ (0, 1], we define our utility from time t to be
T
X −t
Ut , γ k rt+k . (6.2.2)
k=1

To apply the decision theoretic framework, we need to define a suitable family

of probability measures P, indexed by parameter ω ∈ Ω describing the reward
distribution of each bandit, together with a prior distribution ξ on Ω. Since ω is
unknown, we cannot maximise the expected utility with respect to it. However,
we can always maximise expected utility with respect to our belief ξ. That is, we
replace the ill-defined problem of maximising utility in an unknown model with
that of maximising expected utility given a distribution over possible models.
The problem can be written in a simple form as
Z
π
max Eξ Ut = max Eπω Ut dξ(ω). (6.2.3)
π π Ω
112CHAPTER 6. EXPERIMENT DESIGN AND MARKOV DECISION PROCESSES

The following figure summarises the statement of the bandit problem in the
Bayesian setting.

Decision-theoretic statement of the bandit problem

Let A be the set of arms.

Define a family of distributions P = {Pω,i | ω ∈ Ω, i ∈ A} on R.
Assume the i.i.d model rt | ω, at = i ∼ Pω,i .
Define prior ξ on Ω.
Select a policy π : A∗ × R∗ → A maximising
T
X −1
Eπξ U= Eπξ γ t rt .
t=0

There are two main difficulties with this approach. The first is specifying
the family and the prior distribution: this is effectively part of the problem
formulation and can severely influence the solution. The second is calculating
the policy that maximises expected utility given a prior and a family. The first
problem can be resolved by either specifying a subjective prior distribution,
or by selecting a prior distribution that has good worst-case guarantees. The
second problem is hard to solve, because in general, such policies are history
dependent and the set of all possible histories is exponential in the horizon T .

6.2.1 An example: Bernoulli bandits

As a simple illustration, consider the case when the reward for choosing one
of the n actions is either 0 or 1, with some fixed, yet unknown probability
depending on the chosen action. This can be modelled in the standard Bayesian
framework using the Beta-Bernoulli conjugate prior. More specifically, we can
formalise the problem as follows.
Consider n Bernoulli distributions with unknown parameters ωi (i = 1, . . . , n)
such that

rt | at = i ∼ Bern(ωi ), E(rt | at = i) = ωi . (6.2.4)

Each Bernoulli distribution thus corresponds to the distribution of rewards ob-

tained from each bandit that we can play. In order to apply the statistical
decision theoretic framework, we have to quantify our uncertainty about the
parameters ω in terms of a probability distribution.
We model our belief for each bandit’s parameter ωi as a Beta distribution
Beta(αi , βi ), with density f (ω | αi , βi ) so that
n
Y
ξ(ω1 , . . . , ωn ) = f (ωi | αi , βi ).
i=1
6.2. BANDIT PROBLEMS 113

Recall that the posterior of a Beta prior is also a Beta. Let

t
X
Nt,i , I {ak = i}
k=1

be the number of times we played arm i and

t
1 X
r̂t,i , rt I {ak = i}
Nt,i
k=1

be the empirical reward of arm i at time t. We can set r̂t,i = 0 when Nt,i = 0.
Then, the posterior distribution for the parameter of arm i is

ξt = Beta(αi + Nt,i r̂t,i , βi + Nt,i (1 − r̂t,i )).

Since rt ∈ {0, 1}, the possible states of our belief given some prior are N2n .
In order to evaluate a policy we need to be able to predict the expected
utility we obtain. The latter only depends on our current belief, and the state
of our belief corresponds to the state of the bandit problem. This means that belief state
everything we know about the problem at time t can be summarised by ξt . For
Bernoulli bandits, a sufficient statistic for our belief is the number of times we
played each bandit and the total reward from each bandit. Thus, our state at
time t is entirely described by our priors α, β (the initial state) and the vectors

Nt = (Nt,1 , . . . , Nt,n ) (6.2.5)

r̂t = (r̂t,1 , . . . , r̂t,n ). (6.2.6)

At any time t, we can calculate the probability of observing rt = 1 if we pull

arm i as
αi + Nt,i r̂t,i
ξt (rt = 1 | at = i) = .
αi + βi + Nt,i
So, not only we can predict the immediate reward based on our current belief,
but we can also predict all next possible beliefs: the next state is well-defined
and depends only on the current state and observation. As we shall see later,
this type of decision problem can be modelled as a Markov decision process
(Definition 6.3.1). For now, we shall more generally (and precisely) define the
bandit process itself.

6.2.2 Decision-theoretic bandit process

The basic view of the bandit process is to consider only the decision maker’s
actions at , obtained rewards rt and the latent parameter ω, as shown in Fig-
ure 6.2(a). With this basic framework, we can now define the general decision-
theoretic bandit process, which also includes the states of belief ξt of the decision
maker.
Definition 6.2.2. Let A be a set of actions, not necessarily finite. Let Ω be
a set of possible parameter values, indexing a family of probability measures
P = {Pω,a | ω ∈ Ω, a ∈ A}. There is some ω ∈ Ω such that, whenever we take
action at = a, we observe reward rt ∈ R ⊂ R with probability measure

Pω,a (R) , Pω (rt ∈ R | at = a), R ⊆ R. (6.2.7)

114CHAPTER 6. EXPERIMENT DESIGN AND MARKOV DECISION PROCESSES

Let ξ1 be a prior distribution on Ω and let the posterior distributions be defined

as Z
ξt+1 (B) ∝ Pω,at (rt ) dξt (ω). (6.2.8)
B

The next belief is random, since it depends on the random quantity rt . In fact,
the probability of the next reward lying in some set R if at = a is given by the
marginal distribution
Z
Pξt ,a (R) , Pω,a (R) dξt (ω). (6.2.9)
Ω

Finally, as ξt+1 deterministically depends on ξt , at , rt , the probability of

obtaining a particular next belief is the same as the probability of obtaining the
corresponding rewards leading to the next belief. In more detail, we can write:
Z
P(ξt+1 = ξ | ξt , at ) = I {ξt (· | at , rt = r) = ξ} dPξt ,a (r). (6.2.10)
R

In practice, although multiple reward sequences may lead to the same beliefs,
we frequently ignore that possibility for simplicity. Then the process becomes a
tree. A solution to the problem of which action to select is given by a backwards
induction algorithm similar to the one given in Section 5.2.2:
X
U ∗ (ξt ) = max E(rt | ξt , at ) + P(ξt+1 | ξt , at )U ∗ (ξt+1 ). (6.2.11)
at
ξt+1

backwards induction The above equation is the backwards induction algorithm for bandits. If you
look at this structure, you can see that the next belief only depends on the
current belief, action, and reward, i.e., it satisfies the Markov property, as seen
in Figure 6.1.

3
r=1 ξt+1
a2t
r=0 2
ξt+1
ξn
1
r=1 ξt+1
a1t
r=0 0
ξt+1

Figure 6.1: A partial view of the multi-stage process. Here, the probability that
we obtain r = 1 if we take action at = i is simply Pξt ,i ({1}).

Consequently, a decision-theoretic bandit process can be modelled more gen-

erally as a Markov decision process, explained in the following section. It turns
out that backwards induction, as well as other efficient algorithms, can provide
optimal solutions for Markov decision processes.
In reality, the reward depends only on the action and the unknown ω, as
can be seen in Figure 6.2(a). This is the point of view of an external observer.
6.3. MARKOV DECISION PROCESSES AND REINFORCEMENT LEARNING115

at at+1 at

at at+1
ξt

rt rt+1 rt ξt ξt+1

ω ω rt rt+1

(a) The basic process (b) The (c) The lifted process
Bayesian model

Figure 6.2: Three views of the bandit process. The figure shows the basic bandit
process from the view of an external observer. The decision maker selects at
and then obtains reward rt , while the parameter ω is hidden. The process is
repeated for t = 1, . . . , T . The Bayesian model is shown in (b) and the resulting
process in (c). While ω is not known, at each time step t we maintain a belief
ξt on Ω. The reward distribution is then defined through our belief. In (b), we
can see the complete process, where the dependency on ω is clear. In (c), we
marginalise out ω and obtain a model where the transitions only depend on the
current belief and action.

If we want to add the decision maker’s internal belief to the graph, we obtain
Figure 6.2(b). From the point of view of the decision maker, the distribution of
ω only depends on his current belief. Consequently, the distribution of rewards
also only depends on the current belief, as we can marginalise over ω. This
gives rise to the decision-theoretic bandit process shown in Figure 6.2(c). In the
following section, we shall consider Markov decision processes more generally.

6.3 Markov decision processes and reinforcement

learning
The bandit setting is one of the simplest instances of reinforcement learning
problems. Informally, speaking, these are problems of learning how to act in
an unknown environment, only through interaction with the environment and
limited reinforcement signals. The learning agent interacts with the environment
by choosing actions that give certain observations and rewards.
For example, we can consider a mouse running through a maze, where the
reward is finding one or several pieces of cheese. The goal of the agent is usually
to maximise some measure of the total reward. In summary, we can state the
problem as follows.

The reinforcement learning problem.

The reinforcement learning problem is the problem of learning how to act
in an unknown environment, only by interaction and reinforcement.

Generally, we assume that the environment µ that we are acting in has an

underlying state st ∈ S, which changes in discrete time steps t. At each step, the
116CHAPTER 6. EXPERIMENT DESIGN AND MARKOV DECISION PROCESSES

agent obtains an observation xt ∈ X and chooses an action at ∈ A. We usually

assume that the environment is such that its next state st+1 only depends on
its current state st and the last action taken by the agent, at . In addition, the
agent observes a reward signal rt , and its goal is to maximise the total reward
during its lifetime.
When the environment µ is unknown, this is hard even in seemingly simple
settings, like n-armed bandits, where the underlying state never changes. In
many real-world applications, the problem is even harder, as the state often is
not directly observed. Instead, we may have to rely on the observables xt , which
give only partial information about the true underlying state st .
Reinforcement learning problems typically fall into one of the following three
groups: (1) Markov decision processes (MDPs), where the state st is observed
directly, i.e., xt = st ; (2) Partially observable MDPs (POMDPs), where the
state is hidden, i.e., xt is only probabilistically dependent on the state; and
(3) stochastic Markov games, where the next state also depends on the move
of other agents. While all of these problem descriptions are different, in the
Bayesian setting, they all can be reformulated as MDPs by constructing an
appropriate belief state, similarly to how we did it for the decision theoretic
formulation of the bandit problem.
In this chapter, we shall confine our attention to Markov decision processes.
Hence, we shall not discuss the existence of other agents, or the case where we
cannot observe the state directly.

Definition 6.3.1 (Markov Decision Process). A Markov decision process µ

is a tuple µ = hS, A, P, Ri, where S is the state space and A is the action
transition distribution space. The transition distribution P = {P (· | s, a) | s ∈ S, a ∈ A} is a collection
reward distribution of probability measures on S, indexed in S × A and the reward distribution
R = {ρ(· | s, a) | s ∈ S, a ∈ A} is a collection of probability measures on R,
such that:

P (S | s, a) = Pµ (st+1 ∈ S | st = s, at = a) (6.3.1)
ρ(R | s, a) = Pµ (rt ∈ R | st = s, at = a). (6.3.2)

Usually, an initial state s0 (or more generally, an initial distribution from which
s0 is sampled) is specified.

For simplicity, we shall also use

rµ (s, a) = Eµ (rt+1 | st = s, at = a), (6.3.3)

for the expected reward.

Of course, the transition and reward distributions are different for different
environments µ. For that reason, we shall usually subscript the relevant prob-
abilities and expectations with µ, unless the MDP is clear from the context.

Markov property of the reward and state distribution

6.3. MARKOV DECISION PROCESSES AND REINFORCEMENT LEARNING117

st st+1

Figure 6.3: The structure of a Markov decision process.

Pµ (st+1 ∈ S | s1 , a1 , . . . , st , at ) = Pµ (st+1 ∈ S | st , at ),
(Transition distribution)
Pµ (rt ∈ R | s1 , a1 , . . . , st , at ) = Pµ (rt ∈ R | st , at ),
(Reward distribution)

where S ⊂ S and R ⊂ R are reward and state subsets, respectively.

Dependencies of rewards. Sometimes it is more convenient to have rewards

that depend on the next state as well, i.e.
rµ (s, a, s′ ) = Eµ (rt+1 | st = s, at = a, st+1 = s′ ), (6.3.4)
though this is complicates the notation considerably since now the reward is
obtained on the next time step. However, we can always replace this with the
expected reward for a given state-action pair:
X
rµ (s, a) = Eµ (rt+1 | st = s, at = s) = Pµ (s′ | s, a)rµ (s, a, s′ ) (6.3.5)
s′ ∈S

Sometimes, it is notationally more convenient to have rewards that only depend

on the current state, so that we can write
rµ (s) = Eµ (rt | st = s). (6.3.6)

Policies. A policy π (sometimes also called decision function) specifies which policy
action to take. One can think of a policy as implemented through an algorithm
or an embodied agent, who is interested in maximising expected utility.

Policies
A policy π defines a conditional distribution on actions given the history:

Pπ (at | st , . . . , s1 , at−1 , . . . , a1 ) (history-dependent policy)

π
P (at | st ) (Markov policy)
118CHAPTER 6. EXPERIMENT DESIGN AND MARKOV DECISION PROCESSES

In general, policies map histories to actions. In certain cases, however, there

are optimal policies that are Markov. This is for example the case with additive
utility functions U : R∗ → R, which map the sequence of all possible rewards
to a real number as specified in the following definition.
Definition 6.3.2 (Additive utility). The utility function U : R∗ → R is defined
as
T
X
U (r0 , r1 , . . . , rT ) = γ k rk , (6.3.7)
k=0

horizon where T is the horizon, after which the agent is no longer interested in rewards,
discount factor and γ ∈ (0, 1] is the discount factor , which discounts future rewards. It is
convenient to intoduce a special notation for the utility starting from time t,
i.e., the sum of rewards from that time on:
T
X −t
Ut , γ k rt+k . (6.3.8)
k=0

At any time t, the agent wants to to find a policy π maximising the expected
total future reward
T
X −t
Eπµ Ut = Eπµ γ k rt+k . (expected utility)
k=0

This is so far identical to the expected utility framework we have seen so far,
with the only difference that now the reward space is a sequence of numerical
rewards and that we are acting within a dynamical system with state space S.
In fact, it is a good idea to think about the value of different states of the system
under certain policies in the same way that one thinks about how good different
positions are in chess.

6.3.1 Value functions

Value functions represent the expected utility of a given state or state-and-action
pair for a specific policy. They are useful as shorthand notation and as the basis
for algorithm development. The most basic one is the state value function.

State value function

π
Vµ,t (s) , Eπµ (Ut | st = s) (6.3.9)

The state value function for a particular policy π in an MDP µ can be inter-
preted as how much utility you should expect if you follow the policy starting
from state s at time t.

State-action value function

Qπµ,t (s, a) , Eπµ (Ut | st = s, at = a) (6.3.10)

6.4. FINITE HORIZON, UNDISCOUNTED PROBLEMS 119

The state-action value function for a particular policy π in an MDP µ can

be interpreted as how much utility you should expect if you play action a at
state s at time t, and then follow the policy π.
It is also useful to define the optimal policy and optimal value functions for a
given MDP µ. Using a star to indicate optimal quantities, an optimal policy π ∗
dominates all other policies π everywhere in S, that is,
∗
π π
Vµ,t (s) ≥ Vµ,t (s) ∀π, t, s. (6.3.11)
The optimal value function V ∗ is the value function of an optimal policy π ∗ ,
i.e.,
∗ π∗ ∗
Vµ,t (s) , Vµ,t (s), Q∗µ,t (s) , Qπµ,t (s, a). (6.3.12)

Finding the optimal policy when µ is known

When the MDP µ is known, the expected utility of any policy can be cal-
culated. As the number of policies is exponential in the number of states, it is
however not advisable to determine the optimal policy by calculating the util-
ity of every possible policy. More suitable approaches include iterative (offline)
methods. These either try to estimate the optimal value function directly, or
iteratively improve a policy until it is optimal. Another type of method tries
to find an optimal policy in an online fashion. That is, the optimal actions are
estimated only for states which can be visited from the current state. However,
all these algorithms share the same main ideas.

6.4 Finite horizon, undiscounted problems

The conceptually simplest type of problems are finite horizon problems where
T < ∞ and γ = 1. The first thing we shall try is to evaluate a given policy π for
π
a given MDP, that is, compute Vµ,t (s) for all states s and t = 0, 1, . . . T . There
are a number of algorithms that can used for that purpose.

6.4.1 Policy evaluation

The algorithms we consider for computing P
the value function of a given policy π
T −t
use the following recursion. Since Ut+1 = k=1 rt+k we have
π
Vµ,t (s) , Eπµ (Ut | st = s) (6.4.1)
T
X −t
= Eπµ (rt+k | st = s) (6.4.2)
k=0
= Eπµ (rt
| st = s) + Eπµ (Ut+1 | st = s) (6.4.3)
X
= Eπµ (rt | st = s) + π
Vµ,t+1 (i) Pπµ (st+1 = i|st = s). (6.4.4)
i∈S

Note that the last term can be calculated easily through marginalisation, i.e.,
X
Pπµ (st+1 = i|st = s) = Pµ (st+1 = i|st = s, at = a) Pπ (at = a|st = s).
a∈A

This derivation leads to a number of policy evaluation algorithms.

120CHAPTER 6. EXPERIMENT DESIGN AND MARKOV DECISION PROCESSES

Algorithm 2 Direct policy evaluation

Input: µ, policy π, horizon T
for s ∈ S do
for t = 0, .P
. . , T do
psk (j) = i∈S Pπµ (sk = j | sk−1 = s)psk−1 (i).
T X
X
Vt (s) = psk (j) Eπµ (rk | sk = j).
k=t j∈S

end for
end for

Direct policy evaluation Using (6.4.2) we can define a simple algorithm for
evaluating a policy’s value function, as shown in Algorithm 2.

Lemma 6.4.1. For each state s, the value Vt (s) computed by Algorithm 2 sat-
π
isfies Vt (s) = Vµ,t (s).

Proof. First note that from (6.4.2):

T X
X
Vt (s) = Pπµ (sk = j | st = s) Eπµ (rk | sk = j).
k=t j∈S

By marginalisation of all possible states at time k ≥ t, we obtain:

X
Eπµ (rk | st = s) = Eπµ (rk | st+k ) Pπµ (sk | st = s). (6.4.5)
sk ∈S

Through the Markov property, we obtain

X
P(sk | st ) = P(sk | sk−1 = j, st ) P(sk−1 = j | st ).
j∈S

Defining psk (j) , Pµπ (sk = j | st = s) completes the proof.

Unfortunately, the computational cost of direct policy evaluation is quite

high. The algorithm above results in a total of |S|3 T operations.1

6.4.2 Backwards induction policy evaluation

The backwards induction algorithm shown as Algorithm 3 is similar to the
backwards induction algorithm we have already seen for sequential sampling
and bandit problems. However, here in a first step we are only evaluating a
given policy π rather than finding the optimal one.

1A more efficient algorithm is left as an exercise for the reader.

6.4. FINITE HORIZON, UNDISCOUNTED PROBLEMS 121

Algorithm 3 Backwards induction policy evaluation

Input: µ, policy π, horizon T
for s ∈ S do
Initialise V̂T (s) = Eπµ (rT | sT = s).
for t = T − 1, . . . , 0 do

X
V̂t (s) = Eπµ (rt | st = s) + Pπµ (st+1 = j | st = s) V̂t+1 (j). (6.4.6)
j∈S

end for
end for

Theorem 6.4.1. The backwards induction algorithm gives estimates V̂t (s) sat-
isfying
π
V̂t (s) = Vµ,t (s). (6.4.7)
Proof. For t = T , the result is obvious. We prove the remainder by induction.
π
We assume that V̂t+1 (s) = Vµ,t+1 (s) and show that (6.4.7) follows. Indeed, from
the recursion (6.4.6) we have
X
V̂t (s) = Eπµ (rt | st = s) + Pπµ (st+1 = j | st = s) V̂t+1 (j)
j∈S
X
= Eπµ (rt | st = s) + Pπµ (st+1 = j | st = s) Vµ,t+1
π
(j)
j∈S

= Eπµ (rt | st = s) + Eπµ (Ut+1 | st = s)

= Eπµ (Ut | st = s)
π
= Vµ,t (s),

where the second equality is by the induction hypothesis, the third and fourth
π
equalities are by the definition of the utility, and the last by definition of Vµ,t .

6.4.3 Backwards induction policy optimisation

Backwards induction as given as Algorithm 4 is the first non-naive algorithm
for finding an optimal policy for the sequential problem with horizon T . It
is basically identical to the backwards induction algorithms we have seen in
Chapter 5 for sequential sampling and the bandit problem.

Algorithm 4 Finite-horizon backwards induction

Input: µ, horizon T
Initialise V̂T (s) := maxa rµ (s, a) for all s ∈ S.
for t = T − 1, T − 2, . . . , 1 do
for s ∈ S do P
πt (s) := arg maxa rµ (s, a) + s′ ∈S Pµ (s′ | s, a) V̂t+1 (s′ )
P
V̂t (s) := rµ (s, a) + s′ ∈S Pµ (s′ | s, πt (s)) V̂t+1 (s′ )
end for
end for
Return π = (πt )Tt=1 .
122CHAPTER 6. EXPERIMENT DESIGN AND MARKOV DECISION PROCESSES

Theorem 6.4.2. For T -horizon problems, backwards induction is optimal, i.e.

∗
V̂t (s) = Vµ,t (s) (6.4.8)
∗
Proof. First we show that V̂t (s) ≥ Vµ,t (s). For t = T we evidently have V̂T (s) =
∗
maxa r(s, a) = Vµ,T (s). We proceed by induction assuming that V̂t+1 (s) ≥
∗
Vµ,t+1 (s) holds. Then for any policy π ′
 
 X 
V̂t (s) = max r(s, a) + Pµ (j | s, a) V̂t+1 (j)
a  
j∈S
 
 X 
∗
≥ max r(s, a) + Pµ (j | s, a) Vµ,t+1 (j) (by induction assumption)
a  
j∈S
 
 X ′

π
≥ max r(s, a) + Pµ (j | s, a) Vµ,t+1 (j) (by optimality)
a  
j∈S
π′
≥ Vµ,t (s).
Choosing π ′ = π ∗ to be the optimal policy this completes the induction proof.
Finally, we note that for the policy π returned by backwards induction we have
∗ π ∗
Vµ,t (s) ≥ Vµ,t (s) = V̂t (s) ≥ Vµ,t (s).

6.5 Infinite-horizon
When problems have no fixed horizon, they usually can be modelled as infinite
horizon problems, sometimes with help of a terminal state, whose visit termi-
nates the problem, or discounted rewards, which indicate that we care less about
rewards further in the future. When reward discounting is exponential, these
problems can be seen as undiscounted problems with random and geometrically
distributed horizon. For problems with no discounting and no termination states
there are some complications in the definition of optimal policy. However, we
defer discussion of such problems to Chapter 10. outdated) pdf this reference
does not exist.

6.5.1 Examples
We begin with some examples, which will help elucidate the concept of terminal
states and infinite horizon. The first is shortest path problems, where the aim is
to find the shortest path to a particular goal. Although the process terminates
when the goal is reached, not all policies may be able to reach the goal, and so
the process may never terminate.

Shortest-path problems
We shall consider two types of shortest path problems, deterministic and stochas-
tic. Although conceptually very different, both problems have essentially the
same complexity.
6.5. INFINITE-HORIZON 123

Properties
14 13 12 11 10 9 8 7
15 13 6 γ = 1, T → ∞.
16 15 14 4 3 4 5 rt = −1 unless st = X, in which
17 2 case rt = 0.
18 19 20 2 1 2 Pµ (st+1 = X|st = X) = 1.
19 21 1 X 1 A = {North, South, East, West}
20 22
Transitions are deterministic and
21 23 24 25 26 27 28 walls block.

Figure 6.4: Deterministic shortest path example.

Consider an agent moving in a maze, aiming to get to some terminal goal

state X, as for example seen in Fig. 6.4. That is, when reaching this state,
the agent stays in X for all further time steps and receives a reward of 0. In
general, the agent can move deterministically in the four cardinal directions,
and receives a negative reward at each time step. Consequently, the optimal
policy is to move to X as quickly as possible.

Solving the shortest path problem with deterministic transitions can be done
simply by recursively defining the distance of states to X. Thus, first the dis-
tance of X to X is set to 0. Then for states s with distance d to X and with
a neighbor state s′ with no assigned distance yet, one assigns s′ the distance
d + 1 to X. This is illustrated in Figure 6.4, where for all reachable states the
distance to X is indicated. The respective optimal policy at each step simply

moves to a neighbor state with the smallest distance to X. Its reward starting
in any state s is simply the negative distance from s to X.

Stochastic shortest path problem with a pit. Now let us assume the
shortest path problem with stochastic dynamics. That is, at each time-step
there is a small probability ω that we move to a random direction. To make
this more interesting, we can add a pit O, that is a terminal state giving a
one-time negative reward of −100 (and 0 reward for all further steps) as seen
in Figure 6.5
124CHAPTER 6. EXPERIMENT DESIGN AND MARKOV DECISION PROCESSES

Properties

γ = 1, T → ∞.
rt = −1, but rt = 0 at X and −100
(once, reward 0 afterwards) at O.

Pµ (st+1 = X|st = X) = 1.

O X Pµ (st+1 = O|st = O) = 1.
A = {North, South, East, West}
Moves to a random direction with
probability ω. Walls block.

Figure 6.5: Stochastic shortest path example.

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8

(a) ω = 0.1 (b) ω = 0.5

0.5
1
1.5
2
2.5
-120 -100 -80 -60 -40 -20 0

Figure 6.6: Pit maze solutions for two values of ω.

Randomness changes the solution significantly in this environment. When

ω is relatively small, it is worthwhile (in expectation) for the agent to pass past
the pit, even though there is a risk of falling in and getting a reward of −100.
In the example given, even starting from the third row in the first column, the
agent prefers taking the short-cut. If ω is sufficiently high, the optimal policy
avoids approaching the pit. Still, the agent prefers jumping in the pit (getting a
large one-time negative reward) to staying at the save left bottom of the maze
forever (with a reward of −1 at each step).

Continuing problems
Many problems have no natural terminal state, but are continuing ad infini-
tum. Frequently, we model those problems using a utility that discounts future
6.5. INFINITE-HORIZON 125

rewards exponentially. This way, we can guarantee that the utility is bounded.
In addition, exponential discounting also has some economical sense. This is
partially because of the effects of inflation, and partially because money avail-
able now may be more useful than money one obtains in the future. Both these
effects diminish the value of money over time. As an example, consider the
following inventory management problem.

Example 33 (Inventory management). There are K storage locations, and each loca-
tion i can store ni items. At each time-step
P there is a probability φi that a client wants
to buy an item from location i, where i φi ≤ 1. If there is an item available, when
this happens, you gain reward 1. There are two types of actions, one for ordering a
certain number u units of stock, paying c(u). Further one may move u units of stock
from one location i to another location j, paying ψij (u).

An easy special case is when K = 1, and we assume that deliveries happen

once every m timesteps, and each time-step a client arrives with probability φ.
Then the state set S = {0, 1, . . . , n} corresponds to the number of items we
have, the action set A = {0, 1, . . . , n} to the number of items we may order.
The transition probabilities are given by P (s′ |s, a) = m d
d φ (1 − φ)
m−d
, where
′
d = s + a − s , for s + a ≤ n.

6.5.2 Markov chain theory for discounted problems

In this section we consider MDPs with infinite horizon and discounted rewards,
that is, our utility is given by
T
X
Ut = lim γ k rk , γ ∈ (0, 1).
T →∞
k=t

For simplicity, in the following we assume that rewards only depend on the cur-
rent state instead of both state and action. It can easily be verified that the
results presented below can be adapted to the latter case. More importantly, we
also assume that the state and action spaces S, A are finite, and that the tran-
sition kernel of the MDP is time-invariant. This allows us to use the following
simplified vector notation:
v π = (Eπ (Ut | st = s))s∈S is a vector in R|S| representing the value of
policy π.
Sometimes we will use p(j|s, a) as a shorthand for Pµ (st+1 = j | st =
s, at = a).
Pµ,π is the transition matrix in R|S|×|S| for policy π, such that
X
Pµ,π (i, j) = p(j | i, a) Pπ (a | i).
a

r = (r(s))s∈S is the reward vector in R|S| .

The space of value functions V is a Banach space (i.e., a complete, normed
vector space) equipped with the norm

kvk = sup {|v(s)| | s ∈ S} .

126CHAPTER 6. EXPERIMENT DESIGN AND MARKOV DECISION PROCESSES

Definition 6.5.1. A policy π is stationary if π(at | st ) = π(an | sn ) for all n, t.

For infinite-horizon discounted MDPs, stationary policies are sufficient. This

can be proven by induction, using arguments similar to other proofs given here.
For a detailed set of proofs, see Puterman [1994]. We now present a set of
important results that show how to express MDP quantities like value vectors
using linear algebra.
Remark 6.5.1. It holds that
∞
X
vπ = γ t Pµ,π
t
r. (6.5.1)
t=0

Proof. Let rt be the random reward at step t when starting in state s and
following policy π. Then
∞
!
X
π π t
V (s) = E γ rt s0 = s
t=0
∞
X
= γ t Eπ (rt |s0 = s)
t=0
∞
X X
= γt Pπ (st = i | s0 = s) E(rt | st = i).
t=0 i∈S

Since for any distribution vector p over S, we have Ep rt = p⊤ r, the result

follows.

It is possible to show that the expected discounted total reward of a policy is

equal to the expected undiscounted total reward with a geometrically distributed
horizon (see exercise 29. As a corollary, it follows a Markov decision process
with discounting is equivalent with one where there is no discounting, but a
stopping probability (1 − γ) at every step.
The value of a particular policy can be expressed as a linear equation. This
is an important result, as it has led to a number of successful algorithms that
employ linear algebra.

Theorem 6.5.1. For any stationary policy π, v π is the unique solution of

v = r + γPµ,π v. (6.5.2)

In addition, the solution is

v π = (I − γPµ,π )−1 r, (6.5.3)

where I is the identity matrix.

To prove this we will need the following important theorem.

6.5. INFINITE-HORIZON 127

Theorem 6.5.2. For any bounded linear function A :P S → S on a normed

linear space S (i.e., there is c < ∞ s.t. kAxk := supi j ai,j ≤ ckxk for all
ral radius x ∈ S with spectral radius σ(A) , limn→∞ kAn k1/n < 1), A−1 exists and is
given by
X T
A−1 = lim (I − A)n . (6.5.4)
T →∞
n=0

Proof of Theorem 6.5.1. First note that by manipulating (6.5.3), one obtains
r = (I − γPµ,π )v π . Since kγPµ,π k < 1 · kPµ,π k = 1, the inverse
n
X
(I − γPµ,π )−1 = lim (γPµ,π )t
n→∞
t=0

exists by Theorem 6.5.2. It follows that

∞
X
v π = (I − γPµ,π )−1 r = γ t Pµ,π
t
r = vπ ,
t=0

where the last step is by Remark 6.5.1 again.

It is important to note that the entries of matrix X = (I − γPµ,π )−1 are the
expected number of discounted cumulative visits to each state s, starting from
state s′ and following policy π. More specifically,
(∞ )
X
′ π t π ′
x(s, s ) = Eµ γ Pµ (st = s | st = s) . (6.5.5)
t=0

This interpretation is quite useful, as many algorithms rely on an estimation of

X for approximating value functions.

6.5.3 Optimality equations

Let us now look at the backwards induction algorithms in terms of operators.
We introduce the operator of a policy, which is the one-step backwards induction
operator for a fixed policy, and the Bellman operator, which is the equivalent
operator for the optimal policy. If a value function is optimal, then it satisfies
the Bellman optimality equation. In the following, we drop the µ subscript in
P and v, since we are always focusing on a specific MDP µ.

Definition 6.5.2 (Policy and Bellman operator). The linear operator of a pol-
icy π is
Lπ v , r + γPπ v. (6.5.6)
The (non-linear) Bellman operator in the space of value functions V is defined
as:
L v , sup {r + γPπ v} , v ∈ V. (6.5.7)
π

We now show that the Bellman operator satisfies the following monotonicity
properties with respect to an arbitrary value vector v.
128CHAPTER 6. EXPERIMENT DESIGN AND MARKOV DECISION PROCESSES

Theorem 6.5.3. Let v ∗ , supπ v π . Then for any bounded r, it holds that for
v ∈ V:

(1) If v ≥ L v, then v ≥ v ∗ .

(2) If v ≤ L v, then v ≤ v ∗ .

(3) If v = L v, then v is unique and v = supπ v π . Therefore, v = L v is

called the Bellman optimality equation.

Proof. We first prove (1). A simple proof by induction over n shows that for
any π
n−1
X
v ≥ r + γPπ v ≥ γ k Pπk r + γ n Pπn v.
k=0
P∞
Since v π = t=0 γ t Pπt r, it follows that
∞
X
v − v π ≥ γ n Pπn v − γ k Pπk r.
k=n

The first-term on the right-hand side can be bounded by arbitrary ǫ/2 for large
enough n. Also note that
∞
X γne
− γ k Pπk r ≥ − ,
1−γ
k=n

with e being a unit vector, so this can be bounded by ǫ/2 as well. So for any
π, ǫ > 0:
v ≥ v π − ǫ,
so
v ≥ sup v π ,
π

which completes the proof of (4). An equivalent argument shows that

v ≤ v π + ǫ,

proving (2). Putting together (1) and (2) gives (3).

A similar theorem can also be proven for the repeated application of the
Bellman operator

Theorem 6.5.4. Let V be the set of value vectors with Bellman operator L .
Then:

1. Let v, v ′ ∈ V with v ′ ≥ v. Then L v ′ ≥ L v.

2. Let vn+1 = L vn . If there is an N s.t. L vN ≤ vN , then L vN +k ≤ vN +k

for all k ≥ 0 and similarly for ≥.
6.5. INFINITE-HORIZON 129

Proof. Let π ∈ arg maxπ r + γPπ v. Then

L v = r + γPµ,π v ≤ r + γPµ,π v ′ ≤ max

′
r + γPπ′ v ′ = L v ′ ,
π

where the first inequality is due to the fact that P v ≥ P v ′ for any P . For the
second part,

L vN +k = vN +k+1 = L k L vN ≤ L k vN = vN +k ,

since L vN ≤ vN by assumption and consequently L k L vN ≤ L k vN by part

one of the theorem.
Thus, value iteration converges monotonically to v ∗ if the initial value v0 ≤
v for all v ′ . If r ≥ 0, it is sufficient to set v0 = 0. Then vn is always a lower
′

bound on the optimal value function.

We eventually want show that repeated application of the Bellman operator
converges to the optimal value. As a preparation, we need the following theorem.
Theorem 6.5.5 (Banach fixed-point theorem). Suppose X is a Banach space
(i.e. a complete, normed linear space) and T : X → X is a contraction mapping
(i.e. ∃γ ∈ [0, 1) s.t. kT u − T vk ≤ γku − vk for all u, v ∈ S). Then
there is a unique u∗ ∈ U s.t. T u∗ = u∗ , and
for any u0 ∈ X the sequence {un }:

un+1 = T un = T n+1 u0

converges to u∗ .

Proof. The first claim follows from the contraction mapping. If u1 , u2 are both
fixed points then

ku1 + u2 k = kT u1 T u2 k ≤ γku1 − u2 k.

For any m ≥ 1
m−1
X m−1
X
kun+m − un k ≤ kun+k+1 − un+k k = kT n+k u1 − T n+k u0 k
k=0 k=0
m−1
X γ (1 − γ m ) 1
n
≤ γ n+k ku1 − u0 k = ku − u0 k.
1−γ
k=0

Then, for any ǫ > 0, there is n such that kun+m − un k ≤ ǫ. Since X is a Banach
space, the sequence has a limit u∗ , which is unique.
Theorem 6.5.6. For γ ∈ [0, 1) the Bellman operator L is a contraction map-
ping in V.
Proof. Let v, v ′ ∈ V. Consider s ∈ S such that L v(s) ≥ L v ′ (s), and let
 
 X 
a∗s ∈ arg max r(s) + γpµ (j | s, a)v(j) .
a∈A  
j∈S
130CHAPTER 6. EXPERIMENT DESIGN AND MARKOV DECISION PROCESSES

Then, using the fact in (a) that a∗s is optimal for v, but not necessarily for v ′ ,
we have:
(a) X X
0 ≤ L v(s) − L v ′ (s) ≤ γp(j | s, a∗s )v(j) − γp(j | s, a∗s )v ′ (j)
j∈S j∈S
X
=γ p(j | s, a∗s )[v(j) ′
− v (j)]
j∈S
X
≤γ p(j | s, a∗s )kv − v ′ k = γkv − v ′ k.
j∈S

Repeating the argument for s such that L v(s) ≤ L v ′ (s), we obtain

|L v(s) − L v ′ (s)| ≤ γkv − v ′ k.

Taking the supremum over all possible s, the required result follows.

It is easy to show the same result for the Lπ operator, as a corollary to the
above theorem.

Theorem 6.5.7. For any MDP µ with discrete S, bounded r, and γ ∈ [0, 1)

(i) there is a unique v ∗ ∈ V such that L v ∗ = v ∗ and v ∗ = V ∗ πµ ,

(ii) for any stationary policy π, there is a unique v ∈ V such that Lπ v = v

and v = Vµπ .

Proof. As the Bellman operator L is a contraction by Theorem 6.5.6, applica-

tion of the fixed-point Theorem 6.5.5 shows that there is a unique v ∗ ∈ V such
that L v ∗ = v ∗ . This is also the optimal value function due to Theorem 6.5.3.
The second part of the theorem follows from the first part when considering
only a single policy π (which then is optimal).

6.5.4 Infinite Horizon MDP Algorithms

Let us now look at the basic algorithms for obtaining an optimal policy when the
Markov decision process is known: value iteration, which is a simple extension of
the backwards induction algorithm to the infinite horizon case; policy iteration,
which alternatively evaluates and improves a sequence of Markov policies. We
also discuss two variants of these methods, modified policy iteration, which is
somewhere in between value and policy iteration, and temporal-difference policy
iteration, which is related to classical reinforcement learning algorithms such as
Sarsa and Q-Learning. Another basic technique is linear programming, which is
useful in theoretical analyses as well as in some special practical cases.

Value iteration
Value iteration, is a version of backwards induction executed in a finite state for
infinite horizon discounted MDPs. Since the horizon is infinite, the algorithm
also requires a stopping condition. This is typically done by comparing the
change in value function from one step to the other with a small threshold as
seen in the algorithm below.
6.5. INFINITE-HORIZON 131

Algorithm 5 Value iteration

Input µ, S.
Initialise v0 ∈ V.
for n = 1, 2, . . . , N do
for s ∈ S do P ′ ′
πn (s) = arg maxa∈A P r(s) + ′γ s′ ∈S Pµ (s | s, ′a)vn−1 (s )
vn (s) = r(s) + γ s′ ∈S Pµ (s | s, πn (s))vn−1 (s )
end for
end for
Return πn , Vn .

The value iteration algorithm is a direct extension of the backwards induction

algorithm for an infinite horizon and a finite state space S. Since we know that
stationary policies are optimal, we do not need to maintain the values and
actions for all time steps. At each step, we can merely keep the previous value
vn−1 . As the value function is defined for T → ∞, we need to know whether
the algorithm asymptotically converges to the optimal value, and what is the
error we make at a particular iteration.
Theorem 6.5.8. The value iteration algorithm satisfies
limn→∞ kvn − v ∗ k = 0.
For each ǫ > 0 there exists Nǫ < ∞ such that for all n ≥ Nǫ
kvn+1 − vn k ≤ ǫ(1 − γ)/2γ. (6.5.8)

kvn+1 − v ∗ k < ǫ/2 for n ≥ Nǫ .

For n ≥ Nǫ the policy πǫ that takes action
X
arg max r(s, a) + γ p(j|s, a)vn (s′ )
a
j

is ǫ-optimal, i.e. V πǫ (s) ≥ v ∗ (s) − ǫ for all states s.

Proof. The first two statements follow from the fixed-point Theorem 6.5.5. Now
note that
kVµπǫ − v ∗ k ≤ kVµπǫ − vn k + kvn − v ∗ k
We can bound these two terms easily:
Vµπǫ − vn+1 = Lπǫ Vµπǫ − vn+1 (by definition of Lπǫ )
≤ Lπǫ Vµπǫ − L vn+1 + kL vn+1 − vn+1 k (triangle ineq.)
= Lπǫ Vµπǫ − Lπǫ vn+1 + kL vn+1 − L vn k (by definition)
≤ γ Vµπǫ − vn+1 + γ kvn+1 − vn k . (by contraction)
An analogous argument gives a respective for the second term kvn − v ∗ k. Then,
rearranging we obtain
γ γ
Vµπǫ − vn+1 ≤ kvn+1 − vn k, kvn − v ∗ k ≤ kvn+1 − vn k,
1−γ 1−γ
and the third and fourth statements follow from the second statement.
132CHAPTER 6. EXPERIMENT DESIGN AND MARKOV DECISION PROCESSES

The termination condition of value iteration has been left unspecified. How-
ever, the theorem above shows that if we terminate when (6.5.8) is true, then
our error will be bounded by ǫ. However, better termination conditions can be
obtained.
Now let us prove how fast value iteration converges.

Theorem 6.5.9. Value iteration converges with error in O(γ n ) More specifi-
cally, for r ∈ [0, 1] and v0 = 0,

γn 2γ n
kvn − v ∗ k ≤ , kVµπn − v ∗ k ≤ . (6.5.9)
1−γ 1−γ

Proof. The first part follows from the contraction property (Theorem 6.5.6):

kvn+1 − v ∗ k = kL vn − L v ∗ k ≤ γkvn − v ∗ k. (6.5.10)

Now divide by γ n to obtain the final result. The second part follows from ele-
mentary analysis. If a function f (x) is maximised at x∗ , while g(y) is maximised
at y ∗ and |f − g| ≤ ǫ, then |f (x∗ ) − f (y ∗ )| ≤ 2ǫ. Since πn maximises vn and vn
is γ n /(1 − γ)-close to v ∗ , the result follows.

Although value iteration converges exponentially fast, the convergence is

dominated by the discount factor γ. When γ is very close to one, convergence
can be extremely slow. In fact, Tseng [1990] showed that the number of it-
erations are of the order 1/(1 − γ), for bounded accuracy of the input data.
The overall complexity is Õ(|S|2 |A|L(1 − γ)−1 )—omitting logarithmic factors—
where L is the total number of bits used to represent the input.2

Policy iteration
Unlike value iteration, policy iteration attempts to iteratively improve a given
policy, rather than a value function. At each iteration, it calculates the value
of the current policy and then calculates the policy that is greedy with respect
to this value function. For finite MDPs, the policy evaluation step can be
performed with either linear algebra or backwards induction, while the policy
improvement step is trivial. The algorithm described below can be extended to
the case when the reward also depends on the action, by replacing r with the
policy-dependent reward vector rπ .

Algorithm 6 Policy iteration

Input µ, S.
Initialise v0 .
for n = 1, 2, . . . do
πn+1 = arg maxπ {r + γPπ vn } // policy improvement
π
vn+1 = Vµ n+1 // policy evaluation
break if πn+1 = πn .
end for
Return πn , vn .

2 Thus the result is weakly polynomial complexity, due to the dependence on the input size

description.
6.5. INFINITE-HORIZON 133

The following theorem describes an important property of policy iteration,

namely that the policies generated are monotonically improving.

Theorem 6.5.10. Let vn , vn+1 be the value vectors generated by policy itera-
tion. Then vn ≤ vn+1 .

Proof. From the policy improvement step

r + γPπn+1 vn ≥ r + γPπn vn , = vn

where the equality is due to the policy evaluation step for πn . Rearranging, we
get r ≥ (I − γPπn+1 )vn and hence

(I − γPπn+1 )−1 r ≥ vn ,

noting that the inverse is positive. Since the left side equals vn+1 by the policy
evaluation step for πn+1 , the theorem follows.

We can use the fact that the policies are monotonically improving to show
that policy iteration will terminate after a finite number of steps.

Corollary 6.5.1. If S, A are finite, then policy iteration terminates after a

finite number of iterations and returns an optimal policy.

Proof. There is only a finite number of policies, and since policies in policy
iteration are monotonically improving, the algorithm must stop after finitely
many iterations. Finally, the last iteration satisfies

vn = max r + γPπ vn . (6.5.11)

Thus vn solves the optimality equation.

However, it is easy to see that the number of policies is |A||S| , thus the above
corollary only guarantees exponential-time convergence in the number of states.
However, it is also known that the complexity of policy iteration is strongly
polynomial Ye [2011], for any fixed γ, with the number of iterations required
2
|S|2
being |S| 1−γ
(|A|−1)
· ln 1−γ .
Policy iteration seems to have very different behaviour from value iteration.
In fact, one can obtain families of algorithms that lie at the extreme ends of
the spectrum between policy iteration and value iteration. The first member
of this family is modified policy iteration, and the second member is temporal
difference policy iteration.

Modified policy iteration

The astute reader will have noticed that it may be not necessary to fully evaluate
the improved policy. In fact, we can take advantage of that to speed up policy
iteration. Thus, a simple variant of policy iteration involves doing only a k-
step update for the policy evaluation step. For k = 1, the algorithm becomes
identical to value iteration, while for k → ∞ the algorithm is equivalent to
policy iteration, as vn = V πn .
134CHAPTER 6. EXPERIMENT DESIGN AND MARKOV DECISION PROCESSES

Algorithm 7 Modified policy iteration

Input µ, S.
Initialise v0 .
for n = 1, 2, . . . do
πn = arg maxπ r + γPπ vn−1 // policy improvement
vn = Lπkn vn−1 // partial policy evaluation
break if πn = πn+1 .
end for
Return πn , vn .

Modified policy iteration can perform much better than either pure value
iteration or pure policy iteration.

A geometric view
It is perhaps interesting to see the problem from a geometric perspective. This
also gives rise to the so-called “temporal-difference” set of algorithms. First, we
define the difference operator, which is the difference between a value function
vector v and its transformation via the Bellman operator.
difference operator Definition 6.5.3. The difference operator is defined as

Bv , max {r + (γPπ − I)v} = L v − v. (6.5.12)

Essentially, it is the change in the value function vector when we apply the
Bellman operator. Thus the Bellman optimality equation can be rewritten as

Bv = 0. (6.5.13)

Now let us define the set of greedy policies with respect to a value vector v ∈ V
to be:
Πv , arg max {r + (γPπ − I)v} .
π∈Π

We can now show the following inequality between the two different value func-
tion vectors.
Theorem 6.5.11. For any v, v ′ ∈ V and π ∈ Πv

Bv ′ ≥ Bv + (γPπ − I)(v ′ − v). (6.5.14)

Proof. By definition, Bv ′ ≥ r + (γPπ − I)v ′ , while Bv = r + (γPπ − I)v.

Subtracting the latter from the former gives the result.
Equation (6.5.14) is similar to the convexity of the Bayes-optimal utility
(3.3.2). Geometrically, we can see from a look at Figure 6.7, that applying the
Bellman operator on value function always improves it, yet may have a negative
effect on the other value function. If the number of policies is finite, then the
figure is also a good illustration of the policy iteration algorithm, where each
value function improvement results in a new point on the horizontal axis, and
the choice of the best improvement (highest line) for that point. In fact, we can
write the policy iteration algorithm in terms of the difference operator.
6.5. INFINITE-HORIZON 135

0
Bv

−1 π∗
π1
π2
v2 v1 v∗
v

Figure 6.7: The difference operator. The graph shows the effect of the operator
for the optimal value function v ∗ , and two arbitrary value functions, v1 , v2 . Each
line is the improvement effected by the greedy policy π ∗ , π1 , π2 with respect to
each value function v ∗ , v1 , v2 .

Theorem 6.5.12. Let {vn } be the sequence of value vectors obtained from policy
iteration. Then for any π ∈ Πvn ,
vn+1 = vn − (γPπ − I)−1 Bvn . (6.5.15)
Proof. By definition, we have for π ∈ Πvn
vn+1 = (I − γPπ )−1 r − vn + vn
= (I − γPπ )−1 [r − (I − γPπ )vn ] + vn .
Since r − (I − γPπ )vn = Bvn the claim follows.

Temporal-Difference Policy Iteration

In temporal-difference policy iteration, similarly to the modified policy iteration
algorithm, we replace the next-step value with an approximation vn of the n-
th policy’s value. Informally, this approximation is chosen so as to reduce the
discrepancy of our value function over time.
At the n-th iteration of the algorithm, we use a policy improvement step to
obtain the next policy πn+1 given our current approximation vn :
Lπn+1 vn = L vn . (6.5.16)
To update the value from vn to vn+1 we rely on the temporal difference error , temporal difference error
defined as:
dn (i, j) = [r(i) + γvn (j)] − vn (i). (6.5.17)
This can be seen as the difference in the estimate when we move from state i
to state j. In fact, it is easy to see that, if our value function estimate satisfies
v = Vµπn , then the expected error should be zero, as:
X X
dn (i, j)p(j | i, πn (i)) = [r(i) + γvn (j)]p(j | i, πn (i)) − vn (i).
j∈S j∈S
136CHAPTER 6. EXPERIMENT DESIGN AND MARKOV DECISION PROCESSES

Note the similarity to the difference operator in modified policy iteration. The
idea of the temporal-difference policy iteration is to use adjust the current value
vn , using the temporal differences mixed over an infinite number of steps:
∞
X
τn (i) = Eπn (γλ)t dn (st , st+1 ) | s0 = i , (6.5.18)
t=0
vn+1 = vn + τn . (6.5.19)

Here the λ parameter is a simple way to mix together the different temporal
difference errors. If λ → 1, our error will be dominated by the terms far in
the future, while if λ → 0, our error τn , will be dominated by the short-term
discrepancies in our value function. In the end, we shall adjust our value function
in the direction of this error.
Putting all of those steps together, we obtain the following algorithm:

Algorithm 8 Temporal-Difference Policy Iteration

Input µ, S, λ.
Initialise v0 .
for n = 0, 1, 2, . . . do
πn+1 = arg maxπ r + γPπ vn // policy improvement
vn+1 = vn + τn // temporal difference update.
break if πn+1 = πn .
end for
Return πn , vn .

In fact, vn+1 is the unique fixed point of the following equation:

Dn v , (1 − λ)Lπn+1 vn + λLπn+1 v. (6.5.20)

That is, if we repeatedly apply the above operator to some vector v, then
at some point we shall obtain a fixed point v ∗ = Dn v ∗ . It is interesting to
see what happens at the two extreme choices of λ in this case. For λ = 1,
this becomes identical to standard policy iteration, as the fixed point satisfies
v ∗ = Lπn+1 v ∗ , so then v ∗ must be the value of policy πn+1 . For λ = 0, one
obtains standard value iteration, as the fixed point is reached under one step
and is simply v ∗ = Lπn+1 vn , i.e. the approximate value of the one-step greedy
policy. In other words, the new value vector is moved only partially towards the
direction of the Bellman update, depending on how we choose λ.

Linear programming
Perhaps surprisingly, we can also solve Markov decision processes through linear
programming. The main idea is to reformulate the maximisation problem as
a linear optimisation problem with linear constraints. The first step in our
procedure is to recall that there is an easy way to determine whether a particular
v is an upper bound on the optimal value function v ∗ , since if

v ≥ Lv

then v ≥ v ∗ . In order to transform this into a linear program, we must first

define a scalar function to minimise. We can do this by selecting some arbitrary
6.5. INFINITE-HORIZON 137

distribution on the states y ∈ ∆|S| . Then we can write the following linear
program.

Primal linear program

min y ⊤ v,
v

such that

v(s) − γp⊤
s,a v ≥ r(s, a), introducedbef ore? ∀a ∈ A, s ∈ S,

where we use ps,a to denote the vector of next state probabilities p(j | s, a).
Note that the inequality condition is equivalent to v ≥ L v. Consequently,
the problem is to find the smallest v that satisfies this inequality. When A, S
are finite, it is easy to see that this will be the optimal value function and the
Bellman equation is satisfied.
It also pays to look at the dual linear program, which is in terms of a
maximisation. This time, instead of finding the minimal upper bound on the
value function, we find the maximal cumulative discounted state-action visits
x(s, a) that are consistent with the transition kernel of the process.

Dual linear program

XX
max r(s, a)x(s, a)
x
s∈S a∈A
|S×A|
such that x ∈ R+ and
X XX
x(j, a) − γp(j | s, a)x(s, a) = y(j) ∀j ∈ S.
a∈A s∈S a∈A

with y ∈ ∆|S| .

In this case, x can be interpreted as the discounted sum of state-action visits,

as proved by the following theorem.

Theorem 6.5.13. For any policy π,

nX o
xπ (s, a) = Eπ,µ γ n I {st = s, at = a | s0 ∼ y} policyhere?

is a feasible solution to the dual problem.

P On the other hand, if x is a feasible
solution to the dual problem then a x(s, a) > 0. Finally, if we define the
strategy
x(s, a)
π(a | s) = P ′
a′ ∈A x(s, a )

then xπ = x is a feasible solution.

138CHAPTER 6. EXPERIMENT DESIGN AND MARKOV DECISION PROCESSES

The equality condition ensures that x is consistent with the transition kernel
of the Markov decision process. Consequently, the program can be seen as search
among all possible cumulative state-action distributions to find the one giving
the highest total reward.

6.6 Optimality Criteria

In all previous cases, we assumed a specific discount rate, or a finite horizon for
our problem. This section will give an overview of different optimality criteria
when there is no discounting and the horizon is infinite, and compare them to
the ones already discussed in this chapter.
As mentioned previously, the following two views of discounted reward pro-
cesses are equivalent.

Infinite horizon, discounted

Discount factor γ such that
∞
X ∞
X
Ut = γ k rt+k ⇒ E Ut = γ k E rt+k (6.6.1)
k=0 k=0

Geometric horizon, undiscounted

At each step t, the process terminates with probability 1 − γ:
T
X −t ∞
X
UtT = rt+k , T ∼ Geom(1 − γ) ⇒ E Ut = γ k E rt+k (6.6.2)
k=0 k=0

Vγπ (s) , E(Ut | st = s)

The expected total reward criterion

Vtπ,T , Eπ UtT , V π , lim V π,T (6.6.3)

T →∞

Dealing with the limit

Consider µ s.t. the limit exists ∀π.

∞
! ∞
!
X X
V+π (s) , Eπ rt+ st = s , V−π (s) , Eπ rt− st = s
t=1 t=1
(6.6.4)
rt+ , max{−r, 0}, rt− , max{r, 0}. (6.6.5)
6.6. OPTIMALITY CRITERIA 139

∗
Consider µ s.t. ∃π ∗ for which V π exists and
∗ ∗
lim V π ,T
= V π ≥ lim sup V π,T .
T →∞ T →∞

Use optimality criteria sensitive to the divergence rate.

The average reward (gain) criterion

Definition 6.6.1. The gain g of a policy π starting from state s is the expected
average reward the policy obtains when starting from that state.
1 π,T
g π (s) , lim V (s) (6.6.6)
T →∞ T

π 1 π,T π 1 π,T
g+ (s) , lim sup V (s), g− (s) , lim inf V (s) (6.6.7)
T →∞ T T →∞ T

If limT →∞ E(rT | s0 = s) exists then it equals g π (s).

Let Π be the set of all history-dependent, randomised policies. Using our
overloaded symbols, we have that
Definition 6.6.2. π ∗ is total reward optimal if
∗
V π (s) ≥ V π (s) ∀s ∈ S, π ∈ Π.

Definition 6.6.3. π ∗ is discount optimal for γ ∈ [0, 1) if

∗
Vγπ (s) ≥ Vγπ (s) ∀s ∈ S, π ∈ Π.

Definition 6.6.4. π ∗ is gain optimal if

∗
g π (s) ≥ g π (s) ∀s ∈ S, π ∈ Π.

Overtaking optimality π ∗ is overtaking optimal if

h ∗ i
lim inf V π ,T (s) − V π,T (s) ≥ 0 ∀s ∈ S, π ∈ Π.
T →∞

However, no overtaking optimal policy may exist.

π ∗ is average-overtaking optimal if
1 h π∗ ,T i
lim inf V (s) − V+π (s) ≥ 0 ∀s ∈ S, π ∈ Π.
T →∞ T
140CHAPTER 6. EXPERIMENT DESIGN AND MARKOV DECISION PROCESSES

Sensitive discount optimality π ∗ is n-discount optimal for n ∈ {−1, 0, 1, . . .}

if h ∗ i
lim inf (1 − γ)−n Vγπ (s) − Vγπ (s) ≥ 0 ∀s ∈ S, π ∈ Π.
γ↑1

A policy is Blackwell optimal if ∀s, ∃γ ∗ (s) such that

∗
Vγπ (s) − Vγπ (s) ≥ 0, ∀π ∈ Π, γ ∗ (s)γγ < 1.

Lemma 6.6.1. If a policy is m-discount optimal then it is n-discount optimal

for all n ≤ m.

Lemma 6.6.2. Gain optimality is equivalent to −1-discount optimality.

The different optimality criteria summarised here are treated in detail in

Puterman [1994] Chapter 5.

6.7 Summary
Markov decision processes can represent shortest path problems, stopping prob-
lems, experiment design problems, multi-armed bandit problems and reinforce-
ment learning problems.
Bandit problems are the simplest type of Markov decision process, since they
have a fixed, never-changing state. However, to solve them, one can construct
a Markov decision processes in belief space, within a Bayesian framework. It is
then possible to apply backwards induction to find the optimal policy.
Backwards induction is applicable more generally to arbitrary Markov de-
cision processes. For the case of infinite-horizon problems, it is referred to as
value iteration, as it converges to a fixed point. It is tractable when either the
state space S or the horizon T are small (finite).
When the horizon is infinite, policy iteration can also be used to find optimal
policies. It is different from value iteration in that at every step, it fully evaluates
a policy before the improvement step, while value iteration only performs a
partial evaluation. In fact, at the n-th iteration, value iteration has calculated
the value of an n-step policy.
We can arbitrarily mix between the two extremes of policy iteration and
value iteration in two ways. Firstly, we can perform a k-step partial evaluation.
When k = 1, we obtain value iteration, and when k → ∞, we obtain policy iter-
ation. The generalised algorithm is called modified policy iteration. Secondly,
we can perform adjust our value function by using a temporal difference error of
values in future time steps. Again, we can mix liberally between policy iteration
and value iteration by focusing on errors far in the future (policy iteration) or
on short-term errors (value iteration).
Finally, it is possible to solve MDPs through linear programming. This is
done by reformulating the problem as a linear optimisation with constraints.
In the primal formulation, we attempt to find a minimal upper bound on the
optimal value function. In the dual formulation, our goal is to find a distribution
on state-action visitations that maximises expected utility and is consistent with
the MDP model.
6.8. FURTHER READING 141

6.8 Further reading

See the last chapter of [DeGroot, 1970] for further information on the MDP
formulation of bandit problems in the decision theoretic setting. This was ex-
plored in more detail in Duff’s PhD thesis [Duff, 2002]. When the number of
(information) states in the bandit problem is finite, Gittins [1989] has proven
that it is possible to formulate simple index policies. However, this is not gener-
ally applicable. Easily computable, near-optimal heuristic strategies for bandit
problems will be given in Chapter 10. The decision-theoretic solution to the
unknown MDP problem will be given in Chapter 9.
Further theoretical background on Markov decision processes, including many
of the theorems in Section 6.5, can be found in [Puterman, 1994]. Chap-
ter 2 of Bertsekas and Tsitsiklis [1996] gives a quick overview of MDP the-
ory from the operator perspective. The introductory reinforcement learning
book of Sutton and Barto [1998] also explains the basic Markov decision pro-
cess framework.
142CHAPTER 6. EXPERIMENT DESIGN AND MARKOV DECISION PROCESSES

6.9 Exercises
6.9.1 Medical diagnosis
Exercise 28 (Continuation of exercise 24). Now consider the case where you have
the choice between tests to perform First, you observe S, whether or not the patient
is a smoker. Then, you select a test to make: d1 ∈ {X-ray, ECG}. Finally, you de-
cide whether or not to treat for ASC: d2 ∈ {heart treatment, no treatment}. An
untreated ASC patient may die with probability 2%, while a treated one with proba-
bility 0.2%. Treating a non-ASC patient result in death with probability 0.1%.
1. Draw a decision diagram, where:
S is an observed random variable taking values in {0, 1}.
A is an hidden variable taking values in {0, 1}.
C is an hidden variable taking values in {0, 1}.
d1 is a choice variable, taking values in {X-ray, ECG}.
r1 is a result variable, taking values in {0, 1}, corresponding to negative
and positive tests results.
d2 is a choice variable, which depends on the test results, d1 and on S.
r2 is a result variable, taking values in {0, 1} corresponding to the patient
dying (0), or living (1).
2. Let d1 = X-ray, and assume the patient suffers from ACS, i.e. A = 1. How is
the posterior distributed?
3. What is the optimal decision rule for this problem?

6.9.2 Markov Decision Process theory

Exercise 29 (30). Show that the expected discounted total reward of any given policy
is equal to the expected undiscounted total reward with a finite, but random horizon T .
In particular, let T be distributed according to a geometric distribution on {1, 2, . . .}
with parameter 1 − γ. Then show that:
T T
!
X k
X
E lim γ rk = E rk T ∼ Geom(1 − γ) .
T →∞
k=0 k=0

6.9.3 Automatic algorithm selection

Consider the problem of selecting algorithms for finding solutions to a sequence
of problems. Assume you have n algorithms to choose from. At time t, you get a
task and choose the i-th algorithm. Assume that the algorithms are randomised,
so that the i-th algorithm will find a solution with some unknown probability.
Our aim is to maximise the expected total number of solutions found. Consider
the following specific cases of this problem:

Exercise 30 (120). In this case, we assume that the probability that the i-th algo-
rithm successfully solves the t-th task is always pi . Furthermore, tasks are in no way
distinguishable from each other. In each case, assume that pi ∈ {0.1, .Q . . , 0.9} and a
prior distribution ξi (pi ) = 1/9 for all i, with a complete belief ξ(p) = i ξi (pi ), and
formulate the problem as a decision-theoretic n-armed bandit problem with reward
6.9. EXERCISES 143

at time t being rt = 1 if the task is solved and rt = 0 if the problem is not solved.
Whether or not the task at time t is solved or not, at the next time-step we go to
the next problem. Our aim is to find a policy π mapping from the history of observa-
tions to selection of algorithms such that we maximise the total reward to time T in
expectation
T
X
Eξ,π U0T = Eξ,π rt .
t=1

1. Characterise the essential difference between maximising U00 , U01 , U02 ?

2. For n = 3, calculate the maximum expected utility

max Eξ,π U0T

using backwards induction for T ∈ {0, 1, 2, 3, 4} and report the expected utility
in each case. Hint: Use the decision-theoretic bandit formulation to dynami-
cally construct a Markov decision process which you can solve with backwards
induction. See also the extensive decision rule utility from exercise set 3.
3. Now utilise the backwards induction algorithm developed in the previous step
in a problem where we receive a sequence of N tasks to solve and our utility is
N
X
U0N = rt
t=1

At each step t ≤ N , find the optimal action by calculating Eξ,π Utt+T for T ∈
{0, 1, 2, 3, 4} take it. Hint: At each step you can update your prior distribution
using the same routine you use to update your prior distribution. You only need
consider T < N − t.
4. Develop a simple heuristic algorithm of your choice and compare its utility
with the utility of the backwards induction. Perform 103 simulations, each
experiment running for N = 103 time-steps and average the results. How does
the performance improve? Hint: If the program runs too slowly go only up to
T =3

6.9.4 Scheduling
You are controlling a small processing network that is part of a big CPU farm.
You in fact control a set of n processing nodes. At time t, you may be given a
job of class xt ∈ X to execute. Assume these are identically and independently
drawn such that P(xt = k) = pk for all t, k. With some probability p0 , you are
not given a job to execute at the next step. If you do have a new job, then you
can either:

(a) Ignore the job

(b) Send the job to some node i. If the node is already active, then the previous
job is lost.

Not all the nodes and jobs are equal. Some nodes are better at processing
certain types of jobs. If the i-th node is running a job of type k ∈ X, then it has
a probability of finishing it within that time step equal to φi,k ∈ [0, 1]. Then
the node becomes free, and can accept a new job.
144CHAPTER 6. EXPERIMENT DESIGN AND MARKOV DECISION PROCESSES

For this problem, assume that there are n = 3 nodes and k = 2 types of jobs
and that the completion probabilities are given by the following matrix:
 
0.3 0.1
Φ = 0.2 0.2 . (6.9.1)
0.1 0.3

Also, we set p0 = 0.1, p1 = 0.4, p2 = 0.5 to be the probabilities of not getting

any job, and the probabilities of the two job types respectively. We wish to find
the policy maximising the expected total reward given the MDP model π:
∞
X
Eµ,π γ t rt , (6.9.2)
t=0

with γ = 0.9 and where we get a reward of 1 every time a job is completed.
More precisely, at each time step t, the following events happen:

1. A new job xt appears

2. Each node either continues processing, or completes its current job and
becomes free. You get a reward rt equal to the number of nodes that
complete their jobs within this step.

3. You decide whether to ignore the new job or add it to one of the nodes.
If you add a job, then it immediately starts running for the duration of
the time step. (If the job queue is empty then you cannot add a job to a
node, obviously)

Exercise 31 (180). Solve the following problems:

1. Identify the state and action space of this problem and formulate it as a Markov
decision process. Hint: Use independence of the nodes to construct the MDP
parameters.
2. Solve the problem using value iteration, using the stopping criterion indicated in
theorem 15, equation (5.5), in Chapter VII, with ǫ = 0.1. Indicate the number
of iterations needed to stop.
3. Solve the problem using policy iteration. Indicate the number of iterations
needed to stop. Hint: You can either modify the value iteration algorithm to
perform policy evaluation, using the same epsilon, or you can use the linear
formulation. If you use the latter, take care with the inverse!
4. Now consider an alternative version of the problem, where we suffer a penalty
of 0.1 (i.e. we get a negative reward) for each time-step that each node is busy.
Are the solutions different?
5. Finally consider a version of the problem, where we suffer a penalty of 10 (i.e. we
get a negative reward) each time we cancel an executing job. Are the solutions
different?
6. Plot the value function for the optimal policy in each setting.
Hint: To verify that your algorithms work, test them first on a smaller MDP with
known solutions. For example, http: // webdocs. cs. ualberta. ca/ ~ sutton/ book/ ebook/ node35. html
6.9. EXERCISES 145

6.9.5 General questions

Exercise 32 (20!). 1. What in your view is the fundamental advantages and dis-
advantages of modelling problems as Markov decision processes?
2. Is the algorithm selection problem of Exercise 30 solvable with policy iteration?
If so, how? What are the fundamental similarities and differences between the
decision-theoretic finite-horizon bandit setting of exercise 1 and the infinite-
horizon MDP settings of exercise 2?
146CHAPTER 6. EXPERIMENT DESIGN AND MARKOV DECISION PROCESSES
Chapter 7

Simulation-based
algorithms

147
148 CHAPTER 7. SIMULATION-BASED ALGORITHMS

7.1 Introduction
In this chapter, we consider the general problem of reinforcement learning in
dynamic environments. Up to now, we have only examined a solution method
for bandit problems, which are only a special case. The Bayesian decision-
theoretic solution is to reduce the bandit problem to a Markov decision process
which can then be solved with backwards induction.
We also have seen that Markov decision processes can be used to describe en-
vironments in more general reinforcement learning problems. When our knowl-
edge of the MDP describing these problems is perfect, then we can employ a
number of standard algorithms to find the optimal policy. However, in the ac-
tual reinforcement learning problem, the model of the environment is unknown.
However, as we shall see later, both of these ideas can be combined to solve the
general reinforcement learning problem.
The main focus of this chapter is how to simultaneously learn about the
underlying process and act to maximise utility in an approximate way. This
can be done through approximate dynamic programming, where we replace the
actual unknown dynamics of the Markov decision process with estimates. The
estimates can be improved by drawing samples from the environment, either by
acting within the real environment or using a simulator. In both cases we end
up with a number of algorithms that can be used for reinforcement learning.
Although may not be performing as well as the Bayes-optimal solution, these
have a low enough computational complexity that they are worth investigating
in practice.
It is important to note that the algorithms in this chapter can be quite far
from optimal. They may converge eventually to an optimal policy, but they
may not accumulate a lot of reward while still learning. In that sense, they
are not solving the full reinforcement learning problem because their online
performance can be quite low.
For simplicity, we shall first return to the example of bandit problems. As
before, we have n actions corresponding to probability distributions Pi on the
real numbers {Pi | i = 1, . . . , n} and our aim is to maximise to total reward (in
expectation). Had we known the distribution, we could simply always the max-
imising action, as the expected reward of the i-th action can be easily calculated
from Pi and the reward only depends on our current action.
As the Pi are unknown, we must use a history-dependent policy. In the
remainder of this section, we shall examine algorithms which asymptotically
convergence to the optimal policy (which, in the case of bandits corresponds to
pulling always pulling the best arm), but for which we cannot always guarantee
a good initial behaviour.

7.1.1 The Robbins-Monro approximation

In this setting, we wish to to replace the actual Markov decision process in
which we are acting, with an estimate that will eventually converge to the true
process. At the same time, we shall be taking actions which are nearly-optimal
with respect to the estimate.
To approximate the process, we shall use the general idea of a Robbins-Monro
stochastic approximation [Robbins and Monro, 1951]. This entails maintaining
a point estimate of the parameter we want to approximate and perform random
7.1. INTRODUCTION 149

steps that on average move towards the solution, in a way to be made more
precise later. The stochastic approximation actually defines a large class of
procedures, and it contains stochastic gradient descent as a special case.

Algorithm 9 Robbins-Monro bandit algorithm

1: input Step-sizes (αt )t , initial estimates (µi,0 )i , policy π.
2: for t = 1, . . . , T do
3: Take action at = i with probability π(i | a1 , . . . , at−1 , r1 , . . . , rt−1 ).
4: Observe reward rt .
5: µt,i = αi,t rt + (1 − αi,t )µi,t−1 // estimation step
6: µt,i = µj,t−1 for j 6= i.
7: end for
8: return µT

An bandit algorithm that uses a Robbins-Monro approximation is given

in Algorithm 9. The input is a particular policy π, which defines probability
distribution over the next actions given the observed history, a set of initial
estates µi,0 for the bandit means, and a sequence of step sizes α.
The algorithm can be separated in two parts. Taking actions according
to the policy (step 3) and the observation of rewards with an update of the
estimated values (steps 4-6). The policy itself is an input to the algorithm, but
it will in practice only depend on µt,i and t; we shall discuss appropriate policies
later. Regarding the estimation itself, note that only the estimate for the arm
which we have drawn is updated. As we shall see later, this particular update
rule chosen in this case be seen as trying to minimise the expected squared error
between the estimated reward, and the random reward obtained by each bandit.
Consequently, the variance of the reward of each bandit plays an important role.
The step-sizes α must obey certain constraints in order for the algorithm to
work, in particular it must decay neither too slowly, nor too fast. There is one
particular choice, for which our estimates are in fact the mean estimate of the
expected value of the reward for each action i, which is a natural choice if the
bandits are stationary.
The other question is what policy to use to take actions. We must take all
actions often enough, so that we have good estimates for the expected reward of
every bandit. One simple way to do it is to play the apparently best bandit most
of the time, but to sometimes select bandits randomly. This is called ǫ-greedy
action selection. This ensures that all actions are tried a sufficient number of
times.

Definition 7.1.1 (ǫ-greedy policy).

π̂ǫ∗ , (1 − ǫt )π̂t∗ + ǫt Unif (A), (7.1.1)

n o
π̂t∗ (i) = I i ∈ Â∗t /|Â∗t |, Â∗t = arg max µt,i (7.1.2)
i∈A

This is formally defined in Definition 7.1.1. We allow the randomness of the

policy to depend on t. This is because, as our estimates converge to the true
values, we wish to reduce randomness so as to converge to the optimal policy.
150 CHAPTER 7. SIMULATION-BASED ALGORITHMS

The main two parameters of the algorithm are the amount of randomness
in the ǫ-greedy action selection and the step-size α in the estimation. Both of
them have a significant effect in the performance of the algorithm. Although we
could vary them with time, it is perhaps instructive to look at what happens for
fixed values of ǫ, α. Figures 7.1 show the average reward obtained, if we keep
the step size α or the randomness ǫ fixed, respectively, with initial estimates
µ0,i = 0.

1 1

0.8 0.8

0.6 0.6

0.4 0.4
0.001 0.0
0.2 0.01 0.2 0.01
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
·104 ·104
(a) fixed ǫ (b) fixed α

Figure 7.1: For the case of fixed ǫt = 0.1, the step size is α ∈ {0.01, 0.1, 0.5}.
For the case of fixed α, the exploration rate is

For a fixed ǫ, we find that larger values of α tend to give a better result
eventually, while smaller values have a better initial performance. This is a
natural trade-off, since large α appears to “learn” fast, but it also “forgets”
quickly. That is, for a large α, our estimates mostly depend upon the last few
rewards observed.
Things are not so clear-cut for the choice of ǫ. We see that the choice of
ǫ = 0, is significantly worse than ǫ = 0.1. So, that appears to suggest that there
is an optimal level of exploration. How should that be determined? Ideally, we
should be able to to use the decision-theoretic solution seen earlier, but perhaps
a good heuristic way of choosing ǫ may be good enough.

7.1.2 The theory of the approximation

Here we quickly review some basic results of stochastic approximation theory.
Complete proofs can be found in Bertsekas and Tsitsiklis [1996]. The main
question here is whether our estimates converge to the right values, and whether
the complete algorithm itself converges to a optimal policy. We are generally
not interested in how much reward we obtain during the optimisation process,
but only on asymptotic convergence.
We first consider the core problem of stochastic approximation itself. In
particular, we shall cast the approximation problem as a minimisation problem,
i.e. we shall define a function f such that, if µt is our estimate of µ, then f is
minimised at f (µt ). Then, given the function f : Rn → R, we wish to develop
an algorithm that generates a sequences of values µt which converges to some
7.1. INTRODUCTION 151

µ∗ that is a local minimum, or a stationary point for f . For strictly convex f ,

this would also be a global minimum.
In particular, we examine algorithms which maintain estimates µt over time,
with the update equation:
µt+1 = µt + αt zt+1 . (7.1.3)
Here µt is our estimate, αt is a step-size and zt is a direction. In addition, we
use ht , {µt , zt , αt , . . .} to denote the complete history of the algorithm.
The above algorithm can be shown to converge to a stationary point of f
under certain assumptions. Sufficient conditions include continuity and smooth-
ness properties of f and the update direction z. In particular, we shall assume
the following about the function f that we wish to minimise.
Assumption 7.1.1. Assume a function f : Rn → R such that:
(i) f (x) ≥ 0 for all x ∈ Rn .
(ii) ( Lipschitz derivative) f is continuously differentiable (i.e. the derivative
∇f exists and is continuous) and ∃L > 0 such that:
k∇f (x) − ∇f (y)k ≤ L kx − yk , ∀ x, y ∈ Rn

(iii) ( Pseudo-gradient) ∃c > 0 such that:

2
c k∇f (µt )k ≤ −∇f (µt )⊤ E(zt+1 | ht ), ∀ t.

(iv) ∃K1 , K2 > 0 such that

2 2
E(kzt+1 k | ht ) ≤ K1 + K2 k∇f (µt )k

Condition (ii) is a very basic condition for convergence. It basically ensures

that the function is well-behaved, so that gradient-following methods can easily
find the minimum. Condition (iii) combines two assumptions in one. Firstly,
that expected direction of update always decreases cost, and secondly that the
squared norm of the gradient is not too large relative to the size of the update.
Finally, condition (iv) ensures that update is bounded in expectation relative
to the gradient. One can see how putting together the last two conditions
ensures that the expected direction of our update is correct, and that its norm
is bounded.
Theorem 7.1.1. For the algorithm
µt+1 = µt + αt zt+1 ,
where αt ≥ 0 satisfy
∞
X ∞
X
αt = ∞, αt2 < ∞, (7.1.4)
t=0 t=0

and under Assumption 7.1.1, with probability 1:

152 CHAPTER 7. SIMULATION-BASED ALGORITHMS

1. The sequence {f (µt )} converges.

2. limt→∞ ∇f (µt ) = 0.
3. Every limit point µ∗ of µt satisfies ∇f (µ∗ ) = 0.
The above conditions are not necessary conditions. Alternative sufficient
conditions relying on contraction properties are discussed in detail in Bertsekas and Tsitsiklis
[1996]. The following example illustrates the impact of the choice of step size
schedule on convergence.

Estimating the mean of a Gaussian distribution.

Consider a sequence of observations xt , sampled from a Gaussian distri-
bution with mean 1/2 and variance 1, in other words xt ∼ N (0.5, 1). We
compare three different step-size schedules, with update direction:

zt+1 = xt+1 − µt .

The
√ first one, αt = 1/t, satisfies both assumptions. The second one, αt =
1/ t, reduces too slowly, and the third one, αt = t−3/2 , approaches zero
too fast.

1/t
√
2 1/ t
t−3/2
µt

0 200 400 600 800 1,000

Figure 7.2: Estimation of the expectation of xt ∼ N (0.5, 1) using use three

step-size schedules.

Figure 7.2 demonstrates the convergence, or lack thereof, of our esti-

mates µt to the expected value. In fact, the schedule t−3/2 converges to√a
value quite far away from the expected value, while the slow schedule 1/ t
oscillates.

Example 34 (Robbins-Monroe conditions for Bernoulli bandits.). Let us now consider

the conditions for convergence of the estimates of the bandit algorithm we examined
before. Firstly, the function that we wish to minimise relates to the difference between
our own estimates and the actual expected reward of the bandit arms. For that reason,
we can write the function that we wish to approximate
7.2. DYNAMIC PROBLEMS 153

7.2 Dynamic problems

It is possible to extend the ideas outlined in the previous section to dynamic

settings. We simply need to have a policy that is greedy with respect to our
estimates, and a way to update our estimates so that they converge to the
actual Markov decision process we are acting in. However, the dynamic setting
presents one essential difference. Our policy now affects which sequences of
states we observe, while before it only affected the rewards. While in the bandit
problem we could freely select an arm to pull, we might no longer be able to go
to an arbitrary state.1 Otherwise, the algorithmic structure remains the same
and is described below.

Algorithm 10 Generic reinforcement learning algorithm

1: input Update-rule f : Θ × S 2 × A × R → Θ, initial parameters θ0 ∈ Θ,
policy π : S × Θ → ∆ (A).
2: for t = 1, . . . , T do
3: at ∼ π(· | θt , st ) // take action
4: Observe reward rt+1 , state st+1 .
5: θt+1 = f (θt , st , at , rt+1 , st+1 ) // update estimate
6: end for

What should we estimate? For example, θt could be describing a posterior

distribution over MDPs, or a distribution over parameters. What policy should
we use? For example, we could try and use the Bayes-optimal policy with
respect to θ, or some heuristic policy.

Example 35 (The chain task). The chain task has two actions and five states, as
shown in Fig. 7.3. The reward in the leftmost state is 0.2 and 1.0 in the rightmost
state, and zero otherwise. The first action (dashed, blue) takes you to the right,
while the second action (solid, red) takes you to the first state. However, there is a
probability 0.2 with which the actions have the opposite effects. The value function
of the chain task for a discount factor γ = 0.95 is shown in Table 7.1.

The chain task is a very simple, but well-known task, used to test the efficacy
of reinforcement learning algorithms. In particular, it is useful for analysing how
algorithms solve the exploration-exploitation trade-off, since in the short run
simply moving to the leftmost state is advantageous. For a long enough horizon
or large enough discount factor, algorithms should be incentivised to more fully
explore the state space. A variant of this task, with action-dependent rewards
(but otherwise equivalent) was used by [Dearden et al., 1998].

1 This actually depends on what the exact setting is. If the environment is a simulation,

then we could try and start from an arbitrary state, but in the reinforcement learning setting
this is not the case.
154 CHAPTER 7. SIMULATION-BASED ALGORITHMS

0.2 0 0 0 1

Figure 7.3: The chain task

s s1 s2 s3 s4 s5
V ∗ (s) 7.6324 7.8714 8.4490 9.2090 10.209
Q∗ (s, 1) 7.4962 7.4060 7.5504 7.7404 8.7404
Q∗ (s, 2) 7.6324 7.8714 8.4490 9.2090 10.2090

Table 7.1: The chain task’s value function for γ = 0.95

7.2.1 Monte-Carlo policy evaluation and iteration

To make things as easy as possible, let us assume that we have a way to start the
environment from any arbitrary state. That would be the case if the environ-
reset action ment had a reset action, or if we were simply running an accurate simulation.
simulation We shall begin with simplest possible problem, that of estimating the ex-
pected utility of each state for a specific policy. This can be performed with
Monte-Carlo policy evaluation. In the standard setting, we can the value func-
tion for every state by approximating the expectation with the sum of rewards
obtained over multiple trajectories starting from each state. The k-th trajectory
starts from some initial state s0 = s and the next states are sampled as follows

(k) (k) (k) (k) (k) (k) (k)

at ∼ π(at | ht ).rt ∼ Pµ (rt | st , at )st+1 ∼ Pµ (st+1 | st , at ). (7.2.1)

Then the value function satisfies

K T
1 X X (k)
Vµπ (s) , Eπµ (U | s1 = s) ≈ rt ,
K t=1
k=1

(k)
where rt is the sequence of rewards obtained from the k-th trajectory.

7.2.2 Monte-Carlo policy evaluation

Another conceptually simple algorithm is Monte-Carlo policy evaluation shown
as Algorithm 11. The idea is that instead of summing over all possible states
to be visited, we just draw states from the Markov chain defined jointly by
the policy and the Markov decision process. Unlike direct policy evaluation
the algorithm needs a parameter K, the number of trajectories to generate.
Nevertheless, this is a very useful method, employed by several more complex
algorithms.
7.2. DYNAMIC PROBLEMS 155

Algorithm 11 Monte-Carlo policy evaluation

Input: policy π, number of iterations K, horizon T
for s ∈ S do
for k = 0, . . . , K do
Initialise V̂k (s) := 0.
Choose initial state s1 = s.
for t = 0, . . . , T do
Choose action at ∼ π(at | st )
Observe reward rt and next state st+1 .
Set V̂k (s) := V̂k (s) + rt .
end for
end for
Calculate estimate
K
1 X
V̂ (s) = V̂k (s).
K
k=1

end for

Remark 7.2.1. With probability 1 − δ, the estimate V̂ of the Monte Carlo eval-
uation algorithm satisfies
r
π ln(2|S|/δ)
kVµ,0 − V̂ k∞ := max kV (s) − V̂ (s)k ≤ .
s 2K
Proof. From Hoeffding’s inequality (4.5.5) we have for any state s that
r !
π ln(2|S|/δ) δ
P |Vµ,0 (s) − V̂ (s)| ≥ ≤ .
2K |S|
P
Consequently, using a union bound of the form P (A1 ∪A2 ∪. . .∪An ) ≤ i P (Ai )
gives the required result.
The main advantage of Monte-Carlo policy evaluation is that it can be used
in very general settings. It can be used not only in Markovian environments
such as MDPs, but also in partially observable and multi-agent settings.

Algorithm 12 Stochastic policy evaluation

1: input Initial parameters v0 , Markov policy π.
2: for s ∈ S do
3: s1 = s.
4: for k = 1, . . . , K do
5: Run policy π for T steps.
P
6: Observe utility Uk = t rt .
7: Update estimate vk+1 (s) = vk (s) + αk (Uk − vk (s))
8: end for
9: end for
10: return vK

For αk = 1/k and iterating over all S, this is the same as Monte-Carlo policy
evaluation.
156 CHAPTER 7. SIMULATION-BASED ALGORITHMS

7.2.3 Monte Carlo updates

Note that s1 , . . . , sT contains sk , . . . , sT .

This suggests that we could update the value of all encountered states, as
we also have the utility starting from each state. We call this algorithm

Algorithm 13 Every-visit Monte-Carlo update

1: input Initial parameters vk , trajectory s1 , . . . , sT , rewards r1 , . . . , rT visit
counts n.
2: for t = 1, . . . , T do
PT
3: Ut = t=1 rt .
4: nt (st ) = nt−1 (st ) + 1
5: vt+1 (st ) = vt (s) + αnt (st ) (st )(Ut − vt (st ))
6: nt (s) = nt−1 (s), vt (s) = vt−1 (s) ∀s 6= st .
7: end for
8: return vK

For a proper Monte-Carlo estimate, when the environment is stationary

αnt (st ) (st ) = 1/nt (st ). Nevertheless, this type of estimate can be biased, as can
be seen by the following example.

Example 36. Consider a two-state chain with P(st+1 = 1 | st = 0) = δ and P(st+1 =

1 | st = 1) = 1, and reward r(1) = 1, r(0) = 0. Then the every-visit estimate is biased.

Let us consider the discounted setting.

P Then value of the second state is 1/(1 − γ)
k
and the value of the first state is k (δγ) = 1/(1 − δγ). Consider the every-visit
Monte-Carlo update. The update is going to be proportional to the number of steps
you spend in that state.

In order to avoid the bias, we must instead look at only the first visit to
every state. This eliminates the dependence between states and is called the
first visit Monte-Carlo update .

Algorithm 14 First-visit Monte-Carlo update

1: input Initial parameters v1 , trajectory s1 , . . . , sT , rewards r1 , . . . , rT , visit
counts n.
2: for t = 1, . . . , T do
PT
3: Ut = t=1 rt .
4: nt (st ) = nt−1 (st ) + 1
5: vt+1 (st ) = vt (s) + αnt (st ) (st )(Ut − vt (st )) if nt (st ) = 1.
6: nt (s) = nt−1 (s), vt (s) = vt−1 (s) otherwise
7: end for
8: return vT +1
7.2. DYNAMIC PROBLEMS 157

101
every
first

100
kvt − V π k

10−1

10−2
0 0.2 0.4 0.6 0.8 1
iterations ·104

Figure 7.4: Error as the number of iterations n increases, for first and every
visit Monte Carlo estimation.

7.2.4 Temporal difference methods

The main idea of temporal differences is to use partial samples of the utility
and replace the remaining sample from time t with an estimate of the expected
utility after time t. Since there maybe no particular reason to choose a specific
t, frequently an exponential distribution t’s is used.
Let us first look at the usual update when we have the complete utility
sample Uk . The full stochastic update is of the form:

vk+1 (s) = vk (s) + α(Uk − vk (s)),

Using the temporal difference error d(st , st+1 ) = r(st ) + γv(st+1 ) − v(st ), We temporal difference error
can now rewrite the full stochastic update in terms of the temporal-difference
error:
X
vk+1 (s) = vk (s) + α γ t dt , dt , d(st , st+1 ) (7.2.2)
t

Stochastic, incremental, update:

vt+1 (s) = vt (s) + αγ t dt . (7.2.3)

We have now converted the full stochastic update into an incremental update
that is nevertheless equivalent to the old update. Let us see how we can gener-
alise this to the case where we have a mixture of temporal differences.

Temporal difference algorithm with eligibility traces.

158 CHAPTER 7. SIMULATION-BASED ALGORITHMS

TD(λ).
Recall the temporal difference update when the MDP is given in analytic
form.
∞
X
vn+1 (i) = vn i + τn i, τn (i) , Eπn ,µ [(γλ)m dn (st , st+1 ) | s0 = i] .
t=0

We can convert this to a stochastic update, which results in the well-known

TD(λ) algorithm for policy evaluation.
∞
X
vn+1 (st ) = vn (st ) + α (γλ)k−t dk . (7.2.4)
k=t

Unfortunately, this algorithm is only possible to implement offline due to the

fact that we are looking at future values.

This problem can be fixed by the backwards-looking Online TD(λ) algo-

rithm. The main idea is to backpropagate changes in future states to previously
encountered states. However, we wish to modify older states less than more
recent states.

Algorithm 15 Online TD(λ)

1: input Initial parameters vk , trajectories (st , at , rt )
2: e0 = 0.
3: for t = 1, . . . , T do
4: dt , d(st , st+1 ) // temporal difference
5: et (st ) = et−1 (st ) + 1 // eligibility increase
6: for s ∈ S do
7: vt+1 (st ) = vt (s) + αt et (s)dt . // update all eligible states
8: end for
9: et+1 = λet
10: end for
11: return vT
7.2. DYNAMIC PROBLEMS 159

2.5
replacing
cumulative
2

1.5

0.5

0
0 20 40 60 80 100

Figure 7.5: Eligibility traces, replacing and cumulative.

For replacing traces, use et (st ) = et−1 (st ) + 1.

7.2.5 Stochastic value iteration methods

The main problem we had seen so far with Monte-Carlo based simulation is that
we normally require a complete sequence of rewards before updating values.
However, in value iteration, we can simply perform a backwards step from all
the following states in order to obtain a utility estimate. This idea is explored
in stochastic value iteration methods.
The standard value iteration algorithm performs a sweep over the complete
state space at each iteration. However, could perform value iteration over an
arbitrary sequence of states. For example, we can follow a sequence of states
generated from a particular policy. This lends to the idea of simulation-based
value iteration.
Such state sequences must satisfy various technical requirements. In partic-
ular, the policies that generate those state sequences must be proper for episodic
problems. That is, that all policies should reach a terminating state with prob-
ability 1. For discounted non-episodic problems, this is easily achieved by using
a geometric distribution for termination time. This ensures that all policies will
be proper. Alternatively, of course, we could simply select starting states with
an arbitrary schedule, as long as all states are visited infinitely often in the limit.
However, value iteration also requires the Markov decision process model.
The question is whether it is possible to replace the MDP model with some
arbitrary estimate. This estimate can itself be obtained via simulation. This
leads to a whole new family of stochastic value iteration algorithms. The most
important and well-known of these is Q-learning, which uses a trivial empirical
MDP model.

Simulation-based value iteration

First, however, we shall discuss the extension of value iteration to the case
where we obtain state data from simulation. This allows us to concentrate our
estimates to the most useful states.
160 CHAPTER 7. SIMULATION-BASED ALGORITHMS

Algorithm 16 shows a generic simulation-based value iteration algorithm,

with a uniform restart distribution Unif (S) and termination probability ǫ.

Algorithm 16 Simulation-based value iteration

1: Input µ, S.
2: Initialise s1 ∈ S, v0 ∈ V.
3: for t = 1, 2, . . . , n do
4: s = st . P ′ ′
5: πt (s) = arg maxa r(s, P a) + γ ′s′ ∈S Pµ (s |s, a)v′t−1 (s )
6: vt (s) = r(s, a) + γ s′ ∈S Pµ (s |s, πt (s))vt−1 (s )
7: st+1 ∼ (1 − ǫ) · P(st+1 | st = a, πt , µ) + ǫ · Unif (S).
8: end for
9: Return πn , Vn .

In the following figures, we can see the error in value function estimation
in the chain task when using simulation-based value iteration. It is always a
better idea to use an initial value v0 that is an upper bound on the optimal value
function, if such a value is known. This is due to the fact that in that case,
convergence is always guaranteed when using simulation-based value iteration,
as long as the policy that we are using is proper.2

102 102

100 100

10-2 10-2
error

error

10-4 10-4
1.0 1.0
0.5 0.5
10-6 0.1 10-6 0.1
0.01 0.01
1-gamma 1-gamma
10-8 10-8
0 500 1000 1500 2000 0 500 1000 1500 2000
t t

(a) Pessimistic (b) Optimistic

Figure 7.6: Simulation-based value iteration with pessimistic initial estimates

(v0 = 0) and optimistic initial estimates (v0 = 20 = 1/(1 − γ)), for varying ǫ.
Errors indicate kvn − V ∗ k1 .

As can be seen in Figure 7.6, the value function estimation error of simulation-
based value iteration is highly dependent upon the initial value function esti-
mate v0 and the exploration parameter ǫ. It is interesting to see uniform sweeps
(ǫ = 1) result in the lowest estimation error in terms of the value function L1
norm.

Q-learning
Simulation-based value iteration can be suitably modified for the actual rein-
forcement learning problem. Instead of relying on a model of the environment,
2 In the case of discounted non-episodic problems, this amounts to a geometric stopping

time distribution, after which the state is drawn from the initial state distribution.
7.2. DYNAMIC PROBLEMS 161

we replace arbitrary random sweeps of the state-space with the actual state se-
quence observed in the real environment. We also use this sequence as a simple
way to estimate the transition probabilities.

Algorithm 17 Q-learning
1: Input µ, S, ǫt , αt .
2: Initialise st ∈ S, q0 ∈ V.
3: for t = 1, 2, . . . do
4: s = st .
5: at ∼ π̂ǫ∗t (a | st , qt )
6: st+1 ∼ Pπµt (st+1 | st = s, at ).
7: qt+1 (st , at ) = (1 − αt )qt (st , at ) + αt [r(st ) + vt (st+1 )], where vt (s) =
maxa∈A qt (s, a).
8: end for
9: Return πn , Vn .

To see this, note that we can rewrite simulation-based value iteration as

follows:

X
qt (s, a) = r(s, a) + γ Pµπ (s′ |s, a)vt−1 (s′ )
s′
πt (s) = arg max +qt (s, a)
a
vt (s) = arg max +qt (s, a)
a

The result is Q-learning (Algorithm 17), one of the most well-known and
simplest algorithms in reinforcement learning. In light of the previous theory,
it can be seen as a stochastic value iteration algorithm, where at every step t,
given the partial observation (st , at , st+1 ) you have an approximate transition
model for the MDP which is as follows:

(
1, if st+1 = s′
P (s′ |st , at ) = (7.2.5)
0, if st+1 6= s′ .

Even though this model is very simplistic, it still seems to work relatively well in
practice, and the algorithm is simple to implement. In addition, since we cannot
arbitrarily select states in the real environment, we replace the state-exploring
parameter ǫ with a time-dependent exploration parameter ǫt for the policy we
employ on the real environment.
162 CHAPTER 7. SIMULATION-BASED ALGORITHMS

40
error

30 1.0
0.5
0.1
20
0.05
0.01
10
0 20 40 60 80 100
t x 10
(a) Error

3,000
1.0
0.5
0.1
2,000 0.05
0.01
regret

1,000

0 20 40 60 80 100
t x 10
(b) Regret

−2/3
Figure 7.7: Q-learning with v0 = 1/(1 − γ), ǫt = 1/nst , αt ∈ αnst .

Figure 7.7 shows the performance of the basic Q-learning algorithm for the
Chain task, in terms of value function error and regret. In this particular
implementation, we used a polynomially decreasing exploration parameter ǫt
and step size αt . Both of these depend on the number of visits to a particular
state and so perform more efficient Q-learning.
Of course, one could get any algorithm in between pure Q-learning and pure
stochastic value iteration. In fact, variants ofthe Q-learning algorithm using
eligibility traces (see Section 7.2.4) can be formulated in this way.

Generalised stochastic value iteration Finally, we can generalise the above

ideas to the following algorithm. This is an online algorithm, which can be
7.3. DISCUSSION 163

applied directly to a reinforcement learning problem and it includes simulation-

based value iteration and Q-learning as special cases. There are three param-
eters associated with this algorithm. The first is ǫt , the exploration amount
performed by the policy we follow. The second is αt , the step size parameter.
The third one is σt , the state-action distribution. The final parameter is the
MDP estimator µbt . This includes both an estimate of the transition probabilities
Pµct (s′ | s, a) and of the expected reward rµct (s, a).

Algorithm 18 Generalised stochastic value iteration

1: Input µ c0 , S, ǫt , αt .
2: Initialise s1 ∈ S, q1 ∈ Q, v0 ∈ V.
3: for t = 1, 2, . . . do
4: at ∼ f (π̂ǫ∗t (a | st , qt ))
5: Observe st+1 , rt+1 .
6: µbt = µ̂t−1 | st , at , st+1 , rt+1 . // update MDP estimate.
7: for s ∈ S, a ∈ A do
8: With probability σt (s, a) do:
" #
X
′ ′
qt+1 (s, a) = (1 − αt )qt (s, a) + αt rµct (s, a) + γ Pµct (s | s, a)vt (s ) .
s′ ∈S

9: otherwise qt+1 (s, a) = qt (s, a).

10: vt+1 (s) = maxa∈A qt+1 (s, a),
11: end for
12: end for
13: Return πn , Vn .

It is instructive to examine special cases for these parameters. For the case
when σt = 1, αt = 1, and when µbt = µ, we obtain standard value iteration.
For the case when σt (s, a) = I {st = s ∧ at = a} and

Pµct (st+1 = s′ | st = s, at = a) = I {st+1 = s′ | st = s, at = a} ,

it is easy to see that we obtain Q-learning.

Finally, if we set σt (s, a) = et (s, a), then we obtain a stochastic eligibility-
trace Q-learning algorithm similar to Q(λ).

7.3 Discussion
Most of these algorithms are quite simple, and so clearly demonstrate the prin-
ciple of learning by reinforcement. However, they do not aim to solve the rein-
forcement learning problem optimally. They have been mostly of use for finding
near-optimal policies given access to samples from a simulator, as used for ex-
ample to learn to play Atari games Mnih et al. [2015]. However, even in this
case, a crucial issue is how much data is needed in the first place to approach
optimal play. The second issue is using such methods for online reinforcement
learning, i.e. in order to maximise expected utility while still learning.
164 CHAPTER 7. SIMULATION-BASED ALGORITHMS

Convergence. Even though it is quite simple, the convergence of Q-learning

has been established in various settings. Tsitsiklis [1994] has provided an
asymptotic proof based on stochastic approximation theory with less restric-
tive assumptions than the original paper Watkins and Dayan [1992]. Later
Kearns and Singh [1999] proved finite sample convergence results under strong
mixing assumptions on the MDP.
Q-learning can be seen as using a very specific type of approximate transition
model. By modifying this, we can obtain more efficient algorithms, such as
delayed Q-learningStrehl et al. [2006], which needs Õ(|S||A|) samples to find an
ǫ-optimal policy with high probability.

Exploration. In order to perform exploration efficiently, Q-learning does not

attempt to perform optimal exploration. Another extension of Q-learning is us-
ing a population value function estimates. This was introduced in Dimitrakakis
[2006b,a] through the use of random initial values and weighted bootstrapping
and evaluated for bandit tasks. Recently, this idea has also been exploited in
the context of deep neural networks by Osband et al. [2016] representations of
value functions for the case of full reinforcement learning. We will examine this
more closely in Chapter 8.
Bootstrapping and subsampling (App. B.6) use a single set of empirical data
to obtain an empirical measure of uncertainty about statistics of the data. We
wish to do the same thing for value functions, based on data from a one or
more trajectories. Informally, this variant maintains a collection of Q-value
estimates, each one of which is trained on different segments3 of the data, with
possible overlaps. In order to achieve efficient exploration, a random Q estimate
is selected at every episode, or every few steps. This results in a bootstrap
analogue of Thompson sampling. Figure 7.8 shows the use of weighted bootstrap
estimates for the Double Chain problem introduced by Dearden et al. [1998].
Bootstrapping and subsampling (App. B.6) use a single

3 If not dealing with bandit problems, it is important to do this with trajectories.

7.3. DISCUSSION 165

600
1
2
8
32
400
L

200

0
0 0.2 0.4 0.6 0.8 1
4
t ·10

Figure 7.8: Cumulative regret of weighted Boostrap Q-learning for various

amount of bootstrap replicates (1 is equivalent to plain Q-learning). Gener-
ally speaking, an increased amount of replicates leads to improved exploration
performance.
166 CHAPTER 7. SIMULATION-BASED ALGORITHMS

7.4 Exercises
Exercise 33 (180). This is a continuation of exercise 28. Create a reinforcement
learning version of the diagnostic model from exercise 28. In comparison to that
exercise, here the doctor is allowed to take zero, one, or two diagnostic actions.
View the treatment of each patient as a single episode and design an appropriate
state and action space to apply the standard MDP framework: note that all episodes
run for at least 2 steps, and there is a different set of actions available at each state:
the initial state only has diagnostic actions, while any treatment action terminates the
episode and returns us the result.
1. Define the state and action space for each state.
2. Create a simulation of this problem, according to the probabilities mentioned in
Exercise 28.
3. Apply a simulation-based algorithm such as Q-learning to this problem. How
much times does it take to perform well? Can you improve it so as to take into
account the problem structure?

Exercise 34. It is well-known that the value function of a policy π for an MDP
µ with state reward function r can be written as the solution of a linear equation
Vµπ = (I −γPµπ )−1 r, where the term Φπµ , (I −γPµπ )−1 can be seen as a feature matrix.
However, Sarsa and other simulation-based algorithms only approximate the value
function directly rather than Φπµ . This means that, if the reward function changes,
they have to be restarted from scratch. Is there a way to rectify this?4
3h Develop and test a simulation-based algorithm (such as Sarsa) for estimating
Φπµ , and prove its asymptotic convergence. Hint: focus on the fact that you’d
like to estimate a value function for all possible reward functions.
? Consider a model-based approach, where we build an empirical transition kernel
Pµπ . How good are our value function estimates in the first versus the second
approach? Why would you expect either one to be better?
? Can the same idea be extended to Q-learning?

4 This exercise stems from a discussion with Peter Auer in 2012 about this problem.
Chapter 8

Approximate
representations

167
168 CHAPTER 8. APPROXIMATE REPRESENTATIONS

8.1 Introduction
In this chapter, we consider methods for approximating value functions, policies,
or transition kernel. This is in particular useful when the state or policy space
are large, so that one has to use some parametrisation that may not include the
true value function, policy, or transition kernel. In general, we shall assume the
existence of either some approximate value function space VΘ or some approxi-
mate policy space ΠΘ , which are the set of allowed value functions and policies,
respectively. For the purposes of this chapter, we will assume that we have
access to some simulator or approximate model of the transition probabilities,
wherever necessary. Model-based reinforcement learning where the transition
probabilities are explicitly estimated will be examined in the next two chapters.
As an introduction, let us start with the case where we have a value function
space V and some value function v ∈ V that is our best approximation of the
optimal value function. Then we can define the greedy policy with respect to v
as follows:
Definition 8.1.1 (v-greedy policy and value function).
∗ ∗
πu ∈ arg max Lπ u, vu = L u,
π∈Π

where π : S → ∆ (A) maps from states to action distributions.

Although the greedy policies need not be stochastic, here we are explicitly
considering stochastic policies, because this sometimes facilitates finding a good
approximation. If u is the optimal value function V ∗ , then the greedy policy is
going to be optimal.
More generally, when we are trying to approximate a value function, we
usually are constrained to look for it in a parametrised set of value functions VΘ ,
where Θ is the parameter space. Hence, it might be the case that the optimal
value function may not lie within VΘ . Similarly, the policies that we can use lie
in a space ΠΘ , which may not include the greedy policy itself. This is usually
because it is not possible to represent all possible value functions and policies
in complex problems.
Usually, we are not aiming at a uniformly good approximation to a value
function or policy. Instead, we define φ, a distribution on S, which specifies
on which parts of the state space we want to have a good approximation by
placing higher weight on the most important states. Frequently, φ only has a
finite support, meaning that we only measure the approximation error over a
finite set of representative states Ŝ ⊆ S. In the sequel, we shall always define
the quality of an approximate value or policy with respect to φ.
In the remainder of this chapter, we shall examine a number of approximate
dynamic programming algorithms. What all of these algorithms have in common
is the requirement to calculate an approximate value function or policy. The two
next sections given an overview of the basic problem of fitting an approximate
value function or policy to a target.

8.1.1 Fitting a value function

Let us begin by considering the problem of finding the value function vθ ∈ VΘ
that best matches a target value function u that is not necessarily in VΘ . This
8.1. INTRODUCTION 169

can be done by minimising the difference between the target value u and the
approximation vθ , that is,
Z
kvθ − ukφ = |vθ (s) − u(s)| dφ(s) (8.1.1)
S

with respect to some measure φ on S. If u = V ∗ , i.e., the optimal value

function, then we end up getting the best possible value function with respect
to the distribution φ. We can formalise the idea for fitting an approximate value
function to a target as follows.

Approximate value function problem

Given VΘ = {vθ | θ ∈ Θ},

find θ∗ ∈ arg min kvθ − ukφ ,

θ∈Θ
R
where k · kφ , S
| · | dφ.

Unfortunately, this minimisation problem can be difficult to solve in general.

A particularly simple case is when the set of approximate functions is small
enough for the minimisation to be performed via enumeration.

Example 37 (Fitting a finite number of value functions). Consider a finite space of

value functions V = {v1 , v2 , v3 }, which we wish to fit to a target value function u. In
this particular scenario, v1 (x) = sin(0.1x), v2 (x) = sin(0.5x), v3 (x) = sin(x), while

u(x) = 0.5 sin(0.1x) + 0.3 sin(0.1x) + 0.1 sin(x) + 0.1 sin(10x).

Clearly, none of the given functions is a perfect fit. In addition, finding the best overall
fit requires minimising an integral. So, for this problem we choose a random set of
points X = {xt } on which to evaluate the fit, with φ(xt ) = 1 for every point xt ∈ X.
This is illustrated in Figure 8.1, which shows the error of the functions at the selected
points, as well as their cumulative error.

In the example above, the approximation space V does not have a member
that is sufficiently close to the target value function. It could be that a larger
function space contains a better approximation. However, it may be difficult to
find the best fit in an arbitrary set V.

8.1.2 Fitting a policy

The problem of fitting a policy is not significantly different from that of fitting
a value function, especially when the action space is continuous. Once more, we
define an appropriate normed vector space so that it makes sense to talk about
the normed difference between two policies π, π ′ with respect to some measure φ
on the states, more precisely defined as
Z
kπ − π ′ kφ = kπ(· | s) − π ′ (· | s)k dφ(s),
S

where the norm within the integral is usually the PL1 norm. For a finite action
space, this corresponds to kπ(· | s) − π ′ (· | s)k = a∈A |π(a | s) − π ′ (a | s)|, but
170 CHAPTER 8. APPROXIMATE REPRESENTATIONS

1 1
v1
0.5 v2 0.5
v3
0 u 0
−0.5 −0.5

−1 −1
0 2 4 6 8 10 0 2 4 6 8 10
(a) The target function and the three (b) The errors at the chosen points.
candidates.
6

0
0 2 4 6 8 10
(c) The total error of each candi-
date.

Figure 8.1: Fitting a value function in V = {v1 , v2 , v3 } to a target value func-

tion u, over a finite number of points. While none of the three candidates is
a perfect fit, we clearly see that v1 has the lowest cumulative error over the
measured set of points.

certainly other norms may be used and are sometimes more convenient. The
optimisation problem corresponding to fitting an approximate policy from a set
of policies ΠΘ to a target policy π is shown below.

The policy approximation problem

Given ΠΘ = {πθ | θ ∈ Θ},

find θ∗ ∈ arg min kπθ − πu

∗
kφ ,
θ∈Θ

∗
where πu = arg maxπ∈Π Lπ u.

Once more, the minimisation problem may not be trivial, but there are
some cases where it is particularly easy. One of these is when the policies can
be efficiently enumerated, as in the example below.

Example 38 (Fitting a finite space of policies). For simplicity, consider the space of
deterministic policies with a binary action space A = {0, 1}. Then each policy can be
represented as a simple mapping π : S → {0, 1}, corresponding to a binary partition
of the state space. In this example, the state space is the 2-dimensional unit cube,
S = [0, 1]2 . Figure 8.2 shows an example policy, where the light red and light green
areas represent taking action 1 and 0, respectively. The measure φ has support only
8.1. INTRODUCTION 171

0.8

0.6
s2

0.4

0.2

0
0 0.2 0.4 0.6 0.8 1
s1

Figure 8.2: An example policy. The red areas indicate taking action 1, and
the green areas action 0. The φ measure has finite support, indicated by the
crosses and circles. The blue and magenta lines indicate two possible policies
that separate the state space with a hyperplane.

on the crosses and circles, which indicate the action taken at that location. Consider a
policy space Π consisting of just four policies. Each set of two policies is indicated by
the magenta (dashed) and blue (dotted) lines in Figure 8.2. Each line corresponds to
two possible policies, one selecting action 1 in the high region, and the other selecting
action 0 instead. In terms of our error metric, the best policy is the one that makes
the fewest mistakes. Consequently, the best policy in this set to use the blue line and
play action 1 (red) in the top-right region.

8.1.3 Features
Frequently, when dealing with large, or complicated spaces, it pays off to project
(observed) states and actions onto a feature space X . In that way, we can make
problems much more manageable. Generally speaking, a feature mapping is
defined as follows.

Feature mapping
For X ⊂ Rn , a feature mapping f : S × A → X can be written in vector
form as  
f1 (s, a)
f (s, a) =  ... .
fn (s, a)
Obviously, one can define feature mappings f : S → X for states only in a
similar manner.

What sort of functions should we use? A common idea is to use a set of

smooth symmetric functions, such as usual radial basis functions.
172 CHAPTER 8. APPROXIMATE REPRESENTATIONS

Example 39 (Radial Basis Functions). Let d be a metric on S × A and define the set
of centroids {(si , ai ) | i = 1, . . . , n}. Then we define each element of f as:

fi (s, a) , exp {−d[(s, a), (si , ai )]} .

These functions are sometimes called kernels. A one-dimensional example of Gaussian

radial basis functions is shown in Figure 8.3.

0.8

0.6

0.4

0.2

0
0 0.2 0.4 0.6 0.8 1

Figure 8.3: Radial Basis Functions

Another common type of functions are binary functions. These effectively

discretise a continuous space through either a cover or a partition.
S
Definition 8.1.2. A collection of sets G is a cover of X iff S∈G S ⊃ X.

Definition 8.1.3. A collection of sets G is a partition of X iff

1. If S 6= R ∈ G , then S ∩ R = ∅.
S
2. S∈G S = X.

In reinforcement learning, feature functions corresponding to partitions are

usually referred to as tilings.

Example 40 (Tilings). Let G = {X1 , . . . , Xn } be a partition of S × A. Then the fi

can be defined by
fi (s, a) , I {(s, a) ∈ Xi } . (8.1.2)

Multiple tilings create a cover and can be used without many difficulties with
most discrete reinforcement learning algorithms, cf. Sutton and Barto [1998].

8.1.4 Estimation building blocks

Now that we have looked at the basic problems in approximate regimes, let
us look at some methods for obtaining useful approximations. First of all,
we introduce some basic concepts such as look-ahead and rollout policies for
estimating value functions. Then we formulate value function approximation
and policy estimation as an optimisation problem. These are going to be used
in the remaining sections. For example, Section ?? introduces the well known
approximate policy iteration algorithm, which combines those two steps into
approximate policy evaluation and approximate policy improvement.
8.1. INTRODUCTION 173

Look-ahead policies
Given an approximate value function u, the transition model Pµ of the MDP
and the expected rewards rµ , we can always find the improving policy given in
Def. 8.1.1 via the following single-step look-ahead.

Single-step look-ahead P
Let q(i, a) , rµ (i, a) + γ j∈S Pµ (j | i, a) u(j). Then the single-step look-
ahead policy is defined as

πq (a | i) > 0 iff a ∈ arg max q(i, a′ ).

a′ ∈A

We are however not necessarily limited to the first-step. By looking T steps

forward into the future we can improve both our value function and policy
estimates.

T -step look-ahead
Define uk recursively as:

u0 = u,
X
qk (i, a) = rµ (i, a) + γ Pµ (j | i, a) uk−1 (j),
j∈S

uk (i) = max {qk (i, a) | a ∈ A} .

Then the T -step look-ahead policy is defined by

πqt (a | i) = arg max qT (i, a).

a∈A

In fact, taking u = 0, this recursion is identical to solving the k-horizon

problem and in the limit we obtain solution to the original problem. In the
general case, our value function estimation error is bounded by γ k ku − V ∗ k.

Rollout policies
As we have seen in Section 7.2.2 one way to obtain an approximate value
function of an arbitrary policy π is to use Monte Carlo estimation, that is, to
simulate several sequences of state-action-reward tuples by running the policy
on the MDP. More specifically, we have the following rollout estimate.

Rollout estimate of the q-factor

In particular, from each state i, we take Ki rollouts to estimate:
Ki TXk −1
1 X
q(i, a) = r(st,k , at,k ), (8.1.3)
Ki t=0
k=1

where st,k , at,k ∼ Pπµ (· | s0 = i, a0 = a), and Tk ∼ Geom(1 − γ).

174 CHAPTER 8. APPROXIMATE REPRESENTATIONS

This results in a set of samples of q-factors. The next problem is to find a

parametric policy πθ that approximates the greedy policy with respect to our
samples, πq∗ . For a finite number of actions, this can be seen as a classification
problem [Lagoudakis and Parr, 2003a]. For continuous actions, it becomes a
regression problem. As indicated before we define a distribution φ on a set of
representative states Ŝ over which we wish to perform the minimisation.

Rollout policy estimation

Given some distribution φ on a set Ŝ of representative states and a set of
samples q(i, a), giving us the greedy policy

πq∗ (a | i) = arg max q(i | a)

and a parametrised policy space {πθ | θ ∈ Θ}, the goal is to determine

min πθ − πq∗ φ
.
θ

8.1.5 The value estimation step

We can now attempt to fit a parametric approximation to a given state or
state-action value function. This is often better than simply maintaining a set
of rollout estimates from individual states (or state-action pairs), as it might
enable us to generalise over the complete state space. A simple parametrisation
for the value function is to use a generalised linear model on a set of features.
Then the value function is a linear function of the features with parameters θ.
More precisely, we can define the following model for the case where we have a
feature mapping on states.

Generalised linear model using state features (or kernel)

Given a feature mapping f : S → Rn and parameters θ ∈ Rn , compute the
approximation
n
X
vθ (s) = θi fi (s). (8.1.4)
i=1

Choosing the model representation is only the first step. We now have to
use it to represent a specific value function. In order to do this, as before we
first pick a set of representative states Ŝ to fit our value function vθ to v. This
type of estimation can be seen as a regression problem, where the observations
are value function measurements at different states.

Fitting a value function to a target

Let φ be a distribution over representative states Ŝ. For some constants
κ, p > 0, we define the weighted prediction error per state as
κ
cs (θ) = φ(s) kvθ (s) − u(s)kp ,
8.1. INTRODUCTION 175

P
where the total prediction error is c(θ) = s∈Ŝ cs (θ). The goal is to find
a θ minimising c(θ).

Minimising this error can be done using gradient descent, which is a gen-
eral algorithm for finding local minima of smooth cost functions. Generally,
minimising a real-valued cost function c(θ) with gradient descent involves an
algorithm iteratively approximating the value minimising c:

θ (n+1) = θ (n) + αn ∇c(θ (n) ).

Under certain conditions1 on the step-size parameter αn , limn→∞ c(θ (n) ) =

minθ c(θ).

Example 41 (Gradient descent for p = 2, κ = 2). For p = 2, κ = 2 the square root

and κ cancel out and we obtain
X
∇θ cs = φ(s) ∇θ [vθ (s) − u(s)]2 = 2[vθ (s) − u(s)]∇θ vθ ,
a∈A

where ∇θ vθ (s) = f (s). Taking partial derivatives ∂/∂θj leads to the update rule
(n+1) (n)
θj = θj − 2αφ(s)[vθ(n) (s) − u(s)]fj (s). (8.1.5)

However, the value function is not necessarily

R self-consistent, meaning that
we do not have the identity vθ (s) = r(s) + S vθ (s′ ) dP (s′ | s, a). For that
reason, we can instead choose a parameter that tries to make the parametrised
value function self-consistent by minimising the Bellman error.

Minimising the Bellman error

For a given φ and P̂ , the goal is to find θ minimising
Z
b(θ) , r(s, a) + γ vθ (s′ ) dP̂ (s′ | s, a) − vθ (s) . (8.1.6)
S φ

Here P̂ is not necessarily the true transition kernel. It can be a model or

an empirical approximation (in which case the integral would only be over
the empirical support). The summation itself is performed with respect to
the measure φ.

In this chapter, we will look at two methods for approximately minimising

the Bellman error. The first, least square policy iteration is a batch algorithm for
approximate policy iteration and finds the least-squares solution to the problem
using the empirical transition kernel. The second is a gradient based method,
which is flexible enough to use either an explicit model of the MDP or the
empirical transition kernel.
It is also possible to simultaneously approximate some value function u and
minimising the Bellman error by considering the minimisation problem

min c(θ) + λb(θ)

θ
1 See also Sec. 7.1.1.
176 CHAPTER 8. APPROXIMATE REPRESENTATIONS

whereby the Bellman error acts as a regulariser ensuring that our approximation
is indeed as consistent as possible.

8.1.6 Policy estimation

A natural parametrisation for policies is to use a generalised linear model on
a set of features. Then a policy can be described as a linear function of the
features with parameters θ, together with an appropriate link function. More
precisely, we can define the following model.

Generalised linear model using features (or kernel).

Given a feature mapping f : S × A → Rn , parameters θ ∈ Rn , and a link
function ℓ : S × A → R+ , the parametrised policy πθ is defined as

gθ (s, a)
πθ (a | s) = ,
hθ (s)
n
!
X X
where gθ (s, a) = ℓ θi fi (s, a) and hθ (s) = gθ (s, b).
i=1 b∈A

The link function ℓ ensures that the denominator is positive, and the policy is
a distribution over actions. An alternative method would be to directly constrain
the policy parameters so the result is always a distribution, but this would result
in a constrained optimisation problem. A typical choice for the link function is
ℓ(x) = exp(x), which results in the softmax family of policies.
In order to fit a policy, we first pick a set of representative states Ŝ and
then find a policy πθ that approximates a target policy π, which is typically the
greedy policy with respect to some value function. In order to do so, we can
define an appropriate cost function and then estimate the optimal parameters
via some arbitrary optimisation method.

Fitting a policy through a cost function.

Given a target policy π and a cost function
κ
cs (θ) = φ(s) kπθ (· | s) − π(· | s)kp , (8.1.7)
P
the goal is to find the parameter θ minimizing c(θ) = s∈Ŝ cs (θ).

Once more, we can use gradient descent to minimise the cost function. We
obtain different results for different norms, but the three cases of main interest
are p = 1, p = 2, and p → ∞. We present the first one here, and leave the others
as an exercise.

Example 42 (The case p = 1, κ = 1). The derivative can be written as:

X
∇θ cs = φ(s) ∇θ |πθ (a | s) − π(a | s)|,
a∈A

∇θ |πθ (a | s) − π(a | s)| = ∇θ πθ (a | s) sgn[πθ (a | s) − π(a | s)]

8.2. APPROXIMATE POLICY ITERATION (API) 177

The policy derivative in turn is

h(s)∇θ g(s, a) − ∇θ h(s)g(s, a)

πθ (a | s) = ,
h(s)2
P
with ∇θ h(s) = b∈A fi (s, b) i and ∇θ g(s, a) = f (s, a). Taking partial derivatives
∂/∂θj , leads to the update rule:
!
(n+1) (n)
X
θj = θj − αn φ(s) πθ(n) (a | s) fj (s, b) − fj (s, a) .
b∈A

Alternative cost functions. It is often a good idea to add a penalty term of

the form kθkq to the cost function, constraining the parameters to be small. The
purpose of this is to prevent overfitting of the parameters for a small number of
observations.

8.2 Approximate policy iteration (API)

The main idea of approximate policy iteration is to replace the exact Bellman
operator L with an approximate version Lˆ to obtain an approximate optimal
policy and a respective approximate optimal value.
Just as in standard policy iteration, there is a policy improvement step and
a policy evaluation step. In the policy improvement step, we simply try to get
as close as possible to the best possible improvement using a restricted set of
policies and an approximation of the Bellman operator. Similarly, in the policy
evaluation step, we try to get as close as possible to the actual value of the
improved policy using a respective set of value functions.

Algorithm 19 Generic approximate policy iteration algorithm

Input: initial value function approximation v0 , approximate Bellman operator
Lˆ, approximate value estimator V̂ , policy space Π̂, value function space V̂,
norms k · kφ , and k · kψ ,
for k = 1, . . . do
πk = arg minπ∈Π̂ Lˆπ vk−1 − L vk−1 // policy improvement
φ
vk = arg minv∈V̂ kv − V̂µπk kψ // policy evaluation
end for

At the k-th iteration of the policy improvement step the approximate value
vk−1 of the previous policy πk−1 is used to obtain an improved policy πk . How-
ever, note that we may not be able to implement the policy arg maxπ Lπ vk−1
for two reasons. Firstly, the policy space Π̂ may not include all possible policies.
Secondly, the Bellman operator is in general also only an approximation. In the
policy evaluation step, we aim at finding the function vk that is the closest to
the true value function of policy πk . However, even if the value function space
V̂ is rich enough, the minimisation is done over a norm that integrates over a
finite subset of the state space. The following section discusses the effect of
those errors on the convergence of approximate policy iteration.
178 CHAPTER 8. APPROXIMATE REPRESENTATIONS

8.2.1 Error bounds for approximate value functions

If the approximate value function u is close to V ∗ then the greedy policy with
respect to u is close to optimal. For a finite state and action space, the following
holds.

Theorem 8.2.1. Consider a finite MDP µ with discount factor γ < 1 and a
vector u ∈ V such that u − Vµ∗ ∞ = ǫ. If π is the u-greedy policy then

2γǫ
Vµπ − Vµ∗ ∞
≤ .
1−γ

In addition, ∃ǫ0 > 0 s.t. if ǫ < ǫ0 , then π is optimal.

Proof. Recall that L is the one-step Bellman operator and Lπ is the one-step
policy operator on the value function. Then (skipping the index for µ)

kV π − V ∗ k∞ = kLπ V π − V ∗ k∞
≤ kLπ V π − Lπ uk∞ + kLπ u − V ∗ k∞
≤ γ kV π − uk∞ + kL u − V ∗ k∞

by contraction, and by the fact that π is u-greedy,

≤ γ kV π − V ∗ k∞ + γ kV ∗ − uk∞ + γ ku − V ∗ k∞
≤ γ kV π − V ∗ k∞ + 2γǫ,

which proves the first part.

For the second part, note that the state and action sets are finite. Conse-
quently, the set of policies is finite. Thus, there is some ǫ0 > 0 such that the
best sub-optimal policy is ǫ0 -close to the optimal policy in value. So, if ǫ < ǫ0 ,
the obtained policy must be optimal.

Building on this result, we can prove a simple bound for approximate policy
iteration, assuming uniform error bounds on the approximation of the value of
a policy as well as on the approximate Bellman operator. Even though these
assumptions are quite strong, we still only obtain the following rather weak
asymptotic convergence result.2

Theorem 8.2.2 (Bertsekas and Tsitsiklis [1996], Proposition 6.2). Assume that
there are ǫ, δ such that, for all k the iterates vk , πk satisfy

kvk − V πk k∞ ≤ ǫ,
Lπk+1 vk − L vk ∞
≤ δ.

Then
δ + 2γǫ
lim sup kV πk − V ∗ k∞ ≤ . (8.2.1)
k→∞ (1 − γ)2
2 For δ = 0, this is identical to the result for ǫ-equivalent MDPs by Even-Dar and Mansour

[2003].
8.2. APPROXIMATE POLICY ITERATION (API) 179

8.2.2 Rollout-based policy iteration methods

One idea for estimating the value function is to simply perform rollouts, while
the policy itself is estimated in parametric form, as suggested by Bertsekas and Tsitsiklis
[1996]. The first practical algorithm in this direction was Rollout Sampling Ap-
proximate Policy Iteration by Dimitrakakis and Lagoudakis [2008b]. The main
idea is to concentrate rollouts in interesting parts of the state space, so as to
maximise the expected amount of improvement we can obtain with a given
rollout budget.

Algorithm 20 Rollout Sampling Approximate Policy Iteration

for k = 1, . . . do
Select a set of representative states Ŝk .
for n = 1, . . . do
Select a state sn ∈ Ŝk maximising Un (s) and perform a rollout obtaining
{st,k , at,k }.
If â∗ (sn ) is optimal w.p. 1 − δ, add sn to Ŝk (δ) and remove it from Ŝk .
end for
Calculate qk ≈ Qπk from the rollouts, cf. eq. (8.1.3)
Train a classifier πθk+1 on the set of states Ŝk (δ) with actions â∗ (s).
end for

If we have data collects we can use the empirical state distribution to select
starting states. In general, rollouts give us estimates qk , which are used to select
states for further rollouts. That is, we compute for each state s actions

â∗s , arg max qk (s, a).

Then we select a state sn maximising the upper bound value Un (s) defined via

∆ˆk (s) , qk (s, â∗s ) − max∗ qk (s, a)

a6=âs
s
ˆ 1
Un (s) , ∆k − max∗ qk (s, a) + ,
a6=âs 1 + c(s)

where c(s) is the number of rollouts from state s. If the sampling of a state s
stops whenever
s
2 |A| − 1
∆ˆk (s) ≥ ln , (8.2.2)
c(s)(1 − γ)2 δ

then we are certain that the optimal action has been identified with probability
1−δ for that state, due to Hoeffding’s inequality. Unfortunately, guaranteeing a
policy improvement for the complete state space is impossible, even with strong
assumptions.3
3 First, note that if we need to identify the optimal action for k states, then the above

stopping rule has an overall error probability of kδ. In addition, even if we assume that value
functions are smooth, it will be impossible to identify the boundary in the state space where
the optimal policy should switch actions [Dimitrakakis and Lagoudakis, 2008a].
180 CHAPTER 8. APPROXIMATE REPRESENTATIONS

8.2.3 Least Squares Methods

When considering quadratic error, it is tempting to use linear methods, such as
least square methods, which are very efficient. This requires to formulate the
problem in linear form, using a feature mapping that projects individual states
(or state-action pairs) onto a high-dimensional space. Then the value function
can be represented as linear function of the parameters and this mapping, which
minimises a squared error over the observed trajectories.
To get an intuition for these methods, recall from Theorem 6.5.1 that the
solution of

v = r + γPµ,π v

is the value function of π and can be obtained via

v = (I − γPµ,π )−1 r.

Here we consider the setting where we do not have access to the transition ma-
trix, but instead have some observations of transition (st , at , st+1 ). In addition,
our state space can be continuous (e.g., S ⊂ Rn ), so that the transition matrix
becomes a general transition kernel. Consequently, the set of value functions V
becomes a Hilbert space, while it previously was a Euclidean subset.
In general, we deal with this case via projections. We project from the
infinite-dimensional Hilbert space to one with finite dimension on a subset of
states: namely, the ones that we have observed. We also replace the transition
kernel with the empirical transition matrix on the observed states.

Parametrisation. Let us first deal with parametrising a linear value function.

Setting v = Φθ where Φ is a feature matrix and θ is a parameter vector, we
have

Φθ = r + γPµ,π Φθ
−1
θ = [(I − γPµ,π )Φ] r

This simple linear parametrisation of a value function is perfectly usable in a

discrete MDP setting where the transition matrix is given. However, otherwise
making use of this parametrisation is not so straightforward. The first problem
is how to define the transition matrix itself, since there is an infinite number
of states. A simple solution to this problem is to only define the matrix on
the observed states, and furthermore, so that the probability of transiting to a
particular state is 1 if that transition has been observed. This makes the matrix
off-diagonal. More precisely, the construction is as follows.

Empirical construction. Given a set of data points {(si , ai , ri , s′i ) | i = 1, . . . , n},

we define

1. the empirical reward vector r = (ri )i ,

2. the feature matrix Φ = (Φi )i , with Φi = f (si , ai ), and

3. the empirical transition matrix Pµ,π (i, j) = I {j = i + 1}.

8.2. APPROXIMATE POLICY ITERATION (API) 181

Generally the value function space generated by the features and the linear
parametrisation does not allow us to obtain exact value functions. For this
reason, instead of considering the inverse A−1 of the matrix A = (I − γPµ,π )Φ
we use the pseudo-inverse defined as
: −1
A−1 , A⊤ AA⊤ .
If the inverse exists, then it is equal to the pseudo-inverse. However, in our
setting, the matrix can be low rank, in which case we instead obtain the matrix
minimising the squared error, which in turn can be used to obtain a good esti-
mate for the parameters. This immediately leads to the Least Squares Temporal
Difference algorithm [Bradtke and Barto, 1996, LSTD], which estimates an ap-
proximate value function for some policy π given some data D and a feature
mapping f .

State-action value functions. As estimating a state-value function is not

directly useful for obtaining an improved policy without a model, we can instead
estimate a state-action value function as follows:
q = r + γPµ,π q
Φθ = r + γPµ,π Φθ
:
θ = ((I − γPµ,π )Φ)−1 r
However, this approach has two drawbacks. The first is that it is difficult to
get an unbiased estimate of θ. The second is that when we apply the Bellman
operator to q, the result may lie outside the space spanned by the features. For
this reason, we instead consider the least-square projection Φ(Φ⊤ Φ)−1 Φ⊤ , i.e.,
q = Φ(Φ⊤ Φ)−1 Φ⊤ (r + γPµ,π q) .
Replacing q = Φθ leads to the estimate
−1
θ = Φ⊤ (I − γPµ,π )Φ Φ⊤ r.

In practice, of course, we do not have the transitions Pµ,π but estimate them
from data. Note that for any deterministic policy π and a set of T data points
(st , at , rt , s′t )Tt=1 , we have
X
Pµ,π Φ = P (s′ | s, a)Φ(s′ , π(s′ ))
s′
T
1X
≈ P̂ (s′t | st , at )Φ(s′t , π(s′t )).
T t=1
This can be used to maintain q-factors with q(s, a) = f (s, a)θ to obtain an
empirical estimate of the Bellman operator as summarized in the following al-
gorithm.

Algorithm 21 LSTDQ - Least Squares Temporal Differences

inputPdata D = {(st , at , rt , s′t ) | t = 1, . . . , T }, feature mapping f , policy π
n
A =P t=1 Φ(st , at )P [Φ(st , at ) − γΦ(s′t , π(s′t ))].
n
b = t=1 Φ(st , at )rt
θ = A−1 b
182 CHAPTER 8. APPROXIMATE REPRESENTATIONS

The algorithm can be easily extended to approximate policy iteration, re-

sulting in the well-known Least Squares Policy Iteration [Lagoudakis and Parr,
2003b, LSPI] algorithm shown in Alg. 22. The idea is to repeatedly estimate
the value function for improved policies using a least squares estimate, and then
compute the greedy policy for each estimate.

Algorithm 22 LSPI - Least Squares Policy Iteration

input data D = {(st , at , rt , s′t ) | t = 1, . . . , T }, feature mapping f
Set π0 arbitrarily.
for k = 1, . . . do
θ (k) = LSTDQ(D, f, πk−1 ).
π (k) = πΦθ
∗
(k) .

end for

8.3 Approximate value iteration

Approximate algorithms can also be defined for backwards induction. The gen-
eral algorithmic structure remains the same as exact backwards induction, how-
ever the exact steps are replaced by approximations. Applying approximate
value iteration may be necessary for two reasons. Firstly, it may not be possible
to update the value function for all states. Secondly, the set of available value
function representations may be not complex enough to capture the true value
function.

8.3.1 Approximate backwards induction

The first algorithm is approximate backwards induction. Let us start with the
basic backwards induction algorithm:

Vt∗ (s) = max r(s, a) + γ Eµ Vt+1
∗
| st = s, at = a (8.3.1)
a∈A

This is essentially the same both for finite and infinite-horizon problems. If
we have to pick the value function from a set of functions V, we can use the
following value function approximation.
Let our estimate at time t be vt ∈ V, with V being a set of (possibly
parametrised) functions. Let V̂t be our one-step update given the value function
approximation at the next step, vt+1 . Then vt will be the closest approximation
in that set.

Iterative approximation

( )
X
′ ′
V̂t (s) = max r(s, a) + γ Pµ (s | s, a) vt+1 (s )
a∈A
s′

vt = arg min v − V̂t

v∈V
8.3. APPROXIMATE VALUE ITERATION 183

The above minimisation can for example be performed by gradient descent.

Consider the case where v is a parametrised function from a set of parametrised
value functions VΘ with parameters θ. Then it is sufficient to maintain the
parameters θ (t) at any time t. These can be updated with a gradient scheme
at every step. In the online case, our next-step estimates can be obtained by
gradient descent using a step size sequence (αt )t .

Online gradient estimation

θt+1 = θt − αt ∇θ vt − V̂t (8.3.2)

This gradient descent algorithm can also be made stochastic, if we sample

s′ from the probability distribution Pµ (s′ | s, a) used in the iterative approxi-
mation. The next sections give some examples.

8.3.2 State aggregation

In state aggregation, multiple different states with identical properties (with
respect to rewards and transition probabilities) are identified in order to ob-
tain a new aggregated state in an aggregated MDP with smaller state space.
Unfortunately, it is very rarely the case that aggregated states are really indis-
tinguishable with respect to rewards and transition probabilities. Nevertheless,
as we can see in the example below, aggregation can significantly simplify com-
putation through the reduction of the size of the state space.

Example 43 (Aggregated value function.). A simple method for aggregation is to

set the value of every state in an aggregate set to be the same. More precisely, let
G = {S1 , . . . , Sn } be a partition of S, with θ ∈ Rn and let fk (st ) = I {st ∈ Sk }. Then
the approximate value function is
v(s) = θ(k), if s ∈ Sk , k 6= 0.

In the above example, the value of every state corresponds to the value of
the k-th set in the partition. Of course, this is only a very rough approximation
if the sets Sk are very large. However, this is a convenient approach to use for
gradient descent updates, as only one parameter needs to be updated at every
step.

Online gradient estimate for aggregated value functions

2
Consider the case k·k = k·k2 . For st ∈ Sk and some step size sequence (αt )t :
X
θt+1 (k) = (1 − αt )θt (k) + αt max r(st , a) + γ P (j | st , a) θt (fk (j)),
a∈A
j

while θt+1 (k) = θ(k) for st ∈

/ Sk .

Of course, whenever we perform the estimation online, we are limited to

estimation on the sequence of states st that we visit. Consequently, estimation
184 CHAPTER 8. APPROXIMATE REPRESENTATIONS

on other states may not be very good. It is indeed possible that we suffer from
convergence problems as we alternate between estimating the values of different
states in the aggregation.

8.3.3 Representative state approximation

A more refined approach is to choose some representative states and try to
approximate the value function of all other states as a convex combination of
the value of the representative states.

Representative state approximation.

Let Ŝ be a set of n representative states and θ ∈ Rn and a feature mapping
f with
X n
fi (s) = 1, ∀s ∈ S.
i=1

The feature mapping is used to perform the convex combination. Usually,

fi (s) is larger for representative states i which are “closer” to s. In general, the
feature mapping is fixed, and we want to find a set of parameters for the values
of the representative states. At time t, for each representative state i, we obtain
a new estimate of its value function and plug it back in.

Representative state update

For i ∈ Ŝ:
Z
θt+1 (i) = max r(i, a) + γ vt (s) dP (s | i, a) (8.3.3)
a∈A

with
n
X
vt (s) = fi (s)θt (i). (8.3.4)
i=1

When the integration in (8.3.3) is not possible, we may instead approximate

the expectation with a Monte-Carlo method. One particular problem with this
method arises when the transition kernel is very sparse. Then we are basing our
estimates on approximate values of other states, which may be very far from
any other representative state. This is illustrated in Figure 8.4, which presents
the value function error for the chain environment of Example 35 and random
MDPs. Due to the linear structure of the chain environment, the states are far
from each other. In contrast, the random MDPs are generally both quite dense
and the state distribution for any particular policy mixes rather fast. Thus,
states in the former tend to have very different values and in the latter very
similar ones.
8.3. APPROXIMATE VALUE ITERATION 185

400
chain
random
300
ku − V k

200

100

0
0 20 40 60 80 100
number of samples

Figure 8.4: Error in the representative state approximation for two different
MDPs structures as we increase the number of sampled states. The first is
the chain environment from Example 35, extended to 100 states. The second
involves randomly generated MDPs with two actions and 100 states.

8.3.4 Bellman error methods

The problems with the representative state update can be alleviated through
Bellman error minimisation. The idea here is to obtain a value function that is
as consistent as possible. The basic Bellman error minimisation is as follows:
min kvθ − L vθ k (8.3.5)
θ

This is different from the approximate backwards induction algorithm we saw

previously, since the same parameter θ appears in both sides of the equality.
Furthermore, if the norm has support in all of the state space and the approx-
imate value function space contains the actual set of value functions then the
minimum is 0 and we obtain the optimal value function.

Gradient update
For the L2-norm, we have:
X Z
2
kvθ − L vθ k = Dθ (s) , Dθ (s) = vθ (s) − max vθ (j) dP (j | s, a).
a∈A S
s∈Ŝ
(8.3.6)

Then the gradient update becomes θt+1 = θt − αDθt (st )∇θ Dθt (st ), where
Z
∇θ Dθt (st ) = ∇θ vθt (st ) − ∇θ vθt (j) dP (j | st , a∗t )
S
R
with a∗t , arg maxa∈A r(st , a) + γ S vθ (j) dP (j | st , a) .
186 CHAPTER 8. APPROXIMATE REPRESENTATIONS

We can also construct a Q-factor approximation for the case where no model
is available. This can be simply done by replacing P with the empirical transi-
tion observed at time t.

8.4 Policy gradient

In the previous section, we have seen how to use gradient methods for value
function approximation. It is also possible to use these methods to estimate
policies – the only necessary ingredients are a policy representation and a way to
evaluate a policy. The representation is usually parametric, but non-parametric
representations are also possible. A common choice for parametrised policies is
to use a feature function f : S × A → Rk and a linear parametrisation with
parameters θ ∈ Rk leading to the following Softmax distribution:

eF (s,a)
π(a | s) = P F (s,a′ )
, F (s, a) , θ⊤ f (s, a). (8.4.1)
a′ ∈A e

As usual, we would like to find a policy maximising expected utility. Policy

gradient algorithms employ gradient ascent on the expected utility to find a
locally maximising policy. Here we focus on the discounted reward criterion
with discount factor γ, where a policy’s expected utility is defined with respect
to a starting state distribution y so that
X X X
Eπy (U ) = y(s)V π (s) = y(s) Pπ (h | s1 = s) U (h),
s s h

where U (h) is the utility of a trajectory h. This definition leads to a number

of simple expressions for the gradient of the expected utility, including what is
known as the policy gradient theorem of [Sutton et al., 1999].
Theorem 8.4.1. Assuming that the reward only depends on the state, for any
θ-parametrised policy space Π, the gradient of the utility from starting state
distribution y can be equivalently written in the three following forms:

∇θ Eπy U = y ⊤ γ(I − γPµπ )−1 ∇θ Pµπ (I − γPµπ )−1 r (8.4.2)

X X
= xπ,γ
µ,y (s) ∇θ π(a | s) Qπµ (s, a) (8.4.3)
s a
X
= U (h) Pπµ (h)∇θ ln Pπµ (h), (8.4.4)
h

where (as in Sec. 6.5.4) we use

∞
XX
xπ,γ
µ,y (s) = γ t Pπµ (st = s | s0 = s′ ) y(s′ )
s′ t=0

to denote the γ-discounted sum of state visits. Further, h ∈ (S × A)∗ denotes a

state-action history, Pπµ (h) its probability under the policy π in the MDP µ, and
U (h) is the utility of history h.
Proof. We begin by proving the claim (8.4.2). Note that

Eπµ U = y ⊤ (I − γPµπ )−1 r,

8.4. POLICY GRADIENT 187

where y is a starting state distribution vector and Pµπ is the transition matrix
resulting from applying policy π to µ. Computing the derivative using matrix
calculus gives
∇θ E U = y ⊤ ∇θ (I − γPµπ )−1 r,
as the only term involving θ is π. The derivative of the matrix inverse can be
written as

∇θ (I − γPµπ )−1 = −(I − γPµπ )−1 ∇θ (I − γPµπ )(I − γPµπ )−1

= γ(I − γPµπ )−1 ∇θ Pµπ (I − γPµπ )−1 ,

which concludes the proof of (8.4.2).

We proceed expanding Pµπ term, thus obtaining a formula that only has a
derivative for π:
∂ π ′ X ∂
Pµ (s | s) = Pµ (s′ | s, a) π(a | s).
∂θi a
∂θi

Defining the state visitation matrix X , (I − γP )−1 we have, rewriting (8.4.2):

∇θ Eπµ U = γy ⊤ X∇θ Pµπ Xr.

We are now ready to prove claim (8.4.3). Define the expected state visitation
from the starting distribution to be x , y ⊤ X, so that we obtain

∇θ Eπµ U = γx⊤ ∇θ Pµπ Xr

The last claim (8.4.4) is straightforward. Indeed,

X X
∇θ E U = U (h)∇θ Pπµ (h) = U (h) Pπµ (h)∇ ln Pπµ (h), (8.4.5)
h h

as ∇θ ln Pπµ (h) = Pπ1(h) ∇θ Pπµ (h).

8.4.1 Stochastic policy gradient

For finite MDPs, we can obtain xπ from the state occupancy matrix (6.5.5) by
left multiplication with the initial state distribution y. However, in the context
of gradient methods, it makes more sense to use a stochastic estimate of xπ to
calculate the gradient, since
X
∇θ Eπ U = Eπy ∇θ π(a | s) Qπ (s, a). (8.4.6)
a

For the discounted reward criterion, we can easily obtain unbiased samples
through geometric stopping (see Exercise 29).
188 CHAPTER 8. APPROXIMATE REPRESENTATIONS

Importance sampling
The last formulation is especially useful as it allows us to use importance sam-
pling to compute the gradient even on data obtained for different policies, which
in general is more data efficient. First note that for any history h ∈ (S × A)∗ ,
we have
YT
π
Pµ (h) = Pµ (st | st−1 , at−1 ) Pπ (at | st , at−1 ) (8.4.7)
t=1

without any Markovian assumptions on the model or policy. We can now rewrite
(8.4.5) in terms of the expectation with respect to an alternative policy π ′ as
!
π π′ π Pπµ (U )
∇ Eµ U = Eµ U (h)∇ ln Pµ (h) π′
Pµ (U )
T
!
Y π(at | st , at−1 )
π′ π
= Eµ U (h)∇ ln Pµ (h) ,
t=1
π ′ (at | st , at−1 )

since the µ-dependent terms in (8.4.7) cancel out. In practice the expectation
would be approximated through sampling trajectories h. Note that
X X ∇π(at | st , at−1 )
∇ ln Pπµ (h) = ∇ ln π(at | st , at−1 ) = .
t t
π(at | st , at−1 )

Overall, importance sampling allows us to perform stochastic gradient descent

on data collected from any arbitrary previous policy π ′ , and perform gradient
descent on a parametrised policy π.

8.4.2 Practical considerations

The first design choice in any gradient algorithm is how to parametrise the
policy. For the discrete case, a common parametrisation is to have a separate
and independent parameter for each state-action pair, i.e., θs,a = π(a|s). This
leads to a particularly simple expression for the second form (8.4.3), which is
∂/∂θs,a Eπµ U = y(s) Q(s, a). However, it is easy to see that in this case the
parametrisation will lead to all parameters increasing if rewards are positive.
This can be avoided by either a Softmax parametrisation or by subtracting a bias
term (e.g. Qπµ (s, a1 )) from the derivative. Nevertheless, this parametrisation
implies stochastic discrete policies.
We could also suitably parametrise continuous policies. For example, if
A ⊂ Rn , we can consider a linear policy. Most of the derivation carries over
to Euclidean state-action spaces. In particular, the second form (8.4.3) is also
suitable for deterministic policies.
Finally, in practice, we may not need to accurately calculate the expectations
involved in the gradient. Sample trajectories are sufficient to update the gradient
in a meaningful way, especially for the third form (8.4.4), as we can naturally
sample from the distribution of trajectories. However, the fact that this form
doesn’t need a Markovian assumption also means that it cannot take advantage
of Markovian environments.
Policy gradient methods are useful, especially in cases where the environment
model or value function is extremely complicated, while the optimal policy itself
8.5. EXAMPLES 189

might be quite simple. The main difficulty lies in obtaining an appropriate

estimate of the gradient itself, but convergence to a local maximum is generally
good as long as we are adjusting the parameters in a gradient-related direction
(in the sense of Ass. 7.1.1 (iii)).

8.5 Examples
Let us now consider two well-known problems with a 2-dimensional continu-
ous state space and a discrete set of actions. The first is the inverted pendu-
lum problem in the version of Lagoudakis and Parr [2003b], where a controller
must balance a rod upside-down. The state information is the rotational ve-
locity and position of the pendulum. The second example is the mountain car
problem, where we must drive an underpowered vehicle to the top of a hill
[Sutton and Barto, 1998]. The state information is the velocity and location of
the car. In both problems, there are three actions: “push left”, “push right”
and “do nothing”.
Let us first consider the effect of model and features in representing the
value function of the inverted pendulum problem. Figure 8.5 shows value func-
tion approximations for policy evaluation under a uniformly random policy for
different choices of model and features. Here we need to fit an approximate value
function to samples of the utility obtained from different states. The quality
of the approximation depends on both the model and the features used. The
first choice of features is simply the raw state representation, while the second
a 16 × 16 uniform RBF tiling. The two models are very simple: the first uses a
linear model-Gaussian model4 assumption on observation noise (LG), and the
second is a k-nearest neighbour (kNN) model.
As can be seen from the figure the linear model results in a smooth ap-
proximation, but is inadequate for modelling the value function in the original
2-dimensional state space. However, a high-dimensional non-linear projection
using RBF kernels results in a smooth and accurate value function representa-
tion. Non-parametric models such as k-nearest neighbours behave rather well
under either state representation.
For finding the optimal value function we must additionally consider the
question of which algorithm to use. In Figure 8.6 we see the effect of choosing
either approximate value iteration (AVI) or representative state representations
and value iteration (RSVI) for the inverted pendulum and mountain car.

8.6 Further reading

Among value function approximation methods, the two most well known are
fitted Q-iteration [Antos et al., 2008b] and fitted value iteration, which has
been analysed in [Munos and Szepesvári, 2008]. Minimising the Bellman er-
ror [Antos et al., 2008a, Dimitrakakis, 2013, Ghavamzadeh and Engel, 2006] is
generally a good way to ensure that approximate value iteration is stable.
In approximate policy iteration methods, one needs to approximate both the
value function and policy. In rollout sampling policy iteration [Dimitrakakis and Lagoudakis,
4 Essentially, this is the a linear model of the form s ⊤
t+1 | st = s, at = a ∼ N µa s, Σa , where
µa has a normal prior and Σ a Wishart prior.
190 CHAPTER 8. APPROXIMATE REPRESENTATIONS

LG, state kNN, state

·10−2
5 −0.6
−0.8
V

V
0
−1
−1 5 −1 5
0 0 0 0
1 −5 1 −5
s1 s2 s1 s2
LG, RBF kNN, RBF

−0.6
−0.6
−0.8
−0.8
V
V

−1
−1
−1 5 −1 5
0 0 0 0
1 −5 1 −5
s1 s2 s1 s2

Figure 8.5: Estimated value function of a uniformly random policy on the 2-

dimensional state-space of the pendulum problem. Results are shown for a
k-nearest neighbour model (kNN) with k = 3 and a Bayesian linear-Gaussian
model (LG) for either the case when the model uses the plain state information
(state) or an 256-dimensional RBF embedding (RBF).
8.6. FURTHER READING 191

AVI, Pendulum AVI, MountainCar

0
−0.5
−10
V
V

−1 −20
−1 5 −1 5
0 0 0
1 −5 0
−5 ·10−2
s1 s2 s1 s2
RSVI, Pendulum RSVI, Mountaincar

0
−0.5 −10
V
V

−1 −20
−1 5 −1 5
0 0 0
1 −5 0
−5 ·10−2
s1 s2 s1 s2

Figure 8.6: Estimated optimal value function for the pendulum problem. Re-
sults are shown for approximate value iteration (AVI) with a Bayesian linear-
Gaussian model, and a representative state representation (RSVI) with an RBF
embedding. Both the embedding and the states where the value function is
approximated are a 16 × 16 uniform grid over the state space.
192 CHAPTER 8. APPROXIMATE REPRESENTATIONS

2008b,a], an empirical approximation of the value function is maintained. How-

ever, one can employ least-squares methods [Bradtke and Barto, 1996, Boyan,
2002, Lagoudakis and Parr, 2003b] for example.
The general technique of state aggregation [Singh et al., 1995, Bernstein,
2007] is applicable to a variety of reinforcement learning algorithms. While the
more general question of selecting appropriate features is open, there has been
some progress in the domain of feature reinforcement learning [Hutter, 2009]. In
general, learning internal representations (i.e., features) has been a prominent
aspect of neural network research [Rumelhart et al., 1987]. Even if it is un-
clear to what extent recently proposed approximation architectures that employ
deep learning actually learn useful representations, they have been successfully
used in combination with simple reinforcement learning algorithms [Mnih et al.,
2015]. Another interesting direction is to establish links between features and
approximately sufficient statistics [Dimitrakakis and Tziortziotis, 2013, 2014].
Finally, the policy gradient theorem in the state visitation form was first
proposed by Sutton et al. [1999], while Williams [1992] was the first to use
the log-ratio trick (8.4.4) in reinforcement learning. To our knowledge, the
analytical gradient has not actually been applied (or indeed, described) in prior
literature. Extensions of the policy gradient idea are also natural. They have
also been used in a Bayesian setting by Ghavamzadeh and Engel [2006], while
the natural gradient has been proposed by Kakade [2002]. A survey of policy
gradient methods can be found in [Peters and Schaal, 2006].
8.7. EXERCISES 193

8.7 Exercises
194 CHAPTER 8. APPROXIMATE REPRESENTATIONS

Exercise 35 (Enlarging the function space.). Consider the problem in Example 37.
What would be a simple way to extend the space of value functions from the three
given candidates to an infinite number of value functions? How could we get a good
fit?

Exercise 36 (Enlarging the policy.). Consider Example 38. This represents an ex-
ample of a linear deterministic policies. In which two ways can this policy space be
extended and how?

Exercise 37. Find the derivative for minimising the cost function in (8.1.6) for the
following two cases:
1. p = 2, κ = 2.
2. p → ∞, κ = 1.
Chapter 9

Bayesian reinforcement
learning

195
196 CHAPTER 9. BAYESIAN REINFORCEMENT LEARNING

9.1 Introduction
In this chapter, we return to the setting of subjective probability and utility by
formalising the reinforcement learning problem as a Bayesian decision problem
and solving it directly. In the Bayesian setting, we are acting in an MDP which
is not known, but we have a subjective belief about what the environment
is. We shall first consider the case of acting in unknown MDPs, which is the
focus of the reinforcement learning problem. We will examine a few different
heuristics for maximising expected utility in the Bayesian setting and contrast
them with tractable approximations to the Bayes-optimal solution. Further, we
shall present extensions of these ideas to continuous domains, and finally also
connections to partially observable MDPs will be considered.

9.2 Acting in unknown MDPs

The reinforcement learning problem can be formulated as the problem of learn-
ing to act in an unknown environment, only by interaction and reinforcement.
All of these elements of the definition are important. Firstly and foremostly it
is a learning problem: We have only partial prior knowledge about the environ-
ment we are acting in. This knowledge is arrived at via interaction with the
environment. We do not have a fixed set of data to work with, but we must
actively explore the environment to understand how it works. Finally, there is
a reinforcement signal that punishes some behaviours and rewards others. We
can formulate some of these problems as Markov decision processes.
Let us consider the case where the environment can be represented as an
MDP µ. That is, at each time step t, we observe the environment’s state st ∈ S,
take an action at ∈ A and receive reward rt ∈ R. In the MDP setting, the state
and our action fully determine the distribution of the immediate reward, as
well as that of the next state, as described in Definition 6.3.1. For a specific
MDP µ the probability of the immediate reward is given by Pµ (rt | st , at ), with
expectation r̄µ (s, a) , Eµ (rt | st = s, at = a), while the next state distribution
is given by Pµ (st+1 | st , at ). If these quantities are known, or if we can at
least draw samples from these distributions, it is possible to employ stochastic
approximation and approximate dynamic programming to estimate the optimal
policy and value function for the MDP.
More precisely, when µ is known, we wish to find a policy π : S → A
maximising the utility in expectation. This requires us to solve the maximisation
problem maxπ Eπµ U , where the utility is an additive function of rewards, U =
PT
t=1 rt . This can be accomplished using standard algorithms, such as value or
policy iteration. However, knowing µ is contrary to the problem definition.
In Chapter 7 we have seen a number of stochastic approximation algorithms
which allow us to learn the optimal policy for a given MDP eventually. How-
ever, these generally give few guarantees on the performance of the policy while
learning. A good way of learning the optimal policy in an MDP should trade
off exploring the environment to obtain further knowledge and simultaneously
exploiting this knowledge.
Within the subjective probabilistic framework, there is a natural formalisa-
tion for learning optimal behavior in am MDP. We define a prior belief ξ on the
set of MDPs M, and then find the policy that maximises the expected utility
9.2. ACTING IN UNKNOWN MDPS 197

ξ rt

st st+1

Figure 9.1: The unknown Markov decision process. ξ is our prior over the
unknown µ, which is not directly observed. However, we always observe the
result of our action at in terms of reward rt and next state st+1 .

with respect to the prior Eπξ (U ). The structure of the unknown MDP process
is shown in Figure 9.1. We have previously seen two simpler sequential decision
problems in the Bayesian setting. The first was the simple optimal stopping pro-
cedure in Section 5.2.2, which introduced the backwards induction algorithm.
The second was the optimal experiment design problem, which resulted in the
bandit Markov decision process of Section 6.2. Now we want to formulate the
reinforcement learning problem as a Bayesian maximisation problem.
Let ξ be a prior over M and Π be a set of policies. Then the expected utility
of the optimal policy is
Z
Uξ∗ , max E(U | π, ξ) = max E(U | π, µ) dξ(µ). (9.2.1)
π∈Π π∈Π M

Solving this optimisation problem and hence finding the optimal policy is how-
ever not easy, as in general the optimal policy π must incorporate the informa-
tion it obtained while interacting with the MDP. Formally, this means that it
must map from histories to actions. For any such history-dependent policy, the
action we take at step t must depend on what we observed in previous steps
1, . . . , t − 1. Consequently, an optimal policy must also specify actions to be
taken in all future time steps and accordingly take into account the learning
that will take place up to each future time step. Thus, in some sense, the
value of information is automatically taken into account in this model. This is
illustrated in the following example.

Example 44. Consider two MDPs µ1 , µ2 with a single state (i.e., S = {1}) and
actions A = {1, 2}. In the MDP µi , whenever you take action at = i you obtain
reward rt = 1, otherwise you obtain reward 0. If we only consider policies that do
not take into account the history so far, the expected utility of such a policy π taking
action i with probability π(i) is
X
Eπξ U = T ξ(µi ) π(i)
i

for horizon T . Consequently, if the prior ξ is not uniform, the optimal policy selects
the action corresponding to the MDP with the highest prior probability. Then, the
maximal expected utility is
T max ξ(µi ).
i
However, observing the reward after choosing the first action, we can determine the
true MDP. Consequently, an improved policy is the following: First select the best
198 CHAPTER 9. BAYESIAN REINFORCEMENT LEARNING

action with respect to the prior, and then switch to the best action for the MDP we
have identified to be the true one. Then, our utility improves to

max ξ(µi ) + (T − 1).

As we have to consider quite general policies in this setting, it is useful to

differentiate between different policy types. We use Π to denote the set of all
policies. We use Πk to denote the set of k-order Markov policies π, that only
take into account the previous k steps, that is,

π(at | st , at−1 , rt−1 ) = π(at | stt−k+1 , at−1 t−1

t−k , rt−k ),

where here and in the following we use the notation st to abbreviate (s1 , . . . , st )
and st+k
t for (st , . . . , st+k ), and accordingly at , rt , at+k
t , and rtt+k . Important
special cases are the set of blind policies Π0 and the set of memoryless poli-
cies Π1 . The set Π̄k ⊂ Πk contains all stationary policies in Πk , that is, policies
π for which
π(a | stt+k−1 , att+k−2 ) = π(a | sk , ak−1 )
for all t. Finally, policies may be indexed by some parameter set Θ, in which
case the set of parameterised policies is given by ΠΘ .
Let us now turn to the problem of learning an optimal policy. Learning
means that observations we make will affect our belief, so that we will first take
a closer look at this belief update. Given that, we shall examine methods for
exact and approximate methods of policy optimisation.

9.2.1 Updating the belief

Strictly speaking, in order to update our belief, we must condition the prior
distribution on all the information. This includes the sequence of observations
up to this point in time, including the states st , actions at−1 , and rewards rt−1 ,
as well the policy π that we followed. Let Dt = st , at−1 , rt−1 be the observed
data up to time t. Then the posterior measure for any measurable subset B of
the set of all MDPs M is
R π
Pµ (Dt ) dξ(µ)
ξ(B | Dt , π) = R B π . (9.2.2)
M µ
P (Dt ) dξ(µ)

However, as we shall see in the following remark, we can usually1 ignore the
policy itself when calculating the posterior.
Remark 9.2.1. The dependence on the policy can be removed, since the posterior
is the same for all policies that put non-zero mass on the observed data. Indeed,
′
for Dt ∼ Pπµ it is easy to see that ∀π ′ 6= π such that Pπµ (Dt ) > 0, it holds that

ξ(B | Dt , π) = ξ(B | Dt , π ′ ).

The proof is left as an exercise for the reader. In the specific case of MDPs,
the posterior calculation is easy to perform incrementally. This also more clearly
1 The exception involves any type of inference where Pπ (D ) is not directly available. This
µ t
includes methods of approximate Bayesian computation [Csilléry et al., 2010], that use tra-
jectories from past policies for approximation. See Dimitrakakis and Tziortziotis [2013] for an
example of this in reinforcement learning.
9.2. ACTING IN UNKNOWN MDPS 199

demonstrates why there is no dependence on the policy. Let ξt be the (random)

The above calculation is easy to perform for arbitrarily complex MDPs when the
set M is finite. The posterior calculation is also simple under certain conjugate
priors, such as the Dirichlet-multinomial prior for transition distributions.

9.2.2 Finding Bayes-optimal policies

The problem of policy optimisation in the Bayesian case is much harder than
when the MDP is known. This is simply because we have to consider history
dependent policies, which makes the policy space much larger.
In this section, we first consider two simple heuristics for finding sub-optimal
policies. Then we examine policies which construct upper and lower bounds on
the expected utility. Finally, we consider finite look ahead backwards induction,
that uses the same upper and lower bounds to perform efficient tree search.

The expected MDP heuristic

One simple heuristic is to simply calculate the expected MDP µ b(ξ) , Eξ µ for
the current belief ξ. In particular, the transition kernel of the expected MDP is
simply the expected transition kernel:
Z
Pµb(ξ) (s′ |s, a) = Pµ (s′ |s, a) dξ(µ).
M

b(ξ), that is,

Then we simply calculate the optimal memoryless policy for µ

µ(ξ)) ∈ arg max Vµbπ(ξ) ,

π ∗ (b
π∈Π1

where Π1 = π ∈ Π Pπ (at | st , at−1 ) = Pπ (at | st ) is the set of Markov poli-
cies. The policy π ∗ (b
µ(ξ)) is executed on the real MDP. Algorithm 23 shows the
pseudocode for this heuristic. One important detail is that we are only gen-
erating the k-th policy at step Tk . This is sometimes useful to ensure policies
remain consistent, as small changes in the mean MDP may create a large change
in the resulting policy. It is natural to have Tk − Tk−1 in the order of 1/1 − γ for
discounted problems, or simply the length of the episode for episodic problems.
In the undiscounted case, switching policies whenever sufficient information has
been obtained to significantly change the belief gives good performance guaran-
tees, as we shall see in Chapter 10.
Unfortunately, the policy returned by this heuristic may be far from the
Bayes-optimal policy in Π1 , as shown by the following example.
200 CHAPTER 9. BAYESIAN REINFORCEMENT LEARNING

Algorithm 23 The expected MDP heuristic

for k = 1, . . . do
πk ≈ arg maxπ Eπµb(ξT ) U .
k
for t = Tk−1 + 1, . . . , Tk do
Observe st .
Update belief ξt (·) = ξt−1 (· | st , at−1 , rt−1 , st−1 ).
Take action at ∼ πk (at | st ).
Observe reward rt .
end for
end for

0 0 0

0 ǫ 0 0 ǫ 0 0 ǫ 0

1 1 1

(a) µ1 (b) µ2 (c) µ

b(ξ)

Figure 9.2: The two MDPs and the expected MDP from Example 45.

Example 45 (Counterexample to Algorithm 23 based on an example of Remi Munos).

As illustrated in Figure 9.2 let M = {µ1 , µ2 } be the set of MDPs, and the belief is
ξ(µ1 ) = θ, ξ(µ2 ) = 1 − θ. All transitions are deterministic, and there are two actions,
the blue and the red action. We see that in the expected MDP the state with reward
1 is reachable, and that µ b(ξ) ∈/ M. One can compute that even when T → ∞, the
µ
b(ξ)-optimal policy is not optimal in Π1 , if

γθ(1 − θ) 1 1
ǫ< + .
1−γ 1 − γθ 1 − γ(1 − θ)

9.2.3 The maximum MDP heuristic

An alternative idea is to simply pick the maximum-probability MDP, as shown
in Algorithm 24. This at least guarantees that the MDP for which one chooses
the optimal policy is actually within the set of MDPs. However, it may still
be the case that the resulting policy is sub-optimal, as shown by the following
example.

Example 46 (Counterexample to Algorithm 24). As illustrated in Figure 9.3 let M =

{µi | i = 1, . . . , n} be the set of MDPs where A = {0, . . . , n}. In all MDPs, action 0
gives a reward of ǫ. In MDP each µi , action i gives reward 1, and all remaining actions
give a reward of 0. For any action a, the MDP terminates after an action is chosen
and the reward received. Now if ξ(µi ) < ǫ for all i, then it is optimal to choose action
0, while Algorithm 24 would pick the sub-optimal maxi ξ(µi ).
9.2. ACTING IN UNKNOWN MDPS 201

Algorithm 24 The maximum MDP heuristic

for k = 1, . . . do
πk ≈ arg maxπ Eµπb∗ (ξ(Tk )) U .
for t = 1 + Tk−1 , . . . , Tk do
Observe st .
Update belief ξt (·) = ξt−1 (· | st , at−1 , rt−1 , st−1 ).
Take action at ∼ πk (at | st ).
Observe reward rt .
end for
end for

a=0 ǫ

a=1 0

a=i 1

a=n 0

Figure 9.3: The MDP µi from Example 46.

9.2.4 Bounds on the expected utility

Bounds on the Bayes-expected utility can serve as a guideline when trying to
find a good policy. Accordingly, in this and the following section we aim at
obtaining respective upper and lower bounds. First, note that given a belief ξ
and a policy π, the respective conditional expected utility is defined as follows.
Definition 9.2.1 (Bayesian value function π for a belief ξ).
Vξπ (s) , Eπξ (U | s).
It is easy to see that the Bayes value function of a policy is simply the
expected value function under ξ:
Z Z
π π
Vξ (s) = Eµ (U | s) dξ(µ) = Vµπ (s) dξ(µ).
M M

However, the Bayes-optimal value function is not equal to the expected value
function of the optimal policy for each MDP. In fact, the Bayes-value of any
policy is a natural lower bound on the Bayes-optimal value function, as the
Bayes-optimal policy is the maximum by definition. We can however use the
expected optimal value function as an upper bound on the Bayes-optimal value:
Z
π
∗
Vξ , sup Eξ (U ) = sup Eπµ (U ) dξ(µ)
π π M
Z Z
≤ sup Vµπ dξ(µ) = Vµ∗ dξ(µ) , Vξ+
M π M
202 CHAPTER 9. BAYESIAN REINFORCEMENT LEARNING

Algorithm 25 Bayesian Monte-Carlo policy evaluation

input policy π, belief ξ
for k = 1, . . . , K do
µk ∼ ξ
vk = Vµπk
end forP
1 K
u= K k=1 vk .
return u.

Algorithm 26 Bayesian Monte-Carlo upper bound

input belief ξ
for k = 1, . . . , K do
µk ∼ ξ
vk = Vµ∗k
end for P
1 K
u∗ = K k=1 vk
∗
return u

Given the previous development, it is easy to see that the following inequal-
ities always hold, giving us upper and lower bounds on the value function:
Vξπ ≤ Vξ∗ ≤ Vξ+ , ∀π. (9.2.3)
These bounds are geometrically demonstrated in Fig. 9.4. They are entirely
analogous to the Bayes bounds of Sec. 3.3.1, with the only difference being that
we are now considering complete policies rather than simple decisions.
VV ∗
µ1

E(Vµ∗ | ξ)

π2 P
i wi Vξ∗i

Vµ∗2

Vξ∗
π1
π ∗ (ξ1 )

ξ1 ξ

Figure 9.4: A geometric view of the bounds.

9.2.5 Tighter lower bounds

A lower bound on the value function is useful to tell us how tight our upper
bounds are. It is possible to obtain one by evaluating any arbitrary policy. So,
9.2. ACTING IN UNKNOWN MDPS 203

80
tighter bound

expected utility over all states

70
naive bound
upper bound
60

10
0 20 40 60 80 100
Uncertain ⇐ ξ ⇒ Certain

Figure 9.5: Illustration of the improved bounds. The naive and tighter bound
refers to the lower bound obtained by calculating the value of the policy that
is optimal for the expected MDP and that obtained by calculating the value of
the MMBI policy respectively. The upper bound is Vξ+ . The horizontal axis
refers to our belief: At the left edge, our belief is uniform over all MDPs, while
on the right edge, we are certain about the true MDP.

tighter lower bounds can be obtained by finding better policies, something that
was explored by Dimitrakakis [2011].
In particular, we can consider the problem of finding the best memoryless
policy. This involves two approximations. Firstly, approximating our belief over
MDPs with a sample over a finite set of n MDPs. Secondly, assuming that the
belief is nearly constant over time, and performing backwards induction those n
MDPs simultaneously. While this greedy procedure might not find the optimal
memoryless policy, it still improves the lower bounds considerably.
The central step backwards induction over multiple MDPs is summarised by
the following equation, which simply involves calculating the expected utility of
a particular policy over all MDPs.
Z Z
Qπξ,t (s, a) , r̄µ (s, a) + γ π
Vµ,t+1 (s′ ) dPµ (s′ | s, a) dξ(µ) (9.2.4)
M S

The algorithm greedily performs backwards induction as shown in Algorithm 27.

However, this is not an optimal procedure, since the belief at any time-step t
is not constant. Indeed, even though the policy is memoryless, ξ(µ | st , π) 6=
ξ(µ | st , π ′ ). This is because the probability of being at a particular state is
different under different policies and at different time-steps (e.g. if you consider
periodic MDPs). For the same reason, this type of backwards induction may
not converge as value iteration, but can exhibit cyclic convergence similar to
the cyclic equilibria in Markov games [Zinkevich et al., 2006].
In practice, we maintain a belief over an infinite set of MDPs, such as the
class of all discrete MDPs with a certain number of state and actions. In order
to apply this idea in practice, we can sample a finite number of MDPs from
the current belief and then find the optimal policy for this sample, as shown
in Algorithm 28. For n = 1, this method is equivalent to Thompson sam-
pling [Thompson, 1933], which was first used in the context of Bayesian rein-
forcement learning by Strens [2000]. Even though Thompson sampling is good
exploration heuristic with formal performance guarantees [Kaufmanna et al.,
204 CHAPTER 9. BAYESIAN REINFORCEMENT LEARNING

Algorithm 27 Multi-MDP backwards induction (MMBI)

1: input M, ξ, γ, T
π
2: Set Vµ,T +1 (s) = 0 for all s ∈ S.
3: for t = T, T − 1, . . . , 0 do
4: for s ∈ S, a ∈ A do
5: Calculate Qπξ,t (s, a) from (9.2.4) using {Vµ,t+1
π
}.
6: end for
7: for s ∈ S do
8: Choose πt (s) ∈ arg maxa∈A Qξ,t (s, a).
9: for µ ∈ M do
π
10: Set Vµ,t (s) = Qπµ,t (s, πt (s))
11: end for
12: end for
13: end for
14: return π, Qξ

Algorithm 28 Monte Carlo Bayesian Reinforcement Learning

for epochs i = 1, . . . do
, µn from ξti .
At the start-time ti of the epoch, sample n MDPs µ1 , . . .P
n
Calculate the best memoryless policy πi ≈ arg maxπ∈Π1 k=1 Vµπ wrt the
sample.
Execute πi until t = ti+1 .
end for

2012, Osband et al., 2013], it is not optimal. In fact, as we can see in Fig-
ure 9.6, Algorithm 28 performs better when the number of samples is increased.

ψt ψt+1
st st+1

ξt ξt+1 ψt ψt+1

at at
(a) The complete MDP (b) Compact form
model. of the model.

Figure 9.7: Belief-augmented MDP

9.2.6 The Belief-augmented MDP

The most direct way to actually solve the Bayesian reinforcement learning prob-
lem of (9.2.1) is to cast it as a yet another MDP. We have already seen how this
can be done with bandit problems in Section 6.2.2, but we shall now see that
the general methodology is also applicable to MDPs.
We are given an initial belief ξ0 on a set of MDPs M. Each µ ∈ M is a
9.2. ACTING IN UNKNOWN MDPS 205

550
MCBRL

Exploit
500

450
regret

400

350

300

250
2 4 6 8 10 12 14 16
n

Figure 9.6: Comparison of the regret between the expected MDP heuristic and
sampling with Multi-MDP backwards induction for the Chain environment. The
error bars show the standard error of the average regret.

tuple (S, A, Pµ , ρ), with state space S, action space A, transition kernel Pµ and
reward function ρ : S × A → R. Let st , at , rt be the state, action, and reward
observed in the original MDP and ξt be our belief over MDPs µ ∈ M at step t.
Note that the marginal next-state distribution is
Z
P (st+1 ∈ S | ξt , st , at ) , Pµ (st+1 ∈ S | st , at ) dξt (µ), (9.2.5)
M

while the next belief deterministically depends on the next state, i.e.,

ξt+1 (·) , ξt (· | st+1 , st , at ). (9.2.6)

We now construct an augmented Markov decision process (Ψ, A, P, ρ′ ) with the

state space Ψ = S × Ξ being the product of the original MDP states S and
possible beliefs Ξ. The transition distribution is given by

P (ψt+1 | ψt , at ) = P (ξt+1 , st+1 | ξt , st , at )

= P (ξt+1 | ξt , st+1 , st , at )P (st+1 | ξt , st , at ),

where P (ξt+1 | ξt , st+1 , st , at ) is the singular distribution centred on the poste-

rior distribution ξt (· | st+1 , st , at ). This construction is illustrated in Figure 9.7.
The optimal policy for the augmented MDP is the ξ-optimal policy in the orig-
inal MDP. The augmented MDP has a pseudo-tree structure (since belief states
might repeat), as shown in the following example.
206 CHAPTER 9. BAYESIAN REINFORCEMENT LEARNING

Example 47. Consider aset of MDPs M with A = {1, 2}, S = {1, 2}. In general for
any hyper-stateψt = (st , ξt ) each possible action-state transition results in one specific
new hyper-state. This is illustrated for the specific example in the following diagram.

1
ψt+1
1
2
1 ψt+1
2
ψt
1 3
2 ψt+1

2 4
ψt+1
at st+1

When the branching factor is very large, or when we need to deal with very large tree
depths, it becomes necessary to approximate the MDP structure.

9.2.7 Branch and bound

Branch and bound is a general technique for solving large problems. It can be
applied in all cases where upper and lower bounds on the value of solution sets
can be found. For Bayesian reinforcement learning, we can consider upper and
lower bounds q + and q − on Q∗ in the belief-augmented MDP (BAMDP). That
is,

q + (ψ, a) ≥ Q∗ (ψ, a) ≥ q − (ψ, a)

v + (ψ) = max q + (ψ, a), v − (ψ) = max q − (ψ, a).
a∈A a∈A

Let us now consider an incremental expansion of the BAMDP so that, starting

from some hyperstate ψt , we create the BAMDP tree for all subsequent states
ψt+1 , ψt+2 , . . .. For any leaf node ψt′ = (ξt′ , st′ ) in the tree, we can define upper
and lower value function bounds according to (9.2.3) via
π(ξt′ )
v − (ψt′ ) = Vξt′ (st′ ), v + (ψt′ ) = Vξ+t′ (st′ ),

where π(ξt′ ) can be any approximately optimal policy for ξt′ Using backwards
induction, we can calculate tighter upper q + and lower bounds q − for all non-
leaf hyperstates by
X
q + (ψt , at ) = P (ψt+1 | ψt , at ) ρ(ψt , at ) + γ v + (ψt+1 )
ψt+1
X
−
q (ψt , at ) = P (ψt+1 | ψt , at ) r(ψ, at ) + γ v − (ψ ′ )
ψ′

We can then use the upper bounds to expand the tree (i.e., to select actions in
the tree that maximise v + ) while the lower bounds can be used to select the
final policy. Sub-optimal branches can be discarded once their upper bounds
become lower than the lower bound of some other branch.
Remark 9.2.2. If q − (ψ, a) ≥ q + (ψ, a′ ) then a′ is sub-optimal at ψ.
9.3. BAYESIAN METHODS IN CONTINUOUS SPACES 207

However, such an algorithm is only possible to implement when the number

of possible MDPs and states are finite. We can generalise this to the infinite
case by applying stochastic branch and bound methods [Dimitrakakis, 2010b,
2008]. This involves estimating upper and lower bounds on the values of leaf
nodes through Monte-Carlo sampling.

9.2.8 Further reading

One of the first treatments of this idea is due to Bellman [1957]. Although the
idea was well-known in the statistical community [DeGroot, 1970], the popu-
larisation of the idea in reinforcement learning was achieved with Duff’s the-
sis [Duff, 2002]. Most recent advances in this area involve the use of intelligent
methods for exploring the tree, such as sparse sampling [Wang et al., 2005] and
Monte-Carlo tree search [Veness et al., 2009].
Instead of sampling MDPs, one could sample beliefs, which leads to a finite
hyper-state approximation of the complete belief MDP. One such approach is
BEETLE [Poupart et al., 2006, Poupart and Vlassis, 2008], which examines a
set of possible future beliefs and approximates the value of each belief with a
lower bound. In essence, it then creates the set of policies which are optimal
with respect to these bounds.
Another idea is to take advantage of the expectation-maximisation view
of reinforcement learning [Toussaint et al., 2006]. This allows to apply a host
of different probabilistic inference algorithms. This approach was investigated
by Furmston and Barber [2010].

9.3 Bayesian methods in continuous spaces

Formally, Bayesian reinforcement learning in continuous state spaces is not sig-
nificantly different from the discrete case. Typically, we assume that the agent
acts within a fully observable discrete-time Markov decision process, with a met-
ric state space S, for example S ⊂ Rd . The action space A itself can be either
discrete or continuous. The transition kernel can be defined as a collection of
probability measures on the continuous state space, indexed by (s, a) as

Pµ (S | s, a) , Pµ (st+1 ∈ S | st = s, at = a), S ⊂ S.

There are a number of transition models one can consider for the continuous
case. For the purposes of this textbook, we shall limit ourselves to the relatively
simple case of linear-Gaussian models.

9.3.1 Linear-Gaussian transition models

The simplest type of transition model for an MDP defined on a continuous
state space is a linear-Gaussian model, which also results in a closed form pos-
terior calculation due to the conjugate prior. While typically the real system
dynamics may not be linear, one can often find some mapping f : S → X to a
k-dimensional vector space X such that the dynamics of the transformed state
xt , f (st ) at time t may be well-approximated by a linear system. Then the
next state st+1 is given by the output of a function g : X × A → S of the
208 CHAPTER 9. BAYESIAN REINFORCEMENT LEARNING

transformed state, the action, and some additive noise εt , i.e.,

st+1 = g(xt , at ) + εt .
When g is linear and εt is normally distributed, this corresponds to a multi-
variate linear-Gaussian model. In particular, we can parametrise g with a set of
design matrices k × k-design matrices {Ai | i ∈ A}, such that g(xt , at ) = Aat xt . We can also
covariance define a set of covariance matrices {Vi | i ∈ A} for the noise distribution and
define the next state distribution by
st+1 | xt = x, at = i ∼ N (Ai x, Vi ).
That is, the next state is drawn from a normal distribution with mean Ai x and
covariance matrix Vi .
In order to model our uncertainty with a (subjective) prior distribution ξ,
we have to specify the model structure. Fortunately, in this particular case, a
conjugate prior exists in the form of the matrix-normal distribution for A and
the inverse-Wishart distribution for V . Given Vi , the distribution for Ai is
matrix-normal, while the marginal distribution of Vi is inverse-Wishart. More
specifically,
Ai | Vi = Vb ∼ φ(Ai | M , C, Vb ) (9.3.1)
Vi ∼ ψ(Vi | W , n), (9.3.2)
where φi is the prior distribution on dynamics matrices conditional on the co-
variance and two prior parameters: M , which is the prior mean and C which
is the prior output (dependent variable) covariance. Finally, ψ is the marginal
prior on covariance matrices, which has an inverse-Wishart distribution with W
and n. More precisely, the distributions are given by
−1
φ(Ai | M , C, Vb ) ∝ e− 2 trace[P (Ai −M )Vi (Ai −M )C ] ,
1

1 n 1 −1
ψ(Vi | W , n) ∝ | V −1 W | 2 e− 2 trace(V W ) .
2
Essentially, the considered setting is an extension of the univariate Bayesian
linear regression model (see for example DeGroot [1970]) to the multivariate case
via vectorisation of the mean matrix. Since the prior is conjugate, it is relatively
simple to calculate posterior values of the parameters after each observation.
While we omit the details, a full description of inference using this model is
given by Minka [2001b].

is polynomially bounded in the problem parameters. In the context of reinforcement learning

this would be the number of steps for which no guarantee of utility can be provided.
9.3. BAYESIAN METHODS IN CONTINUOUS SPACES 209

Another straightforward extension of linear models are piecewise linear mod-

els, which can be described in a Bayesian non-parametric framework [Tziortziotis et al.,
2014]. This avoids the computational complexity that is introduced when using
GPs.

9.3.2 Approximate dynamic programming

Bayesian methods are also frequently used as part of a dynamic programming
approach. Typically, this requires maintaining a distribution over value func-
tions in some sense. For continuous state spaces particularly, one can e.g. assume
that the value function v is drawn from a Gaussian process. However, to perform
inference we also need to specify some generative model for the observations.

Temporal differences. Engel et al. [2003] consider temporal differences from

a Bayesian perspective in conjunction with a GP model, so that the rewards are
distributed as
rt | v, st , st+1 ∼ N v(st+1 ) − γ v(st ), σ .
This essentially gives a simple model for P (rT |v, sT ). We can now write the
posterior as ξ(v | rT , sT ) ∝ P (rT | v, sT ) ξ(v), where the dependence ξ(v|sT )
is suppressed. This model was later updated by Engel et al. [2005] using the
reward distribution

rt | v, st , st+1 ∼ N v(st ) − γ v(st+1 ), N (st , st+1 ) ,

where N (s, s′ ) , ∆U (s) − γ∆U (s′ ) with ∆U (s) , U (s) − v(s) denoting the
distribution of the residual, i.e., the utility when starting from s minus its
expectation. The correlation between U (s) and U (s′ ) is captured via N , and
the residuals are modelled as a Gaussian process. While the model is still an
approximation, it is equivalent to performing GP regression using Monte-Carlo
samples of the discounted return.

Bayesian finite-horizon dynamic programming for deterministic sys-

tems. Instead of using an approximate model, Deisenroth et al. [2009] employ
a series of GPs, each for one dynamic programming stage, under the assumption
that the dynamics are deterministic and the rewards are Gaussian-distributed.
It is possible to extend this approach to the case of non-deterministic transi-
tions, at the cost of requiring additional approximations. However, since a lot
of real-world problems do in fact have deterministic dynamics, the approach is
consistent.

Bayesian least-squares temporal differences. Tziortziotis and Dimitrakakis

[2017] instead consider a model for the value function itself, where the random
quantity is the empirical transition matrix P̂ rather than the reward (which can
be assumed to be known):
P̂ v | v, P ∼ N (P v, βI). (9.3.3)
This model makes a different trade-off in its distributional assumptions. It
allows us to model the uncertainty about P in a Bayesian manner, but instead
of explicitly modelling this as a distribution on P itself, we are modelling a
distribution on the resulting Bellman operator.
210 CHAPTER 9. BAYESIAN REINFORCEMENT LEARNING

Gradient methods. Generally speaking, if we are able to sample from the

posterior distribution, we can leverage stochastic gradient descent methods to
extend any gradient algorithm for reinforcement learning with a given model to
the Bayesian setting. More precisely, if we have a utility gradient ∇π U (µ, π) for
model µ, then by linearity of expectations we obtain that

Z
∇π Eπξ U= ∇M U (µ, π) dξ(µ).
Π

Then stochastic gradient descent can be implemented simply by sampling µ ∼ ξ

and updating the parameters using the gradient of the sampled MDP. This
approach was originally suggested by Ghavamzadeh and Engel [2006].

9.4 Partially observable Markov decision pro-

cesses

In most real world applications the state st of the system at time t cannot be
observed directly. Instead, we obtain some observation xt , which depends on
the state of the system. While this does give us some information about the
system state, it is in general not sufficient to pinpoint it exactly. This idea can
be formalised as a partially observable Markov decision process (POMDP).

Definition 9.4.1 (POMDP). A partially observable Markov decision process

(POMDP) µ ∈ MP is a tuple (X , S, A, P, y) where X is an observation space,
S is a state space, A is an action space, and P is a conditional distribution on
observations, states and rewards. The reward, observation and next state are
Markov with respect to the current state and action.

In the following we shall assume that

Pµ (st+1 , rt , xt | st , at , . . .) = P (st+1 | st , at )P (xt | st )P (rt | st ).

transition distribution Here P (st+1 | st , at ) is the transition distribution, giving the probabilities of
next states given the current state and action. P (xt | st ) is the observation
observation distribution distribution, giving the probabilities of different observations given the current
reward distribution state. Finally, P (rt | st ) is the reward distribution, which we make dependent
only on the current state for simplicity. Different dependencies are possible, but
they are all equivalent to the one given here.

Partially observable Markov decision process

The following graphical model illustrates the dependencies in a POMDP.
9.4. PARTIALLY OBSERVABLE MARKOV DECISION PROCESSES 211

ξ at

st st+1

xt xt+1

rt rt+1

The system state st ∈ S is not observed.

We receive an observation xt ∈ X and a reward rt ∈ R.
We take action at ∈ A.
The system transits to state st+1 .

9.4.1 Solving known POMDPs

When we know a POMDP’s parameters, that is to say, when we know the
transition, observation and reward distributions, the problem is formally the
same as solving an unknown MDP. In particular, we can similarly define a
belief state summarising our knowledge. This takes the form of a probability
distribution on the hidden state variable st rather than on the model µ. If µ
defines starting state probabilities, then the belief is not subjective, as it only
relies on the actual POMDP parameters. The transition distribution on states
given our belief is as follows.

Belief ξ
For any distribution ξ on S, we define
Z
ξ(st+1 | at , µ) , Pµ (st+1 | st at ) dξ(st ). (9.4.1)
S

When there is no ambiguity, we shall use ξ to denote arbitrary marginal

distributions on states and state sequences given the belief ξ.

When the model µ is given, calculating a belief update is not particularly

difficult, but we must take care to properly use the time index t. Starting from
Bayes’ theorem, it is easy to derive the belief update from ξt to ξt+1 as follows.
212 CHAPTER 9. BAYESIAN REINFORCEMENT LEARNING

Belief update

ξt+1 (st+1 | µ) , ξt (st+1 | xt+1 , rt+1 , at , µ)

A particularly attractive setting is when the model is finite. Then the suffi-
cient statistic also has finite dimension and all updates are in closed form.
Remark 9.4.1. If S, A, X are finite, then we can define a sequence of vectors
pt ∈ ∆|S| and matrices At as
pt (j) = P (xt | st = j),
At (i, j) = P (st+1 = j | st = i, at ).
Then writing bt (i) for ξt (st = i), we can then use Bayes theorem to obtain
diag(pt+1 )At bt
bt+1 = .
p⊤
t+1 At bt

Even though inference is tractable in finite models, there is a small number

of cases

9.4.2 Solving unknown POMDPs

Solving a POMDP that is unknown is a much harder problem. The basic update
equation for a joint belief on both possible state and possible model is given by
ξ(µ, st | xt , at ) ∝ Pµ (xt | st )Pµ (st | at ) ξ(µ).
Unfortunately, even for the simplest possible case of two possible models µ1 , µ2
and binary observations, there is no finite-dimensional representation of the
belief at time t.
Strategies for solving unknown POMDPs include solving the full Bayesian
decision problem, but this requires exponential inference and planning for exact
solutions [Ross et al., 2008]. For this reason, one usually uses approximations.
One very simple approximation involves replacing a POMDP with a variable
variable order Markov de- order Markov decision process, for which inference has only logarithmic compu-
cision process tational complexity [Dimitrakakis, 2010a]. Of course, the memory complexity
is still linear.
In general, finding optimal policies in POMDPs is hard even for restricted
classes of policies [Vlassis et al., 2012]. However, approximations [Spaan and Vlassis,
2005] and stochastic methods and policy search methods [Baxter and Bartlett,
2000, Toussaint et al., 2006] work quite well in practice.
9.5. RELATIONS BETWEEN DIFFERENT SETTINGS 213

9.5 Relations between different settings

Markov Decision Processes can be used to model a range of different prob-
lems. Obtaining the optimal policy for both finite horizon and infinite hori-
zon discounted MDPs, when the state and action sets are finite, can be done
in polynomial time with backwards induction and policy iteration respectively
(see Chapter 6). This section informally (and perhaps inaccurately in places)
describes the relationship between these problems.
However, in other cases, obtaining the optimal policy is far from trivial. In
reinforcement learning the MDP is not known but must be estimated while we
are acting within it. If we set our goal as maximising expected utility under a
prior, the problem becomes a BAMDP. The BAMDP can be seen as a special
case of a POMDP, where the underlying latent variable is the MDP parameter,
rather than the state, which has a fixed value. For that reason it is generally
assumed that BAMDPs have the same complexity as POMDPs. Note, however,
that POMDPs with linear-Gaussian dynamics can be solved with the exact
same controller as linear-Gaussian MDPs, by replacing the actual state with its
expected value.
The POMDP problem for discrete states is similar to a continuous MDP.
This is because we can construct an MDP that uses the POMDP belief state
as its state. This belief state is finite-dimensional and continuous. If the MDP
state space is continuous, then it is not possible to decide whether a given policy
is optimal in finite time. However, it is possible to check whether a policy
is ǫ-optimal under certain regularity conditions on the state space structure.
However, since the POMDP states are discrete, there is only a finite number of
possible belief states for any belief state, and the number of policies is also finite
for a finite horizon. The relationships between these classes are summarised
below:
d-MDP ⊆ d-BAMDP ⊆ d-POMDP ⊆ c-MDP
While it is known that d-MDP is P-complete and d-POMDP is PSPACE-
complete, it is unclear if d-BAMDP is in a simpler class, such as NP.3

Problem µ S A Inference Planning Algorithms

d-MDP O dO d - P Backwards Induction, Value Iteration, Linear
c-MDP O cO d - Approximate
l-MDP linear, O cO c - P [Q]
d-BAMDP L dO d P PSPACE [FH]
c-BAMDP L cO d P → NP Approximate
l-POMDP linear, O cL c P P [Q]
d-POMDP O dL d P PSPACE [FH]
BAPOMDP L L * NP? EXP?
SG O dO dO - PPAD
POSG O dO dO NP? EXP

Table 9.1: Algorithmic complexity of different settings. We distinguish between

Observable (O) and Latent (L) variables, as well as discrete (d) and Continuous
(c) ones. [Q] means quadratic cost.
3 The complexity hierarchy satisfies P ⊆ NP ⊆ PSPACE ⊆ EXP, with P ⊂ EXP.
214 CHAPTER 9. BAYESIAN REINFORCEMENT LEARNING

Finally, the above settings can be generalised to multi-player games. In

particular, an MDP with many players and a different reward function for each
player is a stochastic game (SG). When the game is zero sum, planning is
conjectured to remain in P,4 For non-zero-sum games, a Nash equilibrium must
be calculated, which has PPAD complexity. Partially observable stochastic
games (POSG) are in general in an exponential (or higher) complexity class,
even though inference may be simple depending on the type of game.

4 However, we have not seen a formal proof of this at the time of writing.
9.6. EXERCISES 215

9.6 Exercises
216 CHAPTER 9. BAYESIAN REINFORCEMENT LEARNING

Exercise 38. Consider the algorithms we have seen in Chapter 8. Are any of those
applicable to belief-augmented MDPs? Outline a strategy for applying one of those
algorithms to the problem. What would be the biggest obstacle we would have to
overcome in your specific example?

Exercise 39. Prove Remark 9.2.1

Exercise 40. A practical case of Bayesian reinforcement learning in discrete space

is when we have an independent belief over the transition probabilities of each state-
action pair. Consider the case where we have n states and k actions. Similar to the
product-prior in the bandit case in Section 6.2, we assign a probability (density) ξs,a to
the probability vector θ(s,a) ∈ ∆n . We can then define our joint belief on the (nk) × n
matrix Θ to be Y
ξ(Θ) = ξs,a (θ(s,a) ).
s∈S,a∈A

(i) Derive the updates for a product-Dirichlet prior on transitions.

(ii) Derive the updates for a product-Normal-Gamma prior on rewards.
(iii) What would be the meaning of using a Normal-Wishart prior on rewards?

Exercise 41. Consider the Gaussian process model of eq. (9.3.2). What is the implicit
assumption made about the transition model? If this assumption is satisfied, what
does the corresponding posterior distribution represent?
Chapter 10

Distribution-free
reinforcement learning

217
218CHAPTER 10. DISTRIBUTION-FREE REINFORCEMENT LEARNING

10.1 Introduction
The Bayesian framework requires specifying a prior distribution. For many
reasons, we may frequently be unable to do that. In addition, as we have seen,
the Bayes-optimal solution is often intractable. In this chapter we shall take a
look at algorithms that do not require specifying a prior distribution. Instead,
they employ the heuristic of “optimism under uncertainty” to select policies.
This idea is very similar to heuristic search algorithms, such as A∗ [Hart et al.,
1968]. All these algorithms assume the best possible model that is consistent
with the observations so far and choose the optimal policy in this “optimistic”
model. Intuitively, this means that for each possible policy we maintain an
upper bound on the value/utility we can reasonably expect from it. In general
we want this upper bound to
1. be as tight as possible (i.e., to be close to the true value),
2. still hold with high probability.
We begin with an introduction to these ideas in bandit problems, when the
objective is to maximise total reward. We then expand this discussion to struc-
tured bandit problems, which have many applications in optimisation. Finally,
we look at the case of maximising total reward in unknown MDPs.

10.2 Finite Stochastic Bandit problems

First of all, let us briefly recall the stochastic bandit setting, which we already
have considered in Section 6.2. The learner in discrete time steps t = 1, 2, . . .
chooses an arm at from a given set A = {1, . . . , K} of K arms. The rewards rt
the learner obtains in return are random and assumed to be independent as
well as bounded, e.g., rt ∈ [0, 1]. The expected reward r(i) = E(rt |at = i) for
choosingPany arm i is unknown to the learner, who aims to maximise the total
T
reward t=1 rt after a certain number of T time steps.
Let r∗ , maxi r(i) be the highest expected reward that can be achieved.
Obviously, the optimal policy π ∗ in each time step chooses the arm giving the
highest expected reward r∗ . The learner who does not know which arm is
optimal will choose at each time step t an arm at from A, or more generally,
a probability distribution over the arms from which at then is drawn. It is
important to notice that maximising the total reward is equivalent to minimising
total regret with respect to that policy.
Definition 10.2.1 (Total regret). The (total) regret of a policy π relative to
the optimal fixed policy π ∗ after T steps is
T
X
LT (π) , rt∗ − rtπ ,
t=1
∗
where rtπ is the reward obtained by the policy π at step t and rt∗ , rtπ . Ac-
cordingly, the expected (total) regret is
T
X
E LT (π) , T r∗ − Eπ rt .
t=1
10.2. FINITE STOCHASTIC BANDIT PROBLEMS 219

The regret compares the collected rewards to those of the best fixed policy.
Comparing instead to the best rewards obtained by the arms at each time would
be too hard due to their randomness.

10.2.1 The UCB1 algorithm

It makes sense for a learning algorithm to use the empirical average rewards
obtained for each arm so far.

Empirical average
t t
1 X X
r̂t,i , rk,i I {ak = i} , where Nt,i , I {ak = i}
Nt,i
k=1 k=1

and rk,i denotes the (random) reward the learner receives upon choosing
arm i at step k.

Simply always choosing the arm with best the empirical average reward so far
is not the best idea, because you might get stuck with a sub-optimal arm: If the
optimal arm underperforms at the beginning, so that its empirical average is far
below the true mean of a suboptimal arm, it will never be chosen again. A better
strategy is to choose arms optimistically. Intuitively, as long as an arm has a
significant chance of being the best, you play it every now and then. One simple
way to implement this is shown in the following UCB1 algorithm [Auer et al.,
2002a].

Algorithm 29 UCB1
Input A
Choose each arm once to obtain an initial estimate.
for t = 1, . . . do n q o
Choose arm at = arg maxi∈A r̂t−1,i + N2t−1,iln t
.
end for

p
Thus, the algorithm adds a bonus value of order O( ln t/Nt,i ) to the em-
pirical value of each arm thus forming an upper confidence bound . This upper upper confidence bound
confidence bound value is such that the true mean reward of each arm will lie
below it with high probability by the Hoeffding bound (4.5.5).

Theorem 10.2.1 (Auer et al. [2002a]). The expected regret of UCB1 after T
time steps is at most
X 8 ln T X
E LT (UCB1) ≤ +5 (r∗ − r(i)).
r∗− r(i) i
i:r(i)<r ∗

Proof. By Wald’s identity (5.2.13) the expected regret can be written as

T
X X
E LT = E (r∗ − rt ) = (r∗ − r(i)) E NT,i , (10.2.1)
t=1 i
220CHAPTER 10. DISTRIBUTION-FREE REINFORCEMENT LEARNING

so that we focus on bounding E Nt,i . Thus, let i be an arbitrary suboptimal

arm, forpwhich we shall consider when it will be chosen by the algorithm. Write
Bt,s = (2 ln t)/s for the bonus value at step t after s observations. Note that
for fixed values of t, s, si ∈ N under the assumption that Nt,i = si and (the
count of the optimal action) Nt,∗ = s, we have by the Hoeffding bound (4.5.5)
that

P(r̂t,i ≥ r(i) + Bt,si ) ≤ e−4 ln t = t−4 ,

P(r̂t,∗ ≤ r∗ − Bt,s ) ≤ e−4 ln t = t−4 .

Accordingly we may assume that (taking care of the contribution of the error
probabilities to E Nt,i below)

r̂t,i < r(i) + Bt,Nt,i , (10.2.2)

r∗ < r̂t,∗ + Bt,Nt,∗ . (10.2.3)

Now note that for s ≥ (8 ln T )/(r∗ − r(i))2 it holds that

2Bt,s ≤ (r∗ − r(i)), (10.2.4)

so that after arm i has been chosen (8 ln T )/(r∗ − r(i))2 times we get from
(10.2.2), (10.2.4), and (10.2.3) that

r̂t,i + Bt,Nt,i < r(i) + 2Bt,Nt,i ≤ r∗

< r̂t,∗ + Bt,Nt,∗ .

This shows that after (8 ln T )/(r∗ − r(i))2 samples from arm i, the algorithm
won’t choose it again. Taking into account the error probabilities for (10.2.2)
and (10.2.3), arm i may be played once at each step t whenever either equation
does not hold. Summing over all possible values for t, Nt,i and Nt,∗ this shows
that XX X
8 ln T
E Nt,i ≤ + 2τ −4 .
(r∗ − r(i))2
τ ≥1 s≤τ si ≤τ

Combining this with (10.2.1) and noting that the sum converges to a value < 4,
proves the regret bound.
The UCB1 algorithm is actually not the first algorithm employing optimism
in the face of uncertainty to deal with the exploration-exploitation dilemma,
nor the first that uses confidence intervals for that purpose. This idea goes back
to the seminal work of Lai and Robbins [1985] that used the same approach,
however in a more complicated form. In particular, the whole history is used
for computing the arm to choose. The derived bounds of Lai and Robbins[1985]
show that after T steps each suboptimal arm is played at most D1KL +o(1) log T
times in expectation, where DKL measures the distance between the reward dis-
tributions of the optimal and the suboptimal arm by the Kullback-Leibler di-
vergence, and o(1) → 0 as T → ∞. This bound was also shown to be asymptot-
ically optimal [Lai and Robbins, 1985]. A lower bound logarithmic in T for any
finite T that is close to matching the bound of Theorem 10.2.1 can be found in
[Mannor and Tsitsiklis, 2004]. Improvements that get closer to the lower bound
(and are still based on the UCB1 idea) can be found in [Auer and Ortner, 2010],
while the gap has been finally closed by Lattimore [2015].
10.2. FINITE STOCHASTIC BANDIT PROBLEMS 221

For so-called distribution-independent bounds that do not depend on prob-

lem parameters like the ‘gaps’ r∗ − r(i), see e.g. Audibert and Bubeck [2009]. In
general, √
these bounds cannot be logarithmic√in T anymore, as the gaps may be of
order 1/ T resulting in bounds that are O( KT ), just like in the nonstochastic
setting that we will take a look at next.

10.2.2 Non i.i.d. Rewards

The stochastic setting just considered is only one among several variants of the
multi-armed bandit problem. While it is impossible to cover them all, we give
a brief of the most common scenarios and refer to Bubeck and Cesa-Bianchi
[2012] for a more complete overview.
What is common to most variants of the classic stochastic setting is that the
assumption of receiving i.i.d. rewards when sampling a fixed arm is loosened.
The most extreme case is the so-called nonstochastic, sometimes also termed
adversarial bandit setting, where the reward sequence for each arm is assumed adversarial bandit
to be fixed in advance (and thus not random at all). In this case, the reward is
maximised when choosing in each time step the arm that maximises the reward
at this step. Obviously, since the reward sequences can be completely arbitrary,
no learner can stand a chance to perform well with respect to this optimal
policy. Thus, one confines oneself to consider
PTthe regret with respect to the best
fixed arm in hindsight, that is, arg maxi t=1 rt,i where rt,i is the reward of
arm i at step t. It is still not clear that this is not too much
√ to ask for, but it
turns out that one can achieve regret bounds of order O( KT ) in this setting.
Clearly, algorithms that choose arms deterministically can always be tricked
by an adversarial reward sequence. However, algorithms that at each time
step choose an arm from a suitable distribution over the arms (that is updated
according to the collected rewards), can be shown to give the mentioned optimal
regret bound. A prominent exponent of these algorithms is the Exp3 algorithm
of [Auer et al., 2002b], that uses an exponential weighting scheme.
In the contextual bandit setting the learner receives some additional side in- contextual bandit
formation called the context. The reward for choosing an arm is assumed to
depend on the context as well as on the chosen arm and can be either stochas-
tic or adversarial. The learner usually competes against the best policy that
maps contexts to arms. There is a notable amount of literature dealing with
various settings that are usually also interesting for applications like web ad-
vertisement where user data takes the role of provided side information. For
an overview see e.g. Chapter 4 of [Bubeck and Cesa-Bianchi, 2012] or Part V of
[Lattimore and Szepesvári, 2020].
In other settings the i.i.d. assumption about the rewards of a fixed arm is
replaced by more general assumptions, such as that underlying each arm there is
a Markov chain and rewards depend on the state of the Markov chain when sam-
pling the arm. This is called the restless bandits problem, that is already quite
close to the general reinforcement learning setting with an underlying Markov
decision process (see
√ Section 10.3.1 below). Regret bounds in this setting can
be shown to be Õ( T ) even if at each time step the learner can observe only
the state of the arm he chooses, see [Ortner et al., 2014].
222CHAPTER 10. DISTRIBUTION-FREE REINFORCEMENT LEARNING

10.3 Reinforcement learning in MDPs

Taking a step further from the bandit problems of the previous sections we
now want to consider a more general reinforcement learning setting where the
learner operates on an unknown underlying MDP. Note that the stochastic
bandit problem corresponds to a single state MDP.
Thus, consider an MDP µ with state space S, action space A, and let
r(s, a) ∈ [0, 1] and P (·|s, a) be the mean reward and the transition probability
distribution on S for each state s ∈ S and each action a ∈ A, respectively. For
the moment we assume that S and A are finite. As we have seen in Section 6.6
there are various optimality criteria for MDPs. In the spirit of the bandit prob-
lems considered so far we consider undiscounted rewards and examine the regret
after any T steps with respect to an optimal policy.
Since the optimal T -step policy in general will be non-stationary and dif-
ferent for different horizons T and different initial states, we will compare to a
gain optimal policy π ∗ as introduced in Definition 6.6.4. Further, we assume
that the MDP is communicating. That is, for any two states s, s′ there is a
policy πs,s′ that with positive probability reaches s′ when starting in s. This
assumption allows the learner to recover when making a mistake. Note that in
MDPs that are not communicating one wrong step may lead to a suboptimal
region of the state space that cannot be left anymore, which makes competing
to an optimal policy in a learning setting impossible. For communicating MDPs
we can define the diameter to be the maximal expected time it takes to connect
any two states.
Definition 10.3.1. Let T (π, s, s′ ) the expected number of steps it takes to
reach state s′ when starting in s and playing policy π. Then the diameter is
defined as
D , max min T (π, s, s′ ).
′
s,s π

Given that our rewards are assumed to be bounded in [0, 1], intuitively, when
we make one wrong step in some state s, in the long run we won’t lose more
than D. After all, in D steps we can go back to s and continue optimally.
Under the assumption that the MDP is communicating, the gain g ∗ can be
shown to be independent of the initial state, that is, g ∗ (s) = g ∗ for all states s.
Accordingly, we define the T -step regret of a learning algorithm as
T
X
LT , g ∗ − rt ,
t=1

where rt is the reward collected by the algorithm at step t. Note that in general
(and depending on the initial state) the value T g ∗ we compare to will differ
from the optimal T -step reward. However, this difference can be shown to be
upper bounded by the diameter and is therefore negligible when considering the
regret.

10.3.1 An upper-confidence bound algorithm

Now we aim at extending the idea underlying the UCB1 algorithm to the general
reinforcement learning setting. Again, we would like to have for each (station-
ary) policy π an upper bound on the gain that is reasonable to expect. Note that
10.3. REINFORCEMENT LEARNING IN MDPS 223

simply taking each policy to be the arm of a bandit problem does not work well.
First, to approach the true gain of a chosen policy, it will not be sufficient to
choose it just once. It would be necessary to follow each policy for a sufficiently
high number of consecutive steps. Without knowledge of some characteristics
of the underlying MDP like mixing times, it might be however difficult to de-
termine how long a policy shall be played. Further, due to the large number
of stationary policies, which is |A||S| , the regret bounds that would result from
such an approach would be exponential in the number of states.
Thus, we rather maintain confidence regions for the rewards and transition
probabilities of each state-action pair s, a. Then, at each step t, these confidence
regions implicitly also define a confidence region for the true underlying MDP
µ∗ , that is, a set Mt of plausible MDPs. For suitably chosen confidence intervals
for the rewards and transition probabilities one can obtain that

P(µ∗ ∈
/ Mt ) < δ. (10.3.1)

Given this confidence region Mt , one can define the optimistic value for any
policy π to be
π
g+ (Mt ) , max gµπ µ ∈ Mt . (10.3.2)
Note that similar to the bandit setting this estimate is optimistic for each policy,
π
as due to (10.3.1) it holds that g+ (Mt ) ≥ gµπ with high probability. Analogously
to UCB1 we would like to make an optimistic choice among the possible policies,
π
that is, we choose a policy π that maximises g+ (Mt ).
However, unlike in the bandit setting where we immediately receive a sample
from the reward of the chosen arm, in the MDP setting we only obtain informa-
tion about the reward in the current state. Thus, we should not play the chosen
optimistic policy just for one but a sufficiently large number of steps. An easy
way is to play policies in episodes of increasing length, such that sooner or later
each action is played for a sufficient number of steps in each state. Summarized,
we obtain (the outline of) an algorithm as shown below.

UCRL2 [Jaksch et al., 2010] outline

In episodes k = 1, 2, . . .

At the first step tk of episode k, update the confidence region Mtk .

Compute an optimistic policy π̃k ∈ arg maxπ g+
π
(Mtk ).

Execute π̃k , observe rewards and transitions until tk+1 .

Technical details for UCRL2

To make the algorithm complete, we have to fill in some technical details. In
the following, let S be the number of states and A the number of actions of the
underlying MDP µ. Further, the algorithm takes a confidence parameter δ > 0.

The confidence region Concerning the confidence regions, for the rewards
it is sufficient to use confidence intervals similar to those for UCB1. For the
transition probabilities we consider all those transition probability distributions
224CHAPTER 10. DISTRIBUTION-FREE REINFORCEMENT LEARNING

to be plausible whose k·k1 -norm is close to the empirical distribution P̂t (· | s, a).
That is, the confidence region Mt at step t used to compute the optimistic policy
can be defined as the set of MDPs with mean rewards r(s, a) and transition
probabilities P (· | s, a) such that
q
7 log(2SAt/δ)
r(s, a) − r̂(s, a) ≤ 2Nt (s,a) ,
q
14S log(2At/δ)
P (· | s, a) − P̂t (· | s, a) ≤ Nt (s,a) ,
1

where r̂(s, a) and P̂t (· | s, a) are the estimates for the rewards and the transition
probabilities, and Nt (s, a) denotes the number of samples of action a in state s
at time step t.
One can show via a bound due to Weissman et al. [2003] that given n samples
of the transition probability distribution P (· | s, a), one has
nε
P P (· | s, a) − P̂t (· | s, a) ≥ ε ≤ 2S exp − .
1 2
Using this together with standard Hoeffding bounds for the reward estimates,
it can be shown that the confidence region contains the true underlying MDP
with high probability.
Lemma 10.3.1.
δ
P(µ∗ ∈ Mt ) > 1 − .
15t6

Episode lengths Concerning the termination of episodes, as already men-

tioned, we would like to have episodes that are long enough so that we do not
suffer large regret when playing a suboptimal policy. Intuitively, it only pays off
to recompute the optimistic policy when the estimates or confidence intervals
have changed sufficiently. One option is e.g. to terminate an episode when the
confidence interval for one state-action pair has shrinked by some factor. Even
simpler, one can terminate an episode when a state-action pair has been sam-
pled often (compared to the samples one had before the episode has started),
e.g. when one has doubled the number of visits in some state-action pair. This
also allows to bound the total number of episodes up to step T .
Lemma 10.3.2. If an episode of UCRL2 is terminated when the number of
visits in some state-action pair has been doubled, the total number of episodes
8T
up to step T is upper bounded by SA log2 SA .
This episode termination criterion also allows to bound the sum over all
fractions of the form √vk (s,a) , where vk (s, a) is the number of times action a
Nk (s,a)
has been chosen in state s during episode k, while Nk (s, a) is the respective
count of visits before episode k. The evaluation of this sum will turn out to
be important in the regret analysis below to bound the sum over all confidence
intervals over the visited state-action pairs.
Lemma 10.3.3.
X X vk (s, a) √ √
p ≤ ( 2 + 1) SAT .
k s,a
Nk (s, a)
10.3. REINFORCEMENT LEARNING IN MDPS 225

Calculating the optimistic policy It is important to note that the compu-

tation of the optimistic policy can be performed efficiently by using a modifica-
π
tion of value iteration. Intuitively, for each policy π the optimistic value g+ (Mt )
maximises the gain over all possible values in the confidence intervals for the
rewards and the transition probabilities for π. This is an optimisation problem
over a compact space that can be easily solved. More precisely, in order to find
π
arg maxπ g+ (Mt ), for each considered policy one additionally has to determine
the precise values for rewards and transition probabilities within the confidence
region. This corresponds to finding the optimal policy in an MDP with compact
action space, which can be solved by an extension of value iteration that in each
iteration now not only maximises over the original action space but also within
π
the confidence region of the respective action. Noting that g+ (Mt ) is maximised
when the rewards are set to their upper confidence values, this results in the
following value iteration scheme:

1. Set the optimistic rewards r̃(s, a) to the upper confidence values for all
states s and all actions a.

2. Set u0 (s) := 0 for all s.

3. For i = 0, 1, 2, . . . set
( X )
ui+1 (s) := max r̃(s, a) + max P (s′ ) ui (s′ ) (10.3.3)
,
a P ∈P(s,a)
s′

where P(s, a) is the set of all plausible transition probabilities for choosing
action a in state s.

Similarly to the value iteration algorithm in Section 6.5.4, this scheme can be
shown to converge. More precisely one can show that maxs {ui+1 (s) − ui (s)} −
mins {ui+1 (s) − ui (s)} → 0 and also
π̃
ui+1 (s) → ui (s) + g+ for all s. (10.3.4)

After convergence the maximizing actions constitute the optimistic policy π̃, and
the maximizing transition probabilities are the respective optimistic transition
values P̃ .
One can also show that the so-called span maxs ui (s) − mins ui (s) of the
converged value vector ui is upper bounded by the diameter. This follows by
optimality of the vector ui . Intuitively, if the span would be larger than D one
could increase the collected reward in the lower value state s− by going (as fast
as possible) to the higher value state s+ . Note that this argument uses the fact
that the true MDP is plausible w.h.p., so that we may take the true transitions
to get from s− to s+ .

Lemma 10.3.4. Let ui (s) the converged value vector. Then

max ui (s) − min ui (s) ≤ D.

s s

Analysis of UCRL2
In this section we derive the following regret bound for UCRL2.
226CHAPTER 10. DISTRIBUTION-FREE REINFORCEMENT LEARNING

Theorem 10.3.1 (Jaksch et al. [2010] ). In an MDP with S states, A actions,

and diameter D with probability of at least 1 − δ the regret of UCRL2 after any
T steps is bounded by q
const · DS AT log Tδ .

Proof. The main idea of the proof is that by Lemma 10.3.1 we have that
π̃k
g̃k∗ , g+ (Mtk ) ≥ g ∗ ≥ g π̃k , (10.3.5)

so that the regret in each step is upper bounded by the width of the confidence
interval for g π̃k , that is, by g̃k∗ − g π̃k . In what follows we need to break down this
confidence interval to the confidence intervals we have for rewards and transition
probabilities.
In the following, we consider that the true MDP µ is always contained in
the confidence regions Mt considered by the algorithm. Using Lemma 10.3.1
it is not difficult to show that with probability at least 1 − 12Tδ5/4 the regret
√
accumulated due to µ ∈ / Mt at some step t is bounded by T .
Further, note that the random fluctuation of the rewards can be easily
bounded by Hoeffding’s inequality (4.5.5), that is, if st and at denote the state
and action at step t, we have
T
X X q
5 8T
rt ≥ r(st , at ) − 8T log δ
t=1 t

with probability at least 1 − 12Tδ5/4 .

Therefore, writing vk (s, a) for
P the number P
of times
P action a has been chosen
in state s in episode k we have t r(st , at ) = k s,a vk (s, a) r(s, a) so that by
(10.3.5) we can bound the regret by
T
X XX √ q
(g ∗ − rt ) ≤ vk (s, a) g̃k∗ − r(s, a) + T + 85 T log 8T
δ (10.3.6)
t=1 k s,a

with probability at least 1 − 12T2δ5/4 .

Thus, let us consider an arbitrary but fixed episode k, and consider the regret
X
vk (s, a) g̃k∗ − r(s, a)
s,a

the algorithm accumulates in this episode. Let conf rk (s, a) and conf pk (s, a) be
the width of the confidence intervals for rewards and transition probabilities in
episode k. First, we simply have
X X
vk (s, a) g̃k∗ − r(s, a) ≤ vk (s, a) g̃k∗ − r̃k (s, a)
s,a s,a
X
+ vk (s, a) r̃k (s, a) − r(s, a) , (10.3.7)
s,a

where the second term is bounded by

|r̃k (s, a) − r̂k (s, a)| + |r̂k (s, a) − r(s, a)| ≤ 2conf rk (s, a)
10.3. REINFORCEMENT LEARNING IN MDPS 227

w.h.p. by Lemma 10.3.1, so that

X X
vk (s, a) r̃k (s, a) − r(s, a) ≤ 2 vk (s, a) · conf rk (s, a). (10.3.8)
s,a s,a

For the first term in (10.3.7) we use that after convergence of the value
vector ui we have by (10.3.3) and (10.3.4)
X
g̃k∗ − r̃k (s, π̃k (s)) = P̃k (s′ |s, π̃k (s)) · ui (s′ ) − ui (s).
s′

Then noting that vk (s, a) = 0 for a 6= π̃k (s) and using vector/matrix notation
it follows that
X
vk (s, a) g̃k∗ − r̃k (s, π̃k (s))
s,a
X X
′ ′
= vk (s, a) P̃k (s |s, π̃k (s)) · ui (s ) − ui (s)
s,a s′

= vk P̃k − I u

= vk P̃k − Pk + Pk − I wk

= vk P̃k − Pk wk + vk Pk − I wk , (10.3.9)

where Pk is the true transition matrix (in µ) of the optimistic policy π̃k in
episode k, and wk is a renormalisation of the vector u (with entries ui (s))
where wk (s) := ui (s) − 21 (mins ui (s) + maxs ui (s)), so that kwk k∞ ≤ D 2 by
Lemma 10.3.4.
Since kP̃k − Pk k1 ≤ kP̃k − P̂k k1 + kP̂k − Pk k1 , the first term of (10.3.9) is
bounded as

vk P̃k − Pk wk ≤ vk P̃k − Pk 1 · wk ∞
X
≤ 2 vk (s, a) conf pk (s, a) D. (10.3.10)
s,a

The second term can be rewritten as martingale difference sequence

tk+1 −1
X
vk Pk − I wk = P (·|st , a)wk − wk (st )
t=tk
tk+1 −1
X
= P (·|st , a)wk − wk (st+1 ) + wk (stk+1 ) − wk (stk ),
t=tk

so that its sum over all episodes can be bounded by Azuma-Hoeffding inequality
(5.3.4) and Lemma 10.3.2, that is,
X q
vk Pk − I wk ≤ D 52 T log 8T δ
8T
+ DSA log2 SA (10.3.11)
k

δ
with probability at least 1 − 12T 5/4
.
228CHAPTER 10. DISTRIBUTION-FREE REINFORCEMENT LEARNING

Summing (10.3.8) and (10.3.10) over all episodes, by definition of the confi-
dence intervals and Lemma 10.3.3 we have
XX XX
vk (s, a) conf rk (s, a) + 2D vk (s, a) conf pk (s, a)
k s,a k s,a
p XX
≤ const · D S log(AT /δ) √vk (s,a)
Nk (s,a)
k s,a
p √
≤ const · D S log(AT /δ) SAT . (10.3.12)

Thus, combining (10.3.7)–(10.3.12) we obtain that

X p √
vk (s, a) g̃k∗ − r(s, a) ≤ const · D S log(AT /δ) SAT (10.3.13)
s,a

with probability at least 1 − 12Tδ5/4 .

Finally
pby (10.3.6) and√ (10.3.13) the regret of UCRL2 is upper
P bounded by
const · D S log(AT /δ) SAT with probability at least 1 − 3 T ≥2 12Tδ5/4 ≥
1 − δ.
The following is a corresponding lower bound on the regret that shows that
the upper bound of Theorem 10.3.1 is optimal in T and A.
Theorem 10.3.2. [Jaksch et al., 2010] For any algorithm and any natural
numbers T , S, A > 1, and D ≥ logA S there is an MDP with S states, A
actions, and diameter D, the expected regret after T steps is
√
Ω DSAT .

Similar to the distribution dependent regret bound of Theorem 10.2.1 for

UCB1, one can derive a logarithmic bound on the expected regret of UCRL2.
Theorem 10.3.3. [Jaksch et al., 2010] In an MDP with S states, A actions,
and diameter D the expected regret of UCRL2 is
2 2
D S A log(T )
O ,
∆

where ∆ , g ∗ − maxπ g π : g π < g ∗ is the gap between the optimal gain and
the second largest gain.

10.3.2 Bibliographical remarks

Similar to UCB1 that was based on the work of Lai and Robbins [1985], UCRL2
is not the first optimistic algorithm with theoretical guarantees. Thus, the
index policies of Burnetas and Katehakis [1997] and Tewari and Bartlett [2008]
choose actions optimistically by using confidence bounds for the estimates in
the current state. However, the logarithmic regret bounds are derived only for
ergodic MDPs in which each policy visits each state with probability 1.
Another important predecessor that is based on the principle of optimism
in the face of uncertaintly is R-Max [Brafman and Tennenholtz, 2003], that
10.3. REINFORCEMENT LEARNING IN MDPS 229

assumes in each not sufficiently visited state to receive the maximal possible
reward. UCRL2 offers a refinement of this idea to motivate exploration. Sam-
ple complexity bounds as derived for R-Max can also be obtained for UCRL2,
cf. [Jaksch et al., 2010].
The gap between the lower bound of Theorem 10.3.2 and the bound for
UCRL2 has not been closed so far. There have been various tries in that direc-
tion for different algorithms inspired by Thompson sampling [Agrawal and Jia,
2017] or UCB1 [Ortner, 2020]. However all of the claimed proofs seem to contain
some issues that remain unresolved up-to-date.
The situation is settled in the simpler episodic setting, where after any H
steps there
√ is a restart. Here there are matching upper and lower bounds of
order HSAT on the regret, see [Azar et al., 2017].
In the discounted setting, the MBIE algorithm of Strehl and Littman [2005,
2008] is a precursor of UCRL2 that is based on the same ideas. While there
are regret bounds available also for MBIE, these are not easily comparable to
Theorem 10.2.1, as the regret is measured along the trajectory of the algorithm,
while the regret considered for UCRL2 is with respect to the trajectory an opti-
mal policy would have taken. In general, regret in the discounted setting seems
to be a less satisfactory concept. However, sample complexity bounds in the dis-
counted setting for a UCRL2 variant have been given in [Lattimore and Hutter,
2014].
Last but not least, we would like to refer any reader interested in the material
of this chapter to the recent book of Lattimore and Szepesvári [2020] that deals
with the whole range of topics from simple bandits to reinforcement learning in
MDPs in much more detail.
230CHAPTER 10. DISTRIBUTION-FREE REINFORCEMENT LEARNING
Chapter 11

Conclusion

231
232 CHAPTER 11. CONCLUSION

This book touched upon the basic principles of decision making under uncer-
tainty in the context of reinforcement learning. While one of the main streams of
thought is Bayesian decision theory, we also discussed the basics of approximate
dynamic programming and stochastic approximation as applied to reinforcement
learning problems.
Consciously, however, we have avoided going into a number of topics related
to reinforcement learning and decision theory, some of which would need a book
of their own to be properly addressed. Even though it was fun writing the
book, we at some point had to decide to stop and consolidate the material we
had, sometimes culling partially developed material in favour of a more concise
volume.
Firstly, we haven’t explicitly considered many models that can be used for
representing transition distributions, value functions or policies, beyond the
simplest ones, as we felt that this would detract from the main body of the
text. Textbooks for the latest fashion are always going to be abundant, and we
hope that this book provides a sufficient basis to enable the use of any current
methods. There are also a large number of areas which have not been covered
at all. In particular, while we touched upon the setting of two-player games and
its connection to robust statistical decisions, we have not examined problems
which are also relevant to sequential decision making, such as Markov games
and Bayesian games. In relation to this, while early in the book we discuss
risk aversion and risk seeking, we have not discussed specific sequential decision
making algorithms for such problems. Furthermore, even though we discuss the
problem of preference elicitation, we do not discuss specific algorithms for it
or the related problem of inverse reinforcement learning. Another topic which
went unmentioned, but which may become more important in the future, is
hierarchical reinforcement learning as well as options, which allow constructing
long-term actions (such as “go to the supermarket”) from primitive actions
(such as “open the door”). Finally, even though we have mentioned the basic
framework of regret minimisation, we focused on the standard reinforcement
learning problem, and ignored adversarial settings and problems with varying
amounts of side information.
It is important to note that the book almost entirely elides social aspects of
decision making. In practice, any algorithm that is going to be used to make
autonomous decision is going to have a societal impact. In such cases, the
algorithm designer must guard against negative externalities, such as hurting
disadvantaged groups, violating privacy, or environmental damage. However, as
a lot of these issues are context dependent, we urge the reader to consult recent
work in economics, algorithmic fairness and differential privacy.
Appendix A

Symbols

233
234 APPENDIX A. SYMBOLS

, definition
∧ logical and
∨ logical or
⇒ implies
⇔ if and only if
∃ there exists
∀ for every
s.t. such that

Table A.1: Logic symbols

{xk } a set indexed by k

{x | xRy} the set of x satisfying relation xRy
N set of natural numbers
Z set of integers
R set of real numbers
Ω the universe set (or sample space)
∅ the empty set
∆n the n-dimensional simplex
∆(A) the collection of distributions over a set A
B (A) the Borel σ-algebra
Qn induced by a set A
An the
S∞ product set i=1 A
A∗ n=0 A n
the set of all sequences from set A
x∈A x belongs to A
A⊂B A is a (strict) subset of B
A⊆B A is a (non-strict) subset of B
B\A set difference
B△A symmetric set difference
A∁ set complement
A∪B set union
A∩B set intersection

Table A.2: List of set theory symbols

235

x⊤ the transpose of a vector x

|A| the determinant of a matrix
P A
kxkp The p-norm of a vector ( Ri |xi |p )1/p
kf kp The p-norm of a function ( |f (x)i|p dx)1/p
kAkp The operator norm of a matrix max {Ax | kxkp = 1}
∂f (x)/∂xi Partial derivative with respect to xi
∇f Gradient vector of partial derivatives with respect to vector x

Table A.3: Analysis and linear algebra symbols

Beta(α, β) Beta distribution with parameters (α, β). Geom(ω) Geometric distribution with parameter ω Wish(n −

Table A.4: Miscellaneous statistics symbols

236 APPENDIX A. SYMBOLS
Appendix B

Probability concepts

237
238 APPENDIX B. PROBABILITY CONCEPTS

This chapter is intended as a refresher of basic concepts in probability. This

includes the definition of probability functions, expectations and moments. Per-
haps unusually for an introductory text, we use the modern definition of prob-
ability as a measure, i.e. an additive function on sets.
Probability measures the likelihood of different events; where each event
corresponds to a set in some universe of set. For that reason, we first remind
the reader of elementary set theory and then proceed to describe how this relates
to events.

B.1 Fundamental definitions

We start with ground set Ω that contains all objects we want to talk about.
These objects are called the elements of Ω. Given a property Y of elements in
Ω, one can define the set of all objects that satisfy this property. That is,

A , {x | x have property Y } .

Example 48.
B(c, r) , {x ∈ Rn | kx − ck ≤ r}
describes the set of points enclosed in an n-dimensional sphere of radius r with center
c ∈ Rn .

We use the following notations and definitions for sets. If an element x

belongs to a set A, we write x ∈ A. Let the sample space Ω be a set such that
ω ∈ Ω always.We say that A is a subset of B or that B contains A, and write
A ⊂ B, iff, x ∈ B for any x ∈ A. Let B \ A , {x | x ∈ B and x ∈ / A} be the
set difference. Let A △ B , (B \ A) ∪ (A \ B) be the symmetric set difference.
The complement of any A ⊆ ΩSis A∁ , Ω \ A. The empty set is ∅ = Ω ∁ . The
n
union of n sets: AT 1 , . . . , An is i=1 Ai = A1 ∪ · · · ∪ An . The intersection of n
n
sets A1 , . . . , An is i=1 Ai = A1 ∩ · · · ∩ An . A and B are disjoint if A ∩ B = ∅.
The Cartesian product or product space is defined as

Ω1 × · · · × Ωn = {(s1 , . . . , sn ) | si ∈ Ωi , i = 1, . . . , n} (B.1.1)

the set of all ordered n-tuples (s1 , . . . , sn ).

B.1.1 Experiments and sample spaces

Conceptually, it might be easier to discuss concepts of probability if we think
about this in terms of an experiment performed by a statistician. For example,
such an experiment could be tossing a coin. The coin could come up heads, tails,
balance exactly on the edge, get lost under the furniture, or simple disintegrate
when it is tossed. The sample space of the experiment must contain all possible
outcomes.
However, it is the statistician which determines what this set is. For exam-
ple one statistician may only care whether the coin lands heads, or not (two
outcomes). Another may care about how many times it bounces on the ground.
Yet another may be interested in both the maximum height reached by the coin
and how it lands. Thus, the sample space represents different aspects of the
B.2. EVENTS, MEASURE AND PROBABILITY 239

experiment we are interested in. At the extreme, the sample space and corre-
sponding outcomes may completely describe everything there is to know about
the experiment.

Experiments
The set of possible experimental outcomes of an experiment is called the
sample space Ω.

Ω must contain all possible outcomes.

After the experiment is performed, exactly one outcome ω in Ω is
true.

Each statistician i may consider a different Ωi for the same experi-

ment.

The following example considers the case where three different statisticians
care about three different types of outcomes of an experiment where a drug is
given to a patient. The first is interested in whether the patient recovers, the
second in whether the drug has side-effects, while the third is interested in both.

Example 49. Experiment: give medication to a patient.

Ω1 = {Recovery within a day, No recovery after a day}.
Ω2 = {The medication has side-effects, No side-effect}.
Ω3 = all combinations of the above.
Clearly, the drug’s effects are much more complex than the above simplified view.
One could for example consider a very detailed patient state ω ∈ Ω (which would e.g.
describe every molecule in the patient’s body)

Product spaces and repeated experiments

Sometimes we perform repeated experiments. Each experiment could be defined
in a different outcome space, but many times we are specifically interested in
repeated identical experiments. This occurs for example in situations where we
give a treatment to patients suffering from a particular disease, and we measure
the same outcomes (recovery, side-effects) in each one of them.
More formally, the set-up is as follows: We perform Qn experiments. The i-th
n
experiment has sample space Ωi . The sample space i=1 Ωi := Ω1 × . . . Ωn
can be thought of as a sample space of a composite experiment in which all n
experiments are performed.

Identical experiment sample spaces In many cases, Ωi = Ω for all i, i.e.

the sample space is identicalQfor all individual experiments (e.g. n coin tosses).
n
In this case we write Ω n = i=1 Ω.

B.2 Events, measure and probability

Probability is a type of function that is called a measure. In that sense it is
similar to a function that weights, or measures things. Just like when weighing
240 APPENDIX B. PROBABILITY CONCEPTS

two apples and adding the total gives you the same answer as weighing both
apples together, so does the total probability of either of two mutually exclu-
sive events equals the sum of their individual probabilities. However, sets are
complex beasts and formally we wish to define exactly when we can measure
them.
Many times the natural outcome space Ω that we wish to consider is ex-
tremely complex, but we only care about whether a specific event occurs or not.
For example, when we toss a coin in the air, the natural outcome is the com-
plete trajectory that the coin follows and its final resting position. However, we
might only care about whether the coin lands heads or not. Then, the event of
the coin landing “heads” is defined as all the trajectories that the coin follows
which result in it landing heads. These trajectories form a subset A ⊂ Ω.
Probabilities will always be defined on subsets of the outcome space. These
subsets are termed events. The probability of events will simply be a function
on sets, and more specifically a measure. The following gives some intuition and
formal definitions about what this means.

B.2.1 Events and probability

Probability of a set
If A is a subset of Ω, the probability of A is a measure of the chances that
the outcome of the experiment will be an element of A.

Which sets?

Ideally, we would like to be able to assign a probability to every subset

of Ω. However, for technical reasons, this is not always possible.

Example 50. Let X be uniformly distributed on [0, 1]. By definition, this means that
the probability that X is in [0, p] is equal to p for all p ∈ [0, 1]. However, even for this
simple distribution, it might be difficult to define the probability of all events.
What is the probability that X will be in [0, 1/4)?
What is the probability that X will be in [1/4, 1]?
What is the probability that X will be a rational number?

B.2.2 Measure theory primer

Imagine that you have an apartment Ω composed of three rooms, A, B, C. There
are some coins on the floor and a 5-meter-long red carpet. We can measure
various things in this apartment.

Area

A: 4 × 5 = 20m2 .
B.2. EVENTS, MEASURE AND PROBABILITY 241

A r
B
r r
r
r
r
C r r
r r
r r

Figure B.1: A fashionable apartment

B: 6 × 4 = 24m2 .
C: 2 × 5 = 10m2 .

Coins on the floor

A: 3.
B: 4
C: 5.

Length of red carpet

A: 0m
B: 0.5m
C: 4.5m.

Measure the sets: F = {∅, A, B, C, A ∪ B, A ∪ C, B ∪ C, A ∪ B ∪ C}. It is

easy to see that the union of any sets in F is also in F. In other words, F is
closed under union. Furthermore, F contains the whole space Ω.
Note that all those measures have an additive property.

B.2.3 Measure and probability

As previously mentioned, the probability of A ⊆ Ω is a measure of the chances
that the outcome of the experiment will be an element of A. Here we give a
precise definition of what we mean by measure and probability.
242 APPENDIX B. PROBABILITY CONCEPTS

If we want to be able to perform probabilistic logic, we need to define some

appropriate algebraic construction that relates events to each other. In partic-
ular, if we have a family of events F, i.e. a collection of subsets of Ω, we want
this to be closed under union and complement.

Definition B.2.1 (A field on Ω). A family F of sets, such that for each A ∈ F,
one also has A ⊆ Ω, is called a field on Ω if and only if

1. Ω ∈ F

2. if A ∈ F, then A∁ ∈ F.
Sn
3. For any A1 , A2 , . . . , An such that Ai ∈ F, it holds that: i=1 Ai ∈ F.

From the above definition, it is easy to see that Ai ∩ Aj is also in the field.
Since many times our family may contain an infinite number of sets, we also
want to extend the above to countably infinite unions.

Definition B.2.2 (σ-field on Ω). A family F of sets, such that ∀A ∈ F, A ⊆ Ω,

is called a σ-field on Ω if and only if

1. Ω ∈ F

2. if A ∈ F, then A∁ ∈ F.
S∞
3. For any sequence A1 , A2 , . . . such that Ai ∈ F, it holds that: i=1 Ai ∈ F.

It is easy to verify that the F given in the apartment example satisfies these
properties. In general, for any finite Ω, it is easy to find a family F containing
all possible events in Ω. Things become trickier when Ω is infinite. Can we
define an algebra F that contains all events? In general no, but we can define
an algebra on the so-called Borel sets of Ω, defined in B.2.3.

Definition B.2.3 (Measure). A measure λ on (Ω, F) is a function λ : F → R+

such that

1. λ(∅) = 0.

2. λ(A) ≥ 0 for any A ∈ F.

3. For any collection of subsets A1 , . . . , An with Ai ∈ F and Ai ∩ Aj = ∅.

∞
! ∞
[ X
λ Ai = λ(Ai ) (B.2.1)
i=1 i=1

It is easy to verify that the floor area, the number of coins, and the length of
the red carpet are all measures. In fact, the area and length correspond to what
is called a Lebesgue measure 1 and the number of coins to a counting measure.

Definition B.2.4 (Probability measure). A probability measure P on (Ω, F)

is a function P : F → [0, 1] such that:

1. P (Ω) = 1
1 See Section B.2.3 for a precise definition.
B.3. CONDITIONING AND INDEPENDENCE 243

2. P (∅) = 0

3. P (A) ≥ 0 for any A ∈ F.

4. If A1 , A2 , . . . are (pairwise) disjoint then

∞
! ∞
[ X
P Ai = P (Ai ) (union)
i=1 i=1

(Ω, F, P ) is called a probability space.

So, probability is just a special type of measure.

The Lebesgue measure*

Definition B.2.5 (Outer measure). Let (Ω, F, λ) be a measure space. The
outer measure of a set A ⊆ Ω is:
X
λ∗ (A) , inf
S λ(Bk ). (B.2.2)
A⊆ k Bk
k

In other words, it is the measure λ-measure of the smallest cover {Bk } of A.

Definition B.2.6 (Inner measure). Let (Ω, F, λ) be a measure space. The

inner measure of a set A ⊆ Ω is:

λ∗ (A) , λ(Ω) − λ(Ω \ A). (B.2.3)

Definition B.2.7 (Lebesgue measurable sets). A set A is (Lebesgue) measur-

able if the outer and inner measures are equal.

λ∗ (A) = λ∗ (B). (B.2.4)

The common value of the inner and outer measure is called the Lebesgue mea-
sure2 λ̄(A) = λ∗ (A).

The Borel σ-algebra*

When Ω is a finite collection {ω1 , . . . , ωn }, there is a σ-algebra containing all
possible events in Ω, denoted 2Ω . This is called the powerset. However, in gen- powerset
eral this is not possible. For infinite sets equipped with a metric, we can instead
define the Borel σ-algebra B (Ω), which is the smallest σ-algebra containing all Borel σ-algebra
open sets of Ω.

B.3 Conditioning and independence

A probability measure can give us the probability of any set in the algebra.
Each one of these sets can be seen as an event. For example, the set of all states
where a patient has a fever constitutes the event that the patient has a fever.
Thus, generally we identify events with subsets of Ω.
2 It is easy to see that λ̄ is a measure.
244 APPENDIX B. PROBABILITY CONCEPTS

However, the basic probability on Ω does not tell us anything about what
the probability of some event A, given the fact that some event B has occurred.
Sometimes, these events are mutually exclusive, meaning that when B happens,
A cannot be true; other times B implies A, and sometimes they are independent.
To quantify exactly how knowledge of whether B has occurs can affect what we
know about A, we need the notion of conditional probability.

Side effects
Recovery

A1 A2 Patient state
ω

Everything (Ω)

Figure B.2: Events as sets. The patient state ω ∈ Ω after submitting to a

treatment may belong to either of the two possible sets A1 , A2 .

B.3.1 Mutually exclusive events

By events, we mean subsets of Ω. Thus, the probability of the event that a draw
from Ω is in A is equal to the probability measure of A, P (A). Some events are
mutually exclusive, meaning that they can never happen at the same time. This
is the same as saying that the corresponding sets have an empty intersection.

Side effects
Recovery

A1 A2

Everything (Ω)

Figure B.3: Mutually exclusive events A1 , A2 .

Definition B.3.1 (Mutually exclusive events). Two events A, B are mutually

exclusive if and only if A ∩ B = ∅.
B.3. CONDITIONING AND INDEPENDENCE 245

By definition of the measure, P (A ∪ B) = P (A) + P (B) for any mutually

exclusive events.

Lemma B.3.1 (Union bound). For any events A, B, it holds that

P (A ∪ B) ≤ P (A) + P (B). (B.3.1)

Proof. Let C = A ∩ B. Then

P (A) + P (B) = P (A \ C) + P (C) + P (B \ C) + P (C)

≥ P (A \ C) + P (C) + P (B \ C) = P (A ∪ B)

The union bound is extremely important, and one of the basic proof methods
in many applications of probability.
Finally, let us consider the general case of multiple disjoint events, shown
in Figure B.4. When B is decomposed in a set of disjoint events {Bi }, we can

B1 B2

B4 B3

Figure B.4: Disjoint events and marginalisation

write:
!
[ X
P (B) = P Bi = P (Bi ) (B.3.2)
i i
!
[ X
P (A ∩ B) = P (A ∩ Bi ) = P (A ∩ Bi ), (B.3.3)
i i

for any other set A. An interesting special case occurs when B = Ω, in which
case P (A) = P (A ∩ Ω), since A ⊂ Ω for any A in the algebra. This results in
the marginalisation or sum rule of probability. marginalisation
! sum rule
[ X [
P (A) = P (A ∩ Bi ) = P (A ∩ Bi ), Bi = Ω. (B.3.4)
i i i
246 APPENDIX B. PROBABILITY CONCEPTS

B.3.2 Independent events

Sometimes different events are independent, in the sense there is no interaction
between their probabilities. This can be formalised as follows.

Definition B.3.2 (Independent events). Two events A, B are independent if

P (A ∩ B) = P (A)P (B). The events in a family F of events are independent if
for any sequence A1 , A2 , . . . of events in F,
n
! n
\ Y
P Ai = P (Ai ) (independence)
i=1 i=1

As a simple example, consider Figure B.5, where the universe is a rectangle

of dimensions (1, 1), two events A1 , A2 are rectangles with A1 having dimensions
(1, h) and A2 having dimensions (w, h). Let’s take the probability distribution
P which assigns probability P (A) equal to the area of the set A. Then

P (A1 ) = 1 × h = h, P (A2 ) = w × 1 = w.

Similarly, the intersecting rectangle has dimensions (w, h). Consequently

P (A1 ∪ A2 ) = w × h = P (A1 ) × P (A2 )

and the two events are independent. Independent events are particularly im-

Side effects

A2
Recovery

Everything (Ω)

Figure B.5: Independent events A1 , A2 .

portant in repeated experiments, where the outcomes of one experiment are

independent of the outcome of another.

B.3.3 Conditional probability

Now that we have defined a distribution for all possible events, and we have
also defined basic relationships between events, we’d also like to have a way of
determining the probability of one event given that another has occurred. This
is given by the notion of conditional probability.
B.4. RANDOM VARIABLES 247

Definition B.3.3 (Conditional probability). The conditional probability of A

when B, s.t. P (B) > 0, is given is:
P (A ∩ B)
P (A | B) , . (B.3.5)
P (B)
Note that we can always write P (A ∩ B) = P(A | B) P(B) even if A, B are
not independent.
Finally, we say that two events A, B are conditionally independent given C conditionally independent
if
P (A ∩ B | C) = P (A | C)P (B | C). (B.3.6)
This is an important notion when dealing with probabilistic graphical models.

B.3.4 Bayes’ theorem

The following theorem trivially follows from the above discussion. However,
versions of it shall be used repeatedly throughout the book. For this reason we
present it here together with a detailed proof.
Theorem B.3.1 (Bayes’ theorem). S∞Let A1 , A2 , . . . be a (possibly infinite) se-
quence of disjoint events such that i=1 Ai = Ω and P (Ai ) > 0 for all i. Let B
be another event with P (B) > 0. Then
P (B | Ai )P (Ai )
P (Ai | B) = P∞ (B.3.7)
j=1 P (B | Aj )P (Aj )

Proof. From (B.3.5), P (Ai | B) = P (Ai ∩ B)/P (B) and also P (Ai ∩ B) = P (B |
Ai )P (Ai ). Thus
P (B | Ai )P (Ai )
P (Ai | B) = ,
P (B)
S∞
and we continueS∞analyzing the denominator P (B). First, due to i=1 Ai = Ω
we have B = j=1 (B ∩ Aj ). Since Ai are disjoint, so are B ∩ Ai . Then from
the union property of probability distributions we have
 
∞
[ X∞ X∞
P (B) = P  (B ∩ Aj ) = P (B ∩ Aj ) = P (B | Aj )P (Aj ),
j=1 j=1 j=1

which finishes the proof.

B.4 Random variables

A random variable X is a special kind of random quantity, defined as a function
of outcomes in Ω to some vector space. Unless otherwise stated, the mapping
is on the real numbers R. Thus, it also defines a mapping from a probability
measure P on (Ω, F) to a probability measure PX on (R, B(R)). More precisely,
we define the following.
Definition B.4.1 (Measurable function). Let F on Ω be a σ-field. A function
g : Ω → R is said to be measurable with respect to F, or F-measurable, if, for
any x ∈ R,
{s ∈ Ω | g(s) ≤ x} ∈ F.
248 APPENDIX B. PROBABILITY CONCEPTS

✭
r✭✭✭
❜

r
❜

Figure B.6: A distribution function F

Definition B.4.2 (Random variable). Let (Ω, F, P ) be a probability space. A

random variable X : Ω → R is a real-valued, F-measurable function.

The distribution of X

Every random variable X induces a probability measure PX on R. For

any B ⊆ R we define

PX (B) , P(X ∈ B) = P ({s | X(s) ∈ B}). (B.4.1)

Thus, the probability that X is in B is equal to the P -measure of the points

s ∈ Ω such that X(s) ∈ B and also equal to the PX -measure of B.
Here P is used as a short-hand notation.

Exercise 42. Ω is the set of 52 playing cards. X(s) is the value of each card (1, 10
for the ace and figures respectively). What is the probability of drawing a card s with
X(s) > 7?

B.4.1 (Cumulative) Distribution functions

Definition B.4.3 ((Cumulative) Distribution function). The distribution func-
tion of a random variable X is the function F : R → R:

F (t) = P(X ≤ t). (B.4.2)

Properties

If x ≤ y, then F (x) ≤ F (y).

F is right-continuous.
At the limit,
lim F (t) = 0, lim F (t) = 1.
t→−∞ t→∞
B.4. RANDOM VARIABLES 249

B.4.2 Discrete and continuous random variables

On the real line, there are two types of distributions for a random variable.
Here, once more, we employ the P notation as a shorthand for the probability
of general events involving random variables, so that we don’t have to deal with
the measure notation. The two following examples should give some intuition.

Discrete distributions
X : Ω → {x1 , . . . , xn } takes n discrete values (n can be infinite). The
probability function of X is

f (x) , P(X = x),

defined for x ∈ {x1 , . . . , xn }. For any B ⊆ R:

X
PX (B) = f (xi ).
xi ∈B

In addition, we write P(X ∈ B) to mean PX (B).

Continuous distributions
X has a continuous distribution if there exists a probability density function
f s.t. ∀B ⊆ R: Z
PX (B) = f (x) dx.
B

B.4.3 Random vectors

We can generalise the above to random vectors. These can be seen as vectors of
random variables. These are just random variables on some Cartesian product
space, i.e. X : Ω → V, with V = V1 × · · · × Vm . Once more, there are two special
cases of distributions for the random vector X = (X1 , . . . , Xm ). The first is a
vector of discrete random variables:

Discrete distributions

P(X1 = x1 , . . . , Xm = xm ) = f (x1 , . . . , xm ),
where f is joint probability function, with xi ∈ Vi .

The second is a vector of continuous random variables.

Continuous distributions
For B ⊆ Rm
Z
P {(X1 , . . . , Xm ) ∈ B} = f (x1 , . . . , xm ) dx1 · · · dxm
B
250 APPENDIX B. PROBABILITY CONCEPTS

In general, it is possible that X has neither a continuous, nor a discrete

distribution; for example if some Vi is discrete and some Vj are continuous. In
that case it is convenient to use measure-theoretic notation, explained in the
next section.

B.4.4 Measure-theoretic notation

The previously seen special cases of discrete and continuous variables can be
handled with a unified notation if we take advantage of the fact that probability
is only a particular type of measure. As a first step, we note that summation
can also be seen as integration with respect to the counting measure and that
Riemann integration is integration with respect to the Lebesgue measure.

Integral with respect to a measure µ

R
Introduce the common notation · · · dµ(x), where µ is a measure. Let
some real function g : Ω → R. Then for any subset B ⊆ Ω we can write

Discrete case: f is the probability function and we choose the counting

measure for µ, so:
X Z
g(x)f (x) = g(x)f (x) dµ(x)
x∈B B

Roughly speaking, the counting measure µ(Ω) is equal to the number

of elements in Ω.

Continuous case: f is the probability density function and we choose

the Lebesgue measure for µ, so:
Z Z
g(x)f (x) dx = g(x)f (x) dµ(x)
B B

Roughly speaking, the Lebesgue measure µ(S) is equal to the volume

of S.

In fact, since probability is a measure in itself, we do not need to complicate

things by using f and µ at the same time! This allows us to use the following
notation.

Lebesgue-Stiletjes notation
If P is a probability measure on (Ω, F) and B ⊆ Ω, and g is F-measurable,
we write the probability that g(x) takes the value B can be written equiv-
alently as:
Z Z
P(g ∈ B) = Pg (B) = g(x) dP (x) = g dP. (B.4.3)
B B
B.4. RANDOM VARIABLES 251

Intuitively, dP is related to densities in the following way. If P is a measure

on Ω and is absolutely continuous with respect to another measure µ, then
p , dPdµ is the R(Radon-Nikodyn) derivative of P with respect to µ. We write
the integral as gp dµ. If µ is the Lebesgue measure, then p coincides with the
probability density function.

B.4.5 Marginal distributions and independence

Although this is a straightforward outcome of the set-theoretic definition of
probability, we also define the marginal explicitly for random vectors.

Marginal distribution
The marginal distribution of X1 , . . . , Xk from a set of variables X1 , . . . , Xm ,
is
Z
P(X1 , . . . , Xk ) , P(X1 , . . . , Xk , Xk+1 = xk+1 , . . . , Xm = xm )

dµ(xk+1 , . . . , xm ). (B.4.4)

In the above, P(X1 , . . . Xk ) can be thought of as the probability measure

for any events related to the random vector (X1 , . . . , Xk ). Thus, it defines
a probability measure over Rk , B Rk . In fact, let Y = (X1 , . . . , Xk ) and
Z = (Xk+1 , . . . , Xm ) for simplicity. Then define Q(A) , P(Z ∈ A), with
A ⊆ Rm−k−1 . Then the above can be re-written as:
Z
P(Y ∈ B) = P(Y ∈ B | Z = z) dQ(z).
Rm−k−1

Similarly, P(Y | Z = z) can be thought of as a function mapping from values

of Z to probability measures. Let Pz (B) , P(Y ∈ B | Z = z) be this measure
corresponding to a particular value of z. Then we can write
Z Z
P(Y ∈ B) = dPz (y) dQ(z).
Rm−k−1 B

Independence
If Xi is independent of Xj for all i 6= j:
M
Y M
Y
P(X1 , . . . , Xm ) = P(Xi ), f (x1 , . . . , xm ) = gi (xi ) (B.4.5)
i=1 i=1

B.4.6 Moments
There are some simple properties of the random variable under consideration
which are frequently of interest in statistics. Two of those properties are expec-
tation and variance. expectation
252 APPENDIX B. PROBABILITY CONCEPTS

Expectation

Definition B.4.4. The expectation E(X) of any random variable X : Ω → R,

where R is a vector space, with distribution PX is defined by
Z
E(X) , t dPX (t), (B.4.6)
R

as long as the integral exists.

Furthermore,
Z
E[g(X)] = g(t) dPX (t),

for any function g.

variance Definition B.4.5. The variance V(X) of any random variable X : Ω → R

with distribution PX is defined by
Z ∞
2
V(X) , [t − E(X)] dPX (t)
−∞
n o
2 (B.4.7)
= E [X − E(X)]
= E(X 2 ) − E2 (X).

When X : Ω → R with R an arbitrary vector space, the above becomes the

covariance matrix covariance matrix :
Z ∞
⊤
V(X) , [t − E(X)] [t − E(X)] dPX (t)
−∞
n o
⊤ (B.4.8)
= E [X − E(X)] [X − E(X)]
= E(XX ⊤ ) − E(X) E(X)⊤ .

B.5 Divergences
Divergences are a natural way to measure how different two distributions are.

KL-Divergence Definition B.5.1. The KL-Divergence is a non-symmetric divergence.

Z
dP
D (P k Q) , dP. (B.5.1)
dQ

Another useful distance is the L1 distance

Definition B.5.2. The L1 -distance betwen two measures is defined as:

Z
dP dQ
kP − Qk1 = | − | dµ, (B.5.2)
dµ dµ

where µ is any measure dominating both P and Q.

B.6. EMPIRICAL DISTRIBUTIONS 253

B.6 Empirical distributions

When we have no model for a particular distribution, it is sometimes useful to
rical distribution construct the empirical distribution, which basically counts how many times we
observe different outcomes.
Definition B.6.1. Let xn = (x1 , . . . , xn ) drawn from a product measure xn ∼
P n on the measurable space (X n , Fn ). Let S be any σ-field on X . Then
empirical distribution of xn is defined as
n
1X
P̂n (B) , I {xt ∈ B} . (B.6.1)
n t=1

The problem with the empirical distribution is that does not capture the
uncertainty we have about what the real distribution is. For that reason, it
should be used with care, even though it does converge to the true distribution
in the limit. A clever way to construct a measure of uncertainty is to perform
sub-sampling, that is to create k random samples of size n′ < n from the original sub-sampling
sample. Each sample will correspond to a different random empirical distribu-
tion. Sub-sampling is performed without replacement (i.e. for each sample, each without replacement
observation xi is only used once). When sampling with replacement and n′ = n,
the method is called bootstrapping. bootstrapping

B.7 Further reading

Much of this material is based on DeGroot [1970]. See Kolmogorov and Fomin
[1999] for a really clear exposition of measure, starting from rectangle areas (de-
veloped from course notes in 1957). Also see Savage [1972] for a verbose, but in-
teresting and rigorous introduction to subjective probability. A good recent text
on elementary probability and statistical inference is Bertsekas and Tsitsiklis
[2008].
254 APPENDIX B. PROBABILITY CONCEPTS

B.8 Exercises
B.8. EXERCISES 255

Exercise 43 (5). Show that for any sets A, B, D:

A ∩ (B ∪ D) = (A ∩ B) ∪ (A ∩ D).

Show that
(A ∪ B)∁ = A∁ ∩ B ∁ , and (A ∩ B)∁ = A∁ ∪ B ∁

Exercise 44 (10). Prove that any probability measure P has the following properties:
1. P (A∁ ) = 1 − P (A).
2. If A ⊂ B then P (A) ≤ P (B).
3. For any sequence of events A1 , . . . , An
∞
! ∞
[ X
P Ai ≤ P (Ai ) (union bound)
i=1 i=1

Sn Pn
Hint: Recall that If A1 , . . . , An are disjoint then P ( i=1 Ai ) = i=1 P (Ai ) and that
P (∅) = 0

Definition B.8.1. A random variable X ∈ {0, 1} has Bernoulli distribution

with parameter p > [0, 1], written X ∼ Bern(p), if

p = P(X = 1) = 1 − P(X = 0).

The probability function of X can be written as

(
px (1 − p)1−x , x ∈ {0, 1}
f (x | p) =
0, otherwise.

Definition B.8.2. A random variable X ∈ {0, 1} has a binomial distribution

with parameters p > [0, 1], n ∈ N written X ∼ Binom(p, n), if the probability
function of X is
(
n x n−x
f (x | n, p) = x p (1 − p) , x ∈ {0, 1, . . . , n}
0, otherwise.

If X1P
, . . . , Xn is a sequence of Bernoulli random variables with parameter p,
n
then i=1 Xi has a binomial distribution with parameters n, p.

Exercise 45 (10). Let X ∼ Bern(p)

1. Show that E X = p
2. Show that V X = p(1 − p)
3. Find the value of p for which X has the greatest variance.

Exercise 46 (10). In a few sentences, describe your views on the usefuleness of prob-
ability.
Is it the only formalism that can describe both random events and uncertainty?
Would it be useful to separate randomness from uncertainty?
What would be desirable properties of an alternative concept?
256 APPENDIX B. PROBABILITY CONCEPTS
Appendix C

Useful results

257
258 APPENDIX C. USEFUL RESULTS

C.1 Functional Analysis

Definition C.1.1 (supremum). When we say that

M = sup f (x),
x∈A

then: (i) M ≥ f (x) for any x ∈ A. In other words, M is an upper bound on

f (x). (ii) for any M ′ < M , there exists some x′ ∈ A s.t. M ′ < f (x′ ).

In other words, there exists no smaller upper bound than M . When the
function f has a maximum, then the supremum is identical to the maximum.

Definition C.1.2 (infimum). When we say that

M = sup f (x),
x∈A

then: (i) M ≥ f (x) for any x ∈ A. In other words, M is an upper bound on

f (x). (ii) for any M ′ < M , there exists some x′ ∈ A s.t. M ′ < f (x′ ).

Norms Let (S, Σ, µ) be a measure space. The Lp norm of a µ-measurable

function f is defined as
Z 1/p
kf kp = |f (x)|p dµ(x) . (C.1.1)
S

Hölder inequality. Let (S, Σ, µ) be a measure space and let 1 ≤ p, q ≤ ∞

with 1/p + 1/q = 1 then for all µ-measurable f, g:

kf gk1 ≤ kf kp kgkq. (C.1.2)

The special case p = q = 2 results in the Cauchy-Schwarz inequality.

Lipschitz continuity We say that a function f : X → Y is Lipschitz, with

respect to metrics d, ρ on X, Y respectively when

ρ(f (a) − f (b)) ≤ d(a, b) ∀a, b ∈ X. (C.1.3)

Special spaces. The n-dimensional Euclidean space is denoted by Rn .

The n-dimensional simplex is denoted by ∆n and it holds that for any x ∈ ∆n ,
kxk1 = 1 and xk ≥ 0.

C.1.1 Series
Pn
Definition C.1.3 (The geometric series). The sum k=0 xk is called the geo-
metric series and has the property
n
X xn+1 − 1
xk = . (C.1.4)
x−1
k=0

Taking derivatives with respect to x can result in other useful formulae.

C.1. FUNCTIONAL ANALYSIS 259

C.1.2 Special functions

Definition C.1.4 (Gamma function). For a positive integer n,

Γ (n) = (n − 1)! (C.1.5)

For a positive real numbers (or complex numbers with a postive real part), the
gamma function is defined as
Z ∞
Γ (t) = x−1 e−x dx. (C.1.6)
0
260 APPENDIX C. USEFUL RESULTS
Appendix D

Index

261
Index

., 106 Beta, 71
binomial, 70
Adaptive hypothesis testing, 110 exponential, 76
Adaptive treatment allocation, 110 Gamma, 75
adversarial bandit, 221 marginal, 94
approximate normal, 73
policy iteration, 177 divergences, 252
backwards induction, 114 empirical distribution, 253
bandit every-visit Monte-Carlo, 156
adversarial, 221 expectation, 251
contextual, 221 experimental design, 110
nonstochastic, 221 exploration vs exploitation, 11
bandit problems, 111
stochastic, 111 fairness, 55
Bayes rule, 54 first visit
Bayes’ theorem, 21 Monte-Carlo update, 156
belief state, 113
Beta distribution, 71 Gamma function, 71
binomial coefficient, 70 Gaussian processes, 208
bootstrapping, 253 gradient descent, 175
Borel σ-algebra, 243 stochastic, 149
branch and bound, 206
Hoeffding inequality, 155
classification, 53 horizon, 118
clinical trial, 110
concave function, 27 inequality
conditional probability, 246 Chebyshev, 85
conditionally independent, 247 Hoeffding, 86
contextual bandit, 221 Markov, 85
covariance, 208 inf, see infimum
covariance matrix, 252 infimum, 258
decision boundary, 54 Jensen’s inequality, 27
decision procedure
sequential, 92 KL-Divergence, 252
design matrices, 208 KL-divergence, 89
difference operator, 134
discount factor, 111, 118 likelihood
distribution conditional, 19
χ2 , 74 relative, 16
Bernoulli, 70 linear programming, 136

262
INDEX 263

marginalisation, 245 softmax, 176

Markov decision process, 110, 114, 116, spectral radius, 127
141, 154 standard normal, 74
Markov process, 106 statistic, 66
martingale, 105 sufficient, 67
matrix determinant, 80 stopping function, 92
maximin, 44 stopping set, 93
minimax, 44 student t-distribution, 77
mixture of distributions, 40 sub-sampling, 253
Monte Carlo sum rule, 245
Policy evaluation, 154 sup, see supremum
multinomial, 78 supremum, 258
multivariate-normal, 80
temporal difference, 135
observation distribution, 210 temporal difference error, 135, 157
temporal differences, 134
policy, 111, 117 termination condition, 132
ǫ-greedy, 149 trace, 80
k-order Markov, 198 transition distribution, 116, 210
blind, 198
history-dependent, 117 unbounded procedures, 96
Markov, 117 union bound, 245
memoryless, 198 upper confidence bound, 219
optimal, 119 utility, 24, 118
stochastic, 168 Utility theory, 22
policy evaluation, 119
value, 94
backwards induction, 120
value function
Monte Carlo, 154
optimal, 119
policy iteration, 132
state, 118
modified, 133
state-action, 118
temporal-difference, 135
value iteration, 130
policy optimisation
variable order Markov decision process,
backwards induction, 121
212
powerset, 243
variance, 252
preference, 22
probability Wald’s theorem, 104
subjective, 16 wishart, 80
pseudo-inverse, 181 without replacement, 253

random vector, 249

regret
total, 218
reset action, 154
reward, 22
reward distribution, 116, 210

sample mean, 66
series
geometric, 96, 258
simulation, 154
264 INDEX
Bibliography

Shipra Agrawal and Randy Jia. Optimistic posterior sampling for

reinforcement learning: worst-case regret bounds. In Advances
in Neural Information Processing Systems 30: Annual Confer-
ence on Neural Information Processing Systems 2017, 4-9 Decem-
ber 2017, Long Beach, CA, USA, pages 1184–1194, 2017. URL
http://papers.nips.cc/paper/6718-optimistic-posterior-sampling-for-reinforcement-learning-wor

Mauricio Álvarez, David Luengo, Michalis Titsias, and Neil Lawrence. Efficient
multioutput gaussian processes through variational inducing kernels. In Pro-
ceedings of the Thirteenth International Conference on Artificial Intelligence
and Statistics (AISTATS 2010), pages 25–32, 2010.

A. Antos, C. Szepesvári, and R. Munos. Learning near-optimal policies with

bellman-residual minimization based fitted policy iteration and a single sam-
ple path. Machine Learning, 71(1):89–129, 2008a.
Andrè Antos, Rémi Munos, and Csaba Szepesvari. Fitted Q-iteration in contin-
uous action-space MDPs. In J. C. Platt, D. Koller, Y. Singer, and S. Roweis,
editors, Advances in Neural Information Processing Systems 20. MIT Press,
Cambridge, MA, 2008b.
Robert B. Ash and Catherine A. Doleéans-Dade. Probability & Measure Theory.
Academic Press, 2000.

Jean-Yves Audibert and Sébastien Bubeck. Minimax policies for adversarial and
stochastic bandits. In colt2009. Proceedings of the 22nd Annual Conference
on Learning Theory, pages 217–226, 2009.

Peter Auer and Ronald Ortner. UCB revisited: improved regret bounds for
the stochastic multi-armed bandit problem. Period. Math. Hungar., 61(1-2):
55–65, 2010.

Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite time analysis of the
multiarmed bandit problem. Machine Learning, 47(2/3):235–256, 2002a.

Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire.

The nonstochastic multiarmed bandit problem. SIAM J. Com-
put., 32(1):48–77, 2002b. doi: 10.1137/S0097539701398375. URL
http://dx.doi.org/10.1137/S0097539701398375.

Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Mini-

max regret bounds for reinforcement learning. In Proceedings of the

265
266 BIBLIOGRAPHY

34th International Conference on Machine Learning, ICML 2017, Syd-

ney, NSW, Australia, 6-11 August 2017, pages 263–272, 2017. URL
http://proceedings.mlr.press/v70/azar17a.html.

Andrew G Barto. Adaptive critics and the basal ganglia. Models of information
processing in the basal ganglia, page 215, 1995.

Jonathan Baxter and Peter L. Bartlett. Reinforcement learning in POMDP’s

via direct gradient ascent. In Proc. 17th International Conf. on Machine
Learning, pages 41–48. Morgan Kaufmann, San Francisco, CA, 2000. URL
citeseer.nj.nec.com/baxter00reinforcement.html.

Richard Ernest Bellman. A problem in the sequential design of experiments.

Sankhya, 16:221–229, 1957.
A. Bernstein. Adaptive state aggregation for reinforcement learning. Master’s
thesis, Technion – Israel Institute of Technolog01y, 2007.

Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-Dynamic Programming.

Athena Scientific, 1996.

Dimitri P Bertsekas and John N Tsitsiklis. Introduction to Probability: Dimitri

P. Bertsekas and John N. Tsitsiklis. Athena Scientific, 2008.

J. A. Boyan. Technical update: Least-squares temporal difference learning.

Machine Learning, 49(2):233–246, 2002.

S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal

difference learning. Machine Learning, 22(1):33–57, 1996.

R. I. Brafman and M. Tennenholtz. R-max-a general polynomial time algorithm

for near-optimal reinforcement learning. The Journal of Machine Learning
Research, 3:213–231, 2003. ISSN 1532-4435.
Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret analysis of stochastic
and nonstochastic multi-armed bandit problems. Foundations and Trends
in Machine Learning, 5(1):1–122, 2012. doi: 10.1561/2200000024. URL
http://dx.doi.org/10.1561/2200000024.
Apostolos N. Burnetas and Michael N. Katehakis. Optimal adaptive policies for
Markov decision processes. Math. Oper. Res., 22(1):222–255, 1997.

George Casella, Stephen Fienberg, and Ingram Olkin, editors. Monte Carlo
Statistical Methods. Springer Texts in Statistics. Springer, 1999.

Herman Chernoff. Sequential design of experiments. Annals of Mathematical

Statistics, 30(3):755–770, 1959.

Herman Chernoff. Sequential models for clinical trials. In Proceedings of the

Fifth Berkeley Symposium on Mathematical Statistics and Probability, Vol.4,
pages 805–812. Univ. of Calif Press, 1966.

Alexandra Chouldechova. Fair prediction with disparate impact: A study of bias

in recidivism prediction instruments. Technical Report 1610.07524, arXiv,
2016.
BIBLIOGRAPHY 267

Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq.
Algorithmic decision making and the cost of fairness. Technical Report
1701.08230, arXiv, 2017.

K. Csilléry, M. G. B. Blum, O. E. Gaggiotti, O. François, et al. Approximate

Bayesian computation (ABC) in practice. Trends in ecology & evolution, 25
(7):410–418, 2010.

Richard Dearden, Nir Friedman, and Stuart J. Russell. Bayesian

Q-learning. In AAAI/IAAI, pages 761–768, 1998. URL
citeseer.ist.psu.edu/dearden98bayesian.html.

J. J. Deely and D. V. Lindley. Bayes empirical Bayes. Journal of the Amer-

ican Statistical Association, 76(376):833–841, 1981. ISSN 01621459. URL
http://www.jstor.org/stable/2287578.

Morris H. DeGroot. Optimal Statistical Decisions. John Wiley & Sons, 1970.

M. P. Deisenroth, C. E. Rasmussen, and J. Peters. Gaussian process dynamic

programming. Neurocomputing, 72(7-9):1508–1524, 2009.

Christos Dimitrakakis. Ensembles for Sequence Learning. PhD thesis, École

Polytechnique Fédérale de Lausanne, 2006a.

Christos Dimitrakakis. Nearly optimal exploration-exploitation decision thresh-

olds. In Int. Conf. on Artificial Neural Networks (ICANN), 2006b.

Christos Dimitrakakis. Tree exploration for Bayesian RL exploration. In Com-

putational Intelligence for Modelling, Control and Automation, International
Conference on, pages 1029–1034, Wien, Austria, 2008. IEEE Computer So-
ciety. ISBN 978-0-7695-3514-2. doi: http://doi.ieeecomputersociety.org/10.
1109/CIMCA.2008.32.

Christos Dimitrakakis. Bayesian variable order Markov models. In Yee Whye

Teh and Mike Titterington, editors, Proceedings of the 13th International
Conference on Artificial Intelligence and Statistics (AISTATS), volume 9 of
JMLR : W&CP, pages 161–168, Chia Laguna Resort, Sardinia, Italy, 2010a.

Christos Dimitrakakis. Complexity of stochastic branch and bound methods

for belief tree search in Bayesian reinforcement learning. In 2nd international
conference on agents and artificial intelligence (ICAART 2010), pages 259–
264, Valencia, Spain, 2010b. ISNTICC, Springer.

Christos Dimitrakakis. Robust bayesian reinforcement learning through tight

lower bounds. In European Workshop on Reinforcement Learning (EWRL
2011), number 7188 in LNCS, pages 177–188, 2011.

Christos Dimitrakakis. Monte-carlo utility estimates for bayesian reinforcement

learning. In IEEE 52nd Annual Conference on Decision and Control (CDC
2013), 2013. arXiv:1303.2506.

Christos Dimitrakakis and Michail G. Lagoudakis. Algorithms and bounds for

rollout sampling approximate policy iteration. In EWRL, pages 27–40, 2008a.
268 BIBLIOGRAPHY

Christos Dimitrakakis and Michail G. Lagoudakis. Rollout sampling approxi-

mate policy iteration. Machine Learning, 72(3):157–171, September 2008b.
doi: 10.1007/s10994-008-5069-3. Presented at ECML’08.

Christos Dimitrakakis and Nikolaos Tziortziotis. ABC reinforcement learning.

In ICML 2013, volume 28(3) of JMLR W & CP, pages 684–692, 2013. See
also arXiv:1303.6977.

Christos Dimitrakakis and Nikolaos Tziortziotis. Usable ABC reinforcement

learning. In NIPS 2014 Workshop: ABC in Montreal, 2014.

Christos Dimitrakakis, Yang Liu, David Parkes, and Goran Radanovic. Sub-
jective fairness: Fairness is in the eye of the beholder. Technical Report
1706.00119, arXiv, 2017.

Michael O’Gordon Duff. Optimal Learning Computational Procedures for Bayes-

adaptive Markov Decision Processes. PhD thesis, University of Massachusetts
at Amherst, 2002.

Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard
Zemel. Fairness through awareness. In Proceedings of the 3rd Innovations in
Theoretical Computer Science Conference, pages 214–226. ACM, 2012.

Yaakov Engel, Shie Mannor, and Ron Meir. Bayes meets bellman: The gaussian
process approach to temporal difference learning. In ICML 2003, 2003.

Yaakov Engel, Shie Mannor, and Ron Meir. Reinforcement learning with gaus-
sian processes. In Proceedings of the 22nd international conference on Ma-
chine learning, pages 201–208. ACM, 2005.

Eyal Even-Dar and Yishai Mansour. Approximate equivalence of markov de-

cision processes. In Learning Theory and Kernel Machines. COLT/Kernel
2003, Lecture notes in Computer science, pages 581–594, Washington, DC,
USA, 2003. Springer.

Milton Friedman and Leonard J. Savage. The expected-utility hypothesis and

the measurability of utility. The Journal of Political Economy, 60(6):463,
1952.

Thomas Furmston and David Barber. Variational methods for reinforcement

learning. In Yee Whye Teh and Mike Titterington, editors, Proceedings of the
13th International Conference on Artificial Intelligence and Statistics (AIS-
TATS), volume 9 of JMLR : W&CP, pages 241–248, Chia Laguna Resort,
Sardinia, Italy, 2010.

J. Geweke. Using simulation methods for Bayesian econometric models: infer-

ence, development, and communication. Econometric Reviews, 18(1):1–73,
1999.

Mohammad Ghavamzadeh and Yaakov Engel. Bayesian policy gradient algo-

rithms. In NIPS 2006, 2006.

C. J. Gittins. Multi-armed Bandit Allocation Indices. John Wiley & Sons, New
Jersey, US, 1989.
BIBLIOGRAPHY 269

Robert Grande, Thomas Walsh, and Jonathan How. Sample efficient reinforce-
ment learning with gaussian processes. In International Conference on Ma-
chine Learning, pages 1332–1340, 2014.
Peter E Hart, Nils J Nilsson, and Bertram Raphael. A formal basis for the
heuristic determination of minimum cost paths. IEEE transactions on Sys-
tems Science and Cybernetics, 4(2):100–107, 1968.
Wassily Hoeffding. Probability inequalities for sums of bounded random vari-
ables. Journal of the American Statistical Association, 58(301):13–30, March
1963.
M. Hutter. Feature reinforcement learning: Part I: Unstructured MDPs. Journal
of Artificial General Intelligence, 1:3–24, 2009.
Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds
for reinforcement learning. Journal of Machine Learning Research, 11:1563–
1600, 2010.
Tobias Jung and Peter Stone. Gaussian processes for sample-efficient reinforce-
ment learning with RMAX-like exploration. In ECML/PKDD 2010, pages
601–616, 2010.
Sham Kakade. A natural policy gradient. Advances in neural information
processing systems, 2:1531–1538, 2002.
Emilie Kaufmanna, Nathaniel Korda, and Rémi Munos. Thompson sampling:
An optimal finite time analysis. In ALT-2012, 2012.
Michael Kearns and Satinder Singh. Finite sample convergence rates for Q-
learning and indirect algorithms. In Advances in Neural Information Process-
ing Systems, volume 11, pages 996–1002. The MIT Press, 1999.
Niki Kilbertus, Mateo Rojas-Carulla, Giambattista Parascandolo, Moritz Hardt,
Dominik Janzing, and Bernhard Schölkopf. Avoiding discrimination through
causal reasoning. Technical Report 1706.02744, arXiv, 2017.
Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. Inherent trade-
offs in the fair determination of risk scores. Technical Report 1609.05807,
arXiv, 2016.
AN Kolmogorov and SV Fomin. Elements of the theory of functions and func-
tional analysis. Dover Publications, 1999.
M. Lagoudakis and R. Parr. Reinforcement learning as classification: Leveraging
modern classifiers. In ICML, page 424, 2003a.
M. G. Lagoudakis and R. Parr. Least-squares policy iteration. The Journal of
Machine Learning Research, 4:1107–1149, 2003b.
Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive alloca-
tion rules. Adv. in Appl. Math., 6:4–22, 1985.
Nam M. Laird and Thomas A. Louis. Empirical Bayes confidence intervals
based on bootstrap samples. Journal of the American Statistical Association,
82(399):739–750, 1987.
270 BIBLIOGRAPHY

Tor Lattimore. Optimally confident ucb: Improved regret for finite-armed ban-
dits, 2015. arXiv preprint arXiv:1507.07880.

Tor Lattimore and Marcus Hutter. Near-optimal PAC bounds for discounted
MDPs. Theor. Comput. Sci., 558:125–143, 2014.

Tor Lattimore and Csaba Szepesvári. Bandit Algorithms. Cambridge University

Press, 2020.

T. Lwin and J. S. Maritz. Empirical Bayes approach to multiparameter es-

timation: with special reference to multinomial distribution. Annals of the
Institute of Statistical Mathematics, 41(1):81–99, 1989.

Shie Mannor and John N. Tsitsiklis. The sample complexity of exploration in

the multi-armed bandit problem. Journal of Machine Learning Research, 5:
623–648, 2004.

J. M. Marin, P. Pudlo, C. P. Robert, and R. J. Ryder. Approximate Bayesian

computational methods. Statistics and Computing, pages 1–14, 2011.

Thomas P Minka. Expectation propagation for approximate bayesian inference.

In Proceedings of the Seventeenth conference on Uncertainty in artificial in-
telligence, pages 362–369. Morgan Kaufmann Publishers Inc., 2001a.

Thomas P. Minka. Bayesian linear regression. Technical report, Microsoft re-

search, 2001b.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Ve-
ness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidje-
land, Georg Ostrovski, et al. Human-level control through deep reinforcement
learning. Nature, 518(7540):529–533, 2015.

R. Munos and C. Szepesvári. Finite-time bounds for fitted value iteration. The
Journal of Machine Learning Research, 9:815–857, 2008.

Ronald Ortner. Regret bounds for reinforcement learning via markov chain
concentration. J. Artif. Intell. Res., 67:115–128, 2020. doi: 10.1613/jair.1.
11316. URL https://doi.org/10.1613/jair.1.11316.

Ronald Ortner, Daniil Ryabko, Peter Auer, and Rémi Munos. Regret bounds for
restless Markov bandits. Theor. Comput. Sci., 558:62–76, 2014. doi: 10.1016/
j.tcs.2014.09.026. URL http://dx.doi.org/10.1016/j.tcs.2014.09.026.

Ian Osband, Daniel Russo, and Benjamin Van Roy. (more) efficient reinforce-
ment learning via posterior sampling. In NIPS, 2013.

Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep
exploration via bootstrapped dqn. In Advances in Neural Information Pro-
cessing Systems, pages 4026–4034, 2016.

Jan Peters and Stefan Schaal. Policy gradient methods for robotics. In In-
telligent Robots and Systems, 2006 IEEE/RSJ International Conference on,
pages 2219–2225. IEEE, 2006.
BIBLIOGRAPHY 271

P. Poupart, N. Vlassis, J. Hoey, and K. Regan. An analytic solution to discrete

Bayesian reinforcement learning. In ICML 2006, pages 697–704. ACM Press
New York, NY, USA, 2006.

Pascal Poupart and Nikos Vlassis. Model-based Bayesian reinforcement learning

in partially observable domains. In International Symposium on Artificial
Intelligence and Mathematics (ISAIM), 2008.

Marting L. Puterman. Markov Decision Processes : Discrete Stochastic Dy-

namic Programming. John Wiley & Sons, New Jersey, US, 1994.

Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes

for Machine Learning. MIT Press, 2006. ISBN 13 978-0-262-18253-9.

H. Robbins and S. Monro. A stochastic approximation method. The Annals of

Mathematical Statistics, pages 400–407, 1951.

Herbert Robbins. An empirical Bayes approach to statistics. In Jerzy Neyman,

editor, Proceedings of the Third Berkeley Symposium on Mathematical Statis-
tics and Probability, Volume 1: Contributions to the Theory of Statistics.
University of California Press, Berkeley, CA, 1955.

Herbert Robbins. The empirical Bayes approach to statistical decision problems.

The Annals of Mathematical Statistics, 35(1):1–20, 1964.

Stephane Ross, Brahim Chaib-draa, and Joelle Pineau. Bayes-adaptive

POMDPs. In J. C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Ad-
vances in Neural Information Processing Systems 20, Cambridge, MA, 2008.
MIT Press.

D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal rep-

resentations by error propagation. In D. E. Rumelhart, J. L. McClelland,
et al., editors, Parallel Distributed Processing: Volume 1: Foundations, pages
318–362. MIT Press, Cambridge, 1987.

Leonard J. Savage. The Foundations of Statistics. Dover Publications, 1972.

Wolfram Schultz, Peter Dayan, and P. Read Montague. A neu-

ral substrate of prediction and reward. Science, 275(5306):1593–1599,
1997. ISSN 0036-8075. doi: 10.1126/science.275.5306.1593. URL
http://science.sciencemag.org/content/275/5306/1593.

S. Singh, T. Jaakkola, and M. I. Jordan. Reinforcement learning with soft

state aggregation. Advances in neural information processing systems, pages
361–368, 1995.

M. T. J. Spaan and N. Vlassis. Perseus: Randomized point-based value iteration

for POMDPs. Journal of Artificial Intelligence Research, 24:195–220, 2005.

A. L. Strehl and M. L. Littman. An analysis of model-based interval estimation

for Markov decision processes. Journal of Computer and System Sciences, 74
(8):1309–1331, 2008. ISSN 0022-0000.
272 BIBLIOGRAPHY

Alexander L. Strehl and Michael L. Littman. A theoretical analysis of model-

based interval estimation. In Machine Learning, Proceedings of the Twenty-
Second International Conference (ICML 2005), pages 857–864. ACM, 2005.
Alexander L Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L
Littman. Pac model-free reinforcement learning. In Proceedings of the 23rd
international conference on Machine learning, pages 881–888. ACM, 2006.
Malcolm Strens. A Bayesian framework for reinforcement learning. In ICML
2000, pages 943–950, 2000.
Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Intro-
duction. MIT Press, 1998.
Richard S Sutton, David A McAllester, Satinder P Singh, Yishay Mansour,
et al. Policy gradient methods for reinforcement learning with function ap-
proximation. In NIPS 99, 1999.
Ole Tange. Gnu parallel-the command-line power tool. The USENIX Magazine,
36(1):42–47, 2011.
Ambuj Tewari and Peter Bartlett. Optimistic linear programming gives log-
arithmic regret for irreducible MDPs. In Advances in Neural Information
Processing Systems 20 (NIPS 2007), pages 1505–1512. MIT Press, 2008.
W. R. Thompson. On the Likelihood that One Unknown Probability Exceeds
Another in View of the Evidence of two Samples. Biometrika, 25(3-4):285–
294, 1933.
T. Toni, D. Welch, N. Strelkowa, A. Ipsen, and M. P. H. Stumpf. Approximate
Bayesian computation scheme for parameter inference and model selection in
dynamical systems. Journal of the Royal Society Interface, 6(31):187–202,
2009.
Marc Toussaint, Stefan Harmelign, and Amos Storkey. Probabilistic inference
for solving (PO)MDPs, 2006.
Paul Tseng. Solving h-horizon, stationary markov decision problems in time
proportional to log (h). Operations Research Letters, 9(5):287–297, 1990.
John N Tsitsiklis. Asynchronous stochastic approximation and q-learning. Ma-
chine learning, 16(3):185–202, 1994.
Nikolaos Tziortziotis and Christos Dimitrakakis. Bayesian inference for least
squares temporal difference regularization. In ECML, 2017.
Nikolaos Tziortziotis, Christos Dimitrakakis, and Konstantinos Blekas. Cover
tree Bayesian reinforcement learning. Journal of Machine Learning Research
(JMLR), 2014.
J. Veness, K. S. Ng, M. Hutter, and D. Silver. A Monte Carlo AIXI approxi-
mation. Arxiv preprint arXiv:0909.0801, 2009.
Nikos Vlassis, Michael L. Littman, and David Barber. On the computational
complexity of stochastic controller optimization in POMDPs. TOCT, 4(4):
12, 2012.
BIBLIOGRAPHY 273

Tao Wang, Daniel Lizotte, Michael Bowling, and Dale Schuurmans. Bayesian
sparse sampling for on-line reward optimization. In ICML ’05, pages 956–
963, New York, NY, USA, 2005. ACM. ISBN 1-59593-180-5. doi: http:
//doi.acm.org/10.1145/1102351.1102472.

Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8

(3-4):279–292, 1992.

T. Weissman, E. Ordentlich, G. Seroussi, S. Verdu, and M. J. Weinberger. In-

equalities for the L1 deviation of the empirical distribution. Hewlett-Packard
Labs, Tech. Rep, 2003.

Ronald J Williams. Simple statistical gradient-following algorithms for connec-

tionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
Yinyu Ye. The simplex and policy-iteration methods are strongly polynomial
for the markov decision problem with a fixed discount rate. Mathematics of
Operations Research, 36(4):593–603, 2011.

Henry H Yin and Barbara J Knowlton. The role of the basal ganglia in habit
formation. Nature Reviews Neuroscience, 7(6):464, 2006.

Martin Zinkevich, Amy Greenwald, and Michael Littman. Cyclic equilibria in

markov games. In Advances in Neural Information Processing Systems, 2006.

Notes Summary
No ratings yet
Notes Summary
65 pages
Reinforcement Learning and Optimal Control - Draft Version by Dmitri Bertsekas
No ratings yet
Reinforcement Learning and Optimal Control - Draft Version by Dmitri Bertsekas
268 pages
Reinforcement Learning: Foundations
No ratings yet
Reinforcement Learning: Foundations
276 pages
Book All-In-One 2
No ratings yet
Book All-In-One 2
281 pages
Decision Making Under Uncertainty and Reinforcement Learning
100% (1)
Decision Making Under Uncertainty and Reinforcement Learning
251 pages
AR23
No ratings yet
AR23
159 pages
Machine Learning
No ratings yet
Machine Learning
662 pages
MTH210
No ratings yet
MTH210
126 pages
Application of Reinforcement Learning - Finance
No ratings yet
Application of Reinforcement Learning - Finance
540 pages
Roger D. Peng - Advanced Statistical Computing (2022 Update) (2023) - Libgen - Li
No ratings yet
Roger D. Peng - Advanced Statistical Computing (2022 Update) (2023) - Libgen - Li
107 pages
ProbNum GilP PF14
No ratings yet
ProbNum GilP PF14
342 pages
Reinforcement Learning - A Comprehensive Overview
No ratings yet
Reinforcement Learning - A Comprehensive Overview
177 pages
RL Test Leif
No ratings yet
RL Test Leif
163 pages
DP Book
No ratings yet
DP Book
428 pages
Controle Stochastique M2 S10
No ratings yet
Controle Stochastique M2 S10
203 pages
Lecture Notes On Maximum Likelihood Estimation: Michael Peress December 30, 2024
No ratings yet
Lecture Notes On Maximum Likelihood Estimation: Michael Peress December 30, 2024
113 pages
RL-Notes Book
No ratings yet
RL-Notes Book
119 pages
Full Notes
No ratings yet
Full Notes
197 pages
SGOS Book
No ratings yet
SGOS Book
238 pages
Audio To Text Embedding
No ratings yet
Audio To Text Embedding
144 pages
Dynamic Programming: Thomas J. Sargent and John Stachurski January 16, 2024
No ratings yet
Dynamic Programming: Thomas J. Sargent and John Stachurski January 16, 2024
446 pages
Machine Learning - The Science of Selection Under Uncertainty
No ratings yet
Machine Learning - The Science of Selection Under Uncertainty
85 pages
Book All in One
No ratings yet
Book All in One
288 pages
Decision Uncertainty
No ratings yet
Decision Uncertainty
269 pages
RL Class Notes
No ratings yet
RL Class Notes
68 pages
Stochastic Programming
100% (1)
Stochastic Programming
326 pages
ML Lecture Notes 2022 v0.0
No ratings yet
ML Lecture Notes 2022 v0.0
176 pages
Probabilites Numeriques
No ratings yet
Probabilites Numeriques
354 pages
Book
No ratings yet
Book
534 pages
A Gentle Introduction To Gradient-Based Optimization
No ratings yet
A Gentle Introduction To Gradient-Based Optimization
36 pages
Solution Manual For Fundamentals of Geotechnical Engineering 4th Edition - Braja M. Das
No ratings yet
Solution Manual For Fundamentals of Geotechnical Engineering 4th Edition - Braja M. Das
7 pages
CSC 446 Lecture Notes
No ratings yet
CSC 446 Lecture Notes
61 pages
Supp 2
No ratings yet
Supp 2
214 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
93 pages
BGN182 Factory of Glass
No ratings yet
BGN182 Factory of Glass
5 pages
Advstatcomp PDF
No ratings yet
Advstatcomp PDF
109 pages
Markov Decision Processes: Lecture Notes For STP 425: Jay Taylor
100% (1)
Markov Decision Processes: Lecture Notes For STP 425: Jay Taylor
86 pages
Manujw
100% (1)
Manujw
326 pages
MasterThesis EdouardBerthe
No ratings yet
MasterThesis EdouardBerthe
58 pages
RL Notes
No ratings yet
RL Notes
69 pages
1-Introduction Sep 2022 (Pchem)
No ratings yet
1-Introduction Sep 2022 (Pchem)
46 pages
Oml351 NDT 2
No ratings yet
Oml351 NDT 2
2 pages
Firewalls CCI Industries
No ratings yet
Firewalls CCI Industries
62 pages
Adhaar Magnetism L4 Current Loop As Magnetic Dipole, Bar Magnet
No ratings yet
Adhaar Magnetism L4 Current Loop As Magnetic Dipole, Bar Magnet
84 pages
SSRN 4963741
No ratings yet
SSRN 4963741
26 pages
Statics Textbook Ch04
No ratings yet
Statics Textbook Ch04
72 pages
Machine Learning Lecture Notes
No ratings yet
Machine Learning Lecture Notes
119 pages
Train1201 PDF
No ratings yet
Train1201 PDF
388 pages
Lecture Notes On Decision Theory: Brian Weatherson
No ratings yet
Lecture Notes On Decision Theory: Brian Weatherson
149 pages
Relativistic Quantum Fields 1
No ratings yet
Relativistic Quantum Fields 1
60 pages
Stochastic Programming
100% (2)
Stochastic Programming
315 pages
FS1 Practice Paper 2 - For Teachers
No ratings yet
FS1 Practice Paper 2 - For Teachers
14 pages
Online Learning Lecture Notes 2011 Oct 20
No ratings yet
Online Learning Lecture Notes 2011 Oct 20
125 pages
Differential Equations A First Course On Ode and A Brief Introduction To Pde de Gruyter Textbook 1st Edition Shair Ahmad
No ratings yet
Differential Equations A First Course On Ode and A Brief Introduction To Pde de Gruyter Textbook 1st Edition Shair Ahmad
56 pages
Cheat Sheet
No ratings yet
Cheat Sheet
163 pages
Year 2 Revision Maths
67% (3)
Year 2 Revision Maths
13 pages
Powell UnifiedFrameworkStochasticOptimization Jan292018
No ratings yet
Powell UnifiedFrameworkStochasticOptimization Jan292018
69 pages
FUCHS - Ultrasonic Cleaning Fundamental Theory and Application
No ratings yet
FUCHS - Ultrasonic Cleaning Fundamental Theory and Application
10 pages
Decision 1
No ratings yet
Decision 1
15 pages
S1. Scheme T2 MTC 2023
No ratings yet
S1. Scheme T2 MTC 2023
6 pages
Moritz Lars
No ratings yet
Moritz Lars
97 pages
1 - Table of Contents
No ratings yet
1 - Table of Contents
6 pages
Icnirp Guidelines: FOR Limiting Exposure TO Electromagnetic Fields (100 H 300 GH)
No ratings yet
Icnirp Guidelines: FOR Limiting Exposure TO Electromagnetic Fields (100 H 300 GH)
43 pages
Mathshistory - St-Andrews - Ac.uk-Augustin Louis Cauchy
No ratings yet
Mathshistory - St-Andrews - Ac.uk-Augustin Louis Cauchy
7 pages
Decision Making and Bayes Linear Methods: Module Leader
No ratings yet
Decision Making and Bayes Linear Methods: Module Leader
70 pages
Shear Test Experiment Write Up For Mechanical Engineering Students
No ratings yet
Shear Test Experiment Write Up For Mechanical Engineering Students
4 pages
Stochastic Programming
No ratings yet
Stochastic Programming
326 pages
10 1 1 672 7118 PDF
No ratings yet
10 1 1 672 7118 PDF
35 pages
Elctrostatic & Magnetism 2nd Sem
No ratings yet
Elctrostatic & Magnetism 2nd Sem
5 pages
P6Smbj Series: Surface Mount Transient Voltage Suppressor Power 600 Watt
No ratings yet
P6Smbj Series: Surface Mount Transient Voltage Suppressor Power 600 Watt
7 pages
Prepared by Dr. A. Al-Juhani 1
No ratings yet
Prepared by Dr. A. Al-Juhani 1
16 pages
Activity Class X - Pair of Linear Equations
No ratings yet
Activity Class X - Pair of Linear Equations
3 pages
Dynamic Programming and Optimal Control 3rd Edition, Volume II
No ratings yet
Dynamic Programming and Optimal Control 3rd Edition, Volume II
233 pages
Richloyd R. Magayano Math 1 8 1
No ratings yet
Richloyd R. Magayano Math 1 8 1
7 pages
21CV643
No ratings yet
21CV643
2 pages
Detection and Estimation Theory Lecture Notes For Ecen 672
No ratings yet
Detection and Estimation Theory Lecture Notes For Ecen 672
216 pages
Distinguishing Between Dirac and Majorana Neutrinos PDF
No ratings yet
Distinguishing Between Dirac and Majorana Neutrinos PDF
6 pages
Most Important Theory Questions & Derivations For Class 11 - 2024 Exam
No ratings yet
Most Important Theory Questions & Derivations For Class 11 - 2024 Exam
10 pages
NCERT Book Class 10 Science Chapter 13 Our Environment
No ratings yet
NCERT Book Class 10 Science Chapter 13 Our Environment
1 page
The Genius of Slater's Rules
No ratings yet
The Genius of Slater's Rules
3 pages
Verification of Gap Element in Midas Gen
No ratings yet
Verification of Gap Element in Midas Gen
3 pages
FE Fluids Review - Notes and Problems113pdf
No ratings yet
FE Fluids Review - Notes and Problems113pdf
25 pages
RL Frontmatter
No ratings yet
RL Frontmatter
11 pages
PHY 1200 Worksheet 4
No ratings yet
PHY 1200 Worksheet 4
1 page
Piping Manual English
No ratings yet
Piping Manual English
75 pages
ADVANCED COLLEGE ALGEBRA STUDY GUIDE
From Everand
ADVANCED COLLEGE ALGEBRA STUDY GUIDE
Harrison K Cook
No ratings yet
Advanced college algebra study guide
From Everand
Advanced college algebra study guide
Harrison Cook
No ratings yet
Unlocking Statistics for the Social Sciences
From Everand
Unlocking Statistics for the Social Sciences
Norma Sinclair
No ratings yet
Intrusion Detection Honeypots
From Everand
Intrusion Detection Honeypots
Chris Sanders
3/5 (2)
Gray Hat Hacking the Ethical Hacker's
From Everand
Gray Hat Hacking the Ethical Hacker's
Çağatay Şanlı
5/5 (1)

Book-Decision Making Under Uncertainty and Reinforcement Learning

Uploaded by

Book-Decision Making Under Uncertainty and Reinforcement Learning

Uploaded by

1

Decision Making Under Uncertainty and

Christos Dimitrakakis Ronald Ortner

2 Subjective probability and utility 15

6 Experiment design and Markov decision processes 109

6.5.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

7 Simulation-based algorithms 147

8 Approximate representations 167

9 Bayesian reinforcement learning 195

10 Distribution-free reinforcement learning 217

B Probability concepts 237

B.4.5 Marginal distributions and independence . . . . . . . . . 251

C Useful results 257

1.1 Uncertainty and probability

In the 2-slit experiment, the probabilities of either event can be actually

will go through is fundamentally unpredictable. Such quantum experiments

In such a setting there is an actual, true world ω ∗ ∈ Ω, which could have

1.2 The exploration-exploitation trade-off

1.3 Decision theory and reinforcement learning

the environment. Artificial intelligence research is concerned with modelling the

1. Subjective probability and utility: The notion of subjective probability;

2. Decision problems: maximising expected utility; maximin utility; regret.

3. Estimation: Estimation as conditioning; families of distributions that are

4. Sequential sampling and optimal stopping: Sequential sampling problems;

5. Reinforcement learning I - Markov decision processes Belief and infor-

6. Reinforcement learning II – Stochastic and approximation algorithms:

7. Reinforcement learning III – Function approximation features and the

8. Reinforcement learning IV – Bayesian reinforcement learning: bounds on

9. Reinforcement learning V – Distribution-free reinforcement learning: stochas-

B Probability refresher: measure theory; axiomatic definition of probability;

C Miscellaneous mathematical results.

Subjective probability and

2.1 Subjective probability

2.1.1 Relative likelihood

The relative likelihood of two events A and B

 If A is more likely than B, then we write A ≻ B, or equivalently

As we would like to use the language of probability to talk about likelihoods,

2.1.2 Subjective probability assumptions

Another important assumption is a principle of consistency: Informally, if

Assumption 2.1.2 (SP2). Let A = A1 ∪ A2 , B = B1 ∪ B2 with A1 ∩ A2 =

We also require the simple technical assumption that any event A ∈ F is at

Assumption 2.1.3 (SP3). For all A it holds that ∅ - A. Further, ∅ ≺ Ω.

Theorem 2.1.2 (Complement). For any A, B: A - B iff A∁ ≻ B ∁ .

Finally, note that if A ⊂ B, then it must be the case that whenever A

Theorem 2.1.3 (Fundamental property of relative likelihoods). If A ⊂ B then

Since we are dealing with σ-fields, we need to introduce additional assump-

2.1.3 Assigning unique probabilities*

In some cases we would like to assign unique probabilities to events in order

(x ∈ A) - (x ∈ B) iff λ(A) ≤ λ(B),

where (x ∈ A) denotes the event that x(ω) ∈ A. Then (x ∈ A) - (x ∈ B)

Constructing the probability distribution

2.1.4 Conditional likelihoods

2.1.5 Probability elicitation

Example 3 (Temperature prediction). Let τ be the temperature tomorrow at noon

Eliciting the prior / forming the subjective probability measure P

 Select temperature x0 s.t. (τ ≤ x0 ) h (τ > x0 ).

Note that, necessarily, P (τ ≤ x0 ) = P (τ > x0 ) =: p0 . Since P (τ ≤ x0 ) +

Exercise 1. Propose another way to arrive at a prior probability distribution. For

2.2 Updating beliefs: Bayes’ theorem

Theorem 2.2.1 (Bayes’ theorem).S Let A1 , A2 , . . . be a (possibly infinite) se-

Proof. By definition, P (Ai | B) = P (Ai ∩ B)/P (B) and P (Ai ∩ B) = P (B |

which plugged into (2.2.1) completes the proof.

A simple exercise in updating beliefs

no rain is P (B | A1 ) = 12 . On the other hand, P (B | A2 ) = 0.9. Combining

2.3 Utility theory

2.3.1 Rewards and preferences

Preferences among rewards

as b, we write a h∗ b. We also use a %∗ b and a -∗ b when we like a at least as

Assumption 2.3.1. (i) For any a, b ∈ R, one of the following holds: a ≻∗ b,

(ii) For any a, b, c ∈ R, if a -∗ b and b -∗ c, then a -∗ c.

Example 7 (Counter-example for transitive preferences). Consider vector rewards in

2.3.2 Preferences among distributions

Preferences among probability distributions

Definition 2.3.1 (Utility). A utility function U : R → R is said to agree with

a %∗ b iff U (a) ≥ U (b).

The above definition is very similar to how we defined relative likelihood in

If A is more likely than B, then we write A ≻ B, or equivalently

Select temperature x0 s.t. (τ ≤ x0 ) h (τ > x0 ).