Book-Decision Making Under Uncertainty and Reinforcement Learning
Book-Decision Making Under Uncertainty and Reinforcement Learning
April 8, 2021
2
Contents
1 Introduction 9
1.1 Uncertainty and probability . . . . . . . . . . . . . . . . . . . . . 10
1.2 The exploration-exploitation trade-off . . . . . . . . . . . . . . . 11
1.3 Decision theory and reinforcement learning . . . . . . . . . . . . 12
1.4 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 Decision problems 33
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Rewards that depend on the outcome of an experiment . . . . . . 34
3.2.1 Formalisation of the problem setting . . . . . . . . . . . . 35
3.2.2 Decision diagrams . . . . . . . . . . . . . . . . . . . . . . 37
3.2.3 Statistical estimation* . . . . . . . . . . . . . . . . . . . . 38
3.3 Bayes decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3.1 Convexity of the Bayes-optimal utility* . . . . . . . . . . 40
3.4 Statistical and strategic decision making . . . . . . . . . . . . . . 43
3.4.1 Alternative notions of optimality . . . . . . . . . . . . . . 44
3.4.2 Solving minimax problems* . . . . . . . . . . . . . . . . . 45
3.4.3 Two-player games . . . . . . . . . . . . . . . . . . . . . . 47
3.5 Decision problems with observations . . . . . . . . . . . . . . . . 49
3.5.1 Decision problems in classification . . . . . . . . . . . . . 53
3.5.2 Calculating posteriors . . . . . . . . . . . . . . . . . . . . 56
3
4 CONTENTS
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.7.1 Problems with no observations . . . . . . . . . . . . . . . 58
3.7.2 Problems with observations . . . . . . . . . . . . . . . . . 58
3.7.3 An insurance problem . . . . . . . . . . . . . . . . . . . . 59
3.7.4 Medical diagnosis . . . . . . . . . . . . . . . . . . . . . . . 61
4 Estimation 65
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2 Sufficient statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2.1 Sufficient statistics . . . . . . . . . . . . . . . . . . . . . . 67
4.2.2 Exponential families . . . . . . . . . . . . . . . . . . . . . 68
4.3 Conjugate priors . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3.1 Bernoulli-Beta conjugate pair . . . . . . . . . . . . . . . . 69
4.3.2 Conjugates for the normal distribution . . . . . . . . . . . 73
4.3.3 Conjugates for multivariate distributions . . . . . . . . . . 78
4.4 Credible intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.5 Concentration inequalities . . . . . . . . . . . . . . . . . . . . . . 84
4.5.1 Chernoff-Hoeffding bounds . . . . . . . . . . . . . . . . . 86
4.6 Approximate Bayesian approaches . . . . . . . . . . . . . . . . . 87
4.6.1 Monte-Carlo inference . . . . . . . . . . . . . . . . . . . . 87
4.6.2 Approximate Bayesian Computation . . . . . . . . . . . . 88
4.6.3 Analytic approximations of the posterior . . . . . . . . . . 89
4.6.4 Maximum Likelihood and Empirical Bayes methods . . . 90
5 Sequential sampling 91
5.1 Gains from sequential sampling . . . . . . . . . . . . . . . . . . . 92
5.1.1 An example: sampling with costs . . . . . . . . . . . . . . 93
5.2 Optimal sequential sampling procedures . . . . . . . . . . . . . . 96
5.2.1 Multi-stage problems . . . . . . . . . . . . . . . . . . . . . 99
5.2.2 Backwards induction for bounded procedures . . . . . . . 99
5.2.3 Unbounded sequential decision procedures . . . . . . . . . 100
5.2.4 The sequential probability ratio test . . . . . . . . . . . . 101
5.2.5 Wald’s theorem . . . . . . . . . . . . . . . . . . . . . . . . 104
5.3 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.4 Markov processes . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.5 Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
11 Conclusion 231
A Symbols 233
D Index 261
8 CONTENTS
Chapter 1
Introduction
9
10 CHAPTER 1. INTRODUCTION
The purpose of this book is to collect the fundamental results for decision
making under uncertainty in one place. In particular, the aim is to give a uni-
fied account of algorithms and theory for sequential decision making problems,
including reinforcement learning. Starting from elementary statistical decision
theory, we progress to the reinforcement learning problem and various solution
methods. The end of the book focuses on the current state-of-the-art in models
and approximation algorithms.
The problem of decision making under uncertainty can be broken down into
two parts. First, how do we learn about the world? This involves both the
problem of modeling our initial uncertainty about the world, and that of draw-
ing conclusions from evidence and our initial belief. Secondly, given what we
currently know about the world, how should we decide what to do, taking into
account future events and observations that may change our conclusions?
Typically, this will involve creating long-term plans covering possible future
eventualities. That is, when planning under uncertainty, we also need to take
into account what possible future knowledge could be generated when imple-
menting our plans. Intuitively, executing plans which involve trying out new
things should give more information, but it is hard to tell whether this infor-
mation will be beneficial. The choice between doing something which is already
known to produce good results and experiment with something new is known
as the exploration-exploitation dilemma, and it is at the root of the interaction
between learning and planning.
Classical Probability
A random experiment is performed, with a given set Ω of possible outcomes.
An example is the 2-slit experiment in physics, where a particle is generated
which can go through either one of two slits. According to our current
understanding of quantum mechanics, it is impossible to predict which slit
the particle will go through. Herein, the set Ω consists of two possible
events corresponding to the particle passing through one or the other slit.
Subjective Probability
Here Ω can conceptually not only describe the outcomes of some experi-
ment, but also a set of possible worlds or realities. This set can be quite
large and include anything imaginable. For example, it may include worlds
where dragons are real. However, in practice one only cares about certain
aspects of the world, such as whether in this world, you will win the lottery
if you buy a ticket. We can interpret the probability of a world in Ω as our
degree of belief that it corresponds to reality.
Thus, one must decide whether to exploit knowledge about the world to
gain a known reward, or to explore the world to learn something new. This
will potentially give you less reward immediately, but the knowledge itself can
usually be put to use in the future.
This exploration-exploitation trade-off only arises when data collection is
interactive. If we are simply given a set of data which is to be used to decide
upon a course or action, but our decision does not affect the data we shall collect
in the future, then things are much simpler. However, a lot of real-world human
decision making as well as modern applications in data science involve such
trade-offs. Decision theory offers precise mathematical models and algorithms
for such problems.
Outline
1.4 Acknowledgements
Many thanks go to all the students of the Decision making under uncertainty
and Advanced topics in reinforcement learning and decision making classes over
the years for bearing with early drafts of this book. A big “thank you” goes
to Nikolaos Tziortziotis, whose code is used in some of the examples in the
book. Finally, thanks to Aristide Tossou and Hannes Eriksson for proof-reading
various chapters. Finally, a lot of the coded examples in the book were run
using the parallel package by Tange [2011].
Chapter 2
15
16 CHAPTER 2. SUBJECTIVE PROBABILITY AND UTILITY
Let us now speak more generally about the case where we have defined an
appropriate σ-field F on Ω. Then each element Ai ∈ F will be a subset of Ω.
We now wish to define relative likelihood relations for the elements Ai ∈ F.1
1 More formally, we can define three classes: C , C , C ⊂ F 2 such that a pair (A , A ) ∈
≻ ≺ h i j
CR if an only if it satisfies the relation Ai RAj , where R ∈ {≻, ≺, h}. These three classes
form a partition of F 2 under the subjective probability assumptions we will introduce in the
2.1. SUBJECTIVE PROBABILITY 17
Assumption 2.1.1 (SP1). For any pair of events A, B ∈ F, one has either
A ≻ B, A ≺ B, or A h B.
As it turns out, these assumptions are sufficient for proving the following
theorems [DeGroot, 1970]. The first theorem tells us that our belief must be
consistent with respect to transitivity.
Theorem 2.1.1 (Transitivity). Under Assumptions 2.1.1, 2.1.2, and 2.1.3, for
all events A, B, C: If A - B and B - C, then A - C.
The second theorem says that if two events have a certain relation, then
their negations have the converse relation.
Definition 2.1.1 (Uniform distribution). Let λ(A) denote the length of any
interval A ⊆ [0, 1]. Then x : Ω → [0, 1] has a uniform distribution on [0, 1] if,
for any subintervals A, B of [0, 1],
Example 2. Say that A is the event that it rains in Gothenburg, Sweden tomorrow.
We know that Gothenburg is quite rainy due to its oceanic climate, so we set A % A∁ .
Now, let us try and incorporate some additional information. Let D denote the fact
that good weather is forecast, so that given D good weather will appear more probable
than rain, formally (A | D) - (A∁ | D).
20 CHAPTER 2. SUBJECTIVE PROBABILITY AND UTILITY
Conditional likelihoods
Define (A | D) - (B | D) to mean that B is at least as likely as A when it
is known that D holds.
Assumption 2.1.6 (CP). For any events A, B, D,
(A | D) - (B | D) iff A ∩ D - B ∩ D.
Theorem 2.1.9. If a likelihood relation - satisfies assumptions SP1–SP5 as
well as CP, then there exists a probability measure P such that: For any A, B, D
such that P (D) > 0,
(A | D) - (B | D) iff P (A | D) ≤ P (B | D).
It turns out that there are very few ways that a conditional probability def-
inition can satisfy all of our assumptions. The usual definition is the following.
Definition 2.1.3 (Conditional probability).
P (A ∩ D)
P (A | D) , .
P (D)
This definition effectively answers the question of how much evidence for A
we have, now that we have observed D. This is expressed as the ratio between
the combined event A ∩ D, also known as the joint probability of A and D, and
the marginal probability of D itself. The intuition behind the definition becomes
clearer once we rewrite it as P (A ∩ D) = P (A | D) P (D). Then conditional
probability is effectively used as a way to factorise joint probabilities.
P (B | Ai ) P (Ai )
P (Ai | B) = Pn .
j=1 P (B | Aj ) P (Aj )
Example 4 (The weather forecast). Form a subjective probability for the probability
that it rains.
A1 : Rain.
A2 : No rain.
First, choose P (A1 ) and set P (A2 ) = 1 − P (A1 ). Now assume that there is a weather
forecasting station that predicts no rain for tomorrow. However, you know the fol-
lowing facts about the station: On the days when it rains, half of the time the station
had predicted it was not going to rain. On days when it doesn’t rain, the station had
said no rain 9 times out of 10.
Solution. Let B denote the event that the station predicts no rain. According
to our information, the probability that there is rain when the prediction said
22 CHAPTER 2. SUBJECTIVE PROBABILITY AND UTILITY
P (B | A1 ) P (A1 )
P (A1 | B) =
P (B | A1 ) P (A1 ) + P (B | A2 ) [1 − P (A1 )]
1
2 P (A1 )
= .
0.9 − 0.4 P (A1 )
Preferences
Example 5 (Musical event tickets). We have a set of tickets R, and we must choose
the ticket r ∈ R we prefer best. Here are two possible scenarios:
R is a set of tickets to different music events at the same time, at equally
good halls with equally good seats and the same price. Here preferences simply
coincide with the preferences for a certain type of music or an artist.
If R is a set of tickets to different events at different times, at different quality
halls with different quality seats and different prices, preferences may depend
on all the factors.
Example 6 (Route selection). We have a set of alternative routes and must pick one.
If R contains two routes of the same quality, one short and one long, we will probably
prefer the shorter one. If the longer route is more scenic our preferences may be
different.
The first assumption means that we must always be able to decide between
any two rewards. It may seem that it does not always hold in practice, since
humans are frequently indecisive. However, without the second assumption, it
is still possible to create preference relations that are cyclic.
Example 8 (Route selection). Assume that you have to pick between two routes
P1 , P2 . Your preferences are such that shorter time routes are preferred over longer
ones. For simplicity, let R = {10, 15, 30, 35} be the possible times it might take to
reach your destination. Route P1 takes 10 minutes when the road is clear, but 30
minutes when the traffic is heavy. The probability of heavy traffic on P1 is 0.5. On the
other hand, route P2 takes 15 minutes when the road is clear, but 35 minutes when
the traffic is heavy. The probability of heavy traffic on P2 is 0.2.
What would be a good principle for choosing between the two routes in
Example 8? Clearly route P1 gives both the lowest best-case time and the
lowest worst-case time. It thus appears as though both an extremely cautious
person (who assumes the worst-case) and an extreme optimist (who assumes the
best case) would say P1 ≻∗ P2 . However, the average time taken in P2 is only 19
minutes versus 20 minutes for P1 . Thus, somebody who only took the average
time into account would prefer P2 . In the following sections, we will develop one
of the most fundamental methodologies for choices under uncertainty, based on
the idea of utilities.
2.3.3 Utility
The concept of utility allows us to create a unifying framework, such that given
a particular set of rewards and probability distributions on them, we can define
preferences among distributions via their expected utility. The first step is to
define utility as a way to define a preference relation among rewards.
We make the assumption that the utility function is such that the expected
utility remains consistent with the preference relations between all probability
distributions we are choosing between.
Example 9. Consider the following decision problem. You have the option of entering
a lottery for 1 currency unit (CU). The prize is 10 CU and the probability of winning
is 0.01. This can be formalised by making it a choice between two probability dis-
tributions: P , where you do not enter the lottery, and Q, which represents entering
the lottery. PCalculating the expected utility obviously gives 0 for not entering and
E(U | Q) = r U (r) Q(r) = −0.9 utility of entering the lottery, cf. Table 2.1.
2.3. UTILITY THEORY 25
r U (r) P Q
did not enter 0 1 0
paid 1 CU and lost −1 0 0.99
paid 1 CU and won 10 9 0 0.01
Monetary rewards
Frequently, rewards come in the form of money. In general, it is assumed that
people prefer to have more money than less money. However, while the utility
of monetary rewards is generally assumed to be increasing, it is not necessarily
linear. For example, 1,000 Euros are probably worth more to somebody with
only 100 Euros in the bank than to somebody with 100,000 Euros. Hence, it
seems reasonable to assume that the utility of money is concave.
The following examples show the consequences of the expected utility hy-
pothesis.
Exercise 2. Show that under the expected utility hypothesis, if gamble 1 is preferred
in the Example 10, gamble 1 must also be preferred in the Example 11 for any utility
function.
In practice, you may find that your preferences are not aligned with what
this exercise suggests. This implies that either your decisions do not conform
to the expected utility hypothesis, or that you are not internalising the given
probabilities. We will explore this further in following example.
Example 12 (The St. Petersburg Paradox, Bernoulli 1713). A coin is tossed repeat-
edly until the coin comes up heads. The player obtained 2n currency units, where
n ∈ {1, 2, . . .} is the number of times the coin was thrown. The coin is assumed to be
fair, meaning that the probability of heads is always 1/2.
How many currency units k would you be willing to pay to play this game once?
Were your utility function linear, you would be willing to pay any finite amount k
to play, as the expected utility for playing the game for any finite k is
∞
X
U (2n − k) 2−n = ∞.
n=1
It would be safe to assume that very few readers would be prepared to pay an
arbitrarily high amount to play this game. One way to explain this is that the
utility function is not linear. An alternative would be to assume a logarithmic
utility function. For example, if we also assume that the player has an initial
capital C from which k has to be paid, the utility function would satisfy
∞
X
EU = ln(C + 2n − k) 2−n .
n=1
Then for C = 10, 000 the maximum bet would be 14. For C = 100 it would be
6, while for C = 10 it is just 4.
There is another reason why one may not pay an abritrary amount to play
this game. The player may not fully internalise the fact (or rather, the promise)
that the coin is unbiased. Another explanation would be that it is not really
believed that the bank can pay an unbounded amount of money. Indeed, if the
bank can only pay amounts up to N , in the linear expected utility scenario, for
a coin with probability p of coming heads, we have
N
X 1 − (2p)N
2n pn−1 (1 − p) = 2(1 − p) .
n=1
1 − 2p
For large N and p = 0.45, it turns out that you should only expect a payoff of
about 10 currency units. Similarly, for a fair coin and N = 1024 you should
only pay around 10 units as well. These are possible subjective beliefs that an
individual might have that would influence its behaviour when dealing with a
formally specified decision problem.
Example 13. Let ha, bi denote a lottery ticket that yields a or b CU with equal
probability. Consider the following sequence:
1. Find x1 such that receiving x1 CU with certainty is equivalent to receiving ha, bi.
2. Find x2 such that receiving x2 CU with certainty is equivalent to receiving
ha, x1 i.
2.3. UTILITY THEORY 27
Example 14. If the utility function is convex, then we would prefer obtaining a ran-
dom reward x rather than a fixed reward y = E(x). Thus, a convex utility function
implies risk-taking. This is illustrated by Figure 2.1, which shows a linear function, x,
a convex function, ex − 1, and a concave function, ln(x + 1).
For concave functions, the inverse of Jensen’s inequality holds (i.e., with ≥
replaced with ≤). If the utility function is concave, then we choose a gamble
giving a fixed reward E[x] rather than one giving a random reward x. Con-
sequently, a concave utility function implies risk aversion. The act of buying
insurance can be related to the concavity of our utility function. Consider the
following example, where we assume individuals are risk-averse, but insurance
companies are risk-neutral.
28 CHAPTER 2. SUBJECTIVE PROBABILITY AND UTILITY
U (x) x
1.6
ex − 1
1.4 ln(x + 1)
1.2
0.8
0.6
0.4
0.2
x
0.2 0.4 0.6 0.8 1
a r U a P r U
Figure 2.2: Here are two very simple decision diagrams. In the first, the DM
selects the decision a, which determines the reward r, which determines the
utility U . In the second, the DM only chooses a distribution P , which determines
the rewards.
Example 15 (Insurance). Let x be the insurance cost, h our insurance cover, ǫ the
probability of needing the cover, and U an increasing utility function (for monetary
values). Then we are going to buy insurance if the utility of losing x with certainty is
greater than the utility of losing −h with probability ǫ.
The company has a linear utility, and fixes the premium x high enough so that
Choice nodes, denoted by squares. These are nodes whose values the
decision maker can directly choose. Sometimes there is more than one
decision maker involved.
Value nodes, denoted by diamonds. These are the nodes that the decision
maker is interested in influencing. The utility of the decision maker is
always a function of the value nodes.
Circle nodes are used to denote all other types of variables. These include
deterministic, stochastic, known or unknown variables.
While a full theory of graphical models and decision diagrams is beyond the
scope of this book, we will make use of them frequently to visualise dependencies
between variables in simple problems.
30 CHAPTER 2. SUBJECTIVE PROBABILITY AND UTILITY
2.4 Exercises
Exercise 3. Preferences are transitive if they are induced by a utility function U :
R → R such that a ≻∗ b iff U (a) > U (b). Give an example of a utility function, not
necessarily mapping the rewards R to R, and a binary relation > such that transitivity
can be violated. Back your example with a thought experiment.
Exercise 6. Consider two urns, each containing red and blue balls. The first urn
contains an equal number of red and blue balls. The second urn contains a randomly
chosen proportion X of red balls, i.e., the probability of drawing a red ball from that
urn is X.
1. Suppose that you were to select an urn, and then choose a random ball from
that urn. If the ball is red, you win 1 CU, otherwise nothing. Show that if your
utility function is increasing with monetary gain, you should prefer the first urn
iff E(X) < 12 .
2. Suppose that you were to select an urn, and then choose n random balls from
that urn and that urn only. Each time you draw a red ball, you gain 1 CU.
After you draw a ball, you put it back in the urn. Assume that the utility U
is strictly concave and suppose that E(X) = 21 . Show that you should always
select balls from the first urn.
Hint: Show that for the second urn, E(U | x) is concave for 0 ≤ x ≤ 1 (this can be
d2
done by showing dx 2 E(U | x) < 0). In fact,
n−2
!
d2 X n−2 k
E(U | x) = n(n − 1) [U (k) − 2U (k + 1) + U (k + 2)] x (1 − x)n−2−k .
dx2 k
k=0
measure that shall satisfy the following basic properties of a probability measure:
(a) null probability: P (∅ | B) = 0, (b) total probability: P (Ω | B) = 1, (c) union of
disjoint subsets: P (A1 ∪A2 | B) = P (A1 | B)+P (A2 | B), (d) conditional probability:
P (A | D) ≤ P (B | D) if and only if P (A ∩ D) ≤ P (B ∩ D).
Exercise 10 (Rational Arthur-Merlin games). You are Arthur, and you wish to pay
Merlin to do a very difficult computation for you. More specifically, you perform a
query q ∈ Q and obtain an answer r ∈ R from Merlin. After he gives you the answer,
you give Merlin a random amount of money m, depending on r, q. In particular, it
is assumed that there exists a unique correct answer r∗ = f (q) and E(m | r, q) =
P ∗ ∗
m mP (m | r, q) is maximised by r , i.e., for any r 6= r
Assume that Merlin knows P and the function f . Is this sufficient to incentivise Merlin
to respond with the correct answer? If not, what other assumptions or knowledge are
required?
Exercise 11. Assume that you need to travel over the weekend. You wish to decide
whether to take the train or the car. Assume that the train and the car trip cost exactly
the same amount of money. The train trip takes 2 hours. If it does not rain, then
the car trip takes 1.5 hour. However, if it rains the road becomes both more slippery
and more crowded and so the average trip time is 2.5 hours. Assume that your utility
function is equal to the negative amount of time spent travelling: U (t) = −t.
1. Let it be Friday. What is the expected utility of taking the car on Sunday? What
is the expected utility of taking the train on Sunday? What is the Bayes-optimal
decision, assuming you will travel on Sunday?
2. Let it be a rainy Saturday, which we denote by event A. What is your posterior
probability over the two weather stations, given that it has rained, i.e., P (Hi |
A)? What is the new marginal probability of rain on Sunday, i.e., P (B | A)?
What is now the expected utility of taking the car versus taking the train on
Sunday? What is the Bayes-optimal decision?
Exercise 12. Consider the previous example with a nonlinear utility function.
1. One example is U (t) = 1/t, which is a convex utility function. How would you
interpret the utility in that case? Without performing the calculations, can you
tell in advance whether your optimal decision can change? Verify your answer
by calculating the expected utility of the two possible choices.
2. How would you model a problem where the objective involves arriving in time
for a particular appointment?
32 CHAPTER 2. SUBJECTIVE PROBABILITY AND UTILITY
Chapter 3
Decision problems
33
34 CHAPTER 3. DECISION PROBLEMS
3.1 Introduction
In this chapter we describe how to formalise statistical decision problems. These
involve making decisions whose utility depends on an unknown state of the
world. In this setting, it is common to assume that the state of the world is
a fundamental property that is not influenced by our decisions. However, we
can calculate a probability distribution for the state of the world, using a prior
belief and some data, where the data we obtain may depend on our decisions.
A classical application of this framework is parameter estimation. Therein,
we stipulate the existence of a parameterised law of nature, and we wish to
choose a best-guess set of parameters for the law through measurements and
some prior information. An example would be determining the gravitational
attraction constant from observations of planetary movements. These measure-
ments are always obtained through experiments, the automatic design of which
will be covered in later chapters.
The decisions we make will necessarily depend on both our prior belief and
the data we obtain. In the last section of this chapter will examine how sensitive
our decisions are to the prior, and how we can choose it so that our decisions
are robust.
more precise by choosing an appropriate utility function. For example, we might put
a value of −1 for carrying the umbrella and a value of −10 for getting wet. In this
example, the only events of interest are whether it rains or not.
P(ω ∈ A) = P (A).
Since the random outcome ω does not depend on our decision a, we must
find a way to connect the two. This can be formalised via a reward function, so
that the reward that we obtain (whether we get wet or not) depends on both
our decision (to take the umbrella) and the random outcome (whether it rains).
r = ρ(ω, a).
Example 17 (Rock paper scissors). Consider the simple game of rock paper scissors,
where your opponent plays a move at the same time as you, so that you cannot
influence his move. The opponent’s moves are Ω = {ωR , ωP , ωS } for rock, paper,
scissors, respectively, which also correpsonds to your decision set A = {aR , aP , aS }.
The reward set is R = {Win, Draw, Lose}.
You have studied your opponent for some time and you believe that he is most
likely to play rock P (ωR ) = 3/6, somewhat likely to play paper P (ωP ) = 2/6, and less
likely to play scissors: P (ωS ) = 1/6.
What is the probability of each reward, for each decision you make? Taking the
example of aR , we see that you win if the opponent plays scissors with probability
1/6, you lose if the opponent plays paper (2/6), and you draw if he plays rock (3/6).
ω a
a Pa r U U
Figure 3.1: Decision diagrams for the combined and separated formulation of the
decision problem. Squares denote decision variables, diamonds denote utilities.
All other variables are denoted by circles. Arrows denote the flow of dependency.
every decision:
Of course, what you play depends on our own utility function. If you prefer winning
over drawing or losing, you could for example have
Pthe utility function U (Win) = 1,
U (Draw) = 0, U (Lose) = −1. Then, since Ea U = ω∈Ω U (ω, a)Pa (ω), we have
EaR U = −1/6
EaP U = 2/6
EaS U = −1/6,
The above example illustrates that every decision that we make creates a
corresponding probability distribution on the rewards. While the outcome of
the experiment is independent of the decision, the distribution of rewards is
effectively chosen by our decision.
Expected utility
The expected utility of any decision a ∈ A under P is:
Z Z
EPa (U ) = U (r) dPa (r) = U [ρ(ω, a)] dP (ω)
R Ω
U (P, a) , EPa U
The above equation requires that the following technical assumption is sat-
isfied. As usual, we employ the expected utility hypothesis (Assumption 2.3.2).
Thus, we should choose the decision that results in the highest expected utility.
Assumption 3.2.3. The sets {ω | ρ(ω, a) ∈ B} belong to FΩ . That is, ρ is
FΩ -measurable for any a.
The dependency structure of this problem in either formulation can be vi-
sualised in the decision diagram shown in Figure 3.1.
Example 18 (Continuation of Example 16). You are going to work, and it might rain.
The forecast said that the probability of rain (ω1 ) was 20%. What do you do?
a1 : Take the umbrella.
a2 : Risk it!
The reward of a given outcome and decision combination, as well as the respective
utility is given in Table 3.1.
ρ(ω, a) a1 a2
ω1 dry, carrying umbrella wet
ω2 dry, carrying umbrella dry
U [ρ(ω, a)] a1 a2
ω1 -1 -10
ω2 -1 0
EP (U | a) -1 -2
Table 3.1: Rewards, utilities, expected utility for 20% probability of rain.
Value nodes, denoted by diamonds. These are the nodes that the decision
maker is interested in influencing. The utility of the decision maker is
always a function of the value nodes.
Circle nodes are used to denote all other types of variables. These include
deterministic, stochastic, known or unknown variables.
The nodes are connected via directed edges. These denote the dependencies
between nodes. For example, in Figure 3.1(b), the reward is a function of both
ω and a, i.e., r = ρ(ω, a), while ω depends only on the probability distribution
P . Typically, there must be a path from a choice node to a value node, otherwise
nothing the decision maker can do will influence its utility. Nodes belonging to
or observed by different players will usually be denoted by different lines or
colors. In Figure 3.1(b), ω, which is not observed, is shown in a lighter color.
Example 19 (Voting). Assume you wish to estimate the number of votes for different
candidates in an election. The unknown parameters of the problem mainly include:
the percentage of likely voters in the population, the probability that a likely voter is
going to vote for each candidate. One simple way to estimate this is by polling.
Consider a nation with k political parties. Let ω = (ω1 , . . . , ωk ) ∈ [0, 1]k be the
voting proportions for each party. We wish to make a guess a ∈ [0, 1]k . How should
we guess, given a distribution P (ω)? How should we select U and ρ? This depends on
what our goal is, when we make the guess.
If we wish to give a reasonable estimate about the votes of all the k parties, we can
use the squared error: First, set the error Pvector r = (ω1 − a1 , . . . , ωk − ak ) ∈ [0, 1]k .
Then we set U (r) , −krk2 , where krk2 = i |ωi − ai |2 .
If on the other hand, we just want to predict the winner of the election, then the
actual percentages of all individual parties are not important. In that case, we can set
r = 1 if arg maxi ωi = arg maxi ai and 0 otherwise, and U (r) = r.
Given the above, instead of the expected utility, we consider the ex-
pected loss, or risk.
The maximisation over decisions is usually not easy. However, there exist
a few cases where it is relatively simple. The first of those is when the utility
function is the negative squared error.
a = EP (ω)
maximises the expected utility U (P, a), under the technical assumption that
∂/∂a|ω − a|2 is measurable with respect to FR .
40 CHAPTER 3. DECISION PROBLEMS
Taking derivatives, due to the measurability assumption, we can swap the order
of differentiation and integration and obtain
Z Z
∂ 2 ∂
kω − ak dP (ω) = kω − ak2 dP (ω)
∂a Ω Ω ∂a
Z
=2 (a − ω) dP (ω)
Ω
Z Z
=2 a dP (ω) − 2 ω dP (ω)
Ω Ω
= 2a − 2 E(ω).
Setting the derivative equal to 0 and noting that the utility is concave, we see
that the expected utility is maximised for a = EP (ω).
Another simple example is the absolute error, where U (ω, a) = |ω − a|. The
solution in this case differs significantly from the squared error. As can be
seen from Figure 3.2(a), for absolute loss, the optimal decision is to choose the
a that is closest to the most likely ω. Figure 3.2(b) illustrates the finding of
Theorem 3.3.1.
to mean the probability measure such that Zα (A) = αP (A) + (1 − α)Q(A) for
any A ∈ FΩ . For any fixed choice a, the expected utility varies linearly with α:
Remark 3.3.1 (Linearity of the expected utility). If Zα is as defined in (3.3.2),
then, for any a ∈ A,
-0.2
-0.4
U
-0.6
0.1
-0.8 0.25
0.5
0.75
-1
0 0.2 0.4 0.6 0.8 1
a
(a) Absolute error
-0.2
-0.4
U
-0.6
0.1
-0.8 0.25
0.5
0.75
-1
0 0.2 0.4 0.6 0.8 1
a
(b) Quadratic error
Figure 3.2: Expected utility curves for different values of P (ω = 0), as the
decision a varies in [0, 1].
Proof.
Z
U (Zα , a) = U (ω, a) dZα (ω)
Ω
Z Z
=α U (ω, a) dP (ω) + (1 − α) U (ω, a) dQ(ω)
Ω Ω
= α U (P, a) + (1 − α) U (Q, a).
U ∗ [Zα ] ≤ α U ∗ (P ) + (1 − α) U ∗ (Q).
42 CHAPTER 3. DECISION PROBLEMS
-0.1
-0.2
util
-0.3
-0.4
U ∗ (P )
-0.5
0 0.2 0.4 0.6 0.8 1
P
Proof. From the definition of the expected utility (3.3.1), for any decision a ∈ A,
As we have proven, the expected utility is linear with respect to Zα . Thus, for
any fixed action a we obtain a line as those shown in Fig. 3.3. By Theorem 3.3.2,
the Bayes-optimal utility is convex. Furthermore, the minimising decision for
any Zα is tangent to the Bayes-optimal utility at the point (Zα , U ∗ (Zα )). If
we take a decision that is optimal with respect to some Z, but the distribution
is in fact Q 6= Z, then we are not far from the optimal, if Q, Z are close and
U ∗ is smooth. Consequently, we can trivially lower bound the Bayes utility by
examining any arbitrary finite set of decisions  ⊆ A. That is,
U ∗ (P ) ≥ max U (P, a)
a∈Â
upper bound
holds due to convexity. The two bounds suggest an algorithm for successive ap-
proximation of the Bayes-optimal utilty, by looking for the largest gap between
the lower and the upper bounds.
This theorem should be not be applied naively. It only states that if we know
P then the expected utility of the best fixed/deterministic decision a∗ ∈ A
cannot be increased by randomising between decisions. For example, it does
not make sense to apply this theorem to cases where P itself is completely
or partially unknown (e.g., when P is chosen by somebody else and its value
remains hidden to us).
44 CHAPTER 3. DECISION PROBLEMS
U (ω, a) a1 a2
ω1 -1 0
ω2 10 1
E(U | P, a) 4.5 -0.5
minω U (ω, a) -1 0
Table 3.2: Utility function, expected utility and maximin utility of Example 21.
can essentially be seen as how much utility we would be able to obtain, if we were
to make a decision a first, and nature were to select an adversarial decision ω
later.
minimax On the other hand, the minimax value is
where ω ∗ , arg minω maxa U (ω, a) is the worst-case choice nature could make,
if we were to select our own decision a after its own choice was revealed to us.
To illustrate this, let us consider the following example.
Example 21. You consider attending an open air concert. The weather forecast re-
ports 50% probability of rain. Going to the concert (a1 ) will give you a lot of pleasure
if it doesn’t rain (ω2 ), but in case of rain (ω1 ) you actually would have preferred to stay
at home (a2 ). Since in general you prefer nice weather to rain also in case you decided
not to go you prefer ω2 to ω1 . The reward of a given outcome-decision combination, as
well as the respective utility is given in Table 3.2. We see that a1 maximises expected
utility. However, under a worst-case assumption this is not the case, i.e., the maximin
solution is a2 .
U ∗ ≥ U (ω ∗ , a∗ ) ≥ U∗ . (3.4.1)
L(ω, a) a1 a2
ω1 1 0
ω2 0 9
E(L | P, a) 0.5 4.5
maxω L(ω, a) 1 9
Regret Instead of calculating the expected utility for each possible decision,
we could instead calculate how much utility we would have obtained if we had
made the best decision in hindsight. Consider, for example the problem in
Table 3.2. There the optimal action is either a1 or a2 , depending on whether
we accept the probability P over Ω, or adopt a worst-case approach. However,
after we make a specific decision, we can always look at the best decision we
could have made given the actual outcome ω.
L(ω, a) , max
′
U (ω, a′ ) − U (ω, a).
a
As an example let us revisit Example 21. Given the regret of each decision-
outcome pair, we can determine the decision minimising expected regret E(L |
P, a) and minimising maximum regret maxω L(ω, a), analogously to expected
utility and minimax utility. Table 3.3 shows that the choice minimising regret
either in expectation or in the minimax sense is a1 (going to the concert). Note
that this is a different outcome than before when we considered utility, which
shows that the concept of regret may result in different decisions.
Figure 3.4: Simultaneous two-player stochastic game. The first player (nature)
chooses ω, and the second player (the decision maker) chooses a. Then the
second player obtains utility U (ω, a).
46 CHAPTER 3. DECISION PROBLEMS
ξ ω
σ a
Figure 3.5: Simultaneous two-player stochastic game. The first player (nature)
chooses ξ, and the second player (the decision maker) chooses σ. Then ω ∼ ξ
and a ∼ σ and the second player obtains utility U (ω, a).
shown in Figure 3.4. There are other variations of such games, however. For
example, their moves may be revealed after they have played. This is important
in the case where the game is played repeatedly. However, what is usually
revealed is not the belief ξ, which is something assumed to be internal to player
one, but ω, the actual decision made by the first player. In other cases, we
might have that U itself is not known, and we only observe U (ω, a) for the
choices made.
Minimax utility, regret and loss In the following we again consider strate-
gies as defined in Definition 3.4.1. If the decision maker knows the outcome,
then the additional flexibility by randomizing over the actions does not help.
As we showed for the general case of a distribution over Ω, a simple decision is
as good as any randomised strategy:
Remark 3.4.1. For each ω, there is some a such that:
What follows are some rather trivial remarks connecting regret with utility
in various cases.
Remark 3.4.2. X
L(ω, σ) = σ(a)L(w, a) ≥ 0,
a
Proof.
X
L(ω, σ) = max
′
U (ω, σ ′ ) − U (ω, σ) = max
′
U (ω, σ ′ ) − σ(a) U (ω, a)
σ σ
a
X
= σ(a) max
′
U (ω, σ ′ ) − U (ω, a) ≥ 0.
σ
a
Remark 3.4.3.
L(ω, σ) = max U (ω, a) − U (ω, σ).
a
3.4. STATISTICAL AND STRATEGIC DECISION MAKING 47
U ω1 ω2
a1 1 −1
a2 0 0
Proof. As (3.4.2) shows, for any fixed ω, the best decision is always determinis-
tic, so that
X X
σ(a′ )L(ω, a′ ) = σ(a′ )[max U (ω, a) − U (ω, a′ )]
a∈A
a′ a′
X
= max U (ω, a) − σ(a′ ) U (ω, a′ ).
a∈A
a′
Example 22. (An even-money bet) Consider the decision problem described in Ta-
L ω1 ω2
a1 0 1
a2 1 0
ble 3.4. The respective loss is given in Table 3.5. The maximum regret of a strategy
σ can be written as
X
max L(ω, σ) = max σ(a)L(ω, a)
ω ω
a
= max σ(a) I {a = ai ∧ ω 6= ωi } · 1,
ω
1
since L(a, ω) = 0 when a = ai and ω 6= ωi . Note that maxω L(ω, σ) ≥ 2
and that
equality is obtained iff σ(a) = 12 , giving minimax regret L∗ = 12 .
where the solution exists as long as A and Ω are finite, which we will assume
in the following.
Expected regret
We can now define the expected regret for a given pair of distributions ξ, σ
as
X
L(ξ, σ) = max
′
ξ(ω) {U (ω, σ ′ ) − U (ω, σ)}
σ
ω
= max
′
U (ξ, σ ′ ) − U (ξ, σ).
σ
Not all minimax and maximin policies result in the same value. The following
theorem gives a condition under which the game does have a value.
Theorem 3.4.2. If there exist distributions1 ξ ∗ , σ ∗ and C ∈ R such that
U (ξ ∗ , σ) ≤ C ≤ U (ξ, σ ∗ ) ∀ξ, σ
then
U ∗ = U∗ = U (ξ ∗ , σ ∗ ) = C.
Proof. Since C ≤ U (ξ, σ ∗ ) for all ξ we have
Similarly
C ≥ max U (ξ ∗ , σ) ≥ min max U (ξ, σ) = U ∗ .
σ ξ σ
where everything has been written in matrix form. In fact, one can show that
vξ = vσ , thus obtaining Theorem 3.4.3.
To understand the connection of two-person games with Bayesian decision
theory, take another look at Figure 3.3, seeing the risk as negative expected
utility, or as the opponent’s gain. Each of the decision lines represents nature’s
gain as she chooses different prior distributions, while we keep our policy σ fixed.
The bottom horizontal line that would be tangent to the Bayes-optimal utility
curve would be minimax: if nature were to change priors, it would not increase
its gain, since the line is horizontal. On the other hand, if we were to choose a
different tangent line, we would only increase nature’s gain (and decrease our
utility).
ξ ω x
a a U
P , {Pω | ω ∈ Ω} .
Now, consider the case where we take an observation x from the true model Pω∗
before making a decision. We can represent the dependency of our decision on
the observation by making our decision a function of x.
This is the standard Bayesian framework for decision making. It may be slightly
more intuitive in some case to use the notation ψ(x | ω), in order to emphasize
that this is a conditional distribution. However, there is no technical difference
between the two notations.
When the set of policies includes all constant policies, then there is a policy
π ∗ at least as good as the best fixed decision a∗ . This is formalized in the
following remark.
Remark 3.5.1. Let Π denote a set of policies π : S → A. If for each a ∈ A there
is a π ∈ Π such that π(x) = a ∀x ∈ S, then maxπ∈Π Eξ (U | π) ≥ maxa∈A Eξ (U |
a).
Proof. The proof follows by setting Π0 to be the set of constant policies. The
result follows since Π0 ⊂ Π.
the literature.
3.5. DECISION PROBLEMS WITH OBSERVATIONS 51
3 We obtain a different probability of observations under the binomial model, but the re-
can be written as
Z
U (ξ, π) = U (ω, π) dξ(ω), U ∗ (ξ) , sup U (ξ, π) = U (ξ, π ∗ ).
Ω π
We wish to construct the Bayes decision rule, that is, the policy with maximal
ξ-expected utility. However, doing so by examining all possible policies is cum-
bersome, because (usually) there are many more policies than decisions. It is
however, easy to find the Bayes decision for each possible observation. This is
because it is usally possible to rewrite the expected utility of a policy in terms
of the posterior distribution. While this is trivial to do when the outcome and
observation spaces are finite, it can be extended to the general case as shown in
the following theorem.
Z Z
U (ξ, π) = E {U [ω, π(x)]} = U [ω, π(x)] dPω (x) dξ(ω),
Ω S
Z Z
U (ξ, π) = U [ω, π(x)] dξ(ω | x) dPξ (x), (3.5.2)
S Ω
R
where Pξ (x) = Ω
Pω (x) dξ(ω).
Z Z
U (ξ, π) = U [ω, π(x)] p(x | ω) p(ω) dν(x) dµ(ω)
ZΩ ZS
= h(ω, x) dν(x) dµ(ω)
Ω S
prior distribution on the family as well as a training data set, and we wish to
classify optimally according to our belief. In the last form of the problem, we
are given a set of policies π : X → Y and we must choose the one with highest
expected performance. The two latter forms of the problem are equivalent when
the set of policies contains all Bayes decision rules for a specific model family.
Ut , I {yt = at } .
The probability P (yt | xt ) is the posterior probability of the class given the
observation xt . If we wish to maximise expected utility, we can simply choose
This defines a particular, simple policy. In fact, for two-class problems with
decision boundary Y = {0, 1}, such a rule can be often visualised as a decision boundary in X , on
whose one side we decide for class 0 and on whose other side for class 1.
Pω (y1 , . . . , yn | x1 , . . . , xn ) ξ(ω)
ξ(ω | S) = P ′
ω ′ ∈Ω Pω (y1 , . . . , yn | x1 , . . . , xn ) ξ(ω )
′
to calculate the policy that maximises our expected utility. For a given ω, we
can indeed compute
X
U (ω, π) = U (y, π(x))Pω (y | x)Pω (x),
x,y
Any policy, when applied to large-scale, real world problems, has certain exter-
nalities. This implies that considering only the decision maker’s utility is not
sufficient. One such issue is fairness.
This concerns desirable properties of policies applied to a population of in-
dividuals. For example, college admissions should be decided on variables that
inform about merit, but fairness may also require taking into account the fact
that certain communities are inherently disadvantaged. At the same time, a
person should not feel that someone else in a similar situation obtained an un-
fair advantage. All this must be taken into account while still caring about
optimizing the decision maker’s utility function. As another example, consider
mortgage decisions: while lenders should take into account the creditworthiness
of individuals in order to make a profit, society must ensure that they do not
unduly discriminate against socially vulnerable groups.
Recent work in fairness for statistical decision making in the classifica-
tion setting has considered two main notions of fairness. The first uses (con-
ditional) independence constraints between a sensitive variable (such as eth-
nicity) and other variables (such as decisions made). The second type en-
sures that decisions are meritocratic, so that better individuals are favoured,
but also smoothness4 in order to avoid elitism. While a thorough discus-
sion of fairness is beyond the scope of this book, it is useful to note that
some of these concepts are impossible to strictly achieve simultaneously, but
may be approximately satisfied by careful design of the policy. The recent
work by Dwork et al. [2012], Chouldechova [2016], Corbett-Davies et al. [2017],
Kleinberg et al. [2016], Kilbertus et al. [2017], Dimitrakakis et al. [2017] goes
much more deeply on this topic.
4 More precisely, Lipschitz conditions on the policy.
56 CHAPTER 3. DECISION PROBLEMS
Pω (xn )
Pω (xn | xn−1 ) =
Pω (xn−1 )
Posterior recursion
Pω (xn ) ξ0 (ω)
ξn (ω) , ξ0 (ω | xn ) =
Pξ0 (xn )
Pω (xn ) ξn−1 (ω)
= ξn−1 (ω | xn ) = ,
Pξn−1 (xn )
R
where ξt is the belief at time t. Here Pξn (· | ·) = Ω Pω (· | ·) dξn (ω) is the
marginal distribution with respect to the n-th posterior.
3.6 Summary
In this chapter, we introduced a general framework for making decisions a ∈ A
whose optimality depends on an unknow outcome or parameter ω. We saw that,
3.6. SUMMARY 57
3.7 Exercises
The first part of the exercises considers problems where we are simply given
some distribution over Ω. In the second part, the distribution is a posterior
distribution that depends on observations x.
Exercise 13. Assume ω is drawn from P ξ with ξ(ω) = 1/11 for all ω ∈ Ω. Calculate and
plot the expected utility U (ξ, a) = ω ξ(ω)U (ω, a) for each a. Report maxa U (ξ, a).
(a) Calculate and plot the expected utility when π(a) = 1/11 for all a, reporting
values for all ω.
(b) Find
max min U (ξ, π).
π ξ
Hint: Use the linear programming formulation, adding a constant to the utility
matrix U so that all elements are non-negative.
Exercise 16. Consider the definition of rules that, for some ǫ > 0, select a maximising
P ω U (ω, a) > sup U (ω, d′ ) − ǫ .
d′ ∈A
Prove that this is indeed a statistical decision problem, i.e., it corresponds to max-
imising the expectation of some utility function.
F , {fω | ω ∈ Ω} ,
3.7. EXERCISES 59
such that fω is the binomial probability mass function with parameters ω (with
the number of draws n being implied). Consider the parameter set
Let ξ be the uniform distribution on Ω, such that ξ(ω) = 1/11 for all ω ∈ Ω.
Further, let the decision set be A = [0, 1].
P
Exercise 17. What is the decision a∗ maximising U (ξ, a) = ω ξ(ω)U (ω, a) and what
is U (ξ, a∗ )?
Exercise 18. In the same setting, we now observe the sequence x = (x1 , x2 , x3 ) =
(1, 0, 1).
1. Plot the posterior distribution ξ(ω | x) and compare it to the posterior we would
obtain if our prior on ω was ξ ′ = Beta(2, 2).
2. Find the decision a∗ maximising the a posteriori expected utility
X
Eξ (U | a, x) = U (ω, a)ξ(ω | x).
ω
Exercise 19. In the same setting, we consider nature to be adversarial. Once more,
we observe x = (1, 0, 1). Assume that nature can choose a prior among a set of priors
Ξ = {ξ1 , ξ2 }. Let ξ1 (ω) = 1/11 and ξ2 (ω) = ω/5.5 for each ω.
1. Calculate and plot the value for deterministic decisions a:
min Eξ (U | a, x).
ξ∈Ξ
min max Eξ (U | a)
ξ∈Ξ a∈A
Hint: Apart from the adversarial prior selection, this is very similar to the previous
exercise.
We now consider customers with some baseline income level x ∈ S. For sim-
plicity, we assume that the only possible income levels are S = {15, 20, 25, . . . , 60},
with K = 10 possible incomes. let V : R → R denote the utility function of a
customer. Customers who are interested the insurance product, will buy it if
and only if:
V (x − d) > ǫV (x − h) + (1 − ǫ)V (x). (3.7.1)
We make the simplifying assumption that the utility function is the same for
all customers, and that it has the following form:
(
ln x, x≥1
V (x) = 2
(3.7.2)
1 − (x − 2) , otherwise.
Customers who are not interested the insurance product, will not buy it no
matter what the price.
There is some unknown probability distribution Pω (x) over the income level,
such that the probability of n people having incomes xT = (x1 , . . . , xn ) is
QT
PωT (xT ) = i=1 Pω (xi ). We have two data sources for this. The first is a model
of the general population ω1 not working in high-tec industry, and the second
is a model of employees in high-tec industry, ω2 . The models are summarised
in the table below. Together, these two models form a family of distributions
Income Levels 15 20 25 30 35 40 45 50 55 60
Models Probability (%) of income level Pω (x)
ω1 5 10 12 13 11 10 8 10 11 10
ω2 8 4 1 6 11 14 16 15 13 12
Exercise 20 (50). Show that the expected utility for a given ω is the expected gain
from a buying customer, times the probability that an interested customer will have
an income x such that they would buy our insurance.
X
U (ω, d) = (d − ǫh) Pω (x) I {V (x − d) > ǫV (x − h) + (1 − ǫ)V (x)} . (3.7.3)
x∈S
Let h = 150 and ǫ = 10−3 . Plot the expected utility for varying d, for the two
possible ω. What is the optimal price level if the incomes of all interested customers
are distributed according to ω1 ? What is the optimal price level if they are distributed
according to ω2 ?
Exercise 21 (20). According to our intuition, customers interested in our product are
much more likely to come from the high-tec industry than from the general population.
For that reason, we have a prior probability ξ(ω1 ) = 1/4 and ξ(ω2 ) = 3/4 over the
parameters Ω of the family P. More specifically, and in keeping with our previous
3.7. EXERCISES 61
ω∗ ∼ ξ (3.7.4)
T ∗
x | ω = ω ∼ Pω (3.7.5)
That is, the data is drawn from one unknown model ω ∗ ∈ Ω. This can be thought
of as an experiment where nature randomly selects ω ∗ with probability ξ and then
generates the data from the corresponding model Pω∗ . Plot the expected utility under
this prior as the premium d varies. What is the optimal expected utility and premium?
Exercise 22 (20). Instead of fully relying on our prior, the company decides to per-
form a random survey of 1000 people. We asked whether they would be interested in
the insurance product (as long as the price is low enough). If they were interested,
we asked them what their income is. Only 126 people were interested, with income
levels given in Table 3.7. Each row column of the table shows the stated income and
the number of people reporting it. Let xT = {x1 , x2 , . . .} be the set of data we have
Income 15 20 25 30 35 40 45 50 55 60
Number 7 8 7 10 15 16 13 19 17 14
collected. Assuming that that the responses are truthful, calculate the posterior prob-
ability ξ(ω | xT ), assuming that the only possible models of income distribution are
the two models ω1 , ω2 used in the previous exercises. Plot the expected utility under
the posterior distribution as d varies. What is the maximum expected utility we can
obtain?
Exercise 23 (30? – Bonus exercise ). Having only two possible models is somewhat
limiting, especially since neither of them might correspond to the income distribution
of people interested in our insurance product. How could this problem be rectified?
Describe the idea and implement it. When would you expect this to work better?
Assumption 3.7.2. When the smoking history of the patient is known, the
development of cancer or ACS are independent.
Tests. One can perform an ECG to test for ACS. An ECG test has sensitivity
of 66.6% (i.e. it correctly detects 2⁄3 of all patients that suffer from ACS), and a
specificity of 75% (i.e. 1⁄4 of patients that do not have ACS, still test positive).
An X-ray can diagnose lung cancer with a sensitivity of 90% and a specificity
of 90%.
Assumption 3.7.3. Repeated applications of a test produce the same result for
the same patient, i.e. that randomness is only due to patient variability.
Assumption 3.7.4. The existence of lung cancer does not affect the probability
that the ECG will be positive. Conversely, the existence of ACS does not affect
the probability that the X-ray will be positive.
The main problem we want to solve, is how to perform experiments or tests,
so as to
diagnose the patient
use as few resources as possible.
make sure the patient lives
This is a problem in experiment design. We start from the simplest case, and
look at a couple of example where we only observe the results of some tests. We
then examine the case where we can select which tests to perform.
Exercise 24. In this exercise, we only worry about making inferences from different
tests results.
1. What does the above description imply about the dependencies between the
patient condition, smoking and test results? Draw a belief network for the
above problem, with the following events (i.e. variables that can be either true
or false)
A: ACS
C: Lung cancer.
S: Smoking
E: Positive ECG result.
X: Positive X-ray result.
2. What is the probability that the patient suffers from ACS if S = true?
3. What is the probability that the patient suffers from ACS if the ECG result is
negative?
4. What is the probability that the patient suffers from ACS if the X-ray result is
negative and the patient is a smoker?
Exercise 25. Now consider the case where you have the choice between tests to
perform First, you observe S, whether or not the patient is a smoker. Then, you select
a test to make: d1 ∈ {X-ray, ECG}. Finally, you decide whether or not to treat for
ASC: d2 ∈ {heart treatment, no treatment}. An untreated ASC patient may die
with probability 2%, while a treated one with probability 0.2%. Treating a non-ASC
patient result in death with probability 0.1%.
3.7. EXERCISES 63
Estimation
65
66 CHAPTER 4. ESTIMATION
4.1 Introduction
In the previous chapter, we have seen how to make optimal decisions with
respect to a given utility function and belief. One important question is how to
compute an updated belief from observations and a prior belief. More generally,
we wish to examine how much information we can obtain about an unknown
parameter from observations, and how to bound the respective estimation error.
While most of this chapter will focus on the Bayesian framework for estimating
parameters, we shall also look at tools for making conclusions about the value
of parameters without making specific assumptions about the data distribution,
i.e., without providing specific prior information.
In the Bayesian setting, we calculate posterior distributions of parameters
given data. The basic problem can be stated as follows. Let P , {Pω | ω ∈ Ω}
be a family of probability measures on (S, FS ) and ξ be our prior probability
measure on (Ω, FΩ ). Given some data x ∼ Pω∗ , with ω ∗ ∈ Ω, how can we
estimate ω ∗ ? The Bayesian approach is to estimate the posterior distribution
ξ(· | x), instead of guessing a single ω ∗ . In general, the posterior measure is a
function ξ(· | x) : FΩ → [0, 1], with
R
Pω (x) dξ(ω)
ξ(B | x) = BR .
Ω ω
P (x) dξ(ω)
The posterior distribution allows us to quantify our uncertainty about the un-
known ω ∗ . This in turn enables us to take decisions that take uncertainty into
account.
The first question we are concerned with in this chapter is how to calculate
this posterior for any value of x in practice. If x is a complex object, this may
be computationally difficult. In fact, the posterior distribution can also be a
complicated function. However, there exist distribution families and priors such
that this calculation is very easy, in the sense that the functional form of the
posterior depends upon a small number of parameters. This happens when a
summary of the data that contains all necessary information can be calculated
easily. Formally, this is captured via the concept of a sufficient statistic.
P = {Pω | ω ∈ Ω} .
This means that the statistic is sufficient if, whenever we obtain the same
value of the statistic for two different datasets x, x′ , then the resulting posterior
distribution over the parameters is identical, independent of the prior distri-
bution. In other words, the value of the statistic is sufficient for computing
the posterior. Interestingly, a sufficient statistic always implies the following
factorisation for members of the family.
Proof. In the following proof we assume arbitrary Ω. The case when Ω is finite
is technically simpler and is left as an exercise. Let us first assume the existence
of u, v satisfying the equation. Then for any B ∈ FΩ we have
R
u(x) v[φ(x), ω] dξ(ω)
ξ(B | x) = BR
u(x) v[φ(x), ω] dξ(ω)
RΩ
v[φ(x), ω] dξ(ω)
= RB .
Ω
v[φ(x), ω] dξ(ω)
For x′ with φ(x) = φ(x′ ), it follows that ξ(B | x) = ξ(B | x′ ), so ξ(· | x) = ξ(· |
x′ ) and φ satisfies the definition of a sufficient statistic.
Conversely, assume that φ is a sufficient statistic. Let µ be a dominating
dξ(ω)
measure on S so that we can define the densities p(ω) , dµ(ω) and
point-wise equality on the family members, i.e., Pω (x) = Pω (x′ ) for all ω. This is a stronger
definition, as it implies the Bayesian one we use here.
2 Typically Z ⊂ Rk for finite-dimensional statistics.
68 CHAPTER 4. ESTIMATION
whence Z
p(ω | x)
Pω (x) = Pω (x) dξ(ω).
p(ω) Ω
Another example is when we have a finite set of models. Then the sufficient
statistic is always a finite-dimensional vector.
Lemma 4.2.1. Let P = {Pθ | θ ∈ Θ} be a family, where each model Pθ is
a probability measure on X and Θ contains n models. If p ∈ ∆n is a vector
representing our prior distribution, i.e., ξ(θ) = pθ , then the finite-dimensional
vector with entries qθ = pθ Pθ (x) is a sufficient statistic.
Proof. Simply note that the posterior distribution in this case is
qθ
ξ(θ | x) = P ,
θ′ qθ ′
While conjugate families exist for statistics with unbounded dimension, here
we shall focus on finite-dimensional families. We will start with the simplest
example, the Bernoulli-Beta pair.
ω x
The structure of the graphical model in Figure 4.1 shows the dependencies
between the different variables of the model.
When considering n independent trials of a Bernoulli
Qn distribution the set of
n
possible outcomes is S = {0, 1} . Then Pω (xn ) = t=1 Pω (xt ) is the probability
of observing the exact sequence xn under the Bernoulli model. However, in many
cases we are interested in the probability of observing a particular number of
successes (outcome 1) and failures (outcome 0) and do not care about the actual
order. For summarizing we need to count the actual sequences in which out
of n trials we have k successes. The actual number is given by the binomial
binomial coefficient coefficient, defined as
n n!
= k, n ∈ N, n ≥ k.
k k! (n − k)!
Now we are ready to define the binomial distribution, that is a scaled
product-Bernoulli distribution for multiple independent outcomes where we
want to measure the probability of a particular number successes or failures.
Thus, the Bernoulli is a distribution on a sequence of outcomes, while the
Pnbino-
mial is a distribution on the total number of successes. That is, let s = t=1 xt
be the total number of successes observed up to time n. Then we are interested
in the probability that there are exactly k successes out of n trials.
Definition 4.3.2 (Binomial distribution). The binomial distribution with pa-
rameters ω and n has outcomes S = {0, 1, . . . , n}. Its probability function is
given by
n
Pω (s = k) = ω k (1 − ω)n−k . (4.3.1)
k
If s is drawn from a binomial distribution with parameters ω, n, we write s ∼
Binom(ω, n).
Now let us return to the Bernoulli distribution. If the parameter ω is known,
then observations are independent of each other. However, this is not the case
when ω is unknown. For example, if Ω = {ω1 , ω2 }, then the probability of
observing a sequence xn is given by
X n
Y n
Y
n n
P(x ) = P(x | ω) P(ω) = P(xt | ω1 ) P(ω1 ) + P(xt | ω2 ) P(ω2 ),
ω∈Ω t=1 t=1
4.3. CONJUGATE PRIORS 71
α ω
Qn
which in general is different from t=1 P(xt ). For the general case where
Ω = [0, 1] the question is whether there is a prior distribution that can suc-
cinctly describe our uncertainty about the parameter. Indeed, there is, and it is
called the Beta distribution. It is defined on the interval [0, 1] and has two pa- Beta distribution
rameters that determine the density of the observations. Because the Bernoulli
distribution has a parameter in [0, 1], the outcomes of the Beta can be used to
specify a prior on the parameters of the Bernoulli distribution.
Definition 4.3.3 (Beta distribution). The Beta distribution has outcomes ω ∈
Ω = [0, 1] and parameters α0 , α1 > 0, which we will summarize in a vector
α = (α1 , α0 ). Its probability density function is given by
Γ (α0 + α1 ) α1 −1
p(ω | α) = ω (1 − ω)α0 −1 , (4.3.2)
Γ (α0 )Γ (α1 )
R∞
where Γ (α) = 0 uα−1 e−u du is the Gamma function (see also Appendix C.1.2). Gamma function
If ω is distributed according to a Beta distribution with parameters α1 , α0 , we
write ω ∼ Beta(α1 , α0 ).
We note that the Gamma function is an extension of the function n! (see also
Appendix C.1.2), so that for n ∈ N it holds that Γ (n) = n!. That way, the first
term in (4.3.2) corresponds to a generalized binomial coefficient. The depen-
dencies between the parameters are shown in the graphical model of Figure 4.2.
A Beta distribution with parameter α has expectation
E(ω | α) = α1 /(α0 + α1 )
and variance
α1 α0
V(ω | α) = .
(α1 + α0 )2 (α1 + α0 + 1)
Figure 4.3 shows the density of a Beta distribution for four different param-
eter vectors. When α0 = α1 = 1, the distribution is equivalent to a uniform
one.
α = (1, 1)
4 α = (2, 1)
α = (10, 20)
α = (1/10, 1/2)
p(ω|α) 2
0
0 0.2 0.4 0.6 0.8 1
ω
Beta-Bernoulli model
α ω xt
Then the posterior distribution of ω given the sample the posterior distri-
bution is also Beta, that is,
Example 25. The parameter ω ∈ [0, 1] of a randomly selected coin can be modelled
as a Beta distribution peaking around 21 . Usually one assumes that coins are fair.
However, not all coins are exactly the same. Thus, it is possible that each coin deviates
slightly from fairness. We can use a Beta distribution to model how likely (we think)
different values ω of coin parameters are.
To demonstrate how belief changes, we perform the following simple experiment.
We repeatedly toss a coin and wish to form an accurate belief about how biased the
coin is, under the assumption that the outcomes are Bernoulli with a fixed parameter
ω. Our initial belief, ξ0 , is modelled as a Beta distribution on the parameter space
Ω = [0, 1], with parameters α0 = α1 = 100. This places a strong prior on the coin
being close to fair. However, we still allow for the possibility that the coin is biased.
Figure 4.5 shows a sequence of beliefs at times 0, 10, 100, 1000 respectively, from
4.3. CONJUGATE PRIORS 73
30
ξ0
ξ10
ξ100
20 ξ1000
10
0
0 0.2 0.4 0.6 0.8 1
Figure 4.5: Changing beliefs as we observe tosses from a coin with probability
ω = 0.6 of heads.
a coin with bias ω = 0.6. Due to the strength of our prior, after 10 observations,
the situation has not changed much and the belief ξ10 is very close to the initial one.
However, after 100 observations our belief has now shifted towards 0.6, the true bias
of the coin. After a total of 1000 observations, our belief is centered very close to 0.6,
and is now much more concentrated, reflecting the fact that we are almost certain
about the value of ω.
The dependency graph in Figure 4.6 shows the dependencies between the
parameters of a normal distribution and the observations xt . In this graph,
only a single sample xt is shown, and it is implied that all xt are independent
of each other given r, ω.
74 CHAPTER 4. ESTIMATION
r
xt
ω
r
τ xt
ω
µ
Then the posterior distribution of ω given the sample is also normal, that
is,
τ µ + nrx̄n
ω | xn ∼ N (µ′ , τ ′−1 ) with µ′ = , τ ′ = τ + nr,
τ′
1
Pn
where x̄n , n t=1 xt .
It can be seen that the updated estimate for the mean is shifted towards
4.3. CONJUGATE PRIORS 75
the empirical mean x̄n , and the precision increases linearly with the number of
samples.
β α α−1 −βr
f (r | α, β) = r e ,
Γ (α)
β
rt
α
1
1, 1
1, 2
0.8
2, 2
4, 1/2
f (r|α, β)
0.6
0.4
0.2
0
0 2 4 6 8 10 12
t
f (x | β) = βe−βx , (4.3.3)
and as the Gamma distribution it has support in [0, ∞], i.e., x > 0. For n ∈ N
and α = n2 , β = 21 one obtains a χ2 -distribution with n degrees of freedom.
Now we return to our problem of estimating the precision of a normal dis-
tribution with known mean, using the Gamma distribution to represent uncer-
tainty about the precision.
Normal-Gamma model
α
β r
xt
ω
Then the posterior distribution of r given the sample is also Gamma, that
is,
n
n 1X
r | xn ∼ Gamma(α′ , β ′ ) with α′ = α + , β′ = β + (xt − ω)2 .
2 2 t=1
Finally, let us turn our attention to the general problem of estimating both the
mean and the precision of a normal distribution. We will use the same prior
distributions for the mean and precision as in the case when just one of them
was unknown. It will be assumed that the precision is independent of the mean,
while the mean has a normal distribution given the precision.
β r
xt
µ ω
Figure 4.11: Graphical model for a normal distribution with unknown mean
and precision.
where
τ µ + nx̄
µ′ = , τ ′ = τ + n,
τ +n
n
n 1X τ n(x̄ − µ)2
α′ = α + , β′ = β + (xt − x̄n )2 + .
2 2 t=1 2(τ + n)
√ r
f (x | ω, r) ∝ r · exp − (x − ω)2 .
2
For a prior ω|r ∼ N (µ, (τ r)−1 ) and r ∼ Gamma(α, β), as before, the joint distri-
bution for mean and precision is given by
√ τr
ξ(ω, r) ∝ r · exp − (ω − µ)2 rα−1 e−βr ,
2
78 CHAPTER 4. ESTIMATION
as ξ(ω, r) = ξ(ω | r) ξ(r). Now we can write the marginal density of new
observations as
Z
pξ (x) = f (x | ω, r) dξ(ω, r)
Z ∞Z ∞ r τr
√
∝ r · exp − (x − ω)2 exp − (ω − µ)2 rα−1 e−βr dω dr
0 −∞ 2 2
Z ∞ Z ∞
1 r τr
= rα− 2 e−βr exp − (x − ω)2 − (ω − µ)2 dω dr
0 −∞ 2 2
Z ∞ Z ∞
1 r
= rα− 2 e−βr exp − (x − ω)2 + τ (ω − µ)2 dω dr
0 −∞ 2
Z ∞ s
α− 21 −βr τr 2 2π
= r e exp − (µ − x) dr.
0 2(τ + 1) r(1 + τ)
Multinomial-Dirichlet conjugates
The multinomial distribution is the extension of the binomial distribution to an
arbitrary number of outcomes. It is a common model for independent random
trials with a finite number of possible outcomes, such as repeated dice throws,
multi-class classification problems, etc.
As in the binomial distribution we perform independent trials, but now con-
sider a more general outcome set S = {1, . . . , K} for each trial. Denoting by nk
the number of times one obtains outcome k, the multinomial distribution gives
the probability of observing a given vector (n1 , . . . , nK ) after a total of n trials.
K
Y
n!
P(n | ω) = QK ωknk . (4.3.4)
k=1 nk ! k=1
ω xt
α ω
distribution in the same way that the Beta distribution is conjugate to the
Bernoulli/binomial distribution.
α ω xt
R
xt
ω
For the definition of the Wishart distribution we first have to recall the
definition of a matrix trace.
Definition 4.3.9. The trace of an n × n square matrix A with entries aij is
defined as
Xn
trace(A) , aii .
i=1
T R
xt
µ ω
30
xi
20
10
0
0.4 0.5 0.6 0.7 0.8
Figure 4.17: 90% credible interval after 1000 observations from a Bernoulli with
ω = 0.6.
ξ([ωl , ωu ]) = s.
Note that ωl , ωu are not unique and any choice satisfying the condition is valid.
However, typically the interval is chosen so as to exclude the tails (extremes)
of the distribution and centered in the maximum. Figure 4.17 shows the
90% credible interval for the Bernoulli parameter of Example 24 after 1000
observations, that is, the measure of A under ξ is ξ(A) = 0.9. We see that the
true parameter ω = 0.6 lies slightly outside it.
Q ({xn ∈ S n | ω ∈
/ An }) .
The probability that the true value of ω will be within a particular cred-
ible interval depends on how well the prior ξ0 matches the true distribution
from which the parameter ω was drawn. This is illustrated in the following
experimental setup, where we check how often a 50% credible interval fails.
0.58
UCB
LCB
0.56
0.54
0.52
w
0.5
0.48
0.46
0.44
0.42
10 20 30 40 50 60 70 80 90 100
t
Figure 4.18: 50% credible intervals for a prior Beta(10, 10),ξ0 matching the dis-
tribution of ω.
0.53
0.52
Failure rate
0.51
0.5
0.49
10 20 30 40 50 60 70 80 90 100
t
Figure 4.19: Failure rate of 50% credible intervals for a prior Beta(10, 10),
ξ0 matching the distribution of ω.
0.62
UCB
LCB
0.6
0.58
0.56
0.54
w
0.52
0.5
0.48
0.46
0.44
10 20 30 40 50 60 70 80 90 100
t
Figure 4.20: 50% credible intervals for a prior Beta(10, 10), when ξ0 does not
match the distribution of ω = 0.6.
1
0.9
Average number of failures
0.8
0.7
0.6
0.5
10 20 30 40 50 60 70 80 90 100
t
Figure 4.21: Failure rate of 50% credible interval for a prior Beta(10, 10), when
ξ0 does not match the distribution of ω = 0.6.
EX
P(X ≥ u) ≤ , (4.5.1)
u
where P(X ≥ u) = P {x | x ≥ u} .
Example 26 (Application to sample mean). It is easy to show that the sample mean
x̄n has expectation µ and variance σ 2 /n and we obtain from (4.5.2)
kσ 1
P |x̄n − µ| ≥ √ ≤ 2.
n k
√ √
Setting ǫ = kσ/ n we get k = ǫ n/σ and hence
σ2
P |x̄n − µ| ≥ ǫ ≤ 2 .
ǫ n
P(x̄n − µ ≥ ǫ) = P(Sn ≥ u)
n
Y
−θnǫ
≤e E eθ(xt −µ) . (4.5.6)
t=1
4.6. APPROXIMATE BAYESIAN APPROACHES 87
Applying Jensen’s inequality directly to the expectation does not help. However,
we can use convexity in another way. Let f (x) be the linear upper bound on
eθx on the interval [a, b], i.e.
b − x θa x − a θb
f (x) , e + e ≥ eθx .
b−a b−a
Then obviously E eθx ≤ E f (x) for x ∈ [a, b]. Applying this to the expectation
term (4.5.6) above we obtain
e−θµt
eθ(xt −µt ) ≤ (bt − µt )eθat + (µt − at )eθbt .
bt − a t
Taking derivatives with respect to θ and computing the second order Taylor
expansion, we get
2
1
(bt −at )2
E eθ(xt −µt ) ≤ e 8 θ
1 2 Pn 2
P(x̄n − µ ≥ ǫ) ≤ e−θnǫ+ 8 θ t=1 (bt −at ) .
Pn
This is minimised for θ = 4nǫ/ t=1 (bt − at )2 , which proves the required result.
We can apply this inequality directly to the sample mean example and obtain
for xt ∈ [0, 1]
2
P (|x̄n − µ| ≥ ǫ) ≤ 2e−2nǫ .
modelling [e.g., Geweke, 1999], where detailed simulators but no useful analyt-
ical probabilistic models were available. ABC methods have also been used for
inference in dynamical systems [e.g., Toni et al., 2009] and the reinforcement
learning problem [Dimitrakakis and Tziortziotis, 2013, 2014].
where ξx is shorthand for the joint distribution ξ(ω, x) for a fixed value of x.
As the second term does not depend on θ, we can find the best element of the
family by computing Z
dξx
max ln dQθ , (4.6.4)
θ∈Θ Ω dQθ
where the term we are maximising can also be seen as a lower bound on the
marginal log-likelihood.
90 CHAPTER 4. ESTIMATION
In the latter case, even though we cannot compute the full function ξ(ω | x), we
can still maximise (perhaps locally) for ω.
More generally, there might be some parameters φ for which we actually can
compute a posterior distribution. Then we can still use the same approaches
maximising either of
Z
Pω (x) = Pω,φ (x) dξ(φ | x),
Z
ξ(ω | x) = ξ(ω | φ, x) dξ(φ | x).
Φ
Sequential sampling
91
92 CHAPTER 5. SEQUENTIAL SAMPLING
Example 28. Consider that you have 100 produced items and you want to determine
whether there are fewer than 10 faulty items among them. If testing has some cost,
it pays off to think about whether it is possible to do without testing all 100 items.
Indeed, this is possible by the following simple online testing scheme: You test one
item after another until you either have discovered 10 faulty items or 91 good items.
In either case you have the correct answer at considerably lower cost than when testing
all items.
Thus, the sample obtained depends both on P and the sampling proce-
dure πs . In our setting, we don’t just want to sample sequentially, but also to
take some action after sampling is complete. For that reason, we can generalise
the above definition to sequential decision procedures.
2. a decision rule πd : X ∗ → A.
The stopping rule πs specifies whether, at any given time, we should stop and
make a decision in A or take one more sample. That is, stop if
πs (xt ) = 1,
otherwise observe xt+1 . Once we have stopped (i.e. πs (xt ) = 1), we choose the
decision
πd (xt ).
1 This is simply a sample space and associated algebra, together with a probability measure.
In the remainder of this section, we shall consider the following simple deci-
sion problem, where we need to make a decision about the value of an unknown
parameter. As we get more data, we have a better chance of discovering the right
parameter. However, there is always a small chance of getting no information.
Example 29. Consider the following decision problem, where the goal is to distinguish
between two possible hypotheses θ1 , θ2 , with corresponding decisions a1 , a2 . We have
three possible observations {1, 2, 3}, with 1, 2 being more likely under the first and
second hypothesis, respectively. However, the third observation gives us no information
about the hypothesis, as its probability is the same under θ1 and θ2 . In this problem
γ is the probability that we obtain an uninformative sample.
94 CHAPTER 5. SEQUENTIAL SAMPLING
Parameters: Θ = {θ1 , θ2 }.
Decisions: A = {a1 , a2 }.
Observation distribution fi (k) = Pθi (xt = k) for all t with
f1 (1) = 1 − γ, f1 (2) = 0, f1 (3) = γ, (5.1.4)
f2 (1) = 0, f2 (2) = 1 − γ, f2 (3) = γ. (5.1.5)
-1
-2
V (n)
-3
-4
ξ = 0.1
ξ = 0.5
-5
100 101 102
n
Figure 5.1: Illustration of P1, the procedure taking a fixed number of samples n.
The value of taking exactly n observations under two different beliefs, for γ =
0.9, b = −10, c = 10−2 .
The results of applying this procedure are illustrated in Figure 5.1. Here we
can see that, for two different choices of priors, the optimal number of samples
is different. In both cases, there is a clear choice for how many samples to take,
when we must fix the number of samples before seeing any data.
However, we may not be constrained to fix the number of samples a priori.
As illustrated in Example 28, many times it is a good idea to adaptively decide
when to stop taking samples. This is illustrated by the following sequential
procedure. Since we already know that there is an optimal a priori number of
steps n∗ , we can choose to look at all possible stopping times that are smaller
or equal to n∗ .
using the formula for the geometric series (see equation C.1.4). Consequently,
the value of this procedure is
0
-0.5
-1
-1.5
-2
V
-2.5
-3
fixed
-3.5 bounded
unbounded
-4
0.5 0.6 0.7 0.8 0.9 1
α
Figure 5.2: The value of three strategies for ξ = 1/2, b = −10, c = 10−2 and
varying γ. Higher values of γ imply a longer time before the true θ is known.
Here the cost is inside the expectation, since the number of samples we take is
random. Summing over all the possible stopping times n, and taking Bn ⊂ X ∗
as the set of observations for which we stop, we have:
∞ Z
X ∞
X
U (ξ, π) = Eξ [U (θ, π(xn )) | xn ] dPξ (xn ) − Pξ (Bn )nc (5.2.2)
n=1 Bn n=1
X∞ Z Z ∞
X
n n n
= U [θ, π(x )] dξ(θ | x ) dPξ (x ) − Pξ (Bn )nc
n=1 Bn Θ n=1
(5.2.3)
at most T samples. If the process ends at stage T , we will have observed some
sequence xT , which gives rise to a posterior ξ(θ | xT ). Since we must stop at T ,
we must choose a maximising expected utility at that stage:
Z
T
Eξ [U | x , a] = U (θ, a) dξ(θ | xT )
Θ
Since we need not take another sample, the respective value (maximal expected
utility) of that stage is
V 0 [ξ(·|xT )] , max U (ξ(· | xT ), a),
a∈A
n
where we introduce the notation V to denote the expected utility, given that
we are stopping after at most n steps.
More generally, we need to consider the effect on subsequent decisions. Con-
sider the following simple two-stage problem as an example. Let X = {0, 1}
and ξ be the prior on the θ parameter of Bern(θ). We wish to either decide
immediately on a parameter θ, or take one more observation, at cost c, before
deciding. The problem we consider has two stages, as illustrated in Figure 5.3.
ξ(· | x1 = 0)
c
ξ
c
ξ(· | x1 = 1)
Figure 5.3: An example of a sequential decision problem with two stages. The
initial belief is ξ and there are two possible subsequent beliefs, depending on
whether we observe xt = 0 or xt = 1. At each stage we pay c.
In this example, we begin with a prior ξ at the first stage. There are two
possible outcomes for the second stage.
1. If we observe x1 = 0 then our value is V 0 [ξ(· | x1 = 0)].
2. If we observe x1 = 1 then our value is V 0 [ξ(· | x1 = 1)].
At the first stage, we can:
1. Stop with value V 0 (ξ).
R
2. Pay a sampling cost c for value V 0 [ξ(· | x1 )] with Pξ (x1 ) = Θ
Pθ (x1 ) dξ(w).
So the expected value of continuing for one more step is
Z
1
V (ξ) , V 0 [ξ(· | x1 )] dPξ (x1 ).
X
Thus, the overall value for this problem is:
( 1
)
X
max V 0 (ξ), V 0 [ξ(· | x1 )]Pξ (x1 ) − c.
x1 =0
1
ξn+1
c
xn+1 = 1
ξn xn+1 = 0
c
0
ξn+1
for every belief ξn in the set of beliefs that arise from the prior ξ0 , with j = T −n.
100 CHAPTER 5. SEQUENTIAL SAMPLING
Bk (π) = {x ∈ X ∗ | n = k} (5.2.7)
be the set of observations such that exactly k samples are taken by rule π and
be the set of observations such that at most k samples are taken by rule π. Then
∞ Z
X
U (ξ, π ′ ) = {V 0 [ξ(· | xk ) − ck]} dPξ (xk )
′
k=1 Bk (π )
X∞ Z
≥ U [ξ(· | xk , π)] dPξ (xk )
k=1 Bk (π ′ )
X∞
= Eπξ {U | Bk (π ′ )}Pξ (Bk (π ′ )) = Eξπ U = U (ξ, π).
k=1
U (θ, d) a1 a2
θ1 0 λ1
θ2 λ2 0
As will be the case for all our sequential decision problems, we only need to
consider our current belief ξ, and its possible evolution, when making a decision.
To obtain some intuition about this procedure, we are going to analyse this
problem by examining what the optimal decision is under all possible beliefs ξ.
Under some belief ξ, the immediate value (i.e. the value we obtain if we stop
immediately), is simply:
The worst-case immediate value, i.e. the minimum, is attained when both
terms are equal. Consequently, setting λ1 ξ = λ2 (1 − ξ) gives ξ = λ2 /(λ1 + λ2 ).
Intuitively, this is the worst-case belief, as the uncertainty it induces leaves us
unable to choose between either hypothesis. Replacing in (5.2.9) gives a lower
bound for the value for any belief.
λ1 λ2
V 0 (ξ) ≥ .
λ1 + λ2
Let Π denote the set of procedures π which take at least one observation
and define:
V ′ (ξ) = sup U (ξ, π). (5.2.10)
π∈Π
∗
Then the ξ-expected utility V (ξ) must satisfy
V ∗ (ξ) = max V 0 (ξ), V ′ (ξ) . (5.2.11)
λ2
ξL λ1 +λ2 ξH
c
V∗
V1∗ (ξ)
λ1 λ2
V0∗ (ξ) λ1 +λ2
ξ
Figure 5.5 illustrates the above arguments, by plotting the immediate value
against the optimal continuation after taking one more sample. For the worst-
case belief, we must always continue sampling. When we are absolutely certain
about the model, then it’s always better to stop immediately. There are two
points where the curves intersect. Together, these define three subsets of beliefs:
On the left, if ξ < ξL , we decide for one parameter θ0 . On the right, if ξ > ξH ,
we decide for the other parameter, θ1 . Otherwise, we continue sampling. This
is the main idea of the sequential probability ratio test, explained below.
ξPθ1 (xt )
ξt = .
ξPθ1 (xt ) + (1 − ξ)Pθ2 (xt )
Proof.
n
X ∞ Z
X k
X
E zi = zi dGk (z k )
i=1 k=1 Bk i=1
k Z
∞ X
X
= zi dGk (z k ).
k=1 i=1 Bk
X∞ X ∞ Z
= zi dGk (z k )
i=1 k=i Bk
∞ Z
X
= zi dGi (z i )
i=1 B≥i
∞
X
= E(zi ) P(n ≥ i) = m E n.
i=1
n
X
a< zi < b
i=1
as the test. Using Wald’s theorem and the previous properties and assuming
c ≈ 0, we obtain the following approximately optimal values for a, b:
I1 λ2 (1 − ξ) 1 I 2 λ1 ξ
a ≈ log c − log b ≈ log − log , (5.2.14)
ξ c 1−ξ
5.3 Martingales
Martingales are a fundamentally important concept in the analysis of stochastic
processes where the expectation at time t + 1 only depends on the state of the
process at time t.
An example of a martingale sequence is when xt is the amount of money you
have at a given time, and where at each time-step t you are making a gamble
such that you lose or gain 1 currency unit with equal probability. Then, at any
step t, it holds that E(xt+1 | xt ) = xt . This concept can be generalised to two
random processes xt and yt , which are dependent.
exists and
E(yn+1 | xn ) = yn (5.3.2)
holds with probability 1. If {yn } is a martingale with respect to itself, i.e.
yi (x) = x, then we call it simply a martingale.
At a first glance, it might appear that martingales are not very frequently
encountered, apart from some niche applications. However, we can always con-
struct a martingale from any sequence of random variables as follows.
yn (xn ) = E[f | xn ].
This allows us to bound the probability that the difference sequence deviates
from zero. Since there are only few problems where the default random variables
are difference sequences, use of this theorem is most common by defining a new
random variable sequence that is a difference sequence.
5.5 Exercises.
Exercise 26. Consider a stationary Markov process with state space S and whose
transition kernel is a matrix τ . At time t, we are at state xt = s and we can either,
1: Terminate and receive reward b(s), or 2: Pay c(s) and continue to a random state
xt+1 from the distribution τ (z ′ | z).
Assuming b, c > 0 and τ are known, design a backwards induction algorithm that
optimises the utility function
T
X −1
U (x1 , . . . , xT ) = b(xT ) − c(xt ).
t=1
Finally, show that the expected utility of the optimal policy starting from any
state must be bounded.
Exercise 27. Consider the problem of classification with features x ∈ X and labels
y ∈ Y, where each label costs c > 0. Assume a Bayesian model with some parameter
space Θ on which we have a prior distribution ξ0 . Let ξt be the posterior distribution
after t examples (x1 , y1 ), . . . , (xt , yt ).
Let our expected utility be the expected accuracy (i.e., the marginal probability
of correctly guessing the right label over all possible models) of the Bayes-optimal
classifier π : X → Y minus the cost paid:
Z Z
Et (U ) , max Pθ (π(x) | x) dPθ (x) dξt (θ) − ct
π Θ X
0.7
expected
actual
0.6 predicted
0.5
0.4
0.3
0.2
100 101 102 103
109
110CHAPTER 6. EXPERIMENT DESIGN AND MARKOV DECISION PROCESSES
6.1 Introduction
This chapter introduces the very general formalism of Markov decision processes
(MDPs) that allows representation of various sequential decision making prob-
lems. Thus a Markov decision process can be used to model stochastic path
problems, stopping problems as well as problems in reinforcement learning, ex-
periment design, and control.
experimental design We begin by taking a look at the problem of experimental design. One
instance of this problem occurs when considering how to best allocate treat-
ments with unknown efficacy to patients in an adaptive manner, so that the
best treatment is found, or so as to maximise the number of patients that
are treated successfully. The problem, originally considered by Chernoff [1959,
1966], informally can be stated as follows.
We have a number of treatments of unknown efficacy, i.e., some of them
work better than the others. We observe patients one at a time. When a new
patient arrives, we must choose which treatment to administer. Afterwards, we
observe whether the patient improves or not. Given that the treatment effects
are initially unknown, how can we maximise the number of cured patients? Al-
ternatively, how can we discover the best treatment? The two different problems
are formalised below.
Adaptive treatment alloca- Example 31. Consider k treatments to be administered to T volunteers. To each
tion volunteer only a single treatment can be assigned. At the t-th trial, we treat one
volunteer with some treatment at ∈ {1, . . . , k}. We then obtain a reward rt = 1 if the
patient
Pis healed and 0 otherwise. We wish to choose actions maximising the utility
U = t rt . This would correspond to maximising the number of patients that get
healed over time.
ξt (ω) , ξ0 (ω | a1 , . . . , at , x1 , . . . , xt ).
PT
Our utility can again be expressed as a sum over individual rewards, U = t=1 rt .
where T ∈ (0, ∞] is the horizon and γ ∈ (0, 1] is a discount factor . The reward discount factor
rt is stochastic, and only depends on the current action with expectation E(rt |
at = i) = ωi .
In order to select the actions, we must specify some policy or decision rule. policy
Such a rule can only depend on the sequence of previously taken actions and
observed rewards. Usually, the policy π : A∗ × R∗ → A is a deterministic
mapping from the space of all sequences of actions and rewards to actions.
That is, for every observation and action history a1 , r1 , . . . , at−1 , rt−1 it suggests
a single action at . More generally, it could also be a stochastic policy, that
specifies a mapping to action distributions. We use the notation
The following figure summarises the statement of the bandit problem in the
Bayesian setting.
There are two main difficulties with this approach. The first is specifying
the family and the prior distribution: this is effectively part of the problem
formulation and can severely influence the solution. The second is calculating
the policy that maximises expected utility given a prior and a family. The first
problem can be resolved by either specifying a subjective prior distribution,
or by selecting a prior distribution that has good worst-case guarantees. The
second problem is hard to solve, because in general, such policies are history
dependent and the set of all possible histories is exponential in the horizon T .
be the empirical reward of arm i at time t. We can set r̂t,i = 0 when Nt,i = 0.
Then, the posterior distribution for the parameter of arm i is
Since rt ∈ {0, 1}, the possible states of our belief given some prior are N2n .
In order to evaluate a policy we need to be able to predict the expected
utility we obtain. The latter only depends on our current belief, and the state
of our belief corresponds to the state of the bandit problem. This means that belief state
everything we know about the problem at time t can be summarised by ξt . For
Bernoulli bandits, a sufficient statistic for our belief is the number of times we
played each bandit and the total reward from each bandit. Thus, our state at
time t is entirely described by our priors α, β (the initial state) and the vectors
The next belief is random, since it depends on the random quantity rt . In fact,
the probability of the next reward lying in some set R if at = a is given by the
marginal distribution
Z
Pξt ,a (R) , Pω,a (R) dξt (ω). (6.2.9)
Ω
In practice, although multiple reward sequences may lead to the same beliefs,
we frequently ignore that possibility for simplicity. Then the process becomes a
tree. A solution to the problem of which action to select is given by a backwards
induction algorithm similar to the one given in Section 5.2.2:
X
U ∗ (ξt ) = max E(rt | ξt , at ) + P(ξt+1 | ξt , at )U ∗ (ξt+1 ). (6.2.11)
at
ξt+1
backwards induction The above equation is the backwards induction algorithm for bandits. If you
look at this structure, you can see that the next belief only depends on the
current belief, action, and reward, i.e., it satisfies the Markov property, as seen
in Figure 6.1.
3
r=1 ξt+1
a2t
r=0 2
ξt+1
ξn
1
r=1 ξt+1
a1t
r=0 0
ξt+1
Figure 6.1: A partial view of the multi-stage process. Here, the probability that
we obtain r = 1 if we take action at = i is simply Pξt ,i ({1}).
at at+1 at
at at+1
ξt
rt rt+1 rt ξt ξt+1
ω ω rt rt+1
(a) The basic process (b) The (c) The lifted process
Bayesian model
Figure 6.2: Three views of the bandit process. The figure shows the basic bandit
process from the view of an external observer. The decision maker selects at
and then obtains reward rt , while the parameter ω is hidden. The process is
repeated for t = 1, . . . , T . The Bayesian model is shown in (b) and the resulting
process in (c). While ω is not known, at each time step t we maintain a belief
ξt on Ω. The reward distribution is then defined through our belief. In (b), we
can see the complete process, where the dependency on ω is clear. In (c), we
marginalise out ω and obtain a model where the transitions only depend on the
current belief and action.
If we want to add the decision maker’s internal belief to the graph, we obtain
Figure 6.2(b). From the point of view of the decision maker, the distribution of
ω only depends on his current belief. Consequently, the distribution of rewards
also only depends on the current belief, as we can marginalise over ω. This
gives rise to the decision-theoretic bandit process shown in Figure 6.2(c). In the
following section, we shall consider Markov decision processes more generally.
P (S | s, a) = Pµ (st+1 ∈ S | st = s, at = a) (6.3.1)
ρ(R | s, a) = Pµ (rt ∈ R | st = s, at = a). (6.3.2)
Usually, an initial state s0 (or more generally, an initial distribution from which
s0 is sampled) is specified.
rt
st st+1
at
Pµ (st+1 ∈ S | s1 , a1 , . . . , st , at ) = Pµ (st+1 ∈ S | st , at ),
(Transition distribution)
Pµ (rt ∈ R | s1 , a1 , . . . , st , at ) = Pµ (rt ∈ R | st , at ),
(Reward distribution)
Policies. A policy π (sometimes also called decision function) specifies which policy
action to take. One can think of a policy as implemented through an algorithm
or an embodied agent, who is interested in maximising expected utility.
Policies
A policy π defines a conditional distribution on actions given the history:
horizon where T is the horizon, after which the agent is no longer interested in rewards,
discount factor and γ ∈ (0, 1] is the discount factor , which discounts future rewards. It is
convenient to intoduce a special notation for the utility starting from time t,
i.e., the sum of rewards from that time on:
T
X −t
Ut , γ k rt+k . (6.3.8)
k=0
At any time t, the agent wants to to find a policy π maximising the expected
total future reward
T
X −t
Eπµ Ut = Eπµ γ k rt+k . (expected utility)
k=0
This is so far identical to the expected utility framework we have seen so far,
with the only difference that now the reward space is a sequence of numerical
rewards and that we are acting within a dynamical system with state space S.
In fact, it is a good idea to think about the value of different states of the system
under certain policies in the same way that one thinks about how good different
positions are in chess.
The state value function for a particular policy π in an MDP µ can be inter-
preted as how much utility you should expect if you follow the policy starting
from state s at time t.
Note that the last term can be calculated easily through marginalisation, i.e.,
X
Pπµ (st+1 = i|st = s) = Pµ (st+1 = i|st = s, at = a) Pπ (at = a|st = s).
a∈A
end for
end for
Direct policy evaluation Using (6.4.2) we can define a simple algorithm for
evaluating a policy’s value function, as shown in Algorithm 2.
Lemma 6.4.1. For each state s, the value Vt (s) computed by Algorithm 2 sat-
π
isfies Vt (s) = Vµ,t (s).
T X
X
Vt (s) = Pπµ (sk = j | st = s) Eπµ (rk | sk = j).
k=t j∈S
X
V̂t (s) = Eπµ (rt | st = s) + Pπµ (st+1 = j | st = s) V̂t+1 (j). (6.4.6)
j∈S
end for
end for
Theorem 6.4.1. The backwards induction algorithm gives estimates V̂t (s) sat-
isfying
π
V̂t (s) = Vµ,t (s). (6.4.7)
Proof. For t = T , the result is obvious. We prove the remainder by induction.
π
We assume that V̂t+1 (s) = Vµ,t+1 (s) and show that (6.4.7) follows. Indeed, from
the recursion (6.4.6) we have
X
V̂t (s) = Eπµ (rt | st = s) + Pπµ (st+1 = j | st = s) V̂t+1 (j)
j∈S
X
= Eπµ (rt | st = s) + Pπµ (st+1 = j | st = s) Vµ,t+1
π
(j)
j∈S
where the second equality is by the induction hypothesis, the third and fourth
π
equalities are by the definition of the utility, and the last by definition of Vµ,t .
6.5 Infinite-horizon
When problems have no fixed horizon, they usually can be modelled as infinite
horizon problems, sometimes with help of a terminal state, whose visit termi-
nates the problem, or discounted rewards, which indicate that we care less about
rewards further in the future. When reward discounting is exponential, these
problems can be seen as undiscounted problems with random and geometrically
distributed horizon. For problems with no discounting and no termination states
there are some complications in the definition of optimal policy. However, we
defer discussion of such problems to Chapter 10. outdated) pdf this reference
does not exist.
6.5.1 Examples
We begin with some examples, which will help elucidate the concept of terminal
states and infinite horizon. The first is shortest path problems, where the aim is
to find the shortest path to a particular goal. Although the process terminates
when the goal is reached, not all policies may be able to reach the goal, and so
the process may never terminate.
Shortest-path problems
We shall consider two types of shortest path problems, deterministic and stochas-
tic. Although conceptually very different, both problems have essentially the
same complexity.
6.5. INFINITE-HORIZON 123
Properties
14 13 12 11 10 9 8 7
15 13 6 γ = 1, T → ∞.
16 15 14 4 3 4 5 rt = −1 unless st = X, in which
17 2 case rt = 0.
18 19 20 2 1 2 Pµ (st+1 = X|st = X) = 1.
19 21 1 X 1 A = {North, South, East, West}
20 22
Transitions are deterministic and
21 23 24 25 26 27 28 walls block.
Solving the shortest path problem with deterministic transitions can be done
simply by recursively defining the distance of states to X. Thus, first the dis-
tance of X to X is set to 0. Then for states s with distance d to X and with
a neighbor state s′ with no assigned distance yet, one assigns s′ the distance
d + 1 to X. This is illustrated in Figure 6.4, where for all reachable states the
distance to X is indicated. The respective optimal policy at each step simply
moves to a neighbor state with the smallest distance to X. Its reward starting
in any state s is simply the negative distance from s to X.
Stochastic shortest path problem with a pit. Now let us assume the
shortest path problem with stochastic dynamics. That is, at each time-step
there is a small probability ω that we move to a random direction. To make
this more interesting, we can add a pit O, that is a terminal state giving a
one-time negative reward of −100 (and 0 reward for all further steps) as seen
in Figure 6.5
124CHAPTER 6. EXPERIMENT DESIGN AND MARKOV DECISION PROCESSES
Properties
γ = 1, T → ∞.
rt = −1, but rt = 0 at X and −100
(once, reward 0 afterwards) at O.
Pµ (st+1 = X|st = X) = 1.
O X Pµ (st+1 = O|st = O) = 1.
A = {North, South, East, West}
Moves to a random direction with
probability ω. Walls block.
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
(c) value
Continuing problems
Many problems have no natural terminal state, but are continuing ad infini-
tum. Frequently, we model those problems using a utility that discounts future
6.5. INFINITE-HORIZON 125
rewards exponentially. This way, we can guarantee that the utility is bounded.
In addition, exponential discounting also has some economical sense. This is
partially because of the effects of inflation, and partially because money avail-
able now may be more useful than money one obtains in the future. Both these
effects diminish the value of money over time. As an example, consider the
following inventory management problem.
Example 33 (Inventory management). There are K storage locations, and each loca-
tion i can store ni items. At each time-step
P there is a probability φi that a client wants
to buy an item from location i, where i φi ≤ 1. If there is an item available, when
this happens, you gain reward 1. There are two types of actions, one for ordering a
certain number u units of stock, paying c(u). Further one may move u units of stock
from one location i to another location j, paying ψij (u).
For simplicity, in the following we assume that rewards only depend on the cur-
rent state instead of both state and action. It can easily be verified that the
results presented below can be adapted to the latter case. More importantly, we
also assume that the state and action spaces S, A are finite, and that the tran-
sition kernel of the MDP is time-invariant. This allows us to use the following
simplified vector notation:
v π = (Eπ (Ut | st = s))s∈S is a vector in R|S| representing the value of
policy π.
Sometimes we will use p(j|s, a) as a shorthand for Pµ (st+1 = j | st =
s, at = a).
Pµ,π is the transition matrix in R|S|×|S| for policy π, such that
X
Pµ,π (i, j) = p(j | i, a) Pπ (a | i).
a
Proof. Let rt be the random reward at step t when starting in state s and
following policy π. Then
∞
!
X
π π t
V (s) = E γ rt s0 = s
t=0
∞
X
= γ t Eπ (rt |s0 = s)
t=0
∞
X X
= γt Pπ (st = i | s0 = s) E(rt | st = i).
t=0 i∈S
v = r + γPµ,π v. (6.5.2)
Proof of Theorem 6.5.1. First note that by manipulating (6.5.3), one obtains
r = (I − γPµ,π )v π . Since kγPµ,π k < 1 · kPµ,π k = 1, the inverse
n
X
(I − γPµ,π )−1 = lim (γPµ,π )t
n→∞
t=0
It is important to note that the entries of matrix X = (I − γPµ,π )−1 are the
expected number of discounted cumulative visits to each state s, starting from
state s′ and following policy π. More specifically,
(∞ )
X
′ π t π ′
x(s, s ) = Eµ γ Pµ (st = s | st = s) . (6.5.5)
t=0
Definition 6.5.2 (Policy and Bellman operator). The linear operator of a pol-
icy π is
Lπ v , r + γPπ v. (6.5.6)
The (non-linear) Bellman operator in the space of value functions V is defined
as:
L v , sup {r + γPπ v} , v ∈ V. (6.5.7)
π
We now show that the Bellman operator satisfies the following monotonicity
properties with respect to an arbitrary value vector v.
128CHAPTER 6. EXPERIMENT DESIGN AND MARKOV DECISION PROCESSES
Theorem 6.5.3. Let v ∗ , supπ v π . Then for any bounded r, it holds that for
v ∈ V:
(1) If v ≥ L v, then v ≥ v ∗ .
(2) If v ≤ L v, then v ≤ v ∗ .
Proof. We first prove (1). A simple proof by induction over n shows that for
any π
n−1
X
v ≥ r + γPπ v ≥ γ k Pπk r + γ n Pπn v.
k=0
P∞
Since v π = t=0 γ t Pπt r, it follows that
∞
X
v − v π ≥ γ n Pπn v − γ k Pπk r.
k=n
The first-term on the right-hand side can be bounded by arbitrary ǫ/2 for large
enough n. Also note that
∞
X γne
− γ k Pπk r ≥ − ,
1−γ
k=n
with e being a unit vector, so this can be bounded by ǫ/2 as well. So for any
π, ǫ > 0:
v ≥ v π − ǫ,
so
v ≥ sup v π ,
π
v ≤ v π + ǫ,
A similar theorem can also be proven for the repeated application of the
Bellman operator
Theorem 6.5.4. Let V be the set of value vectors with Bellman operator L .
Then:
where the first inequality is due to the fact that P v ≥ P v ′ for any P . For the
second part,
L vN +k = vN +k+1 = L k L vN ≤ L k vN = vN +k ,
un+1 = T un = T n+1 u0
converges to u∗ .
Proof. The first claim follows from the contraction mapping. If u1 , u2 are both
fixed points then
ku1 + u2 k = kT u1 T u2 k ≤ γku1 − u2 k.
For any m ≥ 1
m−1
X m−1
X
kun+m − un k ≤ kun+k+1 − un+k k = kT n+k u1 − T n+k u0 k
k=0 k=0
m−1
X γ (1 − γ m ) 1
n
≤ γ n+k ku1 − u0 k = ku − u0 k.
1−γ
k=0
Then, for any ǫ > 0, there is n such that kun+m − un k ≤ ǫ. Since X is a Banach
space, the sequence has a limit u∗ , which is unique.
Theorem 6.5.6. For γ ∈ [0, 1) the Bellman operator L is a contraction map-
ping in V.
Proof. Let v, v ′ ∈ V. Consider s ∈ S such that L v(s) ≥ L v ′ (s), and let
X
a∗s ∈ arg max r(s) + γpµ (j | s, a)v(j) .
a∈A
j∈S
130CHAPTER 6. EXPERIMENT DESIGN AND MARKOV DECISION PROCESSES
Then, using the fact in (a) that a∗s is optimal for v, but not necessarily for v ′ ,
we have:
(a) X X
0 ≤ L v(s) − L v ′ (s) ≤ γp(j | s, a∗s )v(j) − γp(j | s, a∗s )v ′ (j)
j∈S j∈S
X
=γ p(j | s, a∗s )[v(j) ′
− v (j)]
j∈S
X
≤γ p(j | s, a∗s )kv − v ′ k = γkv − v ′ k.
j∈S
Taking the supremum over all possible s, the required result follows.
It is easy to show the same result for the Lπ operator, as a corollary to the
above theorem.
Theorem 6.5.7. For any MDP µ with discrete S, bounded r, and γ ∈ [0, 1)
Value iteration
Value iteration, is a version of backwards induction executed in a finite state for
infinite horizon discounted MDPs. Since the horizon is infinite, the algorithm
also requires a stopping condition. This is typically done by comparing the
change in value function from one step to the other with a small threshold as
seen in the algorithm below.
6.5. INFINITE-HORIZON 131
The termination condition of value iteration has been left unspecified. How-
ever, the theorem above shows that if we terminate when (6.5.8) is true, then
our error will be bounded by ǫ. However, better termination conditions can be
obtained.
Now let us prove how fast value iteration converges.
Theorem 6.5.9. Value iteration converges with error in O(γ n ) More specifi-
cally, for r ∈ [0, 1] and v0 = 0,
γn 2γ n
kvn − v ∗ k ≤ , kVµπn − v ∗ k ≤ . (6.5.9)
1−γ 1−γ
Proof. The first part follows from the contraction property (Theorem 6.5.6):
Now divide by γ n to obtain the final result. The second part follows from ele-
mentary analysis. If a function f (x) is maximised at x∗ , while g(y) is maximised
at y ∗ and |f − g| ≤ ǫ, then |f (x∗ ) − f (y ∗ )| ≤ 2ǫ. Since πn maximises vn and vn
is γ n /(1 − γ)-close to v ∗ , the result follows.
Policy iteration
Unlike value iteration, policy iteration attempts to iteratively improve a given
policy, rather than a value function. At each iteration, it calculates the value
of the current policy and then calculates the policy that is greedy with respect
to this value function. For finite MDPs, the policy evaluation step can be
performed with either linear algebra or backwards induction, while the policy
improvement step is trivial. The algorithm described below can be extended to
the case when the reward also depends on the action, by replacing r with the
policy-dependent reward vector rπ .
2 Thus the result is weakly polynomial complexity, due to the dependence on the input size
description.
6.5. INFINITE-HORIZON 133
Theorem 6.5.10. Let vn , vn+1 be the value vectors generated by policy itera-
tion. Then vn ≤ vn+1 .
r + γPπn+1 vn ≥ r + γPπn vn , = vn
where the equality is due to the policy evaluation step for πn . Rearranging, we
get r ≥ (I − γPπn+1 )vn and hence
(I − γPπn+1 )−1 r ≥ vn ,
noting that the inverse is positive. Since the left side equals vn+1 by the policy
evaluation step for πn+1 , the theorem follows.
We can use the fact that the policies are monotonically improving to show
that policy iteration will terminate after a finite number of steps.
Proof. There is only a finite number of policies, and since policies in policy
iteration are monotonically improving, the algorithm must stop after finitely
many iterations. Finally, the last iteration satisfies
However, it is easy to see that the number of policies is |A||S| , thus the above
corollary only guarantees exponential-time convergence in the number of states.
However, it is also known that the complexity of policy iteration is strongly
polynomial Ye [2011], for any fixed γ, with the number of iterations required
2
|S|2
being |S| 1−γ
(|A|−1)
· ln 1−γ .
Policy iteration seems to have very different behaviour from value iteration.
In fact, one can obtain families of algorithms that lie at the extreme ends of
the spectrum between policy iteration and value iteration. The first member
of this family is modified policy iteration, and the second member is temporal
difference policy iteration.
Modified policy iteration can perform much better than either pure value
iteration or pure policy iteration.
A geometric view
It is perhaps interesting to see the problem from a geometric perspective. This
also gives rise to the so-called “temporal-difference” set of algorithms. First, we
define the difference operator, which is the difference between a value function
vector v and its transformation via the Bellman operator.
difference operator Definition 6.5.3. The difference operator is defined as
Essentially, it is the change in the value function vector when we apply the
Bellman operator. Thus the Bellman optimality equation can be rewritten as
Bv = 0. (6.5.13)
Now let us define the set of greedy policies with respect to a value vector v ∈ V
to be:
Πv , arg max {r + (γPπ − I)v} .
π∈Π
We can now show the following inequality between the two different value func-
tion vectors.
Theorem 6.5.11. For any v, v ′ ∈ V and π ∈ Πv
0
Bv
−1 π∗
π1
π2
v2 v1 v∗
v
Figure 6.7: The difference operator. The graph shows the effect of the operator
for the optimal value function v ∗ , and two arbitrary value functions, v1 , v2 . Each
line is the improvement effected by the greedy policy π ∗ , π1 , π2 with respect to
each value function v ∗ , v1 , v2 .
Theorem 6.5.12. Let {vn } be the sequence of value vectors obtained from policy
iteration. Then for any π ∈ Πvn ,
vn+1 = vn − (γPπ − I)−1 Bvn . (6.5.15)
Proof. By definition, we have for π ∈ Πvn
vn+1 = (I − γPπ )−1 r − vn + vn
= (I − γPπ )−1 [r − (I − γPπ )vn ] + vn .
Since r − (I − γPπ )vn = Bvn the claim follows.
Note the similarity to the difference operator in modified policy iteration. The
idea of the temporal-difference policy iteration is to use adjust the current value
vn , using the temporal differences mixed over an infinite number of steps:
∞
X
τn (i) = Eπn (γλ)t dn (st , st+1 ) | s0 = i , (6.5.18)
t=0
vn+1 = vn + τn . (6.5.19)
Here the λ parameter is a simple way to mix together the different temporal
difference errors. If λ → 1, our error will be dominated by the terms far in
the future, while if λ → 0, our error τn , will be dominated by the short-term
discrepancies in our value function. In the end, we shall adjust our value function
in the direction of this error.
Putting all of those steps together, we obtain the following algorithm:
That is, if we repeatedly apply the above operator to some vector v, then
at some point we shall obtain a fixed point v ∗ = Dn v ∗ . It is interesting to
see what happens at the two extreme choices of λ in this case. For λ = 1,
this becomes identical to standard policy iteration, as the fixed point satisfies
v ∗ = Lπn+1 v ∗ , so then v ∗ must be the value of policy πn+1 . For λ = 0, one
obtains standard value iteration, as the fixed point is reached under one step
and is simply v ∗ = Lπn+1 vn , i.e. the approximate value of the one-step greedy
policy. In other words, the new value vector is moved only partially towards the
direction of the Bellman update, depending on how we choose λ.
Linear programming
Perhaps surprisingly, we can also solve Markov decision processes through linear
programming. The main idea is to reformulate the maximisation problem as
a linear optimisation problem with linear constraints. The first step in our
procedure is to recall that there is an easy way to determine whether a particular
v is an upper bound on the optimal value function v ∗ , since if
v ≥ Lv
distribution on the states y ∈ ∆|S| . Then we can write the following linear
program.
min y ⊤ v,
v
such that
v(s) − γp⊤
s,a v ≥ r(s, a), introducedbef ore? ∀a ∈ A, s ∈ S,
where we use ps,a to denote the vector of next state probabilities p(j | s, a).
Note that the inequality condition is equivalent to v ≥ L v. Consequently,
the problem is to find the smallest v that satisfies this inequality. When A, S
are finite, it is easy to see that this will be the optimal value function and the
Bellman equation is satisfied.
It also pays to look at the dual linear program, which is in terms of a
maximisation. This time, instead of finding the minimal upper bound on the
value function, we find the maximal cumulative discounted state-action visits
x(s, a) that are consistent with the transition kernel of the process.
with y ∈ ∆|S| .
The equality condition ensures that x is consistent with the transition kernel
of the Markov decision process. Consequently, the program can be seen as search
among all possible cumulative state-action distributions to find the one giving
the highest total reward.
∗
Consider µ s.t. ∃π ∗ for which V π exists and
∗ ∗
lim V π ,T
= V π ≥ lim sup V π,T .
T →∞ T →∞
π 1 π,T π 1 π,T
g+ (s) , lim sup V (s), g− (s) , lim inf V (s) (6.6.7)
T →∞ T T →∞ T
6.7 Summary
Markov decision processes can represent shortest path problems, stopping prob-
lems, experiment design problems, multi-armed bandit problems and reinforce-
ment learning problems.
Bandit problems are the simplest type of Markov decision process, since they
have a fixed, never-changing state. However, to solve them, one can construct
a Markov decision processes in belief space, within a Bayesian framework. It is
then possible to apply backwards induction to find the optimal policy.
Backwards induction is applicable more generally to arbitrary Markov de-
cision processes. For the case of infinite-horizon problems, it is referred to as
value iteration, as it converges to a fixed point. It is tractable when either the
state space S or the horizon T are small (finite).
When the horizon is infinite, policy iteration can also be used to find optimal
policies. It is different from value iteration in that at every step, it fully evaluates
a policy before the improvement step, while value iteration only performs a
partial evaluation. In fact, at the n-th iteration, value iteration has calculated
the value of an n-step policy.
We can arbitrarily mix between the two extremes of policy iteration and
value iteration in two ways. Firstly, we can perform a k-step partial evaluation.
When k = 1, we obtain value iteration, and when k → ∞, we obtain policy iter-
ation. The generalised algorithm is called modified policy iteration. Secondly,
we can perform adjust our value function by using a temporal difference error of
values in future time steps. Again, we can mix liberally between policy iteration
and value iteration by focusing on errors far in the future (policy iteration) or
on short-term errors (value iteration).
Finally, it is possible to solve MDPs through linear programming. This is
done by reformulating the problem as a linear optimisation with constraints.
In the primal formulation, we attempt to find a minimal upper bound on the
optimal value function. In the dual formulation, our goal is to find a distribution
on state-action visitations that maximises expected utility and is consistent with
the MDP model.
6.8. FURTHER READING 141
6.9 Exercises
6.9.1 Medical diagnosis
Exercise 28 (Continuation of exercise 24). Now consider the case where you have
the choice between tests to perform First, you observe S, whether or not the patient
is a smoker. Then, you select a test to make: d1 ∈ {X-ray, ECG}. Finally, you de-
cide whether or not to treat for ASC: d2 ∈ {heart treatment, no treatment}. An
untreated ASC patient may die with probability 2%, while a treated one with proba-
bility 0.2%. Treating a non-ASC patient result in death with probability 0.1%.
1. Draw a decision diagram, where:
S is an observed random variable taking values in {0, 1}.
A is an hidden variable taking values in {0, 1}.
C is an hidden variable taking values in {0, 1}.
d1 is a choice variable, taking values in {X-ray, ECG}.
r1 is a result variable, taking values in {0, 1}, corresponding to negative
and positive tests results.
d2 is a choice variable, which depends on the test results, d1 and on S.
r2 is a result variable, taking values in {0, 1} corresponding to the patient
dying (0), or living (1).
2. Let d1 = X-ray, and assume the patient suffers from ACS, i.e. A = 1. How is
the posterior distributed?
3. What is the optimal decision rule for this problem?
Exercise 30 (120). In this case, we assume that the probability that the i-th algo-
rithm successfully solves the t-th task is always pi . Furthermore, tasks are in no way
distinguishable from each other. In each case, assume that pi ∈ {0.1, .Q . . , 0.9} and a
prior distribution ξi (pi ) = 1/9 for all i, with a complete belief ξ(p) = i ξi (pi ), and
formulate the problem as a decision-theoretic n-armed bandit problem with reward
6.9. EXERCISES 143
at time t being rt = 1 if the task is solved and rt = 0 if the problem is not solved.
Whether or not the task at time t is solved or not, at the next time-step we go to
the next problem. Our aim is to find a policy π mapping from the history of observa-
tions to selection of algorithms such that we maximise the total reward to time T in
expectation
T
X
Eξ,π U0T = Eξ,π rt .
t=1
using backwards induction for T ∈ {0, 1, 2, 3, 4} and report the expected utility
in each case. Hint: Use the decision-theoretic bandit formulation to dynami-
cally construct a Markov decision process which you can solve with backwards
induction. See also the extensive decision rule utility from exercise set 3.
3. Now utilise the backwards induction algorithm developed in the previous step
in a problem where we receive a sequence of N tasks to solve and our utility is
N
X
U0N = rt
t=1
At each step t ≤ N , find the optimal action by calculating Eξ,π Utt+T for T ∈
{0, 1, 2, 3, 4} take it. Hint: At each step you can update your prior distribution
using the same routine you use to update your prior distribution. You only need
consider T < N − t.
4. Develop a simple heuristic algorithm of your choice and compare its utility
with the utility of the backwards induction. Perform 103 simulations, each
experiment running for N = 103 time-steps and average the results. How does
the performance improve? Hint: If the program runs too slowly go only up to
T =3
6.9.4 Scheduling
You are controlling a small processing network that is part of a big CPU farm.
You in fact control a set of n processing nodes. At time t, you may be given a
job of class xt ∈ X to execute. Assume these are identically and independently
drawn such that P(xt = k) = pk for all t, k. With some probability p0 , you are
not given a job to execute at the next step. If you do have a new job, then you
can either:
(b) Send the job to some node i. If the node is already active, then the previous
job is lost.
Not all the nodes and jobs are equal. Some nodes are better at processing
certain types of jobs. If the i-th node is running a job of type k ∈ X, then it has
a probability of finishing it within that time step equal to φi,k ∈ [0, 1]. Then
the node becomes free, and can accept a new job.
144CHAPTER 6. EXPERIMENT DESIGN AND MARKOV DECISION PROCESSES
For this problem, assume that there are n = 3 nodes and k = 2 types of jobs
and that the completion probabilities are given by the following matrix:
0.3 0.1
Φ = 0.2 0.2 . (6.9.1)
0.1 0.3
with γ = 0.9 and where we get a reward of 1 every time a job is completed.
More precisely, at each time step t, the following events happen:
2. Each node either continues processing, or completes its current job and
becomes free. You get a reward rt equal to the number of nodes that
complete their jobs within this step.
3. You decide whether to ignore the new job or add it to one of the nodes.
If you add a job, then it immediately starts running for the duration of
the time step. (If the job queue is empty then you cannot add a job to a
node, obviously)
Simulation-based
algorithms
147
148 CHAPTER 7. SIMULATION-BASED ALGORITHMS
7.1 Introduction
In this chapter, we consider the general problem of reinforcement learning in
dynamic environments. Up to now, we have only examined a solution method
for bandit problems, which are only a special case. The Bayesian decision-
theoretic solution is to reduce the bandit problem to a Markov decision process
which can then be solved with backwards induction.
We also have seen that Markov decision processes can be used to describe en-
vironments in more general reinforcement learning problems. When our knowl-
edge of the MDP describing these problems is perfect, then we can employ a
number of standard algorithms to find the optimal policy. However, in the ac-
tual reinforcement learning problem, the model of the environment is unknown.
However, as we shall see later, both of these ideas can be combined to solve the
general reinforcement learning problem.
The main focus of this chapter is how to simultaneously learn about the
underlying process and act to maximise utility in an approximate way. This
can be done through approximate dynamic programming, where we replace the
actual unknown dynamics of the Markov decision process with estimates. The
estimates can be improved by drawing samples from the environment, either by
acting within the real environment or using a simulator. In both cases we end
up with a number of algorithms that can be used for reinforcement learning.
Although may not be performing as well as the Bayes-optimal solution, these
have a low enough computational complexity that they are worth investigating
in practice.
It is important to note that the algorithms in this chapter can be quite far
from optimal. They may converge eventually to an optimal policy, but they
may not accumulate a lot of reward while still learning. In that sense, they
are not solving the full reinforcement learning problem because their online
performance can be quite low.
For simplicity, we shall first return to the example of bandit problems. As
before, we have n actions corresponding to probability distributions Pi on the
real numbers {Pi | i = 1, . . . , n} and our aim is to maximise to total reward (in
expectation). Had we known the distribution, we could simply always the max-
imising action, as the expected reward of the i-th action can be easily calculated
from Pi and the reward only depends on our current action.
As the Pi are unknown, we must use a history-dependent policy. In the
remainder of this section, we shall examine algorithms which asymptotically
convergence to the optimal policy (which, in the case of bandits corresponds to
pulling always pulling the best arm), but for which we cannot always guarantee
a good initial behaviour.
steps that on average move towards the solution, in a way to be made more
precise later. The stochastic approximation actually defines a large class of
procedures, and it contains stochastic gradient descent as a special case.
The main two parameters of the algorithm are the amount of randomness
in the ǫ-greedy action selection and the step-size α in the estimation. Both of
them have a significant effect in the performance of the algorithm. Although we
could vary them with time, it is perhaps instructive to look at what happens for
fixed values of ǫ, α. Figures 7.1 show the average reward obtained, if we keep
the step size α or the randomness ǫ fixed, respectively, with initial estimates
µ0,i = 0.
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.001 0.0
0.2 0.01 0.2 0.01
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
·104 ·104
(a) fixed ǫ (b) fixed α
Figure 7.1: For the case of fixed ǫt = 0.1, the step size is α ∈ {0.01, 0.1, 0.5}.
For the case of fixed α, the exploration rate is
For a fixed ǫ, we find that larger values of α tend to give a better result
eventually, while smaller values have a better initial performance. This is a
natural trade-off, since large α appears to “learn” fast, but it also “forgets”
quickly. That is, for a large α, our estimates mostly depend upon the last few
rewards observed.
Things are not so clear-cut for the choice of ǫ. We see that the choice of
ǫ = 0, is significantly worse than ǫ = 0.1. So, that appears to suggest that there
is an optimal level of exploration. How should that be determined? Ideally, we
should be able to to use the decision-theoretic solution seen earlier, but perhaps
a good heuristic way of choosing ǫ may be good enough.
zt+1 = xt+1 − µt .
The
√ first one, αt = 1/t, satisfies both assumptions. The second one, αt =
1/ t, reduces too slowly, and the third one, αt = t−3/2 , approaches zero
too fast.
1/t
√
2 1/ t
t−3/2
µt
Example 35 (The chain task). The chain task has two actions and five states, as
shown in Fig. 7.3. The reward in the leftmost state is 0.2 and 1.0 in the rightmost
state, and zero otherwise. The first action (dashed, blue) takes you to the right,
while the second action (solid, red) takes you to the first state. However, there is a
probability 0.2 with which the actions have the opposite effects. The value function
of the chain task for a discount factor γ = 0.95 is shown in Table 7.1.
The chain task is a very simple, but well-known task, used to test the efficacy
of reinforcement learning algorithms. In particular, it is useful for analysing how
algorithms solve the exploration-exploitation trade-off, since in the short run
simply moving to the leftmost state is advantageous. For a long enough horizon
or large enough discount factor, algorithms should be incentivised to more fully
explore the state space. A variant of this task, with action-dependent rewards
(but otherwise equivalent) was used by [Dearden et al., 1998].
1 This actually depends on what the exact setting is. If the environment is a simulation,
then we could try and start from an arbitrary state, but in the reinforcement learning setting
this is not the case.
154 CHAPTER 7. SIMULATION-BASED ALGORITHMS
0.2 0 0 0 1
s s1 s2 s3 s4 s5
V ∗ (s) 7.6324 7.8714 8.4490 9.2090 10.209
Q∗ (s, 1) 7.4962 7.4060 7.5504 7.7404 8.7404
Q∗ (s, 2) 7.6324 7.8714 8.4490 9.2090 10.2090
(k)
where rt is the sequence of rewards obtained from the k-th trajectory.
end for
Remark 7.2.1. With probability 1 − δ, the estimate V̂ of the Monte Carlo eval-
uation algorithm satisfies
r
π ln(2|S|/δ)
kVµ,0 − V̂ k∞ := max kV (s) − V̂ (s)k ≤ .
s 2K
Proof. From Hoeffding’s inequality (4.5.5) we have for any state s that
r !
π ln(2|S|/δ) δ
P |Vµ,0 (s) − V̂ (s)| ≥ ≤ .
2K |S|
P
Consequently, using a union bound of the form P (A1 ∪A2 ∪. . .∪An ) ≤ i P (Ai )
gives the required result.
The main advantage of Monte-Carlo policy evaluation is that it can be used
in very general settings. It can be used not only in Markovian environments
such as MDPs, but also in partially observable and multi-agent settings.
For αk = 1/k and iterating over all S, this is the same as Monte-Carlo policy
evaluation.
156 CHAPTER 7. SIMULATION-BASED ALGORITHMS
In order to avoid the bias, we must instead look at only the first visit to
every state. This eliminates the dependence between states and is called the
first visit Monte-Carlo update .
101
every
first
100
kvt − V π k
10−1
10−2
0 0.2 0.4 0.6 0.8 1
iterations ·104
Figure 7.4: Error as the number of iterations n increases, for first and every
visit Monte Carlo estimation.
Using the temporal difference error d(st , st+1 ) = r(st ) + γv(st+1 ) − v(st ), We temporal difference error
can now rewrite the full stochastic update in terms of the temporal-difference
error:
X
vk+1 (s) = vk (s) + α γ t dt , dt , d(st , st+1 ) (7.2.2)
t
We have now converted the full stochastic update into an incremental update
that is nevertheless equivalent to the old update. Let us see how we can gener-
alise this to the case where we have a mixture of temporal differences.
TD(λ).
Recall the temporal difference update when the MDP is given in analytic
form.
∞
X
vn+1 (i) = vn i + τn i, τn (i) , Eπn ,µ [(γλ)m dn (st , st+1 ) | s0 = i] .
t=0
2.5
replacing
cumulative
2
1.5
0.5
0
0 20 40 60 80 100
In the following figures, we can see the error in value function estimation
in the chain task when using simulation-based value iteration. It is always a
better idea to use an initial value v0 that is an upper bound on the optimal value
function, if such a value is known. This is due to the fact that in that case,
convergence is always guaranteed when using simulation-based value iteration,
as long as the policy that we are using is proper.2
102 102
100 100
10-2 10-2
error
error
10-4 10-4
1.0 1.0
0.5 0.5
10-6 0.1 10-6 0.1
0.01 0.01
1-gamma 1-gamma
10-8 10-8
0 500 1000 1500 2000 0 500 1000 1500 2000
t t
As can be seen in Figure 7.6, the value function estimation error of simulation-
based value iteration is highly dependent upon the initial value function esti-
mate v0 and the exploration parameter ǫ. It is interesting to see uniform sweeps
(ǫ = 1) result in the lowest estimation error in terms of the value function L1
norm.
Q-learning
Simulation-based value iteration can be suitably modified for the actual rein-
forcement learning problem. Instead of relying on a model of the environment,
2 In the case of discounted non-episodic problems, this amounts to a geometric stopping
time distribution, after which the state is drawn from the initial state distribution.
7.2. DYNAMIC PROBLEMS 161
we replace arbitrary random sweeps of the state-space with the actual state se-
quence observed in the real environment. We also use this sequence as a simple
way to estimate the transition probabilities.
Algorithm 17 Q-learning
1: Input µ, S, ǫt , αt .
2: Initialise st ∈ S, q0 ∈ V.
3: for t = 1, 2, . . . do
4: s = st .
5: at ∼ π̂ǫ∗t (a | st , qt )
6: st+1 ∼ Pπµt (st+1 | st = s, at ).
7: qt+1 (st , at ) = (1 − αt )qt (st , at ) + αt [r(st ) + vt (st+1 )], where vt (s) =
maxa∈A qt (s, a).
8: end for
9: Return πn , Vn .
X
qt (s, a) = r(s, a) + γ Pµπ (s′ |s, a)vt−1 (s′ )
s′
πt (s) = arg max +qt (s, a)
a
vt (s) = arg max +qt (s, a)
a
The result is Q-learning (Algorithm 17), one of the most well-known and
simplest algorithms in reinforcement learning. In light of the previous theory,
it can be seen as a stochastic value iteration algorithm, where at every step t,
given the partial observation (st , at , st+1 ) you have an approximate transition
model for the MDP which is as follows:
(
1, if st+1 = s′
P (s′ |st , at ) = (7.2.5)
0, if st+1 6= s′ .
Even though this model is very simplistic, it still seems to work relatively well in
practice, and the algorithm is simple to implement. In addition, since we cannot
arbitrarily select states in the real environment, we replace the state-exploring
parameter ǫ with a time-dependent exploration parameter ǫt for the policy we
employ on the real environment.
162 CHAPTER 7. SIMULATION-BASED ALGORITHMS
60
50
40
error
30 1.0
0.5
0.1
20
0.05
0.01
10
0 20 40 60 80 100
t x 10
(a) Error
3,000
1.0
0.5
0.1
2,000 0.05
0.01
regret
1,000
0 20 40 60 80 100
t x 10
(b) Regret
−2/3
Figure 7.7: Q-learning with v0 = 1/(1 − γ), ǫt = 1/nst , αt ∈ αnst .
Figure 7.7 shows the performance of the basic Q-learning algorithm for the
Chain task, in terms of value function error and regret. In this particular
implementation, we used a polynomially decreasing exploration parameter ǫt
and step size αt . Both of these depend on the number of visits to a particular
state and so perform more efficient Q-learning.
Of course, one could get any algorithm in between pure Q-learning and pure
stochastic value iteration. In fact, variants ofthe Q-learning algorithm using
eligibility traces (see Section 7.2.4) can be formulated in this way.
It is instructive to examine special cases for these parameters. For the case
when σt = 1, αt = 1, and when µbt = µ, we obtain standard value iteration.
For the case when σt (s, a) = I {st = s ∧ at = a} and
7.3 Discussion
Most of these algorithms are quite simple, and so clearly demonstrate the prin-
ciple of learning by reinforcement. However, they do not aim to solve the rein-
forcement learning problem optimally. They have been mostly of use for finding
near-optimal policies given access to samples from a simulator, as used for ex-
ample to learn to play Atari games Mnih et al. [2015]. However, even in this
case, a crucial issue is how much data is needed in the first place to approach
optimal play. The second issue is using such methods for online reinforcement
learning, i.e. in order to maximise expected utility while still learning.
164 CHAPTER 7. SIMULATION-BASED ALGORITHMS
600
1
2
8
32
400
L
200
0
0 0.2 0.4 0.6 0.8 1
4
t ·10
7.4 Exercises
Exercise 33 (180). This is a continuation of exercise 28. Create a reinforcement
learning version of the diagnostic model from exercise 28. In comparison to that
exercise, here the doctor is allowed to take zero, one, or two diagnostic actions.
View the treatment of each patient as a single episode and design an appropriate
state and action space to apply the standard MDP framework: note that all episodes
run for at least 2 steps, and there is a different set of actions available at each state:
the initial state only has diagnostic actions, while any treatment action terminates the
episode and returns us the result.
1. Define the state and action space for each state.
2. Create a simulation of this problem, according to the probabilities mentioned in
Exercise 28.
3. Apply a simulation-based algorithm such as Q-learning to this problem. How
much times does it take to perform well? Can you improve it so as to take into
account the problem structure?
Exercise 34. It is well-known that the value function of a policy π for an MDP
µ with state reward function r can be written as the solution of a linear equation
Vµπ = (I −γPµπ )−1 r, where the term Φπµ , (I −γPµπ )−1 can be seen as a feature matrix.
However, Sarsa and other simulation-based algorithms only approximate the value
function directly rather than Φπµ . This means that, if the reward function changes,
they have to be restarted from scratch. Is there a way to rectify this?4
3h Develop and test a simulation-based algorithm (such as Sarsa) for estimating
Φπµ , and prove its asymptotic convergence. Hint: focus on the fact that you’d
like to estimate a value function for all possible reward functions.
? Consider a model-based approach, where we build an empirical transition kernel
Pµπ . How good are our value function estimates in the first versus the second
approach? Why would you expect either one to be better?
? Can the same idea be extended to Q-learning?
4 This exercise stems from a discussion with Peter Auer in 2012 about this problem.
Chapter 8
Approximate
representations
167
168 CHAPTER 8. APPROXIMATE REPRESENTATIONS
8.1 Introduction
In this chapter, we consider methods for approximating value functions, policies,
or transition kernel. This is in particular useful when the state or policy space
are large, so that one has to use some parametrisation that may not include the
true value function, policy, or transition kernel. In general, we shall assume the
existence of either some approximate value function space VΘ or some approxi-
mate policy space ΠΘ , which are the set of allowed value functions and policies,
respectively. For the purposes of this chapter, we will assume that we have
access to some simulator or approximate model of the transition probabilities,
wherever necessary. Model-based reinforcement learning where the transition
probabilities are explicitly estimated will be examined in the next two chapters.
As an introduction, let us start with the case where we have a value function
space V and some value function v ∈ V that is our best approximation of the
optimal value function. Then we can define the greedy policy with respect to v
as follows:
Definition 8.1.1 (v-greedy policy and value function).
∗ ∗
πu ∈ arg max Lπ u, vu = L u,
π∈Π
can be done by minimising the difference between the target value u and the
approximation vθ , that is,
Z
kvθ − ukφ = |vθ (s) − u(s)| dφ(s) (8.1.1)
S
Clearly, none of the given functions is a perfect fit. In addition, finding the best overall
fit requires minimising an integral. So, for this problem we choose a random set of
points X = {xt } on which to evaluate the fit, with φ(xt ) = 1 for every point xt ∈ X.
This is illustrated in Figure 8.1, which shows the error of the functions at the selected
points, as well as their cumulative error.
In the example above, the approximation space V does not have a member
that is sufficiently close to the target value function. It could be that a larger
function space contains a better approximation. However, it may be difficult to
find the best fit in an arbitrary set V.
where the norm within the integral is usually the PL1 norm. For a finite action
space, this corresponds to kπ(· | s) − π ′ (· | s)k = a∈A |π(a | s) − π ′ (a | s)|, but
170 CHAPTER 8. APPROXIMATE REPRESENTATIONS
1 1
v1
0.5 v2 0.5
v3
0 u 0
−0.5 −0.5
−1 −1
0 2 4 6 8 10 0 2 4 6 8 10
(a) The target function and the three (b) The errors at the chosen points.
candidates.
6
0
0 2 4 6 8 10
(c) The total error of each candi-
date.
certainly other norms may be used and are sometimes more convenient. The
optimisation problem corresponding to fitting an approximate policy from a set
of policies ΠΘ to a target policy π is shown below.
∗
where πu = arg maxπ∈Π Lπ u.
Once more, the minimisation problem may not be trivial, but there are
some cases where it is particularly easy. One of these is when the policies can
be efficiently enumerated, as in the example below.
Example 38 (Fitting a finite space of policies). For simplicity, consider the space of
deterministic policies with a binary action space A = {0, 1}. Then each policy can be
represented as a simple mapping π : S → {0, 1}, corresponding to a binary partition
of the state space. In this example, the state space is the 2-dimensional unit cube,
S = [0, 1]2 . Figure 8.2 shows an example policy, where the light red and light green
areas represent taking action 1 and 0, respectively. The measure φ has support only
8.1. INTRODUCTION 171
0.8
0.6
s2
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
s1
Figure 8.2: An example policy. The red areas indicate taking action 1, and
the green areas action 0. The φ measure has finite support, indicated by the
crosses and circles. The blue and magenta lines indicate two possible policies
that separate the state space with a hyperplane.
on the crosses and circles, which indicate the action taken at that location. Consider a
policy space Π consisting of just four policies. Each set of two policies is indicated by
the magenta (dashed) and blue (dotted) lines in Figure 8.2. Each line corresponds to
two possible policies, one selecting action 1 in the high region, and the other selecting
action 0 instead. In terms of our error metric, the best policy is the one that makes
the fewest mistakes. Consequently, the best policy in this set to use the blue line and
play action 1 (red) in the top-right region.
8.1.3 Features
Frequently, when dealing with large, or complicated spaces, it pays off to project
(observed) states and actions onto a feature space X . In that way, we can make
problems much more manageable. Generally speaking, a feature mapping is
defined as follows.
Feature mapping
For X ⊂ Rn , a feature mapping f : S × A → X can be written in vector
form as
f1 (s, a)
f (s, a) = ... .
fn (s, a)
Obviously, one can define feature mappings f : S → X for states only in a
similar manner.
Example 39 (Radial Basis Functions). Let d be a metric on S × A and define the set
of centroids {(si , ai ) | i = 1, . . . , n}. Then we define each element of f as:
0.8
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
1. If S 6= R ∈ G , then S ∩ R = ∅.
S
2. S∈G S = X.
Multiple tilings create a cover and can be used without many difficulties with
most discrete reinforcement learning algorithms, cf. Sutton and Barto [1998].
Look-ahead policies
Given an approximate value function u, the transition model Pµ of the MDP
and the expected rewards rµ , we can always find the improving policy given in
Def. 8.1.1 via the following single-step look-ahead.
Single-step look-ahead P
Let q(i, a) , rµ (i, a) + γ j∈S Pµ (j | i, a) u(j). Then the single-step look-
ahead policy is defined as
T -step look-ahead
Define uk recursively as:
u0 = u,
X
qk (i, a) = rµ (i, a) + γ Pµ (j | i, a) uk−1 (j),
j∈S
Rollout policies
As we have seen in Section 7.2.2 one way to obtain an approximate value
function of an arbitrary policy π is to use Monte Carlo estimation, that is, to
simulate several sequences of state-action-reward tuples by running the policy
on the MDP. More specifically, we have the following rollout estimate.
min πθ − πq∗ φ
.
θ
Choosing the model representation is only the first step. We now have to
use it to represent a specific value function. In order to do this, as before we
first pick a set of representative states Ŝ to fit our value function vθ to v. This
type of estimation can be seen as a regression problem, where the observations
are value function measurements at different states.
P
where the total prediction error is c(θ) = s∈Ŝ cs (θ). The goal is to find
a θ minimising c(θ).
Minimising this error can be done using gradient descent, which is a gen-
eral algorithm for finding local minima of smooth cost functions. Generally,
minimising a real-valued cost function c(θ) with gradient descent involves an
algorithm iteratively approximating the value minimising c:
where ∇θ vθ (s) = f (s). Taking partial derivatives ∂/∂θj leads to the update rule
(n+1) (n)
θj = θj − 2αφ(s)[vθ(n) (s) − u(s)]fj (s). (8.1.5)
whereby the Bellman error acts as a regulariser ensuring that our approximation
is indeed as consistent as possible.
gθ (s, a)
πθ (a | s) = ,
hθ (s)
n
!
X X
where gθ (s, a) = ℓ θi fi (s, a) and hθ (s) = gθ (s, b).
i=1 b∈A
The link function ℓ ensures that the denominator is positive, and the policy is
a distribution over actions. An alternative method would be to directly constrain
the policy parameters so the result is always a distribution, but this would result
in a constrained optimisation problem. A typical choice for the link function is
ℓ(x) = exp(x), which results in the softmax family of policies.
In order to fit a policy, we first pick a set of representative states Ŝ and
then find a policy πθ that approximates a target policy π, which is typically the
greedy policy with respect to some value function. In order to do so, we can
define an appropriate cost function and then estimate the optimal parameters
via some arbitrary optimisation method.
Once more, we can use gradient descent to minimise the cost function. We
obtain different results for different norms, but the three cases of main interest
are p = 1, p = 2, and p → ∞. We present the first one here, and leave the others
as an exercise.
At the k-th iteration of the policy improvement step the approximate value
vk−1 of the previous policy πk−1 is used to obtain an improved policy πk . How-
ever, note that we may not be able to implement the policy arg maxπ Lπ vk−1
for two reasons. Firstly, the policy space Π̂ may not include all possible policies.
Secondly, the Bellman operator is in general also only an approximation. In the
policy evaluation step, we aim at finding the function vk that is the closest to
the true value function of policy πk . However, even if the value function space
V̂ is rich enough, the minimisation is done over a norm that integrates over a
finite subset of the state space. The following section discusses the effect of
those errors on the convergence of approximate policy iteration.
178 CHAPTER 8. APPROXIMATE REPRESENTATIONS
Theorem 8.2.1. Consider a finite MDP µ with discount factor γ < 1 and a
vector u ∈ V such that u − Vµ∗ ∞ = ǫ. If π is the u-greedy policy then
2γǫ
Vµπ − Vµ∗ ∞
≤ .
1−γ
Proof. Recall that L is the one-step Bellman operator and Lπ is the one-step
policy operator on the value function. Then (skipping the index for µ)
kV π − V ∗ k∞ = kLπ V π − V ∗ k∞
≤ kLπ V π − Lπ uk∞ + kLπ u − V ∗ k∞
≤ γ kV π − uk∞ + kL u − V ∗ k∞
≤ γ kV π − V ∗ k∞ + γ kV ∗ − uk∞ + γ ku − V ∗ k∞
≤ γ kV π − V ∗ k∞ + 2γǫ,
Building on this result, we can prove a simple bound for approximate policy
iteration, assuming uniform error bounds on the approximation of the value of
a policy as well as on the approximate Bellman operator. Even though these
assumptions are quite strong, we still only obtain the following rather weak
asymptotic convergence result.2
Theorem 8.2.2 (Bertsekas and Tsitsiklis [1996], Proposition 6.2). Assume that
there are ǫ, δ such that, for all k the iterates vk , πk satisfy
kvk − V πk k∞ ≤ ǫ,
Lπk+1 vk − L vk ∞
≤ δ.
Then
δ + 2γǫ
lim sup kV πk − V ∗ k∞ ≤ . (8.2.1)
k→∞ (1 − γ)2
2 For δ = 0, this is identical to the result for ǫ-equivalent MDPs by Even-Dar and Mansour
[2003].
8.2. APPROXIMATE POLICY ITERATION (API) 179
If we have data collects we can use the empirical state distribution to select
starting states. In general, rollouts give us estimates qk , which are used to select
states for further rollouts. That is, we compute for each state s actions
Then we select a state sn maximising the upper bound value Un (s) defined via
where c(s) is the number of rollouts from state s. If the sampling of a state s
stops whenever
s
2 |A| − 1
∆ˆk (s) ≥ ln , (8.2.2)
c(s)(1 − γ)2 δ
then we are certain that the optimal action has been identified with probability
1−δ for that state, due to Hoeffding’s inequality. Unfortunately, guaranteeing a
policy improvement for the complete state space is impossible, even with strong
assumptions.3
3 First, note that if we need to identify the optimal action for k states, then the above
stopping rule has an overall error probability of kδ. In addition, even if we assume that value
functions are smooth, it will be impossible to identify the boundary in the state space where
the optimal policy should switch actions [Dimitrakakis and Lagoudakis, 2008a].
180 CHAPTER 8. APPROXIMATE REPRESENTATIONS
v = r + γPµ,π v
v = (I − γPµ,π )−1 r.
Here we consider the setting where we do not have access to the transition ma-
trix, but instead have some observations of transition (st , at , st+1 ). In addition,
our state space can be continuous (e.g., S ⊂ Rn ), so that the transition matrix
becomes a general transition kernel. Consequently, the set of value functions V
becomes a Hilbert space, while it previously was a Euclidean subset.
In general, we deal with this case via projections. We project from the
infinite-dimensional Hilbert space to one with finite dimension on a subset of
states: namely, the ones that we have observed. We also replace the transition
kernel with the empirical transition matrix on the observed states.
Φθ = r + γPµ,π Φθ
−1
θ = [(I − γPµ,π )Φ] r
Generally the value function space generated by the features and the linear
parametrisation does not allow us to obtain exact value functions. For this
reason, instead of considering the inverse A−1 of the matrix A = (I − γPµ,π )Φ
we use the pseudo-inverse defined as
: −1
A−1 , A⊤ AA⊤ .
If the inverse exists, then it is equal to the pseudo-inverse. However, in our
setting, the matrix can be low rank, in which case we instead obtain the matrix
minimising the squared error, which in turn can be used to obtain a good esti-
mate for the parameters. This immediately leads to the Least Squares Temporal
Difference algorithm [Bradtke and Barto, 1996, LSTD], which estimates an ap-
proximate value function for some policy π given some data D and a feature
mapping f .
In practice, of course, we do not have the transitions Pµ,π but estimate them
from data. Note that for any deterministic policy π and a set of T data points
(st , at , rt , s′t )Tt=1 , we have
X
Pµ,π Φ = P (s′ | s, a)Φ(s′ , π(s′ ))
s′
T
1X
≈ P̂ (s′t | st , at )Φ(s′t , π(s′t )).
T t=1
This can be used to maintain q-factors with q(s, a) = f (s, a)θ to obtain an
empirical estimate of the Bellman operator as summarized in the following al-
gorithm.
end for
This is essentially the same both for finite and infinite-horizon problems. If
we have to pick the value function from a set of functions V, we can use the
following value function approximation.
Let our estimate at time t be vt ∈ V, with V being a set of (possibly
parametrised) functions. Let V̂t be our one-step update given the value function
approximation at the next step, vt+1 . Then vt will be the closest approximation
in that set.
Iterative approximation
( )
X
′ ′
V̂t (s) = max r(s, a) + γ Pµ (s | s, a) vt+1 (s )
a∈A
s′
In the above example, the value of every state corresponds to the value of
the k-th set in the partition. Of course, this is only a very rough approximation
if the sets Sk are very large. However, this is a convenient approach to use for
gradient descent updates, as only one parameter needs to be updated at every
step.
on other states may not be very good. It is indeed possible that we suffer from
convergence problems as we alternate between estimating the values of different
states in the aggregation.
with
n
X
vt (s) = fi (s)θt (i). (8.3.4)
i=1
400
chain
random
300
ku − V k
200
100
0
0 20 40 60 80 100
number of samples
Figure 8.4: Error in the representative state approximation for two different
MDPs structures as we increase the number of sampled states. The first is
the chain environment from Example 35, extended to 100 states. The second
involves randomly generated MDPs with two actions and 100 states.
Gradient update
For the L2-norm, we have:
X Z
2
kvθ − L vθ k = Dθ (s) , Dθ (s) = vθ (s) − max vθ (j) dP (j | s, a).
a∈A S
s∈Ŝ
(8.3.6)
Then the gradient update becomes θt+1 = θt − αDθt (st )∇θ Dθt (st ), where
Z
∇θ Dθt (st ) = ∇θ vθt (st ) − ∇θ vθt (j) dP (j | st , a∗t )
S
R
with a∗t , arg maxa∈A r(st , a) + γ S vθ (j) dP (j | st , a) .
186 CHAPTER 8. APPROXIMATE REPRESENTATIONS
We can also construct a Q-factor approximation for the case where no model
is available. This can be simply done by replacing P with the empirical transi-
tion observed at time t.
eF (s,a)
π(a | s) = P F (s,a′ )
, F (s, a) , θ⊤ f (s, a). (8.4.1)
a′ ∈A e
where y is a starting state distribution vector and Pµπ is the transition matrix
resulting from applying policy π to µ. Computing the derivative using matrix
calculus gives
∇θ E U = y ⊤ ∇θ (I − γPµπ )−1 r,
as the only term involving θ is π. The derivative of the matrix inverse can be
written as
We are now ready to prove claim (8.4.3). Define the expected state visitation
from the starting distribution to be x , y ⊤ X, so that we obtain
For the discounted reward criterion, we can easily obtain unbiased samples
through geometric stopping (see Exercise 29).
188 CHAPTER 8. APPROXIMATE REPRESENTATIONS
Importance sampling
The last formulation is especially useful as it allows us to use importance sam-
pling to compute the gradient even on data obtained for different policies, which
in general is more data efficient. First note that for any history h ∈ (S × A)∗ ,
we have
YT
π
Pµ (h) = Pµ (st | st−1 , at−1 ) Pπ (at | st , at−1 ) (8.4.7)
t=1
without any Markovian assumptions on the model or policy. We can now rewrite
(8.4.5) in terms of the expectation with respect to an alternative policy π ′ as
!
π π′ π Pπµ (U )
∇ Eµ U = Eµ U (h)∇ ln Pµ (h) π′
Pµ (U )
T
!
Y π(at | st , at−1 )
π′ π
= Eµ U (h)∇ ln Pµ (h) ,
t=1
π ′ (at | st , at−1 )
since the µ-dependent terms in (8.4.7) cancel out. In practice the expectation
would be approximated through sampling trajectories h. Note that
X X ∇π(at | st , at−1 )
∇ ln Pπµ (h) = ∇ ln π(at | st , at−1 ) = .
t t
π(at | st , at−1 )
8.5 Examples
Let us now consider two well-known problems with a 2-dimensional continu-
ous state space and a discrete set of actions. The first is the inverted pendu-
lum problem in the version of Lagoudakis and Parr [2003b], where a controller
must balance a rod upside-down. The state information is the rotational ve-
locity and position of the pendulum. The second example is the mountain car
problem, where we must drive an underpowered vehicle to the top of a hill
[Sutton and Barto, 1998]. The state information is the velocity and location of
the car. In both problems, there are three actions: “push left”, “push right”
and “do nothing”.
Let us first consider the effect of model and features in representing the
value function of the inverted pendulum problem. Figure 8.5 shows value func-
tion approximations for policy evaluation under a uniformly random policy for
different choices of model and features. Here we need to fit an approximate value
function to samples of the utility obtained from different states. The quality
of the approximation depends on both the model and the features used. The
first choice of features is simply the raw state representation, while the second
a 16 × 16 uniform RBF tiling. The two models are very simple: the first uses a
linear model-Gaussian model4 assumption on observation noise (LG), and the
second is a k-nearest neighbour (kNN) model.
As can be seen from the figure the linear model results in a smooth ap-
proximation, but is inadequate for modelling the value function in the original
2-dimensional state space. However, a high-dimensional non-linear projection
using RBF kernels results in a smooth and accurate value function representa-
tion. Non-parametric models such as k-nearest neighbours behave rather well
under either state representation.
For finding the optimal value function we must additionally consider the
question of which algorithm to use. In Figure 8.6 we see the effect of choosing
either approximate value iteration (AVI) or representative state representations
and value iteration (RSVI) for the inverted pendulum and mountain car.
·10−2
5 −0.6
−0.8
V
V
0
−1
−1 5 −1 5
0 0 0 0
1 −5 1 −5
s1 s2 s1 s2
LG, RBF kNN, RBF
−0.6
−0.6
−0.8
−0.8
V
V
−1
−1
−1 5 −1 5
0 0 0 0
1 −5 1 −5
s1 s2 s1 s2
0
−0.5
−10
V
V
−1 −20
−1 5 −1 5
0 0 0
1 −5 0
−5 ·10−2
s1 s2 s1 s2
RSVI, Pendulum RSVI, Mountaincar
0
−0.5 −10
V
V
−1 −20
−1 5 −1 5
0 0 0
1 −5 0
−5 ·10−2
s1 s2 s1 s2
Figure 8.6: Estimated optimal value function for the pendulum problem. Re-
sults are shown for approximate value iteration (AVI) with a Bayesian linear-
Gaussian model, and a representative state representation (RSVI) with an RBF
embedding. Both the embedding and the states where the value function is
approximated are a 16 × 16 uniform grid over the state space.
192 CHAPTER 8. APPROXIMATE REPRESENTATIONS
8.7 Exercises
194 CHAPTER 8. APPROXIMATE REPRESENTATIONS
Exercise 35 (Enlarging the function space.). Consider the problem in Example 37.
What would be a simple way to extend the space of value functions from the three
given candidates to an infinite number of value functions? How could we get a good
fit?
Exercise 36 (Enlarging the policy.). Consider Example 38. This represents an ex-
ample of a linear deterministic policies. In which two ways can this policy space be
extended and how?
Exercise 37. Find the derivative for minimising the cost function in (8.1.6) for the
following two cases:
1. p = 2, κ = 2.
2. p → ∞, κ = 1.
Chapter 9
Bayesian reinforcement
learning
195
196 CHAPTER 9. BAYESIAN REINFORCEMENT LEARNING
9.1 Introduction
In this chapter, we return to the setting of subjective probability and utility by
formalising the reinforcement learning problem as a Bayesian decision problem
and solving it directly. In the Bayesian setting, we are acting in an MDP which
is not known, but we have a subjective belief about what the environment
is. We shall first consider the case of acting in unknown MDPs, which is the
focus of the reinforcement learning problem. We will examine a few different
heuristics for maximising expected utility in the Bayesian setting and contrast
them with tractable approximations to the Bayes-optimal solution. Further, we
shall present extensions of these ideas to continuous domains, and finally also
connections to partially observable MDPs will be considered.
ξ rt
st st+1
at
Figure 9.1: The unknown Markov decision process. ξ is our prior over the
unknown µ, which is not directly observed. However, we always observe the
result of our action at in terms of reward rt and next state st+1 .
with respect to the prior Eπξ (U ). The structure of the unknown MDP process
is shown in Figure 9.1. We have previously seen two simpler sequential decision
problems in the Bayesian setting. The first was the simple optimal stopping pro-
cedure in Section 5.2.2, which introduced the backwards induction algorithm.
The second was the optimal experiment design problem, which resulted in the
bandit Markov decision process of Section 6.2. Now we want to formulate the
reinforcement learning problem as a Bayesian maximisation problem.
Let ξ be a prior over M and Π be a set of policies. Then the expected utility
of the optimal policy is
Z
Uξ∗ , max E(U | π, ξ) = max E(U | π, µ) dξ(µ). (9.2.1)
π∈Π π∈Π M
Solving this optimisation problem and hence finding the optimal policy is how-
ever not easy, as in general the optimal policy π must incorporate the informa-
tion it obtained while interacting with the MDP. Formally, this means that it
must map from histories to actions. For any such history-dependent policy, the
action we take at step t must depend on what we observed in previous steps
1, . . . , t − 1. Consequently, an optimal policy must also specify actions to be
taken in all future time steps and accordingly take into account the learning
that will take place up to each future time step. Thus, in some sense, the
value of information is automatically taken into account in this model. This is
illustrated in the following example.
Example 44. Consider two MDPs µ1 , µ2 with a single state (i.e., S = {1}) and
actions A = {1, 2}. In the MDP µi , whenever you take action at = i you obtain
reward rt = 1, otherwise you obtain reward 0. If we only consider policies that do
not take into account the history so far, the expected utility of such a policy π taking
action i with probability π(i) is
X
Eπξ U = T ξ(µi ) π(i)
i
for horizon T . Consequently, if the prior ξ is not uniform, the optimal policy selects
the action corresponding to the MDP with the highest prior probability. Then, the
maximal expected utility is
T max ξ(µi ).
i
However, observing the reward after choosing the first action, we can determine the
true MDP. Consequently, an improved policy is the following: First select the best
198 CHAPTER 9. BAYESIAN REINFORCEMENT LEARNING
action with respect to the prior, and then switch to the best action for the MDP we
have identified to be the true one. Then, our utility improves to
where here and in the following we use the notation st to abbreviate (s1 , . . . , st )
and st+k
t for (st , . . . , st+k ), and accordingly at , rt , at+k
t , and rtt+k . Important
special cases are the set of blind policies Π0 and the set of memoryless poli-
cies Π1 . The set Π̄k ⊂ Πk contains all stationary policies in Πk , that is, policies
π for which
π(a | stt+k−1 , att+k−2 ) = π(a | sk , ak−1 )
for all t. Finally, policies may be indexed by some parameter set Θ, in which
case the set of parameterised policies is given by ΠΘ .
Let us now turn to the problem of learning an optimal policy. Learning
means that observations we make will affect our belief, so that we will first take
a closer look at this belief update. Given that, we shall examine methods for
exact and approximate methods of policy optimisation.
However, as we shall see in the following remark, we can usually1 ignore the
policy itself when calculating the posterior.
Remark 9.2.1. The dependence on the policy can be removed, since the posterior
is the same for all policies that put non-zero mass on the observed data. Indeed,
′
for Dt ∼ Pπµ it is easy to see that ∀π ′ 6= π such that Pπµ (Dt ) > 0, it holds that
ξ(B | Dt , π) = ξ(B | Dt , π ′ ).
The proof is left as an exercise for the reader. In the specific case of MDPs,
the posterior calculation is easy to perform incrementally. This also more clearly
1 The exception involves any type of inference where Pπ (D ) is not directly available. This
µ t
includes methods of approximate Bayesian computation [Csilléry et al., 2010], that use tra-
jectories from past policies for approximation. See Dimitrakakis and Tziortziotis [2013] for an
example of this in reinforcement learning.
9.2. ACTING IN UNKNOWN MDPS 199
The above calculation is easy to perform for arbitrarily complex MDPs when the
set M is finite. The posterior calculation is also simple under certain conjugate
priors, such as the Dirichlet-multinomial prior for transition distributions.
where Π1 = π ∈ Π Pπ (at | st , at−1 ) = Pπ (at | st ) is the set of Markov poli-
cies. The policy π ∗ (b
µ(ξ)) is executed on the real MDP. Algorithm 23 shows the
pseudocode for this heuristic. One important detail is that we are only gen-
erating the k-th policy at step Tk . This is sometimes useful to ensure policies
remain consistent, as small changes in the mean MDP may create a large change
in the resulting policy. It is natural to have Tk − Tk−1 in the order of 1/1 − γ for
discounted problems, or simply the length of the episode for episodic problems.
In the undiscounted case, switching policies whenever sufficient information has
been obtained to significantly change the belief gives good performance guaran-
tees, as we shall see in Chapter 10.
Unfortunately, the policy returned by this heuristic may be far from the
Bayes-optimal policy in Π1 , as shown by the following example.
200 CHAPTER 9. BAYESIAN REINFORCEMENT LEARNING
0 0 0
0 ǫ 0 0 ǫ 0 0 ǫ 0
1 1 1
Figure 9.2: The two MDPs and the expected MDP from Example 45.
a=0 ǫ
a=1 0
a=i 1
a=n 0
However, the Bayes-optimal value function is not equal to the expected value
function of the optimal policy for each MDP. In fact, the Bayes-value of any
policy is a natural lower bound on the Bayes-optimal value function, as the
Bayes-optimal policy is the maximum by definition. We can however use the
expected optimal value function as an upper bound on the Bayes-optimal value:
Z
π
∗
Vξ , sup Eξ (U ) = sup Eπµ (U ) dξ(µ)
π π M
Z Z
≤ sup Vµπ dξ(µ) = Vµ∗ dξ(µ) , Vξ+
M π M
202 CHAPTER 9. BAYESIAN REINFORCEMENT LEARNING
Given the previous development, it is easy to see that the following inequal-
ities always hold, giving us upper and lower bounds on the value function:
Vξπ ≤ Vξ∗ ≤ Vξ+ , ∀π. (9.2.3)
These bounds are geometrically demonstrated in Fig. 9.4. They are entirely
analogous to the Bayes bounds of Sec. 3.3.1, with the only difference being that
we are now considering complete policies rather than simple decisions.
VV ∗
µ1
E(Vµ∗ | ξ)
π2 P
i wi Vξ∗i
Vµ∗2
Vξ∗
π1
π ∗ (ξ1 )
ξ1 ξ
80
tighter bound
50
40
30
20
10
0 20 40 60 80 100
Uncertain ⇐ ξ ⇒ Certain
Figure 9.5: Illustration of the improved bounds. The naive and tighter bound
refers to the lower bound obtained by calculating the value of the policy that
is optimal for the expected MDP and that obtained by calculating the value of
the MMBI policy respectively. The upper bound is Vξ+ . The horizontal axis
refers to our belief: At the left edge, our belief is uniform over all MDPs, while
on the right edge, we are certain about the true MDP.
tighter lower bounds can be obtained by finding better policies, something that
was explored by Dimitrakakis [2011].
In particular, we can consider the problem of finding the best memoryless
policy. This involves two approximations. Firstly, approximating our belief over
MDPs with a sample over a finite set of n MDPs. Secondly, assuming that the
belief is nearly constant over time, and performing backwards induction those n
MDPs simultaneously. While this greedy procedure might not find the optimal
memoryless policy, it still improves the lower bounds considerably.
The central step backwards induction over multiple MDPs is summarised by
the following equation, which simply involves calculating the expected utility of
a particular policy over all MDPs.
Z Z
Qπξ,t (s, a) , r̄µ (s, a) + γ π
Vµ,t+1 (s′ ) dPµ (s′ | s, a) dξ(µ) (9.2.4)
M S
2012, Osband et al., 2013], it is not optimal. In fact, as we can see in Fig-
ure 9.6, Algorithm 28 performs better when the number of samples is increased.
ψt ψt+1
st st+1
ξt ξt+1 ψt ψt+1
at at
(a) The complete MDP (b) Compact form
model. of the model.
550
MCBRL
Exploit
500
450
regret
400
350
300
250
2 4 6 8 10 12 14 16
n
Figure 9.6: Comparison of the regret between the expected MDP heuristic and
sampling with Multi-MDP backwards induction for the Chain environment. The
error bars show the standard error of the average regret.
tuple (S, A, Pµ , ρ), with state space S, action space A, transition kernel Pµ and
reward function ρ : S × A → R. Let st , at , rt be the state, action, and reward
observed in the original MDP and ξt be our belief over MDPs µ ∈ M at step t.
Note that the marginal next-state distribution is
Z
P (st+1 ∈ S | ξt , st , at ) , Pµ (st+1 ∈ S | st , at ) dξt (µ), (9.2.5)
M
while the next belief deterministically depends on the next state, i.e.,
Example 47. Consider aset of MDPs M with A = {1, 2}, S = {1, 2}. In general for
any hyper-stateψt = (st , ξt ) each possible action-state transition results in one specific
new hyper-state. This is illustrated for the specific example in the following diagram.
1
ψt+1
1
2
1 ψt+1
2
ψt
1 3
2 ψt+1
2 4
ψt+1
at st+1
When the branching factor is very large, or when we need to deal with very large tree
depths, it becomes necessary to approximate the MDP structure.
where π(ξt′ ) can be any approximately optimal policy for ξt′ Using backwards
induction, we can calculate tighter upper q + and lower bounds q − for all non-
leaf hyperstates by
X
q + (ψt , at ) = P (ψt+1 | ψt , at ) ρ(ψt , at ) + γ v + (ψt+1 )
ψt+1
X
−
q (ψt , at ) = P (ψt+1 | ψt , at ) r(ψ, at ) + γ v − (ψ ′ )
ψ′
We can then use the upper bounds to expand the tree (i.e., to select actions in
the tree that maximise v + ) while the lower bounds can be used to select the
final policy. Sub-optimal branches can be discarded once their upper bounds
become lower than the lower bound of some other branch.
Remark 9.2.2. If q − (ψ, a) ≥ q + (ψ, a′ ) then a′ is sub-optimal at ψ.
9.3. BAYESIAN METHODS IN CONTINUOUS SPACES 207
Pµ (S | s, a) , Pµ (st+1 ∈ S | st = s, at = a), S ⊂ S.
There are a number of transition models one can consider for the continuous
case. For the purposes of this textbook, we shall limit ourselves to the relatively
simple case of linear-Gaussian models.
1 n 1 −1
ψ(Vi | W , n) ∝ | V −1 W | 2 e− 2 trace(V W ) .
2
Essentially, the considered setting is an extension of the univariate Bayesian
linear regression model (see for example DeGroot [1970]) to the multivariate case
via vectorisation of the mean matrix. Since the prior is conjugate, it is relatively
simple to calculate posterior values of the parameters after each observation.
While we omit the details, a full description of inference using this model is
given by Minka [2001b].
where N (s, s′ ) , ∆U (s) − γ∆U (s′ ) with ∆U (s) , U (s) − v(s) denoting the
distribution of the residual, i.e., the utility when starting from s minus its
expectation. The correlation between U (s) and U (s′ ) is captured via N , and
the residuals are modelled as a Gaussian process. While the model is still an
approximation, it is equivalent to performing GP regression using Monte-Carlo
samples of the discounted return.
Z
∇π Eπξ U= ∇M U (µ, π) dξ(µ).
Π
In most real world applications the state st of the system at time t cannot be
observed directly. Instead, we obtain some observation xt , which depends on
the state of the system. While this does give us some information about the
system state, it is in general not sufficient to pinpoint it exactly. This idea can
be formalised as a partially observable Markov decision process (POMDP).
transition distribution Here P (st+1 | st , at ) is the transition distribution, giving the probabilities of
next states given the current state and action. P (xt | st ) is the observation
observation distribution distribution, giving the probabilities of different observations given the current
reward distribution state. Finally, P (rt | st ) is the reward distribution, which we make dependent
only on the current state for simplicity. Different dependencies are possible, but
they are all equivalent to the one given here.
ξ at
st st+1
xt xt+1
rt rt+1
Belief ξ
For any distribution ξ on S, we define
Z
ξ(st+1 | at , µ) , Pµ (st+1 | st at ) dξ(st ). (9.4.1)
S
Belief update
A particularly attractive setting is when the model is finite. Then the suffi-
cient statistic also has finite dimension and all updates are in closed form.
Remark 9.4.1. If S, A, X are finite, then we can define a sequence of vectors
pt ∈ ∆|S| and matrices At as
pt (j) = P (xt | st = j),
At (i, j) = P (st+1 = j | st = i, at ).
Then writing bt (i) for ξt (st = i), we can then use Bayes theorem to obtain
diag(pt+1 )At bt
bt+1 = .
p⊤
t+1 At bt
4 However, we have not seen a formal proof of this at the time of writing.
9.6. EXERCISES 215
9.6 Exercises
216 CHAPTER 9. BAYESIAN REINFORCEMENT LEARNING
Exercise 38. Consider the algorithms we have seen in Chapter 8. Are any of those
applicable to belief-augmented MDPs? Outline a strategy for applying one of those
algorithms to the problem. What would be the biggest obstacle we would have to
overcome in your specific example?
Exercise 41. Consider the Gaussian process model of eq. (9.3.2). What is the implicit
assumption made about the transition model? If this assumption is satisfied, what
does the corresponding posterior distribution represent?
Chapter 10
Distribution-free
reinforcement learning
217
218CHAPTER 10. DISTRIBUTION-FREE REINFORCEMENT LEARNING
10.1 Introduction
The Bayesian framework requires specifying a prior distribution. For many
reasons, we may frequently be unable to do that. In addition, as we have seen,
the Bayes-optimal solution is often intractable. In this chapter we shall take a
look at algorithms that do not require specifying a prior distribution. Instead,
they employ the heuristic of “optimism under uncertainty” to select policies.
This idea is very similar to heuristic search algorithms, such as A∗ [Hart et al.,
1968]. All these algorithms assume the best possible model that is consistent
with the observations so far and choose the optimal policy in this “optimistic”
model. Intuitively, this means that for each possible policy we maintain an
upper bound on the value/utility we can reasonably expect from it. In general
we want this upper bound to
1. be as tight as possible (i.e., to be close to the true value),
2. still hold with high probability.
We begin with an introduction to these ideas in bandit problems, when the
objective is to maximise total reward. We then expand this discussion to struc-
tured bandit problems, which have many applications in optimisation. Finally,
we look at the case of maximising total reward in unknown MDPs.
The regret compares the collected rewards to those of the best fixed policy.
Comparing instead to the best rewards obtained by the arms at each time would
be too hard due to their randomness.
Empirical average
t t
1 X X
r̂t,i , rk,i I {ak = i} , where Nt,i , I {ak = i}
Nt,i
k=1 k=1
and rk,i denotes the (random) reward the learner receives upon choosing
arm i at step k.
Simply always choosing the arm with best the empirical average reward so far
is not the best idea, because you might get stuck with a sub-optimal arm: If the
optimal arm underperforms at the beginning, so that its empirical average is far
below the true mean of a suboptimal arm, it will never be chosen again. A better
strategy is to choose arms optimistically. Intuitively, as long as an arm has a
significant chance of being the best, you play it every now and then. One simple
way to implement this is shown in the following UCB1 algorithm [Auer et al.,
2002a].
Algorithm 29 UCB1
Input A
Choose each arm once to obtain an initial estimate.
for t = 1, . . . do n q o
Choose arm at = arg maxi∈A r̂t−1,i + N2t−1,iln t
.
end for
p
Thus, the algorithm adds a bonus value of order O( ln t/Nt,i ) to the em-
pirical value of each arm thus forming an upper confidence bound . This upper upper confidence bound
confidence bound value is such that the true mean reward of each arm will lie
below it with high probability by the Hoeffding bound (4.5.5).
Theorem 10.2.1 (Auer et al. [2002a]). The expected regret of UCB1 after T
time steps is at most
X 8 ln T X
E LT (UCB1) ≤ +5 (r∗ − r(i)).
r∗− r(i) i
i:r(i)<r ∗
Accordingly we may assume that (taking care of the contribution of the error
probabilities to E Nt,i below)
Combining this with (10.2.1) and noting that the sum converges to a value < 4,
proves the regret bound.
The UCB1 algorithm is actually not the first algorithm employing optimism
in the face of uncertainty to deal with the exploration-exploitation dilemma,
nor the first that uses confidence intervals for that purpose. This idea goes back
to the seminal work of Lai and Robbins [1985] that used the same approach,
however in a more complicated form. In particular, the whole history is used
for computing the arm to choose. The derived bounds of Lai and Robbins[1985]
show that after T steps each suboptimal arm is played at most D1KL +o(1) log T
times in expectation, where DKL measures the distance between the reward dis-
tributions of the optimal and the suboptimal arm by the Kullback-Leibler di-
vergence, and o(1) → 0 as T → ∞. This bound was also shown to be asymptot-
ically optimal [Lai and Robbins, 1985]. A lower bound logarithmic in T for any
finite T that is close to matching the bound of Theorem 10.2.1 can be found in
[Mannor and Tsitsiklis, 2004]. Improvements that get closer to the lower bound
(and are still based on the UCB1 idea) can be found in [Auer and Ortner, 2010],
while the gap has been finally closed by Lattimore [2015].
10.2. FINITE STOCHASTIC BANDIT PROBLEMS 221
The stochastic setting just considered is only one among several variants of the
multi-armed bandit problem. While it is impossible to cover them all, we give
a brief of the most common scenarios and refer to Bubeck and Cesa-Bianchi
[2012] for a more complete overview.
What is common to most variants of the classic stochastic setting is that the
assumption of receiving i.i.d. rewards when sampling a fixed arm is loosened.
The most extreme case is the so-called nonstochastic, sometimes also termed
adversarial bandit setting, where the reward sequence for each arm is assumed adversarial bandit
to be fixed in advance (and thus not random at all). In this case, the reward is
maximised when choosing in each time step the arm that maximises the reward
at this step. Obviously, since the reward sequences can be completely arbitrary,
no learner can stand a chance to perform well with respect to this optimal
policy. Thus, one confines oneself to consider
PTthe regret with respect to the best
fixed arm in hindsight, that is, arg maxi t=1 rt,i where rt,i is the reward of
arm i at step t. It is still not clear that this is not too much
√ to ask for, but it
turns out that one can achieve regret bounds of order O( KT ) in this setting.
Clearly, algorithms that choose arms deterministically can always be tricked
by an adversarial reward sequence. However, algorithms that at each time
step choose an arm from a suitable distribution over the arms (that is updated
according to the collected rewards), can be shown to give the mentioned optimal
regret bound. A prominent exponent of these algorithms is the Exp3 algorithm
of [Auer et al., 2002b], that uses an exponential weighting scheme.
In the contextual bandit setting the learner receives some additional side in- contextual bandit
formation called the context. The reward for choosing an arm is assumed to
depend on the context as well as on the chosen arm and can be either stochas-
tic or adversarial. The learner usually competes against the best policy that
maps contexts to arms. There is a notable amount of literature dealing with
various settings that are usually also interesting for applications like web ad-
vertisement where user data takes the role of provided side information. For
an overview see e.g. Chapter 4 of [Bubeck and Cesa-Bianchi, 2012] or Part V of
[Lattimore and Szepesvári, 2020].
In other settings the i.i.d. assumption about the rewards of a fixed arm is
replaced by more general assumptions, such as that underlying each arm there is
a Markov chain and rewards depend on the state of the Markov chain when sam-
pling the arm. This is called the restless bandits problem, that is already quite
close to the general reinforcement learning setting with an underlying Markov
decision process (see
√ Section 10.3.1 below). Regret bounds in this setting can
be shown to be Õ( T ) even if at each time step the learner can observe only
the state of the arm he chooses, see [Ortner et al., 2014].
222CHAPTER 10. DISTRIBUTION-FREE REINFORCEMENT LEARNING
Given that our rewards are assumed to be bounded in [0, 1], intuitively, when
we make one wrong step in some state s, in the long run we won’t lose more
than D. After all, in D steps we can go back to s and continue optimally.
Under the assumption that the MDP is communicating, the gain g ∗ can be
shown to be independent of the initial state, that is, g ∗ (s) = g ∗ for all states s.
Accordingly, we define the T -step regret of a learning algorithm as
T
X
LT , g ∗ − rt ,
t=1
where rt is the reward collected by the algorithm at step t. Note that in general
(and depending on the initial state) the value T g ∗ we compare to will differ
from the optimal T -step reward. However, this difference can be shown to be
upper bounded by the diameter and is therefore negligible when considering the
regret.
simply taking each policy to be the arm of a bandit problem does not work well.
First, to approach the true gain of a chosen policy, it will not be sufficient to
choose it just once. It would be necessary to follow each policy for a sufficiently
high number of consecutive steps. Without knowledge of some characteristics
of the underlying MDP like mixing times, it might be however difficult to de-
termine how long a policy shall be played. Further, due to the large number
of stationary policies, which is |A||S| , the regret bounds that would result from
such an approach would be exponential in the number of states.
Thus, we rather maintain confidence regions for the rewards and transition
probabilities of each state-action pair s, a. Then, at each step t, these confidence
regions implicitly also define a confidence region for the true underlying MDP
µ∗ , that is, a set Mt of plausible MDPs. For suitably chosen confidence intervals
for the rewards and transition probabilities one can obtain that
P(µ∗ ∈
/ Mt ) < δ. (10.3.1)
Given this confidence region Mt , one can define the optimistic value for any
policy π to be
π
g+ (Mt ) , max gµπ µ ∈ Mt . (10.3.2)
Note that similar to the bandit setting this estimate is optimistic for each policy,
π
as due to (10.3.1) it holds that g+ (Mt ) ≥ gµπ with high probability. Analogously
to UCB1 we would like to make an optimistic choice among the possible policies,
π
that is, we choose a policy π that maximises g+ (Mt ).
However, unlike in the bandit setting where we immediately receive a sample
from the reward of the chosen arm, in the MDP setting we only obtain informa-
tion about the reward in the current state. Thus, we should not play the chosen
optimistic policy just for one but a sufficiently large number of steps. An easy
way is to play policies in episodes of increasing length, such that sooner or later
each action is played for a sufficient number of steps in each state. Summarized,
we obtain (the outline of) an algorithm as shown below.
The confidence region Concerning the confidence regions, for the rewards
it is sufficient to use confidence intervals similar to those for UCB1. For the
transition probabilities we consider all those transition probability distributions
224CHAPTER 10. DISTRIBUTION-FREE REINFORCEMENT LEARNING
to be plausible whose k·k1 -norm is close to the empirical distribution P̂t (· | s, a).
That is, the confidence region Mt at step t used to compute the optimistic policy
can be defined as the set of MDPs with mean rewards r(s, a) and transition
probabilities P (· | s, a) such that
q
7 log(2SAt/δ)
r(s, a) − r̂(s, a) ≤ 2Nt (s,a) ,
q
14S log(2At/δ)
P (· | s, a) − P̂t (· | s, a) ≤ Nt (s,a) ,
1
where r̂(s, a) and P̂t (· | s, a) are the estimates for the rewards and the transition
probabilities, and Nt (s, a) denotes the number of samples of action a in state s
at time step t.
One can show via a bound due to Weissman et al. [2003] that given n samples
of the transition probability distribution P (· | s, a), one has
nε
P P (· | s, a) − P̂t (· | s, a) ≥ ε ≤ 2S exp − .
1 2
Using this together with standard Hoeffding bounds for the reward estimates,
it can be shown that the confidence region contains the true underlying MDP
with high probability.
Lemma 10.3.1.
δ
P(µ∗ ∈ Mt ) > 1 − .
15t6
1. Set the optimistic rewards r̃(s, a) to the upper confidence values for all
states s and all actions a.
3. For i = 0, 1, 2, . . . set
( X )
ui+1 (s) := max r̃(s, a) + max P (s′ ) ui (s′ ) (10.3.3)
,
a P ∈P(s,a)
s′
where P(s, a) is the set of all plausible transition probabilities for choosing
action a in state s.
Similarly to the value iteration algorithm in Section 6.5.4, this scheme can be
shown to converge. More precisely one can show that maxs {ui+1 (s) − ui (s)} −
mins {ui+1 (s) − ui (s)} → 0 and also
π̃
ui+1 (s) → ui (s) + g+ for all s. (10.3.4)
After convergence the maximizing actions constitute the optimistic policy π̃, and
the maximizing transition probabilities are the respective optimistic transition
values P̃ .
One can also show that the so-called span maxs ui (s) − mins ui (s) of the
converged value vector ui is upper bounded by the diameter. This follows by
optimality of the vector ui . Intuitively, if the span would be larger than D one
could increase the collected reward in the lower value state s− by going (as fast
as possible) to the higher value state s+ . Note that this argument uses the fact
that the true MDP is plausible w.h.p., so that we may take the true transitions
to get from s− to s+ .
Analysis of UCRL2
In this section we derive the following regret bound for UCRL2.
226CHAPTER 10. DISTRIBUTION-FREE REINFORCEMENT LEARNING
Proof. The main idea of the proof is that by Lemma 10.3.1 we have that
π̃k
g̃k∗ , g+ (Mtk ) ≥ g ∗ ≥ g π̃k , (10.3.5)
so that the regret in each step is upper bounded by the width of the confidence
interval for g π̃k , that is, by g̃k∗ − g π̃k . In what follows we need to break down this
confidence interval to the confidence intervals we have for rewards and transition
probabilities.
In the following, we consider that the true MDP µ is always contained in
the confidence regions Mt considered by the algorithm. Using Lemma 10.3.1
it is not difficult to show that with probability at least 1 − 12Tδ5/4 the regret
√
accumulated due to µ ∈ / Mt at some step t is bounded by T .
Further, note that the random fluctuation of the rewards can be easily
bounded by Hoeffding’s inequality (4.5.5), that is, if st and at denote the state
and action at step t, we have
T
X X q
5 8T
rt ≥ r(st , at ) − 8T log δ
t=1 t
the algorithm accumulates in this episode. Let conf rk (s, a) and conf pk (s, a) be
the width of the confidence intervals for rewards and transition probabilities in
episode k. First, we simply have
X X
vk (s, a) g̃k∗ − r(s, a) ≤ vk (s, a) g̃k∗ − r̃k (s, a)
s,a s,a
X
+ vk (s, a) r̃k (s, a) − r(s, a) , (10.3.7)
s,a
|r̃k (s, a) − r̂k (s, a)| + |r̂k (s, a) − r(s, a)| ≤ 2conf rk (s, a)
10.3. REINFORCEMENT LEARNING IN MDPS 227
For the first term in (10.3.7) we use that after convergence of the value
vector ui we have by (10.3.3) and (10.3.4)
X
g̃k∗ − r̃k (s, π̃k (s)) = P̃k (s′ |s, π̃k (s)) · ui (s′ ) − ui (s).
s′
Then noting that vk (s, a) = 0 for a 6= π̃k (s) and using vector/matrix notation
it follows that
X
vk (s, a) g̃k∗ − r̃k (s, π̃k (s))
s,a
X X
′ ′
= vk (s, a) P̃k (s |s, π̃k (s)) · ui (s ) − ui (s)
s,a s′
= vk P̃k − I u
= vk P̃k − Pk + Pk − I wk
= vk P̃k − Pk wk + vk Pk − I wk , (10.3.9)
where Pk is the true transition matrix (in µ) of the optimistic policy π̃k in
episode k, and wk is a renormalisation of the vector u (with entries ui (s))
where wk (s) := ui (s) − 21 (mins ui (s) + maxs ui (s)), so that kwk k∞ ≤ D 2 by
Lemma 10.3.4.
Since kP̃k − Pk k1 ≤ kP̃k − P̂k k1 + kP̂k − Pk k1 , the first term of (10.3.9) is
bounded as
vk P̃k − Pk wk ≤ vk P̃k − Pk 1 · wk ∞
X
≤ 2 vk (s, a) conf pk (s, a) D. (10.3.10)
s,a
so that its sum over all episodes can be bounded by Azuma-Hoeffding inequality
(5.3.4) and Lemma 10.3.2, that is,
X q
vk Pk − I wk ≤ D 52 T log 8T δ
8T
+ DSA log2 SA (10.3.11)
k
δ
with probability at least 1 − 12T 5/4
.
228CHAPTER 10. DISTRIBUTION-FREE REINFORCEMENT LEARNING
Summing (10.3.8) and (10.3.10) over all episodes, by definition of the confi-
dence intervals and Lemma 10.3.3 we have
XX XX
vk (s, a) conf rk (s, a) + 2D vk (s, a) conf pk (s, a)
k s,a k s,a
p XX
≤ const · D S log(AT /δ) √vk (s,a)
Nk (s,a)
k s,a
p √
≤ const · D S log(AT /δ) SAT . (10.3.12)
assumes in each not sufficiently visited state to receive the maximal possible
reward. UCRL2 offers a refinement of this idea to motivate exploration. Sam-
ple complexity bounds as derived for R-Max can also be obtained for UCRL2,
cf. [Jaksch et al., 2010].
The gap between the lower bound of Theorem 10.3.2 and the bound for
UCRL2 has not been closed so far. There have been various tries in that direc-
tion for different algorithms inspired by Thompson sampling [Agrawal and Jia,
2017] or UCB1 [Ortner, 2020]. However all of the claimed proofs seem to contain
some issues that remain unresolved up-to-date.
The situation is settled in the simpler episodic setting, where after any H
steps there
√ is a restart. Here there are matching upper and lower bounds of
order HSAT on the regret, see [Azar et al., 2017].
In the discounted setting, the MBIE algorithm of Strehl and Littman [2005,
2008] is a precursor of UCRL2 that is based on the same ideas. While there
are regret bounds available also for MBIE, these are not easily comparable to
Theorem 10.2.1, as the regret is measured along the trajectory of the algorithm,
while the regret considered for UCRL2 is with respect to the trajectory an opti-
mal policy would have taken. In general, regret in the discounted setting seems
to be a less satisfactory concept. However, sample complexity bounds in the dis-
counted setting for a UCRL2 variant have been given in [Lattimore and Hutter,
2014].
Last but not least, we would like to refer any reader interested in the material
of this chapter to the recent book of Lattimore and Szepesvári [2020] that deals
with the whole range of topics from simple bandits to reinforcement learning in
MDPs in much more detail.
230CHAPTER 10. DISTRIBUTION-FREE REINFORCEMENT LEARNING
Chapter 11
Conclusion
231
232 CHAPTER 11. CONCLUSION
This book touched upon the basic principles of decision making under uncer-
tainty in the context of reinforcement learning. While one of the main streams of
thought is Bayesian decision theory, we also discussed the basics of approximate
dynamic programming and stochastic approximation as applied to reinforcement
learning problems.
Consciously, however, we have avoided going into a number of topics related
to reinforcement learning and decision theory, some of which would need a book
of their own to be properly addressed. Even though it was fun writing the
book, we at some point had to decide to stop and consolidate the material we
had, sometimes culling partially developed material in favour of a more concise
volume.
Firstly, we haven’t explicitly considered many models that can be used for
representing transition distributions, value functions or policies, beyond the
simplest ones, as we felt that this would detract from the main body of the
text. Textbooks for the latest fashion are always going to be abundant, and we
hope that this book provides a sufficient basis to enable the use of any current
methods. There are also a large number of areas which have not been covered
at all. In particular, while we touched upon the setting of two-player games and
its connection to robust statistical decisions, we have not examined problems
which are also relevant to sequential decision making, such as Markov games
and Bayesian games. In relation to this, while early in the book we discuss
risk aversion and risk seeking, we have not discussed specific sequential decision
making algorithms for such problems. Furthermore, even though we discuss the
problem of preference elicitation, we do not discuss specific algorithms for it
or the related problem of inverse reinforcement learning. Another topic which
went unmentioned, but which may become more important in the future, is
hierarchical reinforcement learning as well as options, which allow constructing
long-term actions (such as “go to the supermarket”) from primitive actions
(such as “open the door”). Finally, even though we have mentioned the basic
framework of regret minimisation, we focused on the standard reinforcement
learning problem, and ignored adversarial settings and problems with varying
amounts of side information.
It is important to note that the book almost entirely elides social aspects of
decision making. In practice, any algorithm that is going to be used to make
autonomous decision is going to have a societal impact. In such cases, the
algorithm designer must guard against negative externalities, such as hurting
disadvantaged groups, violating privacy, or environmental damage. However, as
a lot of these issues are context dependent, we urge the reader to consult recent
work in economics, algorithmic fairness and differential privacy.
Appendix A
Symbols
233
234 APPENDIX A. SYMBOLS
, definition
∧ logical and
∨ logical or
⇒ implies
⇔ if and only if
∃ there exists
∀ for every
s.t. such that
Beta(α, β) Beta distribution with parameters (α, β). Geom(ω) Geometric distribution with parameter ω Wish(n −
Probability concepts
237
238 APPENDIX B. PROBABILITY CONCEPTS
A , {x | x have property Y } .
Example 48.
B(c, r) , {x ∈ Rn | kx − ck ≤ r}
describes the set of points enclosed in an n-dimensional sphere of radius r with center
c ∈ Rn .
Ω1 × · · · × Ωn = {(s1 , . . . , sn ) | si ∈ Ωi , i = 1, . . . , n} (B.1.1)
experiment we are interested in. At the extreme, the sample space and corre-
sponding outcomes may completely describe everything there is to know about
the experiment.
Experiments
The set of possible experimental outcomes of an experiment is called the
sample space Ω.
The following example considers the case where three different statisticians
care about three different types of outcomes of an experiment where a drug is
given to a patient. The first is interested in whether the patient recovers, the
second in whether the drug has side-effects, while the third is interested in both.
two apples and adding the total gives you the same answer as weighing both
apples together, so does the total probability of either of two mutually exclu-
sive events equals the sum of their individual probabilities. However, sets are
complex beasts and formally we wish to define exactly when we can measure
them.
Many times the natural outcome space Ω that we wish to consider is ex-
tremely complex, but we only care about whether a specific event occurs or not.
For example, when we toss a coin in the air, the natural outcome is the com-
plete trajectory that the coin follows and its final resting position. However, we
might only care about whether the coin lands heads or not. Then, the event of
the coin landing “heads” is defined as all the trajectories that the coin follows
which result in it landing heads. These trajectories form a subset A ⊂ Ω.
Probabilities will always be defined on subsets of the outcome space. These
subsets are termed events. The probability of events will simply be a function
on sets, and more specifically a measure. The following gives some intuition and
formal definitions about what this means.
Probability of a set
If A is a subset of Ω, the probability of A is a measure of the chances that
the outcome of the experiment will be an element of A.
Which sets?
Example 50. Let X be uniformly distributed on [0, 1]. By definition, this means that
the probability that X is in [0, p] is equal to p for all p ∈ [0, 1]. However, even for this
simple distribution, it might be difficult to define the probability of all events.
What is the probability that X will be in [0, 1/4)?
What is the probability that X will be in [1/4, 1]?
What is the probability that X will be a rational number?
Area
A: 4 × 5 = 20m2 .
B.2. EVENTS, MEASURE AND PROBABILITY 241
A r
B
r r
r
r
r
C r r
r r
r r
B: 6 × 4 = 24m2 .
C: 2 × 5 = 10m2 .
A: 3.
B: 4
C: 5.
A: 0m
B: 0.5m
C: 4.5m.
Definition B.2.1 (A field on Ω). A family F of sets, such that for each A ∈ F,
one also has A ⊆ Ω, is called a field on Ω if and only if
1. Ω ∈ F
2. if A ∈ F, then A∁ ∈ F.
Sn
3. For any A1 , A2 , . . . , An such that Ai ∈ F, it holds that: i=1 Ai ∈ F.
From the above definition, it is easy to see that Ai ∩ Aj is also in the field.
Since many times our family may contain an infinite number of sets, we also
want to extend the above to countably infinite unions.
1. Ω ∈ F
2. if A ∈ F, then A∁ ∈ F.
S∞
3. For any sequence A1 , A2 , . . . such that Ai ∈ F, it holds that: i=1 Ai ∈ F.
It is easy to verify that the F given in the apartment example satisfies these
properties. In general, for any finite Ω, it is easy to find a family F containing
all possible events in Ω. Things become trickier when Ω is infinite. Can we
define an algebra F that contains all events? In general no, but we can define
an algebra on the so-called Borel sets of Ω, defined in B.2.3.
1. λ(∅) = 0.
It is easy to verify that the floor area, the number of coins, and the length of
the red carpet are all measures. In fact, the area and length correspond to what
is called a Lebesgue measure 1 and the number of coins to a counting measure.
1. P (Ω) = 1
1 See Section B.2.3 for a precise definition.
B.3. CONDITIONING AND INDEPENDENCE 243
2. P (∅) = 0
The common value of the inner and outer measure is called the Lebesgue mea-
sure2 λ̄(A) = λ∗ (A).
However, the basic probability on Ω does not tell us anything about what
the probability of some event A, given the fact that some event B has occurred.
Sometimes, these events are mutually exclusive, meaning that when B happens,
A cannot be true; other times B implies A, and sometimes they are independent.
To quantify exactly how knowledge of whether B has occurs can affect what we
know about A, we need the notion of conditional probability.
Side effects
Recovery
A1 A2 Patient state
ω
Everything (Ω)
Side effects
Recovery
A1 A2
Everything (Ω)
The union bound is extremely important, and one of the basic proof methods
in many applications of probability.
Finally, let us consider the general case of multiple disjoint events, shown
in Figure B.4. When B is decomposed in a set of disjoint events {Bi }, we can
B1 B2
B4 B3
write:
!
[ X
P (B) = P Bi = P (Bi ) (B.3.2)
i i
!
[ X
P (A ∩ B) = P (A ∩ Bi ) = P (A ∩ Bi ), (B.3.3)
i i
for any other set A. An interesting special case occurs when B = Ω, in which
case P (A) = P (A ∩ Ω), since A ⊂ Ω for any A in the algebra. This results in
the marginalisation or sum rule of probability. marginalisation
! sum rule
[ X [
P (A) = P (A ∩ Bi ) = P (A ∩ Bi ), Bi = Ω. (B.3.4)
i i i
246 APPENDIX B. PROBABILITY CONCEPTS
P (A1 ) = 1 × h = h, P (A2 ) = w × 1 = w.
and the two events are independent. Independent events are particularly im-
Side effects
A2
Recovery
A1
Everything (Ω)
Proof. From (B.3.5), P (Ai | B) = P (Ai ∩ B)/P (B) and also P (Ai ∩ B) = P (B |
Ai )P (Ai ). Thus
P (B | Ai )P (Ai )
P (Ai | B) = ,
P (B)
S∞
and we continueS∞analyzing the denominator P (B). First, due to i=1 Ai = Ω
we have B = j=1 (B ∩ Aj ). Since Ai are disjoint, so are B ∩ Ai . Then from
the union property of probability distributions we have
∞
[ X∞ X∞
P (B) = P (B ∩ Aj ) = P (B ∩ Aj ) = P (B | Aj )P (Aj ),
j=1 j=1 j=1
✭
r✭✭✭
❜
r
❜
The distribution of X
Exercise 42. Ω is the set of 52 playing cards. X(s) is the value of each card (1, 10
for the ace and figures respectively). What is the probability of drawing a card s with
X(s) > 7?
Properties
Discrete distributions
X : Ω → {x1 , . . . , xn } takes n discrete values (n can be infinite). The
probability function of X is
Continuous distributions
X has a continuous distribution if there exists a probability density function
f s.t. ∀B ⊆ R: Z
PX (B) = f (x) dx.
B
Discrete distributions
P(X1 = x1 , . . . , Xm = xm ) = f (x1 , . . . , xm ),
where f is joint probability function, with xi ∈ Vi .
Continuous distributions
For B ⊆ Rm
Z
P {(X1 , . . . , Xm ) ∈ B} = f (x1 , . . . , xm ) dx1 · · · dxm
B
250 APPENDIX B. PROBABILITY CONCEPTS
R
Introduce the common notation · · · dµ(x), where µ is a measure. Let
some real function g : Ω → R. Then for any subset B ⊆ Ω we can write
Lebesgue-Stiletjes notation
If P is a probability measure on (Ω, F) and B ⊆ Ω, and g is F-measurable,
we write the probability that g(x) takes the value B can be written equiv-
alently as:
Z Z
P(g ∈ B) = Pg (B) = g(x) dP (x) = g dP. (B.4.3)
B B
B.4. RANDOM VARIABLES 251
Marginal distribution
The marginal distribution of X1 , . . . , Xk from a set of variables X1 , . . . , Xm ,
is
Z
P(X1 , . . . , Xk ) , P(X1 , . . . , Xk , Xk+1 = xk+1 , . . . , Xm = xm )
dµ(xk+1 , . . . , xm ). (B.4.4)
Independence
If Xi is independent of Xj for all i 6= j:
M
Y M
Y
P(X1 , . . . , Xm ) = P(Xi ), f (x1 , . . . , xm ) = gi (xi ) (B.4.5)
i=1 i=1
B.4.6 Moments
There are some simple properties of the random variable under consideration
which are frequently of interest in statistics. Two of those properties are expec-
tation and variance. expectation
252 APPENDIX B. PROBABILITY CONCEPTS
Expectation
Furthermore,
Z
E[g(X)] = g(t) dPX (t),
B.5 Divergences
Divergences are a natural way to measure how different two distributions are.
The problem with the empirical distribution is that does not capture the
uncertainty we have about what the real distribution is. For that reason, it
should be used with care, even though it does converge to the true distribution
in the limit. A clever way to construct a measure of uncertainty is to perform
sub-sampling, that is to create k random samples of size n′ < n from the original sub-sampling
sample. Each sample will correspond to a different random empirical distribu-
tion. Sub-sampling is performed without replacement (i.e. for each sample, each without replacement
observation xi is only used once). When sampling with replacement and n′ = n,
the method is called bootstrapping. bootstrapping
B.8 Exercises
B.8. EXERCISES 255
A ∩ (B ∪ D) = (A ∩ B) ∪ (A ∩ D).
Show that
(A ∪ B)∁ = A∁ ∩ B ∁ , and (A ∩ B)∁ = A∁ ∪ B ∁
Exercise 44 (10). Prove that any probability measure P has the following properties:
1. P (A∁ ) = 1 − P (A).
2. If A ⊂ B then P (A) ≤ P (B).
3. For any sequence of events A1 , . . . , An
∞
! ∞
[ X
P Ai ≤ P (Ai ) (union bound)
i=1 i=1
Sn Pn
Hint: Recall that If A1 , . . . , An are disjoint then P ( i=1 Ai ) = i=1 P (Ai ) and that
P (∅) = 0
If X1P
, . . . , Xn is a sequence of Bernoulli random variables with parameter p,
n
then i=1 Xi has a binomial distribution with parameters n, p.
Exercise 46 (10). In a few sentences, describe your views on the usefuleness of prob-
ability.
Is it the only formalism that can describe both random events and uncertainty?
Would it be useful to separate randomness from uncertainty?
What would be desirable properties of an alternative concept?
256 APPENDIX B. PROBABILITY CONCEPTS
Appendix C
Useful results
257
258 APPENDIX C. USEFUL RESULTS
M = sup f (x),
x∈A
In other words, there exists no smaller upper bound than M . When the
function f has a maximum, then the supremum is identical to the maximum.
M = sup f (x),
x∈A
C.1.1 Series
Pn
Definition C.1.3 (The geometric series). The sum k=0 xk is called the geo-
metric series and has the property
n
X xn+1 − 1
xk = . (C.1.4)
x−1
k=0
For a positive real numbers (or complex numbers with a postive real part), the
gamma function is defined as
Z ∞
Γ (t) = x−1 e−x dx. (C.1.6)
0
260 APPENDIX C. USEFUL RESULTS
Appendix D
Index
261
Index
., 106 Beta, 71
binomial, 70
Adaptive hypothesis testing, 110 exponential, 76
Adaptive treatment allocation, 110 Gamma, 75
adversarial bandit, 221 marginal, 94
approximate normal, 73
policy iteration, 177 divergences, 252
backwards induction, 114 empirical distribution, 253
bandit every-visit Monte-Carlo, 156
adversarial, 221 expectation, 251
contextual, 221 experimental design, 110
nonstochastic, 221 exploration vs exploitation, 11
bandit problems, 111
stochastic, 111 fairness, 55
Bayes rule, 54 first visit
Bayes’ theorem, 21 Monte-Carlo update, 156
belief state, 113
Beta distribution, 71 Gamma function, 71
binomial coefficient, 70 Gaussian processes, 208
bootstrapping, 253 gradient descent, 175
Borel σ-algebra, 243 stochastic, 149
branch and bound, 206
Hoeffding inequality, 155
classification, 53 horizon, 118
clinical trial, 110
concave function, 27 inequality
conditional probability, 246 Chebyshev, 85
conditionally independent, 247 Hoeffding, 86
contextual bandit, 221 Markov, 85
covariance, 208 inf, see infimum
covariance matrix, 252 infimum, 258
decision boundary, 54 Jensen’s inequality, 27
decision procedure
sequential, 92 KL-Divergence, 252
design matrices, 208 KL-divergence, 89
difference operator, 134
discount factor, 111, 118 likelihood
distribution conditional, 19
χ2 , 74 relative, 16
Bernoulli, 70 linear programming, 136
262
INDEX 263
sample mean, 66
series
geometric, 96, 258
simulation, 154
264 INDEX
Bibliography
Mauricio Álvarez, David Luengo, Michalis Titsias, and Neil Lawrence. Efficient
multioutput gaussian processes through variational inducing kernels. In Pro-
ceedings of the Thirteenth International Conference on Artificial Intelligence
and Statistics (AISTATS 2010), pages 25–32, 2010.
Jean-Yves Audibert and Sébastien Bubeck. Minimax policies for adversarial and
stochastic bandits. In colt2009. Proceedings of the 22nd Annual Conference
on Learning Theory, pages 217–226, 2009.
Peter Auer and Ronald Ortner. UCB revisited: improved regret bounds for
the stochastic multi-armed bandit problem. Period. Math. Hungar., 61(1-2):
55–65, 2010.
Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite time analysis of the
multiarmed bandit problem. Machine Learning, 47(2/3):235–256, 2002a.
265
266 BIBLIOGRAPHY
Andrew G Barto. Adaptive critics and the basal ganglia. Models of information
processing in the basal ganglia, page 215, 1995.
George Casella, Stephen Fienberg, and Ingram Olkin, editors. Monte Carlo
Statistical Methods. Springer Texts in Statistics. Springer, 1999.
Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq.
Algorithmic decision making and the cost of fairness. Technical Report
1701.08230, arXiv, 2017.
Morris H. DeGroot. Optimal Statistical Decisions. John Wiley & Sons, 1970.
Christos Dimitrakakis, Yang Liu, David Parkes, and Goran Radanovic. Sub-
jective fairness: Fairness is in the eye of the beholder. Technical Report
1706.00119, arXiv, 2017.
Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard
Zemel. Fairness through awareness. In Proceedings of the 3rd Innovations in
Theoretical Computer Science Conference, pages 214–226. ACM, 2012.
Yaakov Engel, Shie Mannor, and Ron Meir. Bayes meets bellman: The gaussian
process approach to temporal difference learning. In ICML 2003, 2003.
Yaakov Engel, Shie Mannor, and Ron Meir. Reinforcement learning with gaus-
sian processes. In Proceedings of the 22nd international conference on Ma-
chine learning, pages 201–208. ACM, 2005.
C. J. Gittins. Multi-armed Bandit Allocation Indices. John Wiley & Sons, New
Jersey, US, 1989.
BIBLIOGRAPHY 269
Robert Grande, Thomas Walsh, and Jonathan How. Sample efficient reinforce-
ment learning with gaussian processes. In International Conference on Ma-
chine Learning, pages 1332–1340, 2014.
Peter E Hart, Nils J Nilsson, and Bertram Raphael. A formal basis for the
heuristic determination of minimum cost paths. IEEE transactions on Sys-
tems Science and Cybernetics, 4(2):100–107, 1968.
Wassily Hoeffding. Probability inequalities for sums of bounded random vari-
ables. Journal of the American Statistical Association, 58(301):13–30, March
1963.
M. Hutter. Feature reinforcement learning: Part I: Unstructured MDPs. Journal
of Artificial General Intelligence, 1:3–24, 2009.
Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds
for reinforcement learning. Journal of Machine Learning Research, 11:1563–
1600, 2010.
Tobias Jung and Peter Stone. Gaussian processes for sample-efficient reinforce-
ment learning with RMAX-like exploration. In ECML/PKDD 2010, pages
601–616, 2010.
Sham Kakade. A natural policy gradient. Advances in neural information
processing systems, 2:1531–1538, 2002.
Emilie Kaufmanna, Nathaniel Korda, and Rémi Munos. Thompson sampling:
An optimal finite time analysis. In ALT-2012, 2012.
Michael Kearns and Satinder Singh. Finite sample convergence rates for Q-
learning and indirect algorithms. In Advances in Neural Information Process-
ing Systems, volume 11, pages 996–1002. The MIT Press, 1999.
Niki Kilbertus, Mateo Rojas-Carulla, Giambattista Parascandolo, Moritz Hardt,
Dominik Janzing, and Bernhard Schölkopf. Avoiding discrimination through
causal reasoning. Technical Report 1706.02744, arXiv, 2017.
Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. Inherent trade-
offs in the fair determination of risk scores. Technical Report 1609.05807,
arXiv, 2016.
AN Kolmogorov and SV Fomin. Elements of the theory of functions and func-
tional analysis. Dover Publications, 1999.
M. Lagoudakis and R. Parr. Reinforcement learning as classification: Leveraging
modern classifiers. In ICML, page 424, 2003a.
M. G. Lagoudakis and R. Parr. Least-squares policy iteration. The Journal of
Machine Learning Research, 4:1107–1149, 2003b.
Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive alloca-
tion rules. Adv. in Appl. Math., 6:4–22, 1985.
Nam M. Laird and Thomas A. Louis. Empirical Bayes confidence intervals
based on bootstrap samples. Journal of the American Statistical Association,
82(399):739–750, 1987.
270 BIBLIOGRAPHY
Tor Lattimore. Optimally confident ucb: Improved regret for finite-armed ban-
dits, 2015. arXiv preprint arXiv:1507.07880.
Tor Lattimore and Marcus Hutter. Near-optimal PAC bounds for discounted
MDPs. Theor. Comput. Sci., 558:125–143, 2014.
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Ve-
ness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidje-
land, Georg Ostrovski, et al. Human-level control through deep reinforcement
learning. Nature, 518(7540):529–533, 2015.
R. Munos and C. Szepesvári. Finite-time bounds for fitted value iteration. The
Journal of Machine Learning Research, 9:815–857, 2008.
Ronald Ortner. Regret bounds for reinforcement learning via markov chain
concentration. J. Artif. Intell. Res., 67:115–128, 2020. doi: 10.1613/jair.1.
11316. URL https://doi.org/10.1613/jair.1.11316.
Ronald Ortner, Daniil Ryabko, Peter Auer, and Rémi Munos. Regret bounds for
restless Markov bandits. Theor. Comput. Sci., 558:62–76, 2014. doi: 10.1016/
j.tcs.2014.09.026. URL http://dx.doi.org/10.1016/j.tcs.2014.09.026.
Ian Osband, Daniel Russo, and Benjamin Van Roy. (more) efficient reinforce-
ment learning via posterior sampling. In NIPS, 2013.
Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep
exploration via bootstrapped dqn. In Advances in Neural Information Pro-
cessing Systems, pages 4026–4034, 2016.
Jan Peters and Stefan Schaal. Policy gradient methods for robotics. In In-
telligent Robots and Systems, 2006 IEEE/RSJ International Conference on,
pages 2219–2225. IEEE, 2006.
BIBLIOGRAPHY 271
Tao Wang, Daniel Lizotte, Michael Bowling, and Dale Schuurmans. Bayesian
sparse sampling for on-line reward optimization. In ICML ’05, pages 956–
963, New York, NY, USA, 2005. ACM. ISBN 1-59593-180-5. doi: http:
//doi.acm.org/10.1145/1102351.1102472.
Henry H Yin and Barbara J Knowlton. The role of the basal ganglia in habit
formation. Nature Reviews Neuroscience, 7(6):464, 2006.