Decision Making Under Uncertainty and Reinforcement Learning
Decision Making Under Uncertainty and Reinforcement Learning
Decision Making Under Uncertainty and Reinforcement Learning
Christos Dimitrakakis
Ronald Ortner
Decision Making
Under Uncertainty
and Reinforcement
Learning
Theory and Algorithms
Intelligent Systems Reference Library
Volume 223
Series Editors
Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland
Lakhmi C. Jain, KES International, Shoreham-by-Sea, UK
The aim of this series is to publish a Reference Library, including novel advances
and developments in all aspects of Intelligent Systems in an easily accessible and
well structured form. The series includes reference works, handbooks, compendia,
textbooks, well-structured monographs, dictionaries, and encyclopedias. It contains
well integrated knowledge and current information in the field of Intelligent Systems.
The series covers the theory, applications, and design methods of Intelligent Systems.
Virtually all disciplines such as engineering, computer science, avionics, business,
e-commerce, environment, healthcare, physics and life science are included. The list
of topics spans all the areas of modern intelligent systems such as: Ambient intelli-
gence, Computational intelligence, Social intelligence, Computational neuroscience,
Artificial life, Virtual society, Cognitive systems, DNA and immunity-based systems,
e-Learning and teaching, Human-centred computing and Machine ethics, Intelligent
control, Intelligent data analysis, Knowledge-based paradigms, Knowledge manage-
ment, Intelligent agents, Intelligent decision making, Intelligent network security,
Interactive entertainment, Learning paradigms, Recommender systems, Robotics
and Mechatronics including human-machine teaming, Self-organizing and adap-
tive systems, Soft computing including Neural systems, Fuzzy systems, Evolu-
tionary computing and the Fusion of these paradigms, Perception and Vision, Web
intelligence and Multimedia.
Indexed by SCOPUS, DBLP, zbMATH, SCImago.
All books published in the series are submitted for consideration in Web of Science.
Christos Dimitrakakis · Ronald Ortner
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Switzerland AG 2022
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
The purpose of this book is to collect the fundamental results for decision making
under uncertainty in one place. In particular, the aim is to give a unified account of
algorithms and theory for sequential decision making problems, including reinforce-
ment learning. Starting from elementary statistical decision theory, we progress to
the reinforcement learning problem and various solution methods. The end of the
book focuses on the current state of the art in models and approximation algorithms.
The problem of decision making under uncertainty can be broken down into two
parts. First, how do we learn about the world? This involves both the problem of
modeling our initial uncertainty about the world and that of drawing conclusions
from evidence and our initial belief. Secondly, given what we currently know about
the world, how should we decide what to do, taking into account future events and
observations that may change our conclusions?
Typically, this will involve creating long-term plans covering possible future even-
tualities. That is, when planning under uncertainty, we also need to take into account
what possible future knowledge could be generated when implementing our plans.
Intuitively, executing plans which involve trying out new things should give more
information, but it is hard to tell whether this information will be beneficial. The
choice between doing something which is already known to produce good results and
experiment with something new is known as the exploration–exploitation dilemma,
and it is at the root of the interaction between learning and planning.
Part I of the book, Chaps. 1–4, focuses on decision making under uncertainty in
non-sequential settings. This includes scenarios such as hypothesis testing, where
the decision maker must choose a single action given the available evidence. Most
of the development is given through the prism of Bayesian inference and decision
theory, where the decision maker has a subjective belief (expressed as a probability
distribution) over what is true. Part II of the book, Chaps. 5–8, introduces sequential
problems and the formalism of Markov decision processes. The remaining chapters
are devoted to the problem of reinforcement learning, which is one of the most general
sequential decision problems under uncertainty. Finally, we have added a number of
v
vi Preface
theoretical and practical exercises that will hopefully aid the reader to understand
the material.
Many thanks go to all the students of the Decision making under uncertainty and
Advanced topics in reinforcement learning and decision making classes over the
years for bearing with early drafts of this book. A big thank you goes to Nikolaos
Tziortziotis, whose code is used in some of the examples in the book. Finally, thanks
to Aristide Tossou and Hannes Eriksson for proof-reading various chapters. Finally,
a lot of the coded examples in the book were run using the parallel package by Tange
[1].
Reference
1. Tange, O.: Gnu parallel-the command-line power tool. USENIX Mag. 36(1), 42–47 (2011)
vii
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Uncertainty and Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The Exploration–Exploitation Trade-Off . . . . . . . . . . . . . . . . . . . . . 2
1.3 Decision Theory and Reinforcement Learning . . . . . . . . . . . . . . . . 3
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Subjective Probability and Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Subjective Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Relative Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Subjective Probability Assumptions . . . . . . . . . . . . . . . . . 8
2.1.3 Assigning Unique Probabilities* . . . . . . . . . . . . . . . . . . . . 10
2.1.4 Conditional Likelihoods . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.5 Probability Elicitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Updating Beliefs: Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Utility Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Rewards and Preferences . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Preferences Among Distributions . . . . . . . . . . . . . . . . . . . 15
2.3.3 Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.4 Measuring Utility* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.5 Convex and Concave Utility Functions . . . . . . . . . . . . . . . 20
2.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 Decision Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Rewards that Depend on the Outcome of an Experiment . . . . . . . 25
3.2.1 Formalisation of the Problem Setting . . . . . . . . . . . . . . . . 26
3.2.2 Decision Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.3 Statistical Estimation* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
ix
x Contents
Classical Probability
A random experiment is performed, with a given set Ω of possible outcomes. An
example is the 2-slit experiment in physics, where a particle is generated which
can go through either one of two slits. According to our current understanding
of quantum mechanics, it is impossible to predict which slit the particle will go
through. Herein, the set Ω consists of two possible events corresponding to the
particle passing through one or the other slit.
In the 2-slit experiment, the probabilities of either event can be actually accurately
calculated through quantum theory. However, which slit the particle will go through
is fundamentally unpredictable. Such quantum experiments are the only ones that are
currently thought of as truly random (though some people disagree about that too).
Any other procedure, such as tossing a coin or casting a die, is inherently deterministic
and only appears random due to our difficulty in predicting the outcome. That is,
modelling a coin toss as a random process is usually the best approximation we can
make in practice, given our uncertainty about the complex dynamics involved. This
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 1
C. Dimitrakakis and R. Ortner, Decision Making Under Uncertainty
and Reinforcement Learning, Intelligent Systems Reference Library 223,
https://doi.org/10.1007/978-3-031-07614-5_1
2 1 Introduction
Subjective Probability
Here Ω can conceptually not only describe the outcomes of some experiment,
but also a set of possible worlds or realities. This set can be quite large and
include anything imaginable. For example, it may include worlds where dragons
are real. However, in practice one only cares about certain aspects of the world,
such as whether in this world, you will win the lottery if you buy a ticket. We
can interpret the probability of a world in Ω as our degree of belief that it
corresponds to reality.
In such a setting there is an actual, true world ω ∗ ∈ Ω, which could have been set
by Nature to an arbritrary value deterministically. However, we do not know which
element in Ω is the true world, and the probability reflects our lack of knowledge
(rather than any inherent randomness about the selection of ω ∗ ).
No matter which view we espouse, we must always take into account our uncer-
tainty when making decisions. When the problem we are dealing with is sequential,
we are taking actions, obtaining new observations, and then taking further actions. As
we gather more information, we learn more about the world. However, the things we
learn about depend on what actions we take. For example, if we always take the same
route to work, then we learn how much time this route takes on different days and
times of the week. However, we don’t obtain information about the time other routes
take. So, we potentially miss out on better choices than the one we follow usually.
This phenomenon gives rise to the so-called exploration-exploitation trade-off.
Thus, one must decide whether to exploit knowledge about the world to gain a
known reward, or to explore the world to learn something new. This will potentially
give you less reward immediately, but the knowledge itself can usually be put to use
in the future.
This exploration-exploitation trade-off only arises when data collection is inter-
active. If we are simply given a set of data which is to be used to decide upon a course
or action, but our decision does not affect the data we shall collect in the future, then
things are much simpler. However, a lot of real-world human decision making as
well as modern applications in data science involve such trade-offs. Decision theory
offers precise mathematical models and algorithms for such problems.
Decision theory deals with the formalization and solution of decision problems.
Given a number of alternatives, what would be the rational choice in a particular
situation depending on one’s goals and desires? In order to answer this question we
need to develop a good concept of rational behavior. This will serve two purposes.
Firstly, this can provide an explanation for the behavior of animals and humans.
Secondly, it is useful for developing models and algorithms for automated decision
making in complex tasks.
A particularly interesting problem in this setting is reinforcement learning. This
problem arises when the environment is unknown, and the learner has to make deci-
sions solely through interaction, which only gives limited feedback. Thus, the learn-
ing agent does not have access to detailed instructions on which task to perform,
nor on how to do it. Instead, it performs actions, which affect the environment and
obtains some observations (i.e., sensory input) and feedback, usually in the form of
rewards which represent the agent’s desires. The learning problem is then formu-
lated as the problem of learning how to act to maximize total reward. In biological
systems, reward is intrinsically hardwired to signals associated with basic needs. In
artificial systems, we can choose the reward signals so as to reinforce behaviour that
achieves the designer’s goals.
Reinforcement learning is a fundamental problem in artificial intelligence, since
frequently we can tell robots, computers, or cars only what we would like them to
achieve, but we do not know the best way to achieve it. We would like to simply
give them a description of our goals and then let them explore the environment on
their own to find a good solution. Since the world is (at least partially) unknown, the
learner always has to deal with the exploration-exploitation trade-off.
Animals and humans also learn through imitation, exploration, and shaping their
behavior according to reward signals to finally achieve their goals. In fact, it has been
known since the 1990s that there is some connection between some reinforcement
learning algorithms and mechanisms in the basal ganglia [1–3].
Decision theory is closely related to other fields, such as logic, statistics, game
theory, and optimization. Those fields have slightly different underlying objectives,
4 1 Introduction
even though they may share the same formalism. In the field of optimization, we
are not only interested in optimal planning in complex environments but also in
how to make robust plans given some uncertainty about the environment. Artificial
intelligence research is concerned with modelling the environments and developing
algorithms that are able to learn by interaction with the environment or from demon-
stration by teachers. Economics and game theory deal with the problem of modeling
the behavior of rational agents and with designing mechanisms (such as markets)
that will give incentives to agents to behave in a certain way.
Beyond pure research, there are also many applications connected to decision the-
ory. Commercial applications arise e.g. in advertising where one wishes to model the
preferences and decision making of individuals. Decision problems also arise in secu-
rity. There are many decision problems, especially in cryptographic and biometric
authentication, but also in detecting and responding to intrusions in networked com-
puter systems. Finally, in the natural sciences, especially in biology and medicine,
decision theory offers a way to automatically design and run experiments and to
optimally construct clinical trials.
Outline
References
1. Yin, H.H., Knowlton, B.J.: The role of the basal ganglia in habit formation. Nat. Rev. Neurosci.
7(6), 464 (2006)
2. Barto, A.G.: Adaptive Critics and the Basal Ganglia, pp. 215–232. MIT Press (1995)
3. Schultz, W., Dayan, P., Read Montague, P.: A neural substrate of prediction and reward. Science
275(5306), 1593–1599 (1997)
Chapter 2
Subjective Probability and Utility
In order to make decisions, we need to be able to make predictions about the possible
outcomes of each decision. Usually, we have uncertainty about what those outcomes
are. This can be due to stochasticity, which is frequently used to model games of
chance and inherently unpredictable physical phenomena. It can also be due to partial
information, a characteristic of many natural problems. For example, it might be hard
to know at any single moment how much change you have in your wallet, whether
you will be able to catch the next bus, or to remember where you left your keys.
In either case, this uncertainty can be expressed as a subjective belief. This does not
have to correspond to reality. For example, some people believe, quite inaccurately,
that if a coin comes up tails for a long time, it is quite likely to come up heads very
soon. Or, you might quite happily believe your keys are in your pocket, only to realise
that you left them at home as soon you arrive at the office.
In this book, we assume the view that subjective beliefs can be modelled as
probabilities. This allows us to treat uncertainty due to stochasticity and due to partial
information in a unified framework. In doing so, we have to define for each problem
a space of possible outcomes Ω and specify an appropriate probability distribution.
Let us start with the simple example of guessing whether a tossed coin will come up
head or tails. In this case the sample space Ω would correspond to every possible
way the coin can land. Since we are only interested in predicting which face will
be up, let A ⊂ Ω be all those cases where the coin comes up heads, and B ⊂ Ω be
the set of tosses where it comes up tails. Here A ∩ B = ∅, but there may be some
other events such as the coin becoming lost, so it does not necessarily hold that
We also use and for at least as likely as and for no more likely than.
Let us now speak more generally about the case where we have defined an appro-
priate σ-field F on Ω. Then each element Ai ∈ F will be a subset of Ω. We now
wish to define relative likelihood relations for the elements Ai ∈ F.1
As we would like to use the language of probability to talk about likelihoods, we
need to define a probability measure that agrees with our given relations. A probability
measure P : F → [0, 1] is said to agree with a relation A B, if it has the property
that P(A) ≤ P(B) if and only if A B, for all A, B ∈ F. In general, there are
many possible measures that can agree with a given relation, cf. Example 2.1 below.
However, it could also be that a given relational structure is incompatible with any
possible probability measure. We also consider the question under which assumptions
a likelihood relation corresponds to a unique probability measure.
We would like our beliefs to satisfy some intuitive properties about what statements
we can make concerning the relative likelihood of events. As we will see, these
assumptions are also necessary to guarantee the existence of a corresponding prob-
ability measure. First of all, it must always be possible to say whether one event is
more likely than the other, i.e., our beliefs must be complete. Consequently, we are
not allowed to claim ignorance.
Assumption 2.1.1 (SP1) For any pair of events A, B ∈ F, one has either A B,
A ≺ B, or A B.
Another important assumption is a principle of consistency: Informally, if we believe
that every possible event Ai that leads to A is less likely than a unique corresponding
event Bi that leads to an outcome B, then we should always conclude that A is less
likely than B.
1More formally, we can define three classes: C , C≺ , C ⊂ F 2 such that a pair (Ai , A j ) ∈ C R if
an only if it satisfies the relation Ai R A j , where R ∈ {, ≺, }. These three classes form a partition
of F 2 under the subjective probability assumptions we will introduce in the next section.
2.1 Subjective Probability 9
In many cases, and particularly when F is a finite field, there is a large number of
probability distributions agreeing with our relative likelihoods. Choosing one specific
probability over another does not seem easy. The following example underscores this
ambiguity.
Theorem 2.1.13 (Equivalent event) For any event A ∈ F, there exists some α ∈
[0, 1] such that A (x ∈ [0, α]).
This means that we can now define the probability of an event A by matching it
to a specific equivalent event on [0, 1].
Hence
A (x ∈ [0, P(A)]),
So far we have only considered the problem of forming opinions about which events
are more likely a priori. However, we also need to have a way to incorporate evidence
which may adjust our opinions. For example, while we ordinarily may think that
A B, we may have additional information D, given which we think the opposite
is true. We can formalise this through the notion of conditional likelihoods.
Example 2.2 Say that A is the event that it rains in Gothenburg, Sweden tomorrow.
We know that Gothenburg is quite rainy due to its oceanic climate, so we set A A .
Now, let us try and incorporate some additional information. Let D denote the fact
that good weather is forecast, so that given D good weather will appear more probable
than rain, formally (A | D) (A | D).
Conditional Likelihoods
Define (A | D) (B | D) to mean that B is at least as likely as A when it is known
that D holds.
(A | D) (B | D) iff A ∩ D B ∩ D.
12 2 Subjective Probability and Utility
It turns out that there are very few ways that a conditional probability definition
can satisfy all of our assumptions. The usual definition is the following.
Definition 2.1.18 (Conditional probability)
P(A ∩ D)
P(A | D) .
P(D)
This definition effectively answers the question of how much evidence for A
we have, now that we have observed D. This is expressed as the ratio between the
combined event A ∩ D, also known as the joint probability of A and D, and the
marginal probability of D itself. The intuition behind the definition becomes clearer
once we rewrite it as P(A ∩ D) = P(A | D) P(D). Then conditional probability is
effectively used as a way to factorise joint probabilities.
By repeating this procedure recursively we are able to slowly build the com-
plete distribution, quantile by quantile.
Although we always start with a particular belief, this belief must be adjusted when
we receive new evidence. In probabilistic inference, the updated beliefs are simply
the probability of future events conditioned on observed events. This idea is captured
neatly by Bayes’ theorem, which links the prior probability P(Ai ) of events to their
posterior probability P(Ai | B) given some event B and the probability P(B | Ai )
of observing the evidence B given events Ai .
Theorem 2.2.1 (Bayes’ theorem)
n Let A1 , A2 , . . . be a (possibly infinite) sequence
of disjoint events such that i=1 Ai = Ω and P(Ai ) > 0 for all i. Let B be another
event with P(B) > 0. Then
P(B | Ai ) P(Ai )
P(Ai | B) = n .
j=1 P(B | A j ) P(A j )
As nj=1 A j = Ω, we have B = nj=1 (B ∩ A j ). Since the A j are disjoint, so are
the B ∩ A j . As P is a probability, the union property gives
⎛ ⎞
n
n
P(B) = P ⎝ (B ∩ A j )⎠ = P(B ∩ A j ) = P(B | A j ) P(A j ),
j=1 j=1 j=1
had predicted it was not going to rain. On days when it doesn’t rain, the station had
said no rain 9 times out of 10.
(Solution) Let B denote the event that the station predicts no rain. According to our
information, the probability that there is rain when the prediction said no rain is
P(B | A1 ) = 21 . On the other hand, P(B | A2 ) = 0.9. Combining these with Bayes’
theorem, we obtain
P(B | A1 ) P(A1 )
P(A1 | B) =
P(B | A1 ) P(A1 ) + P(B | A2 ) [1 − P(A1 )]
1
P(A1 )
= 2
.
0.9 − 0.4 P(A1 )
While probability can be used to describe how likely an event is, utility can be
used to describe how desirable it is. More concretely, our subjective probabilities
are numerical representations of our beliefs and the information available to us.
They can be taken to represent our “internal model” of the world. By analogy, our
utilities are numerical representations of our tastes and preferences. That is, even if
the consequences of our actions are not directly known to us, we assume that we act
to maximise our utility, in some sense.
Rewards
Consider that we have to choose a reward r from a set R of possible rewards. While
the elements of R may be arbitrary, we shall in general find that we prefer some
rewards to others. In fact, some elements of R may not even be desirable. As an
example, R might be a set of tickets to different musical events, or a set of financial
commodities.
Preferences
Example 2.5 (Musical event tickets) We have a set of tickets R, and we must choose
the ticket r ∈ R we prefer best. Here are two possible scenarios:
• R is a set of tickets to different music events at the same time, at equally good
halls with equally good seats and the same price. Here preferences simply coincide
with the preferences for a certain type of music or an artist.
• If R is a set of tickets to different events at different times, at different quality
halls with different quality seats and different prices, preferences may depend on
all the factors.
2.3 Utility Theory 15
Example 2.6 (Route selection) We have a set of alternative routes and must pick
one. If R contains two routes of the same quality, one short and one long, we will
probably prefer the shorter one. If the longer route is more scenic our preferences
may be different.
Example 2.8 (Route selection) Assume that you have to pick between two routes
P1 , P2 . Your preferences are such that shorter time routes are preferred over longer
ones. For simplicity, let R = {10, 15, 30, 35} be the possible times it might take to
reach your destination. Route P1 takes 10 min when the road is clear, but 30 min
when the traffic is heavy. The probability of heavy traffic on P1 is 0.5. On the other
hand, route P2 takes 15 min when the road is clear, but 35 min when the traffic is
heavy. The probability of heavy traffic on P2 is 0.2.
2.3.3 Utility
The concept of utility allows us to create a unifying framework, such that given
a particular set of rewards and probability distributions on them, we can define
preferences among distributions via their expected utility. The first step is to define
utility as a way to define a preference relation among rewards.
The above definition is very similar to how we defined relative likelihood in terms
of probability. For a given utility function, its expectation for a distribution over
rewards is defined as follows.
2.3 Utility Theory 17
Definition 2.3.3 (Expected utility) Given a utility function U , the expected utility
of a distribution P on R is
E P (U ) = U (r ) d P(r ).
R
We make the assumption that the utility function is such that the expected utility
remains consistent with the preference relations between all probability distributions
we are choosing between.
P ∗ Q iff E P (U ) ≥ E Q (U ). (2.2)
Example 2.9 Consider the following decision problem. You have the option of enter-
ing a lottery for 1 currency unit (CU). The prize is 10 CU and the probability of
winning is 0.01. This can be formalised by making it a choice between two proba-
bility distributions: P, where you do not enter the lottery, and Q, which represents
entering the lottery.
Calculating
the expected utility obviously gives 0 for not entering and E(U |
Q) = r U (r ) Q(r ) = −0.9 utility of entering the lottery, cf. Table 2.1.
Monetary Rewards
Frequently, rewards come in the form of money. In general, it is assumed that people
prefer to have more money than less money. However, while the utility of monetary
rewards is generally assumed to be increasing, it is not necessarily linear. For example,
1,000 Euros are probably worth more to somebody with only 100 Euros in the bank
than to somebody with 100,000 Euros. Hence, it seems reasonable to assume that
the utility of money is concave.
The following examples show the consequences of the expected utility hypothesis.
18 2 Subjective Probability and Utility
Exercise 2.3.5 Show that under the expected utility hypothesis, if gamble 1 is pre-
ferred in the Example 2.10, gamble 1 must also be preferred in the Example 2.11 for
any utility function.
In practice, you may find that your preferences are not aligned with what this
exercise suggests. This implies that either your decisions do not conform to the
expected utility hypothesis, or that you are not internalising the given probabilities.
We will explore this further in following example.
The St. Petersburg Paradox
The following simple example illustrates the fact that, internally, most humans do
not behave in ways that are compatible with linear utility for money.
Example 2.12 (The St. Petersburg Paradox, Bernoulli 1713) A coin is tossed repeat-
edly until the coin comes up heads. The player obtained 2n currency units, where
n ∈ {1, 2, . . .} is the number of times the coin was thrown. The coin is assumed to
be fair, meaning that the probability of heads is always 1/2.
How many currency units k would you be willing to pay to play this game once?
As the probability to stop at round n is 2−n , the expected monetary gain of the
game is
∞
2n 2−n = ∞.
n=1
Were your utility function linear, you would be willing to pay any finite amount k
to play, as the expected utility for playing the game for any finite k is
∞
U (2n − k) 2−n = ∞.
n=1
It would be safe to assume that very few readers would be prepared to pay an
arbitrarily high amount to play this game. One way to explain this is that the utility
function is not linear. An alternative would be to assume a logarithmic utility function.
For example, if we also assume that the player has an initial capital C from which k
has to be paid, the utility function would satisfy
2.3 Utility Theory 19
EU = ln(C + 2n − k) 2−n .
n=1
Then for C = 10, 000 the maximum bet would be 14. For C = 100 it would be 6,
while for C = 10 it is just 4.
There is another reason why one may not pay an arbitrary amount to play this
game. The player may not fully internalise the fact (or rather, the promise) that
the coin is unbiased. Another explanation would be that it is not really believed
that the bank can pay an unbounded amount of money. Indeed, if the bank can
only pay amounts up to N , in the linear expected utility scenario, for a coin with
probability p of coming heads, we have
N
1 − (2 p) N
2n p n−1 (1 − p) = 2(1 − p) .
n=1
1 − 2p
For large N and p = 0.45, it turns out that you should only expect a payoff of about 10
currency units. Similarly, for a fair coin and N = 1024 you should only pay around
10 units as well. These are possible subjective beliefs that an individual might have
that would influence its behaviour when dealing with a formally specified decision
problem.
Since we cannot even rely on linear utility for money, we need to ask ourselves how
we can measure the utility of different rewards. There are a number of ways, including
trying to infer it from the actions of people. The simplest approach is to simply ask
them to make even money bets. No matter what approach we use, however, we need
to make some assumptions about the utility structure. This includes whether or not
we should accept that the expected utility hypothesis holds for the observed human
behaviour.
Experimental Measurement of Utility
Example 2.13 Let a, b denote a lottery ticket that yields a or b CU with equal
probability. Consider the following sequence:
1. Find x1 such that receiving x1 CU with certainty is equivalent to receiving a, b.
2. Find x2 such that receiving x2 CU with certainty is equivalent to receiving a, x1 .
3. Find x3 such that receiving x3 CU with certainty is equivalent to receiving x1 , b.
4. Find x4 such that receiving x4 CU with certainty is equivalent to receiving x2 , x3 .
The above example algorithm allows us to measure the utility of money under the
assumption that the expected utility hypothesis holds. Note that if x1 = x4 , then the
20 2 Subjective Probability and Utility
1.2
0.8
0.6
0.4
0.2
x
0.2 0.4 0.6 0.8 1
An important property of convex functions is that they are bounded from above
by linear segments connecting their points. This property is formally given below.
Example 2.14 If the utility function is convex, then we would prefer obtaining a
random reward x rather than a fixed reward y = E(x). Thus, a convex utility function
implies risk-taking. This is illustrated by Fig. 2.1, which shows a linear function, x,
a convex function, e x − 1, and a concave function, ln(x + 1).
For concave functions, the inverse of Jensen’s inequality holds (i.e., with ≥
replaced with ≤). If the utility function is concave, then we choose a gamble giving
a fixed reward E[x] rather than one giving a random reward x. Consequently, a con-
cave utility function implies risk aversion. The act of buying insurance can be related
to the concavity of our utility function. Consider the following example, where we
assume individuals are risk-averse, but insurance companies are risk-neutral.
Example 2.15 (Insurance) Let d be the insurance cost, h our insurance cover, the
probability of needing the cover, and U an increasing utility function (for monetary
values). Then we are going to buy insurance if the utility of losing d with certainty
is greater than the utility of losing −h with probability .
The insurance company has a linear utility and fixes the premium d high enough
so that
d > h. (2.4)
2.4 Exercises
Exercise 2.4.1 Preferences are transitive if they are induced by a utility function
U : R → R such that a ∗ b iff U (a) > U (b). Give an example of a utility function,
not necessarily mapping the rewards R to R, and a binary relation > such that
transitivity can be violated. Back your example with a thought experiment.
Exercise 2.4.4 Consider two urns, each containing red and blue balls. The first urn
contains an equal number of red and blue balls. The second urn contains a randomly
chosen proportion X of red balls, i.e., the probability of drawing a red ball from that
urn is X .
1. Suppose that you were to select an urn, and then choose a random ball from
that urn. If the ball is red, you win 1 CU, otherwise nothing. Show that if your
utility function is increasing with monetary gain, you should prefer the first urn
iff E(X ) < 21 .
2. Suppose that you were to select an urn, and then choose n random balls from
that urn and that urn only. Each time you draw a red ball, you gain 1 CU. After
you draw a ball, you put it back in the urn. Assume that the utility U is strictly
concave and suppose that E(X ) = 21 . Show that you should always select balls
from the first urn.
Hint: Show that for the second urn, E(U | x) is concave for 0 ≤ x ≤ 1 (this can be
2
done by showing ddx 2 E(U | x) < 0). In fact,
n−2
d2 n−2 k
E(U | x) = n(n − 1) [U (k) − 2U (k + 1) + U (k + 2)] x (1 − x)n−2−k .
dx2 k
k=0
Exercise 2.4.5 (Defining likelihood relations via Probability measures) Show that
a probability measure P on (, F) satisfies the following:
1. For any events A, B ∈ F either P(A) > P(B), P(B) > P(A) or P(A) = P(B).
2. If Ai , Bi are partitions of A, B such that for all P(Ai ) ≤ P(Bi ) for all i, then
P(A) ≤ P(B).
3. For any event A, P(∅) ≤ P(A) and P(∅) < P().
Exercise 2.4.7 (Alternatives to the expected utility hypothesis) The expected utility
hypothesis states that we prefer decision P over Q if and only if our expected utility
under the distribution P is larger than that under Q, i.e., E P (U ) ≥ E Q (U ). Under
what conditions do you think this is a reasonable hypothesis? Can you come up with
a different rule for making decisions under uncertainty? Would it still satisfy the total
order and transitivity properties of preference relations? In other words, could you
still unambiguously say for any P, Q whether you prefer P to Q? If you had three
choices, P, Q, W , and you preferred P to Q and Q to W, would you always prefer P
to W ?
Exercise 2.4.8 (Rational Arthur-Merlin games) You are Arthur, and you wish to pay
Merlin to do a very difficult computation for you. More specifically, you perform a
query q ∈ Q and obtain an answer r ∈ R from Merlin. After he gives you the answer,
you give Merlin a random amount of money m, depending on r, q. In particular, it
∗
assumed that there exists a unique∗ correct answer r =
is
∗
f (q) and E(m | r, q) =
m m P(m | r, q) is maximized by r , i.e., for any r = r
∗
E(m | r , q) > E(m | r, q).
Assume that Merlin knows P and the function f . Is this sufficient to incentivize Mer-
lin to respond with the correct answer? If not, what other assumptions or knowledge
are required?
Exercise 2.4.9 Assume that you need to travel over the weekend. You wish to decide
whether to take the train or the car. Assume that the train and the car trip cost exactly
the same amount of money. The train trip takes 2 h. If it does not rain, then the car
trip takes 1.5 h. However, if it rains the road becomes both more slippery and more
crowded and so the average trip time is 2.5 h. Assume that your utility function is
equal to the negative amount of time spent travelling: U (t) = −t.
1. Let it be Friday. What is the expected utility of taking the car on Sunday? What
is the expected utility of taking the train on Sunday? What is the Bayes-optimal
decision, assuming you will travel on Sunday?
2. Consider two stations H1 and H2 that predict rain on Saturday and Sunday with
probabilities 0.4, 0.6 (H1 ) and 0.9, 0.1 (H2 ), respectively. Let it be a rainy Sat-
urday, which we denote by event A. What is your posterior probability over the
two weather stations, given that it has rained, i.e., P(Hi | A)? What is the new
marginal probability of rain on Sunday, i.e., P(B | A)? What is now the expected
utility of taking the car versus taking the train on Sunday? What is the Bayes-
optimal decision?
24 2 Subjective Probability and Utility
Exercise 2.4.10 Consider the previous example with a nonlinear utility function.
1. One example is U (t) = 1/t, which is a convex utility function. How would you
interpret the utility in that case? Without performing the calculations, can you
tell in advance whether your optimal decision can change? Verify your answer
by calculating the expected utility of the two possible choices.
2. How would you model a problem where the objective involves arriving in time
for a particular appointment?
Reference
3.1 Introduction
Consider the problem of choosing one of two different types of tickets in a raffle.
Each type of ticket gives you the chance to win a different prize. The first is a bicycle
and the second is a tea set. The winner ticket for each prize is drawn uniformly from
the respective sold tickets. Thus, the raffle guarantees that somebody will win either
price. If most people opt for the bicycle, your chance of actually winning it by buying
a single ticket is much smaller. However, if you prefer winning a bicycle to winning
the tea set, it is not clear what choice you should make in the raffle. The above is the
quintessential example for problems where the reward that we obtain depends not
only on our decisions, but also on the outcome of an experiment.
This problem can be viewed more generally for scenarios where the reward you
receive depends not only on your own choice, but also on some other, unknown fact
in the world. This may be something completely uncontrollable, and hence you only
can make an informed guess.
More formally, given a set of possible actions A, we must make a decision a ∈ A
before knowing the outcome ω of an experiment with outcomes in Ω. After the
experiment is performed, we obtain a reward r ∈ R which depends on both the
outcome ω of the experiment and our decision. As discussed in the previous chapter,
our preferences for some rewards over others are determined by a utility function
U : R → R, such that we prefer r to r if and only if U (r ) ≥ U (r ). Now, however,
we cannot choose rewards directly. Another example, which will be used throughout
this section, is the following.
Example 3.1 (Taking the umbrella) We must decide whether to take an umbrella to
work. Our reward depends on whether we get wet and the amount of objects that we
carry. We would rather not get wet and not carry too many things, which can be made
more precise by choosing an appropriate utility function. For example, we might put
a value of −1 for carrying the umbrella and a value of −10 for getting wet. In this
example, the only events of interest are whether it rains or not.
The elements we need to formulate the problem setting are a random variable, a
decision variable, a reward function mapping the random and the decision variable
to a reward, and a utility function that says how much we prefer each reward.
P(ω ∈ A) = P(A).
reward that we obtain (whether we get wet or not) depends on both our decision (to
take the umbrella) and the random outcome (whether it rains).
Definition 3.2.1 (Reward function) A reward function ρ : Ω × A → R defines the
reward we obtain if we select a ∈ A and the experimental outcome is ω ∈ Ω:
r = ρ(ω, a)
When we discussed the problem of choosing between distributions in Sect. 2.3.2,
we had directly defined probability distributions on the set of rewards. We can
now formulate our problem in that setting. First, we define a set of distributions
{Pa | a∈ A} on the reward space (R, FR ), such that the decision a amounts to choos-
ing a particular distribution Pa on the rewards.
Example 3.2 (Rock paper scissors) Consider the simple game of rock paper scissors,
where your opponent plays a move at the same time as you, so that you cannot
influence his move. The opponent’s moves are Ω = {ωR , ωP , ωS } for rock, paper,
scissors, respectively, which also corresponds to your decision set A = {aR , aP , aS }.
The reward set is R = {Win, Draw, Lose}.
You have studied your opponent for some time and you believe that he is most
likely to play rock P(ωR ) = 3/6, somewhat likely to play paperP(ωP ) = 2/6, and less
likely to play scissors:P(ωS ) = 1/6.
What is the probability of each reward, for each decision you make? Taking
the example of aR , we see that you win if the opponent plays scissors with
probability 1/6, you lose if the opponent plays paper (2/6), and you draw if he plays rock
(3/6). Consequently, we can convert the outcome probabilities to reward probabilities
for every decision:
PaR (Win) = 1/6, PaR (Draw) = 3/6, PaR (Lose) = 2/6,
PaP (Win) = 3/6, PaP (Draw) = 2/6, PaP (Lose) = 1/6,
PaS (Win) = 2/6, PaS (Draw) = 1/6, PaS (Lose) = 3/6.
Of course, what you play depends on our own utility function. If you prefer winning
utility function U (Win) = 1,
over drawing or losing, you could for example have the
U (Draw) = 0, U (Lose) = −1. Then, since Ea U = ω∈Ω U (ω, a)Pa (ω), we have
E aR U = −1/6,
E aP U = 2/6,
E aS U = −1/6,
ω a
a r U U
Fig. 3.1 Decision diagrams for the combined and separated formulation of the decision problem.
Squares denote decision variables, diamonds denote utilities. All other variables are denoted by
circles. Arrows denote the flow of dependency
Expected utility
The expected utility of any decision a ∈ A under P is:
E Pa (U ) = U (r ) d Pa (r ) = U [ρ(ω, a)] d P(ω)
R Ω
U (P, a) E Pa U
Table 3.1 Rewards, utilities, expected utility for 20% probability of rain
ρ(ω, a) a1 a2
ω1 Dry, carrying umbrella Wet
ω2 Dry, carrying umbrella Dry
U [ρ(ω, a)] a1 a2
ω1 –1 –10
ω2 –1 0
E P (U | a) –1 –2
The above equation requires that the following technical assumption is satisfied.
As usual, we employ the expected utility hypothesis (Assumption 2.3.4). Thus, we
should choose the decision that results in the highest expected utility.
Assumption 3.2.3 The sets {ω | ρ(ω, a) ∈ B} belong to FΩ . That is, ρ is FΩ -
measurable for any a.
The dependency structure of this problem in either formulation can be visualized
in the decision diagram shown in Fig. 3.1.
Example 3.3 (Continuation of Example 3.1) You are going to work, and it might
rain. The forecast said that the probability of rain (ω1 ) was 20%. What do you do?
• a1 : Take the umbrella.
• a2 : Risk it!
The reward of a given outcome and decision combination, as well as the respective
utility is given in Table 3.1.
Decision diagrams, also known as decision networks or influence diagrams, are used
to show dependencies between different variables. As illustrated in the examples
shown in Fig. 3.1, if an arrow points from a variable x to a variable y it means that
y depends on x. In other words, y has x as an input. In general, decision diagrams
include the following types of nodes:
• Choice nodes (denoted by squares) are nodes whose values can be directly chosen
by the decision maker. In general there may be more than one decision maker
involved.
• Value nodes (denoted by diamonds) are the nodes that the decision maker is inter-
ested in influencing. That is, the utility of the decision maker is always a function
of the value nodes.
• Circle nodes are used to denote all other types of variables. These include deter-
ministic or stochastic variables.
30 3 Decision Problems
• Line style and colour represent quantities that part of the problem, but are not
observed by one or more of the decision makers. These are usually called latent
variables.
Let us take a look at the example of Fig. 3.1b, the reward is a function of both
ω and a, i.e., r = ρ(ω, a), while ω depends only on the probability distribution P.
Typically, there must be a path from a choice node to a value node, otherwise nothing
the decision maker can do will influence the utility. Nodes belonging to or observed
by different players will usually be denoted by different lines or colors. In Fig. 3.1b,
ω, which is not observed, is shown with a dashed line.
Example 3.4 (Voting) Assume you wish to estimate the number of votes for different
candidates in an election. The unknown parameters of the problem mainly include:
the percentage of likely voters in the population, the probability that a likely voter is
going to vote for each candidate. One simple way to estimate this is by polling.
Consider a nation with k political parties. Let ω = (ω1 , . . . , ωk ) ∈ [0, 1]k be the
voting proportions for each party. We wish to make a guess a ∈ [0, 1]k . How should
we guess, given a distribution P(ω)? How should we select U and ρ? This depends
on what our goal is, when we make the guess.
If we wish to give a reasonable estimate about the votes of all the k parties, we can
use the squared error: First, set the error vectorr = (ω1 − a1 , . . . , ωk − ak ) ∈ [0, 1]k .
Then we set U (r ) −r 2 , where r 2 = i |ωi − ai |2 .
If on the other hand, we just want to predict the winner of the election, then the
actual percentages of all individual parties are not important. In that case, we can set
r = 1 if arg maxi ωi = arg maxi ai and 0 otherwise, and U (r ) = r .
3.3 Bayes Decisions 31
Given the above, instead of the expected utility, we consider the expected
loss, or risk.
Definition 3.2.3 (Risk)
κ(P, a) = (ω, a) d P(ω)
Ω
The decision which maximizes the expected utility under a particular distribution P,
is called the Bayes-optimal decision, or simply the Bayes decision. The probability
distribution P is supposed to reflect all our uncertainty about the problem. Note that
in the following we usually drop the reward function ρ from the decision problem
and consider utility functions U that map directly from Ω × A to R.
Definition 3.3.1 (Bayes-optimal utility) Consider an outcome (or parameter) space
Ω, a decision space A, and a utility function U : Ω × A → R. For any probability
distribution P on Ω, the Bayes-optimal utility U ∗ (P) is defined as the smallest upper
bound on U (P, a) over all decisions a ∈ A. That is,
The maximization over decisions is usually not easy. However, there exist a few
cases where it is relatively simple. The first of those is when the utility function is
the negative squared error.
Example 3.5 (Quadratic loss) Consider Ω = Rk and A = Rk . The utility function
that, for any point ω ∈ R, is defined as
a = E P (ω)
maximizes the expected utility U (P, a), under the technical assumption that
∂/∂a|ω − a|2 is measurable with respect to FR .
Taking derivatives, due to the measurability assumption, we can swap the order of
differentiation and integration and obtain
∂ ∂
ω − a2 d P(ω) = ω − a2 d P(ω)
∂a Ω Ω ∂a
=2 (a − ω) d P(ω)
Ω
=2 a d P(ω) − 2 ω d P(ω)
Ω Ω
= 2a − 2E(ω).
Setting the derivative equal to 0 and noting that the utility is concave, we see that the
expected utility is maximized for a = E P (ω).
Another simple example is the absolute error, where U (ω, a) = |ω − a|. The
solution in this case differs significantly from the squared error. As can be seen from
Fig. 3.2a, for absolute loss, the optimal decision is to choose the a that is closest to
the most likely ω. Figure 3.2b illustrates the finding of Theorem 3.3.1.
Although finding the optimal decision for an arbitrary utility U and distribution P
may be difficult, fortunately the Bayes-optimal utility has some nice properties which
enable it to be approximated rather well. In particular, for any decision, the expected
utility is linear with respect to our belief P. Consequently, the Bayes-optimal utility
is convex with respect to P. This firstly implies that there is a unique “worst” dis-
tribution P, against which we cannot do very well. Secondly, we can approximate
3.3 Bayes Decisions 33
-0.2
-0.4
U
-0.6
0.1
-0.8 0.25
0.5
0.75
-1
0 0.2 0.4 0.6 0.8 1
a
(b) Quadratic error
the Bayes-utility very well for all possible distributions by generalizing from a small
number of distributions. In order to define linearity and convexity, we first introduce
the concept of a mixture of distributions.
Consider two probability measures P, Q on (Ω, FΩ ). These define two alter-
native distributions for ω. For any P, Q and α ∈ [0, 1], we define the mixture of
distributions mixture
of distri-
Z α αP + (1 − α)Q (3.3.2) butions
to mean the probability measure such that Z α (A) = αP(A) + (1 − α)Q(A) for any
A ∈ FΩ . For any fixed choice a, the expected utility varies linearly with α:
Proof
U (Z α , a) = U (ω, a) dZ α (ω)
Ω
=α U (ω, a) d P(ω) + (1 − α) U (ω, a) dQ(ω)
Ω Ω
= α U (P, a) + (1 − α) U (Q, a).
U ∗ [Z α ] ≤ α U ∗ (P) + (1 − α) U ∗ (Q).
Proof From the definition of the expected utility (3.3.1), for any decision a ∈ A,
U ∗ (Z α ) = sup U (Z α , a)
a∈A
= sup [α U (P, a) + (1 − α) U (Q, a)].
a∈A
As we have proven, the expected utility is linear with respect to Z α . Thus, for any
fixed action a we obtain a line as those shown in Fig. 3.3. By Theorem 3.3.2, the
Bayes-optimal utility is convex. Furthermore, the minimizing decision for any Z α is
tangent to the Bayes-optimal utility at the point (Z α , U ∗ (Z α )). If we take a decision
that is optimal with respect to some Z , but the distribution is in fact Q
= Z , then we
are not far from the optimal, if Q, Z are close and U ∗ is smooth. Consequently, we
can trivially lower bound the Bayes utility by examining any arbitrary finite set of
decisions  ⊆ A. That is,
U ∗ (P) ≥ max U (P, a)
a∈Â
holds due to convexity. The two bounds suggest an algorithm for successive approxi-
mation of the Bayes-optimal utility, by looking for the largest gap between the lower
and the upper bounds.
In this section we consider more general strategies that do not simply pick a single
fixed decision. Further, we will consider other criteria than maximizing expected
utility and minimizing risk, respectively.
Strategies Instead of choosing a specific decision, we could instead choose to ran-
domize our decision somehow. In other words, instead of our choices being specific
decisions, we can choose among distributions over decisions. For example, instead
of choosing to eat lasanga or beef, we choose to throw a coin and eat lasagna if the
coin comes heads and beef otherwise. Accordingly, in the following we will con-
sider strategies that are probability measures on A, the set of which we will denote
by ´(A).
Interestingly, for the type of problems that we have considered so far, even if we
expand our choices to the set of all possible probability measures on A, there always
is one decision (rather than a strategy) which is optimal.
Theorem 3.4.1 Consider any statistical decision problem with probability measure
P on outcomes Ω and with utility function U : Ω × A → R. Further let a ∗ ∈ A
such that U (P, a ∗ ) ≥ U (P, a) for all a ∈ A. Then for any probability measure
σ on A,
U (P, a ∗ ) ≥ U (P, σ).
Proof
U (P, σ) = U (P, a) dσ(a)
A
≤ U (P, a ∗ ) dσ(a)
A
∗
= U (P, a ) dσ(a)
A
= U (P, a ∗ )
This theorem should be not be applied naively. It only states that if we know P
then the expected utility of the best fixed/deterministic decision a ∗ ∈ A cannot be
increased by randomizing between decisions. For example, it does not make sense to
apply this theorem to cases where P itself is completely or partially unknown (e.g.,
when P is chosen by somebody else and its value remains hidden to us).
There are some situations where maximizing expected utility with respect to the
distribution on outcomes is unnatural. Two simple examples are the following.
Maximin and minimax policies If there is no information about ω available, it may
be reasonable to take a pessimistic approach and select a ∗ that maximizes the utility
maximin in the worst-case ω. The respective maximin value
can essentially be seen as how much utility we would be able to obtain, if we were
to make a decision a first, and nature were to select an adversarial decision ω later.
3.4 Statistical and Strategic Decision Making 37
where ω ∗ arg minω maxa U (ω, a) is the worst-case choice nature could make, if
we were to select our own decision a after its own choice was revealed to us.
To illustrate this, let us consider the following example.
Example 3.6 You consider attending an open air concert. The weather forecast
reports 50% probability of rain. Going to the concert (a1 ) will give you a lot of
pleasure if it doesn’t rain (ω2 ), but in case of rain (ω1 ) you actually would have
preferred to stay at home (a2 ). Since in general you prefer nice weather to rain also
in case you decided not to go you prefer ω2 to ω1 . The reward of a given outcome-
decision combination, as well as the respective utility is given in Table 3.2. We see
that a1 maximizes expected utility. However, under a worst-case assumption this is
not the case, i.e., the maximin solution is a2 .
Note that by definition
U ∗ ≥ U (ω ∗ , a ∗ ) ≥ U∗ . (3.4.1)
L(ω, a) max
U (ω, a ) − U (ω, a).
a
As an example let us revisit Example 3.6. Given the regret of each decision-outcome
pair, we can determine the decision minimizing expected regret E(L | P, a) and min-
imizing maximum regret maxω L(ω, a), analogously to expected utility and minimax
utility. Table 3.3 shows that the choice minimizing regret either in expectation or in
the minimax sense is a1 (going to the concert). Note that this is a different outcome
than before when we considered utility, which shows that the concept of regret may
result in different decisions.
38 3 Decision Problems
Table 3.2 Utility function, expected utility and maximin utility of Example 3.6
U (ω, a) a1 a2
ω1 –1 0
ω2 10 1
E(U | P, a) 4.5 –0.5
minω U (ω, a) –1 0
We now view minimax problems as two player games, where one player chooses a
and the other player chooses ω. The decision diagram for this problem is given in
Fig. 3.4, where the dashed line indicates that, from the point of view of the decision
maker, nature’s choice is unobserved before she makes her own decision. A simul-
taneous two-player game is a game where both players act without knowing each
other’s decision, see Fig. 3.5. From the point of view of the player that chooses a,
this is equivalent to assuming that ω is hidden, as shown in Fig. 3.4. There are other
variations of such games, however. For example, their moves may be revealed after
they have played. This is important in the case where the game is played repeatedly.
However, what is usually revealed is not the belief ξ, which is something assumed
to be internal to player one, but ω, the actual decision made by the first player. In
other cases, we might have that U itself is not known, and we only observe U (ω, a)
for the choices made.
Minimax utility, regret and loss In the following we again consider strategies
as defined in Definition 3.4.1. If the decision maker knows the outcome, then the
additional flexibility by randomizing over the actions does not help. As we showed
for the general case of a distribution over Ω, a simple decision is as good as any
randomized strategy:
What follows are some rather trivial remarks connecting regret with utility in
various cases.
Remark 3.4.2
L(ω, σ) = σ(a)L(w, a) ≥ 0,
a
Proof
L(ω, σ) = max
U (ω, σ ) − U (ω, σ) = max
U (ω, σ ) − σ(a) U (ω, a)
σ σ
a
= σ(a) max
U (ω, σ ) − U (ω, a) ≥ 0
σ
a
Remark 3.4.3
L(ω, σ) = max U (ω, a) − U (ω, σ)
a
Proof As (3.4.2) shows, for any fixed ω, the best decision is always deterministic,
so that
σ(a )L(ω, a ) = σ(a )[max U (ω, a) − U (ω, a )]
a∈A
a a
= max U (ω, a) − σ(a ) U (ω, a ).
a∈A
a
40 3 Decision Problems
then
max
U (ω, σ ) = max U (ω, a) = 0.
σ a
In this section we give a few more details about the connections between minimax
theory and the theory of two-player games. In particular, we extend the actions of
nature to ´(Ω), the probability distributions over Ω and as before consider strategies
in ´(A).
3.4 Statistical and Strategic Decision Making 41
where the solution exists as long as A and Ω are finite, which we will assume in the
following.
Expected regret
We can now define the expected regret for a given pair of distributions ξ, σ as
L(ξ, σ) = max
ξ(ω) U (ω, σ ) − U (ω, σ)
σ
ω
= max
U (ξ, σ ) − U (ξ, σ).
σ
Not all minimax and maximin policies result in the same value. The following
theorem gives a condition under which the game does have a value.
Theorem 3.4.2 If there exist distributions1 ξ ∗ , σ ∗ and C ∈ R such that
U (ξ ∗ , σ) ≤ C ≤ U (ξ, σ ∗ ) ∀ξ, σ
then
U ∗ = U∗ = U (ξ ∗ , σ ∗ ) = C.
Similarly,
C ≥ max U (ξ ∗ , σ) ≥ min max U (ξ, σ) = U ∗ .
σ ξ σ
1 These distributions may be singular, that is, they may be concentrated in one point. For example,
σ ∗ is singular, if σ ∗ (a) = 1 for some a and σ ∗ (a ) = 0 for all a
= a.
42 3 Decision Problems
Theorem 3.4.2 gives a sufficient condition for a game having a value. In fact, the
type of games we have been looking at so far are called bilinear games. For these, a
solution always exists and there are efficient methods for finding it.
Definition 3.4.3 A bilinear game is a tuple (U, , Σ, Ω, A) with U : × Σ → R
such that all ξ ∈ are arbitrary distributions on Ω and all σ ∈ Σ are arbitrary
distributions on A with
U (ξ, σ) E(U | ξ, σ) = U (ω, a) σ(a) ξ(ω).
ω,a
Theorem 3.4.3 For a bilinear game, U ∗ = U∗ . In addition, the following three con-
ditions are equivalent:
1. σ ∗ is maximin, ξ ∗ is minimax, and U ∗ = C.
2. U (ξ, σ ∗ ) ≥ C ≥ U (ξ ∗ , σ) for all ξ, σ.
3. U (ω, σ ∗ ) ≥ C ≥ U (ξ ∗ , a) for all ω, a.
While general games may be hard, bilinear games are easy, in the sense that minimax
solutions can be found with well-known algorithms. One such method is linear
programming. The problem
max min U (ξ, σ),
σ ξ
where ξ, σ are distributions over finite domains, can be converted to finding σ cor-
responding to the greatest lower bound vσ ∈ R on the utility. Using matrix notation,
set U to be the matrix such that U ω,a = U (ω, a), and consider the vectors σa = σ(a)
and ξ ω = ξ(ω). Then the problem can be written as:
max vσ | (Uσ) j ≥ vσ ∀ j, σi = 1, σi ≥ 0 ∀i
i
where everything has been written in matrix form. In fact, one can show that vξ = vσ ,
thus obtaining Theorem 3.4.3.
3.5 Decision Problems with Observations 43
So far we have only examined problems where the outcomes were drawn from some
fixed distribution. This distribution constituted our subjective belief about what the
unknown parameter is. Now, we examine the case where we can obtain some obser-
vations that depend on the unknown ω before we make our decision, cf. Fig. 3.6.
These observations should give us more information about ω before making a deci-
sion. Intuitively, we should be able to make decisions by simply considering the
posterior distribution.
In this setting, we once more aim to take some decision a ∈ A so as to maximize
expected utility. As before, we have a prior distribution ξ on some parameter ω ∈ Ω,
representing what we know about ω. Consequently, the expected utility of any fixed
decision a is going to be Eξ (U | a).
However, now we may obtain more information about ω before making a final
decision. In particular, each ω corresponds to a model of the world Pω , which is
a probability distribution over some observation space S, such that Pω (X ) is the
probability that the observation is in X ⊂ S. The set of parameters Ω thus defines a
family of models
P {Pω | ω ∈ Ω}.
Now, consider the case where we take an observation x from the true model Pω∗
before making a decision. We can represent the dependency of our decision on the
observation by making our decision a function of x.
π a U
44 3 Decision Problems
This is the standard Bayesian framework for decision making. It may be slightly
more intuitive in some case to use the notation ψ(x | ω), in order to emphasize that
this is a conditional distribution. However, there is no technical difference between
the two notations.
When the set of policies includes all constant policies, then there is a policy π ∗
at least as good as the best fixed decision a ∗ . This is formalized in the following
remark.
Proof The proof follows by setting 0 to be the set of constant policies. The result
follows since 0 ⊂ .
We conclude this section with a simple example about deciding whether or not to
go to a restaurant, given some expert opinions.
2 For that reason, policies are also sometimes called decision functions or decision rules in the
literature.
3 We obtain a different probability of observations under the binomial model, but the resulting
We wish to construct the Bayes decision rule, that is, the policy with maximal
ξ-expected utility. However, doing so by examining all possible policies is cumber-
some, because (usually) there are many more policies than decisions. It is however,
easy to find the Bayes decision for each possible observation. This is because it is
usually possible to rewrite the expected utility of a policy in terms of the posterior
distribution. While this is trivial to do when the outcome and observation spaces are
finite, it can be extended to the general case as shown in the following theorem.
Proof To prove this when U is non-negative, we shall use Tonelli’s theorem. First
Pω (x)
we need to construct an appropriate product measure. Let p(x | ω) ddν(x) be the
Radon-Nikodym derivative of Pω with respect to some dominating measure ν on S.
Similarly, let p(ω) dμ(x)
dξ(ω)
be the corresponding derivative for ξ. Now, the utility
can be written as
U (ξ, π) = U [ω, π(x)] p(x | ω) p(ω) dν(x) dμ(ω)
Ω S
= h(ω, x) dν(x) dμ(ω)
Ω S
Definition 3.5.2 (Prior distribution) The distribution ξ is called the prior distribu-
tion of ω.
In the simple form of the problem, we are already given a classifier P that can calculate
probabilities P(yt | xt ), and we simply must decide upon some class at ∈ Y, so as to
maximize a specific utility function. One standard utility function is the prediction
accuracy
Ut I {yt = at } .
The probability P(yt | xt ) is the posterior probability of the class given the observa-
tion xt . If we wish to maximize expected utility, we can simply choose
This defines a particular, simple policy. In fact, for two-class problems with Y =
decision {0, 1}, such a rule can be often visualized as a decision boundary in X , on whose
bound-
ary one side we decide for class 0 and on whose other side for class 1.
In the general form of the problem, we are given a training data set S =
{(x1 , y1 ), . . . , (xn , yn )}, a set of classification models {Pω | ω ∈ Ω}, and a prior distri-
bution ξ on Ω. For each model, we can easily calculate Pω (y1 , . . . , yn | x1 , . . . , xn ).
Consequently, we can calculate the posterior distribution
Pω (y1 , . . . , yn | x1 , . . . , xn ) ξ(ω)
ξ(ω | S) =
ω ∈Ω Pω (y1 , . . . , yn | x 1 , . . . , x n ) ξ(ω )
In some cases, we are restricted to functionally simple policies, which do not contain
any Bayes rules as defined above. For example, we might be limited to linear functions
of x. Let π : X → Y be such a rule and let be the set of allowed policies. Given
a family of models and a set of training data, we wish to calculate the policy that
maximizes our expected utility. For a given ω, we can indeed compute
U (ω, π) = U (y, π(x))Pω (y | x)Pω (x),
x,y
Any policy, when applied to large-scale, real world problems, has certain externali-
ties. This implies that considering only the decision maker’s utility is not sufficient.
One such issue is fairness.
This concerns desirable properties of policies applied to a population of individ-
uals. For example, college admissions should be decided on variables that inform
us about individual merit, but fairness may also require taking into account the fact
50 3 Decision Problems
that certain communities are inherently disadvantaged. At the same time, a person
should not feel that someone else in a similar situation obtained an unfair advantage.
All this must be taken into account while still caring about optimizing the decision
maker’s utility function. As another example, consider mortgage decisions: while
lenders should take into account the creditworthiness of individuals in order to make
a profit, society must ensure that they do not unduly discriminate against socially
vulnerable groups.
Recent work in fairness for statistical decision making in the classification setting
has considered two main notions of fairness. The first uses (conditional) indepen-
dence constraints between a sensitive variable (such as ethnicity) and other variables
(such as decisions made). The second type ensures that decisions are meritocratic,
so that better individuals are favoured. Here smoothness4 can be used to assure that
similar people are treated similarly, which helps to avoid cronyism. While a thor-
ough discussion of fairness is beyond the scope of this book, it is useful to note that
some of these concepts are impossible to strictly achieve simultaneously, but may be
approximately satisfied by careful design of the policy. The recent work by [2–7],
and [8] goes much more deeply on this topic.
Posterior recursion
Pω (x n ) ξ0 (ω)
ξn (ω) ξ0 (ω | x n ) =
Pξ0 (xn )
Pω (xn ) ξn−1 (ω)
= ξn−1 (ω | xn ) = ,
Pξn−1 (xn )
where ξt is the belief at time t. Here Pξn (· | ·) = Ω Pω (· | ·) dξn (ω) is the
marginal distribution with respect to the n-th posterior.
3.6 Summary
3.7 Exercises
The first part of the exercises considers problems where we are simply given some
distribution over Ω. In the second part, the distribution is a posterior distribution that
depends on observations x.
For the following exercises, we consider a set of worlds Ω and a decision set A, as
well as the following utility function U : Ω × A → R:
Exercise 3.7.1 Assume ω is drawn from ξ with ξ(ω) = 1/11 for all ω ∈ Ω. Cal-
culate and plot the expected utility U (ξ, a) = ω ξ(ω)U (ω, a) for each a. Report
maxa U (ξ, a).
(a) Calculate and plot the expected utility when π(a) = 1/11 for all a, reporting
values for all ω.
(b) Find
max min U (ξ, π).
π ξ
Hint: Use the linear programming formulation, adding a constant to the utility
matrix U so that all elements are non-negative.
Exercise 3.7.4 Consider the definition of rules that, for some > 0, select a maxi-
mizing
P {ω | U (ω, a) > sup U (ω, d ) − } .
d ∈A
Prove that this is indeed a statistical decision problem, i.e., it corresponds to maxi-
mizing the expectation of some utility function.
3.7 Exercises 53
For the following exercises we consider a set of worlds Ω and a decision set A, as
well as the following utility function U : Ω × A → R:
F f ω ω ∈ Ω,
such that f ω is the binomial probability mass function with parameters ω. Consider
the parameter set
Ω = {0, 0.1, . . . , 0.9, 1} .
Let ξ be the uniform distribution on Ω, such that ξ(ω) = 1/11 for all ω ∈ Ω. Further,
let the decision set be A = [0, 1].
Exercise 3.7.5 What is the decision a ∗ maximizing U (ξ, a) = ω ξ(ω)U (ω, a)
and what is U (ξ, a ∗ )?
Exercise 3.7.6 In the same setting, we now observe the sequence x = (x1 , x2 , x3 ) =
(1, 0, 1).
1. Plot the posterior distribution ξ(ω | x) and compare it to the posterior we would
obtain if our prior on ω was ξ = Beta(2, 2).
2. Find the decision a ∗ maximizing the a posteriori expected utility
Eξ (U | a, x) = U (ω, a)ξ(ω | x).
ω
where φ(x) = ω f ω (x)ξ(ω) is the prior marginal distribution of x and
δ ∗ : S → A is the Bayes-optimal decision rule.
Hint: You can simplify the computational
complexity somewhat, since you only
need to calculate the probability of t x t . This is not necessary to solve the
problem though.
Exercise 3.7.7 In the same setting, we consider nature to be adversarial. Once more,
we observe x = (1, 0, 1). Assume that nature can choose a prior among a set of priors
= {ξ1 , ξ2 }. Let ξ1 (ω) = 1/11 and ξ2 (ω) = ω/5.5 for each ω.
54 3 Decision Problems
min Eξ (U | a, x).
ξ∈
Hint: Apart from the adversarial prior selection, this is very similar to the previous
exercise.
We make the simplifying assumption that the utility function is the same for all
customers and has the following form:
ln x, x ≥1
V (x) =
1 − (x − 2)2 , otherwise.
Customers who are not interested the insurance product, will not buy it no matter
what the price.
There is some unknown probability distribution Pω (x) over the income level,
n that the probability of n people having incomes x = (x1 , . . . , xn ) is Pω (x ) =
n n n
such
P (x
i=1 ω i ). We have two data sources for this. The first is a model of the gen-
eral population ω1 not working in high-tec industry, and the second is a model of
employees in high-tech industry, ω2 . The models are summarized in Table 3.6 below.
Together, these two models form a family of distributions P = {Pω | ω ∈ Ω} with
Ω = {ω1 , ω2 }.
3.7 Exercises 55
The goal is to find a premium d that maximizes the expected utility of the insurance
company. We assume that the company is liquid enough that utility is linear. In
the following, we consider four different cases. For simplicity, you can let A = S
throughout this exercise.
Exercise 3.7.8 Show that the expected utility for a given ω is the expected gain
from a buying customer times the probability that an interested customer will have
an income x such that she would buy our insurance, that is,
U (ω, d) = (d − h) Pω (x) I {V (x − d) > V (x − h) + (1 − )V (x)} .
x∈S
Let h = 150 and = 10−3 . Plot the expected utility for the two possible ω for
varying d. What is the optimal price level if the incomes of all interested customers
are distributed according to ω1 ? What is the optimal price level if they are distributed
according to ω2 ?
Exercise 3.7.9 According to our intuition, customers interested in our product are
much more likely to come from the high-tech industry than from the general pop-
ulation. For that reason, we have a prior probability ξ(ω1 ) = 1/4 and ξ(ω2 ) = 3/4
over the parameters Ω of the family P. More specifically, and in keeping with our
previous assumptions, we formulate the following model:
ω∗ ∼ ξ
x T | ω ∗ = ω ∼ Pω
That is, the data is drawn from one unknown model ω ∗ ∈ Ω. This can be thought
of as an experiment where nature randomly selects ω ∗ with probability ξ and then
generates the data from the corresponding model Pω∗ . Plot the expected utility under
this prior as the premium d varies. What is the optimal expected utility and premium?
Exercise 3.7.10 Instead of fully relying on our prior, the company decides to perform
a random survey of 1000 people who are asked whether they would be interested in
the insurance product (as long as the price is low enough). If interested, they are also
asked about their income level. Assume that only 126 people were interested, with
income levels as given in Table 3.7. Each row column of the table shows the stated
income and the number of people reporting it.
56 3 Decision Problems
Let x n = {x1 , x2 , . . . , xn } be the set of data collected. Assuming that that the
responses are truthful, calculate the posterior probability ξ(ω | x n ), assuming that
the only possible models of income distribution are the two models ω1 , ω2 used in
the previous exercises. Plot the expected utility under the posterior distribution as d
varies. What is the maximum expected utility that can be obtained?
Exercise 3.7.11 Having only two possible models is somewhat limiting, especially
since neither of them might correspond to the income distribution of people interested
in our insurance product. How could this problem be rectified? Describe your idea
and implement it. When would you expect this to work better?
Many patients arriving at an emergency room suffer from chest pain. This may indi-
cate acute coronary syndrome (ACS). Patients suffering from ACS that go untreated
may die with probability5 2% in the next few days. Successful diagnosis lowers the
short-term mortality rate to 0.2%. Consequently, a prompt diagnosis is essential.
Statistics of patients Approximately 50% of patients presenting with chest pain turn
out to suffer from ACS (either acute myocardial infraction or unstable angina pectoris).
Approximately 10% suffer from lung cancer. Of ACS sufferers in general,2/3 are smok-
ers and 1/3 non-smokers. Only 1/4 of non-ACS sufferers are smokers. In addition, 90%
of lung cancer patients are smokers. Only 1/4 of non-cancer patients are smokers.
Assumption 3.7.1 A patient may suffer from none, either or both conditions.
Assumption 3.7.2 When the smoking history of the patient is known, the develop-
ment of cancer or ACS are independent.
Tests One can perform an ECG to test for ACS. An ECG test has sensitivity of 66.6%
(i.e., it correctly detects 2/3 of all patients that suffer from ACS), and a specificity of
75% (i.e., 1/4 of patients that do not have ACS, still test positive). An X-ray can
diagnose lung cancer with a sensitivity of 90% and a specificity of 90%.
Assumption 3.7.3 Repeated applications of a test produce the same result for the
same patient, i.e., that randomness is only due to patient variability.
5 The following figures are not really accurate, as they are liberally adapted from different studies.
References 57
Assumption 3.7.4 The existence of lung cancer does not affect the probability that
the ECG will be positive. Conversely, the existence of ACS does not affect the
probability that the X-ray will be positive.
Exercise 3.7.12 Now consider the case where you have the choice which tests to
perform. First, you observe S, i.e., whether or not the patient is a smoker. Then
you select a test d1 ∈ {X-ray, ECG} to make. Finally, you decide whether or not to
treat for ASC, that is, you choose d2 ∈ {heart treatment, no treatment}.
An untreated ASC patient may die with probability 2%, while a treated one with
probability 0.2%. Treating a non-ASC patient results in death with probability 0.1%.
1. Draw a decision diagram, where:
• S is an observed random variable taking values in {0, 1}.
• A is a hidden variable taking values in {0, 1}.
• C is a hidden variable taking values in {0, 1}.
• d1 is a choice variable, taking values in {X-ray, ECG}.
• r1 is a result variable, taking values in {0, 1}, corresponding to negative and
positive tests results.
• d2 is a choice variable, which depends on the test result of d1 and on S.
• r2 is a result variable, taking values in {0, 1} corresponding to the patient dying
(0), or living (1).
2. Let d1 = X-ray, and assume the patient suffers from ACS, i.e., A = 1. How is
the posterior distributed?
3. What is the optimal decision rule for this problem?
References
6. Kleinberg, J., Mullainathan, S., Raghavan, M.: Inherent trade-offs in the fair determination of
risk scores. Technical Report (2016). arXiv:1609.05807
7. Kilbertus, N., Rojas-Carulla, M., Parascandolo, G., Hardt, M., Janzing, D., Schölkopf, B.: Avoid-
ing discrimination through causal reasoning. Technical Report (2017). arXiv:1706.02744
8. Dimitrakakis, C., Liu, Y., Parkes, D., Radanovic, G.: Subjective fairness: fairness is in the eye
of the beholder. Technical Report (2017). arXiv:1706.00119
Chapter 4
Estimation
4.1 Introduction
In the previous chapter, we have seen how to make optimal decisions with respect
to a given utility function and belief. One important question is how to compute
an updated belief from observations and a prior belief. More generally, we wish to
examine how much information we can obtain about an unknown parameter from
observations, and how to bound the respective estimation error. While most of this
chapter will focus on the Bayesian framework for estimating parameters, we shall also
look at tools for making conclusions about the value of parameters without making
specific assumptions about the data distribution, i.e., without providing specific prior
information.
In the Bayesian setting, we calculate posterior distributions of parameters given
data. The basic problem can be stated as follows. Let P {Pω | ω ∈ Ω} be a fam-
ily of probability measures on (S, FS ) and ξ be our prior probability measure on
(Ω, FΩ ). Given some data x ∼ Pω∗ , with ω ∗ ∈ Ω, how can we estimate ω ∗ ? The
Bayesian approach is to estimate the posterior distribution ξ(· | x), instead of guess-
ing a single ω ∗ . In general, the posterior measure is a function ξ(· | x) : FΩ → [0, 1],
with
Pω (x) dξ(ω)
ξ(B | x) = B . (4.1.1)
Ω Pω (x) dξ(ω)
The posterior distribution allows us to quantify our uncertainty about the unknown ω ∗ .
This in turn enables us to take decisions that take uncertainty into account.
The first question we are concerned with in this chapter is how to calculate this
posterior for any value of x in practice. If x is a complex object, this may be com-
putationally difficult. In fact, the posterior distribution can also be a complicated
function. However, there exist distribution families and priors such that this calcula-
tion is very easy, in the sense that the functional form of the posterior depends upon a
small number of parameters. This happens when a summary of the data that contains
all necessary information can be calculated easily. Formally, this is captured via the
concept of a sufficient statistic.
Sometimes we want to summarize the data we have observed. This can happen when
the data is a long sequence of simple observations x n = (x1 , . . . , xn ). It may also
be useful to do so when we have a single observation x, such as a high-resolution
image. For some applications, it may be sufficient to only calculate a really simple
function of the data, such as the sample mean.
Definition 4.2.1 (Sample mean) The sample mean x̄n : Rn → R of a sequence
x n = (x1 , . . . , xn ) with xt ∈ R is defined as
1
n
x̄n xt .
n t=1
statistic The mean is an example for what is called a statistic, that is, a function of the
observations to some vector space. In the following, we are interested in statistics
that can replace all the complete original data in our calculations without losing any
information. Such statistics are called sufficient.
P = {Pω | ω ∈ Ω} .
1 There is an alternative definition, which replaces equality of posterior distributions with point-wise
equality on the family members, i.e., Pω (x) = Pω (x ) for all ω. This is a stronger definition, as it
implies the Bayesian one we use here.
2 Typically Z ⊂ Rk for finite-dimensional statistics.
4.2 Sufficient Statistics 61
Proof In the following proof we assume arbitrary Ω. The case when Ω is finite is
technically simpler and is left as an exercise. Let us first assume the existence of u, v
satisfying the equation. Then for any B ∈ FΩ we have
u(x) v[φ(x), ω] dξ(ω)
ξ(B | x) = B
u(x) v[φ(x), ω] dξ(ω)
Ω
v[φ(x), ω] dξ(ω)
= B
Ω v[φ(x), ω] dξ(ω)
For x with φ(x) = φ(x ), it follows that ξ(B | x) = ξ(B | x ), so ξ(· | x) = ξ(· | x )
and φ satisfies the definition of a sufficient statistic.
Conversely, assume that φ is a sufficient statistic. Let μ be a dominating measure
on S so that we can define the densities p(ω) dμ(ω) dξ(ω)
and
whence
p(ω | x)
Pω (x) = Pω (x) dξ(ω).
p(ω)
Ω
In the factorization of Theorem 4.2.1, u is the only factor that depends directly on x.
Interestingly, it does not appear in the posterior calculation at all. So, the posterior
only depends on x through the statistic.
n
Pω (x n ) = Pω (xt ) = ω sn (1 − ω)n−sn ,
t=1
where sn = nt=1 xt is the number of times 1 has been observed until time n. Here
the statistic φ(x n ) = sn satisfies (4.2.1) with u(x) = 1, while Pω (x n ) only depends
on the data through the statistic sn = φ(x n ).
Another example is when we have a finite set of models. Then the sufficient
statistic is always a finite-dimensional vector.
Lemma 4.2.1 Let P = {Pθ | θ ∈ Θ} be a family, where each model Pθ is a prob-
ability measure on X and Θ contains n models. If p ∈ ´n is a vector representing
our prior distribution, i.e., ξ(θ) = pθ , then the finite-dimensional vector with entries
qθ = pθ Pθ (x) is a sufficient statistic.
From the proof it is clear that also the vector with entries wθ = qθ is a sufficient
θ qθ
statistic.
Even when dealing with an infinite set of models in some cases the posterior dis-
tributions can be computed efficiently. Many well-known distributions such as the
Gaussian, Bernoulli and Dirichlet distribution are members of the exponential family
of distributions. All those distributions are factorizable in the manner shown below,
while at the same time they have fixed-dimension sufficient statistics.
4.3 Conjugate Priors 63
While conjugate families exist for statistics with unbounded dimension, here we shall
focus on finite-dimensional families. We will start with the simplest example, the
Bernoulli-Beta pair.
64 4 Estimation
The Bernoulli distribution is a discrete distribution that is ideal for modelling indepen-
dent random trials with just two different outcomes (typically ‘success’ and ‘failure’)
and a fixed probability of success.
Definition 4.3.1 (Bernoulli distribution) The Bernoulli distribution is discrete with
outcomes S = {0, 1}, a parameter ω ∈ [0, 1], and probability function
ω, if u = 1
Pω (x = u) = ω (1 − ω)
u 1−u
=
1 − ω, if u = 0.
Now we are ready to define the binomial distribution, that is a scaled product-
Bernoulli distribution for multiple independent outcomes where we want to measure
the probability of a particular number successes or failures. Thus, the Bernoulli is a
distribution on a sequence of outcomes, while
the binomial is a distribution on the
total number of successes. That is, let s = nt=1 xt be the total number of successes
observed up to time n. Then we are interested in the probability that there are exactly
k successes out of n trials.
Definition 4.3.2 (Binomial distribution) The binomial distribution with parameters
ω and n has outcomes S = {0, 1, . . . , n}. Its probability function is given by
n
Pω (s = k) = ω k (1 − ω)n−k .
k
n
n
P(x n ) = P(x n | ω)P(ω) = P(xt | ω1 )P(ω1 ) + P(xt | ω2 )P(ω2 ),
ω∈Ω t=1 t=1
which in general is different from nt=1 P(xt ). For the general case where Ω = [0, 1]
the question is whether there is a prior distribution that can succinctly describe our
uncertainty about the parameter. Indeed, there is, and it is called the Beta distribution. Beta dis-
tribution
It is defined on the interval [0, 1] and has two parameters that determine the density
of the observations. Because the Bernoulli distribution has a parameter in [0, 1], the
outcomes of the Beta can be used to specify a prior on the parameters of the Bernoulli
distribution.
Definition 4.3.3 (Beta distribution) The Beta distribution has outcomes ω ∈ Ω =
[0, 1] and parameters α0 , α1 > 0, which we will summarize in a vector α = (α1 , α0 ).
Its probability density function is given by
Γ (α0 + α1 ) α1 −1
p(ω | α) = ω (1 − ω)α0 −1 , (4.3.1)
Γ (α0 )Γ (α1 )
∞
where Γ (α) = 0 u α−1 e−u du is the Gamma function. If ω is distributed according Gamma
function
to a Beta distribution with parameters α1 , α0 , we write ω ∼ Beta(α1 , α0 ).
We note that the Gamma function is an extension of the function n!, so that for
n ∈ N it holds that Γ (n) = n!. That way, the first term in (4.3.1) corresponds to
a generalized binomial coefficient. The dependencies between the parameters are
shown in the graphical model of Fig. 4.2. A Beta distribution with parameter α has
expectation
α1
E(ω | α) =
α0 + α1
and variance α1 α0
V(ω | α) = .
(α1 + α0 ) (α1 +
2 α0 + 1)
p(ω|α)
2
0
0 0.2 0.4 0.6 0.8 1
ω
Figure 4.3 shows the density of a Beta distribution for four different parameter
vectors. When α0 = α1 = 1, the distribution is equivalent to a uniform one.
As already indicated, we can encode our uncertainty about the unknown but fixed
parameter ω ∈ [0, 1] of a Bernoulli distribution using a Beta distribution. We start
with a Beta distribution with parameter α that defines our prior ξ0 via ξ0 (B)
B p(ω | α) dω. Then the posterior probability is given by
n
Pω (xt ) p(ω | α)
p(ω | x n , α) =
nt=1 ∝ ω sn,1 +α1 −1 (1 − ω)sn,0 +α0 −1 ,
Ω t=1 Pω (x t ) p(ω | α) dω
where sn,1 = nt=1 xt and sn,0 = n − sn,1 are the total number of 1s and 0s, respec-
tively. As can be seen, this again has the form of a Beta distribution (Fig. 4.4).
Beta-Bernoulli model
Let ω be drawn from a Beta distribution with parameters α1 , α0 , and x n =
(x1 , . . . , xn ) be a sample drawn independently from a Bernoulli distribution
with parameter ω, i.e.,
Then the posterior distribution of ω given the sample the posterior distribution
is also Beta, that is,
Example 4.2 The parameter ω ∈ [0, 1] of a randomly selected coin can be modelled
as a Beta distribution peaking around 21 . Usually one assumes that coins are fair.
However, not all coins are exactly the same. Thus, it is possible that each coin
deviates slightly from fairness. We can use a Beta distribution to model how likely
(we think) different values ω of coin parameters are.
To demonstrate how belief changes, we perform the following simple experiment.
We repeatedly toss a coin and wish to form an accurate belief about how biased
the coin is, under the assumption that the outcomes are Bernoulli with a fixed
parameter ω. Our initial belief, ξ0 , is modelled as a Beta distribution on the parameter
space Ω = [0, 1], with parameters α0 = α1 = 100. This places a strong prior on the
coin being close to fair. However, we still allow for the possibility that the coin is
biased.
Figure 4.5 shows a sequence of beliefs at times 0, 10, 100, 1000 respectively, from
a coin with bias ω = 0.6. Due to the strength of our prior, after 10 observations, the
situation has not changed much and the belief ξ10 is very close to the initial one.
However, after 100 observations our belief has now shifted towards 0.6, the true bias
of the coin. After a total of 1000 observations, our belief is centered very close to 0.6,
and is now much more concentrated, reflecting the fact that we are almost certain
about the value of ω.
10
0
0 0.2 0.4 0.6 0.8 1
68 4 Estimation
The well-known normal distribution is also endowed with suitable conjugate priors.
We first give the definition of the normal distribution, then consider the cases where
we wish to estimate its mean, its variance, or both at the same time.
Definition 4.3.4 (Normal distribution) The normal distribution is a continuous dis-
tribution with outcomes in R. It has two parameters, the mean ω ∈ R and the variance
σ 2 ∈ R+ , or alternatively the precision r ∈ R+ , where σ 2 = r −1 . Its probability den-
sity function is given by
r
r
f (x | ω, r ) = exp − (x − ω)2 .
2π 2
t=1
2π 2 t=1
The dependency graph in Fig. 4.6 shows the dependencies between the parame-
ters of a normal distribution and the observations xt . In this graph, only a single
sample xt is shown, and it is implied that all xt are independent of each other
given r, ω.
Transformations of normal samples. The importance of the normal distribution
stems from the fact that many actual distributions turn out to be approximately nor-
mal. Another interesting property of the normal distribution concerns transforma-
tions of normal samples. For example,
if x n is drawn from a normal distribution with
mean ω and precision r , then nt=1 xt ∼ N (nω, nr −1 ). Finally, if the samples
xt are
standard drawn from the standard normal distribution, i.e., xt ∼ N (0, 1), then nt=1 xt2 has
normal
a χ2 -distribution with n degrees of freedom (cf. also the discussion of the Gamma
χ2 -
distribution distribution below).
The simplest normal estimation problem occurs when we only need to estimate
the mean and assume that the variance (or equivalently, the precision) is known.
For Bayesian estimation, it is convenient to assume that the mean ω is drawn from
another normal distribution with known mean. This gives a conjugate pair and thus
results in a posterior normal distribution for the mean as well (Fig. 4.7).
x n | ω, r ∼ N n (ω, r −1 ), ω | τ ∼ N (μ, τ −1 ).
Then the posterior distribution of ω given the sample is also normal, that is,
τ μ + nr x̄n
ω | x n ∼ N (μ , τ −1 ) with μ = , τ = τ + nr,
τ
n
where x̄n 1
n t=1 xt .
It can be seen that the updated estimate for the mean is shifted towards the empir-
ical mean x̄n , and the precision increases linearly with the number of samples.
To model normal distributions with known mean, but unknown precision (or equiv-
alently, unknown variance), we first have to introduce the Gamma distribution that
we will use to represent our uncertainty about the precision.
The graphical model of a Gamma distribution is shown in Fig. 4.8. The parameters
α, β determine the shape and scale of the distribution, respectively. This is illustrated
in Fig. 4.9, which also depicts some special cases that show that the Gamma distri-
bution is a generalization of a number of other standard distributions. For example,
expo- for α = 1, β > 0 one obtains an exponential distribution with parameter β. Its prob-
nential
distribu- ability density function is
tion f (x | β) = βe−βx ,
and as the Gamma distribution it has support in [0, ∞], i.e., x > 0. For n ∈ N and
α = n2 , β = 21 one obtains a χ2 -distribution with n degrees of freedom.
Now we return to our problem of estimating the precision of a normal distribution
with known mean, using the Gamma distribution to represent uncertainty about the
precision (Fig. 4.10).
Normal-Gamma model
Let r be drawn from a Gamma distribution with parameters α, β, while x n is
a sample drawn independently from a normal distribution with mean ω and
precision r , i.e.,
0.4
0.2
0
0 2 4 6 8 10 12
t
4.3 Conjugate Priors 71
Then the posterior distribution of r given the sample is also Gamma, that is,
1
n
n
r | x n ∼ Gamma(α , β ) with α = α + , β = β + (xt − ω)2 .
2 2 t=1
Finally, let us turn our attention to the general problem of estimating both the mean
and the precision of a normal distribution. We will use the same prior distributions
for the mean and precision as in the case when just one of them was unknown. It
will be assumed that the precision is independent of the mean, while the mean has a
normal distribution given the precision (Fig. 4.11).
τ
72 4 Estimation
where
τ μ + n x̄
μ = , τ = τ + n,
τ +n
1
n
n τ n(x̄ − μ)2
α = α + , β = β + (xt − x̄n )2 + .
2 2 t=1 2(τ + n)
For a prior ω|r ∼ N (μ, (τr )−1 ) and r ∼ Gamma(α, β), as before, the joint distribu-
tion for mean and precision is given by
√ τr
ξ(ω, r ) ∝ r · exp − (ω − μ)2 r α−1 e−βr ,
2
as ξ(ω, r ) = ξ(ω | r ) ξ(r ). Now we can write the marginal density of new observa-
tions as
4.3 Conjugate Priors 73
pξ (x) = f (x | ω, r ) dξ(ω, r )
∞ ∞ r τr
√
∝ r · exp − (x − ω)2 exp − (ω − μ)2 r α−1 e−βr dω dr
2 2
0 −∞
∞ ∞ r
α− 1
−βr τr
= r 2e exp − (x − ω)2 − (ω − μ)2 dω dr
2 2
0 −∞
⎛ ∞ ⎞
∞ r
1
α−
= r 2e −βr ⎝ 2
exp − (x − ω) + τ (ω − μ) 2 dω ⎠ dr
2
0 −∞
∞
1 τr 2π
= r α− 2 e−βr exp − (μ − x)2 dr.
2(τ + 1) r (1 + τ )
0
The binomial distribution as well as the normal distribution can be extended to mul-
tiple dimensions. Fortunately, multivariate extensions exist for their corresponding
conjugate priors, too.
n!
K
P(n | ω) =
K ωkn k .
k=1 nk ! k=1
The Dirichlet distribution is the multivariate extension of the Beta distribution that
will turn out to be a natural candidate for a prior on the multinomial distribution.
The parameter α determines the density of the observations, as shown in Fig. 4.13.
The Dirichlet distribution is conjugate to the multinomial distribution in the same
way that the Beta distribution is conjugate to the Bernoulli/binomial distribution.
K
ξt (ω) ∝ ωkn k +αk −1 ,
k=1
n
where n k = t=1 Ixt = k.
The last conjugate pair we shall discuss is that for multivariate normal distribu-
tions. Similarly to the extension of the Bernoulli distribution to the multinomial,
and the corresponding extension of the Beta to the Dirichlet, the normal priors can
be extended to the multivariate case. The prior of the mean becomes a multivariate
normal distribution, while that of the precision becomes a Wishart distribution.
Definition 4.3.8 (Multivariate normal distribution) The multivariate normal distri-
bution is a continuous distribution with outcome space S = R K . Its parameters are
a mean vector ω ∈ R K and precision matrix3 R ∈ R K ×K that is a positive-definite,
that is, x Rx > 0 for x = 0. The probability density function of the multivariate
normal distribution is given by
1
f (x | ω, R) = (2π)− 2 |R| · exp − (x t − ω) R(x − ω) ,
K
where |R| denotes the matrix determinant. When taking n independent samples from matrix
determi-
a fixed multivariate normal distribution the probability of observing a sequence x n nant
is given by
n
f (x n | ω, R) = f (x t | ω, R).
t=1
The graphical model of the multivariate normal distribution is given in Fig. 4.15.
For the definition of the Wishart distribution we first have to recall the definition
of a matrix trace.
Definition 4.3.9 The trace of an n × n square matrix A with entries ai j is defined
as
n
trace(A) aii .
i=1
T R
xt
μ ω
for positive-definite V ∈ R K ×K .
ξ([ωl , ωu ]) = s.
Note that ωl , ωu are not unique and any choice satisfying the condition is valid.
However, typically the interval is chosen so as to exclude the tails (extremes) of
the distribution and centered in the maximum. Figure 4.17 shows the 90% credible
interval for the Bernoulli parameter of Example 4.1 after 1000 observations, that is,
the measure of A under ξ is ξ(A) = 0.9. We see that the true parameter ω = 0.6 lies
slightly outside it.
Reliability of Credible Intervals
Let φ, ξ0 be probability measures on the parameter set Ω with ξ0 being our prior
belief and φ the actual distribution of ω ∈ Ω. Each ω defines a measure Pω on the
10
0
0.4 0.5 0.6 0.7 0.8
78 4 Estimation
The probability that the true value of ω will be within a particular credible interval
depends on how well the prior ξ0 matches the true distribution from which the
parameter ω was drawn. This is illustrated in the following experimental setup,
where we check how often a 50% credible interval fails.
0.58
UCB
LCB
0.56
0.54
0.52
0.5
w
0.48
0.46
0.44
0.42
10 20 30 40 50 60 70 80 90 100
t
Fig. 4.18 50% credible intervals for a prior Beta(10, 10), ξ0 matching the distribution of ω
0.53
0.52
Failure rate
0.51
0.5
0.49
10 20 30 40 50 60 70 80 90 100
t
Fig. 4.19 Failure rate of 50% credible intervals for a prior Beta(10, 10), ξ0 matching the distribution
of ω
On the other hand, Fig. 4.20 illustrates what happens when ξ0 = φ. In fact,
we used ωi = 0.6 for all trials i. Formally, this is a Dirac distribution concentrated
in ω = 0.6, usually notated as φ(ω) = δ(ω − 0.6). We see that the credible interval
is always centered around our initial mean guess and that it is always quite tight.
Figure 4.21 shows the average number of failures. We see that initially, due to the
fact that our prior is different from the true distribution, we make more mistakes than
in the previous case. However, eventually, our prior is swamped by the data and the
error rate converges to 50%.
80 4 Estimation
0.62
UCB
LCB
0.6
0.58
0.56
0.54
w
0.52
0.5
0.48
0.46
0.44
10 20 30 40 50 60 70 80 90 100
t
Fig. 4.20 50% credible intervals for a prior Beta(10, 10), when ξ0 does not match the distribution
of ω = 0.6
1
0.9
Average number of failures
0.8
0.7
0.6
0.5
10 20 30 40 50 60 70 80 90 100
t
Fig. 4.21 Failure rate of 50% credible interval for a prior Beta(10, 10), when ξ0 does not match
the distribution of ω = 0.6
While Bayesian ideas are useful, as they allow us to express our subjective beliefs
about a particular unknown quantity, they nevertheless are difficult to employ when
we have no good intuition about what prior to use. One could look at the Bayesian
estimation problem as a minimax game between us and nature and consider bounds
with respect to the worst possible prior distribution. However, even in that case we
must select a family of distributions and priors.
4.5 Concentration Inequalities 81
In this section we will examine guarantees we can give about any calculation we
make from observations with minimal assumptions about the distribution generating
these observations. These findings are fundamental, in the sense that they rely on a
very general phenomenon, called concentration of measure. As a consequence, they
are much stronger than results such as the central limit theorem (which we will not
cover in this textbook).
Here we shall focus on the most common application of calculating the sample
mean, as given in Definition 4.2.1. We have seen that e.g. for the Beta-Bernoulli
conjugate prior, it is a simple enough matter to compute a posterior distribution. From
that, we can obtain a credible interval on the expected value of the unknown Bernoulli
distribution. However, we would like to do the same for arbitrary distributions on
[0, 1], rather than just the Bernoulli. We shall now give an overview of a set of tools
that can be used to do this.
∞
EX = x d P(x)
0
u ∞
= x d P(x) + x d P(x)
0 u
∞
≥0+ u d P(x)
u
= u P(X ≥ u).
Consequently, if x̄n is the empirical mean after n observations, for a random variable
X with expectation EX = μ, we can use Markov’s inequality to obtain P(|x̄n − μ| ≥
) ≤ E|x̄n − μ|/. In particular, for X ∈ [0, 1], we obtain the bound
1
P |x̄n − μ| ≥ ≤ .
82 4 Estimation
Unfortunately, this bound does not improve for a larger number of observations n.
However, we can get significantly better bounds through various transformations,
using Markov’s inequality as a building block in other inequalities. The first of those
is Chebyshev’s inequality.
Theorem 4.5.2 (Chebyshev’s inequality) Let X be a random variable with expec-
tation EX = μ and variance V X = σ 2 . Then, for all k > 0,
1
P |X − μ| ≥ kσ ≤ 2 . (4.5.2)
k
Example 4.3 (Application to sample mean) It is easy to show that the sample mean
x̄n has expectation μ and variance σ 2 /n and we obtain from (4.5.2)
kσ 1
P |x̄n − μ| ≥ √ ≤ 2.
n k
√ √
Setting = kσ/ n we get k = n/σ and hence
σ2
P |x̄n − μ| ≥ ≤ 2 .
n
The inequality derived in Example 4.3 can be quite loose. In fact, one can prove
tighter bounds for the estimation of an expected value by a different application of
the Markov inequality, due to Chernoff.
4.5 Concentration Inequalities 83
Main idea
of Chernoff bounds.
Let Sn = nt=1 xi , with xt ∼ P independently, i.e., x n ∼ P n . By definition,
from Markov’s inequality we obtain for any θ > 0
n
where μ = 1
n t=1 μt .
P(x̄n − μ ≥ ) = P(Sn ≥ u)
n
≤ e−θn Eeθ(xt −μt ) . (4.5.6)
t=1
Applying Jensen’s inequality directly to the expectation does not help. However, we
can use convexity in another way: Let f (z) be the following linear upper bound on
eθz on the interval [a, b]:
b − z θa z − a θb
f (z) e + e ≥ eθz
b−a b−a
Then obviously Eeθz ≤ E f (z) for z ∈ [a, b]. We can use this to bound the term
inside the product in (4.5.6), setting z = xt − μt :
e−θμt
eθ(xt −μt ) ≤ (bt − μt )eθat + (μt − at )eθbt
bt − at
Bounding the expectation of this term by taking derivatives with respect to θ and
computing the second order Taylor expansion gives
The main problem approximate Bayesian computation (ABC) aims to solve is how
to weight the evidence we have for or against different models. The assumption is
that we have a family of models {Mω | ω ∈ }, from which we can generate data.
However, there is no easy way to calculate the probability of any model having
generated the data. On the other hand, like in the standard Bayesian setting, we can
start with a prior ξ over and given some data x ∈ W we wish to calculate the
posterior ξ(ω | x). ABC methods generally rely on what is called an approximate
statistic in order to weight the relative likelihood of models given the data.
An approximate statistic φ : S → S maps the data to some lower dimensional
space S . Then it is possible to compare different data points in terms of how similar
their statistics are. For this, one also defines some distance D : S × S → R+ .
ABC methods are useful in two specific situations. The first is when the family
of models that we consider has an intractable likelihood. This means that calculating
Mω (x) is prohibitively expensive. The second is in some applications which admit a
class of parametrized simulators which have no probabilistic description. Then one
reasonable approach is to find the best simulator in the class and apply it to the actual
problem.
The simplest algorithm in this context is ABC Rejection Sampling shown as
Algorithm 4.1. Here, we repeatedly sample a model from the prior distribution, and
then generate data x̂ from the model. If the sampled data is -close to the original data
in terms of the statistic, we accept the sample as an approximate posterior sample.
For an overview of ABC methods see [3, 4]. Early ABC methods were developed
for applications such as econometric modelling e.g. [5], where detailed simulators
86 4 Estimation
but no useful analytical probabilistic models were available. ABC methods have also
been used for inference in dynamical systems e.g. [6] and the reinforcement learning
problem [7, 8].
where ξ|x is shorthand for the distribution ξ(ω | x). An efficient method for mini-
mizing this divergence is rewriting it as
4.6 Approximate Bayesian Approaches 87
dξ|x
D Q θ ξ|x = − ln dQ θ
dQ θ
Ω
dξ|x
=− ln dQ θ + ln ξ(x),
dQ θ
Ω
where ξ|x is shorthand for the joint distribution ξ(ω, x) for a fixed value of x. As the
second term does not depend on θ, we can find the best element of the family by
computing
dξ|x
max ln dQ θ ,
θ∈Θ dQ θ
Ω
where the term we are maximizing can also be seen as a lower bound on the marginal
log-likelihood.
Expectation propagation. The other direction requires us to minimize the diver-
gence
dξ|x
D ξ|x Q θ = ln dξ|x .
dQ θ
Ω
A respective algorithm in the case of data terms that are independent given the
parameter is expectation propagation [9]. There, the approximation has a factored
form and is iteratively updated, with each term minimizing the KL-divergence while
keeping the remaining terms fixed.
When it is not necessary to have a full posterior distribution, some parameter may
be estimated point-wise. One simple such approach is maximum likelihood. In the
simplest case, we replace the posterior distribution ξ(ω | x) with a point estimate
corresponding to the parameter value that maximizes the likelihood, that is,
∗
ωML ∈ arg max Pω (x).
ω
In the latter case, even though we cannot compute the full function ξ(ω | x), we can
still maximize (perhaps locally) for ω.
88 4 Estimation
More generally, there might be some parameters φ for which we actually can
compute a posterior distribution, and some other parameters ω for which we can
not. Then we can perform Bayesian inference for the φ parameters and maximum-
likelihood for the ω parameters. This generally falls within the domain of Empirical
Bayes methods, pioneered by [10]. These replace some parameters by an empirical
estimate not necessarily corresponding to the maximum likelihood. These methods
are quite diverse [10–14] and unfortunately beyond the scope of this book.
References
1. Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat.
Assoc. 58(301), 13–30 (1963)
2. Casella, G., Fienberg, S., Olkin, I. (eds.) Monte Carlo Statistical Methods. Springer Texts in
Statistics. Springer (1999)
3. Csilléry, K., Blum, M.G.B., Gaggiotti, O.E., François, O.: Approximate Bayesian computation
(ABC) in practice. Trends Ecol. Evol. 25(7), 410–418 (2010)
4. Marin, J.-M., Pudlo, P., Robert, C.P., Ryder, R.J.: Approximate Bayesian computational meth-
ods. Stat. Comput. 22(6), 1167–1180 (2012)
5. Geweke, J.F.: Using simulation methods for Bayesian econometric models: Inference, devel-
opment, and communication. Econ. Rev. 18(1), 1–73 (1999)
6. Toni, T., Welch, D., Strelkowa, N., Ipsen, A., Stumpf, M.P.H.: Approximate Bayesian com-
putation scheme for parameter inference and model selection in dynamical systems. J. Royal
Soc. Interf. 6(31), 187–202 (2009)
7. Dimitrakakis, C., Tziortziotis, N.: ABC reinforcement learning. In: Proceedings of the 30th
International Conference on Machine Learning, ICML 2013, pp. 684–692. JMLR.org (2013)
8. Dimitrakakis, C., Tziortziotis, N.: Usable ABC reinforcement learning. In: NIPS 2014 Work-
shop: ABC in Montreal (2014)
9. Minka, T.P.: Expectation propagation for approximate Bayesian inference. In: UAI ’01: Pro-
ceedings of the 17th Conference in Uncertainty in Artificial Intelligence, pp. 362–369. Morgan
Kaufmann (2001)
10. Robbins, H.: An empirical Bayes approach to statistics. In: Neyman, J. (ed.) Proceedings of
the Third Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contri-
butions to the Theory of Statistics. University of California Press, Berkeley, CA (1955)
11. Laird, N.M., Louis, T.A.: Empirical Bayes confidence intervals based on bootstrap samples. J.
Am. Stat. Assoc. 82(399), 739–750 (1987)
12. Lwin, T., Maritz, J.S.: Empirical Bayes approach to multiparameter estimation: with special
reference to multinomial distribution. Ann. Inst. Stat. Math. 41(1), 81–99 (1989)
13. Robbins, H.: The empirical Bayes approach to statistical decision problems. Ann. Math. Stat.
35(1), 1–20 (1964)
14. Deely, J.J., Lindley, D.V.: Bayes empirical Bayes. J. Am. Stat. Assoc. 76(376), 833–841 (1981)
Chapter 5
Sequential Sampling
So far, we have mainly considered decision problems where the sample size was fixed.
However, frequently the sample size can also be part of the decision. Since normally
larger sample sizes give us more information, in this case the decision problem is only
interesting when obtaining new samples has a cost. Consider the following example.
Example 5.1 Consider that you have 100 produced items and you want to determine
whether there are fewer than 10 faulty items among them. If testing has some cost,
it pays off to think about whether it is possible to do without testing all 100 items.
Indeed, this is possible by the following simple online testing scheme: You test one
item after another until you either have discovered 10 faulty items or 91 good items.
In either case you have the correct answer at considerably lower cost than when
testing all items.
1 This is simply a sample space and associated algebra, together with a probability measure.
Thus, the sample obtained depends both on P and the sampling procedure πs . In
our setting, we don’t just want to sample sequentially, but also to take some action
after sampling is complete. For that reason, we can generalize the above definition
to sequential decision procedures.
Definition 5.1.2 (Sequential decision procedure) A sequential decision procedure
π = (πs , πd ) is a tuple composed of
1. a stopping rule πs : X ∗ → {0, 1} and
2. a decision rule πd : X ∗ → A.
The stopping rule πs specifies whether, at any given time, we should stop and make
a decision in A or take one more sample. That is, stop if
πs (x t ) = 1,
otherwise observe xt+1 . Once we have stopped (i.e. πs (x t ) = 1), we choose the
decision
πd (x t ).
the expected utility according to our posterior. Now consider the case of sampling
with costs, such that a sample of size n results in a cost of cn. For that reason we
define a new utility function U which depends on the number of observations we
have.
In the remainder of this section, we shall consider the following simple decision
problem, where we need to make a decision about the value of an unknown parameter.
As we get more data, we have a better chance of discovering the right parameter.
However, there is always a small chance of getting no information.
Example 5.2 Consider the following decision problem, where the goal is to distin-
guish between two possible hypotheses θ1 , θ2 , with corresponding decisions a1 , a2 .
We have three possible observations {1, 2, 3}, with 1, 2 being more likely under the
first and second hypothesis, respectively. However, the third observation gives us no
information about the hypothesis, as its probability is the same under θ1 and θ2 . In
this problem γ is the probability that we obtain an uninformative sample.
• Parameters: Θ = {θ1 , θ2 }.
• Decisions: A = {a1 , a2 }.
• Observation distribution f i (k) = Pθi (xt = k) for all t with
utility −c for each additional sample, there is a point of diminishing returns, after
which it will not be worthwhile to take any more samples.
We first investigate the setting where the number of observations is fixed. In
value particular, the value of the optimal procedure taking n observations is defined to be
the expected utility that maximizes the a posteriori utility given x n , i.e.,
V (n) = Pξn (x n ) max Eξ (U | x n , a),
a
xn
where Pξn = Θ Pθn dξ(θ) is the marginal distribution over n observations. For this
specific example, it is easy to calculate the value of the procedure that takes n obser-
vations, by noting the following facts.
(a) The probability of observing xt = 3 for all t = 1, . . . , n is γ n . Then we must
rely on our prior ξ to make a decision.
(b) If we observe any other sequence, we know the value of θ.
Consequently, the total value V (n) of the optimal procedure taking n observations is
Based on this, we now want to find the optimal number of samples n. Since V
is a smooth function, an approximate maximizer can be found by viewing n as a
continuous variable.2 Taking derivatives, we get
∗ c 1
n = log ,
ξb log γ log γ
c c
V (n ∗ ) = 1 + log .
log γ ξb log γ
The results of applying this procedure are illustrated in Fig. 5.1. Here we can see
that, for two different choices of priors, the optimal number of samples is different.
In both cases, there is a clear choice for how many samples to take, when we must
fix the number of samples before seeing any data.
However, we may not be constrained to fix the number of samples a priori. As
illustrated in Example 5.1, many times it is a good idea to adaptively decide when to
stop taking samples. This is illustrated by the following sequential procedure. Since
we already know that there is an optimal a priori number of steps n ∗ , we can choose
to look at all possible stopping times that are smaller or equal to n ∗ .
2 In the end, we can find the optimal maximizer by looking at the nearest two integers to the value
found.
5.1 Gains From Sequential Sampling 93
V (n)
b = −10, c = 10−2
-3
-4
ξ = 0.1
ξ = 0.5
-5
100 101 102
n
Since the probability of xt = 3 is always the same for both θ1 and θ2 , we have
∗
Eξ (n) = E(n | θ = θ1 ) = E(n | θ = θ2 ) < n .
using the formula for the geometric series. Consequently, the value of this proce-
dure is
V̄ (n ∗ ) = Eξ (U | n = n ∗ ) Pξ (n = n ∗ ) + Eξ (U | n < n ∗ ) Pξ (n < n ∗ )
∗
= ξbγ n − c Eξ (n),
V
-2.5
-3
fixed
-3.5 bounded
unbounded
-4
0.5 0.6 0.7 0.8 0.9 1
γ
As you can see, there is a non-zero probability that n = n ∗ , at which time we will
have not resolved the true value of θ. In that case, we are still not better off than at
the very beginning of the procedure, when we had no observations. If our utility is
linear with the number of steps, it thus makes sense that we should still continue.
For that reason, we should consider unbounded procedures.
unbounded
proce- The unbounded procedure for our example is simply this to use the stopping rule
dures πs (x t ) = 1 iff xt = 3. Since we only obtain information whenever xt = 3, and that
information is enough to fully decide θ, once we observe xt = 3, we can make a
decision that has value 0, as we can guess correctly. So, the value of the unbounded
sequential procedure is just V ∗ = −c Eξ (n), where
∞
∞
1
Eξ (n) = t Pξ (n = t) = tγ t−1 (1 − γ) = ,
t=1 t=1
1−γ
We now turn our attention to the general case. While it is easy to define the optimal
stopping rule and decision in this simple example, how can we actually do the same
5.2 Optimal Sequential Sampling Procedures 95
thing for arbitrary problems? The following section characterizes optimal sequential
sampling procedures and gives an algorithm for constructing them.
Once more, consider a distribution family P = {Pθ | θ ∈ Θ} and a prior ξ over
B (Θ). For a decision set A, a utility function U : Θ × A → R, and sampling
costs c, the utility of a sequential decision procedure π is the local utility at the
end of the procedure, minus the sampling cost. In expectation, this can be written as
Here the cost is inside the expectation, since the number of samples we take is
random. Summing over all the possible stopping times n, and taking Bn ⊂ X ∗ as the
set of observations for which we stop, we have
∞
∞
U (ξ, π) = Eξ [U (θ, π(x )) | x ] d Pξ (x ) −
n n n
Pξ (Bn )nc
n=1 Bn n=1
∞ ∞
= U [θ, π(x )] dξ(θ | x ) d Pξ (x ) −
n n n
Pξ (Bn )nc,
n=1 Bn Θ n=1
Since we need not take another sample, the respective value (maximal expected
utility) of that stage is
where we introduce the notation V n to denote the expected utility, given that we are
stopping after at most n steps.
96 5 Sequential Sampling
ξ(· | x1 = 0)
c
ξ
c
ξ(· | x1 = 1)
Fig. 5.3 An example of a sequential decision problem with two stages. The initial belief is ξ and
there are two possible subsequent beliefs, depending on whether we observe xt = 0 or xt = 1. At
each stage we pay c
x1 =0
The above is simply the maximum of the value of stopping immediately (V 0 ), and
the value of continuing for at most one more step (V 1 ). This procedure can be applied
recursively for multi-stage problems, as explained below.
c
x n +1 = 1
ξn
x n +1 = 0
c
ξ n0 +1
depending on what the value of the next observation xn is. This is illustrated in
Fig. 5.4, by extension from the previous two-stage example. The immediate value
V 0 (ξt ) = sup u(θ, a) dξt (θ)
a∈A Θ
is the expected value of the next step, ignoring the cost. Finally, the optimal value
at the n-th step is just the maximum of the value of stopping immediately and the
next-step value, that is,
The main idea expressed in the previous section is to start from the last stage of
our decision problem, where the utility is known, and then move backwards. At
each stage, we know the probability of reaching different points in the next stage,
as well as their values. Consequently, we can compute the value of any point in the
current stage as well. This idea is formalised below, via the algorithm of backwards
induction.
for every belief ξn in the set of beliefs that arise from the prior ξ0 , with j = T − n.
The proof of this theorem follows by induction. However, we shall prove a more
general version in Chap. 6. Equation 5.6 essentially gives a recursive calculation of the
value of the T -bounded optimal procedure. To evaluate it, we first need to calculate
all possible beliefs ξ1 , . . . , ξT . For each belief ξT , we calculate V 0 (ξT ). We then
move backwards, and calculate V 0 (ξT −1 ) and V 1 (ξT −1 ). Proceeding backwards, for
n = T − 1, T − 2, . . . , 1, we calculate V T +1 (ξn ) for all beliefs ξn with j = T − n.
The value of the procedure also determines the optimal sampling strategy, as shown
by the following theorem.
Theorem 5.2.2 The optimal T -bounded procedure stops at time t if the value of
stopping at t is better than that of continuing, i.e. if
V 0 (ξt ) ≥ V T −t (ξt ).
This procedure chooses a maximizing Eξt U (θ, a), otherwise takes one more sample.
Finally, longer procedures (i.e. procedures that allow for stopping later) are always
better than shorter ones, as shown by the following theorem.
Theorem 5.2.3 For any probability measure ξ on Θ,
That is, the procedure that stops after at most n steps is never better than the procedure
that stops after at most n + 1 time steps. To obtain an intuition of why this is the
case, consider the example of Sect. 5.1.1. In that example, if we have a sequence
of 3s, then we obtain no information. Consequently, when we compare the value of
a plan taking at most n samples with that of a plan taking at most n + 1 samples, we
see that the latter plan is better for the event where we obtain n 3s, but has the same
value for all other events.
Given the monotonicity of the value of bounded procedures (5.7), one may well
ask what is the value of unbounded procedures, i.e. procedures that may never stop
sampling. The value of an unbounded sampling and decision procedure π under
prior ξ is
where Pξπ (x n ) is the probability that we observe samples x n and stop under the
marginal distribution defined by ξ and π, while n is the random number of samples
taken by π. As before, this is random because the observations x are random; π itself
can be deterministic.
Definition 5.2.2 (Regular procedure) Given a decision procedure π, let B>k (π) ⊂
X ∗ be the set of sequences such that π takes more than k samples. Then π is regular
if U (ξ, π) ≥ V 0 (ξ) and if, for all n ∈ N, and for all x n ∈ B>n (π)
i.e., the expected utility given for any sample that starts with x n where we don’t stop,
is greater than that of stopping at n.
In other words, if π specifies that at least one observation should be taken, then the
value of π is greater than the value of choosing a decision without any observa-
tion. Furthermore, whenever π specifies that another observation should be taken,
the expected value of continuing must be larger than the value of stopping. If the
procedure is not regular, then there may be stages where the procedure specifies that
sampling should be continued, though the value may not increase by doing so.
Theorem 5.2.4 If π is not regular, then there exists a regular π such that U (ξ, π ) ≥
U (ξ, π).
Proof First, consider the case that π is not regular because U (ξ, π) ≤ V 0 (ξ). Then
π can be the regular procedure which chooses a ∈ A without any observations.
Now consider the case that U (ξ, π) > V 0 (ξ) and that π specifies at least one
sample should be taken. Let π be the procedure which stops as soon as the observed
x n does not satisfy (5.8).
If π stops, then both sides of (5.8) are equal, as the value of stopping immediately
is at least as high as that of continuing. Consequently, π stops no later than π for
any x n . Finally, let
Bk (π) = x ∈ X ∗ | n = k
be the set of observations such that exactly k samples are taken by rule π. Given that
π does stop, we have
∞
U (ξ, π ) = {V 0 [ξ(· | x k ) − ck]} d Pξ (x k )
k=1 Bk (π )
∞
≥ U [ξ(· | x k , π)] d Pξ (x k )
k=1 Bk (π )
∞
π π
= Eξ {U | Bk (π )}Pξ (Bk (π )) = E ξ U = U (ξ, π),
k=1
where the inequality follows from the fact that π is not regular.
100 5 Sequential Sampling
The worst-case immediate value, i.e. the minimum, is attained when both terms are
equal. Consequently, setting λ1 ξ = λ2 (1 − ξ) gives ξ = λ2 /(λ1 + λ2 ). Intuitively,
V∗
V 1∗ (ξ)
λ1 λ2
V 0∗ ( ξ) λ 1 +λ 2
ξ
this is the worst-case belief, as the uncertainty it induces leaves us unable to choose
between either hypothesis. Replacing in (5.9) gives a lower bound for the value for
any belief:
λ1 λ2
V 0 (ξ) ≥ .
λ1 + λ2
Let Π denote the set of procedures π which take at least one observation and
define
V (ξ) = sup U (ξ, π).
π∈Π
Ξ0 ξ | V 0 (ξ) ≥ V (ξ)
ξ Pθ1 (x t )
ξt = .
ξ Pθ1 (x t ) + (1 − ξ)Pθ2 (x t )
ξ(1 − ξT ) Pθ (x t ) ξ(1 − ξ L )
< 2 t < .
(1 − ξ)ξT Pθ1 (x ) (1 − ξ)ξ L
An important tool in the analysis of SPRT as well as other procedures that stop at
random times is the following theorem by Wald.
Proof In the following proof, we omit explicit references to the implied sequential
procedure.
5.3 Martingales 103
n ∞
k
E zi = z i dG k (z k )
i=1 k=1 Bk i=1
k
∞
= z i dG k (z k )
k=1 i=1 Bk
∞
∞
= z i dG k (z k )
i=1 k=i Bk
∞
= z i dG i (z i )
i=1 B≥i
∞
= E(z i ) P(n ≥ i) = m E n
i=1
Pθ2 (xi )
We now consider an application of this theorem to the SPRT. Let z i = log Pθ1 (xi )
.
Consider the equivalent formulation of the SPRT which uses
n
a< zi < b
i=1
as the test. Using Wald’s theorem and the previous properties and assuming c ≈ 0,
we obtain the following approximately optimal values for a, b:
I1 λ2 (1 − ξ) 1 I 2 λ1 ξ
a ≈ log c − log , b ≈ log − log ,
ξ c 1−ξ
5.3 Martingales
exists and
E(yn+1 | x ) = yn
n
holds with probability 1. If {yn } is a martingale with respect to itself, i.e. yi (x) = x,
then we call it simply a martingale.
It is also useful to consider the following generalizations of martingale sequences.
This allows us to bound the probability that the difference sequence deviates from
zero. Since there are only few problems where the default random variables are
difference sequences, use of this theorem is most common by defining a new random
variable sequence that is a difference sequence.
A more general type of sequence of random variables than martingales are Markov
processes. Informally speaking, a Markov process is a sequence of variables {xn }
such that the next value xt+1 only depends on the current value xt .
Definition 5.4.1 (Markov Process) Let (S, B (S)) be a measurable space. If {xn } is
a sequence of random variables xn : S → X such that
5.5 Exercises
Exercise 5.1 Consider a stationary Markov process with state space S and whose
transition kernel is a matrix τ . At time t, we are at state xt = s and we can either,
1: Terminate and receive reward b(s), or 2: Pay c(s) and continue to a random state
xt+1 from the distribution τ (z | z).
Assuming b, c > 0 and τ are known, design a backwards induction algorithm
that optimizes the utility function
T −1
U (x1 , . . . , x T ) = b(x T ) − c(xt ).
t=1
Finally, show that the expected utility of the optimal policy starting from any state
must be bounded.
106 5 Sequential Sampling
Exercise 5.2 Consider the problem of classification with features x ∈ X and labels
y ∈ Y, where each label costs c > 0. Assume a Bayesian model with some parameter
space Θ on which we have a prior distribution ξ0 . Let ξt be the posterior distribution
after t examples (x1 , y1 ), . . . , (xt , yt ).
Let our expected utility be the expected accuracy (i.e., the marginal probability
of correctly guessing the right label over all possible models) of the Bayes-optimal
classifier π : X → Y, minus the cost paid, i.e.,
Et (U ) max Pθ (π(x) | x) d Pθ (x) dξt (θ) − ct.
π Θ X
Show that the Bayes-optimal classification accuracy (ignoring the label cost)
after t observations can be rewritten as
max Pθ (y | x) dξt (θ | x) d Pt (x),
Θ X y∈Y
where Pt and Et denote marginal distributions under the belief ξt . Write the expres-
sion for the expected gain in accuracy when obtaining one more sample and label.
Implement the above for a model family of your choice. Two simple options are
the following. The first is a finite model family composed of two different classifiers
Pθ (y | x). The second is the family of discrete classifier models with a Dirichlet
product prior, i.e. where X = {1, . . . , n}, and each different x ∈ X corresponds to a
different multinomial distribution over Y. In both cases, you can assume a common
(and known) data distribution P(x), in which case ξt (θ | x) = ξt (θ).
0.7
expected
actual
0.6 predicted
0.5
0.4
0.3
0.2
10 0 10 1 10 2 10 3
Fig. 5.6 Illustrative results for an implementation of Exercise 5.2 on a discrete classifier model
5.5 Exercises 107
Figure 5.6 shows the performance for a family of discrete classifier models
with |X | = 4. It shows the expected classifier performance (based on the poste-
rior marginal), the actual performance on a small test set, as well as the cumulative
predicted performance gain. As you can see, even though the expected performance
gain is zero in some cases, cumulatively it reaches the actual performance of the
classifier. You should be able to produce a similar figure for your own setup.
Chapter 6
Experiment Design and Markov Decision
Processes
6.1 Introduction
This chapter introduces the very general formalism of Markov decision processes
(MDPs) that allows representation of various sequential decision making problems.
Thus a Markov decision process can be used to model stochastic path problems,
stopping problems as well as problems in reinforcement learning, experiment design,
and control.
We begin by taking a look at the problem of experimental design. A typical experi-
mental
question is to how to best allocate treatments with unknown efficacy to patients in design
an adaptive manner, so that the best treatment is found, or so as to maximize the
number of patients that are treated successfully. The problem, originally considered
by Chernoff [1, 2], informally can be stated as follows.
We have a number of treatments of unknown efficacy, i.e., some of them work
better than the others. We observe patients one at a time. When a new patient arrives,
we must choose which treatment to administer. Afterwards, we observe whether the
patient improves or not. Given that the treatment effects are initially unknown, how
can we maximize the number of cured patients? Alternatively, how can we discover
the best treatment? The two different problems are formalized below.
placebo. We are given a hypothesis set Ω, with each ω ∈ Ω corresponding to different clinical
trial
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 109
C. Dimitrakakis and R. Ortner, Decision Making Under Uncertainty
and Reinforcement Learning, Intelligent Systems Reference Library 223,
https://doi.org/10.1007/978-3-031-07614-5_6
110 6 Experiment Design and Markov Decision Processes
models for the effect of the treatment and the placebo. Since we don’t know what
is the right model, we place a prior ξ0 on Ω. We can perform T experiments, after
which we must decide whether or not the treatment is significantly better than the
placebo. To model this, we define a decision set D = {d0 , d1 } and a utility function
U : D × Ω → R which models the effect of each decision d given different versions
of reality ω. One hypothesis ω ∈ Ω is true. To identify the correct hypothesis, we
can choose from a set of K possible experiments to be performed over T trials. At
the t-th trial, we choose experiment at ∈ {1, . . . , K } and observe outcome xt ∈ X ,
with xt ∼ Pω drawn from the true hypothesis. Our posterior is
ξt (ω) ξ0 (ω | a1 , . . . , at , x1 , . . . , xt ).
T
Our utility can again be expressed as a sum over individual rewards, U = t=1 r t .
The simplest bandit problem is the stochastic n-armed bandit. We are faced with K
different one-armed bandit machines, such as those found in casinos. At time t, we
have to choose one action (i.e., a machine, in the bandit context usually termed arm)
at ∈ A = {1, . . . , K }. Each time t we play a machine, we receive a reward rt with
fixed expected value ωi = E(rt | at = i). Unfortunately, we do not know the ωi , and
consequently the best arm is also unknown. The question is how to choose arms in
order to maximize the total expected reward.
Definition 6.2.1 (The stochastic n-armed bandit problem) Given a set of arms A =
{1, . . . , K } each giving reward according to a fixed reward distribution with unknown
mean, select a sequence of actions at ∈ A, so as to maximize expected utility, where
the utility is
T −1
U= γ t rt ,
t=0
discount T ∈ (0, ∞] is the horizon, and γ ∈ (0, 1] a discount factor. The reward rt is stochastic
factor
and only depends on the action chosen at step t with expectation E(rt | at = i) = ωi .
policy For selecting actions, we want to specify some policy or decision rule. Such a rule
shall only depend on the sequence of previously taken actions and observed rewards.
6.2 Bandit Problems 111
The following figure summarizes the bandit problem formulated in the Bayesian
setting.
T −1
π π
Eξ U = Eξ γ t rt .
t=0
112 6 Experiment Design and Markov Decision Processes
There are two main difficulties with this approach. The first one is that the choice of
the family and the prior distribution is effectively part of the problem formulation and
can severely influence the solution. This issue can be resolved by either specifying a
subjective prior distribution, or by selecting a prior distribution that has good worst-
case guarantees. The second difficulty concerns the computation of the policy that
maximizes expected utility given a prior and a family. This is a hard problem, because
in general such policies are history dependent and the set of all possible histories is
exponential in the horizon T .
As a simple illustration, consider the case when the reward for choosing one of the
n actions is either 0 or 1 with some fixed yet unknown probability depending on the
chosen action. This can be modelled in the standard Bayesian framework using the
Beta-Bernoulli conjugate prior. More specifically, we can formalize the problem as
follows.
Consider n Bernoulli distributions with unknown parameters ωi (i = 1, . . . , n)
such that
n
ξ(ω1 , . . . , ωn ) = f (ωi | αi , βi ).
i=1
t
Nt,i I{ak = i}
k=1
1
t
r̂t,i rt · I{ak = i}
Nt,i k=1
6.2 Bandit Problems 113
be the empirical reward of arm i at time t. We can set r̂t,i = 0 when Nt,i = 0. Then,
the posterior distribution for the parameter of arm i is
Nt = (Nt,1 , . . . , Nt,n )
r̂t = (r̂t,1 , . . . , r̂t,n ).
Accordingly, as rt ∈ {0, 1}, the possible states of our belief given some prior are in
N2n . At any time t, we can calculate the probability of observing rt = 1 if we pull
arm i as
αi + Nt,i r̂t,i
ξt (rt = 1 | at = i) = .
αi + βi + Nt,i
So, not only we can predict the immediate reward based on our current belief, but we
can also predict all possible next beliefs: the next state is well-defined and depends
only on the current state and observation. As we shall see later, this type of decision
problem can be modelled as a Markov decision process (Definition 6.3.1). For now,
we shall take a closer look at the general bandit process itself.
The basic view of the bandit process is to consider only the decision maker’s
actions at , obtained rewards rt and the latent parameter ω, as shown in Fig. 6.2a.
With this basic framework, we can now define the general decision-theoretic bandit
process, which also includes the states of belief ξt of the decision maker.
Definition 6.2.2 Let A be a set of arms (now not necessarily finite). Let Ω be a
values, indexing a family of probability measures P =
set of possible parameter
Pω,a ω ∈ Ω, a ∈ A . There is some ω ∈ Ω such that, whenever we take action
at = i, we observe reward rt ∈ R ⊂ R with probability measure
The next belief ξt+1 is random, since it depends on the random quantity rt . In fact, the
probability of the next reward lying in some set R for at = i is given by the marginal
distribution
Pξt ,i (R) Pω,i (R) dξt (ω).
Ω
In practice, although multiple reward sequences may lead to the same beliefs, we
frequently ignore that possibility for simplicity. Then the process obtains a tree-like
structure. A solution to the problem of which action to select is given by the follow-
back- ing backwards induction algorithm for bandits, similar to the backwards induction
wards
induc- algorithm given in Sect. 5.2.2, i.e.,
tion
U ∗ (ξt ) = max E(rt | ξt , at ) + P(ξt+1 | ξt , at ) U ∗ (ξt+1 ).
at
ξt+1
If you look at this structure, you can see that the next belief only depends on the
current belief, action, and reward, i.e., it satisfies the Markov property, as seen in
Fig. 6.1.
In reality, the reward depends only on the action and the unknown ω, as can be
seen in Fig. 6.2a. This is the point of view of an external observer. If we want to
6.3 Markov Decision Processes and Reinforcement Learning 115
at a t +1 at
at a t +1
ξt
rt r t +1 rt ξt ξ t +1
rt r t +1
ω ω
(a) The basic process (b) The Bayesian (c) The lifted process
model
Fig. 6.2 Three views of the bandit process. The figure (a) shows the basic bandit process from the
view of an external observer. The decision maker selects at and then obtains reward rt , while the
parameter ω is hidden. The process is repeated for t = 1, . . . , T . The Bayesian model is shown in
(b) and the resulting process in (c). While ω is not known, at each time step t we maintain a belief
ξt on Ω. The reward distribution is then defined through our belief. In (b), we can see the complete
process, where the dependency on ω is clear. In (c), we marginalize out ω and obtain a model where
the transitions only depend on the current belief and action
add the decision maker’s internal belief to the graph, we obtain Fig. 6.2b. From the
point of view of the decision maker, the distribution of ω only depends on his current
belief. Consequently, the distribution of rewards also only depends on the current
belief, as we can marginalize over ω. This gives rise to the decision-theoretic bandit
process shown in Fig. 6.2c.
A decision-theoretic bandit process can be modelled more generally as a Markov
decision process, a setting we shall consider more generally in the following section.
It turns out that backwards induction, as well as other efficient algorithms, can provide
optimal solutions for Markov decision processes, too.
The bandit setting is one of the simplest instances of so-called reinforcement learning
problems. Informally, speaking, these are problems of learning how to act in an
unknown environment, only through interaction with the environment and limited
reinforcement signals. The learning agent interacts with the environment by choosing
actions that give certain observations and rewards. The goal is usually to maximize
some measure of the accumulated reward. For example, we can consider a mouse
running through a maze, where the reward is finding one or all pieces of cheese
hidden in the maze.
116 6 Experiment Design and Markov Decision Processes
Generally, we assume that the environment μ that we are acting in has an under-
lying state st ∈ S, which changes in discrete time steps t. At each step, the agent
obtains an observation xt ∈ X and chooses an action at ∈ A. We usually assume
that the environment is such that its next state st+1 only depends on its current
state st and the last action at taken by the agent. In addition, the agent observes a
reward signal rt , and its goal is to maximize the total reward during its lifetime.
When the environment μ is unknown, this problem is hard even in seemingly
simple settings like the multi-armed bandit, where the underlying state never changes.
In many real-world applications, the problem is even harder, as the state often is not
directly observed. Instead, we may have to rely on the observables xt , which give
only partial information about the true underlying state st .
Reinforcement learning problems typically fall into one of the following three
categories: (1) Markov decision processes (MDPs), where the state st is observed
directly, i.e., xt = st ; (2) partially observable MDPs (POMDPs), where the state is
hidden, i.e., xt is only probabilistically dependent on the state st ; and (3) stochastic
Markov games, where the next state also depends on the move of other agents. While
all of these problem descriptions are different, in the Bayesian setting they all can be
reformulated as MDPs by constructing an appropriate belief state, similarly to the
decision theoretic formulation of the bandit problem.
In this chapter, we shall confine our attention to Markov decision processes.
Hence, we shall not discuss the case where we cannot observe the state directly, or
consider the existence of other agents.
Definition 6.3.1 (Markov decision process) A Markov decision process (MDP)
μ is a tuple μ =
S, A, P, R, where S is the state space and A is the action
transi- space. The transition distributionP = { p(· | s, a) | s ∈ S, a ∈ A} is a collection
tion
distribu- of probability measures on S, indexed in S × A, and the reward distribution
tion R = {ρ(· | s, a) | s ∈ S, a ∈ A} is a collection of probability measures on R, such
reward
distribu- that
tion
Usually, also an initial state s0 (or more generally, an initial distribution from which
s0 is sampled) is specified.
In the following, we usually assume only MDPs with finite state and action spaces
S, A. By definition, rewards and transition probabilities of MDP are time-invariant,
although one could assume more general models. In any case however, rewards and
transitions in an MDP μ shall satisfy the following Markov property (Fig. 6.3).
6.3 Markov Decision Processes and Reinforcement Learning 117
rt
st s t +1
at
rμ (s, a) = Eμ (rt+1 | st = s, at = a)
though this is complicates the notation considerably since now the reward is obtained
on the next time step. However, we can always replace this with the expected reward
for a given state-action pair, i.e.,
rμ (s, a) = Eμ (rt+1 | st = s, at = s) = pμ (s | s, a) rμ (s, a, s ).
s ∈S
In a simpler setting, rewards only depend on the current state, so that we can write
policy Policies. A policy π (sometimes also called decision function) specifies which action
to take. One can think of a policy as implemented through an algorithm or an embod-
ied agent.
Policies
A policy π defines a conditional distribution on actions given the history:
In general, policies map histories to actions. In certain cases, however, there are
optimal policies that are Markov. This is for example the case with additive utility
functions U : R∗ → R, which map the sequence of all possible rewards to a real
number as specified in the following definition.
Definition 6.3.2 (Additive utility) The utility function U : R∗ → R is defined as
T
U (r0 , r1 , . . . , r T ) = γ k rk ,
k=0
horizon where T is the horizon, after which the agent is no longer interested in rewards, and
discount γ ∈ (0, 1] is the discount factor, which discounts future rewards. It is convenient
factor
to introduce a special notation for the utility starting from time t, i.e., the sum of
rewards from that time on:
T −t
Ut γ k rt+k .
k=0
At any time t, the agent wants to to find a policy π maximizing the expected total
future reward (expected utility)
T −t
π π
Eμ U t = Eμ γ k rt+k .
k=0
This is so far identical to the expected utility framework we have seen so far, with
the only difference that now the reward space is a sequence of numerical rewards
and that we are acting within a dynamical system with state space S. In fact, it is
a good idea to think about the value of different states of the system under certain
policies in the same way that one thinks about how good different positions are in a
board game like chess.
6.3 Markov Decision Processes and Reinforcement Learning 119
Value functions represent the expected utility of a given state or state-action pair for
a specific policy. They are useful as shorthand notation and as the basis for algorithm
development. The most basic one is the state value function.
The state value function for a particular policy π in an MDP μ can be interpreted
as how much utility you should expect if you follow the policy starting from state s
at time t.
The optimal value function V ∗ is the value function of an optimal policy π ∗ , i.e.,
∗ ∗
∗ π
Vμ,t (s) Vμ,t (s), Q ∗μ,t (s) Q πμ,t (s, a).
The conceptually simplest type of problems are finite horizon problems where T <
∞ and γ = 1. The first thing we shall try is to evaluate a given policy π for a given
π
MDP, that is, compute Vμ,t (s) for all states s and t = 0, 1, . . . T . There are a number
of algorithms that can be used for that purpose.
By definition,
T −t
π
Vμ,t (s) Eπμ (Ut | st = s) = π
Eμ (rt+k | st = s)
k=0
T
π μ
= Eμ (rk | sk = s ) Pπ (sk = s | st = s), (6.4.1)
k=t s ∈S
and one can try to compute the value function by (6.4.1), using that
Pμπ (sk = s | st = s) = P(sk = s | sk−1 = s , st = s)P(sk−1 = s | st = s).
s ∈S
However, the computational cost of this direct policy evaluation is quite high, as it
results in a total of |S|3 T operations if the value function is to be computed for all
time steps up to T .
Noting that
π
Vμ,t (s) Eπμ (Ut | st = s)
= Eπμ (rt | st = s) + Eπμ (Ut+1 | st = s)
= Eπμ (rt | st = s) + pμ (s | s, π(s)) Vμ,t+1
π
(s )
s ∈S
provides a recursion that can be used for the backwards induction algorithm shown as
Algorithm 6.1, that is similar to the backwards induction algorithm we have already
seen for sequential sampling and bandit problems. However, here in a first step we
are only evaluating a given policy π rather than finding the optimal one.
6.4 Finite Horizon, Undiscounted Problems 121
Remark 6.4.1 The backwards induction algorithm gives estimates V̂t (s) satisfying
π
V̂t (s) = Vμ,t (s).
∗
Proof First we show that V̂t (s) ≥ Vμ,t (s). For t = T we evidently have V̂T (s) =
∗ ∗
maxa r (s, a) = Vμ,T (s). We proceed by induction assuming that V̂t+1 (s) ≥ Vμ,t+1 (s)
holds. Then for any policy π
122 6 Experiment Design and Markov Decision Processes
V̂t (s) = max r (s, a) + pμ (s | s, a) V̂t+1 (s )
a
s ∈S
≥ max r (s, a) + pμ (s | s, a) Vμ,t+1
∗
(s ) (by induction assumption)
a
s ∈S
π
≥ max r (s, a) + pμ (s | s, a) Vμ,t+1 (s ) (by optimality)
a
s ∈S
π
≥ Vμ,t (s).
Choosing π = π ∗ to be the optimal policy this completes the induction proof. Finally,
we note that for the policy π returned by backwards induction we have
∗ π ∗
Vμ,t (s) ≥ Vμ,t (s) = V̂t (s) ≥ Vμ,t (s).
6.5 Infinite-Horizon
When problems have no fixed horizon, they usually can be modelled as infinite
horizon problems, sometimes with help of a terminal state, whose visit terminates
the problem, or discounted rewards, which indicate that we care less about rewards
further in the future. When reward discounting is exponential, these problems can be
seen as undiscounted problems with random and geometrically distributed horizon.
For problems with no discounting and no terminal states there are some complications
in the definition of optimal policy. However, we defer discussion of such problems
to Chap. 10.
6.5.1 Examples
We begin with some examples, which will help elucidate the concept of terminal
states and infinite horizon.
First we consider shortest path problems, where the aim is to find the shortest path to a
particular goal. Although the process terminates when the goal is reached, not all poli-
cies may be able to reach the goal, and so the process may never terminate. We shall
consider two types of shortest path problems, deterministic and stochastic. Although
conceptually different, both problems have essentially the same complexity.
6.5 Infinite-Horizon 123
Properties
14 13 12 11 10 9 8 7
15 13 6 γ = 1, T → ∞.
16 15 14 4 3 4 5 rt = −1 unless st = X, in which
17 2 case rt = 0.
18 19 20 2 1 2 Pµ (st+1 = X|st = X) = 1.
19 21 1 X 1 A = {North, South, East, West}
20 22
Transitions are deterministic and
21 23 24 25 26 27 28 walls block.
Properties
γ = 1, T → ∞.
Pµ (st+1 = X|st = X) = 1.
O X Pµ (st+1 = O|st = O) = 1.
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
(a) ω = 0.1 (b) ω = 0.5
0.5
1
1.5
2
2.5
-120 -100 -80 -60 -40 -20 0
(c) value
Many problems have no natural terminal state, but continue ad infinitum. A popular
model to guarantee that the utility is still bounded is to exponentially discount future
rewards. This also has some economical interpretation. On the one hand discounting
takes into account the effects of inflation, on the other hand money available now
may be more useful than money one obtains in the future. Both these effects diminish
the value of money over time. In this section we consider some basics of MDPs with
infinite horizon and discounted rewards, when the utility is given by
6.5 Infinite-Horizon 125
T
Ut = lim γ k rk , γ ∈ (0, 1).
T →∞
k=t
For simplicity, in the following we assume that rewards only depend on the current
state instead of both state and action. It can easily be verified that the results presented
below can be adapted to the latter case. Henceforth, we will also often drop the
dependence on μ in the notation, if the considered MDP μ is clear from the context.
As we assume finite state and action spaces S, A as well as a time-invariant transition
kernel we may use the following simplified vector notation:
• r = (r (s))s∈S is the reward vector in R|S| .
• We will use P π to denote the transition matrix in R|S|×|S| for policy π, i.e.,
P π (s, s ) = p(s | s, a)Pπ (a | s).
a
Proof Let rt be the random reward at step t when starting in state s and following
policy π. Then
126 6 Experiment Design and Markov Decision Processes
∞
π π
v (s) = E γ r t s0 = s
t
t=0
∞
= γ t Eπ (rt | s0 = s)
t=0
∞
= γt Pπ (st = s | s0 = s) E(rt | st = s )
t=0 s ∈S
∞
= γ t P tπ r,
t=0
as the entries of P tπ are precisely the t-step transition probabilities, and for any
distribution vector p over S, we have E p rt = p r t .
One can show that the expected discounted total reward of a policy is equal to
the expected undiscounted total reward with a geometrically distributed horizon (see
Exercise 6.1). Accordingly, an MDP with discounted rewards is equivalent to one
where there is no discounting but a stopping probability (1 − γ) at every step.
The value of a particular policy can be expressed as a linear equation. This is an
important result, that has led to a number of successful algorithms that employ linear
algebra.
Theorem 6.5.1 For any stationary policy π, v π is the unique solution of
v = r + γ P π v. (6.5.1)
T
−1
(I − A) = lim At . (6.5.2)
T →∞
t=0
Proof of Theorem 6.5.1 First note that from (6.5.1) one obtains r = (I − γ P π )v.
Since γ P π = γ · P π = γ < 1, the inverse
T
(I − γ P π )−1 = lim (γ P π )t
T →∞
t=0
Definition 6.5.2 (Policy and Bellman operator) Let v ∈ V. Then the linear operator
of a policy π is given by
Lπ v r + γ P π v.
128 6 Experiment Design and Markov Decision Processes
L v max {r + γ P π v} .
π
We now show that the Bellman operator satisfies the following monotonicity
properties with respect to an arbitrary value function v. Further, if a value function
Bellman v is optimal, then it satisfies the Bellman optimality equation
optimal-
ity
equation v = L v.
In the following, we use the notation v ≥ v in short for v(s) ≥ v (s) for all s.
Proof We first prove (1). A simple proof by induction over n shows that for any π
and any n
n−1
v ≥ r + γ Pπv ≥ γ t P tπ r + γ n P nπ v.
t=0
∞
Since v π = t=0 γ t P tπ r, it follows that
∞
v − v π ≥ γ n P nπ v − γ t P tπ r.
t=n
Let e be the vector with e(s) = 1 for all s. Then, as rewards are assumed to be in
[0, 1],
∞
γn e
γ k P kπ r ≤ ,
k=n
1−γ
v ≥ vπ ,
and hence also v ≥ v ∗ , which completes the proof of (1). A similar argument shows
(2), which together with (1) then implies (3).
A similar theorem can also be proven for the repeated application of the Bellman
operator.
6.5 Infinite-Horizon 129
L v = r + γ P π v ≤ r + γ P π v ≤ max
r + γ P π v = L v ,
π
where the first inequality is due to the fact that Pv ≥ Pv for any transition
matrix P. For the second part, note that we have L v N ≤ v N by assumption and
hence L k L v N ≤ L k v N by the first part of the theorem. It follows that
L v N +k = v N +k+1 = L k L v N ≤ L k v N = v N +k .
Thus, if one starts with an initial value v 0 ≤ v for all v ∈ V, then repeated
application of the Bellman operator (known as value iteration as introduced in the
following section) converges monotonically to v ∗ . For example, if rewards are ≥ 0,
one may set v 0 = 0 and v n = L n v 0 is always a lower bound on the optimal value
function.
We eventually want to show that repeated application of the Bellman operator
always converges to the optimal value, independent of the initial value v 0 . As a
preparation, we need the following definition and the subsequent theorem.
Definition 6.5.3 For a Banach space X (i.e., a complete, normed linear space) we
say that T : X → X is a contraction mapping, if there is γ ∈ [0, 1) such that T u −
T v ≤ γu − v for all u, v ∈ X .
Theorem 6.5.5 (Banach fixed-point theorem) Given a Banach space X and a con-
traction mapping T : X → X , it holds that:
1. There is a unique u∗ ∈ X with T u∗ = u∗ .
2. For any u0 ∈ X the sequence (un )n≥0 defined by un+1 = T un = T n+1 u0 con-
verges to u∗ .
m−1
m−1
un+m − un ≤ un+k+1 − un+k = T n+k u1 − T n+k u0
k=0 k=0
m−1
γ n (1 − γ m )
≤ γ n+k u1 − u0 = u1 − u0 .
k=0
1−γ
T u∗ − u∗ ≤ T u∗ − un + un − u∗ . (6.5.3)
u − u = T u − T u ≤ γu − u ,
Then, as as∗ is optimal for v(s), but not necessarily for v (s), we have
0 ≤ L v(s) − L v (s) ≤ γ p(s | s, as∗ ) v(s ) − γ p(s | s, as∗ ) v (s )
s ∈S s ∈S
=γ p(s | s, as∗ ) [v(s ) − v (s )]
s ∈S
≤γ p(s | s, as∗ ) v − v = γv − v .
s ∈S
We note that is easy to adapt this proof to show that Lπ is a contraction, too.
the optimal value function due to Theorem 6.5.3. The second part of the theorem
follows from the first part when considering only a single policy π (which then is
optimal).
We now take a look at the basic algorithms for obtaining an optimal policy when
the Markov decision process is known. Value iteration is a simple extension of the
backwards induction algorithm to the infinite horizon case. Alternatively, policy iter-
ation evaluates and improves a sequence of Markov policies. We also discuss two
variants of these methods, namely modified policy iteration, which is somewhere in
between value and policy iteration, and temporal-difference policy iteration, which is
related to classical reinforcement learning algorithms such as Sarsa and Q-Learning.
Another basic technique is linear programming, which is useful in theoretical anal-
yses as well as in some special practical cases. While for the sake of simplicity,
we stick to our assumption that rewards depend only on the state, the algorithms
described below can easily extended to the case when the reward also depends on
the action.
Proof Statements 1 and 2(a) follow from Theorems 6.5.5 and 6.5.6 from the previous
section. Now note that
v πn − v ∗ ≤ v πn − v n + v n − v ∗ .
A similar argument gives a respective bound for the second term v n − v ∗ . Then,
rearranging we obtain
γ γ
v πn − v n+1 ≤ v n+1 − v n , v n − v ∗ ≤ v n+1 − v n ,
1−γ 1−γ
Theorem 6.5.8 also bounds the error when stopping value iteration when the
change in the estimated value function from one iteration to the next falls below
a set small threshold. As we have already discussed in context of Theorem 6.5.4 in
the previous section, initializing v 0 = 0 guarantees that value iteration converges
monotonically. The following theorem shows that value iteration converges con-
verges with an error of O(γ n ) in this case.
Theorem 6.5.9 Let v 0 = 0 and assume that rewards are bounded in [0, 1]. Then
γn 2γ n
v n − v ∗ ≤ , and v πn − v ∗ ≤ .
1−γ 1−γ
1 γ0
v 0 − v ∗ = v ∗ ≤ = .
1−γ 1−γ
6.5 Infinite-Horizon 133
Proceeding by induction over n proves the first claim, as by the contraction property
of Theorem 6.5.6 we have
v n+1 − v ∗ = L v n − L v ∗ ≤ γv n − v ∗ .
Unlike value iteration, policy iteration attempts to iteratively improve a given policy,
rather than a value function. Starting with an arbitrary policy π0 , at each iteration it
first computes the value of the current policy. In finite MDPs this policy evaluation
step can be performed with either linear algebra or backwards induction. In a second
step called policy improvement the policy is updated by choosing the policy that is
greedy with respect to the value function computed in the evaluation step.
The following theorem shows that the policies generated by policy iteration are
monotonically improving.
Theorem 6.5.10 For value vectors v n , v n+1 generated by policy iteration it holds
that v n ≤ v n+1 .
Proof From the policy improvement step
r + γ P πn+1 v n ≥ r + γ P πn v n = v n
1 Thus, the result is weakly polynomial complexity, due to the dependence on the input size descrip-
tion.
134 6 Experiment Design and Markov Decision Processes
where the equality is due to the policy evaluation step for πn . Rearranging, we get
r ≥ (I − γ P πn+1 ) v n and hence
(I − γ P πn+1 )−1 r ≥ v n .
By Theorem 6.5.1 and the policy evaluation step for πn+1 the left hand side equals
v n+1 , which completes the proof.
Theorem 6.5.10 can be used to show that policy iteration terminates after a finite
number of steps.
Corollary 6.5.1 Policy iteration terminates after a finite number of iterations and
returns an optimal policy.
Proof There is only a finite number of policies, and since policies in policy iteration
are monotonically improving, the algorithm must stop after finitely many iterations.
Finally, the last iteration satisfies
v n = max {r + γ P π v n } ,
π
that is, v n solves the optimality equation and the claim follows by Theorem 6.5.3.
As even in finite MDPs, there are |A||S| policies, Corollary 6.5.1 only guarantees
exponential-time convergence in the number of states. However, the complexity of
policy iteration can be shown to be actually strongly polynomial [5] with the number
of required iterations being Õ(|S|2 |A|(1 − γ)−1 ), again omitting logarithmic factors.
As the behaviour of policy iteration seems to be quite different from value iteration,
one is interested in algorithms that lie between policy iteration and value iteration.
We will have a look at two such algorithms in the following two subsections.
It is perhaps interesting to see the problem from a geometric perspective. This also
gives rise to the so-called temporal-difference algorithms which will be considered
below. First, we define the difference operator, which is the difference between a
value function vector v and its transformation via the Bellman operator.
Definition 6.5.4 The difference operator is defined as differ-
ence
operator
Bv max {r + (γ P π − I)v} = L v − v.
π
Bv = 0.
we can show the following inequality between two value function vectors.
Theorem 6.5.11 For any v, v ∈ V and π ∈ Πv
Bv ≥ Bv + (γ P π − I)(v − v).
Theorem 6.5.12 Let (v n )n≥0 be the sequence of value vectors obtained from policy
iteration. Then for any π ∈ Πvn ,
Bv
greedy policy π ∗ , π1 , π2
with respect to each value
function v ∗ , v 1 , v 2
−1 π∗
π1
π2
v2 v1 v∗
v
v n+1 = (I − γ P π )−1 r − v n + v n
= (I − γ P π )−1 [r − (I − γ P π )v n ] + v n .
Lπn+1 v n = L v n .
tempo- In order to update the value from v n to v n+1 we rely on the temporal difference
ral
differ- defined as
ence dn (s, s ) = r(s) + γv n (s ) − v n (s).
The temporal difference error can be seen as the difference in the estimate when we
move from state s to state s . In fact, it is easy to see that if the value function estimate
satisfies v n = v πn , then the expected error is zero, as
dn (s, s ) p(s | s, πn (s)) = r(s) + γv n (s ) p(s | s, πn (s)) − v n (s).
s ∈S s ∈S
6.5 Infinite-Horizon 137
Note the similarity to the difference operator in modified policy iteration. The idea
of temporal-difference policy iteration is to adjust the current value v n , using the
temporal differences mixed over an infinite number of steps, that is, we set
v n+1 = v n + τ n , where
∞
τ n (s) = Eπn (γλ) dn (st , st+1 ) | s0 = s .
t
t=0
The parameter λ is a simple way to weight the different temporal difference errors.
If λ → 0, the error τ n is dominated by the short-term discrepancies in our value
function, while for λ → 1 also the terms far in the future matter. In the end, the value
function is adjusted in the direction of this error.
Putting all of those steps together, the algorithm looks as follows.
It can be shown that v n+1 is the unique fixed point of the equation
Dn v (1 − λ)Lπn+1 v n + λLπn+1 v.
That is, if we repeatedly apply the above operator to some vector v, then we approach
a fixed point v ∗ = Dn v ∗ . It is interesting to see what happens at the two extreme
choices of λ in this case. For λ = 1, this becomes standard policy iteration, as the
fixed point satisfies v ∗ = Lπn+1 v ∗ so that v ∗ must be the value of policy πn+1 . For
λ = 0, one obtains standard value iteration, as the fixed point is reached under one
step and is v ∗ = Lπn+1 v n , i.e., the approximate value of the one-step greedy policy.
In general, the new value vector is moved only partially towards the direction of the
Bellman update, depending on how we choose λ.
Perhaps surprisingly, we can also solve an MDP through linear programming, refor-
mulating the maximization problem as a linear optimization problem with linear
constraints. As a first step we recall that there is an easy way to determine whether
138 6 Experiment Design and Markov Decision Processes
min y v,
v
v(s) − γ p
s,a v ≥ r (s).
with y ∈ ´ (S).
In this case, the respective vector x ∈ R|S×A| can be interpreted as the discounted
sum of state-action visits, as shown in the following theorem.
Theorem 6.5.13 For any policy π,
π
xπ (s, a) = E γ I st = s, at = a | s0 ∼ y
t
t≥0
x(s, a)
π x (a | s) =
,
a ∈A x(s, a )
T −t
∞
UtT = rt+k with T ∼ Geom(1 − γ), so that E Ut = γ k E rt+k .
k=0 k=0
The average reward (gain) criterion. Beside the expected total reward defined as
Vtπ,T Eπ UtT , the expected average reward is a natural criterion.
1 π,T
g π (s) lim V (s),
T →∞ T
that is, the expected average reward the policy obtains when starting in s.
If the limit in Definition 6.6.2 does not exist, one may consider the limits
π 1 π,T π 1 π,T
g+ (s) lim sup V (s), g− (s) lim inf V (s).
T →∞ T T →∞ T
1 π∗ ,T
lim inf V (s) − V+π (s) ≥ 0 ∀s ∈ S, π ∈ Π,
T →∞ T
∞
where V+π (s) Eπ
t=1 max{r t , 0} st = s .
Lemma 6.6.1 If a policy is m-discount optimal then it is n-discount optimal for all
n ≤ m.
6.7 Summary
Markov decision processes can be used to represent shortest path problems, stop-
ping problems, experiment design problems, multi-armed bandit and more general
reinforcement learning problems.
Bandit problems are the simplest type of Markov decision process, since they
have a single fixed, never-changing state. However, to solve them, one can construct
a Markov decision processes in belief space within a Bayesian framework. It is then
possible to apply backwards induction to find the optimal policy.
Backwards induction is applicable more generally to arbitrary Markov decision
processes. For the case of infinite-horizon problems, it is referred to as value iteration,
as it converges to a fixed point. It is tractable when either the state space S or the
horizon T are small (finite).
When the horizon is infinite, policy iteration can also be used to find optimal
policies. It is different from value iteration in that at every step, it fully evaluates
a policy before the improvement step, while value iteration only performs a partial
evaluation. In fact, at the nth iteration, value iteration has calculated the value of an
n-step policy.
We can arbitrarily mix between the two extremes of policy iteration and value
iteration in two ways. Firstly, we can perform a k-step partial evaluation, which
is called modified policy iteration. When k = 1 this coincides with value iteration,
while for k → ∞, one obtains policy iteration. Secondly, we can adjust our value
function by using a temporal difference error of values in future time steps. Again,
we can mix liberally between policy iteration and value iteration by focusing on
errors far in the future (policy iteration) or on short-term errors (value iteration).
Finally, it is possible to solve MDPs through linear programming, reformulating
the MDP as a linear optimization problem with constraints. In the primal formulation,
we attempt to find a minimal upper bound on the optimal value function. In the dual
formulation, our goal is to find a distribution on state-action visits that maximizes
expected utility and is consistent with the MDP model.
For further information on the MDP formulation of bandit problems in the decision
theoretic setting see the last chapter of [6], which was explored in more detail in Duff’s
PhD thesis [7]. When the number of (information) states in the bandit problem is
finite, [8] has proven that it is possible to formulate simple so-called index policies.
However, this is not generally applicable. Easily computable, near-optimal heuristic
142 6 Experiment Design and Markov Decision Processes
strategies for bandit problems will be presented in Chap. 10. The decision-theoretic
solution to the unknown MDP problem is given in Chap. 9.
Further theoretical background on MDPs, including many of the theorems in
Sect. 6.5, can be found in [3]. Chap. 2 of [9] gives a quick overview of MDP theory
from the operator perspective. The introductory reinforcement learning book of [10]
also explains the basic Markov decision process framework.
6.9 Exercises
Exercise 6.1 Show that the expected discounted total reward of any given policy is
equal to the expected undiscounted total reward with a finite, but random horizon T .
In particular, let T be distributed according to a geometric distribution on {1, 2, . . .}
with parameter 1 − γ. Then show that
T
T
E lim γ rk = E
k
rk T ∼ Geom(1 − γ) .
T →∞
k=0 k=0
Exercise 6.2 Assume that the probability that the ith algorithm successfully solves
the tth task is always pi . Furthermore, tasks are in no way distinguishable from
each other. In each case, let pi ∈ {0.1, . . . , 0.9} andassume a prior distribution
ξi ( pi ) = 1/9 for all i with a complete belief ξ( p) = i ξi ( pi ). Then formulate the
problem as a decision-theoretic n-armed bandit problem with reward at time t being
rt = 1 if the task is solved and rt = 0 if the problem is not solved. Independent
of whether or not the task at time t is solved or not, at the next time-step the next
problem is to be solved. The aim is to find a policy π mapping from the history of
observations to the set of algorithms that maximizes the total reward to time T in
expectation, i.e.,
6.9 Exercises 143
T
Eξ,π U0 = E ξ,π rt .
T
t=1
using backwards induction for T ∈ {0, 1, 2, 3, 4} and report the expected utility
in each case. Hint: Use the decision-theoretic bandit formulation to dynamically
construct a Markov decision process which you can solve with backwards induc-
tion. See also the extensive form of the utility from (3.5.2).
3. Now utilize the backwards induction algorithm developed in the previous step in
a problem where we receive a sequence of N tasks to solve and our utility is
N
U0N = rt .
t=1
At each step t ≤ N , find the optimal action by calculating Eξ,π Utt+T for T ∈
{0, 1, 2, 3, 4}. Hint: At each step you can update your prior distribution using the
same routine you use to update your prior distribution. You only need consider
T < N − t.
4. Develop a simple heuristic algorithm for your choice and compare its utility with
the utility of backwards induction. Perform 103 simulations, each experiment run-
ning for N = 103 time-steps and average the results. How does the performance
improve? Hint: If the program runs too slowly go only up to T = 3.
6.9.3 Scheduling
You are controlling a set of n processing nodes of a processing network that is part
of a big CPU farm. At time t, you may be given a job of class xt ∈ X to execute.
Assume these are identically and independently drawn such that P(xt = k) = pk for
all t, k. With some probability p0 , you are not given a job to execute at the next step.
If you do have a new job, then you can either:
(a) Ignore the job.
(b) Send the job to some node i. If the node is already active, then the previous job
is lost.
Not all the nodes and jobs are equal. Some nodes are better at processing certain
types of jobs. If the ith node is running a job of type k ∈ X , then it has a probability
144 6 Experiment Design and Markov Decision Processes
φi,k ∈ [0, 1] of finishing it within that time step. Then the node becomes free and can
accept a new job.
For this problem, assume that there are n = 3 nodes and k = 2 types of jobs and
that the completion probabilities are given by the matrix
⎡ ⎤
0.3 0.1
Φ = ⎣0.2 0.2⎦ .
0.1 0.3
Hint: To verify that your algorithms work, test them first on a smaller MDP
with known solutions. For example, check out http://webdocs.cs.ualberta.ca/~sutton/
book/ebook/node35.html.
Exercise 6.4
1. What in your view are the fundamental advantages and disadvantages of modelling
problems as Markov decision processes?
2. Is the algorithm selection problem of Exercise 6.2 solvable with policy iteration?
If so, how?
3. What are the fundamental similarities and differences between the decision-
theoretic finite-horizon bandit setting and the infinite-horizon MDP setting?
References
1. Chernoff, H.: Sequential design of experiments. Ann. Math. Stat. 30(3), 755–770 (1959)
2. Chernoff, H.: Sequential models for clinical trials. In: Proceedings of the Fifth Berkeley Sym-
posium on Mathematical Statistics and Probability, vol. 4, pp. 805–812. University of California
Press (1966)
3. Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming.
Wiley, New Jersey, US (1994)
4. Tseng, P.: Solving H-horizon, stationary Markov decision problems in time proportional to
log(H). Oper. Res. Lett. 9(5), 287–297 (1990)
5. Ye, Y.: The simplex and policy-iteration methods are strongly polynomial for the Markov
decision problem with a fixed discount rate. Math. Oper. Res. 36(4), 593–603 (2011)
6. DeGroot, M.H.: Optimal Statistical Decisions. Wiley (1970)
7. O’Gordon Duff, M.: Optimal learning computational procedures for bayes-adaptive markov
decision processes. Ph.D. thesis, University of Massachusetts at Amherst (2002)
8. Gittins, J.C.: Multi-armed Bandit Allocation Indices. Wiley, New Jersey, US (1989)
9. Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming. Athena Scientific (1996)
10. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press (1998)
Chapter 7
Simulation-Based Algorithms
7.1 Introduction
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 147
C. Dimitrakakis and R. Ortner, Decision Making Under Uncertainty
and Reinforcement Learning, Intelligent Systems Reference Library 223,
https://doi.org/10.1007/978-3-031-07614-5_7
148 7 Simulation-Based Algorithms
In general it makes sense to let t depend on t, so that the randomness can be chosen
to decrease over time when estimates converge to their true values.
When using -greedy in our Robbins-Monro bandit algorithm, the two main
parameters to choose are the amount of randomness t and the step-size αt in the esti-
mation. Both of them have a significant effect on the performance of the algorithm.
Although as indicated above, in general it makes sense to vary these parameters with
time, it is perhaps instructive to look at what happens for fixed values of and α.
Figure 7.1 shows the average reward obtained, if we keep the step size α or the
randomness fixed, respectively, with initial estimates r̂0,i = 0.
7.1 Introduction 149
1 1
0.8 0.8
0.6 0.6
r
r
0.4 0.4
0.001 0.0
0.2 0.01 0.2 0.01
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
t ·10 4 t ·10 4
(a) fixed (b) fixed α
Fig. 7.1 The plots above show average reward over time. For fixed t = 0.1, the step size is
α ∈ {0.01, 0.1, 0.5}. For fixed step size α = 0.1 we vary the exploration rate in ∈ {0.0, 0.01, 0.1}
For fixed , we find that larger values of α tend to give better results eventually,
while smaller values have a better initial performance. This is a natural trade-off,
since large α appears to “learn” fast, but it also “forgets” quickly. That is, for a large
α the estimates mostly depend upon the last few rewards that have been observed.
Things are not so clear-cut for the choice of . We see that the choice of = 0
is significantly worse than = 0.1. That appears to suggest that there is an optimal
level of exploration. Ideally, we should use the decision-theoretic solution seen in
Chap. 6, but usually a good heuristic method for choosing will be good enough.
This section briefly reviews some basic results of stochastic approximation theory.
Complete proofs can be found in [2]. We first consider the core problem of stochastic
approximation itself. In particular, we shall cast the approximation problem as a
minimization problem. That is, given some unknown parameter of the environment
ν ∈ Rn and an estimate νt ∈ Rn of ν at time t, the estimate is assumed to be chosen
such that for a certain function f : Rn → R the estimate νt minimizes f .
Then with respect to the learning algorithm the goal is that the sequence of values
(νt )t generated by the algorithm converges to some ν ∗ that is a local minimum, or a
stationary point for f . For strictly convex f , this would also be a global minimum.
In the following, we will examine algorithms that compute estimates νt according
to the update
νt+1 = νt + αt z t+1 , (7.1.1)
∇ f (x) − ∇ f (y) ≤ L x − y , ∀ x, y ∈ Rn .
E( z t+1 | h t ) ≤ K 1 + K 2 ∇ f (μt ) , ∀ t.
2 2
The basic condition (ii) ensures that the function is well-behaved, so that gradient-
following methods will easily find the minimum. Condition (iii) combines two
assumptions in one. Firstly, the expected direction of the update always decreases
7.1 Introduction 151
f , and secondly that the squared norm of the gradient is not too large relative to
the size of the update. Finally, condition (iv) ensures that the update is bounded in
expectation relative to the gradient. One can see how putting together the last two
conditions ensures that the expected direction of the update is correct and its norm
is bounded.
Theorem 7.1.1 Consider an algorithm with update (7.1.1) such that the step sizes
αt ≥ 0 satisfy
∞ ∞
αt = ∞, αt2 < ∞.
t=0 t=0
The conditions of Theorem 7.1.1 are not necessary. Alternative sufficient conditions
relying on contraction properties are for example discussed in detail in [2]. The
following example illustrates the impact of the choice of step size schedule (αt )t on
convergence.
z t+1 = xt+1 − νt .
Now consider three different step-size schedules: The first one, αt = 1/t, satisfies
√
both assumptions of Theorem 7.1.1 on the step-size. The second one, αt = 1/ t,
decreases too slowly, while the third one, αt = t −3/2 , approaches zero too fast.
Figure 7.2 demonstrates the convergence, or lack thereof, of the estimates νt to
the expected value. In fact, the schedule t −3/2 converges
√ to a value quite far away
from the expected value, while the slow schedule 1/ t oscillates.
152 7 Simulation-Based Algorithms
µt 0
It is possible to extend the ideas outlined in the previous section to the dynamic
Markov decision process setting. We simply need to have a policy that is greedy with
respect to our estimates as well as a way to update our estimates so that they converge
to the actual MDP we are acting in. The additional challenge of the dynamic setting
is that our policy now affects which sequences of states we observe. That is, while
in the bandit problem we could freely select an arm to pull, sampling an arbitrary
state is not possible as easily (unless some simulation of the MDP is available).
Otherwise, the algorithmic structure remains basically the same and is described in
Algorithm 7.2.
0.2 0 0 0 1
Example 7.2 (The chain task) The chain task has two actions and five states, as
shown in Fig. 7.3. The starting state is the leftmost state, where the reward is 0.2.
The reward is 1.0 in the rightmost state, and zero otherwise. The first action (dashed,
blue) takes you to the right, while the second action (solid, red) takes you to the
leftmost state, both with probability 0.8. With a probability of 0.2 both actions have
the opposite effect. The value function of the chain task for a discount factor γ = 0.95
starting in the leftmost state is shown in Table 7.1.
The chain task is a very simple but well-known task used to test the efficacy
of reinforcement learning algorithms. In particular, it is useful for analysing how
algorithms solve the exploration-exploitation trade-off, since in the short run (i.e.,
small discount factor γ) simply staying in the leftmost state is advantageous. However
if the discount factor is sufficiently large, algorithms should be incentivized to more
fully explore the state space. A variant of this task with action-dependent rewards
can be found in [3].
To make things as easy as possible, let us assume that we have a way to start the
environment from any arbitrary state. That would be the case if the environment had
a reset action, or if we were simply running an accurate simulation.
We shall begin with simplest possible problem, that of estimating the expected
utility of each state for a specific policy. This can be performed with Monte Carlo
policy evaluation as shown in Algorithm 7.3. The idea is to estimate the value function
for every state by approximating the expectation with the sum of rewards obtained
over multiple trajectories starting from each state. Although this is very simple, Monte
Carlo policy evaluation is a very useful method that is employed by several more
complex algorithms as a subroutine. It can be also used non-Markovian environments
such as partially observable or multi-agent settings.
154 7 Simulation-Based Algorithms
Remark 7.2.1 Using Hoeffding’s inequality (4.5.5), one can show that for each
state s that the error of the estimate v K (s) is bounded by ln(2|S|/δ) with probability at
δ
2K
least 1 − |S| . Hence, a union bound of the form P(A1 ∪ A2 ∪ . . . ∪ An ) ≤ i P(Ai )
shows that with probability 1 − δ this error bound holds for all states.
It is also possible to update the value of all encountered states after each visit. This
algorithm is called every-visit Monte Carlo, whose evaluation of a given state trajec-
tory s1 , . . . , sT and thereby earned rewards r1 , . . . , r T is shown in Algorithm 7.5. In
general, an estimate of the value function will be computed over several iterations as
in the algorithms of the previous section. The parameters αk we had before are here
replaced by the values n t 1(s) that depend on the state and the respective visits in the
latter. However, the type of estimate computed by every-visit Monte Carlo can be
biased, as the update is going to be proportional to the number of steps spent in the
respective state. In order to avoid the bias, we can instead only perform the update
on each first visit to every state. This eliminates the dependence between states and
is called the first-visit Monte Carlo algorithm.
Figure 7.4 shows the difference between the two Monte Carlo evaluation methods
on the chain task for the optimal policy. In particualr, it shows the L1 error between
the value function of the optimal policy and the respective Monte Carlo estimates.
156 7 Simulation-Based Algorithms
100
vt − V π
10−1
10−2
0 0.2 0.4 0.6 0.8 1
iterations ·104
The kind of update we have now seen in several algorithms is of the form
where Ut is the utility sampled along some trajectory that was generated when fol-
lowing policy π. The difference between the observed utility and the assumed value
so far can be seen as an error term, which is used for correction towards the true
value.
The main idea of temporal difference learning methods is to replace the utility
term by the immediate reward(s) in the next step(s) and estimate the utility for the
subsequent steps by the current value of the next state.
That is, in the simplest case (now for technical reasons considering discounted
tempo- rewards) the error considered is the so-called temporal difference! error
ral
differ-
ence!
error
dt = rt + γv(st+1 ) − v(st ), (7.2.2)
It can be shown that the error term used in (7.2.1) can be written as a sum over the
temporal difference errors
Ut − v(st ) = γ −t d ,
≥t
Instead of the estimate rt + γv(st+1 ) we used for the utility before we can more
generally consider the return over the next m steps starting in step t, that is,
Ut,m := rt + γrt+1 + . . . + γ m−1 rt+m−1 + γ m v(st+m ). The TD(λ) algorithm now
uses a mixture
Utλ := (1 − λ) λm−1 Ut,m
m≥1
of all these different estimates, where the parameter λ ∈ [0, 1] determines the impor-
tance of estimates with higher m similar to a discount factor and the factor (1 − λ) is
used for normalization. Note that for λ = 0 one obtains the estimate corresponding
to using the temporal difference error (7.2.2). If there is a terminal state then for
λ = 1 an update with respect to Utλ can be shown to correspond to a Monte Carlo
update. In general, doing the update according to (7.2.1) with respect to Utλ , one can
write the update using the temporal difference error as follows.
TD(λ) update
∞
v(st ) = v(st ) + α (γλ)−t d .
=t
Unfortunately, as this update uses all future state and reward observations, it is
only possible to implement it offline. However, considering a so-called backward
view obtaining an online version shown as Algorithm 7.7 is possible.
1.5
0.5
0
0 20 40 60 80 100
are visited and discards visits that are not so recent. Accordingly, the eligibility of
each state is used as a factor for the temporal difference error in the state updates.
Figure 7.5 is an illustration of how eligibility traces operate in the two differ-
ent formulations: replacing versus cumulative. While replacing traces start from 1
whenever a state is visited again, cumulative traces increase with every visit.
In the previous sections we have seen the idea of updating value information not
on all states but only on the currently observed one. This approach can not only
be applied to the setting when one wants to estimate the value of a given policy,
but also when we want to learn an optimal policy. For example, standard value
iteration as introduced in Sect. 6.5.4.1 performs a sweep over the complete state
space at each iteration. However, in principle one could perform the update over an
arbitrary sequence of states. These could be generated when following a particular
fixed policy or more generally a reinforcement learning algorithm. This lends to the
idea of simulation-based value iteration.
In order to work in general the used state sequences usually must satisfy certain
technical requirements. One sufficient condition is for example that the policies that
generate the state sequences must be proper for episodic problems with a terminal
state. That is, they all reach a terminating state with probability 1. For discounted
non-episodic problems, this can be achieved by using a geometric distribution for a
restart in a state chosen uniformly at random. This ensures that all policies will be
proper. Alternatively, one could also select starting states with an arbitrary schedule,
as long as it is guaranteed that all states are visited infinitely often in the limit.
7.2 Dynamic Problems 159
Before considering the general case when the underlying MDP model μ is unknown,
we start with the obviously simpler case where we can obtain data of the MDP from
simulation. That is, for each state-action pair (s, a), we can request independent sam-
ples from the respective transition probability distribution pμ (·|s, a). Algorithm 7.8
shows a generic simulation-based value iteration algorithm. In this algorithm, at
every step there is a probability the agent moves to a random state. This can be
seen as restarting the MDP from a uniform starting state distribution Unif (S).
Figure 7.6 shows the error in value function estimation in the chain task
(Example 7.2) when using simulation-based value iteration. In general, it is advis-
able to use an initial value v 0 that is an upper bound on the optimal value function,
if such a value is known. This is due to the fact that in that case convergence of
simulation-based value iteration is always guaranteed, as long as the policy that we
are using is proper.1 This is confirmed by the results. In general, the estimation error
is highly dependent upon the initial value function estimate v 0 and the exploration
parameter . It is interesting to see uniform sweeps (i.e., = 1) result in the lowest
estimation error.
7.2.4.2 Q-Learning
Simulation-based value iteration can be suitably modified for the general reinforce-
ment learning setting when neither model nor simulator for the underlying environ-
ment are available. Here instead of relying on a model of the environment, we replace
arbitrary random sweeps of the state-space with the actual state sequence observed
in the real environment. We also use this sequence as a simple way to estimate the
transition probabilities. The replacement of the true MDP model by an estimate is a
natural idea which leads to a whole family of stochastic value iteration algorithms.
The most important of these is Q-learning, which uses a trivial empirical MDP model.
102 102
100 100
10-2 10-2
error
error
10-4 10-4
1.0 1.0
0.5 0.5
10-6 0.1 10-6 0.1
0.01 0.01
-8 1-gamma -8 1-gamma
10 10
0 500 1000 1500 2000 0 500 1000 1500 2000
t t
Fig. 7.6 Simulation-based value iteration with pessimistic initial estimates (v 0 = 0) and optimistic
initial estimates (v 0 = 20 = 1/(1 − γ)) for varying . Errors indicate v n − V ∗ 1
Q-learning is shown as Algorithm 7.9, not only one of the most well-known but also
one of the simplest algorithms in reinforcement learning.
error
30 1.0
0.5
0.1
20
0.05
0.01
10
0 20 40 60 80 100
t x 10
(a) Error
3,000
1.0
0.5
0.1
2,000 0.05
0.01
regret
1,000
0 20 40 60 80 100
t x 10
(b) Regret
step size αt . Both of these depend on the number of visits to a particular state, which
leads to a more efficient performance.
Of course, one could get any algorithm in between pure Q-learning and pure
stochastic value iteration. In fact, variants of the Q-learning algorithm using eligi-
bility traces (see Sect. 7.2.3.1) can be formulated in the same way.
2More generally, rewards could of course depend not only on state, but the combination of state
and action.
7.4 Exercises 163
7.3 Discussion
Most of the algorithms we have seen in this chapter are quite simple and thus clearly
demonstrate the principle of learning by reinforcement. However, they do not aim to
solve the reinforcement learning problem in an optimal way. They have been mostly
used for finding near-optimal policies given access to samples from a simulator, for
example in the case of learning to play Atari games [4]. However, even in this case,
a crucial issue is how much data is needed to be able to approach optimal behavior.
Convergence. Even though it is quite simple, Q-learning can be shown to converge
to an optimal policy in various settings. Tsitsiklis [5] has provided an asymptotic
proof based on stochastic approximation theory with less restrictive assumptions
than the original paper of [6]. Later [7] proved finite sample convergence results
under strong mixing assumptions on the MDP. Moreover, for modifications such as
delayed Q-learning [8] stronger results on the sample complexity can be shown.
That is, Õ(|S||A|) samples are sufficient to determine an -optimal policy with high
probability.
Exploration. Another extension of Q-learning is using a population of value function
estimates using random initial values and weighted bootstrapping. This was first
introduced by [9, 10] and evaluated on bandit tasks. Recently, this idea has also been
exploited in the context of deep neural networks by [11] for representations of value
functions in the setting of full reinforcement learning. We will examine this more
closely in Chap. 8.
Bootstrapping and subsampling use a single set of empirical data to obtain an
empirical measure of uncertainty about statistics of the data. We wish to do the same
thing for value functions, based on data from one or more trajectories. Informally, this
variant maintains a collection of Q-value estimates, each one of which is trained on
different segments3 of the data, with possible overlaps. In order to achieve efficient
exploration, a random Q-estimate is selected at every episode or every few steps.
This results in a bootstrap analogue of Thompson sampling. Figure 7.8 shows the use
of weighted bootstrap estimates for the Double Chain problem introduced by [3].
7.4 Exercises
Exercise 7.1 According to Theorem 6.5.1 the value function of a policy π for an
MDP μ with state reward function r can be written as the solution of the linear
equation v πμ = (I − γ Pμπ )−1 r, where the term Φμπ (I − γ Pμπ )−1 can be seen as a
feature matrix. However, simulation-based algorithms like Sarsa only approximate
the value function directly rather than Φμπ . This means that if the reward function
changes, they have to be restarted from scratch. Is there a way to improve on this?4
L
improved exploration
performance
200
0
0 0.2 0.4 0.6 0.8 1
t ·104
1. Develop and test a simulation-based algorithm (such as Sarsa) for estimating Φμπ ,
and prove its asymptotic convergence. Hint: Focus on the fact that you’d like to
estimate a value function for all possible reward functions.
2. Consider a model-based approach building an empirical transition kernel Pμ̂π .
How good are your value function estimates in the first versus the second
approach? Why would you expect either one to be better?
3. Can the same idea be extended to Q-learning?
References
1. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407
(1951)
2. Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming. Athena Scientific (1996)
3. Dearden, R., Friedman, N., Russell, S.J.: Bayesian Q-learning. In: Proceedings of the Fifteenth
National Conference on Artificial Intelligence and Tenth Innovative Applications of Artificial
Intelligence Conference, AAAI 98, IAAI98, pp. 761–768. AAAI Press, The MIT Press (1998)
4. Mnih, V., Kavukcuoglu, K., Silver, D., et al.: Human-level control through deep reinforcement
learning. Nature 518(7540), 529–533 (2015)
5. Tsitsiklis, J.N.: Asynchronous stochastic approximation and Q-learning. Mach. Learn. 16(3),
185–202 (1994)
6. Watkins, C.J.C.H, Dayan, P: Q-learning. Mach. Learn. 8(3–4), 279–292 (1992)
7. Kearns, M., Singh, S.: Finite sample convergence rates for Q-learning and indirect algorithms.
In: Advances in Neural Information Processing Systems, vol. 11, pp. 996–1002. The MIT Press
(1999)
8. Strehl, A.L., Li, L., Wiewiora, E., Langford, J., Littman, M.L.: Pac model-free reinforcement
learning. In: Machine Learning, Proceedings of the 23rd International Conference, ICML 2006,
pp. 881–888. ACM (2006)
References 165
8.1 Introduction
may not include the greedy policy itself. This is usually because it is not possible to
represent all possible value functions and policies in complex problems.
Usually, we are not aiming at a uniformly good approximation to a value function
or policy. Instead, we define φ, a distribution on S, which specifies on which parts
of the state space we want to have a good approximation by placing higher weight
on the most important states. Frequently, φ only has a finite support, meaning that
we only measure the approximation error over a finite set of representative states
Ŝ ⊆ S. In the sequel, we shall always define the quality of an approximate value or
policy with respect to φ.
In the remainder of this chapter, we shall examine a number of approximate
dynamic programming algorithms. What all of these algorithms have in common
is the requirement to calculate an approximate value function or policy. The two
next sections given an overview of the basic problem of fitting an approximate value
function or policy to a target.
Let us begin by considering the problem of finding the value function v θ ∈ V that
best matches a target value function u that is not necessarily in V . This can be done
by minimizing the difference between the target value u and the approximation vθ ,
that is,
v θ − uφ = |v θ (s) − u(s)| dφ(s)
S
Given V = {v θ | θ ∈ },
find θ∗ ∈ arg min v θ − uφ ,
θ∈
where · φ S | · | dφ.
1 1
v1
0.5 v2 0.5
v3
0 u 0
− 0.5 − 0.5
−1 −1
0 2 4 6 8 10 0 2 4 6 8 10
(a) The target function and the three (b) The errors at the chosen points.
candidates.
6
0
0 2 4 6 8 10
(c) The total error of each candidate.
Fig. 8.1 Fitting a value function in V = {v 1 , v 2 , v 3 } to a target value function u, over a finite
number of points. While none of the three candidates is a perfect fit, we clearly see that v 1 has the
lowest cumulative error over the measured set of points
Clearly, none of the given functions is a perfect fit. In addition, finding the best overall
fit requires minimizing an integral. So, for this problem we choose a random set of
points X = {xt } on which to evaluate the fit, with φ(xt ) = 1 for every point xt ∈ X .
This is illustrated in Fig. 8.1, which shows the error of the functions at the selected
points, as well as their cumulative error.
In the example above, the approximation space V does not have a member that is
sufficiently close to the target value function. It could be that a larger function space
contains a better approximation. However, it may be difficult to find the best fit in an
arbitrary set V.
170 8 Approximate Representations
The problem of fitting a policy is not significantly different from that of fitting a value
function, especially when the action space is continuous. Once more, we define an
appropriate normed vector space so that it makes sense to talk about the normed
difference between two policies π, π with respect to some measure φ on the states,
more precisely defined as
π − π = π(· | s) − π (· | s) dφ(s),
φ
S
Once more, the minimization problem may not be trivial, but there are some cases
where it is particularly easy. One of these is when the policies can be efficiently
enumerated, as in the example below.
Example 8.2 (Fitting a finite space of policies) For simplicity, consider the space of
deterministic policies with a binary action space A = {0, 1}. Then each policy can be
represented as a simple mapping π : S → {0, 1}, corresponding to a binary partition
of the state space. In this example, the state space is the 2-dimensional unit cube,
S = [0, 1]2 . Figure 8.2 shows an example policy, where the light red and light green
areas represent taking action 1 and 0, respectively. The measure φ has support only
on the crosses and circles, which indicate the action taken at that location. Consider
a policy space consisting of just four policies. Each set of two policies is indicated
by the magenta (dashed) and blue (dotted) lines in Fig. 8.2. Each line corresponds to
two possible policies, one selecting action 1 in the high region, and the other selecting
action 0 instead. In terms of our error metric, the best policy is the one that makes
the fewest mistakes. Consequently, the best policy in this set to use the blue line and
play action 1 (red) in the top-right region.
8.1 Introduction 171
s2
policies that separate the
0.4
state space with a hyperplane
0.2
0
0 0.2 0.4 0.6 0.8 1
s1
8.1.3 Features
Frequently, when dealing with large, or complicated spaces, it pays off to project
(observed) states and actions onto a feature space X . In that way, we can make
problems much more manageable. Generally speaking, a feature mapping is defined
as follows.
Feature mapping
What sort of functions should we use? A common idea is to use a set of smooth
symmetric functions, such as usual radial basis functions.
Example 8.3 (Radial Basis Functions) Let d be a metric on S × A and define the
a set of characteristic state-action pairs {(si , ai ) | i = 1, . . . , n}. These can act as
centers for a set of radial basis functions, defined as follows:
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
Another common type of functions are binary functions. These effectively dis-
cretize a continuous space through either a cover or a partition.
2. S∈G S = X .
Multiple tilings create a cover and can be used without many difficulties with most
discrete reinforcement learning algorithms, cf. Sutton and Barto [1].
Now that we have looked at the basic problems in approximate regimes, let us look
at some methods for obtaining useful approximations. First of all, we introduce
some basic concepts such as look-ahead and rollout policies for estimating value
functions. Then we formulate value function approximation and policy estimation as
an optimization problem. These are going to be used in the remaining sections. For
example, Sect. 8.2 introduces the well known approximate policy iteration algorithm,
which combines those two steps into approximate policy evaluation and approximate
policy improvement.
8.1 Introduction 173
Single-step look-ahead
Let q(i, a) rμ (i, a) + γ j∈S Pμ ( j | i, a) u( j). Then the single-step look-
ahead policy is defined as
T -step look-ahead
u0 = u,
qk (i, a) = rμ (i, a) + γ Pμ ( j | i, a) uk−1 ( j),
j∈S
As we have seen in Sect. 7.2.2 one way to obtain an approximate value function of
an arbitrary policy π is to use Monte Carlo estimation, that is, to simulate several
174 8 Approximate Representations
k −1
1
K i T
q(i, a) = r (st,k , at,k ), (8.1.1)
K i k=1 t=0
This results in a set of samples of q-factors. The next problem is to find a parametric
policy πθ that approximates the greedy policy with respect to our samples, πq∗ . For
a finite number of actions, this can be seen as a classification problem [2]. For
continuous actions, it becomes a regression problem. As indicated before we define
a distribution φ on a set of representative states Ŝ over which we wish to perform
the minimization.
Choosing the model representation is only the first step. We now have to use it to
represent a specific value function. In order to do this, as before we first pick a set
of representative states Ŝ to fit our value function v θ to v. This type of estimation
can be seen as a regression problem, where the observations are value function
measurements at different states.
Let φ be a distribution over representative states Ŝ. For some constants κ, p > 0,
we define the weighted prediction error per state as
Minimizing this error can be done using gradient descent, which is a general
algorithm for finding local minima of smooth cost functions. Generally, minimizing a
real-valued cost function c(θ) with gradient descent involves an algorithm iteratively
approximating the value minimizing c:
Under certain conditions1 on the step-size parameter αn , limn→∞ c(θ(n) ) = minθ c(θ).
where ∇θ v θ (s) = f (s). Taking partial derivatives ∂/∂θ j leads to the update rule
θ(n+1)
j = θ(n)
j − 2αφ(s)[v θ (n) (s) − u(s)] f j (s).
Here P̂ is not necessarily the true transition kernel. It can be a model or an empir-
ical approximation (in which case the integral would only be over the empirical
support). The summation itself is performed with respect to the measure φ.
In this chapter, we will look at two methods for approximately minimizing the
Bellman error. The first, least square policy iteration is a batch algorithm for approx-
imate policy iteration and finds the least-squares solution to the problem using the
empirical transition kernel. The second is a gradient based method, which is flexible
enough to use either an explicit model of the MDP or the empirical transition kernel.
It is also possible to simultaneously approximate some value function u and
minimizing the Bellman error by considering the minimization problem
whereby the Bellman error acts as a regularizer ensuring that our approximation is
indeed as consistent as possible.
gθ (s, a)
πθ (a | s) = ,
h θ (s)
n
where gθ (s, a) = θi f i (s, a) and h θ (s) = gθ (s, b).
i=1 b∈A
The link function ensures that the denominator is positive, and the policy is
a distribution over actions. An alternative method would be to directly constrain
the policy parameters so the result is always a distribution, but this would result
in a constrained optimization problem. A typical choice for the link function is
(x) = exp(x), which results in the softmax family of policies.
In order to fit a policy, we first pick a set of representative states Ŝ and then find a
policy πθ that approximates a target policy π, which is typically the greedy policy with
respect to some value function. In order to do so, we can define an appropriate cost
function and then estimate the optimal parameters via some arbitrary optimization
method.
Once more, we can use gradient descent to minimize the cost function. We
obtain different results for different norms, but the three cases of main interest are
p = 1, p = 2, and p → ∞. We present the first one here, and leave the others as an
exercise.
Example 8.6 (The case p = 1, κ = 1) The derivative can be written as
∇θ cs = φ(s) ∇θ |πθ (a | s) − π(a | s)|,
a∈A
∇θ |πθ (a | s) − π(a | s)| = ∇θ πθ (a | s) sgn[πθ (a | s) − π(a | s)].
Alternative cost functions. It is often a good idea to add a penalty term of the form
θq to the cost function, constraining the parameters to be small. The purpose of
this is to prevent overfitting of the parameters for a small number of observations.
The main idea of approximate policy iteration is to replace the exact Bellman
operator L with an approximate version Lˆ to obtain an approximate optimal policy
and a respective approximate optimal value.
Just as in standard policy iteration introduced in Sect. 6.5.4.2, there is a policy
improvement step and a policy evaluation step. In the policy improvement step,
we simply try to get as close as possible to the best possible improvement using a
restricted set of policies and an approximation of the Bellman operator. Similarly, in
the policy evaluation step, we try to get as close as possible to the actual value of the
improved policy using a respective set of value functions.
At the k-th iteration of the policy improvement step the approximate value v k−1
of the previous policy πk−1 is used to obtain an improved policy πk . However, note
that we may not be able to implement the policy arg maxπ Lπ v k−1 for two reasons.
ˆ may not include all possible policies. Secondly, the Bell-
Firstly, the policy space
man operator is in general also only an approximation. In the policy evaluation step,
we aim at finding the function v k that is the closest to the true value function of
policy πk . However, even if the value function space V̂ is rich enough, the mini-
8.2 Approximate Policy Iteration (API) 179
mization is done over a norm that integrates over a finite subset of the state space.
The following section discusses the effect of those errors on the convergence of
approximate policy iteration.
If the approximate value function u is close to V ∗ then the greedy policy with respect
to u is close to optimal. For a finite state and action space, the following holds.
Theorem 8.2.1 Considera finite MDP μ with discount factor γ < 1 and a vector
u ∈ V such that u − Vμ∗ ∞ = . If π is the u-greedy policy then
π 2γ
V − V ∗ ≤ .
μ μ ∞ 1−γ
Proof Recall that L is the one-step Bellman operator and Lπ is the one-step policy
operator on the value function. Then (skipping the index for μ)
π
V − V ∗ = Lπ V π − V ∗ ∞
∞
≤ Lπ V π − Lπ u∞ + Lπ u − V ∗ ∞
≤ γ V π − u∞ + L u − V ∗ ∞
Building on this result, we can prove a simple bound for approximate policy
iteration, assuming uniform error bounds on the approximation of the value of a policy
as well as on the approximate Bellman operator. Even though these assumptions are
quite strong, we still only obtain the following rather weak asymptotic convergence
result.2
2 For δ = 0, this is identical to the result for -equivalent MDPs by Even-Dar and Mansour [3].
180 8 Approximate Representations
Theorem 8.2.2 (Bertsekas and Tsitsiklis [4], Proposition 6.2) Assume that there are
, δ such that, for all k the iterates v k , πk satisfy
v k − V πk ∞ ≤ ,
Lπ v k − L v k ≤ δ.
k+1 ∞
Then
δ + 2γ
lim sup V πk − V ∗ ∞ ≤ .
k→∞ (1 − γ)2
As suggested by Bertsekas and Tsitsiklis [4], one idea for estimating the value func-
tion is to simply perform rollouts, while the policy itself is estimated in parametric
form. The first practical algorithm in this direction was Rollout Sampling Approx-
imate Policy Iteration by Dimitrakakis and Lagoudakis [5]. The main idea is to
concentrate rollouts in interesting parts of the state space, so as to maximize the
expected amount of improvement we can obtain with a given rollout budget.
If we have data collects we can use the empirical state distribution to select starting
states. In general, rollouts give us estimates q k , which are used to select states for
further rollouts. That is, we compute for each state s actions
Then we select a state sn maximizing the upper bound value Un (s) defined via
8.2 Approximate Policy Iteration (API) 181
where c(s) is the number of rollouts from state s. If the sampling of a state s stops
whenever
2 |A| − 1
ˆ k (s) ≥
ln ,
c(s)(1 − γ)2 δ
then we are certain that the optimal action has been identified with probability 1 − δ
for that state, due to Hoeffding’s inequality. Unfortunately, guaranteeing a policy
improvement for the complete state space is impossible, even with strong assump-
tions.3
When considering quadratic error, it is tempting to use linear methods, such as least
squares methods, which are very efficient. This requires to formulate the problem in
linear form, using a feature mapping that projects individual states (or state-action
pairs) onto a high-dimensional space. Then the value function can be represented as
linear function of the parameters and this mapping, which minimizes a squared error
over the observed trajectories.
To get an intuition for these methods, recall from Theorem 6.5.1 that the solution
of
v = r + γ P μ,π v
v = (I − γ P μ,π )−1 r.
Here we consider the setting where we do not have access to the transition matrix, but
instead have some observations of transition (st , at , st+1 ). In addition, our state space
can be continuous (e.g., S ⊂ Rn ), so that the transition matrix becomes a general
transition kernel. Consequently, the set of value functions V becomes a Hilbert space,
while in the discrete setting a value function is simply a point in Rn .
3 First, note that if we need to identify the optimal action for k states, then the above stopping rule
has an overall error probability of kδ. In addition, even if we assume that value functions are smooth,
it will be impossible to identify the boundary in the state space where the optimal policy should
switch actions [6].
182 8 Approximate Representations
In general, we deal with this case via projections. We project from the infinite-
dimensional Hilbert space to one with finite dimension on a subset of states: namely,
the ones that we have observed. We also replace the transition kernel with the empir-
ical transition matrix on the observed states.
Parametrization. Let us first deal with parametrizing a linear value function. Setting
v = θ, where is a feature matrix and θ is a parameter vector, we have
θ = r + γ P μ,π θ,
−1
θ = (I − γ P μ,π ) r.
If the inverse exists, then it is equal to the pseudo-inverse. However, in our setting,
the matrix can be low rank, in which case we instead obtain the matrix minimizing
the squared error, which in turn can be used to obtain a good estimate for the param-
eters. This immediately leads to the Least Squares Temporal Difference (LSTD)
algorithm [7], which estimates an approximate value function for some policy π
given some data D and a feature mapping f .
State-action value functions. As estimating a state-value function is not directly
useful for obtaining an improved policy without a model, we can instead estimate a
state-action value function as follows:
8.2 Approximate Policy Iteration (API) 183
q = r + γ P μ,π q
θ = r + γ P μ,π θ
θ = (I − γ P μ,π ) −1 r
However, this approach has two drawbacks. The first is that it is difficult to get
an unbiased estimate of θ. The second is that when we apply the Bellman operator
to q, the result may lie outside the space spanned by the features. For this reason, we
instead consider the least-square projection ( )−1 , i.e.,
q = ( )−1 r + γ P μ,π q .
In practice, of course, we do not have the transitions P μ,π but estimate them
from data. Note that for any deterministic policy π and a set of T data points
(st , at , rt , st )t=1
T
, we have
P μ,π = P(s | s, a) (s , π(s ))
s
1
T
≈ P̂(st | st , at ) (st , π(st )),
T t=1
where for P̂ one can take the simple empirical transition matrix mentioned previously.
This equation can be used to maintain q-factors with q(s, a) = f (s, a)θ to obtain
an empirical estimate of the Bellman operator as summarized in Algorithm 8.3.
Approximate algorithms can also be defined for backwards induction. The general
algorithmic structure remains the same as exact backwards induction, however the
exact steps are replaced by approximations. Applying approximate value iteration
may be necessary for two reasons. Firstly, it may not be possible to update the value
function for all states. Secondly, the set of available value function representations
may be not complex enough to capture the true value function.
The first algorithm is approximate backwards induction. Let us start with the basic
backwards induction algorithm:
Vt∗ (s) = max r (s, a) + γ ∗
Vt+1 (s ) P μ (s |s, a)
a∈A
s
This is essentially the same both for finite and infinite-horizon problems. If we have
to pick the value function from a set of functions V, we can use the following value
function approximation.
Let our estimate at time t be v t ∈ V, with V being a set of (possibly parametrized)
functions. Let V̂t be our one-step update given the value function approximation at
the next step, v t+1 . Then v t will be the closest approximation in that set.
Iterative approximation
V̂t (s) = max r (s, a) + γ P μ (s | s, a) v t+1 (s )
a∈A
s
v t = arg min v − V̂t
v∈V
8.3 Approximate Value Iteration 185
The above minimization can for example be performed by gradient descent. Con-
sider the case where v is a parametrized function from a set of parametrized value
functions V with parameters θ. Then it is sufficient to maintain the parameters θ (t)
at any time t. These can be updated with a gradient scheme at every step. In the
online case, our next-step estimates can be obtained by gradient descent using a step
size sequence (αt )t .
θ t+1 = θ t − αt ∇θ v t − V̂t
This gradient descent algorithm can also be made stochastic, if we sample s from
the probability distribution P μ (s | s, a) used in the iterative approximation. The
next sections give some examples.
In state aggregation, multiple different states with identical properties (with respect to
rewards and transition probabilities) are identified in order to obtain a new aggregated
state in an aggregated MDP with smaller state space. Unfortunately, it is very rarely
the case that aggregated states are really indistinguishable with respect to rewards and
transition probabilities. Nevertheless, as we can see in the example below, aggregation
can significantly simplify computation through the reduction of the size of the state
space.
v(s) = θ(k), if s ∈ Sk , k
= 0,
In the above example, the value of every state corresponds to the value of the k-th
set in the partition. Of course, this is only a very rough approximation if the sets Sk
are very large. However, this is a convenient approach to use for gradient descent
updates, as only one parameter needs to be updated at every step.
186 8 Approximate Representations
Consider the case · = ·22 . For st ∈ Sk and some step size sequence (αt )t :
θ t+1 (k) = (1 − αt )θ t (k) + αt max r (st , a) + γ P( j | st , a) θ t ( f k ( j)),
a∈A
j∈ Ŝ
A more refined approach is to choose some representative states and try to approx-
imate the value function of all other states as a convex combination of the value of
the representative states.
The feature mapping is used to perform the convex combination. Usually, f i (s)
is larger for representative states i which are “closer” to s. In general, the feature
mapping is fixed, and we want to find a set of parameters for the values of the
representative states. At time t, for each representative state i, we obtain a new
estimate of its value function and plug it back in.
8.3 Approximate Value Iteration 187
u−V
Example 7.2, extended to 200
100 states. The second
involves randomly generated
MDPs with two actions and 100
100 states
0
0 20 40 60 80 100
number of samples
For i ∈ Ŝ:
θ t+1 (i) = max r (i, a) + γ v t (s) d P(s | i, a) (8.3.1)
a∈A
with
n
v t (s) = f i (s)θ t (i).
i=1
When the integration in (8.3.1) is not possible, we may instead approximate the
expectation with a Monte Carlo method. One particular problem with this method
arises when the transition kernel is very sparse. Then we are basing our estimates on
approximate values of other states, which may be very far from any other represen-
tative state. This is illustrated in Fig. 8.4, which presents the value function error for
the chain environment of Example 7.2 and random MDPs. Due to the linear structure
of the chain environment, the states are far from each other. In contrast, the random
MDPs are generally both quite dense and the state distribution for any particular
policy mixes rather fast. Thus, states in the former tend to have very different values
and in the latter very similar ones.
The problems with the representative state update can be alleviated through Bellman
error minimization. The idea here is to obtain a value function that is as consistent
188 8 Approximate Representations
min v θ − L v θ .
θ
This is different from the approximate backwards induction algorithm we saw previ-
ously, since the same parameter θ appears in both terms inside the norm. Furthermore,
if the norm has support in all of the state space and the approximate value function
space contains the actual set of value functions then the minimum is 0 and we obtain
the optimal value function.
Gradient update
Then the gradient update becomes θ t+1 = θ t − αDθt (st )∇θ Dθt (st ), where
∇θ Dθt (st ) = ∇θ v θt (st ) − ∇θ v θt ( j) d P( j | st , at∗ )
S
with at∗ arg maxa∈A r (st , a) + γ S v θ ( j) d P( j | st , a) .
We can also construct a q-factor approximation for the case where no model
is available. This can be simply done by replacing P with the empirical transition
observed at time t.
In the previous section, we have seen how to use gradient methods for value function
approximation. It is also possible to use these methods to estimate policies—the
only necessary ingredients are a policy representation and a way to evaluate a policy.
The representation is usually parametric, but non-parametric representations are also
possible. A common choice for parametrized policies is to use a feature function
f : S × A → Rk and a linear parametrization with parameters θ ∈ Rk leading to
the following Softmax distribution:
e F(s,a)
π(a | s) = F(s,a )
, F(s, a) θ f (s, a) (8.4.1)
a ∈A e
8.4 Policy Gradient 189
As usual, we would like to find a policy maximizing expected utility. Policy gradient
algorithms employ gradient ascent on the expected utility to find a locally maximizing
policy. Here we focus on the discounted reward criterion with discount factor γ, where
a policy’s expected utility is defined with respect to a starting state distribution y so
that
π
E y (U ) = y(s)V π (s) = y(s) π
P (h | s1 = s) U (h), (8.4.2)
s s h
where U (h) is the utility of a trajectory h. This definition leads to a number of simple
expressions for the gradient of the expected utility, including what is known as the
policy gradient theorem of [9]. For notational simplicity, we omit the subscript θ for
the policy π.
Theorem 8.4.1 Assuming that the reward only depends on the state, for any
θ-parametrized policy space , the gradient of the utility from starting state dis-
tribution y can be equivalently written in the three following forms:
to denote the γ-discounted sum of state visits. Further, h ∈ (S × A)∗ denotes a state-
action history, Pπμ (h) its probability under the policy π in the MDP μ, and U (h) is
the utility of history h.
where y is a starting state distribution vector and P πμ is the transition matrix resulting
from applying policy π to μ. Computing the derivative using matrix calculus gives
∇θ E U = y ∇θ (I − γ P πμ )−1 r,
as the only term involving θ is π. The derivative of the matrix inverse can be written
as
190 8 Approximate Representations
∇θ Eπμ U = γ y X∇θ P πμ X r
We are now ready to prove claim (8.4.4). Define the expected state visitation from
the starting distribution to be x y X, so that we obtain
∇θ Eπμ U = γ x ∇θ P πμ X r
=γ x(s) P μ (s | s, a)∇θ π(a | s) Vμπ (s )
s a,s
=γ x(s) ∇θ π(a | s) P μ (s | s, a)Vμπ (s )
s a s
=γ x(s) ∇θ π(a | s) Q πμ (s, a).
s a
as ∇θ ln Pπμ (h) = 1
∇
Pπμ (h) θ Pπμ (h).
For finite MDPs, we can obtain x π from the state occupancy matrix (6.5.2) by left
multiplication with the initial state distribution y. However, in the context of gradient
methods, it makes more sense to use a stochastic estimate of x π to calculate the
gradient, since
∇θ Eπ U = Eπy γ t I {st = s} ∇θ π(a | s) Q π (s, a).
s,t a
8.4 Policy Gradient 191
For the discounted reward criterion, we can easily obtain unbiased samples through
geometric stopping (see Exercise 6.1).
Importance sampling
The last formulation is especially useful as it allows us to use importance sampling
to compute the gradient even on data obtained for different policies, which in general
is more data efficient. First note that for any history h ∈ (S × A)∗ , we have
T
π π
Pμ (h) = Pμ (st | s , a ) P (at | s , a )
t−1 t−1 t t−1
(8.4.7)
t=1
without any Markovian assumptions on the model or policy. We can now rewrite
(8.4.6) in terms of the expectation with respect to an alternative policy π as
π π Pπμ (U )
∇ Eμ U = Eμ U (h)∇ ln Pπμ (h)
Pπμ (U )
T
π(at | s t , a t−1 )
π
= Eμ U (h)∇ ln Pπμ (h) ,
t=1
π (at | s t , a t−1 )
since the μ-dependent terms in (8.4.7) cancel out. In practice the expectation would
be approximated through sampling trajectories h. Note that
∇π(at | s t , a t−1 )
∇ ln Pπμ (h) = ∇ ln π(at | s t , a t−1 ) = .
t t
π(at | s t , a t−1 )
The first design choice in any gradient algorithm is how to parametrize the policy. For
the discrete case, a common parametrization is to have a separate and independent
parameter for each state-action pair, i.e., θs,a = π(a|s). This leads to a particularly
simple expression for the second form (8.4.4), which is ∂/∂θs,a Eπμ U = y(s) Q(s, a).
However, it is easy to see that in this case the parametrization will lead to all param-
eters increasing if rewards are positive. This can be avoided by either a Softmax
parametrization or by subtracting a bias term (e.g. the value of the state Vμπ (s)) from
the derivative. Nevertheless, this parametrization implies stochastic discrete policies.
We could also suitably parametrize continuous policies. For example, if A ⊂ Rn ,
we can consider a linear policy. Most of the derivation carries over to Euclidean state-
192 8 Approximate Representations
action spaces. In particular, the second form (8.4.4) is also suitable for deterministic
policies.
Finally, in practice, we may not need to accurately calculate the expectations
involved in the gradient. Sample trajectories are sufficient to update the gradient in
a meaningful way, especially for the third form (8.4.5), as we can naturally sample
from the distribution of trajectories. However, the fact that this form doesn’t need
a Markovian assumption also means that it cannot take advantage of Markovian
environments.
Policy gradient methods are useful, especially in cases where the environment
model or value function is extremely complicated, while the optimal policy itself
might be quite simple. The main difficulty lies in obtaining an appropriate estimate
of the gradient itself, but convergence to a local maximum is generally good as long
as we are adjusting the parameters in a gradient-related direction (in the sense of
Assumption 7.1.1iii).
8.5 Examples
4Essentially, this is the a linear model of the form st+1 | st = s, at = a ∼ N μa s, a , where μa
has a normal prior and a Wishart prior.
8.6 Further Reading 193
·10−2
5 −0.6
−0.8
V
V
0
−1
−1 5 −1 5
0 0 0 0
1 −5 1 −5
s1 s2 s1 s2
LG, RBF kNN, RBF
−0.6
−0.6
−0.8
−0.8
V
V
−1
−1
−1 5 −1 5
0 0 0 0
1 −5 1 −5
s1 s2 s1 s2
Fig. 8.5 Estimated value function of a uniformly random policy on the 2-dimensional state-space
of the pendulum problem. Results are shown for a k-nearest neighbour model (kNN) with k = 3
and a Bayesian linear-Gaussian model (LG) for either the case when the model uses the plain state
information (state) or an 256-dimensional RBF embedding (RBF)
For finding the optimal value function we must additionally consider the question
of which algorithm to use. In Fig. 8.6 we see the effect of choosing either approxi-
mate value iteration (AVI) or representative state representations and value iteration
(RSVI) for the inverted pendulum and mountain car.
Among value function approximation methods, the two most well known are fitted Q-
iteration [10] and fitted value iteration, which has been analysed in [11]. Minimizing
the Bellman error [12–14] is generally a good way to ensure that approximate value
iteration is stable.
In approximate policy iteration methods, one needs to approximate both the value
function and policy. An empirical approximation of the value function is maintained
in rollout sampling policy iteration [5, 6]. However, one can employ least-squares
methods [7, 8, 15] for example.
The general technique of state aggregation [16, 17] is applicable to a variety of
reinforcement learning algorithms. While the more general question of selecting
194 8 Approximate Representations
0
−0.5
−10
V
V
−1 −20
−1 5 −1 5
0 0 0
−5 0
1 −5 ·10−2
s1 s2 s1 s2
RSVI, Pendulum RSVI, Mountaincar
0
−0.5 −10
V
V
−1 −20
−1 5 −1 5
0 0 0
−5 0
1 −5 ·10−2
s1 s2 s1 s2
Fig. 8.6 Estimated optimal value function for the pendulum problem. Results are shown for approx-
imate value iteration (AVI) with a Bayesian linear-Gaussian model, and a representative state rep-
resentation (RSVI) with an RBF embedding. Both the embedding and the states where the value
function is approximated are a 16 × 16 uniform grid over the state space
appropriate features is open, there has been some progress in the domain of fea-
ture reinforcement learning [18]. In general, learning internal representations (i.e.,
features) has been a prominent aspect of neural network research [19]. Even if it is
unclear to what extent recently proposed approximation architectures that employ
deep learning actually learn useful representations, they have been successfully used
in combination with simple reinforcement learning algorithms [20]. Another inter-
esting direction is to establish links between features and approximately sufficient
statistics [21, 22].
Finally, the policy gradient theorem in the state visitation form was first proposed
by Sutton et al. [9], while Williams [23] was the first to use the log-ratio trick in
Eq. (8.4.5) in reinforcement learning. To our knowledge, the analytical gradient has
not actually been applied (or indeed, described) in prior literature. Extensions of the
policy gradient idea are also natural. They have also been used in a Bayesian setting
by Ghavamzadeh and Engel [14], while the natural gradient has been proposed by
Kakade [24]. A survey of policy gradient methods can be found in [25].
References 195
8.7 Exercises
Exercise 8.1 (Enlarging the function space) Consider the problem in Example 8.1.
What would be a simple way to extend the space of value functions from the three
given candidates to an infinite number of value functions? How could we get a good
fit?
Exercise 8.2 (Enlarging the policy) Consider Example 8.2. This represents an
example of linear deterministic policies. In which two ways can this policy space be
extended and how?
Exercise 8.3 Find the derivative for minimizing the cost function in (8.1.2) for the
following two cases:
1. p = 2, κ = 2.
2. p → ∞, κ = 1.
References
1. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press (1998)
2. Lagoudakis, M.G., Parr, R.: Reinforcement learning as classification: leveraging modern clas-
sifiers. In: Machine Learning, Proceedings of the 20th International Conference (ICML 2003),
pp. 424–431. AAAI Press (2003)
3. Even-Dar, E., Mansour, Y: Approximate equivalence of Markov decision processes. In: Com-
putational Learning Theory and Kernel Machines, 16th Annual Conference on Computational
Learning Theory and 7th Kernel Workshop, COLT/Kernel 2003. Lecture notes in Computer
Science, vol. 2777, pp. 581–594. Springer (2003)
4. Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming. Athena Scientific (1996)
5. Dimitrakakis, C., Lagoudakis, M.G.: Rollout sampling approximate policy iteration. Mach.
Learn. 72(3), 157–171 (2008)
6. Dimitrakakis, C., Lagoudakis, M.G.: Algorithms and bounds for rollout sampling approximate
policy iteration. In: Girgin, S., Loth, M., Munos, R., Preux, P., Ryabko, D. (eds.) Recent
Advances in Reinforcement Learning, 8th European Workshop, EWRL 2008. Lecture Notes
in Computer Science, vol. 5323, pp. 27–40. Springer (2008)
7. Bradtke, Steven J., Barto, Andrew G.: Linear least-squares algorithms for temporal difference
learning. Mach. Learn. 22(1), 33–57 (1996)
8. Lagoudakis, Michail G.: Parr, Ronald: least-squares policy iteration. J. Mach. Learn. Res. 4,
1107–1149 (2003)
9. Sutton, R.S., McAllester, D.A., Singh, S.P., Mansour, Y.: Policy gradient methods for reinforce-
ment learning with function approximation. In: Advances in Neural Information Processing
Systems 12, pp. 1057–1063. The MIT Press (1999)
10. Antos, A., Munos, R., Szepesvari, C.: Fitted Q-iteration in continuous action-space MDPs. In:
Advances in Neural Information Processing Systems, vol. 20, pp. 9–16 (2008)
11. Munos, R., Szepesvári, C.: Finite-time bounds for fitted value iteration. J. Mach. Learn. Res.
9, 815–857 (2008)
12. Antos, A., Szepesvari, C., Munos, R.: Learning near-optimal policies with bellman-residual
minimization based fitted policy iteration and a single sample path. Mach. Learn. 71(1), 89–129
(2008)
196 8 Approximate Representations
13. Dimitrakakis, C.: Monte-carlo utility estimates for bayesian reinforcement learning. In: Pro-
ceedings of the 52nd IEEE Conference on Decision and Control, CDC 2013, pp. 7303–7308.
IEEE (2013)
14. Ghavamzadeh, M., Engel, Y.: Bayesian policy gradient algorithms. In: Advances in Neural
Information Processing Systems, vol. 19, pp. 457–464. MIT Press (2006)
15. Boyan, Justin A.: Technical update: least-squares temporal difference learning. Mach. Learn.
49(2), 233–246 (2002)
16. Singh, Satinder P., Jaakkola, Tommi S., Jordan, Michael I.: Reinforcement learning with soft
state aggregation. Adv. Neural Inf. Process. Syst. 7, 361–368 (1995)
17. Bernstein, A.: Adaptive state aggregation for reinforcement learning. Master’s thesis, Technion
Israel Institute of Technology (2007)
18. Hutter, Marcus: Feature reinforcement learning: part I: unstructured MDPs. J. Artif. General
Intell. 1, 3–24 (2009)
19. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by error prop-
agation. In: Rumelhart, D.E., McClelland, J.L., et al. (eds.) Parallel Distributed Processing.
Foundations, vol. 1, pp. 318–362. MIT Press, Cambridge (1987)
20. Mnih, V., Kavukcuoglu, K., Silver, D., et al.: Human-level control through deep reinforcement
learning. Nature 518(7540), 529–533 (2015)
21. Dimitrakakis, C., Tziortziotis, N.: ABC reinforcement learning. In: Proceedings of the 30th
International Conference on Machine Learning, ICML 2013, pp. 684–692. JMLR.org (2013)
22. Dimitrakakis, C., Tziortziotis, N.: Usable ABC reinforcement learning. In: NIPS 2014 Work-
shop: ABC in Montreal (2014)
23. Williams, Ronald J.: Simple statistical gradient-following algorithms for connectionist rein-
forcement learning. Mach. Learn. 8(3–4), 229–256 (1992)
24. Kakade, Sham: A natural policy gradient. Adv. Neural Inf. Process. Syst. 14, 1531–1538 (2002)
25. Peters, J., Schaal, S.: Policy gradient methods for robotics. In: 2006 IEEE/RSJ International
Conference on Intelligent Robots and Systems, IROS 2006, pp. 2219–2225. IEEE (2006)
Chapter 9
Bayesian Reinforcement Learning
9.1 Introduction
In this chapter, we return to the setting of subjective probability and utility by for-
malizing the reinforcement learning problem as a Bayesian decision problem and
solving it directly. In the Bayesian setting, we are acting in an unknown environ-
ment, and we represent our subjective belief about the environment in the form of a
probability distribution. We shall first consider the case of acting in unknown MDPs,
which is the focus of the reinforcement learning problem. We will examine a few dif-
ferent heuristics for maximizing expected utility in the Bayesian setting and contrast
them with tractable approximations to the Bayes-optimal solution. We then present
extensions of these ideas to continuous domains. Finally, we draw connections of
this setting to partially observable MDPs.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 197
C. Dimitrakakis and R. Ortner, Decision Making Under Uncertainty
and Reinforcement Learning, Intelligent Systems Reference Library 223,
https://doi.org/10.1007/978-3-031-07614-5_9
198 9 Bayesian Reinforcement Learning
at
determine the distribution of the immediate reward, as well as that of the next state,
as described in Definition 6.3.1. For a specific MDP μ the probability of the imme-
diate reward is given by Pμ (rt | st , at ), with expectation r̄μ (s, a) Eμ (rt | st = s,
at = a), while the next state distribution is given by Pμ (st+1 | st , at ). If these quan-
tities are known, or if we can at least draw samples from these distributions, it is
possible to employ (approximate) dynamic programming to obtain the optimal pol-
icy and value function for the MDP.
More precisely, when μ is known, we wish to find a policy π : S → A maximiz-
ing the utility in expectation. This requires us to solve the maximization T problem
maxπ Eπμ U , where the utility is an additive function of rewards, U = t=1 rt . This
can be accomplished using standard algorithms, such as value or policy iteration.
However, knowing μ is contrary to the problem definition.
In Chap. 7 we have seen a number of stochastic approximation algorithms which
allow us to learn the optimal policy for a given MDP eventually. However, these
generally give few guarantees on the performance of the policy while learning.
A good way of learning the optimal policy in an MDP should trade off explor-
ing the environment to obtain further knowledge and simultaneously exploiting this
knowledge.
Within the subjective probabilistic framework, there is a natural formalization
for learning optimal behavior in am MDP. We define a prior belief ξ on the set
of MDPs M, and then find the policy that maximizes the expected utility with
respect to the prior Eπξ (U ). The structure of the unknown MDP process is shown in
Fig. 9.1. We have previously seen two simpler sequential decision problems in the
Bayesian setting. The first was the simple optimal stopping procedure in Sect. 5.2.2,
which introduced the backwards induction algorithm. The second was the optimal
experiment design problem, which resulted in the bandit Markov decision process
of Sect. 6.2. Now we want to formulate the reinforcement learning problem as a
Bayesian maximization problem.
Let ξ be a prior over M and Π be a set of policies. Then the expected utility of
the optimal policy, over some fixed starting state distribution, is
Uξ∗ max E(U | π, ξ) = max E(U | π, μ) dξ(μ). (9.1.1)
π∈Π π∈Π M
9.1 Introduction 199
Solving this optimization problem and hence finding the optimal policy is however
not easy, as in general the optimal policy π must incorporate the information it
obtained while interacting with the MDP. Formally, this means that it must map from
histories to actions. For any such history-dependent policy, the action we take at
step t must depend on what we observed in previous steps 1, . . . , t − 1. Consequently,
an optimal policy must also specify actions to be taken in all future time steps and
accordingly take into account the learning that will take place up to each future
time step. Thus, in some sense, the value of information is automatically taken into
account in this model. This is illustrated in the following example.
Example 9.1 Consider two MDPs μ1 , μ2 with a single state (i.e., S = {1}) and
actions A = {1, 2}. In the MDP μi , whenever you take action at = i you obtain
reward rt = 1, otherwise you obtain reward 0. If we only consider policies that do
not take into account the history so far, the expected utility of such a policy π taking
action i with probability π(i) is
π
Eξ U = T ξ(μi ) π(i)
i
for horizon T . Consequently, if the prior ξ is not uniform, the optimal policy selects
the action corresponding to the MDP with the highest prior probability. Then, the
maximal expected utility is
T max ξ(μi ).
i
However, observing the reward after choosing the first action, we can determine the
true MDP. Consequently, an improved policy is the following: First select the best
action with respect to the prior, and then switch to the best action for the MDP we
have identified to be the true one. Then, our utility improves to
where here and in the following we use the notation s t to abbreviate (s1 , . . . , st )
and stt+k for (st , . . . , st+k ), and accordingly at , rt , att+k , and rtt+k . Important special
cases are the set of blind policies Π0 and the set of memoryless policies Π1 . The
set Π̄k ⊂ Πk contains all stationary policies in Πk . Finally, policies may be indexed
by some parameter set Θ, in which case the set of parameterized policies is given
by ΠΘ .
200 9 Bayesian Reinforcement Learning
Let us now turn to the problem of learning an optimal policy. Learning means that
observations we make will affect our belief, so that we will first take a closer look at
this belief update. Given that, we shall examine methods for exact and approximate
methods of policy optimization.
Strictly speaking, in order to update our belief, we must condition the prior dis-
tribution on all the information. This includes the sequence of observations up to
this point in time, including the states s t, actions a t−1, and rewards r t−1 , as well
the policy π that we followed. Let Dt = s t , a t−1 , r t−1 be the observed data up to
time t. Then the posterior measure for any measurable subset B of the set of all
MDPs M is
Pπμ (Dt ) dξ(μ)
ξ(B | Dt , π) = B π .
M Pμ (Dt ) dξ(μ)
However, as we shall see in the following remark, we can usually1 ignore the policy
itself when calculating the posterior.
Remark 9.1.1 The dependence on the policy can be removed, since the posterior
is the same for all policies that put non-zero mass on the observed data. Indeed, for
Dt ∼ Pπμ it is easy to see that ∀π = π such that Pπμ (Dt ) > 0, it holds that
ξ(B | Dt , π) = ξ(B | Dt , π ).
The proof is left as an exercise for the reader. In the specific case of MDPs, the poste-
rior calculation is easy to perform incrementally. This also more clearly demonstrates
why there is no dependence on the policy. Let ξt be the (random) posterior at time t.
Then the next-step belief is given by
Pπμ (Dt ) dξ(μ)
ξt+1 (B) ξ(B | Dt+1 ) = B π
M Pμ (Dt ) dξ(μ)
Pμ (st+1 , rt | st , at ) π(at | s t , a t−1 , r t−1 ) dξ(μ | Dt )
= B
Pμ (st+1 , rt | st , at ) π(at | s t , a t−1 , r t−1 ) dξ(μ | Dt )
M
Pμ (st+1 , rt | st , at ) dξt (μ)
= B .
M Pμ (st+1 , r t | st , at ) dξt (μ)
1The exception involves any type of inference where Pπμ (Dt ) is not directly available. This includes
methods of approximate Bayesian computation [1], that use trajectories from past policies for
approximation. See [2] for an example of this in reinforcement learning.
9.2 Finding Bayes-Optimal Policies 201
The above calculation is easy to perform for arbitrarily complex MDPs when the
set M is finite. The posterior calculation is also simple under certain conjugate
priors, such as the Dirichlet-multinomial prior for transition distributions.
The problem of policy optimization in the Bayesian case is much harder than when
the MDP is known. This is simply because we have to consider history dependent
policies, which makes the policy space much larger.
In this section, we first consider two simple heuristics for finding sub-optimal poli-
cies. Then we show how policy gradient methods can be extended to the Bayesian
case to obtain Bayes-optimal policies within a parametrized policy class. We pro-
ceed considering finite look-ahead backwards induction to approximate the Bayes-
optimal policy. Generally, backwards induction in this setting requires building an
exponential-size tree. However, upper and lower bounds on the value function can
be used to create a branch and bound algorithm to improve efficiency. We end this
section introducing two methods to construct such bounds, and discuss their relation
to one of the best-known Bayesian methods, posterior sampling.
One simple heuristic is to simply calculate the expected MDP μ(ξ) Eξ μ for the
current belief ξ. In particular, the transition kernel of the expected MDP is simply
the expected transition kernel:
Pμ(ξ) (s |s, a) = Pμ (s |s, a) dξ(μ)
M
π ∗ (
μ(ξ)) ∈ arg max Vμπ(ξ) ,
π∈Π1
0 0 0
0 0 0 0 0 0
1 1 1
Fig. 9.2 The two MDPs and the expected MDP from Example 9.2
Unfortunately, the policy returned by this heuristic may be far from the Bayes-
optimal policy in Π1 , as shown by the following example.
a =1 0
a=i 1
a=n 0
Policy gradient (see Sect. 8.4) can also be performed in the Bayesian setting. For
this, we must restrict our policies to a parametrized policy space so that we can
differentiate them:
ΠΘ {πθ | θ ∈ Θ}
The general idea is to find the policy parameter maximizing expected utility under
the current belief, i.e. solve the problem
max Eπξ [U ],
θ
where the utility is defined with respect to some starting state distribution (see
Eq. 8.4.2). A policy gradient algorithm would simply move in the direction of the
gradient, which can be computed as
∇θ Eπξ [U ] = ∇θ π
Eμ [U ] dξ(μ)
M
= ∇θ Eπμ [U ] dξ(μ)
M
1
n
≈ ∇θ Eπμ(i) [U ], μ(i) ∼ ξ(μ).
n i=1
Here, the integral is approximated by sampling MDPs from the belief, and ∇θ Eπμ(i) [U ]
is the standard policy gradient for a given MDP μ(i) . Approximations to the gradient
can be computed via rollouts.
An interesting question is how to define the policy parametrization. In order for a
policy to be adaptive it must take into account the complete history of observations.
This can be achieved even for simple policies, as long as we have a statistic mapping
histories to a rich enough representation. This is detailed in the following example.
Example 9.4 (Policies defined on a statistic φ : H × A → Rk×|A| ) Define history
h t = s1 , a1 , r1 , . . . , st and a history-dependent sigmoid policy:
πθ (a | h t ) ∝ exp φ(h t , a) θ
The most direct way to actually solve the Bayesian reinforcement learning problem
of (9.1.1) is to cast it as a yet another MDP. We have already seen how this can
be done with bandit problems in Sect. 6.2.2, but we shall now see that the general
methodology is also applicable to MDPs.
9.2 Finding Bayes-Optimal Policies 205
ξt ξ t +1 ψt ψ t +1
at at
while the next belief deterministically depends on the next state, i.e.,
Example 9.5 Consider aset of MDPs M with A = {1, 2}, S = {1, 2}. In general
for any hyper-state ψt = (st , ξt ) each possible action-state transition results in one
specific new hyper-state. This is illustrated for the specific example in the following
diagram.
206 9 Bayesian Reinforcement Learning
1
ψt+1
1
2
1 ψt+1
2
ψt
1 3
2 ψt+1
2 4
ψt+1
at st+1
When the branching factor is very large, or when we need to deal with very large
tree depths, it becomes necessary to approximate the MDP structure.
Branch and bound is a general technique for solving large problems. It can be applied
in all cases where upper and lower bounds on the value of solution sets can be
found. For Bayesian reinforcement learning, we can consider upper and lower bounds
q + and q − on Q ∗ in the belief-augmented MDP (BAMDP). That is,
π(ξt )
v − (ψt ) = Vξt (st ), v + (ψt ) = Vξ+t (st ),
where π(ξt ) can be any approximately optimal policy for ξt Using backwards induc-
tion, we can calculate tighter upper q + and lower bounds q − for all non-leaf hyper-
states by
q + (ψt , at ) = P(ψt+1 | ψt , at ) ρ(ψt , at ) + γ v + (ψt+1 ) ,
ψt+1
−
q (ψt , at ) = P(ψt+1 | ψt , at ) ρ(ψt , at ) + γ v − (ψt+1 ) .
ψt+1
We can then use the upper bounds to expand the tree (i.e., to select actions in the
tree that maximize v + ) while the lower bounds can be used to select the final policy.
9.2 Finding Bayes-Optimal Policies 207
Sub-optimal branches can be discarded once their upper bounds become lower than
the lower bound of some other branch.
Remark 9.2.1 If q − (ψ, a) ≥ q + (ψ, a ) then a is sub-optimal at ψ.
However, such an algorithm is only possible to implement when the number of
possible MDPs and states are finite. We can generalize this to the infinite case by
applying stochastic branch and bound methods [3, 4]. This involves estimating upper
and lower bounds on the values of leaf nodes through Monte Carlo sampling.
Bounds on the Bayes-expected utility can serve as a guideline when trying to find
a good policy. Accordingly, in this and the following section we aim at obtaining
respective upper and lower bounds. First, note that given a belief ξ and a policy π,
the respective conditional expected utility is defined as follows.
It is easy to see that the Bayes value function of a policy is simply the expected value
function under ξ:
π π
Vξ (s) = Eμ (U | s) dξ(μ) = Vμπ (s) dξ(μ)
M M
However, the Bayes-optimal value function is not equal to the expected value
function of the optimal policy for each MDP. In fact, the Bayes-value of any policy
is a natural lower bound on the Bayes-optimal value function, as the Bayes-optimal
policy is the maximum by definition. We can however use the expected optimal value
function as an upper bound on the Bayes-optimal value, that is,
Vξ∗ sup Eπξ (U ) = sup π
Eμ (U ) dξ(μ)
π π M
≤ π
sup Vμ dξ(μ) = Vμ∗ dξ(μ) Vξ+ .
M π M
Given the previous development, it is easy to see that the following inequalities
always hold, giving us upper and lower bounds on the value function:
Fig. 9.5 A geometric view of the bounds. Here we plot the expected value of two policies, π1 , π2
and the policy π ∗ (ξ1 ) that is optimal for ξ1 , as well as the Bayes-optimal value function Vξ∗
These bounds are geometrically demonstrated in Fig. 9.5. They are entirely anal-
ogous to the Bayes bounds of Sect. 3.3.1, with the only difference being that we are
now considering complete policies rather than simple decisions.
Thompson sampling and upper bounds. The upper bound on the value func-
tion, can be easily approximated through Monte Carlo sampling, as shown in Algo-
rithm 9.3 (cf. also the respective policy evaluation Algorithm 9.4). In fact, for K = 1,
Thomp- this method is equivalent to Thompson sampling [5], which was first used in the con-
son
sam- text of Bayesian reinforcement learning by [6]. In Thompson sampling, we sample
pling a single MDP from the belief and then act optimally with respect to this, until some
exploration condition is met. This is good exploration heuristic with formal per-
formance guarantees for bandit problems [7, 8]. However, obtaining lower bounds
requires estimating good policies. An algorithm for doing this through backwards
induction will be explained in the following.
A lower bound on the value function is useful to tell us how tight our upper bounds
are. It is possible to obtain one by evaluating any arbitrary policy. So, tighter lower
bounds can be obtained by finding better policies, something that was explored
by [9, 10].
In particular, we can consider the problem of finding the best memoryless policy.
This involves two approximations. Firstly, approximating our belief over MDPs with
a sample over a finite set of n MDPs. Secondly, assuming that the belief is nearly
constant over time, and performing backwards induction those n MDPs simultane-
ously. While this greedy procedure might not find the optimal memoryless policy, it
still improves the lower bounds considerably (Fig. 9.6).
The central step backwards induction over multiple MDPs is summarized by
the following equation, which simply involves calculating the expected utility of a
particular policy over all MDPs:
Q πξ,t (s, a) r̄μ (s, a) + γ π
Vμ,t+1 (s ) d Pμ (s | s, a) dξ(μ) (9.2.2)
M S
80
50
40
30
20
10
0 20 40 60 80 100
Fig. 9.6 Illustration of the improved bounds. The naive and tighter bound refers to the lower bound
obtained by calculating the value of the policy that is optimal for the expected MDP and that obtained
by calculating the value of the MMBI policy respectively. The upper bound is Vξ+ . The horizontal
axis refers to our belief: At the left edge, our belief is uniform over all MDPs, while on the right
edge, we are certain about the true MDP
In practice, we maintain a belief over an infinite set of MDPs, such as the class of
all discrete MDPs with a certain number of state and actions. To apply this algorithm
in this case, we can sample a finite number of MDPs from the current belief and
then find the optimal policy for this sample, as shown in Algorithm 9.6. For K = 1,
this is also equivalent to Thompson sampling. However as we can see in Fig. 9.7,
Algorithm 9.6 performs better when the number of samples is increased.
9.3 Bayesian Methods in Continuous Spaces 211
regret
environment (Example 7.2). 400
The error bars show the
standard error of the average 350
regret
300
250
2 4 6 8 10 12 14 16
n
9.2.8 Further Reading
One of the first treatments of Bayesian reinforcement learning is due to [12]. Although
the idea was well-known in the statistical community [13], the first incursion of the
idea of Bayes-adaptive policies in reinforcement learning was achieved by Duff’s
thesis [14]. Most recent advances in Bayes-adaptive policies involve the use of intel-
ligent methods for exploring the tree, such as sparse sampling [31] and Monte Carlo
tree search [15].
Instead of sampling MDPs, one could sample beliefs, which leads to a finite hyper-
state approximation of the complete belief MDP. One such approach is BEETLE [9,
16], which examines a set of possible future beliefs and approximates the value of
each belief with a lower bound. In essence, it then creates the set of policies which
are optimal with respect to these bounds.
Another idea is to take advantage of the expectation-maximization view of rein-
forcement learning [32]. This allows to apply a host of different probabilistic infer-
ence algorithms. This approach was investigated by [17].
Pμ (S | s, a) Pμ (st+1 ∈ S | st = s, at = a), S ⊂ S.
There are a number of transition models one can consider for the continuous case.
For the purposes of this textbook, we shall limit ourselves to the relatively simple
case of linear-Gaussian models.
The simplest type of transition model for an MDP defined on a continuous state
space is a linear-Gaussian model, which also results in a closed form posterior calcu-
lation due to the conjugate prior. While typically the real system dynamics may not
be linear, one can often find some mapping f : S → X to a k-dimensional vector
space X such that the dynamics of the transformed state x t f (st ) at time t may
be well-approximated by a linear system. Then the next state st+1 is given by the
output of a function g : X × A → S of the transformed state, the action, and some
additive noise εt , i.e.,
st+1 = g(x t , at ) + εt .
That is, the next state is drawn from a normal distribution with mean Ai x and
covariance matrix V i .
In order to model our uncertainty with a (subjective) prior distribution ξ, we have
to specify the model structure. Fortunately, in this particular case, a conjugate prior
exists in the form of the matrix-normal distribution for A and the inverse-Wishart
distribution for V . Given V i , the distribution for Ai is matrix-normal, while the
marginal distribution of V i is inverse-Wishart. More specifically, the dependencies
are as follows:
∼ φ( Ai | M, C, V
Ai | V i = V ), (9.3.1)
V i ∼ ψ(V i | W , n), (9.3.2)
1 −1
ψ(V i | W , n) ∝ | V −1 W | 2 e− 2 trace(V W )
n 1
2
Essentially, the considered setting is an extension of the univariate Bayesian linear
regression model (see for example [13]) to the multivariate case via vectorization
of the mean matrix. Since the prior is conjugate, it is relatively simple to calculate
posterior values of the parameters after each observation. While we omit the details,
a full description of inference using this model is given by [18].
Further reading. More complex transition models include the non-parametric exten-
sion of the above model, namely Gaussian processes (GP) [33]. For an n-dimensional Gaus-
sian
state space, one typically applies independent GPs for predicting each state coordi- pro-
nate, i.e., f i : Rn → R. As this completely decouples the state dimensions, it is best cesses
to consider a joint model, but this requires various approximations (cf. e.g., [19]).
A well-known method for model-based Gaussian process reinforcement learning is
GP-Rmax of [34], which has been recently shown by [20] to be KWIK-learnable.2
Another straightforward extension of linear models are piecewise linear models,
which can be described in a Bayesian non-parametric framework [21]. This avoids
the computational complexity that is introduced when using GPs.
for any time step t. This essentially gives a simple model for sequences of rewards
and states P(r T |v, s T ). We can now write the posterior as ξ(v | r T , s T ) ∝ P(r T |
v, s T ) ξ(v), where the dependence ξ(v|s T ) is suppressed. This model was later
updated by [23] using the reward distribution
rt | v, st , st+1 ∼ N v(st ) − γ v(st+1 ), N (st , st+1 ) ,
2Informally, a class is KWIK-learnable if the number of mistakes made by the algorithm is poly-
nomially bounded in the problem parameters. In the context of reinforcement learning this would
be the number of steps for which no guarantee of utility can be provided.
214 9 Bayesian Reinforcement Learning
where N (s, s ) ΔU (s) − γΔU (s ) with ΔU (s) U (s) − v(s) denoting the distri-
bution of the residual, i.e., the utility when starting from s minus its expectation. The
correlation between U (s) and U (s ) is captured via N , and the residuals are modelled
as a Gaussian process. While the model is still an approximation, it is equivalent to
performing GP regression using Monte Carlo samples of the discounted return.
Bayesian finite-horizon dynamic programming for deterministic systems. Instead
of using an approximate model, [24] employ a series of GPs, each for one dynamic
programming stage, under the assumption that the dynamics are deterministic and
the rewards are Gaussian-distributed. It is possible to extend this approach to the case
of non-deterministic transitions, at the cost of requiring additional approximations.
However, since a lot of real-world problems do in fact have deterministic dynamics,
the approach is consistent.
Bayesian least-squares temporal differences. Tziortziotis and Dimitrakakis [25]
instead consider a model for the value function itself, where the random quantity is
the empirical transition matrix P̂ rather than the reward (which can be assumed to
be known):
P̂v | v, P ∼ N (Pv, β I )
In most real world applications the state st of the system at time t cannot be observed
directly. Instead, we obtain some observation xt , which depends on the state of the
system. While this does give us some information about the system state, it is in
general not sufficient to pinpoint it exactly. This idea can be formalized as a partially
observable Markov decision process (POMDP).
Definition 9.4.1 (POMDP) A partially observable Markov decision process
(POMDP) μ is a tuple (X , S, A, P, y) where X is an observation space, S is a
state space, A is an action space, P is a conditional distribution on observations,
states and rewards and y is a starting state distribution. The reward, observation and
next state are Markov with respect to the current state and action:
9.4 Partially Observable Markov Decision Processes 215
Here P(st+1 | st , at ) is the transition distribution, giving the probabilities of next transi-
tion
states given the current state and action. P(xt | st ) is the observation distribution, distribu-
giving the probabilities of different observations given the current state. Finally, tion
P(rt | st ) is the reward distribution, which we make dependent only on the current observa-
tion
state for simplicity. distribu-
tion
reward
distribu-
Partially observable Markov decision process tion
When we know a POMDP’s parameters, that is to say, when we know the transition,
observation and reward distributions, the problem is formally the same as solving
an unknown MDP. In particular, we can similarly define a belief state summarizing
our knowledge. This takes the form of a probability distribution on the hidden state
variable st rather than on the model μ. If μ defines starting state probabilities, then
the belief is not subjective, as it only relies on the actual POMDP parameters. The
transition distribution on states given our belief is as follows.
216 9 Bayesian Reinforcement Learning
Belief ξ
For any distribution ξ on S, we define
ξ(st+1 | at , μ) Pμ (st+1 | st at ) dξ(st ).
S
When the model μ is given, calculating a belief update is not particularly difficult,
but we must take care to properly use the time index t. Starting from Bayes’ theorem,
it is easy to derive the belief update from ξt to ξt+1 as follows.
Belief update
A particularly attractive setting is when the model is finite. Then the sufficient
statistic also has finite dimension and all updates are in closed form.
Remark 9.4.1 If S, A, X are finite, then we can define a sequence of vectors pt ∈
´|S| and matrices At as
pt ( j) = P(xt | st = j),
At (i, j) = P(st+1 = j | st = i, at ).
Then writing bt (i) for ξt (st = i), we can then use Bayes theorem to obtain
diag( pt+1 ) At bt
bt+1 = .
p
t+1 At bt
9.5 Relations Between Different Settings 217
Solving a POMDP that is unknown is a much harder problem. The basic update
equation for a joint belief on both possible state and possible model is given by
Unfortunately, even for the simplest possible case of two possible models μ1 , μ2
and binary observations, there is no finite-dimensional representation of the belief at
time t.
Strategies for solving unknown POMDPs include solving the full Bayesian deci-
sion problem, but this requires exponential inference and planning for exact solu-
tions [27]. For this reason, one usually uses approximations.
One very simple approximation involves replacing a POMDP with a variable
order Markov decision process, for which inference has only logarithmic compu- variable
order
tational complexity [28]. The variable order model assumes that the observation Markov
probabilities can be decomposed in terms of finite-length contexts. Of course, the decision
process
memory complexity is still linear. This approach has been used by [15] in combina-
tion with a Monte Carlo planner for online decision making with promising results.
In general, finding optimal policies in POMDPs is hard even for restricted classes
of policies [35]. However, approximations [29] and stochastic methods as well as
policy search methods [30, 32] work quite well in practice.
Markov decision processes can be used to model a wide range of different prob-
lems. This section informally (and perhaps sometimes inaccurately) describes the
relationship between different MDP settings we have dealt with so far.
When the state and action spaces are finite, the optimal policy for both finite
horizon and infinite horizon discounted MDPs can be computed in polynomial time
using algorithms like backwards induction or policy iteration (see Chap. 6). However,
in other cases obtaining the optimal policy is far from trivial. In the reinforcement
learning setting, the MDP is not known and must be estimated while acting in it. If the
goal is to maximize expected utility under a prior, the problem becomes a BAMDP.
The BAMDP can be seen as a special case of a POMDP, where the underlying latent
variable is the MDP parameter, which has a fixed value, rather than the state. For that
reason it is generally assumed that BAMDPs have the same complexity as POMDPs.
Note however, that POMDPs with linear-Gaussian dynamics can be solved with the
exact same controller as linear-Gaussian MDPs, by replacing the actual state with
its expected value.
The POMDP problem with discrete state space is similar to a continuous MDP.
This is because we can construct an MDP that uses the POMDP belief state as its
218 9 Bayesian Reinforcement Learning
state. This belief state is finite-dimensional and continuous. If the MDP state space
is continuous, then it is in general not possible to decide whether a given policy
is optimal in finite time. However, it is possible to check whether a policy is -
optimal under certain regularity conditions on the state space structure. However,
if the POMDP state space is discrete, there is only a finite number of possible next
belief states for any belief state, and the number of policies is also finite for a finite
horizon. Thus, the relationship between these classes is summarized below:
9.6 Exercises
Exercise 9.1 Consider the algorithms we have seen in Chap. 8. Are any of those
applicable to belief-augmented MDPs? Outline a strategy for applying one of those
algorithms to the problem. What would be the biggest obstacle we would have to
overcome in your specific example?
Exercise 9.2 Prove Remark 9.1.1.
Exercise 9.3 A practical case of Bayesian reinforcement learning in discrete space
is when we have an independent belief over the transition probabilities of each state-
action pair. Consider the case where we have n states and k actions. Similar to the
product-prior in the bandit case in Sect. 6.2, we assign a probability (density) ξs,a to
the probability vector θ (s,a) ∈ ´n . We can then define our joint belief on the (nk) × n
matrix Θ to be
ξ(Θ) = ξs,a (θ (s,a) ).
s∈S,a∈A
Exercise 9.4 Consider the Gaussian process model of Eq. (9.3.2). What is the
implicit assumption made about the transition model? If this assumption is satis-
fied, what does the corresponding posterior distribution represent?
References
1. Csilléry, K., Blum, M.G.B., Gaggiotti, O.E., François, O.: Approximate Bayesian computation
(ABC) in practice. Trends Ecol. Evol. 25(7), 410–418 (2010)
2. Dimitrakakis, C., Tziortziotis, N.: ABC reinforcement learning. In Proceedings of the 30th
International Conference on Machine Learning, ICML 2013, pp. 684–692 (2013). (JMLR.org)
3. Dimitrakakis, C.: Complexity of stochastic branch and bound methods for belief tree search
in Bayesian reinforcement learning. In: 2nd International Conference on Agents and Artificial
Intelligence (ICAART 2010), pp. 259–264. Springer, Valencia, Spain (2010)
4. Dimitrakakis, C.: Tree exploration for Bayesian RL exploration. In: 2008 International Confer-
ences on Computational Intelligence for Modelling, Control and Automation (CIMCA 2008),
Intelligent Agents, Web Technologies and Internet Commerce (IAWTIC 2008), Innovation in
Software Engineering (ISE 2008), pp. 1029–1034. IEEE Computer Society (2008)
5. Thompson, W.R.: On the likelihood that one unknown probability exceeds another in view of
the evidence of two samples. Biometrika 25(3–4), 285–294 (1933)
6. Strens, M.J.A.: A Bayesian framework for reinforcement learning. In: Proceedings of the Sev-
enteenth International Conference on Machine Learning (ICML 2000), pp. 943–950. Morgan
Kaufmann (2000)
7. Kaufmann, E., Korda, N., Munos, R.: Thompson sampling: an asymptotically optimal finite-
time analysis. In: Algorithmic Learning Theory-23rd International Conference, ALT 2012.
Proceedings. Lecture Notes in Computer Science, vol. 7568, pp. 199–213. Springer (2012)
8. Osband, I., Russo, D., Van Roy, B.: (More) efficient reinforcement learning via posterior sam-
pling. In: Advances in Neural Information Processing Systems, vol. 26, pp. 3003–3011 (2013)
9. Poupart, P., Vlassis, N.A., Hoey, J., Regan, K.: An analytic solution to discrete Bayesian rein-
forcement learning. In: Machine Learning, Proceedings of the 23rd International Conference
(ICML 2006), pp. 697–704. ACM (2006)
10. Dimitrakakis, C.: Robust Bayesian reinforcement learning through tight lower bounds. In: San-
ner, S., Hutter, M. (eds.) Recent Advances in Reinforcement Learning–9th European Workshop,
EWRL 2011. Lecture Notes in Computer Science, vol. 7188, pp. 177–188. Springer (2011)
11. Zinkevich, M., Greenwald, A., Littman, M.L.: Cyclic equilibria in Markov games. In: Advances
in Neural Information Processing Systems, vol. 18, pp. 1641–1648 (2006)
12. Richard Ernest Bellman: A problem in the sequential design of experiments. Sankhya 16,
221–229 (1957)
13. DeGroot, M.H.: Optimal Statistical Decisions. Wiley (1970)
14. Duff, M.O.: Optimal learning computational procedures for Bayes-adaptive Markov decision
processes. Ph.D. thesis, University of Massachusetts at Amherst (2002)
15. Veness, J., Ng, K.S., Hutter, M., Silver, D.: A Monte Carlo AIXI approximation. Technical
Report 0909.0801 (2009). (arXiv)
16. Poupart, P., Vlassis, N.: Model-based Bayesian reinforcement learning in partially observable
domains. In: International Symposium on Artificial Intelligence and Mathematics, ISAIM 2008
(2008)
17. Furmston, T, Barber, D.: Variational methods for reinforcement learning. In: Proceedings of
the 13th International Conference on Artificial Intelligence and Statistics (AISTATS 2010), pp.
241–248 (2010)
18. Minka, T.P.: Bayesian linear regression. Technical Report, Microsoft research (2000)
220 9 Bayesian Reinforcement Learning
19. Álvarez, M., Luengo, D., Titsias, M., Lawrence, N.: Efficient multioutput Gaussian processes
through variational inducing kernels. In: Proceedings of the 13th International Conference on
Artificial Intelligence and Statistics (AISTATS 2010), pp. 25–32 (2010)
20. Grande, R.C., Walsh, T.J., How, J.P.: Sample efficient reinforcement learning with gaussian
processes. In: Proceedings of the 31th International Conference on Machine Learning, ICML
2014, pp. 1332–1340 (2014). (JMLR.org)
21. Tziortziotis, N., Dimitrakakis, C., Blekas, K.: Cover tree Bayesian reinforcement learning. J.
Mach. Learn. Res. 15(1), 2313–2335 (2014)
22. Engel, Y., Mannor, S., Meir, R.: Bayes meets Bellman: the Gaussian process approach to
temporal difference learning. In: Machine Learning, Proceedings of the 20th International
Conference (ICML 2003), pp. 154–161. AAAI Press (2003)
23. Engel, Y., Mannor, S., Meir, R.: Reinforcement learning with gaussian processes. In: Machine
Learning, Proceedings of the 22nd International Conference (ICML 2005), pp. 201–208. ACM
(2005)
24. Deisenroth, M.P., Rasmussen, C.E., Peters, J.: Gaussian process dynamic programming. Neu-
rocomputing 72(7–9), 508–1524 (2009)
25. Tziortziotis, N., Dimitrakakis, C.: Bayesian inference for least squares temporal difference
regularization. In: Machine Learning and Knowledge Discovery in Databases-European Con-
ference, ECML PKDD 2017, Proceedings Part II. Lecture Notes in Computer Science, vol.
10535, pp. 126–141. Springer (2017)
26. Ghavamzadeh, M., Engel, Y.: Bayesian policy gradient algorithms. In: Advances in Neural
Information Processing Systems, vol. 19, pp. 457–464. MIT Press (2006)
27. Ross, S., Chaib-draa, B., Pineau, J.: Bayes-adaptive POMDPs. In: Advances in Neural Infor-
mation Processing Systems, vol. 20, pp. 1225–1232 (2008)
28. Dimitrakakis, C.: Bayesian variable order Markov models. In: Proceedings of the 13th Inter-
national Conference on Artificial Intelligence and Statistics (AISTATS 2010), pp. 161–168
(2010)
29. Spaan, M.T.J., Vlassis, N.: Perseus: randomized point-based value iteration for POMDPs. J.
Artif. Intell. Res. 24(1), 195–220 (2005)
30. Baxter, J., Bartlett, P.L.: Reinforcement learning in POMDP’s via direct gradient ascent. In:
Proceedings of the 17th International Conference on Machine Learning, ICML 2000, pp. 41–
48. Morgan Kaufmann, San Francisco, CA (2000)
31. Wang, T., Lizotte, D., Bowling, M., Schuurmans, D.: Bayesian sparse sampling for on-line
reward optimization. In: Machine Learning, Proceedings of the 22nd International Conference
(ICML 2005). ACM (2005)
32. Toussaint, M., Harmelign, S., Storkey, A.: Probabilistic inference for solving (PO)MDPs. Tech-
nical Report EDI-INF-RR-0934, University of Endinburgh, School of Informatics (2006)
33. Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning. MIT Press
(2006)
34. Jung, T., Stone, P.: Gaussian processes for sample-efficient reinforcement learning with RMAX-
like exploration. In: Machine Learning and Knowledge Discovery in Databases, European
Conference, ECML PKDD 2010. Lecture Notes in Computer Science, vol. 6321, pp. 601–616.
Springer (2010)
35. Vlassis, N., Littman, M.L., Barber, D.: On the computational complexity of stochastic controller
optimization in POMDPs. ACM Trans. Comput. Theory 4(4), 12:1–12:8 (2012)
Chapter 10
Distribution-Free Reinforcement
Learning
10.1 Introduction
The Bayesian framework requires specifying a prior distribution. For many reasons,
we may frequently be unable to do that. In addition, as we have seen, the Bayes-
optimal solution is often intractable. In this chapter we shall take a look at algorithms
that do not require specifying a prior distribution. Instead, they employ the heuristic of
“optimism under uncertainty” to select policies. This idea is very similar to heuristic
search algorithms, such as A∗ [1]. All these algorithms assume the best possible
model that is consistent with the observations so far and choose the optimal policy
in this “optimistic” model. Intuitively, this means that for each possible policy we
maintain an upper bound on the value/utility we can reasonably expect from it. In
general we want this upper bound to
1. be as tight as possible (i.e., to be close to the true value),
2. still hold with high probability.
We begin with an introduction to these ideas in bandit problems, when the objective
is to maximize total reward. We then expand this discussion to structured bandit
problems, which have many applications in optimization. Finally, we look at the
case of maximizing total reward in unknown MDPs.
First of all, let us briefly recall the stochastic bandit setting, which we already have
considered in Sect. 6.2. The learner in discrete time steps t = 1, 2, . . . chooses an
arm at from a given set A = {1, . . . , K } of K arms. The rewards rt the learner
obtains in return are random and assumed to be independent as well as bounded,
e.g., rt ∈ [0, 1]. The expected reward r (i) = E(rt |at = i) for choosing any arm i
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 221
C. Dimitrakakis and R. Ortner, Decision Making Under Uncertainty
and Reinforcement Learning, Intelligent Systems Reference Library 223,
https://doi.org/10.1007/978-3-031-07614-5_10
222 10 Distribution-Free Reinforcement Learning
T
is unknown to the learner, who aims to maximize the total reward t=1 rt after a
certain number of T time steps.
Let r ∗ maxi r (i) be the highest expected reward that can be achieved. Obvi-
ously, the optimal policy π ∗ in each time step chooses the arm giving the highest
expected reward r ∗ . The learner who does not know which arm is optimal will choose
at each time step t an arm at from A, or more generally, a probability distribution
over the arms from which at then is drawn. It is important to notice that maximizing
the total reward is equivalent to minimizing total regret with respect to that policy.
Definition 10.2.1 (Total regret) The (total) regret of a policy π relative to the opti-
mal fixed policy π ∗ after T steps is
T
∗
L T (π ) rt − rtπ ,
t=1
∗
where rtπ is the reward obtained by the policy π at step t and rt∗ rtπ . Accordingly,
the expected (total) regret is
T
∗
E L T (π ) T r − Eπ rt .
t=1
The regret compares the collected rewards to those of the best fixed policy. Comparing
instead to the best rewards obtained by the arms at each time would be too hard due
to their randomness.
We note that the notion of regret we consider here is usually called pseudo-regret,
while the term expected regret often refers to the just mentioned comparison to the
actual best rewards, cf. [8] for a more detailed discussion.
It makes sense for a learning algorithm to use the empirical average rewards obtained
for each arm so far.
Empirical average
1
t t
r̂t,i rk,i I {ak = i} , where Nt,i I {ak = i}
Nt,i k=1 k=1
and rk,i denotes the (random) reward the learner receives upon choosing arm i
at step k.
10.2 Finite Stochastic Bandit Problems 223
Simply always choosing the arm with best the empirical average reward so far is
not the best idea, because you might get stuck with a sub-optimal arm: If the optimal
arm underperforms at the beginning, so that its empirical average is far below the
true mean of a suboptimal arm, it will never be chosen again. A better strategy is to
choose arms optimistically. Intuitively, as long as an arm has a significant chance of
being the best, you play it every now and then. One simple way to implement this is
shown in the following UCB1 algorithm Auer et al. [2].
Thus, the algorithm adds a bonus value of order O( ln t/Nt,i ) to the empirical
value of each arm thus forming an upper confidence bound. This upper confidence upper
confi-
bound value is such that the true mean reward of each arm will lie below it with high dence
probability by the Hoeffding bound (4.5.5). bound
Theorem 10.2.1 (Auer et al. [2]) The expected regret of UCB1 after T time steps is
at most
8lnT
E L T (UCB1) ≤ +5 (r ∗ − r (i)).
i:r (i)<r ∗
r∗− r (i) i
T
E LT = E (r ∗ − rt ) = (r ∗ − r (i)) E N T,i , (10.2.1)
t=1 i
Accordingly we may assume that (taking care of the contribution of the error prob-
abilities to E Nt,i below)
224 10 Distribution-Free Reinforcement Learning
Combining this with (10.2.1) and noting that the sum converges to a value < 4,
proves the regret bound.
The UCB1 algorithm is actually not the first algorithm employing optimism in the
face of uncertainty to deal with the exploration-exploitation dilemma, nor the first that
uses confidence intervals for that purpose. This idea goes back to the seminal work
of Lai and Robbins [3] that used the same approach, however in a more complicated
form. In particular, the whole history is used for computing the arm to choose. The
derived bounds ofLai and Robbins
[3] show that after T steps each suboptimal arm
is played at most D1KL + o(1) log T times in expectation, where DKL measures the
distance between the reward distributions of the optimal and the suboptimal arm by
the Kullback-Leibler divergence, and o(1) → 0 as T → ∞. This bound was also
shown to be asymptotically optimal Lai and Robbins [3]. A lower bound logarithmic
in T for any finite T that is close to matching the bound of Theorem 10.2.1 can be
found in Mannor and Tsitsiklis [4]. Improvements that get closer to the lower bound
(and are still based on the UCB1 idea) can be found in Auer and Ortner [5], while
the gap has been finally closed by Lattimore [6].
For so-called distribution-independent bounds that do not depend on problem
parameters like the “gaps” r ∗ − r (i), see e.g. Audibert and Bubeck [7]. In general,√
these bounds cannot be logarithmic√ in T anymore, as the gaps may be of order 1/ T
resulting in bounds that are O( K T ), just like in the nonstochastic setting that we
will take a look at next.
10.2 Finite Stochastic Bandit Problems 225
The stochastic setting just considered is only one among several variants of the multi-
armed bandit problem. While it is impossible to cover them all, we give a brief of
the most common scenarios and refer to Bubeck and Cesa-Bianchi [8] for a more
complete overview.
What is common to most variants of the classic stochastic setting is that the
assumption of receiving i.i.d. rewards when sampling a fixed arm is loosened. The
most extreme case is the so-called nonstochastic, sometimes also termed adversarial
bandit setting, where the reward sequence for each arm is assumed to be fixed in adver-
sarial
advance (and thus not random at all). In this case, the reward is maximized when bandit
choosing in each time step the arm that maximizes the reward at this step. Obvi-
ously, since the reward sequences can be completely arbitrary, no learner can stand
a chance to perform well with respect to this optimal policy. Thus, one confines
oneself to consider the regret with respect to the best fixed arm in hindsight, that is,
T
arg maxi t=1 rt,i where rt,i is the reward of arm i at step t. It is still not clear that
this is not too
√ much to ask for, but it turns out that one can achieve regret bounds
of order O( K T ) in this setting. Clearly, algorithms that choose arms deterministi-
cally can always be tricked by an adversarial reward sequence. However, algorithms
that at each time step choose an arm from a suitable distribution over the arms (that is
updated according to the collected rewards), can be shown to give the mentioned opti-
mal regret bound. A prominent exponent of these algorithms is the Exp3 algorithm
of Auer et al. [9], that uses an exponential weighting scheme.
In the contextual bandit setting the learner receives some additional side infor- contex-
tual
mation called the context. The reward for choosing an arm is assumed to depend on bandit
the context as well as on the chosen arm and can be either stochastic or adversarial.
The learner usually competes against the best policy that maps contexts to arms.
There is a notable amount of literature dealing with various settings that are usually
also interesting for applications like web advertisement where user data takes the
role of provided side information. For an overview see e.g. Chap. 4 of Bubeck and
Cesa-Bianchi [8] or Part V of Lattimore and Szepesvári [10].
In other settings the i.i.d. assumption about the rewards of a fixed arm is replaced
by more general assumptions, such as that underlying each arm there is a Markov
chain and rewards depend on the state of the Markov chain when sampling the
arm. This is called the restless bandits problem, that is already quite close to the
general reinforcement learning setting with an underlying Markov decision √ process
(see Sect. 10.3.1). Regret bounds in this setting can be shown to be Õ( T ) even if
at each time step the learner can observe only the state of the arm he chooses, see
Ortner et al. [11].
226 10 Distribution-Free Reinforcement Learning
Taking a step further from the bandit problems of the previous sections we now
want to consider a more general reinforcement learning setting where the learner
operates on an unknown underlying MDP. Note that the stochastic bandit problem
corresponds to a single state MDP.
Thus, consider an MDP μ with state space S, action space A, and let r (s, a) ∈
[0, 1] and P(·|s, a) be the mean reward and the transition probability distribution
on S for each state s ∈ S and each action a ∈ A, respectively. For the moment
we assume that S and A are finite. As we have seen in Sect. 6.6 there are various
optimality criteria for MDPs. In the spirit of the bandit problems considered so far
we consider undiscounted rewards and examine the regret after any T steps with
respect to an optimal policy.
Since the optimal T -step policy in general will be non-stationary and different for
different horizons T and different initial states, we will compare to a gain optimal
policy π ∗ as introduced in Definition 6.6.3. Further, we assume that the MDP is
communicating. That is, for any two states s, s there is a policy πs,s that with positive
probability reaches s when starting in s. This assumption allows the learner to recover
when making a mistake. Note that in MDPs that are not communicating one wrong
step may lead to a suboptimal region of the state space that cannot be left anymore,
which makes competing to an optimal policy in a learning setting impossible. For
communicating MDPs we can define the diameter to be the maximal expected time
it takes to connect any two states.
Definition 10.3.1 Let T (π, s, s ) the expected number of steps it takes to reach
state s when starting in s and playing policy π . Then the diameter is defined as
Given that our rewards are assumed to be bounded in [0, 1], intuitively, when we
make one wrong step in some state s, in the long run we won’t lose more than D.
After all, in D steps we can go back to s and continue optimally.
Under the assumption that the MDP is communicating, the gain g ∗ can be shown
to be independent of the initial state, that is, g ∗ (s) = g ∗ for all states s. Accordingly,
we define the T -step regret of a learning algorithm as
T
∗
LT g − rt ,
t=1
where rt is the reward collected by the algorithm at step t. Note that in general (and
depending on the initial state) the value T g ∗ we compare to will differ from the
optimal T -step reward. However, this difference can be shown to be upper bounded
by the diameter and is therefore negligible when considering the regret.
10.3 Reinforcement Learning in MDPs 227
Now we aim at extending the idea underlying the UCB1 algorithm to the general
reinforcement learning setting. Again, we would like to have for each (stationary)
policy π an upper bound on the gain that is reasonable to expect. Note that simply
taking each policy to be the arm of a bandit problem does not work well. First, to
approach the true gain of a chosen policy, it will not be sufficient to choose it just
once. It would be necessary to follow each policy for a sufficiently high number of
consecutive steps. Without knowledge of some characteristics of the underlying MDP
like mixing times, it might be however difficult to determine how long a policy shall
be played. Further, due to the large number of stationary policies, which is |A||S| ,
the regret bounds that would result from such an approach would be exponential in
the number of states.
Thus, we rather maintain confidence regions for the rewards and transition prob-
abilities of each state-action pair s, a. Then, at each step t, these confidence regions
implicitly also define a confidence region for the true underlying MDP μ∗ , that is, a
set Mt of plausible MDPs. For suitably chosen confidence intervals for the rewards
and transition probabilities one can obtain that
∗
P(μ ∈/ Mt ) < δ. (10.3.1)
Given this confidence region Mt , one can define the optimistic value for any
policy π to be
π
g+ (Mt ) max gμπ μ ∈ Mt . (10.3.2)
Note that similar to the bandit setting this estimate is optimistic for each policy,
π
as due to (10.3.1) it holds that g+ (Mt ) ≥ gμπ with high probability. Analogously to
UCB1 we would like to make an optimistic choice among the possible policies, that
π
is, we choose a policy π that maximizes g+ (Mt ).
However, unlike in the bandit setting where we immediately receive a sample
from the reward of the chosen arm, in the MDP setting we only obtain information
about the reward in the current state. Thus, we should not play the chosen optimistic
policy just for one but a sufficiently large number of steps. An easy way is to play
policies in episodes of increasing length, such that sooner or later each action is
played for a sufficient number of steps in each state. Summarized, we obtain (the
outline of) an algorithm as shown below.
To make the algorithm complete, we have to fill in some technical details. In the
following, let S be the number of states and A the number of actions of the underlying
MDP μ. Further, the algorithm takes a confidence parameter δ > 0.
The confidence region. Concerning the confidence regions, for the rewards it is
sufficient to use confidence intervals similar to those for UCB1. For the transition
probabilities we consider all those transition probability distributions to be plau-
sible whose
·
1 -norm is close to the empirical distribution P̂ t (· | s, a). That is,
the confidence region Mt at step t used to compute the optimistic policy can be
defined as the set of MDPs with mean rewards r (s, a) and transition probabilities
P(· | s, a) such that
r (s, a) − r̂ (s, a) ≤ 7 log(2S At/δ) ,
2Nt (s,a)
14S log(2 At/δ)
P(· | s, a) − P̂ t (· | s, a) ≤ Nt (s,a)
,
1
where r̂ (s, a) and P̂ t (· | s, a) are the estimates for the rewards and the transition
probabilities, and Nt (s, a) denotes the number of samples of action a in state s at
time step t.
One can show via a bound due to Weissman et al. [13] that given n samples of
the transition probability distribution P(· | s, a), one has
n
P P(· | s, a) − P̂ t (· | s, a) ≥ ≤ 2 exp − .
S
1 2
Using this together with standard Hoeffding bounds for the reward estimates, it
can be shown that the confidence region contains the true underlying MDP with high
probability.
Lemma 10.3.1
∗ δ
P(μ ∈ Mt ) > 1 − .
15t 6
Episode lengths. Concerning the termination of episodes, as already mentioned, we
would like to have episodes that are long enough so that we do not suffer large regret
when playing a suboptimal policy. Intuitively, it only pays off to recompute the opti-
mistic policy when the estimates or confidence intervals have changed sufficiently.
One option is e.g. to terminate an episode when the confidence interval for one state-
action pair has shrinked by some factor. Even simpler, one can terminate an episode
when a state-action pair has been sampled often (compared to the samples one had
before the episode has started), e.g. when one has doubled the number of visits in
some state-action pair. This also allows to bound the total number of episodes up to
step T .
10.3 Reinforcement Learning in MDPs 229
This episode termination criterion also allows to bound the sum over all fractions
of the form √vNk (s,a) , where vk (s, a) is the number of times action a has been chosen
k (s,a)
in state s during episode k, while Nk (s, a) is the respective count of visits before
episode k. The evaluation of this sum will turn out to be important in the regret
analysis below to bound the sum over all confidence intervals over the visited state-
action pairs.
Lemma 10.3.3
vk (s, a) √ √
√ ≤ ( 2 + 1) S AT .
k s,a
Nk (s, a)
where P(s, a) is the set of all plausible transition probabilities for choosing
action a in state s.
Similarly to the value iteration algorithm in Sect. 6.5.4.1, this scheme can be
shown to converge. More precisely, one can show that maxs {u i+1 (s) − u i (s)} −
mins {u i+1 (s) − u i (s)} → 0 and also
π̃
u i+1 (s) → u i (s) + g+ for all s. (10.3.4)
230 10 Distribution-Free Reinforcement Learning
After convergence the maximizing actions constitute the optimistic policy π̃ , and the
maximizing transition probabilities are the respective optimistic transition values P̃.
One can also show that the so-called span maxs u i (s) − mins u i (s) of the con-
verged value vector u i is upper bounded by the diameter. This follows by optimality
of the vector u i . Intuitively, if the span would be larger than D one could increase
the collected reward in the lower value state s − by going (as fast as possible) to the
higher value state s + . Note that this argument uses the fact that the true MDP is
plausible w.h.p., so that we may take the true transitions to get from s − to s + .
Theorem 10.3.1 ([12]) In an MDP with S states, A actions, and diameter D with
probability of at least 1 − δ the regret of UCRL2 after any T steps is bounded by
const · DS AT log Tδ .
Proof The main idea of the proof is that by Lemma 10.3.1 we have that
π̃k
g̃k∗ g+ (Mtk ) ≥ g ∗ ≥ g π̃k , (10.3.5)
so that the regret in each step is upper bounded by the width of the confidence interval
for g π̃k , that is, by g̃k∗ − g π̃k . In what follows we need to break down this confidence
interval to the confidence intervals we have for rewards and transition probabilities.
In the following, we consider that the true MDP μ is always contained in the
confidence regions Mt considered by the algorithm. Using Lemma 10.3.1 it is not
difficult to show that with probability at least 1 − 12Tδ 5/4 the regret accumulated due
√
to μ ∈ / Mt at some step t is bounded by T .
Further, note that the random fluctuation of the rewards can be easily bounded
by Hoeffding’s inequality (4.5.5), that is, if st and at denote the state and action at
step t, we have
T
rt ≥ r (st , at ) − 58 T log 8Tδ
t=1 t
T √
(g ∗ − rt ) ≤ vk (s, a) g̃k∗ − r (s, a) + T + 85 T log 8Tδ (10.3.6)
t=1 k s,a
|r̃k (s, a) − r̂k (s, a)| + |r̂k (s, a) − r (s, a)| ≤ 2conf rk (s, a)
For the first term in (10.3.7) we use that after convergence of the value vector u i
we have by (10.3.3) and (10.3.4)
g̃k∗ − r̃k (s, π̃k (s)) = P̃k (s |s, π̃k (s)) · u i (s ) − u i (s).
s
Then noting that vk (s, a) = 0 for a = π̃k (s) and using vector/matrix notation it
follows that
vk (s, a) g̃k∗ − r̃k (s, π̃k (s))
s,a
= vk (s, a) P̃k (s |s, π̃k (s)) · u i (s ) − u i (s)
s,a s
= vk P̃k − I u
= vk P̃k − Pk + Pk − I wk
= vk P̃k − Pk wk + vk Pk − I wk , (10.3.9)
232 10 Distribution-Free Reinforcement Learning
where Pk is the true transition matrix (in μ) of the optimistic policy π̃k in episode k,
and wk is a renormalization of the vector u (with entries u i (s)) where wk (s) :=
u i (s) − 21 (mins u i (s) + maxs u i (s)), so that
wk
∞ ≤ D2 by Lemma 10.3.4.
Since
P̃k − Pk
1 ≤
P̃k − P̂k
1 +
P̂k − Pk
1 , the first term of (10.3.9) is
bounded as
vk P̃k − Pk wk ≤ vk P̃k − Pk 1 · wk ∞
p
≤2 vk (s, a) conf k (s, a) D. (10.3.10)
s,a
so that its sum over all episodes can be bounded by Azuma-Hoeffding inequality
(5.11) and Lemma 10.3.2, that is,
vk Pk − I wk ≤ D 25 T log 8Tδ + DS A log2 8T
SA
(10.3.11)
k
The following is a corresponding lower bound on the regret that shows that the
upper bound of Theorem 10.3.1 is optimal in T and A.
Theorem 10.3.2 (Jaksch et al. [12]) For any algorithm and any natural numbers T ,
S, A > 1, and D ≥ log A S there is an MDP with S states, A actions, and diameter D,
the expected regret after T steps is
√
Ω DS AT .
Similar to the distribution dependent regret bound of Theorem 10.2.1 for UCB1,
one can derive a logarithmic bound on the expected regret of UCRL2.
Theorem 10.3.3 (Jaksch et al. [12]) In an MDP with S states, A actions, and diam-
eter D the expected regret of UCRL2 is
D 2 S 2 A log(T )
O ,
Δ
where Δ g ∗ − maxπ g π : g π < g ∗ is the gap between the optimal gain and the
second largest gain.
Similar to UCB1 that was based on the work of Lai and Robbins [3], UCRL2 is not
the first optimistic algorithm with theoretical guarantees. Thus, the index policies of
Burnetas and Katehakis [14] and Tewari and Bartlett [15] choose actions optimisti-
cally by using confidence bounds for the estimates in the current state. However, the
logarithmic regret bounds are derived only for ergodic MDPs in which each policy
visits each state with probability 1.
Another important predecessor that is based on the principle of optimism in the
face of uncertainty is R-Max [16], that assumes in each not sufficiently visited state
to receive the maximal possible reward. UCRL2 offers a refinement of this idea to
motivate exploration. Sample complexity bounds as derived for R-Max can also be
obtained for UCRL2, cf. [12].
The gap between the lower bound of Theorem 10.3.2 and the bound for UCRL2
has not been closed so far. There have been various tries in that direction for different
algorithms inspired by Thompson sampling [17] or UCB1 [18]. However all of the
claimed proofs seem to contain some issues that remain unresolved up-to-date.
The situation is settled in the simpler episodic setting, where after any√H steps
there is a restart. Here there are matching upper and lower bounds of order H S AT
on the regret, see [19].
234 10 Distribution-Free Reinforcement Learning
In the discounted setting, the MBIE algorithm of Strehl and Littman [20, 21] is a
precursor of UCRL2 that is based on the same ideas. While there are regret bounds
available also for MBIE, these are not easily comparable to Theorem 10.2.1, as the
regret is measured along the trajectory of the algorithm, while the regret considered
for UCRL2 is with respect to the trajectory an optimal policy would have taken.
In general, regret in the discounted setting seems to be a less satisfactory concept.
However, sample complexity bounds in the discounted setting for a UCRL2 variant
have been given in Lattimore and Hutter [22].
Last but not least, we would like to refer any reader interested in the material of
this chapter to the recent book of Lattimore and Szepesvári [10] that deals with the
whole range of topics from simple bandits to reinforcement learning in MDPs in
much more detail.
References
1. Hart, P.E., Nilsson, N.J., Raphael, B.: A formal basis for the heuristic determination of mini-
mum cost paths. IEEE Trans. Syst. Sci. Cybern. 4(2), 100–107 (1968)
2. Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite time analysis of the multiarmed bandit problem.
Mach. Learn. 47(2–3), 235–256 (2002)
3. Tze Leung Lai and Herbert Robbins: Asymptotically efficient adaptive allocation rules. Adv.
Appl. Math. 6(1), 4–22 (1985)
4. Mannor, S., Tsitsiklis, J.N.: The sample complexity of exploration in the multi-armed bandit
problem. J. Mach. Learn. Res. 5, 623–648 (2004)
5. Auer, P., Ortner, R.: UCB revisited: improved regret bounds for the stochastic multi-armed
bandit problem. Period. Math. Hung. 61(1–2), 55–65 (2010)
6. Lattimore, T.: Optimally confident UCB: Improved regret for finite-armed bandits. Technical
Report 1507.07880 (2015). (arXiv)
7. Audibert, J.-Y., Bubeck, S.: Minimax policies for adversarial and stochastic bandits. In:
colt2009. Proceedings of the 22nd Annual Conference on Learning Theory, pp. 217–226
(2009)
8. Bubeck, S., Cesa-Bianchi, N.: Regret analysis of stochastic and nonstochastic multi-armed
bandit problems. Found. Trends Mach. Learn. 5(1), 1–122 (2012)
9. Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: The nonstochastic multiarmed bandit
problem. SIAM J. Comput. 32(1), 48–77 (2002)
10. Lattimore, T., Szepesvári, C.: Bandit Algorithms. Cambridge University Press (2020)
11. Ortner, R., Ryabko, D., Auer, P., Munos, R.: Regret bounds for restless Markov bandits. Theor.
Comput. Sci. 558, 62–76 (2014)
12. Jaksch, T., Ortner, R., Auer, P.: Near-optimal regret bounds for reinforcement learning. J.
Mach. Learn. Res. 11, 1563–1600 (2010)
13. Weissman, T., Ordentlich, E., Seroussi, G., Verdu, S., Weinberger, M.J.: Inequalities for the
L 1 deviation of the empirical distribution. Technical Report HPL-2003-97 (R.1), Hewlett-
Packard Labs, Technical Report (2003)
14. Burnetas, A.N., Katehakis, M.N.: Optimal adaptive policies for Markov decision processes.
Math. Oper. Res. 22(1), 222–255 (1997)
15. Tewari, A., Bartlett, P.: Optimistic linear programming gives logarithmic regret for irreducible
MDPs. In: Advances in Neural Information Processing Systems, vol. 20, pp. 1505–1512. MIT
Press (2008)
16. Brafman, R.I., Tennenholtz, M.: R-MAX-A general polynomial time algorithm for near-
optimal reinforcement learning. J. Mach. Learn. Res. 3, 213–231 (2003)
References 235
17. Agrawal, S., Jia, R.: Optimistic posterior sampling for reinforcement learning: worst-case
regret bounds. In: Advances in Neural Information Processing Systems, vol. 30, pp. 1184–
1194 (2017)
18. Ortner, R.: Regret bounds for reinforcement learning via Markov chain concentration. J. Artif.
Intell. Res. 67, 115–128 (2020)
19. Azar, M.G., Osband, I., Munos, R.: Minimax regret bounds for reinforcement learning. In:
Proceedings of the 34th International Conference on Machine Learning, ICML 2017, pp.
263–272 (2017)
20. Strehl, A.L., Littman, M.L.: A theoretical analysis of model-based interval estimation. In:
Machine Learning, Proceedings of the 22nd International Conference, ICML 2005, pp. 857–
864. ACM (2005)
21. Strehl, A.L., Littman, M.L.: An analysis of model-based interval estimation for Markov deci-
sion processes. J. Comput. Syst. Sci. 74(8), 1309–1331 (2008)
22. Lattimore, T., Hutter, M.: Near-optimal PAC bounds for discounted MDPs. Theor. Comput.
Sci. 558, 125–143 (2014)
Chapter 11
Conclusion
This book touched upon the basic principles of decision making under uncertainty
in the context of reinforcement learning. While one of the main streams of thought
is Bayesian decision theory, we also discussed the basics of approximate dynamic
programming and stochastic approximation as applied to reinforcement learning
problems.
Consciously, however, we have avoided going into a number of topics related to
reinforcement learning and decision theory, some of which would need a book of
their own to be properly addressed. Even though it was fun writing the book, we at
some point had to decide to stop and consolidate the material we had, sometimes
culling partially developed material in favour of a more concise volume.
Firstly, we haven’t explicitly considered many models that can be used for rep-
resenting transition distributions, value functions or policies, beyond the simplest
ones, as we felt that this would detract from the main body of the text. Textbooks
for the latest fashion are always going to be abundant, and we hope that this book
provides a sufficient basis to enable the use of any current methods. There are also
a large number of areas which have not been covered at all. In particular, while we
touched upon the setting of two-player games and its connection to robust statistical
decisions, we have not examined problems which are also relevant to sequential deci-
sion making, such as Markov games and Bayesian games. In relation to this, while
early in the book we discuss risk aversion and risk seeking, we have not discussed
specific sequential decision making algorithms for such problems. Furthermore, even
though we discuss the problem of preference elicitation, we do not discuss specific
algorithms for it or the related problem of inverse reinforcement learning. Another
topic which went unmentioned, but which may become more important in the future,
is hierarchical reinforcement learning as well as options, which allow constructing
long-term actions (such as “go to the supermarket”) from primitive actions (such as
“open the door”). Finally, even though we have mentioned the basic framework of
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 237
C. Dimitrakakis and R. Ortner, Decision Making Under Uncertainty
and Reinforcement Learning, Intelligent Systems Reference Library 223,
https://doi.org/10.1007/978-3-031-07614-5_11
238 11 Conclusion
© The Editor(s) (if applicable) and The Author(s), under exclusive license to 239
Springer Nature Switzerland AG 2022
C. Dimitrakakis and R. Ortner, Decision Making Under Uncertainty
and Reinforcement Learning, Intelligent Systems Reference Library 223,
https://doi.org/10.1007/978-3-031-07614-5
240 Appendix: Symbols
x∈A x belongs to A
A⊂B A is a (strict) subset of B
A⊆B A is a (non-strict) subset of B
B\A Set difference
BA Symmetric set difference
A Set complement
A∪B Set union
A∩B Set intersection
I optimal, 119
Importance sampling, 191 stationary, 125
Inequality stochastic, 167
Chebyshev, 82 Policy approximation, 170
Hoeffding, 83 Policy estimation, 176
Jensen, 20 Policy evaluation, 120
Markov, 81 backwards induction, 120
Bayesian Monte Carlo, 208
Monte Carlo, 153
K Policy gradient, 188
KL-divergence, 86 Bayesian, 204
stochastic, 190
Policy iteration, 133
L approximate, 178
Least squares, 181 modified, 134
Likelihood temporal-difference, 136
conditional, 11 Policy optimization
relative, 7 backwards induction, 121
Linear programming, 137 Posterior sampling, see Thompson sampling
Loss, 31 Preference, 14
quadratic, 31 Probability
subjective, 7
Pseudo-inverse, 182
M
Markov decision process, 115, 116, 142
belief-augmented, 204 Q
partially observable, 214 Q-learning, 159
variable order, 217
Markov process, 105
Martingale, 104 R
Matrix determinant, 75 Regret, 37
Maximin, 36 total, 222
Minimax, 37 Reward, 14
Mixture of distributions, 33 Reward distribution, 116, 215
Monte Carlo update Risk, 31
every-visit, 155 Robbins-Monro approximation, 147
first-visit, 155
Multinomial, 73
Multivariate-normal, 75 S
Sample mean, 60
Sampling
O sequential, 89
Observation distribution, 215 Sequential probability ratio test, 102
Series
geometric, 93
P Softmax, 177
Policy, 44, 110, 118 Spectral radius, 126
blind, 199 Standard normal, 68
-greedy, 148 State aggregation, 185
history-dependent, 118 Stationary Markov process, 105
k-order Markov, 199 Statistic, 60
Markov, 118 sufficient, 60
maximin, 41 Stopping function, 89
memoryless, 199 Stopping set, 90
Index 243
Strategy, 35 function, 20
Student t-distribution, 72
V
T Value, 92
Temporal difference, 135, 136 Value function
error, 136, 156 approximate, 168
Thompson sampling, 208, 210 optimal, 119
Trace, 75 state, 119
Transition distribution, 116, 215 state-action, 119
Value iteration, 131
approximate, 184
U generalized stochastic, 161
Un-bounded procedures, 94
Upper confidence bound, 223
Utility, 16, 118 W
Bayes-optimal, 31 Wald’s theorem, 102