Bandit Algorithms
Bandit Algorithms
This is the (free) online edition. The content is the same as the print edition, published
by Cambridge University Press, except that minor typos are corrected here. There are
also font and other typographical differences that mean the page numbers do not
match between the versions.
Contents
Contents page ii
1 Introduction 8
1.1 The Language of Bandits 10
1.2 Applications 13
1.3 Notes 16
1.4 Bibliographic Remarks 16
2 Foundations of Probability ( ) 18
2.1 Probability Spaces and Random Elements 18
2.2 σ-Algebras and Knowledge 26
2.3 Conditional Probabilities 28
2.4 Independence 29
2.5 Integration and Expectation 30
2.6 Conditional Expectation 33
2.7 Notes 37
2.8 Bibliographic Remarks 41
2.9 Exercises 42
4 Stochastic Bandits 56
4.1 Core Assumptions 56
4.2 The Learning Objective 57
4.3 Knowledge and Environment Classes 57
CONTENTS iii
5 Concentration of Measure 73
5.1 Tail Probabilities 73
5.2 The Inequalities of Markov and Chebyshev 74
5.3 The Cramér-Chernoff Method and Subgaussian Random Variables 76
5.4 Notes 78
5.5 Bibliographical Remarks 80
5.6 Exercises 80
Part IV Lower Bounds for Bandits with Finitely Many Arms 176
32 Ranking 387
32.1 Click Models 388
32.2 Policy 391
32.3 Regret Analysis 393
32.4 Notes 397
32.5 Bibliographic Remarks 399
32.6 Exercises 400
Bibliography 548
Index 582
Preface
Multi-armed bandits have now been studied for nearly a century. While research
in the beginning was quite meandering, there is now a large community publishing
hundreds of articles every year. Bandit algorithms are also finding their way into
practical applications in industry, especially in on-line platforms where data is
readily available and automation is the only way to scale.
We had hoped to write a comprehensive book, but the literature is now so vast
that many topics have been excluded. In the end we settled on the more modest
goal of equipping our readers with enough expertise to explore the specialised
literature by themselves, and to adapt existing algorithms to their applications.
This latter point is important. Problems in theory are all alike; every application is
different. A practitioner seeking to apply a bandit algorithm needs to understand
which assumptions in the theory are important and how to modify the algorithm
when the assumptions change. We hope this book can provide that understanding.
What is covered in the book is covered in some depth. The focus is on the
mathematical analysis of algorithms for bandit problems, but this is not a
traditional mathematics book, where lemmas are followed by proofs, theorems
and more lemmas. We worked hard to include guiding principles for designing
algorithms and intuition for their analysis. Many algorithms are accompanied by
empirical demonstrations that further aid intuition.
We expect our readers to be familiar with basic analysis and calculus and
some linear algebra. The book uses the notation of measure-theoretic probability
theory, but does not rely on any deep results. A dedicated chapter is included to
introduce the notation and provide intuitions for the basic results we need. This
chapter is unusual for an introduction to measure theory in that it emphasises the
reasons to use σ-algebras beyond the standard technical justifications. We hope
this will convince the reader that measure theory is an important and intuitive
tool. Some chapters use techniques from information theory and convex analysis,
and we devote a short chapter to each.
Most chapters are short and should be readable in an afternoon or presented in
a single lecture. Some components of the book contain content that is not really
about bandits. These can be skipped by knowledgeable readers, or otherwise
referred to when necessary. They are marked with a ( ) because ‘Skippy the
Preface 2
Kangaroo’ skips things.1 The same mark is used for those parts that contain
useful, but perhaps overly specific information for the first-time reader. Later
parts will not build on these chapters in any substantial way. Most chapters end
with a list of notes and exercises. These are intended to deepen intuition and
highlight the connections between various subsections and the literature. There
is a table of notation at the end of this preface.
Thanks
We’re indebted to our many collaborators and feel privileged that there are
too many of you to name. The University of Alberta, Indiana University and
DeepMind have all provided outstanding work environments and supported the
completion of this book. The book has benefited enormously from the proofreading
efforts of a large number of our friends and colleagues. We are sorry for all the
mistakes introduced after your hard work. Alphabetically, they are: Aaditya
Ramdas, Abbas Mehrabian, Aditya Gopalan, Ambuj Tewari, András György,
Arnoud den Boer, Branislav Kveton, Brendan Patch, Chao Tao, Chao Qin,
Christoph Dann, Claire Vernade, Emilie Kaufmann, Eugene Ji, Gellért Weisz,
Gergely Neu, Johannes Kirschner, Julian Zimmert, Kwang-Sung Jun, Lalit Jain,
Laurent Orseau, Marcus Hutter, Michal Valko, Omar Rivasplata, Pierre Menard,
Ramana Kumar, Roman Pogodin, Ronald Ortner, Ronan Fruit, Ruihao Zhu,
Shuai Li, Toshiyuki Tanaka, Wei Chen, Yoan Russac, Yufei Yi and Zhu Xiaohu.
We are especially grateful to Gábor Balázs and Wouter Koolen, who both read
almost the entire book. Thanks to Lauren Cowels and Cambridge University
Press for providing free books for our proofreaders, tolerating the delays and
for supporting a freely available PDF version. Réka Szepesvári is responsible for
converting some of our primary school figures to their current glory. Last of all,
our families have endured endless weekends of editing and multiple false promises
of ‘done by Christmas’. Rosina and Beáta, it really is done now!
1 Taking inspiration from Tor’s grandfather-in-law, John Dillon [Anderson et al., 1977].
Notation
Some sections are marked with special symbols, which are listed and described
below.
Something important.
An experiment.
Landau Notation
We make frequent use of the Bachmann–Landau notation. Both were nineteenth
century mathematicians who could have never expected their notation to be
adopted so enthusiastically by computer scientists. Given functions f, g : N →
Notation 4
f (n)
f (n) = Ω(g(n)) ⇔ lim inf > 0,
n→∞ g(n)
f (n)
f (n) = ω(g(n)) ⇔ lim inf = ∞,
n→∞ g(n)
Bandits
At action in round t
k number of arms/actions
n time horizon
Xt reward in round t
Yt loss in round t
π a policy
ν a bandit
µi mean reward of arm i
Sets
∅ empty set
N, N+ natural numbers, N = {0, 1, 2, . . .} and N+ = N \ {0}
R real numbers
R̄ R ∪ {−∞, ∞}
[n] {1, 2, 3, . . . , n − 1, n}
2A the power set of set A (the set of all subsets of A)
S∞
A∗ set of finite sequences over A, A∗ = i=0 Ai
B2d d-dimensional unit ball, {x ∈ Rd : kxk2 ≤ 1}
Pd probability simplex, {x ∈ [0, 1]d+1 : kxk1 = 1}
Notation 5
Linear Algebra
e1 , . . . , e d standard basis vectors of the d-dimensional Euclidean space
0, 1 vectors whose elements are all zeros and all ones, respectively
det(A) determinant of matrix A
trace(A) trace of matrix A
im(A) image of matrix A
ker(A) kernel of matrix A
span(v1 , . . . , vd ) span of vectors v1 , . . . , vd
λmin (G) minimum eigenvalue of matrix G
P
hx, yi inner product, hx, yi = i xi yi
kxkp p-norm of vector x
kxk2G x> Gx for positive definite G ∈ Rd×d and x ∈ Rd
Notation 6
Topological
cl(A) closure of set A
int(A) interior of set A
∂A boundary of a set A, ∂A = cl(A) \ int(A)
co(A) convex hull of A
aff(A) affine hull of A
ri(A) relative interior of A
Part I
Bandits, Probability and
Concentration
1 Introduction
plays a role in Monte Carlo Tree Search, an algorithm made famous by the recent
success of AlphaGo.
Finally, the mathematical formulation of bandit problems leads to a rich
structure with connections to other branches of mathematics. In writing this
book (and previous papers), we have read books on convex analysis/optimisation,
Brownian motion, probability theory, concentration analysis, statistics, differential
geometry, information theory, Markov chains, computational complexity and more.
What fun!
A combination of all these factors has led to an enormous growth in research
over the last two decades. Google Scholar reports less than 1000, then 2700 and
7000 papers when searching for the phrase ‘bandit algorithm’ for the periods of
2001–5, 2006–10, and 2011–15, respectively, and the trend just seems to have
strengthened since then, with 5600 papers coming up for the period of 2016 to
the middle of 2018. Even if these numbers are somewhat overblown, they are
indicative of a rapidly growing field. This could be a fashion, or maybe there is
something interesting happening here. We think that the latter is true.
A Classical Dilemma
Imagine you are playing a two-armed bandit machine and you already pulled
each lever five times, resulting in the following pay-offs (in dollars):
Round 1 2 3 4 5 6 7 8 9 10
left 0 10 0 0 10
right 10 0 0 0 0
In the literature, actions are often also called ‘arms’. We talk about k-armed
bandits when the number of actions is k, and about multi-armed bandits
when the number of arms is at least two and the actual number is immaterial
to the discussion. If there are multi-armed bandits, there are also one-armed
bandits, which are really two-armed bandits where the pay-off of one of the
arms is a known fixed deterministic number.
Of course the learner cannot peek into the future when choosing their
actions, which means that At should only depend on the history Ht−1 =
(A1 , X1 , . . . , At−1 , Xt−1 ). A policy is a mapping from histories to actions: A
learner adopts a policy to interact with an environment. An environment is a
mapping from history sequences ending in actions to rewards. Both the learner
and the environment may randomise their decisions, but this detail is not so
important for now. The most common objective of the learner is to choose actions
that lead to the largest possible cumulative reward over all n rounds, which is
Pn
t=1 Xt .
The fundamental challenge in bandit problems is that the environment is
unknown to the learner. All the learner knows is that the true environment
lies in some set E called the environment class. Most of this book is about
designing policies for different kinds of environment classes, though in some cases
the framework is extended to include side observations as well as actions and
rewards.
The next question is how to evaluate a learner. We discuss several performance
measures throughout the book, but most of our efforts are devoted to
understanding the regret. There are several ways to define this quantity. To avoid
getting bogged down in details, we start with a somewhat informal definition.
Definition 1.1. The regret of the learner relative to a policy π (not necessarily
that followed by the learner) is the difference between the total expected reward
using policy π for n rounds and the total expected reward collected by the learner
over n rounds. The regret relative to a set of policies Π is the maximum regret
relative to any policy π ∈ Π in the set.
The set Π is often called the competitor class. Another way of saying all this
is that the regret measures the performance of the learner relative to the best
policy in the competitor class. We usually measure the regret relative to a set of
policies Π that is large enough to include the optimal policy for all environments
1.1 The Language of Bandits 11
in E. In this case, the regret measures the loss suffered by the learner relative to
the optimal policy.
Example 1.2. Suppose the action set is A = {1, 2, . . . , k}. An environment is
called a stochastic Bernoulli bandit if the reward Xt ∈ {0, 1} is binary valued
and there exists a vector µ ∈ [0, 1]k such that the probability that Xt = 1 given
the learner chose action At = a is µa . The class of stochastic Bernoulli bandits is
the set of all such bandits, which are characterised by their mean vectors. If you
knew the mean vector associated with the environment, then the optimal policy
is to play the fixed action a∗ = argmaxa∈A µa . This means that for this problem
the natural competitor class is the set of k constant polices Π = {π1 , . . . , πk },
where πi chooses action i in every round. The regret over n rounds becomes
" n #
X
Rn = n max µa − E Xt ,
a∈A
t=1
where the expectation is with respect to the randomness in the environment and
policy. The first term in this expression is the maximum expected reward using
any policy. The second term is the expected reward collected by the learner.
For a fixed policy and competitor class, the regret depends on the environment.
The environments where the regret is large are those where the learner is behaving
worse. Of course the ideal case is that the regret be small for all environments.
The worst-case regret is the maximum regret over all possible environments.
One of the core questions in the study of bandits is to understand the growth
rate of the regret as n grows. A good learner achieves sublinear regret. Letting Rn
denote the regret over n rounds, this means that Rn = o(n) or equivalently that
limn→∞ Rn /n = 0. Of course one can ask for more. Under what circumstances is
√
Rn = O( n) or Rn = O(log(n))? And what are the leading constants? How does
the regret depend on the specific environment in which the learner finds itself?
We will discover eventually that for the environment class in Example 1.2, the
√
worst-case regret for any policy is at least Ω( n) and that there exist policies for
√
which Rn = O( n).
case the environment is restricted to generate the reward in response to each action
from a distribution that is specific to that action and independent of the previous
action choices and rewards. The environment class in Example 1.2 satisfies these
conditions, but there are many alternatives. For example, the rewards could follow
a Gaussian distribution rather than Bernoulli. This relatively mild difference does
not change the nature of the problem in a significant way. A more drastic change
is to assume the action set A is a subset of Rd and that the mean reward for
choosing some action a ∈ A follows a linear model, Xt = ha, θi + ηt for θ ∈ Rd
and ηt a standard Gaussian (zero mean, unit variance). The unknown quantity
in this case is θ, and the environment class corresponds to its possible values
(E = Rd ).
For some applications, the assumption that the rewards are stochastic and
stationary may be too restrictive. The world mostly appears deterministic, even
if it is hard to predict and often chaotic looking. Of course, stochasticity has
been enormously successful in explaining patterns in data, and this may be
sufficient reason to keep it as the modelling assumption. But what if the stochastic
assumptions fail to hold? What if they are violated for a single round? Or just for
one action, at some rounds? Will our best algorithms suddenly perform poorly?
Or will the algorithms developed be robust to smaller or larger deviations from
the modelling assumptions?
An extreme idea is to drop all assumptions on how the rewards are generated,
except that they are chosen without knowledge of the learner’s actions and lie
in a bounded set. If these are the only assumptions, we get what is called the
setting of adversarial bandits. The trick to say something meaningful in this
setting is to restrict the competitor class. The learner is not expected to find
the best sequence of actions, which may be like finding a needle in a haystack.
Instead, we usually choose Π to be the set of constant policies and demand that
the learner is not much worse than any of these. By defining the regret in this
way, the stationarity assumption is transported into the definition of regret rather
than constraining the environment.
Of course there are all shades of grey between these two extremes. Sometimes
we consider the case where the rewards are stochastic, but not stationary. Or
one may analyse the robustness of an algorithm for stochastic bandits to small
adversarial perturbations. Another idea is to isolate exactly which properties of
the stochastic assumption are really exploited by a policy designed for stochastic
bandits. This kind of inverse analysis can help explain the strong performance of
policies when facing environments that clearly violate the assumptions they were
designed for.
we want to keep the regret small across all possible environments. One way to
convert a multi-objective criterion into a single number is to take averages. This
corresponds to the Bayesian viewpoint where the objective is to minimise the
average cumulative regret with respect to a prior on the environment class.
Maximising the sum of rewards is not always the objective. Sometimes the
learner just wants to find a near-optimal policy after n rounds, but the actual
rewards accumulated over those rounds are unimportant. We will see examples
of this shortly.
1.2 Applications
After this short preview, and as an appetiser before the hard work, we briefly
describe the formalisations of a variety of applications.
A/B Testing
The designers of a company website are trying to decide whether the ‘buy it now’
button should be placed at the top of the product page or at the bottom. In
the old days, they would commit to a trial of each version by splitting incoming
users into two groups of 10 000. Each group would be shown a different version
of the site, and a statistician would examine the data at the end to decide which
version was better. One problem with this approach is the non-adaptivity of the
test. For example, if the effect size is large, then the trial could be stopped early.
One way to apply bandits to this problem is to view the two versions of the
site as actions. Each time t a user makes a request, a bandit algorithm is used
to choose an action At ∈ A = {SiteA, SiteB}, and the reward is Xt = 1 if the
user purchases the product and Xt = 0 otherwise.
1.2 Applications 14
Advert Placement
In advert placement, each round corresponds to a user visiting a website, and
the set of actions A is the set of all available adverts. One could treat this as
a standard multi-armed bandit problem, where in each round a policy chooses
At ∈ A, and the reward is Xt = 1 if the user clicked on the advert and Xt = 0
otherwise. This might work for specialised websites where the adverts are all
likely to be appropriate. But for a company like Amazon, the advertising should
be targeted. A user that recently purchased rock-climbing shoes is much more
likely to buy a harness than another user. Clearly an algorithm should take this
into account.
The standard way to incorporate this additional knowledge is to use the
information about the user as context. In its simplest formulation, this might
mean clustering users and implementing a separate bandit algorithm for each
cluster. Much of this book is devoted to the question of how to use side information
to improve the performance of a learner.
This is a good place to emphasise that the world is messy. The set of available
adverts is changing from round to round. The feedback from the user can be
delayed for many rounds. Finally, the real objective is rarely just to maximise
clicks. Other metrics such as user satisfaction, diversity, freshness and fairness,
just to mention a few, are important too. These are the kinds of issues that make
implementing bandit algorithms in the real world a challenge. This book will not
address all these issues in detail. Instead we focus on the foundations and hope
this provides enough understanding that you can invent solutions for whatever
peculiar challenges arise in your problem.
Recommendation Services
Netflix has to decide which movies to place most prominently in your ‘Browse’
page. Like in advert placement, users arrive at the page sequentially, and the
reward can be measured as some function of (a) whether or not you watched a
movie and (b) whether or not you rated it positively. There are many challenges.
First of all, Netflix shows a long list of movies, so the set of possible actions
is combinatorially large. Second, each user watches relatively few movies, and
individual users are different. This suggests approaches such as low-rank matrix
factorisation (a popular approach in ‘collaborative filtering’). But notice this is
1.2 Applications 15
not an offline problem. The learning algorithm gets to choose what users see and
this affects the data. If the users are never recommended the AlphaGo movie,
then few users will watch it, and the amount of data about this film will be
scarce.
Network Routing
Another problem with an interesting structure is network routing, where the
learner tries to direct internet traffic through the shortest path on a network. In
each round the learner receives the start/end destinations for a packet of data.
The set of actions is the set of all paths starting and ending at the appropriate
points on some known graph. The feedback in this case is the time it takes for
the packet to be received at its destination, and the reward is the negation of
this value. Again the action set is combinatorially large. Even relatively small
graphs have an enormous number of paths. The routing problem can obviously
be applied to more physical networks such as transportation systems used in
operations research.
Dynamic Pricing
In dynamic pricing, a company is trying to automatically optimise the price of
some product. Users arrive sequentially, and the learner sets the price. The user
will only purchase the product if the price is lower than their valuation. What
makes this problem interesting is (a) the learner never actually observes the
valuation of the product, only the binary signal that the price was too low/too
high, and (b) there is a monotonicity structure in the pricing. If a user purchased
an item priced at $10, then they would surely purchase it for $5, but whether or
not it would sell when priced at $11 is uncertain. Also, the set of possible actions
is close to continuous.
Waiting Problems
Every day you travel to work, either by bus or by walking. Once you get on the
bus, the trip only takes 5 minutes, but the timetable is unreliable, and the bus
arrival time is unknown and stochastic. Sometimes the bus doesn’t come at all.
Walking, on the other hand, takes 30 minutes along a beautiful river away from
the road. The problem is to devise a policy for choosing how long to wait at
the bus stop before giving up and walking to minimise the time to get to your
workplace. Walk too soon, and you miss the bus and gain little information. But
waiting too long also comes at a price.
While waiting for a bus is not a problem we all face, there are other applications
of this setting. For example, deciding the amount of inactivity required before
putting a hard drive into sleep mode or powering off a car engine at traffic lights.
The statistical part of the waiting problem concerns estimating the cumulative
distribution function of the bus arrival times from data. The twist is that the
data is censored on the days you chose to walk before the bus arrived, which
is a problem analysed in the subfield of statistics called survival analysis. The
1.3 Notes 16
interplay between the statistical estimation problem and the challenge of balancing
exploration and exploitation is what makes this and the other problems studied
in this book interesting.
Resource Allocation
A large part of operations research is focussed on designing strategies for allocating
scarce resources. When the dynamics of demand or supply are uncertain, the
problem has elements reminiscent of a bandit problem. Allocating too few
resources reveals only partial information about the true demand, but allocating
too many resources is wasteful. Of course, resource allocation is broad, and many
problems exhibit structure that is not typical of bandit problems, like the need
for long-term planning.
Tree Search
The UCT algorithm is a tree search algorithm commonly used in perfect-
information game-playing algorithms. The idea is to iteratively build a search
tree where in each iteration the algorithm takes three steps: (1) chooses a path
from the root to a leaf; (2) expands the leaf (if possible); (3) performs a Monte
Carlo roll-out to the end of the game. The contribution of a bandit algorithm is in
selecting the path from the root to the leaves. At each node in the tree, a bandit
algorithm is used to select the child based on the series of rewards observed
through that node so far. The resulting algorithm can be analysed theoretically,
but more importantly has demonstrated outstanding empirical performance in
game-playing problems.
1.3 Notes
1 The reader may find it odd that at one point we identified environments with
maps from histories to rewards, while we used the language that a learner
‘adopts a policy’ (a map from histories to actions). The reason is part historical
and part because policies and their design are at the center of the book, while
the environment strategies will mostly be kept fixed (and relatively simple).
On this note, strategy is also a word that sometimes used interchangeably with
policy.
with the earliest being: [Robbins, 1952]. Another early pioneer is Herman Chernoff,
who wrote papers with titles like ‘Sequential Decisions in the Control of a
Spaceship’ [Bather and Chernoff, 1967].
Besides these seminal papers, there are already a number of books on bandits
that may serve as useful additional reading. The most recent (and also most
related) is by Bubeck and Cesa-Bianchi [2012] and is freely available online. This is
an excellent book and is warmly recommended. The main difference between their
book and ours is that (a) we have the benefit of seven years of additional research
in a fast-moving field and (b) our longer page limit permits more depth. Another
relatively recent book is Prediction, Learning and Games by Cesa-Bianchi and
Lugosi [2006]. This is a wonderful book, and quite comprehensive. But its scope
is ‘all of’ online learning, which is so broad that bandits are not covered in great
depth. We should mention there is also a recent book on bandits by Slivkins
[2019]. Conveniently it covers some topics not covered in this book (notably
Lipschitz bandits and bandits with knapsacks). The reverse is also true, which
should not be surprising since our book is currently 400 pages longer. There are
also four books on sequential design and multi-armed bandits in the Bayesian
setting, which we will address only a little. These are based on relatively old
material, but are still useful references for this line of work and are well worth
reading [Chernoff, 1959, Berry and Fristedt, 1985, Presman and Sonin, 1990,
Gittins et al., 2011].
Without trying to be exhaustive, here are a few articles applying bandit
algorithms; a recent survey is by Bouneffouf and Rish [2019]. The papers
themselves will contain more useful pointers to the vast literature. We mentioned
AlphaGo already [Silver et al., 2016]. The tree search algorithm that drives its
search uses a bandit algorithm at each node [Kocsis and Szepesvári, 2006]. Le et al.
[2014] apply bandits to wireless monitoring, where the problem is challenging
due to the large action space. Lei et al. [2017] design specialised contextual
bandit algorithms for just-in-time adaptive interventions in mobile health: in
the typical application the user is prompted with the intention of inducing a
long-term beneficial behavioural change. See also the article by Greenewald et al.
[2017]. Rafferty et al. [2018] apply Thompson sampling to educational software
and note the trade-off between knowledge and reward. Sadly, by 2015, bandit
algorithms still have not been used in clinical trials, as explicitly mentioned
by Villar et al. [2015]. Microsoft offers a ‘Decision Service’ that uses bandit
algorithms to automate decision-making [Agarwal et al., 2016].
2 Foundations of Probability ( )
The thrill of gambling comes from the fact that the bet is placed on future
outcomes that are uncertain at the time of the gamble. A central question in
gambling is the fair value of a game. This can be difficult to answer for all but
the simplest games. As an illustrative example, imagine the following moderately
complex game: I throw a dice. If the result is four, I throw two more dice; otherwise
I throw one dice only. Looking at each newly thrown dice (one or two), I repeat
the same, for a total of three rounds. Afterwards, I pay you the sum of the values
on the faces of the dice. How much are you willing to pay to play this game with
me?
Many examples of practical interest exhibit a complex random interdependency
between outcomes. The cornerstone of modern probability as proposed by
Kolmogorov aims to remove this complexity by separating the randomness from
the mechanism that produces the outcome.
Instead of rolling the dice one by one, imagine that sufficiently many dice were
rolled before the game has even started. For our game we need to roll seven
dice, because this is the maximum number that might be required (one in the
first round, two in the second round and four in the third round. See Fig. 2.1).
2.1 Probability Spaces and Random Elements 19
X1 := throw()
No Yes
X1 = 4?
Figure 2.1 The initial phase of a gambling game with a random number of dice rolls.
Depending on the outcome of a dice roll, one or two dice are rolled for a total of three
rounds. The number of dice used will then be random in the range of three to seven.
After all the dice are rolled, the game can be emulated by ordering the dice and
revealing the outcomes sequentially. Then the value of the first dice in the chosen
ordering is the outcome of the dice in the first round. If we see a four, we look at
the next two dice in the ordering; otherwise we look at the single next dice.
By taking this approach, we get a simple calculus for the probabilities of all
kinds of events. Rather than directly calculating the likelihood of each pay-off,
we first consider the probability of any single outcome of the dice. Since there
are seven dice, the set of all possible outcomes is Ω = {1, . . . , 6}7 . Because
all outcomes are equally probable, the probability of any ω ∈ Ω is (1/6)7 . The
probability of the game pay-off taking value v can then be evaluated by calculating
the total probability assigned to all those outcomes ω ∈ Ω that would result
in the value of v. In principle, this is trivial to do thanks to the separation of
everything that is probabilistic from the rest. The set Ω is called the outcome
space, and its elements are the outcomes. Fig. 2.2 illustrates this idea. Random
outcomes are generated on the left, while on the right, various mechanisms are
used to arrive at values; some of these values may be observed and some not.
There will be much benefit from being a little more formal about how we
come up with the value of our artificial game. For this, note that the process by
which the game gets its value is a function X that maps Ω to the reals (simply,
X : Ω → R). We find it ironic that functions of this type (from the outcome
space to subsets of the reals) are called random variables. They are neither
random nor variables in a programming language sense. The randomness is in
the argument that X is acting on, producing randomly changing results. Later
we will put a little more structure on random variables, but for now it suffices to
think of them as maps from the outcome space to the reals.
2.1 Probability Spaces and Random Elements 20
Outcomes
Randomising device
Mechanisms
all randomness
Figure 2.2 A key idea in probability theory is the separation of sources of randomness
from game mechanisms. A mechanism creates values from the elementary random
outcomes, some of which are visible for observers, while others may remain hidden.
with any predicate (an expression evaluating to true or false) where U, V, . . . are
functions with domain Ω.
What properties should P satisfy? Since Ω is the set of all possible outcomes,
it seems reasonable to expect that P is defined for Ω and P(Ω) = 1 and since ∅
contains no outcomes, P(∅) = 0 is also expected to hold. Furthermore, probabilities
should be non-negative so P(A) ≥ 0 for any A ⊂ Ω on which P is defined. Let
Ac = Ω \ A be the complement of A. Then we should expect that P is defined
for A exactly when it is defined for Ac and P(Ac ) = 1 − P(A) (negation rule).
Finally, if A, B are disjoint so that A ∩ B = ∅ and P(A), P(B) and P(A ∪ B) are
all defined, then P(A ∪ B) = P(A) + P(B). This is called the finite additivity
property.
Let F be the set of subsets of Ω on which P is defined. It would seem silly if
A ∈ F and Ac ∈ / F, since P(Ac ) could simply be defined by P(Ac ) = 1 − P(A).
Similarly, if P is defined on disjoint sets A and B, then it makes sense if A∪B ∈ F.
We will also require the additivity property to hold (i) regardless of whether
the sets are disjoint and (ii) even for countably infinitely many sets. If {Ai }i
is a collection of sets and Ai ∈ F for all i ∈ N, then ∪i Ai ∈ F, and if these
P
sets are pairwise disjoint, P(∪i Ai ) = i P(Ai ). A set of subsets that satisfies all
these properties is called a σ-algebra, which is pronounced ‘sigma-algebra’ and
sometimes also called a σ-field (see Note 1).
Definition 2.1 (σ-algebra and probability measures). A set F ⊆ 2Ω is a σ-
algebra if Ω ∈ F and Ac ∈ F for all A ∈ F and ∪i Ai ∈ F for all {Ai }i with
Ai ∈ F for all i ∈ N. That is, it should include the whole outcome space and
be closed under complementation and countable unions. A function P : F → R
is a probability measure if P(Ω) = 1 and for all A ∈ F, P(A) ≥ 0 and
P
P(Ac ) = 1 − P(A) and P(∪i Ai ) = i P(Ai ) for all countable collections of disjoint
sets {Ai }i with Ai ∈ F for all i. If F is a σ-algebra and G ⊂ F is also a σ-algebra,
then we say G is a sub-σ-algebra of F. If P is a measure defined on F, then
the restriction of P to G is a measure P|G on G defined by P|G (A) = P(A) for
all A ∈ G.
At this stage, the reader may rightly wonder about why we introduced the notion
of sub-σ-algebras. The answer should become clear quite soon. The elements
of F are called measurable sets. They are measurable in the sense that P
assigns values to them. The pair (Ω, F) alone is called a measurable space,
while the triplet (Ω, F, P) is called a probability space. If the condition that
P(Ω) = 1 is lifted, then P is called a measure. If the condition that P(A) ≥ 0
is also lifted, then P is called a signed measure. For measures and signed
measures, it would be unusual to use the symbol P, which is mostly reserved for
probabilities. Probability measures are also called probability distributions,
or just distributions.
Random variables lead to new probability
measures. In particular, in the
example above PX (A) = P X −1 (A) is a probability measure defined for all the
subsets A of R for which P X −1 (A) is defined. More generally, for a random
2.1 Probability Spaces and Random Elements 22
Thus, random vectors are random elements where the range space is
2.1 Probability Spaces and Random Elements 23
(Rk , B(Rk )), and random vectors are random variables when k = 1. Random
elements generalise random variables and vectors to functions that do not
take values in Rk . The push-forward measure (or law) can be defined for
any random element. Furthermore, random variables and vectors work nicely
together. If X1 , . . . , Xk are k random variables on the same domain (Ω, F),
then X(ω) = (X1 (ω), . . . , Xk (ω)) is an Rk -valued random vector, and vice versa
(Exercise 2.2). Multiple random variables X1 , . . . , Xk from the same measurable
space can thus be viewed as a random vector X = (X1 , . . . , Xk ).
Given a map X : Ω → X between measurable spaces (Ω, F) and (X , G), we let
σ(X) = {X −1 (A) : A ∈ G} be the σ-algebra generated by X. The map X is
F/G-measurable if and only if σ(X) ⊆ F. By checking the definitions one can
show that σ(X) is a sub-σ-algebra of F and in fact is the smallest sub-σ-algebra
for which X is measurable. If G = σ(A) itself is generated by a set system
A ⊂ 2X , then to check the F/G-measurability of X, it suffices to check whether
X −1 (A) = {X −1 (A) : A ∈ A} is a subset of F. The reason this is sufficient is
because σ(X −1 (A)) = X −1 (σ(A)), and by definition the latter is σ(X). In fact,
to check whether a map is measurable, either one uses the composition rule or
checks X −1 (A) ⊂ F for a ‘generator’ A of G.
Random elements can be combined to produce new random elements by
composition. One can show that if f is F/G-measurable and g is G/H-measurable
for σ-algebras F, G and H over appropriate spaces, then their composition g ◦ f
is F/H-measurable (Exercise 2.1). This is used most often for Borel functions,
which is a special name for B(Rm )/B(Rn )-measurable functions from Rm to
Rn . These functions are also called Borel measurable. The reader will find it
pleasing that all familiar functions are Borel. First and foremost, all continuous
functions are Borel, which includes elementary operations such as addition and
multiplication. Continuity is far from essential, however. In fact one is hard-
pressed to construct a function that is not Borel. This means the usual operations
are ‘safe’ when working with random variables.
Indicator Functions
Given an arbitrary set Ω and A ⊆ Ω, the indicator function of A is
IA : Ω → {0, 1} given by
(
1 , if ω ∈ A ;
IA (ω) =
0 , otherwise .
Why So Complicated?
You may be wondering why we did not define P on the power set of Ω, which
is equivalent to declaring that all sets are measurable. In many cases this is a
perfectly reasonable thing to do, including the example game where nothing
prevents us from defining F = 2Ω . However, beyond this example, there are two
justifications not to have F = 2Ω , the first technical and the second conceptual.
The technical reason is highlighted by the following surprising theorem
according to which there does not exist a uniform probability distribution on
Ω = [0, 1] if F is chosen to be the power set of Ω (a uniform probability distribution
over [0, 1], if existed, would have the property of assigning its length to every
interval). In other words, if you want to be able to define the uniform measure,
then F cannot be too large. By contrast, the uniform measure can be defined
over the Borel σ-algebra, though proving this is not elementary.
Theorem 2.3. Let Ω = [0, 1], and F be the power set of Ω. Then there does not
exist a measure P on (Ω, F) such that P([a, b]) = b − a for all 0 ≤ a ≤ b ≤ 1.
The main conceptual reason of why not to have F = 2Ω is because then we
can use σ-algebras to represent information. This is especially useful in the study
of bandits where the learner is interacting with an environment and is slowly
gaining knowledge. One useful way to represent this is by using a sequence of
nested σ-algebras, as we explain in the next section. One might also be worried
that the Borel σ-algebra does not contain enough measurable sets. Rest assured
that this is not a problem and you will not easily find a non-measurable set. For
completeness, an example of a non-measurable set will still be given in the notes,
along with a little more discussion on this topic.
A second technical reason to prefer the measure-theoretic approach to
probabilities is that this approach allows for the unification of distributions
on discrete spaces and densities on continuous ones (the uninitiated reader will
find the definitions of these later). This unification can be necessary when dealing
with random variables that combine elements of both, e.g. a random variable
that is zero with probability 1/2 and otherwise behaves like a standard Gaussian.
Random variables like this give rise to so-called “mixed continuous and discrete
distributions”, which seem to require special treatment in a naive approach
to probabilities, yet dealing with random variables like these are nothing but
ordinary under the measure-theoretic approach.
which represents the joint distribution for the values of a dice (X ∈ [6]) and coin
(Y ∈ [2]). The formula describes some constraints on the probabilistic interactions
between the outputs of X and Y , but says nothing about their domain. In a way,
the domain is an unimportant detail. Nevertheless, one must ask whether or not
an appropriate domain exists at all. More generally, one may ask whether an
appropriate probability space exists given some constraints on the joint law of a
collection X1 , . . . , Xk of random variables. For this to make sense, the constraints
should not contradict each other, which means there is a probability measure
µ on B(Rk ) such that µ satisfies the postulated constraints. But then we can
choose Ω = Rk , F = B(Rk ), P = µ and Xi : Ω → R to be the ith coordinate
map: Xi (ω) = ωi . The push-forward of P under X = (X1 , . . . , Xk ) is µ, which by
definition is compatible with the constraints.
A more specific question is whether for a particular set of constraints on the
joint law there exists a measure µ compatible with the constraints. Very often the
constraints are specified for elements of the cartesian product of finitely many
σ-algebras, like in Eq. (2.1). If (Ω1 , F1 ), . . . , (Ωn , Fn ) are measurable spaces, then
the cartesian product of F1 , . . . Fn is
F1 × · · · × Fn = {A1 × · · · × An : A1 ∈ F1 , . . . , An ∈ Fn } ⊆ 2Ω1 ×···×Ωn .
Elements of this set are known as measurable rectangles in Ω1 × · · · × Ωn .
Theorem 2.4 (Carathéodory’s extension theorem). Let (Ω1 , F1 ), . . . , (Ωn , Fn )
be measurable spaces and µ̄ : F1 × · · · × Fn → [0, 1] be a function such that
(a) µ̄(Ω1 × · · · × Ωn ) = 1; and
P∞
k=1 Ak ) =
(b) µ̄(∪∞ k=1 µ̄(Ak ) for all sequences of disjoint sets with Ak ∈
F1 × · · · × Fn .
Let Ω = Ω1 × · · · × Ωn and F = σ(F1 × · · · × Fn ). Then there exists a unique
probability measure µ on (Ω, F) such that µ agrees with µ̄ on F1 × · · · × Fn .
The theorem is applied by letting Ωk = R and Fk = B(R). Then the values of
a measure on all cartesian products uniquely determines its value everywhere.
It is not true that F1 ×F2 = σ(F1 ×F2 ). Take, for example, F1 = F2 = 2{1,2} .
Then, |F1 × F2 | = 1 + 3 × 3 = 10 (because ∅ × X = ∅), while, since
F1 × F2 includes the singletons of 2{1,2}×{1,2} , σ(F1 × F2 ) = 2{1,2}×{1,2} .
Hence, six sets are missing from F1 × F2 . For example, {(1, 1), (2, 2)} ∈
σ(F1 × F2 ) \ F1 × F2 .
than two terms in the product. The n-fold product σ-algebra of F is denoted by
F ⊗n .
X
(Ω, F) (X , G)
f
Y
(Y, H)
Figure 2.3 The factorisation problem asks whether there exists a (measurable) function
f that makes the diagram commute.
for a technical assumption on (Y, H), the following result shows that Y is a
measurable function of X if and only if Y is σ(X)/H-measurable. The technical
assumption mentioned requires (Y, H) to be a Borel space, which is true of all
probability spaces considered in this book, including (Rk , B(Rk )). We leave the
exact definition of Borel spaces to the next chapter.
Lemma 2.5 (Factorisation lemma). Assume that (Y, H) is a Borel space. Then Y
is σ(X)-measurable (σ(Y ) ⊆ σ(X)) if and only if there exists a G/H-measurable
map f : X → Y such that Y = f ◦ X.
In this sense σ(X) contains all the information that can be extracted from X
via measurable functions. This is not the same as saying that Y can be deduced
from X if and only if Y is σ(X)-measurable because the set of X → Y maps
can be much larger than the set of G/H-measurable functions. When G is coarse,
there are not many G/H-measurable functions with the extreme case occurring
when G = {X , ∅}. In cases like this, the intuition that σ(X) captures all there
is to know about X is not true anymore (Exercise 2.6). The issue is that σ(X)
does not only depend on X, but also on the σ-algebra of (X , G) and that if G is
coarse-grained, then σ(X) can also be coarse-grained and not many functions
will be σ(X)-measurable. If X is a random variable, then by definition X = R
2.2 σ-Algebras and Knowledge 27
Filtrations
In the study of bandits and other online settings, information is revealed to the
learner sequentially. Let X1 , . . . , Xn be a collection of random variables on a
common measurable space (Ω, F). We imagine a learner is sequentially observing
the values of these random variables. First X1 , then X2 and so on. The learner
needs to make a prediction, or act, based on the available observations. Say, a
prediction or an act must produce a real-valued response. Then, having observed
.
X1:t = (X1 , . . . , Xt ), the set of maps f ◦ X1:t where f : Rt → R is Borel, captures
all the possible ways the learner can respond. By Lemma 2.5, this set contains
exactly the σ(X1:t )/B(R)-measurable maps. Thus, if we need to reason about
the set of Ω → R maps available after observing X1:t , it suffices to concentrate
on the σ-algebra Ft = σ(X1:t ). Conveniently, Ft is independent of the space of
possible responses, and being a subset of F, it also hides details about the range
space of X1:t . It is easy to check that F0 ⊆ F1 ⊆ F2 ⊆ · · · ⊆ Fn ⊆ F, which
means that more and more functions are becoming Ft -measurable as t increases,
which corresponds to increasing knowledge (note that F0 = {∅, Ω}, and the set
of F0 -measurable functions is the set of constant functions on Ω).
Bringing these a little further, we will often find it useful to talk about increasing
sequences of σ-algebras without constructing them in terms of random variables
as above. Given a measurable space (Ω, F), a filtration is a sequence (Ft )nt=0 of
sub-σ-algebras of F where Ft ⊆ Ft+1 for all t < n. We also allow n = ∞, and in
this case we define
∞
!
[
F∞ = σ Ft
t=0
to be the smallest σ-algebra containing the union of all Ft . Filtrations can also
be defined in continuous time, but we have no need for that here. A sequence
of random variables (Xt )nt=1 is adapted to filtration F = (Ft )nt=0 if Xt is Ft -
measurable for each t. We also say in this case that (Xt )t is F-adapted. The
same nomenclature applies if n is infinite. Finally, (Xt )t is F-predictable if Xt
is Ft−1 -measurable for each t ∈ [n]. Intuitively we may think of an F-predictable
process X = (Xt )t as one that has the property that Xt can be known (or
‘predicted’) based on Ft−1 , while a F-adapted process is one that has the property
that Xt can be known based on Ft only. Since Ft−1 ⊆ Ft , a predictable process
2.3 Conditional Probabilities 28
is also adapted. A filtered probability space is the tuple (Ω, F, F, P), where
(Ω, F, P) is a probability space and F = (Ft )t is filtration of F.
the case quite often, explaining why this simple formula has quite a status in
probability and statistics. Exercise 2.8 asks the reader to verify this law.
2.4 Independence
How is this related to knowledge? Assuming that P (B) > 0, dividing both sides
by P (B) and using the definition of conditional probability, we get that the above
is equivalent to
P (A | B) = P (A) . (2.4)
R R R
that X1 dP and X2 dP are defined, (α1 X1 + α2 X2 )dP is defined and
satisfies
Z Z Z
(α1 X1 + α2 X2 ) dP = α1 X1 dP + α2 X2 dP . (2.5)
Ω Ω Ω
Pn
These two properties together tell us that whenever X(ω) = i=1 αi I {ω ∈ Ai }
for some n, αi ∈ R and Ai ∈ F, i = 1, . . . , n, then
Z X
XdP = αi P (Ai ) . (2.6)
Ω i
The meaning of U ≤ V for random variables U, V is that U (ω) ≤ V (ω) for all
ω ∈ Ω. The supremum on the right-hand side could be infinite, in which case we
say the integral of X is not defined. Whenever the integral of X is defined, we
say that X is integrable or, if the identity of the measure P is unclear, that X
is integrable with respect to P.
Integrals for arbitrary random variables are defined by decomposing the
random variable into positive and negative parts. Let X : Ω → R be any
measurable function. Then define X + (ω) = X(ω)I {X(ω) > 0} and X − (ω) =
−X(ω)I {X(ω) < 0} so that X(ω) = X + (ω) − X − (ω). Now X + and X − are
both non-negative random variables called the positive and negative parts of
X. Provided that both X + and X − are integrable, we define
Z Z Z
XdP = X + dP − X − dP .
Ω Ω Ω
unique measure on B(R) such that λ((a, b)) = b − a for any a ≤ b. In this scenario,
if f : R → R is a Borel-measurable function, then we can write the Lebesgue
integral of f with respect to the Lebesgue measure as
Z
f dλ .
R
There exist functions that are Riemann integrable and not Lebesgue
integrable, and also the other way around (although examples of the former
are more exotic than the latter).
In general E [XY ] =
6 E [X] E [Y ] (Exercise 2.18). Finally, an important simple
result connects expectations of non-negative random variables to their tail
probabilities.
2.6 Conditional Expectation 33
provided that either the right-hand side, or the left-hand side exist.
Example 2.9. Let (Ω, F, P) model the outcomes of an unloaded dice: Ω = [6],
F = 2Ω and P(A) = |A|/6. Define two random variables X and Y by
Y (ω) = I {ω > 3} and X(ω) = ω. Suppose we are interested in the expectation
of X given a specific value of Y . Arguing intuitively, we might notice that Y = 1
means that the unobserved X must be either 4, 5 or 6, and that each of these
outcomes is equally likely, and so the expectation of X given Y = 1 should
be (4 + 5 + 6)/3 = 5. Similarly, the expectation of X given Y = 0 should be
(1 + 2 + 3)/3 = 2. If we want a concise summary, we can just write that ‘the
expectation of X given Y ’ is 5Y + 2(1 − Y ). Notice how this is a random variable
itself.
should be equal to Y , but how to define it? The mean of a Bernoulli random
variable is equal to its bias so the definition of conditional probability shows that
for 0 ≤ p < q ≤ 1,
= E[IY −1 (A) X] .
This can be viewed as putting a set of linear constraints on E[X | Y ] with one
constraint for each measurable A ⊆ Y. By treating E[X | Y ] as an unknown
σ(Y )-measurable random variable, we can attempt to solve this linear system. As
2.6 Conditional Expectation 35
it turns out, this can always be done: the linear constraints and the measurability
restriction on E [X | Y ] completely determine E[X | Y ] except for a set of measure
zero. Notice that both conditions only depend on σ(Y ) ⊆ F. The abstract
definition of conditional expectation takes these properties as the definition and
replaces the role of Y with a sub-σ-algebra.
Definition 2.10 (Conditional expectation). Let (Ω, F, P) be a probability space
and X : Ω → R be random variable and H be a sub-σ-algebra of F. The
conditional expectation of X given H is denoted by E[X | H] and defined to be
any H-measurable random variable on Ω such that for all H ∈ H,
Z Z
E[X | H]dP = XdP . (2.9)
H H
The reader may find it odd that E[X | Y ] is a random variable on Ω rather
than the range of Y . Lemma 2.5 and the fact that E[X | σ(Y )] is σ(Y )-
measurable shows there exists a measurable function f : (R, B(R)) →
(R, B(R)) such that E[X | σ(Y )](ω) = (f ◦ Y )(ω) (see Fig. 2.4). In this sense
E[X | Y ](ω) only depends on Y (ω), and occasionally we write E[X | Y ](y).
(Ω, F)
Y E[X | Y ]
{{1, 2, 3}, {4, 5, 6}, ∅, Ω}. Denote this set-system by H for brevity. The condition
that E[X | H] is H-measurable can only be satisfied if E[X | H](ω) is constant on
{1, 2, 3} and {4, 5, 6}. Then (2.9) immediately implies that
(
2, if ω ∈ {1, 2, 3} ;
E [X | H] (ω) =
5, if ω ∈ {4, 5, 6} .
While the definition of conditional expectations given above is non-constructive
and E[X | H] is uniquely defined only up to events of P-measure zero, none of
this should be of a significant concern. First, we will rarely need closed-form
expressions for conditional expectations, but we rather need how they relate to
other expectations, conditional or not. This is also the reason why it should not
be concerning that they are only determined up to zero probability events: usually,
conditional expectations appear in other expectations or in statements that are
concerned with how probable some event is, making the difference between the
different ‘versions’ of conditional expectations disappear.
We close the section by summarising some additional important properties of
conditional expectations. These follow from the definition directly, and the reader
is invited to prove them in Exercise 2.20.
Theorem 2.12. Let (Ω, F, P) be a probability space, G, G1 , G2 ⊂ F be sub-σ-
algebras of F and X, Y integrable random variables on (Ω, F, P). The following
hold true:
The above list of abstract properties will be used over and over again. We
encourage the reader to study the list carefully and convince yourself that
all items are intuitive. Playing around with discrete random variables can
be invaluable for this. Eventually it will all become second nature.
2.7 Notes
mostly focus on finite time results or take expectations before taking limits
when dealing with asymptotics).
9 You might be surprised that we have not mentioned densities. For most of
us, our first exposure to probability on continuous spaces was by studying the
normal distribution and its density
1
p(x) = √ exp(−x2 /2) , (2.10)
2π
which can be integrated over intervals to obtain the probability that a Gaussian
random variable will take a value in that interval. The reader should notice
that p : R → R is Borel measurable and that the Gaussian measure associated
with this density is P on (R, B(R)) defined by
Z
P(A) = p dλ .
A
Here the integral is with respect to the Lebesgue measure λ on (R, B(R)). The
notion of a density can be generalised beyond this simple setup. Let P and Q
be measures (not necessarily probability measures) on arbitrary measurable
space (Ω, F). The Radon–Nikodym derivative of P with respect to Q is
an F-measurable random variable dQ dP
: Ω → [0, ∞) such that
Z
dP
P (A) = dQ for all A ∈ F . (2.11)
A dQ
R R
We can also write this in the form IA dP = IA dQ dP
dQ, A ∈ F, from which we
R R dP
may realise that for any X P -integrable random variable, XdP = X dQ dQ
must also hold. This is often called the change-of-measure formula. Another
word for the Radon–Nikodym derivative dQ dP
is the density of P with respect to
Q. It is not hard to find examples where the density does not exist. We say that
P is absolutely continuous with respect to Q if Q(A) = 0 =⇒ P (A) = 0
for all A ∈ F. When dQ dP
exists, it follows immediately that P is absolutely
continuous with respect to Q by Eq. (2.11). Except for some pathological cases,
it turns out that this is both necessary and sufficient for the existence of dP/dQ.
The measure Q is σ-finite if there exists a countable covering {Ai } of Ω with
F-measurable sets such that Q(Ai ) < ∞ for each i.
Theorem 2.13. Let P, Q be measures on a common measurable space (Ω, F)
and assume that Q is σ-finite. Then the density of P with respect to Q, dQ dP
,
exists if and only if P is absolutely continuous with respect to Q. Furthermore,
dQ is uniquely defined up to a Q-null set so that for any f1 , f2 satisfying (2.11),
dP
Much of this chapter draws inspiration from David Pollard’s A user’s guide to
measure theoretic probability [Pollard, 2002]. We like this book because the author
takes a rigourous approach, but still explains the ‘why’ and ‘how’ with great
care. The book gets quite advanced quite fast, concentrating on the big picture
rather than getting lost in the details. Other useful references include the book by
Billingsley [2008], which has many good exercises and is quite comprehensive in
terms of its coverage of the ‘basics’. These books are both quite detailed. For an
outstanding shorter introduction to measure-theoretic probability, see the book
by Williams [1991], which has an enthusiastic style and a pleasant bias towards
martingales. We also like the book by Kallenberg [2002], which is recommended
for the mathematically inclined readers who already have a good understanding of
the basics. The author has put a major effort into organising the material so that
redundancy is minimised and generality is maximised. This reorganisation resulted
2.9 Exercises 42
in quite a few original proofs, and the book is comprehensive. The factorisation
lemma (Lemma 2.5) is stated in the book by Kallenberg [2002] (Lemma 1.13
there). Kallenberg calls this lemma the ‘functional representation’ lemma and
attributes it to Joseph Doob. Theorem 2.4 is a corollary of Carathéodory’s
extension theorem, which says that probability measures defined on semi-rings of
sets have a unique extension to the generated σ-algebra. The remaining results can
be found in either of the three books mentioned above. Theorem 2.14 appears as
theorem 8.9.3 in the two-volume book by Bogachev [2007]. Finally, for something
older and less technical, we recommend the philosophical essays on probability
by Pierre Laplace, which was recently reprinted [Laplace, 2012].
2.9 Exercises
2.5 Let G ⊆ 2Ω be a non-empty collection of sets and define σ(G) as the smallest
σ-algebra that contains G. By ‘smallest’ we mean that F ∈ 2Ω is smaller than
F 0 ∈ 2Ω if F ⊂ F 0 .
(a) Show that σ(G) exists and contains exactly those sets A that are in every
σ-algebra that contains G.
(b) Suppose (Ω0 , F) is a measurable space and X : Ω0 → Ω be F/G-measurable.
Show that X is also F/σ(G)-measurable. (We often use this result to simplify
the job of checking whether a random variable satisfies some measurability
property).
(c) Prove that if A ∈ F where F is a σ-algebra, then I {A} is F-measurable.
2.9 Consider the standard probability space (Ω, F, P) generated by two standard,
unbiased, six-sided dice that are thrown independently of each other. Thus,
Ω = {1, . . . , 6}2 , F = 2Ω and P(A) = |A|/62 for any A ∈ F so that Xi (ω) = ωi
represents the outcome of throwing dice i ∈ {1, 2}.
(a) Show that the events ‘X1 < 2’ and ‘X2 is even’ are independent of each
other.
(b) More generally, show that for any two events, A ∈ σ(X1 ) and B ∈ σ(X2 ),
are independent of each other.
(a) Let (Ω, F, P) be a probability space. Show that ∅ and Ω (which are events)
are independent of any other event. What is the intuitive meaning of this?
(b) Continuing the previous part, show that any event A ∈ F with P (A) ∈ {0, 1}
is independent of any other event.
(c) What can we conclude about an event A ∈ F that is independent of its
complement, Ac = Ω \ A? Does your conclusion make intuitive sense?
(d) What can we conclude about an event A ∈ F that is independent of itself?
Does your conclusion make intuitive sense?
(e) Consider the probability space generated by two independent flips of unbiased
coins with the smallest possible σ-algebra. Enumerate all pairs of events
A, B such that A and B are independent of each other.
(f) Consider the probability space generated by the independent rolls of two
unbiased three-sided dice. Call the possible outcomes of the individual dice
rolls 1, 2 and 3. Let Xi be the random variable that corresponds to the
outcome of the ith dice roll (i ∈ {1, 2}). Show that the events {X1 ≤ 2} and
{X1 = X2 } are independent of each other.
(g) The probability space of the previous example is an example when the
probability measure is uniform on a finite outcome space (which happens to
have a product structure). Now consider any n-element, finite outcome space
with the uniform measure. Show that A and B are independent of each other
if and only if the cardinalities |A|, |B|, |A ∩ B| satisfy n|A ∩ B| = |A| · |B|.
(h) Continuing with the previous problem, show that if n is prime, then no
non-trivial events are independent (an event A is trivial if P (A) ∈ {0, 1}).
(i) Construct an example showing that pairwise independence does not imply
mutual independence.
2.9 Exercises 44
(a) Let X be a constant random element (that is, X(ω) = x for any ω ∈ Ω over
the outcome space over which X is defined). Show that X is independent of
any other random variable.
(b) Show that the above continues to hold if X is almost surely constant (that
is, P (X = x) = 1 for an appropriate value x).
(c) Show that two events are independent if and only if their indicator random
variables are independent (that is, A, B are independent if and only if
X(ω) = I {ω ∈ A} and Y (ω) = I {ω ∈ B} are independent of each other).
(d) Generalise the result of the previous item to pairwise and mutual
independence for collections of events and their indicator random variables.
2.12 Our goal in this exercise is to show that X is integrable if and only if |X|
is integrable. This is broken down into multiple steps. The first issue is to deal
with the measurability of |X|. While a direct calculation can also show this, it
may be worthwhile to follow a more general path:
Hint For (b) recall Exercise 2.1. For (c) examine the relationship between
|X| and (X)+ and (X)− .
2.13 (Infinite-valued integrals) Can we consistently extend the definition of
integrals so that for non-negative random variables, the integral is always defined
(it may be infinite)? Defend your view by either constructing an example (if you
are arguing against) or by proving that your definition is consistent with the
requirements we have for integrals.
The measure-theoretic probability in the previous chapter covers almost all the
definitions required. Occasionally, however, infinite sequences of random variables
arise, and for these a little more machinery is needed. We expect most readers
will skip this chapter on the first reading, perhaps referring to it when necessary.
Before one can argue about the properties of infinite sequences of random
variables, it must be demonstrated that such sequences exist under certain
constraints on their joint distributions. For example, does there exist an infinite
sequence of random variables such that any finite subset of the random variables
are independent and distributed like a standard Gaussian? The first theorem
provides conditions under which questions like this can be answered positively.
This allows us to write, for example, ‘let (Xn )∞ n=1 be an infinite sequence of
independent standard Gaussian random variables’ and be comfortable knowing
there exists a probability space on which these random variables can be defined.
To state the theorem, we need the concept of Borel spaces.
Two measurable spaces (X , F) and (Y, G) are said to be isomorphic if there
exists a bijective function f : X → Y such that f is F/G-measurable and f −1 is
G/F-measurable. A Borel space is a measurable space (X , F) that is isomorphic
to (A, B(A)) with A ∈ B(R) a Borel measurable subset of the reals. This is not
a very strong assumption. For example, (Rn , B(Rn )) is a Borel space, along with
all of its measurable subsets.
We give a sketch of the proof because, although it is not really relevant for
the material in this book, it illustrates the general picture and dispels some of
the mystic about what is really going on. Exercise 3.1 asks you to provide the
missing steps from the proof.
Proof sketch of Theorem 3.1 For simplicity we consider only the case that
S = ([0, 1], B([0, 1])) and µ is the Lebesgue measure. For any x ∈ [0, 1], let
F1 (x), F2 (x), . . . be the binary expansion of x, which is the unique binary-valued
3.1 Stochastic Processes 47
F1 , F2 , F4 , F7 , · · ·
F3 , F5 , F8 , · · ·
F6 , F9 , · · ·
F10 , · · ·
..
.
Letting Xm,t be the tth entry in the mth row of this grid, we define Xm =
P∞ −t
t=1 2 Xm,t , and again one can easily check that with this choice the sequence
X1 , X2 , . . . is independent and λXt = µ is uniform for each t.
A Markov chain is an infinite sequence of random elements (Xt )∞ t=1 where the
conditional distribution of Xt+1 given X1 , . . . , Xt is the same as the conditional
distribution of Xt+1 given Xt . The sequence has the property that given the last
element, the history is irrelevant to ‘predict’ the future. Such random sequences
appear throughout probability theory and have many applications besides. The
theory is too rich to explain in detail, so we give the basics and point towards the
literature for more details at the end. The focus here is mostly on the definition
and existence of Markov chains.
Let (X , F) and (Y, G) be measurable spaces. A probability kernel or Markov
kernel from (X , F) to (Y, G) is a function K : X × G → [0, 1] such that
(a) K(x, ·) is a measure for all x ∈ X ; and
(b) K(·, A) is F-measurable for all A ∈ G.
The idea here is that K describes a stochastic transition. Having arrived at x, a
process’s next state is sampled Y ∼ K(x, ·). Occasionally, we will use the notation
Kx (A) or K(A | x) rather than K(x, A).
If K1 is a (X , F) → (Y, G) probability kernel and K2 is a (Y, G) → (Z, H)
probability kernel, then the product kernel K1 ⊗ K2 is the probability kernel
from (X , F) → (Y × Z, G ⊗ H) defined by
Z Z
(K1 ⊗ K2 )(x, A) = IA ((y, z))K2 (y, dz)K1 (x, dy) .
Y Z
The word ‘homogeneous’ refers to the fact that the probability kernel does
not change with time. Accordingly, sometimes one writes ‘time homogeneous’
instead of homogeneous. The reader can no doubt see how to define a Markov
chain where µ depends on t, though doing so is purely cosmetic since the
state space can always be augmented to include a time component.
Note that if µ(x | ·) = µ0 (·) for all x ∈ X , then Theorem 3.3 is yet another
way to prove the existence of an infinite sequence of independent and identically
distributed random variables. The basic questions in Markov chains resolve around
understanding the evolution of Xt in terms of the probability kernel. For example,
assuming that Ωt = Ω1 for all t ∈ N+ , does the law of Xt converge to some fixed
distribution as t → ∞, and if so, how fast is this convergence? For now we make
do with the definitions, but in the special case that X is finite, we will discuss
some of these topics much later in Chapters 37 and 38.
(a) E[Xt | Ft−1 ] = Xt−1 almost surely for all t ∈ {2, 3, . . .}; and
(b) Xt is integrable.
The time index t need not run over N+ . Very often t starts at zero instead.
Example 3.5. A gambler repeatedly throws a coin, winning a dollar for each
heads and losing a dollar for each tails. Their total winnings over time is a
martingale. To model this situation, let Y1 , Y2 , . . . be a sequence of independent
Rademacher distributions, which means that P (Yt = 1) = P (Yt = −1) = 1/2.
Pt
The winnings after t rounds is St = s=1 Ys , which is a martingale adapted to
the filtration (Ft )∞
t=1 given by Ft = σ(Y1 , . . . , Yt ). The definition of super/sub-
martingales (the direction of inequality) can be remembered by remembering
that the definition favors the casino, not the gambler.
the gambler at the end of round t can decide to stop (δt = 1) or continue (δt = 0)
based on the information available to them. Denoting by τ = min{t : δt = 1}
the time when the gambler stops, the question is whether by a clever choice of
(δt )t∈N , E [Sτ ] can be made positive. Here, (δt )t∈N , a sequence of binary, F-adapted
random variables, is called a stopping rule, while τ is a stopping time with
respect F.
Note that the stopping rule is not allowed to inject additional randomness
beyond what is already there in F.
Definition 3.6. Let F = (Ft )t∈N be a filtration. A random variable τ with values
in N ∪ {∞} is a stopping time with respect to F if I {τ ≤ t} is Ft -measurable
for all t ∈ N. The σ-algebra at stopping time τ is
Fτ = {A ∈ F∞ : A ∩ {τ ≤ t} ∈ Ft for all t} .
where the second inequality uses the definition of the stopping time and the non-
negativity of the supermartingale. Rearranging shows that P (An ) ≤ E[X0 ]/ε for
all n ∈ N. Since A1 ⊆ A2 ⊆ . . ., it follows that P (supt∈N Xt ≥ ε) = P (∪n∈N An ) ≤
E[X0 ]/ε.
Markov’s inequality (which we will cover in the next chapter) combined with
the definition of a supermartingale shows that
E[X0 ]
P (Xn ≥ ε) ≤ . (3.2)
ε
In fact, in the above we have effectively applied Markov’s inequality to the
random variable Xτ (the need for the proof arises when the conditions of
Doob’s optional sampling theorem are not met). The maximal inequality is
a strict improvement over Eq. (3.2) by replacing Xn with supt∈N Xt at no
cost whatsoever.
Theorem 3.10. Let (Xt )nt=0 be a submartingale with Xt ≥ 0 almost surely for
all t. Then for any ε > 0,
E[Xn ]
P max Xt ≥ ε ≤ .
t∈{0,1,...,n} ε
3.4 Notes 52
3.4 Notes
The theorem implies the useful relation that PX,Y = PY ⊗K (cf. Exercise 3.9)
where recall that for a random variable Z, PZ denotes its pushforward under
P. To make the origin K clear, we often write PX|Y instead of K. With this,
the above equality becomes PX,Y = PY ⊗ PX|Y , which can be viewed as the
converse of the Ionescu–Tulcea theorem (Theorem 3.3). Sometimes this is called
the chain rule of probabilities measures.
You can also condition on a σ-algebra G ⊂ F, in which case K is a probability
kernel from (Ω, G) to X . The condition that X be Borel is sufficient, but not
necessary. Some conditions are required, however. An example where no regular
version exists can be found in [Halmos, 1976, p210]. Regular versions play
a role in the following useful theorem for decomposing random variables on
product spaces.
In many applications
R G = σ(Y ), in which case the theorem says that
E[f (X, Y ) | Y ] = X f (x, Y )K(dx | Y ) almost surely. Proofs of both theorems
appear in chapter 6 of Kallenberg [2002].
There are many places to find the construction of a stochastic process. Like before,
we recommend Kallenberg [2002] for readers who want to refresh their memory
and Billingsley [2008] for a more detailed account. For Markov chains the recent
book by Levin and Peres [2017] provides a wonderful introduction. After reading
that, you might like the tome by Meyn and Tweedie [2012]. Theorem 3.1 can be
found as theorem 3.19 in the book by Kallenberg [2002], where the reader can also
find its proof. Theorem 3.2 is credited to Percy John Daniell by Kallenberg [2002]
(see Aldrich 2007). More general versions of this theorem exist. Readers looking
for these should look up Kolmogorov’s extension theorem [Kallenberg, 2002,
theorem 6.16]. The theorem of Ionescu–Tulcea (Theorem 3.3) is attributed to him
3.6 Exercises 54
[Ionescu Tulcea, 1949–50] with a modern proof in the book by Kallenberg [2002,
theorem 6.17]. There are lots of minor variants of the optional stopping theorem,
most of which can be found in any probability book featuring martingales. The
most historically notable source is by the man himself [Doob, 1953]. A more
modern book that also gives the maximal inequalities is the book on optimal
stopping by Peskir and Shiryaev [2006].
3.6 Exercises
3.6 (Limits of increasing stopping times are stopping times) Let (τn )∞ n=1
be an almost surely increasing sequence of F-stopping times on probability space
(Ω, F, P) with filtration F = (Fn )∞
n=1 , which means that τn (ω) ≤ τn+1 (ω) for all
n ≥ 1 almost surely. Prove that τ (ω) = limn→∞ τn (ω) is a F-stopping time.
3.6 Exercises 55
The goal of this chapter is to formally introduce stochastic bandits. The model
introduced here provides the foundation for the remaining chapters that treat
stochastic bandits. While the topic seems a bit mundane, it is important to be
clear about the assumptions and definitions. The chapter also introduces and
motivates the learning objectives, and especially the regret. Besides the definitions,
the main result in this chapter is the regret decomposition, which is presented in
Section 4.5.
A mathematician might ask whether there even exists a probability space carrying
these random elements such that (a) and (b) hold. Specific constructions showing
this in the affirmative are given in Section 4.6. These constructions are also
valuable because they teach us important lessons about equivalent models. For
now, however, we move on.
4.2 The Learning Objective 57
Even if the horizon is known in advance and we commit to maximising the expected
value of Sn , there is still the problem that the bandit instance ν = (Pa : a ∈ A) is
4.3 Knowledge and Environment Classes 58
unknown. A policy that maximises the expectation of Sn for one bandit instance
may behave quite badly on another. The learner usually has partial information
about ν, which we represent by defining a set of bandits E for which ν ∈ E is
guaranteed. The set E is called the environment class. We distinguish between
structured and unstructured bandits.
Unstructured Bandits
An environment class E is unstructured if A is finite and there exist sets of
distributions Ma for each a ∈ A such that
or, in short, E = ×a∈A Ma . The product structure means that by playing action
a the learner cannot deduce anything about the distributions of actions b 6= a.
Some typical choices of unstructured bandits are listed in Table 4.1. Of course,
these are not the only choices, and the reader can no doubt find ways to construct
more, e.g. by allowing some arms to be Bernoulli and some Gaussian, or have
rewards being exponentially distributed, or Gumbel distributed, or belonging to
your favourite (non-)parametric family.
The Bernoulli, Gaussian and uniform distributions are often used as examples
for illustrating some specific property of learning in stochastic bandit problems.
The Bernoulli distribution is actually a natural choice. Think of applications like
maximising click-through rates in a web-based environment. A bandit problem
is often called a ‘distribution bandit’, where ‘distribution’ is replaced by the
underlying distribution from which the pay-offs are sampled. Some examples
are: Gaussian bandit, Bernoulli bandit or subgaussian bandit. Similarly we say
‘bandits with X’, where ‘X’ is a property of the underlying distribution from
which the pay-offs are sampled. For example, we can talk about bandits with
finite variance, meaning the bandit environment where the a priori knowledge of
the learner is that all pay-off distributions are such that their underlying variance
is finite.
Some environment classes, like Bernoulli bandits, are parametric, while others,
like subgaussian bandits, are non-parametric. The distinction is the number of
degrees of freedom needed to describe an element of the environment class. When
the number of degrees of freedom is finite, it is parametric, and otherwise it is
non-parametric. Of course, if a learner is designed for a specific environment class
E, then we might expect that it has good performance on all bandits ν ∈ E. Some
environment classes are subsets of other classes. For example, Bernoulli bandits
are a special case of bandits with a finite variance, or bandits with bounded
support. Something to keep in mind is that we expect that it will be harder to
achieve a good performance in a larger class. In a way, the theory of finite-armed
stochastic bandits tries to quantify this expectation in a rigourous fashion.
4.3 Knowledge and Environment Classes 59
Bounded support k
E[a,b] {(Pi )i : Supp(Pi ) ⊆ [a, b]}
Subgaussian k
ESG (σ 2 ) {(Pi )i : Pi is σ-subgaussian for all i}
Table 4.1 Typical environment classes for stochastic bandits. Supp(P ) is the (topological)
support of distribution P . The kurtosis of a random variable X is a measure of its tail
behaviour and is defined by E[(X − E[X])4 ]/V[X]2 . Subgaussian distributions have similar
properties to the Gaussian and will be defined in Chapter 5.
Structured Bandits
Environment classes that are not unstructured are called structured. Relaxing the
requirement that the environment class is a product set makes structured bandit
problems much richer than the unstructured set-up. The following examples
illustrate the flexibility.
Example 4.1. Let A = {1, 2} and E = {(B(θ), B(1 − θ)) : θ ∈ [0, 1]}. In this
environment class, the learner does not know the mean of either arm, but can
learn the mean of both arms by playing just one. The knowledge of this structure
dramatically changes the difficulty of learning in this problem.
In this environment class, the reward of an action is Gaussian, and its mean is given
by the inner product between the action and some unknown parameter. Notice
that even if A is extremely large, the learner can deduce the true environment
by playing just d actions that span Rd .
In Chapter 1 we informally defined the regret as being the deficit suffered by the
learner relative to the optimal policy. Let ν = (Pa : a ∈ A) be a stochastic bandit
and define
Z ∞
µa (ν) = x dPa (x) .
−∞
Then let µ (ν) = maxa∈A µa (ν) be the largest mean of all the arms.
∗
We assume throughout that µa (ν) exists and is finite for all actions and
that argmaxa∈A µa (ν) is non-empty. The latter assumption could be relaxed
by carefully adapting all arguments using nearly optimal actions, but in
practice this is never required.
If the context is clear, we will often drop the dependence on ν and π in various
Pn
quantities. For example, by writing Rn = nµ∗ − E[ t=1 Xt ]. Similarly, the
limits in sums and maxima are abbreviated when we think you can work
out ranges of symbols in a unique way, e.g. µ∗ = maxi µi .
The regret is always non-negative, and for every bandit ν, there exists a policy
π for which the regret vanishes.
4.4 The Regret 61
which is only defined by assuming (or proving) that the regret is a measurable
function with respect to F. An advantage of the Bayesian approach is that
having settled on a prior and horizon, the problem of finding a policy that
minimises the Bayesian regret is just an optimisation problem. Most of this
book is devoted to analyzing the ‘frequentist’ regret in Eq. (4.1), which does not
integrate over all environments as Eq. (4.4) does. Bayesian methods are covered
in Chapters 34 to 36, where we also discuss the strengths and weaknesses of the
Bayesian approach.
We now present a lemma that forms the basis of almost every proof for
stochastic bandits. Let ν = (Pa : a ∈ A) be a stochastic bandit and define
∆a (ν) = µ∗ (ν) − µa (ν), which is called the suboptimality gap or action gap
or immediate regret of action a. Further, let
t
X
Ta (t) = I {As = a}
s=1
be the number of times action a was chosen by the learner after the end of round
t. In general, Ta (n) is random, which may seem surprising if we think about a
deterministic policy that chooses the same action for any fixed history. So why
is Ta (n) random in this case? The reason is because for all rounds t except for
the first, the action At depends on the rewards observed in rounds 1, 2, . . . , t − 1,
which are random, hence At will also inherit their randomness. We are now ready
to state the second and last lemmas of the chapter. In the statement of the lemma,
we use our convention that the dependence of the various quantities involved on
the policy π and the environment ν is suppressed.
Lemma 4.5 (Regret decomposition lemma). For any policy π and stochastic
bandit environment ν with A finite or countable and horizon n ∈ N, the regret
Rn of policy π in ν satisfies
X
Rn = ∆a E [Ta (n)] . (4.5)
a∈A
The lemma decomposes the regret in terms of the loss due to using each of the
arms. It is useful because it tells us that to keep the regret small, the learner
should try to minimise the weighted sum of expected action counts, where the
weights are the respective suboptimality gaps, (∆a )a∈A .
Lemma 4.5 tells us that a learner should aim to use an arm with a larger
suboptimality gap proportionally fewer times.
Proof of Lemma 4.5 Since Rn is based on summing over rounds, and the right-
hand side of the lemma statement is based on summing over actions, to convert one
sum into the other one, we introduce indicators. In particular, note that for any
P P P P
fixed t we have a∈A I {At = a} = 1. Hence Sn = t Xt = t a Xt I {At = a},
and thus
n
XX
Rn = nµ∗ − E [Sn ] = E [(µ∗ − Xt )I {At = a}] . (4.6)
a∈A t=1
The result is completed by plugging this into Eq. (4.6) and using the definition
of Ta (n).
The argument fails when A is uncountable because you cannot introduce the
sum over actions. Of course the solution is to use an integral, but for this we need
to assume (A, G) is a measurable space. Given a bandit ν and policy π define
measure G on (A, G) by
" n #
X
G(U ) = E I {At ∈ U } ,
t=1
where the expectation is taken with respect to the measure on outcomes induced
by the interaction of π and ν.
Lemma 4.6. Provided that everything is well defined and appropriately measurable,
" n # Z
X
Rn = E ∆At = ∆a dG(a) .
t=1 A
For those worried about how to ensure everything is well defined, see Section 4.7.
In most cases the underlying probability space that supports the random rewards
and actions is never mentioned. Occasionally, however, it becomes convenient to
choose a specific probability space, which we call the canonical bandit model.
4.6 The Canonical Bandit Model ( ) 64
Finite Horizon
Let n ∈ N be the horizon. A policy and bandit interact to produce the outcome,
which is the tuple of random variables Hn = (A1 , X1 , . . . , An , Xn ). The first step
towards constructing a probability space that carries these random variables
is to choose the measurable space. For each t ∈ [n], let Ωt = ([k] × R)t ⊂ R2t
and Ft = B(Ωt ). The random variables A1 , X1 , . . . , An , Xn that make up the
outcome are defined by their coordinate projections:
At (a1 , x1 , . . . , an , xn ) = at and Xt (a1 , x1 , . . . , an , xn ) = xt .
The probability measure on (Ωn , Fn ) depends on both the environment and the
policy. Our informal definition of a policy is not quite sufficient now.
Definition 4.7. A policy π is a sequence (πt )nt=1 , where πt is a probability
kernel from (Ωt−1 , Ft−1 ) to ([k], 2[k] ). Since [k] is discrete, we adopt the notational
convention that for i ∈ [k],
πt (i | a1 , x1 , . . . , at−1 , xt−1 ) = πt ({i} | a1 , x1 , . . . , at−1 , xt−1 ) .
Let ν = (Pi )ki=1 be a stochastic bandit where each Pi is a probability measure
on (R, B(R)). We want to define a probability measure on (Ωn , Fn ) that respects
our understanding of the sequential nature of the interaction between the learner
and a stationary stochastic bandit. Since we only care about the law of the
random variables (Xt ) and (At ), the easiest way to enforce this is to directly list
our expectations, which are
(a) the conditional distribution of action At given A1 , X1 , . . . , At−1 , Xt−1 is
πt ( · | A1 , X1 , . . . , At−1 , Xt−1 ) almost surely.
(b) the conditional distribution of reward Xt given A1 , X1 , . . . , At is PAt almost
surely.
The sufficiency of these assumptions is asserted by the following proposition,
which we ask you to prove in Exercise 4.2.
Proposition 4.8. Suppose that P and Q are probability measures on an arbitrary
measurable space (Ω, F) and A1 , X1 , . . . , An , Xn are random variables on Ω, where
At ∈ [k] and Xt ∈ R. If both P and Q satisfy (a) and (b), then the law of the
outcome under P is the same as under Q:
PA1 ,X1 ,...,An ,Xn = QA1 ,X1 ,...,An ,Xn .
Next we construct a probability measure on (Ωn , Fn ) that satisfies (a) and
(b). To emphasise that what follows is intuitively not complicated, imagine that
Xt ∈ {0, 1} is Bernoulli, which means the set of possible outcomes is finite and
we can define the measure in terms of a distribution. Let pi (0) = Pi ({0}) and
pi (1) = 1 − pi (0) and define
n
Y
pνπ (a1 , x1 , . . . , an , xn ) = π(at | a1 , x1 , . . . , at−1 , xt−1 )pat (xt ) .
t=1
4.6 The Canonical Bandit Model ( ) 65
The reader can check that pνπ is a distribution on ([k] × {0, 1})n and that the
associated measure satisfies (a) and (b) above. Making this argument rigourous
when (Pi ) are not discrete requires the use of Radon–Nikodym derivatives. Let λ
be a σ-finite measure on (R, B(R)) for which Pi is absolutely continuous with
respect to λ for all i. Next, let pi = dPi /dλ be the Radon–Nikodym
R derivative of
Pi with respect to λ, which is a function pi : R → R such that B pi dλ = Pi (B)
for all B ∈ B(R). Letting ρ be the counting measure with ρ(B) = |B|, the density
pνπ : Ω → R can now be defined with respect to the product measure (ρ × λ)n
by
n
Y
pνπ (a1 , x1 , . . . , an , xn ) = π(at | a1 , x1 , . . . , at−1 , xt−1 )pat (xt ) . (4.7)
t=1
The reader can again check (more abstractly) that (a) and (b) are satisfied by
the probability measure Pνπ defined by
Z
Pνπ (B) = pνπ (ω)(ρ × λ)n (dω) for all B ∈ Fn .
B
It is important to emphasise that this choice of (Ωn , Fn , Pνπ ) is not unique. Instead,
all that this shows is that a suitable probability space does exist. Furthermore, if
some quantity of interest depends on the law of Hn , by Proposition 4.8, there is
no loss in generality in choosing (Ωn , Fn , Pνπ ) as the probability space.
Pk
A choice of λ such that Pi λ for all i always exists since λ = i=1 Pi
satisfies this condition. For direct calculations, another choice is usually
more convenient, e.g. the counting measure when (Pi ) are discrete and the
Lebesgue measure for continuous (Pi ).
There is another way to define the probability space, which can be useful.
Define a collection of independent random variables (Xsi )s∈[n],i∈[k] such that the
law of Xti is Pi . By Theorem 2.4 these random variables may be defined on
(Ω, F), where Ω = Rnk and F = B(Rnk ). Then let Xt = XtAt , where the actions
At are Ft−1 -measurable with Ft−1 = σ(A1 , X1 , . . . , At−1 , Xt−1 ). We call this the
random table model. Yet another way is to define (Xsi )s,i as above but let
Xt = XTAt (t),At . This corresponds to sampling a stack of rewards for each arm
at the beginning of the game, giving rise to the reward-stack model. Each time
the learner chooses an action, they receive the reward on top of the stack. All of
these models are convenient from time to time. The important thing is that it
does not matter which model we choose because the quantity of ultimate interest
(usually the regret) only depends on the law of A1 , X1 , . . . , An , Xn , and this is
the same for all choices.
4.7 The Canonical Bandit Model for Uncountable Action Sets ( ) 66
Infinite Horizon
We never need the canonical bandit model for the case that n = ∞. It is comforting
to know, however, that there does exist a probability space (Ω, F, Pνπ ) and infinite
sequences of random variables X1 , X2 , . . . and A1 , A2 , . . . satisfying (a) and (b).
The result follows directly from the theorem of Ionescu–Tulcea (Theorem 3.3).
For uncountable action sets, a little more machinery is necessary to make things
rigourous. The first requirement is that the action set must be a measurable
space (A, G) and the collection of distribution ν = (Pa : a ∈ A) that defines a
bandit environment must be a probability kernel from (A, G) to (R, B(R)). A
policy is a sequence (πt )nt=1 , where πt is a probability kernel from (Ωt−1 , Ft−1 )
to (A, G) with
t
Y t
O
Ωt = (A × R) and Ft = (G ⊗ B(R)) .
s=1 s=1
We did not define Pνπ in terms of a density because there may not exist a
common dominating measure for either (Pa : a ∈ A) or the policy. When
such measures exist, as they usually do, then Pνπ may be defined in terms
of a density in the same manner as the previous section.
You will check in Exercise 4.5 that the assumptions on ν and π in this section
are sufficient to ensure the quantities in Lemma 4.6 are well defined and that
Proposition 4.8 continues to hold in this setting without modification. Finally, in
none of the definitions above do we require that n be finite.
4.8 Notes
1 It is not obvious why the expected value is a good summary of the reward
distribution. Decision makers who base their decisions on expected values are
called risk-neutral. In the example shown on the figure above, a risk-averse
decision maker may actually prefer the distribution labelled as A because
occasionally distribution B may incur a very small (even negative) reward.
Risk-seeking decision makers, if they exist at all, would prefer distributions
with occasional large rewards to distributions that give mediocre rewards only.
4.8 Notes 67
While R̂n is influenced by the noise Xt − µAt in the rewards, the pseudo-regret
filters this out, which arguably makes it a better basis for measuring the ‘skill’
of a bandit policy. As these random regret measures tend to be highly skewed,
using variance to assess risk suffers not only from the problem of penalising
upside risk, but also from failing to capture the skew of the distribution.
5 What happens if the distributions of the arms are changing with time?
Such bandits are unimaginatively called non-stationary bandits. With no
assumptions, there is not much to be done. Because of this, it is usual to
assume the distributions change infrequently or drift slowly. We’ll eventually
see that techniques for stationary bandits can be adapted to this set-up (see
Chapter 31).
6 The rigourous models introduced in Sections 4.6 and 4.7 are easily extended to
more sophisticated settings. For example, the environment sometimes produces
side information as well as rewards or the set of available actions may change
with time. You are asked to formalise an example in Exercise 4.6.
and with an approach where the regret itself is redefined. VaR is considered in
the context of a specific bandit policy family by Audibert et al. [2007, 2009].
4.10 Exercises
By the definition of the canonical probability space and the product of probability
kernels,
Xk Z Xk Z
Pνπ◦ (B) = ··· IB (hn )νan (dxn )πn◦ (an | hn−1 ) · · · νa1 (dx1 )π1◦ (a1 )
a1 =1 R an =1 R
X k Z
X k Z
X
= p(π) ··· IB (hn )νan (dxn )πn (an | hn−1 ) · · · νa1 (dx1 )π1 (a1 )
π∈Π a1 =1 R an =1 R
X
= p(π)Pνπ (B) ,
π∈Π
4.6 (Canonical model for contextual bandit) Let A and C be finite sets.
A stochastic contextual bandit is like a normal stochastic bandit, but in each
round the learner first observes a context Ct ∈ C. They then choose an action
At ∈ A and receive a reward Xt ∼ PAt ,Ct .
(a) Let n be fixed and π = (πt )nt=1 be any policy. Prove there exists a retirement
policy π 0 = (πt0 )nt=1 such that for all ν ∈ E.
Rn (π 0 , ν) ≤ Rn (π, ν) .
(b) Let M1 = {B(µ1 ) : µ1 ∈ [0, 1]} and suppose that π = (πt )∞
t=1 is a retirement
policy. Prove there exists a bandit ν ∈ E such that
Rn (π, ν)
lim sup > 0.
n→∞ n
4.11 (Failure of follow-the-leader (i)) Consider a Bernoulli bandit with
two arms and means µ1 = 0.5 and µ2 = 0.6.
(a) Using a horizon of n = 100, run 1000 simulations of your implementation
of follow-the-leader on the Bernoulli bandit above and record the (random)
Pn
pseudo regret, nµ∗ − t=1 µAt , in each simulation.
(b) Plot the results using a histogram. Your figure should resemble Fig. 4.2.
(c) Explain the results in the figure.
Follow-the-leader
400
Frequency
200
0 2 4 6 8 10
Regret
Figure 4.2 Histogram of regret for follow-the-leader over 1000 trials on a Bernoulli bandit
with means µ1 = 0.5, µ2 = 0.6
Follow-the-leader
50
40
Expected Regret
30
20
10
Figure 4.3 The regret for Follow-the-leader over 1000 trials on Bernoulli bandit with
means µ1 = 0.5, µ2 = 0.6 and horizons ranging from n = 100 to n = 1000.
5 Concentration of Measure
Before we can start designing and analysing algorithms, we need one more tool
from probability theory, called concentration of measure. Recall that the
optimal action is the one with the largest mean. Since the mean pay-offs are
initially unknown, they must be learned from data. How long does it take to
learn about the mean reward of an action? In this section, after introducing
the notion of tail probabilities, we look at ways of obtaining upper bounds on
them. The main point is to introduce subgaussian random variables and the
Cramér–Chernoff exponential tail inequalities, which will play a central role in
the design and analysis of the various bandit algorithms.
1X
n
µ̂ = Xi ,
n i=1
σ2
V [µ̂] = E (µ̂ − µ)2 = , (5.1)
n
which means that we expect the squared distance between µ and µ̂ to shrink as
n grows large at a rate of 1/n and scale linearly with the variance of X. While
the expected squared error is important, it does not tell us very much about the
distribution of the error. To do this we usually analyse the probability that µ̂
overestimates or underestimates µ by more than some value ε > 0. Precisely, how
5.2 The Inequalities of Markov and Chebyshev 74
Figure 5.1 The figure shows a probability density, with the tails shaded indicating the
regions where X is at least ε away from the mean µ.
optimised. This is a bit cumbersome, and thus instead we present the continuous
analog of this, known as the Cramér-Chernoff method.
To calibrate our expectations on what improvement to expect relative to
Chebyshev’s inequality, let us start by recalling the central limit theorem
Pn
(CLT). Let Sn = t=1 (Xt − µ). The CLT says that under no additional
assumptions
√ than the existence of the variance, the limiting distribution of
Sn / nσ as n → ∞ is a Gaussian with mean zero and unit variance. If
2
Z 2
∞
1 x
P (Z ≥ u) = √ exp − dx .
u 2π 2
Z 2 Z ∞ 2
∞
1 x 1 x
√ exp − dx ≤ √ x exp − dx
u 2π 2 u 2π u 2
r 2
1 u
= exp − , (5.3)
2πu2 2
which gives
√ p p
P (µ̂ ≥ µ + ε) = P Sn / σ 2 n ≥ ε n/σ 2 ≈ P Z ≥ ε n/σ 2
r
σ2 nε2
≤ exp − . (5.4)
2πnε2 2σ 2
The asymptotic nature of the CLT makes it unsuitable for designing bandit
algorithms. In the next section, we derive finite-time analogs, which are only
possible by making additional assumptions.
5.3 The Cramér-Chernoff Method and Subgaussian Random Variables 76
For the sake of moving rapidly towards bandits, we start with a straightforward
and relatively fundamental assumption on the distribution of X, known as the
subgaussian assumption.
Definition 5.2 (Subgaussianity). A random variable X is σ-subgaussian if for
all λ ∈ R, it holds that E [exp(λX)] ≤ exp λ2 σ 2 /2 .
An alternative way to express the subgaussianity condition uses the moment-
generating function of X, which is a function MX : R → R defined by
MX (λ) = E [exp(λX)]. The condition in the definition can be written as
1 2 2
ψX (λ) = log MX (λ) ≤ λ σ for all λ ∈ R .
2
The function ψX is called the cumulant-generating function. It is not hard
to see that MX (or ψX ) need not exist for all random variables over the whole
range of real numbers. For example, if X is exponentially distributed and λ ≥ 1,
then
Z ∞
E [exp(λX)] = exp(−x) × exp(λx)dx = ∞ .
0 | {z }
density of exponential
The following theorem explains the origin of the term ‘subgaussian’. The tails
of a σ-subgaussian random variable decay approximately as fast as that of a
Gaussian with zero mean and the same variance.
Theorem 5.3. If X is σ-subgaussian, then for any ε ≥ 0,
ε2
P (X ≥ ε) ≤ exp − 2 . (5.5)
2σ
Proof We take a generic approach called the Cramér–Chernoff method. Let
λ > 0 be some constant to be tuned later. Then
P (X ≥ ε) = P (exp (λX) ≥ exp (λε))
≤ E [exp (λX)] exp (−λε) (Markov’s inequality)
2 2
λ σ
≤ exp − λε . (Def. of subgaussianity)
2
Choosing λ = ε/σ 2 completes the proof.
5.3 The Cramér-Chernoff Method and Subgaussian Random Variables 77
A similar inequality holds for the left tail. By using the union bound
P (A ∪ B) ≤ P (A) + P (B), we also find that P (|X| ≥ ε) ≤ 2 exp(−ε2 /(2σ 2 )).
An equivalent form of these bounds is
p p
P X ≥ 2σ 2 log(1/δ) ≤ δ P |X| ≥ 2σ 2 log(2/δ) ≤ δ .
This form is often more convenient and especially the latter, which for small δ
shows that with overwhelming probability X takes values in the interval
p p
− 2σ 2 log(2/δ), 2σ 2 log(2/δ) .
The proof of the lemma is left to the reader (Exercise 5.7). Combining
Lemma 5.4 and Theorem 5.3 leads to a straightforward bound on the tails
of µ̂ − µ.
For x > 0, it holds that exp(−x) ≤ 1/(ex), which shows that the above
inequality is stronger than what we obtained via Chebyshev’s inequality except
when ε is very small. It is exponentially smaller if nε2 is large relative to σ 2 . The
deviation form of the above result says that under the conditions of the result,
for any δ ∈ [0, 1], with probability at least 1 − δ,
r
2σ 2 log(1/δ)
µ ≤ µ̂ + . (5.6)
n
Symmetrically, it also follows that with probability at least 1 − δ,
r
2σ 2 log(1/δ)
µ ≥ µ̂ − . (5.7)
n
Again, one can use a union bound to derive a two-sided inequality.
(b) If X has mean zero and |X| ≤ B almost surely for B ≥ 0, then X is
B-subgaussian.
(c) If X has mean zero and X ∈ [a, b] almost surely, then X is (b − a)/2-
subgaussian.
For random variables that are not centred (E [X] 6= 0), we abuse notation
by saying that X is σ-subgaussian if the noise X − E [X] is σ-subgaussian.
A distribution is called σ-subgaussian if a random variable drawn from that
distribution is σ-subgaussian. Subgaussianity is really a property of both a
random variable and the measure on the space on which it is defined, so the
nomenclature is doubly abused.
5.4 Notes
2 Theorem 5.3 shows that subgaussian random variables have tails that decay
almost as fast as a Gaussian. A version of the converse is also possible. That
is, if a centered random has tails that behave in a similar way to a Gaussian,
then it is subgaussian. In particular, the following holds: let X be a centered
random variable (E[X] = 0) with P (|X| ≥ ε) ≤ 2 exp(−ε2 /2). Then X is
5.4 Notes 79
√
5-subgaussian:
" ∞
# ∞ i
X λi X i X λ |X|i
E[exp(λX)] = E ≤1+ E
i=0
i! i=2
i!
Z
i!1/i 1/i
∞
X ∞
≤1+ P |X| ≥ x dx (Exercise 2.19)
i=2 0
λ
∞ Z ∞
X i!2/i x2/i
≤1+2 exp − dx (by assumption)
i=2 0
2λ2
√ λ
= 1 + 2πλ exp(λ2 /2) 1 + erf √ −1 (by Mathematica)
2
2
5λ
≤ exp .
2
This bound is surely loose. At the same time, there is little room for
improvement: if X has density p(x) √ = |x| exp(−x2 /2)/2, then P (|X| ≥ ε) =
exp(−ε2 /2). And yet X is at best 2-subgaussian, so some degree of slack is
required (see Exercise 5.4).
3 We saw in (5.4) that if X1 , X2 , . . . , Xn are independent standard Gaussian
Pn
random variables and µ̂ = n1 t=1 , then
r
σ2 nε2
P (µ̂ ≥ ε) ≤ exp − 2 .
2πnε2 2σ
The above is called Hoeffding’s inequality. For details see Exercise 5.11.
There are many variants of this result that provide tighter bounds when X
satisfies certain additional distributional properties like small variance (see
Exercise 5.14).
5 The Cramér–Chernoff method is applicable beyond the subgaussian case, even
when the moment-generating function is not defined globally. One example
where this occurs is when X1 , X2 , . . . , Xn are independent standard Gaussian
Pn
and Y = i=1 Xi2 . Then Y has a χ2 -distribution with n degrees of freedom.
5.5 Bibliographical Remarks 80
An easy calculation shows that MY (λ) = (1 − 2λ)−n/2 for λ ∈ [0, 1/2) and
MY (λ) is undefined for λ ≥ 1/2. By the Cramér–Chernoff method, we have
5.6 Exercises
There are too many candidate exercises to list. We heartily recommend all the
exercises in chapter 2 of the book by Boucheron et al. [2013].
5.1 (Variance of average) Let X1 , X2 , . . . , Xn be a sequence of independent
and identically distributed random variables with mean µ and variance σ 2 < ∞.
Pn
Let µ̂ = n1 t=1 Xt and show that V[µ̂] = E[(µ̂ − µ)2 ] = σ 2 /n.
5.3 Prove that the Gaussian tail probability bound on the right-hand side
of Eq. (5.4) is smaller than the bound obtained with Chebyshev’s inequality
5.6 Exercises 81
Eq. (5.2). In what regime is the improvement most dramatic and in what regime
are both bounds trivial?
5.4 Let X be a random variable on R with density with respect to the Lebesgue
measure of p(x) = |x| exp(−x2 /2)/2. Show the following:
2
(a) P (|X| ≥ ε)
p = exp(−ε /2).
(b) X is not (2 − ε)-subgaussian for any ε > 0.
(a) Show that limn→∞ Pn (x) = e−λ λx /(x!), which is a Poisson distribution with
parameter λ.
(b) Explain why this does not contradict the CLT, and discuss the implications
of the Berry–Esseen.
(c) In what way does this show that the CLT is indeed a poor approximation in
some cases?
(d) Based on Monte Carlo simulations, plot the distribution of X1 + · · · + Xn
for n = 30 and some well-chosen values of λ. Compare the distribution to
what you would get from the CLT. What can you conclude?
otherwise.
(c) Show that when X is a centered Bernoulli random variable with parameter
p (that is, P (X = −p) = 1 − p and P (X = 1 − p) = p) then ψX ∗
(ε) = ∞
when ε is such that p + ε > 1 and ψX (ε) = d(p + ε, p) otherwise, where
∗
The name “large deviation” originates from rewriting the tail probabilities in
terms of the partial sum Sn = X1 +· · ·+Xn , we see that the inequality in (5.9)
bounds the probability of the deviation of Sn from its mean (which is zero by
assumption) at a scale of Θ(n): P (µ̂n ≥ ε) = P (Sn ≥ nε). In contrast, the
central-limit theorem (CLT) gives the (limiting) probability of the deviation
√ √ √
of Sn from its mean at the scales of Θ( n): P (µ̂n n ≥ ε) = P (Sn ≥ nε).
√
Compared to nε, nε is thought of as a “large” deviation. The deviation
probabilities at this scale can decay to zero faster than what the CLT
predicts, as also showcased in the last part of the last exercise. But what
happens at intermediate scales? That is, when deviations are of size nα ε with
1/2 < α < n? This is studied on the formulaic name of moderate deviations.
As it turns out, in this case, the ruthless use of the large deviation formula
gives correct answers. The reader who wants to learn more about large
deviation theory can check out the lecture notes by Swart [2017].
5.11 (Hoeffding’s lemma) Suppose that X is zero mean and X ∈ [a, b] almost
surely for constants a < b.
Hint For part (a), it suffices to prove that ψX (λ) ≤ λ2 (b − a)2 /4. By Taylor’s
theorem, for some λ0 between 0 and λ, ψX (λ) = ψX (0) + ψX 0
(0)λ + ψX00
(λ0 )λ2 /2.
To bound the last term, introduce the distribution Pλ for λ ∈ R arbitrary:
Pλ (dz) = e−ψX (λ) eλz P (dz). Show that Ψ00X (λ) = V[Z], where Z ∼ Pλ . Now,
since Z ∈ [a, b] with probability one, argue (without relying on E [Z]) that
V[Z] ≤ (b − a)2 /4.
Readers looking for a hint to parts (b), (c) and (e) in the previous exercise
might like to look at the papers by Berend and Kontorovich [2013] and
Ostrovsky and Sirota [2014]. The result that the subgaussianity constant of
X ∼ B(p) is upper bounded by Q(p) is known as the Kearn-Saul inequality
and is due to Kearns and Saul [1998].
(ii) In light of the central limit theorem, explain why the answer you got in (i)
was not 1.
Hint For Part (d.i) use large deviation theory (Exercise 5.10).
5.14 (Bernstein’s inequality) Let X1 , . . . , Xn be a sequence of independent
Pn
random variables with Xt − E[Xt ] ≤ b almost surely and S = t=1 (Xt − E[Xt ])
Pn
and v = t=1 V[Xt ].
1 x2
(a) Show that g(x) = 2 + x
3! + 4! + · · · = (exp(x) − 1 − x)/x2 is increasing.
(b) Let X be a random variable with E[X] = 0 and X ≤ b almost surely. Show
that E[exp(X)] ≤ 1 + g(b)V[X].
2
3α
(c) Prove that (1 + α) log(1 + α) − α ≥ 6+2α for all α ≥ 0. Prove that this is
the best possible approximation in the sense that the 2 in the denominator
cannot be increased.
(d) Let ε > 0 and α = bε/v and prove that
v
P (S ≥ ε) ≤ exp − 2 ((1 + α) log(1 + α) − α) (5.10)
b !
ε2
≤ exp − . (5.11)
2v 1 + 3v
bε
Note that the right-hand side of this inequality is the same as that shown in
Eq. (5.10).
5.6 Exercises 86
The bound in Eq. (5.10) is called Bennett’s inequality and the one
in Eq. (5.11) is called Bernstein’s inequality. There are several
generalisations, the most notable of which is the martingale version that
slightly relaxes the independence assumption and which was presented in
Part (f). Martingale techniques appear in Chapter 20. Another useful variant
(under slightly different conditions) replaces the actual variance with the
empirical variance. This is useful when the variance is unknown. For more,
see the papers by Audibert et al. [2007], Mnih et al. [2008], Maurer and
Pontil [2009].
Hint Use the Cramér–Chernoff method and the fact that exp(x) ≤ 1 + x + x2
for all x ≤ 1 and exp(x) ≥ 1 + x for all x.
Pt
Let (Mt ) be the martingale defined by Mt = s=1 (Xs −µs ). The inequalities
in Exercise 5.15 can be viewed as a kind of Bernstein’s inequality because
they bound the tail of the martingale (Mt ) in terms of the predictable
Pn 2
variation of the martingale (Mt ), which is V = t=1 Et−1 [(Xt − µt ) ].
The main difference relative to well-known results is that the analysis has
stopped early. The next step is usually to choose η to minimise the bound
in some sense. Either by assuming bounds on the predictable variation,
union bounding or using the method of mixtures [de la Peña et al., 2008].
These techniques are covered in Chapter 20. Note, optimising η directly is
not possible because the bounds hold for any fixed η, but minimising the
right-hand side inside the probability with respect to η would lead to a
random η. For more martingale results with this flavour, see the notes by
McDiarmid [1998].
Hint Use Jensen’s inequality to show that exp(λE[Z]) ≤ E[exp(λZ)], and then
provide a naive bound on the moment-generating function of Z.
5.19 (Almost surely bounded sums) Let X1 , X2 , . . . , Xn be a sequence of non-
Pn
negative random variables adapted to filtration (Ft )nt=0 such that t=1 Xt ≤ 1
almost surely. Prove that for all x > 1,
!
X n n−x n−1 , if x < n ;
.
P E[Xt | Ft−1 ] ≥ x ≤ fn (x) = n−1
0 , if x ≥ n ,
t=1
Hint This problem does not use the techniques introduced in the chapter.
Prove that Bernoulli random variables are the worst case and use backwards
induction. Although this result is new to our knowledge, a weaker version was
derived by Kirschner and Krause [2018] for the analysis of information-directed
sampling. The bound is tight in the sense that there exists a sequence of random
variables and filtration for which equality holds.
Part II
Stochastic Bandits with
Finitely Many Arms
89
Over the next few chapters, we introduce the fundamental algorithms and
tools of analysis for unstructured stochastic bandits with finitely many actions.
The keywords here are finite, unstructured and stochastic. The first of these
just means that the number of actions available is finite. The second is more
ambiguous, but roughly means that choosing one action yields no information
about the mean pay-off of the other arms. A bandit is stochastic if the sequence
of rewards associated with each action is independent and identically distributed
according to some distribution. This latter assumption will be relaxed in Part III.
There are several reasons to study this class of bandit problems. First, their
simplicity makes them relatively easy to analyse and permits a deep understanding
of the trade-off between exploration and exploitation. Second, many of the
algorithms designed for finite-armed bandits, and the principle underlying them,
can be generalised to other settings. Finally, finite-armed bandits already have
applications – notably as a replacement to A/B testing, as discussed in the
introduction.
6 The Explore-Then-Commit
Algorithm
average reward received from arm i after round t, which is written formally as
1 X
t
µ̂i (t) = I {As = i} Xs ,
Ti (t) s=1
Pt
where Ti (t) = s=1 I {As = i} is the number of times action i has been played
after round t. The ETC policy is given in Algorithm 1 below.
1: Input m.
2: In round t choose action
(
(t mod k) + 1 , if t ≤ mk ;
At =
argmaxi µ̂i (mk) , t > mk .
(ties in the argmax are broken arbitrarily)
Algorithm 1: Explore-then-commit.
Theorem 6.1. When ETC is interacting with any 1-subgaussian bandit and
1 ≤ m ≤ n/k,
m∆2i
k
X k
X
Rn ≤ m ∆i + (n − mk) ∆i exp − .
i=1 i=1
4
Proof Assume without loss of generality that the first arm is optimal, which
means that µ1 = µ∗ = maxi µi . By the decomposition given in Lemma 4.5, the
regret can be written as
k
X
Rn = ∆i E [Ti (n)] . (6.1)
i=1
In the first mk rounds, the policy is deterministic, choosing each action exactly
m times. Subsequently it chooses a single action maximising the average reward
during exploration. Thus,
bound for bandits with bounded reward range, we take the liberty and will also
call bounds like that in Eq. (6.7) a worst-case bound.
The bound in (6.6) is close to optimal (see Part IV), but there is a caveat. The
choice of m that defines the policy and leads to this bound depends on both the
suboptimality gap and the horizon. While the horizon is sometimes known in
advance, it is seldom reasonable to assume knowledge of the suboptimality gap.
You will show in Exercise 6.5 that there is a choice of m depending only on n, for
which Rn = O(n2/3 ) regardless of the value of ∆. Alternatively, the number of
plays before commitment can be made data dependent, which means the learner
plays arms alternately until it decides based on its observations to commit to
a single arm for the remainder (Exercise 6.5). ETC also has the property that
its immediate expected regret per time step is monotonically decreasing as time
goes by, though not in a nice smooth fashion. This monotone decreasing property
is a highly desirable property. In later chapters we will see policies where the
decrease is smoother.
Experiment 6.1 Fig. 6.1 shows the expected regret of ETC when playing a
Gaussian bandit with k = 2 and means µ1 = 0 and µ2 = −∆. The horizon is set
to n = 1000, and the suboptimality gap ∆ is varied between 0 and 1. Each data
point is the average of 105 simulations, which makes the error bars invisible. The
results show that the theoretical upper bound provided by Theorem 6.1 is quite
close to the actual performance.
60
40
20
Figure 6.1 The expected regret of ETC and the upper bound in Eq. (6.6).
6.2 Notes 94
6.2 Notes
ETC has a long history. Robbins [1952] considered ‘certainty equivalence with
forcing’, which chooses the arm with the largest sample mean except at a fixed
set of times Ti ⊂ N when arm i is chosen for i ∈ [k]. By choosing the set
of times carefully, it is shown that this policy enjoys sublinear regret. While
ETC performs all the exploration at the beginning, Robbins’s policy spreads
the exploration over time. This is advantageous if the horizon is not known,
but disadvantageous otherwise. Anscombe [1963] considered exploration and
commitment in the context of medical trials or other experimental set-ups. He
already largely solves the problem in the Gaussian case and highlights many of
the important considerations. Besides this, the article is beautifully written and
well worth reading. Strategies based on exploration and commitment are simple
to implement and analyse. They can also generalise well to more complex settings.
For example, Langford and Zhang [2008] consider this style of policy under the
name ‘epoch-greedy’ for contextual bandits (the idea of exploring then exploiting
in epochs, or intervals, is essentially what Robbins [1952] suggested). We’ll return
to contextual bandits in Chapter 18. Abbasi-Yadkori et al. [2009], Abbasi-Yadkori
[2009b] and Rusmevichientong and Tsitsiklis [2010] consider ETC-style policies
under the respective names of ‘forced exploration’ and ‘phased exploration and
greedy exploitation’ (PEGE) in the context of linear bandits (which we shall meet
in Chapter 19). Other names include ‘forced sampling’, ‘explore-first’, ‘explore-
then-exploit’. Garivier et al. [2016b] have shown that ETC policies are necessarily
suboptimal in the limit of infinite data in a way that is made precise in Chapter 16.
This comment also applies to elimination-based strategies, which are described in
6.4 Exercises 95
Exercise 6.8. The history of ε-greedy is unclear, but it is a popular and widely used
and known algorithm in reinforcement learning [Sutton and Barto, 1998]. Auer
et al. [2002a] analyse the regret of ε-greedy with slowly decreasing exploration
probabilities. There are other kinds of randomised exploration as well, including
Thompson sampling [1933] and Boltzmann exploration analysed recently by
Cesa-Bianchi et al. [2017].
6.4 Exercises
6.2 (Minimax regret) Show that Eq. (6.6) implies the regret of an optimally
√
tuned ETC for subgaussian two-armed bandits satisfies Rn ≤ ∆ + C n where
C > 0 is a universal constant.
6.3 (High-probability bounds (i)) Assume that k = 2, and let δ ∈ (0, 1).
Modify the ETC algorithm to depend on δ and prove a bound on the pseudo-
Pn
regret R̄n = nµ∗ − t=1 µAt of ETC that holds with probability 1 − δ. The
algorithm is allowed to use the action suboptimality gaps.
6.4 (High-probability bounds (ii)) Repeat the previous exercise, but now
Pn
prove a high probability bound on the random regret: R̂n = nµ∗ − t=1 Xt .
Compare this to the bound derived for the pseudo-regret in the previous exercise.
What can you conclude?
6.5 (Adaptive commitment times) Suppose that ETC interacts with a two-
armed 1-subgaussian bandit ν ∈ E with means µ1 , µ2 ∈ R and ∆ν = |µ1 − µ2 |.
(a) Find a choice of m that only depends on the horizon n and not ∆ such that
there exists a constant C > 0 such that for any n and for any ν ∈ E, the
regret Rn (ν) of Algorithm 1 is bounded by
Furthermore, show that there is no C > 0 such that for any problem instance
ν and n ≥ 1, Rn (ν) ≤ ∆ν + Cn2/3 holds.
6.4 Exercises 96
(b) Now suppose the commitment time is allowed to be data dependent, which
means the algorithm explores each arm alternately until some condition is
met and then commits to a single arm for the remainder. Design a condition
such that the regret of the resulting algorithm can be bounded by
C log n
Rn (ν) ≤ ∆ν + , (6.8)
∆ν
where C is a universal constant. Your condition should only depend on the
observed rewards and the time horizon. It should not depend on µ1 , µ2 or
∆ν .
(c) Show that
p any algorithm for which (6.8) holds also satisfies Rn (ν) ≤
∆ν + C n log(n) for any n ≥ 1 and ν ∈ E and a suitably chosen universal
constant C > 0.
(d) As for (b), but now the objective is to design a condition such that for any
n ≥ 1 and ν ∈ E, the regret of the resulting algorithm is bounded by
C log max e, n∆2ν
Rn (ν) ≤ ∆ν + . (6.9)
∆ν
(e) Show that any algorithm for which (6.9) holds also satisfies that for any
√
n ≥ 1 and ν ∈ E, Rn (ν) ≤ ∆ν + C n for suitably chosen universal constant
C > 0.
Hint For (a) start from Rn ≤ m∆ + n∆ exp(−m∆2 /2) and show an upper
bound on the second term which is independent of ∆. Then, choose m. For
(b) think about the simplest stopping policy and then make it robust by using
confidence intervals. Tune the failure probability. For (c) note that the regret
can never be larger than n∆.
6.6 (Doubling trick) The purpose of this exercise is to analyse a meta-
algorithm based on the so-called doubling trick that converts a policy depending
on the horizon to a policy with similar guarantees that does not. Let E be an
arbitrary set of bandits. Suppose you are given a policy π = π(n) designed for E
that accepts the horizon n as a parameter and has a regret guarantee of
According to Besson and Kaufmann [2018], the doubling trick was first
applied to bandits by Auer et al. [1995]. Note, nowhere in this exercise did
we use that the bandit is stochastic. Nothing changes in the adversarial or
contextual settings studied later in the book.
6.7 (ε-greedy) For this exercise assume the rewards are 1-subgaussian and
there are k ≥ 2 arms. The ε-greedy algorithm depends on a sequence of
parameters ε1 , ε2 , . . .. First it chooses each arm once and subsequently chooses
At = argmaxi µ̂i (t − 1) with probability 1 − εt and otherwise chooses an arm
uniformly at random.
k
Rn εX
(a) Prove that if εt = ε > 0, then lim = ∆i .
n→∞ n k i=1
n o
(b) Let ∆min = min {∆i : ∆i > 0} and let εt = min 1, t∆Ck 2 , where C > 0 is
min
a sufficiently large universal constant. Prove that there exists a universal
C 0 > 0 such that
k
X ∆i n∆2min
Rn ≤ C 0 ∆i + 2 log max e, .
i=1
∆min k
7: end for
Algorithm 2: Phased elimination for finite-armed bandits
(b) Show that if i ∈ [k] and ` ≥ 1 are such that ∆i ≥ 2−` , then
m` (∆i − 2−` )2
P (i ∈ A`+1 , 1 ∈ A` , i ∈ A` ) ≤ exp − .
4
(c) Let `i = min ` ≥ 1 : 2−` ≤ ∆i /2 . Choose m` in such a way that
P (exists ` : 1 ∈
/ A` ) ≤ 1/n and P (i ∈ A`i +1 ) ≤ 1/n.
(d) Show that your algorithm has regret at most
X 1
Rn ≤ C ∆i + log(n) ,
∆i
i:∆i >0
(f) Show that with an appropriate universal constant C 0 > 0, the regret satisfies
X p
Rn ≤ ∆i + C 0 nk log(k) .
i
Algorithm 2 is due to Auer and Ortner [2010]. The log(k) term in Part (f) can
be removed by modifying the algorithm to use the refined confidence intervals
in Chapter 9, but we would not recommend this for the reasons discussed
in Section 9.2 of that chapter. You could also use a more sophisticated
confidence level [Lattimore, 2018].
6.4 Exercises 99
ETC
70
Expected regret 60
50
Figure 6.2 Expected regret for ETC over 105 trials on a Gaussian bandit with means
µ1 = 0, µ2 = −1/10
6.9 (Empirical study) In this exercise you will investigate the empirical
behaviour of ETC on a two-armed Gaussian bandit with means µ1 = 0 and
µ2 = −∆. Let
n
X
R̄n = ∆A t ,
t=1
100
ETC
Standard deviation of the regret
80
60
40
Figure 6.3 Standard deviation of the regret for ETC over 105 trials on a Gaussian bandit
with means µ1 = 0, µ2 = −1/10
7 The Upper Confidence Bound
Algorithm
The upper confidence bound (UCB) algorithm offers several advantages over the
explore-then-commit (ETC) algorithm introduced in the last chapter.
happen too often because the additional data provided by playing a suboptimal
arm means that the upper confidence bound for this arm will eventually fall
below that of the optimal arm.
In order to make this argument more precise, we need to define the upper
confidence bound. Let (Xt )nt=1 be a sequence of independent 1-subgaussian random
Pn
variables with mean µ and µ̂ = n1 t=1 Xt . By Eq. (5.6),
r !
2 log(1/δ)
P µ ≥ µ̂ + ≤δ for all δ ∈ (0, 1) . (7.1)
n
Great care is required when comparing (7.1) and (7.2) because in the former the
number of samples is the constant n, but in the latter it is a random variable
Ti (t − 1). By and large, however, this is merely an annoying technicality, and the
intuition remains that δ is approximately an upper bound on the probability of
the event that the above quantity is an underestimate of the true mean. More
details are given in Exercise 7.1.
At last we have everything we need to state a version of the UCB algorithm,
which takes as input the number of arms and the error probability δ.
1: Input k and δ
2: for t ∈ 1, . . . , n do
3: Choose action At = argmaxi UCBi (t − 1, δ)
4: Observe reward Xt and update upper confidence bounds
5: end for
Algorithm 3: UCB(δ).
Although there are many versions of the UCB algorithm, we often do not
distinguish them by name and hope the context is clear. For the rest of this
chapter, we’ll usually call UCB(δ) just UCB.
The value inside the argmax is called the index of arm i. Generally speaking,
an index algorithm chooses the arm in each round that maximises some value
(the index), which usually only depends on the current time step and the samples
from that arm. In the case of UCB, the index is the sum of the empirical mean
7.1 The Optimism Principle 103
of rewards experienced so far and the exploration bonus, which is also known
as the confidence width.
Besides the slightly vague ‘optimism guarantees optimality or learning’ intuition
we gave before, it is worth exploring other intuitions for the choice of index. At
a very basic level, an algorithm should explore arms more often if they are (a)
promising because µ̂i (t − 1) is large or (b) not well explored because Ti (t − 1) is
small. As one can plainly see, the definition in Eq. (7.2) exhibits this behaviour.
This explanation is not completely satisfying, however, because it does not explain
why the form of the functions is just so.
A more refined explanation comes from thinking of what we expect of any
reasonable algorithm. Suppose at the start of round t the first arm has been
played much more frequently than the rest. If we did a good job designing our
algorithm, we would hope this is the optimal arm, and because it has been played
so often, we expect that µ̂1 (t − 1) ≈ µ1 . To confirm the hypothesis that arm 1 is
optimal, the algorithm had better be highly confident that other arms are indeed
worse. This leads quite naturally to the idea of using upper confidence bounds.
The learner can be reasonably certain that arm i is worse than arm 1 if
s s
2 log(1/δ) 2 log(1/δ)
µ̂i (t − 1) + ≤ µ1 ≈ µ̂1 (t − 1) + , (7.3)
Ti (t − 1) T1 (t − 1)
where δ is called the confidence level and quantifies the degree of certainty.
This means that choosing the arm with the largest upper confidence bound leads
to a situation where arms are only chosen if their true mean could reasonably be
larger than those of arms that have been played often. That this rule is indeed a
good one depends on two factors. The first is whether the width of the confidence
interval at a given confidence level can be significantly decreased, and the second
is whether the confidence level is chosen in a reasonable fashion. For now, we
will take a leap of faith and assume that the width of confidence intervals for
subgaussian bandits cannot be significantly improved from what we use here
(we shall see that this holds in later chapters), and concentrate on choosing the
confidence level now.
already alluded to, one of the main difficulties is that the number of samples
Ti (t − 1) in the index (7.2) is a random variable, and so our concentration results
cannot be immediately applied. For this reason we will see that (at least naively)
δ should be chosen a bit smaller than 1/n.
Theorem 7.1. Consider UCB as shown in Algorithm 3 on a stochastic k-armed
1-subgaussian bandit problem. For any horizon n, if δ = 1/n2 , then
k
X X 16 log(n)
Rn ≤ 3 ∆i + .
i=1
∆i
i:∆i >0
Before the proof we need a little more notation. Let (Xti )t∈[n],i∈[k] be a collection
of independent random variables with the law of Xti equal to Pi . Then define
Ps
µ̂is = 1s u=1 Xui to be the empirical mean based on the first s samples. We
make use of the third model in Section 4.6 by assuming that the reward in round
t is
Xt = XTAt (t)At .
Then we define µ̂i (t) = µ̂iTi (t) to be the empirical mean of the ith arm after round
t. The proof of Theorem 7.1 relies on the basic regret decomposition identity,
k
X
Rn = ∆i E [Ti (n)] . (Lemma 4.5)
i=1
The theorem will follow by showing that E [Ti (n)] is not too large for suboptimal
arms i. The key observation is that after the initial period where the algorithm
chooses each action once, action i can only be chosen if its index is higher than
that of an optimal arm. This can only happen if at least one of the following is
true:
(a) The index of action i is larger than the true mean of a specific optimal arm.
(b) The index of a specific optimal arm is smaller than its true mean.
Since with reasonably high probability the index of any arm is an upper bound
on its mean, we don’t expect the index of the optimal arm to be below its
mean. Furthermore, if the suboptimal arm i is played sufficiently often, then its
exploration bonus becomes small and simultaneously the empirical estimate of
its mean converges to the true value, putting an upper bound on the expected
total number of times when its index stays above the mean of the optimal arm.
The proof that follows is typical for the analysis of algorithms like UCB, and
hence we provide quite a bit of detail so that readers can later construct their
own proofs.
Proof of Theorem 7.1 Without loss of generality, we assume the first arm is
optimal so that µ1 = µ∗ . As noted above,
k
X
Rn = ∆i E [Ti (n)] . (7.4)
i=1
7.1 The Optimism Principle 105
The theorem will be proven by bounding E[Ti (n)] for each suboptimal arm i. We
make use of a relatively standard idea, which is to decouple the randomness from
the behaviour of the UCB algorithm. Let Gi be the ‘good’ event defined by
s
n o n 2 1 o
Gi = µ1 < min UCB1 (t, δ) ∩ µ̂iui + log < µ1 ,
t∈[n] ui δ
The next step is to complete our promise by showing that Ti (n) ≤ ui on Gi and
that P (Gci ) is small. Let us first assume that Gi holds and show that Ti (n) ≤ ui ,
which we do by contradiction. Suppose that Ti (n) > ui . Then arm i was played
more than ui times over the n rounds, and so there must exist a round t ∈ [n]
where Ti (t − 1) = ui and At = i. Using the definition of Gi ,
s
2 log(1/δ)
UCBi (t − 1, δ) = µ̂i (t − 1) + (definition of UCBi (t − 1, δ))
Ti (t − 1)
s
2 log(1/δ)
= µ̂iui + (since Ti (t − 1) = ui )
ui
< µ1 (definition of Gi )
< UCB1 (t − 1, δ) . (definition of Gi )
The first of these sets is decomposed using the definition of UCB1 (t, δ),
( r )
2 log(1/δ)
µ1 ≥ min UCB1 (t, δ) ⊂ µ1 ≥ min µ̂1s +
t∈[n] s∈[n] s
( r )
[ 2 log(1/δ)
= µ1 ≥ µ̂1s + .
s
s∈[n]
Then using a union bound and the concentration bound for sums of independent
subgaussian random variables in Corollary 5.5, we obtain:
( r )
[ 2 log(1/δ)
P µ1 ≥ min UCB1 (t, δ) ≤ P µ1 ≥ µ̂1s +
t∈[n] s
s∈[n]
r !
2 log(1/δ)
Xn
≤ P µ1 ≥ µ̂1s + ≤ nδ . (7.7)
s=1
s
The next step is to bound the probability of the second set in (7.6). Assume that
ui is chosen large enough that
s
2 log(1/δ)
∆i − ≥ c∆i (7.8)
ui
since Ti (n) ≤ n. Then, using the assumption that δ = 1/n2 and this choice of ui
leads via (7.9) to
2 2 2 log(n2 ) 2 2
E[Ti (n)] ≤ ui + 1 + n1−2c /(1−c) = + 1 + n1−2c /(1−c) . (7.10)
(1 − c)2 ∆2i
All that remains is to choose c ∈ (0, 1). The second term will contribute a
polynomial dependence on n unless 2c2 /(1 − c)2 ≥ 1. However, if c is chosen too
close to 1, then the first term blows up. Somewhat arbitrarily we choose c = 1/2,
which leads to
16 log(n)
E [Ti (n)] ≤ 3 + .
∆2i
The result follows by substituting the above display in Eq. (7.4).
As we saw for the ETC strategy, the regret bound in Theorem 7.1 depends
on the reciprocal of the gaps, which may be meaningless when even a single
suboptimal action has a very small suboptimality gap. As before, one can also
prove a sublinear regret bound that does not depend on the reciprocal of the
gaps.
Theorem 7.2. If δ = 1/n2 , then the regret of UCB, as defined in Algorithm 3,
on any ν ∈ ESG
k
(1) environment, is bounded by
p k
X
Rn ≤ 8 nk log(n) + 3 ∆i .
i=1
Proof Let ∆ > 0 be some value to be tuned subsequently, and recall from the
proof of Theorem 7.1 that for each suboptimal arm i, we can bound
16 log(n)
E[Ti (n)] ≤ 3 + .
∆2i
Therefore, using the basic regret decomposition again (Lemma 4.5), we have
k
X X X
Rn = ∆i E [Ti (n)] = ∆i E [Ti (n)] + ∆i E [Ti (n)]
i=1 i:∆i <∆ i:∆i ≥∆
X 16 log(n)
16k log(n) X
≤ n∆ + 3∆i + ≤ n∆ + +3 ∆i
∆i ∆ i
i:∆i ≥∆
p k
X
≤8 nk log(n) + 3 ∆i ,
i=1
P
where the first inequality follows because i:∆i <∆ Ti (n) ≤ n and the last line by
p
choosing ∆ = 16k log(n)/n.
P
The additive i ∆i term is unavoidable because no reasonable algorithm can
avoid playing each arm once (try to work out what would happen if it did not).
In any case, this term does not grow with the horizon n and is typically negligible.
7.1 The Optimism Principle 108
20
Figure 7.1 Experiment showing universality of UCB relative to fixed instances of ETC
7.2 Notes
1 The choice of δ = 1/n2 led to an easy analysis, but comes with two
disadvantages. First of all, it turns out that a slightly smaller value of δ
improves the regret (and empirical performance). Secondly, the dependence on
n means the horizon must be known in advance, which is often not reasonable.
Both of these issues are resolved in the next chapter, where δ is chosen to be
smaller and to depend on the current round t rather than n. Nonetheless – as
promised – Algorithm 3 with δ = 1/n2 does achieve a regret bound similar to
the ETC strategy, but without requiring knowledge of the gaps.
2 The assumption that the rewards generated by each arm are independent can
be relaxed significantly. All of the results would go through by assuming there
exists a mean reward vector µ ∈ Rk such that
Eq. (7.11) is just saying that the conditional mean of the reward in round t
only depends on the chosen action. Eq. (7.12) ensures that the tails of Xt are
conditionally subgaussian. That everything still goes through is proven using
martingale techniques, which we develop in detail in Chapter 20.
3 So is the optimism principle universal? Does it always lead to policies with
strong guarantees in more complicated settings? Unfortunately the answer turns
out to be no. The optimism principle usually leads to reasonable algorithms
when (i) any action gives feedback about the quality of that action and (ii) no
action gives feedback about the value of other actions. When (i) is violated, even
sublinear regret may not be guaranteed. When (ii) is violated, an optimistic
algorithm may avoid actions that lead to large information gain and low reward,
even when this trade-off is optimal. An example where this occurs is provided
in Chapter 25 on linear bandits. Optimism can work in more complex models as
well, but sometimes fails to appropriately balance exploration and exploitation.
4 When thinking about future outcomes, humans and some animals often have
higher expectations than are warranted by past experience or conditions of the
environment. This phenomenon, a form of cognitive bias, is known as the
optimism bias in the psychology and behavioural economics literature and is
in fact ‘one of the most consistent, prevalent, and robust biases documented in
psychology and behavioral economics’ [Sharot, 2011a]. While much has been
written about this bias in these fields, and one of the current explanations
of why the optimism bias is so prevalent is that it helps exploration, to our
best knowledge, the connection to the deeper mathematical justification of
optimism, pursued here and in other parts of this book, has so far escaped the
attention of researchers in all the relevant fields.
7.3 Bibliographical Remarks 110
The use of confidence bounds and the idea of optimism first appeared in the work
by Lai and Robbins [1985]. They analysed the asymptotics for various parametric
bandit problems (see the next chapter for more details on this). The first version
of UCB is by Lai [1987]. Other early work is by Katehakis and Robbins [1995],
who gave a very straightforward analysis for the Gaussian case, and Agrawal
[1995], who noticed that all that was needed is an appropriate sequence of
upper confidence bounds on the unknown means. In this way, their analysis is
significantly more general than what we have done here. These researchers also
focused on the asymptotics, which at the time was the standard approach in
the statistics literature. The UCB algorithm was independently discovered by
Kaelbling [1993], although with no regret analysis or clear advice on how to tune
the confidence parameter. The version of UCB discussed here is most similar to
that analysed by Auer et al. [2002a] under the name UCB1, but that algorithm
used t rather than n in the confidence level (see the next chapter). Like us, they
prove a finite-time regret bound. However, rather than considering 1-subgaussian
environments, Auer et al. [2002a] considers bandits where the pay-offs are confined
to the [0, 1] interval, which are ensured to be 1/2-subgaussian. See Exercise 7.2
for hints on what must change in this situation. The basic structure of the proof
of our Theorem 7.1 is essentially the same as that of theorem 1 of Auer et al.
[2002a]. The worst-case bound in Theorem 7.2 appeared in the book by Bubeck
and Cesa-Bianchi [2012], which also popularised the subgaussian set-up. We did
not have time to discuss the situation where the subgaussian constant is unknown.
There have been several works exploring this direction. If the variance is unknown,
but the noise is bounded, then one can replace the subgaussian concentration
bounds with an empirical Bernstein inequality [Audibert et al., 2007]. For details,
see Exercise 7.6. If the noise has heavy tails, then a more serious modification is
required, as discussed in Exercise 7.7 and the note that follows.
We found the article by Sharot [2011a] on optimism bias from the psychology
literature quite illuminating. Readers looking to dive deeper into this literature
may enjoy the book by the same author [Sharot, 2011b]. Optimism bias is also
known as ‘unrealistic optimism’, a term that is most puzzling to us – what bias
is ever realistic? The background of this is explained by Jefferson et al. [2017].
7.4 Exercises
(b) Now relax the assumption that T is independent from (Xt )t . Let Et =
I {T = t} be the event that T = t and Ft = σ(X1 , . . . , Xt ) be the σ-algebra
generated by the first t samples. Let δ ∈ (0, 1) and show there exists a T
such that for all t ∈ {1, 2, 3, . . .} it holds that Et is Ft -measurable and
r !
2 log(1/δ)
P µ̂ − µ ≥ = 1.
T
Hint For part (b) above, you may find it useful to apply the law of the iterated
logarithm, which says if X1 , X2 , . . . is a sequence of independent and identically
distributed random variables with zero mean and unit variance, then
Pn
t=1 Xt
lim sup √ =1 almost surely .
n→∞ 2n log log n
This result is especially remarkable because it relies on no assumptions other
than zero mean and unit variance. You might wonder if Eq. (7.13) might continue
to hold if log(T (T + 1)/δ) were replaced by log(log(T )/δ). It almost does, but
the proof of this fact is more sophisticated. For more details, see the paper by
Garivier [2013] or Exercise 20.9.
7.2 (Relaxing the subgaussian assumption) In this chapter, we assumed
the pay-off distributions were 1-subgaussian. The purpose of this exercise is to
relax this assumption.
(a) First suppose that σ 2 > 0 is a known constant and that ν ∈ ESG
k
(σ 2 ). Modify
the UCB algorithm and state and prove an analogue of Theorems 7.1 and 7.2
for this case.
(b) Now suppose that ν = (Pi )ki=1 is chosen so that Pi is σi -subgaussian where
(σi2 )ki=1 are known. Modify the UCB algorithm and state and prove an
analogue of Theorems 7.1 and 7.2 for this case.
(c) If you did things correctly, the regret bound in the previous part should not
depend on the values of {σi2 : ∆i = 0}. Explain why not.
where g and f should be as small as possible (there are trade-offs – try and come
up with a natural choice).
7.4 (Phased UCB (i)) Fix a 1-subgaussian k-armed bandit environment and a
horizon n. Consider the version of UCB that works in phases of exponentially
increasing length of 1, 2, 4, . . .. In each phase, the algorithm uses the action that
would have been chosen by UCB at the beginning of the phase (see Algorithm 4
below).
(a) State and prove a bound on the regret for this version of UCB.
(b) Compare your result with Theorem 7.1.
(c) How would the result change if the `th phase had a length of α` with
α > 1?
1: Input k and δ
2: Choose each arm once
3: for ` = 1, 2, . . . do
4: Compute A` = argmaxi UCBi (t − 1, δ)
5: Choose arm A` exactly 2` times
6: end for
Algorithm 4: A phased version of UCB.
7.5 (Phased UCB (ii)) Let α > 1 and consider the version of UCB that first
plays each arm once. Thereafter it operates in the same way as UCB, but rather
than playing the chosen arm just once, it plays it until the number of plays of
that arm is a factor of α larger (see Algorithm 5 below).
(a) State and prove a bound on the regret for version of UCB with α = 2
(doubling counts).
(b) Compare with the result of the previous exercise and with Theorem 7.1.
What can you conclude?
(c) Repeat the analysis for α > 1. What is the role of α?
(d) Implement these algorithms and compare them empirically to UCB(δ).
7.4 Exercises 113
1: Input k and δ
2: Choose each arm once
3: for ` = 1, 2, . . . do
4: Let t` = t
5: Compute A` = argmaxi UCBi (t` − 1, δ)
6: Choose arm A` until round t such that Ti (t) ≥ αTi (t` − 1)
7: end for
Algorithm 5: A phased version of UCB.
The algorithms of the last two exercises may seem ridiculous. Why would
you wait before updating empirical estimates and choosing a new action?
There are at least two reasons:
(a) It can happen that the algorithm does not observe its rewards
immediately, but rather they appear asynchronously after some delay.
Alternatively many bandits algorithms may be operating simultaneously
and the results must be communicated at some cost.
(b) If the feedback model has a more complicated structure than what we
examined so far, then even computing the upper confidence bound just
once can be quite expensive. In these circumstances, it’s comforting to
know that the loss of performance by updating the statistics only rarely
is not too severe.
1
Pn
(a) Show that σ̂ 2 = n t=1 (Xt − µ)
2
− (µ̂ − µ)2 .
2 2 2
(b) Show that V[(Xt − µ) ] ≤ b σ .
(c) Use Bernstein’s inequality (Exercise 5.14) to show that
s !
2 2 2b2 σ 2 1 2b2 1
P σ̂ ≥ σ + log + log ≤ δ.
n δ 3n δ
(d) Suppose that ν = (νi )ki=1 is a bandit where Supp(νi ) ⊂ [0, b] and the variance
of the ith arm is σi2 (with our earlier notation, ν ∈ E[0,b]
k
). Design a policy
7.4 Exercises 114
X
σi2
Rn ≤ C ∆i + b + log(n) , (7.14)
∆i
i:∆i >0
If you did things correctly, then the policy you derived in Exercise 7.6
should resemble UCB-V by Audibert et al. [2007]. The proof of the empirical
Bernstein also appears there or (with slightly better constants) in the papers
by Mnih et al. [2008] and Maurer and Pontil [2009].
j n 1/8 ok
(a) Show that if m = min n2 , 8 log e δ and Ai are chosen as equally
sized as possible, then
s !
192σ 2 e1/8
P µ̂M + log ≤µ ≤ δ.
n δ
X σ 2 log(n)
Rn ≤ C ∆i + ,
∆i
i:∆i >0
The algorithm analysed in the previous chapter is not anytime. This shortcoming
is resolved via a slight modification and a refinement of the analysis. The improved
analysis leads to constant factors in the dominant logarithmic term that match a
lower bound provided later in Chapter 16.
The algorithm studied is shown in Algorithm 6. It differs from the one analysed
in the previous section (Algorithm 3) only by the choice of the confidence level,
the choice of which is dictated by the analysis of its regret.
1: Input k
2: Choose each arm once
3: Subsequently choose
s !
2 log f (t)
At = argmaxi µ̂i (t − 1) +
Ti (t − 1)
The regret bound for Algorithm 6 is more complicated than the bound for
Algorithm 3 (see Theorem 7.1). The dominant terms in the two results have the
same order, but the gain here is that in this result the leading constant, governing
the asymptotic rate of growth of regret, is smaller.
Theorem 8.1. For any 1-subgaussian bandit, the regret of Algorithm 6 satisfies
p
X 5 2 log f (n) + π log f (n) + 1
Rn ≤ inf ∆i 1 + 2 + . (8.1)
ε∈(0,∆i ) ε (∆i − ε)2
i:∆i >0
8.1 Asymptotically Optimal UCB 117
Furthermore,
Rn X 2
lim sup ≤ . (8.2)
n→∞ log(n) ∆i
i:∆i >0
Even more concretely, there exists some universal constant C > 0 such that
X log(n)
Rn ≤ C ∆i + ,
∆i
i:∆i >0
Taking the limit of the ratio of the bound in (8.3) and log(n) does not result
in the same constant as in the theorem, which is the main justification for
introducing the more complicated regret bound. You will see in Chapter 15
that the asymptotic bound on the regret given in (8.2) is unimprovable in a
strong sense.
We start with a useful lemma to bound the number of times the index of a
suboptimal arm will be larger than some threshold above its mean.
Lemma 8.2. Let X1 , . . . , Xn be a sequence of independent 1-subgaussian random
Pt
variables, µ̂t = 1t s=1 Xs , ε > 0, a > 0 and
( r ) ( r )
2a 2a
Xn X n
κ= I µ̂t + ≥ε , κ =u+
0
I µ̂t + ≥ε ,
t=1
t t
t=due
2 √
where u = 2aε−2 . Then it holds E[κ] ≤ E[κ0 ] ≤ 1 + 2
(a + πa + 1).
ε
The intuition for this result is as follows. Since the Xi are
p1-subgaussian and
independent we have E[µ̂t ] = 0, so we cannot expect µ̂t + 2a/t to be smaller
than ε until t is at least 2a/ε2 . The lemma confirms that this is the right order
as an estimate for E [κ].
Proof By Corollary 5.5 we have
q 2
r ! 2a
n
X 2a
n
X t ε − t
E[κ] ≤ E[κ0 ] = u + P µ̂t + ≥ε ≤u+ exp −
t 2
t=due t=due
q 2
Z 2a
t ε− 2
∞ √
t
≤1+u+ exp − dt = 1 + 2 (a + πa + 1) ,
u 2 ε
8.1 Asymptotically Optimal UCB 118
√ √
where the final equality follows by making the substitution s = ε t − 2a and
substituting the value of u from the lemma statement.
Proof of Theorem 8.1 As usual, the starting point is the fundamental regret
decomposition (Lemma 4.5),
X
Rn = ∆i E[Ti (n)] .
i:∆i >0
The rest of the proof revolves around bounding E[Ti (n)]. Let i be a suboptimal
arm. The main idea is to decompose Ti (n) into two terms. The first measures the
number of times the index of the optimal arm is less than µ1 − ε. The second term
measures the number of times that At = i and its index is larger than µ1 − ε.
( s )
2 log f (t)
n
X n
X
Ti (n) = I {At = i} ≤ I µ̂1 (t − 1) + ≤ µ1 − ε
t=1 t=1
T1 (t − 1)
( s )
2 log f (t)
n
X
+ I µ̂i (t − 1) + ≥ µ1 − ε and At = i . (8.4)
t=1
Ti (t − 1)
The proof of the first part of the theorem is completed by bounding the expectation
of each of these two sums. Starting with the first, we again use Corollary 5.5:
" ( s )#
2 log f (t)
n
X
E I µ̂1 (t − 1) + ≤ µ1 − ε
t=1
T1 (t − 1)
r !
2 log f (t)
Xn X n
≤ P µ̂1s + ≤ µ1 − ε
t=1 s=1
s
q 2
2 log f (t)
Xn X n s s +ε
≤ exp −
t=1 s=1
2
1 X sε2 5
Xn n
≤ exp − ≤ 2.
t=1
f (t) s=1 2 ε
The first inequality follows from the union bound over all possible values of
T1 (t − 1). The last inequality is an algebraic exercise (Exercise 8.1). The function
f (t) was chosen precisely so this bound would hold. For the second term in (8.4)
8.2 Notes 119
The first part of the theorem follows by substituting the results of the previous
two displays into (8.4). The second part follows by choosing ε = log−1/4 (n) and
taking the limit as n tends to infinity.
8.2 Notes
1 The improvement to the constants comes from making the confidence interval
slightly smaller, which is made possible by a more careful analysis. The main
trick is the observation that we do not need to show that µ̂1s ≥ µ1 for all s
with high probability, but instead that µ̂1s ≥ µ1 − ε for small ε.
2 The choice of f (t) = 1 + t log2 (t) looks quite odd. With a slightly messier
calculation we could have chosen f (t) = t logα (t) for any α > 0. If the rewards
are actually Gaussian, then a more careful concentration analysis allows one
to choose f (t) = t or even some slightly slower-growing function [Katehakis
and Robbins, 1995, Lattimore, 2016a, Garivier et al., 2016b].
3 The asymptotic regret is often indicative of finite-time performance. The reader
is advised to be cautious, however. The lower-order terms obscured by the
asymptotics can be dominant in all practical regimes.
Lai and Robbins [1985] designed policies for which Eq. (8.2) holds. They also
proved a lower bound showing that no ‘reasonable’ policy can improve on this
bound for any problem, where ‘reasonable’ means that they suffer subpolynomial
regret on all problems (see Part IV). The policy proposed by Lai and Robbins
[1985] was based on upper confidence bounds, but was not a variant of UCB. The
asymptotics for variants of the policy presented here were given first by Lai [1987],
8.4 Exercises 120
Katehakis and Robbins [1995] and Agrawal [1995]. None of these articles gave
finite-time bounds like what was presented here. When the reward distributions
lie in an exponential family, then asymptotic and finite-time bounds with the
same flavor to what is presented here are given by Cappé et al. [2013]. There are
now a huge variety of asymptotically optimal policies in a wide range of settings.
Burnetas and Katehakis [1996] study the general case and give conditions for
a version of UCB to be asymptotically optimal. Honda and Takemura [2010,
2011] analyse an algorithm called DMED, proving asymptotic optimality for noise
models where the support is bounded or semi-bounded. Kaufmann et al. [2012b]
prove asymptotic optimality for Thompson sampling (see Chapter 36) when
the rewards are Bernoulli, which is generalised to single-parameter exponential
families by Korda et al. [2013]. Kaufmann [2018] proves asymptotic optimality
for the BayesUCB class of algorithms for single-parameter exponential families.
Ménard and Garivier [2017] prove asymptotic optimality and minimax optimality
for exponential families (more discussion in Chapter 9).
8.4 Exercises
8.1 Do the algebra needed at the end of the proof of Theorem 8.1. Precisely,
show that
1 X sε2 5
X n n
exp − ≤ 2,
t=1
f (t) s=1 2 ε
8.3 (One-armed bandits (ii)) Consider the setting of Exercise 8.2 and define
a policy by
q
1 if µ̂ (t − 1) + 2 log f (t) ≥ 0
1 T1 (t−1)
At = (8.5)
2 otherwise .
8.4 Exercises 121
Suppose that ν = (P1 , P2 ) where P1 = N (µ1 , 1) and P2 = N (0, 1). Prove that
for the modified policy,
(
Rn (ν) 0 if µ1 ≥ 0
lim sup ≤
n→∞ log(n)
2
−µ 1
if µ1 < 0 .
Hint Follow the analysis for UCB, but carefully adapt the proof by using the
fact that the index of the second arm is always zero.
The strategy proposed in the above exercise is based on the idea that
optimism is used to overcome uncertainty in the estimates of the quality of
an arm, but for one-armed bandits the mean of the second arm is known in
advance.
We proved that the variants p of UCB analysed in the last two chapters have a
worst-case regret of Rn = O( kn log(n)). Further, p in Exercise 6.8 you showed
that an elimination algorithm achieves Rn = O( kn log(k)). By modifying the
confidence levels of the algorithm it is possible to remove the log factor entirely.
Building on UCB, the directly named ‘minimax optimal strategy in the stochastic
case’ (MOSS) algorithm was the first to make this modification and is presented
below. MOSS again depends on prior knowledge of the horizon, a requirement
that may be relaxed, as we explain in the notes.
The term minimax is used because, except for constant factors, the worst-
case bound proven in this chapter cannot be improved on by any algorithm.
The lower bounds are deferred to Part IV.
1: Input n and k
2: Choose each arm once
3: Subsequently choose
s
4 n
At = argmaxi µ̂i (t − 1) + log+ ,
Ti (t − 1) kTi (t − 1)
where log+ (x) = log max {1, x} .
Algorithm 7: MOSS.
9.1 The MOSS Algorithm 123
Theorem 9.1. For any 1-subgaussian bandit, the regret of Algorithm 7 satisfies
√ k
X
Rn ≤ 39 kn + ∆i .
i=1
Before the proof we state and prove a strengthened version of Corollary 5.5.
The bound in Eq. (9.1) is the same as the bound on P (Sn ≥ ε) that appears
in a simple reformulation of Corollary 5.5, so this new result is strictly stronger.
Proof From the definition of subgaussian random variables and Lemma 5.4,
nσ 2 λ2
E [exp (λSn )] ≤ exp .
2
The novel step is the first inequality, which follows from Doob’s submartingale
inequality (Theorem 3.10) and the fact that that exp(λSt ) is a submartingale
with respect to the filtration generated by X1 , X2 , . . . , Xn (Exercise 9.1).
Before the proof of Theorem 9.1, we need one more lemma to bound the
probability that the index of the optimal arm ever drops too far below the actual
mean of the optimal arm. The proof of this lemma relies on a tool called the
peeling device, which is an important technique in probability theory and has
many applications beyond bandits. For example, it can be used to prove the
celebrated law of the iterated logarithm.
j=0
2j+1 δ
q 2
∞ 2 j+2 log+ 1
+ 2 j
∆
X 2j+1 δ
≤ exp − .
j=0
2j+2
The first inequality follows from a union bound over a geometric grid. The second
step is straightforward but important because it sets up to apply Theorem 9.2.
The rest is purely algebraic:
q 2
∞ 2 j+2 log+ 1
+ 2 j
∆ ∞
X 2 j+1 δ X
exp − ≤ δ 2j+1 exp −∆2 2j−2
j=0
2j+2
j=0
Z
8δ ∞ 15δ
≤ +δ 2s+1 exp −∆2 2s−2 ds ≤ 2 .
e∆2 0 ∆
Above, the first inequality follows since (a + b)2 ≥ a2 + b2 for a, b ≥ 0, and
the second last step follows by noting that the integrand is unimodal and
2
has a maximum value of 8δ/(e∆ R b ). For such functions f , one has the bound
Pb
j=a f (j) ≤ maxs∈[a,b] f (s) + a f (s)ds.
Proof of Theorem 9.1 As usual, we assume without loss of generality that the
first arm is optimal, so µ1 = µ∗ . Arguing that the optimal arm is sufficiently
optimistic with high probability is no longer satisfactory because in this refined
analysis, the probability that an arm is played linearly often needs to depend
on its suboptimality gap. A way around this difficulty is to make an argument
in terms of the expected amount of optimism. Define a random variable ∆ that
measures how far below the index of the optimal arm drops below its true mean.
r !!+
4 n
∆ = µ1 − min µ̂1s + log+ .
s≤n s ks
Arms with suboptimality gaps much larger than ∆ will not be played too often,
while arms with suboptimality gaps smaller than ∆ may be played linearly often,
but ∆ is sufficiently small in expectation that this price is small. Using the basic
9.1 The MOSS Algorithm 125
regret decomposition (Lemma 4.5) and splitting the actions based on whether or
not their suboptimality gap is smaller or larger than 2∆ leads to
X
Rn = ∆i E[Ti (n)]
i:∆i >0
X
≤ E 2n∆ + ∆i Ti (n)
i:∆i >2∆
√ X
≤ E 2n∆ + 8 kn + ∆i Ti (n) .
√
i:∆i >max 2∆,8 k/n
The first term is easily bounded using Proposition 2.8 and Lemma 9.3:
Z Z
∞ ∞
15k √
E[2n∆] = 2nE[∆] = 2n P (∆ ≥ x) dx ≤ 2n min 1, 2 dx ≤ 16 kn .
0 0 nx
( r )
n
X 4 n
κi = I µ̂is + log+ ≥ µi + ∆i /2 .
s=1
s ks
The reason for choosing κi in this way is that for arms i with ∆i > 2∆, it holds
that the index of the optimal arm is always larger than µi + ∆i /2, so κi is an
upper bound on the number of times arm i is played, Ti (n). If ∆i ≥ 8(k/n)1/2 ,
then the expectation of ∆i κi is bounded using Lemma 8.2 by
" n ( s )#
1 X 4 2
+ n∆i
∆i E[κi ] ≤ + ∆i E I µ̂is + log ≥ µi + ∆i /2
∆i s=1
s k
2
s 2
!
1 8 + n∆ + n∆i
≤ + ∆i + 2 log i
+ 2π log +1
∆i ∆i k k
r r r
1 n n p n
≤ + ∆i + 4 log 8 + 2 π log 8 + 1 ≤ ∆i + 15 ,
8 k k k
where the first inequality follows by replacing the s in the logarithm with 1/∆2i
and adding the ∆i × 1/∆2i correction term to compensate for the first ∆−2 i
rounds where this fails to hold. Then we use Lemma 8.2 and the monotonicity of
√
x 7→ x−1−p log+ (ax2 ) for p ∈ [0, 1], positive a and x ≥ e/ a. The last inequality
9.2 Two Problems 126
√
follows by naively bounding 1/8 + 4 log 8 + 2 π log 8 + 1 ≤ 15. Then
X X
E ∆i Ti (n) ≤ E ∆i κi
√ √
i:∆i >max 2∆,8 k/n i:∆i >8 k/n
r
X n
≤ ∆i + 15
√ k
i:∆i >8 k/n
√ k
X
≤ 15 nk + ∆i .
i=1
√ Pk
Combining all the results we have Rn ≤ 39 kn + i=1 ∆i .
A rigourous proof of this claim is quite delicate, but we encourage readers to try
to understand why it holds intuitively.
Instability
There is a hidden cost of pushing too hard to reduce the expected regret, which
is that the distribution of the regret is less well-behaved. Consider a two-armed
Gaussian bandit with suboptimality gap ∆. The random (pseudo) regret is
Pn
R̂n = t=1 ∆At , which for a carefully tuned algorithm has a roughly bimodal
distribution:
(
n∆ with probability δ
R̂n ≈ 1 1
∆ log δ otherwise ,
9.3 Notes 127
where δ is a parameter of the policy that determines the likelihood that the
optimal arm is misidentified. Integrating, one has
1 1
Rn = E[R̂n ] = O n∆δ + log ,
∆ δ
The choice of δ that minimises the expected regret depends on ∆ and is
approximately 1/(n∆2 ). With this choice, the regret is
1 2
Rn = O 1 + log n∆ .
∆
Of course ∆ is not known in advance, but it can be estimated online so that the
above bound is actually realisable by an adaptive policy that does not know ∆
in advance (Exercise 9.3). Let F be the (informal) event that R̂n = Ω(n∆). The
problem is that when δ = 1/(n∆2 ) is chosen to minimise the expected regret,
then the second moment due to failure is
E[IF R̂n2 ] = Ω(n) .
On the other hand, by choosing δ = (n∆)−2 , the regret increases only slightly to
1 1
Rn = O + log n2 ∆2 .
∆ n
The second moment of the regret due to failure, however, is E[IF R̂n2 ] = O(1).
9.3 Notes
p
Exercise 9.3 that it has a distribution free regret of O( nk log(k)). An
algorithm that does almost the same thing in disguise is called ‘improved
UCB’, which operates in phases and eliminates arms for which the upper
confidence bound drops below a lower confidence bound for some arm [Auer
and Ortner, 2010]. This algorithm was the topic of Exercise 6.8.
3 Overcoming the failure of MOSS to be instance optimal without sacrificing
minimax optimality is possible by using an adaptive confidence level that tunes
the amount of optimism to match the instance. One of the authors has proposed
two ways to do this, using one of the following indices:
s
2(1 + ε) n
µ̂i (t − 1) + log , or (9.3)
Ti (t − 1) t
v !
u
u 2 n
µ̂i (t − 1) + t log Pk p .
Ti (t − 1) j=1 min{Ti (t − 1), Ti (t − 1)Tj (t − 1)}
The first of these algorithms is called the ‘optimally confident UCB’ [Lattimore,
2015b] while the second is AdaUCB [Lattimore, 2018]. Both algorithms are
minimax optimal up to constant factors and never worse than UCB. The
latter is also asymptotically optimal. If the horizon is unknown, then AdaUCB
can be modified by replacing n with t. It remains a challenge to provide a
straightforward analysis for these algorithms.
9.5 Exercises
9.2 (Problem-dependent bound) Let ∆min = mini:∆i >0 ∆i . Show there exists
9.5 Exercises 129
(a) Show that for all 1-subgaussian bandits, this new policy suffers regret at
most
X 1
Rn ≤ C ∆i + log+ (n∆2i ) ,
∆i
i:∆i >0
(g) Let
p h : (0, ∞) → (1, ∞) be a concave increasing function p such that
log(h(a))/h(a) ≤ c/a for some constant c > 0 and f (t) = 2t log h(1/tδ)+
t∆. Show that
!
2cδ
Xt
P exists t : Xs ≥ f (t) ≤ √ 2 .
s=1
π∆
p
(h) Show that h(a) = 1 + (1 + a) log(1 + a) satisfies the requirements of the
previous part with c = 11/10.
(i) Use your results to modify MOSS for the case when the rewards are Gaussian.
Compare the algorithms empirically.
(j) Prove for your modified algorithm that
Rn X 2
lim sup ≤ .
n→∞ log(n) ∆i
i:∆i >0
Hint The above exercise has several challenging components and assumes
prior knowledge of Brownian motion and its interpretation in terms of the heat
equation. We recommend the book by Lerche [1986] as a nice reference on hitting
times for Brownian motion against concave barriers. The equation you derived in
Part (d) is called the Bachelier–Lévy formula , and the technique for doing
so is the method of images. The use of this theory in bandits was introduced
by one of the authors [Lattimore, 2018], which readers might find useful when
working through these questions.
9.5 (Asymptotic optimality and subgaussian noise) In the last exercise,
9.5 Exercises 131
you modified MOSS to show asymptotic optimality when the noise is Gaussian.
This is also possible for subgaussian noise. Follow the advice in the notes of this
chapter to adapt MOSS so that for all 1-subgaussian bandits, it holds that
Rn X 2
lim sup ≤ ,
n→∞ log(n) ∆i
i:∆i >0
√
while maintaining the property that Rn ≤ C kn for universal constant C > 0.
10 The Upper Confidence Bound
Algorithm: Bernoulli Noise ( )
In previous chapters we assumed that the noise of the rewards was σ-subgaussian
for some known σ > 0. This has the advantage of simplicity and relative generality,
but stronger assumptions are sometimes justified and often lead to stronger results.
In this chapter the rewards are assumed to be Bernoulli, which just means that
Xt ∈ {0, 1}. This is a fundamental setting found in many applications. For
example, in click-through prediction, the user either clicks on the link or not. A
Bernoulli bandit is characterised by the mean pay-off vector µ ∈ [0, 1]k and the
reward observed in round t is Xt ∼ B(µAt ).
The Bernoulli distribution is 1/2-subgaussian regardless of its mean
(Exercise 5.12). Hence the results of the previous chapters are applicable, and an
appropriately tuned UCB enjoys logarithmic regret. The additional knowledge
that the rewards are Bernoulli is not being fully exploited by these algorithms,
however. The reason is essentially that the variance of a Bernoulli random
variable depends on its mean, and when the variance is small, the empirical mean
concentrates faster, a fact that should be used to make the confidence intervals
smaller.
The first step when designing a new optimistic algorithm is to construct confidence
sets for the unknown parameters. For Bernoulli bandits, this corresponds to
analysing the concentration of the empirical mean for sums of Bernoulli random
variables. For this, the following definition will prove useful:
where singularities are defined by taking limits: d(0, q) = log(1/(1 − q)) and
d(1, q) = log(1/q) for q ∈ [0, 1] and d(p, 0) = 0 if p = 0 and ∞ otherwise and
d(p, 1) = 0 if p = 1 and ∞ otherwise.
10.1 Concentration for Sums of Bernoulli Random Variables 133
(a) The functions d(·, q) and d(p, ·) are convex and have unique minimisers at q
and p, respectively.
(b) d(p, q) ≥ 2(p − q)2 (Pinsker’s inequality).
(c) If p ≤ q − ε ≤ q, then d(p, q − ε) ≤ d(p, q) − d(q − ε, q) ≤ d(p, q) − 2ε2 .
Proof We assume that p, q ∈ (0, 1). The corner cases are easily checked
separately. Part (a): d(·, q) is the sum of the negative binary entropy function
h(p) = p log p + (1 − p) log(1 − p) and a linear function. The second derivative
of h is h00 (p) = 1/p + 1/(1 − p), which is positive, and hence h is convex. For
fixed p the function d(p, ·) is the sum of h(p) and convex functions p log(1/q) and
(1 − p) log(1/(1 − q)). Hence d(p, ·) is convex. The minimiser property follows
because d(p, q) > 0 unless p = q in which case d(p, p) = d(q, q) = 0. A more
general version of (b) is given in Chapter 15. A proof of the simple version here
follows by considering the function g(x) = d(p, p + x) − 2x2 , which obviously
satisfies g(0) = 0. The proof is finished by showing that this is the unique
minimiser of g over the interval [−p, 1 − p]. The details are left to Exercise 10.1.
For (c), notice that
q 1−q
h(p) = d(p, q − ε) − d(p, q) = p log + (1 − p) log .
q−ε 1−q+ε
It is easy to see then that h is linear and increasing in its argument. Therefore,
since p ≤ q − ε,
as required for the first inequality of (c). The second inequality follows by using
the result in (b).
The next lemma controls the concentration of the sample mean of a sequence
of independent and identically distributed Bernoulli random variables.
Proof We will again use the Cramér–Chernoff method. Let λ > 0 be some
constant to be chosen later. Then,
n
! !
X
P (µ̂ ≥ µ + ε) = P exp λ (Xt − µ) ≥ exp (λnε)
t=1
Pn
E [exp (λ t=1 (Xt − µ))]
≤
exp (λnε)
= (µ exp(λ(1 − µ − ε)) + (1 − µ) exp(−λ(µ + ε))) .
n
P (µ̂ ≥ µ + ε)
1−µ−ε −µ−ε !n
(µ + ε)(1 − µ) (µ + ε)(1 − µ)
≤ µ + (1 − µ)
µ(1 − µ − ε) µ(1 − µ − ε)
1−µ−ε !n
µ (µ + ε)(1 − µ)
=
µ+ε µ(1 − µ − ε)
= exp (−nd(µ + ε, µ)) .
The bound on the left tail is proven identically.
Using Pinsker’s inequality, it follows that P (µ̂ ≥ µ + ε) , P (µ̂ ≤ µ − ε) ≤
exp(−2nε2 ), which is the same as what can be obtained from Hoeffding’s lemma
(see (5.8)). Solving exp(−2nε2 ) = δ, we recover the usual 1 − δ confidence upper
bound. In fact, this cannot be improved when µ ≈ 1/2, but the Chernoff bound
is much stronger when µ is close to either zero or one. Can we invert the Chernoff
tail bound to get confidence intervals that get tighter automatically as µ (or µ̂)
approaches zero or one? The following corollary shows how to do this.
Corollary 10.4. Let µ, µ̂, n be as above. Then, for any a ≥ 0,
P (d(µ̂, µ) ≥ a, µ̂ ≤ µ) ≤ exp(−na) , (10.3)
and P (d(µ̂, µ) ≥ a, µ̂ ≥ µ) ≤ exp(−na) . (10.4)
Furthermore, defining
U (a) = max{u ∈ [0, 1] : d(µ̂, u) ≤ a} ,
and L(a) = min{u ∈ [0, 1] : d(µ̂, u) ≤ a} .
Then, P (µ ≥ U (a)) ≤ exp(−na) and P (µ ≤ L(a)) ≤ exp(−na).
Proof First, we prove (10.3). Note that d(·, µ) is decreasing on [0, µ], and thus,
for 0 ≤ a ≤ d(0, µ), {d(µ̂, µ) ≥ a, µ̂ ≤ µ} = {µ̂ ≤ µ − x, µ̂ ≤ µ} = {µ̂ ≤ µ − x},
where x is the unique solution to d(µ − x, µ) = a on [0, µ]. Hence, by Eq. (10.2)
of Lemma 10.3, P (d(µ̂, µ) ≥ a, µ̂ ≤ µ) ≤ exp(−na). When a ≥ d(0, µ), the
inequality trivially holds. The proof of (10.4) is entirely analogous and hence
is omitted. For the second part of the corollary, fix a and let U = U (a).
10.1 Concentration for Sums of Bernoulli Random Variables 135
First, notice that U ≥ µ̂ and d(µ̂, ·) is strictly increasing on [µ̂, 1]. Hence,
{µ ≥ U } = {µ ≥ U, µ ≥ µ̂} = {d(µ̂, µ) ≥ d(µ̂, U ), µ ≥ µ̂} = {d(µ̂, µ) ≥ a, µ ≥ µ̂},
where the last equality follows by d(µ̂, U ) = a, which holds by the definition
of U . Taking probabilities and using the first part of the corollary shows that
P (µ ≥ U ) ≤ exp(−na). The statement concerning L = L(a) follows with a similar
reasoning.
Note that for δ ∈ (0, 1), U = U (log(1/δ)/n) and L = L(log(1/δ)/n) are upper
and lower confidence bounds for µ. Although the relative entropy has no closed-
form inverse, the optimisation problem that defines U and L can be solved to a
high degree of accuracy using Newton’s method (the relative entropy d is convex
in its second argument). The advantage of this confidence interval relative to
the one derived from Hoeffding’s bound is now clear. As µ̂ approaches one, the
width of the interval U (a) − µ̂ approaches zero,
p whereas the width of the interval
provided by Hoeffding’s bound stays at log(1/δ)/(2n). The same holds for
µ̂ − L(a) as µ̂ → 0.
Example 10.5. Fig. 10.1 shows a plot of d(3/4, x) and the lower bound given
by Pinsker’s inequality. The approximation degrades as |x − 3/4| grows large,
especially for x > 3/4. As explained in Corollary 10.4, the graph of d(µ̂, ·) can
be used to derive confidence bounds by solving for d(µ̂, x) = a = log(1/δ)/n.
Assuming µ̂ = 3/4 is observed, a confidence level of 90 per cent with n = 10,
a ≈ 0.23. The confidence interval can be read out from the figure by finding
those values where the horizontal dashed black line intersects the solid blue line.
The resulting confidence interval will be highly asymmetric. Note that in this
scenario, the lower confidence bounds produced by both Hoeffding’s inequality
and Chernoff’s bound are similar, while the upper bound provided by Hoeffding’s
bound is vacuous.
d(3/4, x)
0.6 2(x − 3/4)2
a = 0.23
0.4
0.2
0
0 0.25 0.5 0.75 1
x
The difference between KL-UCB and UCB is that Chernoff’s bound is used to
define the upper confidence bound instead of Lemma 5.5.
1: Input k
2: Choose each arm once
3: Subsequently choose
log f (t)
At = argmaxi max µ̃ ∈ [0, 1] : d(µ̂i (t − 1), µ̃) ≤ ,
Ti (t − 1)
where f (t) = 1 + t log2 (t) .
Algorithm 8: KL-UCB.
Rn X ∆i
Furthermore, lim sup ≤ .
n→∞ log(n) d(µi , µ∗ )
i:∆i >0
Comparing the regret in Theorem 10.6 to what would be obtained when using
UCB from Chapter 8, which for subgaussian constant σ = 1/2 satisfies
Rn X 1
lim sup ≤ .
n→∞ log(n) 2∆i
i:∆i >0
second shows that the index of any other arm is not often much larger than the
same value. These results mirror those given for UCB, but things are complicated
by the non-symmetric and hard-to-invert divergence function.
For the next results, we define d(p, q) = d(p, q)I {p ≤ q}.
where the second inequality follows, since by the definition of τ , if t > τ , then
the index of the optimal arm is at least as large as µ1 − ε2 . The third inequality
follows from the definition of κ as in the proof of Theorem 8.1. The final inequality
10.3 Notes 139
follows from Lemmas 10.7 and 10.8. The first claim of the theorem is completed
by substituting the above into the standard regret decomposition
k
X
Rn = ∆i E[Ti (n)] .
i=1
10.3 Notes
1 The new concentration inequality (Lemma 10.3) holds more generally for
any sequence of independent and identically distributed random variables
X1 , X2 , . . . , Xn for which Xt ∈ [0, 1] almost surely. Therefore all results in
this section also hold if the assumption that the noise is Bernoulli is relaxed
to the case where it is simply supported in [0, 1] (or other bounded sets by
shifting/scaling).
2 Expanding on the previous note, all that is required is a bound on the moment-
generating function for random variables X where, X ∈ [0, 1] almost surely.
Garivier and Cappé [2011, Lemma 9] noted that f (x) = exp(λx) − x(exp(λ) −
1) − 1 is negative on [0, 1], and so
E [exp(λX)] ≤ E [X(exp(λ) − 1) + 1] = µ exp(λ) + 1 − µ ,
which is precisely the moment-generating function of the Bernoulli distribution
with mean µ. Then the remainder of the proof of Lemma 10.3 goes through
unchanged. This shows that for any bandit ν = (Pi )i with Supp(Pi ) ∈ [0, 1] for
all i the regret of the policy in Algorithm 8 satisfies
Rn X ∆i
lim sup ≤ .
n→∞ log(n) d(µi , µ∗ )
i:∆i >0
3 The bounds obtained using the argument in the previous note are not quite
tight. Specifically one can show there exists an algorithm such that for all
bandits ν = (Pi )i with Pi , the reward distribution of the ith arm supported on
[0, 1], then
Rn X ∆i
lim sup = , where
n→∞ log(n) di
i:∆i >0
appealing to the central limit theorem. The answer is no. First, the quality of
the approximation in Eq. (10.5) does not depend on n, so asymptotically it is
not true that the Bernoulli bandit behaves like a Gaussian bandit with variances
tuned to match. The reason is that as n tends to infinity, the confidence level
should be chosen so that the risk of failure also tends to zero. But the central
limit theorem does not provide information about the tails with probability
mass less than O(n−1/2 ). See Note 1 in Chapter 5.
5 The analysis in this chapter is easily generalised to a wide range of alternative
noise models. You will do this for single-parameter exponential families in
Exercises 10.4, 10.5 and 34.5.
6 Chernoff credits Lemma 10.3 to his friend Herman Rubin [Chernoff, 2014], but
the name seems to have stuck.
Several authors have worked on Bernoulli bandits, and the asymptotics have
been well understood since the article by Lai and Robbins [1985]. The earliest
version of the algorithm presented in this chapter is due to Lai [1987], who
provided asymptotic analysis. The finite-time analysis of KL-UCB was given by
two groups simultaneously (and published in the same conference) by Garivier
and Cappé [2011] and Maillard et al. [2011] (see also the combined journal article:
Cappé et al. 2013). Two alternatives are the DMED [Honda and Takemura, 2010]
and IMED [Honda and Takemura, 2015] algorithms. These works go after the
problem of understanding the asymptotic regret for the more general situation
where the rewards lie in a bounded interval (see Note 3). The latter work
covers even the semi-bounded case where the rewards are almost surely upper-
bounded. Both algorithms are asymptotically optimal. Ménard and Garivier
[2017] combined MOSS and KL-UCB to derive an algorithm that is minimax
optimal and asymptotically optimal for single-parameter exponential families.
While the subgaussian and Bernoulli examples are very fundamental, there has
also been work on more generic set-ups where the unknown reward distribution for
each arm is known to lie in some class F. The article by Burnetas and Katehakis
[1996] gives the most generic (albeit, asymptotic) results. These generic set-ups
remain wide open for further work.
10.5 Exercises
Hint Consider the function g(x) = d(p, p + x) − 2x2 over the [−p, 1 − p] interval.
By taking derivatives, show that g ≥ 0.
10.2 (Asymptotic optimality) Prove the asymptotic claim in Theorem 10.6.
10.5 Exercises 141
Hint Choose ε1 , ε2 to decrease slowly with n and use the first part of the
theorem.
Hint Read Note 2 at the end of this chapter. Let g(·, µ) be the cumulant-
generating function of the µ-parameter Bernoulli distribution. For X ∼ B(µ),
λ ∈ R, g(λ, µ) = log E [exp(λX)]. Show that g(λ, ·) is concave. Next, use this and
the tower rule to show that E [exp(λn(µ̂ − µ))] ≤ g(λ, µ)n .
The bound of the previous exercise is most useful when all µt are either all
close to zero or they are all close to one. When half of the {µt } are close to
zero and the other half close to one, then the bound degrades to Hoeffding’s
bound.
Rn (π, ν) X ∆i
lim ≤ ,
n→∞ log(n) di,inf
i:∆i >0
R
where µ(θ) = R xdPθ (x) is the mean of Pθ and di,inf = inf{d(θ, φ) : µ(φ) >
µ∗ , φ ∈ Θ}, with d(θ, φ) the relative entropy between Pθ and Pφ .
Hint Readers not familiar with exponential families should skip ahead to
Section 34.3.1 and then do Exercise 34.5. For the exercise, repeat the proof of
Theorem 10.6, adapting as necessary. See also the paper by Cappé et al. [2013].
Hint This is a subtle problem. You should adapt the algorithm so that if there
are ties in the upper confidence bounds, then an arm with the largest number of
plays is chosen. A solution is available. Korda et al. [2013] analysed
R Thompson
sampling in this setting. Their result only holds when θ 7→ R xpθ (x)dh(x) is
invertible, which does not always hold.
10.5 Exercises 142
10.6 (Comparison to UCB) In this exercise, you compare KL-UCB and UCB
empirically.
(a) Implement Algorithm 8 and Algorithm 6, where the latter algorithm should
be tuned for 1/2-subgaussian bandits so that
s
log(f (t))
At = argmaxi∈[k] µ̂i (t − 1) + .
2Ti (t − 1)
(b) Let n = 10000 and k = 2. Plot the expected regret of each algorithm as a
function of ∆ when µ1 = 1/2 and µ2 = 1/2 + ∆.
(c) Repeat the above experiment with µ1 = 1/10 and µ1 = 9/10.
(d) Discuss your results.
Part III
Adversarial Bandits with
Finitely Many Arms
144
Statistician George E. P. Box is famous for writing that ‘all models are wrong,
but some are useful’. In the stochastic bandit model the reward is sampled from
a distribution that depends only on the chosen action. It does not take much
thought to realise this model is almost always wrong. At the macroscopic level
typically considered in bandit problems, there is not much that is stochastic
about the world. And even if there were, it is hard to rule out the existence of
other factors influencing the rewards.
The quotation suggests we should not care whether or not the stochastic bandit
model is right, only whether it is useful. In science, models are used for predicting
the outcomes of future experiments, and their usefulness is measured by the
quality of the predictions. But how can this be applied to bandit problems? What
predictions can be made based on bandit models? In this respect, we postulate
the following:
A model can fail in two fundamentally different ways. It can be too specific,
imposing assumptions so detached from reality that a catastrophic mismatch
between actual and predicted performance may arise. The second mode of failure
occurs when a model is too general, which makes the algorithms designed to do
well on the bandit model overly cautious, which can harm performance.
Not all assumptions are equally important. It is a critical assumption in
stochastic bandits that the mean reward of individual arms does not change
(significantly) over time. On the other hand, the assumption that a single, arm-
dependent distribution generates the rewards for a given arm plays a relatively
insignificant role. The reader is encouraged to think of cases when the constancy
of arm distributions plays no role, and also of cases when it does – furthermore, to
decide to what extent the algorithms can tolerate deviations from the assumption
that the means of arms stay the same. Stochastic bandits where the means of
the arms are changing over time are called non-stationary and are the topic of
Chapter 31.
If a highly specialised model is actually correct, then the resulting algorithms
usually dominate algorithms derived for a more general model. This is a general
manifestation of the bias-variance trade-off, well known in supervised learning
and statistics. The holy grail is to find algorithms that work ‘optimally’ across
a range of models. The reader should think about examples from the previous
chapters that illustrate these points.
The usefulness of the stochastic model depends on the setting. In particular,
the designer of the bandit algorithm must carefully evaluate whether stochasticity,
stability of the mean and independence are reasonable assumptions. For some
applications, the answer will probably be yes, while in others the practitioner
145
may seek something more robust. This latter situation is the topic of the next
few chapters.
Adversarial Bandits
The adversarial bandit model abandons almost all the assumptions on how
the rewards are generated, so much so that the environment is often called the
adversary. The adversary has a great deal of power in this model, including the
ability to examine the code of the proposed algorithms and choose the rewards
accordingly. All that is kept from the previous chapters is that the objective will
be framed in terms of how well a policy is able to compete with the best action
in hindsight.
At first sight, it seems remarkable that one can say anything at all about such
a general model. And yet it turns out that this model is not much harder than
the stochastic bandit problem. Why this holds and how to design algorithms that
achieve these guarantees will be explained in the following chapters.
To give you a glimmer of hope, imagine playing the following simple bandit
game with a friend. The horizon is n = 1, and you have two actions. The game
proceeds as follows:
1 You tell your friend your strategy for choosing an action.
2 Your friend secretly chooses rewards x1 ∈ {0, 1} and x2 ∈ {0, 1}.
3 You implement your strategy to select A ∈ {1, 2} and receive reward xA .
4 The regret is R = max{x1 , x2 } − xA .
Clearly, if your friend chooses x1 = x2 , then your regret is zero no matter what.
Now let’s suppose you implement the deterministic strategy A = 1. Then your
friend can choose x1 = 0 and x2 = 1, and your regret is R = 1. The trick to
improve on this is to randomise. If you tell your friend, ‘I will choose A = 1 with
probability one half’, then the best she can do is choose x1 = 1 and x2 = 0 (or
reversed), and your expected regret is R = 1/2. You are forgiven if you did not
settle on this solution yourself because we did not tell you that a strategy may
be randomised. With such a short horizon, you cannot do better than this, but
for longer games the relative advantage of the adversary decreases, as we shall
see soon.
In the next two chapters, we investigate the k-armed adversarial model in detail,
providing both algorithms and regret analysis. Like the stochastic model, the
adversarial model has many generalisations, which we’ll visit in future chapters.
Bibliographic Remarks
The quote by George Box was used several times with different phrasings [Box,
1976, 1979]. The adversarial framework has its roots in game theory, with familiar
146
names like Hannan [1957] and Blackwell [1954] producing some of the early
work. The non-statistical approach has enjoyed enormous popularity since the
1990’s and has been adopted wholeheartedly by the theoretical computer science
community [Vovk, 1990, Littlestone and Warmuth, 1994, and many, many others].
The earliest work on adversarial bandits is by Auer et al. [1995]. There is now a
big literature on adversarial bandits, which we will cover in more depth in the
chapters that follow. There has been a lot of effort to move away from stochastic
assumptions. An important aspect of this is to define a sense of regularity for
individual sequences. We refer the reader to some of the classic papers by Martin-
Löf [1966] and Levin [1973] and the more recent paper by Ivanenko and Labkovsky
[2013].
11 The Exp3 Algorithm
For rounds t = 1, 2, . . . , n:
Learner selects distribution Pt ∈ Pk−1 and samples At from Pt .
Learner observes reward Xt = xtAt .
where the expectation is over the randomness of the learner’s actions. The
arguments π and x are omitted from the regret when they are clear from context.
The only source of randomness in the regret comes from the randomness in
the actions of the learner. Of course the interaction with the environment
means the action chosen in round t may depend on actions s < t as well as
the observed rewards until round t. As we noted, unlike the case of stochastic
bandits, here, there is no measurability restriction on the learner’s policy π.
This is actually by choice, see Note 12 for details.
The main question is whether or not there exist policies π for which Rn∗ (π) is
sublinear in n. In Exercise 11.2 you will show that for deterministic policies
Rn∗ (π) ≥ n(1 − 1/k), which follows by constructing a bandit so that xtAt = 0 for
all t and xti = 1 for i 6= At . Because of this, sublinear worst-case regret is only
possible by using a randomised policy.
Readers familiar with game theory will not be surprised by the need for
randomisation. The interaction between learner and adversarial bandit can be
framed as a two-player zero-sum game between the learner and environment.
The moves for the environment are the possible reward sequences, and for
the player they are the policies. The pay-off for the environment/learner is
the regret and its negation respectively. Since the player goes first, the only
way to avoid being exploited is to choose a randomised policy.
While stochastic and adversarial bandits seem quite different, it turns out that the
optimal worst-case regret is the same up to constant factors and that lower bounds
for adversarial bandits are invariably derived in the same manner as for stochastic
bandits (see Part IV). In this chapter, we present a simple algorithm for which
the worst-case regret is suboptimal by just a logarithmic factor. First, however,
we explore the differences and similarities between stochastic and adversarial
environments.
We already noted that deterministic strategies will have linear regret for
some adversarial bandit. Since strategies in Part II like UCB and ‘Explore-then-
Commit’ were deterministic, they are not well suited for the adversarial setting.
This immediately implies that policies that are good for stochastic bandit can
be very suboptimal in the adversarial setting. What about the other direction?
Will an adversarial bandit strategy have small expected regret in the stochastic
setting? Let π be an adversarial bandit policy and ν = (ν1 , . . . , νk ) be a stochastic
bandit with Supp(νi ) ⊆ [0, 1] for all i. Next, let Xti be sampled from νi for each
11.2 Importance-Weighted Estimators 149
i ∈ [k] and t ∈ [n], and assume these random variables are mutually independent.
By Jensen’s inequality and convexity of the maximum function, we have
" n #
X
Rn (π, ν) = max E (Xti − XtAt )
i∈[k]
t=1
" n
#
X
≤ E max (Xti − XtAt )
i∈[k]
t=1
where the regret in the first line is the stochastic regret (using the random table
model), and in the last it is the adversarial regret. Therefore the worst-case
stochastic regret is upper-bounded by the worst-case adversarial regret. Going
the other way, the above inequality also implies that the worst-case regret for
adversarial problems is lower-bounded by the worst-case regret on stochastic
problems with rewards bounded in [0, 1]. In Chapter 15, we √ prove that the worst-
case regret for stochastic Bernoulli bandits is at least c nk, where c > 0 is a
universal constant (Exercise 15.4). And so for the same universal constant, the
minimax regret for adversarial bandits satisfies
√
Rn∗ = inf sup Rn (π, x) ≥ c nk .
π x∈[0,1]n×k
There is a little subtlety here. In order to define the expectations in the stochastic
regret, the policy should be appropriately measurable. This can be resolved by
noting that lower bounds can be proven using Bernoulli bandits. For details, see
again Note 12.
In what follows, we assume that for all t and i, Pti > 0 almost surely. As we
shall see later, this will be true for all policies considered in this chapter. The
importance-weighted estimator of xti is
I {At = i} Xt
X̂ti = . (11.3)
Pti
Let Et [·] = E[· | A1 , X1 , . . . , At , Xt ] denote the conditional expectation given the
history up to time t. The conditional mean of X̂ti satisfies
which means that X̂ti is an unbiased estimate of xti conditioned on the history
observed after t − 1 rounds. To see why Eq. (11.4) holds, let Ati = I {At = i} so
that Xt Ati = xti Ati and
Ati
X̂ti = xti .
Pti
Now, Et−1 [Ati ] = Pti , and since Pti is σ(A1 , X1 , . . . , At−1 , Xt−1 )-measurable,
Ati xti xti
Et−1 [X̂ti ] = Et−1 xti = Et−1 [Ati ] = Pti = xti .
Pti Pti Pti
Being unbiased is a good start, but the variance of an estimator is also important.
For arbitrary random variable U , the conditional variance Vt−1 [U ] is the random
variable
Vt−1 [U ] = Et−1 (U − Et−1 [U ])2 .
So Vt−1 [X̂ti ] is a random variable that measures the variance of X̂ti conditioned
on the past. Calculating the conditional variance using the definition of X̂ti and
Eq. (11.4) shows that
2 Ati x2ti x2 (1 − Pti )
Vt−1 [X̂ti ] = Et−1 [X̂ti ] − x2ti = Et−1 2 − x2ti = ti . (11.5)
Pti Pti
This can be extremely large when Pti is small and xti is bounded away from zero.
In the notes and exercises, we shall see to what extent this can cause trouble.
The estimator in (11.3) is the first that comes to mind, but there are alternatives.
For example,
I {At = i}
X̂ti = 1 − (1 − Xt ) . (11.6)
Pti
This estimator is still unbiased. Rewriting the formula in terms of yti = 1 − xti
and Yt = 1 − Xt and Ŷti = 1 − X̂ti leads to
I {At = i}
Ŷti = Yt .
Pti
This is the same as (11.3) except that Yt has replaced Xt . The terms yti , Yt and
Ŷti should be interpreted as losses. Had we started with losses to begin with, then
this would have been the estimator that first came to mind. For obvious reasons,
the estimator in Eq. (11.6) is called the loss-based importance-weighted
estimator. The conditional variance of this estimator is essentially the same as
Eq. (11.5):
2 1 − Pti
Vt [X̂ti ] = Vt [Ŷti ] = yti .
Pti
2
The only difference is that the variance now depends on yti rather than x2ti . Which
is better depends on the rewards for arm i, with smaller rewards suggesting the
superiority of the first estimator and larger rewards (or small losses) suggesting
the superiority of the second estimator. Can we change the estimator (either one
11.3 The Exp3 Algorithm 151
of them) so that it is more accurate for actions whose reward is close to some
specific value v? Of course! Just change the estimator so that v is subtracted
from the observed reward (or loss), then use the importance-sampling formula,
and subsequently add back v. The problem is that the optimal value of v depends
on the unknown quantity being estimated. Also note that the dependence of the
variance on Pti is the same for both estimators, and since the rewards are bounded,
it is this term that usually contributes most significantly. In Exercise 11.5, we ask
you to show that all unbiased estimators in this setting are importance-weighted
estimators.
Although the two estimators seem quite similar, it should be noted that the
first estimator takes values in [0, ∞) while the second takes values in (−∞, 1].
Soon we will see that this difference has a big impact on the usefulness of
these estimators when used in the Exp3 algorithm.
The simplest algorithm for adversarial bandits is called Exp3, which stands
for ‘exponential-weight algorithm for exploration and exploitation’. The reason
for this name will become clear after the explanation of the algorithm. Let
Pt
Ŝti = s=1 X̂si be the total estimated reward by the end of round t, where X̂si is
given in Eq. (11.6). It seems natural to play actions with larger estimated reward
with higher probability. While there are many ways to map Ŝti into probabilities,
a simple and popular choice is called exponential weighting, which for tuning
parameter η > 0 sets
exp(η Ŝt−1,i )
Pti = Pk . (11.7)
j=1 exp(η Ŝt−1,j )
The parameter η is called the learning rate. When the learning rate is large, Pt
concentrates about the arm with the largest estimated reward and the resulting
algorithm exploits aggressively. For small learning rates, Pt is more uniform,
and the algorithm explores more frequently. Note that as Pt concentrates, the
variance of the importance-weighted estimators for poorly performing arms
increases dramatically. There are many ways to tune the learning rate, including
allowing it to vary with time. In this chapter we restrict our attention to the
simplest case by choosing η to depend only on the number of actions k and the
horizon n. Since the algorithm depends on η, this means that the horizon must
be known in advance, a requirement that can be relaxed (see Note 10).
11.4 Regret Analysis 152
1: Input: n, k, η
2: Set Ŝ0i = 0 for all i
3: for t = 1, . . . , n do
4: Calculate the sampling distribution Pt :
exp η Ŝt−1,i
Pti = P
exp
k
j=1 η Ŝt−1,j
which is the expected regret relative to using action i in all the rounds. The
result will follow by bounding Rni for all i, including the optimal arm. For the
remainder of the proof, let i be some fixed arm. By the unbiasedness property of
the importance-weighted estimator X̂ti ,
n
X k
X k
X
E[Ŝni ] = xti and also Et−1 [Xt ] = Pti xti = Pti Et−1 [X̂ti ] .
t=1 i=1 i=1
(11.8)
The tower rule says that for any random variable X, E[Et−1 [X]] = E[X], which
together with the linearity of expectation and Eq. (11.8) means that
" n k #
h i XX h i
Rni = E Ŝni − E Pti X̂ti = E Ŝni − Ŝn , (11.9)
t=1 i=1
11.4 Regret Analysis 153
P P
where the last equality serves as the definition of Ŝn = t i Pti X̂ti . To bound
the right-hand side of Eq. (11.9), let
k
X
Wt = exp η Ŝtj .
j=1
Wt X exp(η Ŝt−1,j )
k X k
= exp(η X̂tj ) = Ptj exp(η X̂tj ) . (11.11)
Wt−1 j=1
Wt−1 j=1
Xk Xk
Wt 2 2
≤1+η Ptj X̂tj + η Ptj X̂tj
Wt−1 j=1 j=1
Xk Xk
≤ exp η Ptj X̂tj + η 2 2
Ptj X̂tj . (11.12)
j=1 j=1
Notice that this was only possible because X̂tj is defined by Eq. (11.6), which
ensures that X̂tj ≤ 1 and would not have been true had we used Eq. (11.3).
Combining Eq. (11.12) and Eq. (11.10),
n X
X k
exp η Ŝni ≤ k exp η Ŝn + η 2 2
Ptj X̂tj .
t=1 j=1
Taking the logarithm of both sides, dividing by η > 0 and reordering gives
log(k) Xn X k
2
Ŝni − Ŝn ≤ +η Ptj X̂tj . (11.13)
η t=1 j=1
As noted earlier, the expectation of the left-hand side is Rni . The first term on
the right-hand side is a constant, which leaves us to bound the expectation of the
second term. Letting ytj = 1 − xtj and Yt = 1 − Xt and expanding the definition
11.4 Regret Analysis 154
2
of X̂tj leads to
2
=
Xk X k
2 I {A j} y
= E Ptj 1 −
t tj
E Ptj X̂tj
j=1 j=1
Ptj
!
2
X k
I {A = j} y I {At = j} ytj
= E Ptj 1 − 2 +
t tj
j=1
Ptj Ptj2
2
X k
I {At = j} ytj
= E 1 − 2Yt + Et−1
j=1
Ptj
Xk
2
= E 1 − 2Yt + ytj
j=1
X
= E (1 − Yt )2 + 2
ytj
j6=At
≤ k.
log(k) p
Rni ≤ + ηnk = 2 nk log(k) ,
η
p
where the equality follows by substituting η = log(k)/(nk), which was chosen
to optimise this bound.
The former of these inequalities is an ansatz derived from the first-order Taylor
expansion of exp(x) about x = 0. The latter, however, is not the second-order
Taylor expansion, which would be 1 + x + x2 /2. The problem is that the second-
order Taylor series is not an upper bound on exp(x) for x ≤ 1, but only for
x ≤ 0:
1
exp(x) ≤ 1 + x + x2 for all x ≤ 0 . (11.14)
2
But it is nearly an upper bound, and this can be exploited to improve the bound
in Theorem 11.1. The mentioned upper and lower bounds on exp(x) are shown
in Fig. 11.3, from which it is quite obvious that the bound in Eq. (11.14) is
significantly tighter when x ≤ 0.
Let us now put Eq. (11.14) to use in proving the following√improved version of
Theorem 11.1, for which the regret is smaller by a factor of 2. The algorithm is
unchanged except for a slightly increased learning rate.
11.4 Regret Analysis 155
exp(x) − (1 + x)
exp(x) − (1 + x + x2 )
0.1
exp(x) − (1 + x + x2 /2)
−0.1
−0.5 0 0.5
x
where the equality is from Eq. (11.11). We see that here we need to bound
P 2
j Ptj (X̂tj − 1) . Let Ŷtj = 1 − X̂tj . Then,
Ptj (X̂tj − 1)2 = Ptj Ŷtj Ŷtj = I {At = j} ytj Ŷtj ≤ Ŷtj ,
log(k) η X X
n k
Ŝni − Ŝn ≤ + Ŷtj . (11.15)
η 2 t=1 j=1
11.5 Notes 156
P
The result is completed by taking expectations of both sides, using E t,j Ŷtj =
P P
E t,j Et−1 Ŷtj = E t,j ytj ≤ nk and substituting the learning rate.
The reader may wonder about the somewhat ad hoc proof. The best we
can do for now is to point out a few things about the proof. It is natural to
replace the true rewards with the estimated ones. Then, to prove a regret
bound in terms of the estimated rewards, an alternative to the proof is
to start with the the trivial inequality that states that for any x = (xj )
P
vector and positive quantity η, the inequality xi ≤ η1 log j exp(ηxj ) holds.
Applying this with x = (Ŝni ) gives
1 X 1
Ŝni ≤ log( exp(η Ŝnj )) = log(Wn ) ,
η j
η
11.5 Notes
1 Exp3 is nearly optimal in the sense that its expected regret cannot be improved
significantly in the worst case. The distribution of its regret, however, is very far
from optimal. Define the random regret to be the random variable measuring
the actual deficit of the learner relative to the best arm in hindsight:
n
X n
X n
X n
X
R̂n = max xti − Xt = Yt − min yti .
i∈[k] i∈[k]
t=1 t=1 t=1 t=1
| {z } | {z }
in terms of rewards in terms of losses
In Exercise 11.6 you will show that for all large enough n and reasonable
choices of η, there exists a bandit such that the random regret of Exp3 satisfies
P(R̂n ≥ n/4) > 1/131. In the same exercise, you should explain why this does
not contradict the upper bound. That Exp3 has such a high variance is a
serious limitation, which we address in the next chapter.
2 What happens when the range of the rewards is unbounded? This has been
studied by Allenberg et al. [2006], where some (necessarily much weaker)
positive results are presented.
3 In the full information setting, the learner observes the whole vector
xt ∈ [0, 1]k at the end of round t, but the reward is still xtAt . This setting is
also called prediction with expert advice. Exponential weighting is still
a good idea, but the estimated rewards can now be replaced by the actual
rewards. The resulting algorithm is sometimes called Hedge or the exponential
weights algorithm. The proof as written goes through in almost the same way,
11.5 Notes 157
Even if Π only consists of constant sequences, there still does not exist a policy
guaranteeing sublinear regret. The reason is simple. Consider the two candidate
choices of f1 , . . . , fn . In the first choice, ft (a1 , . . . , at ) = I {a1 = 1}, and in
the second we have ft (a1 , . . . , at ) = I {a1 = 2}. Clearly the learner must suffer
linear regret in at least one of these two reactive bandit environments. The
problem is that the learner’s decision in the first round determines the rewards
available in all subsequent rounds, and there is no time for learning. By making
11.5 Notes 158
Exponential weighting has been a standard tool in online learning since the
papers by Vovk [1990] and Littlestone and Warmuth [1994]. Exp3 and several
variations were introduced by Auer et al. [1995], which was also the first paper to
study bandits in the adversarial framework. The algorithm and analysis presented
here differs slightly because we do not add any additional exploration, while the
version of Exp3 in that paper explores uniformly with low probability. The fact
that additional exploration is not required was observed by Stoltz [2005].
11.7 Exercises
11.2 (Linear regret for deterministic policies) Show that for any
deterministic policy π there exists an environment x ∈ [0, 1]n×k such that
Rn (π, x) ≥ n(1 − 1/k). What does your result say about the policies designed in
Part II?
11.7 Exercises 160
11.3 (Maximum and expectations) Show that the first inequality in (11.2)
holds: Moving the maximum inside the expectation increases the value of the
expectation.
At first sight this definition seems like the right thing because it measures what
you actually care about. Unfortunately, however, it gives the adversary too much
power. Show that for any policy π (randomised or not), there exists a x ∈ [0, 1]n×k
such that
track 1
Rn (π, x) ≥ n 1 − .
k
11.5 (Unbiased estimators are importance weighted) Let P ∈ Pk−1
be a probability vector with nonzero components and let A ∼ P . Suppose
X̂ : [k] × R → R is a function such that for all x ∈ Rk ,
k
X
E[X̂(A, xA )] = Pi X̂(i, xi ) = x1 .
i=1
Show that there exists an a ∈ Rk such that ha, P i = 0 and for all i and z in their
I {i = 1} z
respective domains, X̂(i, z) = ai + .
P1
11.6 (Variance of Exp3) In this exercise, you will show that if η ∈ [n−p , 1]
for some p ∈ (0, 1), then for sufficiently large n, there exists a bandit on which
Exp3 has a constant probability of suffering linear regret. We work with losses so
that given a bandit y ∈ [0, 1]n×k , the learner samples At from Pt given by
P
exp −η s=1 Ŷsi
t−1
Pti = P P ,
j=1 exp −η
k t−1
s=1 Ŷsj
where Ŷti = Ati yti /Pti . Let α ∈ [1/4, 1/2] be a constant to be tuned subsequently
and define a two-armed adversarial bandit in terms of its losses by
( (
0 if t ≤ n/2 α if t ≤ n/2
yt1 = and yt2 =
1 otherwise 0 otherwise .
For simplicity you may assume that n is even.
4,000
3,000
2,000
Regret
1,000
−1,000
8
85
0.3
0.2
0.2
9
0.2
0.2
α
Figure 11.4 Exp3 instability: Box and whisker plot of the distribution of the regret of
Exp3 for different values of α over a horizon of n = 104 with m = 500 repetitions for the
example of Exercise 11.6. The boxes represent the quartiles of the empirical distribution,
the diamond shows the average; the median is equal to the upper quartile (and thus
cannot be seen), while the dots show values outside of the “interquartile range”.
Pt
Show for t ≤ 1 + n/2 that Pt2 = qT2 (t−1) (α), where T2 (t) = s=1 As2 .
(b) Show that for sufficiently large n there exists an α ∈ [1/4, 1/2] and s ∈ N
such that
1 1
s−1
X n
qs (α) = and ≤ .
8n u=0
qu (α) 8
4E[R̂n ]
P(R̂n ≥ n/4) ≤ = O(n−1/2 ) .
n
Explain the apparent contradiction.
(f) Validate the theoretical results of this exercise in an experimental fashion:
Implement Exp3 with the loss sequence suggested to reproduce Fig. 11.4.
The
p learning rate is set to the value computed in Theorem 11.2: η =
2 log(k)/(nk). Compare the figure with the theoretical results: Is there an
agreement between theory and the empirical results?
11.7 Exercises 162
Hint The performance of UCB depends greatly on which version you use. For
best results, remember that Bernoulli distributions are 1/2-subgaussian or use
the KL-UCB algorithm from Chapter 10.
11.7 Exercises 163
Exp3
250
200
Expected regret
150
100
0 0.1
η
Figure 11.5 Expected regret for Exp3 for different learning rates over n = 105 rounds
on a Bernoulli bandit with means µ1 = 0.5 and µ2 = 0.55.
12 The Exp3-IX Algorithm
In the last chapter, we proved a sublinear bound on the expected regret of Exp3,
but with a dishearteningly large variance. The objective of this chapter is to
modify Exp3 so that the regret stays small in expectation and is simultaneously
well concentrated about its mean. Such results are called high-probability
bounds. By slightly modifying the algorithm, we show that for each δ ∈ (0, 1),
there exists an algorithm such that with probability at least 1 − δ,
s !
Xn
k
R̂n = max (ytAt − yta ) = O nk log .
a∈A
t=1
δ
The poor behaviour of Exp3 occurs because the variance of the importance-
weighted estimators can become very large. In this chapter we modify the reward
estimates to control the variance at the price of introducing some bias.
We start by summarising what we know about the behaviour of the random regret
of Exp3. Because we want to use the loss-based estimator, it is more convenient
to switch to losses, which we do for the remainder of the chapter. Rewriting
Eq. (11.15) in terms of losses,
log(k) η X
k
L̂n − L̂ni ≤ + L̂nj , (12.1)
η 2 j=1
where L̂n and L̂ni are defined using the loss estimator Ŷtj by
n X
X k n
X
L̂n = Ptj Ŷtj and L̂ni = Ŷti .
t=1 j=1 t=1
Eq. (12.1) holds no matter how the loss estimators are chosen, provided
they satisfy 0 ≤ Ŷti ≤ 1/Pti for all t and i. Of course, the left-hand side of
Eq. (12.1) is not close to the regret unless Ŷti is a reasonable estimator of
the loss yti .
12.1 The Exp3-IX Algorithm 165
We also need to define the sum of losses observed by the learner and for each
fixed action, which are
n
X n
X
L̃n = ytAt and Lni = yti
t=1 t=1
Like in the previous chapter, we need to define the (random) regret with respect
to a given arm i as follows:
n
X n
X
R̂ni = xti − Xt = L̃n − Lni . (12.2)
t=1 t=1
By substituting the above definitions into Eq. (12.1) and rearranging, the regret
with respect to arm i is bounded by
This means the random regret can be bounded by controlling L̃n − L̂n , L̂nj − Lnj
and L̂nj for each j. As promised we now modify the loss estimate. Let γ > 0 be
a small constant to be chosen later and define the biased estimator
I {At = i} Yt
Ŷti = . (12.4)
Pti + γ
First, note that Ŷti still satisfies 0 ≤ Ŷti ≤ 1/Pti , so (12.3) is still valid. As γ
increases, the predictable variance decreases, but the bias increases. The optimal
choice of γ depends on finding the sweet spot, which we will do once the dust
has settled in the analysis. When Eq. (12.4) is used in the exponential update in
Exp3, the resulting algorithm is called Exp3-IX (Algorithm 10). The suffix ‘IX’
stands for implicit exploration, a name justified by the following argument. A
simple calculation shows that
Pti yti γyti
Et [Ŷti ] = = yti − ≤ yti .
Pti + γ Pti + γ
Since small losses correspond to large rewards, the estimator is optimistically
biased. The effect is a smoothing of Pt so that actions with large losses for which
Exp3 would assign negligible probability are still chosen occasionally. In fact, the
smaller is Pti , the larger the bias is. As a result, Exp3-IX will explore more than
the standard Exp3 algorithm (see Exercise 12.5).
The reason for calling the exploration implicit is because the algorithm
explores more as a consequence of modifying the reward estimates, rather
than directly alternating Pt .
12.2 Regret Analysis 166
1: Input: n, k, η, γ
2: Set L̂0i = 0 for all i
3: for t = 1, . . . , n do
4: Calculate the sampling distribution Pt :
exp −η L̂t−1,i
Pti = P
exp
k
j=1 −η L̂ t−1,j
We now prove the following theorem bounding the random regret of Exp3-IX
with high probability.
Theorem 12.1. Let δ ∈ (0, 1) and define
r s
2 log(k + 1) δ )
log(k) + log( k+1
η1 = and η2 = .
nk nk
The following statements hold:
1 If Exp3-IX is run with parameters η = η1 and γ = η/2, then
s !
p nk 1 k+1
P R̂n ≥ 8nk log(k + 1) + log + log ≤ δ.
2 log(k + 1) δ δ
(12.5)
2 If Exp3-IX is run with parameters η = η2 and γ = η/2, then
p k+1
P R̂n ≥ 2 (2 log(k + 1) + log(1/δ))nk + log ≤ δ. (12.6)
δ
The proof follows by bounding each of the terms in Eq. (12.3), which we do
via a series of lemmas. The first of these lemmas is a new concentration bound.
12.2 Regret Analysis 167
To state the lemma, we recall two useful notions: Recall that given a filtration
F = (Ft )nt=0 , (Zt )nt=1 is F-adapted if for t ∈ [n], Zt is Ft -measurable and (Zt )nt=1
is F-predictable, if for t ∈ [n], Zt is Ft−1 -measurable.
Lemma 12.2. Let F = (Ft )nt=0 be a filtration and for i ∈ [k] let (Ỹti )t be F-adapted
such that:
Q
S ⊂[k] with |S| > 1, E i∈S Ỹti Ft−1 ≤ 0; and
1 for any
2 E Ỹti Ft−1 = yti for all t ∈ [n] and i ∈ [k].
Furthermore, let (αti )ti and (λti )ti be real-valued F-predictable random sequences
such that for all t, i it holds that 0 ≤ αti Ỹti ≤ 2λti . Then, for all δ ∈ (0, 1),
!
1
Xn X k
Ỹti
P αti − yti ≥ log ≤ δ.
t=1 i=1
1 + λ ti δ
The proof relies on the Cramér–Chernoff method and is deferred until the
end of the chapter. Condition 1 states that the variables {Ỹti }i are negatively
correlated, and it helps us save a factor of k. Equipped with this result, we can
easily bound the terms L̂ni − Lni .
Lemma 12.3 (Concentration – variance). Let δ ∈ (0, 1). With probability at least
1 − δ, the following inequalities hold simultaneously:
log( k+1 ) k
X log( k+1 )
max L̂ni − Lni ≤ δ
and L̂ni − Lni ≤ δ
. (12.7)
i∈[k] 2γ i=1
2γ
Proof Fix δ 0 ∈ (0, 1) to be chosen later and let Ati = I {At = i} as before. Then
X k Xn X k
Ati yti
(L̂ni − Lni ) = − yti
i=1 t=1 i=1
Pti + γ
!
1 XX 1
n k
Ati yti
= 2γ − yti .
2γ t=1 i=1 1 + Pγti Pti
Introduce λti = Pγti , Ỹti = APtitiyti and αti = 2γ. Notice that the conditions of
Lemma 12.2 are now satisfied. In particular, for any S ⊂ [k] with |S| > 1, it holds
Q Q
that i∈S Ati = 0 and hence i∈S Ỹti = 0. Therefore,
!
log(1/δ 0 )
Xk
P (L̂ni − Lni ) ≥ ≤ δ0 . (12.8)
i=1
2γ
P
Proof Let Ati = I {At = i} as before. Writing Yt = j Atj ytj , we calculate
k
X k
X k
X Xk
Ptj Atj
Yt − Ptj Ŷtj = 1− Atj ytj = γ ytj = γ Ŷtj .
j=1 j=1
Ptj + γ j=1
Ptj + γ j=1
Pk
Therefore L̃n − L̂n = γ j=1 L̂nj as required.
log(k)
k
ηX
R̂n ≤ + (L̃n − L̂n ) + max(L̂ni − Lni ) + L̂nj
η i∈[k] 2 j=1
log(k) η X k
= + max(L̂ni − Lni ) + +γ L̂nj .
η i∈[k] 2 j=1
12.3 Notes
where the first equality follows from Proposition 2.8. The result is completed
using either the high-probability bound in Theorem 12.1 and by straightforward
integration. We leave the details to the reader in Exercise 12.7.
3 The analysis presented here uses a fixed
p learning rate that depends on the
horizon. Replacing η and γ with ηt = log(k)/(kt) and γt = ηt /2 leads to an
anytime algorithm with about the same regret [Kocák et al., 2014, Neu, 2015a].
4 There is another advantage of the modified importance-weighted estimators
used by Exp3-IX, which leads to an improved regret in the special case that
one of the arms has small losses. Specifically, it is possible to show that
!
r
Rn = O k min Lni log(k) .
i∈[k]
In the worst case, Lni is linear in n and the usual bound is recovered. But
if the optimal arm enjoys low cumulative regret, then the above can be a
big improvement over the bounds given in Theorem 12.1. Bounds of this
kind are called first-order bounds. We refer the interested reader to the
papers by Allenberg et al. [2006], Abernethy et al. [2012] and Neu [2015b] and
Exercise 28.14.
5 Another situation where one might hope to have a smaller regret is when the
rewards/losses for each arm do not deviate too far from their averages. Define
the quadratic variation by
1X
n
X n
Qn = kxt − µk2 , where µ = xt .
t=1
n t=1
√
Hazan and Kale [2011] gave an algorithm for which Rn = O(k 2 Qn ), which can
be better than the worst-case bound of Exp3 or Exp3-IX when the quadratic
variation is very small. The factor of k 2 is suboptimal and can be removed
using a careful instantiation of the mirror descent algorithm [Bubeck et al.,
2018]. We do not cover this exact algorithm in this book, but the techniques
based on mirror descent are presented in Chapter 28.
6 An alternative to the algorithm presented here is to mix the probability
distribution computed using exponential weights with the uniform distribution,
12.4 Bibliographic Remarks 171
while biasing the estimates. This leads to the Exp3.P algorithm due to Auer
et al. [2002b], who considered the case where δ is given and derived a bound that
is similar to Eq. (12.6) of Theorem 12.1. With an appropriate modification of
their proof, it is possible to derive a weaker bound similar to Eq. (12.5), where
the knowledge of δ is not needed by the algorithm. This has been explored by
Beygelzimer et al. [2010] in the context of a related algorithm, which will be
considered in Chapter 18. One advantage of this approach is that it generalises
to the case where the loss estimators are sometimes negative, a situation that
can arise in more complicated settings. For technical details, we advise the
reader to work through Exercise 12.3.
The Exp3-IX algorithm is due to Kocák et al. [2014], who also introduced the
biased loss estimators. The focus of that paper was to improve algorithms for
more complex models with potentially large action sets and side information,
though their analysis can still be applied to the model studied in this chapter. The
observation that this algorithm also leads to high-probability bounds appeared in
a follow-up paper by Neu [2015a]. High-probability bounds for adversarial bandits
were first provided by Auer et al. [2002b] and explored in a more generic way by
Abernethy and Rakhlin [2009]. The idea to reduce the variance of importance-
weighted estimators is not new and seems to have been applied in various forms
[Uchibe and Doya, 2004, Wawrzynski and Pacut, 2007, Ionides, 2008, Bottou
et al., 2013]. All of these papers are based on truncating the estimators, which
makes the resulting estimator less smooth. Surprisingly, the variance-reduction
technique used in this chapter seems to be recent [Kocák et al., 2014].
12.5 Exercises
(a) For any δ ∈ (0, 1), with probability at least 1 − δ, L̂ni − Lni < γ1 log(1/δ).
P P
(b) For any δ ∈ (0, 1), with probability at least 1 − δ, i L̂ni − i Lni <
1
γ log(1/δ).
Hint The source for this exercise is theorem 1 of the paper by Neu [2015a].
You can also read ahead and use the techniques from Exercise 28.13.
12.3 (Exp3.P) In this exercise we ask you to analyse the Exp3.P algorithm,
which as we mentioned in the notes is another way to obtain high probability
12.5 Exercises 172
bounds. The idea is to modify Exp3 by biasing the estimators and introducing
some forced exploration. Let Ŷti = Ati yti /Pti − β/Pti be a biased version of the
loss-based importance-weighted estimator that was used in the previous chapter.
Pt
Define L̂ti = s=1 Ŷsi and consider the policy that samples At ∼ Pt , where
γ exp −η L̂t−1,i
Pti = (1 − γ)P̃ti + with P̃ti = P .
k exp −η L̂
k
j=1 t−1,j
(a) Let δ ∈ (0, 1) and i ∈ [k]. Show that with probability 1 − δ, the random
regret R̂ni against i (cf. (12.2)) satisfies
r
n log(1/δ)
X n Xk Xn
β
R̂ni < nγ + (1 − γ) P̃tj (Ŷtj − yti ) + + .
t=1 j=1
P
t=1 tAt
2
(e) Suppose that γ = kη and η = β. Apply the result of Exercise 5.15 to show
that for any δ ∈ (0, 1), the following hold:
!
1 1
Xn
k
P ≥ 2nk + log ≤ δ.
P
t=1 tAt
γ δ
!
1 1
Xn
P Ŷti − yti ≥ log ≤ δ.
t=1
β δ
(f) Combining the previous steps, show that there exists a universal constant
C > 0 such that for any δ ∈ (0, 1), for an appropriate choice of η, γ and β,
with probability at least 1 − δ it holds that the random regret R̂n of Exp3.P
satisfies
p
R̂n ≤ C nk log(k/δ) .
(g) In which step did you use the modified estimators?
(h) Show a bound where the algorithm parameters η, γ, β can only depend on
n, k, but not on δ.
(i) Compare the bounds with the analogous bounds for Exp3-IX in Theorem 12.1.
12.5 Exercises 173
Pta = P P .
b=1 exp −η
k t−1
s=1 Z̃sb
Let Et−1 [·] = E [· | Ft−1 ]. Assume the following hold for all a ∈ [k]:
n X
X k
Pn
Let A∗ = argmina∈[k] t=1 Zta and Rn = Pta (Zta − ZtA∗ ).
t=1 a=1
log(k) Xn X k Xn X k
2
(A) ≤ +η Pta Ẑta +3 Pta βta .
η t=1 a=1 t=1 a=1
log(1/δ)
n X
X k
(B) ≤ 2 Pta βta + .
t=1 a=1
η
(e) Conclude that for any δ ≤ 1/(k + 1), with probability at least 1 − (k + 1)δ,
3 log(1/δ) Xn X k Xn X k
2
Rn ≤ +η Pta Ẑta +5 Pta βta .
η t=1 a=1 t=1 a=1
Hint This is a long and challenging exercise. You may find it helpful to use
the result in Exercise 5.15. The solution is also available.
12.5 (Implementation) Consider the Bernoulli bandit with k = 5 arms and
n = 104 with means µ1 = 1/2 and µi = 1/2 − ∆ for i > 1. Plot the regret of Exp3
and Exp3-IX for ∆ ∈ [0, 1/2]. You should get a plot similar to that of Fig. 12.1.
Does the result surprise you?
300 Exp3
Exp3-IX
250
Regret
200
150
100
0 0.1 0.2 0.3 0.4 0.5
∆
12.7 (Expected regret of Exp3-IX) In this exercise, you will complete the
steps explained in Note 2 to prove a bound on the expected regret of Exp3-IX.
(a) Find a choice of η and universal constant C > 0 such that
p
Rn ≤ C kn log(k) .
(b) What happens as η grows? Write a bound on the expected regret of Exp3-IX
in terms of η and k and n.
12.5 Exercises 175
−400
−500
−600
Regret
−700
−800
8
85
95
0.3
0.2
0.2
0.2
0.2
Figure 12.2 Box and whisker plot of the regret of Exp3-IX for the same setting as those
used to produce Fig. 11.4. For details of the experimental settings, see the text of
Exercise 11.6.
Part IV
Lower Bounds for Bandits
with Finitely Many Arms
177
1 An upper bound does not tell you much about what you could be missing
out on. The only way to demonstrate that your algorithm really is (close to)
optimal is to prove a lower bound showing that no algorithm can do better.
2 The second reason is that lower bounds are often more informative in the
sense that it usually turns out to be easier to get the lower bound right than
the upper bound. History shows a list of algorithms with steadily improving
guarantees until eventually someone hits upon the idea for which the upper
bound matches some known lower bound.
3 Finally, thinking about lower bounds forces you to understand what is hard
about the problem. This is so useful that the best place to start when attacking
a new problem is usually to try and prove lower bounds. Too often we have
not heeded our own advice and started trying to design an algorithm, only
to discover later that had we tackled the lower bound first, then the right
algorithm would have fallen in our laps with almost no effort at all.
So what is the form of a typical lower bound? In the chapters that follow, we
will see roughly two flavours. The first is the worst-case lower bound, which
corresponds to a claim of the form
‘For any policy you give me, I will give you an instance of a bandit problem ν on which
the regret is at least L’ .
Results of this kind have an adversarial flavour, which makes them suitable for
understanding the robustness of a policy. The second type is a lower bound on
the regret of an algorithm for specific instances. These bounds have a different
form that usually reads like the following:
‘If you give me a reasonable policy, then its regret on any instance ν is at least L(ν)’ .
The statement only holds for some policies – the ‘reasonable’ ones, whatever that
means. But the guarantee is also more refined because bound controls the regret
for these policies on every instance by a function that depends on this instance.
This kind of bound will allow us to show that the instance-dependent bounds
P
for stochastic bandits of O( i:∆i >0 ∆i + log(n)/∆i ) are not improvable. The
inclusion of the word ‘reasonable’ is unfortunately necessary. For every bandit
instance ν there is a policy that just chooses the optimal action in ν. Such policies
are not reasonable because they have linear regret for bandits with a different
optimal arm. There are a number of ways to define ‘reasonable’ in a way that is
simultaneously rigorous and, well, reasonable.
The contents of this part is roughly as follows. First we introduce the definition
of worst-case regret and discuss the line of attack for proving lower bounds
(Chapter 13). The next chapter takes us on a brief excursion into information
theory, where we explain the necessary mathematical tools (Chapter 14). Readers
178
familiar with information theory could skim this chapter. The final three chapters
are devoted to applying information theory to prove lower bounds on the regret
for both stochastic and adversarial bandits.
13 Lower Bounds: Basic Ideas
A policy is called minimax optimal for E if Rn (π, E) = Rn∗ (E). The value Rn∗ (E)
is of interest by itself. A small value of Rn∗ (E) indicates that the underlying bandit
problem is less challenging in the worst-case sense. A core activity in bandit
theory is to understand what makes Rn∗ (E) large or small, often focusing on its
behaviour as a function of the number of rounds n.
Theorem 13.1. Let E k be the set of k-armed Gaussian bandits with unit variance
and means µ ∈ [0, 1]k . Then there exists a universal
√ constant c > 0 such that for
all k > 1 and n ≥ k, it holds that Rn∗ (E k ) ≥ c kn.
We will prove this theorem in Chapter 15, but first we give an informal
justification.
13.1 Main Ideas Underlying Minimax Lower Bounds 180
The two requirements are clearly conflicting. The first makes us want to choose
instances with means µ, µ0 ∈ [0, 1]k that are far from each other, while the second
requirement makes us want to choose them to be close to each other. The lower
bound will follow by optimising this trade-off.
Let us start to make things concrete by choosing bandits ν = (Pi )ki=1 and
13.1 Main Ideas Underlying Minimax Lower Bounds 181
ν 0 = (Pi0 )ki=1 , where Pi = N (µi , 1) and Pi0 = N (µ0i , 1) are Gaussian and
µ, µ0 ∈ [0, 1]k . We will also assume that n is larger than k by some suitably
large constant factor. In order to prove a lower bound, it suffices to show that for
every strategy π, there exists a choice of µ and µ0 such that
√
max {Rn (π, ν), Rn (π, ν 0 )} ≥ c kn ,
where the expectation is taken with respect to the induced measure on the
sequence of outcomes when π interacts with ν. Now we need to choose µ0 to
satisfy the two requirements above. Since we want ν and ν 0 to be hard to
distinguish and yet have different optimal actions, we should make µ0 as close to
µ except in a coordinate where π expects to explore the least. To this end, let
If we are prepared to ignore the fact that Ti (n) is a random variable and take for
granted the claims in the first part of the chapter, then with this choice of ∆,
13.2 Notes 182
13.2 Notes
The bound on Gaussian tails used in Eq. (13.1) is derived from §7.1.13 of the
reference book by Abramowitz and Stegun [1964], which bounds
Z ∞
exp(−x2 ) exp(−x2 )
√ ≤ exp(−t2 )dt ≤ p for all x ≥ 0 . (13.4)
x + x2 + 2 x x + x2 + 4/π
13.4 Exercises
Hint Think about simple policies (not necessarily good ones) and use the
definition.
14 Foundations of Information Theory
( )
Alice wants to communicate with Bob. She wants to tell Bob the outcome of a
sequence n of independent random variables sampled from known distribution Q.
Alice and Bob agree to communicate using a binary code that is fixed in advance
in such a way that the expected message length is minimised. The entropy of Q
is the expected number of bits necessary per random variable using the optimal
code as n tends to infinity. The relative entropy between distributions P and Q
is the price in terms of expected message length that Alice and Bob have to pay
if they believe the random variables are sampled from Q when in fact they are
sampled from P .
Let P be a measure on [N ] with σ-algebra 2[N ] and X : [N ] → [N ] be the
identity random variable, X(ω) = ω. Alice observes a realisation of X and wants
to communicate the result to Bob using a binary code that they agree upon
in advance. For example, when N = 4, they might agree on the following code:
1 → 00, 2 → 01, 3 → 10, 4 → 11. Then if Alice observes a 3, she sends Bob a
message containing 10. For our purposes, a code is a function c : [N ] → {0, 1} ,
∗
Of course c must be injective so that no two numbers (or symbols) have the
same code. We also require that c be prefix free, which means that no code is a
prefix of any other. This is justified by supposing that Alice would like to tell
Bob about multiple samples. Then Bob needs to know where the message for one
symbol starts and ends.
Using a prefix code is not the only way to enforce unique decodability, but
all uniquely decodable codes have equivalent prefix codes (see Note 1).
14.1 Entropy and Optimal Coding 186
where the argmin is taken over valid codes and `(·) is a function that returns
the length of a code. The optimisation problem in (14.1) can be solved using
Huffman coding, and the optimal value satisfies
N
X
H2 (P ) ≤ pi `(c∗ (i)) ≤ H2 (P ) + 1 , (14.2)
i=1
When pi = 1/N is uniform, the naive idea of using a code of uniform length is
recovered, but for non-uniform distributions, the code adapts to assign shorter
codes to symbols with larger probability. It is worth pointing out that the sum
is only over outcomes that occur with non-zero probability, which is motivated
by observing that limx→0+ x log(1/x) = 0 or by thinking of the entropy as an
expectation of the log probability with respect to P , and expectations should not
change when the value of the random variable is perturbed on a measure zero set.
It turns out that H2 (P ) is not just an approximation on the expected length of
the Huffman code, but is itself a fundamental quantity. Imagine that Alice wants
to transmit a long string of symbols sampled from P . She could use a Huffman
code to send Bob each symbol one at a time, but this introduces rounding errors
that accumulate as the message length grows. There is another scheme called
arithmetic coding for which the average number of bits per symbol approaches
H2 (P ) and the source coding theorem says that this is unimprovable.
The definition of entropy using base 2 makes sense from the perspective of
sending binary message. Mathematically, however, it is more convenient to define
14.2 Relative Entropy 187
This is nothing more than a scaling of the H2 . Measuring information using base
2 logarithms has a unit of bits, and for the natural logarithm the unit is nats.
By slightly abusing terminology, we will also call H(P ) the entropy of P .
Suppose that Alice and Bob agree to use a code that is optimal when X is sampled
from distribution Q. Unbeknownst to them, however, X is actually sampled from
distribution P . The relative entropy between P and Q measures how much longer
the messages are expected to be using the optimal code for Q than what would be
obtained using the optimal code for P . Letting pi = P (X = i) and qi = Q(X = i),
assuming Shannon coding, working out the math while dropping d·e leads to the
definition of relative entropy as
X 1 X 1 X pi
D(P, Q) = pi log − pi log = pi log
qi pi qi
i∈[N ]:pi >0 i∈[N ]:pi >0 i∈[N ]:pi >0
(14.4)
From the coding interpretation, one conjectures that D(P, Q) ≥ 0. Indeed, this
is easy to verify using Jensen’s inequality. Still poking around the definition,
what happens when qi = 0 and pi = 0? This means that symbol i is superfluous
and the value of D(P, Q) should not be impacted by introducing superfluous
symbols. And again, it is not by the definition of the expectations. We also see
that the sufficient and necessary condition for D(P, Q) < ∞ is that for each i
with qi = 0, we also have pi = 0. The condition we discovered is equivalent to
saying that P is absolutely continuous with respect to Q. Note that absolute
continuity only implies a finite relative entropy when X takes on finitely many
values (Exercise 14.2).
This brings us back to defining relative entropy between probability measures
P and Q on arbitrary measurable spaces (Ω, F). When the support of P is
uncountable, defining the entropy via communication is hard because infinitely
many symbols are needed to describe some outcomes. This seems to be a
fundamental difficulty. Luckily, the impasse gets resolved automatically if we
only consider relative entropy. While we cannot communicate the outcome, for
any finite discretisation of the possible outcomes, the discretised values can be
communicated finitely, and all our definitions will work. Formally, a discretisation
to [N ] is specified by a F/2[N ] -measurable map X : Ω → [N ]. Then the entropy
of P relative Q can be defined as
D(P, Q) = sup sup D(PX , QX ) , (14.5)
N ∈N+ X
14.2 Relative Entropy 188
Theorem 14.1. Let (Ω, F) be a measurable space, and let P and Q be measures
on this space. Then,
R
log dP (ω) dP (ω) , if P Q ;
D(P, Q) = dQ
∞ , otherwise .
Note that the relative entropy between P and Q can still be infinite even when
P Q. Note also that in the case of discrete measures, the above expression
reduces to (14.4). For calculating relative entropies densities one often uses
densities: If λ is a common dominating σ-finite measure for P and Q (that is,
P λ and Q λ both hold), then letting p = dP dλ and q = dλ , if also P Q,
dQ
Z
p
D(P, Q) = p log dλ . (14.6)
q
This is probably the best-known expression for relative entropy and is often used
as a definition. Note that for probability measures, a common dominating σ-finite
measure can always be bound. For example, λ = P + Q always dominates both
P and Q.
Relative entropy is a kind of ‘distance’ measure between distributions P and
Q. In particular, D(P, Q) = 0 whenever P = Q, and otherwise D(P, Q) > 0.
However, strictly speaking, the relative entropy is not a distance because it
satisfies neither the triangle inequality nor is it symmetric. Nevertheless, it serves
the same purpose.
The relative entropy between many standard distributions is often quite easy
to compute. For example, the relative entropy between two Gaussians with means
µ1 , µ2 ∈ R and common variance σ 2 is
(µ1 − µ2 )2
D(N (µ1 , σ 2 ), N (µ2 , σ 2 )) = .
2σ 2
The dependence on the difference in means and the variance is consistent with
our intuition. If µ1 is close to µ2 , then the ‘difference’ between the distributions
should be small, but if the variance is very small, then there is little overlap, and
the difference is large. The relative entropy between two Bernoulli distributions
with means p, q ∈ [0, 1] is
p 1−p
D(B(p), B(q)) = p log + (1 − p) log ,
q 1−q
where 0 log(·) = 0. Due to its frequent appearance at various places, D(B(p), B(q))
14.2 Relative Entropy 189
gets the honour of being abbreviated to d(p, q), which we have met before in
Definition 10.1.
We are nearing the end of our whirlwind tour of relative entropy. It remains
to state the key lemma that connects the relative entropy to the hardness of
hypothesis testing.
The proof may be found at the end of the chapter, but first some interpretation
and a simple application. Suppose that D(P, Q) is small; then P is close to Q
in some sense. Since P is a probability measure, we have P (A) + P (Ac ) = 1.
If Q is close to P , then we might expect that P (A) + Q(Ac ) should be large.
The purpose of the theorem is to quantify just how large. Note that if P is not
absolutely continuous with respect to Q, then D(P, Q) = ∞, and the result is
vacuous. Also note that the result is symmetric. We could replace D(P, Q) with
D(Q, P ), which sometimes leads to a stronger result because the relative entropy
is not symmetric.
Returning to the hypothesis-testing problem described in the previous chapter,
let X be normally distributed with unknown mean µ ∈ {0, ∆} and variance
σ 2 > 0. We want to bound the quality of a rule for deciding what is the real mean
from a single observation. The decision rule is characterised by a measurable
set A ⊆ R on which the predictor guesses µ = ∆ (it predicts µ = 0 on the
complement of A). Let P = N (0, σ 2 ) and Q = N (∆, σ 2 ). Then the probability
of an error under P is P (A), and the probability of error under Q is Q(Ac ). The
reader surely knows what to do next. By Theorem 14.2, we have
1 1 ∆2
P (A) + Q(A ) ≥ exp (− D(P, Q)) = exp − 2 .
c
2 2 2σ
If we assume that the signal-to-noise ratio is small, ∆2 /σ 2 ≤ 1, then
1 1 3
P (A) + Q(A ) ≥ exp −
c
≥ ,
2 2 10
which implies max {P (A), Q(Ac )} ≥ 3/20. This means that no matter how we
chose our decision rule, we simply do not have enough data to make a decision
for which the probability of error on either P or Q is smaller than 3/20.
R
p = dP
dν and q = dν . By Eq. (14.6), D(P, Q) =
dQ
p log pq dν. For brevity, when
writing integrals with
R respect to ν, in this proof, we will drop dν. Thus, we will
write, for example, p log(p/q) for the above integral.
Instead of (14.7), we prove the stronger result that
Z
1
p ∧ q ≥ exp(− D(P, Q)) . (14.8)
2
R R R R R
This indeed is sufficient since p ∧ q = A p ∧ q + Ac p ∧ q ≤ A p + Ac q =
P (A) + Q(Ac ). We start with an inequality attributed to French mathematician
Lucien Le Cam, which lower-bounds the left-hand side of Eq. (14.8). The inequality
states that
Z Z 2
1 √
p∧q ≥ pq . (14.9)
2
How do we get this inequality? Starting from the right-hand side above, using
pq = (p ∧ q)(p ∨ q) and Cauchy–Schwarz we get
Z 2 Z 2 Z Z
√ p
pq = (p ∧ q)(p ∨ q) ≤ p∧q p∨q .
Now,
R using Rp ∧ q + p ∨ q = p + q, the proof is finished by substituting
p ∨ q = 2 − p ∧ q ≤ 2 and dividing both sides by two. It remains to lower-bound
the right-hand side of (14.9). For this, we use Jensen’s inequality. First, we write
(·)2 as exp(2 log(·)) and then move the log inside the integral:
Z 2 Z Z r
√ √ q
pq = exp 2 log pq = exp 2 log p
p>0 p
Z Z
1 q p
≥ exp 2 p log = exp − p log
p>0 2 p pq>0 q
Z
p
= exp − p log = exp (− D(P, Q)) .
q
In the fourth and the last step, we used that since P Q, q = 0 implies p = 0,
and so p > 0 implies q > 0, and eventually pq > 0. The result is completed by
chaining the inequalities.
14.3 Notes
P∞ −`
Furthermore, for any (`n )∞ n=1 satisfying i=1 2 ≤ 1, there exists a prefix
code c : N+ → {0, 1}∗ such that `(c(i)) = `i . The second part justifies our
restriction to prefix codes rather than uniquely decodable codes in the definition
of the entropy.
2 The supremum in the definition given in Eq. (14.5) may often be taken over
a smaller set. Precisely, let (X , G) be a measurable space and suppose that
G = σ(F) where F is a field. Note that a field is defined by the same axioms
as a σ-algebra except that being closed under countable unions is replaced by
the condition that it be closed under finite unions. Then, for measures P and
Q on (X , G), it holds that
where the supremum is over F/2[n] -measurable functions. This result is known
as Dobrushin’s theorem.
R
3 In the proof of Theorem 14.2 we used the inequality P (A) + Q(Ac ) ≥ p ∧ q.
Looking at the proof, it is not hard to see that the inequality becomes an equality
when A = {p ≤ q} = {q/p ≥ 1}. Reader’s familiar with statistical decision
theory may recognize that this is a special case of the Neyman-Pearson
lemma which states that the most powerful test among all statistical tests in a
simple hypothesis testing problem at a given significance level is the likelihood
ratio test. Exercise 14.13 explores this connection.
4 How tight is Theorem 14.2? We remarked already that D(P, Q) = 0 if and only
if P = Q. But in this case, Theorem 14.2 only gives
1 1
1 = P (A) + Q(Ac ) ≥ exp (− D(P, Q)) = ,
2 2
which does not seem so strong. From where does the weakness arise? The
answer is in Le Cam’s inequality, Eq. (14.9), which can be refined by
Z 2 Z Z Z Z
√
pq ≤ p∧q p∨q = p∧q 2− p∧q .
1 x/2
√
1− 1−x
p
1− log(1/x)/2
0.5
−0.5
0 1
x
Figure 14.2 Tightening the inequality of Le Cam. Here, x = exp(− D(P, Q)). Higher
values are better as the figure shows lower bounds on P (A) + Q(Ac ).
the total variation distance, the Hellinger distance is actually a distance (it is
symmetric and satisfies triangle inequality), but the χ2 -‘distance’ is not. It is
possible to show (Tsybakov [2008], chapter 2) that
δ(P, Q)2 ≤ h(P, Q)2 ≤ D(P, Q) ≤ χ2 (P, Q) . (14.16)
All the inequalities are tight for some choices of P and Q, but the examples
do not chain together, as evidenced by Pinsker’s inequality, which shows that
δ(P, Q)2 ≤ D(P, Q)/2 (which is also tight for some P and Q).
7 The entropy for distribution P was defined as H(P ) in Eq. (14.3). If X is a
random variable, then H(X) is defined to be the entropy of the law of X. This
is a convenient notation because it allows one to write H(f (X)) and H(XY )
and similar expressions.
There are many references for information theory. Most well known (and
comprehensive) is the book by Cover and Thomas [2012]. Another famous book
is the elementary and enjoyable introduction by MacKay [2003]. The approach
we have taken for defining and understanding the relative entropy is inspired by
an excellent shorter book by Gray [2011]. Theorem 14.1 connects our definition of
relative entropies to densities (the ‘classic definition’). It can be found in §5.2 of
the aforementioned book. Dobrushin’s theorem is due to him [Dobrushin, 1959].
An alternative source is lemma 5.2.2 in the book of Gray [2011]. Theorem 14.2 is
due to Bretagnolle and Huber [1979]. We also recommend the book by Tsybakov
[2008] as a good source for learning about information theoretic lower bounds in
statistical settings.
14.5 Exercises
14.3 Prove the inequality in Eq. (14.10) for prefix free codes c.
14.5 (Entropy inequalities) Prove that each of the inequalities in Eq. (14.16)
is tight.
14.7 (Relative entropy for Gaussian distributions) For each i ∈ {1, 2},
let µi ∈ R, σi2 > 0 and Pi = N (µi , σi2 ). Show that
1 σ22 σ12 (µ1 − µ2 )2
D(P1 , P2 ) = log + − 1 + .
2 σ12 σ22 2σ22
(a) a probability measure (R, B(R)) that is not absolutely continuous with
respect to λ; and
(b) a probability measure P on (R, B(R)) that is absolutely continuous to λ
with D(P, Q) = ∞ where Q = N (0, 1) is the standard Gaussian measure.
14.9 (Data processing inequality) Let P and Q be measures on (Ω, F), and
let G be a sub-σ-algebra of F and PG and QG be the restrictions of P and Q to
(Ω, G). Show that D(PG , QG ) ≤ D(P, Q).
that
n
X
D(P, Q) = EP [D(Pt (· | X1 , . . . , Xt−1 ), Qt (· | X1 , . . . , Xt−1 ))] . (14.17)
t=1
Hint This is a rather technical exercise. You will likely need to apply a
monotone class argument [Kallenberg, 2002, theorem 1.1]. For the definition
of a regular version, see [Kallenberg, 2002, theorem 5.3] or Theorem 3.11.
Briefly, Pt is a probability kernel from (Rt−1 , B(Rt−1 )) to (R, B(R)) such that
Pt (A | x1 , . . . , xt−1 ) = P (Xt ∈ A | X1 , . . . , Xt−1 ) with P -probability one for all
A ∈ B(R).
14.12 (Chain rule (cont.)) Let P and Q be measures on (Rn , B(Rn )), and
for t ∈ [n], let Xt (x) = xt be the coordinate project from Rn → R. Then let Pt
and Qt be regular versions of Xt given X1 , . . . , Xt−1 under P and Q, respectively.
Let τ be a stopping time adapted to the filtration generated by X1 , . . . , Xn with
τ ∈ [n] almost surely. Show that
" τ #
X
D(P|Fτ , Q|Fτ ) = EP D(Pt (· | X1 , . . . , Xt−1 ), Qt (· | X1 , . . . , Xt−1 )) .
t=1
(a) If a test f is such that f (x) = P if p(x) > ηq(x) and f (x) = Q if p(x) < ηq(x)
and α = e(f, P ) then e(f, Q) = min{e(f 0 , Q) : e(f 0 , P ) ≤ α}. Here,
p = dP/dν and q = dQ/dν where ν is a common dominating measure
of P and Q.
(b) If f , as postulated in the previous part exist, and f 0 is a test such that
e(f 0 , Q) = e(f, Q) and e(f 0 , P ) ≤ α then e(f 0 , P ) = α. Furthermore, f 0 = f
holds except perhaps over the union of the set {x : p(x) = ηq(x)} and a set
that has zero measure under both P and Q.
14.5 Exercises 196
After the short excursion into information theory, let us return to the world
of k-armed stochastic bandits. In what follows, we fix the horizon n > 0 and
the number of actions k > 1. This chapter has two components. The first is
an exact calculation of the relative entropy between measures in the canonical
bandit model for a fixed policy and different bandits. In the second component,
we prove a minimax lower bound that formalises the intuitive arguments given in
Chapter 13.
The following result will be used repeatedly. Some generalisations are provided
in the exercises.
Lemma 15.1 (Divergence decomposition). Let ν = (P1 , . . . , Pk ) be the reward
distributions associated with one k-armed bandit, and let ν 0 = (P10 , . . . , Pk0 ) be the
reward distributions associated with another k-armed bandit. Fix some policy π
and let Pν = Pνπ and Pν 0 = Pν 0 π be the probability measures on the canonical
bandit model (Section 4.6) induced by the n-round interconnection of π and ν
(respectively, π and ν 0 ). Then,
k
X
D(Pν , Pν 0 ) = Eν [Ti (n)] D(Pi , Pi0 ) . (15.1)
i=1
Proof Assume that D(Pi , Pi0 ) < ∞ for all i ∈ [k]. It follows that Pi Pi0 . Define
Pk Pk
λ = i=1 Pi + Pi0 , which is the measure defined by λ(A) = i=1 (Pi (A) + Pi0 (A))
dPν
for any measurable set A. Theorem 14.1 shows that, as long as dP < +∞,
ν0
dPν
D(Pν , Pν 0 ) = Eν log .
dPν 0
Recalling that ρ is the counting measure over [k], we find that the Radon–Nikodym
derivative of Pν with respect to the product measure (ρ × λ)n is given in Eq. (4.7)
as
n
Y
pνπ (a1 , x1 , . . . , an , xn ) = πt (at | a1 , x1 , . . . , at−1 , xt−1 )pat (xt ) .
t=1
15.2 Minimax Lower Bounds 198
pa (xt )
n X
dPν
log (a1 , x1 , . . . , an , xn ) = log 0 t ,
dPν 0 t=1
pat (xt )
where we used the chain rule for Radon–Nikodym derivatives and the fact that
the terms involving the policy cancel. Taking expectations of both sides,
X
pA (Xt )
n
dPν
Eν log (A1 , X1 , . . . , An , Xn ) = Eν log 0 t ,
dPν 0 t=1
pAt (Xt )
and
" " ##
pAt (Xt ) pAt (Xt )
Eν log 0 = Eν Eν log 0 At = Eν D(PAt , PA0 t ) ,
pAt (Xt ) pAt (Xt )
where in the second equality we used that under Pν (·|At ), the distribution of Xt
is dPAt = pAt dλ. Plugging back into the previous display,
X
pAt (Xt )
n
dPν
Eν log (A1 , X1 , . . . , An , Xn ) = Eν log 0
dPν 0 t=1
pAt (Xt )
n k
" n
#
X X X
= Eν D(PAt , PAt ) =0
Eν I {At = i} D(PAt , PAt )
0
Recall that EN
k
(1) is the class of Gaussian bandits with unit variance, which can
be parameterised by their mean vector µ ∈ Rk . Given µ ∈ Rk , let νµ be the
Gaussian bandit for which the ith arm has reward distribution N (µi , 1).
Theorem 15.2. Let k > 1 and n ≥ k − 1. Then, for any policy π, there exists a
mean vector µ ∈ [0, 1]k such that
1p
Rn (π, νµ ) ≥ (k − 1)n .
27
15.2 Minimax Lower Bounds 199
Figure 15.1 The idea of the minimax lower bound. Given a policy and one environment,
the evil antagonist picks another environment so that the policy will suffer a large regret
in at least one environment.
Since νµ ∈ ENk
(1), it follows that the minimax regret for EN
k
(1) is lower-bounded
by the right-hand side of the above display as soon as n ≥ k − 1:
1p
Rn∗ (EN
k
(1)) ≥ (k − 1)n .
27
The idea of the proof is illustrated in Fig. 15.1.
Proof Fix a policy π. Let ∆ ∈ [0, 1/2] be some constant to be chosen later. As
suggested in Chapter 13, we start with a Gaussian bandit with unit variance
and mean vector µ = (∆, 0, 0, . . . , 0). This environment and π give rise to the
distribution Pνµ ,π on the canonical bandit model (Hn , Fn ). For brevity we will
use Pµ in place of Pνµ ,π , and expectations under Pµ will be denoted by Eµ . To
choose the second environment, let
µ0 = (∆, 0, 0, . . . , 0, 2∆, 0, . . . , 0) ,
where specifically µ0i = 2∆. Therefore, µj = µ0j except at index i and the optimal
arm in νµ is the first arm, while in νµ0 arm i is optimal. We abbreviate Pµ0 = Pνµ0 ,π .
Lemma 4.5 and a simple calculation lead to
n∆ n∆
Rn (π, νµ ) ≥ Pµ (T1 (n) ≤ n/2) and Rn (π, νµ0 ) > Pµ0 (T1 (n) > n/2) .
2 2
Then, applying the Bretagnolle–Huber inequality from the previous chapter
15.3 Notes 200
(Theorem 14.2),
n∆
Rn (π, νµ ) + Rn (π, νµ0 ) > (Pµ (T1 (n) ≤ n/2) + Pµ0 (T1 (n) > n/2))
2 (15.2)
n∆
≥ exp(− D(Pµ , Pµ0 )) .
4
It remains to upper-bound D(Pµ , Pµ0 ). For this, we use Lemma 15.1 and the
definitions of µ and µ0 to get
(2∆)2 2n∆2
D(Pµ , Pµ0 ) = Eµ [Ti (n)] D(N (0, 1), N (2∆, 1)) = Eµ [Ti (n)] ≤ .
2 k−1
n∆ 2n∆2
Rn (π, νµ ) + Rn (π, νµ0 ) ≥ exp − .
4 k−1
p
The result is completed by choosing ∆ = (k − 1)/4n ≤ 1/2, where the inequality
follows from the assumptions in the theorem statement. The final steps are lower
bounding exp(−1/2) and using 2 max(a, b) ≥ a + b.
15.3 Notes
1 We used the Gaussian noise model because the KL divergences are so easily
calculated in this case, but all that we actually used was that D(Pi , Pi0 ) =
O((µi − µ0i )2 ) when the gap between the means ∆ = µi − µ0i is small. While
this is certainly not true for all distributions, it very often is. Why is that? Let
{Pµ : µ ∈ R} be some parametric family of distributions on Ω and assume that
distribution Pµ has mean µ. Assuming the densities are twice differentiable
and that everything is sufficiently nice that integrals and derivatives can be
exchanged (as is almost always the case), we can use a Taylor expansion about
15.3 Notes 201
µ to show that
∂ 1 ∂2
D(Pµ , Pµ+∆ ) ≈ D(Pµ , Pµ+∆ ) ∆+ D(P , P ) ∆2
2 2 µ µ+∆
∂∆ ∆=0 ∂∆ ∆=0
Z
∂ dPµ 1
= log dPµ ∆ + I(µ)∆2
∂∆ Ω dPµ+∆ ∆=0 2
Z
∂ dPµ+∆ 1
=− log dPµ ∆ + I(µ)∆2
Ω ∂∆ dP µ
∆=0 2
Z
∂ dPµ+∆ 1
=− dPµ ∆ + I(µ)∆2
Ω ∂∆ dPµ
∆=0 2
Z
∂ dPµ+∆ 1
=− dPµ ∆ + I(µ)∆2
∂∆ Ω dPµ ∆=0 2
Z
∂ 1
=− dPµ+∆ ∆ + I(µ)∆2
∂∆ Ω ∆=0 2
1
= I(µ)∆2 ,
2
where I(µ), introduced in the second line, is called the Fisher information
of the family (Pµ )µ at µ. Note that if λ is a common dominating measure for
(Pµ+∆ ) for ∆ small, dPµ+∆ = pµ+∆ dλ and we can write
Z
∂2
I(µ) = − 2
log pµ+∆ pµ dλ ,
∂∆ ∆=0
which is the form that is usually given in elementary texts. The upshot of all
this is that D(Pµ , Pµ+∆ ) for ∆ small is indeed quadratic in ∆, with the scaling
√
provided by I(µ), and as a result the worst-case regret is always O( nk),
provided the class of distributions considered is sufficiently rich and not too
bizarre.
√
2 We have now shown a lower bound that is Ω( nk), while many of the upper
bounds were O(log(n)). There is no contradiction because the logarithmic
bounds depended on the inverse suboptimality gaps, which may be very large.
3 Our lower bound was only proven for n ≥ k − 1. In Exercise 15.3, we ask you
to show that when n < k − 1, there exists a bandit such that
n(2k − n − 1) n
Rn ≥ > .
2k 2
4 The method used to prove Theorem 15.2 can be viewed as a generalisation
and strengthening of Le Cam’s method in statistics. Recall that Eq. (15.2)
establishes that for any µ and µ0 ,
n∆
inf sup Rn (π, ν) ≥ exp(− D(Pµ , Pµ0 )) .
π ν 8
To explain Le Cam’s method, we need a little notation. Let X be an outcome
space, P a set of measures on X and θ : P → Θ, where (Θ, d) is a metric space.
15.4 Bibliographic Remarks 202
The first work on lower bounds that we know of was the remarkably precise
minimax analysis of two-armed Bernoulli bandits by Vogel [1960]. The Bretagnolle–
Huber inequality (Theorem 14.2) was first used for bandits by Bubeck et al.
[2013b]. As mentioned in the notes, the use of this inequality for proving lower
bounds is known as Le Cam’s method in statistics [Le Cam, 1973]. The proof
of Theorem 15.2 uses the same ideas as Gerchinovitz and Lattimore [2016],
while the alternative proof in Exercise 15.2 is essentially due to Auer et al.
[1995], who analysed the more difficult case where the rewards are Bernoulli (see
Exercise 15.4). Yu [1997] describes some alternatives to Le Cam’s method for the
passive, statistical setting. These alternatives can be (and often are) adapted to
the sequential setting.
15.5 Exercises
Pk
(b) Using the previous part, Jensen’s inequality and the identity i=1 E0 [Ti (n)] =
n, show that
k
X k p
X
Ei [Ti (n)] ≤ n + c nkE0 [Ti (n)] ≤ n + ckn .
i=1 i=1
The method used in this exercise is borrowed from Auer et al. [2002b] and
is closely related to the lower-bound technique known as Assouad’s method
in statistics [Yu, 1997].
15.3 (Lower bound for small horizons) Let k > 1 and n < k. Prove that
for any policy π there exists a Gaussian bandit with unit variance and means
µ ∈ [0, 1]k such that Rn (π, νµ ) ≥ n(2k − n − 1)/(2k) > n/2.
15.4 (Lower bounds for Bernoulli bandits) Recall from Table 4.1 that
EBk is the set of k-armed Bernoulli bandits. Show that there exists a universal
constant c > 0 such that for any 2 ≤ k ≤ n, it holds that:
√
Rn∗ (EBk ) = inf sup Rn (π, ν) ≥ c nk .
π ν∈E k
B
Hint Use the fact that KL divergence is upper bounded by the χ-squared
distance (Eq. (14.16)).
15.5 In Chapter 9 we proved that if π is the MOSS policy and ν ∈ ESG
k
(1), then
√ X
Rn (π, ν) ≤ C kn + ∆i ,
i:∆i >0
where C > 0 is a universal constant. Prove that the dependence on the sum
cannot be eliminated.
Hint You will have to use that Ti (t) is an integer for all t.
15.6 (Lower bound for explore-then-commit) Let ETCnm be the explore-
then-commit policy with inputs n and m respectively (Algorithm 1). Prove that
15.5 Exercises 204
t=1
Hint Use an appropriately adjusted form of the chain rule for relative entropy
from Exercise 14.11.
16 Instance-Dependent Lower Bounds
In the last chapter, we proved a lower bound on the minimax regret for subgaussian
bandits with suboptimality gaps in [0, 1]. Such bounds serve as a useful measure
of the robustness of a policy, but are often excessively conservative. This chapter
is devoted to understanding instance-dependent lower bounds, which try to
capture the optimal performance of a policy on a specific bandit instance.
Because the regret is a multi-objective criteria, an algorithm designer might
try and design algorithms that perform well on one kind of instance or another.
An extreme example is the policy that chooses At = 1 for all t, which suffers
zero regret when the first arm is optimal and linear regret otherwise. This is a
harsh trade-off, with the price for reducing the regret from logarithmic to zero
on just a few instances being linear regret on the remainder. Surprisingly, this
is the nature of the game in bandits. One can assign a measure of difficulty to
each instance such that policies performing overly well relative to this measure
on some instances pay a steep price on others. The situation is illustrated in
Fig. 16.1.
n over-specialised
instance optimal
Regret
√
n minimax optimal
Instances
Figure 16.1 On the x-axis, the instances are ordered according to the measure of difficulty,
and the y-axis shows the regret (on some scale). In the previous chapter, we proved
that no policy can be entirely below the horizontal ‘minimax optimal’ line. The results
in this chapter show that if the regret of a policy is below the ‘instance optimal’ line at
any point, then it must have regret above the shaded region for other instances. For
example, the ‘overly specified’ policy.
16.1 Asymptotic Bounds 206
In finite time, the situation is a little messy, but if one pushes these ideas to
the limit, then for many classes of bandits one can define a precise notion of
instance-dependent optimality.
where ∆i is the suboptimality gap of the ith arm in ν and µ∗ is the mean of the
optimal arm.
Proof Let µi be the mean of the ith arm in ν and di = dinf (Pi , µ∗ , Mi ). The
16.1 Asymptotic Bounds 207
result will follow from Lemma 4.5, and by showing that for any suboptimal arm
i it holds that
Eνπ [Ti (n)] 1
lim inf ≥ .
n→∞ log(n) di
Fix a suboptimal arm i, and let ε > 0 be arbitrary and ν 0 = (Pj0 )kj=1 ∈ E be a
bandit with Pj0 = Pj for j 6= i and Pi0 ∈ Mi be such that D(Pi , Pi0 ) ≤ di + ε and
µ(Pi0 ) > µ∗ , which exists by the definition of di . Let µ0 ∈ Rk be the vector of means
of distributions of ν 0 . By Lemma 15.1, we have D(Pνπ , Pν 0 π ) ≤ Eνπ [Ti (n)](di + ε),
and by Theorem 14.2, for any event A,
1 1
Pνπ (A) + Pν 0 π (Ac ) ≥ exp (− D(Pνπ , Pν 0 π )) ≥ exp (−Eνπ [Ti (n)](di + ε)) .
2 2
Now choose A = {Ti (n) > n/2}, and let Rn = Rn (π, ν) and Rn0 = Rn (π, ν 0 ).
Then,
n
Rn + Rn0 ≥ (Pνπ (A)∆i + Pν 0 π (Ac )(µ0i − µ∗ ))
2
n
≥ min {∆i , µ0i − µ∗ } (Pνπ (A) + Pν 0 π (Ac ))
2
n
≥ min {∆i , µ0i − µ∗ } exp (−Eνπ [Ti (n)](di + ε)) .
4
Rearranging and taking the limit inferior leads to
n min{∆i ,µ0i −µ∗ }
log
Eνπ [Ti (n)] 1 4(Rn +Rn 0 )
M P dinf (P, µ∗ , M)
(µ − µ∗ )2
{N (µ, σ 2 ) : µ ∈ R} N (µ, σ 2 )
2σ 2
1 (µ − µ∗ )2
{N (µ, σ 2 ) : µ ∈ R, σ 2 ∈ (0, ∞)} N (µ, σ 2 ) log 1+
2 σ2
µ 1−µ
{B(µ) : µ ∈ [0, 1]} B(µ) µ log + (1 − µ) log
µ∗ 1 − µ∗
2((a + b)/2 − µ∗ )2
{U(a, b) : a, b ∈ R} U(a, b) log 1+
b−a
Table 16.1 Expressions for dinf for different parametric families when the mean of P is less
than µ∗ .
Lemma 16.3. Let ν = (Pi ) and ν 0 = (Pi0 ) be k-armed stochastic bandits that
differ only in the distribution of the reward for action i ∈ [k]. Assume that i is
suboptimal in ν and uniquely optimal in ν 0 . Let λ = µi (ν 0 ) − µi (ν). Then, for
any policy π,
min{λ−∆i (ν),∆i (ν)}
log 4 + log(n) − log(Rn (ν) + Rn (ν 0 ))
Eνπ [Ti (n)] ≥ . (16.4)
D(Pi , Pi0 )
The lemma holds for finite n and any ν and can be used to derive finite-
time instance-dependent lower bounds for any environment class E that is rich
enough. The following result provides a finite-time instance-dependence bound
for Gaussian bandits where the asymptotic notion of consistency is replaced by
an assumption that the minimax regret is not too large. This assumption alone
is enough to show that no policy that is remotely close to minimax optimal can
be much better than UCB on any instance.
E(ν) = {ν 0 ∈ EN
k
: µi (ν 0 ) ∈ [µi , µi + 2∆i ]} .
Suppose C > 0 and p ∈ (0, 1) are constants and π is a policy such that
16.3 Notes 209
Rn (π, ν 0 ) ≤ Cnp for all n and ν 0 ∈ E(ν). Then, for any ε ∈ (0, 1],
!+
2 X (1 − p) log(n) + log ε∆i
8C
Rn (π, ν) ≥ . (16.5)
(1 + ε)2 ∆i
i:∆i >0
Plugging this into the basic regret decomposition identity (Lemma 4.5) gives the
result.
When p = 1/2, the leading term in this lower bound is approximately half that
of the asymptotic bound. This effect may be real. The class of policies considered
is larger than in the asymptotic lower bound, and so there is the possibility that
the policy that is best tuned for a given environment achieves a smaller regret.
16.3 Notes
1 We mentioned that for most classes E there is a policy satisfying Eq. (16.3).
Its form is derived from the lower bound, and by making some additional
assumptions on the underlying distributions. For details, see the article
by Burnetas and Katehakis [1996], which is also the original source of
Theorem 16.2.
2 The analysis in this chapter only works for unstructured classes. Without this
assumption a policy can potentially learn about the reward from one arm
by playing other arms and this greatly reduces the regret. Lower bounds for
structured bandits are more delicate and will be covered on a case-by-case
basis in subsequent chapters.
3 The classes analysed in Table 16.1 are all parametric, which makes the
calculation possible analytically. There has been relatively little analysis
in the non-parametric case, but we know of three exceptions for which we
simply refer the reader to the appropriate source. The first is the class of
distributions with bounded support: M = {P : Supp(P ) ⊆ [0, 1]}, which has
been analysed exactly [Honda and Takemura, 2010]. The second is the class
of distributions with semi-bounded support, M = {P : Supp(P ) ⊆ (−∞, 1]}
[Honda and Takemura, 2015]. The third is the class of distributions with
bounded kurtosis, M = {P : KurtX∼P [X] ≤ κ} [Lattimore, 2017].
16.4 Bibliographic Remarks 210
16.5 Exercises
(c) The results from parts (a) and (b) seem to contradict the heuristic analysis
in Note 1 at the end of Chapter 15. Explain.
16.5 Exercises 211
16.5 (Minimax lower bound) Use Lemma 16.3 to prove Theorem 15.2, possibly
with different constants.
2
16.6 (Refining the lower-order terms) Let k = 2, and for ν ∈ EN let
2
∆(ν) = max{∆1 (ν), ∆2 (ν)}. Suppose that π is a policy such that for all ν ∈ EN
with ∆(ν) ≤ 1, it holds that
C log(n)
Rn (π, ν) ≤ . (16.6)
∆(ν)
(a) Give an example of a policy satisfying Eq. (16.6).
(b) Assume that i = 2 is suboptimal for ν and that α ∈ (0, 1) be such that
1
Eνπ [T2 (n)] = 2∆(ν) 2 log(α). Let ν
0
be the alternative environment where
µ1 (ν 0 ) = µ1 (ν) and µ2 (ν 0 ) = µ1 (ν) + 2∆(ν). Show that
1
exp(− D(Pνπ , Pν 0 π )) =
.
α
(c) Let A be the event that T2 (n) ≥ n/2. Show that
2C log(n) 1 2C log(n)
Pνπ (A) ≤ and Pν 0 π (A) ≥ − .
n∆(ν)2 2α n∆(ν)2
(d) Show that
n∆(ν) 1 2C log(n)
Rn (π, ν ) ≥
0
− .
2 2α n∆(ν)2
16.5 Exercises 212
n∆(ν)2
(e) Show that α ≥ 8C log(n) and conclude that
1 n∆(ν)2
Rn (π, ν) ≥ log .
2∆(ν) 8C log(n)
In Exercise 7.6 you showed that there exists a bandit policy π such that for
some universal constant C > 0 and for any ν ∈ E[0,b] k
k-armed bandit with
rewards taking values in [0, b], the regret Rn (π, ν) of π on ν after n rounds
satisfies
X
σi2
Rn (π, ν) ≤ C ∆i + b + log(n) ,
∆i
i:∆i >0
where ∆i = ∆i (ν) is the action gap of action i and σi2 = σi2 (ν) is the
variance of the reward of arm i. In particular, this is the inequality shown in
Eq. (7.14). The next exercise asks you to show that the appearance of both
σ2
b and ∆ii is necessary in this bound.
16.7 (Sharpness of Eq. (7.14)) Let k > 1, b > 0 and c > 0 be arbitrary. Show
that there is no policy π for which either
Rn (π, ν)
lim sup ≤ cb, k
∀ν ∈ E[0,b] (16.7)
n→∞ log(n)
or
Rn (π, ν) X σ 2 (ν)
lim sup ≤c i
, k
∀ν ∈ E[0,b] (16.8)
n→∞ log(n) ∆i (ν)
i:∆i >0
The intuition underlying this result is the following: Eq. (16.7) cannot hold
because this would mean that for some policy, the regret is logarithmic
with a constant independent of the gaps, while intuitively, if the variance
is constant, the coefficient of the logarithmic regret must increase as the
gaps get close. Similarly, Eq. (16.8) cannot hold either because we expect a
logarithmic regret with a coefficient proportional to the inverse gap even as
the variance gets zero, as the case of Bernoulli bandits shows. This exercise
is due to Audibert et al. [2007].
Prove that
log(V[R̂n (π, ν)])
lim sup sup ≥ 1,
n→∞ ν∈E (1 − p) log(n)
Pn
where R̂n (π, ν) = nµ∗ (ν) − t=1 µAt (ν).
17 High-Probability Lower Bounds
The lower bounds proven in the last two chapters were for stochastic bandits.
In this chapter, we prove high probability lower bounds for both stochastic and
adversarial bandits. Recall that for adversarial bandit x ∈ [0, 1]n×k , the random
regret is
n
X
R̂n = max xti − xtAt
i∈[k]
t=1
We also gave a version of the algorithm that depended on δ ∈ (0, 1) for which
with probability at least 1 − δ,
s !
k
R̂n = O kn log . (17.2)
δ
On the other hand, if the high-probability bound only holds for a single δ, as in
(17.2), then it seems hard to do much better than
s !
k
Rn ≤ nδ + O kn log ,
δ
p
which with the best choice of δ leads to a bound of O( kn log(n)).
For simplicity, we start with the stochastic setting before explaining how to
convert the arguments to the adversarial model. There is no randomness in the
expected regret, so in order to derive a high-probability bound, we define the
random pseudo-regret by
k
X
R̄n = Ti (n)∆i ,
i=1
Theorem 17.1. Let n ≥ 1 and k ≥ 2 and B > 0 and π be a policy such that for
any ν ∈ E k ,
p
Rn (π, ν) ≤ B (k − 1)n . (17.4)
Corollary 17.2. Let n ≥ 1 and k ≥ 2. Then, for any policy π and δ ∈ (0, 1)
such that
s
1
nδ ≤ n(k − 1) log , (17.6)
4δ
Proof We prove the result by contradiction. Assume that the conclusion does
not hold for π and let δ ∈ (0, 1) satisfy (17.6). Then, for any bandit problem
ν ∈ E k , the expected regret of π is bounded by
s s
n(k − 1) 1 1
Rn (π, ν) ≤ nδ + log ≤ 2n(k − 1) log .
2 4δ 4δ
p
Therefore, π satisfies the conditions of Theorem 17.1 with B = 2 log(1/(4δ)),
which implies that there exists some bandit problem ν ∈ E k such that (17.7)
holds, contradicting our assumption.
Corollary 17.3. Let k ≥ 2 and p ∈ (0, 1) and B > 0. Then, there does not
exist a policy π such that for all n ≥ 1, δ ∈ (0, 1) and ν ∈ E k ,
p 1
P R̄n (π, ν) ≥ B (k − 1)n logp < δ.
δ
Proof We proceed by contradiction. Suppose that such a policy exists. Choosing
δ sufficiently small and n sufficiently large ensures that
1 1 1 1p 1
log ≥ B log p
and n(k − 1) log ≤ n.
B 4δ δ B 4δ
17.2 Adversarial Bandits 217
We suspect there exists a policy π and universal constant B > 0 such that
for all ν ∈ E k ,
√ 1
P R̄n (π, ν) ≥ B kn log ≤ δ.
δ
We now explain how to translate the ideas in the previous section to the adversarial
model. Let π = (πt )nt=1 be a fixed policy, and recall that for x ∈ [0, 1]n×k , the
random regret is
n
X
R̂n = max (xti − xtAt ) .
i∈[k]
t=1
Let Fx be the cumulative distribution function of the law of R̂n when policy π
interacts with the adversarial bandit x ∈ [0, 1]n×k .
The proof is a bit messy, but is not completely without interest. For the sake of
brevity, we explain only the high-level ideas and refer you elsewhere for the gory
details. There are two difficulties in translating the arguments in the previous
section to the adversarial model. First, in the adversarial model, we need the
17.2 Adversarial Bandits 218
rewards to be bounded in [0, 1]. The second difficulty is we now analyse the
adversarial regret rather than the random pseudo-regret. Given a measure Q, let
X ∈ [0, 1]n×k and (At )nt=1 be a collection of random variables on a probability
space (Ω, F, PQ ) such that
(a) PQ (X ∈ B) = Q(B) for all B ∈ B([0, 1]n×k ); and
(b) PQ (At | A1 , X1 , . . . , At−1 , Xt−1 ) = πt (At | A1 , X1 , . . . , At−1 , Xt−1 ) almost
surely, where Xs = XtAs .
Then the regret is a random variable R̂n : Ω → R defined by
n
X
R̂n = max (Xti − XtAt ) .
i∈[k]
t=1
Suppose we sample X ∈ [0, 1]n×k from distribution Q on ([0, 1]n×k , B([0, 1]k )).
Claim 17.5. Suppose that X ∼ Q, where Q is a measure on [0, 1]n×k with the
Borel σ-algebra and that EQ [1 − FX (u)] ≥ δ. Then there exists an x ∈ [0, 1]n×k
such that 1 − Fx (u) ≥ δ.
The next step is to choose Q and argue that EQ [1 − FX (u)] ≥ δ for sufficiently
large u. To do this, we need a truncated normal distribution. Defining clipping
function
1 if x > 1
clip[0,1] (x) = 0 if x < 0
x otherwise .
Let σ and ∆ be positive constants to be chosen later and (ηt )nt=1 a sequence of
independent random variables with ηt ∼ N (1/2, σ 2 ). For each i ∈ [k], let Qi be
the distribution of X ∈ [0, 1]n×k , where
clip[0,1] (ηt + ∆) if j = 1
Xtj = clip[0,1] (ηt + 2∆) if j = i and i 6= 1
clip (η ) otherwise .
[0,1] t
Notice that under any Qi for fixed t, the random variables Xt1 , . . . , Xtk are not
independent, but for fixed j, the random variables X1j , . . . , Xnj are independent
and identically distributed. Let PQi be the law of X1 , A1 , . . . , An , Xn when policy
π interacts with adversarial bandit sampled from X ∼ Qi .
q
1
Claim 17.6. If σ > 0 and ∆ = σ k−1 2n log 8δ , then there exists an arm i such
that
PQi (Ti (n) < n/2) ≥ 2δ .
The proof of this claim follows along the same lines as the theorems in the
previous section. All that changes is the calculation of the relative entropy. The
last step is to relate Ti (n) to the random regret. In the stochastic model, this was
17.3 Notes 219
The following claim upper-bounds the number of rounds in which clipping occurs
with high probability.
Combining Claim 17.6 and Claim 17.7 with Eq. (17.8) shows there exists an
arm i such that
n∆
PQi R̂n ≥ ≥ δ,
4
which by the definition of ∆ and Claim 17.5 implies Theorem 17.4.
17.3 Notes
1 The adversarial bandits used in Section 17.2 had the interesting property that
the same arm has the best reward in every round (not just the best mean).
This cannot be exploited by an algorithm, however, because it only gets a
single observation in each round.
2 In Theorem 17.4, we did not make any assumptions on the algorithm. If √ we
had assumed the algorithm enjoyed an expected regret bound of Rn ≤ B kn,
then we could conclude that for each sufficiently small δ ∈ (0, 1) there exists
an adversarial bandit such that
c√ 1
P R̂n ≥ kn log ≥ δ,
B 2δ
which shows that our high-probability upper bounds for Exp3-IX are nearly
tight.
The results in this chapter are by Gerchinovitz and Lattimore [2016], who also
provide lower bounds on what is achievable when the loss matrix exhibits nice
structure such as low variance or similarity between losses of the arms.
17.5 Exercises 220
17.5 Exercises
Whenever you design a new benchmark, there are several factors to consider.
Competing with a poor benchmark does not make sense, since even an
algorithm that perfectly matches the benchmark will perform poorly. At
the same time, competing with a better benchmark can be harder from a
learning perspective, and this penalty must be offset against the benefits.
While contextual bandits can be studied in both the adversarial and stochastic
frameworks, in this chapter we focus on the k-armed adversarial model. As usual,
the adversary secretly chooses (xt )nt=1 , where xt ∈ [0, 1]k with xti the reward
associated with arm i in round t. The adversary also secretly chooses a sequence
of contexts (ct )nt=1 , where ct ∈ C with C a set of possible contexts. In each round,
18.1 Contextual Bandits: One Bandit per Context 224
For rounds t = 1, 2, . . . , n:
Learner observes context ct ∈ C where C is an arbitrary fixed set of contexts.
Learner selects distribution Pt ∈ Pk−1 and samples At from Pt .
Learner observes reward Xt = xtAt .
the learner observes ct , chooses an action At and receives reward xtAt . The
interaction protocol is shown in Fig. 18.1.
A natural way to define the regret is to compare the rewards collected by
the learner with the rewards collected by the best context-dependent policy in
hindsight:
X X
Rn = E max (xti − Xt ) . (18.1)
i∈[k]
c∈C t∈[n]:ct =c
If the set of possible contexts is finite, then a simple approach is to use a separate
instance of Exp3 for each context. Let
X
Rnc = E max (xti − Xt )
i∈[k]
t∈[n]:ct =c
be the regret due to context c ∈ C. When using a separate instance of Exp3 for
each context, we can use the results of Chapter 11 to bound
v
u n
u X
Rnc ≤ 2tk I {ct = c} log(k) , (18.2)
t=1
where the sum inside the square root counts the number of times context c ∈ C is
observed. Because this is not known in advance, it is important to use an anytime
version of Exp3 for which the above regret bound holds without needing to tune
a learning rate that depends on the number of times the context is observed (see
Exercise 28.13). Substituting (18.2) into the regret leads to
v
u n
X Xu X
Rn = Rnc ≤ 2 tk log(k) I {ct = c} . (18.3)
c∈C c∈C t=1
when all contexts are observed equally often, in which case we have
p
Rn ≤ 2 nk|C| log(k) . (18.4)
Jensen’s inequality applied to Eq. (18.3) shows that this really is the worst case
(Exercise 18.1).
The regret in Eq. (18.4) is different than the regret studied in Chapter 11. If
we ignore the context and run the standard Exp3 algorithm, then we would
have
" n # n
X X p
E Xt ≥ max xti − 2 kn log(k) .
i∈[k]
t=1 t=1
When the context set C is large, using one bandit algorithm per context will
almost always be a poor choice because the additional precision is wasted unless
the amount of data is enormous. Fortunately, however, it is seldom the case that
the context set is both large and unstructured. To illustrate a common situation,
we return to the movie recommendation theme, where the actions are movies
and the context contains user information such as age, gender and recent movie
preferences. In this case, the context space is combinatorially large, but there
is a lot of structure inherited from the fact that the space of movies is highly
structured and users with similar demographics are more likely to have similar
preferences.
We start by rewriting Eq. (18.1) in an equivalent form. Let Φ be the set of all
functions from C → [k]. Then,
" n
#
X
Rn = E max (xtφ(ct ) − Xt ) . (18.5)
φ∈Φ
t=1
The discussion above suggests that a slightly smaller set Φ may lead to more
reward. In what follows, we describe some of the most common ideas of how to
do this.
18.2 Bandits with Expert Advice 226
Figure 18.2 Prediction with expert advice. The experts, upon seeing a foot give expert
advice on what socks should fit it best. If the owner of the foot is happy, the
recommendation system earns a cookie!
Partitions
Let P ⊂ 2C be a partition of C, which means that sets (or parts) in P are disjoint
and ∪P ∈P P = C. Then define Φ to be the set of functions from C to [k] that are
constant on each part in P. In this case, we can run a version of Exp3 for each
part, which means the regret depends on the number of parts |P| rather than on
the number of contexts.
Similarity Functions
Let s : C × C → [0, 1] be a function measuring the similarity between pairs of
contexts on the [0, 1]-scale. Then let Φ be the set of functions φ : C → [k] such
that the average dissimilarity
1 X
(1 − s(c, d))I {φ(c) 6= φ(d)}
|C|2
c,d∈C
is below a user-tuned threshold θ ∈ (0, 1). It is not clear anymore that we can
control the regret (18.5) using some simple meta-algorithm on Exp3, but keeping
the regret small is still a meaningful objective.
should think more generally about some subset Φ of functions without considering
the internal structure of Φ. In fact, once Φ has been chosen, the contexts play
very little role. All we need in each round is the output of each function.
This framework assumes the experts are oblivious in the sense that their
predictions do not depend on the actions of the learner.
18.3 Exp4 228
18.3 Exp4
The number 4 in Exp4 is not just an increased version number, but indicates
the four e’s in the long name of the algorithm, which is exponential weighting
for exploration and exploitation with experts. The idea of the algorithm is
very simple. Since exponential weighting worked so well in the standard bandit
problem, we aim to adopt it to the problem at hand. However, since the goal is to
compete with the best expert in hindsight, it is not the actions that we will score,
but the experts. Exp4 thus maintains a probability distribution Qt over experts
and uses this to come up with the next action in the obvious way, by first choosing
an expert Mt at random from Qt and then following the chosen expert’s advice
(t)
to choose At ∼ EMt . The reader is invited to check for themself that this is the
same as sampling At from Pt = Qt E (t) where Qt is treated as a row vector. Once
the action is chosen, one can use their favorite reward estimation procedure to
estimate the rewards for all the actions, which is then used to estimate how much
total reward the individual experts would have made so far. The reward estimates
are then used to update Qt using exponential weighting. The pseudocode of Exp4
is given in Algorithm 11.
1: Input: n, k, M , η, γ
2: Set Q1 = (1/M, . . . , 1/M ) ∈ [0, 1]1×M (a row vector)
3: for t = 1, . . . , n do
4: Receive advice E (t)
5: Choose the action At ∼ Pt , where Pt = Qt E (t)
6: Receive the reward Xt = xtAt
t =i}
7: Estimate the action rewards: X̂ti = 1 − I{A Pti +γ (1 − Xt )
8: Propagate the rewards to the experts: X̃t = E (t) X̂t
9: Update the distribution Qt using exponential weighting:
exp(η X̃ti )Qti
Qt+1,i = P for all i ∈ [M ]
j exp(η X̃tj )Qtj
The algorithm uses O(M ) memory and O(M + k) computation per round
(when sampling in two steps). Hence it is only practical when both M and k are
reasonably small.
We restrict our attention to the case when γ = 0, which is the original algorithm.
The version where γ > 0 is called Exp4-IX and its analysis is left for Exercise 18.3.
18.4 Regret Analysis 229
p
Theorem 18.1. Let γ = 0 and η = 2 log(M )/(nk), and denote by Rn the
expected regret of Exp4 defined in Algorithm 11 after n rounds. Then,
p
Rn ≤ 2nk log(M ) . (18.7)
After translating the notation, the proof of the following lemma can be extracted
from the analysis of Exp3 in the proof of Theorem 11.2 (Exercise 18.2).
Lemma 18.2. For any m∗ ∈ [M ], it holds that
log(M ) η X X
n
X n X
X M n M
X̃ − Qtm X̃tm ≤ + Qtm (1 − X̂tm )2 .
2 t=1 m=1
tm∗
t=1 t=1 m=1
η
Proof of Theorem 18.1 Let Ft = σ(E (1) , A1 , E (2) , A2 , . . . , At−1 , E (t) ) and
abbreviate Et [·] = E[ · | Ft ]. Let m∗ be the index of the best-performing expert in
hindsight:
n
X
(t)
m = argmaxm∈[M ]
∗
Em xt , (18.8)
t=1
which is not random by the assumption that the experts are oblivious. Applying
Lemma 18.2 shows that
log(M ) η X X
n
X n X
X M M M
X̃tm∗ − Qtm X̃tm ≤ + Qtm (1 − X̃tm )2 . (18.9)
t=1 t=1 m=1
η 2 t=1 m=1
log(M ) η X X
n M
Rn ≤ + E Qtm (1 − X̃tm )2 . (18.11)
η 2 t=1 m=1
Like in Chapter 11, it is more convenient to work with losses. Let Ŷti = 1 − X̂ti ,
yti = 1 − xti and Ỹtm = 1 − X̃tm . Note that Ỹt = E (t) Ŷt and recall the notation
Ati = I {At = i}, which means that Ŷti = APtitiyti and
!2 2
(t) (t)
Xk Emi yti Xk (t)
2
E ytA Emi
Et [Ỹtm ] = Et mAt = (18.12)
t
≤ .
PtAt i=1
Pti i=1
Pti
In Exercise 18.7, you will show that if all experts make identical recommendations,
then Et∗ = t and that no matter how the experts behave,
En∗ ≤ n min(k, M ) . (18.13)
In this sense En∗ /n can be viewed as the effective number of experts, which
depends on the degree of disagreement in the expert’s recommendations. By
modifying the algorithm to use a time varying learning rate, one can prove the
following theorem.
Theorem
p 18.3. Assume the same conditions as in Theorem 18.1, except let
ηt = log(M )/Et∗ . Then there exists a universal constant C > 0 such that
p
Rn ≤ C En∗ log(M ) . (18.14)
The proof of Theorem 18.3 is not hard and is left to Exercise 18.4. The bound
tells us that Exp4 with the suggested learning rate is able to adapt to degree of
disagreement between the experts, which seems like quite an encouraging result.
As a further benefit, the learning rate does not depend on the horizon so the
algorithm is anytime.
18.5 Notes
1 The most important concept in this chapter is that there are trade-offs when
choosing the competitor class. A large class leads to a more meaningful definition
18.5 Notes 231
of the regret, but also increases the regret. This is similar to what we have
observed in stochastic bandits. Tuning an algorithm for a restricted environment
class usually allows faster learning, but the resulting algorithms can fail when
interacting with an environment that does not belong to the restricted class.
2 The Exp4 algorithm serves as a tremendous building block for other bandit
problems by defining your own experts. An example is the application of Exp4
to non-stationary bandits that we explore in Chapter 31, which is one of the
rare cases where Exp4 can be computed efficiently with a combinatorially large
number of experts. When Exp4 does not have an efficient implementation, it
often provides a good starting place to derive regret bounds without worrying
about computation (for an example, see Exercise 18.5).
3 The bandits with expert advice framework is clearly more general than
contextual bandits. With the terminology of the bandits with expert advice
framework, the contextual bandit problem arises when the experts are given
by static C → [k] maps.
4 A significant challenge is that a naive implementation of Exp4 has running
time O(M + k) per round, which can be enormous if either M or k is large. In
general there is no solution to this problem, but in some cases the computation
can be reduced significantly. One situation where this is possible is when the
learner has access to an optimisation oracle that for any context/reward
sequence returns the expert that would collect the most reward in this sequence
(this is equivalent to solving the offline problem Eq. (18.8)). In Chapter 30
we show how to use an offline optimisation oracle to learn efficiently in
combinatorial bandit problems. The idea is to solve a randomly perturbed
optimisation problem (leading to the so-called follow-the-perturbed-leader class
of algorithms) and then show that the randomness in the outputs provides
sufficient exploration. However, as we shall see there, these algorithms will
have some extra information, which makes estimating the rewards possible.
5 In the stochastic contextual bandit problem, it is assumed that the
context/reward pairs form a sequence of independent and identically distributed
random variables. Let Φ be a set of functions from C to [k] and suppose the
learner has access to an optimisation oracle capable of finding
t
X
argmaxφ∈Φ xsφ(cs )
s=1
problem. The distribution is constrained so that the importance weights will not
be too large, while the regret estimates averaged over the chosen distribution
will stay small. To reduce the computation cost, this distribution is updated
periodically with the length of the interval between the updates exponentially
growing. The significance of this result is that it reduces contextual bandits
to (cost-sensitive) empirical risk minimisation (ERM), which means that any
advance in solving cost-sensitive ERM problems automatically translates to
bandits.
6 The development of efficient algorithms for ERM is a major topic in supervised
learning. Note that ERM can be NP-hard even in simple cases like linear
classification [Shalev-Shwartz and Ben-David, 2014, §8.7].
7 The bound on the regret stated in Theorem 18.3 is data dependent. Note
that in adversarial bandits the data and instance are the same thing, while
in stochastic bandits the instance determines the probability distributions
associated with each arm and the data corresponds to samples from those
distributions. In any case a data/instance-dependent bound should usually be
preferred if it is tight enough to imply the worst-case optimal bounds.
8 There are many points we have not developed in detail. One is high-probability
bounds, which we saw in Chapter 12 and can also be derived here. We also
have not mentioned lower bounds. The degree to which the bounds are tight
depends on whether or not there is additional structure in the experts. In later
chapters we will see examples when the results are essentially tight, but there
are also cases when they are not.
9 Theorem 18.3 is the first result where we used a time-varying learning rate.
As we shall see in later chapters, time-varying learning rates are a powerful
way to make online algorithms adapt to specific characteristics of the problem
instance.
For a good account on the history of contextual bandits, see the article by Tewari
and Murphy [2017]. The Exp4 algorithm was introduced by Auer et al. [2002b],
and Theorem 18.1 essentially matches theorem 7.1 of their paper (the constant
in Theorem 18.1 is slightly smaller). McMahan and Streeter [2009] noticed that
neither the number of experts nor the size of the action set are what really matters
for the regret, but rather the extent to which the experts tend to agree. McMahan
and Streeter [2009] also introduced the idea of finding the distribution to be played
to be maximally ‘similar’ to Pt (i) while ensuring sufficient exploration of each of
the experts. The idea of explicitly optimising a probability distribution with these
objectives in mind is at the heart of several subsequent works [e.g. Agarwal et al.,
2014]. While Theorem 18.3 is inspired by this work, the result appears to be new
and goes beyond the work of McMahan and Streeter [2009] because it shows
that all one needs is to adapt the learning rate based on the degree of agreement
18.7 Exercises 233
amongst the experts. Neu [2015a] proves high-probability bounds for Exp4-IX.
You can follow in his footsteps by solving Exercise 18.3. Another way to get
high-probability bounds is to generalise Exp3.P, which was done by Beygelzimer
et al. [2011]. As we mentioned in Note 5, there exist efficient algorithms for
stochastic contextual bandit problems when a suitable optimisation oracle is
available [Agarwal et al., 2014]. An earlier attempt to address the problem of
reducing contextual bandits to cost-sensitive ERM is by Dudı́k et al. [2011]. The
adversarial case of static experts is considered by Syrgkanis et al. [2016], who
√
prove suboptimal (worse than n) regret bounds under various conditions for
follow-the-perturbed-leader for the transductive setting when the contexts are
available at the start. The case when the contexts are independent and identically
distributed, but the reward is adversarial is studied by Lazaric and Munos [2009]
for the finite expert case, while Rakhlin and Sridharan [2016] considers the case
when an ERM oracle is available. The paper of Rakhlin and Sridharan [2016] also
considers the more realistic case when only an approximation oracle is available
for the ERM problem. What is notable about this work is that they demonstrate
regret bounds with a moderate blow-up, but without changing the definition
of the regret. Kakade et al. [2008] consider contextual bandit problems with
adversarial context-loss sequences, where all but one action suffers a loss of one in
every round. This can also be seen as an instance of multi-class classification
with bandit feedback where labels to be predicted are identified with actions
and the only feedback received is whether the label predicted was correct, with
the goal of making as few mistakes as possible. Since minimising the regret is in
general hard in this non-convex setting, just like most of the machine learning
literature on classification, Kakade et al. [2008] provide results in the form of
mistake bounds for linear classifiers where the baseline is not the number of
mistakes of the best linear classifier, but is a convex upper bound on it. The
recent book by Shalev-Shwartz and Ben-David [2014] lists some hardness results
for ERM. For a more comprehensive treatment of computation in learning theory,
the reader can consult the book by Kearns and Vazirani [1994].
18.7 Exercises
18.3 In this exercise you will prove an analogue of Theorem 12.1 for Exp4-IX.
In the contextual setting, the random regret is
n
X
(t)
R̂n = max Em xt − Xt .
m∈[M ]
t=1
Hint The key idea is to modify the analysis of Exp3 to handle decreasing
learning rates. Of course you can do this directly yourself, or you can peek ahead
to Chapter 28, and specifically Exercises 28.12 and 28.13.
18.5 Let x1 , . . . , xn be a sequence of reward vectors chosen in advance by
an adversary with xt ∈ [0, 1]k . Furthermore, let o1 , . . . , on be a sequence of
observations, also chosen in advance by an adversary with ot ∈ [O] for some
fixed O ∈ N+ . Then let H be the set of functions φ : [O]m → [k] where m ∈ N+ .
In each round the learner observes ot and should choose an action At based on
o1 , A1 , X1 , . . . , ot−1 , At−1 , Xt−1 , ot , and the regret is
n
X
Rn = max xtφ(ot ,ot−1 ,...,ot−m ) − xtAt ,
φ∈H
t=1
where ot = 1 for t ≤ 0. This means the learner is competing with the best
predictor in hindsight that uses only the last m observations. Prove there exists
an algorithm such that
p
E[Rn ] ≤ 2knOm log(k) .
ξ on C and the rewards are (Xt )nt=1 , where the conditional law of Xt given Ct
and At is PCt At . The mean
R reward when choosing action i ∈ [k] having observed
context c ∈ C is µ(c, i) = x dPci (x). Let Φ be a subset of functions from C to
[k]. The regret is
" n #
X
Rn = n sup µ(φ) − E Xt ,
φ∈Φ t=1
R
where µ(φ) = µ(c, φ(c))dξ(c). Consider a variation of explore-then-commit,
which explores uniformly at random for the first m rounds. Then define
m
k X
µ̂(φ) = I {At = φ(Ct )} Xt .
m t=1
where X̂ti = kI {At = φ(Ct )} Xt . When no maximiser exists you may assume
that µ̂(φ̂∗ ) ≥ supφ∈Φ µ̂(φ) − ε for any ε > 0 of your choice. Show that when Φ
is finite, then for appropriately tuned m the expected regret of this algorithms
satisfies
Rn = O n2/3 (k log(|Φ|))1/3 .
18.9 Consider a stochastic contextual bandit problem with the same set-up as
the previous exercise and k = 2 arms. As before, let Φ be a set of functions from
C to [k]. Design a policy such that
" n # r
X n
Rn = n max µ(φ) − E Xt ≤ C ndk log ,
φ∈Φ
t=1
d
Xt = r(Ct , At ) + ηt ,
where r : C × [k] → R is called the reward function and ηt is the noise, which
we will assume is conditionally 1-subgaussian. Precisely, let
The noise could have been chosen to be σ-subgaussian for any known σ 2 , but
like in earlier chapters, we save ourselves some ink by fixing its value to σ 2 = 1.
Remember from Chapter 5 that subgaussian random variables have zero mean,
so the assumption also implies that E [ηt | Ft ] = 0 and E [Xt | Ft ] = r(Ct , At ).
19.1 Stochastic Contextual Bandits 238
If r was given, then the action in round t with the largest expected return
is A∗t ∈ argmaxa∈[k] r(Ct , a). Notice that this action is now a random variable
because it depends on the context Ct . The loss due to the lack of knowledge of r
makes the learner incur the (expected) regret
" n n
#
X X
Rn = E max r(Ct , a) − Xt .
a∈[k]
t=1 t=1
Like in the adversarial setting, there is one big caveat in this definition of the
regret. Since we did not make any restrictions on how the contexts are chosen, it
could be that choosing a low-rewarding action in the first round might change
the contexts observed in subsequent rounds. Then the learner could potentially
achieve an even higher cumulative reward by choosing a ‘suboptimal’ arm initially.
As a consequence, this definition of the regret is most meaningful when the actions
of the learner do not (greatly) affect subsequent contexts.
One way to eventually learn an optimal policy is to estimate r(c, a) for each
(c, a) ∈ C × [k] pair. As in the adversarial setting, this is ineffective when the
number of context-action pairs is large. In particular, the worst-case regret over
all possible√contextual problems with M contexts and mean reward in [0, 1] is
at least Ω( nM k). While this may not look bad, M is often astronomical (for
example, 2100 ). The argument that gives rise to the mentioned lower bound
relies on designing a problem where knowledge of r(c, ·) for context c provides
no useful information about r(c0 , ·) for some different context c0 . Fortunately,
in most interesting applications, the set of contexts is highly structured, which
is often captured by the fact that r(·, ·) changes ‘smoothly’ as a function of its
arguments.
A simple, yet interesting assumption to capture further information about the
dependence of rewards on context is to assume that the learner has access to a
map ψ : C × [k] → Rd , and for an unknown parameter vector θ∗ ∈ Rd , it holds
that
book. If the visitors and books are assigned to finitely many categories, indicator
variables of all possible combinations of these categories could be used to create
the feature map. Of course, many other possibilities exist. For example, you can
train a neural network (deep or not) on historical data to predict the revenue
and use the nonlinear map that we obtained by removing the last layer of the
neural network. The subspace Ψ spanned by the feature vectors {ψ(c, a)}c,a in
Rd is called the feature space.
If k · k is a norm on Rd then, an assumption on kθ∗ k implies smoothness of r.
In particular, from Hölder’s inequality,
|r(c, a) − r(c0 , a0 )| ≤ kθ∗ kkψ(c, a) − ψ(c0 , a0 )k∗ ,
where k · k∗ denotes the dual of k · k. Restrictions on kθ∗ k have a similar effect
to assuming that the dimensionality d is finite. In fact, one may push this to
the extreme and allow d to be infinite, an approach that can buy tremendous
flexibility and makes the linearity assumption less limiting.
Stochastic linear bandits arise from realising that under Eq. (19.1), all that
matters is the feature vector that results from choosing a given action and not
the ‘identity’ of the action itself. This justifies studying the following simplified
model: in round t, the learner is given the decision set At ⊂ Rd , from which it
chooses an action At ∈ At and receives reward
Xt = hθ∗ , At i + ηt ,
where ηt is 1-subgaussian given A1 , A1 , X1 , . . . , At−1 , At−1 , Xt−1 , At and At . The
random (pseudo-)regret and regret are defined by
n
X
R̂n = max hθ∗ , a − At i ,
a∈At
t=1
" n #
h i X n
X
Rn = E R̂n = E max hθ∗ , ai − Xt ,
a∈At
t=1 t=1
problems in directed graphs and choosing spanning trees) can be written as linear
optimisation problems over some combinatorial set A obtained from considering
incidence vectors often associated with some graph. Some of these topics will be
covered later in Chapter 30.
19.2 Stochastic Linear Bandits 240
be an upper bound on the mean pay-off hθ∗ , ai of a. The UCB algorithm that
uses the confidence set Ct at time t then selects
At first sight it is not at all obvious what Ct should look like. After all, it is a
subset of Rd , not just an interval like the confidence intervals about the empirical
estimate of the mean reward for a single action that we saw in the previous
chapters. While we specify the analytic form of a possible construction for Ct here,
there are some details in choosing some of the parameters in this construction. As
they are both delicate and important, we dedicate the next chapter to discussing
them.
Following the idea for UCB, we need an analogue for the empirical estimate
of the unknown quantity, which in this case is θ∗ . There are several principles
one might use for deriving such an estimate. For now we use the regularised
19.2 Stochastic Linear Bandits 241
t
!
X
2
θ̂t = argminθ∈Rd (Xs − hθ, As i) + λkθk22 , (19.4)
s=1
t
X
θ̂t = Vt−1 As Xs , (19.5)
s=1
t
X
V0 = λI and Vt = V0 + As A>
s . (19.6)
s=1
The impatient reader who is puzzled of the form Et may briefly think of
the case when ηs ∼ N (0, σ 2 ), A1 , . . . , At−1 are deterministic and span Rd so
that we can take λ = 0. In this case, one easily computes that with V = Vt−1 ,
Z = V 1/2 (θ̂t−1 − θ∗ ) ∼ N (0, I), or that kZk2 is the sum of d, independent
standard normal random variables, and thus it follows the χ2 -distribution
(with d degrees of freedom), from which one can find the appropriate value
of βt−1 . As we shall see, the expression one can get from this calculation,
will, more or less, be still correct in the general case.
19.3 Regret Analysis 242
We prove a regret bound for LinUCB under the assumption that the confidence
intervals indeed contain the true parameter with high probability and boundedness
conditions on the action set and rewards.
(a) 1 ≤ β1 ≤ β2 ≤ · · · ≤ βn .
(b) maxt∈[n] supa,b∈At hθ∗ , a − bi ≤ 1.
Sn
(c) kak2 ≤ L for all a ∈ t=1 At .
(d) There exists a δ ∈ (0, 1) such that with probability 1 − δ, for all t ∈ [n],
θ∗ ∈ Ct where Ct satisfies Eq. (19.7).
Corollary 19.3. Under the conditions of Assumption 19.1, the expected regret
of LinUCB with δ = 1/n is bounded by
√
Rn ≤ Cd n log(nL) ,
The proof of Theorem 19.2 depends on the following lemma, often called the
elliptical potential lemma.
n
X det Vn trace V0 + nL2
1∧ kat k2V −1 ≤ 2 log ≤ 2d log .
t=1
t−1 det V0 d det(V0 )1/d
det Vn
We now argue that this last expression is log det V0 . For t ≥ 1, we have
1/2 1/2
Vt = Vt−1 + at a>
t = Vt−1 (I + Vt−1 at at Vt−1 )Vt−1 .
> −1/2 −1/2
where the second equality follows because the matrix I + yy has eigenvalues >
1 + kyk22 and 1 as well as the fact that the determinant of a matrix is the product
of its eigenvalues. Putting things together, we see that
n
Y
det(Vn ) = det(V0 ) 1 + kat k2V −1 , (19.9)
t−1
t=1
which is equivalent to the first inequality that we wanted to prove. To get the
second inequality, note that by the inequality of arithmetic and geometric means,
d d
1 trace V0 + nL2
Y d
det(Vn ) = λi ≤ trace Vn ≤ ,
i=1
d d
Proof of Theorem 19.2 By part (d) of Assumption 19.1, it suffices to prove the
bound on the event that θ∗ ∈ Ct for all rounds t ∈ [n]. Let A∗t = argmaxa∈At hθ∗ , ai
be an optimal action for round t and rt be the instantaneous regret in round t
defined by
rt = hθ∗ , A∗t − At i .
Let θ̃t ∈ Ct be the parameter in the confidence set for which hθ̃t , At i = UCBt (At ).
Then, using the fact that θ∗ ∈ Ct and the definition of the algorithm leads to
Using Cauchy–Schwarz inequality and the assumption that θ∗ ∈ Ct and facts that
θ̃t ∈ Ct and Ct ⊆ Et leads to
The result is completed using Lemma 19.4, which depends on part (c) of
Assumption 19.1.
19.3.1 Computation
An obvious question is whether or not the optimisation problem in Eq. (19.3) can
be solved efficiently. First note that the computation of At can also be written as
This is a bilinear optimisation problem over the set At × Ct . In general, not much
can be said about the computational efficiency of solving this problem. There are
two notable special cases, however.
(a) Suppose that a(θ) = argmaxa∈At hθ, ai can be computed efficiently for any θ
and that Ct = co(φ1 , . . . , φm ) is the convex hull of a finite set. Then At can
be computed by finding a(φ1 ), . . . , a(φm ) and choosing At = a(φi ), where i
maximises hφi , a(φi )i.
(b) Assume that Ct = Et is the ellipsoid given in Eq. (19.7) and At is a small
finite set. Then the action At from Eq. (19.12) can be found using
p
At = argmaxa∈At hθ̂t−1 , ai + βt kakV −1 , (19.13)
t−1
which may be solved by simply iterating over the arms and calculating
the term inside the argmax. Further implementation issues are explored in
Exercise 19.8.
19.4 Notes
The trick is to rewrite all computations in terms of the kernel function so that
ψ(c, a) is neither computed, nor stored. The second issue is that the claim
made in Theorem 19.2 depends on the dimension d and becomes vacuous when
d is large or infinite. This dependence arises from Lemma 19.4. It is possible to
modify this result by replacing d with a data-dependent quantity that measures
the ‘effective dimension’ of the image of the data under φ. The final challenge
is to define an appropriate confidence set. See the bibliographic remarks for
further details and references.
19.4 Notes 245
Empirically this choice is never worse than the value suggested in Eq. (19.8)
and sometimes better, typically by a modest amount.
4 The application of Cauchy–Schwarzpin Eq. (19.11) often loses a logarithm,
as it does, for example, when rt = 1/t. Recently, however, a lower bound
for contextual linear bandits has been derived by constructing a sequence for
which this Cauchy–Schwarz is tight, as well as Lemma 19.4 [Li et al., 2019b].
5 In the worst case, the bound in Theorem 19.2 is tight up to logarithmic factors.
More details are in Chapter 24, which is devoted to lower bounds for stochastic
linear bandits. The environments for which the lower bound nearly matches the
upper bound have action sets that are either infinite or exponentially large in
the dimension. When |At | ≤ k for all rounds t, there are algorithms for which
the regret is
q
3
Rn = O dn log (nk) .
The special case where the action set does not change with time is treated in
Chapter 22, where references to the literature are also provided.
6 The calculation in Eq. (19.13) shows that LinUCB has more than just a passing
resemblance to the UCB algorithm introduced in Chapter 7. The term hθ̂t−1 , ai
may be interpreted as an empirical estimate of the reward from choosing action
√
a, and βt kakV −1 is a bonus term that ensures sufficient exploration. If the
t−1
penalty term vanishes (λ = 0) and At = {e1 , . . . , ed } for all t ∈ [n], then θ̂i
becomes the empirical mean of action ei , and the matrix Vt is diagonal, with
its ith diagonal entry being the number of times action ei is used up to and
including round t. Then the bonus term has order
s
p βt
βt kei kV −1 = ,
t−1 Ti (t − 1)
where Ti (t − 1) is the number of times action ei has been chosen before the tth
round. So UCB for finite-armed bandits is recovered by choosing βt = 2 log(·),
where the term inside the logarithm can be chosen in a variety of ways as
discussed in earlier chapters. Notice now
p that the simple analysis given in this
chapter leads to a regret bound of O( dn log(·)), which is quite close to the
19.5 Bibliographic Remarks 246
Stochastic linear bandits were introduced by Abe and Long [1999]. The first paper
to consider algorithms based on the optimism principle for linear bandits is by
Auer [2002], who considered the case when the number of actions is finite. The
core ideas of the analysis of optimistic algorithms (and more) is already present
in this paper. An algorithm based on confidence ellipsoids is described in the
papers by Dani et al. [2008], Rusmevichientong and Tsitsiklis [2010] and Abbasi-
Yadkori et al. [2011]. The regret analysis presented here, and the discussion of the
19.6 Exercises 247
computational questions, is largely based on the former of these works, which also
√
stresses that an expected regret of Õ(d n) can be achieved regardless of the shape
of the decision sets At as long as the means are guaranteed to lie in a bounded
interval. Rusmevichientong and Tsitsiklis [2010] consider both optimistic and
explore-then-commit strategies, which they call ‘phased exploration and greedy
exploitation’ (PEGE). They focus on the case where At is the unit ball or some
other compact set with a smooth boundary and show that PEGE is optimal up to
logarithmic factors. The observation that explore-then-commit works for the unit
ball (and other action sets with a smooth boundary) was independently made
by Abbasi-Yadkori et al. [2009], further expanded in [Abbasi-Yadkori, 2009a].
Generalised linear models are credited to Nelder and Wedderburn [1972]. We
mentioned already that LinUCB was generalised to this model by Filippi et al.
[2010]. A more computationally efficient algorithm has recently been proposed by
Jun et al. [2017]. Nonlinear structured bandits where the pay-off function belongs
to a known set have also been studied [Anantharam et al., 1987, Russo and Van
Roy, 2013, Lattimore and Munos, 2014]. Kernelised versions of UCB have been
given by Srinivas et al. [2010], Abbasi-Yadkori [2012] and Valko et al. [2013b].
We mentioned early in the chapter that making assumptions on the norm θ∗ is
related to smoothness of the reward function with smoother functions leading
to stronger guarantees. For an example of where this is done, see the paper on
‘spectral bandits’ by Valko et al. [2014] and Exercise 19.7.
19.6 Exercises
19.1 (Least-squares solution) Prove that the solution given in Eq. (19.5) is
indeed the minimiser of Eq. (19.4).
19.2 (Action selection with ellipsoidal confidence sets) Show that the
action selection in LinUCB can indeed be done as shown in Eq. (19.13) when
Ct = Et is an ellipsoid given in Eq. (19.7).
19.3 (Elliptical potentials: You cannot have more than O(d) big
intervals) Let V0 = λI and a1 , . . . , an ∈ Rd be a sequence of vectors with
Pt
kat k2 ≤ L for all t ∈ [n]. Then let Vt = V0 + s=1 as a>s and show that the
number of times kat kV −1 ≥ 1 is at most
t−1
3d L2
log 1 + .
log(2) λ log(2)
19.6 Exercises 248
The proof of Theorem 19.2 depended on part (b) of Assumption 19.1, which
asserts that the mean rewards are bounded by one. Suppose we replace this
assumption with the relaxation that there exists a B > 0 such that
max sup hθ∗ , a − bi ≤ B .
t∈[n] a,b∈At
Then, Exercise 19.3 allows you to bound the number of rounds when
kxt kV −1 ≥ 1, and in these rounds the naive bound of rt ≤ B is used. For
t−1
the remaining rounds, the analysis of Theorem 19.2 goes through unaltered.
As a consequence we see that the dependence on B is an additive constant
term that does not grow with the horizon.
19.4 (Computation cost savings with fixed large action set) When
the action set At = A is fixed and |A| = k, the total computation cost of
LinUCB after n rounds is O(kd2 n) in n rounds if the advice on implementation
of Exercise 19.8 is used. This can be reduced to O(k log(n) + d2 n) with almost
no increase of the regret, a significant reduction when k d2 . For this, LinUCB
should be modified to work in phases, where in a given fixed it uses the same
action computed in the usual way at the beginning of the phase. A phase ends
when log det Vt (λ) increases by log(1 + ε).
kxk2A det A
(a) Prove that if 0 ≺ B A then supx6=0 kxk2B
≤ det B .
(b) Let Assumption 19.1 hold. Let R̂n (βn ) be the regret bound of LinUCB
stated in Theorem 19.2. Show that with probability 1 − δ the random pseudo-
regret R̂n of the phased version of LinUCB, as described above, satisfies
R̂n ≤ R̂n ((1 + ε)βn ).
(a) Construct an algorithm whose regret Rn after n rounds is O((Lk log k)1/3 n2/3 ).
(b) Show that the minimax optimal regret is of the order Ω((Lk)1/3 n2/3 ).
(c) Generalise the result to the case when C = [0, 1]d and in the definition
of Lipschitzness we use the Euclidean norm. Show the dependence on the
dimension in the lower and upper bounds. Discuss the influence of the choice
of the norm.
This exercise is inspired by the work of Perchet and Rigollet [2013], who focus
on improving the regret bound by adaptive discretisation when a certain
margin condition holds. There are many variations of the problem of the
previous exercise. For starters, the domain of contexts could be more general:
one may consider higher-order smoothness, continuous action and context
spaces. What is the role of the context distribution? In some applications,
the context distribution can be estimated for free, in which case you might
assume the context distribution is known. How to take a known context
distribution into account? To whet your appetite, if the context distribution
is concentrated on a handful of contexts, the discretisation should respect
which contexts the distribution is concentrated on. Instead of discretisation,
one may also consider function approximation. An interesting approach that
goes beyond discretisation is by Combes et al. [2017] (see also Magureanu
et al. 2014). The approach in these papers is to derive an asymptotic,
instance-dependent lower bound, which is then used to guide the algorithm
(much like in the track-and-stop algorithm in Section 33.2). An open problem
is to design algorithms that are simultaneously near minimax optimal and
asymptotically optimal. As described in Part II, this problem is now settled
for finite-armed stochastic bandits, the only case where we can say this in
the whole literature of bandits.
19.6 (Generalised linear bandits) In this exercise you will design and
analyse an algorithm for the generalised linear bandit problem mentioned in
Note 7. Let Θ be a convex compact subset of Rd and assume that θ∗ ∈ Θ. The
only difference relative to the standard model is that the reward is
Xt = µ(hθ∗ , At i) + ηt ,
That c1 > 0 is assumed implies that µ is increasing on the relevant area of its
domain. Like in the standard model, for each t ≥ 1, ηt is 1-subgaussian given
A1 , X1 , . . . , At−1 , Xt−1 , At , and you may as well assume that rewards and feature
vectors are bounded:
max µ(hθ∗ , ai) − µ(hθ∗ , bi) ≤ 1 and max kak2 ≤ L and kθ∗ k2 ≤ m2 .
a,b∈∪n
t=1 At a∈∪n
t=1 At
Recall that λ is the regularisation parameter in the definition of Vt (see Eq. (19.6))
19.6 Exercises 250
and let
t
X
t
X
gt (θ) = λθ + µ(hθ, As i)As , Lt (θ) =
gt (θ) − Xs As
.
s=1 s=1 Vt−1
Prove that on the event that θ∗ ∈ Ct , for A∗t = argmaxa∈At µ(hθ∗ , ai),
1/2
2c2 βt−1
rt = µ(hθ∗ , A∗t i) − µ(hθ∗ , At i) ≤ kAt kV −1 .
c1 t−1
Pn
(d) Prove that with probability at least 1 − δ, the random regret R̂n = t=1 rt
is bounded by
s
c2 nL2
R̂n ≤ 8ndβn log 1 + .
c1 d
Hint For (a), you should peek into the future and use Theorem 20.4. The
mean value theorem will help with Part (b).
19.7 (Spectral bandits) The regret of LinUCB can be improved considerably
if an appropriate norm of θ∗ is known to be small. In this exercise you will
investigate this phenomenon. Suppose that V0 is positive definite with eigenvalues
Pt
λ1 , . . . , λd , respective eigenvectors v1 , . . . , vd , and Vt = V0 + s=1 As A> s . All
other quantities are left unchanged, but the alternative value of V0 means that θ̂t
is heavily regularised in the direction of each vi for which λi is large. Without
loss of generality, assume that (λi )di=1 is increasing and let λ = λ1 be the smallest
eigenvalue. Define the ‘effective dimension’ by
n
deff = max i ∈ [d] : (i − 1)λi ≤ ∈ [d] .
log(1 + nL2 /λ)
det(Vt ) nL2
(a) Prove that log det(V 0)
≤ 2deff log 1 + λ .
(b) Let m > 0 be a user-defined constant and
s
1/2 1 det(Vt )
βt = m + 2 log + log
δ det(V0 )
and let Ct = {θ : kθ − θ̂t−1 k2Vt−1 ≤ βt−1 }. Assume that kθ∗ kV0 ≤ m and
prove that θ∗ ∈ Ct for all t with probability at least 1 − δ.
19.6 Exercises 251
(c) Prove that if kθ∗ kV0 ≤ m, then with probability at least 1 − δ, the random
regret of LinUCB in this setting is bounded by
s
nL2
R̂n ≤ 8βn deff log 1 + .
λ
(d) Show that with an appropriate choice of δ,
√
E[R̂n ] = O deff n log(nL2 ) , (19.16)
where the last equality suppresses dependence on m and λ = λ1 .
(e) The result of the previous display explains the definition of the ‘effective
dimension’ deff . When V0 = I, which gives rise to uniform regularisation,
√
Corollary 19.3 states that for kθ∗ k ≤ m, E[R̂n ] = O(d n log(nL2 )). Given
Eq. (19.16), for a fixed n, deff can be thought of as replacing the dimension
d when V0 is chosen as any positive definite matrix with V0 λI. It follows
then that when deff ≤ d, the upper bound for non-uniform regularisation will
be smaller. Given this, explain the potential pros and cons of non-uniform
regularisation. What happens when kθ∗ kV0 ≤ m fails to hold? Show a bound
on the degradation of the expected regret as a function of max(0, kθ∗ kV0 −m).
Hint For Part (b), you should peek into the next chapter and modify
Theorem 20.5.
19.8 (Implementation) If the action set is the same in every round, then the
assumptions are satisfied for the various versions of UCB discussed in Chapters 7
19.6 Exercises 252
30 2,000
20
0
0 0.2 0.4 0.6 0.8 1 0 200 400 600 800 1,000
∆ k
Figure 19.1 The plot on the left compares the regret of UCB (Algorithm 6) and LinUCB
on a Gaussian bandit with k = 2, n = 1000 and varying suboptimality gaps ∆. The plot
on the right compares the same algorithms on a linear bandit with actions uniformly
distributed on the sphere and with d = 5 and n = 5000. The parameter θ is also
uniformly generated on the sphere.
In the last chapter, we derived a regret bound for a version of the upper confidence
bound algorithm that depended on a particular kind of confidence set. The purpose
of this chapter is to justify these choices.
Suppose a bandit algorithm has chosen actions A1 , . . . , At ∈ Rd and received
the rewards X1 , . . . , Xt with Xs = hθ∗ , As i + ηs where ηs is zero-mean noise.
Recall from the previous chapter that the penalised least-squares estimate of θ∗
is the minimiser of
t
X
Lt (θ) = (Xs − hθ, As i)2 + λkθk22 ,
s=1
None of these assumptions is plausible in the bandit setting, but the simplification
eases the analysis and provides insight.
The assumption that λ = 0 means that in this section, θ̂t is just the ordinary
least squares estimator of θ. The requirement that Vt be non-singular means
that (As )ts=1 must span Rd , and so t must be at least d.
Confidence Bounds for Least Squares Estimators 254
s=1 s=1
* t
+ t
X X
= x, Vt−1 As η s = x, Vt−1 As ηs .
s=1 s=1
Since (ηs )s are independent and 1-subgaussian, by Lemma 5.4 and Theorem 5.3,
v
D E uu X t
2
1
P x, θ̂t − θ∗ ≥ t2 x, Vt−1 As log ≤ δ.
s=1
δ
Pt
2
A little linear algebra shows that s=1 x, Vt−1 As = kxk2V −1 and so
t
s !
2 1
P hθ̂t − θ∗ , xi ≥ 2kxkV −1 log ≤ δ. (20.2)
t δ
If we only care about confidence bounds for one or a few vectors x, we could stop
here. For large action sets (with more than Ω(2d ) actions), one approach is to
convert this bound to a bound on kθ̂t − θ∗ kVt . To begin this process, notice that
1/2
1/2 Vt (θ̂t − θ∗ )
kθ̂t − θ∗ kVt = hθ̂t − θ∗ , Vt Xi , where X = .
kθ̂t − θ∗ kVt
The problem is that X is random, while we have only proven (20.2) for
deterministic x. The standard way of addressing problems like this is to use
a covering argument. First we identify a finite set Cε ⊂ Rd such that whatever
value X takes, there exists some x ∈ Cε that is ε-close to X. Then a union
bound and a triangle inequality allows one to finish. By its definition, we have
kXk22 = X > X = 1, which means that X ∈ S d−1 = {x ∈ Rd : kxk2 = 1}. Using
that X ∈ S d−1 , we see it suffices to cover S d−1 . The following lemma provides
the necessary guarantees on the size of the covering set.
Lemma 20.1. There exists a set Cε ⊂ Rd with |Cε | ≤ (3/ε)d such that for all
x ∈ S d−1 there exists a y ∈ Cε with kx − yk2 ≤ ε.
The proof of this lemma requires a bit work, but nothing really deep is needed.
This work is deferred to Exercises 20.3 and 20.4. Let Cε be the covering set given
by the lemma, and define event
( s )
D E |Cε |
1/2
E = exists x ∈ Cε : Vt x, θ̂t − θ∗ ≥ 2 log .
δ
1/2
Using the fact that kVt xkV −1 = kxk2 = 1, and a union bound combined with
t
Eq. (20.2) shows that P (E) ≤ δ. When E does not occur, Cauchy–Schwarz shows
20.1 Martingales and the Method of Mixtures 255
that
D E
1/2
kθ̂t − θ∗ kVt = max Vt x, θ̂t − θ∗
x∈S d−1
hD E D Ei
1/2 1/2
= max min Vt (x − y), θ̂t − θ∗ + Vt y, θ̂t − θ∗
x∈S d−1 y∈Cε
" s #
|Cε |
< max min kθ̂t − θ∗ kVt kx − yk2 + 2 log
x∈S d−1 y∈Cε δ
s
|Cε |
≤ εkθ̂t − θ∗ kVt + 2 log .
δ
Rearranging yields
s
1 |Cε |
kθ̂t − θ∗ kVt < 2 log .
1−ε δ
Now there is a tension in the choice of ε > 0. The term in the denominator
suggests that ε should be small, but by Lemma 20.1 the cardinality of Cε grows
rapidly as ε tends to zero. By lazily choosing ε = 1/2,
s !
1
P kθ̂t − θ∗ kVt ≥ 2 2 d log(6) + log ≤ δ. (20.3)
δ
Except for constants and other minor differences, this turns out to be about as
good as you can get. Unfortunately, however, this analysis only works because
Vt was assumed to be deterministic. When the actions are chosen by a bandit
algorithm, this assumption does not hold, and the ideas need to be modified.
We now remove the limiting assumptions in the previous section. Of course some
conditions are still required. For the remainder of this section the following is
assumed:
1 There exists a θ∗ ∈ Rd such that Xt = hθ∗ , At i + ηt for all t ≥ 1.
2 The noise is conditionally 1-subgaussian:
2
α
for all α ∈ R and t ≥ 1, E [exp(αηt ) | Ft−1 ] ≤ exp a.s. , (20.4)
2
where Ft−1 is such that A1 , X1 , . . . , At−1 , Xt−1 , At are Ft−1 -measurable.
3 In addition, we assume that λ > 0.
The inclusion of At in the definition of Ft−1 allows the noise to depend on past
choices, including the most recent action. This is often essential, as the case of
Bernoulli rewards shows. We have now dropped the assumption that (At )∞ t=1 are
fixed in advance.
20.1 Martingales and the Method of Mixtures 256
The assumption that λ > 0 ensures that Vt (λ) is invertible and allows us
to relax the requirement that the actions span Rd . Notice also that in this
section, we allow the interaction sequence to be infinitely long.
Sadly, we do not know how to bound this expectation. Can we still somehow use
the Cramér–Chernoff method? We take inspiration from looking at the special
Pt
case of λ = 0 one last time, assuming that Vt = s=1 As A>
s is invertible. Let
t
X
St = η s As .
s=1
Pt
Recall that θ̂t = Vt−1 s=1 Xs As = θ∗ + Vt−1 St . Hence,
1 2 1 2 1 2
kθ̂t − θ∗ kVt = kSt kV −1 = max hx, St i − kxkVt .
2 2 t x∈Rd 2
The point of the second equality is to separate the martingale (St )t from Vt at
the price of the introduction of a maximum. This second equality is a special
case of (Fenchel) duality. As we shall see later in Chapter 26, for sufficiently
nice convex functions f one can show that with an appropriate function f ∗ ,
for any x ∈ Rd from the domain of f , f (x) = supu∈Rd hu, xi − f ∗ (u). The
advantage of this is that for any fixed u, x appears in a linear fashion.
The next lemma shows that the exponential of the term inside the maximum
is a supermartingale even when λ > 0.
Lemma 20.2. For all x ∈ Rd the process Mt (x) = exp(hx, St i − 12 kxk2Vt (λ) ) is an
F-adapted non-negative supermartingale with M0 (x) ≤ 1.
Proof of Lemma 20.2 That Mt (x) is Ft -measurable for all t and that it
is nonnegative are immediate from the definition. We need to show that
E[Mt (x) | Ft−1 ] ≤ Mt−1 (x) almost surely. The fact that (ηt ) is conditionally
1-subgaussian means that
! !
hx, At i
2 kxk2At A>
E [exp (ηt hx, At i) | Ft−1 ] ≤ exp = exp t
a.s.
2 2
20.1 Martingales and the Method of Mixtures 257
Hence
1 2
E[Mt (x) | Ft−1 ] = E exp hx, St i − kxkVt Ft−1
2
1 2
= Mt−1 (x)E exp ηt hx, At i − kxkAt A> Ft−1
2 t
For simplicity, consider now again the case when λ = 0. Combining the lemma
and the linearisation idea almost works. The Cramér–Chernoff method leads to
1 1
P kθ̂t − θ∗ k2Vt ≥ log(1/δ) = P exp max hx, St i − kxk2Vt ≥ 1/δ
2 x∈Rd 2
1
≤ δE exp max hx, St i − kxk2Vt
x∈Rd 2
= δE max Mt (x) . (20.5)
x∈Rd
Lemma 20.2 shows that E[Mt (x)] ≤ 1. This seems quite promising, but
the presence of the maximum is a setback because E[maxx∈Rd Mt (x)] ≥
maxx∈Rd E[Mt (x)], which is the wrong direction to be used above. This means
we cannot directly use the lemma to bound Eq. (20.5). There are two ways to
proceed. The first is to use a covering argument over possible near-maximisers
of x, which eventually works. A more elegant way is to take inspiration from
Eq. (20.5) and use Laplace’s method for approximating integrals of well-behaved
exponentials, as we now explain.
for some large value of s > 0. From a Taylor expansion, we may write
q
f (x) = f (x0 ) − (x − x0 )2 + R(x) ,
2
where R(x) = o((x − x0 )2 ). Under appropriate technical assumptions,
Z b
sq(x − x0 )2
Is ∼ exp(sf (x0 )) exp − dx as s → ∞ .
a 2
20.1 Martingales and the Method of Mixtures 258
1 1
s=1 s=3
0 0
−5 0 5 −5 0 5
Figure 20.1 The plots depict Laplace’s approximation with f (x) = cos(x) exp(−x2 /20),
which is maximised at x0 = 0 and has q = −f 00 (x0 ) = 11/10. The solid line is a plot of
exp(sf (x))/ exp(sf (x0 )), and the dotted line is exp(−sq(x − x0 )2 ).
The following theorem is the key result from which the confidence set will be
derived.
Theorem 20.5. Let δ ∈ (0, 1). Then, with probability at least 1 − δ, it holds that
for all t ∈ N,
s
√ 1 det Vt (λ)
kθ̂t − θ∗ kVt (λ) < λkθ∗ k2 + 2 log + log .
δ λd
Proof We only have to compare kSt kVt (λ)−1 and kθ̂t − θ∗ kVt (λ) :
kθ̂t − θ∗ kVt (λ) = kVt (λ)−1 St + (Vt (λ)−1 Vt − I)θ∗ kVt (λ)
≤ kSt kVt (λ)−1 + (θ∗> (Vt (λ)−1 Vt − I)Vt (λ)(Vt (λ)−1 Vt − I)θ∗ )1/2
= kSt kVt (λ)−1 + λ1/2 (θ∗> (I − Vt (λ)−1 Vt )θ∗ )1/2
≤ kSt kVt (λ)−1 + λ1/2 kθ∗ k ,
Now we turn to studying M̄t . Completing the square in the definition of M̄t
we get
1 1 1 1
hx, St i − kxk2Vt − kxk2H = kSt k2(H+Vt )−1 − kx − (H + Vt )−1 St k2H+Vt .
2 2 2 2
The first term kSt k2(H+Vt )−1 does not depend on x and can be moved outside the
20.2 Notes 260
integral, which leaves a quadratic ‘Gaussian’ term that may be integrated exactly
and results in
1/2
det(H) 1 2
M̄t = exp kSt k(H+Vt )−1 . (20.8)
det(H + Vt ) 2
The result follows by substituting this expression into Eq. (20.7) and
rearranging.
20.2 Notes
1 Recall from the previous chapter that when kAt k2 ≤ L is assumed, then
d d
det Vt (λ) Vt (λ) nL2
≤ trace ≤ 1+ . (20.9)
λd λd λd
In general, the log determinant form should be preferred when confidence
intervals are used as part of an algorithm, but the right-hand side has a
concrete form that can be useful when stating regret bounds.
2 Plugging the bounds of the previous note into Theorem 20.5 and choosing
λ = 1 gives the confidence set
( s )
1 nL 2
Ct = θ ∈ Rd : kθ̂t−1 − θkVt−1 (1) < m2 + 2 log + d log 1 + .
δ d
p
The dependence of the radius on n, d and δ, up to constants and a log(n)
factor, is the same as what we got in the fixed design case (cf. Eq. (20.3)),
which suggests that Theorem 20.5 can be quite tight. By considering the case
when each basis vector {e1 , . . . , ed } is played m times, then D = kθ̂t − θk2Vt
is distributed like a chi-squared distribution with d degrees of freedom. From
this, we see that the first term under the square root with the coefficient two
is stemming from variance of the noise, while the term that involved d log(n) √
is the bias (the expected value of D). In particular, this shows that the d
factor cannot be avoided.
3 If either of the above confidence sets is used (either the one from the theorem,
or that from Eq. (20.3)) to derive confidence bounds for the prediction error
h√θ̂t − θ, xi at some fixed x ∈ Rd , we get a confidence width that scales with
d (e.g., Eq. (19.13)), unlike the confidence width in Eq. (20.2), which is
independent of d. It follows that if one is interested in high-probability bounds
for the mean at a fixed input x, one should avoid going through a confidence
set for the whole parameter vector. What this leaves open is whether a bound
like in Eq. (20.2) is possible at a fixed input x, but with a sequential design.
In Exercise 20.2 you will answer this question in the negative. First note that
when the actions are chosen using a fixed design, integrating Eq. (20.2) shows
20.3 Bibliographic Remarks 261
that E[hθ̂t − θ∗ , xi2 /kxk2V −1 ] = O(1). In the exercise, you will show that there
t
exists a sequential design such that
h i
E hθ̂t − θ∗ , xi2 /kxk2V −1 = Ω(d) ,
t
√
showing that for some sequential designs the factor d is necessary. It remains
an interesting open question to design confidence bounds for sequential design
for fixed x that adapts to the amount of dependence in the design.
4 Supermartingales arise naturally in proofs relying on the Cramér–Chernoff
method. Just one example is the proof of Lemma 12.2. One could rewrite
most of the proofs involving sums of random variables relying on the Cramér–
Chernoff method in a way that it would become clear that the proof hinges on
the supermartingale property of an appropriate sequence.
Bounds like those given in Theorem 20.5 are called self-normalised bounds [de la
Peña et al., 2008]. The method of mixtures goes back to the work by Robbins
and Siegmund [1970]. In practice, the improvement provided by the method of
mixtures relative to the covering arguments is quite large. A historical account
of martingale methods in sequential analysis is by Lai [2009]. A simple proof of
Lemma 20.1 appears as lemma 2.5 in the book by van de Geer [2000]. Calculating
covering numbers (or related packing numbers) is a whole field by itself, with
open questions even in the most obvious examples. The main reference is by
Rogers [1964], which by now is a little old, but still interesting.
20.4 Exercises
20.1 (Lower bounds for fixed design) Let n = md for integer m and
A1 , . . . , An be a fixed design where each basis vector in {e1 , . . . , ed } is played
exactly m times. Then let (ηt )nt=1 be a sequence of independent standard Gaussian
random variables and Xt = hθ∗ , At i + ηt . Finally, let θ̂n be the ordinary least
squares estimator of θ∗ ∈ Rd . Show that
h i
E kθ̂n − θ∗ k2Vn = d .
20.2 (Lower bounds for sequential design) Let n ≥ 2d and (ηt )nt=1 be a
sequence of independent standard Gaussian random variables. Find a sequence of
20.4 Exercises 262
Pn
random vectors (At )nt=1 , with At ∈ Rd such that Vn = t=1 At A> t is invertible
almost surely and At is σ(A1 , η1 , . . . , At−1 , ηt−1 )-measurable for all t and
h i
E hθ̂n , 1i2 /k1k2V −1 ≥ cd ,
n
Pn
where c > 0 is a universal constant and Sn = t=1 ηt At and θ̂n = Vn−1 Sn .
The definitions can be repeated for pseudo-metric spaces. Let X be a set and
d : X × X → [0, ∞) be a function that is symmetric, satisfies the triangle
inequality and for which d(x, x) = 0 for all x ∈ X. Note that d(x, y) = 0 is
allowed for distinct x and y, so d need not be a metric. The basic results
concerning covering and packing stated in the next exercise remain valid
with this more general definition. In applications we often need the logarithm
of the covering and packing numbers, which are called the metric entropy
of X at scale ε. As we shall see, these are often close no matter whether we
consider packing or covering.
where (∗) holds under the assumption that εB ⊂ A and that A is convex
and for U, V ⊂ Rd , c ∈ R, U + V = {u + v : u ∈ U , v ∈ V } and
cU = {cu : u ∈ U };
20.4 Exercises 263
(d) Fix ε > 0. Then N (ε) < +∞ if and only if A is bounded. The same holds
for M (ε).
20.4 Use the results of the previous exercise to prove Lemma 20.1.
Hint Use the ‘sections’ lemma [Kallenberg, 2002, Lemma 1.26] to established
that M̄t is Ft -measurable.
20.6 (Hoeffding–Azuma) Let X1 , . . . , Xn be a sequence of random variables
adapted to a filtration F = (Ft )t . Suppose that |Xt | ∈ [at , bt ] almost surely for
arbitrary fixed sequences (at ) and (bt ) with at ≤ bt for all t ∈ [n]. Show that for
any ε > 0,
!
2n2 ε2
Xn
P (Xt − E[Xt | Ft−1 ]) ≥ ε ≤ exp − Pn .
t=1 (bt − at )
2
t=1
Hint It may help to recall Hoeffding’s lemma from Note 4 in Chapter 5, which
states that for a random variable X ∈ [a, b], the moment-generating function
satisfies
The utility of this result comes from the fact that very often the range of
some adapted sequence is itself random and could be arbitrarily large with
low probability (when A does not hold). A reference for the above result is
the survey by McDiarmid [1998].
such that
Pn
Let Sn = t=1 Xt . Show that
v !
u √
u tσ 2 + 1
P exists t : |St | ≥ t2σ 2 (t + 1) log
≤ δ.
δ
(c) Use the previous result to show that for any δ ∈ (0, 1),
s !
2n 1 1
P exists n : Sn ≥ inf log + log ≤ δ.
ε>0 (1 − ε2 ) δ εΛn f (Λn (1 + ε))
R∞
(d) Find an f such that 0
f (λ)dλ = 1 and f (λ) ≥ 0 for all λ ∈ R and
1 1
log = (1 + o(1)) log log
λf (λ) λ
as λ → 0.
(e) Use the previous results to show that
!
Sn
P lim sup p ≤1 = 1.
n→∞ 2n log log(n)
The last part of the previous exercises is one-half of the statement of the
law of iterated logarithm, which states that
Sn
lim sup p =1 almost surely .
n→∞ 2n log log(n)
In other words, the magnitude of the largest fluctuations of the partial sum
√
(Sn )n is almost surely of the order 2n log log n as n → ∞.
This result
p appeared
√ in a paper by the authors and others with the constant
c = 4 2/π/ erf( 2) ≈ 3.43 [Lattimore et al., 2018].
Hint Use Cramér–Chernoff method and observe that (exp(Lt (θ∗ )))∞
t=1 is a
martingale.
The quantities pθ0 (Xs )/pθ (Xs ) are called likelihood ratios. That the
product of likelihood ratios forms a martingale is a cornerstone result of
classical parametric statistics. The sequential form that appears in the above
exercise is based on Lemma 2 of Lai and Robbins [1985], who cite Robbins
and Siegmund [1972] as the original source.
21 Optimal Design for Least Squares
Estimators
The least squares estimator used here is not regularised. This eases the
calculations, and the lack of regularisation will not harm us in future
applications.
Eq. (20.2) from Chapter 20 shows that for any a ∈ Rd and δ ∈ (0, 1),
s !
2 1
P hθ̂ − θ∗ , ai ≥ 2kakV −1 log ≤ δ. (21.1)
δ
For our purposes, both a1 , . . . , an and a will be actions from some (possibly
infinite) set A ⊂ Rd and the question of interest is finding the shortest sequence
of exploratory actions a1 , . . . , an such that the confidence bound in the previous
display is smaller than some threshold for all a ∈ A. To solve this exactly
is likely an intractable exercise in integer programming. Finding an accurate
approximation turns out to be efficient for a broad class of action sets, however.
P
Let π : A → [0, 1] be a distribution on A so that a∈A π(a) = 1 and V (π) ∈ Rd×d
and g(π) ∈ R be given by
X
V (π) = π(a)aa> , g(π) = max kak2V (π)−1 . (21.2)
a∈A
a∈A
21.1 The Kiefer–Wolfowitz Theorem 267
By Eq. (21.3), the total number of actions required to ensure a confidence width
of no more than ε is bounded by
X X π(a)g(π)
1 g(π)
1
n= na = 2
log ≤ | Supp(π)| + 2 log .
ε δ ε δ
a∈Supp(π) a∈Supp(π)
The set Supp(π) is sometimes called the core set. The following theorem
characterises the size of the core set and the minimum of g.
(a) π ∗ is a minimiser of g.
(b) π ∗ is a maximiser of f (π) = log det V (π).
(c) g(π ∗ ) = d.
Proof We give the proof for finite A. The general case follows by passing to the
limit (Exercise 21.3). When it is convenient, distributions π on A are treated as
vectors in R|A| . You will show in Exercises 21.1 and 21.2 that f is concave and
that
and the concavity of f . That (a) =⇒ (c) is now trivial. To prove the second
part of the theorem, let π ∗ be a minimiser of g, which by the previous part is a
maximiser of f . Let S = Supp(π ∗ ), and suppose that |S| > d(d + 1)/2. Since the
dimension of the subspace of d × d symmetric matrices is d(d + 1)/2, there must
be a non-zero function v : A → R with Supp(v) ⊆ S such that
X
v(a)aa> = 0 . (21.6)
a∈S
Notice that for any a ∈ S, the first-order optimality conditions ensure that
kak2V (π∗ )−1 = d (Exercise 21.5). Hence
X X
d v(a) = v(a)kak2V (π∗ )−1 = 0 ,
a∈S a∈S
where the last equality follows from Eq. (21.6). Let π(t) = π ∗ + tv and let
P
τ = max{t > 0 : π(t) ∈ PA }, which exists since v 6= 0 and a∈S v(a) = 0
and Supp(v) ⊆ S. By Eq. (21.6), V (π(t)) = V (π ∗ ), and hence f (π(τ )) = f (π ∗ ),
which means that π(τ ) also maximises f . The claim follows by checking that
| Supp(π(T ))| < | Supp(π ∗ )| and then using induction.
Geometric Interpretation
There is a geometric interpretation of the D-optimal design problem. Let π be a
P
D-optimal design for A and V = a∈A π(a)aa> and
E = x ∈ Rd : kxk2V −1 ≤ d ,
which is a centered ellipsoid. By Theorem 21.1, it holds that A ⊂ E with the core
set lying on the boundary (see Fig. 21.1). As you might guess from the figure,
the ellipsoid E is the minimum volume centered ellipsoid containing A. This is
known to be unique and the optimisation problem that characterises it is in fact
the dual of the log determinant problem that determines the D-optimal design.
21.2 Notes 269
Figure 21.1 The minimum volume centered ellipsoid containing a point cloud. The points
on the boundary are the core set. The ellipse is E = {x : kxk2V (π)−1 = d}, where π is an
optimal design.
21.2 Notes
1 The letter ‘d’ in D-optimal design comes from the determinant in the objective.
The ‘g’ in G-optimal design stands for ‘globally optimal’. The names were coined
by Kiefer and Wolfowitz, though both problems appeared in the literature
before them.
2 In applications we seldom need an exact solution to the design problem. Finding
a distribution π such that g(π) ≤ (1 + ε)g(π ∗ ) will increase the regret of our
algorithms by a factor of just (1 + ε)1/2 .
3 The computation of an optimal design for finite action sets is a convex problem
for which there are numerous efficient approximation algorithms. The Frank–
Wolfe algorithm is one such algorithm, which can be used to find a near-optimal
solution for modestly sized problems. The algorithm starts with an initial π0
and updates according to
where ak = argmaxa∈A kak2V (πk )−1 and the step size is chosen to optimise f
along the line connecting πk and δak .
1 2
d kak kV (πk )−1 − 1
γk = argmaxγ∈[0,1] f ((1 − γ)πk + γδak ) = . (21.8)
kak k2V (πk )−1 − 1
4 If the action set is infinite, then approximately optimal designs can sometimes
still be found efficiently. Unfortunately the algorithms in the infinite case tend
to be much ‘heavier’ and less practical.
5 The smallest ellipsoid containing some set K ⊂ Rd is called the minimum
volume enclosing ellipsoid (MVEE) of K. As remarked, the D-optimal
design problem is equivalent to finding an MVEE of A with the added constraint
that the ellipsoid must be centred – or equivalently, finding the MVEE of the
symmetrised set A ∪ {−a : a ∈ A}. The MVEE of a convex set is also called
John’s ellipsoid, which has many applications in optimisation and beyond.
6 In Exercise 21.6, you will generalise Kiefer–Wolfowitz theorem to sets that do
not span Rd . When A is compact and dim(span(A)) = m ∈ [d], then there
exists a distribution π ∗ supported on at most m(m + 1)/2 points of A and for
which g(π ∗ ) = m = inf π g(π).
21.4 Exercises
Hint For square matrix A let adj(A) be the transpose of the cofactor matrix
of A. Use the facts that the inverse of a matrix A is A−1 = adj(A)> / det(A) and
21.4 Exercises 271
Hint Consider t 7→ log det(H + tZ) for Z symmetric, and show that this is a
concave function.
21.3 (Kiefer–Wolfowitz for compact sets) Generalise the proof of
Theorem 21.1 to compact action sets.
21.5 Let π ∗ be a G-optimal design and a ∈ Supp(π ∗ ). Prove that kak2V (π∗ )−1 = d.
21.6 Prove that if A is compact and dim(span(A)) = m ∈ [d], then there exists
a distribution π ∗ over A supported on at most m(m + 1)/2 points and for which
g(π ∗ ) = m.
Hint The easiest pure way to do this is to implement the Frank–Wolfe algorithm
described in Note 3. All quantities can be updated incrementally using rank-one
update formulas, and this will lead to a significant speedup. You might like to
read the third chapter of the book by Todd [2016] and experiment with the
proposed variants.
22 Stochastic Linear Bandits with
Finitely Many Arms
The optimal design problem from the previous chapter has immediate applications
to stochastic linear bandits. In Chapter 19, we developed a linear version of the
√
upper confidence bound algorithm that achieves a regret of Rn = O(d n log(n)).
The only required assumptions were that the sequence of available action sets
were bounded. In this short chapter, we consider a more restricted setting where:
1 the set of actions available in round t is A ⊂ Rd and |A| = k for some natural
number k;
2 the reward is Xt = hθ∗ , At i + ηt where ηt is conditionally 1-subgaussian:
E[exp(ληt )|A1 , η1 , . . . , At−1 ] ≤ exp(λ2 /2) almost surely for all λ ∈ R; and
The key difference relative to Chapter 19 is that now the set of actions is finite
and does not change with time. Under these conditions, it becomes possible to
design a policy such that
p
Rn = O dn log(nk) .
For moderately sized k, this bound improves the regret by a factor of d1/2 , which
in some regimes is large enough to be worth the effort. The policy is an instance
of phase-based elimination algorithms. As usual, at the end of a phase, arms that
are likely to be suboptimal with a gap exceeding the current target are eliminated.
In fact, this elimination is the only way the data collected in a phase is being
used. In particular, the actions to be played during a phase are chosen based
entirely on the data from previous phases: the data collected in the present phase
do not influence which actions are played. This decoupling allows us to make use
of the tighter confidence bounds available in the fixed design setting, as discussed
in the previous chapter. The choice of policy within each phase uses the solution
to an optimal design problem to minimise the number of required samples to
eliminate arms that are far from optimal.
Input A ⊂ Rd and δ
Step 0 Set ` = 1 and let A1 = A
Step 1 Let t` = t be the current timestep and find G-optimal design π` ∈ P(A` )
with Supp(π` ) ≤ d(d + 1)/2 that maximises
X
log det V (π` ) subject to π` (a) = 1
a∈A`
p
where C > 0 is a universal constant. If δ = O(1/n), then E[Rn ] ≤ C nd log(kn)
for an appropriately chosen universal constant C > 0.
The proof of this theorem follows relatively directly from the high-probability
correctness of the confidence intervals used to eliminate low-rewarding arms. We
leave the details to the reader in Exercise 22.1.
22.1 Notes
1 The assumption that the action set does not change is crucial for Algorithm 12.
Several complicated algorithms have been proposed and analysed for the case
where At is allowed to change from round to round under the assumption that
|At | ≤ k for all rounds. For these algorithms, it has been proven that
q
Rn = O nd log3 (nk) . (22.1)
intervals derived in Chapter 20. Once the dust has settled, you should find the
regret is
p
Rn = O d n log(n) .
3 One advantage of Algorithm 12 is that it behaves well even when the linear
model is misspecified. Suppose the reward is Xt = hθ, At i + ηt + f (At ), where
ηt is noise as usual and f : A → R is some function with kf k∞ ≤ ε. Then the
regret of Algorithm 12 can be shown to be
p √
Rn = O dn log(nk) + nε d log(n) .
The algorithms achieving Eq. (22.1) for changing action sets are SupLinRel [Auer,
2002] and SupLinUCB [Chu et al., 2011]. Both introduce phases to decouple
the dependence of the design on the outcomes. Unfortunately the analysis of
these algorithms is long and technical, which prohibited us from presenting
the ideas here. These algorithms are also not the most practical relative to
LinUCB (Chapter 19) or Thompson sampling (Chapter 36). Of course this does
not diminish the theoretical breakthrough. Phased elimination algorithms have
appeared in many places, but the most similar to the algorithm presented here
is the work on spectral bandits by Valko et al. [2014] (and we have also met
them briefly in earlier chapters on finite-armed bandits). None of the works just
mentioned used the Kiefer–Wolfowitz theorem. This idea is apparently new, but
it is based on the literature on adversarial linear bandits where John’s ellipsoid
has been used to define exploration policies [Bubeck et al., 2012]. For more details
on adversarial linear bandits, read on to Part VI.
Ghosh et al. [2017] address misspecified (stochastic) linear bandits with a fixed
action set. In misspecified linear bandits, the reward is nearly a linear function of
the feature vectors associated with the actions. Ghosh et al. [2017] demonstrate
that in the favourable case when one can cheaply test linearity, an algorithm
that first runs a test and then switches to either
√ a linear bandit or a finite-armed
√
bandit based on the outcome will achieve ( k ∧ d) n regret up to log factors.
We will return to misspecified linear bandits a few more times in the book.
22.3 Exercises
(a) Use Theorem 21.1 to show that the length of the `th phase is bounded by
2d k`(` + 1) d(d + 1)
T` ≤ 2 log + .
ε` δ 2
(b) Let a∗ ∈ argmaxa∈A hθ∗ , ai be the optimal arm and use Eq. (21.1) to show
that
δ
P (exists phase ` such that a∗ ∈
/ A` ) ≤ .
k
(c) For action a define `a = min{` : 2ε` < ∆a } to be the first phase where the
suboptimality gap of arm a is smaller than 2ε` . Show that
δ
P (a ∈ A`a ) ≤ .
k
(d) Show that with probability at least 1 − δ the regret is bounded by
s
k log(n)
Rn ≤ C dn log ,
δ
where C > 0 is a universal constant.
(e) Show that this implies Theorem 22.1 for the given choice of δ.
Like in the standard stochastic linear bandit setting, at the beginning of round t,
the learner receives a decision set At ⊂ Rd . They then choose an action At ∈ At
and receive a reward
Xt = hθ∗ , At i + ηt , (23.1)
where (ηt )t is zero-mean noise and θ∗ ∈ Rd is an unknown vector. The only
difference in the sparse setting is that the parameter vector θ∗ is assumed to have
many zero entries. For θ ∈ Rd let
d
X
kθk0 = I {θi 6= 0} ,
i=1
The details are a bit more complicated than just the conditioning, but the
main point is that the usual assumptions imposed on the covariance matrix of
the actions for passive learning are never satisfied when the actions are chosen
by a good bandit policy. The reason is simple. Bandit algorithms want to choose
the optimal action as often as possible, which means the covariance matrix will
have an eigenvector that points (approximately) towards the optimal action with
a large corresponding eigenvalue. We need some approach that does not rely on
such strong assumptions.
As a warm-up, consider the case where the action set is the d-dimensional
hypercube: At = A = [−1, 1]d . To reduce clutter, we denote the true parameter
vector by θ = θ∗ . The hypercube is notable as an action set because it enjoys
perfect separability. For each dimension i ∈ [d], the value of Ati ∈ [−1, 1] can
be chosen independently of Atj for j 6= i. Because of this, the optimal action is
a∗ = sign(θ), where
1 , if θi > 0 ;
sign(θ)i = sign(θi ) = 0 , if θi = 0 ;
−1 , if θ < 0 .
i
23.2 Elimination on the Hypercube 278
So learning the optimal action amounts to learning the sign of θi for each
dimension. A disadvantage of this structure is that in the worst case the sign
of each θi must be learned independently, which in Chapter 24 we show leads
√
to a worst-case regret of Rn = Ω(d n). On the positive side, the separability
means that θi can be estimated in each dimension independently while paying
absolutely no price for this experimentation when θi = 0. It turns out that this
√
allows us to design a policy for which Rn = O(kθk0 n), even without knowing
the value of kθk0 .
Let Gt = σ(A1 , X1 , . . . , At , Xt ) be the σ-algebra containing information up to
time t − 1 (this differs from Ft , which also includes information about the action
chosen). Now suppose that (Ati )di=1 are chosen to be conditionally independent
given Gt−1 , and further assume for some specific i ∈ [d] that Ati is sampled from
a Rademacher distribution so that P (Ati = 1 | Gt−1 ) = P (Ati = −1 | Gt−1 ) = 1/2.
Then
d
X
E[Ati Xt | Gt−1 ] = E Ati Atj θj + ηt Gt−1
j=1
X
= θi E[A2ti | Gt−1 ] + θj E[Atj Ati | Gt−1 ] + E[Ati ηt | Gt−1 ]
j6=i
= θi ,
where the first equality is the definition of Xt = hθ, At i+ηt , the second by linearity
of expectation and the third by the conditional independence of (Ati )i and the
fact that E[Ati | Gt−1 ] = 0 and E[A2ti | Gt−1 ] = 1. This looks quite promising, but
we should also check the variance. Using our assumptions that (ηt ) is conditionally
1-subgaussian and that hθ, ai ≤ 1 for all actions a, we have
2
V[Ati Xt | Gt−1 ] = E[A2ti Xt2 | Gt−1 ] − θi2 = E[(hθ, At i + ηt ) | Gt−1 ] − θi2 ≤ 2 .
(23.2)
And now we have cause for celebration. The value of θi can be estimated by
choosing Ati to be a Rademacher random variable independent of the choices in
other dimensions. All the policy does is treat all dimensions independently. For a
particular dimension (say i), it explores by choosing Ati ∈ {−1, 1} uniformly at
random until its estimate is sufficiently accurate to commit to either Ati = 1 or
Ati = −1 for all future rounds. How long this takes depends on |θi |, but note that
if |θi | is small, then the price of exploring is also limited. The policy that results
from this idea is called selective explore-then-commit (Algorithm 13, SETC).
Theorem 23.2. There exists a universal constants C, C 0 > 0 such that the regret
of SETC satisfies
X log(n) p
Rn ≤ 3kθk1 + C and Rn ≤ 3kθk1 + C 0 kθk0 n log(n) .
|θi |
i:θi 6=0
1: Input n and d
2: Set E1i = 1 and C1i = R for all i ∈ [d]
3: for t = 1, . . . , n do
4: For each i ∈ [d] sample Bti ∼ Rademacher
5: Choose action:
Bti if 0 ∈ Cti
(∀i) Ati = 1 if Cti ⊂ (0, ∞]
−1 if C ⊂ [−∞, 0) .
ti
s=1
Ti (t)
h i
(∀i) Ct+1,i = θ̂ti − Wti , θ̂ti + Wti
Eq. (23.2), we should be hopeful that the confidence intervals used by the
algorithm are sufficiently large to contain the true θi with high probability, but
this still needs to be proven.
Lemma 23.3. Define τi = n ∧ max{t : Eti = 1}, and let Fi = I {θi ∈ / Cτi +1,i } be
the event that θi is not in the confidence interval constructed at time τi . Then
P (Fi ) ≤ 1/n.
The proof of Lemma 23.3 is left until after the proof of Theorem 23.2.
Proof of Theorem 23.2 Recalling the definition of the regret and using the
fact that the optimal action is a∗ = sign(θ), we have the following regret
23.2 Elimination on the Hypercube 280
decomposition:
n
" # d
" n #!
X X X
Rn = maxhθ, ai − E hθ, At i = n|θi | − E Ati θi . (23.3)
a∈A
t=1 i=1 t=1
| {z }
Rni
Since τi is the first round t when 0 ∈ / Ct+1,i it follows that if Fi does not occur,
then θi ∈ Cτi ,i and 0 ∈ Cτi ,i . Thus the width of the confidence interval Cτi ,i must
be at least |θi |, and so
s
1 1 √
2Wτi −1,i = 4 + log n 2τi − 1 ≥ |θi | ,
τi − 1 (τi − 1)2
which after rearranging shows for some universal constant C > 0 that
C log(n)
I {Fic } (τi − 1) ≤ 1 + .
θi2
C log(n)
Rni ≤ 2n|θi |P (Fi ) + |θi | + .
|θi |
Using Lemma 23.3 to bound P (Fi ) and substituting into the decomposition
Eq. (23.3) completes the proof of the first part. The second part is left as a treat
for you (Exercise 23.2).
P
Proof of Lemma 23.3 Let Sti = j6=i Atj θj and Zti = Ati ηt +Ati Sti . For t ≤ τi ,
1X
t
θ̂ti − θi = Zsi .
t s=1
23.3 Online to Confidence Set Conversion 281
√
The next step is to show that Zti is conditionally 2-subgaussian for t ≤ τi :
The first inequality used the fact that ηt is conditionally 1-subgaussian. The second-
to-last inequality follows because Ati is conditionally Rademacher for t ≤ τi ,
which is 1-subgaussian by Hoeffding’s lemma (5.11). The final inequality follows
because Sti ≤ kAt k∞ kθk1 ≤ 1. The result follows by applying the concentration
bound from Exercise 20.8.
A new plan is needed to relax the assumption that the action set is a hypercube.
The idea is to modify the ellipsoidal confidence set used in Chapter 19 to have a
smaller radius. We will see that modifying the algorithm in Chapter 19 to use
√
the smaller confidence intervals improves the regret to Rn = O( dpn log(n)).
Without assumptions
√ on the action set, one cannot hope to have a regret
smaller than O( dn). To see this, recall that d-armed bandits can be
represented as linear bandits with At = {e1 , . . . , ed }. For these problems,
Theorem 15.2 shows
√ that for any policy there exists a d-armed bandit for
which Rn = Ω( dn). Checking the proof reveals that when adapted to the
linear setting the parameter vector is 2-sparse.
The construction that follows makes use of a kind of duality between online
prediction and confidence sets. While we will only apply the idea to the sparse
linear case, the approach is generic.
The prediction problem considered is online linear prediction under the
squared loss. This is also known as online linear regression. The learner
interacts with an environment in a sequential manner where in each round
t ∈ N+ :
The online learning literature has a number of powerful techniques for this
learning problem. Later we will give a specific result for the sparse case when
Θ = {x : kxk0 ≤ m0 }, but first we show how to use such a learning algorithm
to construct a confidence set. Take any learner for online linear regression, and
assume the environment generates Xt in a stochastic manner like in linear bandits:
Xt = hθ∗ , At i + ηt . (23.7)
Combining Eqs. (23.5) to (23.7) with elementary algebra,
n
X n
X
Qt = (X̂t − hθ∗ , At i)2 = ρn (θ∗ ) + 2 ηt (X̂t − hθ∗ , At i)
t=1 t=1
n
X
≤ Bn + 2 ηt (X̂t − hθ∗ , At i) , (23.8)
t=1
where the first equality serves as the definition of Qt . Let us now take stock for a
moment. If we could somehow remove the dependence on the noise ηt in the right-
hand side, then we could define a confidence set consisting of all θ that satisfy
the equation. Of course the noise has zero mean and is conditionally independent
of its multiplier, so the expectation of this term is zero. The fluctuations can be
controlled with high probability using a little concentration analysis. Let
t
X
Zt = ηs (X̂s − hθ∗ , As i) .
s=1
Since X̂t is chosen based on information available at the beginning of the round,
X̂t is Ft−1 -measurable, and so
for all λ ∈ R, E[exp(λ(Zt − Zt−1 )) | Ft−1 ] ≤ exp(λ2 σt2 /2) ,
where σt2 = (X̂t − hθ∗ , At i)2 . The uniform self-normalised tail bound
(Theorem 20.4) with λ = 1 implies that,
s !
1 + Qt
P exists t ≥ 0 such that |Zt | ≥ (1 + Qt ) log ≤ δ.
δ2
23.3 Online to Confidence Set Conversion 283
Provided this low-probability event does not occur, then from Eq. (23.8) we have
s
1 + Qt
Qt ≤ Bt + 2 (1 + Qt ) log . (23.9)
δ2
While both sides depend on Qt , the left-hand side grows linearly, while the
right-hand side grows sublinearly in Qt . This means that the largest value of Qt
that satisfies the above inequality is finite. A tedious calculation then shows this
value must be less than
√ √
8 + 1 + Bt
βt (δ) = 1 + 2Bt + 32 log . (23.10)
δ
By piecing together the parts, we conclude that with probability at least 1 − δ
the following holds for all t:
t
X
Qt = (X̂s − hθ∗ , As i)2 ≤ βt (δ) .
s=1
We could define Ct+1 to be the set of all θ such that the above holds with θ∗
replaced by θ, but there is one additionally subtlety, which is that the resulting
Pt
confidence interval may be unbounded (think about the case that s=1 As A> s is
not invertible). In Chapter 19 we overcame this problem by regularising the least
squares estimator. Since we have assumed that kθ∗ k2 ≤ m2 , the previous display
implies that
t
X
kθ∗ k22 + (X̂s − hθ∗ , As i)2 ≤ m22 + βt (δ) .
s=1
performing an algebraic calculation that we leave to the reader (see Exercise 23.5),
one can see that
t
X t
X
kθk22 + (X̂s − hθ, As i)2 = kθ − θ̂t k2Vt + (X̂s − hθ̂t , As i)2 + kθ̂t k22 . (23.11)
s=1 s=1
Using this, the confidence set can be rewritten in the familiar form of an ellipsoid:
( t
)
X
2 2 2 2 2
Ct+1 = θ ∈ R : kθ − θ̂t kVt ≤ m2 + βt (δ) − kθ̂t k2 −
d
(X̂s − hθ̂t , As i) .
s=1
23.4 Sparse Online Linear Prediction 284
It is not obvious that Ct+1 is not empty because the radius could be negative.
Theorem 23.4 shows, however, that with high probability θ∗ ∈ Ct+1 . At last we
have established all the conditions required for Theorem 19.2, which implies the
following theorem bounding the regret of Algorithm 14:
Theorem 23.5. With probability at least 1 − δ the pseudo-regret of OLR-UCB
satisfies
r n
R̂n ≤ 8dn (m22 + βn−1 (δ)) log 1 + .
d
Theorem 23.6. There exists a strategy π for the learner such that for any
θ ∈ Rd , the regret ρn (θ) of π against any strategic environment such that
maxt∈[n] kAt k2 ≤ L and maxt∈[n] |Xt | ≤ X satisfies
n o
ρn (θ) ≤ cX 2 kθk0 log(e + n1/2 L) + Cn log 1 + kθk 1
kθk0 + (1 + X 2 )Cn ,
where c > 0 is some universal constant and Cn = 2 + log2 log(e + n1/2 L).
Note that Cn = O(log log(n)), so by dropping the dependence on X and L, we
have
sup ρn (θ) = O(m0 log(n)) .
θ:kθk0 ≤m0 ,kθk2 ≤L
As a final catch, the rewards (Xt ) in sparse linear bandits with subgaussian noise
23.5 Notes 285
are not necessarily bounded. However, the subgaussian property implies that with
probability 1 − δ, |ηt | ≤ log(2/δ). By choosing δ = 1/n2 and Assumption 23.1,
we have
2
1
P max |Xt | ≥ 1 + log 2n ≤ .
t∈[n] n
Putting all the pieces together shows that the expected regret of OLR-UCB when
using the predictor provided by Theorem 23.6 and when kθk0 ≤ m0 satisfies
p
Rn = O dnm0 log(n)2 .
23.5 Notes
23.7 Exercises
23.2 (Minimax bound for SETC) Prove the second part of Theorem 23.2.
Hint One way is to use the doubling trick, but a more careful approach will
lead to a more practical algorithm.
23.4 Complete the calculation to derive Eq. (23.10) from Eq. (23.9).
Lower bounds for linear bandits turn out to be more nuanced than those for the
classical finite-armed bandit. The difference is that for linear bandits the shape
of the action set plays a role in the form of the regret, not just the distribution
of the noise. This should not come as a big surprise because the stochastic
finite-armed bandit problem can be modeled as a linear bandit with actions
being the standard basis vectors, A = {e1 , . . . , ek }. In this case the actions are
orthogonal, which means that samples from one action do not give information
about the rewards for other actions. Other action sets such as the unit ball
(A = B2d = {x ∈ Rd : kxk2 ≤ 1}) do not share this property. For example, if
d = 2 and A = B2d and an algorithm chooses actions e1 = (1, 0) and e2 = (0, 1)
many times, then it can deduce the reward it would obtain from choosing any
other action.
All results of this chapter have a worst-case flavour showing what is (not)
achievable in general, or under a sparsity constraint, or if the realisable assumption
is not satisfied. The analysis uses the information-theoretic tools introduced in
Part IV combined with careful choices of action sets. The hard part is guessing
what is the worst case, which is followed by simply turning the crank on the
usual machinery.
In all lower bounds, we use a simple model with Gaussian noise. For action
At ∈ A ⊆ Rd the reward is Xt = µ(At ) + ηt where ηt ∼ N (0, 1) is a sequence of
independent standard Gaussian noise and µ : A → R is the mean reward. We
will usually assume there exists a θ ∈ Rd such that µ(a) = ha, θi. We write Pµ to
indicate the measure on outcomes induced by the interaction of the fixed policy
and the Gaussian bandit paramterised by µ. Because we are now proving lower
bounds, it becomes necessary to be explicit about the dependence of the regret
on A and µ or θ. The regret of a policy is:
" n
#
X
Rn (A, µ) = n max µ(a) − Eµ Xt ,
a∈A
t=1
24.1 Hypercube
The first lower bound is for the hypercube action set and shows that the upper
bounds in Chapter 19 cannot be improved in general.
Theorem 24.1. Let A = [−1, 1]d and Θ = {−n−1/2 , n−1/2 }d . Then, for any
policy, there exists a vector θ ∈ Θ such that:
exp(−2) √
Rn (A, θ) ≥ d n.
8
Proof By the relative entropy identities in Exercise 15.8.(b) and Exercise 14.7,
we have for θ, θ0 ∈ Θ that
" n #
X
D(Pθ , Pθ0 ) = Eθ D(N (hAt , θi, 1), N (hAt , θ i, 1))
0
t=1
1X
n
= Eθ hAt , θ − θ0 i2 . (24.1)
2 t=1
Now let i ∈ [d] and θ ∈ Θ be fixed, and let θj0 = θj for j 6= i and θi0 = −θi . Then,
by the Bretagnolle–Huber inequality (Theorem 14.2) and Eq. (24.1),
!
1 1X 1
n
pθi + pθ0 i ≥ exp − Eθ [hAt , θ − θ0 i2 ] ≥ exp (−2) . (24.2)
2 2 t=1 2
where the first line follows since the optimal action satisfies a∗i = sign(θi ) for
i ∈ [d], the first inequality follows from a simple case-based analysis showing that
(sign(θi ) − Ati )θi ≥ |θi |I {sign(Ati ) 6= sign(θi )}, the second inequality is Markov’s
inequality (see Lemma 5.1), and the last inequality follows from the choice of
θ.
Except for logarithmic factors, this shows that the algorithm of Chapter 19
is near optimal for this action set. The same proof works when A = {−1, 1}d
is restricted to the corners of the hypercube, which is a finite-armed
p linear
bandit. In Chapter 22, we gave a policy with regret Rn = O( nd log(nk)),
where k = |A|. There is no contradiction because the action set in the above
proof has k = |A| = 2d elements.
Lower-bounding the minimax regret when the action set is the unit ball presents
an additional challenge relative to the hypercube. The product structure of
the hypercube means that the actions of the learner in one dimension do not
constraint their choices in other dimensions. For the unit ball, this is not true, and
this complicates the analysis. Nevertheless, a small modification of the technique
allows us to prove a similar bound.
and the assumption that d ≤ 2n. The inequality in Eq. (24.3) follows from the
chain rule for the relative entropy up to a stopping time (Exercise 15.7). Eq. (24.4)
is true by the definition of τi and Eq. (24.5) by the assumption that d ≤ 2n.
Then,
√ r
4 3n∆ n
Eθ [Ui (1)] + Eθ0 [Ui (−1)] ≥ Eθ0 [Ui (1) + Ui (−1)] −
d d
" # √ r √ r
4 3n∆ n 2n 4 3n∆ n
τ
τi X 2 i
n
= 2Eθ0 + Ati − ≥ − = .
d t=1
d d d d d d
Let Bt = V At ∈ [k]p represent the vector of ‘base’ actions chosen by the learner
in each of the p bandits in round t. The optimal action in the ith bandit is
(i)
b∗i (θ) = argmaxb∈[k] θb .
The regret can be decomposed into the regrets in the p ‘base bandit’ problems (a
form of separability, again):
p
" n #
X X
Rn (θ) = ∆ Eθ I {Bti 6= bi } .
∗
i=1 t=1
| {z }
Rni (θ)
24.4 Misspecified Models 292
Then let ε = supa∈A |hθ, ai − µ(a)| be the maximum error. It would be very
pleasant to have an algorithm such that
" n #
X √ √
Rn (A, µ) = n max µ(a) − E µ(At ) = Õ(min{d n + εn, kn}) . (24.7)
a∈A
t=1
Unfortunately, it turns out that results of this kind are not achievable. To show
this, we will prove a generic bound for the classical finite-armed bandit problem
and afterwards show how this implies the impossibility of an adaptive bound like
the above.
Theorem 24.4. Let A = [k], and for µ ∈ [0, 1]k the reward is Xt = µAt + ηt and
the regret is
" n #
X
Rn (µ) = n max µi − Eµ µAt .
i∈A
t=1
24.4 Misspecified Models 293
Define Θ, Θ0 ⊂ Rk by
Θ = µ ∈ [0, 1]k : µi = 0 for i > 1 Θ0 = µ ∈ [0, 1]k .
p
If V ∈ R is such that 2(k − 1) ≤ V ≤ n(k − 1) exp(−2)/8 and supµ∈Θ Rn (µ) ≤
V , then
n(k − 1)
sup Rn (µ0 ) ≥ exp(−2) .
µ0 ∈Θ0 8V
Pn
Proof Recall that Ti (n) = t=1 I {At = i} is the number of times arm i is
played after all n rounds. Let µ ∈ Θ be given by µ1 = ∆ = (k − 1)/V ≤ 1/2. The
regret is then decomposed as:
k
X
Rn (µ) = ∆ Eµ [Ti (n)] ≤ V .
i=2
Pk
Rearranging shows that i=2 Eµ [Ti (n)] ≤ ∆,
V
and so by the pigeonhole principle
there exists an i > 1 such that
V 1
Eµ [Ti (n)] ≤ = 2.
(k − 1)∆ ∆
Then, define µ0 ∈ Θ0 by
∆
if j = 1
µ0j = 2∆ if j = i
0 otherwise .
Next, by Theorem 14.2 and Lemma 15.1, for any event A, we have
1 1 1
Pµ (A) + Pµ0 (Ac ) ≥ exp (D(Pµ , Pµ0 )) = exp −2∆2 E[Ti (n)] ≥ exp (−2) .
2 2 2
By choosing A = {T1 (n) ≤ n/2} we have
n∆ n(k − 1)
Rn (µ) + Rn (µ0 ) ≥ exp(−2) = exp(−2) .
4 4V
p
Therefore, by the assumption that Rn (µ) ≤ V ≤ n(k − 1) exp(−2)/8 we have
n(k − 1)
Rn (µ0 ) ≥ exp(−2) .
8V
As promised, we now relate this to the misspecified linear bandits. Suppose
that d = 1 (an absurd case) and that there are k arms A = {a1 , a2 , . . . , ak } ⊂ R1 ,
where a1 = (1) and ai = (0) for i > 1. Clearly, if θ > 0 and µ(ai ) = hai , θi, then
the problem can be modelled as a finite-armed bandit with means µ ∈ Θ ⊂ [0, 1]k .
In the general case, we just have a finite-armed bandit with µ ∈ Θ0 . If in the first
√
case we have Rn (A, µ) = O( n), then the theorem shows for large enough n that
√
sup Rn (A, µ) = Ω(k n) .
µ∈Θ0
24.5 Notes 294
It follows that Eq. (24.7) is a pipe dream. To our knowledge, it is still an open
question of what is possible on this front. We speculate that for k ≥ d2 , there is
a policy for which
√ √ k√
Rn (A, θ) = Õ min d n + εn d, n .
d
24.5 Notes
1 The worst-case bound demonstrates the near optimality of the OFUL algorithm
for a specific action set. It is an open question to characterise the optimal
regret for a wide range of action sets. We will return to these issues in the next
part of the book, where we discuss adversarial linear bandits.
2 We return to misspecified bandits in the notes and exercises of Chapter 29,
where algorithms from the adversarial linear bandit framework are applied to
this problem in special cases. √
In many applications, the number of actions is so
√
large that Rn = Õ(d n + εn d) should be considered acceptable. There exist
algorithms achieving this bound, which for large k is essentially not improvable
in the worst case [Lattimore and Szepesvári, 2019b]. For small k,√recent work√by
Foster and Rakhlin [2020] shows that one can achieve Rn = Õ( dkn + εn k).
24.7 Exercises
24.1 Complete the missing steps to prove the inequality in Eq. (24.6).
25 Asymptotic Lower Bounds for
Stochastic Linear Bandits
The lower bounds in the previous chapter were derived by analysing the worst
case for specific action sets and/or constraints on the unknown parameter. In this
chapter, we focus on the asymptotics and aim to understand the influence of the
action set on the regret. We start with a lower bound, and argue that the lower
bound can be achieved. We finish by arguing that the optimistic algorithms (and
Thompson sampling) will perform arbitrarily worse than what can be achieved
by non-optimistic algorithms.
where the dependence on the policy is omitted for readability and Eθ [·] is the
expectation with respect to the measure on outcomes induced by the interaction
of the policy and the linear bandit determined by θ. Like the asymptotic lower
bounds in the classical finite-armed case (Chapter 16), the results of this chapter
are proven only for consistent policies. Recall that a policy is consistent in some
class of bandits E if the regret is sub-polynomial for any bandit in that class.
Here this means that
The main objective of the chapter is to prove the following theorem on the
behaviour of any consistent policy and discuss the implications.
lim inf n→∞ λmin (Ḡn )/ log(n) > 0. Furthermore, for any a ∈ A, it holds that
∆2a
lim sup log(n)kak2Ḡ−1 ≤ .
n→∞ n 2
The reader should recognise kak2Ḡ−1 as the key term in the width of the
n
confidence interval for the least squares estimator (Chapter 20). This is quite
intuitive. The theorem is saying that any consistent algorithm must prove
statistically that all suboptimal arms are indeed suboptimal by making the
size of the confidence interval smaller than the suboptimality gap. Before the
proof of this result, we give a corollary that characterises the asymptotic regret
that must be endured by any consistent policy.
Theorem 25.3. Let A ⊂ Rd be a finite set that spans Rd . Then there exists a
policy such that
Rn (A, θ)
lim sup ≤ c(A, θ) ,
n→∞ log(n)
Proof of Theorem 25.1 The proof of the first part is simply omitted (see the
reference below for details). It follows along similar lines to what follows, essentially
that if Gn is not sufficiently large in every direction, then some alternative
parameter is not sufficiently identifiable. Let a∗ = argmaxa∈A ha, θi be the optimal
action, which we assumed to be unique. Let θ0 ∈ Rd be an alternative parameter
to be chosen subsequently, and let P and P0 be the measures on the sequence
of outcomes A1 , X1 , . . . , An , Xn induced by the interaction between the policy
and the bandit determined by θ and θ0 respectively. Let E[·] and E0 [·] be the
expectation operators of P and P0 , respectively. By Theorem 14.2 and Lemma 15.1,
25.1 An Asymptotic Lower Bound for Fixed Action Sets 297
where we introduced
ka − a∗ k2Ḡ−1 ka − a∗ k2H Ḡ
ρn (H) = n nH
.
ka − a∗ k4H
25.1 An Asymptotic Lower Bound for Fixed Action Sets 298
Therefore, by choosing E to be the event that Ta∗ (n) < n/2 and using (25.3) and
(25.2), we have
(∆a + ε)2 nε
ρ (H) ≥ log ,
2ka − a∗ k2Ḡ−1 4Rn + 4Rn0
n
n
The definition of consistency means that Rn and Rn0 are both sub-polynomial,
which implies that the second term in the previous expression tends to zero for
large n and so by sending ε to zero,
ρn (H) 2
lim inf 2 ≥ 2. (25.4)
n→∞ log(n)ka − a kḠ−1
∗ ∆a
n
Ḡn . Such a point must exist, since matrices in this sequence have unit spectral
−1
ka − a kH > 0:
∗
ka − a∗ k2Ḡ−1
ka − a∗ k2H = lim0 n
> 0,
n∈S kḠ−1
n k
where the last inequality follows from the assumption in (25.5) and the first part
of the theorem. Therefore,
ka − a∗ k2Ḡ−1 ka − a∗ k2H Ḡ
1 < lim inf ρn (H) ≤ lim inf n nH
= 1,
n∈S 0 n∈S ka − a∗ k4H
which is a contradiction, and hence (25.5) does not hold. Thus,
∆2a
lim sup log(n)ka − a∗ k2Ḡ−1 ≤ .
n→∞ n 2
We leave the proof of the corollary as an exercise for the reader. Essentially,
though, any consistent algorithm must choose its actions so that in expectation
∆2a
ka − a∗ k2Ḡ−1 ≤ (1 + o(1)) .
n 2 log(n)
25.2 Clouds Looming for Optimism 299
Now, since a∗ will be chosen linearly, often it is easily shown for suboptimal a
that limn→∞ ka − a∗ kḠ−1
n
/kakḠ−1
n
→ 1. This leads to the required constraint on
the actions of the algorithm, and the optimisation problem in the corollary is
derived by minimising the regret subject to this constraint.
X ∆2a
α(a)∆a subject to kak2H(α)−1 ≤ for all a ∈ A with ∆a > 0 ,
2
a∈A
P
where H(α) = a∈A α(a)aa> . Clearly we should choose α(a1 ) arbitrarily large,
then a computation shows that
0 0
lim H(α)−1 = .
α(a1 )→∞ 1
0 α(a3 )ε2 γ 2 +α(a2 )
25.2 Clouds Looming for Optimism 300
where Ct ⊂ Rd is a confidence set that we assume contains the true θ with high
probability. So far this does not greatly restrict the class of algorithms that we
might call optimistic. We now assume that there exists a constant c > 0 such
that
n p o
Ct ⊆ θ̃ : kθ̂t − θ̃kVt ≤ c log(n) ,
Pt
where Vt = s=1 As A> s . So now we ask how often we can expect the optimistic
algorithm to choose action a2 = e2 in the example described above. Since we
have assumed θ ∈ Ct with high probability, we have that
maxha1 , θ̃i ≥ 1 .
θ̃∈Ct
which means that a2 will not be chosen more than 1 + 4c2 log(n) times. So if
γ = Ω(c2 ), then the optimistic algorithm will not choose a2 sufficiently often
and a simple computation shows it must choose a3 at least Ω(log(n)/ε2 ) times
25.3 Notes 301
and suffers regret of Ω(log(n)/ε). The key take away from this is that optimistic
algorithms do not choose actions that are statistically suboptimal, but for linear
bandits it can be optimal to choose these actions more often to gain information
about other actions.
25.3 Notes
1 All algorithms known to match the lower bound in Theorem 25.3 are based
on (or inspired by) solving the optimisation problem that defines c(A, θ) with
estimated value θ. Unfortunately, these algorithms are not especially practical
in finite time. As far as we know, none are simultaneously near-optimal in a
minimax sense. Constructing a practical asymptotically optimal algorithm for
linear bandits is a fascinating open problem.
2 In Chapter 36 we will introduce the randomised Bayesian algorithm called
Thompson sampling algorithm for finite-armed and linear bandits. While
Thompson sampling is often empirically superior to UCB, it does not overcome
the issues described here.
The theorems of this chapter are by the authors: Lattimore and Szepesvári [2017].
The example in Section 25.2 first appeared in a paper by Soare et al. [2014],
which deals with the problem of best-arm identification for linear bandits (for an
introduction to best-arm identification, see Chapter 33). The optimisation-based
algorithms that match the lower bound are by Lattimore and Szepesvári [2017],
Ok et al. [2018], Combes et al. [2017] and Hao et al. [2020], with the latter
handling also the contextual case with finitely many contexts.
25.5 Exercises
The convex hull co(A) is also defined for an arbitrary set A ⊂ Rd and is still the
smallest convex set that contains A (see (c) in Figure 26.1). For the rest of the
section, we let A ⊆ Rd be convex. Let R̄ = R ∪ {−∞, ∞} be the extended real
number system and define operations involving infinities in the natural way (see
notes).
The term ‘epi’ originates in greek and it means upon or over: The epigraph of
a function is the set of points that sit on the top of the function’s graph.
The domain of an extended real-valued function on Rd is dom(f ) = {x ∈ Rd :
f (x) < ∞}. For S ⊂ Rd , a function f : S → R̄ is identified with the function
f¯ : Rd → R̄, which coincides with f on S and is defined to take the value ∞
outside of S. It follows that if f : S → R, then dom(f ) = S. A convex function is
proper if its range does not include −∞ and its domain is nonempty.
For the rest of the chapter, we will write ‘let f be a convex’ to mean that
f : Rd → R̄ is a proper convex function.
26.1 Convex Sets and Functions 306
Figure 26.1 (a) is a convex set. (b) is a non-convex set. (c) is the convex hull of a
non-convex set. (d) is a convex function. (e) is non-convex, but all local minimums are
global. (f) is not convex.
Some authors use Eq. (26.1) as the definition of a convex function along
with a specification that the domain is convex: If A ⊆ Rd is convex, then
f : A → R is convex if it satisfies Eq. (26.1), with f (x) = ∞ assumed for
x∈/ A.
The reader is invited to prove that all convex functions are continuous on the
interior of their domain (Exercise 26.1).
A function is strictly convex if the inequality in Eq. (26.1) is always strict.
The Fenchel dual of a function f is f ∗ (u) = supx hx, ui − f (x), which is convex
because the maximum of convex functions is convex. The Fenchel dual has many
nice properties. Most important for us is that for sufficiently nice functions, ∇f ∗
is the inverse of ∇f (Theorem 26.6). Another useful property is that when f is
a proper convex function and its epigraph is closed, then f = f ∗∗ , where f ∗∗
denotes the bidual of f : f ∗∗ = (f ∗ )∗ . The Fenchel dual is also called the convex
26.2 Jensen’s Inequality 307
One of the most important results for convex functions is Jensen’s inequality:
Theorem 26.2 (Jensen’s inequality). Let f : Rd → R̄ be a measurable convex
function and X be an Rd -valued random element on some probability space such
that E[X] exists and X ∈ dom(f ) holds almost surely. Then E[f (X)] ≥ f (E[X]).
If we allowed Lebesgue integrals to take on the value of ∞, the condition that
X is almost surely an element of the domain of f could be removed and the
result would still be true. Indeed, in this case we would immediately conclude
that E[f (X)] = ∞ and Jensen’s inequality would trivially hold.
The basic inequality of (26.1) is trivially
a special case of Jensen’s inequality. Jensen’s f (x)
inequality is so central to convexity that it
can actually be used as the definition (a Pn
(x̄, pk f (xk ))
function is convex if and only if it satisfies k=1
f (x)
f (y) + hx − y, ∇f (y)i
Df (x, y)
y x
Figure 26.2 The Bregman divergence Df (x, y) is the difference between f (x) and the
Taylor series approximation of f at y. When f is convex the, linear approximation is a
lower bound on the function and the Bregman divergence is positive.
and
d
X d
X d
X
Df (x, y) = (xi log(xi ) − xi ) − (yi log yi − yi ) − log(yi )(xi − yi )
i=1 i=1 i=1
d
X d
X
xi
= xi log + (yi − xi ) .
i=1
yi i=1
Notice that if x, y ∈ Pd−1 are in the unit simplex, then Df (x, y) is the relative
entropy between probability vectors x and y. The function f is called the
unnormalised negentropy, which will feature heavily in many of the chapters
26.4 Legendre Functions 309
that follow. When y 6> 0, the Bregman divergence is infinite if there exists an
P
i such that yi = 0 and xi > 0. Otherwise, Df (x, y) = i:xi >0 xi log(xi /yi ) +
Pd
i=1 (yi − xi ).
(a) C is non-empty; −2
(b) f is differentiable and strictly convex on 0 1 2 3 4
C; and √
(c) limn→∞ k∇f (xn )k2 = ∞ for any Figure 26.3 f (x) = − x: the
archetypical Legendre function
sequence (xn )n with xn ∈ C for all n
and limn→∞ xn = x and some x ∈ ∂C.
The intuition is that the set {(x, f (x)) : x ∈ dom(A)} is a ‘dish’ with ever-
steepening edges towards the boundary of the domain. Legendre functions have
some very convenient properties:
The next result formalises the ‘dish’ intuition by showing the directional
derivative along any straight path from a point in the interior to the boundary
blows up. You should supply the proof of the following results in Exercise 26.6.
Example 26.9. Let f be the Legendre function given by f (x) = 12 kxk22 , which
has domain dom(f ) = Rd . Then, f ∗ (x) = f (x) and ∇f and ∇f ∗ are the identity
functions.
26.4 Legendre Functions 310
Pd √
Example 26.10. Let f (x) = −2 i=1 xi when xi ≥ 0 for all i and ∞
otherwise, which has dom(f ) = [0, ∞)d and int(dom(f )) = (0, ∞)d . The gradient
√
is ∇f (x) = −1/ x, which blows up (in norm) on any sequence (xn ) approaching
√
∂ int(dom(f )) = {x ∈ [0, ∞)d : xi = 0 for some i ∈ [d]}. Here, x stands for
√
the vector ( xi )i . In what follows we will often use the underlying convention
of extending univariate functions to vector by applying them componentwise.
Note that k∇f (x)k → 0 as kxk → ∞: ‘∞’ is not part of the boundary of dom(f ).
Strict convexity is also obvious so f is Legendre. In Exercise 26.8, we ask you
to calculate the Bregman divergences with respect to f and f ∗ and verify the
results of Theorem 26.6.
P
Example 26.11. Let f (x) = i xi log(xi ) − xi be the unnormalised negentropy,
which we met in Example 26.5. Similarly to the previous example, dom(f ) =
[0, ∞)d , int(dom(f )) = (0, ∞)d and ∂ int(dom(f )) = {x ∈ [0, ∞)d : xi =
0 for some i ∈ [d]}. The gradient is ∇f (x) = log(x), and thus k∇f (x)k → ∞ as
x → ∂ int(dom(f )). Strict convexity also holds, hence f is Legendre. You already
met the Bregman divergence Df (x, y), which turned out to be the relative entropy
when x, y belong to the simplex. Exercise 26.9 asks you to calculate the dual of f
(can you guess what this function will be?) and the Bregman divergence induced
by f ∗ and to verify Theorem 26.6.
A = int(dom(f )), x, y ∈ A, and let z ∈ [x, y] be the point such that Df (x, y) =
1 2
2 kx − yk∇2 f (z) . Then, for all u ∈ R ,
d
Df (x, y) η
hx − y, ui − ≤ kuk2(∇2 f (z))−1 .
η 2
Although the Bregman divergence is not symmetric, the right-hand side does
not depend on the order of x and y in the Bregman divergence except that
z ∈ [x, y] may be different.
26.5 Optimisation
−∇f (x)
better point
−∇f (x∗ )
Figure 26.4 Illustration of first-order optimality conditions. The point at the top is not a
minimiser because the hyperplane with normal as gradient does not support the convex
set. The point at the right is a minimiser.
x∗ ∈ argminx∈A f (x) ⇐⇒
∀x ∈ A ∩ dom(f ) : hx − x∗ , ∇f (x∗ )i ≥ 0 . (26.3)
The part that concerns the Legendre objective f follows by noting that by
Corollary 26.8, x∗ ∈ int(dom(f )) combined with that by Theorem 26.6(a),
int(dom(f )) = dom(∇f ).
26.6 Projections
26.7 Notes
The main source for these notes is the excellent book by Rockafellar [2015]. The
basic definitions are in part I. The Fenchel dual is analysed in part III while
Legendre functions are found in part V. Convex optimisation is a huge topic. The
standard text is by Boyd and Vandenberghe [2004].
26.9 Exercises
26.4 For each of the real-valued functions below, decide whether or not it is
Legendre on the given domain:
(ui − vi )2
d
X
Df ∗ (u, v) = − .
i=1
ui vi2
26.9 Exercises 315
26.10 Let f be Legendre. Show that f˜ given by f˜(x) = f (x) + hx, ui is also
Legendre for any u ∈ Rd .
26.12 Let α ∈ [0, 1/d] and A = Pd−1 ∩ [α, 1]d and f be the unnormalised
negentropy function. Let y ∈ [0, ∞)d and x = argminx∈A Df (x, y) and assume
that y1 ≤ y2 ≤ · · · ≤ yd . Let m be the smallest value such that
d
X
ym (1 − (m − 1)α) ≥ α yj .
j=m
Show that
(
α if i < m
xi = Pd
(1 − (m − 1)α)yi / j=m yj otherwise .
26.14 Prove Theorem 26.3 and show that Part (c) does not hold in general
when f is not differentiable at y.
Hint For the first part, simply apply Taylor’s theorem. For the second part,
26.9 Exercises 316
The model for adversarial linear bandits is as follows. The learner is given an
action set A ⊂ Rd and the number of rounds n. As usual in the adversarial setting,
it is convenient to switch to losses. An instance of the adversarial problem is a
sequence of loss vectors y1 , . . . , yn taking values in Rd . In each round t ∈ [n], the
learner selects a possibly random action At ∈ A and observes a loss Yt = hAt , yt i.
The learner does not observe the loss vector yt . The regret of the learner after n
rounds is
" n # n
X X
Rn = E Yt − min ha, yt i .
a∈A
t=1 t=1
where η > 0 is the learning rate. To control the variance of the loss estimates,
it will be useful to mix this distribution with an exploration distribution π
P
(π : A → [0, 1] and a∈A π(a) = 1). The mixture distribution is
Pt (a) = (1 − γ)P̃t (a) + γπ(a) ,
where γ is a constant mixing factor to be chosen later. The algorithm then simply
samples its action At from Pt :
A t ∼ Pt .
Recall that Yt = hAt , yt i is the observed loss after taking action At . We need a
.
way to estimate yt (a) = ha, yt i. The idea is to use least squares to estimate yt with
Ŷt = Rt At Yt , where Rt ∈ Rd×d is selected so that Ŷt is an unbiased estimate of yt
given the history. Then the loss for a given action is estimated by Ŷt (a) = ha, Ŷt i.
To find the choice of Rt that makes Ŷt unbiased, let Et [·] = E [·|Pt ] and calculate
!
X
Et [Ŷt ] = Rt Et [At At ]yt = Rt
>
Pt (a)aa>
yt .
a∈A
| {z }
Qt
Pt (a) = γπ(a) + (1 − γ) P P .
a0 ∈A exp −η s=1 Ŷs (a )
t−1 0
4: Sample action At ∼ Pt
5: Observe loss Yt = hAt , yt i and compute loss estimates:
Ŷt = Q−1
t At Yt and Ŷt (a) = ha, Ŷt i .
6: end for
Algorithm 15: Exp3 for linear bandits.
27.2 Regret Analysis 319
Theorem 27.1. Assume that A is non-empty and let k = |A|. For any exploration
distribution π, for some parameters η and γ, for all (yt )t with yt ∈ L, the regret
of Algorithm 15 satisfies
p
Rn ≤ 2 (2g(π) + d)n log(k) , (27.1)
The utility of (27.1) is that at times, calculating the distribution that minimises
g(π) or sampling from it may be difficult, in which case, one may employ a
distribution that trades off computation with the regret.
Proof Assume that the learning rate η is chosen so that for each round t the
loss estimates satisfy
Then, by adopting the proof of Theorem 11.1 (see Exercise 27.1), the regret is
bounded by
" #
log k Xn X
2
Rn ≤ + 2γn + η E Pt (a)Ŷt (a) . (27.3)
η t=1 a∈A
Note that we cannot use the proof that leads to the tighter constant (η getting
replaced by η/2 in the second term above) because we would loose too much in
other parts of the proof by guaranteeing that the loss estimates are bounded
by one (see below). To get a regrethP bound, it remains
i to set γ and η so that
2
(27.2) is satisfied and to bound E a Pt (a)Ŷt (a) . We start with the latter. Let
P 2
Mt = a Pt (a)Ŷt (a). By the definition of the loss estimate,
P 2 2 > −1
which means that Mt = a Pt (a)Ŷt (a) = Yt At Qt At ≤ At Qt At =
> −1
!
X
E[Mt | Pt ] ≤ trace Pt (a)aa Qt
> −1
= d.
a∈A
It remains to choose γ and η. Strengthen (27.2) to |η Ŷt (a)| ≤ 1 and note that
since |Yt | ≤ 1,
Q(π)−1 /γ by Exercise 27.4. Using this and the Cauchy–Schwarz inequality shows
that
1 g(π)
t At | ≤ kakQ−1 kAt kQ−1 ≤ max ν Qt ν ≤
|a> Q−1 max ν > Q−1 (π)ν =
> −1
,
t t ν∈A γ ν∈A γ
Choosing γ = ηg(π) guarantees |η Ŷt (a)| ≤ 1. Plugging this choice into (27.3), we
get
log k p
Rn ≤ + ηn(2g(π) + d) = 2 (2g(π) + d)n log(k) ,
η
q
log(k)
where the last equality is derived by choosing η = (2g(π)+d)n finishing the proof
of (27.1).
For the second half, recall that by the Kiefer–Wolfowitz theorem (Theorem 21.1
and Exercise 21.6), there exists a sampling distribution π such that g(π) ≤ d.
Plugging this value into (27.1), finishes the proof.
A standard calculation shows (Exercise 27.6) shows that C can always be chosen
so that log |C| ≤ d log(6dn). Then it is easy to check that
p Exp3 on C suffers regret
relative to the best action in A of at most Rn = O(d n log(nd)). The problem
with this approach is that C is exponentially large in d, which makes this algorithm
intractable in most situations. When A is convex, a more computationally
tractable approach is to use the continuous exponential weights algorithm.
For this section, we assume that A is convex and has positive Lebesgue
measure. The latter condition can be relaxed with some care (Exercise 27.10).
supported on A defined by
R P
exp (a)
t−1
B
−η Ŷ
s=1 s da
P̃t (B) = R P . (27.5)
exp −η s=1 Ŷs (a) da
t−1
A
We will shortly see that the analysis in the previous section can be copied almost
verbatim to prove a regret bound for this strategy. But what has been bought here?
Rather than sampling from a discrete distribution on a large number of arms, we
now have to sample from a probability measure on a convex set. Sampling from
arbitrary probability measures is itself a challenging problem, but under certain
conditions there are polynomial time algorithms for this problem. The factors
that play the biggest role in the feasibility of sampling from a measure are (a)
the form of the measure or its density and (b) how the convex set is represented.
As it happens, the measure defined in the last display is log-concave, which
means that the logarithm of the density, with respect to the Lebesgue measure
on A, is a concave function.
Theorem 27.2. Let p(a) ∝ IA (a) exp(−f (a)) be a density with respect to the
Lebesgue measure on A such that f : A → R is a convex function. Then there
exists a polynomial-time algorithm for sampling from p, provided one can compute
the following efficiently:
x
A
Figure 27.1 Separation oracle returns the normal of a hyperplane that separates x from
A whenever x ∈
/ A. When x ∈ A, the separation oracle returns true.
The left-hand side in the above display is the logarithmic Laplace transform
of the uniform measure on K − {x∗ } evaluated at u.
Proof of Theorem 27.3 As before, choosing γ = dη ensures that |ηha, Ŷt i| ≤ 1 for
all a ∈ A (see the proof of Theorem 27.1). The standard argument (Exercise 27.9)
shows that
1 vol(A)
Rn ≤ E log R P + 3ηdn . (27.6)
η exp −η (Ŷ (a) − Ŷ (a∗ )) da
n
A t=1 t t
Pn
Using again that η|ha, Ŷt i| ≤ 1 and Proposition 27.4 with u = η t=1 Ŷt shows
that
d(1 + log+ (n/d)) q
Rn ≤ + 3ηdn ≤ 2d 3n(1 + log+ (2n/d)) .
η
27.4 Notes
√
3 The O( n) dependence of the regret on the horizon is not improvable, but the
linear dependence on the dimension is suboptimal for certain action sets and
optimal for others. An example where improvement is possible occurs when A
is the unit ball, which is analysed in the next chapter.
4 A slight modification of the set-up allows the action set to change in each
round, but where actions have identities. Suppose that k ∈ {1, 2, . . .} and At =
{a1 (t), . . . , ak (t)} and the adversary chooses losses so that maxa∈At |ha, yt i| ≤ 1
for all t. Then a straightforward adaptation of Algorithm 15 and Theorem 27.1
leads to an algorithm for which
" n #
X p
Rn = max E hAt − ai (t), yt i ≤ 2 3dn log(k) .
i∈[k]
t=1
The definition of the regret still compares the learner to the best single action
in hindsight, which makes it less meaningful than the definition of the regret
in Chapter 19 for stochastic linear bandits with changing action sets. These
differences are discussed in more detail in Chapter 29. See also Exercise 27.5.
The results in Sections 27.1 and 27.2 follow the article by Bubeck et al. [2012],
with minor modifications to make the argument more pedagogical. The main
difference is that they used John’s ellipsoid over the action set for exploration,
which is only the right thing when John’s ellipsoid is also a central ellipsoid. Here
we use Kiefer–Wolfowitz, which is equivalent to finding the minimum volume
central ellipsoid containing the action set. Theorem 27.2, which guarantees
the existence of a polynomial time sampling algorithm for convex sets with
gradient information and projections is by Bubeck et al. [2015b]. We warn the
reader that these algorithms are not very practical, especially if theoretically
justified parameters are used. The study of sampling from convex bodies is quite
fascinating. There is an overview by Lovász and Vempala [2007], though it is a
little old. The continuous exponential weights algorithm is perhaps attributable
to Cover [1991] in the special setting of online learning called universal portfolio
optimisation. The first application to linear bandits is by Hazan et al. [2016].
Their algorithm and analysis are more complicated because they seek to improve
the computation properties by replacing the exploration distribution based on
Kiefer–Wolfowitz with an adaptive randomised exploration basis that can be
computed in polynomial time under weaker assumptions. Continuous exponential
weights for linear bandits using the core set of John’s ellipsoid for exploration
(rather than Kiefer–Wolfowitz) was recently p analysed by van der Hoeven et al.
[2018]. Another path towards an efficient O(d n log(·)) policy for convex action
sets is to use the tools from online optimisation. We explain some of these ideas
in more detail in the next chapter, but the reader is referred to the paper by
Bubeck and Eldan [2015].
27.6 Exercises 324
27.6 Exercises
27.3 (Dependence on the range of losses (ii)) Now suppose that a < b
are known and yt ∈ {y ∈ Rd : ha, yi ∈ [a, b] for all a ∈ A}. How can you adapt
the algorithm now, and what is its regret?
Hint You will need to choose a new exploration distribution in every round.
Otherwise everything is more or less the same.
27.6 (Covering numbers for convex sets) For K ⊂ Rd let kxkK =
supy∈K |hx, yi|. Let A ⊂ Rd and L = {y : kykA ≤ 1}. Let N (A, ε) be the size of
the smallest subset C ⊆ A such that minx0 ∈C kx − x0 kL ≤ ε for all x ∈ A. Show
the following:
(a) When A = x ∈ Rd : kxkV −1 ≤ 1 , we have N (A, ε) ≤ (3/ε)d .
(b) When A is convex, bounded and span(A) = Rd we have N (A, ε) ≤ (3d/ε)d .
(c) For any bounded A ⊂ Rd we have N (A, ε) ≤ (6d/ε)d .
Hint For the first part, find a linear map from A to the Euclidean ball and
use the fact that the Euclidean ball can be covered with a set of size (3/ε)d . For
the second part use the fact that for any symmetric, convex and compact set K
there exists an ellipsoid E = {x : kxkV ≤ 1} such that E ⊆ K ⊆ dE.
27.7 (Low rank action sets (i)) In the definition of the algorithm and the
proof of Theorem 27.1, we assumed that A spans Rd and that it has positive
Lebesgue measure. Show that this assumption may be relaxed by carefully
adapting the algorithm and analysis.
27.10 (Low rank action sets (ii)) In the definition of the algorithm and
the proof of Theorem 27.3, we assumed that A spans Rd and that it has positive
Lebesgue measure. Show that this assumption may be relaxed by carefully
adapting the algorithm and analysis.
In the last chapter, we showed that if A ⊂ Rd has k elements, then the regret of
Exp3 with a careful exploration distribution has regret
p
Rn = O( dn log(k)) .
We also showed the continuous version of this algorithm has regret at most
p
Rn = O(d n log(n)) .
Although this algorithm can often be made to run in polynomial time, the degree
tends to be high and the implementation complicated, making the algorithm
impractical. In many cases this can be improved, both in terms of the regret and
computation. In this chapter we demonstrate this in the case when A is the unit
ball by showing that for this case pthere is an efficient, low-complexity algorithm
for which the regret is Rn = O( dn log(n)). More importantly, however, we
introduce a pair of related algorithms called follow-the-regularised-leader and
mirror descent, which are powerful tools for the design and analysis of bandit
algorithms. In fact, the exponential weights algorithm turns out to be a special case.
and the regret is Rn = maxa∈A Rn (a). We emphasise that the only difference
relative to the adversarial linear bandit is that now yt is observed rather than
hat , yt i. Actions are not capitalised in this section because the algorithms presented
here do not randomise.
Mirror descent
The basic version of mirror descent has two extra parameters beyond n and A. A
learning rate η > 0 and a convex function F : Rd → R̄ with domain D = dom(F ).
Usually F will be Legendre. The function F is called a potential function or
regulariser. In the first round, mirror descent predicts
Subsequently it predicts
Follow-the-Regularised-Leader
Like mirror descent, follow-the-regularised-leader depends on a convex potential
F with domain D = dom(F ) and predicts a1 = argmina∈A F (a). In subsequent
rounds t ∈ [n], the predictions are
t
!
X
at+1 = argmina∈A η ha, ys i + F (a) . (28.3)
s=1
The intuition is that the algorithm chooses at+1 to be the action that performed
best in hindsight with respect to the regularised loss. Again, the definition of
follow-the-regularised-leader implicitly assumes that (at )nt=1 are well-defined. As
for mirror descent, the regularisation serves to stabilise the algorithm, which
turns out to be a key property of good algorithms for online linear prediction.
Now mirror descent chooses at+1 to minimise Φt . The reader should check that
the assumption that F is Legendre on domain D ⊆ A implies that the minimiser
occurs in the interior of D ⊆ A and that ∇Φt (at+1 ) = 0 (see Exercise 28.1). This
means that ηyt = ∇F (at ) − ∇F (at+1 ), and so
t
X t
X
∇F (at+1 ) = −ηyt + ∇F (at ) = ∇F (a1 ) − η ys = −η ys ,
s=1 s=1
The last two displays and the fact that the gradient for Legendre functions is
invertible shows that mirror descent and follow-the-regularised-leader are the
same in this setting.
at+1,i =P P . (28.5)
exp
d t
j=1 −η s=1 ysj
Then the solution to Eq. (28.2) can be found using the following two-step
procedure:
Eq. (28.6) means the first optimisation problem can be evaluated explicitly as
the solution to
All potentials and losses that appear in positive results in this book guarantee
that mirror descent (and also follow-the-regularised-leader) are well defined,
and that condition in Eq. (28.6) holds.
The two-step implementation of mirror descent also explains its name. The
update in round t can be seen as transforming the action at ∈ A into the ‘mirror’
(dual) space using ∇F , where it is combined with the most recent (scaled) loss
ηyt . Then ∇F −1 is used to transform the updated vector back to the original
(primal) space. The function ∇F is called the mirror map.
Although mirror descent and follow-the-regularised-leader are not the same, the
bounds presented here are identical. The theorem for mirror descent has two
parts, the first of which is a little stronger than the second. To minimise clutter,
we abbreviate DF by D.
Theorem 28.4 (Mirror descent regret bound). Let η > 0 and F be Legendre with
domain D and A ⊂ Rd be a non-empty convex set with int(dom(F )) ∩ A = 6 ∅. Let
a1 , . . . , an+1 be the actions chosen by mirror descent, which are assumed to be
well-defined. Then, for any a ∈ A, the regret of mirror descent is bounded by
F (a) − F (a1 ) X 1X
n n
Rn (a) ≤ + hat − at+1 , yt i − D(at+1 , at ) .
η t=1
η t=1
Furthermore, suppose that Eq. (28.6) holds and ã2 , ã3 , . . . , ãn+1 are given by
28.2 Regret Analysis 331
−ηy1
∇F −1
a2
∇F
−ηy2
ΠA,F
a3 ∇F −1
∇F −ηy3
−ηy3
afollow-the-regularised-leader
4
∆F −1
∆F −1
aMD
4
∆F
Proof Fix a ∈ A. The result trivially holds when a 6∈ D. Hence, we assume that
a ∈ D. For the first part of the claim, we split the inner product:
In Exercise 28.1, you will show that at ∈ int(dom(F )), and hence the Bregman
divergence D(b, at ) = F (b) − F (at ) − hb − at , ∇F (at )i for any b ∈ dom(F ). By
definition, at+1 = argminb∈A ηhb, yt i + D(b, at ). Hence, the first-order optimality
conditions for at+1 (Proposition 26.14) show that
1X
Xn n
≤ hat − at+1 , yt i + (D(a, at ) − D(a, at+1 ) − D(at+1 , at ))
t=1
η t=1
!
1
Xn n
X
= hat − at+1 , yt i + D(a, a1 ) − D(a, an+1 ) − D(at+1 , at )
t=1
η t=1
F (a) − F (a1 ) 1 X
Xn n
≤ hat − at+1 , yt i + − D(at+1 , at ) , (28.10)
t=1
η η t=1
where the final inequality follows from the fact that D(a, an+1 ) ≥ 0 and
D(a, a1 ) ≤ F (a) − F (a1 ), the latter of which is true by the first-order optimality
conditions for a1 = argminb∈A F (b). To see the second part, note that
1
hat − at+1 , yt i = hat − at+1 , ∇F (at ) − ∇F (ãt+1 )i
η
1
= (D(at+1 , at ) + D(at , ãt+1 ) − D(at+1 , ãt+1 ))
η
1
≤ (D(at+1 , at ) + D(at , ãt+1 )) .
η
The result follows by substituting this into Eq. (28.10).
The assumption that a1 minimises the potential was only used to bound
D(a, a1 ) ≤ F (a) − F (a1 ). For a different initialisation, the following bound
still holds:
!
1 X n
Rn (a) ≤ D(a, a1 ) + D(at , ãt+1 ) . (28.11)
η t=1
As we shall see in Chapter 31, this is useful when using mirror descent to
analyse non-stationary bandits.
We now give two applications of the regret bound of Theorem 28.4 for mirror
descent. The same results would hold for the same problems for FTRL just in this
case we would need to use Theorem 28.5. Let diamF (A) = maxa,b∈A F (a) − F (b)
be the diameter of A with respect to F .
Proposition 28.6 (Regret on the unit ball). Let A = B2d = {a ∈ Rd : kak2 ≤ 1}
be the standard unit ball and assume
pyt ∈ B2 for all t. Then mirror descent with
d
1 2
potential F (a) = 2 kak2 and η = 1/n is well defined and its regret satisfies
√
Rn ≤ n.
Proof That mirror descent is well defined follows by a direct calculation (cf.
Example 28.1). By Eq. (28.9), we have ãt+1 = at − ηyt so
1 η2
D(at , ãt+1 ) = kãt+1 − at k22 = kyt k22 .
2 2
Therefore, since diamF (A) = 1/2 and kyt k2 ≤ 1 for all t,
diamF (A) η X 1
n
ηn √
Rn ≤ + kyt k22 ≤ + = n.
η 2 t=1 2η 2
F (a) − F (a1 ) X 1X
n n
Rn (a) ≤ + hat − at+1 , yt i − D(at+1 , at )
η t=1
η t=1
log(d) X 1X1
n n
≤ + kat − at+1 k1 kyt k∞ − kat − at+1 k21
η t=1
η t=1
2
log(d) η X log(d) ηn p
n
≤ + kyt k2∞ ≤ + = 2n log(d) ,
η 2 t=1 η 2
where the first inequality follows from Theorem 28.4, the second from Pinsker’s
inequality and the facts that diamF (A) = log(d). In the third inequality, we
used ‘optimise to bound’. In particular, we used that for any a ∈ R and b > 0,
maxx∈R ax − bx2 /2 = a2 /(2b). The last inequality follows from the assumption
that kyt k∞ ≤ 1.
28.3 Application to Linear Bandits 334
The last few steps in the above proof are so routine that we summarise their
use in a corollary, the proof of which we leave to the reader (Exercise 28.6).
diamF (A) η X
n
Rn ≤ + kyt k2t∗ ,
η 2 t=1
It often happens that the easiest way to bound the regret of mirror descent is to
find a norm that satisfies the conditions of Corollary 28.8. Often, Theorem 26.13
provides a good approach.
1
n
ηX
Rn ≤ + kyt k22 .
2η 2 t=1
2
But now we√ see that kyt k2 canpbe as large as d, and tuning η would lead to a
rate of O( nd) rather than O( n log(d)).
Both Theorems 28.4 and 28.5 were presented for the oblivious case where
(yt )nt=1 are chosen in advance. This assumption was not used, however, and in
fact the bounds continue to hold when yt is chosen strategically as a function
of a1 , y1 , . . . , yt−1 , at . This is analogous to how the basic regret bound for
exponential weights continues to hold in the face of strategic losses. But
be cautioned, this result does not carry immediately to the application of
mirror descent to bandits, as discussed at the end in Note 9.
relative to action a ∈ A is
" n #
X
Rn (a) = E hAt − a, yt i .
t=1
The regret is Rn = maxa∈A Rn (a). The application of mirror descent and follow-
the-regularised-leader to linear bandits is straightforward. The only difficulty
is that the learner does not observe yt but instead hAt , yt i. The solution is to
replace yt with an estimator, which is typically some kind of importance-weighted
estimator as in the previous chapter. Because estimation of yt is only possible
using randomisation, the algorithm cannot play the suggested action of mirror
descent, but instead plays a distribution over actions with the same mean as the
proposed action. This is often necessary anyway, when A is not convex. Since
the losses are linear, the expected additional regret by playing according to the
distribution vanishes. The algorithm is summarised in Algorithm 16. We have
switched to capital letters because the actions are now randomised.
Furthermore, letting
Ãt+1 = argmina∈dom(F ) ηha, Ŷt i + DF (a, Āt )
and assuming that −η Ŷt + ∇F (a) ∈ ∇F (dom(F )) for all a ∈ A almost surely,
the regret of the mirror descent variation satisfies
diamF (A) 1 X
n
Rn ≤ + E D(Āt , Ãt+1 ) .
η η t=1
Proof Using the definition of the algorithm and the assumption that Ŷt is
unbiased given Āt and that Pt has mean Āt leads to
h h ii
E [hAt , yt i] = E hĀt , yt i = E E hĀt , yt i | Āt = E E hĀt , Ŷt i Āt ,
that says these theorems continue to hold even for recursively constructed loss
sequences.
In other words, Et = 1 indicates that the algorithm explores, which happens with
probability 1 − kĀt k. Clearly, Et [At ] = Āt . (The sampling distribution Pt is just
the law of At given the past, which remains implicit.) For the estimator we use a
variant of the importance-weighted estimator from the last chapter:
d Et At hAt , yt i
Ŷt = . (28.12)
1 − kĀt k
The reader can check for themself that this estimator is unbiased. Next, we
inspect the contents of our magician’s hat and select the potential
Theorem 28.11. Assume that (yt )nt=1 are a sequence of losses such that kyt k2 ≤ 1
for all t. Suppose that Algorithm 16 is run using the sampling rule, estimator
and potential as describedp above, shrunken action set à with r = 1 − 2ηd where
the learning rate is η = p log(n)/(3dn). Then, the algorithm is well defined and
its regret satisfies Rn ≤ 2 3nd log(n).
28.4 Linear Bandits on the Unit Ball 338
You might notice that in some regimes this is smaller than the lower bound
for stochastic linear bandits (Theorem 24.2). There is no contradiction
because the adversarial and stochastic linear bandit models are actually
quite different. More details are in Chapter 29.
Proof That the algorithm is well defined follows because à is compact. Let
Pn
a∗ = argmina∈A t=1 ha, yt i be the optimal action. Then
" n # n
X X
Rn = E hAt − ra , yt i +
∗
hra∗ − a∗ , yt i ≤ Rn (ra∗ ) + (1 − r)n ,
t=1 t=1
where Zt ∈ [Āt , Āt+1 ] lies on the chord connecting Āt and Āt+1 . The algorithm is
stable in the sense that no matter how the losses are chosen, Āt+1 cannot
be too far from Āt . This also means that Zt is close to Āt . By definition,
ηkŶt k ≤ ηd/(1 − r) = 1/2. Combining this with Eq. (28.13) shows that
1 − kZt k 1 − kαĀt + (1 − α)Āt+1 k 1 − kĀt+1 k
≤ sup = max 1,
1 − kĀt k α∈[0,1] 1 − kĀt k 1 − kĀt k
( ) ( )
1 + ηkL̂t−1 k 1 + ηkL̂t−1 k
≤ max 1, ≤ max 1, ≤ 2.
1 + ηkL̂t k 1/2 + ηkL̂t−1 k
Here, the second inequality is proved by noting that if the maximum is not one,
kĀt+1 k < kĀt k. The next step is to find the Hessian of F , which is
I aa> I
∇2 F (a) = + .
1 − kak kak(1 − kak)2 1 − kak
Therefore, (∇2 F (a))−1 ≤ (1 − kak)I, and so
h i h i
2 2 2 (1 − kZt k)Et hUt , yt i2
E kŶt k(∇2 F (Zt ))−1 ≤ E (1 − kZt k)kŶt k = d E ≤ 2d .
(1 − kĀt k)2
The diameter satisfies diamF (Ã) ≤ log(1/(1 − r)), and hence
1 1
Rn ≤ (1 − r)n + log + ηnd
η 1−r
1 1
= log + 3ηnd
η 2ηd
p
≤ 2 3nd log(n) ,
where the last two relations follow from the choices of r and η, respectively.
28.5 Notes 339
28.5 Notes
1 Our assumptions on the potential and action set in the analysis of mirror
descent (Theorem 28.4) can be relaxed significantly. What is important is
that F is convex and the directional derivative v 7→ ∇v F (x) is linear for
all values for which it exists. Our assumptions are chosen to ensure that
at ∈ int(dom(F )), which for Legendre F means that ∇F (at ) exists, and hence
∇v F (at ) = hv, ∇F (at )i is linear. A comprehensive examination of various
generalisations is given by Joulani et al. [2017]. For follow-the-regularised-
leader, convexity of F suffices, as you will show using directional derivatives in
Exercise 28.5.
2 Finding at+1 for both mirror descent and follow-the-regularised-leader requires
solving a convex optimisation problem. Provided the dimension is not too
large and the action set and potential are reasonably nice, there exist practical
approximation algorithms for this problem. The two-step process described in
Eqs. (28.7) and (28.8) is sometimes an easier way to go. Usually (28.7) can
be solved analytically, while (28.8) can be quite expensive. In some important
special cases, however, the projection step can be written in closed form or
efficiently approximated.
3 We saw that follow-the-regularised-leader
p with a carefully chosen potential
function achieves O( dn log(n)) regret on the `2 -ball. On the `∞ ball
√
(hypercube), the optimal regret is O(d n). Interestingly, as n tends to infinity
the optimal dependence on the dimension for A = Bpd = {x ∈ Rd : kxkp ≤ 1}
√
with p ≥ 1 is either d or d with a complete classification given by Bubeck
et al. [2018].
4 Adversarial linear bandits with A = Pk−1 are essentially equivalent to k-armed
adversarial bandits.
√ There exists a potential such that the resulting algorithm
satisfies Rn = O( kn), which matches the lower bound up to constant factors
√
and shaves a factor of log k from the upper bounds presented in Chapters 11
and 12. For more details, see Exercise 28.15.
5 Most of the bounds proven for adversarial bandits have a worst-case flavour.
The tools in this chapter can often be applied to prove adaptive bounds. In
Exercise 28.14, you will analyse a simple algorithm for k-armed adversarial
28.5 Notes 340
Bounds of this kind are called first-order bounds [Allenberg et al., 2006,
Abernethy et al., 2012, Neu, 2015b, Wei and Luo, 2018]. The log(n/k) term
can be improved to log(k) using a more sophisticated algorithm/analysis.
6 Both mirror descent and follow-the-regularised-leader depend on the potential
function. Currently there is no characterisation of exactly what this potential
should be or how to find it. At least in the full information setting, there are
quite general universality results showing that if a certain regret is achievable
by some algorithm, then that same regret is nearly achievable by mirror descent
with some potential [Srebro et al., 2011]. In practice this result is not useful for
constructing new potential functions, however. There have been some attempts
to develop ‘universal’ potential functions that exhibit nice behaviour for any
action sets [Bubeck et al., 2015b, and others]. These can be useful, but as yet
we do not know precisely what properties are crucial, especially in the bandit
case.
7 When the horizon is unknown, the learning rate cannot be tuned ahead of time.
One option is to apply the doubling trick. A more elegant solution is to use a
decreasing schedule of learning rates. This requires an adaptation of the proofs
of Theorems 28.4 and 28.5, which we outline in Exercises 28.11 and 28.12. This
is one situation where mirror descent and follow-the-regularised-leader are not
the same and where the latter algorithm is usually to be preferred.
8 In much of the literature the potential is chosen in such a way that mirror descent
and follow-the-regularised-leader are the same algorithm. For historical reasons,
the name mirror descent is more commonly used in the bandit community.
Unfortunately ‘mirror descent’ is often used, sometimes with qualifiers, when
the algorithm being analysed is actually follow-the-regularised-leader. This is
confusing and makes it hard to identify for which algorithm the results actually
hold. Naming aside, we encourage the reader to keep both algorithms in mind,
since the analysis of one or the other can sometimes be slightly easier.
9 Mirror descent and follow-the-regularised-leader are used as modules for
converting loss sequences to distributions. Since these losses depend on past
actions, it is crucial that both algorithm are well-behaved in the full-information
setting when the losses are chosen non-obliviously. This does not translate to
Pn
the bandit setting for a subtle reason. Let R̂n (a) = t=1 hAt − a, yt i be the
random regret so that
" n n
#
X X
Rn = E max R̂n (a) = E hAt , yt i − min ha, yt i .
a∈A a∈A
t=1 t=1
The second sum is constant when the losses are oblivious, which means the
maximum can be brought outside the expectation, which is not true if the loss
28.5 Notes 341
vectors are non-oblivious. It is still possible to bound the expected loss relative
to a fixed comparator a so that
" n #
X
Rn (a) = E hAt − a, yt i ≤ B ,
t=1
where B is whatever bound obtained from the analysis presented above. Using
maxa R̂n (a) ≤ maxa R̂n (a) − Rn (a) + maxa Rn (a) shows that
Rn = E max R̂n (a) ≤ B + E max R̂n (a) − Rn (a) .
a∈A a∈A
The second term on the right-hand side can be bounded using tools from
√
empirical process theory, but the resulting bound is O( n) only if V[R̂n (a)] =
O(n). In general, however, the variance can be much larger (for an example, see
Exercise 11.6). We emphasise again that the non-oblivious regret is a strange
measure because it does not capture the reactive nature of the environment.
The details of the application of empirical process theory is beyond the scope
of this book. For an introduction to that topic, we recommend the books by
van der Vaart and Wellner [1996], van de Geer [2000], Boucheron et al. [2013]
and Dudley [2014]. p
10 The price of bandit information on the unit ball is an extra d log(n) (compare
Proposition 28.6 and Theorem 28.11). Except for log factors, this is also true for
the simplex (Proposition
√ 28.7 and Note 4). One might wonder if the difference
is always about d, but this is not true. The price of bandit information can
be as high as Θ(d). Overall the dimension dependence in the regret in terms of
the action set is still not well understood except for special cases.
11 The poor behaviour of follow-the-leader in the full information setting depends
on (a) the environment being adversarial rather than stochastic and (b) the
action set having sharp corners. When either of these factors is missing, follow-
the-leader is a reasonable choice [Huang et al., 2017b]. Note that with bandit
feedback, the failure is primarily due to a lack of exploration (Exercises 4.11
and 4.12).
12 A generalisation of online linear optimisation is online convex optimisation,
where the adversary secretly chooses a sequence of convex functions f1 , . . . , fn .
In each round the learner chooses at ∈ A and observes the entire function ft .
As usual, the regret is relative to a ∈ A is
n
X
Rn (a) = ft (at ) − ft (a) .
t=1
One way to tackle this problem is to linearise the loss functions. Let
yt = ∇ft (at ). Then, by convexity of the loss functions,
n
X
Rn (a) ≤ hat − a, yt i ,
t=1
which shows that an algorithm for online linear optimisation can be used to
28.6 Bibliographic Remarks 342
analyse the more general case. Now look again at Example 28.1 and notice
that online mirror descent with a quadratic potential and linearised losses is
really the same gradient descent we know and love. Online convex optimisation
is a rich topic by itself. We refer the interested reader to the books by Shalev-
Shwartz [2012] and Hazan [2016].
13 There is a nice application of online linear optimisation to minimax theorems.
Let X and Y be arbitrary sets. For any function f : X × Y → R,
Theorem 28.12 (Sion’s minimax theorem). Suppose that X and Y are convex
subsets of linear topological spaces with at least one of X or Y compact. Let
f : X × Y → R be a function such that f (·, y) is lower semi-continuous and
quasi-convex for all y ∈ Y and f (x, ·) is upper semi-continuous and quasi-
concave for all x ∈ X. Then
There is a short topological proof of this theorem [Komiya, 1988]. You will
use the tools of online linear optimisation to analyse two special cases in
Exercise 28.16. When X and Y are probability simplexes and f is linear, the
resulting theorem is von Neumann’s minimax theorem [von Neumann, 1928].
The minimax theorems form a bridge between minimax adversarial regret and
Bayesian regret, which we discuss in Chapters 34 and 36.
14 Let X be a subset of a linear topological space and f : X → R. The function f
is quasi-convex if f −1 ((−∞, a)) is convex for all a ∈ R and quasi-concave
if −f is quasi-convex. f is upper semi-continuous if for all x ∈ X and
ε > 0 there exists a neighborhood U of x such that f (y) ≤ f (x) + ε for all
y ∈ U . It is lower semi-continuous if for all x ∈ X and ε > 0 there exists a
neighborhood U of x such that f (y) ≥ f (x) − ε for all y ∈ U .
The results in this chapter come from a wide variety of sources. The online convex
optimisation framework was popularised by Zinkevich [2003]. The framework
has been briefly considered by Warmuth and Jagota [1997], then reintroduced
by Gordon [1999] (without noticing the earlier work of Warmuth and Jagota).
While the framework was introduced relatively recently, the core ideas have
been worked out earlier in the special case of linear prediction with nonlinear
28.7 Exercises 343
losses (the book of Cesa-Bianchi and Lugosi [2006] can be used as a reference
to this literature). Mirror descent was first developed by Nemirovsky [1979] and
Nemirovsky and Yudin [1983] for classical optimisation. In statistical learning,
follow-the-regularised-leader is known as regularised risk minimisation and
has a long history. In the context of online learning, Gordon [1999] considered
follow-the-regularised-leader and called it ‘generalised gradient descent’. The name
seems to originate from the work of Shalev-Shwartz [2007] and Shalev-Shwartz and
Singer [2007]. An implicit form of regularisation is to add a perturbation of the
losses, leading to the ‘follow-the-perturbed-leader’ algorithm [Hannan, 1957, Kalai
and Vempala, 2002], which is further explored in the context of combinatorial
bandit problems in Chapter 30 (and see also Exercise 11.7). Readers interested in
an overview of online learning will like the short books by Shalev-Shwartz [2012]
and Hazan [2016], while the book by Cesa-Bianchi and Lugosi [2006] has a little
more depth (but is also older). As far as we know, the first explicit application of
mirror descent to bandits was by Abernethy et al. [2008]. Since then the idea has
been used extensively, with some examples by Audibert et al. [2013], Abernethy
et al. [2015], Bubeck et al. [2018] and Wei and Luo [2018]. Mirror descent has
been adapted in a generic way to prove high-probability bounds by Abernethy
and Rakhlin [2009]. The reader can find (slightly) different proofs of some mirror
descent results in the book by Bubeck and Cesa-Bianchi [2012]. The results for
the unit ball are from a paper by Bubeck et al. [2012], but we have reworked
the proof to be more in line with the rest of the book. Mirror descent can be
generalised to Banach spaces. For details, see the article by Sridharan and Tewari
[2010].
28.7 Exercises
28.3 Prove the correctness of the two-step procedure described in Section 28.1.1.
28.4 (Linear regret for follow-the-leader) Let A = [−1, 1], and let
y1 = 1/2 and ys = 1 for odd s > 1 and ys = −1 for even s > 1.
28.7 Exercises 344
28.10 (Exp3 as mirror descent (ii)) Here you will show that the tools in
this chapter not only lead to the same algorithm, but also the same bounds.
(a) Let P̃t+1 = argminp∈[0,∞)k ηhp, Ŷt i + DF (p, Pt ). Show both relations in the
following display:
k
X η2 Xk
DF (Pt , P̃t+1 ) = Pti exp(−η Ŷti ) − 1 + η Ŷti ≤ Pti Ŷti2 .
i=1
2 i=1
" n #
1 X ηnk
(b) Show that E DF (Pt , P̃t+1 ) ≤ .
η t=1
2
(c) Show that diamF (Pk−1 ) = log(k).
(d) Conclude that for appropriately tuned η > 0, the regret of Exp3 satisfies,
p
Rn ≤ 2nk log(k) .
A ∩ int(D) non-empty and assume that Eq. (28.6) holds. Let η0 , η1 , . . . , ηn > 0,
a1 , a2 , . . . , an+1 ∈ A and ã2 , . . . , ãn+1 be sequences so that η0 = ∞, a1 =
argmina∈A F (a) and
where (ηt )∞
t=1 is an infinite sequence of learning rates and Ŷti = I {At = i} yti /Pti
and At is sampled from Pt .
(a) Let A = Pk−1 be the simplex, F be the unnormalised negentropy potential,
Pt−1
Ft (p) = F (p)/ηt and Φt (p) = F (p)/ηt + s=1 hp, Ŷs i. Show that Pt is the
choice of follow-the-regularised-leader with potentials (Ft )nt=1 and losses
(Ŷt )nt=1 .
(b) Assume that (ηt )nt=1 are decreasing and then use Exercise 28.12 to show that
" n #
log(k) X DF (Pt+1 , Pt )
Rn ≤ +E hPt − Pt+1 , Ŷt i − .
ηn t=1
ηt
(c) Use Theorem 26.13 in combination with the facts that Ŷti ≥ 0 for all i and
Ŷti = 0 unless At = i to show that
DF (Pt+1 , Pt ) ηt
hPt − Pt+1 , Ŷt i − ≤ .
ηt 2PtAt
log(k) k X
n
(d) Prove that Rn ≤ + ηt .
ηn 2 t=1
p
(e) Choose (ηt )∞
t=1 so that Rn ≤ 2 nk log(k) for all n ≥ 1.
28.14 (The log barrier and first-order bounds) Your mission in this
exercise is to prove first-order bounds for finite-armed bandits as studied in
Chapter 11. The notation is the same as the previous exercise. Let (yt )nt=1 be
Pk
a sequence of loss vectors with yt ∈ [0, 1]k for all t and F (a) = − i=1 log(ai ).
Consider the instance of follow-the-regularised-leader for bandits that samples
At from Pt defined by
t−1
X
Pt = argminp∈Pk−1 ηt hp, Ŷs i + F (p) .
s=1
(a) Show a particular, non-anticipating choice of the learning rates (ηt )nt=1 so
that
v "n−1 #!
u
u X n∨k
t
Rn ≤ k + 2 k 1 + E 2
ytAt log . (28.14)
t=1
k
(b) Prove that any algorithm satisfying Eq. (28.14) also satisfies
v !
u n
n∨k u X n∨k
Rn ≤ k + k log + C tk 1 + min yta log ,
k a∈[k]
t=1
k
28.7 Exercises 347
Hint For choosing the learning rate, you might take inspiration from
Theorem 18.3.
where Ŷsi = I {As = i} ysi /Psi is the importance-weighted estimator of ysi and
η > 0 is the learning rate.
28.16 (Minimax theorem) In this exercise you will prove simplified versions
of Sion’s minimax theorem.
(a) Use the tools from online linear optimisation to prove Sion’s minimax theorem
when X = Pk−1 and Y = Pj−1 and f (x, y) = x> Gy for some G ∈ Rk×j .
(b) Generalise your result to the case when X and Y are non-empty, convex,
compact subsets of Rd and f : X × Y → R is convex/concave and has
bounded gradients.
Hint Consider a repeated simultaneous game where the first player chooses
(xt )∞
t=1 and the second player chooses (yt )t=1 . The loss in round t to the first
∞
player is f (xt , yt ), and the loss to the second player is −f (xt , yt ). See what
Pn Pn
happens to the average iterates x̄n = n1 t=1 xt and ȳn = n1 t=1 yt when (xt )
and (yt ) are chosen by (appropriate) regret-minimising algorithms. For the second
part, see Note 12. Also observe that there is nothing fundamental about X and
Y both having dimension d.
28.17 (Counterexample to Sion without compactness) Find examples
of X, Y and f that satisfy the conditions of Sion’s theorem except that neither
X nor Y are compact and where the statement does not hold. Can you choose f
to be bounded?
29 The Relation between Adversarial
and Stochastic Linear Bandits
The purpose of this chapter is to highlight some of the differences and connections
between adversarial and stochastic linear bandits. As it turns out, the connection
between these are not as straightforward as for finite-armed bandits. We focus
on three topics:
(a) For fixed action sets, there is a reduction from stochastic linear bandits to
adversarial linear bandits. This does not come entirely for free. The action
set needs to be augmented for things to work (Section 29.2).
(b) The adversarial and stochastic settings make different assumptions about
the variability of the losses/rewards. This will explain the apparently
contradictory
p result that the upper bound for adversarial bandits on the unit
ball is O( dn log(n)) (Theorem 28.11), while the lower bound for stochastic
√
bandits also for the unit ball is Ω(d n) (Theorem 24.2).
(c) When the action set is changing, the notion of regret in the adversarial
setting must be carefully chosen, and for the ‘right’ choice, we do not yet
have effective algorithms (Section 29.4).
on Ft−1 . The expected regret for the two cases are defined as follows:
n
X
Rn = E [hAt , θi] − n inf ha, θi , (Stochastic setting)
a∈A
t=1
Xn
Rn = E [hAt , θt i] − n inf ha, θ̄n i . (Adversarial setting)
a∈A
t=1
1
Pn
In the last display, θ̄n = n t=1 θt is the average of the loss vectors chosen by
the adversary.
To formalise the intuition that adversarial environments are harder than stochastic
environments, one may try to find a reduction where learning in the stochastic
setting is reduced to learning in the adversarial setting. Here, reducing problem
E (‘easy’) to problem H (‘hard’) just means that we can use algorithms designed
for problem H to solve instances of problem E. In order to do this, we need to
transform instances of problem E into instances of problem H and translate back
the actions of algorithms designed for H to actions for problem E. To get a regret
bound for problem E from a regret bound for problem H, one needs to ensure
that the losses translate properly between the problem classes.
Of course, based on our previous discussion, we know that if there is a reduction
from stochastic linear bandits to adversarial linear bandits, then somehow the
adversarial problem must change so that no contradiction is created in the curious
case of the unit ball. To be able to use an adversarial algorithm in the stochastic
environment, we need to specify a sequence (θt )t so that the adversarial feedback
matches the stochastic one. Comparing Eq. (29.1) and Eq. (29.2), we can see
that the crux of the problem is incorporating the noise ηt into θt while satisfying
the other requirements. One simple way of doing this is by introducing an extra
dimension for the adversarial problem.
In particular, suppose that the stochastic problem is d-dimensional so that
A ⊂ Rd . For the sake of simplicity, assume furthermore that the noise and
parameter vector satisfy |ha, θi + ηt | ≤ 1 almost surely for all a ∈ A and that
a∗ = argmina∈A ha, θi exists. Then define Aaug = {(a, 1) : a ∈ A} ⊂ Rd+1 and let
the adversary choose θt = (θ, ηt ) ∈ Rd+1 . Here, we slightly abuse notation: for
x ∈ Rd and y ∈ R, we use (x, y) to denote the d + 1 dimensional vector whose
first d components are those of x and whose last component is y. The reduction
is now straightforward: for t = 1, 2, . . . , do the following:
4 Feed Yt to the adversarial bandit policy, increment t and repeat from step 2.
Let a0∗ = (a∗ , 1). Note that for any a = (a, 1) ∈ Aaug , hAt , θi − ha, θi =
0
hAt , θt i − ha0 , θt i and thus adversarial regret, and eventually Bn , will upper
0
t=1 t=1
are unit balls. It does not seem like this should make much difference, but at
√
least in the case of the ball, from our Ω(d n) lower bound on the regret for the
stochastic case, we see that the changed geometry must make the adversary more
powerful. This reinforces the importance of the geometry of the action set, which
we have already seen in the previous chapter.
While the reduction shows one way to use adversarial algorithms in stochastic
environments, the story seems to be unfinished. When facing a linear bandit
problem with some action set A, the user is forced to decide whether or not
the environment is stochastic. Strangely enough, for stochastic environments the
recommendation is to run your favorite adversarial linear bandit algorithm on the
augmented action set. What if the environment may or may not be stochastic?
One can still run the adversarial linear bandit algorithm on the original action
set. This usually works, but the algorithm may need to be tuned differently
(Exercises 29.2 and 29.3).
The real reason for all these discrepancies is that the adversarial linear bandit
model is better viewed as relaxation of another class of stochastic linear bandits.
Rather than assuming the noise is added after taking an inner product, assume that
(θt )nt=1 is a sequence of vectors sampled independently from a fixed distribution
ν on Rd . The resulting model is called a stochastic linear bandit with
parameter noise. This new problem can be trivially reduced to adversarial
29.3 Stochastic Linear Bandits with Parameter Noise 352
Combining the stochastic linear bandits with parameter noise model with the
techniques in Chapter 24 is the standard method for proving lower bounds
for adversarial linear bandits.
further emphasises the importance of the assumptions that restrict the choices of
the adversary.
The best way to think about the standard adversarial linear model is that it
generalises the stochastic linear bandit with parameter noise. Linear bandits
with parameter noise are sometimes easier than the standard model because
parameter noise limits the adversary’s control of the signal-to-noise ratio
experienced by the learner.
In practical applications the action set is usually changing from round to round.
Although it is possible to prove bounds for adversarial linear bandits with changing
action sets, the notion of regret makes the results less meaningful than what
one obtains in the stochastic setting. Suppose that (At )nt=1 are a sequence of
action sets. In the stochastic setting, the actions (At )t selected by the LinUCB
algorithm satisfy
" n #
X √
E hAt − a∗t , θi = Õ(d n) ,
t=1
where a∗t = argmaxa∈At ha, θi is the optimal action in round t. This definition of
the regret measures the right thing: the action a∗t really is the optimal action in
round t. The analogous result for adversarial bandits would be a bound on
" n #
X
Rn (Θ) = max E hAt − at (θ), yt i , (29.4)
θ∈Θ
t=1
29.5 Notes
1 For the reduction in Section 29.2, we assumed that |Yt | ≤ 1 almost surely.
This is not true for many classical noise models like the Gaussian. One way to
overcome this annoyance is to apply the adversarial analysis on the event that
|Yt | ≤ C for some constant C > 0 that is sufficiently large that the probability
that this event occurs is high. For example, if ηt is p a standard Gaussian and
supa∈A |ha, θi| ≤ 1, then C may be chosen to be 1 + 4 log(n), and the failure
event that there exists a t such that |hAt , θi + ηt | ≥ C has probability at most
1/n by Theorem 5.3 and a union bound.
2 The mirror descent analysis of adversarial linear bandits also works for
stochastic bandits. Recall that mirror descent samples At from a distribution
with a conditional mean of Āt , and suppose that θ̂t is a conditionally unbiased
estimator of θ. Then the regret for a stochastic linear bandit with optimal
action a∗ can be rewritten as
" n # " n # " n #
X X X
Rn = E hAt − a∗ , θi = E hĀt − a∗ , θi = E hĀt − a∗ , θ̂t i ,
t=1 t=1 t=1
Yt = `(At ) + ηt ,
Linear bandits on the sphere with parameter noise have been studied by Carpentier
and Munos [2012]. However they consider the case where the action set is the
sphere and the components of the noise are independent so that the reward is
Xt = hAt , θ + ηt i where the coordinates of ηt ∈ Rd are independent with unit
Pd
variance. In this case, the predictable variation is V[Xt | At ] = i=1 A2ti = 1 for
all actions At and the parameter noise is equivalent to the standard model. We
are not aware of any systematic studies of parameter noise in the stochastic
setting. With only a few exceptions, the impact on the regret of the action set
and adversary’s choices is not well understood beyond the case where A is an
`p -ball, which has been mentioned in the previous section. A variety of lower
bounds illustrating the complications are given by Shamir
√ [2015]. Perhaps the
most informative is the observation that obtaining O( dn) regret is not possible
when A = {a + x : kxk2 ≤ 1} is a shifted unit ball with a = (2, 0, . . . , 0), which
also follows from our reduction in Section 29.2.
29.7 Exercises
Hint Repeat the analysis in the proof of Theorem 28.11, update the learning
rate and check the bounds on the norm of the estimators.
29.3 (Follow-the-regularised-leader for stochastic bandits (ii))
Repeat the previous exercise using exponential weights or continuous exponential
weights with Kiefer–Wolfowitz exploration where
In the penultimate part, we collect a few topics to which we could not dedicate
a whole part. When deciding what to include, we balanced our subjective views
on what is important, pedagogical and sufficiently well understood for a book.
Of course we have played favourites with our choices and hope the reader can
forgive us for the omissions. We spend the rest of this intro outlining some of the
omitted topics.
Continuous-Armed Bandits
There is a small literature on bandits where the number of actions is infinitely
large. We covered the linear case in earlier chapters, but the linear assumption
can be relaxed significantly. Let A be an arbitrary set and F a set of functions
from A → R. The learner is given access to the action set A and function class
F. In each round, the learner chooses an action At ∈ A and receives reward
Xt = f (At ) + ηt , where ηt is noise and f ∈ F is fixed, but unknown. Of course
this set-up is general enough to model all of the stochastic bandits so far, but is
perhaps too general to say much. One interesting relaxation is the case where A
is a metric space and F is the set of Lipschitz functions. We refer the reader to
papers by Kleinberg [2005], Auer et al. [2007], Kleinberg et al. [2008], Bubeck
et al. [2011], Slivkins [2014], Magureanu et al. [2014] and Combes et al. [2017], as
well as the book of Slivkins [2019].
Infinite-Armed Bandits
Consider a bandit problem where in each round the learner can choose to play
an arm from an existing pool of Bernoulli arms or to add another Bernoulli arm
to the pool with mean sampled from a uniform distribution. The regret in this
setting is defined as
" n #
X
Rn = n − E Xt .
t=1
This problem is studied by Berry et al. [1997], who show that Rn = Θ(n1/2 ) is the
optimal regret. There are now a number of strengthening and generalisations of
this work [Wang et al., 2009, Bonald and Proutiere, 2013, Carpentier and Valko,
2015, for example], which sadly must be omitted from this book. The notable
difficulty is generalising the algorithms and analysis to the case where reservoir
distribution from which the new arms are sampled is unknown and/or does not
exhibit a nice structure.
Duelling Bandits
In the duelling bandit problem, the learner chooses two arms in each round
At1 , At2 . Rather than observing a reward for each arm, the learner observes
the winner of a ‘duel’ between the two arms. Let k be the number of arms and
P ∈ [0, 1]k×k be a matrix where Pij is the probability that arm i beats arm j in
a duel. It is natural to assume that Pij = 1 − Pji . A common, but slightly less
justifiable, assumption is the existence of a total ordering on the arms such that
359
if i j, then Pij > 1/2. There are at least two notions of regret. Let i∗ be the
optimal arm so that i∗ j for all j 6= i∗ . Then the strong and weak regret are
defined by
" n #
X
Strong regret = E (Pi∗ ,At1 + Pi∗ ,At2 − 1) ,
t=1
" n #
X
Weak regret = E min {Pi∗ ,At1 − 1/2, Pi∗ ,At2 − 1/2} .
t=1
Both definitions measure the number of times arms with low probability of
winning a duel against the optimal arm is played. The former definition only
vanishes when At1 = At2 = i∗ , while the latter is zero as soon as i∗ ∈ {At1 , At2 }.
The duelling bandit problem was introduced by Yue et al. [2009] and has seen
quite a lot of interest since then [Yue and Joachims, 2009, 2011, Ailon et al.,
2014, Zoghi et al., 2014, Dudı́k et al., 2015, Jamieson et al., 2015, Komiyama
et al., 2015a, Zoghi et al., 2015, Wu and Liu, 2016, Zimmert and Seldin, 2019].
Convex Bandits
Let A ⊂ Rd be a convex set. The convex bandit problem comes in both stochastic
and adversarial varieties. In both cases, the learner chooses At from A. In the
stochastic case, the learner receives a reward Xt = f (At ) + ηt where f is an
unknown convex function and ηt is noise. In the adversarial setting, the adversary
chooses a sequence of convex functions f1 , . . . , fn and the learner receives reward
Xt = ft (At ). This turned out to be a major challenge over the last decade with
most approaches leading to suboptimal regret in terms of the horizon. The best
bounds in the stochastic case are by Agarwal et al. [2011], while in the adversarial
case there has been a lot of recent progress [Bubeck et al., 2015a, Bubeck and
Eldan, 2016, Bubeck et al., 2017]. In both cases the dependence of the regret on
√
the horizon is O( n), which is optimal in the worst case. Many open question
remain, such as the optimal dependence on the dimension, or the related problem
of designing practical low-regret algorithms. The interested reader may consult
Shamir [2013] and Hu et al. [2016] for some of the open problems.
Budgeted Bandits
In many problems, choosing an action costs some resources. In the bandits-with-
knapsacks problem, the learner starts with a fixed budget B ∈ [0, ∞)d over
d resource types. Like in the standard K-armed stochastic bandit, the learner
chooses At ∈ [K] and receives a reward Xt sampled from a distribution depending
on At . The twist is that the game does not end after a fixed number of rounds.
Instead, in each round, the environment samples a cost vector Ct ∈ [0, 1]d from a
distribution that depends on At . The game ends in the first round τ for which
Pτ
there exists an i ∈ [d] such that t=1 Cti > Bi . This line of work was started by
Badanidiyuru et al. [2013] and has been extended in many directions by Agrawal
and Devanur [2014], Tran-Thanh et al. [2012], Ashwinkumar et al. [2014], Xia
360
et al. [2015], Agrawal and Devanur [2016], Tran-Thanh et al. [2010] and Hanawal
et al. [2015]. A somewhat related idea is the conservative bandit problem where
the goal is to minimise regret subject to the constraint that the learner must not
be much worse than some known baseline. The constraint limits the amount of
exploration and makes the regret guarantees slightly worse [Sui et al., 2015, Wu
et al., 2016, Kazerouni et al., 2017].
Graph Feedback
There is growing interest in feedback models that lie between the full information
and bandit settings. One way to do this is to let G be a directed graph with
K vertices. The adversary chooses a sequence of loss vectors in [0, 1]K as usual.
In each round, the learner chooses a vertex and observes the loss corresponding
to that vertex and its neighbours. The full information and bandit settings
are recovered by choosing the graph to be fully connected or have no edges
respectively, but of course there are many interesting regimes in between. There
are many variants on this basic problem. For example, G might change in each
round or be undirected. Or perhaps the graph is changing, and the learner only
observes it after choosing an action. The reader can explore this topic by reading
the articles by Mannor and Shamir [2011], Alon et al. [2013], Kocák et al. [2014]
and Alon et al. [2015] or the short book by Valko [2016].
30 Combinatorial Bandits
" n #
X
Rn = max E hAt − a, yt i ,
a∈A
t=1
In Chapters 27 and 28, we assumed that yt ∈ {y : supa∈A |ha, yi| ≤ 1}. This
restriction is not consistent with the applications we have in mind, so instead we
assume that yt ∈ [0, 1]d , which by the definition of A ensures that |hAt , yt i| ≤ m
for all t. In the standard bandit model, the learner observes hAt , yt i in each round.
30.2 Applications
Shortest-Path Problems
Let G = (V, E) be a fixed graph with a finite set of vertices V and edges
E ⊆ V × V , with |E| = d. The online shortest-path problem is a game over n
rounds between an adversary and a learner. Given fixed vertices u, v ∈ V , the
learner’s objective in each round is to find the shortest path between u and v. At
the beginning of the game, the adversary chooses a sequence of vectors y1 , . . . , yn ,
with yt ∈ [0, 1]d and yti representing the length of the ith edge in E in round t.
In each round, the learner chooses a path between u and v. The regret of the
learner is the difference between the distance they travelled and the distance of
the optimal path in hindsight. A path is represented by a vector a ∈ {0, 1}d where
ai = 1 if the ith edge is part of the path. Let A be the set of paths connecting
vertices u and v, then the length of path a in round t is ha, yt i. In this problem,
m is the length of the longest path. Fig. 30.1 illustrates a typical example.
Ranking
Suppose a company has d ads and m locations in which to display them. In each
round t, the learner should choose the m ads to display, which is represented by a
vector At ∈ {0, 1}d with kAt k1 = m. As before, the adversary chooses yt ∈ [0, 1]d
that measures the quality of each placement and the learner suffers loss hAt , yt i.
This problem could also be called ‘selection’ because the order of the items play
no role. Problems where the order plays a direct role are analysed in Chapter 32.
Multitask Bandits
Consider playing m multi-armed bandits simultaneously, each with k arms. If
the losses for each bandit problem are observed, then it is easy to apply Exp3 or
Exp3-IX to each bandit independently. But now suppose the learner only observes
the sum of the losses. This problem is represented as a combinatorial bandit by
30.3 Bandit Feedback 363
Beijing
12
13 7
Budapest
10 13
1
11 Abu Dhabi
Frankfurt
10
12 Singapore
8
Sydney
Figure 30.1 Shortest-path problem between Budapest and Sydney. The learner chooses
the path Budapest–Frankfurt–Singapore–Sydney. In the bandit setting, they observe
total travel time (21 hours), while in the semi-bandit they observe the length of each
flight on the route they took (1 hour, 12 hours, 8 hours).
letting d = mk and
( k
)
X
A= a ∈ {0, 1} : d
ai+kj = 1 for all 0 ≤ j < m .
i=1
In words, the d coordinates are partitioned into m parts and the learner needs to
select exactly one coordinate (“primitive action”) from each part. The resulting
problem is called the multi-task bandit problem: This problem is like making
m independent choices in parallel in m bandit problems blindly and then receiving
an aggregated feedback for all the m choices made. This scenario can arise in
practice when a company is making multiple independent interventions, but the
quality of the interventions are only observed via a single change in revenue.
The easiest approach is to apply the version of Exp3 for linear bandits described
in Chapter 27. The only difference is that now |hAt , yt i| can be as large as m,
which increases the regret by a factor of m. We leave the proof of the following
theorem to the reader (Exercise 30.1).
Theorem 30.1. Consider the setting of Section 30.1. If Algorithm 15 is run on
action set A with appropriately chosen learning rate, then
s
p 3/2 ed
Rn ≤ 2m 3dn log |A| ≤ m 12dn log .
m
There are two issues with this approach, both computational. First, the action
30.4 Semi-bandit Feedback and Mirror Descent 364
set is typically so large that finding the core set of the central minimum volume
enclosing ellipsoid that determines the Kiefer–Wolfowitz exploration distribution
of Algorithm 15 is hopeless. Second, efficiently sampling from the resulting
exponential weights distribution may not be possible. There is no silver bullet
for these issues. The combinatorial bandit can model a repeated version of the
travelling salesman problem, which is hard even to approximate. Since an online
learning algorithm with O(np ) regret with p < 1 can be used to approximate
the optimal solution, it follows that no such algorithm can be computationally
efficient. There are, however, special cases where efficient algorithms exist, and
we give some pointers to the relevant literature on this at the end of the chapter.
One modification that greatly eases computation is to replace the optimal Kiefer–
Wolfowitz exploration distribution with a distribution that can be computed and
sampled from in an efficient manner, as noted after Theorem 27.1.
In the semi-bandit setting, the learner observes the loss associated with all non-
zero coordinates of the chosen action. The additional information is exploited by
noting that yt can now be estimated in each coordinate. Let
Ati yti
Ŷti = , (30.1)
Āti
where Āti = E[Ati | Ft−1 ] with Ft = σ(A1 , . . . , At ). An easy calculation shows
that E[Ŷt | Ft−1 ] = yt , so this estimate is still unbiased. Unsurprisingly we will
again use online stochastic mirror descent, which is summarised for this setting
in Algorithm 18.
1: Input A, η, F
2: Ā1 = argmina∈co(A) F (a)
3: for t = 1, . . . , n do
P
4: Choose distribution Pt on A such that a∈A Pt (a)a = Āt
5: Sample At ∼ Pt and observe At1 yt1 , . . . , Atd ytd
6: Compute Ŷti = Ati yti /Āti for all i ∈ [d]
7: Update Āt+1 = argmina∈co(A) ηha, Ŷt i + DF (a, Āt )
8: end for
Algorithm 18: Online stochastic mirror descent for semi-bandits.
Proof Since A is a finite set, the algorithm is well defined. In particular, Āt > 0
exists and is unique for all t ∈ [n]. By Theorem 28.10,
" n #
diamF (co(A)) X 1
Rn ≤ +E hĀt − Āt+1 , Ŷt i − DF (Āt+1 , Āt ) . (30.2)
η t=1
η
The diameter is easily bounded by noting that F is negative in co(A) and using
Jensen’s inequality:
d
X 1
diamF (co(A)) ≤ sup ai + ai log ≤ m(1 + log(d/m)) .
a∈co(A) i=1 ai
For the second term in Eq. (30.2), let Ŷti0 = Ŷti I Āt+1,i ≤ Āti . Since Ŷt is
positive,
1 1
hĀt − Āt+1 , Ŷt i − DF (Āt+1 , Āt ) ≤ hĀt − Āt+1 , Ŷt0 i − DF (Āt+1 , Āt )
η η
d
η η X Ati
≤ kŶt0 k2∇2 F (Zt )−1 ≤ ,
2 2 i=1 Āti
where Zt is provided by Theorem 26.12 and lies on the chord [Āt , Āt+1 ]. The
final inequality follows because ∇2 F (z) = diag(1/z) and using the definition of
Ŷt0 , which ensures that the worst case occurs when Zt = Āt . Summing and taking
the expectation:
" n # " n d #
X 1 η X X Ati ηnd
E hĀt − Āt+1 , Ŷt i − DF (Āt+1 , Āt ) ≤ E = .
t=1
η 2 t=1 i=1
Āti 2
Algorithm 18 plays mirror descent on the convex hull of the actions, which has
dimension d − 1. In principle it would be possible to do the same thing on the
set of distributions over actions, which has dimension
p |A| − 1. Repeating the
analysis leads to a suboptimal regret of O(m dn log(d/m)). We encourage
the reader to go through this calculation to see where things go wrong.
Like in Section 30.3, the main problem is computation. In each round the algorithm
P
needs to find a distribution Pt over A such that a∈A Pt (a) = Āt . Feasibility
follows from the definition of co(A), while Carathéodory’s theorem proves the
30.5 Follow-the-Perturbed-Leader 366
support of Pt never needs to be larger than d+1. Since A is finite, we can write the
problem of finding Pt in terms of linear constraints, but naively the computation
complexity is polynomial in k = |A|, which is exponential in m. The algorithm also
needs to compute Āt+1 from Āt and Ŷt . This is a convex optimisation problem,
but the computation complexity depends on the representation of A and may be
intractable. See Note 6 for a few more details on this.
30.5 Follow-the-Perturbed-Leader
where η > 0 is the learning rate and Zt ∈ Rd is sampled from a carefully chosen
distribution Q. The random perturbations is chosen to both guard against worst-
case, and to induce necessary exploration. Notice that if η is small, then the
effect of Zt is larger and the algorithm can be expected to explore more, which is
consistent with the learning rate used in mirror descent or exponential weighting
studied in previous chapters.
Before defining the loss estimations and perturbation distribution, we make a
connection between FTPL and mirror descent. Given Legendre potential F with
dom(∇F ) = int(co(A)), online stochastic mirror descent chooses Āt so that
Taking derivatives and using the fact that dom(∇F ) = int(co(A)), we have
By duality (Theorem 26.6), this implies that Āt = ∇F ∗ (−η L̂t−1 ). On the other
hand, examining Eq. (30.4), we see that for FTPL,
h i
Āt = E[At | Ft−1 ] = E argmina∈A ha, η L̂t−1 − Zt i Ft−1 ,
The key to this argument is that the derivative of φ exists almost everywhere
and is equal to a(x). All this shows is that FTPL can be interpreted as mirror
descent with potential F defined in terms of its Fenchel dual,
Z
F (x) =
∗
φ(x + z)dQ(z) . (30.5)
Rd
It remains to choose the loss estimator. A natural choice would be the same as
Eq. (30.1), which is Ŷti = Ati yti /Pti with Pti = P (Ati = 1 | Ft−1 ) = Āti . The
problem is that Pti does not generally have a closed-form solution. And while
Pti can be estimated by sampling, the number of samples required for sufficient
accuracy can be quite large. The next idea is to replace 1/Pti in the importance-
weighted estimator with a random variable with conditional expectation equal to
1/Pti . This is based on the following well-known result:
The truncation parameter β is needed to ensure that Ŷti is never too large. We
have now provided all the pieces to define a version of FTPL that is a special
case of mirror descent. The algorithm is summarised in Algorithm 19.
Theorem 30.4. Consider the setting of Section 30.1. Let Q have density with
respect to the Lebesgue measure of q(z) = 2−d exp(−kzk1 ), and choose the
parameters η, β as follows:
s
2(1 + log(d)) 1
η= , β= .
(1 + e2 )dnm ηm
1: Input A, n, η, β, Q
2: L̂0 = 0 ∈ Rd
3: for t = 1, . . . , n do
4: Sample Zt ∼ Q
5: Compute At = argmaxa∈A ha, Zt − η L̂t−1 i
6: Observe At1 yt1 , . . . , Atd ytd
7: For each i ∈ [d] sample Kti ∼ Geometric(Pti )
8: For each i ∈ [d] compute Ŷti = min(β, Kti )Ati yti
9: L̂t = L̂t−1 + Ŷt
10: end for
Algorithm 19: Follow-the-perturbed-leader for semi-bandits.
Proof First, note that At is almost surely uniquely defined and so is Āt =
E [At | Ft−1 ]. Therefore, by isolating the bias in the loss estimators, and thanks
to Exercise 30.6, we can apply Theorem 28.4 to get that
" n # n
" #
X X
Rn (a) = E hAt − a, yt i = E hĀt − a, yt i
t=1 t=1
" n
# n
" #
X X
=E hĀt − a, Ŷt i + E hĀt − a, yt − Ŷt i
t=1 t=1
" # " n #
diamF (A) 1X
n X
≤ +E DF (Āt , Āt+1 ) + E hĀt − a, yt − Ŷt i .
η η t=1 t=1
(30.6)
1
d
X
≥ −E[maxhb, Zi] ≥ −mE[kZk∞ ] = −m ≥ −m(1 + log(d)) ,
b∈A
i=1
d
where the first inequality follows by choosing x = 0 and the second follows
from Hölder’s inequality and that kak1 ≤ m for any a ∈ A. The last equality is
non-trivial and is explained in Exercise 30.4. By the convexity of the maximum
function and the fact that Z is centered, we also have from Eq. (30.7) that
F (a) ≤ 0, which means that
The next step is to bound the Bregman divergence induced by F . We will shortly
show that the Hessian ∇2 F ∗ (x) of F ∗ exists, so by Part (b) of Theorem 26.6
30.5 Follow-the-Perturbed-Leader 370
and Taylor’s theorem, there exists an α ∈ [0, 1] and ξ = −η L̂t−1 − αη Ŷt such that
where the last inequality follows since α ∈ [0, 1] and Ŷti ≤ β = d1/(mη)e, ηm ≥ 1
and Ŷt has at most m non-zero entries. Continuing on from Eq. (30.9), we have
η2 e2 η 2 X e2 η 2 X X
d Xd d d
kŶt k2∇2 F ∗ (ξ) ≤ Pti Ŷti Ŷtj ≤ Pti Kti Ati Ktj Atj .
2 2 i=1 j=1
2 i=1 j=1
Chaining together the parts and taking the expectation shows that
e2 η X X
d d
E[DF (Āt , Āt+1 )] ≤ E Pti Kti Ati Ktj Atj
2 i=1 j=1
e2 η 2 X X Ati Atj e2 mdη 2
d d
= E ≤ .
2 i=1 j=1
Ptj 2
The last step is to control the bias term. For this, first note that since
30.6 Notes 371
where the last inequality follows from using that for x ∈ [0, 1], s > 0,
x(1 − x)s ≤ xe−sx ≤ 1/s. Putting together all the pieces into Eq. (30.6) leads to
m(1 + log(d)) e2 dnmη dnmη p
Rn ≤ + + ≤ m 2(1 + e2 )nd(1 + log(d)) .
η 2 2
30.6 Notes
1 For a long time, it was speculated that the dependence of the regret on m3/2
in Theorem 30.1 (bandit feedback) might be improvable to m. Very recently,
however, the lower bound was increased to show the upper bound is √ tight
[Cohen et al., 2017]. For semi-bandits the worst-case lower bound is Ω( dnm)
(Exercise 30.8), which holds for large enough n and m ≤ d/2 and is matched
up to constant factors by online stochastic mirror descent with a different
potential (Exercise 30.7).
2 The implementation of FTPL shown in Algorithm 19 needs to sample Kti for
each i with Ati = 1. The conditional expected running time for this is Ati /Pti ,
which has expectation 1. It follows that the expected running time over the
whole n rounds is O(nd) calls to the oracle linear optimisation algorithm. It
can happen that the algorithm is unlucky and chooses Ati = 1 for some i
with Pti quite small and then sampling Kti could be time-consuming. Note,
however, that only min(Kti , β) is actually used by the algorithm, and hence
the sampling procedure can be truncated at β. This minor modification ensures
the algorithm needs at most O(βnd) calls to the oracle in the worst case.
3 While FTPL is excellent in the face of semi-bandit information, we do not know
of a general result for the bandit model. The main challenge is controlling the
variance of the least squares estimator without explicitly inducing exploration
using a sophisticated exploration distribution like what is provided by Kiefer–
Wolfowitz.
30.6 Notes 372
Xt = hAt , θi + ηt , (30.12)
Xt = hAt , θt i . (30.13)
This latter version has ‘parameter noise’ (cf. Chapter 29) and is more closely
related to the adversarial set-up studied in this chapter. Finally, one can assume
additionally that the distribution of θt is a product distribution so that (θ1i )di=1
are also independent.
5 For some action sets, the off-diagonal elements of the Hessian in Eq. (30.10)
√
are negative, which improves the dependence on m to m. An example where
this occurs is when A = {a ∈ {0, 1}d : kak1 = m}. Let i 6= j, and suppose that
z, ξ ∈ Rd and zj ≥ 0. Then you can check that a(z + ξ)i ≤ a(z − 2zj ej + ξ)i ,
and so
Z
∇2 F ∗ (ξ)ij = a(z + ξ)i sign(z)j q(z)dz
Rd
Z Z ∞
= (a(z + ξ)i − a(z − 2zj ej + ξ)i )q(z)dzj dz−j
Rd−1 0
≤ 0,
where dz−j is shorthand for dz1 dz2 , . . . dzj−1 dzj+1 , . . . , dzd . You are asked to
complete all the details in Exercise 30.9. This result unfortunately does not
hold for every action set (Exercise 30.10).
6 In order to implement mirror descent or follow-the-regularised-leader with
bandit or semi-bandit information, one needs to solve two optimisation problems:
(a) a convex optimisation problem of the form argmina∈co(A) F (a) for some
convex F and (b) a linear optimisation problem to find a distribution P over A
with mean ā where ā ∈ co(A). More or less sufficient is an efficient membership
oracle for co(A) and evaluation oracle for F [Grötschel et al., 2012, Lee et al.,
2018]. Also necessary for bandits is to identify an exploration distribution,
which we discuss in the notes and bibliographic remarks of Chapter 27. This is
not required for semi-bandits, however, at least with the negentropy potential
used in by Algorithm 18.
30.7 Bibliographic Remarks 373
30.8 Exercises
Hint For the second inequality, you may find it useful to know that for
Pm
0 ≤ m ≤ n, defining Φm (n) = i=0 ni , it holds that (m/n)m Φm (n) ≤ em .
30.2 (Efficient computation on m-sets) Provide an efficient implementation
of Algorithm 18 for the m-set: A = {a ∈ {0, 1}d : kak1 = m}.
Hint This is not the easiest exercise. Start by reading the paper by Takimoto
and Warmuth [2003], then follow up with that of György et al. [2007].
30.4 (Expected supremum norm of Laplace) Let Z be sampled from
measure on Rd with density f (z) = 2−d exp(−kzk1 ). The purpose of this exercise
is to show that
1
d
X
E[kZk∞ ] = . (30.14)
i=1
i
Hint Recall that the support function φ of a non-empty compact set is a proper
30.8 Exercises 375
convex function. Then, note that for any proper convex function f : Rd → R∪{∞},
the set Rd \ dom(∇f ) has Lebesgue measure zero [Rockafellar, 2015, Theorem
25.5]. Next, by Danskin’s theorem, the directional derivative of φ in the direction
v ∈ Rd is given by ∇v φ(x) = maxa∈A(x) ha, vi, where A(x) is the set of maximisers
of a 7→ ha, xi over A [Bertsekas, 2015, Proposition 5.4.8 in Appendix B]. Finally,
it is worth remembering the following result: let f be an extended real-valued
function with x ∈ Rd in the interior of its domain. Then, for some g ∈ Rd ,
∇v f (x) = hg, vi holds true for all v ∈ Rd if and only if ∇f (x) exists and is equal
to g.
30.6 A function f : Rd → R̄ is closed if its epigraph is a closed set. Let F ∗ be
the function defined in Section 30.5 and F be the proper convex closed function
and whose Fenchel dual is F ∗ .
(a) Show that the function F is well defined (F ∗ is the Fenchel dual of a proper
convex closed function, and there is only a single such function).
(b) For the remainder of the exercise, let Q be absolutely continuous with respect
to the Lebesgue measure with an everywhere positive density, and let A be
the convex hull of finitely many points in Rd whose span is Rd . Show that
the function F is Legendre.
(c) Show that int(dom(F )) = int(co(A)).
Hint For Part (a), it may be worth recalling that the bidual (the dual of the
dual) of a proper convex closed function f is itself: f = f ∗∗ (= (f ∗ )∗ ). Furthermore,
the Fenchel dual of a proper function is always a proper convex closed function.
30.7 (Minimax bound for combinatorial semi-bandits) Adapt the analysis
in Exercise 28.15 to derive an algorithm for combinatorial
√ bandits with semi-
bandit feedback for which the regret is Rn ≤ C mdn for universal constant
C > 0.
Hint The most obvious choice is A = {a ∈ {0, 1}d : kak1 = m}, which are
sometimes called m-sets. A lower bound does hold for this action set [Lattimore
et al., 2018]. However, an easier path is to impose a little additional structure
such as multi-task bandits.
30.9 (Follow-the-perturbed-leader
√ for m-sets) Use the ideas in Note 5 to
prove that FTPL has Rn = Õ( mnd) regret when A = {a ∈ {0, 1}d : kak1 = m}.
Hint After proving the off-diagonal elements of the Hessian are negative, you
will also need to tune the learning rate. We do not know of a source for this
result, but the full information case was studied by Cohen and Hazan [2015].
30.8 Exercises 376
30.10 Construct an action set and i 6= j and z ∈ Rd with zj > 0 such that
a(z)i ≥ a(z − 2zj ej )i .
i j
start goal
Choose losses for the edges z, and think about what happens when the loss
associated with edge j decreases.
31 Non-stationary Bandits
To put this in perspective, a policy that plays each arm with probability half in
Pn
every round would have E[ t=1 ytAt ] = n/2. In other words, the regret guarantee
is practically meaningless.
What should we expect for this problem? The sequence of losses is so regular
31.1 Adversarial Bandits 378
that we might hope that a clever policy will mostly play the second arm in the
first n/2 rounds and then switch to playing mostly the first arm in the second
n/2 rounds. Then the cumulative loss would be close to zero and the regret would
be negative. Rather than aiming to guarantee negative regret, we redefine the
regret by enlarging the competitor class as a way to ensure meaningful results.
Let Γnm ⊂ [k]n be the set of action sequences of length n with at most m − 1
changes:
( n−1
)
X
Γnm = (at ) ∈ [k] :
n
I {at 6= at+1 } ≤ m − 1 .
t=1
which means a policy can only enjoy sublinear non-stationary regret if it detects
the change point quickly. The obvious question is whether or not such a policy
exists and how its regret depends on m.
To see that you cannot do much better than this, imagine interacting with m
adversarial bandit environments sequentially, each with horizon n/m. No matter
what policy you propose, there exist choices of bandits
p such that the expected
regret suffered against each bandit is at least Ω( nk/m). After summing over
the m instances, we see that the worst-case regret is at least
√
Rnm = Ω nmk ,
31.1 Adversarial Bandits 379
which matches the upper bound except for logarithmic factors. Notice how this
lower bound applies to policies that know the location of the changes, so it is
not true that things are significantly harder in the absence of this knowledge.
There is one big caveat with all these calculations. The running time of a naive
implementation of Exp4 is linear in the number of experts, which even for modestly
sized m is very large indeed.
where η > 0 is the learning rate and Ŷti = I {At = i} yti /Pti is the importance-
weighted estimator of the loss of action i for round t. The solution to the
optimisation problem of Eq. (31.3) can be computed efficiently using the two-step
process:
m log(1/α) ηnk
Rnm ≤ αn(k − 1) + + .
η 2
Pn
Proof Let a∗ ∈ argmina∈Γnm t=1 ytat be an optimal sequence of actions in
hindsight constrained to Γnm . Then let 1 = t1 < t2 < · · · < tm < tm+1 = n + 1
so that a∗t is constant on each interval {ti , . . . , ti+1 − 1}. We abuse notation by
31.2 Stochastic Bandits 380
The next step is to apply Eq. (28.11) and the solution to Exercise 28.10 to bound
the inner expectation, giving
"ti+1 −1 # "ti+1 −1 #
X X
E (ytAt − yta∗i ) Pti = E hPt − ea∗i , yt i Pti
t=ti t=ti
" #
ti+1 −1
X
≤ α(ti+1 − ti )(k − 1) + E max hPt − p, yt i Pti
p∈A
t=ti
" #
ti+1 −1
X
= α(ti+1 − ti )(k − 1) + E max hPt − p, Ŷt i Pti
p∈A
t=ti
D(p, Pti ) ηk(ti+1 − ti )
≤ α(ti+1 − ti )(k − 1) + E max + Pti .
p∈A η 2
By assumption, Pti ∈ A and so Pti j ≥ α for all j and D(p, Pti ) ≤ log(1/α).
Combining this observation with the previous two displays shows that
m log(1/α) ηnk
Rnm ≤ nα(k − 1) + + .
η 2
The learning rate and clipping parameters are approximately optimised by
p p
η = 2m log(1/α)/(nk) and α = m/(nk) ,
p √
which leads to a regret of Rnm ≤ mnk log(nk/m) + mnk. In typical
applications,
p the value p
of m is not known. In this case one can choose η =
√
log(1/α)/nk and α = 1/nk, and the regret increases by a factor of O( m).
To keep things simple, we will assume the rewards are Gaussian and that for
each arm i there is a function µi : [n] → R, and the reward is
Xt = µAt (t) + ηt ,
where (ηt )nt=1 is a sequence of independent standard Gaussian random variables.
The optimal arm in round t has mean µ∗ (t) = maxi∈[k] µi (t) and the regret is
n
" n #
X X
Rn (µ) = µ (t) − E
∗
µAt (t) .
t=1 t=1
31.2 Stochastic Bandits 381
If the locations of the change points were known then, thanks to the concavity of
log, running a new copy of UCB on each interval would lead to a bound of
n
mk
Rn (µ) = O m + log , (31.4)
∆min m
where ∆min is the smallest suboptimality gap over all m blocks and n ≥ m. This
is a non-vacuous bound for n large. Inspired by the results of the last section
that showed that the bound achieved by an omniscient policy that knows when
the changes occur can be achieved by a policy that does not, one then wonders
whether the same holds concerning the bound in Eq. (31.4). As it turns out, the
answer in this case is no.
Theorem 31.2. Let k = 2, and fix ∆ ∈ (0, 1) and a policy π. Let µ be so that
µi (t) = µi is constant for both arms and ∆ = µ1 − µ2 > 0. If the expected regret
Rn (µ) of policy π on bandit µ satisfies Rn (µ) = o(n), then for all sufficiently
large n, there exists a non-stationary bandit µ0 with at most two change points
and mint∈[n] |µ01 (t) − µ02 (t)| ≥ ∆ such that Rn (µ0 ) ≥ n/(22Rn (µ)).
The theorem implies that if a policy enjoys Rn (µ) = o(n1/2 ) for any non-trivial
(stationary) bandit, then its minimax regret is at least ω(n1/2 ) on some non-
stationary bandit. In particular, if Rn (µ) = O(log(n)), then its worst-case regret
against non-stationary bandits with at most two changes is at least Ω(n/ log(n)).
This dashes our hopes for a policy that outperforms Exp4 in a stochastic setting
with switches, even in an asymptotic sense. The reason for the negative result
is that any algorithm anticipating the possibility of an abrupt change must
frequently explore all suboptimal arms to check that no change has occurred.
There are algorithms designed for non-stationary bandits in the stochastic
setting with abrupt change points as described above. Those that come with
theoretical guarantees are based on forgetting or discounting data so that decisions
of the algorithm depend almost entirely on recent data. In the notes, we discuss
these approaches along with alternative models for non-stationarity. For now,
the advantage of the stochastic setting seems to be that in the stochastic setting
there are algorithms that do not need to know the number of changes, while, as
noted beforehand, such algorithms are not yet known (or maybe not possible) in
the nonstochastic setting.
Proof of Theorem 31.2 Let (Sj )L j=1 be a uniform partition of [n] into successive
intervals. Let P and E[·] denote the probabilities and expectations with respect
to the bandit determined by µ and P0 with respect to alternative non-stationary
31.3 Notes 382
31.3 Notes
2 The negative results for stochastic non-stationary bandits do not mean that
trying to improve on the adversarial bandit algorithms is completely hopeless.
First of all, the adversarial bandit algorithms are not well suited for exploiting
distributional assumptions on the noise, which makes things irritating when
the losses/rewards are Gaussian (which are unbounded) or Bernoulli (which
have small variance near the boundaries). There have been several algorithms
designed specifically for stochastic non-stationary bandits. When the reward
distributions are permitted to change abruptly, as in the last section, then the
two main algorithms are based on the idea of ‘forgetting’ rewards observed in
the distant past. One way to do this is with discounting. Let γ ∈ (0, 1) be
the discount factor , and define
t
X t
X
µ̂γi (t) = γ t−s I {As = i} Xs Tiγ (t) = γ t−s I {As = i} .
s=1 s=1
Then, for appropriately tuned constant α, the discounted UCB policy chooses
each arm once and subsequently
v !
u k
u α X
At = argmaxi∈[k] µ̂γi (t − 1) + t γ log Tiγ (t − 1) .
Ti (t − 1) i=1
The idea is to ‘discount’ rewards that occurred far in the past, which makes
the algorithm most influenced by recent events. A similar algorithm called
sliding-window UCB uses a similar approach, but rather than discounting past
rewards with a geometric discount function, it simply discards them altogether.
Let τ ∈ N+ be a constant, and define
t
X t
X
µ̂τi (t) = I {As = i} Xs Tiτ (t) = I {As = i} .
s=t−τ +1 s=t−τ +1
is to understand the magnitude of the linear regret in terms of the size of the
interval or volatility of the Brownian motion.
4 Yet another idea is to allow the means to change in an arbitrary way, but
restrict the amount of total variation. Let µt = (µ1 (t), . . . , µk (t)) and
n−1
X
Vn = kµt − µt+1 k∞
t=1
Non-stationary bandits have quite a long history. The celebrated Gittins index is
based on a model where each arm is associated with a Markov chain that evolves
when played, the reward depends on the state, and the state of the chosen Markov
chain is observed after it evolves [Gittins, 1979, Gittins et al., 2011]. The classical
approaches, as discussed in Chapter 35, address this problem in the Bayesian
framework, and the objective is primarily to design efficient algorithms rather
than understanding the frequentist regret. Even more related is the restless
bandit, which is the same as Gittins’s set-up except the Markov chain for every
arm evolves in every round, while the learner still only observes the state and
reward for the action they chose. As a result, the learner needs to reason about the
evolution of all the Markov chains, which makes this problem rather challenging.
Restless bandits were introduced by Whittle [1988] in the Bayesian framework,
where most of the results are not especially positive. There has been some interest
in a frequentist analysis, but the challenging nature of the problem makes it
difficult to design efficient algorithms with meaningful regret guarantees [Ortner
et al., 2012]. Certainly there is potential for more work in this area.
The ideas in Section 31.1 are mostly generalisations of algorithms designed
for the full information setting, notably the fixed share algorithm [Herbster and
Warmuth, 1998]. The first algorithm designed for the adversarial non-stationary
31.5 Exercises 385
31.5 Exercises
Hint For the second part, you may find it useful to show the following well-
Pm
known inequality: for 0 ≤ m ≤ n, defining Φm (n) = i=0 ni , it holds that
(m/n)m Φm (n) ≤ em .
31.2 (Lower bound for adversarial non-stationary bandits) Let
n, m, k ∈ N+ be such that n ≥ mk. Prove that for any policy π there exists an
adversarial bandit (yti ) such that
√
Rnm ≥ c nmk ,
where c > 0 is a universal constant.
Ranking is a huge topic, and our approach is necessarily quite narrow. In fact
there is still a long way to go before we have a genuinely practical algorithm
for large-scale online ranking problems. As usual, we summarise alternative
ideas in the notes.
Stochastic Ranking
A permutation on [`] is an invertible function σ : [`] → [`]. Let A be the set of
all permutations on [`]. In each round t the learner chooses an action At ∈ A,
which should be interpreted as meaning the learner places item At (k) in the kth
position. Equivalently, A−1
t (i) is the position of the ith item. Since the shortlist
has length m, the order of At (m + 1), . . . , At (`) is not important and is included
32.1 Click Models 388
only for notational convenience. After choosing their action, the learner observes
Cti ∈ {0, 1} for each i ∈ [`], where Cti = 1 if the user clicked on the ith item.
Note that the user may click on multiple items. We will assume a stochastic
model where the probability that the user clicks on position k in round t only
depends on At and is given by v(At , k), with v : A × [`] → [0, 1] an unknown
function. The regret over n rounds is
" n #
X̀ X X̀
Rn = n max v(a, k) − E Cti .
a∈A
k=1 t=1 i=1
Document-Based Model
The document-based model is one of the simplest click models, which assumes
the probability of clicking on a shortlisted item is equal to its attractiveness.
Formally, for each item i ∈ [`], let α(i) ∈ [0, 1] be the attractiveness of item i.
The document-based model assumes that
v(a, k) = α(a(k))I {k ≤ m} .
The unknown quantity in this model is the attractiveness function, which has
just ` parameters.
32.1 Click Models 389
Position-Based Model
The document-based model might occasionally be justified, but in most cases the
position of an item in the ranking also affects the likelihood of a click. A natural
extension that accounts for this behaviour is called the position-based model,
which assumes that
v(a, k) = α(a(k))χ(k) ,
where χ : [`] → [0, 1] is a function that measures the quality of position k. Since
the user cannot click on items that are not shown, we assume that χ(k) = 0 for
k > m. This model is richer than the document-based model, which is recovered
by choosing χ(k) = I {k ≤ m}. The number of parameters in the position-based
models is m + `.
Cascade Model
The position-based model is not suitable for applications where clicking on an
item takes the user to a different page. In the cascade model, it is assumed
that the learner scans the shortlisted items in order and only clicks on the first
item they find attractive. Define χ : A × [`] → [0, 1] by
1 if k = 1
χ(a, k) = 0 if k > m
Qk−1
(1 − α(a(k ))) otherwise ,
0
0 k =1
which is the probability that the user has not clicked on the first k − 1 items.
Then the cascade model assumes that
v(a, k) = α(a(k))χ(a, k) . (32.1)
The first term in the factorisation is the attractiveness function, which measures
the probability that the user is attracted to the ith item. The second term can be
interpreted as the probability that the user examines that item. This interpretation
is also valid in the position-based model. It is important to emphasise that
v(a, k) is the probability of clicking on the kth position when taking action
a ∈ A. This does not mean that Ct1 , . . . , Ct` are independent. The assumptions
only restricts the marginal distribution of each Cti , which is sufficient for our
purposes. Nevertheless, in the cascade model, it would be standard to assume
that CtAt (k) = 0 if there exists an k 0 < k such that CtAt (k0 ) = 1, and otherwise
P(CtAt (k) = 1 | At , CtAt (1) = 0, . . . , CtAt (k−1) = 0) = I {k ≤ m} α(At (k)) .
Like the document-based model, the cascade model has ` parameters.
Generic Model
We now introduce a model that generalises the last three. Previous models
essentially assumed that the probability of a click factorises into an attractiveness
probability and an examination probability. We deviate from this norm by making
32.1 Click Models 390
a a0
1
2 i j
3
4 j i
5
Figure 32.2 Part (c) of Assumption 32.1 says that the probability of clicking in the
second position on the left list is larger than the probability of clicking on the second
position on the right list by a factor of α(i)/α(j). For the fourth position, the probability
is larger for the right list than the left by the same factor.
These assumptions may appear quite mysterious. At some level they are
chosen to make the proof go through, while simultaneously generalising the
document-based, position-based and cascade models (32.1). The choices are
not entirely without basis or intuition, however. Part (a) asserts that the user
does not click on items that are not placed in the shortlist. Part (b) says that
α-optimal actions maximise the expected number of clicks. Note that there
are multiple optimal rankings if α is not injective. Part (c) is a little more
restrictive and is illustrated in Fig. 32.2. One way to justify this is to assume
that v(a, k) = α(a(k))χ(a, k), where χ(a, k) is viewed as the probability that the
user examines position k. It seems reasonable to assume that the probability
the user examines position k should only depend on the first k − 1 items. Hence
v(a, 2) = α(i)χ(a, 2) = α(i)χ(a0 , 2) = α(i)/α(j)v(a0 , 2). In order the make the
argument for the fourth position, we need to assume that placing less attractive
items in the early slots increases the probability that the user examines later
positions (searching for a good result). This is true for the position-based and
32.2 Policy 391
cascade models, but is perhaps the most easily criticised assumption. Part (d)
says that the probability that a user clicks on a position with a correctly placed
item is at least as large as the probability that the user clicks on that position in
an optimal ranking. The justification is that the items a(1), . . . , a(k − 1) cannot
be more attractive than a∗ (1), . . . , a∗ (k − 1), which should increase the likelihood
that the user makes it the kth position.
The generic model has many parameters, but we will see that the learner does
not need to learn all of them in order to suffer small regret. The advantage of
this model relative to the previous ones is that it offers more flexibility, and yet
it is not so flexible that learning is impossible.
32.2 Policy
We now explain the policy for learning to rank when v is unknown, but satisfies
Assumption 32.1. After the description is an illustration that may prove helpful.
Step 0: Initialisation
The policy takes as input a confidence parameter δ ∈ (0, 1) and ` and m. The
policy maintains a binary relation Gt ⊆ [`] × [`]. In the first round t = 1 the
relation is empty: G1 = ∅. You should think of Gt as maintaining pairs (i, j)
for which the policy has proven with high probability that α(i) < α(j). Ideally,
Gt ⊆ {(i, j) ∈ [`] × [`] : α(i) < α(j)}.
Finally, let Mt = max{d : Ptd 6= ∅}. The reader should check that if Gt does not
have cycles, then Mt is well defined and finite and that Pt1 , . . . , PtMt is indeed a
partition of [`] (Exercise 32.5). The event that Gt contains cycles is a failure event.
In order for the policy to be well defined, we assume it chooses some arbitrary
fixed action in this case.
Next let Σt ⊆ A be the set of actions σ such that σ(Itd ) = Ptd for all d ∈ [Mt ].
The algorithm chooses At uniformly at random from Σt . Intuitively the policy
first shuffles the items in Pt1 and uses these as the first |Pt1 | entries in the ranking.
Then Pt2 is shuffled, and the items are appended to the ranking. This process is
repeated until the ranking is complete. For an item i ∈ [`], we denote by Dti the
unique index d such that i ∈ Ptd .
All this means is that Stij tracks the difference between the number of clicks
on items i and j over rounds when they share a partition. As a final step, the
relation Gt+1 is given by
v !
u p
u c Ntij
Gt+1 = Gt ∪ (j, i) : Stij ≥ t2Ntij log ,
δ
where c ≈ 3.43 is the universal constant given in Exercise 20.10. In the analysis we
will show that if α(i) ≥ α(j), then with high probability Stji is never large enough
for Gt+1 to include (i, j). In this sense, with high probability, Gt is consistent
with the order on [`] induced by sorting in decreasing order with respect to α(·).
Note that Gt is generally not a partial order because it need not be transitive.
Illustration
Suppose ` = 5 and m = 4, and in round t the relation is Gt = {(3, 1), (5, 2), (5, 3)},
which is represented in the graph below, where an arrow from j to i indicates
that (j, i) ∈ Gt .
This means that in round t the first three positions in the ranking will contain
items from Pt1 = {1, 2, 4} but with random order. The fourth position will be
item 3, and item 5 is not shown to the user.
32.3 Regret Analysis 393
Part (a) of Assumption 32.1 means that items in position k > m are never
clicked. As a consequence, the algorithm never needs to actually compute
the partitions Ptd for which min Itd > m because items in these partitions
are never shortlisted.
Theorem 32.2. Let v satisfy Assumption 32.1, and assume that α(1) > α(2) >
· · · > α(`). Let ∆ij = α(i) − α(j) and δ ∈ (0, 1). Then the regret of TopRank is
bounded by
√
X̀ min{m,j−1}
X 6(α(i) + α(j)) log c δ n
Rn ≤ δnm`2 + 1 + .
j=1 i=1
∆ ij
s √
2 c n
Furthermore, Rn ≤ δnm` + m` + 4m3 `n log .
δ
By choosing δ = n−1 the theorem shows that the expected regret is at most
X̀ min{m,j−1}
X α(i) log(n) p
Rn = O and Rn = O m3 `n log(n) .
j=1 i=1
∆ij
The algorithm does not make use of any assumed ordering on α(·), so the
assumption is only used to allow for a simple expression for the regret. The core
idea of the proof is to show that (a) if the algorithm is suffering regret as a
consequence of misplacing an item, then it is gaining information so that Gt will
get larger and, (b) once Gt is sufficiently rich, the algorithm is playing optimally.
Let Ft = σ(A1 , C1 , . . . , At , Ct ) and Pt (·) = P(· | Ft ) and Et [·] = E[· | Ft ]. For each
t ∈ [n], let Ft be the failure event that there exists i = 6 j ∈ [`] and s < t such that
Nsij > 0 and
q
Xs p
Ssij − Eu−1 [Uuij | Uuij 6= 0] |Uuij | ≥ 2Nsij log(c Nsij /δ) .
u=1
Lemma 32.3. Let i and j satisfy α(i) ≥ α(j) and d ≥ 1. On the event that
i, j ∈ Psd and d ∈ [Ms ] and Usij 6= 0, the following hold almost surely:
∆ij
(a) Es−1 [Usij | Usij 6= 0] ≥ .
α(i) + α(j)
(b) Es−1 [Usji | Usji 6 0] ≤ 0 .
=
Proof For the remainder of the proof, we focus on the event that i, j ∈ Psd and
d ∈ [Ms ] and Usij =
6 0. We also discard the measure zero subset of this event where
32.3 Regret Analysis 394
Ps−1 (Usij =
6 0) = 0. From now on, we omit the ‘almost surely’ qualification on
conditional expectations. Under these circumstances, the definition of conditional
expectation shows that
Ps−1 (Csi = 1, Csj = 0) − Ps−1 (Csi = 0, Csj = 1)
Es−1 [Usij | Usij 6= 0] =
Ps−1 (Csi 6= Csj )
Ps−1 (Csi = 1) − Ps−1 (Csj = 1)
=
Ps−1 (Csi 6= Csj )
Ps−1 (Csi = 1) − Ps−1 (Csj = 1)
≥
Ps−1 (Csi = 1) + Ps−1 (Csj = 1)
Es−1 [v(As , As−1 (i)) − v(As , A−1
s (j))]
= , (32.2)
Es−1 [v(As , As (i)) + v(As , As (j))]
−1 −1
where in the second equality we added and subtracted Ps−1 (Csi = 1, Csj = 1).
By the design of TopRank, the items in Ptd are placed into slots Itd uniformly at
random. Let σ be the permutation that exchanges the positions of items i and j.
Then using Part (c) of Assumption 32.1,
X
Es−1 [v(As , A−1
s (i))] = Ps−1 (As = a)v(a, a−1 (i))
a∈A
α(i) X
≥ Ps−1 (As = a)v(σ ◦ a, a−1 (i))
α(j)
a∈A
α(i) X
= Ps−1 (As = σ ◦ a)v(σ ◦ a, (σ ◦ a)−1 (j))
α(j)
a∈A
α(i)
= Es−1 [v(As , A−1
s (j))] ,
α(j)
where the second equality follows from the fact that a−1 (i) = (σ ◦ a)−1 (j) and the
definition of the algorithm ensuring that Ps−1 (As = a) = Ps−1 (As = σ ◦ a). The
last equality follows from the fact that σ is a bijection. Using this and continuing
the calculation in Eq. (32.2) shows that
s (i)) − v(As , As (j))
Es−1 v(As , A−1 −1
Eq. (32.2) =
s (i)) + v(As , As (j))
Es−1 v(As , A−1 −1
2
=1−
1 + Es−1 v(As , As (i)) /Es−1 v(As , A−1
−1
s (j))
2
≥1−
1 + α(i)/α(j)
α(i) − α(j) ∆ij
= = .
α(i) + α(j) α(i) + α(j)
The second part follows from the first since Usji = −Usij .
The next lemma shows that the failure event occurs with low probability.
Proof The proof follows immediately from Lemma 32.3, the definition of Fn , the
union bound over all pairs of actions, and a modification of the Azuma–Hoeffding
inequality in Exercise 20.10.
Proof Let i < j so that α(i) ≥ α(j). On the event Ftc , either Nsji = 0 or
s r cp
X
Ssji − Eu−1 [Uuji | Uuji 6= 0]|Uuji | < 2Nsji log Nsji for all s < t .
u=1
δ
When i and j are in different blocks in round u < t, then Uuji = 0 by definition.
On the other hand, when i and j are in the same block, Eu−1 [Uuji | Uuji 6= 0] ≤ 0
almost surely by Lemma 32.3. Based on these observations,
r cp
Ssji < 2Nsji log Nsji for all s < t ,
δ
there must exist a sequence of items id , . . . , ic in blocks Ptd , . . . , Ptc such that
id < · · · < ic = i∗ . From the definition of Itd ∗
, Itd
∗
≤ id < i∗ . This concludes our
proof.
Lemma 32.7. On the event Fnc and for all i < j, it holds that
√
6(α(i) + α(j)) c n
Snij ≤ 1 + log .
∆ij δ
Proof The result is trivial when Nnij = 0. Assume from now on that Nnij > 0.
By the definition of the algorithm, arms i and j are not in the same block once
Stij grows too large relative to Ntij , which means that
r cp
Snij ≤ 1 + 2Nnij log Nnij .
δ
On the event Fnc and part (a) of Lemma 32.3, it also follows that
r cp
∆ij Nnij
Snij ≥ − 2Nnij log Nnij .
α(i) + α(j) δ
32.3 Regret Analysis 396
r cp r cp
∆ij Nnij
− 2Nnij log Nnij ≤ Snij ≤ 1 + 2Nnij log Nnij
α(i) + α(j) δ δ
r cp
√
≤ (1 + 2) Nnij log Nnij . (32.3)
δ
Using the fact that Nnij ≤ n and rearranging the terms in the previous display
shows that
√ √
(1 + 2 2)2 (α(i) + α(j))2 c n
Nnij ≤ log .
∆2ij δ
Proof of Theorem 32.2 The first step in the proof is an upper bound on the
expected number of clicks in the optimal list a∗ . Fix time t, block Ptd and recall
that Itd∗
= min Ptd is the most attractive item in Ptd . Let k = A−1t (Itd ) be the
∗
Assumption 32.1, we have v(At , k) ≥ v(σ ◦ At , k) ≥ v(a∗ , k). Hence, on the event
Ftc , the expected number of clicks on Itd
∗
is bounded from below by those on
items in a ,
∗
h i X
Et−1 CtItd
∗ = Pt−1 (A−1
t (Itd ) = k)Et−1 [v(At , k) | At (Itd ) = k]
∗ −1 ∗
k∈Itd
1 X 1 X
= Et−1 [v(At , k) | A−1
t (Itd ) = k] ≥
∗
v(a∗ , k) ,
|Itd | |Itd |
k∈Itd k∈Itd
where we also used the fact that TopRank randomises within each block to
guarantee that Pt−1 (A−1
t (Itd ) = k) = 1/|Itd | for any k ∈ Itd . Using this and the
∗
design of TopRank,
m
X Mt X
X Mt
X h i
v(a∗ , k) = v(a∗ , k) ≤ |Itd |Et−1 CtItd
∗ .
k=1 d=1 k∈Itd d=1
Therefore, under event Ftc , the conditional expected regret in round t is bounded
32.4 Notes 397
by
m
X X̀ XMt X̀
v(a∗ , k) − Et−1 Ctj ≤ Et−1 |Ptd |CtItd
∗ − Ctj
k=1 j=1 d=1 j=1
XMt X
= Et−1 (CtItd
∗ − Ctj )
d=1 j∈Ptd
Mt
X X
= Et−1 [UtItd
∗ j]
d=1 j∈Ptd
X̀ min{m,j−1}
X
≤ Et−1 [Utij ] . (32.4)
j=1 i=1
Pmin{m,j−1}
The last inequality follows by noting that Et−1 [UtItd
∗ j] ≤
i=1 Et−1 [Utij ].
To see this, use part (a) of Lemma 32.3 to show that Et−1 [Utij ] ≥ 0 for i < j and
Lemma 32.6 to show that when Itd ∗
> m, then neither Itd∗
nor j are not shown to
the user in round t so that UtItd
∗ j = 0. Substituting the bound in Eq. (32.4) into
X̀ min{m,j−1}
X
Rn ≤ nmP(Fn ) + E [I {Fnc } Snij ] , (32.5)
j=1 i=1
where we used the fact that the maximum number of clicks over n rounds is
nm. The proof of the first part is completed by using Lemma 32.4 to bound
the first term and Lemma 32.7 to bound the second. The problem-independent
bound follows from Eq. (32.5) and by stopping early in the proof of Lemma 32.7
(Exercise 32.6).
32.4 Notes
1 At no point in the analysis did we use the fact that v is fixed over time. Suppose
that v1 , . . . , vn are a sequence of click-probability functions that all satisfy
Assumption 32.1 with the same attractiveness function. The regret in this
setting is
n Xm
" n #
X X X̀
Rn = vt (a∗ , k) − E Cti .
t=1 k=1 t=1 i=1
Then the bounds in Theorem 32.2 still hold without changing the algorithm.
2 The cascade model is usually formalised in the following more restrictive fashion.
Let {Zti : i ∈ [`], t ∈ [n]} be a collection of independent Bernoulli random
variables with P (Zti = 1) = α(i). Then define Mt as the first item i in the
32.4 Notes 398
where the minimum of an empty set is ∞. Finally let Cti = 1 if and only if
Mt ≤ m and At (Mt ) = i. This set-up satisfies Eq. (32.1), but the independence
assumption makes it possible to estimate α without randomisation. Notice that
in any round t with Mt ≤ m, all items i with A−1 t (i) < Mt must have been
unattractive (Zti = 0), while the clicked item must be attractive (Zti = 1).
This fact can be used in combination with standard concentration analysis to
estimate the attractiveness. The optimistic policy sorts the ` items in decreasing
order by their upper confidence bounds and shortlists the first m. When the
confidence bounds are derived from Hoeffding’s inequality , this policy is called
CascadeUCB, while the policy that uses Chernoff’s lemma is called CascadeKL-
UCB. The computational cost of the latter policy is marginally higher than
the former, but the improvement is also quite significant because in practice
most items have barely positive attractiveness.
3 The linear dependence of the regret on ` is unpleasant when the number of
items is large, which is the case in many practical problems. Like for finite-
armed bandits, one can introduce a linear structure on the items by assuming
that α(i) = hθ, φi i where θ ∈ Rd is an unknown parameter vector and (φi )`i=1
are known feature vectors. This has been investigated in the cascade model by
Zong et al. [2016] and with a model resembling that of this chapter by Li et al.
[2019a].
4 There is an adversarial variant of the cascade model. In the ranked bandit
model an adversary secretly chooses a sequence of sets S1 , . . . , Sn , with St ⊆ [`].
In each round t the learner chooses At ∈ A and receives a reward Xt (At ), where
Xt : A → [0, 1] is given by Xt (a) = I {St ∩ {a(1), . . . , a(k)} =
6 ∅}. The feedback
is the position of the clicked action, which is Mt = min{k ∈ [m] : At (k) ∈ St }.
The regret is
n
X
Rn = (Xt (a∗ ) − Xt (At )) ,
t=1
Notice that this is the same as the cascade model when St = {i : Zti = 1}.
5 A challenge in the ranked bandit model is that solving the offline problem (Eq.
32.6) for known S1 , . . . , Sn is NP-hard. How can one learn when finding an
optimal solution to the offline problem is hard? First, hardness only matters if
|A| is large. When ` and m are not too large, then exhaustive search is quite
feasible. If this is not an option, one may use an approximation algorithm.
It turns out that in a certain sense, the best one can do is to use a greedy
32.5 Bibliographic Remarks 399
algorithm, We omit the details, but the highlight is that there exist efficient
algorithms such that
" n #
X 1 Xn p
E Xt (At ) ≥ 1 − max Xt (a) − O m n` log(`) .
t=1
e a∈A t=1
The feedback is the positions of the clicked items, St ∩ {a(1), . . . , a(k)}. For
this model, there are no computation issues. In fact, the problem can be
analysed using a reduction to combinatorial semi-bandits, which we ask you to
investigate in Exercise 32.3.
7 The position-based model can also be modelled in the adversarial setting by
letting Stk ⊂ [`] for each t ∈ [n] and k ∈ [m]. Then, defining the reward by
m
X
Xt (a) = I {At (k) ∈ Stk } .
k=1
Again, the feedback is the positions of the clicked items, {k ∈ [m] : At (k) ∈ Stk }.
This model can also be tackled using algorithms for combinatorial semi-bandits
(Exercise 32.4).
The policy and analysis presented in this chapter is by the authors and others
[Lattimore et al., 2018]. The most related work is by Zoghi et al. [2017], who
assumed a factorisation of the click probabilities v(a, k) = α(a(k))χ(a, k) and then
made assumptions on χ. The assumptions made here are slightly less restrictive,
and the bounds are simultaneously stronger. Some experimental results comparing
these algorithms are given by Lattimore et al. [2018]. For more information on
click models, we recommend the survey paper by Chuklin et al. [2015] and the
article by Craswell et al. [2008]. Cascading bandits were first studied by Kveton
et al. [2015a], who proposed algorithms based on UCB and KL-UCB and prove
finite-time instance-dependence upper bounds and asymptotic lower bounds that
match in specific regimes. Around the same time, Combes et al. [2015a] proposed
a different algorithm for the same model that is also asymptotically optimal. The
optimal regret has a complicated form and is not given explicitly in all generality.
We remarked in the notes that the linear dependence on ` is problematic for
large `. To overcome this problem, Zong et al. [2016] introduce a linear variant
where the attractiveness of an item is assumed to be an inner product between an
32.6 Exercises 400
32.6 Exercises
The optimality criterion Radlinski et al. [2008] had in mind is to present at least
one item that the user is attracted to. Do you find this argument convincing?
Why or why not?
32.6 Exercises 401
The probabilistic ranking principle was put forward by Maron and Kuhns
[1960]. The paper by Robertson [1977] identifies some sufficient conditions
under which the principle is valid and also discusses its limitations.
32.5 (Cycles in partial order) Prove that if Gt does not contain cycles, then
Mt defined in Section 32.2 is well defined and that Pt1 , . . . , PtMt is a partition of
[`].
1: for t = 1, . . . , n do
2: Choose At = 1 + (t mod k)
3: end for
4: Choose An+1 = argmaxi∈[k] µ̂i (n)
Proof Let ∆i = ∆i (ν) and P = Pνπ . Assume without loss of generality that
∆1 = 0, and let i be a suboptimal arm with ∆i > ∆. Observe that An+1 = i
implies that µ̂i (n) ≥ µ̂1 (n). Now Ti (n) ≥ bn/kc is not random, so by Theorem 5.3
and Lemma 5.4,
bn/kc ∆2i
P (µ̂i (n) ≥ µ̂1 (n)) = P (µ̂i (n) − µ̂1 (n) ≥ 0) ≤ exp − . (33.1)
4
The proof is completed by substituting Eq. (33.1) and taking the minimum over
all ∆ ≥ 0.
The theorem highlights some important differences between the simple regret
and the cumulative regret. If ν is fixed and n tends to infinity, then the simple
regret converges to zero exponentially fast. On the other hand, if n is fixed and ν
is allowed to vary, then we are in a worst-case regime.
p Theorem 33.1 can be used
to derive a bound in this case by choosing ∆ = 2 log(k)/ bn/kc, which after a
short algebraic calculation shows that for n ≥ k there exists a universal constant
C > 0 such that
r
k log(k)
Rn (UE, ν) ≤ C
simple
for all ν ∈ ESG
k
(1) . (33.2)
n
In Exercise 33.1 we ask you to use the techniques of Chapter 15 to prove pthat for
all policies there exists a bandit ν ∈ EN (1) such that Rn (π, ν) ≥ C k/n for
k simple
1X
n
πn+1 (i | a1 , x1 , . . . , an , xn ) = I {at = i} .
n t=1
33.2 Best-Arm Identification with a Fixed Confidence 404
Rn (π, ν)
Rnsimple ((πt )n+1
t=1 , ν) = ,
n
where Rn (π, ν) is the cumulative regret of policy π = (πt )nt=1 on bandit ν.
Proof By the regret decomposition identity (4.5),
" k #
X Ti (n)
Rn (π, ν) = nE ∆i = nE ∆An+1 = nRnsimple ((πt )n+1
t=1 , ν) ,
i=1
n
where the first equality follows from the definition of the cumulative regret, the
third from the definition of πn+1 and the last from the definition of the simple
regret.
Corollary 33.3. For all n there exists a policypπ such that for all ν ∈ ESG
k
(1)
with ∆(ν) ∈ [0, 1] it holds that Rn (π, ν) ≤ C k/n, where C is a universal
k simple
constant.
Proof Combine the previous result with Theorem 9.1.
Proposition 33.2 raises our hopes that policies designed for minimising the
cumulative regret might also have well-behaved simple regret. Unfortunately this
is only true in the intermediate regimes where the best arm is hard to identify.
Policies with small cumulative regret spend most of their time playing the optimal
arm and play suboptimal arms just barely enough to ensure they are not optimal.
In pure exploration this leads to a highly suboptimal policy for which the simple
regret is asymptotically polynomial, while we know from Theorem 33.1 that the
simple regret should decrease exponentially fast. More details and pointers to
the literature are given in Note 2 at the end of the chapter.
learner halts and ψ ∈ [k] is the recommended action, which by the measurability
assumption only depends on (A1 , X1 , . . . , Aτ , Xτ ). Note that in line with our
definition of stopping times (see Definition 3.6), it is possible that τ = ∞, which
just means the learner cannot ever make up their mind to stop. This behaviour
of a learner, of course, will not be encouraged! The function ψ is called the
selection rule.
Definition 33.4. A triple (π, τ, ψ) is sound at confidence level δ ∈ (0, 1) for
environment class E if for all ν ∈ E,
Pνπ (τ < ∞ and ∆ψ (ν) > 0) ≤ δ . (33.3)
The objective in fixed confidence best-arm identification is to find a sound
learner for which Eνπ [τ ] is minimised over environments ν ∈ E. Since this is a
multi-objective criteria, there is a priori no reason to believe that a single optimal
learner should exist. Conveniently, however, the condition that the learner must
satisfy Eq. (33.3) plays the role of the consistency assumption in the asymptotic
lower bounds in Chapter 16, which allows for a sense of instance-dependent
asymptotic optimality. The situation in finite time is more complicated, as we
discuss in Note 7.
If E is sufficiently rich and ν has multiple optimal arms, then no sound learner
can stop in finite time with positive probability. The reason is that there is
no way to reject the hypothesis that one optimal arm is fractionally better
than another. You will investigate this in Exercise 33.10. Also note that
in our definition, I {τ = t} is a deterministic function of A1 , X1 , . . . , At , Xt .
None of the results that follow would change if you allowed τ or ψ to also
depend on some exogenous source of randomness.
k
!!
X
c (ν) = sup
∗ −1
inf αi D(νi , νi )
0
(33.4)
α∈Pk−1 0 ν ∈Ealt (ν)
i=1
Proof The result is trivial when Eνπ [τ ] = ∞. For the remainder, assume that
Eνπ [τ ] < ∞, which implies that Pνπ (τ = ∞) = 0. Next, let ν 0 ∈ Ealt (ν) and
define event E = {τ < ∞ and ψ ∈ / i∗ (ν 0 )} ∈ Fτ . Then,
2δ ≥ Pνπ (τ < ∞ and ψ ∈
/ i∗ (ν)) + Pν 0 π (τ < ∞ and ψ ∈
/ i∗ (ν 0 ))
≥ Pνπ (E c ) + Pν 0 π (E)
!
1 Xk
≥ exp − Eνπ [Ti (τ )] D(νi , νi ) ,
0
(33.5)
2 i=1
where the first inequality follows from the definition of soundness and the last
from the Bretagnolle–Huber inequality (Theorem 14.2) and the stopping time
version of Lemma 15.1 (see Exercise 15.7). The second inequality holds because
Pνπ (τ = ∞) = 0 and i∗ (ν) ∩ i∗ (ν 0 ) = ∅ and
E c = {τ = ∞} ∪ {τ < ∞ and ψ ∈ i∗ (ν 0 )}
⊆ {τ = ∞} ∪ {τ < ∞ and ψ ∈
/ i∗ (ν)} .
Rearranging Eq. (33.5) shows that
1
k
X
Eνπ [Ti (τ )] D(νi , νi0 ) ≥ log , (33.6)
i=1
4δ
which implies that Eνπ [τ ] > 0. Using this, the definition of c∗ (ν) and Eq. (33.6),
Eνπ [τ ] Xk
= Eνπ [τ ] sup 0 inf αi D(νi , νi0 )
c∗ (ν) α∈Pk−1 ν ∈Ealt (ν) i=1
Eνπ [Ti (τ )]
k
X
≥ Eνπ [τ ] inf D(νi , νi0 ) (33.7)
ν 0 ∈Ealt (ν)
i=1
Eνπ [τ ]
k
X
= inf Eνπ [Ti (τ )] D(νi , νi0 )
ν 0 ∈Ealt (ν)
i=1
1
≥ log ,
4δ
where the last inequality follows from Eq. (33.6). Rearranging completes the proof.
Note, in the special case that c∗ (ν)−1 = 0, the assumption that Eνπ [τ ] < ∞
would lead to a contradiction.
Before this, we devote a little time to understanding the constant c∗ (ν). Suppose
that α∗ (ν) ∈ Pk−1 satisfies
k
X
c∗ (ν)−1 = inf αi∗ (ν) D(νi , νi0 ) .
ν 0 ∈Ealt (ν)
i=1
We need to construct a triple (π, τ, ψ) that is sound for E and for which Eνπ [τ ]
matches the lower bound in Theorem 33.5 as δ → 0. Both are derived using
the insights provided by the lower bound. The policy should choose action i in
proportion to αi∗ (ν), which must be estimated from data. The stopping rule is
motivated by noting that Eq. (33.6) implies that a sound stopping rule must
satisfy
1
X k
Eνπ [Ti (τ )] D(νi , νi0 ) ≥ log for all ν 0 ∈ Ealt (ν) .
i=1
4δ
If the inequality is tight, then we might guess that a reasonable stopping rule as
the first round t when
1
Xk
inf Ti (t) D(ν ,
i iν 0
) & log .
ν ∈Ealt (ν)
0 δ
i=1
There are two problems: (a) ν is unknown, so the expression cannot be evaluated,
and (b) we have replaced the expected number of pulls with the actual number of
pulls. Still, let us persevere. To deal with the first problem, we can try replacing
ν by the Gaussian bandit environment with mean vector µ̂(t), which we denote
by ν̂(t). Then let
1
k
X Xk
Zt = inf Ti (t) D(ν̂i (t), νi0 ) = inf Ti (t)(µ̂i (t) − µi (ν 0 ))2 .
ν 0 ∈Ealt (ν̂(t))
i=1
2 µ0 ∈Ealt (µ̂(t)) i=1
We will show there exists a choice of βt (δ) ≈ log(t/δ) such that if τ = min{t : Zt >
βt (δ)}, then the empirically optimal arm at time τ is the best arm with probability
at least 1 − δ. The next step is to craft a policy for which the expectation of τ
matches the lower bound asymptotically. As we remarked earlier, if the policy
is to match the lower bound, it should play arm i approximately in proportion
to αi∗ (ν). This suggests estimating α∗ (ν) by α̂(t) = α∗ (ν̂(t)) and then playing
the arm for which tα̂i (t) − Ti (t) is maximised. If α̂(t) is inaccurate, then perhaps
the samples collected will not allow the algorithm to improve its estimates. To
overcome this last challenge, the policy includes enough forced exploration to
ensure that eventually α̂(t) converges to α∗ (ν) with high probability. Combining
all these ideas leads to the track-and-stop policy (Algorithm 21).
Theorem 33.6. Let (π, τ, ψ) be the policy, stopping time and selection rule of
track-and-stop (Algorithm 21). There exists a choice of βt (δ) such that track-and-
stop is sound and for all ν ∈ E with |i∗ (ν)| = 1 it holds that
Eνπ [τ ]
lim = c∗ (ν) .
δ→0 log(1/δ)
33.2 Best-Arm Identification with a Fixed Confidence 409
Note that only π does not depend on δ inside the limit statement of the theorem,
but the stopping time does. The following lemma guarantees the soundness of
(π, τ, ψ).
Lemma 33.7. Let f : [k, ∞) → R be given by f (x) = exp(k − x)(x/k)k and
βt (δ) = k log(t2 + t) + f −1 (δ). Then, for τ = min{t : Zt ≥ βt (δ)}, it holds that
P (i∗ (ν̂(τ )) 6= i∗ (ν)) ≤ δ.
Proof of Lemma 33.7 Notice that |i∗ (ν̂(t))| > 1 implies that Zt = 0. Hence
|i∗ (ν̂(τ ))| = 1 for τ < ∞, and the selection rule is well defined. Abbreviate
µ = µ(ν) and ∆ = ∆(ν), and assume without loss of generality that ∆1 = 0. By
the definition of τ and Zt ,
( k )
1X 2
{ν ∈ Ealt (ν̂(τ ))} ⊆ Ti (τ )(µ̂i (τ ) − µi ) ≥ βτ (δ) .
2 i=1
Then apply Lemma 33.8 and Proposition 33.9 from Section 33.2.3.
A candidate for βt (δ) can be extracted from the proof and satisfies βt (δ) ≈
2k log t + log(1/δ). This can be improved to approximately k log log(t) + log(1/δ)
33.2 Best-Arm Identification with a Fixed Confidence 410
by using a law of the iterated logarithm bound instead of Lemma 33.8. Below,
we sketch the proof of Theorem 33.6. A more complete outline is given in
Exercise 33.6.
Proof sketch of Theorem 33.6 Lemma 33.7 shows that (π, τ ) are sound. It
remains to control the expectation of the stopping time. The intuition is
straightforward. As more samples are collected, we expect that α̂(t) ≈ α∗ (ν) and
µ̂ ≈ µ and
33.2.3 Concentration
The first concentration theorem follows from Corollary 5.5 and a union bound.
Proof For i ∈ [k], let Wi = max{w ∈ [0, 1] : Sis < g(s)+log(1/w) for all s ∈ N},
where we define log(1/0) = ∞. Note that Wi are well defined. Then, for any
s ∈ Nk ,
k k k k
! k
X X X X X
Sisi ≤ g(si ) + log(1/Wi ) ≤ kg si + log(1/Wi ) .
i=1 i=1 i=1 i=1 i=1
By assumption, (Wi )ki=1 are independent and satisfy P (Wi ≤ x) ≤ x for all
x ∈ [0, 1]. The proof is completed using the result of Exercise 5.16.
1: Input n and k
2: Set L = dlog2 (k)e and A1 = [k].
3: for ` = 1, . . .j, L dok
4: Let T` = L|A n
`|
.
5: Choose each arm in A` exactly T` times
6: For each i ∈ A` compute µ̂`i as the empirical mean of arm i based on the
last T` samples
7: Let A`+1 contain the top d|A` |/2e arms in A`
8: end for
9: return An+1 as the arm in AL+1
Algorithm 22: Sequential halving.
The assumption on the ordering of the means is only needed for the clean
definition of H2 , which would otherwise be defined by permuting the arms. The
algorithm is completely symmetric. In Exercise 33.8 we guide you through the
proof of Theorem 33.10.
The quantity H2 (µ) looks a bit unusual, but arises naturally in the
analysis. It is related to a more familiar quantity as follows. Define H1 (µ) =
Pk 2 2
i=1 min{1/∆i , 1/∆min }. Then
33.4 Notes
2 We mentioned that algorithms with logarithmic cumulative regret are not well
suited for pure exploration. Suppose π has asymptotically optimal cumulative
regret on E = ENk
, which means that limn→∞ Eνπ [Ti (n)]/ log(n) = 2/∆i (ν) for
all ν ∈ E. You will show in Exercise 33.5 that for any ε > 0, there exists a
ν ∈ E with a unique optimal arm such that
− log (Pνπ (An+1 6∈ i∗ (ν)))
lim inf ≤ 1 + ε.
n→∞ log(n)
This shows that using an asymptotically optimal policy for cumulative regret
minimisation leads to a best-arm identification policy for which the probability
of selecting a suboptimal arm decays only polynomially with n. This result
holds no matter how An+1 is selected.
3 A related observation is that the empirical estimates of the means after running
an algorithm designed for minimising the cumulative regret tend to be negatively
biased. This occurs because these algorithms play arms until their empirical
means are sufficiently small.
4 Although there is no exploration/exploitation dilemma in the pure exploration
setting, there is still an ‘exploration dilemma’ in the sense that the optimal
exploration policy depends on an unknown quantity. This means the policy
must balance (to some extent) the number of samples dedicated to learning
how to explore relative to those actually exploring.
5 Best-arm identification is a popular topic that lends itself to simple analysis
and algorithms. The focus on the correct identification of an optimal arm
makes us question the practicality of the setting, however. In reality, any
suboptimal arm is acceptable provided its suboptimality gap is small enough
relative to the budget, which is more faithfully captured by the simple
regret criterion. Of course the simple regret may be bounded naively by
Rnsimple ≤ maxi ∆i P ∆An+1 > 0 , which is tight in some circumstances and
loose in others.
6 An equivalent form of the bound shown in Theorem 33.5 is
( k k
)
X X
Eνπ [τ ] ≥ min αi : α1 , . . . , αk ≥ 0, inf αi D(νi , νi ) ≥ log(4/δ) .
0
ν∈Ealt (ν)
i=1 i=1
This form follows immediately from Eq. (33.6) by noting that Eνπ [τ ] =
P
i Eνπ [Ti (τ )]. The version given in the theorem is preferred because it is
a closed form expression. Exercise 33.3 asks you to explore the relation between
the two forms.
7 The forced exploration in the track-and-stop algorithm is sufficient for
asymptotic optimality. We are uneasy about the fact that the proof would work√
for any threshold Ctp with p ∈ (0, 1). There is nothing fundamental about t.
We do not currently know of a principled way to tune the amount of forced
exploration or if there is better algorithm design for best-arm identification.
Ideally one should provided finite-time upper bounds that match the finite-time
33.5 Bibliographical Remarks 414
lower bound provided by Theorem 33.5. The extent to which this is possible
appears to be an open question.
8 The choice of βt (δ) significantly influences the practical performance of track-
and-stop. We believe the analysis given here is mostly tight except that the
naive concentration bound given in Lemma 33.8 can be improved using a
finite-time version of the law of the iterated logarithm (see Exercise 20.9, for
example).
9 Perhaps the most practical set-up in pure exploration has not yet received any
attention, which is upper and lower instance-dependent bounds on the simple
regret. Even better would be to have an understanding of the distribution of
∆An+1 .
In the machine learning literature, pure exploration for bandits seems to have
been first studied by Even-Dar et al. [2002], Mannor and Tsitsiklis [2004] and
Even-Dar et al. [2006] in the ‘Probability Approximately Correct’ setting, where
the objective is to find an ε-optimal arm with high probability with as few samples
as possible. After a dry spell, the field was restarted by Bubeck et al. [2009] and
Audibert and Bubeck [2010b]. The asymptotically optimal algorithm for the fixed
confidence setting of Section 33.2 was introduced by Garivier and Kaufmann
[2016], who also provide results for exponential families as well as in-depth
intuition and historical background. Degenne and Koolen [2019] and Degenne
et al. [2019] have injected some new ideas into the basic principles of track-and-
stop by incorporating a kind of optimism and solving the optimisation problem
incrementally using online learning, which leads to theoretical and practical
improvements. A similar problem is studied in a Bayesian setting by Russo
[2016], who focuses on designing algorithms for which the posterior probability of
choosing a suboptimal arm converges to zero exponentially fast with an optimal
rate. Even more recently, Qin et al. [2017] designed a policy that is optimal in both
the frequentist and Bayesian settings. The stopping rule used by Garivier and
Kaufmann [2016] is inspired by similar rules by Chernoff [1959]. The sequential
halving algorithm is by Karnin et al. [2013], and the best summary of lower
bounds is by Carpentier and Locatelli [2016]. Besides this there have been many
other approaches, with a summary by Jamieson and Nowak [2014]. The negative
result discussed in Note 2 is due to Bubeck et al. [2009]. Pure exploration has
recently become a hot topic and is expanding beyond the finite-armed case. For
example, to linear bandits [Soare et al., 2014] and continuous-armed bandits
[Valko et al., 2013a], tree search [Garivier et al., 2016a, Huang et al., 2017a] and
combinatorial bandits [Chen et al., 2014, Huang et al., 2018].
The continuous-armed case is also known as zeroth-order (or derivative-
free) stochastic optimisation and is studied under various assumptions on
the unknown reward function, usually assuming that A ⊂ Rd . Because of the
33.5 Bibliographical Remarks 415
confidence parameter approaches one. We also like the short and readable review
of the literature up to the 1980s from the perspective of simulation optimisation
by Goldsman [1983].
A related setting studied mostly in the operations research community is
ordinal optimisation. In its simplest form, ordinal optimisation is concerned
with finding an arm amongst the αk arms with the highest pay-offs. Ho et al. [1992],
who defined this problem in the stochastic simulation optimisation literature,
emphasised that the probability of failing to find one of the ‘good arms’ decays
exponentially with the number of observations n per arm, in contrast to the
slow n−1/2 decay of the error of estimating the value of the best arm, which
this literature calls the problem of cardinal optimisation. Given the results
in this chapter, this should not be too surprising. A nice twist in this literature
is that the error probability does not need to depend on k (see Exercise 33.9).
The price, of course, is that the simple regret is in general uncontrolled. In a
way, ordinal optimisation is a natural generalisation of best-arm identification.
As such, it also leads to algorithmic choices that are not the best fit when the
actual goal is to keep the simple regret small. Based on a Bayesian reasoning, a
heuristic expression for the asymptotically optimal allocation of samples for the
Gaussian best-arm identification problem is given by Chen et al. [2000]. They
call the problem of finding an optimal allocation the ‘optimal computing budget
allocation’ (OCBA) problem. Their work can be viewed as the precursor to the
results in Section 33.2. Glynn and Juneja [2015] gives further pointers to this
literature, while connecting it to the bandit literature.
Best-arm identification has also been considered in the adversarial setting
[Jamieson and Talwalkar, 2016, Li et al., 2018, Abbasi-Yadkori et al., 2018].
Another related setting is called the max-armed bandit problem, where the
objective is to obtain the largest possible single reward over n rounds [Cicirello
and Smith, 2005, Streeter and Smith, 2006a,b, Carpentier and Valko, 2014, Achab
et al., 2017].
33.6 Exercises
33.1 (Simple regret lower bound) Show there exists a universal constant
C > 0 such that for all
p n ≥ k > 1 and all policies π, there exists a ν ∈ EN such
k
Rn (UE, ν) ≥ C k log(k)/n.
simple
33.6 Exercises 417
33.3 Let L > 0 and D ⊂ [0, ∞)k \ {0} be non-empty. Show that
!−1
inf kαk1 : α ∈ [0, ∞) , inf hα, di ≥ L =
k
sup inf hα, di L.
d∈D α∈Pk−1 d∈D
1 α1 αi ∆2i
k
X
inf αi D(νi , ν̃i ) = min .
ν̃∈Ealt (ν)
i=1
2 i>1 α1 σi2 + αi σ12
(e) Show that if σi2 /∆2i = σ12 /∆2min for all i, then equality holds in Eq. (33.10).
(a) For any ε > 0, prove there exists a ν ∈ E with a unique optimal arm such
that
− log(Pνπ (∆An+1 > 0))
lim inf ≤ 1 + ε.
n→∞ log(n)
(b) Can you prove the same result with lim inf replaced by lim sup?
(c) What happens if the assumption that π is asymptotically optimal is replaced
with the assumption that there exists a universal constant C > 0 such that
X log(n)
Rn (π, ν) ≤ C ∆i (ν) + .
∆i (ν)
i:∆i (ν)>0
metric space via the metric d(ν1 , ν2 ) = kµ(ν1 ) − µ(ν2 )k∞ . Let ε > 0 be a small
constant, and define random times
τν (ε) = 1 + max {t : d(ν̂t , ν) ≥ ε}
τα (ε) = 1 + max {t : kα∗ (ν) − α∗ (ν̂t )k∞ ≥ ε}
τT (ε) = 1 + max {t : kT (t)/t − αi∗ (ν)k∞ ≥ ε} .
Note, these are not stopping times. Do the following:
(a) Show that α∗ (ν) is unique.
(b) Show α∗ is continuous at ν.
(c) Prove that E[τν (ε)] < ∞ for all ε > 0.
(d) Prove that E[τα (ε)] < ∞ for all ε > 0.
(e) Prove that E[τT (ε)] < ∞ for all ε > 0.
(f) Prove that limδ→0 E[τ ]/ log(1/δ) ≤ c∗ (ν).
be the top m arms in A. To make life easier, you may also assume that k is a
power of two so that |A` | = k21−` and T` = n2`−1 / log2 (k).
(a) Prove that |AL+1 | = 1.
(b) Let i be a suboptimal arm in A` , and suppose that 1 ∈ A` . Show that
T` ∆2i
P µ̂1 ≤ µ̂i i ∈ A` , 1 ∈ A` ≤ exp −
` `
.
4
(c) Let A0` = A` \ TopM(A` , d|A` |/4e) be the bottom three-quarters of the arms
in round `. Show that if the optimal arm is eliminated after the `th phase,
then
X 1
N` = I µ̂`i ≥ µ̂`1 ≥ |A0` | .
0
3
i∈A`
(e) Combine the previous two parts with Markov’s inequality to show that
T ∆2i`
P (1 ∈
/ A`+1 | 1 ∈ A` ) ≤ 3 exp − .
16 log2 (k)i`
(f) Join the dots to prove Theorem 33.10.
Hint Part (b) of the above exercise is a challenging problem. The simplest
approach is to use an elimination algorithm that operates in phases where at
the end of each phase, the bottom half of the arms (in terms of their empirical
estimates) are eliminated. For details, see the paper by Even-Dar et al. [2002].
34 Foundations of Bayesian Learning
1 π1 , admissible
Loss
π2 , dominated
π3 , minimax optimal
π4 , admissible
0
0 1
Environments
Figure 34.1 Loss as a function of the environment for four different polices π1 , . . . , π4 ,
when E = [0, 1]. Which policy would you choose?
The Bayesian viewpoint is hard to criticise when the user really does know the
underlying likelihood of each environment and the user is risk-neutral. Even
when the distribution is not known exactly, however, sensible priors often yield
provably sensible outcomes, regardless of whether one is interested in the average
loss across the environments, or the worst-case loss, or some other metric.
The last section explained the ‘forward view’, where a policy is chosen in advance
that minimises the expected loss. The Bayesian can also act sequentially by
updating their beliefs (the prior) as data is observed to obtain a new distribution
34.2 Bayesian Learning and the Posterior Distribution 422
on the set of environments (more generally, the set of hypotheses). The new
distribution is called the posterior. This is simple and well defined when the
environment set is countable, but quickly gets technical for larger spaces. We
start gently with a finite case and then explain the measure-theoretic machinery
needed to rigourously treat the general case.
Suppose you are given a bag containing two marbles. A trustworthy source
tells you the bag contains either (a) two white marbles (ww) or (b) a white
marble and a black marble (wb). You are allowed to choose a marble from the
bag (without looking) and observe its colour, which we abbreviate by ‘observe
white’ (ow) or ‘observe black’ (ob). The question is how to update your ‘beliefs’
about the contents of the bag having observed one of the marbles. The Bayesian
way to tackle this problem starts by choosing a probability distribution on the
space of hypotheses, which, incidentally, is also called the prior. This distribution
usually reflects one’s beliefs about which hypotheses are more probable. In the
lack of extra knowledge, for the sake of symmetry, it seems reasonable to choose
P(ww) = 1/2 and P(wb) = 1/2. The next step is to think about the likelihood
of the possible outcomes under each hypothesis. Assuming that the marble is
selected blindly (without peeking into the bag) and the marbles in the bag are
well shuffled, these are
P(ow | ww) = 1 and P(ow | wb) = 1/2 .
The conditioning here indicates that we are including the hypotheses as part of
the probability space, which is a distinguishing feature of the Bayesian approach.
With this formulation we can apply Bayes’ law (Eq. (2.2)) to show that
P(ow | ww)P(ww) P(ow | ww)P(ww)
P(ww | ow) = =
P(ow) P(ow | ww)P(ww) + P(ow | wb)P(wb)
1 × 21 2
= = .
1 × 2 + 12 × 12
1 3
Of course P(wb | ow) = 1 − P(ww | ow) = 1/3. Thus, while in the lack of
observations, ‘a priori’, both hypotheses are equally likely, having observed a
white marble, the probability that the bag originally contained two white marbles
(and thus the bag has a white marble remaining in it) jumps to 2/3. An alternative
calculation shows that P(ww | ob) = 0, which makes sense because choosing a
black marble rules out the hypothesis that the bag contains two white marbles.
The conditional distribution P( · | ow) over the hypotheses is called the posterior
distribution and represents the Bayesian’s belief in each hypothesis after observing
a white marble.
for generality, there are two reasons not to do this. First, having spent the effort
developing the necessary tools in Chapter 2, it would seem a waste not to use them
now. And second, the subtle issues that arise highlight some real consequences of
the differences between the Bayesian and frequentist viewpoints. As we shall see,
there is a real gap between these viewpoints.
Let Θ be a set called the hypothesis space and G be a σ-algebra on Θ. While
Θ is often a subset of a Euclidean space, we do not make this assumption. A prior
is a probability measure Q on (Θ, G). Next, let (U, H) be a measurable space and
P = (Pθ : θ ∈ Θ) be a probability kernel from (Θ, G) to (U, H). We call P the
model. Let Ω = Θ × U and F = G ⊗ H. The prior and the model combine to yield
a probability P = Q ⊗ P on (Ω, F). The prior is now the marginal distribution
of the joint probability measure: Q(A) = P(A × U). Suppose a random element
X on Ω describes what is observed. Then, generalizing the previous example
with the marbles, the posterior should somehow be the marginal of the joint
probability measure conditioned X. To make this more precise, let (X , J ) be a
measurable space and X : Ω → X a F/J -measurable map. The posterior having
observed that X = x should be a measure Q( · | x) on (Θ, G).
Without much thought, we might try and apply Bayes’ law (Eq. (2.2)) to claim
that the posterior distribution having observed X(ω) = x should be a measure
on (Θ, G) given by
P (X = x | θ ∈ A) P (θ ∈ A)
Q(A | x) = P (θ ∈ A | X = x) = . (34.1)
P (X = x)
The problem with the ‘definition’ in (34.1) is that P (X = x) can have measure
zero, and then P (θ ∈ A | X = x) is not defined. This is not an esoteric problem.
Consider the problem when θ is randomly chosen from Θ = R and its distribution
is Q = N (0, 1), the parameter θ is observed in Gaussian noise with a variance of
one: U = R, Pθ = N (θ, 1) for all θ ∈ R and X(φ, u) = u for all (φ, u) ∈ Θ×U. Even
in this very simple example, we have P (X = x) = 0 for all x ∈ R. Having read
Chapter 2, the next attempt might be to define Q(A | X) as a σ(X)-measurable
random variable defined using conditional expectations: for A ∈ G,
true here. A related annoying issue is that Q( · | x) as defined above need not be
a measure. By assuming that (Θ, G) is a Borel space, this issue can be overcome
by using a regular version (Theorem 3.11), a result that we restate here using
the present notation.
Theorem 34.1. If (Θ, G) is a Borel space, then there exists a probability kernel
Q : X ×G → [0, 1] such that Q(A | X) = P (θ ∈ A | X) simultaneously for all A ∈ G
outside of some P-null set. Furthermore, for any two probability kernels Q, Q0
satisfying this condition, Q(· | x) = Q0 (· | x) for all x in some set of PX -probability
one.
Example 34.2. Consider the situation when the hypothesis set is the [0, 1]
interval, the prior is the uniform distribution, and the observation is equal to the
hypothesis sampled. Formally, Θ = [0, 1] and the prior Q is the uniform measure
on (Θ, B(Θ)), Pθ = δθ is the Dirac measure on [0, 1] at θ, and X : [0, 1] → [0, 1] is
the identity: X(x) = x for all x ∈ [0, 1]. Let C ⊂ [0, 1] be an arbitrary countable
set and µ be an arbitrary probability measure on ([0, 1], B(R)). It is not hard to
34.3 Conjugate Pairs, Conjugate Priors and the Exponential Family 425
satisfies the conditions of Theorem 34.1 and is thus one of the many versions of
the posterior, regardless of the choice of C and µ!
A true Bayesian is unconcerned. If θ is sampled from the prior Q, then the
event {X ∈ C} has measure zero, and there is little cause to worry about events
that happen with probability zero. But for a frequentist using Bayesian techniques
for inference, this actually matters. If θ is not sampled from Q, then nothing
prevents the situation that θ ∈ C and the non-uniqueness of the posterior is an
issue (Exercise 34.12). Probability theory does not provide a way around this
issue.
It follows that one must be careful to specify the version of the posterior
being used when using Bayesian techniques for inference in a frequentist
setting because in the frequentist viewpoint, θ is not part of the probability
space and results are proven for Pθ for arbitrary fixed θ ∈ Θ. By contrast,
the all-in Bayesians include θ in the probability space and thus will not worry
about events with negligible prior probability, and for them any version of
the posterior will do.
parametric form as the prior. In this case, the prior is called a conjugate prior
to the model.
Following convention, from now on we sweep under the rug that this posterior
is one of many choices, which is justified because all posteriors must agree
almost everywhere.
The limiting regimes as the prior/signal variance tend to zero or infinity are
quite illuminating. For example, as σP2 → 0 the posterior tends to a Gaussian
N (µP , σP2 ), which is equal to the prior and indicates that no learning occurs.
This is consistent with intuition. If the prior variance is zero, then the statistician
is already certain of the mean, and no amount of data can change their belief.
On the other hand, as σP2 tends to infinity, we see the mean of the posterior
has no dependence on the prior mean, which means that all prior knowledge is
washed away with just one sample. You should think about what happens when
σS2 → {0, ∞}.
Notice how the model has fixed σS2 , suggesting that the model variance is
known. The Bayesian can also incorporate their uncertainty over the variance. In
this case, the model parameters are Θ = R × [0, ∞) and Pθ = N (θ1 , θ2 ). But is
there a conjugate prior in this case? Already things are getting complicated, so we
will simply let you know that the family of Gaussian-inverse-gamma distributions
is conjugate.
the Gaussian case, the posterior for the Bernoulli model and beta prior is unique
(Exercise 34.2).
Example 34.3. Let σ 2 > 0 and h = N (0, σ 2 ) and η(θ) = σθ and S(x) = σx . An
easy calculation shows that A(θ) = θ2 /(2σ 2 ), which has domain Θ = R and
Pθ = N (θ, σ 2 ).
Example 34.5. The same family can be parameterised in many different ways.
Let h = δ0 + δ1 , S(x) = x and η(θ) = log(θ/(1 − θ)). Then A(θ) = − log(1 − θ)
and Θ = (0, 1) and Pθ = B(θ).
Exponential families have many nice properties, some of which you will prove
in Exercise 34.5. Of most interest to us here is the existence of conjugate priors.
Suppose that (Pθ : θ ∈ Θ) is a single-parameter exponential family determined by
h, η and S, where S(x) = x is the identity map. Let x0 , n0 ∈ R, and define prior
measure Q on (Θ, B(Θ)) in terms of its density q = dQ/dλ with λ the Lebesgue
measure:
exp (n0 x0 η(θ) − n0 A(θ))
q(θ) = R , (34.4)
Θ
exp (n0 x0 η(θ) − n0 A(θ)) dθ
where we assume that the integral in the denominator exists and is positive.
Suppose we observe X = x. Then a choice of posterior has density with respect
34.3 Conjugate Pairs, Conjugate Priors and the Exponential Family 428
There are important parametric families with conjugate priors that are not
exponential families. One example is the uniform family (U(a, b) : a < b),
which is conjugate to the Pareto family.
which shows that in this case the posterior summarises all the useful information
in (Xs )ts=1 for predicting future data. By introducing a little measure-theoretic
machinery and making suitable regularity assumptions, it is possible to show that
the sequence Q1 , . . . , Qn is a time-inhomogeneous Markov chain. In many cases,
the posterior has a simple form, as you can see in the next two examples.
Example 34.6. Suppose Θ = [0, 1] and G = B([0, 1]) and Q = Beta(α, β)
and Pθ = B(θ) is Bernoulli. Then the posterior after t observations is Qt =
Pt
Beta(α+St , β+t−St ), where St = s=1 Xs . Furthermore, E[Xt+1 | X1 , . . . , Xt ] =
EQt [Xt+1 ] = (α + St )/(α + β + t), and hence
α + St
P (St+1 = St + 1 | St ) = ,
α+β+t
β + t − St
P (St+1 = St | St ) = .
α+β+t
So the posterior after t observations is a Beta distribution depending on St and
S1 , S2 , . . . , Sn follows a Markov chain evolving according to the above display.
Example 34.7. Let (Θ, G) = (R, B(R)) and Q = N (µ, σ 2 ) and Pθ = N (θ, 1).
Then, using the same notation as above the posterior is almost surely Qt =
N (µt , σt2 ), where
−1
µ/σ 2 + St 2 1
µt = and σt = +t .
1/σ 2 + t σ2
Then S1 , S2 , . . . , Sn is a Markov chain with the conditional distribution of St+1
given St a Gaussian with mean St + µt and variance 1 + σt2 .
The Bayesian bandit model is the same as the frequentist version introduced
in Chapter 4, except that at the beginning of the game, an environment is
sampled from the prior. Of course, the chosen environment is not revealed to
the learner, but its presence forces us to change our conditions on the rewards
because the rewards are dependent on each other through the chosen environment.
For simplicity, we treat only the finite, k-armed case, but the more general set-up
is handled in the same was as in Chapter 4.
A k-armed Bayesian bandit environment is a tuple (E, G, Q, P ), where
(E, G) is a measurable space and Q is a probability measure on (E, G) called the
prior. The last element P = (Pνi : ν ∈ E, i ∈ [k]) is a probability kernel from
E × [k] to (R, B(R)), where Pνi is the reward distribution associated with the ith
arm in bandit ν. A Bayesian bandit environment and policy π = (πt )nt=1 interact
to produce a collection of random variables, ν ∈ E, (At )nt=1 and (Xt )nt=1 with
At ∈ [k] and Xt ∈ R that satisfy
(a) P (ν ∈ ·) = Q(·);
34.5 Posterior Distributions in Bandits 430
Q(A | a1 , x1 , . . . , at , xt )
where pνa is the density of Pνa with respect to λ. Then the posterior after t
rounds is given by
R
pνπ (a1 , x1 , . . . , at , xt )dQ(ν)
Q(B | a1 , x1 , . . . , at , xt ) = RB
p (a , x1 , . . . , at , xt )dQ(ν)
E νπ 1
R Qt
pνas (xs )dQ(ν)
= RB Qts=1 , (34.7)
E s=1 pνas (xs )dQ(ν)
34.6 Bayesian Regret 431
where the second equality follows from Eq. (34.6). The posterior is not
defined when the denominator is zero, which only occurs with probability zero
(Exercise 34.11). Note that the Radon–Nikodym derivatives pνa (x) are only
unique up to sets of Pνa -measure zero, and so the ‘choice’ of posterior has been
converted to a choice of the Radon–Nikodym derivatives, which, in all practical
situations is straightforward. Observe also that Eq. (34.7) is only well defined
if pνas (·) is G-measurable as a function of ν. Fortunately this is always possible
(see Note 8).
Example 34.9. The posterior for the Bayesian bandit in Example 34.8 in terms
of its density with respect to the Lebesgue measure is
k
Y α+si (ht )−1
q(θ | a1 , x1 , . . . , at , xt ) ∝ θi (1 − θi )β+ti (ht )−si (ht )−1 ,
ht i=1
Pt Pt
where si (ht ) = u=1 xu I {au = i} and ti (ht ) = u=1 I {au = i}. This means the
posterior is also the product of Beta distributions, each updated according to the
observations from the relevant arm.
Recall that the regret of policy π in k-armed bandit environment ν over n rounds
is
" n #
X
Rn (π, ν) = nµ∗ − E Xt , (34.8)
t=1
The fact that the expected regret Rn (π, ν) is non-negative for all ν and π
means that the Bayesian regret is always non-negative. Perhaps less obviously,
the Bayesian regret of the Bayesian optimal policy can be strictly greater
than zero (Exercise 34.8).
34.7 Notes
Qt (µ) = P P ,
1
exp log
t
ν∈M − s=1 ν(ys | y1 ,...,ys−1 )
of all policies and P be a convex space of probability measures over policies and
Q be a convex space of probability measures on (E, G). Define L : P × Q → R
by
Z Z
L(S, Q) = Rn (π, ν)Q(dν)S(dπ) ,
Π E
surely for all x ∈ X for some some h : X → [0, ∞) and gθ : Y → [0, ∞) Borel
34.8 Bibliographic Remarks 435
34.9 Exercises
34.3 Use the tower rule to prove the identity in Eq. (34.5).
where pθ (x) = dPθ /dµ and q(θ) = dQ/dν. You may assume that pθ (x) is jointly
measurable in θ and x (see Note 8).
R
(a) Let N = {x : Θ pψ (x)q(ψ)dν(ψ) = 0} and show that PX (N ) = 0.
R
(b) Define Q(A | x) = A q(θ | x)dν(θ) for x ∈/ N and Q(A | x) be an arbitrary
fixed probability measure for x ∈ N . Show that Q( · | X) is a regular version
of P (θ ∈ · | X).
Hint The ‘sections’ lemma may prove useful (Lemma 1.26 in Kallenberg 2002),
along with the properties of the Radon–Nikodym derivative.
34.5 (Exponential families) Let A, T , h, η and Θ be as in Section 34.3.1.
(a) Prove that Pθ is indeed a probability measure.
(b) Let Eθ denote expectations with respect to Pθ . Show that A0 (θ) = Eθ [T ].
(c) Let θ ∈ Θ and X ∼ Pθ . Show that for all λ with λ + θ ∈ θ,
Eθ [exp(λT (X))] = exp(A(λ + θ) − A(θ)) .
(d) Given θ, θ0 ∈ Θ, show that
pθ (X)
d(θ, θ ) = Eθ log
0
= A(θ0 ) − A(θ) − (θ0 − θ)A0 (θ) . (34.12)
pθ0 (X)
(e) Let θ, θ0 ∈ Θ be such that A0 (θ0 ) ≥ A0 (θ) and X1 , . . . , Xn be independent
Pn
and identically distributed and T̂ = n1 t=1 T (Xt ). Show that
P T̂ ≥ A0 (θ0 ) ≤ exp (−nd(θ0 , θ)) .
Curiously, the function d of Eq. (34.12) is both the relative entropy D(Pθ , Pθ0 )
and the Bregman divergence between θ0 and θ induced by the convex function
A. See Section 26.3 for the definition of Bregman divergence.
demonstrating that for some priors over finite-armed stochastic bandits, the
Bayesian regret is strictly positive: inf π BRn (π, Q) > 0.
Hint The key is to observe that under appropriate conditions, BRn (π, Q) = 0
would mean that π needs to know the identity of the optimal action under ν from
round one, which is impossible when ν is random and the model is rich enough.
34.9 (Canonical model) Prove the existence of a probability space carrying
the random variables satisfying the conditions in Section 34.4.
34.11 Prove that the denominator in Eq. (34.7) is almost surely non-zero.
where y 6< x is defined to mean it is not true that yi ≤ xi for all i with strict
inequality for at least one i (λ(S) is the Pareto frontier of set S, and its elements
are the non-dominated loss-outcome vectors in cl(S)). Prove that if λ(S) ⊆ S
and S is convex, then for every π ∗ ∈ Π such that `(π ∗ ) ∈ λ(S), there exists a
prior q ∈ P(E) such that
X X
q(ν)`(π ∗ , ν) = min q(ν)`(π, ν) .
π∈Π
ν∈E ν∈E
Hint Use the supporting hyperplane theorem, stated in the hint after
Exercise 26.2.
34.9 Exercises 438
The first section of this chapter provides simple bounds on the Bayesian optimal
regret, which are obtained by integrating the regret guarantees for frequentist
algorithms studied in Part II. This is followed by a short interlude on the basic
theory of optimal stopping, which we will need later. The next few sections
are devoted to special cases where computing the Bayesian optimal policy is
tractable. We start with the finite horizon Bayesian one-armed bandit problem
where the existence of a tractable solution is reduced to the computation of a
sequence of functions on the sufficient statistics of the arm with the unknown
pay-off. Next, the k-armed setting is considered. The main question is whether
there exists a solution that avoids considering joint sufficient statistics over all
arms, which would be intractable in the lack of further structure (see Note 2).
Avoiding the joint sufficient in general is not possible, but in the remarkable
case of the problem of maximising the total expected discounted reward over an
infinite horizon, where John C. Gittins’s celebrated result shows that the Bayesian
optimal policy takes the form of an ‘index’ policy that keeps statistics for each
arm separately (updated based on the arm’s observations only) to compute a
value (‘index’) for each arm, in each round choosing the arm with the highest
index.
Even in relatively benign set-ups, the computation of the Bayesian optimal policy
appears hopelessly intractable. Nevertheless, one can investigate the value of the
Bayesian optimal regret by proving upper and lower bounds.
For simplicity, we restrict our attention to Bernoulli bandits, but the arguments
generalise to other models. Let (E, G) = ([0, 1]k , B([0, 1]k )), and for ν ∈ [0, 1]k
let Pνj = B(νj ). Choose some prior Q on (E, G). The Bayesian optimal regret is
necessarily smaller than the minimax regret, which by Theorem 9.1 means that
√
BR∗n (Q) ≤ C kn ,
where C > 0 is a universal constant. The proof of the lower bound in Exercise 15.2
shows that for each n, there exists a prior Q for which
√
BR∗n (Q) ≥ c kn ,
35.2 Optimal Stopping ( ) 440
where c > 0 is a√universal constant. These two together show that the
supQ BR∗n (Q) = Θ( kn).
Turning to the asymptotics for a fixed distribution, recall that that for any fixed
Bernoulli bandit environment, the asymptotic growth rate of regret is Θ(log(n)).
In stark contrast to this, the best we can say in the Bayesian case is that the
√ √
asymptotic growth rate of BR∗n (Q) is slower than n, but for some priors, n is
almost a lower bound on the growth rate. In particular, we ask you to prove the
following theorem in Exercise 35.1:
We now make a detour to show some results of optimal stopping, which will be
used in the next sections to find tractable solutions to certain Bayesian bandit
problems.
The first setting we consider will be useful for the one-armed bandit problem.
Let (Ut )nt=1 be a sequence of random variables adapted to filtration F = (Ft )nt=1 .
Optimal stopping is concerned with finding solutions to optimisation problems of
the following form:
Intuitively, Et is the optimal expected value one can guarantee provided that
stage t was reached.
Theorem 35.2. Assume that n is finite and Ut is integrable for all t ∈ [n]. Then
the stopping time τ = min{t ∈ [n] : Ut = Et } ∈ Rn1 achieves the supremum in
Eq. (35.1).
Backwards induction is not directly applicable when the horizon is infinite.
There are several standard ways around this problem. For our purposes, the most
convenient workaround is to introduce a Markov structure. The connection to
the Bayesian bandit setting is that in the Bayesian setting, posteriors follow a
Markov process. The connection will be made explicit in a few examples in later
sections.
Let (S, G) be a Borel space and (Px : x ∈ S) be a probability kernel from S to
itself and u : S → R be S/B(R)-measurable. A Markov reward process is a
Markov chain (St )∞t=1 evolving according to P and a sequence of random variables
(Ut )∞
t=1 with Ut = u(St ). Define the filtration F = (Ft )∞
t=1 with Ft = σ(S1 , . . . , St ).
The (Markov) optimal stopping problem is
sup E[Uτ ] ,
τ ∈R1
where R1 is the set of F-adapted stopping times, and the initial distribution of
S1 is arbitrary. Inspired by the solution of the finite horizon problem define the
value function v : S → R by
v(x) = sup Ex [Uτ ] , (35.2)
τ ∈R1
where Px is the probability measure on the space carrying (St )∞ t=1 for which
Px (S1 = x) = 1 and Ex be the expectation
R with respect to Px . As before, the
idea is to stop when Ut is above S v(y)PSt (dy), the predicted optimal value of
continuing. Note that ties can be resolved in any way (depending on St , one may
or may not stop when the predicted optimal value of continuation is equal to Ut ).
The next result gives sufficient conditions under which stopping rules of this form
are indeed optimal.
Theorem 35.3. Assume for all x ∈ S that U∞ = limn→∞ Un exists Px -a.s. and
supn≥1 |Un | is Px -integrable. Then v satisfies the Wald–Bellman equation,
Z
v(x) = max{u(x), v(y)Px (dy)} for all x ∈ S .
S
Given a deterministic retirement policy π = (πt )nt=1 , define the random variable
τ = min{t ≥ 1 : πt (2 | 1, Z1 , . . . , 1, Zt−1 ) = 1} ,
where the minimum of an empty set in this case is n+1. Clearly τ is an F-stopping
35.3 One-armed Bayesian Bandits 443
This should make intuitive sense. It is optimal to continue only if the expected
future reward from doing so is at least as large as what can be obtained by
stopping immediately. The difficulty is that E[Zt + Wt+1 | Ft ] can be quite a
complicated object. We now give two examples where E[Zt + Wt+1 | Ft ] has a
simple representation and thus computing the optimal stopping rule becomes
practical. The idea is to find a sequence of sufficient statistics (St )nt=0 so that
St ∈ S is Ft -measurable and Pν1 (Z1 , . . . , Zt ∈ · | St ) is independent of ν. Then
Et is σ(St )-measurable, and by Lemma 2.5 it follows that Et = vt (St ) for an
appropriately measurable function vt : S → R. For more on this, read the next
two subsections, and then do Exercise 35.4.
35.3 One-armed Bayesian Bandits 444
Bayesian optimal
8 Frequentist
Expected regret 6
0
0.2 0.4 0.6 0.8
µ1
Figure 35.2 The plot shows the expected regret for the Bayesian optimal algorithm
compared to the ‘frequestist’ algorithm in Eq. (35.4) on the Bernoulli 1-armed bandit
where µ2 = 1/2 and µ1 varies on the x-axis. The horizontal lines show the average
regret for each algorithm with respect to the prior, which is uniform.
and variance σP2 > 0. By the results in Section 34.3, the posterior Q(· | x1 , . . . , xt )
after observing rewards x1 , . . . , xt from the first arm is almost surely Gaussian
with mean µt and variance σt2 given by
Pt −1
2 +
µP
σP s=1 xs 2 1
µt = and σ = t + . (35.5)
1 + σP−2
t
σP2
The posterior variance is independent of the observations, so the posterior is
determined entirely by its mean. As in the Bernoulli case, there exist functions
(wt )n+1
t=1 such that Wt = wt (µt−1 ) almost surely for all t ∈ [n]. Precisely,
wn+1 (µ) = 0 and for t ≤ n,
Z ∞
1 x2
wt (µ) = max (n − t + 1)µ2 , µ + √ exp − 2 wt+1 (µ + x)dx .
2π −∞ 2σt−1
(35.6)
The integral on the right-hand side does not have a closed-form solution, which
forces the use of approximate methods. Fortunately wt is a well-behaved function
and can be efficiently approximated. The favourable properties are summarised
in the next lemma, the proof of which is left to Exercise 35.5.
Lemma 35.5. The following hold:
(a) The function wt is increasing.
(b) The function wt is convex.
(c) limµ→∞ wt (µ)/µ = n − t + 1 and limµ→−∞ wt (µ) = (n − t + 1)µ2 .
There are many ways to approximate a function, but in order to propagate
the approximation using Eq. (35.6), it is convenient to choose a form for which
the integral in Eq. (35.6) can be computed analytically. Given the properties in
35.4 Gittins Index 446
Let (St )∞
t=1 be a Markov chain on Borel space (S, G) evolving according to
probability kernel (Px : x ∈ S). As in Section 35.2, let (Ω, F, Px ) be a probability
space carrying (Sn )∞
n=1 with Sn ∈ S such that
where α ∈ (0, 1) is the discount factor. To ensure that this is well defined, we
need the following assumption:
"∞ #
X
Assumption 35.6. For all x ∈ S, it holds that Ex α |r(St )| < ∞.
t−1
t=1
If the rewards are bounded, the assumption will hold. When the rewards are
unbounded, the assumption restricts the rate of growth of rewards over time.
The presence of discounting encourages the learner to obtain large rewards
earlier rather than later and is one distinction between this model and the finite-
horizon model studied for most of this book. A brief discussion of discounting is
left for the notes.
Fix a state x ∈ S. The map γ 7→ vγ (x) is decreasing and is always non-negative.
In fact, if γ is large enough, it is easy to see that retiring immediately (τ = 1)
achieves the supremum in the definition of vγ (x), and thus vγ (x) = 0. The
Gittins index, or fair charge, of a state x is the smallest value of γ for which
the learner is indifferent between retiring immediately and playing for at least
one round:
The form in (35.9) will be useful for computation. It is not immediately clear that
35.4 Gittins Index 448
a stopping time attaining the supremum in (35.9) exists. The following lemma
shows that it does and gives an explicit form.
Lemma 35.7. Let x ∈ S be arbitrary. The following hold under Assumption 35.6:
R
(a) vγ (x) = max{0, r(x) − γ + α S vγ (y)PxR(dy)} for all γ ∈ R.
(b) If γ ≤ g(x), then vγ (x) = r(x) − γ + α S vγ (y)Px (dy).
(c) The stopping time τ = min{t ≥ 2 : g(St ) ≤ γ} attains the supremum in
Eq. (35.9).
The result is relatively intuitive. The Gittins index represents the price the
learner should be willing to pay for the privilege of continuing to play. The optimal
policy continues to play as long as the actual value of the game is not smaller
than this price was at the start. The proof of Lemma 35.7 uses Theorem 35.3
and is left for the reader in Exercise 35.7.
The assumption that the Markov chains evolve on the same state space with
the same transition kernel is non-restrictive since the state space can always
be taken to be the union of k state spaces and the transition kernel defined
with k disconnected components.
Because the learner observes the state of all chains in each round, a policy π
now is a collection (πt )∞
t=1 , where πt is a probability kernel from (S ×[k])
k t−1
×S k
(history, including past observed states and actions) to [k]. Given a discount
rate α ∈ (0, 1), the objective is to find the policy maximising the cumulative
discounted reward:
"∞ #
X
argmaxπ Eπ α r(SAt (t)) ,
t−1
t=1
Figure 35.3 Interaction protocol for discounted bandits with Markov pay-offs
While this notation breaks our convention of putting the time index first in
the reward sequences of a multi-armed bandit, we prefer this notation here
as we need to consider reward sequences underlying individual arms.
Given an infinite sequence (at )∞ t=1 , taking values in [k], define the interleaving
sequence I(g, a) = (It (g, a))∞
t=1 by
t−1
X
It (g, a) = gat ,1+nat (a,t−1) with ni (a, t − 1) = I {as = i} .
s=1
Note that this is the same as the ‘reward-stack model’ of bandits mentioned on
page 65 in Chapter 4 except that here we have fixed sequences. The next lemma
follows from the Hardy–Littlewood inequality, a generalisation of the trivial
observation that the identical ordering of two sequences of numbers maximises
their inner product. We leave the proof to Exercise 35.9.
Lemma 35.10. Suppose that gi is decreasing for all i ∈ [k] and (a∗t )∞ t=1 is defined
recursively by a∗t = argmaxi gi,1+ni (a∗ ,t−1) and I ∗ (g) = I(g, a∗ ). Then, for any
α ∈ (0, 1),
∞
X ∞
X
αt−1 It∗ (g) = sup αt−1 It (g, a) .
t=1 a∈[k]N t=1
τ1 = min{t ≥ 1 : At = i} and
τj+1 = min{t > τj : At = i and g(Si (t)) ≤ Gi (τj )} ,
where the minimum of the empty set is defined to be infinite. Next, let
Note that on the event {τj < ∞}, Gi (t) = γj for all t ∈ Tj . Furthermore,
g(Si (τj )) = γj . By definition, we have
"∞ #
X ∞
X X
Eπ αt−1 (r(Si (t)) − Gi (t)) I {At = i} = Eπ αt−1 (r(Si (t)) − γj ) .
t=1 j=1 t∈Tj
The claim follows by showing the term inside the sum on the right-hand side
vanishes for the Gittins index policy and is not positive for any other policy.
Fix j ≥ 1. By definition, for t ∈ Tj it holds that gi (Si (t)) ≥ Gi (t) = γj .
Combining this with Part (b) of Lemma 35.7, on {t ∈ Tj }, thanks to {t ∈ Tj } ∈
Ft ,
Z
vγj (Si (t)) + γj − r(Si (t)) = α vγj (y)PSi (t) (dy) = αEπ [vγj (Si (t + 1)) | Ft ] .
S
≤ 0,
where the final inequality holds since vγj is non-negative, vγj (Si (τj )) = 0 and by
telescoping the sum, which is possible because whenever t0 is the smallest element
larger than t in Tj , then Si (t0 ) = Si (t + 1). We now argue that the inequality
is replaced by an equality for the Gittins index policy. The key observation
is that having played Aτj = i, the Gittins index policy continues playing arm
i until g(Si (t)) ≤ γj , which means that Tj = {τj , τj + 1, . . . , κj − 1}, where
κj = min{t > τj : g(Si (t)) ≤ γj }, which by Part (c) of Lemma 35.7 means that
X
Eπ∗ αt−1 (r(Si (t)) − γj ) | Fτj = vγj (Si (τj )) = 0 .
t∈Tj
35.5 Computing the Gittins Index 452
t=1 t=1
" ∞
#
X
= Eπ∗ α t−1
It (H, A)
t=1
"∞ #
X
= Eπ∗ αt−1 It∗ (H) ,
t=1
where the first equality follows from part 1, the second by the definition of It
and H and the third by the definitions of It∗ from Lemma 35.10 and that of
the Gittins index policy, which always chooses an action that maximises the
prevailing charge. On the other hand, for any policy π,
"∞ # "∞ #
X X
Eπ α r(SAt (t)) ≤ Eπ
t−1
α GAt (t)
t−1
t=1 t=1
" n
#
X
= Eπ α t−1
It (H, A)
t=1
" n #
X
≤ Eπ αt−1 It∗ (H) ,
t=1
where the last line follows from Lemma 35.10. Finally, note that the law of H
under Pπ does not depend on π, and hence
" n # " n #
X X
Eπ α It (H) = Eπ∗
t−1 ∗
α It (H) .
t−1 ∗
t=1 t=1
t=1 t=1
We describe a simple approach that depends on the state space being finite.
References to more general methods are given in the bibliographic remarks.
Assume without loss of generality that S = {1, 2, . . . , |S|} and G = 2S . The matrix
form of the transition kernel is P ∈ [0, 1]|S|×|S| and is defined by Pij = Pi ({j}).
We also let r ∈ [0, 1]|S| be the vector of rewards so that ri = r(i). The standard
35.6 Notes 453
basis vector is ei ∈ R|S| , and 1 ∈ R|S| is the vector with one in every coordinate.
For C ⊂ S, let QC be the transition matrix with (QC )ij = Pij IC (j). For each
i ∈ S, the goal is to find
hP i
)
τ −1 t−1
Ei t=1 α r(S t
g(i) = sup hP i ,
τ −1 t−1
τ ≥2 Ei t=1 α
where Ei is the expectation with respect to the measure Pi for which the initial
state is S1 = i. Lemma 35.7 shows that the stopping time τ = min{t ≥ 2 : g(St ) ≤
g(i)} attains the supremum in the above display. The set Ci = {j : g(j) > g(i)}
is called the continuation region, and Si = S \ Ci is the stopping region. Then
the Gittins index can be calculated as
hP i
Ei
τ −1 t−1
r(St ) P∞ t−1 > t−1
t=1 α i (I − αQCi )
e> −1
t=1 α ei QCi r r
g(i) = hP i = P∞ = .
Ei
τ −1 t−1
t=1 α
t−1 > t−1
ei QCi 1 ei (I − αQCi ) 1
> −1
t=1 α
All this suggests an induction approach where the Gittins index is calculated
for each state in decreasing order of their indices. To get started, note that the
maximum possible Gittins index is maxi ri and that this is achievable for state
i = argmaxj rj with the deterministic stopping time τ = 2. For the induction
step, assume that g(i) is known for the j states C = {i1 , i2 , . . . , ij } with the
largest Gittins indices. Then ij+1 is given by
i (I − αQC )
e> −1
r
ij+1 = argmaxi∈C .
/
ei (I − αQC ) 1
> −1
35.6 Notes
to lower bound the computation complexity of finding the optimal action for
k-armed Bayesian bandits when the prior is a product of Beta distributions,
but without discounting.
3 The solution to optimal stopping problems is essentially a form of dynamic
programming, which is a method that trades memory for computation by
introducing recursively defined value functions that suffice for reconstructing an
optimal policy. In the one-armed bandit optimal stopping problem, thanks to
the factorisation lemma (Lemma 2.5), for any 0 ≤ t ≤ n, there exists a function
wt : Rt → R such that Wt = wt (X1 , . . . , Xt ) almost surely. This function can
be seen as the value function that captures the optimal value-to-go from stage
t on, and (35.3) gives a recursive construction for it, wn (x1 , . . . , xn ) = 0, and
for t < n,
Z
wt (x1 , . . . , xt ) = max((n − t)µ2 , xt+1 + wt+1 (x1 , . . . , xt , xt+1 )dPt (xt+1 )) ,
7 The previous note does not apply to one-armed bandits for which the
interleaving argument is not required. Given a Markov chain (St )t and horizon
n, the undiscounted Gittins index of state s is
hP i
)
τ −1
Es t=1 r(St
gn (s) = sup .
2≤τ ≤n Es [τ − 1]
If the learner receives reward µ2 by retiring, then the Bayesian optimal policy
is to retire in the first round t when gn−t+1 (St ) ≤ µ2 . A reasonable strategy
for undiscounted k-armed bandits is to play the arm At that maximises
gn−t+1 (Si (t)). Although this strategy is not Bayesian optimal anymore, it
nevertheless performs well in practice. In the Gaussian case, it even enjoys
frequentist regret guarantees similar to UCB [Lattimore, 2016a].
8 The form of the undiscounted Gittins index was analysed asymptotically
by Burnetas and Katehakis [1997b], who showed the index behaves like the
upper confidence bound provided by KL-UCB. This should not be especially
surprising and explains the performance of the algorithm in the previous note.
The asymptotic nature of the result does not make it suitable for proving regret
guarantees, however.
9 We mentioned that computing the Bayesian optimal policy in finite horizon
bandits is computationally intractable. But this is not quite true if n is small.
For example, when n = 50 and k = 5, the dynamic program for computing
the exact Bayesian optimal policy for Bernoulli noise and Beta prior has
approximately 1011 states. A big number to be sure, but not so large that the
table cannot be stored on disk. And this is without any serious effort to exploit
symmetries. For mission-critical applications with small horizon, the benefits
of exact optimality might make the computation worth the hassle.
10 The algorithm in Section 35.5 for computing Gittins index is called Varaiya’s
algorithm. In the bibliographic remarks, we give some pointers on where to
look for more sophisticated methods. The assumption that |S| is finite is less
severe than it may appear. When the discount rate is not too close to one, then
for many problems the Gittins index can be approximated by removing states
that are not reachable from the start state before the discounting means they
becomes close to irrelevant. When the state space is infinite, there is often a
topological structure that makes a discretisation possible.
regular priors and noise models, the asymptotic Bayesian optimal regret is
BR∗n ∼ c log(n)2 for some constant c > 0 that depends on the prior/model
(see theorem 3 of Lai [1987]). The Bayesian approach dominated research on
bandits from 1960 to 1980, with Gittins’s result (Theorem 35.9) receiving the
most attention [Gittins, 1979]. Gittins et al. [2011] has written a whole book on
Bayesian bandits. Another book that focusses mostly on the Bayesian problem is
by Berry and Fristedt [1985]. Although it is now more than 30 years old, this book
is still a worthwhile read and presents many curious and unintuitive results about
exact Bayesian policies. The book by Presman and Sonin [1990] also considers
the Bayesian case. As compared to the other books, here the emphasis is on a
case that is more similar to partial monitoring, the subject of Chapter 37 (in the
adversarial setting). As far as we know, the earliest fully Bayesian analysis is by
Bradt et al. [1956], who studied the finite horizon Bayesian one-armed bandit
problem, essentially writing down the optimal policy using backwards induction,
as presented here in Section 35.3. More general ‘approximation results’ are shown
by Burnetas and Katehakis [2003], who show that under weak assumptions the
Bayesian optimal strategy for one-armed bandits is asymptotically approximated
by a retirement policy reminiscent of Eq. (35.4). The very specific approach
to approximating the Bayesian strategy for Gaussian one-armed bandits is by
one of the authors [Lattimore, 2016a], where a precise approximation for this
special case is also given. There are at least four proofs of Gittins’s theorem
[Gittins, 1979, Whittle, 1980, Weber, 1992, Tsitsiklis, 1994]. All are summarised
in the review by Frostig and Weiss [1999]. There is a line of work on computing
and/or approximating the Gittins index, which we cannot do justice to. The
approach presented here for finite state spaces is due to Varaiya et al. [1985], but
more sophisticated algorithms exist with better guarantees. A nice survey is by
Chakravorty and Mahajan [2014], but see also the articles by Chen and Katehakis
[1986], Kallenberg [1986], Sonin [2008], Niño-Mora [2011] and Chakravorty and
Mahajan [2013]. There is also a line of work on approximations of the Gittins
index, most of which are based on approximating the discrete time stopping
problem with continuous time and applying free boundary methods [Yao, 2006,
and references therein]. The Gittins index has been generalised to continuous
time, where the challenge is to ensure the existence of solutions to the resulting
stochastic differential equations [Karoui and Karatzas, 1994]. We mentioned
restless bandits in Chapter 31 on non-stationary bandits, but they are usually
studied in the Bayesian context [Whittle, 1988, Weber and Weiss, 1990]. The
difference is that now the Markov chains for all actions evolve regardless of the
action chosen, but the learner only gets to observe the new state for the action
they chose.
35.8 Exercises
Hint For the first part, you should use the existence of a policy for Bernoulli
bandits such that
√ k log(n)
Rn (π, ν) ≤ C min kn, ,
∆min (ν)
where C > 0 is a universal constant and ∆min (ν) is the smallest positive
suboptimality gap. Then let En be a set of bandits for which there exists a
small enough positive suboptimality gap and integrate the above bound on En
and Enc . The second part is left as a challenge, though the solution is available.
35.2 (Finite horizon optimal stopping) Prove Theorem 35.2.
Hint Prove that (Et )nt=1 is a F-adapted supermartingale and that for stopping
time τ satisfying the conditions of the theorem that (Mt )nt=1 defined by Mt = Et∧τ
is a martingale. Then apply the optional stopping theorem (Theorem 3.8).
35.3 (Infinite horizon optimal stopping) Prove Theorem 35.3.
Hint This is a technical exercise. Use theorem 1.7 of Peskir and Shiryaev [2006],
and pass to the limit using the almost-sure convergence of (Ut )t as t → ∞. You
may find the ideas in the proof of theorem 1.11 of the same book useful. Be
careful, Peskir and Shiryaev adopt the convention that stopping times are almost
surely finite, while here we permit infinite stopping times.
35.4 This exercise uses the notation and setting of Section 35.3. Suppose that
(St )nt=0 is a sequence of random elements taking values in measurable space (S, H)
and with St being Ft /H-measurable and Pν1 (Z1 , . . . , Zt ∈ · | St ) is independent
of ν. Show that Et is σ(St )-measurable, and there exists a H/B(R))-measurable
function vt : S → R such that Et = vt (St ). You may assume that (E, G) is Borel.
Hint Use the Hardy–Littlewood inequality, which for infinite sequences states
35.8 Exercises 458
35.11 In this exercise, you will implement some Bayesian (near-)optimal 1-armed
bandit algorithms.
(a) Reproduce the experimental results in Experiment 1.
(b) Implement an approximation of the optimal policy for one-armed Gaussian
bandits and compare its performance to the stopping rule τα defined below
for a variety of different choices of α > 0.
( r )
2 max{0, log(αn/t)}
τα = min t ≥ 2 : µ̂t−1 + ≤ µ2 .
t−1
36 Thompson Sampling
“As all things come to an end, even this story, a day came at last when they were in
sight of the country where Bilbo had been born and bred, where the shapes of the land
and of the trees were as well known to him as his hands and toes.” – Tolkien [1937].
Like Bilbo, as the end nears, we return to where it all began, to the first algorithm
for bandits proposed by Thompson [1933]. The idea is a simple one. Before the
game starts, the learner chooses a prior over a set of possible bandit environments.
In each round, the learner samples an environment from the posterior and
acts according to the optimal action in that environment. Thompson only gave
empirical evidence (calculated by hand) and focused on Bernoulli bandits with
two arms. Nowadays these limitations have been eliminated, and theoretical
guarantees have been proven demonstrating the approach is often close to optimal
in a wide range of settings. Perhaps more importantly, the resulting algorithms
are often quite practical both in terms of computation and empirical performance.
The idea of sampling from the posterior and playing the optimal action is called
Thompson sampling, or posterior sampling.
The exploration in Thompson sampling comes from the randomisation. If the
posterior is poorly concentrated, then the fluctuations in the samples are expected
to be large and the policy will likely explore. On the other hand, as more data
is collected, the posterior concentrates towards the true environment and the
rate of exploration decreases. We focus our attention on finite-armed stochastic
bandits and linear stochastic bandits, but Thompson sampling has been extended
to all kinds of models, as explained in the bibliographic remarks.
Recalling the notation from Section 34.5, let k > 1 and (E, B(E), Q, P ) be a
k-armed Bayesian bandit environment. The learner chooses actions (At )nt=1 and
receives rewards (Xt )nt=1 , and the posterior after t observations is a probability
kernel Q( · | ·) from ([k] ×RR)t to (E, B(E)). Denote the mean of the ith arm in
bandit ν ∈ E by µi (ν) = R xdPνi (x). In round t, Thompson sampling samples
a bandit environment νt from the posterior of Q given A1 , X1 , . . . , At−1 , Xt−1
and then chooses the arm with the largest mean (Algorithm 23). A more precise
definition is that Thompson sampling is the policy π = (πt )∞ t=1 with
Thompson sampling has been analysed in both the frequentist and the Bayesian
settings. We start with the latter where the result requires almost no assumptions
on the prior. In fact, after one small observation about Thompson sampling, the
analysis is almost the same as that of UCB.
where µ̂i (t − 1) is the empirical estimate of the reward of arm i after t − 1 rounds
and we assume µ̂i (t − 1) = 0 if Ti (t − 1) = 0. Let E be the event that for all
36.2 Frequentist Analysis 461
The key insight (Exercise 36.3) is to notice that the definition of Thompson
sampling implies the conditional distributions of A∗ and At given Ft−1 are the
same:
P (A∗ = · | Ft−1 ) = P (At = · | Ft−1 ) a.s. (36.1)
Using the previous display,
E [µA∗ − µAt | Ft−1 ] = E [µA∗ − Ut (At ) + Ut (At ) − µAt | Ft−1 ]
= E [µA∗ − Ut (A∗ ) + Ut (At ) − µAt | Ft−1 ] (Eq. (36.1))
= E [µA∗ − Ut (A ) | Ft−1 ] + E [Ut (At ) − µAt | Ft−1 ] .
∗
On the event E c the terms inside the expectation are bounded by 2n, while on
the event E, the first sum is negative and the second is bounded by
n
X n X
X k
I {E} (Ut (At ) − µAt ) = I {E} I {At = i} (Ut (i) − µi )
t=1 t=1 i=1
s Z r
Ti (n)
8 log(1/δ) 8 log(1/δ)
k X
X n X k
≤ I {At = i} ≤ ds
i=1 t=1
1 ∨ Ti (t − 1) i=1 0 s
k p
X p
= 32Ti (n) log(1/δ) ≤ 32nk log(1/δ) .
i=1
The proof is completed by choosing δ = n−2 and the fact that P (E c ) ≤ 2nkδ.
Bounding the frequentist regret of Thompson sampling is more technical than the
Bayesian regret. The trouble is the frequentist regret does not have an expectation
with respect to the prior, which means that At is not conditionally distributed in
36.2 Frequentist Analysis 462
the same way as the optimal action (which is not random). Thompson sampling
can be viewed as an instantiation of follow-the-perturbed-leader, which we already
saw in action for adversarial combinatorial semi-bandits in Chapter 30. Here we
work with the stochastic setting and consider the general form algorithm given
in Algorithm 24.
Let Fis be the cumulative distribution function used for arm i in all rounds
t with Ti (t − 1) = s. This quantity is defined even if Ti (n) < s by using the
reward-stack model from Section 4.6.
Theorem 36.2. Assume that arm 1 is optimal. Let i > 1 be an action and ε ∈ R
be arbitrary. Then the expected number of times Algorithm 24 plays action i is
bounded by
"n−1 # "n−1 #
X 1 X
E[Ti (n)] ≤ 1 + E −1 +E I {Gis > 1/n} , (36.3)
s=0
G1s s=0
In order to bound the first term, let A0t = argmaxi6=1 θi (t). Then
P (At = i, Ei (t) | Ft−1 ) ≤ (1 − P (θ1 (t) > µ1 − ε | Ft−1 ))P (A0t = i, Ei (t) | Ft−1 ) ,
which is true since {At = i, Ei (t) occurs} ⊆ {A0t = i, Ei (t) occurs} ∩ {θ1 (t) ≤
µ1 − ε}, and the two intersected events are conditionally independent given Ft−1 .
Therefore using Eq. (36.5), we have
1
P (At = i, Ei (t) | Ft−1 ) ≤ − 1 P (At = 1, Ei (t) | Ft−1 )
G1T1 (t−1)
1
≤ − 1 P (At = 1 | Ft−1 ) .
G1T1 (t−1)
Substituting this into the first term in Eq. (36.4) leads to
" n # " n #
X X 1
E I {At = i, Ei (t) occurs} ≤ E − 1 P (At = 1 | Ft−1 )
t=1 t=1
G1T1 (t−1)
" n #
X 1
=E − 1 I {At = 1}
t=1
G1T1 (t−1)
"n−1 #
X 1
≤E −1 , (36.6)
s=0
G1s
where in the last step we used the fact that T1 (t − 1) = s is only possible for one
36.2 Frequentist Analysis 464
round where At = 1. Let T = {t ∈ [n] : 1 − FiTi (t−1) (µ1 − ε) > 1/n}. After some
calculation (Exercise 36.5), we get
" n # " # " #
X X X
E I {At = i, Eic (t) occurs} ≤ E I {At = i} + E I {Eic (t)}
t=1 t∈T t∈T
/
"n−1 # " #
X X1
≤E I {1 − Fis (µ1 − ε) > 1/n} + E
s=0
n
t∈T
/
"n−1 #
X
≤E I {Gis > 1/n} + 1 .
s=0
Theorem 36.3. Suppose that Fi (1) = δ∞ is the Dirac at infinity and let
Update(Fi (t), At , Xt ) be the cumulative distribution function of the Gaussian
N (µ̂i (t), 1/t). Then the regret of Algorithm 24 on Gaussian bandit ν ∈ EN k
(1)
satisfies
Rn X 2
lim = .
n→∞ log(n) ∆i
i:∆i >0
Furthermore,
p there exists a universal constant C > 0 such that Rn ≤
C nk log(n).
Proportion × Regret2
300 Thompson sampling 300 AdaUCB
200 200
100 100
0 0
0 100 200 300 400 0 100 200 300 400
Regret Regret
The Bayesian regret is controlled using the techniques from the previous section
in combination with the concentration analysis in Chapter 20. A frequentist
analysis is also possible under slightly unsatisfying assumptions, which we discuss
in the notes and bibliographic remarks.
Theorem 36.4. Assume that kθk2 ≤ S with Q-probability one and supa∈A kak2 ≤
L and supa∈A |ha, θi| ≤ 1 with Q-probability one. Then the Bayesian regret of
Algorithm 25 is bounded by
s
nS 2 L2
BRn ≤ 2 + 2 2dnβ 2 log 1 + ,
d
s
nS 2 L2
where β = 1 + 2 log(n) + d log 1 + .
d
For fixed S and L, the upper bound obtained here is of order
36.3 Linear Bandits 466
p
O(d n log(n) log(n/d)), which matches the upper bound obtained for Lin-UCB
in Corollary 19.3.
Proof We apply the same technique as used in the proof of Theorem 36.1. Define
the upper confidence bound function Ut : A → R by
1 X t
Ut (a) = ha, θ̂t−1 i + βkakV −1 , where Vt = 2 I + As A>
s .
t−1 S s=1
By Theorem 20.5 and Eq. (20.9), P(exists t ≤ n : kθ̂t−1 − θkVt−1 > β) ≤ 1/n. Let
Tn
Et be the event that kθ̂t−1 −θkVt−1 ≤ β, E = t=1 Et and A∗ = argmaxa∈A ha, θi.
Note that A∗ is a random variable because θ is random. Then
" n #
X
BRn = E ∗
hA − At , θi
t=1
" n
# " n
#
X X
= E IE c hA∗ − At , θi + E IE hA∗ − At , θi
t=1 t=1
"n
#
X
≤ 2 + E IE ∗
hA − At , θi
t=1
" n #
X
≤2+E IEt hA∗ − At , θi . (36.7)
t=1
≤ 2βkAt kV −1 .
t
Substituting this combined with IEt hA∗ − At , θi ≤ 2 into the second term of
Eq. (36.7), we get
" n # " n #
X X
E IEt hA − At , θi ≤ 2βE
∗
(1 ∧ kAt kV −1 )
t
t=1 t=1
v " n #
u
u X
≤ 2 nβ E
t 2 2
(1 ∧ kAt kV −1 ) (Cauchy-Schwarz)
t
t=1
s
nS 2 L2
≤2 2dnβ 2 E log 1 + . (Lemma 19.4)
d
36.4 Information Theoretic Analysis 467
36.3.1 Computation
An implementation of Thompson sampling for linear bandits needs to sample θt
from the posterior and then find the optimal action for the sampled parameter:
At = argmaxa∈A ha, θt i .
For some priors and noise models, sampling from the posterior is straightforward.
The most notable case is when Q is a multivariate Gaussian and the noise is
Gaussian with a known variance. More generally, there is a large literature devoted
to numerical methods for sampling from posterior distributions. Having sampled
θt , finding At is a linear optimisation problem. By comparison, LinUCB needs to
solve
which means that At and X are conditionally independent given Ht−1 . This is
consistent with our definition of the model where X is sampled first from Q and
36.4 Information Theoretic Analysis 468
then At depends on X only through the history Ht−1 . The optimal action is
Pn
A∗ = argmaxa∈[k] t=1 Xta with ties broken arbitrarily. The Bayesian regret is
" n #
X
BRn = E (XtA∗ − XtAt ) .
t=1
Like in the previous sections, Thompson sampling is a policy π = (πt )nt=1 that
plays each action according to the conditional probability that it is optimal, which
means the following holds almost surely:
Theorem 36.5. The Bayesian regret of Thompson sampling for Bayesian k-armed
adversarial bandits satisfies
p
BRn ≤ kn log(k)/2 .
The proof is done through a generic theorem that is powerful enough to analyse
a wide range of settings. For stating this result, we need some preparation. Let
Ft = σ(A1 , X1A1 , . . . , At , XtAt ) and Et [·] = E[· | Ft ] and Pt (·) = P( · | Ft ). Let
∆t = XtA∗ − XtAt denote the immediate regret of round t.
The promised generic theorem bounds the regret in terms of an ‘information
ratio’ that depends on the ratio of the squared expected instantaneous regret
conditioned on the past and a Bregman divergence with respect to some convex
function F to be chosen later.
where the first inequality follows from Fatou’s lemma and the second from the
convexity of F . The last equality is because Et−1 [Mt ] = Mt−1 . Hence,
" n # " n #
X Xp
BRn = E ∆t ≤ E βEt−1 [DF (Mt , Mt−1 )]
t=1 t=1
v " n #
u
u X p
t
≤ βnE Et−1 [DF (Mt , Mt−1 )] ≤ βndiamF (Pk−1 ) ,
t=1
where the first inequality follows from the assumption in the theorem, the second
by Cauchy–Schwarz, while the third follows by Eq. (36.8), telescoping and the
definition of the diameter.
It remains to choose F and show that the condition of the previous result can
be met. As you might have guessed, a good choice is the unnormalised negentropy
Pk
potential F (p) = a=1 pa log(pa ) − pa . Remember that in this case the resulting
Bregman divergence DF (p, q) is the relative entropy, D(p, q), between categorical
distributions parameterised by p and q, respectively.
Lemma 36.7. If Xti ∈ [0, 1] almost surely for all t ∈ [n] and i ∈ [k] and At is
chosen by Thompson sampling using any prior, then
r
k
Et−1 [∆t ] ≤ Et−1 [D(Pt (A∗ ∈ ·), Pt−1 (A∗ ∈ ·))] .
2
Proof Given a measure P, we write PX|Y ( · ) for P(X ∈ · | Y ). In our application
below, X is a random variable, and hence P(X ∈ · | Y ) can be chosen to be a
probability measure by Theorem 3.11. When Y is discrete, we write PX|Y =y (·)
for P(X ∈ · | Y = y). The result follows by chaining Pinsker’s inequality and
Cauchy–Schwarz:
k
X
Et−1 [∆t ] = Pt−1 (At = a) (Et−1 [Xta | A∗ = a] − Et−1 [Xta ])
a=1
r
1
k
X
≤ Pt−1 (At = a) D(Pt−1,Xta |A∗ =a , Pt−1,Xta )
a=1
2
v
u k
uk X
≤t Pt−1 (At = a)2 D(Pt−1,Xta |A∗ =a , Pt−1,Xta )
2 a=1
v
u k k
uk X X
≤t Pt−1 (At = a) Pt−1 (A∗ = a) D(Pt−1,Xta |A∗ =b , Pt−1,Xta )
2 a=1
b=1
r
k
= Et−1 [D(Pt (A∗ ∈ ·), Pt−1 (A∗ ∈ ·))] ,
2
where the final equality follows from Bayes’ law and is left as an exercise.
36.5 Notes 470
Proof of Theorem 36.5 The result follows by combining Lemma 36.7, Theo-
rem 36.6 and the fact that the diameter of the unnormalised negentropy potential
is diamF (Pk−1 ) = log(k).
36.5 Notes
3 For the Gaussian noise model, it is known that Thompsonp sampling is not
minimax optimal. Its worst-case regret is Rn = Θ( nk log(k)) [Agrawal and
Goyal, 2013a].
4 An alternative to sampling from the posterior is to choose in each round
the arm that maximises a Bayesian upper confidence bound, which is a
quantile of the posterior. The resulting algorithm is called BayesUCB and
has excellent empirical and theoretical guarantees [Kaufmann et al., 2012a,
Kaufmann, 2018].
5 The prior has a significant effect on the performance of Thompson sampling.
In classical Bayesian statistics, a poorly chosen prior is quickly washed away by
data. This is not true in (stochastic, non-Bayesian) bandits because if the prior
underestimates the quality of an arm, then Thompson sampling may never
play that arm with high probability and no data is ever observed. We ask you
to explore this situation in Exercise 36.16.
6 An instantiation of Thompson sampling for stochastic contextual linear bandits
is known to enjoy near-optimal frequentist regret. In each round the algorithm
samples θt ∼ N (θ̂t−1 , rVt−1
−1
), where r = Θ(d) is a constant and
t
X t
X
Vt = I + As A>
s and θ̂t = Vt−1 Xs As .
s=1 s=1
sampling At from the posterior on A∗ , one can sample At from the distribution
Pt given by
Pk
a=1 pa (Et−1 [Xta | A = a] − Et−1 [Xta ])
∗
Pt = argminp∈Pk−1 Pk .
a=1 pa Et−1 [DF (Pt−1 (A = · | Xta ), Pt−1 (A = ·))]
∗ ∗
where the second equality follows from Sion’s minimax theorem (Exercise 36.11)
and the inequality follows from Theorem 36.5. This bound is a factor
√ of two
better than what we gave in Theorem 11.2 and can be improved to 2nk using
the argument from the previous note and Exercise 36.10. The approach has
been used in more sophisticated settings, like the first near-optimal analysis
for adversarial convex bandits [Bubeck et al., 2015a, Bubeck and Eldan, 2016]
or partial monitoring [Lattimore and Szepesvári, 2019c]. As noted earlier, the
36.6 Bibliographic Remarks 473
main disadvantage is that the technique does not lead to algorithms for the
adversarial setting.
Thompson sampling has the honor of being the first bandit algorithm and
is named after its inventor [Thompson, 1933], who considered the Bernoulli
case with two arms. Thompson provided no theoretical guarantees, but argued
intuitively and gave hand-calculated empirical analysis. It would be wrong to
say that Thompson sampling was entirely ignored for the next eight decades,
but it was definitely not popular until recently, when a large number of authors
independently rediscovered the article/algorithm [Graepel et al., 2010, Granmo,
2010, Ortega and Braun, 2010, Chapelle and Li, 2011, May et al., 2012]. The
surge in interest was mostly empirical, but theoreticians followed soon with regret
guarantees. For the frequentist analysis, we followed the proofs by Agrawal and
Goyal [2012, 2013a], but the setting is slightly different. We presented results for
the ‘realisable’ case where the pay-off distributions are actually Gaussian, while
Agrawal and Goyal use the same algorithm but prove bounds for rewards bounded
in [0, 1]. Agrawal and Goyal [2013a] also analyse the Beta/Bernoulli variant of
Thompson sampling, which for rewards in [0, 1] is asymptotically optimal in
the same way as KL-UCB (see Chapter 10). This result was simultaneously
obtained by Kaufmann et al. [2012b], who later showed that for appropriate
priors, asymptotic optimality also holds for single-parameter exponential families
[Korda et al., 2013]. For Gaussian bandits with unknown mean and variance,
Thompson sampling is asymptotically optimal for some priors, but not others –
even quite natural ones [Honda and Takemura, 2014]. The Bayesian analysis of
Thompson sampling based on confidence intervals is due to Russo and Van Roy
[2014b]. Recently the idea has been applied to a wide range of bandit settings
[Kawale et al., 2015, Agrawal et al., 2017] and reinforcement learning [Osband
et al., 2013, Gopalan and Mannor, 2015, Leike et al., 2016, Kim, 2017]. The
BayesUCB algorithm is due to Kaufmann et al. [2012a], with improved analysis
and results by Kaufmann [2018]. The frequentist analysis of Thompson sampling
for linear bandits is by Agrawal and Goyal [2013b], with refined analysis by
Abeille and Lazaric [2017a] and a spectral version by Kocák et al. [2014]. A recent
paper analyses the combinatorial semi-bandit setting [Wang and Chen, 2018].
The information-theoretic analysis is by Russo and Van Roy [2014a, 2016], while
the generalising beyond the negentropy potential is by Lattimore and Szepesvári
[2019c]. As we mentioned, these ideas have been applied to convex bandits [Bubeck
et al., 2015a, Bubeck and Eldan, 2016] and also to partial monitoring [Lattimore
and Szepesvári, 2019c]. There is a tutorial on Thompson sampling by Russo
et al. [2018] that focuses mostly on applications and computational issues. We
mentioned there are other ways to configure Algorithm 24, for example the recent
article by Kveton et al. [2019].
36.7 Exercises 474
36.7 Exercises
36.2 (Filling in steps in the proof of Theorem 36.1 (i)) Consider the
event E defined in Theorem 36.1, and prove that P (E c ) ≤ 2nkδ.
36.3 (Filling in steps in the proof of Theorem 36.1 (ii)) Prove Eq. (36.1).
Hint Replace the naive confidence intervals used in the proof of Theorem 36.1
by the more refined confidence bounds used in Chapter 9. The source for this
result is the paper by Bubeck and Liu [2013].
36.5 (Filling in steps in the proof of Theorem 36.2) Let Gi (s) =
1 − Fis (µ1 − ε). Show that
X n
X
(a) I {At = i} ≤ I {Gi (s − 1) > 1/n}; and
t∈T s=1
" # " #
X X
(b) E I {Eic (t)} ≤E 1/n .
t∈T
/ t∈T
/
(a) Show that there exists a universal constant c > 0 such that
"∞ #
X 1 c 1
E −1 ≤ 2 log .
t=1
G 1s (ε) ε ε
(c) Use Theorem 36.2 and the fundamental regret decomposition (Lemma 4.5)
to prove Theorem 36.3.
Hint For (a) you may find it useful to know that for y ≥ 0,
r
2 exp(−y 2 /2)
1 − Φ(y) ≥ p ,
π y + y2 + 4
Ry
where Φ(y) = √12π −∞ exp(−x2 /2)dx is the cumulative distribution function of
the standard Gaussian [Abramowitz and Stegun, 1964, §7.1.13].
36.7 Prove the final equality in the proof of Lemma 36.7.
36.7 Exercises 475
36.9 (Information-directed sampling) Prove that for any prior such that
Xti ∈ [0, 1] almost surely, the Bayesian regret of information-directed sampling
(see Note 8) satisfies
p
BRn ≤ kn log(k)/2 .
36.10 (Minimax Bayesian regret for Thompson sampling) Prove that for
any prior over adversarial k-armed bandits such that Xti ∈√ [0, 1] almost surely,
the Bayesian regret of Thompson sampling satisfies BRn ≤ 2kn.
Pk √
Hint Use the potential F (p) = −2 i=1 pi and the fact that the total
variation distance is upper-bounded by the Hellinger distance.
36.11 (From Bayesian to adversarial regret) Let E = {0, 1}n×k and Q
be the space of probability measures on E. Prove that
Hint Repeat the argument in the solution to Exercise 34.16, noting that Q is
finite dimensional. Take care to adapt the result in Exercise 4.4 to the adversarial
setting.
36.12 (From Bayesian to adversarial regret) Let E = [0, 1]n×k . Prove
that
Hint Think about how to use a minimax optimal policy for {0, 1}n×k for
bandits in [0, 1]n×k .
36.14 (Implementation (i)) In this exercise, you will reproduce the results in
Experiment 1.
(a) Implement Thompson sampling as described in Theorem 36.3 as well as
UCB and AdaUCB.
(b) Reproduce the figures in Experiment 1 as well as UCB.
(c) How consistent are these results across different bandits? Run a few
experiments and report the results.
(d) Explain your findings. Which algorithm do you prefer and why.
36.16 (Misspecified prior) Fix a Gaussian bandit with unit variance and
mean vector µ = (0, 1/10) and horizon n = 1000. Now consider Thompson
sampling with a Gaussian model with known unit covariance and a prior on the
unknown mean of each arm given by a Gaussian distribution with mean µP and
covariance σP2 I.
(a) Let the prior mean be µP = (0, 0), and plot the regret of Thompson sampling
as a function of the prior variance σP2 .
(b) Repeat the above with µP = (0, 1/10) and (0, −1/10) and (2/10, 1/10).
(c) Explain your results.
Part VIII
Beyond Bandits
37 Partial Monitoring
While in a bandit problem, the feedback that the learner receives from the
environment is the loss of the chosen action, in partial monitoring the coupling
between the loss of the action and the feedback received by the learner is loosened.
Consider the problem of learning to match
pennies when feedback is costly. Let c > 0 be
a known constant. At the start of the game, the
adversary secretly chooses a sequence i1 , . . . , in ∈
{heads, tails}. In each round, the learner chooses
an action At ∈ {heads, tails, uncertain}. The
loss for choosing action a in round t is
Figure 37.1 Spam filtering
0 , if a = it ;
is a potential application of
yta = c , if a = uncertain ; partial monitoring. The turtle
(called Spam) was inherited by
1 , otherwise .
one of the authors.
So far this looks like a bandit problem. The difference is that the learner
never directly observes ytAt . Instead, the learner observes nothing unless
At = uncertain, in which case they observe the value of it . As usual, the
goal is to minimise the (expected) regret, which is
" n #
X
Rn = max E (ytAt − yta ) .
a∈[k]
t=1
How should a learner act in problems like this, where the loss is not directly
observed? Can we find a policy with sublinear regret? In this chapter we give
a more or less complete answer to these questions for finite adversarial partial
monitoring games, which include the above problems as a special case.
Matching pennies with costly feedback seems like an esoteric problem. But
think about adding contextual information and replace the pennies with
emails to be classified as spam or otherwise. The true label is only accessible
by asking a human, which replaces the third action. While the chapter does
not cover the contextual version, some pointers to the literature are added
at the end.
37.1 Finite Adversarial Partial Monitoring Problems 479
We omit the arguments of Rn when they can be inferred from the context.
37.1.1 Examples
The partial monitoring framework is rich enough to model a wide variety of
problems, a few of which are illustrated in the examples that follow. Many of the
examples are quite artificial and are included only to highlight the flexibility of
the framework and challenges of making the regret small.
Two feedback matrices Φ ∈ Σk×d and Φ̃ ∈ Σ̃k×d encode the same information
if the pattern of identical entries in each row match. For example,
⊥ 4 ⊥ ♦ ♠ ♦
Φ = 1 2 2
and Φ̃ = ♣ ♥ ♥
3 1 1 ♣ ♥ ♥
both encode the same information. Note that for these matrices m = 2 since
in any row there are at most two distinct symbols.
Example 37.2 (Trivial problem). Just as there are hopeless problems, there are
also trivial problems. This happens when one action dominates all others as in
the following problem:
0 0 ⊥ ⊥
L= , Φ= .
1 1 ⊥ ⊥
In this game the learner can safely ignore the second action and suffer zero regret,
regardless of the choices of the adversary.
Matching pennies is a hard game for c > 1/2 in the sense that the adversary can
force the regret of any policy to be at least Ω(n2/3 ). To see this, consider the
randomised adversary that chooses the first outcome with probability p and the
second with probability 1 − p. Let ε > 0 be a small constant to be chosen later
and assume p is either 1/2 + ε or 1/2 − ε. The techniques in Chapter 13 show that
the learner can only distinguish between these environments by playing the third
action about 1/ε2 times. If the learner does not choose to do this, then the regret
is expected to be Ω(nε). Taking these together shows the regret is lower-bounded
by Rn = Ω(min(nε, (c − 1/2 + ε)/ε2 )). Choosing ε = n−1/3 leads to a bound
of Rn = Ω((c − 1/2)n2/3 ). Notice that the argument fails when c ≤ 1/2. We
encourage you to pause for a minute to convince yourself about the correctness of
the above argument and to consider what might be the situation when c ≤ 1/2.
The number of columns for this game is 2k . For non-binary rewards, you would
need even more columns. A partial monitoring problem where Φ = L can be
called a bandit problem because the learner observes the loss of the chosen
p action.
In bandit games, Exp3 from Chapter 11 guarantees a regret of O( kn log(k)),
and as noted there, a more sophisticated algorithm will also remove the log(k)
factor. If
√ you completed Exercise 15.4 then you will know that, up to a constant
factor, kn is also the best possible regret in adversarial bandits with binary
losses.
Example 37.5 (Full information problems). One can also represent problems
where the learner observes all the losses. With binary losses and two actions, we
have
0 1 0 1 1 2 3 4
L= , Φ= .
0 0 1 1 1 2 3 4
Like for bandits, the size of the game grows quickly as more actions/outcomes
are added. A partial monitoring game where Φai = i for all a ∈ [k] and i ∈ [d]
can be called full information because the signal reveals the losses for all actions.
Example 37.6 (Dynamic pricing). A charity worker is going door to door selling
calendars. The marginal cost of a calendar is close to zero, but the wages of the
door knocker represents a fixed cost of c > 0 per occupied house. The question
is how to price the calendar. Each round corresponds to an attempt to sell a
calendar, and the action is the seller’s asking price from one of d choices. The
potential buyer will purchase the calendar if the asking price is low enough. Below
we give the corresponding matrices for case where both the candidate asking prices
and the possible values for the buyer’s private valuations are {$1, $2, $3, $4}:
c − 1 c − 1 c − 1 c − 1 Y Y Y Y
c c − 2 c − 2 c − 2 N Y Y Y
L=
,
Φ =
.
c
c c − 3 c − 3
N N Y Y
c c c c−4 N N N Y
Notice that observing the feedback is sufficient to deduce the loss so the problem
could be tackled with a bandit algorithm. But there is additional structure in
the losses here because the learner knows that if a calendar did not sell for $3,
then it would not sell for $4.
37.2 The Structure of Partial Monitoring 482
which is a convex polytope. The collection {Ca : a ∈ [k]} is called the cell
decomposition of Pd−1 . Actions with Ca = ∅ are called dominated because
they are never optimal, no matter how the adversary plays. For non-dominated
actions we define the dimension of an action to be the dimension of the affine
hull of Ca . Readers unfamiliar with the affine hull should read Note 4 at the
end of the chapter. A non-dominated action is called Pareto optimal if it has
dimension d − 1, and degenerate otherwise. Actions a and b are duplicates if
37.2 The Structure of Partial Monitoring 483
Neighbourhood relation
Pareto optimal actions a and b are neighbours if Ca ∩ Cb has dimension d − 2.
Note that if a and b are Pareto optimal duplicates, then Ca ∩ Cb has dimension
d − 1, and the definition means that a and b are not neighbours. For Pareto
optimal action a we let Na be the set consisting of a and its neighbours. Given
a pair of neighbours e = (a, b), we let Ne = Nab = {c ∈ [k] : Ca ∩ Cb ⊆ Cc } to
be the set of actions that are incident to e. The neighbourhood relation defines
an undirected graph over [k] with edges E = {(a, b) : a and b are neighbours},
which is called the neighbourhood graph.
The next result, which shows the connectedness of the neighborhood graph
induced by a set of actions whose cells cover the whole simplex, will play an
important role in subsequent proofs:
Lemma 37.7. Suppose that S is any set of Pareto optimal actions such that
∪a∈S Ca = Pd−1 . Then the graph with vertices S and edges from E is connected.
Let e = (a, b) ∈ E. The next lemma characterises actions in Ne as either a, b,
duplicates of a, b or degenerate actions c for which `c is a convex combination of
`a and `b . The situation is illustrated when d = 2 in Fig. 37.2.
Lemma 37.8. Let e = (a, b) ∈ E be neighbouring actions and c ∈ Ne be an action
such that `c ∈
/ {`a , `b }. Then
Proof We use the fact that if X ⊆ Y ⊆ Rd and dim(X ) = dim(Y), then aff(X ) =
aff(Y) (Exercise 37.2). Introduce ker0 (x) = {u ∈ Rd : u> x = 0, u> 1 = 1}. Clearly,
Ca ∩Cb ⊆ Ca ∩Cc and aff(Ca ∩Cb ) = ker0 (`a −`b ) and aff(Ca ∩Cc ) = ker0 (`a −`c ).
By assumption dim(Ca ∩ Cb ) = d − 2. Since Ca ∩ Cb ⊆ Ca ∩ Cc , it holds that
dim(Ca ∩ Cc ) ≥ d − 2. Furthermore, dim(Ca ∩ Cc ) ≤ d − 2, since otherwise `c = `a .
Hence dim(Ca ∩ Cc ) = d − 2 and thus by the fact mentioned and our earlier
37.2 The Structure of Partial Monitoring 484
h`
losses
2
i
,(
u)
u,
−
1
1
−
,
(u
u)
1,
i
h`
0 1
u
Figure 37.2 The figure shows the situation when d = 2 and `1 = (1, 0) and `2 = (0, 1)
and `3 = (1/2, 1/2). The x axis corresponds to P1 = [0, 1], the y axis to the losses.
Then C1 = [0, 1/2] and C2 = [1/2, 1], which both have dimension 1 = d − 1. Then
C3 = {1/2} = C1 ∩ C2 , which has dimension 0.
findings, ker0 (`a − `b ) = ker0 (`a − `c ). This implies (Exercise 37.3) that `a − `b
is proportional to `a − `c so that (1 − α)(`a − `b ) = `a − `c for some α = 6 1.
Rearranging shows that
`c = α`a + (1 − α)`b .
Now we show that α ∈ (0, 1). First note that α ∈ / {0, 1} since otherwise
`c ∈ {`a , `b }. Let u ∈ Ca be such that h`a , ui < h`b , ui, which exists since
dim(Ca ) = d − 1 and dim(Ca ∩ Cb ) = d − 2. Then
h`a , ui ≤ h`c , ui = αh`a , ui + (1 − α)h`b , ui = h`a , ui + (α − 1)h`a − `b , ui ,
which by the negativity of h`a − `b , ui implies that α ≤ 1. A symmetric argument
shows that α > 0. For (b), it suffices to show that Cc ⊂ Ca ∩ Cb . By de Morgan’s
law, for this it suffices to show that Pd−1 \ (Ca ∩ Cb ) ⊂ Pd−1 \ Cc . Thus, pick
some u ∈ Pd−1 \ (Ca ∩ Cb ). The goal is to show that u 6∈ Cc . The choice of u
implies that there exists an action e such that h`a − `e , ui ≥ 0 and h`b − `e , ui ≥ 0
with a strict inequality for either a or b (or both). Therefore, using the fact that
α ∈ (0, 1), we have
h`c , ui = αh`a , ui + (1 − α)h`b , ui > h`e , ui ,
which by definition means that u ∈
/ Cc , completing the proof of (b). Finally, (c)
is immediate from (b) and the definition of neighbouring actions.
the differences in losses between Pareto optimal actions and not the actual losses
themselves. In fact, there exist games for which estimating the actual losses is
impossible, but estimating the differences is straightforward:
The learner can never tell if the environment is playing in the first two columns or
the last two, but the differences between the losses of actions are easily deduced
from the feedback no matter the outcome and the action.
Only the loss differences between Pareto optimal actions need to be estimated.
There are games that are easy, but where some loss differences cannot be
estimated. For example, there is never any need to estimate the losses of a
dominated action.
Example 37.10. The partial monitoring problem illustrated in Fig. 37.3 has six
actions, three feedbacks and three outcomes. The cell decomposition is shown on
the right with the 2-simplex parameterised by its first two coordinates u1 and
u2 so that u3 = 1 − u2 − u1 . Actions 1, 2 and 3 are Pareto optimal. There are
no dominated actions while actions 4 and 5 are 1-dimensional and action 6 is 0-
dimensional. The neighbours are (1, 3) and (2, 3), which are both locally observable,
and so the game is locally observable. Note that (1, 2) are not neighbours because
the intersection of their cells is (d − 3)-dimensional. Finally, N3 = {1, 2, 3} and
N1 = {1, 3} and N23 = {2, 3, 4}. Think about how we decided on what losses to
use to get the cell decomposition shown in Fig. 37.3.
u2
0 1 1 1 2 3
1 0 1 ⊥ ⊥ ⊥
C4
1/2 1/2 1/2 ⊥ ⊥ ⊥
C2
L= Φ= C6
3/4 1/4 3/4 1 2 3
1 1/2 1/2 ⊥ ⊥ ⊥ C5 C3
C1
1 1/4 3/4 ⊥ ⊥ ⊥ u1
The terminology in the last section finally allows us to state the main theorem of
this chapter that classifies finite adversarial partial monitoring games.
The proof is split into parts by proving upper and lower bounds for each part.
First up is the lower bounds. We then describe a policy and analyse its regret.
Like for bandits, the lower bounds are most easily proven using a stochastic
adversary. In stochastic partial monitoring, we assume that u1 , . . . , un are
chosen independently at random from the same distribution. To emphasise
the randomness, we switch to capital letters. Given a partial monitoring game
G = (L, Φ) and probability vector u ∈ Pd−1 , the stochastic partial monitoring
environment associated with u samples a sequence of independently and identically
distributed random variables I1 , . . . , In with P (It = i) = ui and Ut = eIt . In each
round t, a policy chooses action At and receives feedback σt = ΦAt It . The regret
is
" n # " n #
X X
Rn (π, u) = max E h`At − `a , Ut i = max E h`At − `a , ui .
a∈[k] a∈[k]
t=1 t=1
The reader should check that Rn∗ (G)≥ inf π maxu∈Pd−1 Rn (π, u), which allows
us to restrict our attention to stochastic partial monitoring problems. Given
u, q ∈ Pd−1 , let D(u, q) be the relative entropy between categorical distributions
with parameters u and q respectively:
X
(ui − qi )2
X d d
ui
D(u, q) = ui log ≤ , (37.4)
i=1
qi i=1
qi
where the second inequality follows from the fact that for measures P, Q we have
D(P, Q) ≤ χ2 (P, Q) (see Note 6 in Chapter 13).
Theorem 37.12. Let G = (L, Φ) be a globally observable partial monitoring
problem that is not locally observable. Then there exists a constant cG > 0 such
that Rn∗ (G) ≥ cG n2/3 .
Proof The proof involves several steps. Roughly, we need to define two alternative
stochastic partial monitoring problems. We then show these environments are
hard to distinguish without playing an action associated with a large loss. Finally
we balance the cost of distinguishing the environments against the linear cost of
playing randomly. Without loss of generality assume that Σ = [m].
Ca
ua
u Cb
ub
Cc
Figure 37.4 Lower-bound construction for hard partial monitoring problems. Shown is
Pd−1 , the cells Ca and Cb of two Pareto optimal actions, and two alternatives ua ∈ Ca
and ub ∈ Cb that induce the same distributions on the outcomes under both a and b.
In this form, it does not seem obvious what the next step should be. To clear
things up, we introduce some linear algebra. Let Sc ∈ {0, 1}m×d be the matrix
with (Sc )σi = I {Φci = σ}, which is chosen so that Sc ei = eΦci . Define the linear
map S : Rd → R|Nab |m by
Sa
Sb
S=
.. ,
.
Sc
which is the matrix formed by stacking the matrices {Sc : c ∈ Nab }. Then, an
elementary argument shows that there exists a function f satisfying Eq. (37.6) if
37.4 Lower Bounds 489
`a − `b = S > w .
In other words, actions (a, b) are locally observable if and only if `a − `b ∈ im(S > ).
Since we have assumed that (a, b) are not locally observable, we must have
/ im(S > ). Let z ∈ im(S > ) and w ∈ ker(S) be such that `a − `b = z + w,
`a − `b ∈
which is possible since im(S > )⊕ker(S) = Rd . Since `a −`b ∈ / im(S > ), it holds that
w= 6 0 and h`a − `b , wi = hz + w, wi = hw, wi =6 0. Note also that 1 ∈ im(S > ) and
hence h1, wi = 0. Finally, let q = w/h`a − `b , wi. By construction, q ∈ Rd , q 6= 0
while Sq = 0, h`a − `b , qi = 1 and h1, qi = 0. Let ∆ > 0 be some small constant
to be chosen subsequently. With this, we define ua = u − ∆q and ub = u + ∆q so
that
where we used that u ∈ Ca ∩ Cb is not on the boundary of Pd−1 , so ui > 0 for all
i and we defined C̃u as a suitably large constant that depends on u (q is entirely
determined by a and b). Therefore,
X
D(Pua , Pub ) ≤ C̃u E[Tc (n)]∆2 . (37.9)
c∈N
/ ab
and without the loss of generality, we assumed that the losses lie in [0, 1]. Define
T̃ (n) to be the number of times an arm not in Nab is played:
X
T̃ (n) = Tc (n) .
c∈N
/ ab
By Lemma 37.8, for each action c ∈ Nab , there exists an α ∈ [0, 1] such that
`c = α`a + (1 − α)`b . Therefore, by Eq. (37.7),
It also follows from (37.10) that if c ∈ Nab and h`c − `a , ua i < ∆ 2 , then
∆
P
h`c −`b , ub i ≥ 2 . Hence, under ub , the random pseudo-regret, c Tc (n)h`c −`b , ub i,
is at least (n − T̄ (n))∆/2. Assume that ∆ is chosen sufficiently small so that
∆kqk1 ≤ ε/2. By the above,
Rn (π, ua ) + Rn (π, ub )
X X
= Eua Tc (n)h`c − `a , ua i + Eub Tc (n)h`c − `b , ub i
c∈[k] c∈[k]
ε n∆
≥ Eu T̃ (n) + Pua (T̄ (n) ≥ n/2) + Pub (T̄ (n) < n/2)
2 a 4
ε n∆
≥ Eua T̃ (n) + exp (− D(Pua , Pub ))
2 8
ε n∆
≥ Eua T̃ (n) + exp −C̃u ∆2 Eua T̃ (n) ,
2 8
where the second inequality follows from the Bretagnolle–Huber inequality
(Theorem 14.2) and the third from Eqs. (37.8) and (37.9). The bound is completed
by choosing
ui ε
∆ = min min , ,
i:qi 6=0 2|qi | 2kqk1 n1/3
We leave the following theorems as exercises for the reader (Exercises 37.8
and 37.9):
Theorem 37.13. If G is not globally observable and has at least two non-
dominated actions, then there exists a constant cG > 0 such that Rn∗ (G) ≥ cG n.
37.5 Policy and Upper Bounds 491
small and ua = u−∆q and ub = u+∆q. Show that D(Pua , Pub ) = 0 for all policies
and complete the proof in the same fashion as the proof of Theorem 37.12.
Theorem 37.14. Let G = (L, Φ) be locally observable and have at least one pair
of neighbours. Then there exists a constant cG > 0 such that for all large enough
√
n the minimax regret satisfies Rn∗ (G) ≥ cG n.
Proof sketch By assumption, there exists a pair of neighbouring actions (a, b).
Define u as the centroid of Ca ∩ Cb and ua and ub be the centroids of Ca and
Cb respectively. For sufficiently small ∆ > 0, let va = (1 − ∆)u − ∆ua and
vb = (1 − ∆)u + ∆ub . Then
(vai − vbi )2
d
X
D(Pva , Pvb ) ≤ n ≤ cG n∆2 ,
i=1
vbi
√
where cG > 0 is a game-dependent constant. Let ∆ = 1/ n and apply the ideas
in the proof of Theorem 37.12.
We now describe a policy for globally and locally observable games, and prove
its regret is O(n1/2 ) for locally observable games and O(n2/3 ) otherwise. For
the remainder of this section, fix a globally observable game G = (L, Φ). The
glo loc
estimation functions in Eab and Eab are designed to combine with importance-
weighting to estimate the loss differences between actions a and b. For this section,
it is more convenient to define estimation functions for the whole loss vector up
to constant shifts. Let E vec be the set of all functions f : [k] × Σ → Rk such that:
The intuition is that E vec is the set of functions that serve as unbiased loss
difference estimators in the sense that when A ∼ p ∈ ri(Pk−1 ), then
hea − eb , f (A, ΦAi )i
E = h`a − `b , ei i for all Pareto optimal a, b and i ∈ [d] .
pA
As we will see in the proof of Theorem 37.16, if G is globally observable, then
37.5 Policy and Upper Bounds 492
Qta = P P , t ∈ [n] .
b=1 exp −η
k t−1
s=1 ŷsb
log(k) 1 X
n X
X k n
Qta (ŷta − ŷta∗ ) ≤ + ΨQt (η ŷt ) . (37.11)
t=1 a=1
η η t=1
The definition of global observability does not imply that loss differences
between dominated and degenerate actions can be estimated. Consequentially,
the distribution Qt used by the new algorithm will be supported on Pareto
optimal actions only. The actual distribution Pt used when choosing an
action may also include degenerate actions, however.
Of course, optq (η) depends on the game G, which is hidden from the notation
to reduce clutter. The first term in the right-hand side of Eq. (37.12) measures
the additional regret when playing p rather than q, while the second corresponds
to the expectation of the second term in Eq. (37.11) when the algorithm uses
importance-weighting using estimation function f . The optimisation problem is
convex and hence amenable to efficient computation (see Note 9 for some details).
The worst-case value over all q is
The function q 7→ optq (η) is generally not convex, so opt∗ (η) may be hard to
compute. This causes a minor problem when setting the learning rate, which can
be mitigated by adapting the learning rate online as discussed in Note 7.
We say that f ∈ E vec and p ∈ ri(Pk−1 ) solve Eq. (37.12) with precision
ε ≥ 0 if
" #
1 1 X ηf (a, Φai )
k
max (p − q) Lei + 2
>
p a Ψq ≤ optq (η) + ε . (37.13)
i∈[d] η η a=1 pa
Such approximately optimal solutions exist for any ε > 0, but may not exist
for ε = 0 because the constraint on p is not compact.
The convexity of the inner maximum in Eq. (37.12) can be checked using the
following construction. The perspective of a convex function f : Rd → R
is a function g : Rd+1 → R given by
(
uf (x/u) , if u > 0 ;
g(x, u) = (37.14)
∞, otherwise .
The perspective is known to be convex (Exercise 37.1). Since Ψq is convex
and the max of convex functions is convex, it follows that the term inside of
the infimum of Eq. (37.12) is convex.
Theorem 37.15. For any η > 0 and ε > 0, the regret of Algorithm 26 is bounded
by
log(k)
Rn ≤ + nη(opt∗ (η) + ε) .
η
Proof The result follows from the definitions of E vec and the regret, and the
Pn
bound for exponential weights in Eq. (37.11). Let a∗ = argmina∈Π t=1 h`a , ut i.
37.5 Policy and Upper Bounds 494
1: Input: η, ε, L and Φ
2: for t ∈ 1, . . . , n do
3: Compute exponential weights distribution Qt ∈ Pk−1 by:
P
IΠ (a) exp −η s=1 ŷsa
t−1
Qta = P P .
exp
t−1
b∈Π −η s=1 ŷsb
Then,
" n
#
X
Rn = E LAt it − La∗ it
t=1
" n X
k
#
X
=E Pta (Lait − La∗ it )
t=1 a=1
" n X
k
# " n k #
X XX
=E Qta (Lait − La∗ it ) + E (Pta − Qta )Lait
t=1 a=1 t=1 a=1
" n X
k
# " n k #
X XX
=E Qta (ŷta − ŷta∗ ) + E (Pta − Qta )Lait .
t=1 a=1 t=1 a=1
The first expectation is bounded using the definition of Qt and Eq. (37.11) by
" n k # " n #
XX log(k) 1 X
E Qta (ŷta − ŷta∗ ) ≤ + E ΨQt (η ŷt )
t=1 a=1
η η t=1
" n k #
log(k) 1 XX ηft (a, Φait )
= + E Pta ΨQt .
η η t=1 a=1
Pta
Combining the two displays, using the definitions of Pt , ft and εt , and substituting
the definition of optQt (η) ≤ opt∗ (η) completes the proof.
The extent to which this result is useful depends on the behaviour of opt∗ (η)
for different classes of games. The following two theorems bound the value of
the optimisation problem for globally observable and locally observable games
respectively. An apparently important quantity in the regret upper bounds for
both globally and locally observable games is the minimum magnitude of the
37.5 Policy and Upper Bounds 495
In the remainder of this chapter we assume that the losses are between zero
and one L ∈ [0, 1]k×d .
√
Theorem 37.16. For all globally observable games, opt∗ (η) ≤ 2vglo k 2 / η for all
2
η ≤ 1/ max{1, vglo k 4 }.
Theorem 37.17. For all locally observable games, opt∗ (η) ≤ 9k 3 max(1, vloc
2
) for
2
all η ≤ 1/(2k max(1, vloc )).
By using Theorem 37.16, it follows that for globally observable games the regret
is bounded by
Rn = O (vglo kn)2/3 (log(k))1/3 .
These results establish the upper bounds in the classification theorem for locally
and globally observable games. The quantities vglo and vloc only depend on G but
may be exponentially large in d. We walk you through the proof of the following
proposition in Exercise 37.12.
The only property of non-degenerate games used in Part (c) is that |Ne | = 2
for all e ∈ E. It is illustrative to bound opt∗ (η) for well-known games. The next
proposition shows that Algorithm 26 recovers the usual bounds for bandits and
the full information setting.
You will prove this proposition in Exercise 37.14 by making explicit choices of
p ∈ ri(Pk−1 ) and f ∈ E vec .
37.6 Proof of Theorem 37.16 496
The definition of global (and local) observability is defined in terms of the existence
of functions serving as unbiased loss estimators between pairs of neighbouring
actions. To make a connection between E vec and E glo (and E loc ) we need the
concept of an in-tree on the neighbourhood graph. Let S be a subset of Pareto
optimal actions with no duplicate actions and ∪a∈S Ca = Pd−1 . An in-tree on the
graph (S, E) is a set of edges T ⊂ E such that (S, T ) is a directed tree with all
edges pointing towards a special vertex called the root, denoted by rootT and
such that V (T ), the set of vertices underlying T , is the same as S. Provided
the game is non-trivial, then such a tree exists by Lemma 37.7. Given a Pareto
optimal action b, let pathT (b) ⊆ T denote the path from b to the root. The path
is empty when b is the root. When b is not the root, we let parT (b) denote the
unique Pareto optimal action such that (b, parT (b)) ∈ T .
Abbreviate v = vglo and let T ⊂ E be an arbitrary in-tree over the Pareto
optimal actions. For each e ∈ E, let fe ∈ Eeglo be such that kfe k∞ ≤ v. Then
define f : [k] × Σ → Rk by
X
f (a, σ)b = fe (a, σ) .
e∈pathT (b)
√
Let p = (1 − γ)q + γ1/k with γ = vk 2 η. By the condition in the theorem that
η ≤ 1/ max{1, v 2 k 4 }, it holds that γ ≤ 1 and hence p ∈ ri(Pk−1 ). The next step
is to bound the minimum possible value of the loss estimator. For actions a and
b and outcome i,
ηf (a, Φai )b ηvk 2 √
≥− = − η ≥ −1 ,
pa γ
where in the final inequality we used the fact that η ≤ 1. Next, using the fact
that exp(−x) ≤ x2 + 1 − x for x ≥ −1, it follows that for any z ≥ −1,
k
X
Ψq (z) ≤ qb zb2 , (37.15)
b=1
which is the inequality we have used long ago in Chapter 11. Using this,
X
1 X ηf (a, Φai ) k4 v2 k2 v
k k X
qb 2
p Ψ ≤ f (a, Φ ) ≤ = √ ,
η 2 a=1
a q ai b
pa a=1
pa γ η
b∈Π
where we used that kf (a, σ)k∞ ≤ kv and p ≥ γ/k1 and that q ∈ Pk−1 . For the
37.7 Proof of Theorem 37.17 497
Combining the previous two displays shows that for any i ∈ [d],
1 1 X ηf (a, Φai ) 2vk 2
k
(p − q) Lei + 2
>
p a Ψq ≤ √ ,
η η a=1 pa η
The figure on the right-hand side is the neighbourhood graph. Notice that the third
action is revealing and also separates the first two actions in the neighbourhood
graph. Clearly, loss differences can be estimated between all pairs of neighbours
in this graph, and hence the game is locally observable. Let’s suppose now that
q = (1/2 − ε/2, 1/2 − ε/2, ε) and p = q. The obvious estimation function f ∈ E vec
is given by
(0, 1, 1/4) , if a = 3 and σ = 1 ;
>
Examining the second term in Eq. (37.12) and using a second order Taylor
approximation,
3 3
1 X ηf (a, Φai ) 1 X qb 2 1 1−ε
p Ψ ≈ Lbi = + ,
2 2 32 4ε
a q
η a=1 pa p3
b=1
which holds for both i = 1 and i = 2. This is bad news. The appearance of p3 = q3
in the denominator means the objective can be arbitrarily large when ε is small.
Taylor’s theorem shows that the approximation is not to blame, provided that
η is suitably small. The main issue is that q and p assign most of their mass to
two actions that are not neighbours and hence cannot be distinguished without
37.7 Proof of Theorem 37.17 498
Root
7
5 8 9 12
6
Degenerate
4 11
actions
2 1 10
3
Figure 37.5 The large nodes are Pareto optimal actions in S. The smaller nodes inside
are their duplicates, which are not part of S. The remaining nodes are degenerate
actions that are linear combinations of Pareto optimal actions. The arrows indicate
the in-tree. A vector y ∈ Rk is T -increasing if it is constant on duplicate actions and
otherwise increasing in the direction of the arrows. In this case, the constraint is that
y1 = y2 = y3 ≤ y4 ≤ y5 = y6 = y7 ≤ y8 ≤ y9 ≤ y12 and y10 ≤ y11 ≤ y12 .
(a) r ≥ q/k;
(b) r is T -increasing; and
(c) hr − q, yi ≤ 0 for all T -decreasing vectors y ∈ Rk .
Proof For simplicity, we give the proof for the special case that all actions
are Pareto optimal and there are no duplicates, in which case S = [k].
The proof is generalised in Exercise 37.11. Given an action a ∈ [k], let
ancT (a) = ∪e∈pathT (a) Ne ∪ {a} be the set of ancestors of a, including a and
descT (a) = {b : a ∈ ancT (b)} be the set of descendants of a. Define r by
X qb
ra = .
| ancT (b)|
b∈descT (a)
For Part (a), the definition means that ra ≥ qa /| ancT (a)| ≥ qa /k. That r is
T -increasing follows immediately from the definition. (c) follows because
k
X X Xk X
qb yb qb
hr, yi = ya ≥ = hq, yi .
| ancT (b)| a=1 | ancT (b)|
a=1 b∈descT (a) b∈descT (a)
The existence of the mapping q 7→ r given by Lemma 37.21 was originally proven
using a ‘water flowing’ argument and was called the water transfer operator.
Lemma 37.21. Let S as before. Then, for any λ ∈ Pd−1 , there exists an in-tree
T ⊂ E over vertices S such that Lλ is T -decreasing.
Proof Again, we outline the argument for games with no degenerate or duplicate
actions, leaving the complete proof for Exercise 37.11. Let a be an action such
that λ ∈ Ca . First, assume that λ ∈ ri(Ca ). The root of our tree will be
a (the reader may find helpful to check Fig. 37.6). Next, for b 6= a, define
par(b) = argminc∈Nb e> c Lλ and then let T = {(b, par(b)) : b 6= a}. Clearly,
V (T ) = [k]. Provided that T really is a tree, the fact that Lλ is T -decreasing is
obvious from the definition of the parent function. That T is a tree follows by
showing that for any (b, d) ∈ T , e>
d Lλ < eb Lλ, which we will prove now. For this,
>
let ω ∈ ri(Cb ) and c ∈ Nb such that Cc ∩ [ω, λ] 6= ∅. These exist by Exercise 37.10
(see also Fig. 37.6). We now show that e> c Lλ < eb Lλ from which the desired
>
It suffices to show that f (1) > 0. The following hold: (a) f is linear; (b) f (0) < 0,
since ω ∈ ri(Cb ); and (c) there exists an α ∈ (0, 1) such that f (α) = 0, which
holds because Cc ∩ [ω, λ] 6= ∅ and λ ∈ ri(Ca ). Thus f (1) > 0, establishing the
result for λ ∈ ri(Ca ). When λ is on the boundary of Ca , let (λ(i) )∞
i=1 be a sequence
in ri(Ca ) so that limi→∞ λ(i) → λ. For each i, let T (i) ⊂ E be an in-tree such
that Lλ(i) is T (i) -decreasing. Since there are only finitely many trees, by selecting
a subsequence we conclude that there exists an in-tree T ⊂ E such that Lλ(i) is
T -decreasing for all i. The result follows by taking the limit.
This concludes the building of the tools needed to control optq (η) for locally
observable games.
Ca
λ
Cb
Cc ω
Figure 37.6 The core argument used in the proof of Lemma 37.21.
Theorem 37.17 shows that f ∈ E vec . Moving to the objective in Eq. (37.16), we
lower-bound the loss estimates:
ηf (a, σ)b η X ηvk 2
= fe (a, σ) ≥ − = −1 . (37.17)
pa pa γ
e∈pathT (b)
Fix i ∈ [k]. The stability term is bounded using the properties of p and f as
follows:
X
1 X ηf (a, Φai )
k k X
qb
pa Ψq ≤ f (a, Φai )2b
η 2 a=1 pa a=1 b∈Π
p a
2
Xk X
qb X
= fe (a, Φai )
pa
a=1 b∈Π e∈pathT (b)
2
Xk X X
q
≤ 2v 2
b
I {a ∈ Ne }
ra
a=1 b∈Π e∈pathT (b)
Xk X
qb
≤ 8v 2 I a ∈ ∪e∈pathT (b) Ne
a=1
ra
b∈Π
k X
X qb
≤ 8v 2
a=1 b∈Π
rb
3 2
≤ 8k v .
Here, in the first inequality we used Eq. (37.15) and Eq. (37.17). The second
inequality follows by the definition of v = vloc and the choice of fe ∈ Eeloc , and
also because pa ≥ ra /2 by the condition on η in the theorem statement. The
third since any action a is in Ne for at most two edges in e ∈ pathT (b) (because
V (T ) ⊂ Π and it has no duplicates). The fourth inequality is true since r is
T -increasing and the fifth because r ≥ q/k. Finally, by Part (c) of Lemma 37.20
37.8 Proof of the Classification Theorem 502
Almost all the results are now available to prove Theorem 37.11. In Section 37.4,
we showed that if G is globally observable and not locally observable, then
Rn∗ (G) = Ω(n2/3 ). We also proved that if G is locally observable and has
√
neighbours, then Rn∗ (G) = Ω( n). This last result is complemented by the policy
and analysis in Sections 37.5 to 37.7, where we showed that for globally observable
√
games Rn∗ (G) = O(n2/3 ) and for locally observable games Rn∗ (G) = O( n).
Finally we proved that if G is not globally observable, then Rn∗ (G) = Ω(n). All
that remains is to prove that if G has no neighbouring actions, then Rn∗ (G) = 0.
Theorem 37.22. If G has no neighbouring actions, then Rn∗ (G) = 0.
Proof Since G has no neighbouring actions, there exists an action a such that
Ca = Pd−1 and the policy that chooses At = a for all rounds suffers no regret.
37.9 Notes
1 The next three notes are covering some basic definitions and facts in linear
algebra. There are probably hundreds of introductory texts on linear algebra.
A short and intuitive exposition is by Axler [1997].
2 A non-empty set L ⊆ Rn is a linear subspace of Rn if αv + βw ∈ L for
all α, β ∈ R and v, w ∈ L. If L and M are linear subspaces of Rn , then
L ⊕ M = {v + w : L ∈ L, w ∈ M }. The orthogonal complement of linear
subspace L is L⊥ = {v ∈ Rn : hu, vi = 0 for all u ∈ L}. The following
properties are easily checked: (i) L⊥ is a linear subspace, (ii) (L⊥ )⊥ = L and
(iii) (L ∩ M )⊥ = L⊥ ⊕ M ⊥ .
3 Let A ∈ Rm×n be a matrix and recall that matrices of this form correspond
to linear maps from Rn → Rm where the function A : Rn → Rm is given by
matrix multiplication, A(x) = Ax. The image of A is im(A) = {Ax : x ∈ Rn },
and the kernel is ker(A) = {x ∈ Rn : Ax = 0}. Notice that im(A) ⊆ Rm and
ker(A) ⊆ Rn . One can easily check that im(A) and ker(A> ) are linear subspaces,
37.9 Notes 503
that it only depends on the game through the constant B. Furthermore, the
bound depends on (Vt )nt=1 , rather than opt∗ (η), which may sometimes be
beneficial. The analysis of the algorithm uses the same techniques as developed
in Exercise 28.13 and is given by Lattimore and Szepesvári [2019d].
8 Algorithm 26 can be modified in several ways. One enhancement is to drop the
constraint that f ∈ E vec in the optimisation problem and introduce the worst
case bias of f as a penalty. Certainly this does not make the bounds worse. A
more significant change is to introduce a moment-generating function into the
optimisation problem, which leads to high-probability bounds [Lattimore and
Szepesvári, 2019d].
9 The optimisation problem in Algorithm 26 is convex and can be solved using
standard solvers when k and d are small and η is not too small. When η is small
and/or k or d is large, then numerical instability is a real challenge. One way to
address this issue is to approximate the exponential in the definition of Ψq with
a quadratic and add constraints on p and f that ensure the approximation is
reasonable. Since the analysis uses p and f satisfying these conditions, none of
the theory changes. What is bought by this approximation is that the resulting
optimisation problem becomes a second order cone program, rather than an
exponential cone program, and these are better behaved. More details are in
our paper: [Lattimore and Szepesvári, 2019d].
10 Partial monitoring has many potential applications. We already mentioned
dynamic pricing and spam filtering. In the latter case, acquiring the true label
comes at a price, which is a typical component of hard partial monitoring
problems. In general, there are many set-ups where the learner can pay extra
for high-quality information. For example, in medical diagnosis the doctor can
request additional tests before recommending a treatment plan, but these cost
time and money. Yet another potential application is quality testing in factory
production where the quality control team can choose which items to test (at
great cost).
11 There are many possible extensions to the partial monitoring framework. We
have only discussed problems where the number of actions/feedbacks/outcomes
is potentially infinite, but nothing prevents studying a more general setting.
Suppose the learner chooses a sequence of real-valued outcomes i1 , . . . , in with
it ∈ [0, 1]. In each round, the learner chooses At ∈ [k] and observes ΦAt (it ),
where Φa : [0, 1] → Σ is a known feedback function. The loss is determined
by a collection of known functions La : [0, 1] → [0, 1]. We do not know of any
systematic study of this setting. The reader can no doubt imagine generalising
this idea to infinite action sets or introducing a linear structure for the loss.
12 A pair of Pareto-optimal actions (a, b) are called weak neighbours if
Ca ∩ Cb = 6 ∅ and pairwise observable if there exists a function g satisfying
Eq. (37.3) and with g(c, f ) = 0 whenever c ∈ / {a, b}. A partial monitoring
problem is called a point-locally observable game if all weak neighbours are
pairwise observable. All point-locally observable games are locally observable,
but the converse is not true. Bartók [2013] designed a policy for this type of
37.9 Notes 505
1 p
Rn ≤ kloc n log(n) ,
εG
where εG > 0 is a game-dependent constant and kloc is the size of the largest
A ⊆ [k] of Pareto optimal actions such that ∩a∈A Ca = 6 ∅. Using a different
policy, Lattimore and Szepesvári [2019a] have shown that as the horizon grows,
the game-dependence diminishes so that
Rn p
lim sup √ ≤ 8(2 + m) 2kloc log(k) .
n→∞ n
13 Linear regret is unavoidable in hopeless games, but that does not mean there
is nothing to play for. Rustichini [1999] considered a version of the regret that
captures the performance of policies in this harsh setting. Given p ∈ Pd−1
define set I(p) ⊆ Pd−1 by
( d
)
X
I(p) = q ∈ Pd−1 : (pi − qi )I {Φai = f } = 0 for all a ∈ [k] and f ∈ [m] .
i=1
This is the set of distributions over the outcomes that are indistinguishable
from p by the learner using any actions. Then define
d
X
f (p) = max min qi Lai .
q∈I(p) a∈[k]
i=1
and Shimkin, 2003, Perchet, 2011, Mannor et al., 2014]. There has been some
work on infinite partial monitoring games. Lin et al. [2014] study a stochastic
setting with finitely many actions, but infinitely many outcomes and a particular
linear structure for the feedback. Chaudhuri and Tewari [2016] also consider a
linear setting with global observability and prove O(n2/3 log(n)) regret using
an explore-then-commit algorithm. Kirschner et al. [2020] study a version of
information-directed sampling in partial monitoring setting with a linear feedback
structure and finitely or infinitely many actions.
One can also add context, as usual. The special case of stochastic finite
contextual partial monitoring has been considered by Bartók and Szepesvári
[2012]. In this version, the learner is still given the matrices (L, Φ), but also a set of
functions F that map a sequence (xt )t of contexts to outcome distributions, with
the assumption that the outcome in round t is generated from f (xt ) with f ∈ F
unknown to the learner. A special case, apple tasting with context (equivalently,
matching pennies with context) is the subject of the paper of Helmbold et al.
[2000]. The aforementioned paper by Kirschner et al. [2020] also studies the
contextual partial monitoring problem in a linear setting.
37.11 Exercises
37.3 (Modified kernel) Recall that ker0 (x) = {u : u> x = 0 and u> 1 = 1}.
Show that if ker0 (x) = ker0 (y) 6= ∅ then x and y are proportional.
37.5 (Apple tasting) Apples arrive sequentially from the farm to a processing
facility. Most apples are fine, but occasionally there is a rotten one. The only way
to figure out whether an apple is good or rotten is to taste it. For some reason
customers do not like bite marks in the apples they buy, which means that tested
apples cannot be sold. Good apples yield a unit reward when sold, while the sale
of a bad apple costs the company c > 0.
37.7 (Complete lower bound for hard games) Complete the last step in
the proof of Theorem 37.12.
37.10 Let a and b be non-duplicate Pareto optimal actions and λ ∈ ri(Ca ). Show
there exists an ω ∈ ri(Cb ) and neighbour c of b such that Cc ∩ [ω, λ] 6= ∅.
Hint It may be useful to look at Fig. 37.6 to get some tips. The figure depicts
a slightly different situation, but is still useful when it is changed a little.
37.11 Generalise the proofs of Lemma 37.20 and Lemma 37.21 to handle duplicate
and degenerate actions.
Hint For Part (a), let S ∈ Rkm×d be obtained by stacking (Sc )kc=1 , defined√as
in the proof of Theorem 37.12. Then argue that for globally observable games, d
times the reciprocal of the smallest non-zero singular value of S is an upper bound
on vglo and then use the fact that S > S has integer-valued coefficients. Part (b)
follows in a similar fashion. For Part (c), use a graph-theoretic argument.
37.13 Let m = |Σ| = 2 and d = 2k − 1 and construct a globally observable game
for which there exists a pair of neighbouring actions a, b for which
min kf k ≥ C2d/2 ,
glo
f ∈Eab
Hint Find choices of p and f that reduce the algorithm to Exp3 and exponential
weights respectively.
37.15 (Lower bound depending on the number of feedbacks) Consider
G = (L, Φ) given by
1 0 1 0 ··· 1 0
L= and
0 1 0 1 ··· 0 1
1 2 2 3 3 4 ··· m−1 m−1 m
Φ= .
1 1 2 2 3 3 ··· m−2 m−1 m−1
37.11 Exercises 509
opt∗ (η)
0
0.2 0.4 0.6 0.8 1
c
Figure 37.7 The value of opt∗ (η) as a function of c in matching pennies (Example 37.3).
The source for previous exercise is the paper by the authors [Lattimore and
Szepesvári, 2019a].
Hint The convex optimisation problem in Eq. (37.12) seems to cause problems
for some solvers (see Note 9 for some mitigating strategies). We assume that
many libraries can be made to work. Our implementation used the splitting cone
solver by O’Donoghue et al. [2016, 2017]. Your plot should resemble Fig. 37.7.
(a) For full information games, exponential weights behaves like Algorithm 26
except that ŷt = yt and Pt = Qt . Does the solution to the optimisation
37.11 Exercises 510
Hint You can approach this problem by using your solution to Exercise 37.18
and comparing values empirically. Alternatively, you can theoretically analyse
Eq. (37.12) in these special cases. Some of these questions are answered by
Lattimore and Szepesvári [2019d].
38 Markov Decision Processes
Bandit environments are a sensible model for many simple problems, but they do
not model more complex environments where actions have long-term consequences.
A brewing company needs to plan ahead when ordering ingredients, and the
decisions made today affect their position to brew the right amount of beer in
the future. A student learning mathematics benefits not only from the immediate
reward of learning an interesting topic but also from their improved job prospects.
A Markov decision process (MDP) is a simple way to incorporate long-term
planning into the bandit framework. Like in bandits, the learner chooses actions
and receives rewards. But they also observe a state, and the rewards for different
actions depend on the state. Furthermore, the actions chosen affect which state
will be observed next.
An MDP is defined by a tuple M = (S, A, P, r, µ). The first two items S and A
are sets called the state space and action space, and S = |S| and A = |A| are
their sizes, which may be infinite. An MDP is finite if S, A < ∞. The quantity
P = (Pa : a ∈ A) is called the transition function with Pa : S × S → [0, 1]
so that Pa (s, s0 ) is the probability that the learner transitions from state s to
s0 when taking action a. The fourth element of the tuple is r = (ra : a ∈ A),
which is a collection of reward functions with ra : S → [0, 1]. When the learner
takes action a in state s, it receives a deterministic reward of ra (s). The last
element is µ ∈ P(S), which is a distribution over the states that determines
the starting state. The transition and reward functions are often represented by
vectors or matrices. When the state space is finite, we may assume without loss
of generality that S = [S]. We write Pa (s) ∈ [0, 1]S as the probability vector with
s0 th coordinate given by Pa (s, s0 ). In the same way, we let Pa ∈ [0, 1]S×S be the
right stochastic matrix with (Pa )s,s0 = Pa (s, s0 ). Finally, we view ra as a vector
in [0, 1]S in the natural way.
The interaction protocol is similar to bandits. Before the game starts, the initial
state S1 is sampled from µ. In each round t, the learner observes the state St ∈ S,
chooses an action At ∈ A and receives reward rAt (St ). The environment then
38.1 Problem Set-Up 512
samples St+1 from the probability vector PAt (St ), and the next round begins
(Fig. 38.1).
t = 1 and sample S1 ∼ µ
Observe state St
Although the action set is the same in all states, this does not mean that
Pa (s) or ra (s) has any relationship to Pa (s0 ) or ra (s0 ) for states s 6= s0 . In
this sense, it might be better to use an entirely different set of actions for
each state, which would not change the results we present. And while we
are at it, of course one could also allow the number of actions to vary over
the state space.
1, 1
1, 1
2 2 2 2
2 5,0 5,0 5,0 5 , 10
5,0 3
5 , 10
3 3 3 3
5,0 5,0 5,0 5,0
2 3 4 5 6
1 1 1 1
5,1 5,1 5,1 5,1
1
4 5 , 12
5,1 4 4 4 4
5,1 5,1 5,1 5 , 12
Figure 38.2 A Markov decision process with six states and two actions represented by
solid and dashed arrows, respectively. The numbers next to each arrow represent the
probability of transition and reward for the action respectively. For example, taking the
solid action in state 3 results in a reward of 0, and the probability of moving to state
4 is 3/5, and the probability of moving to state 3 is 2/5. For human interpretability
only, the actions are given consistent meaning across the states (blue/solid actions
‘increment’ the state index, black/dashed actions decrement it). In reality there is no
sense of similarity between states or actions built into the MDP formalism.
Probability Spaces
It will be convenient to allow infinitely long interactions between the learner and
the environment. In line with Fig. 38.1, when the agent or learner follows a policy
π in MDP M = (S, A, P, r, µ), such a never-ending interaction gives rise to a
random process (S1 , A1 , S2 , A2 , . . . ) so that for any s, s0 ∈ S, a ∈ A and t ≥ 1,
(a) P(S1 = s) = µ(s);
(b) P(St+1 = s0 | Ht , At ) = PAt (St , s0 ); and
(c) P(At = a | Ht ) = π(a | Ht ).
Meticulous readers may wonder whether there exists a probability space (Ω, F, P)
holding the infinite sequence of random variables (S1 , A1 , S2 , A2 , . . . ) that satisfy
(a)–(c). The Ionescu–Tulcea theorem (Theorem 3.3) furnishes us with a positive
answer (Exercise 38.1). Item (b) above is known as the Markov property. Of
38.1 Problem Set-Up 514
course the measure P depends on the policy, Markov decision process and the
initial distribution. For most of the chapter, these quantities will be fixed and the
dependence is omitted from the notation. In the few places where disambiguation
is necessary, we provide additional notation. In addition to this, to minimise
clutter, we allow ourselves to write P(· | S1 = s), which just means the probability
distribution that results from the interconnection of π and M , while replacing µ
with an alternative initial state distribution that is a Dirac at s.
where the expectation is taken with respect to the law of Markov chain (St )∞
t=1
induced by the interaction between π and M .
A number of observations are in order about this definition. First, the order
of the maximum and minimum means that for any pair of states a different
policy may be used. Second, travel times are always minimised by deterministic
memoryless policies, so the restriction to these policies in the minimum is
inessential (Exercise 38.3). Finally, the definition only considers distinct states.
We also note that when the number of states is finite, it holds that D(M ) < ∞ if
and only if M is strongly connected (Exercise 38.4). The diameter of an MDP
with S states and A actions cannot be smaller than logA (S) − 3 (Exercise 38.5).
For the remainder of this chapter, unless otherwise specified, all MDPs are
assumed to be strongly connected.
38.2 Optimal Policies and the Bellman Optimality Equation 515
We now define the notion of an optimal policy and outline the proof that there
exists a deterministic memoryless optimal policy. Along the way, we define what is
called the Bellman optimality equation. Methods that solve this equation are the
basis for finding optimal policies in an efficient manner and also play a significant
role in learning algorithms. Throughout, we fix a strongly connected MDP M .
The gain of a policy π is the long-term average reward expected from using
that policy when starting in state s:
1X π
n
ρπs = lim E [rAt (St ) | S1 = s] ,
n→∞ n
t=1
1X π
n
ρ̄πs = lim sup E [rAt (St ) | S1 = s] ,
n→∞ n t=1
which exists for any policy. Of course, whenever ρπs exists we have ρπs = ρ̄πs . The
optimal gain is a real value
where the supremum is taken over all policies. A π policy is an optimal policy
if ρπ = ρ∗ 1. For strongly connected MDPs, an optimal policy is guaranteed to
exist. This is far from trivial, however, and we will spend the next little while
outlining the proof.
MDPs that are not strongly connected may not have a constant optimal
gain. This makes everything more complicated, and we are lucky not to have
to deal with such MDPs here.
Before continuing, we need some new notation. For a memoryless policy π, define
X X
Pπ (s, s0 ) = π(a | s)Pa (s, s0 ) and rπ (s) = π(a | s)ra (s) . (38.1)
a∈A a∈A
1 X t−1
n
ρπ = lim Pπ rπ = Pπ∗ rπ , (38.2)
n→∞ n
t=1
1
Pn
where Pπ∗ = limn→∞ n t=1 Pπt−1 is called the stationary transition matrix,
38.2 Optimal Policies and the Bellman Optimality Equation 516
the existence of which you will prove in Exercise 38.7. For each k ∈ N, define
k
X
vπ(k) = Pπt−1 (rπ − ρπ ) .
t=1
(k)
For s ∈ S, vπ (s) gives the total expected excess reward collected by π when
the process starts at state s and lasts for k time steps. The (differential) value
function of a policy is a function vπ : S → R defined as the Cesàro sum of the
sequence (Pπt (rπ − ρπ ))t≥0 ,
1 X (k)
n
vπ = lim vπ = ((I − Pπ + Pπ∗ )−1 − Pπ∗ )rπ . (38.3)
n→∞ n
k=1
Note, the second equality above is non-trivial (Exercise 38.7). The definition
implies that vπ (s) − vπ (s0 ) is the ‘average’ long-term advantage of starting in
state s relative to starting in state s0 when following policy π. These quantities
are only defined for memoryless policies where they are also guaranteed to exist
(Exercise 38.7). The definition of Pπ∗ implies that Pπ∗ Pπ = Pπ∗ , which in turn
implies that Pπ∗ vπ = 0. Combining this with Eqs. (38.2) and (38.3) shows that
for any memoryless policy π,
ρπ + vπ = rπ + Pπ vπ . (38.4)
A value function is a function v : S → R, and its span is given by
span(v) = max v(s) − min v(s) .
s∈S s∈S
(a) There exists a pair (ρ, v) that satisfies the Bellman optimality equation.
(b) If (ρ, v) satisfies the Bellman optimality equation, then ρ = ρ∗ and πv is
optimal.
(c) There exists a deterministic memoryless optimal policy.
Proof sketch The proof of part (a) is too long to include here, but we guide you
through it in Exercise 38.10. For part (b), let (ρ, v) satisfy the Bellman equation
and π ∗ = πv be the greedy policy with respect to v. Then, by Eq. (38.2),
1 X t−1 1 X t−1
n n
ρπ = lim Pπ∗ rπ∗ = lim Pπ∗ (ρ1 + v − Pπ∗ v) = ρ1 .
∗
n→∞ n n→∞ n
t=1 t=1
Next, let π be an arbitrary Markov policy. We show that ρ̄π ≤ ρ1. The result is
then completed using the result of Exercise 38.2, where you will prove that for any
policy π, there exists a Markov policy with the same expected rewards. Denote
by πt the memoryless policy used at time t = 1, 2, . . . when following the Markov
(t) (0)
policy π, and for t ≥ 1, let Pπ = Pπ1 . . . Pπt , while for t = 0, let Pπ = I. Thus,
(t)
Pπ (s, s ) is the probability of ending up in state s while following π from state
0 0
Pn (t−1)
s for t time steps. It follows that ρ̄π = lim supn→∞ n1 t=1 Pπ rπt . Fix t ≥ 1.
Using the fact that π ∗ is the greedy policy with respect to v gives
Taking the average of both sides over t ∈ [n] and then taking the limit shows
that ρ̄π ≤ ρ1, finishing the proof. Part (c) follows immediately from the first
two parts.
The theorem shows that there exist solutions to the Bellman optimality equation
and that the greedy policy with respect to the resulting value function is an
optimal policy. We need one more result about solutions to the Bellman optimality
equation, the proof of which you will provide in Exercise 38.13.
Lemma 38.3. Suppose that (ρ, v) satisfies the Bellman optimality equation. Then
span(v) ≤ D(M ).
There are many ways to find an optimal policy, including value iteration, policy
iteration and enumeration. These ideas are briefly discussed in Note 12. Here
we describe a two-step approach based on linear programming. Consider the
following constrained linear optimisation problem:
minimise ρ (38.6)
ρ∈R,v∈RS
Theorem 38.4. The optimisation problem in Eq. (38.6) is feasible, and if (ρ, v)
is a solution, then ρ = ρ∗ is the optimal gain.
Solutions (ρ, v) to the optimisation problem in Eq. (38.6) need not satisfy
the Bellman optimality equation (Exercise 38.12).
Proof of Theorem 38.4 Theorem 38.2 guarantees the existence of a pair (ρ∗ , v ∗ )
that satisfies the Bellman optimality equation:
Hence the pair (ρ , v ) satisfies the constraints in Eq. (38.6) and witnesses
∗ ∗
feasibility. Next, let (ρ, v) be a solution of Eq. (38.6). Since (ρ∗ , v ∗ ) satisfies the
constraints, ρ ≤ ρ∗ is immediate. It remains to prove that ρ ≥ ρ∗ . Let π = πv
be the greedy policy with respect to v and π ∗ be greedy with respect to v ∗ . By
Theorem 38.2, ρ∗ = ρπ . Furthermore,
∗
Pπt ∗ rπ∗ ≤ Pπt ∗ (rπ + Pπ v − Pπ∗ v) ≤ Pπt ∗ (ρ1 + v − Pπ∗ v) = ρ1 + Pπt ∗ v − Pπt+1
∗ v .
Pn−1
Summing over t shows that ρ∗ 1 = limn→∞ n1 t=0 Pπt ∗ rπ∗ ≤ ρ1, which completes
the proof.
Having found the optimal gain, the next step is to find a value function that
satisfies the Bellman optimality equation. Let s̃ ∈ S, and consider the following
linear program:
The second constraint is crucial in order for the minimum to exist, since otherwise
the value function can be arbitrarily small.
38.3 Finding an Optimal Policy ( ) 519
Theorem 38.5. There exists a state s̃ ∈ S such that the solution v of Eq. (38.7)
satisfies the Bellman optimality equation.
stochastic, Pπ∗∗ ε = 0. Choose s̃ to be a state such that Pπ∗∗ (s, s̃) > 0 for some s ∈ S,
which exists because Pπ∗∗ is right stochastic. Then 0 = (Pπ∗∗ ε)(s) ≥ Pπ∗∗ (s, s̃)ε(s̃)
and hence ε(s̃) = 0. It follows that ṽ = v − ε also satisfies the constraints in
Eq. (38.7). Because v is a solution to Eq. (38.7), hṽ, 1i ≥ hv, 1i, implying that
hε, 1i ≤ 0. Since we already showed that ε ≥ 0, it follows that ε = 0.
The theorem only demonstrates the existence of a state s̃ for which the solution
of Eq. (38.7) satisfies the Bellman optimality equation. There is a relatively
simple procedure for finding such a state using the solution to Eq. (38.6), but its
analysis depends on the basic theory of duality from linear programming, which
is beyond the scope of this text. More details are in Note 11 at the end of the
chapter. Instead we observe that one can simply solve Eq. (38.7) for all choices
of s̃ and take the first solution that satisfies the Bellman optimality equation.
minimise
n
hc, xi
x∈R
subject to Ax ≥ b ,
in n only, provided that the constraints satisfy certain structural properties. Let
K ⊂ Rn be convex, and consider the optimisation problem
minimise
n
hc, xi (38.8)
x∈R
subject to x ∈ K .
Algorithms for this problem generally have a slightly different flavour because K
may have no corners. Suppose the following holds:
Under these circumstances, the ellipsoid method accepts as input the size of the
bounding sphere R, the separation oracle and an accuracy parameter ε > 0. Its
output is a point x in time polynomial in n and log(R/(δε)) such that x ∈ K and
hc, xi ≤ hc, x∗ i + ε, where x∗ is the minimiser of Eq. (38.8). The reader can find
references to this method at the end of the chapter.
The linear programs in Eq. (38.6) and Eq. (38.7) do not have bounded feasible
regions because if v is feasible, then v + c1 is also feasible for any c ∈ R. For
strongly connected MDPs with diameter D, however, Lemma 38.3 allows us to
add the constraint that kvk∞ ≤ D. If the rewards are bounded in [0, 1], then we
may also add the constraint that 0 ≤ ρ ≤ 1. Together these imply that for (ν, ρ)
in the feasible region,
minimise ρ (38.9)
ρ∈R,v∈RS
Note that for any x in the feasible region of Eq. (38.9), there exists a y that is
feasible for Eq. (38.6) with kx − yk∞ ≤ ε. Furthermore, the solution to the above
linear program is at most ε away from the solution to Eq. (38.6). What we have
38.4 Learning in Markov Decision Processes 521
bought by adding this slack is that now the linear program in Eq. (38.9) satisfies
the conditions (a) and (c) above. The final step is to give a condition when a
separation oracle exists for the convex set determined by the constraints in the
above program. Define convex set K by
Assuming that
can be solved efficiently, Algorithm 27 provides a separation oracle for K. For the
specialised case considered later, Eq. (38.11) is trivial to compute efficiently. The
feasible region defined by the constraints in Eq. (38.9) is the intersection of K with
a small number of half-spaces. In Exercise 38.15, you will show how to efficiently
Tn
extend a separation oracle for arbitrary convex set K to i=1 Hk ∩ K, where
(Hk )nk=1 are half-spaces. You will show in Exercise 38.14 that approximately
solving Eq. (38.7) works in the same way as above, as well as the correctness of
Algorithm 27.
1: function SeparationOracle(ρ, v)
2: For each s ∈ S find a∗s ∈ argmaxa (ra (s) + hPa (s), vi)
3: if ε + ρ + v(s) ≥ ra∗s (s) + hPa∗s (s), vi for all s ∈ S then
4: return true
5: else
6: Find state s with ε + ρ + v(s) < ra∗s (s) + hPa∗s (s), vi
7: return (1, es − Pa∗s (s))
8: end if
9: end function
Algorithm 27: Separation oracle for Eq. (38.6).
is unknown while the reward function is given. This assumption is not especially
restrictive as the case where the rewards are also unknown is easily covered using
either a reduction or a simple generalisation, as we explain in the notes. The
regret of a policy π is the deficit of rewards suffered relative to the expected
average reward of an optimal policy:
n
X
R̂n = nρ∗ − rAt (St ) .
t=1
The reader will notice we are comparing the non-random nρ∗ to the random
sum of rewards received by the learner, which was also true in the study of
stochastic bandits. The difference is that ρ∗ is an asymptotic quantity while for
stochastic bandits the analogous quantity was nµ∗ . The definition stills makes
sense, however, because for MDPs with finite diameter D the optimal expected
cumulative reward over n rounds is at least nρ∗ − D so the difference is negligible
(Exercise 38.17). The main result of this chapter is the following:
Theorem 38.6. Let S, A and n be natural numbers and δ ∈ (0, 1). There
exists an efficiently computable policy π that when interacting with any MDP
M = (S, A, P, r) with S states, A actions, rewards in [0, 1] and any initial state
distribution satisfies with probability at least 1 − δ,
p
R̂n < CD(M )S An log(nSA/δ) ,
where C is a universal constant.
In Exercise 38.18, we ask you to use the assumption that the rewards are
bounded to find a choice of δ ∈ (0, 1) such that
p
E[R̂n ] ≤ 1 + CD(M )S 2An log(n) . (38.12)
This result is complemented by the following lower bound:
Theorem 38.7. Let S ≥ 3, A ≥ 2, D ≥ 6 + 2 logA S and n ≥ DSA. Then for
any policy π there exists a Markov decision process with S states, A actions and
diameter at most D such that
√
E[R̂n ] ≥ C DSAn ,
where C > 0 is again a universal constant.
√
The upper and lower bounds are separated by a factor of at least DS, which
is a considerable gap. Recent work has made progress towards closing this gap as
we explain in the notes.
which means that the next phase starts once the number of visits to some
state-action pair at least doubles.
38.5 Upper Confidence Bounds for Reinforcement Learning 524
1: Input S, A, r, δ ∈ (0, 1)
2: t=0
3: for k = 1, 2, . . . do
4: τk = t + 1
5: Find πk as the greedy policy with respect to vk satisfying Eq. (38.16)
6: do
7: t ← t + 1, observe St and take action At = πk (St )
8: while Tt (St , At ) < 2Tτk −1 (St , At )
9: end for
Algorithm 28: UCRL2.
The reward function of the extended MDP is r̃(a,P ) (s) = ra (s), and the transitions
are P̃a,P (s) = Pa (s). The action space in the extended MDP allows the agent to
choose both a ∈ A and a plausible transition vector Pa (s) ∈ Cτk (s, a). By the
definition of the confidence sets, for any pair of states s, s0 and action a ∈ A,
there always exists a transition vector Pa (s) ∈ Cτk (s, a) such that Pa (s, s0 ) > 0,
which means that M̃k is strongly connected. Hence solving the Bellman optimality
equation for M̃k yields a value function vk and constant gain ρk ∈ R that satisfy
38.6 Proof of Upper Bound 525
Eq. (38.16). A minor detail is that the extended action sets are infinite, while
the analysis in previous sections only demonstrated existence of solutions to
the Bellman optimality equation for finite MDPs. You should convince yourself
that Ct (s, a) is convex and has finitely many extremal points. Restricting the
confidence sets to these points makes the extended MDP finite without changing
the optimal policy.
can be carried out in an efficient manner. The inner optimisation is another linear
program with S variables and O(S) constraints and can be solved in polynomial
time. This procedure is repeated for each a ∈ A to compute the outcome of
(38.17). In fact the inner optimisation can be solved more straightforwardly by
sorting the entries of v and then allocating P coordinate by coordinate to be as
large as allowed by the constraints in decreasing order of v. The total computation
cost of solving Eq. (38.17) in this way is O(S(A + log S)). Combining this with
Algorithm 27 gives the required separation oracle.
The next problem is to find an R such that the set of feasible solutions to the
linear programs in Eq. (38.6) and Eq. (38.7) are contained in√the set {x : kxk ≤ R}.
As discussed in Section 38.3.1, a suitable value is R = 1 + D2 S, where D is
√
an upper bound on the diameter of the MDP. It turns out that D = n works
because for each pair of states s, s0 , there exists an action a and P ∈ Cτk (s, a)
√ √
such that P (s, s0 ) ≥ 1 ∧ (1/ n) so D(M̃k ) ≤ n. Combining this with the tools
developed in Section 38.3 shows that the Bellman optimality equation for M̃k may
be solved using linear programming in polynomial time. Note that the additional
constraints require a minor adaptation of the separation oracle, which we leave
to the reader.
The proof is developed in three steps. First we decompose the regret into phases
and define a failure event where the confidence intervals fail. In the second step,
we bound the regret in each phase, and in the third step we sum over the phases.
Recall that M = (S, A, P, r) is the true Markov decision process with diameter
D = D(M ). The initial state distribution is µ ∈ P(S), which is arbitrary.
38.6 Proof of Upper Bound 526
n
X K X
X
R̂n = (ρ∗ − rAt (St )) ≤ (ρk − rAt (St )) .
t=1 k=1 t∈Ek
| {z }
R̃k
In the next step, we bound R̃k under the assumption that F does not hold.
1 D
kvk k∞ ≤ span(vk ) ≤ , (38.19)
2 2
where the second inequality follows from Lemma 38.3 and the fact that when F
does not hold, the diameter of the extended MDP M̃k is at most D and vk also
satisfies the Bellman optimality equation in this MDP. By the definition of the
policy, we have At = πk (St ) for t ∈ Ek , which implies that
where the inequality follows from Hölder’s inequality and Eq. (38.19). Let
Et [·] denote the conditional expectation with respect to P conditioned on
σ(S1 , A1 , . . . , St−1 , At−1 , St ). To bound (A), we reorder the terms and use the
fact that span(vk ) ≤ D on the event F c . We get
X
(A) = (vk (St+1 ) − vk (St ) + hPAt (St ), vk i − vk (St+1 ))
t∈Ek
X
= vk (Sτk+1 ) − vk (Sτk ) + (hPAt (St ), vk i − vk (St+1 ))
t∈Ek
X
≤D+ (Et [vk (St+1 )] − vk (St+1 )) ,
t∈Ek
where the second equality used that max Ek = τk+1 − 1 and min Ek = τk . We
leave this here for now and move on to term (B) in Eq. (38.20). The definition of
the confidence intervals and the assumption that F does not occur shows that
√
D LS X T(k) (s, a)
(B) ≤ p .
2 1 ∨ Tτk −1 (s, a)
(s,a)∈S×A
It remains to bound the number of phases. A new phase starts when the visit
count for some state-action pair doubles. Hence K cannot be more than the
number of times the counters double in total for each of the states. It is easy to
see that 1 + log2 Tn (s, a) gives an upper bound on how many times the counter
for this pair may double (the constant 1 is there to account for the counter
P
changing from zero to one). Thus K ≤ K 0 = s,a 1 + log2 Tn (s, a). Noting that
P
0 ≤ Tn (s, a) and s,a Tn (s, a) = n and relaxing Tn (s, a) to take real values, we
find that the value of K 0 is the largest when Tn (s, a) = n/(SA), which shows that
n
K ≤ SA 1 + log2 .
SA
Putting everything together gives the desired result.
The lower bound is proven by crafting a difficult MDP that models a bandit
with approximately SA arms. This is a cumbersome endeavour, but intuitively
straightforward, and the explanations that follow should be made clear in Fig. 38.3.
Given S and A, the first step is to construct a tree of minimum depth with at
most A children for each node using exactly S − 2 states. The root of the tree is
denoted by s◦ and transitions within the tree are deterministic, so in any given
node, the learner can simply select which child to transition to. Let L be the
38.7 Proof of Lower Bound 529
number of leaves, and label these states s1 , . . . , sL . The last two states are sg
and sb (‘good’ and ‘bad’ respectively). For each i ∈ [L], the learner can take any
action a ∈ A and transitions to either the good state or the bad state according
to
1 1
Pa (si , sg ) = + ε(a, i) and Pa (si , sb ) = − ε(a, i) .
2 2
The function ε will be chosen so that ε(a, i) = 0 for all (a, i) pairs except one. For
this special state-action pair, we let ε(a, i) = ∆ for appropriately tuned ∆ > 0.
The good state and the bad state have the same transitions for all actions:
Pa (sg , sg ) = 1 − δ , Pa (sg , s◦ ) = δ ,
Pa (sb , sb ) = 1 − δ , Pa (sb , s◦ ) = δ .
Choosing δ = 4/D, which under the assumptions of the theorem is guaranteed
to be in (0, 1], ensures that the diameter of the described MDP is at most D,
regardless of the value of ∆. The reward function is ra (s) = 1 if s = sg and
ra (s) = 0 otherwise.
The connection to finite-armed bandits is straightforward. Each time the learner
arrives in state s◦ , it selects which leaf to visit and then chooses an action from
that leaf. This corresponds to choosing one of k = LA = Ω(SA) meta actions.
The optimal policy is to select the meta action with the largest probability of
transitioning to the good state. The choice of δ means the learner expects to
stay in the good/bad state for approximately D rounds, which also makes the
diameter of this MDP about D. This means the learner expects to make about
n/D decisions and p the rewards√are roughly in [0, D], so we should expect the
regret to be Ω(D kn/D) = Ω( nDSA).
s◦
δ, 1 δ, 0
s1 s2 s3
sg sb
Good state Bad state
1 − δ, 1 1 − δ, 0
One could almost claim victory here and not bother with the proof. As usual,
however, there are some technical difficulties, which in this case arise because the
number of visits to the decision state s◦ is a random quantity. For this reason we
give the proof, leaving as exercises the parts that are both obvious and annoying.
Proof of Theorem 38.7 The proof follows the path suggested in Exercise 15.2.
We break things up into two steps. Throughout we fix an arbitrary policy π.
By definition, this has k elements. Let M0 be the MDP with ε(s, a) = 0 for all
(s, a) ∈ L. Then let Mj be the MDP with ε(s, a) = ∆ for the jth state-action
pair in the above set. Define stopping time τ by
( t
)
X n
τ = n ∧ min t : I {Su = s◦ } ≥ −1 ,
u=1
D
which is the first round when the number of visits to state s◦ is at least n/D − 1,
or n if s◦ is visited fewer times than n/D. Next, let Tj be the number of visits to
Pk
state-action pair j ∈ [k] until stopping time τ and Tσ = j=1 Tj . For 0 ≤ j ≤ k,
let Pj be the law of T1 , . . . , Tk induced by the interaction of π and Mj . And
let Ej [·] be the expectation with respect to Pj . None of the following claims is
surprising, but they are all tiresome to prove to some extent. The claims are
listed in increasing order of difficulty and left to the reader in Exercise 38.24.
Claim 38.10. There exist universal constants 0 < c1 < c2 < ∞ such that
Claim 38.11. Let Rnj be the expected regret of policy π in MDP Mj over n
rounds. There exists a universal constant c3 > 0 such that
Rnj ≥ c3 ∆D Ej [Tσ − Tj ] .
where d(p, q) is the relative entropy between Bernoulli distributions with means
p and q, respectively. Now ∆ will be chosen to satisfy ∆ ≤ 1/4. It follows from
the entropy inequalities in Eq. (14.16) that
c1 n(k − 1)
Ej [Tσ − Tj ] ≥ .
2Dk
Then, for the last step, apply Claim 38.11 to show that
r
c2 c3 n(k − 1)2 D
Rnj ≥ c3 D∆Ej [Tσ − Tj ] ≥ 1 .
4k 2c2 nk
Naive bounding and simplification concludes the proof.
38.8 Notes
1 MDPs in applications can have millions (or ‘billions and billions’) of states,
which should make the reader worried that the bound in Theorem 38.6 could
be extremely large. The takeaway should be that learning in large MDPs
without additional assumptions is hard, as attested by the lower bound in
Theorem 38.7.
38.8 Notes 532
2 The key to choosing the state space is that the state must be observable and
sufficiently informative that the Markov property is satisfied. Blowing up the
size of the state space may help to increase the fidelity of the approximation
(the entire history always works), but will almost always slow down learning.
3 We simplified the definition of MDPs by making the rewards a deterministic
function of the current state and the action chosen. A more general definition
allows the rewards to evolve in a random fashion, jointly with the next state.
In this definition, the mean reward functions are dropped and the transition
kernel Pa is replaced with an S → S × R stochastic kernel, call it, P̃a . Thus,
for every s ∈ S, P̃a (s) is a probability measure over S × R. The meaning of this
is that when action a is chosen in state s, a random transition, (S, R) ∼ P̃a (s)
happens to state S, while reward R R is received. Note that the mean reward
along this transition is ra (s) = xP̃a (s, ds0 , dx).
4 A state s ∈ S is absorbing if Pa (s, s) = 1 for all a ∈ A. An MDP is episodic if
there exists an absorbing state that is reached almost surely by any policy. The
average reward criterion is meaningless in episodic MDPs because all policies
are optimal. In this case the usual objective is to maximise the expected reward
until the absorbing state is reached without limits or normalisation, sometimes
with discounting. An MDP is finite-horizon if it is episodic and the absorbing
state is always reached after some fixed number of rounds. The simplification
of the setting eases the analysis and preserves most of the intuition from the
general setting.
5 A partially observable MDP (POMDP) is a generalisation where the learner
does not observe the underlying state. Instead they receive an observation
that is a (possibly random) function of the state. Given a fixed (known) initial
state distribution, any POMDP can be mapped to an MDP at the price of
enlarging the state space. A simple way to achieve this is to let the new state
space be the space of all histories. Alternatively you can use any sufficient
statistic for the hidden state as the state. A natural choice is the posterior
distribution over the hidden state given the interaction history, which is called
the belief space. While the value function over the belief space has some nice
structure, in general even computing the optimal policy is hard [Papadimitriou
and Tsitsiklis, 1987].
6 We called the all-knowing entity that interacts with the MDP an agent. In
operations research the term is decision maker and in control theory it is
controller. In control theory the environment would be called the controlled
system or the plant (for power-plant, not a biological plant). Acting in
an MDP is studied in control theory under stochastic optimal control,
while in operations research the area is called multistage decision making
under uncertainty or multistage stochastic programming. In the control
community the infinite horizon setting with the average cost criterion is perhaps
the most common, while in operations research the episodic setting is typical.
7 The definition of the optimal gain that is appropriate for MDPs that are not
strongly connected is a vector ρ∗ ∈ RS given by ρ∗s = supπ ρ̄πs . A policy is
38.8 Notes 533
optimal if it achieves the supremum in this definition and such a policy always
exists as long as the MDP is finite. In strongly connected MDPs, the two
definitions coincide. For infinite MDPs, everything becomes more delicate and
a large portion of the literature on MDPs is devoted to this case.
8 In applications where the asymptotic nature of gain optimality is unacceptable,
there are criteria that make finer distinctions between the policies. A memoryless
policy π ∗ is bias optimal if it is gain optimal and vπ∗ ≥ vπ for all memoryless
policies π. Even more sensitive criteria exist. Some keywords to search for are
Blackwell optimality and n-discount optimality.
9 The Cesàro sum of a real-valued sequence (an )n is the asymptotic average of
its partial sums. Let sn = a0 + · · · + an−1 be the nth partial sum. The Cesàro
sum of this sequence is A = limn→∞ n1 (s1 + · · · + sn ) when this limit exists.
The idea is that Cesàro summation smoothes out periodicity, which means that
for certain sequences the Cesáro sum exists while sn does not converge. For
example, the alternating sequence (+1, −1, +1, −1, . . . ) is Cesàro summable,
and its Cesàro sum is easily seen to be 1/2, while it is not summable in the
normal sense. If a sequence is summable, then its sum and its Cesàro sum
coincide. The differential value of a policy is defined as a Cesàro sum so that it
is well defined even if the underlying Markov chain has periodic states.
10 For γ ∈ (0, 1), the γ-discounted average of sequence (an )n is Aγ = (1 −
P∞
γ) n=0 γ n an . An elementary argument shows that if Aγ is well defined, then
P∞ Pn
Aγ = (1 − γ)2 n=1 γ n−1 sn . Suppose the Cesàro sum A = limn→∞ n1 t=1 st
P
exists, then using the fact that 1 = (1 − γ)2 n=1 γ n−1 n, we have Aγ − A =
∞
2
P∞ n−1 P∞
(1−γ) n=1 γ (sn −nA). It is not hard to see that | n=1 γ n−1 (sn −nA)| =
O(1/(1 − γ)), and thus Aγ − A = O(1 − γ) as γ → 1, which means that
limγ→1 Aγ = A. The value limγ→1 Aγ is called the Abel sum of (an )n . Put
simply, the Abel sum of a sequence is equal to its Cesàro sum when the latter
exists. Abel summation is stronger in the sense that there are sequences that
are Abel summable but not Cesàro summable. The approach of approximating
Cesàro sums through γ-discounted averages, and taking the limit as γ → 1 is
called the vanishing discount approach and is one of the standard ways
to prove that the (average reward) Bellman equation has a solution (see
Exercises 38.9 and 38.10). As an aside, the systematic study of how to define
the ‘sum’ of a divergent series is a relatively modern endeavour. An enjoyable
historical account is given in the first chapter of the book on the topic by
Hardy [1973].
11 Given a solution (ρ, v) to Eq. (38.6), we mentioned a procedure for finding
a state s̃ ∈ S that is recurrent under some optimal policy. This works as
follows. Let C0 = {(s, a) : ρ + v(s) = ra (s) + hPa (s), vi} and I0 = {s :
(s, a) ∈ C0 for some a ∈ A}. Then define Ck+1 and Ik+1 inductively by the
following algorithm. First find an (s, a) ∈ Ck such that Pa (s, s0 ) > 0 for some
s0 6∈ Ik . If no such pair exists, then halt. Otherwise let Ck+1 = Ck \ {(s, a)}
and Ik+1 = {s : (s, a) ∈ Ck+1 for some a ∈ A}. Now use the complementary
slackness conditions of the dual program to Eq. (38.6) to prove that the
38.8 Notes 534
algorithm halts with some non-empty Ik and that these states are recurrent
under some optimal policy. For more details, have a look at Exercise 4.15 of
the second volume of the book by Bertsekas [2012].
12 We mentioned enumeration, value iteration and policy iteration as other
methods for computing optimal policies. Enumeration just means enumerating
all deterministic memoryless policies and selecting the one with the highest
gain. This is obviously too expensive. Policy iteration is an iterative process
that starts with a policy π0 . In each round, the algorithm computes πk+1
from πk by computing vπk and then choosing πk+1 to be the greedy policy
with respect to vπk . This method may not converge to an optimal policy,
but by slightly modifying the update process, one can prove convergence.
For more details, see chapter 4 of volume 2 of the book by Bertsekas [2012].
Value iteration works by choosing an arbitrary value function v0 and then
inductively defining vk+1 = T vk , where (T v)(s) = maxa∈A ra (s) + hPa (s), vi
is the Bellman operator. Under certain technical conditions, one can prove
that the greedy policy with respect to vk converges to an optimal policy. Note
that vk+1 = Ω(k), which can be a problem numerically. A simple idea is to
let vk+1 = T vk − δk where δk = maxs∈S vk (s). Since the greedy policy is the
same for v and v + c1, this does not change the mathematics, but improves
the numerical situation. The aforementioned book by Bertsekas is again a
good source for more details. Unfortunately, none of these algorithms have
known polynomial time guarantees on the computation complexity of finding
an optimal policy without stronger assumptions than we would like. In practice,
however, both value and policy iteration work quite well, while the ellipsoid
method for solving linear programs should be avoided at all costs. Of course
there are other methods for solving linear programs, and these can be effective.
13 Theorem 38.6 is vacuous when the diameter is infinite, but you might wonder if
the bound continues to hold in certain ‘nice’ cases. Unfortunately, the algorithm
is rather brittle. UCRL2 suffers linear regret if there is a single unreachable
state with reward larger than the optimal gain (Exercise 38.27).
14 One can modify the concept of regret to allow for MDPs that have traps. We
restrict our attention to policies with sublinear regret in strongly connected
MDPs, which must try and explore the whole state space and hence almost
surely become trapped in a strongly communicating subset of the state space.
The regret is redefined by ‘restarting the clock’ at the time when the policy
gets trapped. For details, see Exercise 38.29.
15 The assumption that the reward function is known can be relaxed without
difficulty. It is left as an exercise to figure out how to modify algorithm and
analysis to the case when r is unknown and reward observed in round t is
bounded in [0, 1] and has conditional mean rAt (St ). See Exercise 38.23.
16 Although√it has not been done yet in this setting, the path to removing the
spurious S from the bound is to avoid the application of Cauchy–Schwarz
in Eq. (38.20). Instead one should define confidence intervals directly on
hP̂k − P, vk i, where the dependence on the state and action has been omitted.
38.9 Bibliographical Remarks 535
√
Szepesvári [2011] and Abbasi-Yadkori [2012] give algorithms with O( n) regret
for linearly parameterised MDP problems with quadratic cost (linear quadratic
regulation, or LQR), while Ortner and Ryabko [2012] give O(n(2d+1)/(2d+2) ) regret
bounds under a Lipschitz assumption, where d is the dimensionality of the state
space. The algorithms in these works are not guaranteed to be computationally
efficient because they rely on optimistic policies. In theory, this could be addressed
by Thompson sampling, which is considered by Abeille and Lazaric [2017b], who
obtain partial results for the LQR setting. Thompson sampling has also been
studied in the Bayesian framework by Osband et al. [2013], Abbasi-Yadkori
and Szepesvári [2015], Osband and Van Roy [2017] and Theocharous et al.
[2017], of which Abbasi-Yadkori and Szepesvári [2015] and Theocharous et al.
[2017] consider general parametrisations, while the other papers are concerned
with finite state-action MDPs. Learning in MDPs has also been studied in the
probability approximately correct (PAC) framework introduced by Kearns and
Singh [2002], where the objective is to design policies for which the number
of badly suboptimal actions is small with high probability. The focus of these
papers is on the discounted reward setting rather than average reward. The
algorithms are again built on the optimism principle. Algorithms that are known
to be PAC-MDP include R-max [Brafman and Tennenholtz, 2003, Kakade, 2003],
MBIE [Strehl and Littman, 2005, 2008], delayed Q-learning [Strehl et al., 2006],
the optimistic-initialisation-based algorithm of Szita and Lőrincz [2009], MorMax
by Szita and Szepesvári [2010], and an adaptation of UCRL by Lattimore and
Hutter [2012], which they call UCRLγ. The latter work presents optimal results
(matching upper and lower bounds) for the case when the transition structure
is sparse, while the optimal dependence on the number of state-action pairs
is achieved by delayed Q-learning and Mormax [Strehl et al., 2006, Szita and
Szepesvári, 2010], though the Mormax bound is better in its dependency on the
discount factor. The idea to incorporate the uncertainty in the transitions into
the action space to solve the optimistic optimisation problem appeared in the
analysis of MBIE [Strehl and Littman, 2008]. A hybrid between stochastic and
adversarial settings is when the reward sequence is chosen by an adversary, while
transitions are stochastic. This problem has been introduced by Even-Dar et al.
[2004]. State-of-the-art results for the bandit case are due to Neu et al. [2014],
where the reader can also find further pointers to the literature. The case when
the rewards and the transitions probability distributions are chosen adversarially
is studied by [Abbasi-Yadkori et al., 2013].
38.10 Exercises
µ ∈ P(S), show there exists a probability space (Ω, F, P) and an infinite sequence
of random elements S1 , A1 , S2 , A2 , . . . such that for any s ∈ S, a ∈ A and t ∈ N,
Hint Let τ ∗ (s, s0 ) be the shortest expected travel time between some arbitrary
pairs of states, which for s = s0 is defined to be zero. Show that τ ∗ satisfies the
fixed point equation
(
0, if s = s0 ;
τ (s, s ) =
∗ 0
P
1 + mina s00 Pa (s, s ) τ (s , s ) , otherwise .
00 ∗ 00 0
38.5 (Diameter lower bound) Let M = (S, A, P, r) be any MDP. Show that
D(M ) ≥ logA (S) − 3.
Hint Denote by d∗ (s, s0 ) the minimum expected time it takes to reach
state s0 when starting from state s. The definition of d∗ can be extended to
arbitrary initial distributions µ0 over states and sets U ⊂ S of target states:
38.10 Exercises 539
P P
d∗ (µ0 , U ) = s µ0 (s)d∗ (s, s0 ). Prove by induction on the size of U that
s0 ∈U
X X
d∗ (µ0 , U ) ≥ min knk 0 ≤ nk ≤ Ak , k ≥ 0, nk = |U | (38.25)
k≥0 k≥0
and then conclude that the proposition holds by choosing U = S [Jaksch et al.,
2010, corollary 15].
Hint Note that the first four parts of this exercise are the same as in Chapter 37.
For parts (c) and (d), you will likely find it useful that the space of right stochastic
matrices is compact. Then show that all cluster points of (An ) are the same. For
(g), show that v = U r.
The previous exercise shows that the gain and differential value function of
any memoryless policy in any MDP are well defined. The matrix H is called
the fundamental matrix, and U is called the deviation matrix.
38.8 (Discounted MDPs) Let γ ∈ (0, 1), and define the operator Tγ : RS → RS
by
(Tγ v)(s) = max ra (s) + γhPa (s), vi .
a∈A
Hint For (b), you should use the contraction mapping theorem (or Banach
fixed point theorem), which says that if (X , d) is a complete metric space and
T : X → X satisfies d(T (x), T (y)) ≤ γd(x, y) for γ ∈ [0, 1), then there exists an
x ∈ X such that T (x) = x. For (e), use (d) and Exercise 38.2 to show that
it suffices to check that vγπ ≤ v for any Markov policy π. Verify this by using
the fact that Tγ is monotone (f ≤ g implies that Tγ f ≤ Tγ g) and showing that
π
vγ,n ≤ Tγn 0 holds for any n, where vγ,n
π
(s) is the total expected discounted reward
of the policy when it is started from state s and is followed for n steps.
38.9 (From discounting to average reward) Recall that H = (I − P +
P ∗ )−1 , U = H − P ∗ . For γ ∈ [0, 1), define Pγ∗ = (1 − γ)(I − γP )−1 . Show that
Hint For (a) start by manipulating the expressions Pγ∗ P and (Pγ∗ )−1 P ∗ . For
(b) consider H −1 (Pγ∗ − P ∗ ).
38.10 (Solution to Bellman optimality equation) In this exercise you
will prove part (a) of Theorem 38.2.
(a) Prove there exists a deterministic stationary policy π and increasing sequence
of discount rates (γn ) with γn < 1 and limn→∞ γn = 1 such that π is a
greedy policy with respect to the fixed point vn of Tγn for all n.
(b) For the remainder of the exercise, fix a policy π whose existence is guaranteed
by part (a). Show that ρπ = ρ1 is constant.
(c) Let v = vπ be the value function and ρ = ρπ the gain of policy π. Show that
(ρ, v) satisfies the Bellman optimality equation.
Hint For (a), use the fact that for finite MDPs there are only finitely many
memoryless deterministic policies. For (b) and (c), use Exercise 38.9.
38.11 (Counterintuitive solutions to the Bellman equation) Consider
the deterministic MDP shown below with two states and two actions. The first
action, stay, keeps the state the same and the second action, Go, moves the
learner to the other state while incurring a reward of −1. Show that in this
38.10 Exercises 541
example, solutions (ρ, v) to the Bellman optimality equations (Eq. (38.5)) are
exactly the elements of the set
(ρ, v) ∈ R × R2 : ρ = 0, v(1) − 1 ≤ v(2) ≤ v(1) + 1 .
r=0 r=0
r = −1
1 2
r = −1
Note for the sake of curiosity that the above display continues to hold for weakly
communicating MDPs.
1/2 1/2
1 2
Figure 38.4 Transitions and rewards are deterministic. Numbers indicate the rewards.
(a) Find all memoryless optimal policies for the MDP in Fig. 38.4.
(b) Prove that the version of UCRL2 given in Exercise 38.23 modified to re-solve
the optimistic MDP in every round suffers linear regret on this MDP.
38.10 Exercises 543
Hint Since UCRL2 and the environment are both deterministic you can
examine the behaviour of the algorithm on the MDP. You should aim to prove
that eventually the algorithm will alternate between actions stay and go.
[Long-term plans should have phases] The reason UCRL2, or more generally,
optimistic algorithms without an explicit introduction of phases fail is
because UCRL2 for creating a plan solves the infinite horizon problem where
a reward in a state other than the current one and that is larger by the
tiniest amount than the reward in the current state makes it worth to switch
to the other state. If we considered an finite horizon version of the problem
where experience is collected in episodes with some fixed start state or start
state distribution, an optimistic algorithm would eventually stop considering
switches because on a finite horizon, eventually the loss from using the
actions that switch would be assessed to be higher than the potential gain
from switching.
Hint Use the result of Exercise 5.17 and apply a union bound over all state-
action pairs and the number of samples. Use the Markov property to argue that
the independence assumption in Exercise 5.17 is not problematic.
38.22 Let (ak ) and (Ak ) be non-negative numbers so that for any k ≥ 0,
ak+1 ≤ Ak = 1 ∨ (a1 + · · · + ak ). Prove that for any m ≥ 1,
m
X a √ p
p k ≤ 2+1 Am .
k=1
Ak−1
Pm−1
Hint The statement is trivial if k=1 ak ≤ 1. If this does not hold, use
induction based on m = n, n + 1, . . . , where n is the first integer such that
Pn−1
k=1 ak > 1.
38.23 (Unknown rewards) In this exercise, you will modify the algorithm
to handle the situation where r is unknown and rewards are stochastic. More
precisely, assume there exists a function ra (s) ∈ [0, 1] for all a ∈ A and s ∈ S.
Then, in each round, the learner observes St , chooses an action At and receives a
reward Xt ∈ [0, 1] with
by
k −1
I {Su = s, Au = a} Xu
τX
r̂k,a (s) = .
u=1
1 ∨ Tτk −1 (s, a)
38.24 (Lower bound) In this exercise, you will prove the claims to complete
the proof of the lower bound.
(a) Derive the optimal policy and the average optimal reward.
(b) Show an optimal value function that solves the Bellman optimality equation.
(c) Prove that the diameter of this MDP is D = maxs 1/p(s).
(d) Consider the algorithm that puts one instance of an appropriate version
of UCB into every state (the same idea was explored in the context of
adversarial bandits in Section√18.1). Prove that the expected regret of your
algorithm will be at most O( SAn).
(e) Does the scaling behaviour of the upper bound in Theorem 38.6 match the
actual scaling behaviour of the expected regret of UCRL2 in this example?
Why or why not?
(f) Design and run an experiment to confirm your claim.
and S = [S] with S ≥ 2. In all states s > 1, action left deterministically leads
to state s − 1 and provides no reward. In state 1, action left leaves the state
unchanged and yields a reward of 0.05. The action right tends to make the agent
move right but not deterministically (the learner is swimming against a current).
With probability 0.3, the state is incremented, with a probability 0.6, the state
is left unchanged, while with probability of 0.1 the state is decremented. This
action incurs a reward of zero in all states except in state S, where it receives a
reward of 1. The situation when S = 5 is illustrated in Fig. 38.5.
1, 0.05 1, 0 1, 0 1, 0 1, 0
0.3, 0 0.3, 0 0.3, 0 0.3, 0
current
Figure 38.5 The RiverSwim MDP when S = 5. Solid arrows correspond to action left
and dashed ones to action right. The right-hand bank is slippery, so the learner
sometimes falls back into the river.
(a) Show that the optimal policy always takes action right and calculate the
optimal average reward ρ∗ as a function of S.
(b) Implement the MDP and test the optimal policy when started from state 1.
Plot the total reward as a function of time and compare it with the plot of
t 7→ tρ∗ . Run multiple simulations to produce error bars. How fast do you
think the total reward concentrates around tρ∗ ? Experiment with different
values of S.
(c) The ε-greedy strategy can also be implemented in MDPs as follows: based
on the data previously collected, estimate the transition probabilities and
rewards using empirical means. Find the optimal policy π ∗ of the resulting
MDP, and if the current state is s, use the action π ∗ (s) with probability 1 − ε
and choose one of the two actions uniformly at random with the remaining
probability. To ensure the empirical MDP has a well-defined optimal policy,
mix the empirical estimate of the next state distributions Pa (s) with the
uniform distribution with a small mixture coefficient. Implement this strategy
and plot the trajectories it exhibits for various MDP sizes. Explain what you
see.
(d) Implement UCRL2 and produce the same plots. Can you explain what you
see?
(e) Run simulations in RiverSwim instances of various sizes to compare the
regret of UCRL2 and ε-greedy. What do you conclude?
38.10 Exercises 546
38.27 (UCRL2 and unreachable states) Show that UCRL2 suffers linear
regret if there is a single unreachable state with reward larger than the optimal
gain.
Hint Think about the optimistic MDP and the optimistic transitions to the
unreachable state. The article by Fruit et al. [2018] provides a policy that mitigates
the problem.
38.28 (MDPs with traps (i)) Fix state space S, action space A and reward
function r. Let π be a policy with sublinear regret in all strongly connected
MDPs (S, A, r, P ). Now suppose that (S, A, r, P ) is an MDP that is not strongly
connected such that for all s ∈ S, there exists a state s0 that is reachable from s
under some policy and where ρ∗s0 < maxu ρ∗u . Finally, assume that ρ∗S1 = maxu ρ∗u
almost surely. Prove that π has linear regret on this MDP.
38.29 (MDPs with traps (ii)) This exercise develops the ideas mentioned in
Note 14. First, we need some definitions: fix S and A and define Π0 as the set
of policies (learner strategies) for MDPs with state space S and action space A
that achieve sublinear regret in any strongly connected MDP with state space S
and action space A. Now consider an arbitrary finite MDP M = (S, A, P, r). A
state s ∈ S is reachable from state s0 ∈ S if there is a policy that when started
in s0 reaches state s with positive probability after one or more steps. A set of
states C ⊂ S is a strongly connected component (SCC) if every state s ∈ U
is reachable from every other state s0 ∈ C, including s = s0 . A set C ⊆ S is
maximal if we cannot add more states to C and still maintain the SCC property.
A SCC C is called a maximal end component if there does not exist another
SCC C 0 with C ⊂ C 0 . Show the following:
(a) There exists at least one MEC and two MECs C1 and C2 , are either equal
or disjoint.
(b) Let C1 , . . . , Ck be all the distinct MECs of an MDP. The MDP structure
defines a connectivity over C1 , . . . , Ck as follows: for i 6= j, we say that Ci is
connected to Cj if from some state in Ci , it is possible to reach some state of
Cj with positive probability under some policy. Show that this connectivity
structure defines a directed graph, which must be acyclic.
(c) Let C1 , . . . , Cm with m ≤ k be the sinks (the nodes with no out edges) of
this graph. Show that if M is strongly connected, then m = 1 and C1 = S.
(d) Show that for any i ∈ [m] and for any policy π ∈ Π0 , it holds that π will
reach Ci in finite time with positive probability if the initial state distribution
assigns positive mass to the non-trap states S \ ∪i∈[m] Ci .
(e) Show that for i ≤ m, for any s ∈ Ci and any action a ∈ A, Pa (s, s0 ) = 0 for
any s0 ∈ S \ Ci , i.e., Ci is closed.
(f) Show that the restriction of M to Ci defined as
is an MDP.
(g) Show that Mi is strongly connected.
(h) Let τ be the time when the learner enters one of C1 , . . . , Cm and let I ∈ [m]
be the index of the class that is entered at time τ . That is, Sτ ∈ CI . Show
that if M is strongly connected, then τ = 1 with probability one.
(i) We redefine the regret as follows:
"τ +n−1 #
X
Rn = E
0
rAt (St ) − nρ (MI ) .
∗
t=τ
The logic of the regret definition in part (i) is that by part (d), reasonable
policies cannot control which trap they fall into in an MDP that has more
than one traps. As such, policies should not be penalised for what trap they
fall into. However, once a policy falls into some trap, we expect it to start
to behave near optimally. What this definition is still lacking is that it is
insensitive to how fast a policy gets trapped. The last part is quite subtle
[Fruit et al., 2018].
38.30 (Chain rule for relative entropy) Prove the claim in Eq. (38.22).
on Machine Learning, pages 3–11, San Francisco, CA, USA, 1999. Morgan
Kaufmann Publishers Inc. [246]
M. Abeille and A. Lazaric. Linear Thompson sampling revisited. In Proceedings
of the 20th International Conference on Artificial Intelligence and Statistics,
pages 176–184, Fort Lauderdale, FL, USA, 2017a. JMLR.org. [473]
M. Abeille and A. Lazaric. Thompson sampling for linear-quadratic control
problems. In Proceedings of the 20th International Conference on Artificial
Intelligence and Statistics, pages 1246–1254, Fort Lauderdale, FL, USA, 2017b.
JMLR.org. [537]
J. D. Abernethy and A. Rakhlin. Beating the adaptive bandit with high probability.
In Proceedings of the 22nd Conference on Learning Theory, 2009. [171, 343]
J. D. Abernethy, E. Hazan, and A. Rakhlin. Competing in the dark: An efficient
algorithm for bandit linear optimization. In Proceedings of the 21st Conference
on Learning Theory, pages 263–274. Omnipress, 2008. [343]
J. D. Abernethy, E. Hazan, and A. Rakhlin. Interior-point methods for full-
information and bandit online learning. IEEE Transactions on Information
Theory, 58(7):4164–4175, 2012. [170, 340]
J. D. Abernethy, C. Lee, A. Sinha, and A. Tewari. Online linear optimization via
smoothing. In Proceedings of the 27th Conference on Learning Theory, pages
807–823, Barcelona, Spain, 2014. JMLR.org. [373]
J. D. Abernethy, C. Lee, and A. Tewari. Fighting bandits with a new kind of
smoothness. In Advances in Neural Information Processing Systems, pages
2197–2205. Curran Associates, Inc., 2015. [343, 373]
M. Abramowitz and I. A. Stegun. Handbook of mathematical functions: with
formulas, graphs, and mathematical tables, volume 55. Courier Corporation,
1964. [183, 474]
M. Achab, S. Clémençon, A. Garivier, A. Sabourin, and C. Vernade. Max k-
armed bandit: On the extremehunter algorithm and beyond. In Joint European
Conference on Machine Learning and Knowledge Discovery in Databases, pages
389–404. Springer, 2017. [416]
L. Adelman. Choice theory. In Saul I. Gass and Michael C. Fu, editors,
Encyclopedia of Operations Research and Management Science, pages 164–
168. Springer US, Boston, MA, 2013. [67]
A. Agarwal, D. P. Foster, D. J. Hsu, S. M. Kakade, and A. Rakhlin. Stochastic
convex optimization with bandit feedback. In Advances in Neural Information
Processing Systems, pages 1035–1043. Curran Associates, Inc., 2011. [359]
A. Agarwal, D. P. Foster, D. Hsu, S. M. Kakade, and A. Rakhlin. Stochastic
convex optimization with bandit feedback. SIAM Journal on Optimization, 23
(1):213–240, 2013. [415]
A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, and R. Schapire. Taming the
monster: A fast and simple algorithm for contextual bandits. In Proceedings
of the 31st International Conference on Machine Learning, pages 1638–1646,
Bejing, China, 2014. JMLR.org. [231, 232, 233]
BIBLIOGRAPHY 550
Conference on Learning Theory, pages 116–120, New York, NY, USA, 2016.
JMLR.org. [157]
P. Auer and R. Ortner. Logarithmic online regret bounds for undiscounted
reinforcement learning. In Advances in Neural Information Processing Systems,
pages 49–56. MIT Press, 2007. [536]
P. Auer and R. Ortner. UCB revisited: Improved regret bounds for the stochastic
multi-armed bandit problem. Periodica Mathematica Hungarica, 61(1-2):55–65,
2010. [98, 128]
P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. Gambling in a
rigged casino: The adversarial multi-armed bandit problem. In Foundations
of Computer Science, 1995. Proceedings., 36th Annual Symposium on, pages
322–331. IEEE, 1995. [97, 146, 159, 202]
P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed
bandit problem. Machine Learning, 47:235–256, 2002a. [95, 110]
P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic
multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2002b.
[171, 203, 232, 385]
P. Auer, R. Ortner, and Cs. Szepesvári. Improved rates for the stochastic
continuum-armed bandit problem. In International Conference on
Computational Learning Theory, pages 454–468. Springer, 2007. [358]
P. Auer, T. Jaksch, and R. Ortner. Near-optimal regret bounds for reinforcement
learning. In Advances in Neural Information Processing Systems, pages 89–96,
2009. [536]
P. Auer, P. Gajane, and R. Ortner. Adaptively tracking the best arm with
an unknown number of distribution changes. In European Workshop on
Reinforcement Learning 14, 2018. [385]
P. Auer, P. Gajane, and R. Ortner. Adaptively tracking the best bandit arm
with an unknown number of distribution changes. In Proceedings of the 32nd
Conference on Learning Theory, 2019. [383, 385]
B. Awerbuch and R. Kleinberg. Adaptive routing with end-to-end feedback:
Distributed learning and geometric approaches. In Proceedings of the 36th
annual ACM symposium on theory of computing, pages 45–53. ACM, 2004.
[373]
S. J. Axler. Linear algebra done right, volume 2. Springer, 1997. [502]
M. G. Azar, I. Osband, and R. Munos. Minimax regret bounds for reinforcement
learning. In Proceedings of the 34th International Conference on Machine
Learning, pages 263–272, Sydney, Australia, 06–11 Aug 2017. JMLR.org. [536]
A. Badanidiyuru, R. Kleinberg, and A. Slivkins. Bandits with knapsacks. In
Foundations of Computer Science (FOCS), 2013 IEEE 54th Annual Symposium
on, pages 207–216. IEEE, 2013. [359]
P. L. Bartlett and A. Tewari. Regal: A regularization based algorithm for
reinforcement learning in weakly communicating MDPs. In Proceedings of
the 25th Conference on Uncertainty in Artificial Intelligence, pages 35–42,
Arlington, VA, United States, 2009. AUAI Press. [541]
BIBLIOGRAPHY 553
J. Gittins. Bandit processes and dynamic allocation indices. Journal of the Royal
Statistical Society. Series B (Methodological), 41(2):148–177, 1979. [384, 456]
J. Gittins, K. Glazebrook, and R. Weber. Multi-armed bandit allocation indices.
John Wiley & Sons, 2011. [17, 384, 456]
D. Glowacka. Bandit algorithms in information retrieval. Foundations and
Trends® in Information Retrieval, 13:299–424, 01 2019. [400]
P. Glynn and S. Juneja. Ordinal optimization – empirical large deviations rate
estimators, and stochastic multi-armed bandits. arXiv:1507.04564, 2015. [416]
D. Goldsman. Ranking and selection in simulation. In 15th conference on Winter
Simulation, pages 387–394, 1983. [416]
A. Gopalan and S. Mannor. Thompson sampling for learning parameterized
Markov decision processes. In Proceedings of the 28th Conference on Learning
Theory, pages 861–898, Paris, France, 2015. JMLR.org. [473]
G. J. Gordon. Regret bounds for prediction problems. In Proceedings of the 12th
Conference on Learning Theory, pages 29–40, 1999. [342, 343]
T. Graepel, J. Q. Candela, T. Borchert, and R. Herbrich. Web-scale Bayesian
click-through rate prediction for sponsored search advertising in microsoft’s
bing search engine. In Proceedings of the 27th International Conference on
Machine Learning, pages 13–20, USA, 2010. Omnipress. [473]
O. Granmo. Solving two-armed bernoulli bandit problems using a Bayesian
learning automaton. International Journal of Intelligent Computing and
Cybernetics, 3(2):207–234, 2010. [473]
R. M. Gray. Entropy and information theory. Springer Science & Business Media,
2011. [193]
K. Greenewald, A. Tewari, S. Murphy, and P. Klasnja. Action centered contextual
bandits. In Advances in Neural Information Processing Systems, pages 5977–
5985. Curran Associates, Inc., 2017. [17]
M. Grötschel, L. Lovász, and A. Schrijver. Geometric algorithms and combinatorial
optimization, volume 2. Springer Science & Business Media, 2012. [270, 372,
535]
F. Guo, C. Liu, and Y. M. Wang. Efficient multiple-click models in web search.
In Proceedings of the 2nd ACM International Conference on Web Search and
Data Mining, pages 124–131. ACM, 2009. [400]
A. György and Cs. Szepesvári. Shifting regret, mirror descent, and matrices. In
Proceedings of the 33rd International Conference on Machine Learning, pages
2943–2951, New York, NY, USA, 20–22 Jun 2016. JMLR.org. [385]
A. György, T. Linder, G. Lugosi, and G. Ottucsák. The on-line shortest path
problem under partial monitoring. Journal of Machine Learning Research, 8
(Oct):2369–2403, 2007. [373, 374]
A. György, D. Pál, and Cs. Szepesvári. Online learning: Algorithms for Big Data.
2019. [385]
P. R. Halmos. Measure Theory. Graduate Texts in Mathematics. Springer New
York, 1976. [53]
BIBLIOGRAPHY 563
D. van der Hoeven, T. van Erven, and W. Kotlowski. The many faces of
exponential weights in online learning. In Proceedings of the 31st Conference
on Learning Theory, pages 2067–2092, 2018. [323]
A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical
Processes. Springer, New York, 1996. [341]
H. P. Vanchinathan, G. Bartók, and A. Krause. Efficient partial monitoring
with prior information. In Advances in Neural Information Processing Systems,
pages 1691–1699. Curran Associates, Inc., 2014. [506]
V. Vapnik. Statistical learning theory. 1998, volume 3. Wiley, New York, 1998.
[236]
P. Varaiya, J. Walrand, and C. Buyukkoc. Extensions of the multiarmed bandit
problem: The discounted case. IEEE Transactions on Automatic Control, 30
(5):426–439, 1985. [456]
C. Vernade, O. Cappé, and V. Perchet. Stochastic bandit models for delayed
conversions. In Proceedings of the 33rd Conference on Uncertainty in Artificial
Intelligence. AUAI Press, 2017. [360]
C. Vernade, A. Carpentier, G. Zappella, B. Ermis, and M. Brueckner. Contextual
bandits under delayed feedback. arXiv:1807.02089, 2018. [360]
S. Villar, J. Bowden, and J. Wason. Multi-armed bandit models for the optimal
design of clinical trials: benefits and challenges. Statistical science: a review
journal of the Institute of Mathematical Statistics, 30(2):199–215, 2015. [17]
W. Vogel. An asymptotic minimax theorem for the two armed bandit problem.
The Annals of Mathematical Statistics, 31(2):444–451, 1960. [202]
J. von Neumann. Zur theorie der gesellschaftsspiele. Mathematische annalen, 100
(1):295–320, 1928. [342]
V. G. Vovk. Aggregating strategies. Proceedings of Computational Learning
Theory, 1990. [146, 159]
S. Wang and W. Chen. Thompson sampling for combinatorial semi-bandits. In
Proceedings of the 35th International Conference on Machine Learning, pages
5114–5122, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. JMLR.org.
[373, 473]
Y. Wang, J-Y. Audibert, and R. Munos. Algorithms for infinitely many-armed
bandits. In Advances in Neural Information Processing Systems, pages 1729–
1736, 2009. [358]
M. K. Warmuth and A. Jagota. Continuous and discrete-time nonlinear gradient
descent: Relative loss bounds and convergence. In Electronic Proceedings of
the 5th International Symposium on Artificial Intelligence and Mathematics,
1997. [342]
P. L Wawrzynski and A. Pacut. Truncated importance sampling for reinforcement
learning with experience replay. In Proceedings of the International
Multiconference on Computer Science and Information Technology, pages 305–
315, 2007. [171]
R. Weber. On the Gittins index for multiarmed bandits. The Annals of Applied
Probability, 2(4):1024–1033, 1992. [456]
BIBLIOGRAPHY 580