Agrawal&Goyal 2013
Agrawal&Goyal 2013
Abstract 1. Introduction
Thompson Sampling is one of the old- Multi-armed bandit (MAB) problems model the ex-
est heuristics for multi-armed bandit prob- ploration/exploitation trade-off inherent in many se-
lems. It is a randomized algorithm based quential decision problems. There are many versions
on Bayesian ideas, and has recently gener- of multi-armed bandit problems; a particularly useful
ated significant interest after several stud- version is the contextual multi-armed bandit problem.
ies demonstrated it to have better empirical In this problem, in each of T rounds, a learner is pre-
performance compared to the state-of-the- sented with the choice of taking one out of N actions,
art methods. However, many questions re- referred to as N arms. Before making the choice of
garding its theoretical performance remained which arm to play, the learner sees d-dimensional fea-
open. In this paper, we design and an- ture vectors bi , referred to as “context”, associated
alyze a generalization of Thompson Sam- with each arm i. The learner uses these feature vec-
pling algorithm for the stochastic contextual tors along with the feature vectors and rewards of the
multi-armed bandit problem with linear pay- arms played by her in the past to make the choice of
off functions, when the contexts are provided the arm to play in the current round. Over time, the
by an adaptive adversary. This is among the learner’s aim is to gather enough information about
most important and widely studied version of how the feature vectors and rewards relate to each
the contextual bandits problem. We √ prove a other, so that she can predict, with some certainty,
2
high probability regret bound of Õ( d T 1+ ) which arm is likely to give the best reward by look-
in time T for any 0 < < 1, where d is ing at the feature vectors. The learner competes with
the dimension of each context vector and a class of predictors, in which each predictor takes in
is a parameter used by the algorithm. Our the feature vectors and predicts which arm will give
results provide the first theoretical guaran- the best reward. If the learner can guarantee to do
tees for the contextual version of Thompson nearly as well as the predictions of the best predictor
Sampling,
√ and are close to the lower bound in hindsight (i.e., have low regret), then the learner is
of Ω(d T ) for this problem. This essentially said to successfully compete with that class.
solves a COLT open problem of Chapelle and
In the contextual bandits setting with linear payoff
Li [COLT 2012].
functions, the learner competes with the class of all
“linear” predictors on the feature vectors. That is,
a predictor is defined by a d-dimensional parameter
µ ∈ Rd , and the predictor ranks the arms according to
bTi µ. We consider stochastic contextual bandit prob-
lem under linear realizability assumption, that is, we
assume that there is an unknown underlying parame-
ter µ ∈ Rd such that the expected reward for each arm
i, given context bi , is bTi µ. Under this realizability as-
sumption, the linear predictor corresponding to µ is in
Proceedings of the 30 th International Conference on Ma- fact the best predictor and the learner’s aim is to learn
chine Learning, Atlanta, Georgia, USA, 2013. JMLR: this underlying parameter. This realizability assump-
W&CP volume 28. Copyright 2013 by the author(s).
tion is standard in the existing literature on contextual
Thompson Sampling for Contextual Bandits with Linear Payoffs
multi-armed bandits, e.g. (Auer, 2002; Filippi et al., ing techniques in many general settings along with fa-
2010; Chu et al., 2011; Abbasi-Yadkori et al., 2011). vorable empirical comparisons with other techniques.
Chapelle & Li (2011) demonstrate that for the basic
Thompson Sampling (TS) is one of the earliest heuris-
stochastic MAB problem, empirically TS achieves re-
tics for multi-armed bandit problems. The first version
gret comparable to the lower bound of Lai & Robbins
of this Bayesian heuristic is around 80 years old, dating
(1985); and in applications like display advertising and
to Thompson (1933). Since then, it has been rediscov-
news article recommendation modeled by the contex-
ered numerous times independently in the context of
tual bandits problem, it is competitive to or better
reinforcement learning, e.g., in Wyatt (1997); Ortega
than the other methods such as UCB. In their exper-
& Braun (2010); Strens (2000). It is a member of the
iments, TS is also more robust to delayed or batched
family of randomized probability matching algorithms.
feedback than the other methods. TS has been used
The basic idea is to assume a simple prior distribution
in an industrial-scale application for CTR prediction
on the underlying parameters of the reward distribu-
of search ads on search engines (Graepel et al., 2010).
tion of every arm, and at every time step, play an arm
Kaufmann et al. (2012) do a thorough comparison of
according to its posterior probability of being the best
TS with the best known versions of UCB and show
arm. The general structure of TS for the contextual
that TS has the lowest regret in the long run.
bandits problem involves the following elements:
1. a set Θ of parameters µ̃; However, the theoretical understanding of TS is lim-
ited. Granmo (2010) and May et al. (2011) pro-
2. a prior distribution P (µ̃) on these parameters;
vided weak guarantees, namely, a bound of o(T ) on
3. past observations D consisting of (context b, re- the expected regret in time T . For the the ba-
ward r) for the past time steps; sic (i.e. without contexts) version of the stochastic
4. a likelihood function P (r|b, µ̃), which gives the MAB problem, some significant progress was made by
probability of reward given a context b and a pa- Agrawal & Goyal (2012), Kaufmann et al. (2012) and,
rameter µ̃; more recently, by Agrawal & Goyal (2013), who pro-
5. a posterior distribution P (µ̃|D) ∝ P (D|µ̃)P (µ̃), vided optimal regret bounds on the expected regret.
where P (D|µ̃) is the likelihood function. But, many questions regarding theoretical analysis of
In each round, TS plays an arm according to its pos- TS remained open, including high probability regret
terior probability of having the best parameter. A bounds, and regret bounds for the more general con-
simple way to achieve this is to produce a sample of textual bandits setting. In particular, the contextual
parameter for each arm, using the posterior distribu- MAB problem does not seem easily amenable to the
tions, and play the arm that produces the best sam- techniques used so far for analyzing TS for the basic
ple. In this paper, we design and analyze a natural MAB problem. In Section 3.1, we describe some of
generalization of Thompson Sampling (TS) for con- these challenges. Some of these questions and difficul-
textual bandits; this generalization fits the above gen- ties were also formally raised as a COLT 2012 open
eral structure, and uses Gaussian prior and Gaussian problem (Chapelle & Li, 2012).
likelihood function. We emphasize that although TS In this paper, we use novel martingale-based analy-
is a Bayesian approach, the description of the algo- sis techniques to demonstrate that TS (i.e., our Gaus-
rithm and our analysis apply to the prior-free stochas- sian prior based generalization of TS for contextual
tic MAB model, and our regret bounds will hold ir- bandits) achieves high probability, near-optimal re-
respective of whether or not the actual reward distri- gret bounds for stochastic contextual bandits with lin-
bution matches the Gaussian likelihood function used ear payoff functions. To our knowledge, ours are the
to derive this Bayesian heuristic. Thus, our bounds first non-trivial regret bounds for TS for the contex-
for TS algorithm are directly comparable to the UCB tual bandits problem. Additionally, our results are the
family of algorithms which form a frequentist approach first high probability regret bounds for TS, even in the
to the same problem. One could interpret the priors case of basic MAB problem. This essentially solves the
used by TS as a way of capturing the current knowl- COLT 2012 open problem by(Chapelle & Li, 2012) for
edge about the arms. contextual bandits with linear payoffs.
Recently, TS has attracted considerable attention. Our version of Thompson Sampling algorithm for the
Several studies (e.g., Granmo (2010); Scott (2010); contextual MAB problem, described formally in Sec-
Graepel et al. (2010); Chapelle & Li (2011); May & tion 2.2, uses Gaussian prior and Gaussian likelihood
Leslie (2011); Kaufmann et al. (2012)) have empiri- functions. Our techniques can be extended to the use
cally demonstrated the efficacy of TS: Scott (2010) of other prior distributions, satisfying certain condi-
provides a detailed discussion of probability match-
Thompson Sampling for Contextual Bandits with Linear Payoffs
tions, as discussed in Section 4. We can obtain the same regret bounds for this alter-
native definition of regret. The details are provided in
2. Problem setting and algorithm the supplementary material in Appendix A.5.
description
2.2. Thompson Sampling algorithm
2.1. Problem setting
We use Gaussian likelihood function and Gaussian
There are N arms. At time t = 1, 2, . . ., a context prior to design our version of Thompson Sampling al-
vector bi (t) ∈ Rd , is revealed for every arm i. These gorithm. More precisely, suppose that the likelihood
context vectors are chosen by an adversary in an adap- of reward ri (t) at time t, given context bi (t) and pa-
tive manner after observing the arms played and their rameter µ, were given by the pdf ofqGaussian distri-
rewards up to time t − 1, i.e. history Ht−1 , bution N (bi (t)T µ, v 2 ). Here, v = R 24 1
d ln( δ ), with
Ht−1 = {a(τ ), ra(τ ) (τ ), bi (τ ), i = 1, . . . , N, τ = ∈ (0, 1) which parametrizes our algorithm. Let
1, . . . , t − 1}, Pt−1
where a(τ ) denotes the arm played at time τ . Given B(t) = Id + τ =1 ba(τ ) (τ )ba(τ ) (τ )T
bi (t), the reward for arm i at time t is generated from P
t−1
µ̂(t) = B(t)−1 τ =1 ba(τ ) (τ )ra(τ ) (τ ) .
an (unknown) distribution with mean bi (t)T µ, where
µ ∈ Rd is a fixed but unknown parameter. Then, if the prior for µ at time t is given by
N (µ̂(t), v 2 B(t)−1 ), it is easy to compute the poste-
E ri (t) {bi (t)}Ni=1 , Ht−1 = E [ri (t) bi (t)] = bi (t) µ.
T
We assume that ηi,t = ri (t) − bi (t)T µ is conditionally Algorithm 1 Thompson Sampling for Contextual
R-sub-Gaussian for a constant R ≥ 0, i.e., bandits
Set B = Id , µ̂ = 0d , f = 0d .
2 2
λ R
∀λ ∈ R, E[eληi,t |{bi (t)}N
i=1 , H t−1 ] ≤ exp 2 .
for all t = 1, 2, . . . , do
This assumption is satisfied whenever ri (t) ∈ Sample µ̃(t) from distribution N (µ̂, v 2 B −1 ).
[bi (t)T µ − R, bi (t)T µ + R] (see Remark 1 in Appendix Play arm a(t) := arg maxi bi (t)T µ̃(t), and observe
A.1 of Filippi et al. (2010)). We will also assume reward rt .
that ||bi (t)|| ≤ 1, ||µ|| ≤ 1, and ∆i (t) ≤ 1 for all i, t Update B = B + ba(t) (t)ba(t) (t)T , f = f +
(the norms, unless otherwise indicated, are `2 -norms). ba(t) (t)rt , µ̂ = B −1 f .
These assumptions are required to make the regret end for
bounds scale-free, and are standard in the literature
on this problem. If ||µ|| ≤ c, ||bi (t)|| ≤ c, ∆i (t) ≤ c
instead, then our regret bounds would increase by a Every step t of Algorithm 1 consists of gener-
factor of c. ating a d-dimensional sample µ̃(t) from a multi-
variate Gaussian distribution, and solving the problem
Remark 1. An alternative definition of regret that
arg maxi bi (t)T µ̃(t). Therefore, even if the number of
appears in the literature is
arms N is large (or infinite), the above algorithm is
regret(t) = ra∗ (t) (t) − ra(t) (t). efficient as long as the problem arg maxi bi (t)T µ̃(t) is
Thompson Sampling for Contextual Bandits with Linear Payoffs
efficiently solvable. This is the case, for example, when bound in Theorem 2 has a better dependence on d.
the set of arms at time t is given by a d-dimensional This improvement results from the independence of
convex set (every vector in the convex set is a context θi (t) = bi (t)T µ̃i (t) in the algorithm for this setting.
vector, and thus corresponds to an arm). On the other hand in Algorithm 1, used for the single
parameter setting of Theorem 1, a single µ̃(t) is gen-
2.3. Our Results erated, and so θi (t) = bi (t)T µ̃(t) are not independent.
Theorem 1. For the stochastic contextual ban- This motivates us to consider a modification of Al-
dit problem with linear payoff functions, with prob- gorithm 1 for the single parameter setting, in which
ability 1 − δ, the total regret in time T for the θi (t)’s are independently generated, each with
Thompson Sampling (Algorithm 1) is bounded by marginal distribution bi (t)T µ̃(t). The arm with the
2√
O d T 1+ ln(T d) ln 1δ , for any 0 < < 1, 0 < highest value of θi (t) is played at time t. Although, this
modified algorithm could be inefficient compared to
δ < 1. Here, is a parameter used by the Thompson Algorithm 1 if N is large (say exponential) compared
Sampling algorithm. to d, the better dependence on d in regret bounds could
Remark 2. The parameter can be chosen to be any be useful if d is large.
constant in (0, 1). If √ T is known, one could choose Theorem 3. For the modified algorithm in single pa-
= ln1T , to get Õ(d2 T ) regret bound.
q 1 − δ, the total regret
rameter setting, with probability
Remark 3. Our regret bound in Theorem 1 does not
T 1+ ln N 1
depend on N , and is applicable to the case of infi- in time T is bounded by O d ln T ln δ ,
nite arms, with only notational changes required in the for any 0 < < 1, 0 < δ < 1.
analysis.
The details of the modified algorithm and the proof
In the main body of this paper, we will discuss the of the above theorem appears in the supplementary
proof of the above result. Below, we state two ad- material in Appendix B.
ditional results; their proofs require small changes to
the proof of Theorem 1 and are provided in the sup- 2.4. Related Work
plementary material.
The contextual bandit problem with linear payoffs is
The first result is for the setting where each of the N
a widely studied problem in statistics and machine
arms is associated with a different d-dimensional pa-
learning often under different names as mentioned by
rameter µi ∈ Rd , so that the mean reward for arm i
Chu et al. (2011): bandit problems with co-variates
at time t is bi (t)T µi . This setting is a direct gener-
(Woodroofe, 1979; Sarkar, 1991), associative reinforce-
alization of the basic MAB problem to d-dimensions.
ment learning (Kaelbling, 1994), associative bandit
Thompson Sampling for this setting will maintain a
problems (Auer, 2002; Strehl et al., 2006), bandit prob-
separate posterior distribution for each arm i which
lems with expert advice (Auer et al., 2002), and linear
would be updated only at the time instances when i is
bandits (Dani et al., 2008; Abbasi-Yadkori et al., 2011;
played. And, at every time step t, instead of a single
Bubeck et al., 2012). The name contextual bandits was
sample µ̃(t), N independent samples will have to be
coined in Langford & Zhang (2007).
generated: µ̃i (t) for each arm i. We prove the follow- √
ing regret bound for this setting. A lower bound of Ω(d T ) for this problem was given
Theorem 2. For the setting with N different pa- by Dani et al. (2008), when the number of arms is al-
rameters, with probability 1 − δ, the total regret lowed to be infinite. In particular, they prove their
in time T for Thompson Sampling is bounded by lower bound using an example where the set of arms
q
1+
correspond to all vectors in the intersection of a d-
O d N T ln N ln T ln 1δ , for any 0 < < 1,
dimensional sphere and√a cube. They also provide
0 < δ < 1. an upper bound of Õ(d T ), although their setting is
slightly restrictive in the sense that the context vector
The details of the algorithm for N -parameter setting for every arm is fixed in advanced and is not allowed to
and the proof of Theorem 2 appear in the supplemen- change with time. Abbasi-Yadkori et al. (2011) ana-
tary material in Appendix C. lyze a UCB-style algorithm
√ pand provide a regret upper
bound of O(d log (T ) T + dT log (T /δ)). Apart from
Note that unlike Theorem 1, the regret bound in The-
the dependence on , our bounds are essentially away
orem 2 has a dependence on N , which is expected
by a factor of d from these bounds.
because Theorem 2 deals with a setting where there
are N different parameters to learn. However, the For finite N , Chu et al. (2011) show a lower bound
Thompson Sampling for Contextual Bandits with Linear Payoffs
√
of Ω( T d) for d2 ≤ T . Auer (2002) and Chu In the basic MAB problem there are N arms, with
et al. (2011) analyze SupLinUCB, a complicated al- mean reward µi ∈ R for arm i, and the regret for
gorithm using UCB as a subroutine, for this prob- playing a suboptimal arm i is µa∗ − µi , where a∗ is the
lem.
q Chu et al. (2011) achieve a regret bound of arm with the highest mean. Let us compare this to a 1-
O( T d ln3 (N T ln(T )/δ)) with probability at least 1− dimensional contextual MAB problem, where arm i is
associated with a parameter µi ∈ R, but in addition, at
δ (Auer (2002) proves similar results). This regret
every time t, it is associated with a context bi (t) ∈ R,
bound is not applicable to the case of infinite arms,
so that mean reward is bi (t)µi . The best arm a∗ (t) at
and assumes that context vectors are generated by an
time t is the arm with the highest mean at time t, and
oblivious
√ adversary. Also, this regret bound would give the regret for playing arm i is ba∗ (t) (t)µa∗ (t) − bi (t)µi .
O(d2 T ) regret if N is exponential in d. The state-
of-the-art bounds for linear bandits problem in case In general, the basis of regret analysis for stochastic
of finite N are given by Bubeck et al. (2012). They MAB is to prove that the variances of empirical esti-
provide an algorithm √ based on exponential weights, mates for all arms decrease fast enough, so that the re-
with regret of order dT log N for any finite set of N gret incurred until the variances become small enough,
actions. However, the exponential weights based algo- is small. In the basic MAB, the variance of the em-
rithms are not efficient if N is large (sampling com- pirical mean is inversely proportional to the number
plexity of O(N ) in every step). Also, their setting of plays ki (t) of arm i at time t. Thus, every time the
is slightly different from ours. The set of arms and suboptimal arm i is played, we know that even though
the associated bi vectors are non-adaptive and fixed in a regret of µi∗ − µi ≤ 1 is incurred, there is also an im-
advance. And, they consider a non-stochastic (adver- provement of exactly 1 in the number of plays of that
sarial) bandit setting where the reward at time t for arm, and hence, corresponding decrease in the vari-
arm i is bTi µt with µt chosen by an adversary. ance. The techniques for analyzing basic MAB rely on
this observation to precisely quantify the exploration-
Very recent work Russo & Roy (2013) provides near-
exploitation tradeoff. On the other hand, the variance
optimal bounds on Bayesian regret in many general
of the empirical meanPfor the contextual case is given
settings. This result is incomparable to ours because t
by inverse of Bi (t) = τ =1:a(τ )=i bi (τ )2 . When a sub-
of the different notion of regret used.
optimal arm i is played, if bi (t) is small, the regret
While the regret bounds provided in this paper do not ba∗ (t) (t)µa∗ (t) − bi (t)µi could be much higher than the
match or better the best available regret bounds for improvement bi (t)2 in Bi (t).
the extensively studied problem of linear contextual
In our proof, we overcome this difficulty by dividing
bandits, our results demonstrate that the natural and
the arms into two groups at any time: saturated and
efficient heuristic of Thompson Sampling can achieve
unsaturated arms, based on whether the standard
theoretical bounds that are close to the best bounds.
deviation of the estimates for an arm is smaller or
The main contribution of this paper is to provide new
larger compared to the standard deviation for the
tools for analysis of Thompson Sampling algorithm for
optimal arm. The optimal arm is included in the
contextual bandits, which despite being popular and
group of unsaturated arms. We show that for the
empirically attractive, has eluded theoretical analysis.
unsaturated arms, the regret on playing the arm can
We believe the techniques used in this paper will pro-
be bounded by a factor of the standard deviation,
vide useful insights into the workings of this Bayesian
which improves every time the arm is played. This
algorithm, and may be useful for further improvements
allows us to bound the total regret due to unsaturated
and extensions.
arms. For the saturated arms, standard deviation is
small, or in other words, the estimates of the means
3. Regret Analysis: Proof of Theorem 1 constructed so far are quite accurate in the direction
of the current contexts of these arms, so that the
3.1. Challenges and proof outline
algorithm is able to distinguish between them and
The contextual version of the multi-armed bandit the optimal arm. We utilize this observation to show
problem presents new challenges for the analysis of TS that the probability of playing such arms at any step
algorithm, and the techniques used so far for analyz- is bounded by a function of the probability of playing
ing the basic multi-armed bandit problem by Agrawal the unsaturated arms.
& Goyal (2012); Kaufmann et al. (2012) do not seem
directly applicable. Let us describe some of these dif- Below is a more technical outline of the proof of
ficulties and our novel ideas to resolve them. Theorem 1. At any time step t, we divide the arms
into two groups:
Thompson Sampling for Contextual Bandits with Linear Payoffs
q
• saturated arms defined as those with g(T )st,i < Definition 3. Define `(T ) = R d ln(T 3 ) ln( 1δ ) + 1,
`(T )st,a∗ (t) , q p
• unsaturated arms defined as those with g(T )st,i ≥ v = R 24 1
d ln( δ ), and g(T ) = 4d ln(T d) v + `(T ).
`(T )st,a∗ (t) , Definition 4. Define E µ (t) and E θ (t) as the events
p that bi (t)T µ̂(t) and θi (t) are concentrated around their
where st,i = bi (t)T B(t)−1 bi (t) and g(T ), `(T ) respective means. More precisely, define E µ (t) as the
(g(T ) > `(T )) are constants (functions of T, d, δ) de- event that
fined later. Note that st,i is the standard deviation of ∀i : |bi (t)T µ̂(t) − bi (t)T µ| ≤ `(T ) st,i .
the estimate bi (t)T µ̂(t) and vst,i is the standard devi-
ation of the random variable bi (t)T µ̃(t). Define E θ (t) as the event that
p
∀i : |θi (t) − bi (t)T µ̂(t)| ≤ 4d ln(T d) vst,i .
We use concentration bounds for µ̃(t) and µ̂(t) to
Definition 5. An arm i is called saturated at time
bound the regret at any time t by g(T )(st,a∗ (t) +st,a(t) ).
t if g(T ) st,i < `(T ) st,a∗ (t) , and unsaturated oth-
Now, if an unsaturated arm is played at time t, then
erwise. Let C(t) denote the set of saturated arms at
using the definition of unsaturated arms, the regret is
)2
time t. Note that the optimal arm is always unsatu-
at most 2g(T
`(T ) st,a(t) . This is useful because of the in- rated at time t, i.e., a∗ (t) ∈ / C(t). An arm may keep
P √
equality t st,a(t) = O( T d ln T ) (derived along the shifting from saturated to unsaturated and vice-versa
lines of Auer (2002)), which allows us to bound the over time.
total regret due to unsaturated arms. Definition 6. Define filtration Ft−1 as the union of
For saturated arms, we prove that the probability of history until time t−1, and the contexts at time t, i.e.,
playing a saturated arm at any time t is within p of the Ft−1 = {Ht−1 , bi (t), i = 1, . . . , N }.
probability of playing an unsaturated arm, where p = By definition, F1 ⊆ F2 · · · ⊆ FT −1 . Observe that the
√1 . More precisely, we define Ft−1 as the union
4e πT following quantities are determined by the history Ht−1
of history Ht−1 and the contexts bi (t), i = 1, . . . , N at and the contexts bi (t) at time t, and hence are included
time t, and prove that for “most” (in a high probability in Ft−1 ,
sense) Ft−1 ,
• µ̂(t), B(t),
Pr (a(t) is a saturated arm Ft−1 ) ≤ • st,i , for all i,
1 1
p · Pr (a(t) is an unsaturated arm Ft−1 ) + pT 2 , • the identity of the optimal arm a∗ (t) and the set
of saturated arms C(t),
We use these observations to establish that (Xt ; t ≥ 0),
• whether E µ (t) is true or not,
where
• the distribution N (µ̂(t), B(t)−1 ) of µ̃(t),
Xt ' regret(t) − g(T )
p I(a(t) is unsaturated)st,a (t) −
∗
and hence the joint distribution of θi (t) =
2g(T )2 2g(T )
`(T ) st,a(t) − pT 2 ,
bi (t)T µ̃(t), i = 1, . . . , N .
is a super-martingale difference process adapted to fil- Lemma 1. For all t, 0 < δ < 1, Pr(E µ (t)) ≥ 1 − Tδ2 .
tration Ft . Then, using the Azuma-Hoeffding inequal- And, for all possible filtrations Ft−1 , Pr(E θ (t)|Ft−1 ) ≥
ity
P for super-martingales,
√ along with the inequality 1 − T12 .
t st,a(t) = O( T d ln T ), we will obtain the desired
high probability regret bound. Proof. The complete proof of this lemma appears in
Appendix A.3. The probability bound for E µ (t) will
3.2. Formal proof be proven using a concentration inequality given by
Abbasi-Yadkori et al. (2011), stated as Lemma 7 in
For quick reference, the notations introduced below
Appendix A.2. The R-sub-Gaussian assumption on
also appear in a table of notations at the beginning of
rewards will be utilized here. The probability bound
the supplementary material.
for E θ (t) will be proven using a concentration inequal-
Definitionp1. For all i, define θi (t) = bi (t)T µ̃(t), ity for Gaussian random variables from Abramowitz
and st,i = bi (t)T B(t)−1 bi (t). By definition of µ̃(t), & Stegun (1964) stated as Lemma 5 in Appendix A.2
marginal distribution of each θi (t) is Gaussian with .
mean bi (t)T µ̂(t) and standard deviation vst,i . Also,
st,i is the standard deviation of estimate bi (t)T µ̂(t). The next lemma lower bounds the probability that
Definition 2. Recall that ∆i (t) = ba∗ (t) (t)T µ − θa∗ (t) (t) = ba∗ (t) (t)T µ̃(t) for the optimal arm at
bi (t)T µ, the difference between the mean reward of op- time t will exceed its mean reward ba∗ (t) (t)T µ plus
timal arm and arm i at time t. . `(T )st,a∗ (t) .
Thompson Sampling for Contextual Bandits with Linear Payoffs
Lemma 2. For any filtration Ft−1 such that E µ (t) is which implies
true,
Pr θa∗ (t) (t) > ba∗ (t) (t)T µ + `(T )st,a∗ (t) Ft−1 ≥ Pr (a(t) ∈ C(t) Ft−1 ) 1
√1 . ≤ .
4e πT / C(t) Ft−1 ) + T12
Pr (a(t) ∈ p
Proof. The algorithm chooses the arm with the high- where p = √1 .
4e πT
est value of θi (t) = bi (t)T µ̃(t) to be played at time t.
Therefore, if θa∗ (t) (t) is greater than θj (t) for all satu- Lemma 4. (Yt ; t = 0, . . . , T ) is a super-martingale
rated arms, i.e., θa∗ (t) (t) > θj (t), ∀j ∈ C(t), then one process with respect to filtration Ft .
of the unsaturated arms (which include the optimal
arm and other suboptimal unsaturated arms) must be
played. Therefore, Proof. See Definition 9 in Appendix A.2 for the defi-
nition of super-martingales. We need to prove that for
Pr (a(t) ∈
/ C(t) Ft−1 ) all t ∈ [1, T ], and any Ft−1 , E[Yt − Yt−1 |Ft−1 ] ≤ 0, i.e.
≥ Pr θa∗ (t) (t) > θj (t), ∀j ∈ C(t) Ft−1 . (1)
By definition, for all saturated arms, i.e. for all E [regret0 (t) Ft−1 ] ≤
g(T )
j ∈ C(t), g(T )st,j < `(T )st,a∗ (t) . Also, if both the p Pr (a(t) ∈ / C(t) Ft−1 ) st,a∗ (t) +
events E µ (t) and E θ (t) are true then, by the def- 2g(T )2 2g(T )
`(T ) E st,a(t) Ft−1 + pT 2 .
initions of these events, for all j ∈ C(t), θj (t) ≤
bj (t)T µ + g(T )st,j . Therefore, given an Ft−1 such
that E µ (t) is true, either E θ (t) is false, or else for all If Ft−1 is such that E µ (t) is not true, then regret0 (t) =
j ∈ C(t), regret(t)·I(E µ (t)) = 0, and the above inequality holds
trivially. So, we consider Ft−1 such that E µ (t) holds.
θj (t) ≤ bj (t)T µ + g(T )st,j ≤ ba∗ (t) (t)T µ + `(T )st,a∗ (t) .
We observe that if the events E µ (t), E θ (t) are true,
µ
Hence, for any Ft−1 such that E (t) is true, then ∆a(t) (t) ≤ g(T )(st,a(t) + st,a∗ (t) ). This is because
if an arm i is played at time t, then it must be true
Pr θa∗ (t) (t) > θj (t), ∀j ∈ C(t) Ft−1
that θi (t) ≥ θa∗ (t) (t). And, if E θ (t) and E µ (t) are
≥ Pr θa∗ (t) (t) > ba∗ (t) (t)T µ + `(T )st,a∗ (t) Ft−1
true, then,
− Pr E θ (t) Ft−1
bi (t)T µ ≥ θi (t) − g(T )st,i
1
≥ p − 2. ≥ θa∗ (t) (t) − g(T )st,i
T
≥ ba∗ (t) (t)T µ − g(T )st,a∗ (t) − g(T )st,i .
The last inequality uses Lemma 2 and Lemma 1. Sub-
stituting in Equation (1), this gives,
Therefore, given a filtration Ft−1 such that E µ (t) is
1
Pr (a(t) ∈
/ C(t) Ft−1 ) + T2 ≥ p, true, either ∆a(t) (t) ≤ g(T )(st,a(t) + st,a∗ (t) ) or E θ (t)
Thompson Sampling for Contextual Bandits with Linear Payoffs
PT √
is false. And, hence, Now, we can use t=1 st,a(t) ≤ 5 dT ln T , which
can be derived along the lines of Lemma 3 of Chu
E [regret0 (t) Ft−1 ]
et al. (2011) using Lemma 11 of Auer (2002) (see Ap-
= E ∆a(t) (t) Ft−1 pendix A.5 for details). Also, recalling the definitions
≤ E g(T )(st,a∗ (t) + st,a(t) ) Ft−1 + Pr E θ (t) of p, `(T ), and g(T ) (see the Table of notations in the
beginning of the supplementary material), and substi-
= g(T ) E st,a∗ (t) I(a(t) ∈ C(t)) Ft−1 tuting in above, we get
+g(T ) E st,a∗ (t) I(a(t) ∈/ C(t)) Ft−1
PT 2√
0 d (1+) ln( 1 ) ln(T d) .
t=1 regret (t) = O T δ
+g(T ) E st,a(t) Ft−1 + Pr E θ (t)
≤ g(T )st,a∗ (t) Pr (a(t) ∈ C(t) Ft−1 ) Also, because E µ (t) holds for all t with probability
h i at least 1 − 2δ (see Lemma 1), regret0 (t) = regret(t)
+g(T )E g(T )
/ C(t)) Ft−1
`(T ) st,a(t) I(a(t) ∈ for all t with probability at least 1 − 2δ . Hence, with
+g(T )E st,a(t) Ft−1 + T12 probability 1 − δ,
PT PT 0
R(T ) = t=1 regret(t) = t=1 regret (t) =
√
/ C(t) Ft−1 ) + g(T ) pT1 2
≤ g(T )st,a∗ (t) · p1 Pr (a(t) ∈ d
O T
2
(1+) 1
ln( δ ) ln(T d) .
)2
+ 2g(T E st,a(t) Ft−1 + T12 The proof for the alternate definition of regret men-
`(T )
tioned in Remark 1 is provided in Appendix A.5.
≤ g(T )st,a∗ (t) · p1 Pr (a(t) ∈
/ C(t) Ft−1 )
2
+ 2g(T )
E st,a(t) Ft−1 + 2g(T )
4. Conclusions
`(T ) pT 2 .
In the first inequality we used that for all i, ∆i (t) ≤ 1. Detailed concluding remarks appear in supplementary
The second inequality used the definition of unsatu- materials Sec. D.
rated arms to apply st,a∗ (t) ≤ g(T )
`(T ) st,a(t) , and Lemma
References
1 to apply Pr E θ (t) ≤ T12 . The third inequality used
Lemma 3, and also the observation that 0 ≤ st,a∗ (t) ≤ Abbasi-Yadkori, Yasin, Pál, Dávid, and Szepesvári,
||ba∗ (t) (t)|| ≤ 1. Csaba. Improved Algorithms for Linear Stochastic
Bandits. In NIPS, pp. 2312–2320, 2011.
Now, we are ready to prove Theorem 1.
Abramowitz, Milton and Stegun, Irene A. Handbook
of Mathematical Functions with Formulas, Graphs,
Proof of Theorem 1 We observe that the absolute
and Mathematical Tables. Dover, New York, 1964.
value of each of the four terms in the definition of Xt
)2
is bounded by p2 g(T
`(T ) , therefore the super-martingale Agrawal, Shipra and Goyal, Navin. Analysis of
8 g(T )
2
Thompson Sampling for the Multi-armed Bandit
Yt has bounded difference |Yt − Yt−1 | ≤ p `(T ) , for all
Problem. In COLT, 2012.
t ≥ 1. Thus, we can apply Azuma-Hoeffding inequality
(see Lemma 6 in Appendix A.2), to obtain that with Agrawal, Shipra and Goyal, Navin. Further Optimal
probability 1 − 2δ , Regret Bounds for Thompson Sampling. AISTATS,
PT 0
t=1 regret (t) 2013.
PT
g(T )
≤ t=1 p I(a(t) ∈
/ C(t))s t,a ∗ (t) + 2g(T
pT
)
Auer, Peter. Using Confidence Bounds for
q Exploitation-Exploration Trade-offs. Journal of Ma-
)2 PT 2
+ 2g(T
`(T )
8 g(T )
t=1 st,a(t) + p `(T ) 2T ln( 2δ ) chine Learning Research, 3:397–422, 2002.
PT g(T )2 1
≤ t=1 `(T ) p I(a(t) ∈
/ C(t))s t,a(t) + 2g(T
pT
)
Auer, Peter, Cesa-Bianchi, Nicolò, Freund, Yoav, and
q Schapire, Robert E. The Nonstochastic Multiarmed
)2 PT 2
+ 2g(T
`(T )
8 g(T )
t=1 st,a(t) + p `(T ) 2T ln( 2δ ) Bandit Problem. SIAM J. Comput., 32(1):48–77,
q 2002.
)2 3 PT 2
≤ g(T
`(T ) p s
t=1 t,a(t) + 2g(T )
pT + 8 g(T )
p `(T ) 2T ln( 2δ ).
Bubeck, Sébastien, Cesa-Bianchi, Nicolò, and Kakade,
Sham M. Towards minimax policies for online linear
The second inequality used the observation that if an optimization with bandit feedback. Proceedings of
unsaturated arm is played, i.e., a(t) ∈
/ C(t), then, the 25th Conference on Learning Theory (COLT),
g(T )st,a(t) ≥ `(T )st,a∗ (t) . pp. 1–14, 2012.
Thompson Sampling for Contextual Bandits with Linear Payoffs
Chapelle, Olivier and Li, Lihong. An Empirical Eval- Ortega, Pedro A. and Braun, Daniel A. Linearly
uation of Thompson Sampling. In NIPS, pp. 2249– Parametrized Bandits. Journal of Artificial Intel-
2257, 2011. ligence Research, 38:475–511, 2010.
Chapelle, Olivier and Li, Lihong. Open Problem: Re- Russo, Daniel and Roy, Benjamin Van. Learn-
gret Bounds for Thompson Sampling. In COLT, ing to optimize via posterior sampling. CoRR,
2012. abs/1301.2609, 2013.
Chu, Wei, Li, Lihong, Reyzin, Lev, and Schapire, Sarkar, Jyotirmoy. One-armed badit problem with co-
Robert E. Contextual Bandits with Linear Payoff variates. The Annals of Statistics, 19(4):1978–2002,
Functions. Journal of Machine Learning Research - 1991.
Proceedings Track, 15:208–214, 2011.
Scott, S. A modern Bayesian look at the multi-armed
Dani, Varsha, Hayes, Thomas P., and Kakade, bandit. Applied Stochastic Models in Business and
Sham M. Stochastic Linear Optimization under Industry, 26:639–658, 2010.
Bandit Feedback. In COLT, pp. 355–366, 2008.
Strehl, Alexander L., Mesterharm, Chris, Littman,
Filippi, Sarah, Cappé, Olivier, Garivier, Aurélien, and Michael L., and Hirsh, Haym. Experience-efficient
Szepesvári, Csaba. Parametric Bandits: The Gen- learning in associative bandit problems. In ICML,
eralized Linear Case. In NIPS, pp. 586–594, 2010. pp. 889–896, 2006.
Graepel, Thore, Candela, Joaquin Quiñonero, Strens, Malcolm J. A. A Bayesian Framework for Re-
Borchert, Thomas, and Herbrich, Ralf. Web-Scale inforcement Learning. In ICML, pp. 943–950, 2000.
Bayesian Click-Through rate Prediction for Spon-
sored Search Advertising in Microsoft’s Bing Search Thompson, William R. On the likelihood that one
Engine. In ICML, pp. 13–20, 2010. unknown probability exceeds another in view of the
evidence of two samples. Biometrika, 25(3-4):285–
Granmo, O.-C. Solving Two-Armed Bernoulli Bandit 294, 1933.
Problems Using a Bayesian Learning Automaton.
International Journal of Intelligent Computing and Woodroofe, Michael. A one-armed bandit problem
Cybernetics (IJICC), 3(2):207–234, 2010. with a concomitant variable. Journal of the Ameri-
can Statistics Association, 74(368):799–806, 1979.
Kaelbling, Leslie Pack. Associative Reinforcement
Learning: Functions in k-DNF. Machine Learning, Wyatt, Jeremy. Exploration and Inference in Learning
15(3):279–298, 1994. from Reinforcement. PhD thesis, Department of Ar-
tificial Intelligence, University of Edinburgh, 1997.
Kaufmann, Emilie, Korda, Nathaniel, and Munos,
Rémi. Thompson Sampling: An Optimal Finite
Time Analysis. ALT, 2012.
Lai, T. L. and Robbins, H. Asymptotically effi-
cient adaptive allocation rules. Advances in Applied
Mathematics, 6:4–22, 1985.
Langford, John and Zhang, Tong. The Epoch-Greedy
Algorithm for Multi-armed Bandits with Side Infor-
mation. In NIPS, 2007.
May, Benedict C. and Leslie, David S. Simula-
tion studies in optimistic Bayesian sampling in
contextual-bandit problems. Technical Report
11:02, Statistics Group, Department of Mathemat-
ics, University of Bristol, 2011.
May, Benedict C., Korda, Nathan, Lee, Anthony,
and Leslie, David S. Optimistic Bayesian sampling
in contextual-bandit problems. Technical Report
11:01, Statistics Group, Department of Mathemat-
ics, University of Bristol, 2011.