Agrawal&Goyal 2013

This paper presents a generalization of the Thompson Sampling algorithm for stochastic contextual multi-armed bandits with linear payoffs, addressing theoretical performance questions that have remained open. It establishes high probability regret bounds, demonstrating that the algorithm can compete nearly as well as the best predictor in hindsight, thus solving a COLT open problem. The results indicate that the proposed Thompson Sampling approach achieves near-optimal regret bounds, contributing significantly to the understanding of contextual bandits.

Uploaded by

conglima01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views9 pages

Agrawal&Goyal 2013

Uploaded by

conglima01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Thompson Sampling for Contextual Bandits with Linear Payoffs

Shipra Agrawal [email protected]

Microsoft Research India
Navin Goyal [email protected]
Microsoft Research India

Abstract 1. Introduction
Thompson Sampling is one of the old- Multi-armed bandit (MAB) problems model the ex-
est heuristics for multi-armed bandit prob- ploration/exploitation trade-off inherent in many se-
lems. It is a randomized algorithm based quential decision problems. There are many versions
on Bayesian ideas, and has recently gener- of multi-armed bandit problems; a particularly useful
ated significant interest after several stud- version is the contextual multi-armed bandit problem.
ies demonstrated it to have better empirical In this problem, in each of T rounds, a learner is pre-
performance compared to the state-of-the- sented with the choice of taking one out of N actions,
art methods. However, many questions re- referred to as N arms. Before making the choice of
garding its theoretical performance remained which arm to play, the learner sees d-dimensional fea-
open. In this paper, we design and an- ture vectors bi , referred to as “context”, associated
alyze a generalization of Thompson Sam- with each arm i. The learner uses these feature vec-
pling algorithm for the stochastic contextual tors along with the feature vectors and rewards of the
multi-armed bandit problem with linear pay- arms played by her in the past to make the choice of
off functions, when the contexts are provided the arm to play in the current round. Over time, the
by an adaptive adversary. This is among the learner’s aim is to gather enough information about
most important and widely studied version of how the feature vectors and rewards relate to each
the contextual bandits problem. We √ prove a other, so that she can predict, with some certainty,
2
high probability regret bound of Õ( d T 1+ ) which arm is likely to give the best reward by look-
in time T for any 0 < < 1, where d is ing at the feature vectors. The learner competes with
the dimension of each context vector and a class of predictors, in which each predictor takes in
is a parameter used by the algorithm. Our the feature vectors and predicts which arm will give
results provide the first theoretical guaran- the best reward. If the learner can guarantee to do
tees for the contextual version of Thompson nearly as well as the predictions of the best predictor
Sampling,
√ and are close to the lower bound in hindsight (i.e., have low regret), then the learner is
of Ω(d T ) for this problem. This essentially said to successfully compete with that class.
solves a COLT open problem of Chapelle and
In the contextual bandits setting with linear payoff
Li [COLT 2012].
functions, the learner competes with the class of all
“linear” predictors on the feature vectors. That is,
a predictor is defined by a d-dimensional parameter
µ ∈ Rd , and the predictor ranks the arms according to
bTi µ. We consider stochastic contextual bandit prob-
lem under linear realizability assumption, that is, we
assume that there is an unknown underlying parame-
ter µ ∈ Rd such that the expected reward for each arm
i, given context bi , is bTi µ. Under this realizability as-
sumption, the linear predictor corresponding to µ is in
Proceedings of the 30 th International Conference on Ma- fact the best predictor and the learner’s aim is to learn
chine Learning, Atlanta, Georgia, USA, 2013. JMLR: this underlying parameter. This realizability assump-
W&CP volume 28. Copyright 2013 by the author(s).
tion is standard in the existing literature on contextual
Thompson Sampling for Contextual Bandits with Linear Payoffs

multi-armed bandits, e.g. (Auer, 2002; Filippi et al., ing techniques in many general settings along with fa-
2010; Chu et al., 2011; Abbasi-Yadkori et al., 2011). vorable empirical comparisons with other techniques.
Chapelle & Li (2011) demonstrate that for the basic
Thompson Sampling (TS) is one of the earliest heuris-
stochastic MAB problem, empirically TS achieves re-
tics for multi-armed bandit problems. The first version
gret comparable to the lower bound of Lai & Robbins
of this Bayesian heuristic is around 80 years old, dating
(1985); and in applications like display advertising and
to Thompson (1933). Since then, it has been rediscov-
news article recommendation modeled by the contex-
ered numerous times independently in the context of
tual bandits problem, it is competitive to or better
reinforcement learning, e.g., in Wyatt (1997); Ortega
than the other methods such as UCB. In their exper-
& Braun (2010); Strens (2000). It is a member of the
iments, TS is also more robust to delayed or batched
family of randomized probability matching algorithms.
feedback than the other methods. TS has been used
The basic idea is to assume a simple prior distribution
in an industrial-scale application for CTR prediction
on the underlying parameters of the reward distribu-
of search ads on search engines (Graepel et al., 2010).
tion of every arm, and at every time step, play an arm
Kaufmann et al. (2012) do a thorough comparison of
according to its posterior probability of being the best
TS with the best known versions of UCB and show
arm. The general structure of TS for the contextual
that TS has the lowest regret in the long run.
bandits problem involves the following elements:
1. a set Θ of parameters µ̃; However, the theoretical understanding of TS is lim-
ited. Granmo (2010) and May et al. (2011) pro-
2. a prior distribution P (µ̃) on these parameters;
vided weak guarantees, namely, a bound of o(T ) on
3. past observations D consisting of (context b, re- the expected regret in time T . For the the ba-
ward r) for the past time steps; sic (i.e. without contexts) version of the stochastic
4. a likelihood function P (r|b, µ̃), which gives the MAB problem, some significant progress was made by
probability of reward given a context b and a pa- Agrawal & Goyal (2012), Kaufmann et al. (2012) and,
rameter µ̃; more recently, by Agrawal & Goyal (2013), who pro-
5. a posterior distribution P (µ̃|D) ∝ P (D|µ̃)P (µ̃), vided optimal regret bounds on the expected regret.
where P (D|µ̃) is the likelihood function. But, many questions regarding theoretical analysis of
In each round, TS plays an arm according to its pos- TS remained open, including high probability regret
terior probability of having the best parameter. A bounds, and regret bounds for the more general con-
simple way to achieve this is to produce a sample of textual bandits setting. In particular, the contextual
parameter for each arm, using the posterior distribu- MAB problem does not seem easily amenable to the
tions, and play the arm that produces the best sam- techniques used so far for analyzing TS for the basic
ple. In this paper, we design and analyze a natural MAB problem. In Section 3.1, we describe some of
generalization of Thompson Sampling (TS) for con- these challenges. Some of these questions and difficul-
textual bandits; this generalization fits the above gen- ties were also formally raised as a COLT 2012 open
eral structure, and uses Gaussian prior and Gaussian problem (Chapelle & Li, 2012).
likelihood function. We emphasize that although TS In this paper, we use novel martingale-based analy-
is a Bayesian approach, the description of the algo- sis techniques to demonstrate that TS (i.e., our Gaus-
rithm and our analysis apply to the prior-free stochas- sian prior based generalization of TS for contextual
tic MAB model, and our regret bounds will hold ir- bandits) achieves high probability, near-optimal re-
respective of whether or not the actual reward distri- gret bounds for stochastic contextual bandits with lin-
bution matches the Gaussian likelihood function used ear payoff functions. To our knowledge, ours are the
to derive this Bayesian heuristic. Thus, our bounds first non-trivial regret bounds for TS for the contex-
for TS algorithm are directly comparable to the UCB tual bandits problem. Additionally, our results are the
family of algorithms which form a frequentist approach first high probability regret bounds for TS, even in the
to the same problem. One could interpret the priors case of basic MAB problem. This essentially solves the
used by TS as a way of capturing the current knowl- COLT 2012 open problem by(Chapelle & Li, 2012) for
edge about the arms. contextual bandits with linear payoffs.
Recently, TS has attracted considerable attention. Our version of Thompson Sampling algorithm for the
Several studies (e.g., Granmo (2010); Scott (2010); contextual MAB problem, described formally in Sec-
Graepel et al. (2010); Chapelle & Li (2011); May & tion 2.2, uses Gaussian prior and Gaussian likelihood
Leslie (2011); Kaufmann et al. (2012)) have empiri- functions. Our techniques can be extended to the use
cally demonstrated the efficacy of TS: Scott (2010) of other prior distributions, satisfying certain condi-
provides a detailed discussion of probability match-
Thompson Sampling for Contextual Bandits with Linear Payoffs

tions, as discussed in Section 4. We can obtain the same regret bounds for this alter-
native definition of regret. The details are provided in
2. Problem setting and algorithm the supplementary material in Appendix A.5.
description
2.2. Thompson Sampling algorithm
2.1. Problem setting
We use Gaussian likelihood function and Gaussian
There are N arms. At time t = 1, 2, . . ., a context prior to design our version of Thompson Sampling al-
vector bi (t) ∈ Rd , is revealed for every arm i. These gorithm. More precisely, suppose that the likelihood
context vectors are chosen by an adversary in an adap- of reward ri (t) at time t, given context bi (t) and pa-
tive manner after observing the arms played and their rameter µ, were given by the pdf ofqGaussian distri-
rewards up to time t − 1, i.e. history Ht−1 , bution N (bi (t)T µ, v 2 ). Here, v = R 24 1
d ln( δ ), with
Ht−1 = {a(τ ), ra(τ ) (τ ), bi (τ ), i = 1, . . . , N, τ = ∈ (0, 1) which parametrizes our algorithm. Let
1, . . . , t − 1}, Pt−1
where a(τ ) denotes the arm played at time τ . Given B(t) = Id + τ =1 ba(τ ) (τ )ba(τ ) (τ )T
bi (t), the reward for arm i at time t is generated from P
t−1

µ̂(t) = B(t)−1 τ =1 ba(τ ) (τ )ra(τ ) (τ ) .
an (unknown) distribution with mean bi (t)T µ, where
µ ∈ Rd is a fixed but unknown parameter. Then, if the prior for µ at time t is given by
N (µ̂(t), v 2 B(t)−1 ), it is easy to compute the poste-

E ri (t) {bi (t)}Ni=1 , Ht−1 = E [ri (t) bi (t)] = bi (t) µ.
T

rior distribution at time t + 1,

An algorithm for the contextual bandit problem needs Pr(µ̃|ri (t)) ∝ Pr(ri (t)|µ̃) Pr(µ̃)
to choose, at every time t, an arm a(t) to play, using −1
as N (µ̂(t + 1), v 2 B(t + 1) ) (details of this computa-
history Ht−1 and current contexts bi (t), i = 1, . . . , N .
tion are in Appendix A.1). In our Thompson Sampling
Let a∗ (t) denote the optimal arm at time t, i.e. a∗ (t) =
algorithm, at every time step t, we will simply generate
arg maxi bi (t)T µ. And let ∆i (t) be the difference be- −1
a sample µ̃(t) from the distribution N (µ̂(t), v 2 B(t) ),
tween the mean rewards of the optimal arm and of T
and play the arm i that maximizes bi (t) µ̃(t).
arm i at time t, i.e.,
We emphasize that the Gaussian priors and the Gaus-
∆i (t) = ba∗ (t) (t)T µ − bi (t)T µ.
sian likelihood model for rewards are only used above
Then, the regret at time t is defined as to design the Thompson Sampling algorithm for con-
textual bandits. Our analysis of the algorithm allows
regret(t) = ∆a(t) (t). these models to be completely unrelated to the actual
The objective is to minimize the total regret R(T ) = reward distribution. The assumptions on the actual
PT reward distribution are only those mentioned in Sec-
t=1 regret(t) in time T . The time horizon T is finite
but possibly unknown. tion 2.1, i.e., the R-sub-Gaussian assumption.

We assume that ηi,t = ri (t) − bi (t)T µ is conditionally Algorithm 1 Thompson Sampling for Contextual
R-sub-Gaussian for a constant R ≥ 0, i.e., bandits
Set B = Id , µ̂ = 0d , f = 0d .
2 2
λ R
∀λ ∈ R, E[eληi,t |{bi (t)}N
i=1 , H t−1 ] ≤ exp 2 .
for all t = 1, 2, . . . , do
This assumption is satisfied whenever ri (t) ∈ Sample µ̃(t) from distribution N (µ̂, v 2 B −1 ).
[bi (t)T µ − R, bi (t)T µ + R] (see Remark 1 in Appendix Play arm a(t) := arg maxi bi (t)T µ̃(t), and observe
A.1 of Filippi et al. (2010)). We will also assume reward rt .
that ||bi (t)|| ≤ 1, ||µ|| ≤ 1, and ∆i (t) ≤ 1 for all i, t Update B = B + ba(t) (t)ba(t) (t)T , f = f +
(the norms, unless otherwise indicated, are `2 -norms). ba(t) (t)rt , µ̂ = B −1 f .
These assumptions are required to make the regret end for
bounds scale-free, and are standard in the literature
on this problem. If ||µ|| ≤ c, ||bi (t)|| ≤ c, ∆i (t) ≤ c
instead, then our regret bounds would increase by a Every step t of Algorithm 1 consists of gener-
factor of c. ating a d-dimensional sample µ̃(t) from a multi-
variate Gaussian distribution, and solving the problem
Remark 1. An alternative definition of regret that
arg maxi bi (t)T µ̃(t). Therefore, even if the number of
appears in the literature is
arms N is large (or infinite), the above algorithm is
regret(t) = ra∗ (t) (t) − ra(t) (t). efficient as long as the problem arg maxi bi (t)T µ̃(t) is
Thompson Sampling for Contextual Bandits with Linear Payoffs

efficiently solvable. This is the case, for example, when bound in Theorem 2 has a better dependence on d.
the set of arms at time t is given by a d-dimensional This improvement results from the independence of
convex set (every vector in the convex set is a context θi (t) = bi (t)T µ̃i (t) in the algorithm for this setting.
vector, and thus corresponds to an arm). On the other hand in Algorithm 1, used for the single
parameter setting of Theorem 1, a single µ̃(t) is gen-
2.3. Our Results erated, and so θi (t) = bi (t)T µ̃(t) are not independent.
Theorem 1. For the stochastic contextual ban- This motivates us to consider a modification of Al-
dit problem with linear payoff functions, with prob- gorithm 1 for the single parameter setting, in which
ability 1 − δ, the total regret in time T for the θi (t)’s are independently generated, each with
Thompson Sampling (Algorithm 1) is bounded by marginal distribution bi (t)T µ̃(t). The arm with the
2√
O d T 1+ ln(T d) ln 1δ , for any 0 < < 1, 0 < highest value of θi (t) is played at time t. Although, this
modified algorithm could be inefficient compared to
δ < 1. Here, is a parameter used by the Thompson Algorithm 1 if N is large (say exponential) compared
Sampling algorithm. to d, the better dependence on d in regret bounds could
Remark 2. The parameter can be chosen to be any be useful if d is large.
constant in (0, 1). If √ T is known, one could choose Theorem 3. For the modified algorithm in single pa-
= ln1T , to get Õ(d2 T ) regret bound.
q 1 − δ, the total regret
rameter setting, with probability
Remark 3. Our regret bound in Theorem 1 does not

T 1+ ln N 1

depend on N , and is applicable to the case of infi- in time T is bounded by O d ln T ln δ ,
nite arms, with only notational changes required in the for any 0 < < 1, 0 < δ < 1.
analysis.
The details of the modified algorithm and the proof
In the main body of this paper, we will discuss the of the above theorem appears in the supplementary
proof of the above result. Below, we state two ad- material in Appendix B.
ditional results; their proofs require small changes to
the proof of Theorem 1 and are provided in the sup- 2.4. Related Work
plementary material.
The contextual bandit problem with linear payoffs is
The first result is for the setting where each of the N
a widely studied problem in statistics and machine
arms is associated with a different d-dimensional pa-
learning often under different names as mentioned by
rameter µi ∈ Rd , so that the mean reward for arm i
Chu et al. (2011): bandit problems with co-variates
at time t is bi (t)T µi . This setting is a direct gener-
(Woodroofe, 1979; Sarkar, 1991), associative reinforce-
alization of the basic MAB problem to d-dimensions.
ment learning (Kaelbling, 1994), associative bandit
Thompson Sampling for this setting will maintain a
problems (Auer, 2002; Strehl et al., 2006), bandit prob-
separate posterior distribution for each arm i which
lems with expert advice (Auer et al., 2002), and linear
would be updated only at the time instances when i is
bandits (Dani et al., 2008; Abbasi-Yadkori et al., 2011;
played. And, at every time step t, instead of a single
Bubeck et al., 2012). The name contextual bandits was
sample µ̃(t), N independent samples will have to be
coined in Langford & Zhang (2007).
generated: µ̃i (t) for each arm i. We prove the follow- √
ing regret bound for this setting. A lower bound of Ω(d T ) for this problem was given
Theorem 2. For the setting with N different pa- by Dani et al. (2008), when the number of arms is al-
rameters, with probability 1 − δ, the total regret lowed to be infinite. In particular, they prove their
in time T for Thompson Sampling is bounded by lower bound using an example where the set of arms
q
1+
correspond to all vectors in the intersection of a d-
O d N T ln N ln T ln 1δ , for any 0 < < 1,

dimensional sphere and√a cube. They also provide
0 < δ < 1. an upper bound of Õ(d T ), although their setting is
slightly restrictive in the sense that the context vector
The details of the algorithm for N -parameter setting for every arm is fixed in advanced and is not allowed to
and the proof of Theorem 2 appear in the supplemen- change with time. Abbasi-Yadkori et al. (2011) ana-
tary material in Appendix C. lyze a UCB-style algorithm
√ pand provide a regret upper
bound of O(d log (T ) T + dT log (T /δ)). Apart from
Note that unlike Theorem 1, the regret bound in The-
the dependence on , our bounds are essentially away
orem 2 has a dependence on N , which is expected
by a factor of d from these bounds.
because Theorem 2 deals with a setting where there
are N different parameters to learn. However, the For finite N , Chu et al. (2011) show a lower bound
Thompson Sampling for Contextual Bandits with Linear Payoffs
√
of Ω( T d) for d2 ≤ T . Auer (2002) and Chu In the basic MAB problem there are N arms, with
et al. (2011) analyze SupLinUCB, a complicated al- mean reward µi ∈ R for arm i, and the regret for
gorithm using UCB as a subroutine, for this prob- playing a suboptimal arm i is µa∗ − µi , where a∗ is the
lem.
q Chu et al. (2011) achieve a regret bound of arm with the highest mean. Let us compare this to a 1-
O( T d ln3 (N T ln(T )/δ)) with probability at least 1− dimensional contextual MAB problem, where arm i is
associated with a parameter µi ∈ R, but in addition, at
δ (Auer (2002) proves similar results). This regret
every time t, it is associated with a context bi (t) ∈ R,
bound is not applicable to the case of infinite arms,
so that mean reward is bi (t)µi . The best arm a∗ (t) at
and assumes that context vectors are generated by an
time t is the arm with the highest mean at time t, and
oblivious
√ adversary. Also, this regret bound would give the regret for playing arm i is ba∗ (t) (t)µa∗ (t) − bi (t)µi .
O(d2 T ) regret if N is exponential in d. The state-
of-the-art bounds for linear bandits problem in case In general, the basis of regret analysis for stochastic
of finite N are given by Bubeck et al. (2012). They MAB is to prove that the variances of empirical esti-
provide an algorithm √ based on exponential weights, mates for all arms decrease fast enough, so that the re-
with regret of order dT log N for any finite set of N gret incurred until the variances become small enough,
actions. However, the exponential weights based algo- is small. In the basic MAB, the variance of the em-
rithms are not efficient if N is large (sampling com- pirical mean is inversely proportional to the number
plexity of O(N ) in every step). Also, their setting of plays ki (t) of arm i at time t. Thus, every time the
is slightly different from ours. The set of arms and suboptimal arm i is played, we know that even though
the associated bi vectors are non-adaptive and fixed in a regret of µi∗ − µi ≤ 1 is incurred, there is also an im-
advance. And, they consider a non-stochastic (adver- provement of exactly 1 in the number of plays of that
sarial) bandit setting where the reward at time t for arm, and hence, corresponding decrease in the vari-
arm i is bTi µt with µt chosen by an adversary. ance. The techniques for analyzing basic MAB rely on
this observation to precisely quantify the exploration-
Very recent work Russo & Roy (2013) provides near-
exploitation tradeoff. On the other hand, the variance
optimal bounds on Bayesian regret in many general
of the empirical meanPfor the contextual case is given
settings. This result is incomparable to ours because t
by inverse of Bi (t) = τ =1:a(τ )=i bi (τ )2 . When a sub-
of the different notion of regret used.
optimal arm i is played, if bi (t) is small, the regret
While the regret bounds provided in this paper do not ba∗ (t) (t)µa∗ (t) − bi (t)µi could be much higher than the
match or better the best available regret bounds for improvement bi (t)2 in Bi (t).
the extensively studied problem of linear contextual
In our proof, we overcome this difficulty by dividing
bandits, our results demonstrate that the natural and
the arms into two groups at any time: saturated and
efficient heuristic of Thompson Sampling can achieve
unsaturated arms, based on whether the standard
theoretical bounds that are close to the best bounds.
deviation of the estimates for an arm is smaller or
The main contribution of this paper is to provide new
larger compared to the standard deviation for the
tools for analysis of Thompson Sampling algorithm for
optimal arm. The optimal arm is included in the
contextual bandits, which despite being popular and
group of unsaturated arms. We show that for the
empirically attractive, has eluded theoretical analysis.
unsaturated arms, the regret on playing the arm can
We believe the techniques used in this paper will pro-
be bounded by a factor of the standard deviation,
vide useful insights into the workings of this Bayesian
which improves every time the arm is played. This
algorithm, and may be useful for further improvements
allows us to bound the total regret due to unsaturated
and extensions.
arms. For the saturated arms, standard deviation is
small, or in other words, the estimates of the means
3. Regret Analysis: Proof of Theorem 1 constructed so far are quite accurate in the direction
of the current contexts of these arms, so that the
3.1. Challenges and proof outline
algorithm is able to distinguish between them and
The contextual version of the multi-armed bandit the optimal arm. We utilize this observation to show
problem presents new challenges for the analysis of TS that the probability of playing such arms at any step
algorithm, and the techniques used so far for analyz- is bounded by a function of the probability of playing
ing the basic multi-armed bandit problem by Agrawal the unsaturated arms.
& Goyal (2012); Kaufmann et al. (2012) do not seem
directly applicable. Let us describe some of these dif- Below is a more technical outline of the proof of
ficulties and our novel ideas to resolve them. Theorem 1. At any time step t, we divide the arms
into two groups:
Thompson Sampling for Contextual Bandits with Linear Payoffs
q
• saturated arms defined as those with g(T )st,i < Definition 3. Define `(T ) = R d ln(T 3 ) ln( 1δ ) + 1,
`(T )st,a∗ (t) , q p
• unsaturated arms defined as those with g(T )st,i ≥ v = R 24 1
d ln( δ ), and g(T ) = 4d ln(T d) v + `(T ).
`(T )st,a∗ (t) , Definition 4. Define E µ (t) and E θ (t) as the events
p that bi (t)T µ̂(t) and θi (t) are concentrated around their
where st,i = bi (t)T B(t)−1 bi (t) and g(T ), `(T ) respective means. More precisely, define E µ (t) as the
(g(T ) > `(T )) are constants (functions of T, d, δ) de- event that
fined later. Note that st,i is the standard deviation of ∀i : |bi (t)T µ̂(t) − bi (t)T µ| ≤ `(T ) st,i .
the estimate bi (t)T µ̂(t) and vst,i is the standard devi-
ation of the random variable bi (t)T µ̃(t). Define E θ (t) as the event that
p
∀i : |θi (t) − bi (t)T µ̂(t)| ≤ 4d ln(T d) vst,i .
We use concentration bounds for µ̃(t) and µ̂(t) to
Definition 5. An arm i is called saturated at time
bound the regret at any time t by g(T )(st,a∗ (t) +st,a(t) ).
t if g(T ) st,i < `(T ) st,a∗ (t) , and unsaturated oth-
Now, if an unsaturated arm is played at time t, then
erwise. Let C(t) denote the set of saturated arms at
using the definition of unsaturated arms, the regret is
)2
time t. Note that the optimal arm is always unsatu-
at most 2g(T
`(T ) st,a(t) . This is useful because of the in- rated at time t, i.e., a∗ (t) ∈ / C(t). An arm may keep
P √
equality t st,a(t) = O( T d ln T ) (derived along the shifting from saturated to unsaturated and vice-versa
lines of Auer (2002)), which allows us to bound the over time.
total regret due to unsaturated arms. Definition 6. Define filtration Ft−1 as the union of
For saturated arms, we prove that the probability of history until time t−1, and the contexts at time t, i.e.,
playing a saturated arm at any time t is within p of the Ft−1 = {Ht−1 , bi (t), i = 1, . . . , N }.
probability of playing an unsaturated arm, where p = By definition, F1 ⊆ F2 · · · ⊆ FT −1 . Observe that the
√1 . More precisely, we define Ft−1 as the union
4e πT following quantities are determined by the history Ht−1
of history Ht−1 and the contexts bi (t), i = 1, . . . , N at and the contexts bi (t) at time t, and hence are included
time t, and prove that for “most” (in a high probability in Ft−1 ,
sense) Ft−1 ,
• µ̂(t), B(t),
Pr (a(t) is a saturated arm Ft−1 ) ≤ • st,i , for all i,
1 1
p · Pr (a(t) is an unsaturated arm Ft−1 ) + pT 2 , • the identity of the optimal arm a∗ (t) and the set
of saturated arms C(t),
We use these observations to establish that (Xt ; t ≥ 0),
• whether E µ (t) is true or not,
where
• the distribution N (µ̂(t), B(t)−1 ) of µ̃(t),
Xt ' regret(t) − g(T )
p I(a(t) is unsaturated)st,a (t) −
∗
and hence the joint distribution of θi (t) =
2g(T )2 2g(T )
`(T ) st,a(t) − pT 2 ,
bi (t)T µ̃(t), i = 1, . . . , N .
is a super-martingale difference process adapted to fil- Lemma 1. For all t, 0 < δ < 1, Pr(E µ (t)) ≥ 1 − Tδ2 .
tration Ft . Then, using the Azuma-Hoeffding inequal- And, for all possible filtrations Ft−1 , Pr(E θ (t)|Ft−1 ) ≥
ity
P for super-martingales,
√ along with the inequality 1 − T12 .
t st,a(t) = O( T d ln T ), we will obtain the desired
high probability regret bound. Proof. The complete proof of this lemma appears in
Appendix A.3. The probability bound for E µ (t) will
3.2. Formal proof be proven using a concentration inequality given by
Abbasi-Yadkori et al. (2011), stated as Lemma 7 in
For quick reference, the notations introduced below
Appendix A.2. The R-sub-Gaussian assumption on
also appear in a table of notations at the beginning of
rewards will be utilized here. The probability bound
the supplementary material.
for E θ (t) will be proven using a concentration inequal-
Definitionp1. For all i, define θi (t) = bi (t)T µ̃(t), ity for Gaussian random variables from Abramowitz
and st,i = bi (t)T B(t)−1 bi (t). By definition of µ̃(t), & Stegun (1964) stated as Lemma 5 in Appendix A.2
marginal distribution of each θi (t) is Gaussian with .
mean bi (t)T µ̂(t) and standard deviation vst,i . Also,
st,i is the standard deviation of estimate bi (t)T µ̂(t). The next lemma lower bounds the probability that
Definition 2. Recall that ∆i (t) = ba∗ (t) (t)T µ − θa∗ (t) (t) = ba∗ (t) (t)T µ̃(t) for the optimal arm at
bi (t)T µ, the difference between the mean reward of op- time t will exceed its mean reward ba∗ (t) (t)T µ plus
timal arm and arm i at time t. . `(T )st,a∗ (t) .
Thompson Sampling for Contextual Bandits with Linear Payoffs

Lemma 2. For any filtration Ft−1 such that E µ (t) is which implies
true,
Pr θa∗ (t) (t) > ba∗ (t) (t)T µ + `(T )st,a∗ (t) Ft−1 ≥ Pr (a(t) ∈ C(t) Ft−1 ) 1
√1 . ≤ .
4e πT / C(t) Ft−1 ) + T12
Pr (a(t) ∈ p

Proof. The proof uses anti-concentration of Gaussian

random variable θa∗ (t) (t) = ba∗ (t) (t)T µ̃(t), which has
mean ba∗ (t) (t)T µ̂(t) and standard deviation vst,a∗ (t) ,
provided by Lemma 5 in Appendix A.2, and the con- Definition 7. Recall that regret(t) was defined as,
centration of ba∗ (t) (t)T µ̂(t) around ba∗ (t) (t)T µ pro- regret(t) = ∆a(t) (t) = ba∗ (t) (t)T µ − ba(t) (t)T µ. Define
vided by the event E µ (t). The details of the proof regret0 (t) = regret(t) · I(E µ (t)).
are in Appendix A.4.
Next, we establish a super-martingale process that will
The following lemma bounds the probability of playing form the basis of our proof of the high-probability re-
saturated arms in terms of the probability of playing gret bound.
unsaturated arms. Definition 8. Let
Lemma 3. Given any filtration Ft−1 such that E µ (t)
g(T )
is true, Xt = regret0 (t) − p I(a(t) ∈
/ C(t))st,a∗ (t)
Pr (a(t) ∈ C(t) Ft−1 ) ≤ )2
1
Pr / C(t) Ft−1 ) + pT1 2 ,
(a(t) ∈ − 2g(T
`(T ) st,a(t) − 2g(T )
pT 2 ,
p
Pt
where p = √1 . Yt =
4e πT w=1 Xw ,

Proof. The algorithm chooses the arm with the high- where p = √1 .
4e πT
est value of θi (t) = bi (t)T µ̃(t) to be played at time t.
Therefore, if θa∗ (t) (t) is greater than θj (t) for all satu- Lemma 4. (Yt ; t = 0, . . . , T ) is a super-martingale
rated arms, i.e., θa∗ (t) (t) > θj (t), ∀j ∈ C(t), then one process with respect to filtration Ft .
of the unsaturated arms (which include the optimal
arm and other suboptimal unsaturated arms) must be
played. Therefore, Proof. See Definition 9 in Appendix A.2 for the defi-
nition of super-martingales. We need to prove that for
Pr (a(t) ∈
/ C(t) Ft−1 ) all t ∈ [1, T ], and any Ft−1 , E[Yt − Yt−1 |Ft−1 ] ≤ 0, i.e.

≥ Pr θa∗ (t) (t) > θj (t), ∀j ∈ C(t) Ft−1 . (1)
By definition, for all saturated arms, i.e. for all E [regret0 (t) Ft−1 ] ≤
g(T )
j ∈ C(t), g(T )st,j < `(T )st,a∗ (t) . Also, if both the p Pr (a(t) ∈ / C(t) Ft−1 ) st,a∗ (t) +
events E µ (t) and E θ (t) are true then, by the def- 2g(T )2 2g(T )
`(T ) E st,a(t) Ft−1 + pT 2 .
initions of these events, for all j ∈ C(t), θj (t) ≤
bj (t)T µ + g(T )st,j . Therefore, given an Ft−1 such
that E µ (t) is true, either E θ (t) is false, or else for all If Ft−1 is such that E µ (t) is not true, then regret0 (t) =
j ∈ C(t), regret(t)·I(E µ (t)) = 0, and the above inequality holds
trivially. So, we consider Ft−1 such that E µ (t) holds.
θj (t) ≤ bj (t)T µ + g(T )st,j ≤ ba∗ (t) (t)T µ + `(T )st,a∗ (t) .
We observe that if the events E µ (t), E θ (t) are true,
µ
Hence, for any Ft−1 such that E (t) is true, then ∆a(t) (t) ≤ g(T )(st,a(t) + st,a∗ (t) ). This is because
if an arm i is played at time t, then it must be true
Pr θa∗ (t) (t) > θj (t), ∀j ∈ C(t) Ft−1
that θi (t) ≥ θa∗ (t) (t). And, if E θ (t) and E µ (t) are
≥ Pr θa∗ (t) (t) > ba∗ (t) (t)T µ + `(T )st,a∗ (t) Ft−1

true, then,

− Pr E θ (t) Ft−1
bi (t)T µ ≥ θi (t) − g(T )st,i
1
≥ p − 2. ≥ θa∗ (t) (t) − g(T )st,i
T
≥ ba∗ (t) (t)T µ − g(T )st,a∗ (t) − g(T )st,i .
The last inequality uses Lemma 2 and Lemma 1. Sub-
stituting in Equation (1), this gives,
Therefore, given a filtration Ft−1 such that E µ (t) is
1
Pr (a(t) ∈
/ C(t) Ft−1 ) + T2 ≥ p, true, either ∆a(t) (t) ≤ g(T )(st,a(t) + st,a∗ (t) ) or E θ (t)
Thompson Sampling for Contextual Bandits with Linear Payoffs
PT √
is false. And, hence, Now, we can use t=1 st,a(t) ≤ 5 dT ln T , which
can be derived along the lines of Lemma 3 of Chu
E [regret0 (t) Ft−1 ]
et al. (2011) using Lemma 11 of Auer (2002) (see Ap-
= E ∆a(t) (t) Ft−1 pendix A.5 for details). Also, recalling the definitions

≤ E g(T )(st,a∗ (t) + st,a(t) ) Ft−1 + Pr E θ (t) of p, `(T ), and g(T ) (see the Table of notations in the
beginning of the supplementary material), and substi-
= g(T ) E st,a∗ (t) I(a(t) ∈ C(t)) Ft−1 tuting in above, we get

+g(T ) E st,a∗ (t) I(a(t) ∈/ C(t)) Ft−1

PT 2√
0 d (1+) ln( 1 ) ln(T d) .
t=1 regret (t) = O T δ
+g(T ) E st,a(t) Ft−1 + Pr E θ (t)
≤ g(T )st,a∗ (t) Pr (a(t) ∈ C(t) Ft−1 ) Also, because E µ (t) holds for all t with probability
h i at least 1 − 2δ (see Lemma 1), regret0 (t) = regret(t)
+g(T )E g(T )
/ C(t)) Ft−1
`(T ) st,a(t) I(a(t) ∈ for all t with probability at least 1 − 2δ . Hence, with
+g(T )E st,a(t) Ft−1 + T12 probability 1 − δ,

PT PT 0
R(T ) = t=1 regret(t) = t=1 regret (t) =
√
/ C(t) Ft−1 ) + g(T ) pT1 2
≤ g(T )st,a∗ (t) · p1 Pr (a(t) ∈ d
O T
2
(1+) 1
ln( δ ) ln(T d) .

)2
+ 2g(T E st,a(t) Ft−1 + T12 The proof for the alternate definition of regret men-

`(T )
tioned in Remark 1 is provided in Appendix A.5.
≤ g(T )st,a∗ (t) · p1 Pr (a(t) ∈
/ C(t) Ft−1 )
2

+ 2g(T )
E st,a(t) Ft−1 + 2g(T )
4. Conclusions

`(T ) pT 2 .

In the first inequality we used that for all i, ∆i (t) ≤ 1. Detailed concluding remarks appear in supplementary
The second inequality used the definition of unsatu- materials Sec. D.
rated arms to apply st,a∗ (t) ≤ g(T )
`(T ) st,a(t) , and Lemma
References

1 to apply Pr E θ (t) ≤ T12 . The third inequality used
Lemma 3, and also the observation that 0 ≤ st,a∗ (t) ≤ Abbasi-Yadkori, Yasin, Pál, Dávid, and Szepesvári,
||ba∗ (t) (t)|| ≤ 1. Csaba. Improved Algorithms for Linear Stochastic
Bandits. In NIPS, pp. 2312–2320, 2011.
Now, we are ready to prove Theorem 1.
Abramowitz, Milton and Stegun, Irene A. Handbook
of Mathematical Functions with Formulas, Graphs,
Proof of Theorem 1 We observe that the absolute
and Mathematical Tables. Dover, New York, 1964.
value of each of the four terms in the definition of Xt
)2
is bounded by p2 g(T
`(T ) , therefore the super-martingale Agrawal, Shipra and Goyal, Navin. Analysis of
8 g(T )
2
Thompson Sampling for the Multi-armed Bandit
Yt has bounded difference |Yt − Yt−1 | ≤ p `(T ) , for all
Problem. In COLT, 2012.
t ≥ 1. Thus, we can apply Azuma-Hoeffding inequality
(see Lemma 6 in Appendix A.2), to obtain that with Agrawal, Shipra and Goyal, Navin. Further Optimal
probability 1 − 2δ , Regret Bounds for Thompson Sampling. AISTATS,
PT 0
t=1 regret (t) 2013.
PT
g(T )
≤ t=1 p I(a(t) ∈
/ C(t))s t,a ∗ (t) + 2g(T
pT
)
Auer, Peter. Using Confidence Bounds for
q Exploitation-Exploration Trade-offs. Journal of Ma-
)2 PT 2
+ 2g(T
`(T )
8 g(T )
t=1 st,a(t) + p `(T ) 2T ln( 2δ ) chine Learning Research, 3:397–422, 2002.

PT g(T )2 1
≤ t=1 `(T ) p I(a(t) ∈
/ C(t))s t,a(t) + 2g(T
pT
)
Auer, Peter, Cesa-Bianchi, Nicolò, Freund, Yoav, and
q Schapire, Robert E. The Nonstochastic Multiarmed
)2 PT 2
+ 2g(T
`(T )
8 g(T )
t=1 st,a(t) + p `(T ) 2T ln( 2δ ) Bandit Problem. SIAM J. Comput., 32(1):48–77,
q 2002.
)2 3 PT 2
≤ g(T
`(T ) p s
t=1 t,a(t) + 2g(T )
pT + 8 g(T )
p `(T ) 2T ln( 2δ ).
Bubeck, Sébastien, Cesa-Bianchi, Nicolò, and Kakade,
Sham M. Towards minimax policies for online linear
The second inequality used the observation that if an optimization with bandit feedback. Proceedings of
unsaturated arm is played, i.e., a(t) ∈
/ C(t), then, the 25th Conference on Learning Theory (COLT),
g(T )st,a(t) ≥ `(T )st,a∗ (t) . pp. 1–14, 2012.
Thompson Sampling for Contextual Bandits with Linear Payoffs

Chapelle, Olivier and Li, Lihong. An Empirical Eval- Ortega, Pedro A. and Braun, Daniel A. Linearly
uation of Thompson Sampling. In NIPS, pp. 2249– Parametrized Bandits. Journal of Artificial Intel-
2257, 2011. ligence Research, 38:475–511, 2010.
Chapelle, Olivier and Li, Lihong. Open Problem: Re- Russo, Daniel and Roy, Benjamin Van. Learn-
gret Bounds for Thompson Sampling. In COLT, ing to optimize via posterior sampling. CoRR,
2012. abs/1301.2609, 2013.
Chu, Wei, Li, Lihong, Reyzin, Lev, and Schapire, Sarkar, Jyotirmoy. One-armed badit problem with co-
Robert E. Contextual Bandits with Linear Payoff variates. The Annals of Statistics, 19(4):1978–2002,
Functions. Journal of Machine Learning Research - 1991.
Proceedings Track, 15:208–214, 2011.
Scott, S. A modern Bayesian look at the multi-armed
Dani, Varsha, Hayes, Thomas P., and Kakade, bandit. Applied Stochastic Models in Business and
Sham M. Stochastic Linear Optimization under Industry, 26:639–658, 2010.
Bandit Feedback. In COLT, pp. 355–366, 2008.
Strehl, Alexander L., Mesterharm, Chris, Littman,
Filippi, Sarah, Cappé, Olivier, Garivier, Aurélien, and Michael L., and Hirsh, Haym. Experience-efficient
Szepesvári, Csaba. Parametric Bandits: The Gen- learning in associative bandit problems. In ICML,
eralized Linear Case. In NIPS, pp. 586–594, 2010. pp. 889–896, 2006.
Graepel, Thore, Candela, Joaquin Quiñonero, Strens, Malcolm J. A. A Bayesian Framework for Re-
Borchert, Thomas, and Herbrich, Ralf. Web-Scale inforcement Learning. In ICML, pp. 943–950, 2000.
Bayesian Click-Through rate Prediction for Spon-
sored Search Advertising in Microsoft’s Bing Search Thompson, William R. On the likelihood that one
Engine. In ICML, pp. 13–20, 2010. unknown probability exceeds another in view of the
evidence of two samples. Biometrika, 25(3-4):285–
Granmo, O.-C. Solving Two-Armed Bernoulli Bandit 294, 1933.
Problems Using a Bayesian Learning Automaton.
International Journal of Intelligent Computing and Woodroofe, Michael. A one-armed bandit problem
Cybernetics (IJICC), 3(2):207–234, 2010. with a concomitant variable. Journal of the Ameri-
can Statistics Association, 74(368):799–806, 1979.
Kaelbling, Leslie Pack. Associative Reinforcement
Learning: Functions in k-DNF. Machine Learning, Wyatt, Jeremy. Exploration and Inference in Learning
15(3):279–298, 1994. from Reinforcement. PhD thesis, Department of Ar-
tificial Intelligence, University of Edinburgh, 1997.
Kaufmann, Emilie, Korda, Nathaniel, and Munos,
Rémi. Thompson Sampling: An Optimal Finite
Time Analysis. ALT, 2012.
Lai, T. L. and Robbins, H. Asymptotically effi-
cient adaptive allocation rules. Advances in Applied
Mathematics, 6:4–22, 1985.
Langford, John and Zhang, Tong. The Epoch-Greedy
Algorithm for Multi-armed Bandits with Side Infor-
mation. In NIPS, 2007.
May, Benedict C. and Leslie, David S. Simula-
tion studies in optimistic Bayesian sampling in
contextual-bandit problems. Technical Report
11:02, Statistics Group, Department of Mathemat-
ics, University of Bristol, 2011.
May, Benedict C., Korda, Nathan, Lee, Anthony,
and Leslie, David S. Optimistic Bayesian sampling
in contextual-bandit problems. Technical Report
11:01, Statistics Group, Department of Mathemat-
ics, University of Bristol, 2011.

Reusable Booster System - Review and Assessment PDF
No ratings yet
Reusable Booster System - Review and Assessment PDF
115 pages
Thompson Sampling For Contextual Bandits With Linear Payoffs
No ratings yet
Thompson Sampling For Contextual Bandits With Linear Payoffs
22 pages
Open Problem: Regret Bounds For Thompson Sampling: 1. Background
No ratings yet
Open Problem: Regret Bounds For Thompson Sampling: 1. Background
3 pages
Thompson Sampling For High-Dimensional Sparse Linear Contextual Bandits - ICML23
No ratings yet
Thompson Sampling For High-Dimensional Sparse Linear Contextual Bandits - ICML23
30 pages
Garbage In, Reward Out Bootstrapping Exploration in Multi-Armed Bandits
No ratings yet
Garbage In, Reward Out Bootstrapping Exploration in Multi-Armed Bandits
19 pages
Online Learning For Causal Bandits
No ratings yet
Online Learning For Causal Bandits
7 pages
2505.12092v2
No ratings yet
2505.12092v2
57 pages
Reading 3-Russo & Van Roy 2014
No ratings yet
Reading 3-Russo & Van Roy 2014
24 pages
Stacked Thompson Bandits: Lenz Belzner Thomas Gabor
No ratings yet
Stacked Thompson Bandits: Lenz Belzner Thomas Gabor
4 pages
NeurIPS 2021 Breaking The Moments Condition Barrier No Regret Algorithm For Bandits With Super Heavy Tailed Payoffs Paper
No ratings yet
NeurIPS 2021 Breaking The Moments Condition Barrier No Regret Algorithm For Bandits With Super Heavy Tailed Payoffs Paper
11 pages
Finite-Time Regret of Thompson Sampling Algorithms For Exponential Family Multi-Armed Bandits
No ratings yet
Finite-Time Regret of Thompson Sampling Algorithms For Exponential Family Multi-Armed Bandits
49 pages
CS181 P - A - : Roject New Exploration of The Multi Armed Bandit Problem
No ratings yet
CS181 P - A - : Roject New Exploration of The Multi Armed Bandit Problem
9 pages
Bubeck 11 A
No ratings yet
Bubeck 11 A
41 pages
26202-Article Text-30265-1-2-20230626
No ratings yet
26202-Article Text-30265-1-2-20230626
8 pages
Agrawal&Goyal 2017
No ratings yet
Agrawal&Goyal 2017
3 pages
Introduction To Bandits: (Some Slides Stolen From Csaba's AAAI Tutorial)
No ratings yet
Introduction To Bandits: (Some Slides Stolen From Csaba's AAAI Tutorial)
16 pages
A12-Online Learning Short 2020
No ratings yet
A12-Online Learning Short 2020
61 pages
Zhu 20 D
No ratings yet
Zhu 20 D
10 pages
Evendar 06 A
No ratings yet
Evendar 06 A
27 pages
Department of Management Science and Engineering, Stanford University, Department of Management Science and Engineering, Stanford University
No ratings yet
Department of Management Science and Engineering, Stanford University, Department of Management Science and Engineering, Stanford University
38 pages
Hayashi 2025
No ratings yet
Hayashi 2025
14 pages
Dann 23 A
No ratings yet
Dann 23 A
68 pages
IntroMulti Armed Bandits Slivkin Microsoft PDF
No ratings yet
IntroMulti Armed Bandits Slivkin Microsoft PDF
174 pages
Contextual Bandits
No ratings yet
Contextual Bandits
34 pages
Multi-Armed Bandit Problem With Online Clustering As Side
No ratings yet
Multi-Armed Bandit Problem With Online Clustering As Side
13 pages
Book PDF
No ratings yet
Book PDF
582 pages
Fan Glynn
No ratings yet
Fan Glynn
32 pages
29117-Article Text-33171-1-2-20240324
No ratings yet
29117-Article Text-33171-1-2-20240324
8 pages
SSRN Id4823494 Code3832712
No ratings yet
SSRN Id4823494 Code3832712
52 pages
Necessary and Sufficient Conditions For Achieving Sub-Linear Regret in Stochastic Multi-Armed Bandits
No ratings yet
Necessary and Sufficient Conditions For Achieving Sub-Linear Regret in Stochastic Multi-Armed Bandits
9 pages
Bandit Algorithms
No ratings yet
Bandit Algorithms
596 pages
Statistical Inference For Online Decision-Making: in A Contextual Bandit Setting
No ratings yet
Statistical Inference For Online Decision-Making: in A Contextual Bandit Setting
44 pages
LinUCB Ote
No ratings yet
LinUCB Ote
68 pages
Finite-Time Analysis of The Multi-Armed Bandit Problem With Known Trend
No ratings yet
Finite-Time Analysis of The Multi-Armed Bandit Problem With Known Trend
7 pages
2502.17175v1
No ratings yet
2502.17175v1
20 pages
Exploration Exploitation
No ratings yet
Exploration Exploitation
40 pages
Blanchet, Xu 2024
No ratings yet
Blanchet, Xu 2024
21 pages
Master Thesis On Mixed Model Bandits
No ratings yet
Master Thesis On Mixed Model Bandits
73 pages
2022 Multiarmed Bandit Algorithms On Zynq System-On-Chip Go Frequentist or Bayesian
No ratings yet
2022 Multiarmed Bandit Algorithms On Zynq System-On-Chip Go Frequentist or Bayesian
14 pages
Vincent NCC Paper On Multi Objective BAI
No ratings yet
Vincent NCC Paper On Multi Objective BAI
33 pages
Scalable Thompson Sampling Via Optimal Transport
No ratings yet
Scalable Thompson Sampling Via Optimal Transport
9 pages
Neural Contextual Bandits With UCB-based Exploration
No ratings yet
Neural Contextual Bandits With UCB-based Exploration
27 pages
Contextual Information-Directed Sampling
No ratings yet
Contextual Information-Directed Sampling
21 pages
Simple Regret For Infinitely Many Armed Bandits
No ratings yet
Simple Regret For Infinitely Many Armed Bandits
9 pages
Exploration Vs Exploitation in Stationary Multi-Armed Bandit Problems
No ratings yet
Exploration Vs Exploitation in Stationary Multi-Armed Bandit Problems
15 pages
6716 Bayesian Design Principles For
No ratings yet
6716 Bayesian Design Principles For
33 pages
Bandit Algorithms (Tor Lattimore, Csaba Szepesvári) (Z-Library)
0% (1)
Bandit Algorithms (Tor Lattimore, Csaba Szepesvári) (Z-Library)
537 pages
Bandit
No ratings yet
Bandit
8 pages
Expanded Multi Armed Bandit and Probability Basics
No ratings yet
Expanded Multi Armed Bandit and Probability Basics
5 pages
RL Unit 1 - QA
No ratings yet
RL Unit 1 - QA
10 pages
NIPS 2008 Algorithms For Infinitely Many Armed Bandits Paper
No ratings yet
NIPS 2008 Algorithms For Infinitely Many Armed Bandits Paper
8 pages
Lecture 2 EE675
No ratings yet
Lecture 2 EE675
4 pages
5385 Representation Driven Reinforc
No ratings yet
5385 Representation Driven Reinforc
16 pages
DLMAIRIL01 Q4-2024 Session3
No ratings yet
DLMAIRIL01 Q4-2024 Session3
47 pages
On A Restless Multi-Armed Bandit Problem With Non-Identical Arms
No ratings yet
On A Restless Multi-Armed Bandit Problem With Non-Identical Arms
8 pages
Bandits and Graphs
No ratings yet
Bandits and Graphs
13 pages
Contextual Multi-Armed Bandits: A+b+1 A+b+2
No ratings yet
Contextual Multi-Armed Bandits: A+b+1 A+b+2
8 pages
Satplan: Fundamentals and Applications
From Everand
Satplan: Fundamentals and Applications
Fouad Sabry
No ratings yet
Random Optimization: Fundamentals and Applications
From Everand
Random Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Stationary and Related Stochastic Processes: Sample Function Properties and Their Applications
From Everand
Stationary and Related Stochastic Processes: Sample Function Properties and Their Applications
Harald Cramér
4/5 (2)
Battisti 2020
No ratings yet
Battisti 2020
31 pages
Idee Per Una Lezione Digitale: La Corrente Indotta
No ratings yet
Idee Per Una Lezione Digitale: La Corrente Indotta
4 pages
EAPP Q2 Module 4 DETERMINES THE OBJECTIVES AND STRUCTURES OF VARIOUS KINDS OF REPORTS
No ratings yet
EAPP Q2 Module 4 DETERMINES THE OBJECTIVES AND STRUCTURES OF VARIOUS KINDS OF REPORTS
22 pages
LEARNING ACTIVITY SHEET GRADE 10 AFA From Module 6
No ratings yet
LEARNING ACTIVITY SHEET GRADE 10 AFA From Module 6
5 pages
Bostik Boscolastic Rev1
No ratings yet
Bostik Boscolastic Rev1
2 pages
Between Memory and History: Les Lieux de Memoire: Pierre Nora
No ratings yet
Between Memory and History: Les Lieux de Memoire: Pierre Nora
18 pages
Terms of Reference Borehole Drilling MT Darwin district-WFP FFA 2020
No ratings yet
Terms of Reference Borehole Drilling MT Darwin district-WFP FFA 2020
20 pages
Slides For F2F Week 1 Part 1 of 2 Intro To Course
100% (1)
Slides For F2F Week 1 Part 1 of 2 Intro To Course
32 pages
SVNIT, Placement Brochure 2021
No ratings yet
SVNIT, Placement Brochure 2021
12 pages
Z Score Tables
No ratings yet
Z Score Tables
5 pages
Algebra Introductory and Intermediate Fourth Edition Richard N. Aufmann - Download The Ebook Now To Start Reading Without Waiting
100% (1)
Algebra Introductory and Intermediate Fourth Edition Richard N. Aufmann - Download The Ebook Now To Start Reading Without Waiting
84 pages
IGDC0613 - IIM Udaipur 3
No ratings yet
IGDC0613 - IIM Udaipur 3
9 pages
237a - Brosure BeneFusion 3 Series
No ratings yet
237a - Brosure BeneFusion 3 Series
4 pages
2023 TKI Profile and Interpretive Report
No ratings yet
2023 TKI Profile and Interpretive Report
11 pages
The Lion and The Mouse
No ratings yet
The Lion and The Mouse
4 pages
Quiz Result
No ratings yet
Quiz Result
4 pages
Historyof Educationin Nigeria
No ratings yet
Historyof Educationin Nigeria
20 pages
Citizenship: Grade 9 Reading Comprehension
No ratings yet
Citizenship: Grade 9 Reading Comprehension
2 pages
Dap An 19
No ratings yet
Dap An 19
2 pages
Thesis Outline
No ratings yet
Thesis Outline
11 pages
Notes Sociology BALLB (DR Kiran Tiwari) - 1
No ratings yet
Notes Sociology BALLB (DR Kiran Tiwari) - 1
74 pages
Study of Car Acceleration and Deceleration Characteristics at Dangerous Route FT050
No ratings yet
Study of Car Acceleration and Deceleration Characteristics at Dangerous Route FT050
7 pages
Uhu005 6
No ratings yet
Uhu005 6
1 page
Muhammad Shoaib Anwar - PHD Thesis
No ratings yet
Muhammad Shoaib Anwar - PHD Thesis
159 pages
02mapeh-3RD Quarter Week 1
No ratings yet
02mapeh-3RD Quarter Week 1
8 pages
Fundamentals of Robotics: Ingeniería en Computación
No ratings yet
Fundamentals of Robotics: Ingeniería en Computación
37 pages
Quaid-e-Azam Divisional Public School, Gujranwala: End of Year Syllabus & Date Sheet 2020-21, Cambridge - II
No ratings yet
Quaid-e-Azam Divisional Public School, Gujranwala: End of Year Syllabus & Date Sheet 2020-21, Cambridge - II
2 pages
BLS Base Line Plan
No ratings yet
BLS Base Line Plan
4 pages
Adjectives and Adverbs:, Hot Socks
No ratings yet
Adjectives and Adverbs:, Hot Socks
12 pages

Agrawal&Goyal 2013

Uploaded by

Agrawal&Goyal 2013

Uploaded by

Thompson Sampling for Contextual Bandits with Linear Payoffs

Shipra Agrawal [email protected]

rior distribution at time t + 1,

Proof. The proof uses anti-concentration of Gaussian

You might also like