0% found this document useful (0 votes)

18 views15 pages

Mab Notes

hhfhffghf

Uploaded by

Poorva Parkhi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views15 pages

Mab Notes

hhfhffghf

Uploaded by

Poorva Parkhi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Introduction to Stochastic Multi-Armed Bandits

Cynthia Rudin (with Stefano Tracá and Tianyu Wang)

The name “multi-armed bandit” (MAB) comes from the name of a gambling machine. You can choose one of the arms
(levers) of the machine at each round, and get a reward based on which arm you choose. The rewards for each arm are iid
from a distribution, and each arm has its own distribution. If one of the arms is better than the rest it would be good to always
pull that arm, but you don’t know which one it is! So you need to divide your time between exploring arms that you think
might be good with exploiting arms that you know are good.
There are many applications for MAB, including recommender systems. For instance, the New York Times uses MAB
to determine which news articles to show you on your cellular phone. One could also argue that contextual MAB are the
algorithms that are leading to the demise of our current society as we know it! These are the algorithms that are really
good at figuring out how to serve you advertisements and social media posts that you are most likely to click on. These
are some of the algorithms that can keep you addicted to social media, going down its rabbit holes. But they are actually
very simple optimization algorithms that are also used for many scientific purposes and clinical trials and so on. They are
probably not dangerous unless you are running a big social media company! I am hopeful that some of you, after learning how
these algorithms work, will figure out how to adapt them for the good of society, to optimize long term health of humanity
rather than simply using them to optimize clicks and viewing time! One could envision many ways to do this, for instance,
prioritizing articles that are more likely to be truthful and less likely to incite rage. Or perhaps to promote articles that
encourage educational topics (broadly construed).
Usually MAB is considered to be an alternative to massive A-B testing. Say you want to optimize the look of your website,
but there are many possible website options to consider. To determine which one is the best, you might try each option several
times and perform pairwise hypothesis testing between all pairs (this is a hypothesis test between option “A” and option “B,”
hence the terminology “A-B testing”). This will take a huge amount of time, so you might want to run a MAB instead, which
conducts all the tests at once, eliminating testing options that are bad once we are pretty sure they are bad, and focusing on
options that might be the best. Clinical trials also can use MAB. We can give many different drugs to the patients, and we
can use MAB to find the best drugs, based on the performance of these drugs over the course of the trial, without having to
do pairwise tests and waste our resources testing drugs that we know early on are not performing well.
In contextual MAB, we also consider the context of each trial. So, for instance, when the social media companies optimize
which advertisement to show you, they might not just use information about the general popularity of each ad, they would
also use a context vector (a feature vector) that they created about you (e.g., this person is an introvert, who likes machine
learning, Dungeons and Dragons, Minecraft, and is a student at a prestigious university in NC, with a political stance that
leans to the left, who stays up late looking at dating sites – yes they have that level of detail about you, and no, it is not too
hard to figure that information out if they know what you do online).
Formally, the stochastic multi-armed bandit problem is a game played in n rounds. At each round t the player chooses an
action among a finite set of m possible choices called arms. When arm j is played (j ∈ {1, · · · , m}) a random reward Xj (t)
is drawn from an unknown distribution. In the case of online advertising, the reward is often whether someone clicked on
something. The distribution of Xj (t) does not change with time (the index t is just used to indicate in which turn the reward
was drawn). At the end of each turn the player can update her estimate of the mean reward of arm j :

" t−1
!j,t = 1
X Xj (s)1{Is =j} , (1)
Tj (t − 1) s=1

where Tj (t − 1) is the number of times arm j has been played before round t starts, and 1{It =j} is an indicator function equal
to 1 if arm j is played at time t (otherwise its value is 0). After a while, this empirical mean will be close to the arm’s mean
reward. Updating these estimates after each round will help the player in choosing a good arm in the next round.
At each turn, the player suffers a possible regret from not having played the best arm. If they had chosen the best arm,
their reward would have been X ∗ (t) where notation ∗ means the best arm. The total regret at the end of the game is given by
n "
" m
Rn(raw) = [X ∗ (t) − Xj (t)]1{It =j} ,
t=1 j=1

1
where X ∗ (t) is the reward of the best arm at time t if it would have been played at time t.
We don’t usually define the regret this way though when doing theory. We usually assign the regret to be based on the
means of the arms distributions. So let’s try it again:
n "
" m
Rn = [µ∗ − µj ]1{It =j} ,
t=1 j=1

where µj is the expected payoff of arm j . The mean regret for having played arm j is given by ∆j = µ∗ − µj , where µ∗ is
the mean reward of the best arm and µj is the mean reward obtained when playing arm j . So the regret is now:
n "
" m
Rn = ∆j 1{It =j} ,
t=1 j=1

The strategies presented in the following sections aim to minimize the expected cumulative regret E[Rn ], where the
expectation is over the random draw of the arms. (The algorithm reacts to these random draws, so the choice of arms It also
then becomes random.)
n "
" m m
"
E[Rn ] = E ∆j 1{It =j} = ∆j E [Tj (n)] , (2)
t=1 j=1 j=1

where Tj (n) is the number of times arm j is played up to time n.

A complete list of the symbols used can be found in Appendix D.

1 ε-greedy algorithm
The first algorithm we consider is called ε-greedy, and it is in Algorithm 2. The idea is very simple: with some small
probability, play an arm uniformly at random. Otherwise, pick the arm that we think is the best.

Algorithm 2: ε-greedy algorithm

Input : number of rounds n, number of arms m, a constant k such that k > max{10, min4j ∆2 }, sequence
# $ j

{εt }nt=1 = min 1, kmt

!j,t (defined in (1)) for each j = 1, · · · , m
Initialization: play all arms once and initialize X
for t = m + 1 to n do
With probability εt play an arm uniformly at random (each arm has probability m 1
of being selected),
otherwise (with probability 1 − εt ) play (“best”) arm j such that
!j,t−1 ≥ X
X !i,t−1 ∀i.

Get reward Xj (t);

Update X!j,t ;
end

You will notice that there are some interesting terms in the algorithm, defining the choice of εt . They are chosen that way
so that we can get a tight bound on the regret of the algorithm.
Since we select k to be larger than both 1 and min4j ∆2 , it means that the algorithm explores for a while before it does any
j
exploitation. To see this, by the definition of εt in the algorithm, εt = 1 until km/t is less than 1, which takes a while when
k is large. In these early iterations t when εt = 1, the algorithm just plays arms uniformly at random. As I mentioned earlier,
the whole idea of these MAB algorithms is to balance between exploration of arms to reduce uncertainty and exploiting arms
that we know are good. So it’s useful to explore for a while to know which arms are good before trying to exploit.
Theorem 1.1 shows that the regret of ε-greedy is bounded by a quantity that is at most logarithmic in n. You can see this
because the bound consists of a sum of n terms (a term for each t), each of which is less than t−1 , and log n would be a bound
on the sum (integral) of these terms. Specifically, εt is order θ (1/t), while the βj (t) term is o (1/t). To see this, you need
the assumptions we made about εt , for instance that k > 10 so that the first exponent of βj is sufficiently negative, and that

2
k > 4/∆2j for all j so that the second exponent of βj is sufficiently negative.

Theorem 1.1 (Regret-bound for ε-greedy algorithm – adapted from Auer et al. (2002)). The bound on the mean regret
E[Rn ] at time n is given by
n
" " % &
1
E[Rn ] ≤ ekm2 + ∆j εt + (1 − εt )βj (t) (3)
m
t=e2 km+1 j:µj <µ∗

where 2
% &− 10
k % & 1 % &− k∆4 j
t t 4e 2 t
βj (t) = k log + 2 . (4)
mke mke ∆j mke

The first term in (3) is a bound on mean regret during the “starting phase” of Algorithm 1. For the rounds after the starting
phase, the quantity in the parenthesis of (3) is an upper bound on the probability of playing arm j . In the bound, βj (t) is
an upper bound on the probability that our algorithm thinks arm j is the best arm at round t, and 1/m is the probability of
choosing arm j when the choice is made at random. Proof in Appendix A.

2 The UCB algorithm

The UCB algorithm is also very simple. It creates a confidence interval on the mean reward. At each round, it chooses the
arm with the highest upper confidence interval. This is because any arm with a high upper confidence bound could be the best
arm.

Algorithm 3: UCB algorithm

Input : number of rounds n, number of arms m
!j,t (as defined in (1)) for each j = 1, · · · , m
Initialization: play all arms once and initialize X
for t = m + 1 to n do
play arm j with the highest upper confidence bound on the mean estimate:
'
!j,t−1 + 2 log(t)
X ;
Tj (t − 1)

Get reward Xj ;
Update X!j,t ;
end

The bound for UCB also grows logarithmically in n. Again, this can be seen because the terms in the sum decay faster
that 1/t.

Theorem 2.1 (Regret-bound of the UCB algorithm – adapted from Auer et al. (2002)). The bound on the mean regret
E[Rn ] at time n is given by
" $
m
! ! !m n
!
8 −4 2
E[Rn ] ≤ ∆j + log(n) + ∆j # 1 + 2t (t − 1 − m) % . (5)
j=1 j:µj <µ∗
∆j j=1 t=m+1

Proof in Appendix B.

3
3 Instance-independent regret bound
In previous sections, we have discussed regret bounds of the $-greedy and the UCB algorithm, which are logarithmic in
number of steps – n. In fact, such logarithmic regret rate is the best we can hope for in the asymptotic sense (Lai and
Robbins, 1985). Yet it is important to point out that the aforementioned bounds depend on ∆j (the gap of expected reward
of the optimal arm and arm j ). This means different problem instances exhibit different regret rates. In particular, when the
optimality gaps ∆j is small, the regret rate may be large. A natural question to ask is: is there a form of regret bound that
hold true for any problem instance? The answer is positive.

Corollary 3.1. Fix an arbitrary positive integer n. For the UCB algorithm, the mean regret satisfies
(
) " + "n
,
) √
E [Rn ] ≤ * 8 log(n) + 1 + 2 −4
t (t − 1 − m) 2 n, (6)
j:∆j >0 t=m+1

where m is number of arms. For the $-greedy algorithm, the mean regret satisfies
( - 0
) 2
)
) " . % t &− 10 % & % &− k∆4 j
k
t t 1 √
E [Rn ] ≤ )
1
* /k log + 4e 2 2 · n, (7)
mke mke mke
j:∆j >0

where m is the number of arms, and k is an algorithm parameter specified in Algorithm 2.

With proofs for Theorems 1.1 and 2.1, the proof of the above corollary is fairly straightforward, and is in Appendix C. Its
counterpart for the $-greedy algorithm follows the same procedure.
In fact, the bound in Corollary 3.1 for $-greedy and the UCB algorithm are also optimal in the worst case sense. In fact,
the rate in Corollary 3.1 is worst-case optimal. The proof for asymptotic optimality (Lai and Robbins, 1985) uses a similar
(but a bit more involved) mechanism, and is left as optional reading. Textbooks covering this topic include those of Bubeck
and Cesa-Bianchi (2012); Slivkins (2019); Lattimore and Szepesvári (2020).

4 Contextual Bandits
A natural extension to the standard multi-armed bandit is to make decisions with “side information.” In this setting, at each
time, the agent observes some contextual information, and chooses an arm based not only on the arms’ histories, but also on
the contextual information.
This setting is used in important real-world scenarios. For example, in online item recommendation, a company may use
the context to describe the browsing user (age, gender, geographical location, etc.) and construct a feature vector to be the
user’s profile.

user in context = [1 if introvert, 1 if likes Jazz, 1 it is now between 12am and 6am, 1 if browsing dating sites, ...]

After observing this context vector, an algorithm (the recommendation system) chooses an arm (item to recommend, e.g., a
video game ad) to display the to the user. The choice of which arm would be more desirable really can depend on the context:
what the user wants after midnight could be totally different than what she wants at lunchtime! Also, different people respond
to very different ads, based on their interests. The algorithm could get a reward of 1 if the users clicks on the item, or a zero
reward otherwise. Or, the reward could be “conversion,” that is, whether the user purchased the item that is being advertised.
We will formulate (a preliminary version of) the problem, define a performance metric, and present one algorithm for this
problem (out of many possible algorithms). The problem is described by a set of contexts C := [0, 1]dS (dS is the dimension
of the context vector), a continuous set of arms A := [0, 1]dA (dA is the dimension of the arm space), and a (stochastic)
reward f (z, a) + $, ∀(z, a) ∈ C × A, where f is the mean reward function, and $ is a zero-mean and bounded noise. In
addition, we assume that the “mean reward” is Lipschitz: For all (z, a), (z ′ , a′ ) ∈ C × A, we assume |f (z, a) − f (z ′ , a′ )| ≤
L((z, a) − (z ′ , a′ )(2 , where L is the Lipschitz constant of the mean reward. For time t = 1, 2, · · · , n, the environment reveals
a context zt , the agent chooses an arm at , and receives a random reward Xt = f (zt , at ) + $. (It’s confusing, but now Xt is the
label, whereas zt and at act like the features!) In other words, the mean regret changes smoothly over context-arm space. This
smoothness is precisely what allows us to interpolate from past situations to the present situation. Perhaps we have seen many
contexts that are similar to the current context (but not exactly the same), and we know what arms were shown in previous

4
contexts. We can use this information to estimate rewards, even to a new arm and new context we have never seen before, as
long as they are similar to past cases.
(Note that if the set of arms is finite, then we need the rewards only to be Lipschitz with respect to the context, and we
can look at the past history of that particular arm. In fact, there are many variations of this problem!)
The performance is now measured by contextual regret:
"n % &
Rn := max f (zt , a) − f (zt , at ) . (8)
a∈A
t=1

This is how much you lost in rewards from choosing the arm at in context zt , rather than choosing the best arm for that
context.
The UCB algorithm can be altered to solve contextual bandit problems with slight modification in order to accommodate
context. To do this, we can create parametric models that estimate f (zt , at ) with confidence intervals. You can use almost
any kind of model to estimate f as long as it has an upper confidence bound. For instance, you can use decision trees that
group the past contexts and arms we have seen so far into leaves of a tree. Then you can calculate the upper confidence bound
of rewards that fall into that leaf. To use UCB, you just need to be able to compute upper confidence bounds for the value of
f at any given context point zt .
Let us provide a simple contextual bandit algorithm that iteratively partitions the context and arm space into leaves of a
tree. Over iterations, we maintain a partition Pt of the context-arm space, and define a mean and confidence with respect to
each “leaf” in this partition. This partition could be learned using decision tree splitting, but one could also define it in other
ways. We would typically have the partition grow finer and finer as we gather more information about contexts, arms and
rewards. In the figure below, we show an example of a partition that we could have constructed using a decision tree. Each
number in the figure represents a trial, where the number’s position on the plot represents the context and arm that was pulled,
and the value (e.g., 4,5,6,7,8,9) is the reward that was obtained after pulling the arm.

Given partition Pt , function pt : C × A → Pt is called a Region Selection Function with respect to Pt if for any
(z, a) ∈ C × A, pt (z, a) is the region in Pt containing (z, a). In other words the Region Selection Function tells us which
region (“bin”) a given context-arm pair is in. With this function defined, we now define the mean estimate and confidence
bound in each partition bin. Let {(z1 , a1 ), x1 , (z2 , a2 ), x2 , · · · , (zt , at ), xt } be the (context-arm, reward) observations up to
time t. Define function nt to be the count of points (each corresponding to a historical arm pull) in the same bin as context-arm
pair (z, a) (but if there are no points, we count that as 1):
3 t
4
"
nt (z, a) = max 1, 1{(zs ,as )∈pt (z,a)} . (9)
s=1

Define mt as the mean of the rewards in the bin:

3 !t x 1 5t
s=1 s {(zs ,as )∈pt (z,a)}
, if 1{(zs ,as )∈pt (z,a)} > 0;
mt (z, a) = nt (z,a) s=1 (10)
∞, otherwise (no observations in bin pt (z, a)).

Now we can define the “UCB index” (that is, the upper confidence bound) for this problem: ∀(z, a) ∈ C × A,
'
log t
It (z, a) = mt (z, a) + CU CB + DL (pt (z, a)), (11)
nt (z, a)

5
where CU CB is a parameter that controls how much we would like to explore, because it scales the upper confidence
bound and thus controls exploration. Here, DL (pt (z, a)) describes how large the region pt (z, a) is: DL (pt (z, a)) :=
L maxw,w′ ∈pt (z,a) (w − w′ (2 . This last term is useful due to the fact that the reward is smooth (Lipschitz). As long as
the reward is smooth and the region is not too large, the function f must stay relatively constant in the region pt (z, a). In that
case, we have a tighter UCB because we are more confident that our estimate of the mean reward represents all of the points
in the bin. However, if the region is large, then even if we have estimated the mean and its UCB correctly, there could be
significant variation in the mean reward f (z, a) over this large bin. In that case, we raise the upper confidence bound based
on the diameter of the bin. This strategy is summarized below. Please see Wang et al. (2020) (and references therein) for
more information on (contextual) Lipschitz bandits. When we update the partition, we create more partition bins by splitting

Algorithm 4: UCB-Tree algorithm

Input : number of rounds n, arms A, exploration parameter CU CB , Lipschitz constant L, tree fitting
(partition maintenance) rule R.
for t = 1 to n do
Observe context zt .;
Compute mt and nt using (10) and (9);
Play arm at with the highest upper
6 confidence bound index: maxa UCB where
log t
U CB = mt (zt , a) + CU CB nt (zt ,a) + DL (pt (zt , a));
Get reward xt ;
Update partition Pt using rule R.
end

them. We split bins when we have enough data that we would get a sufficiently good estimate of the mean reward in each
new bin, after we split. Since we will mostly be choosing bins with very high mean reward, this sequential splitting step will
let us zoom in on the arms that have the highest rewards for each context.

5 Other types of bandit problems

There are a huge variety of bandit problems. For instance, there are sleeping bandits, where the bandits disappear for a while
and then reappear (think about an online sale that appears and disappears), there are mortal bandits where the bandits appear
at various times and disappear and never come back! (Here, you can think about news articles.) There are bandits where the
rewards are delayed (so you’re playing blindly for a while when you start playing a new arm). There are bandits where arms
lock for a period of time, so that if you choose an arm, you can’t change it for a while (think about pricing items online where
you are not allowed to change the price too often or it would frighten the customers away).
Some of the most interesting MAB problems involve non-stationary time series, where the expected reward of an arm
changes over time. This happens a lot in reality, for instance, demand for many products has a weekly or yearly cycle, with a
spike in demand for Christmas!

A Regret-bound of the ε-greedy algorithm

The mean regret at round n is given by
n "
" m
Rn = ∆j 1{It =j} ,
t=1 j=1

where 1{It =j} is an indicator function equal to 1 if arm j is played at time t (otherwise its value is 0) and ∆j = µ∗ − µj
is the difference between the mean of the best arm’s reward distribution and the mean of the j ’s arm reward distribution. By
taking the expectation we have that
n "
" m
E[Rn ] = ∆j P({It = j})
t=1 j=1

which can be rewritten as

n "
" m 7 8
1 ! !
E[Rn ] = ∆j εt + (1 − εt )P(Xj,Tj (t−1) ≥ Xi,Ti (t−1) ∀i) , (12)
t=1 j=1
m

6
where notation X !i,T (t−1) is the estimated mean for arm i after it has been chosen Ti (t − 1) times up to time t − 1. The
i
first term is the probability that we choose arm j by exploring. We explore with probability εt and if we explore, we chose
j at random, that is, with probability 1/m. If we chose j while exploiting, which happens with prob 1 − εt , then its average
reward is above that of all the other arms.
For this proof, we assume the rewards are bounded, say between 0 and 1. If they are bounded by something bigger than
1, we would have an extra constant scaling factor in the theorem.

STEP 1: Conditions when we think arm j is the best at time t. If we think arm j is the best at time t, then either we
overestimated its mean reward, or we underestimated the reward of the best arm, which is called arm ∗. If neither of those
things occurred, arm j ’s rewards would have been below those of arm ∗ and thus we would not think that arm j is the best
when it isn’t. In the first inequality below, we consider the probability arm j has average reward above all the other arms, and
this is less than the probability that arm j has reward greater than just one of those arms (in particular, arm ∗).
!j,T (t−1) ≥ X
P(X !i,T (t−1) ∀i) ≤ !j,T (t−1) ≥ X
P(X !∗,T (t−1) )
j i j ∗
% & % &
≤ P X!j,T (t−1) ≥ µj + ∆j + P X !∗,T (t−1) ≤ µ∗ − ∆j , (13)
j ∗
2 2
where the last inequality follows from the fact that either we must have underestimated arm ∗ or overestimated arm j :
9 : %; < ; <&
!j,T (t−1) ≥ X
X !∗,T (t−1) ⊂ X!∗,T (t−1) ≤ µ∗ − ∆j ∪ X !j,T (t−1) ≥ µj + ∆j .
j ∗ ∗ j
2 2
9 :
Aside: To show this, suppose that there exist an event ω ∈ X !j,T (t−1) ≥ X
!∗,T (t−1) that does not belong to
j ∗
=9 : 9 :>
! ∆j ! ∆j
X∗,T∗ (t−1) ≤ µ∗ − 2 ∪ Xj,Tj (t−1) ≥ µj + 2 . Then, we would have that
%; < ; <&C
! ∆j ! ∆j
ω ∈ X∗,T∗ (t−1) ≤ µ∗ − ∪ Xj,Tj (t−1) ≥ µj +
2 2
; < ; <
! ∆ j ! ∆ j
= X∗,T∗ (t−1) > µ∗ − ∩ Xj,Tj (t−1) < µj + , (14)
2 2
!∗,T (t−1) > µ∗ − ∆j = µj + ∆j > X
but from the intersection of events given in (14) it follows that X !j,T (t−1) which
9 : ∗ 9 2 2: j

contradicts ω ∈ X !j,T (t−1) ≥ X!∗,T (t−1) . Therefore, all events where X !j,T (t−1) ≥ X
!∗,T (t−1) belong to the set of
j ∗ j ∗

events where: %; < ; <&

!∗,T (t−1) ≤ µ∗ − ∆j
X !j,T (t−1) ≥ µj + ∆j
∪ X .
∗ j
2 2

STEP 2: Let us bound the probability of overestimating sub-optimal arm j at time t. Let us consider the first term of
(13). The computations for the second term are basically identical.
% & t−1 %
" &
P X!j,T (t−1) ≥ µj + ∆j = !j,s ≥ µj + ∆j
P Tj (t − 1) = s, X
j
2 s=1
2
t−1
" % & % &
! ∆j ! ∆j
= P Tj (t − 1) = s|Xj,s ≥ µj + P Xj,s ≥ µj +
s=1
2 2
t−1
" % & ∆2
≤ P Tj (t − 1) = s|X!j,s ≥ µj + ∆j e− 2j s , (15)
s=1
2

where in the last inequality we used the Chernoff-Hoeffding bound. The second term will be small when s is large, so that
term will be sufficient to handle whatever the first term brings when s is large. When s is small, the first term could be
problematic since it will be large. We are going to separate this sum into large s and small s and handle them separately.
Here, small s means less than x0 , where we define it as:
t
1 "
x0 := εs .
2m s=1

7
Then
⌊x0 ⌋
" % & t−1
" ∆2
(15) ≤ !j,s ≥ µj + ∆j
P Tj (t − 1) = s|X ·1+ 1 · e− 2
j
s
.
s=1
2
s=⌊x0 ⌋+1

Here, we split the sum into two pieces and bounded one of the terms by 1.
5∞ ∆2j
Let us work on the second term. We will now use the fact that s=⌊x0 ⌋+1 e−bs ≤ 1b e−b⌊x0 ⌋ , where in our case b = 2 .

⌊x0 ⌋
" % &
!j,s ≥ µj + ∆j 2 − ∆2j ⌊x0 ⌋
(15) ≤ P Tj (t − 1) = s|X + e 2 .
s=1
2 ∆2j

Now comes a trick. Let us define TjR (t − 1) as the number of times arm j is played when we are performing exploration.
5t−1
Note that TjR (t − 1) ≤ Tj (t − 1) and that TjR (t − 1) = s=1 Bs where Bs is a Bernoulli r.v. with parameter εs /m (this
is the probability that we explore times the probability that we choose arm j when exploring, so $s times 1/m. In that case,
TjR (t − 1) equals a values less than s but we don’t know which one. Luckily we’re constructing upper bounds. So we add up
all possibilities for it.
⌊x0 ⌋
" % & ∆2
(15) ≤ !j,s ≥ µj + ∆j + 2 e− 2j ⌊x0 ⌋ .
P TjR (t − 1) ≤ s|X
s=1
2 ∆2j

Now things are good, since the number of times we explore to choose arm j , TjR (t − 1), does not depend on the estimate of
the mean for arm j . The number of terms in the sum is ⌊x0 ⌋:

? @ 2 ∆2j
(15) ≤ ⌊x0 ⌋P TjR (t − 1) ≤ ⌊x0 ⌋ + 2 e− 2 ⌊x0 ⌋ . (16)
∆j
5t−1 !s
Recall TjR (t − 1) = s=1 Bs where Bs are independent Bernoulli random variables with P(Bs = 1) = m. The Bernstein
inequality states, for (independent) Bernoulli random variables,
+ t−1 A t−1 B , + ,
" " 1 2
2a
P Bs ≤ E Bs − a ≤ exp − 5t−1 1
.
s=1 s=1 s=1 V ar(Bs ) + 3 a

Also, we have (using the formula for the variance of a Bernoulli random variable):
εs = εs > εs
V ar(Bs ) = 1− ≤ . (17)
m m m
C D
Thus, applying Bernstein’s inequality to the Bernoulli random variables Bs with a = x0 = 12 E TjR (t − 1) gives

P(TjR (t − 1) ≤ ⌊x0 ⌋) ≤ P(TjR (t − 1) ≤ x0 )

% &
1
= P TjR (t − 1) ≤ E[TjR (t − 1)] − E[TjR (t − 1)]
2
3 4
1 R 2
8 (E[Tj (t − 1)])
≤ exp − 5t−1
s=1 V ar(Bs ) + 6 E[Tj (t − 1)]
1 R
3 4
1 R 2
8 (E[Tj (t − 1)])
≤ exp − 5t−1 ε (by Eq. 17)
s=1 m + 6 E[Tj (t − 1)]
s 1 R
3 4
1 R 2
8 (E[Tj (t − 1)]) 5t−1
= exp − (because E[TjR (t − 1)] = s=1 εms .)
E[Tj (t − 1)] + 6 E[Tj (t − 1)]
R 1 R
; < ; <
6 1 R 3 1
= exp − · E[Tj (t − 1)] = exp − · x0
7 8 7 2
; <
1
≤ exp − x0 . (18)
5

8
STEP 3: To upper bound (18), let us find a lower bound on ⌊x0 ⌋. Let us define n′ = ⌊km⌋ + 1 (where k was defined in
the algorithm statement, remember that it is more than 10), then
t
1 "
x0 = εs
2m s=1
t ; <
1 " km
= min 1,
2m s=1 s
′
n t
1 " km " 1
= 1+
2m s=1 2m s
s=n′ +1
- 0
t n′
n′ k /" 1 " 1 2
= + − .
2m 2 s=1 s s=1 s
5n 5n E n+1
Here we will use some properties of harmonic sequences, namely 1
t=1 t ≤ log n + 1 and 1
t=1 t > 1
1
t dt = ln(n + 1).
Continuing from the previous line,
n′ k
x0 ≥ + (log(t + 1) − (log(n′ ) + log(e)))
2m 2 % &
k n′ k (t + 1)
= + log
2 mk 2 n′ e
% ′ & % &
k n k t
≥ log + log (because log x ≤ x)
2 mk 2 n′ e
% &
k t
= log . (19)
2 mke
Using (19) combined with (18) in (16), we get the following:
? @ 2 ∆2
j
(15) ≤ ⌊x0 ⌋P TjR (t − 1) ≤ ⌊x0 ⌋ + 2 e− 2 ⌊x0 ⌋ (copying (16))
∆j
; < ∆2
1 2 j
≤ ⌊x0 ⌋ exp − x0 + 2 e− 2 ⌊x0 ⌋ (from (18))
5 ∆j
; < ∆2
1 2 j
≤ x0 exp − x0 + 2 e− 2 (x0 −1)
5 ∆j
; <
1 2 1 2 ∆2j
= x0 exp − x0 + 2 e 2 ∆j e− 2 x0
5 ∆j
; <
1 2 1 ∆2j
≤ x0 exp − x0 + 2 e 2 e− 2 x0 (since ∆j ∈ [0, 1])
5 ∆j
2
; < 1 % &− k∆4 j
1 2e2 t
≤ x0 exp − x0 + (from (19)). (20)
5 ∆2j mke
# $
Next, from the first order derivative test we know that the function x0 exp − 15 x0 is decreasing on [5, ∞). Thus when
? @ ? @ ? t @
mke ≥$ 5, (and from (19) we have x0 ≥ 2 log mke ) we have (plugging in 2 log mke for x0 in the expression
k t k t k
2 log #
x0 exp − 15 x0 ):
; < % &− 10
k % &
1 k t t
x0 exp − x0 ≤ log . (21)
5 2 mke mke
9 : ? t @
Since we choose k ≥ max 10, min4j ∆2 , we know k2 ≥ 5. Thus t ≥ e2 km is sufficient to ensure x0 ≥ k2 log mke ≥ 5.
j

Combining the above results in (20) and (21) gives: when t ≥ e2 km,
2
% &− 10
k % & 1 % &− k∆4 j
k t t 2e2 t
the first term in (13) = the left side of (15) ≤ log + . (22)
2 mke mke ∆2j mke

9
STEP 4: Let us bound the probability of underestimating sub-optimal arm j at time t. = Since the computations for the
>
second term in (13) are essentially identical, by removing the 1/2 factor we get this bound on P X!j,T (t−1) ≥ X
!i,T (t−1) ∀i
j i

(when t ≥ e2 km):
2
% &− 10
k % & 1 % &− k∆4 j
t t 4e2 t
βj (t) = k log + . (23)
mke mke ∆2j mke
STEP 5: Let us bound the probability of playing suboptimal arm j . We have now an upper bound for
!j,T (t−1) ≥ X
P(X !i,T (t−1) ∀i),
j i

the left hand side of (13). We will plug this into (12) which yields the following bound on the mean regret at time n. First,
we just split the sum over t in (12) into two parts.
⌊ekm⌋ m
" " 7 8
1 !j,T (t−1) ≥ X !i,T (t−1) ∀i)
E[Rn ] ≤ ∆j εt + (1 − εt )P(X j i
t=1 j=1
m
"n "m 7 8
1 !j,T (t−1) ≥ X !i,T (t−1) ∀i) ,
+ ∆j εt + (1 − εt )P(X j i
(24)
j=1
m
t=⌊ekm⌋+1

The first term has an upper bound, since all probabilities are at most 1, and all ∆j ’s are at most 1:
⌊ekm⌋ m
" " 7 8 ⌊ekm⌋ m
" " 7 8
1 !j,T (t−1) ≥ X
!i,T (t−1) ∀i) 1
∆j εt + (1 − εt )P(X j i
≤ 1 · εt + (1 − εt ) · 1
t=1 j=1
m t=1 j=1
m
⌊ekm⌋ m
" "
≤ 1
t=1 j=1

≤ ekm2 (25)

Thus, including into (24) the upper bound for the first term (25) and the upper bound for the probability in the second term
(23), we obtain:
" n " % &
1
E[Rn ] ≤ ekm2 + ∆j εt + (1 − εt )βj (t) ,
2 j:µ <µ
m
t=⌊e km⌋+1 j ∗

This proves the theorem.

B The regret bound of the UCB algorithm

The regret at round n is given by

m
" n
" m
"
Rn = ∆j + ∆j 1{It =j}
j=1 t=m+1 j=1

The expected regret E[Rn ] at round n is bounded by

m
" m
"
E[Rn ] ≤ ∆j + ∆j E[Tj (n)]. (26)
j=1 j=1
5n
where Tj (n) = t=1 1{It =j} is the number of times arm j has been chosen up to round n. Recall that

Tj (t−1)
1 "
!j,t =
X Xj (s).
Tj (t − 1) s=1

10
Let’s suppose the rewards are bounded, say between 0 and 1.

STEP 1: Let us bound the probability of overestimating or underestimating suboptimal arm j .

From the Chernoff-Hoeffding Inequality we have that
- 0
Tj (t−1)
1 "
P/ Xj (i) − µj ≤ −ε2 ≤ exp{−2Tj (t − 1)ε2 },
Tj (t − 1) i=1

and - 0
Tj (t−1)
1 "
P/ Xj (i) − µj ≥ ε2 ≤ exp{−2Tj (t − 1)ε2 }.
Tj (t − 1) i=1
6
2 log(t)
By selecting ε = Tj (t−1) we have
+ ' ,
!j,t + 2 log(t)
P X ≤ µj ≤ t−4 , (27)
Tj (t − 1)
and + ' ,
!j,t − 2 log(t)
P X ≥ µj ≤ t−4 . (28)
Tj (t − 1)
STEP 2: Let us bound the number of times we play arm j .
For each t, we consider the events such that the UCB of j is higher than that of ∗. These are events that could potentially
happen due to the randomness in the draws of each arm at each time until t.
3 ' ' 4
!j,T (t−1) + 2 log(t) !∗,T (t−1) + 2 log(t)
X j
≥X ∗
, Tj (t − 1) ≥ u ⊂
Tj (t − 1) T∗ (t − 1)
F ' ' I
G 2 log(t) 2 log(t) J
max !j,s +
X ≥ min X!∗,s + (29)
j ∗
Hsj ∈{u,...,Tj (t−1)} sj s∗ ∈{1,...,T∗ (t−1)} s∗ K

Events on both the left and right sides of (29) are included in
F ' ' I
T∗ (t−1) Tj (t−1) G J
L L
!j,s + 2 log(t) ≥ X
X !∗,s + 2 log(t) . (30)
j ∗
s =u
H sj s∗ K
s =1 ∗ j

Thus, for any integer u, we may write

n
"
Tj (n) = 1+ 1{It = j} (play arm j once during the starting phase)
t=m+1
n
"
= u+ 1{It = j, Tj (t − 1) ≥ u} (split terms to separate out the first u turns)
t=m+1
3 ' ' 4
n
"
!j,T (t−1) + 2 log(t) !∗,T (t−1) + 2 log(t)
= u+ 1 X j
≥X ∗
, Tj (t − 1) ≥ u (31)
t=m+1
Tj (t − 1) T∗ (t − 1)
(play arm j when its UCB is the highest)
F ' ' I
n
" G 2 log(t) J
≤ u+ 1 max !j,s +
X ≥ min !∗,s + 2 log(t)
X
j ∗
Hsj ∈{u,...,Tj (t−1)} sj s∗ ∈{1,...,T∗ (t−1)} s∗ K
t=m+1

(from (29))
F ' ' I
n
" T∗ (t−1)
" Tj (t−1)
" G 2 log(t) 2 log(t) J
≤ u+ 1 !j,s +
X !∗,s +
≥X (from (30)). (32)
j ∗
sj =u
H sj s∗ K
t=m+1 s∗ =1

11
STEP 3: Let us rewrite the event of playing arm j as a subset of the union of underestimating or overestimating arm
j . When 3 ' ' 4
! 2 log(t) ! 2 log(t)
1 Xj,t + ≥ X∗,t + (when we play arm j ) (33)
Tj (t − 1) T∗ (t − 1)
is equal to one, at least one of the following has to be true:
'
!∗,t ≤ µ∗ − 2 log(t)
X ; (we underestimated arm ∗) (34)
T∗ (t − 1)
'
!j,t ≥ µj + 2 log(t)
X ; (we overestimated arm j ) (35)
Tj (t − 1)
'
2 log(t)
µ∗ < µj + 2 . (arm j ’s UCB is higher than µ∗ ) (36)
Tj (t − 1)
6
To prove this, suppose none of them hold. Then from (34) we would have that X !∗,t > µ∗ − 2 log(t) ; then, by applying (36)
T∗ (t−1)
6 6
!
(with opposite verse since we are assuming it does not hold) we get X∗,t > µj + 2 Tj (t−1) − T2∗log(t)
2 log(t)
(t−1) and then from (35)
6 6
!∗,t > X
(again, with opposite verse) follows that X !j,t + 2 log(t) 2 log(t)
Tj (t−1) − T∗ (t−1) which is in contradiction with (33). Now, if
M N
we set u = ∆82 log (t) , then for Tj (t − 1) ≥ u, we have seen arm j enough times that its confidence bound is less than ∆,
j
which allows us to show that (36) does not hold, shown as follows:
' O
2 log(t) 2 log(t)
µ∗ − µj − 2 ≥ µ∗ − µj − 2
Tj (t − 1) u
(
) 2 log(t)
)
= µ ∗ − µ j − 2* M N
8
∆j2 log (t)
(
)
) 2 log(t)
≥ µ ∗ − µ j − 2* 8
∆2
log (t)
j

≥ µ∗ − µj − ∆j = 0,

therefore, with this choice of u, (36) cannot hold. So either (34) or (35) is true if we play arm j instead of arm ∗.

STEP 4: Let us bound the expected number of times we play arm j .

Using (32) and Step 3, we have that
P Q
8
Tj (n) ≤ log(n) (this is u in (32))
∆2j
F ' I
n T∗ (t−1) Tj (t−1) G J
" " " 2 log(t)
+ 1 X !∗,s ≤ µ∗ − (from (34))
∗
s =u
H s∗ K
t=m+1 s∗ =1 j

T∗ (t−1) Tj (t−1)
3 ' 4
n
" " "
!j,s ≥ µj + 2 log(t)
+ 1 X j (from (35))
t=m+1 s∗ =1 sj =u
sj

12
and by taking the expectation,
P Q
8
E[Tj (n)] ≤ log(n)
∆2j
F ' I
n
" T∗ (t−1)
" Tj (t−1)
" G 2 log(t) J
+ P !∗,s ≤ µ∗ −
X ∗
sj =u
H s∗ K
t=m+1 s∗ =1

T∗ (t−1) Tj (t−1)
3 ' 4
n
" " "
!j,s ≥ µj + 2 log(t)
+ P X j
t=m+1 s∗ =1 sj =u
sj
"n
8
≤ log(n) + 1 + 2 t−4 (t − 1 − m)2 . (37)
∆2j t=m+1

where in the last step we created an upper bound for T∗ (t − 1) by (t − 1 − m) (this is the maximum number of cases where we
have could have played the best arm, excluding the starting phase of m rounds). We similarly bounded Tj (t − 1). Therefore,
by using (26),
m m
+ n
,
" " 8 " "
E[Rn ] ≤ ∆j + log(n) + ∆j 1 + 2 t−4 (t − 1 − m)2 .
j=1 j:µ <µ
∆ j j=1 t=m+1
j ∗

Notice that the parenthesis is bound by O(1) because the terms in the sum decrease rapidly enough in t. We now have proven
the theorem.

C Proof of Corollary 3.1

Proof. We will first prove the statement of the corollary for the UCB algorithm. Using previous results, we have
m
"
E [Rn ] = ∆j E [Tj (n)] (by Eq. 2.)
j=1
"m 6 6
= ∆j E [Tj (n)] · E [Tj (n)]
j=1
( (
)" )"
)m 2 )m
≤* ∆j E [Tj (n)] · * E [Tj (n)] (by Cauchy-Schwarz inequality.)
j=1 j=1
(
)"
)m 2 √ 5m
≤* ∆j E [Tj (n)] · n (since j=1 E [Tj (n)] ≤ n. We can’t play more than n arms in n rounds.)
j=1
' "
√
= ∆2j E [Tj (n)] · n
j:∆j >0
( + ,
) " "n
) 8 √
≤ * 2
∆j · log(n) + 1 + 2 −4
t (t − 1 − m) 2 n (from (37))
2
∆j
j:∆j >0 t=m+1
(
) " + +
"n
, ,
) √
≤ * 8 log(n) + 1 + 2 t (t − 1 − m) ∆2j
−4 2 n (multiplying ∆ through)
j:∆j >0 t=m+1
(
) " + n
"
,
) √
≤ * 8 log(n) + 1 + 2 −4
t (t − 1 − m) 2 n. (since ∆j ≤ 1.)
j:∆j >0 t=m+1

This is the first line of the corollary. Onto $-greedy.

13
The proof for $-greedy is very similar, except for a different bound on E [Tj (n)]: from Equations (12) and (23), we have
n
"
E[Tj (n)] = P({It = j})
t=1
"n 7 8
1 !j,T (t−1) ≥ X
!i,T (t−1) ∀i)
≤ εt + (1 − εt )P(X j i
t=1
m
2
% &− 10
k % & 1 % &− k∆4 j
t t 4e 2 t
≤k log + 2 . (38)
mke mke ∆j mke

Now we can repeat the same argument:

m
"
E [Rn ] = ∆j E [Tj (n)] (by Eq. 2.)
j=1
"m 6 6
= ∆j E [Tj (n)] · E [Tj (n)]
j=1
( (
)" )"
)m 2 )m
≤ * ∆j E [Tj (n)] · * E [Tj (n)] (by Cauchy-Schwarz inequality.)
j=1 j=1
(
)"
)m 2 √ 5m
≤* ∆j E [Tj (n)] · n (since j=1 E [Tj (n)] ≤ n. We can’t play more than n arms in n rounds.)
j=1
' "
√
= ∆2j E [Tj (n)] · n.
j:∆j >0
( - 0
) 2
) % &− 10
k % & 1 % &− k∆4 j
) " . t t 4e 2 t 1 √
≤)
* ∆2j /k log + 2 2· n (by Eq. 38.)
mke mke ∆j mke
j:∆j >0
( - 0
) 2
) % &− 10
k % & % &− k∆4 j
) " . t t t 1 √
≤)
1
* /∆2j k log + 4e 2 2· n
mke mke mke
j:∆j >0
( - 0
) 2
)
) " . % t &− 10 % & % &− k∆4 j
k
t t 1 √
≤)
1
* /k log + 4e 2 2 · n. (since ∆j ≤ 1)
mke mke mke
j:∆j >0

We have now proved the second inequality in the corollary.

14
D Notation summary

• m: number of arms;

• n: number of rounds;

• Xj (t): random reward for playing arm j ;

• µ∗ : mean reward of the optimal arm (µ∗ = max1≤j≤m µj );

• ∆j : difference between the mean reward of the optimal arm and the mean reward of arm j (∆j = µ∗ − µj );

• X̂j : current estimate of µj ;

• It : arm played at turn t;

• Tj (t − 1): number of times arm j has been played before round t starts;

• k : a constant greater than 10 such that k > 4

minj ∆j in Algorithm 1;

• βj (t): upper bound on the probability of considering suboptimal arm j being the best arm at round t when
using Algorithm 1;

• n′ : particular time defined as km in the comparison between Algorithm 1 in Section 1;

• Rn : total regret at round n.

References
Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysis of the multi-armed bandit problem. Machine learning,
47(2-3):235–256.

Bubeck, S. and Cesa-Bianchi, N. (2012). Regret analysis of stochastic and nonstochastic multi-armed bandit problems.
Foundations and Trends©
R in Machine Learning, 5(1):1–122.

Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in applied mathematics,
6(1):4–22.

Lattimore, T. and Szepesvári, C. (2020). Bandit algorithms. Cambridge University Press.

R in Machine Learning, 12(1-2):1–286.

Wang, T., Ye, W., Geng, D., and Rudin, C. (2020). Towards practical Lipschitz bandits. In FODS.

RLbook Solutions Manual
100% (1)
RLbook Solutions Manual
35 pages
T G Butaney - Forecasting Prices (1958) PDF
100% (3)
T G Butaney - Forecasting Prices (1958) PDF
280 pages
Ashirvad Pipes Pvt. LTD., Bangalore: Test Report of CPVC Pipes As Per Is 15778
67% (3)
Ashirvad Pipes Pvt. LTD., Bangalore: Test Report of CPVC Pipes As Per Is 15778
7 pages
CSE3011 Reinforcement Learning Credit Structure: 2-2-3
No ratings yet
CSE3011 Reinforcement Learning Credit Structure: 2-2-3
50 pages
Multi-Armed Bandits Epsilon-Greedy Algorithm
No ratings yet
Multi-Armed Bandits Epsilon-Greedy Algorithm
14 pages
Y SuccessiveRejects - Budget
No ratings yet
Y SuccessiveRejects - Budget
12 pages
Finite-Time Analysis of The Multi-Armed Bandit Problem With Known Trend
No ratings yet
Finite-Time Analysis of The Multi-Armed Bandit Problem With Known Trend
7 pages
Exploration Exploitation
No ratings yet
Exploration Exploitation
40 pages
Bandit
No ratings yet
Bandit
8 pages
W Pg#s
No ratings yet
W Pg#s
34 pages
Unit II
No ratings yet
Unit II
10 pages
A12-Online Learning Short 2020
No ratings yet
A12-Online Learning Short 2020
61 pages
Aifinal
No ratings yet
Aifinal
15 pages
Expanded Multi Armed Bandit and Probability Basics
No ratings yet
Expanded Multi Armed Bandit and Probability Basics
5 pages
NIPS 2008 Algorithms For Infinitely Many Armed Bandits Paper
No ratings yet
NIPS 2008 Algorithms For Infinitely Many Armed Bandits Paper
8 pages
2022 Multiarmed Bandit Algorithms On Zynq System-On-Chip Go Frequentist or Bayesian
No ratings yet
2022 Multiarmed Bandit Algorithms On Zynq System-On-Chip Go Frequentist or Bayesian
14 pages
Reinforcement Learning - Chapter 2
100% (1)
Reinforcement Learning - Chapter 2
22 pages
CSD311: Artificial Intelligence
No ratings yet
CSD311: Artificial Intelligence
11 pages
Lecture 2 EE675
No ratings yet
Lecture 2 EE675
4 pages
EXP3
No ratings yet
EXP3
36 pages
Unit 1-RL
No ratings yet
Unit 1-RL
11 pages
Multi Armed Bandits
No ratings yet
Multi Armed Bandits
34 pages
IntroMulti Armed Bandits Slivkin Microsoft PDF
No ratings yet
IntroMulti Armed Bandits Slivkin Microsoft PDF
174 pages
EAS 240 MAB Project Description Spring 2025
No ratings yet
EAS 240 MAB Project Description Spring 2025
10 pages
Thompson Sampling For Contextual Bandits With Linear Payoffs
No ratings yet
Thompson Sampling For Contextual Bandits With Linear Payoffs
22 pages
MAB Assignment 2
No ratings yet
MAB Assignment 2
2 pages
Multi-Armed Bandit Algorithms and Empirical Evaluation
No ratings yet
Multi-Armed Bandit Algorithms and Empirical Evaluation
12 pages
Multi-Arm-Bandit Problem
No ratings yet
Multi-Arm-Bandit Problem
11 pages
Introduction To Bandits: (Some Slides Stolen From Csaba's AAAI Tutorial)
No ratings yet
Introduction To Bandits: (Some Slides Stolen From Csaba's AAAI Tutorial)
16 pages
Q1. Explain The Multi-Armed Bandit Problem and Its Key Characteristics. Illustrate Their Real-World Applications
No ratings yet
Q1. Explain The Multi-Armed Bandit Problem and Its Key Characteristics. Illustrate Their Real-World Applications
11 pages
RL Unit
No ratings yet
RL Unit
595 pages
Online Learning For Causal Bandits
No ratings yet
Online Learning For Causal Bandits
7 pages
Lecture 1: Introduction: Lecturer: Prof. Subrahmanya Swamy Peruru Scribe: Harshvardhan Arya - Rishabh Katiyar
No ratings yet
Lecture 1: Introduction: Lecturer: Prof. Subrahmanya Swamy Peruru Scribe: Harshvardhan Arya - Rishabh Katiyar
4 pages
RL Unit 1 - QA
No ratings yet
RL Unit 1 - QA
10 pages
Data Challenge - NC Soft
No ratings yet
Data Challenge - NC Soft
4 pages
29117-Article Text-33171-1-2-20240324
No ratings yet
29117-Article Text-33171-1-2-20240324
8 pages
Lecture 03: Adaptive Exploration-Based Algorithms: 1.1 Outline of The Algorithm
No ratings yet
Lecture 03: Adaptive Exploration-Based Algorithms: 1.1 Outline of The Algorithm
4 pages
Mid Term Report SoS
No ratings yet
Mid Term Report SoS
18 pages
DLMAIRIL01 Q4-2024 Session3
No ratings yet
DLMAIRIL01 Q4-2024 Session3
47 pages
EE675A Lecture 3
No ratings yet
EE675A Lecture 3
8 pages
MCTSintro BR
No ratings yet
MCTSintro BR
33 pages
Dissecting Reinforcement Learning-Part6
No ratings yet
Dissecting Reinforcement Learning-Part6
25 pages
Zhu 20 D
No ratings yet
Zhu 20 D
10 pages
26202-Article Text-30265-1-2-20230626
No ratings yet
26202-Article Text-30265-1-2-20230626
8 pages
Machine - Learning - Chapter 4
No ratings yet
Machine - Learning - Chapter 4
13 pages
RL Unit5
No ratings yet
RL Unit5
101 pages
Contextual Bandits
No ratings yet
Contextual Bandits
34 pages
Multi-Armed Bandit
No ratings yet
Multi-Armed Bandit
17 pages
KLUCB Paper
No ratings yet
KLUCB Paper
59 pages
Garbage In, Reward Out Bootstrapping Exploration in Multi-Armed Bandits
No ratings yet
Garbage In, Reward Out Bootstrapping Exploration in Multi-Armed Bandits
19 pages
16 - Reinforcement Learning and Bandits
No ratings yet
16 - Reinforcement Learning and Bandits
41 pages
Vincent NCC Paper On Multi Objective BAI
No ratings yet
Vincent NCC Paper On Multi Objective BAI
33 pages
Lecture 9: Exploration and Exploitation: David Silver
No ratings yet
Lecture 9: Exploration and Exploitation: David Silver
47 pages
325 Notes
No ratings yet
325 Notes
23 pages
Multi-Armed Bandits
No ratings yet
Multi-Armed Bandits
11 pages
Nokia Optimization
No ratings yet
Nokia Optimization
39 pages
Evendar 06 A
No ratings yet
Evendar 06 A
27 pages
Topics on Tournaments in Graph Theory
From Everand
Topics on Tournaments in Graph Theory
John W. Moon
No ratings yet
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Differential Games
From Everand
Differential Games
Avner Friedman
No ratings yet
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Useful Formulae: Mathematical & Physical
From Everand
Useful Formulae: Mathematical & Physical
Matthew Watkins
No ratings yet
100 Important Geography
No ratings yet
100 Important Geography
9 pages
Health: Myths and Facts About Aluminum and Human Health
No ratings yet
Health: Myths and Facts About Aluminum and Human Health
3 pages
Descartes1619 4
No ratings yet
Descartes1619 4
74 pages
Temperate Cyclones
No ratings yet
Temperate Cyclones
4 pages
Answer Key
No ratings yet
Answer Key
4 pages
Introduction To The Research
No ratings yet
Introduction To The Research
19 pages
Camp Half Blood
No ratings yet
Camp Half Blood
2 pages
ENVE 201 Tanıtım
No ratings yet
ENVE 201 Tanıtım
7 pages
Chapter 1
No ratings yet
Chapter 1
17 pages
Forecasting Daily Evapotranspiration Using Artificial Neural Networks For Sustainable Irrigation Scheduling
No ratings yet
Forecasting Daily Evapotranspiration Using Artificial Neural Networks For Sustainable Irrigation Scheduling
15 pages
NSTP Final
No ratings yet
NSTP Final
5 pages
Bioresource Technology: Sciencedirect
No ratings yet
Bioresource Technology: Sciencedirect
10 pages
MSDS Wilgard C21
No ratings yet
MSDS Wilgard C21
5 pages
Sat Score Live
No ratings yet
Sat Score Live
2 pages
302 Mechanics of Solids
No ratings yet
302 Mechanics of Solids
23 pages
Tuvsud Case Study BMW Ag
No ratings yet
Tuvsud Case Study BMW Ag
2 pages
B.SC Mathmatics Books
No ratings yet
B.SC Mathmatics Books
26 pages
Application 85900
No ratings yet
Application 85900
2 pages
CE CE131P 2 SyllabusModular
No ratings yet
CE CE131P 2 SyllabusModular
14 pages
Lesson 1.2. Intellectual Revolutions That Defined Society
No ratings yet
Lesson 1.2. Intellectual Revolutions That Defined Society
13 pages
(CAD CAM CAE) Ansys - Userguide
No ratings yet
(CAD CAM CAE) Ansys - Userguide
51 pages
Fatigue Analysis of An Automobile Wheel Rim: Abstract
No ratings yet
Fatigue Analysis of An Automobile Wheel Rim: Abstract
10 pages
NCERT Solutions For Class 8 Civics Chapter 9 Public Facilities
No ratings yet
NCERT Solutions For Class 8 Civics Chapter 9 Public Facilities
3 pages
BE and Policy Nudging
No ratings yet
BE and Policy Nudging
20 pages
Covumaiphuongbodevip2024 Deso05
No ratings yet
Covumaiphuongbodevip2024 Deso05
10 pages
MSC CS Ai3
No ratings yet
MSC CS Ai3
117 pages
STEWART Briony Kumiko and The Dragon FINAL2010
No ratings yet
STEWART Briony Kumiko and The Dragon FINAL2010
8 pages
Topological Indices of Molecular Graph and Drug Design
No ratings yet
Topological Indices of Molecular Graph and Drug Design
5 pages

Mab Notes

Uploaded by

Mab Notes

Uploaded by

Introduction to Stochastic Multi-Armed Bandits

Cynthia Rudin (with Stefano Tracá and Tianyu Wang)

where Tj (n) is the number of times arm j is played up to time n.

Algorithm 2: ε-greedy algorithm

{εt }nt=1 = min 1, kmt

Get reward Xj (t);

2 The UCB algorithm

Algorithm 3: UCB algorithm

where m is the number of arms, and k is an algorithm parameter specified in Algorithm 2.

Define mt as the mean of the rewards in the bin:

Algorithm 4: UCB-Tree algorithm

5 Other types of bandit problems

A Regret-bound of the ε-greedy algorithm

which can be rewritten as

events where: %; < ; <&

P(TjR (t − 1) ≤ ⌊x0 ⌋) ≤ P(TjR (t − 1) ≤ x0 )

This proves the theorem.

B The regret bound of the UCB algorithm

The expected regret E[Rn ] at round n is bounded by

STEP 1: Let us bound the probability of overestimating or underestimating suboptimal arm j .

Thus, for any integer u, we may write

STEP 4: Let us bound the expected number of times we play arm j .

C Proof of Corollary 3.1

This is the first line of the corollary. Onto $-greedy.

Now we can repeat the same argument:

We have now proved the second inequality in the corollary.

• Xj (t): random reward for playing arm j ;

• µ∗ : mean reward of the optimal arm (µ∗ = max1≤j≤m µj );

• X̂j : current estimate of µj ;

• k : a constant greater than 10 such that k > 4

• n′ : particular time defined as km in the comparison between Algorithm 1 in Section 1;

• Rn : total regret at round n.

Lattimore, T. and Szepesvári, C. (2020). Bandit algorithms. Cambridge University Press.

Slivkins, A. (2019). Introduction to multi-armed bandits. Foundations and Trends©

You might also like