0% found this document useful (0 votes)
18 views15 pages

Mab Notes

hhfhffghf

Uploaded by

Poorva Parkhi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views15 pages

Mab Notes

hhfhffghf

Uploaded by

Poorva Parkhi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Introduction to Stochastic Multi-Armed Bandits

Cynthia Rudin (with Stefano Tracá and Tianyu Wang)

The name “multi-armed bandit” (MAB) comes from the name of a gambling machine. You can choose one of the arms
(levers) of the machine at each round, and get a reward based on which arm you choose. The rewards for each arm are iid
from a distribution, and each arm has its own distribution. If one of the arms is better than the rest it would be good to always
pull that arm, but you don’t know which one it is! So you need to divide your time between exploring arms that you think
might be good with exploiting arms that you know are good.
There are many applications for MAB, including recommender systems. For instance, the New York Times uses MAB
to determine which news articles to show you on your cellular phone. One could also argue that contextual MAB are the
algorithms that are leading to the demise of our current society as we know it! These are the algorithms that are really
good at figuring out how to serve you advertisements and social media posts that you are most likely to click on. These
are some of the algorithms that can keep you addicted to social media, going down its rabbit holes. But they are actually
very simple optimization algorithms that are also used for many scientific purposes and clinical trials and so on. They are
probably not dangerous unless you are running a big social media company! I am hopeful that some of you, after learning how
these algorithms work, will figure out how to adapt them for the good of society, to optimize long term health of humanity
rather than simply using them to optimize clicks and viewing time! One could envision many ways to do this, for instance,
prioritizing articles that are more likely to be truthful and less likely to incite rage. Or perhaps to promote articles that
encourage educational topics (broadly construed).
Usually MAB is considered to be an alternative to massive A-B testing. Say you want to optimize the look of your website,
but there are many possible website options to consider. To determine which one is the best, you might try each option several
times and perform pairwise hypothesis testing between all pairs (this is a hypothesis test between option “A” and option “B,”
hence the terminology “A-B testing”). This will take a huge amount of time, so you might want to run a MAB instead, which
conducts all the tests at once, eliminating testing options that are bad once we are pretty sure they are bad, and focusing on
options that might be the best. Clinical trials also can use MAB. We can give many different drugs to the patients, and we
can use MAB to find the best drugs, based on the performance of these drugs over the course of the trial, without having to
do pairwise tests and waste our resources testing drugs that we know early on are not performing well.
In contextual MAB, we also consider the context of each trial. So, for instance, when the social media companies optimize
which advertisement to show you, they might not just use information about the general popularity of each ad, they would
also use a context vector (a feature vector) that they created about you (e.g., this person is an introvert, who likes machine
learning, Dungeons and Dragons, Minecraft, and is a student at a prestigious university in NC, with a political stance that
leans to the left, who stays up late looking at dating sites – yes they have that level of detail about you, and no, it is not too
hard to figure that information out if they know what you do online).
Formally, the stochastic multi-armed bandit problem is a game played in n rounds. At each round t the player chooses an
action among a finite set of m possible choices called arms. When arm j is played (j ∈ {1, · · · , m}) a random reward Xj (t)
is drawn from an unknown distribution. In the case of online advertising, the reward is often whether someone clicked on
something. The distribution of Xj (t) does not change with time (the index t is just used to indicate in which turn the reward
was drawn). At the end of each turn the player can update her estimate of the mean reward of arm j :

" t−1
!j,t = 1
X Xj (s)1{Is =j} , (1)
Tj (t − 1) s=1

where Tj (t − 1) is the number of times arm j has been played before round t starts, and 1{It =j} is an indicator function equal
to 1 if arm j is played at time t (otherwise its value is 0). After a while, this empirical mean will be close to the arm’s mean
reward. Updating these estimates after each round will help the player in choosing a good arm in the next round.
At each turn, the player suffers a possible regret from not having played the best arm. If they had chosen the best arm,
their reward would have been X ∗ (t) where notation ∗ means the best arm. The total regret at the end of the game is given by
n "
" m
Rn(raw) = [X ∗ (t) − Xj (t)]1{It =j} ,
t=1 j=1

1
where X ∗ (t) is the reward of the best arm at time t if it would have been played at time t.
We don’t usually define the regret this way though when doing theory. We usually assign the regret to be based on the
means of the arms distributions. So let’s try it again:
n "
" m
Rn = [µ∗ − µj ]1{It =j} ,
t=1 j=1

where µj is the expected payoff of arm j . The mean regret for having played arm j is given by ∆j = µ∗ − µj , where µ∗ is
the mean reward of the best arm and µj is the mean reward obtained when playing arm j . So the regret is now:
n "
" m
Rn = ∆j 1{It =j} ,
t=1 j=1

The strategies presented in the following sections aim to minimize the expected cumulative regret E[Rn ], where the
expectation is over the random draw of the arms. (The algorithm reacts to these random draws, so the choice of arms It also
then becomes random.)
n "
" m m
"
E[Rn ] = E ∆j 1{It =j} = ∆j E [Tj (n)] , (2)
t=1 j=1 j=1

where Tj (n) is the number of times arm j is played up to time n.


A complete list of the symbols used can be found in Appendix D.

1 ε-greedy algorithm
The first algorithm we consider is called ε-greedy, and it is in Algorithm 2. The idea is very simple: with some small
probability, play an arm uniformly at random. Otherwise, pick the arm that we think is the best.

Algorithm 2: ε-greedy algorithm


Input : number of rounds n, number of arms m, a constant k such that k > max{10, min4j ∆2 }, sequence
# $ j

{εt }nt=1 = min 1, kmt


!j,t (defined in (1)) for each j = 1, · · · , m
Initialization: play all arms once and initialize X
for t = m + 1 to n do
With probability εt play an arm uniformly at random (each arm has probability m 1
of being selected),
otherwise (with probability 1 − εt ) play (“best”) arm j such that
!j,t−1 ≥ X
X !i,t−1 ∀i.

Get reward Xj (t);


Update X!j,t ;
end

You will notice that there are some interesting terms in the algorithm, defining the choice of εt . They are chosen that way
so that we can get a tight bound on the regret of the algorithm.
Since we select k to be larger than both 1 and min4j ∆2 , it means that the algorithm explores for a while before it does any
j
exploitation. To see this, by the definition of εt in the algorithm, εt = 1 until km/t is less than 1, which takes a while when
k is large. In these early iterations t when εt = 1, the algorithm just plays arms uniformly at random. As I mentioned earlier,
the whole idea of these MAB algorithms is to balance between exploration of arms to reduce uncertainty and exploiting arms
that we know are good. So it’s useful to explore for a while to know which arms are good before trying to exploit.
Theorem 1.1 shows that the regret of ε-greedy is bounded by a quantity that is at most logarithmic in n. You can see this
because the bound consists of a sum of n terms (a term for each t), each of which is less than t−1 , and log n would be a bound
on the sum (integral) of these terms. Specifically, εt is order θ (1/t), while the βj (t) term is o (1/t). To see this, you need
the assumptions we made about εt , for instance that k > 10 so that the first exponent of βj is sufficiently negative, and that

2
k > 4/∆2j for all j so that the second exponent of βj is sufficiently negative.

Theorem 1.1 (Regret-bound for ε-greedy algorithm – adapted from Auer et al. (2002)). The bound on the mean regret
E[Rn ] at time n is given by
n
" " % &
1
E[Rn ] ≤ ekm2 + ∆j εt + (1 − εt )βj (t) (3)
m
t=e2 km+1 j:µj <µ∗

where 2
% &− 10
k % & 1 % &− k∆4 j
t t 4e 2 t
βj (t) = k log + 2 . (4)
mke mke ∆j mke

The first term in (3) is a bound on mean regret during the “starting phase” of Algorithm 1. For the rounds after the starting
phase, the quantity in the parenthesis of (3) is an upper bound on the probability of playing arm j . In the bound, βj (t) is
an upper bound on the probability that our algorithm thinks arm j is the best arm at round t, and 1/m is the probability of
choosing arm j when the choice is made at random. Proof in Appendix A.

2 The UCB algorithm


The UCB algorithm is also very simple. It creates a confidence interval on the mean reward. At each round, it chooses the
arm with the highest upper confidence interval. This is because any arm with a high upper confidence bound could be the best
arm.

Algorithm 3: UCB algorithm


Input : number of rounds n, number of arms m
!j,t (as defined in (1)) for each j = 1, · · · , m
Initialization: play all arms once and initialize X
for t = m + 1 to n do
play arm j with the highest upper confidence bound on the mean estimate:
'
!j,t−1 + 2 log(t)
X ;
Tj (t − 1)

Get reward Xj ;
Update X!j,t ;
end

The bound for UCB also grows logarithmically in n. Again, this can be seen because the terms in the sum decay faster
that 1/t.

Theorem 2.1 (Regret-bound of the UCB algorithm – adapted from Auer et al. (2002)). The bound on the mean regret
E[Rn ] at time n is given by
" $
m
! ! !m n
!
8 −4 2
E[Rn ] ≤ ∆j + log(n) + ∆j # 1 + 2t (t − 1 − m) % . (5)
j=1 j:µj <µ∗
∆j j=1 t=m+1

Proof in Appendix B.

3
3 Instance-independent regret bound
In previous sections, we have discussed regret bounds of the $-greedy and the UCB algorithm, which are logarithmic in
number of steps – n. In fact, such logarithmic regret rate is the best we can hope for in the asymptotic sense (Lai and
Robbins, 1985). Yet it is important to point out that the aforementioned bounds depend on ∆j (the gap of expected reward
of the optimal arm and arm j ). This means different problem instances exhibit different regret rates. In particular, when the
optimality gaps ∆j is small, the regret rate may be large. A natural question to ask is: is there a form of regret bound that
hold true for any problem instance? The answer is positive.

Corollary 3.1. Fix an arbitrary positive integer n. For the UCB algorithm, the mean regret satisfies
(
) " + "n
,
) √
E [Rn ] ≤ * 8 log(n) + 1 + 2 −4
t (t − 1 − m) 2 n, (6)
j:∆j >0 t=m+1

where m is number of arms. For the $-greedy algorithm, the mean regret satisfies
( - 0
) 2
)
) " . % t &− 10 % & % &− k∆4 j
k
t t 1 √
E [Rn ] ≤ )
1
* /k log + 4e 2 2 · n, (7)
mke mke mke
j:∆j >0

where m is the number of arms, and k is an algorithm parameter specified in Algorithm 2.

With proofs for Theorems 1.1 and 2.1, the proof of the above corollary is fairly straightforward, and is in Appendix C. Its
counterpart for the $-greedy algorithm follows the same procedure.
In fact, the bound in Corollary 3.1 for $-greedy and the UCB algorithm are also optimal in the worst case sense. In fact,
the rate in Corollary 3.1 is worst-case optimal. The proof for asymptotic optimality (Lai and Robbins, 1985) uses a similar
(but a bit more involved) mechanism, and is left as optional reading. Textbooks covering this topic include those of Bubeck
and Cesa-Bianchi (2012); Slivkins (2019); Lattimore and Szepesvári (2020).

4 Contextual Bandits
A natural extension to the standard multi-armed bandit is to make decisions with “side information.” In this setting, at each
time, the agent observes some contextual information, and chooses an arm based not only on the arms’ histories, but also on
the contextual information.
This setting is used in important real-world scenarios. For example, in online item recommendation, a company may use
the context to describe the browsing user (age, gender, geographical location, etc.) and construct a feature vector to be the
user’s profile.

user in context = [1 if introvert, 1 if likes Jazz, 1 it is now between 12am and 6am, 1 if browsing dating sites, ...]

After observing this context vector, an algorithm (the recommendation system) chooses an arm (item to recommend, e.g., a
video game ad) to display the to the user. The choice of which arm would be more desirable really can depend on the context:
what the user wants after midnight could be totally different than what she wants at lunchtime! Also, different people respond
to very different ads, based on their interests. The algorithm could get a reward of 1 if the users clicks on the item, or a zero
reward otherwise. Or, the reward could be “conversion,” that is, whether the user purchased the item that is being advertised.
We will formulate (a preliminary version of) the problem, define a performance metric, and present one algorithm for this
problem (out of many possible algorithms). The problem is described by a set of contexts C := [0, 1]dS (dS is the dimension
of the context vector), a continuous set of arms A := [0, 1]dA (dA is the dimension of the arm space), and a (stochastic)
reward f (z, a) + $, ∀(z, a) ∈ C × A, where f is the mean reward function, and $ is a zero-mean and bounded noise. In
addition, we assume that the “mean reward” is Lipschitz: For all (z, a), (z ′ , a′ ) ∈ C × A, we assume |f (z, a) − f (z ′ , a′ )| ≤
L((z, a) − (z ′ , a′ )(2 , where L is the Lipschitz constant of the mean reward. For time t = 1, 2, · · · , n, the environment reveals
a context zt , the agent chooses an arm at , and receives a random reward Xt = f (zt , at ) + $. (It’s confusing, but now Xt is the
label, whereas zt and at act like the features!) In other words, the mean regret changes smoothly over context-arm space. This
smoothness is precisely what allows us to interpolate from past situations to the present situation. Perhaps we have seen many
contexts that are similar to the current context (but not exactly the same), and we know what arms were shown in previous

4
contexts. We can use this information to estimate rewards, even to a new arm and new context we have never seen before, as
long as they are similar to past cases.
(Note that if the set of arms is finite, then we need the rewards only to be Lipschitz with respect to the context, and we
can look at the past history of that particular arm. In fact, there are many variations of this problem!)
The performance is now measured by contextual regret:
"n % &
Rn := max f (zt , a) − f (zt , at ) . (8)
a∈A
t=1

This is how much you lost in rewards from choosing the arm at in context zt , rather than choosing the best arm for that
context.
The UCB algorithm can be altered to solve contextual bandit problems with slight modification in order to accommodate
context. To do this, we can create parametric models that estimate f (zt , at ) with confidence intervals. You can use almost
any kind of model to estimate f as long as it has an upper confidence bound. For instance, you can use decision trees that
group the past contexts and arms we have seen so far into leaves of a tree. Then you can calculate the upper confidence bound
of rewards that fall into that leaf. To use UCB, you just need to be able to compute upper confidence bounds for the value of
f at any given context point zt .
Let us provide a simple contextual bandit algorithm that iteratively partitions the context and arm space into leaves of a
tree. Over iterations, we maintain a partition Pt of the context-arm space, and define a mean and confidence with respect to
each “leaf” in this partition. This partition could be learned using decision tree splitting, but one could also define it in other
ways. We would typically have the partition grow finer and finer as we gather more information about contexts, arms and
rewards. In the figure below, we show an example of a partition that we could have constructed using a decision tree. Each
number in the figure represents a trial, where the number’s position on the plot represents the context and arm that was pulled,
and the value (e.g., 4,5,6,7,8,9) is the reward that was obtained after pulling the arm.

Given partition Pt , function pt : C × A → Pt is called a Region Selection Function with respect to Pt if for any
(z, a) ∈ C × A, pt (z, a) is the region in Pt containing (z, a). In other words the Region Selection Function tells us which
region (“bin”) a given context-arm pair is in. With this function defined, we now define the mean estimate and confidence
bound in each partition bin. Let {(z1 , a1 ), x1 , (z2 , a2 ), x2 , · · · , (zt , at ), xt } be the (context-arm, reward) observations up to
time t. Define function nt to be the count of points (each corresponding to a historical arm pull) in the same bin as context-arm
pair (z, a) (but if there are no points, we count that as 1):
3 t
4
"
nt (z, a) = max 1, 1{(zs ,as )∈pt (z,a)} . (9)
s=1

Define mt as the mean of the rewards in the bin:


3 !t x 1 5t
s=1 s {(zs ,as )∈pt (z,a)}
, if 1{(zs ,as )∈pt (z,a)} > 0;
mt (z, a) = nt (z,a) s=1 (10)
∞, otherwise (no observations in bin pt (z, a)).

Now we can define the “UCB index” (that is, the upper confidence bound) for this problem: ∀(z, a) ∈ C × A,
'
log t
It (z, a) = mt (z, a) + CU CB + DL (pt (z, a)), (11)
nt (z, a)

5
where CU CB is a parameter that controls how much we would like to explore, because it scales the upper confidence
bound and thus controls exploration. Here, DL (pt (z, a)) describes how large the region pt (z, a) is: DL (pt (z, a)) :=
L maxw,w′ ∈pt (z,a) (w − w′ (2 . This last term is useful due to the fact that the reward is smooth (Lipschitz). As long as
the reward is smooth and the region is not too large, the function f must stay relatively constant in the region pt (z, a). In that
case, we have a tighter UCB because we are more confident that our estimate of the mean reward represents all of the points
in the bin. However, if the region is large, then even if we have estimated the mean and its UCB correctly, there could be
significant variation in the mean reward f (z, a) over this large bin. In that case, we raise the upper confidence bound based
on the diameter of the bin. This strategy is summarized below. Please see Wang et al. (2020) (and references therein) for
more information on (contextual) Lipschitz bandits. When we update the partition, we create more partition bins by splitting

Algorithm 4: UCB-Tree algorithm


Input : number of rounds n, arms A, exploration parameter CU CB , Lipschitz constant L, tree fitting
(partition maintenance) rule R.
for t = 1 to n do
Observe context zt .;
Compute mt and nt using (10) and (9);
Play arm at with the highest upper
6 confidence bound index: maxa UCB where
log t
U CB = mt (zt , a) + CU CB nt (zt ,a) + DL (pt (zt , a));
Get reward xt ;
Update partition Pt using rule R.
end

them. We split bins when we have enough data that we would get a sufficiently good estimate of the mean reward in each
new bin, after we split. Since we will mostly be choosing bins with very high mean reward, this sequential splitting step will
let us zoom in on the arms that have the highest rewards for each context.

5 Other types of bandit problems


There are a huge variety of bandit problems. For instance, there are sleeping bandits, where the bandits disappear for a while
and then reappear (think about an online sale that appears and disappears), there are mortal bandits where the bandits appear
at various times and disappear and never come back! (Here, you can think about news articles.) There are bandits where the
rewards are delayed (so you’re playing blindly for a while when you start playing a new arm). There are bandits where arms
lock for a period of time, so that if you choose an arm, you can’t change it for a while (think about pricing items online where
you are not allowed to change the price too often or it would frighten the customers away).
Some of the most interesting MAB problems involve non-stationary time series, where the expected reward of an arm
changes over time. This happens a lot in reality, for instance, demand for many products has a weekly or yearly cycle, with a
spike in demand for Christmas!

A Regret-bound of the ε-greedy algorithm


The mean regret at round n is given by
n "
" m
Rn = ∆j 1{It =j} ,
t=1 j=1

where 1{It =j} is an indicator function equal to 1 if arm j is played at time t (otherwise its value is 0) and ∆j = µ∗ − µj
is the difference between the mean of the best arm’s reward distribution and the mean of the j ’s arm reward distribution. By
taking the expectation we have that
n "
" m
E[Rn ] = ∆j P({It = j})
t=1 j=1

which can be rewritten as


n "
" m 7 8
1 ! !
E[Rn ] = ∆j εt + (1 − εt )P(Xj,Tj (t−1) ≥ Xi,Ti (t−1) ∀i) , (12)
t=1 j=1
m

6
where notation X !i,T (t−1) is the estimated mean for arm i after it has been chosen Ti (t − 1) times up to time t − 1. The
i
first term is the probability that we choose arm j by exploring. We explore with probability εt and if we explore, we chose
j at random, that is, with probability 1/m. If we chose j while exploiting, which happens with prob 1 − εt , then its average
reward is above that of all the other arms.
For this proof, we assume the rewards are bounded, say between 0 and 1. If they are bounded by something bigger than
1, we would have an extra constant scaling factor in the theorem.

STEP 1: Conditions when we think arm j is the best at time t. If we think arm j is the best at time t, then either we
overestimated its mean reward, or we underestimated the reward of the best arm, which is called arm ∗. If neither of those
things occurred, arm j ’s rewards would have been below those of arm ∗ and thus we would not think that arm j is the best
when it isn’t. In the first inequality below, we consider the probability arm j has average reward above all the other arms, and
this is less than the probability that arm j has reward greater than just one of those arms (in particular, arm ∗).
!j,T (t−1) ≥ X
P(X !i,T (t−1) ∀i) ≤ !j,T (t−1) ≥ X
P(X !∗,T (t−1) )
j i j ∗
% & % &
≤ P X!j,T (t−1) ≥ µj + ∆j + P X !∗,T (t−1) ≤ µ∗ − ∆j , (13)
j ∗
2 2
where the last inequality follows from the fact that either we must have underestimated arm ∗ or overestimated arm j :
9 : %; < ; <&
!j,T (t−1) ≥ X
X !∗,T (t−1) ⊂ X!∗,T (t−1) ≤ µ∗ − ∆j ∪ X !j,T (t−1) ≥ µj + ∆j .
j ∗ ∗ j
2 2
9 :
Aside: To show this, suppose that there exist an event ω ∈ X !j,T (t−1) ≥ X
!∗,T (t−1) that does not belong to
j ∗
=9 : 9 :>
! ∆j ! ∆j
X∗,T∗ (t−1) ≤ µ∗ − 2 ∪ Xj,Tj (t−1) ≥ µj + 2 . Then, we would have that
%; < ; <&C
! ∆j ! ∆j
ω ∈ X∗,T∗ (t−1) ≤ µ∗ − ∪ Xj,Tj (t−1) ≥ µj +
2 2
; < ; <
! ∆ j ! ∆ j
= X∗,T∗ (t−1) > µ∗ − ∩ Xj,Tj (t−1) < µj + , (14)
2 2
!∗,T (t−1) > µ∗ − ∆j = µj + ∆j > X
but from the intersection of events given in (14) it follows that X !j,T (t−1) which
9 : ∗ 9 2 2: j

contradicts ω ∈ X !j,T (t−1) ≥ X!∗,T (t−1) . Therefore, all events where X !j,T (t−1) ≥ X
!∗,T (t−1) belong to the set of
j ∗ j ∗

events where: %; < ; <&


!∗,T (t−1) ≤ µ∗ − ∆j
X !j,T (t−1) ≥ µj + ∆j
∪ X .
∗ j
2 2

STEP 2: Let us bound the probability of overestimating sub-optimal arm j at time t. Let us consider the first term of
(13). The computations for the second term are basically identical.
% & t−1 %
" &
P X!j,T (t−1) ≥ µj + ∆j = !j,s ≥ µj + ∆j
P Tj (t − 1) = s, X
j
2 s=1
2
t−1
" % & % &
! ∆j ! ∆j
= P Tj (t − 1) = s|Xj,s ≥ µj + P Xj,s ≥ µj +
s=1
2 2
t−1
" % & ∆2
≤ P Tj (t − 1) = s|X!j,s ≥ µj + ∆j e− 2j s , (15)
s=1
2

where in the last inequality we used the Chernoff-Hoeffding bound. The second term will be small when s is large, so that
term will be sufficient to handle whatever the first term brings when s is large. When s is small, the first term could be
problematic since it will be large. We are going to separate this sum into large s and small s and handle them separately.
Here, small s means less than x0 , where we define it as:
t
1 "
x0 := εs .
2m s=1

7
Then
⌊x0 ⌋
" % & t−1
" ∆2
(15) ≤ !j,s ≥ µj + ∆j
P Tj (t − 1) = s|X ·1+ 1 · e− 2
j
s
.
s=1
2
s=⌊x0 ⌋+1

Here, we split the sum into two pieces and bounded one of the terms by 1.
5∞ ∆2j
Let us work on the second term. We will now use the fact that s=⌊x0 ⌋+1 e−bs ≤ 1b e−b⌊x0 ⌋ , where in our case b = 2 .

⌊x0 ⌋
" % &
!j,s ≥ µj + ∆j 2 − ∆2j ⌊x0 ⌋
(15) ≤ P Tj (t − 1) = s|X + e 2 .
s=1
2 ∆2j

Now comes a trick. Let us define TjR (t − 1) as the number of times arm j is played when we are performing exploration.
5t−1
Note that TjR (t − 1) ≤ Tj (t − 1) and that TjR (t − 1) = s=1 Bs where Bs is a Bernoulli r.v. with parameter εs /m (this
is the probability that we explore times the probability that we choose arm j when exploring, so $s times 1/m. In that case,
TjR (t − 1) equals a values less than s but we don’t know which one. Luckily we’re constructing upper bounds. So we add up
all possibilities for it.
⌊x0 ⌋
" % & ∆2
(15) ≤ !j,s ≥ µj + ∆j + 2 e− 2j ⌊x0 ⌋ .
P TjR (t − 1) ≤ s|X
s=1
2 ∆2j

Now things are good, since the number of times we explore to choose arm j , TjR (t − 1), does not depend on the estimate of
the mean for arm j . The number of terms in the sum is ⌊x0 ⌋:

? @ 2 ∆2j
(15) ≤ ⌊x0 ⌋P TjR (t − 1) ≤ ⌊x0 ⌋ + 2 e− 2 ⌊x0 ⌋ . (16)
∆j
5t−1 !s
Recall TjR (t − 1) = s=1 Bs where Bs are independent Bernoulli random variables with P(Bs = 1) = m. The Bernstein
inequality states, for (independent) Bernoulli random variables,
+ t−1 A t−1 B , + ,
" " 1 2
2a
P Bs ≤ E Bs − a ≤ exp − 5t−1 1
.
s=1 s=1 s=1 V ar(Bs ) + 3 a

Also, we have (using the formula for the variance of a Bernoulli random variable):
εs = εs > εs
V ar(Bs ) = 1− ≤ . (17)
m m m
C D
Thus, applying Bernstein’s inequality to the Bernoulli random variables Bs with a = x0 = 12 E TjR (t − 1) gives

P(TjR (t − 1) ≤ ⌊x0 ⌋) ≤ P(TjR (t − 1) ≤ x0 )


% &
1
= P TjR (t − 1) ≤ E[TjR (t − 1)] − E[TjR (t − 1)]
2
3 4
1 R 2
8 (E[Tj (t − 1)])
≤ exp − 5t−1
s=1 V ar(Bs ) + 6 E[Tj (t − 1)]
1 R
3 4
1 R 2
8 (E[Tj (t − 1)])
≤ exp − 5t−1 ε (by Eq. 17)
s=1 m + 6 E[Tj (t − 1)]
s 1 R
3 4
1 R 2
8 (E[Tj (t − 1)]) 5t−1
= exp − (because E[TjR (t − 1)] = s=1 εms .)
E[Tj (t − 1)] + 6 E[Tj (t − 1)]
R 1 R
; < ; <
6 1 R 3 1
= exp − · E[Tj (t − 1)] = exp − · x0
7 8 7 2
; <
1
≤ exp − x0 . (18)
5

8
STEP 3: To upper bound (18), let us find a lower bound on ⌊x0 ⌋. Let us define n′ = ⌊km⌋ + 1 (where k was defined in
the algorithm statement, remember that it is more than 10), then
t
1 "
x0 = εs
2m s=1
t ; <
1 " km
= min 1,
2m s=1 s

n t
1 " km " 1
= 1+
2m s=1 2m s
s=n′ +1
- 0
t n′
n′ k /" 1 " 1 2
= + − .
2m 2 s=1 s s=1 s
5n 5n E n+1
Here we will use some properties of harmonic sequences, namely 1
t=1 t ≤ log n + 1 and 1
t=1 t > 1
1
t dt = ln(n + 1).
Continuing from the previous line,
n′ k
x0 ≥ + (log(t + 1) − (log(n′ ) + log(e)))
2m 2 % &
k n′ k (t + 1)
= + log
2 mk 2 n′ e
% ′ & % &
k n k t
≥ log + log (because log x ≤ x)
2 mk 2 n′ e
% &
k t
= log . (19)
2 mke
Using (19) combined with (18) in (16), we get the following:
? @ 2 ∆2
j
(15) ≤ ⌊x0 ⌋P TjR (t − 1) ≤ ⌊x0 ⌋ + 2 e− 2 ⌊x0 ⌋ (copying (16))
∆j
; < ∆2
1 2 j
≤ ⌊x0 ⌋ exp − x0 + 2 e− 2 ⌊x0 ⌋ (from (18))
5 ∆j
; < ∆2
1 2 j
≤ x0 exp − x0 + 2 e− 2 (x0 −1)
5 ∆j
; <
1 2 1 2 ∆2j
= x0 exp − x0 + 2 e 2 ∆j e− 2 x0
5 ∆j
; <
1 2 1 ∆2j
≤ x0 exp − x0 + 2 e 2 e− 2 x0 (since ∆j ∈ [0, 1])
5 ∆j
2
; < 1 % &− k∆4 j
1 2e2 t
≤ x0 exp − x0 + (from (19)). (20)
5 ∆2j mke
# $
Next, from the first order derivative test we know that the function x0 exp − 15 x0 is decreasing on [5, ∞). Thus when
? @ ? @ ? t @
mke ≥$ 5, (and from (19) we have x0 ≥ 2 log mke ) we have (plugging in 2 log mke for x0 in the expression
k t k t k
2 log #
x0 exp − 15 x0 ):
; < % &− 10
k % &
1 k t t
x0 exp − x0 ≤ log . (21)
5 2 mke mke
9 : ? t @
Since we choose k ≥ max 10, min4j ∆2 , we know k2 ≥ 5. Thus t ≥ e2 km is sufficient to ensure x0 ≥ k2 log mke ≥ 5.
j

Combining the above results in (20) and (21) gives: when t ≥ e2 km,
2
% &− 10
k % & 1 % &− k∆4 j
k t t 2e2 t
the first term in (13) = the left side of (15) ≤ log + . (22)
2 mke mke ∆2j mke

9
STEP 4: Let us bound the probability of underestimating sub-optimal arm j at time t. = Since the computations for the
>
second term in (13) are essentially identical, by removing the 1/2 factor we get this bound on P X!j,T (t−1) ≥ X
!i,T (t−1) ∀i
j i

(when t ≥ e2 km):
2
% &− 10
k % & 1 % &− k∆4 j
t t 4e2 t
βj (t) = k log + . (23)
mke mke ∆2j mke
STEP 5: Let us bound the probability of playing suboptimal arm j . We have now an upper bound for
!j,T (t−1) ≥ X
P(X !i,T (t−1) ∀i),
j i

the left hand side of (13). We will plug this into (12) which yields the following bound on the mean regret at time n. First,
we just split the sum over t in (12) into two parts.
⌊ekm⌋ m
" " 7 8
1 !j,T (t−1) ≥ X !i,T (t−1) ∀i)
E[Rn ] ≤ ∆j εt + (1 − εt )P(X j i
t=1 j=1
m
"n "m 7 8
1 !j,T (t−1) ≥ X !i,T (t−1) ∀i) ,
+ ∆j εt + (1 − εt )P(X j i
(24)
j=1
m
t=⌊ekm⌋+1

The first term has an upper bound, since all probabilities are at most 1, and all ∆j ’s are at most 1:
⌊ekm⌋ m
" " 7 8 ⌊ekm⌋ m
" " 7 8
1 !j,T (t−1) ≥ X
!i,T (t−1) ∀i) 1
∆j εt + (1 − εt )P(X j i
≤ 1 · εt + (1 − εt ) · 1
t=1 j=1
m t=1 j=1
m
⌊ekm⌋ m
" "
≤ 1
t=1 j=1

≤ ekm2 (25)

Thus, including into (24) the upper bound for the first term (25) and the upper bound for the probability in the second term
(23), we obtain:
" n " % &
1
E[Rn ] ≤ ekm2 + ∆j εt + (1 − εt )βj (t) ,
2 j:µ <µ
m
t=⌊e km⌋+1 j ∗

This proves the theorem.

B The regret bound of the UCB algorithm


The regret at round n is given by

m
" n
" m
"
Rn = ∆j + ∆j 1{It =j}
j=1 t=m+1 j=1

The expected regret E[Rn ] at round n is bounded by


m
" m
"
E[Rn ] ≤ ∆j + ∆j E[Tj (n)]. (26)
j=1 j=1
5n
where Tj (n) = t=1 1{It =j} is the number of times arm j has been chosen up to round n. Recall that

Tj (t−1)
1 "
!j,t =
X Xj (s).
Tj (t − 1) s=1

10
Let’s suppose the rewards are bounded, say between 0 and 1.

STEP 1: Let us bound the probability of overestimating or underestimating suboptimal arm j .


From the Chernoff-Hoeffding Inequality we have that
- 0
Tj (t−1)
1 "
P/ Xj (i) − µj ≤ −ε2 ≤ exp{−2Tj (t − 1)ε2 },
Tj (t − 1) i=1

and - 0
Tj (t−1)
1 "
P/ Xj (i) − µj ≥ ε2 ≤ exp{−2Tj (t − 1)ε2 }.
Tj (t − 1) i=1
6
2 log(t)
By selecting ε = Tj (t−1) we have
+ ' ,
!j,t + 2 log(t)
P X ≤ µj ≤ t−4 , (27)
Tj (t − 1)
and + ' ,
!j,t − 2 log(t)
P X ≥ µj ≤ t−4 . (28)
Tj (t − 1)
STEP 2: Let us bound the number of times we play arm j .
For each t, we consider the events such that the UCB of j is higher than that of ∗. These are events that could potentially
happen due to the randomness in the draws of each arm at each time until t.
3 ' ' 4
!j,T (t−1) + 2 log(t) !∗,T (t−1) + 2 log(t)
X j
≥X ∗
, Tj (t − 1) ≥ u ⊂
Tj (t − 1) T∗ (t − 1)
F ' ' I
G 2 log(t) 2 log(t) J
max !j,s +
X ≥ min X!∗,s + (29)
j ∗
Hsj ∈{u,...,Tj (t−1)} sj s∗ ∈{1,...,T∗ (t−1)} s∗ K

Events on both the left and right sides of (29) are included in
F ' ' I
T∗ (t−1) Tj (t−1) G J
L L
!j,s + 2 log(t) ≥ X
X !∗,s + 2 log(t) . (30)
j ∗
s =u
H sj s∗ K
s =1 ∗ j

Thus, for any integer u, we may write


n
"
Tj (n) = 1+ 1{It = j} (play arm j once during the starting phase)
t=m+1
n
"
= u+ 1{It = j, Tj (t − 1) ≥ u} (split terms to separate out the first u turns)
t=m+1
3 ' ' 4
n
"
!j,T (t−1) + 2 log(t) !∗,T (t−1) + 2 log(t)
= u+ 1 X j
≥X ∗
, Tj (t − 1) ≥ u (31)
t=m+1
Tj (t − 1) T∗ (t − 1)
(play arm j when its UCB is the highest)
F ' ' I
n
" G 2 log(t) J
≤ u+ 1 max !j,s +
X ≥ min !∗,s + 2 log(t)
X
j ∗
Hsj ∈{u,...,Tj (t−1)} sj s∗ ∈{1,...,T∗ (t−1)} s∗ K
t=m+1

(from (29))
F ' ' I
n
" T∗ (t−1)
" Tj (t−1)
" G 2 log(t) 2 log(t) J
≤ u+ 1 !j,s +
X !∗,s +
≥X (from (30)). (32)
j ∗
sj =u
H sj s∗ K
t=m+1 s∗ =1

11
STEP 3: Let us rewrite the event of playing arm j as a subset of the union of underestimating or overestimating arm
j . When 3 ' ' 4
! 2 log(t) ! 2 log(t)
1 Xj,t + ≥ X∗,t + (when we play arm j ) (33)
Tj (t − 1) T∗ (t − 1)
is equal to one, at least one of the following has to be true:
'
!∗,t ≤ µ∗ − 2 log(t)
X ; (we underestimated arm ∗) (34)
T∗ (t − 1)
'
!j,t ≥ µj + 2 log(t)
X ; (we overestimated arm j ) (35)
Tj (t − 1)
'
2 log(t)
µ∗ < µj + 2 . (arm j ’s UCB is higher than µ∗ ) (36)
Tj (t − 1)
6
To prove this, suppose none of them hold. Then from (34) we would have that X !∗,t > µ∗ − 2 log(t) ; then, by applying (36)
T∗ (t−1)
6 6
!
(with opposite verse since we are assuming it does not hold) we get X∗,t > µj + 2 Tj (t−1) − T2∗log(t)
2 log(t)
(t−1) and then from (35)
6 6
!∗,t > X
(again, with opposite verse) follows that X !j,t + 2 log(t) 2 log(t)
Tj (t−1) − T∗ (t−1) which is in contradiction with (33). Now, if
M N
we set u = ∆82 log (t) , then for Tj (t − 1) ≥ u, we have seen arm j enough times that its confidence bound is less than ∆,
j
which allows us to show that (36) does not hold, shown as follows:
' O
2 log(t) 2 log(t)
µ∗ − µj − 2 ≥ µ∗ − µj − 2
Tj (t − 1) u
(
) 2 log(t)
)
= µ ∗ − µ j − 2* M N
8
∆j2 log (t)
(
)
) 2 log(t)
≥ µ ∗ − µ j − 2* 8
∆2
log (t)
j

≥ µ∗ − µj − ∆j = 0,

therefore, with this choice of u, (36) cannot hold. So either (34) or (35) is true if we play arm j instead of arm ∗.

STEP 4: Let us bound the expected number of times we play arm j .


Using (32) and Step 3, we have that
P Q
8
Tj (n) ≤ log(n) (this is u in (32))
∆2j
F ' I
n T∗ (t−1) Tj (t−1) G J
" " " 2 log(t)
+ 1 X !∗,s ≤ µ∗ − (from (34))

s =u
H s∗ K
t=m+1 s∗ =1 j

T∗ (t−1) Tj (t−1)
3 ' 4
n
" " "
!j,s ≥ µj + 2 log(t)
+ 1 X j (from (35))
t=m+1 s∗ =1 sj =u
sj

12
and by taking the expectation,
P Q
8
E[Tj (n)] ≤ log(n)
∆2j
F ' I
n
" T∗ (t−1)
" Tj (t−1)
" G 2 log(t) J
+ P !∗,s ≤ µ∗ −
X ∗
sj =u
H s∗ K
t=m+1 s∗ =1

T∗ (t−1) Tj (t−1)
3 ' 4
n
" " "
!j,s ≥ µj + 2 log(t)
+ P X j
t=m+1 s∗ =1 sj =u
sj
"n
8
≤ log(n) + 1 + 2 t−4 (t − 1 − m)2 . (37)
∆2j t=m+1

where in the last step we created an upper bound for T∗ (t − 1) by (t − 1 − m) (this is the maximum number of cases where we
have could have played the best arm, excluding the starting phase of m rounds). We similarly bounded Tj (t − 1). Therefore,
by using (26),
m m
+ n
,
" " 8 " "
E[Rn ] ≤ ∆j + log(n) + ∆j 1 + 2 t−4 (t − 1 − m)2 .
j=1 j:µ <µ
∆ j j=1 t=m+1
j ∗

Notice that the parenthesis is bound by O(1) because the terms in the sum decrease rapidly enough in t. We now have proven
the theorem.

C Proof of Corollary 3.1


Proof. We will first prove the statement of the corollary for the UCB algorithm. Using previous results, we have
m
"
E [Rn ] = ∆j E [Tj (n)] (by Eq. 2.)
j=1
"m 6 6
= ∆j E [Tj (n)] · E [Tj (n)]
j=1
( (
)" )"
)m 2 )m
≤* ∆j E [Tj (n)] · * E [Tj (n)] (by Cauchy-Schwarz inequality.)
j=1 j=1
(
)"
)m 2 √ 5m
≤* ∆j E [Tj (n)] · n (since j=1 E [Tj (n)] ≤ n. We can’t play more than n arms in n rounds.)
j=1
' "

= ∆2j E [Tj (n)] · n
j:∆j >0
( + ,
) " "n
) 8 √
≤ * 2
∆j · log(n) + 1 + 2 −4
t (t − 1 − m) 2 n (from (37))
2
∆j
j:∆j >0 t=m+1
(
) " + +
"n
, ,
) √
≤ * 8 log(n) + 1 + 2 t (t − 1 − m) ∆2j
−4 2 n (multiplying ∆ through)
j:∆j >0 t=m+1
(
) " + n
"
,
) √
≤ * 8 log(n) + 1 + 2 −4
t (t − 1 − m) 2 n. (since ∆j ≤ 1.)
j:∆j >0 t=m+1

This is the first line of the corollary. Onto $-greedy.

13
The proof for $-greedy is very similar, except for a different bound on E [Tj (n)]: from Equations (12) and (23), we have
n
"
E[Tj (n)] = P({It = j})
t=1
"n 7 8
1 !j,T (t−1) ≥ X
!i,T (t−1) ∀i)
≤ εt + (1 − εt )P(X j i
t=1
m
2
% &− 10
k % & 1 % &− k∆4 j
t t 4e 2 t
≤k log + 2 . (38)
mke mke ∆j mke

Now we can repeat the same argument:


m
"
E [Rn ] = ∆j E [Tj (n)] (by Eq. 2.)
j=1
"m 6 6
= ∆j E [Tj (n)] · E [Tj (n)]
j=1
( (
)" )"
)m 2 )m
≤ * ∆j E [Tj (n)] · * E [Tj (n)] (by Cauchy-Schwarz inequality.)
j=1 j=1
(
)"
)m 2 √ 5m
≤* ∆j E [Tj (n)] · n (since j=1 E [Tj (n)] ≤ n. We can’t play more than n arms in n rounds.)
j=1
' "

= ∆2j E [Tj (n)] · n.
j:∆j >0
( - 0
) 2
) % &− 10
k % & 1 % &− k∆4 j
) " . t t 4e 2 t 1 √
≤)
* ∆2j /k log + 2 2· n (by Eq. 38.)
mke mke ∆j mke
j:∆j >0
( - 0
) 2
) % &− 10
k % & % &− k∆4 j
) " . t t t 1 √
≤)
1
* /∆2j k log + 4e 2 2· n
mke mke mke
j:∆j >0
( - 0
) 2
)
) " . % t &− 10 % & % &− k∆4 j
k
t t 1 √
≤)
1
* /k log + 4e 2 2 · n. (since ∆j ≤ 1)
mke mke mke
j:∆j >0

We have now proved the second inequality in the corollary.

14
D Notation summary

• m: number of arms;

• n: number of rounds;

• Xj (t): random reward for playing arm j ;

• µ∗ : mean reward of the optimal arm (µ∗ = max1≤j≤m µj );

• ∆j : difference between the mean reward of the optimal arm and the mean reward of arm j (∆j = µ∗ − µj );

• X̂j : current estimate of µj ;


• It : arm played at turn t;

• Tj (t − 1): number of times arm j has been played before round t starts;

• k : a constant greater than 10 such that k > 4


minj ∆j in Algorithm 1;

• βj (t): upper bound on the probability of considering suboptimal arm j being the best arm at round t when
using Algorithm 1;

• n′ : particular time defined as km in the comparison between Algorithm 1 in Section 1;

• Rn : total regret at round n.

References
Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysis of the multi-armed bandit problem. Machine learning,
47(2-3):235–256.

Bubeck, S. and Cesa-Bianchi, N. (2012). Regret analysis of stochastic and nonstochastic multi-armed bandit problems.
Foundations and Trends©
R in Machine Learning, 5(1):1–122.

Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in applied mathematics,
6(1):4–22.

Lattimore, T. and Szepesvári, C. (2020). Bandit algorithms. Cambridge University Press.

Slivkins, A. (2019). Introduction to multi-armed bandits. Foundations and Trends©


R in Machine Learning, 12(1-2):1–286.

Wang, T., Ye, W., Geng, D., and Rudin, C. (2020). Towards practical Lipschitz bandits. In FODS.

15

You might also like