Lec05 Multi Armed Bandit
Lec05 Multi Armed Bandit
Question: What is the expected payoff of the algo- --- --- ---
rithm in round i? | | | | (1-eps)pi | |
| | pi | | + eps/k | |
| | ------> | | --------> | W |
| O | | M | | O |
| L | (0,... | A | | R |
| A | vj/pij, | B | | L |
| | ...,0) | | vj | D |
| | <------ | | <-------- | |
| | | | | |
--- --- ---
1
i
Notation: Online Algorithm (OLA) • reported payoffs in [0, h̃] for h̃ = maxi,j vj/πji .
i
• in round i: • if πji is small, then ṽji = vj/πji can be big!
• probabilities of actions π i = (π1i , . . . , πki )
• choose action j i ∼ π i . Challenge 2: keep h̃ small
• payoffs vi = (vi1 , . . . , vik ) Idea 2: pick random action with some minimal prob-
• expected payoff: ability ϵ/k
Lemma 2: if πji ≥ ϵ/k then ṽji ≤ h̃ = kh/ϵ
h i X h i
E vij i = E vij i | j i = j Pr j i = j
j i
X Proof: ṽji = vj/πji ≤ h/ϵ/k = kh/ϵ
= vij πji
Note: explore-vs-exploit tradeoff with ϵ
j
0 otherwise.
Idea 1: give algorithm unbiased estimator of payoffs.
• if alg uses probabilities π t = (π1i , . . . , πki )
Thm: for payoffs in [0, h̃], if OLA satisfies
• and samples j i ∼ π t
• real payoffs are vi = (vi1 , . . . , vik )
• learn only vij i E[OLA] ≥ (1 − ϵ) OPT −h̃/ϵ ln k
report payoff ṽi = (0, . . . , ṽji/πji i , . . . , 0)
i
•
then for payoffs in [0, h], MAB satisfies
Lemma 1: reported payoffs are unbiased estimators
of true payoffs
E[MAB] ≥ (1 − 2ϵ) OPT −h k/ϵ2 ln k
Proof:
Recall: Exponential Weights (EW) satisfies assump-
E ṽji = E ṽji | j i = j · Pr j i = j
tion of Thm.
+ E ṽji | j i ̸= j · Pr j i ̸= j
Cor: for payoffs in [0, h], MAB-EW satisfies vanishing
= vj/πji · πji + 0 · (1 − πji )
i
per round regret.
= vij Proof: similar to before.
E ṽ = vi
i
Note:
2
Exercise: MAB-EW Analysis
p the per-round regret of exponential weights “online learning works with unbiased estimators of
Recall:
is 2h ln k/n payoffs”
• dependence on h is O(h)
p Proof of Thm:
• dependence on n is O(√ 1/n)
• dependence on k is O( log k) “E[MAB] ≥ (1 − 2ϵ) OPT −h k/ϵ2 ln k”
Setup: 0. let
• payoffs in [0, h] • R = h̃/ϵ ln k P
• apply the multi-armed-bandit reduction to the • j ∗ = argmaxj i vij
exponential weights algorithm
• Theorem: E[MAB] ≥ (1 − 2ϵ) OPT −h k/ϵ2 ln k 1. what does OLA guarantee?
• optimally tune the learning rate ϵ for n rounds for any ṽ1 , . . . , ṽn :
Question: analyze the per-round regret, what is
dependence on
X X
• maximum payoff h? OLA = π i · ṽi ≥ (1 − ϵ) ṽji ∗ − R
• number of rounds n? i i
X X
Eπ,ṽ π i · ṽi ≥ (1 − ϵ) Eṽ ṽji ∗ − R
• number of actions k? Eπ,ṽ [OLA] =
i i
q q
X i i X
Eπ π · v ≥ (1 − ϵ) vij ∗ − R
i i
X
Eπi ,ṽi π i · ṽi = Eπi ,ṽi π i · ṽi | π i Pr π i
πi
X
π i · vi Pr π i
=
πi
= Eπi π i · vi
X
MAB = π̃ i · vi
i
X ϵX i
= (1 − ϵ) π i · vi + v
i
k j j
X
≥ (1 − ϵ) π i · vi
i
3
3. Combine (1) and (2), plug in R, Lemma 2: