0% found this document useful (0 votes)
8 views4 pages

Lec05 Multi Armed Bandit

The document discusses the multi-armed bandit (MAB) learning model in the context of online learning, focusing on strategies to optimize payoffs over multiple rounds. It emphasizes the tradeoff between exploration and exploitation, introduces unbiased estimators for reported payoffs, and outlines the performance guarantees of algorithms like Exponential Weights (EW). The key takeaway is that the MAB can achieve performance close to the optimal action by leveraging online learning techniques with careful tuning of learning rates.

Uploaded by

adamalaouri56
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views4 pages

Lec05 Multi Armed Bandit

The document discusses the multi-armed bandit (MAB) learning model in the context of online learning, focusing on strategies to optimize payoffs over multiple rounds. It emphasizes the tradeoff between exploration and exploitation, introduces unbiased estimators for reported payoffs, and outlines the performance guarantees of algorithms like Exponential Weights (EW). The key takeaway is that the MAB can achieve performance close to the optimal action by leveraging online learning techniques with careful tuning of learning rates.

Uploaded by

adamalaouri56
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

CS 332: Online Markets (Online) Multi-armed Bandit Learning

“online learning with partial information”


Lecture 5: Multi-armed Bandit
Model:
Learning
• k actions
Last Time: • n rounds
• action j’s payoff in round i: vij ∈ [0, h]
• online learning • in round i:
• best in hindsight (a) choose an action j i
• regret (b) learn payoffs vij i .
• exponential weights
(c) obtain payoff vij i .
• learning rates Pn
• payoff ALG = i=1 vij i
Today:
Goal: profit close to best action in hindsight
• multi-armed bandit learning
• reduction to online learning Note: identical to online learning except only learn
vij i and not (vi1 , . . . , vik ).
Note: if don’t play an action j, can’t learn if j is
good.
Exercise: Expected Payoff
Challenge: tradeoff explore versus exploit.
Setup:
• online learning, k = 2 actions
• probabilities algorithm selects each action in Reducing MAB to Online Learning
round i are:
Approach: reduce partial information to full infor-
π i = (π1i , π2i ) = (2/3, 1/3) mation.
• payoffs of each action in round i are: “solve multi-armed bandit problem with online learn-
i
v = (vi1 , vi2 ) = (3, 9) ing algorithm”

Question: What is the expected payoff of the algo- --- --- ---
rithm in round i? | | | | (1-eps)pi | |
| | pi | | + eps/k | |
| | ------> | | --------> | W |
| O | | M | | O |
| L | (0,... | A | | R |
| A | vj/pij, | B | | L |
| | ...,0) | | vj | D |
| | <------ | | <-------- | |
| | | | | |
--- --- ---

1
i
Notation: Online Algorithm (OLA) • reported payoffs in [0, h̃] for h̃ = maxi,j vj/πji .
i
• in round i: • if πji is small, then ṽji = vj/πji can be big!
• probabilities of actions π i = (π1i , . . . , πki )
• choose action j i ∼ π i . Challenge 2: keep h̃ small
• payoffs vi = (vi1 , . . . , vik ) Idea 2: pick random action with some minimal prob-
• expected payoff: ability ϵ/k
Lemma 2: if πji ≥ ϵ/k then ṽji ≤ h̃ = kh/ϵ
h i X h i 
E vij i = E vij i | j i = j Pr j i = j

j i
X Proof: ṽji = vj/πji ≤ h/ϵ/k = kh/ϵ
= vij πji
Note: explore-vs-exploit tradeoff with ϵ
j

= vi · π i (vector dot product) Alg: MAB Reduction to OLA


In round i:
Recall Thm: for payoffs in [0, h̃], exists OLA such
that 1. π ← OLA
2. draw j i ∼ π̃ with
E[OLA] ≥ (1 − ϵ) OPT −h̃/ϵ ln k
π̃ji = (1 − ϵ) πji + ϵ/k
Challenge 1: what report to the algorithm?
Def: random variable Y is an unbiased estimator 3. take action j i
of random variable X if E[Y ] = E[X]. 4. report ṽ to OLA with
Example: mean of random 5 students grades is un- (
vij/π i if j = j i
biased estimator of mean of all sudents grades. ṽji = j

0 otherwise.
Idea 1: give algorithm unbiased estimator of payoffs.
• if alg uses probabilities π t = (π1i , . . . , πki )
Thm: for payoffs in [0, h̃], if OLA satisfies
• and samples j i ∼ π t
• real payoffs are vi = (vi1 , . . . , vik )
• learn only vij i E[OLA] ≥ (1 − ϵ) OPT −h̃/ϵ ln k
report payoff ṽi = (0, . . . , ṽji/πji i , . . . , 0)
i

then for payoffs in [0, h], MAB satisfies
Lemma 1: reported payoffs are unbiased estimators
of true payoffs
E[MAB] ≥ (1 − 2ϵ) OPT −h k/ϵ2 ln k
Proof:
Recall: Exponential Weights (EW) satisfies assump-
E ṽji = E ṽji | j i = j · Pr j i = j
     
tion of Thm.
+ E ṽji | j i ̸= j · Pr j i ̸= j
   
Cor: for payoffs in [0, h], MAB-EW satisfies vanishing
= vj/πji · πji + 0 · (1 − πji )
i
per round regret.
= vij Proof: similar to before.
E ṽ = vi
 i

Note:

2
Exercise: MAB-EW Analysis
p the per-round regret of exponential weights “online learning works with unbiased estimators of
Recall:
is 2h ln k/n payoffs”
• dependence on h is O(h)
p Proof of Thm:
• dependence on n is O(√ 1/n)
• dependence on k is O( log k) “E[MAB] ≥ (1 − 2ϵ) OPT −h k/ϵ2 ln k”

Setup: 0. let
• payoffs in [0, h] • R = h̃/ϵ ln k P
• apply the multi-armed-bandit reduction to the • j ∗ = argmaxj i vij
exponential weights algorithm
• Theorem: E[MAB] ≥ (1 − 2ϵ) OPT −h k/ϵ2 ln k 1. what does OLA guarantee?
• optimally tune the learning rate ϵ for n rounds for any ṽ1 , . . . , ṽn :
Question: analyze the per-round regret, what is
dependence on
X X
• maximum payoff h? OLA = π i · ṽi ≥ (1 − ϵ) ṽji ∗ − R
• number of rounds n? i i
X X  
Eπ,ṽ π i · ṽi ≥ (1 − ϵ) Eṽ ṽji ∗ − R
 
• number of actions k? Eπ,ṽ [OLA] =
i i
q q
X  i i X
Eπ π · v ≥ (1 − ϵ) vij ∗ − R
i i

For left-hand side:

 X
Eπi ,ṽi π i · ṽi = Eπi ,ṽi π i · ṽi | π i Pr π i
    

πi
X
π i · vi Pr π i
  
=
πi
= Eπi π i · vi
 

2. What is MAB performance?

X
MAB = π̃ i · vi
i
X ϵX i
= (1 − ϵ) π i · vi + v
i
k j j
X
≥ (1 − ϵ) π i · vi
i

3
3. Combine (1) and (2), plug in R, Lemma 2:

E[MAB] ≥ (1 − 2ϵ) OPT −R


= (1 − 2ϵ) OPT −h̃/ϵ ln k
= (1 − 2ϵ) OPT −h k/ϵ2 ln k

You might also like