Lec05 Multi Armed Bandit

The document discusses the multi-armed bandit (MAB) learning model in the context of online learning, focusing on strategies to optimize payoffs over multiple rounds. It emphasizes the tradeoff between exploration and exploitation, introduces unbiased estimators for reported payoffs, and outlines the performance guarantees of algorithms like Exponential Weights (EW). The key takeaway is that the MAB can achieve performance close to the optimal action by leveraging online learning techniques with careful tuning of learning rates.

Uploaded by

adamalaouri56

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views4 pages

Lec05 Multi Armed Bandit

Uploaded by

adamalaouri56

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

CS 332: Online Markets (Online) Multi-armed Bandit Learning

“online learning with partial information”

Lecture 5: Multi-armed Bandit
Model:
Learning
• k actions
Last Time: • n rounds
• action j’s payoff in round i: vij ∈ [0, h]
• online learning • in round i:
• best in hindsight (a) choose an action j i
• regret (b) learn payoffs vij i .
• exponential weights
(c) obtain payoff vij i .
• learning rates Pn
• payoff ALG = i=1 vij i
Today:
Goal: profit close to best action in hindsight
• multi-armed bandit learning
• reduction to online learning Note: identical to online learning except only learn
vij i and not (vi1 , . . . , vik ).
Note: if don’t play an action j, can’t learn if j is
good.
Exercise: Expected Payoff
Challenge: tradeoff explore versus exploit.
Setup:
• online learning, k = 2 actions
• probabilities algorithm selects each action in Reducing MAB to Online Learning
round i are:
Approach: reduce partial information to full infor-
π i = (π1i , π2i ) = (2/3, 1/3) mation.
• payoffs of each action in round i are: “solve multi-armed bandit problem with online learn-
i
v = (vi1 , vi2 ) = (3, 9) ing algorithm”

Question: What is the expected payoff of the algo- --- --- ---
rithm in round i? | | | | (1-eps)pi | |
| | pi | | + eps/k | |
| | ------> | | --------> | W |
| O | | M | | O |
| L | (0,... | A | | R |
| A | vj/pij, | B | | L |
| | ...,0) | | vj | D |
| | <------ | | <-------- | |
| | | | | |
--- --- ---

1
i
Notation: Online Algorithm (OLA) • reported payoffs in [0, h̃] for h̃ = maxi,j vj/πji .
i
• in round i: • if πji is small, then ṽji = vj/πji can be big!
• probabilities of actions π i = (π1i , . . . , πki )
• choose action j i ∼ π i . Challenge 2: keep h̃ small
• payoffs vi = (vi1 , . . . , vik ) Idea 2: pick random action with some minimal prob-
• expected payoff: ability ϵ/k
Lemma 2: if πji ≥ ϵ/k then ṽji ≤ h̃ = kh/ϵ
h i X h i
E vij i = E vij i | j i = j Pr j i = j

j i
X Proof: ṽji = vj/πji ≤ h/ϵ/k = kh/ϵ
= vij πji
Note: explore-vs-exploit tradeoff with ϵ
j

= vi · π i (vector dot product) Alg: MAB Reduction to OLA

In round i:
Recall Thm: for payoffs in [0, h̃], exists OLA such
that 1. π ← OLA
2. draw j i ∼ π̃ with
E[OLA] ≥ (1 − ϵ) OPT −h̃/ϵ ln k
π̃ji = (1 − ϵ) πji + ϵ/k
Challenge 1: what report to the algorithm?
Def: random variable Y is an unbiased estimator 3. take action j i
of random variable X if E[Y ] = E[X]. 4. report ṽ to OLA with
Example: mean of random 5 students grades is un- (
vij/π i if j = j i
biased estimator of mean of all sudents grades. ṽji = j

0 otherwise.
Idea 1: give algorithm unbiased estimator of payoffs.
• if alg uses probabilities π t = (π1i , . . . , πki )
Thm: for payoffs in [0, h̃], if OLA satisfies
• and samples j i ∼ π t
• real payoffs are vi = (vi1 , . . . , vik )
• learn only vij i E[OLA] ≥ (1 − ϵ) OPT −h̃/ϵ ln k
report payoff ṽi = (0, . . . , ṽji/πji i , . . . , 0)
i
•
then for payoffs in [0, h], MAB satisfies
Lemma 1: reported payoffs are unbiased estimators
of true payoffs
E[MAB] ≥ (1 − 2ϵ) OPT −h k/ϵ2 ln k
Proof:
Recall: Exponential Weights (EW) satisfies assump-
E ṽji = E ṽji | j i = j · Pr j i = j

tion of Thm.
+ E ṽji | j i ̸= j · Pr j i ̸= j

Cor: for payoffs in [0, h], MAB-EW satisfies vanishing
= vj/πji · πji + 0 · (1 − πji )
i
per round regret.
= vij Proof: similar to before.
E ṽ = vi
i

Note:

2
Exercise: MAB-EW Analysis
p the per-round regret of exponential weights “online learning works with unbiased estimators of
Recall:
is 2h ln k/n payoffs”
• dependence on h is O(h)
p Proof of Thm:
• dependence on n is O(√ 1/n)
• dependence on k is O( log k) “E[MAB] ≥ (1 − 2ϵ) OPT −h k/ϵ2 ln k”

Setup: 0. let
• payoffs in [0, h] • R = h̃/ϵ ln k P
• apply the multi-armed-bandit reduction to the • j ∗ = argmaxj i vij
exponential weights algorithm
• Theorem: E[MAB] ≥ (1 − 2ϵ) OPT −h k/ϵ2 ln k 1. what does OLA guarantee?
• optimally tune the learning rate ϵ for n rounds for any ṽ1 , . . . , ṽn :
Question: analyze the per-round regret, what is
dependence on
X X
• maximum payoff h? OLA = π i · ṽi ≥ (1 − ϵ) ṽji ∗ − R
• number of rounds n? i i
X X
Eπ,ṽ π i · ṽi ≥ (1 − ϵ) Eṽ ṽji ∗ − R

• number of actions k? Eπ,ṽ [OLA] =
i i
q q
X i i X
Eπ π · v ≥ (1 − ϵ) vij ∗ − R
i i

For left-hand side:

X
Eπi ,ṽi π i · ṽi = Eπi ,ṽi π i · ṽi | π i Pr π i

πi
X
π i · vi Pr π i

=
πi
= Eπi π i · vi

2. What is MAB performance?

X
MAB = π̃ i · vi
i
X ϵX i
= (1 − ϵ) π i · vi + v
i
k j j
X
≥ (1 − ϵ) π i · vi
i

3
3. Combine (1) and (2), plug in R, Lemma 2:

E[MAB] ≥ (1 − 2ϵ) OPT −R

= (1 − 2ϵ) OPT −h̃/ϵ ln k
= (1 − 2ϵ) OPT −h k/ϵ2 ln k

Economics
No ratings yet
Economics
322 pages
Film Pendek "Gen Z" Sebagai Media Edukasi Tentang Pentingnya Memuliakan Orang Tua
No ratings yet
Film Pendek "Gen Z" Sebagai Media Edukasi Tentang Pentingnya Memuliakan Orang Tua
91 pages
Astm C 110 Pruebas Fisicas O y OH de Ca PDF
No ratings yet
Astm C 110 Pruebas Fisicas O y OH de Ca PDF
21 pages
Bandit Algorithms
No ratings yet
Bandit Algorithms
596 pages
Ideai Reinforcement Learning
No ratings yet
Ideai Reinforcement Learning
167 pages
丹田 and a scientific definition of Qi 氣 that can explain ... are obese. (tai chi and meditation Book 6)
No ratings yet
丹田 and a scientific definition of Qi 氣 that can explain ... are obese. (tai chi and meditation Book 6)
7 pages
Blackout Dice Games Instructions
No ratings yet
Blackout Dice Games Instructions
5 pages
Y Delayed Feedback Thesis
No ratings yet
Y Delayed Feedback Thesis
66 pages
Multi Armed Bandits
No ratings yet
Multi Armed Bandits
161 pages
IntroMulti Armed Bandits Slivkin Microsoft PDF
No ratings yet
IntroMulti Armed Bandits Slivkin Microsoft PDF
174 pages
DLL g7 Sci Micros
100% (2)
DLL g7 Sci Micros
3 pages
EE290S Lecture Note 22
No ratings yet
EE290S Lecture Note 22
12 pages
Soil Nails Field Pull Out Testing Evaluation and Applications
No ratings yet
Soil Nails Field Pull Out Testing Evaluation and Applications
11 pages
Cofee Husk Ash
No ratings yet
Cofee Husk Ash
12 pages
Introduction To Bandits: (Some Slides Stolen From Csaba's AAAI Tutorial)
No ratings yet
Introduction To Bandits: (Some Slides Stolen From Csaba's AAAI Tutorial)
16 pages
MODULE 6 - THE LAWS OF PHYSICS-WPS Office
No ratings yet
MODULE 6 - THE LAWS OF PHYSICS-WPS Office
5 pages
Exploration Exploitation
No ratings yet
Exploration Exploitation
40 pages
Engineered Ferrites and Their Applications: Pankaj Sharma Gagan Kumar Bhargava Sumit Bhardwaj Indu Sharma Editors
No ratings yet
Engineered Ferrites and Their Applications: Pankaj Sharma Gagan Kumar Bhargava Sumit Bhardwaj Indu Sharma Editors
261 pages
Lecture 9: Exploration and Exploitation: David Silver
No ratings yet
Lecture 9: Exploration and Exploitation: David Silver
47 pages
Brain Works
No ratings yet
Brain Works
2 pages
A12-Online Learning Short 2020
No ratings yet
A12-Online Learning Short 2020
61 pages
Leadership Theories: MBA 645 Leadership in Organizations Jeff Shay University of Montana
No ratings yet
Leadership Theories: MBA 645 Leadership in Organizations Jeff Shay University of Montana
16 pages
Necessary and Sufficient Conditions For Achieving Sub-Linear Regret in Stochastic Multi-Armed Bandits
No ratings yet
Necessary and Sufficient Conditions For Achieving Sub-Linear Regret in Stochastic Multi-Armed Bandits
9 pages
Sound Simulation-Based Design Optimization of Brass Wind
No ratings yet
Sound Simulation-Based Design Optimization of Brass Wind
11 pages
Science 5
No ratings yet
Science 5
8 pages
Determining Liquid Limits of Soils: Test Procedure For
No ratings yet
Determining Liquid Limits of Soils: Test Procedure For
12 pages
Li 2008
No ratings yet
Li 2008
4 pages
Garbage In, Reward Out Bootstrapping Exploration in Multi-Armed Bandits
No ratings yet
Garbage In, Reward Out Bootstrapping Exploration in Multi-Armed Bandits
19 pages
Samai (Hod) Applied Accounting Year 1 Business Economices Sec Sem
No ratings yet
Samai (Hod) Applied Accounting Year 1 Business Economices Sec Sem
19 pages
Online Learning For Causal Bandits
No ratings yet
Online Learning For Causal Bandits
7 pages
Weir & Retaining Wall
No ratings yet
Weir & Retaining Wall
2 pages
Multi Armed Bandits
No ratings yet
Multi Armed Bandits
34 pages
Politics of Development and Underdevelop
No ratings yet
Politics of Development and Underdevelop
310 pages
26 Making Decisions
No ratings yet
26 Making Decisions
31 pages
MAB Assignment 2
No ratings yet
MAB Assignment 2
2 pages
Skit Rubric
No ratings yet
Skit Rubric
1 page
Simply Supported Beam Example
No ratings yet
Simply Supported Beam Example
4 pages
Evendar 06 A
No ratings yet
Evendar 06 A
27 pages
325 Notes
No ratings yet
325 Notes
23 pages
Art of Agile Dev
No ratings yet
Art of Agile Dev
45 pages
Dissertation Travail Et Technique Philosophie
100% (2)
Dissertation Travail Et Technique Philosophie
6 pages
Written Assignment 1
No ratings yet
Written Assignment 1
2 pages
Dissecting Reinforcement Learning-Part6
No ratings yet
Dissecting Reinforcement Learning-Part6
25 pages
Mid-Semester Examination
No ratings yet
Mid-Semester Examination
2 pages
Rlassignment 2
No ratings yet
Rlassignment 2
3 pages
Finite-Time Analysis of The Multi-Armed Bandit Problem With Known Trend
No ratings yet
Finite-Time Analysis of The Multi-Armed Bandit Problem With Known Trend
7 pages
Limma Guide
No ratings yet
Limma Guide
151 pages
EE675A Lecture 3
No ratings yet
EE675A Lecture 3
8 pages
Reading 3-Russo & Van Roy 2014
No ratings yet
Reading 3-Russo & Van Roy 2014
24 pages
Lecture 2 EE675
No ratings yet
Lecture 2 EE675
4 pages
Assignment 3: Reinforcement Learning Prof. B. Ravindran
100% (1)
Assignment 3: Reinforcement Learning Prof. B. Ravindran
4 pages
NeurIPS 2021 Breaking The Moments Condition Barrier No Regret Algorithm For Bandits With Super Heavy Tailed Payoffs Paper
No ratings yet
NeurIPS 2021 Breaking The Moments Condition Barrier No Regret Algorithm For Bandits With Super Heavy Tailed Payoffs Paper
11 pages
2 B. Chapter 2 Mpu22012 2021
No ratings yet
2 B. Chapter 2 Mpu22012 2021
59 pages
26202-Article Text-30265-1-2-20230626
No ratings yet
26202-Article Text-30265-1-2-20230626
8 pages
RL Unit
No ratings yet
RL Unit
595 pages
Luyện Đọc Điền - Đọc Hiểu (Buổi 8) Livestream
No ratings yet
Luyện Đọc Điền - Đọc Hiểu (Buổi 8) Livestream
3 pages
1.RL Unit 1
No ratings yet
1.RL Unit 1
47 pages
Multi-Armed Bandit Algorithms and Empirical Evaluation
No ratings yet
Multi-Armed Bandit Algorithms and Empirical Evaluation
12 pages
Bandit
No ratings yet
Bandit
8 pages
Unit - 1: Probability Linear Algebra
No ratings yet
Unit - 1: Probability Linear Algebra
20 pages
Cs229-Notes12 Reinforcement in Control
No ratings yet
Cs229-Notes12 Reinforcement in Control
17 pages
RL Mid-1 Bit Bank
No ratings yet
RL Mid-1 Bit Bank
10 pages
Cable As Axial Elements
No ratings yet
Cable As Axial Elements
9 pages
Agrawal&Goyal 2013
No ratings yet
Agrawal&Goyal 2013
9 pages
Zhu 20 D
No ratings yet
Zhu 20 D
10 pages
Expanded Multi Armed Bandit and Probability Basics
No ratings yet
Expanded Multi Armed Bandit and Probability Basics
5 pages
RL Unit 1 - QA
No ratings yet
RL Unit 1 - QA
10 pages
Unit II
No ratings yet
Unit II
10 pages
Auer - Using Ucb For Exploration-Exploitation Tradeoffs
No ratings yet
Auer - Using Ucb For Exploration-Exploitation Tradeoffs
26 pages
Multi-Armed Bandits Epsilon-Greedy Algorithm
No ratings yet
Multi-Armed Bandits Epsilon-Greedy Algorithm
14 pages
Delhi Public School: Name-Vipanshu Class 10th G Summited To
No ratings yet
Delhi Public School: Name-Vipanshu Class 10th G Summited To
19 pages
Multi-Armed Bandits
No ratings yet
Multi-Armed Bandits
11 pages
EAS 240 MAB Project Description Spring 2025
No ratings yet
EAS 240 MAB Project Description Spring 2025
10 pages
EXP3
No ratings yet
EXP3
36 pages
DLMAIRIL01 Q4-2024 Session3
No ratings yet
DLMAIRIL01 Q4-2024 Session3
47 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
28 pages
Mod6 Slides
No ratings yet
Mod6 Slides
105 pages
RL Paper Deepsk
No ratings yet
RL Paper Deepsk
4 pages
RL Sem Ans
No ratings yet
RL Sem Ans
90 pages
MCQ& FB - Unit 1
No ratings yet
MCQ& FB - Unit 1
9 pages
S10 Q1 Week 4
No ratings yet
S10 Q1 Week 4
8 pages
Mid Term Report SoS
No ratings yet
Mid Term Report SoS
18 pages
Q1. Explain The Multi-Armed Bandit Problem and Its Key Characteristics. Illustrate Their Real-World Applications
No ratings yet
Q1. Explain The Multi-Armed Bandit Problem and Its Key Characteristics. Illustrate Their Real-World Applications
11 pages
Cs6046-Notes 2
No ratings yet
Cs6046-Notes 2
34 pages
Nas5311 Spec
No ratings yet
Nas5311 Spec
4 pages
Inverse Trigonometric Functions (Trigonometry) Mathematics Question Bank
From Everand
Inverse Trigonometric Functions (Trigonometry) Mathematics Question Bank
Mohmmad Khaja Shareef
No ratings yet

Lec05 Multi Armed Bandit

Uploaded by

Lec05 Multi Armed Bandit

Uploaded by

CS 332: Online Markets (Online) Multi-armed Bandit Learning

“online learning with partial information”

= vi · π i (vector dot product) Alg: MAB Reduction to OLA

For left-hand side:

2. What is MAB performance?

E[MAB] ≥ (1 − 2ϵ) OPT −R

You might also like