0% found this document useful (0 votes)

15 views32 pages

Class 3

The document discusses the multi-armed bandit problem and several algorithms for solving it, including epsilon-greedy, UCB, and gradient bandit algorithms. The multi-armed bandit problem involves balancing exploration of unknown actions with exploitation of the best known actions to maximize long-term rewards. Epsilon-greedy introduces randomness to encourage exploration while usually choosing the best known action. UCB selection focuses on uncertain actions to reduce long-term exploration. Gradient bandits learn action preferences through feedback to guide exploration.

Uploaded by

Trung Nguyễn Thành

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views32 pages

Class 3

Uploaded by

Trung Nguyễn Thành

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Class 3: Multi-Arm Bandit

Sutton and Barto, Chapter 2

Sutton slides and Silver

295, class 2 1
Multi-Arm Bandits
Sutton and Barto, Chapter 2

The simplest
reinforcement learning
problem
The Exploration/Exploitation Dilemma
Online decision-making involves a fundamental choice:
• Exploitation Make the best decision given current information
• Exploration Gather more information
The best long-term strategy may involve short-term sacrifices
Gather enough information to make the best overall decisions

295, class 2 3
Examples
Restaurant Selection
Exploitation Go to your favourite restaurant
Exploration Try a new restaurant
Online Banner Advertisements
Exploitation Show the most successful advert
Exploration Show a different advert
Oil Drilling
Exploitation Drill at the best known location
Exploration Drill at a new location
Game Playing
Exploitation Play the move you believe is best
Exploration Play an experimental move
295, class 2 4
You are the algorithm! (bandit1)
The k-armed Bandit Problem
• On each of a sequence of time steps,t=1,2,3,…,
you choose an action At from k possibilities, and receive a real-
valued reward Rt

true values

• These true values are unknown. The distribution is unknown

• Nevertheless, you must maximize your total reward

• You must both try actions to learn their values (explore), and
prefer those that appear best (exploit)
The Exploration/Exploitation Dilemma
Regret
The action-value is the mean reward for action a,
• q*(a) = E [r |a]
The optimal value V ∗is
• V ∗= Q(a∗) = max q*(a)
a∈A
The regret is the opportunity loss for one step
• lt = E [V ∗− Q(at )]
The total regret is the total opportunity loss

295, class 2 8
Multi-Armed Bandits
Regret
Multi-Armed Bandits
Regret

greedy
ϵ-greedy

Total regret
decaying ϵ-greedy

0 1 2 3 4 5 6 7 8 9 10 1112 13 14 15 16 17 18 19

Time-steps

If an algorithm forever explores it will have linear total regret

If an algorithm never explores it will have linear total regret Is
it possible to achieve sublinear total regret?
Complexity of regret

295, class 2 11
Overview
• Action-value methods
– Epsilon-greedy strategy
– Incremental implementation
– Stationary vs. non-stationary environment
– Optimistic initial values
• UCB action selection
• Gradient bandit algorithms
• Associative search (contextual bandits)
295, class 2 12
Basics
• Maximize total reward collected
– vs learn (optimal) policy (RL)
• Episode is one step
• Complex function of
– True value
– Uncertainty
– Number of time steps
– Stationary vs non-stationary?

295, class 2 13
Action-Value Methods
-Greedy ActionSelection

• In greedy action selection, you always exploit

• In 𝜀-greedy, you are usually greedy, but with probability 𝜀 you

instead pick an action at random (possibly the greedy action
again)

• This is perhaps the simplest way to balance exploration and

exploitation
A simple bandit algorithm
One Bandit Taskfrom
Figure 2.1: An example
bandit problem from the
10-armed testbed. The true
value q(a) of each of the
ten
actions was selected
The 10-armedTestbed
according to a normal
distribution with mean zero
and unit variance, and then 4
the actual
rewards were selected
according to a mean q(a) 3
unit variance normal q⇤ (3)
distribution, as suggested
by these gray 2 q⇤ (5)
distributions.

1 q⇤ (9)
q⇤ (4)
q⇤(1)
Reward 0 q⇤ (7) q⇤ (10)
distribution
-1 q⇤ (2) q⇤ (8)

q⇤ (6)
-2
Run for 1000 steps

-3 Repeat the whole

thing 2000 times
with different bandit
-4 tasks
1 2 3 4 5 6 7 8 9 10

Action
-Greedy Methods on the 10-ArmedTestbed
Averaging ⟶ learning rule
• To simplify notation, let us focus on one action

• We consider only its rewards, and its estimate after n+1 rewards:
. R 1 + R 2 + ···+ R n - 1
Qn =
n-1
• How can we do this incrementally (without storing all the rewards)?
• Could store a running sum and count (and divide), or equivalently:
Derivation of incremental update
Tracking a Non-stationary Problem
Standard stochastic approximation
convergence conditions
Optimistic InitialValues
• All methods so far depend on Q1(a), i.e.,they are biased.
So far we have used Q 1 (a) = 0

• Suppose we initialize the action values optimistically (Q1(a) = 5 ), e.g., on

the 10-armed testbed (with alpha= 0.1 )

100%
optimistic, greedy
80% Q01 = 5, E= 0

% 60% realistic, -greedy

Optimal Q01 = 0, E = 0.1
action 40%

20%

0%
0 200 400 600 800 1000
Plays
Steps
Upper Confidence Bound (UCB) action selection
• A clever way of reducing exploration over time
• Focus on actions whose estimate has large degree of uncertainty
• Estimate an upper bound on the true action values
• Select the action with the largest (estimated) upper bound

UCB c = 2

E-greedy E = 0.1

Average
reward

Steps
Complexity of UCB Algorithm

Theorem
The UCB algorithm achieves logarithmic asymptotic total regret

lim Lt ≤ 8 logt ∆a
t →∞
a|∆ a>0
Gradient-Bandit Algorithms
• Let H t ( a ) be a learned preference for taking action a

100%

α =0.1
80%
with baseline
α =0.4
% 60%
Optimal α =0.1
action 40% without baseline

α =0.4
20%

0%
0 250 500 750 1000
Steps
Derivation of gradient-bandit algorithm
Summary Comparison of Bandit Algorithms
Conclusions
• These are all simple methods

• but they are complicated enough—we will build on them

• we should understand them completely

• there are still open questions

• Our first algorithms that learn from evaluative feedback

• and thus must balance exploration and exploitation

• Our first algorithms that appear to have a goal

—that learn to maximize reward by trial and error

Sample Process Template SCM
0% (1)
Sample Process Template SCM
70 pages
Unit 4 MCQ PDF
No ratings yet
Unit 4 MCQ PDF
34 pages
F.M.L. Thompson - The Cambridge Social History of Britain, 1750-1950, Vol. 01. Regions and Communities
No ratings yet
F.M.L. Thompson - The Cambridge Social History of Britain, 1750-1950, Vol. 01. Regions and Communities
592 pages
Darin Barney The Participatory Condition in The Digital Age
100% (1)
Darin Barney The Participatory Condition in The Digital Age
348 pages
Day 9: Primary Health Care (PHC) : CHN Lec Term 2 Exam
No ratings yet
Day 9: Primary Health Care (PHC) : CHN Lec Term 2 Exam
46 pages
(Studies of Organized Crime 8) Carlo Morselli (Auth.) - Inside Criminal Networks-Springer-Verlag New York (2009)
100% (1)
(Studies of Organized Crime 8) Carlo Morselli (Auth.) - Inside Criminal Networks-Springer-Verlag New York (2009)
207 pages
Captiva Sevies
No ratings yet
Captiva Sevies
5 pages
Logcat
No ratings yet
Logcat
4,525 pages
Internal Control PSA315
100% (1)
Internal Control PSA315
8 pages
150# Cs Ball Valve Datasheet: General
100% (1)
150# Cs Ball Valve Datasheet: General
3 pages
GenAI 20 Weeks Roadmap
No ratings yet
GenAI 20 Weeks Roadmap
2 pages
Intergroup Conflicts and Their Resolutio
No ratings yet
Intergroup Conflicts and Their Resolutio
376 pages
Ramp Check List
No ratings yet
Ramp Check List
1 page
Encyclopedia of Giftedness Creativity and Talent 1st Edition Barbara Kerr Download
No ratings yet
Encyclopedia of Giftedness Creativity and Talent 1st Edition Barbara Kerr Download
86 pages
CHP 19 Rehman-Et-Al-2022-Developing-The-Integrated-Marketing-Communication-Imc-Through-Social-Media-Sm-The-Modern-Marketing
No ratings yet
CHP 19 Rehman-Et-Al-2022-Developing-The-Integrated-Marketing-Communication-Imc-Through-Social-Media-Sm-The-Modern-Marketing
23 pages
Facilities Management Conference Indonesia
No ratings yet
Facilities Management Conference Indonesia
6 pages
Octavia Manual Running Gear Part4
No ratings yet
Octavia Manual Running Gear Part4
136 pages
SCMA306 Course Outline - July 2017
No ratings yet
SCMA306 Course Outline - July 2017
9 pages
Electrically Switch Electromagnet
No ratings yet
Electrically Switch Electromagnet
16 pages
Yellowstripe Scad
No ratings yet
Yellowstripe Scad
7 pages
Sentence Correction Rules
No ratings yet
Sentence Correction Rules
27 pages
BC672 772RB-2 6pg
No ratings yet
BC672 772RB-2 6pg
6 pages
PEC1-Format Prak Corporate 2024 (Autosaved)
No ratings yet
PEC1-Format Prak Corporate 2024 (Autosaved)
25 pages
RLSC Inventory of Laboratory Glasswares As of July20211 1
No ratings yet
RLSC Inventory of Laboratory Glasswares As of July20211 1
2 pages
9780374533557RGGReading Group Gold
No ratings yet
9780374533557RGGReading Group Gold
5 pages
Jemarah Rabina
No ratings yet
Jemarah Rabina
4 pages
Unit 6 Listening 1
No ratings yet
Unit 6 Listening 1
2 pages
1 Essay3
No ratings yet
1 Essay3
2 pages
1210 6261v1 PDF
No ratings yet
1210 6261v1 PDF
8 pages
Reported Speech: Mr.A-Bouhandi
No ratings yet
Reported Speech: Mr.A-Bouhandi
1 page
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (643)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1175)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4103)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2885)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)

Class 3

Uploaded by

Class 3

Uploaded by

Class 3: Multi-Arm Bandit

Sutton and Barto, Chapter 2

Sutton slides and Silver

• These true values are unknown. The distribution is unknown

• Nevertheless, you must maximize your total reward

If an algorithm forever explores it will have linear total regret

• In greedy action selection, you always exploit

• In 𝜀-greedy, you are usually greedy, but with probability 𝜀 you

• This is perhaps the simplest way to balance exploration and

-3 Repeat the whole

• Suppose we initialize the action values optimistically (Q1(a) = 5 ), e.g., on

% 60% realistic, -greedy

• but they are complicated enough—we will build on them

• we should understand them completely

• there are still open questions

• Our first algorithms that learn from evaluative feedback

• and thus must balance exploration and exploitation

• Our first algorithms that appear to have a goal

You might also like