Class 3
Class 3
295, class 2 1
Multi-Arm Bandits
Sutton and Barto, Chapter 2
The simplest
reinforcement learning
problem
The Exploration/Exploitation Dilemma
Online decision-making involves a fundamental choice:
• Exploitation Make the best decision given current information
• Exploration Gather more information
The best long-term strategy may involve short-term sacrifices
Gather enough information to make the best overall decisions
295, class 2 3
Examples
Restaurant Selection
Exploitation Go to your favourite restaurant
Exploration Try a new restaurant
Online Banner Advertisements
Exploitation Show the most successful advert
Exploration Show a different advert
Oil Drilling
Exploitation Drill at the best known location
Exploration Drill at a new location
Game Playing
Exploitation Play the move you believe is best
Exploration Play an experimental move
295, class 2 4
You are the algorithm! (bandit1)
The k-armed Bandit Problem
• On each of a sequence of time steps,t=1,2,3,…,
you choose an action At from k possibilities, and receive a real-
valued reward Rt
true values
• You must both try actions to learn their values (explore), and
prefer those that appear best (exploit)
The Exploration/Exploitation Dilemma
Regret
The action-value is the mean reward for action a,
• q*(a) = E [r |a]
The optimal value V ∗is
• V ∗= Q(a∗) = max q*(a)
a∈A
The regret is the opportunity loss for one step
• lt = E [V ∗− Q(at )]
The total regret is the total opportunity loss
295, class 2 8
Multi-Armed Bandits
Regret
Multi-Armed Bandits
Regret
greedy
ϵ-greedy
Total regret
decaying ϵ-greedy
0 1 2 3 4 5 6 7 8 9 10 1112 13 14 15 16 17 18 19
Time-steps
295, class 2 11
Overview
• Action-value methods
– Epsilon-greedy strategy
– Incremental implementation
– Stationary vs. non-stationary environment
– Optimistic initial values
• UCB action selection
• Gradient bandit algorithms
• Associative search (contextual bandits)
295, class 2 12
Basics
• Maximize total reward collected
– vs learn (optimal) policy (RL)
• Episode is one step
• Complex function of
– True value
– Uncertainty
– Number of time steps
– Stationary vs non-stationary?
295, class 2 13
Action-Value Methods
-Greedy ActionSelection
1 q⇤ (9)
q⇤ (4)
q⇤(1)
Reward 0 q⇤ (7) q⇤ (10)
distribution
-1 q⇤ (2) q⇤ (8)
q⇤ (6)
-2
Run for 1000 steps
Action
-Greedy Methods on the 10-ArmedTestbed
Averaging ⟶ learning rule
• To simplify notation, let us focus on one action
• We consider only its rewards, and its estimate after n+1 rewards:
. R 1 + R 2 + ···+ R n - 1
Qn =
n-1
• How can we do this incrementally (without storing all the rewards)?
• Could store a running sum and count (and divide), or equivalently:
Derivation of incremental update
Tracking a Non-stationary Problem
Standard stochastic approximation
convergence conditions
Optimistic InitialValues
• All methods so far depend on Q1(a), i.e.,they are biased.
So far we have used Q 1 (a) = 0
100%
optimistic, greedy
80% Q01 = 5, E= 0
20%
0%
0 200 400 600 800 1000
Plays
Steps
Upper Confidence Bound (UCB) action selection
• A clever way of reducing exploration over time
• Focus on actions whose estimate has large degree of uncertainty
• Estimate an upper bound on the true action values
• Select the action with the largest (estimated) upper bound
UCB c = 2
E-greedy E = 0.1
Average
reward
Steps
Complexity of UCB Algorithm
Theorem
The UCB algorithm achieves logarithmic asymptotic total regret
lim Lt ≤ 8 logt ∆a
t →∞
a|∆ a>0
Gradient-Bandit Algorithms
• Let H t ( a ) be a learned preference for taking action a
100%
α =0.1
80%
with baseline
α =0.4
% 60%
Optimal α =0.1
action 40% without baseline
α =0.4
20%
0%
0 250 500 750 1000
Steps
Derivation of gradient-bandit algorithm
Summary Comparison of Bandit Algorithms
Conclusions
• These are all simple methods