Lecture 9: Exploration and Exploitation: David Silver
Lecture 9: Exploration and Exploitation: David Silver
David Silver
Lecture 9: Exploration and Exploitation
Outline
1 Introduction
2 Multi-Armed Bandits
3 Contextual Bandits
4 MDPs
Lecture 9: Exploration and Exploitation
Introduction
Examples
Restaurant Selection
Exploitation Go to your favourite restaurant
Exploration Try a new restaurant
Online Banner Advertisements
Exploitation Show the most successful advert
Exploration Show a different advert
Oil Drilling
Exploitation Drill at the best known location
Exploration Drill at a new location
Game Playing
Exploitation Play the move you believe is best
Exploration Play an experimental move
Lecture 9: Exploration and Exploitation
Introduction
Principles
Naive Exploration
Add noise to greedy policy (e.g. -greedy)
Optimistic Initialisation
Assume the best until proven otherwise
Optimism in the Face of Uncertainty
Prefer actions with uncertain values
Probability Matching
Select actions according to probability they are best
Information State Search
Lookahead search incorporating value of information
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Regret
The action-value is the mean reward for action a,
Q(a) = E [r |a]
The optimal value V ∗ is
V ∗ = Q(a∗ ) = max Q(a)
a∈A
Counting Regret
The count Nt (a) is expected number of selections for action a
The gap ∆a is the difference in value between action a and
optimal action a∗ , ∆a = V ∗ − Q(a)
Regret is a function of gaps and the counts
" t #
X
Lt = E V ∗ − Q(aτ )
τ =1
X
= E [Nt (a)] (V ∗ − Q(a))
a∈A
X
= E [Nt (a)] ∆a
a∈A
greedy
ϵ-greedy
Total regret
decaying ϵ-greedy
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Time-steps
Greedy Algorithm
-Greedy Algorithm
Optimistic Initialisation
1
Q̂t (at ) = Q̂t−1 + (rt − Q̂t−1 )
Nt (at )
c>0
d = min ∆i
a|∆a >0
c|A|
t = min 1, 2
d t
Lower Bound
p(Q)
Q(a3)
Q(a2)
Q(a1)
-2 -1.6 -1.2 -0.8 -0.4 0 0.4 0.8 1.2 1.6 2 2.4 2.8 3.2 3.6 4 4.4 4.8 5.2 5.6 6
Hoeffding’s Inequality
2
P E [X ] > X t + u ≤ e −2tu
UCB1
Theorem
The UCB algorithm achieves logarithmic asymptotic total regret
X
lim Lt ≤ 8 log t ∆a
t→∞
a|∆a >0
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Upper Confidence Bound
Figure 9. Comparison on distribution 11 (10 machines with parameters 0.9, 0.6, . . . , 0.6).
Figure 10. Comparison on distribution 12 (10 machines with parameters 0.9, 0.8, 0.8, 0.8, 0.7, 0.7, 0.7, 0.6,
0.6, 0.6).
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Bayesian Bandits
Bayesian Bandits
Q(a3)
Q(a2)
Q(a1)
-2 -1.6 -1.2 -0.8 -0.4 0 0.4 0.8 1.2 1.6 2 2.4 2.8 3.2 3.6 4 4.4 4.8 5.2 5.6 6
t | at =a
Probability Matching
Thompson Sampling
Value of Information
M̃ = hS̃, A, P̃, R, γi
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Information State Search
Beta(αa , βa + 1) if r = 1
Success Failure Success Failure
f(!1)
1.0
f(!1) f(!1)
!1 1.0
, !2 1.0 , ,
!1 1.0 !2 1.0 !1 1.0 !2 1.0
Drug 1
Drug 2
Information state hα, βi Success Failure
f(!)
1
1.0
f(!2)
f(!1)
1.0
f(!2)
..
,
Beta(α, β)
! 1.0 !2 1.0
1
!1 1.0
, !2 1.0
.
Failure
Each state transition corresponds
to a Bayesian model update
f(!1) f(!2)
!1 1.0
, !2 1.0
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Information State Search
Contextual Bandits
Linear Regression
Action-value function is expected reward for state s and
action a
Q(s, a) = E [r |s, a]
Estimate value function with a linear function approximator
Qθ (s, a) = φ(s, a)> θ ≈ Q(s, a)
Estimate parameters by least squares regression
t
X
At = φ(sτ , aτ )φ(sτ , aτ )>
τ =1
X t
bt = φ(sτ , aτ )rτ
τ =1
θt = A−1
t bt
Lecture 9: Exploration and Exploitation
Contextual Bandits
Linear UCB
2.4
1.6
E
1.2
0.8
0.4
-0.4
(a) (b)
(c) (d)
Lecture 9: Exploration and Exploitation
MDPs
rmax
Initialise action-value function Q(s, a) to 1−γ
Run favourite model-free RL algorithm
Monte-Carlo control
Sarsa
Q-learning
...
Encourages systematic exploration of states and actions
Lecture 9: Exploration and Exploitation
MDPs
Optimistic Initialisation
Bayesian Model-Based RL
M̃ = hS̃, A, P̃, R, γi
Lecture 9: Exploration and Exploitation
MDPs
Information State Search
Conclusion