0% found this document useful (0 votes)
137 views47 pages

Lecture 9: Exploration and Exploitation: David Silver

The document discusses the exploration-exploitation dilemma in online decision making. It introduces the multi-armed bandit problem, where an agent must balance exploring new actions against exploiting the currently most rewarding action. Greedy and epsilon-greedy algorithms can become stuck exploiting suboptimal actions, resulting in linear regret. Decaying epsilon-greedy achieves logarithmic regret by gradually reducing exploration over time. The document outlines lower bounds on regret based on similarities between action distributions. It introduces the concept of optimism in the face of uncertainty to guide exploration.

Uploaded by

司向辉
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
137 views47 pages

Lecture 9: Exploration and Exploitation: David Silver

The document discusses the exploration-exploitation dilemma in online decision making. It introduces the multi-armed bandit problem, where an agent must balance exploring new actions against exploiting the currently most rewarding action. Greedy and epsilon-greedy algorithms can become stuck exploiting suboptimal actions, resulting in linear regret. Decaying epsilon-greedy achieves logarithmic regret by gradually reducing exploration over time. The document outlines lower bounds on regret based on similarities between action distributions. It introduces the concept of optimism in the face of uncertainty to guide exploration.

Uploaded by

司向辉
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Lecture 9: Exploration and Exploitation

Lecture 9: Exploration and Exploitation

David Silver
Lecture 9: Exploration and Exploitation

Outline

1 Introduction

2 Multi-Armed Bandits

3 Contextual Bandits

4 MDPs
Lecture 9: Exploration and Exploitation
Introduction

Exploration vs. Exploitation Dilemma

Online decision-making involves a fundamental choice:


Exploitation Make the best decision given current information
Exploration Gather more information
The best long-term strategy may involve short-term sacrifices
Gather enough information to make the best overall decisions
Lecture 9: Exploration and Exploitation
Introduction

Examples

Restaurant Selection
Exploitation Go to your favourite restaurant
Exploration Try a new restaurant
Online Banner Advertisements
Exploitation Show the most successful advert
Exploration Show a different advert
Oil Drilling
Exploitation Drill at the best known location
Exploration Drill at a new location
Game Playing
Exploitation Play the move you believe is best
Exploration Play an experimental move
Lecture 9: Exploration and Exploitation
Introduction

Principles

Naive Exploration
Add noise to greedy policy (e.g. -greedy)
Optimistic Initialisation
Assume the best until proven otherwise
Optimism in the Face of Uncertainty
Prefer actions with uncertain values
Probability Matching
Select actions according to probability they are best
Information State Search
Lookahead search incorporating value of information
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits

The Multi-Armed Bandit

A multi-armed bandit is a tuple hA, Ri


A is a known set of m actions (or “arms”)
Ra (r ) = P [r |a] is an unknown probability
distribution over rewards
At each step t the agent selects an action
at ∈ A
The environment generates a reward
rt ∼ Rat
The goal
Pis to maximise cumulative
reward tτ =1 rτ
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Regret

Regret
The action-value is the mean reward for action a,
Q(a) = E [r |a]
The optimal value V ∗ is
V ∗ = Q(a∗ ) = max Q(a)
a∈A

The regret is the opportunity loss for one step


lt = E [V ∗ − Q(at )]
The total regret is the total opportunity loss
" t #
X
Lt = E V ∗ − Q(aτ )
τ =1

Maximise cumulative reward ≡ minimise total regret


Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Regret

Counting Regret
The count Nt (a) is expected number of selections for action a
The gap ∆a is the difference in value between action a and
optimal action a∗ , ∆a = V ∗ − Q(a)
Regret is a function of gaps and the counts
" t #
X
Lt = E V ∗ − Q(aτ )
τ =1
X
= E [Nt (a)] (V ∗ − Q(a))
a∈A
X
= E [Nt (a)] ∆a
a∈A

A good algorithm ensures small counts for large gaps


Problem: gaps are not known!
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Regret

Linear or Sublinear Regret

greedy
ϵ-greedy

Total regret
decaying ϵ-greedy

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Time-steps

If an algorithm forever explores it will have linear total regret


If an algorithm never explores it will have linear total regret
Is it possible to achieve sublinear total regret?
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Greedy and -greedy algorithms

Greedy Algorithm

We consider algorithms that estimate Q̂t (a) ≈ Q(a)


Estimate the value of each action by Monte-Carlo evaluation
T
1 X
Q̂t (a) = rt 1(at = a)
Nt (a)
t=1

The greedy algorithm selects action with highest value

at∗ = argmax Q̂t (a)


a∈A

Greedy can lock onto a suboptimal action forever


⇒ Greedy has linear total regret
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Greedy and -greedy algorithms

-Greedy Algorithm

The -greedy algorithm continues to explore forever


With probability 1 −  select a = argmax Q̂(a)
a∈A
With probability  select a random action
Constant  ensures minimum regret
 X
lt ≥ ∆a
A
a∈A

⇒ -greedy has linear total regret


Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Greedy and -greedy algorithms

Optimistic Initialisation

Simple and practical idea: initialise Q(a) to high value


Update action value by incremental Monte-Carlo evaluation
Starting with N(a) > 0

1
Q̂t (at ) = Q̂t−1 + (rt − Q̂t−1 )
Nt (at )

Encourages systematic exploration early on


But can still lock onto suboptimal action
⇒ greedy + optimistic initialisation has linear total regret
⇒ -greedy + optimistic initialisation has linear total regret
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Greedy and -greedy algorithms

Decaying t -Greedy Algorithm

Pick a decay schedule for 1 , 2 , ...


Consider the following schedule

c>0
d = min ∆i
a|∆a >0
 
c|A|
t = min 1, 2
d t

Decaying t -greedy has logarithmic asymptotic total regret!


Unfortunately, schedule requires advance knowledge of gaps
Goal: find an algorithm with sublinear regret for any
multi-armed bandit (without knowledge of R)
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Lower Bound

Lower Bound

The performance of any algorithm is determined by similarity


between optimal arm and other arms
Hard problems have similar-looking arms with different means
This is described formally by the gap ∆a and the similarity in
distributions KL(Ra ||Ra ∗)

Theorem (Lai and Robbins)


Asymptotic total regret is at least logarithmic in number of steps
X ∆a
lim Lt ≥ log t
t→∞ KL(Ra ||Ra∗ )
a|∆a >0
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Upper Confidence Bound

Optimism in the Face of Uncertainty

p(Q)

Q(a3)
Q(a2)

Q(a1)

-2 -1.6 -1.2 -0.8 -0.4 0 0.4 0.8 1.2 1.6 2 2.4 2.8 3.2 3.6 4 4.4 4.8 5.2 5.6 6

Which action should we pick?


The more uncertain we are about an action-value
The more important it is to explore that action
It could turn out to be the best action
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Upper Confidence Bound

Optimism in the Face of Uncertainty (2)

After picking blue action


We are less uncertain about the value
And more likely to pick another action
Until we home in on best action
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Upper Confidence Bound

Upper Confidence Bounds

Estimate an upper confidence Ût (a) for each action value


Such that Q(a) ≤ Q̂t (a) + Ût (a) with high probability
This depends on the number of times N(a) has been selected
Small Nt (a) ⇒ large Ût (a) (estimated value is uncertain)
Large Nt (a) ⇒ small Ût (a) (estimated value is accurate)
Select action maximising Upper Confidence Bound (UCB)

at = argmax Q̂t (a) + Ût (a)


a∈A
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Upper Confidence Bound

Hoeffding’s Inequality

Theorem (Hoeffding’s Inequality)


Let X1 , ...,
P Xt be i.i.d. random variables in [0,1], and let
X t = τ1 tτ =1 Xτ be the sample mean. Then

2
P E [X ] > X t + u ≤ e −2tu
 

We will apply Hoeffding’s Inequality to rewards of the bandit


conditioned on selecting action a
h i 2
P Q(a) > Q̂t (a) + Ut (a) ≤ e −2Nt (a)Ut (a)
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Upper Confidence Bound

Calculating Upper Confidence Bounds

Pick a probability p that true value exceeds UCB


Now solve for Ut (a)
2
e −2Nt (a)Ut (a) = p
s
− log p
Ut (a) =
2Nt (a)

Reduce p as we observe more rewards, e.g. p = t −4


Ensures we select optimal action as t → ∞
s
2 log t
Ut (a) =
Nt (a)
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Upper Confidence Bound

UCB1

This leads to the UCB1 algorithm


s
2 log t
at = argmax Q(a) +
a∈A Nt (a)

Theorem
The UCB algorithm achieves logarithmic asymptotic total regret

X
lim Lt ≤ 8 log t ∆a
t→∞
a|∆a >0
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Upper Confidence Bound

Example: UCB vs. -Greedy On 10-armed Bandit


Figure 8. Comparison on distribution 3 (2 machines with parameters 0.55, 0.45).

Figure 9. Comparison on distribution 11 (10 machines with parameters 0.9, 0.6, . . . , 0.6).

Figure 10. Comparison on distribution 12 (10 machines with parameters 0.9, 0.8, 0.8, 0.8, 0.7, 0.7, 0.7, 0.6,
0.6, 0.6).
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Bayesian Bandits

Bayesian Bandits

So far we have made no assumptions about the reward


distribution R
Except bounds on rewards
Bayesian bandits exploit prior knowledge of rewards, p [R]
They compute posterior distribution of rewards p [R | ht ]
where ht = a1 , r1 , ..., at−1 , rt−1 is the history
Use posterior to guide exploration
Upper confidence bounds (Bayesian UCB)
Probability matching (Thompson sampling)
Better performance if prior knowledge is accurate
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Bayesian Bandits

Bayesian UCB Example: Independent Gaussians


Assume reward distribution is Gaussian, Ra (r ) = N (r ; µa , σa2 )
p(Q)

Q(a3)
Q(a2)

Q(a1)

-2 -1.6 -1.2 -0.8 -0.4 0 0.4 0.8 1.2 1.6 2 2.4 2.8 3.2 3.6 4 4.4 4.8 5.2 5.6 6

Q µ(a1) µ(a2) µ(a3)


c!(a3)
c!(a2)
c!(a1)

Compute Gaussian posterior over µa and σa2 (by Bayes law)


 Y
p µa , σa2 | ht ∝ p µa , σa2 N (rt ; µa , σa2 )
  

t | at =a

Pick action that maximises standard deviation of Q(a)


p
at = argmax µa + cσa / N(a)
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Bayesian Bandits

Probability Matching

Probability matching selects action a according to probability


that a is the optimal action

π(a | ht ) = P Q(a) > Q(a0 ), ∀a0 6= a | ht


 

Probability matching is optimistic in the face of uncertainty


Uncertain actions have higher probability of being max
Can be difficult to compute analytically from posterior
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Bayesian Bandits

Thompson Sampling

Thompson sampling implements probability matching

π(a | ht ) = P Q(a) > Q(a0 ), ∀a0 6= a | ht


 
 
= ER|ht 1(a = argmax Q(a))
a∈A

Use Bayes law to compute posterior distribution p [R | ht ]


Sample a reward distribution R from posterior
Compute action-value function Q(a) = E [Ra ]
Select action maximising value on sample, at = argmax Q(a)
a∈A
Thompson sampling achieves Lai and Robbins lower bound!
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Information State Search

Value of Information

Exploration is useful because it gains information


Can we quantify the value of information?
How much reward a decision-maker would be prepared to pay
in order to have that information, prior to making a decision
Long-term reward after getting information - immediate reward
Information gain is higher in uncertain situations
Therefore it makes sense to explore uncertain situations more
If we know value of information, we can trade-off exploration
and exploitation optimally
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Information State Search

Information State Space

We have viewed bandits as one-step decision-making problems


Can also view as sequential decision-making problems
At each step there is an information state s̃
s̃ is a statistic of the history, s˜t = f (ht )
summarising all information accumulated so far
Each action a causes a transition to a new information state
s̃ 0 (by adding information), with probability P̃s̃,s̃
a
0

This defines MDP M̃ in augmented information state space

M̃ = hS̃, A, P̃, R, γi
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Information State Search

Example: Bernoulli Bandits

Consider a Bernoulli bandit, such that Ra = B(µa )


e.g. Win or lose a game with probability µa
Want to find which arm has the highest µa
The information state is s̃ = hα, βi
αa counts the pulls of arm a where reward was 0
βa counts the pulls of arm a where reward was 1
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Information State Search

Solving Information State Space Bandits

We now have an infinite MDP over information states


This MDP can be solved by reinforcement learning
Model-free reinforcement learning
e.g. Q-learning (Duff, 1994)
Bayesian model-based reinforcement learning
e.g. Gittins indices (Gittins, 1979)
This approach is known as Bayes-adaptive RL
Finds Bayes-optimal exploration/exploitation trade-off
with respect to prior distribution
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Information State Search

Bayes-Adaptive Bernoulli Bandits

Start with Beta(αa , βa ) prior over 1.0 1.0

reward function Ra f(!1) f(!2)


Each time a is selected, update
posterior for Ra !1 1.0
, !2 1.0

Beta(αa + 1, βa ) if r = 0 Drug 1 Drug 2

Beta(αa , βa + 1) if r = 1
Success Failure Success Failure

This defines transition function P̃ 2.0

f(!1)
1.0

f(!2) 2.0 1.0 1.0 1.0 1.0 1.0

f(!1) f(!1)

for the Bayes-adaptive MDP


f(!1) f(!2) f(!2 ) f(!2)
!1 1.0
, !2 1.0

!1 1.0
, !2 1.0 , ,
!1 1.0 !2 1.0 !1 1.0 !2 1.0

Drug 1
Drug 2
Information state hα, βi Success Failure

corresponds to reward model 3.0

f(!)
1
1.0

f(!2)
f(!1)
1.0

f(!2)

..
,

Beta(α, β)
! 1.0 !2 1.0
1
!1 1.0
, !2 1.0

.
Failure
Each state transition corresponds
to a Bayesian model update
f(!1) f(!2)

!1 1.0
, !2 1.0
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Information State Search

Bayes-Adaptive MDP for Bernoulli Bandits


Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Information State Search

Gittins Indices for Bernoulli Bandits

Bayes-adaptive MDP can be solved by dynamic programming


The solution is known as the Gittins index
Exact solution to Bayes-adaptive MDP is typically intractable
Information state space is too large
Recent idea: apply simulation-based search (Guez et al. 2012)

Forward search in information state space


Using simulations from current information state
Lecture 9: Exploration and Exploitation
Contextual Bandits

Contextual Bandits

A contextual bandit is a tuple hA, S, Ri


A is a known set of actions (or “arms”)
S = P [s] is an unknown distribution over
states (or “contexts”)
Ras (r ) = P [r |s, a] is an unknown
probability distribution over rewards
At each step t
Environment generates state st ∼ S
Agent selects action at ∈ A
Environment generates reward rt ∼ Rastt
Goal
Pt is to maximise cumulative reward
τ =1 rτ
Lecture 9: Exploration and Exploitation
Contextual Bandits
Linear UCB

Linear Regression
Action-value function is expected reward for state s and
action a
Q(s, a) = E [r |s, a]
Estimate value function with a linear function approximator
Qθ (s, a) = φ(s, a)> θ ≈ Q(s, a)
Estimate parameters by least squares regression
t
X
At = φ(sτ , aτ )φ(sτ , aτ )>
τ =1
X t
bt = φ(sτ , aτ )rτ
τ =1
θt = A−1
t bt
Lecture 9: Exploration and Exploitation
Contextual Bandits
Linear UCB

Linear Upper Confidence Bounds

Least squares regression estimates the mean action-value


Qθ (s, a)
But it can also estimate the variance of the action-value
σθ2 (s, a)
i.e. the uncertainty due to parameter estimation error
Add on a bonus for uncertainty, Uθ (s, a) = cσ
i.e. define UCB to be c standard deviations above the mean
Lecture 9: Exploration and Exploitation
Contextual Bandits
Linear UCB
3.2

Geometric Interpretation 2.8

2.4

1.6
E

1.2

0.8

0.4

-0.5 0 0.5 1 1.5 2 2.5 3 3.5 4

-0.4

Define confidence ellipsoid Et around parameters θt


Such that Et includes true parameters θ∗ with high probability
Use this ellipsoid to estimate the uncertainty of action values
Pick parameters within ellipsoid that maximise action value
argmax Qθ (s, a)
θ∈E
Lecture 9: Exploration and Exploitation
Contextual Bandits
Linear UCB

Calculating Linear Upper Confidence Bounds

For least squares regression, parameter covariance is A−1


Action-value is linear in features, Qθ (s, a) = φ(s, a)> θ
So action-value variance is quadratic,
σθ2 (s, a) = φ(s, a)> A−1 φ(s, a)
p
Upper confidence bound is Qθ (s, a) + c φ(s, a)> A−1 φ(s, a)
Select action maximising upper confidence bound
q
at = argmax Qθ (st , a) + c φ(st , a)> A−1
t φ(st , a)
a∈A
Lecture 9: Exploration and Exploitation
Contextual Bandits
Linear UCB
L I AND C HU AND L ANGFORD AND M OON AND WANG

Example: Linear UCB for Selecting Front Page News

(a) (b)

(c) (d)
Lecture 9: Exploration and Exploitation
MDPs

Exploration/Exploitation Principles to MDPs

The same principles for exploration/exploitation apply to MDPs


Naive Exploration
Optimistic Initialisation
Optimism in the Face of Uncertainty
Probability Matching
Information State Search
Lecture 9: Exploration and Exploitation
MDPs
Optimistic Initialisation

Optimistic Initialisation: Model-Free RL

rmax
Initialise action-value function Q(s, a) to 1−γ
Run favourite model-free RL algorithm
Monte-Carlo control
Sarsa
Q-learning
...
Encourages systematic exploration of states and actions
Lecture 9: Exploration and Exploitation
MDPs
Optimistic Initialisation

Optimistic Initialisation: Model-Based RL

Construct an optimistic model of the MDP


Initialise transitions to go to heaven
(i.e. transition to terminal state with rmax reward)
Solve optimistic MDP by favourite planning algorithm
policy iteration
value iteration
tree search
...
Encourages systematic exploration of states and actions
e.g. RMax algorithm (Brafman and Tennenholtz)
Lecture 9: Exploration and Exploitation
MDPs
Optimism in the Face of Uncertainty

Upper Confidence Bounds: Model-Free RL

Maximise UCB on action-value function Q π (s, a)

at = argmax Q(st , a) + U(st , a)


a∈A

Estimate uncertainty in policy evaluation (easy)


Ignores uncertainty from policy improvement
Maximise UCB on optimal action-value function Q ∗ (s, a)

at = argmax Q(st , a) + U1 (st , a) + U2 (st , a)


a∈A

Estimate uncertainty in policy evaluation (easy)


plus uncertainty from policy improvement (hard)
Lecture 9: Exploration and Exploitation
MDPs
Optimism in the Face of Uncertainty

Bayesian Model-Based RL

Maintain posterior distribution over MDP models


Estimate both transitions and rewards, p [P, R | ht ]
where ht = s1 , a1 , r2 , ..., st is the history
Use posterior to guide exploration
Upper confidence bounds (Bayesian UCB)
Probability matching (Thompson sampling)
Lecture 9: Exploration and Exploitation
MDPs
Probability Matching

Thompson Sampling: Model-Based RL

Thompson sampling implements probability matching

π(s, a | ht ) = P Q ∗ (s, a) > Q ∗ (s, a0 ), ∀a0 6= a | ht


 
 

= EP,R|ht 1(a = argmax Q (s, a))
a∈A

Use Bayes law to compute posterior distribution p [P, R | ht ]


Sample an MDP P, R from posterior
Solve MDP using favourite planning algorithm to get Q ∗ (s, a)
Select optimal action for sample MDP, at = argmax Q ∗ (st , a)
a∈A
Lecture 9: Exploration and Exploitation
MDPs
Information State Search

Information State Search in MDPs

MDPs can be augmented to include information state


Now the augmented state is hs, s̃i
where s is original state within MDP
and s̃ is a statistic of the history (accumulated information)
Each action a causes a transition
to a new state s 0 with probability Ps,s
a
0
0
to a new information state s̃
Defines MDP M̃ in augmented information state space

M̃ = hS̃, A, P̃, R, γi
Lecture 9: Exploration and Exploitation
MDPs
Information State Search

Bayes Adaptive MDPs

Posterior distribution over MDP model is an information state

s̃t = P [P, R|ht ]

Augmented MDP over hs, s̃i is called Bayes-adaptive MDP


Solve this MDP to find optimal exploration/exploitation
trade-off (with respect to prior)
However, Bayes-adaptive MDP is typically enormous
Simulation-based search has proven effective (Guez et al.)
Lecture 9: Exploration and Exploitation
MDPs
Information State Search

Conclusion

Have covered several principles for exploration/exploitation


Naive methods such as -greedy
Optimistic initialisation
Upper confidence bounds
Probability matching
Information state search
Each principle was developed in bandit setting
But same principles also apply to MDP setting

You might also like