0% found this document useful (0 votes)
61 views91 pages

2 Exploration and Exploitation

The document discusses the exploration-exploitation tradeoff in multi-armed bandit problems. It provides examples of a trading desk with multiple strategies to allocate capital to, and a mouse choosing between two levers that sometimes provide rewards and other times electric shocks. It describes how an agent must balance exploiting currently optimal actions with exploring alternative actions to gain more information. Multi-armed bandits provide a simple framework to study this tradeoff, where the goal is to maximize rewards by choosing the best "arms" or actions through repeated trials.

Uploaded by

yilvas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views91 pages

2 Exploration and Exploitation

The document discusses the exploration-exploitation tradeoff in multi-armed bandit problems. It provides examples of a trading desk with multiple strategies to allocate capital to, and a mouse choosing between two levers that sometimes provide rewards and other times electric shocks. It describes how an agent must balance exploiting currently optimal actions with exploring alternative actions to gain more information. Multi-armed bandits provide a simple framework to study this tradeoff, where the goal is to maximize rewards by choosing the best "arms" or actions through repeated trials.

Uploaded by

yilvas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 91

Exploration and Exploitation

Exploration and Exploitation

Paul Alexander Bilokon, PhD

Thalesians Ltd
Level39, One Canada Square, Canary Wharf, London E14 5AB

2023.01.17
Exploration and Exploitation

Acknowledgements

I These notes are based on Chapter 2 of [SB18] by Richard S. Sutton and Andrew G.
Barto.
I They also borrow a lot from the lectures on reinforcement learning by David
Silver [Sil15] and those by his colleague at DeepMind Hado van Hasselt [vH16].
Exploration and Exploitation
Bandits

An example: trading strategies

I I run a trading desk at GoodBank.


I My traders have contributed ten different trading strategies.
I These traders are very secretive, so I don’t know anything about the strategies—they
are black boxes.1
I All I can do is observe the strategies’ performance in real time, online.
I How do I allocate capital over these trading strategies?

1 It isn’t a good idea to run a trading desk with overly secretive traders and black-box trading strategies.
Exploration and Exploitation
Bandits

Another example: a mouse and two levers (i)

I Adapted from Hado van Hasselt.


I Suppose that a mouse is faced with a choice of two levers, a white one and a black
one.
Exploration and Exploitation
Bandits

An example: a mouse and two levers (ii)

I On Monday, the mouse presses the black lever and receives an electric shock.
Exploration and Exploitation
Bandits

An example: a mouse and two levers (ii)

I On Monday, the mouse presses the black lever and receives an electric shock.
I On Tuesday, the mouse presses the white lever and receives cheese.
Exploration and Exploitation
Bandits

An example: a mouse and two levers (ii)

I On Monday, the mouse presses the black lever and receives an electric shock.
I On Tuesday, the mouse presses the white lever and receives cheese.
I On Wednesday, the mouse presses the white lever and receives an electric shock.
Exploration and Exploitation
Bandits

An example: a mouse and two levers (ii)

I On Monday, the mouse presses the black lever and receives an electric shock.
I On Tuesday, the mouse presses the white lever and receives cheese.
I On Wednesday, the mouse presses the white lever and receives an electric shock.
I What should the mouse do on Thursday?
Exploration and Exploitation
Bandits

Exploration versus exploitation

I Both examples are concerned with online decision-making, which requires us to


make a sequence of decisions based on incremental information.
I The agent (decision-maker) receives evaluative feedback that evaluates the actions
taken rather than instructive feedback that instructs by giving correct actions.
I The agent faces the fundamental trade-off of online decision-making:
I Exploitation maximizes reward based on current knowledge;
I Exploration increases knowledge.
I The agent needs to gather enough information to make the best overall decisions.
I The best long-term strategy may involve short-term sacrifices.
I In the first example there were ten “levers” (actions), in the second only two.
I Such problems are called multi-armed bandits by analogy with slot machines, which
are sometimes referred to as one-armed bandits.
I Each action selection resembles a play of one of the slot machine’s levers, and the
rewards are the payoffs for hitting the jackpot.
I Through repeated action selections you are to maximize your winnings by
concentrating your actions on the best levers.
Exploration and Exploitation
Bandits

Further examples

I Portfolio selection [She15, HF17]:


I Exploitation: Buy the best performing portfolio.
I Exploration: Try a different portfolio.
I Dynamic online pricing [MSA19, MST19]:
I Exploitation: Quote the best known price.
I Exploration: Quote a different price to learn more about supply and demand.
I Recommender systems [BBG12, ZZXL17]:
I Exploitation: Recommend the item believed to be the most suitable.
I Exploration: Recommend a different item.
I Clinical trials [BB15, DAI+ 18]:
I Exploitation: Use the best known treatment (dose).
I Exploration: Try a different treatment (dose).
I For more examples, see [BR19].
Exploration and Exploitation
Bandits

A one-armed bandit

Figure: A one-armed bandit.


Exploration and Exploitation
Bandits

Multi-armed bandit (informally)

I A multi-armed bandit is the simplest setting for online decision making.


I In this setting, actions have no influence on next observations. There is no sequential
structure.
I You can interact with the problem again and again.
I Moreover, the state does not change: this setting does not involve learning to act in
more than one situation.
I This nonassociative setting avoids much of the complexity of the full reinforcement
learning problem.
Exploration and Exploitation
Bandits

Multi-armed bandit (formally)

I A is a known set of actions (“arms”).


I At each time step t, the agent selects an action At ∈ A.
I The environment responds with a reward Rt .
I The distribution p (r | a ) is fixed but unknown.
I The agent’s goal is to maximize the cumulative reward ∑ti=1 Ri .
I One can think of multi-armed bandits as repeated games against nature2 .

2 A game against nature involves a single agent choosing under conditions of uncertainty, where none of the
uncertainty is strategic—that is, either the uncertainty is due to natural acts (crop loss, death, and the like) or, if other
people are involved, the others do not behave strategically toward the agent being modeled [Gin00].
Exploration and Exploitation
Action-value methods

The value of an action

I In a k -armed bandit problem, each of the k actions, say a, has an expected or mean
reward given that that action is selected—the value of that action:

q (a ) := E [Rt | At = a ] .

I If you knew it, the problem would be trivial.


I At best, we can estimate q. We denote our estimate of q (a ) by Qt (a ).
I The methods of action selection that are based on such estimation are called
action-value methods.
Exploration and Exploitation
Action-value methods

Estimating the value of an action

I Let the count be defined by


t −1
Nt ( a ) : = ∑ 1Ai =a ,
i =1

where 1predicate is 0 if the predicate is false and 1 if the predicate is true.


I One natural way to estimate q is

sum of rewards when a taken prior to t ∑t −1 Ri · 1Ai =a


Qt ( a ) : = = i =1 ,
number of times a taken prior to t Nt (a )

I If the denominator is zero then we initialize Qt (a ) to some default value, such as 0.


I As the denominator goes to infinity, by the law of large numbers, Qt (a ) converges to
q (a ).
I We call this the sample-average method for estimating action values because each
estimate is an average of the sample of relevant rewards.
Exploration and Exploitation
Action-value methods

Greedy actions

I At any time step t there is at least one action a∗ whose estimated value, Qt (a∗ ), is
greatest.
I We refer to these as greedy actions.
I Choosing these actions amounts to exploitation.
I Choosing one of the non-greedy actions instead amounts to exploration.
Exploration and Exploitation
Action-value methods

Returning to the mouse example

I Let rshock = −1 and rcheese = 1.


I Then
I On Monday, Q1 (ablack ) = 0, Q1 (awhite ) = 0, black lever, electric shock.
I On Tuesday, Q1 (ablack ) = −1, Q1 (awhite ) = 0, white lever, cheese.
I On Wednesday, Q1 (ablack ) = −1, Q1 (awhite ) = 1, white lever, electric shock.
I On Thursday, Q1 (ablack ) = −1, Q1 (awhite ) = 0, white lever, electric shock.
I On Friday, Q1 (ablack ) = −1, Q1 (awhite ) = −0.3333..., white lever, electric shock.
I On Saturday, Q1 (ablack ) = −1, Q1 (awhite ) = −0.5, white lever, electric shock.
I On Saturday, Q1 (ablack ) = −1, Q1 (awhite ) = −0.6...
Exploration and Exploitation
Action-value methods

Returning to the mouse example

I Let rshock = −1 and rcheese = 1.


I Then
I On Monday, Q1 (ablack ) = 0, Q1 (awhite ) = 0, black lever, electric shock.
I On Tuesday, Q1 (ablack ) = −1, Q1 (awhite ) = 0, white lever, cheese.
I On Wednesday, Q1 (ablack ) = −1, Q1 (awhite ) = 1, white lever, electric shock.
I On Thursday, Q1 (ablack ) = −1, Q1 (awhite ) = 0, white lever, electric shock.
I On Friday, Q1 (ablack ) = −1, Q1 (awhite ) = −0.3333..., white lever, electric shock.
I On Saturday, Q1 (ablack ) = −1, Q1 (awhite ) = −0.5, white lever, electric shock.
I On Saturday, Q1 (ablack ) = −1, Q1 (awhite ) = −0.6...
I What should the mouse do?
Exploration and Exploitation
Action-value methods

Returning to the mouse example

I Let rshock = −1 and rcheese = 1.


I Then
I On Monday, Q1 (ablack ) = 0, Q1 (awhite ) = 0, black lever, electric shock.
I On Tuesday, Q1 (ablack ) = −1, Q1 (awhite ) = 0, white lever, cheese.
I On Wednesday, Q1 (ablack ) = −1, Q1 (awhite ) = 1, white lever, electric shock.
I On Thursday, Q1 (ablack ) = −1, Q1 (awhite ) = 0, white lever, electric shock.
I On Friday, Q1 (ablack ) = −1, Q1 (awhite ) = −0.3333..., white lever, electric shock.
I On Saturday, Q1 (ablack ) = −1, Q1 (awhite ) = −0.5, white lever, electric shock.
I On Saturday, Q1 (ablack ) = −1, Q1 (awhite ) = −0.6...
I What should the mouse do?
I Although Q1 (awhite ) > Q1 (ablack ), the intuition is to switch.
Exploration and Exploitation
Action-value methods

Returning to the mouse example

I Let rshock = −1 and rcheese = 1.


I Then
I On Monday, Q1 (ablack ) = 0, Q1 (awhite ) = 0, black lever, electric shock.
I On Tuesday, Q1 (ablack ) = −1, Q1 (awhite ) = 0, white lever, cheese.
I On Wednesday, Q1 (ablack ) = −1, Q1 (awhite ) = 1, white lever, electric shock.
I On Thursday, Q1 (ablack ) = −1, Q1 (awhite ) = 0, white lever, electric shock.
I On Friday, Q1 (ablack ) = −1, Q1 (awhite ) = −0.3333..., white lever, electric shock.
I On Saturday, Q1 (ablack ) = −1, Q1 (awhite ) = −0.5, white lever, electric shock.
I On Saturday, Q1 (ablack ) = −1, Q1 (awhite ) = −0.6...
I What should the mouse do?
I Although Q1 (awhite ) > Q1 (ablack ), the intuition is to switch.
I How can we formalize this?
Exploration and Exploitation
Action-value methods

The greedy algorithm

I In our example, the mouse has employed the greedy algorithm: it has sampled all the
actions and is exploiting the greedy action.
I If there is more than one greedy action, then a selection is made among them in some
arbitrary way, perhaps randomly.
I We see that the greedy algorithm can lock the agent into a suboptimal action.
I Can we do better?
Exploration and Exploitation
Action-value methods

The e-greedy algorithm

I Let e ∈ [0, 1].


I The e-greedy algorithm continues to explore indefinitely:
I With probability 1 − e, select a greedy action arg maxa ∈A Qt (a ).
I With probability e select a random action.
I An advantage of e-greedy methods is that, in the limit as the number of steps
increases, every action will be sampled an infinite number of times, thus ensuring that
all the Qt (a ) converge to q (a ).
I This of course implies that the probability of selecting the optimal action converges to
greater than 1 − e, that is, to near certainty.
I These are just asymptotic guarantees and say little about the practical effectiveness of
the methods.
I It is possible to reduce e over time to try to get the best of both high and low values.
Exploration and Exploitation
Action-value methods

Online (incremental) implementation


I Sample averages Qn of observed rewards can be computed in a computationally
efficient manner.
I In particular, with constant memory and constant per-time-step computation.
I To simplify the notation, we consider only one action.
I Let Ri denote the reward received after the ith selection of this action, and let Qn
denote the estimate of its action value after it has been selected n − 1 times, which we
can now write simply as

R1 + R2 + . . . + Rn−1
Qn := .
n−1
I Then
1
Qn+1 = Qn + [Rn − Qn ] .
n
I This update rule is of the form

NewEstimate ← OldEstimate + StepSize [Target − OldEstimate] ,

which occurs commonly in machine learning.


I The expression [Target − OldEstimate] is the error of the estimate.
I Notice that the StepSize changes from time step to time step. It is often denoted by α
or α(t ).
Exploration and Exploitation
Action-value methods

An e-greedy bandit algorithm in pseudocode

I Initialize, for a = 1 to k :
I Q (a ) ← 0
I N (a ) ← 0
I Loop forever:

I A← with probability 1 − e (breaking ties randomly);
arg maxa Q (a ),
a random action, with probability e.
I R ← bandit (A )
I N (A ) ← N (A ) + 1
I Q (A ) ← Q (A ) + 1 [R − Q (A )]
N (A )
Exploration and Exploitation
Action-value methods

The advantage of the e-greedy method

I The advantage of e-greedy over greedy methods depends on the task.


I For example, suppose the reward variance had been larger, say 10 instead of 1.
I With noisier rewards it takes more exploration to find the optimal action, and the
e-greedy methods should fare even better relative to the greedy method.
I On the other hand, if the reward variances were zero, then the greedy method would
know the true value of each action after trying it once. In this case the greedy method
might actually perform best because it would soon find the optimal action and then
never explore.
I Suppose the bandit task were nonstationary, that is, the true values of the actions
changed over time. In this case exploration is needed even in the deterministic case to
make sure one of the nongreedy actions has not changed to become better than the
greedy one.
I Nonstationarity is the case most commonly encountered in reinforcement learning.
Exploration and Exploitation
Optimistic initial values

Initial action-value estimates

I All the methods we have discussed so far are dependent to some extent on the initial
action-value estimates, Q1 (a ).
I In the language of statistics, these methods are biased by their initial estimates.
I For the sample-average methods, the bias disappears once all actions have been
selected at least once, but for methods with constant α, the bias is permanent, though
decreasing over time.
I In practice, this kind of bias is usually not a problem and can sometimes be very
helpful.
I The downside is that the initial estimates become, in effect, a set of parameters that
must be picked by the user, if only to set them all to zero.
I The upside is that they provide an easy way to supply some prior knowledge about
what level of rewards can be expected.
Exploration and Exploitation
Optimistic initial values

Encouraging exploration

I Suppose that instead of setting the initial action values to zero, as we did in the
10-armed testbed, we set them all to +5.
I Recall that the q (a ) in this problem are selected from a normal distribution with mean
0 and variance 1.
I An initial estimate of +5 is thus wildly optimistic.
I But this optimism encourages action-value methods to explore.
I Whichever actions are initially selected, the reward is less than the starting estimates;
the learner switches to other actions, being “disappointed” with the rewards it is
receiving.
I The result is that all actions are tried several times before the value estimates
converge. The system does a fair amount of exploration even if greedy actions are
selected all the time.
Exploration and Exploitation
Optimistic initial values

Optimistic initial values

I We call this technique for encouraging exploration optimistic initial values.


I We regard it as a simple trick that can be quite effective on stationary problems, but it
is far from being a generally useful approach to encouraging exploration.
I For example, it is not well suited to nonstationary problems because its drive for
exploration is inherently temporary.
I If the task changes, creating a renewed need for exploration, this method cannot help.
Indeed any method that focusses on the initial conditions in any special way is unlikely
to help with the general nonstationary case
I The beginning of time occurs only once, and thus we should not focus on it too much.
I This criticism applies as well to the sample-average methods, which also treat the
beginning of time as a special event, averaging all subsequent rewards with equal
weights.
I Nevertheless, all of these methods are very simple, and one of them — or some
combination of them — is often adequate in practice.
Exploration and Exploitation
Optimistic initial values

Optimism in the face of uncertainty

From [KS17]:
The principle of “optimism in the face of uncertainty” is an empirically verified
idea in sequential decision-making problems [BCB12], although the accurate ori-
gin of it is uncertain. Biologically, the optimism bias is known as a cognitive bias
in higher brain functions [Fox12, SCGT02]. On the other hand, the contexts of
machine learning deal with optimism as a strategy to explore better solutions. The
difference of the implications might be derived from the difference of the viewpoint
on rewards.
Exploration and Exploitation
How well can we do?

How well can we do?

I The optimal value is

v∗ = q (a∗ ) = max q (a ) = max E [Rt | At = a ] ,


a ∈A a

where q (a ), as before, is the true value of action a.


I Regret is the opportunity loss for one time step: lt = E [v∗ − q (At )].
I The action regret or gap is the difference in value between action a and optimal
action a∗ : ∆a = v∗ − q (a ).
I The total regret is the total opportunity loss
" # " #
t t
Lt = E ∑ lt =E ∑ (v∗ − q(Ai )) .
i =1 i =1

I Thus the agent’s goal is to trade-off exploration and exploitation by minimizing the total
regret (which is equivalent to maximizing the cumulative reward).
I The agent cannot observe, or even sample, the real regret directly.
I However, it is useful for analysing different learning algorithms.
Exploration and Exploitation
How well can we do?

Regret

I Regret is a function of gaps and counts:


" #
t
Lt = E ∑ (v∗ − q(Ai ))
i =1

= ∑ E [Nt (a )] (v∗ − q (a ))
a ∈A

= ∑ E [Nt (a )] ∆a .
a ∈A

I A good algorithm ensures small counts for large gaps.


I The problem is that gaps are not known.
Exploration and Exploitation
How well can we do?

Regret for the greedy and e-greedy algorithms

I The greedy algorithm selects the action with the highest value

At = arg max Qt (a ).
a ∈A

I The greedy algorithm can lock onto a suboptimal action forever.


I Therefore the greedy algorithm has linear total regret (it is a linear function of time t).
I The e-greedy method will continue to select all suboptimal actions with probability e.
I Constant e ensures minimum regret:
e
lt ≥
|A| ∑ ∆a .
a ∈A

I Therefore the e-greedy algorithm also has linear expected total regret.
I Optimistic greedy and optimistic e-greedy algorithms also have linear expected total
regret.
Exploration and Exploitation
How well can we do?

Decaying et -greedy algorithm

I Let us come up with a decaying et -greedy algorithm.


I The trick is to decrease et fast enough that the regret is close to the optimal (we will
find out what that is shortly) but slow enough that the estimate of which arm to choose
converges to the optimum.
I Pick a decay schedule for e1 , e2 , . . ..
I Consider the following schedule [ACBF02]:

c |A|
 
1
et = min 1, 2 ∝ ,
d t t

where c > 0 and d = mina | ∆a >0 ∆a .


I This decaying et -greedy algorithm has logarithmic asymptotic total regret!
I Unfortunately, the schedule requires knowledge of gaps.
I We aim to find an algorithm with sublinear regret for any multi-armed bandit (without
knowledge of gaps).
Exploration and Exploitation
How well can we do?

Regret for simple multi-armed bandit algorithms

Figure: Total regret for simple multi-armed bandit algorithms.


Exploration and Exploitation
How well can we do?

Kullback–Leibler divergence
I The Kullback–Leibler divergence [KL51, Kul59], also called relative entropy, is a
measure of how one probability distribution is different from a second, reference
probability distribution.
I For discrete probability distributions P and Q defined on the same space, X , the
relative entropy from Q to P is defined to be
   
P (x ) Q (x )
KL (P || Q ) = ∑ P (x ) ln = − ∑ P (x ) ln .
x ∈X
Q (x ) x ∈X
P (x )

I For distributions P and Q of a continuous random variable, relative entropy is defined


to be the integral  
p (x )
Z
KL (P || Q ) = p (x ) ln dx ,
x ∈X q (x )
where p and q denote the probability densities of P and Q.
I More generally, if P and Q are probability measures over a set X , and P is absolutely
continuous with respect to Q, then
   
dP dP dP
Z Z
chain rule
KL (P || Q ) = ln dP = ln dQ ,
X dQ X dQ dQ
dP
where dQ is the Radon–Nikodym derivative of P with respect to Q.
Exploration and Exploitation
How well can we do?

An example
I Consider a numerical example from [Kul59]:
x 0 1 2
9 12 4
Distribution P (x ) 25 25 25
1 1 1
Distribution Q (x ) 3 3 3
I In this example,
 
P (x )
KL (P || Q ) = ∑ P (x ) ln
Q (x )
x ∈X
     
9 9/25 12 12/25 4 4/25
= ln + ln + ln
25 1/3 25 1/3 25 1/3
1
= (32 ln(2) + 55 ln(3) − 50 ln(5)) ≈ 0.0852996,
25
I whereas
 
Q (x )
KL (Q || P ) = ∑ Q (x ) ln
P (x )
x ∈X
     
1 1/3 1 1/3 1 1/3
= ln + ln + ln
3 9/25 3 12/25 3 4/25
1
= (−4 ln(2) − 6 ln(3) + 6 ln(5)) ≈ 0.097455.
3
Exploration and Exploitation
How well can we do?

Properties of the Kullback–Leibler divergence

I By definition, the Kullback–Leibler divergence KL (P || Q ) is the expectation of the


logarithmic difference between the probabilities P and Q, where the expectation is
taken using the probabilities P.
I We have seen from our example that it is not symmetric, i.e.

KL (P || Q ) , KL (Q || P ).

I Nor does the Kullback–Leibler divergence satisfy the triangle inequality.


I Therefore it is not a distance metric [Men90] in the traditional sense of the word.
I If one is looking for a distance metric, then one such distance metric can be
constructed on the basis of mutual information (MI), which is itself related to
Kullback–Leibler divergence.
I However, we’ll see now that it is the Kullback–Leibler divergence that features in many
multi-armed bandits applications.
Exploration and Exploitation
How well can we do?

A bound on regret

I The performance of any method is determined by similarity between optimal arm and
other arms.
I Hard problems have arms with similar distributions but different means.
I This is described formally by the gap ∆a and the Kullback–Leibler divergence

KL (p (r | a ) || p (r | a∗ )).

I T. L. Lai and Herbert Robbins established the following lower bound on total regret
in [LR85]:  
∆a
lim Lt ≥ ln t 
t →∞
∑ KL (p (r | a ) || p (r | a∗ ))
.
a | ∆a >0

I Notice that this grows logarithmically with time (the ln t factor), which is a lot better
than linearly.
Exploration and Exploitation
How well can we do?

Big O functions ranking (i)


Exploration and Exploitation
How well can we do?

Big O functions ranking (ii)

Description Function 1 10 100 1000


constant O(1) 1 1 1 1
logarithmic O(ln n ) 0 2.302 4.605 6.907
linear O(n ) 1 10 100 1000
loglinear (linearithmic) O(n ln n ) 0 23.025 460.517 6907.755
quadratic O(n2 ) 1 100 10000 1000000
cubic O(n3 ) 1 1000 1000000 1E+09
exponential O(2n ) 2 1024 1.27E+30 1.1E+301
Exploration and Exploitation
Upper-Confidence-Bound Action Selection

Which action to pick?

Figure: Which action to pick?


Exploration and Exploitation
Upper-Confidence-Bound Action Selection

Which action to pick?

Figure: Which action to pick?

I The more uncertain we are about an action-value, the more important it is to explore
that action.
Exploration and Exploitation
Upper-Confidence-Bound Action Selection

Which action to pick?

Figure: Which action to pick?

I The more uncertain we are about an action-value, the more important it is to explore
that action.
I It could turn out to be the best action.
Exploration and Exploitation
Upper-Confidence-Bound Action Selection

Candidate actions

I Exploration is needed because there is always uncertainty about the accuracy of the
action-value estimates.
I The greedy actions are those that look best at present, but some of the other actions
may actually be better.
I e-greedy action selection forces the non-greedy actions to be tried, but
indiscriminately, with no preference for those that are nearly greedy or particularly
uncertain.
I It would be better to select among the non-greedy actions according to their potential
for being optimal, taking into account both how close their estimates are to being
maximal and the uncertainties in those estimates.
Exploration and Exploitation
Upper-Confidence-Bound Action Selection

Upper confidence bounds (UCB)

I Here is an idea: estimate an upper confidence Ut (a ) for each action value.


I We need
q ( a ) ≤ Qt ( a ) + Ut ( a )
with high probability.
I This depends on the number of times N (a ) has been selected:
I Small Nt (a ) ⇒ large Ut (a ) (estimated value is uncertain);
I Large Nt (a ) ⇒ small Ut (a ) (estimated value is accurate).
I Select the action maximizing the upper confidence bound (UCB):

At = arg max Qt (a ) + Ut (a ).
a ∈A
Exploration and Exploitation
Upper-Confidence-Bound Action Selection

Theorem (Hoeffding’s inequality)


I Let X1 , . . . , Xt be i.i.d. random variables in [0, 1], and let X t = 1
t ∑ti=1 Xi be the
sample mean.
I Then
2
P E [X ] > X t + u ≤ e −2tu .
 

I This is known as Hoeffding’s inequality [Hoe63].


I Applying this to UCB, we obtain
2
P [q (a ) > Qt (a ) + Ut (a )] ≤ e −2Nt (a )Ut (a ) .

I Now let
2
p := e −2Nt (a )Ut (a )
and solve for Ut (a ): s
− ln p
Ut ( a ) = .
2Nt (a )

I Reduce p as we observe more rewards, e.g. p = t −4 , so we ensure that we select an


optimal action as t → ∞: s
√ ln t
Ut ( a ) = 2 .
Nt ( a )
Exploration and Exploitation
Upper-Confidence-Bound Action Selection

A UCB algorithm (i)


I This leads to the action selection on the basis of
" s #
ln t
At := arg max Qt (a ) + c ,
a Nt ( a )

where Nt (a ) denotes the number of times that √


action a has been selected prior to time
t, and the number c > 0 (in our example, c = 2) controls the degree of exploration.
I The square-root term is a measure of the uncertainty or variance in the estimate of a’s
value.
I The quantity being max’ed over is thus a sort of upper bound on the possible true
value of action a, with c determining the confidence level.
I Each time a is selected the uncertainty is presumably reduced: Nt (a ) increments,
and, as it appears in the denominator, the uncertainty estimate decreases.
I On the other hand, each time an action other than a is selected, t increases but Nt (a )
does not; because t appears in the numerator, the uncertainty estimate increases.
I The use of the natural logarithm means that the increases get smaller over time, but
are unbounded; all actions will eventually be selected, but actions with lower value
estimates, or that have already been selected frequently, will be selected with
decreasing frequency over time.
Exploration and Exploitation
Upper-Confidence-Bound Action Selection

A UCB algorithm (ii)


I Theorem [ACBF02]: The UCB algorithm (with c = 2) achieves logarithmic expected
total regret for all t: !
ln t
Lt ≤ 8 ∑ +O ∑ ∆a .
a | ∆ >0
∆a a
a

I UCB often performs well, but is more difficult than e-greedy methods to extend beyond
bandits to more general reinforcement learning settings.
I One difficulty is in dealing with nonstationary problems.
I Another difficulty is dealing with large state spaces.
I In these more advanced settings the idea of UCB action selection is often impractical.
Exploration and Exploitation
Upper-Confidence-Bound Action Selection

Other inequalities

We have applied Hoeffding’s inequality, but could have applied...


I Bernstein’s inequality [Ber24];
I Empirical Bernstein’s inequality [PAR10];
I Chernoff inequality [Das08];
I Azuma’s inequality [Azu67];
I ...
Exploration and Exploitation
Tracking a nonstationary problem

Nonstationarity

I The averaging methods discussed so far are appropriate for stationary bandit
problems, that is, for bandit problems in which the reward probabilities do not change
over time.
I In stationary problems, the distribution of Rt given At is identical and independent
across time.
I We often encounter reinforcement learning problems that are effectively
nonstationary.
I In such cases it makes sense to give more weight to recent rewards than to long-past
rewards.
I One of the most popular ways of doing this is to use a constant step-size parameter.
I For example, the incremental update rule for updating an average Qn of the n − 1 past
rewards is modified to be

Qn+1 := Qn + α [Rn − Qn ] ,

where the step-size parameter α ∈ (0, 1] is constant.


I Constant α would lead to tracking rather than averaging with more weight towards
recent rewards.
Exploration and Exploitation
Tracking a nonstationary problem

Weighted average

I This results in Qn+1 being a weighted average of past rewards and the initial estimate
Q1 :
n
Qn+1 = (1 − α)n Q1 + ∑ α(1 − α)n−i Ri .
i =1
I We call this a weighted average because the sum of the weights is
(1 − α)n + ∑ni=1 α(1 − α)n−i = 1.
I Note that the weight α(1 − α)n−i , given to the reward Ri depends on how many
rewards ago, n − i, it was observed.
I The quantity 1 − α is less than 1, and thus the weight given to Ri decreases as the
number of intervening rewards increases.
I In fact, the weight decays exponentially according to the exponent on 1 − α.
I Accordingly, this is sometimes called an exponential recency-weighted average.
Exploration and Exploitation
Tracking a nonstationary problem

Variable step size

I Sometimes it is convenient to vary the step-size parameter from step to step.


I Let αn (a ) denote the step-size parameter used to process the reward received after
the nth selection of action a.
I As we have noted, the choice αn (a ) = 1 results in the sample-average method,
n
which is guaranteed to converge to the true action values by the law of large numbers.
I Convergence is not guaranteed for all choices of the sequence {αn (a )}.
I A well-known result in stochastic approximation theory gives us the conditions required
to assure convergence with probability 1:
∞ ∞
∑ αn (a ) = ∞ and ∑ α2n (a ) < ∞.
n =1 n =1

I The first condition is required to guarantee that the steps are large enough to
eventually overcome any initial conditions or random fluctuations.
I The second condition guarantees that eventually the steps become small enough to
assure convergence.
Exploration and Exploitation
Tracking a nonstationary problem

Constant step size

I Note that both convergence conditions are met for the sample-average case,
αn (a ) = n1 , but not for the case of constant step-size parameter, αn (a ) = α.
I In the latter case, the second condition is not met, indicating that the estimates never
completely converge but continue to vary in response to the most recently received
rewards.
I This is actually desirable in a nonstationary environment, and problems that are
effectively nonstationary are the most common in reinforcement learning.
I In addition, sequences of step-size parameters that meet the two conditions often
converge very slowly or need considerable tuning in order to obtain a satisfactory
convergence rate.
I Although sequences of step-size parameters that meet these convergence conditions
are often used in theoretical work, they are seldom used in applications and empirical
research.
Exploration and Exploitation
Bayesian bandits

Interpretations of probability
I Classical:
I A random experiment E is performed.
I The set Ω of possible outcomes of E is finite and all ω ∈ Ω are equally likely.
I The probability of the event A ⊆ Ω is given by

|A |
P [A ] = .
|Ω|
I Frequentist:
I The superexperiment E ∞ consists in an infinite number of independent performances of a
random experiment E .
I Let N (A , n ) be the number of occurrences of the event A in the first n performances of E
within E ∞ .
I The probability of A is given by

N (A , n )
P [A ] = lim .
n →∞ n
I Bayesian:
I The probability of an event is the degree of belief that that event will occur.
I Axiomatic:
The theory of probability as a mathematical discipline can and should be devel-
oped from axioms in exactly the same way as Geometry and Algebra. [Kol33]
Exploration and Exploitation
Bayesian bandits

Bayes’s Theorem

I Let H and E be events (H stands for “hypothesis”, E stands for “evidence”).


I Bayes’s Theorem establishes the relationship between the probability of H, P [H ], the
probability of E, P [E ], the conditional probability of H given E, P [H | E ], and the
conditional probability of E given H, P [E | H ]:

P [E | H ] P [H ]
P [H | E ] = .
P [E ]
I The proof follows immediately from the definition of conditional probability:

P [H ∩ E ] P [E ∩ H ] P [E | H ] P [H ]
P [H | E ] = = = .
P [E ] P [E ] P [E ]
I It is useful to reformulate it for the case when there are multiple alternative
hypotheses, H1 , H2 , . . . Hs , s ∈ N∗ . Then, for 1 ≤ i ≤ s,

P [ E | Hi ] P [ Hi ] P [E | Hi ] P [Hi ]
P [Hi | E ] = = s ,
P [E ] ∑j =1 P [E | Hj ] P [Hj ]

where the second equality follows from the Law of Total Probability.
Exploration and Exploitation
Bayesian bandits

Bayes’s 1763 essay

In An Essay towards Solving a Problem in the Doctrine of Chances by


the late Rev. Mr. Bayes, F. R. S. communicated by Mr. Price, in a Letter
to John Canton, A. M. F. R. S. [BP63]:
I Bayes attempts to solve the following Problem:
Given the number of times in which an unknown event has
happened and faileda : Required the chance that the prob-
ability of its happening in a single trial lies somewhere be-
tween any two degrees of probability that can be named.

I Bayes performs a thought experiment. Bayes has his back turned


to a table, and asks his assistant to drop a ball on the table. He
asks his assistant to throw another ball on the table and report
whether it is to the left or the right of the first ball. If the new ball
landed to the left of the first ball, then the first ball is more likely to
be on the right side of the table than the left side. He asks his
assistant to throw the second ball again. If it again lands to the left
of the first ball, then the first ball is even more likely than before to
be on the right side of the table. And so on.
I Bayes’s system was: Initial Belief + New Data → Improved Belief.
I Bayes said that if he didn’t know what guess to make, he’d just
assign all possibilities equal probability to start.

a Did not happen.


Exploration and Exploitation
Bayesian bandits

Laplace’s 1774 memoir

In Mémoire sur la probabilité des causes par les évènemens by M.


de la Place, Professeur à l’École royale Militaire [Lap74]:
I Laplace states a “Principle” equivalent to Bayes’s theorem with all
causes being a priori equally likely.
I If F is Laplace’s event and θ1 , θ2 , . . . , θn the n causes, then his
axiomatic “Principle” is:
P [ θi | F ] P [F | θi ]
 = 
P θj | F P F | θj
 

and
P [F | θi ]
P [ θi | F ] = .
∑nj=1 P F | θj


I Some reasons why we can be reasonably certain Laplace was


unaware of Bayes’s earlier work can be found in [Sti78].
I Where Bayes gives a cogent argument why an a priori uniform
distribution might be acceptable, Laplace assumes the conclusion
as an intuitively obvious axiom.
I See [Sti86] for details.
Exploration and Exploitation
Bayesian bandits

Bayes’s Theorem: frequentist interpretation

I How does a frequentist interpret

P [E | H ] P [H ]
P [H | E ] = ?
P [E ]
I For a frequentist, probability is a long-term relative frequency of outcomes in the
hyperexperiment E ∞ .
I Let N (A , n ) be the number of occurrences of the event A in the first n performances of
E within E ∞ . Then P [A ] := limn→∞ N ( A ,n )
n .
I In particular,
N (H , n ) N (E , n )
P [H ] := lim , P [E ] := lim .
n →∞ n n →∞ n
I Therefore,

P [H ∩ E ] N (H ∩ E , n ) P [H ∩ E ] N (H ∩ E , n )
P [H | E ] = = lim , P [E | H ] = = lim .
P [E ] n →∞ N (E , n ) P [H ] n →∞ N (H , n )

I The conditional probabilities are, then, relative frequencies: P [H | E ] is the proportion


of outcomes with property H out of those with property E.
Exploration and Exploitation
Bayesian bandits

Bayes’s Theorem: Bayesian interpretation


I How does a Bayesian interpret

P [E | H ] P [H ]
P [H | E ] = ?
P [E ]
I For a Bayesian, probability is a degree of belief.
I Before (prior to) observing any new evidence E—the degree of belief in a certain
hypothesis H.
I After (posterior to)—the degree of belief in H after taking into account that piece of
evidence E.
I Thus,
I P [H ] is the prior, the initial degree of belief in H.
I P [H | E ] is the posterior, the degree of belief in H after accounting for E.
I P[E | H ]
P[E ]
is the support that the evidence E provides for the hypothesis H.
I P [E | H ] is the likelihood, the compatibility of the evidence with the hypothesis.
I P [E ] is the marginal likelihood of the evidence, irrespective of the hypothesis.
I Using this Bayesian terminology, Bayes’s Theorem can be stated as:

posterior = support · prior ∝ likelihood · prior.


Exploration and Exploitation
Bayesian bandits

A perspective on Bayesian approach

According to Senn [Sen97, page 46],


A Bayesian is one who, vaguely expecting a horse, and catching a glimpse of a
donkey, strongly concludes he has seen a mule.
Exploration and Exploitation
Bayesian bandits

A perspective on Bayesian approach

According to Senn [Sen97, page 46],


A Bayesian is one who, vaguely expecting a horse, and catching a glimpse of a
donkey, strongly concludes he has seen a mule.
Exploration and Exploitation
Bayesian bandits

A perspective on Bayesian approach

According to Senn [Sen97, page 46],


A Bayesian is one who, vaguely expecting a horse, and catching a glimpse of a
donkey, strongly concludes he has seen a mule.
Exploration and Exploitation
Bayesian bandits

Question

I While watching a game of Champions League football in a bar, you observe someone
who is clearly supporting Manchester United in the game.
I What is the probability that they were actually born within 25 miles of Manchester?
I Assume that:
I the probability that a randomly selected person in a typical local bar environment is born
1
within 25 miles of Manchester is 20
;
I the probability that a person born within 25 miles of Manchester supports United is 7 ;
10
I the probability that a person not born within 25 miles of Manchester supports United is 1
.
10
Exploration and Exploitation
Bayesian bandits

Solution

I Define
I B—the event that the person is born within 25 miles of Manchester;
I U—the event that the person supports United.
I We want P [B | U ].

Bayes’s Theorem P [U | B ] P [B ]
P [B | U ] =
P [U ]
Law of Total Probability P [U | B ] P [B ]
=
P [U | B ] P [B ] + P [U | B c ] ¶ [B c ]
7 1
10 · 20
= 7 1 1 19
10 · 20 + 10 · 20
7
= ≈ 0.269.
26
Exploration and Exploitation
Bayesian bandits

An example (i)

I Consider an experiment consisting of a single coin flip.


I We set the random variable Y to 0 if tails come up and 1 if heads come up.
I Then the probability density of Y is given by

p (y | θ ) = θ y (1 − θ )1−y ,

where θ ∈ [0, 1] is the probability of heads showing up.


I You will recognize Y as a Bernoulli random variable.
I We view p as a function of y, but parameterized by the given parameter θ , hence the
notation, p (y | θ ).
Exploration and Exploitation
Bayesian bandits

An example (ii)

I More generally, suppose that we perform n such independent experiments (tosses) on


the same coin.
I Denote these n realizations of Y as

y1
 
 y2 
y =  .  ∈ {0, 1}n ,
 
 .. 
yn

where, for 1 ≤ i ≤ n, yi is the result of the ith toss.


I Since the coin tosses are independent, the probability density of y, i.e. the joint
probability density of y1 , y2 , . . . , yn , is given by the product rule
n
p ( y | θ ) = p ( y1 , y2 , . . . , yn | θ ) = ∏ θ yi (1 − θ )1−yi = θ ∑ yi (1 − θ )n−∑ yi .
i =1
Exploration and Exploitation
Bayesian bandits

An example (iii)

I Suppose that we have tossed the coin n = 50 times (performed n = 50 Bernoulli


trials) and recorded the results of the trials as

0 0 1 0 0 1 0 0 0 0 0 1 0 1 1 1 0 0 0 0 1 0 1 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 1 0 0

I How can we estimate θ given these data?


Exploration and Exploitation
Bayesian bandits

Bayesian estimation: an uninformative prior

I As Bayesians, we view the parameter θ as a random variable.


I Apply Laplace’s principle of indifference (principle of insufficient reason): when
faced with multiple possibilities, whose probabilities are unknown, assume that the
probabilities of all possibilities are equal: assume that all values of θ in [0, 1] are
equally likely.
I Thus our prior is that θ is uniformly distributed on [0, 1], i.e. θ ∼ U (a = 0, b = 1).
I In the context of Bayesian estimation, applying Laplace’s principle of indifference,
constitutes what is known as an uninformative prior.
I The pdf of the uniform distribution, U (a , b ), is given by

1
p (θ ) =
b −a

if θ ∈ [a , b ] and zero otherwise.


I In our case, a = 0, b = 1, and so our uninformative prior is given by

p ( θ ) = 1, for all θ ∈ [0, 1].


Exploration and Exploitation
Bayesian bandits

Bayesian estimation: a posterior (i)

I Bayes’s theorem tells us that

posterior ∝ likelihood · prior.

I Thus the posterior is

p (θ | y ) ∝ p (y | θ )p (θ ) = θ ∑ yi (1 − θ )n−∑ yi · 1.

I From the shape of the resulting pdf, we recognize it as the pdf of the Beta distribution3

Beta θ | ∑ yi , n − ∑ yi ,


and we immediately know the missing normalizing constant factor.

3 The function’s argument is now θ , not yi , so it is not the pdf of a Bernoulli distribution.
Exploration and Exploitation
Bayesian bandits

Bayesian estimation: a posterior (ii)

I From the properties of this distribution,

∑ yi ∑ yi 12
E [θ | y ] = = = = 0.24,
∑ yi + ( n − ∑ yi ) n 50

I and

(∑ yi )(n − ∑ yi )
Var [θ | y ] =
(∑ yi + n − ∑ yi )2 (∑ yi + n − ∑ yi + 1)
n ∑ yi − ( ∑ yi ) 2 50 · 12 − 122 456
= 2
= = = 0.00357647058.
n (n + 1) 502 (50 + 1) 127500
q
I The standard deviation being, in units of probability, 456
127500 = 0.05980360012.
I Notice that the mean of the posterior, 0.24, matches the frequentist maximum
likelihood estimate of θ , θ̂ML , and our intuition.
Exploration and Exploitation
Bayesian bandits

Bayesian estimation: a more informative prior (i)


I Let us question our prior. Is it somewhat too uninformative? After all, most coins in the
world are probably close to being fair.
I We could use a
Beta(θ | α, β) (1)
prior instead of the Uniform prior. Picking α = β = 2, for example, will give a
distribution on [0, 1] centred on 21 , incorporating the assumption that the coin is fair.
I The pdf of this prior is given by

1
p (θ ) = θ α −1 (1 − θ ) β −1 , for all θ ∈ [0, 1].
B (α, β)

I So the posterior becomes

p ( θ | y ) ∝ p (y | θ )p ( θ )
1
= θ ∑ yi (1 − θ )n−∑ yi · θ α−1 (1 − θ ) β−1 ∝ θ (α+∑ yi )−1 (1 − θ )( β+n−∑ yi )−1 ,
B (α, β)

which we recognize as a pdf of the distribution

Beta θ | α + ∑ yi , β + n − ∑ yi .

(2)
Exploration and Exploitation
Bayesian bandits

Bayesian estimation: a more informative prior (ii)

I If we initially assume a Beta(θ | α = 2, β = 2) prior, then the posterior expectation is

α + ∑ yi α + ∑ yi 2 + 12 7
E [θ | y ] = = = = ≈ 0.259.
α + ∑ yi + β + n − ∑ yi α+β+n 2 + 2 + 50 27

I Unsurprisingly, since now our prior assumption is that the coin is unbiased,
50 < E [ θ | y ] < 2 .
12 1

I If we look at Var [θ | y ], we will see that we are also somewhat more certain about the
posterior than when we assumed the uniform prior.
I (In this particular case) both the prior and posterior belong to the same probability
distribution family. In Bayesian estimation theory we refer to such prior and posterior
as conjugate distributions (with respect to this particular likelihood function).
I Notice that the results of Bayesian estimation are sensitive—to varying degree in each
specific case—to the choice of prior distribution.
Exploration and Exploitation
Bayesian bandits

Sequential Bayesian updates

I What would happen if, instead of observing all twelve coin tosses at once, we
I considered each coin toss in turn;
I obtained our posterior; and
I used that posterior as a prior for an update based on the information from the next coin toss?
I The equations (1) and (2) give the answer to this question.
I We start with our initial prior
Beta(θ | α, β),
then, substituting n = 1 into (2), we obtain

Beta(θ | α + y1 , β + 1 − y1 ).

I Using this posterior as a prior before the second coin toss, we obtain the next posterior
as
Beta(θ | α + y1 + y2 , β + 2 − y1 − y2 ).
I Proceeding along these lines, after all n coin tosses, we end up with

Beta(θ | α + ∑ yi , β + n − ∑ yi ).
Exploration and Exploitation
Bayesian bandits

Normal prior, likelihood, and posterior

I Suppose we have a measurement y ∼ N θ, σ2 where the variance σ2 is known.




I In other words, θ is the unknown parameter of interest and we are given that the
likelihood comes from a normal distribution with variance σ2 :

p (y | θ ) ∼ N θ, σ2 .


I If we choose a normal prior pdf


2

p (θ ) ∼ N µprior , σprior

then the posterior pdf is also normal:


2

p (θ | y ) ∼ N µpost , σpost ,

where
1 1 a µprior + by 2 1
a= 2
, b= , µpost = , σpost = .
σprior σ2 a+b a+b
I Thus the normal distribution is its own conjugate prior (if the likelihood is normal too).
Exploration and Exploitation
Bayesian bandits

Bayesian bandits

I Bayesian bandits [Sco10] exploit prior knowledge about the rewards.


I Consider a distribution p (Q | θ) over action-value function with parameter θ, e.g.
|
independent Gaussians: θ = µ1 σ12 . . . µk σk2 for a ∈ [1, k ].
I Bayesian methods compute posterior distribution over θ,

p (θ | R1 , . . . , Rt ).
I Use the posterior to guide exploration:
I upper confidence bounds;
I probability matching;
I Thompson sampling.
I Better performance if prior knowledge is accurate.
Exploration and Exploitation
Bayesian bandits

Bayesian bandits

I Bayesian bandits model parameterized distributions over rewards, p (Rt | θ, a ).


I Compute posterior distribution over θ

pt (θ | a ) ∝ p (Rt | θ, a )pt −1 (θ | a ).

I Allows us to inject rich prior knowledge p0 (θ | a ).


I Use posterior to guide exploration.
Exploration and Exploitation
Bayesian bandits

Bayesian bandits: Example

I Consider bandits with Bernoulli reward distribution (the so-called binary or Bernoulli
bandits [Ber72]): rewards are 0 or +1.
I For each action, the prior could be a uniform distribution on [0, 1].
I This means we think each mean reward in [0, 1] is equally likely.
I The posterior is a Beta distribution Beta(xa , ya ) with initial parameters xa = 1 and
ya = 1 for each action a.
I Updating the posterior:
I xA ← xA + 1 when Rt = 0;
t t
I yA ← yA + 1 when Rt = 1.
t t
Exploration and Exploitation
Bayesian bandits

Probability matching

I Probability matching selects action a according to probability that a is the optimal


action  
π (a ) = P Q (a ) = max Q (a 0 ) | R1 , . . . , Rt −1 .
a0

I Probability matching is optimistic in the face of uncertainty, since uncertain actions


have higher probability of being maximum.
I It can be difficult to compute π (a ) analytically from the posterior.
Exploration and Exploitation
Bayesian bandits

Thompson sampling

I Thompson sampling4 is sample-based probability matching.


I Use Bayes’s theorem to compute the posterior distribution

pθ (Q | R1 , . . . , Rt −1 ).

I Sample an action-value function Q (a ) from the posterior.


I Select an action maximizing the sample, At = arg maxa ∈A Q (a ).
I For Bernoulli bandits, Thompson sampling achieves the Lai and Robbins lower bound
on regret [CL11]!

4 The technique dates back to [Tho33], hence the name.


Exploration and Exploitation
Gradient bandit algorithms

Gradient bandit algorithms (i)


I So far we have considered methods that estimate action values and use those
estimates to select actions.
I This is often a good approach, but it is not the only one possible.
I We shall now consider learning a numerical preference for each action a, which we
denote Ht (a ).
I The larger the preference, the more often that action is taken, but the preference has
no interpretation in terms of reward.
I Only the relative preference of one action over another is important; if we add 1000 to
all the action preferences there is no effect on the action probabilities, which are
determined according to a soft-max distribution (i.e., Gibbs or Boltzmann
distribution) as follows:

e Ht (a )
P [At = a ] := = : πt (a ),
∑kb =1 e Ht (b )

where here we have also introduced a useful new notation, πt (a ), for the probability of
taking action a at time t.
I Initially all action preferences are the same (e.g. H1 (a ) = 0, for all a) so that all
actions have an equal probability of being selected.
Exploration and Exploitation
Gradient bandit algorithms

Gradient bandit algorithms (ii)

I There is a natural learning algorithm for this setting based on the idea of stochastic
gradient ascent.
I On each step, after selecting action At and receiving the reward Rt , the action
preferences are updated by:

Ht +1 (At ) := Ht (At ) + α(Rt − R t )(1 − πt (At )), and


Ht +1 (a ) := Ht (a ) − α(Rt − R t )πt (a ), for all a , At ,

where α > 0 is a step-size parameter, and R t ∈ R is the average of all the rewards up
through and including time t, which can be computed incrementally.
I The R t term serves as a baseline with which the reward is compared. If the reward is
higher than the baseline, then the probability of taking At in the future is increased,
and if the reward is below baseline, then probability is deceased.
I The nonselected actions move in the opposite direction.
Exploration and Exploitation
Contextual bandits

The associative setting

I So far we have considered only nonassociative tasks, that is, tasks in which there is
no need to associate different actions with different situations.
I In these tasks the learner either tries to find the best action when the task is stationary,
or tries to track the best action as it changes over time when the task is nonstationary.
I However, in a general reinforcement learning task there is more than one situation,
and the goal is to learn a policy: a mapping from situations to the actions that are best
in those situations. This is the associative setting.
Exploration and Exploitation
Contextual bandits

Contextual bandits

I Suppose there are several different k -armed bandit tasks, and that on each step you
confront one of these chosen at random.
I Thus, the bandit task changes randomly from step to step.
I This would appear to you as a single, nonstationary k -armed bandit task whose true
action values change randomly from step to step.
I You could try using one of the methods described above that can handle
nonstationarity, but unless the true value changes slowly, these methods will not work
very well.
I Now suppose that when a bandit task is selected for you, you are given some
distinctive clue about its identity (but not its action values).
I Maybe you are facing an actual slot machine that changes the colour of its display as it
changes its action values.
I Now you can learn a policy associating each task, signalled by the colour you see, with
the best action to take when facing that task—for instance, if red, select arm 1; if
green, select arm 2.
I With the right policy you can usually do much better than you could in the absence of
any information distinguishing one bandit task from another.
Exploration and Exploitation
Contextual bandits

Associative search

I This is an example of an associative search task, so called because it involves both


trial-and-error learning to search for the best actions, and association of these actions
with the situations in which they are best.
I Associative search tasks are now called contextual bandits [LLL10, BLL+ 11, EL19]
in the literature.
I Associative search tasks are intermediate between the k -armed bandit problem and
the full reinforcement learning problem.
I They are like the full reinforcement learning problem in that they involve learning a
policy, but like the k -armed bandit problem in that each action affects only the
immediate reward.
I If actions are allowed to affect the next situation as well as the reward, then we have
the full reinforcement learning problem.
Exploration and Exploitation
Bibliography

Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer.


Finite-time analysis of the multiarmed bandit problem.
Machine Learning, 47(2/3):235–256, 2002.
Kazuoki Azuma.
Weighted sums of certain dependent random variables.
Tohoku Mathematical Journal, 19(3):357–367, 1967.
Hamsa Bastani and Mohsen Bayati.
Online decision-making with high-dimensional covariates.
SSRN Electronic Journal, 2015.
Djallel Bouneffouf, Amel Bouzeghoub, and Alda Lopes Gançarski.
A contextual-bandit algorithm for mobile context-aware recommender system.
In Neural Information Processing, pages 324–331. Springer Berlin Heidelberg, 2012.
Sébastien Bubeck and Nicolò Cesa-Bianchi.
Regret analysis of stochastic and nonstochastic multi-armed bandit problems.
Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
Sergei Natanovich Bernstein.
On a modification of Chebyshev’s inequality and of the error formula of Laplace.
Uchenye Zapiski Nauch.-Issled. Kaf. Ukraine, Sect. Math., 1:38–48, 1924.
Donald A. Berry.
A Bernoulli two-armed bandit.
The Annals of Mathematical Statistics, 43(3):871–897, June 1972.
Exploration and Exploitation
Bibliography

Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, and Robert Schapire.
Contextual bandit algorithms with supervised learning guarantees.
In Geoffrey Gordon, David Dunson, and Miroslav Dudı́k, editors, Proceedings of the
Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15
of Proceedings of Machine Learning Research, pages 19–26, Fort Lauderdale, FL,
USA, 11–13 Apr 2011. JMLR Workshop and Conference Proceedings.
Thomas Bayes and Richard Price.
An essay towards solving a problem in the doctrine of chances.
Philosophical Transactions, 53:370–418, January 1763.
Djallel Bouneffouf and Irina Rish.
A survey on practical applications of multi-armed and contextual bandits.
arXiv, 2019.
Olivier Chapelle and Lihong Li.
An empirical evaluation of thompson sampling.
In Advances in Neural Information Processing Systems 24 (NIPS 2011), 2011.
Audrey Durand, Charis Achilleos, Demetris Iacovides, Katerina Strati, Georgios D.
Mitsis, and Joelle Pineau.
Contextual bandits for adaptive treatment in a mouse model of de novo
carcinogenesis.
In Proceedings of the 3rd Machine Learning for Healthcare Conference, 2018.
Anirban DasGupta.
Asymptotic Theory of Statistics and Probability.
Exploration and Exploitation
Bibliography

Springer, 2008.
James A. Edwards and David S. Leslie.
Selecting multiple web adverts: A contextual multi-armed bandit with state uncertainty.

Journal of the Operational Research Society, 71(1):100–116, feb 2019.


Elaine Fox.
Rainy Brain, Sunny Brain: The New Science of Optimism and Pessimism.
Cornerstone Digital, 2012.
Herbert Gintis.
Game Theory Evolving: A Problem-Centered Introduction to Modeling Strategic
Interaction.
Priceton University Press, 2000.
Xiaoguang Huo and Feng Fu.
Risk-aware multi-armed bandit problem with application to portfolio selection.
Royal Society Open Science, 4(11):171377, nov 2017.
Wassily Hoeffding.
Probability inequalities for sums of bounded random variables.
Journal of the American Statistical Association, 58(301):13–30, mar 1963.
Solomon Kullback and Richard A. Leibler.
On information and sufficiency.
The Annals of Mathematical Statistics, 22(1):79–86, mar 1951.
Exploration and Exploitation
Bibliography

Andrey Nikolaevich Kolmogorov.


Grundbegriffe der Wahrscheinlichkeitrechnung.
Ergebnisse der Mathematik und ihrer Grenzgebiete, 2(3):1–62, 1933.
Moto Kamiura and Kohei Sano.
Optimism in the face of uncertainty supported by a statistically-designed multi-armed
bandit algorithm.
Biosystems, 160:25–32, oct 2017.
Solomon Kullback.
Information Theory and Statistics.
John Wiley & Sons, 1959.
Pierre-Simon Laplace.
Mémoire sur la probabilité des causes par les évènemens.
Mémoires de Mathematique et de Physique, Presentés à l’Académie Royale des
Sciences, Par Divers Savans et Lus Dans ses Assemblées, 16:621–656, 1774.
Shoumei Li, Jungang Li, and Xiaohua Li.
Stochastic integral with respect to set-valued square integrable martingales.
Journal of Mathematical Analysis and Applications, 370:659–671, 2010.
T. L. Lai and Herbert Robbins.
Asymptotically efficient adaptive allocation rules.
Advances in Applied Mathematics, 6(1):4–22, March 1985.
Bert Mendelson.
Exploration and Exploitation
Bibliography

Introduction to Topology.
Dover Publications, Inc., 3 edition, 1990.
Kanishka Misra, Eric M. Schwartz, and Jacob Abernethy.
Dynamic online pricing with incomplete information using multiarmed bandit
experiments.
Marketing Science, 38(2):226–252, mar 2019.
Jonas Mueller, Vasilis Syrgkanis, and Matt Taddy.
Low-rank bandit methods for high-dimensional dynamic pricing.
In NeurIPS, 2019.
Thomas Peel, Sandrine Anthoine, and Liva Ralaivola.
Empirical Bernstein inequalities for U-statistics.
In Neural Information Processing Systems (NIPS), pages 1903–1911, Vancouver,
Canada, December 2010.
Richard S. Sutton and Andrew G. Barto.
Reinforcement Learning: An Introduction.
MIT Press, 2 edition, 2018.
James A. Shepperd, Patrick Carroll, Jodi Grace, and Meredith Terry.
Exploring the causes of comparative optimism.
Psychologica Belgica, 42(1–2):65–98, 2002.
Steven L. Scott.
A modern bayesian look at the multi-armed bandit.
Applied Stochastic Models in Business and Industry, 26(6):639–658, nov 2010.
Exploration and Exploitation
Bibliography

Stephen Senn.
Statistical basis of public policy — present remembrance of priors past is not the same
as a true prior.
Birtish Medical Journal, 1997.
Darryl Shen.
Order imbalance based strategy in high frequency trading.
Master’s thesis, Linacre College, University of Oxford, 2015.
David Silver.
Lectures on reinforcement learning.
url: https://www.davidsilver.uk/teaching/, 2015.
Stephen M. Stigler.
Laplace’s early work: chronology and citations.
Isis, 61:234–254, 1978.
Stephen M. Stigler.
Laplace’s 1774 memoir on inverse probability.
Statistical Science, 1(3):359–363, August 1986.
William R. Thompson.
On the likelihood that one unknown probability exceeds another in view of evidence of
two samples.
Biometrika, 25(3-4):285–294, dec 1933.
Hado van Hasselt.
Exploration and Exploitation
Bibliography

Lectures on reinforcement learning.


url: https://hadovanhasselt.com/2016/01/12/ucl-course/, 2016.
Qian Zhou, XiaoFang Zhang, Jin Xu, and Bin Liang.
Large-scale bandit approaches for recommender systems.
In Neural Information Processing, pages 811–821. Springer International Publishing,
2017.

You might also like