2 Exploration and Exploitation
2 Exploration and Exploitation
Thalesians Ltd
Level39, One Canada Square, Canary Wharf, London E14 5AB
2023.01.17
Exploration and Exploitation
Acknowledgements
I These notes are based on Chapter 2 of [SB18] by Richard S. Sutton and Andrew G.
Barto.
I They also borrow a lot from the lectures on reinforcement learning by David
Silver [Sil15] and those by his colleague at DeepMind Hado van Hasselt [vH16].
Exploration and Exploitation
Bandits
1 It isn’t a good idea to run a trading desk with overly secretive traders and black-box trading strategies.
Exploration and Exploitation
Bandits
I On Monday, the mouse presses the black lever and receives an electric shock.
Exploration and Exploitation
Bandits
I On Monday, the mouse presses the black lever and receives an electric shock.
I On Tuesday, the mouse presses the white lever and receives cheese.
Exploration and Exploitation
Bandits
I On Monday, the mouse presses the black lever and receives an electric shock.
I On Tuesday, the mouse presses the white lever and receives cheese.
I On Wednesday, the mouse presses the white lever and receives an electric shock.
Exploration and Exploitation
Bandits
I On Monday, the mouse presses the black lever and receives an electric shock.
I On Tuesday, the mouse presses the white lever and receives cheese.
I On Wednesday, the mouse presses the white lever and receives an electric shock.
I What should the mouse do on Thursday?
Exploration and Exploitation
Bandits
Further examples
A one-armed bandit
2 A game against nature involves a single agent choosing under conditions of uncertainty, where none of the
uncertainty is strategic—that is, either the uncertainty is due to natural acts (crop loss, death, and the like) or, if other
people are involved, the others do not behave strategically toward the agent being modeled [Gin00].
Exploration and Exploitation
Action-value methods
I In a k -armed bandit problem, each of the k actions, say a, has an expected or mean
reward given that that action is selected—the value of that action:
q (a ) := E [Rt | At = a ] .
Greedy actions
I At any time step t there is at least one action a∗ whose estimated value, Qt (a∗ ), is
greatest.
I We refer to these as greedy actions.
I Choosing these actions amounts to exploitation.
I Choosing one of the non-greedy actions instead amounts to exploration.
Exploration and Exploitation
Action-value methods
I In our example, the mouse has employed the greedy algorithm: it has sampled all the
actions and is exploiting the greedy action.
I If there is more than one greedy action, then a selection is made among them in some
arbitrary way, perhaps randomly.
I We see that the greedy algorithm can lock the agent into a suboptimal action.
I Can we do better?
Exploration and Exploitation
Action-value methods
R1 + R2 + . . . + Rn−1
Qn := .
n−1
I Then
1
Qn+1 = Qn + [Rn − Qn ] .
n
I This update rule is of the form
I Initialize, for a = 1 to k :
I Q (a ) ← 0
I N (a ) ← 0
I Loop forever:
I A← with probability 1 − e (breaking ties randomly);
arg maxa Q (a ),
a random action, with probability e.
I R ← bandit (A )
I N (A ) ← N (A ) + 1
I Q (A ) ← Q (A ) + 1 [R − Q (A )]
N (A )
Exploration and Exploitation
Action-value methods
I All the methods we have discussed so far are dependent to some extent on the initial
action-value estimates, Q1 (a ).
I In the language of statistics, these methods are biased by their initial estimates.
I For the sample-average methods, the bias disappears once all actions have been
selected at least once, but for methods with constant α, the bias is permanent, though
decreasing over time.
I In practice, this kind of bias is usually not a problem and can sometimes be very
helpful.
I The downside is that the initial estimates become, in effect, a set of parameters that
must be picked by the user, if only to set them all to zero.
I The upside is that they provide an easy way to supply some prior knowledge about
what level of rewards can be expected.
Exploration and Exploitation
Optimistic initial values
Encouraging exploration
I Suppose that instead of setting the initial action values to zero, as we did in the
10-armed testbed, we set them all to +5.
I Recall that the q (a ) in this problem are selected from a normal distribution with mean
0 and variance 1.
I An initial estimate of +5 is thus wildly optimistic.
I But this optimism encourages action-value methods to explore.
I Whichever actions are initially selected, the reward is less than the starting estimates;
the learner switches to other actions, being “disappointed” with the rewards it is
receiving.
I The result is that all actions are tried several times before the value estimates
converge. The system does a fair amount of exploration even if greedy actions are
selected all the time.
Exploration and Exploitation
Optimistic initial values
From [KS17]:
The principle of “optimism in the face of uncertainty” is an empirically verified
idea in sequential decision-making problems [BCB12], although the accurate ori-
gin of it is uncertain. Biologically, the optimism bias is known as a cognitive bias
in higher brain functions [Fox12, SCGT02]. On the other hand, the contexts of
machine learning deal with optimism as a strategy to explore better solutions. The
difference of the implications might be derived from the difference of the viewpoint
on rewards.
Exploration and Exploitation
How well can we do?
I Thus the agent’s goal is to trade-off exploration and exploitation by minimizing the total
regret (which is equivalent to maximizing the cumulative reward).
I The agent cannot observe, or even sample, the real regret directly.
I However, it is useful for analysing different learning algorithms.
Exploration and Exploitation
How well can we do?
Regret
= ∑ E [Nt (a )] (v∗ − q (a ))
a ∈A
= ∑ E [Nt (a )] ∆a .
a ∈A
I The greedy algorithm selects the action with the highest value
At = arg max Qt (a ).
a ∈A
I Therefore the e-greedy algorithm also has linear expected total regret.
I Optimistic greedy and optimistic e-greedy algorithms also have linear expected total
regret.
Exploration and Exploitation
How well can we do?
c |A|
1
et = min 1, 2 ∝ ,
d t t
Kullback–Leibler divergence
I The Kullback–Leibler divergence [KL51, Kul59], also called relative entropy, is a
measure of how one probability distribution is different from a second, reference
probability distribution.
I For discrete probability distributions P and Q defined on the same space, X , the
relative entropy from Q to P is defined to be
P (x ) Q (x )
KL (P || Q ) = ∑ P (x ) ln = − ∑ P (x ) ln .
x ∈X
Q (x ) x ∈X
P (x )
An example
I Consider a numerical example from [Kul59]:
x 0 1 2
9 12 4
Distribution P (x ) 25 25 25
1 1 1
Distribution Q (x ) 3 3 3
I In this example,
P (x )
KL (P || Q ) = ∑ P (x ) ln
Q (x )
x ∈X
9 9/25 12 12/25 4 4/25
= ln + ln + ln
25 1/3 25 1/3 25 1/3
1
= (32 ln(2) + 55 ln(3) − 50 ln(5)) ≈ 0.0852996,
25
I whereas
Q (x )
KL (Q || P ) = ∑ Q (x ) ln
P (x )
x ∈X
1 1/3 1 1/3 1 1/3
= ln + ln + ln
3 9/25 3 12/25 3 4/25
1
= (−4 ln(2) − 6 ln(3) + 6 ln(5)) ≈ 0.097455.
3
Exploration and Exploitation
How well can we do?
KL (P || Q ) , KL (Q || P ).
A bound on regret
I The performance of any method is determined by similarity between optimal arm and
other arms.
I Hard problems have arms with similar distributions but different means.
I This is described formally by the gap ∆a and the Kullback–Leibler divergence
KL (p (r | a ) || p (r | a∗ )).
I T. L. Lai and Herbert Robbins established the following lower bound on total regret
in [LR85]:
∆a
lim Lt ≥ ln t
t →∞
∑ KL (p (r | a ) || p (r | a∗ ))
.
a | ∆a >0
I Notice that this grows logarithmically with time (the ln t factor), which is a lot better
than linearly.
Exploration and Exploitation
How well can we do?
I The more uncertain we are about an action-value, the more important it is to explore
that action.
Exploration and Exploitation
Upper-Confidence-Bound Action Selection
I The more uncertain we are about an action-value, the more important it is to explore
that action.
I It could turn out to be the best action.
Exploration and Exploitation
Upper-Confidence-Bound Action Selection
Candidate actions
I Exploration is needed because there is always uncertainty about the accuracy of the
action-value estimates.
I The greedy actions are those that look best at present, but some of the other actions
may actually be better.
I e-greedy action selection forces the non-greedy actions to be tried, but
indiscriminately, with no preference for those that are nearly greedy or particularly
uncertain.
I It would be better to select among the non-greedy actions according to their potential
for being optimal, taking into account both how close their estimates are to being
maximal and the uncertainties in those estimates.
Exploration and Exploitation
Upper-Confidence-Bound Action Selection
At = arg max Qt (a ) + Ut (a ).
a ∈A
Exploration and Exploitation
Upper-Confidence-Bound Action Selection
I Now let
2
p := e −2Nt (a )Ut (a )
and solve for Ut (a ): s
− ln p
Ut ( a ) = .
2Nt (a )
√
I Theorem [ACBF02]: The UCB algorithm (with c = 2) achieves logarithmic expected
total regret for all t: !
ln t
Lt ≤ 8 ∑ +O ∑ ∆a .
a | ∆ >0
∆a a
a
I UCB often performs well, but is more difficult than e-greedy methods to extend beyond
bandits to more general reinforcement learning settings.
I One difficulty is in dealing with nonstationary problems.
I Another difficulty is dealing with large state spaces.
I In these more advanced settings the idea of UCB action selection is often impractical.
Exploration and Exploitation
Upper-Confidence-Bound Action Selection
Other inequalities
Nonstationarity
I The averaging methods discussed so far are appropriate for stationary bandit
problems, that is, for bandit problems in which the reward probabilities do not change
over time.
I In stationary problems, the distribution of Rt given At is identical and independent
across time.
I We often encounter reinforcement learning problems that are effectively
nonstationary.
I In such cases it makes sense to give more weight to recent rewards than to long-past
rewards.
I One of the most popular ways of doing this is to use a constant step-size parameter.
I For example, the incremental update rule for updating an average Qn of the n − 1 past
rewards is modified to be
Qn+1 := Qn + α [Rn − Qn ] ,
Weighted average
I This results in Qn+1 being a weighted average of past rewards and the initial estimate
Q1 :
n
Qn+1 = (1 − α)n Q1 + ∑ α(1 − α)n−i Ri .
i =1
I We call this a weighted average because the sum of the weights is
(1 − α)n + ∑ni=1 α(1 − α)n−i = 1.
I Note that the weight α(1 − α)n−i , given to the reward Ri depends on how many
rewards ago, n − i, it was observed.
I The quantity 1 − α is less than 1, and thus the weight given to Ri decreases as the
number of intervening rewards increases.
I In fact, the weight decays exponentially according to the exponent on 1 − α.
I Accordingly, this is sometimes called an exponential recency-weighted average.
Exploration and Exploitation
Tracking a nonstationary problem
I The first condition is required to guarantee that the steps are large enough to
eventually overcome any initial conditions or random fluctuations.
I The second condition guarantees that eventually the steps become small enough to
assure convergence.
Exploration and Exploitation
Tracking a nonstationary problem
I Note that both convergence conditions are met for the sample-average case,
αn (a ) = n1 , but not for the case of constant step-size parameter, αn (a ) = α.
I In the latter case, the second condition is not met, indicating that the estimates never
completely converge but continue to vary in response to the most recently received
rewards.
I This is actually desirable in a nonstationary environment, and problems that are
effectively nonstationary are the most common in reinforcement learning.
I In addition, sequences of step-size parameters that meet the two conditions often
converge very slowly or need considerable tuning in order to obtain a satisfactory
convergence rate.
I Although sequences of step-size parameters that meet these convergence conditions
are often used in theoretical work, they are seldom used in applications and empirical
research.
Exploration and Exploitation
Bayesian bandits
Interpretations of probability
I Classical:
I A random experiment E is performed.
I The set Ω of possible outcomes of E is finite and all ω ∈ Ω are equally likely.
I The probability of the event A ⊆ Ω is given by
|A |
P [A ] = .
|Ω|
I Frequentist:
I The superexperiment E ∞ consists in an infinite number of independent performances of a
random experiment E .
I Let N (A , n ) be the number of occurrences of the event A in the first n performances of E
within E ∞ .
I The probability of A is given by
N (A , n )
P [A ] = lim .
n →∞ n
I Bayesian:
I The probability of an event is the degree of belief that that event will occur.
I Axiomatic:
The theory of probability as a mathematical discipline can and should be devel-
oped from axioms in exactly the same way as Geometry and Algebra. [Kol33]
Exploration and Exploitation
Bayesian bandits
Bayes’s Theorem
P [E | H ] P [H ]
P [H | E ] = .
P [E ]
I The proof follows immediately from the definition of conditional probability:
P [H ∩ E ] P [E ∩ H ] P [E | H ] P [H ]
P [H | E ] = = = .
P [E ] P [E ] P [E ]
I It is useful to reformulate it for the case when there are multiple alternative
hypotheses, H1 , H2 , . . . Hs , s ∈ N∗ . Then, for 1 ≤ i ≤ s,
P [ E | Hi ] P [ Hi ] P [E | Hi ] P [Hi ]
P [Hi | E ] = = s ,
P [E ] ∑j =1 P [E | Hj ] P [Hj ]
where the second equality follows from the Law of Total Probability.
Exploration and Exploitation
Bayesian bandits
and
P [F | θi ]
P [ θi | F ] = .
∑nj=1 P F | θj
P [E | H ] P [H ]
P [H | E ] = ?
P [E ]
I For a frequentist, probability is a long-term relative frequency of outcomes in the
hyperexperiment E ∞ .
I Let N (A , n ) be the number of occurrences of the event A in the first n performances of
E within E ∞ . Then P [A ] := limn→∞ N ( A ,n )
n .
I In particular,
N (H , n ) N (E , n )
P [H ] := lim , P [E ] := lim .
n →∞ n n →∞ n
I Therefore,
P [H ∩ E ] N (H ∩ E , n ) P [H ∩ E ] N (H ∩ E , n )
P [H | E ] = = lim , P [E | H ] = = lim .
P [E ] n →∞ N (E , n ) P [H ] n →∞ N (H , n )
P [E | H ] P [H ]
P [H | E ] = ?
P [E ]
I For a Bayesian, probability is a degree of belief.
I Before (prior to) observing any new evidence E—the degree of belief in a certain
hypothesis H.
I After (posterior to)—the degree of belief in H after taking into account that piece of
evidence E.
I Thus,
I P [H ] is the prior, the initial degree of belief in H.
I P [H | E ] is the posterior, the degree of belief in H after accounting for E.
I P[E | H ]
P[E ]
is the support that the evidence E provides for the hypothesis H.
I P [E | H ] is the likelihood, the compatibility of the evidence with the hypothesis.
I P [E ] is the marginal likelihood of the evidence, irrespective of the hypothesis.
I Using this Bayesian terminology, Bayes’s Theorem can be stated as:
Question
I While watching a game of Champions League football in a bar, you observe someone
who is clearly supporting Manchester United in the game.
I What is the probability that they were actually born within 25 miles of Manchester?
I Assume that:
I the probability that a randomly selected person in a typical local bar environment is born
1
within 25 miles of Manchester is 20
;
I the probability that a person born within 25 miles of Manchester supports United is 7 ;
10
I the probability that a person not born within 25 miles of Manchester supports United is 1
.
10
Exploration and Exploitation
Bayesian bandits
Solution
I Define
I B—the event that the person is born within 25 miles of Manchester;
I U—the event that the person supports United.
I We want P [B | U ].
Bayes’s Theorem P [U | B ] P [B ]
P [B | U ] =
P [U ]
Law of Total Probability P [U | B ] P [B ]
=
P [U | B ] P [B ] + P [U | B c ] ¶ [B c ]
7 1
10 · 20
= 7 1 1 19
10 · 20 + 10 · 20
7
= ≈ 0.269.
26
Exploration and Exploitation
Bayesian bandits
An example (i)
p (y | θ ) = θ y (1 − θ )1−y ,
An example (ii)
y1
y2
y = . ∈ {0, 1}n ,
..
yn
An example (iii)
0 0 1 0 0 1 0 0 0 0 0 1 0 1 1 1 0 0 0 0 1 0 1 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 1 0 0
1
p (θ ) =
b −a
p (θ | y ) ∝ p (y | θ )p (θ ) = θ ∑ yi (1 − θ )n−∑ yi · 1.
I From the shape of the resulting pdf, we recognize it as the pdf of the Beta distribution3
Beta θ | ∑ yi , n − ∑ yi ,
3 The function’s argument is now θ , not yi , so it is not the pdf of a Bernoulli distribution.
Exploration and Exploitation
Bayesian bandits
∑ yi ∑ yi 12
E [θ | y ] = = = = 0.24,
∑ yi + ( n − ∑ yi ) n 50
I and
(∑ yi )(n − ∑ yi )
Var [θ | y ] =
(∑ yi + n − ∑ yi )2 (∑ yi + n − ∑ yi + 1)
n ∑ yi − ( ∑ yi ) 2 50 · 12 − 122 456
= 2
= = = 0.00357647058.
n (n + 1) 502 (50 + 1) 127500
q
I The standard deviation being, in units of probability, 456
127500 = 0.05980360012.
I Notice that the mean of the posterior, 0.24, matches the frequentist maximum
likelihood estimate of θ , θ̂ML , and our intuition.
Exploration and Exploitation
Bayesian bandits
1
p (θ ) = θ α −1 (1 − θ ) β −1 , for all θ ∈ [0, 1].
B (α, β)
p ( θ | y ) ∝ p (y | θ )p ( θ )
1
= θ ∑ yi (1 − θ )n−∑ yi · θ α−1 (1 − θ ) β−1 ∝ θ (α+∑ yi )−1 (1 − θ )( β+n−∑ yi )−1 ,
B (α, β)
Beta θ | α + ∑ yi , β + n − ∑ yi .
(2)
Exploration and Exploitation
Bayesian bandits
α + ∑ yi α + ∑ yi 2 + 12 7
E [θ | y ] = = = = ≈ 0.259.
α + ∑ yi + β + n − ∑ yi α+β+n 2 + 2 + 50 27
I Unsurprisingly, since now our prior assumption is that the coin is unbiased,
50 < E [ θ | y ] < 2 .
12 1
I If we look at Var [θ | y ], we will see that we are also somewhat more certain about the
posterior than when we assumed the uniform prior.
I (In this particular case) both the prior and posterior belong to the same probability
distribution family. In Bayesian estimation theory we refer to such prior and posterior
as conjugate distributions (with respect to this particular likelihood function).
I Notice that the results of Bayesian estimation are sensitive—to varying degree in each
specific case—to the choice of prior distribution.
Exploration and Exploitation
Bayesian bandits
I What would happen if, instead of observing all twelve coin tosses at once, we
I considered each coin toss in turn;
I obtained our posterior; and
I used that posterior as a prior for an update based on the information from the next coin toss?
I The equations (1) and (2) give the answer to this question.
I We start with our initial prior
Beta(θ | α, β),
then, substituting n = 1 into (2), we obtain
Beta(θ | α + y1 , β + 1 − y1 ).
I Using this posterior as a prior before the second coin toss, we obtain the next posterior
as
Beta(θ | α + y1 + y2 , β + 2 − y1 − y2 ).
I Proceeding along these lines, after all n coin tosses, we end up with
Beta(θ | α + ∑ yi , β + n − ∑ yi ).
Exploration and Exploitation
Bayesian bandits
I In other words, θ is the unknown parameter of interest and we are given that the
likelihood comes from a normal distribution with variance σ2 :
p (y | θ ) ∼ N θ, σ2 .
where
1 1 a µprior + by 2 1
a= 2
, b= , µpost = , σpost = .
σprior σ2 a+b a+b
I Thus the normal distribution is its own conjugate prior (if the likelihood is normal too).
Exploration and Exploitation
Bayesian bandits
Bayesian bandits
p (θ | R1 , . . . , Rt ).
I Use the posterior to guide exploration:
I upper confidence bounds;
I probability matching;
I Thompson sampling.
I Better performance if prior knowledge is accurate.
Exploration and Exploitation
Bayesian bandits
Bayesian bandits
pt (θ | a ) ∝ p (Rt | θ, a )pt −1 (θ | a ).
I Consider bandits with Bernoulli reward distribution (the so-called binary or Bernoulli
bandits [Ber72]): rewards are 0 or +1.
I For each action, the prior could be a uniform distribution on [0, 1].
I This means we think each mean reward in [0, 1] is equally likely.
I The posterior is a Beta distribution Beta(xa , ya ) with initial parameters xa = 1 and
ya = 1 for each action a.
I Updating the posterior:
I xA ← xA + 1 when Rt = 0;
t t
I yA ← yA + 1 when Rt = 1.
t t
Exploration and Exploitation
Bayesian bandits
Probability matching
Thompson sampling
pθ (Q | R1 , . . . , Rt −1 ).
e Ht (a )
P [At = a ] := = : πt (a ),
∑kb =1 e Ht (b )
where here we have also introduced a useful new notation, πt (a ), for the probability of
taking action a at time t.
I Initially all action preferences are the same (e.g. H1 (a ) = 0, for all a) so that all
actions have an equal probability of being selected.
Exploration and Exploitation
Gradient bandit algorithms
I There is a natural learning algorithm for this setting based on the idea of stochastic
gradient ascent.
I On each step, after selecting action At and receiving the reward Rt , the action
preferences are updated by:
where α > 0 is a step-size parameter, and R t ∈ R is the average of all the rewards up
through and including time t, which can be computed incrementally.
I The R t term serves as a baseline with which the reward is compared. If the reward is
higher than the baseline, then the probability of taking At in the future is increased,
and if the reward is below baseline, then probability is deceased.
I The nonselected actions move in the opposite direction.
Exploration and Exploitation
Contextual bandits
I So far we have considered only nonassociative tasks, that is, tasks in which there is
no need to associate different actions with different situations.
I In these tasks the learner either tries to find the best action when the task is stationary,
or tries to track the best action as it changes over time when the task is nonstationary.
I However, in a general reinforcement learning task there is more than one situation,
and the goal is to learn a policy: a mapping from situations to the actions that are best
in those situations. This is the associative setting.
Exploration and Exploitation
Contextual bandits
Contextual bandits
I Suppose there are several different k -armed bandit tasks, and that on each step you
confront one of these chosen at random.
I Thus, the bandit task changes randomly from step to step.
I This would appear to you as a single, nonstationary k -armed bandit task whose true
action values change randomly from step to step.
I You could try using one of the methods described above that can handle
nonstationarity, but unless the true value changes slowly, these methods will not work
very well.
I Now suppose that when a bandit task is selected for you, you are given some
distinctive clue about its identity (but not its action values).
I Maybe you are facing an actual slot machine that changes the colour of its display as it
changes its action values.
I Now you can learn a policy associating each task, signalled by the colour you see, with
the best action to take when facing that task—for instance, if red, select arm 1; if
green, select arm 2.
I With the right policy you can usually do much better than you could in the absence of
any information distinguishing one bandit task from another.
Exploration and Exploitation
Contextual bandits
Associative search
Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, and Robert Schapire.
Contextual bandit algorithms with supervised learning guarantees.
In Geoffrey Gordon, David Dunson, and Miroslav Dudı́k, editors, Proceedings of the
Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15
of Proceedings of Machine Learning Research, pages 19–26, Fort Lauderdale, FL,
USA, 11–13 Apr 2011. JMLR Workshop and Conference Proceedings.
Thomas Bayes and Richard Price.
An essay towards solving a problem in the doctrine of chances.
Philosophical Transactions, 53:370–418, January 1763.
Djallel Bouneffouf and Irina Rish.
A survey on practical applications of multi-armed and contextual bandits.
arXiv, 2019.
Olivier Chapelle and Lihong Li.
An empirical evaluation of thompson sampling.
In Advances in Neural Information Processing Systems 24 (NIPS 2011), 2011.
Audrey Durand, Charis Achilleos, Demetris Iacovides, Katerina Strati, Georgios D.
Mitsis, and Joelle Pineau.
Contextual bandits for adaptive treatment in a mouse model of de novo
carcinogenesis.
In Proceedings of the 3rd Machine Learning for Healthcare Conference, 2018.
Anirban DasGupta.
Asymptotic Theory of Statistics and Probability.
Exploration and Exploitation
Bibliography
Springer, 2008.
James A. Edwards and David S. Leslie.
Selecting multiple web adverts: A contextual multi-armed bandit with state uncertainty.
Introduction to Topology.
Dover Publications, Inc., 3 edition, 1990.
Kanishka Misra, Eric M. Schwartz, and Jacob Abernethy.
Dynamic online pricing with incomplete information using multiarmed bandit
experiments.
Marketing Science, 38(2):226–252, mar 2019.
Jonas Mueller, Vasilis Syrgkanis, and Matt Taddy.
Low-rank bandit methods for high-dimensional dynamic pricing.
In NeurIPS, 2019.
Thomas Peel, Sandrine Anthoine, and Liva Ralaivola.
Empirical Bernstein inequalities for U-statistics.
In Neural Information Processing Systems (NIPS), pages 1903–1911, Vancouver,
Canada, December 2010.
Richard S. Sutton and Andrew G. Barto.
Reinforcement Learning: An Introduction.
MIT Press, 2 edition, 2018.
James A. Shepperd, Patrick Carroll, Jodi Grace, and Meredith Terry.
Exploring the causes of comparative optimism.
Psychologica Belgica, 42(1–2):65–98, 2002.
Steven L. Scott.
A modern bayesian look at the multi-armed bandit.
Applied Stochastic Models in Business and Industry, 26(6):639–658, nov 2010.
Exploration and Exploitation
Bibliography
Stephen Senn.
Statistical basis of public policy — present remembrance of priors past is not the same
as a true prior.
Birtish Medical Journal, 1997.
Darryl Shen.
Order imbalance based strategy in high frequency trading.
Master’s thesis, Linacre College, University of Oxford, 2015.
David Silver.
Lectures on reinforcement learning.
url: https://www.davidsilver.uk/teaching/, 2015.
Stephen M. Stigler.
Laplace’s early work: chronology and citations.
Isis, 61:234–254, 1978.
Stephen M. Stigler.
Laplace’s 1774 memoir on inverse probability.
Statistical Science, 1(3):359–363, August 1986.
William R. Thompson.
On the likelihood that one unknown probability exceeds another in view of evidence of
two samples.
Biometrika, 25(3-4):285–294, dec 1933.
Hado van Hasselt.
Exploration and Exploitation
Bibliography