0% found this document useful (0 votes)
32 views

Introduction To Bandit Algorithm, Unit1

This document discusses key concepts in bandit algorithms including: 1. Bandit problems involve sequentially choosing actions with unknown rewards to maximize long term rewards. They were introduced to model medical trials. 2. Applications of bandit algorithms include A/B testing, recommendation systems, network routing, and dynamic pricing. 3. Probability spaces, independence, and other probability theory concepts underpin bandit algorithms and allow analyzing their performance and regret.

Uploaded by

Manish LACHHETA
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Introduction To Bandit Algorithm, Unit1

This document discusses key concepts in bandit algorithms including: 1. Bandit problems involve sequentially choosing actions with unknown rewards to maximize long term rewards. They were introduced to model medical trials. 2. Applications of bandit algorithms include A/B testing, recommendation systems, network routing, and dynamic pricing. 3. Probability spaces, independence, and other probability theory concepts underpin bandit algorithms and allow analyzing their performance and regret.

Uploaded by

Manish LACHHETA
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Understand the fundamental concepts and frameworks of Bandit Algorithms

Topic–
1.Introduction to Bandit Algorithms solving
2.Languages
3.Applications
4.Probability spaces,
5.Independence
6.Batch to online setting
7.Adversarial setting with full information
8.Halving algorithm
9.Regret Lower Bounds
Introduction to Bandit Algorithms solving

Bandit problems were introduced by William R. Thompson in an article published in


1933 in Biometrika. Thompson was interested in medical trials and the cruelty of
running a trial blindly, without adapting the treatment allocations on the fly as the
drug appears more or Figure 1.1 Mouse learning a T-maze.less effective. The
name comes from the 1950s,when Frederick Mosteller and Robert Bush decided to
study animal learning and ran trials on mice andthen on humans.
The Language of Bandits
A bandit problem is a sequential game between a learner and an environment. The game
is played over n rounds, where n is a positive natural number called the horizon. In each
round t ∈ [n], the learner first chooses an action At from a given set A, and the environment
then reveals a reward Xt ∈ R.

The fundamental challenge in bandit problems is that the environment is unknown to


the learner. All the learner knows is that the true environment lies in some set E called
the environment class.
Applications
After this short preview, and as an appetiser before the hard work, we briefly describe the
formalisations of a variety of applications.

A/B Testing
The designers of a company website are trying to decide whether the ‘buy it now’ button
should be placed at the top of the product page or at the bottom. In the old days, they would commit to a
trial of each version by splitting incoming users into two groups of 10 000.
Each group would be shown a different version of the site, and a statistician would examine
the data at the end to decide which version was better. One problem with this approach is
the non-adaptivity of the test. For example, if the effect size is large, then the trial could be
stopped early.
One way to apply bandits to this problem is to view the two versions of the site as actions.
Each time t a user makes a request, a bandit algorithm is used to choose an action At ∈
A ={SiteA, SiteB}, and the reward is Xt = 1 if the user purchases the product and Xt= 0
Otherwise.

In traditional A/B testing, the objective of the statistician is to decide which website is
better. When using a bandit algorithm, there is no need to end the trial. The algorithm
automatically decides when one version of the site should be shown more often than
another. Even if the real objective is to identify the best site, then adaptivity or early
stopping can be added to the A/B process using techniques from bandit theory
Advert Placement
In advert placement, each round corresponds to a user visiting a website, and the set of
actions
A is the set of all available adverts. One could treat this as a standard multi-armed
bandit problem, where in each round a policy chooses
At ∈ A, and the reward is Xt = 1
if the user clicked on the advert and Xt = 0 otherwise. This might work for specialised
websites where the adverts are all likely to be appropriate. But for a company like Amazon,
the advertising should be targeted. A user that recently purchased rock-climbing shoes is
much more likely to buy a harness than another user. Clearly an algorithm should take this
into account.
Recommendation Services

Netflix has to decide which movies to place most prominently in your ‘Browse’ page. Like
in advert placement, users arrive at the page sequentially, and the reward can be measured as
some function of (a) whether or not you watched a movie and (b) whether or not you rated it positively.

There are many challenges. First of all, Netflix shows a long list of movies,
so the set of possible actions is combinatorially large. Second, each user watches relatively
few movies, and individual users are different. This suggests approaches such as low-rank
matrix factorisation (a popular approach in ‘collaborative filtering’). But notice this is not
an offline problem. The learning algorithm gets to choose what users see and this affects
the data. If the users are never recommended the AlphaGo movie, then few users will watch
it, and the amount of data about this film will be scarce.
Network Routing

Another problem with an interesting structure is network routing, where the learner tries
to direct internet traffic through the shortest path on a network. In each round the learner
receives the start/end destinations for a packet of data. The set of actions is the set of all paths
starting and ending at the appropriate points on some known graph. The feedback in this
case is the time it takes for the packet to be received at its destination, and the reward is the
negation of this value. Again the action set is combinatorially large. Even relatively small
graphs have an enormous number of paths. The routing problem can obviously be applied
to more physical networks such as transportation systems used in operations research.
Dynamic Pricing

In dynamic pricing, a company is trying to automatically optimise the price of some product.
Users arrive sequentially, and the learner sets the price. The user will only purchase the
product if the price is lower than their valuation. What makes this problem interesting is (a)
the learner never actually observes the valuation of the product, only the binary signal that
the price was too low/too high, and (b) there is a monotonicity structure in the pricing. If a
user purchased an item priced at $10, then they would surely purchase it for $5, but whether
or not it would sell when priced at $11 is uncertain. Also, the set of possible actions is close
to continuous.
Waiting Problems

Every day you travel to work, either by bus or by walking. Once you get on the bus, the trip
only takes 5 minutes, but the timetable is unreliable, and the bus arrival time is unknown
and stochastic. Sometimes the bus doesn’t come at all. Walking, on the other hand, takes 30
minutes along a beautiful river away from the road. The problem is to devise a policy for
choosing how long to wait at the bus stop before giving up and walking to minimise the time
to get to your workplace. Walk too soon, and you miss the bus and gain little information.
But waiting too long also comes at a price.
While waiting for a bus is not a problem we all face, there are other applications of
this setting. For example, deciding the amount of inactivity required before putting a hard
drive into sleep mode or powering off a car engine at traffic lights. The statistical part of
the waiting problem concerns estimating the cumulative distribution function of the bus
arrival times from data. The twist is that the data is censored on the days you chose to
walk before the bus arrived, which is a problem analysed in the subfield of statistics called
survival analysis. The interplay between the statistical estimation problem and the challenge
of balancing exploration and exploitation is what makes this and the other problems studied
in this book interesting.
Probability Spaces and Random Elements
The thrill of gambling comes from the fact that the bet is placed on future outcomes that are
uncertain at the time of the gamble. A central question in gambling is the fair value of a
game. This can be difficult to answer for all but the simplest games. As an illustrative
example, imagine the following moderately complex game: I throw a dice. If the result is
four, I throw two more dice; otherwise I throw one dice only. Looking at each newly thrown
dice (one or two), I repeat the same, for a total of three rounds. Afterwards, I pay you the
sum of the values on the faces of the dice. How much are you willing to pay to play this
game with me? Many examples of practical interest exhibit a complex random
interdependency between outcomes. The cornerstone of modern probability as proposed by
Kolmogorov aims to remove this complexity by separating the randomness from the
mechanism that produces the outcome. Instead of rolling the dice one by one, imagine that
sufficiently many dice were rolled before the game has even started. For our game we need
to roll seven dice, because this is the maximum number that might be required (one in the
first round, two in the second round and four in the third round. See Fig. 2.1).
Figure 2.1 The initial phase of a gambling game with a random number of dice rolls.
Depending on the outcome of a dice roll, one or two dice are rolled for a total of three
rounds. The number of dice used will then be random in the range of three to seven.
Figure 2.2 A key idea in probability theory is the separation of sources of randomness from
game mechanisms. A mechanism creates values from the elementary random outcomes,
some of which are visible for observers, while others may remain hidden.
The significance of the push-forward measure PX is that any probabilistic question
concerning X can be answered from the knowledge of PX alone. Even Ω and the details of
the map X are not needed. This is often used as an excuse to not even mention the
underlying probability space (Ω, F, P).

A big ‘conspiracy’ in probability theory is that probability spaces are seldom mentioned in
theorem statements, despite the fact that a measure cannot be defined without one.
Statements are instead given in terms of random elements and constraints on their joint
probabilities. For example, suppose that X and Y are random variables such that P (X ∈ A, Y
∈ B) = |A ∩ [6]| 6 · |B ∩ [2]| 2 for all A, B ∈ B(R)
Independence
Independence is another basic concept of probability that relates to
knowledge/information. In its simplest form, independence is a relation that holds between
events on a probability space (Ω, F, P). Two events A, B ∈ F are independent if P

P (A ∩ B) = P (A) P (B) .

How is this related to knowledge? Assuming that P (B) > 0, dividing both sides by P (B) and
using the definition of conditional probability, we get that the above is equivalent to

P (A | B) = P (A)
Of course, we also have that if P (A) > 0, (2.3) is equivalent to P (B | A) = P (B). Both of the
latter relations express that A and B are independent if the probability assigned to A (or B)
remains the same regardless of whether it is known that B (respectively, A) occurred.
A collection of events G ⊂ F is said to be pairwise independent if any two distinct elements
of G are independent of each other. The events in G are said to be mutually independent if
for any n > 0 integer and A1, . . . , An distinct elements of G, P (A1 ∩ · · · ∩ An) = Qn i=1 P
(Ai). This is a stronger restriction than pairwise independence. In the case of mutually
independent events, the knowledge of joint occurrence of any finitely many events from the
collection will not change our prediction of whether some other event in the collection
happens. But this may not be the case when the events are only pairwise independent Two
collections of events G1, G2 are said to be independent of each other if for any A ∈ G1 and
B ∈ G2 it holds that A and B are independent.
Halving algorithm
When the concept class is finite, there is a surprisingly simple algorithm that obtains a
mistake bound of log2 |C|. This algorithm is called the Halving algorithm, and it uses two key
ideas. The first idea is that of a version space. The version space is the set of hypotheses that
are consistent with the data observed thus far. Thus, at the start of round t, the version
space Vt is the subset of hypotheses from C which are consistent with (x1, y1), . . . ,(xt−1,
yt−1). The second idea is to predict according to a majority vote. For a set of hypotheses F,
define the majority vote based on F as

mvF (x) = ( 1 if |{f ∈ F : f(x) = 1}| ≥ |F|/2; 0 otherwise.

The Halving algorithm simply predicts according to the majority vote with respect to the
version space in every round.
Theorem 1. The Halving algorithm learns any finite concept class C in the mistake bound
model and makes at most log2 |C| mistakes. Unfortunately, the runtime of the Halving
algorithm is linear in |C|, which can be exorbitant. Why is this bad? In many situations, the
size of the concept class |C| can be exponential in the dimension of the data, in which case
the runtime of the Halving algorithm is exponential in d (!). For instance, the class of
monotone conjunctions has cardinality 2 d . In other cases, such as the case of linear
separators, the concept class can even be infinite.
Regret lower bound
Suppose the margin parameter α ∈ [0, 1]. Let Pα be the class of distributions of (Xa,t, Ya,t : a
∈ [d]), where the feature vector Xa,t is drawn from Pa and the reward Ya,t = Xa,t, β∗ + a,t for
a ∈ [K] satisfies Assumptions 1–2 with parameter α in Assumption 2(b). Then for large
enough horizon T,

where CL is a constant independent of T and d.

You might also like