Introduction To Bandit Algorithm, Unit1

This document discusses key concepts in bandit algorithms including: 1. Bandit problems involve sequentially choosing actions with unknown rewards to maximize long term rewards. They were introduced to model medical trials. 2. Applications of bandit algorithms include A/B testing, recommendation systems, network routing, and dynamic pricing. 3. Probability spaces, independence, and other probability theory concepts underpin bandit algorithms and allow analyzing their performance and regret.

Uploaded by

Manish LACHHETA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

55 views18 pages

Introduction To Bandit Algorithm, Unit1

Uploaded by

Manish LACHHETA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Understand the fundamental concepts and frameworks of Bandit Algorithms

Topic–
1.Introduction to Bandit Algorithms solving
2.Languages
3.Applications
4.Probability spaces,
5.Independence
6.Batch to online setting
7.Adversarial setting with full information
8.Halving algorithm
9.Regret Lower Bounds
Introduction to Bandit Algorithms solving

Bandit problems were introduced by William R. Thompson in an article published in

1933 in Biometrika. Thompson was interested in medical trials and the cruelty of
running a trial blindly, without adapting the treatment allocations on the fly as the
drug appears more or Figure 1.1 Mouse learning a T-maze.less effective. The
name comes from the 1950s,when Frederick Mosteller and Robert Bush decided to
study animal learning and ran trials on mice andthen on humans.
The Language of Bandits
A bandit problem is a sequential game between a learner and an environment. The game
is played over n rounds, where n is a positive natural number called the horizon. In each
round t ∈ [n], the learner first chooses an action At from a given set A, and the environment
then reveals a reward Xt ∈ R.

The fundamental challenge in bandit problems is that the environment is unknown to

the learner. All the learner knows is that the true environment lies in some set E called
the environment class.
Applications
After this short preview, and as an appetiser before the hard work, we briefly describe the
formalisations of a variety of applications.

A/B Testing
The designers of a company website are trying to decide whether the ‘buy it now’ button
should be placed at the top of the product page or at the bottom. In the old days, they would commit to a
trial of each version by splitting incoming users into two groups of 10 000.
Each group would be shown a different version of the site, and a statistician would examine
the data at the end to decide which version was better. One problem with this approach is
the non-adaptivity of the test. For example, if the effect size is large, then the trial could be
stopped early.
One way to apply bandits to this problem is to view the two versions of the site as actions.
Each time t a user makes a request, a bandit algorithm is used to choose an action At ∈
A ={SiteA, SiteB}, and the reward is Xt = 1 if the user purchases the product and Xt= 0
Otherwise.

In traditional A/B testing, the objective of the statistician is to decide which website is
better. When using a bandit algorithm, there is no need to end the trial. The algorithm
automatically decides when one version of the site should be shown more often than
another. Even if the real objective is to identify the best site, then adaptivity or early
stopping can be added to the A/B process using techniques from bandit theory
Advert Placement
In advert placement, each round corresponds to a user visiting a website, and the set of
actions
A is the set of all available adverts. One could treat this as a standard multi-armed
bandit problem, where in each round a policy chooses
At ∈ A, and the reward is Xt = 1
if the user clicked on the advert and Xt = 0 otherwise. This might work for specialised
websites where the adverts are all likely to be appropriate. But for a company like Amazon,
the advertising should be targeted. A user that recently purchased rock-climbing shoes is
much more likely to buy a harness than another user. Clearly an algorithm should take this
into account.
Recommendation Services

Netflix has to decide which movies to place most prominently in your ‘Browse’ page. Like
in advert placement, users arrive at the page sequentially, and the reward can be measured as
some function of (a) whether or not you watched a movie and (b) whether or not you rated it positively.

There are many challenges. First of all, Netflix shows a long list of movies,
so the set of possible actions is combinatorially large. Second, each user watches relatively
few movies, and individual users are different. This suggests approaches such as low-rank
matrix factorisation (a popular approach in ‘collaborative filtering’). But notice this is not
an offline problem. The learning algorithm gets to choose what users see and this affects
the data. If the users are never recommended the AlphaGo movie, then few users will watch
it, and the amount of data about this film will be scarce.
Network Routing

Another problem with an interesting structure is network routing, where the learner tries
to direct internet traffic through the shortest path on a network. In each round the learner
receives the start/end destinations for a packet of data. The set of actions is the set of all paths
starting and ending at the appropriate points on some known graph. The feedback in this
case is the time it takes for the packet to be received at its destination, and the reward is the
negation of this value. Again the action set is combinatorially large. Even relatively small
graphs have an enormous number of paths. The routing problem can obviously be applied
to more physical networks such as transportation systems used in operations research.
Dynamic Pricing

In dynamic pricing, a company is trying to automatically optimise the price of some product.
Users arrive sequentially, and the learner sets the price. The user will only purchase the
product if the price is lower than their valuation. What makes this problem interesting is (a)
the learner never actually observes the valuation of the product, only the binary signal that
the price was too low/too high, and (b) there is a monotonicity structure in the pricing. If a
user purchased an item priced at $10, then they would surely purchase it for $5, but whether
or not it would sell when priced at $11 is uncertain. Also, the set of possible actions is close
to continuous.
Waiting Problems

Every day you travel to work, either by bus or by walking. Once you get on the bus, the trip
only takes 5 minutes, but the timetable is unreliable, and the bus arrival time is unknown
and stochastic. Sometimes the bus doesn’t come at all. Walking, on the other hand, takes 30
minutes along a beautiful river away from the road. The problem is to devise a policy for
choosing how long to wait at the bus stop before giving up and walking to minimise the time
to get to your workplace. Walk too soon, and you miss the bus and gain little information.
But waiting too long also comes at a price.
While waiting for a bus is not a problem we all face, there are other applications of
this setting. For example, deciding the amount of inactivity required before putting a hard
drive into sleep mode or powering off a car engine at traffic lights. The statistical part of
the waiting problem concerns estimating the cumulative distribution function of the bus
arrival times from data. The twist is that the data is censored on the days you chose to
walk before the bus arrived, which is a problem analysed in the subfield of statistics called
survival analysis. The interplay between the statistical estimation problem and the challenge
of balancing exploration and exploitation is what makes this and the other problems studied
in this book interesting.
Probability Spaces and Random Elements
The thrill of gambling comes from the fact that the bet is placed on future outcomes that are
uncertain at the time of the gamble. A central question in gambling is the fair value of a
game. This can be difficult to answer for all but the simplest games. As an illustrative
example, imagine the following moderately complex game: I throw a dice. If the result is
four, I throw two more dice; otherwise I throw one dice only. Looking at each newly thrown
dice (one or two), I repeat the same, for a total of three rounds. Afterwards, I pay you the
sum of the values on the faces of the dice. How much are you willing to pay to play this
game with me? Many examples of practical interest exhibit a complex random
interdependency between outcomes. The cornerstone of modern probability as proposed by
Kolmogorov aims to remove this complexity by separating the randomness from the
mechanism that produces the outcome. Instead of rolling the dice one by one, imagine that
sufficiently many dice were rolled before the game has even started. For our game we need
to roll seven dice, because this is the maximum number that might be required (one in the
first round, two in the second round and four in the third round. See Fig. 2.1).
Figure 2.1 The initial phase of a gambling game with a random number of dice rolls.
Depending on the outcome of a dice roll, one or two dice are rolled for a total of three
rounds. The number of dice used will then be random in the range of three to seven.
Figure 2.2 A key idea in probability theory is the separation of sources of randomness from
game mechanisms. A mechanism creates values from the elementary random outcomes,
some of which are visible for observers, while others may remain hidden.
The significance of the push-forward measure PX is that any probabilistic question
concerning X can be answered from the knowledge of PX alone. Even Ω and the details of
the map X are not needed. This is often used as an excuse to not even mention the
underlying probability space (Ω, F, P).

A big ‘conspiracy’ in probability theory is that probability spaces are seldom mentioned in
theorem statements, despite the fact that a measure cannot be deﬁned without one.
Statements are instead given in terms of random elements and constraints on their joint
probabilities. For example, suppose that X and Y are random variables such that P (X ∈ A, Y
∈ B) = |A ∩ [6]| 6 · |B ∩ [2]| 2 for all A, B ∈ B(R)
Independence
Independence is another basic concept of probability that relates to
knowledge/information. In its simplest form, independence is a relation that holds between
events on a probability space (Ω, F, P). Two events A, B ∈ F are independent if P

P (A ∩ B) = P (A) P (B) .

How is this related to knowledge? Assuming that P (B) > 0, dividing both sides by P (B) and
using the deﬁnition of conditional probability, we get that the above is equivalent to

P (A | B) = P (A)
Of course, we also have that if P (A) > 0, (2.3) is equivalent to P (B | A) = P (B). Both of the
latter relations express that A and B are independent if the probability assigned to A (or B)
remains the same regardless of whether it is known that B (respectively, A) occurred.
A collection of events G ⊂ F is said to be pairwise independent if any two distinct elements
of G are independent of each other. The events in G are said to be mutually independent if
for any n > 0 integer and A1, . . . , An distinct elements of G, P (A1 ∩ · · · ∩ An) = Qn i=1 P
(Ai). This is a stronger restriction than pairwise independence. In the case of mutually
independent events, the knowledge of joint occurrence of any finitely many events from the
collection will not change our prediction of whether some other event in the collection
happens. But this may not be the case when the events are only pairwise independent Two
collections of events G1, G2 are said to be independent of each other if for any A ∈ G1 and
B ∈ G2 it holds that A and B are independent.
Halving algorithm
When the concept class is finite, there is a surprisingly simple algorithm that obtains a
mistake bound of log2 |C|. This algorithm is called the Halving algorithm, and it uses two key
ideas. The first idea is that of a version space. The version space is the set of hypotheses that
are consistent with the data observed thus far. Thus, at the start of round t, the version
space Vt is the subset of hypotheses from C which are consistent with (x1, y1), . . . ,(xt−1,
yt−1). The second idea is to predict according to a majority vote. For a set of hypotheses F,
define the majority vote based on F as

mvF (x) = ( 1 if |{f ∈ F : f(x) = 1}| ≥ |F|/2; 0 otherwise.

The Halving algorithm simply predicts according to the majority vote with respect to the
version space in every round.
Theorem 1. The Halving algorithm learns any finite concept class C in the mistake bound
model and makes at most log2 |C| mistakes. Unfortunately, the runtime of the Halving
algorithm is linear in |C|, which can be exorbitant. Why is this bad? In many situations, the
size of the concept class |C| can be exponential in the dimension of the data, in which case
the runtime of the Halving algorithm is exponential in d (!). For instance, the class of
monotone conjunctions has cardinality 2 d . In other cases, such as the case of linear
separators, the concept class can even be infinite.
Regret lower bound
Suppose the margin parameter α ∈ [0, 1]. Let Pα be the class of distributions of (Xa,t, Ya,t : a
∈ [d]), where the feature vector Xa,t is drawn from Pa and the reward Ya,t = Xa,t, β∗ + a,t for
a ∈ [K] satisfies Assumptions 1–2 with parameter α in Assumption 2(b). Then for large
enough horizon T,

where CL is a constant independent of T and d.

RLbook Solutions Manual
100% (1)
RLbook Solutions Manual
35 pages
Reinforcement Learning - Chapter 2
100% (1)
Reinforcement Learning - Chapter 2
22 pages
Reinforcement Learning Model Based Planning Dynamic Programming
No ratings yet
Reinforcement Learning Model Based Planning Dynamic Programming
17 pages
Expanded Multi Armed Bandit and Probability Basics
No ratings yet
Expanded Multi Armed Bandit and Probability Basics
5 pages
Bandit
No ratings yet
Bandit
8 pages
Random
No ratings yet
Random
18 pages
Ai Epom
No ratings yet
Ai Epom
3 pages
Unit 1-RL
No ratings yet
Unit 1-RL
11 pages
InOpe - 6 - Dynamic Programming Exercises To Submit
No ratings yet
InOpe - 6 - Dynamic Programming Exercises To Submit
3 pages
Machine - Learning - Chapter 4
No ratings yet
Machine - Learning - Chapter 4
13 pages
Unit:1 Reinforcement Learning
No ratings yet
Unit:1 Reinforcement Learning
9 pages
Unit IVGameTheoryandSimulation PDF
No ratings yet
Unit IVGameTheoryandSimulation PDF
28 pages
Kguh
No ratings yet
Kguh
38 pages
Materi Greedy
No ratings yet
Materi Greedy
8 pages
System Sequence of Events State: e Relates The
No ratings yet
System Sequence of Events State: e Relates The
4 pages
MAB Demo, Thompson Sampling, RL Sell Like A Wolf, Q-Learning
No ratings yet
MAB Demo, Thompson Sampling, RL Sell Like A Wolf, Q-Learning
8 pages
A Baby Robot - 1
No ratings yet
A Baby Robot - 1
6 pages
Markov Decision Process: Reinforcement Learning
No ratings yet
Markov Decision Process: Reinforcement Learning
10 pages
Computational Approaches To Problem-Solving Ii: DR Jeremiah O. Bandele
No ratings yet
Computational Approaches To Problem-Solving Ii: DR Jeremiah O. Bandele
26 pages
CSC 401 Lesson 2
No ratings yet
CSC 401 Lesson 2
21 pages
Reinforcement Learning: A Short Cut
No ratings yet
Reinforcement Learning: A Short Cut
7 pages
RL Frra
No ratings yet
RL Frra
10 pages
The Role of Algorithms in Computing
No ratings yet
The Role of Algorithms in Computing
9 pages
Dynamic Probabilistic Systems Volume 2 - Ronald A. Howard
No ratings yet
Dynamic Probabilistic Systems Volume 2 - Ronald A. Howard
564 pages
Inf715 12
No ratings yet
Inf715 12
51 pages
Assignment 3 - ReinforcementLearning - 200508263 - AdityaAnantharaman - Trikkur
No ratings yet
Assignment 3 - ReinforcementLearning - 200508263 - AdityaAnantharaman - Trikkur
9 pages
2997 Spring 2004
No ratings yet
2997 Spring 2004
201 pages
RL Frra
No ratings yet
RL Frra
9 pages
IntroMulti Armed Bandits Slivkin Microsoft PDF
No ratings yet
IntroMulti Armed Bandits Slivkin Microsoft PDF
174 pages
cs461 hw1
No ratings yet
cs461 hw1
14 pages
Notes-NP Hard and NP Complete Notes
No ratings yet
Notes-NP Hard and NP Complete Notes
12 pages
Chapter 2
No ratings yet
Chapter 2
21 pages
Introduction To Reinforcement Learning
No ratings yet
Introduction To Reinforcement Learning
19 pages
Algorithms - CS3401 - Notes - Unit 5 - NP Complete and Approximation Algorithm-1
No ratings yet
Algorithms - CS3401 - Notes - Unit 5 - NP Complete and Approximation Algorithm-1
19 pages
RL Lecturer
No ratings yet
RL Lecturer
38 pages
) W !"#$%&' +,-./012345 Ya - Fi Mu: Discounted Properties of Probabilistic Pushdown Automata
No ratings yet
) W !"#$%&' +,-./012345 Ya - Fi Mu: Discounted Properties of Probabilistic Pushdown Automata
33 pages
mdp1 6pp
No ratings yet
mdp1 6pp
13 pages
AI Game Playing and Search
No ratings yet
AI Game Playing and Search
41 pages
Value Functions & Bellman Equations: UNIT-3
No ratings yet
Value Functions & Bellman Equations: UNIT-3
11 pages
Competitive Analysis: Steven Skiena
No ratings yet
Competitive Analysis: Steven Skiena
17 pages
Module III
No ratings yet
Module III
38 pages
AI Unit 3 PDF
No ratings yet
AI Unit 3 PDF
12 pages
RL Unit5
No ratings yet
RL Unit5
101 pages
Data Challenge - NC Soft
No ratings yet
Data Challenge - NC Soft
4 pages
Planning: Basic Components of A Planning System
No ratings yet
Planning: Basic Components of A Planning System
28 pages
16 - Reinforcement Learning and Bandits
No ratings yet
16 - Reinforcement Learning and Bandits
41 pages
Metaheuristics Introduction 2
No ratings yet
Metaheuristics Introduction 2
81 pages
DLMAIRIL01 Q4-2024 Session3
No ratings yet
DLMAIRIL01 Q4-2024 Session3
47 pages
Risk Attitudes, Normal-Form Games, Dominance, Iterated Dominance
No ratings yet
Risk Attitudes, Normal-Form Games, Dominance, Iterated Dominance
18 pages
Exploration Vs Exploitation in Stationary Multi-Armed Bandit Problems
No ratings yet
Exploration Vs Exploitation in Stationary Multi-Armed Bandit Problems
15 pages
1 s2.0 0166218X9190086C Main
No ratings yet
1 s2.0 0166218X9190086C Main
37 pages
Multi-Armed Bandits
No ratings yet
Multi-Armed Bandits
11 pages
Adversarial Search
No ratings yet
Adversarial Search
63 pages
AIML Module - 1 Foundations and History of Artificial Intelligence and Problem Solving and Search Strtegies
No ratings yet
AIML Module - 1 Foundations and History of Artificial Intelligence and Problem Solving and Search Strtegies
28 pages
Markov Chains
No ratings yet
Markov Chains
8 pages
Unit 14 Complexity of Algorithms: Structure
No ratings yet
Unit 14 Complexity of Algorithms: Structure
17 pages
CHAPTER 20-Final
No ratings yet
CHAPTER 20-Final
20 pages
S&M Assignment 02 20250619 Answers
No ratings yet
S&M Assignment 02 20250619 Answers
6 pages
Week 3 C5 Adversarial Search and Games (Belano & Ong Chua)
No ratings yet
Week 3 C5 Adversarial Search and Games (Belano & Ong Chua)
57 pages
State Space Search: Fundamentals and Applications
From Everand
State Space Search: Fundamentals and Applications
Fouad Sabry
No ratings yet
Sup Assignment 4,5
No ratings yet
Sup Assignment 4,5
10 pages
Fuzzy Logic Assignment 2
No ratings yet
Fuzzy Logic Assignment 2
7 pages
Assignment 2 SUPL
No ratings yet
Assignment 2 SUPL
12 pages
04 Exact Inference
No ratings yet
04 Exact Inference
23 pages
Case Problem 3 Hart Venture Capital
No ratings yet
Case Problem 3 Hart Venture Capital
3 pages
ERL M 520 PDF
No ratings yet
ERL M 520 PDF
425 pages
10 Most Important Algorithms For Coding Interviews
No ratings yet
10 Most Important Algorithms For Coding Interviews
6 pages
MLSys 2021 Accounting For Variance in Machine Learning Benchmarks Paper
No ratings yet
MLSys 2021 Accounting For Variance in Machine Learning Benchmarks Paper
23 pages
Project Proposal List 24-25
No ratings yet
Project Proposal List 24-25
4 pages
Grocery Expenses Matrix Inversion Detailed
No ratings yet
Grocery Expenses Matrix Inversion Detailed
4 pages
Tutorial Excel NPV
No ratings yet
Tutorial Excel NPV
1 page
Cognizant Interview Coding Question
No ratings yet
Cognizant Interview Coding Question
15 pages
Chapter 4
No ratings yet
Chapter 4
32 pages
cs236 Lecture2
No ratings yet
cs236 Lecture2
30 pages
Example Shadow Price
No ratings yet
Example Shadow Price
4 pages
Rohini 39354126818
No ratings yet
Rohini 39354126818
7 pages
00-2 Quiz-1-Ans-1
No ratings yet
00-2 Quiz-1-Ans-1
4 pages
All Functions
No ratings yet
All Functions
118 pages
2022 Proposing Several Hybrid SSA-machine Learning Techniques For Estimating Rock Cuttability by Conical Pick With Relieved Cutting Modes
No ratings yet
2022 Proposing Several Hybrid SSA-machine Learning Techniques For Estimating Rock Cuttability by Conical Pick With Relieved Cutting Modes
16 pages
Cook PDF
No ratings yet
Cook PDF
43 pages
Stability PDF
No ratings yet
Stability PDF
64 pages
Course-Content MSAI
No ratings yet
Course-Content MSAI
7 pages
Reliability
No ratings yet
Reliability
15 pages
Predicting Student Academic Performanceusing Support Vector Machineand Random Forest
No ratings yet
Predicting Student Academic Performanceusing Support Vector Machineand Random Forest
9 pages
Unit 5
No ratings yet
Unit 5
6 pages
Cluster Analysis: Prof. (DR.) H. J. Jani Mba Programme, Sardar Patel University Vallabh Vidyanagar - 388 120
No ratings yet
Cluster Analysis: Prof. (DR.) H. J. Jani Mba Programme, Sardar Patel University Vallabh Vidyanagar - 388 120
41 pages
Numerical - Cpu Scheduling
No ratings yet
Numerical - Cpu Scheduling
5 pages
r20 B.tech 1-1 Supply Time Table Feb-2025
No ratings yet
r20 B.tech 1-1 Supply Time Table Feb-2025
2 pages
Agentic AI Fundamentals Quiz Complete With Code Diagrams Nida Rizwan
100% (1)
Agentic AI Fundamentals Quiz Complete With Code Diagrams Nida Rizwan
14 pages
Multiple Attribute Decision-Making Method Under Hesitant Single Valued Neutrosophic Uncertain Linguistic Environment
No ratings yet
Multiple Attribute Decision-Making Method Under Hesitant Single Valued Neutrosophic Uncertain Linguistic Environment
20 pages
GTU Final Paper
No ratings yet
GTU Final Paper
1 page
1.0 Algorithms 3.1 Pseudocode
No ratings yet
1.0 Algorithms 3.1 Pseudocode
3 pages
Cloud Based Secure File Storage Using Hybrid Cryptography Algorithms
No ratings yet
Cloud Based Secure File Storage Using Hybrid Cryptography Algorithms
7 pages
Identification of Dynamic Systems, Theory and Formulation
No ratings yet
Identification of Dynamic Systems, Theory and Formulation
138 pages

Introduction To Bandit Algorithm, Unit1

Uploaded by

Introduction To Bandit Algorithm, Unit1

Uploaded by

Understand the fundamental concepts and frameworks of Bandit Algorithms

Bandit problems were introduced by William R. Thompson in an article published in

The fundamental challenge in bandit problems is that the environment is unknown to

mvF (x) = ( 1 if |{f ∈ F : f(x) = 1}| ≥ |F|/2; 0 otherwise.

where CL is a constant independent of T and d.

You might also like