0% found this document useful (0 votes)

48 views4 pages

Lecture 2 EE675

The document summarizes key concepts in n-armed bandit problems and algorithms for solving them. It defines n-armed bandits as single-state reinforcement learning problems where choosing an arm provides a reward without intermediate states. It describes two main algorithms: 1) explore-then-commit which uniformly explores arms and commits to the best, and 2) epsilon-greedy which chooses the best arm most of the time but explores randomly with probability epsilon. Both algorithms aim to minimize expected regret from not choosing the optimal arm.

Uploaded by

SAURABH VISHWAKARMA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views4 pages

Lecture 2 EE675

Uploaded by

SAURABH VISHWAKARMA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

EE675 - IIT Kanpur, 2022-23-II

Lecture 2: n-Armed Bandits

11.01.23
Lecturer: Prof. Subrahmanyu Swamy Peruru Scribe: Ananya Mehrotra and Shamayeeta Dass

1 Recap
1.1 Paradigms of Machine Learning and Basic Terminology
There are three paradigms of Machine Learning:

• Supervised Learning

• Unsupervised Learning

• Reinforcement Learning

We primarily focus on Reinforcement Learning (RL) in this course. Some common terms used in
Reinforcement Learning are described below:

• State : Representation of the environment of the task. Commonly denoted as St

• Action : Refers to the reaction to State. Commonly denoted as At

• Environment : The combination of the new State and Reward to the previous action

• Reward : Feedback to the User in correspondence with the previous action and the new
state. Commonly denoted as Rt
∑
• Return : Cumulative Reward. It is equivalent to Tt=1 Rt

• Policy : Methodology which evaluates history and predicts best possible action. A policy
can be segregated into the following two categories:

– Deterministic : The output is a single action which is predicted to be the best possible
choice
– Stochastic : The output is a set of probabilities of the actions. As per the policy, the
higher the probability of a particular action, the better it is.

The main objective of an RL algorithm is to maximize the cumulative reward over several rounds.

1
2 Basics of n-Armed Bandits
n-Armed Bandits is a special scenario of Reinforcement Learning where there is no time sequence
dependence. In simpler terms, n-Armed Bandits are Single State Scenarios. A single action leads
to a reward with no intermediate states, after which the environment resets, and the process repeats.
Some Common Notations :
• Arms : Refer to actions taken by the user. The ith arm is denoted by ai , and the total number
of arms is taken to be K

• Expectation of a Random Variable ’X’ : Denoted by E[X].

∫
E[X] = xfX (x)dx

where fX (x) refers to the Probability Density Function of a continuous RV ’X’

• True means of arms : Denoted by µi for ith arm. We know,

E[Rt |At = ai ] = µi

The best true mean is given by µ∗ such that,

µ∗ = max µi
i

• Estimated Sample Averages: Denoted by µ̄i for ith arm.

There can be two main objectives in a Multi-Armed Bandit problem:
• Best Arm Identiﬁcation : This methodology focuses more on exploration. Over T rounds,
P (Identiﬁed Arm is the Optimal Arm) is maximised to help identify the optimal arm

• Regret Minimzation : This methodology minimzes the Expected Regret.

The Expected Regret is
∑
T
µ∗ T − µi (at )
i=1

and, the Actual Regret is

∑
T
µ∗ T − Rt
t=1

where µ(at ) = E[Rt |at ] and µi = E[Rt |at = ai ]

2.1 Algorithms
There are two main algorithms for a Multi-Armed Bandit problem.

2
2.1.1 Explore Then Commit Algorithm
In this methodology, we explore all arms uniformly and select the arm with the highest sample
average reward.

Algorithm:

1. Explore each arm N times.

2. Pick arm â with the highest sample average mean.

3. Play arm â in all remaining rounds.

If we take a large number of samples, the sample mean will be equal to the expected mean. In
our setting, the average reward µ̃(a) for any action a should be close to the expected reward µ(a).
We make use of the following inequality to set a bound.

Hoeffding’s Inequality: It states that for N sample means and ∀ ϵ,

P [|µ̄(a) − µ(a)| ≥ ϵ] < 2e−2ϵ

2.1.2 Regret Minimisation in ETC Algorithm:

In Hoeffding’s
√ Inequality, we assume that Rt ∈ [0, 1]. As we want the probability to be small, we
take ε ∼ 2 log(T )
N
. This leads to the upper bound in the above expression to be of the order: O( T14 ).

Let us take an example of K = 2 arms. This indicates one of the arms will be the optimal arm,
and the other one will be a sub-optimal arm.

Let a∗ denote the best arm. If we choose the sub-optimal arm, then it must have been because
the sub-optimal arm had a higher sample mean, µ̄(a) > µ̄(a∗ ). We also know by using Hoeffding’s
Inequality:
µ(a) + ϵ ≥ µ̄(a) > µ̄(a∗ ) ≥ µ(a∗ ) − ϵ
Thus,
µ(a∗ ) − µ(a) ≤ 2ϵ
In order to analyse regret: R(T ), we need to divide it into two parts -

• Exploration Regret :If we sample every arm N times, we will have a term of order N for
sampling the sub-optimal arm in every round.

• Exploitation Regret : In the remaining T − 2N rounds, we will have a regret term of 2ϵ in

each round for picking the sub-optimal arm.

3
Hence,
R(T ) ≤ N + 2ϵ(T − 2N ) < N + 2ϵT
√
R(T ) < N + 2T 2 log T /N
For N = T 2/3 (log T )1/3 , the upper bound in Hoeffding’s Inequality is minimum. Thus, we get

R(T ) ≤ O(T 2/3 (log T )1/3 )

Let A be the event that Hoeffding’s Inequality holds for every arm and A′ be its complementary
event. Then, the Expected Regret can be expressed as

E[R(T )] = E[R(T )|A]P (A) + E[R(T )|A′ ]P (A′ )

If Hoeffding’s inequality doesn’t hold, then a regret term of order 1/T 4 is considered for each
round. As P (A) ≤ 1
E[R(T )] ≤ R(T ) + T.O(1/T 4 )
E[R(T )] ≤ O(T 2/3 (log T )1/3

2.1.3 Epsilon-greedy Algorithm

Algorithm

1. Toss a coin that lands in the head with a probability of ϵ.

2. If coin lands in head pick any arm at random, else pick the arm with best sample average.

If the Exploration Probability ϵt ≈ t−1/3 , then

E[R(t)] ≤ t2/3 O((K log t)1/3 )

References
[1] A. Slivkins. Introduction to Multi-Armed Bandits. Foundations and Trends in Machine Learn-
ing, 2022.

[2] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. The MIT Press, 2020.

[1] [2]

RLbook Solutions Manual
100% (1)
RLbook Solutions Manual
35 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
23 pages
IntroMulti Armed Bandits Slivkin Microsoft PDF
No ratings yet
IntroMulti Armed Bandits Slivkin Microsoft PDF
174 pages
Bandit Algorithms
No ratings yet
Bandit Algorithms
596 pages
RL-Endterm Report - Mridul Agarwal
No ratings yet
RL-Endterm Report - Mridul Agarwal
27 pages
Exploration Exploitation
No ratings yet
Exploration Exploitation
40 pages
Garbage In, Reward Out Bootstrapping Exploration in Multi-Armed Bandits
No ratings yet
Garbage In, Reward Out Bootstrapping Exploration in Multi-Armed Bandits
19 pages
Introduction To Bandits: (Some Slides Stolen From Csaba's AAAI Tutorial)
No ratings yet
Introduction To Bandits: (Some Slides Stolen From Csaba's AAAI Tutorial)
16 pages
Lecture 9: Exploration and Exploitation: David Silver
No ratings yet
Lecture 9: Exploration and Exploitation: David Silver
47 pages
Necessary and Sufficient Conditions For Achieving Sub-Linear Regret in Stochastic Multi-Armed Bandits
No ratings yet
Necessary and Sufficient Conditions For Achieving Sub-Linear Regret in Stochastic Multi-Armed Bandits
9 pages
Online Learning For Causal Bandits
No ratings yet
Online Learning For Causal Bandits
7 pages
EE290S Lecture Note 22
No ratings yet
EE290S Lecture Note 22
12 pages
Written Assignment 1
No ratings yet
Written Assignment 1
2 pages
Open Problem: Regret Bounds For Thompson Sampling: 1. Background
No ratings yet
Open Problem: Regret Bounds For Thompson Sampling: 1. Background
3 pages
Bandits
No ratings yet
Bandits
2 pages
A12-Online Learning Short 2020
No ratings yet
A12-Online Learning Short 2020
61 pages
Multi Armed Bandits
No ratings yet
Multi Armed Bandits
34 pages
Data Challenge - NC Soft
No ratings yet
Data Challenge - NC Soft
4 pages
Multi-Armed Bandit
No ratings yet
Multi-Armed Bandit
17 pages
26 Making Decisions
No ratings yet
26 Making Decisions
31 pages
Finite-Time Analysis of The Multi-Armed Bandit Problem With Known Trend
No ratings yet
Finite-Time Analysis of The Multi-Armed Bandit Problem With Known Trend
7 pages
Evendar 06 A
No ratings yet
Evendar 06 A
27 pages
FULLTEXT01
No ratings yet
FULLTEXT01
186 pages
EE675A Lecture 3
No ratings yet
EE675A Lecture 3
8 pages
pdf24 Images Merged
No ratings yet
pdf24 Images Merged
12 pages
Unit:1 Reinforcement Learning
No ratings yet
Unit:1 Reinforcement Learning
9 pages
Lecture 1: Introduction: Lecturer: Prof. Subrahmanya Swamy Peruru Scribe: Harshvardhan Arya - Rishabh Katiyar
No ratings yet
Lecture 1: Introduction: Lecturer: Prof. Subrahmanya Swamy Peruru Scribe: Harshvardhan Arya - Rishabh Katiyar
4 pages
Lecture 03: Adaptive Exploration-Based Algorithms: 1.1 Outline of The Algorithm
No ratings yet
Lecture 03: Adaptive Exploration-Based Algorithms: 1.1 Outline of The Algorithm
4 pages
Dissecting Reinforcement Learning-Part6
No ratings yet
Dissecting Reinforcement Learning-Part6
25 pages
Bandit
No ratings yet
Bandit
8 pages
Bandit Algorithms (Tor Lattimore, Csaba Szepesvári) (Z-Library)
0% (1)
Bandit Algorithms (Tor Lattimore, Csaba Szepesvári) (Z-Library)
537 pages
Multi-Armed Bandit Algorithms and Empirical Evaluation
No ratings yet
Multi-Armed Bandit Algorithms and Empirical Evaluation
12 pages
Reading 3-Russo & Van Roy 2014
No ratings yet
Reading 3-Russo & Van Roy 2014
24 pages
NeurIPS 2021 Breaking The Moments Condition Barrier No Regret Algorithm For Bandits With Super Heavy Tailed Payoffs Paper
No ratings yet
NeurIPS 2021 Breaking The Moments Condition Barrier No Regret Algorithm For Bandits With Super Heavy Tailed Payoffs Paper
11 pages
Mid-Semester Examination
No ratings yet
Mid-Semester Examination
2 pages
Fan Glynn
No ratings yet
Fan Glynn
32 pages
A Reinforcement Learning Approach For Optimal Placement of Sensors in Protected Cultivation Systems
No ratings yet
A Reinforcement Learning Approach For Optimal Placement of Sensors in Protected Cultivation Systems
20 pages
Exercise Sheet 12
No ratings yet
Exercise Sheet 12
6 pages
The Pakistan Development Review Volume 60, Number 2, 2021
No ratings yet
The Pakistan Development Review Volume 60, Number 2, 2021
44 pages
Finite-Time Regret of Thompson Sampling Algorithms For Exponential Family Multi-Armed Bandits
No ratings yet
Finite-Time Regret of Thompson Sampling Algorithms For Exponential Family Multi-Armed Bandits
49 pages
29117-Article Text-33171-1-2-20240324
No ratings yet
29117-Article Text-33171-1-2-20240324
8 pages
Experiment 6
No ratings yet
Experiment 6
7 pages
Unit II
No ratings yet
Unit II
10 pages
RL Unit5
No ratings yet
RL Unit5
101 pages
Non-Stochastic Best Arm Identification and Hyperparameter Optimization
No ratings yet
Non-Stochastic Best Arm Identification and Hyperparameter Optimization
13 pages
16 - Reinforcement Learning and Bandits
No ratings yet
16 - Reinforcement Learning and Bandits
41 pages
311c PDF
No ratings yet
311c PDF
50 pages
26202-Article Text-30265-1-2-20230626
No ratings yet
26202-Article Text-30265-1-2-20230626
8 pages
Y SuccessiveRejects - Budget
No ratings yet
Y SuccessiveRejects - Budget
12 pages
Assignment 1: CS747: F I L A
No ratings yet
Assignment 1: CS747: F I L A
10 pages
Hauser - Website Morphing
No ratings yet
Hauser - Website Morphing
23 pages
Probabilistic Models: AA228/CS238 Exercises
No ratings yet
Probabilistic Models: AA228/CS238 Exercises
9 pages
ORF 544 Week 5 Derivative Free Stochastic Optimization VFA and DLA
No ratings yet
ORF 544 Week 5 Derivative Free Stochastic Optimization VFA and DLA
123 pages
RL Unit 1 - QA
No ratings yet
RL Unit 1 - QA
10 pages
Anant Nawalgaria
No ratings yet
Anant Nawalgaria
111 pages
EE675A Lecture 4
No ratings yet
EE675A Lecture 4
7 pages
RL Unit
No ratings yet
RL Unit
595 pages
Multi-Armed Bandits
No ratings yet
Multi-Armed Bandits
11 pages
Exercise Sheet 10
No ratings yet
Exercise Sheet 10
3 pages
Auer - Using Ucb For Exploration-Exploitation Tradeoffs
No ratings yet
Auer - Using Ucb For Exploration-Exploitation Tradeoffs
26 pages
CS 747, Autumn 2023 - Lecture 3
No ratings yet
CS 747, Autumn 2023 - Lecture 3
27 pages
Improved Regret Bounds For Bandits With Expert Advice
No ratings yet
Improved Regret Bounds For Bandits With Expert Advice
18 pages
Ewor Entrepreneurship Productivity Bible
No ratings yet
Ewor Entrepreneurship Productivity Bible
195 pages
Assignment 0 (Sol.) : Reinforcement Learning
No ratings yet
Assignment 0 (Sol.) : Reinforcement Learning
50 pages
EE675 Lecture 5
No ratings yet
EE675 Lecture 5
6 pages
Multi-Armed Bandits Epsilon-Greedy Algorithm
No ratings yet
Multi-Armed Bandits Epsilon-Greedy Algorithm
14 pages
Publications
No ratings yet
Publications
37 pages
Early Optimal Algorithms For Contextual Dueling Bandits From Adversarial Feedback
No ratings yet
Early Optimal Algorithms For Contextual Dueling Bandits From Adversarial Feedback
24 pages
Dynamic Assortment With Demand Learning For Seasonal Consumer Goods
No ratings yet
Dynamic Assortment With Demand Learning For Seasonal Consumer Goods
18 pages
Rlassignment 2
No ratings yet
Rlassignment 2
3 pages
EXP3
No ratings yet
EXP3
36 pages
Playing With A Multi Armed Bandit To Optimize Resource Allocation in Satellite-Enabled 5G Networks
No ratings yet
Playing With A Multi Armed Bandit To Optimize Resource Allocation in Satellite-Enabled 5G Networks
14 pages
DLMAIRIL01 Q4-2024 Session3
No ratings yet
DLMAIRIL01 Q4-2024 Session3
47 pages
Machine - Learning - Chapter 4
No ratings yet
Machine - Learning - Chapter 4
13 pages
Dynamic Pricing With Demand Covariates
No ratings yet
Dynamic Pricing With Demand Covariates
28 pages
Bubeck 11 A
No ratings yet
Bubeck 11 A
41 pages
IITM MS Thesis Final
No ratings yet
IITM MS Thesis Final
83 pages
NIPS 2008 Algorithms For Infinitely Many Armed Bandits Paper
No ratings yet
NIPS 2008 Algorithms For Infinitely Many Armed Bandits Paper
8 pages
Simple Regret For Infinitely Many Armed Bandits
No ratings yet
Simple Regret For Infinitely Many Armed Bandits
9 pages
Mid Term Report SoS
No ratings yet
Mid Term Report SoS
18 pages
MIDAS - Multi-Layered Attack Detection Architecture With Decision Optimisation
No ratings yet
MIDAS - Multi-Layered Attack Detection Architecture With Decision Optimisation
14 pages
Module - 05 Machine Learning (BCS602) Search Creators
No ratings yet
Module - 05 Machine Learning (BCS602) Search Creators
47 pages
Q1. Explain The Multi-Armed Bandit Problem and Its Key Characteristics. Illustrate Their Real-World Applications
No ratings yet
Q1. Explain The Multi-Armed Bandit Problem and Its Key Characteristics. Illustrate Their Real-World Applications
11 pages
Contextual Bandits
No ratings yet
Contextual Bandits
34 pages
RL L2 MultiArmedBandits
No ratings yet
RL L2 MultiArmedBandits
44 pages
NeurIPS 2019 Batched Multi Armed Bandits Problem Paper
No ratings yet
NeurIPS 2019 Batched Multi Armed Bandits Problem Paper
11 pages
1.1. The K-Armed Bandit Problem
No ratings yet
1.1. The K-Armed Bandit Problem
18 pages
Cs6046-Notes 2
No ratings yet
Cs6046-Notes 2
34 pages
Deep Reinforcement Learning in Action 1st Edition Alexander Zai PDF Download
No ratings yet
Deep Reinforcement Learning in Action 1st Edition Alexander Zai PDF Download
56 pages
Deep Reinforcement Learning in Action 1st Edition Alexander Zai Download
No ratings yet
Deep Reinforcement Learning in Action 1st Edition Alexander Zai Download
59 pages

Lecture 2 EE675

Uploaded by

Lecture 2 EE675

Uploaded by

EE675 - IIT Kanpur, 2022-23-II

Lecture 2: n-Armed Bandits

• State : Representation of the environment of the task. Commonly denoted as St

• Action : Refers to the reaction to State. Commonly denoted as At

• Expectation of a Random Variable ’X’ : Denoted by E[X].

where fX (x) refers to the Probability Density Function of a continuous RV ’X’

• True means of arms : Denoted by µi for ith arm. We know,

The best true mean is given by µ∗ such that,

• Estimated Sample Averages: Denoted by µ̄i for ith arm.

• Regret Minimzation : This methodology minimzes the Expected Regret.

and, the Actual Regret is

where µ(at ) = E[Rt |at ] and µi = E[Rt |at = ai ]

1. Explore each arm N times.

2. Pick arm â with the highest sample average mean.

3. Play arm â in all remaining rounds.

Hoeffding’s Inequality: It states that for N sample means and ∀ ϵ,

P [|µ̄(a) − µ(a)| ≥ ϵ] < 2e−2ϵ

2.1.2 Regret Minimisation in ETC Algorithm:

• Exploitation Regret : In the remaining T − 2N rounds, we will have a regret term of 2ϵ in

R(T ) ≤ O(T 2/3 (log T )1/3 )

E[R(T )] = E[R(T )|A]P (A) + E[R(T )|A′ ]P (A′ )

2.1.3 Epsilon-greedy Algorithm

1. Toss a coin that lands in the head with a probability of ϵ.

If the Exploration Probability ϵt ≈ t−1/3 , then

E[R(t)] ≤ t2/3 O((K log t)1/3 )

You might also like