0% found this document useful (0 votes)
48 views4 pages

Lecture 2 EE675

The document summarizes key concepts in n-armed bandit problems and algorithms for solving them. It defines n-armed bandits as single-state reinforcement learning problems where choosing an arm provides a reward without intermediate states. It describes two main algorithms: 1) explore-then-commit which uniformly explores arms and commits to the best, and 2) epsilon-greedy which chooses the best arm most of the time but explores randomly with probability epsilon. Both algorithms aim to minimize expected regret from not choosing the optimal arm.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views4 pages

Lecture 2 EE675

The document summarizes key concepts in n-armed bandit problems and algorithms for solving them. It defines n-armed bandits as single-state reinforcement learning problems where choosing an arm provides a reward without intermediate states. It describes two main algorithms: 1) explore-then-commit which uniformly explores arms and commits to the best, and 2) epsilon-greedy which chooses the best arm most of the time but explores randomly with probability epsilon. Both algorithms aim to minimize expected regret from not choosing the optimal arm.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

EE675 - IIT Kanpur, 2022-23-II

Lecture 2: n-Armed Bandits


11.01.23
Lecturer: Prof. Subrahmanyu Swamy Peruru Scribe: Ananya Mehrotra and Shamayeeta Dass

1 Recap
1.1 Paradigms of Machine Learning and Basic Terminology
There are three paradigms of Machine Learning:

• Supervised Learning

• Unsupervised Learning

• Reinforcement Learning

We primarily focus on Reinforcement Learning (RL) in this course. Some common terms used in
Reinforcement Learning are described below:

• State : Representation of the environment of the task. Commonly denoted as St

• Action : Refers to the reaction to State. Commonly denoted as At

• Environment : The combination of the new State and Reward to the previous action

• Reward : Feedback to the User in correspondence with the previous action and the new
state. Commonly denoted as Rt

• Return : Cumulative Reward. It is equivalent to Tt=1 Rt

• Policy : Methodology which evaluates history and predicts best possible action. A policy
can be segregated into the following two categories:

– Deterministic : The output is a single action which is predicted to be the best possible
choice
– Stochastic : The output is a set of probabilities of the actions. As per the policy, the
higher the probability of a particular action, the better it is.

The main objective of an RL algorithm is to maximize the cumulative reward over several rounds.

1
2 Basics of n-Armed Bandits
n-Armed Bandits is a special scenario of Reinforcement Learning where there is no time sequence
dependence. In simpler terms, n-Armed Bandits are Single State Scenarios. A single action leads
to a reward with no intermediate states, after which the environment resets, and the process repeats.
Some Common Notations :
• Arms : Refer to actions taken by the user. The ith arm is denoted by ai , and the total number
of arms is taken to be K

• Expectation of a Random Variable ’X’ : Denoted by E[X].



E[X] = xfX (x)dx

where fX (x) refers to the Probability Density Function of a continuous RV ’X’

• True means of arms : Denoted by µi for ith arm. We know,

E[Rt |At = ai ] = µi

The best true mean is given by µ∗ such that,

µ∗ = max µi
i

• Estimated Sample Averages: Denoted by µ̄i for ith arm.


There can be two main objectives in a Multi-Armed Bandit problem:
• Best Arm Identification : This methodology focuses more on exploration. Over T rounds,
P (Identified Arm is the Optimal Arm) is maximised to help identify the optimal arm

• Regret Minimzation : This methodology minimzes the Expected Regret.


The Expected Regret is

T
µ∗ T − µi (at )
i=1

and, the Actual Regret is



T
µ∗ T − Rt
t=1

where µ(at ) = E[Rt |at ] and µi = E[Rt |at = ai ]

2.1 Algorithms
There are two main algorithms for a Multi-Armed Bandit problem.

2
2.1.1 Explore Then Commit Algorithm
In this methodology, we explore all arms uniformly and select the arm with the highest sample
average reward.

Algorithm:

1. Explore each arm N times.

2. Pick arm â with the highest sample average mean.

3. Play arm â in all remaining rounds.

If we take a large number of samples, the sample mean will be equal to the expected mean. In
our setting, the average reward µ̃(a) for any action a should be close to the expected reward µ(a).
We make use of the following inequality to set a bound.

Hoeffding’s Inequality: It states that for N sample means and ∀ ϵ,

P [|µ̄(a) − µ(a)| ≥ ϵ] < 2e−2ϵ


2N

2.1.2 Regret Minimisation in ETC Algorithm:


In Hoeffding’s
√ Inequality, we assume that Rt ∈ [0, 1]. As we want the probability to be small, we
take ε ∼ 2 log(T )
N
. This leads to the upper bound in the above expression to be of the order: O( T14 ).

Let us take an example of K = 2 arms. This indicates one of the arms will be the optimal arm,
and the other one will be a sub-optimal arm.

Let a∗ denote the best arm. If we choose the sub-optimal arm, then it must have been because
the sub-optimal arm had a higher sample mean, µ̄(a) > µ̄(a∗ ). We also know by using Hoeffding’s
Inequality:
µ(a) + ϵ ≥ µ̄(a) > µ̄(a∗ ) ≥ µ(a∗ ) − ϵ
Thus,
µ(a∗ ) − µ(a) ≤ 2ϵ
In order to analyse regret: R(T ), we need to divide it into two parts -

• Exploration Regret :If we sample every arm N times, we will have a term of order N for
sampling the sub-optimal arm in every round.

• Exploitation Regret : In the remaining T − 2N rounds, we will have a regret term of 2ϵ in


each round for picking the sub-optimal arm.

3
Hence,
R(T ) ≤ N + 2ϵ(T − 2N ) < N + 2ϵT

R(T ) < N + 2T 2 log T /N
For N = T 2/3 (log T )1/3 , the upper bound in Hoeffding’s Inequality is minimum. Thus, we get

R(T ) ≤ O(T 2/3 (log T )1/3 )

Let A be the event that Hoeffding’s Inequality holds for every arm and A′ be its complementary
event. Then, the Expected Regret can be expressed as

E[R(T )] = E[R(T )|A]P (A) + E[R(T )|A′ ]P (A′ )

If Hoeffding’s inequality doesn’t hold, then a regret term of order 1/T 4 is considered for each
round. As P (A) ≤ 1
E[R(T )] ≤ R(T ) + T.O(1/T 4 )
E[R(T )] ≤ O(T 2/3 (log T )1/3

2.1.3 Epsilon-greedy Algorithm


Algorithm

1. Toss a coin that lands in the head with a probability of ϵ.

2. If coin lands in head pick any arm at random, else pick the arm with best sample average.

If the Exploration Probability ϵt ≈ t−1/3 , then

E[R(t)] ≤ t2/3 O((K log t)1/3 )

References
[1] A. Slivkins. Introduction to Multi-Armed Bandits. Foundations and Trends in Machine Learn-
ing, 2022.

[2] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. The MIT Press, 2020.

[1] [2]

You might also like