0% found this document useful (0 votes)

28 views9 pages

Unit:1 Reinforcement Learning

nil

Uploaded by

jashwanthkumar.ad21

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views9 pages

Unit:1 Reinforcement Learning

nil

Uploaded by

jashwanthkumar.ad21

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

UNIT :1 REINFORCEMENT LEARNING

TOPIC 2: MULTI-ARM BANDITS , AN N-ARMED BANDIT

PROBLEM AND TRACKING A NON STATIONARY PROBLEM
IN REINFORCEMENT LEARNING

1.MULTI ARM BANDITS:

Multi-Armed Bandit Problem

In Reinforcement Learning, we use the Multi-Armed Bandit Problem to formalize

the notion of decision-making under uncertainty using k-armed bandits.

A decision-maker or agent is present in a Multi-Armed Bandit Problem to choose

between k-different actions and receives a reward based on the action it chooses.

Bandit problem is used to describe fundamental concepts in reinforcement learning,

such as rewards, timesteps, and values.

The picture above represents a slot machine also known as a bandit with two levers.

We assume that each lever has a separate distribution of rewards and there is at least
one lever that generates maximum reward.The probability distribution for the reward

corresponding to each lever is different and is unknown to the gambler(decision-

maker). Hence, the goal here is to identify which lever to pull to get the maximum

reward after a given set of trials.

What is Multi-Armed Bandit Problem

● The Multi-Armed bandit problem is a learner. The learner takes some action

and the environment returns some reward value. The learner has to find a

policy that leads to maximum rewards.To understand the multi-armed bandit

problem, first, see a one-armed bandit problem

● Suppose we have a slot machine, which has one lever and a screen. The screen

displays three or more wheels. That looks something like that-When you pull

the lever, the game is activated. This single lever represents the single-arm or

one-arm bandit.

● Suppose The advertiser has to find out the click rate of each ad for the same

product. The objective of the advertiser is to find the best advertisement. So,

these are the steps of Multi-Armed Bandit Problem–

1. We have m ads. The advertiser displays these ads to there user when

the user visits the web page.

2. Each time a user visits the web page, that makes one round.

3. At each round, the advertiser chooses one ad to display to the user.

4. At each round n, ad j gets reward rj(n) ∈ {0,1} : rj(n)=1 if user clicked

on the ad, and 0 if user didn’t click the ad.

5. The advertiser’s goal is to maximize the total reward from all rounds.

For Example:

● Imagine an online advertising trial where an advertiser wants to

measure the click-through rate of three different ads for the same

product. Whenever a user visits the website, the advertiser displays an

ad at random. The advertiser then monitors whether the user clicks on

the ad or not. After a while, the advertiser notices that one ad seems to
be working better than the others. The advertiser must now decide

between sticking with the best-performing ad or continuing with the

randomized study.

● If the advertiser only displays one ad, then he can no longer collect data

on the other two ads. Perhaps one of the other ads is better, it only

appears worse due to chance. If the other two ads are worse, then

continuing the study can affect the click-through rate adversely. This

advertising trial exemplifies decision-making under uncertainty.

● In the above example, the role of the agent is played by an advertiser.

The advertiser has to choose between three different actions, to display

the first, second, or third ad. Each ad is an action. Choosing that ad

yields some unknown reward. Finally, the profit of the advertiser after

the ad is the reward that the advertiser receives.

An n-Armed Bandit Problem

○ This is the original form of the n-armed bandit problem,

so named by anal- ogy to a slot machine, or “one-armed

bandit,” except that it has n levers instead of one. Each

action selection is like a play of one of the slot machine’s

levers, and the rewards are the payoffs for hitting the
jackpot.

○ If instead you select one of the nongreedy actions, then we

say you are exploring, because this enables you to improve

your estimate of the nongreedy action’s value

○ If you have many time steps ahead on which to make

action selections, then it may be better to explore the

nongreedy actions and discover which of them are better

than the greedy action.

○ The need to balance exploration and exploitation is a

distinctive challenge that arises in reinforcement learning;

the simplicity of the n-armed bandit problem enables us to

show this in a particularly clear form.

If Nt(a) = 0, then we define Qt(a) instead as some default value,

such as Q1(a) = 0. As Nt(a) , by the law of large numbers,

Qt(a) converges to q(a).

The simplest action selection rule is to select the action (or one

of the actions) with highest estimated action value, that is, to select

at step t one of the greedy actions, At∗ , for which Qt(A∗ t ) = maxa

Qt(a). This greedy action selection method can be written as

At = argmax Qt(a),a

Action-Values:
For the advertiser to decide which action is best, we must define the value of taking
each action. We define these values using the action-value function using the
language of probability. The value of selecting an action q*(a) is defined as the
expected reward Rt we receive when taking an action from the possible set of actions.

The goal of the agent is to maximize the expected reward by selecting the action that
has the highest action-value.

Action-value Estimate:

Since the value of selecting an action i.e. Q*(a) is not known to the agent, so we will
use the sample-average method to estimate it.
Exploration vs Exploitation:

● Greedy Action: When an agent chooses an action that currently has the
largest estimated value. The agent exploits its current knowledge by
choosing the greedy action.
● Non-Greedy Action: When the agent does not choose the largest
estimated value and sacrifice immediate reward hoping to gain more
information about the other actions.
● Exploration: It allows the agent to improve its knowledge about each
action. Hopefully, leading to a long-term benefit.
● Exploitation: It allows the agent to choose the greedy action to try to get
the most reward for short-term benefit. A pure greedy action selection
can lead to sub-optimal behaviour.

A dilemma occurs between exploration and exploitation because an agent can not
choose to both explore and exploit at the same time. Hence, we use the Upper
Confidence Bound algorithm to solve the exploration-exploitation dilemma

Tracking a Nonstationary Problem

One of the most popular ways of doing this is to use a constant step-

size parameter. For example, the incremental update rule (2.3) for updating an

average Qk of the k − 1 past rewards is modified to be

Qk+1 = Qk + αhRk − Qki,

where the step-size parameter α ∈ (0, 1]1 is constant. This results in Qk+1being a
h i
weighted average of past rewards and the initial estimate Q1:
Qk+1 = Qk + α Rk − Qk

= αRk + (1 − α)Qk

= αRk + (1 − α) [αRk−1 + (1 − α)Qk−1]

= αRk + (1 − α)αRk−1 + (1 − α) Qk−12
= αRk + (1 − α)αRk−1 + (1 − α) αRk−2
2 +

· · · + (1 − α) k−1
αR1 + (1
k − α) Q
1

k Σ
= (1 − α)kQ1 +α(1 − α)k−iRi.
i=1

A well- known result in stochastic approximation theory gives us

the conditions required to assure convergence with probability 1:

Σ
αk(a)
∞ =∞ and Σ
∞2

k < ∞.
α (a)
k=1 k=1

The first condition is required to guarantee that the steps are large
enough to eventually overcome any initial conditions or random
fluctuations. The second condition guarantees that eventually the
steps become small enough to assure convergence.

Note that both convergence conditions are met for the sample-
average case, αk(a) = 1
, but not for the case of constant step-size
parameter, αk(a) = α. In the latter case, the second condition is not
met, indicating that the estimates never completely converge but
continue to vary in response to the most re- cently received rewards.

RLbook Solutions Manual
100% (1)
RLbook Solutions Manual
35 pages
Assignment 1: Reinforcement Learning Prof. B. Ravindran
100% (2)
Assignment 1: Reinforcement Learning Prof. B. Ravindran
4 pages
Artificial Neural Networks An Econometric Perspective
No ratings yet
Artificial Neural Networks An Econometric Perspective
98 pages
RL Unit5
No ratings yet
RL Unit5
101 pages
RL L2 MultiArmedBandits
No ratings yet
RL L2 MultiArmedBandits
44 pages
Multi-Armed Bandits
No ratings yet
Multi-Armed Bandits
11 pages
Bandit
No ratings yet
Bandit
8 pages
Mid Term Report SoS
No ratings yet
Mid Term Report SoS
18 pages
DLMAIRIL01 Q4-2024 Session3
No ratings yet
DLMAIRIL01 Q4-2024 Session3
47 pages
17 ThompsonSampling
No ratings yet
17 ThompsonSampling
24 pages
Reinforcement Learning - Chapter 2
100% (1)
Reinforcement Learning - Chapter 2
22 pages
Unit II
No ratings yet
Unit II
10 pages
Upper Confidence Bound Algorithm in Reinforcement Learning
No ratings yet
Upper Confidence Bound Algorithm in Reinforcement Learning
6 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
28 pages
RL-Endterm Report - Mridul Agarwal
No ratings yet
RL-Endterm Report - Mridul Agarwal
27 pages
What We Learned Last Time: 1. Intelligence Is The Computational Part of The Ability To Achieve Goals
No ratings yet
What We Learned Last Time: 1. Intelligence Is The Computational Part of The Ability To Achieve Goals
32 pages
Exploration Vs Exploitation in Stationary Multi-Armed Bandit Problems
No ratings yet
Exploration Vs Exploitation in Stationary Multi-Armed Bandit Problems
15 pages
Auer - Using Ucb For Exploration-Exploitation Tradeoffs
No ratings yet
Auer - Using Ucb For Exploration-Exploitation Tradeoffs
26 pages
Evendar 06 A
No ratings yet
Evendar 06 A
27 pages
Lecture 1: Introduction: Lecturer: Prof. Subrahmanya Swamy Peruru Scribe: Harshvardhan Arya - Rishabh Katiyar
No ratings yet
Lecture 1: Introduction: Lecturer: Prof. Subrahmanya Swamy Peruru Scribe: Harshvardhan Arya - Rishabh Katiyar
4 pages
Module 02
No ratings yet
Module 02
68 pages
Lecture 9: Exploration and Exploitation: David Silver
No ratings yet
Lecture 9: Exploration and Exploitation: David Silver
47 pages
Multi-Armed Bandit Algorithms and Empirical Evaluation
No ratings yet
Multi-Armed Bandit Algorithms and Empirical Evaluation
12 pages
Machine - Learning - Chapter 4
No ratings yet
Machine - Learning - Chapter 4
13 pages
A12-Online Learning Short 2020
No ratings yet
A12-Online Learning Short 2020
61 pages
Multi Armed Bandits
No ratings yet
Multi Armed Bandits
34 pages
Report
No ratings yet
Report
4 pages
Experiment 6
No ratings yet
Experiment 6
7 pages
RL Mid-1 Bit Bank
No ratings yet
RL Mid-1 Bit Bank
10 pages
16 - Reinforcement Learning and Bandits
No ratings yet
16 - Reinforcement Learning and Bandits
41 pages
A Baby Robot - 1
No ratings yet
A Baby Robot - 1
6 pages
EXP3
No ratings yet
EXP3
36 pages
Some Thoughts On Reinforcement Learning: 1 Motivation
No ratings yet
Some Thoughts On Reinforcement Learning: 1 Motivation
9 pages
Dissecting Reinforcement Learning-Part6
No ratings yet
Dissecting Reinforcement Learning-Part6
25 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
136 pages
Finite-Time Analysis of The Multi-Armed Bandit Problem With Known Trend
No ratings yet
Finite-Time Analysis of The Multi-Armed Bandit Problem With Known Trend
7 pages
1.1. The K-Armed Bandit Problem
No ratings yet
1.1. The K-Armed Bandit Problem
18 pages
Multi-Armed Bandits Epsilon-Greedy Algorithm
No ratings yet
Multi-Armed Bandits Epsilon-Greedy Algorithm
14 pages
Q1. Explain The Multi-Armed Bandit Problem and Its Key Characteristics. Illustrate Their Real-World Applications
No ratings yet
Q1. Explain The Multi-Armed Bandit Problem and Its Key Characteristics. Illustrate Their Real-World Applications
11 pages
RL Unit 1 - QA
No ratings yet
RL Unit 1 - QA
10 pages
EAS 240 MAB Project Description Spring 2025
No ratings yet
EAS 240 MAB Project Description Spring 2025
10 pages
Lecture 2 EE675
No ratings yet
Lecture 2 EE675
4 pages
Unit:1 Reinforcement Learning: Upper-Confidence-Bound Action Selection, Gradient Bandits
No ratings yet
Unit:1 Reinforcement Learning: Upper-Confidence-Bound Action Selection, Gradient Bandits
6 pages
Module 6 1 Ungraded Quizz
No ratings yet
Module 6 1 Ungraded Quizz
13 pages
RL Sem Ans
No ratings yet
RL Sem Ans
90 pages
RL Ese Answers
No ratings yet
RL Ese Answers
22 pages
Reinforcement Learning: A Short Cut
No ratings yet
Reinforcement Learning: A Short Cut
7 pages
Cs6046-Notes 2
No ratings yet
Cs6046-Notes 2
34 pages
Garbage In, Reward Out Bootstrapping Exploration in Multi-Armed Bandits
No ratings yet
Garbage In, Reward Out Bootstrapping Exploration in Multi-Armed Bandits
19 pages
Unit 1-RL
No ratings yet
Unit 1-RL
11 pages
Unit Iv-1
No ratings yet
Unit Iv-1
32 pages
Data Challenge - NC Soft
No ratings yet
Data Challenge - NC Soft
4 pages
Reinforcement Learning Cheatsheet
No ratings yet
Reinforcement Learning Cheatsheet
16 pages
Subtitle
No ratings yet
Subtitle
2 pages
Jin-Han2010 ReferenceWorkEntry K-MeansClustering
No ratings yet
Jin-Han2010 ReferenceWorkEntry K-MeansClustering
10 pages
Solution 3
No ratings yet
Solution 3
4 pages
K-Armed Bandit
No ratings yet
K-Armed Bandit
2 pages
Module 6 2nd Ungraded Quizz
No ratings yet
Module 6 2nd Ungraded Quizz
13 pages
Statistical Inference For Online Decision-Making: in A Contextual Bandit Setting
No ratings yet
Statistical Inference For Online Decision-Making: in A Contextual Bandit Setting
44 pages
Acceptance-Rejection Sampling and Multi-dimensional Monte Carlo Integrations Utilizing Mathematica®
From Everand
Acceptance-Rejection Sampling and Multi-dimensional Monte Carlo Integrations Utilizing Mathematica®
SUJAUL CHOWDHURY
No ratings yet
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet
Karnaugh Maps - Rules of Simplification: Grouping Ones
100% (1)
Karnaugh Maps - Rules of Simplification: Grouping Ones
14 pages
Advanced Engineering Mathematics
No ratings yet
Advanced Engineering Mathematics
32 pages
Quiz 3 - 20PAIE51J - Machine Learning - Unsupervised Model - Great Learning PDF
No ratings yet
Quiz 3 - 20PAIE51J - Machine Learning - Unsupervised Model - Great Learning PDF
6 pages
NM First Practical
No ratings yet
NM First Practical
9 pages
Project Report: Bangladesh University of Business & Technology (BUBT)
No ratings yet
Project Report: Bangladesh University of Business & Technology (BUBT)
18 pages
Process Synchronization - CH 6&7
No ratings yet
Process Synchronization - CH 6&7
85 pages
Advances in Human-Computer Interaction - 2022 - Mahapatra - Multiclass Classification of Imagined Speech Vowels and Words
No ratings yet
Advances in Human-Computer Interaction - 2022 - Mahapatra - Multiclass Classification of Imagined Speech Vowels and Words
10 pages
PDS Gtu-Qp W2023
No ratings yet
PDS Gtu-Qp W2023
2 pages
An Approximate Equivalence of The GNS Representation For The Haar State of SU p2q
No ratings yet
An Approximate Equivalence of The GNS Representation For The Haar State of SU p2q
16 pages
Spatial Filtering
No ratings yet
Spatial Filtering
51 pages
Eric C. Chi: Research Interests
No ratings yet
Eric C. Chi: Research Interests
15 pages
Sparse, Stacked and Variational Autoencoder - by Venkata Krishna Jonnalagadda - Medium
No ratings yet
Sparse, Stacked and Variational Autoencoder - by Venkata Krishna Jonnalagadda - Medium
17 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
Open Elective Notice Jan May 2025
No ratings yet
Open Elective Notice Jan May 2025
2 pages
Ass 1 Unit 26 16-17
0% (1)
Ass 1 Unit 26 16-17
3 pages
Sophisticated Imitation in Cyclic Games: Josef Hofbauer and Karl H. Schlag
No ratings yet
Sophisticated Imitation in Cyclic Games: Josef Hofbauer and Karl H. Schlag
21 pages
Null 7
No ratings yet
Null 7
11 pages
Template
No ratings yet
Template
2 pages
Statistics Help Card Full
No ratings yet
Statistics Help Card Full
6 pages
MATLAB Experiment No. (2) Second Order Systems: Objectives: 1. 2
No ratings yet
MATLAB Experiment No. (2) Second Order Systems: Objectives: 1. 2
21 pages
A High Probability Analysis of Adaptive SGD With Momentum
No ratings yet
A High Probability Analysis of Adaptive SGD With Momentum
13 pages
Front Page f5132150
No ratings yet
Front Page f5132150
26 pages
Water Jag Problem
No ratings yet
Water Jag Problem
36 pages
Presentation of Master's Thesis: Gait Analysis: Is It Possible To Learn To Walk Like Someone Else?
No ratings yet
Presentation of Master's Thesis: Gait Analysis: Is It Possible To Learn To Walk Like Someone Else?
27 pages
Ad Batch
No ratings yet
Ad Batch
12 pages
DAA Practical File - 1900648
No ratings yet
DAA Practical File - 1900648
20 pages
Assingment Ai
No ratings yet
Assingment Ai
7 pages
02 Regression and Classification Problems
No ratings yet
02 Regression and Classification Problems
7 pages
Chapter 4
100% (1)
Chapter 4
31 pages

Unit:1 Reinforcement Learning

Uploaded by

Unit:1 Reinforcement Learning

Uploaded by

UNIT :1 REINFORCEMENT LEARNING

TOPIC 2: MULTI-ARM BANDITS , AN N-ARMED BANDIT

1.MULTI ARM BANDITS:

In Reinforcement Learning, we use the Multi-Armed Bandit Problem to formalize

the notion of decision-making under uncertainty using k-armed bandits.

A decision-maker or agent is present in a Multi-Armed Bandit Problem to choose

Bandit problem is used to describe fundamental concepts in reinforcement learning,

such as rewards, timesteps, and values.

corresponding to each lever is different and is unknown to the gambler(decision-

reward after a given set of trials.

What is Multi-Armed Bandit Problem

policy that leads to maximum rewards.To understand the multi-armed bandit

problem, first, see a one-armed bandit problem

these are the steps of Multi-Armed Bandit Problem–

the user visits the web page.

3. At each round, the advertiser chooses one ad to display to the user.

4. At each round n, ad j gets reward rj(n) ∈ {0,1} : rj(n)=1 if user clicked

on the ad, and 0 if user didn’t click the ad.

● Imagine an online advertising trial where an advertiser wants to

product. Whenever a user visits the website, the advertiser displays an

ad at random. The advertiser then monitors whether the user clicks on

between sticking with the best-performing ad or continuing with the

advertising trial exemplifies decision-making under uncertainty.

● In the above example, the role of the agent is played by an advertiser.

The advertiser has to choose between three different actions, to display

the first, second, or third ad. Each ad is an action. Choosing that ad

the ad is the reward that the advertiser receives.

An n-Armed Bandit Problem

○ This is the original form of the n-armed bandit problem,

so named by anal- ogy to a slot machine, or “one-armed

bandit,” except that it has n levers instead of one. Each

action selection is like a play of one of the slot machine’s

○ If instead you select one of the nongreedy actions, then we

say you are exploring, because this enables you to improve

your estimate of the nongreedy action’s value

○ If you have many time steps ahead on which to make

action selections, then it may be better to explore the

nongreedy actions and discover which of them are better

than the greedy action.

○ The need to balance exploration and exploitation is a

distinctive challenge that arises in reinforcement learning;

the simplicity of the n-armed bandit problem enables us to

show this in a particularly clear form.

If Nt(a) = 0, then we define Qt(a) instead as some default value,

such as Q1(a) = 0. As Nt(a) , by the law of large numbers,

Qt(a) converges to q(a).

Qt(a). This greedy action selection method can be written as

Tracking a Nonstationary Problem

average Qk of the k − 1 past rewards is modified to be

Qk+1 = Qk + αhRk − Qki,

= αRk + (1 − α) [αRk−1 + (1 − α)Qk−1]

A well- known result in stochastic approximation theory gives us

You might also like