0% found this document useful (0 votes)

31 views14 pages

A Handy Guide To UCB Algorithm in Reinforcement Learning.

Uploaded by

sherwingao99

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views14 pages

A Handy Guide To UCB Algorithm in Reinforcement Learning.

Uploaded by

sherwingao99

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

A handy guide to UCB algorithm in reinforcement learning.

1/8/24, 14:31

FOR DEVELOPERS

Know all About Upper Confidence Bound

Algorithm in Reinforcement Learning
Share 

Have you tried a slot machine and won amazing

rewards? You may have noticed that luck
sometimes favors you while at other times, you win
a few cents or lose entirely. Have you ever
wondered how to find the ‘lucky machine’ that
rewards you with plenty of money? This problem is
known as the multi-armed bandit problem and the
optimal approach employed to solve it is UCB or
upper confidence bound algorithm.

This article will detail what the multi-armed bandit

problem is, UCB, the math behind it, and its
research applications. A prerequisite is a basic
understanding of reinforcement learning including
concepts such as agent, environment, policy, action,
state and reward.

Exploration-exploitation
trade-off
file:///Users/xinwei/Library/CloudStorage/OneDrive-Personal/Zotero/…ide-on-upper-confidence-bound-algorithm-in-reinforced-learning.html Page 1 of 14
A handy guide to UCB algorithm in reinforcement learning. 1/8/24, 14:31

trade-off
When deciding what state it should choose next, an
agent faces a trade-off between exploration and
exploitation. Exploration involves choosing new
states that the agent hasn’t chosen or has chosen
fewer times till then. Exploitation involves making
a decision regarding the next state from its
experiences so far.

At a given timestep, should the agent make a

decision based on exploration or exploitation? This
is the trade-off. The concept is the foundation for
solving the multi-armed bandit problem, or for that
matter, any problem to be solved using
reinforcement learning.

Multi-armed bandit

Image source

file:///Users/xinwei/Library/CloudStorage/OneDrive-Personal/Zotero/…ide-on-upper-confidence-bound-algorithm-in-reinforced-learning.html Page 2 of 14
A handy guide to UCB algorithm in reinforcement learning. 1/8/24, 14:31

Image source

Consider the bandits or machines at a casino that

you can pull with a trigger and win rewards. The
above picture is a simple illustration to visualize
the game. Whenever you pull one of the machines
by its handle, you either win or lose. Note that you
play with only one machine at a time.

Correlating the problem to reinforcement learning,

the player is the agent here. Assume that different
machines have different properties although this
isn’t the case in real-life. In this case, one machine
may give very good rewards while another may
not.

How do you know which machine is the best? This

is important for a player as the aim is to maximize
the reward. You can find this out by playing with
every bandit in the casino. As illustrated in the
above picture, a player has multiple arms playing
with multiple machines, indicating that the agent
should play with all the machines. This enables the
agent to learn about the machine’s properties.

file:///Users/xinwei/Library/CloudStorage/OneDrive-Personal/Zotero/…ide-on-upper-confidence-bound-algorithm-in-reinforced-learning.html Page 3 of 14
A handy guide to UCB algorithm in reinforcement learning. 1/8/24, 14:31

Image source

An application of multi-armed bandit includes

clinical trials. Since all drug versions can’t be
tested, the optimal one should, thus, be tested.

Image source: Author

file:///Users/xinwei/Library/CloudStorage/OneDrive-Personal/Zotero/…ide-on-upper-confidence-bound-algorithm-in-reinforced-learning.html Page 4 of 14
A handy guide to UCB algorithm in reinforcement learning. 1/8/24, 14:31

Image source: Author

Consider that there are four machines: A, B, C and

D. The agent randomly chooses them and plays.
The results are:

In the table above, the agent played with A and

won. Exploring, the agent played with B and lost.
Exploiting, it again played with A as it had won
before. However, it lost. Exploring, it played with a
new machine, C, and won. Exploring again, it
played with a new machine, D, and lost. Exploiting,
it returned to C but lost again. This process
continues and is repeated until the agent maximizes
its reward.

The math behind multi-

armed bandit
Let the number of machines/bandits=n

Each machine i returns rewards yi which is

approximately equal to P (y; θ_i)

Where θ_i is the probability distribution parameter.

The machine provides rewards based on
probabilities. The goal is to maximize the reward.

Introduction to UCB
file:///Users/xinwei/Library/CloudStorage/OneDrive-Personal/Zotero/…ide-on-upper-confidence-bound-algorithm-in-reinforced-learning.html Page 5 of 14
A handy guide to UCB algorithm in reinforcement learning. 1/8/24, 14:31

Let’s formulate the multi-armed bandit’s problem

this way:
Let t = current timestep
a_t is the decision or action taken by the agent at
timestep t(action here refers to choosing one of the
n machines and playing)
For instance, a_1 means action taken at timestep
t=1
y_t is the reward or outcome at timestep t
policy π: [(a_1, y_1), (a_2, y_2)…….(a_(t-1),y_(t-
1))] -> a_t

Policy maps everything the agent has learned so far

to the new decision. This is why timesteps from t=1
to t-1 have been considered here.

Problem:

Find policy π that

Case-1: max

Find policy π that

Case-1: max

Case-2: max

Case-1 finds a policy that maximizes the

sum of all the rewards (as you can infer from the ∑
symbol).
file:///Users/xinwei/Library/CloudStorage/OneDrive-Personal/Zotero/…ide-on-upper-confidence-bound-algorithm-in-reinforced-learning.html Page 6 of 14
A handy guide to UCB algorithm in reinforcement learning. 1/8/24, 14:31

symbol).
Case-2 finds a policy to maximize the
reward obtained in the final step alone.

In case-2, agents need not care about intermediate

rewards as the goal is to optimize only the final
reward. Thus, in case-2, agents can explore and
learn as much as possible. However, in case-1, the
agent must collect as many rewards as possible.

So, which case should be chosen? This is where

UCB algorithm comes in.

UCB1 algorithm
Ignore the 1 here.

Step-1: Initialization
Play each machine once to have minimum
knowledge of each machine and to avoid
calculating ln0 in the formula in step-2.

Step-2: Repeat
Play the machine i that maximizes

Until

END
Look at what is maximized in step-2.

= average reward of machine i so far

file:///Users/xinwei/Library/CloudStorage/OneDrive-Personal/Zotero/…ide-on-upper-confidence-bound-algorithm-in-reinforced-learning.html Page 7 of 14
A handy guide to UCB algorithm in reinforcement learning. 1/8/24, 14:31

=how often machine i has been played so far

n=∑

=total number of rounds played so far

β is taken as 1(β=0.99, approximately considered
as 1 mostly)

The first term,

, is the exploitative case. Here, the agent won’t

learn much about the other machines. The second
term

is the explorative case. It reduces entropy because,

unlike the first term, it allows the agent to play with
the less chosen machines so that you can better
estimate if the current best machine is indeed the
best.

Example problem for UCB1

Let’s understand the UCB algorithm with the help

of an example.
Consider that there are three bandits being played
and their corresponding rewards are as follows:

Question: Training, Testing & Deployment of a

Classification Model Using
At the last timestep, which bandit should the player Convolutional Neural Networks and
play to maximize their reward? Machine Learning Classifiers
CNN, Convolutional Neural Networks, is
Solution:
a deep-learning-based algorithm that takes
an image as an input...
The UCB algorithm can be applied as follows:
file:///Users/xinwei/Library/CloudStorage/OneDrive-Personal/Zotero/…ide-on-upper-confidence-bound-algorithm-in-reinforced-learning.html Page 8 of 14
A handy guide to UCB algorithm in reinforcement learning. 1/8/24, 14:31

The UCB algorithm can be applied as follows:

Total number of rounds played so far(n)=No. of Read more
times Bandit-1 was played + No. of times Bandit-2
was played + No. of times Bandit-3 was played.
So, n=6+2+2=10=>n=10 Hands-on Tutorial on Running Redis
on Google Colab
For Bandit-1,
Redis, which stands for Remote
Dictionary Server, is a type of database
It has been played 6 times. So,
similar to MySQL, PostgreSQL, and
MongoDB. It’s an in-memory...

Graph Centrality Measures

A graph is a descriptive way of
What For representing
About Get the relationship between
= 8+7+12+13+11+9)/6=10 Resources differentStarted Login
entities. Meanwhile, an entity -
we do developers us
also known as a node - represents the
depicted...

= 0.876
So, the value of step-2 is approximately 10.876.
Designing of Different Kernels in
For Bandit-2, Machine Learning and Deep
Learning
It has been played 2 times. So,
The kernel is said to be a dot product in a
higher dimensional space where
estimation methods are linear methods....

Data Collection and Data

= (8+12)/2=10
Apply for remote Algorithms developer jobs at
top U.S. companies

=1.517
So, the value of step-2 is approximately 11.517.

For Bandit-3,

It has been played 2 times. So,

file:///Users/xinwei/Library/CloudStorage/OneDrive-Personal/Zotero/…ide-on-upper-confidence-bound-algorithm-in-reinforced-learning.html Page 9 of 14
A handy guide to UCB algorithm in reinforcement learning. 1/8/24, 14:31

Mobile Tech Lead React Native Developer

A well-established company that is developing A fast-growing company that is working on A rap

world-class trading platform solutions is looking developing an AI-powered platform that will assist softw
for a Mobile Tech Lead. The selected candidate roofing contractors get bigger contracts, faster helpd
= (5+13)/2=9
will be leading a team of mobile developers in the insurance approvals, and instant payments, is lookin
design, development, and maintenance of our looking for a React Native Developer. The be res
mobile trading platform applications. The developer will be responsible for building code
company is offering cutting-edge technology and additional features for the internal online recom

innovative
= 1.517  251-10K
Finance trading tools toemployees
clients in order to  Real Estate
dashboard  11-50
and mobile fieldemployees
application. 
is com
So,deliver exceptional
the value of step-2trading experiences10.517.
is approximately to our Contractors perform photo-centric house tools
users. This is anReact
React Native amazing opportunity for
RxJS inspections
React using
React the company's
Native Node.js +mobile
1 app, with
Rubyc
Time
candidates
t is the last
whotimestep.
are eagerThere
to work
won’t be any which generates all insurance reporting have
further exploration, so check where exploitation is immediately for
maximum. Apply now Apply now

Bandit-1 and Bandit-2 have the same and the

highest exploitation value (10). Which to choose?
Considering entropy, it is Bandit-1 as it was played
a greater number of times compared to Bandit-2.
As it has less entropy, its value is more likely to be
correct and is more expressive.

Hence, at the last timestep, the agent should play

Bandit-1.

Let’s look at the two types of UCB-1 algorithms:

UCB-1 Tuned and UCB-1 normal. These variations
are obtained by changing the C value. The term
under the square root in step-2 of the UCB1
algorithm is called ‘C’.

UCB-1 Tuned
For UCB1-Tuned,
C = √( (logN / n) x min(1/4, V(n)) )
where V(n) is an upper confidence bound on the
variance of the bandit, i.e.
V(n) = Σ(x_i² / n) - (Σ x_i / n)² + √(2log(N) / n)
and x_i are the rewards gained from the bandit so

far.
UCB1-Tuned is known to have outperformed
UCB1.
file:///Users/xinwei/Library/CloudStorage/OneDrive-Personal/Zotero…de-on-upper-confidence-bound-algorithm-in-reinforced-learning.html Page 10 of 14
A handy guide to UCB algorithm in reinforcement learning. 1/8/24, 14:31

UCB1.

UCB1-Normal
The term ‘normal’ in the name of the algorithm
refers to normal distribution.
The UC1-Normal algorithm is designed for rewards
from normal distributions.
The value of C, in UCB1-Normal, is based on the
sample variance.
C = √( 16 SV(n) log(N - 1) / n )
where the sample variance is
SV(n) = ( Σ x_i² - n (Σ x_i / n)² ) / (n - 1)
x_i are the rewards the agent got from the bandit so
far.

Note that to calculate C, each machine or bandit

must be played a minimum of two times by the
player to avoid division by zero. Each bandit is
played twice as an initialization step. At each round
N, it checks if there’s a bandit that has played less
than the ceiling of 8logN. If it finds any, the player
plays that bandit.

Research applications of
UCB algorithm
Researchers have proposed strategies based on
UCB to maximize energy efficiency, minimize
delay, and develop network routing algorithms to
optimize throughput and delay. A recent
research project includes route and power
optimization for Software Defined Networking
(SDN)-enabled IIoT (Industrial IoT) in a smart
grid, keeping EMI (electromagnetic interference)
coming from the electrical devices in the grid.

file:///Users/xinwei/Library/CloudStorage/OneDrive-Personal/Zotero/…de-on-upper-confidence-bound-algorithm-in-reinforced-learning.html Page 11 of 14
A handy guide to UCB algorithm in reinforcement learning. 1/8/24, 14:31

Press
What’s up with Turing?
Get the latest news about us here.

Blog
Know more about remote work.
Checkout our blog here.

Contact
Have any questions?
We’d love to hear from you.

Hire remote developers

Tell us the skills you need and we'll find the best developer for you in days, not weeks.

Hire Developers

file:///Users/xinwei/Library/CloudStorage/OneDrive-Personal/Zotero…de-on-upper-confidence-bound-algorithm-in-reinforced-learning.html Page 12 of 14
A handy guide to UCB algorithm in reinforcement learning. 1/8/24, 14:31

Engineering services

LLM training and enhancement

Generative AI

AI/ML

Custom engineering

All services

On-demand talent

Technical professionals and teams

For developers

Browse remote jobs

Get hired

Developer reviews

Developer resources

Tech interview questions

Resources

Blog

More resources

Company

About

Press

Turing careers

Connect
file:///Users/xinwei/Library/CloudStorage/OneDrive-Personal/Zotero…de-on-upper-confidence-bound-algorithm-in-reinforced-learning.html Page 13 of 14
A handy guide to UCB algorithm in reinforcement learning. 1/8/24, 14:31

Connect

Help center

Sitemap Terms Privacy Privacy

of policy settings
service

Palo Alto, CA, 94303 Turing

file:///Users/xinwei/Library/CloudStorage/OneDrive-Personal/Zotero…de-on-upper-confidence-bound-algorithm-in-reinforced-learning.html Page 14 of 14

RL Unit 1 - QA
No ratings yet
RL Unit 1 - QA
10 pages
Bandit Algorithms
No ratings yet
Bandit Algorithms
2 pages
Q1. Explain The Multi-Armed Bandit Problem and Its Key Characteristics. Illustrate Their Real-World Applications
No ratings yet
Q1. Explain The Multi-Armed Bandit Problem and Its Key Characteristics. Illustrate Their Real-World Applications
11 pages
RL Sem Ans
No ratings yet
RL Sem Ans
90 pages
Multi-Arm-Bandit Problem
No ratings yet
Multi-Arm-Bandit Problem
11 pages
UCB Algorithm in RL
No ratings yet
UCB Algorithm in RL
3 pages
RL SEM Updated
No ratings yet
RL SEM Updated
89 pages
Bandit
No ratings yet
Bandit
8 pages
Mid Term Report SoS
No ratings yet
Mid Term Report SoS
18 pages
DLMAIRIL01 Q4-2024 Session3
No ratings yet
DLMAIRIL01 Q4-2024 Session3
47 pages
Upper Confidence Bound Algorithm in Reinforcement Learning
No ratings yet
Upper Confidence Bound Algorithm in Reinforcement Learning
6 pages
RL Week - 2 - 3
No ratings yet
RL Week - 2 - 3
83 pages
RL Unit5
No ratings yet
RL Unit5
101 pages
Multi Armed Bandits
No ratings yet
Multi Armed Bandits
34 pages
Rlassignment 2
No ratings yet
Rlassignment 2
3 pages
EAS 240 MAB Project Description Spring 2025
No ratings yet
EAS 240 MAB Project Description Spring 2025
10 pages
Unit:1 Reinforcement Learning: Upper-Confidence-Bound Action Selection, Gradient Bandits
No ratings yet
Unit:1 Reinforcement Learning: Upper-Confidence-Bound Action Selection, Gradient Bandits
6 pages
Online Learning For Causal Bandits
No ratings yet
Online Learning For Causal Bandits
7 pages
Introduction To Bandits: (Some Slides Stolen From Csaba's AAAI Tutorial)
No ratings yet
Introduction To Bandits: (Some Slides Stolen From Csaba's AAAI Tutorial)
16 pages
Dissecting Reinforcement Learning-Part6
No ratings yet
Dissecting Reinforcement Learning-Part6
25 pages
LinUCB Ote
No ratings yet
LinUCB Ote
68 pages
Unit II
No ratings yet
Unit II
10 pages
Mod6 Slides
No ratings yet
Mod6 Slides
105 pages
KLUCB Paper
No ratings yet
KLUCB Paper
59 pages
Auer - Using Ucb For Exploration-Exploitation Tradeoffs
No ratings yet
Auer - Using Ucb For Exploration-Exploitation Tradeoffs
26 pages
Assignment 3: Reinforcement Learning Prof. B. Ravindran
100% (1)
Assignment 3: Reinforcement Learning Prof. B. Ravindran
4 pages
Neural Contextual Bandits With UCB-based Exploration
No ratings yet
Neural Contextual Bandits With UCB-based Exploration
27 pages
CS181 P - A - : Roject New Exploration of The Multi Armed Bandit Problem
No ratings yet
CS181 P - A - : Roject New Exploration of The Multi Armed Bandit Problem
9 pages
1.RL Unit 1
No ratings yet
1.RL Unit 1
47 pages
Algorithms For The Multi-Armed Bandit Problem: Volodymyr Kuleshov Doina Precup
No ratings yet
Algorithms For The Multi-Armed Bandit Problem: Volodymyr Kuleshov Doina Precup
32 pages
A12-Online Learning Short 2020
No ratings yet
A12-Online Learning Short 2020
61 pages
Reading 3-Russo & Van Roy 2014
No ratings yet
Reading 3-Russo & Van Roy 2014
24 pages
MCQ& FB - Unit 1
No ratings yet
MCQ& FB - Unit 1
9 pages
Contextual Bandits
No ratings yet
Contextual Bandits
34 pages
29117-Article Text-33171-1-2-20240324
No ratings yet
29117-Article Text-33171-1-2-20240324
8 pages
RL Unit
No ratings yet
RL Unit
595 pages
Bandit Problems
No ratings yet
Bandit Problems
8 pages
Exploration Exploitation
No ratings yet
Exploration Exploitation
40 pages
Multi-Armed Bandits
No ratings yet
Multi-Armed Bandits
11 pages
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
45 pages
Master Thesis On Mixed Model Bandits
No ratings yet
Master Thesis On Mixed Model Bandits
73 pages
Assignment 1: CS747: F I L A
No ratings yet
Assignment 1: CS747: F I L A
10 pages
Experiment 6
No ratings yet
Experiment 6
7 pages
Lecture 2 EE675
No ratings yet
Lecture 2 EE675
4 pages
26 Making Decisions
No ratings yet
26 Making Decisions
31 pages
Unit:1 Reinforcement Learning
No ratings yet
Unit:1 Reinforcement Learning
9 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
28 pages
Module 6 1 Ungraded Quizz
No ratings yet
Module 6 1 Ungraded Quizz
13 pages
Lecture 1: Introduction: Lecturer: Prof. Subrahmanya Swamy Peruru Scribe: Harshvardhan Arya - Rishabh Katiyar
No ratings yet
Lecture 1: Introduction: Lecturer: Prof. Subrahmanya Swamy Peruru Scribe: Harshvardhan Arya - Rishabh Katiyar
4 pages
Bandit Algorithms in Hyperparameter Tuning Extended Refreshed
No ratings yet
Bandit Algorithms in Hyperparameter Tuning Extended Refreshed
3 pages
Lecture 9: Exploration and Exploitation: David Silver
No ratings yet
Lecture 9: Exploration and Exploitation: David Silver
47 pages
Module 6 2nd Ungraded Quizz
No ratings yet
Module 6 2nd Ungraded Quizz
13 pages
Supp
No ratings yet
Supp
25 pages
Lattimore Szepesvari18bandit Algorithms PDF
No ratings yet
Lattimore Szepesvari18bandit Algorithms PDF
513 pages
Bandits and Graphs
No ratings yet
Bandits and Graphs
13 pages
Mid-Semester Examination
No ratings yet
Mid-Semester Examination
2 pages
Solution 2
No ratings yet
Solution 2
5 pages
2023 Week2 Lecture Before
No ratings yet
2023 Week2 Lecture Before
77 pages
Azar 17 A
No ratings yet
Azar 17 A
10 pages
1 s2.0 S2352152X22012403 Main
No ratings yet
1 s2.0 S2352152X22012403 Main
24 pages
Potential-Based Reward Shaping For Finite Horizon Online POMDP Planning
No ratings yet
Potential-Based Reward Shaping For Finite Horizon Online POMDP Planning
43 pages
Multi Grid
No ratings yet
Multi Grid
10 pages
Practice Problems: Paul Dawkins
No ratings yet
Practice Problems: Paul Dawkins
75 pages
Script Output
No ratings yet
Script Output
53 pages
Lab Experiment 1 - Friction Pipe
No ratings yet
Lab Experiment 1 - Friction Pipe
7 pages
Fuses - Circuit Breakers
No ratings yet
Fuses - Circuit Breakers
11 pages
Germination Value A New Formula: Pinus Radiata
No ratings yet
Germination Value A New Formula: Pinus Radiata
5 pages
INFO1113 Assignment 2023 S2
No ratings yet
INFO1113 Assignment 2023 S2
11 pages
Membership Form: The Accredited Professional Organization in The Phils. (I-Apo No
No ratings yet
Membership Form: The Accredited Professional Organization in The Phils. (I-Apo No
1 page
PERSONAL-LIFELONG-LEARNING-PLAN Marilyn D. Tagao
No ratings yet
PERSONAL-LIFELONG-LEARNING-PLAN Marilyn D. Tagao
7 pages
Friction JEE
No ratings yet
Friction JEE
33 pages
Pci Assigment
No ratings yet
Pci Assigment
7 pages
2 Resume
No ratings yet
2 Resume
2 pages
Ex Inspections - A Journey For Maintenance Engineers: Shailesh Chauhan Shell Project &technology Stavanger Norway
No ratings yet
Ex Inspections - A Journey For Maintenance Engineers: Shailesh Chauhan Shell Project &technology Stavanger Norway
4 pages
China Orifice Forged Flanges Manufacturer & Supplier DHDZ
No ratings yet
China Orifice Forged Flanges Manufacturer & Supplier DHDZ
1 page
Exercise - Analytical Exposition Text
40% (5)
Exercise - Analytical Exposition Text
3 pages
0510 s16 Ms 23 PDF
No ratings yet
0510 s16 Ms 23 PDF
11 pages
100% Original Combo
No ratings yet
100% Original Combo
4 pages
建筑师求职信
100% (1)
建筑师求职信
7 pages
cENTRE ELECTRICITY BILL
No ratings yet
cENTRE ELECTRICITY BILL
1 page
System-On-Chip Design Book 2019 200dpi Aw
No ratings yet
System-On-Chip Design Book 2019 200dpi Aw
334 pages
Chapter-1: 1.1 Tapered Steel Members
No ratings yet
Chapter-1: 1.1 Tapered Steel Members
11 pages
Webview
No ratings yet
Webview
3 pages
Emulgel Preparation
No ratings yet
Emulgel Preparation
6 pages
DLL-November 4-8-25, 2024
No ratings yet
DLL-November 4-8-25, 2024
4 pages
Ramp Check List
No ratings yet
Ramp Check List
1 page
General Tolerances - DIN - IsO - 2768
No ratings yet
General Tolerances - DIN - IsO - 2768
2 pages
Cambridge International AS & A Level: Physics 9702/23
No ratings yet
Cambridge International AS & A Level: Physics 9702/23
12 pages
The Role of Academic Libraries in The Digital Transformation of The Universities
No ratings yet
The Role of Academic Libraries in The Digital Transformation of The Universities
5 pages
Question 1: How Busy Is Your Schedule?
No ratings yet
Question 1: How Busy Is Your Schedule?
10 pages
3rd-5 Grade Lesson Plans
No ratings yet
3rd-5 Grade Lesson Plans
2 pages
18CSP83 - Project Phase 2 - Body
No ratings yet
18CSP83 - Project Phase 2 - Body
11 pages