0% found this document useful (0 votes)
31 views14 pages

A Handy Guide To UCB Algorithm in Reinforcement Learning.

Uploaded by

sherwingao99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views14 pages

A Handy Guide To UCB Algorithm in Reinforcement Learning.

Uploaded by

sherwingao99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

A handy guide to UCB algorithm in reinforcement learning.

1/8/24, 14:31

FOR DEVELOPERS

Know all About Upper Confidence Bound


Algorithm in Reinforcement Learning
Share 

Have you tried a slot machine and won amazing


rewards? You may have noticed that luck
sometimes favors you while at other times, you win
a few cents or lose entirely. Have you ever
wondered how to find the ‘lucky machine’ that
rewards you with plenty of money? This problem is
known as the multi-armed bandit problem and the
optimal approach employed to solve it is UCB or
upper confidence bound algorithm.

This article will detail what the multi-armed bandit


problem is, UCB, the math behind it, and its
research applications. A prerequisite is a basic
understanding of reinforcement learning including
concepts such as agent, environment, policy, action,
state and reward.

Exploration-exploitation
trade-off
file:///Users/xinwei/Library/CloudStorage/OneDrive-Personal/Zotero/…ide-on-upper-confidence-bound-algorithm-in-reinforced-learning.html Page 1 of 14
A handy guide to UCB algorithm in reinforcement learning. 1/8/24, 14:31

trade-off
When deciding what state it should choose next, an
agent faces a trade-off between exploration and
exploitation. Exploration involves choosing new
states that the agent hasn’t chosen or has chosen
fewer times till then. Exploitation involves making
a decision regarding the next state from its
experiences so far.

At a given timestep, should the agent make a


decision based on exploration or exploitation? This
is the trade-off. The concept is the foundation for
solving the multi-armed bandit problem, or for that
matter, any problem to be solved using
reinforcement learning.

Multi-armed bandit

Image source

file:///Users/xinwei/Library/CloudStorage/OneDrive-Personal/Zotero/…ide-on-upper-confidence-bound-algorithm-in-reinforced-learning.html Page 2 of 14
A handy guide to UCB algorithm in reinforcement learning. 1/8/24, 14:31

Image source

Consider the bandits or machines at a casino that


you can pull with a trigger and win rewards. The
above picture is a simple illustration to visualize
the game. Whenever you pull one of the machines
by its handle, you either win or lose. Note that you
play with only one machine at a time.

Correlating the problem to reinforcement learning,


the player is the agent here. Assume that different
machines have different properties although this
isn’t the case in real-life. In this case, one machine
may give very good rewards while another may
not.

How do you know which machine is the best? This


is important for a player as the aim is to maximize
the reward. You can find this out by playing with
every bandit in the casino. As illustrated in the
above picture, a player has multiple arms playing
with multiple machines, indicating that the agent
should play with all the machines. This enables the
agent to learn about the machine’s properties.

file:///Users/xinwei/Library/CloudStorage/OneDrive-Personal/Zotero/…ide-on-upper-confidence-bound-algorithm-in-reinforced-learning.html Page 3 of 14
A handy guide to UCB algorithm in reinforcement learning. 1/8/24, 14:31

Image source

Image source

An application of multi-armed bandit includes


clinical trials. Since all drug versions can’t be
tested, the optimal one should, thus, be tested.

Image source: Author

file:///Users/xinwei/Library/CloudStorage/OneDrive-Personal/Zotero/…ide-on-upper-confidence-bound-algorithm-in-reinforced-learning.html Page 4 of 14
A handy guide to UCB algorithm in reinforcement learning. 1/8/24, 14:31

Image source: Author

Consider that there are four machines: A, B, C and


D. The agent randomly chooses them and plays.
The results are:

In the table above, the agent played with A and


won. Exploring, the agent played with B and lost.
Exploiting, it again played with A as it had won
before. However, it lost. Exploring, it played with a
new machine, C, and won. Exploring again, it
played with a new machine, D, and lost. Exploiting,
it returned to C but lost again. This process
continues and is repeated until the agent maximizes
its reward.

The math behind multi-


armed bandit
Let the number of machines/bandits=n

Each machine i returns rewards yi which is


approximately equal to P (y; θ_i)

Where θ_i is the probability distribution parameter.


The machine provides rewards based on
probabilities. The goal is to maximize the reward.

Introduction to UCB
file:///Users/xinwei/Library/CloudStorage/OneDrive-Personal/Zotero/…ide-on-upper-confidence-bound-algorithm-in-reinforced-learning.html Page 5 of 14
A handy guide to UCB algorithm in reinforcement learning. 1/8/24, 14:31

Let’s formulate the multi-armed bandit’s problem


this way:
Let t = current timestep
a_t is the decision or action taken by the agent at
timestep t(action here refers to choosing one of the
n machines and playing)
For instance, a_1 means action taken at timestep
t=1
y_t is the reward or outcome at timestep t
policy π: [(a_1, y_1), (a_2, y_2)…….(a_(t-1),y_(t-
1))] -> a_t

Policy maps everything the agent has learned so far


to the new decision. This is why timesteps from t=1
to t-1 have been considered here.

Problem:

Find policy π that


<u>Case-1</u>: max

Find policy π that


<u>Case-1</u>: max

or

<u>Case-2</u>: max

<u>Case-2</u>: max

<u>Case-1</u> finds a policy that maximizes the


sum of all the rewards (as you can infer from the ∑
symbol).
file:///Users/xinwei/Library/CloudStorage/OneDrive-Personal/Zotero/…ide-on-upper-confidence-bound-algorithm-in-reinforced-learning.html Page 6 of 14
A handy guide to UCB algorithm in reinforcement learning. 1/8/24, 14:31

symbol).
<u>Case-2</u> finds a policy to maximize the
reward obtained in the final step alone.

In case-2, agents need not care about intermediate


rewards as the goal is to optimize only the final
reward. Thus, in case-2, agents can explore and
learn as much as possible. However, in case-1, the
agent must collect as many rewards as possible.

So, which case should be chosen? This is where


UCB algorithm comes in.

UCB1 algorithm
Ignore the 1 here.

Step-1: Initialization
Play each machine once to have minimum
knowledge of each machine and to avoid
calculating ln0 in the formula in step-2.

Step-2: Repeat
Play the machine i that maximizes

Step-2: Repeat
Play the machine i that maximizes

Until

END
Look at what is maximized in step-2.

= average reward of machine i so far

file:///Users/xinwei/Library/CloudStorage/OneDrive-Personal/Zotero/…ide-on-upper-confidence-bound-algorithm-in-reinforced-learning.html Page 7 of 14
A handy guide to UCB algorithm in reinforcement learning. 1/8/24, 14:31

=how often machine i has been played so far


n=∑

=total number of rounds played so far


β is taken as 1(β=0.99, approximately considered
as 1 mostly)

The first term,

, is the exploitative case. Here, the agent won’t


learn much about the other machines. The second
term

is the explorative case. It reduces entropy because,


unlike the first term, it allows the agent to play with
the less chosen machines so that you can better
estimate if the current best machine is indeed the
best.

Example problem for UCB1

Let’s understand the UCB algorithm with the help


of an example.
Consider that there are three bandits being played
and their corresponding rewards are as follows:

Related articles

Question: Training, Testing & Deployment of a


Classification Model Using
At the last timestep, which bandit should the player Convolutional Neural Networks and
play to maximize their reward? Machine Learning Classifiers
CNN, Convolutional Neural Networks, is
Solution:
a deep-learning-based algorithm that takes
an image as an input...
The UCB algorithm can be applied as follows:
file:///Users/xinwei/Library/CloudStorage/OneDrive-Personal/Zotero/…ide-on-upper-confidence-bound-algorithm-in-reinforced-learning.html Page 8 of 14
A handy guide to UCB algorithm in reinforcement learning. 1/8/24, 14:31

The UCB algorithm can be applied as follows:


Total number of rounds played so far(n)=No. of Read more
times Bandit-1 was played + No. of times Bandit-2
was played + No. of times Bandit-3 was played.
So, n=6+2+2=10=>n=10 Hands-on Tutorial on Running Redis
on Google Colab
For Bandit-1,
Redis, which stands for Remote
Dictionary Server, is a type of database
It has been played 6 times. So,
similar to MySQL, PostgreSQL, and
MongoDB. It’s an in-memory...

Read more

=6

Graph Centrality Measures


A graph is a descriptive way of
What For representing
About Get the relationship between
= 8+7+12+13+11+9)/6=10 Resources differentStarted Login
entities. Meanwhile, an entity -
we do developers us
also known as a node - represents the
depicted...

Read more

= 0.876
So, the value of step-2 is approximately 10.876.
Designing of Different Kernels in
For Bandit-2, Machine Learning and Deep
Learning
It has been played 2 times. So,
The kernel is said to be a dot product in a
higher dimensional space where
estimation methods are linear methods....

Read more
=2

Data Collection and Data

= (8+12)/2=10
Apply for remote Algorithms developer jobs at
top U.S. companies

=1.517
So, the value of step-2 is approximately 11.517.

For Bandit-3,

It has been played 2 times. So,

file:///Users/xinwei/Library/CloudStorage/OneDrive-Personal/Zotero/…ide-on-upper-confidence-bound-algorithm-in-reinforced-learning.html Page 9 of 14
A handy guide to UCB algorithm in reinforcement learning. 1/8/24, 14:31

Mobile Tech Lead React Native Developer


=2

A well-established company that is developing A fast-growing company that is working on A rap


world-class trading platform solutions is looking developing an AI-powered platform that will assist softw
for a Mobile Tech Lead. The selected candidate roofing contractors get bigger contracts, faster helpd
= (5+13)/2=9
will be leading a team of mobile developers in the insurance approvals, and instant payments, is lookin
design, development, and maintenance of our looking for a React Native Developer. The be res
mobile trading platform applications. The developer will be responsible for building code
company is offering cutting-edge technology and additional features for the internal online recom

innovative
= 1.517  251-10K
Finance trading tools toemployees
clients in order to  Real Estate
dashboard  11-50
and mobile fieldemployees
application. 
is com
So,deliver exceptional
the value of step-2trading experiences10.517.
is approximately to our Contractors perform photo-centric house tools
users. This is anReact
React Native amazing opportunity for
RxJS inspections
React using
React the company's
Native Node.js +mobile
1 app, with
Rubyc
Time
candidates
t is the last
whotimestep.
are eagerThere
to work
won’t be any which generates all insurance reporting have
further exploration, so check where exploitation is immediately for
maximum. Apply now Apply now

Bandit-1 and Bandit-2 have the same and the


highest exploitation value (10). Which to choose?
Considering entropy, it is Bandit-1 as it was played
a greater number of times compared to Bandit-2.
As it has less entropy, its value is more likely to be
correct and is more expressive.

Hence, at the last timestep, the agent should play


Bandit-1.

Let’s look at the two types of UCB-1 algorithms:


UCB-1 Tuned and UCB-1 normal. These variations
are obtained by changing the C value. The term
under the square root in step-2 of the UCB1
algorithm is called ‘C’.

UCB-1 Tuned
For UCB1-Tuned,
C = √( (logN / n) x min(1/4, V(n)) )
where V(n) is an upper confidence bound on the
variance of the bandit, i.e.
V(n) = Σ(x_i² / n) - (Σ x_i / n)² + √(2log(N) / n)
and x_i are the rewards gained from the bandit so

far.
UCB1-Tuned is known to have outperformed
UCB1.
file:///Users/xinwei/Library/CloudStorage/OneDrive-Personal/Zotero…de-on-upper-confidence-bound-algorithm-in-reinforced-learning.html Page 10 of 14
A handy guide to UCB algorithm in reinforcement learning. 1/8/24, 14:31

UCB1.

UCB1-Normal
The term ‘normal’ in the name of the algorithm
refers to normal distribution.
The UC1-Normal algorithm is designed for rewards
from normal distributions.
The value of C, in UCB1-Normal, is based on the
sample variance.
C = √( 16 SV(n) log(N - 1) / n )
where the sample variance is
SV(n) = ( Σ x_i² - n (Σ x_i / n)² ) / (n - 1)
x_i are the rewards the agent got from the bandit so
far.

Note that to calculate C, each machine or bandit


must be played a minimum of two times by the
player to avoid division by zero. Each bandit is
played twice as an initialization step. At each round
N, it checks if there’s a bandit that has played less
than the ceiling of 8logN. If it finds any, the player
plays that bandit.

Research applications of
UCB algorithm
Researchers have proposed strategies based on
UCB to maximize energy efficiency, minimize
delay, and develop network routing algorithms to
optimize throughput and delay. A recent
research project includes route and power
optimization for Software Defined Networking
(SDN)-enabled IIoT (Industrial IoT) in a smart
grid, keeping EMI (electromagnetic interference)
coming from the electrical devices in the grid.

file:///Users/xinwei/Library/CloudStorage/OneDrive-Personal/Zotero/…de-on-upper-confidence-bound-algorithm-in-reinforced-learning.html Page 11 of 14
A handy guide to UCB algorithm in reinforcement learning. 1/8/24, 14:31

Press
What’s up with Turing?
Get the latest news about us here.

Blog
Know more about remote work.
Checkout our blog here.

Contact
Have any questions?
We’d love to hear from you.

Hire remote developers


Tell us the skills you need and we'll find the best developer for you in days, not weeks.

Hire Developers

file:///Users/xinwei/Library/CloudStorage/OneDrive-Personal/Zotero…de-on-upper-confidence-bound-algorithm-in-reinforced-learning.html Page 12 of 14
A handy guide to UCB algorithm in reinforcement learning. 1/8/24, 14:31

Engineering services

LLM training and enhancement

Generative AI

AI/ML

Custom engineering

All services

On-demand talent

Technical professionals and teams

For developers

Browse remote jobs

Get hired

Developer reviews

Developer resources

Tech interview questions

Resources

Blog

More resources

Company

About

Press

Turing careers

Connect
file:///Users/xinwei/Library/CloudStorage/OneDrive-Personal/Zotero…de-on-upper-confidence-bound-algorithm-in-reinforced-learning.html Page 13 of 14
A handy guide to UCB algorithm in reinforcement learning. 1/8/24, 14:31

Connect

Contact us

Help center

Sitemap Terms Privacy Privacy


of policy settings
service

1900 Embarcadero Road © 2024


Palo Alto, CA, 94303 Turing

file:///Users/xinwei/Library/CloudStorage/OneDrive-Personal/Zotero…de-on-upper-confidence-bound-algorithm-in-reinforced-learning.html Page 14 of 14

You might also like