A Handy Guide To UCB Algorithm in Reinforcement Learning.
A Handy Guide To UCB Algorithm in Reinforcement Learning.
1/8/24, 14:31
FOR DEVELOPERS
Exploration-exploitation
trade-off
file:///Users/xinwei/Library/CloudStorage/OneDrive-Personal/Zotero/…ide-on-upper-confidence-bound-algorithm-in-reinforced-learning.html Page 1 of 14
A handy guide to UCB algorithm in reinforcement learning. 1/8/24, 14:31
trade-off
When deciding what state it should choose next, an
agent faces a trade-off between exploration and
exploitation. Exploration involves choosing new
states that the agent hasn’t chosen or has chosen
fewer times till then. Exploitation involves making
a decision regarding the next state from its
experiences so far.
Multi-armed bandit
Image source
file:///Users/xinwei/Library/CloudStorage/OneDrive-Personal/Zotero/…ide-on-upper-confidence-bound-algorithm-in-reinforced-learning.html Page 2 of 14
A handy guide to UCB algorithm in reinforcement learning. 1/8/24, 14:31
Image source
file:///Users/xinwei/Library/CloudStorage/OneDrive-Personal/Zotero/…ide-on-upper-confidence-bound-algorithm-in-reinforced-learning.html Page 3 of 14
A handy guide to UCB algorithm in reinforcement learning. 1/8/24, 14:31
Image source
Image source
file:///Users/xinwei/Library/CloudStorage/OneDrive-Personal/Zotero/…ide-on-upper-confidence-bound-algorithm-in-reinforced-learning.html Page 4 of 14
A handy guide to UCB algorithm in reinforcement learning. 1/8/24, 14:31
Introduction to UCB
file:///Users/xinwei/Library/CloudStorage/OneDrive-Personal/Zotero/…ide-on-upper-confidence-bound-algorithm-in-reinforced-learning.html Page 5 of 14
A handy guide to UCB algorithm in reinforcement learning. 1/8/24, 14:31
Problem:
or
<u>Case-2</u>: max
<u>Case-2</u>: max
symbol).
<u>Case-2</u> finds a policy to maximize the
reward obtained in the final step alone.
UCB1 algorithm
Ignore the 1 here.
Step-1: Initialization
Play each machine once to have minimum
knowledge of each machine and to avoid
calculating ln0 in the formula in step-2.
Step-2: Repeat
Play the machine i that maximizes
Step-2: Repeat
Play the machine i that maximizes
Until
END
Look at what is maximized in step-2.
file:///Users/xinwei/Library/CloudStorage/OneDrive-Personal/Zotero/…ide-on-upper-confidence-bound-algorithm-in-reinforced-learning.html Page 7 of 14
A handy guide to UCB algorithm in reinforcement learning. 1/8/24, 14:31
Related articles
Read more
=6
Read more
= 0.876
So, the value of step-2 is approximately 10.876.
Designing of Different Kernels in
For Bandit-2, Machine Learning and Deep
Learning
It has been played 2 times. So,
The kernel is said to be a dot product in a
higher dimensional space where
estimation methods are linear methods....
Read more
=2
= (8+12)/2=10
Apply for remote Algorithms developer jobs at
top U.S. companies
=1.517
So, the value of step-2 is approximately 11.517.
For Bandit-3,
file:///Users/xinwei/Library/CloudStorage/OneDrive-Personal/Zotero/…ide-on-upper-confidence-bound-algorithm-in-reinforced-learning.html Page 9 of 14
A handy guide to UCB algorithm in reinforcement learning. 1/8/24, 14:31
UCB-1 Tuned
For UCB1-Tuned,
C = √( (logN / n) x min(1/4, V(n)) )
where V(n) is an upper confidence bound on the
variance of the bandit, i.e.
V(n) = Σ(x_i² / n) - (Σ x_i / n)² + √(2log(N) / n)
and x_i are the rewards gained from the bandit so
far.
UCB1-Tuned is known to have outperformed
UCB1.
file:///Users/xinwei/Library/CloudStorage/OneDrive-Personal/Zotero…de-on-upper-confidence-bound-algorithm-in-reinforced-learning.html Page 10 of 14
A handy guide to UCB algorithm in reinforcement learning. 1/8/24, 14:31
UCB1.
UCB1-Normal
The term ‘normal’ in the name of the algorithm
refers to normal distribution.
The UC1-Normal algorithm is designed for rewards
from normal distributions.
The value of C, in UCB1-Normal, is based on the
sample variance.
C = √( 16 SV(n) log(N - 1) / n )
where the sample variance is
SV(n) = ( Σ x_i² - n (Σ x_i / n)² ) / (n - 1)
x_i are the rewards the agent got from the bandit so
far.
Research applications of
UCB algorithm
Researchers have proposed strategies based on
UCB to maximize energy efficiency, minimize
delay, and develop network routing algorithms to
optimize throughput and delay. A recent
research project includes route and power
optimization for Software Defined Networking
(SDN)-enabled IIoT (Industrial IoT) in a smart
grid, keeping EMI (electromagnetic interference)
coming from the electrical devices in the grid.
file:///Users/xinwei/Library/CloudStorage/OneDrive-Personal/Zotero/…de-on-upper-confidence-bound-algorithm-in-reinforced-learning.html Page 11 of 14
A handy guide to UCB algorithm in reinforcement learning. 1/8/24, 14:31
Press
What’s up with Turing?
Get the latest news about us here.
Blog
Know more about remote work.
Checkout our blog here.
Contact
Have any questions?
We’d love to hear from you.
Hire Developers
file:///Users/xinwei/Library/CloudStorage/OneDrive-Personal/Zotero…de-on-upper-confidence-bound-algorithm-in-reinforced-learning.html Page 12 of 14
A handy guide to UCB algorithm in reinforcement learning. 1/8/24, 14:31
Engineering services
Generative AI
AI/ML
Custom engineering
All services
On-demand talent
For developers
Get hired
Developer reviews
Developer resources
Resources
Blog
More resources
Company
About
Press
Turing careers
Connect
file:///Users/xinwei/Library/CloudStorage/OneDrive-Personal/Zotero…de-on-upper-confidence-bound-algorithm-in-reinforced-learning.html Page 13 of 14
A handy guide to UCB algorithm in reinforcement learning. 1/8/24, 14:31
Connect
Contact us
Help center
file:///Users/xinwei/Library/CloudStorage/OneDrive-Personal/Zotero…de-on-upper-confidence-bound-algorithm-in-reinforced-learning.html Page 14 of 14