Upper Confidence Bound Algorithm in Reinforcement Learning

The Upper Confidence Bound (UCB) algorithm in reinforcement learning addresses the exploration-exploitation dilemma by balancing uncertainty in action-value estimates. It selects actions based on the highest estimated value plus an exploration term, promoting optimism in uncertain scenarios. UCB systematically reduces exploration over time, leading to greater average rewards compared to other algorithms like Epsilon-greedy.

Uploaded by

hksun.12731

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views6 pages

Upper Confidence Bound Algorithm in Reinforcement Learning

Uploaded by

hksun.12731

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

Upper Confidence Bound Algorithm in Reinforcement Learning

In Reinforcement learning, the agent or decision-maker generates its training

data by interacting with the world. The agent must learn the consequences of its
actions through trial and error, rather than being explicitly told the correct action.
Multi-Armed Bandit Problem
In Reinforcement Learning, we use Multi-Armed Bandit Problem to formalize the
notion of decision-making under uncertainty using k-armed bandits. A decision-
maker or agent is present in Multi-Armed Bandit Problem to choose between k-
different actions and receives a reward based on the action it chooses. Bandit
problem is used to describe fundamental concepts in reinforcement learning,
such as rewards, timesteps, and values.

The picture above represents a slot machine also known as a bandit with two
levers. We assume that each lever has a separate distribution of rewards and
there is at least one lever that generates maximum reward.
The probability distribution for the reward corresponding to each lever is different
and is unknown to the gambler(decision-maker). Hence, the goal here is to
identify which lever to pull to get the maximum reward after a given set of trials.
For Example:

Imagine an online advertising trial where an advertiser wants to measure the

click-through rate of three different ads for the same product. Whenever a user
visits the website, the advertiser displays an ad at random. The advertiser then
monitors whether the user clicks on the ad or not. After a while, the advertiser
notices that one ad seems to be working better than the others. The advertiser
must now decide between sticking with the best-performing ad or continuing
with the randomized study.
If the advertiser only displays one ad, then he can no longer collect data on the
other two ads. Perhaps one of the other ads is better, it only appears worse due
to chance. If the other two ads are worse, then continuing the study can affect
the click-through rate adversely. This advertising trial exemplifies decision-
making under uncertainty.
In the above example, the role of the agent is played by an advertiser. The
advertiser has to choose between three different actions, to display the first,
second, or third ad. Each ad is an action. Choosing that ad yields some unknown
reward. Finally, the profit of the advertiser after the ad is the reward that the
advertiser receives.
Action-Values:
For the advertiser to decide which action is best, we must define the value of
taking each action. We define these values using the action-value function using
the language of probability. The value of selecting an action q*(a) is defined as
the expected reward Rt we receive when taking an action a from the possible set
of actions.

The goal of the agent is to maximize the expected reward by selecting the action
that has the highest action-value.
Action-value Estimate:
Since the value of selecting an action i.e. Q*(a) is not known to the agent, so we
will use the sample-average method to estimate it.

Exploration vs Exploitation:
 Greedy Action: When an agent chooses an action that currently has the
largest estimated value. The agent exploits its current knowledge by
choosing the greedy action.
 Non-Greedy Action: When the agent does not choose the largest
estimated value and sacrifice immediate reward hoping to gain more
information about the other actions.
 Exploration: It allows the agent to improve its knowledge about each
action. Hopefully, leading to a long-term benefit.
 Exploitation: It allows the agent to choose the greedy action to try to get
the most reward for short-term benefit. A pure greedy action selection can
lead to sub-optimal behaviour.
A dilemma occurs between exploration and exploitation because an agent can
not choose to both explore and exploit at the same time. Hence, we use
the Upper Confidence Bound algorithm to solve the exploration-exploitation
dilemma
Upper Confidence Bound Action Selection:
Upper-Confidence Bound action selection uses uncertainty in the action-value
estimates for balancing exploration and exploitation. Since there is inherent
uncertainty in the accuracy of the action-value estimates when we use a
sampled set of rewards thus UCB uses uncertainty in the estimates to drive
exploration.

Qt(a) here represents the current estimate for action a at time t. We select the
action that has the highest estimated action-value plus the upper-confidence
bound exploration term.
Q(A) in the above picture represents the current action-value estimate for
action A. The brackets represent a confidence interval around Q*(A) which says
that we are confident that the actual action-value of action A lies somewhere in
this region.
The lower bracket is called the lower bound, and the upper bracket is the upper
bound. The region between the brackets is the confidence interval which
represents the uncertainty in the estimates. If the region is very small, then we
become very certain that the actual value of action A is near our estimated
value. On the other hand, if the region is large, then we become uncertain that
the value of action A is near our estimated value.
The Upper Confidence Bound follows the principle of optimism in the face of
uncertainty which implies that if we are uncertain about an action, we should
optimistically assume that it is the correct action.
For example, let’s say we have these four actions with associated uncertainties
in the picture below, our agent has no idea which is the best action. So according
to the UCB algorithm, it will optimistically pick the action that has the highest
upper bound i.e. A. By doing this either it will have the highest value and get the
highest reward, or by taking that we will get to learn about an action we know
least about.

Let’s assume that after selecting the action A we end up in a state depicted in
the picture below. This time UCB will select the action B since Q(B) has the
highest upper-confidence bound because it’s action-value estimate is the
highest, even though the confidence interval is small.

Initially, UCB explores more to systematically reduce uncertainty but its

exploration reduces over time. Thus we can say that UCB obtains greater reward
on average than other algorithms such as Epsilon-greedy, Optimistic Initial
Values, etc.

RL Unit
No ratings yet
RL Unit
595 pages
Assignment 2 - Solution
No ratings yet
Assignment 2 - Solution
5 pages
RL-Endterm Report - Mridul Agarwal
No ratings yet
RL-Endterm Report - Mridul Agarwal
27 pages
Mod6 Slides
No ratings yet
Mod6 Slides
105 pages
9 DeepReinforcementLearning
No ratings yet
9 DeepReinforcementLearning
138 pages
RL Week - 2 - 3
No ratings yet
RL Week - 2 - 3
83 pages
Reinforcement Learning - Chapter 2
100% (1)
Reinforcement Learning - Chapter 2
22 pages
SYSCAL Pro Users Manual SYSCAL Pro Stand
No ratings yet
SYSCAL Pro Users Manual SYSCAL Pro Stand
114 pages
RL SEM Updated
No ratings yet
RL SEM Updated
89 pages
Bandit
No ratings yet
Bandit
8 pages
Assignment 1: Reinforcement Learning Prof. B. Ravindran
100% (2)
Assignment 1: Reinforcement Learning Prof. B. Ravindran
4 pages
RLbook Solutions Manual
100% (1)
RLbook Solutions Manual
35 pages
RL Sem Ans
No ratings yet
RL Sem Ans
90 pages
DLMAIRIL01 Q4-2024 Session3
No ratings yet
DLMAIRIL01 Q4-2024 Session3
47 pages
RL L2 MultiArmedBandits
No ratings yet
RL L2 MultiArmedBandits
44 pages
Mab Notes
No ratings yet
Mab Notes
15 pages
KLUCB Paper
No ratings yet
KLUCB Paper
59 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
28 pages
EXP3
No ratings yet
EXP3
36 pages
RL Unit5
No ratings yet
RL Unit5
101 pages
EAS 240 MAB Project Description Spring 2025
No ratings yet
EAS 240 MAB Project Description Spring 2025
10 pages
Multi-Armed Bandits Epsilon-Greedy Algorithm
No ratings yet
Multi-Armed Bandits Epsilon-Greedy Algorithm
14 pages
Lecture 9: Exploration and Exploitation: David Silver
No ratings yet
Lecture 9: Exploration and Exploitation: David Silver
47 pages
Notes
No ratings yet
Notes
6 pages
Machine - Learning - Chapter 4
No ratings yet
Machine - Learning - Chapter 4
13 pages
RL Unit 1 - QA
No ratings yet
RL Unit 1 - QA
10 pages
Auer - Using Ucb For Exploration-Exploitation Tradeoffs
No ratings yet
Auer - Using Ucb For Exploration-Exploitation Tradeoffs
26 pages
A Handy Guide To UCB Algorithm in Reinforcement Learning.
No ratings yet
A Handy Guide To UCB Algorithm in Reinforcement Learning.
14 pages
Unit:1 Reinforcement Learning: Upper-Confidence-Bound Action Selection, Gradient Bandits
No ratings yet
Unit:1 Reinforcement Learning: Upper-Confidence-Bound Action Selection, Gradient Bandits
6 pages
Multi-Arm-Bandit Problem
No ratings yet
Multi-Arm-Bandit Problem
11 pages
Q1. Explain The Multi-Armed Bandit Problem and Its Key Characteristics. Illustrate Their Real-World Applications
No ratings yet
Q1. Explain The Multi-Armed Bandit Problem and Its Key Characteristics. Illustrate Their Real-World Applications
11 pages
16 - Reinforcement Learning and Bandits
No ratings yet
16 - Reinforcement Learning and Bandits
41 pages
26 Making Decisions
No ratings yet
26 Making Decisions
31 pages
A12-Online Learning Short 2020
No ratings yet
A12-Online Learning Short 2020
61 pages
Report
No ratings yet
Report
4 pages
Reading 3-Russo & Van Roy 2014
No ratings yet
Reading 3-Russo & Van Roy 2014
24 pages
Rlassignment 2
No ratings yet
Rlassignment 2
3 pages
Finite-Time Analysis of The Multi-Armed Bandit Problem With Known Trend
No ratings yet
Finite-Time Analysis of The Multi-Armed Bandit Problem With Known Trend
7 pages
Multi Armed Bandits
No ratings yet
Multi Armed Bandits
34 pages
Solution 2
No ratings yet
Solution 2
5 pages
Multi-Armed Bandits
No ratings yet
Multi-Armed Bandits
11 pages
2、Model Based Bayesian Exploration
No ratings yet
2、Model Based Bayesian Exploration
10 pages
Mid Term Report SoS
No ratings yet
Mid Term Report SoS
18 pages
Experiment 6
No ratings yet
Experiment 6
7 pages
Megersa MBA Thesis For Defense (2024)
No ratings yet
Megersa MBA Thesis For Defense (2024)
74 pages
Unit Iv-1
No ratings yet
Unit Iv-1
32 pages
Unit:1 Reinforcement Learning
No ratings yet
Unit:1 Reinforcement Learning
9 pages
Reinforcement Learning Framework
No ratings yet
Reinforcement Learning Framework
12 pages
UCB Algorithm in RL
No ratings yet
UCB Algorithm in RL
3 pages
An Episodic History of Mathematics PDF
No ratings yet
An Episodic History of Mathematics PDF
483 pages
EE675A Lecture 3
No ratings yet
EE675A Lecture 3
8 pages
Reinforcement Learning: A Short Cut
No ratings yet
Reinforcement Learning: A Short Cut
7 pages
Exploration Vs Exploitation in Stationary Multi-Armed Bandit Problems
No ratings yet
Exploration Vs Exploitation in Stationary Multi-Armed Bandit Problems
15 pages
12th Computer Applications Study Material 2024 25
No ratings yet
12th Computer Applications Study Material 2024 25
143 pages
Introduction To Bandits: (Some Slides Stolen From Csaba's AAAI Tutorial)
No ratings yet
Introduction To Bandits: (Some Slides Stolen From Csaba's AAAI Tutorial)
16 pages
Lecture 1: Introduction: Lecturer: Prof. Subrahmanya Swamy Peruru Scribe: Harshvardhan Arya - Rishabh Katiyar
No ratings yet
Lecture 1: Introduction: Lecturer: Prof. Subrahmanya Swamy Peruru Scribe: Harshvardhan Arya - Rishabh Katiyar
4 pages
Unit II
No ratings yet
Unit II
10 pages
CSD311: Artificial Intelligence
No ratings yet
CSD311: Artificial Intelligence
11 pages
Mid-Semester Examination
No ratings yet
Mid-Semester Examination
2 pages
Water and Steam Chemistry, Deposits and Corrosion.
No ratings yet
Water and Steam Chemistry, Deposits and Corrosion.
41 pages
Exploration Exploitation
No ratings yet
Exploration Exploitation
40 pages
What We Learned Last Time: 1. Intelligence Is The Computational Part of The Ability To Achieve Goals
No ratings yet
What We Learned Last Time: 1. Intelligence Is The Computational Part of The Ability To Achieve Goals
32 pages
Amartya Paul With Degrees and Certificates
No ratings yet
Amartya Paul With Degrees and Certificates
30 pages
Playing Changes
100% (2)
Playing Changes
13 pages
Iph750 Hydraulic Piling Hammer and Rig: Impact-Power Hydraulics Sdn. BHD
100% (1)
Iph750 Hydraulic Piling Hammer and Rig: Impact-Power Hydraulics Sdn. BHD
4 pages
Tda8580j Datasheet
100% (1)
Tda8580j Datasheet
28 pages
Network Security 1.0 Modules 8 - 10 - ACLs and Firewalls Group Exam Answers
No ratings yet
Network Security 1.0 Modules 8 - 10 - ACLs and Firewalls Group Exam Answers
20 pages
Windows OS Internal Training
No ratings yet
Windows OS Internal Training
66 pages
Deh-P4180sd crt4248
No ratings yet
Deh-P4180sd crt4248
83 pages
Themes - Flutter
No ratings yet
Themes - Flutter
5 pages
GMAT Quant Topic 8 - Probability Solutions
No ratings yet
GMAT Quant Topic 8 - Probability Solutions
20 pages
Eloisa Jasmin F. Perez E3Q - Engineering Data Analysis Formative Assessment
No ratings yet
Eloisa Jasmin F. Perez E3Q - Engineering Data Analysis Formative Assessment
2 pages
A Fatigue Driving Detection Algorithm Based On Facial Multi-Feature Fusion
No ratings yet
A Fatigue Driving Detection Algorithm Based On Facial Multi-Feature Fusion
16 pages
Group Members: 1. Shucayb Mohamed Ismail 2. Abdihafid Ismail Salad 3. Nimo Ahmed Hassan 4. Nimo Khadar Ahmed
No ratings yet
Group Members: 1. Shucayb Mohamed Ismail 2. Abdihafid Ismail Salad 3. Nimo Ahmed Hassan 4. Nimo Khadar Ahmed
20 pages
DX Diag
No ratings yet
DX Diag
27 pages
Aqa Mm1b QP Jan13
No ratings yet
Aqa Mm1b QP Jan13
20 pages
Reliability and Validity of The Research Methods Skills Assessment
No ratings yet
Reliability and Validity of The Research Methods Skills Assessment
11 pages
Dr. Carlos S. Lanting College: Basic Education - Senior High School
No ratings yet
Dr. Carlos S. Lanting College: Basic Education - Senior High School
8 pages
HP DesignJet 500, 800 Series Printers Service Manual - English
No ratings yet
HP DesignJet 500, 800 Series Printers Service Manual - English
5 pages
01 Task Performance 1
No ratings yet
01 Task Performance 1
3 pages
Text Summarization
No ratings yet
Text Summarization
6 pages
Heat Capacities of Inorganic and Organic Compounds in The Ideal Gas State
No ratings yet
Heat Capacities of Inorganic and Organic Compounds in The Ideal Gas State
5 pages
Apoorva Nandakumar Resume PDF
No ratings yet
Apoorva Nandakumar Resume PDF
2 pages
a094mMPMC Multiple Choice Questions
No ratings yet
a094mMPMC Multiple Choice Questions
7 pages
Musical Elements Table
No ratings yet
Musical Elements Table
3 pages
SAP Business Explorer Tools
No ratings yet
SAP Business Explorer Tools
12 pages
Design Shell Tube
No ratings yet
Design Shell Tube
3 pages
A Conversation About Calculus
From Everand
A Conversation About Calculus
Ginachukwu Amah
No ratings yet
Reinforcement Learning Explained - A Step-by-Step Guide to Reward-Driven AI
From Everand
Reinforcement Learning Explained - A Step-by-Step Guide to Reward-Driven AI
Luka Nikolic
No ratings yet
Mechanism Design: Fundamentals and Applications
From Everand
Mechanism Design: Fundamentals and Applications
Fouad Sabry
No ratings yet

Upper Confidence Bound Algorithm in Reinforcement Learning

Uploaded by

Upper Confidence Bound Algorithm in Reinforcement Learning

Uploaded by

Upper Confidence Bound Algorithm in Reinforcement Learning

In Reinforcement learning, the agent or decision-maker generates its training

Imagine an online advertising trial where an advertiser wants to measure the

Initially, UCB explores more to systematically reduce uncertainty but its

You might also like