Notes

The document discusses the K-Armed Bandit problem and the exploration-exploitation dilemma in decision-making, emphasizing various action-value methods like ε-Greedy and Upper Confidence Bound. It contrasts Multi-Armed Bandit problems with Reinforcement Learning, which involves sequential decision-making and Markov Decision Processes, detailing techniques like Dynamic Programming, Monte Carlo methods, and Temporal Difference Learning. Key takeaways include the importance of balancing exploration and exploitation, the efficiency of TD learning, and the advantages of Monte Carlo methods in model-free reinforcement learning.

Uploaded by

anahitanoori93

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views6 pages

Notes

Uploaded by

anahitanoori93

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

RL_L2_MultiArmedBandits

K-Armed Bandit Problem

The agent chooses among k different actions. Each action provides a numerical reward based on a probability
distribution. The goal is to maximize the expected total reward over time.
Exploration-Exploitation Dilemma
Exploitation: Selecting the action with the highest estimated reward.
Exploration: Trying other actions to discover potentially better rewards. Balance between exploration and
exploitation is crucial.
Action-Value Methods
Sample-Average Method: Computes action values by averaging received rewards.
ε-Greedy Method: Selects the best-known action most of the time, but explores randomly with probability ε.
Upper Confidence Bound (UCB)
Selects actions based on both estimated reward and uncertainty. Encourages trying less-explored actions to get
more information.
Gradient Bandit Algorithm
Learns preferences for each action instead of estimating action values. Uses a softmax distribution to choose actions
probabilistically.
the expected reward greedy action selection
soft-max distribution: UCB
action selection:

A Multi-Armed Bandit (MAB) problem involves single-step decision-making, where an agent chooses among multiple actions
(arms) to maximize immediate rewards, balancing exploration vs. exploitation. In contrast, Reinforcement Learning (RL) deals
with sequential decision-making, where actions affect future states and rewards, requiring the agent to learn an optimal policy
over time. While MAB problems have no state transitions, RL problems involve Markov Decision Processes (MDPs) with delayed
rewards.
Markov Decision Process (MDP)
formalization of sequential decision-making problem
Actions influence not only immediate rewards but also subsequent situations (delayed reward). Trade-off
immediate and delayed reward
Bellman Equation:

Solving RL tasks means finding a policy that achieves large reward over long runs → Finding an optimal
policy
A policy π is defined to be better than or equal to a policy π’ if its expected return (i.e., value) is greater than
or equal to that of π for all states, namely

Dynamic Programming -Value Iteration and Policy Iteration:

DP can be used to compute the value function using the Bellman equation in an iterative way to improve
value approximations.
Policy Evaluation: Computes state values for a given policy.
Policy Improvement: Finds better actions by evaluating alternatives.
Policy Iteration: Alternates between evaluation and improvement to find the best policy.
Value Iteration: Merges policy evaluation & improvement into one step for faster learning.
Asynchronous DP: Updates only important states, making it more efficient.
For our purpose, iterative solution methods are most suitable. We use the Bellman equation as an update rule
(k is the iteration):
Given the value function for a policy π we would like to know whether we can get a better policy by choosing a
different action a, in a specific state s

The greedy policy takes the action that looks best after one step of lookahead according to V(policy) By
construction this policy meets the conditions of the policy improvement theorem, hence it is as good as, or
better than, the original policy.
The Policy Iteration algorithm repeats policy evaluation and policy improvement obtaining a sequence of
monotonically improving policies and value functions, until convergence to an optimal policy and optimal value
function.

Policy iteration: two interacting (competing/cooperating) processes: 1. Policy evaluation 2. Policy

improvement. This schema, called Generalized Policy Iteration (GPI), is common to several RL algorithms,
such as, .Value iteration. Asynchronous DP methods
DP methods update estimates of the values of states based on estimates of the values of successor states,
i.e., they update estimates on the basis of other estimates. → DP methods perform bootstrapping
(Monte Carlo Methods): methods that do not require a model and do not bootstrap (Temporal-Difference
Learning): methods that do not require a model and do bootstrap

Monte Carlo (MC) Methods - Learning from Complete Episodes:

Monte Carlo Prediction: Estimates value functions by averaging observed returns over full episodes.
First-Visit MC: Uses only the first visit to a state in an episode.
Every-Visit MC: Uses all occurrences of a state in an episode.
Monte Carlo Control: Uses Generalized Policy Iteration (GPI) to improve policies.
Exploration Strategies:
Exploring Starts: Randomly starts episodes from different states.
ε-Greedy Policies: Ensures all actions have a chance to be explored.
Off-Policy Learning (Importance Sampling):
Behavior Policy: Collects data.
Target Policy: Learns from that data.
Key Takeaway: MC methods learn from full episodes, making them ideal for model-free RL.
Main idea of MC: to average the returns observed after visits of the state

The estimate for one state in MC methods does not build upon the estimate of any other state, as is the case in
DP → MC methods do not perform bootstrapping In MC methods the computational expense of estimating
the value of a single state is independent on the number of states Useful for online estimation or estimation of
subsets of state.

First-visit MC and every-visit MC converge quadratically to the true values (expected returns) as the number
of visits to each state-action pair approaches infinity

Policy evaluation: is performed using MC prediction (let’s assume to observe an infinite number of episodes,
hence we get the exact ) Policy improvement: is done by making the policy greedy w.r.t. the current value
function. We have an action-value function hence no model is needed to construct the greedy policy

MC methods can be used to find optimal policies given only sample episodes and no other knowledge of
the environment.
MC ES is an example of an on-policy method

Question: How can they learn about the optimal policy while behaving according to an exploratory policy?
The on-policy approach is a compromise. It learns action values not for the optimal policy but for a near-optimal policy
(i.e., -greedy) that still explores
Solution: use two policies Target policy: learned policy, it becomes the optimal policy Behavior policy: exploratory
policy, it is used to generate data Learning is from data “off” the target policy → Off-policy learning

MC methods learn from sample episodes : Four advantages over DP methods

1) no model of the environment is required 2) they can be used with simulators of the environment 3) they can
focus on subset of states (scaling) 4) they do not bootstrap, hence they may be less harmed by violation of the
Markov property
Problem of maintaining sufficient exploration: Exploring starts: ok only for simulated episodes
On-policy prediction/control: not completely precise
Off-policy prediction/control: the best method but more complex  Target/Behavior policy 
Ordinary/weighted Importance Sampling

Temporal Difference (TD) Learning - Combining MC & DP:

TD Prediction vs. MC vs. DP:

Like MC: Learns from experience.
Like DP: Uses bootstrapping (updates based on estimates).
TD(0) Update Rule: Updates value estimates using observed rewards + next state’s estimate.
TD Control Algorithms:
Sarsa (On-Policy TD Control): Learns Q-values for the current policy.
Q-Learning (Off-Policy TD Control): Learns Q-values for the optimal policy.
Maximization Bias & Double Learning:
Maximization Bias: Overestimation of Q-values.
Double Learning: Uses two Q-tables to reduce bias.
Key Takeaway: TD learning is efficient, widely used, and combines benefits of MC and DP
TD methods instead need to wait only until the next time step. At time t+1 they immediately form a target and make a
useful update using the observed reward Rt+1 and the estimate V(St+1).

19 - Monte Carlo and Temporal Difference For Markov Decision Processes
No ratings yet
19 - Monte Carlo and Temporal Difference For Markov Decision Processes
57 pages
Passive vs Active Reinforcement Learning
No ratings yet
Passive vs Active Reinforcement Learning
15 pages
Model Free Methods
No ratings yet
Model Free Methods
31 pages
A Short Tutorial On Reinforcement Learning: Review and Applications
No ratings yet
A Short Tutorial On Reinforcement Learning: Review and Applications
5 pages
Monte Carlo Methods in Reinforcement Learning
No ratings yet
Monte Carlo Methods in Reinforcement Learning
245 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
6 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
Multi-Agent Reinforcement Learning-Implementation of Hide and Seek
No ratings yet
Multi-Agent Reinforcement Learning-Implementation of Hide and Seek
7 pages
CSE 445 - Lecture 9 - Reinforcement Learning
No ratings yet
CSE 445 - Lecture 9 - Reinforcement Learning
45 pages
Unit 5 Notes
No ratings yet
Unit 5 Notes
20 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
4 pages
Lecture#5 Monte Carlo Methods Part I
No ratings yet
Lecture#5 Monte Carlo Methods Part I
28 pages
Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
50 pages
18 - Dynamic Programming For Markov Decision Processes
No ratings yet
18 - Dynamic Programming For Markov Decision Processes
50 pages
2.2+model Free+Control
No ratings yet
2.2+model Free+Control
92 pages
Unit-3 Unit-3 RL Problems, Prediction and Control P 241111 181426
No ratings yet
Unit-3 Unit-3 RL Problems, Prediction and Control P 241111 181426
15 pages
ML Unit-4 - RTU
No ratings yet
ML Unit-4 - RTU
18 pages
Unit 4
100% (1)
Unit 4
7 pages
RL Cheatsheet for Researchers
No ratings yet
RL Cheatsheet for Researchers
16 pages
Reinforcement Learning, Crawling Robot: Faculty of Sciences and Techniques Béni-Mellal
No ratings yet
Reinforcement Learning, Crawling Robot: Faculty of Sciences and Techniques Béni-Mellal
5 pages
Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
30 pages
Monte Carlo Learning
No ratings yet
Monte Carlo Learning
14 pages
CH3 - 2 Montecarlo Control
No ratings yet
CH3 - 2 Montecarlo Control
33 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
AS02
No ratings yet
AS02
16 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
Lecture26 Ri
No ratings yet
Lecture26 Ri
55 pages
RL RS-Unit - 3
No ratings yet
RL RS-Unit - 3
6 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
Unit Vi
No ratings yet
Unit Vi
17 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
Discuss About Temporal Difference in Reinforcement Learning?
No ratings yet
Discuss About Temporal Difference in Reinforcement Learning?
9 pages
Lnotes 03
No ratings yet
Lnotes 03
11 pages
RL Module 4
No ratings yet
RL Module 4
50 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Unit 05 Dynamic Programming
No ratings yet
Unit 05 Dynamic Programming
9 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
20ai903 - RL - Unit 2
No ratings yet
20ai903 - RL - Unit 2
27 pages
Value Functions and Bellman Equations
No ratings yet
Value Functions and Bellman Equations
11 pages
12 ML Reinforcement Learning Value Based Control
No ratings yet
12 ML Reinforcement Learning Value Based Control
12 pages
Markov Decision Processes Ii: Ppts by Dan Klein and Pieter Abbeel For Cs188 Intro To Ai at Uc Berkeley
No ratings yet
Markov Decision Processes Ii: Ppts by Dan Klein and Pieter Abbeel For Cs188 Intro To Ai at Uc Berkeley
50 pages
RL Unit-Ii
No ratings yet
RL Unit-Ii
14 pages
Lec 09
No ratings yet
Lec 09
51 pages
RL Ese
No ratings yet
RL Ese
7 pages
Dynamic Programming in RL Planning
No ratings yet
Dynamic Programming in RL Planning
17 pages
AI Decision Making & RL Guide
No ratings yet
AI Decision Making & RL Guide
18 pages
Subtitle
No ratings yet
Subtitle
2 pages
Sp14 Cs188 Lecture 9 - Mdps II
No ratings yet
Sp14 Cs188 Lecture 9 - Mdps II
48 pages
Solution To Assignment - 4 - Dynamic Programming
No ratings yet
Solution To Assignment - 4 - Dynamic Programming
11 pages
Experiment 4
No ratings yet
Experiment 4
7 pages
Assignment 6 (Sol.) : Reinforcement Learning
No ratings yet
Assignment 6 (Sol.) : Reinforcement Learning
4 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
DLMAIRIL01 Q4-2024 Session4
No ratings yet
DLMAIRIL01 Q4-2024 Session4
80 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
62 pages
DL Unit 6 QP Solution
No ratings yet
DL Unit 6 QP Solution
15 pages
Lecture 4 - ModelFreePrediction
No ratings yet
Lecture 4 - ModelFreePrediction
48 pages
Lecture 10
No ratings yet
Lecture 10
25 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
Thermodynamics in Chemical Engineering
100% (2)
Thermodynamics in Chemical Engineering
406 pages
Boiler
100% (1)
Boiler
19 pages
01-Basics of Antennas
No ratings yet
01-Basics of Antennas
39 pages
Advances in Steam Turbines For Modern Power Plants Tadashi Tanuma - Download The Full Set of Chapters Carefully Compiled
100% (3)
Advances in Steam Turbines For Modern Power Plants Tadashi Tanuma - Download The Full Set of Chapters Carefully Compiled
57 pages
Question and Answer Jenil
No ratings yet
Question and Answer Jenil
5 pages
Oil Quality Testing Standards Summary
No ratings yet
Oil Quality Testing Standards Summary
1 page
Finite State Machines
100% (2)
Finite State Machines
55 pages
Done - Technical - Answers
No ratings yet
Done - Technical - Answers
2 pages
Sample Paper Maths Class 2: A) B) C) D)
No ratings yet
Sample Paper Maths Class 2: A) B) C) D)
4 pages
KA3511BS: Intelligent Voltage Mode PWM IC
No ratings yet
KA3511BS: Intelligent Voltage Mode PWM IC
20 pages
What Is Entity Framework?
No ratings yet
What Is Entity Framework?
4 pages
Quantitative Aptitude Quantitative Aptitude Questions and Answers..
No ratings yet
Quantitative Aptitude Quantitative Aptitude Questions and Answers..
37 pages
P4-Ipsec: Site-To-Site and Host-To-Site VPN With Ipsec in P4-Based SDN
No ratings yet
P4-Ipsec: Site-To-Site and Host-To-Site VPN With Ipsec in P4-Based SDN
20 pages
Intel's Innovation & Strategy Insights
No ratings yet
Intel's Innovation & Strategy Insights
14 pages
QSHELL
100% (1)
QSHELL
154 pages
ZTE KPI's
No ratings yet
ZTE KPI's
3 pages
Schools in Malappuram and Courses
No ratings yet
Schools in Malappuram and Courses
17 pages
Web URL
No ratings yet
Web URL
4 pages
Reverse Return System Overview
No ratings yet
Reverse Return System Overview
31 pages
SSLC Science EM Model Paper 1 2025 26 Solved
No ratings yet
SSLC Science EM Model Paper 1 2025 26 Solved
11 pages
Charles Correa's Kanchanjunga Apartments
No ratings yet
Charles Correa's Kanchanjunga Apartments
11 pages
Principles of Concurrency and Parallelism
No ratings yet
Principles of Concurrency and Parallelism
22 pages
Harr Clinical Chemistry Flashcards Quizlet
No ratings yet
Harr Clinical Chemistry Flashcards Quizlet
44 pages
Syllabus 3 Exam Reports
No ratings yet
Syllabus 3 Exam Reports
115 pages
Math Competition Guide
No ratings yet
Math Competition Guide
10 pages
Chemical Bonding: by Om Pandey, Iit Delhi
No ratings yet
Chemical Bonding: by Om Pandey, Iit Delhi
30 pages
Bridge Design for Engineers
No ratings yet
Bridge Design for Engineers
10 pages
To 220
No ratings yet
To 220
3 pages
W1 Statistics Q+MS
No ratings yet
W1 Statistics Q+MS
39 pages
Home Remedies Stories Xuan Juliana Wang All Chapter Instant Download
No ratings yet
Home Remedies Stories Xuan Juliana Wang All Chapter Instant Download
81 pages

Notes

Uploaded by

Notes

Uploaded by

RL_L2_MultiArmedBandits

K-Armed Bandit Problem

Dynamic Programming -Value Iteration and Policy Iteration:

Policy iteration: two interacting (competing/cooperating) processes: 1. Policy evaluation 2. Policy

Monte Carlo (MC) Methods - Learning from Complete Episodes:

MC methods learn from sample episodes : Four advantages over DP methods

Temporal Difference (TD) Learning - Combining MC & DP:

TD Prediction vs. MC vs. DP:

You might also like