0% found this document useful (0 votes)
2 views

notes

The document discusses the K-Armed Bandit problem and the exploration-exploitation dilemma in decision-making, emphasizing various action-value methods like ε-Greedy and Upper Confidence Bound. It contrasts Multi-Armed Bandit problems with Reinforcement Learning, which involves sequential decision-making and Markov Decision Processes, detailing techniques like Dynamic Programming, Monte Carlo methods, and Temporal Difference Learning. Key takeaways include the importance of balancing exploration and exploitation, the efficiency of TD learning, and the advantages of Monte Carlo methods in model-free reinforcement learning.

Uploaded by

anahitanoori93
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

notes

The document discusses the K-Armed Bandit problem and the exploration-exploitation dilemma in decision-making, emphasizing various action-value methods like ε-Greedy and Upper Confidence Bound. It contrasts Multi-Armed Bandit problems with Reinforcement Learning, which involves sequential decision-making and Markov Decision Processes, detailing techniques like Dynamic Programming, Monte Carlo methods, and Temporal Difference Learning. Key takeaways include the importance of balancing exploration and exploitation, the efficiency of TD learning, and the advantages of Monte Carlo methods in model-free reinforcement learning.

Uploaded by

anahitanoori93
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

RL_L2_MultiArmedBandits

K-Armed Bandit Problem


The agent chooses among k different actions. Each action provides a numerical reward based on a probability
distribution. The goal is to maximize the expected total reward over time.
Exploration-Exploitation Dilemma
Exploitation: Selecting the action with the highest estimated reward.
Exploration: Trying other actions to discover potentially better rewards. Balance between exploration and
exploitation is crucial.
Action-Value Methods
Sample-Average Method: Computes action values by averaging received rewards.
ε-Greedy Method: Selects the best-known action most of the time, but explores randomly with probability ε.
Upper Confidence Bound (UCB)
Selects actions based on both estimated reward and uncertainty. Encourages trying less-explored actions to get
more information.
Gradient Bandit Algorithm
Learns preferences for each action instead of estimating action values. Uses a softmax distribution to choose actions
probabilistically.
the expected reward greedy action selection
soft-max distribution: UCB
action selection:

A Multi-Armed Bandit (MAB) problem involves single-step decision-making, where an agent chooses among multiple actions
(arms) to maximize immediate rewards, balancing exploration vs. exploitation. In contrast, Reinforcement Learning (RL) deals
with sequential decision-making, where actions affect future states and rewards, requiring the agent to learn an optimal policy
over time. While MAB problems have no state transitions, RL problems involve Markov Decision Processes (MDPs) with delayed
rewards.
Markov Decision Process (MDP)
formalization of sequential decision-making problem
Actions influence not only immediate rewards but also subsequent situations (delayed reward). Trade-off
immediate and delayed reward
Bellman Equation:

Solving RL tasks means finding a policy that achieves large reward over long runs → Finding an optimal
policy
A policy π is defined to be better than or equal to a policy π’ if its expected return (i.e., value) is greater than
or equal to that of π for all states, namely

Dynamic Programming -Value Iteration and Policy Iteration:


DP can be used to compute the value function using the Bellman equation in an iterative way to improve
value approximations.
Policy Evaluation: Computes state values for a given policy.
Policy Improvement: Finds better actions by evaluating alternatives.
Policy Iteration: Alternates between evaluation and improvement to find the best policy.
Value Iteration: Merges policy evaluation & improvement into one step for faster learning.
Asynchronous DP: Updates only important states, making it more efficient.
For our purpose, iterative solution methods are most suitable. We use the Bellman equation as an update rule
(k is the iteration):
Given the value function for a policy π we would like to know whether we can get a better policy by choosing a
different action a, in a specific state s

The greedy policy takes the action that looks best after one step of lookahead according to V(policy) By
construction this policy meets the conditions of the policy improvement theorem, hence it is as good as, or
better than, the original policy.
The Policy Iteration algorithm repeats policy evaluation and policy improvement obtaining a sequence of
monotonically improving policies and value functions, until convergence to an optimal policy and optimal value
function.

Policy iteration: two interacting (competing/cooperating) processes: 1. Policy evaluation 2. Policy


improvement. This schema, called Generalized Policy Iteration (GPI), is common to several RL algorithms,
such as, .Value iteration. Asynchronous DP methods
DP methods update estimates of the values of states based on estimates of the values of successor states,
i.e., they update estimates on the basis of other estimates. → DP methods perform bootstrapping
(Monte Carlo Methods): methods that do not require a model and do not bootstrap (Temporal-Difference
Learning): methods that do not require a model and do bootstrap

Monte Carlo (MC) Methods - Learning from Complete Episodes:


Monte Carlo Prediction: Estimates value functions by averaging observed returns over full episodes.
First-Visit MC: Uses only the first visit to a state in an episode.
Every-Visit MC: Uses all occurrences of a state in an episode.
Monte Carlo Control: Uses Generalized Policy Iteration (GPI) to improve policies.
Exploration Strategies:
Exploring Starts: Randomly starts episodes from different states.
ε-Greedy Policies: Ensures all actions have a chance to be explored.
Off-Policy Learning (Importance Sampling):
Behavior Policy: Collects data.
Target Policy: Learns from that data.
Key Takeaway: MC methods learn from full episodes, making them ideal for model-free RL.
Main idea of MC: to average the returns observed after visits of the state

The estimate for one state in MC methods does not build upon the estimate of any other state, as is the case in
DP → MC methods do not perform bootstrapping In MC methods the computational expense of estimating
the value of a single state is independent on the number of states Useful for online estimation or estimation of
subsets of state.

First-visit MC and every-visit MC converge quadratically to the true values (expected returns) as the number
of visits to each state-action pair approaches infinity

Policy evaluation: is performed using MC prediction (let’s assume to observe an infinite number of episodes,
hence we get the exact ) Policy improvement: is done by making the policy greedy w.r.t. the current value
function. We have an action-value function hence no model is needed to construct the greedy policy

MC methods can be used to find optimal policies given only sample episodes and no other knowledge of
the environment.
MC ES is an example of an on-policy method

Question: How can they learn about the optimal policy while behaving according to an exploratory policy?
The on-policy approach is a compromise. It learns action values not for the optimal policy but for a near-optimal policy
(i.e., -greedy) that still explores
Solution: use two policies Target policy: learned policy, it becomes the optimal policy Behavior policy: exploratory
policy, it is used to generate data Learning is from data “off” the target policy → Off-policy learning

MC methods learn from sample episodes : Four advantages over DP methods


1) no model of the environment is required 2) they can be used with simulators of the environment 3) they can
focus on subset of states (scaling) 4) they do not bootstrap, hence they may be less harmed by violation of the
Markov property
Problem of maintaining sufficient exploration: Exploring starts: ok only for simulated episodes
On-policy prediction/control: not completely precise
Off-policy prediction/control: the best method but more complex  Target/Behavior policy 
Ordinary/weighted Importance Sampling

Temporal Difference (TD) Learning - Combining MC & DP:

TD Prediction vs. MC vs. DP:


Like MC: Learns from experience.
Like DP: Uses bootstrapping (updates based on estimates).
TD(0) Update Rule: Updates value estimates using observed rewards + next state’s estimate.
TD Control Algorithms:
Sarsa (On-Policy TD Control): Learns Q-values for the current policy.
Q-Learning (Off-Policy TD Control): Learns Q-values for the optimal policy.
Maximization Bias & Double Learning:
Maximization Bias: Overestimation of Q-values.
Double Learning: Uses two Q-tables to reduce bias.
Key Takeaway: TD learning is efficient, widely used, and combines benefits of MC and DP
TD methods instead need to wait only until the next time step. At time t+1 they immediately form a target and make a
useful update using the observed reward Rt+1 and the estimate V(St+1).

You might also like