notes
notes
A Multi-Armed Bandit (MAB) problem involves single-step decision-making, where an agent chooses among multiple actions
(arms) to maximize immediate rewards, balancing exploration vs. exploitation. In contrast, Reinforcement Learning (RL) deals
with sequential decision-making, where actions affect future states and rewards, requiring the agent to learn an optimal policy
over time. While MAB problems have no state transitions, RL problems involve Markov Decision Processes (MDPs) with delayed
rewards.
Markov Decision Process (MDP)
formalization of sequential decision-making problem
Actions influence not only immediate rewards but also subsequent situations (delayed reward). Trade-off
immediate and delayed reward
Bellman Equation:
Solving RL tasks means finding a policy that achieves large reward over long runs → Finding an optimal
policy
A policy π is defined to be better than or equal to a policy π’ if its expected return (i.e., value) is greater than
or equal to that of π for all states, namely
The greedy policy takes the action that looks best after one step of lookahead according to V(policy) By
construction this policy meets the conditions of the policy improvement theorem, hence it is as good as, or
better than, the original policy.
The Policy Iteration algorithm repeats policy evaluation and policy improvement obtaining a sequence of
monotonically improving policies and value functions, until convergence to an optimal policy and optimal value
function.
The estimate for one state in MC methods does not build upon the estimate of any other state, as is the case in
DP → MC methods do not perform bootstrapping In MC methods the computational expense of estimating
the value of a single state is independent on the number of states Useful for online estimation or estimation of
subsets of state.
First-visit MC and every-visit MC converge quadratically to the true values (expected returns) as the number
of visits to each state-action pair approaches infinity
Policy evaluation: is performed using MC prediction (let’s assume to observe an infinite number of episodes,
hence we get the exact ) Policy improvement: is done by making the policy greedy w.r.t. the current value
function. We have an action-value function hence no model is needed to construct the greedy policy
MC methods can be used to find optimal policies given only sample episodes and no other knowledge of
the environment.
MC ES is an example of an on-policy method
Question: How can they learn about the optimal policy while behaving according to an exploratory policy?
The on-policy approach is a compromise. It learns action values not for the optimal policy but for a near-optimal policy
(i.e., -greedy) that still explores
Solution: use two policies Target policy: learned policy, it becomes the optimal policy Behavior policy: exploratory
policy, it is used to generate data Learning is from data “off” the target policy → Off-policy learning