0% found this document useful (0 votes)
333 views14 pages

Monte Carlo Learning

The document provides information about Monte Carlo learning and reinforcement learning. Some key points: - Monte Carlo learning does not require a model of the environment and learns directly from experience/samples. It learns state values based on the average return from sampled episodes following a policy. - There are two types of Monte Carlo methods - first visit and every visit. They differ in how state visits are counted within an episode. - Monte Carlo learning converges slowly since values are only updated after a complete episode. It can only be used for episodic problems. - Off-policy Monte Carlo with importance sampling allows learning about one target policy while following a different behavior policy for exploration via importance ratios.

Uploaded by

Sivasathiya G
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
333 views14 pages

Monte Carlo Learning

The document provides information about Monte Carlo learning and reinforcement learning. Some key points: - Monte Carlo learning does not require a model of the environment and learns directly from experience/samples. It learns state values based on the average return from sampled episodes following a policy. - There are two types of Monte Carlo methods - first visit and every visit. They differ in how state visits are counted within an episode. - Monte Carlo learning converges slowly since values are only updated after a complete episode. It can only be used for episodic problems. - Off-policy Monte Carlo with importance sampling allows learning about one target policy while following a different behavior policy for exploration via importance ratios.

Uploaded by

Sivasathiya G
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

EASWARI ENGINEERING COLLEGE

(AUTONOMOUS)
DEPARTMENT OF ARTIFICIAL INTELLIGENCE ANDDATA
SCIENCE

191AIC601T – REINFORCEMENT LEARNING

Unit III –Notes


(Monte Carlo Learning)

III YEAR - B.TECH

PREPARED BY APPROVED BY

G.SIVASATHIYA, AP/AI&DS HOD/AI&DS


MONTE CARLO LEARNING

In Dynamic programming we need a model (agent knows the MDP


transition and rewards) and agent does planning (once model is available
agent need to plan its action in each state). There is no real learning by the agent
in Dynamic programming method.

Monte Carlo method on the other hand is a very simple concept where
agent learn about the states and reward when it interacts with the environment.
In this method agent generate experienced samples and then based on average
return, value is calculated for a state or state-action.

Below are key characteristics of Monte Carlo (MC) method:


 There is no model (agent does not know state MDP transitions)
 agent learns from sampled experience
 learn state value vπ(s) under policy π by
experiencing average return from all sampled episodes (value =
average return)
 only after a complete episode, values are updated (because of this
algorithm convergence is slow and update happens after a episode
is Complete)
 There is no bootstrapping
 Only can be used in episodic problems

Consider a real life analogy; Monte Carlo learning is like annual examination
where student completes its episode at the end of the year. Here, the result of
the annual exam is like the return obtained by the student.
Now if the goal of the problem is to find how students score during a
calendar year (which is a episode here) for a class, we can take sample result of
some student and then calculate mean result to find score for a class (don’t take
the analogy point by point but on a holistic level I think you can get the essence
of MC learning).
Similarly we have TD learning or temporal difference learning (TD
learning is like updating value in every time step and does not require wait till
end of episode to update the values) that we will cover in future blog, can be
thought like a weekly or monthly examination (student can adjust their
performance based on this score (reward received) after every small interval
and final score is accumulation of the all weekly tests (total rewards)).

Value function = Expected Return

Expected return is equal to discounted sum of all rewards.


In Monte Carlo Method instead of expected return we use empirical return
that agent has sampled based following the policy.

If we go back to our very first example of gem collection, agent follows


policy and complete an episode, along the way in each step it collects rewards
in the form of gem. To get state value agent sum-up all the gems collected after
each episode starting from that state.

Refer to below diagram where 3 samples collected starting from State S


05. Total reward collected (discount factor is considered as 1 for simplicity) in
each episode as follows:
Return(Sample 01) = 2 + 1 + 2 + 2 + 1 + 5 = 13 gems
Return(Sample 02) = 2 + 3 + 1 + 3 + 1 + 5 = 15 gems
Return(Sample 03) = 2 + 3 + 1 + 3 + 1 + 5 = 15 gems

Observed mean return (based on 3 samples) = (13 + 15 + 15)/3 = 14.33 gems

Thus state value as per Monte Carlo Method, v π(S 05) is 14.33 gems based on 3
samples following policy π.

Monte Carlo Backup diagram


There are two types of MC learning policy evaluation (prediction) methods:

First Visit Monte Carlo Method

In this case in an episode first visit of the state is counted (even if agent
comes-back to the same state multiple time in the episode, only first visit will be
counted). Detailed step as below:

1. To evaluate state s, first we set number of visit, N(s) = 0, Total return TR(s)
= 0 (these values are updated across episodes)
2. The first time-step t that state s is visited in an episode, increment counter
N(s) = N(s) + 1
3. Increment total return TR(s) = TR(s) + Gt
4. Value is estimated by mean return V(s) = TR(s)/N(s)
5. By law of large numbers, V(s) -> vπ(s) (this is called true value under policy
π) as N(s) approaches infinity

Refer to below diagram for better understanding of counter increment.


Every Visit Monte Carlo Method

In this case in an episode every visit of the state is counted. Detailed step as
below:
1. To evaluate state s, first we set number of visit, N(s) = 0, Total return TR(s)
= 0 (these values are updated across episodes)
2. every time-step t that state s is visited in an episode, increment counter N(s)
= N(s) + 1
3. Increment total return TR(s) = TR(s) + Gt
4. Value is estimated by mean return V(s) = TR(s)/N(s)
5. By law of large numbers, V(s) -> vπ(s) (this is called true value under policy
π) as N(s) approaches infinity

Refer to below diagram for better understanding of counter increment.


Usually MC is updated incrementally after every episode (no need to store
old episode values, it could be a running mean value for the state updated after
every episode).

Update V(s) incrementally after episode S 1, A 2, R 3,….,S T For each state


S t with return G t. Usually in place of 1/N(S t) a constant learning rate (α) is
used and above equation becomes:

For Policy improvement, Generalized Policy Improvement concept is used


to update policy using action value function of Monte Carlo Method.

Monte Carlo Methods have below Advantages :


 zero bias
 Good convergence properties (even with function approximation)
 Not very sensitive to initial value
 Very simple to understand and use

But it has below limitations as well:


 MC must wait until end of episode before return is known
 MC has high variance
 MC can only learn from complete sequences
 MC only works for episodic (terminating) environments

Even though MC method takes time, it is an important tool for any


Reinforcement Learning practitioner.
OFF-POLICY MONTE CARLO WITH IMPORTANCE
SAMPLING

Off Policy Learning


 By exploration-exploitation trade-off, the agent should take sub-optimal
exploratory action by which the agent may receive less reward. One way of
exploration is by using an epsilon-greedy policy, where the agent takes a non
greedy action with a small probability.
 In an on-policy, improvement and evaluation are done on the policy which
is used to select actions.
 In off-policy, improvement and evaluation are done on a different policy
from the one used to select actions. The policy learned is off the policy used
for action selection while gathering episodes.
o Target Policy π(a/s) : The value function of learning is based
on π(a/s) . We want the target policy to be the optimal
policy π∗(a/s). The target policy will be used for action selection
after the learning process is complete (deployment).
o Behavior Policy b(a/s) : Behavior policy is used for action selection
while gathering episodes to train the agent. This generally follows
an exploratory policy.

Importance Sampling
 We have a random variable x∼b sampled from behavior policy
distribution b. We want to estimate the expected value of x ,wrt the target
distribution π ie: Eπ[X]. The sample average will give the expected value of x
under b Eb[X].
Let xρ(x) be a new random variable Xρ(X).

Then, Eπ[X]=∑xρ(x)b(x)=Eb[Xρ(X)] . Now we have expectation under b instead


of π.

Off Policy Monte Carlo Prediction with Importance


Sampling
Off-policy control methods has two policies on the same episode. One that
is learned about and that becomes the optimal policy, called target policy,
and one that is more exploratory and is used to generate behavior, called
the behavior policy.

In this case we say that learning is from data “off” the target policy, and
the overall process is termed off-policy learning.

Here, we only consider the prediction problem, in which both target and
behavior policies are fixed. We require that π (a|s) > 0 implies b (a|s) > 0 , which
is called the assumption of coverage, to assure that every action taken under
π is also taken under b . It follows from coverage that b must be stochastic in
states where it is not identical to π . The target policy πv may be
deterministic.
Incremental Implementation of Off-policy MC


So the Off policy Monte Carlo Control algorithm is ,

You might also like