Monte Carlo Learning
Monte Carlo Learning
(AUTONOMOUS)
DEPARTMENT OF ARTIFICIAL INTELLIGENCE ANDDATA
SCIENCE
PREPARED BY APPROVED BY
Monte Carlo method on the other hand is a very simple concept where
agent learn about the states and reward when it interacts with the environment.
In this method agent generate experienced samples and then based on average
return, value is calculated for a state or state-action.
Consider a real life analogy; Monte Carlo learning is like annual examination
where student completes its episode at the end of the year. Here, the result of
the annual exam is like the return obtained by the student.
Now if the goal of the problem is to find how students score during a
calendar year (which is a episode here) for a class, we can take sample result of
some student and then calculate mean result to find score for a class (don’t take
the analogy point by point but on a holistic level I think you can get the essence
of MC learning).
Similarly we have TD learning or temporal difference learning (TD
learning is like updating value in every time step and does not require wait till
end of episode to update the values) that we will cover in future blog, can be
thought like a weekly or monthly examination (student can adjust their
performance based on this score (reward received) after every small interval
and final score is accumulation of the all weekly tests (total rewards)).
Thus state value as per Monte Carlo Method, v π(S 05) is 14.33 gems based on 3
samples following policy π.
In this case in an episode first visit of the state is counted (even if agent
comes-back to the same state multiple time in the episode, only first visit will be
counted). Detailed step as below:
1. To evaluate state s, first we set number of visit, N(s) = 0, Total return TR(s)
= 0 (these values are updated across episodes)
2. The first time-step t that state s is visited in an episode, increment counter
N(s) = N(s) + 1
3. Increment total return TR(s) = TR(s) + Gt
4. Value is estimated by mean return V(s) = TR(s)/N(s)
5. By law of large numbers, V(s) -> vπ(s) (this is called true value under policy
π) as N(s) approaches infinity
In this case in an episode every visit of the state is counted. Detailed step as
below:
1. To evaluate state s, first we set number of visit, N(s) = 0, Total return TR(s)
= 0 (these values are updated across episodes)
2. every time-step t that state s is visited in an episode, increment counter N(s)
= N(s) + 1
3. Increment total return TR(s) = TR(s) + Gt
4. Value is estimated by mean return V(s) = TR(s)/N(s)
5. By law of large numbers, V(s) -> vπ(s) (this is called true value under policy
π) as N(s) approaches infinity
Importance Sampling
We have a random variable x∼b sampled from behavior policy
distribution b. We want to estimate the expected value of x ,wrt the target
distribution π ie: Eπ[X]. The sample average will give the expected value of x
under b Eb[X].
Let xρ(x) be a new random variable Xρ(X).
In this case we say that learning is from data “off” the target policy, and
the overall process is termed off-policy learning.
Here, we only consider the prediction problem, in which both target and
behavior policies are fixed. We require that π (a|s) > 0 implies b (a|s) > 0 , which
is called the assumption of coverage, to assure that every action taken under
π is also taken under b . It follows from coverage that b must be stochastic in
states where it is not identical to π . The target policy πv may be
deterministic.
Incremental Implementation of Off-policy MC
So the Off policy Monte Carlo Control algorithm is ,