Lecture 6 MONTE CARLO Example
Lecture 6 MONTE CARLO Example
Reinforcement Learning
What is meant by Monte Carlo?
The term “Monte Carlo” is often used more broadly for any estimation method whose operation
Monte Carlo methods require only experience — sample sequences of states, actions,
and rewards from actual or simulated interaction with an environment. Learning from
actual experience is striking because it requires no prior knowledge of the environment’s dynamics,
terminate.
● We are given two example episodes (we can generate them using random walks for
any environment).
● A+3 →A+2 means the transition from state A →A with reward =3 for this transition.
Now, we know that averaging rewards can get us value-function for multi-state RL
problems as well. But things aren’t this easy as we know value-function depends on
future rewards as well. Hence we have got 2 types of Monte Carlo learning on how to
average future rewards:
First Visit Monte Carlo: First visit estimates (Value|State: S1) as the average of the
returns following the first visit to the state S1.
Every Visit Monte Carlo: It estimates (Value|State: S1) as the average of returns for every
Carlo methods
First Visit Monte Carlo:
● Calculating V(A)
As we have been given 2 different iterations, we will be summing all the rewards
coming after A (including that of A) after the first visit to ‘A’. Therefore, we can’t
have more than one summation_term/episode for a state.
Hence,
As we have got two terms, we will be averaging these two value i.e V(A)=(2+0)/2=1
Note:It must be noted that if an episode doesn’t have an occurence of ‘A’, it won’t be
Hence if a 3rd episode like B-3 →B-3 →terminate existed, still V(A) using 1st Visit
● 1st episode=-4+4–3=-3
● 2nd episode=-2+3+-3=-2
Averaging, V(B)=(-3+-2)/2=-2.5
Every Visit MC: Calculating V(A)
Here, we would be creating a new summation term adding all rewards coming
after every occurrence of ‘A’(including that of A as well).
i.e. V(A)=(2+-1+1+0)/4=0.5
Calculating V(B)
● From 1st episode=(-4+4+-3)+(-3)=-3+-3
V(B)=(-3+-3+-2+-3)/4=-2.75