0% found this document useful (0 votes)
32 views

Lecture 6 MONTE CARLO Example

Monte Carlo methods estimate value functions using returns from sample episodes of interaction with an environment, without knowledge of the environment's dynamics. There are two types of Monte Carlo methods: First Visit estimates the value of a state as the average return following the first visit to that state in each episode. Every Visit estimates the value as the average return following every visit to a state across all episodes. These methods were demonstrated on examples to calculate the value of states A and B in a 3-state environment using 2 sample episodes.

Uploaded by

Trinaya Kodavati
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Lecture 6 MONTE CARLO Example

Monte Carlo methods estimate value functions using returns from sample episodes of interaction with an environment, without knowledge of the environment's dynamics. There are two types of Monte Carlo methods: First Visit estimates the value of a state as the average return following the first visit to that state in each episode. Every Visit estimates the value as the average return following every visit to a state across all episodes. These methods were demonstrated on examples to calculate the value of states A and B in a 3-state environment using 2 sample episodes.

Uploaded by

Trinaya Kodavati
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

MONTE CARLO method for

Reinforcement Learning
What is meant by Monte Carlo?

The term “Monte Carlo” is often used more broadly for any estimation method whose operation

involves a significant random component.

Monte Carlo methods require only experience — sample sequences of states, actions,

and rewards from actual or simulated interaction with an environment. Learning from

actual experience is striking because it requires no prior knowledge of the environment’s dynamics,

yet one can still attain optimal behavior.


What is this Monte thing used for in RL?
It is a method for estimating Value-action(Value|State, Action) or Value function
(Value|State) using some sample runs from the environment for which we are
estimating the Value function.
● Let us consider the above situation where we have a system of 3 states that are A, B &

terminate.

● We are given two example episodes (we can generate them using random walks for

any environment).

● A+3 →A+2 means the transition from state A →A with reward =3 for this transition.
Now, we know that averaging rewards can get us value-function for multi-state RL
problems as well. But things aren’t this easy as we know value-function depends on
future rewards as well. Hence we have got 2 types of Monte Carlo learning on how to
average future rewards:

First Visit Monte Carlo: First visit estimates (Value|State: S1) as the average of the
returns following the first visit to the state S1.

Every Visit Monte Carlo: It estimates (Value|State: S1) as the average of returns for every

visit to the State S1.


We will be Calculating V(A) & V(B) using the above mentioned Monte

Carlo methods
First Visit Monte Carlo:

● Calculating V(A)

As we have been given 2 different iterations, we will be summing all the rewards
coming after A (including that of A) after the first visit to ‘A’. Therefore, we can’t
have more than one summation_term/episode for a state.

Hence,

● For 1st episode=3+2+-4+4+-3=2


● For 2nd episode=3+-3=0

As we have got two terms, we will be averaging these two value i.e V(A)=(2+0)/2=1
Note:It must be noted that if an episode doesn’t have an occurence of ‘A’, it won’t be

considered in the average.

Hence if a 3rd episode like B-3 →B-3 →terminate existed, still V(A) using 1st Visit

would have been 1


Calculating V(B)
Drawing reference from the above example:

● 1st episode=-4+4–3=-3

● 2nd episode=-2+3+-3=-2

Averaging, V(B)=(-3+-2)/2=-2.5
Every Visit MC: Calculating V(A)
Here, we would be creating a new summation term adding all rewards coming
after every occurrence of ‘A’(including that of A as well).

● From 1st episode=(3+2+-4+4+-3)+(2+-4+4+-3)+(4+-3)=2+-1+1


● From 2nd episode=(3+-3)=0

As we got 4 summation terms, we will be averaging using N=4

i.e. V(A)=(2+-1+1+0)/4=0.5
Calculating V(B)
● From 1st episode=(-4+4+-3)+(-3)=-3+-3

● From 2nd episode=(-2+3–3)+(-3)=-2+-3

As we have 4 summation terms, averaging using N=4,

V(B)=(-3+-3+-2+-3)/4=-2.75

You might also like