0% found this document useful (0 votes)

333 views14 pages

Monte Carlo Learning

The document provides information about Monte Carlo learning and reinforcement learning. Some key points: - Monte Carlo learning does not require a model of the environment and learns directly from experience/samples. It learns state values based on the average return from sampled episodes following a policy. - There are two types of Monte Carlo methods - first visit and every visit. They differ in how state visits are counted within an episode. - Monte Carlo learning converges slowly since values are only updated after a complete episode. It can only be used for episodic problems. - Off-policy Monte Carlo with importance sampling allows learning about one target policy while following a different behavior policy for exploration via importance ratios.

Uploaded by

Sivasathiya G

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

333 views14 pages

Monte Carlo Learning

Uploaded by

Sivasathiya G

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

EASWARI ENGINEERING COLLEGE

(AUTONOMOUS)
DEPARTMENT OF ARTIFICIAL INTELLIGENCE ANDDATA
SCIENCE

191AIC601T – REINFORCEMENT LEARNING

Unit III –Notes

(Monte Carlo Learning)

III YEAR - B.TECH

PREPARED BY APPROVED BY

G.SIVASATHIYA, AP/AI&DS HOD/AI&DS

MONTE CARLO LEARNING

In Dynamic programming we need a model (agent knows the MDP

transition and rewards) and agent does planning (once model is available
agent need to plan its action in each state). There is no real learning by the agent
in Dynamic programming method.

Monte Carlo method on the other hand is a very simple concept where
agent learn about the states and reward when it interacts with the environment.
In this method agent generate experienced samples and then based on average
return, value is calculated for a state or state-action.

Below are key characteristics of Monte Carlo (MC) method:

 There is no model (agent does not know state MDP transitions)
 agent learns from sampled experience
 learn state value vπ(s) under policy π by
experiencing average return from all sampled episodes (value =
average return)
 only after a complete episode, values are updated (because of this
algorithm convergence is slow and update happens after a episode
is Complete)
 There is no bootstrapping
 Only can be used in episodic problems

Consider a real life analogy; Monte Carlo learning is like annual examination
where student completes its episode at the end of the year. Here, the result of
the annual exam is like the return obtained by the student.
Now if the goal of the problem is to find how students score during a
calendar year (which is a episode here) for a class, we can take sample result of
some student and then calculate mean result to find score for a class (don’t take
the analogy point by point but on a holistic level I think you can get the essence
of MC learning).
Similarly we have TD learning or temporal difference learning (TD
learning is like updating value in every time step and does not require wait till
end of episode to update the values) that we will cover in future blog, can be
thought like a weekly or monthly examination (student can adjust their
performance based on this score (reward received) after every small interval
and final score is accumulation of the all weekly tests (total rewards)).

Value function = Expected Return

Expected return is equal to discounted sum of all rewards.

In Monte Carlo Method instead of expected return we use empirical return
that agent has sampled based following the policy.

If we go back to our very first example of gem collection, agent follows

policy and complete an episode, along the way in each step it collects rewards
in the form of gem. To get state value agent sum-up all the gems collected after
each episode starting from that state.

Refer to below diagram where 3 samples collected starting from State S

05. Total reward collected (discount factor is considered as 1 for simplicity) in
each episode as follows:
Return(Sample 01) = 2 + 1 + 2 + 2 + 1 + 5 = 13 gems
Return(Sample 02) = 2 + 3 + 1 + 3 + 1 + 5 = 15 gems
Return(Sample 03) = 2 + 3 + 1 + 3 + 1 + 5 = 15 gems

Observed mean return (based on 3 samples) = (13 + 15 + 15)/3 = 14.33 gems

Thus state value as per Monte Carlo Method, v π(S 05) is 14.33 gems based on 3
samples following policy π.

Monte Carlo Backup diagram

There are two types of MC learning policy evaluation (prediction) methods:

First Visit Monte Carlo Method

In this case in an episode first visit of the state is counted (even if agent
comes-back to the same state multiple time in the episode, only first visit will be
counted). Detailed step as below:

1. To evaluate state s, first we set number of visit, N(s) = 0, Total return TR(s)
= 0 (these values are updated across episodes)
2. The first time-step t that state s is visited in an episode, increment counter
N(s) = N(s) + 1
3. Increment total return TR(s) = TR(s) + Gt
4. Value is estimated by mean return V(s) = TR(s)/N(s)
5. By law of large numbers, V(s) -> vπ(s) (this is called true value under policy
π) as N(s) approaches infinity

Refer to below diagram for better understanding of counter increment.

Every Visit Monte Carlo Method

In this case in an episode every visit of the state is counted. Detailed step as
below:
1. To evaluate state s, first we set number of visit, N(s) = 0, Total return TR(s)
= 0 (these values are updated across episodes)
2. every time-step t that state s is visited in an episode, increment counter N(s)
= N(s) + 1
3. Increment total return TR(s) = TR(s) + Gt
4. Value is estimated by mean return V(s) = TR(s)/N(s)
5. By law of large numbers, V(s) -> vπ(s) (this is called true value under policy
π) as N(s) approaches infinity

Refer to below diagram for better understanding of counter increment.

Usually MC is updated incrementally after every episode (no need to store
old episode values, it could be a running mean value for the state updated after
every episode).

Update V(s) incrementally after episode S 1, A 2, R 3,….,S T For each state

S t with return G t. Usually in place of 1/N(S t) a constant learning rate (α) is
used and above equation becomes:

For Policy improvement, Generalized Policy Improvement concept is used

to update policy using action value function of Monte Carlo Method.

Monte Carlo Methods have below Advantages :

 zero bias
 Good convergence properties (even with function approximation)
 Not very sensitive to initial value
 Very simple to understand and use

But it has below limitations as well:

 MC must wait until end of episode before return is known
 MC has high variance
 MC can only learn from complete sequences
 MC only works for episodic (terminating) environments

Even though MC method takes time, it is an important tool for any

Reinforcement Learning practitioner.
OFF-POLICY MONTE CARLO WITH IMPORTANCE
SAMPLING

Off Policy Learning

 By exploration-exploitation trade-off, the agent should take sub-optimal
exploratory action by which the agent may receive less reward. One way of
exploration is by using an epsilon-greedy policy, where the agent takes a non
greedy action with a small probability.
 In an on-policy, improvement and evaluation are done on the policy which
is used to select actions.
 In off-policy, improvement and evaluation are done on a different policy
from the one used to select actions. The policy learned is off the policy used
for action selection while gathering episodes.
o Target Policy π(a/s) : The value function of learning is based
on π(a/s) . We want the target policy to be the optimal
policy π∗(a/s). The target policy will be used for action selection
after the learning process is complete (deployment).
o Behavior Policy b(a/s) : Behavior policy is used for action selection
while gathering episodes to train the agent. This generally follows
an exploratory policy.

Importance Sampling
 We have a random variable x∼b sampled from behavior policy
distribution b. We want to estimate the expected value of x ,wrt the target
distribution π ie: Eπ[X]. The sample average will give the expected value of x
under b Eb[X].
Let xρ(x) be a new random variable Xρ(X).

Then, Eπ[X]=∑xρ(x)b(x)=Eb[Xρ(X)] . Now we have expectation under b instead

of π.

Off Policy Monte Carlo Prediction with Importance

Sampling
Off-policy control methods has two policies on the same episode. One that
is learned about and that becomes the optimal policy, called target policy,
and one that is more exploratory and is used to generate behavior, called
the behavior policy.

In this case we say that learning is from data “off” the target policy, and
the overall process is termed off-policy learning.

Here, we only consider the prediction problem, in which both target and
behavior policies are fixed. We require that π (a|s) > 0 implies b (a|s) > 0 , which
is called the assumption of coverage, to assure that every action taken under
π is also taken under b . It follows from coverage that b must be stochastic in
states where it is not identical to π . The target policy πv may be
deterministic.
Incremental Implementation of Off-policy MC



So the Off policy Monte Carlo Control algorithm is ,

Round Da Doo Bop Singing Game PDF
No ratings yet
Round Da Doo Bop Singing Game PDF
4 pages
Temporal Difference Learning
No ratings yet
Temporal Difference Learning
17 pages
SCSA3015 Deep Learning Unit 3
100% (1)
SCSA3015 Deep Learning Unit 3
23 pages
Reinforcement Learning - Chapter 2
100% (1)
Reinforcement Learning - Chapter 2
22 pages
Lecture 4 Monte Carlo Method
100% (1)
Lecture 4 Monte Carlo Method
22 pages
5.4-Reinforcement Learning-Part2-Learning-Algorithms
No ratings yet
5.4-Reinforcement Learning-Part2-Learning-Algorithms
15 pages
Lec 5
No ratings yet
Lec 5
13 pages
4 Monte Carlo Methods
No ratings yet
4 Monte Carlo Methods
28 pages
Monte Carlo 1
No ratings yet
Monte Carlo 1
245 pages
CH3 - 1 Montecarlo Components
No ratings yet
CH3 - 1 Montecarlo Components
18 pages
CH3 - 2 Montecarlo Control
No ratings yet
CH3 - 2 Montecarlo Control
33 pages
Model Free Methods
No ratings yet
Model Free Methods
31 pages
Module 2
No ratings yet
Module 2
84 pages
Notes
No ratings yet
Notes
6 pages
Lnotes 03
No ratings yet
Lnotes 03
11 pages
Improving Monte Carlo Evaluation With Offline Data: Sutton and Barto 2018
No ratings yet
Improving Monte Carlo Evaluation With Offline Data: Sutton and Barto 2018
40 pages
ML Unit-4 - RTU
No ratings yet
ML Unit-4 - RTU
18 pages
3 Evaluation
No ratings yet
3 Evaluation
41 pages
Lecture 6 MONTE CARLO Example
No ratings yet
Lecture 6 MONTE CARLO Example
11 pages
19 - Monte Carlo and Temporal Difference For Markov Decision Processes
No ratings yet
19 - Monte Carlo and Temporal Difference For Markov Decision Processes
57 pages
QP Ans
No ratings yet
QP Ans
40 pages
Lecture 5 - ModelFreePrediction
No ratings yet
Lecture 5 - ModelFreePrediction
79 pages
2.2+model Free+Control
No ratings yet
2.2+model Free+Control
92 pages
Model Free Prediction
No ratings yet
Model Free Prediction
38 pages
Unit 5 Notes
No ratings yet
Unit 5 Notes
20 pages
Monte Carlo Methods in Reinforcement Learning
No ratings yet
Monte Carlo Methods in Reinforcement Learning
5 pages
RL Lecture5
No ratings yet
RL Lecture5
16 pages
Monte Carlo Methods in Reinforcement Learning
0% (1)
Monte Carlo Methods in Reinforcement Learning
5 pages
Dissecting Reinforcement Learning-Part9
No ratings yet
Dissecting Reinforcement Learning-Part9
15 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
6 pages
Unit Iii Monte Carlo & Temporal Difference Methods
No ratings yet
Unit Iii Monte Carlo & Temporal Difference Methods
18 pages
Lecture#5 Monte Carlo Methods Part I
No ratings yet
Lecture#5 Monte Carlo Methods Part I
28 pages
Discuss About Temporal Difference in Reinforcement Learning?
No ratings yet
Discuss About Temporal Difference in Reinforcement Learning?
9 pages
Reinforcement Learning Notes
No ratings yet
Reinforcement Learning Notes
1 page
Reinforcement Learning, Crawling Robot: Faculty of Sciences and Techniques Béni-Mellal
No ratings yet
Reinforcement Learning, Crawling Robot: Faculty of Sciences and Techniques Béni-Mellal
5 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
04 MC Methods
No ratings yet
04 MC Methods
18 pages
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
No ratings yet
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
16 pages
Value Functions & Bellman Equations: UNIT-3
No ratings yet
Value Functions & Bellman Equations: UNIT-3
11 pages
05 MC Methods
No ratings yet
05 MC Methods
53 pages
Unit 06 Temporal Difference Learning
No ratings yet
Unit 06 Temporal Difference Learning
9 pages
Lecture 4: Model-Free Prediction: David Silver
No ratings yet
Lecture 4: Model-Free Prediction: David Silver
51 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
38 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
50 pages
Unit 4
100% (1)
Unit 4
7 pages
Unit-3 Unit-3 RL Problems, Prediction and Control P 241111 181426
No ratings yet
Unit-3 Unit-3 RL Problems, Prediction and Control P 241111 181426
15 pages
Lecture 3 Pre
No ratings yet
Lecture 3 Pre
67 pages
Course 2 - Sample Based Learning Methods Learning Objectives
No ratings yet
Course 2 - Sample Based Learning Methods Learning Objectives
3 pages
Lecture 3 Post
No ratings yet
Lecture 3 Post
67 pages
Lecture 3 Post
No ratings yet
Lecture 3 Post
58 pages
Reinforcement Learning I
No ratings yet
Reinforcement Learning I
85 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
RL 25412
No ratings yet
RL 25412
7 pages
Lecture 3 Pre
No ratings yet
Lecture 3 Pre
67 pages
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
45 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
2023 Week3 Modelfree
No ratings yet
2023 Week3 Modelfree
63 pages
Unit 4
No ratings yet
Unit 4
49 pages
ML Unit 5 at VS
No ratings yet
ML Unit 5 at VS
29 pages
EE675A Lecture 16
No ratings yet
EE675A Lecture 16
6 pages
The Impact of Welfare Scheme On The Motivation of Workers (A Case Study of (ESBS) Enugu)
100% (1)
The Impact of Welfare Scheme On The Motivation of Workers (A Case Study of (ESBS) Enugu)
17 pages
A Uni Ed Account of The Effects of Distinctiveness, Inversion, and Race in Face Recognition
No ratings yet
A Uni Ed Account of The Effects of Distinctiveness, Inversion, and Race in Face Recognition
46 pages
Present Simple and Present Continuous
100% (2)
Present Simple and Present Continuous
2 pages
Research Problem, Objectives and
No ratings yet
Research Problem, Objectives and
54 pages
Chapter 1-3 Research Lecture
No ratings yet
Chapter 1-3 Research Lecture
146 pages
Kathryn Pole (2007) Mixed Method Designs: A Review of Strategies For Blending Quantitative and Qualitative Methodologies.
No ratings yet
Kathryn Pole (2007) Mixed Method Designs: A Review of Strategies For Blending Quantitative and Qualitative Methodologies.
4 pages
Ethics Course Outline Revised
No ratings yet
Ethics Course Outline Revised
20 pages
Grace Complete Project
No ratings yet
Grace Complete Project
31 pages
Noun Phrases Practice-Chivi
No ratings yet
Noun Phrases Practice-Chivi
3 pages
Ai & Machine Learning
100% (3)
Ai & Machine Learning
62 pages
Effect of Collaborative and Competitive Learning Strategy On The Achievement of Students' Performance in Chemistry
No ratings yet
Effect of Collaborative and Competitive Learning Strategy On The Achievement of Students' Performance in Chemistry
73 pages
Lesson Plan (Beed 3b)
No ratings yet
Lesson Plan (Beed 3b)
6 pages
Dll-Mil 2023
No ratings yet
Dll-Mil 2023
3 pages
Estuary English Thesis by Christina Schmid
No ratings yet
Estuary English Thesis by Christina Schmid
166 pages
Aj Ayers Freedom and Necessity
No ratings yet
Aj Ayers Freedom and Necessity
3 pages
CH 09
No ratings yet
CH 09
10 pages
The Nature of Leadership: The Learning Objectives of This Chapter Are To
No ratings yet
The Nature of Leadership: The Learning Objectives of This Chapter Are To
12 pages
What Are The Considerations Common To All Syllabuses
No ratings yet
What Are The Considerations Common To All Syllabuses
3 pages
English 5: File Created by Deped Click
No ratings yet
English 5: File Created by Deped Click
1 page
TT Oman G1A Teacher's Book
No ratings yet
TT Oman G1A Teacher's Book
182 pages
Apa Maksud Kod 1119
No ratings yet
Apa Maksud Kod 1119
5 pages
Jim Sarris - Comic Mnemonics Spanish Verbs - 2
67% (3)
Jim Sarris - Comic Mnemonics Spanish Verbs - 2
234 pages
Appreciative Inquiry Interview Questions Discover Questions - Define The Best of "What Is."
No ratings yet
Appreciative Inquiry Interview Questions Discover Questions - Define The Best of "What Is."
2 pages
UHV-2 Syllabus
No ratings yet
UHV-2 Syllabus
4 pages
English Grammar - Wish: Wishes About The Present and Future
No ratings yet
English Grammar - Wish: Wishes About The Present and Future
2 pages
Methods, Procedure and Technique of Teaching Language
No ratings yet
Methods, Procedure and Technique of Teaching Language
28 pages
Psychology: History & Perspectives
No ratings yet
Psychology: History & Perspectives
32 pages
Personal Affirmation Statement Personal Affirmation Statement
No ratings yet
Personal Affirmation Statement Personal Affirmation Statement
1 page
Computer Assisted Instruction e
100% (1)
Computer Assisted Instruction e
2 pages

Monte Carlo Learning

Uploaded by

Monte Carlo Learning

Uploaded by

EASWARI ENGINEERING COLLEGE

191AIC601T – REINFORCEMENT LEARNING

Unit III –Notes

III YEAR - B.TECH

G.SIVASATHIYA, AP/AI&DS HOD/AI&DS

In Dynamic programming we need a model (agent knows the MDP

Below are key characteristics of Monte Carlo (MC) method:

Value function = Expected Return

Expected return is equal to discounted sum of all rewards.

If we go back to our very first example of gem collection, agent follows

Refer to below diagram where 3 samples collected starting from State S

Observed mean return (based on 3 samples) = (13 + 15 + 15)/3 = 14.33 gems

Monte Carlo Backup diagram

First Visit Monte Carlo Method

Refer to below diagram for better understanding of counter increment.

Refer to below diagram for better understanding of counter increment.

Update V(s) incrementally after episode S 1, A 2, R 3,….,S T For each state

For Policy improvement, Generalized Policy Improvement concept is used

Monte Carlo Methods have below Advantages :

But it has below limitations as well:

Even though MC method takes time, it is an important tool for any

Off Policy Learning

Then, Eπ[X]=∑xρ(x)b(x)=Eb[Xρ(X)] . Now we have expectation under b instead

Off Policy Monte Carlo Prediction with Importance

You might also like