0% found this document useful (0 votes)
8 views28 pages

Lecture#5 Monte Carlo Methods Part I

Lecture #5 discusses Monte Carlo methods for reinforcement learning, focusing on their application in estimating optimal policies without prior knowledge of the environment's dynamics. It contrasts Monte Carlo methods with dynamic programming, emphasizing the importance of learning from experience through episodes in environments modeled as Markov decision processes. Key concepts include model-free reinforcement learning, prediction and control tasks, and the interaction of policy evaluation and improvement in generalized policy iteration.

Uploaded by

majd abed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views28 pages

Lecture#5 Monte Carlo Methods Part I

Lecture #5 discusses Monte Carlo methods for reinforcement learning, focusing on their application in estimating optimal policies without prior knowledge of the environment's dynamics. It contrasts Monte Carlo methods with dynamic programming, emphasizing the importance of learning from experience through episodes in environments modeled as Markov decision processes. Key concepts include model-free reinforcement learning, prediction and control tasks, and the interaction of policy evaluation and improvement in generalized policy iteration.

Uploaded by

majd abed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Lecture #5:

1
Monte Carlo Methods Part I

PREPARED BY : ENG.NADA JONIDE SECOND SEMESTER-2024


2

JUST
FOR
FUN
3 Memoir

For computing optimal policies namely value iteration and policy iteration.
We modelled the environment as a Markov decision process (MDP).
For that we used a transition model to describe the probability of moving
from one state to the other.
The transition model was stored in a matrix T(s,a,s’) and used to find the value
function or state-value function function 𝑉 ∗ and the best policy 𝜋 ∗ .
The value of a state as the expected cumulative future discounted reward
starting from that state.
4 Example of Transition Model from Frozen Lake Game
 8x8 Frozen
Lake

 Transition model for 8x8 Frozen Lake For


state 0 while taking actions 0,1,2,3

probability

Next
State
Actions
0,1,2,3
Reward

False for being in a state that is not a hole or an end state


True for falling in hole or reaching end state
5

What if the transition model was missing???


If you recall the formula of the State-Value function

It is not possible to compute the V(s) because p(s’,r |s,a) is now unknown to
us.
6 Introduction
 Let’s say we want to train a bot to learn how to play
chess. Consider converting the chess environment into
an MDP.
 Now, depending on the positioning of pieces, this
environment will have many states (more than 1050), as
well as a large number of possible actions. The model
of this environment is almost impossible to design!
 One potential solution could be to repeatedly play a
complete game of chess and receive a positive reward
for winning, and a negative reward for losing, at the
end of each game.
7 Introduction
 Let’s say we want to train a bot to learn how to play
chess. Consider converting the chess environment into
an MDP.
 Now, depending on the positioning of pieces, this
environment will have many states (more than 1050), as
well as a large number of possible actions. The model
of this environment is almost impossible to design!
 One potential solution could be to repeatedly play a
complete game of chess and receive a positive reward
for winning, and a negative reward for losing, at the
end of each game.
This is called learning from experience.
8 Monte Carlo(MC) method involves learning
from experience.

What does that mean? It means learning


through sequences of states, actions, and
rewards.
Suppose, our agent is in state s1, takes an action
a1, gets a reward of r1, and is moved to state s2.
This whole sequence is an experience.
9 Introduction
 Always keep in mind that our goal is to find the policy that maximizes the
reward for an agent. We said in previous articles that analytical solution is
hard to get, so we fallback to iterative solutions such as Dynamic
Programming. However DP has its own problems.
 An alternative solution is to play enough number of episodes of the game
and extract the information needed. [P.S: when the agent–environment
interaction breaks naturally into subsequences, then we call that episodes]
 Notice that in DP we didn’t play the game because we knew its dynamics, in
other words at each state we knew what are the probabilities of going to
another state when we take certain action, and we knew what the reward is
going to be. Based on that we were able to do our calculations.
 In this new scenario, we won’t know these data unless we play the game.
This is a key difference between Monte Carlo and Dynamic Programming.
10 Introduction
we will enter into solving an MDP problem when part of
the model is unknown.
In this case, our agent must learn from the environment by
interacting with it and collecting experiences, or samples.
In doing so, the agent carries out strategy evaluation and
interaction and can obtain the optimal strategy.
Since the theory to support this approach comes from the
Monte-Carlo method, let’s start by discussing Monte
Carlo learning.
11 Some Concepts Is Needed:
Let’s define couple of concepts so the rest is gonna be pretty easy to
understand.
Now , let’s give a proper definition of model-free reinforcement
learning and in particular of passive and active reinforcement
learning.
In model-free reinforcement learning :
The first thing we miss is a transition model. In fact the name
model-free stands for transition-model-free.
The second thing we miss is the reward function R(s) which
gives to the agent the reward associated to a particular state.
12 passive & active

In the passive approach, we have a policy π which can


be used by the agent to move in the environment.
[Passive: Assume the agent is already following a policy (so there is no
action choice to be made]

In the active approach, It is possible to estimate the


optimal policy while moving in the environment.
13 Prediction and Control
 There are two types of tasks in RL

Prediction : This type of task predicts the expected total reward from any given state assuming the
function 𝜋(𝑎|𝑠) is given.
𝜋
 ( in other words) Policy π is given, it calculates the Value function 𝑣 with or without the model.
 ex: Policy evaluation ( we have seen in DP)

Control : This type of task finds the policy 𝜋(𝑎|𝑠) that maximizes the expected total reward from
any given state.
 (in other words) Some Policy π is given , it finds the Optimal policy 𝜋 .

 ex: Policy improvement


 Policy iteration is the combination of both to find the optimal policy.
 Just like in supervised learning , we have regression and classification tasks, in reinforcement learning, we have
prediction and control tasks.
14 On policy and Off policy
 There are two types of policy learning methods
On-Policy learning : It learns on the job. which means it evaluates
or improves the policy that is used to make the decisions.
 (In other words) it directly learns a policy which gives you decisions about which
action to take in some state.

Off-Policy learning : It evaluates one policy ( target policy ) while


following another policy ( behavior policy )
 just like we learn to do something while observing others doing the same thing.
 target policy may be deterministic ( ex: greedy ) while behavior policy is
stochastic.
15 Generalized Policy Iteration GPI
 Policy iteration consists of two simultaneous, interacting processes, one
making the value function consistent with the current policy (policy
evaluation), and the other making the policy greedy with respect to the
current value function (policy improvement).
 In policy iteration, these two processes alternate, each completing
before the other begins, but this is not really necessary.
 In value iteration, for example, only a single iteration of policy evaluation
is performed in between each policy improvement.
 In asynchronous DP methods, the evaluation and improvement processes
are interleaved at an even finer grain.
 In some cases a single state is updated in one process before returning to
the other. As long as both processes continue to update all states, the
ultimate result is typically the same—convergence to the optimal value
function and an optimal policy.
16 Generalized Policy Iteration GPI

 We use the term generalized policy iteration (GPI) to refer to the general
idea of letting policy-evaluation and policy-improvement processes
interact, independent of the granularity and other details of the two
processes.
 Almost all reinforcement learning methods are well described as GPI. That is,
all have identifiable policies and value functions, with the policy always
being improved with respect to the value function and the value function
always being driven toward the value function for the policy , as suggested
by the diagram to the right.
 If both the evaluation process and the improvement process stabilize, that
is, no longer produce changes, then the value function and policy must be
optimal.
 The value function stabilizes only when it consistent with the current policy,
and the policy stabilizes only when it is greedy with respect to the current
value function.
17 Generalized Policy Iteration GPI
Thus, both processes stabilize only when a policy
has been found that is greedy with respect to its
own evaluation function.
This implies that the Bellman optimality equation
holds, and thus that the policy and the value
function are optimal.
The evaluation and improvement processes in GPI
can be viewed as both competing and
cooperating. They compete in the sense that they
pull in opposing directions.
18 Generalized Policy Iteration GPI
Making the policy greedy with respect to the
value function typically makes the value
function incorrect for the changed policy, and
making the value function consistent with the
policy typically causes that policy no longer to
be greedy.
In the long run, however, these two processes
interact to find a single joint solution: the optimal
value function and an optimal policy.
19
Let’s
but
all of
that
together
20 Monte Carlo(MC) Methods
 Monte-Carlo (MC) methods are statistical techniques for estimating properties of complex systems via
random sampling.
 Interestingly, in many cases is possible to generate experiences sampled according to the desired probability
distributions but infeasible to obtain the distributions in explicit form.
 MC for RL: finds optimal policies without a priori models of MDP by random roll-outs and estimating
expected returns (i.e., the value)
 Model-free RL
 MC for RL :learns from complete sample returns in episodic tasks:
 uses value functions but not Bellman equations
 An important fact about Monte Carlo methods is that the estimates for each state are independent. The
estimate for one state does not build upon the estimate of any other state, as is the case in DP. In other
words, Monte Carlo methods do not bootstrap
 In general, bootstrapping in RL means [update estimates of the values of states based on estimates of
the values of successor states. That is, they update estimates on the basis of other estimates.]
21 Monte Carlo(MC) Methods
In particular, note that the computational expense of estimating the
value of a single state is independent of the number of states.
This can make Monte Carlo methods particularly attractive when
one requires the value of only one or a subset of states.
One can generate many sample episodes starting from the states of
interest, averaging returns from only these states, ignoring all others.
This is an advantage Monte Carlo methods can have over DP
methods (after the ability to learn from actual experience and from
simulated experience).
22 Monte Carlo(MC) Methods
 Monte Carlo methods are ways of solving the reinforcement learning problem based on
averaging sample returns.
 To ensure that well-defined returns are available, here we define Monte Carlo methods only
for episodic tasks. That is, we assume experience is divided into episodes, and that all
episodes eventually terminate no matter what actions are selected. Only on the completion
of an episode are value estimates and policies changed.[Monte Carlo methods learn from
complete sample returns ]
 Monte Carlo methods can thus be incremental in an episode-by-episode sense, but not in a
step-by-step (online) sense.
 The term “Monte Carlo” is often used more broadly for any estimation method whose
operation involves a significant random component. Here we use it specifically for methods
based on averaging complete returns (as opposed to methods that learn from partial returns,
considered in the next lectures).
23 Monte Carlo(MC) Methods
 Monte Carlo methods sample and average returns for each state–action pair.
 That is, the return after taking an action in one state depends on the actions taken in later
states in the same episode.
 Because all the action selections are undergoing learning, the problem becomes non-stationary
from the point of view of the earlier state.
 To handle the non-stationarity, we adapt the idea of general policy iteration (GPI) .Whereas
instead of computing value functions from knowledge of the MDP, here we learn value
functions from sample returns with the MDP.
 The value functions and corresponding policies still interact to attain optimality in essentially
the same way (GPI). As in the DP, first we consider the prediction problem (the computation
of 𝑣 𝜋 and 𝑞 𝜋 for a fixed arbitrary policy 𝜋) then policy improvement, and, finally, the
control problem and its solution by GPI. Each of these ideas taken from DP is extended to
the Monte Carlo case in which only sample experience is available.
24 Monte Carlo(MC) Methods
 In Monte Carlo (MC) we play an episode of the game starting by some random state
(not necessarily the beginning) till the end, record the states, actions and rewards that
we encountered then compute the V(s) and Q(s) for each state we passed through.
 We repeat this process by playing more episodes and after each episode we get the
states, actions, and rewards and we average the values of the discovered V(s) and Q(s).
 [One drawback to MC is that it can only apply to episodic Markov Decision Processes
where all episodes must terminate.]
 In Monte Carlo there is no guarantee that we will visit all the possible states, another
weakness of this method is that we need to wait until the game ends to be able to
update our V(s) and Q(s), this is problematic in games that never ends.
25 Monte Carlo(MC) Methods
 Monte Carlo methods learn directly from experience
 On-line: No model necessary and still attains optimality
 Simulated: No need for a full model
[Often a simulator of a planning domain is available or can be learned from data
even when domain can’t be expressed via MDP language]
Example Domains with Simulators [Traffic simulators/
•Robotics simulators/
•Military campaign simulators/
•Computer network simulators/
•Emergency planning simulators/
•large-scale disaster and municipal/
•Sports domains [Madden Football)/
•Board games -Go/ Video games - RTS]
26 Monte Carlo(MC) Methods

Remember :The value function of a state s under a policy 𝜋,


denoted 𝑣 𝜋 (𝑠), is the expected return 𝑮𝒕 when starting in s and following
𝜋 thereafter.
 For MDPs, we defined 𝑣 𝜋 (𝑠) formally by:

𝑽𝝅 𝒔 = 𝔼𝝅 𝑮𝒕 𝑺𝒕 = 𝒔 = 𝔼𝝅 𝒌=𝟎 𝜸 𝒌𝒓
𝒕+𝒌+𝟏 𝑺𝒕 = 𝒔 , 𝒇𝒐𝒓 𝒂𝒍𝒍 𝒔 ∈ 𝑺
27 Monte Carlo(MC) Methods
 We know that we can estimate any expected value simply by adding up samples and
dividing by the total number of samples:

 i – Episode index
 s – Index of state
 The question is how do we get these sample returns? For that, we need to play a bunch of
episodes and generate them.
 For every episode we play, we’ll have a sequence of states and rewards. And from these
rewards, we can calculate the return by definition, which is just the sum of all future
rewards.
28 Monte Carlo(MC) Methods

In state S the agent always produce the action a given by the policy π. The
goal of the agent in passive reinforcement learning is to learn the state
values it means learn the value function 𝐕 𝛑 (𝐬) and may be action model.
Sutton and Barto called this case MC for prediction.
 [Need to learn both the optimal policy and the state values (and may
be action model) ]
Sutton and Barto called this case MC for control estimation.

You might also like