Lecture#5 Monte Carlo Methods Part I
Lecture#5 Monte Carlo Methods Part I
1
Monte Carlo Methods Part I
JUST
FOR
FUN
3 Memoir
For computing optimal policies namely value iteration and policy iteration.
We modelled the environment as a Markov decision process (MDP).
For that we used a transition model to describe the probability of moving
from one state to the other.
The transition model was stored in a matrix T(s,a,s’) and used to find the value
function or state-value function function 𝑉 ∗ and the best policy 𝜋 ∗ .
The value of a state as the expected cumulative future discounted reward
starting from that state.
4 Example of Transition Model from Frozen Lake Game
8x8 Frozen
Lake
probability
Next
State
Actions
0,1,2,3
Reward
It is not possible to compute the V(s) because p(s’,r |s,a) is now unknown to
us.
6 Introduction
Let’s say we want to train a bot to learn how to play
chess. Consider converting the chess environment into
an MDP.
Now, depending on the positioning of pieces, this
environment will have many states (more than 1050), as
well as a large number of possible actions. The model
of this environment is almost impossible to design!
One potential solution could be to repeatedly play a
complete game of chess and receive a positive reward
for winning, and a negative reward for losing, at the
end of each game.
7 Introduction
Let’s say we want to train a bot to learn how to play
chess. Consider converting the chess environment into
an MDP.
Now, depending on the positioning of pieces, this
environment will have many states (more than 1050), as
well as a large number of possible actions. The model
of this environment is almost impossible to design!
One potential solution could be to repeatedly play a
complete game of chess and receive a positive reward
for winning, and a negative reward for losing, at the
end of each game.
This is called learning from experience.
8 Monte Carlo(MC) method involves learning
from experience.
Prediction : This type of task predicts the expected total reward from any given state assuming the
function 𝜋(𝑎|𝑠) is given.
𝜋
( in other words) Policy π is given, it calculates the Value function 𝑣 with or without the model.
ex: Policy evaluation ( we have seen in DP)
Control : This type of task finds the policy 𝜋(𝑎|𝑠) that maximizes the expected total reward from
any given state.
(in other words) Some Policy π is given , it finds the Optimal policy 𝜋 .
∗
We use the term generalized policy iteration (GPI) to refer to the general
idea of letting policy-evaluation and policy-improvement processes
interact, independent of the granularity and other details of the two
processes.
Almost all reinforcement learning methods are well described as GPI. That is,
all have identifiable policies and value functions, with the policy always
being improved with respect to the value function and the value function
always being driven toward the value function for the policy , as suggested
by the diagram to the right.
If both the evaluation process and the improvement process stabilize, that
is, no longer produce changes, then the value function and policy must be
optimal.
The value function stabilizes only when it consistent with the current policy,
and the policy stabilizes only when it is greedy with respect to the current
value function.
17 Generalized Policy Iteration GPI
Thus, both processes stabilize only when a policy
has been found that is greedy with respect to its
own evaluation function.
This implies that the Bellman optimality equation
holds, and thus that the policy and the value
function are optimal.
The evaluation and improvement processes in GPI
can be viewed as both competing and
cooperating. They compete in the sense that they
pull in opposing directions.
18 Generalized Policy Iteration GPI
Making the policy greedy with respect to the
value function typically makes the value
function incorrect for the changed policy, and
making the value function consistent with the
policy typically causes that policy no longer to
be greedy.
In the long run, however, these two processes
interact to find a single joint solution: the optimal
value function and an optimal policy.
19
Let’s
but
all of
that
together
20 Monte Carlo(MC) Methods
Monte-Carlo (MC) methods are statistical techniques for estimating properties of complex systems via
random sampling.
Interestingly, in many cases is possible to generate experiences sampled according to the desired probability
distributions but infeasible to obtain the distributions in explicit form.
MC for RL: finds optimal policies without a priori models of MDP by random roll-outs and estimating
expected returns (i.e., the value)
Model-free RL
MC for RL :learns from complete sample returns in episodic tasks:
uses value functions but not Bellman equations
An important fact about Monte Carlo methods is that the estimates for each state are independent. The
estimate for one state does not build upon the estimate of any other state, as is the case in DP. In other
words, Monte Carlo methods do not bootstrap
In general, bootstrapping in RL means [update estimates of the values of states based on estimates of
the values of successor states. That is, they update estimates on the basis of other estimates.]
21 Monte Carlo(MC) Methods
In particular, note that the computational expense of estimating the
value of a single state is independent of the number of states.
This can make Monte Carlo methods particularly attractive when
one requires the value of only one or a subset of states.
One can generate many sample episodes starting from the states of
interest, averaging returns from only these states, ignoring all others.
This is an advantage Monte Carlo methods can have over DP
methods (after the ability to learn from actual experience and from
simulated experience).
22 Monte Carlo(MC) Methods
Monte Carlo methods are ways of solving the reinforcement learning problem based on
averaging sample returns.
To ensure that well-defined returns are available, here we define Monte Carlo methods only
for episodic tasks. That is, we assume experience is divided into episodes, and that all
episodes eventually terminate no matter what actions are selected. Only on the completion
of an episode are value estimates and policies changed.[Monte Carlo methods learn from
complete sample returns ]
Monte Carlo methods can thus be incremental in an episode-by-episode sense, but not in a
step-by-step (online) sense.
The term “Monte Carlo” is often used more broadly for any estimation method whose
operation involves a significant random component. Here we use it specifically for methods
based on averaging complete returns (as opposed to methods that learn from partial returns,
considered in the next lectures).
23 Monte Carlo(MC) Methods
Monte Carlo methods sample and average returns for each state–action pair.
That is, the return after taking an action in one state depends on the actions taken in later
states in the same episode.
Because all the action selections are undergoing learning, the problem becomes non-stationary
from the point of view of the earlier state.
To handle the non-stationarity, we adapt the idea of general policy iteration (GPI) .Whereas
instead of computing value functions from knowledge of the MDP, here we learn value
functions from sample returns with the MDP.
The value functions and corresponding policies still interact to attain optimality in essentially
the same way (GPI). As in the DP, first we consider the prediction problem (the computation
of 𝑣 𝜋 and 𝑞 𝜋 for a fixed arbitrary policy 𝜋) then policy improvement, and, finally, the
control problem and its solution by GPI. Each of these ideas taken from DP is extended to
the Monte Carlo case in which only sample experience is available.
24 Monte Carlo(MC) Methods
In Monte Carlo (MC) we play an episode of the game starting by some random state
(not necessarily the beginning) till the end, record the states, actions and rewards that
we encountered then compute the V(s) and Q(s) for each state we passed through.
We repeat this process by playing more episodes and after each episode we get the
states, actions, and rewards and we average the values of the discovered V(s) and Q(s).
[One drawback to MC is that it can only apply to episodic Markov Decision Processes
where all episodes must terminate.]
In Monte Carlo there is no guarantee that we will visit all the possible states, another
weakness of this method is that we need to wait until the game ends to be able to
update our V(s) and Q(s), this is problematic in games that never ends.
25 Monte Carlo(MC) Methods
Monte Carlo methods learn directly from experience
On-line: No model necessary and still attains optimality
Simulated: No need for a full model
[Often a simulator of a planning domain is available or can be learned from data
even when domain can’t be expressed via MDP language]
Example Domains with Simulators [Traffic simulators/
•Robotics simulators/
•Military campaign simulators/
•Computer network simulators/
•Emergency planning simulators/
•large-scale disaster and municipal/
•Sports domains [Madden Football)/
•Board games -Go/ Video games - RTS]
26 Monte Carlo(MC) Methods
i – Episode index
s – Index of state
The question is how do we get these sample returns? For that, we need to play a bunch of
episodes and generate them.
For every episode we play, we’ll have a sequence of states and rewards. And from these
rewards, we can calculate the return by definition, which is just the sum of all future
rewards.
28 Monte Carlo(MC) Methods
In state S the agent always produce the action a given by the policy π. The
goal of the agent in passive reinforcement learning is to learn the state
values it means learn the value function 𝐕 𝛑 (𝐬) and may be action model.
Sutton and Barto called this case MC for prediction.
[Need to learn both the optimal policy and the state values (and may
be action model) ]
Sutton and Barto called this case MC for control estimation.