Solution to Assignment_4_Dynamic Programming
Solution to Assignment_4_Dynamic Programming
Solution to Assignment_4_Dynamic Programming
for all s ϵ S.
Value Iteration Algorithm:
Answer:
A major drawback to the DP methods is that they involve operations over the
entire state set of the MDP, that is, they require sweeps of the state set. If the
state set is very large, then even a single sweep is expensive.
For example, the game of backgammon has over 10 20 states. Even if we could
perform the value iteration update on a million states per second, it would take
over a thousand years to complete a single sweep.
Asynchronous DP algorithms are not organized in terms of systematic sweeps
of the state set. These algorithms update the values of states in any order
whatsoever, using whatever values of other states happen to be available. The
values of some states may be updated several times before the values of others
are updated once.
Asynchronous DP algorithms allow great flexibility in selecting states to
update.
For example, one version of asynchronous value iteration updates the value, in
place, of only one state, sk, on each step, k, using the value iteration update.
It is possible to intermix policy evaluation and value iteration updates to
produce a asynchronous truncated policy iteration
It just means that an algorithm does not need to get locked into any long sweep
before it can make progress improving a policy. We can try to take advantage of
this flexibility by selecting the states to which we apply updates so as to
improve the algorithm’s rate of progress. We can try to order the updates to let
value information propagate from state to state in an efficient way. Some states
may not need their values updated as often as others. We might even try to skip
updating some states entirely if they are not relevant to optimal behavior.
Asynchronous algorithms also make it easier to intermix computation with real-
time interaction. To solve a given MDP, we can run an iterative DP algorithm at
the same time that an agent is actually experiencing the MDP. The agent’s
experience can be used to determine the states to which the DP algorithm
applies its updates. At the same time, the latest value and policy information
from the DP algorithm can guide the agent’s decision making. For example, we
can apply updates to states as the agent visits them.
Generalised Policy Iteration
Once a policy, π, has been improved using Vπ to yield a better policy, π’, we
can then compute Vπ’ and improve it again to yield an even better π’’. We can
thus obtain a sequence of monotonically improving policies and value
functions:
5. Explain in detail the Monte Carlo method. How to apply it for solving RL
problems?
Answer:
Monte Carlo method:
Monte Carlo is the first learning methods for estimating value functions
and discovering optimal policies. Unlike the previous methods
(MDP,DP), here we do not assume complete knowledge of the
environment.
Monte Carlo methods require only experience—sample sequences of
states, actions, and rewards from actual or simulated interaction with an
environment.
Learning from actual experience is striking because it requires no prior
knowledge of the environment’s dynamics, yet can still attain optimal
behavior.
Learning from simulated experience is also powerful. Although a model
is required, the model need only generate sample transitions, not the
complete probability distributions of all possible transitions that is
required for dynamic programming (DP).
Consider a real life analogy; Monte Carlo learning is like annual examination
where student completes its episode at the end of the year. Here, the result of
the annual exam is like the return obtained by the student. Now if the goal of the
problem is to find how students score during a calendar year (which is a episode
here) for a class, we can take sample result of some student and then calculate
mean result to find score for a class.
zero bias