Solution to Assignment_4_Dynamic Programming

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Assignment 4: Dynamic Programming

1. What is Dynamic Programming? Elaborate on efficiency of dynamic


programming.
Answer:
Dynamic Programming:
The term dynamic programming (DP) refers to a collection of algorithms that
can be used to compute optimal policies given a perfect model of the
environment as a Markov decision process (MDP). Classical DP algorithms are
of limited utility in reinforcement learning both because of their assumption of a
perfect model and because of their great computational expense, but they are
still important theoretically.
We usually assume that the environment is a finite MDP. That is, we assume
that its state, action, and reward sets, S, A, and R, are finite, and that its
dynamics are given by a set of probabilities p(s’, r|s, a), for all s ϵ S, a ϵ A(s), r ϵ
R, and s’ ϵ S+ (S+ is S plus a terminal state if the problem is episodic).
Although DP ideas can be applied to problems with continuous state and action
spaces but solutions are approximate.
The key idea of DP, and of reinforcement learning generally, is the use of value
functions to organize and structure the search for good policies. DP can be used
to compute the value functions and to obtain optimal policies once we have
found the optimal value functions, v* or q*, which satisfy the Bellman
optimality equations:
for all s ϵ S, a ϵ A(s), r ϵ R, and s’ ϵ S+ (S+ is S plus a terminal state if the
problem is episodic).

Efficiency of Dynamic Programming:

DP methods find an optimal policy in polynomial time in the number of states


and actions. If n and k denote the number of states and actions, this means that a
DP method requires less number of computational operations than some
polynomial function of n and k. A DP method finds an optimal policy in
polynomial time even though the total number of (deterministic) policies is kn.
DP is exponentially faster than any direct search in policy space because direct
search examines each policy exhaustively. For smaller number of states DP
methods are more efficient than linear programming.
For the largest problems, having large state space, DP methods are feasible than
LPP and direct search.
In practice, DP methods can be used with today’s computers to solve MDPs
with millions of states. Both policy iteration and value iteration are widely used,
as these methods converge faster than if they are started with good initial value
functions or policies.
Asynchronous DP methods are often preferred for problems with large state
spaces. To complete even one sweep of a synchronous method requires
computation and memory for every state.
2. Write and explain the value iteration method.
Answer:
Value Iteration:
One drawback to policy iteration is that each of its iterations involves policy
evaluation, which is itself a protracted (lengthy) iterative computation requiring
multiple sweeps through the state set. If policy evaluation is done iteratively,
then convergence exactly to Vπ occurs only in the limit. Policy evaluation
iterations beyond the first three have no effect on the corresponding greedy
policy.
Policy evaluation step of policy iteration can be truncated in several ways
without losing the convergence guarantees of policy iteration. When the policy
evaluation is stopped after just one sweep (one update of each state).
This algorithm is called value iteration. It can be written as a particularly simple
update operation that combines the policy improvement and truncated policy
evaluation steps:

for all s ϵ S.
Value Iteration Algorithm:

3. What is asynchronous dynamic programming? Elaborate on the generalized


policy Iteration method.

Answer:

Asynchronous Dynamic Programming

A major drawback to the DP methods is that they involve operations over the
entire state set of the MDP, that is, they require sweeps of the state set. If the
state set is very large, then even a single sweep is expensive.
For example, the game of backgammon has over 10 20 states. Even if we could
perform the value iteration update on a million states per second, it would take
over a thousand years to complete a single sweep.
Asynchronous DP algorithms are not organized in terms of systematic sweeps
of the state set. These algorithms update the values of states in any order
whatsoever, using whatever values of other states happen to be available. The
values of some states may be updated several times before the values of others
are updated once.
Asynchronous DP algorithms allow great flexibility in selecting states to
update.
For example, one version of asynchronous value iteration updates the value, in
place, of only one state, sk, on each step, k, using the value iteration update.
It is possible to intermix policy evaluation and value iteration updates to
produce a asynchronous truncated policy iteration
It just means that an algorithm does not need to get locked into any long sweep
before it can make progress improving a policy. We can try to take advantage of
this flexibility by selecting the states to which we apply updates so as to
improve the algorithm’s rate of progress. We can try to order the updates to let
value information propagate from state to state in an efficient way. Some states
may not need their values updated as often as others. We might even try to skip
updating some states entirely if they are not relevant to optimal behavior.
Asynchronous algorithms also make it easier to intermix computation with real-
time interaction. To solve a given MDP, we can run an iterative DP algorithm at
the same time that an agent is actually experiencing the MDP. The agent’s
experience can be used to determine the states to which the DP algorithm
applies its updates. At the same time, the latest value and policy information
from the DP algorithm can guide the agent’s decision making. For example, we
can apply updates to states as the agent visits them.
Generalised Policy Iteration

The interaction of policy-evaluation and policy improvement processes, is


known as generalized policy iteration. Almost all reinforcement learning
methods have identifiable policies and value functions, with the policy always
being improved with respect to the value function and the value function always
being driven toward the value function for the policy. If both the evaluation
process and the improvement process stabilize, that is, they no longer produce
changes, then the value function and policy must be optimal. Both processes
stabilize only when a policy has been found that is greedy with respect to its
own evaluation function.
The evaluation and improvement processes in GPI are both competing and
cooperating with each other. In the long run, however, these two processes
interact to find a single joint solution: the optimal value function and an optimal
policy.
Interaction between the evaluation and improvement processes in GPI can be
considered in terms of two constraints or goals—for example, as two lines in
two-dimensional space as shown in figure.
Each process drives the value function or policy toward one of the lines
representing a solution to one of the two goals. Driving
directly toward one goal causes some movement away from the other goal. The
joint process is brought closer to the overall goal of optimality. The arrows in
this diagram correspond to the behavior of policy iteration in that each takes the
system all the way to achieving one of the two goals completely. In GPI one
could also take smaller, incomplete steps toward each goal.
4. Write the policy iteration algorithm (Policy evaluation and improvement) for
estimating policy ╥ to optimal policy ╥* and explain how to apply for
optimal policy.
Answer:

Once a policy, π, has been improved using Vπ to yield a better policy, π’, we
can then compute Vπ’ and improve it again to yield an even better π’’. We can
thus obtain a sequence of monotonically improving policies and value
functions:

where E denotes a policy evaluation and I denotes a policy improvement . Each


policy is guaranteed to be a strict improvement over the previous one (unless it
is already optimal). Because a finite MDP has only a finite number of policies,
this process must converge to an optimal policy and optimal value function in a
finite number of iterations. This way of finding an optimal policy is called
policy iteration.

5. Explain in detail the Monte Carlo method. How to apply it for solving RL
problems?

Answer:
Monte Carlo method:

 In Dynamic programming we need a model(agent knows the MDP


transition and rewards) and agent does planning (once model is available
agent need to plan its action in each state). There is no real learning by
the agent in Dynamic programming method.
 Monte Carlo method on the other hand is a very simple concept where
agent learn about the states and reward when it interacts with the
environment. In this method agent generate experienced samples and then
based on average return, value is calculated for a state or state-action.

 Monte Carlo is the first learning methods for estimating value functions
and discovering optimal policies. Unlike the previous methods
(MDP,DP), here we do not assume complete knowledge of the
environment.
 Monte Carlo methods require only experience—sample sequences of
states, actions, and rewards from actual or simulated interaction with an
environment.
 Learning from actual experience is striking because it requires no prior
knowledge of the environment’s dynamics, yet can still attain optimal
behavior.
 Learning from simulated experience is also powerful. Although a model
is required, the model need only generate sample transitions, not the
complete probability distributions of all possible transitions that is
required for dynamic programming (DP).

 Below are key characteristics of Monte Carlo (MC) method:


1. There is no model (agent does not know state MDP transitions)

2. Agent learns from sampled experience.


3. Learn state value vπ(s) under policy π by experiencing average return
from all sampled episodes (value = average return).
4. Only after a complete episode, values are updated (because of this
algorithm convergence is slow and update happens after a episode is
Complete).
5. There is no bootstrapping.
6. Only can be used in episodic problems.

Consider a real life analogy; Monte Carlo learning is like annual examination
where student completes its episode at the end of the year. Here, the result of
the annual exam is like the return obtained by the student. Now if the goal of the
problem is to find how students score during a calendar year (which is a episode
here) for a class, we can take sample result of some student and then calculate
mean result to find score for a class.

In Monte Carlo Method instead of expected return we use empirical return


that agent has sampled based following the policy.

Monte Carlo Methods have below advantages :

 zero bias

 Good convergence properties (even with function approximation)

 Not very sensitive to initial value

 Very simple to understand and use


But it has below limitations as well:

 MC must wait until end of episode before return is known

 MC has high variance

 MC can only learn from complete sequences

 MC only works for episodic (terminating) environments

You might also like