0% found this document useful (0 votes)

11 views11 pages

Solution To Assignment - 4 - Dynamic Programming

Uploaded by

ARNY

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views11 pages

Solution To Assignment - 4 - Dynamic Programming

Uploaded by

ARNY

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Assignment 4: Dynamic Programming

1. What is Dynamic Programming? Elaborate on efficiency of dynamic

programming.
Answer:
Dynamic Programming:
The term dynamic programming (DP) refers to a collection of algorithms that
can be used to compute optimal policies given a perfect model of the
environment as a Markov decision process (MDP). Classical DP algorithms are
of limited utility in reinforcement learning both because of their assumption of a
perfect model and because of their great computational expense, but they are
still important theoretically.
We usually assume that the environment is a finite MDP. That is, we assume
that its state, action, and reward sets, S, A, and R, are finite, and that its
dynamics are given by a set of probabilities p(s’, r|s, a), for all s ϵ S, a ϵ A(s), r ϵ
R, and s’ ϵ S+ (S+ is S plus a terminal state if the problem is episodic).
Although DP ideas can be applied to problems with continuous state and action
spaces but solutions are approximate.
The key idea of DP, and of reinforcement learning generally, is the use of value
functions to organize and structure the search for good policies. DP can be used
to compute the value functions and to obtain optimal policies once we have
found the optimal value functions, v* or q*, which satisfy the Bellman
optimality equations:
for all s ϵ S, a ϵ A(s), r ϵ R, and s’ ϵ S+ (S+ is S plus a terminal state if the
problem is episodic).

Efficiency of Dynamic Programming:

DP methods find an optimal policy in polynomial time in the number of states

and actions. If n and k denote the number of states and actions, this means that a
DP method requires less number of computational operations than some
polynomial function of n and k. A DP method finds an optimal policy in
polynomial time even though the total number of (deterministic) policies is kn.
DP is exponentially faster than any direct search in policy space because direct
search examines each policy exhaustively. For smaller number of states DP
methods are more efficient than linear programming.
For the largest problems, having large state space, DP methods are feasible than
LPP and direct search.
In practice, DP methods can be used with today’s computers to solve MDPs
with millions of states. Both policy iteration and value iteration are widely used,
as these methods converge faster than if they are started with good initial value
functions or policies.
Asynchronous DP methods are often preferred for problems with large state
spaces. To complete even one sweep of a synchronous method requires
computation and memory for every state.
2. Write and explain the value iteration method.
Answer:
Value Iteration:
One drawback to policy iteration is that each of its iterations involves policy
evaluation, which is itself a protracted (lengthy) iterative computation requiring
multiple sweeps through the state set. If policy evaluation is done iteratively,
then convergence exactly to Vπ occurs only in the limit. Policy evaluation
iterations beyond the first three have no effect on the corresponding greedy
policy.
Policy evaluation step of policy iteration can be truncated in several ways
without losing the convergence guarantees of policy iteration. When the policy
evaluation is stopped after just one sweep (one update of each state).
This algorithm is called value iteration. It can be written as a particularly simple
update operation that combines the policy improvement and truncated policy
evaluation steps:

for all s ϵ S.
Value Iteration Algorithm:

3. What is asynchronous dynamic programming? Elaborate on the generalized

policy Iteration method.

Answer:

Asynchronous Dynamic Programming

A major drawback to the DP methods is that they involve operations over the
entire state set of the MDP, that is, they require sweeps of the state set. If the
state set is very large, then even a single sweep is expensive.
For example, the game of backgammon has over 10 20 states. Even if we could
perform the value iteration update on a million states per second, it would take
over a thousand years to complete a single sweep.
Asynchronous DP algorithms are not organized in terms of systematic sweeps
of the state set. These algorithms update the values of states in any order
whatsoever, using whatever values of other states happen to be available. The
values of some states may be updated several times before the values of others
are updated once.
Asynchronous DP algorithms allow great flexibility in selecting states to
update.
For example, one version of asynchronous value iteration updates the value, in
place, of only one state, sk, on each step, k, using the value iteration update.
It is possible to intermix policy evaluation and value iteration updates to
produce a asynchronous truncated policy iteration
It just means that an algorithm does not need to get locked into any long sweep
before it can make progress improving a policy. We can try to take advantage of
this flexibility by selecting the states to which we apply updates so as to
improve the algorithm’s rate of progress. We can try to order the updates to let
value information propagate from state to state in an efficient way. Some states
may not need their values updated as often as others. We might even try to skip
updating some states entirely if they are not relevant to optimal behavior.
Asynchronous algorithms also make it easier to intermix computation with real-
time interaction. To solve a given MDP, we can run an iterative DP algorithm at
the same time that an agent is actually experiencing the MDP. The agent’s
experience can be used to determine the states to which the DP algorithm
applies its updates. At the same time, the latest value and policy information
from the DP algorithm can guide the agent’s decision making. For example, we
can apply updates to states as the agent visits them.
Generalised Policy Iteration

The interaction of policy-evaluation and policy improvement processes, is

known as generalized policy iteration. Almost all reinforcement learning
methods have identifiable policies and value functions, with the policy always
being improved with respect to the value function and the value function always
being driven toward the value function for the policy. If both the evaluation
process and the improvement process stabilize, that is, they no longer produce
changes, then the value function and policy must be optimal. Both processes
stabilize only when a policy has been found that is greedy with respect to its
own evaluation function.
The evaluation and improvement processes in GPI are both competing and
cooperating with each other. In the long run, however, these two processes
interact to find a single joint solution: the optimal value function and an optimal
policy.
Interaction between the evaluation and improvement processes in GPI can be
considered in terms of two constraints or goals—for example, as two lines in
two-dimensional space as shown in figure.
Each process drives the value function or policy toward one of the lines
representing a solution to one of the two goals. Driving
directly toward one goal causes some movement away from the other goal. The
joint process is brought closer to the overall goal of optimality. The arrows in
this diagram correspond to the behavior of policy iteration in that each takes the
system all the way to achieving one of the two goals completely. In GPI one
could also take smaller, incomplete steps toward each goal.
4. Write the policy iteration algorithm (Policy evaluation and improvement) for
estimating policy ╥ to optimal policy ╥* and explain how to apply for
optimal policy.
Answer:

Once a policy, π, has been improved using Vπ to yield a better policy, π’, we
can then compute Vπ’ and improve it again to yield an even better π’’. We can
thus obtain a sequence of monotonically improving policies and value
functions:

where E denotes a policy evaluation and I denotes a policy improvement . Each

policy is guaranteed to be a strict improvement over the previous one (unless it
is already optimal). Because a finite MDP has only a finite number of policies,
this process must converge to an optimal policy and optimal value function in a
finite number of iterations. This way of finding an optimal policy is called
policy iteration.

5. Explain in detail the Monte Carlo method. How to apply it for solving RL
problems?

Answer:
Monte Carlo method:

 In Dynamic programming we need a model(agent knows the MDP

transition and rewards) and agent does planning (once model is available
agent need to plan its action in each state). There is no real learning by
the agent in Dynamic programming method.
 Monte Carlo method on the other hand is a very simple concept where
agent learn about the states and reward when it interacts with the
environment. In this method agent generate experienced samples and then
based on average return, value is calculated for a state or state-action.

 Monte Carlo is the first learning methods for estimating value functions
and discovering optimal policies. Unlike the previous methods
(MDP,DP), here we do not assume complete knowledge of the
environment.
 Monte Carlo methods require only experience—sample sequences of
states, actions, and rewards from actual or simulated interaction with an
environment.
 Learning from actual experience is striking because it requires no prior
knowledge of the environment’s dynamics, yet can still attain optimal
behavior.
 Learning from simulated experience is also powerful. Although a model
is required, the model need only generate sample transitions, not the
complete probability distributions of all possible transitions that is
required for dynamic programming (DP).

 Below are key characteristics of Monte Carlo (MC) method:

1. There is no model (agent does not know state MDP transitions)

2. Agent learns from sampled experience.

3. Learn state value vπ(s) under policy π by experiencing average return
from all sampled episodes (value = average return).
4. Only after a complete episode, values are updated (because of this
algorithm convergence is slow and update happens after a episode is
Complete).
5. There is no bootstrapping.
6. Only can be used in episodic problems.

Consider a real life analogy; Monte Carlo learning is like annual examination
where student completes its episode at the end of the year. Here, the result of
the annual exam is like the return obtained by the student. Now if the goal of the
problem is to find how students score during a calendar year (which is a episode
here) for a class, we can take sample result of some student and then calculate
mean result to find score for a class.

In Monte Carlo Method instead of expected return we use empirical return

that agent has sampled based following the policy.

Monte Carlo Methods have below advantages :

 zero bias

 Good convergence properties (even with function approximation)

 Not very sensitive to initial value

 Very simple to understand and use

But it has below limitations as well:

 MC must wait until end of episode before return is known

 MC has high variance

 MC can only learn from complete sequences

 MC only works for episodic (terminating) environments

Assignment 5
100% (1)
Assignment 5
2 pages
Rust J. - Numerical Dynamic Programming in Economics
No ratings yet
Rust J. - Numerical Dynamic Programming in Economics
167 pages
06 MDP
No ratings yet
06 MDP
89 pages
DP Methods
No ratings yet
DP Methods
61 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
19 - Monte Carlo and Temporal Difference For Markov Decision Processes
No ratings yet
19 - Monte Carlo and Temporal Difference For Markov Decision Processes
57 pages
04 RL DP
No ratings yet
04 RL DP
76 pages
18 - Dynamic Programming For Markov Decision Processes
No ratings yet
18 - Dynamic Programming For Markov Decision Processes
50 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
Pomdps
No ratings yet
Pomdps
76 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
Adprl Chapter Icis
No ratings yet
Adprl Chapter Icis
43 pages
CE 3503 FE Course Material
No ratings yet
CE 3503 FE Course Material
117 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
Tut21 RL
No ratings yet
Tut21 RL
101 pages
Lec 09
No ratings yet
Lec 09
51 pages
Fish Disease Manual 2010 PDF
100% (3)
Fish Disease Manual 2010 PDF
72 pages
Dynamic Programming and Optimal Control
No ratings yet
Dynamic Programming and Optimal Control
62 pages
Robust Dynamic Programming
No ratings yet
Robust Dynamic Programming
30 pages
Module 04
No ratings yet
Module 04
63 pages
RL Module 4
No ratings yet
RL Module 4
50 pages
Lecture4 Model Free Prediction
No ratings yet
Lecture4 Model Free Prediction
34 pages
Stochastic DP
No ratings yet
Stochastic DP
23 pages
15 MDP
No ratings yet
15 MDP
35 pages
Lecture 4 - Bellman Equations and DP
No ratings yet
Lecture 4 - Bellman Equations and DP
27 pages
Lecture#5 Monte Carlo Methods Part I
No ratings yet
Lecture#5 Monte Carlo Methods Part I
28 pages
RL Unit-4
No ratings yet
RL Unit-4
18 pages
20ai903 - RL - Unit 2
No ratings yet
20ai903 - RL - Unit 2
27 pages
RL and ObC Lecture 2
No ratings yet
RL and ObC Lecture 2
20 pages
1.10 Policy Evaluation (Prediction)
No ratings yet
1.10 Policy Evaluation (Prediction)
30 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
43 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
3 DP PDF
No ratings yet
3 DP PDF
42 pages
Cs5811 Ch17 Complex Dec
No ratings yet
Cs5811 Ch17 Complex Dec
29 pages
RL Lecture4
No ratings yet
RL Lecture4
16 pages
RL Unit-Ii
No ratings yet
RL Unit-Ii
14 pages
Lnotes 03
No ratings yet
Lnotes 03
11 pages
CS229
No ratings yet
CS229
17 pages
Unit 05 Dynamic Programming
No ratings yet
Unit 05 Dynamic Programming
9 pages
M 2
No ratings yet
M 2
12 pages
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
No ratings yet
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
14 pages
AS02
No ratings yet
AS02
16 pages
Dynamic Programming
No ratings yet
Dynamic Programming
9 pages
Experiment 4
No ratings yet
Experiment 4
7 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
Notes
No ratings yet
Notes
6 pages
RL Ese
No ratings yet
RL Ese
7 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
Experiment 3
No ratings yet
Experiment 3
6 pages
Markovian Decision Process
No ratings yet
Markovian Decision Process
27 pages
A Short Tutorial On Reinforcement Learning: Review and Applications
No ratings yet
A Short Tutorial On Reinforcement Learning: Review and Applications
5 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
242 Sheet 02 02
No ratings yet
242 Sheet 02 02
6 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
Markov Decision Processes and Exact Solution Methods
No ratings yet
Markov Decision Processes and Exact Solution Methods
34 pages
Conjugate Markov Decision Processes
No ratings yet
Conjugate Markov Decision Processes
8 pages
Document PDF 331
0% (1)
Document PDF 331
13 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
Markov Decision
No ratings yet
Markov Decision
4 pages
Ews Certificate
No ratings yet
Ews Certificate
1 page
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
Dynamic Programming RL Answers Final
No ratings yet
Dynamic Programming RL Answers Final
3 pages
Kalane L
No ratings yet
Kalane L
147 pages
12 - Arnav Sharma
No ratings yet
12 - Arnav Sharma
1 page
FM S Teacher Manual 09
No ratings yet
FM S Teacher Manual 09
58 pages
Hotel Reservation System
100% (1)
Hotel Reservation System
8 pages
EPEP - GROUP 3 OPEN PIT MINE (Canatuan Project)
No ratings yet
EPEP - GROUP 3 OPEN PIT MINE (Canatuan Project)
54 pages
Computer Platforms Assignment
No ratings yet
Computer Platforms Assignment
21 pages
GWH 07 Agak 3 Nna 1 B
No ratings yet
GWH 07 Agak 3 Nna 1 B
106 pages
Rife Instrument History
No ratings yet
Rife Instrument History
224 pages
SDDF
No ratings yet
SDDF
142 pages
DSP5003 System Setup Manual
No ratings yet
DSP5003 System Setup Manual
25 pages
Exam Handbook - July2020 PDF
No ratings yet
Exam Handbook - July2020 PDF
56 pages
Turbina TME 400
No ratings yet
Turbina TME 400
8 pages
Sony GDM-F500 F500T9 N3P Revised @
No ratings yet
Sony GDM-F500 F500T9 N3P Revised @
60 pages
Notes Za Computer
No ratings yet
Notes Za Computer
28 pages
Evaluation of Faculty 1 1
No ratings yet
Evaluation of Faculty 1 1
2 pages
Timeline Magellan
No ratings yet
Timeline Magellan
5 pages
JSS1 Math
No ratings yet
JSS1 Math
10 pages
IAS 24 Combined Document Notes, Class Examples and Questions 2024
No ratings yet
IAS 24 Combined Document Notes, Class Examples and Questions 2024
13 pages
Detailed Lesson Plan in English
No ratings yet
Detailed Lesson Plan in English
6 pages
Mod Menu Log - Com - Carxtech.sr
No ratings yet
Mod Menu Log - Com - Carxtech.sr
27 pages
Watson's Daily-Double Wagering
No ratings yet
Watson's Daily-Double Wagering
13 pages
ZD553KL L2 Service Manual: Product Appearance
No ratings yet
ZD553KL L2 Service Manual: Product Appearance
22 pages
FMEA
No ratings yet
FMEA
22 pages
SC37 Rushan Shaikh Assign.6
No ratings yet
SC37 Rushan Shaikh Assign.6
5 pages
21st Century Literature Reviewer
No ratings yet
21st Century Literature Reviewer
3 pages
Daily Interruption of Sedation in Patients Receiving Mechanical Ventilation
No ratings yet
Daily Interruption of Sedation in Patients Receiving Mechanical Ventilation
10 pages
Sc37 Rushan Shaikh Assign.5
No ratings yet
Sc37 Rushan Shaikh Assign.5
5 pages
Development and Quality Evaluation of Lupin-Fortif
No ratings yet
Development and Quality Evaluation of Lupin-Fortif
6 pages
Ac Eagle (TS 22M55) 58%
No ratings yet
Ac Eagle (TS 22M55) 58%
2 pages
Cisce Compressed - org-SPCER-207154963 - Compressed - SHRIMAY YANGANTI (C-64)
No ratings yet
Cisce Compressed - org-SPCER-207154963 - Compressed - SHRIMAY YANGANTI (C-64)
1 page
Module 4 Nursing Care Plan Sample
No ratings yet
Module 4 Nursing Care Plan Sample
2 pages
Grammar Chants Jennifers A Juggler Worksheet
No ratings yet
Grammar Chants Jennifers A Juggler Worksheet
2 pages

Solution To Assignment - 4 - Dynamic Programming

Uploaded by

Solution To Assignment - 4 - Dynamic Programming

Uploaded by

Assignment 4: Dynamic Programming

1. What is Dynamic Programming? Elaborate on efficiency of dynamic

Efficiency of Dynamic Programming:

DP methods find an optimal policy in polynomial time in the number of states

3. What is asynchronous dynamic programming? Elaborate on the generalized

Asynchronous Dynamic Programming

The interaction of policy-evaluation and policy improvement processes, is

where E denotes a policy evaluation and I denotes a policy improvement . Each

 In Dynamic programming we need a model(agent knows the MDP

 Below are key characteristics of Monte Carlo (MC) method:

2. Agent learns from sampled experience.

In Monte Carlo Method instead of expected return we use empirical return

Monte Carlo Methods have below advantages :

 Good convergence properties (even with function approximation)

 Not very sensitive to initial value

 Very simple to understand and use

 MC must wait until end of episode before return is known

 MC has high variance

 MC can only learn from complete sequences

 MC only works for episodic (terminating) environments

You might also like