0% found this document useful (0 votes)
291 views

Reinforcement Learning - Unit 6 - Week 4

This document discusses an online course on reinforcement learning from NPTEL. It provides information on the course structure, including weekly topics and assignments. It also includes a 10 question practice quiz on concepts from reinforcement learning and Markov decision processes, covering topics like state transition probabilities, value functions, optimal policies, and convergence of algorithms.

Uploaded by

Addy Rao
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
291 views

Reinforcement Learning - Unit 6 - Week 4

This document discusses an online course on reinforcement learning from NPTEL. It provides information on the course structure, including weekly topics and assignments. It also includes a 10 question practice quiz on concepts from reinforcement learning and Markov decision processes, covering topics like state transition probabilities, value functions, optimal policies, and convergence of algorithms.

Uploaded by

Addy Rao
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

X

(https://swayam.gov.in)

(https://swayam.gov.in/nc_details/NPTEL)

[email protected]

NPTEL (https://swayam.gov.in/explorer?ncCode=NPTEL)
»
Reinforcement Learning (course)

Register for
Certification exam

Week 4: Assignment 4
(https://examform.nptel.ac.in/2022_01/exam_form/dashboard)

Assignment not submitted Due date: 2022-02-23, 23:59 IST.


Course outline
1) State True/False
1 point
How does an

NPTEL online The state transition graph for any MDP is a directed acyclic graph.
course work?
()
True

False
Week 0 ()
2) Consider the following statements:
1 point

Week 1 ()
(i) The optimal policy of an MDP is unique.
(ii) We can determine an optimal policy for a MDP using only the optimal value function(v∗ ) , without
Week 2 ()
accessing the MDP parameters.
(iii) We can determine an optimal policy for a given MDP using only the optimal q-value function(q ∗ ) ,
Week 3 ()
without accessing the MDP parameters.

Week 4 ()
Which of these statements are true?
MDP Modelling

Only (ii)
(unit?
unit=43&lesson=44)
Only (iii)

Bellman

Only (i), (ii)
Equation (unit?
Only (i), (iii)
unit=43&lesson=45)
Only (ii), (iii)
Bellman
3) Which of the following statements are true for a finite MDP? (Select all that apply). 1 point
Optimality
Equation (unit?
The Bellman equation of a value function of a finite MDP defines a contraction in Banach
space
unit=43&lesson=46)
(using the max norm).
Cauchy If 0 ≤ γ < 1 , then the eigenvalues of γPπ are less than 1.
Sequence and
We call a normed vector space ’complete’ if Cauchy sequences exist in that vector space.
Green's Equation
(unit?
he sequence defined by vn = rπ + γP π vn−1 is a Cauchy sequence in Banach space (using
the max
unit=43&lesson=47)
norm).

Banach Fixed (Pπ is a stochastic matrix)


Point Theorem
(unit? 4) Which of the following is a benefit of using RL algorithms for solving MDPs? 1 point
unit=43&lesson=48)

They do not require the state of the agent for solving a MDP.
Convergence

They do not require the action taken by the agent for solving a MDP.
Proof (unit?
unit=43&lesson=49)
They do not require the state transition probability matrix for solving a MDP.

They do not require the reward signal for solving a MDP.
Practice: Week
4: Assignment 4 5) Consider the following equations:
1 point
(Non Graded)

(assessment? ∞

name=142) (
i) vπ (s) = Eπ [∑ γ
i−t
Ri+1 |St = s]  
i=t
Quiz: Week 4:
(ii) q π (s, a) = ∑ p(s |s, a)v
′ π
(s )

Assignment 4

s
(assessment?
name=143) (iii) vπ (s) = ∑ π(a|s)q
π
(s, a)

Reinforcement

Learning : Week Which of the above are correct?

4 Feedback
Form (unit?
Only (i)
unit=43&lesson=129)
Only (i), (ii)

Week 5 ()
Only (ii), (iii)

Only (i), (iii)
DOWNLOAD
(i), (ii), (iii)
VIDEOS ()
6) What is true about the γ (discount factor) in reinforcement learning? 1 point
Text

discount factor can be any real number
Transcripts ()

  the value of γ cannot affect the optimal policy



the lower the value of gamma, the more myopic the agent gets, i.e the agent maximises
rewards
that it receives over a shorter horizon

7) Consider the following statements for a finite MDP (I is an identity matrix with dimensions
1 point
|S| × |S| (S is the set of all states) and Pπ is a stochastic matrix):

(i) MDP with stochastic rewards may not have a deterministic optimal policy.
(ii) There can be multiple optimal stochastic policies.
(iii) If 0 ≤ γ < 1 , then rank of the matrix I − γP π is equal to |S| .

(iv) If 0 ≤ γ < 1 , then rank of the matrix I − γP π is less than |S| .

Which of the above statements are true?


Only (ii), (iii)

Only (ii), (iv)

Only (i), (iii)

Only (i), (ii), (iii)

8) Consider an MDP with 3 states A,B,C. At each state we can go to either of the two states.
i.e 1 point
if we are in state A then we can perform 2 actions, going to state B or C. The rewards for
each transactions
are r(A, B) = −2 (reward if we go from A to B),
r(B, A) = 3, r(B, C ) = 10, r(C , B) = −7, r(A, C ) = −2, r(C , A) = 4 , discount factor is 0.9. Find
the fixed point of the
value function for the policy π(A) = B (if we are in state A we choose the action to
go to C)
π(B) = C , π(C ) = A. vπ([ABC ]) =?
(round to 1 decimal place)


[35.2, 48.6, 10.7]

[37.8, 44.2, 38.0]

[37.8, 38.0, 44.2]

[40.6, 20.2, 75.3]

9) Which of the following is not a valid norm function? (x is a D dimensional vector) 1 point

max d∈{1,...,D} |x d |

−−−−−

D

 2
∑ x
d

d=1

mind∈{1,...,D} |x d |

D
∑ |x d |
d=1

10) For an operator L, which of the following properties must be satisfied by x for it to be a fixed 1 point
point for L?(Multi-Correct)

Lx = x

2
L x = x

∀λ > 0Lx = λx


None of the above

You may submit any number of times before the due date. The final submission will be considered for
grading.
Submit Answers

You might also like