Reinforcement Learning - - Unit 13 - Week 10
Reinforcement Learning - - Unit 13 - Week 10
[email protected]
X
(https://swayam.gov.in)
(https://swayam.gov.in/nc_details/NPTEL)
If already
registered, click
Thank you for taking the Week 10 :
to check your
payment status
Assignment 10.
How does an
NPTEL online Which of the following are the correct values of A and B ?
course work? (rk is the reward received at time step k , and γ is the discount factor)
()
A = rt ; B = γ
Week 1 ()
A = rt + γrt+1 + ... + γ τ −1 rt+τ ; B = γ τ
Week 2 ()
A = γ t rt + γ t+1 rt+1 + ... + γ t+τ −1 rt+τ ; B = γ t+τ
Week 3 ()
A = γ τ −1 rt+τ B = γ t
Week 4 ()
2) Consider a SMDP in which the next state and the reward only depend on the previous 1 point
Week 5 () state and action i.e P (s ′
, τ ∣ s, a) = P (s
′ ′ ′
∣ s, a)P (τ ∣ s, a), R(s, a, τ , s ) = R(s, a, s ) .If we
solve the above SMDP with conventional Q-learning we will end up with the same policy as solving it
Week 6 () with SMDP Q-learning.
Week 7 ()
yes, because now τ won’t change anything and we end up with same states and action
sequences
Week 8 ()
no, because τ still depends on the state, action pair and discounting may have a effect on the
Week 9 ()
final policies.
Week 10 () no, because the next state will still depend on the τ .
Hierarchical yes, because the bellman equation is same for both methods in this case.
Reinforcement
Learning (unit? 3) In HAM, what will be the immediate rewards received between two choice states. 1 point
unit=94&lesson
=95) Accumulation of immediate rewards of the core MDP obtained between these choice points.
Types of The return of the next choice state.
Assessment submitted.
Optimality (unit?
X The reward of only the next primitive action taken.
unit=94&lesson
=96) Immediate reward is always zero
Semi Markov 4) Which of the following is true about Markov and Semi Markov Options? 1 point
Decision
Processes In a Markov Option the option’s policy depends only on the current state.
(unit? In a Semi Markov Option the option’s policy can depend only on the current state.
unit=94&lesson
In a Semi Markov Option, the option’s policy may depend on the history since the execution of
=97)
the option began.
Options (unit? A Semi-Markov Option is always a Markov Option but not vice versa.
unit=94&lesson
=98)
5) Consider the two statements below for an SMDP for a HAM: 1 point
Learning with
Options (unit? Statement1: The state of the SMDP is defined by the state of the base MDP, the call stack and the
unit=94&lesson state of the machine currently executing.
=99) Statement2: The actions of the SMDP can only be defined by the action states.
Hierarchical
Abstract Which of the following are true?
Machines (unit?
unit=94&lesson Statement1 is True and Statement2 is True.
=100) Statement1 is True and Statement2 is False.
Quiz: Week 10
7) In SMDP, consider the case when τ is fixed for all state, action pairs. Will we always get 1 point
: Assignment
10
the same policy for conventional Q-learning and SMDP Q learning then? Provide answer for the three
(assessment? cases when τ = 3, τ = 2, τ = 1.
name=213)
yes, yes, no
Week 11 () no, no, no
yes, yes, yes
DOWNLOAD
no, no, yes
VIDEOS ()
Problem
True
Solving
Session - Jan False
2024 ()
9) Suppose that we model a robot in a room as an SMDP, such that the position of the robot 1 point
in the room is the state of the SMDP. Which of the following scenarios satisfy the assumption that the
next state and transition time are independent of each other given the current state and action i.e
′
P (s , τ ∣ s, a) = P (s
′
∣ s, a)P (τ ∣ s, a)? (Assume that primitive actions - < left, right, up, down>
take a single time step to execute.)
The room has a single door. The actions available are : {exit the room, move left, move right,
Assessment submitted.
move up, move down}.
X
The room has a two doors. The actions available are : {exit the room, move left, move right,
move up, move down}.
The room has a two doors. The actions available are: {move left, move right, move up, move
down}.
None of the above.
10) Which of the following is a correct Bellman equation for an SMDP? 1 point
Note: R(s, a, s') ⟹ reward is a function of only s, a and s' .
∗ ∗
V (s) = max a∈A(S) [R(s, a, τ , s') + γτ P (s'|s, a)V (s')]
∗ ∗
V (s) = max a∈A(S) [∑ P (s'|s, a, τ )(R(s, a, τ , s') + γV (s'))]
s',τ
∗ ∗
V (s) = max a∈A(S) [∑ P (s', τ |s, a)(R(s, a, τ , s') + γτ V (s'))]
s',τ
∗ ∗
V (s) = max a∈A(S) [∑ P (s', τ |s, a)(R(s, a, s') + γV (s'))]
s',τ
You may submit any number of times before the due date. The final submission will be considered for
grading.
Submit Answers