Assignment 3: Reinforcement Learning Prof. B. Ravindran

This document discusses key concepts in reinforcement learning. It contains multiple choice questions about Markov decision processes, policy gradient methods, contextual bandits, and comparing value-based and policy-based methods for continuous action spaces. The correct answers and explanations are provided.

Uploaded by

udayraj singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

1K views4 pages

Assignment 3: Reinforcement Learning Prof. B. Ravindran

Uploaded by

udayraj singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Assignment 3

Reinforcement Learning
Prof. B. Ravindran
1. Which of the following is true for an MDP?
(a) P r(st+1 , rt+1 |st , at ) = P r(st+1 , rt+1 )
(b) P r(st+1 , rt+1 |st , at , st−1 , at−1 , st−2 , at−2 , ..., s0 , a0 ) = P r(st+1 , rt+1 |st , at )
(c) P r(st+1 , rt+1 |st , at ) = P r(st+1 , rt+1 |s0 , a0 )
(d) P r(st+1 , rt+1 |st , at ) = P r(st , rt |st−1 , at−1 )
Sol. (b)
(b) is true for any MDP. (a),(c) and (d) are not true.
2. The baseline in the REINFORCE update should not depend on which of the following (with-
out voiding any of the steps in the proof of REINFORCE)?
(a) rn−1
(b) rn
(c) Action taken(an )
(d) None of the above
Sol. (c)
The baseline must not depend on any action. The baseline can depend on current and past
rewards. An example baseline is given in the videos where the average of rewards obtained so
far is considered to be the baseline.
3. In many supervised machine learning algorithms, such as neural networks, we rely on the
gradient descent technique. However, in the policy gradient approach to bandit problems, we
made use of gradient ascent. This discrepancy can mainly be attributed to the differences in

(a) the objectives of the learning tasks

(b) the parameters of the functions whose gradient are being calculated
(c) the nature of the feedback received by the algorithms

Sol. (a), (c)

The feedback in most supervised learning algorithms is an error signal which we wish to
minimise. Hence, we would look to perform gradient descent. In policy gradient, we are trying
to maximise the reward signal, hence gradient ascent.
4. In case of linear bandits, let’s consider we have 2 actions - a1 and a2 . The policy π to be
followed when encountering a state s is given by
⊤
ew1 s
π(a1 |s) = w⊤ s ⊤
e 1 + ew 2 s
⊤
ew2 s
π(a2 |s) = w⊤ s ⊤
e 1 + ew 2 s

1
where w1 and w2 are weight vectors associated with a1 and a2 respectively. If we are using
REINFORCE to learn the parameters, what will be the update for w1 and w2 , when we pull
action a1 ?
⊤s
ew1
(i) w1 → w1 + αr(s − ⊤ ⊤ s)
ew1 s +ew2 s
⊤
ew1 s
(ii) w1 → w1 + αr(− ⊤s
w1 ⊤ s)
e +ew2 s
⊤s
ew2
(iii) w2 → w2 + αr(s − ⊤ ⊤ s)
ew1 s +ew2 s
⊤s
ew2
(iv) w2 → w2 + αr(− ⊤ ⊤ s)
ew1 s +ew2 s

Which of the above updates are correct?

(a) (i), (iii)
(b) (i), (iv)
(c) (ii), (iii)
(d) (ii), (iv)
Sol. (b)
Derive the update using the REINFORCE formula.
5. The update in REINFORCE is given by θt+1 = θt + αrt ∂ ln π(a ∂θt
t ;θt )
, where rt ∂ ln π(a
∂θt
t ;θt )
is
an unbiased estimator of the true gradient of the performance function. However, there was
another variant of REINFORCE, where a baseline b, that is independent of the action taken, is
subtracted from the obtained reward, i.e, the update is given by θt+1 = θt +α(rt −b) ∂ ln π(a ∂θt
t ;θt )
.
How are E[(rt − b) ∂ ln π(a
∂θt
t ;θt )
] and E[rt ∂ ln π(a
∂θt
t ;θt )
] related?

(a) E[(rt − b) ∂ ln π(a

∂θt
t ;θt )
] = E[rt ∂ ln π(a
∂θt
t ;θt )
]
(b) E[(rt − b) ∂ ln π(a
∂θt
t ;θt )
] < E[rt ∂ ln π(a
∂θt
t ;θt )
]
(c) E[(rt − b) ∂ ln π(a
∂θt
t ;θt )
] > E[rt ∂ ln π(a
∂θt
t ;θt )
]
(d) Could be either of a, b or c, depending on the choice of baseline

Sol. (a)

∂ ln π(at ; θt ) 1 ∂π(at ; θt )
E[b ] = E[b ]
∂θt π(at ; θt ) ∂θt
X 1 ∂π(a; θt )
= [b ]π(a; θt )
a
π(a; θt ) ∂θt
X ∂π(a; θt )
= b
a
∂θt
∂1
=b
∂θt
=0

Thus, E[(rt − b) ∂ ln π(a

∂θt
t ;θt )
] = E[rt ∂ ln π(a
∂θt
t ;θt )
]

2
6. Consider the following policy-search algorithm for a multi-armed binary bandit:

∀a, πt+1 (a) = πt (a)(1 − α) + α(1a=at rt + (1 − 1a=at )(1 − rt ))

where 1at =a is 1 if a = at and 0 otherwise. Which of the following is true for the above
algorithm?
(a) It is LR−I algorithm.
(b) It is LR−ϵP algorithm.
(c) It would work well if the best arm had probability of 0.9 of resulting in +1 reward and
the next best arm had probability of 0.5 of resulting in +1 reward
(d) It would work well if the best arm had probability of 0.3 of resulting in +1 reward and
the worst arm had probability of 0.25 of resulting in +1 reward
Sol. (c)
The given algorithm is LR=P algorithm. It would work well for the case described in (c) as
it gives equal weightage to penalties and rewards, and as the gap between best arm and next
best arm’s probability of giving +1 reward is significant, it would easily figure out the best
arm.
7. The actions in contextual bandits do not determine the next state, but typically do in full RL
problems. True or false?
(a) True
(b) False
Sol. (a)
Refer to the contextual bandits lecture.
8. Value function-based methods generalize well for continuous actions when compared to policy
gradient methods.
(a) True
(b) False
Sol. (b)
When the action space is continuous, storing values for each action is hard in the case of value
function-based methods. Due to the action approximation for the continuous actions, it is
necessary to optimize the action space for every action chosen from the policy, to find the
estimated optimal action. This approach is slow, and policy gradient-based methods are likely
to generalize better.
9. Let’s assume for some full RL problem we are acting according to a policy π. At some time t,
we are in a state s where we took action a1 . After few time steps, at time t′ , the same state
s was reached where we performed an action a2 (̸= a1 ). Which of the following statements is
true?
(a) π is definitely a stationary policy
(b) π is definitely a non-stationary policy

3
(c) π can be stationary or non-stationary.

Sol. (c)
A stationary policy can be stochastic and thus the for same state different actions can be
chosen at different time steps. Thus π can be stationary or non-stationary policy.
10. In solving a multi-arm bandit problem using the policy gradient method, are we assured of
converging to the optimal solution?

(a) no
(b) yes
Sol. (a)
Depending upon the properties of the function whose gradient is being ascended, the policy
gradient approach may converge to a local optimum.

Assignment 1: Reinforcement Learning Prof. B. Ravindran
100% (2)
Assignment 1: Reinforcement Learning Prof. B. Ravindran
4 pages
Assignment 3 - Solution
No ratings yet
Assignment 3 - Solution
4 pages
Practice Assignment 6: Reinforcement Learning Prof. B. Ravindran
No ratings yet
Practice Assignment 6: Reinforcement Learning Prof. B. Ravindran
24 pages
NPTEL ML Assignment Week1
100% (4)
NPTEL ML Assignment Week1
5 pages
DEEP LEARNING IIT Kharagpur Assignment - 1 - 2024 - Updated
No ratings yet
DEEP LEARNING IIT Kharagpur Assignment - 1 - 2024 - Updated
6 pages
2022 ML Assignments
No ratings yet
2022 ML Assignments
45 pages
DL Assignment Solution 00 To 10
100% (1)
DL Assignment Solution 00 To 10
67 pages
Practice Assignment 4: Reinforcement Learning Prof. B. Ravindran
No ratings yet
Practice Assignment 4: Reinforcement Learning Prof. B. Ravindran
2 pages
Assignment 11: Reinforcement Learning Prof. B. Ravindran
No ratings yet
Assignment 11: Reinforcement Learning Prof. B. Ravindran
4 pages
Assignment 8: Reinforcement Learning Prof. B. Ravindran
No ratings yet
Assignment 8: Reinforcement Learning Prof. B. Ravindran
4 pages
DL - Assignment 3 Solution
No ratings yet
DL - Assignment 3 Solution
7 pages
DEEP LEARNING IIT Kharagpur Assignment - 4 - 2024
100% (2)
DEEP LEARNING IIT Kharagpur Assignment - 4 - 2024
7 pages
Assignment 4: Reinforcement Learning Prof. B. Ravindran
No ratings yet
Assignment 4: Reinforcement Learning Prof. B. Ravindran
4 pages
Assignment 7 (Sol.) : Reinforcement Learning
0% (1)
Assignment 7 (Sol.) : Reinforcement Learning
3 pages
DL - Assignment 2 Solution
No ratings yet
DL - Assignment 2 Solution
7 pages
Deep Learning - IIT Ropar - Unit 3 - Week 1
100% (1)
Deep Learning - IIT Ropar - Unit 3 - Week 1
3 pages
DEEP LEARNING IIT Kharagpur Assignment - 5 - 2024
No ratings yet
DEEP LEARNING IIT Kharagpur Assignment - 5 - 2024
9 pages
DEEP LEARNING IIT Kharagpur Assignment - 3 - 2024
100% (2)
DEEP LEARNING IIT Kharagpur Assignment - 3 - 2024
7 pages
Data Science - Assignment 2
No ratings yet
Data Science - Assignment 2
4 pages
Assignment 5 (Sol.) : Reinforcement Learning
100% (1)
Assignment 5 (Sol.) : Reinforcement Learning
4 pages
PA12
100% (2)
PA12
3 pages
RL Unit 1
100% (1)
RL Unit 1
26 pages
Introduction To Machine Learning - Unit 3 - Week 1
No ratings yet
Introduction To Machine Learning - Unit 3 - Week 1
3 pages
DL - Assignment 9 Solution
100% (3)
DL - Assignment 9 Solution
7 pages
DL - Assignment 12 Solution
No ratings yet
DL - Assignment 12 Solution
7 pages
Deep Learning - IIT Ropar - Unit 4 - Week 1
No ratings yet
Deep Learning - IIT Ropar - Unit 4 - Week 1
8 pages
DL - Assignment 8 Solution
100% (2)
DL - Assignment 8 Solution
6 pages
DL - Assignment 7 Solution
100% (1)
DL - Assignment 7 Solution
5 pages
DEEP LEARNING IIT Kharagpur Assignment - 2 - 2024 - Updated
No ratings yet
DEEP LEARNING IIT Kharagpur Assignment - 2 - 2024 - Updated
6 pages
Deep Learning - IIT Ropar - Unit 7 - Week 4
100% (1)
Deep Learning - IIT Ropar - Unit 7 - Week 4
5 pages
DL - Assignment 6 Solution
100% (3)
DL - Assignment 6 Solution
6 pages
Deep Learning - IIT Ropar - Unit 6 - Week 3
No ratings yet
Deep Learning - IIT Ropar - Unit 6 - Week 3
4 pages
DL - Assignment 10 Solution
100% (2)
DL - Assignment 10 Solution
6 pages
DL - Assignment 11 Solution
No ratings yet
DL - Assignment 11 Solution
7 pages
Assignment Week 8-Deep-Learning PDF
100% (1)
Assignment Week 8-Deep-Learning PDF
5 pages
DL - Assignment 1 Solution
No ratings yet
DL - Assignment 1 Solution
8 pages
Artificial Intelligence - Knowledge Representation and Reasoning - Unit 4 - Week 1
No ratings yet
Artificial Intelligence - Knowledge Representation and Reasoning - Unit 4 - Week 1
4 pages
MCQ Question
No ratings yet
MCQ Question
5 pages
Artificial Intelligence - Knowledge Representation and Reasoning - Unit 8 - Week 5
100% (1)
Artificial Intelligence - Knowledge Representation and Reasoning - Unit 8 - Week 5
5 pages
DL - Assignment 4 Solution
No ratings yet
DL - Assignment 4 Solution
6 pages
Machine Learning, ML Ass 5
No ratings yet
Machine Learning, ML Ass 5
6 pages
Assignment 11: Introduction To Machine Learning Prof. B. Ravindran
100% (1)
Assignment 11: Introduction To Machine Learning Prof. B. Ravindran
3 pages
Reinforcement Learning - Unit 6 - Week 4
No ratings yet
Reinforcement Learning - Unit 6 - Week 4
3 pages
DL - Assignment 5 Solution
No ratings yet
DL - Assignment 5 Solution
7 pages
Introduction To Machine Learning - Unit 4 - Week 2
100% (1)
Introduction To Machine Learning - Unit 4 - Week 2
3 pages
Assignment 12: Introduction To Machine Learning Prof. B. Ravindran
100% (1)
Assignment 12: Introduction To Machine Learning Prof. B. Ravindran
4 pages
Artificial Intelligence - Knowledge Representation and Reasoning - Unit 6 - Week 3
No ratings yet
Artificial Intelligence - Knowledge Representation and Reasoning - Unit 6 - Week 3
5 pages
Solution 3
No ratings yet
Solution 3
4 pages
Assignment Week 4-Deep-Learning PDF
100% (1)
Assignment Week 4-Deep-Learning PDF
7 pages
Assignment Week 11-Deep-Learning PDF
100% (2)
Assignment Week 11-Deep-Learning PDF
7 pages
Assignment Week 12-Deep-Learning PDF
100% (3)
Assignment Week 12-Deep-Learning PDF
6 pages
Machine Learning, ML Ass 7
No ratings yet
Machine Learning, ML Ass 7
7 pages
Assignment 11: Introduction To Machine Learning Prof. B. Ravindran
100% (2)
Assignment 11: Introduction To Machine Learning Prof. B. Ravindran
3 pages
Assignment 2
No ratings yet
Assignment 2
7 pages
NPTEL Introduction To Machine Learning Assignment 10 Answers
100% (1)
NPTEL Introduction To Machine Learning Assignment 10 Answers
7 pages
Week3 Assignment
No ratings yet
Week3 Assignment
6 pages
Assignment 1
No ratings yet
Assignment 1
7 pages
Introduction To Machine Learning Assignment-Week 4
No ratings yet
Introduction To Machine Learning Assignment-Week 4
5 pages
Assignment9 DeepLearning
No ratings yet
Assignment9 DeepLearning
6 pages
Thank You For Taking The Week 3: Assignment 3. Week 3: Assignment 3
No ratings yet
Thank You For Taking The Week 3: Assignment 3. Week 3: Assignment 3
3 pages

Assignment 3: Reinforcement Learning Prof. B. Ravindran

Uploaded by

Assignment 3: Reinforcement Learning Prof. B. Ravindran

Uploaded by

Assignment 3

(a) the objectives of the learning tasks

Sol. (a), (c)

Which of the above updates are correct?

(a) E[(rt − b) ∂ ln π(a

Thus, E[(rt − b) ∂ ln π(a

∀a, πt+1 (a) = πt (a)(1 − α) + α(1a=at rt + (1 − 1a=at )(1 − rt ))

You might also like