0% found this document useful (0 votes)
11 views4 pages

Quiz2 Sol

The document is a quiz for a Reinforcement Learning course, consisting of multiple questions related to expected return calculations, Markov Decision Processes (MDPs), and value iteration. It includes specific scenarios for calculating returns, building state transition diagrams, and analyzing policies and their effects on employee satisfaction. The quiz also tests knowledge on theoretical aspects of MDPs, such as the implications of discount factors and convergence of value functions.

Uploaded by

Rahul Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views4 pages

Quiz2 Sol

The document is a quiz for a Reinforcement Learning course, consisting of multiple questions related to expected return calculations, Markov Decision Processes (MDPs), and value iteration. It includes specific scenarios for calculating returns, building state transition diagrams, and analyzing policies and their effects on employee satisfaction. The quiz also tests knowledge on theoretical aspects of MDPs, such as the implications of discount factors and convergence of value functions.

Uploaded by

Rahul Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Reinforcement Learning (RO3002) Quiz-2

Name

ID No.

Total Marks: 15 Duration: 30 min

Q1. Expected Return:


A. (2 marks) Self-balancing Segway: We want a segway machine to learn to balance by applying right forces to
segway-cart moving along a track to keep the segway rider upright from falling over: A failure is said to occur
if the segway rider falls past a given angle from vertical. The segway pole and the rider is reset to vertical
after each failure. We could treat self-balancing segway as a continuing task, using discounting. In this case
the reward would be -1 on each failure and zero at all other times. Consider an experiment run for 100
timesteps, and a failure was observed at timesteps t=47, and t=78. Given the discount factor, γ = 0.9, what
will the returns G0 and G10 be?
4 =
R +
Y Ren + YRALP----

9)77
46

40 =
( 1)(0 9) ·
+ (
-
1)(0 ·

67

40 =
(1 (0 . 91" + ( 1)(0-a )
-
.

B. (2 marks) In another continuing task, suppose γ = 0.8 and the reward sequence is R1 = 3, R2 = 2, followed by
an infinite sequence of 5s. What will the returns G0 and G5 be?
P
Re YRzty Rye-
30 = +

10.07 (5) + (0 .
03 (s) + -

31 (01)(2) +

Yo
=

(0641(25) = 20 6
·

400 +

100 700)
=
= 3+ 16 +

YS = 3 + 10.05 + 108s + - - -

=
Q2. A tech giant, Coocle, wants you to build an MDP to model their employee retention strategy. The company
categorizes its employees into three categories, namely low satisfaction (L), medium satisfaction (M), and high
satisfaction (H), based on their job satisfaction. When employees raise a concern, the company has two strategies
for addressing it: one is to provide minimal support (ms) with a zero cost, and the second option is to offer full
support (fs) with a cost of 2 units. Minimal support will lead the employee to lose satisfaction and reach one level
below (unless they already have the least satisfaction, in which case they will stay in the same state) with probability
0.8 or stay in the same state with probability 0.2. If full support is provided, it will increase employee satisfaction
(unless they already have the highest satisfaction, in which case they will stay in the same state) with probability 0.9
or stay in the same state with probability 0.1. Any transition which will increase employee satisfaction will result in
an immediate reward of 8 units and a transition which will decrease employee satisfaction will result in an
immediate loss of 8 units.

A. (2 marks) Build a state transition diagram for this MDP with clear indication of states, actions, transition
probabilities, and rewards.

· L
p
↓S

V= 6
0 -
9

M
= 6

.
D= 0 .
8 P= 0 .
8 H
g

·
v
v
=

y
ms

Po
mS
ms

B. (4 marks) Consider a policy where the company always chooses to go with minimal support (ms), compute
the true value function for each state and determine possible improving actions given a discount factor γ =
0.9.
girenpoli
ESE S It =

& ! Aimsa ,

09(VA(4))) VH( 0
=
=
Va(() =
1(0 +

Va (M
=

0 .

0( -
0 + 0 .

9 il) + 0
-

>(0 + 0 .

9(VH(M))
= -7 8 .

6 4 + Vi(M) - Vam =
-

64/0 82 .

Va(M) =
018
.
-

0( O + y (UM(RI) + 0 2 -

(of Y (U (H1)
Vi(H) = 0 . -

10 87(09)( 7 8) + (0-2) (0 9)
·

(UH(H)
64 +
- .

(n) = -

VA

12 010VACH VECH -
1 .
63
VIT(H +
-
=

Blums)
=
Va(L = 0
,
D( ,
bs =
0 .
1-2 +
y )) + 0 .

9)0 + 0 .

97.8)
= 0 2 + 54 63 118
.

-
-
= -

neusor State (@ (LMs)> &(1 +5) - no improved action.

&(M msT2 Vi (MI = -7 0 A(Mb) 1) -2 + 09(-7 8))) + 0 9/6 + 9 14 03)


.
.

= 0
-

0
. .
.

,
,

@ (M , JS) > &(17 ,


=
-
7 35
. =
) ms)

improved action = 76 .

BUR , msl VIT (H) 63 007


M
, (M16S) =
1) 2+ 91 63)) 10
.
=
=
14
-
-

0 =
. - .
=

improved action
Since &(H , MS) &(H , 7) Here no
Q3. (1 mark) Suppose you are in an infinite-horizon MDP with rewards bounded by |R| ≤ 1 and γ = 0.99. What can
we say about the maximum possible value of a state V(s)?

A) V(s) can be arbitrarily large.


-
B) V(s) is at most 100.
C) V(s) is at most 1.
D) The maximum value of V(s) depends on the policy.

Q4. (1 mark) Suppose you have two Markov Decision Processes (MDPs) with the same state transitions and rewards
but different discount factors γ_1 and γ_2. If γ_1 < γ_2, which of the following is true about their value functions
V(s)?

A) V_1(s) is always lesser than V_2(s) for all states s.


-
B) V_2(s) ≥ V_1(s) for all s, only if all the rewards are positive.
C) V_1(s) is always ≥ V_2(s) for all states s.
D) The Bellman equation does not hold when comparing two different discount factors.

Q5. (1 mark) Given an MDP with three states {A, B, C}, and two actions {Left, Right}, consider a policy π where:
π(A) = Left
π(B) = Right
π(C) = Left
After performing one step of policy improvement, which of the following is necessarily true?

A) The new policy π' will always have a higher value function than π.
B) The new policy π' will always be different from π.
C) The new policy π' is guaranteed to be the optimal policy.
-
D) The new policy π' will be at least as good as π.

Q6. (1 mark) Which of the following scenarios would cause value iteration to take an unusually long time to
converge?

A) Low discount factor (γ close to 0).


B) Deterministic transitions.
C) Small action space.
D) Sparse reward structure. -

Q7. (1 mark) If we initialize the value function V(s) arbitrarily and apply Value Iteration, what happens as the number
of iterations increases?

A) V(s) oscillates and does not converge


-B) V(s) converges to the optimal value function V∗(s)
C) V(s) may diverge if initialized incorrectly
D) V(s) converges only if the reward function is bounded

Cheat sheet

You might also like