Quiz2 Sol
Quiz2 Sol
Name
ID No.
9)77
46
40 =
( 1)(0 9) ·
+ (
-
1)(0 ·
67
40 =
(1 (0 . 91" + ( 1)(0-a )
-
.
B. (2 marks) In another continuing task, suppose γ = 0.8 and the reward sequence is R1 = 3, R2 = 2, followed by
an infinite sequence of 5s. What will the returns G0 and G5 be?
P
Re YRzty Rye-
30 = +
10.07 (5) + (0 .
03 (s) + -
31 (01)(2) +
Yo
=
(0641(25) = 20 6
·
400 +
100 700)
=
= 3+ 16 +
YS = 3 + 10.05 + 108s + - - -
=
Q2. A tech giant, Coocle, wants you to build an MDP to model their employee retention strategy. The company
categorizes its employees into three categories, namely low satisfaction (L), medium satisfaction (M), and high
satisfaction (H), based on their job satisfaction. When employees raise a concern, the company has two strategies
for addressing it: one is to provide minimal support (ms) with a zero cost, and the second option is to offer full
support (fs) with a cost of 2 units. Minimal support will lead the employee to lose satisfaction and reach one level
below (unless they already have the least satisfaction, in which case they will stay in the same state) with probability
0.8 or stay in the same state with probability 0.2. If full support is provided, it will increase employee satisfaction
(unless they already have the highest satisfaction, in which case they will stay in the same state) with probability 0.9
or stay in the same state with probability 0.1. Any transition which will increase employee satisfaction will result in
an immediate reward of 8 units and a transition which will decrease employee satisfaction will result in an
immediate loss of 8 units.
A. (2 marks) Build a state transition diagram for this MDP with clear indication of states, actions, transition
probabilities, and rewards.
· L
p
↓S
V= 6
0 -
9
M
= 6
.
D= 0 .
8 P= 0 .
8 H
g
·
v
v
=
y
ms
Po
mS
ms
B. (4 marks) Consider a policy where the company always chooses to go with minimal support (ms), compute
the true value function for each state and determine possible improving actions given a discount factor γ =
0.9.
girenpoli
ESE S It =
& ! Aimsa ,
09(VA(4))) VH( 0
=
=
Va(() =
1(0 +
Va (M
=
0 .
0( -
0 + 0 .
9 il) + 0
-
>(0 + 0 .
9(VH(M))
= -7 8 .
6 4 + Vi(M) - Vam =
-
64/0 82 .
Va(M) =
018
.
-
0( O + y (UM(RI) + 0 2 -
(of Y (U (H1)
Vi(H) = 0 . -
10 87(09)( 7 8) + (0-2) (0 9)
·
(UH(H)
64 +
- .
(n) = -
VA
12 010VACH VECH -
1 .
63
VIT(H +
-
=
↑
Blums)
=
Va(L = 0
,
D( ,
bs =
0 .
1-2 +
y )) + 0 .
9)0 + 0 .
97.8)
= 0 2 + 54 63 118
.
-
-
= -
= 0
-
0
. .
.
,
,
improved action = 76 .
0 =
. - .
=
improved action
Since &(H , MS) &(H , 7) Here no
Q3. (1 mark) Suppose you are in an infinite-horizon MDP with rewards bounded by |R| ≤ 1 and γ = 0.99. What can
we say about the maximum possible value of a state V(s)?
Q4. (1 mark) Suppose you have two Markov Decision Processes (MDPs) with the same state transitions and rewards
but different discount factors γ_1 and γ_2. If γ_1 < γ_2, which of the following is true about their value functions
V(s)?
Q5. (1 mark) Given an MDP with three states {A, B, C}, and two actions {Left, Right}, consider a policy π where:
π(A) = Left
π(B) = Right
π(C) = Left
After performing one step of policy improvement, which of the following is necessarily true?
A) The new policy π' will always have a higher value function than π.
B) The new policy π' will always be different from π.
C) The new policy π' is guaranteed to be the optimal policy.
-
D) The new policy π' will be at least as good as π.
Q6. (1 mark) Which of the following scenarios would cause value iteration to take an unusually long time to
converge?
Q7. (1 mark) If we initialize the value function V(s) arbitrarily and apply Value Iteration, what happens as the number
of iterations increases?
Cheat sheet