Assignment 3: Reinforcement Learning Prof. B. Ravindran

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Assignment 3

Reinforcement Learning
Prof. B. Ravindran
1. Which of the following is true for an MDP?
(a) P r(st+1 , rt+1 |st , at ) = P r(st+1 , rt+1 )
(b) P r(st+1 , rt+1 |st , at , st−1 , at−1 , st−2 , at−2 , ..., s0 , a0 ) = P r(st+1 , rt+1 |st , at )
(c) P r(st+1 , rt+1 |st , at ) = P r(st+1 , rt+1 |s0 , a0 )
(d) P r(st+1 , rt+1 |st , at ) = P r(st , rt |st−1 , at−1 )
Sol. (b)
(b) is true for any MDP. (a),(c) and (d) are not true.
2. The baseline in the REINFORCE update should not depend on which of the following (with-
out voiding any of the steps in the proof of REINFORCE)?
(a) rn−1
(b) rn
(c) Action taken(an )
(d) None of the above
Sol. (c)
The baseline must not depend on any action. The baseline can depend on current and past
rewards. An example baseline is given in the videos where the average of rewards obtained so
far is considered to be the baseline.
3. In many supervised machine learning algorithms, such as neural networks, we rely on the
gradient descent technique. However, in the policy gradient approach to bandit problems, we
made use of gradient ascent. This discrepancy can mainly be attributed to the differences in

(a) the objectives of the learning tasks


(b) the parameters of the functions whose gradient are being calculated
(c) the nature of the feedback received by the algorithms

Sol. (a), (c)


The feedback in most supervised learning algorithms is an error signal which we wish to
minimise. Hence, we would look to perform gradient descent. In policy gradient, we are trying
to maximise the reward signal, hence gradient ascent.
4. In case of linear bandits, let’s consider we have 2 actions - a1 and a2 . The policy π to be
followed when encountering a state s is given by

ew1 s
π(a1 |s) = w⊤ s ⊤
e 1 + ew 2 s

ew2 s
π(a2 |s) = w⊤ s ⊤
e 1 + ew 2 s

1
where w1 and w2 are weight vectors associated with a1 and a2 respectively. If we are using
REINFORCE to learn the parameters, what will be the update for w1 and w2 , when we pull
action a1 ?
⊤s
ew1
(i) w1 → w1 + αr(s − ⊤ ⊤ s)
ew1 s +ew2 s

ew1 s
(ii) w1 → w1 + αr(− ⊤s
w1 ⊤ s)
e +ew2 s
⊤s
ew2
(iii) w2 → w2 + αr(s − ⊤ ⊤ s)
ew1 s +ew2 s
⊤s
ew2
(iv) w2 → w2 + αr(− ⊤ ⊤ s)
ew1 s +ew2 s

Which of the above updates are correct?


(a) (i), (iii)
(b) (i), (iv)
(c) (ii), (iii)
(d) (ii), (iv)
Sol. (b)
Derive the update using the REINFORCE formula.
5. The update in REINFORCE is given by θt+1 = θt + αrt ∂ ln π(a ∂θt
t ;θt )
, where rt ∂ ln π(a
∂θt
t ;θt )
is
an unbiased estimator of the true gradient of the performance function. However, there was
another variant of REINFORCE, where a baseline b, that is independent of the action taken, is
subtracted from the obtained reward, i.e, the update is given by θt+1 = θt +α(rt −b) ∂ ln π(a ∂θt
t ;θt )
.
How are E[(rt − b) ∂ ln π(a
∂θt
t ;θt )
] and E[rt ∂ ln π(a
∂θt
t ;θt )
] related?

(a) E[(rt − b) ∂ ln π(a


∂θt
t ;θt )
] = E[rt ∂ ln π(a
∂θt
t ;θt )
]
(b) E[(rt − b) ∂ ln π(a
∂θt
t ;θt )
] < E[rt ∂ ln π(a
∂θt
t ;θt )
]
(c) E[(rt − b) ∂ ln π(a
∂θt
t ;θt )
] > E[rt ∂ ln π(a
∂θt
t ;θt )
]
(d) Could be either of a, b or c, depending on the choice of baseline

Sol. (a)

∂ ln π(at ; θt ) 1 ∂π(at ; θt )
E[b ] = E[b ]
∂θt π(at ; θt ) ∂θt
X 1 ∂π(a; θt )
= [b ]π(a; θt )
a
π(a; θt ) ∂θt
X ∂π(a; θt )
= b
a
∂θt
∂1
=b
∂θt
=0

Thus, E[(rt − b) ∂ ln π(a


∂θt
t ;θt )
] = E[rt ∂ ln π(a
∂θt
t ;θt )
]

2
6. Consider the following policy-search algorithm for a multi-armed binary bandit:

∀a, πt+1 (a) = πt (a)(1 − α) + α(1a=at rt + (1 − 1a=at )(1 − rt ))

where 1at =a is 1 if a = at and 0 otherwise. Which of the following is true for the above
algorithm?
(a) It is LR−I algorithm.
(b) It is LR−ϵP algorithm.
(c) It would work well if the best arm had probability of 0.9 of resulting in +1 reward and
the next best arm had probability of 0.5 of resulting in +1 reward
(d) It would work well if the best arm had probability of 0.3 of resulting in +1 reward and
the worst arm had probability of 0.25 of resulting in +1 reward
Sol. (c)
The given algorithm is LR=P algorithm. It would work well for the case described in (c) as
it gives equal weightage to penalties and rewards, and as the gap between best arm and next
best arm’s probability of giving +1 reward is significant, it would easily figure out the best
arm.
7. The actions in contextual bandits do not determine the next state, but typically do in full RL
problems. True or false?
(a) True
(b) False
Sol. (a)
Refer to the contextual bandits lecture.
8. Value function-based methods generalize well for continuous actions when compared to policy
gradient methods.
(a) True
(b) False
Sol. (b)
When the action space is continuous, storing values for each action is hard in the case of value
function-based methods. Due to the action approximation for the continuous actions, it is
necessary to optimize the action space for every action chosen from the policy, to find the
estimated optimal action. This approach is slow, and policy gradient-based methods are likely
to generalize better.
9. Let’s assume for some full RL problem we are acting according to a policy π. At some time t,
we are in a state s where we took action a1 . After few time steps, at time t′ , the same state
s was reached where we performed an action a2 (̸= a1 ). Which of the following statements is
true?
(a) π is definitely a stationary policy
(b) π is definitely a non-stationary policy

3
(c) π can be stationary or non-stationary.

Sol. (c)
A stationary policy can be stochastic and thus the for same state different actions can be
chosen at different time steps. Thus π can be stationary or non-stationary policy.
10. In solving a multi-arm bandit problem using the policy gradient method, are we assured of
converging to the optimal solution?

(a) no
(b) yes
Sol. (a)
Depending upon the properties of the function whose gradient is being ascended, the policy
gradient approach may converge to a local optimum.

You might also like