Solution 3
Solution 3
)
Reinforcement Learning
Prof. B. Ravindran
1. In solving a multi-arm bandit problem using the policy gradient method, are we assured of
converging to the optimal solution?
(a) no
(b) yes
Sol. (a)
Depending upon the properties of the function whose gradient is being ascended, the policy
gradient approach may converge to a local optimum.
2. In many supervised machine learning algorithms, such as neural networks, we rely on the
gradient descent technique. However, in the policy gradient approach to bandit problems, we
made use of gradient ascent. This discrepancy can mainly be attributed to the differences in
Sol. (c)
The feedback in most supervised learning algorithms is an error signal which we wish to
minimise. Hence, we would look to perform gradient descent. In policy gradient, we are trying
to maximise the reward signal, hence gradient ascent.
3. Consider a bandit problem in which the parameters on which the policy depends are the
preferences of the actions and the action selection probabilities are determined by the softmax
θi
relationship as π(ai ) = Pke eθj , where k is the total number of actions and θi is the preference
j=1
value of action ai . Derive the parameter update conditions according to the REINFORCE
procedure considering the above described parameters and where the baseline is the reference
reward defined as the average of the rewards received for all arms.
1
Sol. (a)
According to the REINFORCE method, at each step, we update the parameter as
∂lnπ(an ; θ)
∆θn = α(Rn − bn )
∂θn
Considering the baseline, b as the average of all rewards seen so far, and α as the step size
parameter, the update conditions are
∆θi = α(R − b)(1 − π(ai ; θ))
4. Repeat the above problem for the case where the parameters are the mean and variance of the
normal distribution according to which the actions are selected and the baseline is zero.
n 2
o
(a) ∆µn = αn rn anσ−µ n
n
; ∆σn = αn rn (an −µ 2
σn
n)
−1
(an −µn )2 −σn2
n o
(b) ∆µn = αn rn anσ−µ 2
n
; ∆σn = αn rn σn2
n
Sol. (c)
Assuming parameters to be µ and σ, we have policy,
1 (a−µ)2
π(a; µ, σ 2 ) = √ e− 2σ
2πσ
For the mean:
(an − µn )2
∂ ln π(an ; µn , σn ) ∂ an − µn
= − =
∂µn ∂µn 2σn2 σn2
For the variance:
√ (an − µn )2
∂ ln π(an ; µn , σn ) ∂ ∂
= − ln( 2πσn ) + −
∂σn ∂σn ∂σn 2σn2
Solving, we have:
2
(an − µn )2
∂ ln π(an ; µn , σn ) 1 1 an − µn
=− + = { − 1}
∂σn σn σn3 σn σn
2
5. Which among the following is/are differences between contextual bandits and full RL problems?
(a) the actions and states in contextual bandits share features, but not in full RL problems
(b) the actions in contextual bandits do not determine the next state, but typically do in full
RL problems
(c) full RL problems can be modelled as MDPs whereas contextual bandit problems cannot
(d) no difference
Sol. (b)
Option (a) is not true because we can think of representations where actions and states share
features. Similarly, option (c) is false because contextual bandit problems can also be modelled
as MDPs.
6. Given a stationary policy, is it possible that if the agent is in the same state at two different
time steps, it can choose two different actions?
(a) no
(b) yes
Sol. (b)
A stationary policy does not mean that the policy is deterministic. For example, for state
s1 and actions a1 and a2 , a stationary policy π may have π(a1 |s1 ) = 0.4 and π(a2 |s1 ) = 0.6.
Thus, at different time steps, if the agent is in the same state s1 , it may perform either of the
two actions.
7. In class we saw that it is possible to learn via a sequence of stationary policies, i.e., during an
episode, the policy does not change, but we move to a different stationary policy before the
next episode begins. Does the temporal difference method encountered when discussing the
tic-tac-toe example follow this pattern of learning?
(a) no
(b) yes
Sol. (a)
Recall that within an episode, at each step, after observing a reward, we modify the value
function (the probability values in the tic-tac-toe example). Since the policy used to select
actions is derived from the value function, in effect we are modifying the policy at each step.
8. We saw the following definition of the action-value function for policy π: qπ (s, a) = Eπ [Gt |St =
s, At = a]. Suppose that the action selected according to the policy π in state s is a1 . For the
same state, will the function be defined for actions other than a1 ?
(a) no
(b) yes
Sol. (b)
The interpretation of the action-value function qπ (s, a) = Eπ [Gt |St = s, At = a] is the expec-
tation of the return given that we take action a in state s at time t and thereafter select actions
based on policy π. Thus, the only constraint for the actions is that they must be applicable
in state s.
3
9. Consider a 100x100 grid world domain where the agent starts each episode in the bottom-left
corner, and the goal is to reach the top-right corner in the least number of steps. To learn an
optimal policy to solve this problem you decide on a reward formulation in which the agent
receives a reward of +1 on reaching the goal state and 0 for all other transitions. Suppose
you try two variants of this reward formulation, (P1 ), where you use discounted returns with
γ ∈ (0, 1), and (P2 ), where no discounting is used. Which among the following would you
expect to observe?
Sol. (c)
In (P2 ), since there is no discounting, the return for each episode regardless of the number of
steps is +1. This prevents the agent from learning a policy which tries to minimise the number
of steps to reach the goal state. In (P1 ), the discount factor ensures that the longer the agent
takes to reach the goal state, the lesser reward it gets. This motivates the agent to find take
the shortest path to the goal state.
10. Given an MDP with finite state set, S, and an arbitrary function, f , that maps S to the real
numbers, the MDP has a policy π such that f = V π .
(a) false
(b) true
Sol. (a)
Consider an MDP where all rewards are 0.