Assignment 8: Reinforcement Learning Prof. B. Ravindran
Assignment 8: Reinforcement Learning Prof. B. Ravindran
Reinforcement Learning
Prof. B. Ravindran
1. You are given a training set of vectors Φ, so that each row of the matrix Φ corresponds to k
attributes of a single training sample. Suppose that you are asked to find a linear function that
minimizes the mean squared error for a given set of stationary targets y using linear regression.
State true/false for the following statement.
Statement: If the column vectors of Φ are linearly independent, then there exists a unique
linear function that minimizes the mean-squared-error.
(a) True
(b) False
Sol. (a)
If the column vectors of Φ are linearly independent, then (ΦT Φ) is invertible. Using the closed
−1
form solution θ = (ΦT Φ) ΦT y, a unique solution to the linear regression problem must exist.
2. Which of the following statements are true ?
Sol. (a),(b)
Refer to the lectures on function approximation.
3. In which of the following cases, the loss of a function appoximator as s∈S (V̂ (s) − V (s))2
P
would lead to poor performance? Consider ’relevant’ states to be those which are visited
frequently when executing near optimal policies.
Sol. (a)
A small percentage of relevant states would cause poor performance in this case. Due to only
a few states being relevant, it would be more important for the function approximator to do
well in those states as compared to irrelevant states. However in the given loss function, all
states are equally important, which will make it perform poorly.
4. Assertion: It is not possible to use look-up table based methods to solve continuous state
or action space problems. (Assume discretization of continuous space is not allowed)
Reason: For continuous state or action space, there are an infinite number of states/actions.
(a) Both Assertion and Reason are true, and Reason is correct explanation for Assertion.
1
(b) Both Assertion and Reason are true, but Reason is not correct explanation for assertion.
(c) Assertion is true, Reason is false
(d) Both Assertion and Reason are false
Sol. (a)
Unless the space is split into discrete points, it is not possible to maintain a look up table.
5. Assertion: If we make incremental updates for a linear approximation of the value function
v̂ under a policy π, using gradient descent to minimize the mean-square-error between v̂(st )
and bootstrapped targets Rt + γv̂(st+1 ), then we will eventually converge to the same solution
that we would have if we used the true vπ values as targets instead.
Reason: Each update moves v̂ closer to vπ , so eventually the bootstrapped targets Rt +
γv̂(st+1 ) will converge to the true vπ (st ) values
For discrete action space, as we can’t maximize it explicitly, we need to formulate a classifica-
tion problem.
(a) Both Assertion and Reason are true, and Reason is correct explanation for Assertion.
(b) Both Assertion and Reason are true, but Reason is not correct explanation for assertion.
(c) Assertion is true, Reason is false
(d) Both Assertion and Reason are false
Sol. (d)
As its discrete action space, we do not need to train a classifier. We can simply pass the actions
through the function approximator and pick the one with maximum estimated Q-value.
2
7. Which of the following is/are true about the LSTD and LSTDQ algorithm?
(a) Both are iterative algorithms, where the estimate of the parameters are updated using
the gradient information of the loss function.
(b) Both LSTD and LSTDQ can reuse samples.
(c) Both LSTD and LSTDQ are linear function approximation methods.
(d) None of the above
Sol. (c)
Refer to videos on LSTD and LSTDQ
8. Assertion: When minimizing mean-squared-error to approximate the value of states under
a given policy π, it is important that we draw samples on-policy.
Reason: Sampling on-policy makes the training data approximately reflect the steady state
distribution of states under the policy π.
(a) Both Assertion and Reason are true, and Reason is correct explanation for Assertion.
(b) Both Assertion and Reason are true, but Reason is not correct explanation for assertion.
(c) Assertion is true and Reason is false
(d) Both Assertion and Reason are false
Sol. (a)
Both the Assertion and Reason are true, and the Reason is a correct explanation for the
Assertion. Refer to the lecture on function approximation.
9. Tile coding is a method of state aggregation for gridworld problems. Consider the following
statements.
(i) The number of indicators for each state is equal to number of tilings.
(ii) Tile coding cannot be used in continuous state spaces.
(iii) Tile coding is also a form of Coarse coding.
Say which of the above statements are true
(a) (iii) only
(b) (i), (iii)
(c) (i) only
(d) (i), (ii), (iii)
Sol. (b)
Tile coding is one approach of coarse coding for grid-world. Also, the number of tilings can be
anything as per requirement and number of indicators are equal to number of tilings.
10. Which of the following are the correct values for à in LSTDQ method.
Note the samples are D = {si , ai , s′i , ri }, ϕ(si , ai ) is the representation used for (si , ai ) pair
and π is the policy being followed.
PL ′
(a) L1 T
i=1 [ϕ(si , ai )(ϕ(si , ai ) − γϕ(si , ai )) ]
3
1
PL ′ T
(b) L i=1 [ϕ(si )(ϕ(si ) − γϕ(si )) ]
1
PL ′ T
(c) L i=1 [ϕ(si , ai )(ϕ(si , ai ) − γϕ(si )) ]
1
PL ′ ′ T
(d) L i=1 [ϕ(si , ai )(ϕ(si , ai ) − γϕ(si , π(si ))) ]
Sol. (d)
Φ matrix contains features for each state action pair, and a policy π is followed. So when we
are in state s′i the correct feature to use is ϕ(s′i , π(s′i )) since after reaching the state s′i in the
PL
future we would have taken the action π(s′i ). so L1 ′ ′ T
i=1 [ϕ(si , ai )(ϕ(si , ai ) − γϕ(si , π(si ))) ]
is the correct answer