Assignment 2- solution
Assignment 2- solution
Reinforcement Learning
Prof. B. Ravindran
1. Which of the following is true of the UCB algorithm?
(a) The action with the highest Q value is chosen at every iteration.
(b) After a very large number of iterations, the confidence intervals of unselected actions will
not change much.
(c) The true expected-value of an action always lies within it’s estimated confidence interval.
(d) With a small probability ϵ, we select a random action to ensure adequate exploration of
the action space.
Sol. (b)
For every trial that we don’t sample an action, the confidence interval
p of that action grows.
However, the confidence interval grows a rate proportional to ln(n) for n trials. Conse-
quently, as n becomes very large, the confidence intervals of unselected actions will not change
much.
q
(a) is false, we select the action with the largest Qn (j) + 2 ln(n)
nj value to capture uncer-
tainty in estimates.
Sol. (d)
The uncertainty term in the UCB algorithm shrinks much faster as actions are selected. This
means that we have higher confidence that our estimates are correct with fewer samples. If
we are able to make good estimates of the true action-values within a few samples, this would
mean that we waste less time selecting sub-optimal arms.
However, fewer samples typically mean worse estimates -especially for bandit problems with
noisy reward distributions. Low uncertainty terms would increase the importance of potentially
worse Q-value estimates after a few trials, and it is possible that we would select sub-optimal
actions more often as a result.
1
Since either (a) or (b) could be True, depending on the problem setting, the correct answer is
(d).
3. In a 4-arm bandit problem, after executing 100 iterations of the UCB algorithm, the estimates
of Q values are- Q100 (1) = 1.73, Q100 (2) = 1.83, Q100 (3) = 1.89, Q100 (4) = 1.55 and the
number of times each of them are sampled are- n1 = 25, n2 = 20, n3 = 30, n4 = 15. Which
arm will be sampled in the next trial?
(a) Arm 1
(b) Arm 2
(c) Arm 3
(d) Arm 4
Sol. (b) q
2ln(n)
Calculate the value of Q100 (j) + nj for each arm. It is highest for Arm 2.
(a) 15000
(b) 10000
(c) 500
(d) 20000
Sol. (a)
no. of rounds in median elimination = log2 (no. of arms) = 8
=⇒ no. of arms = 256
No. of samples required by the naive algorithm = 2kϵ2 ln(2k/δ)
Substituting,
No. of samples required = 14824
5. Consider the following equalities/inequalities for the UCB algorithm, following the notation
used in the lectures. (Ti (n): the number of times that action i has been played in the previous
n trials, Ck,Ti (k) represents the confidence bound for arm i after k trials)
i Ti (n) = 1 + Σnm=k+1 {Im = i}
ii Ti (n) = Σnm=1 {Im = i}
iii Ti (n) ≤ 1 + Σnm=k+1 {Qm−1 (a∗ ) + Cm−1,Ta∗ (m−1) ≤ Qm−1 (i) + Cm−1,Ti (m−1) }
iv Ti (n) ≤ 1 + Σnm=k+1 {Qm−1 (a∗ ) ≤ Qm−1 (i) + Cm−1,Ti (m−1) }
Which of these equalities/inequalities are correct ?
(a) i and iii
(b) ii and iv
(c) i, ii, iii
2
(d) i, ii, iii, iv
Sol. (d)
The indicator variable {Im = i} is non-zero only when the action i is played, so (ii) is correct
by definition.
(i) is equivalent to (ii). Since, we play every action once at the start of the UCB algorithm,
Σnm=1 {Im = i} = 1 + Σnm=k+1 {Im = i}
(iii) is valid. After the initial single play of every arm, we will only select to play an arm
if it’s ”estimate + upper confidence bound” : Qm−1 (i) + Cm−1,Ti (m−1) is greater than that of
other arms. This includes the case where Qm−1 (i) + Cm−1,Ti (m−1) is greater than the value
for the ”estimate + upper confidence bound” for the optimal arm: Qm−1 (a∗ ) + Cm−1,Ta∗ (m−1)
6. In the naive (ϵ, δ)-PAC algorithm, suppose we draw ϵ22 ln( kδ ) samples for each arm instead of
2 2k
ϵ2 ln( δ ) samples. Using the same analysis presented in the lectures, with what probability
can we guarantee that the arm a′ returned will have q∗ (a′ ) value ϵ close to the q∗ value of the
optimal arm?
(a) 1 − δ
(b) 1 − 4δ
(c) 1 − 2δ
δ
(d) 1 − 2
Sol. (c)
From the lectures, assuming that a′ is an arm with q∗ (a′ ) value at least ϵ away from q∗ (a∗ ):
P (Q(a′ ) ≥ Q(a∗ )) ≤ 2δ
k
Summing over all k arms, the probability of picking an arm a′ that does not meet our criteria
is bounded by 2δ, and the probability that the arm returned has q∗ (a′ ) value ϵ close to the q∗
value of the optimal arm is, therefore, ≥ 1 − 2δ
7. In median elimination method for (ϵ, δ)-PAC bounds, we claim that for every phase l, P r[A ≤
B + ϵl ] > 1 − δl . (Sl – is the set of arms remaining in the lth phase)
3
Consider the following statements:
(i) A – is the maximum of rewards of true best arm in Sl , i.e. in lth phase
(ii) B – is the maximum of rewards of true best arm in Sl+1 , i.e. in l + 1th phase
(iii) B – is the minimum of rewards of true best arm in Sl+1 , i.e. in l + 1th phase
(iv) A – is the minimum of rewards of true best arm in Sl , i.e. in lth phase
(v) A – is the maximum of rewards of true best arm in Sl+1 , i.e. in l + 1th phase
(vi) B – is the maximum of rewards of true best arm in Sl , i.e. in lth phase
Which of the statements above are correct?
(a) i and ii
(b) iii and iv
(c) iii and iv
(d) v and vi
(e) i and iii
Sol. (a)
Refer Lemma 1 in the proof for the Median Elimination Algorithm
8. Which of the following statements is NOT true about Thompson Sampling or Posterior Sam-
pling?
(a) After each sample is drawn, the q∗ distribution for that sampled arm is updated to be
closer to the true distribution.
(b) Thompson sampling has been shown to generally give better regret bounds than UCB.
(c) In Thompson sampling, we do not need to eliminate arms each round to get good sample
complexity.
(d) The algorithm requires that we use Gaussian priors to represent distributions over q∗
values for each arm.
Sol. (d)
(d) is NOT true. We are not constrained to Gaussian priors. We can assume a prior distribution
of any type over the q∗ values for each arm.
9. Assertion: The confidence bound of each arm in the UCB algorithm cannot increase with
iterations.
Reason: The nj term in the denominator ensures that the confidence bound remains the same
for unselected arms and decreases for the selected arm.
(a) Assertion and Reason are both true and Reason is a correct explanation of Assertion
(b) Assertion and Reason are both true and Reason is not a correct explanation of Assertion
(c) Assertion is true and Reason is false
(d) Both Assertion and Reason are false
4
Sol. (d)
The confidence bound for the unselected arm actually increases in the UCB algorithm as their
nj value remains the same but the ln(n) term in the numerator increases.
10. Which of the following is true about the Median Elimination algorithm?
(a) It is a regret minimizing algorithm.
(b) The probability of the ϵl -optimal arms of round l being eliminated is less than δl for the
round.
(c) It is guaranteed to provide an ϵ-optimal arm at the end.
ϵ
(d) Replacing ϵ with 2 doubles the sample complexity.
Sol. (b)
Look at the derivation for Median Elimination.