0% found this document useful (0 votes)
19 views

Assignment 2- solution

The document discusses various aspects of the UCB (Upper Confidence Bound) algorithm and its application in reinforcement learning, including true statements about the algorithm, comparisons with other methods like Thompson Sampling, and the implications of different sampling strategies. It also presents problems and solutions related to the UCB algorithm, including calculations for selecting arms in a bandit problem and the performance of the naive (ϵ, δ)-PAC algorithm. Additionally, it covers the Median Elimination method and its properties regarding optimal arm selection.

Uploaded by

MUKUND TIWARI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Assignment 2- solution

The document discusses various aspects of the UCB (Upper Confidence Bound) algorithm and its application in reinforcement learning, including true statements about the algorithm, comparisons with other methods like Thompson Sampling, and the implications of different sampling strategies. It also presents problems and solutions related to the UCB algorithm, including calculations for selecting arms in a bandit problem and the performance of the naive (ϵ, δ)-PAC algorithm. Additionally, it covers the Median Elimination method and its properties regarding optimal arm selection.

Uploaded by

MUKUND TIWARI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Assignment 2

Reinforcement Learning
Prof. B. Ravindran
1. Which of the following is true of the UCB algorithm?

(a) The action with the highest Q value is chosen at every iteration.
(b) After a very large number of iterations, the confidence intervals of unselected actions will
not change much.
(c) The true expected-value of an action always lies within it’s estimated confidence interval.
(d) With a small probability ϵ, we select a random action to ensure adequate exploration of
the action space.
Sol. (b)
For every trial that we don’t sample an action, the confidence interval
p of that action grows.
However, the confidence interval grows a rate proportional to ln(n) for n trials. Conse-
quently, as n becomes very large, the confidence intervals of unselected actions will not change
much.
q
(a) is false, we select the action with the largest Qn (j) + 2 ln(n)
nj value to capture uncer-
tainty in estimates.

(c) is false. It is possible to grossly overestimate or grossly underestimate action-values so


that the true expected-reward does not lie within an estimated confidence interval.

(d) is false. UCB is a deterministic algorithm.


q
2. In UCB, the term 2 ln(n)
nj is added to each arm’s Q value and the arm with the highest value
of this sum is chosen. Which one of theq following would definitely
q happen to the frequency of
2 ln(n) 2 ln(n)
picking sub-optimal arms when adding n2j
instead of nj ?

(a) Sub-optimal arms would be chosen more frequently.


(b) Sub-optimal arms would be chosen less frequently.
(c) Makes no change to the frequency of picking sub-optimal arms.
(d) Sub-optimal arms could be chosen less or more frequently, depending on the samples.

Sol. (d)
The uncertainty term in the UCB algorithm shrinks much faster as actions are selected. This
means that we have higher confidence that our estimates are correct with fewer samples. If
we are able to make good estimates of the true action-values within a few samples, this would
mean that we waste less time selecting sub-optimal arms.
However, fewer samples typically mean worse estimates -especially for bandit problems with
noisy reward distributions. Low uncertainty terms would increase the importance of potentially
worse Q-value estimates after a few trials, and it is possible that we would select sub-optimal
actions more often as a result.

1
Since either (a) or (b) could be True, depending on the problem setting, the correct answer is
(d).
3. In a 4-arm bandit problem, after executing 100 iterations of the UCB algorithm, the estimates
of Q values are- Q100 (1) = 1.73, Q100 (2) = 1.83, Q100 (3) = 1.89, Q100 (4) = 1.55 and the
number of times each of them are sampled are- n1 = 25, n2 = 20, n3 = 30, n4 = 15. Which
arm will be sampled in the next trial?

(a) Arm 1
(b) Arm 2
(c) Arm 3
(d) Arm 4

Sol. (b) q
2ln(n)
Calculate the value of Q100 (j) + nj for each arm. It is highest for Arm 2.

4. We need 8 rounds of median-elimination to get an (ϵ, δ) − P AC arm. Approximately how


many samples would have been required using the naive (ϵ, δ) − P AC algorithm given (ϵ, δ) =
(1/2, 1/e) ? (Choose the value closest to the correct answer)

(a) 15000
(b) 10000
(c) 500
(d) 20000

Sol. (a)
no. of rounds in median elimination = log2 (no. of arms) = 8
=⇒ no. of arms = 256
No. of samples required by the naive algorithm = 2kϵ2 ln(2k/δ)
Substituting,
No. of samples required = 14824
5. Consider the following equalities/inequalities for the UCB algorithm, following the notation
used in the lectures. (Ti (n): the number of times that action i has been played in the previous
n trials, Ck,Ti (k) represents the confidence bound for arm i after k trials)
i Ti (n) = 1 + Σnm=k+1 {Im = i}
ii Ti (n) = Σnm=1 {Im = i}
iii Ti (n) ≤ 1 + Σnm=k+1 {Qm−1 (a∗ ) + Cm−1,Ta∗ (m−1) ≤ Qm−1 (i) + Cm−1,Ti (m−1) }
iv Ti (n) ≤ 1 + Σnm=k+1 {Qm−1 (a∗ ) ≤ Qm−1 (i) + Cm−1,Ti (m−1) }
Which of these equalities/inequalities are correct ?
(a) i and iii
(b) ii and iv
(c) i, ii, iii

2
(d) i, ii, iii, iv

Sol. (d)
The indicator variable {Im = i} is non-zero only when the action i is played, so (ii) is correct
by definition.

(i) is equivalent to (ii). Since, we play every action once at the start of the UCB algorithm,
Σnm=1 {Im = i} = 1 + Σnm=k+1 {Im = i}

(iii) is valid. After the initial single play of every arm, we will only select to play an arm
if it’s ”estimate + upper confidence bound” : Qm−1 (i) + Cm−1,Ti (m−1) is greater than that of
other arms. This includes the case where Qm−1 (i) + Cm−1,Ti (m−1) is greater than the value
for the ”estimate + upper confidence bound” for the optimal arm: Qm−1 (a∗ ) + Cm−1,Ta∗ (m−1)

(iv) further relaxes the bound. It is easy to see that if:


Qm−1 (a∗ ) + Cm−1,Ta∗ (m−1) ≤ Qm−1 (i) + Cm−1,Ti (m−1)
then,
Qm−1 (a∗ ) ≤ Qm−1 (i) + Cm−1,Ti (m−1)

6. In the naive (ϵ, δ)-PAC algorithm, suppose we draw ϵ22 ln( kδ ) samples for each arm instead of
2 2k
ϵ2 ln( δ ) samples. Using the same analysis presented in the lectures, with what probability
can we guarantee that the arm a′ returned will have q∗ (a′ ) value ϵ close to the q∗ value of the
optimal arm?

(a) 1 − δ
(b) 1 − 4δ
(c) 1 − 2δ
δ
(d) 1 − 2

Sol. (c)
From the lectures, assuming that a′ is an arm with q∗ (a′ ) value at least ϵ away from q∗ (a∗ ):

P (Q(a′ ) ≥ Q(a∗ )) ≤ P (Q(a′ ) ≥ q∗ (a′ ) + ϵ/2) + P (Q(a∗ ) < q∗ (a∗ ) − ϵ/2)

Using the Chernoff-Hoeffding bounds introduced in the lectures:


−ϵ2 l
P (Q(a′ ) ≥ Q(a∗ )) ≤ 2e 2

Substituting the new number of samples drawn = l, we get:

P (Q(a′ ) ≥ Q(a∗ )) ≤ 2δ
k
Summing over all k arms, the probability of picking an arm a′ that does not meet our criteria
is bounded by 2δ, and the probability that the arm returned has q∗ (a′ ) value ϵ close to the q∗
value of the optimal arm is, therefore, ≥ 1 − 2δ
7. In median elimination method for (ϵ, δ)-PAC bounds, we claim that for every phase l, P r[A ≤
B + ϵl ] > 1 − δl . (Sl – is the set of arms remaining in the lth phase)

3
Consider the following statements:

(i) A – is the maximum of rewards of true best arm in Sl , i.e. in lth phase
(ii) B – is the maximum of rewards of true best arm in Sl+1 , i.e. in l + 1th phase
(iii) B – is the minimum of rewards of true best arm in Sl+1 , i.e. in l + 1th phase
(iv) A – is the minimum of rewards of true best arm in Sl , i.e. in lth phase
(v) A – is the maximum of rewards of true best arm in Sl+1 , i.e. in l + 1th phase
(vi) B – is the maximum of rewards of true best arm in Sl , i.e. in lth phase
Which of the statements above are correct?
(a) i and ii
(b) iii and iv
(c) iii and iv
(d) v and vi
(e) i and iii

Sol. (a)
Refer Lemma 1 in the proof for the Median Elimination Algorithm

8. Which of the following statements is NOT true about Thompson Sampling or Posterior Sam-
pling?

(a) After each sample is drawn, the q∗ distribution for that sampled arm is updated to be
closer to the true distribution.
(b) Thompson sampling has been shown to generally give better regret bounds than UCB.
(c) In Thompson sampling, we do not need to eliminate arms each round to get good sample
complexity.
(d) The algorithm requires that we use Gaussian priors to represent distributions over q∗
values for each arm.
Sol. (d)
(d) is NOT true. We are not constrained to Gaussian priors. We can assume a prior distribution
of any type over the q∗ values for each arm.
9. Assertion: The confidence bound of each arm in the UCB algorithm cannot increase with
iterations.
Reason: The nj term in the denominator ensures that the confidence bound remains the same
for unselected arms and decreases for the selected arm.

(a) Assertion and Reason are both true and Reason is a correct explanation of Assertion
(b) Assertion and Reason are both true and Reason is not a correct explanation of Assertion
(c) Assertion is true and Reason is false
(d) Both Assertion and Reason are false

4
Sol. (d)
The confidence bound for the unselected arm actually increases in the UCB algorithm as their
nj value remains the same but the ln(n) term in the numerator increases.
10. Which of the following is true about the Median Elimination algorithm?
(a) It is a regret minimizing algorithm.
(b) The probability of the ϵl -optimal arms of round l being eliminated is less than δl for the
round.
(c) It is guaranteed to provide an ϵ-optimal arm at the end.
ϵ
(d) Replacing ϵ with 2 doubles the sample complexity.
Sol. (b)
Look at the derivation for Median Elimination.

You might also like