0% found this document useful (0 votes)

19 views

Assignment 2- solution

The document discusses various aspects of the UCB (Upper Confidence Bound) algorithm and its application in reinforcement learning, including true statements about the algorithm, comparisons with other methods like Thompson Sampling, and the implications of different sampling strategies. It also presents problems and solutions related to the UCB algorithm, including calculations for selecting arms in a bandit problem and the performance of the naive (ϵ, δ)-PAC algorithm. Additionally, it covers the Median Elimination method and its properties regarding optimal arm selection.

Uploaded by

MUKUND TIWARI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views

Assignment 2- solution

Uploaded by

MUKUND TIWARI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Assignment 2

Reinforcement Learning
Prof. B. Ravindran
1. Which of the following is true of the UCB algorithm?

(a) The action with the highest Q value is chosen at every iteration.
(b) After a very large number of iterations, the confidence intervals of unselected actions will
not change much.
(c) The true expected-value of an action always lies within it’s estimated confidence interval.
(d) With a small probability ϵ, we select a random action to ensure adequate exploration of
the action space.
Sol. (b)
For every trial that we don’t sample an action, the confidence interval
p of that action grows.
However, the confidence interval grows a rate proportional to ln(n) for n trials. Conse-
quently, as n becomes very large, the confidence intervals of unselected actions will not change
much.
q
(a) is false, we select the action with the largest Qn (j) + 2 ln(n)
nj value to capture uncer-
tainty in estimates.

(c) is false. It is possible to grossly overestimate or grossly underestimate action-values so

that the true expected-reward does not lie within an estimated confidence interval.

(d) is false. UCB is a deterministic algorithm.

q
2. In UCB, the term 2 ln(n)
nj is added to each arm’s Q value and the arm with the highest value
of this sum is chosen. Which one of theq following would definitely
q happen to the frequency of
2 ln(n) 2 ln(n)
picking sub-optimal arms when adding n2j
instead of nj ?

(a) Sub-optimal arms would be chosen more frequently.

(b) Sub-optimal arms would be chosen less frequently.
(c) Makes no change to the frequency of picking sub-optimal arms.
(d) Sub-optimal arms could be chosen less or more frequently, depending on the samples.

Sol. (d)
The uncertainty term in the UCB algorithm shrinks much faster as actions are selected. This
means that we have higher confidence that our estimates are correct with fewer samples. If
we are able to make good estimates of the true action-values within a few samples, this would
mean that we waste less time selecting sub-optimal arms.
However, fewer samples typically mean worse estimates -especially for bandit problems with
noisy reward distributions. Low uncertainty terms would increase the importance of potentially
worse Q-value estimates after a few trials, and it is possible that we would select sub-optimal
actions more often as a result.

1
Since either (a) or (b) could be True, depending on the problem setting, the correct answer is
(d).
3. In a 4-arm bandit problem, after executing 100 iterations of the UCB algorithm, the estimates
of Q values are- Q100 (1) = 1.73, Q100 (2) = 1.83, Q100 (3) = 1.89, Q100 (4) = 1.55 and the
number of times each of them are sampled are- n1 = 25, n2 = 20, n3 = 30, n4 = 15. Which
arm will be sampled in the next trial?

(a) Arm 1
(b) Arm 2
(c) Arm 3
(d) Arm 4

Sol. (b) q
2ln(n)
Calculate the value of Q100 (j) + nj for each arm. It is highest for Arm 2.

4. We need 8 rounds of median-elimination to get an (ϵ, δ) − P AC arm. Approximately how

many samples would have been required using the naive (ϵ, δ) − P AC algorithm given (ϵ, δ) =
(1/2, 1/e) ? (Choose the value closest to the correct answer)

(a) 15000
(b) 10000
(c) 500
(d) 20000

Sol. (a)
no. of rounds in median elimination = log2 (no. of arms) = 8
=⇒ no. of arms = 256
No. of samples required by the naive algorithm = 2kϵ2 ln(2k/δ)
Substituting,
No. of samples required = 14824
5. Consider the following equalities/inequalities for the UCB algorithm, following the notation
used in the lectures. (Ti (n): the number of times that action i has been played in the previous
n trials, Ck,Ti (k) represents the confidence bound for arm i after k trials)
i Ti (n) = 1 + Σnm=k+1 {Im = i}
ii Ti (n) = Σnm=1 {Im = i}
iii Ti (n) ≤ 1 + Σnm=k+1 {Qm−1 (a∗ ) + Cm−1,Ta∗ (m−1) ≤ Qm−1 (i) + Cm−1,Ti (m−1) }
iv Ti (n) ≤ 1 + Σnm=k+1 {Qm−1 (a∗ ) ≤ Qm−1 (i) + Cm−1,Ti (m−1) }
Which of these equalities/inequalities are correct ?
(a) i and iii
(b) ii and iv
(c) i, ii, iii

2
(d) i, ii, iii, iv

Sol. (d)
The indicator variable {Im = i} is non-zero only when the action i is played, so (ii) is correct
by definition.

(i) is equivalent to (ii). Since, we play every action once at the start of the UCB algorithm,
Σnm=1 {Im = i} = 1 + Σnm=k+1 {Im = i}

(iii) is valid. After the initial single play of every arm, we will only select to play an arm
if it’s ”estimate + upper confidence bound” : Qm−1 (i) + Cm−1,Ti (m−1) is greater than that of
other arms. This includes the case where Qm−1 (i) + Cm−1,Ti (m−1) is greater than the value
for the ”estimate + upper confidence bound” for the optimal arm: Qm−1 (a∗ ) + Cm−1,Ta∗ (m−1)

(iv) further relaxes the bound. It is easy to see that if:

Qm−1 (a∗ ) + Cm−1,Ta∗ (m−1) ≤ Qm−1 (i) + Cm−1,Ti (m−1)
then,
Qm−1 (a∗ ) ≤ Qm−1 (i) + Cm−1,Ti (m−1)

6. In the naive (ϵ, δ)-PAC algorithm, suppose we draw ϵ22 ln( kδ ) samples for each arm instead of
2 2k
ϵ2 ln( δ ) samples. Using the same analysis presented in the lectures, with what probability
can we guarantee that the arm a′ returned will have q∗ (a′ ) value ϵ close to the q∗ value of the
optimal arm?

(a) 1 − δ
(b) 1 − 4δ
(c) 1 − 2δ
δ
(d) 1 − 2

Sol. (c)
From the lectures, assuming that a′ is an arm with q∗ (a′ ) value at least ϵ away from q∗ (a∗ ):

P (Q(a′ ) ≥ Q(a∗ )) ≤ P (Q(a′ ) ≥ q∗ (a′ ) + ϵ/2) + P (Q(a∗ ) < q∗ (a∗ ) − ϵ/2)

Using the Chernoff-Hoeffding bounds introduced in the lectures:

−ϵ2 l
P (Q(a′ ) ≥ Q(a∗ )) ≤ 2e 2

Substituting the new number of samples drawn = l, we get:

P (Q(a′ ) ≥ Q(a∗ )) ≤ 2δ
k
Summing over all k arms, the probability of picking an arm a′ that does not meet our criteria
is bounded by 2δ, and the probability that the arm returned has q∗ (a′ ) value ϵ close to the q∗
value of the optimal arm is, therefore, ≥ 1 − 2δ
7. In median elimination method for (ϵ, δ)-PAC bounds, we claim that for every phase l, P r[A ≤
B + ϵl ] > 1 − δl . (Sl – is the set of arms remaining in the lth phase)

3
Consider the following statements:

(i) A – is the maximum of rewards of true best arm in Sl , i.e. in lth phase
(ii) B – is the maximum of rewards of true best arm in Sl+1 , i.e. in l + 1th phase
(iii) B – is the minimum of rewards of true best arm in Sl+1 , i.e. in l + 1th phase
(iv) A – is the minimum of rewards of true best arm in Sl , i.e. in lth phase
(v) A – is the maximum of rewards of true best arm in Sl+1 , i.e. in l + 1th phase
(vi) B – is the maximum of rewards of true best arm in Sl , i.e. in lth phase
Which of the statements above are correct?
(a) i and ii
(b) iii and iv
(c) iii and iv
(d) v and vi
(e) i and iii

Sol. (a)
Refer Lemma 1 in the proof for the Median Elimination Algorithm

8. Which of the following statements is NOT true about Thompson Sampling or Posterior Sam-
pling?

(a) After each sample is drawn, the q∗ distribution for that sampled arm is updated to be
closer to the true distribution.
(b) Thompson sampling has been shown to generally give better regret bounds than UCB.
(c) In Thompson sampling, we do not need to eliminate arms each round to get good sample
complexity.
(d) The algorithm requires that we use Gaussian priors to represent distributions over q∗
values for each arm.
Sol. (d)
(d) is NOT true. We are not constrained to Gaussian priors. We can assume a prior distribution
of any type over the q∗ values for each arm.
9. Assertion: The confidence bound of each arm in the UCB algorithm cannot increase with
iterations.
Reason: The nj term in the denominator ensures that the confidence bound remains the same
for unselected arms and decreases for the selected arm.

(a) Assertion and Reason are both true and Reason is a correct explanation of Assertion
(b) Assertion and Reason are both true and Reason is not a correct explanation of Assertion
(c) Assertion is true and Reason is false
(d) Both Assertion and Reason are false

4
Sol. (d)
The confidence bound for the unselected arm actually increases in the UCB algorithm as their
nj value remains the same but the ln(n) term in the numerator increases.
10. Which of the following is true about the Median Elimination algorithm?
(a) It is a regret minimizing algorithm.
(b) The probability of the ϵl -optimal arms of round l being eliminated is less than δl for the
round.
(c) It is guaranteed to provide an ϵ-optimal arm at the end.
ϵ
(d) Replacing ϵ with 2 doubles the sample complexity.
Sol. (b)
Look at the derivation for Median Elimination.

RLbook Solutions Manual
No ratings yet
RLbook Solutions Manual
35 pages
UTP Student 2012 Handbook
No ratings yet
UTP Student 2012 Handbook
173 pages
Assignment 1: Reinforcement Learning Prof. B. Ravindran
100% (2)
Assignment 1: Reinforcement Learning Prof. B. Ravindran
4 pages
Microsoft Malware Prediction
100% (1)
Microsoft Malware Prediction
16 pages
Solution2
No ratings yet
Solution2
5 pages
CS 747, Autumn 2023 - Lecture 3
No ratings yet
CS 747, Autumn 2023 - Lecture 3
27 pages
multi-arm-bandit problem
No ratings yet
multi-arm-bandit problem
11 pages
Mid-Semester Examination
No ratings yet
Mid-Semester Examination
2 pages
Unit II
No ratings yet
Unit II
10 pages
EE675A Lecture 3
No ratings yet
EE675A Lecture 3
8 pages
RL UNIT PPT
No ratings yet
RL UNIT PPT
595 pages
RL SEM ANS
No ratings yet
RL SEM ANS
90 pages
HW 2
No ratings yet
HW 2
3 pages
EE675A Lecture 4
No ratings yet
EE675A Lecture 4
7 pages
cs747 A2020 Quizzes PDF
No ratings yet
cs747 A2020 Quizzes PDF
5 pages
pdf24 Images Merged
No ratings yet
pdf24 Images Merged
12 pages
KLUCB Paper
No ratings yet
KLUCB Paper
59 pages
AI sp12 Final Solutions
No ratings yet
AI sp12 Final Solutions
19 pages
Bandits
No ratings yet
Bandits
2 pages
ANSWERS TO 15-381 Final, Spring 2004: Friday May 7, 2004
No ratings yet
ANSWERS TO 15-381 Final, Spring 2004: Friday May 7, 2004
20 pages
Rec5_Solns
No ratings yet
Rec5_Solns
14 pages
UCB
No ratings yet
UCB
6 pages
26 Making Decisions
No ratings yet
26 Making Decisions
31 pages
Practice Assignment 6: Reinforcement Learning Prof. B. Ravindran
No ratings yet
Practice Assignment 6: Reinforcement Learning Prof. B. Ravindran
24 pages
RL-Unit-1_QA
No ratings yet
RL-Unit-1_QA
10 pages
UCB Algorithm in RL
No ratings yet
UCB Algorithm in RL
3 pages
CS 747, Autumn 2023: Lecture 4: Shivaram Kalyanakrishnan
No ratings yet
CS 747, Autumn 2023: Lecture 4: Shivaram Kalyanakrishnan
42 pages
Practice Final CS61c
No ratings yet
Practice Final CS61c
19 pages
EXP3
No ratings yet
EXP3
36 pages
Exam 21
No ratings yet
Exam 21
17 pages
Fa17 Practice Midterm2
No ratings yet
Fa17 Practice Midterm2
6 pages
2017-18-I MS Key
No ratings yet
2017-18-I MS Key
6 pages
Machine 2021 Jan-Apr
No ratings yet
Machine 2021 Jan-Apr
45 pages
Machine 2021 Jul-Dec
No ratings yet
Machine 2021 Jul-Dec
46 pages
Assignment 1: CS747: F I L A
No ratings yet
Assignment 1: CS747: F I L A
10 pages
Lecture 03: Adaptive Exploration-Based Algorithms: 1.1 Outline of The Algorithm
No ratings yet
Lecture 03: Adaptive Exploration-Based Algorithms: 1.1 Outline of The Algorithm
4 pages
Lec07 Baysian Opti
No ratings yet
Lec07 Baysian Opti
94 pages
Final Exam Epfl 2020 Machine Leaning
No ratings yet
Final Exam Epfl 2020 Machine Leaning
16 pages
EE 6106: Online Learning and Optimisation Homework 1
No ratings yet
EE 6106: Online Learning and Optimisation Homework 1
4 pages
ML Quiz 2
No ratings yet
ML Quiz 2
8 pages
Reading 3-Russo & Van Roy 2014
No ratings yet
Reading 3-Russo & Van Roy 2014
24 pages
2019-20-I MS Key
No ratings yet
2019-20-I MS Key
6 pages
Assignment 3: Reinforcement Learning Prof. B. Ravindran
100% (1)
Assignment 3: Reinforcement Learning Prof. B. Ravindran
4 pages
CS725 2020 Quiz1
No ratings yet
CS725 2020 Quiz1
3 pages
CMPUT 466/551 - Assignment 1: Paradox?
No ratings yet
CMPUT 466/551 - Assignment 1: Paradox?
6 pages
Lion 5 Paper
No ratings yet
Lion 5 Paper
15 pages
Assignment 2
No ratings yet
Assignment 2
3 pages
ML - Compre - Question - Paper - 2022 - 23 - Marking Scheme
No ratings yet
ML - Compre - Question - Paper - 2022 - 23 - Marking Scheme
6 pages
Homework1 Solutions
No ratings yet
Homework1 Solutions
5 pages
Signal Processing MCQs (1)
No ratings yet
Signal Processing MCQs (1)
16 pages
Assignment 4
No ratings yet
Assignment 4
6 pages
Assignment 3- solution
No ratings yet
Assignment 3- solution
4 pages
Ex 1
No ratings yet
Ex 1
2 pages
E0_270_RL
No ratings yet
E0_270_RL
10 pages
Trial Exam 2021 With Solutions
No ratings yet
Trial Exam 2021 With Solutions
10 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
AML774 Post Assignment 2
No ratings yet
AML774 Post Assignment 2
4 pages
ML Question CMU
No ratings yet
ML Question CMU
12 pages
Final2018 Solutions
No ratings yet
Final2018 Solutions
19 pages
Assignment 0 (Sol.) : Reinforcement Learning
No ratings yet
Assignment 0 (Sol.) : Reinforcement Learning
50 pages
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
Digital Signal and Image Processing using MATLAB, Volume 3: Advances and Applications, The Stochastic Case
From Everand
Digital Signal and Image Processing using MATLAB, Volume 3: Advances and Applications, The Stochastic Case
Gérard Blanchet
3/5 (1)
Deep Learning Dec
No ratings yet
Deep Learning Dec
1 page
Game Theory Dec
No ratings yet
Game Theory Dec
2 pages
Null Vs Alternative Hypothesis, Rejection Region, and Significance Level Type I Error and Type II Error, Test For The Mean. Population Variance Known, P-Value
No ratings yet
Null Vs Alternative Hypothesis, Rejection Region, and Significance Level Type I Error and Type II Error, Test For The Mean. Population Variance Known, P-Value
14 pages
Distribution, Normal Distribution, Standard Normal Distribution, Central Limit Theorem, Standard Error, Estimators and Estimates
No ratings yet
Distribution, Normal Distribution, Standard Normal Distribution, Central Limit Theorem, Standard Error, Estimators and Estimates
13 pages
Procurement of Cinema AV Works
No ratings yet
Procurement of Cinema AV Works
24 pages
Easa Ad 2022-0252R1 1 PDF
No ratings yet
Easa Ad 2022-0252R1 1 PDF
4 pages
Booklist, Wiser
No ratings yet
Booklist, Wiser
5 pages
Solucionario de Regina Murphy Cap 1 PDF
No ratings yet
Solucionario de Regina Murphy Cap 1 PDF
26 pages
22CS303_DAA_Syllabus-new
No ratings yet
22CS303_DAA_Syllabus-new
2 pages
Fujitsu
No ratings yet
Fujitsu
23 pages
Textile Exchange Organic Cotton Market Report 2021
No ratings yet
Textile Exchange Organic Cotton Market Report 2021
87 pages
Celebrity Cultures An Introduction 1st Edition Lee Barron - Download the ebook now to never miss important content
No ratings yet
Celebrity Cultures An Introduction 1st Edition Lee Barron - Download the ebook now to never miss important content
57 pages
Translation of The Psalms and Canticles With Commentary James McSwiney 1901
100% (1)
Translation of The Psalms and Canticles With Commentary James McSwiney 1901
700 pages
Close Up 09 10 Web
No ratings yet
Close Up 09 10 Web
15 pages
Catch Up Powerpoint March 15
No ratings yet
Catch Up Powerpoint March 15
43 pages
Boardman V DPP (SFE) - LIA160094 - NurAinaMdYusri
No ratings yet
Boardman V DPP (SFE) - LIA160094 - NurAinaMdYusri
8 pages
Instant Download The Embryologic Basis of Craniofacial Structure Developmental Anatomy Evolutionary Design and Clinical Applications Michael H. Carstens PDF All Chapters
100% (3)
Instant Download The Embryologic Basis of Craniofacial Structure Developmental Anatomy Evolutionary Design and Clinical Applications Michael H. Carstens PDF All Chapters
40 pages
Sops That Compliance With Safety and Health Act 1994
No ratings yet
Sops That Compliance With Safety and Health Act 1994
38 pages
Log WAUZZZ8K08N004388
No ratings yet
Log WAUZZZ8K08N004388
2 pages
Project Based Learning: Warmer
No ratings yet
Project Based Learning: Warmer
1 page
Offline Password Brute-Forcer
No ratings yet
Offline Password Brute-Forcer
2 pages
Môn Tiếng Anh 10-Đề Thi HSG Cấp Trường (23-24)
100% (1)
Môn Tiếng Anh 10-Đề Thi HSG Cấp Trường (23-24)
9 pages
8th Grade World History-09
0% (1)
8th Grade World History-09
15 pages
Chapter 8 The Credit System
No ratings yet
Chapter 8 The Credit System
20 pages
[Ebooks PDF] download Handbook of Chitin and Chitosan: Volume 2: Composites and Nanocomposites from Chitin and Chitosan, Manufacturing and Characterisations 1st Edition Sabu Thomas (Editor) full chapters
No ratings yet
[Ebooks PDF] download Handbook of Chitin and Chitosan: Volume 2: Composites and Nanocomposites from Chitin and Chitosan, Manufacturing and Characterisations 1st Edition Sabu Thomas (Editor) full chapters
55 pages
BÀI KIỂM TRA SỐ 3
No ratings yet
BÀI KIỂM TRA SỐ 3
4 pages
Cultural Foundations of Learning East and West PDF
No ratings yet
Cultural Foundations of Learning East and West PDF
409 pages
Python Cookbook
No ratings yet
Python Cookbook
259 pages
Basic Yoga Session I
No ratings yet
Basic Yoga Session I
4 pages
Manual de Utilizare Detector de Gaz Metan Adresabil Cu Sirena UniPOS FD71CNG
No ratings yet
Manual de Utilizare Detector de Gaz Metan Adresabil Cu Sirena UniPOS FD71CNG
4 pages
Beechcraft 35 Bonanza Performance Information
No ratings yet
Beechcraft 35 Bonanza Performance Information
1 page
MAPEH G10 Q3 TQi
No ratings yet
MAPEH G10 Q3 TQi
3 pages

Assignment 2- solution

Uploaded by

Assignment 2- solution

Uploaded by

Assignment 2

(c) is false. It is possible to grossly overestimate or grossly underestimate action-values so

(d) is false. UCB is a deterministic algorithm.

(a) Sub-optimal arms would be chosen more frequently.

4. We need 8 rounds of median-elimination to get an (ϵ, δ) − P AC arm. Approximately how

(iv) further relaxes the bound. It is easy to see that if:

P (Q(a′ ) ≥ Q(a∗ )) ≤ P (Q(a′ ) ≥ q∗ (a′ ) + ϵ/2) + P (Q(a∗ ) < q∗ (a∗ ) − ϵ/2)

Using the Chernoff-Hoeffding bounds introduced in the lectures:

Substituting the new number of samples drawn = l, we get:

You might also like