0% found this document useful (0 votes)
54 views5 pages

cs747 A2020 Quizzes PDF

This document provides a summary of weekly quizzes for the course CS 747 (Autumn 2020) taught by instructor Shivaram Kalyanakrishnan. It notes that students must show their work and calculations to receive credit for answers. Submissions must be handwritten, scanned, and uploaded to Moodle by the due date mentioned in each question. The document then provides 4 questions related to topics in Markov decision processes, Bellman equations for policy evaluation, multi-armed bandits, and bandit algorithms.

Uploaded by

rineeth m
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views5 pages

cs747 A2020 Quizzes PDF

This document provides a summary of weekly quizzes for the course CS 747 (Autumn 2020) taught by instructor Shivaram Kalyanakrishnan. It notes that students must show their work and calculations to receive credit for answers. Submissions must be handwritten, scanned, and uploaded to Moodle by the due date mentioned in each question. The document then provides 4 questions related to topics in Markov decision processes, Bellman equations for policy evaluation, multi-armed bandits, and bandit algorithms.

Uploaded by

rineeth m
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

CS 747 (Autumn 2020): Weekly Quizzes

Instructor: Shivaram Kalyanakrishnan

September 18, 2020

Note. Provide justifications/calculations/steps along with each answer to illustrate how you ar-
rived at the answer. You will not receive credit for giving an answer without sufficient explanation.

Submission. Write down your answer by hand, then scan and upload to Moodle. Write clearly
and legibly. Be sure to mention your roll number.

Week 5
Question. For an MDP (S, A, T, R, γ), let V0 : S → R be an initial guess of the optimal value
function V ? . Suppose that this guess is progressively updated using Value Iteration: that is, by
setting Vt+1 ← B ? (Vt ) for t = 0, 1, 2, . . . . Recall that B ? is the Bellman optimality operator.
In this question, we examine the design of a stopping condition for Value Iteration. As usual, let
k·k∞ denote the max norm. We would like that our computed solution, Vu for some u ∈ {1, 2, . . . },
be within  of V ? for some given tolerance  > 0. In other words, we would like to stop after u
applications of B ? , so long as we can guarantee kVu − V ? k∞ ≤ . Naturally, we cannot use V ? itself
in our stopping rule, since it is not known! Show that it suffices to stop when

(1 − γ)
kVu − Vu−1 k∞ ≤ .
γ
and thereafter return Vu as the answer.
You are likely to find two results handy: (1) that B ? is a contraction mapping with contraction
factor γ, and (2) the triangle inequality: for X : S → R, Y : S → R, kX + Y k∞ ≤ kXk∞ + kY k∞ .
Week 4
Question. In this week’s lecture, we derived Bellman equations for policy evaluation. If M =
(S, A, T, R, γ) is our input MDP, we showed for every policy π : S → A and state s ∈ S:
X
V π (s) = T (s, π(s), s0 ){R(s, π(s), s0 ) + γV π (s0 )}.
s0 ∈S

This question considers four variations in our definitions or assumptions regarding the input MDP
M and policy π. In each case write down Bellman equations after making appropriate modifications.
The set of equations for each case will suffice; no need for additional explanation.

a. The reward function R does not depend on the next state s0 ; it is given to you as R : S×A → R.

b. The reward function R depends only on the next state s0 ; it is given to you as R : S → R.

c. The policy π is stochastic: for s ∈ S, a ∈ A, π(s, a) denotes the probability with which the
policy takes action a from state s.

d. The underlying MDP M is deterministic. Hence, the transition function T is given as T :


S × A → S, with the semantics that T (s, a) is the next state s0 ∈ S for s ∈ S, a ∈ A.

Solution. Answers are given below for all policies π and states s ∈ S.

a. V π (s) = R(s, π(s)) + γ s0 ∈S T (s, π(s), s0 )V π (s0 ).


P

b. V π (s) = s0 ∈S T (s, π(s), s0 ){R(s0 ) + γV π (s0 )}.


P

c. V π (s) = a∈A π(s, a) s0 ∈S T (s, a, s0 ){R(s, a, s0 ) + γV π (s0 )}.


P P

d. V π (s) = R(s, π(s), T (s, π(s))) + γV π (T (s, π(s))).

2
Week 3
Question. A 2-armed bandit instance I has as the mean rewards of its arms p1 , p2 ∈ [0, 1], where
|p1 − p2 | = ∆ > 0. Both arms produce 0 and 1 rewards (that is, from Bernoulli distributions).
Suppose we are given ∆, but we do not know which arm has the higher mean reward. Our aim
is to determine the optimal arm with probability at least 1 − δ. In order to do so, we pull each arm
N times, and declare as our answer the arm which registers the higher empirical mean (breaking
ties uniformly at random).
Show that it suffices to set   
1 1
N =θ 2
log
∆ δ
in order to indeed give the correct answer with probability at least 1 − δ.

Solution. Without loss of generality, let arm 1, with mean p1 , be the optimal arm, and arm 2,
with mean p2 , be the sub-optimal arm. Intuition suggests that as N becomes larger, the probability
that arm 1 is returned increases. We will build a proof assuming N is sufficiently large—and take
it to the point that the proof itself suggests to us how N must be set.
After N pulls each, let the empirical means of the arms be p̂1 and p̂2 , respectively. Consider the
mid-point between these means, p1 +p 2
2 , as a “boundary”, in the sense that the answer is guaranteed
to be correct if neither empirical mean “crosses” the boundary. In other words, if each empirical
mean stays within ∆ 2 of the true mean on its corresponding side, then p̂1 must exceed p̂2 , thereby
yielding the right answer. We invoke Hoeffding’s Inequality to bound the deviation probability.

P{Wrong answer given} ≤ P{p̂1 ≤ p̂2 }


 
p1 + p2 p1 + p2
≤ P p̂1 ≤ or p̂2 ≥
2 2
   
p1 + p2 p1 + p2
≤ P p̂1 ≤ + P p̂2 ≥
2 2
   
∆ ∆
= P p̂1 ≤ p1 − + P p̂2 ≥ p2 +
2 2
2 2
≤ e−2N (∆/2) + e−2N (∆/2) .
2
Suppose we had set N such that 2e−2N (∆/2) ≤ δ, we would have an acceptable proof to go with
that choice! Observe that it suffices to take N = d ∆22 ln( 2δ )e.

3
Week 2
Question. In this question, we consider bandit instances in which the number of arms n = 10;
assume the set of arms is A = {0, 1, 2, . . . , 9}. Each arm yields rewards from a Bernoulli distribution
whose mean is strictly less than 1. Call this set of bandit instances Ī.
Now consider a family of algorithms L that operate on Ī, wherein each algorithm L ∈ L satisfies
the following properties.

• L is deterministic.

• In the first n pulls made by L (on steps 0 ≤ t ≤ n − 1), each arm is pulled exactly once.

• For t = n, n + 1, n + 2, . . . : if t is not a prime number, then the arm pulled by L on the t-th
step has the highest empirical mean among all the arms at that step.

In other words, each L ∈ L is a deterministic algorithm that begins with round-robin sampling for
n pulls, and thereafter exploits on every step t that is not a prime number. You can assume ties are
broken arbitrarily. The chief difference between the elements of L arises from the decisions they
make on steps t that are prime numbers—there is no restriction on the choice made on such steps.

a. Show that there exists Lgood ∈ L such that Lgood achieves sub-linear regret on all I ∈ Ī.

a. Show that there exists Lbad ∈ L such that Lbad does not achieve sub-linear regret on all
I ∈ Ī.

Your arguments can be informal: no need for the dense notation of Class Note 1. You can use the
N
fact that the number of prime numbers smaller than any natural number N is θ( log(N ) ).

Solution. For part (a), it suffices to show that there exists Lgood ∈ L that is GLIE. For every prime
number t, let m(t) denote the number of prime numbers smaller than t. Thus m(2) = 0, m(3) =
1, m(5) = 2, . . . . Take Lgood as an algorithm that on every step t that is a prime number, pulls arm
m(t) mod n. It is clear that Lgood will pull each arm infinitely often in the limit. Furthermore,
T
the number of “exploit” steps up to horizon T is at least T − θ( log(T ) ). For I ∈ Ī, we have

ELgood ,I [exploit(T )]
  
1
lim = lim 1 − θ = 1,
T →∞ T T →∞ log(T )

implying that Lgood is greedy in the limit.


For part (b), it suffices to show that there exists Lbad ∈ L that is not GLIE: in particular we
need only show that Lbad is not guaranteed to pull each arm infinitely often in the limit. Take
Lbad to be an algorithm that only pulls arm 0 on steps t that are prime numbers. On any instance
in which the means of arms are in increasing order of their index (hence arm 9 is the sole optimal
arm), there is a non-zero probability that arm 9 will initially give a 0-reward, some other arm a
1-reward, and thereafter arm 9 will never get pulled by Lbad . On such an instance, Lbad incurs
linear regret.
In summary: the prime number bound guarantees that each L ∈ L will be greedy in the limit,
and it also allows for infinite exploration. Whether L ∈ L actually performs infinite exploration of
each arm determines the sub-linearity of its regret.

4
Week 1
Question. Consider a 2-armed bandit instance B in which the rewards from the arms come from
uniform distributions (recall that the lectures assumed they came from Bernoulli distributions).
The rewards of arm 1 are drawn uniformly at random from [a, b], and the rewards of arm 2 are
drawn uniformly at random from [c, d], where 0 < a < c < b < d < 1. Observe that this means
there is an overlap: both arms produce some rewards from the interval [c, b].
An algorithm L proceeds as follows. First it pulls arm 1; then it pulls arm 2; whichever of these
arms produced a higher reward (or arm 1 in case of a tie) is then pulled a further 20 times. In
other words, the algorithm performs round-robin exploration for 2 steps and greedily picks an arm
for the subsequent exploitation phase, during which that arm is blindly pulled 20 times. What is
the expected cumulative regret of L on B after 22 pulls?
(If you have worked out an answer but are not sure about it, consider writing a small program
to simulate L and run it many times for fixed a, b, c, d. Is the average regret from these runs close
to your answer? The program is for your own sake; no need to submit or to explain to us.)

Solution. The mean reward of arm 1 is p1 = a+b c+d


2 and the mean reward of arm 2 is p2 = 2 . Since
a < c and b < d, it is clear that arm 2 is optimal.
The expected cumulative regret from the 22 pulls is the sum of those from the first 2 pulls
and from the subsequent exploitation phase (20 pulls). In the first two pulls, the expected cu-
mulative regret is exactly p2 − p1 , since arm 1 (the suboptimal arm) is pulled exactly once.
In the exploitation phase, the expected cumulative regret is 0 in case arm 2 is played, and
20(p2 − p1 ) if arm 1 is pulled. The expected cumulative regret from exploitation is therefore
P{arm 1 is selected after first 2 steps} · 20 · (p2 − p1 ).
What is the probability that arm 1 gets selected after the first two pulls? We know that each
1
reward x1 from arm 1 is drawn from [a, b] according to pdf b−a . Similarly, the reward x2 from arm
1
2 is drawn from [c, d] according to pdf d−c . The probability that x1 ≥ x2 is therefore

b x1
(c − b)2
Z Z
1
P{arm 1 is selected after first 2 steps} = dx2 dx1 = .
x1 =c x2 =c (b − a)(d − c) 2(b − a)(d − c)
c−b
An alternative argument to obtain this probability is that (1) x1 falls in [c, b] with probability b−a ,
c−b
(2) x2 falls in [c, b] with probability d−c , and (3) conditioned on x1 and x2 both falling in [c, b],
each has a uniform distribution in that range, and thus the probability that one exceeds the other
is 1/2.
The expected cumulative regret from the 22 pulls is thus

10(c − b)2
 
c+d−a−b
(p2 −p1 )+P{arm 1 is selected after first 2 steps}·20·(p2 −p1 ) = 1+ .
2 (b − a)(d − c)

You might also like