cs747 A2020 Quizzes PDF
cs747 A2020 Quizzes PDF
Note. Provide justifications/calculations/steps along with each answer to illustrate how you ar-
rived at the answer. You will not receive credit for giving an answer without sufficient explanation.
Submission. Write down your answer by hand, then scan and upload to Moodle. Write clearly
and legibly. Be sure to mention your roll number.
Week 5
Question. For an MDP (S, A, T, R, γ), let V0 : S → R be an initial guess of the optimal value
function V ? . Suppose that this guess is progressively updated using Value Iteration: that is, by
setting Vt+1 ← B ? (Vt ) for t = 0, 1, 2, . . . . Recall that B ? is the Bellman optimality operator.
In this question, we examine the design of a stopping condition for Value Iteration. As usual, let
k·k∞ denote the max norm. We would like that our computed solution, Vu for some u ∈ {1, 2, . . . },
be within of V ? for some given tolerance > 0. In other words, we would like to stop after u
applications of B ? , so long as we can guarantee kVu − V ? k∞ ≤ . Naturally, we cannot use V ? itself
in our stopping rule, since it is not known! Show that it suffices to stop when
(1 − γ)
kVu − Vu−1 k∞ ≤ .
γ
and thereafter return Vu as the answer.
You are likely to find two results handy: (1) that B ? is a contraction mapping with contraction
factor γ, and (2) the triangle inequality: for X : S → R, Y : S → R, kX + Y k∞ ≤ kXk∞ + kY k∞ .
Week 4
Question. In this week’s lecture, we derived Bellman equations for policy evaluation. If M =
(S, A, T, R, γ) is our input MDP, we showed for every policy π : S → A and state s ∈ S:
X
V π (s) = T (s, π(s), s0 ){R(s, π(s), s0 ) + γV π (s0 )}.
s0 ∈S
This question considers four variations in our definitions or assumptions regarding the input MDP
M and policy π. In each case write down Bellman equations after making appropriate modifications.
The set of equations for each case will suffice; no need for additional explanation.
a. The reward function R does not depend on the next state s0 ; it is given to you as R : S×A → R.
b. The reward function R depends only on the next state s0 ; it is given to you as R : S → R.
c. The policy π is stochastic: for s ∈ S, a ∈ A, π(s, a) denotes the probability with which the
policy takes action a from state s.
Solution. Answers are given below for all policies π and states s ∈ S.
2
Week 3
Question. A 2-armed bandit instance I has as the mean rewards of its arms p1 , p2 ∈ [0, 1], where
|p1 − p2 | = ∆ > 0. Both arms produce 0 and 1 rewards (that is, from Bernoulli distributions).
Suppose we are given ∆, but we do not know which arm has the higher mean reward. Our aim
is to determine the optimal arm with probability at least 1 − δ. In order to do so, we pull each arm
N times, and declare as our answer the arm which registers the higher empirical mean (breaking
ties uniformly at random).
Show that it suffices to set
1 1
N =θ 2
log
∆ δ
in order to indeed give the correct answer with probability at least 1 − δ.
Solution. Without loss of generality, let arm 1, with mean p1 , be the optimal arm, and arm 2,
with mean p2 , be the sub-optimal arm. Intuition suggests that as N becomes larger, the probability
that arm 1 is returned increases. We will build a proof assuming N is sufficiently large—and take
it to the point that the proof itself suggests to us how N must be set.
After N pulls each, let the empirical means of the arms be p̂1 and p̂2 , respectively. Consider the
mid-point between these means, p1 +p 2
2 , as a “boundary”, in the sense that the answer is guaranteed
to be correct if neither empirical mean “crosses” the boundary. In other words, if each empirical
mean stays within ∆ 2 of the true mean on its corresponding side, then p̂1 must exceed p̂2 , thereby
yielding the right answer. We invoke Hoeffding’s Inequality to bound the deviation probability.
3
Week 2
Question. In this question, we consider bandit instances in which the number of arms n = 10;
assume the set of arms is A = {0, 1, 2, . . . , 9}. Each arm yields rewards from a Bernoulli distribution
whose mean is strictly less than 1. Call this set of bandit instances Ī.
Now consider a family of algorithms L that operate on Ī, wherein each algorithm L ∈ L satisfies
the following properties.
• L is deterministic.
• In the first n pulls made by L (on steps 0 ≤ t ≤ n − 1), each arm is pulled exactly once.
• For t = n, n + 1, n + 2, . . . : if t is not a prime number, then the arm pulled by L on the t-th
step has the highest empirical mean among all the arms at that step.
In other words, each L ∈ L is a deterministic algorithm that begins with round-robin sampling for
n pulls, and thereafter exploits on every step t that is not a prime number. You can assume ties are
broken arbitrarily. The chief difference between the elements of L arises from the decisions they
make on steps t that are prime numbers—there is no restriction on the choice made on such steps.
a. Show that there exists Lgood ∈ L such that Lgood achieves sub-linear regret on all I ∈ Ī.
a. Show that there exists Lbad ∈ L such that Lbad does not achieve sub-linear regret on all
I ∈ Ī.
Your arguments can be informal: no need for the dense notation of Class Note 1. You can use the
N
fact that the number of prime numbers smaller than any natural number N is θ( log(N ) ).
Solution. For part (a), it suffices to show that there exists Lgood ∈ L that is GLIE. For every prime
number t, let m(t) denote the number of prime numbers smaller than t. Thus m(2) = 0, m(3) =
1, m(5) = 2, . . . . Take Lgood as an algorithm that on every step t that is a prime number, pulls arm
m(t) mod n. It is clear that Lgood will pull each arm infinitely often in the limit. Furthermore,
T
the number of “exploit” steps up to horizon T is at least T − θ( log(T ) ). For I ∈ Ī, we have
ELgood ,I [exploit(T )]
1
lim = lim 1 − θ = 1,
T →∞ T T →∞ log(T )
4
Week 1
Question. Consider a 2-armed bandit instance B in which the rewards from the arms come from
uniform distributions (recall that the lectures assumed they came from Bernoulli distributions).
The rewards of arm 1 are drawn uniformly at random from [a, b], and the rewards of arm 2 are
drawn uniformly at random from [c, d], where 0 < a < c < b < d < 1. Observe that this means
there is an overlap: both arms produce some rewards from the interval [c, b].
An algorithm L proceeds as follows. First it pulls arm 1; then it pulls arm 2; whichever of these
arms produced a higher reward (or arm 1 in case of a tie) is then pulled a further 20 times. In
other words, the algorithm performs round-robin exploration for 2 steps and greedily picks an arm
for the subsequent exploitation phase, during which that arm is blindly pulled 20 times. What is
the expected cumulative regret of L on B after 22 pulls?
(If you have worked out an answer but are not sure about it, consider writing a small program
to simulate L and run it many times for fixed a, b, c, d. Is the average regret from these runs close
to your answer? The program is for your own sake; no need to submit or to explain to us.)
b x1
(c − b)2
Z Z
1
P{arm 1 is selected after first 2 steps} = dx2 dx1 = .
x1 =c x2 =c (b − a)(d − c) 2(b − a)(d − c)
c−b
An alternative argument to obtain this probability is that (1) x1 falls in [c, b] with probability b−a ,
c−b
(2) x2 falls in [c, b] with probability d−c , and (3) conditioned on x1 and x2 both falling in [c, b],
each has a uniform distribution in that range, and thus the probability that one exceeds the other
is 1/2.
The expected cumulative regret from the 22 pulls is thus
10(c − b)2
c+d−a−b
(p2 −p1 )+P{arm 1 is selected after first 2 steps}·20·(p2 −p1 ) = 1+ .
2 (b − a)(d − c)