Sol Tor Csaba
Sol Tor Csaba
Contents
2 Foundations of Probability 5
2.1, 2.3, 2.5, 2.6, 2.7, 2.8, 2.9, 2.10, 2.11, 2.12, 2.14, 2.15, 2.16, 2.18, 2.19
3 Stochastic Processes and Markov Chains 11
3.1, 3.5, 3.8, 3.9
4 Stochastic Bandits 13
4.9, 4.10
5 Concentration of Measure 14
5.10, 5.12, 5.13, 5.14, 5.15, 5.16, 5.17, 5.18, 5.19
6 The Explore-then-Commit Algorithm 22
6.2, 6.3, 6.5, 6.6, 6.8
7 The Upper Confidence Bound Algorithm 26
7.1
8 The Upper Confidence Bound Algorithm: Asymptotic Optimality 27
8.1
9 The Upper Confidence Bound Algorithm: Minimax Optimality 28
9.1, 9.4
10 The Upper Confidence Bound Algorithm: Bernoulli Noise 29
10.1, 10.3, 10.4, 10.5
11 The Exp3 Algorithm 35
11.2, 11.5, 11.6, 11.7
12 The Exp3-IX Algorithm 37
12.1, 12.4
13 Lower Bounds: Basic Ideas 39
13.2
14 Foundations of Information Theory 39
14.4, 14.10, 14.11
15 Minimax Lower Bounds 41
15.1
16 Instance-Dependent Lower Bounds 42
16.2, 16.7
17 High-Probability Lower Bounds 43
17.1
1
18 Contextual Bandits 44
18.1, 18.6, 18.7, 18.8, 18.9
19 Stochastic Linear Bandits 47
19.3, 19.4, 19.5, 19.6, 19.7, 19.8
20 Confidence Bounds for Least Squares Estimators 54
20.1, 20.2, 20.3, 20.4, 20.5, 20.8, 20.9, 20.10, 20.11
21 Optimal Design for Least Squares Estimators 58
21.1, 21.2, 21.3, 21.5
22 Stochastic Linear Bandits for Finitely Many Arms 60
23 Stochastic Linear Bandits with Sparsity 60
23.2
24 Minimax Lower Bounds for Stochastic Linear Bandits 60
24.1
25 Asymptotic Lower Bounds for Stochastic Linear Bandits 61
25.3
26 Foundations of Convex Analysis 61
26.2, 26.3, 26.9, 26.13, 26.14, 26.15
27 Exp3 for Adversarial Linear Bandits 64
27.1, 27.4, 27.6, 27.8, 27.9, 27.11
28 Follow-the-Regularised-Leader and Mirror Descent 69
28.1, 28.5, 28.10, 28.11, 28.12, 28.13, 28.14, 28.15, 28.16, 28.17
29 The Relation between Adversarial and Stochastic Linear Bandits 77
29.2, 29.4
30 Combinatorial Bandits 78
30.4, 30.5, 30.6, 30.8
31 Non-stationary Bandits 81
31.1, 31.3
32 Ranking 82
32.2, 32.6
33 Pure Exploration 84
33.3, 33.4, 33.5, 33.6, 33.7, 33.9
34 Foundations of Bayesian Learning 90
34.4, 34.5, 34.13, 34.14, 34.15, 34.16
35 Bayesian Bandits 95
35.1, 35.2, 35.3, 35.6, 35.7
36 Thompson Sampling 100
36.3, 36.5, 36.6, 36.13
37 Partial Monitoring 103
37.3, 37.10, 37.12, 37.13, 37.14
38 Markov Decision Processes 106
38.2, 38.4, 38.5, 38.7, 38.8, 38.9, 38.10, 38.11, 38.12, 38.13, 38.14, 38.15, 38.16, 38.17,
38.19, 38.21, 38.22, 38.23, 38.24
2
Bibliography 118
3
4
Chapter 2 Foundations of Probability
2.1 Let h = g ◦ f . Let A ∈ H. We need to show that h−1 (A) ∈ F. We claim that
h−1 (A) = f −1 (g −1 (A)). Because g is G/H-measurable, g −1 (A) ∈ G and thus because f is F/G-
measurable, f −1 (g −1 (A)) is F-measurable, thus completing the proof, once we show that the claim
holds. To show the claim, we show two-sided containment. For showing h−1 (A) ⊂ f −1 (g −1 (A)) let
x ∈ h−1 (A). Thus, h(x) ∈ A. By definition, h(x) = g(f (x)) ∈ A. Hence, f (x) ∈ g −1 (A) and thus
x ∈ f −1 (g −1 (A)). For the other direction let x ∈ f −1 (g −1 (A)). This implies that f (x) ∈ g −1 (A),
which implies that h(x) = g(f (x)) ∈ A.
2.3 Since X(u) ∈ V for all u ∈ U we have X −1 (V) = U. Therefore U ∈ ΣX . Suppose that
U ∈ ΣX , then by definition there exists a V ∈ Σ such that X −1 (V ) = U . Because ΣX is a
σ-algebra we have V c ∈ Σ and by definition of ΣX we have U c = X −1 (V c ) ∈ ΣX . Therefore
ΣX is closed under complements. Finally let (Ui )i be a countable sequence with Ui ∈ ΣX . Then
S
i Ui = X
−1 (∪ X(U )) ∈ Σ , which means that Σ is closed under countable unions and the proof
i i X X
is completed.
2.5
(a) Let A be the set of all σ-algebras that contain G and define
\
F∗ = F.
F ∈A
We claim that F ∗ is the smallest σ-algebra containing G. Clearly F ∗ contains G and is contained
in all σ-algebras containing G. Furthermore, by definition it contains exactly those A that are
in every σ-algebra that contains G. It remains to show that F ∗ is a σ-algebra. Since Ω ∈ F
for all F ∈ A it follows that Ω ∈ F ∗ . Now suppose that A ∈ F ∗ . Then A ∈ F for all F ∈ A
and Ac ∈ F for all F ∈ A. Therefore Ac ∈ F ∗ . Therefore F ∗ is closed under complements.
Finally, suppose that (Ai )i is a family in F ∗ . Then (Ai )i are families in F for all F ∈ A and so
S S
i Ai ∈ F for all F ∈ A and again we have i Ai ∈ F . Therefore F is a σ-algebra.
∗ ∗
(b) Define H = A : X −1 (A) ∈ F . Then Ω ∈ H and for A ∈ H we have X −1 (Ac ) = X −1 (A)c so
Ac in H. Furthermore, for (Ai )i with Ai ∈ H we have
!
[ [
X −1 Ai = X −1 (Ai ) .
i i
(c) We need to show that I {A}−1 (B) ∈ F for all B ∈ B(R). There are four cases. If {0, 1} ∈ B,
then I {A}−1 (B) = Ω ∈ F. If {1} ∈ B, then I {A}−1 (B) = A ∈ F. If {0} ∈ B, then
5
I {A}−1 (B) = Ac ∈ F. Finally, if {0, 1} ∩ B = ∅, then I {A}−1 (B) = ∅ ∈ F. Therefore I {A} is
F-measurable.
2.6 Trivially, σ(X) = {∅, R}. Hence Y is not σ(X)/B(R)-measurable because Y −1 ([0, 1]) = [0, 1] 6∈
σ(X).
2.7 First P (∅ | B) = P (∅ ∩ B) /P (B) = 0 and P (Ω | B) = P (Ω ∩ B) /P (B) = 1. Let (Ei )i be a
countable collection of disjoint sets with Ei ∈ F. Then
! S S
[ P (B ∩ i Ei ) P ( i (B ∩ Ei ))
P Ei B = =
i
P (B) P (B)
X P (B ∩ Ei ) X
= = P (Ei | B) .
i
P (B) i
Therefore P( · | B) satisfies the countable additivity property and the proof is complete.
2.8 Using the definition of conditional probability and the assumption that P (A) > 0 and P (B) > 0
we have:
P (A ∩ B) P (B | A) P (A)
P (A | B) = = .
P (B) P (B)
Therefore {X1 < 2} is independent from {X2 is even}. For part (b) note that σ(X1 ) = {C × [6] :
C ∈ 2[6] } and σ(X2 ) = {[6] × C : C ∈ 2[6] }. It follows that for |A ∩ B| = |A||B|/62 and so
P (A ∩ B) |A ∩ B|/62 |A|
P (A | B) = = = 2 = P (A) .
P (B) |B|/6 2 6
(c) If A and Ac are independent, then 0 = P (∅) = P (A ∩ Ac ) = P (A) P (Ac ) = P (A) (1 − P (A)).
Therefore P (A) ∈ {0, 1}. This makes sense because the knowledge of A provides the knowledge
of Ac , so the two events can only be independent if one occurs with probability zero.
6
(d) If A is independent of itself, then P (A ∩ A) = P (A)2 . Therefore P (A) ∈ {0, 1} as before. The
intuition is the same as the previous part.
(e) Ω = {(1, 1), (1, 2), (2, 1), (2, 2)} and F = 2Ω .
(h) Assume that n is prime. By the previous part, n|A ∩ B| = |A||B| must hold if A and B are
independent of each other. If |A ∩ B| = 0, the events will be trivial. Hence, assume |A ∩ B| > 0.
Since n is prime, it follows then that n must be either a factor of |A| or a factor of |B|. Without
loss of generality, assume that it is a factor of |A|. This implies n ≤ |A|. But |A| ≤ n also holds,
hence |A| = n, i.e., A is a trivial event.
(i) Let X1 and X2 be independent Rademacher random variables and X3 = X1 X2 . Clearly these
random variables are not mutually independent since X3 takes multiple values with nonzero
probability and is fully determined by X1 and X2 . And yet X3 and Xi are independent for
i ∈ {1, 2}, which ensures that pairwise independence holds.
(j) No. Let Ω = [6] and F = 2Ω and P be the uniform measure. Define events A = {1, 3, 4} and
B = {1, 3, 5} and C = {3, 4, 5, 6}. Then A and B are clearly dependent and yet
2.11
(a) σ(X) = (Ω, ∅) is trivial. Let Y be another random variable, then X and Y are independent
if and only if for all A ∈ σ(X) and B ∈ σ(Y ) it holds that P (A ∩ B) = P (B), which is trivial
when A ∈ {Ω, ∅}.
7
(c) Suppose that A and B are independent. Then P (Ac | B) = 1 − P (A | B) = 1 − P (A) = P (Ac ).
Therefore Ac and B are independent and by the same argument so are Ac and B c as well as A
and B c . The ‘if’ direction follows by noting that σ(X) = {Ω, A, Ac , ∅} and σ(Y ) = {Ω, B, B c , ∅}
and recalling that every event is independent of Ω or the empty set. For the ‘only if’ note
that independence of X and Y means that any pair of events taken from σ(X) × σ(Y ) are
independent, which by the above includes the pair A, B.
(d) Let (Ai )i be a countable family of events and Xi (ω) = I {ω ∈ Ai } be the indicator of the ith
event. When the random variables/events are pairwise independent, then the above argument
goes through unchanged for each pair. In the case of mutual independence the ‘only if’ is again
the same. For the ‘if’, suppose that (Ai ) are mutually independent. Therefore for any finite
subset K ⊂ N we have
!
\ Y
P Ai = P (Ai )
i∈K i∈K
The same argument as the previous part shows that for any disjoint finite sets K, J ⊂ N we
have
!
[ [ Y Y
P Ai ∪ Aci = P (Ai ) P (Aci ) .
i∈K i∈J i∈K i∈J
Therefore for any finite set K ⊂ N and (Vi )i∈K with Vi ∈ σ(Xi ) = {Ω, ∅, Ai , Aci } it holds that
!
\ Y
P Vi = P (Vi ) ,
i∈K i∈K
2.12
(a) Let A ⊂ R be an open set. By definition, since f is continuous it holds that f −1 (A) is open.
But the Borel σ-algebra is generated by all open sets and so f −1 (A) ∈ B(R) as required.
(c) Recall that (X)+ = max{0, X} and (X)− = − min{0, X}. Therefore (|X|)+ = |X| =
(X)+ + (X)− and (|X|)− = 0. Recall that E[X] = E[(X)+ ] − E[(X)− ] exists if an only if
both expectations are defined. Therefore if X is integrable, then |X| is integrable. Now suppose
that |X| is integrable, then X is integrable by the dominated convergence theorem.
2.14 Assume without (much) loss of generality that Xi ≥ 0 for all i. The general case follows by
8
considering positive and negative parts, as usual. First we claim that for any n it holds that
" n # n
X X
E Xi = E[Xi ] .
i=1 i=1
P
Next let Sn = ni=1 Xi and note that by the monotone convergence theorem we have limn→∞ E[Sn ] =
E[X], which means that
"∞ # n
X X
E Xi = lim E[Sn ] = lim E[Xi ] .
n→∞ n→∞
i=1 i=1
Pn
2.15 Suppose that X(ω) = i=1 αi I {ω ∈ Ai } is simple and c > 0. Then cX is also simple and
n
X n
X
E[cX] = cαi I {ω ∈ Ai } = c αi I {ω ∈ Ai } = cE[X] .
i=1 i=1
Now suppose that X is positive (but maybe not simple) and c > 0, then cX is also positive and
For negative c simply note that (cX)+ = −c(X)− and (cX)− = −c(X)+ and repeat the above
argument.
9
PN PN
2.16 Suppose X = i=1 αi I {Ai } and Y = i=1 βi I {Bi } are simple functions. Then
"N N
#
X X
E[XY ] = E αi I {Ai } βi I {Bi }
i=1 i=1
N
XX N
= αi βj P (Ai ∩ Bj )
i=1 j=1
N X
X N
= αi βj P (Ai ) P (Aj )
i=1 j=1
= E[X]E[Y ] .
Now suppose that X and Y are arbitrary non-negative independent random variables. Then
Finally, for arbitrary random variables we have via the previous display and the linearity of
expectation that
2.18 Let X be a standard Rademacher random variable and Y = X. Then E[X]E[Y ] = 0 and
E[XY ] = 1.
Ra
2.19 Using the fact that 0 1dx = a for a ≥ 0 and the non-negativity of X we have
Z ∞
X(ω) = I {[0, X(ω)]} (x)dx .
0
10
Chapter 3 Stochastic Processes and Markov Chains
3.1
(a) We have F1 (x) = I {x ∈ [1/2, 1]} and F2 (x) = I {x ∈ [1/4, 2/4) ∪ [3/4, 4/4]}. More generally,
Ft (x) = IUt (x) where
[
Ut = {1} ∪ [(2s − 1)/2t , 2s/2t ) .
1≤s≤2t−1
U1
U2
U3
U4
P2t−1
(b) We have P(Ut ) = λ(Ut ) = s=1 (1/2t ) = 1/2.
(c) Given an index set K ⊂ N+ we need to show that {Fk : k ∈ K} are independent. Or equivalently,
that
\ Y
P Uk = P (Uk ) = 2−|K| . (3.1)
k∈K k∈K
(d) It follows directly from the definition of independence that any subsequence of an independent
sequence is also an independent sequence. That P (Xm,t = 0) = P (Xm,t = 1) = 1/2 follows from
Part (b).
11
P
(e) By the previous parts Xt = ∞ t=1 Xm,t 2
−t is a weighted sum of an independent sequence of
P
uniform Bernoully random variables. Therefore Xt has the same law as Y = ∞ t=1 Ft 2 . But
−t
(f) This follows from the definition of (Xm,t )∞ t=1 as disjoint subsets of independent random variables
(Ft )t=1 and the ‘grouping’ result that whenever (Ft )t∈T is a collection of independent σ-algebras
∞
and T1 , T2 are disjoint subsets of T , then σ(∪t∈T1 Ft ) and σ(∪t∈T2 Ft ) are independent [Kallenberg,
2002, Corollary 3.7]. This latter result is a good exercise. Use a monotone class argument.
R
3.5 Let A ∈ G and suppose that X(ω) = IA (ω). Then X X(x)K(ω, dx) = K(ω, A), which is
F-measurable by the definition of a probability kernel. The result extends to simple functions by
linearity. For nonnegative X let Xn ↑ X be a monotone increasing sequence of simple functions
R
converging point-wise to X [Kallenberg, 2002, Lemma 1.11]. Then Un (ω) = X Xn (x)K(ω, dx) is
R
F-measurable. Monotone convergence ensures that limn→∞ Un (ω) = X limn→∞ Xn (x)K(ω, dx) =
R
X X(x)K(ω, dx) = U (ω). Hence limn→∞ Un (ω) = U (ω) and a point-wise convergent sequence of
measurable functions is measurable, it follows that U is F-measurable. The result for arbitrary X
follows by decomposing into positive and negative parts.
3.8 Let (Xt )nt=0 be F = (Ft )nt=1 -adapted and τ = min{t : Xt ≥ ε}. By the submartingale property,
E[Xn | Ft ] ≥ Xt . Therefore
I {τ = t} Xt ≤ I {τ = t} E[Xn | Ft ] = E[I {τ = t} Xn | Ft ] .
3.9 Let ΣX ( ΣY ) be the σ-algebra underlying X (respectively, Y). It suffices to verify that for
12
A ∈ ΣX , B ∈ ΣY , P(X,Y ) (A × B) = (PY ⊗ PX|Y )(A × B). We have
P(X,Y ) (A × B) = P (X ∈ A, Y ∈ B)
= E [E [I {X ∈ A} I {Y ∈ B} | Y ]] (tower rule)
= E [I {Y ∈ B} E [I {X ∈ A} | Y ]] (I {Y ∈ B} is σ(Y )-measurable)
= E [I {Y ∈ B} P (X ∈ A | Y )] (relation of expectation and probability)
h i
= E I {Y ∈ B} PX|Y (X ∈ A | Y ) (definition of PX|Y )
Z
= PY (dy)PX|Y (X ∈ A | y) (pushforward property)
B
= (PY ⊗ PX|Y )(B × A) . (definition of ⊗)
4.9
(a) The statement is true. Let i be a suboptimal arm. By Lemma 4.5 we have
Rn (π, ν) Xk
E[Ti (n)]∆i E[Ti (n)]
0 = lim = lim sup ≥ lim sup ∆i .
n→∞ n n→∞
i=1
n n→∞ n
Hence lim supn→∞ E[Ti (n)]/n ≤ 0 ≤ lim inf n→∞ E[Ti (n)]/n and so limn→∞ E[Ti (n)]/n = 0 for
P P
suboptimal arms i. Since ki=1 E[Ti (n)]/n = 1 it follows that limn→∞ i:∆i =0 E[Ti ]/n = 1.
(b) The statement is false. Consider a two-armed bandit for which the second arm is suboptimal
and an algorithm that chooses the second arm in rounds t ∈ {1, 2, 4, 8, 16, . . .}.
4.10
(a) (Sketch) Fix policy π and n and assume without loss of generality the reward-stack model. We
turn policy π into a retirement policy π 0 by reshuffling the order in which π uses the two arms
during the n rounds so that if π uses action 1, say m times out of the n rounds, π 0 will use
action 1 in the first m rounds and then switches to arm 2. By using the regret decomposition,
we see that this suffices to show that π 0 achieves no more regret than π (actually, achieves the
same regret).
So what is policy π 0 ? Policy π 0 will keep querying policy π for at most n times. If π returns by
proposing to use action 1, π 0 will play this action, get the reward from the environment and
feeds the obtained reward to policy π. If π returns by proposing to play action 2, π 0 does not
play this action for now, just feeds π with zero. After π was queried n times, action 2 is played
up in the remaining rounds out of the total n rounds.
(b) Assume that arm 1 has a Bernoulli payoff with parameter p ∈ [0, 1] and arm 2 has a fixed
payoff of 0.5 (so µ1 = p and µ2 = 0.5). Note that whether π ever retires on these Bernoulli
13
environments depends on whether there exists some t > 0 and x1 , . . . , xt−1 ∈ {0, 1} such that
πt (2|1, x1 , . . . , 1, xt−1 ) > 0, or
We have the two cases. When (4.1) does not hold then π will have linear regret when
p < 0.5. When (4.1) does hold then take the t > 0 and x1 , . . . , xt−1 ∈ {0, 1} such that
ρ = πt (2|1, x1 , . . . , 1, xt−1 ) > 0 (these must exist). Assume that t > 0 is smallest possible:
Hence, πs (1|1, x01 , . . . , 1, x0s−1 ) = 1 for any s < t and x01 , . . . , x0s−1 ∈ {0, 1}. Now, take an
environment when p > 0.5 (so arm 1 is the optimal arm) and let Rn denote the regret of π in
this environment. Then letting ∆ = p − 0.5 > 0, we have
Rn = ∆E [T2 (n)]
≥ ∆E [I {A1 = 1, X1 = x1 , . . . , At−1 = 1, Xt−1 = xt−1 , At = 2} T2 (n)]
= ∆E [I {A1 = 1, X1 = x1 , . . . , At−1 = 1, Xt−1 = xt−1 , At = 2} (n − t + 1)]
= ∆P (A1 = 1, X1 = x1 , . . . , At−1 = 1, Xt−1 = xt−1 , At = 2) (n − t + 1)
t
Y
= ∆(n − t + 1)ρ pxs (1 − p)1−xs
s=1
≥ c(n − t + 1) ,
Qt−1 xs
where c = ∆ρ s=1 p (1 − p)
1−xs > 0. It follows that lim inf
n→∞ Rn /n ≥ c > 0.
5.10
1
log P (µ̂n ≥ ε) ≤ −(λε − MX (λ)) .
n
Since this holds for any λ ≥ 0, taking the supremum over λ ∈ R gives the desired inequality
(allowing λ < 0 potentially makes the resulting inequality loser).
14
λε − log cosh(λ). We have dλ d
log cosh(λ) = (eλ − e−λ )/(eλ + e−λ ) = tanh(λ) ∈ [−1, 1].
Hence, supλ f (λ) = +∞ when |ε| > 1. In the other case, we get that the maximum is
∗ (ε) = f (tanh−1 (ε)) = tanh−1 (ε)ε − log cosh(tanh−1 (ε)). Using tanh−1 (ε) = 1 log( 1+ε )
ψX 2 1−ε
we find that etanh = ( 1+ε
1−ε )
1/2 and e− tanh = ( 1−ε
1+ε )
1/2 , hence cosh(tanh−1 (ε)) =
−1 −1
(ε) (ε)
2 (( 1−ε ) +( 1−ε
1+ε )
1/2 ) = 1 ( (1+ε)+(1−ε) ) = . Therefore, ψX
∗ (ε) = log( 1+ε
1−ε )+ 2 log(1−ε ) =
1 1+ε 1/2 √ 1 ε 1 2
2 (1−ε2 )1/2 1−ε2 2
1+ε
2 log(1 + ε) + 1−ε
2 log(1 − ε).
(c) We have ψX (λ) = λ(p + ε) − log(1 − p + peλ ). The maximiser of this is λ∗ = log( (1−p)(p+ε)
p(1−(p+ε)) )
provided that p + ε < 1. Plugging in this value, after some algebra, gives the desired result. The
result also extends to p + ε = 1: In this case ψX is increasing and limλ→∞ λ − log(1 − p + peλ ) =
limλ→∞ λ − log(peλ ) = log(1/p) = d(1, p). For ε > 0 so that p + ε > 1, ψX ∗ (ε) = +∞ because
R R
(d) Set σ = 1 for simplicity. We have MX (λ) = √12π exp(−(x2 − 2λx)/2) dx = √12π exp(−(x −
λ)2 /2) exp(λ2 /2) dx = exp(λ2 /2). Hence, f (λ) = λε − log MX (λ) = λε − λ2 /2 and
supλ f (λ) = f (2ε) = ε2 /2.
p
(e) We need to calculate limn→∞ n1 log(1 − Φ(ε n/σ 2 )). By Eq. (5.3) we have 1 − Φ(x) ≤
p √ √
1/(2πx2 ) exp(−x2 /2). Further, by Eq. (13.4), 1 − Φ(x) ≥ exp(−x2 /2)/( π(x/ 2 +
p p
x2 /2 + 2)). Taking logarithm, plugging in x = ε n/σ 2 , dividing by n and taking n → ∞
gives
1 q
lim log(1 − Φ(ε n/σ 2 )) = ε2 /(2σ 2 ) .
n→∞ n
When X is a Rademacher random variable, ψX ∗ (ε) = 1+ε log(1 + ε) + 1−ε log(1 − ε) ≥ ε2 /2 for
2 2
any ε ∈ R, with equality holding only at ε = 0. Hence, the question-marked equality cannot
hold. (In fact, this is very easy to see also by noting that if X is supported on [−1, 1] then
µ̂n ∈ [−1, 1] almost surely and thus P (µ̂n > ε) = 0 for any ε ≥ 1, while the approximation from
the CLT gives ε2 /(2σ 2 ), a strictly larger value: The CLT can significantly overestimate tail
probabilities. What goes wrong with the careless application of the (strong form) of the CLT
is that limn→∞ supx |fn (x) − f (x)| = 0 does not imply | log fn (xn ) − log f (xn )| = o(n) for all
choices of {xn }. For example, one can take f (x) = exp(−x), fn (x) = exp(−x(1 + 1/n)) so that
log fn (x) − log f (x) = −x/n. Then, | log fn (n2 ) − log f (n2 )| = n 6= o(n). The same problem
happens in the specific case that was investigated.
p
5.12 Part (d) The plots of p 7→ Q(p) and p 7→ p(1 − p) are shown below:
15
0.5
Q(p)
p
p 7→ p(1 − p)
0.25
0
0 0.25 0.5 0.75 1
p
p
As can be seen, p(1 − p) ≤ Q(p) for p ∈ [0, 1].
Part (e): Consider 0 ≤ λ < 4 and λ ≥ 4 separately. In the latter case use λ2 ≥ 4λ. For the
former case consider the extremes p = 1 and p = 1/2 and then use convexity. The general conclusion
is that the subgaussianity constant may be misleadingly large when it comes to studying tails of
distributions: Tail bounds (for the upper tail) only need bounds on the MGF for nonnegative values
of λ!
5.13
Pn Pn
(a) Using linearity of expectation E[p̂n ] = E[ t=1 Xt /n] = t=1 E[Xt ]/n = p. Similarly,
V[p̂n ] = p(1 − p)/n.
(c) This is an empirical question, the solution to which we omit. You should do this calculation
directly using the binomial distribution.
(d.i) Let d(p, q) = p log(p/q) + (1 − p) log((1 − p)/(1 − q)) be the relative entropy between Bernoulli
distributions with means p and q. By large deviation theory (see Exercise 5.10),
where dBer = d(p + ∆, p) and dGauss = ∆2 /(2p(1 − p)) and (εn )∞ n=1 and (ξn )n=1 satisfy
∞
16
where the o(1) term vanishes as δ tends to zero (see below for the precise argument). Therefore
when ∆ = p = 1/10,
It remains to see the validity of Eq. (5.2). This follows from an elementary but somewhat tedious
argument. The precise claim is as follows: Let (pn ) be a sequence taking values in [0, 1], n(δ) =
min{n ≥ 1 : pn ≤ δ} such that n(δ) → ∞ as δ → 0 and log(1/pn ) = n(d + o(1)). We claim that
from these it follows that n(δ) = (1/d + o(1)) log(1/δ) as δ → 0. To show this it suffices to prove
that for any ε > 0, for any δ > 0 small enough, n(δ) ∈ [1/(d + ε) log(1/δ), 1/(d − ε) log(1/δ) + 1].
Fix ε > 0. Then, by our assumption on log(1/pn ) there exist some n0 > 0 such that for any
n ≥ n0 , log(1/pn ) ∈ [n(d − ε), n(d + ε)]. Further, by our assumption on n(δ), there exists δ0 > 0
such that for any δ < δ0 , n(δ) − 1 ≥ n0 . Take some δ < δ0 and let n0 = n(δ). By definition,
pn0 ≤ δ < pn0 −1 and hence log(1/pn0 ) ≥ log(1/δ) > log(1/pn0 −1 ). Since n0 ≥ n0 − 1 ≥ n0 , we
also have that (n0 − 1)(d − ε) ≤ log(1/pn0 −1 ) < log(1/δ) ≤ log(1/pn0 ) ≤ n0 (d + ε), from which it
follows that log(1/δ)
d+ε ≤ n < d−ε + 1, finishing the proof.
0 log(1/δ)
(d.ii) The central limit theorem only shows Eq. (5.1). In particular, you cannot choose x to depend
on n. A second try is to use Berry-Esseen (Exercise 5.5) which warrants that
√
|P (p̂n − p ≥ ∆) − P (Zn ≥ ∆) | = O(1/ n) .
The problem is that this provides very little information in the regime where ∆ is fixed and
n tends to infinity where both probabilities tend to zero exponentially fast and the error term
washes away the comparison. In particular, for the inversion process to work, one needs nontrivial
lower and upper bounds on P (p̂n − p ≥ ∆) and the central limit theorem only asserts that this
√
probability is in the range of [0, O(1/ n)] (irrespective of the value of p and ∆), which does not
lead to nontrivial bounds on nBer (δ; p, ∆).
To summarise, the study of ni (δ, p, ∆) as δ tends to zero is a question about the large deviation
regime, where the central limit theorem and Berry-Esseen do not provide meaningful information.
To make use of the central limit theorem and Berry-Esseen, one needs to choose the deviation
√ √
level x so that the probability P ( n(p̂n − p) ≥ x) is of a larger magnitude than O(1/ n), which
is the range of ‘small deviations’.
As an aside, comparisons between normal and binomial distributions have been studied extensively.
If you are interested, the most relevant lower bound for this discussion is Slud’s inequality [Slud,
1977].
5.14
x4
have h0 (x) = xex − ex + 1 and h00 (x) = xex . Hence, h0 is increasing on (0, ∞) and decreasing on
(−∞, 0). Since h(0) = 0, so sign(h(x)) = sign(x) and thus g 0 (x) > 0 for x 6= 0.
17
(b) We have exp(x) = 1 + x + g(x)x2 . Therefore, E [exp(X)] = 1 + E g(X)X 2 ≤ 1 + E g(b)X 2 =
1 + g(b)V[X], where the last inequality used that g is increasing.
Differentiation shows that the exponent is minimised by λ = 1/b log(1 + α) where recall that
α = bε/v. Plugging in this value we get (5.10) and then using the bound in Part ((c)) we get
(5.11).
(e) We need to solve δ = exp − 2v ε2
for ε ≥ 0. Algebra gives that this is quadratic equation in
(1+ 3v
bε
)
ε: Using the abbreviation L = log(1/δ), this quadratic equation is ε2 − 23 bLε − 2vL. The positive
q
root is ε = 1
2
2
3 bL +
( 23 bL)2 + 8vL . Hence, with probability 1 − δ, S ≤ ε. Further upper
p p p √
bounding ε using |a| + |b| ≤ |a| + |b| gives that with probability 1 − δ, Sn ≤ 32 bL + 2vL,
which is the desired inequality.
(f) We start by modifying the Cramér-Chernoff method. In particular, consider the problem
of bounding the probability of event A where for a random vector X ∈ Rd and a fixed
vector x ∈h R , A takes
d
i the
h form i A = {X ≥ x}. Notice that for f : R → [0, ∞),
d
P (A) ≤ E I {A} ef (X) ≤ E ef (X) . We use this with X = (S, −V ) and x = (ε, v) so that A =
{S ≥ ε, V ≤ v} = {X ≥ (ε, v)}. Then, for λ > 0 letting h(S, V ) = λS − g(λb)λ2 V wehhave on i
A that h(S, V ) ≥ h(ε, v) and so f (S, V ) = h(S, V ) − h(ε, v) ≥ 0 and P (A) ≤ e−h(ε,v) E eh(S,V ) .
We have eh(S,V ) = U1 . . . Un where Ut = eλZt −λ g(λb)Et−1 [Zt ] and Zt = Xt − µt = Xt − Et−1 [Xt ].
2 2
(1 + g(λb)Es−1 [(λZs )2 ])
2 g(λb)E 2
≤ e−λ s−1 [Zs ]
= e−λ = 1,
2 g(λb)E 2 2 g(λb)E 2
s−1 [Zs ]
eλ s−1 [Zs ]
and thus
" n #
h i Y
E eg(Sn ,Vn ) = E Ut = E [U1 . . . Un−1 En−1 [Un ]]
t=1
≤ E [U1 . . . Un−1 ] ≤ · · · ≤ 1 .
18
Thus, P (A) ≤ e−h(ε,v) . Notice that the expression on the right-hand side is the same as in
Eq. (5.3), finishing the proof.
All that remains is to show that the term inside the expectation is a supermartingale. Using the
fact that exp(x) ≤ 1 + x + x2 for x ≤ 1 and 1 + x ≤ exp(x) for all x ∈ R we have
1X n
hλ, p − p̂i = hλ, p − eXt i .
n t=1
19
Now, |hλ, p − eXt i| ≤ kλk∞ kp − eXt k1 ≤ 2 and E[hλ, p − eXt i] = 0. Then, by Hoeffding’s bound,
s !
2 1
P hλ, p − p̂i ≥ log ≤ δ.
n δ
log(n) λσ 2
E[Z] ≤ + .
λ 2
p p
Choosing λ = σ1 2 log(n) shows that E[Z] ≤ 2σ 2 log(n). For Part (b), a union bound in
combination with Theorem 5.3 suffices.
5.19 Let P be the set of measures on ([0, 1], B([0, 1])) and for q ∈ P let µq be its mean. The
theorem will be established by induction over n. The claim is immediate when x > n or n = 1.
Assume that n ≥ 2 and x ∈ (1, n] and the theorem holds for n − 1. Then
n
! " n
!#
X X
P E[Xt | Ft−1 ] =E P E[Xt | Ft−1 ] ≥ x − E[X1 | F0 ] F0
t=1 t=2
x − E[X1 | F0 ]
≤ E fn−1
1 − X1
x − E[X1 | F0 ]
= E E fn−1 F0
1 − X1
Z 1
x − µq
≤ sup fn−1 dq(y) ,
q∈P 0 1−y
Pn
where the first inequality follows from the inductive hypothesis and the fact that t=2 Xt /(1−X1 ) ≤1
almost surely. The result is completed by proving that for all q ∈ P,
Z 1
. x − µq
Fn (q) = fn−1 dq(y) ≤ fn (x) . (5.4)
0 1−y
Let q ∈ P have mean µ and y0 = max(0, 1 − x + µ). In Lemma 5.1 below it is shown that
x−µ 1−y x−µ
fn−1 ≤ fn−1 ,
1−y 1 − y0 1 − y0
20
which after integrating implies that
1−µ x−µ
Fn (q) ≤ fn−1 .
1 − y0 1 − y0
Considering two cases. First, when y0 = 0 the display shows that Fn (q) ≤ (1−µ)fn−1 (x−µ). On the
other hand, if y0 > 0 then x − 1 < µ ≤ 1 and Fn (q) ≤ (1 − µ)/(x − µ) ≤ (1 − (x − 1))fn−1 (x − (x − 1)).
Combining the two cases we have
Lemma 5.1. Suppose that n ≥ 1 and u ∈ (0, n] and y0 = max(0, 1 − u). Then
u 1−y u
fn ≤ fn−1 for all y ∈ [0, 1] .
1−y 1 − y0 1 − y0
Proof. The lemma is equivalent to the claim that the line connecting (y0 , fn (u/(1 − y0 ))) and (1, 0)
lies above fn (u/(1 − y)) for all y ∈ [0, 1] (see figure below). This is immediate for n = 1 when
fn (u/(1 − y)) = I {y ≤ 1 − u}. For larger n basic calculus shows that fn (u/(1 − y)) is concave as a
function of y on [1 − u, 1 − u/n] and
∂
fn (u/(1 − y)) = −1/u .
∂y y=1−u
Since fn (1) = 1 this means that the line connecting (1 − u, 1) and (1, 0) lies above fn (u/(1 − y)).
This completes the proof when y0 = 1 − u. Otherwise y0 ∈ [1 − u, 1 − u/n] and the result follows by
concavity of fn (u/(1 − y)) on this interval.
fn (u/(1 − y))
0
0 y0 1
y
21
Chapter 6 The Explore-then-Commit Algorithm
√ √ √
6.2 If ∆ ≤ 1/ n then 2from
Rn ≤ √ n∆ we get Rn ≤ n. Now, if ∆ > 1/ n then from
Rn ≤ ∆ + ∆4
1 + log+ n∆ 4 ≤ ∆ + 4 n + maxx>0 x1 log+ (nx2 /4). A simple calculation shows
√ √
that maxx>0 x1 log+ (nx2 /4) = e−2 n. Putting things together we get that Rn ≤ ∆ + (4 + e−2 ) n
holds no matter the value of ∆ > 0.
6.3 Assume for simplicity that n is even and let ∆ = max{∆1 , ∆2 } and
4 1
m = min n/k, log .
∆ 2 δ
When 2m = n the pseudo regret is bounded by R̄n ≤ m∆. Now suppose that 2m < n. Then
!
m∆2
P (T2 (n) > m) ≤ P (µ̂2 (2m) − µ2 − µ̂1 (2m) + µ1 ≥ ∆) ≤ exp − ≤ δ.
4
(a) For the first part we need to show that Rn ≤ (∆ + C)n2/3 when m = f (n) for a suitable
chosen
function f and C > 0 is a universal constant. By Eq. (6.4), Rn ≤ m∆ + n∆ exp − 4 m∆2
≤
q
m∆ + n max∆>0 ∆ exp − m∆ = m∆ + 2n exp − 12 , where the equality follows because
2 1
4 m
p p
maxx x exp(−cx2 ) is at the value x∗ = 1/(2c) and
q is equal to 1/(2c) exp(−1/2) as a simple
p p √
calculation shows (so, ∆∗ = 4/(2m) = 2/m = 2/n2/3 = 2n−1/3 ). That Rn ≤ ∆ + Cn2/3
cannot hold follows because m∆/2 ≤ Rn . Hence, if Rn ≤ ∆ + Cn2/3 was also true, we would
get that for any ∆ > 0, m∆/2 ≤ ∆ + Cn2/3 holds. Dividing both sides by ∆ and letting
∆ → ∞, this would imply that m ≤ 2. However, if m ≤ 2 then Rn = Ω(n) on some instances:
In particular, there is a positive probability that the arm chosen after trying both arms at most
twice is the suboptimal arm.
p
(b) For any fixed m, µ̂i (2m) − µi is 1/m subgaussian. Hence defining G = {|µ̂i (2m) − µi | ≤
p
2 log(n/δ)/m, i = 1, 2, m = 1, 2, . . . , bn/2c}, using n ≥ 2bn/2c union bounds, we have that
p
P (G) ≥ 1 − δ. Introduce w(m) = 2 log(n/δ)/m. Let M = min{1 ≤ m ≤ bn/2c : |µ̂1 (2m) −
µ̂2 (2m)| > 2w(m)} (note that M = ∞ if the condition is never met). Then on G if M < +∞
and say 1 = argmaxi µ̂i (2M ) then µ1 ≥ µ̂1 (2M ) − w(m) > µ̂2 (2M ) + 2w(M ) − w(M ) ≥ µ2
where the first and last inequalities used that we are on G and the middle one used the stopping
condition and that we assumed that at stopping, arm one has the highest mean. Hence,
22
Rn = P (Gc ) n∆/2 + E [M I {G}] ∆/2 ≤ δn + E [M I {G}] ∆/2. We now show a bound on M on
G. To reduce clutter assume that µ1 > µ2 . Assume G holds and let m < M . Then, 2w(m) ≥
|µ̂1 (2m)− µ̂2 (2m)| ≥ µ̂1 (2m)− µ̂2 (2m) ≥ (µ1 −w(m))−(µ2 +w(m)) = ∆−2w(m). Reordering we
see that 4w(m) ≥ ∆, which, using the definition of w(m), is equivalent to m ≤ (4/∆)2 2 log(n/δ).
Hence, on G, M = 1+max{m : 2w(i) ≥ |µ̂1 (2i)−µ̂i (2i)|, i = 1, 2, . . . , m} ≤ 1+(4/∆)2 2 log(n/δ).
Plugging this in and setting δ = 1/n, we get Rn ≤ ∆ + 16 ∆ log(n).
p
(c) In addition to the said inequality, of course, Rn ≤ n∆ also holds. If ∆ ≤ log(n)/n, we thus
p p p p
have Rn ≤ n/ log(n) ≤ n log(n). If ∆ > log(n)/n, Rn ≤ ∆ + C n log(n). Combining
p
the inequalities, we have Rn ≤ ∆ + (C ∨ 1) n log(n).
p
(d) Change the definition of w(m) from Part (b) to w(m) = 2 log(n/(mδ))/m. Then,
Pn/2
P (Gc ) ≤ nδ m=1 m ≤ cδn for a suitable universal constant c > 0. We will choose
p
δ = 1/(cn2 ) so that P (Gc ) ≤ 1/n. Hence, w(m) = c0 log(n/m)/m with a suitable universal
constant c0 > 0. With the same reasoning as in Part (b), we find that M ≤ 1 + m∗
where m∗ = max{m ≥ 1 : m ≤ c0 log(n/m)/∆2 }. A case analysis then gives that
00
for a suitable universal constant c00 > 0. Finishing as in Part (b),
2 n))
m∗ ≤ c log(e∨(∆
∆2
c00 log(e∨(∆2 n))
Rn = P (Gc ) n∆/2 + E [M I {G}] ∆/2 ≤ δn + E [M I {G}] ∆/2 ≤ ∆ + ∆ .
6.6
(a) Let N0 = 0 and for ` > 1, let N` = min(N`−1 + n` , n), T` = {N`−1 + 1, . . . , N` }. The intervals
(T` )``=1
max
are non-overlapping and policy π is used with horizon n` on interval T` . Since ν is a
stochastic environment,
`X
max `X
max `X
max
Rn (π , ν) =
∗
R|T` | (π(n` ), ν) ≤ max Rt (π(n` ), ν) ≤ fn` (ν) , (6.1)
1≤t≤n`
`=1 `=1 `=1
where the first inequality uses that |T` | ≤ n` (in fact for ` < `max , |T` | = n` ), and the second
inequality uses (6.10).
P P
(b) We have `i=1 ni = `−1 i=0 2 = 2 − 1, hence `max = dlog2 (n + 1)e and 2
i ` `max ≤ 2(n + 1). By
23
(c) By Eq. (6.1) we have
`X
max X−1
`max
Rn (π ∗ , ν) ≤ g(ν) log(2`−1 ) = log(2)g(ν) `
`=1 `=0
(`max − 1)`max
= log(2)g(ν) ≤ Cg(ν) log2 (n + 1)
2
with some universal constant C > 0. Hence, the regret significantly worsens in this case. A
better choice n` = 22 . With this,
`−1
`X
max X−1
`max
Rn (π , ν) ≤ g(ν)
∗
log(2 2`−1
) = log(2)g(ν) 2` ≤ log(2)g(ν)2`max
`=1 `=0
≤ Cg(ν) log(n) ,
(d) The power/advantage of the doubling trick is its generality. It shows, that, as long as we do not
mind losing a constant factor, under mild conditions, adapting to an unknown horizon is not
challenging. The first disadvantage is that π ∗ does lose a constant factor when perhaps there is
no need to lose a constant factor. The second disadvantage is that oftentimes one can design
algorithms such that the immediate expected regret at time t decreases as the time t increases.
This is a highly desirable property and can often be met, as it was explained in the chapter.
Yet, by applying the doubling trick this monotone decrease of the immediate expected regret
will be lost, which will raise questions in the user.
6.8
(a) Using the definition of the algorithm and concentration for subgaussian random variables:
P (1 ∈
/ A`+1 , 1 ∈ A` ) ≤ P 1 ∈ A` , exists i ∈ A` \ {1} : µ̂i,` ≥ µ̂1,` + 2−`
= P 1 ∈ A` , exists i ∈ A` \ {1} : µ̂i,` − µ̂1,` ≥ 2−`
!
m` 2−2`
≤ k exp − ,
4
where in the last final inequality we used (c) of Lemma 5.4 and Theorem 5.3.
24
(c) Let δ ∈ (0, 1) be some constant to be chosen later and
m` = 24+2` log(`/δ) .
/ A`i )
P (i ∈ A`i +1 ) ≤ P (i ∈ A`i +1 , i ∈ A`i , 1 ∈ A`i ) + P (1 ∈
!
m` (∆i − 2−`i )2 kπ 2 δ
≤ exp − +
4 6
!
m` 2−2`i kπ 2 δ
≤ exp − +
16 6
!
kπ 2
≤δ 1+ .
6
i ∧n
`X
E[Ti (n)] ≤ nP (i ∈ A`i +1 ) + m`
`=1
i ∧n
`X
n
≤1+ 24+2` log
`=1
δ
≤ 1 + C2 log(nk)
2`i
16C
≤ 1 + 2 log(nk) ,
∆i
where C > 1 is a suitably large universal constant derived by naively bounding the logarithmic
term and the geometric series. The result follows from upper bounding log(nk) ≤ 2 log(n) which
25
follows from k ≤ n and the standard regret decomposition (Lemma 4.5).
22` 22`
P (1 ∈
/ A` ) ≤ and P (i ∈ A`i +1 ) ≤ .
n n
From this it follows that
16
X i `
E[Ti (n)] ≤ + C 22`
log max e, kn2 −2`
.
∆2i `=1
Bounding the sum by an integral and some algebraic gymnastics eventually leads to the desired
result. Note, you have to justify the k in the logarithm is a lower order term. Argue by splitting
p
the suboptimal arms into those with ∆i ≤ k/n and the rest.
p
(f) Using the analysis in Part (e) and letting ∆ = k/n. Then
k
X
Rn = ∆i E[Ti (n)]
i=1
X C0
≤ n∆ + log max e, nk∆2i
i:∆ ≥∆
∆i
i
q
≤C 00
nk log(k) ,
where the constant C 0 is derived from C in Part (e) and the last inequality follows by considering
the monotonicity properties of x 7→ 1/x log max(e, nx2 ).
7.1
26
(a) We have
s " ( )#
2 log(1/δ) X ∞ Xn q
P µ̂ − µ ≥ = E I {T = n} I (Xt − µ) ≥ 2n log(1/δ)
T n=1 t=1
" " ( n ) ##
∞
X X q
= E E I {T = n} I (Xt − µ) ≥ 2n log(1/δ) T
n=1 t=1
" " ( n ) ##
X∞ X q
= E I {T = n} E I (Xt − µ) ≥ 2n log(1/δ) T
n=1 t=1
X∞
≤ E [I {T = n} δ]
n=1
= δ.
n P p o
(b) Let T = min n : nt=1 (Xt − µ) ≥ 2n log(1/δ) . By the law of the iterated logarithm, T < ∞
almost surely. The result follows.
8.1 Following the hint, F ≤ exp(−a)/(1 − exp(−a)) where a = ε2 /2. Reordering exp(−a)/(1 −
exp(−a)) ≤ 1/a gives 1 + a ≤ exp(a) which is well known (and easy to prove). Then
Z ∞ Z ∞
n
X 1 20
X 1 dt 20
X 1 dt
≤ + ≤ +
t=1
f (t) t=1
f (t) 20 f (t) t=1
f (t) 20 t log(t)2
20
X 1 1 5
= + ≤ .
t=1
f (t) log(20) 2
27
Chapter 9 The Upper Confidence Bound Algorithm: Minimax Optimality
9.1 Clearly (Mt ) is F-adapted. Then by Jensen’s inequality and convexity of the exponential
function,
t
!
X
E[Mt | Ft−1 ] = exp λ Xs E[exp(λXt ) | Ft−1 ]
s=1
t
!
X
≥ exp λ Xs exp(λE[Xt | Ft−1 ])
s=1
t−1
!
X
= exp λ Xs a.s.
s=1
Hence Mt is a F-submartingale.
9.4
(i) Consider the policy that plays each arm once and subsequently chooses
s
2 n
At = argmaxi∈[k] µ̂i (t − 1) + log h .
Ti (t − 1) Ti (t − 1)k
(j) The proof more-or-less follows the proof of Theorem 9.1, but uses the new concentration
inequalities. Define random variable
Ti (n) ≤ κi + nI {∆ ≥ ε} .
The expectation of κi is bounded using the same technique as in Theorem 9.1, but the tighter
28
confidence bound leads to an improved bound.
s
1 Xn
2 n∆2
E[κi ] ≤ 2 + P µ̂is + log h ≤ µ 1 − ε
∆i s=1
s k
2 log(n)
= + o(log(n)) .
(∆i − ε)2
2 log(n)
E[Ti (n)] ≤ + o(log(n)) .
(∆i − ε)2
The result follows from the fundamental regret decomposition lemma (Lemma 4.5) and by
taking the limit as n tends to infinity and ε tends to zero at appropriate rates.
Clearly, g 0 (0) = 0. Further, since q(1 − q) ≤ 1/4 for any q ∈ [0, 1], g 0 (x) ≥ 0 for x > 0 and g 0 (x) ≤ 0
for x < 0. Hence, g is increasing for positive x and decreasing for negative x. Thus, x = 0 is a
minimiser of g. Here, g(0) = 0, and so g(x) ≥ 0 over [−p, 1 − p].
and dµ 2 g(λ, µ) = − (1+µ(eλ −1))2 ≤ 0, showing that g(λ, ·) is concave as suggested in the hint. Now,
2 λ
(e −1)2
d
Pp
let Sp = t=1 (Xt − µt ), p ∈ [n] and let S0 = 0. Then for p ∈ [n],
and, by Note 2, E [exp(λ(Xp − µp )) | Fp−1 ] ≤ exp(g(λ, µp )). Hence, using that µn is not random,
Chaining this inequalities, using that S0 = 0 together with that g(λ, ·) is concave, we get
n
!!n
X
E [exp(λSn )] ≤ exp 1
n g(λ, µt ) ≤ exp (ng(λ, µ)) .
t=1
29
Thus,
From this point, repeat the proof of Lemma 10.3 word by word.
10.4 When the exponential family is in canonical form the mean of Pθ is µ(θ) = Eθ [S] = A0 (θ).
Since A is strictly convex by the assumption that M is nonsingular it follows that µ(θ) is strictly
increasing and hence invertible. Let µsup = supθ∈Θ µ(θ) and µinf = inf θ∈Θ and define
sup Θ if x ≥ µsup
θ̂(x) = inf Θ if x ≤ µinf
µ−1 (x) otherwise .
The function θ̂ is the bridge between the empirical mean and the maximum likelihood estimator of θ.
P
Precisely, let X1 , . . . , Xn be independent and identically distribution from Pθ and µ̂n = n1 nt=1 Xt .
Then provided that θ̂n = θ̂(µ̂n ) ∈ Θ, then θ̂n is the maximum likelihood estimator of θ,
n
Y dPθ
θ̂n = argmaxθ∈Θ (Xt ) .
t=1
dh
There is an irritating edge case that µ̂n does not lie in the range of µ : Θ → R. When this occurs
there is no maximum likelihood estimator.
Part I: Algorithm
¯ y) = I {x ≥ y} limz↑x d(z, y). The algorithm
Then define d(x, y) = I {x ≤ y} limz↓x d(z, y) and d(x,
chooses At = t for the first k rounds and subsequently At = argmaxi Ui (t) where
¯ θ̂i (t − 1), θ̃) ≤ log(f (Ti (t − 1)))
Ui (t) = sup θ̃ ∈ Θ : d( .
Ti (t − 1)
where the second inequality follows from Part (e) of Exercise 34.5. Similarly
¯ θ̂t , θ) ≥ ε ≤ exp(−tε) .
P d( (10.2)
30
Define random variable τ by
log(f (t))
τ = min t : d(θ̂s , θ − ε) < for all s ∈ [n] .
s
In order to bound the expectation of τ we need a connection between d(θ̂s , θ − ε) and d(θ̂s , θ). Let
x ≤ y − ε and g(z) = d(x, z). Then
Z y
g(y) = g(y − ε) + g 0 (z)dz
y−ε
Z y
= g(y − ε) + (z − x)A00 (z)dz
y−ε
Z y
≥ g(y − ε) + inf A (z)
00
(z − x)dz
z∈[y−ε,y] y−ε
1
= g(y − ε) + inf A00 (z)ε(2y − 2x − ε)
2 z∈[y−ε,y]
ε2 inf z∈[y−ε,y] A00 (z)
≥ g(y − ε) + .
2
Note that inf z∈[y−ε,y] A00 (z) > 0 is guaranteed because A00 is continuous and [y − ε, y] is compact
and because M was assumed to be nonsingular. Using this, the expectation of τ is bounded by
n
X
E[τ ] = P (τ ≥ t)
t=1
Xn Xn
log(f (t))
≤ P d(θ̂s , θ − ε) ≥
t=1 s=1
s
!
Xn X n
ε2 inf z∈[θ−ε,θ] A00 (z) log(f (t))
≤ P d(θ̂s , θ) ≥ +
t=1 s=1
2 s
n X
X n
exp(−s inf z∈[θ−ε,θ] A00 (z)ε2 /2)
≤
t=1 s=1
f (t)
= O(1) , (10.3)
where the last inequality follows from Eq. (10.1) and the final inequality is the same calculation as
in the proof of Lemma 10.7. Next let
where we used the fact that M is non-singular to ensure strict positivity of the divergences.
31
Part III: Bounding E[Ti (n)]
For each arm i let θ̂is = θ̂(µ̂is ). Now fix a suboptimal arm i and let ε < (θ1 − θi )/2 and
log(f (t))
τ = min t : d(θ̂s , θ − ε) < for all s ∈ [n] .
s
Then define
Then by Eq. (10.3) and Eq. (10.4), E[τ ] = O(1) and E[κ] = O(1). Suppose that t ≥ τ and
Ti (t − 1) ≥ κ and At = i. Then Ui (t) ≥ U1 (t) ≥ θ1 − ε and hence
log(f (n))
d(θi + ε, θ1 − ε) < .
Ti (t − 1)
E[Ti (n)] 1
lim sup ≤ .
n→∞ log(n) d(θi + ε, θ1 − ε)
Since the above holds for all sufficiently small ε > 0 and the divergence d is continuous it follows
that
E[Ti (n)] 1
lim sup ≤
n→∞ log(n) d(θi , θ1 )
for all suboptimal arms i. The result follows from the fundamental regret decomposition lemma
(Lemma 4.5).
10.5 For simplicity we assume the first arm is uniquely optimal. Define
¯ y) = I {x ≥ y} lim d(z, y) ,
d(x, d(x, y) = I {x ≤ y} lim d(x, y) .
z↑x z↓x
R
Let µ(θ) = R xdPθ (x) and s̄(θ) = Eθ [S] = A0 (θ) and S = {s̄(θ) : θ ∈ Θ}. Define θ̂ : R → cl(Θ) by
s−1 (x) , if x ∈ S ;
θ̂(x) = sup Θ , if x ≥ sup S ;
inf Θ , if x ≤ inf S .
32
The algorithm is a generalisation of KL-UCB. Let
1 X t
t̂i (t) = I {At = i} S(Xt ) and θ̂i (t) = θ̂(t̂i (t)) ,
Ti (t) s=1
which is the empirical estimator of the sufficient statistic. Like UCB, the algorithm plays At = t for
t ∈ [k] and subsequently At = argmaxi Ui (t), where
log(f (Ti (t − 1))f (t))
Ui (t) = sup µ(θ) : d(θ̂i (t − 1), θ) ≤
Ti (t − 1)
and ties in the argmax are broken by choosing the arm with the largest number of plays.
where the final equality follows from Eqs. (10.5) and (10.6) and the same calculation as in the proof
of Lemma 10.7. Next let
33
The expectation of κ is easily bounded Eq. (10.5) and Eq. (10.6):
∞
n X
X
E[κ] ≤ (exp(−ud(θ + ε, θ) + exp(−ud(θ − ε, θ))) = O(1) , (10.8)
s=1 u=s
where we used the fact that M is non-singular to ensure strict positivity of the divergences.
Part II: Bounding E[Ti (n)] Choose ε > 0 sufficiently small that for all suboptimal arms i,
and define
Let θ̂is be the empirical estimate of θi based on the first s samples of arm i, which means that
θ̂i (t) = θ̂iTi (t) . Let τ be the smallest t such that
Now suppose that t ≥ τ and Ti (t − 1) ≥ κi and At = i. Then Ui (t) ≥ U1 (t) ≥ µ∗ , which implies that
Then let
Λ = max t : T1 (t − 1) ≤ max Ti (t − 1) ,
i>1
which by Eq. (10.9) and Eq. (10.7) and Eq. (10.8) satisfies E[Λ] = O(1). Suppose now that t ≥ Λ.
Then T1 (t − 1) > maxi>1 Ti (t − 1) and by the definition of the algorithm At = i implies that
34
Ui (t) > µ∗ and so
Hence
log(f (Ti (n))f (t))
Ti (n) ≤ 1 + Λ + .
di,inf (ε)
E[Ti (n)] 1
lim sup ≤ .
n→∞ log(n) di,inf (ε)
The result because limε→0 di,inf (ε) = di,inf and by the fundamental regret decomposition (Lemma 4.5).
11.6 The first two parts are purely algebraic and are omitted.
35
E[Gu ] = 1/qu (α). Then
s
!
X
P (T2 (n/2) ≥ s + 1) = P Gu ≤ n/2
u=0
s−1
!
X
≥P Gu ≤ n/4 P (Gs ≤ n/4)
u=0
!! n/4 !
s−1
X 1
= 1−P Gu > n/4 1− 1−
u=0
8n
1
≥ (1 − exp(−1/32))
2
1
≥ .
65
t−1
X
L̂t2 = Ŷu2 ≥ 8αn ≥ 2n .
u=1
Using induction and the fact that Pt1 ≥ 1/2 as long as L̂t1 ≤ L̂t2 it follows that on the event
E = {Tt (n/2) ≥ s + 1} that Pt1 ≥ 1/2 for all t. Therefore
t−1
!
X
Pt2 ≤ exp −η (Ŷs2 − Ŷs1 )
s=1
t−1
!!
X
≤ exp η Ŷs1 − 2n
s=1
≤ exp (−nη) ,
The result follows because on the event {At = 1 for all t > n/2}, the regret satisfies
n αn n
R̂n ≥ − ≥ .
2 2 4
(e) Markov’s inequality need not hold for negative random variables and R̂n can be negative. For
36
this problem it even holds that E[R̂n ] < 0.
(f) Since for n = 104 , the probability of seeing a large regret is about 1/65 by the answer to the
previous part, Exp3 was run m = 500 times, which gives us a good margin to encounter large
regrets. The results are shown in Fig. 11.4. As can be seen, as predicted by the theory a
significant fraction of the cases, the regret is above n/4 = 2500. As seen from the figure, the
mean regret is negative.
11.7 First, note that if G = − log(− log(U )) with U uniform on [0, 1] then
P (G ≤ g) = e− exp(−g) .
= E Ui j6=i ai
1
= P
1+
aj
j6=i ai
ai
= Pk .
j=1 aj
12.1
2
(a) We have µt = Et−1 [Ŷti ] = Pti +γ . Further, Vt−1 [Ŷti ](= Et−1 [(Ŷti − µt )2 ]) =
Pti yti Pti (1−Pti )yti
(Pti +γ)2
≤
(Pti +γ)yti
(Pti +γ)2
= Pti +γ .
yti
For any η > 0 such that η(Ŷti − µt ) = η (AtiP−P ti )yti
ti +γ
≤ 1 almost surely for all
t ∈ [n],
X Pti yti X yti 1
L̂ni − ≤η + log(1/δ) .
t
Pti + γ t
Pti + γ η
Choosing η = γ, the constraints η(Ŷti − µt ) ≤ 1 are satisfied for t ∈ [n]. Plugging in this value
and reordering gives the desired inequality.
P P Pti yti P P P
(b) We have µt = Et−1 [ i Ŷti ] = i Pti +γ . Further, Vt−1 [ i Ŷti ] ≤ Et−1 [( i Ŷti ) ] =
2
i Et−1 [Ŷti ] =
2
P 2 P P P
i Pti +γ . To satisfy the constraint on η we calculate η( − µt ) ≤ η =
Pti yti yti
i (Pti +γ)2 ≤ i Ŷti i Ŷti
37
P Ati yti η P
i Ati = γ . Hence, any η ≤ γ is suitable. Choosing η = γ, we get
η
η i Pti +γ ≤ γ
P
Step 1: Decomposition Using that a Pta = 1 and some algebra we get
n X
X k
Pta (Zta − ZtA∗ )
t=1 a=1
n X
X k n X
X k n
X
= Pta (Z̃ta − Z̃ tA∗ )+ Pta (Zta − Z̃ta ) + (Z̃tA∗ − ZtA∗ ) .
t=1 a=1 t=1 a=1 t=1
| {z } | {z } | {z }
(A) (B) (C)
Step 2: Bounding (A) By assumption (c) we have βta ≥ 0, which by assumption (a) means
that η Z̃ta ≤ η Ẑta ≤ η|Ẑta | ≤ 1 for all a. A straightforward modification of the analysis in the last
chapter shows that (A) is bounded by
log(k) Xn X k
(A) ≤ +η 2
Pta Z̃ta
η t=1 a=1
log(k) Xn X k Xn X k
= +η Pta (Ẑta
2
+ βta
2
) − 2η Pta Ẑta βta
η t=1 a=1 t=1 a=1
log(k) Xn X k Xn X k
≤ +η Pta Ẑta + 3
2
Pta βta ,
η t=1 a=1 t=1 a=1
where in the last two line we used the assumptions that ηβta ≤ 1 and η|Ẑta | ≤ 1.
n X
X k n X
X k
(B) = Pta (Zta − Z̃ta ) = Pta (Zta − Ẑta + βta ) .
t=1 a=1 t=1 a=1
We prepare to use Exercise 5.15. By assumptions (c) and (d) respectively we have ηEt−1 [Ẑta
2]≤β
ta
and Et−1 [Ẑta ] = Zta . By Jensen’s inequality,
!2
k
X k
X k
X
ηEt−1 Pta (Zta − Ẑta ) ≤η Pta Et−1 [Ẑta
2
]≤ Pta βta .
a=1 a=1 a=1
38
Therefore by Exercise 5.15, with probability at least 1 − δ
n X
X k
log(1/δ)
(B) ≤ 2 Pta βta + .
t=1 a=1
η
Because A∗ is random we cannot directly apply Exercise 5.15, but need a union bound over all actions.
Let a be fixed. Then by Exercise 5.15 and the assumption that η|Ẑta | ≤ 1 and Et−1 [Ẑta ] = Zta and
ηEt−1 [Ẑta
2 ] ≤ β , with probability at least 1 − δ.
ta
n
X log(1/δ)
Ẑta − Zta − βta ≤ .
t=1
η
log(1/δ)
(C) ≤ .
η
Step 5: Putting it together Combining the bounds on (A), (B) and (C) in the last three
steps with the decomposition in the first step shows that with probability at least 1 − (k + 1)δ,
3 log(1/δ) Xn X k Xn X k
Rn ≤ +η Pta Ẑta + 5
2
Pta βta .
η t=1 a=1 t=1 a=1
13.2 Notice that a policy has zero regret on all bandits for which the first arm is optimal if and
only if Pνπ (At = 1) = 1 for all t ∈ [n]. Hence the policy that always plays the first arm is optimal.
14.4 Let µ = P − Q, which is a signed measure on (Ω, F). By the Hahn decomposition theorem
there exist disjoint sets A, B ⊂ Ω such that A ∪ B = Ω and µ(E) ≥ 0 for all measurable E ⊆ A and
39
µ(E) ≤ 0 for all measurable E ⊆ B. Then
Z Z Z Z
XdP − XdQ = Xdµ + Xdµ
Ω Ω A B
≤ bµ(A) + aµ(B)
= (b − a)µ(A)
≤ (b − a)δ(P, Q) ,
where we used the fact that µ(B) = P (B) − Q(B) = Q(A) − P (A) = −µ(A).
where the supremum is taken over all finite partitions of R with rational-valued end-points. By the
definition of a probability kernel it follows that the quantity inside the supremum on the right-hand
side is F-measurable as a function of ω for any finite partition. Since the supremum is over a
countable set, the whole right-hand side is F-measurable as required.
14.11 First assume that P Q. Then let P t and Qt be the restrictions of P and Q to (Rt , B(Rt ))
given by
You should check that P Q implies that P t Qt and hence there exists a Radon-Nikodym
derivative dP t /dQt . Define
,
dP t dP t−1
F (xt | x1 , . . . , xt−1 ) = (x1 , . . . , xt ) (x1 , . . . , xt−1 ) ,
dQt dQt−1
which is well defined for all x1 , . . . , xt−1 ∈ Rt−1 except for a set of P t−1 -measure zero. Then for any
A ∈ B(Rt−1 ) and B ∈ B(R),
Z Z Z Z
dP t
F (xt | ω)Qt (dxt | ω)P t−1 (dω) = t
(xt , ω)Qt (dxt | ωQt−1 (dω)
A B A B dQ
Z
dP t t
= t
dQ
A×B dQ
= P (A × B) .
A monotone class argument shows that F (xt | ω) is P t−1 -almost surely the Radon-Nikodym derivative
40
of Pt (· | ω) with respect to Qt (· | ω). Hence
dP
D(P, Q) = EP log
dQ
n
X
= EP [log (F (Xt | X1 , . . . , Xt−1 ))]
t=1
Xn
= EP [D(Pt (· | X1 , . . . , Xt−1 ), Qt (· | X1 , . . . , Xt−1 ))] .
t=1
Now suppose that P 6 Q. Then by definition D(P, Q) = ∞. We need to show this implies
there exists a t ∈ [n] such that D(Pt (· | ω), Qt (· | ω)) = ∞ with nonzero probability. Proving the
contrapositive, let
Q
Hence nt=1 F (xt | x1 , . . . , xt−1 ) behaves like the Radon-Nikodym derivative of P with respect to
Q on rectangles. Another monotone class argument extends this to all measurable sets and the
existence of dP/dQ guarantees that P Q.
15.1 Abbreviate θ̂ = θ̂(X1 , . . . , Xn ) and let R(P ) = EP [d(θ̂, P )]. By the triangle inequality
Let E = {d(θ̂, P0 ) ≤ ∆/2}. On E c it holds that d(θ̂, P0 ) ≥ ∆/2 and on E it holds that
d(θ̂, P1 ) ≥ ∆ − d(θ̂, P0 ) ≥ ∆/2.
∆ ∆
R(P0 ) + R(P1 ) ≥ (P0 (E c ) + P1 (E)) ≥ exp(− D(P0 , P1 )) .
2 4
41
The result follows because max{a, b} ≥ (a + b)/2.
16.2
(a) Suppose that µ 6= µ0 . Then D(R(µ), R(µ0 )) = ∞, since R(µ) and R(µ0 ) are not absolutely
continuous. Therefore dinf (R(µ), µ∗ , M) = ∞.
(b) Notice that each arm returns exactly two possible rewards and once these have been observed,
then the mean is known. Consider the algorithm that plays each arm until it has observed both
possible rewards from that arm and subsequently plays optimally. The expected number of
P
trials before both rewards from an arm are observed is ∞i=2 i2
1−i = 3. Hence
k
X
Rn ≤ 3 ∆i .
i=1
(c) Let Pµ be the shifted Rademacher distribution with mean µ. Then D(Pµ , Pµ+∆ ) is not
differentiable as a function of ∆.
Rn (π, ν) X ∆i X ∆i
lim inf ≥ ≥ ,
n→∞ log(n) d (Pi , µ , M) i:∆ >0 dinf (Pi , µ∗ , M0 )
i:∆ >0 inf
∗
i i
where the latter inequality holds thanks to M0 ⊂ M. Note that any P ∈ M0 is uniquely
determined by its Bernoulli parameter p. Choose some ν ∈ (M0 )k , ν = (Pi ) and let pi the
Bernoulli parameter underlying Pi . Introduce p∗ = maxi pi . Then, dinf (Pi , µ∗ , M0 ) = d(pi , p∗ ) where
d(p, q)(= D(B(p), B(q))) = p log(p/q) + (1 − p) log((1 − p)/(1 − q)) (cf. Definition 10.1). Furthermore,
∆i = b(p∗ − pi ). Hence, we conclude that
We will consider environments ν = νδ given by the Bernoulli parameters ((1+δ)/2, . . . , (1+δ)/2, (1−
42
δ)/2) for some δ ∈ [0, 1]. Thus,
Rn (π, ν) X b(p∗ − pi ) bδ
lim sup ≥ =
n→∞ log(n) i:p <p∗
d(pi , p )
∗ d((1 − δ)/2, (1 + δ)/2)
i
b
= .
log((1 + δ)/(1 − δ))
Denote the right-hand side by f (δ). Noticing that limδ→0 f (δ) = ∞ we immediately see that
Eq. (16.7) cannot hold: As the action gap gets small, if we maintain the variance at constant (as in
this example) then the regret blows up with the inverse action gap. To show that Eq. (16.8) cannot
hold either, consider the case when δ → 1. The right-hand side of Eq. (16.8) is σk2 /∆k = b(1−δ 2 )/(4δ).
Now, f (δ) ∆ k
σk2
= (1−δ2 ) log((1+δ)/(1−δ))
4δ
→ ∞ as δ → 1. (Note that in this case the variance decreases
to zero, while the gap is maintained. Since the algorithm does not know that the variance is zero, it
has to pay a logarithmic cost).
17.1
Proof of Claim 17.6. Abbreviate Pi to be the law of A1 , X1 , . . . , An , Xn induced by PQi and let Ei
be the corresponding expectation operator. Following the standard argument, let j be the arm that
minimises E1 [Tj (n)], which satisfies
n
E1 [Tj (n)] ≤ .
k−1
Therefore by Theorem 14.2 and Lemma 15.1,
max {P1 (T1 (n) ≤ n/2), Pj (Tj (n) ≤ n/2)} ≥ P1 (T1 (n) ≤ n/2) + Pj (T1 (n) > n/2)
1
≥ exp −E1 [Ti (n)]2∆2
2
1 k−1 1
≥ exp −E1 [Ti (n)] log
2 n 8δ
≥ 4δ .
43
Proof of Claim 17.7. Notice that if ηt + 2∆ < 1 and ηt > 0, then Xtj ∈ (0, 1) for all j ∈ [k]. Now
! !
(1/2 − 2∆)2 (1/2)2
Pi (ηt + 2∆ ≥ 1 or ηt ≤ 0) ≤ exp − + exp −
2σ 2 2σ 2
100 25
≤ exp − + exp −
32 2
1
≤ .
8
P
Let M = nt=1 I {ηt ≤ 0 or ηt + 2∆ ≥ 1}, which is an upper bound on the number of rounds where
clipping occurs. By Hoeffding’s bound,
s
n log(1/δ)
Pi M ≥ Ei [M ] + ≤ δ,
2
18.1
44
which matches the upper bound proven in the previous part.
18.6
(a) This follows because Xi ≤ maxj Xj for any family of random variables (Xi )i . Hence
E [Xi ] ≤ E [maxj Xj ].
for an arbitrarily fixed m∗ . By the definition of learning experts, Et [X̂t ] = xt and so Eq. (18.10)
also remains valid. Note this would not be true in general if E (t) were allowed to depend on At .
The rest follows the same way as in the oblivious case.
18.7 The inequality En∗ ≤ nk is trivial since maxm Emi ≤ 1. To prove En∗ ≤ nM , let
(t)
n X
X k X
M n X
X M X
k
En∗ = Emi I {m = m∗ti } ≤ Emi = nM ,
(t) (t)
Pk
where in the last step we used the fact that = 1.
(t)
i=1 Emi
18.8 Let X̂tφ = kI {At = φ(Ct )}. Assume that t ≤ m. Since At is chosen uniformly at random,
Finally note that X̂tφ ∈ [0, k]. Using the technique from Exercise 5.14,
!
kλ2 g(λk/m)
E[exp(λ(µ̂(φ) − µ(φ)))] ≤ exp ,
m
where g(x) = (exp(x) − 1 − x)/x2 , which for x ∈ (0, 1] satisfies g(x) ≤ 1/2 + x/4. Suppose that
45
m ≥ 2k log(2|Φ|). Then by following the argument in Exercise 5.18,
!
log(2|Φ|) kλ k 2 λ2
E max |µ̂(φ) − µ(φ)| ≤ inf + +
φ λ>0 λ 2m 4m2
s
2k log(2|Φ|) k log(2|Φ|)
≤ + ,
m 2m
18.9 Let C1 , . . . , Cn ∈ C be the i.i.d. sequence of contexts and for k ∈ [n] let C1:k = (C1 , . . . , Ck ).
The algorithm is as follows: Following the hint, for the first m rounds the algorithm selects arms
in an arbitrary fashion. The regret from this period is bounded by m. The algorithm then picks
M = |ΦC1:m | functions Φ0 = {φ1 , . . . , φM } from Φ so that Φ0 |C1:m = ΦC1:m and for the remaining
n − m rounds uses Exp4 with the set Φ0 . q
p
The regret of Exp4 for competing against Φ0 is 4n log(M ) ≤ 4n log((em/d)d ) =
p
4nd log(em/d), where the inequality follows from Sauer’s lemma. It remains to show that
the best expert in Φ0 achieves almost as much reward as the best expert in Φ. For this, it suffices to
show that for any φ∗ ∈ Φ, the expert φ0∗ ∈ Φ0 that agrees with φ∗ on C1:m agrees with φ∗ on most
of the rest of the rounds. Let d∗ be a positive integer to be chosen later. We will show that with
high probability φ0∗ and φ∗ agree except for at most d∗ rounds.
P
We need a few definitions: For a, b finite sequences of equal length k let d(a, b) = ki=1 I {ai 6= bi }
be their Hamming distance. For a sequence c ∈ C k , let φ(c) = (φ(c1 ), . . . , φ(ck )) ∈ {1, 2}k .
For φ, φ0 ∈ Φ, let dC (φ, φ0 ) = d(φ(C1:k ), φ0 (C1:k )). For a permutation π on [n], let C π =
(k)
= P ∃φ, φ0 ∈ Φ : dC π (φ, φ0 ) ≥ d∗ , dC π (φ, φ0 ) = 0
(n) (m)
h i
= E P ∃φ, φ0 ∈ Φ : dC π (φ, φ0 ) ≥ d∗ , dC π (φ, φ0 ) = 0|C
(n) (m)
,
46
where the first equality is the definition of p, the second holds by the exchangeability of (C1 , . . . , Cn ).
Now, by a union bound and because π and C are independent,
P ∃φ, φ0 ∈ Φ : dC π (φ, φ0 ) ≥ d∗ , dC π (φ, φ0 ) = 0|C
(n) (m)
X
≤ P (d(aπ , bπ ) ≥ d∗ , d(aπ1:m , bπ1:m ) = 0) .
a,b∈ΦC1:n
By Sauer’s lemma, |ΦC1:n | ≤ (en/d)d . For any fixed a, b ∈ {1, 2}n , p(a, b) =
P (d(a , b ) ≥ d , d(a1:m , b1:m ) = 0) is the probability that all bits in a random subsequence of
π π ∗ π π
length m of a bit sequence of length n with at least d∗ one bits has only zero bits.
mAs the probability
of a randomly chosen bit to be equal to zero is 1 − d /n, p(a, b) ≤ 1 − n
∗ d∗
≤ exp(−d∗ m/n).
Choosing d∗ = d m
n
log (en/d)2d
δ e we find that p ≤ δ.
Pt
19.3 Let T be the set of rounds t when kat kV −1 ≥ 1 and Gt = V0 + s=1 IT (s)as a>
s . Then
t−1
!d d
dλ + |T |L2 trace(Gn )
≥
d d
≥ det(Gn )
Y
= det(V0 ) (1 + kat k2G−1 )
t−1
t∈T
Y
≥ det(V0 ) (1 + kat k2V −1 )
t−1
t∈T
≥ λd 2|T | .
47
Abbreviate x = d/ log(2) and y = L2 /dλ, which are both positive. Then
x log (1 + y(3x log(1 + xy))) ≤ x log 1 + 3x2 y 2 ≤ x log(1 + xy)3 = 3x log(1 + xy) .
19.4
where the inequality follows by Cauchy-Schwartz, and the equality follows because det(A) =
det(B) det(I + B −1/2 uu> B −1/2 ) = det(B)(1 + kuk2B −1 ) per the proof of Lemma 19.4. In the
P
general case, C = A − B 0 can be written using its eigendecomposition as C = ki=1 ui u> i
P
with 0 ≤ k ≤ d. Letting Aj = B + i≤j ui u> i , 0 ≤ j ≤ k note that A = A k and B = A 0 and
thus by (19.1),
P
(b) We need some notation. As before, for t ≥ 0, we let Vt (λ) = V0 + s≤t As A> s , where As is the
action taken in round s. Let τ0 = 0. For t ≥ 1, we let τt ∈ [t] τt denote the round index when
the phase that contains round t starts. That is,
τ
t−1 , if det Vt−1 (λ) ≤ (1 + ε) det Vτt−1 −1 (λ) ;
τt =
t, otherwise .
Further,
A
t−1 , if τt = τt−1 ;
At =
argmax
a∈A UCBt (a) , otherwise .
Here, UCBt (a) is the upper confidence bound based on all the data available up the beginning
of round t.
Let the event when θ∗ ∈ ∩t∈[n] Ct hold. Define θ̃t ∈ Ct so that if Ãt = argmaxa∈A UCBt (a) then
hθ̃t , Ãt i = maxa∈A UCBt (a). Letting a∗ = argmaxa∈A hθ∗ , ai and noting that maxa UCBτt (a) =
48
UCBτt (Aτt ) and that At = Aτt ,
Hence,
Now, by Part (a), thanks to Vτt −1 Vt−1 and that θ∗ , θ̃τt ∈ Cτt and that Cτt is contained in an
ellipsoid of radius βτt ≤ βt and “shape” determined by Vτt −1 ,
1/2 q
det Vt−1
kθ̃τt − θ∗ kVt−1 ≤ kθ̃τt − θ∗ kVτt −1 ≤ 2 (1 + ε)βt .
det Vτt −1
which is the same as Eq. (19.10) except that βt is replaced with (1 + ε)βt . This implies in turn
P
that R̂n = nt=1 ≤ R̂n ((1 + ε)βn ).
19.5 We only present in detail the solution for the first part.
(a) Partition C into m equal length subintervals, call these C1 , . . . , Cm . Associate a bandit algorithm
with each of these subintervals. In round t, upon seeing Ct ∈ C, play the bandit algorithm
associated with the unique subinterval that Ct belongs to. For example, one could use a Exp3
as in the previous chapter, or UCB. The regret is
" n n
#
X X
Rn = E max r(Ct , a) − Xt
a∈[k]
t=1 t=1
" n # " n #
X X
=E max r(Ct , a) − max r̃([Ct ], a) + E max r̃([Ct ], a) − Xt ,
a∈[k] a a∈[k]
t=1 t=1
| {z } | {z }
(I) (II)
where for c ∈ C, [c] is the index of the unique part Ci that c belongs to and for i ∈ [m],
E [r(C , a) | [C ] = i] if P ([Ct ] = i) > 0
r̃(i, a) =
t t
0 otherwise .
The first term in the regret decomposition is called the approximation error, and the second is
the error due to learning. The approximation error is bounded using the Lipschitz assumption
49
and the definition of the discretisation:
≤ L/m ,
where in the second last inequality we used the assumption that r is Lipschitz and in the last
that when [Ct ] = [c] it holds that |Ct − c| ≤ 1/m. It remains to bound the error due to learning.
Note that E[Xt | [Ct ] = i] = r̃(i, At ). As a result, the data experienced by the bandit associated
with Ci satisfies the conditions of a stochastic bandit environment. Consider the case when
Exp3 is used with the adaptive learning rate as described in Exercise 28.13. For i ∈ [m], let
Ti = {t ∈ [n] | [Ct ] = i}, Ni = |Ti | and
!
X
Rni = max r̃([Ct ], a) − Xt .
a∈[k]
t∈Ti
p
Then, by Eq. (11.2), E[Rni ] ≤ CE[ k log(k)Ni ] and thus
" #
m
X m
X q
(II) ≤ E [Rni ] ≤ CE (k log(k)) ≤ C k log(k)mn ,
1/2 1/2
Ni
i=1 i=1
Ln q
Rn ≤ + C k log(k)mn .
m
for some constant C 0 that depends only on C. The same argument works with no change if
the bandit algorithm is switched to UCB, just a little extra work is needed to deal with the
fact that UCB will be run for a random number of rounds. Luckily, the number of rounds is
independent of the rewards experienced and the actions taken.
(c) Make an argument that C can be partitioned into m = (3dL/ε)d partitions to guarantee that for
fixed a the function r(c, a) varies by at most ε within each partition. Then bound the regret by
s
d
3dL
Rn ≤ nε + 2nk log(k) .
ε
50
By optimizing ε you should arrive at a bound that depends on the horizon like O(n(d+1)/(d+2) ),
which is really quite bad, but also not improvable without further assumptions. You might find
the results of Exercise 20.3 to be useful.
19.6
(a) We need to show that Lt (θ∗ ) ≤ βt for all t with high probability. By definition,
1/2
t
X
Lt (θ∗ ) = gt (θ∗ ) − Xs As
s=1 Vt−1
t
X
= λθ∗ + (µ(hθ∗ , As i)As − (µ(hθ∗ , As i)As + ηs )As )
s=1 Vt−1
t
X
= λθ∗ + As η s
s=1 Vt−1
t
X √
≤ As η s + λkθ∗ k2 .
s=1 Vt−1
(b) Let gt0 denote the derivative of gt (which exists by assumption). By the mean value theorem,
there exists a ξ on the segment connecting θ and θ0 such that
(c) Let θ̃t be such that µ(hθ̃t , At i) = maxa∈At maxθ∈Ct µ(hθ, ai). Using the fact that θ∗ ∈ Ct ,
2c2 βt−1
1/2
≤ kAt kV −1 .
c1 t−1
51
(d) By Part (a), with probability at least 1 − δ, θ∗ ∈ Ct for all t. Hence, by Part (c), with
probability at least 1 − δ,
2c2 βn
n
X 1/2 n
X
R̂n = rt ≤ (1 ∧ kAt kV −1 )
t=1
c1 t=1
t−1
s
2c2 nL2
≤ 2ndβn log 1 + ,
c1 d
where we used Lemma 19.4 and the same argument as in the proof of Theorem 19.2.
19.7
(b) By Theorem 20.4 in the next chapter, it holds with probability least 1 − δ that for all t,
s
t
X 1 det(Vt )
As Xs ≤ 2 log + log .
δ det(V0 )
s=1 Vt−1
52
On this event,
t
X t
X
kθ̂t − θkVt = Vt−1 s θ − θ + Vt
As A> −1
As ηs
s=1 s=1 Vt
t
X
≤ Vt−1 V0 θ + As η s
Vt
s=1 Vt
s
1 det(Vt )
≤ kθkV0 + 2 log + log
δ det(V0 )
s
1 det(Vt )
≤m+ 2 log + log .
δ det(V0 )
(d) The result follows by combining the result of the last part and that of Part (a) while choosing
δ = 1/n.
(e) The bound to improves when deff d. For this, V0 should have a large number of large
P
eigenvalues while maintaining i λi hvi , θ∗ i2 = kθ∗ k2V0 ≤ m2 : Choosing V0 is betting on the
directions (vi ) in which θ∗ will have small components as in those directions we can choose λi
to be large. This way, one can have a finer control than just choosing features of some fixed
dimension: The added flexibility is the main advantage of non-uniform regularisation. However,
this is not for free: Non-uniform regularisation may increase kθ∗ kV0 for some θ∗ which may lead to
worse regret. In particular, when kθ∗ kV0 ≤ m is not satisfied, the confidence set will still contain
θ∗ , but the coverage probability 1−δ will worsen to about 1−δ 0 where δ 0 ≈ δ exp((kθ∗ kV0 −m)2+ ),
which increases the δn term in the regret to δ 0 n. Thus, the degradation is smooth, but can be
quite harsh. Since the confidence radius is probably conservative, the degradation may be less
noticeable than what one would expect based on this calculation.
P
19.8 For the last part, note that it suffices to store the inverse matrix Ut = Vt−1 (λ), St = s≤t As Xs ,
together with dt = log det V0 (λ) .
det Vt (λ)
Choosing V0 = λI, we have U0 = λ−1 I, S0 = 0 and d0 = 0. Then,
53
for t ≥ 0 and a ∈ Rd we have
p
UCBt (a) = hθ̂t , ai + βt kakUt−1
where
and where after receiving (At , Xt ), the following updates need to be executed:
Here, the update of dt follows from Eq. (19.9). Note that the O(d2 ) operations are the calculation
of θ̂t and the update of Ut .
20.1 From the definition of the design Vn = mI and the ith coordinate of θ̂n − θ∗ is the sum of m
independent standard Gaussian random variables. Hence
h i d
X 1
E kθ̂n − θ∗ k2Vn = m = d.
i=1
m
20.2 Let m = n/d and assume for simplicity that m is a whole number. The actions (At )nt=1
are chosen in blocks of size m with the ith block starting in round ti = (i − 1)m + 1. For all
i ∈ {1, . . . , d} we set Ati = ei . For t ∈ {ti , . . . , ti + m − 1} we set Ati = ei if ηti > 0 and At = 0
otherwise. Clearly Vn−1 ≥ I and hence k1kVn−1 ≤ d. The choice of (At )nt=1 ensures that E[θ̂n,i ] is
independent of i. Furthermore,
Pm
t=1 ηt 1 1
E[θ̂n,1 ] = E I {η > 0} + I {η1 < 0} η1 =√ −1 .
m 2π m
Therefore
2
hθ̂n , 1i2 1 h i 1 h i2 d m−1
E ≥ E hθ̂ , 1i 2
≥ E h θ̂ , 1i = .
2π
2 n n
k1kV −1 d d m
n
20.3
54
(a) If C ⊂ A is an ε-covering then it is also an ε0 -covering with any ε0 ≥ ε. Hence, ε → N (ε) is a
decreasing function of ε.
(b) The inequality M (2ε) ≤ N (ε) amounts to showing that any 2ε packing has a cardinality at most
the cardinality of any ε covering. Assume this does not hold, that is, there is a 2ε packing P ⊂ A
and an ε-covering C ⊂ A such that |P | ≥ |C|+1. By the pigeonhole principle, there is c ∈ C such
that there are distinct x, y ∈ P such that x, y ∈ B(c, ε). Then kx − yk ≤ kx − ck + kc − yk ≤ 2ε,
which contradicts that P is a 2ε-packing.
If M (ε) = ∞, the inequality N (ε) ≤ M (ε) is trivially true. Otherwise take a maximum
ε-packing P of A. This packing is automatically an ε-covering as well (otherwise P would not
be a maximum packing), hence, the result.
.
(c) We show the inequalities going left to right. For the first inequality, if N = N (ε) = ∞ then there
is nothing to be shown. Otherwise let C be a minimum cardinality ε-cover of A. Then from
P
the definition of cover and the additivity of volume, vol(A) ≤ x∈A0 vol(B(x, ε)) = N εd vol(B).
Reordering gives the inequality.
The next inequality, namely that N (ε) ≤ M (ε) has already been shown.
.
Consider now the inequality bounding M = M (ε). Let P be a maximum cardinality ε-packing
of A. Then, for any x, y ∈ P distinct, B(x, ε/2) ∩ B(y, ε/2) = ∅. Further, for x ∈ P ,
B(x, ε/2) ⊂ A + 2ε B and thus ∪x∈P B(x, ε/2) ⊂ A + 2ε B, hence, by the additivity of volume,
M vol( 2ε B) ≤ vol(A + 2ε B).
For the next inequality note that εB ⊂ A immediately implies that A + 2ε B ⊂ A + 12 A (check
the containment using the definitions), while the convexity of A implies that A + 12 A ⊂ 32 A.
For this second claim let u ∈ A + 12 A. Then u = x + 12 y for some x, y ∈ A. By the convexity
of A, 23 u = 23 x + 13 y ∈ A and hence u = 32 ( 23 u) ∈ 32 A. For the final inequality note that for
measurable X and c > 0 we have vol(cX) = cd vol(X). This is true because cX is the image of
X under the linear mapping represented by a diagonal matrix with c on the diagonal and this
matrix has determinant cd .
(d) Let A be bounded, and say, A ⊂ rB for some r > 0. Then vol(A + ε/2B) ≤ vol(rB + ε/2B) =
vol((r + ε/2)B) < +∞, hence, the previous part gives that N (ε) ≤ M (ε) < +∞. Now
assume that N (ε) < ∞ and let C be a minimum cover of A. Then A ⊂ ∪x∈C B(x, ε) ⊂
∪x∈C (kxk + ε)B ⊂ maxx∈C (kxk + ε)B hence, A is bounded.
20.4 The result follows from Part (c) of Exercise 20.3 by taking A = B = {x ∈ Rd : kxk2 ≤ 1},
which shows that the covering number N (B, ε) ≤ (3/ε)d , from which the result follows.
20.5 Proving that M̄t is Ft -measurable is actually not trivial. It follows because Mt (·) is measurable
and by the ‘sections’ lemma [Kallenberg, 2002, Lemma 1.26]. It remains to show that E[M̄t | Ft−1 ] ≤
M̄t−1 almost surely. Proceeding by contradiction, suppose that P(E[M̄t | Ft−1 ] − M̄t−1 > 0) > 0.
Then there exists an ε > 0 such that the set A = {ω : E[M̄t | Ft−1 ](ω) − M̄t−1 (ω) > ε} ∈ Ft−1
55
satisfies P (A) > 0. Then
Z Z
0< (E[M̄t | Ft−1 ] − M̄t−1 )dP = (M̄t − M̄t−1 ))dP
A ZA Z
= (Mt (x) − Mt−1 (x))dh(x)dP
d
ZA ZR
= (Mt (x) − Mt−1 (x))dPdh(x)
Rd A
≤ 0,
where the first equality follows from the definition of conditional expectation, the second by
substituting the definition of M̄t and the third from Fubini-Tonelli’s theorem. The last follows
from Lemma 20.2 and the definition of conditional expectation again. The proof is completed by
noting the deep result that 0 6< 0. In this proof it is necessary to be careful to avoid integrating
over conditional E[Mt (x) | Ft−1 ], which are only defined for each x almost surely and need not be
measurable as a function of x (though a measurable choice can be constructed using separability of
Rd and continuity of x 7→ Mt (x)).
20.8 Let f (λ) = √12π exp(−λ2 /2) be the density of the standard Gaussian and define
supermartingale Mt by
Z ! !
tσ 2 λ2 1 St2
Mt = f (λ) exp λSt − dλ = √ exp .
R 2 tσ 2 + 1 2σ 2 (t + 1)
Since E[Mτ ] = M0 = 1, the maximal inequality shows that P (supt Mt ≥ 1/δ) ≤ δ, which after
rearranging the previous display completes the result.
20.9
Pn = P (exists t ≤ n : Mt ≥ 1/δ) ≤ δ .
56
Hence P (exists t : Mt ≥ 1/δ) ≤ δ. Substituting the result from the previous part and rearranging
completes the proof.
I {λ ≤ e−e }
(d) A suitable choice of f is f (λ) = 2 .
λ log 1
λ log log 1
λ
(e) Let εn = min{1/2, 1/ log log(n)} and δ ∈ [0, 1] be the largest (random) value such that Sn never
exceeds
s
2n 1 1
log + log .
1 − εn
2 δ εn Λn f (Λn (1 + εn ))
By Part (c) we have P (δ > 0) = 1. Furthermore, lim supn→∞ Sn /n = 0 almost surely by the
strong law of large numbers, so that Λn → 0 almost surely. On the intersection of these almost
sure events we have
Sn
lim sup p ≤ 1.
n→∞ 2n log log(n)
20.10 We first show a bound on the right tail of St . A symmetric argument suffices for the left tail.
P
Let Ys = Xs − µs |Xs | and Mt (λ) = exp( ts=1 (λYs − λ2 |Xs |/2)). Define filtration G1 ⊂ · · · ⊂ Gn by
Gt = σ(Ft−1 , |Xt |). Using the fact that Xs ∈ {−1, 0, 1} we have for any λ > 0 that
Therefore Mt (λ) is a supermartingale for any λ > 0. The next step is to use the method of mixtures
R
with a uniform distribution on [0, 2]. Let Mt = 02 Mt (λ)dλ. Then Markov’s inequality shows that
for any Gt -measurable stopping time τ with τ ≤ n almost surely, P (Mτ ≥ 1/δ) ≤ δ. Next we need a
bound on Mτ . The following holds whenever St ≥ 0.
Z
1 2
Mt = Mt (λ)dλ
2 0
r !
1 π St 2Nt − St St2
= erf √ + erf √ exp
2 2Nt 2Nt 2Nt 2Nt
√ r !
erf( 2) π St2
≥ exp .
2 2Nt 2Nt
The bound on the upper tail completed via a stopping time, which shows that
v
u s
u
u 2 2Nt
P exists t ≤ n : St ≥ t2Nt log √ and Nt > 0 ≤ δ .
δ erf( 2) π
57
(a) Following the hint, we show that exp(Lt (θ∗ )) is a martingale. Indeed, letting Ft = σ(X1 , . . . , Xt ),
h i
E [exp(Lt (θ∗ ))|Ft−1 ] = E pθ̂t−1 (Xt )/pθ∗ (Xt ) exp(Lt−1 (θ∗ ))
Z p (x)
= exp(Lt−1 (θ∗ )) ∗ (x) X
pθ θ̂t−1
dµ(x)
∗ (x)
pθX
Z
X
X
= exp(Lt−1 (θ∗ )) pθ̂t−1 (x)dµ(x)
= exp(Lt−1 (θ∗ )) .
where the inequality is due to Theorem 3.9, the maximal inequality of nonnegative
supermartingales.
where in the third equality we used that adj(V (π)) is symmetric since V (π) is symmetric, hence,
following the hint, adj(V (π)/ det(V (π)) = V (π)−1 .
where λi are the eigenvalues of H −1/2 ZH −1/2 . Since log(1 + tλi ) is concave, their sum is also
concave, proving that t 7→ log det(H + tZ) is concave.
21.3 Let A be a compact subset of Rd and (An )n be a sequence of finite subsets with An ⊂ An+1
and span(An ) = Rd and limn→∞ d(A, An ) = 0 where d is the Hausdorff metric. Then let πn be a
G-optimal design for An with support of size at most d(d + 1)/2 and Vn = V (πn ). Given any a ∈ A
58
we have
√
kakVn−1 ≤ min ka − bkVn−1 + kbkVn−1 ≤ d + min ka − bkVn−1 .
b∈An b∈An
Let W ∈ Rd×d be matrix with columns w1 , . . . , wd in A1 that span Rd . The operator norm of Vn
−1/2
is bounded by
kVn−1/2 k = kW −1 W Vn−1/2 k
≤ kW −1 kkVn−1/2 W k
n o
= kW −1 k sup kW xkVn−1 : kxk2 = 1
( d )
X
≤ kW −1
k sup |xi |kwi kVn−1 : kxk2 = 1
i=1
√ d
X
≤ kW −1 k d sup xi
x:kxk2 =1 i=1
≤ dkW −1 k .
Notice that πn may be represented as a tuple of vector/probability pairs with at most d(d + 1)/2
entries and where the vectors lie in A. Since the set of all such tuples with the obvious topology
forms a compact set it follows that (πn ) has a cluster point π ∗ , which represents a distribution on
A with support at most d(d + 1)/2. The previous display shows that g(π ∗ ) ≤ d. The fact that
g(π ∗ ) ≥ d follows from the same argument as the proof of Theorem 21.1.
21.5 Let π be a Dirac at a and π(t) = π ∗ + t(π ∗ − π). Since π ∗ (a) > 0 it follows for sufficiently
small t > 0 that π(t) is a distribution over A. Because π ∗ is a minimiser of f ,
d
0≥ f (π(t))|t=0 = h∇f (π ∗ ), π ∗ − πi = d − kak2V (π)−1 .
dt
59
Rearranging shows that kak2V (π)−1 ≥ d. The other direction follows by Theorem 21.1.
We proved that
C log(n)
Rni ≤ 3|θi | + .
|θi |
24.1 Assume without loss of generality that i = 1 and let θ(−1) ∈ Θp−1 . The objective is to prove
that
√
1 X kn
Rn1 (θ) ≥ .
|Θ| (1) 8
θ ∈Θ
P
For j ∈ [k] let Tj (n) = nt=1 I {Bt1 = j} be the number of times base action j is played in the first
bandit. Define ψ0 ∈ Rd to be the vector with ψ0 = θ(−1) and ψ0 = 0. For j ∈ [k] let ψj ∈ Rd be
(−1) (1)
given by ψj = θ(−1) and ψj = ∆ej . Abbreviate Pj = Pψj and Ej [·] = EPj [·]. With this notation,
(−1) (1)
we have
1 X 1X k
Rn1 (θ) = ∆(n − Ej [Tj (n)]) . (24.1)
|Θ| (1) k j=1
θ ∈Θ
60
Lemma 15.1 gives that
" #
1 Xn
∆2
D(P0 , Pj ) = E0 hAt , ψ0 − ψj i2 = E0 [Tj (n)] .
2 t=1
2
p
Choosing ∆ = k/n/2 and applying Pinsker’s inequality yields
r
k
X k
X k
X 1
Ej [Tj (n)] ≤ E0 [Tj (n)] + n D(P0 , Pj )
j=1 j=1 j=1
2
s
k
X ∆2
=n+n E0 [Tj (n)]
j=1
4
v
u
u k∆2 X
k
u
≤ n + nt E0 [Tj (n)] (Cauchy-Schwarz)
4 j=1
s
k∆2 n
=n+n
4
≤ 3nk/4 . (since k ≥ 2)
Combining the above display with Eq. (24.1) completes the proof:
1 X 1X k
n∆ 1√
Rn1 (θ) = ∆(n − Ej [Tj (n)]) ≥ = kn .
|Θ| (1) k j=1 4 8
θ ∈Θ
25.3 For (a) let θ1 = ∆ and θi = 0 for i > 1 and let A = {e1 , . . . , ed−1 }. Then adding ed
increases the asymptotic regret. For (b) let θ1 = ∆ and θi = 0 for 1 < i < d and θd = 1 and
A = {e1 , . . . , ed−1 }. Then for small values of ∆ adding ed decreases the asymptotic regret.
26.2 Let P be the on the space on which X is defined. Following the hint, let x0 = E[X] ∈ Rd .
Then let a ∈ Rd and b ∈ R be such that ha, x0 i + b = f (x0 ) and ha, xi + b ≤ f (x) for all x ∈ Rd .
The hyperplane {x : ha, xi + b − f (x0 ) = 0} is guaranteed to exist by the supporting hyperplane
theorem. Then
Z Z
f (X)dP ≥ (ha, Xi + b)dP = ha, x0 i + b = f (x0 ) = f (E[X]) .
61
An alternative is of course to follow the ideas next to the picture in the main text. As you may
recall that proof is given for the case when X is discrete. To extend the proof to the general case,
one can use the ‘standard machinery’ of building up the integral from simple functions, but the
resulting proof, originally due to Needham [1993], is much longer than what was given above.
26.3
= f (x) ,
where in the second inequality we used the definition of convexity to ensure that f (y) ≥
f (x) + hy − x, ∇f (x)i.
26.9
(a) Fix u ∈ Rd . By definition f ∗ (u) = supx hx, ui − f (x). To find this value we solve for x where
the derivative of hx, ui − f (x) in x is equal to zero. As calculated before, ∇f (x) = log(x).
Thus, we need to find the solution to u = log(x), giving x = exp(u). Plugging this value, we
get f ∗ (u) = hexp(u), ui − f (exp(u)). Now, f (exp(u)) = hexp(u), log(exp(u))i − hexp(u), 1i =
hexp(u), ui − hexp(u), 1i. Hence, f ∗ (u) = hexp(u), 1i and ∇f ∗ (u) = exp(u).
(c) Df ∗ (u, v) = f ∗ (u) − f ∗ (v) − h∇f ∗ (v), u − vi = hexp(u) − exp(v), 1i − hexp(v), u − vi.
(d) To check Part (a) of Theorem 26.6 note that ∇f (x) = log(x) and ∇f ∗ (u) = exp(u), which
are indeed inverses of each other and their respective domains match that of int(dom(f )) and
62
int(dom(f ∗ )), respectively. To check Part (b) of Theorem 26.6, we calculate Df ∗ (∇f (y), ∇f (x)):
26.13
z ∈ argminA g(z) ,
which exists by the assumption that A is compact. By convexity of A and the first-order
optimality condition it follows that
Therefore
The proof fails when f is not differentiable at y because the map v 7→ ∇v f (y) need not be
linear.
(b) Consider the function function f (x, y) = −(xy)1/4 and let y = (0, 0) and x = (1, 0) and
A = {(t, 1 − t) : t ∈ [0, 1]}. Then Df (x, y) = Df (z, y) = 0, but Df (x, z) = ∞.
26.14 Parts (a) and (b) are immediate from convexity and the definitions. For Part (c), we have
Then consider y = (2, 0) and x = (0, 2) and z = (0, −2). Then Df (z, y) = Df (x, y) = 0, but
Df ((x + z)/2, y) = Df (0, y) = 1 ≥ (Df (x, y) + Df (z, y))/2.
63
26.15 The first part follows immediately from Taylor’s theorem. The second part takes a little
work. To begin, abbreviate kx − ykz = kx − yk∇2 f (z) and for t ∈ (0, 1) let
1
δ(x, y, t) = Df (x, y) − kx − yk2tx+(1−t)y ,
2
which is continuous on int(dom(f )) × int(dom(f )) × [0, 1]. Let (a, b) ⊂ (0, 1) and consider
[ \ [
A= {(x, y) : δ(x, y, t) ≤ ε} .
δ∈(0,b−a)∩Q ε∈(0,1)∩Q t∈U ∩Q
As we mentioned already, Taylor’s theorem ensures there exists a t ∈ [0, 1] such that δ(x, y, t) = 0
for all (x, y) ∈ int(dom(f )) × int(dom(f )). By the continuity of δ it follows that (x, y) ∈ A if
and only if there exists a t ∈ (a, b) such that δ(x, y, t) = 0. Since (x, y) 7→ δ(x, y, t) is measurable
for each t it follows that A is measurable. Let T (x, y) = {t : δ(x, y, t) = 0}. Then by the
Kuratowski–Ryll-Nardzewski measurable selection theorem (theorem 6.9.4, Bogachev 2007) there
exists a measurable function τ : int(dom(f )) × int(dom(f )) → (0, 1) such that τ (x, y) ∈ T (x, y) for
all (x, y) ∈ dom(f ) × dom(f ). Therefore g(x, y) = τ (x, y)x + (1 − τ (x, y))y is measurable and the
result is complete.
27.1 Let
P
exp(−η t−1
s=1 Ŷs (a))
P̃t (a) = P Pt−1 .
a0 ∈A exp(−η s=1 Ŷs (a ))
0
Then,
Pt = (1 − γ)P̃t + γπ . (27.1)
P P Pn
Let L̂n (a) = nt=1 Ŷt (a), L̂n = nt=1 hPt , Ŷt i and L̃n = t=1 hP̃t , Ŷt i, where we abuse h·, ·i by defining
P
hp, yi = a∈A p(a)y(a) for p, y : A → R. Then,
" n #
X
Rn = max Rn (a) where Rn (a) = E hAt , yt i − ha, yt i .
a∈A
t=1
64
Now, by (27.1),
n
X
L̂n = (1 − γ)L̃n + γ hπ, Ŷt i .
t=1
Repeating the steps of the proof of Theorem 11.1 shows that, thanks to η Ŷt (a) ≥ −1,
log k Xn
L̃n ≤ L̂n (a) + +η hP̃t , Ŷt2 i (27.2)
η t=1
log k η X n
≤ L̂n (a) + + hPt , Ŷt2 i ,
η 1 − γ t=1
where Ŷt2 denotes the function a 7→ Ŷt2 (a) and the second inequality used that P̃t = Pt −γπ
1−γ ≤ 1−γ .
Pt
Now,
n
X log k Xn
L̂n − L̂n (a) ≤ γ hπ, Ŷt i + (1 − γ)L̂n (a) + +η hPt , Ŷt2 i − L̂n (a)
t=1
η t=1
log k Xn Xn
= +η hPt , Ŷt2 i + γ hπ − ea , Ŷt i ,
η t=1 t=1
log k Xn h i
Rn ≤ max Rn (a) ≤ + 2γn + η E hPt , Ŷt2 i ,
a η t=1
27.4 Note that it suffices to show that kxk2B −1 − kxk2A−1 = kxk2B −1 −A−1 ≥ 0 for any x ∈ Rd . Let
x ∈ Rd . Then, by the Cauchy-Schwarz inequality,
kxk2A−1 = hx, A−1 xi ≤ kxkB −1 kA−1 xkB ≤ kxkB −1 kA−1 xkA = kxkB −1 kxkA−1 .
27.6
n o
(a) A straightforward calculation shows that L = y ∈ Rd : kykV ≤ 1 . Let T x = V 1/2 x and note
n o
that T −1 L = T A = B = u ∈ Rd : kuk2 ≤ 1 . Then let U be an ε-cover of B with respect
to k · k2 with |U| ≤ (3/ε)d and C = T −1 U. Given x ∈ A let u = T x and u0 ∈ U be such that
65
ku − u0 k2 ≤ ε and x0 = T −1 u0 . Then
n o
(b) Notice that L is convex, symmetric, bounded and span(L) = Rd . Let E = y ∈ Rd : kykV ≤ 1
be the ellipsoid of maximum volume contained by cl(L). Then let
n o n o
E∗ = y ∈ Rd : kykE ≤ 1 = y ∈ Rd : kykV −1 ≤ 1 ,
which satisfies A ⊆ E∗ . Since span(L) = Rd the matrix V −1 is positive definite. By the previous
result there exists a C¯ ⊂ E∗ of size at most (3d/ε)d such that
sup inf kx − x0 kL ≤ ε .
x∈E∗ x0 ∈C¯
We are nearly done. The problem is that C¯ may contain elements not in A. To resolve this issue
¯ where Π(x) ∈ argminx0 ∈A kx − x0 kE . Then note that
let C = {Π(x) : x ∈ C}
result follows by choosing C = Π(x) : x ∈ C¯ where Π is the projection onto A with respect to
k · kE where E is the maximum volume ellipsoid contained by cl(co(A)).
27.8 Consider the case when d = 1, k = n and A = {1, −1, ε, ε/2, ε/4, . . .} for suitably small
ε. Then, for t = 1, Pt is uniform on A and hence Q−1 t ≈ 2/k and with probability 1 − 2/k,
|Ŷt | ≈ 1/k = 1/n. If η is small, then the algorithm will barely learn. If η is large, then it learns
quickly that either 1 or −1 is optimal, but is too unstable for small regret.
27.9 We can copy the proof presented for the finite-action case in the solution to Exercise 27.1 in
an almost verbatim manner: The minor change is that is that we need to replace the sums over the
action space with integrals. In particular, here we have
P
exp(−η t−1
s=1 Ŷs (a))
P̃t (a) = R Pt−1
A exp(−η s=1 Ŷs (a ))da
0 0
R
and hp, yi = A p(a)y(a)da for p, y : A → R. Now up to (27.2) everything is the same. Recall that
the inequality in this display was obtained by using the steps of the proof of Theorem 11.1. Here,
we need add a little detail because we need to change this inequality slightly.
66
We argue as follows: Define (Wt )nt=0 by
Z t
!
X
Wt = exp −η Ŷs (a) da ,
A s=1
n−1
Y Wt+1
Wn = vol(A) .
t=0
Wt
Therefore,
n
X n
X
log Wn ≤ log(vol(A)) − η hP̃t , Ŷt i + η 2 hP̃t , Ŷt2 i .
t=1 t=1
Pn
Recalling that L̃n = t=1 hP̃t , Ŷt i, a rearrangement of the previous display gives
1 vol(A) n
X
L̃n ≤ log +η hP̃t , Ŷt2 i . (27.3)
η Wn t=1
Pn
Let a∗ = argmina∈A t=1 hyt , ai. Note that
n
X 1 1
L̂n (a∗ ) = Ŷt (a∗ ) = − log P .
t=1
η exp η t=1 Ŷt (a )
n ∗
By adding and subtracting L̂n (a∗ ) to the right-side of Eq. (27.3) and using the last identity and the
definition of Wn , we get
1 Xn
L̃n ≤ L̂n (a∗ ) + log(Kn ) + η hP̃t , Ŷt2 i ,
η t=1
where
vol(A)
Kn = R Pn .
exp −η t=1 (Ŷt (a) − Ŷt (a )) da
∗
which is the inequality that replaces (27.2). In particular, the only difference between (27.2) and
the above display is that in the above display log(k) got replaced by log(Kn ). From here, we can
67
follow the steps of the proof of Exercise 27.1 up to the end, to get
" #
E [log(Kn )] Xn
Rn ≤ + 2γn + ηE hPt , Ŷt2 i ,
η t=1
The result is completed by noting that γ = ηd and the same argument as in the proof of Theorem 27.1
to bound
" n #
X
ηE hPt , Ŷt2 i ≤ ηdn .
t=1
We claim that it suffices to show the result for the case when kuk = 1. Indeed, if the claim was true
for kuk = 1 then for any u 6= 0 it would follow that
R R R
K exp(−hx, ui)dx K exp(−hxkuk, u/kuki)dx kukK exp(−hy, u/kuki)dy
= =
vol(K) vol(K) kukd vol(K)
R
kukK exp(−hy, u/kuki)dy
= ≥ g( sup hx, u/kuki) = g(suphx, ui) .
vol(kukK) x∈kukK x∈K
Hence, it remains to show the claim for vectors u such that kuk = 1. With an entirely similar
reasoning we can show that it suffices to show the claim for the case when vol(K) = 1. Hence, from
now on we will assume these.
Introduce α = supx∈K hx, ui. For t ∈ [0, α], define f (t) to be the volume of the slice
Kt = {x ∈ K : hx, ui = t} with respect to the (d − 1)-form. Since vol(K) = 1, f (t) > 0 for
R R R
0 < t < α and 1 = 0α f (t)dt. Clearly, we also have K exp(−hx, ui)dx = 0α f (t) exp(−t)dt. Now,
since K is convex, for any t ∈ (0, α), f (t) ≥ volt−1 (q/tKq ) for any t ≤ q ≤ α.
Since e−t decreasing, a rearrangement argument shows that the function f that minimises
Rα
0 f (t) exp(−t)dt
Rα
and which satisfies the above properties is f (t) = (tf (α))d−1 for a suitable value
of f (α) so that 0 f (t)dt = 1 (we want the function to increase as fast as possible). Note that f (t)
gives the volume of the tK̃α for a suitable set K̃α , as shown in the figure below:
68
u
0
The whole triangle is K̃ = {tK̃α : t ∈ [0, α]} with x∗ = 0 at the bottom corner. The thin lines
represent tK̃α for different values of t, which are (d − 1)-dimensional subsets of K̃ that lie in affine
spaces with normal vector u.
Rα R α d−1 −1
From the constraint 0 f (t)dt = 1 we get f (α)d−1 = ( 0 t dt) . We calculate
Z Z α Rα
exp(−t)td−1 dt
exp(−hx, ui)dx ≥ exp(−t)(tf (α)) d−1
dt = 0 Rα
d−1 dt
K 0 0 t
n o
≥ min 1, (d/α)d /ed = g(α) ,
R α d−1
where the final inequality follows because 0 t dt = αd /d and
Z α Z α∧d
1 min(α, d)d
td−1
exp(−t)dt ≥ d td−1 dt = .
0 e 0 ed d
28.1 The mapping a 7→ DF (a, b) is the sum of Legendre function F and a linear function, which
is clearly Legendre. Hence, Φ is Legendre. Suppose now that c ∈ ∂int(D) and let d ∈ A ∩ int(D) be
arbitrary. Then, the map α 7→ Φ(αc + (1 − α)d) must be decreasing. And yet,
d
Φ(αc + (1 − α)d) = h∇Φ(αc + (1 − α)d), c − di
dα
= hy + ∇F (αc + (1 − α)d), c − di ,
28.5 The first step is the same as the proof of Theorem 28.4:
n
X n
X n
X
Rn (a) = hat − a, yt i = hat − at+1 , yt i + hat+1 − a, yt i .
t=1 t=1 t=1
69
Pt
Next let Φt (a) = F (a)/η + s=1 ha, ys i so that
n
X n
X F (a)
hat+1 − a, yt i = hat+1 , yt i − Φn (a) +
t=1 t=1
η
n
X F (a)
= (Φt (at+1 ) − Φt−1 (at+1 )) − Φn (a) +
t=1
η
n−1
X F (a)
= −Φ0 (a1 ) + (Φt (at+1 ) − Φt (at+2 )) + Φn (an+1 ) − Φn (a) +
t=0
η
F (a) − F (a1 ) n−1
X
≤ + (Φt (at+1 ) − Φt (at+2 )) . (28.1)
η t=0
1
Φt (at+1 ) − Φt (at+2 ) = −∇at+2 −at+1 Φt (at+1 ) − DF (at+2 , at+1 )
η
1
≤ − DF (at+2 , at+1 ) ,
η
where the inequality follows by the first-order optimality condition applied to at+1 =
argmina∈A∩dom(F ) Φt (a) and at+2 . Substituting this into Eq. (28.1) completes the proof.
28.10
(a) For the first relation, direct calculation shows that P̃t+1,i = Pti exp(−η Ŷti ) and
k
! k k
X Pti X X
DF (Pt , P̃t+1 ) = Pti log − Pti + P̃t+1,i
i=1 P̃t+1,i i=1 i=1
Xk
= Pti exp −η Ŷti − 1 + η Ŷti .
i=1
(c) Simple calculus shows that for p ∈ Pk−1 , F (p) ≥ − log(k) − 1 and F (p) ≤ −1 is obvious.
Therefore diamF (Pk−1 ) = maxp,q∈Pk−1 F (p) − F (q) ≤ log(k).
(d) By the previous exercise, Exp3 chooses At sampled from Pt . Then applying the second bound
p
of Theorem 28.4 and parts (b) and (c) and choosing η = log(k)/(2nk) yields the result.
70
28.11 Abbreviate D(x, y) = DF (x, y). By the definition of ãt+1 and the first-order optimality
conditions we have ηt yt = ∇F (at ) − ∇F (ãt+1 ). Therefore
1
hat − a, yt i = hat − a, ∇F (at ) − ∇F (ãt+1 )i
ηt
1
= (−ha − at , ∇F (at )i − hat − ãt+1 , ∇F (ãt+1 )i + ha − ãt+1 , ∇F (ãt+1 )i)
ηt
1
= (D(a, at ) − D(a, ãt+1 ) + D(at , ãt+1 )) .
ηt
Summing completes the proof. For the second part use the generalised Pythagorean theorem
(Exercise 26.13) and positivity of the Bregman divergence to argue that D(a, ãt+1 ) ≥ D(a, at+1 ).
28.12
(a) We use the same argument as in the solution to Exercise 28.5. First,
n
X n
X n
X
Rn (a) = hat − a, yt i = hat − at+1 , yt i + hat+1 − a, yt i .
t=1 t=1 t=1
The next step also mirrors that in Exercise 28.5, but now we have to keep track of the changing
potentials:
n
X n
X
hat+1 − a, yt i = hat+1 , yt i − Φn+1 (a) + Fn+1 (a)
t=1 t=1
n
X n
X
= (Φt+1 (at+1 ) − Φt (at+1 )) + (Ft (at+1 ) − Ft+1 (at+1 )) − Φn+1 (a) + Fn+1 (a)
t=1 t=1
n−1
X
= −Φ1 (a1 ) + (Φt+1 (at+1 ) − Φt+1 (at+2 )) + Φn+1 (an+1 ) − Φn+1 (a)
t=0
n
X
+ Fn+1 (a) + (Ft (at+1 ) − Ft+1 (at+1 ))
t=1
n−1
X n
X
≤ Fn+1 (a) − F1 (a1 ) + (Φt+1 (at+1 ) − Φt+1 (at+2 )) + (Ft (at+1 ) − Ft+1 (at+1 )) .
t=0 t=1
Φt+1 (at+1 ) − Φt+1 (at+2 ) = −∇at+2 −at+1 Φt+1 (at+1 ) − DFt+1 (at+2 , at+1 )
≤ −DFt+1 (at+2 , at+1 ) ,
which combined with the previous big display completes the proof.
(b) Note that adding a constant to the potential does not change the policy or the Bregman
divergence. Applying the previous part with Ft (a) = (F (a) − minb∈A F (b))/ηt immediately
gives the result.
71
28.13
Then apply the result from Exercise 28.12 combined with the fact that diamF (A) = log(k).
On the other hand, if Pt+1,At ≤ PtAt , then by Theorem 26.13 with H = ∇2 f (q) = diag(1/q) for
some q ∈ [Pt , Pt+1 ] we have
DF (Pt+1 , Pt ) ηt
hPt − Pt+1 , Ŷt i − ≤ kŶt k2H −1 ,
ηt 2
2
ηt qAt ŶtA 2
ηt PtAt ŶtA 2
ηt ytA
ηt
kŶt k2H −1 = t
≤ t
≤ t
.
2 2 2 2PtAt
Therefore
DF (Pt+1 , Pt ) 2
ηt ytA ηt
hPt − Pt+1 , Ŷt i − ≤ t
≤ .
ηt 2PtAt 2PtAt
(d) Continuing from the previous part and using the fact that E[1/PtAt ] = k shows that
" #
log(k) 1 X n
ηt log(k) k X n
Rn ≤ E + = + ηt .
ηn 2 t=1 PtAt ηn 2 t=1
q p √
Pn
(e) Choose ηt = log(k)
kt and use the fact that t=1 1/t ≤ 2 n.
28.14
(a) The result is obvious for any algorithm when n < k. Assume for the remainder that n ≥ k. The
learning rate is chosen to be
v
u
u k log(n/k)
ηt = t Pt−1 2 ,
1+ s=1 ytAt
72
R p p
which is obviously decreasing. By noting that f 0 (x)/ f (x) dx = 2 f (x) and making a simple
approximation,
v
n u n
X u X
ηt ytAt ≤ 2tk
2 2 log(n/k) .
ytA t
(28.2)
t=1 t=1
Pn
Define Rn (p) = t=1 hPt − p, Ŷt i. Then
For the remainder of the proof let p ∈ [1/n, 1] ∩ Pk−1 be arbitrary. Notice that F (p) −
minq∈Pk−1 F (q) ≤ k log(n/k). By the result in Exercise 28.12,
k log(n/k) X n
DF (Pt+1 , Pt )
Rn (p) ≤ + hPt − Pt+1 , Ŷt i − , (28.4)
ηn t=1
ηt
If Pt+1,At ≥ PtAt , then hPt −Pt+1 , Ŷt i ≤ 0. Now suppose that Pt+1,At ≤ PtAt . By Theorem 26.12,
there exists a ξ ∈ [Pt , Pt+1 ] such that
ηt
hPt − Pt+1 , Ŷt i − DFt−1 (Pt+1 , Pt ) ≤ kŶt k2∇2 F (ξ)−1
2
2
ηt ytA
ηt 2 2 ηt 2 2
= ξA Ŷ ≤ P Ŷ = t
.
2 t tAt 2 tAt tAt 2
By the definition of ηn and Eq. (28.2),
k log(n/k) 1 X n
Rn (p) ≤ + 2
ηt ytA
ηn 2 t=1 t
v !
u n−1
u X
≤ 2tk 1+ 2
ytA t
log(n/k) .
t=1
73
(b) Combining the previous result with the fact that yt ∈ [0, 1]k shows that
v " n #!
u
u X
Rn ≤ k + 2tk 1+E ytAt log(n/k)
t=1
v !
u n
u X
= k + 2tk 1 + Rn + min yta log(n/k) .
a∈[k]
t=1
Solving the quadratic in Rn shows that for a suitably large universal constant C,
v !
u n
u X
Rn ≤ k(1 + log(n/k)) + C tk 1 + min yta log(n/k) .
a∈[k]
t=1
where the first inequality follows from Part (b) and the second from the Cauchy-Schwarz
inequality. The result follows by substituting the above display into Eq. (28.5) and choosing
p
η = 2/n.
28.16
(a) Following the suggestion in the hint let F be the negentropy potential and
t−1
!
X
xt = argminx∈X F (x) + η f (x, ys )
s=1
t−1
!
X
yt = argminy∈Y F (y) − η f (xs , y) .
s=1
74
q
Then let εd (n) = 2 log(d)
n . By Proposition 28.7,
1X n
= max f (xt , y)
y∈Y n
t=1
1X n
≤ f (xt , yt ) + εk (n)
n t=1
1 Xn
≤ min f (x, yt ) + εj (n) + εk (n)
n x∈X t=1
= min f (x, ȳn ) + εj (n) + εk (n)
x∈X
≤ max min f (x, y) + εj (n) + εk (n) .
y∈Y x∈X
(b) Following a similar plan. Let F (x) = 12 kxk22 and gs (x) = f (x, ys ) and hs (y) = f (xs , y). Then
define
t−1
!
X
xt = argminx∈X F (x) + η hx, ∇gs (xs )i
s=1
t−1
!
X
yt = argminy∈Y F (y) − η hy, ∇hs (ys )i .
s=1
p
Let G = supx∈X,y∈Y k∇f (x, y)k2 and B = supz∈X∪Y kzk2 . Then let ε(n) = GB 1/n. A
straightforward generalisation of the above argument and the analysis in Proposition 28.6 shows
75
that
1X n
1X n
f (xt , yt ) = gt (xt )
n t=1 n t=1
!
1 Xn Xn
= min gt (x) + (gt (xt ) − gt (x))
x∈X n
t=1 t=1
!
1 Xn
1X n
≤ min gt (x) + hxt − x, ∇gt (xt )i
x∈X n n t=1
t=1
1X n
≤ min gt (x) + ε(n)
x∈X n
t=1
1X n
= min f (x, yt ) + ε(n)
x∈X n t=1
≤ min f (x, ȳn ) + ε(n) .
x∈X
1X n
max f (x̄n , y) ≤ f (xt , yt ) + ε(n) .
y∈Y n t=1
Hence
And the result is again completed by taking the limit as n tends to infinity.
In both cases the pair of average iterates (x̄n , ȳn ) has a cluster point that is a saddle point of
f (·, ·). In general the iterates (xn , yn ) may not have a cluster point that is a saddle point.
76
f (x, y) = y/(x + y).
29.2 First we check that θ̂t = dEt At Yt /(1 − kĀt k2 ) is appropriately bounded. Indeed,
where the last step holds by choosing 1 − r = 2ηd. All of the steps in the proof of Theorem 28.11
are the same until the expectation of the dual norm of θ̂t . Then
h i h i
E kθ̂t k2∇F (Zt )−1 ≤ E (1 − kZt k2 )kθ̂t k2
" #
(1 − kZt k2 )Et Yt2
=d E 2
(1 − kĀt k2 )2
≤ 2d2 .
This last inequality is where things have changed, with the d becoming a d2 . From this we conclude
that
1 1 1 1
Rn ≤ log + (1 − r)n + ηnd2 ≤ log + 2ηnd + ηnd2
η 2ηd η 2ηd
29.4
t=1
" n #
X
≤E hAt − a∗ , θi + 2nε
t=1
" n #
X
=E hĀt − a , θi + 2nε .
∗
(29.1)
t=1
77
The estimator is θ̂t = dEt At Yt /(1 − kĀt k2 ), which is no longer unbiased. Then
" #
dEt At Yt
E[θ̂t | Ft−1 ] = E Ft−1
1 − kĀt k2
" #
Et At (hAt , θi + ηt + ε(At ))
= dE Ft−1
1 − kĀt k2
d
X
=θ+ ε(ei )ei ,
i=1
t=1
Then we need to check that ηkθ̂t k2 = ηkdEt At Yt /(1 − kĀt k2 )k2 ≤ ηd/(1 − r) ≤ 1/2. Now
proceed as in Exercise 29.2.
√
(b) When ε(a) = 0 for all a, the lower bound is Ω(d n). Now add a spike ε(a) = −ε in the vicinity
of the optimal arm. Since A is continuous, the learner will almost surely never identify the
√ √
‘needle’ and hence its regret is Ω(d n + εn). The d factor cannot be improved greatly, but
the argument is more complicated [Lattimore and Szepesvári, 2019].
30.4
(b) Using the independence of (Xj )dj=1 shows that almost surely,
Z Mj−1 Z ∞
E[Mj | Mj−1 ] = Mj−1 exp(−x) dx + x exp(−x) dx .
0 Mj−1
= Mj−1 + exp(−Mj−1 ) .
78
Therefore by induction it follows that
a!
E[exp(−aMj )] = Qa .
b=1 (j + b)
1
E[Mj ] = E[Mj−1 ] + E[exp(−Mj−1 )] = E[Mj−1 ] + .
j
30.5 Since A is compact, dom(φ) = Rd . Let D be the set of points x at which φ is differentiable.
Then, as noted in the hint, λ(Rd \ D) = 0 where λ is the Lebesgue measure. Since Q λ, then
Q(Rd \ D) = 0 as well. Define a(x) = argmaxa∈A ha, xi. Let v ∈ Rd be non-zero. Then, by the
second part of the hint, the directional derivative of φ is
By the last part of the hint, for x ∈ D this implies that A(x) is a singleton and thus ∇φ(x) = a(x).
Let q = dQ
dλ be the density of Q with respect to the Lebesgue measure. Then, for any v ∈ R ,
d
Z Z
∇v φ(x + z)q(z) dz = ∇v φ(x + z)q(z) dz
Rd ZR
d
= ∇v φ(x + z)q(z) dz
D+{x}
Z
= ha(x + z), vi q(z) dz
D+{x}
*Z +
= a(x + z)q(z) dz, v ,
D+{x}
where the exchange of limit (hidden in the derivative) and integral is justified by the dominated
convergence theorem. By the last part of the hint, since v ∈ Rd was arbitrary, it follows that
R R
∇ Rd φ(x + z)q(z) dz exists and is equal to D+{x} a(x + z)q(z) = E [a(x + Z)].
30.6
(a) To show that F is well defined we need to show that F ∗ is the Fenchel dual of a unique proper
convex closed function. Let g = (F ∗ )∗ . It is not hard to see that F ∗ is a proper convex function,
dom(F ∗ ) = Rd , and hence the epigraph of F ∗ is closed. Then, by the hint, g ∗ = (F ∗ )∗∗ = F ∗ ,
hence F ∗ is the Fenchel dual of g. By the hint, the Fenchel dual of g is a proper convex closed
function, so we can take F = g ∗ . It remains to show that there is only a single proper convex
closed function whose Fenchel dual is F ∗ . To show this let g, h be proper convex closed functions
such that g ∗ = h∗ = F ∗ . Then g = g ∗∗ = (F ∗ )∗ = h∗∗ = h, hence, F is uniquely defined.
79
(b) By Part (c) of Theorem 26.6, it suffices to show that F ∗ is Legendre. As noted earlier, the
domain of F ∗ is all of Rd . From Exercise 30.5, it follows that F ∗ is everywhere differentiable.
Part (c) of the definition of Legendre functions is automatically satisfied since ∂Rd = ∅, hence
it remains to prove that F ∗ is strictly convex.
Let a(x) = argmaxa∈A ha, xi, with ties broken arbitrarily and let q = dQ
dλ be the density of Q
with respect to the Lebesgue measure. Recalling the definitions and the result of Exercise 30.5,
where δ = x − y. Clearly the term f (u) = ha(u) − a(u + δ), ui is nonnegative for any
u ∈ Rd . Since by assumption q > 0, it suffices to show that f is strictly positive over a
neighborhood of zero that has positive volume. The assumption that span(A) = Rd means that
ha(−δ/2) − a(δ/2), −δ/2i = ε > 0. To see this, notice that
Were it the case that ha(−δ/2), δ/2i = ha(δ/2), δ/2i, then co(A) would be a subset of a (d − 1)-
dimensional hyperplane, contradicting the assumption that span(A) = Rd . Let u = −δ/2 and
k · k = k · k2 and diam(A) = diamk·k (A). Then
80
Similarly, ha(v + δ/2), δ/2i ≥ ha(δ/2), δ/2i − 2kvkdiam(A) and hence
Thus, for sufficiently small kvk, it holds that f (u + v) ≥ ε/2 and the claim follows.
(c) We need to show that int(dom(F )) = int(co(A)). By the first two parts of the exercise, F is
Legendre, and hence by Part (a) of Theorem 26.6 and by Exercise 30.5, we have
Z
int(dom(F )) = ∇F ∗ (Rd ) = a(x + z)q(z)dz : x ∈ Rd .
Rd
Clearly, this is a subset of int(co(A)). To establish the equality, by convexity of co(A) it suffices
to show that for any extreme point a ∈ A and ε > 0 there exists an x such that k∇F ∗ (x)−ak ≤ ε.
To show this, choose a vector x0 ∈ Rd so that a(x0 + v) = a for any v in the unit ball centered
at zero. Such a vector exist because of the conditions on A. Let Kε be a closed ball centered
at zero such that Q(Kε ) ≥ 1 − ε/(maxa∈A kak). This exist because Q(Rd ) = 1. Let r be the
radius of Kε . Pick any c > rε . Then, for any v ∈ Kε , a(cx0 + v) = a(x0 + v/c) = a and hence
R
∇F ∗ (cx0 ) = s + Kε a(cx0 + z)q(z) = s + a, where ksk ≤ ε, finishing the proof.
30.8 Following the advice, assume that the learner plays m bandits in parallel, each having k = d/m
√ p
actions. Let Rni be the regret of the learner in the ith bandit. Then, Rni ≥ c nk = c nd/m for
Pm
some universal constant c > 0. Further, if Rn is the regret of the learner, Rn = i=1 Rni . Hence,
√
Rn ≥ c ndm.
An alternative to this is to emulate a k = d/m-armed bandit with scaled rewards: For this
imagine that the d items (components of the combinatorial action) are partitioned into k parts, each
having m items in it. Unlike in multi-task bandits, the learner needs to choose a part and receives
feedback for all the items in it. Hence, the the rewards received belong to the [0, m] interval and we
√ √
also get Rn ≥ cm nk = c ndm.
31.1 As suggested, Exp4 is used with each element of Γnm identified with one expert. Consider
an arbitrary enumeration of Γnm = {a(1) , . . . , a(G) } where G = |Γnm |. The predictions of expert
g ∈ [G] for round t ∈ [n] encoded as a probability
n distribution
o over [k] (as required by the prediction-
with-expert-advice framework) is Eg,j t = I a(g) = j , j ∈ [k]. The expected regret of Exp4 when
t
81
used with these experts is
" n n
#
X X
Rnexperts =E ytAt − min Eg(t) yt ,
g∈[G]
t=1 t=1
and hence
Rnexperts = Rnm .
Thus, Theorem 18.1 indeed proves (31.1). To prove (31.2) it remains to show that G = |Γnm | ≤
P
Cm log(kn/m). For this note that G = m s=1 Gn,s where Gn,s is the number of sequences from [k]
∗ ∗ n
that switch exactly s − 1 times. When m − 1 ≤ n/2, a crude upper bound on G is mGnm . For ∗
s = 1, G∗n,s = k. For s > 1, a sequence with s − 1 switches is determined by the location of the
switches, and the identity of the action taken in each segment where the action does not change.
The possible switch locations are of the form (t, t + 1) with t = 1, . . . , n − 1. Thus the number of
these locations is n − 1, of which, we need to choose s − 1. There are n−1 s−1 ways of doing this.
Since there are s segments and for the first segment we can choose any action and for the others we
can choose any other action than the one chosen for the previous segments, there are kk s−1 valid
P n
ways of assigning actions to segments. Thus, G∗n,s = kk s−1 n−1
s−1 . Define Φm (n) = m i=0 i . Hence,
Pm−1 n−1
G≤k m
s=0 s = k Φm−1 (n − 1) ≤ k Φm (n). Now note that for n ≥ m, 0 ≤ m/n ≤ 1, hence
m m
m m ! n ! n
m X m i n X m i n m
Φm (n) ≤ ≤ = 1+ ≤ em .
n i=0
n i i=0
n i n
Reordering gives Φm (n) ≤ en m
m . Hence, log(G) ≤ m log(ekn/m). Plugging this into (31.1) gives
(31.2).
31.3 Use the construction and analysis in Exercise 11.6 and note that when m = 2 the random
version of the regret is nonnegative on the bandit constructed there.
Chapter 32 Ranking
32.2 The argument is half-convincing. The heart of the argument is that under the criterion
that at least one item should attract the user, it may be suboptimal to present the list composed
of the fittest items. The example with the query ‘jaguar’ is clear: Assume half of the users will
mean ‘jaguar’ as the big cat, while the other half will mean it as the car. Presenting items that are
relevant for both meanings may have a better chance to satisfy a randomly picked user than going
with the top m list, which may happen to support only one of the meanings. This shows that there
82
is indeed an issue with ‘linearizing’ the problem by just considering individual item fitness values.
However, the argument is confusing in other ways. First, it treats conditions (for example,
independence of attractiveness) that are sufficient but not necessary to validate the probabilistic
ranking principle (PRP) as if they were also necessary. In fact, in click model studied here, the
mentioned independence assumption is not needed. To clarify, the strong assumption in the stochastic
click model, is that the optimal list is indeed optimal. Under this assumption, the independence
assumption is not needed.
Next, that the same document can have different relevance to different users fits even the cascade
model, where the vector of attraction values are different each time they are sampled from the
model. So this alone would not undermine the PRP.
Finally, the last sentence confuses relevance and ‘usefulness’. Again, in the cascade model, the
relevance (attractiveness) of a document (item) does not depend on the relevance of any other
document. Yet in the reward in the cascade model is exactly one if and only if at least one document
presented is relevant (attractive).
32.6 Following the proof of Theorem 32.2 the first part until Eq. (32.5) we have
" #
X̀ min{m,j−1}
X n
X
Rn ≤ nmP(Fn ) + E I {Fnc } Utij .
j=1 i=1 t=1
As before the first term is bounded using Lemma 32.4. Then using the first part of the proof of
Lemma 32.7 shows that
v
n u √ !
X u c n
I {Fnc } Utij ≤ 1 + t2Nnij log .
t=1
δ
Substituting into the previous display and applying Cauchy-Schwarz shows that
v
u √ !
u X̀ min{m,j−1}
X
u c n
Rn ≤ nmP(Fn ) + m` + t2m`E Nnij log .
j=1 i=1
δ
83
Expanding the two terms in the inner sum and bounding each separately leads to
Mt X
X X Mt
X X
E Cti Ft−1 = E |Ptd | Cti Ft−1
d=1 j∈Ptd i∈Ptd ∩[m] d=1 i∈Ptd ∩[m]
Mt
X
≤ |Itd ∩ [m]||Ptd ∩ [m]| ≤ m2 ,
d=1
where the inequality follows from the fact that for i ∈ Ptd ,
|Itd ∩ [m]| |Itd ∩ [m]|
t (i) ∈ [m] | Ft−1 =
E[Cti | Ft−1 ] ≤ P A−1 = .
|Itd | |Ptd |
33.3 Abbreviate f (α) = inf d∈D hα, di, which is clearly positively homogeneous: f (cα) = cf (α) for
any c ≥ 0. Because D is nonempty, f (0) = 0. Hence we can ignore α = 0 in both optimisation
problems and so
!−1
L
sup f (α) L= inf
α∈Pk−1 α∈Pk−1 f (α)
Lkαk1
= inf
α≥0:kαk1 >0 f (α)
= inf kLα/f (α)k1
α≥0:kαk1 >0
33.4
84
(a) For each i > 1 define
∆22
= .
2(σ12 + σ22 )
(c) By the result in Exercise 33.3 and Part (a) of this exercise,
( k
)
X
c (ν) = inf kαk1 : α ∈ [0, ∞) ,
∗ k
inf αi D(νi , ν̃i ) = 1
ν̃∈Ealt (ν)
i=1
( )
α1 αi ∆2i
= inf kαk1 : α ∈ [0, ∞) , min k
=1 .
i>1 2α1 σi2 + 2αi σ12
Let α1 = 2aσ12 /∆2min , which by the constraint that α ≥ 0 must satisfy a > 1. Then
k
X 2α1 σi2
c∗ (ν) = inf α1 +
α1 >2σ12 /∆2min i=2
α1 ∆2i − 2σ12
2aσ12 a Xk
2σi2 /∆2i
≤ inf +
a>1 ∆2min a − 1 i=2
a−1
s v 2
u k
2σ1
2 uX 2σ 2
= +t i
∆2min i=2
∆2i
v
u
2σ 2 k
X 2σi2 4σ1 uXk
2σi2
= 21 + + t .
∆min i=2 ∆2i ∆min i=2 ∆2i
85
(d) From the previous part
( )
α1 αi ∆2i
c (ν) = inf kαk1 : α ∈ [0, ∞) , min
∗ k
=1 ,
i>1 2α1 σi2 + 2αi σ12
(e) Notice that the inequality in the previous part is now an equality.
33.5
(a) Let ν ∈ E be an arbitrary Gaussian bandit with µ1 (ν) > maxi>1 µi (ν) and assume that
− log Pνπ (∆An+1 > 0)
lim inf > 1 + ε. (33.1)
n→∞ log(n)
Notice that if Eq. (33.1) were not true then we would be done. Then let ν 0 be a Gaussian
bandit in Ealt (ν) with µ(ν 0 ) = µ(ν) except that µi (ν 0 ) = µi (ν) + ∆i (ν)(1 + δ) where i > 1 and
√
δ = 1 + ε − 1. By Theorem 14.2 and Lemma 15.1,
1
Pνπ (An+1 6= 1) + Pν 0 π (An+1 6= i) ≥ exp (− D(Pνπ , Pν 0 π ))
2 !
1 (1 + δ)2 ∆i (ν)2 Eνπ [Ti (n)]
≥ exp −
2 2
!
1 (1 + ε)∆i (ν)2 Eνπ [Ti (n)]
= exp − .
2 2
Because π is asymptotically optimal, limn→∞ Eνπ [Ti (n)]/ log(n) = 2/∆i (ν)2 and hence
1+ε+εn
1 1
Pνπ (An+1 6= 1) + Pν 0 π (An+1 6= i) ≥ ,
2 n
86
where limn→∞ εn = 0. Using Eq. (33.1) shows that
round-robin.
(c) The same argument as Part (a) shows there exists a ν ∈ E with a unique optimal arm such that
which means the probability of selecting a suboptimal arm decays only polynomially with n.
33.6
(a) Assume without loss of generality that arm 1 is unique in ν. By the work in Part (a) of
Exercise 33.4, α∗ (ν) = argmaxα∈Pk−1 Φ(ν, α) with
1 Xk
Φ(ν, α) = inf αi (µi (ν) − µi (ν̃))2
2 ν̃∈Ealt (ν) i=1
1 α1 αi ∆2i 1
= min = min fi (α1 , αi )
2 i>1 α1 + αi 2 i>1
where Φ(ν, α) = 0 if αi = 0 for any i and the last equality serves as the definition of fi .
The function Φ(ν, ·) is the minimum of a collection of concave functions and hence concave.
Abbreviate α∗ = α∗ (ν) and notice that α∗ must equalize the functions (fi ) so that fi (α1∗ , αi∗ ) is
constant for i > 1. Hence, for all i > 1,
2α1∗ Φ(ν)
αi∗ = .
∆2i α1∗
− 2Φ(ν)
Therefore
k
X 2α1∗ Φ(ν)
αi∗ + = 1.
i=2
∆2i α1∗− 2Φ(ν)
87
The solutions to this equation are the roots of a polynomial and by the fundamental theorem
of algebra, either this polynomial is zero or there are finitely many roots. Since the former is
clearly not true, we conclude there are at most finitely many maximisers. Yet concavity of the
objective means that the number of maximisers is either one or infinite. Therefore there is a
unique maximiser.
(b) Notice that i∗ (ξ) = i∗ (ν) whenever d(ξ, ν) is sufficiently small. Hence, by the previous
part, the function Φ(·, ·) is continuous at (ν, α) for any α. Suppose that α∗ (·) is not
continuous at ν. Then there exists a sequence (νn )∞ n=1 with limn→∞ d(νn , ν) = 0 and for
which lim inf n→∞ kα (ν) − α (νn )k∞ > 0. By compactness of Pk−1 , the sequence α∗ (νn ) has a
∗ ∗
cluster point α∞∗ , which by assumption must satisfy α∗ (ν) 6= α∗ . And yet, taking limits along
∞
an appropriate subsequence, Φ(α∗ (ν), ν) = limn→∞ Φ(α∗ (ν), νn ) ≤ limn→∞ Φ(α∗ (νn ), νn ) =
Φ(α∞∗ , ν). Therefore by Part (a), α∗ (ν) = α∗ , which is a contradiction.
∞
(c) We’ll be a little lackadasical about constants here. Define random variable
( s )
2 log(2λkt(t + 1))
Λ = min λ ≥ 1 : d(ν̂t , ν) ≤ for all t ,
mini Ti (t)
which by the usual concentration analysis and union bounding satisfies P (Λ ≥ x) ≤ 1/x.
Therefore
Z ∞ Z ∞
E[log(Λ)2 ] = P Λ ≥ exp(x1/2 ) dx ≤ exp(−x1/2 )dx = 2 .
0 0
By the definition of λ,
( s )
2 log(Λkt(t + 1))
τν (ε) ≤ 1 + max t : >ε .
mini Ti (t)
√
The forced exploration in the algorithm means that Ti (t) = Ω( t) almost surely and hence
E[τν (ε)] = O E[log(Λ)2 ] = O(1) .
(d) Let w(ε) = inf{x : d(ω, ν) ≤ x =⇒ kα∗ (ν) − α∗ (ω)k∞ ≤ ε}, which by (b) satisfies w(ε) > 0
for all ε > 0. Hence E[τα (ε)] ≤ E[τν (w(ε))] < ∞.
√
(e) By definition of the algorithm At = i implies that either Ti (t − 1) ≤ t or At =
argmaxi αi∗ (ν̂t−1 ) − Ti (t − 1)/(t − 1). Now suppose that
( )
2kτα (ε/(2k)) 16k 2
t ≥ max , 2 .
ε ε
88
Then the definition of the algorithm implies that
n √o
Ti (t) ≤ max Ti (τα (ε/(2k))), 1 + t(αi∗ (ν) + ε/(2k)), 1 + t
ε
≤ t αi∗ (ν) + .
k
Pk
Furthermore, since i=1 Ti (t) = t,
X X ε
Ti (t) ≥ t − Tj (t) ≥ t − t αj∗ (ν) + ≥ t(αi∗ (ν) − ε) .
j6=i j6=i
k
And the result follows from the previous part, which ensures that
" ( )#
2kτα (ε/(2k)) 16k 2
E max , 2 < ∞.
ε ε
(f) Given ε > 0 let τβ (ε) = 1 + max {t : tΦ(ν, α∗ (ν)) < βt (δ) + εt} and
Taking the limit as δ → 0 and using the previous parts shows that for any sufficiently small
ε > 0,
Continuity of Φ(·, ·) at (ν, α∗ (ν)) ensures that limε→0 u(ε) = 0 and the result follows since
c∗ (ν) = 1/Φ(ν, α∗ (ν)). Note that taking the limit as δ → 0 only works because the policy does
not depend on δ. Hence the expectations of τν (ε), τα (ε) and τT (ε) do not depend on δ.
33.7
89
(a) Recalling the definitions,
( )
k
X 1 1 i
H1 (µ) = min , 2 and H2 (µ) = max ,
i=1
∆min ∆i
2 i:∆i >0 ∆2 i
Therefore H1 (µ) ≥ H2 (µ). For the second inequality, let imin = min{i : ∆i > 0}. Then
imin Xk
1 i
H1 (µ) = +
∆min i=i +1 i ∆2i
2
min
k
X 1
≤ 1 + H2 (µ) ≤ (1 + log(k))H2 (µ) .
i=imin +1
i
Pk
The result follows because imin > 1 and because i=3 ≤ log(k).
√
(b) When ∆2 = · · · = ∆k > 0 it holds that H1 (µ) = H2 (µ). For the other direction let ∆i = i for
i ≥ 2 so that i/∆2i = 1 = H2 (µ) and
k
X 1
H1 (µ) = 1 + = L = LH2 (µ) .
i=3
i
33.9 We have P maxi∈[n] µ(Xi ) < µ∗α = P (µ(X1 ) < µ∗α )n ≤ (1 − α)n ≤ δ. Solving for n gives
the required inequality.
34.4
R
(a) By the ‘sections’ Lemma 1.26 in [Kallenberg, 2002], d(x) = Θ pψ (x)q(ψ)dν(ψ) is H-measurable.
90
Therefore N = d−1 (0) ∈ H is measurable. Then
Z Z
0= pψ (x)q(ψ) dν(ψ)dµ(x)
ZN Z Θ
= pψ (x)dµ(x)q(ψ) dν(ψ)
ZΘ N
= Pψ (N ) dν(ψ)
Θ
= P(X ∈ N )
= PX (N ) ,
where the first equality follows from the definition of N . The second is Fubini’s theorem, the
third by the definition of the Radon-Nikodym derivative, the fourth by the definition of P and
the last by the definition of PX .
(b) Note that q(θ | x) = pθ (x)q(θ)/d(x), which is jointly measurable in θ and x. The fact that
Q(A | x) is a probability measure for all x is straightforward from the definition of expectation
and because for x ∈ N ,
R
Θ q(θ | x)
Q(Θ | x) = R = 1.
p
Θ ψ (x)q(ψ) dν(ψ)
That Q(A | ·) is H-measurable follows from the sections lemma and the fact that N ∈ H. Let
A ∈ G and B ∈ σ(X) ⊆ F, which can be written as B = Θ × C for some C ∈ H. Then,
Z Z Z
Q(A | X(ω)) dP(ω) = q(θ | X(ω)) dν(θ)dP(ω)
B ZB ZA Z
= q(θ | x) dν(θ) pθ (x)q(θ) dµ(x)dν(θ)
ZΘ C A
Z
= d(x) q(θ | x) dν(θ)dµ(x)
ZC Z A
= pθ (x)q(θ) dν(θ)dµ(x)
ZC A
= pθ (C)q(θ) dν(θ)
A
= P(θ ∈ A, X ∈ C)
= P(θ ∈ A, X ∈ C)
Z
= IA (θ) dP ,
B
R
(a) Clearly pθ (x) ≥ 0. By definition, for B ∈ B(R), Pθ (B) = B pθ (x)dh(x). Hence Pθ (B) ≥ 0.
91
Furthermore,
Z Z
Pθ (R) = exp(θT (x) − A(θ))dh(x) = exp(−A(θ)) exp(θT (x))dh(x) = 1 .
R R
R R R
Additivity is immediate since B f dh + C f dh = B∪C f dh for disjoint B, C.
(b) Using the chain rule and passing the derivative under the integral yields the result:
d R
dθR R exp(θT (x))dh(x)
A (θ) =
0
exp(θT (x))dh(x)
R R
T (x) exp(θT (x))dh(x)
= RR
exp(θT (x))dh(x)
Z R
= T (x) exp(θT (x) − A(x))dh(x)
ZR
= T (x)pθ (x)dh(x)
R
= Eθ [T ] .
In order to justify the exchange of integral and derivative use the identity that for all sufficiently
small ε > 0 and all a > 0,
exp(aε) + exp(−aε)
a≤ .
ε
Hence for θ ∈ int(dom(A)) there exists a neighborhood N of θ such that for all ψ ∈ N ,
92
(d) This is another straightforward calculation:
Z
d(θ, θ0 ) = (θT (x) − A(θ) − θ0 T (x) + A(θ0 )) exp(θT (x) − A(θ))dh(x)
R Z
= A(θ0 ) − A(θ) + (θ − θ0 ) T (x) exp(θT (x) − A(θ))dh(x)
R
= A(θ0 ) − A(θ) − (θ0 − θ)A (θ) . 0
(e) The Crammer-Chernoff method is the solution. Let λ = n(θ0 − θ). Then
34.13 Let π ∗ as in the problem definition. Let S̃ be the extension of S by adding rays in the
positive direction: S̃ = {x + u : x ∈ S, u ≥ 0}. Clearly S̃ remains convex and λ(S) ⊆ ∂ S̃ is on the
boundary and is a subset of λ(S̃) (see figure) Let x ∈ λ(S). By the supporting hyperplane theorem
and the convexity of S̃ there exists a nonzero vector a ∈ RN and b ∈ R such that ha, `(π ∗ )i = b and
ha, yi ≥ b for all y ∈ S̃. Furthermore a ≥ 0 since x + ei ∈ S̃ and so hx + ei , ai = b + ai ≥ b. Define
q(νi ) = ai /kak1 . Then, for any policy π,
X 1 X N
ha, `(π)i b
q(ν)`(π, ν) = ai `(π, νi ) = ≥
ν∈E
kak1 i=1 kak1 kak1
with equality for any policy π with `(π) = `(π ∗ ). Since ai is nonnegative, a 6= 0, q ∈ P(E), finishing
the proof.
93
34.14
(a) Suppose that π is not admissible. Then there exists another policy π 0 with `(π 0 , ν) ≤ `(π, ν) for
all ν ∈ E. Clearly π 0 is also Bayesian optimal. But π was unique, which is a contradiction.
(b) Suppose that Π = {π1 , π2 } and E = {ν1 , ν2 } and `(π, ν1 ) = 0 for all π and `(πi , ν2 ) = I {i = 2}.
Then any policy is Bayesian optimal for Q = δν1 , but π2 is dominated by π1 .
(c) Suppose that π is not admissible. Then there exists another policy π 0 with `(π 0 , ν) ≤ `(π, ν) for
all ν ∈ E and `(π 0 , ν) < `(π, ν) for at least one ν ∈ E. Then
Z X X Z
`(π, ν)dQ(ν) = Q({ν})`(π, ν) > Q({ν})`(π, ν) = `(π 0 , ν)dQ(ν) ,
E ν∈E ν∈E E
which is a contradiction.
(d) Repeat the previous solution with the restriction to the support.
34.15 Let Π be the set of all policies and ΠD = {e1 , . . . , eN } the set of all deterministic policies,
which is finite. A policy π ∈ Π can be viewed as a probability measure on (ΠD , B(ΠD )), which is
the essence of Kuhn’s theorem on the equivalence of behavioral and mixed strategies in extensive
form games. Note that since ΠD is finite, probability measures on (ΠD , B(ΠD )) can be viewed
as distributions in PN −1 . In this way Π inherits a metric and topology from PN −1 . Even more
straightforwardly, E is identified with [0, 1]k and inherits a metric from that space. As metric spaces
both Π and E are compact and the regret Rn (π, ν) is continuous in both arguments by Exercise 14.4.
Let (νj )∞
j=1 be a sequence of bandit environments that is dense in E and Ej = {ν1 , . . . , νj }. Using
the notation of the previous exercise, let Rn,j (π) = (Rn (π, ν1 ), . . . , Rn (π, νj )), Sj = Rn,j (Π) ⊂ Rj
and let λ(Sj ) be the Pareto frontier of Sj . Note that Sj is non-empty, closed and convex. Thus,
λ(Sj ) ⊂ Sj . Now let π adm ∈ Π be an admissible policy. Then Rn,j (π adm ) ∈ λ(Sj ) and by the
result of the previous exercise there exists a distribution Qj ∈ P(E) supported on Ej such that
BRn (π adm , Qj ) ≤ minπ BRn (π, Qj ). Let Q be the space of probability measures on (E, B(E)),
which is compact with the weak* topology by Theorem 2.14. Hence (Qj )∞ j=1 contains a convergent
subsequence (Qi )i converging to Q. Notice that ν 7→ Rn (π, ν) is a continuous function from E to
[0, n]. Therefore by the definition of the weak* topology for any policy π,
Z
lim BRn (π, Qi ) = lim Rn (π, ν)dQ(ν) = BRn (π, Q) .
i→∞ i→∞ E
In the remainder we show the other direction. Since [0, 1]k is compact with the usual topology,
Theorem 2.14 shows that Q is compact with the weak* topology. Let Π be the space of all policies
94
with the discrete topology and
( )
X
P= p(π)δπ : p ∈ P(A) and A ⊂ Π is finite ,
π∈A
which is a convex subspace of the topological vector space of all signed measures on (Π, 2Π ) with
the weak* topology. Let L : P × Q → [0, n] be defined by
Z Z Z
L(S, Q) = − Rn (π, ν)dQ(ν)dS(π) = − Rn (πS , ν)dQ(ν) ,
Π E E
R
where πS a policy such that PνπS = Π Pνπ dS(π), which is defined in Exercise 4.4. The regret
is bounded in [0, n] and the discrete topology on Π means that all functions from Π to R are
R
continuous, including π 7→ E Rn (π, ν)Q(dν). By the definition of the weak* topology on P it holds
that L(·, Q) is continuous in its first argument for all Q. The integral over Π with respect to S ∈ P
is a finite sum and ν 7→ Rn (π, ν) is continuous for all π by the result in Exercise 14.4. Therefore L
is continuous and linear in both arguments. By Sion’s theorem (Theorem 28.12),
− max BR∗n (Q) = min sup L(S, Q) = sup min L(S, Q) = − inf Rn∗ (πS , E) ,
Q∈Q Q∈Q S∈P S∈P Q∈Q S∈P
Therefore
max BR∗n (Q) = inf Rn∗ (πS , E) = inf Rn∗ (π, E) = Rn∗ (E) .
Q∈Q S∈P π∈Π
35.1 Let π be the policy of MOSS from Chapter 9, which for any 1-subgaussian bandit ν with
rewards in [0, 1] satisfies
√
k log(n)
Rn (π, ν) ≤ C min kn, ,
∆min (ν)
95
where ∆min (ν) is the smallest positive suboptimality gap. Let En be the set of bandits in E for
which there exists an arm i with ∆i ∈ (0, n−1/4 ). Then, for C 0 = Ck,
The first part follows since ∩n En = ∅ and thus limn→∞ Q(En ) = 0 for any measure Q. For the second
part we describe roughly what needs to be done. The idea is to make use of the minimax lower
bound technique in Exercise 15.2, which shows that for a uniform prior concentrated on a finite set
√
of k bandits the regret is Ω( kn). The only problems are that (a) the rewards were assumed to be
Gaussian and (b) the prior depends on n. The first issue is corrected by replacing the Gaussian
distributions with Bernoulli distributions with means close to 1/2. For the second issue you should
P
compute this prior for n ∈ {1, 2, 4, 8, . . .} and denote them Q1 , Q2 , . . .. Then let Q = ∞
j=1 pj Qj
where pj ∝ (j log (j)) . The result follows easily.
2 −1
Et = max{Ut , E[Et+1 | Ft ]} .
Integrability of (Ut )nt=1 ensures that (Et )nt=1 are integrable. By definition Et ≥ E[Et+1 | Ft ]. Hence
(Et )nt=1 is a supermartingale adapted to F. Hence for any stopping time κ ∈ Rn1 the optional
stopping theorem says that
E[Uκ ] ≤ E[Eκ ] ≤ E1 .
On the other hand, for τ satisfying the requirements of the lemma the process Mt = Et∧τ is a
martingale and hence E[Uτ ] = E[Mτ ] = M1 = E1 .
35.3 Define v n (x) = supτ ∈Rn1 Ex [Uτ ]. By assumption, Ex [|u(St )|] < ∞ for all x ∈ S and t ∈ [n].
Therefore by Theorem 1.7 of Peskir and Shiryaev [2006],
Z
v (x) = max{u(x),
n
v n−1 (y)Px (dy)} . (35.1)
S
Recall that v(x) = supτ Ex [Uτ ]. Clearly v n (x) ≤ v(x) for all x ∈ S. Let τ be an arbitrary stopping
time. Then
Since by assumption supn |Un | is Px -integrable for all x, the dominated convergence theorem shows
96
that
h i
lim Ex [(Un − Uτ )I {τ ≥ n}] = Ex lim (Un − Uτ )I {τ ≥ n} = 0 ,
n→∞ n→∞
where the second equality follows because U∞ = limn→∞ Un exists Px -almost surely by assumption.
Therefore limn→∞ v n (x) = v(x). Since convergence is monotone, it follows that v is measurable.
Taking limits in Eq. (35.1) shows that
Z
v(x) = lim max{u(x), v n−1 (y)Px (dy)}
n→∞
ZS
= max{u(x), lim v n−1 (y)Px (dy)}
n→∞ S
Z
= max{u(x), v(y)Px (dy)} ,
S
where the last equality follows from the monotone convergence theorem. Next, let Vn = v(Sn ). Note
that limn→∞ Ex [Un ] = Ex [limn→∞ Un ] = Ex [U∞ ], where the exchange of the limit and expectation
is justified by the dominated convergence theorem because supn |Un | is Px integrable. By definition,
Vn ≥ Un . Hence,
= 0,
where the exchange of limit and expectation is again justified by the dominated convergence theorem
and the assumption that supn |Un | is Px -integrable. Therefore V∞ = limn→∞ Vn = U∞ Px -a.s. Then
Z
Ex [Vn+1 | Sn ] = v(y)PSn (dy) ≤ Vn a.s. .
S
Therefore (Vn )∞
n=1 is a supermartingale, which means that for any stopping time κ,
h i
Ex [Uκ ] = Ex lim Uκ∧n = lim Ex [Uκ∧n ] ≤ lim Ex [Vκ∧n ] ≤ v(x) ,
n→∞ n→∞ n→∞
where the exchange of limits and expectation is justified by the dominated convergence theorem
and the fact that Uκ∧n ≤ supn Un , which is Px -integrable by assumption. Consider a stopping time
τ satisfying the conditions of Theorem 35.3. Then (Vn∧τ )∞ n=1 is a martingale and using the same
97
argument as before we have
h i
Ex [Uτ ] = Ex [Vτ ] = Ex lim Vτ ∧n = lim Ex [Vτ ∧n ] = v(x) ,
n→∞ n→∞
where the first equality follows from the assumption on τ that on the event τ < ∞, Uτ = Vτ and
the fact that V∞ = U∞ Px -a.s..
35.6 Fix x ∈ S and let
hP i
Ex τ −1 t−1
t=1 α r(St )
g = sup hP i .
τ −1 t−1
τ ≥2 Ex α
t=1
We will show that (a) vγ (x) > 0 for all γ < g and (b) vγ (x) = 0 for all γ ≥ g.
For (a), assume γ < g. By the definition of g, there exists a stopping time τ ≥ 2 such that
"τ −1 # "τ −1 #
X X
Ex α t−1
r(St ) > γ Ex α t−1
,
t=1 t=1
Moving
hP now to (b), first i note that vγ (x) ≥ 0 for any γ ∈ R because when τ = 1,
Ex τ −1 t−1
t=1 α (r(St ) − g) = 0. Hence, it suffices to show that vγ (x) ≤ 0 for all γ ≥ g. Pick
γ ≥ g. By the definition of g, for any stopping time τ ≥ 2,
"τ −1 #
X
Ex α t−1
(r(St ) − γ) ≤ 0 ,
t=1
If τ is a F-stopping time then Px (τ = 1) is either zero or one (the stopping rule underlying τ either
stops given S1 = x, or does not stop – the stopping rule cannot inject any further randomness).
From this it follows that
"τ −1 #
X
vg (x) = sup Ex αt−1 (r(St ) − γ)
τ ≥1 t=1
( "τ −1 #)
X
= max 0, sup Ex αt−1 (r(St ) − γ)
τ ≥2 t=1
≤ 0,
98
finishing the proof.
35.7 We want to apply Theorem 35.3. The difficulty is that Theorem 35.3 considers the case where
the reward depends only on the current state, while here the reward accumulates. The solution
is to augment the state space to include the history. We use the convention that if x ∈ S n and
y ∈ S m , then xy ∈ S n+m is the concatenation of x and y. In particular, the ith component of xy is
xi if i ≤ n and yi−n if i > n. We will also denote by x1:n the sequence (x1 , . . . , xn ) formed from
x1 , . . . , xn . Recall that S ∗ is the set of all finite sequences with elements in S and let G ∗ be the
σ-algebra given by
∞
!
[
G =σ
∗
G n
.
n=0
where B1 , . . . , Bn+1 ∈ G are measurable. Note that for measurable f : S ∗ → R and x1:n ∈ S n ,
Z Z
f (y)Qx1:n (dy) = f (x1:n xn+1 )Pxn (dxn+1 ) .
S∗ S
n−1
X
u(x1:n ) = αt−1 (r(xt ) − γ) .
t=1
Notice that the value of u(x1:n ) does not depend on xn . Let Px1:n be the probability measure
carrying (S n )∞
n=1 for which Px1:n (S = x1:n ) = 1. As usual, let Ex1:n be the expectation with respect
n
vγ (x) = v̄γ (x) and v̄γ (x1:n ) = u(x1:n ) + αn v̄γ (xn ) . (35.3)
In order to apply Theorem 35.3 we need to check the existence and integrability conditions of
(Un )∞
n=1 . By Assumption 35.6, U = limn→∞ Un exists Px1:n -a.s. and supn≥1 Un is Px1:n -integrable
for all x1:n ∈ S ∗ . Then by Theorem 35.3 it follows that
Z
v̄γ (x1:n ) = max{u(x1:n ), v̄γ (x1:n xn+1 )Pxn (dxn+1 )} .
S
99
The proof of Part (a) is completed by noting that u(xy) = r(x) − γ, and so by (35.3),
Z
vγ (x) = v̄γ (x) = max{0, v̄γ (xy)Px (dy)}
S Z
= max{0, r(x) − γ + α vγ (y)Px (dy)} .
S
For Part (b), when γ < g(x) we have vγ (x) > 0 by definition and hence using the previous part it
follows that
Z
vγ (x) = r(x) − γ + α vγ (y)Px (dy) .
S
Note that supx∈S |vγ+δ (x) − vγ (x)| ≤ |δ|/(1 − α) and hence by continuity for γ = g(x) we have
Z
r(x) − γ + α vγ (y)Px (dy) = 0 = vγ (x) .
S
For Part (c), applying Theorem 35.3 again shows that when v̄γ (x) = 0, then τ = min{t ≥ 2 :
v̄γ (S t ) = u(S t )} attains the stopping time in Eq. (35.2) with x1:n = x. Notice finally that by (35.3),
for any x1:n ∈ S ∗ ,
which means that τ = min{t ≥ 2 : αt−1 vγ (St ) = 0} = min{t ≥ 2 : g(St ) ≤ γ}, where we used the
fact vγ (x) = 0 ⇔ G(x) ≤ γ.
holds almost surely. For specificity, let r : E → [k] be the (tie-breaking) rule that chooses the arm
with the highest mean given a bandit environment so that At = r(νt ) and A∗ = r(ν). Recall that
νt ∼ Qt−1 (·) = Q( · | A1 , X1 , . . . , At−1 , Xt−1 ). We have
100
36.5 We have
X n X
X n
I {At = i} ≤ I {Ti (t) = s, Ti (t − 1) = s − 1, Gi (Ti (t − 1)) > 1/n}
t∈T t=1 s=1
Xn n
X
= I {Gi (s − 1) > 1/n} I {Ti (t) = s, Ti (t − 1) = s − 1}
s=1 t=1
Xn
= I {Gi (s − 1) > 1/n} ,
s=1
where the first equality uses that when At = i, Ti (t) = s and Ti (t − 1) = s − 1 for some s ∈ [n] and
that t ∈ T implies Gi (Ti (t − 1)) > 1/n. The next equality is by algebra, and the last follows because
for any s ∈ [n], there is at most one time point t ∈ [n] such that Ti (t) = s and Ti (t − 1) = s − 1.
For the next inequality, note that
X X
E I {Eic (t)} = E E[I {Eic (t), Gi (Ti (t − 1)) ≤ 1/n} |Ft−1 ]
t∈T
/ t
X
= E I {Gi (Ti (t − 1)) ≤ 1/n} Gi (Ti (t − 1))
t
X
≤ E I {Gi (Ti (t − 1)) ≤ 1/n} 1/n
t
X
= E 1/n ,
t∈T
/
where the second equality used that I {Gi (Ti (t − 1)) ≤ 1/n} is Ft−1 -measurable and
E[I {Eic (t)} |Ft−1 ] = 1 − P(θi (t) ≤ µ1 − ε|Ft−1 ) = Gi (Ti (t − 1)).
36.6
p
(a) Let f (y) = s/(2π) exp(−sy 2 /2) be the probability density function of a centered Gaussian
Ry
with variance 1/s and F (y) = −∞ f (x)dx be its cumulative distribution function. Then
Z
G1s = f (y + ε)F (y)/(1 − F (y))dy
R
Z ∞ Z 0
≤ f (y + ε)/(1 − F (y)) + 2 f (y + ε)F (y)dy . (36.1)
0 −∞
For the first term in Eq. (36.1), following the hint, we use the following bound on 1 − F (y) for
y ≥ 0:
exp(−sy 2 /2)
1 − F (y) ≥ √ p .
y s + sy 2 + 4
101
Hence
Z ∞ Z ∞ q
f (y + ε) √
dy ≤ f (y + ε) exp(sy 2 /2)(y s + sy 2 + 4)dy
0 1 − F (y) 0
Z ∞ r
√ s
≤ 2 exp(−sε /2) 2
exp(−syε)(y s + 1) dy
0 2π
√
1+ε s
= 2 2 √ exp(−sε2 /2) .
ε s 2π
(b) Let µ̂is be the empirical mean of arm i after s observations. Then Gis ≤ 1/n if
s
2 log(n)
µ̂is + ≤ µ1 − ε .
s
102
Summing,
q 2
s ∆i − ε −
2 log(n)
n
X n
X s
P (Gis > 1/n) ≤ u + exp −
s=1 s=due
2
2 q
≤1+ (log(n) + π log(n) + 1) ,
(∆i − ε)2
where the last inequality follows by bounding the sum by an integral as in the proof of Lemma 8.2.
36.13 Let π be a minimax optimal policy for {0, 1}n×k . Given an arbitrary adversarial bandit
x ∈ [0, 1]n×k . Choose π̃ to be the policy obtained by observing Xt = xtAt and then sampling
X̃t ∼ B(Xt ) and passing X̃t to π. Then
X n Y
Y k
Rn (π̃, x) ≤ xx̃titi (1 − xti )1−x̃ti Rn (π, x̃) ≤ Rn∗ ({0, 1}n×k ) .
x̃∈{0,1}n×k t=1 i=1
Therefore Rn∗ ([0, 1]n×k ) ≤ Rn∗ ({0, 1}n×k ). The other direction is obvious.
37.3 It suffices to show that y has no component in any direction z that is perpendicular to x.
Let z be such a direction. Without loss of generality either z > 1 = 0 or z > 1 = 1. Assume first
that z > 1 = 0. Take some u ∈ ker0 (x). Then, u + z ∈ ker0 (x) also holds. Since ker0 (x) ⊂ ker0 (y),
z > y = (u + z)> y − u> y = 0. Assume now that z > 1 = 1. Then z ∈ ker0 (x) ⊂ ker0 (y) and hence
z > y = 0.
37.10 An argument that almost works is to choose ω ∈ ri(Cb ) arbitrarily and find the first ν
along the chord connecting ω and λ for which ν ∈ Cb ∩ Cc for some cell c. The minor problem
is that Cb ∩ Cc 6= ∅ does not imply that b and c are neighbours, because there could be a third
non-duplicate Pareto optimal action d with ν ∈ Cb ∩ Cc ∩ Cd . This issue is resolved by making an
ugly dimension argument. Let D be the set of ω ∈ Pd−1 for which at least three non-duplicate
Pareto optimal cells intersect, which has dimension at most dim(D) ≤ d − 3. Since b is Pareto
optimal, dim(ri(Cb )) = d − 1. Meanwhile, the dimension of those ω ∈ ri(Cb ) such that [ω, λ] ∩ D 6= ∅
has dimension at most d − 2. Hence, there exists an ω ∈ ri(Cb ) such that [ω, λ] ∩ D 6= ∅ and for this
choice the initial argument works.
37.12
(a) Assume without loss of generality that Σ = [m]. Given an action c ∈ [k], let Sc be as in the
103
proof of Theorem 37.12 and S ∈ Rkm×d be the matrix obtained by stacking (Sc )kc=1 :
S1
.
S = ..
.
Sk
As in the proof of Theorem 37.12, by the definition of global observability, for any pair of
neighbouring actions a, b, it holds that `a − `b ∈ im(S > ) and hence
`a − `b = S > U (`a − `b ) ,
The largest singular value of U is the square root of the reciprocal of the smallest non-zero
eigenvalue of S > S ∈ {0, . . . , k}d×d . Let (λi )pi=1 be the non-zero eigenvalues of S > S, in decreasing
order. Recall that for square matrix A, the product of its non-zero eigenvalues is a coefficient of
the characteristic polynomial. Since S > S has entries in {0, . . . , k}, the characteristic equation
has integer coefficients. Since S > S is positive definite, its non-zero eigenvalues are all positive
Q
and it follows that pi=1 λi ≥ 1. If p = 1, then we are done. Suppose that p > 1. By the
arithmetic–geometric mean inequality,
!p−1 p−1
1 trace(S > S)
p−1
Y dk
≤ λi ≤ ≤ ≤ kd .
λp i=1
p−1 p−1
(b) Repeat the argument above, but restrict S to stacking (Sc )c∈Nab .
(c) For non-degenerate locally observable games, Ne = {a, b} for all e = (a, b) ∈ E. Given
a Pareto optimal action a, let Va = {(a, Φai ) : i ∈ [d]}. Let a, b be neighbouring actions
and f ∈ Eab loc , which exists by the assumption that the game is locally observable. Define
C = {((a, Φai ), (b, Φbi ) : i ∈ [d]} ⊂ Va × Vb , which makes (Va ∪ Vb , C) a bipartite graph. We
define a new function g : [k] × Σ → R. First, let g(c, σ) = 0 whenever σ ∈ / {Φci : i ∈ [d]} or
c∈/ {a, b}. Next, let (Va ∪ Vb , C ) be a connected component. Then, by the conditions on f ,
0 0 0
.
max |f (n0 ) − f (n00 )| ≤ s = 2(m − 1) + 1 .
n0 ,n00 ∈Va0 ∪Vb0
Letting c be the midpoint of the interval that the values of f restricted to Va0 ∪ Vb0 fall into,
104
define
f (n) − c , if n ∈ Va0 ;
f 0 (n) =
f (n) + c , if n ∈ Vb0 .
components are processed. Let the resulting function be g. Then, g ∈ Eab loc and also kgk
∞ ≤ m.
37.13 Let Σ = {♣, ♥} and `1 = (1, 0, 0, 0 . . . , 0, 0) and `2 = (0, 1, 1, 1, . . . , 1, 1). For all other
actions a > 2 let `a = 1. Hence `1 − `2 = (1, −1, −1, −1, . . . , −1, −1). Let d = 2k − 1. Let Φ ∈ Σk×d
be the matrix such that the first row is ♣ and ♥, alternating and for a > 1, let
♣ if i ≤ 2(a − 1)
♥ if i = 1 + 2(a − 1)
Φai =
♣ otherwise and i is odd
♥ otherwise and i is even .
k
X k
X
2 = `11 − `21 + `22 − `21 = f (a, Φa1 ) − f (a, Φa2 ) = f (1, ♣) − f (1, ♥) .
a=1 a=1
Hence,
X
f (a, ♣) − f (a, ♥) = (f (b, ♣) − f (b, ♥)) .
b<a
By induction it follows that f (a, ♣) − f (a, ♥) = 2a−1 for all a > 1. Finally, note that 1 and 2
are non-duplicate and Pareto optimal. Since all other actions are degenerate, these actions are
neighbours. The game is globally observable because all columns have distinct patterns.
105
37.14 For bandit games with Φ = L, let p = q and
f (a, σ)b = σI {a = b} .
P P Pk
We have ka=1 f (a, Φai )b = ka=1 f (a, Lai )b = a=1 Lai I {a = b} = Lbi and thus f ∈ E vec . Then,
using that exp(−x) + x − 1 ≤ x2 /2 for x ≥ 0,
!
(p − q)> Lei 1 X
k
ηf (a, Φai )
opt∗q (η) ≤ max + 2 pa Ψq
i∈[d] η η a=1 pa
!
1X k k
X qb L2ai I {a = b}
≤ max pa
i∈[d] 2 a=1 b=1 p2a
k
= .
2
The full information game is similar. As before, let p = q, but choose f (a, σ)b = pa Lbσ . We have
Pk Pk Pk
a=1 f (a, Φai )b = a=1 f (a, i)b = a=1 pa Lbi = Lbi and thus f ∈ E
vec . Then, again using that
exp(−x) + x − 1 ≤ x2 /2 for x ≥ 0,
1X k k
X 1
opt∗q (η) ≤ max pa qb L2bi = .
i∈[d] 2 a=1 b=1 2
38.2 The solution to Part (b) is immediate from Part (a), so we only show the solution to
Part (a). Abbreviate Pπµ to P and Pπµ to P0 . Let π 0 = (π10 , π20 , . . . ) be the Markov policy to be
0
38.4 We show that D(M ) < ∞ implies that M is strongly connected and that D(M ) = ∞ implies
that M is not strongly connected. Assume first that D(M ) < ∞. Take any s, s0 ∈ S. Assume first
that s 6= s0 . By definition, there is a policy whose expected travel time from state s to s0 is finite.
Take this policy. It follows that this policy reaches state s0 from state s with positive probability,
because otherwise the expected travel time would be infinite. Formally, if T is the random travel
time of a policy whose expected travel time between s and s0 is finite, {T = ∞} is the event that
the policy does not reach state s0 . Now, for any n ∈ N, T > nI {T = ∞}. Taking expectations and
106
reordering gives E [T ] /n > P (T = ∞). Letting n → ∞, we see that P (T = ∞) = 0 (thus we see
that the policy reaches state s0 in fact with probability one). It remains to consider the case when
s = s0 . If the MDP has a single state, it is strongly connected by definition. Otherwise, there exist a
state s00 ∈ S that is distinct from s = s0 . Since D(M ) is finite, there is a policy that reaches s0 from
s with positive probability and another one that reaches s again from s0 with positive probability.
Compose these two policies the obvious way to find the policy that travels from s to s with positive
probability.
Assume now that D(M ) = ∞, while M is strongly connected (proof by contradiction). Since M
is strongly connected, for any s, s0 , there is a policy that has a positive probability of reaching s0
from s. But this means that the uniformly random policy (the policy which chooses uniformly at
random between the actions at any state) has also a positive probability of reaching any state from
any other state. We claim that the expected travel time of this policy is finite between any pairs of
states. Indeed, this follows by noticing that the states under this policy form a time-homogenous
Markov chain whose transition probability matrix is irreducible and the hitting times in some a
Markov chain, which coincide with the expected travel times in the MDP for the said policy, are
finite. link
38.5 We follow the advice of the hint. For the second part, note that the minimum in the definition
of d∗ (µ0 , U ) is attained when nk is maximised for small indices until |U | is exhausted. In particular,
if (nk )0≤k≤m denotes the optimal solution (nk = 0 for k > m) then n0 = A0 , . . . , nm−1 = Ak ,
P
0 ≤ nm = |U | − n−1 k=0 A (= |U | − A−1 ) < A . Hence, |U | < A + A−1 ≤ 2A , implying that
k Am −1 m m Am −1 m
m−1
X
d∗ (µ0 , U ) = k Ak + m nm
k=0
m−1
X
= m|U | + (k − m)Ak
k=0
(a) m Am+1 − A
= m|U | + −
A−1 (A − 1)2
1
≥ |U | m − 1 −
A−1
≥ |U |(logA (|U |) − 3) ,
where step (a) follows since |U | < AA−1 . Choosing U = S, we see that the expected minimum time
m
−1
to reach a random state in S is lower bounded by logA (S) − 3. The expected minimum time to
reach an arbitrary state in S must also be above this quantity, proving the desired result.
38.7
Pn−1
(a) An 1 = 1
n t=0 P t 1 = 1, which means that An is right stochastic.
107
(c) Let (Bn ) and (Cn ) be convergent subsequences of (An ) with limn→∞ Bn = B and limn→∞ Cn =
C. It suffices to show that B = C. From Part (b), Bm + n1m (P nm − I) = Bm P = P Bm .
Taking the limit as m tends to infinity and using the fact that P n is [0, 1]-valued we see
that B = BP = P B. Similarly, C = CP = P C and it follows that B = BP i = P i B and
C = CP i = P C i hold for any i ≥ 0. Hence B = BCm = Cm B and C = CBm = Bm C for any
m ≥ 1. Taking limit as m tends to infinity shows that B = BC = CB and C = CB = BC,
which together imply that B = C.
(d) We have already seen in the proof of Part (c) that P ∗ = P ∗ P = P P ∗ . From this, it follows
that P ∗ = P ∗ P i for any i ≥ 0, which implies that P ∗ = P ∗ An holds for any n ≥ 1. Taking limit
shows that P ∗ = P ∗ P ∗ .
1X n
I− P i + P ∗ = (I − B)Hn , (38.1)
n i=1
P P
where Hn = n1 ni=1 i−1 k=0 (P − P ) . The limit of the left-hand side of (38.1) exists and is equal
∗ k
to the identity matrix I. Hence the limit of the right-hand side also exists and in particular the
limit of Hn must exist. Denoting this by H∞ we find that I = (I − B)H∞ and thus I − B is
invertible and its inverse H is equal to H∞ .
Pn Pi−1
(f) Let Un = 1
n i=1 k=0 (P
k − P ∗ ). Then
I, if k = 0 ;
Bk =
P k − P ∗, otherwise .
P P
Using this we calculate Hn − Un = n1 ni=1 (P − P ∗ )0 − n1 ni=1 (P 0 − P ∗ ) = I − I + P ∗ = P ∗ .
Hence H − limn→∞ Un = P ∗ . From the definition of U we have U = limn→∞ Un .
1X n Xi−1
1Xn Xi−1
lim P k (r − ρ) = lim (P k − P ∗ )r = U r .
n→∞ n n→∞ n
i=1 k=0 i=1 k=0
(h) One way to prove this is to note that by the previous part v = U r = (H − P ∗ )r, hence
r = H −1 (v+ρ). Now, H −1 (v+ρ) = (I −P +P ∗ )(v+ρ) = v−P v+P ∗ v+ρ−P ρ+P ∗ ρ = v−P v+ρ,
where we used that P ∗ v = P ∗ U r and P ∗ U = P ∗ (H − P ∗ ) = P ∗ H − P ∗ = (P ∗ − P ∗ H −1 )H = 0
and that P ρ = P P ∗ r = P ∗ r = P ∗ ρ = P ∗ P ∗ r.
Alternatively, the following direct argument also works. In this argument we only use that v
P 1 Pn
is well defined. Let vn = n−1t=0 P (r − ρ), v̄n = n
t
i=1 vi . Note that limn→∞ v̄n = v. Then,
108
vk+1 = P vk + (r − ρ). Taking the average of these over k = 1, . . . , n we get
1
((n + 1)v̄n+1 − v1 ) = P v̄n + (r − ρ) .
n
Taking the limit of both sides proves that v = P v + r − ρ, which, after reordering gives
v + ρ = r + P v.
38.8
(a) First note that | maxx f (x) − maxy g(y)| ≤ maxx |f (x) − g(x)|. Then for v, w ∈ RS ,
(b) This follows immediately from the Banach fixed point theorem, which also guarantees the
uniqueness of a value function v satisfying v = Tγ v.
(c) Recall that the greedy policy is π(s) = argmaxa ra (s) + γhPa (s), vi. Then
(e) If π is a memoryless policy, it is trivial to see that vγπ = rπ + γPπ vγπ . Let π ∗ be the greedy
policy with respect to v, the unique solution of v = Tγ v. By the previous part of this exercise, it
follows that vγπ = v. By Exercise 38.2, it suffices to show that for any Markov policy π, vγπ ≤ v.
∗
P
If πt is the memoryless policy used in time step t when following π, vγπ = ∞ t=1 γ
t−1 P (t−1) r ,
πt
P
where P (0) = I and for t ≥ 1, P (t) = Pπ1 . . . Pπt . For n ≥ 1, let vγ,n
π = n
t=1 γ t−1 P (t−1) r . It is
πt
easy to see that vγ,1
π =r
π1 ≤ T 0. Assume that for some n ≥ 1,
sup π
vγ,n ≤ T n 0, (38.2)
π Markov
(that is, if π = (π0 , π1 , π2 , . . . ), π 0 = (π1 , π2 , . . . )). This shows that (38.2) holds for all n ≥ 1.
Letting n → ∞, the right-hand side converges to v, while the left-hand side converges to vγπ .
Hence, vγπ ≤ v.
38.9
109
(a) Let 0 ≤ γ < 1. Algebra gives
∞
X Pγ∗ − (1 − γ)I
Pγ∗ P = (1 − γ) γtP tP = .
t=0
γ
Hence γPγ∗ P = Pγ∗ − (1 − γ)I. It is easy to check that Pγ∗ is right stochastic. By the compactness
of the space of right stochastic matrices, (Pγ∗ )γ has at least one cluster point A as γ → 1−.
∗
It follows that AP = A, which implies that AP ∗ = A. Now, (Pγ∗ )−1 P ∗ = (I−γP
1−γ
)P
= P ∗,
which implies that P = AP = A. Since this holds for any cluster point we conclude that
∗ ∗
limγ→1− Pγ∗ = P ∗ .
The result is completed by taking the limit as γ tends to one from below and using Part (a).
38.10
(a) Let (γn ) be an arbitrary increasing sequence with γn < 1 for all n and limn→∞ γn = 1. Let vn
be the fixed point of Tγn and πn be the greedy policy with respect to vn . Since greedy policies
are always deterministic and there are only finitely many deterministic policies it follows there
exists a subsequence n1 < n2 < · · · and policy π such that πnk = π for all k.
(b) For arbitrary π and 0 ≤ γ < 1, let vγπ be the value function of policy π in the γ-discounted
MDP, v π the value function of π and ρπ its gain. Let Uπ be the deviation matrix underlying
Pπ . Define fγπ by
ρπ
vγπ = + vπ + fγπ . (38.3)
1−γ
By Part (b) of Exercise 38.9, and because ρπ = Pπ∗ rπ and vπ = Uπ rπ , it holds that kfγπ k∞ → 0
as γ → 1.
Fix now π to be the policy whose existence is guaranteed by the previous part. By Part (e)
of Exercise 38.8, π is γn -discount optimal for all n ≥ 1. Suppose that ρπ is not a constant. In
any case, ρπ is piecewise constant on the recurrent classes of the Markov chain with transition
probabilities P π . Let ρπ∗ = maxs∈S ρπ (s). Let R ⊂ S be the recurrent class in this Markov chain
110
where ρπ is the largest and take a policy π 0 that is identical to π over R, while π 0 is set up such
that it gets to R with probability one. Such a π 0 exist because the MDP is strongly connected.
Fix any s ∈ S \ R. We claim that there exists some γ ∗ ∈ (0, 1) such that for all γ ≥ γ ∗ ,
If this was true and n is large enough so that γn ≥ γ ∗ then, since π is γn -discount optimal,
vγπn (s) ≥ vγπn (s) > vγπn (s), which is a contradiction.
0 0
Hence, it remains to show (38.4). By the construction of π 0 , ρπ (s) = ρπ∗ > ρπ (s). From (38.3),
0
ρπ (s) ρπ (s)
0
vγπ (s) = + vπ0 (s) + fγπ (s) > + vπ (s) + fγπ (s) = vγπ (s) ,
0 0
1−γ 1−γ
where the inequality follows by taking γ ≥ γ ∗ for some γ ∗ ∈ (0, 1). The existence of any
appropriate γ ∗ follows because fγπ (s), fγπ (s) → 0, while 1/(1 − γ) → ∞ as γ → 1.
0
(c) Since v is the value function of π and ρ is its gain, by (38.4) we have ρ1 + v = rπ + Pπ v. Let
π 0 be an arbitrary stationary policy and vn be as before. Let fn = fγn where fγ is defined by
(38.3). Note that kfn k∞ → 0 as n → ∞. Then,
Note that when π 0 = π, the first inequality becomes an equality. Taking the limit as n tends to
infinity and rearranging shows that
ρ1 + v ≥ rπ0 + Pπ0 v
and
ρ1 + v = rπ + Pπ v . (38.5)
Since π 0 was arbitrary, ρ + v(s) ≥ maxa ra (s) + hPa (s), vi holds for all s ∈ S. This combined
with (38.5) shows that the pair (ρ, v) satisfies the Bellman optimality equation as required.
38.11 Clearly the optimal policy is to take action stay in any state and this policy has gain
ρ∗ = 0. Pick any solution (ρ, v) to the Bellman optimality equations. Therefore ρ = ρ∗ = 0 by
Theorem 38.2. The Bellman optimality equation for state 1 is v(1) = max(v(1), −1 + v(2)), which is
equivalent to v(1) ≥ −1 + v(2). Similarly, the Bellman optimality equation for state 2 is equivalent
to v(2) ≥ −1 + v(1). Thus the set of solutions is a subset of
111
The same argument shows that any element of this set is a solution to the optimality equations.
38.12 Consider the deterministic MDP below with two states and two actions, A =
{solid, dashed}.
r=0 r=1
r=0
1 2
r=1
Clearly the optimal policy is to choose π(1) = solid and π(2) arbitrarily which leads to a gain of 1.
On the other hand, choosing ρ = 1 and v = (2, 1) satisfies the linear program in Eq. (38.6) and the
greedy policy with respect to this value function chooses π(1) = dashed and π(2) arbitrary.
38.13 Let T : RS → RS be defined by (T v)(s) = maxa ra (s) − ρ∗ + hPa (s), vi so the Bellman
optimality equation Eq. (38.5) can be written in the compact form v = T v. Let v ∈ RS be a solution
to Eq. (38.5). The proof follows from the definition of the diameter and by showing that for any
states s1 , s2 ∈ S and memoryless policy π it holds that
The remainder of the proof is devoted to proving this result for fixed s1 , s2 ∈ S and memoryless
policy π. Abbreviate τ = τs2 and let E[·] denote the expectation with respect to the measure
induced by the interaction of π and the MDP conditioned on S1 = s1 . Since the result is trivial
when E[τ ] = ∞, for the remainder we assume that E[τ ] < ∞. Define operator T̄ : RS → RS by
min
s,a ra (s) − ρ∗ + hPπ (s), ui, if s 6= s2 ;
(T̄ u)(s) =
v(s ), otherwise .
2
Since rπ (s) − ρ∗ ≥ mins,a ra (s) − ρ∗ and T v = v it follows that (T̄ v)(s) ≤ (T v)(s) = v(s). Notice
that for u ≤ w it holds that T̄ u ≤ T̄ w. Then by induction we have T̄ n v ≤ v for all n ∈ N+ . By
unrolling the recurrence we have
v(s1 ) ≥ (T̄ v)(s1 ) = E −(ρ − min ra (s))(n ∧ τ ) + v(Sτ ∧n ) .
n ∗
s,a
Taking the limit as n tends to infinity shows that v(s1 ) ≥ v(s2 ) − (ρ∗ − mins,a ra (s))E[τ ], which
completes the result.
38.14
(a) It is clear that Algorithm 27 returns true if and only if (ρ, v) is feasible for Eq. (38.6). Note that
feasibility can be written in the compact form ρ1v ≥ T v. It remains to show that when (ρ, v) is
not feasible then u = (1, es − Pa∗s (s)) is such that for any (ρ0 , v 0 ) feasible, h(ρ0 , v 0 ), ui > h(ρ, v), ui.
112
For this, we have h(ρ0 , v 0 ), ui = ρ0 + v 0 (s) − hPa∗s (s), v 0 i ≥ ra∗s (s), where the inequality used that
(ρ0 , v 0 ) is feasible. Further, h(ρ, v), ui = ρ + v(s) − hPa∗s (s), vi < ra∗s (s), by the construction of u.
Putting these together gives the result.
(b) Relax the constraint that v(s̃) = 0 to −ε ≤ v(s̃) ≤ ε. Then add the ε of slack to the first
constraint of Eq. (38.7) and add the additional constraints used in Eq. (38.9). Now the ellipsoid
method can be applied as for Eq. (38.9).
38.15 Let φk (x) = true if hak , xi ≥ bk and φk (x) = ak otherwise. Then the new separation
oracle returns true if φ(x) and φk (x) are true for all k. Otherwise return the separating hyperplane
provided by some φ or φk that did not return true.
38.16
(a) Let π = (π1 , π2 , . . .) be an arbitrary Markov policy where πt is the policy followed in time step
t. Using the notation and techniques from the proof of Theorem 38.2,
Pπ(t−1) rπt = Pπ(t−1) (rπt + Pπt v − Pπt v) ≤ Pπ(t−1) (rπ̃ + Pπ̃ v − Pπt v)
≤ Pπ(t−1) ((ρ + ε)1 + v − Pπt v) = (ρ + ε)1 + Pπ(t−1) v − Pπ(t) v .
Taking the average and then the limit shows that ρ̄π (s) ≤ ρ + ε for all s ∈ S. By the claim in
Exercise 38.2, ρ∗ ≤ ρ + ε.
(b) We have rπ̃ (s) + hPπ̃(s) (s), vi ≥ maxa ra (s) + hPa (s), vi − ε0 ≥ ρ + v(s) − (ε + ε0 ). Therefore,
Pπ̃t−1 rπ̃ = Pπ̃t−1 (rπ̃ + Pπ̃ v − Pπ̃ v) ≥ Pπ̃t−1 ((ρ − (ε + ε0 ))1 + v − Pπ̃ v) .
Taking the average and the limit again shows that ρπ̃ (s) ≥ ρ − (ε + ε0 ). The claim follows from
combining this with the previous result.
(c) Let ε = v + ρ1 − T v, which by the first constraint satisfies ε ≥ 0. Let π ∗ be an optimal policy
satisfying the requirements of the theorem statement and π be the greedy policy with respect
to v. Then
Hence ρ∗ 1 = ρπ 1 ≤ ρ1 − Pπ∗∗ ε, which means that Pπ∗∗ ε ≤ δ1. By the definition of s̃ there exists
∗
and so ε(s̃) ≤ δ|S|. Notice that ṽ = v − ε + ε(s̃)1 also satisfies the constraints in Eq. (38.7) and
hence
hv − ε + ε(s̃)1, 1i ≥ hv, 1i ,
113
which implies that hε, 1i ≤ |S|ε(s̃) ≤ |S|2 δ. Hence (ρ, v) approximately satisfies the Bellman
optimality equation.
38.17 Define operator T : RS → RS by (T u)(s) = maxa∈A ra (s) + hPa (s), ui, which is chosen so
that vn∗ = T n 0. Let v be a solution to the Bellman optimality equation with mins v(s) = 0. Then
T v = ρ∗ 1 + v and
where the last inequality follows from the previous exercise and the assumption that mins v(s) = 0.
38.19
(a) There are four memoryless policies in this MDP. All are optimal except the policy π that always
chooses the dashed action.
Whenever Tk−1 (s, a) ≥ 1 the transition estimates P̂k−1,a (s) = Pa (s). Let St0 be the state
with St 6= St0 . Suppose that Tk−1 (St , stay) > Tk−1 (St0 , stay) and r̃k,stay (St ) < 1. Then
r̃k,stay (St0 ) > r̃k,stay (St ). Once this occurs the optimal policy in the optimistic MDP is to choose
action go. It follows easily that once t is sufficiently large the algorithm will alternate between
choosing actions go and stay and subsequently suffer linear regret. Note that the uncertainty
in the transitions does not play a big role here. In the optimistic MDP they will always be
chosen to maximise the probability of transitioning to the state with the largest optimistic
reward.
38.21 Here we abuse notation by letting P̂u,a (s) be the empirical next-state transitions after u
visits to state-action pair (s, a). By a union bound and the result in Exercise 5.17,
s
2S log(4SAu(u + 1)/δ)
P (F ) ≤ P ∃u ∈ N, s, a ∈ S × A : kP − P̂u,a (s)k1 ≥
u
s
X ∞
X 2S log(4SAu(u + 1)/δ)
≤ P kP − P̂u,a (s)k1 ≥
s,a∈S×A u=1
u
X X ∞
δ δ
≤ = .
s,a∈S×A u=1
2SAu(u + 1) 2
The statement in Exercise 5.17 makes an independence assumption that is not exactly satisfied here.
We are saved by the Markov property, which provides the conditional independence required.
114
Pm−1
38.22 If k=1 ak ≤ 1, then A0 = A1 = · · · = Am−1 = 1. Hence am ≤ 1 also holds and using
Am ≥ 1,
m
X m−1
X √
ak
p = ak + am ≤ 1 + 1 ≤ ( 2 + 1)Am .
k=1
Ak−1 k=1
P
Let us now use induction on m. As long as m is so that m−1
k=1 ak ≤ 1, the previous argument covers
P
us. Thus, consider any m > 1 such that m−1 k=1 ka > 1 and assume that the statement holds for
Pm−1 √
m − 1 (note that m = 1 implies k=1 ak ≤ 1). Let c = 2 + 1. Then,
m
X p
ak am
p ≤ c Am−1 + √ (split sum, induction hypothesis)
k=1
Ak−1 Am−1
s
a2m
= c2 Am−1 + 2cam +
Am−1
q
≤ c2 Am−1 + (2c + 1)am (am ≤ Am−1 )
p
= c Am−1 + am (choice of c)
p
= c Am . (Am−1 ≥ 1 and definition of Am )
38.23 We only outline the necessary changes to the proof of Theorem 38.6. The first step is to
augment the failure event to include the event that there exists a phase k and state-action pair s, a
such that
s
2L
|r̃k,a (s) − ra (s)| ≥ .
Tt (s, a)
The likelihood of this event is at most δ/2 by Hoeffding’s bound combined with a union bound.
Like in the proof of Theorem 38.6 we now restrict our attention to the regret on the event that the
failure does not occur. The first change is that rπk (s) in Eq. (38.18) must be replaced with r̃k,πk (s).
Then the reward terms no longer cancel in Eq. (38.20), which means that now
X
R̃k = (−vk (St ) + hPk,At (St ), vk i + r̃k,At (St ) − rAt (St ))
t∈Ek
X D X
≤ (hPAt , vk i − vk (St )) + kPk,At (St ) − PAt (St )k1
t∈Ek
2 t∈E
k
s
X 2L
+ .
t∈Ek
1 ∨ Tτk −1 (St , At )
The first two terms are the same as the proof of Theorem 38.6 and are bounded in the same way,
which result in the same contribution to the regret. Only the last term is new. Summing over all
115
phases and applying the result from Exercise 38.22 and Cauchy-Schwarz,
s
K X
X 2L √ XXX K
T(k) (s, a)
= 2L q
k=1 t∈Ek
1 ∨ Tτk −1 (St , At ) s∈S a∈A k=1 1 ∨ Tτk −1 (s, a)
√ √
≤ ( 2 + 1) 2LSAn .
This term is small relative to the contribution due to the uncertainty in the transitions. Hence there
exists a universal constant C such that with probability at least 1 − 3δ/2 the regret of this modified
algorithm is at most
s
nSA
R̂n ≤ CD(M )S nA log .
δ
38.24
(a) An easy calculation shows that the depth d of the tree is bounded by 2 + logA S, which by the
conditions in the statement of the lower bound implies that d + 1 ≤ D/2. The diameter is
the maximum over all distinct pairs of states of the expected travel time between those two
states. It is not hard to see that this is maximised by the pair sg and sb , so we restrict our
attention to bounding the expected travel time between these two states under some policy.
Let τ = min{t : St = sb } and let π be a policy that traverses the tree to a decision state with
ε(s, a) = 0. We will show that for this policy
E[τ | S1 = sg ] ≤ D .
(b) The definition of stopping time τ ensures that Tσ ≤ n/D + 1 ≤ 2n/D almost surely and hence
DE[Tσ ]/n ≤ 2 is immediate. For the second part note that
n/(2D)
X n
P Dt ≥
t=1
D
(c) We need to prove that Rnj ≥ c3 ∆D Ej [Tσ − Tj ] where Rnj is the expected regret of π in MDP
Mj over n rounds and c3 > 0 is a universal constant. The idea is to write the total reward
incurred using episodes punctuated by visits to s0 and note that the expected lengths of these
116
episodes are the same regardless of the policy used.
For the formal argument, we start by rewriting the expected regret Rnj in a more suitable
form. For this we introduce episodes indexed by k ∈ [n]. The kth episode starts at the time τk
when s0 is visited the kth time (in particular, τ1 = 0). In the kth episode, a leaf is reached in
time step t = τk + d. Let Ek be the indicator that the the state-action pair of this time step is
not the jth state-action in L = {(s1 , a1 ), . . . , (sp , ap )}: Ek = I {(Sτk +d , Aτk +d ) = (sj , aj )}. Note
that this is an indicator of a “regret-inducing choice”.
In time steps Tk = {τk + d + 1, . . . , τk+1 }, the state is one of sg and sb . Let Hk = |Tk ∩ [n]| be
the number of time steps in Tk that happen before round n is over. For clarity, we add subindex
π to Ej and Pj to make it explicit that these depend on π. Further, in MDP Mj , let the values
of ε(s, a) be denoted by εj (s, a).
By construction, the total expected reward incurred up to time step n by policy π is
n
X
Vjπ := Ej,π [Hk I {Sτ +k+d+1 = sg }]
k=1
n
X X
= Ej,π [Hk I {Sτ +k+d+1 = sg } |Sτk +d = s, Aτ +k+d = a]Pj,π (Sτk +d = s, Aτ +k+d = a)
k=1 (s,a)∈L
(by the law of total probability)
n
X X
= Ej,π [Hk ]Pj,π (Sτ +k+d+1 = sg |Sτk +d = s, Aτ +k+d = a]Pj,π (Sτk +d = s, Aτ +k+d = a)
k=1 (s,a)∈L
(conditioning, Markov property)
n
X X 1
= Ej,π [Hk ] + εj (s, a) Pj,π (Sτk +d = s, Aτ +k+d = a) (Mj definition)
k=1 (s,a)∈L
2
Xn X
1
= Ej,π [Hk ] Pj,π (Sτk +d = sp , Aτ +k+d = ap )
k=1 p6=j
2
X
1 n
+ +∆ Ej,π [Hk ]Pj,π (Sτk +d = sj , Aτ +k+d = aj ) .
2 k=1
Now, note that Ej,π [Hk ] = λk , regardless of policy π and index j. If πj∗ is the optimal policy for
MDP Mj , Pj,πj∗ (Sτk +d = sj , Aτ +k+d = aj ) = 1. Let ρπk = Pj,π ((Sτk +d , Aτk +d ) 6= (sj , aj )). Hence,
n
X 1X n
1 Xn
= (1/2 + ∆) λk − λk ρπk − ( + ∆) λk (1 − ρπk )
k=1
2 k=1 2 k=1
n
X 1 1
= λk (1/2 + ∆) − ρπk − ( + ∆)(1 − ρπk )
k=1
2 2
n
X m−1
X m−1
X
=∆ λk ρπk ≥∆ λk ρπk ≥ c∆D ρπk ,
k=1 k=1 k=1
117
where m = dn/D − 1e and the last inequality uses that λk ≥ cD for k ∈ [m − 1] with some universal
constant c > 0, the proof of which is left to the reader. Now, by definition,
n
X
Ej,π [Tσ − Tj ] = Pj,π ((Sτk +d , Aτk +d ) 6= (sj , aj ), τk + d < τ )
k=1
m−1
X
≤ Pj,π ((Sτk +d , Aτk +d ) 6= (sj , aj ))
k=1
Xm
= ρπk ,
k=1
where the second equality is because τ = n ∧ τm and thus for k ≥ m, τk + d ≥ τ . Putting together
the last two inequalities finishes the proof.
Bibliography
V. I. Bogachev. Measure theory, volume 2. Springer Science & Business Media, 2007. [64]
T. Lattimore and Cs. Szepesvári. Learning with good feature representations in bandits and in RL
with a generative model. arXiv:1911.07676, 2019. [78]
G. Peskir and A. Shiryaev. Optimal stopping and free-boundary problems. Springer, 2006. [96]
E. V. Slud. Distribution inequalities for the binomial law. The Annals of Probability, pages 404–412,
1977. [17]
118