0% found this document useful (0 votes)
26 views118 pages

Sol Tor Csaba

This document provides solutions to selected exercises from the book "Bandit Algorithms" by Tor Lattimore and Csaba Szepesvári. It is organized by chapter and contains solutions to probability, stochastic processes, stochastic bandits, concentration of measure, multi-armed bandit algorithms like UCB, Exp3, Bayesian bandits, Thompson sampling, contextual bandits, stochastic linear bandits, and more.

Uploaded by

leeaxil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views118 pages

Sol Tor Csaba

This document provides solutions to selected exercises from the book "Bandit Algorithms" by Tor Lattimore and Csaba Szepesvári. It is organized by chapter and contains solutions to probability, stochastic processes, stochastic bandits, concentration of measure, multi-armed bandit algorithms like UCB, Exp3, Bayesian bandits, Thompson sampling, contextual bandits, stochastic linear bandits, and more.

Uploaded by

leeaxil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 118

Solutions to Selected Exercises in Bandit Algorithms

Tor Lattimore and Csaba Szepesvári


Draft of Friday 11th September, 2020

Contents

2 Foundations of Probability 5
2.1, 2.3, 2.5, 2.6, 2.7, 2.8, 2.9, 2.10, 2.11, 2.12, 2.14, 2.15, 2.16, 2.18, 2.19
3 Stochastic Processes and Markov Chains 11
3.1, 3.5, 3.8, 3.9
4 Stochastic Bandits 13
4.9, 4.10
5 Concentration of Measure 14
5.10, 5.12, 5.13, 5.14, 5.15, 5.16, 5.17, 5.18, 5.19
6 The Explore-then-Commit Algorithm 22
6.2, 6.3, 6.5, 6.6, 6.8
7 The Upper Confidence Bound Algorithm 26
7.1
8 The Upper Confidence Bound Algorithm: Asymptotic Optimality 27
8.1
9 The Upper Confidence Bound Algorithm: Minimax Optimality 28
9.1, 9.4
10 The Upper Confidence Bound Algorithm: Bernoulli Noise 29
10.1, 10.3, 10.4, 10.5
11 The Exp3 Algorithm 35
11.2, 11.5, 11.6, 11.7
12 The Exp3-IX Algorithm 37
12.1, 12.4
13 Lower Bounds: Basic Ideas 39
13.2
14 Foundations of Information Theory 39
14.4, 14.10, 14.11
15 Minimax Lower Bounds 41
15.1
16 Instance-Dependent Lower Bounds 42
16.2, 16.7
17 High-Probability Lower Bounds 43
17.1

1
18 Contextual Bandits 44
18.1, 18.6, 18.7, 18.8, 18.9
19 Stochastic Linear Bandits 47
19.3, 19.4, 19.5, 19.6, 19.7, 19.8
20 Confidence Bounds for Least Squares Estimators 54
20.1, 20.2, 20.3, 20.4, 20.5, 20.8, 20.9, 20.10, 20.11
21 Optimal Design for Least Squares Estimators 58
21.1, 21.2, 21.3, 21.5
22 Stochastic Linear Bandits for Finitely Many Arms 60
23 Stochastic Linear Bandits with Sparsity 60
23.2
24 Minimax Lower Bounds for Stochastic Linear Bandits 60
24.1
25 Asymptotic Lower Bounds for Stochastic Linear Bandits 61
25.3
26 Foundations of Convex Analysis 61
26.2, 26.3, 26.9, 26.13, 26.14, 26.15
27 Exp3 for Adversarial Linear Bandits 64
27.1, 27.4, 27.6, 27.8, 27.9, 27.11
28 Follow-the-Regularised-Leader and Mirror Descent 69
28.1, 28.5, 28.10, 28.11, 28.12, 28.13, 28.14, 28.15, 28.16, 28.17
29 The Relation between Adversarial and Stochastic Linear Bandits 77
29.2, 29.4
30 Combinatorial Bandits 78
30.4, 30.5, 30.6, 30.8
31 Non-stationary Bandits 81
31.1, 31.3
32 Ranking 82
32.2, 32.6
33 Pure Exploration 84
33.3, 33.4, 33.5, 33.6, 33.7, 33.9
34 Foundations of Bayesian Learning 90
34.4, 34.5, 34.13, 34.14, 34.15, 34.16
35 Bayesian Bandits 95
35.1, 35.2, 35.3, 35.6, 35.7
36 Thompson Sampling 100
36.3, 36.5, 36.6, 36.13
37 Partial Monitoring 103
37.3, 37.10, 37.12, 37.13, 37.14
38 Markov Decision Processes 106
38.2, 38.4, 38.5, 38.7, 38.8, 38.9, 38.10, 38.11, 38.12, 38.13, 38.14, 38.15, 38.16, 38.17,
38.19, 38.21, 38.22, 38.23, 38.24

2
Bibliography 118

3
4
Chapter 2 Foundations of Probability

2.1 Let h = g ◦ f . Let A ∈ H. We need to show that h−1 (A) ∈ F. We claim that
h−1 (A) = f −1 (g −1 (A)). Because g is G/H-measurable, g −1 (A) ∈ G and thus because f is F/G-
measurable, f −1 (g −1 (A)) is F-measurable, thus completing the proof, once we show that the claim
holds. To show the claim, we show two-sided containment. For showing h−1 (A) ⊂ f −1 (g −1 (A)) let
x ∈ h−1 (A). Thus, h(x) ∈ A. By definition, h(x) = g(f (x)) ∈ A. Hence, f (x) ∈ g −1 (A) and thus
x ∈ f −1 (g −1 (A)). For the other direction let x ∈ f −1 (g −1 (A)). This implies that f (x) ∈ g −1 (A),
which implies that h(x) = g(f (x)) ∈ A.
2.3 Since X(u) ∈ V for all u ∈ U we have X −1 (V) = U. Therefore U ∈ ΣX . Suppose that
U ∈ ΣX , then by definition there exists a V ∈ Σ such that X −1 (V ) = U . Because ΣX is a
σ-algebra we have V c ∈ Σ and by definition of ΣX we have U c = X −1 (V c ) ∈ ΣX . Therefore
ΣX is closed under complements. Finally let (Ui )i be a countable sequence with Ui ∈ ΣX . Then
S
i Ui = X
−1 (∪ X(U )) ∈ Σ , which means that Σ is closed under countable unions and the proof
i i X X
is completed.
2.5

(a) Let A be the set of all σ-algebras that contain G and define
\
F∗ = F.
F ∈A

We claim that F ∗ is the smallest σ-algebra containing G. Clearly F ∗ contains G and is contained
in all σ-algebras containing G. Furthermore, by definition it contains exactly those A that are
in every σ-algebra that contains G. It remains to show that F ∗ is a σ-algebra. Since Ω ∈ F
for all F ∈ A it follows that Ω ∈ F ∗ . Now suppose that A ∈ F ∗ . Then A ∈ F for all F ∈ A
and Ac ∈ F for all F ∈ A. Therefore Ac ∈ F ∗ . Therefore F ∗ is closed under complements.
Finally, suppose that (Ai )i is a family in F ∗ . Then (Ai )i are families in F for all F ∈ A and so
S S
i Ai ∈ F for all F ∈ A and again we have i Ai ∈ F . Therefore F is a σ-algebra.
∗ ∗


(b) Define H = A : X −1 (A) ∈ F . Then Ω ∈ H and for A ∈ H we have X −1 (Ac ) = X −1 (A)c so
Ac in H. Furthermore, for (Ai )i with Ai ∈ H we have
!
[ [
X −1 Ai = X −1 (Ai ) .
i i

Therefore H is a σ-algebra on Ω and by definition σ(G) ⊆ H. Now for any A ∈ H we have


f −1 (A) ∈ F by definition. Therefore f −1 (A) ∈ F for all A ∈ σ(G).

(c) We need to show that I {A}−1 (B) ∈ F for all B ∈ B(R). There are four cases. If {0, 1} ∈ B,
then I {A}−1 (B) = Ω ∈ F. If {1} ∈ B, then I {A}−1 (B) = A ∈ F. If {0} ∈ B, then

5
I {A}−1 (B) = Ac ∈ F. Finally, if {0, 1} ∩ B = ∅, then I {A}−1 (B) = ∅ ∈ F. Therefore I {A} is
F-measurable.

2.6 Trivially, σ(X) = {∅, R}. Hence Y is not σ(X)/B(R)-measurable because Y −1 ([0, 1]) = [0, 1] 6∈
σ(X).
2.7 First P (∅ | B) = P (∅ ∩ B) /P (B) = 0 and P (Ω | B) = P (Ω ∩ B) /P (B) = 1. Let (Ei )i be a
countable collection of disjoint sets with Ei ∈ F. Then
! S S
[ P (B ∩ i Ei ) P ( i (B ∩ Ei ))
P Ei B = =
i
P (B) P (B)
X P (B ∩ Ei ) X
= = P (Ei | B) .
i
P (B) i

Therefore P( · | B) satisfies the countable additivity property and the proof is complete.
2.8 Using the definition of conditional probability and the assumption that P (A) > 0 and P (B) > 0
we have:
P (A ∩ B) P (B | A) P (A)
P (A | B) = = .
P (B) P (B)

2.9 For part (a),

P (X1 < 2 and X2 is even) 3/(62 ) 1


P (X1 < 2 | X2 is even) = = = = P (X1 < 2) .
P (X2 is even) 18/(6 )
2 6

Therefore {X1 < 2} is independent from {X2 is even}. For part (b) note that σ(X1 ) = {C × [6] :
C ∈ 2[6] } and σ(X2 ) = {[6] × C : C ∈ 2[6] }. It follows that for |A ∩ B| = |A||B|/62 and so

P (A ∩ B) |A ∩ B|/62 |A|
P (A | B) = = = 2 = P (A) .
P (B) |B|/6 2 6

Therefore A and B are independent.


2.10
(a) Let A ∈ F. Then P (A ∩ Ω) = P (A) = P (A) P (Ω) and P (A ∩ ∅) = 0 = P (∅) P (A). Intuitively,
Ω and ∅ happen surely/never respectively, so the occurrence or not of any other event cannot
alter their likelihood.

(b) Let A ∈ F satisfy P (A) = 1 and B ∈ F be arbitrary. Then P (B ∩ Ac ) ≤ P (Ac ) = 0.


Therefore P (A ∩ B) = P (A ∩ B) + P (Ac ∩ B) = P (B) = P (A) P (B). When P (A) = 0 we have
P (A ∩ B) ≤ P (A) = 0 = P (A) P (B).

(c) If A and Ac are independent, then 0 = P (∅) = P (A ∩ Ac ) = P (A) P (Ac ) = P (A) (1 − P (A)).
Therefore P (A) ∈ {0, 1}. This makes sense because the knowledge of A provides the knowledge
of Ac , so the two events can only be independent if one occurs with probability zero.

6
(d) If A is independent of itself, then P (A ∩ A) = P (A)2 . Therefore P (A) ∈ {0, 1} as before. The
intuition is the same as the previous part.

(e) Ω = {(1, 1), (1, 2), (2, 1), (2, 2)} and F = 2Ω .

Ω and A for all A ∈ F (16 pairs)


∅ and A for all A ∈ F − Ω (15 pairs)
{(1, 0), (1, 1)} and {(0, 0), (1, 0)}
{(1, 0), (1, 1)} and {(0, 1), (1, 1)}
{(0, 0), (0, 1)} and {(0, 0), (1, 0)}
{(0, 0), (0, 1)} and {(0, 1), (1, 1)}

(f) P (X1 ≤ 2, X1 = X2 ) = P (X1 = X2 = 1) + P (X1 = X2 = 2) = 2/9 = P (X1 ≤ 2) P (X1 = X2 )


because P (X1 ≤ 2) = 2/3 and P (X1 = X2 ) = 1/3.

(g) If A and B are independent, then |A ∩ B|/n = P (A ∩ B) = P (A) P (B) = |A||B|/n2 .


Rearranging shows that n|A ∩ B| = |A||B|. All steps can be reversed showing the reverse
direction.

(h) Assume that n is prime. By the previous part, n|A ∩ B| = |A||B| must hold if A and B are
independent of each other. If |A ∩ B| = 0, the events will be trivial. Hence, assume |A ∩ B| > 0.
Since n is prime, it follows then that n must be either a factor of |A| or a factor of |B|. Without
loss of generality, assume that it is a factor of |A|. This implies n ≤ |A|. But |A| ≤ n also holds,
hence |A| = n, i.e., A is a trivial event.

(i) Let X1 and X2 be independent Rademacher random variables and X3 = X1 X2 . Clearly these
random variables are not mutually independent since X3 takes multiple values with nonzero
probability and is fully determined by X1 and X2 . And yet X3 and Xi are independent for
i ∈ {1, 2}, which ensures that pairwise independence holds.

(j) No. Let Ω = [6] and F = 2Ω and P be the uniform measure. Define events A = {1, 3, 4} and
B = {1, 3, 5} and C = {3, 4, 5, 6}. Then A and B are clearly dependent and yet

P (A ∩ B ∩ C) = P (A ∩ B | C) P (C) = P (A) P (B) P (C) .

2.11

(a) σ(X) = (Ω, ∅) is trivial. Let Y be another random variable, then X and Y are independent
if and only if for all A ∈ σ(X) and B ∈ σ(Y ) it holds that P (A ∩ B) = P (B), which is trivial
when A ∈ {Ω, ∅}.

(b) Let A be an event with P (A) > 0. Then P (X = x | A) = P (X = x ∩ A) /P (A) = 1 = P (X = x).


Similarly, P (X 6= x | A) = 0 = P (X 6= x). Therefore X is independent of all events, including
those generated by Y .

7
(c) Suppose that A and B are independent. Then P (Ac | B) = 1 − P (A | B) = 1 − P (A) = P (Ac ).
Therefore Ac and B are independent and by the same argument so are Ac and B c as well as A
and B c . The ‘if’ direction follows by noting that σ(X) = {Ω, A, Ac , ∅} and σ(Y ) = {Ω, B, B c , ∅}
and recalling that every event is independent of Ω or the empty set. For the ‘only if’ note
that independence of X and Y means that any pair of events taken from σ(X) × σ(Y ) are
independent, which by the above includes the pair A, B.

(d) Let (Ai )i be a countable family of events and Xi (ω) = I {ω ∈ Ai } be the indicator of the ith
event. When the random variables/events are pairwise independent, then the above argument
goes through unchanged for each pair. In the case of mutual independence the ‘only if’ is again
the same. For the ‘if’, suppose that (Ai ) are mutually independent. Therefore for any finite
subset K ⊂ N we have
!
\ Y
P Ai = P (Ai )
i∈K i∈K

The same argument as the previous part shows that for any disjoint finite sets K, J ⊂ N we
have
!
[ [ Y Y
P Ai ∪ Aci = P (Ai ) P (Aci ) .
i∈K i∈J i∈K i∈J

Therefore for any finite set K ⊂ N and (Vi )i∈K with Vi ∈ σ(Xi ) = {Ω, ∅, Ai , Aci } it holds that
!
\ Y
P Vi = P (Vi ) ,
i∈K i∈K

which completes the proof that (Xi )i are mutually independent.

2.12

(a) Let A ⊂ R be an open set. By definition, since f is continuous it holds that f −1 (A) is open.
But the Borel σ-algebra is generated by all open sets and so f −1 (A) ∈ B(R) as required.

(b) Since | · | : R → R is continuous and by definition a random variable X on measurable space


(Ω, F) is F/B(R)-measurable it follows by the previous part that |X| is F/B(R)-measurable
and therefore a random variable.

(c) Recall that (X)+ = max{0, X} and (X)− = − min{0, X}. Therefore (|X|)+ = |X| =
(X)+ + (X)− and (|X|)− = 0. Recall that E[X] = E[(X)+ ] − E[(X)− ] exists if an only if
both expectations are defined. Therefore if X is integrable, then |X| is integrable. Now suppose
that |X| is integrable, then X is integrable by the dominated convergence theorem.

2.14 Assume without (much) loss of generality that Xi ≥ 0 for all i. The general case follows by

8
considering positive and negative parts, as usual. First we claim that for any n it holds that
" n # n
X X
E Xi = E[Xi ] .
i=1 i=1

To show this, note the definition that


" n # (Z n
)
X X
E Xi = sup h dP : h is simple and 0 ≤ h ≤ Xi
i=1 Ω i=1
n
X Z 
= sup h dP : h is simple and 0 ≤ h ≤ Xi .
i=1 Ω

P
Next let Sn = ni=1 Xi and note that by the monotone convergence theorem we have limn→∞ E[Sn ] =
E[X], which means that
"∞ # n
X X
E Xi = lim E[Sn ] = lim E[Xi ] .
n→∞ n→∞
i=1 i=1

Pn
2.15 Suppose that X(ω) = i=1 αi I {ω ∈ Ai } is simple and c > 0. Then cX is also simple and
n
X n
X
E[cX] = cαi I {ω ∈ Ai } = c αi I {ω ∈ Ai } = cE[X] .
i=1 i=1

Now suppose that X is positive (but maybe not simple) and c > 0, then cX is also positive and

E[cX] = sup {E[h] : h is simple and h ≤ cX}


= sup {E[ch] : h is simple and h ≤ X}
= sup {cE[h] : h is simple and h ≤ X}
= cE[X] .

Finally for arbitrary random variables and c > 0 we have

E[cX] = E[(cX)+ ] − E[(cX)− ] = cE[(X)+ ] − cE[(X)− ] = cE[X] .

For negative c simply note that (cX)+ = −c(X)− and (cX)− = −c(X)+ and repeat the above
argument.

9
PN PN
2.16 Suppose X = i=1 αi I {Ai } and Y = i=1 βi I {Bi } are simple functions. Then
"N N
#
X X
E[XY ] = E αi I {Ai } βi I {Bi }
i=1 i=1
N
XX N
= αi βj P (Ai ∩ Bj )
i=1 j=1
N X
X N
= αi βj P (Ai ) P (Aj )
i=1 j=1

= E[X]E[Y ] .

Now suppose that X and Y are arbitrary non-negative independent random variables. Then

E[XY ] = sup {E[h] : h is simple and h ≤ XY }


= sup {E[hg] : h ∈ σ(X), g ∈ σ(Y ) are simple and h ≤ X, g ≤ Y }
= sup {E[h]E[g] : h ∈ σ(X), g ∈ σ(Y ) are simple and h ≤ X, g ≤ Y }
= sup {E[h] : h ∈ σ(X) is simple and h ≤ X} sup {E[h] : h ∈ σ(Y ) is simple and h ≤ Y }
= E[X]E[Y ] .

Finally, for arbitrary random variables we have via the previous display and the linearity of
expectation that

E[XY ] = E[((X)+ − (X)− )((Y )+ − (Y )− )]


= E[(X)+ (Y )+ ] − E[(X)+ (Y )− ] − E[(X)− (Y )+ ] + E[(X)− (Y )− ]
= E[(X)+ ]E[(Y )+ ] − E[(X)+ ]E[(Y )− ] − E[(X)− ]E[(Y )+ ] + E[(X)− ]E[(Y )− ]
= E[(X)+ − (X)− ]E[(Y )+ − (Y )− ] .

2.18 Let X be a standard Rademacher random variable and Y = X. Then E[X]E[Y ] = 0 and
E[XY ] = 1.
Ra
2.19 Using the fact that 0 1dx = a for a ≥ 0 and the non-negativity of X we have
Z ∞
X(ω) = I {[0, X(ω)]} (x)dx .
0

Then by Fubini’s theorem,


Z ∞ 
E[X] = E I {[0, X(ω)]} (x)dx
Z ∞0
= E[I {[0, X(ω)]} (x)]dx
Z0∞
= P (X(ω) ≥ x) dx .
0

10
Chapter 3 Stochastic Processes and Markov Chains

3.1

(a) We have F1 (x) = I {x ∈ [1/2, 1]} and F2 (x) = I {x ∈ [1/4, 2/4) ∪ [3/4, 4/4]}. More generally,
Ft (x) = IUt (x) where
[
Ut = {1} ∪ [(2s − 1)/2t , 2s/2t ) .
1≤s≤2t−1

Since Ut ∈ B([0, 1]), Ft are random variables (see Fig. 3.1).

U1

U2

U3

U4

Figure 3.1: Illustration of events (Ut )∞


t=1 .

P2t−1
(b) We have P(Ut ) = λ(Ut ) = s=1 (1/2t ) = 1/2.

(c) Given an index set K ⊂ N+ we need to show that {Fk : k ∈ K} are independent. Or equivalently,
that
 
\ Y
P Uk  = P (Uk ) = 2−|K| . (3.1)
k∈K k∈K

Let k = max K. Then


   
[ 1 [
λ Uk ∩ Uj  = λ  .
j∈K\{k}
2 j∈K\{k}

Then Eq. (3.1) follows by induction.

(d) It follows directly from the definition of independence that any subsequence of an independent
sequence is also an independent sequence. That P (Xm,t = 0) = P (Xm,t = 1) = 1/2 follows from
Part (b).

11
P
(e) By the previous parts Xt = ∞ t=1 Xm,t 2
−t is a weighted sum of an independent sequence of
P
uniform Bernoully random variables. Therefore Xt has the same law as Y = ∞ t=1 Ft 2 . But
−t

Y (x) = x is the identity. Hence Y is uniformly distributed and so too is Xt .

(f) This follows from the definition of (Xm,t )∞ t=1 as disjoint subsets of independent random variables
(Ft )t=1 and the ‘grouping’ result that whenever (Ft )t∈T is a collection of independent σ-algebras

and T1 , T2 are disjoint subsets of T , then σ(∪t∈T1 Ft ) and σ(∪t∈T2 Ft ) are independent [Kallenberg,
2002, Corollary 3.7]. This latter result is a good exercise. Use a monotone class argument.

R
3.5 Let A ∈ G and suppose that X(ω) = IA (ω). Then X X(x)K(ω, dx) = K(ω, A), which is
F-measurable by the definition of a probability kernel. The result extends to simple functions by
linearity. For nonnegative X let Xn ↑ X be a monotone increasing sequence of simple functions
R
converging point-wise to X [Kallenberg, 2002, Lemma 1.11]. Then Un (ω) = X Xn (x)K(ω, dx) is
R
F-measurable. Monotone convergence ensures that limn→∞ Un (ω) = X limn→∞ Xn (x)K(ω, dx) =
R
X X(x)K(ω, dx) = U (ω). Hence limn→∞ Un (ω) = U (ω) and a point-wise convergent sequence of
measurable functions is measurable, it follows that U is F-measurable. The result for arbitrary X
follows by decomposing into positive and negative parts.

3.8 Let (Xt )nt=0 be F = (Ft )nt=1 -adapted and τ = min{t : Xt ≥ ε}. By the submartingale property,
E[Xn | Ft ] ≥ Xt . Therefore

I {τ = t} Xt ≤ I {τ = t} E[Xn | Ft ] = E[I {τ = t} Xn | Ft ] .

Therefore E[Xt I {τ = t}] ≤ E[I {τ = t} Xn ].


! n
X
P max Xt ≥ ε = P (τ = t)
t∈{0,1,...,n}
t=0
Xn
= P (Xt I {τ = t} ≥ ε)
t=0
1X n
≤ E[Xt I {τ = t}]
ε t=0
1X n
≤ E[Xn I {τ = t}]
ε t=0
E[Xn ]
≤ .
ε

3.9 Let ΣX ( ΣY ) be the σ-algebra underlying X (respectively, Y). It suffices to verify that for

12
A ∈ ΣX , B ∈ ΣY , P(X,Y ) (A × B) = (PY ⊗ PX|Y )(A × B). We have

P(X,Y ) (A × B) = P (X ∈ A, Y ∈ B)
= E [E [I {X ∈ A} I {Y ∈ B} | Y ]] (tower rule)
= E [I {Y ∈ B} E [I {X ∈ A} | Y ]] (I {Y ∈ B} is σ(Y )-measurable)
= E [I {Y ∈ B} P (X ∈ A | Y )] (relation of expectation and probability)
h i
= E I {Y ∈ B} PX|Y (X ∈ A | Y ) (definition of PX|Y )
Z
= PY (dy)PX|Y (X ∈ A | y) (pushforward property)
B
= (PY ⊗ PX|Y )(B × A) . (definition of ⊗)

Chapter 4 Stochastic Bandits

4.9
(a) The statement is true. Let i be a suboptimal arm. By Lemma 4.5 we have

Rn (π, ν) Xk
E[Ti (n)]∆i E[Ti (n)]
0 = lim = lim sup ≥ lim sup ∆i .
n→∞ n n→∞
i=1
n n→∞ n

Hence lim supn→∞ E[Ti (n)]/n ≤ 0 ≤ lim inf n→∞ E[Ti (n)]/n and so limn→∞ E[Ti (n)]/n = 0 for
P P
suboptimal arms i. Since ki=1 E[Ti (n)]/n = 1 it follows that limn→∞ i:∆i =0 E[Ti ]/n = 1.

(b) The statement is false. Consider a two-armed bandit for which the second arm is suboptimal
and an algorithm that chooses the second arm in rounds t ∈ {1, 2, 4, 8, 16, . . .}.

4.10
(a) (Sketch) Fix policy π and n and assume without loss of generality the reward-stack model. We
turn policy π into a retirement policy π 0 by reshuffling the order in which π uses the two arms
during the n rounds so that if π uses action 1, say m times out of the n rounds, π 0 will use
action 1 in the first m rounds and then switches to arm 2. By using the regret decomposition,
we see that this suffices to show that π 0 achieves no more regret than π (actually, achieves the
same regret).
So what is policy π 0 ? Policy π 0 will keep querying policy π for at most n times. If π returns by
proposing to use action 1, π 0 will play this action, get the reward from the environment and
feeds the obtained reward to policy π. If π returns by proposing to play action 2, π 0 does not
play this action for now, just feeds π with zero. After π was queried n times, action 2 is played
up in the remaining rounds out of the total n rounds.

(b) Assume that arm 1 has a Bernoulli payoff with parameter p ∈ [0, 1] and arm 2 has a fixed
payoff of 0.5 (so µ1 = p and µ2 = 0.5). Note that whether π ever retires on these Bernoulli

13
environments depends on whether there exists some t > 0 and x1 , . . . , xt−1 ∈ {0, 1} such that
πt (2|1, x1 , . . . , 1, xt−1 ) > 0, or

sup sup πt (2|1, x1 , . . . , 1, xt−1 ) > 0 . (4.1)


t>0 x1 ,...,xt−1 ∈{0,1}

We have the two cases. When (4.1) does not hold then π will have linear regret when
p < 0.5. When (4.1) does hold then take the t > 0 and x1 , . . . , xt−1 ∈ {0, 1} such that
ρ = πt (2|1, x1 , . . . , 1, xt−1 ) > 0 (these must exist). Assume that t > 0 is smallest possible:
Hence, πs (1|1, x01 , . . . , 1, x0s−1 ) = 1 for any s < t and x01 , . . . , x0s−1 ∈ {0, 1}. Now, take an
environment when p > 0.5 (so arm 1 is the optimal arm) and let Rn denote the regret of π in
this environment. Then letting ∆ = p − 0.5 > 0, we have

Rn = ∆E [T2 (n)]
≥ ∆E [I {A1 = 1, X1 = x1 , . . . , At−1 = 1, Xt−1 = xt−1 , At = 2} T2 (n)]
= ∆E [I {A1 = 1, X1 = x1 , . . . , At−1 = 1, Xt−1 = xt−1 , At = 2} (n − t + 1)]
= ∆P (A1 = 1, X1 = x1 , . . . , At−1 = 1, Xt−1 = xt−1 , At = 2) (n − t + 1)
t
Y
= ∆(n − t + 1)ρ pxs (1 − p)1−xs
s=1
≥ c(n − t + 1) ,
Qt−1 xs
where c = ∆ρ s=1 p (1 − p)
1−xs > 0. It follows that lim inf
n→∞ Rn /n ≥ c > 0.

Chapter 5 Concentration of Measure

5.10

(a) The Cramér-Chernoff method gives that for any λ ≥ 0,


" n
#
X
P (µ̂n ≥ ε) = P (exp(λnµ̂n ) ≥ exp(nλε)) ≤ exp(−nλε)E exp(λ Xt )
t=1
= exp(−nλε)MX (λ)n .

Taking logarithm of both sides and reordering gives

1
log P (µ̂n ≥ ε) ≤ −(λε − MX (λ)) .
n
Since this holds for any λ ≥ 0, taking the supremum over λ ∈ R gives the desired inequality
(allowing λ < 0 potentially makes the resulting inequality loser).

(b) Let X be a Rademacher variable. We have ψX (λ) = 12 (exp(−λ) + exp(λ)) = cosh(λ). To


get the Fenchel dual of log ψX , we find the maximum value of f (λ) = λε − log ψX (λ) =

14
λε − log cosh(λ). We have dλ d
log cosh(λ) = (eλ − e−λ )/(eλ + e−λ ) = tanh(λ) ∈ [−1, 1].
Hence, supλ f (λ) = +∞ when |ε| > 1. In the other case, we get that the maximum is
∗ (ε) = f (tanh−1 (ε)) = tanh−1 (ε)ε − log cosh(tanh−1 (ε)). Using tanh−1 (ε) = 1 log( 1+ε )
ψX 2 1−ε
we find that etanh = ( 1+ε
1−ε )
1/2 and e− tanh = ( 1−ε
1+ε )
1/2 , hence cosh(tanh−1 (ε)) =
−1 −1
(ε) (ε)

2 (( 1−ε ) +( 1−ε
1+ε )
1/2 ) = 1 ( (1+ε)+(1−ε) ) = . Therefore, ψX
∗ (ε) = log( 1+ε
1−ε )+ 2 log(1−ε ) =
1 1+ε 1/2 √ 1 ε 1 2
2 (1−ε2 )1/2 1−ε2 2
1+ε
2 log(1 + ε) + 1−ε
2 log(1 − ε).

(c) We have ψX (λ) = λ(p + ε) − log(1 − p + peλ ). The maximiser of this is λ∗ = log( (1−p)(p+ε)
p(1−(p+ε)) )
provided that p + ε < 1. Plugging in this value, after some algebra, gives the desired result. The
result also extends to p + ε = 1: In this case ψX is increasing and limλ→∞ λ − log(1 − p + peλ ) =
limλ→∞ λ − log(peλ ) = log(1/p) = d(1, p). For ε > 0 so that p + ε > 1, ψX ∗ (ε) = +∞ because

as λ → ∞, λ(p + ε) − log(1 − p + pe ) ∼ λ(p + ε − 1) → ∞.


λ

R R
(d) Set σ = 1 for simplicity. We have MX (λ) = √12π exp(−(x2 − 2λx)/2) dx = √12π exp(−(x −
λ)2 /2) exp(λ2 /2) dx = exp(λ2 /2). Hence, f (λ) = λε − log MX (λ) = λε − λ2 /2 and
supλ f (λ) = f (2ε) = ε2 /2.

p
(e) We need to calculate limn→∞ n1 log(1 − Φ(ε n/σ 2 )). By Eq. (5.3) we have 1 − Φ(x) ≤
p √ √
1/(2πx2 ) exp(−x2 /2). Further, by Eq. (13.4), 1 − Φ(x) ≥ exp(−x2 /2)/( π(x/ 2 +
p p
x2 /2 + 2)). Taking logarithm, plugging in x = ε n/σ 2 , dividing by n and taking n → ∞
gives

1 q
lim log(1 − Φ(ε n/σ 2 )) = ε2 /(2σ 2 ) .
n→∞ n

When X is a Rademacher random variable, ψX ∗ (ε) = 1+ε log(1 + ε) + 1−ε log(1 − ε) ≥ ε2 /2 for
2 2
any ε ∈ R, with equality holding only at ε = 0. Hence, the question-marked equality cannot
hold. (In fact, this is very easy to see also by noting that if X is supported on [−1, 1] then
µ̂n ∈ [−1, 1] almost surely and thus P (µ̂n > ε) = 0 for any ε ≥ 1, while the approximation from
the CLT gives ε2 /(2σ 2 ), a strictly larger value: The CLT can significantly overestimate tail
probabilities. What goes wrong with the careless application of the (strong form) of the CLT
is that limn→∞ supx |fn (x) − f (x)| = 0 does not imply | log fn (xn ) − log f (xn )| = o(n) for all
choices of {xn }. For example, one can take f (x) = exp(−x), fn (x) = exp(−x(1 + 1/n)) so that
log fn (x) − log f (x) = −x/n. Then, | log fn (n2 ) − log f (n2 )| = n 6= o(n). The same problem
happens in the specific case that was investigated.

p
5.12 Part (d) The plots of p 7→ Q(p) and p 7→ p(1 − p) are shown below:

15
0.5
Q(p)
p
p 7→ p(1 − p)

0.25

0
0 0.25 0.5 0.75 1
p
p
As can be seen, p(1 − p) ≤ Q(p) for p ∈ [0, 1].
Part (e): Consider 0 ≤ λ < 4 and λ ≥ 4 separately. In the latter case use λ2 ≥ 4λ. For the
former case consider the extremes p = 1 and p = 1/2 and then use convexity. The general conclusion
is that the subgaussianity constant may be misleadingly large when it comes to studying tails of
distributions: Tail bounds (for the upper tail) only need bounds on the MGF for nonnegative values
of λ!
5.13
Pn Pn
(a) Using linearity of expectation E[p̂n ] = E[ t=1 Xt /n] = t=1 E[Xt ]/n = p. Similarly,
V[p̂n ] = p(1 − p)/n.

(b) The central limit theorem says that


 √  √ 
lim P n(p̂n − p) ≥ x − P nZn ≥ x =0 for all x ∈ R . (5.1)
n→∞

(c) This is an empirical question, the solution to which we omit. You should do this calculation
directly using the binomial distribution.

(d.i) Let d(p, q) = p log(p/q) + (1 − p) log((1 − p)/(1 − q)) be the relative entropy between Bernoulli
distributions with means p and q. By large deviation theory (see Exercise 5.10),

P (p̂n ≥ p + ∆) = exp (−n(dBer + εn )) and


P (Zn ≥ p + ∆) = exp (−n(dGauss + ξn )) ,

where dBer = d(p + ∆, p) and dGauss = ∆2 /(2p(1 − p)) and (εn )∞ n=1 and (ξn )n=1 satisfy

limn→∞ εn = 0 and limn→∞ ξn = 0. It should be clear that limδ→0 ni (δ, p, ∆) = ∞ for


i ∈ {Ber, Gauss}. Hence, by inverting the above displays we arrive at
   
1 1
ni (δ, p, ∆) = + o(1) log , (5.2)
di δ

16
where the o(1) term vanishes as δ tends to zero (see below for the precise argument). Therefore
when ∆ = p = 1/10,

nBer (δ, p, ∆) dGauss ∆2 /(2p(1 − p))


lim = = ≈ 1.2512 .
δ→0 nGauss (δ, p, ∆) dBer d(p + ∆, p)

It remains to see the validity of Eq. (5.2). This follows from an elementary but somewhat tedious
argument. The precise claim is as follows: Let (pn ) be a sequence taking values in [0, 1], n(δ) =
min{n ≥ 1 : pn ≤ δ} such that n(δ) → ∞ as δ → 0 and log(1/pn ) = n(d + o(1)). We claim that
from these it follows that n(δ) = (1/d + o(1)) log(1/δ) as δ → 0. To show this it suffices to prove
that for any ε > 0, for any δ > 0 small enough, n(δ) ∈ [1/(d + ε) log(1/δ), 1/(d − ε) log(1/δ) + 1].
Fix ε > 0. Then, by our assumption on log(1/pn ) there exist some n0 > 0 such that for any
n ≥ n0 , log(1/pn ) ∈ [n(d − ε), n(d + ε)]. Further, by our assumption on n(δ), there exists δ0 > 0
such that for any δ < δ0 , n(δ) − 1 ≥ n0 . Take some δ < δ0 and let n0 = n(δ). By definition,
pn0 ≤ δ < pn0 −1 and hence log(1/pn0 ) ≥ log(1/δ) > log(1/pn0 −1 ). Since n0 ≥ n0 − 1 ≥ n0 , we
also have that (n0 − 1)(d − ε) ≤ log(1/pn0 −1 ) < log(1/δ) ≤ log(1/pn0 ) ≤ n0 (d + ε), from which it
follows that log(1/δ)
d+ε ≤ n < d−ε + 1, finishing the proof.
0 log(1/δ)

(d.ii) The central limit theorem only shows Eq. (5.1). In particular, you cannot choose x to depend
on n. A second try is to use Berry-Esseen (Exercise 5.5) which warrants that

|P (p̂n − p ≥ ∆) − P (Zn ≥ ∆) | = O(1/ n) .

The problem is that this provides very little information in the regime where ∆ is fixed and
n tends to infinity where both probabilities tend to zero exponentially fast and the error term
washes away the comparison. In particular, for the inversion process to work, one needs nontrivial
lower and upper bounds on P (p̂n − p ≥ ∆) and the central limit theorem only asserts that this

probability is in the range of [0, O(1/ n)] (irrespective of the value of p and ∆), which does not
lead to nontrivial bounds on nBer (δ; p, ∆).
To summarise, the study of ni (δ, p, ∆) as δ tends to zero is a question about the large deviation
regime, where the central limit theorem and Berry-Esseen do not provide meaningful information.
To make use of the central limit theorem and Berry-Esseen, one needs to choose the deviation
√ √
level x so that the probability P ( n(p̂n − p) ≥ x) is of a larger magnitude than O(1/ n), which
is the range of ‘small deviations’.
As an aside, comparisons between normal and binomial distributions have been studied extensively.
If you are interested, the most relevant lower bound for this discussion is Slud’s inequality [Slud,
1977].

5.14

(a) We have g 0 (x) = x (exp(x)−1)−2x(exp(x)−1−x) so that x3 g 0 (x) = h(x) = xex − 2ex + 2 + x. We


2

x4
have h0 (x) = xex − ex + 1 and h00 (x) = xex . Hence, h0 is increasing on (0, ∞) and decreasing on
(−∞, 0). Since h(0) = 0, so sign(h(x)) = sign(x) and thus g 0 (x) > 0 for x 6= 0.

17
   
(b) We have exp(x) = 1 + x + g(x)x2 . Therefore, E [exp(X)] = 1 + E g(X)X 2 ≤ 1 + E g(b)X 2 =
1 + g(b)V[X], where the last inequality used that g is increasing.

(c) Calculation – left to the reader.


Pn
(d) Let Zt = Xt − EXt so that S = t=1 Zt . By the Cramér-Chernoff method, for any λ ≥ 0,
n
Y
P (S ≥ ε) ≤ exp(−λε) E [exp(λZt )] .
t=1

Using E [exp(λZt )] ≤ 1 + g(λb)λ2 V[Zt ] ≤ exp(g(λb)λ2 V[Zt ]), we get


!
X
P Zt ≥ ε ≤ exp(−λε + g(λb)λ2 v) . (5.3)
t

Differentiation shows that the exponent is minimised by λ = 1/b log(1 + α) where recall that
α = bε/v. Plugging in this value we get (5.10) and then using the bound in Part ((c)) we get
(5.11).
 
(e) We need to solve δ = exp − 2v ε2
for ε ≥ 0. Algebra gives that this is quadratic equation in
(1+ 3v

)
ε: Using the abbreviation L = log(1/δ), this quadratic equation is ε2 − 23 bLε − 2vL. The positive
 q 
root is ε = 1
2
2
3 bL +
( 23 bL)2 + 8vL . Hence, with probability 1 − δ, S ≤ ε. Further upper
p p p √
bounding ε using |a| + |b| ≤ |a| + |b| gives that with probability 1 − δ, Sn ≤ 32 bL + 2vL,
which is the desired inequality.

(f) We start by modifying the Cramér-Chernoff method. In particular, consider the problem
of bounding the probability of event A where for a random vector X ∈ Rd and a fixed
vector x ∈h R , A takes
d
i the
h form i A = {X ≥ x}. Notice that for f : R → [0, ∞),
d

P (A) ≤ E I {A} ef (X) ≤ E ef (X) . We use this with X = (S, −V ) and x = (ε, v) so that A =
{S ≥ ε, V ≤ v} = {X ≥ (ε, v)}. Then, for λ > 0 letting h(S, V ) = λS − g(λb)λ2 V wehhave on i
A that h(S, V ) ≥ h(ε, v) and so f (S, V ) = h(S, V ) − h(ε, v) ≥ 0 and P (A) ≤ e−h(ε,v) E eh(S,V ) .
We have eh(S,V ) = U1 . . . Un where Ut = eλZt −λ g(λb)Et−1 [Zt ] and Zt = Xt − µt = Xt − Et−1 [Xt ].
2 2

Furthermore, owning to λ > 0, λZt ≤ λb, hence

Es−1 [Us ] = e−λ Es−1 [eλZs ]


2 g(λb)E 2
s−1 [Zs ]

(1 + g(λb)Es−1 [(λZs )2 ])
2 g(λb)E 2
≤ e−λ s−1 [Zs ]

= e−λ = 1,
2 g(λb)E 2 2 g(λb)E 2
s−1 [Zs ]
eλ s−1 [Zs ]

and thus
" n #
h i Y
E eg(Sn ,Vn ) = E Ut = E [U1 . . . Un−1 En−1 [Un ]]
t=1
≤ E [U1 . . . Un−1 ] ≤ · · · ≤ 1 .

18
Thus, P (A) ≤ e−h(ε,v) . Notice that the expression on the right-hand side is the same as in
Eq. (5.3), finishing the proof.

5.15 Let αt = ηEt−1 [(Xt − µt )2 ]. We use the Cramér-Chernoff method:


 ! ! !
n
X 1 1 n
X 1
P (Xt − µt − αt ) ≥ log = P exp η (Xt − µt − αt ) ≥
t=1
η δ t=1
δ
" n
!#
X
≤ δE exp η (Xt − µt − αt ) .
t=1

All that remains is to show that the term inside the expectation is a supermartingale. Using the
fact that exp(x) ≤ 1 + x + x2 for x ≤ 1 and 1 + x ≤ exp(x) for all x ∈ R we have

Et−1 [exp (η(Xt − µt − αt ))] = exp(−ηαt ) Et−1 [exp(η(Xt − µt ))]


 
≤ exp(−ηαt ) 1 + η 2 Et−1 [(Xt − µt )2 ]
 
= exp −ηαt + η 2 Et−1 [(Xt − µt )2 ] = 1 .
P
Therefore exp(η nt=1 (Xt − µt − αt )) is a supermartingale, which completes the proof of Part (a).
The proof of Part (b) follows in the same fashion.
5.16 By assumption P (Xt ≤ x) ≤ x, which means that for λ < 1,
Z ∞
E[exp(λ log(1/Xt ))] = P (exp(λ log(1/Xt )) ≥ x) dx
0
Z ∞ Z ∞
  1
=1+ P Xt ≤ x−1/λ dx ≤ 1 + x−1/λ dx = .
1 1 1−λ

Applying the Cramér-Chernoff method,


n
! n
! !
X X
P log(1/Xt ) ≥ ε = P exp λ log(1/Xi ) ≥ exp(λε)
t=1 t=1
" !#  n
n
X 1
≤ exp(−λε)E exp λ log(1/Xt ) ≤ exp(−λε) .
t=1
1−λ

Choosing λ = (ε − n)/ε completes the claim.


5.17 The 1-norm can be re-written as

kp − p̂k1 = max hλ, p − p̂i .


λ∈{−1,1}m

Next, let λ ∈ {−1, 1}m be fixed. Then,

1X n
hλ, p − p̂i = hλ, p − eXt i .
n t=1

19
Now, |hλ, p − eXt i| ≤ kλk∞ kp − eXt k1 ≤ 2 and E[hλ, p − eXt i] = 0. Then, by Hoeffding’s bound,
s  !
2 1
P hλ, p − p̂i ≥ log ≤ δ.
n δ

Taking a union bound over all λ ∈ {−1, 1}m shows that


s  !
2 2m
P max hλ, p − p̂i ≥ log ≤ δ.
λ∈{−1,1}m n δ

5.18 Let λ > 0. Then


n
X
exp(λE[Z]) ≤ E[exp(λZ)] ≤ E[exp(λXt )] ≤ n exp(λ2 σ 2 /2) .
t=1

Rearranging shows that

log(n) λσ 2
E[Z] ≤ + .
λ 2
p p
Choosing λ = σ1 2 log(n) shows that E[Z] ≤ 2σ 2 log(n). For Part (b), a union bound in
combination with Theorem 5.3 suffices.

5.19 Let P be the set of measures on ([0, 1], B([0, 1])) and for q ∈ P let µq be its mean. The
theorem will be established by induction over n. The claim is immediate when x > n or n = 1.
Assume that n ≥ 2 and x ∈ (1, n] and the theorem holds for n − 1. Then
n
! " n
!#
X X
P E[Xt | Ft−1 ] =E P E[Xt | Ft−1 ] ≥ x − E[X1 | F0 ] F0
t=1 t=2
  
x − E[X1 | F0 ]
≤ E fn−1
1 − X1
    
x − E[X1 | F0 ]
= E E fn−1 F0
1 − X1
Z 1  
x − µq
≤ sup fn−1 dq(y) ,
q∈P 0 1−y
Pn
where the first inequality follows from the inductive hypothesis and the fact that t=2 Xt /(1−X1 ) ≤1
almost surely. The result is completed by proving that for all q ∈ P,
Z 1  
. x − µq
Fn (q) = fn−1 dq(y) ≤ fn (x) . (5.4)
0 1−y

Let q ∈ P have mean µ and y0 = max(0, 1 − x + µ). In Lemma 5.1 below it is shown that
   
x−µ 1−y x−µ
fn−1 ≤ fn−1 ,
1−y 1 − y0 1 − y0

20
which after integrating implies that
 
1−µ x−µ
Fn (q) ≤ fn−1 .
1 − y0 1 − y0

Considering two cases. First, when y0 = 0 the display shows that Fn (q) ≤ (1−µ)fn−1 (x−µ). On the
other hand, if y0 > 0 then x − 1 < µ ≤ 1 and Fn (q) ≤ (1 − µ)/(x − µ) ≤ (1 − (x − 1))fn−1 (x − (x − 1)).
Combining the two cases we have

Fn (q) ≤ sup (1 − µ)fn−1 (1 − µ) = fn (x) .


µ∈[0,1]

Lemma 5.1. Suppose that n ≥ 1 and u ∈ (0, n] and y0 = max(0, 1 − u). Then
   
u 1−y u
fn ≤ fn−1 for all y ∈ [0, 1] .
1−y 1 − y0 1 − y0

Proof. The lemma is equivalent to the claim that the line connecting (y0 , fn (u/(1 − y0 ))) and (1, 0)
lies above fn (u/(1 − y)) for all y ∈ [0, 1] (see figure below). This is immediate for n = 1 when
fn (u/(1 − y)) = I {y ≤ 1 − u}. For larger n basic calculus shows that fn (u/(1 − y)) is concave as a
function of y on [1 − u, 1 − u/n] and


fn (u/(1 − y)) = −1/u .
∂y y=1−u

Since fn (1) = 1 this means that the line connecting (1 − u, 1) and (1, 0) lies above fn (u/(1 − y)).
This completes the proof when y0 = 1 − u. Otherwise y0 ∈ [1 − u, 1 − u/n] and the result follows by
concavity of fn (u/(1 − y)) on this interval.

fn (u/(1 − y))

0
0 y0 1
y

21
Chapter 6 The Explore-then-Commit Algorithm
√ √ √
6.2 If ∆ ≤ 1/ n then  2from
 Rn ≤ √ n∆ we get Rn ≤ n. Now, if ∆ > 1/ n then from
Rn ≤ ∆ + ∆4
1 + log+ n∆ 4 ≤ ∆ + 4 n + maxx>0 x1 log+ (nx2 /4). A simple calculation shows
√ √
that maxx>0 x1 log+ (nx2 /4) = e−2 n. Putting things together we get that Rn ≤ ∆ + (4 + e−2 ) n
holds no matter the value of ∆ > 0.
6.3 Assume for simplicity that n is even and let ∆ = max{∆1 , ∆2 } and
  
4 1
m = min n/k, log .
∆ 2 δ

When 2m = n the pseudo regret is bounded by R̄n ≤ m∆. Now suppose that 2m < n. Then
!
m∆2
P (T2 (n) > m) ≤ P (µ̂2 (2m) − µ2 − µ̂1 (2m) + µ1 ≥ ∆) ≤ exp − ≤ δ.
4

Hence with probability at least 1 − δ the pseudo regret is bounded by


  
n∆ 4 1
R̄n ≤ m∆ = min , log .
2 ∆ δ

6.5 By slightly abusing notation, to reduce clutter we abbreviate Rn (ν) to Rn and ∆ν to ∆.

(a) For the first part we need to show that Rn ≤ (∆ + C)n2/3 when m = f (n) for a suitable
 chosen

function f and C > 0 is a universal constant. By Eq. (6.4), Rn ≤ m∆ + n∆ exp − 4 m∆2

  q  
m∆ + n max∆>0 ∆ exp − m∆ = m∆ + 2n exp − 12 , where the equality follows because
2 1
4 m
p p
maxx x exp(−cx2 ) is at the value x∗ = 1/(2c) and
q is equal to 1/(2c) exp(−1/2) as a simple
p p √
calculation shows (so, ∆∗ = 4/(2m) = 2/m = 2/n2/3 = 2n−1/3 ). That Rn ≤ ∆ + Cn2/3
cannot hold follows because m∆/2 ≤ Rn . Hence, if Rn ≤ ∆ + Cn2/3 was also true, we would
get that for any ∆ > 0, m∆/2 ≤ ∆ + Cn2/3 holds. Dividing both sides by ∆ and letting
∆ → ∞, this would imply that m ≤ 2. However, if m ≤ 2 then Rn = Ω(n) on some instances:
In particular, there is a positive probability that the arm chosen after trying both arms at most
twice is the suboptimal arm.
p
(b) For any fixed m, µ̂i (2m) − µi is 1/m subgaussian. Hence defining G = {|µ̂i (2m) − µi | ≤
p
2 log(n/δ)/m, i = 1, 2, m = 1, 2, . . . , bn/2c}, using n ≥ 2bn/2c union bounds, we have that
p
P (G) ≥ 1 − δ. Introduce w(m) = 2 log(n/δ)/m. Let M = min{1 ≤ m ≤ bn/2c : |µ̂1 (2m) −
µ̂2 (2m)| > 2w(m)} (note that M = ∞ if the condition is never met). Then on G if M < +∞
and say 1 = argmaxi µ̂i (2M ) then µ1 ≥ µ̂1 (2M ) − w(m) > µ̂2 (2M ) + 2w(M ) − w(M ) ≥ µ2
where the first and last inequalities used that we are on G and the middle one used the stopping
condition and that we assumed that at stopping, arm one has the highest mean. Hence,

22
Rn = P (Gc ) n∆/2 + E [M I {G}] ∆/2 ≤ δn + E [M I {G}] ∆/2. We now show a bound on M on
G. To reduce clutter assume that µ1 > µ2 . Assume G holds and let m < M . Then, 2w(m) ≥
|µ̂1 (2m)− µ̂2 (2m)| ≥ µ̂1 (2m)− µ̂2 (2m) ≥ (µ1 −w(m))−(µ2 +w(m)) = ∆−2w(m). Reordering we
see that 4w(m) ≥ ∆, which, using the definition of w(m), is equivalent to m ≤ (4/∆)2 2 log(n/δ).
Hence, on G, M = 1+max{m : 2w(i) ≥ |µ̂1 (2i)−µ̂i (2i)|, i = 1, 2, . . . , m} ≤ 1+(4/∆)2 2 log(n/δ).
Plugging this in and setting δ = 1/n, we get Rn ≤ ∆ + 16 ∆ log(n).

p
(c) In addition to the said inequality, of course, Rn ≤ n∆ also holds. If ∆ ≤ log(n)/n, we thus
p p p p
have Rn ≤ n/ log(n) ≤ n log(n). If ∆ > log(n)/n, Rn ≤ ∆ + C n log(n). Combining
p
the inequalities, we have Rn ≤ ∆ + (C ∨ 1) n log(n).

p
(d) Change the definition of w(m) from Part (b) to w(m) = 2 log(n/(mδ))/m. Then,
Pn/2
P (Gc ) ≤ nδ m=1 m ≤ cδn for a suitable universal constant c > 0. We will choose
p
δ = 1/(cn2 ) so that P (Gc ) ≤ 1/n. Hence, w(m) = c0 log(n/m)/m with a suitable universal
constant c0 > 0. With the same reasoning as in Part (b), we find that M ≤ 1 + m∗
where m∗ = max{m ≥ 1 : m ≤ c0 log(n/m)/∆2 }. A case analysis then gives that
00
for a suitable universal constant c00 > 0. Finishing as in Part (b),
2 n))
m∗ ≤ c log(e∨(∆
∆2
c00 log(e∨(∆2 n))
Rn = P (Gc ) n∆/2 + E [M I {G}] ∆/2 ≤ δn + E [M I {G}] ∆/2 ≤ ∆ + ∆ .

(e) See the solution to Exercise 6.2.

6.6

(a) Let N0 = 0 and for ` > 1, let N` = min(N`−1 + n` , n), T` = {N`−1 + 1, . . . , N` }. The intervals
(T` )``=1
max
are non-overlapping and policy π is used with horizon n` on interval T` . Since ν is a
stochastic environment,
`X
max `X
max `X
max
Rn (π , ν) =

R|T` | (π(n` ), ν) ≤ max Rt (π(n` ), ν) ≤ fn` (ν) , (6.1)
1≤t≤n`
`=1 `=1 `=1

where the first inequality uses that |T` | ≤ n` (in fact for ` < `max , |T` | = n` ), and the second
inequality uses (6.10).

P P
(b) We have `i=1 ni = `−1 i=0 2 = 2 − 1, hence `max = dlog2 (n + 1)e and 2
i ` `max ≤ 2(n + 1). By

Eq. (6.1), the assumption on fn and the choice of (n` )` ,


r
`X
max √
`−1 1 √ 1 √ 1
Rn (π , ν) ≤

2 ≤√ 2`max ≤ √ 2n 1 +
`=1 2−1 2−1 n
√ √
= 2(1 + 2) n .

23
(c) By Eq. (6.1) we have

`X
max X−1
`max
Rn (π ∗ , ν) ≤ g(ν) log(2`−1 ) = log(2)g(ν) `
`=1 `=0
(`max − 1)`max
= log(2)g(ν) ≤ Cg(ν) log2 (n + 1)
2
with some universal constant C > 0. Hence, the regret significantly worsens in this case. A
better choice n` = 22 . With this,
`−1

`X
max X−1
`max
Rn (π , ν) ≤ g(ν)

log(2 2`−1
) = log(2)g(ν) 2` ≤ log(2)g(ν)2`max
`=1 `=0
≤ Cg(ν) log(n) ,

with another universal constant C > 0.

(d) The power/advantage of the doubling trick is its generality. It shows, that, as long as we do not
mind losing a constant factor, under mild conditions, adapting to an unknown horizon is not
challenging. The first disadvantage is that π ∗ does lose a constant factor when perhaps there is
no need to lose a constant factor. The second disadvantage is that oftentimes one can design
algorithms such that the immediate expected regret at time t decreases as the time t increases.
This is a highly desirable property and can often be met, as it was explained in the chapter.
Yet, by applying the doubling trick this monotone decrease of the immediate expected regret
will be lost, which will raise questions in the user.

6.8

(a) Using the definition of the algorithm and concentration for subgaussian random variables:
 
P (1 ∈
/ A`+1 , 1 ∈ A` ) ≤ P 1 ∈ A` , exists i ∈ A` \ {1} : µ̂i,` ≥ µ̂1,` + 2−`
 
= P 1 ∈ A` , exists i ∈ A` \ {1} : µ̂i,` − µ̂1,` ≥ 2−`
!
m` 2−2`
≤ k exp − ,
4

where in the last final inequality we used (c) of Lemma 5.4 and Theorem 5.3.

(b) Again, concentration and the algorithm definition show that:


 
P (i ∈ A`+1 , 1 ∈ A` , i ∈ A` ) ≤ P 1 ∈ A` , i ∈ A` , µ̂i,` + 2−` ≥ µ̂1,`
 
= P 1 ∈ A` , i ∈ A` , (µ̂i,` − µi ) − (µ̂1,` − µ1 ) ≥ ∆i − 2−`
!
m` (∆i − 2−` )2
≤ exp − .
4

24
(c) Let δ ∈ (0, 1) be some constant to be chosen later and

m` = 24+2` log(`/δ) .

Then by Part (a),



X
P (exists ` : 1 ∈
/ A` ) ≤ P (1 ∈
/ A`+1 , 1 ∈ A` )
`=1
!

X m` 22`
≤k exp −
`=1
4

X 1
≤ kδ
`=1
`2
kπ 2 δ
= .
6
Furthermore, by Part (b),

/ A`i )
P (i ∈ A`i +1 ) ≤ P (i ∈ A`i +1 , i ∈ A`i , 1 ∈ A`i ) + P (1 ∈
!
m` (∆i − 2−`i )2 kπ 2 δ
≤ exp − +
4 6
!
m` 2−2`i kπ 2 δ
≤ exp − +
16 6
!
kπ 2
≤δ 1+ .
6

Choosing δ = n−1 (1 + kπ 2 /6)−1 completes the result.


P
(d) For n < k, the result is trivial because all actions are tried at most once and hence Rn ≤ i:∆i ∆i
which is below the desired bound provided that C > 1. Hence, assume that n ≥ k. If there is
no suboptimal action, the statement is again trivial. Otherwise, let i be a suboptimal action.
Notice that 2−`i ≥ ∆i /4 and hence 22`i ≤ 16/∆2i . Furthermore, m` ≥ m1 ≥ 1 for ` ≥ 1. Hence,

i ∧n
`X
E[Ti (n)] ≤ nP (i ∈ A`i +1 ) + m`
`=1
i ∧n
`X  
n
≤1+ 24+2` log
`=1
δ
≤ 1 + C2 log(nk)
2`i

16C
≤ 1 + 2 log(nk) ,
∆i

where C > 1 is a suitably large universal constant derived by naively bounding the logarithmic
term and the geometric series. The result follows from upper bounding log(nk) ≤ 2 log(n) which

25
follows from k ≤ n and the standard regret decomposition (Lemma 4.5).

(e) Briefly, the idea is to choose


 
m` = C22` log max e, kn2−2` ,

where C is a suitably large universal constant chosen so that

22` 22`
P (1 ∈
/ A` ) ≤ and P (i ∈ A`i +1 ) ≤ .
n n
From this it follows that

16  
X i `
E[Ti (n)] ≤ + C 22`
log max e, kn2 −2`
.
∆2i `=1

Bounding the sum by an integral and some algebraic gymnastics eventually leads to the desired
result. Note, you have to justify the k in the logarithm is a lower order term. Argue by splitting
p
the suboptimal arms into those with ∆i ≤ k/n and the rest.

p
(f) Using the analysis in Part (e) and letting ∆ = k/n. Then

k
X
Rn = ∆i E[Ti (n)]
i=1
X C0  
≤ n∆ + log max e, nk∆2i
i:∆ ≥∆
∆i
i
q
≤C 00
nk log(k) ,

where the constant C 0 is derived from C in Part (e) and the last inequality follows by considering
the monotonicity properties of x 7→ 1/x log max(e, nx2 ).

Chapter 7 The Upper Confidence Bound Algorithm

7.1

26
(a) We have
 s  " ( )#
2 log(1/δ)  X ∞ Xn q
P µ̂ − µ ≥ = E I {T = n} I (Xt − µ) ≥ 2n log(1/δ)
T n=1 t=1
" " ( n ) ##

X X q
= E E I {T = n} I (Xt − µ) ≥ 2n log(1/δ) T
n=1 t=1
" " ( n ) ##
X∞ X q
= E I {T = n} E I (Xt − µ) ≥ 2n log(1/δ) T
n=1 t=1
X∞
≤ E [I {T = n} δ]
n=1
= δ.

n P p o
(b) Let T = min n : nt=1 (Xt − µ) ≥ 2n log(1/δ) . By the law of the iterated logarithm, T < ∞
almost surely. The result follows.

(c) Note that


 s  !
2 log(T (T + 1)/δ)  Xn q
P µ̂ − µ ≥ ≤ P exists n : (Xt − µ) ≥ 2n log(n(n + 1)/δ)
T t=1

X δ

n=1
n(n + 1)
= δ.

Chapter 8 The Upper Confidence Bound Algorithm: Asymptotic Optimality

8.1 Following the hint, F ≤ exp(−a)/(1 − exp(−a)) where a = ε2 /2. Reordering exp(−a)/(1 −
exp(−a)) ≤ 1/a gives 1 + a ≤ exp(a) which is well known (and easy to prove). Then
Z ∞ Z ∞
n
X 1 20
X 1 dt 20
X 1 dt
≤ + ≤ +
t=1
f (t) t=1
f (t) 20 f (t) t=1
f (t) 20 t log(t)2
20
X 1 1 5
= + ≤ .
t=1
f (t) log(20) 2

27
Chapter 9 The Upper Confidence Bound Algorithm: Minimax Optimality

9.1 Clearly (Mt ) is F-adapted. Then by Jensen’s inequality and convexity of the exponential
function,
t
!
X
E[Mt | Ft−1 ] = exp λ Xs E[exp(λXt ) | Ft−1 ]
s=1
t
!
X
≥ exp λ Xs exp(λE[Xt | Ft−1 ])
s=1
t−1
!
X
= exp λ Xs a.s.
s=1

Hence Mt is a F-submartingale.

9.4

(i) Consider the policy that plays each arm once and subsequently chooses
s   
2 n
At = argmaxi∈[k] µ̂i (t − 1) + log h .
Ti (t − 1) Ti (t − 1)k

The requested experimental validation is omitted from this solution.

(j) The proof more-or-less follows the proof of Theorem 9.1, but uses the new concentration
inequalities. Define random variable

∆ = min {∆ ≥ 0 : µ̂1s ≥ µ1 − ∆ for all s ∈ [n]} .



Let ε > 0. By Part (g), P (∆ ≥ ε) ≤ 5/(4nε2 ), where we used the fact the 2c/ π ≤ 5/4 when
c = 11/10. For suboptimal arm i define
( s    )
n
X 2 n
κi = I µ̂is + log h ≤ µ1 − ε .
s=1
s ks

By the definition of the algorithm and κi and ∆,

Ti (n) ≤ κi + nI {∆ ≥ ε} .

The expectation of κi is bounded using the same technique as in Theorem 9.1, but the tighter

28
confidence bound leads to an improved bound.
 s 
  
1 Xn
2 n∆2
E[κi ] ≤ 2 + P µ̂is + log h ≤ µ 1 − ε
∆i s=1
s k
2 log(n)
= + o(log(n)) .
(∆i − ε)2

Combining the two parts shows that for all ε > 0,

2 log(n)
E[Ti (n)] ≤ + o(log(n)) .
(∆i − ε)2

The result follows from the fundamental regret decomposition lemma (Lemma 4.5) and by
taking the limit as n tends to infinity and ε tends to zero at appropriate rates.

Chapter 10 The Upper Confidence Bound Algorithm: Bernoulli Noise

10.1 Let g be as in the hint. We have


 
1
g (x) = x
0
−4 .
(p + x)(1 − (p + x))

Clearly, g 0 (0) = 0. Further, since q(1 − q) ≤ 1/4 for any q ∈ [0, 1], g 0 (x) ≥ 0 for x > 0 and g 0 (x) ≤ 0
for x < 0. Hence, g is increasing for positive x and decreasing for negative x. Thus, x = 0 is a
minimiser of g. Here, g(0) = 0, and so g(x) ≥ 0 over [−p, 1 − p].

We have g(λ, µ) = −λµ + log(1 + µ(eλ − 1)). Taking derivatives, = −λ + 1+µ(e


d λ
e −1
10.3 dµ g(λ, µ) λ −1)

and dµ 2 g(λ, µ) = − (1+µ(eλ −1))2 ≤ 0, showing that g(λ, ·) is concave as suggested in the hint. Now,
2 λ
(e −1)2
d
Pp
let Sp = t=1 (Xt − µt ), p ∈ [n] and let S0 = 0. Then for p ∈ [n],

E [exp(λSp )] = E [exp(λSp−1 )E [exp(λ(Xp − µp )) | Fp−1 ]] ,

and, by Note 2, E [exp(λ(Xp − µp )) | Fp−1 ] ≤ exp(g(λ, µp )). Hence, using that µn is not random,

E [exp(λSp )] ≤ E [exp(λSp−1 )] exp(g(λ, µp )) .

Chaining this inequalities, using that S0 = 0 together with that g(λ, ·) is concave, we get
n
!!n
X
E [exp(λSn )] ≤ exp 1
n g(λ, µt ) ≤ exp (ng(λ, µ)) .
t=1

29
Thus,

P (µ̂ − µ ≥ ε) = P (exp(λSn ) ≥ exp(λnε))


≤ E [exp(λSn )] exp(−λnε)
≤ (µ exp(λ(1 − µ − ε)) + (1 − µ) exp(−λ(µ + ε)))n .

From this point, repeat the proof of Lemma 10.3 word by word.
10.4 When the exponential family is in canonical form the mean of Pθ is µ(θ) = Eθ [S] = A0 (θ).
Since A is strictly convex by the assumption that M is nonsingular it follows that µ(θ) is strictly
increasing and hence invertible. Let µsup = supθ∈Θ µ(θ) and µinf = inf θ∈Θ and define





sup Θ if x ≥ µsup
θ̂(x) = inf Θ if x ≤ µinf



µ−1 (x) otherwise .

The function θ̂ is the bridge between the empirical mean and the maximum likelihood estimator of θ.
P
Precisely, let X1 , . . . , Xn be independent and identically distribution from Pθ and µ̂n = n1 nt=1 Xt .
Then provided that θ̂n = θ̂(µ̂n ) ∈ Θ, then θ̂n is the maximum likelihood estimator of θ,
n
Y dPθ
θ̂n = argmaxθ∈Θ (Xt ) .
t=1
dh

There is an irritating edge case that µ̂n does not lie in the range of µ : Θ → R. When this occurs
there is no maximum likelihood estimator.

Part I: Algorithm
¯ y) = I {x ≥ y} limz↑x d(z, y). The algorithm
Then define d(x, y) = I {x ≤ y} limz↓x d(z, y) and d(x,
chooses At = t for the first k rounds and subsequently At = argmaxi Ui (t) where
 
¯ θ̂i (t − 1), θ̃) ≤ log(f (Ti (t − 1)))
Ui (t) = sup θ̃ ∈ Θ : d( .
Ti (t − 1)

Part II: Concentration


Given a fixed θ ∈ Θ and independent random variables X1 , . . . , Xn sampled from Pθ and
P
ŝt = 1t tu=1 S(Xu ) and θ̂t = θ̂(ŝt ). Let θ̃ ∈ Θ be such that d(θ̃, θ) = ε > 0. Then
   
P d(θ̂t , θ) ≥ ε ≤ P ŝt ≥ s̄(θ̃) ≤ exp(−td(θ̃, θ)) = exp(−tε) , (10.1)

where the second inequality follows from Part (e) of Exercise 34.5. Similarly
 
¯ θ̂t , θ) ≥ ε ≤ exp(−tε) .
P d( (10.2)

30
Define random variable τ by
 
log(f (t))
τ = min t : d(θ̂s , θ − ε) < for all s ∈ [n] .
s

In order to bound the expectation of τ we need a connection between d(θ̂s , θ − ε) and d(θ̂s , θ). Let
x ≤ y − ε and g(z) = d(x, z). Then
Z y
g(y) = g(y − ε) + g 0 (z)dz
y−ε
Z y
= g(y − ε) + (z − x)A00 (z)dz
y−ε
Z y
≥ g(y − ε) + inf A (z)
00
(z − x)dz
z∈[y−ε,y] y−ε
1
= g(y − ε) + inf A00 (z)ε(2y − 2x − ε)
2 z∈[y−ε,y]
ε2 inf z∈[y−ε,y] A00 (z)
≥ g(y − ε) + .
2
Note that inf z∈[y−ε,y] A00 (z) > 0 is guaranteed because A00 is continuous and [y − ε, y] is compact
and because M was assumed to be nonsingular. Using this, the expectation of τ is bounded by
n
X
E[τ ] = P (τ ≥ t)
t=1
 
Xn Xn
log(f (t))
≤ P d(θ̂s , θ − ε) ≥
t=1 s=1
s
!
Xn X n
ε2 inf z∈[θ−ε,θ] A00 (z) log(f (t))
≤ P d(θ̂s , θ) ≥ +
t=1 s=1
2 s
n X
X n
exp(−s inf z∈[θ−ε,θ] A00 (z)ε2 /2)

t=1 s=1
f (t)
= O(1) , (10.3)

where the last inequality follows from Eq. (10.1) and the final inequality is the same calculation as
in the proof of Lemma 10.7. Next let

κ = min{s ≥ 1 : θ̂u − θ < ε for all u ≥ s} .

The expectation of κ is easily bounded using Eq. (10.2),



n X
X
E[κ] ≤ (exp(−ud(θ + ε, θ))) = O(1) , (10.4)
s=1 u=s

where we used the fact that M is non-singular to ensure strict positivity of the divergences.

31
Part III: Bounding E[Ti (n)]

For each arm i let θ̂is = θ̂(µ̂is ). Now fix a suboptimal arm i and let ε < (θ1 − θi )/2 and
 
log(f (t))
τ = min t : d(θ̂s , θ − ε) < for all s ∈ [n] .
s

Then define

κ = min{s ≥ 1 : θ̂iu < θi + ε for all u ≥ s} .

Then by Eq. (10.3) and Eq. (10.4), E[τ ] = O(1) and E[κ] = O(1). Suppose that t ≥ τ and
Ti (t − 1) ≥ κ and At = i. Then Ui (t) ≥ U1 (t) ≥ θ1 − ε and hence

log(f (n))
d(θi + ε, θ1 − ε) < .
Ti (t − 1)

From this we conclude that


log(f (n))
Ti (n) ≤ 1 + τ + κ + .
d(θi + ε, θ1 − ε)

Taking expectations and limits shows that

E[Ti (n)] 1
lim sup ≤ .
n→∞ log(n) d(θi + ε, θ1 − ε)

Since the above holds for all sufficiently small ε > 0 and the divergence d is continuous it follows
that
E[Ti (n)] 1
lim sup ≤
n→∞ log(n) d(θi , θ1 )

for all suboptimal arms i. The result follows from the fundamental regret decomposition lemma
(Lemma 4.5).

10.5 For simplicity we assume the first arm is uniquely optimal. Define

¯ y) = I {x ≥ y} lim d(z, y) ,
d(x, d(x, y) = I {x ≤ y} lim d(x, y) .
z↑x z↓x

R
Let µ(θ) = R xdPθ (x) and s̄(θ) = Eθ [S] = A0 (θ) and S = {s̄(θ) : θ ∈ Θ}. Define θ̂ : R → cl(Θ) by





s−1 (x) , if x ∈ S ;
θ̂(x) = sup Θ , if x ≥ sup S ;



inf Θ , if x ≤ inf S .

32
The algorithm is a generalisation of KL-UCB. Let

1 X t
t̂i (t) = I {At = i} S(Xt ) and θ̂i (t) = θ̂(t̂i (t)) ,
Ti (t) s=1

which is the empirical estimator of the sufficient statistic. Like UCB, the algorithm plays At = t for
t ∈ [k] and subsequently At = argmaxi Ui (t), where
 
log(f (Ti (t − 1))f (t))
Ui (t) = sup µ(θ) : d(θ̂i (t − 1), θ) ≤
Ti (t − 1)

and ties in the argmax are broken by choosing the arm with the largest number of plays.

Part I: Concentration Given a fixed θ ∈ Θ and independent random variables X1 , . . . , Xn


P ¯ θ̃, θ) = ε > 0.
sampled from Pθ and ŝt = 1t tu=1 S(Xu ) and θ̂t = θ̂(ŝt ). Let θ̃ ∈ Θ be such that d(
Then
   
¯ θ̂s , θ) ≥ ε ≤ P ŝt ≥ s̄(θ̃) ≤ exp(−td(θ̃, θ)) = exp(−tε) .
P d( (10.5)

Using an identical argument,


 
P d(θ̂t , θ) ≥ ε) ≤ exp(−tε) . (10.6)

Define random variable τ by


 
log(f (s)f (t))
τ = min t : d(θ̂t , θ) < for all s ∈ [n] .
s

Then the expectation of τ is bounded by


n
X
E[τ ] = P (τ ≥ t)
t=1
 
Xn Xn
log(f (s)f (t))
≤ P d(θ̂s , θ + ε) ≥
t=1 s=1
s
n X
X n
2

t=1 s=1
f (s)f (t)
= O(1) , (10.7)

where the final equality follows from Eqs. (10.5) and (10.6) and the same calculation as in the proof
of Lemma 10.7. Next let

κ = min{s ≥ 1 : |θ̂u − θ| < ε for all u ≥ s} .

33
The expectation of κ is easily bounded Eq. (10.5) and Eq. (10.6):

n X
X
E[κ] ≤ (exp(−ud(θ + ε, θ) + exp(−ud(θ − ε, θ))) = O(1) , (10.8)
s=1 u=s

where we used the fact that M is non-singular to ensure strict positivity of the divergences.

Part II: Bounding E[Ti (n)] Choose ε > 0 sufficiently small that for all suboptimal arms i,

sup µ(φ) < µ∗


φ∈[θi −ε,θi +ε]

and define

di,min (ε) = min {d(θi + x, φ) : µ(φ) = µ∗ , x = ±ε}


φ∈Θ

di,inf (ε) = inf {d(θi + x, φ) : µ(φ) > µ∗ , x = ±ε} .


φ∈Θ

Let θ̂is be the empirical estimate of θi based on the first s samples of arm i, which means that
θ̂i (t) = θ̂iTi (t) . Let τ be the smallest t such that

log(f (s)f (t))


d(θ̂1s , θ1 ) < for all s ∈ [n] ,
s
which means that U1 (t) ≥ µ∗ for all t ≥ τ . For suboptimal arms i let κi be the random variable

κi = min{s : |θ̂iu − θi | < ε for all u ≥ s} .

Now suppose that t ≥ τ and Ti (t − 1) ≥ κi and At = i. Then Ui (t) ≥ U1 (t) ≥ µ∗ , which implies that

log(f (Ti (t − 1))f (t))


di,min (ε) ≥ .
Ti (t − 1)

This means that


( )!
X X log(t4 )
Ti (t) ≤ τ + 1 + max κi , . (10.9)
i>1 i>1
di,min (ε)

Then let
 
Λ = max t : T1 (t − 1) ≤ max Ti (t − 1) ,
i>1

which by Eq. (10.9) and Eq. (10.7) and Eq. (10.8) satisfies E[Λ] = O(1). Suppose now that t ≥ Λ.
Then T1 (t − 1) > maxi>1 Ti (t − 1) and by the definition of the algorithm At = i implies that

34
Ui (t) > µ∗ and so

log(Ti (t − 1)2 f (t))


Ti (t − 1) ≤ .
di,inf (ε)

Hence
log(f (Ti (n))f (t))
Ti (n) ≤ 1 + Λ + .
di,inf (ε)

Since E[Λ] = O(1) we conclude that

E[Ti (n)] 1
lim sup ≤ .
n→∞ log(n) di,inf (ε)

The result because limε→0 di,inf (ε) = di,inf and by the fundamental regret decomposition (Lemma 4.5).

Chapter 11 The Exp3 Algorithm

11.2 Let π be a deterministic policy. Therefore At is a function of x1A1 , . . . , xt−1,At−1 . We define


(xt )nt=1 inductively by

0 if At = i
xti =
1 otherwise .

Clearly the policy will collect zero reward and yet


n
X 1X n X k
n(k − 1)
max xti ≥ xti = .
i∈[k]
t=1
k t=1 i=1 k

Therefore the regret is at least n(k − 1)/k = n(1 − 1/k) as required.


P
11.5 Let X̂ as stated in the problem and let f (x) = i Pi X̂(i, xi ). Let x ∈ Rk be
arbitrary and x0 ∈ Rk be such that x0i = xi except possibly for component j > 1. Note that
f (x) = x1 = x01 = f (x0 ) and thus 0 = f (x) − f (x0 ) = Pj (X̂(j, xj ) − X̂(j, x0j )). Dividing
by Pj > 0 implies that X̂(j, xj ) = X̂(j, x0j ). Since xj and x0j were arbitrary, X̂(j, ·) ≡ const.
Call the value that X̂(j, ·) is equal to aj . Now, let x, x0 ∈ Rk be such that they agree on all
components except possibly on component one and let x01 = 0. Further, let a1 = X̂(1, 0). Then,
x1 −0 = f (x)−f (x0 ) = P1 (X̂(1, x1 )−a1 ). Reordering gives that for any x1 ∈ R, X̂(1, x1 ) = a1 +x1 /P1 .
P
Finally, let x be such that x1 = 0. Then, 0 = f (x) = i Pi ai , finishing the proof.

11.6 The first two parts are purely algebraic and are omitted.

(c) Let G0 , . . . , Gs be a sequence of independent geometrically distributed random variables with

35
E[Gu ] = 1/qu (α). Then
s
!
X
P (T2 (n/2) ≥ s + 1) = P Gu ≤ n/2
u=0
s−1
!
X
≥P Gu ≤ n/4 P (Gs ≤ n/4)
u=0
!!  n/4 !
s−1
X 1
= 1−P Gu > n/4 1− 1−
u=0
8n
1
≥ (1 − exp(−1/32))
2
1
≥ .
65

(d) Suppose that T2 (n/2) ≥ s + 1. Then for all t > n/2,

t−1
X
L̂t2 = Ŷu2 ≥ 8αn ≥ 2n .
u=1

On the other hand,


t−1
X t−1
X 1
L̂t1 = Ŷu1 = .
u=1
P
u=n/2+1 u1

Using induction and the fact that Pt1 ≥ 1/2 as long as L̂t1 ≤ L̂t2 it follows that on the event
E = {Tt (n/2) ≥ s + 1} that Pt1 ≥ 1/2 for all t. Therefore

t−1
!
X
Pt2 ≤ exp −η (Ŷs2 − Ŷs1 )
s=1
t−1
!!
X
≤ exp η Ŷs1 − 2n
s=1
≤ exp (−nη) ,

which combined with the previous part means that


 
1 n
P (At = 1 for all t > n/2) ≥ 1 − exp (−nη) .
65 2

The result follows because on the event {At = 1 for all t > n/2}, the regret satisfies
n αn n
R̂n ≥ − ≥ .
2 2 4

(e) Markov’s inequality need not hold for negative random variables and R̂n can be negative. For

36
this problem it even holds that E[R̂n ] < 0.

(f) Since for n = 104 , the probability of seeing a large regret is about 1/65 by the answer to the
previous part, Exp3 was run m = 500 times, which gives us a good margin to encounter large
regrets. The results are shown in Fig. 11.4. As can be seen, as predicted by the theory a
significant fraction of the cases, the regret is above n/4 = 2500. As seen from the figure, the
mean regret is negative.

11.7 First, note that if G = − log(− log(U )) with U uniform on [0, 1] then

P (G ≤ g) = e− exp(−g) .

Now, the result follows from a long sequence of equalities:


!  
Y
P log ai + Gi ≥ max log aj + Gj = E P (log aj + Gj ≤ log ai + Gi | Gi )
j∈[k]
j6=i
 
Y  
aj
= E exp − exp(−Gi ) 
j6=i
ai
" P aj #

= E Ui j6=i ai

1
= P
1+
aj
j6=i ai
ai
= Pk .
j=1 aj

Chapter 12 The Exp3-IX Algorithm

12.1
2
(a) We have µt = Et−1 [Ŷti ] = Pti +γ . Further, Vt−1 [Ŷti ](= Et−1 [(Ŷti − µt )2 ]) =
Pti yti Pti (1−Pti )yti
(Pti +γ)2

(Pti +γ)yti
(Pti +γ)2
= Pti +γ .
yti
For any η > 0 such that η(Ŷti − µt ) = η (AtiP−P ti )yti
ti +γ
≤ 1 almost surely for all
t ∈ [n],
X Pti yti X yti 1
L̂ni − ≤η + log(1/δ) .
t
Pti + γ t
Pti + γ η

Choosing η = γ, the constraints η(Ŷti − µt ) ≤ 1 are satisfied for t ∈ [n]. Plugging in this value
and reordering gives the desired inequality.
P P Pti yti P P P
(b) We have µt = Et−1 [ i Ŷti ] = i Pti +γ . Further, Vt−1 [ i Ŷti ] ≤ Et−1 [( i Ŷti ) ] =
2
i Et−1 [Ŷti ] =
2
P 2 P P P
i Pti +γ . To satisfy the constraint on η we calculate η( − µt ) ≤ η =
Pti yti yti
i (Pti +γ)2 ≤ i Ŷti i Ŷti

37
P Ati yti η P
i Ati = γ . Hence, any η ≤ γ is suitable. Choosing η = γ, we get
η
η i Pti +γ ≤ γ

X X X Pti yti XX γyti 1


L̂ni − ≤ + log(1/δ) .
i t i
Pti + γ t i
Pti + γ γ

Reordering as before gives the desired result.

12.4 We proceed in five steps.

P
Step 1: Decomposition Using that a Pta = 1 and some algebra we get

n X
X k
Pta (Zta − ZtA∗ )
t=1 a=1
n X
X k n X
X k n
X
= Pta (Z̃ta − Z̃ tA∗ )+ Pta (Zta − Z̃ta ) + (Z̃tA∗ − ZtA∗ ) .
t=1 a=1 t=1 a=1 t=1
| {z } | {z } | {z }
(A) (B) (C)

Step 2: Bounding (A) By assumption (c) we have βta ≥ 0, which by assumption (a) means
that η Z̃ta ≤ η Ẑta ≤ η|Ẑta | ≤ 1 for all a. A straightforward modification of the analysis in the last
chapter shows that (A) is bounded by

log(k) Xn X k
(A) ≤ +η 2
Pta Z̃ta
η t=1 a=1

log(k) Xn X k Xn X k
= +η Pta (Ẑta
2
+ βta
2
) − 2η Pta Ẑta βta
η t=1 a=1 t=1 a=1

log(k) Xn X k Xn X k
≤ +η Pta Ẑta + 3
2
Pta βta ,
η t=1 a=1 t=1 a=1

where in the last two line we used the assumptions that ηβta ≤ 1 and η|Ẑta | ≤ 1.

Step 3: Bounding (B) For (B) we have

n X
X k n X
X k
(B) = Pta (Zta − Z̃ta ) = Pta (Zta − Ẑta + βta ) .
t=1 a=1 t=1 a=1

We prepare to use Exercise 5.15. By assumptions (c) and (d) respectively we have ηEt−1 [Ẑta
2]≤β
ta
and Et−1 [Ẑta ] = Zta . By Jensen’s inequality,
 !2 
k
X k
X k
X
ηEt−1  Pta (Zta − Ẑta ) ≤η Pta Et−1 [Ẑta
2
]≤ Pta βta .
a=1 a=1 a=1

38
Therefore by Exercise 5.15, with probability at least 1 − δ
n X
X k
log(1/δ)
(B) ≤ 2 Pta βta + .
t=1 a=1
η

Step 4: Bounding (C) For (C) we have


n
X n 
X 
(C) = (Z̃tA∗ − ZtA∗ ) = ẐtA∗ − ZtA∗ − βtA∗ .
t=1 t=1

Because A∗ is random we cannot directly apply Exercise 5.15, but need a union bound over all actions.
Let a be fixed. Then by Exercise 5.15 and the assumption that η|Ẑta | ≤ 1 and Et−1 [Ẑta ] = Zta and
ηEt−1 [Ẑta
2 ] ≤ β , with probability at least 1 − δ.
ta

n 
X  log(1/δ)
Ẑta − Zta − βta ≤ .
t=1
η

Therefore by a union bound we have with probability at most 1 − kδ,

log(1/δ)
(C) ≤ .
η

Step 5: Putting it together Combining the bounds on (A), (B) and (C) in the last three
steps with the decomposition in the first step shows that with probability at least 1 − (k + 1)δ,

3 log(1/δ) Xn X k Xn X k
Rn ≤ +η Pta Ẑta + 5
2
Pta βta .
η t=1 a=1 t=1 a=1

where we used the assumption that δ ≤ 1/k.

Chapter 13 Lower Bounds: Basic Ideas

13.2 Notice that a policy has zero regret on all bandits for which the first arm is optimal if and
only if Pνπ (At = 1) = 1 for all t ∈ [n]. Hence the policy that always plays the first arm is optimal.

Chapter 14 Foundations of Information Theory

14.4 Let µ = P − Q, which is a signed measure on (Ω, F). By the Hahn decomposition theorem
there exist disjoint sets A, B ⊂ Ω such that A ∪ B = Ω and µ(E) ≥ 0 for all measurable E ⊆ A and

39
µ(E) ≤ 0 for all measurable E ⊆ B. Then
Z Z Z Z
XdP − XdQ = Xdµ + Xdµ
Ω Ω A B
≤ bµ(A) + aµ(B)
= (b − a)µ(A)
≤ (b − a)δ(P, Q) ,

where we used the fact that µ(B) = P (B) − Q(B) = Q(A) − P (A) = −µ(A).

14.10 Dobrushin’s theorem says that for any ω,


 
X P (Ai | ω)
D(P (· | ω), Q(· | ω)) = sup P (Ai | ω) log ,
(Ai ) i Q(Ai | ω)

where the supremum is taken over all finite partitions of R with rational-valued end-points. By the
definition of a probability kernel it follows that the quantity inside the supremum on the right-hand
side is F-measurable as a function of ω for any finite partition. Since the supremum is over a
countable set, the whole right-hand side is F-measurable as required.

14.11 First assume that P  Q. Then let P t and Qt be the restrictions of P and Q to (Rt , B(Rt ))
given by

P t (A) = P (A × Ωn−t ) and Qt (A) = Q(A × Ωn−t ) .

You should check that P  Q implies that P t  Qt and hence there exists a Radon-Nikodym
derivative dP t /dQt . Define
,
dP t dP t−1
F (xt | x1 , . . . , xt−1 ) = (x1 , . . . , xt ) (x1 , . . . , xt−1 ) ,
dQt dQt−1

which is well defined for all x1 , . . . , xt−1 ∈ Rt−1 except for a set of P t−1 -measure zero. Then for any
A ∈ B(Rt−1 ) and B ∈ B(R),
Z Z Z Z
dP t
F (xt | ω)Qt (dxt | ω)P t−1 (dω) = t
(xt , ω)Qt (dxt | ωQt−1 (dω)
A B A B dQ
Z
dP t t
= t
dQ
A×B dQ
= P (A × B) .

A monotone class argument shows that F (xt | ω) is P t−1 -almost surely the Radon-Nikodym derivative

40
of Pt (· | ω) with respect to Qt (· | ω). Hence
  
dP
D(P, Q) = EP log
dQ
n
X
= EP [log (F (Xt | X1 , . . . , Xt−1 ))]
t=1
Xn
= EP [D(Pt (· | X1 , . . . , Xt−1 ), Qt (· | X1 , . . . , Xt−1 ))] .
t=1

Now suppose that P 6 Q. Then by definition D(P, Q) = ∞. We need to show this implies
there exists a t ∈ [n] such that D(Pt (· | ω), Qt (· | ω)) = ∞ with nonzero probability. Proving the
contrapositive, let

Ut = {ω : D(Pt (· | ω), Qt (· | ω)) < ∞}

and assume that P (Ut = 1) for all t. Then U = ∩nt=1 Ut satisfies P (U ) = 1. On


Ut let F (xt | x1 , . . . , xt−1 ) = dPt (· | x1 , . . . , xt−1 )/dQt (· | x1 , . . . , xt−1 )(xt ) and otherwise let
F (xt | x1 , . . . , xt−1 ) = 0. Iterating applications of Fubini’s theorem shows that for any (At )nt=1
with At ∈ B(R) it holds that
Z n
Y
F (xt | x1 , . . . , xt−1 )Q(dx1 , . . . , dxn ) = P (A1 × · · · × An ) .
A1 ×···×An t=1

Q
Hence nt=1 F (xt | x1 , . . . , xt−1 ) behaves like the Radon-Nikodym derivative of P with respect to
Q on rectangles. Another monotone class argument extends this to all measurable sets and the
existence of dP/dQ guarantees that P  Q.

Chapter 15 Minimax Lower Bounds

15.1 Abbreviate θ̂ = θ̂(X1 , . . . , Xn ) and let R(P ) = EP [d(θ̂, P )]. By the triangle inequality

d(θ̂, P0 ) + d(θ̂, P1 ) ≥ d(P0 , P1 ) = ∆ .

Let E = {d(θ̂, P0 ) ≤ ∆/2}. On E c it holds that d(θ̂, P0 ) ≥ ∆/2 and on E it holds that
d(θ̂, P1 ) ≥ ∆ − d(θ̂, P0 ) ≥ ∆/2.

∆ ∆
R(P0 ) + R(P1 ) ≥ (P0 (E c ) + P1 (E)) ≥ exp(− D(P0 , P1 )) .
2 4

41
The result follows because max{a, b} ≥ (a + b)/2.

Chapter 16 Instance-Dependent Lower Bounds

16.2

(a) Suppose that µ 6= µ0 . Then D(R(µ), R(µ0 )) = ∞, since R(µ) and R(µ0 ) are not absolutely
continuous. Therefore dinf (R(µ), µ∗ , M) = ∞.

(b) Notice that each arm returns exactly two possible rewards and once these have been observed,
then the mean is known. Consider the algorithm that plays each arm until it has observed both
possible rewards from that arm and subsequently plays optimally. The expected number of
P
trials before both rewards from an arm are observed is ∞i=2 i2
1−i = 3. Hence

k
X
Rn ≤ 3 ∆i .
i=1

(c) Let Pµ be the shifted Rademacher distribution with mean µ. Then D(Pµ , Pµ+∆ ) is not
differentiable as a function of ∆.

16.7 Fix any policy π. If π is not a consistent policy for E[0,b]


k then π cannot have logarithmic
regret, which contradicts both Eq. (16.7) and Eq. (16.8). Hence, we may assume that π is consistent
for this environment class. We can thus apply Theorem 16.2. Let M be the set of probability
distributions supported on [0, b] and let M0 be the set of scaled Bernoulli distributions supported
on [0, b]: P ∈ M0 if P = (1 − p)δ0 + pδb for p ∈ [0, 1] where δx is the Dirac distribution supported
on {x}. Then, thanks to Theorem 16.2, for any ν ∈ E[0,b]k ,

Rn (π, ν) X ∆i X ∆i
lim inf ≥ ≥ ,
n→∞ log(n) d (Pi , µ , M) i:∆ >0 dinf (Pi , µ∗ , M0 )
i:∆ >0 inf

i i

where the latter inequality holds thanks to M0 ⊂ M. Note that any P ∈ M0 is uniquely
determined by its Bernoulli parameter p. Choose some ν ∈ (M0 )k , ν = (Pi ) and let pi the
Bernoulli parameter underlying Pi . Introduce p∗ = maxi pi . Then, dinf (Pi , µ∗ , M0 ) = d(pi , p∗ ) where
d(p, q)(= D(B(p), B(q))) = p log(p/q) + (1 − p) log((1 − p)/(1 − q)) (cf. Definition 10.1). Furthermore,
∆i = b(p∗ − pi ). Hence, we conclude that

Rn (π, ν) Rn (π, ν) X b(p∗ − pi )


lim sup ≥ lim inf ≥ .
n→∞ log(n) n→∞ log(n) i:p <p∗
d(pi , p∗ )
i

We will consider environments ν = νδ given by the Bernoulli parameters ((1+δ)/2, . . . , (1+δ)/2, (1−

42
δ)/2) for some δ ∈ [0, 1]. Thus,

Rn (π, ν) X b(p∗ − pi ) bδ
lim sup ≥ =
n→∞ log(n) i:p <p∗
d(pi , p )
∗ d((1 − δ)/2, (1 + δ)/2)
i

b
= .
log((1 + δ)/(1 − δ))

Denote the right-hand side by f (δ). Noticing that limδ→0 f (δ) = ∞ we immediately see that
Eq. (16.7) cannot hold: As the action gap gets small, if we maintain the variance at constant (as in
this example) then the regret blows up with the inverse action gap. To show that Eq. (16.8) cannot
hold either, consider the case when δ → 1. The right-hand side of Eq. (16.8) is σk2 /∆k = b(1−δ 2 )/(4δ).
Now, f (δ) ∆ k
σk2
= (1−δ2 ) log((1+δ)/(1−δ))

→ ∞ as δ → 1. (Note that in this case the variance decreases
to zero, while the gap is maintained. Since the algorithm does not know that the variance is zero, it
has to pay a logarithmic cost).

Chapter 17 High-Probability Lower Bounds

17.1

Proof of Claim 17.5. We have


Z
δ ≤ PQ (R̂n ≥ u) = Pδx (R̂n ≥ u)dQ(x) .
[0,1]n×k

Therefore there exists an x with Pδx (R̂n ≥ u) ≥ δ.

Proof of Claim 17.6. Abbreviate Pi to be the law of A1 , X1 , . . . , An , Xn induced by PQi and let Ei
be the corresponding expectation operator. Following the standard argument, let j be the arm that
minimises E1 [Tj (n)], which satisfies
n
E1 [Tj (n)] ≤ .
k−1
Therefore by Theorem 14.2 and Lemma 15.1,

max {P1 (T1 (n) ≤ n/2), Pj (Tj (n) ≤ n/2)} ≥ P1 (T1 (n) ≤ n/2) + Pj (T1 (n) > n/2)
1  
≥ exp −E1 [Ti (n)]2∆2
2   
1 k−1 1
≥ exp −E1 [Ti (n)] log
2 n 8δ
≥ 4δ .

Therefore there exists an i such that Pi (Ti (n) ≤ n/2) ≥ 2δ.

43
Proof of Claim 17.7. Notice that if ηt + 2∆ < 1 and ηt > 0, then Xtj ∈ (0, 1) for all j ∈ [k]. Now
! !
(1/2 − 2∆)2 (1/2)2
Pi (ηt + 2∆ ≥ 1 or ηt ≤ 0) ≤ exp − + exp −
2σ 2 2σ 2
   
100 25
≤ exp − + exp −
32 2
1
≤ .
8
P
Let M = nt=1 I {ηt ≤ 0 or ηt + 2∆ ≥ 1}, which is an upper bound on the number of rounds where
clipping occurs. By Hoeffding’s bound,
 s 
n log(1/δ) 
Pi M ≥ Ei [M ] + ≤ δ,
2

which means that with probability at least 1 − δ,


s
n n log(1/δ) n
M≤ + ≤ ,
8 2 4

where we used the fact that n ≥ 32 log(1/δ).

Chapter 18 Contextual Bandits

18.1

(a) By Jensen’s inequality,


v v
u n u n
X uX X 1 uX
t I {ct = c} = |C| t I {ct = c}
c∈C t=1 c∈C
|C| t=1
v
uX
u 1 X
n
≤ |C|t I {ct = c}
c∈C
|C| t=1
q
= |C|n ,

where the inequality follows from Jensen’s inequality and the concavity of · and the last
P P
equality follows since c∈C nt=1 I {ct = c} = n.

(b) When each context occurs n/|C| times we have


v
u n q
X uX
t I {ct = c} = n|C| ,
c∈C t=1

44
which matches the upper bound proven in the previous part.

18.6

(a) This follows because Xi ≤ maxj Xj for any family of random variables (Xi )i . Hence
E [Xi ] ≤ E [maxj Xj ].

(b) Modify the proof by proving a bound on


" n n
#
X (t) X
E Em∗ xt − Xt
t=1 t=1

for an arbitrarily fixed m∗ . By the definition of learning experts, Et [X̂t ] = xt and so Eq. (18.10)
also remains valid. Note this would not be true in general if E (t) were allowed to depend on At .
The rest follows the same way as in the oblivious case.

18.7 The inequality En∗ ≤ nk is trivial since maxm Emi ≤ 1. To prove En∗ ≤ nM , let
(t)

m∗ti = argmaxm Emi . Then


(t)

n X
X k X
M n X
X M X
k
En∗ = Emi I {m = m∗ti } ≤ Emi = nM ,
(t) (t)

t=1 i=1 m=1 t=1 m=1 i=1

Pk
where in the last step we used the fact that = 1.
(t)
i=1 Emi

18.8 Let X̂tφ = kI {At = φ(Ct )}. Assume that t ≤ m. Since At is chosen uniformly at random,

E[X̂tφ ] = E [kI {At = φ(Ct )} Xt ]


= E [kI {At = φ(Ct )} E [Xt | At , Ct ]]
= E [kI {At = φ(Ct )} µ(Ct , At )]
= E [µ(Ct , φ(Ct ))]
= µ(φ) .

The variance satisfies

E[(X̂tφ − µ(φ))2 ] ≤ E[X̂tφ


2
] ≤ k 2 E[I {At = φ(Ct )}] = k .

Finally note that X̂tφ ∈ [0, k]. Using the technique from Exercise 5.14,
!
kλ2 g(λk/m)
E[exp(λ(µ̂(φ) − µ(φ)))] ≤ exp ,
m

where g(x) = (exp(x) − 1 − x)/x2 , which for x ∈ (0, 1] satisfies g(x) ≤ 1/2 + x/4. Suppose that

45
m ≥ 2k log(2|Φ|). Then by following the argument in Exercise 5.18,
  !
log(2|Φ|) kλ k 2 λ2
E max |µ̂(φ) − µ(φ)| ≤ inf + +
φ λ>0 λ 2m 4m2
s
2k log(2|Φ|) k log(2|Φ|)
≤ + ,
m 2m

where the second inequality follows by choosing


s
2m log(2|Φ|) m
λ= ≤ .
k k

Therefore the regret is bounded by


s
2k log(2|Φ|) nk log(2|Φ|)
Rn ≤ m + 2n + .
m m

By tuning m it follows that


 
Rn = O n2/3 (k log(|Φ|))1/3

18.9 Let C1 , . . . , Cn ∈ C be the i.i.d. sequence of contexts and for k ∈ [n] let C1:k = (C1 , . . . , Ck ).
The algorithm is as follows: Following the hint, for the first m rounds the algorithm selects arms
in an arbitrary fashion. The regret from this period is bounded by m. The algorithm then picks
M = |ΦC1:m | functions Φ0 = {φ1 , . . . , φM } from Φ so that Φ0 |C1:m = ΦC1:m and for the remaining
n − m rounds uses Exp4 with the set Φ0 . q
p
The regret of Exp4 for competing against Φ0 is 4n log(M ) ≤ 4n log((em/d)d ) =
p
4nd log(em/d), where the inequality follows from Sauer’s lemma. It remains to show that
the best expert in Φ0 achieves almost as much reward as the best expert in Φ. For this, it suffices to
show that for any φ∗ ∈ Φ, the expert φ0∗ ∈ Φ0 that agrees with φ∗ on C1:m agrees with φ∗ on most
of the rest of the rounds. Let d∗ be a positive integer to be chosen later. We will show that with
high probability φ0∗ and φ∗ agree except for at most d∗ rounds.
P
We need a few definitions: For a, b finite sequences of equal length k let d(a, b) = ki=1 I {ai 6= bi }
be their Hamming distance. For a sequence c ∈ C k , let φ(c) = (φ(c1 ), . . . , φ(ck )) ∈ {1, 2}k .
For φ, φ0 ∈ Φ, let dC (φ, φ0 ) = d(φ(C1:k ), φ0 (C1:k )). For a permutation π on [n], let C π =
(k)

(Cπ(1) , Cπ(2) , . . . , Cπ(n) ).


Let π be a random permutation on [n], chosen uniformly from the set of all permutations on [n],
independently of C. We have
   
p = P dC (φ∗ , φ0∗ ) ≥ d∗ ≤ P ∃φ, φ0 ∈ Φ : dC (φ, φ0 ) ≥ d∗ , dC (φ, φ0 ) = 0
(n) (n) (m)

 
= P ∃φ, φ0 ∈ Φ : dC π (φ, φ0 ) ≥ d∗ , dC π (φ, φ0 ) = 0
(n) (m)

h  i
= E P ∃φ, φ0 ∈ Φ : dC π (φ, φ0 ) ≥ d∗ , dC π (φ, φ0 ) = 0|C
(n) (m)
,

46
where the first equality is the definition of p, the second holds by the exchangeability of (C1 , . . . , Cn ).
Now, by a union bound and because π and C are independent,
 
P ∃φ, φ0 ∈ Φ : dC π (φ, φ0 ) ≥ d∗ , dC π (φ, φ0 ) = 0|C
(n) (m)

X
≤ P (d(aπ , bπ ) ≥ d∗ , d(aπ1:m , bπ1:m ) = 0) .
a,b∈ΦC1:n

By Sauer’s lemma, |ΦC1:n | ≤ (en/d)d . For any fixed a, b ∈ {1, 2}n , p(a, b) =
P (d(a , b ) ≥ d , d(a1:m , b1:m ) = 0) is the probability that all bits in a random subsequence of
π π ∗ π π

length m of a bit sequence of length n with at least d∗ one bits has only zero bits.
mAs the probability
of a randomly chosen bit to be equal to zero is 1 − d /n, p(a, b) ≤ 1 − n
∗ d∗
≤ exp(−d∗ m/n).
 
Choosing d∗ = d m
n
log (en/d)2d
δ e we find that p ≤ δ.

Choosing δ = 1/n, we get that


s  
em
Rn ≤ 1 + m + d + ∗
4nd log
d
   s  
n en em
≤2+m+ log(n) + (2d) log + 4nd log .
m d d

Letting m = dn gives the desired bound.

Chapter 19 Stochastic Linear Bandits

Pt
19.3 Let T be the set of rounds t when kat kV −1 ≥ 1 and Gt = V0 + s=1 IT (s)as a>
s . Then
t−1

!d  d
dλ + |T |L2 trace(Gn )

d d
≥ det(Gn )
Y
= det(V0 ) (1 + kat k2G−1 )
t−1
t∈T
Y
≥ det(V0 ) (1 + kat k2V −1 )
t−1
t∈T

≥ λd 2|T | .

Rearranging and taking the logarithm shows that


!
d |T |L2
|T | ≤ log 1 + .
log(2) dλ

47
Abbreviate x = d/ log(2) and y = L2 /dλ, which are both positive. Then
 
x log (1 + y(3x log(1 + xy))) ≤ x log 1 + 3x2 y 2 ≤ x log(1 + xy)3 = 3x log(1 + xy) .

Since z − x log(1 + yz) is decreasing for z ≥ 3x log(1 + xy) it follows that


!
3d L2
|T | ≤ 3x log(1 + xy) = log 1 + .
log(2) λ log(2)

19.4

(a) Let A = B + uu> . Then, for x 6= 0,

kxk2A (x> u)2 kxk2B kuk2B −1 det A


= 1 + ≤ 1 + = 1 + kuk2B −1 = . (19.1)
2
kxkB kxkB2 kxkB2 det B

where the inequality follows by Cauchy-Schwartz, and the equality follows because det(A) =
det(B) det(I + B −1/2 uu> B −1/2 ) = det(B)(1 + kuk2B −1 ) per the proof of Lemma 19.4. In the
P
general case, C = A − B  0 can be written using its eigendecomposition as C = ki=1 ui u> i
P
with 0 ≤ k ≤ d. Letting Aj = B + i≤j ui u> i , 0 ≤ j ≤ k note that A = A k and B = A 0 and
thus by (19.1),

kxk2A kxk2Ak kxk2Ak kxk2Ak−1 kxk2A1


= = . . .
kxk2B kxk2A0 kxk2Ak−1 kxk2Ak−2 kxk2A0
det Ak det Ak−1 det A1 det A
≤ ... = .
det Ak−1 det Ak−2 det A0 det B

P
(b) We need some notation. As before, for t ≥ 0, we let Vt (λ) = V0 + s≤t As A> s , where As is the
action taken in round s. Let τ0 = 0. For t ≥ 1, we let τt ∈ [t] τt denote the round index when
the phase that contains round t starts. That is,

τ
t−1 , if det Vt−1 (λ) ≤ (1 + ε) det Vτt−1 −1 (λ) ;
τt =
t, otherwise .

Further,

A
t−1 , if τt = τt−1 ;
At =
argmax
a∈A UCBt (a) , otherwise .

Here, UCBt (a) is the upper confidence bound based on all the data available up the beginning
of round t.
Let the event when θ∗ ∈ ∩t∈[n] Ct hold. Define θ̃t ∈ Ct so that if Ãt = argmaxa∈A UCBt (a) then
hθ̃t , Ãt i = maxa∈A UCBt (a). Letting a∗ = argmaxa∈A hθ∗ , ai and noting that maxa UCBτt (a) =

48
UCBτt (Aτt ) and that At = Aτt ,

hθ∗ , a∗ i ≤ UCBτt (a∗ ) ≤ UCBτt (Aτt ) = hθ̃τt , Aτt i = hθ̃τt , At i .

Hence,

rt = hθ∗ , a∗ − At i ≤ hθ̃τt − θ∗ , At i ≤ kθ̃τt − θ∗ kVt−1 kAt kV −1 .


t−1

Now, by Part (a), thanks to Vτt −1  Vt−1 and that θ∗ , θ̃τt ∈ Cτt and that Cτt is contained in an
ellipsoid of radius βτt ≤ βt and “shape” determined by Vτt −1 ,
 1/2 q
det Vt−1
kθ̃τt − θ∗ kVt−1 ≤ kθ̃τt − θ∗ kVτt −1 ≤ 2 (1 + ε)βt .
det Vτt −1

Combined with the previous inequality, we see that


q
rt ≤ 2 (1 + ε)βt kAt kV −1 ,
t−1

which is the same as Eq. (19.10) except that βt is replaced with (1 + ε)βt . This implies in turn
P
that R̂n = nt=1 ≤ R̂n ((1 + ε)βn ).

19.5 We only present in detail the solution for the first part.

(a) Partition C into m equal length subintervals, call these C1 , . . . , Cm . Associate a bandit algorithm
with each of these subintervals. In round t, upon seeing Ct ∈ C, play the bandit algorithm
associated with the unique subinterval that Ct belongs to. For example, one could use a Exp3
as in the previous chapter, or UCB. The regret is
" n n
#
X X
Rn = E max r(Ct , a) − Xt
a∈[k]
t=1 t=1
" n # " n #
X X
=E max r(Ct , a) − max r̃([Ct ], a) + E max r̃([Ct ], a) − Xt ,
a∈[k] a a∈[k]
t=1 t=1
| {z } | {z }
(I) (II)

where for c ∈ C, [c] is the index of the unique part Ci that c belongs to and for i ∈ [m],

E [r(C , a) | [C ] = i] if P ([Ct ] = i) > 0
r̃(i, a) =
t t
0 otherwise .

The first term in the regret decomposition is called the approximation error, and the second is
the error due to learning. The approximation error is bounded using the Lipschitz assumption

49
and the definition of the discretisation:

max r(c, a) − max r̃([c], a) ≤ max(r(c, a) − r̃([c], a))


a∈[k] a∈[k] a∈[k]

= max E[r(c, a) − r(Ct , a) | [Ct ] = [c]])


a∈[k]

≤ max E[L|c − Ct | | [Ct ] = [c]])


a∈[k]

≤ L/m ,

where in the second last inequality we used the assumption that r is Lipschitz and in the last
that when [Ct ] = [c] it holds that |Ct − c| ≤ 1/m. It remains to bound the error due to learning.
Note that E[Xt | [Ct ] = i] = r̃(i, At ). As a result, the data experienced by the bandit associated
with Ci satisfies the conditions of a stochastic bandit environment. Consider the case when
Exp3 is used with the adaptive learning rate as described in Exercise 28.13. For i ∈ [m], let
Ti = {t ∈ [n] | [Ct ] = i}, Ni = |Ti | and
!
X
Rni = max r̃([Ct ], a) − Xt .
a∈[k]
t∈Ti

p
Then, by Eq. (11.2), E[Rni ] ≤ CE[ k log(k)Ni ] and thus
" #
m
X m
X q
(II) ≤ E [Rni ] ≤ CE (k log(k)) ≤ C k log(k)mn ,
1/2 1/2
Ni
i=1 i=1

where the last inequality follows from Cauchy-Schwarz. Hence,

Ln q
Rn ≤ + C k log(k)mn .
m

Optimizing m gives m = (L/C)2/3 (n/(k log(k)))1/3 and

Rn ≤ C 0 n2/3 (Lk log(k))1/3

for some constant C 0 that depends only on C. The same argument works with no change if
the bandit algorithm is switched to UCB, just a little extra work is needed to deal with the
fact that UCB will be run for a random number of rounds. Luckily, the number of rounds is
independent of the rewards experienced and the actions taken.

(b) Consider a lower bound that hides a ‘spike’ at one of m positions.

(c) Make an argument that C can be partitioned into m = (3dL/ε)d partitions to guarantee that for
fixed a the function r(c, a) varies by at most ε within each partition. Then bound the regret by
s
 d
3dL
Rn ≤ nε + 2nk log(k) .
ε

50
By optimizing ε you should arrive at a bound that depends on the horizon like O(n(d+1)/(d+2) ),
which is really quite bad, but also not improvable without further assumptions. You might find
the results of Exercise 20.3 to be useful.

19.6

(a) We need to show that Lt (θ∗ ) ≤ βt for all t with high probability. By definition,
1/2

t
X
Lt (θ∗ ) = gt (θ∗ ) − Xs As
s=1 Vt−1
t
X
= λθ∗ + (µ(hθ∗ , As i)As − (µ(hθ∗ , As i)As + ηs )As )
s=1 Vt−1
t
X
= λθ∗ + As η s
s=1 Vt−1
t
X √
≤ As η s + λkθ∗ k2 .
s=1 Vt−1

The result now follows by Theorem 20.4.

(b) Let gt0 denote the derivative of gt (which exists by assumption). By the mean value theorem,
there exists a ξ on the segment connecting θ and θ0 such that

gt (θ) − gt (θ0 ) = gt0 (ξ)(θ − θ0 )


t
!
X
= λI + µ 0
(hξ, As i)As A>
s (θ − θ0 )
s=1
= Mt (θ − θ0 ) ,
Pt
where Mt = λI + s=1 µ
0 (hξ, A i)A A>
s s s  c1 Vt . Hence,

kgt (θ) − gt (θ0 )kV −1 = kθ − θ0 kMt V −1 Mt ≥ c1 kθ − θ0 kVt .


t t

(c) Let θ̃t be such that µ(hθ̃t , At i) = maxa∈At maxθ∈Ct µ(hθ, ai). Using the fact that θ∗ ∈ Ct ,

rt = µ(hθ∗ , A∗t i) − µ(hθ∗ , At i) ≤ µ(hθ̃t , At i) − µ(hθ∗ , At i) ≤ c2 hθ̃t − θ∗ , At i


≤ c2 kθ̃t − θ∗ kVt−1 kAt kV −1
t−1
c2
≤ kgt−1 (θ̃t ) − gt−1 (θ∗ )kV −1 kAt kV −1
c1 t−1 t−1
c2  
≤ Lt−1 (θ̃t ) + Lt−1 (θ∗ ) kAt kV −1
c1 t−1

2c2 βt−1
1/2
≤ kAt kV −1 .
c1 t−1

51
(d) By Part (a), with probability at least 1 − δ, θ∗ ∈ Ct for all t. Hence, by Part (c), with
probability at least 1 − δ,

2c2 βn
n
X 1/2 n
X
R̂n = rt ≤ (1 ∧ kAt kV −1 )
t=1
c1 t=1
t−1
s  
2c2 nL2
≤ 2ndβn log 1 + ,
c1 d

where we used Lemma 19.4 and the same argument as in the proof of Theorem 19.2.

19.7

(a) Let ξ1 , . . . , ξd be the eigenvalues of Vt in increasing order. By the Courant–Fischer min-max


theorem, δi = ξi − λi ≥ 0. Hence,
 
det(Vt ) d
Y ξi d
Y δi
= = 1+ .
det(V0 ) i=1 λi i=1 λi
Pd
A trace argument shows that i=1 δi ≤ nL2 , which, assuming that deff < d, implies that
   
det(Vt ) d
X δi
log = log 1 +
det(V0 ) i=1
λi
deff
X   d
X
δi δi
= log 1 + + (log(1 + x) ≤ x, x ≥ 0, λi increasing)
i=1
λi i=deff +1
λdeff +1
deff !
X nL2 nL2
≤ log 1 + +
i=1
λ λdeff +1
!
nL2
≤ 2deff log 1 + .
λ

When deff = d, the desired inequality trivially holds.

(b) By Theorem 20.4 in the next chapter, it holds with probability least 1 − δ that for all t,
s    
t
X 1 det(Vt )
As Xs ≤ 2 log + log .
δ det(V0 )
s=1 Vt−1

52
On this event,
t
X t
X
kθ̂t − θkVt = Vt−1 s θ − θ + Vt
As A> −1
As ηs
s=1 s=1 Vt
t
X
≤ Vt−1 V0 θ + As η s
Vt
s=1 Vt
s    
1 det(Vt )
≤ kθkV0 + 2 log + log
δ det(V0 )
s    
1 det(Vt )
≤m+ 2 log + log .
δ det(V0 )

(c) Using again the argument as in the proof of Theorem 19.2,

rt = hA∗t − At , θ∗ i ≤ 2βn1/2 min{1, kAt kV −1 } .


t−1

Then, using Lemma 19.4,


v
n u n
X u X
rt ≤ 2βn
1/2 t
n min{1, kAt k2 −1 } Vt−1
t=1 t=1
s  
det Vt
≤ 2βn1/2 2n log
det V0
s  
nL2
≤ 2βn1/2 2ndeff log 1 + .
λ

(d) The result follows by combining the result of the last part and that of Part (a) while choosing
δ = 1/n.

(e) The bound to improves when deff  d. For this, V0 should have a large number of large
P
eigenvalues while maintaining i λi hvi , θ∗ i2 = kθ∗ k2V0 ≤ m2 : Choosing V0 is betting on the
directions (vi ) in which θ∗ will have small components as in those directions we can choose λi
to be large. This way, one can have a finer control than just choosing features of some fixed
dimension: The added flexibility is the main advantage of non-uniform regularisation. However,
this is not for free: Non-uniform regularisation may increase kθ∗ kV0 for some θ∗ which may lead to
worse regret. In particular, when kθ∗ kV0 ≤ m is not satisfied, the confidence set will still contain
θ∗ , but the coverage probability 1−δ will worsen to about 1−δ 0 where δ 0 ≈ δ exp((kθ∗ kV0 −m)2+ ),
which increases the δn term in the regret to δ 0 n. Thus, the degradation is smooth, but can be
quite harsh. Since the confidence radius is probably conservative, the degradation may be less
noticeable than what one would expect based on this calculation.
P
19.8 For the last part, note that it suffices to store the inverse matrix Ut = Vt−1 (λ), St = s≤t As Xs ,
together with dt = log det V0 (λ) .
det Vt (λ)
Choosing V0 = λI, we have U0 = λ−1 I, S0 = 0 and d0 = 0. Then,

53
for t ≥ 0 and a ∈ Rd we have
p
UCBt (a) = hθ̂t , ai + βt kakUt−1

where

θ̂t = Ut−1 St−1 ,


p √ q
βt = m2 λ + 2 log(1/δ) + dt ,

and where after receiving (At , Xt ), the following updates need to be executed:

(Ut−1 At )(Ut−1 At )>


Ut = Ut−1 − ,
1 + A> t Ut−1 At
dt = dt−1 + log(1 + A>
t Ut−1 At ) ,
St = St−1 + At Xt .

Here, the update of dt follows from Eq. (19.9). Note that the O(d2 ) operations are the calculation
of θ̂t and the update of Ut .

Chapter 20 Confidence Bounds for Least Squares Estimators

20.1 From the definition of the design Vn = mI and the ith coordinate of θ̂n − θ∗ is the sum of m
independent standard Gaussian random variables. Hence

h i d
X 1
E kθ̂n − θ∗ k2Vn = m = d.
i=1
m

20.2 Let m = n/d and assume for simplicity that m is a whole number. The actions (At )nt=1
are chosen in blocks of size m with the ith block starting in round ti = (i − 1)m + 1. For all
i ∈ {1, . . . , d} we set Ati = ei . For t ∈ {ti , . . . , ti + m − 1} we set Ati = ei if ηti > 0 and At = 0
otherwise. Clearly Vn−1 ≥ I and hence k1kVn−1 ≤ d. The choice of (At )nt=1 ensures that E[θ̂n,i ] is
independent of i. Furthermore,
 Pm   
t=1 ηt 1 1
E[θ̂n,1 ] = E I {η > 0} + I {η1 < 0} η1 =√ −1 .
m 2π m

Therefore
 
 2
hθ̂n , 1i2  1 h i 1 h i2 d m−1
E ≥ E hθ̂ , 1i 2
≥ E h θ̂ , 1i = .

2 n n
k1kV −1 d d m
n

20.3

54
(a) If C ⊂ A is an ε-covering then it is also an ε0 -covering with any ε0 ≥ ε. Hence, ε → N (ε) is a
decreasing function of ε.

(b) The inequality M (2ε) ≤ N (ε) amounts to showing that any 2ε packing has a cardinality at most
the cardinality of any ε covering. Assume this does not hold, that is, there is a 2ε packing P ⊂ A
and an ε-covering C ⊂ A such that |P | ≥ |C|+1. By the pigeonhole principle, there is c ∈ C such
that there are distinct x, y ∈ P such that x, y ∈ B(c, ε). Then kx − yk ≤ kx − ck + kc − yk ≤ 2ε,
which contradicts that P is a 2ε-packing.
If M (ε) = ∞, the inequality N (ε) ≤ M (ε) is trivially true. Otherwise take a maximum
ε-packing P of A. This packing is automatically an ε-covering as well (otherwise P would not
be a maximum packing), hence, the result.
.
(c) We show the inequalities going left to right. For the first inequality, if N = N (ε) = ∞ then there
is nothing to be shown. Otherwise let C be a minimum cardinality ε-cover of A. Then from
P
the definition of cover and the additivity of volume, vol(A) ≤ x∈A0 vol(B(x, ε)) = N εd vol(B).
Reordering gives the inequality.
The next inequality, namely that N (ε) ≤ M (ε) has already been shown.
.
Consider now the inequality bounding M = M (ε). Let P be a maximum cardinality ε-packing
of A. Then, for any x, y ∈ P distinct, B(x, ε/2) ∩ B(y, ε/2) = ∅. Further, for x ∈ P ,
B(x, ε/2) ⊂ A + 2ε B and thus ∪x∈P B(x, ε/2) ⊂ A + 2ε B, hence, by the additivity of volume,
M vol( 2ε B) ≤ vol(A + 2ε B).
For the next inequality note that εB ⊂ A immediately implies that A + 2ε B ⊂ A + 12 A (check
the containment using the definitions), while the convexity of A implies that A + 12 A ⊂ 32 A.
For this second claim let u ∈ A + 12 A. Then u = x + 12 y for some x, y ∈ A. By the convexity
of A, 23 u = 23 x + 13 y ∈ A and hence u = 32 ( 23 u) ∈ 32 A. For the final inequality note that for
measurable X and c > 0 we have vol(cX) = cd vol(X). This is true because cX is the image of
X under the linear mapping represented by a diagonal matrix with c on the diagonal and this
matrix has determinant cd .

(d) Let A be bounded, and say, A ⊂ rB for some r > 0. Then vol(A + ε/2B) ≤ vol(rB + ε/2B) =
vol((r + ε/2)B) < +∞, hence, the previous part gives that N (ε) ≤ M (ε) < +∞. Now
assume that N (ε) < ∞ and let C be a minimum cover of A. Then A ⊂ ∪x∈C B(x, ε) ⊂
∪x∈C (kxk + ε)B ⊂ maxx∈C (kxk + ε)B hence, A is bounded.

20.4 The result follows from Part (c) of Exercise 20.3 by taking A = B = {x ∈ Rd : kxk2 ≤ 1},
which shows that the covering number N (B, ε) ≤ (3/ε)d , from which the result follows.

20.5 Proving that M̄t is Ft -measurable is actually not trivial. It follows because Mt (·) is measurable
and by the ‘sections’ lemma [Kallenberg, 2002, Lemma 1.26]. It remains to show that E[M̄t | Ft−1 ] ≤
M̄t−1 almost surely. Proceeding by contradiction, suppose that P(E[M̄t | Ft−1 ] − M̄t−1 > 0) > 0.
Then there exists an ε > 0 such that the set A = {ω : E[M̄t | Ft−1 ](ω) − M̄t−1 (ω) > ε} ∈ Ft−1

55
satisfies P (A) > 0. Then
Z Z
0< (E[M̄t | Ft−1 ] − M̄t−1 )dP = (M̄t − M̄t−1 ))dP
A ZA Z
= (Mt (x) − Mt−1 (x))dh(x)dP
d
ZA ZR
= (Mt (x) − Mt−1 (x))dPdh(x)
Rd A
≤ 0,

where the first equality follows from the definition of conditional expectation, the second by
substituting the definition of M̄t and the third from Fubini-Tonelli’s theorem. The last follows
from Lemma 20.2 and the definition of conditional expectation again. The proof is completed by
noting the deep result that 0 6< 0. In this proof it is necessary to be careful to avoid integrating
over conditional E[Mt (x) | Ft−1 ], which are only defined for each x almost surely and need not be
measurable as a function of x (though a measurable choice can be constructed using separability of
Rd and continuity of x 7→ Mt (x)).
20.8 Let f (λ) = √12π exp(−λ2 /2) be the density of the standard Gaussian and define
supermartingale Mt by
Z ! !
tσ 2 λ2 1 St2
Mt = f (λ) exp λSt − dλ = √ exp .
R 2 tσ 2 + 1 2σ 2 (t + 1)

Since E[Mτ ] = M0 = 1, the maximal inequality shows that P (supt Mt ≥ 1/δ) ≤ δ, which after
rearranging the previous display completes the result.
20.9

(a) This follows from straightforward calculus.

(b) The result is trivial for Λ < 0. For Λ ≥ 0 we have


Z !
λ2 n
Mn = f (λ) exp λSn − dλ
R 2
Z Λ(1+ε) !
λ2 n
≥ f (λ) exp λSn − dλ
Λ 2
!
Λ2 (1 + ε)2 n
≥ εΛf (Λ(1 + ε)) exp Λ(1 + ε)Sn −
2
!
(1 − ε2 )Sn2
= εΛf (Λ(1 + ε)) exp .
2n

(c) Let n ∈ N. Since Mt is a supermartingale with M0 = 1 it follows that

Pn = P (exists t ≤ n : Mt ≥ 1/δ) ≤ δ .

56
Hence P (exists t : Mt ≥ 1/δ) ≤ δ. Substituting the result from the previous part and rearranging
completes the proof.
I {λ ≤ e−e }
(d) A suitable choice of f is f (λ) =    2 .
λ log 1
λ log log 1
λ

(e) Let εn = min{1/2, 1/ log log(n)} and δ ∈ [0, 1] be the largest (random) value such that Sn never
exceeds
s     
2n 1 1
log + log .
1 − εn
2 δ εn Λn f (Λn (1 + εn ))

By Part (c) we have P (δ > 0) = 1. Furthermore, lim supn→∞ Sn /n = 0 almost surely by the
strong law of large numbers, so that Λn → 0 almost surely. On the intersection of these almost
sure events we have
Sn
lim sup p ≤ 1.
n→∞ 2n log log(n)

20.10 We first show a bound on the right tail of St . A symmetric argument suffices for the left tail.
P
Let Ys = Xs − µs |Xs | and Mt (λ) = exp( ts=1 (λYs − λ2 |Xs |/2)). Define filtration G1 ⊂ · · · ⊂ Gn by
Gt = σ(Ft−1 , |Xt |). Using the fact that Xs ∈ {−1, 0, 1} we have for any λ > 0 that

E[exp(λYs − λ2 |Xs |/2) | Gs ] ≤ 1 .

Therefore Mt (λ) is a supermartingale for any λ > 0. The next step is to use the method of mixtures
R
with a uniform distribution on [0, 2]. Let Mt = 02 Mt (λ)dλ. Then Markov’s inequality shows that
for any Gt -measurable stopping time τ with τ ≤ n almost surely, P (Mτ ≥ 1/δ) ≤ δ. Next we need a
bound on Mτ . The following holds whenever St ≥ 0.
Z
1 2
Mt = Mt (λ)dλ
2 0
r      !
1 π St 2Nt − St St2
= erf √ + erf √ exp
2 2Nt 2Nt 2Nt 2Nt
√ r !
erf( 2) π St2
≥ exp .
2 2Nt 2Nt

The bound on the upper tail completed via a stopping time, which shows that
 v   
u s
u
 u 2 2Nt  
P exists t ≤ n : St ≥ t2Nt log  √ and Nt > 0 ≤ δ .
δ erf( 2) π

The result follows by symmetry and union bound.


20.11

57
(a) Following the hint, we show that exp(Lt (θ∗ )) is a martingale. Indeed, letting Ft = σ(X1 , . . . , Xt ),
h i
E [exp(Lt (θ∗ ))|Ft−1 ] = E pθ̂t−1 (Xt )/pθ∗ (Xt ) exp(Lt−1 (θ∗ ))
Z p (x)
= exp(Lt−1 (θ∗ )) ∗ (x) X
pθ  θ̂t−1
 dµ(x)
∗ (x)
pθX

Z
X
X
= exp(Lt−1 (θ∗ )) pθ̂t−1 (x)dµ(x)

= exp(Lt−1 (θ∗ )) .

Then, applying the Cramér-Chernoff trick,


!
P (Lt (θ∗ ) ≥ log(1/δ) for some t ≥ 1) = P sup exp(Lt (θ∗ )) ≥ 1/δ
t∈N

≤ δE [exp(L0 (θ∗ ))] = δ ,

where the inequality is due to Theorem 3.9, the maximal inequality of nonnegative
supermartingales.

(b) This follows from the definition of Ct and Part (a).

Chapter 21 Optimal Design for Least Squares Estimators

21.1 Following the hint,

trace(adj(V (π))aa> ) a> adj(V (π))a


∇f (π)a = = = a> V (π)−1 a = kak2V (π)−1 ,
det(V (π)) det(V (π))

where in the third equality we used that adj(V (π)) is symmetric since V (π) is symmetric, hence,
following the hint, adj(V (π)/ det(V (π)) = V (π)−1 .

21.2 By the determinant product rule,

log det(H + tZ) = log det(H 1/2 (I + tH −1/2 ZH −1/2 )H 1/2 )


= log det(H) + log det(I + tH −1/2 ZH −1/2 )
X
= log det(H) + log(1 + tλi ) ,
i

where λi are the eigenvalues of H −1/2 ZH −1/2 . Since log(1 + tλi ) is concave, their sum is also
concave, proving that t 7→ log det(H + tZ) is concave.

21.3 Let A be a compact subset of Rd and (An )n be a sequence of finite subsets with An ⊂ An+1
and span(An ) = Rd and limn→∞ d(A, An ) = 0 where d is the Hausdorff metric. Then let πn be a
G-optimal design for An with support of size at most d(d + 1)/2 and Vn = V (πn ). Given any a ∈ A

58
we have
  √
kakVn−1 ≤ min ka − bkVn−1 + kbkVn−1 ≤ d + min ka − bkVn−1 .
b∈An b∈An

Let W ∈ Rd×d be matrix with columns w1 , . . . , wd in A1 that span Rd . The operator norm of Vn
−1/2

is bounded by

kVn−1/2 k = kW −1 W Vn−1/2 k
≤ kW −1 kkVn−1/2 W k
n o
= kW −1 k sup kW xkVn−1 : kxk2 = 1
( d )
X
≤ kW −1
k sup |xi |kwi kVn−1 : kxk2 = 1
i=1
√ d
X
≤ kW −1 k d sup xi
x:kxk2 =1 i=1

≤ dkW −1 k .

Taking the limit as n tends to infinity shows that



lim sup kakVn−1 ≤ d + lim sup min ka − bkVn−1
n→∞ n→∞ b∈A

≤ d + dkW −1
k lim sup min ka − bk2
n→∞ b∈A

≤ d.

Since k · kVn−1 : A → R is continuous and A is compact it follows that

lim sup sup kak2V −1 ≤ d .


n→∞ a∈A n

Notice that πn may be represented as a tuple of vector/probability pairs with at most d(d + 1)/2
entries and where the vectors lie in A. Since the set of all such tuples with the obvious topology
forms a compact set it follows that (πn ) has a cluster point π ∗ , which represents a distribution on
A with support at most d(d + 1)/2. The previous display shows that g(π ∗ ) ≤ d. The fact that
g(π ∗ ) ≥ d follows from the same argument as the proof of Theorem 21.1.

21.5 Let π be a Dirac at a and π(t) = π ∗ + t(π ∗ − π). Since π ∗ (a) > 0 it follows for sufficiently
small t > 0 that π(t) is a distribution over A. Because π ∗ is a minimiser of f ,

d
0≥ f (π(t))|t=0 = h∇f (π ∗ ), π ∗ − πi = d − kak2V (π)−1 .
dt

59
Rearranging shows that kak2V (π)−1 ≥ d. The other direction follows by Theorem 21.1.

Chapter 22 Stochastic Linear Bandits for Finitely Many Arms

Chapter 23 Stochastic Linear Bandits with Sparsity


Pd
23.2 The usual idea does the trick. Recall that Rn = i=1 Rni where
" n #
X
Rni = n|θi | − E Ati θi .
t=1

We proved that

C log(n)
Rni ≤ 3|θi | + .
|θi |

Clearly Rni ≤ 2n|θi |. Let ∆ > 0 be a constant to be tuned later. Then


 
X C log(n) X
Rni ≤ 3|θi | + + 2n∆
i:|θi |>∆
|θi | i:|θi |∈(0,∆)
Ckθk0 log(n)
≤ 3kθk1 + + 2kθk0 n∆ .

p
Choosing ∆ = log(n)/n completes the result.

Chapter 24 Minimax Lower Bounds for Stochastic Linear Bandits

24.1 Assume without loss of generality that i = 1 and let θ(−1) ∈ Θp−1 . The objective is to prove
that

1 X kn
Rn1 (θ) ≥ .
|Θ| (1) 8
θ ∈Θ
P
For j ∈ [k] let Tj (n) = nt=1 I {Bt1 = j} be the number of times base action j is played in the first
bandit. Define ψ0 ∈ Rd to be the vector with ψ0 = θ(−1) and ψ0 = 0. For j ∈ [k] let ψj ∈ Rd be
(−1) (1)

given by ψj = θ(−1) and ψj = ∆ej . Abbreviate Pj = Pψj and Ej [·] = EPj [·]. With this notation,
(−1) (1)

we have

1 X 1X k
Rn1 (θ) = ∆(n − Ej [Tj (n)]) . (24.1)
|Θ| (1) k j=1
θ ∈Θ

60
Lemma 15.1 gives that
" #
1 Xn
∆2
D(P0 , Pj ) = E0 hAt , ψ0 − ψj i2 = E0 [Tj (n)] .
2 t=1
2
p
Choosing ∆ = k/n/2 and applying Pinsker’s inequality yields
r
k
X k
X k
X 1
Ej [Tj (n)] ≤ E0 [Tj (n)] + n D(P0 , Pj )
j=1 j=1 j=1
2
s
k
X ∆2
=n+n E0 [Tj (n)]
j=1
4
v
u
u k∆2 X
k
u
≤ n + nt E0 [Tj (n)] (Cauchy-Schwarz)
4 j=1
s
k∆2 n
=n+n
4
≤ 3nk/4 . (since k ≥ 2)

Combining the above display with Eq. (24.1) completes the proof:

1 X 1X k
n∆ 1√
Rn1 (θ) = ∆(n − Ej [Tj (n)]) ≥ = kn .
|Θ| (1) k j=1 4 8
θ ∈Θ

Chapter 25 Asymptotic Lower Bounds for Stochastic Linear Bandits

25.3 For (a) let θ1 = ∆ and θi = 0 for i > 1 and let A = {e1 , . . . , ed−1 }. Then adding ed
increases the asymptotic regret. For (b) let θ1 = ∆ and θi = 0 for 1 < i < d and θd = 1 and
A = {e1 , . . . , ed−1 }. Then for small values of ∆ adding ed decreases the asymptotic regret.

Chapter 26 Foundations of Convex Analysis

26.2 Let P be the on the space on which X is defined. Following the hint, let x0 = E[X] ∈ Rd .
Then let a ∈ Rd and b ∈ R be such that ha, x0 i + b = f (x0 ) and ha, xi + b ≤ f (x) for all x ∈ Rd .
The hyperplane {x : ha, xi + b − f (x0 ) = 0} is guaranteed to exist by the supporting hyperplane
theorem. Then
Z Z
f (X)dP ≥ (ha, Xi + b)dP = ha, x0 i + b = f (x0 ) = f (E[X]) .

61
An alternative is of course to follow the ideas next to the picture in the main text. As you may
recall that proof is given for the case when X is discrete. To extend the proof to the general case,
one can use the ‘standard machinery’ of building up the integral from simple functions, but the
resulting proof, originally due to Needham [1993], is much longer than what was given above.

26.3

(a) Using the definition,

f ∗∗ (x) = sup hx, ui − f ∗ (u)


u∈Rd
= sup hx, ui − ( sup hy, ui − f (y))
u∈Rd y∈Rd

≤ sup hx, ui − (hx, ui − f (x))


u∈Rd
= f (x) .

(b) We only need to show that f ∗∗ (x) ≥ f (x).

f ∗∗ (x) = sup hx, ui − f ∗ (u)


u∈Rd
≥ hx, ∇f (x)i − f ∗ (∇f (x))
= hx, ∇f (x)i − ( sup hy, ∇f (x)i − f (y))
y∈Rd

≥ hx, ∇f (x)i − ( sup hy, ∇f (x)i − f (x) − hy − x, ∇f (x)i)


y∈Rd

= f (x) ,

where in the second inequality we used the definition of convexity to ensure that f (y) ≥
f (x) + hy − x, ∇f (x)i.

26.9

(a) Fix u ∈ Rd . By definition f ∗ (u) = supx hx, ui − f (x). To find this value we solve for x where
the derivative of hx, ui − f (x) in x is equal to zero. As calculated before, ∇f (x) = log(x).
Thus, we need to find the solution to u = log(x), giving x = exp(u). Plugging this value, we
get f ∗ (u) = hexp(u), ui − f (exp(u)). Now, f (exp(u)) = hexp(u), log(exp(u))i − hexp(u), 1i =
hexp(u), ui − hexp(u), 1i. Hence, f ∗ (u) = hexp(u), 1i and ∇f ∗ (u) = exp(u).

(b) From our calculation, dom(∇f ∗ ) = Rd .

(c) Df ∗ (u, v) = f ∗ (u) − f ∗ (v) − h∇f ∗ (v), u − vi = hexp(u) − exp(v), 1i − hexp(v), u − vi.

(d) To check Part (a) of Theorem 26.6 note that ∇f (x) = log(x) and ∇f ∗ (u) = exp(u), which
are indeed inverses of each other and their respective domains match that of int(dom(f )) and

62
int(dom(f ∗ )), respectively. To check Part (b) of Theorem 26.6, we calculate Df ∗ (∇f (y), ∇f (x)):

Df ∗ (∇f (y), ∇f (x))


= hexp(log(y)) − exp(log(x)), 1i − hexp(log(x)), log(y) − log(x)i
= hy − x, 1i − hx, log(y) − log(x)i ,

which is indeed equal to Df (x, y).

26.13

(a) Let g(z) = f (z) − hz − y, ∇f (y)i. Then by definition

z ∈ argminA g(z) ,

which exists by the assumption that A is compact. By convexity of A and the first-order
optimality condition it follows that

∇x−z f (z) − hx − z, ∇f (y)i = ∇x−z g(z) ≥ 0 .

Therefore

∇x−z f (z) ≥ hx − z, ∇f (y)i = hx − y, ∇f (y)i − hz − y, ∇f (y)i


= ∇x−y f (y) − ∇z−y f (y) .

Substituting the definition of the Bregman divergence shows that

Df (x, y) ≥ Df (x, z) + Df (z, y)

The proof fails when f is not differentiable at y because the map v 7→ ∇v f (y) need not be
linear.

(b) Consider the function function f (x, y) = −(xy)1/4 and let y = (0, 0) and x = (1, 0) and
A = {(t, 1 − t) : t ∈ [0, 1]}. Then Df (x, y) = Df (z, y) = 0, but Df (x, z) = ∞.

26.14 Parts (a) and (b) are immediate from convexity and the definitions. For Part (c), we have

Df (x, y) = f (x) − f (y) − h∇f (y), x − yi ,

which is linear as a function of x. Note that here we used differentiability of f at y. An example


showing that differentiability at y is necessary occurs when f : R2 → R is given by

f (x) = max{1, kxk1 } .

Then consider y = (2, 0) and x = (0, 2) and z = (0, −2). Then Df (z, y) = Df (x, y) = 0, but
Df ((x + z)/2, y) = Df (0, y) = 1 ≥ (Df (x, y) + Df (z, y))/2.

63
26.15 The first part follows immediately from Taylor’s theorem. The second part takes a little
work. To begin, abbreviate kx − ykz = kx − yk∇2 f (z) and for t ∈ (0, 1) let

1
δ(x, y, t) = Df (x, y) − kx − yk2tx+(1−t)y ,
2
which is continuous on int(dom(f )) × int(dom(f )) × [0, 1]. Let (a, b) ⊂ (0, 1) and consider
[ \ [
A= {(x, y) : δ(x, y, t) ≤ ε} .
δ∈(0,b−a)∩Q ε∈(0,1)∩Q t∈U ∩Q

As we mentioned already, Taylor’s theorem ensures there exists a t ∈ [0, 1] such that δ(x, y, t) = 0
for all (x, y) ∈ int(dom(f )) × int(dom(f )). By the continuity of δ it follows that (x, y) ∈ A if
and only if there exists a t ∈ (a, b) such that δ(x, y, t) = 0. Since (x, y) 7→ δ(x, y, t) is measurable
for each t it follows that A is measurable. Let T (x, y) = {t : δ(x, y, t) = 0}. Then by the
Kuratowski–Ryll-Nardzewski measurable selection theorem (theorem 6.9.4, Bogachev 2007) there
exists a measurable function τ : int(dom(f )) × int(dom(f )) → (0, 1) such that τ (x, y) ∈ T (x, y) for
all (x, y) ∈ dom(f ) × dom(f ). Therefore g(x, y) = τ (x, y)x + (1 − τ (x, y))y is measurable and the
result is complete.

Chapter 27 Exp3 for Adversarial Linear Bandits

27.1 Let
P
exp(−η t−1
s=1 Ŷs (a))
P̃t (a) = P Pt−1 .
a0 ∈A exp(−η s=1 Ŷs (a ))
0

Then,

Pt = (1 − γ)P̃t + γπ . (27.1)
P P Pn
Let L̂n (a) = nt=1 Ŷt (a), L̂n = nt=1 hPt , Ŷt i and L̃n = t=1 hP̃t , Ŷt i, where we abuse h·, ·i by defining
P
hp, yi = a∈A p(a)y(a) for p, y : A → R. Then,
" n #
X
Rn = max Rn (a) where Rn (a) = E hAt , yt i − ha, yt i .
a∈A
t=1

As in the proof of Theorem 11.1,


h i
Rn (a) = E L̂n − L̂n (a) .

64
Now, by (27.1),
n
X
L̂n = (1 − γ)L̃n + γ hπ, Ŷt i .
t=1

Repeating the steps of the proof of Theorem 11.1 shows that, thanks to η Ŷt (a) ≥ −1,

log k Xn
L̃n ≤ L̂n (a) + +η hP̃t , Ŷt2 i (27.2)
η t=1
log k η X n
≤ L̂n (a) + + hPt , Ŷt2 i ,
η 1 − γ t=1

where Ŷt2 denotes the function a 7→ Ŷt2 (a) and the second inequality used that P̃t = Pt −γπ
1−γ ≤ 1−γ .
Pt

Now,
n
X log k Xn
L̂n − L̂n (a) ≤ γ hπ, Ŷt i + (1 − γ)L̂n (a) + +η hPt , Ŷt2 i − L̂n (a)
t=1
η t=1
log k Xn Xn
= +η hPt , Ŷt2 i + γ hπ − ea , Ŷt i ,
η t=1 t=1

where ea (a0 ) = I {a = a0 }. Now, thanks to −1 ≤ ha, yt i ≤ 1,


h i
E hπ − ea , Ŷt i = hπ − ea , yt i ≤ 2 .

Putting things together,

log k Xn h i
Rn ≤ max Rn (a) ≤ + 2γn + η E hPt , Ŷt2 i ,
a η t=1

thus, finishing the proof.

27.4 Note that it suffices to show that kxk2B −1 − kxk2A−1 = kxk2B −1 −A−1 ≥ 0 for any x ∈ Rd . Let
x ∈ Rd . Then, by the Cauchy-Schwarz inequality,

kxk2A−1 = hx, A−1 xi ≤ kxkB −1 kA−1 xkB ≤ kxkB −1 kA−1 xkA = kxkB −1 kxkA−1 .

Hence kxkA−1 ≤ kxkB −1 for all x, which completes the claim.

27.6

n o
(a) A straightforward calculation shows that L = y ∈ Rd : kykV ≤ 1 . Let T x = V 1/2 x and note
n o
that T −1 L = T A = B = u ∈ Rd : kuk2 ≤ 1 . Then let U be an ε-cover of B with respect
to k · k2 with |U| ≤ (3/ε)d and C = T −1 U. Given x ∈ A let u = T x and u0 ∈ U be such that

65
ku − u0 k2 ≤ ε and x0 = T −1 u0 . Then

kx − x0 k = suphx − x0 , yi = suphT −1 u − T −1 u0 , yi = suphu − u0 , T −1 yi ≤ ε .


y∈L y∈L y∈L

n o
(b) Notice that L is convex, symmetric, bounded and span(L) = Rd . Let E = y ∈ Rd : kykV ≤ 1
be the ellipsoid of maximum volume contained by cl(L). Then let
n o n o
E∗ = y ∈ Rd : kykE ≤ 1 = y ∈ Rd : kykV −1 ≤ 1 ,

which satisfies A ⊆ E∗ . Since span(L) = Rd the matrix V −1 is positive definite. By the previous
result there exists a C¯ ⊂ E∗ of size at most (3d/ε)d such that

sup inf kx − x0 kE ≤ ε/d .


x∈E∗ x0 ∈C¯

Using the fact that L ⊆ dE we have

sup inf kx − x0 kL ≤ ε .
x∈E∗ x0 ∈C¯

We are nearly done. The problem is that C¯ may contain elements not in A. To resolve this issue
¯ where Π(x) ∈ argminx0 ∈A kx − x0 kE . Then note that
let C = {Π(x) : x ∈ C}

kx − Π(x0 )kL ≤ dkx − Π(x0 )kE ≤ dkx − x0 kE .

(c) Let C¯ be an ε/2-cover of cl(co(A)),


n which by
o the previous part has size at most (6d/ε) . The
d

result follows by choosing C = Π(x) : x ∈ C¯ where Π is the projection onto A with respect to
k · kE where E is the maximum volume ellipsoid contained by cl(co(A)).

27.8 Consider the case when d = 1, k = n and A = {1, −1, ε, ε/2, ε/4, . . .} for suitably small
ε. Then, for t = 1, Pt is uniform on A and hence Q−1 t ≈ 2/k and with probability 1 − 2/k,
|Ŷt | ≈ 1/k = 1/n. If η is small, then the algorithm will barely learn. If η is large, then it learns
quickly that either 1 or −1 is optimal, but is too unstable for small regret.

27.9 We can copy the proof presented for the finite-action case in the solution to Exercise 27.1 in
an almost verbatim manner: The minor change is that is that we need to replace the sums over the
action space with integrals. In particular, here we have
P
exp(−η t−1
s=1 Ŷs (a))
P̃t (a) = R Pt−1
A exp(−η s=1 Ŷs (a ))da
0 0

R
and hp, yi = A p(a)y(a)da for p, y : A → R. Now up to (27.2) everything is the same. Recall that
the inequality in this display was obtained by using the steps of the proof of Theorem 11.1. Here,
we need add a little detail because we need to change this inequality slightly.

66
We argue as follows: Define (Wt )nt=0 by
Z t
!
X
Wt = exp −η Ŷs (a) da ,
A s=1

which means that W0 = vol(A) and

n−1
Y Wt+1
Wn = vol(A) .
t=0
Wt

Following the proof of Theorem 11.1, thanks to −η Ŷt (a) ≤ 1,


Z Z  
Wt+1
= exp(−η Ŷt (a))P̃t (a)da ≤ 1 − η Ŷt (a) + η 2 Ŷt2 (a) P̃t (a)da
Wt A A
 
≤ exp −ηhP̃t , Ŷt i + η 2
hP̃t , Ŷt2 i .

Therefore,
n
X n
X
log Wn ≤ log(vol(A)) − η hP̃t , Ŷt i + η 2 hP̃t , Ŷt2 i .
t=1 t=1

Pn
Recalling that L̃n = t=1 hP̃t , Ŷt i, a rearrangement of the previous display gives
 
1 vol(A) n
X
L̃n ≤ log +η hP̃t , Ŷt2 i . (27.3)
η Wn t=1
Pn
Let a∗ = argmina∈A t=1 hyt , ai. Note that
 
n
X 1 1
L̂n (a∗ ) = Ŷt (a∗ ) = − log   P  .
t=1
η exp η t=1 Ŷt (a )
n ∗

By adding and subtracting L̂n (a∗ ) to the right-side of Eq. (27.3) and using the last identity and the
definition of Wn , we get

1 Xn
L̃n ≤ L̂n (a∗ ) + log(Kn ) + η hP̃t , Ŷt2 i ,
η t=1

where
vol(A)
Kn = R  Pn  .
exp −η t=1 (Ŷt (a) − Ŷt (a )) da

which is the inequality that replaces (27.2). In particular, the only difference between (27.2) and
the above display is that in the above display log(k) got replaced by log(Kn ). From here, we can

67
follow the steps of the proof of Exercise 27.1 up to the end, to get
" #
E [log(Kn )] Xn
Rn ≤ + 2γn + ηE hPt , Ŷt2 i ,
η t=1

The result is completed by noting that γ = ηd and the same argument as in the proof of Theorem 27.1
to bound
" n #
X
ηE hPt , Ŷt2 i ≤ ηdn .
t=1

27.11 Throughout k · k = k · k2 is the standard Euclidean norm. By translating K we may assume


that x∗ = 0. Note that supx,y∈K hx − y, ui = supx∈K hx, ui (the last equality uses that x∗ = 0).
Clearly, the claim is equivalent to
Z !
dx
exp (−hx, ui) ≥ exp −d(1 + log+ suphx, ui/d) .
K vol(K) x∈K
| {z }
=:g(supx∈K hx,ui)

We claim that it suffices to show the result for the case when kuk = 1. Indeed, if the claim was true
for kuk = 1 then for any u 6= 0 it would follow that
R R R
K exp(−hx, ui)dx K exp(−hxkuk, u/kuki)dx kukK exp(−hy, u/kuki)dy
= =
vol(K) vol(K) kukd vol(K)
R
kukK exp(−hy, u/kuki)dy
= ≥ g( sup hx, u/kuki) = g(suphx, ui) .
vol(kukK) x∈kukK x∈K

Hence, it remains to show the claim for vectors u such that kuk = 1. With an entirely similar
reasoning we can show that it suffices to show the claim for the case when vol(K) = 1. Hence, from
now on we will assume these.
Introduce α = supx∈K hx, ui. For t ∈ [0, α], define f (t) to be the volume of the slice
Kt = {x ∈ K : hx, ui = t} with respect to the (d − 1)-form. Since vol(K) = 1, f (t) > 0 for
R R R
0 < t < α and 1 = 0α f (t)dt. Clearly, we also have K exp(−hx, ui)dx = 0α f (t) exp(−t)dt. Now,
since K is convex, for any t ∈ (0, α), f (t) ≥ volt−1 (q/tKq ) for any t ≤ q ≤ α.
Since e−t decreasing, a rearrangement argument shows that the function f that minimises

0 f (t) exp(−t)dt

and which satisfies the above properties is f (t) = (tf (α))d−1 for a suitable value
of f (α) so that 0 f (t)dt = 1 (we want the function to increase as fast as possible). Note that f (t)
gives the volume of the tK̃α for a suitable set K̃α , as shown in the figure below:

68
u
0

The whole triangle is K̃ = {tK̃α : t ∈ [0, α]} with x∗ = 0 at the bottom corner. The thin lines
represent tK̃α for different values of t, which are (d − 1)-dimensional subsets of K̃ that lie in affine
spaces with normal vector u.
Rα R α d−1 −1
From the constraint 0 f (t)dt = 1 we get f (α)d−1 = ( 0 t dt) . We calculate
Z Z α Rα
exp(−t)td−1 dt
exp(−hx, ui)dx ≥ exp(−t)(tf (α)) d−1
dt = 0 Rα
d−1 dt
K 0 0 t
n o
≥ min 1, (d/α)d /ed = g(α) ,
R α d−1
where the final inequality follows because 0 t dt = αd /d and
Z α Z α∧d
1 min(α, d)d
td−1
exp(−t)dt ≥ d td−1 dt = .
0 e 0 ed d

Chapter 28 Follow-the-Regularised-Leader and Mirror Descent

28.1 The mapping a 7→ DF (a, b) is the sum of Legendre function F and a linear function, which
is clearly Legendre. Hence, Φ is Legendre. Suppose now that c ∈ ∂int(D) and let d ∈ A ∩ int(D) be
arbitrary. Then, the map α 7→ Φ(αc + (1 − α)d) must be decreasing. And yet,

d
Φ(αc + (1 − α)d) = h∇Φ(αc + (1 − α)d), c − di

= hy + ∇F (αc + (1 − α)d), c − di ,

which converges to infinity as α tends to one by Proposition 26.7 and is a contradiction.

28.5 The first step is the same as the proof of Theorem 28.4:
n
X n
X n
X
Rn (a) = hat − a, yt i = hat − at+1 , yt i + hat+1 − a, yt i .
t=1 t=1 t=1

69
Pt
Next let Φt (a) = F (a)/η + s=1 ha, ys i so that
n
X n
X F (a)
hat+1 − a, yt i = hat+1 , yt i − Φn (a) +
t=1 t=1
η
n
X F (a)
= (Φt (at+1 ) − Φt−1 (at+1 )) − Φn (a) +
t=1
η
n−1
X F (a)
= −Φ0 (a1 ) + (Φt (at+1 ) − Φt (at+2 )) + Φn (an+1 ) − Φn (a) +
t=0
η
F (a) − F (a1 ) n−1
X
≤ + (Φt (at+1 ) − Φt (at+2 )) . (28.1)
η t=0

Now DΦt (a, b) = η1 DF (a, b). Therefore,

1
Φt (at+1 ) − Φt (at+2 ) = −∇at+2 −at+1 Φt (at+1 ) − DF (at+2 , at+1 )
η
1
≤ − DF (at+2 , at+1 ) ,
η

where the inequality follows by the first-order optimality condition applied to at+1 =
argmina∈A∩dom(F ) Φt (a) and at+2 . Substituting this into Eq. (28.1) completes the proof.
28.10

(a) For the first relation, direct calculation shows that P̃t+1,i = Pti exp(−η Ŷti ) and

k
! k k
X Pti X X
DF (Pt , P̃t+1 ) = Pti log − Pti + P̃t+1,i
i=1 P̃t+1,i i=1 i=1
Xk    
= Pti exp −η Ŷti − 1 + η Ŷti .
i=1

The second relation follows from the inequality exp(x) ≤ 1 + x + x2 /2 for x ≤ 0.

(b) Using part (a), we have


" # " # " #
1 X n
η Xn X k
η Xn X k
I {At = i}
E DF (Pt , P̃t+1 ) ≤ E Pti Ŷti2 ≤ E
η t=1
2 t=1 i=1
2 t=1 i=1
Pti
ηnk
= .
2

(c) Simple calculus shows that for p ∈ Pk−1 , F (p) ≥ − log(k) − 1 and F (p) ≤ −1 is obvious.
Therefore diamF (Pk−1 ) = maxp,q∈Pk−1 F (p) − F (q) ≤ log(k).

(d) By the previous exercise, Exp3 chooses At sampled from Pt . Then applying the second bound
p
of Theorem 28.4 and parts (b) and (c) and choosing η = log(k)/(2nk) yields the result.

70
28.11 Abbreviate D(x, y) = DF (x, y). By the definition of ãt+1 and the first-order optimality
conditions we have ηt yt = ∇F (at ) − ∇F (ãt+1 ). Therefore

1
hat − a, yt i = hat − a, ∇F (at ) − ∇F (ãt+1 )i
ηt
1
= (−ha − at , ∇F (at )i − hat − ãt+1 , ∇F (ãt+1 )i + ha − ãt+1 , ∇F (ãt+1 )i)
ηt
1
= (D(a, at ) − D(a, ãt+1 ) + D(at , ãt+1 )) .
ηt

Summing completes the proof. For the second part use the generalised Pythagorean theorem
(Exercise 26.13) and positivity of the Bregman divergence to argue that D(a, ãt+1 ) ≥ D(a, at+1 ).

28.12

(a) We use the same argument as in the solution to Exercise 28.5. First,
n
X n
X n
X
Rn (a) = hat − a, yt i = hat − at+1 , yt i + hat+1 − a, yt i .
t=1 t=1 t=1

The next step also mirrors that in Exercise 28.5, but now we have to keep track of the changing
potentials:
n
X n
X
hat+1 − a, yt i = hat+1 , yt i − Φn+1 (a) + Fn+1 (a)
t=1 t=1
n
X n
X
= (Φt+1 (at+1 ) − Φt (at+1 )) + (Ft (at+1 ) − Ft+1 (at+1 )) − Φn+1 (a) + Fn+1 (a)
t=1 t=1
n−1
X
= −Φ1 (a1 ) + (Φt+1 (at+1 ) − Φt+1 (at+2 )) + Φn+1 (an+1 ) − Φn+1 (a)
t=0
n
X
+ Fn+1 (a) + (Ft (at+1 ) − Ft+1 (at+1 ))
t=1
n−1
X n
X
≤ Fn+1 (a) − F1 (a1 ) + (Φt+1 (at+1 ) − Φt+1 (at+2 )) + (Ft (at+1 ) − Ft+1 (at+1 )) .
t=0 t=1

Now DΦt (a, b) = DFt (a, b). Therefore

Φt+1 (at+1 ) − Φt+1 (at+2 ) = −∇at+2 −at+1 Φt+1 (at+1 ) − DFt+1 (at+2 , at+1 )
≤ −DFt+1 (at+2 , at+1 ) ,

which combined with the previous big display completes the proof.

(b) Note that adding a constant to the potential does not change the policy or the Bregman
divergence. Applying the previous part with Ft (a) = (F (a) − minb∈A F (b))/ηt immediately
gives the result.

71
28.13

(a) Apply your solution to Exercise 26.11.

(b) Since Ŷt is unbiased we have


" n # " n #
X X
Rn = max E (ytAt − yti ) = max E hPt − P, Ŷt i .
i∈[k] i∈[k]
t=1 t=1

Then apply the result from Exercise 28.12 combined with the fact that diamF (A) = log(k).

(c) Consider two cases. First, if Pt+1,At ≥ PtAt , then

hPt − Pt+1 , Ŷt i = (PtAt − Pt+1,At )ŶtAt ≤ 0 .

On the other hand, if Pt+1,At ≤ PtAt , then by Theorem 26.13 with H = ∇2 f (q) = diag(1/q) for
some q ∈ [Pt , Pt+1 ] we have

DF (Pt+1 , Pt ) ηt
hPt − Pt+1 , Ŷt i − ≤ kŶt k2H −1 ,
ηt 2

Since Pt+1,At ≤ PtAt we have

2
ηt qAt ŶtA 2
ηt PtAt ŶtA 2
ηt ytA
ηt
kŶt k2H −1 = t
≤ t
≤ t
.
2 2 2 2PtAt

Therefore

DF (Pt+1 , Pt ) 2
ηt ytA ηt
hPt − Pt+1 , Ŷt i − ≤ t
≤ .
ηt 2PtAt 2PtAt

(d) Continuing from the previous part and using the fact that E[1/PtAt ] = k shows that
" #
log(k) 1 X n
ηt log(k) k X n
Rn ≤ E + = + ηt .
ηn 2 t=1 PtAt ηn 2 t=1

q p √
Pn
(e) Choose ηt = log(k)
kt and use the fact that t=1 1/t ≤ 2 n.

28.14

(a) The result is obvious for any algorithm when n < k. Assume for the remainder that n ≥ k. The
learning rate is chosen to be
v
u
u k log(n/k)
ηt = t Pt−1 2 ,
1+ s=1 ytAt

72
R p p
which is obviously decreasing. By noting that f 0 (x)/ f (x) dx = 2 f (x) and making a simple
approximation,
v
n u n
X u X
ηt ytAt ≤ 2tk
2 2 log(n/k) .
ytA t
(28.2)
t=1 t=1

Pn
Define Rn (p) = t=1 hPt − p, Ŷt i. Then

Rn = sup E[Rn (p)] ≤ k + sup E[Rn (p)] . (28.3)


p∈Pk−1 p∈[1/n,1]k ∩Pk−1

For the remainder of the proof let p ∈ [1/n, 1] ∩ Pk−1 be arbitrary. Notice that F (p) −
minq∈Pk−1 F (q) ≤ k log(n/k). By the result in Exercise 28.12,

k log(n/k) X n
DF (Pt+1 , Pt )
Rn (p) ≤ + hPt − Pt+1 , Ŷt i − , (28.4)
ηn t=1
ηt

If Pt+1,At ≥ PtAt , then hPt −Pt+1 , Ŷt i ≤ 0. Now suppose that Pt+1,At ≤ PtAt . By Theorem 26.12,
there exists a ξ ∈ [Pt , Pt+1 ] such that
ηt
hPt − Pt+1 , Ŷt i − DFt−1 (Pt+1 , Pt ) ≤ kŶt k2∇2 F (ξ)−1
2
2
ηt ytA
ηt 2 2 ηt 2 2
= ξA Ŷ ≤ P Ŷ = t
.
2 t tAt 2 tAt tAt 2
By the definition of ηn and Eq. (28.2),

k log(n/k) 1 X n
Rn (p) ≤ + 2
ηt ytA
ηn 2 t=1 t

v !
u n−1
u X
≤ 2tk 1+ 2
ytA t
log(n/k) .
t=1

Therefore by Eq. (28.3),


v
u ! 
u n−1
X
Rn ≤ k + 2E tk
 1+ 2
ytA t
log(n/k)
t=1
v "n−1 #!
u
u X
≤ k + 2tk 1+E 2
ytA t
log(n/k) ,
t=1

where the second line follows from Jensen’s inequality.

73
(b) Combining the previous result with the fact that yt ∈ [0, 1]k shows that
v " n #!
u
u X
Rn ≤ k + 2tk 1+E ytAt log(n/k)
t=1
v !
u n
u X
= k + 2tk 1 + Rn + min yta log(n/k) .
a∈[k]
t=1

Solving the quadratic in Rn shows that for a suitably large universal constant C,
v !
u n
u X
Rn ≤ k(1 + log(n/k)) + C tk 1 + min yta log(n/k) .
a∈[k]
t=1

28.15 The first parts are mechanical and are skipped.

(e) By Part (c), Theorem 28.5 and Theorem 26.13,


√ " n #
2 k η X
Rn ≤ + E kŶt k2∇2 F (Zt )−1 , (28.5)
η 2 t=1

where Zt ∈ Pk−1 = αPt + (1 − α)Pt+1 for some α ∈ [0, 1]. Then


" n k # " n k #
XX X X Ati y 2 3/2
E kŶt k2∇2 F (Zt )−1 = 2E ti
Zti
t=1 i=1 t=1 i=1
Pti2
" n k #
XX Ati
≤ 2E 1/2
t=1 i=1 Pti
" n k #
XX
= 2E
1/2
Pti
t=1 i=1

≤ 2n k ,

where the first inequality follows from Part (b) and the second from the Cauchy-Schwarz
inequality. The result follows by substituting the above display into Eq. (28.5) and choosing
p
η = 2/n.

28.16

(a) Following the suggestion in the hint let F be the negentropy potential and
t−1
!
X
xt = argminx∈X F (x) + η f (x, ys )
s=1
t−1
!
X
yt = argminy∈Y F (y) − η f (xs , y) .
s=1

74
q
Then let εd (n) = 2 log(d)
n . By Proposition 28.7,

min max f (x, y) ≤ max f (x̄n , y)


x∈X y∈Y y∈Y

1X n
= max f (xt , y)
y∈Y n
t=1
1X n
≤ f (xt , yt ) + εk (n)
n t=1
1 Xn
≤ min f (x, yt ) + εj (n) + εk (n)
n x∈X t=1
= min f (x, ȳn ) + εj (n) + εk (n)
x∈X
≤ max min f (x, y) + εj (n) + εk (n) .
y∈Y x∈X

Taking the limit as n tends to infinity shows that

min max f (x, y) ≤ max min f (x, y) .


x∈X y∈Y y∈Y x∈X

Since the other direction holds trivially, equality holds.

(b) Following a similar plan. Let F (x) = 12 kxk22 and gs (x) = f (x, ys ) and hs (y) = f (xs , y). Then
define
t−1
!
X
xt = argminx∈X F (x) + η hx, ∇gs (xs )i
s=1
t−1
!
X
yt = argminy∈Y F (y) − η hy, ∇hs (ys )i .
s=1
p
Let G = supx∈X,y∈Y k∇f (x, y)k2 and B = supz∈X∪Y kzk2 . Then let ε(n) = GB 1/n. A
straightforward generalisation of the above argument and the analysis in Proposition 28.6 shows

75
that

1X n
1X n
f (xt , yt ) = gt (xt )
n t=1 n t=1
!
1 Xn Xn
= min gt (x) + (gt (xt ) − gt (x))
x∈X n
t=1 t=1
!
1 Xn
1X n
≤ min gt (x) + hxt − x, ∇gt (xt )i
x∈X n n t=1
t=1
1X n
≤ min gt (x) + ε(n)
x∈X n
t=1
1X n
= min f (x, yt ) + ε(n)
x∈X n t=1
≤ min f (x, ȳn ) + ε(n) .
x∈X

In the same manner,

1X n
max f (x̄n , y) ≤ f (xt , yt ) + ε(n) .
y∈Y n t=1

Hence

min max f (x, y) ≤ max f (x̄n , y)


x∈X y∈Y y∈Y

≤ min f (x, ȳn ) + 2ε(n)


x∈X
≤ max min f (x, y) + 2ε(n) .
y∈Y x∈X

And the result is again completed by taking the limit as n tends to infinity.

In both cases the pair of average iterates (x̄n , ȳn ) has a cluster point that is a saddle point of
f (·, ·). In general the iterates (xn , yn ) may not have a cluster point that is a saddle point.

28.17 Let X = Y = R and f (x, y) = x + y. Clearly X and Y are convex topological


vector spaces and f is linear and linear in both arguments. Then inf x∈X supy∈Y f (x, y) = ∞
and supy∈Y inf x∈X f (x, y) = −∞. For a bounded example, consider X = Y = [1, ∞) and

76
f (x, y) = y/(x + y).

Chapter 29 The Relation between Adversarial and Stochastic Linear Bandits

29.2 First we check that θ̂t = dEt At Yt /(1 − kĀt k2 ) is appropriately bounded. Indeed,

ηdEt kAt k2 |Yt | ηd 1


ηkθ̂t k2 = ≤ ≤ ,
1 − kĀt k2 1 − kĀt k2 2

where the last step holds by choosing 1 − r = 2ηd. All of the steps in the proof of Theorem 28.11
are the same until the expectation of the dual norm of θ̂t . Then
h i h i
E kθ̂t k2∇F (Zt )−1 ≤ E (1 − kZt k2 )kθ̂t k2
" #
(1 − kZt k2 )Et Yt2
=d E 2
(1 − kĀt k2 )2
≤ 2d2 .

This last inequality is where things have changed, with the d becoming a d2 . From this we conclude
that
   
1 1 1 1
Rn ≤ log + (1 − r)n + ηnd2 ≤ log + 2ηnd + ηnd2
η 2ηd η 2ηd

and the result follows by tuning η.

29.4

(a) Let a∗ = argmina∈A `(a) be the optimal action. Then


" n #
X
Rn = E `(At ) − `(a )

t=1
" n #
X
≤E hAt − a∗ , θi + 2nε
t=1
" n #
X
=E hĀt − a , θi + 2nε .

(29.1)
t=1

77
The estimator is θ̂t = dEt At Yt /(1 − kĀt k2 ), which is no longer unbiased. Then
" #
dEt At Yt
E[θ̂t | Ft−1 ] = E Ft−1
1 − kĀt k2
" #
Et At (hAt , θi + ηt + ε(At ))
= dE Ft−1
1 − kĀt k2
d
X
=θ+ ε(ei )ei ,
i=1

which, given that Āt is Ft−1 , implies that

E[hĀt − a∗ , θi] ≤ E[hĀt − a∗ , θ̂t i] + εE[kĀt − a∗ k1 k1k∞ ]



≤ E[hĀt − a∗ , θ̂t i] + 2ε d .

Combining with Eq. (29.1) shows that


" n #
X √
Rn ≤ E hĀt − a , θ̂t i + 2εn + 2εn d .

t=1

Then we need to check that ηkθ̂t k2 = ηkdEt At Yt /(1 − kĀt k2 )k2 ≤ ηd/(1 − r) ≤ 1/2. Now
proceed as in Exercise 29.2.

(b) When ε(a) = 0 for all a, the lower bound is Ω(d n). Now add a spike ε(a) = −ε in the vicinity
of the optimal arm. Since A is continuous, the learner will almost surely never identify the
√ √
‘needle’ and hence its regret is Ω(d n + εn). The d factor cannot be improved greatly, but
the argument is more complicated [Lattimore and Szepesvári, 2019].

Chapter 30 Combinatorial Bandits

30.4

(b) Using the independence of (Xj )dj=1 shows that almost surely,
Z Mj−1 Z ∞
E[Mj | Mj−1 ] = Mj−1 exp(−x) dx + x exp(−x) dx .
0 Mj−1

= Mj−1 + exp(−Mj−1 ) .

Taking the expectation of both sides yields the result.

(c) The base case when j = 1 is immediate. For j ≥ 2,


a
E[exp(−aMj )] = E[exp(−aMj−1 )] − E[exp(−(a + 1)Mj−1 ] .
a+1

78
Therefore by induction it follows that

a!
E[exp(−aMj )] = Qa .
b=1 (j + b)

(d) Combining (b) and (c) shows that for j ≥ 2,

1
E[Mj ] = E[Mj−1 ] + E[exp(−Mj−1 )] = E[Mj−1 ] + .
j

The result follow by induction.

30.5 Since A is compact, dom(φ) = Rd . Let D be the set of points x at which φ is differentiable.
Then, as noted in the hint, λ(Rd \ D) = 0 where λ is the Lebesgue measure. Since Q  λ, then
Q(Rd \ D) = 0 as well. Define a(x) = argmaxa∈A ha, xi. Let v ∈ Rd be non-zero. Then, by the
second part of the hint, the directional derivative of φ is

∇v φ(x) = max ha, vi .


a∈A(x)

By the last part of the hint, for x ∈ D this implies that A(x) is a singleton and thus ∇φ(x) = a(x).
Let q = dQ
dλ be the density of Q with respect to the Lebesgue measure. Then, for any v ∈ R ,
d

Z Z
∇v φ(x + z)q(z) dz = ∇v φ(x + z)q(z) dz
Rd ZR
d

= ∇v φ(x + z)q(z) dz
D+{x}
Z
= ha(x + z), vi q(z) dz
D+{x}
*Z +
= a(x + z)q(z) dz, v ,
D+{x}

where the exchange of limit (hidden in the derivative) and integral is justified by the dominated
convergence theorem. By the last part of the hint, since v ∈ Rd was arbitrary, it follows that
R R
∇ Rd φ(x + z)q(z) dz exists and is equal to D+{x} a(x + z)q(z) = E [a(x + Z)].

30.6

(a) To show that F is well defined we need to show that F ∗ is the Fenchel dual of a unique proper
convex closed function. Let g = (F ∗ )∗ . It is not hard to see that F ∗ is a proper convex function,
dom(F ∗ ) = Rd , and hence the epigraph of F ∗ is closed. Then, by the hint, g ∗ = (F ∗ )∗∗ = F ∗ ,
hence F ∗ is the Fenchel dual of g. By the hint, the Fenchel dual of g is a proper convex closed
function, so we can take F = g ∗ . It remains to show that there is only a single proper convex
closed function whose Fenchel dual is F ∗ . To show this let g, h be proper convex closed functions
such that g ∗ = h∗ = F ∗ . Then g = g ∗∗ = (F ∗ )∗ = h∗∗ = h, hence, F is uniquely defined.

79
(b) By Part (c) of Theorem 26.6, it suffices to show that F ∗ is Legendre. As noted earlier, the
domain of F ∗ is all of Rd . From Exercise 30.5, it follows that F ∗ is everywhere differentiable.
Part (c) of the definition of Legendre functions is automatically satisfied since ∂Rd = ∅, hence
it remains to prove that F ∗ is strictly convex.

To prove this, we need to prove that for all x 6= y,

F ∗ (y) > F ∗ (x) + hy − x, ∇F ∗ (x)i .

Let a(x) = argmaxa∈A ha, xi, with ties broken arbitrarily and let q = dQ
dλ be the density of Q
with respect to the Lebesgue measure. Recalling the definitions and the result of Exercise 30.5,

F ∗ (y) − F ∗ (x) − hy − x, ∇F ∗ (x)i


Z Z  Z 
= φ(y + z)q(z)dz − φ(x + z)q(z)dz − y − x, a(x + z)q(z)dz
d Rd Rd
ZR
= hy + z, a(y + z) − a(x + z)i q(z) dz
d
ZR
= hu, a(u) − a(u + δ)i q(u − y) du ,
Rd

where δ = x − y. Clearly the term f (u) = ha(u) − a(u + δ), ui is nonnegative for any
u ∈ Rd . Since by assumption q > 0, it suffices to show that f is strictly positive over a
neighborhood of zero that has positive volume. The assumption that span(A) = Rd means that
ha(−δ/2) − a(δ/2), −δ/2i = ε > 0. To see this, notice that

co(A) ⊂ {x : hx, δ/2i ≤ ha(δ/2), δ/2i} and


co(A) ⊂ {x : hx, −δ/2i ≤ ha(−δ/2), −δ/2i} = {x : hx, δ/2i ≥ ha(−δ/2), δ/2i} .

Were it the case that ha(−δ/2), δ/2i = ha(δ/2), δ/2i, then co(A) would be a subset of a (d − 1)-
dimensional hyperplane, contradicting the assumption that span(A) = Rd . Let u = −δ/2 and
k · k = k · k2 and diam(A) = diamk·k (A). Then

ha(−δ/2 + v), −δ/2i = ha(v − δ/2), v − δ/2i − ha(v − δ/2), vi


≥ ha(−δ/2), v − δ/2i − kvkdiam(A)
≥ ha(−δ/2), −δ/2i − 2kvkdiam(A) .

80
Similarly, ha(v + δ/2), δ/2i ≥ ha(δ/2), δ/2i − 2kvkdiam(A) and hence

f (u + v) = ha(u + v) − a(u + δ + v), u + vi


≥ ha(u + v) − a(u + δ + v), ui − 2kvkdiam(A)
= ha(v − δ/2) − a(v + δ/2), −δ/2i − 2kvkdiam(A)
= ha(v − δ/2), −δ/2i + ha(δ/2 + v), δ/2i − 2kvkdiam(A)
≥ ha(−δ/2), −δ/2i + ha(δ/2), δ/2i − 6kvkdiam(A)
= ha(−δ/2) − a(δ/2), −δ/2i − 6kvkdiam(A)
= ε − 6kvkdiam(A) .

Thus, for sufficiently small kvk, it holds that f (u + v) ≥ ε/2 and the claim follows.

(c) We need to show that int(dom(F )) = int(co(A)). By the first two parts of the exercise, F is
Legendre, and hence by Part (a) of Theorem 26.6 and by Exercise 30.5, we have
Z 
int(dom(F )) = ∇F ∗ (Rd ) = a(x + z)q(z)dz : x ∈ Rd .
Rd

Clearly, this is a subset of int(co(A)). To establish the equality, by convexity of co(A) it suffices
to show that for any extreme point a ∈ A and ε > 0 there exists an x such that k∇F ∗ (x)−ak ≤ ε.
To show this, choose a vector x0 ∈ Rd so that a(x0 + v) = a for any v in the unit ball centered
at zero. Such a vector exist because of the conditions on A. Let Kε be a closed ball centered
at zero such that Q(Kε ) ≥ 1 − ε/(maxa∈A kak). This exist because Q(Rd ) = 1. Let r be the
radius of Kε . Pick any c > rε . Then, for any v ∈ Kε , a(cx0 + v) = a(x0 + v/c) = a and hence
R
∇F ∗ (cx0 ) = s + Kε a(cx0 + z)q(z) = s + a, where ksk ≤ ε, finishing the proof.

30.8 Following the advice, assume that the learner plays m bandits in parallel, each having k = d/m
√ p
actions. Let Rni be the regret of the learner in the ith bandit. Then, Rni ≥ c nk = c nd/m for
Pm
some universal constant c > 0. Further, if Rn is the regret of the learner, Rn = i=1 Rni . Hence,

Rn ≥ c ndm.
An alternative to this is to emulate a k = d/m-armed bandit with scaled rewards: For this
imagine that the d items (components of the combinatorial action) are partitioned into k parts, each
having m items in it. Unlike in multi-task bandits, the learner needs to choose a part and receives
feedback for all the items in it. Hence, the the rewards received belong to the [0, m] interval and we
√ √
also get Rn ≥ cm nk = c ndm.

Chapter 31 Non-stationary Bandits

31.1 As suggested, Exp4 is used with each element of Γnm identified with one expert. Consider
an arbitrary enumeration of Γnm = {a(1) , . . . , a(G) } where G = |Γnm |. The predictions of expert
g ∈ [G] for round t ∈ [n] encoded as a probability
n distribution
o over [k] (as required by the prediction-
with-expert-advice framework) is Eg,j t = I a(g) = j , j ∈ [k]. The expected regret of Exp4 when
t

81
used with these experts is
" n n
#
X X
Rnexperts =E ytAt − min Eg(t) yt ,
g∈[G]
t=1 t=1

where compared to Chapter 18 we switched to losses. By definition,


n
X n
X
Eg(t) yt = yt,a(g)
t
t=1 t=1

and hence

Rnexperts = Rnm .

Thus, Theorem 18.1 indeed proves (31.1). To prove (31.2) it remains to show that G = |Γnm | ≤
P
Cm log(kn/m). For this note that G = m s=1 Gn,s where Gn,s is the number of sequences from [k]
∗ ∗ n

that switch exactly s − 1 times. When m − 1 ≤ n/2, a crude upper bound on G is mGnm . For ∗

s = 1, G∗n,s = k. For s > 1, a sequence with s − 1 switches is determined by the location of the
switches, and the identity of the action taken in each segment where the action does not change.
The possible switch locations are of the form (t, t + 1) with t = 1, . . . , n − 1. Thus the number of

these locations is n − 1, of which, we need to choose s − 1. There are n−1 s−1 ways of doing this.
Since there are s segments and for the first segment we can choose any action and for the others we
can choose any other action than the one chosen for the previous segments, there are kk s−1 valid
 P n
ways of assigning actions to segments. Thus, G∗n,s = kk s−1 n−1
s−1 . Define Φm (n) = m i=0 i . Hence,
Pm−1 n−1
G≤k m
s=0 s = k Φm−1 (n − 1) ≤ k Φm (n). Now note that for n ≥ m, 0 ≤ m/n ≤ 1, hence
m m

 m m   ! n   !  n
m X m i n X m i n m
Φm (n) ≤ ≤ = 1+ ≤ em .
n i=0
n i i=0
n i n

Reordering gives Φm (n) ≤ en m
m . Hence, log(G) ≤ m log(ekn/m). Plugging this into (31.1) gives
(31.2).

31.3 Use the construction and analysis in Exercise 11.6 and note that when m = 2 the random
version of the regret is nonnegative on the bandit constructed there.

Chapter 32 Ranking

32.2 The argument is half-convincing. The heart of the argument is that under the criterion
that at least one item should attract the user, it may be suboptimal to present the list composed
of the fittest items. The example with the query ‘jaguar’ is clear: Assume half of the users will
mean ‘jaguar’ as the big cat, while the other half will mean it as the car. Presenting items that are
relevant for both meanings may have a better chance to satisfy a randomly picked user than going
with the top m list, which may happen to support only one of the meanings. This shows that there

82
is indeed an issue with ‘linearizing’ the problem by just considering individual item fitness values.

However, the argument is confusing in other ways. First, it treats conditions (for example,
independence of attractiveness) that are sufficient but not necessary to validate the probabilistic
ranking principle (PRP) as if they were also necessary. In fact, in click model studied here, the
mentioned independence assumption is not needed. To clarify, the strong assumption in the stochastic
click model, is that the optimal list is indeed optimal. Under this assumption, the independence
assumption is not needed.

Next, that the same document can have different relevance to different users fits even the cascade
model, where the vector of attraction values are different each time they are sampled from the
model. So this alone would not undermine the PRP.

Finally, the last sentence confuses relevance and ‘usefulness’. Again, in the cascade model, the
relevance (attractiveness) of a document (item) does not depend on the relevance of any other
document. Yet in the reward in the cascade model is exactly one if and only if at least one document
presented is relevant (attractive).

32.6 Following the proof of Theorem 32.2 the first part until Eq. (32.5) we have
" #
X̀ min{m,j−1}
X n
X
Rn ≤ nmP(Fn ) + E I {Fnc } Utij .
j=1 i=1 t=1

As before the first term is bounded using Lemma 32.4. Then using the first part of the proof of
Lemma 32.7 shows that
v
n u √ !
X u c n
I {Fnc } Utij ≤ 1 + t2Nnij log .
t=1
δ

Substituting into the previous display and applying Cauchy-Schwarz shows that
v  
u √ !
u X̀ min{m,j−1}
X
u c n
Rn ≤ nmP(Fn ) + m` + t2m`E  Nnij  log .
j=1 i=1
δ

Writing out the definition of Nnij reveals that we need to bound


    
X̀ min{m,j−1}
X n
X Mt X
X X
E Nnij  ≤ E E  Utij Ft−1 
j=1 i=1 t=1 d=1 j∈Ptd i∈Ptd ∩[m]
  
n
X Mt X
X X
≤ E E  (Cti + Ctj ) Ft−1  = (A) .
t=1 d=1 j∈Ptd i∈Ptd ∩[m]

83
Expanding the two terms in the inner sum and bounding each separately leads to
   
Mt X
X X Mt
X X
E Cti Ft−1  = E  |Ptd | Cti Ft−1 
d=1 j∈Ptd i∈Ptd ∩[m] d=1 i∈Ptd ∩[m]
Mt
X
≤ |Itd ∩ [m]||Ptd ∩ [m]| ≤ m2 ,
d=1

where the inequality follows from the fact that for i ∈ Ptd ,
  |Itd ∩ [m]| |Itd ∩ [m]|
t (i) ∈ [m] | Ft−1 =
E[Cti | Ft−1 ] ≤ P A−1 = .
|Itd | |Ptd |

For the second term that makes up (A),


   
Mt X
X X Mt
X X
Et−1  Ctj Ft−1  = Et−1  |Ptd ∩ [m]| Ctj Ft−1 
d=1 j∈Ptd i∈Ptd ∩[m] d=1 j∈Ptd
Mt
X
≤ |Ptd ∩ [m]||Itd ∩ [m]| ≤ m2 .
d=1
r  √ 
Hence (A) ≤ nm2 and Rn ≤ nmP(Fn ) + m` + 4m3 `n log c n
δ and the result follows from
Lemma 32.4.

Chapter 33 Pure Exploration

33.3 Abbreviate f (α) = inf d∈D hα, di, which is clearly positively homogeneous: f (cα) = cf (α) for
any c ≥ 0. Because D is nonempty, f (0) = 0. Hence we can ignore α = 0 in both optimisation
problems and so
!−1
L
sup f (α) L= inf
α∈Pk−1 α∈Pk−1 f (α)
Lkαk1
= inf
α≥0:kαk1 >0 f (α)
= inf kLα/f (α)k1
α≥0:kαk1 >0

= inf {kαk1 : f (α) ≥ L} ,

where we used the positive homogeneity of f (α) and the `1 norm.

33.4

84
(a) For each i > 1 define

Ei = {ν̃ ∈ E : µ1 (ν̃) = µi (ν̃) and µj (ν̃) = µj (ν) for j ∈


/ {1, i}} .

You can easily show that


k
X
inf αi D(νi , ν̃i ) = min inf (α1 D(ν1 , ν̃1 ) + αi D(νi , ν̃i ))
ν̃∈Ealt (ν) i>1 ν̃∈Ei
i=1
!
α1 (µ1 (ν) − µ̃)2 αi (µi (ν) − µ̃)2
= min inf +
i>1 µ̃∈R 2σ12 2σi2
1 α1 αi ∆2i
= min .
2 i>1 α1 σi2 + αi σ12

(b) Let α1 = α so that α2 = 1 − α. By the previous part


k
X α1 α2 ∆22
(c (ν))
∗ −1
= max inf αi D(νi , ν̃i ) = max
α∈[0,1] ν̃∈Ealt (ν) α∈[0,1] α1 σ22 + αi σ12
i=1
α(1 − α)∆22
= max
α∈[0,1] ασ22 + (1 − α)σ12

∆22
= .
2(σ12 + σ22 )

(c) By the result in Exercise 33.3 and Part (a) of this exercise,
( k
)
X
c (ν) = inf kαk1 : α ∈ [0, ∞) ,
∗ k
inf αi D(νi , ν̃i ) = 1
ν̃∈Ealt (ν)
i=1
( )
α1 αi ∆2i
= inf kαk1 : α ∈ [0, ∞) , min k
=1 .
i>1 2α1 σi2 + 2αi σ12

Let α1 = 2aσ12 /∆2min , which by the constraint that α ≥ 0 must satisfy a > 1. Then

k
X 2α1 σi2
c∗ (ν) = inf α1 +
α1 >2σ12 /∆2min i=2
α1 ∆2i − 2σ12
2aσ12 a Xk
2σi2 /∆2i
≤ inf +
a>1 ∆2min a − 1 i=2
a−1
s v 2
u k
2σ1
2 uX 2σ 2
= +t i 
∆2min i=2
∆2i
v
u
2σ 2 k
X 2σi2 4σ1 uXk
2σi2
= 21 + + t .
∆min i=2 ∆2i ∆min i=2 ∆2i

85
(d) From the previous part
( )
α1 αi ∆2i
c (ν) = inf kαk1 : α ∈ [0, ∞) , min
∗ k
=1 ,
i>1 2α1 σi2 + 2αi σ12

Let α1 = 2aσ12 /∆2min with a > 1. Then


!
k
X 2α1 σi2
c∗ (ν) = inf α1 +
α1 >2σ12 /∆2min i=2
α1 ∆2i − 2σ12
!
2aσ12 a Xk
2σi2
≤ inf +
a>1 ∆2min a−1 i=2
∆2i
s v 2
u k
2σ12 uX 2σ 2
= +t i 
∆2min i=2
∆2i
v
u k
2σ12 k
X 2σi2 4σ1 uX 2σ 2
= + + t i
.
∆2min i=2
∆2 i ∆ ∆2
min i=2 i

(e) Notice that the inequality in the previous part is now an equality.

33.5

(a) Let ν ∈ E be an arbitrary Gaussian bandit with µ1 (ν) > maxi>1 µi (ν) and assume that

− log Pνπ (∆An+1 > 0)
lim inf > 1 + ε. (33.1)
n→∞ log(n)

Notice that if Eq. (33.1) were not true then we would be done. Then let ν 0 be a Gaussian
bandit in Ealt (ν) with µ(ν 0 ) = µ(ν) except that µi (ν 0 ) = µi (ν) + ∆i (ν)(1 + δ) where i > 1 and

δ = 1 + ε − 1. By Theorem 14.2 and Lemma 15.1,

1
Pνπ (An+1 6= 1) + Pν 0 π (An+1 6= i) ≥ exp (− D(Pνπ , Pν 0 π ))
2 !
1 (1 + δ)2 ∆i (ν)2 Eνπ [Ti (n)]
≥ exp −
2 2
!
1 (1 + ε)∆i (ν)2 Eνπ [Ti (n)]
= exp − .
2 2

Because π is asymptotically optimal, limn→∞ Eνπ [Ti (n)]/ log(n) = 2/∆i (ν)2 and hence
 1+ε+εn
1 1
Pνπ (An+1 6= 1) + Pν 0 π (An+1 6= i) ≥ ,
2 n

86
where limn→∞ εn = 0. Using Eq. (33.1) shows that

lim inf n1+ε+εn Pν 0 π (An+1 6= i) > 0 ,


n→∞

which implies that

− log (Pν 0 π (An+1 6= i)) − log (Pν 0 π (An+1 6= i))


lim inf ≤ lim sup ≤ 1 + ε.
n→∞ log(n) n→∞ log(n)

(b) No. Consider the algorithm that plays UCB on rounds t ∈


/ {2k : k ∈ N} and otherwise plays
2

round-robin.

(c) The same argument as Part (a) shows there exists a ν ∈ E with a unique optimal arm such that

− log (Pνπ (An+1 6∈ i∗ (ν)))


lim inf = O(1) ,
n→∞ log(n)

which means the probability of selecting a suboptimal arm decays only polynomially with n.

33.6

(a) Assume without loss of generality that arm 1 is unique in ν. By the work in Part (a) of
Exercise 33.4, α∗ (ν) = argmaxα∈Pk−1 Φ(ν, α) with

1 Xk
Φ(ν, α) = inf αi (µi (ν) − µi (ν̃))2
2 ν̃∈Ealt (ν) i=1
1 α1 αi ∆2i 1
= min = min fi (α1 , αi )
2 i>1 α1 + αi 2 i>1

where Φ(ν, α) = 0 if αi = 0 for any i and the last equality serves as the definition of fi .
The function Φ(ν, ·) is the minimum of a collection of concave functions and hence concave.
Abbreviate α∗ = α∗ (ν) and notice that α∗ must equalize the functions (fi ) so that fi (α1∗ , αi∗ ) is
constant for i > 1. Hence, for all i > 1,

1 α1∗ αi∗ ∆2i


= max Φ(ν, α) = Φ(ν) .
2 α1∗ + αi∗ α∈Pk−1

Rearranging shows that

2α1∗ Φ(ν)
αi∗ = .
∆2i α1∗
− 2Φ(ν)

Therefore
k
X 2α1∗ Φ(ν)
αi∗ + = 1.
i=2
∆2i α1∗− 2Φ(ν)

87
The solutions to this equation are the roots of a polynomial and by the fundamental theorem
of algebra, either this polynomial is zero or there are finitely many roots. Since the former is
clearly not true, we conclude there are at most finitely many maximisers. Yet concavity of the
objective means that the number of maximisers is either one or infinite. Therefore there is a
unique maximiser.

(b) Notice that i∗ (ξ) = i∗ (ν) whenever d(ξ, ν) is sufficiently small. Hence, by the previous
part, the function Φ(·, ·) is continuous at (ν, α) for any α. Suppose that α∗ (·) is not
continuous at ν. Then there exists a sequence (νn )∞ n=1 with limn→∞ d(νn , ν) = 0 and for
which lim inf n→∞ kα (ν) − α (νn )k∞ > 0. By compactness of Pk−1 , the sequence α∗ (νn ) has a
∗ ∗

cluster point α∞∗ , which by assumption must satisfy α∗ (ν) 6= α∗ . And yet, taking limits along

an appropriate subsequence, Φ(α∗ (ν), ν) = limn→∞ Φ(α∗ (ν), νn ) ≤ limn→∞ Φ(α∗ (νn ), νn ) =
Φ(α∞∗ , ν). Therefore by Part (a), α∗ (ν) = α∗ , which is a contradiction.

(c) We’ll be a little lackadasical about constants here. Define random variable
( s )
2 log(2λkt(t + 1))
Λ = min λ ≥ 1 : d(ν̂t , ν) ≤ for all t ,
mini Ti (t)

which by the usual concentration analysis and union bounding satisfies P (Λ ≥ x) ≤ 1/x.
Therefore
Z ∞   Z ∞
E[log(Λ)2 ] = P Λ ≥ exp(x1/2 ) dx ≤ exp(−x1/2 )dx = 2 .
0 0

By the definition of λ,
( s )
2 log(Λkt(t + 1))
τν (ε) ≤ 1 + max t : >ε .
mini Ti (t)

The forced exploration in the algorithm means that Ti (t) = Ω( t) almost surely and hence
 
E[τν (ε)] = O E[log(Λ)2 ] = O(1) .

(d) Let w(ε) = inf{x : d(ω, ν) ≤ x =⇒ kα∗ (ν) − α∗ (ω)k∞ ≤ ε}, which by (b) satisfies w(ε) > 0
for all ε > 0. Hence E[τα (ε)] ≤ E[τν (w(ε))] < ∞.


(e) By definition of the algorithm At = i implies that either Ti (t − 1) ≤ t or At =
argmaxi αi∗ (ν̂t−1 ) − Ti (t − 1)/(t − 1). Now suppose that
( )
2kτα (ε/(2k)) 16k 2
t ≥ max , 2 .
ε ε

88
Then the definition of the algorithm implies that
n √o
Ti (t) ≤ max Ti (τα (ε/(2k))), 1 + t(αi∗ (ν) + ε/(2k)), 1 + t
 
ε
≤ t αi∗ (ν) + .
k
Pk
Furthermore, since i=1 Ti (t) = t,
X X  ε

Ti (t) ≥ t − Tj (t) ≥ t − t αj∗ (ν) + ≥ t(αi∗ (ν) − ε) .
j6=i j6=i
k

And the result follows from the previous part, which ensures that
" ( )#
2kτα (ε/(2k)) 16k 2
E max , 2 < ∞.
ε ε

(f) Given ε > 0 let τβ (ε) = 1 + max {t : tΦ(ν, α∗ (ν)) < βt (δ) + εt} and

u(ε) = sup {Φ(ω, α) : d(ω, ν) ≤ ε, kα − α∗ (ν)k∞ ≤ ε} .


ω,α

Then for t ≥ max{τν (ε), τT (ε), τβ (u(ε))} it holds that

tZt = tΦ(ν̂t , T (t)/t) ≥ t(Φ(ν, α∗ (ν)) − u(ε)) ≥ βt (δ) ,

which implies that

τ ≤ max {τν (ε), τT (ε), τβ (u(ε))} ≤ τν (ε) + τT (ε) + τβ (u(ε)) .

Taking the expectation,

E[τ ] ≤ E[τν (ε)] + E[τT (ε)] + E[τβ (u(ε))] .

Taking the limit as δ → 0 and using the previous parts shows that for any sufficiently small
ε > 0,

E[τ ] E[τβ (u(ε))] 1


lim sup ≤ lim sup = .
δ→0 log(1/δ) δ→0 log(1/δ) Φ(ν, α ∗ (ν)) − u(ε)

Continuity of Φ(·, ·) at (ν, α∗ (ν)) ensures that limε→0 u(ε) = 0 and the result follows since
c∗ (ν) = 1/Φ(ν, α∗ (ν)). Note that taking the limit as δ → 0 only works because the policy does
not depend on δ. Hence the expectations of τν (ε), τα (ε) and τT (ε) do not depend on δ.

33.7

89
(a) Recalling the definitions,
( )
k
X 1 1 i
H1 (µ) = min , 2 and H2 (µ) = max ,
i=1
∆min ∆i
2 i:∆i >0 ∆2 i

where ∆1 ≤ ∆2 ≤ · · · ≤ ∆k . Now, for any i ∈ [k] with ∆i > 0,


( ) ( )
i
X 1 1 i
X 1 1 i
H1 (µ) ≥ min , 2 ≥ min , 2 ≥ .
j=1
∆min ∆j
2
j=1
∆min ∆i
2 ∆2i

Therefore H1 (µ) ≥ H2 (µ). For the second inequality, let imin = min{i : ∆i > 0}. Then

imin Xk
1 i
H1 (µ) = +
∆min i=i +1 i ∆2i
2
min
 
k
X 1
≤ 1 + H2 (µ) ≤ (1 + log(k))H2 (µ) .
i=imin +1
i

Pk
The result follows because imin > 1 and because i=3 ≤ log(k).


(b) When ∆2 = · · · = ∆k > 0 it holds that H1 (µ) = H2 (µ). For the other direction let ∆i = i for
i ≥ 2 so that i/∆2i = 1 = H2 (µ) and

k
X 1
H1 (µ) = 1 + = L = LH2 (µ) .
i=3
i

 
33.9 We have P maxi∈[n] µ(Xi ) < µ∗α = P (µ(X1 ) < µ∗α )n ≤ (1 − α)n ≤ δ. Solving for n gives
the required inequality.

Chapter 34 Foundations of Bayesian Learning

34.4

R
(a) By the ‘sections’ Lemma 1.26 in [Kallenberg, 2002], d(x) = Θ pψ (x)q(ψ)dν(ψ) is H-measurable.

90
Therefore N = d−1 (0) ∈ H is measurable. Then
Z Z
0= pψ (x)q(ψ) dν(ψ)dµ(x)
ZN Z Θ
= pψ (x)dµ(x)q(ψ) dν(ψ)
ZΘ N
= Pψ (N ) dν(ψ)
Θ
= P(X ∈ N )
= PX (N ) ,

where the first equality follows from the definition of N . The second is Fubini’s theorem, the
third by the definition of the Radon-Nikodym derivative, the fourth by the definition of P and
the last by the definition of PX .

(b) Note that q(θ | x) = pθ (x)q(θ)/d(x), which is jointly measurable in θ and x. The fact that
Q(A | x) is a probability measure for all x is straightforward from the definition of expectation
and because for x ∈ N ,
R
Θ q(θ | x)
Q(Θ | x) = R = 1.
p
Θ ψ (x)q(ψ) dν(ψ)

That Q(A | ·) is H-measurable follows from the sections lemma and the fact that N ∈ H. Let
A ∈ G and B ∈ σ(X) ⊆ F, which can be written as B = Θ × C for some C ∈ H. Then,
Z Z Z
Q(A | X(ω)) dP(ω) = q(θ | X(ω)) dν(θ)dP(ω)
B ZB ZA Z
= q(θ | x) dν(θ) pθ (x)q(θ) dµ(x)dν(θ)
ZΘ C A
Z
= d(x) q(θ | x) dν(θ)dµ(x)
ZC Z A

= pθ (x)q(θ) dν(θ)dµ(x)
ZC A
= pθ (C)q(θ) dν(θ)
A
= P(θ ∈ A, X ∈ C)
= P(θ ∈ A, X ∈ C)
Z
= IA (θ) dP ,
B

which is the required averaging property.

34.5 Abbreviate pθ (x) = dh (x).


dPθ

R
(a) Clearly pθ (x) ≥ 0. By definition, for B ∈ B(R), Pθ (B) = B pθ (x)dh(x). Hence Pθ (B) ≥ 0.

91
Furthermore,
Z Z
Pθ (R) = exp(θT (x) − A(θ))dh(x) = exp(−A(θ)) exp(θT (x))dh(x) = 1 .
R R
R R R
Additivity is immediate since B f dh + C f dh = B∪C f dh for disjoint B, C.

(b) Using the chain rule and passing the derivative under the integral yields the result:
d R
dθR R exp(θT (x))dh(x)
A (θ) =
0
exp(θT (x))dh(x)
R R
T (x) exp(θT (x))dh(x)
= RR
exp(θT (x))dh(x)
Z R
= T (x) exp(θT (x) − A(x))dh(x)
ZR
= T (x)pθ (x)dh(x)
R
= Eθ [T ] .

In order to justify the exchange of integral and derivative use the identity that for all sufficiently
small ε > 0 and all a > 0,

exp(aε) + exp(−aε)
a≤ .
ε
Hence for θ ∈ int(dom(A)) there exists a neighborhood N of θ such that for all ψ ∈ N ,

exp((θ + ε)T (x)) + exp((θ − ε)T (x))


|T (x)| exp(ψT (x)) ≤ φ(x) = .
ε
Since φ(x) is integrable for sufficiently small ε it follows by the dominated convergence theorem
that the derivative and integral can be exchanged.

(c) When X ∼ Pθ we have


Z
dPθ
E[exp(λT (X))] = exp(λT (x)) dh(x)
ZR
dh
= exp(λT (x)) exp(θT (x) − A(θ))dh(x)
R Z
= exp(−A(θ)) exp((θ + λ)T (x))dh(x)
R
= exp(−A(θ)) exp(A(θ + λ))
= exp(A(θ + λ) − A(θ)) .

92
(d) This is another straightforward calculation:
Z
d(θ, θ0 ) = (θT (x) − A(θ) − θ0 T (x) + A(θ0 )) exp(θT (x) − A(θ))dh(x)
R Z
= A(θ0 ) − A(θ) + (θ − θ0 ) T (x) exp(θT (x) − A(θ))dh(x)
R
= A(θ0 ) − A(θ) − (θ0 − θ)A (θ) . 0

(e) The Crammer-Chernoff method is the solution. Let λ = n(θ0 − θ). Then

Pθ (T̂ ≥ Eθ0 [T ]) = Pθ (exp(λT̂ ) ≥ exp(λEθ0 [T ]))


≤ Eθ [exp(λT̂ )] exp(−λEθ0 [T ])
n
Y
= Eθ [exp(λT (Xt )/n)] exp(−λA0 (θ0 ))
t=1
= exp(n(A(θ + λ/n) − A(θ)) − λA0 (θ0 ))

= exp n(A(θ0 ) − A(θ) − (θ0 − θ)A0 (θ0 ))

= exp −nd(θ0 , θ) .

A symmetric calculation shows that for θ0 < θ,



Pθ (T̂ ≤ Eθ0 [T ]) ≤ exp −nd(θ0 , θ) .

34.13 Let π ∗ as in the problem definition. Let S̃ be the extension of S by adding rays in the
positive direction: S̃ = {x + u : x ∈ S, u ≥ 0}. Clearly S̃ remains convex and λ(S) ⊆ ∂ S̃ is on the
boundary and is a subset of λ(S̃) (see figure) Let x ∈ λ(S). By the supporting hyperplane theorem
and the convexity of S̃ there exists a nonzero vector a ∈ RN and b ∈ R such that ha, `(π ∗ )i = b and
ha, yi ≥ b for all y ∈ S̃. Furthermore a ≥ 0 since x + ei ∈ S̃ and so hx + ei , ai = b + ai ≥ b. Define
q(νi ) = ai /kak1 . Then, for any policy π,

X 1 X N
ha, `(π)i b
q(ν)`(π, ν) = ai `(π, νi ) = ≥
ν∈E
kak1 i=1 kak1 kak1

with equality for any policy π with `(π) = `(π ∗ ). Since ai is nonnegative, a 6= 0, q ∈ P(E), finishing
the proof.

93
34.14

(a) Suppose that π is not admissible. Then there exists another policy π 0 with `(π 0 , ν) ≤ `(π, ν) for
all ν ∈ E. Clearly π 0 is also Bayesian optimal. But π was unique, which is a contradiction.

(b) Suppose that Π = {π1 , π2 } and E = {ν1 , ν2 } and `(π, ν1 ) = 0 for all π and `(πi , ν2 ) = I {i = 2}.
Then any policy is Bayesian optimal for Q = δν1 , but π2 is dominated by π1 .

(c) Suppose that π is not admissible. Then there exists another policy π 0 with `(π 0 , ν) ≤ `(π, ν) for
all ν ∈ E and `(π 0 , ν) < `(π, ν) for at least one ν ∈ E. Then
Z X X Z
`(π, ν)dQ(ν) = Q({ν})`(π, ν) > Q({ν})`(π, ν) = `(π 0 , ν)dQ(ν) ,
E ν∈E ν∈E E

which is a contradiction.

(d) Repeat the previous solution with the restriction to the support.

34.15 Let Π be the set of all policies and ΠD = {e1 , . . . , eN } the set of all deterministic policies,
which is finite. A policy π ∈ Π can be viewed as a probability measure on (ΠD , B(ΠD )), which is
the essence of Kuhn’s theorem on the equivalence of behavioral and mixed strategies in extensive
form games. Note that since ΠD is finite, probability measures on (ΠD , B(ΠD )) can be viewed
as distributions in PN −1 . In this way Π inherits a metric and topology from PN −1 . Even more
straightforwardly, E is identified with [0, 1]k and inherits a metric from that space. As metric spaces
both Π and E are compact and the regret Rn (π, ν) is continuous in both arguments by Exercise 14.4.
Let (νj )∞
j=1 be a sequence of bandit environments that is dense in E and Ej = {ν1 , . . . , νj }. Using
the notation of the previous exercise, let Rn,j (π) = (Rn (π, ν1 ), . . . , Rn (π, νj )), Sj = Rn,j (Π) ⊂ Rj
and let λ(Sj ) be the Pareto frontier of Sj . Note that Sj is non-empty, closed and convex. Thus,
λ(Sj ) ⊂ Sj . Now let π adm ∈ Π be an admissible policy. Then Rn,j (π adm ) ∈ λ(Sj ) and by the
result of the previous exercise there exists a distribution Qj ∈ P(E) supported on Ej such that
BRn (π adm , Qj ) ≤ minπ BRn (π, Qj ). Let Q be the space of probability measures on (E, B(E)),
which is compact with the weak* topology by Theorem 2.14. Hence (Qj )∞ j=1 contains a convergent
subsequence (Qi )i converging to Q. Notice that ν 7→ Rn (π, ν) is a continuous function from E to
[0, n]. Therefore by the definition of the weak* topology for any policy π,
Z
lim BRn (π, Qi ) = lim Rn (π, ν)dQ(ν) = BRn (π, Q) .
i→∞ i→∞ E

Hence BRn (π adm , Q) ≤ minπ BR(π, Q).


34.16 Clearly

max BR∗n (Q) ≤ Rn∗ (E) .


Q∈Q

In the remainder we show the other direction. Since [0, 1]k is compact with the usual topology,
Theorem 2.14 shows that Q is compact with the weak* topology. Let Π be the space of all policies

94
with the discrete topology and
( )
X
P= p(π)δπ : p ∈ P(A) and A ⊂ Π is finite ,
π∈A

which is a convex subspace of the topological vector space of all signed measures on (Π, 2Π ) with
the weak* topology. Let L : P × Q → [0, n] be defined by
Z Z Z
L(S, Q) = − Rn (π, ν)dQ(ν)dS(π) = − Rn (πS , ν)dQ(ν) ,
Π E E
R
where πS a policy such that PνπS = Π Pνπ dS(π), which is defined in Exercise 4.4. The regret
is bounded in [0, n] and the discrete topology on Π means that all functions from Π to R are
R
continuous, including π 7→ E Rn (π, ν)Q(dν). By the definition of the weak* topology on P it holds
that L(·, Q) is continuous in its first argument for all Q. The integral over Π with respect to S ∈ P
is a finite sum and ν 7→ Rn (π, ν) is continuous for all π by the result in Exercise 14.4. Therefore L
is continuous and linear in both arguments. By Sion’s theorem (Theorem 28.12),

− max BR∗n (Q) = min sup L(S, Q) = sup min L(S, Q) = − inf Rn∗ (πS , E) ,
Q∈Q Q∈Q S∈P S∈P Q∈Q S∈P

Therefore

max BR∗n (Q) = inf Rn∗ (πS , E) = inf Rn∗ (π, E) = Rn∗ (E) .
Q∈Q S∈P π∈Π

Hence Rn∗ (E) = maxQ∈Q BR∗n (Q).

Chapter 35 Bayesian Bandits

35.1 Let π be the policy of MOSS from Chapter 9, which for any 1-subgaussian bandit ν with
rewards in [0, 1] satisfies
√ 
k log(n)
Rn (π, ν) ≤ C min kn, ,
∆min (ν)

95
where ∆min (ν) is the smallest positive suboptimality gap. Let En be the set of bandits in E for
which there exists an arm i with ∆i ∈ (0, n−1/4 ). Then, for C 0 = Ck,

BR∗n (Q) ≤ BRn (π, Q)


Z
= Rn (π, ν)dQ(ν)
ZE Z
= Rn (π, ν)dQ(ν) + Rn (π, ν)dQ(ν)
En Enc
Z

≤ C 0 nQ(En ) + C 0 n1/4 log(n)dQ(ν)
Enc
√ √
= C 0 nQ(En ) + o( n) .

The first part follows since ∩n En = ∅ and thus limn→∞ Q(En ) = 0 for any measure Q. For the second
part we describe roughly what needs to be done. The idea is to make use of the minimax lower
bound technique in Exercise 15.2, which shows that for a uniform prior concentrated on a finite set

of k bandits the regret is Ω( kn). The only problems are that (a) the rewards were assumed to be
Gaussian and (b) the prior depends on n. The first issue is corrected by replacing the Gaussian
distributions with Bernoulli distributions with means close to 1/2. For the second issue you should
P
compute this prior for n ∈ {1, 2, 4, 8, . . .} and denote them Q1 , Q2 , . . .. Then let Q = ∞
j=1 pj Qj
where pj ∝ (j log (j)) . The result follows easily.
2 −1

35.2 Recall that Et = Ut and for t < n,

Et = max{Ut , E[Et+1 | Ft ]} .

Integrability of (Ut )nt=1 ensures that (Et )nt=1 are integrable. By definition Et ≥ E[Et+1 | Ft ]. Hence
(Et )nt=1 is a supermartingale adapted to F. Hence for any stopping time κ ∈ Rn1 the optional
stopping theorem says that

E[Uκ ] ≤ E[Eκ ] ≤ E1 .

On the other hand, for τ satisfying the requirements of the lemma the process Mt = Et∧τ is a
martingale and hence E[Uτ ] = E[Mτ ] = M1 = E1 .
35.3 Define v n (x) = supτ ∈Rn1 Ex [Uτ ]. By assumption, Ex [|u(St )|] < ∞ for all x ∈ S and t ∈ [n].
Therefore by Theorem 1.7 of Peskir and Shiryaev [2006],
Z
v (x) = max{u(x),
n
v n−1 (y)Px (dy)} . (35.1)
S

Recall that v(x) = supτ Ex [Uτ ]. Clearly v n (x) ≤ v(x) for all x ∈ S. Let τ be an arbitrary stopping
time. Then

v n (x) ≥ Ex [Uτ ∧n ] = Ex [Uτ ] + Ex [(Un − Uτ )I {τ ≥ n}] .

Since by assumption supn |Un | is Px -integrable for all x, the dominated convergence theorem shows

96
that
h i
lim Ex [(Un − Uτ )I {τ ≥ n}] = Ex lim (Un − Uτ )I {τ ≥ n} = 0 ,
n→∞ n→∞

where the second equality follows because U∞ = limn→∞ Un exists Px -almost surely by assumption.
Therefore limn→∞ v n (x) = v(x). Since convergence is monotone, it follows that v is measurable.
Taking limits in Eq. (35.1) shows that
Z
v(x) = lim max{u(x), v n−1 (y)Px (dy)}
n→∞
ZS
= max{u(x), lim v n−1 (y)Px (dy)}
n→∞ S
Z
= max{u(x), v(y)Px (dy)} ,
S

where the last equality follows from the monotone convergence theorem. Next, let Vn = v(Sn ). Note
that limn→∞ Ex [Un ] = Ex [limn→∞ Un ] = Ex [U∞ ], where the exchange of the limit and expectation
is justified by the dominated convergence theorem because supn |Un | is Px integrable. By definition,
Vn ≥ Un . Hence,

lim Ex [|Vn − Un |] = lim Ex [Vn ] − lim Ex [Un ]


n→∞ n→∞ n→∞
= lim Ex [Vn − U∞ ]
n→∞
" " # #
≤ lim Ex Ex sup Ut Sn − U∞
n→∞ t≥n
" #
= lim Ex sup Ut − U∞
n→∞ t≥n
" #
= Ex lim sup Ut − U∞
n→∞ t≥n

= 0,

where the exchange of limit and expectation is again justified by the dominated convergence theorem
and the assumption that supn |Un | is Px -integrable. Therefore V∞ = limn→∞ Vn = U∞ Px -a.s. Then
Z
Ex [Vn+1 | Sn ] = v(y)PSn (dy) ≤ Vn a.s. .
S

Therefore (Vn )∞
n=1 is a supermartingale, which means that for any stopping time κ,
h i
Ex [Uκ ] = Ex lim Uκ∧n = lim Ex [Uκ∧n ] ≤ lim Ex [Vκ∧n ] ≤ v(x) ,
n→∞ n→∞ n→∞

where the exchange of limits and expectation is justified by the dominated convergence theorem
and the fact that Uκ∧n ≤ supn Un , which is Px -integrable by assumption. Consider a stopping time
τ satisfying the conditions of Theorem 35.3. Then (Vn∧τ )∞ n=1 is a martingale and using the same

97
argument as before we have
h i
Ex [Uτ ] = Ex [Vτ ] = Ex lim Vτ ∧n = lim Ex [Vτ ∧n ] = v(x) ,
n→∞ n→∞

where the first equality follows from the assumption on τ that on the event τ < ∞, Uτ = Vτ and
the fact that V∞ = U∞ Px -a.s..
35.6 Fix x ∈ S and let
hP i
Ex τ −1 t−1
t=1 α r(St )
g = sup hP i .
τ −1 t−1
τ ≥2 Ex α
t=1

We will show that (a) vγ (x) > 0 for all γ < g and (b) vγ (x) = 0 for all γ ≥ g.
For (a), assume γ < g. By the definition of g, there exists a stopping time τ ≥ 2 such that
"τ −1 # "τ −1 #
X X
Ex α t−1
r(St ) > γ Ex α t−1
,
t=1 t=1

which implies that


"τ −1 #
X
vγ (x) ≥ Ex α t−1
(r(St ) − γ) > 0 .
t=1

Moving
hP now to (b), first i note that vγ (x) ≥ 0 for any γ ∈ R because when τ = 1,
Ex τ −1 t−1
t=1 α (r(St ) − g) = 0. Hence, it suffices to show that vγ (x) ≤ 0 for all γ ≥ g. Pick
γ ≥ g. By the definition of g, for any stopping time τ ≥ 2,
"τ −1 #
X
Ex α t−1
(r(St ) − γ) ≤ 0 ,
t=1

which implies that


"τ −1 #
X
sup Ex αt−1 (r(St ) − γ) ≤ 0 .
τ ≥2 t=1

If τ is a F-stopping time then Px (τ = 1) is either zero or one (the stopping rule underlying τ either
stops given S1 = x, or does not stop – the stopping rule cannot inject any further randomness).
From this it follows that
"τ −1 #
X
vg (x) = sup Ex αt−1 (r(St ) − γ)
τ ≥1 t=1
( "τ −1 #)
X
= max 0, sup Ex αt−1 (r(St ) − γ)
τ ≥2 t=1
≤ 0,

98
finishing the proof.

35.7 We want to apply Theorem 35.3. The difficulty is that Theorem 35.3 considers the case where
the reward depends only on the current state, while here the reward accumulates. The solution
is to augment the state space to include the history. We use the convention that if x ∈ S n and
y ∈ S m , then xy ∈ S n+m is the concatenation of x and y. In particular, the ith component of xy is
xi if i ≤ n and yi−n if i > n. We will also denote by x1:n the sequence (x1 , . . . , xn ) formed from
x1 , . . . , xn . Recall that S ∗ is the set of all finite sequences with elements in S and let G ∗ be the
σ-algebra given by

!
[
G =σ

G n
.
n=0

For n ≥ 1, let S n = (S1 , . . . , Sn ). The sequence (S n )∞


n=1 is a Markov chain on the space of finite
sequences (S ∗ , G ∗ ) with probability kernel characterized by
n
Y
Qx1:n (B1 × · · · × Bn+1 ) = Pxn (Bn+1 ) I {xt ∈ Bt } ,
t=1

where B1 , . . . , Bn+1 ∈ G are measurable. Note that for measurable f : S ∗ → R and x1:n ∈ S n ,
Z Z
f (y)Qx1:n (dy) = f (x1:n xn+1 )Pxn (dxn+1 ) .
S∗ S

Now define the G ∗ /B(R)-measurable function u : S ∗ → R by

n−1
X
u(x1:n ) = αt−1 (r(xt ) − γ) .
t=1

Notice that the value of u(x1:n ) does not depend on xn . Let Px1:n be the probability measure
carrying (S n )∞
n=1 for which Px1:n (S = x1:n ) = 1. As usual, let Ex1:n be the expectation with respect
n

to Px1:n . Now let Ut = u(S t ) and define

v̄γ (x1:n ) = sup Ex1:n [Uτ ] , (35.2)


τ ≥n

The definitions ensure that for any x1:n ∈ S ∗ , x ∈ S and γ ∈ R,

vγ (x) = v̄γ (x) and v̄γ (x1:n ) = u(x1:n ) + αn v̄γ (xn ) . (35.3)

In order to apply Theorem 35.3 we need to check the existence and integrability conditions of
(Un )∞
n=1 . By Assumption 35.6, U = limn→∞ Un exists Px1:n -a.s. and supn≥1 Un is Px1:n -integrable
for all x1:n ∈ S ∗ . Then by Theorem 35.3 it follows that
Z
v̄γ (x1:n ) = max{u(x1:n ), v̄γ (x1:n xn+1 )Pxn (dxn+1 )} .
S

99
The proof of Part (a) is completed by noting that u(xy) = r(x) − γ, and so by (35.3),
Z
vγ (x) = v̄γ (x) = max{0, v̄γ (xy)Px (dy)}
S Z
= max{0, r(x) − γ + α vγ (y)Px (dy)} .
S

For Part (b), when γ < g(x) we have vγ (x) > 0 by definition and hence using the previous part it
follows that
Z
vγ (x) = r(x) − γ + α vγ (y)Px (dy) .
S

Note that supx∈S |vγ+δ (x) − vγ (x)| ≤ |δ|/(1 − α) and hence by continuity for γ = g(x) we have
Z
r(x) − γ + α vγ (y)Px (dy) = 0 = vγ (x) .
S

For Part (c), applying Theorem 35.3 again shows that when v̄γ (x) = 0, then τ = min{t ≥ 2 :
v̄γ (S t ) = u(S t )} attains the stopping time in Eq. (35.2) with x1:n = x. Notice finally that by (35.3),
for any x1:n ∈ S ∗ ,

v̄γ (x1:n ) − u(x1:n ) = αn vγ (xn ) ,

which means that τ = min{t ≥ 2 : αt−1 vγ (St ) = 0} = min{t ≥ 2 : g(St ) ≤ γ}, where we used the
fact vγ (x) = 0 ⇔ G(x) ≤ γ.

Chapter 36 Thompson Sampling

36.3 We need to show that

P (A∗ = · | Ft−1 ) = P (At = · | Ft−1 )

holds almost surely. For specificity, let r : E → [k] be the (tie-breaking) rule that chooses the arm
with the highest mean given a bandit environment so that At = r(νt ) and A∗ = r(ν). Recall that
νt ∼ Qt−1 (·) = Q( · | A1 , X1 , . . . , At−1 , Xt−1 ). We have

P (A∗ = i | Ft−1 ) = P (r(ν) = i | Ft−1 ) (definition of A∗ )


= Qt−1 ({x ∈ E : r(x) = i}) (definition of Qt−1 )
= P (r(νt ) = i | Ft−1 ) (definition of νt )
= P (At = i | Ft−1 ) . (definition of At )

100
36.5 We have
X n X
X n
I {At = i} ≤ I {Ti (t) = s, Ti (t − 1) = s − 1, Gi (Ti (t − 1)) > 1/n}
t∈T t=1 s=1
Xn n
X
= I {Gi (s − 1) > 1/n} I {Ti (t) = s, Ti (t − 1) = s − 1}
s=1 t=1
Xn
= I {Gi (s − 1) > 1/n} ,
s=1

where the first equality uses that when At = i, Ti (t) = s and Ti (t − 1) = s − 1 for some s ∈ [n] and
that t ∈ T implies Gi (Ti (t − 1)) > 1/n. The next equality is by algebra, and the last follows because
for any s ∈ [n], there is at most one time point t ∈ [n] such that Ti (t) = s and Ti (t − 1) = s − 1.
For the next inequality, note that
 
X X  
E I {Eic (t)} = E E[I {Eic (t), Gi (Ti (t − 1)) ≤ 1/n} |Ft−1 ]
t∈T
/ t
X  
= E I {Gi (Ti (t − 1)) ≤ 1/n} Gi (Ti (t − 1))
t
X  
≤ E I {Gi (Ti (t − 1)) ≤ 1/n} 1/n
t
 
X
= E 1/n ,
t∈T
/

where the second equality used that I {Gi (Ti (t − 1)) ≤ 1/n} is Ft−1 -measurable and
E[I {Eic (t)} |Ft−1 ] = 1 − P(θi (t) ≤ µ1 − ε|Ft−1 ) = Gi (Ti (t − 1)).

36.6

p
(a) Let f (y) = s/(2π) exp(−sy 2 /2) be the probability density function of a centered Gaussian
Ry
with variance 1/s and F (y) = −∞ f (x)dx be its cumulative distribution function. Then
Z
G1s = f (y + ε)F (y)/(1 − F (y))dy
R
Z ∞ Z 0
≤ f (y + ε)/(1 − F (y)) + 2 f (y + ε)F (y)dy . (36.1)
0 −∞

For the first term in Eq. (36.1), following the hint, we use the following bound on 1 − F (y) for
y ≥ 0:

exp(−sy 2 /2)
1 − F (y) ≥ √ p .
y s + sy 2 + 4

101
Hence
Z ∞ Z ∞ q
f (y + ε) √
dy ≤ f (y + ε) exp(sy 2 /2)(y s + sy 2 + 4)dy
0 1 − F (y) 0
Z ∞ r
√ s
≤ 2 exp(−sε /2) 2
exp(−syε)(y s + 1) dy
0 2π

1+ε s
= 2 2 √ exp(−sε2 /2) .
ε s 2π

For the second term in Eq. (36.1),


Z 0 Z
2 f (y + ε)F (y)dy ≤ 2 f (y + ε)F (y)dy ≤ 2 exp(−sε2 ) .
−∞ R

Summing from s = 1 to ∞ shows that


√ !  

X 1+ε s c 1
2 exp(−sε ) + 2
2
√ exp(−sε /2) ≤ 2 log
2
,
s=1 ε s 2π ε ε

where the last line follows from a Mathematica slog.

(b) Let µ̂is be the empirical mean of arm i after s observations. Then Gis ≤ 1/n if
s
2 log(n)
µ̂is + ≤ µ1 − ε .
s

Hence for s ≥ u = 2 log(n)


(∆i −ε) we have
 s 
2 log(n)
P (Gis > 1/n) ≤ P µ̂is + > µ1 − ε
s
 s 
2 log(n) 
= P µ̂is − µi > ∆i − ε −
s
  q 2 
 s ∆i − ε −
2 log(n)
s 
 
≤ exp − .
 2 

102
Summing,
  q 2 
 s ∆i − ε −
2 log(n)
n
X n
X s 
 
P (Gis > 1/n) ≤ u + exp − 
s=1 s=due
 2 

2 q
≤1+ (log(n) + π log(n) + 1) ,
(∆i − ε)2

where the last inequality follows by bounding the sum by an integral as in the proof of Lemma 8.2.

36.13 Let π be a minimax optimal policy for {0, 1}n×k . Given an arbitrary adversarial bandit
x ∈ [0, 1]n×k . Choose π̃ to be the policy obtained by observing Xt = xtAt and then sampling
X̃t ∼ B(Xt ) and passing X̃t to π. Then

X n Y
Y k
Rn (π̃, x) ≤ xx̃titi (1 − xti )1−x̃ti Rn (π, x̃) ≤ Rn∗ ({0, 1}n×k ) .
x̃∈{0,1}n×k t=1 i=1

Therefore Rn∗ ([0, 1]n×k ) ≤ Rn∗ ({0, 1}n×k ). The other direction is obvious.

Chapter 37 Partial Monitoring

37.3 It suffices to show that y has no component in any direction z that is perpendicular to x.
Let z be such a direction. Without loss of generality either z > 1 = 0 or z > 1 = 1. Assume first
that z > 1 = 0. Take some u ∈ ker0 (x). Then, u + z ∈ ker0 (x) also holds. Since ker0 (x) ⊂ ker0 (y),
z > y = (u + z)> y − u> y = 0. Assume now that z > 1 = 1. Then z ∈ ker0 (x) ⊂ ker0 (y) and hence
z > y = 0.

37.10 An argument that almost works is to choose ω ∈ ri(Cb ) arbitrarily and find the first ν
along the chord connecting ω and λ for which ν ∈ Cb ∩ Cc for some cell c. The minor problem
is that Cb ∩ Cc 6= ∅ does not imply that b and c are neighbours, because there could be a third
non-duplicate Pareto optimal action d with ν ∈ Cb ∩ Cc ∩ Cd . This issue is resolved by making an
ugly dimension argument. Let D be the set of ω ∈ Pd−1 for which at least three non-duplicate
Pareto optimal cells intersect, which has dimension at most dim(D) ≤ d − 3. Since b is Pareto
optimal, dim(ri(Cb )) = d − 1. Meanwhile, the dimension of those ω ∈ ri(Cb ) such that [ω, λ] ∩ D 6= ∅
has dimension at most d − 2. Hence, there exists an ω ∈ ri(Cb ) such that [ω, λ] ∩ D 6= ∅ and for this
choice the initial argument works.

37.12

(a) Assume without loss of generality that Σ = [m]. Given an action c ∈ [k], let Sc be as in the

103
proof of Theorem 37.12 and S ∈ Rkm×d be the matrix obtained by stacking (Sc )kc=1 :
 
S1
 . 
S =  .. 

.
Sk

As in the proof of Theorem 37.12, by the definition of global observability, for any pair of
neighbouring actions a, b, it holds that `a − `b ∈ im(S > ) and hence

`a − `b = S > U (`a − `b ) ,

where U is the Moore–Penrose pseudo-inverse of S > . Then,

kU (`a − `b )k∞ ≤ kU (`a − `b )k2 ≤ kU k2 k`a − `b k2 ≤ d1/2 kU k2 .

The largest singular value of U is the square root of the reciprocal of the smallest non-zero
eigenvalue of S > S ∈ {0, . . . , k}d×d . Let (λi )pi=1 be the non-zero eigenvalues of S > S, in decreasing
order. Recall that for square matrix A, the product of its non-zero eigenvalues is a coefficient of
the characteristic polynomial. Since S > S has entries in {0, . . . , k}, the characteristic equation
has integer coefficients. Since S > S is positive definite, its non-zero eigenvalues are all positive
Q
and it follows that pi=1 λi ≥ 1. If p = 1, then we are done. Suppose that p > 1. By the
arithmetic–geometric mean inequality,
!p−1  p−1
1 trace(S > S)
p−1
Y dk
≤ λi ≤ ≤ ≤ kd .
λp i=1
p−1 p−1

Hence, kU k2 ≤ k d/2 and the result follows.

(b) Repeat the argument above, but restrict S to stacking (Sc )c∈Nab .

(c) For non-degenerate locally observable games, Ne = {a, b} for all e = (a, b) ∈ E. Given
a Pareto optimal action a, let Va = {(a, Φai ) : i ∈ [d]}. Let a, b be neighbouring actions
and f ∈ Eab loc , which exists by the assumption that the game is locally observable. Define

C = {((a, Φai ), (b, Φbi ) : i ∈ [d]} ⊂ Va × Vb , which makes (Va ∪ Vb , C) a bipartite graph. We
define a new function g : [k] × Σ → R. First, let g(c, σ) = 0 whenever σ ∈ / {Φci : i ∈ [d]} or
c∈/ {a, b}. Next, let (Va ∪ Vb , C ) be a connected component. Then, by the conditions on f ,
0 0 0

.
max |f (n0 ) − f (n00 )| ≤ s = 2(m − 1) + 1 .
n0 ,n00 ∈Va0 ∪Vb0

Letting c be the midpoint of the interval that the values of f restricted to Va0 ∪ Vb0 fall into,

104
define

f (n) − c , if n ∈ Va0 ;
f 0 (n) =
f (n) + c , if n ∈ Vb0 .

Clearly we still have f 0 ∈ Eab


loc and f 0 |
Va0 ∪Vb0 takes values in [−s/2, s/2] = [−m, m]. Now repeat
the procedure with f and the next connected component of the graph until all connected
0

components are processed. Let the resulting function be g. Then, g ∈ Eab loc and also kgk
∞ ≤ m.

37.13 Let Σ = {♣, ♥} and `1 = (1, 0, 0, 0 . . . , 0, 0) and `2 = (0, 1, 1, 1, . . . , 1, 1). For all other
actions a > 2 let `a = 1. Hence `1 − `2 = (1, −1, −1, −1, . . . , −1, −1). Let d = 2k − 1. Let Φ ∈ Σk×d
be the matrix such that the first row is ♣ and ♥, alternating and for a > 1, let





♣ if i ≤ 2(a − 1)


 ♥ if i = 1 + 2(a − 1)
Φai =




♣ otherwise and i is odd


♥ otherwise and i is even .

For example, when k = 4, then


 
♣ ♥ ♣ ♥ ♣ ♥ ♣
 
♣ ♣ ♥ ♥ ♣ ♥ ♣
Φ=
♣
.
 ♣ ♣ ♣ ♥ ♥ ♣
♣ ♣ ♣ ♣ ♣ ♣ ♥

Suppose that f ∈ E12


glo
. Then,

k
X k
X
2 = `11 − `21 + `22 − `21 = f (a, Φa1 ) − f (a, Φa2 ) = f (1, ♣) − f (1, ♥) .
a=1 a=1

Furthermore, for a > 1 and i = 2a,


k
X k
X X
f (b, Φbi ) − f (b, Φb,i+1 ) = f (a, ♣) − f (a, ♥) + (f (b, ♥) − f (b, ♣))
b=1 b=1 b<a

Hence,
X
f (a, ♣) − f (a, ♥) = (f (b, ♣) − f (b, ♥)) .
b<a

By induction it follows that f (a, ♣) − f (a, ♥) = 2a−1 for all a > 1. Finally, note that 1 and 2
are non-duplicate and Pareto optimal. Since all other actions are degenerate, these actions are
neighbours. The game is globally observable because all columns have distinct patterns.

105
37.14 For bandit games with Φ = L, let p = q and

f (a, σ)b = σI {a = b} .
P P Pk
We have ka=1 f (a, Φai )b = ka=1 f (a, Lai )b = a=1 Lai I {a = b} = Lbi and thus f ∈ E vec . Then,
using that exp(−x) + x − 1 ≤ x2 /2 for x ≥ 0,
  !
(p − q)> Lei 1 X
k
ηf (a, Φai )
opt∗q (η) ≤ max + 2 pa Ψq
i∈[d] η η a=1 pa
!
1X k k
X qb L2ai I {a = b}
≤ max pa
i∈[d] 2 a=1 b=1 p2a
k
= .
2
The full information game is similar. As before, let p = q, but choose f (a, σ)b = pa Lbσ . We have
Pk Pk Pk
a=1 f (a, Φai )b = a=1 f (a, i)b = a=1 pa Lbi = Lbi and thus f ∈ E
vec . Then, again using that

exp(−x) + x − 1 ≤ x2 /2 for x ≥ 0,

1X k k
X 1
opt∗q (η) ≤ max pa qb L2bi = .
i∈[d] 2 a=1 b=1 2

Chapter 38 Markov Decision Processes

38.2 The solution to Part (b) is immediate from Part (a), so we only show the solution to
Part (a). Abbreviate Pπµ to P and Pπµ to P0 . Let π 0 = (π10 , π20 , . . . ) be the Markov policy to be
0

constructed: For each t ≥ 1, s ∈ S, πt0 (·|s) is a distribution over A.


Fix (s, a) ∈ S × A and consider first t = 1. We want P0 (S1 = s, A1 = a) = P(S1 = s, A1 = a).
By the definition of P0 , P and by the definition of conditional probabilities, we have P0 (S1 = s, A1 =
a) = P0 (A1 = a|S1 = s)P0 (S1 = s) = π10 (a|s)µ(s) = P(S1 = a, A1 = a) = P(A1 = a|S1 = s)µ(s).
Thus, defining π10 (a|s) = P(A1 = a|S1 = s) we see that the desired equality holds for t = 1. Now,
for t > 1 first notice that from P0 (At−1 = a, St−1 = s) = P0 (At−1 = a, St−1 = s), (s, a) ∈ S × A
it follows by summing these equations over a that P0 (St−1 = s) = P0 (St−1 = s). Hence, the same
calculation as for t = 1 applies, showing that πt0 (a|s) = P(At = a|St = s) will work.

38.4 We show that D(M ) < ∞ implies that M is strongly connected and that D(M ) = ∞ implies
that M is not strongly connected. Assume first that D(M ) < ∞. Take any s, s0 ∈ S. Assume first
that s 6= s0 . By definition, there is a policy whose expected travel time from state s to s0 is finite.
Take this policy. It follows that this policy reaches state s0 from state s with positive probability,
because otherwise the expected travel time would be infinite. Formally, if T is the random travel
time of a policy whose expected travel time between s and s0 is finite, {T = ∞} is the event that
the policy does not reach state s0 . Now, for any n ∈ N, T > nI {T = ∞}. Taking expectations and

106
reordering gives E [T ] /n > P (T = ∞). Letting n → ∞, we see that P (T = ∞) = 0 (thus we see
that the policy reaches state s0 in fact with probability one). It remains to consider the case when
s = s0 . If the MDP has a single state, it is strongly connected by definition. Otherwise, there exist a
state s00 ∈ S that is distinct from s = s0 . Since D(M ) is finite, there is a policy that reaches s0 from
s with positive probability and another one that reaches s again from s0 with positive probability.
Compose these two policies the obvious way to find the policy that travels from s to s with positive
probability.
Assume now that D(M ) = ∞, while M is strongly connected (proof by contradiction). Since M
is strongly connected, for any s, s0 , there is a policy that has a positive probability of reaching s0
from s. But this means that the uniformly random policy (the policy which chooses uniformly at
random between the actions at any state) has also a positive probability of reaching any state from
any other state. We claim that the expected travel time of this policy is finite between any pairs of
states. Indeed, this follows by noticing that the states under this policy form a time-homogenous
Markov chain whose transition probability matrix is irreducible and the hitting times in some a
Markov chain, which coincide with the expected travel times in the MDP for the said policy, are
finite. link

38.5 We follow the advice of the hint. For the second part, note that the minimum in the definition
of d∗ (µ0 , U ) is attained when nk is maximised for small indices until |U | is exhausted. In particular,
if (nk )0≤k≤m denotes the optimal solution (nk = 0 for k > m) then n0 = A0 , . . . , nm−1 = Ak ,
P
0 ≤ nm = |U | − n−1 k=0 A (= |U | − A−1 ) < A . Hence, |U | < A + A−1 ≤ 2A , implying that
k Am −1 m m Am −1 m

m ≥ logA (|U |/2). Thus,

m−1
X
d∗ (µ0 , U ) = k Ak + m nm
k=0
m−1
X
= m|U | + (k − m)Ak
k=0
(a) m Am+1 − A
= m|U | + −
A−1 (A − 1)2
 
1
≥ |U | m − 1 −
A−1
≥ |U |(logA (|U |) − 3) ,

where step (a) follows since |U | < AA−1 . Choosing U = S, we see that the expected minimum time
m
−1

to reach a random state in S is lower bounded by logA (S) − 3. The expected minimum time to
reach an arbitrary state in S must also be above this quantity, proving the desired result.

38.7
Pn−1
(a) An 1 = 1
n t=0 P t 1 = 1, which means that An is right stochastic.

(b) This follows immediately from the definitions.

107
(c) Let (Bn ) and (Cn ) be convergent subsequences of (An ) with limn→∞ Bn = B and limn→∞ Cn =
C. It suffices to show that B = C. From Part (b), Bm + n1m (P nm − I) = Bm P = P Bm .
Taking the limit as m tends to infinity and using the fact that P n is [0, 1]-valued we see
that B = BP = P B. Similarly, C = CP = P C and it follows that B = BP i = P i B and
C = CP i = P C i hold for any i ≥ 0. Hence B = BCm = Cm B and C = CBm = Bm C for any
m ≥ 1. Taking limit as m tends to infinity shows that B = BC = CB and C = CB = BC,
which together imply that B = C.

(d) We have already seen in the proof of Part (c) that P ∗ = P ∗ P = P P ∗ . From this, it follows
that P ∗ = P ∗ P i for any i ≥ 0, which implies that P ∗ = P ∗ An holds for any n ≥ 1. Taking limit
shows that P ∗ = P ∗ P ∗ .

(e) Let B = P − P ∗ . By algebra I − B i = (I − B)(I + B + · · · + B i−1 ). Summing over i = 1, . . . , n


and dividing by n and using the fact that B i = P i − P ∗ for all i ≥ 1,

1X n
I− P i + P ∗ = (I − B)Hn , (38.1)
n i=1
P P
where Hn = n1 ni=1 i−1 k=0 (P − P ) . The limit of the left-hand side of (38.1) exists and is equal
∗ k

to the identity matrix I. Hence the limit of the right-hand side also exists and in particular the
limit of Hn must exist. Denoting this by H∞ we find that I = (I − B)H∞ and thus I − B is
invertible and its inverse H is equal to H∞ .
Pn Pi−1
(f) Let Un = 1
n i=1 k=0 (P
k − P ∗ ). Then

I, if k = 0 ;
Bk =
P k − P ∗, otherwise .
P P
Using this we calculate Hn − Un = n1 ni=1 (P − P ∗ )0 − n1 ni=1 (P 0 − P ∗ ) = I − I + P ∗ = P ∗ .
Hence H − limn→∞ Un = P ∗ . From the definition of U we have U = limn→∞ Un .

(g) This follows immediately because

1X n Xi−1
1Xn Xi−1
lim P k (r − ρ) = lim (P k − P ∗ )r = U r .
n→∞ n n→∞ n
i=1 k=0 i=1 k=0

(h) One way to prove this is to note that by the previous part v = U r = (H − P ∗ )r, hence
r = H −1 (v+ρ). Now, H −1 (v+ρ) = (I −P +P ∗ )(v+ρ) = v−P v+P ∗ v+ρ−P ρ+P ∗ ρ = v−P v+ρ,
where we used that P ∗ v = P ∗ U r and P ∗ U = P ∗ (H − P ∗ ) = P ∗ H − P ∗ = (P ∗ − P ∗ H −1 )H = 0
and that P ρ = P P ∗ r = P ∗ r = P ∗ ρ = P ∗ P ∗ r.

Alternatively, the following direct argument also works. In this argument we only use that v
P 1 Pn
is well defined. Let vn = n−1t=0 P (r − ρ), v̄n = n
t
i=1 vi . Note that limn→∞ v̄n = v. Then,

108
vk+1 = P vk + (r − ρ). Taking the average of these over k = 1, . . . , n we get

1
((n + 1)v̄n+1 − v1 ) = P v̄n + (r − ρ) .
n
Taking the limit of both sides proves that v = P v + r − ρ, which, after reordering gives
v + ρ = r + P v.

38.8

(a) First note that | maxx f (x) − maxy g(y)| ≤ maxx |f (x) − g(x)|. Then for v, w ∈ RS ,

kTγ v − Tγ wk∞ ≤ max max γ |hPa (s), v − wi|


s∈S a∈A
≤ max max γkPa (s)k1 kv − wk∞
s∈S a∈A
= γkv − wk∞ .

Hence T is a contraction with respect to the supremum norm as required.

(b) This follows immediately from the Banach fixed point theorem, which also guarantees the
uniqueness of a value function v satisfying v = Tγ v.

(c) Recall that the greedy policy is π(s) = argmaxa ra (s) + γhPa (s), vi. Then

v(s) = max ra (s) + γhPa (s), vi = rπ (s) + γhPπ (s), vi .


a∈A

(d) We have v = rπ + γPπ v. Solving for v completes the result.

(e) If π is a memoryless policy, it is trivial to see that vγπ = rπ + γPπ vγπ . Let π ∗ be the greedy
policy with respect to v, the unique solution of v = Tγ v. By the previous part of this exercise, it
follows that vγπ = v. By Exercise 38.2, it suffices to show that for any Markov policy π, vγπ ≤ v.

P
If πt is the memoryless policy used in time step t when following π, vγπ = ∞ t=1 γ
t−1 P (t−1) r ,
πt
P
where P (0) = I and for t ≥ 1, P (t) = Pπ1 . . . Pπt . For n ≥ 1, let vγ,n
π = n
t=1 γ t−1 P (t−1) r . It is
πt
easy to see that vγ,1
π =r
π1 ≤ T 0. Assume that for some n ≥ 1,

sup π
vγ,n ≤ T n 0, (38.2)
π Markov

Notice that f ≤ g implies T f ≤ T g. Hence, T vγ,n π ≤ T n+1 0. Further, vγ,n+1


π =
rπ0 + γPπ0 vγ,n ≤ T vγ,n , where π is the Markov policy obtained from π by discarding π0
π 0 π 0 0

(that is, if π = (π0 , π1 , π2 , . . . ), π 0 = (π1 , π2 , . . . )). This shows that (38.2) holds for all n ≥ 1.
Letting n → ∞, the right-hand side converges to v, while the left-hand side converges to vγπ .
Hence, vγπ ≤ v.

38.9

109
(a) Let 0 ≤ γ < 1. Algebra gives

X Pγ∗ − (1 − γ)I
Pγ∗ P = (1 − γ) γtP tP = .
t=0
γ

Hence γPγ∗ P = Pγ∗ − (1 − γ)I. It is easy to check that Pγ∗ is right stochastic. By the compactness
of the space of right stochastic matrices, (Pγ∗ )γ has at least one cluster point A as γ → 1−.

It follows that AP = A, which implies that AP ∗ = A. Now, (Pγ∗ )−1 P ∗ = (I−γP
1−γ
)P
= P ∗,
which implies that P = AP = A. Since this holds for any cluster point we conclude that
∗ ∗

limγ→1− Pγ∗ = P ∗ .

(b) Since I − P + P ∗ is invertible and P P ∗ = P ∗ = P ∗ P ∗ , the required claim is equivalent to


!
Pγ∗ − P ∗
lim (I − P + P ) ∗
= I − P∗ .
γ→1− 1−γ
P∞
Rewriting Pγ∗ = (1 − γ) t=0 γ
tP t shows that
! ∞
!
Pγ∗ − P ∗ X
P∗
(I − P + P )∗
= (I − P + P ) ∗
γP − t t
1−γ t=0
1−γ

X 1
= (I − P )γ t P t = (I − Pγ∗ ) .
t=0
γ

The result is completed by taking the limit as γ tends to one from below and using Part (a).

38.10

(a) Let (γn ) be an arbitrary increasing sequence with γn < 1 for all n and limn→∞ γn = 1. Let vn
be the fixed point of Tγn and πn be the greedy policy with respect to vn . Since greedy policies
are always deterministic and there are only finitely many deterministic policies it follows there
exists a subsequence n1 < n2 < · · · and policy π such that πnk = π for all k.

(b) For arbitrary π and 0 ≤ γ < 1, let vγπ be the value function of policy π in the γ-discounted
MDP, v π the value function of π and ρπ its gain. Let Uπ be the deviation matrix underlying
Pπ . Define fγπ by

ρπ
vγπ = + vπ + fγπ . (38.3)
1−γ

By Part (b) of Exercise 38.9, and because ρπ = Pπ∗ rπ and vπ = Uπ rπ , it holds that kfγπ k∞ → 0
as γ → 1.
Fix now π to be the policy whose existence is guaranteed by the previous part. By Part (e)
of Exercise 38.8, π is γn -discount optimal for all n ≥ 1. Suppose that ρπ is not a constant. In
any case, ρπ is piecewise constant on the recurrent classes of the Markov chain with transition
probabilities P π . Let ρπ∗ = maxs∈S ρπ (s). Let R ⊂ S be the recurrent class in this Markov chain

110
where ρπ is the largest and take a policy π 0 that is identical to π over R, while π 0 is set up such
that it gets to R with probability one. Such a π 0 exist because the MDP is strongly connected.
Fix any s ∈ S \ R. We claim that there exists some γ ∗ ∈ (0, 1) such that for all γ ≥ γ ∗ ,

vγπ (s) > vγπ (s) . (38.4)


0

If this was true and n is large enough so that γn ≥ γ ∗ then, since π is γn -discount optimal,
vγπn (s) ≥ vγπn (s) > vγπn (s), which is a contradiction.
0 0

Hence, it remains to show (38.4). By the construction of π 0 , ρπ (s) = ρπ∗ > ρπ (s). From (38.3),
0

ρπ (s) ρπ (s)
0

vγπ (s) = + vπ0 (s) + fγπ (s) > + vπ (s) + fγπ (s) = vγπ (s) ,
0 0

1−γ 1−γ

where the inequality follows by taking γ ≥ γ ∗ for some γ ∗ ∈ (0, 1). The existence of any
appropriate γ ∗ follows because fγπ (s), fγπ (s) → 0, while 1/(1 − γ) → ∞ as γ → 1.
0

(c) Since v is the value function of π and ρ is its gain, by (38.4) we have ρ1 + v = rπ + Pπ v. Let
π 0 be an arbitrary stationary policy and vn be as before. Let fn = fγn where fγ is defined by
(38.3). Note that kfn k∞ → 0 as n → ∞. Then,

0 ≥ rπ0 + (γn Pπ0 − I)vn


 
ρ1
= rπ0 + (γn Pπ0 − I) + v + fn
1 − γn
= rπ0 − ρ1 + γn Pπ0 v − v + (γn Pπ0 − I)fn .

Note that when π 0 = π, the first inequality becomes an equality. Taking the limit as n tends to
infinity and rearranging shows that

ρ1 + v ≥ rπ0 + Pπ0 v

and

ρ1 + v = rπ + Pπ v . (38.5)

Since π 0 was arbitrary, ρ + v(s) ≥ maxa ra (s) + hPa (s), vi holds for all s ∈ S. This combined
with (38.5) shows that the pair (ρ, v) satisfies the Bellman optimality equation as required.

38.11 Clearly the optimal policy is to take action stay in any state and this policy has gain
ρ∗ = 0. Pick any solution (ρ, v) to the Bellman optimality equations. Therefore ρ = ρ∗ = 0 by
Theorem 38.2. The Bellman optimality equation for state 1 is v(1) = max(v(1), −1 + v(2)), which is
equivalent to v(1) ≥ −1 + v(2). Similarly, the Bellman optimality equation for state 2 is equivalent
to v(2) ≥ −1 + v(1). Thus the set of solutions is a subset of

{(ρ, v) ∈ R × R2 : ρ = 0, v(1) − 1 ≤ v(2) ≤ v(1) + 1} .

111
The same argument shows that any element of this set is a solution to the optimality equations.
38.12 Consider the deterministic MDP below with two states and two actions, A =
{solid, dashed}.
r=0 r=1

r=0

1 2

r=1

Clearly the optimal policy is to choose π(1) = solid and π(2) arbitrarily which leads to a gain of 1.
On the other hand, choosing ρ = 1 and v = (2, 1) satisfies the linear program in Eq. (38.6) and the
greedy policy with respect to this value function chooses π(1) = dashed and π(2) arbitrary.
38.13 Let T : RS → RS be defined by (T v)(s) = maxa ra (s) − ρ∗ + hPa (s), vi so the Bellman
optimality equation Eq. (38.5) can be written in the compact form v = T v. Let v ∈ RS be a solution
to Eq. (38.5). The proof follows from the definition of the diameter and by showing that for any
states s1 , s2 ∈ S and memoryless policy π it holds that

v(s2 ) ≤ v(s1 ) + (ρ∗ − min ra (s))Eπ [τs2 | S1 = s1 ] .


s,a

The remainder of the proof is devoted to proving this result for fixed s1 , s2 ∈ S and memoryless
policy π. Abbreviate τ = τs2 and let E[·] denote the expectation with respect to the measure
induced by the interaction of π and the MDP conditioned on S1 = s1 . Since the result is trivial
when E[τ ] = ∞, for the remainder we assume that E[τ ] < ∞. Define operator T̄ : RS → RS by

min
s,a ra (s) − ρ∗ + hPπ (s), ui, if s 6= s2 ;
(T̄ u)(s) =
v(s ), otherwise .
2

Since rπ (s) − ρ∗ ≥ mins,a ra (s) − ρ∗ and T v = v it follows that (T̄ v)(s) ≤ (T v)(s) = v(s). Notice
that for u ≤ w it holds that T̄ u ≤ T̄ w. Then by induction we have T̄ n v ≤ v for all n ∈ N+ . By
unrolling the recurrence we have
 
v(s1 ) ≥ (T̄ v)(s1 ) = E −(ρ − min ra (s))(n ∧ τ ) + v(Sτ ∧n ) .
n ∗
s,a

Taking the limit as n tends to infinity shows that v(s1 ) ≥ v(s2 ) − (ρ∗ − mins,a ra (s))E[τ ], which
completes the result.
38.14

(a) It is clear that Algorithm 27 returns true if and only if (ρ, v) is feasible for Eq. (38.6). Note that
feasibility can be written in the compact form ρ1v ≥ T v. It remains to show that when (ρ, v) is
not feasible then u = (1, es − Pa∗s (s)) is such that for any (ρ0 , v 0 ) feasible, h(ρ0 , v 0 ), ui > h(ρ, v), ui.

112
For this, we have h(ρ0 , v 0 ), ui = ρ0 + v 0 (s) − hPa∗s (s), v 0 i ≥ ra∗s (s), where the inequality used that
(ρ0 , v 0 ) is feasible. Further, h(ρ, v), ui = ρ + v(s) − hPa∗s (s), vi < ra∗s (s), by the construction of u.
Putting these together gives the result.

(b) Relax the constraint that v(s̃) = 0 to −ε ≤ v(s̃) ≤ ε. Then add the ε of slack to the first
constraint of Eq. (38.7) and add the additional constraints used in Eq. (38.9). Now the ellipsoid
method can be applied as for Eq. (38.9).

38.15 Let φk (x) = true if hak , xi ≥ bk and φk (x) = ak otherwise. Then the new separation
oracle returns true if φ(x) and φk (x) are true for all k. Otherwise return the separating hyperplane
provided by some φ or φk that did not return true.
38.16

(a) Let π = (π1 , π2 , . . .) be an arbitrary Markov policy where πt is the policy followed in time step
t. Using the notation and techniques from the proof of Theorem 38.2,

Pπ(t−1) rπt = Pπ(t−1) (rπt + Pπt v − Pπt v) ≤ Pπ(t−1) (rπ̃ + Pπ̃ v − Pπt v)
≤ Pπ(t−1) ((ρ + ε)1 + v − Pπt v) = (ρ + ε)1 + Pπ(t−1) v − Pπ(t) v .

Taking the average and then the limit shows that ρ̄π (s) ≤ ρ + ε for all s ∈ S. By the claim in
Exercise 38.2, ρ∗ ≤ ρ + ε.

(b) We have rπ̃ (s) + hPπ̃(s) (s), vi ≥ maxa ra (s) + hPa (s), vi − ε0 ≥ ρ + v(s) − (ε + ε0 ). Therefore,

Pπ̃t−1 rπ̃ = Pπ̃t−1 (rπ̃ + Pπ̃ v − Pπ̃ v) ≥ Pπ̃t−1 ((ρ − (ε + ε0 ))1 + v − Pπ̃ v) .

Taking the average and the limit again shows that ρπ̃ (s) ≥ ρ − (ε + ε0 ). The claim follows from
combining this with the previous result.

(c) Let ε = v + ρ1 − T v, which by the first constraint satisfies ε ≥ 0. Let π ∗ be an optimal policy
satisfying the requirements of the theorem statement and π be the greedy policy with respect
to v. Then

Pπt ∗ rπ∗ ≤ Pπt ∗ (rπ + Pπ v − Pπ∗ v) = Pπt ∗ (ρ1 + v − ε − Pπ∗ v) .

Hence ρ∗ 1 = ρπ 1 ≤ ρ1 − Pπ∗∗ ε, which means that Pπ∗∗ ε ≤ δ1. By the definition of s̃ there exists

a state s with Therefore


X
δ ≥ (Pπ∗ ε)(s) = Pπ∗ (s, s0 )ε(s0 ) ≥ Pπ∗ (s, s̃)ε(s̃)
s0 ∈S

and so ε(s̃) ≤ δ|S|. Notice that ṽ = v − ε + ε(s̃)1 also satisfies the constraints in Eq. (38.7) and
hence

hv − ε + ε(s̃)1, 1i ≥ hv, 1i ,

113
which implies that hε, 1i ≤ |S|ε(s̃) ≤ |S|2 δ. Hence (ρ, v) approximately satisfies the Bellman
optimality equation.

38.17 Define operator T : RS → RS by (T u)(s) = maxa∈A ra (s) + hPa (s), ui, which is chosen so
that vn∗ = T n 0. Let v be a solution to the Bellman optimality equation with mins v(s) = 0. Then
T v = ρ∗ 1 + v and

vn∗ = T n 0 ≤ T n v = nρ∗ 1 + v ≤ nρ∗ 1 + D1 ,

where the last inequality follows from the previous exercise and the assumption that mins v(s) = 0.
38.19
(a) There are four memoryless policies in this MDP. All are optimal except the policy π that always
chooses the dashed action.

(b) The optimistic rewards are given by


s
1 L
r̃k,stay (s) = +
2 2(1 ∨ Tk−1 (s, stay))
s
L
r̃k,go (s) = .
2(1 ∨ Tk−1 (s, go))

Whenever Tk−1 (s, a) ≥ 1 the transition estimates P̂k−1,a (s) = Pa (s). Let St0 be the state
with St 6= St0 . Suppose that Tk−1 (St , stay) > Tk−1 (St0 , stay) and r̃k,stay (St ) < 1. Then
r̃k,stay (St0 ) > r̃k,stay (St ). Once this occurs the optimal policy in the optimistic MDP is to choose
action go. It follows easily that once t is sufficiently large the algorithm will alternate between
choosing actions go and stay and subsequently suffer linear regret. Note that the uncertainty
in the transitions does not play a big role here. In the optimistic MDP they will always be
chosen to maximise the probability of transitioning to the state with the largest optimistic
reward.

38.21 Here we abuse notation by letting P̂u,a (s) be the empirical next-state transitions after u
visits to state-action pair (s, a). By a union bound and the result in Exercise 5.17,
 s 
2S log(4SAu(u + 1)/δ) 
P (F ) ≤ P ∃u ∈ N, s, a ∈ S × A : kP − P̂u,a (s)k1 ≥
u
 s 
X ∞
X 2S log(4SAu(u + 1)/δ) 
≤ P kP − P̂u,a (s)k1 ≥
s,a∈S×A u=1
u
X X ∞
δ δ
≤ = .
s,a∈S×A u=1
2SAu(u + 1) 2

The statement in Exercise 5.17 makes an independence assumption that is not exactly satisfied here.
We are saved by the Markov property, which provides the conditional independence required.

114
Pm−1
38.22 If k=1 ak ≤ 1, then A0 = A1 = · · · = Am−1 = 1. Hence am ≤ 1 also holds and using
Am ≥ 1,
m
X m−1
X √
ak
p = ak + am ≤ 1 + 1 ≤ ( 2 + 1)Am .
k=1
Ak−1 k=1

P
Let us now use induction on m. As long as m is so that m−1
k=1 ak ≤ 1, the previous argument covers
P
us. Thus, consider any m > 1 such that m−1 k=1 ka > 1 and assume that the statement holds for
Pm−1 √
m − 1 (note that m = 1 implies k=1 ak ≤ 1). Let c = 2 + 1. Then,
m
X p
ak am
p ≤ c Am−1 + √ (split sum, induction hypothesis)
k=1
Ak−1 Am−1
s
a2m
= c2 Am−1 + 2cam +
Am−1
q
≤ c2 Am−1 + (2c + 1)am (am ≤ Am−1 )
p
= c Am−1 + am (choice of c)
p
= c Am . (Am−1 ≥ 1 and definition of Am )

38.23 We only outline the necessary changes to the proof of Theorem 38.6. The first step is to
augment the failure event to include the event that there exists a phase k and state-action pair s, a
such that
s
2L
|r̃k,a (s) − ra (s)| ≥ .
Tt (s, a)

The likelihood of this event is at most δ/2 by Hoeffding’s bound combined with a union bound.
Like in the proof of Theorem 38.6 we now restrict our attention to the regret on the event that the
failure does not occur. The first change is that rπk (s) in Eq. (38.18) must be replaced with r̃k,πk (s).
Then the reward terms no longer cancel in Eq. (38.20), which means that now
X
R̃k = (−vk (St ) + hPk,At (St ), vk i + r̃k,At (St ) − rAt (St ))
t∈Ek
X D X
≤ (hPAt , vk i − vk (St )) + kPk,At (St ) − PAt (St )k1
t∈Ek
2 t∈E
k
s
X 2L
+ .
t∈Ek
1 ∨ Tτk −1 (St , At )

The first two terms are the same as the proof of Theorem 38.6 and are bounded in the same way,
which result in the same contribution to the regret. Only the last term is new. Summing over all

115
phases and applying the result from Exercise 38.22 and Cauchy-Schwarz,
s
K X
X 2L √ XXX K
T(k) (s, a)
= 2L q
k=1 t∈Ek
1 ∨ Tτk −1 (St , At ) s∈S a∈A k=1 1 ∨ Tτk −1 (s, a)
√ √
≤ ( 2 + 1) 2LSAn .

This term is small relative to the contribution due to the uncertainty in the transitions. Hence there
exists a universal constant C such that with probability at least 1 − 3δ/2 the regret of this modified
algorithm is at most
s  
nSA
R̂n ≤ CD(M )S nA log .
δ

38.24

(a) An easy calculation shows that the depth d of the tree is bounded by 2 + logA S, which by the
conditions in the statement of the lower bound implies that d + 1 ≤ D/2. The diameter is
the maximum over all distinct pairs of states of the expected travel time between those two
states. It is not hard to see that this is maximised by the pair sg and sb , so we restrict our
attention to bounding the expected travel time between these two states under some policy.
Let τ = min{t : St = sb } and let π be a policy that traverses the tree to a decision state with
ε(s, a) = 0. We will show that for this policy

E[τ | S1 = sg ] ≤ D .

Let X1 , X2 , . . . be a sequence of random variables where Xi ∈ N+ is the number of rounds until


the policy leaves state sg on the ith series of visits to sg . Then let M be the number of visits to
state sg before sb is reached. All of these random variables are independent and geometrically
distributed. An easy calculation shows that

E[Xi ] = 1/δ E[M ] = 2 .


PM
Then τ = i=1 (Xi + d + 1), which has expectation E[τ ] = 2(1/δ + d + 1) ≤ 2/δ + D/2 ≤ D.

(b) The definition of stopping time τ ensures that Tσ ≤ n/D + 1 ≤ 2n/D almost surely and hence
DE[Tσ ]/n ≤ 2 is immediate. For the second part note that
 
n/(2D)
X n
P Dt ≥ 
t=1
D

(c) We need to prove that Rnj ≥ c3 ∆D Ej [Tσ − Tj ] where Rnj is the expected regret of π in MDP
Mj over n rounds and c3 > 0 is a universal constant. The idea is to write the total reward
incurred using episodes punctuated by visits to s0 and note that the expected lengths of these

116
episodes are the same regardless of the policy used.
For the formal argument, we start by rewriting the expected regret Rnj in a more suitable
form. For this we introduce episodes indexed by k ∈ [n]. The kth episode starts at the time τk
when s0 is visited the kth time (in particular, τ1 = 0). In the kth episode, a leaf is reached in
time step t = τk + d. Let Ek be the indicator that the the state-action pair of this time step is
not the jth state-action in L = {(s1 , a1 ), . . . , (sp , ap )}: Ek = I {(Sτk +d , Aτk +d ) = (sj , aj )}. Note
that this is an indicator of a “regret-inducing choice”.
In time steps Tk = {τk + d + 1, . . . , τk+1 }, the state is one of sg and sb . Let Hk = |Tk ∩ [n]| be
the number of time steps in Tk that happen before round n is over. For clarity, we add subindex
π to Ej and Pj to make it explicit that these depend on π. Further, in MDP Mj , let the values
of ε(s, a) be denoted by εj (s, a).
By construction, the total expected reward incurred up to time step n by policy π is
n
X
Vjπ := Ej,π [Hk I {Sτ +k+d+1 = sg }]
k=1
n
X X
= Ej,π [Hk I {Sτ +k+d+1 = sg } |Sτk +d = s, Aτ +k+d = a]Pj,π (Sτk +d = s, Aτ +k+d = a)
k=1 (s,a)∈L
(by the law of total probability)
n
X X
= Ej,π [Hk ]Pj,π (Sτ +k+d+1 = sg |Sτk +d = s, Aτ +k+d = a]Pj,π (Sτk +d = s, Aτ +k+d = a)
k=1 (s,a)∈L
(conditioning, Markov property)
 
n
X X 1
= Ej,π [Hk ] + εj (s, a) Pj,π (Sτk +d = s, Aτ +k+d = a) (Mj definition)
k=1 (s,a)∈L
2
 
Xn X
1
= Ej,π [Hk ] Pj,π (Sτk +d = sp , Aτ +k+d = ap )
k=1 p6=j
2
 X
1 n
+ +∆ Ej,π [Hk ]Pj,π (Sτk +d = sj , Aτ +k+d = aj ) .
2 k=1

Now, note that Ej,π [Hk ] = λk , regardless of policy π and index j. If πj∗ is the optimal policy for
MDP Mj , Pj,πj∗ (Sτk +d = sj , Aτ +k+d = aj ) = 1. Let ρπk = Pj,π ((Sτk +d , Aτk +d ) 6= (sj , aj )). Hence,

Rnj = Vjπ − Vjπ


n
X 1X n
1 Xn
= (1/2 + ∆) λk − λk ρπk − ( + ∆) λk (1 − ρπk )
k=1
2 k=1 2 k=1
 
n
X 1 1
= λk (1/2 + ∆) − ρπk − ( + ∆)(1 − ρπk )
k=1
2 2
n
X m−1
X m−1
X
=∆ λk ρπk ≥∆ λk ρπk ≥ c∆D ρπk ,
k=1 k=1 k=1

117
where m = dn/D − 1e and the last inequality uses that λk ≥ cD for k ∈ [m − 1] with some universal
constant c > 0, the proof of which is left to the reader. Now, by definition,
n
X
Ej,π [Tσ − Tj ] = Pj,π ((Sτk +d , Aτk +d ) 6= (sj , aj ), τk + d < τ )
k=1
m−1
X
≤ Pj,π ((Sτk +d , Aτk +d ) 6= (sj , aj ))
k=1
Xm
= ρπk ,
k=1

where the second equality is because τ = n ∧ τm and thus for k ≥ m, τk + d ≥ τ . Putting together
the last two inequalities finishes the proof.

Bibliography
V. I. Bogachev. Measure theory, volume 2. Springer Science & Business Media, 2007. [64]

O. Kallenberg. Foundations of modern probability. Springer-Verlag, 2002. [12, 55, 90]

T. Lattimore and Cs. Szepesvári. Learning with good feature representations in bandits and in RL
with a generative model. arXiv:1911.07676, 2019. [78]

T. Needham. A visual explanation of Jensen’s inequality. American Mathematical Monthly, 100(8):


768–771, 1993. [62]

G. Peskir and A. Shiryaev. Optimal stopping and free-boundary problems. Springer, 2006. [96]

E. V. Slud. Distribution inequalities for the binomial law. The Annals of Probability, pages 404–412,
1977. [17]

118

You might also like