On Lower Bounds For Statistical Learning Theory
On Lower Bounds For Statistical Learning Theory
On Lower Bounds For Statistical Learning Theory
Review
On Lower Bounds for Statistical Learning Theory
Po-Ling Loh
Department of Electrical and Computer Engineering, University of Wisconsin-Madison, 1415 Engineering Drive,
Madison, WI 53706, USA; [email protected]; Tel.: +1-443-968-5029
Abstract: In recent years, tools from information theory have played an increasingly prevalent
role in statistical machine learning. In addition to developing efficient, computationally feasible
algorithms for analyzing complex datasets, it is of theoretical importance to determine whether
such algorithms are “optimal” in the sense that no other algorithm can lead to smaller statistical
error. This paper provides a survey of various techniques used to derive information-theoretic lower
bounds for estimation and learning. We focus on the settings of parameter and function estimation,
community recovery, and online learning for multi-armed bandits. A common theme is that lower
bounds are established by relating the statistical learning problem to a channel decoding problem,
for which lower bounds may be derived involving information-theoretic quantities such as the mutual
information, total variation distance, and Kullback–Leibler divergence. We close by discussing the
use of information-theoretic quantities to measure independence in machine learning applications
ranging from causality to medical imaging, and mention techniques for estimating these quantities
efficiently in a data-driven manner.
Keywords: machine learning; minimax estimation; community recovery; online learning; multi-armed
bandits; channel decoding; threshold phenomena
1. Introduction
Statistical learning theory refers to the rigorous mathematical analysis of machine learning
algorithms [1,2]. On one hand, it is desirable to derive error bounds for the performance of particular
machine learning algorithms under appropriate assumptions on the probabilistic models used to
generate the data. On the other hand, it is important to understand the fundamental limitations
of any algorithmic procedure, which may be influenced by quantities such as the sample size,
signal-to-noise ratio, or smoothness of an ambient function space. Whereas statistical techniques based
on concentration inequalities and empirical process theory may often be employed to derive rates
of convergence of specific estimators to the underlying parameters of a data-generating distribution,
the somewhat trickier problem of quantifying the best possible performance of any learning procedure
requires tools from information theory.
A general approach is to relate the machine learning task at hand to an appropriate channel
decoding problem, where the output corresponds to the observed data and the input corresponds to a
cleverly constructed subset of the parameter space. For estimation problems, the key observation is
that, if the underlying parameters may be estimated closely (i.e., on the level of discretization of the
subset of parameter space), decoding may be performed accurately with high probability. The hardness
of the decoding problem may in turn be quantified using techniques in information theory [3], leading
to a lower bound on the estimation error. This strategy has been applied successfully to a diverse
array of statistical estimation problems, including parametric and nonparametric regression, structure
estimation for graphical models, covariance matrix estimation, and dimension reduction methods
such as principal component analysis [4–9]. Section 2 discusses the method and several illustrative
examples in greater detail.
Although some classes of machine learning problems may not be analyzed directly using these
methods, alternative approaches involving related information-theoretic concepts may be employed.
In Sections 3 and 4, we consider the problems of community recovery and online learning, which are
both active areas of research in machine learning. Our discussion of weak recovery in the community
estimation setting is similar to the framework described in Section 2, but since the loss function
used to quantify the estimation error incurred by the algorithm is more complicated, a more careful
analysis must be conducted to derive sharp lower bounds. The theory characterizing the regimes
in which exact recovery is possible are of a somewhat different flavor, but the emergence of sharp
thresholds may again be related to Shannon coding theory. Section 4, concerning online learning for
multi-armed bandits, provides a still different setting, where the goal is to bound a quantity known
as regret. Although this is a radically different goal from bounding estimation error, the techniques
used to obtain lower bounds for multi-armed bandits nonetheless include components of reductions
to channel decoding problems: The key is to relate the performance of a learning algorithm to a
problem of distinguishing between pairs of parameter assignments corresponding to underlying
reward distributions that are close in parameter space.
We include proof sketches for the stated theorems in the main text of the paper, with references
to resources where the reader can find more detailed proofs and additional background material.
Although the discussion of each problem setting is necessarily brief, given the broad scope of this paper,
we hope that our survey will convey the high-level ideas involved in applying information-theoretic
tools to derive lower bounds for some statistical machine learning problems in a clear, concise manner.
We have intentionally selected a diverse variety of problem settings in order to help the reader compare
and contrast different approaches for obtaining lower bounds and identify the common threads
underlying all the strategies.
2. Statistical Estimation
We begin by discussing an approach based on minimax theory for statistical estimation
problems [10]. Our goal is a lower bound on the following quantity, known as the minimax risk:
where ` is a symmetric loss function. Here, P denotes a class of data-generating distributions and
θ : P → Ω is a functional that maps each distribution in P to a parameter in the metric space Ω.
The expectation in expression (1) is taken with respect to data from a particular distribution P ∈ P ,
and the infimum is then taken over all possible estimators θb = θb( X ) computed from the data. In other
words, quantity (1) captures the worst-case risk of the best possible estimator. Whereas statistical
analysis of a specific estimator can provide an upper bound on the minimax risk, tools from information
theory may be used to derive a lower bound on the same quantity. Throughout this section, we will
restrict our attention to the setting where ` = Φ ◦ ρ, for a metric ρ and monotonically increasing
function Φ : [0, ∞) → [0, ∞). For instance, Example 2 below will discuss the setting where ρ is the
L2 -distance in a function space and Φ(t) = t2 , so ` is the squared L2 -distance.
The basic idea is to transform an estimation problem into a decoding problem, in which we
wish to infer the correct message from a discrete set of messages, corresponding to a collection of
parameters. The estimation problem must be at least as hard as the decoding problem, since, if the
parameters in the discrete set are appropriately separated, accurate parameter estimation implies
accurate decoding. In Section 2.1, we present a general technique based on Fano’s inequality, which
expresses the probability of error for the decoding in terms of the mutual information between the
input (parameters in the discrete subset) and output (observed data). Sections 2.2 and 2.3 then
provide methods for bounding the mutual information and discuss applications to concrete statistical
estimation settings. We will follow the convention of Cover and Thomas [3] and take all logarithms
with respect to base 2 in our definitions of entropy and mutual information; analogous results hold
when logarithms are taken with respect to base e.
Entropy 2017, 19, 617 3 of 17
The main result relates the minimax risk to the mutual information between observations and the
data-generating distribution.
I (Y; X ) − 1
inf sup EX ∼ P [`(θ ( X ), θ ( P))] ≥ Φ(δ) 1 −
b ,
θb P∈P log2 M
where Y is distributed uniformly on {1, . . . , M } and the conditional distribution of X given Y is defined by
X | {Y = j} ∼ Pj .
M
1
sup EX ∼ P [`(θb( X ), θ ( P))] ≥
M ∑ EX∼Pi [`(θb(X ), θ ( Pi ))]. (3)
P∈P i =1
( a) (b)
EX ∼ Pi [`(θb( X ), θ ( Pi ))] ≥ Φ(δ)Pi `(θb( X ), θ ( Pi )) ≥ Φ(δ) ≥ Φ(δ)Pi (ψ( X ) 6= i ) ,
for each 1 ≤ i ≤ M. Inequality ( a) is a direct application of Markov’s inequality, and inequality (b)
b θi ) < Φ(δ), or equivalently, ρ(θ,
follows from the fact that if `(θ, b θi ) < δ, then
implying that ψ( X ) = i.
Now, recall the statement of Fano’s inequality:
b of Y such that Y → X → Y
Lemma 1 (Fano’s inequality [3]). For any estimator Y b forms a Markov chain,
it holds that
P(Yb 6 = Y ) ≥ H (Y | X ) − 1 ,
log2 |Y |
where Y is the range of Y.
M log2 M − I (Y; X ) − 1
1 H (Y | X ) − 1
M ∑ Pi ( ψ ( X ) 6 = i ) ≥ log2 M
=
log2 M
, (4)
i =1
where the equality follows from relation (2) and the fact that Y has a uniform distribution. Combining
inequalities (3) and (4) establishes the desired result.
Entropy 2017, 19, 617 4 of 17
In the following subsections, we describe two methods for upper-bounding the mutual
information term I (Y; X ) appearing in Theorem 1, yielding a lower bound on the minimax risk.
1
I (Y; X ) ≤
M2 ∑ DKL ( Pi k Pj ).
1≤i,j≤ M
where P = 1
M ∑ jM=1 Pj is a mixture distribution. By the convexity of the KL divergence, we then have
M M
1 1
I (Y; X ) ≤
M ∑ M ∑ DKL ( Pi k Pj ),
i =1 j =1
This bounding technique is known as a “local packing”, since the trick is to design an appropriate
set { P1 , . . . , PM } such that the parameters θ ( Pi ) are 2δ-separated, while the pairwise KL divergences
between the data-generating distributions are relatively small.
Example 1 (High-dimensional linear regression). Suppose we have observation pairs {( xi , yi )}in=1 from a
linear model:
yi = xiT β∗ + wi ,
where xi ∈ R p and wi ∼ N (0, σ2 ) is i.i.d. noise, and β∗ ∈ R p is the unknown parameter vector. We assume
that p > n, but β∗ is known to have at most s nonzero values, where s ≤ n. More precisely, if Bq (r ) denotes the
ball of radius r in the `q norm, we are interested in characterizing the minimax risk over the parameter space
For any fixed parameter δ > 0, it is possible to construct a subset of parameters { β 1 , . . . , β M } lying in
√
p−s
the parameter space such that δ ≤ k β j − β k k2 ≤ 2δ 2 for all 1 ≤ j < k ≤ M and log M ≥ 2s log s/2 ,
essentially by rescaling a packing of the subset of {−1, 0, 1} p of s-sparse vectors such that the Hamming distance
between any two elements is at least 2s [4,11]. Furthermore, we may compute the pairwise KL divergences in
terms of the squared `2 -norm between parameter vectors, so
1 2 4nδ2 γ2s
2
DKL ( Pj k Pk ) = k X ( β j − β k )k 2 ≤ ,
2σ2 σ2
k Xβk
where γ2s = supβ∈B0 (2s) √nk βk2 . Note that Pj and Pk refer to the conditional distributions of the yi ’s given the
2
xi ’s for this example, so we are assuming the design matrix is fixed. Applying Theorem 1 and Lemma 2 with ρ
equal to the `2 -distance and Φ equal to the identity, we therefore have
nδ2 γ2s
2
h i δ σ2
−1
inf sup E k βb − βk2 ≥ 1− .
p−s
βb β∈B0 (s)∩B2 (1) 2 s
log
2 s/2
Entropy 2017, 19, 617 5 of 17
p−s
σs log( s/2 )
Taking δ2 2 n
γ2s
and assuming that the problem dimensions satisfy n ≥ Cs log p, we then obtain a lower
bound of the form s
p−s
h i δ c s
inf sup E k βb − βk2 ≥ ≥ 2 log .
βb β∈B0 (s)∩B2 (1) 4 γ2s n s
In the case of the `2 -loss, the Lasso estimator achieves the risk expression in the lower bound (up to constant
factors), implying that it is a rate-optimal estimator [4]. Similar bounds on the minimax risk may be derived
when the norms appearing in the loss function and/or parameter space are replaced by a general `q -norm [4,12].
which denotes the e-covering number of P , where distances are measured with respect to the square
root KL divergence. We have the following bound:
Proof (sketch). Suppose { Q1 , . . . , Q N } is an e-cover of P with respect to the square root KL divergence.
Letting P = M1
∑iM 1 N
=1 Pi and Q = N ∑ j=1 Q j , we can check that
M M
1 1
I (Y; X ) =
M ∑ DKL ( Pi k P) ≤ M ∑ DKL ( Pi kQ),
i =1 i =1
where the inequality holds because P minimizes the average KL divergence with respect to the second
argument. Furthermore, we know that there exists some Qn such that DKL ( Pi k Qn ) ≤ e2 , implying that
dPi ( X ) dPi ( X )
Z Z
DKL ( Pi k Q) = log dPi ( X ) ≤ log 1
dPi ( X ) = DKL ( Pi k Qn ) + log N
dQ( X ) N dQn ( X )
≤ e2 + log NKL (e; P ).
Since the above inequality holds for all e > 0, we may take an infimum over e to obtain the stated
bound.
As an example of the above technique, we consider the problem of nonparametric regression. Note
that the following example shows that the general machinery developed above, though described in
terms of parameter estimation, may be applied to nonparametric settings involving function estimation,
as well.
y i = f ∗ ( x i ) + wi ,
Entropy 2017, 19, 617 6 of 17
xi ∼ Uni f orm[0, 1], wi ∼ N (0, 1), and xi is independent of wi . We also assume that f ∗ belongs to the
function class Fs , for a positive integer s, defined as the set of all continuous functions f on [0, 1] satisfying the
following properties:
We derive lower bounds on the minimax risk of estimating f ∗ when ` is the squared L2 -distance, defined by
Z 1
`( f , g) = ( f ( x ) − g( x ))2 dx.
0
Hence, we will take Φ(t) = t2 and ρ equal to the L2 -distance. Let P denote the set of joint distributions of
( x, y) generated by the class Fs . By standard results on the metric entropy of function classes [14,15], we have
the bound 1/s 1/s
1 1
c ≤ log N2 (e; Fs ) ≤ C ,
e e
where log N2 (e; Fs ) denotes the metric entropy of Fs with respect to the L2 -distance. Furthermore, for any
1/s
δ > 0, there exists a δ-packing { f 1 , . . . , f M } of Fs in the L2 -metric such that log M = c0 1δ . For two
functions f , g ∈ Fs , we may compute the KL divergence between the corresponding distributions Pf , Pg ∈ P :
n
DKL ( Pf k Pg ) = · k f − gk22 .
2
Hence, it follows that
r ! r 1/s
2 1 n
log NKL (e; P ) ≤ log N2 e ; Fs ≤C .
n e 2
1
Minimizing the bound obtained from Lemma 3 with respect to e, we obtain e∗ = C 0 n 4s+2 , and plugging back
into Theorem 1, we obtain the lower bound
!
2 C 00 n1/(4s+2)
δ 1− .
(1/δ)1/s
s/(4s+2)
1
Taking δ n then yields the bound
s/(2s+1)
0 1
h i
∗ 2
inf sup E f ∗ k f − f k2 ≥ c
b .
fb f ∗ ∈Fs n
A matching
upper
bound may be derived using local weighted polynomial regression [16], so the minimax risk is
Θ n−s/(2s+1) .
3. Community Recovery
Another area of machine learning that has recently received a substantial amount of attention
concerns recovering communities based on node connectivity in a network. A popular probabilistic
model is known as the stochastic block model (SBM). In the simplest form of the model, parametrized
by (n, K, p, q), the graph has nodes {1, . . . , n} partitioned into K communities. Let the community
label of node i be denoted by σ (i ). The edge set E of the random graph G is then constructed in the
following manner: each edge (i, j) is generated independently from all others, with probability
Entropy 2017, 19, 617 7 of 17
(
p, if σ (i ) = σ ( j),
P (i, j) ∈ E =
q, if σ (i ) 6= σ ( j).
The goal is to partition the n nodes into the underlying communities based on observing the graph G.
In order to measure the performance of an algorithm, we consider the loss function
1
r (b
σ, σ ) = σ , τ ◦ σ ).
min d H (b
n τ ∈ SK
Here, the estimator b σ : {1, . . . , n} → {1, . . . , K } corresponds to a partitioning of the nodes into K
communities, and d H denotes the Hamming distance between assignments. Furthermore, we take
the minimum over all permutations SK of the community labels. Hence, r (b σ, σ ) is the proportion of
incorrectly labeled nodes (for the optimal labeling of partitions). We will focus our discussion on the
setting where K is fixed, but p and q may vary with n; generalizations exist in the literature where K is
allowed to grow with n, as well. We are interested in the behavior of various algorithms as n → ∞.
In the following two subsections, we discuss the popular notions of weak recovery and exact recovery.
The algorithm b σ achieves weak recovery if E[r (b σ, σ )] → 0 (i.e., the expected fraction of misclassified
nodes tends to 0 as n → ∞), and achieves exact recovery if r (b σ, σ ) = 0. For a more complete description
of current work on stochastic block models, see the extensive survey paper by Abbe [17].
where Σ(n, K ) is an appropriate class of underlying community labelings. We state and prove a result
for approximately equal-sized communities in the limit as n → ∞, so Σ(n, K ) is the set of all labelings
σ such that |{i : σ (i ) = k}| = (1 + o (1)) Kn , for all 1 ≤ k ≤ K.
The main result is the following [18]:
Theorem 2. Suppose p = a
n and q = nb , and suppose nI
K → ∞, where
r r r r !
a b a b
I = −2 log + 1− 1− . (5)
n n n n
Proof (sketch). The core of the approach bears similarity to the method for obtaining lower bounds
for estimation, in the sense that we construct a subset Σ L of the parameter space corresponding to
“messages”, which we wish to recover via an appropriate decoding strategy. In the case when K = 2
(and n is even), the subset Σ L consists of all partitions of the nodes into equal-sized communities and
communities of size n2 + 1, n2 − 1 . We focus on the case K = 2 in the present proof sketch to avoid
technical complications.
The proof is somewhat more involved than the strategies outlined in Section 2, however, since the
unknown quantity to be estimated is a set of discrete labelings and the loss function is defined with
respect to an optimal permutation. The first step is to lower-bound the minimax risk by the average
risk over the class Σ L . Furthermore, a more technical argument shows that we may just examine the
average local risk defined with respect to a single node in the graph:
Entropy 2017, 19, 617 8 of 17
1
|Σ L | σ∑
≥ inf E[r (b
σ, σ)]
σ
∈Σ L
b
1
|Σ L | σ∑
= inf E[r1 (b
σ, σ )],
σ
∈Σ L
b
where r1 is the local loss function defined with respect to node 1, which is the fraction of optimal
permutations of community assignments that incorrectly classify node 1. The next step is to lower-bound
the local risk (uniformly over all choices of σ ∈ Σ L ) using the minimum risk of a binary hypothesis
testing problem, where the two hypotheses correspond to the possible assignments of node 1 as a
member of the first or second community. In particular, we have the following inequality, which holds
for each σ: !
n/2 n/2
σ, σ)] ≥ c P
E[r1 (b ∑ Xi ≥ ∑ Yj ,
i =1 j =1
i.i.d. i.i.d.
where Xi ∼ Bernoulli nb and Yj ∼ Bernoulli na are independent random variables. Standard
techniques involving large deviation inequalities allow us to lower-bound the latter probability, thus
yielding the overall lower bound appearing in the theorem.
As demonstrated by Zhang and Zhou [18], the lower bound on the risk appearing in Theorem 2
may be achieved using a form of penalized likelihood estimation. A computationally feasible procedure
was subsequently provided in Gao et al. [19].
a log n q log n
√ √ 2
Theorem 3. Let p = n and b = n , where a > b ≥ 0. If a − b < 2, then for sufficiently large
n, the maximum likelihood estimator fails in recovering the communities with probability bounded away from 0:
σMLE , σ) 6= 0) > 0.
lim inf P (r (b
n→∞
Proof (sketch). We denote the two communities by A and B. Let F be the event that the maximum
likelihood estimator fails in performing exact recovery, and let
FA ∩ FB ⊆ F,
since if both FA and FB were to occur simultaneously, swapping the labels of the nodes i and j would
lead to a higher value of the likelihood than in the case of correct labeling. In particular, this implies that
P( F ) ≥ P( FA ∩ FB ) ≥ P( FA ) + P( FB ) − 1 = 2P( FA ) − 1. (6)
n
Let H ⊆ A denote a fixed subset with | H | = 3 , and define the event
log (n)
log n
FH = ∃ j ∈ H s.t. E( j, A\ H ) + ≤ E( j, B) ,
log log n
where E( j, C ) denotes the number of edges between j and the nodes in C. Note that, if event FH
log n
occurs and all nodes in H are connected to at most log log n other nodes in H, then event FA must occur.
log n
Furthermore, one can show that, with high probability, every node in H is connected to at most log log n
other nodes in H. Hence,
P( FA ) ≥ P( FH ) + o (1). (7)
( j)
and note that the FH ’s are independent. Hence,
∏
[ ( j) ( j)
P( FH ) = P FH = 1 − 1 − P( FH ) .
j∈ H j∈ H
Straightforward techniques for bounding sums of independent Bernoulli random variables show that
( j) log(4) log3 (n)
P( FH ) > n for each j, from which we can conclude that
! n
log(4) log3 (n) log3 (n)
1
P( FH ) ≥ 1 − 1 − = 1− + o (1). (8)
n 4
σMLE , σ) = 0) ≥ P (r (b
P ( r (b σ , σ ) = 0) .
σ , σ ) 6 = 0) ≥ P ( r (b
lim inf P (r (b σMLE , σ ) 6= 0) ≥ 0.
n→∞
√ √ 2
Theorem 4. Under the same conditions as in Theorem 3, suppose instead that a− b > 2. Then,
the maximum likelihood estimator succeeds in recovering the communities with probability tending to 1:
lim P (r (b
σMLE , σ ) = 0) = 1.
n→∞
Since the focus of this paper is to establish lower bounds, we refer the reader to Abbe [24] for
the proof of Theorem 4, which proceeds by direct calculation. An extension of Theorems 3 and 4 for
weighted stochastic block models may be found in Jog and Loh [25].
Remark 2. The threshold behavior described in Theorems 3 and 4 is perhaps not surprising in light of known
threshold behavior in Shannon coding theory, and the connections between each of the statistical learning
tasks and the problem of decoding on a discrete alphabet after passage through a noisy channel. Indeed,
the community recovery problem has been cast in information-theoretic terminology as decoding in a “graphical
channel" [26]. On the other hand, the coding scheme is fixed according to the stochastic block model, whereas
Shannon theory allows one to design an optimal encoding scheme to achieve channel capacity. See also the paper
by Chen et al. [27], and the derivation of similar types of sharp threshold behavior in submatrix localization
a log n b log n
problems [28,29]. Finally, we note that the scaling p = n and q = n , when a, b = Θ(1), corresponds
to the threshold for the graph to have isolated vertices with probability tending to 1 [23]. Indeed, it would
be impossible to perform exact recovery with high probability in the presence of isolated vertices: flipping the
community assignments of two isolated vertices belonging to the two different communities would not change
the value of the likelihood.
4. Online Learning
We now shift our focus to sequential allocation problems. The setup we consider involves a
series of actions taken by a player, using limited feedback about the environment based on his/her
past actions. We study the setting of a multi-armed bandit, where each potential action of the player
is associated with a reward distribution, but the player only observes the reward corresponding to
his/her action on successive rounds. In the following two subsections, we will consider the cases of
stochastic and adversarial bandits and obtain bounds on a quantity known as regret. More details on the
setting and results may be found in Bubeck and Cesa-Bianchi [30] or Cesa-Bianchi and Lugosi [31].
where we may also write Rn (θ1 , . . . , θk ) to make the dependence on the reward distributions explicit.
If the player employs a random strategy, the expectation is computed with respect to randomness
in the sequence of actions ( I1 , . . . , In ), as well as randomness generated by draws from the reward
distributions. In other words, the pseudo-regret measures the difference between the expected reward
Entropy 2017, 19, 617 11 of 17
incurred by the player’s strategy and the expected reward incurred by playing the arm with maximum
expected reward on every round.
Lai and Robbins [32] prove the following result. We omit some technical regularity conditions on
the parameter space, such as denseness of the parameter space and continuity with respect to the KL
divergence, in order to avoid cluttering the presentation.
Theorem 5. Suppose that, for all pairs θ1 , θ2 ∈ Θ such that µ(θ1 ) > µ(θ2 ), we have 0 < DKL ( Pθ2 k Pθ1 ) < ∞.
Suppose a strategy satisfies Rn (θ1 , . . . , θk ) = o (nα ), for all θ1 , . . . , θk ∈ Θ and all α > 0. Then, for any
(θ1 , . . . , θk ) ∈ Θ, we have
R n ( θ1 , . . . , θ k ) µ∗ − µ j
lim inf
n→∞ log n
≥ ∑ DKL ( Pθ j k Pθ ∗ )
,
j:µ j <µ∗
where ∆ j = µ∗ − µ j and Tj (n) = ∑nt=1 1{ It = j}. The main step is to show that the inequality
log n
E[ Tj (n)] ≥ , ∀ j : µ j < µ∗ (9)
DKL ( Pθ j k Pθ ∗ )
holds for any strategy. Inequality (9) provides a lower bound on the expected number of pulls to any
suboptimal arm (note that, as Pθ j becomes further from Pθ ∗ , the two arms are easier to distinguish, so the
expected number of pulls to the suboptimal arm can be smaller). We focus on proving inequality (9)
for j = 2; the other cases are similar.
Consider two parameter vectors θ = (θ1 , θ2 , . . . , θk ) and θ 0 = (θ1 , θ20 , . . . , θk ), which differ only in
the second coordinate. We further choose the parameters such that
µ1 > µ2 ≥ µ3 ≥ · · · ≥ µ k ,
µ20 ≥ µ1 > µ3 ≥ · · · ≥ µk ,
so the second arm is suboptimal in the first setting but optimal in the second. We will choose θ20 close
to θ1 , so
DKL ( Pθ2 k Pθ 0 ) ≈ DKL ( Pθ2 k Pθ1 ) = DKL ( Pθ2 k Pθ ∗ ).
2
(The regularity conditions on the parameter space and reward distributions ensure that such a choice
is possible.) The idea is that, since Pθ and Pθ 0 are close, any strategy should pick roughly the same
sequence of arms in both scenarios, but a strategy that performs well on θ will behave relatively poorly
on θ 0 (and vice versa), since the ordering of arms according to optimality is different in the two settings.
In particular, we will derive the following bound, relating the probabilities of pulling the second arm
in each of the parameter settings:
(1−3α) log n
where an = DKL ( Pθ2 k Pθ 0 )
, and we take α < 13 . We can show that bn = o (1) since Pθ and Pθ 0 are close, and
2
that the right-hand probability is also o (1), since arm 2 is optimal under θ 0 .
For a fixed strategy, let { X j,s } 1≤ j≤k denote the rewards corresponding to various arm pulls.
1≤ s ≤ n
For A ⊆ { T2 (n) = n2 }, we have
Entropy 2017, 19, 617 12 of 17
d Pθ 0 ( x )
Z Z
Pθ 0 ( A ) = d Pθ 0 ( x ) = · d Pθ ( x )
A A d Pθ ( x )
Z n2 dPθ 0 ( x2,s )
= ∏ dPθ2 (x2,s ) dPθ (x)
A s =1 2
Z
= e − L2 ( x ) d Pθ ( x ),
A
Pθ 0 ( A ) ≥ e − c n Pθ ( A ),
which is inequality (10) with c0n = ecn and bn = Pθ ( T2 (n) < an , L2 ( X ) > cn ). Note that if T2 (n) < an ,
dPθ ( X2,s )
we have L2 ( X ) < ∑sa=
n 2
1 log dP 0 ( X2,s ) , so
θ2
!
an dPθ ( X2,s )
bn ≤ P θ ∑ log dP 20 (X2,s ) > cn = o (1),
s =1 θ 2
where the last equality follows from the fact that the rewards { X2,s }sa=
n
1 are i.i.d. and
" #
an dPθ2 ( X2,s ) a.s. dPθ2 ( X2,s )
1
an ∑ log dP 0 (X2,s ) −→ Eθ log dP 0 (X2,s ) = DKL ( Pθ2 k Pθ20 ).
s =1 θ 2 θ 2
Eθ 0 [n − T2 (n)]
Pθ 0 ( T2 (n) < an ) = Pθ 0 (n − T2 (n) ≥ n − an ) ≤ = o ( n α −1 ),
n − an
where the last equality follows from the fact that an = o (n) and the assumption on Rn (θ 0 ). Altogether,
we conclude that the right-hand side of inequality (10) is o (1).
By another application of Markov’s inequality, we conclude that
!
DKL ( Pθ2 k Pθ 0 ) log n
2
Eθ [ T2 (n)] · ≥ Pθ T2 (n) ≥ > Pθ ( T2 (n) > an ) → 1.
log n DKL ( Pθ2 k Pθ 0 )
2
Hence,
Eθ [ T2 (n)] 1 1
≥ ≈ ,
log n DKL ( Pθ2 k Pθ 0 ) DKL ( Pθ2 k Pθ ∗ )
2
as wanted.
Note that the assumption Rn (θ1 , . . . , θk ) = o (nα ) implies that a sufficiently good player strategy
exists for all choices of reward parameters. In particular, such a condition may be verified when the
reward distributions are Bernoulli (e.g., Pθ ∼ Bernoulli(θ )). Then, we have
θ1 1 − θ1
DKL ( Pθ1 k Pθ2 ) = θ1 log + (1 − θ1 ) log ,
θ2 1 − θ2
Entropy 2017, 19, 617 13 of 17
R n ( θ1 , . . . , θ k ) 1
lim inf ≥ µ ∗ (1 − µ ∗ ) ∑ ∗−µ
.
n→∞ log n µ j
j:µ <µ∗ j
A player strategy known as the Upper Confidence Bound (UCB) strategy may be shown to achieve
this lower bound, up to constant factors [32,33].
Finally, we mention a non-asymptotic lower bound on the pseudo-regret that comes from the
probably approximately correct (PAC) literature on bandits [34–36]:
Theorem 6. In the case of Bernoulli reward distributions, there exist positive constants {ci }5i=1 such that for
all k ≥ 2 and n ≥ 1, the pseudo-regret of any strategy satisfies
n o
sup Rn (θ1 , . . . , θk ) ≥ min c1 n, c2 k + c3 n, c4 k (log n − log k + c5 ) . (11)
θ1 ,...,θk ∈[0,1]
Proof (sketch). For a detailed proof of Theorem 6, we refer the reader to Mannor and Tsitsiklis [36].
The main idea is to construct a collection of k vectors {θ 1 , . . . , θ k } ⊆ [0, 1]k corresponding to the
parameters of the reward distributions on arms. For each 2 ≤ i ≤ k, we define the vector
θ i = (θ1i , . . . , θki ) such that
1 e 1 1
θ1i = + , θii = + e, θ ij = , for j ∈
/ {1, i },
2 2 2 2
Theorem 6 is a type of minimax result, stating that, for any player strategy, a distribution of
Bernoulli rewards exists for which the problem incurs Ω(log n) regret. The same UCB strategies of
Auer et al. [33] may be used to obtain O(log n) upper bounds on the minimax regret even for the
worst-case reward distribution, showing that the bound stated in Theorem 6 is tight.
where the first expectation is taken with respect to possible randomization in the adversarial strategy,
and the second expectation is taken with respect to randomization in the strategies of both the player
and adversary.
The following result provides a lower bound for the minimax pseudo-regret, where the supremum
is taken over P Ber , the set of all Bernoulli reward distributions over the k time steps, and the infimum
is taken over all player strategies [30,37]:
1 √
inf sup Rn (S, P) ≥ · min{ nk, n},
S∈S P∈P
Ber
18
where the infimum is taken over all (possibly randomized) player strategies.
Proof (sketch). Note that it suffices to prove the bound when the infimum is taken over deterministic
player strategies, since the pseudo-regret for a randomized strategy will be a convex combination
of the pseudo-regret of deterministic strategies. Fix a deterministic player strategy, and consider the
reward distributions P1 , . . . , Pk ∈ P Ber , where P j corresponds to the distribution where the reward
of each arm i 6= j is i.i.d. Bernoulli 12 , and the reward of arm j is i.i.d. Bernoulli 12 + e . Note that
this construction bears some similarity to the proof outline for Theorem 6 provided above, in that the
reward distribution P j slightly favors arm j. We will also compute a lower bound for the weighted
regret, this time allocating uniform weights to each parameter setting, in order to conclude the existence
of at least one assignment of reward distributions satisfying the desired lower bounds. Let E j denote
the expectation with respect to the reward distribution P j .
We may compute
" # !
k k k k
1 1 e 1
∑ Rn (S, Pj ) = k ∑ Ej ∑ eTi (n) = k ∑ Ej ∑ Ej [Tj (n)]
n − Tj (n) = e n− , (12)
k j =1 j =1 i6= j j =1
k j =1
where inequality ( a) may be derived by first relating the difference in expectations for bounded
random variables to total variation distance and then applying Pinsker’s inequality, and equality (b)
follows from a direct computation. Combining inequalities (12) and (13), we then obtain
s !
k k
1 n n 1 q
k ∑ Rn (S, Pj ) ≥ e n− −
k 2k
log
1 − 4e2 ∑ E[ Tj (n)]
j =1 j =1
s !
k
1 1 1 1 q
= en 1 − −
k 2
log
1 − 4e2 k ∑ E[ Tj (n)]
j =1
s r !
1 1 1 n
≥ en 1 − − log ,
k 2 1 − 4e2 k
Entropy 2017, 19, 617 15 of 17
1
√
using the concavity of the square root function. Choosing e = 4n min{ kn, n} then yields the inequality
1 k
1 √
sup Rn (S, P) ≥
k ∑ Rn (S, Pj ) ≥ 18 min{ kn, n},
P∈P Ber j =1
and taking an infimum over all player strategies produces the desired result.
Note that the lower bound provided in Theorem 7 clearly also holds when the supremum is taken
over any class of adversarial strategies containing P Ber . In particular, one topic of study is that of
oblivious adversaries, which are allowed to perform any strategy that is non-adaptive to the actions
of the player (i.e., it is chosen before the start of the first round). The Exp3 algorithm provides an
upper bound on the minimax pseudo-regret for oblivious adversaries that matches the lower bound in
p
Theorem 7 up to a factor of log k [37]. The study of non-oblivious adversaries refers to the setting
where the adversary’s actions may be chosen in response to the player’s sequential choices, as well,
and is also an active area of research [31,38].
5. Discussion
In this article, we have presented several distinct approaches for deriving lower bounds in various
statistical learning problems. In each of the settings described—statistical estimation, community
recovery, and online learning—we have shown how to simplify the problem to one involving channel
decoding, and leverage information-theoretic bounds on the hardness of the decoding problem to
bound the hardness of the corresponding statistical problem. It is worth reflecting on the similarities
between the techniques employed in each of the approaches. Although the specific interpretation
involving channel decoding looks quite different in each of the settings, the trick is to find an
appropriate discretization of parameter space so that pairs of parameters are relatively far apart,
but the corresponding data-generating distributions are close. In the context of statistical estimation,
this means that we construct a packing of parameter space. In the community recovery setting, we
consider pairs of community partitions that differ only in the assignment of a single node. In the
multi-armed bandit setting, we consider pairs of arm parameters that flip the assignment of the optimal
arm, while perturbing the parameter values as little as possible.
On a more applied note, information-theoretic tools have made an appearance in various
machine learning algorithms involving maximizing independence between observed quantities. Some
examples include decision tree learning via information gain [39]; independent component analysis by
mutual information minimization [40]; causal inference algorithms maximizing independence [41];
minimal-redundancy-maximal-relevance (mRMR) methods for feature selection [42]; and image
registration via mutual information maximization in medical imaging [43]. As a result, quantities
such as mutual information have become increasingly mainstream in data science applications. Note,
however, that such applications of information theory to machine learning have no connection to
the channel decoding techniques or hardness results discussed in this article. In terms of statistical
theory, these applications have created a renewed interest in deriving efficient estimators of entropy
and other related information measures based on finite samples [44–47], but a detailed discussion of
such methods is somewhat orthogonal to the main topic of this survey.
Acknowledgments: The author thanks Varun Jog, the Assitant Editor, and the anonymous referees for helpful
comments that enhanced the clarity of the paper.
Conflicts of Interest: The author declares no conflict of interest.
References
1. Bousquet, O.; Boucheron, S.; Lugosi, G. Introduction to statistical learning theory. In Advanced Lectures on
Machine Learning; Springer: Berlin/Heidelberger, Germany, 2004; pp. 169–207.
Entropy 2017, 19, 617 16 of 17
2. Friedman, J.; Hastie, T.; Tibshirani, R. The Elements of Statistical Learning; Springer Series in Statistics; Springer:
Berlin/Heidelberger, Germany, 2001; Volume 1.
3. Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons: New York, NY, USA, 2012.
4. Raskutti, G.; Wainwright, M.J.; Yu, B. Minimax rates of estimation for high-dimensional linear regression
over `q -balls. IEEE Trans. Inf. Theory 2011, 57, 6976–6994.
5. Tsybakov, A.B. Introduction to Nonparametric Estimation; Springer: Berlin/Heidelberger, Germany, 2008.
6. Santhanam, N.P.; Wainwright, M.J. Information-theoretic limits of selecting binary graphical models in high
dimensions. IEEE Trans. Inf. Theory 2012, 58, 4117–4134.
7. Guntuboyina, A. Lower bounds for the minimax risk using f -divergences, and applications. IEEE Trans.
Inf. Theory 2011, 57, 2386–2399.
8. Cai, T.T.; Zhang, C.H.; Zhou, H.H. Optimal rates of convergence for covariance matrix estimation. Ann. Stat.
2010, 38, 2118–2144.
9. Amini, A.A.; Wainwright, M.J. High-Dimensional Analysis of Semidefinite Relaxations for Sparse Principal
Components. Ann. Stat. 2009, 37, 2877–2921.
10. Lehmann, E.L.; Casella, G. Theory of Point Estimation; Springer Science & Business Media: Berlin/Heidelberger,
Germany, 2006.
11. Kühn, T. A lower estimate for entropy numbers. J. Approx. Theory 2001, 110, 120–124.
12. Zhang, C.H. Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 2010, 38, 894–942.
13. Yang, Y.; Barron, A. Information-theoretic determination of minimax rates of convergence. Ann. Stat. 1999,
27, 1564–1599.
14. Lorentz, G.G. Metric entropy and approximation. Bull. Am. Math. Soc. 1966, 72, 903–937.
15. Tikhomirov, V.M.; Shiryayev, A.N. e-entropy and e-capacity of sets in functional spaces. In Selected Works of
A.N. Kolmogorov: Volume III: Information Theory and the Theory of Algorithms; Springer: Dordrecht, The Netherlands,
1993; pp. 86–170.
16. Stone, C.J. Optimal global rates of convergence for nonparametric regression. Ann. Stat. 1982, 10, 1040–1053.
17. Abbe, E. Community detection and stochastic block models: Recent developments. arXiv 2017, arXiv:1703.10146.
18. Zhang, A.Y.; Zhou, H.H. Minimax rates of community detection in stochastic block models. Ann. Stat. 2016,
44, 2252–2280.
19. Gao, C.; Ma, Z.; Zhang, A.Y.; Zhou, H.H. Achieving optimal misclassification proportion in stochastic block
model. arXiv 2015, arXiv:1505.03772.
20. Xu, M.; Jog, V.; Loh, P. Optimal Rates for Community Estimation in the Weighted Stochastic Block Model.
arXiv 2017, arXiv:1706.01175.
21. Abbe, E.; Sandon, C. Community detection in general stochastic block models: Fundamental limits and
efficient algorithms for recovery. In Proceedings of the 2015 IEEE 56th Annual Symposium on Foundations
of Computer Science (FOCS), Berkeley, CA, USA, 17–20 October 2015; pp. 670–688.
22. Yun, S.Y.; Proutiere, A. Optimal cluster recovery in the labeled stochastic block model. In Advances in
Neural Information Processing Systems, Proceedings of the 30th Conference on Neural Information Processing
Systems (NIPS 2016), Barcelona, Spain, 4–9 December 2016; The Neural Information Processing Systems (NIPS)
Foundation: La Jolla, CA, USA, 2016; pp. 965–973.
23. Bollobás, B. Random Graphs (Cambridge Studies in Advanced Mathematics); Cambridge University Press:
Cambridge, UK, 2001.
24. Abbe, E.; Bandeira, A.S.; Hall, G. Exact recovery in the stochastic block model. IEEE Trans. Inf. Theory 2016,
62, 471–487.
25. Jog, V.; Loh, P. Recovering communities in weighted stochastic block models. In Proceedings of the 2015
53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL,
USA, 29 September–2 October 2015; pp. 1308–1315.
26. Abbe, E.; Montanari, A. Conditional random fields, planted constraint satisfaction and entropy concentration.
In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques; Springer:
Berlin/Heidelberger, Germany, 2013; pp. 332–346.
27. Chen, Y.; Suh, C.; Goldsmith, A.J. Information recovery from pairwise measurements. IEEE Trans. Inf. Theory
2016, 62, 5881–5905.
28. Chen, Y.; Xu, J. Statistical-computational tradeoffs in planted problems and submatrix localization with
a growing number of clusters and submatrices. J. Mach. Learn. Res. 2016, 17, 882–938.
Entropy 2017, 19, 617 17 of 17
29. Hajek, B.; Wu, Y.; Xu, J. Submatrix localization via message passing. arXiv 2015, arXiv:1510.09219.
30. Bubeck, S.; Cesa-Bianchi, N. Regret analysis of stochastic and nonstochastic multi-armed bandit problems.
Found. Trends Mach. Learn. 2012, 5, 1–122.
31. Cesa-Bianchi, N.; Lugosi, G. Prediction, Learning, and Games; Cambridge University Press: Cambridge, UK, 2006.
32. Lai, T.L.; Robbins, H. Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 1985, 6, 4–22.
33. Auer, P.; Cesa-Bianchi, N.; Fischer, P. Finite-time analysis of the multiarmed bandit problem. Mach. Learn.
2002, 47, 235–256.
34. Anthony, M.; Bartlett, P.L. Neural Network Learning: Theoretical Foundations; Cambridge University Press:
Cambridge, UK, 1999.
35. Even-Dar, E.; Mannor, S.; Mansour, Y. PAC bounds for multi-armed bandit and Markov decision processes.
In Proceedings of the Fifteenth Annual Conference on Computational Learning Theory, Sydney, Australia,
8–10 July 2002; Springer: Berlin, Germany, 2002; pp. 255–270.
36. Mannor, S.; Tsitsiklis, J.N. The sample complexity of exploration in the multi-armed bandit problem.
J. Mach. Learn. Res. 2004, 5, 623–648.
37. Auer, P.; Cesa-Bianchi, N.; Freund, Y.; Schapire, R.E. The nonstochastic multiarmed bandit problem.
SIAM J. Comput. 2002, 32, 48–77.
38. Maillard, O.; Munos, R. Adaptive bandits: Towards the best history-dependent strategy. In Proceedings of
the Fourteenth International Conference on Artificial Intelligence and Statistics, Ft. Lauderdale, FL, USA,
11–13 April 2011; pp. 570–578.
39. Breiman, L.; Friedman, J.; Stone, C.J.; Olshen, R.A. Classification and Regression Trees; CRC Press: Boca Raton,
FL, USA, 1984.
40. Hyvärinen, A.; Karhunen, J.; Oja, E. ICA by Minimization of Mutual Information. In Independent Component
Analysis; John Wiley & Sons, Inc.: New York, NY, USA, 2002; pp. 221–227.
41. Janzing, D.; Mooij, J.; Zhang, K.; Lemeire, J.; Zscheischler, J.; Daniušis, P.; Steudel, B.; Schölkopf, B.
Information-geometric approach to inferring causal directions. Artif. Intell. 2012, 182, 1–31.
42. Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information criteria of max-dependency,
max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1226–1238.
43. Maes, F.; Collignon, A.; Vandermeulen, D.; Marchal, G.; Suetens, P. Multimodality image registration by
maximization of mutual information. IEEE Trans. Med. Imaging 1997, 16, 187–198.
44. Wolpert, D.H.; Wolf, D.R. Estimating functions of probability distributions from a finite set of samples.
Phys. Rev. E 1995, 52, 6841.
45. Paninski, L. Estimation of entropy and mutual information. Neural Comput. 2003, 15, 1191–1253.
46. Valiant, G.; Valiant, P. Estimating the unseen: An n/ log(n)-sample estimator for entropy and support size,
shown optimal via new CLTs. In Proceedings of the Forty-Third Annual ACM Symposium on Theory of
Computing, San Jose, CA, USA, 6–8 June 2011; pp. 685–694.
47. Jiao, J.; Venkat, K.; Han, Y.; Weissman, T. Minimax estimation of functionals of discrete distributions.
IEEE Trans. Inf. Theory 2015, 61, 2835–2885.
c 2017 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).