Sparsity-Based Generalization Bounds For Predictive Sparse Coding
Sparsity-Based Generalization Bounds For Predictive Sparse Coding
Contributions We provide a learning bound for the hence, encoding x as ϕD (x) amounts to solving a
overcomplete setting in predictive sparse coding, where LASSO problem. The reconstructive `1 sparse coding
the dictionary size, or number of learned features, k objective is then
exceeds the ambient dimension d. The bound holds
provided the size m of the training sample is large min Ex∼Π kx − DϕD (x)k22 + λkϕD (x)k1 .
D∈D
enough, where the critical size for the bound to kick in
depends on a certain notion of stability of the learned Generalization bounds for the empirical risk minimiza-
representation. This work’s core contributions are: tion (ERM) variant of this objective have been estab-
lished. In the infinite-dimensional setting, Maurer &
Pontil (2010) showed1 that with probability 1 − δ over
• Under mild conditions, a stability bound for the
the training sample x = (x1 , . . . , xm ):
LASSO (Tibshirani, 1996) under dictionary per-
m
turbations (Theorem 4). 1 X
sup Ex∼Π fD (x) − fD (xi )
• In the overcomplete setting, m i=1
q a learning bound that
√
D∈D
dk s « r
is essentially of order + λµs (D) ,
where each
„
m k 14 1p log(1/δ)
≤ √ + 2
log (16m/λ ) + , (2)
sparse code has at most s non-zero components m λ 2 2m
1
(Theorem 5). The term µs (D) is the inverse s-
where fD (x) := minz∈Rk kx − Dzk22 + λkzk1 . This
incoherence (see Definition 1) and is roughly the
bound is independent of the dimension d and hence
worst condition number among all linear systems
useful when d k, as in general Hilbert spaces. They
induced by taking s columns of D.
also showed a similar bound in√the overcomplete set-
ting where the k is replaced by dk. Vainsencher et al.
The stability of the sparse codes are absolutely cru- (2011) handled the p overcomplete setting, producing
cial to this work. Proving that the notion of stability
a bound that is O dk/m as well as fast rates of
of contribution 1 holds is highly nontrivial because the O(dk/m), with only logarithmic dependence on λ1 .
LASSO objective (see (1) below) is not strongly convex
in general. Consequently, much of the technical diffi- Predictive sparse coding (Mairal et al., 2012), mini-
culty of this work is owed to finding conditions under mizes a supervised loss with respect to a representation
which the LASSO is stable under dictionary pertur- and an estimator linear in the representation. Let W
bations and proving that when these conditions hold be a space of linear hypotheses with W := rBRk , the
with respect to the learned hypothesis and the training ball in Rk scaled to radius r. A predictive sparse cod-
sample, they also hold with respect to a future sample. ing hypothesis function f is identified by f = (D, w) ∈
D×W and defined as f (x) = hw, ϕD (x)i. The function
1.1. The predictive sparse coding problem class F is the set of such hypotheses. The loss will be
measured via l : Y × R → [0, b], b > 0, a bounded loss
Let P be a probability measure over BRd × Y, the function that is L-Lipschitz in its second argument.
product of an input space BRd (the unit ball of Rd ) and
a space Y of univariate labels; examples of Y include The predictive sparse coding objective is2
a bounded subset of R for regression and {−1, 1} for 1
classification. Let z = (z1 , . . . , zm ) be a sample of E(x,y)∼P l(y, hw, ϕD (x)i) + kwk22 ;
min (3)
D∈D,w∈W r
m points drawn iid from P , where each labeled point
zi equals (xi , yi ) for xi ∈ BRd and yi ∈ Y. In the In this work, we analyze the ERM variant of (3):
reconstructive setting, labels are not of interest and m
1 X 1
we can just as well consider an unlabeled sample x min l(yi , hw, ϕD (xi )i) + kwk22 . (4)
of m points drawn iid from the marginal probability D∈D,w∈W m i=1 r
measure Π on BRd .
This objective is not convex, and it is unclear how to
The sparse coding problem is to represent each point find global minima, so a priori we cannot say whether
xi as a sparse linear combination of k basis vectors, an optimal or nearly optimal hypothesis will be re-
or atoms D1 , . . . , Dk . The atoms form the columns of turned by any learning algorithm. However, we can
a dictionary D living in a space of dictionaries D := 1
k To see this, take Theorem 1.2 of Maurer & Pontil
(BRd ) , for Di = (Di1 , . . . , Did )T in the unit `2 ball. An (2010) with Y = {y ∈ Rk : kyk1 < λ1 } and T = {T :
encoder ϕD can be used to express `1 sparse coding: Rk → Rd : kT ej k ≤ 1, j ∈ [k]}, so that kT kY ≤ λ1 .
2
1 While the focus of this work is (3), formally the pre-
ϕD (x) := arg min kx − Dzk22 + λkzk1 ; (1) dictive sparse coding framework admits swapping out the
z 2 squared `2 norm regularizer on w for any other regularizer.
Predictive Sparse Coding Bounds
and will bet on certain sparsity-related stability prop- sparse on x. More concisely, the boolean expression
erties holding with respect to the learned hypothesis s-sparse(ϕD (x)) is true.
and the training sample. Consequently, the presented
learning bound will hold uniformly not over the set of This property is critical as the learning bound will ex-
all hypotheses but rather potentially much smaller ran- ploit the observed sparsity level over the training sam-
dom subclasses of hypotheses. The presented bound ple. Finally, we require some margin properties.
will be algorithm-independent3 , but algorithm design
Definition 3 (s-margin) Given a dictionary D and
can influence the learned hypothesis’s stability and
a point xi ∈ BRd , the s-margin of D on xi is
hence the best a posteriori learning bound. n o
margins (D, xi ):= max min λ − hDj , xi − DϕD (xi )i .
Encoder stability Defining the encoder (1) via the I⊆[k] j∈I
|I|=k−s
`1 sparsity-inducing regularizer (or sparsifier ) is just
one choice for the encoder. The choice of sparsifier The sample s-margin is the maximum s-margin that
seems to be pivotal both from an empirical perspec- holds for all points in x, or the s-margin of D on x:
tive and a theoretical one. Bradley & Bagnell (2009)
used a differentiable approximate sparsifier based on margins (D, x) := min margins (D, xi ).
xi ∈x
the Kullback-Leibler divergence (true sparsity may not
result). The `1 sparsifier k · k1 is the most popular and The importance of the s-margin properties flows di-
notably is the tightest convex lower bound for the `0 rectly from the upcoming Sparse Coding Stability The-
“norm”: kxk0 := |{i : xi 6= 0}| (Fazel, 2002). Regret- orem (Theorem 4). Intuitively, if the s-margin of D on
tably, from a stability perspective the `1 sparsifier is x is high, there is a set of (k − s) inactive atoms that
not well-behaved in general. Indeed, due to the lack of correlate poorly with the optimal residual x−DϕD (x);
strict convexity, each x need not have a unique image hence these k − s atoms are far from being included in
under ϕD . It also is unclear how to analyze the class of the set of active atoms. More formally, margins (D, xi )
mappings ϕD , parameterized by D, if the map changes is equal to the (s+1) th smallest element of the set of k
drastically under small perturbations to D. Hence, we elements {λ − |hDj , xi − DϕD (xi )i|}j∈[k] . Note that if
will begin by establishing sufficient conditions under kϕD (xi )k0 = s, we can use the (s + ρ)-margin for any
which ϕD is stable under perturbations to D. integer ρ ≥ 0. Indeed, ρ > 0 is justified when ϕD (xi )
has only s non-zero dimensions but for precisely one
2. Conditions and main result index j ∗ outside the support set |hDj ∗ , xi − DϕD (xi )i|
is arbitrarily close to λ. In this scenario, the s-margin
In this section, we develop several quantities that are of D on xi is trivially small; however, the (s+1)-margin
central to the statement of the main result. Through- is non-trivial because the max in the definition of the
out this paper, let [n] := {1, . . . , n} for n ∈ N. Also, margin will remove j ∗ from the min’s choices I. Em-
for t ∈ Rk , define supp(t) := {i ∈ [k] : ti 6= 0}. pirical evidence shown in Section 5 suggests that even
when ρ is small, the (s + ρ)-margin is not too small.
Definition 1 (s-incoherence) For s ∈ [k] and D ∈
D, the s-incoherence µs (D) is defined as the square Sparse coding stability Our first result is a funda-
of the minimum singular value among s-atom subdic- mental stability result for the LASSO. In addition to
tionaries of D. Formally, being critical in motivating the presented conditions,
2 the result may be of interest in its own right.
µs (D) = min {ςs (DΛ ) : Λ ⊆ [k], |Λ| = s} ,
Theorem 4 (Sparse Coding Stability) Let
where ςs (A) is the s th singular value of A.
D, D̃ ∈ D satisfy µs (D), µs (D̃) ≥ µ and kD− D̃k2 ≤ ε,
The s-incoherence can used to guarantee that sparse and let x ∈ BRd . Suppose that there exists an index
codes are stable in a certain sense. We also intro- set I ⊆ [k] of k − s indices such that for all i ∈ I:
duce some key parameter-and-data-dependent proper- |hDi , x − DϕD (x)i| < λ − τ, (5)
ties. The first property regards the sparsity of the
encoder on a sample x = (x1 , . . . , xm ).
τ 2λ
with ε≤ 27 . (6)
Definition 2 (s-sparsity) If every point xi in the set
of points x satisfies kϕD (xi )k0 ≤ s, then ϕD is s- Then the following stability bound holds:
√
3 3ε s
Empirically, stochastic gradient approaches such as the kϕD (x) − ϕD̃ (x)k2 ≤ .
one of Mairal et al. (2012) perform quite well. 2 λµ
Predictive Sparse Coding Bounds
02
Moreover, if ε = τ27λ for τ 0 < τ , then for all i ∈ I: 2.1. Main result
Some notation will aid the result below and the sub-
hD̃i , x − D̃ϕD̃ (x)i ≤ λ − (τ − τ 0 ).
sequent analysis. Recall that the loss l is bounded by
Thus, some margin, and hence sparsity, is retained b and L-Lipschitz in its second argument. Also recall
after perturbation. that F is the set of predictive sparse coding hypothesis
functions f (x) = hw, ϕD (x)i indexed by D ∈ D and
Condition (5) means that at least k − s inactive atoms w ∈ W. For f ∈ F, define l(·, f ) : Y × Rd → [0, b]
in the coding ϕD (x) do not have too high absolute cor- as the loss-composed function (y, x) 7→ l(y, f (x)). Let
relation with the residual x − DϕD (x). We refer to the l ◦ F be the class of such functions induced by the
right-hand side of (6) as the permissible radius of per- choice of F and l. A probability measure P operates
turbation (PRP) because it is the largest perturbation on functions and loss-composed functions as:
for which the theorem can guarantee encoder stability.
P f = E(x,y)∼P f (x) P l(·, f ) = E(x,y)∼P l(y, f (x)).
In short, the theorem says that if problem (1) admits
a stable sparse solution, then a small perturbation to Similarly, an empirical measure Pz associated with
the dictionary will not change the fact that a certain sample z operates on functions and loss-composed
set of k −s atoms remains inactive in the new solution. functions as:
The proof of Theorem 4 is quite long; we leave all but m m
1 X 1 X
the following high-level sketch to Appendix A. Pz f = f (xi ) Pz l(·, f ) = l(yi , f (xi )).
m i=1 m i=1
Proof sketch First, we show that the solution ϕD̃ (x)
is s-sparse and, in particular, has support contained in Classically speaking, the overcomplete setting is the
the complement of I. Second, we reframe the LASSO modus operandi in sparse coding. In this setting, an
as a quadratic program (QP). By exploiting the con- overcomplete basis is learned which will be used parsi-
vexity of the QP and the fact that both solutions have moniously in coding individual points. The next result
their support contained in a set of s atoms, simple bounds the generalization error in the overcomplete
linear algebra yields the desired stability bound. The setting. The Õ(·) notation hides log(log(·)) terms and
first step appears much more difficult than the second. assumes that r ≤ mmin{d,k} .
The quartet below is our strategy for the first step:
Theorem 5 With probability at least 1 − δ over
1. optimal value stability: The two problems’ op- z ∼ P m , for any s ∈ [k] and any f = (D, w) ∈ F
timal objective values are close; this is an easy con- satisfying s-sparse(ϕD (x)) and m > margin243 2 ,the
s (D,x) λ
sequence of the closeness of D and D̃. generalization error (P − Pz )l(·, f ) is
2. stability of norm of reconstructor: The q
norms of the optimal reconstructors (DϕD (x) and dk log m+log δ1
b m
D̃ϕD̃ (x)) of the two problems are close. We show Õ √ .
r s
b
this using optimal value stability and +m L
dk log margin21(D,x)·λ + m λµs (D)
s
T
(8)
(x − DϕD (x)) DϕD (x) = λkϕD (x)k1 , (7)
the latter of which holds due to the subgradient of Note that this bound also applies to the particular
(1) with respect to z (Osborne et al., 2000). hypothesis learned from the training sample.
3. reconstructor stability: The optimal recon-
structors of the two problems are close. This fact is Discussion of Theorem 5 The theorem highlights
a consequence of stability of norm of recon- the central role of the stability of the sparse encoder.
structor, using the `1 norm’s convexity and the The bound is data-dependent and exploits properties
equality (7). related to the training sample and the learned hypoth-
esis. Since k ≥ d in the overcomplete setting, an ideal
4. preservation of sparsity: The solution to 1
learning bound has minimal dependence on k. The m
the perturbed problem also is supported on the
term of the learning bound (8) exhibits square root
complement of I. To show this, it is sufficient to
dependence on both the size of the dictionary k and
show that the absolute correlation of each atom D̃i
the ambient dimension d. It is unclear whether further
(i ∈ I) with the residual in the perturbed problem
improvement is possible, even in the reconstructive set-
is less than λ. This last claim is a relatively easy
ting. The two known results in the reconstructive set-
consequence of reconstructor stability.
ting were established by Maurer & Pontil (2010) and
later by Vainsencher et al. (2011).
Predictive Sparse Coding Bounds
Contrasting the predictive setting with the reconstruc- small collection of inactive atoms may be very close
tive setting, the first term of (8) matches the slower to being brought into the optimal solution (the code);
of the rates shown by Vainsencher et al. (2011) for the however, we can instead use the (s+ρ)-margin for some
unsupervised case. Vainsencher et al. also showed fast small positive integer ρ for which the (s + ρ)-margin is
rates of dk
m (plus a small fraction of the observed em- non-trivial. In Section 5 we show empirical evidence
pirical risk), but in the predictive setting it is an open that such a non-trivial (s + ρ)-margin does exist, with
question whether similar fast rates are possible. The ρ small, when learning predictive sparse codes on real
second term of (8) represents the error in approximat- data. Hence, there is evidence that predictive sparse
1
ing the estimator via an (ε = m )-cover of the space coding learns a dictionary with high s-incoherence
of dictionaries. This term reflects the stability of the µs (D) and non-trivial s-margin margins (D, x) on the
sparse codes with respect to dictionary perturbations, training sample for low s.
as quantified by the Sparse Coding Stability Theorem
(Theorem 4). The reason for the lower bound on m is 3. Tools
that the ε-net used to approximate the space of dic-
tionaries needs to be fine enough to satisfy the PRP As before, let z be a labeled sample of m points (an m-
condition (6) of the Sparse Coding Stability Theorem. sample) drawn iid from P and z0 be a second (ghost)
Hence, both this lower bound and the second term are labeled m-sample drawn iid from P . All epsilon-nets
determined primarily by the Sparse Coding Stability of spaces of dictionaries will use the metric induced by
Theorem, and so with this proof strategy the extent to the operator norm k · k2 .
which the Sparse Coding Stability Theorem cannot be
The next result is essentially due to Mendelson &
improved also indicates the extent to which Theorem
Philips (2004); it applies symmetrization by a ghost
5 cannot be improved.
sample for random subclasses.
Critically, encoder stability is not necessary in the re-
constructive setting because stability in loss (recon- Lemma 6 (Symmetrization by Ghost Sample)
struction error) requires only stability in the norm of Let F(z) ⊂ F be a random subclass which can depend
the residual of the LASSO problem rather than stabil- on a labeled sample z. Recall that z0 is a ghost sample
2
ity in the value of the solution to the problem. Sta- of m points. If m ≥ bt , then
bility of the norm of the residual is readily obtainable
without any of the incoherence, sparsity, and margin Prz {∃f ∈ F(z), (P − Pz )l(·, f ) ≥ t}
conditions used here.
t
≤ 2Prz z0 ∃f ∈ F(z), (Pz0 − Pz )l(·, f ) ≥ .
2
Remarks on conditions One may wonder about
typical values for the various hypothesis-and-data-
dependent properties in Theorem 5. In practical appli- For completeness, this lemma is proved in Appendix B.
cations of reconstructive and predictive sparse coding, This symmetrization lemma will shift the analysis of
the regularization parameter λ is set to ensure that the next section from large deviations of the empirical
s is small relative to the dimension d. As a result, risk from the expected risk to large deviations of two
the incoherence µs (D) of the learned dictionary can independent empirical risks.
be expected to be bounded away from zero. A suf- For a Banach space E of dimension d, the ε-covering
ficiently large s-incoherence certainly is necessary if numbers of the radius r ball of E are bounded as
one hopes for any amount of stability of the class of N (rBE , ε) ≤ (4r/ε)d (Carl & Stephani, 1990, equation
sparse coders with respect to dictionary perturbations. (1.1.10)). For spaces of dictionaries obeying some de-
Since our path to reaching Theorem 5 passes through terministic property, such as Dµ = {D ∈ D : µs (D) ≥
the Sparse Coding Stability Theorem (Theorem 4), it µ}, one must be careful to use a proper ε-cover so that
seems that a drastically different strategy needs to be the representative elements of the cover also obey the
used if it is possible to avoid dependence on µs (D) in desired property. The following bound relates proper
the learning bounds. covering numbers to covering numbers (a simple proof
A curious aspect of the learning bound is its depen- is in Vidyasagar 2002, Lemma 2.1):
dence on the s-margin margins (D, x). Suppose a dic- If E is a Banach space and T ⊆ E is a bounded subset,
tionary is learned which is s-sparse on the training then N (E, ε, T ) ≤ Nproper (E, ε/2, T ).
sample x, and s is the lowest such integer for which this k
Let d, k ∈ N. Define Eµ := {E ∈ (BRd ) : µs (D) ≥ µ}
holds. It may not always be the case that the s-margin and W := rBRd . From the above, we have:
is bounded away from zero because for some points a
Predictive Sparse Coding Bounds
Proposition 7 The proper ε-covering number of Eµ In the RHS of the above, let the event whose proba-
dk bility is being measured be
is bounded by (8/ε) .
z z0 : ∃f ∈ Fµ ,
ˆ ˜ ff
Proposition 8 The product of the proper ε-covering margins (D, x) > ι
J := .
number of Eµ and the ε-covering number of W is and (Pz0 − Pz )l(·, f ) > t/2)
(d+1)k
8(r/2)1/(d+1)
bounded by . Define Z as the event that there exists a hypothesis
ε
with stable codes on the original sample, in the sense
of the Sparse Coding Stability Theorem (Theorem 4),
4. Proof of the learning bound but more than η = η(m, d, k, D, x, δ) points4 of the
At a high level, our strategy for proving Theorem 5 is ghost sample have codes that are not guaranteed stable
to construct an epsilon-net over a subclass of the space by the Sparse Coding Stability Theorem:
ˆ ˜ ff
of functions F := {f = (D, w) : D ∈ D, w ∈ W} and ∃f ∈ Fµ , margins (D, x) > ι and
Z:= z z0 : ` 0
ˆ 1
˜´ .
to show that the metric entropy of this subclass is of @ x̃ ⊆η x margins (D, x̃) > 3 margins (D, x)
order dk. The main difficulty is that an epsilon-net
over D need not approximate F to any degree, unless Our strategy will be to show that Pr(J) is small by
one has a notion of encoder stability. Our analysis use of the fact that5
effectively will be concerned with only a training sam- Pr(J) = Pr(J ∩ Z̄) + Pr(J ∩ Z) ≤ Pr(J ∩ Z̄) + Pr(Z).
ple and a ghost sample, and it is similar in style to the
luckiness framework of Shawe-Taylor et al. (1998). If We show that Pr(Z) and Pr(J ∩ Z̄) are small in turn.
we observe that the sufficient conditions for encoder
The imminent Good Ghost Lemma shadows Shawe-
stability hold true on the training sample, then it is
Taylor et al.’s (1998) notion of probable smoothness
enough to guarantee that most points in a ghost sam-
and provides a bound on Pr(Z).
ple also satisfy these conditions (at a weaker level).
Lemma 10 (Good Ghost) Fix µ, λ > 0 and s ∈
4.1. Useful conditions and subclasses [k]. With probability at least 1 − δ over an m-
Let x̃ ⊆η x indicate that x̃ is a subset of x with at sample x ∼ P m and a second m-sample x0 ∼ P m ,
most η elements of x removed. This notation is iden- for any D ∈ Dµ for which ϕD is s-sparse on x,
tical to Shawe-Taylor et al. (1998)’s notation from the least m − η(m,1d, k, D, x, δ) points
at x̃ ⊆ x0 satisfy
luckiness framework. margins (D, x̃) > 3 margins (D, x) , for
marg marg
let FD := {fD,τ |τ ∈ R+ } be the class of threshold ε-cover of W. It is sufficient to bound the prob-
functions defined via ability of a large deviation for all of Fε and to
( then consider the maximum difference between an el-
marg 1; if margins (D, x) > τ, ement of F̃(x, x0 ) and its closest representative in Fε .
fD,τ (x) :=
0; otherwise. Clearly, for each f = (D, w) ∈ F̃(x, x0 ), there is
a f 0 = (D0 , w0 ) ∈ Fε satisfying kD − D0 k2 ≤ ε and
The VC dimension of the one-dimensional threshold kw − w0 k2 ≤ ε. If ε is sufficiently small, then for all
marg
functions is 1, and so VC(FD ) = 1. By using the but η of the points xi in the ghost sample (and for all
marg
VC dimension of FD and the standard permutation points xi of the original sample) it is guaranteed that
argument of Vapnik & Chervonenkis (1968, Proof of
|hw, ϕD (xi )i − hw0 , ϕD0 (xi )i|
Theorem 2), it follows that for a single, fixed element
≤ ˛hw − w0 , ϕD (xi )i˛ + ˛hw0 , ϕD (xi ) − ϕD0 (xi )i˛
˛ ˛ ˛ ˛
of D0 , with probability at least 1 − δ at most log(2m + √ √ «
1) + log 1δ points from a ghost sample will violate the
„
ε 3ε s ε 3r s
≤ +r = 1+ = β,
margin inequality in question. Hence, by the bound on 2λ 2 λµ 2λ µ
the proper covering numbers provided by Proposition where the second inequality follows from the Sparse
7, we can we can guarantee for all candidate members Coding Stability Theorem (Theorem 4). Trivially, for
D0 ∈ D0 that with probability 1 − δ at most the rest of the points xi in the ghost sample each loss
1944 1 is bounded by b. Hence, on the original sample:
η = dk log + log(2m + 1) + log m
margin2s (D, x) · λ δ 1 X ˛˛
l(yi , hw, ϕD (xi )i) − l(yi , hw0 , ϕD0 (xi )i)˛ ≤ Lβ,
˛
m i=1
points from the ghost sample violate the s-margin in-
equality. Thus, for arbitrary D0 ∈ D0 satisfying the and on the ghost sample:
m
conditions of the lemma, with probability 1−δ at most 1 X˛˛ 0
l(yi , hw, ϕD (x0i )i) − l(yi0 , hw0 , ϕD0 (x0i )i)˛
˛
η(m, d, k, D, x, δ) points from the ghost sample violate m i=1
margins (D0 , ·) > 32 margins (D, x) .
L X ˛˛
hw, ϕD (xi )i − hw0 , ϕD0 (xi )i˛
˛
≤
at least m − η points in the ghost
Finally, consider the m i good
sample satisfying margins (D0 , ·) > 32 margins (D, x) .
1 X ˛˛ 0
l(yi , hw, ϕD (x0i )i) − l(yi0 , hw0 , ϕD0 (x0i )i)˛
˛
( 1 margin (D,x))2 ·λ +
Since kD0 − Dk2 ≤ 3 s
27 , the Sparse Cod- m i bad
ing Stability Theorem (Theorem 4) implies that these bη
≤ Lβ + ,
points satisfy margins (D, ·) > 13 margins (D, x) .
m
It remains to bound Pr(J ∩ Z̄). where good denotes the (at least m − η) points of the
ghost sample for which the Sparse Coding Stability
Lemma 11 (Large Deviation on Good Ghost) Theorem applies, and bad is the complement thereof.
√
Let $ := t/2 − 2Lβ + m , β := 2λ 1 + 3rµ s .
bη ε
Concluding the above argument, the difference be-
Then tween the losses of f and f 0 on the double sample is at
(d+1)k most 2Lβ + bηm . Consequently, if (Pz −Pz )l(·, f ) > t/2,
0
8(r/2)1/(d+1)
Pr(J ∩ Z̄) ≤ exp(−m$2 /(2b2 )). then the deviation between the loss of f 0 on the origi-
ε nal sample and the 0
loss of f on the ghost sample must
be at least t/2− 2Lβ + bη m . To bound the probability
Proof First, note that the event J ∩ Z̄ is a subset of of R it therefore is sufficient to control
8 ˆ ˜ 9
∃f ∈ Fµ , margin s (D, x) > ι and ∃f = (D0 , w0 ) ∈ Dε × Wε
( )
< ˜´=
0 ` 0
ˆ 1
R:= zz : ∃x̃ ⊆η x , margins (D, x̃) > 3 margins (D, x) . Prz z0
.
:
and ((Pz0 − Pz )l(·, f ) > t/2)
; (Pz0 − Pz )l(·, f ) > t/2 − 2Lβ + bη
m
margin
0.10