0% found this document useful (0 votes)
21 views9 pages

Sparsity-Based Generalization Bounds For Predictive Sparse Coding

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views9 pages

Sparsity-Based Generalization Bounds For Predictive Sparse Coding

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Sparsity-Based Generalization Bounds for Predictive Sparse Coding

Nishant A. Mehta [email protected]


Alexander G. Gray [email protected]
College of Computing, Georgia Institute of Technology, Atlanta, GA 30332, USA

Abstract Dj ∈ Rd of a dictionary D = (D1 , . . . , Dk ) ∈ Rd×k . In


Pk
The goal of predictive sparse coding is to the coding x ≈ j=1 zj Dj , all but a few zj are zero.
learn a representation of examples as sparse Predictive sparse coding methods such as Mairal
linear combinations of elements from a dic- et al.’s (2012) task-driven dictionary learning recently
tionary, such that a learned hypothesis linear have achieved state-of-the-art results on many tasks,
in the new representation performs well on including the MNIST digits task. Whereas standard
a predictive task. Predictive sparse coding sparse coding minimizes an unsupervised, reconstruc-
has demonstrated impressive performance on tive `2 loss, predictive sparse coding seeks to mini-
a variety of supervised tasks, but its general- mize a supervised loss by learning a dictionary and
ization properties have not been studied. We a linear predictor in the space of codes induced by
establish the first generalization error bounds that dictionary. There is much empirical evidence that
for predictive sparse coding, in the overcom- sparse coding can provide good abstraction by finding
plete setting, where the number of features k higher-level representations which are useful in predic-
exceeds the original dimensionality
p d. The tive tasks (Yu et al., 2009). Intuitively, the power of
learning bound decays as Õ( dk/m) with prediction-driven dictionaries is that they pack more
respect to d, k, and the size m of the train- atoms in parts of the representational space where the
ing sample. It depends intimately on stabil- prediction task is more difficult. However, despite the
ity properties of the learned sparse encoder, empirical successes of predictive sparse coding, it is
as measured on the training sample. Conse- unknown how well it generalizes in a theoretical sense.
quently, we also present a fundamental stabil-
ity result for the LASSO, a result that char- In this work, we develop what to our knowledge are
acterizes the stability of the sparse codes with the first generalization error bounds for predictive
respect to dictionary perturbations. sparse coding algorithms; in particular, we focus on `1 -
regularized sparse coding. Maurer & Pontil (2010) and
Vainsencher et al. (2011) previously established gener-
1. Introduction alization bounds for the classical, reconstructive sparse
coding setting. Extending their analysis to the predic-
Learning architectures such as the support vector ma- tive setting introduces certain difficulties related to the
chine and other linear predictors enjoy strong theoreti- complexity of the class of sparse encoders. Whereas in
cal properties (Steinwart & Christmann, 2008; Kakade the reconstructive setting, this complexity can be con-
et al., 2009), but a learning-theoretic view of many trolled directly by exploiting the stability of the recon-
more complex learning architectures is lacking. Pre- struction error to dictionary perturbations, in the pre-
dictive methods based on sparse coding recently have dictive setting it appears that the complexity hinges
emerged which simultaneously learn a data representa- upon the stability of the sparse codes themselves to dic-
tion via a nonlinear encoding scheme and an estimator tionary perturbations. This latter notion of stability is
linear in that representation (Bradley & Bagnell, 2009; much harder to prove; moreover, it can be realized only
Mairal et al., 2009; 2012). A sparse coding representa- with additional assumptions which depend on the dic-
tion z ∈ Rk of a data point x ∈ Rd is learned by rep- tionary, the data, and their interaction (see Theorem
resenting x as a sparse linear combination of k atoms 4). Furthermore, when the assumptions hold for the
learned dictionary and data, we also need to guarantee
Proceedings of the 30 th International Conference on Ma-
chine Learning, Atlanta, Georgia, USA, 2013. JMLR: that the assumptions hold on a newly drawn sample.
W&CP volume 28. Copyright 2013 by the author(s).
Predictive Sparse Coding Bounds

Contributions We provide a learning bound for the hence, encoding x as ϕD (x) amounts to solving a
overcomplete setting in predictive sparse coding, where LASSO problem. The reconstructive `1 sparse coding
the dictionary size, or number of learned features, k objective is then
exceeds the ambient dimension d. The bound holds
provided the size m of the training sample is large min Ex∼Π kx − DϕD (x)k22 + λkϕD (x)k1 .
D∈D
enough, where the critical size for the bound to kick in
depends on a certain notion of stability of the learned Generalization bounds for the empirical risk minimiza-
representation. This work’s core contributions are: tion (ERM) variant of this objective have been estab-
lished. In the infinite-dimensional setting, Maurer &
Pontil (2010) showed1 that with probability 1 − δ over
• Under mild conditions, a stability bound for the
the training sample x = (x1 , . . . , xm ):
LASSO (Tibshirani, 1996) under dictionary per-
m
turbations (Theorem 4). 1 X
sup Ex∼Π fD (x) − fD (xi )
• In the overcomplete setting, m i=1
q a learning bound that

D∈D

dk s « r
is essentially of order + λµs (D) ,
where each

m k 14 1p log(1/δ)
≤ √ + 2
log (16m/λ ) + , (2)
sparse code has at most s non-zero components m λ 2 2m
1
(Theorem 5). The term µs (D) is the inverse s-
where fD (x) := minz∈Rk kx − Dzk22 + λkzk1 . This
incoherence (see Definition 1) and is roughly the
bound is independent of the dimension d and hence
worst condition number among all linear systems
useful when d  k, as in general Hilbert spaces. They
induced by taking s columns of D.
also showed a similar bound in√the overcomplete set-
ting where the k is replaced by dk. Vainsencher et al.
The stability of the sparse codes are absolutely cru- (2011) handled the p overcomplete setting, producing
cial to this work. Proving that the notion of stability 
a bound that is O dk/m as well as fast rates of
of contribution 1 holds is highly nontrivial because the O(dk/m), with only logarithmic dependence on λ1 .
LASSO objective (see (1) below) is not strongly convex
in general. Consequently, much of the technical diffi- Predictive sparse coding (Mairal et al., 2012), mini-
culty of this work is owed to finding conditions under mizes a supervised loss with respect to a representation
which the LASSO is stable under dictionary pertur- and an estimator linear in the representation. Let W
bations and proving that when these conditions hold be a space of linear hypotheses with W := rBRk , the
with respect to the learned hypothesis and the training ball in Rk scaled to radius r. A predictive sparse cod-
sample, they also hold with respect to a future sample. ing hypothesis function f is identified by f = (D, w) ∈
D×W and defined as f (x) = hw, ϕD (x)i. The function
1.1. The predictive sparse coding problem class F is the set of such hypotheses. The loss will be
measured via l : Y × R → [0, b], b > 0, a bounded loss
Let P be a probability measure over BRd × Y, the function that is L-Lipschitz in its second argument.
product of an input space BRd (the unit ball of Rd ) and
a space Y of univariate labels; examples of Y include The predictive sparse coding objective is2
a bounded subset of R for regression and {−1, 1} for 1
classification. Let z = (z1 , . . . , zm ) be a sample of E(x,y)∼P l(y, hw, ϕD (x)i) + kwk22 ;
min (3)
D∈D,w∈W r
m points drawn iid from P , where each labeled point
zi equals (xi , yi ) for xi ∈ BRd and yi ∈ Y. In the In this work, we analyze the ERM variant of (3):
reconstructive setting, labels are not of interest and m
1 X 1
we can just as well consider an unlabeled sample x min l(yi , hw, ϕD (xi )i) + kwk22 . (4)
of m points drawn iid from the marginal probability D∈D,w∈W m i=1 r
measure Π on BRd .
This objective is not convex, and it is unclear how to
The sparse coding problem is to represent each point find global minima, so a priori we cannot say whether
xi as a sparse linear combination of k basis vectors, an optimal or nearly optimal hypothesis will be re-
or atoms D1 , . . . , Dk . The atoms form the columns of turned by any learning algorithm. However, we can
a dictionary D living in a space of dictionaries D := 1
k To see this, take Theorem 1.2 of Maurer & Pontil
(BRd ) , for Di = (Di1 , . . . , Did )T in the unit `2 ball. An (2010) with Y = {y ∈ Rk : kyk1 < λ1 } and T = {T :
encoder ϕD can be used to express `1 sparse coding: Rk → Rd : kT ej k ≤ 1, j ∈ [k]}, so that kT kY ≤ λ1 .
2
1 While the focus of this work is (3), formally the pre-
ϕD (x) := arg min kx − Dzk22 + λkzk1 ; (1) dictive sparse coding framework admits swapping out the
z 2 squared `2 norm regularizer on w for any other regularizer.
Predictive Sparse Coding Bounds

and will bet on certain sparsity-related stability prop- sparse on x. More concisely, the boolean expression
erties holding with respect to the learned hypothesis s-sparse(ϕD (x)) is true.
and the training sample. Consequently, the presented
learning bound will hold uniformly not over the set of This property is critical as the learning bound will ex-
all hypotheses but rather potentially much smaller ran- ploit the observed sparsity level over the training sam-
dom subclasses of hypotheses. The presented bound ple. Finally, we require some margin properties.
will be algorithm-independent3 , but algorithm design
Definition 3 (s-margin) Given a dictionary D and
can influence the learned hypothesis’s stability and
a point xi ∈ BRd , the s-margin of D on xi is
hence the best a posteriori learning bound. n o
margins (D, xi ):= max min λ − hDj , xi − DϕD (xi )i .
Encoder stability Defining the encoder (1) via the I⊆[k] j∈I
|I|=k−s
`1 sparsity-inducing regularizer (or sparsifier ) is just
one choice for the encoder. The choice of sparsifier The sample s-margin is the maximum s-margin that
seems to be pivotal both from an empirical perspec- holds for all points in x, or the s-margin of D on x:
tive and a theoretical one. Bradley & Bagnell (2009)
used a differentiable approximate sparsifier based on margins (D, x) := min margins (D, xi ).
xi ∈x
the Kullback-Leibler divergence (true sparsity may not
result). The `1 sparsifier k · k1 is the most popular and The importance of the s-margin properties flows di-
notably is the tightest convex lower bound for the `0 rectly from the upcoming Sparse Coding Stability The-
“norm”: kxk0 := |{i : xi 6= 0}| (Fazel, 2002). Regret- orem (Theorem 4). Intuitively, if the s-margin of D on
tably, from a stability perspective the `1 sparsifier is x is high, there is a set of (k − s) inactive atoms that
not well-behaved in general. Indeed, due to the lack of correlate poorly with the optimal residual x−DϕD (x);
strict convexity, each x need not have a unique image hence these k − s atoms are far from being included in
under ϕD . It also is unclear how to analyze the class of the set of active atoms. More formally, margins (D, xi )
mappings ϕD , parameterized by D, if the map changes is equal to the (s+1) th smallest element of the set of k
drastically under small perturbations to D. Hence, we elements {λ − |hDj , xi − DϕD (xi )i|}j∈[k] . Note that if
will begin by establishing sufficient conditions under kϕD (xi )k0 = s, we can use the (s + ρ)-margin for any
which ϕD is stable under perturbations to D. integer ρ ≥ 0. Indeed, ρ > 0 is justified when ϕD (xi )
has only s non-zero dimensions but for precisely one
2. Conditions and main result index j ∗ outside the support set |hDj ∗ , xi − DϕD (xi )i|
is arbitrarily close to λ. In this scenario, the s-margin
In this section, we develop several quantities that are of D on xi is trivially small; however, the (s+1)-margin
central to the statement of the main result. Through- is non-trivial because the max in the definition of the
out this paper, let [n] := {1, . . . , n} for n ∈ N. Also, margin will remove j ∗ from the min’s choices I. Em-
for t ∈ Rk , define supp(t) := {i ∈ [k] : ti 6= 0}. pirical evidence shown in Section 5 suggests that even
when ρ is small, the (s + ρ)-margin is not too small.
Definition 1 (s-incoherence) For s ∈ [k] and D ∈
D, the s-incoherence µs (D) is defined as the square Sparse coding stability Our first result is a funda-
of the minimum singular value among s-atom subdic- mental stability result for the LASSO. In addition to
tionaries of D. Formally, being critical in motivating the presented conditions,
2 the result may be of interest in its own right.
µs (D) = min {ςs (DΛ ) : Λ ⊆ [k], |Λ| = s} ,
Theorem 4 (Sparse Coding Stability) Let
where ςs (A) is the s th singular value of A.
D, D̃ ∈ D satisfy µs (D), µs (D̃) ≥ µ and kD− D̃k2 ≤ ε,
The s-incoherence can used to guarantee that sparse and let x ∈ BRd . Suppose that there exists an index
codes are stable in a certain sense. We also intro- set I ⊆ [k] of k − s indices such that for all i ∈ I:
duce some key parameter-and-data-dependent proper- |hDi , x − DϕD (x)i| < λ − τ, (5)
ties. The first property regards the sparsity of the
encoder on a sample x = (x1 , . . . , xm ).
τ 2λ
with ε≤ 27 . (6)
Definition 2 (s-sparsity) If every point xi in the set
of points x satisfies kϕD (xi )k0 ≤ s, then ϕD is s- Then the following stability bound holds:

3 3ε s
Empirically, stochastic gradient approaches such as the kϕD (x) − ϕD̃ (x)k2 ≤ .
one of Mairal et al. (2012) perform quite well. 2 λµ
Predictive Sparse Coding Bounds
02
Moreover, if ε = τ27λ for τ 0 < τ , then for all i ∈ I: 2.1. Main result
Some notation will aid the result below and the sub-

hD̃i , x − D̃ϕD̃ (x)i ≤ λ − (τ − τ 0 ).

sequent analysis. Recall that the loss l is bounded by
Thus, some margin, and hence sparsity, is retained b and L-Lipschitz in its second argument. Also recall
after perturbation. that F is the set of predictive sparse coding hypothesis
functions f (x) = hw, ϕD (x)i indexed by D ∈ D and
Condition (5) means that at least k − s inactive atoms w ∈ W. For f ∈ F, define l(·, f ) : Y × Rd → [0, b]
in the coding ϕD (x) do not have too high absolute cor- as the loss-composed function (y, x) 7→ l(y, f (x)). Let
relation with the residual x − DϕD (x). We refer to the l ◦ F be the class of such functions induced by the
right-hand side of (6) as the permissible radius of per- choice of F and l. A probability measure P operates
turbation (PRP) because it is the largest perturbation on functions and loss-composed functions as:
for which the theorem can guarantee encoder stability.
P f = E(x,y)∼P f (x) P l(·, f ) = E(x,y)∼P l(y, f (x)).
In short, the theorem says that if problem (1) admits
a stable sparse solution, then a small perturbation to Similarly, an empirical measure Pz associated with
the dictionary will not change the fact that a certain sample z operates on functions and loss-composed
set of k −s atoms remains inactive in the new solution. functions as:
The proof of Theorem 4 is quite long; we leave all but m m
1 X 1 X
the following high-level sketch to Appendix A. Pz f = f (xi ) Pz l(·, f ) = l(yi , f (xi )).
m i=1 m i=1
Proof sketch First, we show that the solution ϕD̃ (x)
is s-sparse and, in particular, has support contained in Classically speaking, the overcomplete setting is the
the complement of I. Second, we reframe the LASSO modus operandi in sparse coding. In this setting, an
as a quadratic program (QP). By exploiting the con- overcomplete basis is learned which will be used parsi-
vexity of the QP and the fact that both solutions have moniously in coding individual points. The next result
their support contained in a set of s atoms, simple bounds the generalization error in the overcomplete
linear algebra yields the desired stability bound. The setting. The Õ(·) notation hides log(log(·)) terms and
first step appears much more difficult than the second. assumes that r ≤ mmin{d,k} .
The quartet below is our strategy for the first step:
Theorem 5 With probability at least 1 − δ over
1. optimal value stability: The two problems’ op- z ∼ P m , for any s ∈ [k] and any f = (D, w) ∈ F
timal objective values are close; this is an easy con- satisfying s-sparse(ϕD (x)) and m > margin243 2 ,the
s (D,x) λ
sequence of the closeness of D and D̃. generalization error (P − Pz )l(·, f ) is
2. stability of norm of reconstructor: The  q 
norms of the optimal reconstructors (DϕD (x) and dk log m+log δ1
b  m
D̃ϕD̃ (x)) of the two problems are close. We show Õ    √  .
r s
b
this using optimal value stability and +m L
dk log margin21(D,x)·λ + m λµs (D)
s

T
(8)
(x − DϕD (x)) DϕD (x) = λkϕD (x)k1 , (7)
the latter of which holds due to the subgradient of Note that this bound also applies to the particular
(1) with respect to z (Osborne et al., 2000). hypothesis learned from the training sample.
3. reconstructor stability: The optimal recon-
structors of the two problems are close. This fact is Discussion of Theorem 5 The theorem highlights
a consequence of stability of norm of recon- the central role of the stability of the sparse encoder.
structor, using the `1 norm’s convexity and the The bound is data-dependent and exploits properties
equality (7). related to the training sample and the learned hypoth-
esis. Since k ≥ d in the overcomplete setting, an ideal
4. preservation of sparsity: The solution to 1
learning bound has minimal dependence on k. The m
the perturbed problem also is supported on the
term of the learning bound (8) exhibits square root
complement of I. To show this, it is sufficient to
dependence on both the size of the dictionary k and
show that the absolute correlation of each atom D̃i
the ambient dimension d. It is unclear whether further
(i ∈ I) with the residual in the perturbed problem
improvement is possible, even in the reconstructive set-
is less than λ. This last claim is a relatively easy
ting. The two known results in the reconstructive set-
consequence of reconstructor stability.
ting were established by Maurer & Pontil (2010) and
later by Vainsencher et al. (2011).
Predictive Sparse Coding Bounds

Contrasting the predictive setting with the reconstruc- small collection of inactive atoms may be very close
tive setting, the first term of (8) matches the slower to being brought into the optimal solution (the code);
of the rates shown by Vainsencher et al. (2011) for the however, we can instead use the (s+ρ)-margin for some
unsupervised case. Vainsencher et al. also showed fast small positive integer ρ for which the (s + ρ)-margin is
rates of dk
m (plus a small fraction of the observed em- non-trivial. In Section 5 we show empirical evidence
pirical risk), but in the predictive setting it is an open that such a non-trivial (s + ρ)-margin does exist, with
question whether similar fast rates are possible. The ρ small, when learning predictive sparse codes on real
second term of (8) represents the error in approximat- data. Hence, there is evidence that predictive sparse
1
ing the estimator via an (ε = m )-cover of the space coding learns a dictionary with high s-incoherence
of dictionaries. This term reflects the stability of the µs (D) and non-trivial s-margin margins (D, x) on the
sparse codes with respect to dictionary perturbations, training sample for low s.
as quantified by the Sparse Coding Stability Theorem
(Theorem 4). The reason for the lower bound on m is 3. Tools
that the ε-net used to approximate the space of dic-
tionaries needs to be fine enough to satisfy the PRP As before, let z be a labeled sample of m points (an m-
condition (6) of the Sparse Coding Stability Theorem. sample) drawn iid from P and z0 be a second (ghost)
Hence, both this lower bound and the second term are labeled m-sample drawn iid from P . All epsilon-nets
determined primarily by the Sparse Coding Stability of spaces of dictionaries will use the metric induced by
Theorem, and so with this proof strategy the extent to the operator norm k · k2 .
which the Sparse Coding Stability Theorem cannot be
The next result is essentially due to Mendelson &
improved also indicates the extent to which Theorem
Philips (2004); it applies symmetrization by a ghost
5 cannot be improved.
sample for random subclasses.
Critically, encoder stability is not necessary in the re-
constructive setting because stability in loss (recon- Lemma 6 (Symmetrization by Ghost Sample)
struction error) requires only stability in the norm of Let F(z) ⊂ F be a random subclass which can depend
the residual of the LASSO problem rather than stabil- on a labeled sample z. Recall that z0 is a ghost sample
2
ity in the value of the solution to the problem. Sta- of m points. If m ≥ bt , then
bility of the norm of the residual is readily obtainable
without any of the incoherence, sparsity, and margin Prz {∃f ∈ F(z), (P − Pz )l(·, f ) ≥ t}
conditions used here.  
t
≤ 2Prz z0 ∃f ∈ F(z), (Pz0 − Pz )l(·, f ) ≥ .
2
Remarks on conditions One may wonder about
typical values for the various hypothesis-and-data-
dependent properties in Theorem 5. In practical appli- For completeness, this lemma is proved in Appendix B.
cations of reconstructive and predictive sparse coding, This symmetrization lemma will shift the analysis of
the regularization parameter λ is set to ensure that the next section from large deviations of the empirical
s is small relative to the dimension d. As a result, risk from the expected risk to large deviations of two
the incoherence µs (D) of the learned dictionary can independent empirical risks.
be expected to be bounded away from zero. A suf- For a Banach space E of dimension d, the ε-covering
ficiently large s-incoherence certainly is necessary if numbers of the radius r ball of E are bounded as
one hopes for any amount of stability of the class of N (rBE , ε) ≤ (4r/ε)d (Carl & Stephani, 1990, equation
sparse coders with respect to dictionary perturbations. (1.1.10)). For spaces of dictionaries obeying some de-
Since our path to reaching Theorem 5 passes through terministic property, such as Dµ = {D ∈ D : µs (D) ≥
the Sparse Coding Stability Theorem (Theorem 4), it µ}, one must be careful to use a proper ε-cover so that
seems that a drastically different strategy needs to be the representative elements of the cover also obey the
used if it is possible to avoid dependence on µs (D) in desired property. The following bound relates proper
the learning bounds. covering numbers to covering numbers (a simple proof
A curious aspect of the learning bound is its depen- is in Vidyasagar 2002, Lemma 2.1):
dence on the s-margin margins (D, x). Suppose a dic- If E is a Banach space and T ⊆ E is a bounded subset,
tionary is learned which is s-sparse on the training then N (E, ε, T ) ≤ Nproper (E, ε/2, T ).
sample x, and s is the lowest such integer for which this k
Let d, k ∈ N. Define Eµ := {E ∈ (BRd ) : µs (D) ≥ µ}
holds. It may not always be the case that the s-margin and W := rBRd . From the above, we have:
is bounded away from zero because for some points a
Predictive Sparse Coding Bounds

Proposition 7 The proper ε-covering number of Eµ In the RHS of the above, let the event whose proba-
dk bility is being measured be
is bounded by (8/ε) .

z z0 : ∃f ∈ Fµ ,
 ˆ ˜ ff
Proposition 8 The product of the proper ε-covering margins (D, x) > ι
J := .
number of Eµ and the ε-covering number of W is and (Pz0 − Pz )l(·, f ) > t/2)
(d+1)k
8(r/2)1/(d+1)

bounded by . Define Z as the event that there exists a hypothesis
ε
with stable codes on the original sample, in the sense
of the Sparse Coding Stability Theorem (Theorem 4),
4. Proof of the learning bound but more than η = η(m, d, k, D, x, δ) points4 of the
At a high level, our strategy for proving Theorem 5 is ghost sample have codes that are not guaranteed stable
to construct an epsilon-net over a subclass of the space by the Sparse Coding Stability Theorem:
 ˆ ˜ ff
of functions F := {f = (D, w) : D ∈ D, w ∈ W} and ∃f ∈ Fµ , margins (D, x) > ι and
Z:= z z0 : ` 0
ˆ 1
˜´ .
to show that the metric entropy of this subclass is of @ x̃ ⊆η x margins (D, x̃) > 3 margins (D, x)
order dk. The main difficulty is that an epsilon-net
over D need not approximate F to any degree, unless Our strategy will be to show that Pr(J) is small by
one has a notion of encoder stability. Our analysis use of the fact that5
effectively will be concerned with only a training sam- Pr(J) = Pr(J ∩ Z̄) + Pr(J ∩ Z) ≤ Pr(J ∩ Z̄) + Pr(Z).
ple and a ghost sample, and it is similar in style to the
luckiness framework of Shawe-Taylor et al. (1998). If We show that Pr(Z) and Pr(J ∩ Z̄) are small in turn.
we observe that the sufficient conditions for encoder
The imminent Good Ghost Lemma shadows Shawe-
stability hold true on the training sample, then it is
Taylor et al.’s (1998) notion of probable smoothness
enough to guarantee that most points in a ghost sam-
and provides a bound on Pr(Z).
ple also satisfy these conditions (at a weaker level).
Lemma 10 (Good Ghost) Fix µ, λ > 0 and s ∈
4.1. Useful conditions and subclasses [k]. With probability at least 1 − δ over an m-
Let x̃ ⊆η x indicate that x̃ is a subset of x with at sample x ∼ P m and a second m-sample x0 ∼ P m ,
most η elements of x removed. This notation is iden- for any D ∈ Dµ for which ϕD is s-sparse on x,
tical to Shawe-Taylor et al. (1998)’s notation from the  least m − η(m,1d, k, D, x, δ) points
at  x̃ ⊆ x0 satisfy
luckiness framework. margins (D, x̃) > 3 margins (D, x) , for

Our bound uses a PRP-based condition depending on 1944 1


both the learned dictionary and the training sample: η := dk log +log(2m+1)+log .
margin2s (D, x) · λ δ
r
243ε
margins (D, x) ≥ ι(λ, ε) for ι(λ, ε) = . Proof By the assumptions of the lemma, consider
λ
an arbitrary dictionary D satisfying µs (D) ≥ µ and
For brevity we will refer to ι with its parameters im- s-sparse(ϕD (x)). The goal is to guarantee with high
plicit; the dependence on ε and λ will not be an issue probability that all but η points of the ghost sample are
because we first develop bounds with these quantities coded by ϕD with s-margin of at least 31 margins (D, x).
fixed a priori. Lastly, for µ > 0 define a restricted
( 1 margin (D,x))2 ·λ
s
dictionary class Dµ := {D ∈ D : µs (D) ≥ µ} and a Let ε = 3 27 , and consider a minimum-
function class Fµ := {f = (D, w) ∈ F : D ∈ Dµ }. cardinality proper ε-cover D0 of Dµ . Let D0 be a
candidate element of D0 satisfying kD − D0 k2 ≤ ε.
4.2. Proof of the learning bound Then the Sparse Coding Stability Theorem (Theorem
4) implies that the coding margin of D0 on x retains
The following proposition is a specialization of Lemma over two-thirds the coding margin of D on x; that is,
6 with F(z) := {f ∈ Fµ : margins (D, x) > ι }. 
margins (D0 , x) > 32 margins (D, x) .
b 2 Furthermore, most points from the ghost  sample sat-

Proposition 9 If m ≥ t , then
isfy margins (D0 , ·) > 23 margins (D, x) . To see this,

 ˆ ˜ ff
∃f ∈ Fµ ,
margins (D, x)) > ι
Prz 4
We use the shorthand η = η(m, d, k, D, x, δ).
and ((P − Pz )l(·, f ) > t)
5
 ˆ ˜ ff Our strategy thus far is similar to the beginning of
∃f ∈ Fµ , margins (D, x) > ι Shawe-Taylor et al.’s proof of the main luckiness framework
≤ 2Prz z0 .
and ((Pz0 − Pz )l(·, f ) > t/2) learning bound (Shawe-Taylor et al., 1998, Theorem 5.22).
Predictive Sparse Coding Bounds

marg marg
let FD := {fD,τ |τ ∈ R+ } be the class of threshold ε-cover of W. It is sufficient to bound the prob-
functions defined via ability of a large deviation for all of Fε and to
( then consider the maximum difference between an el-
marg 1; if margins (D, x) > τ, ement of F̃(x, x0 ) and its closest representative in Fε .
fD,τ (x) :=
0; otherwise. Clearly, for each f = (D, w) ∈ F̃(x, x0 ), there is
a f 0 = (D0 , w0 ) ∈ Fε satisfying kD − D0 k2 ≤ ε and
The VC dimension of the one-dimensional threshold kw − w0 k2 ≤ ε. If ε is sufficiently small, then for all
marg
functions is 1, and so VC(FD ) = 1. By using the but η of the points xi in the ghost sample (and for all
marg
VC dimension of FD and the standard permutation points xi of the original sample) it is guaranteed that
argument of Vapnik & Chervonenkis (1968, Proof of
|hw, ϕD (xi )i − hw0 , ϕD0 (xi )i|
Theorem 2), it follows that for a single, fixed element
≤ ˛hw − w0 , ϕD (xi )i˛ + ˛hw0 , ϕD (xi ) − ϕD0 (xi )i˛
˛ ˛ ˛ ˛
of D0 , with probability at least 1 − δ at most log(2m + √ √ «
1) + log 1δ points from a ghost sample will violate the

ε 3ε s ε 3r s
≤ +r = 1+ = β,
margin inequality in question. Hence, by the bound on 2λ 2 λµ 2λ µ
the proper covering numbers provided by Proposition where the second inequality follows from the Sparse
7, we can we can guarantee for all candidate members Coding Stability Theorem (Theorem 4). Trivially, for
D0 ∈ D0 that with probability 1 − δ at most the rest of the points xi in the ghost sample each loss
1944 1 is bounded by b. Hence, on the original sample:
η = dk log + log(2m + 1) + log m
margin2s (D, x) · λ δ 1 X ˛˛
l(yi , hw, ϕD (xi )i) − l(yi , hw0 , ϕD0 (xi )i)˛ ≤ Lβ,
˛
m i=1
points from the ghost sample violate the s-margin in-
equality. Thus, for arbitrary D0 ∈ D0 satisfying the and on the ghost sample:
m
conditions of the lemma, with probability 1−δ at most 1 X˛˛ 0
l(yi , hw, ϕD (x0i )i) − l(yi0 , hw0 , ϕD0 (x0i )i)˛
˛
η(m, d, k, D, x, δ) points from the ghost sample violate m i=1
margins (D0 , ·) > 32 margins (D, x) .
 
L X ˛˛
hw, ϕD (xi )i − hw0 , ϕD0 (xi )i˛
˛

 at least m − η points in the ghost
Finally, consider the m i good
sample satisfying margins (D0 , ·) > 32 margins (D, x) .

1 X ˛˛ 0
l(yi , hw, ϕD (x0i )i) − l(yi0 , hw0 , ϕD0 (x0i )i)˛
˛
( 1 margin (D,x))2 ·λ +
Since kD0 − Dk2 ≤ 3 s
27 , the Sparse Cod- m i bad
ing Stability Theorem (Theorem 4) implies that these bη
≤ Lβ + ,
points satisfy margins (D, ·) > 13 margins (D, x) .

m
It remains to bound Pr(J ∩ Z̄). where good denotes the (at least m − η) points of the
ghost sample for which the Sparse Coding Stability
Lemma 11 (Large  Deviation on Good  Ghost) Theorem applies, and bad is the complement thereof.
√ 
Let $ := t/2 − 2Lβ + m , β := 2λ 1 + 3rµ s .
bη ε
Concluding the above argument, the difference be-
Then tween the losses of f and f 0 on the double sample is at
(d+1)k most 2Lβ + bηm . Consequently, if (Pz −Pz )l(·, f ) > t/2,
0
8(r/2)1/(d+1)

Pr(J ∩ Z̄) ≤ exp(−m$2 /(2b2 )). then the deviation between the loss of f 0 on the origi-
ε nal sample and the 0
 loss of f on the ghost sample must
be at least t/2− 2Lβ + bη m . To bound the probability
Proof First, note that the event J ∩ Z̄ is a subset of of R it therefore is sufficient to control
8 ˆ ˜ 9
∃f ∈ Fµ , margin s (D, x) > ι and ∃f = (D0 , w0 ) ∈ Dε × Wε 
( )
< ˜´=
0 ` 0
ˆ 1
R:= zz : ∃x̃ ⊆η x , margins (D, x̃) > 3 margins (D, x) . Prz z0

.
:
and ((Pz0 − Pz )l(·, f ) > t/2)
; (Pz0 − Pz )l(·, f ) > t/2 − 2Lβ + bη
m

For the case of a fixed f = (D0 , w0 ) ∈ Dε × Wε , ap-


Bounding the probability of the event R is equivalent plying Hoeffding’s inequality to the random variable
to bounding the probability of a large deviation (i.e. l(yi , f (xi )) − l(yi0 , f (x0i )), with range in [−b, b], yields:
((Pz0 − Pz )l(·, f ) > t/2)) for the random subclass:
 ˆ ˜ ff Prz z0 {(Pz0 − Pz )l(·, f ) > $} ≤ exp(−m$2 /(2b2 )),
0 f ∈ Fµ : margins (D, x) > ι and
F̃(x, x ):= ` 0
ˆ 1
˜´ .
∃x̃ ⊆η x , margins (D, x̃) > 3 margins (D, x)
 
for $ := t/2 − 2Lβ + bη m . Via a proper covering
Let Fε = Dε ×Wε , where Dε is a minimum-cardinality number bound of Dε × Wε (Proposition 8) and the
proper ε-cover of Dµ and Wε is a minimum-cardinality union bound, this result extends over all of Dε × Wε :
Predictive Sparse Coding Bounds

Prz z0 ∃f = (D0 , w0 ) ∈ Dε × Wε , (Pz0 − Pz )l(·, f ) > $


˘ ¯
0.15 λ
«(d+1)k ● 0.1
8(r/2)1/(d+1)
„ ● 0.15
≤ exp(−m$2 /(2b2 )). ● 0.2
ε

margin
0.10

The bound on Pr(J ∩ Z̄) now follows.


We now prove Theorem 5 (full proof in Appendix C). 0.05

Proof sketch (of Theorem 5) Proposition 9 and Lem-


mas 10 and 11 imply that 0.00 ● ● ●

 ˆ ˜ ff 0 100 200 300 400


∃f ∈ Fµ , margins (D, x) > ι s
Prz
and ((P − Pz )l(·, f ) > t) Figure 1. The s-margin for predictive sparse coding with
1/(d+1) (d+1)k
!

8(r/2)
« 400 atoms trained on the MNIST training set, digit 4 versus
2 2
≤2 exp(−m$ /(2b )) + δ . all, for three settings of λ. The sparsity level (maximum
ε
number of non-zeros per code, taken across all codes of
1 the training points) is indicated by the dots on the s-axis.
Fix s ∈ [k] and µ > 0 a priori. Let ε = m ; elementary
Predictive sparse coding was trained as per the stochastic
manipulations show that provided m > margin243 2 ,
gradient descent approach of Mairal et al. (2012).
s (D,x) λ
then with probability at least 1 − δ over z ∼ P m ,
 f = (D, w) ∈ F satisfying µs (D) ≥ µ
for any sparse encoder, and the bound consequently depends
and margins (D, x) > ι , the generalization error on properties of both the learned dictionary and the
(P − Pz )l(·, f ) is bounded by: training sample. The PRP condition in the Sparse
s Coding Stability Theorem (Theorem 4) appears to be
2((d + 1)k log(8m) + k log r2 + log 4δ ) much stronger than necessary; we conjecture that the
2b √
m PRP actually is O(ε) rather than O( ε). If the con-
„ „ √ ««
2L 1 3r s jecture is true, the number of samples required before
+ 1+
m λ µ Theorem 5 kicks in would be greatly reduced, as would
be the size of many of the constants in the results.
„ «
2b 1944 4
+ dk log + log(2m + 1) + log .
m margin2s (D, x) · λ δ
In machine learning, we often first map the data im-
The theorem follows after suitably distributing a prior plicitly to a space of very high or even infinite dimen-
across the bounds for each choice of s and µ. sion and use kernels for computability. In these cases
where d  k or d is infinite, any learning bound must
be independent of d. We in fact have obtained a bound
5. An empirical study of the s-margin in the infinite-dimensional setting using considerably
Empirical evidence suggests that the s-margin is well more sophisticated techniques (Rademacher complex-
above zero even when s is only slightly larger than ities over “mostly good” random subclasses), but for
the observed sparsity level. We performed experiments space we leave this result to the long version.
on the MNIST digit classification task (LeCun et al., Though we established an upper bound on the gener-
1998), specifically the single binary task of the digit 4 alization error for predictive sparse coding, two things
vs all the other digits. All the training data was used, remain unclear. How close to optimal is the bound
and each data point was normalized to unit norm. The of Theorem 5, and what lower bounds can be estab-
results in Figure 1 show that when the minimum spar- lished? If the conditions on which the bound relies
sity level is s (indicated by the colored dots on the s- are of fundamental importance, then the presented
axis of the plots), there is a non-trivial (s + ρ)-margin data-dependent bound provides motivation for an al-
for ρ a small positive integer. Using the 2s-margin gorithm to prefer dictionaries for which small subdic-
when s-sparsity holds may ensure that there is a mod- tionaries are well-conditioned and to additionally en-
erate margin for only a constant factor increase to s. courage large coding margin on the training sample.

6. Discussion and open problems Acknowledgments


We have shown the first generalization error bound We thank the anonymous reviewers for immeasurably
for predictive sparse coding. The learning bound in useful feedback on this work. We also thank Dongryeol
Theorem 5 is intimately related to the stability of the Lee for many formative conversations and comments.
Predictive Sparse Coding Bounds

References Steinwart, Ingo and Christmann, Andreas. Support


vector machines. Springer, 2008.
Bradley, David M. and Bagnell, J. Andrew. Differen-
tiable sparse coding. In Advances in Neural Infor- Tibshirani, Robert. Regression shrinkage and selection
mation Processing Systems 21, pp. 113–120. MIT via the lasso. Journal of the Royal Statistical Soci-
Press, 2009. ety. Series B (Methodological), pp. 267–288, 1996.
Carl, B. and Stephani, I. Entropy, compactness, and Vainsencher, Daniel, Mannor, Shie, and Bruckstein,
the approximation of operators, volume 98. Cam- Alfred M. The sample complexity of dictionary
bridge University Press, 1990. learning. Journal of Machine Learning Research,
12:3259–3281, 2011.
Fazel, Maryam. Matrix rank minimization with ap-
plications. Elec Eng Dept Stanford University, 54: Vapnik, Vladimir N. and Chervonenkis, Alexey Ya.
1–130, 2002. Uniform convergence of frequencies of occurence of
events to their probabilities. In Dokl. Akad. Nauk
Kakade, Sham M., Sridharan, Karthik, and Tewari, SSSR, volume 181, pp. 915–918, 1968.
Ambuj. On the complexity of linear prediction:
Risk bounds, margin bounds, and regularization. In Vidyasagar, Mathukumalli. Learning and Generaliza-
Koller, Daphne, Schuurmans, Dale, Bengio, Yoshua, tion with Applications to Neural Networks. Springer,
and Bottou, Léon (eds.), Advances in Neural Infor- 2002.
mation Processing Systems 21, pp. 793–800. MIT
Yu, Kai, Zhang, Tong, and Gong, Yihong. Non-
Press, 2009.
linear learning using local coordinate coding. In
LeCun, Yann, Bottou, Léon, Bengio, Yoshua, and Bengio, Yoshua, Schuurmans, Dale, Lafferty, John,
Haffner, Patrick. Gradient-based learning applied Williams, Christopher K. I., and Culotta, Aron
to document recognition. Proceedings of the IEEE, (eds.), Advances in Neural Information Processing
86(11):2278–2324, 1998. Systems 22, pp. 2223–2231. MIT Press, 2009.

Mairal, Julien, Bach, Francis, Ponce, Jean, Sapiro,


Guillermo, and Zisserman, Andrew. Supervised dic-
tionary learning. In Koller, Daphne, Schuurmans,
Dale, Bengio, Yoshua, and Bottou, Léon (eds.), Ad-
vances in Neural Information Processing Systems
21, pp. 1033–1040. MIT Press, 2009.

Mairal, Julien, Bach, Francis, and Ponce, Jean. Task-


driven dictionary learning. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 32(4):
791–804, 2012.

Maurer, A. and Pontil, M. K-dimensional coding


schemes in hilbert spaces. IEEE Transactions on
Information Theory, 56(11):5839–5846, 2010.

Mendelson, Shahar and Philips, Petra. On the impor-


tance of small coordinate projections. Journal of
Machine Learning Research, 5:219–238, 2004.

Osborne, Michael R., Presnell, Brett, and Turlach,


Berwin A. On the lasso and its dual. Journal of
Computational and Graphical Statistics, pp. 319–
337, 2000.

Shawe-Taylor, John, L., Peter, Williamson, Robert C.,


and Anthony, Martin. Structural risk minimization
over data-dependent hierarchies. IEEE Transactions
on Information Theory, 44(5):1926–1940, 1998.

You might also like