Breaking The Curse of Dimensionality With Convex Neural Networks
Breaking The Curse of Dimensionality With Convex Neural Networks
Breaking The Curse of Dimensionality With Convex Neural Networks
Abstract
We consider neural networks with a single hidden layer and non-decreasing positively ho-
mogeneous activation functions like the rectified linear units. By letting the number of
hidden units grow unbounded and using classical non-Euclidean regularization tools on
the output weights, they lead to a convex optimization problem and we provide a de-
tailed theoretical analysis of their generalization performance, with a study of both the
approximation and the estimation errors. We show in particular that they are adaptive
to unknown underlying linear structures, such as the dependence on the projection of the
input variables onto a low-dimensional subspace. Moreover, when using sparsity-inducing
norms on the input weights, we show that high-dimensional non-linear variable selection
may be achieved, without any strong assumption regarding the data and with a total num-
ber of variables potentially exponential in the number of observations. However, solving
this convex optimization problem in infinite dimensions is only possible if the non-convex
subproblem of addition of a new unit can be solved efficiently. We provide a simple geo-
metric interpretation for our choice of activation functions and describe simple conditions
for convex relaxations of the finite-dimensional non-convex subproblem to achieve the same
generalization error bounds, even when constant-factor approximations cannot be found.
We were not able to find strong enough convex relaxations to obtain provably polynomial-
time algorithms and leave open the existence or non-existence of such tractable algorithms
with non-exponential sample complexities.
Keywords: Neural networks, non-parametric estimation, convex optimization, convex
relaxation.
1. Introduction
Supervised learning methods come in a variety of ways. They are typically based on local
averaging methods, such as k-nearest neighbors, decision trees, or random forests, or on
optimization of the empirical risk over a certain function class, such as least-squares re-
gression, logistic regression or support vector machine, with positive definite kernels, with
model selection, structured sparsity-inducing regularization, or boosting (see, e.g., Györfi
and Krzyzak, 2002; Hastie et al., 2009; Shalev-Shwartz and Ben-David, 2014, and references
therein).
c
2017 Francis Bach.
License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided
at http://jmlr.org/papers/v18/14-546.html.
Bach
Most methods assume either explicitly or implicitly a certain class of models to learn
from. In the non-parametric setting, the learning algorithms may adapt the complexity
of the models as the number of observations increases: the sample complexity (i.e., the
number of observations) to adapt to any particular problem is typically large. For example,
when learning Lipschitz-continuous functions in Rd , at least n = Ω(ε− max{d,2} ) samples are
needed to learn a function with excess risk ε (von Luxburg and Bousquet, 2004, Theorem
15). The exponential dependence on the dimension d is often referred to as the curse of
dimensionality: without any restrictions, exponentially many observations are needed to
obtain optimal generalization performances.
At the other end of the spectrum, parametric methods such as linear supervised learning
make strong assumptions regarding the problem and generalization bounds based on esti-
mation errors typically assume that the model is well-specified, and the sample complexity
to attain an excess risk of ε grows as n = Ω(d/ε2 ), for linear functions in d dimensions
and Lipschitz-continuous loss functions (Shalev-Shwartz and Ben-David, 2014, Chapter 9).
While the sample complexity is much lower, when the assumptions are not met, the methods
underfit and more complex models would provide better generalization performances.
Between these two extremes, there are a variety of models with structural assumptions
that are often used in practice. For input data in x ∈ Rd , prediction functions f : Rd → R
may for example be parameterized as:
(a) Affine functions: f (x) = w> x + b, leading to potential severe underfitting, but easy
optimization and good (i.e., non exponential) sample complexity.
Pd
(b) Generalized additive models: f (x) = j=1 fj (xj ), which are generalizations of the
above by summing functions fj : R → R which may not be affine (Hastie and Tibshi-
rani, 1990; Ravikumar et al., 2008; Bach, 2008a). This leads to less strong underfitting
but cannot model interactions between variables, while the estimation may be done
with similar tools than for affine functions (e.g., convex optimization for convex losses).
P
(c) Nonparametric ANOVA models: f (x) = A∈A fA (xA ) for a set A of subsets of
{1, . . . , d}, and non-linear functions fA : RA → R. The set A may be either given (Gu,
2013) or learned from data (Lin and Zhang, 2006; Bach, 2008b). Multi-way interac-
tions are explicitly included but a key algorithmic problem is to explore the 2d − 1
non-trivial potential subsets.
(d) Single hidden-layer neural networks: f (x) = kj=1 σ(wj> x+bj ), where k is the number
P
of units in the hidden layer (see, e.g., Rumelhart et al., 1986; Haykin, 1994). The
activation function σ is here assumed to be fixed. While the learning problem may
be cast as a (sub)differentiable optimization problem, techniques based on gradient
descent may not find the global optimum. If the number of hidden units is fixed, this
is a parametric problem.
(e) Projection pursuit (Friedman and Stuetzle, 1981): f (x) = kj=1 fj (wj> x) where k is
P
the number of projections. This model combines both (b) and (d); the only difference
with neural networks is that the non-linear functions fj : R → R are learned from data.
The optimization is often done sequentially and is harder than for neural networks.
2
Breaking the Curse of Dimensionality with Convex Neural Networks
(e) Dependence on a unknown k-dimensional subspace: f (x) = g(W > x) with W ∈ Rd×k ,
where g is a non-linear function. A variety of algorithms exist for this problem (Li,
1991; Fukumizu et al., 2004; Dalalyan et al., 2008). Note that when the columns of W
are assumed to be composed of a single non-zero element, this corresponds to variable
selection (with at most k selected variables).
In this paper, our main aim is to answer the following question: Is there a single
learning method that can deal efficiently with all situations above with prov-
able adaptivity ? We consider single-hidden-layer neural networks, with non-decreasing
homogeneous activation functions such as
for α ∈ {0, 1, . . .}, with a particular focus on α = 0 (with the convention that 00 = 0), that
is σ(u) = 1u>0 (a threshold at zero), and α = 1, that is, σ(u) = max{u, 0} = (u)+ , the
so-called rectified linear unit (Nair and Hinton, 2010; Krizhevsky et al., 2012). We follow
the convexification approach of Bengio et al. (2006); Rosset et al. (2007), who consider
potentially infinitely many units and let a sparsity-inducing norm choose the number of
units automatically. This leads naturally to incremental algorithms such as forward greedy
selection approaches, which have a long history for single-hidden-layer neural networks (see,
e.g., Breiman, 1993; Lee et al., 1996).
We make the following contributions:
– We provide in Section 2 a review of functional analysis tools used for learning from
continuously infinitely many basis functions, by studying carefully the similarities and
differences between L1 - and L2 -penalties on the output weights. For L2 -penalties,
this corresponds to a positive definite kernel and may be interpreted through random
sampling of hidden weights. We also review incremental algorithms (i.e., forward
greedy approaches) to learn from these infinite sets of basis functions when using
L1 -penalties.
– The results are specialized in Section 3 to neural networks with a single hidden layer
and activation functions which are positively homogeneous (such as the rectified linear
unit). In particular, in Sections 3.2, 3.3 and 3.4, we provide simple geometric interpre-
tations to the non-convex problems of additions of new units, in terms of separating
hyperplanes or Hausdorff distance between convex sets. They constitute the core po-
tentially hard computational tasks in our framework of learning from continuously
many basis functions.
3
Bach
Functional form Generalization bound
−1/(d+3)
No assumption n log n
w> x d1/2 · n−1/2
Affine function
+ b
Pk > wj ∈ R d kd1/2 · n−1/4 log n
Generalized additive model j=1 fj (wj x),
Pk > kd1/2 · n−1/2
Single-layer neural network j=1 ηj (wj x + bj )+
Pk > wj ∈ R d kd1/2 · n−1/4 log n
Projection pursuit j=1 fj (wj x),
f (W > x) , W ∈ Rd×s d1/2 · n−1/(s+3) log n
Dependence on subspace
Table 1: Summary of generalization bounds for various models. The bound represents the
expected excess risk over the best predictor in the given class. When no assumption
is made, the dependence in n goes to zero with an exponent proportional to 1/d
(which leads to sample complexity exponential in d), while making assumptions
removes the dependence of d in the exponent.
– We provide in Section 5.5 simple conditions for convex relaxations to achieve the
same generalization error bounds, even when constant-factor approximation cannot
be found (e.g., because it is NP-hard such as for the threshold activation function and
the rectified linear unit). We present in Section 6 convex relaxations based on semi-
definite programming, but we were not able to find strong enough convex relaxations
(they provide only a provable sample complexity with a polynomial time algorithm
which is exponential in the dimension d) and leave open the existence or non-existence
of polynomial-time algorithms that preserve the non-exponential sample complexity.
4
Breaking the Curse of Dimensionality with Convex Neural Networks
tion properties of neural networks (Barron, 1993; Kurkova and Sanguineti, 2001; Mhaskar,
2004), the algorithmic parts that we present in Section 2.5 have been studied in a variety
of contexts, such as “convex neural networks” (Bengio et al., 2006), or `1 -norm with infi-
nite dimensional feature spaces (Rosset et al., 2007), with links with conditional gradient
algorithms (Dunn and Harshbarger, 1978; Jaggi, 2013) and boosting (Rosset et al., 2004).
In the following sections, note that there will be two different notions of infinity: in-
finitely many inputs x and infinitely many basis functions x 7→ ϕv (x). Moreover, two
orthogonal notions of Lipschitz-continuity will be tackled in this paper: the one of the pre-
diction functions f , and the one of the loss ` used to measure the fit of these prediction
functions.
5
Bach
Given our assumptions regarding the compactness of V, for any f ∈ F1 , the infimum
defining γ1 (f ) is in fact attained by a signed measure µ, as a consequence of the compactness
of measures for the weak topology (see Evans and Gariepy, 1991, Section 1.9).
In the definition above, if we assume that the signed measure µ has a density with
respect to a fixed probability measure τ with full support on V, that is, dµ(v) = p(v)dτ (v),
then, the variation norm γ1 (f ) is also equal to the infimal value of
Z
|µ|(V) = |p(v)|dτ (v),
V
R
over all integrable functions p such that f (x) = V p(v)ϕv (x)dτ (v). Note however that not
all measures have densities, and that the two infimums are the same as all Radon measures
are limits of measures with densities. Moreover, the infimum in the definition above is not
attained in general (for example when the optimal measure is singular with respect to dτ );
however, it often provides a more intuitive definition of the variation norm, and leads to
easier comparisons with Hilbert spaces in Section 2.3.
Finite number of neurons. If f : X → R is decomposableP into k basis functions, that
is, f (x) = kj=1 ηj ϕvj (x), then this corresponds to µ = kj=1 ηj δ(v = vj ), and the total
P
variation of µ is equal to the `1 -norm kηk1 of η. Thus the function f has variation norm
less than kηk1 or equal. This is to be contrasted with the number of basis functions, which
is the `0 -pseudo-norm of η.
6
Breaking the Curse of Dimensionality with Convex Neural Networks
R
then, the norm γ1 (f ) isRthe infimum of the total variation |µ|(V) = V |p(v)|dτ (v), over all
decompositions f (x) = V p(v)ϕv (x)dτR(v).
We may also define the infimum of V |p(v)|2 dτ (v) over the same decompositions (squared
L2 -norm instead of L1 -norm). It turns out that it defines a squared norm γ22 and that
the function space F2 of functions with finite norm happens to be a reproducing ker-
nel Hilbert space (RKHS). When V is finite, then it is well-known (see, e.g., Berlinet
2 over all vectors µ
P
and Thomas-Agnan,
P 2004, Section 4.1) that the infimum of v∈V vµ
such thatP f = v∈V µv ϕv defines a squared RKHS norm with positive definite kernel
k(x, y) = v∈V ϕv (x)ϕv (y).
We show in Appendix A that for any compact Z set V, we have defined a squared RKHS
norm γ22 with positive definite kernel k(x, y) = ϕv (x)ϕv (y)dτ (v).
V
Random sampling. Note that such kernels are well-adapted to approximations by sam-
pling several basis functions ϕv sampled from the probability measure τ (Neal, 1995; Rahimi
and Recht, 2007). Indeed,Pif we consider m i.i.d. samples v1 , . . . , vm , we may define the ap-
1 m
proximation k̂(x, y) = m i=1 ϕvi (x)ϕvi (y), which corresponds to an explicit feature repre-
sentation. In other words, this corresponds to sampling units vi , using prediction functions
1 Pm
of the form m η ϕ
i=1 i vi (x) and then penalizing by the `2 -norm of η.
When m tends to infinity, then k̂(x, y) tends to k(x, y) and random sampling provides a
way to work efficiently with explicit m-dimensional feature spaces. See Rahimi and Recht
(2007) for a analysis of the number of units needed for an approximation with error ε, typ-
ically of order 1/ε2 . See also Bach (2017) for improved results with a better dependence on
ε when making extra assumptions on the eigenvalues of the associated covariance operator.
Relationship between F1 and F2 . The corresponding RKHS norm is always greater
than the variation norm (because of Jensen’s inequality), and thus the RKHS F2 is included
in F1 . However, as shown in this paper, the two spaces F1 and F2 have very different
properties; e.g., γ2 may be computed easily in several cases, while γ1 does not; also, learning
with F2 may either be done by random sampling of sufficiently many weights or using kernel
methods, while F1 requires dedicated convex optimization algorithms with potentially non-
polynomial-time steps (see Section 2.5).
Moreover, for any v ∈ V, ϕv ∈ F1 with a norm γ1 (ϕv ) 6 1, while in general ϕv ∈ / F2 .
This is a simple illustration of the fact that F2 is too small and thus will lead to a lack
of adaptivity that will be further studied in Section 5.4 for neural networks with certain
activation functions.
7
Bach
−J ′ (ft ) ft+1
ft
et al., 2013) and leaving its analysis in terms of learning rates for future work. Since the
functional Jˆ depends only on function values taken at finitely many points, the results
from Section 2.2 apply and we expect the solution f to be spanned by only n functions
ϕv1 , . . . , ϕvn (but we ignore in advance which ones among all ϕv , v ∈ V, and the algorithms
in Section 2.5 will provide approximate such representations with potentially less or more
than n functions).
that is, the excess risk J(fˆ) − inf f ∈F J(f ) is upper-bounded by a sum of an approximation
ˆ ) − J(f )| and an opti-
error inf f ∈F δ J(f ) − inf f ∈F J(f ), an estimation error 2 supf ∈F δ |J(f
mization error ε (see also Bottou and Bousquet, 2008). In this paper, we will deal with all
three errors, starting from the optimization error which we now consider for the space F1
and its variation norm.
8
Breaking the Curse of Dimensionality with Convex Neural Networks
When, r2 = supv∈V kϕv k2L2 (dρ) is finite, we have kf k2L2 (dρ) 6 r2 γ1 (f )2 and thus we get a
2 2
convergence rate of 2Lr t+1 .
δ
Moreover, the basic Frank-Wolfe (FW) algorithm may be extended to handle the reg-
ularized problem as well (Harchaoui et al., 2013; Bach, 2013; Zhang et al., 2012), with
similar convergence rates in O(1/t). Also, the second step in the algorithm, where the
function ft+1 is built in the segment between ft and the newly found extreme function,
may be replaced by the optimization of J over the convex hull of all functions f¯0 , . . . , f¯t ,
a variant which is often referred to as fully corrective. Moreover, in our context where V
is a space where local search techniques may be considered, there is also the possibility
of “fine-tuning” the vectors v as wellP(Bengio et al., 2006), that is, we may optimize the
function (v1 , . . . , vt , α1 , . . . , αt ) 7→ J( ti=1 αi ϕvi ), through local search techniques, starting
from the weights (αi ) and points (vi ) obtained from the conditional gradient algorithm.
Adding a new basis function. The conditional gradient algorithm presented above
relies on solving at each iteration the “Frank-Wolfe step”:
max hf, giL2 (dρ) .
γ(f )6δ
for gR = −J 0 (ft ) ∈ L2 (dρ). For the norm γ1 defined through an L1 -norm, we have for
f = V ϕv dµ(v) such that γ1 (f ) = |µ|(V):
Z Z Z
hf, giL2 (dρ) = f (x)g(x)dρ(x) = ϕv (x)dµ(v) g(x)dρ(x)
X X V
Z Z
= ϕv (x)g(x)dρ(x) dµ(v)
V X
Z
6 γ1 (f ) · max ϕv (x)g(x)dρ(x),
v∈V X
9
Bach
with equality if and only if µ = µ+ − µ− with µ+ and µR− two non-negative measures, with
µ+ (resp. µ− ) supported in the set of maximizers v of | X ϕv (x)g(x)dρ(x)| where the value
is positive (resp. negative).
This implies that:
Z
max hf, giL2 (dρ) = δ max
ϕv (x)g(x)dρ(x), (1)
γ1 (f )6δ v∈V X
with the maximizers f of the first optimization problem above (left-hand side) obtained
as δ times convex combinations of ϕv and −ϕv for maximizers v of the second problem
(right-hand side).
A common difficulty in practice is the hardness of the Frank-Wolfe step, that is, the
optimization problem above over V may be difficult to solve. See Section 3.2, 3.3 and 3.4
for neural networks, where this optimization is usually difficult.
Finitely many observations. When X is finite (or when using the result from Sec-
tion 2.2), the Frank-Wolfe step in Eq. (1) becomes equivalent to, for some vector g ∈ Rn :
n n
1X 1 X
sup gi f (xi ) = δ max
gi ϕv (xi ), (2)
γ1 (f )6δ n i=1
v∈V n
i=1
where the set of solutions of the first problem is in the convex hull of the solutions of the
second problem.
Non-smooth loss functions. In this paper, in our theoretical results, we consider non-
smooth loss functions for which conditional gradient algorithms do not converge in general.
One possibility is to smooth the loss function, as done by Nesterov (2005): an approximation
of ε may be obtained with a smoothness √
error √ constant proportional to 1/ε. By choosing ε
as 1/ t, we obtain a convergence rate of O(1/ t) after t iterations. See also Lan (2013).
Approximate oracles. The conditional gradient algorithm may deal with approximate
oracles; however, what we need in this paper is not the additive errors situations considered
by Jaggi (2013), but multiplicative ones on the computation of the dual norm (similar to
ones derived by Bach (2013) for the regularized problem).
Indeed, in our context, we minimize a function J(f ) on f ∈ L2 (dρ) over a norm ball
{γ1 (f ) 6 δ}. A multiplicative approximate oracle outputs for any g ∈ L2 (dρ), a vector
fˆ ∈ L2 (dρ) such that γ1 (fˆ) = 1, and
10
Breaking the Curse of Dimensionality with Convex Neural Networks
x 7→ σ(w> x + b),
– By homogeneity, they are invariant by a change of scale of the data; indeed, if all
observations x are multiplied by a constant, we may simply change the measure µ
defining the expansion of f by the appropriate constant to obtain exactly the same
function. This allows us to study functions defined on the unit-sphere.
– The special case α = 1, often referred to as the rectified linear unit, has seen consider-
able recent empirical success (Nair and Hinton, 2010; Krizhevsky et al., 2012), while
the case α = 0 (hard thresholds) has some historical importance (Rosenblatt, 1958).
The goal of this section is to specialize the results from Section 2 to this particular case and
show that the “Frank-Wolfe” steps have simple geometric interpretations.
We first show that the positive homogeneity of the activation functions allows to transfer
the problem to a unit sphere.
Boundedness assumptions. For the theoretical analysis, we assume that our data in-
puts x ∈ Rd are almost surely bounded by R in `q -norm, for some q ∈ [2, ∞] (typically
q = 2 and q = ∞). We then build the augmented variable z ∈ Rd+1 as√z = (x> , R)> ∈ Rd+1
by appending the constant R to x ∈ Rd . We therefore have kzkq 6 2R. By defining the
vector v = (w> , b/R)> ∈ Rd+1 , we have:
11
Bach
This implies by Hölder’s inequality that ϕv (x)2 6 2α . Moreover this leads to functions
in F1 that are bounded everywhere, that is, ∀f ∈ F1 , f (x)2 6 2α γ1 (f )2 . Note that the
functions in F1 are also Lipschitz-continuous for α > 1.
√ Since all `p -norms (for p ∈ [1, 2]) are equivalent to each other with constants of at most
d with respect to the `2 -norm, all the spaces F1 defined above are equal, but the norms
γ1 are of course different and they differ by a constant of at most dα/2 —this can be seen by
computing the dual norms like in Eq. (2) or Eq. (1).
Homogeneous reformulation. In our study of approximation properties, it will be use-
ful to consider the the space of functionR G1 defined for z in the unit sphere Sd ⊂ Rd+1 of
>
the Euclidean norm, such that g(z) = Sd σ(v z)dµ(v), with the norm γ1 (g) defined as the
infimum of |µ|(Sd ) over all decompositions of g. Note the slight overloading of notations for
γ1 (for norms in G1 and F1 ) which should not cause any confusion.
In order to prove the approximation properties (with unspecified constants depending
only on d), we may assume that p = 2, since the norms k · kp for p ∈ [1, ∞] are equivalent
to k · k2 with a constant that grows at most as dα/2 with respect to the `2 -norm. We thus
focus on the `2 -norm in all proofs in Section 4.
We may go from G1 (a space of real-valued functions defined on the unit `2 -sphere in
d + 1 dimensions) to the space F1 (a space of real-valued functions defined on the ball of
radius R for the `2 -norm) as follows (this corresponds to sending a ball in Rd into a spherical
cap in dimension d + 1, as illustrated in Figure 2).
2 α/2 1
x
kxk
– Given g ∈ G1 , we define f ∈ F1 , with f (x) = R22 + 1 g p . If
kxk 2
2 + R 2 R
g may be represented as Sd σ(v > z)dµ(v), then the function f that we have defined
R
may be represented as
kxk2 α/2 Z α
2 > 1 x
f (x) = 2
+1 v p dµ(v)
R Sd kxk22 + R2 R +
Z α
> x/R
Z
= v dµ(v) = σ(w> x + b)dµ(Rw, b),
S d 1 + Sd
that is γ1 (f ) 6 γ1 (g), because we have assumed that (w> , b/R)> is on the (1/R)-
sphere.
value of g(z, a) for a > √1is enough to recover f from the formula above.
2
√
On that portion {a > 1/ 2} of the sphere Sd , this function exactly inherits the
differentiability properties of f . That is, (a) if f is bounded by 1 and f is (1/R)-
Lipschitz-continuous, then g is Lipschitz-continuous with a constant that only depends
on d and α and (b), if all derivatives of order less than k are bounded by R−k , then
all derivatives of the same order of g are bounded by a constant that only depends
on d and α. Precise notions of differentiability may be defined on the sphere, using
12
Breaking the Curse of Dimensionality with Convex Neural Networks
R z
x
the manifold structure (see, e.g., Absil et al., 2009) or through polar coordinates (see,
e.g., Atkinson and Han, 2012, Chapter 3). See these references for more details.
The only remaining important aspect is to define g on the entire sphere, so that (a) its
regularity constants are controlled by a constant times the ones on the portion of the
sphere where it is already defined, (b) g is either even or odd (this will be important
in Section 4). Ensuring that the regularity conditions can be met is classical when
extending to the full sphere (see, e.g., Whitney, 1934). Ensuring that the function
may be chosen as odd or even may be obtained by multiplying√the function g by an
infinitely differentiable function which is equal to one for a > 1/ 2 and zero for a 6 0,
and extending by −g or g on the hemi-sphere a < 0.
In summary, we may consider in Section 4 functions defined on the sphere, which are
much easier to analyze. In the rest of the section, we specialize some of the general con-
cepts reviewed in Section 2 to our neural network setting with specific activation functions,
namely, in terms of corresponding kernel functions and geometric reformulations of the
Frank-Wolfe steps.
Sd , and 0 d+1
for (Rw, b) distributed uniformly on the unit `2 -sphere
r r x, x ∈ R . Given the
x> x0 kxk22 kx0 k22
angle ϕ ∈ [0, π] defined through + 1 = (cos ϕ) + 1 + 1, we have explicit
R2 R2 R2
expressions (Le Roux and Bengio, 2007; Cho and Saul, 2009):
1
k0 (z, z 0 ) = (π − ϕ)
2π
q q
kxk22 kx0 k22
0 R2
+1 R2
+1
k1 (z, z ) = ((π − ϕ) cos ϕ + sin ϕ)
2(d + 1)π
13
Bach
where I+ = {i, yi > 0} and I− = {i, yi < 0}. As outlined by Bengio et al. (2006), this
is equivalent to finding an hyperplane parameterized by v that minimizes a weighted mis-
classification rate (when doing linear classification). Note that the norm of v has no effect.
NP-hardness. This problemP is NP-hard in general. Indeed, if we assume that all yi
are equal to −1 or 1 and with ni=1 yi = 0, then we have Pna balanced binary classification
n
problem (we need to assume n even). The quantity i=1 yi 1v > zi >0 is then 2 (1 − 2e)
where e is the corresponding classification error for a problem of classifying at positive
(resp. negative) the examples in I+ (resp. I− ) by thresholding the linear classifier v > z.
Guruswami and Raghavendra (2009) showed that for all (ε, δ), it is NP-hard to distinguish
between instances (i.e., configurations of points xi ), where a halfspace with classification
error at most ε exists, and instances where all half-spaces have an error of at least 1/2 − δ.
d+1 such that
Thus, it is NP-hard to distinguish between instances where there exists
P n n d+1
Pnv ∈ R
i=1 yi 1v > zi >0 > 2 (1 − 2ε) and instances where for all v ∈ R Pn , i=1 yi 1v> zin>0 6 nδ.
Thus, it is NP-hard to distinguish instances where maxv∈Rd+1 i=1 yi 1v> zi >0 > 2 (1 − 2ε)
and ones where it is less than n2 δ. Since this is valid for all δ and ε, this rules out a
constant-factor approximation.
Convex relaxation. Given linear binary classification problems, there are several algo-
rithms to approximately find a good half-space. These are based on using convex surrogates
(such as the hinge loss or the logistic loss). Although some theoretical results do exist regard-
ing the classification performance of estimators obtained from convex surrogates (Bartlett
et al., 2006), they do not apply in the context of linear classification.
14
Breaking the Curse of Dimensionality with Convex Neural Networks
where I+ = {i, yi > 0} and I− = {i, yi < 0}. We have, with ti = |yi |zi ∈ Rd+1 , using convex
duality:
n
X X X
max yi (v > zi )+ = max (v > ti )+ − (v > ti )+
kvkp 61 kvkp 61
i=1 i∈I+ i∈I−
X X
= max max bi v > ti − max bi v > ti
kvkp 61 bi ∈[0,1] bi ∈[0,1]
i∈I+ i∈I−
Pn >z
For the problem of maximizing | i=1 yi (v i )+ |, then this corresponds to
max max min kT+> b+ − T−> b− kq , max min kT+> b+ − T−> b− kq .
b+ ∈[0,1]n+ b− ∈[0,1]n− b− ∈[0,1]n− b+ ∈[0,1]n+
This is exactly the Hausdorff distance between the two convex sets {T+> b+ , b+ ∈ [0, 1]n+ }
and {T−> b− , b− ∈ [0, 1]n− } (referred to as zonotopes, see below).
Given the pair (b+ , b− ) achieving the Hausdorff distance, then we may compute the
optimal v as v = arg maxkvkp 61 v > (T+> b+ − T−> b− ). Note this has not changed the problem
at all, since it is equivalent. It is still NP-hard in general (König, 2014). But we now have a
geometric interpretation with potential approximation algorithms. See below and Section 6.
Zonotopes. A zonotope A is the Minkowski sum of a finite number of segments from the
origin, that is, of the form
r
nX o
A = [0, t1 ] + · · · + [0, tr ] = bi ti , b ∈ [0, 1]r ,
i=1
15
Bach
t1 + t2 + t3
t1
t3
t2
0 0 0 0
Figure 3: Two zonotopes in two dimensions: (left) vectors, and (right) their Minkowski sum
(represented as a polygone).
0
0 0
Figure 4: Left: two zonotopes (with their generating segments) and the segments achieving
the two sides of the Haussdorf distance. Right: approximation by ellipsoids.
16
Breaking the Curse of Dimensionality with Convex Neural Networks
ratio is not good enough to get any relevant bound for our purpose (see Section 5.5), as for
computing the Haussdorff distance, we care about potentially vanishing differences that are
swamped by constant factor approximations.
Nevertheless, the ellipsoid approximation may prove useful in practice, in particular
because the `2 -Haussdorff distance between two ellipsoids may be computed in polynomial
time (see Appendix E).
NP-hardness. Given the reduction of the case α = 1 (rectified linear units) to α = 0
(exact thresholds) (Livni et al., 2014), the incremental problem is also NP-hard, so as ob-
taining a constant-factor approximation. However, this does not rule out convex relaxations
with non-constant approximation ratios (see Section 6 for more details).
where I+ = {i, yi > 0} and I− = {i, yi < 0}. We have, with ti = |yi |1/α zi ∈ Rd+1 , and
β ∈ (1, 2] defined by 1/β + 1/α = 1 (we use the fact that the function u 7→ uα /α and
v 7→ v β /β are Fenchel-dual to each other):
n
1X X 1 X 1
max yi (v > zi )α+ = max (v > ti )α+ − (v > ti )α+
kvkp 61 α kvkp 61 α α
i=1 i∈I+ i∈I−
n 1 o X n 1 o
max bi vi> ti − bβi − max bi v > ti − bβi
X
= max
kvkp 61 bi >0 β bi >0 β
i∈I+ i∈I−
1 1
= max min max v > [T+> b+ − T−> b− ] − kb+ kββ + kb− kββ
I
b+ ∈R++
I
b− ∈R+− kvkp 61 β β
by Fenchel duality,
1 1
= max min kT+> b+ − T−> b− kq − kb+ kββ + kb− kββ , (3)
I+
b+ ∈[0,1] b− ∈[0,1] I− β β
17
Bach
the results from Livni et al. (2014), which state that for the quadratic activation function,
the incremental problems is equivalent to an eigendecomposition (and hence solvable in
polynomial time), do not apply.
4. Approximation Properties
In this section, we consider the approximation properties of the set F1 of functions defined
on Rd . As mentioned earlier, the norm used to penalize input weights w or v is irrelevant
for approximation properties as all norms are equivalent. Therefore, we focus on the case
q = p = 2 and `2 -norm constraints.
Because we consider homogeneous activation functions, we start by studying the set
G1 of functions defined on the unit `2 -sphere Sd ⊂ Rd+1 . We denote by τd the uniform
probability measureR on > Sd . The set G1 is defined as the set of functions on the sphere
such thatR g(z) = Sd σ(v z)p(z)dτd (z), with the norm γ1 (g) equal to the smallest possible
value of Sd |p(z)|dτd (z). We
R may also define the corresponding squared RKHS norm by the
smallest possible value of Sd |p(z)|2 dτd (z), with the corresponding RKHS G2 .
In this section, we first consider approximation properties of functions in G1 by a finite
number of neurons (only for α = 1). We then study approximation properties of functions on
the sphere by functions in G1 . It turns out that all our results are based on the approximation
properties of the corresponding RKHS G2 : we give sufficient conditions for being in G2 , and
then approximation bounds for functions which are not in G2 . Finally we transfer these to
the spaces F1 and F2 , and consider in particular functions which only depend on projections
on a low-dimensional subspace, for which the properties of G1 and G2 (and of F1 and F2 )
differ. This property is key to obtaining generalization bounds that show adaptivity to
linear structures in the prediction functions (as done in Section 5).
Approximation properties of neural networks with finitely many neurons have been
studied extensively (see, e.g., Petrushev, 1998; Pinkus, 1999; Makovoz, 1998; Burger and
Neubauer, 2001). In Section 4.7, we relate our new results to existing work from the
literature on approximation theory, by showing that our results provide an explicit control
of the various weight vectors which are needed for bounding the estimation error in Section 5.
18
Breaking the Curse of Dimensionality with Convex Neural Networks
with r 6 C(d)ε−2+6/(d+3) = C(d)ε−2d/(d+3) , for some constant C(d) that depends only on
d. We may then simply write
and approximate the last two terms with error εµ± (Sd ) with r terms, leading to an approxi-
mation of εµ+ (Sd ) + εµ− (Sd ) = εγ1 (g) = ε, with a remainder that is a linear function q > z of
z, with kqk2 6 1. We may then simply add two extra units with vectors q/kqk2 and weights
−kqk2 and kqk2 . We thus obtain, with 2r + 2 units, the desired approximation result.
Note that Bourgain et al. (1989, Theorem 6.5) showed that the scaling in ε in Eq. (4)
is not improvable, if the measure is allowed to have non equal weights on all points and the
proof relies on the non-approximability of the Euclidean ball by centered zonotopes. This
results does not apply here, because we may have different weights µ− (Sd ) and µ+ (Sd ).
Note that the proposition above is slightly improved in terms of the scaling of the num-
ber of neurons with respect to the approximation error ε (improved exponent), compared
to conditional gradient bounds (Barron, 1993; Kurkova and Sanguineti, 2001). Indeed, the
19
Bach
simple use of conditional gradient leads to r 6 ε−2 γ1 (g)2 , with a better constant (indepen-
dent of d) but a worse scaling in ε—also with a result in L2 (Sd )-norm and not uniformly on
the ball {kxkq 6 R}. Note also that the conditional gradient algorithm gives a construc-
tive way of building the measure. Moreover, the proposition above is related to the result
from Makovoz (1998, Theorem 2), which applies for α = 0 but with a number of neurons
growing as ε−2d/(d+1) , or to the one of Burger and Neubauer (2001, Example 3.1), which
applies to a piecewise affine sigmoidal function but with a number of neurons growing as
ε−2(d+1)/(d+3) (both slightly worse than ours).
Finally, the number of neurons needed to express a function with a bound on the γ2 -
norm can be estimated from general results on approximating reproducing kernel Hilbert
space described in Section 2.3, whose kernel can be expressed as an expectation. Indeed,
Bach (2017) shows that with k neurons, one can approximate a function in F2 with unit
γ2 -norm with an error measured in L2 of ε = k −(d+3)/(2d) . When inverting the relationship
between k and ε, we get a number of neurons scaling as ε−2d/(d+3) , which is the same as in
Prop. 1 but with an error in L2 -norm instead of L∞ -norm.
sphere Sd belongs to the family of dot-product kernels (Smola et al., 2001) that only depends
on the dot-product x> y, although in our situation, the function is not particularly simple
(see formulas in Section 3.1). The analysis of these kernels is similar to one of translation-
invariant kernels; for d = 1, i.e., on the 2-dimensional sphere, it is done through Fourier
series; while for d > 1, spherical harmonics have to be used as the expansion of functions
in series of spherical harmonics make the computation of the RKHS norm explicit (see a
review of spherical harmonics in Appendix D.1 with several references therein). Since the
calculus is tedious, all proofs are put in appendices, and we only present here the main
results. In this section, we provide simple sufficient conditions for belonging to G2 (and
hence G1 ) based on the existence and boundedness of derivatives, while in the next section,
we show how any Lipschitz-function may be approximated by functions in G2 (and hence
G1 ) with precise control of the norm of the approximating functions.
The derivatives of functions defined on Sd may be defined in several ways, using the
manifold structure (see, e.g., Absil et al., 2009) or through polar coordinates (see, e.g.,
Atkinson and Han, 2012, Chapter 3). For d = 1, the one-dimensional sphere S1 ⊂ R2 may
be parameterized by a single angle and thus the notion of derivatives and the proof of the
following result is simpler and based on Fourier series (see Appendix C.2). For the general
proof based on spherical harmonics, see Appendix D.2.
20
Breaking the Curse of Dimensionality with Convex Neural Networks
– Tightness of conditions: as shown in Appendix D.5, there are functions g, which have
bounded first s derivatives and do not belong to G2 while s 6 d2 + α (at least when
s − α is even). Therefore, when s − α is even, the scaling in (d − 1)/2 + α is optimal.
– Dependence on α: for any d, the higher the α, the stricter the sufficient condition.
Given that the estimation error grows slowly with α (see Section 5.1), low values of
α would be preferred in practice.
This proposition is shown in Appendix C.3 for d = 1 (using Fourier series) and in
Appendix D.4 for all d > 1 (using spherical harmonics). We can make the following obser-
vations:
– Dependence in δ and η: as expected, the main term in the error bound (δ/η)−1/(α+(d−1)/2)
is a decreasing function of δ/η, that is when the norm γ2 (h) is allowed to grow, the
approximation gets tighter, and when the Lipschitz constant of g increases, the ap-
proximation is less tight.
21
Bach
– Tightness: in Appendix D.5, we provide a function which is not in the RKHS and
for which the tightest possible approximation scales as δ −2/(d/2+α−2) . Thus the linear
scaling of the rate as d/2 + α is not improvable (but constants are).
We see that for α = 1, the√ γ1 -norm is less than a constant, and is much smaller than
the γ2 -norm (which scales as d). For α > 2, we were not able to derive better bounds for
γ1 (other than the value of γ2 ).
γ1 (g̃) = |µ̃|(S1 ), which is not increasing in d. If we consider any vector t ∈ Rd+1 which is
orthogonal to w in Rd+1 , then, we may define a measure µ supported in the circle defined by
the two vectors w and t and which is equal to µ̃ on that circle. The total variation of µ is the
one of µ̃ while g can be decomposed using µ and thus γ1 (g) 6 γ1 (g̃). Similarly, Prop. 3 could
also be applied (and will for obtaining generalization bounds), also our reasoning works for
any low-dimensional projections: the dependence on a lower-dimensional projection allows
to reduce smoothness requirements.
However, for the RKHS norm γ2 , this reasoning does not apply. For example, a certain
function ϕ exists, which is s-times differentiable, as shown in Appendix D.5, for s 6 d2 + α
(when s − α is even), and is not in G2 . Thus, given Prop. 2, the dependence on a uni-
dimensional projection does not make a difference regarding the level of smoothness which
is required to belong to G2 .
22
Breaking the Curse of Dimensionality with Convex Neural Networks
Proposition 5 (Finite variation) Assume that f : Rd → R is such that all i-th order
derivatives exist and are upper-bounded on the ball {kxkq 6 R} by η/Ri for i ∈ {0, . . . , k},
where s is the smallest integer such that s > (d − 1)/2 + α + 1; then f ∈ F2 and γ2 (f ) 6
C(d, α)η, for a constant C(d, α) that depends only on d and α.
Proof By assumption, the function x 7→ f (Rx) has all its derivatives bounded by a con-
stant times η. Moreover, we have defined g(t, a) = f ( Rt α
a )a so that all derivatives are
bounded by η. The result then follows immediately from Prop. 2.
Proof With the same reasoning as above, we obtain that g is Lipschitz-continuous with
constant η, we thus get the desired approximation error from Prop. 3.
Linear functions. If f (x) = w> x + b, with kwk2 6 η and b 6 ηR, then for α = 1, it is
straightforward that γ1 (f ) 6 2Rη. Moreover, we have γ2 (f ) ∼ CRη. For other values of
α, we also have γ1 -norms less than a constant (depending only of α) times Rη. The RKHS
norms are bit harder to compute since linear functions for f leads to linear functions for g
only for α = 1.
`1 -penalty on input weights (p=1). When using an `1 -penalty on input weights instead
of an `2 -penalty, the results in Prop. 5 and 6 are unchanged (only the constants that
depend on d are changed). Moreover, when kxk∞ 6 1 almost surely, functions of the form
f (x) = ϕ(w> x) where kwk1 6 η and ϕ : R → R is a function, will also inherit from
properties of ϕ (without any dependence on dimension). Similarly, for functions of the
form f (x) = Φ(W > x) for W ∈ Rd×s with all columns of `1 -norm less than η, we have
kW > xk∞ 6 Rη and we can apply the s-dimensional result.
23
Bach
24
Breaking the Curse of Dimensionality with Convex Neural Networks
5. Generalization Bounds
Our goal is to derive the generalization bounds outlined in Section 2.4 for neural networks
with a single hidden layer. The main results that we obtain are summarized in Table 2 and
show adaptivity to assumptions that avoid the curse of dimensionality.
More precisely, given some distribution over the pairs (x, y) ∈ X × Y, a loss function
` : Y × R → R, our aim is to find a function f : Rd → R such that J(f ) = E[`(y, f (x))]
is small, given some i.i.d. observations (xi , yi ), i = 1, . . . , n. We consider the empirical risk
minimization framework over a space of functions F, equipped with a norm γ (in our situ-
ations, F1 and F2 , equipped with γ1 or γ2 ). The empirical risk J(f ˆ ) = 1 Pn `(yi , f (xi )),
n i=1
is minimized by constraining f to be in the ball F δ = {f ∈ F, γ(f ) 6 δ}.
We assume that almost surely,√ kxkq 6 R, that for all y the function u 7→ `(y, u) is
G-Lipschitz-continuous on {|u| 6 2δ}, √ and that almost surely, `(y, 0) 6 Gδ. As before z
denotes z = (x> , R)> so that kzkq 6 2R. This corresponds to the following examples:
– Logistic regression and support vector machines: we have G = 1.
√ 2
– Least-squares regression: we take G = max { 2δ + kyk∞ , kyk
√ ∞ }.
2δ
Approximation errors inf f ∈F δ J(f ) − inf f ∈F J(f ) will be obtained from the approxima-
tion results from Section 4 by assuming that the optimal target function f∗ has a specific
form. Indeed, we have:
n o
inf J(f ) − J(f∗ ) 6 G inf sup |f (x) − f∗ (x)| .
f ∈F δ f ∈F δ kxkq 6R
Proposition 7 (Uniform deviations) We have the following bound on the expected uni-
form deviation:
ˆ Gδ
E sup |J(f ) − J(f )| 6 4 √ C(p, d, α),
γ1 (f )6δ n
with the following constants:
√α
p
– for α > 1, C(p, d, α) 6 α 2 log(d + 1) for p = 1 and C(p, d, α) 6 p−1
for p ∈ (1, 2]
√
– for α = 0, C(p, d, α) 6 C d + 1, where C is a universal constant.
25
Bach
ˆ )|
E sup |J(f ) − J(f
γ1 (f )6δ
n
1 X
6 2E sup τi `(yi , f (xi )) using Rademacher random variables τi ,
γ1 (f )6δ n i=1
n n
1 X 1 X
6 2E sup τi `(yi , 0) + 2E sup
τi [`(yi , f (xi )) − `(yi , 0)]
γ1 (f )6δ n i=1 γ1 (f )6δ n i=1
X n
Gδ 1
6 2 √ + 2GE sup τi f (xi ) using the Lipschitz-continuity of the loss,
n γ(f )6δ n i=1
1 n
X
Gδ > α
6 2 √ + 2GδE sup τi (v zi )+ using Eq. (2).
n kvkp 61/R n i=1
1 n
X
Gδ
ˆ )| 6 2 √ τi v > zi
E sup |J(f ) − J(f + 2GδαE sup
γ1 (f )6δ n kvkp 61/R n i=1
using the α-Lipschitz-cont. of (·)α+ on [−1, 1],
Xn
Gδ Gαδ
6 2√ + 2 E
τi i
.
z
n Rn
q i=1
From Kakade et al. (2009), we get the following bounds on Rademacher complexities:
Pn √ √ √
– If p ∈ (1, 2], then q ∈ [2, ∞), and Ek √1 R
i=1 τi zi kq 6 q − 1R n = p−1
n
Pn √ p
– If p = 1, then q = ∞, and Ek i=1 τi zi kq 6 R n 2 log(d + 1).
Pn √
Overall, we have Ek i=1 τi zi kq 6 nRC(p, d) with C(p, d) defined above, and thus
Gδ
ˆ )| 6 2 √ Gδα
E sup |J(f ) − J(f (1 + αC(p, d)) 6 4 √ C(p, d).
γ(f )6δ n n
26
Breaking the Curse of Dimensionality with Convex Neural Networks
27
Bach
function space k · k2 , α > 1 k · k1 , α > 1 α=0
(dq)1/2
>
w x+b d1/2 √ log d 1/2
q( n )
n1/2 n1/2
k
X
fj (wj> x), wj ∈ Rd
kd1/2 kq 1/2 (log d)1/(α+1) k(dq)1/2
1/(2α+2) log n log n
j=1
n n1/(2α+2) n1/2
k kq 1/2 (log d)1/(α+(s+1)/2) (dq)1/2 d1/(s+1)
X
fj (Wj> x), Wj ∈ Rd×s kd1/2
1/(2α+s+1) log n 1/(2α+s+1) log n 1/(s+1) log n
j=1
n n n
Table 2: Summary of generalization bounds with different settings. See text for details.
– Comparing different values of α: the value α = 0 always has the best scaling in n,
but constants are better for α > 1 (among which α = 1 has the better scaling in n).
– Bounds for F2 : The simplest upper bound for the penalization by the space F2
depends on the approximation properties√ of F2 . For linear functions and α = 1,
√
it is less than dηR, with a bound GRη √ d . For the other values of α, there is a
n
constant C(d). Otherwise, there is no adaptivity and all other situations only lead to
upper-bounds of O(n−1/(2α+d+1) ). See more details in Section 5.4.
– Sample complexity: Note that the generalization bounds above may be used to obtain
sample complexity results such as dε−2 for affine functions, (εk −1 d−1/2 )−2α−2 for pro-
jection pursuit, and (εk −1 d−1/2 )−s−1−2α for the generalized version (up to logarithmic
terms).
– Relationship to existing work : Maiorov (2006, Theorem 1.1) derives similar results for
neural networks with sigmoidal activation functions (that tend to one at infinity) and
the square loss only, and for a level of smoothness of the target function which grows
with dimension (in this case, once can get easily rates of n−1/2 ). Our result holds
for problems where only bounded first-order derivatives are assumed, but by using
Prop. 2, we would get similar rate by ensuring the target function belongs to F2 and
hence to F1 .
28
Breaking the Curse of Dimensionality with Convex Neural Networks
Lower bounds. In the sections above, we have only provided generalization bounds.
Although interesting, deriving lower-bounds for the generalization performance when the
target function belongs to certain function classes is out of the scope of this paper. Note
however, that results from Sridharan (2012) suggest that the Rademacher complexities of
the associated function classes provide such lower-bounds. For general Lipschitz-functions,
these Rademacher complexities decreases as n− max{d,2} (von Luxburg and Bousquet, 2004).
Wj ∈ Rd×s , having all columns with `2 -norm less than η (note that this is a weaker require-
ment than having all singular values that are less than η). If we assume that each of these
√
columns has at most q non-zeros, then the `1 -norms are less than r q and we may use the
approximation properties described at the end of Section 4.6. We also assume that each Fj
√
is bounded by ηr q and 1-Lipschitz continuous (with respect to the `2 -norm).
√
We may approach each x 7→ Fj (Wj> x) by a function with γ1 -norm less than δηr q and
√
uniform approximation C(α, s)ηr qδ −1/(α+(s−1)/2) log δ. This leads to a total approxima-
√
tion error of kC(α, s)Gηr qδ −1/(α+(s−1)/2) log δ.
29
Bach
√ √ √
For α > 1, the estimation error is kGr qηδ√ log d/ n, with an overall bound which is
√
equal to C(α, s)kGr qη(δ −1/(α+(s−1)/2) log δ+ δ √log
n
d
). With δ = (n/ log d)(α+(s−1)/2)/(2α+s−1) ,
√
C(α, s)kGr qη(log d)1/(2α+s+1)
we get an optimized bound of 1/(2α+s+1)
log n.
n√
C(s)kGr qη
For α = 0, we have the bound (n/d)1/(s+1) log(n/d), that is we cannot use the sparsity
as the problem is invariant to the chosen norm on hidden weights.
– Group penalties: in this paper, we only consider `1 -norm on input weights; when
doing joint variable selection for all basis functions, it may be worth using a group
penalty (Yuan and Lin, 2006; Bach, 2008a).
30
Breaking the Curse of Dimensionality with Convex Neural Networks
up to a constant factor. That is, there exists κ > 1, such that for all y and z, we may
compute v̂ such that kv̂kp = 1 and
1 n 1 n
X X
> α
1 > α
n y i (v̂ z i )+ >
κ sup yi (v z i ) + .
kvkp =1 n
i=1 i=1
This is provably NP-hard for α = 0 (see Section 3.2), and for α = 1 (see Section 3.3). If such
an algorithm is available, the approximate conditional gradient presented in Section 2.5 leads
to an estimator with the same generalization bound. Moreover, given the strong hardness
results for improper learning in the situation α = 0 (Klivans and Sherstov, 2006; Livni et al.,
2014), a convex relaxation that would consider a larger set of predictors (e.g., by relaxing
vv > into a symmetric positive-definite matrix), and obtained a constant approximation
guarantee, is also ruled out.
However, this is only a sufficient condition, and a simpler sufficient condition may be
obtained. In the following, we consider V = {v ∈ Rd+1 , kvk2 = 1} and basis functions
ϕv (z) = (v > z)α+ (that is we specialize to the `2 -norm penalty on weight vectors). We
consider a new variation norm γ̂1 which has to satisfy the following assumptions:
– Lower-bound on γ1 : It is defined from functions ϕ̂v̂ , for v̂ ∈ V̂, where for any v ∈ V,
there exists v̂ ∈ V̂ such that ϕv = ϕ̂v̂ . This implies that the corresponding space F̂1
is larger than F1 and that if f ∈ F1 , then γ̂1 (f ) 6 γ1 (f ).
1 n
X
– Polynomial-time algorithm for dual norm: The dual norm sup yi ϕ̂v̂ (zi ) may
v̂∈V̂
n
i=1
be computed in polynomial time.
– Performance guarantees for random direction: There exists κ > 0, such that for any
vectors z1 , . . . , zn ∈ Rd+1 with `2 -norm less than R, and random standard Gaussian
vector y ∈ Rn ,
n
1 X R
sup
yi ϕ̂v̂ (xi ) 6 κ √ . (6)
v̂∈V̂
n n
i=1
We may also replace the standard Gaussian vectors by Rademacher random variables.
31
Bach
These approximation algorithms may be divided in three families, as they may be based
on (a) geometric interpretations as linear binary classification or computing Haussdorff
distances (see Section 3.2 and Section 3.3), (b) on direct relaxations, on (c) relaxations
of sign vectors. For simplicity, we only focus on the case p = 2 (that is `2 -constraint on
weights) and on α = 1 (rectified linear units). As described in Section 5.5, constant-factor
approximation ratios are not possible, while approximation ratios that increases with n
are possible (but as of now, we only obtain scalings in n that provide a provable sample
complexity with a polynomial time algorithm which is exponential in the dimension d.
which leads to a convex program in U = uu> , V = vv > and J = uv > , that is a semidefinite
program with d + n dimensions, with the constraints
32
Breaking the Curse of Dimensionality with Convex Neural Networks
U J u u >
and the usual semi-definite contraints < , with the additional
J> V v v
constraint that 4Uii + zi> V zi − 4δi> Jzi = tr V zi zi> .
If we add these constraints on top of the ones above, we obtain a tighter relaxation. Note
that for this relaxation, we must have [(2ui −v > zi )−(2uj −v > zj )] less than a constant times
kzi − zj k2 . Hence, the result mentioned above regarding Lipschitz-continuous functions and
the scaling of the upper-bound for random y holds (with the dependence on n which is not
good enough to preserve the generalization bounds with a polynomial-time algorithm).
1 Pn > 1 Pn >
We then need to maximize 2n i=1 yi δi Jxi + 2n i=1 yi v xi , which leads to a semidefinte
program. Again empirically, it did not lead to the correct scaling as a function of n for
random Gaussian vectors y ∈ Rn .
7. Conclusion
In this paper, we have provided a detailed analysis of the generalization properties of convex
neural networks with positively homogenous non-decreasing activation functions. Our main
new result is the adaptivity of the method to underlying linear structures such as the
dependence on a low-dimensional subspace, a setting which includes non-linear variable
selection in presence of potentially many input variables.
All our current results apply to estimators for which no polynomial-time algorithm is
known to exist and we have proposed sufficient conditions under which convex relaxations
could lead to the same bounds, leaving open the existence or non-existence of such algo-
rithms. Interestingly, these problems have simple geometric interpretations, either as binary
linear classification, or computing the Haussdorff distance between two zonotopes.
In this work, we have considered a single real-valued output; the functional analysis
framework readily extends to outputs in a finite-dimensional vector-space where vector-
valued measures could be used, and then apply to multi-task or multi-class problems. How-
ever, the extension to multiple hidden layers does not appear straightforward as the units
of the last hidden layers share the weights of the first hidden layers, which should require a
new functional analysis framework.
33
Bach
Acknowledgements
The author was partially supported by the European Research Council (SIERRA Project),
and thanks Nicolas Le Roux for helpful discussions. The author also thanks Varun Kanade
for pointing the NP-hardness linear classification result. The main part of this work was
carried through while visiting the Centre de Recerca Matemàtica (CRM) in Barcelona.
for a fixed κ > 1. We now propose a modification of the conditional gradient algorithm
that converges to a certain h such that γ(h) 6 δ and for which inf γ(h)6δ J(h) 6 J(ĥ) 6
inf γ(h)6δ/κ J(h).
34
Breaking the Curse of Dimensionality with Convex Neural Networks
We assume the smoothness of the function J with respect to the norm γ, that is, for a
certain L > 0, for all h, h0 such that γ(h) 6 δ, then
L
J(h0 ) 6 J(h) + hJ 0 (h), h0 − hi + γ(h − h0 )2 . (7)
2
We consider the following recursion
In the previous recursion, one may replace the minimization of J on the segment [ht , ĥt ]
with the minimization of its upper-bound of Eq. (7) taken at h = ht . From the recursion,
all iterates are in the γ-ball of radius δ. Following the traditional convergence proof for the
conditional gradient method (Dunn and Harshbarger, 1978; Jaggi, 2013), we have, for any
ρ in [0, 1]:
This is valid for any ρ ∈ [0, 1]. If J(ht ) − J(h∗ ) 6 0 for some t, then by taking ρ = 0
it remains the same of all greater t. Therefore, up to (the potentially never happening)
point where J(ht ) − J(h∗ ) 6 0, we can apply the regular proof of the conditional gradien
2 2
to obtain: J(ht ) 6 inf γ(h)6δ/κ J(h) + 4Lρt δ , which leads to the desired result. Note that a
similar reasoning may be used for ρ = 2/(t + 1).
35
Bach
1 2π 1
Z Z 2π
= p(ϕ)σ(cos(η − ϕ))dϕ cos k(θ − η)dη through the decomposition of g,
π 0 2π 0
Z 2π Z 2π
1
= p(ϕ) σ(cos(η − ϕ)) cos k(θ − η)dη dϕ
2π 2 0 0
Z 2π Z 2π
1
= p(ϕ) σ(cos η) cos k(θ − ϕ − η)dη dϕ by a change of variable,
2π 2 0 0
Z 2π Z 2π
1
= p(ϕ) cos k(θ − ϕ) σ(cos η) cos kη dη
2π 2 0 0
Z 2π
+ sin k(θ − ϕ) σ(cos η) sin kη dη dϕ by expanding the cosine,
0
Z 2π Z 2π
1 1
= σ(cos η) cos kη dη p(ϕ) cos k(θ − ϕ) + 0 by a parity argument,
2π 0 π 0
Z 2π
1
= λk pk (θ) with λk = σ(cos η) cos kη dη.
2π 0
For k = 0, the same equality holds (except that the two coefficients g0 and p0 are divided
by 2π except of π).
Thus we may express kpk2L2 (Sd ) as
X X X
kpk2L2 (Sd ) = kpk k2L2 (Sd ) = kpk k2L2 (Sd ) + kpk k2L2 (Sd )
k>0 λk 6=0 λk =0
X 1 X
= 2 kgk k2L2 (Sd ) + kpk k2L2 (Sd ) .
λk
λk 6=0 λk =0
If we minimize over p, we thus need to have kpk k2L2 (Sd ) = 0 for λk = 0, and we get
X 1
γ2 (g)2 = kgk k2L2 (Sd ) . (8)
λ2k
λk 6=0
We thus simply need to compute λk and its decay for all values of α, and then relate them
to the smoothness properties of g, which is standard for Fourier series.
C.1 Computing λk
1
R 2π
We now detail the computation of λk = 2π 0 σ(cos η) cos kη dη for the different functions
σ = (·)α+ . We have for α = 0:
Z 2π Z π/2
1 1 1 kπ
1cos η>0 cos kη dη = cos kη dη = sin if k 6= 0.
2π 0 2π −π/2 πk 2
For k = 0 it is equal to 21 . It is equal to zero for all other even k, and different from zero
for all odd k, with λk going to zero as 1/k.
We have for α = 1:
Z 2π Z π/2
1 1
(cos η)+ cos kη dη = cos η cos kη dη
2π 0 2π −π/2
36
Breaking the Curse of Dimensionality with Convex Neural Networks
Z π/2
1 1 1
= [ cos(k + 1)η + cos(k − 1)η] dη
2π −π/2 2 2
1 2 π 2 π
= sin(k + 1) + sin(k − 1)
4π k + 1 2 k−1 2
cos kπ − cos kπ
2 1 1 2
= − = for k 6= 1.
2π k+1 k−1 π(k 2 − 1)
For k = 1, it is equal to 1/4. It is equal to zero for all other odd k, and different from zero
for all even k, with λk going to zero as 1/k 2 .
For α = 2, we have:
Z 2π Z π/2 Z π/2
1 1 1 1 + cos 2η
(cos η)2+ cos kη dη = (cos η)2 cos kη dη = cos kη dη
2π 0 2π −π/2 2π −π/2 2
Z π/2
1 1 1 1
= [ cos kη + cos(k + 2)η + cos(k − 2)η] dη
2π −π/2 2 4 4
1 2 π 1 π 1 π
= sin k + sin(k + 2) + sin(k − 2)
4π k 2 k+2 2 k−2 2
sin(k π2 ) 2
1 1
= − −
4π k k+2 k−2
sin(k π2 ) 2k 2 − 8 − k 2 + 2k − k 2 − 2k
=
4π k(k 2 − 4)
−8 sin(k π2 )
= for k ∈
/ {0, 2}.
4πk(k 2 − 4)
For k = 0, it is equal to 1/4, and for k = 2, it is equal to 1/8. It is equal to zero for all
other even k, and different from zero for all odd k, with λk going to zero as 1/k 3 .
The general case for α > 2 will be shown for for all d in Appendix D.2: for all α ∈ N,
λk is different from zero for k having the opposite parity of α, with a decay as 1/k α+1 . All
values from k = 0 to α are also different from zero. All larger values with the same parity
as α are equal to zero.
37
Bach
Our goal is to show that for r chosen close enough to 1, then the function ĝ defined from p̂
has small enough norm γ2 (ĝ) 6 kp̂kL2 (Sd ) , and is close to g.
Computation of norm. We have
X
kp̂k2L2 (Sd ) = λ−2 2k 2
k r kgk kL2 (Sd ) .
k,λk 6=0
38
Breaking the Curse of Dimensionality with Convex Neural Networks
1 2π
Z 2π
1 − r cos(θ − η)
Z
1
= g(η)dη − g(η)dη
π 0 (1 − r cos(θ − η))2 + r2 (sin(θ − η))2 2π 0
1 2π
Z 2π
1 − r cos(θ − η)
Z
1
= g(η)dη − g(η)dη
π 0 1 + r2 − 2r cos(θ − η) 2π 0
Z 2π Z 2π
1 − r2 + 1 + r2 − 2r cos(θ − η)
1 1
= g(η)dη − g(η)dη
2π 0 1 + r2 − 2r cos(θ − η) 2π 0
Z 2π
1 − r2
1
= g(η)dη.
2π 0 1 + r2 − 2r cos(θ − η)
We have, for any θ ∈ [0, 2π]
Z 2π
1 − r2
1
|ĝ(θ) − g(θ)| = [g(η) − g(θ)]dη
2π 0 1 + r2 − 2r cos(θ − η)
Z 2π
1 − r2
1
6 |g(η) − g(θ)|dη
2π 0 1 + r2 − 2r cos(θ − η)
Z π
1 − r2
1
= |g(θ) − g(θ + η)|dη by periodicity,
2π π 1 + r2 − 2r cos η
1 π/2 1 − r2
Z
= |g(θ) − g(θ + η)|dη by parity of g,
π π/2 1 + r2 − 2r cos η
1 π/2 1 − r2 √
Z
6 2
2| sin η|dη
π π/2 1 + r − 2r cos η
because the distance on the sphere is bounded by the sine,
Z π
1 − r2
2
6 sin η dη
π 0 1 + r2 − 2r cos η
1 1 1 − r2
Z
= dt by the change of variable t = cos θ,
π 0 1 + r2 − 2rt
Z 1
1
6 C(1 − r) dt
0 1 + r2 − 2rt
1
−1 1 + r2
2 1
= C(1 − r) log(1 + r − 2rt) = C(1 − r) log .
2r 0 2r (1 − r)2
It can be easily checked that for any r ∈ (1/2, 1), the last function is less than a constant
times 52 C(1 − r) log 1−r
1
. We thus get for δ large enough, by taking r = 1 − (C/δ)1/α ∈
(1/2, 1), an error of
(C/δ)1/α log(C/δ)−1/α = O(δ −1/α log δ).
This leads to the desired result.
39
Bach
Definition and links with Laplace-Beltrami operator. For any k > 1 (for k = 0,
the constant function is the corresponding basis element), there is an orthonormal basis
d 2k+d−1 k+d−2
of spherical harmonics,
R Ykj : S → R, 1 6 j 6 N (d, k) = k d−1 . They are such
hYki , Ysi iSd = Sd Yki (x)Ysj dτd (x) = δij δsk .
Each of these harmonics may be obtained from homogeneous polynomials in Rd with an
Euclidean Laplacian equal to zero, that is, if we define a function Hk (y) = Yki (y/kyk2 )kykk2
for y ∈ Rd+1 , then Hk is a homogeneous polynomial of degree k with zero Laplacian. From
the relationship between the Laplacian in Rd+1 and the Laplace-Beltrami operator ∆ on Sd ,
Yki is an eigenfunction of ∆ with eigenvalue −k(k + d − 1). Like in Euclidean spaces, the
Laplace-Beltrami operator may be used to characterize differentiability of functions defined
on the sphere (Frye and Efthimiou, 2012; Atkinson and Han, 2012).
N (d,k)
X
Ykj (x)Ykj (y) = N (d, k)Pk (x> y),
j=1
Γ(d/2) d k
Pk (t) = (−1/2)k (1 − t2 )(2−d)/2 (1 − t2 )k+(d−2)/2 .
Γ(k + d/2) dt
They are also referred to as Gegenbauer polynomials. For d = 1, Pk is the k-th Chebyshev
polynomial, such that Pk (cos θ) = cos(kθ) for all θ (and we thus recover the Fourier series
framework of Appendix C). For d = 2, Pk is the usual Legendre polynomial.
The polynomial Pk is even (resp. odd) when k is even (resp. odd), and we have
1
ωd
Z
1
Pk (t)Pj (k)(1 − t2 )(d−2)/2 dt = δjk .
−1 ωd−1 N (d, k)
−1 2
For small k, we have P0 (t) = 1, P1 (t) = t, and P2 (t) = (d+1)t
d .
The Hecke-Funk formula leads to, for any linear combination Yk of Ykj , j ∈ {1, . . . , N (d, k)}:
1
ωd−1
Z Z
>
f (x y)Yk (y)dτd (y) = Yk (x) f (t)Pk (t)(1 − t2 )(d−2)/2 dt.
Sd ωd −1
40
Breaking the Curse of Dimensionality with Convex Neural Networks
Decomposition
R of functions in L2 (Sd ). Any function g : Sd → R, such that we have
Sd g(x)dτd (x) = 0 may be decomposed as
∞ NX
X (d,k) ∞ NX
X (d,k) Z
g(x) = hYkj , giYkj (x) = Ykj (y)Ykj (x)g(y)dτd (y)
k=1 j=1 k=1 j=1 Sd
X∞ Z
= gk (x) with gk (x) = N (d, k) g(y)Pk (x> y)dτd (y).
k=1 Sd
and
1 2 NX
(d,k)
ωd−1
Z
kgk k2L2 (Sd ) = Pk (x> v) ϕ(t)Pk (t)(1 − t )2 (d−2)/2
dt Ykj (y)2
ωd −1 j=1
1 2
ωd−1
Z
= Pk (x> v) 2 (d−2)/2
ϕ(t)Pk (t)(1 − t ) dt N (d, k)Pk (1)
ωd −1
1 2
ωd−1
Z
= Pk (x> v) ϕ(t)Pk (t)(1 − t2 )(d−2)/2 dt N (d, k).
ωd −1
41
Bach
42
Breaking the Curse of Dimensionality with Convex Neural Networks
√
By using Stirling formula Γ(x) ≈ xx−1/2 e−x 2π, we get an equivalent when k or d tends
to infinity as a constant (that depends on α) times
Note that all exponential terms cancel out. Moreover, when k tends to infinity and d is
considered constant, then we get the equivalent k −d/2−α−1/2 , which we need for the following
sections. Finally, when d tends to infinity and k is considered constant, then we get the
equivalent d−α/2−k/2+1/2 .
We will also need expressions of λk for k = 0 and k = 1. For k = 0, we have:
1 1 √
du
Z Z
α 2 d/2−1
t (1 − t ) dt = (1 − u)α/2 ud/2−1 √ with t = 1 − u,
0 0 2 1−u
Z 1
1 1 Γ(α/2 + 1/2)Γ(d/2)
= (1 − u)α/2+1/2−1 ud/2−1 du = ,
2 0 2 Γ(α/2 + 1/2 + d/2)
Computing the RKHS norm. Given g with the correct parity, then we have
X
γ2 (g)2 = λ−2 2
k kgk kL2 (Sd ) .
k>0
43
Bach
Also, gk are eigenfunctions of the Laplacian with eigenvalues k(k + d − 1). Thus, we have
1
kgk k22 6 kfk k2L2 (Sd ) 6 kfk k2L2 (Sd ) /k 2s ,
[k(k + d − 1)]s
which is always defined when r ∈ (0, 1) because the series is absolutely convergent. This
defines a function ĝ that will have a finite γ2 -norm and be close to g.
Computing the norm. Given our assumption regarding the Lipschitz-continuity of g,
we have g = ∆1/2 f with f ∈ L2 (Sd ) with norm less than 1 (Atkinson and Han, 2012).
Moreover kgk k2L2 (Sd ) 6 Ck 2 kfk k2L2 (Sd ) . We have
X
kp̂k2L2 (Sd ) = λ−2 2k 2
k r kgk kL2 (Sd )
k,λk 6=0
The function p̂ thus defines a function ĝ ∈ G1 by ĝk = λk pk , for which γ2 (g) 6 C(d, α)(1−
r)(−d+1)/2−α .
Approximation properties. We now show that g and ĝ are close to each other. Because
of the parity of g, we have ĝk = rk gk . We have, using Theorem 4.28 from Frye and Efthimiou
(2012):
X X Z
ĝ(x) = k
r = k
r N (d, k) g(y)Pk (x> y)dτd (y)
k>0 k>0 Sd
44
Breaking the Curse of Dimensionality with Convex Neural Networks
Z X
k >
= g(y) r N (d, k)Pk (x y) dτd (y)
Sd k>0
1 − r2
Z
= g(y) dτd (y).
Sd (1 + r2 − 2r(x> y))(d+1)/2
1 − r2
Z
g(x) − ĝ(x) = [g(x) − g(y)] dτd (y)
Sd (1 + r2 − 2r(x> w))(d+1)/2
1 − r2
Z
= 2 [g(x) − g(y)] dτd (y) by parity of g,
Sd , y > x>0 (1 + r2 − 2r(x> w))(d+1)/2
Z √ p 1 − r2
|g(x) − ĝ(x)| 6 2 1 − x> y dτd (y).
Sd , y > x>0 (1 + r2 − 2r(x> y))(d+1)/2
As shown by Bourgain and Lindenstrauss (1988, Eq. (2.13)), this is less than a constant
1
that depends on d times (1 − r) log 1−r . We thus get for δ large enough, by taking 1 − r =
(C/δ) 1/(α+(d−1)/2) ∈ (0, 1), an error of
N (d, k) grows as k d−1 . In order to use the computation of the RKHS norm derived in
Appendix D.2, we need to make sure that g has the proper parity. This can de done by
removing all harmonics with k 6 s (note that these harmonics are also functions of w> x,
and thus the function that we obtain is also a function of w> x). That function then has a
squared RKHS norm equal to
X
kgk k2L2 (Sd ) λ−2
k .
k>s,λk 6=0
The summand has an asymptotic equivalent proportional to k −d−2s−1 k d−1 k d+2α+1 which is
equal to k d+2α−2s−1 . Thus if d + 2α − 2s > 0, the series is divergent (the function is not in
the RKHS), i.e., if s 6 α + d2 .
45
Bach
which should be of order δ 2 (this gives the scaling of λ as a function of δ). Then the squared
error is
X X λ2 t2d+4α+2
(1 − αk )2 k −2s−2 = k −2s−2
(1 + λk d+2α+1 )2
k>0 k>0
∞
λ2 t2d+4α−2s
Z
≈ dt
0 (1 + λtd+2α+1 )2
2 −(2d+4α−2s+1)/(d+2α+1)
≈ λ λ = λ−(2d+4α−2s+1−2d−4α−2)/(d+2α+1)
= λ(2s+1)/(d+2α+1) ≈ δ −2(2s+1)/(d+2α−2s) ,
and thus the (non-squared) approximation error scales as δ −(2s+1)/(d+2α−2s) . For s = 1, this
leads to a scaling as δ −3/(d+2α−2) .
For g a linearRfunction gk = 0 exceptRfor k = 1, for which, we have g1 (x) = v > x, and thus
kgk k2L2 (Sd ) = Sd (v > x)2 dτd (x) = v > ( Sd xx> τd (x))v = 1. This implies that γ2 (g) = λ−1
1 .
d−1 α Γ(α/2)Γ(d/2+1)
Given the expression (from Appendix D.2) λ1 = d 4π Γ(α/2+d/2+1) for α > 1 and λ1 =
d−1
2dπ .
46
Breaking the Curse of Dimensionality with Convex Neural Networks
which are related by w = a + A1/2 u − b − B 1/2 v. We first review classical methods for
optimization of quadratic functions over the `2 -unit ball.
Minimizing convex quadratic forms over the sphere. We consider the following
convex optimization problem, with Q < 0; we have by Lagrangian duality:
1 >
min x Qx − q > x
kxk2 61 2
1 λ
max min x> Qx − q > x + (kxk22 − 1)
λ>0 x∈Rd 2 2
1 λ
max − q > (Q + λI)−1 q − with x = (Q + λI)−1 q.
λ>0 2 2
If kQ−1 qk2 6 1, then λ = 0 and x = Q−1 q. Otherwise, at the optimum, λ > 0 and
1
kxk22 = q > (Q + λI)−2 q = 1, which implies 1 6 λ+λmin > −1
(Q) q Q q, which leads to λ 6
q > Q−1 q − λmin (Q), which is important to reduce the interval of possible λ. The optimal λ
may then be obtained by binary search (from a single SVD of Q).
Minimizing concave quadratic forms over the sphere. We consider the following
non-convex optimization problem, with Q < 0, for which strong Lagrangian duality is known
to hold (Boyd and Vandenberghe, 2004):
1 1
min − x> Qx + q > x = min − x> Qx + q > x
kxk2 61 2 kxk2 =1 2
1 λ
max min − x> Qx + q > x + (kxk22 − 1)
λ>0 x∈Rd 2 2
1 λ
max − q > (λI − Q)−1 q − with x = (Q − λI)−1 q.
λ>λmax (Q) 2 2
1
At the optimum, we have q > (λI − Q)−2 q = 1, which implies 1 6 [λ−λmax (Q)]2
kqk22 , which
leads to 0 6 λ − λmax (Q) 6 kqk2 . We may perform binary search on λ from a single SVD
of Q.
Computing the Haussdorff distance. We need to compute:
1
max min ka + A1/2 u − b − B 1/2 vk22
kuk2 61 kvk2 61 2
47
Bach
1 λ
= max max min ka + A1/2 u − b − B 1/2 vk22 + (kvk22 − 1)
kuk2 61 λ>0 kvk2 61 2 2
λ 1 1
= max max − + ka − b + A1/2 uk2 − (a − b + A1/2 u)> B(B + λI)−1 (a − b + A1/2 u)
kuk2 61 λ>0 2 2 2
λ λ
= max max − + (a − b + A1/2 u)> (B + λI)−1 (a − b + A1/2 u)
kuk2 61 λ>0 2 2
with v = (B + λI)−1 B 1/2 (a − b + A1/2 u). The interval in λ which is sufficient to explore is
λ µ λ
min max (a − b + A1/2 u)> (B + λI)−1 (a − b + A1/2 u) − (kuk22 − 1) −
µ>0 u∈Rd 2 2 2
λ µ−λ
= min max (a − b)> (B + λI)−1 (a − b) + + λu> A1/2 (B + λI)−1 (a − b)
µ>0 u∈Rd 2 2
1
− u> (µI − λA1/2 (B + λI)−1 A1/2 )u
2
λ µ−λ
= min (a − b)> (B + λI)−1 (a − b) +
µ>0 2 2
+λ2 (a − b)> (B + λI)−1 A1/2 (µI − λA1/2 (B + λI)−1 A1/2 )−1 A1/2 (B + λI)−1 (a − b)
µ
We have u = ( I − A1/2 (B + λI)−1 A1/2 )−1 A1/2 (B + λI)−1 (a − b), leading to w ∝ (λ−1 B −
λ
µ−1 A + I)(a − b). We need µλ > λmax (A1/2 (B + λI)−1 A1/2 ). Moreover
µ
06 − λmax (A1/2 (B + λI)−1 A1/2 ) 6 kA1/2 (B + λI)−1 (a − b)k.
λ
This means that the `2 -Haussdorff distance may be computed by solving in λ and µ, by
exhaustive search with respect to λ and by binary search (or Newton’s method) for µ. The
complexity of each iteration is that of a singular value decomposition, that is O(d3 ). For
more details on optimization of quadratic functions on the unit-sphere, see Forsythe and
Golub (1965).
References
P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization Algorithms on Matrix Manifolds.
Princeton University Press, 2009.
R. A. Adams and J. F. Fournier. Sobolev Spaces, volume 140. Academic Press, 2003.
K. Atkinson and W. Han. Spherical Harmonics and Approximations on the Unit Sphere:
an Introduction, volume 2044. Springer, 2012.
48
Breaking the Curse of Dimensionality with Convex Neural Networks
F. Bach. Consistency of the group Lasso and multiple kernel learning. Journal of Machine
Learning Research, 9:1179–1225, 2008a.
F. Bach. Exploring large feature spaces with hierarchical multiple kernel learning. In
Advances in Neural Information Processing Systems (NIPS), 2008b.
F. Bach. Duality between subgradient and conditional gradient methods. SIAM Journal
on Optimization, 25(1):115–129, 2015.
F. Bach. On the equivalence between kernel quadrature rules and random feature expan-
sions. Journal of Machine Learning Research, 18:1–38, 2017.
P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and
structural results. Journal of Machine Learning Research, 3:463–482, 2003.
L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In Advances in Neural
Information Processing Systems (NIPS), 2008.
P. Bühlmann and S. Van De Geer. Statistics for high-dimensional data: methods, theory
and applications. Springer, 2011.
49
Bach
M. Burger and A. Neubauer. Error bounds for approximation with neural networks. Journal
of Approximation Theory, 112(2):235–250, 2001.
Y. Cho and L. K. Saul. Kernel methods for deep learning. In Advances in Neural Information
Processing Systems (NIPS), 2009.
A. S. Dalalyan, A. Juditsky, and V. Spokoiny. A new algorithm for estimating the effective
dimension-reduction subspace. Journal of Machine Learning Research, 9:1647–1678, 2008.
M. Dudik, Z. Harchaoui, and J. Malick. Lifted coordinate descent for learning with trace-
norm regularization. In Proceedings of the International Conference on Artificial Intelli-
gence and Statistics (AISTATS), 2012.
J. C. Dunn and S. Harshbarger. Conditional gradient algorithms with open loop step size
rules. Journal of Mathematical Analysis and Applications, 62(2):432–444, 1978.
L. C. Evans and R. F. Gariepy. Measure Theory and Fine Properties of Functions, volume 5.
CRC Press, 1991.
M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics
Quarterly, 3(1-2):95–110, 1956.
50
Breaking the Curse of Dimensionality with Convex Neural Networks
51
Bach
K.-C. Li. Sliced inverse regression for dimension reduction. Journal of the American Sta-
tistical Association, 86(414):316–327, 1991.
Roi Livni, Shai Shalev-Shwartz, and Ohad Shamir. On the computational efficiency of
training neural networks. In Advances in Neural Information Processing Systems, 2014.
V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines.
In Proceedings of International Conference on Machine Learning (ICML), 2010.
R. M. Neal. Bayesian Learning for Neural Networks. PhD thesis, University of Toronto,
1995.
A. Pinkus. Approximation theory of the MLP model in neural networks. Acta Numerica,
8:143–195, 1999.
52
Breaking the Curse of Dimensionality with Convex Neural Networks
A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Advances in
Neural Information Processing Systems (NIPS), 2007.
F. Rosenblatt. The perceptron: a probabilistic model for information storage and organi-
zation in the brain. Psychological Review, 65(6):386, 1958.
R. Schneider. Zu einem problem von shephard über die projektionen konvexer körper.
Mathematische Zeitschrift, 101(1):71–82, 1967.
M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49–67,
2006.
53