0% found this document useful (0 votes)
73 views23 pages

Deep Learning Meets Sparse Regularization: A Signal Processing Perspective

This document presents a new mathematical framework for understanding deep learning. The framework uses tools from signal processing like sparse regularization and the Radon transform. It explains key aspects of neural networks like the effect of regularization, the functions neural networks learn, the role of activation functions, and why neural networks can perform well in high-dimensional problems. The framework provides new insights into the properties of functions learned by neural networks.

Uploaded by

Juan Zarate
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views23 pages

Deep Learning Meets Sparse Regularization: A Signal Processing Perspective

This document presents a new mathematical framework for understanding deep learning. The framework uses tools from signal processing like sparse regularization and the Radon transform. It explains key aspects of neural networks like the effect of regularization, the functions neural networks learn, the role of activation functions, and why neural networks can perform well in high-dimensional problems. The framework provides new insights into the properties of functions learned by neural networks.

Uploaded by

Juan Zarate
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

1

Deep Learning Meets Sparse Regularization:


A Signal Processing Perspective
Rahul Parhi, Member, IEEE, and Robert D. Nowak, Fellow, IEEE
arXiv:2301.09554v2 [stat.ML] 30 Jan 2023

Abstract

Deep learning has been wildly successful in practice and most state-of-the-art machine learning
methods are based on neural networks. Lacking, however, is a rigorous mathematical theory that
adequately explains the amazing performance of deep neural networks. In this article, we present
a relatively new mathematical framework that provides the beginning of a deeper understanding
of deep learning. This framework precisely characterizes the functional properties of neural
networks that are trained to fit to data. The key mathematical tools which support this framework
include transform-domain sparse regularization, the Radon transform of computed tomography, and
approximation theory, which are all techniques deeply rooted in signal processing. This framework
explains the effect of weight decay regularization in neural network training, the use of skip
connections and low-rank weight matrices in network architectures, the role of sparsity in neural
networks, and explains why neural networks can perform well in high-dimensional problems.

I. I NTRODUCTION

Deep learning (DL) has revolutionized engineering and the sciences in the modern, data age. The
typical goal of DL is to predict an output y ∈ Y (e.g., a label or response) from an input x ∈ X
(e.g., a feature or example). A neural network (NN) is “trained” to fit to a set of data consisting
of the pairs {(xn , yn )}N
n=1 , by finding a set of NN parameters θ so that the NN mapping closely

matches the data. The trained NN is a function, denoted by fθ : X → Y , that can be used to
predict the output y ∈ Y of a new input x ∈ X . The success of deep learning has spawned a
burgeoning industry that is continually developing new applications, NN architectures, and training

Rahul Parhi is with the Biomedical Imaging Group, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
(e-mail: [email protected]).
Robert D. Nowak is with the Department of Electrical and Computer Engineering, University of Wisconsin–Madison,
Madison, WI, USA (e-mail: [email protected]).
2

algorithms. While the broader research community has been focused on understanding why certain
architectural choices and training algorithms can improve performance, relatively little work has
been done to characterize the properties of the functions learned by NNs. In this article, we present
a mathematical framework which unifies a line of work from many authors over the last few years
that sheds light on the nature and behavior of trained NN functions.
The purpose of this article is to provide a gentle introduction to this new mathematical framework,
accessible to readers with a mathematical background in Signals and Systems and Applied Linear
Algebra. The framework is based on mathematical tools familiar to the signal processing community,
including transform-domain sparse regularization, the Radon transform of computed tomography,
and approximation theory. It is also related to well-known signal processing ideas such as wavelets,
splines, and compressed sensing. Finally, this framework provides a new take on the following
fundamental questions:

1) What is the effect of regularization in DL?


2) What kinds of functions do NNs learn?
3) What is the role of NN activation functions?
4) Why do NNs seemingly break the curse of dimensionality?

This article will answer these fundamental questions as well as discuss several followup research
directions based on the presented framework.

II. N EURAL N ETWORKS AND L EARNING FROM DATA

The task of DL corresponds to learning the input-output mapping from a data set in a hierarchical
or multi-layer manner. Deep neural networks (DNNs) are complicated function mappings built
from many smaller, simpler building blocks. The simplest building block of a DNN is an (artificial)
neuron, inspired by the biological neurons of the brain [25]. A neuron is a function mapping
Rd → R of the form x 7→ σ(wT x − b), where w ∈ Rd corresponds to the weights of the neuron
and b ∈ R corresponds to the bias of the neuron. The function σ : R → R is referred to as the
activation function of the neuron and controls nonlinear response of the neuron. A neuron “activates”
when the weighted combination of its input x exceeds a certain threshold, i.e., wT x > b. Therefore,
typical activation functions such as the sigmoid, unit step function, or rectified linear unit (ReLU)
activate when their input exceeds 0 as seen in Fig. 1.
A neuron is composed of a linear mapping followed by a nonlinearity. A popular form (or
“architecture”) of a DNN is a fully-connected feedforward DNN which is a cascade of alternating
3

Sigmoid Unit Step ReLU

Fig. 1. Typical activation functions found in neural networks.

linear mappings and component-wise nonlinearities. A feedforward DNN fθ (parameterized by θ )


can be represented as the function composition

fθ (x) = A(L) ◦ σ ◦ A(L−1) ◦ · · · σ ◦ A(1) (x), (1)

where, for each ` = 1, . . . , L, the function A(`) (z) = W(`) z − b(`) is an affine linear mapping with
weight matrix W(`) and bias vector b(`) . The functions σ that appear in the composition applies
the activation function σ : R → R component-wise to the vector A(`) (z). The parameters of this
DNN are the weights and biases, i.e., θ = {(W(`) , b(`) )}L
`=1 . Each mapping A
(`) corresponds to a

layer of the DNN and the number of mappings L is the depth of the DNN. The dimensions of the
weight matrices W(`) correspond to the number of neurons in each layer (i.e., the width of the
layer). Alternative DNN architectures can be be built with other simple building blocks, e.g., with
convolutions and pooling/downsampling operations, which would correspond to deep convolutional
neural networks (DCNNs) [17]. DNN architectures are often depicted with diagrams as in Fig. 2.
Given a DNN fθ parameterized by θ ∈ Θ (of any architecture), the task of learning from the
data {(xn , yn )}N
n=1 is formulated as the optimization problem
N
X
min L(yn , fθ (xn )), (2)
θ∈Θ
n=1

where L(·, ·) is a loss function (squared error, logistic, hinge loss, etc.). For example, the squared
error loss is given by L(y, z) = (y − z)2 . A DNN is trained by solving this optimization problem,
usually via some form of gradient descent. In typical scenarios, this optimization problem is ill-posed
and so the problem is regularized either explicitly through the addition of a regularization term
and/or implicitly by constraints on the network architecture and the behavior of gradient descent
4

(a) Feedforward DNN

z1 w1 v1 Z1

Previous Layer Outputs z2 w2 Σ σ v2 Z2 Next Layer Inputs


.. ..
. .
zd wd vD ZD

(b) A Single Neuron

Fig. 2. Example depiction of a deep neural network and its components: (a) a feedforward DNN architecture where the
nodes represent the neurons and the edges represent the weights; (b) a single neuron from the DNN in (a) mapping an
input z ∈ Rd to an output Z ∈ RD via Z = vσ(wT z).

procedures [34]. Explicit regularization corresponds to solving an optimization problem of the form
N
X
min L(yn , fθ (xn )) + λ C(θ), (3)
θ∈Θ
n=1

where C(θ) ≥ 0 for all θ ∈ Θ. C(θ) is a regularizer which measures the “size” (or “capacity”)
of the DNN parameterized by θ ∈ Θ and λ > 0 is an adjustable hyperparamter which controls
a trade-off between the data-fitting term and the regularizer. DNNs are often trained with weight
decay which corresponds to solving the optimization problem
N
X
min L(yn , fθ (xn )) + λ kθk22 , (4)
θ∈Θ
n=1

where the regularizer C(θ) = kθk22 is the squared Euclidean-norm of all the network parameters.
Sometimes the weight decay objective only regularizes the network weights and not the network
biases. The primary focus of this article is explicit regularization with the weight decay objective.
5

III. W HAT IS THE E FFECT OF R EGULARIZATION IN D EEP L EARNING ?

Weight decay is a common form of regularization for DNNs. On the surface, it appears to simply
be the familiar Tikhonov or “ridge” regularization. In standard linear models, it is well-known
that this sort of regularization tends to reduce the size of the weights, but does not produce
sparse weights. However, when this regularization is used in conjunction with NNs,, the results
are strikingly different. Regularizing the sum of squared weights turns out to be equivalent to
regularization with a type of `1 -norm regularization on the network weights, leading to sparse
solutions in which the weights of many neurons are zero [48]. This is due to the key property
that the most commonly used activation functions in DNNs are homogeneous. A function σ(x) is
said to be homogeneous (of degree 1) if σ(γx) = γσ(x) for any γ > 0. The most common NN
activation function, the ReLU, is homogeneous, as well as the leaky ReLU, the linear activation,
and pooling/downsampling units. This homogeneity leads to the following theorem, referred to as
the neural balance theorem (NBT).

Neural Balance Theorem ([48, Theorem 1]): Let fθ be a DNN of any architecture
parameterized by θ ∈ Θ which solves the DNN training problem with weight decay in
(4). Then, the weights satisfy the following balance constraint: if w and v denote the input
and output weights of any homogeneous unit in the DNN, then kwk2 = kvk2 .

The proof of this theorem boils down to the simple observation that for any homogeneous unit
with input weights w and output weights v , we can scale the input weight by γ > 0 and the output
weight by 1/γ without changing the function mapping. For example, consider the single neuron
z 7→ vσ(wT z) with homogeneous activation function σ as depicted in Fig. 2(b). It is immediate that
(v/γ)σ((γw)T z) = vσ(wT z). The theorem then follows by noticing that minγ>0 kγwk22 + kv/γk22
p
occurs when γ = kvk2 /kwk2 which implies that the minimum squared Euclidean-norm solution
must satisfy the property that the input and output weights w and v are balanced.

A. The Secret Sparsity of Weight Decay

The balancing effect of the NBT has a striking effect on solutions to the weight decay objective.
In particular, a sparsity-promoting effect akin to least absolute shrinkage and selection operator
(LASSO) regularization [42]. As an illustrative example, consider a shallow (L = 2), feedforward
6

NN mapping Rd → RD with a homogeneous activation function (e.g., the ReLU) and K neurons.
In this case, the NN is given by
K
X
fθ (x) = vk σ(wkT x − bk ). (5)
k=1
1 PK 2
Here, the weight decay regularizer the form 2 k=1 kvk k2 + kwk k22 , where wk and vk are the
input and output weights of the k th neuron, respectively. By the NBT, this is equivalent to using the
regularizer K
P
k=1 kvk k2 kwk k2 . Due to the homogeneity of the activation function, we can assume,

without loss of generality, that kwk k2 = 1 by “absorbing” the magnitude of the input weight wk
into the output weight vk . Therefore, by constraining the input weights to be unit-norm, the training
problem can then be reformulated with the regularizer K
P
k=1 kvk k2 . Remarkably, this is exactly the

well-known group LASSO regularizer [49], which favors solutions with few active neurons (i.e.,
solutions typically have many vk exactly equal to 0).
More generally, consider the feedforward deep NN architecture in (1) with a homogeneous
activation function and consider training the DNN with weight decay only on the network weights.
An application of the NBT shows that the weight decay problem is equivalent to the regularized
DNN training problem with the regularizer
(1) (L) (`)
K K L K
1 X (1) 2 1 X (L) 2 X X (`) (`)
C(θ) = kwk k2 + kvk k2 + kwk k2 kvk k2 , (6)
2 2
k=1 k=1 `=1 k=1
(`)
where K (`) denotes the number of neurons in layer `, wk denotes the input weights to the k th
(`)
neuron in layer `, and vk denotes the output weights to the k th neuron in layer ` (see [48,
Equation (2)]). Solutions based on this regularizer will also be sparse due to the 2-norms that
appear in the last term in (6) being not squared, akin to the group LASSO regularizer. Moreover,
increasing the regularization parameter λ, will increase the number of weights that are zero in the
solution. Therefore, training the DNN with weight decay favors sparse solutions, where sparsity is
quantified via the number of active neurons. An early version of this result appeared in 1998 [18],
although it did not become well-known until it was rediscovered in 2015 [26].

IV. W HAT K INDS OF F UNCTIONS D O N EURAL N ETWORKS L EARN ?

The sparsity-promoting effect of weight decay arising from the NBT in network architectures
with homogeneous activation functions has several consequences on the properties of trained NNs.
In this section, we will focus on the popular ReLU activation function, ρ(t) = max{0, t}. The
imposed sparsity not only promotes sparsity in the sense of the number of active neurons, but
7

also promotes a kind of transform-domain sparsity. This transform-domain sparsity suggests the
inclusion of skip connections and low-rank weight matrices in network architectures.

A. Shallow Neural Networks

In the univariate case, a shallow, feedforward ReLU NN with K neurons is realized by the
mapping
K
X
fθ (x) = vk ρ(wk x − bk ). (7)
k=1

Training this NN with weight decay corresponds to the solving the optimization problem
N K
X λX
min L(yn , fθ (xn )) + |vk |2 + |wk |2 , (8)
θ∈Θ 2
n=1 k=1

From Section III-A, we saw that the NBT implies that this problem is equivalent to
N
X K
X
min L(yn , fθ (xn )) + λ |vk ||wk |. (9)
θ∈Θ
n=1 k=1

As illustrated in Insert IN1, we see that (9) is actually regularizing the integral of second derivative
of the NN, which can be viewed a measure of sparsity in the second derivative domain. The integral
in (15) must be understood in the distributional sense since the Dirac impulse is not a function,
2 √
but a generalized function or distribution. To make this precise, let gε (x) = e−x /2ε / 2πε denote
the Gaussian density with variance ε > 0. As is well-known in signal processing, gε converges to
the Dirac impulse as ε → 0. Using this idea, given a distribution f , define the norm
Z ∞ Z ∞

kf kM := sup kf ∗ gε kL = sup
1
f (x)gε (y − x) dx dy.
(16)
ε>0 ε>0 −∞ −∞

This definition provides an explicit construction, via the convolution with a Gaussian, of a sequence
of smooth functions that converge to f , where the supremum acts as the limit. For example, if
f (x) = g(x) + K
P
k=1 vk δ(x − tk ), where g is an absolutely integrable function, then kf kM =

kgkL1 + k=1 |vk | = kgkL1 + kvk1 , with kvk1 = K


PK P
k=1 |vk |. It is in this sense that (15) holds,

i.e., kD2 fθ kM = K
P
k=1 |vk ||wk |. In particular, the M-norm is precisely the continuous-domain

analogue of the sparsity-promoting discrete `1 -norm. Therefore, we see that training a NN with
weight decay as in (8) prefers solutions with sparse second derivatives.
It turns out that the connection between between sparsity in the second derivative domain and
NNs is even tighter. Let BV2 (R) denote the space of functions mapping R → R such that kD2 f kM
is finite. This is the space of functions of second-order bounded variation and the quantity kD2 f kM
8

[IN1] ReLU Sparsity in the Second Derivative Domain

Given a ReLU neuron r(x) = ρ(wx − b), its we have


first derivative, D r(x), is
w2
 
2 b
D r(x) = δ x−
D r(x) = D ρ(wx − b) |w| w
 
b
= w u(wx − b), (10) = |w|δ x − . (13)
w
where u is the unit step function (Fig. 1(b)).
The second derivative of the NN (7) is then
2
Therefore, its second derivative, D r(x), is
K  
2
X bk
D2 r(x) = D w u(wx − b) D fθ (x) = vk |wk |δ x − . (14)
wk
k=1
= w2 δ(wx − b). (11)
Therefore,
By the scaling property of the Dirac impulse Z ∞ K
1
X
δ(γx) = δ(x) (12) |D2 fθ (x)| dx = |vk ||wk |. (15)
|γ| −∞ k=1

fθ (x) Dfθ (x) D2fθ (x)


v3|w3| v5|w5|
v1|w1|

b1 b2 b3 b4 b5 b6
w1 w2 w3 w4 w5 w6
v6|w6|
v2|w2|
v4|w4|

Fig. 3. Illustration of the sparsity in the second derivative domain of a univariate, shallow feedforward NN with
6 neurons.

is the second-order total variation1 of f . It is well-known [15], [24], [45] that functions that fit
data and have minimimal second-order TV are linear splines which are continuous piecewise linear
(CPwL) functions. Since the ReLU is a CPwL function, ReLU NNs are CPwL functions [3]. In

1
The classical notion of total variation, often used in signal denoising problems [36], is TV(f ) = kD f kM and so the
second-order total variation of f can be viewed as the total variation of the derivative of f : kD2 f kM = TV(D f ).
9

fact, under mild assumptions on the loss function, the solution set to the optimization problem
N
X
min
2
L(yn , f (xn )) + λkD2 f kM (17)
f ∈BV (R)
n=1

is completely characterized by NNs of the form


K
X
fθ (x) = vk ρ(wk x − bk ) + c1 x + c0 , (18)
k=1

where the number of neurons is strictly less than the number of data (K < N ) [38], [29]. In neural
network parlance, the c1 x + c0 term is a skip connection [20]. This term is an affine function that
naturally arises since the second-order TV of an affine function is zero and so the regularizer places
no penalty for including of this term.
The intuition behind this result is due the fact that the second derivative of a CPwL function is
an impulse train and therefore extremely sparse in the second derivative domain. This is illustrated
in Fig. 3. Therefore, the optimization (17) will favor sparse CPwL functions which always admit a
representation as in (18). In signal processing parlance, “signals” that are sparse in some transform
domain are said to have a finite rate of innovation [46]. Here, the transform involved is the second
derivative operator.
From the derivation in Insert IN1 combined with the equivalence of (8) and (9), we see that
training a sufficiently wide (K ≥ N ) NN with a skip connection (18) and weight decay (8) results
in a solution to the optimization problem (17) over the function space BV2 (R). While this result
may seem obvious in hindsight, it is remarkable since it says that the kinds of functions that NNs
(as in (18)) trained with weight decay are exactly optimal functions in BV2 (R).
In the multivariate case, a shallow feedforward NN has the form
K
X
fθ (x) = vk ρ(wkT x − bk ). (19)
k=1

The key property connects the univariate case and BV2 (R) was that ReLU neurons are sparsified
by the second derivative operator as in (13). A similar analysis can be carried out in the multivariate
case by finding an operator that is the sparsifying transform of the multivariate ReLU neuron
r(x) = ρ(wT x − b). The sparsifying transform was proposed in 2020 in the seminal work of Ongie
et al. [27], and hinges on the Radon transform that arises in tomographic imaging. Connections
between the Radon transform and neurons have been known since at least the 1990s, gaining
popularity due to the proposal of ridgelets [6] and early versions of the sparsifying transform for
neurons were studied as early as 1997 [21].
10

The Radon transform, first studied by Radon in 1917 [35], of a function mapping Rd → R is
specified by the integral transform
Z
R{f }(α, t) = f (x)δ(αT x − t) dx, (20)
Rd

where δ is the univariate Dirac impulse, α ∈ Sd−1 = {u ∈ Rd : kuk2 = 1} is a unit vector on


the sphere in Rd , and t ∈ R is a scalar. The Radon transform of f at (α, t) is the integral of f
along the hyperplane given by {x ∈ Rd : αT x = t}. When d = 2, this is an integral along a
line, as in the case X-ray CT. The Radon transform is tightly linked with the Fourier transform
\}(α, ω) = fb(ωα), where the Fourier transform on
via the Fourier slice theorem which states R{f
the left-hand side is the univariate Fourier transform with respect to the t-variable and the Fourier
transform on the right-hand side is the d-variate Fourier transform of f . The Fourier transform
T
is given by fb(ω) = Rd f (x)e− jω x dx, with j2 = −1. The Fourier slice theorem shows that the
R

univariate Fourier transform in the Radon domain corresponds to a slice of the Fourier transform
in the spatial domain. The sparsifying transform for multivariate ReLU neurons is based on the
(filtered) Radon transform, which is summarized in Insert IN2
The filter K in (24) is exactly the backprojection filter that arises in the filtered backprojection
algorithm in tomographic image reconstruction and acts as a high-pass filter (or ramp filter) to
correct the attenuation of high frequencies from the Radon transform. The intuition behind this
is that the Radon transform integrates a function along hyperplanes. In the univariate case, the
magnitude of the frequency response of an integrator behaves as 1/|ω| and therefore attenuates high
frequencies. The magnitude of the frequency response of integration along a hyperplane, therefore,
behaves as 1/|ω|d−1 , since hyperplanes are of dimension (d − 1). Note that the even projector that
appears in (28) is due to the fact that (α, t) and (−α, −t) parameterize the same hyperplane.
From the derivation in Insert IN2, we immediately see that the sparsifying transform of the
multivariate ReLU neuron r(x) = ρ(wT x − b) with (w, b) ∈ Sd−1 × R is the operator D2t K R ,
where D2t = ∂ 2 /∂t2 denotes the second-order partial derivative with respect to t. We have

D2t K R{r}(α, t) = δR ((α, t) − (w, b)), (29)

where δR (z = (α, t)) = Peven {δ}(z) = (δ(z)+δ(−z))/2 is the even Dirac impulse that arises due
to the even symmetry of the Radon domain. From the homogeneity of the ReLU activation, applying
this sparsifying transform to the (unconstrained) neuron r(x) = ρ(wT x − b) with (w, b) ∈ Rd × R
yields
D2t K R{r}(α, t) = kwk2 δR ((α, t) − (w,
e eb)), (30)
11

[IN2] Filtered Radon Transform of a Neuron with Unit-Norm Input Weights

First consider the neuron r(x) = σ(wT x) with Taking the inverse Fourier transform,
w = e1 = (1, 0, . . . , 0) (the first canonical
K R{r}(α, t)
unit vector). In this case, r(x) = σ(x1 ) and so  
1 t
r = σ ⊗ 1 ⊗ · · · ⊗ 1. The Fourier transform of = σ δ(α2 , . . . , αd )
2|α1 | α1
this tensor product is given by product of the σ(t)δ(α1 − 1) + σ(−t)δ(α1 + 1)
= δ(α2 , . . . , αd )
univariate Fourier transforms of each term in the 2
σ(t)δ(α − e1 ) + σ(−t)δ(α + e1 )
tensor product =
2
d
Y =: Peven {σ(t)δ(α − e1 )}, (26)
rb(ω) = σ
b(ω1 ) 2πδ(ωk ). (21)
k=2 where Peven is the projector which extracts the
By the Fourier slice theorem, even part of its input. The second line holds by
d
Y the dilation property of the Fourier transform:
\
R{r}(α, ω) = σ
b(ωα1 ) 2πδ(ωαk ). (22)  
1 x F
k=2 f ←−→ fb(γω). (27)
|γ| γ
By the scaling property of the Dirac impulse (12),
Since α ∈ Sd−1 , the third line holds by observing
the above quantity equals
that when α1 = ±1, α2 , . . . , αd = 0 and so the
d−1
(2π) second line is σ(±t)/2 multiplied by an impulse

b(ωα1 ) δ(α2 , . . . , αd ). (23)
|ω|d−1
and when α1 6= ±1, the second line is 0, which
If we define the filter via the frequency response is exactly the third line. By the rotation properties

|ω|d−1 b of the Fourier transform, we have the following


K
d f (ω) = f (ω), (24)
2(2π)d−1 result for the neuron r(x) = σ(wT x)

we find K R{r}(α, t) = Peven {σ(t)δ(α − w)}, (28)


σ
b(ωα1 )
K
b R{r}(α, ω) = δ(α2 , . . . , αd ). (25) where w ∈ Sd−1 .
2

e = w/kwk2 and eb = b/kwk2 . This is analogous to the how D2 is the sparsifying transform
where w
for univariate neurons as in (13). The sparsifying operator is simply the second derivative in
the Radon domain. The key idea is that the (filtered) Radon transform allows us to extract the
(univariate) activation function from the multivariate neuron and apply the univariate sparsifying
transform in the t variable. Figure 4 is a cartoon diagram which depicts the the sparsifying transform
of a ReLU neuron.
The story is now analogous to the univariate case. Indeed, by the NBT, training the NN in (19)
12

<latexit sha1_base64="hQm0lR/mBmhjzkOfd5AoJhGlXNA=">AAACRnicbVDLSgMxFL1T3+Or6tJNsAgKpcyIr40giCC4UbG10Cklk2ZsMJkZkoxSwnyJX+NO9Av8CXciuDIdu7DWA4Fzz7mXm3vClDOlPe/NKU1MTk3PzM658wuLS8vlldWGSjJJaJ0kPJHNECvKWUzrmmlOm6mkWISc3oR3JwP/5p5KxZL4WvdT2hb4NmYRI1hbqVPeCwTWPSnMeR5UC66INFd5YGSQbwWhMAHmaQ/n6AgNqoe8ivR2p1zxal4BNE78IanAEBed8lfQTUgmaKwJx0q1fC/VbYOlZoTT3A0yRVNM7vAtbVkaY0FV2xTn5WjTKl0UJdK+WKNC/T1hsFCqL0LbWRzw1xuI/3mtTEeHbcPiNNM0Jj+LoowjnaBBVqjLJCWa9y3BRDL7V0R6WGKibaLuyJoiRRWpkVNMKHLXtWH5f6MZJ42dmr9f273crRyfDmObhXXYgC3w4QCO4QwuoA4EHuEJXuDVeXbenQ/n86e15Axn1mAEJfgGXSOyCg==</latexit>

K R{r}(↵ = w, t)

t
<latexit sha1_base64="gyxXXPs6abl7KvzldfK6LO78VFQ=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeCCB5bsLXQhrLZbtq1m03YnQgl9Bd48aCIV3+SN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dgpr6xubW8Xt0s7u3v5B+fCobeJUM95isYx1J6CGS6F4CwVK3kk0p1Eg+UMwvpn5D09cGxGre5wk3I/oUIlQMIpWamK/XHGr7hxklXg5qUCORr/81RvELI24QiapMV3PTdDPqEbBJJ+WeqnhCWVjOuRdSxWNuPGz+aFTcmaVAQljbUshmau/JzIaGTOJAtsZURyZZW8m/ud1Uwyv/UyoJEWu2GJRmEqCMZl9TQZCc4ZyYgllWthbCRtRTRnabEo2BG/55VXSvqh6l9Vas1ap3+ZxFOEETuEcPLiCOtxBA1rAgMMzvMKb8+i8OO/Ox6K14OQzx/AHzucP4/6NBA==</latexit>

(a) Surface plot of r(x). (b) Filtered Radon transform when α = w.

t
<latexit sha1_base64="gyxXXPs6abl7KvzldfK6LO78VFQ=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeCCB5bsLXQhrLZbtq1m03YnQgl9Bd48aCIV3+SN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dgpr6xubW8Xt0s7u3v5B+fCobeJUM95isYx1J6CGS6F4CwVK3kk0p1Eg+UMwvpn5D09cGxGre5wk3I/oUIlQMIpWamK/XHGr7hxklXg5qUCORr/81RvELI24QiapMV3PTdDPqEbBJJ+WeqnhCWVjOuRdSxWNuPGz+aFTcmaVAQljbUshmau/JzIaGTOJAtsZURyZZW8m/ud1Uwyv/UyoJEWu2GJRmEqCMZl9TQZCc4ZyYgllWthbCRtRTRnabEo2BG/55VXSvqh6l9Vas1ap3+ZxFOEETuEcPLiCOtxBA1rAgMMzvMKb8+i8OO/Ox6K14OQzx/AHzucP4/6NBA==</latexit>

t t
<latexit sha1_base64="gyxXXPs6abl7KvzldfK6LO78VFQ=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeCCB5bsLXQhrLZbtq1m03YnQgl9Bd48aCIV3+SN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dgpr6xubW8Xt0s7u3v5B+fCobeJUM95isYx1J6CGS6F4CwVK3kk0p1Eg+UMwvpn5D09cGxGre5wk3I/oUIlQMIpWamK/XHGr7hxklXg5qUCORr/81RvELI24QiapMV3PTdDPqEbBJJ+WeqnhCWVjOuRdSxWNuPGz+aFTcmaVAQljbUshmau/JzIaGTOJAtsZURyZZW8m/ud1Uwyv/UyoJEWu2GJRmEqCMZl9TQZCc4ZyYgllWthbCRtRTRnabEo2BG/55VXSvqh6l9Vas1ap3+ZxFOEETuEcPLiCOtxBA1rAgMMzvMKb8+i8OO/Ox6K14OQzx/AHzucP4/6NBA==</latexit> <latexit sha1_base64="gyxXXPs6abl7KvzldfK6LO78VFQ=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeCCB5bsLXQhrLZbtq1m03YnQgl9Bd48aCIV3+SN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dgpr6xubW8Xt0s7u3v5B+fCobeJUM95isYx1J6CGS6F4CwVK3kk0p1Eg+UMwvpn5D09cGxGre5wk3I/oUIlQMIpWamK/XHGr7hxklXg5qUCORr/81RvELI24QiapMV3PTdDPqEbBJJ+WeqnhCWVjOuRdSxWNuPGz+aFTcmaVAQljbUshmau/JzIaGTOJAtsZURyZZW8m/ud1Uwyv/UyoJEWu2GJRmEqCMZl9TQZCc4ZyYgllWthbCRtRTRnabEo2BG/55VXSvqh6l9Vas1ap3+ZxFOEETuEcPLiCOtxBA1rAgMMzvMKb8+i8OO/Ox6K14OQzx/AHzucP4/6NBA==</latexit>

K R{r}(↵ 6= w, t)
<latexit sha1_base64="6wsDouJ++1/61Ohhkr95xquTDXg=">AAACSXicbVDNSgMxGMzW//Wv6tFLsAgKpeyKqEdBBMGLim2FppRsmm2DSXZNskoJ+yo+jTfRu4/hTXoy3fZg1YHAMPN9fJmJUs60CYIPrzQzOze/sLjkL6+srq2XNzYbOskUoXWS8ETdRVhTziStG2Y4vUsVxSLitBndn4385iNVmiXy1gxS2ha4J1nMCDZO6pRPkMCmr4S9zFG14Jooe5Mjq1C+hyJhEeZpH+cQSfoAR8JTXoVmv1OuBLWgAPxLwgmpgAmuOuUh6iYkE1QawrHWrTBITdtiZRjhNPdRpmmKyT3u0ZajEguq27ZImMNdp3RhnCj3pIGF+nPDYqH1QERussjw2xuJ/3mtzMQnbctkmhkqyfhQnHFoEjiqC3aZosTwgSOYKOb+CkkfK0yMK9WfOlMUqWM9FcVGIvd9V1b4u5q/pHFQC49qh9eHldPzSW2LYBvsgD0QgmNwCi7AFagDAp7BC3gD796r9+l9ecPxaMmb7GyBKZRmvgFhzrOL</latexit>

(c) Filtered Radon transform when α 6= w. (d) Sparsifying transform D2t K R{r}.

Fig. 4. Cartoon diagram illustrating the illustrating the sparsifying transform of the ReLU neuron r(x) = ρ(wT x − b)
with (w, b) ∈ Sd−1 × R. The heatmap is a top-down view of a ReLU neuron depicted in (a).

with weight decay is equivalent to solving the optimization problem


N
X K
X
min L(yn , fθ (xn )) + λ |vk |kwk k2 . (31)
θ∈Θ
n=1 k=1
PK
From (30) we see that D2t K R{fθ } = k=1 |vk |kwk k2 , and so training the NN (19) with weight
decay prefers solutions with sparse second derivatives in the filtered Radon domain. This measure
of sparsity can be viewed as the second-order total variation in the (filtered) Radon domain. Let
13

R BV2 (Rd ) denote the space of functions on Rd of second-order bounded variation in the (filtered)
Radon domain (i.e., the second-order total variation in the (filtered) Radon domain is finite). Under
mild assumptions on the loss function, the solution set to the optimization problem
N
X
min2 L(yn , f (xn )) + λkD2t K Rf kM (32)
f ∈R BV (Rd )
n=1

is characterized by NNs of the form


K
X
fθ (x) = vk ρ(wkT x − bk ) + cT x + c0 , (33)
k=1

where the number of neurons is strictly less than the number of data (K < N ) [30], [28], [44], [5].
Common loss functions such as the squared error, logistic, and soft-max satisfy the mild assumptions.
The skip connection cT x + c0 arises because, as in the univariate case, the second-order total
variation of an affine function is zero. Therefore, NNs (as in (33)) trained with weight decay are
exactly optimal functions in R BV2 (Rd ).

B. Deep Neural Networks

The machinery is straightforward to extend to the case of deep neural networks (DNNs). The
key idea is to consider fitting data using compositions of R BV2 -functions. It is shown in [31] that
under mild assumptions on the loss function, a solution to the optimization problem
N d
L X̀
(`)
X X
min L(yn , f (L) ◦ · · · ◦ f (1) (xn )) + λ kD2t K Rfi kM (34)
f (1) ,...,f (L)
n=1 `=1 i=1

has the form of a DNN as in (1) that satisfies the following properties:

• The number of layers is L + 1;


• The solution is sparse in the sense of having few active neurons (widths of the layers are
proportional to the number of data N );
• The solution has skip connections in all layers;
• The architecture has linear bottlenecks which forces the weight matrices to be low rank.

Such an architecture is illustrated in Fig. 5. The result shows that ReLU DNNs with skip connections
and linear bottlenecks trained with a variant of weight decay [31, Remark 4.7] are optimal solutions
to fitting data using compositions of R BV2 functions. Linear bottlenecks may be written as
factorized (low-rank) weight matrices of the form W(`) = U(`) V(`) . Factorizations of this form in
DNNs have been shown to speed up learning [1] and increase the accuracy [16], robustness [37],
and computational efficiency [47] of DNNs.
14

Fig. 5. A feedforward DNN architecture with linear bottlenecks. The blue nodes represent ReLU neurons, the gray
nodes represent linear neurons, and the white nodes depict the DNN inputs. Since the linear layers are narrower than the
ReLU layers, this architecture is referred to as a DNN with linear bottlenecks.

V. W HAT IS THE ROLE OF N EURAL N ETWORK ACTIVATION F UNCTIONS ?

The primary focus of the article so far has been the ReLU activation function ρ(t) = max{0, t}.
Many of the previously discussed ideas can be extended to a broad class of activation functions.
The property of the ReLU exploited so far has been that it is sparsified by the second derivative
operator in the sense that D2 ρ = δ . Indeed, we can define a broad class of neural function spaces
akin to R BV2 (Rd ) by defining spaces characterized by different sparsifying transforms matched
to an activation function. This entails replacing D2t in (29) with a generic sparsifying transform H.
Table 1 (adapted from [43]) provides examples of common activation functions that fall into this
framework, where each sparsifying transform H is defined by its frequency response H(ω)
b . For
the ReLU, we have H = D2t and so H(ω)
b = ( jω)2 = −ω 2 .
Therefore, many of the previously discussed results can thus be directly extended to a broad class
of activation functions including the classical sigmoid and arctan activation functions. We remark
that the sparsity-promoting effect of weight decay hinges on the homogeneity of the activation
function in the DNN. While the ReLU and truncated power activation functions in Table 1 are
homogeneous, the other activation functions are not. This provides evidence that one should prefer
homogeneous activation functions like the ReLU in order to exploit the tight connections between
weight decay and sparsity. Although the sparsity-promoting effect of weight decay does not apply
to these other activation functions, key results including the characterizations of the of the solution
sets to the optimization problems akin to (32) and (34) do apply, providing key insight into the
kinds of functions that DNNs learn using these activation functions.
15

[Table 1] Common Activation Functions

Frequency Response of
Activation Function σ(t)
Sparsifying Transform: H(ω)
b

Rectified Linear Unit (ReLU) max{0, t} −ω 2

max{0, t}k
Truncated Power ,k∈N ( jω)k+1
k!
1 1 j
Sigmoid −t
− sinh(πω)
1+e 2 π
arctan(t)
arctan jωe|ω|
π

e−|t|
Exponential 1 + ω2
2

VI. W HY D O N EURAL N ETWORKS S EEMINGLY B REAK THE C URSE OF D IMENSIONALITY ?

In 1993, Barron published his seminal paper [4] on the ability of NNs with sigmoid activation
functions to approximate a wide variety of multivariate functions. Remarkably, he showed that
NNs can approximate functions which satisfy certain decay conditions on their Fourier transforms
at a rate that is completely independent of the input-dimension of the functions. This property
has led to many people heralding his work as “breaking the curse of dimensionality.” Today, the
function spaces he studied are often referred to as the spectral Barron spaces. It turns out that this
remarkable approximation property of NNs is due to sparsity.
To explain this phenomenon, we first recall a problem which “suffers the curse of dimensionality”.
A classical problem in signal processing is reconstructing a signal from its samples. Shannon’s
sampling theorem [40] asserts that sampling a bandlimited signal on a regular grid at a rate faster
than the Nyquist rate guarantees that the sinc interpolator perfectly reconstructs the signal. Since the
sinc function and its shifts form an orthobasis for the space of bandlimited signals, the energy of the
signal (squared L2 -norm) corresponds to the squared (discrete) `2 -norm of its samples. Multivariate
versions of the sampling theorem are similar and assert that sampling multivariate bandlimited
signal on a sufficiently fine regular grid guarantees perfect reconstruction with (multivariate) sinc
interpolation. It is easy to see that the grid size (and therefore the number of samples) grows
16

exponentially with with the dimension of the signal. This shows that the sampling and reconstruction
of bandlimited signals suffers the curse of dimensionality. The fundamental reason for this is that
the energy or “size” of a bandlimited signal corresponds to the `2 -norm of the signal’s expansion
coefficients in the sinc basis.
It turns out that there is a stark difference if we instead measure the “size” of a function by the
more restrictive `1 -norm instead of the `2 -norm, an idea popularized by wavelets and compressed
sensing. Let D = {ψ}ψ∈D be a dictionary of atoms (e.g., sinc functions, wavelets, neurons, etc.).
Consider the problem of approximating a multivariate function mapping Rd → R that admits
a decomposition f (x) = ∞
P
k=1 vk ψk (x), where ψk ∈ D and the expansion coefficients satisfy
P∞
k=1 |vk | = kvk`1 < ∞. We will show that there exists an approximant constructed with K

terms from the dictionary D whose approximation error rate (measured in the squared L2 -norm)
kf − fK k2L2 decays at a rate completely independent of the input dimension d.
We will illustrate the argument when D = {ψk }∞ k=1 is an orthonormal basis (e.g., multivariate

Haar wavelets). Given a function f : Rd → R that admits a decomposition f (x) = ∞


P
k=1 vk ψk (x)

such that kvk`1 < ∞, we can construct an approximant fK by a simple thresholding procedure that
keeps the K largest coefficients of f and sets all other coefficients to 0. If we let |v(1) | ≥ |v(2) | ≥ · · ·
denote the coefficients of f sorted in nonincreasing magnitude, then the approximation error is
bounded as 2
X X
kf − fK k2L2 = v(k) ψ(k) = |v(k) |2 , (35)


k>K L 2 k>K

where the last equality follows by exploiting the orthonormality of the {ψk }∞
k=1 . Finally, since

the original sequence of coefficients v = (v1 , v2 , . . .) is absolutely summable, the magnitude of


each term v(k) in the tail of the sequence of sorted coefficients must decay faster than 1/k (since
the harmonic series ∞ 1
P
k=1 k diverges). Putting this together with (35), the approximation error

kf − fK k2L2 decays as 1/K , completely independant of the input dimension d. For a more precise
treatment of this argument, we refer the reader to Theorem 9.10 in Mallat’s Wavelet Tour of Signal
Processing [23]. These kinds of thresholding procedures, particularly with wavelet bases [13],
[9], [12], revolutionized signal and image processing and were the foundations of compressed
sensing [8], [11], [7].
By a more sophisticated argument a similar phenomenon occurs when the orthonormal basis is
replaced with an essentially arbitrary dictionary of atoms. The approximation result for general
atoms was presented by Pisier in 1981 at the Functional Analysis Seminar at Polytechnique,
Palaiseau, France, crediting the idea to Maurey [33]. Roughly speaking, the result of Maurey states
17

that given a function which is an `1 -combination of bounded atoms from a dictionary, there exists
a K -term approximant that admits a dimension-free approximation rate that decays as 1/K . Barron
used the result of Maurey to prove his dimension free approximation rates with sigmoidal NNs.
More generally, Maurey’s idea can be applied to any function space where the functions are
`1 -combinations of bounded atoms. Such spaces are sometimes called variation spaces [2]. Recall
from Sections IV and V that the operator H K R sparsifies neurons of the form σ(wT x − b), where
(w, b) ∈ Sd−1 × R and σ is matched to H. This implies that the space of function f : Rd → R
such that kH K Rf kM < ∞ can be viewed as a variation space, where the dictionary corresponds
to the neurons {σ(wT x − b)}(w,b)∈Sd−1 ×R . Therefore, given any function f : Rd → R such that
kH K Rf kM < ∞, there exists a K -term approximant fK that takes the form of a shallow NN with
K neurons such that the approximation error decays as 1/K . These techniques have been studied and
extended in great detail [41] and have been extended to the setting of deep NNs [14] by considering
compositional function spaces akin to the compositional space introduced in Section IV-B.
Combining these dimension-free approximation rates with the sparsity-promoting effect of weight
decay regularization for ReLU NNs has a striking effect on the learning problem. Suppose that
we training a shallow ReLU NN with weight decay on data generated from the noisy samples
yn = f (xn ) + εn , n = 1, . . . , N , of f ∈ R BV2 (Rd ), where xn are i.i.d. uniform random variables
on some bounded domain Ω ⊂ Rd and εn are i.i.d. Gaussian random variables. Let fN denote
this trained NN. Then, it has been shown [32] that the mean integrated squared error (MISE)
Ekf − fN k2L2 (Ω) decays as N −1/2 , independant of the input dimension d. Moreover, this result also
shows that the generalization error of the trained NN on a new example x generated uniformly at
random on Ω is also immune to the curse of dimensionality. Furthermore, these ideas have been
studied in the context of deep NNs [39], proving dimension-free MISE rates for ReLU DNNs.

A. Mixed Variation and Low-Dimensional Structure

The national meeting of the American Mathematical Society in 2000 was held to discuss the
mathematical challenges of the 21st Century. Here, Donoho gave a lecture titled High-Dimensional
Data Analysis: The Curses and Blessings of Dimensionality [10]. In this lecture, he coined the term
“mixed variation” to refer to the kinds of functions that lie in variation spaces, citing the spectral
Barron spaces as an example. The variation spaces are different from classical multivariate function
spaces in that they favor functions that have weak variation in multiple directions (very smooth
functions) as well as functions that have very strong variation in one or a few directions (very
rough functions). These spaces also disfavor functions with strong variation in multiple directions.
18

(a) Weak variation in multiple directions. (b) Strong variation in one direction.

(c) Strong variation in multiple directions.

Fig. 6. Examples of functions exhibiting different kinds of variation.

It is this fact that makes them quite “small” compared to classical multivariate function spaces,
giving rise to their dimension free approximation and MISE rates. Examples of functions with
different kinds of variation are illustrated in Fig. 6.
In order to interpret the idea of mixed variation in the context of modern data analysis and
DL, we turn our attention to Fig. 6(b). In this figure, the function has strong variation, but only
in a single direction. In other words, this function has low-dimensional structure. It has been
observed by a number of authors that DNNs are able to automatically adapt to the low-dimensional
structure that often arises in natural data. This is possible because the input weights can be trained
to adjust orientation of each neuron. The dimension-independent approximation rate quantifies the
power of this tunability. This provides an explanation of why DNNs are good at learning functions
with low-dimensional structure. In particular, the function in Fig. 6(c) has strong variation in all
directions, so no method can overcome the curse of dimensionality in this sort of situation. On the
19

other hand, in Fig. 6(a) the function has weak varation in all directions and Fig. 6(b) has strong
variation only in one direction, so these are functions for which neural networks will overcome
the curse. For (b) the the sparsity-promoting effect of weight decay promotes DNN solutions with
neurons oriented in the direction of variation (i.e., it automatically learns the low-dimensional
structure).

VII. TAKEAWAY M ESSAGES AND F UTURE R ESEARCH D IRECTIONS

In this article we presented a mathematical framework to understand DNNs from first principles,
through the lens of sparsity and sparse regularization. Using familiar mathematical tools from
signal processing, we provided an explanation for the sparsity-promoting effect of the common
regularization scheme of weight decay in neural network training, the use of skip connections and
low-rank weight matrices in network architectures, and why neural networks seemingly break the
curse of dimensionality. This framework provides the mathematical setting for many future research
directions.
For example, the framework suggests the possibility of new neural training algorithms. The
equivalence of solutions using weight decay regularization and the regularization in (6) leads the
proximal gradient methods akin to iterative soft-thresholding algorithms. The preliminary results
in [48] show that proximal gradient training algorithms result in neural networks that perform as
good as and often better (particularly when labels are corrupted) than standard training with weight
decay, while producing sparser networks. We reproduce a figure from [48] in Fig. 7 showing this
effect on the MNIST data set with a fully-connected DNN.
A large body of work dating back to 1989 [22], [19] on NN pruning has shown empirically
that large NNs can be compressed or sparsified to a fraction of their size while still maintaining
their predictive performance. The connection between weight decay and and sparsity-promoting
regularizers like in (6) suggests new approaches to pruning. For example, one could apply proximal
gradient algorithms to derive sparse approximations to large pre-trained neural networks. There are
many open questions in this direction, both experimental and theoretical, including applying these
algorithms to other DNN architectures and deriving convergence results for these algorithms.
The framework in this paper also shows that trained ReLU DNNs are compositions of R BV2
functions. As explained in Section Section VI-A, we have a clear understanding of R BV2 . The
R BV2 spaces favor functions that are smooth in most or all directions, which explains why neural
networks seemingly break the curse of dimensionality. Less is clear about the compositions of
R BV2 functions (which characterize DNNs). Better understanding of these compositional function
20

100

98

Test Accuracy (%)


Group LASSO
96
LASSO
Proximal Gradient Algorithm
94 Off-the-Shelf Weight Decay

92

90
100.0 51.2 26.21 13.42 6.87 3.52
Sparsity (%)

Fig. 7. Test accuracy vs. sparsity (proportion of nonzero neurons) of a fully-connected DNN trained on the MNIST
dataset using the group LASSO, the LASSO, the proximal gradient algorithm of Yang et al. [48], and using off-the-shelf
weight decay. The proximal gradient algorithm results in a trained model with high test accuracy and extreme sparsity.
Meanwhile, the off-the-shelf weight decay model is not sparse at all. (figure reproduced from [48, Figure 3])

spaces could provide new insights into the benefits of depth in neural networks. This in turn could
lead to new guidelines for designing NN achitectures and training algorithms.

ACKNOWLEDGMENT

The authors would like to thank Rich Baraniuk, Misha Belkin, Ron DeVore, Kangwook Lee,
Greg Ongie, Dimitris Papailiopoulos, Tomaso Poggio, Lorenzo Rosasco, Joe Shenouda, Jonathan
Siegel, Ryan Tibshirani, Michael Unser, Becca Willett, Stephen Wright, Liu Yang, and Jifan Zhang,
for many insightful discussions on the topics presented in this article.
RP was supported by in part by the NSF Graduate Research Fellowship Program under grant DGE-
1747503 and by the European Research Council (ERC Project FunLearn) under Grant 101020573.
RN was supported in part by the NSF grants DMS-2134140 and DMS-2023239, the ONR MURI
grant N00014-20-1-2787, and the AFOSR/AFRL grant FA9550-18-1-0166, as well as the Keith
and Jane Nosbusch Professorship.

R EFERENCES

[1] L. J. Ba and R. Caruana, “Do deep nets really need to be deep?” in Proceedings of the 27th International Conference
on Neural Information Processing Systems-Volume 2, 2014, pp. 2654–2662.
[2] F. Bach, “Breaking the curse of dimensionality with convex neural networks,” The Journal of Machine Learning
Research, vol. 18, no. 1, pp. 629–681, 2017.
[3] R. Balestriero and R. G. Baraniuk, “Mad max: Affine spline insights into deep learning,” Proceedings of the IEEE,
vol. 109, no. 5, pp. 704–727, 2020.
21

[4] A. R. Barron, “Universal approximation bounds for superpositions of a sigmoidal function,” IEEE Transactions on
Information theory, vol. 39, no. 3, pp. 930–945, 1993.
[5] F. Bartolucci, E. De Vito, L. Rosasco, and S. Vigogna, “Understanding neural networks with reproducing kernel
Banach spaces,” Appl. Comput. Harmon. Anal., vol. 62, pp. 194–236, 2023.
[6] E. J. Candès, “Ridgelets: Theory and applications,” Ph.D. dissertation, Stanford University, 1998.
[7] E. J. Candès and J. Romberg, “Sparsity and incoherence in compressive sampling,” Inverse Problems, vol. 23, no. 3,
p. 969, 2007.
[8] E. J. Candès, J. Romberg, and T. Tao, “Robust uncertainty principles: Exact signal reconstruction from highly
incomplete frequency information,” IEEE Transactions on Information Theory, vol. 52, no. 2, pp. 489–509, 2006.
[9] A. Chambolle, R. A. DeVore, N.-Y. Lee, and B. J. Lucier, “Nonlinear wavelet image processing: Variational
problems, compression, and noise removal through wavelet shrinkage,” IEEE Transactions on Image Processing,
vol. 7, no. 3, pp. 319–335, 1998.
[10] D. L. Donoho, “High-dimensional data analysis: The curses and blessings of dimensionality,” AMS Lectures, p. 32,
2000.
[11] D. L. Donoho, “Compressed sensing,” IEEE Transactions on Information Theory, vol. 52, no. 4, pp. 1289–1306,
2006.
[12] D. L. Donoho and I. M. Johnstone, “Minimax estimation via wavelet shrinkage,” The Annals of Statistics, vol. 26,
no. 3, pp. 879–921, 1998.
[13] D. L. Donoho, I. M. Johnstone, G. Kerkyacharian, and D. Picard, “Wavelet shrinkage: Asymptopia?” Journal of the
Royal Statistical Society: Series B (Methodological), vol. 57, no. 2, pp. 301–337, 1995.
[14] W. E and S. Wojtowytsch, “On the Banach spaces associated with multi-layer ReLU networks: Function representation,
approximation theory and gradient descent dynamics,” CSIAM Transactions on Applied Mathematics, vol. 1, no. 3,
pp. 387–440, 2020.
[15] S. D. Fisher and J. W. Jerome, “Spline solutions to L1 extremal problems in one and several variables,” Journal of
Approximation Theory, vol. 13, no. 1, pp. 73–83, 1975.
[16] A. Golubeva, B. Neyshabur, and G. Gur-Ari, “Are wider nets better given the same number of parameters?”
International Conference on Learning Representations, 2021.
[17] R. C. Gonzalez, “Deep convolutional neural networks [lecture notes],” IEEE Signal Processing Magazine, vol. 35,
no. 6, pp. 79–87, 2018.
[18] Y. Grandvalet, “Least absolute shrinkage is equivalent to quadratic penalization,” in International Conference on
Artificial Neural Networks. Springer, 1998, pp. 201–206.
[19] B. Hassibi and D. Stork, “Second order derivatives for network pruning: Optimal brain surgeon,” Advances in
neural information processing systems, vol. 5, 1992.
[20] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE
conference on computer vision and pattern recognition, 2016, pp. 770–778.
[21] V. Kůrková, P. C. Kainen, and V. Kreinovich, “Estimates of the number of hidden units and variation with respect
to half-spaces,” Neural Networks, vol. 10, no. 6, pp. 1061–1068, 1997.
[22] Y. LeCun, J. Denker, and S. Solla, “Optimal brain damage,” Advances in neural information processing systems,
vol. 2, 1989.
[23] S. Mallat, A Wavelet Tour of Signal Processing, 3rd ed. Elsevier/Academic Press, Amsterdam, 2009.
22

[24] E. Mammen and S. van de Geer, “Locally adaptive regression splines,” The Annals of Statistics, vol. 25, no. 1, pp.
387–413, 1997.
[25] W. S. McCulloch and W. Pitts, “A logical calculus of the ideas immanent in nervous activity,” The Bulletin of
Mathematical Biophysics, vol. 5, no. 4, pp. 115–133, 1943.
[26] B. Neyshabur, R. Tomioka, and N. Srebro, “In search of the real inductive bias: On the role of implicit regularization
in deep learning.” in International Conference on Learning Representations (Workshop), 2015.
[27] G. Ongie, R. Willett, D. Soudry, and N. Srebro, “A function space view of bounded norm infinite width ReLU nets:
The multivariate case,” in International Conference on Learning Representations, 2020.
[28] R. Parhi, “On Ridge Splines, Neural Networks, and Variational Problems in Radon-Domain BV Spaces,” Ph.D.
dissertation, The University of Wisconsin–Madison, 2022.
[29] R. Parhi and R. D. Nowak, “The role of neural network activation functions,” IEEE Signal Processing Letters,
vol. 27, pp. 1779–1783, 2020.
[30] R. Parhi and R. D. Nowak, “Banach space representer theorems for neural networks and ridge splines.” Journal of
Machine Learning Research, vol. 22, no. 43, pp. 1–40, 2021.
[31] R. Parhi and R. D. Nowak, “What kinds of functions do deep neural networks learn? Insights from variational
spline theory,” SIAM Journal on Mathematics of Data Science, vol. 4, no. 2, pp. 464–489, 2022.
[32] R. Parhi and R. D. Nowak, “Near-minimax optimal estimation with shallow ReLU neural networks,” IEEE
Transactions on Information Theory, vol. 69, no. 2, pp. 1125–1140, 2023.
[33] G. Pisier, “Remarques sur un résultat non publié de B. Maurey,” Séminaire d’Analyse Fonctionnelle (dit ”Maurey-
Schwartz”), pp. 1–12, April 1981.
[34] T. Poggio, A. Banburski, and Q. Liao, “Theoretical issues in deep networks,” Proceedings of the National Academy
of Sciences, vol. 117, no. 48, pp. 30 039–30 045, 2020.
[35] J. Radon, “Über die bestimmung von funktionen durch ihre integralwerte längs gewisser mannigfaltigkeiten,” Ber.
Verh, Sachs Akad Wiss., vol. 69, pp. 262–277, 1917.
[36] L. I. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variation based noise removal algorithms,” Physica D:
nonlinear phenomena, vol. 60, no. 1-4, pp. 259–268, 1992.
[37] A. Sanyal, P. H. Torr, and P. K. Dokania, “Stable rank normalization for improved generalization in neural networks
and GANs,” International Conference on Learning Representations, 2019.
[38] P. Savarese, I. Evron, D. Soudry, and N. Srebro, “How do infinite width bounded norm networks look in function
space?” in Conference on Learning Theory. PMLR, 2019, pp. 2667–2690.
[39] J. Schmidt-Hieber, “Nonparametric regression using deep neural networks with ReLU activation function,” The
Annals of Statistics, vol. 48, no. 4, pp. 1875–1897, 2020.
[40] C. E. Shannon, “Communication in the presence of noise,” Proceedings of the IRE, vol. 37, no. 1, pp. 10–21, 1949.
[41] J. W. Siegel and J. Xu, “Sharp bounds on the approximation rates, metric entropy, and n-widths of shallow neural
networks,” Foundations of Computational Mathematics, pp. 1–57, 2022.
[42] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society: Series B
(Methodological), vol. 58, no. 1, pp. 267–288, 1996.
[43] M. Unser, “From kernel methods to neural networks: A unifying variational formulation,” arXiv preprint
arXiv:2206.14625, 2022.
[44] M. Unser, “Ridges, neural networks, and the Radon transform,” Journal of Machine Learning Research, vol. 24,
no. 37, pp. 1–33, 2023.
23

[45] M. Unser, J. Fageot, and J. P. Ward, “Splines are universal solutions of linear inverse problems with generalized
TV regularization,” SIAM Review, vol. 59, no. 4, pp. 769–793, 2017.
[46] M. Vetterli, P. Marziliano, and T. Blu, “Sampling signals with finite rate of innovation,” IEEE Transactions on
Signal Processing, vol. 50, no. 6, pp. 1417–1428, 2002.
[47] H. Wang, S. Agarwal, and D. Papailiopoulos, “Pufferfish: Communication-efficient models at no extra cost,”
Proceedings of Machine Learning and Systems, vol. 3, pp. 365–386, 2021.
[48] L. Yang, J. Zhang, J. Shenouda, D. Papailiopoulos, K. Lee, and R. D. Nowak, “A better way to decay: Proximal
gradient training algorithms for neural nets,” in OPT 2022: Optimization for Machine Learning (NeurIPS Workshop),
2022.
[49] M. Yuan and Y. Lin, “Model selection and estimation in regression with grouped variables,” Journal of the Royal
Statistical Society: Series B (Statistical Methodology), vol. 68, no. 1, pp. 49–67, 2006.

Rahul Parhi received the B.S. degree in mathematics and the B.S. degree in computer science from the University
of Minnesota–Twin Cities in 2018 and the M.S. and Ph.D. degrees in electrical engineering from the University of
Wisconsin–Madison in 2019 and 2022, respectively. During his Ph.D., he was supported by an NSF Graduate Research
Fellowship. He is currently a Post-Doctoral Researcher with the Biomedical Imaging Group at the École Polytechnique
Fédérale de Lausanne. He is primarily interested in applications of functional and harmonic analysis to problems in
signal processing and data science. He is a member of the IEEE.

Robert D. Nowak received the Ph.D. degree in electrical engineering from the University of Wisconsin–Madison in
1995. He was a Post-Doctoral Fellow at Rice University from 1995-1996, an Assistant Professor at Michigan State
University from 1996-1999, and held Assistant and Associate Professor positions at Rice University from 1999-2003.
Since 2003, Nowak has been with the University of Wisconsin–Madison, where he now holds the Keith and Jane Morgan
Nosbusch Professorship in Electrical and Computer Engineering. His research focuses on signal processing, machine
learning, optimization, and statistics. His work on sparse signal recovery and compressed sensing has received several
awards, including the 2014 IEEE W.R.G. Baker Award. Nowak has held visiting positions at INRIA, Sophia-Antipolis
in 2001, and Trinity College, Cambridge in 2010. He has served as an Associate Editor for the IEEE Transactions on
Image Processing and the ACM Transactions on Sensor Networks, and as the Secretary of the SIAM Activity Group
on Imaging Science. He was General Chair for the 2007 IEEE Statistical Signal Processing workshop and Technical
Program Chair for the 2003 IEEE Statistical Signal Processing Workshop, the 2004 IEEE/ACM International Symposium
on Information Processing in Sensor Networks, and the inaugural IEEE GlobalSIP Conference in 2013. He is presently
a Section Editor for the SIAM Journal on Mathematics of Data Science and a Senior Editor for the IEEE Journal on
Selected Areas in Information Theory. Nowak is a Fellow of the IEEE.

You might also like