0% found this document useful (0 votes)
15 views

Deep ReLU networks

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Deep ReLU networks

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Deep ReLU networks – injectivity capacity upper bounds


Mihailo Stojnic
arXiv:2412.19677v1 [stat.ML] 27 Dec 2024

Abstract
We study deep ReLU feed forward neural networks (NN) and their injectivity abilities. The main focus
is on precisely determining the so-called injectivity capacity. For any given hidden layers architecture, it
is defined as the minimal ratio between number of network’s outputs and inputs which ensures unique
recoverability of the input from a realizable output. A strong recent progress in precisely studying single
ReLU layer injectivity properties is here moved to a deep network level. In particular, we develop a program
that connects deep l-layer net injectivity to an l-extension of the ℓ0 spherical perceptrons, thereby massively
generalizing an isomorphism between studying single layer injectivity and the capacity of the so-called (1-
extension) ℓ0 spherical perceptrons discussed in [82]. Random duality theory (RDT) based machinery is then
created and utilized to statistically handle properties of the extended ℓ0 spherical perceptrons and implicitly
of the deep ReLU NNs. A sizeable set of numerical evaluations is conducted as well to put the entire
RDT machinery in practical use. From these we observe a rapidly decreasing tendency in needed layers’
expansions, i.e., we observe a rapid expansion saturation effect. Only 4 layers of depth are sufficient to closely
approach level of no needed expansion – a result that fairly closely resembles observations made in practical
experiments and that has so far remained completely untouchable by any of the existing mathematical
methodologies.

Index Terms: Injectivity; Deep ReLU networks; Random duality.

1 Introduction
An avalanche of research in machine learning (ML) and neural networks (NN) over the last decade produced
some of the very best scientific breakthroughs. These include developments of both excellent algorithmic
methodologies as well as their accompanying theoretical justifications. For almost all of them a superior level
of understanding of underlying mathematical principles is needed. In this paper we study such a principle
called injectivity and discuss how its presence/absence impacts/limits functioning of neural nets.
Just by its definition, the functional injectivity plays a critical role in studying inverse problems and is in
a direct correspondence with their well- or ill-posedness. It comes as a no surprise that recent theoretical and
practical studying of (nonlinear) inverse problems via neural nets heavily relies on the associated injectivities
(see, e.g., [6, 11, 15, 16, 19, 32, 39, 41, 82]). Consequently, many aspects od injectivity gained strong interest in
recent years including studying Lipshitzian/stability properties, [20, 30, 37], role of the injective ReLU nets
in manifold densities and approximative maps [48, 49, 52], algorithmic approaches to generative models [21,
32,33,35,41,44,50,51,55,56,89], deep learning compressed sensing/phase retrieval [11,15,16,32,34,39,41,44],
and random matrix - neural networks connections [42, 43, 46].
As discussed in many of the above works, characterizing analytically injectivity of a whole network is
not an easy mathematical problem. It typically relates to the so-called injectivity capacity defined as the
minimal ratio of the network’s number of outputs and inputs for which a unique generative input produces a
realizable output. Despite a host of technical difficulties, in addition to excellent practical implementations,
strong accompanying analytical results are obtained in many of the above works as well. They usually relate
to the so-called qualitative performance characterizations which typically give correct dimensional orders
and, as such, provide an intuitive guidance for building networks’ architectures. Since our interest is in
more precise, i.e., quantitative performance characterizations, results from [13, 43, 45, 48, 82] are more closely
∗ e-mail: [email protected]

1
related to ours and we discuss them throughout the presentation after the introduction of necessary technical
preliminaries.

2 Mathematical preliminaries, related work, and contributions


For any given positive integer l, consider sequences of positive integers m0 , m1 , m2 , m3 , . . . , ml , matrices
A(i) ∈ Rmi ×mi−1 , and real maps fgi (·) : Rmi → Rmi . The following nonlinear system of equations will be
the main mathematical object of our study
    
ȳ(l) = fgl A(l) . . . A(4) fg3 A(3) fg2 A(2) fg1 (A(1) x̄) . (1)

Adopting the convention n , m0 , one has that A(i) ’s are linearly transformational system matrices, fgi (·)
are (nonlinear) system functions, and x̄ ∈ Rn is a generative input vector to be recovered. A generic nature
of our presentation will ensure that both developed methodologies and obtained results can be utilized in
conjunction with a host of different activation functions fgi (·). To ensure neatness of the exposition, we
consider the so-called componentwise activation functions that act in the same manner on each coordinate of
their vector argument. Given their importance in studying and practical utilization of deep neural networks
(DNN), we consider ReLUs as concrete activation examples

fgi (x) = max(x, 0), (2)

with max being applied componentwise. After setting

A1:l , [A(1) , A(2) , . . . , A(l) ], (3)

it is then not that difficult to see that the nonlinear system from (1) becomes the so-called
     
Deep ReLU system N1:l : ȳ(l) = max A(l) . . . A(2) max A(1) x̄, 0 , 0 , 0 , f¯nn (x̄; A1:l ). (4)

We particularly focus on mathematically typically the most challenging linear (or, as often called, propor-
tional ) high-dimensional regimes with i-th layer absolute expansion coefficients
mi
αi , lim , i = 1, 2, . . . , l. (5)
m0 →∞ m0
These are closely connected to relative expansion coefficients
mi αi
ζi , lim = , i = 1, 2, . . . , l. (6)
m0 →∞ mi−1 αi−1

Clearly, the absolute expansion coefficients relate to the expansion of the whole network, whereas the relative
ones relate to the expansions within each of the layers. The system in (4) is a mathematical description of a
feed forward ReLU DNN with input x̄, output ȳ, and the weights of the gates in the i-th hidden layer being
the rows of A(i) .
Given the growing popularity of machine learning (ML) and neural networks (NN) concepts, the interest
in ReLU DNNs picked up over the last decade as well. Various properties of ReLU gates have been the
subject of extensive research including both single-layer structures (see, e.g., [13, 23, 43, 45, 48, 49]) as well as
more complex multilayered ones (see, e.g., [5, 8, 80, 81, 90]). Of particular interest have been the invertibility
or injectivity abilities of ReLU activations as they make them a bit different from, say, more traditional sign
perceptrons or other nonlinear ones. For example, just mere existence of at least m1 ≥ m0 nonzero elements
at the output of the first layer is sufficient to recover the generating input (a non-degenerative scenario with
any subset of rows/columns of A(i) being of full rank is assumed throughout the paper; in statistical contexts
of our interest here, this typically happens with probability 1). While potential existence of the injectivity
is relatively easy to observe, it is highly nontrivial to determine the minimal length of ȳ that ensures its
existence (see, e.g., [13, 23, 43, 45, 48, 49, 82]). In fact, this is already very complicated to do evem for a single

2
layer network [43, 82].
Precisely determining sequence m0 , m1 , m2 , . . . , ml for which ReLU DNNs are injective is the main topic of
this paper. As we will be working in the high-dimensional linear (proportional) regime, this effectively trans-
lates into determining the corresponding absolute expansion sequence, αi , i = 1, 2, . . . , l. It is not that difficult
to see that in order to have an l-layer ReLU NN, N1:l , injective, all its subnetworks N1:1 , N1:2 , . . . , N1:(1−1) ,
must be injective as well. A sequence αi , i = 1, 2, . . . , (l − 1) that ensures this will be called injectively ad-
missible. Of particular interest are the minimal ones, i.e., the ones that require minimal necessary expansion
in each of the layers. They will be called minimally injectively admissible. Following the usual terminology
(see, e.g., [23, 43, 48, 49]), we, for an injectively admissible sequence αi , i = 1, 2, . . . , (l − 1) (for l = 1 the
sequence is empty and automatically admissible), and statistical A(i) , i = 1, 2, . . . , l, formally define

l-layer ReLU NN, N1:l , injectivity capacity:


(inj)
αReLU (α1 , α2 , . . . , αl−1 ) , min αl
subject to lim PA1:l (∀x̄, ∄x 6= x̄ such that f¯nn (x̄; A1:l ) = f¯nn (x; A1:l )) = 1.
n→∞
(7)

To facilitate the exposition we assume throughout the paper that A(i) ’s are comprised of iid standard normals
(as mentioned in [82], all of our results are easily generalizable to various other statistics that can be pushed
through the Lindeberg variant of the central limit theorem). We also adopt the convention that the subscripts
next to P and E denote the underlying source of randomness (these subscripts are left unspecified when the
source of randomness is clear from the context). It is also not that difficult to establish the definition of
(inj)
Minimally injectively admissible sequence: αi = αReLU (α1 , α2 , . . . , αi−1 ) , i = 1, 2, . . . , l. (8)

All our results will be obtained for generally injectively admissible sequences. To obtain concrete numerical
values of the capacities we will then consider the minimally injectively admissible ones as they provide
architecture with the minimal expansion in each of the layers.

2.1 Related prior work


⋆ Deep learning compressed sensing: A particularly prominent role of injectivity over the last several years
appeared in deep learning approaches to compressed sensing (or general structured objects recovery). In [11] a
deep learning compressed sensing paradigm was put forth where properly trained deep ReLU nets are used as
sparse signals generative engines. Nice ReLU analytical properties allowed for utilization of gradient methods
to recover both generating input and the corresponding network output (i.e., the desired sparse signal).
Experimental results showed a superior performance compared to standard convexity based techniques with
a level of compression decreased by 4-5 times compared to say LASSO. Analytical results showed dimensional
orders and errors that match the best known ones of the convex methods provided that gradient solves the
underlying optimization. [36] then showed a polynomial running of the gradient provided a logarithmic layers
expansions. On the other hand, [15]) showed that constant expansion in principle suffices for compressed
sensing while the results of [41] (when taken together with [56]) achieved the same constant type of expansion
with a particularly tailored layer-wise Gaussian elimination algorithm. Many other utilizations of generative
models, or generative adversarial networks (GAN), appeared in further improving compressed sensing, phase
retrieval or denoising [16, 32, 34, 44]. As noted in most of the above works, invertibility/injectivity of the
generating models/networks is a key precondition that allows for application of such objects in deep learning
compressed sensing. Moreover, its precise dimensional characterizations critically impact the architecture of
the networks, computational complexity of the training and recovery algorithms, and the accuracy of the
theoretical guarantees that rely on them. Naturally, studying the above defined injectivity capacities in full
detail imminently gained traction and became extraordinarily relevant research topic on its own as well.
⋆ Injectivity capacity versus classical perceptrons capacities: Before switching to precise studies of capacities,
one thing needs to be carefully addressed as well. Namely, although conceptually somewhat connected,
the capacities defined in the previous section are actually different from the ones typically associated with
classical spherical [9,12,14,26,38,53,57,58,63,70,76,84–88] or binary perceptrons [1–3,7,22,25–27,31,40,47].

3
There are two key difference: (i) The injectivity capacities are defined for the whole network whereas the
classical ones are typically defined for a single gate; and (ii) The classical perceptrons’ capacities usually
relate to the network (gate) ability to store/memorize patterns – fundamentally different from ability to
(uniquely) invert them. Some aspects of these differences can be bridged though. For example, regarding
the first difference, [10, 18, 24, 75, 79] show how single layer associative memories extend to multilayered ones
as well.
⋆ Connecting injectivity and classical perceptrons capacities: While the second of the above differences by the
definition can not be conceptually bridged, it can be bridged in mathematical terms. Namely, as shown in [82],
a so-called ℓ0 spherical perceptron reformulation of the single layer injectivity capacities is possible so that
they resemble (or become mathematically equivalent to) variants of the classical ones. Further connections
with classical perceptron studies have been established while focusing on the injectivity of 1-layer ReLU NNs.
In particular, [48] combined union bounding with spherical perceptron capacity characterizations [14, 86–88]
to obtain ≈ 3.3 and ≈ 10.5 as respective lower and upper bounds on the single layer injectivity capacity under
the Gaussianity of A (for a further upper-bound decrease to ≈ 9.091 see [43] and reference therein [13, 45];
on the other hand, [48] showed that allowing for optimal network’s weights choice lowers the capacity to 2).
Appearance of a discrepancy between the bounds strongly suggested their looseness. While opting for union
bounding comes as a no surprise given the highly nonconvex nature of the underlying problems, the resulting
deviation from exactness is to be expected as well. As discussed in a long line of work on various perceptrons
models [64, 70, 83, 84], utilization of classical techniques when facing nonconvex problems typically results in
built-in suboptimality.
⋆ Geometric approaches: Different approaches were taken in [13, 45] where a connection to high-dimensional
integral geometry studies of intersecting random subspaces is established. Relying on the kinematic formula
[54] (and somewhat resembling earlier compressed sensing approaches of [4,17]), [45] considered a heuristical
Euler characteristics approximation approach and obtained a slightly better capacity estimate ≈ 8.34.
⋆ Probabilistic approaches: Exactness of this prediction was considered in [43]. Following into the footsteps
of [13, 45] and connecting further ReLU injectivity and intersecting random subspaces, [43] showed that
application of Gordon’s escape through a mesh theorem [28, 29] results in a much large ≈ 23.54 capacity
bound. The authors of [43] then made use of the (plain) random duality theory (RDT) that Stojnic created
in a long line of work [60, 62, 63, 67, 69] and lowered the capacity upper-bound to ≈ 7.65. This was both
significantly better than what they got through the Gordon’s theorem and also sufficiently good to refute the
above mentioned Euler characteristics approximation based prediction. [43] continued further by utilizing
replica methods from statistical physics and obtained single layer injectivity capacity prediction ≈ 6.698.
All these results were closely matched by Stojnic in [82] through the utilization of the Fully lifted random
duality theory (fl RDT).
⋆ “Qualitative” vs “quantitative” performance characterizations: The above results mostly relate to single
layer ReLU injectivity. Due to underlying difficulties, studying, of a presumably simpler, single layer injec-
tivity equivalent is often undertaken as a first step on a path towards handling multilayered networks (see,
e.g. [13, 43, 45, 48, 82]). One however needs to note a significant difference between the role studying of a sin-
gle layer injectivity plays in both qualitative and quantitative performance analyses. In qualitative analyses
(that mostly focus on correct dimensional orders) results obtained for a single layer almost automatically
extend to multilayered structures as the dimension orders are to a large degree preserved. In quantitative
analyses things are way more different. Even when one can precisely characterize performance of one layer,
trivial extensions to higher layers typically incur a substantial suboptimality (while the dimensional orders
are likely to be preserved the concrete associated constants dramatically change). For example, if a single
layer injectivity capacity is α ≈ 6.7 then the corresponding l-layer one is for sure trivially upper-bounded
by αl ≈ 6.7l . From the qualitative performance characterization point of view, this is sufficient to ensure
constant expansion per both layer and network as a whole which on its own is a remarkable property. On the
other hand, from the quantitative performance analysis point of view, such a bound is expected to be highly
suboptimal. Moreover, it grows at a much faster pace than what is observed in practical implementations
(see, e.g., [11] where the needed expansion is ∼ 40 which is way smaller than 6.73 ≈ 300). As such it can
be way too conservative and result in projecting much larger network architectures than needed, potentially
causing computational intractabilities of both training and recovery algorithms. In other words, if signifi-
cantly suboptimal, it could be more suited as a cautious intuitive guidance rather than as a precise recipe
for designing network architectures. In general, both types of analysis are useful, one just needs to be careful

4
when and how to use them. The qualitative ones are more adapted for scenarios where one needs quick
intuitive assessments whereas the quantitative ones are more tailored for fine-grained precise designs.
In what follows, we consider multilayered deep ReLU NNs and focus on a quantitative type of analysis
that provides injectivity capacity upper-bounds much lower than the ones obtained through trivial single
layer generalizations. Before proceeding further with the technical analysis, we below summarize some of
our key results.

2.2 Our contributions


As stated above our focus is on precise studying of deep ReLU NN injectivity capacities. A fully connected
feed forward network architecture is considered in a statistical so-called linear/proportional high-dimensional
regime. This effectively means that the number of outputs of l-th layer, ml , is αl times larger than the number
of network inputs, n = m0 . Moreover, it remains constant as the number of inputs/outputs grows.
• Following [63, 70, 76, 82], we establish a connection between studying l-layer ReLU NN injectivity
properties and random feasibility problem (rfps). In particular (see Section 2), we find

l-layer ReLU NN injectivity ⇐⇒ l-extended ℓ0 spherical perceptron, (9)

which then implies

l-layer ReLU NN injectivity capacity ⇐⇒ l-extended ℓ0 spherical perceptron capacity. (10)

• Relying on Random duality theory (RDT) we create a generic program for studying deep ReLU NNs
and their injectivity properties. We introduce two notions of injectivity, weak and strong (see Section
3). The weak one relates to invertibility over all possible inputs, whereas the strong one relates to
invertibility of any given input. For both notions and for any number of layers l, we provide upper-
bounds on l-layer ReLU NN injectivity capacity (see Sections 3.1 and 4.1.1).
• For the first three layers explicit numerical values of the capacity bounds and all associated RDT
parameters are provided (see Section 3.1 and Tables 2 and 3 for 2-layer nets and Section 4.1.1 and
Tables 4, and 5 for 3-layer nets).
• Table 1 previews the change in weak injectivity capacity as the number of layers increases. A remarkable
rapid onset of expansion saturation effect is observed. Namely, already for nets with 4 layers necessary
expansion per layer decreases fairly close to 1, thereby almost reaching level of no needed expansion.

Table 1: Deep ReLU NN weak injectivity capacity and layers expansions ; mi – # of nodes in the i-th layer

l (# of layers) 1 2 3 4
Injectivity capacity (upper bound)
  6.7004 8.267 9.49 10.124
ml
αl = limn→∞ mnl = limm0 →∞ m 0

Layer’s expansion

ml αl
 6.7004 1.2338 1.1479 1.0668
ζl = limn→∞ ml−1 = αl−1

• To further lower the capacity upper bounds, we develop a powerful Lifted RDT based program (see
Sections 5.1 and 5.2). For 2-layer nets we provide concrete numerical values and observe that they
indeed lower the corresponding plain RDT ones (see Section 5.1 and Tables 8and 9).
• Implications for compressed sensing seem rather dramatic. As the reciprocal of the values given in
Table 1 are tightly connected to sparsity and undersampling ratios in compressed sensing, provided
that nets are sufficiently generalizable and that the underlying recovery algorithms run fast (which
practical implementations suggest is the case), the above results allow for unprecedented improvement
unreachable by any of the currently practically usable non NN based compressed sensing algorithms.

5
3 2-layer ReLU NN
To make the presentation smoother we start more technical discussions by considering 2-layered nets as the
simplest example of the multilayered ones. For such networks injectively admissible sequence has only one
component, α1 , which, by the earlier definition, is not smaller than the injectivity capacity of a single ReLU
layer. Keeping this in mind and proceeding in a similar fashion as in [82], we rely on [63,64,70,74,76] and the
random feasibility problems (rfps) considerations established therein to observe that for l = 2 the condition
under the probability in (8) is directly related to the following feasibility optimization problem

F (A1:2 , α1 , α2 ) : find x
subject to A(1) x = z
A(2) max(z, 0) = t
k max(t, 0)k0 < 2n. (11)

To see the connection, we first note that if the above problem is infeasible then ȳ(2) in (4) must have at
least 2n nonzero components. Under the assumption that α1 is injectively admissible (and under the non-
degenerative assumption stated a bit earlier) the ȳ(1) ’s number of the nonzeros is at least n = m0 . This then
implies that ȳ(2) has at least n nonzeros.
Assume that there are two outputs of the first layer, say ȳ(1,1) and ȳ(1,2) that both generate ȳ(2) . Let
the sets of indices of ȳ(1,1) and ȳ(1,2) nonzero components be S1 and S2 , respectively. Similarly, let the set
of indices of ȳ(2) ’s nonzero components be S0 . Set

So = S1 ∩ S2
Sd 1 = S1 \ S2
Sd 2 = S2 \ S1 . (12)

For a lack of injectivity, there must exist an x ∈ Rn such that


(1,1) (1)
ȳS1 = AS1 ,: x̄
(1,2) (1)
ȳS2 = AS2 ,: x
(2) (1) (2) (1,1) (2) (1,2) (2) (1)
AS0 ,S1 AS1 ,: x̄ = AS0 ,S1 ȳS1 = AS0 ,S2 ȳS2 = AS0 ,S2 AS1 ,: x. (13)

The last equality in (13) further gives that the following condition must also be met to ensure the lack of
injectivity
(2) (1) (2) (1)
AS0 ,S1 AS1 ,: x̄ − AS0 ,S2 AS1 ,: x = 0
i 

h
(2) (1) (2) (1) (2) (1) (2) (1)
⇐⇒ AS0 ,So ASo ,: + AS0 ,Sd ASd ,: −AS0 ,So ASo ,: − AS0 ,Sd ASd ,: = 0. (14)
1 1 2 2 x

Under the non-degenerative assumption stated a bit earlier and keeping in mind that the cardinalities of S0 ,
S1 , and S2 satisfy

2n ≤ |S0 | ≤ min(|S1 |, |S2 |), (15)

one has that


h i
(2) (1) (2) (1) (2) (1) (2) (1)
rank AS0 ,So ASo ,: + AS0 ,Sd ASd ,: −AS0 ,So ASo ,: − AS0 ,Sd ASd ,: ≥ 2n, (16)
1 1 2 2

 

which contradicts existence of ∈ R2n that satisfies (14) and therefore the presumed nonexistence of
x
injectivity. The likelihood of having non-degenerative assumption in place over continuous A(1) and A(2)
that we will consider later on (namely the standard normal ones) is trivially zero (in other words, it is a
possible, but improbable event). We then say that infeasibility of (11) implies a typical strong injectivity

6
or for brevity just “strong injectivity”. From a practical point of view, a weaker notion might be of even
greater interest. Such a notion assumes that for a given generative x̄ there is no x such that (13) holds.
The non-degenerative assumption together with the above reasoning then easily gives that the infeasibility
of (11) with a factor 2 removed implies the “weak injectivity’ ’.
To make writing easier we set

f(s) , k max(t, 0)k0 − 2n


f(w) , k max(t, 0)k0 − n
(
f(s) , for strong injectivity
finj , (17)
f(w) , for weak injectivity.

When needed for the concreteness in the derivations below, we specialize to the weak case and take finj =
f(w) . Very minimal modifications of our final results will then automatically be applicable to the strong case
as well. One should also note that there is no distinction between the above two notions of injectivity in
1-layer networks.
Returning back to (11) and noting the scaling invariance of the optimization therein, one can, without
loss of generality, assume the unit sphere restriction of x. Moreover, following further the trends set in
[63,64,70,74,76], one can also introduce fa (x, z, t) : Rm0 +m1 +m2 → R as an artificial objective and transform
the (random) feasibility problem (rfp) given in (11) into the following (random) optimization problem (rop)

2-extended ℓ0 spherical perceptron min fa (x, z, t)


x,z,t

subject to A(1) x = z
A(2) max(z, 0) = t
finj < 0
kxk2 = 1. (18)

Any optimization problem is actually solvable only if it is feasible. Under the feasibility assumption, (18)
can then be rewritten as
 
(2) T
ξf (fa ) = min max fa (x, z, t) + y(1) A(1) x + y(2)
T
A(2) max(z, 0) − y(1)
T T
z − y(2) t , (19)
kxk2 =1,finj <0 y(i) ∈Yi

where Yi = Rm1 . Since fa (x, z, t) is an artificial object, one can specialize back to fa (x, z, t) = 0 and write
 
(2) T
ξf (0) = min max y(1) A(1) x + y(2)
T
A(2) max(z, 0) − y(1)
T T
z − y(2) t . (20)
kxk2 =1,finj <0 y(i) ∈Yi

From (20), one can now easily see the main point behind the connection to rfps. Namely, the existence of a
triplet x, z, t such that kxk2 = 1, finj < 0 and A(1) x = z, A(2) max(z, 0) = t i.e., such that (11) is feasible,
(2)
ensures that (20)’s inner maximization can do no better than make ξf (0) = 0. If such a triplet does not
exist, then at least one of the equalities in A(1) x = z or A(2) max(z, 0) = t is violated which allows the inner
(2) (2) (2)
maximization to trivially achieve ξf (0) = ∞. Since ξf (0) = ∞ and ξf (0) > 0 are no different from the
feasibility point of view, the underlying optimization in (20) can be viewed as y(1)) , y(2)) scaling invariant
for all practical purposes. Such an invariance allows restriction to ky(1) k2 = 1 and ky(2) k2 = √1n which in
(2)
return guarantees boundedness of ξf (0). Having all of this in mind, one recognizes the importance of
 
(2) T
ξReLU = min max y(1) A(1) x + y(2)
T
A(2) max(z, 0) − y(1)
T T
z − y(2) t , (21)
kxk2 =1,finj <0 ky(1) k2 =1,|y(2) k2 = √1n

(2)
for the analytical characterization of (11). In fact, it is the sign of ξReLU (i.e., of the objective in (21)) that
(2)
determines (11)’s feasibility. If ξReLU > 0 then (11) is infeasible and the network is (typically) injective. A

7
combination with (11) allows then for the following typical rewriting of (7)
(inj)
αReLU (α1 ) , max{α2 | lim PA1:2 (∄x 6= x̄ such that f¯nn (x̄, A1:2 ) = f¯nn (x, A1:2 )) = 1}
n→∞
= max{α2 | lim PA1:2 (F (A1:2 , α1 , α2 ) is feasible) −→ 1}
n→∞
 
(2)
= max{α2 | lim PA1:2 ξReLU (0) > 0 −→ 1}. (22)
n→∞

To handle (21) (and ultimately (22)) we rely on Random duality theory (RDT) developed in a long series of
work [59–62, 69]. This is shown next.

(inj)
3.1 Upper-bounding αReLU (α1 ) via Random Duality Theory (RDT)
We start by providing a brief summary of the main RDT principles and then continue by showing, step-by-
step, how each of them relates to the problems of our interest here.

Summary of the RDT’s main principles [60, 69]

1) Finding underlying optimization algebraic representation 2) Determining the random dual


3) Handling the random dual 4) Double-checking strong random duality.

To make the presentation neater, we formalize all key results (simple and more complicated ones) as
lemmas or theorems.
1) Algebraic injectivity representation: The above considerations established a convenient connection
between injectivity capacity and feasibility problems. The main points are summarized in the following
lemma.
Lemma 1. Consider a sequence of positive integers, n = m0 , m1 , m2 , high-dimensional linear regime, corre-
sponding expansion coefficients α1 , α2 , and assume that α1 is injectively admissible. Assume a 2-layer ReLU
NN with architecture A1:2 = [A(1) , A(2) ] (the rows of matrix A(i) ∈ Rmi ×mi−1 , i = 1, 2, being the weights of
the nodes in the i-th layer). The network is typically injective if

frp (A1:2 ) > 0, (23)

where
1 
T (1) T (2) T T

frp (A1:2 ) , √ min max y A x + y A max(z, 0) − y z − y t , (24)
n kxk2 =1,finj <0 ky(1) k2 =1,|y(2) k2 = √1n (1) (2) (1) (2)

and

f(s) , k max(t, 0)k0 − 2n


f(w) , k max(t, 0)k0 − n
(
f(s) , for strong injectivity
finj , (25)
f(w) , for weak injectivity.

Proof. Follows immediately from the discussion presented in the previous section.

2) Determining the random dual: As is typical within the RDT, the so-called concentration of measure
property is utilized as well. This basically means that for any fixed ǫ > 0, we have (see, e.g. [60, 62, 69])

|frp (A1:2 ) − EA1:2 (frp (A1:2 ))|


 
lim PA1:2 > ǫ −→ 0.
n→∞ EA1:2 (frp (A1:2 ))

8
The following, so-called random dual theorem, is a key ingredient of the RDT machinery.
Theorem 1. Assume the setup of Lemma 2. Let the elements of A(i) ∈ Rmi ×mi−1 , g(i) ∈ Rmi ×1 , and
h(i) ∈ Rmi−1 ×1 be iid standard normals. Set

G(2) , [g(1) , g(2) , h(1) , h(2) ]


 
T (1) T (1) T (2) 1 T (2) T T
φ(x, z, t, y(1) , y(2) ) , y(1) g + x h + k max(z, 0)k2 y(2) g + √ max(z, 0) h − y(1) z − y(2) t
n
1
frd (G(2) ) , √ min max φ(x, z, t, y(1) , y(2) )
n kxk2 =1,finj <0 ky(1) k2 =1,|y(2) k2 = √1n
φ0 , lim EG(2) frd (G(2) ). (26)
n→∞

One then has


   
(φ0 > 0) =⇒ lim PG(2) (frd (G(2) ) > 0) −→ 1 =⇒ lim PA1:2 (frp (A1:2 ) > 0) −→ 1
n→∞ n→∞
 
=⇒ lim PA1:2 (2-layer ReLU NN with architecture A1:2 is typically injective) −→ 1 . (27)
n→∞

The injectivity is strong for finj = f(s) and weak for finj = f(w) .
Proof. Follows immediately as a direct 2-fold application of the Gordon’s probabilistic comparison theorem
(see, e.g., Theorem B in [29]). Gordon’s theorem is a special case of the results obtained in [71, 72] (see
Theorem 1, Corollary 1, and Section 2.7.2 in [72] as well as Theorem 1, Corollary 1, and Section 2.3.2
in [71]).  
T
In particular, term y(1) g(1) + xT h(1) − y(1)
T
z corresponds to the lower-bounding side of the Gordon’s
 
inequality related to A(1) . On the other hand, term k max(z, 0)k2 y(2) T
g(2) + √1n max(z, 0)T h(2) − y(2)
T
t
corresponds to the lower-bounding side related to A(2) . Given that (24) contains the summation of the
corresponding two terms from the other side of the inequality, the proof is completed.

2) Handling the random dual: To handle the above random dual we follow the methodologies invented
and presented in a series of papers [59–62, 69]. After solving the optimizations over x and y(i) , we find from
(26)
 
1 1 1
frd (G(2) ) = √ min kg(1) − zk2 − kh(1) k2 + √ kk max(z, 0)k2 g(2) − tk2 + √ max(z, 0)T h(2) . (28)
n finj <0 n n

The above can be rewritten as


 
1 1 1
frd (G(2) ) = √ min kg(1) − zk2 − kh(1) k2 + √ krg(2) − tk2 + √ max(z, 0)T h(2)
n r,z,t n n
subject to finj < 0
k max(z, 0)k2 = r. (29)

As mentioned earlier, taking for concreteness finj = f(w) and writing the Lagrangian we further have

1
frd (G(2) ) = √ min max L(ν, γ), (30)
n r,z,t ν,γ

where
 
(1) (1) 1 (2) 1 T (2)
L(ν, γ) = kg − zk2 − kh k2 + √ krg − tk2 + √ max(z, 0) h
n n
+νk max(t, 0)k0 − νn + γk max(z, 0)k22 − γr2 . (31)

9
One then utilizes the square root trick (introduced on numerous occasions in [65, 68, 70]) to obtain

kg(1) − zk22 1 krg(2) − tk22


 
(1) 1 T (2)
L(ν, γ) = min γ̄1 + − kh k2 + γ̄2 + + √ max(z, 0) h
γ̄1 ,γ̄2 4γ̄1 n 4γ̄2 n
+νk max(t, 0)k0 − νn + γk max(t, 0)k22 − γr2
  2  2 
(1) (2)
m
X i1 g − zi 1X
m 2 rg i − ti 1 X
m 1
(2) 
= min γ̄1 + − kh(1) k2 + γ̄2 + +√ max(zi , 0)T hi 

γ̄1 ,γ̄2
i=1
4γ̄ 1 n i=1
4γ̄ 2 n i=1

m2 m1
νX X
+ (1 + sign(ti )) − νn + γ max(zi , 0)2 − γr2 . (32)
2 i=1 i=1
√ √
After appropriate scaling γ̄i → γ̄i n, γ → √γn , r → r n, and cosmetic change ν
2 → √ν ,
n
concentrations,
statistical identicalness over i, and a combination of (30) and (32) give

φ0 , lim EG(2) frd (G(2) ) = EG(2) min max L1 (ν, γ), (33)
n→∞ r,γ̄1 ,γ̄2 ,zi ,ti ν,γ

where
 2 
(1)
g
 i − zi
(2)
L1 (ν, γ) = γ̄1 + α1  + max(zi , 0)T hi + γ max(zi , 0)2  − 1

4γ̄1
  2 
2 (2) ti 4γ̄2 ν ti
 
r gi − r + r 2 sign r
 − ν(2 − α2 ) − γr2 .

+γ̄2 + α2  (34)
 4γ̄2 

The Lagrangian duality also gives

φ0 , lim EG(2) frd (G(2) ) = EG(2) min max L1 (ν, γ) ≥ EG(2) min max min L1 (ν, γ). (35)
n→∞ r,γ̄1 ,γ̄2 ,zi ,ti ν,γ r,γ̄1 ,γ̄2 ν,γ zi ,ti

After setting
 2 
(1)
 gi − zi
(2)
fq,1 , EG(2) max  + max(zi , 0)T hi + γ max(zi , 0)2 

zi 4γ̄1
 2  !
(2) ti 4γ̄2 ν ti
fq,2 , EG(2) max gi − + 2 sign , (36)
ti r r r

one notes that a quantity structurally identical to fq,2 was already handled in [82]. Namely, after a change
of variables
4γ̄2 ν
ν1 → , (37)
r2
one can follow [82], set

ā = 2ν1 , (38)

and
2 !
e−ā. /2 ā 1


f¯x = − √ + erfc √
2π 2 2

10
1 ν1
f¯21 = − −
2 2 √ 
ν 1 2ν1
f¯22 ¯
= fx + erfc √
2 2
 √ 
1 1 2ν1
f¯23 = −ν1 − erfc √
2 2 2
f¯2 ¯ ¯ ¯
= f21 + f22 + f23 , (39)

and after solving the integrals obtain

fq,2 = (1 + f¯2 ). (40)

On the other hand, after setting


 2 
(1)
 gi − zi (2)
f¯q,1 , max  + max(zi , 0)T hi + γ max(zi , 0)2  (41)

zi 4γ̄1

and solving the optimization over zi one finds


!
(1) 2
  
gi (1) (2)
(max(gi −2hi γ̄1 ,0)).2 (1)
min 0, 4γ̄ − , if gi ≤ 0


1 4γ̄1 (1+4γ γ̄1 )
¯
fq,1 =  2 (42)
(1) (1) (2)
 gi (max(gi −2hi γ̄1 ,0)).2


4γ̄1 − 4γ̄1 (1+4γ γ̄1 ) , otherwise.

(1)
For gi > 0 one sets
(1)
gI
Ā =
2γ̄1
γ̄1
B̄ =
(1 + 4γγ̄1 )
C̄ = Ā
C̄.2
!
e− 2 (2Ā − C̄)
   
1 2 C̄
I11 = B̄ (Ā + 1) erf √ +1 + √
2 2 2π
(1)
(gi ).2
fˆq,1 = − I11 . (43)
4γ̄1
(1)
On the other hand, for gi ≤ 0 one sets
(1)
gI
Ā =
2γ̄1
γ̄1
B̄ =
(1 + 4γγ̄1 )
 
1
q
(1) (1)
C̄ = gI − | − (gI )2 |(1 + 4γγ̄1 )
2γ̄1
C̄.2
!
e− 2 (2Ā − C̄)
   
1 2 C̄
I11 = B̄ (Ā + 1) erf √ +1 + √
2 2 2π
(1)
(gi )2
fˆq,2 = − I11 . (44)
4γ̄1

11
After solving the remaining integrals one then obtains

fq,1 = EG(2) f¯q,1


2 2
(gi(1) ) +(h(2)
i )
e−
Z
2
(1) (2)
= f¯q,1
√ 2 dgi dhi
(1) (2)
gi ,hi 2π
2 2
(g(1) )
− i2
(g(1) )
− i2
e e
Z Z
(1) (1)
= fˆq,1 √ dgi + fˆq,1 √ dgi . (45)
(1)
gi >0 2π (1)
gi ≤0 2π

A combination of (34), (35), (36), (40), and (45) then gives

r2 fq,2
 
φ0 ≥ EG(2) min max min L1 (ν, γ) = min max γ̄1 + α1 fq,1 − γr2 + γ̄2 + α2 − ν(2 − α2 ) − 1 , (46)
r,γ̄1 ,γ̄2 ν,γ zi ,ti r,γ̄1 ,γ̄2 ν,γ 4γ̄2

where fq,1 and fq,2 are given in (45) and (40), respectively. An upper bound on the injectivity capacity is
then obtained for α2 such that φ0 = 0. Numerical evaluations produce the concrete parameters values given
in Table 2. One now observes that the expansion of the second layer, 8.267/6.7004 ≈ 1.2338, is much smaller

Table 2: RDT parameters; 2-layer ReLU NN weak injectivity capacity ; n → ∞;


(inj)
RDT parameters α1 r γ̄1 γ̄2 ν γ αReLU (α1 )
RDT parameters values 6.7004 1.7697 0.8935 0.9642 0.5560 0.3078 8.267

than of the first one, 6.7004/1 = 6.7004 .


The above weak injectivity capacity can easily be complemented by the corresponding strong one. The
only difference is that instead of characterization in (46) we now have

r2 fq,2
 
2
φ0 ≥ EG(2) min max min L1 (ν, γ) = min max γ̄ 1 + α f
1 q,1 − γr + γ̄ 2 + α2 − ν(4 − α2 ) − 1 , (47)
r,γ̄1 ,γ̄2 ν,γ zi ,ti r,γ̄1 ,γ̄2 ν,γ 4γ̄2

After solving the optimization in (47), we obtain the results shown in Table 3. The expansion of the second

Table 3: RDT parameters; 2-layer ReLU NN strong injectivity capacity ; n → ∞;


(inj)
RDT parameters α1 r γ̄1 γ̄2 ν γ αReLU (α1 )
RDT parameters values 6.7004 1.7708 0.9647 0.8938 0.3954 0.3077 12.35

layer, 12.35/6.7004 ≈ 1.8432, is now larger than in the weak case but still much smaller than of the first
layer.
4) Double checking the strong random duality: The last step of the RDT machinery assumes double
checking the strong random duality. As the underlying problems do not allow for a deterministic strong
duality the corresponding reversal considerations from [69] are not applicable and the strong random duality
is not in place. This effectively implies that the presented results are strict injectivity capacity upper bounds.

4 Multi-layer ReLU NN
In this section we show how to translate the above results for 2-layer NNs to the corresponding ones related
to multi-layer NNs. We start with 3-layer NNs and once we establish such a translation, the move to any
number of layers, l, will be automatic.

12
4.1 3-layer ReLU NN
To facilitate the ensuing presentation we try to parallel the derivations from Section 3. At the same time,
we proceed in a much faster fashion avoiding unnecessary repetitions and instead prioritizing showing key
differences. For 3-layer ReLU nets injectively admissible sequences have two components, α1 , α2 . As stated
earlier, the first element of the sequence, α1 , is not smaller than the injectivity capacity of a single ReLU
layer. On the other hand the second one is not smaller than the injectivity capacity of a 2-layer net. A generic
admissible sequence α1 , α2 is considered throughout the derivations in this section while the specializations
are deferred and made later on when we discuss concrete numerical capacity evaluations.
Having all of the above in mind, we proceed following the path traced in Section 3 and observe that for
l = 3 the condition under the probability in (8) is related to the following feasibility optimization problem
(basically a 3-layer analogue to (11))

F (A1:3 , α1 , α2 , α3 ) : find x
subject to A(1) x = z
A(2) max(z, 0) = t
A(3) max(t, 0) = t(1)
finj < 0, (48)

where as in (17)

f(s) = k max(t(1) , 0)k0 − 2n


f(w) = k max(t(1) , 0)k0 − n
(
f(s) , for strong injectivity
finj = (49)
f(w) , for weak injectivity.

After introducing an artificial objective fa (x, z, t, t(1) ) : Rm0 +m1 +m2 +m3 → R and restricting x to the unit
sphere, we have the following 3-layer analogue to (18)

3-extended ℓ0 spherical perceptron min fa (x, z, t)


x,z,t

subject to A(1) x = z
A(2) max(z, 0) = t
A(3) max(t, 0) = t(1)
finj < 0
kxk2 = 1. (50)

Paralleling further the derivation between (18) and (22), we obtain


(inj)
αReLU (α1 , α2 ) , max{α3 | lim PA1:3 (∄x 6= x̄ such that f¯nn (x̄, A1:3 ) = f¯nn (x, A1:3 )) = 1}
n→∞
= max{α3 | lim PA1:3 (F (A1:3 , α1 , α2 , α3 ) is feasible) −→ 1}
n→∞
 
(3)
= max{α3 | lim PA1:3 ξReLU (0) > 0 −→ 1}, (51)
n→∞

where
(3)
ξReLU = min max φ(3)
rp , (52)
1
kxk2 =1,finj <0 ky(1) k2 =1,ky(2) k2 = √1 ,ky(3) k2 = n
n

where
 
φ(3) T
rp = y(1) A
(1) T
x + y(2) A(2) max(z, 0) + y(3)
T
A(3) max(t, 0) − y(1)
T T
z − y(2) T (1)
t − y(3) t . (53)

13
Following further the trend of Section 3 we utilize the Random duality theory (RDT) to handle (52) and
(53) (and ultimately (51)).

(inj)
4.1.1 Upper-bounding αReLU (α1 , α2 ) via Random Duality Theory (RDT)
We below show how each of the four main RDT principles is implemented. All key results are again framed
as lemmas and theorems.
1) Algebraic injectivity representation: We start with the following 3-layer analogue to Lemma 2.
Lemma 2. Consider a sequence of positive integers, m0 , m1 , m2 , m3 (n = m0 ), high-dimensional linear
regime, corresponding expansion coefficients α1 , α2 , α3 , and assume that α1 .α2 is an injectively admissible
sequence. Assume a 3-layer ReLU NN with architecture A1:3 = [A(1) , A(2) , A(3) ] (the rows of matrix A(i) ∈
Rmi ×mi−1 , i = 1, 3, being the weights of the nodes in the i-th layer). The network is typically injective if

frp (A1:3 ) > 0, (54)

where
1
frp (A1:2 ) = √ min max φ(3) , (55)
n kxk2 =1,finj <0 ky(1) k2 =1,ky(2) k2 = √1n ,ky(3) k2 = n1 rp

 
φ(3) T
rp = y(1) A
(1) T
x + y(2) A(2) max(z, 0) + y(3)
T
A(3) max(t, 0) − y(1)
T T
z − y(2) T (1)
t − y(3) t , (56)

and

f(s) = k max(t(1) , 0)k0 − 2n


f(w) = k max(t(1) , 0)k0 − n
(
f(s) , for strong injectivity
finj = (57)
f(w) , for weak injectivity.

Proof. Follows as an automatic consequence of the discussion presented in the previous section.

2) Determining the random dual: We again utilize the concentration of measure phenomenon and the
following 3-layer random dual analogue of Theorem 1.
Theorem 2. Assume the setup of Lemma 2. Let the elements of A(i) ∈ Rmi ×mi−1 , g(i) ∈ Rmi ×1 , and
h(i) ∈ Rmi−1 ×1 be iid standard normals. Set

G(3) = [g(1) , g(2) , g(3) , h(1) , h(2) , h(3) ]


(3) T 1
φrd = y(1) g(1) + xT h(1) + k max(z, 0)k2 y(2) T
g(2) + √ max(z, 0)T h(2)
n
T 1
+k max(t, 0)k2 y(3) g(3) + max(t, 0)T h(3) − y(1) T T
z − y(2) T (1)
t − y(3) t
n
1 (3)
frd (G(3) ) = √ min max φ
n kxk2 =1,finj <0 ky(1) k2 =1,ky(2) k2 = √1n ,ky(3) k2 = n1 rd
φ0 = lim EG(3) frd (G(3) ). (58)
n→∞

One then has


   
(φ0 > 0) =⇒ lim PG(3) (frd (G(3) ) > 0) −→ 1 =⇒ lim PG(3) (frp (G(3) ) > 0) −→ 1
n→∞ n→∞
 
=⇒ lim PA1:3 (3-layer ReLU NN with architecture A1:3 is typically injective) −→ 1 . (59)
n→∞

14
The injectivity is strong for finj = f(s) and weak for finj = f(w) .
Proof. Follows in exactly the same way as the proof of Theorem 1 through a direct 3-fold application of the
Gordon’s probabilistic comparison theorem (see, e.g., Theorem B in [29] as well as Theorem 1, Corollary 1,
and Section 2.7.2 in [72] and Theorem 1, Corollary 1, and Section 2.3.2 in [71]). In particular, in addition to
T
terms (y(1) g(1) + xT h(1) − y(1)
T T
z) and (k max(z, 0)k2 y(2) g(2) + √1n max(z, 0)T h(2) − y(2)
T
t) corresponding to
the lower-bounding side related to A(1) and A(2) , the term (k max(z, 0)k2 y(2)
T
g(2) + n1 max(z, 0)T h(2) − y(2)
T
t)
(3)
corresponds to the lower-bounding side related to A . As (55) contains the summation of the corresponding
three terms from the other side of the inequality, the proof is completed.

3) Handling the random dual: The methodologies from [59–62,69] are again utilized to handle the above
random dual. Optimizing over x and y(i) one first transforms the optimization in (58) into

1 1
frd (G(3) ) = √ min kg(1) − zk2 − kh(1) k2 + √ kk max(z, 0)k2 g(2) − tk2
n inj
f <0 n
!
1 (3) (1) 1 T (2) 1 T (3)
+ kk max(t, 0)k2 g − t k2 + √ max(z, 0) h + max(t, 0) h ) . (60)
n n n

The above can then be rewritten as


1 (3)
frd (G(3) ) = √ min φrd,1
n r,r2 ,x,z,t,t(1)
subject to finj < 0
k max(z, 0)k2 = r
k max(t, 0)k2 = r2 , (61)

where
(3) 1 1
φrd,1 = kg(1) − zk2 − kh(1) k2 + √ krg(2) − tk2 + kr2 g(2) − tk2
n n
1 1
+ √ max(z, 0)T h(2) + max(t, 0)T h(3) . (62)
n n

As earlier, taking for concreteness finj = f(w) and writing the Lagrangian gives

1
frd (G(3) ) = √ min max L(ν, γ), (63)
n r,r2 ,z,t,t(1) ν,γ

where
1 1 1
L(ν, γ, γ2 ) = kg(1) − zk2 − kh(1) k2 + √ krg(2) − tk2 + kr2 g(2) − tk2 + √ max(z, 0)T h(2)
n n n
1
+ max(t, 0)T h(3) + νk max(t(1) , 0)k0 − νn + γk max(z, 0)k22 − γr2 + γ2 k max(t, 0)k22 − γ2 r22 .
n
(64)

After utilizing the square root trick one first finds


!
kg(1) − zk22 1 krg(2) − tk22 1 kr2 g(3) − t(1) k22
L(ν, γ, γ2 ) = min γ̄1 + + γ̄2 + + γ̄3 + 2
γ̄1 ,γ̄2 4γ̄1 n 4γ̄2 n 4γ̄3
1 1
−kh(1) k2 + √ max(z, 0)T h(2) + max(t, 0)T h(3) + νk max(t(1) , 0)k0 − νn
n n
+γk max(z, 0)k22 − γr2 + γ2 k max(t, 0)k22 − γ2 r22

15
m1 m2 m3
!
(1) (2) (3) (1)
X kg − zi k2
i 2 1X krgi − ti k22 1 X kr2 gi − ti k22
= min γ̄1 + + γ̄2 + + γ̄3 + 2
γ̄1 ,γ̄2
i=1
4γ̄1 n i=1 4γ̄2 n i=1 4γ̄3
1 m 2 m
1 X (2) 1X (3)
−kh(1) k2 + √ max(zi , 0)T hi + max(ti , 0)T hi
n i=1 n i=1
m2 m1 m2
νX (1)
X X
+ (1 + sign(ti )) − νn + γ max(zi , 0)2 − γr2 + γ2 max(ti , 0)2 − γ2 r22 . (65)
2 i=1 i=1 i=1
√ √ √
Appropriate scaling γ̄1/3 → γ̄1/3 n, γ → √γn ,γ2 → γn2 , r → r n, r → rn, and cosmetic changes γ̄2 → γ̄r2 n,
γ2 → r√γ2n3 , and ν2 → √νn , concentrations and statistical identicalness over i allow one to arrive at the
following analogues of (34) and 35
 2 
(1)
 ig − zi
(2)
L1 (ν, γ, γ2 ) = γ̄1 + α1  + max(zi , 0)T hi + γ max(zi , 0)2 

4γ̄1
  2 
(2) ti
 gi − r

ti
T 
ti
2
(3)
+r γ̄2 + α2  + max ,0 hi + γ2 max ,0
 
4γ̄2 r r


  2  ! 
(1) (1)
2 (3) ti 4γ̄3 ν ti
 r2 gi − r2 + r22
sign r2 2

 − ν(2 − α3 ) − γr2 − γ2 r2 − 1,
 
+γ̄3 + α3 

 4γ̄3 
 r

(66)

and

φ0 = lim EG(3) frd (G(3) ) = EG(3) min max L1 (ν, γ, γ2 ) ≥ EG(3) min max min L1 (ν, γ, γ2 ).
n→∞ r,γ̄1 ,γ̄2 ,zi ,ti ,t(1) ν,γ r,γ̄1 ,γ̄2 ,γ̄3 ν,γ,γ2 z ,t ,t(1)
i i i

(67)
Recalling on fq,1 and fq,2 from (36), one recognizes that fq,1 appears as factor multiplying α1 and α2 whereas
r2
fq,2 appears as factor multiplying α3 4γ̄23 . This is then sufficient to immediately write the following analogue
to (46)

φ0 ≥ EG(3) min max min L1 (ν, γ, γ2 )


r,r2 ,γ̄1 ,γ̄2 ,γ̄3 ν,γ,γ2 z ,t ,t(1)
i i i

r22 r22 fq,2


   
2
= min max γ̄1 + α1 fq,1 − γr + r γ̄2 + α2 fq,1 − γ 2 + γ̄3 + α3 − ν(2 − α3 ) − 1 ,
r,r2 ,γ̄1 ,γ̄2 ,γ̄3 ν,γ,γ2 r 4γ̄3
(68)

where fq,1 and fq,2 are as in (45) and (40), respectively. An upper bound on the 3-layer net injectivity
capacity is then obtained for α3 such that φ0 = 0. After setting
       
α2 r γ̄ γ
α(0) = , r(0) = , γ̄ (0) = 1 , γ (0) = , (69)
α2 r2 γ̄2 γ2

and conducting numerical evaluations one obtains the concrete parameters values given in Table 4. One
observes that the expansion of the third layer, 9.49/8.267 ≈ 1.1479, is even smaller than of the second one,
≈ 1.2338 .
To complement the above weak injectivity capacity with the corresponding strong one, one utilizes,

16
Table 4: RDT parameters; 3-layer ReLU NN weak injectivity capacity ; n → ∞;
(inj)
parameters α(0) r(0) γ̄ (0) γ̄3 ν γ (0) αReLU (α1 , α2 )
" # " # " # " #
6.7004 1.75 0.8830 0.3128
values 2.344125 1.1620 9.49
8.267 3.73 1.1224 0.2952

instead of characterization in (68), the following

φ0 ≥ EG(3) min max min L1 (ν, γ, γ2 )


r,r2 ,γ̄1 ,γ̄2 ,γ̄3 ν,γ,γ2 z ,t ,t(1)
i i i

r2 r2 fq,2
   
= min max γ̄1 + α1 fq,1 − γr2 + r γ̄2 + α2 fq,1 − γ 22 + γ̄3 + α3 2 − ν(4 − α3 ) − 1 .
r,r2 ,γ̄1 ,γ̄2 ,γ̄3 ν,γ,γ2 r 4γ̄3
(70)

Solving the optimization in (70) gives the numerical values shown in Table 5. The expansion of the third

Table 5: RDT parameters; 3-layer ReLU NN strong injectivity capacity ; n → ∞;


(inj)
parameters α(0) r(0) γ̄ (0) γ̄3 ν γ (0) αReLU (α1 , α2 )
" # " # " # " #
6.7004 1.76 0.8870 0.3101
values 5.7610 1.5965 17.13
12.35 7.2 2.1721 0.1955

layer, 17.13/12.35 ≈ 1.3870, is again larger than in the weak case but still smaller than the expansion of the
second layer, 1.8432.
4) Double checking the strong random duality: As was the case for 2-layer nets, the underlying prob-
lems do not allow for a deterministic strong duality and the corresponding reversal considerations from [69]
can not be applied which implies the absence of strong random duality and an upper-bounding nature of the
above results.

4.2 l-layer ReLU NN


It is now not that difficult to extend the above considerations to general l-layer depth. Setting r(0) = 1,
(0) (0)
αl = αl , γ̄l = γ̄l , and analogously to (69)
       
α1 r γ̄1 γ
 α2   r2   γ̄2   γ2 
α(0) =  .  , r(0) =  .  , γ̄ (0) =  .  , γ (0) =  .  , (71)
       
 ..   ..   ..   .. 
αl−1 rl−1 γ̄l−1 γl−1

one can write


(i) (0) (0
fq,1 = fq,i (γ̄i , γi )), 1 ≤ i ≤ l − 1
(l) (0) (0
fq,2 = fq,2 (γ̄l , rl−1 )), (72)

17
and analogously to (68) and (70)
   2   2 
(0) (0) (l)
l−1
X (0)  (0) ri rl−1 f q,2
(0) (i) (0) (0) (0) (0)
φ0 ≥ min max  ri−1 γ̄i + αi fq,1 − γi  2  + γ̄l + αl − ν(2 − αl ) − 1 ,
 
(0) (0) (0)
r (0) ,γ̄1 ν,γ i=1
(0)
ri−1 4γ̄ l

and
   2   2 
(0) (0) (l)
l−1 ri rl−1 fq,2
(0)  (0) (0) (i) (0) (0) (0) (0)
X
φ0 ≥ min max  ri−1 γ̄i + αi fq,1 − γi  2  + γ̄l + αl − ν(4 − αl ) − 1 .
  
(0) ν,γ (0) (0)
r (0) ,γ̄1 i=1
(0)
ri−1 4γ̄l

After conducting numerical evaluations for the fourth layer one obtains the concrete parameters values given
in Table 6. One observes that the decreasing expansion trend continues. For the forth layer, the above

Table 6: RDT parameters; 4-layer ReLU NN weak injectivity capacity ; n → ∞;


(inj)
parameters α(0) r(0) γ̄ (0) γ̄3 ν γ (0) αReLU (α1 , α2 )
       
6.7004 1.73 0.8751 0.3184
values  8.267  3, 68 1.1205 4.485862 2.0769 0.2960 10.124
       

9.49 6.7 0.9983 0.3683

results give 10.124/9.49 ≈ 1.0668 which is smaller than the expansion of the third layer 1.1479, Table 7
shows systematically how the weak injectivity capacity and layers expansions change as the depth of the
network increases. One observes a rapid onset of an expansion saturation effect. After only 4 layers, hardly
any further expansion is needed. Interestingly, a similar phenomenon is empirically observed within deep
learning compressed sensing context, where after significantly expanded first layer much smaller expansions
are used in higher layers (see, e.g., [11]). Of course, overthere one he to be extra careful since, in addition to
the injectivity, the architecture must have excellent generalizabiity properties as well.

Table 7: Deep ReLU NN weak injectivity capacity and layers expansions ; mi – # of nodes in the i-th layer

l (# of layers) 1 2 3 4
Injectivity capacity (upper bound)
  6.7004 8.267 9.49 10.124
αl = limn→∞ mnl = limm0 →∞ m
ml
0

Layer’s expansion

ml αl
 6.7004 1.2338 1.1479 1.0668
ζl = limn→∞ ml−1 = αl−1

5 Lifted RDT
As we mentioned earlier, the strong random duality is not in place and the above capacity characterizations
are not only bounds but also expected to be strict upper bounds. In other words, one expects that they can
be further lowered. The recent development of fully lifted (fl) RDT [73,77,78] allows to precisely evaluate by
how much the above given upper bounds can be lowered. However, full implementation of the fl RDT heavily
relies on a sizeable set of numerical evaluations. Since the plain RDT already requires a strong numerical
effort, we find it practically beneficial to consider a bit less accurate but way more convenient partially lifted
(pl) RDT variant [65, 66, 68, 70, 79].

18
(inj)
5.1 Lowering upper bounds on αReLU (α1 ) via pl RDT
As was the case earlier when we considered the plain RDT, we again start with 2-layer nets and later extend
the results to multilayered ones. The pl RDT relies on the same principles as the plain RDT with the
exception that one now deals with the partially lifted random dual introduced in the following theorem
(basically a partially lifted analogue to Theorem 1).
Theorem 3. Assume the setup of Theorem 1 with the elements of A(i) ∈ Rmi ×mi−1 , g(i) ∈ Rmi ×1 , and
h(i) ∈ Rmi−1 ×1 being iid standard normals, c3 > 0 and

G(2) , [g(1) , g(2) , h(1) , h(2) ]


 
T
φ(x, z, t, y(1) , y(2) ) , y(1) g(1) + xT h(1) + k max(z, 0)k2 y(2)
T
g(2) + max(z, 0)T h(2) − y(1)
T T
z − y(2) t
f¯rd (G(2) ) , min max φ(x, z, t, y(1) , y(2) )
r,kxk2 =1,k max(z,0)k2 =r,finj <0 kyi k2 =1
 
1 c3 c3 2 1 
−c3 f¯rd (G(2) )
φ̄0 , min lim √ + r − log EG(2) e . (73)
r>0 n→∞ n 2 2n c3

One then has


 
(φ̄0 > 0) =⇒ lim PA1:2 (frp (A1:2 ) > 0) −→ 1
n→∞
 
=⇒ lim PA1:2 (2-layer ReLU NN with architecture A1:2 is typically injective) −→ 1 . (74)
n→∞

The injectivity is strong for finj = f(s) and weak for finj = f(w) .
Proof. For any fixed r it follows automatically as a 2-fold application of Corollary 3 from [71] (see Section
T
3.2.1 and equation (86); see also Lemma 2 and equation (57) in [66]). For example, terms y(1) g(1) , xT h(1) ,
T c3 (1)
y(1) z, and 2 correspond to the lower-bounding side of equation (86) in [71] related to A , whereas terms
T
k max(z, 0)k2 y(2) g(2) , max(z, 0)T h(2) , y(2)
T
t, and c23 r2 are their A(2) related analogous. Given that the left
hand side of (86) corresponds to frp and that the minimization over r accounts for the least favorable choice
the proof is completed.
To handle the above partially lifted random dual we rely on results from previous sections. In particular,
repeating with minimal adjustments the derivations between (28) and (32) one arrives at

f¯rd (G(2) ) = min max L(ν, γ), (75)


r,z,t ν,γ

where
  2  2 
(1) (2)
m1 gi − zi 1
m2 rgi − ti 1
m1
(2) 
X X X
L(ν, γ) = min γ̄1 + − kh(1) k2 + γ̄2 + √ +√ max(zi , 0)T hi 

γ̄1 ,γ̄2
i=1
4γ̄1 n i=1
4γ̄2 n i=1

m2 m1
νX X
+ (1 + sign(ti )) − νn + γ max(zi , 0)2 − γr2
2 i=1 i=1
 2  2
(1) (2)
m1 gi − zi 1 X
m2 rgi − ti 1 X
m1
(2)
X
= min γ̄1 + + γ̄2 + √ +√ max(zi , 0)T hi
γ̄1 ,γ̄2 ,γsph
i=1
4γ̄ 1 n i=1
4γ̄ 2 n i=1
Pn  (1) 2 !
m2 m1
νX X i=1 hi
+ (1 + sign(ti )) − νn + γ max(zi , 0)2 − γr2 − γsph − . (76)
2 i=1 i=1
4γsph
√ √ √ √
After appropriate scaling c3 → c3 n, γ̄i → γ̄i n, γsph → γsph n, r → r n, γ → √γn , and cosmetic change
ν √ν , concentrations, statistical identicalness over i, Lagrangian duality, and a combination of (75) and
2 → n

19
(76) give
 
1 c3 c3 2 1 
−c3 f¯rd (G(2) )
φ̄0 = min lim √ + r − log EG(2) e
r>0 n→∞ n 2 2n c3
c3 c3 α1 
¯

≥ min max + r2 + γ̄1 − log EG(2) e−c3 fq,1 − γr2 + γ̄2 − ν(2 − α2 )
r>0,γ̄1 ,γ̄2 ,γsph ν,γ 2 2 c3
 2
!
α2   1 c
(h(1)
i )
¯
− log EG(2) e−c3 fq,2 − γsph −
3
log EG(2) e 4γsph  , (77)
c3 c3

where
 2 
(1)
 gi − zi
(2)
f¯q,1 = max  + max(zi , 0)T hi + γ max(zi , 0)2 

zi 4γ̄1
 2  !
(2) ti 4γ̄2 ν ti
f¯q,2 = max gi − + 2 sign . (78)
ti r r r

Changing variables
4γ̄2 ν
ν1 → , (79)
r2
and recalling on (42) and equation (44) in [82], we further write
!
(1) 2
  
gi (1) (2)
(max(gi −2hi γ̄1 ,0)).2 (1)
min 0, 4γ̄ − , if gi ≤0


1 4γ̄1 (1+4γ γ̄1 )
f¯q,1 = (80)
(1) 2
 
(1) (2)
 gi (max(gi −2hi γ̄1 ,0)).2


4γ̄1 − 4γ̄1 (1+4γ γ̄1 ) , otherwise,

and

  2
(2) (2)


 − g i − ν1 , if gi ≤ 0
 √
f¯q,2 = −ν1 , (2)
if 0 ≤ gi ≤ 2ν1 (81)
   2
− g(2) + ν1 , otherwise.


i

After setting
s
√ 2c3 r2
ā1 = 2ν1 1+
4γ̄2
2
r
c3 4γ̄ ν1
e 2
f¯21
+
=
2
r
−c3 4γ̄ ν1
2 √ 
e 2 2ν1
f¯22
+
= erfc √
2 2
r2
c3 4γ̄ ν1   
e 2 1 1 ā1
f¯23
+
= q − erfc √ , (82)
1 + 2c3 r
2 2 2 2
4γ̄2

20
and solving the integrals one obtains
(lif t) ¯
fq,2 (γ̄2 , r) = EG(2) e−c3 fq,2 = f¯21
+
+ f¯22
+
+ f¯23
+
. (83)

(1)
On the other hand, after setting for gi >0
(1)
gI
Ā+ =
2γ̄1
c3 γ̄1
B̄ + =
(1 + 4γγ̄1 )
C̄ + = Ā+ !
+ + + +
1 B̄ + + 2 (2 B̄ Ā + (1 − 2 B̄ )C̄ )
e 1−2B̄+ ( ) erfc −
+ Ā
I11 = √ p
2 1 − 2B̄ + 2(1 − 2B̄ + )
(1) 2
C̄ +
   
(g
i
)
1
fˆq,1
+
= e−c3 4γ̄1 1 − erfc − √ +
+ I11 . (84)
2 2
(1)
and for gi ≤0
(1)
gI
Ā+ =
2γ̄1
c3 γ̄1
B̄ + =
(1 + 4γγ̄1 )
 
1
q
(1) (1) 2
C̄ + = gI − | − (gI ) |(1 + 4γγ̄1 )
2γ̄1
!
+ + + +
1 B̄ +
(Ā+ ) erfc − (2B̄ Āp + (1 − 2B̄ )C̄ )
2
+
I11 = √ e 1−2B̄ +

2 1 − 2B̄ + 2(1 − 2B̄ + )


(1) 2
(g )
−c3 i
+
I12 = e I+ 4γ̄1

11 + 
+ 1 C̄
I13 = 1 − erfc − √
2 2
fˆq,1
+
= +
I12 +
+ I13 , (85)

solving remaining integrals gives


2 2
(gi(1) )

(g(1) )
− i2
e e
Z Z
¯ 2
(lif t) (1) (1)
fq,1 (c3 , γ̄1 , γ) = EG(2) e−c3 fq,1 = fˆq,1 √
+
dgi + fˆq,2 √
+
dgi . (86)
(1)
gi >0 2π (1)
gi ≤0 2π
(1)
After solving the integral over hi and optimizing over γsph one first finds (see, e.g., [66])

c3 + c3 + 4
γ̂sph = , (87)
4
and then rewrites (77) as

c3 c3 α1 
(lif t)

φ̄0 ≥ min max + r2 + γ̄1 − log fq,1 − γr2 + γ̄2 − ν(2 − α2 )
r>0,γ̄1 ,γ̄2 ν,γ 2 2 c3
 !
α2 
(lif t)
 1 c3
− log fq,2 − γ̂sph + log 1 − , (88)
c3 2c3 2γ̂sph

21
(lif t) (lif t)
with fq,1 , fq,2 , and γ̂sph as in (83), (86)), and (87), respectively. As (88) holds for any c3 , maximization
of the right hand side over c3 also gives

c3 c3 α1 
(lif t)

φ̄0 ≥ max min max + r2 + γ̄1 − log fq,1 − γr2 + γ̄2 − ν(2 − α2 )
c3 >0 r>0,γ̄1 ,γ̄2 ν,γ 2 2 c3
 !
α2 
(lif t)
 1 c2
− log fq,2 − γ̂sph + log 1 − . (89)
c3 2c3 2γ̂sph

Limiting value of α2 for which one has φ̄0 = 0 gives then an upper bound on the injectivity capacity. The
concrete parameters values obtained through numerical evaluations are given in Table 8. The plain RDT
results are shown in parallel as well. The upper bound is indeed lowered. To get the corresponding strong

Table 8: Lifted RDT parameters; 2-layer ReLU NN weak injectivity capacity ; n → ∞;


(inj)
RDT parameters α1 r γ̄1 γ̄2 ν γ c3 αReLU (α1 )
Plain RDT 6.7004 1.7697 0.8935 0.9642 0.5560 0.3078 →0 8.267
Lifted RDT 6.7004 1.7931 0.8810 0.9053 0.5504 0.3361 0.1091 8.264

injectivity capacity bound, instead of (88) we utilize

c3 c3 α1 
(lif t)

φ̄0 ≥ max min max + r2 + γ̄1 − log fq,1 − γr2 + γ̄2 − ν(4 − α2 )
c3 >0 r>0,γ̄1 ,γ̄2 ν,γ 2 2 c3
 !
α2 
(lif t)
 1 c2
− log fq,2 − γ̂sph + log 1 − . (90)
c3 2c3 2γ̂sph

The obtained results together with the corresponding plain RDT ones are shown in Table 9. We again
observe a lowering of the plain RDT upper bound. In fact, the lowering is now a bit more pronounced than
in the weak case.

Table 9: Lifted RDT parameters; 2-layer ReLU NN strong injectivity capacity ; n → ∞;


(inj)
RDT parameters α1 r γ̄1 γ̄2 ν γ c3 αReLU (α1 )
Plain RDT 6.7004 1.7708 0.9647 0.8938 0.3954 0.3077 →0 12.350
Lifted RDT 6.7004 1.9060 0.7862 0.5707 0.3561 0.5728 0.8315 12.183

5.2 Lifted RDT – l-layer ReLU NN


The above 2-layer considerations extend to general l-layer depth precisely in the same manner as the corre-
(0) (0)
sponding plain RDT ones. After setting r(0) = 1, αl = αl , γ̄l = γ̄l , we recall on (71)
       
α1 r γ̄1 γ
 α2   r2   γ̄2   γ2 
α(0) =  .  , r(0) =  .  , γ̄ (0) =  .  , γ (0) =  .  , (91)
       
 ..   ..   ..   .. 
αl−1 rl−1 γ̄l−1 γl−1

and analogously to (72) write


(lif t,i) (0) (0) (0
fq,1 = fq,1 (c3 ri−1 , γ̄i , γi )), 1 ≤ i ≤ l − 1

22
(lif t,l) (0) (0) (0
fq,2 = fq,2 (c3 rl−1 , γ̄l , rl−1 )). (92)

It is then straightforward to mimic the mechanism utilized to obtain (73) and combine it with (89) to write
l-layer analogue to (89)
  2 
(0)
l−1
α
(0)   r i αl
(0)  
(l)  (0) (0) (lif t,i) (0) (0) (lif t,l)
X
φ̄0 ≥ max min max γ̄i ri−1 + i log fq,1 − γi + γ̄ + log f

(0) l q,2
c3 c3
 
c3 >0 r (0) ,γ̄ (0) ν,γ (0) r
1 i=1 i−1
 !
(0) 1 c3
−ν(2 − αl ) − γ̂sph + log 1 − . (93)
2c3 2γ̂sph

The strong injectivity complement is then easily obtained as


  2 
(0)
l−1
αi
(0)   r i αl
(0)  
(l)  (0) (0) (lif t,i) (0) (0) (lif t,l)
X
φ̄0 ≥ max min max γ̄ r + log f − γ + γ̄ + log fq,2

i i−1 q,1 i (0) l
c3 c3
 
c3 >0 r (0) ,γ̄ (0) ν,γ (0) r
1 i=1 i−1
 !
(0) 1 c3
−ν(4 − αl ) − γ̂sph + log 1 − . (94)
2c3 2γ̂sph

After conducting numerical evaluations we did not find substantial improvements beyond the second layer.

6 Conclusion
We studied the injectivity capacity of a deep ReLU network. For 1 -layer networks studying injectivity
capacity was recently shown [82] to be equivalent to studying the capacity of the so-called ℓ0 spherical per-
ceptron. Here we show that similar concept applies for studying multilayered deep ReLU networks. Namely,
the injectivity of l-layer deep ReLU NN is uncovered to directly relate to l-extened ℓ0 spherical perceptrons.
Utilizing Random duality theory (RDT) we then create a generic program for statistical performance analysis
of such perceptrons and consequently the injectivity properties of deep ReLU nets. As a result we obtain
upper bounds on the injectivity capacity for any number of network layers and any number of nodes in any
of the layers.
Two notions of injectivity capacity, weak and strong, are introduced and concrete numerical values for
both of them are obtained. We observe a tendency that relative network expansion per layer (increase from
the number of layer’s inputs to the number of layer’s outputs) needed for injectivity decreases in any new
added layer. Moreover, this decreasing is more pronounced for, practically more relevant, weak injectivity.
In particular, we obtain that 4-layer deep nets are already close to reaching maximally needed expansion
which is ∼ 10.
As the results obtained through plain RDT are upper-bounds, we implemented a partially lifted (pl)
RDT variant to lower them. To do so we first created a generic analytical adaptation of the plain RDT
program and then conducted related numerical evaluations. For 2-layer nets we found that partially lifted
RDT indeed is capable of decreasing the plain RDT upper bounds.
Since the developed methodologies are fairly generic, many extensions and generalizations are also pos-
sible. Beyond naturally expected fl RDT implementations, these, for example, include studying stability
properties, injectivity sensitivity with respect to various imperfections (noisy gates, missing/incomplete por-
tions of the training data, not fully adequately or optimally trained networks, etc.), and many others. Given
the importance of injectivity within the deep compressed sensing paradigm, quantifying networks generaliz-
ability properties as well as the complexity of the training and recovery algorithms related to such contexts
are of great interest as well. The mechanisms presented here can be used for any of these tasks. As the
associated technicalities are problem specific, we discuss them in separate papers.

23
References
[1] E. Abbe, S. Li, and A. Sly. Proof of the contiguity conjecture and lognormal limit for the symmetric
perceptron. In 62nd IEEE Annual Symposium on Foundations of Computer Science, FOCS 2021,
Denver, CO, USA, February 7-10, 2022, pages 327–338. IEEE, 2021.
[2] E. Abbe, S. Li, and A. Sly. Binary perceptron: efficient algorithms can find solutions in a rare well-
connected cluster. In STOC ’22: 54th Annual ACM SIGACT Symposium on Theory of Computing,
Rome, Italy, June 20 - 24, 2022, pages 860–873. ACM, 2022.
[3] R. Alweiss, Y. P. Liu, and M. Sawhney. Discrepancy minimization via a self-balancing walk. In Proc.
53rd STOC, ACM, pages 14–20, 2021.
[4] D. Amelunxen, M. Lotz, M. McCoy, and J. Tropp. Living on the edge: phase transitions in convex
programs with random data. Information and Inference: A Journal of the IMA, 3(3):224–294, 2014.
[5] B. L. Annesi, E. M. Malatesta, and F. Zamponi. Exact full-RSB SAT/UNSAT transition in infinitely
wide two-layer neural networks. 2023. available online at http://arxiv.org/abs/2410.06717.
[6] S. R. Arridge, P. Maass, O. Öktem, and C. B. Schönlieb. Solving inverse problems using data-driven
models. Acta Numer., 28:1–174, 2019.
[7] B. Aubin, W. Perkins, and L. Zdeborova. Storage capacity in symmetric binary perceptrons. J. Phys.
A, 52(29):294003, 2019.
[8] C. Baldassi, E. M. Malatesta, and R. Zecchina. Properties of the geometry of solutions and capacity of
multilayer neural networks with rectified linear unit activations. Phys. Rev. Lett., 123:170602, October
2019.
[9] P. Baldi and S. Venkatesh. Number od stable points for spin-glasses and neural networks of higher
orders. Phys. Rev. Letters, 58(9):913–916, Mar. 1987.
[10] E. Barkai, D. Hansel, and H. Sompolinsky. Broken symmetries in multilayered perceptrons. Phys. Rev.
A, 45(6):4146, March 1992.
[11] A. Bora, A. Jalal, E. Price, and A. G. Dimakis. Compressed sensing using generative models. In
Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW,
Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 537–546.
PMLR, 2017.

[12] S. H. Cameron. Tech-report 60-600. Proceedings of the bionics symposium, pages 197–212, 1960. Wright
air development division, Dayton, Ohio.
[13] C. Clum. Topics in the Mathematics of Data Science. PhD thesis, The Ohio State University, 2022.
[14] T. Cover. Geomretrical and statistical properties of systems of linear inequalities with applications in
pattern recognition. IEEE Transactions on Electronic Computers, (EC-14):326–334, 1965.
[15] C. Daskalakis, D. Rohatgi, and E. Zampetakis. Constant-expansion suffices for compressed sensing with
generative priors. In Advances in Neural Information Processing Systems 33: Annual Conference on
Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
[16] M. Dhar, A. Grover, and S. Ermon. Modeling sparse deviations for compressed sensing using gener-
ative models. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018,
Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning
Research, pages 1222–1231. PMLR, 2018.
[17] D. Donoho. High-dimensional centrally symmetric polytopes with neighborlines proportional to dimen-
sion. Disc. Comput. Geometry, 35(4):617–652, 2006.

24
[18] A. Engel, H. M. Kohler, F. Tschepke, H. Vollmayr, and A. Zippelius. Storage capacity and learning
algorithms for two-layer neural networks. Phys. Rev. A, 45(10):7590, May 1992.
[19] J. B. Estrach, A. Szlam, and Y. LeCun. Signal recovery from pooling representations. In Proceedings of
the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014,
volume 32 of JMLR Workshop and Conference Proceedings, pages 307–315. JMLR.org, 2014.
[20] M. Fazlyab, A. Robey, H. Hassani, M. Morari, and G. J. Pappas. Efficient and accurate estimation of
lipschitz constants for deep neural networks. In Advances in Neural Information Processing Systems 32:
Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14,
2019, Vancouver, BC, Canada, pages 11423–11434, 2019.
[21] Alyson K. Fletcher, Sundeep Rangan, and Philip Schniter. Inference in deep networks in high dimen-
sions. In 2018 IEEE International Symposium on Information Theory, ISIT 2018, Vail, CO, USA, June
17-22, 2018, pages 1884–1888. IEEE, 2018.
[22] S. Franz, G. Parisi, M. Sevelev, P. Urbani, and F. Zamponi. Universality of the SAT-UNSAT (jamming)
threshold in non-convex continuous constraint satisfaction problems. SciPost Physics, 2:019, 2017.
[23] T. Furuya, M. Puthawala, M. Lassas, and M. V. de Hoop. Globally injective and bijective neural
operators. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural
Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023,
2023.
[24] R. M. Durbin G. J. Mitchison. Bounds on the learning capacity of some multi-layer networks. Biological
Cybernetics, 60:345–365, 1989.
[25] David Gamarnik, Eren C. Kizildag, Will Perkins, and Changji Xu. Algorithms and barriers in the
symmetric binary perceptron model. In 63rd IEEE Annual Symposium on Foundations of Computer
Science, FOCS 2022, Denver, CO, USA, October 31 - November 3, 2022, pages 576–587. IEEE, 2022.
[26] E. Gardner. The space of interactions in neural networks models. J. Phys. A: Math. Gen., 21:257–270,
1988.
[27] E. Gardner and B. Derrida. Optimal storage properties of neural networks models. J. Phys. A: Math.
Gen., 21:271–284, 1988.
[28] Y. Gordon. Some inequalities for Gaussian processes and applications. Israel Journal of Mathematics,
50(4):265–289, 1985.
[29] Y. Gordon. On Milman’s inequality and random subspaces which escape through a mesh in Rn . Geo-
metric Aspect of of functional analysis, Isr. Semin. 1986-87, Lect. Notes Math, 1317, 1988.
[30] H. Gouk, E. Frank, B. Pfahringer, and M. J. Cree. Regularisation of neural networks by enforcing
lipschitz continuity. Mach. Learn., 110(2):393–416, 2021.
[31] H. Gutfreund and Y. Stein. Capacity of neural networks with discrete synaptic couplings. J. Physics
A: Math. Gen, 23:2613, 1990.
[32] P. Hand, O. Leong, and V. Voroninski. Phase retrieval under a generative prior. In Advances in Neural
Information Processing Systems 31: Annual Conference on Neural Information Processing Systems
2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pages 9154–9164, 2018.
[33] H. He, C.-K. Wen, and S. Jin. Generalized expectation consistent signal recovery for nonlinear measure-
ments. In 2017 IEEE International Symposium on Information Theory, ISIT 2017, Aachen, Germany,
June 25-30, 2017, pages 2333–2337. IEEE, 2017.
[34] R. Heckel, W. Huang, P. Hand, and V. Voroninski. Deep denoising: Rate-optimal recovery of structured
signals with a deep prior. 2018. available online at http://arxiv.org/abs/1805.08855.

25
[35] C. Hegde, M. B. Wakin, and R. G. Baraniuk. Random projections for manifold learning. In Advances
in Neural Information Processing Systems 20, Proceedings of the Twenty-First Annual Conference on
Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 3-6, 2007,
pages 641–648. Curran Associates, Inc., 2007.
[36] W. Huang, P. Hand, R. Heckel, and V. Voroninski. A provably convergent scheme for compressive sensing
under random generative priors. 2018. available online at http://arxiv.org/abs/1812.04176.
[37] M. Jordan and A. G. Dimakis. Exactly computing the local lipschitz constant of relu networks. In
Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information
Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
[38] R. D. Joseph. The number of orthants in n-space instersected by an s-dimensional subspace. Tech.
memo 8, project PARA, 1960. Cornel aeronautical lab., Buffalo, N.Y.
[39] K. Kothari, A. Khorashadizadeh, M. V. de Hoop, and I. Dokmanic. Trumpets: Injective flows for
inference and inverse problems. In Proceedings of the Thirty-Seventh Conference on Uncertainty in
Artificial Intelligence, UAI 2021, Virtual Event, 27-30 July 2021, volume 161 of Proceedings of Machine
Learning Research, pages 1269–1278. AUAI Press, 2021.
[40] W. Krauth and M. Mezard. Storage capacity of memory networks with binary couplings. J. Phys.
France, 50:3057–3066, 1989.
[41] Q. Lei, A. Jalal, I. S. Dhillon, and A. G. Dimakis. Inverting deep generative models, one layer at a time.
In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information
Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 13910–
13919, 2019.
[42] C. Louart, Z. Liao, and R. Couillet. A random matrix approach to neural networks. CoRR,
abs/1702.05419, 2017.
[43] A. Maillard, A. S. Bandeira, D. Belius, I. Dokmanic, and S. Nakajima. Injectivity of relu networks:
perspectives from statistical physics. 2023. available online at http://arxiv.org/abs/2302.14112.
[44] M. Mardani, Q. Sun, D. L. Donoho, V. Papyan, H. Monajemi, S. Vasanawala, and J. M. Pauly. Neural
proximal gradient descent for compressive imaging. In Advances in Neural Information Processing Sys-
tems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December
3-8, 2018, Montréal, Canada, pages 9596–9606, 2018.
[45] D. Paleka. Injectivity of ReLU neural networks at initialization. Master thesis, ETH Zurich, 2021.
[46] J. Pennington and P. Worah. Nonlinear random matrix theory for deep learning. In Advances in Neural
Information Processing Systems 30: Annual Conference on Neural Information Processing Systems
2017, December 4-9, 2017, Long Beach, CA, USA, pages 2637–2646, 2017.
[47] W. Perkins and C. Xu. Frozen 1-RSB structure of the symmetric Ising perceptron. STOC 2021:
Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 1579–1588,
2021.
[48] M. Puthawala, K. Kothari, M. Lassas, I. Dokmanic, and M. V. de Hoop. Globally injective relu networks.
J. Mach. Learn. Res., 23:105:1–105:55, 2022.
[49] M. Puthawala, M. Lassas, I. Dokmanic, and M. V. de Hoop. Universal joint approximation of manifolds
and densities by simple injective flows. In International Conference on Machine Learning, ICML 2022,
17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research,
pages 17959–17983. PMLR, 2022.
[50] S. Rangan, P. Schniter, and A. K. Fletcher. Vector approximate message passing. In 2017 IEEE
International Symposium on Information Theory, ISIT 2017, Aachen, Germany, June 25-30, 2017,
pages 1588–1592. IEEE, 2017.

26
[51] Y. Romano, M. Elad, and P. Milanfar. The little engine that could: Regularization by denoising (RED).
SIAM J. Imaging Sci., 10(4):1804–1844, 2017.
[52] B. Leigh Ross and J. C. Cresswell. Tractable density estimation on learned manifolds with conformal
embedding flows. In Advances in Neural Information Processing Systems 34: Annual Conference on
Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 26635–
26648, 2021.
[53] L. Schlafli. Gesammelte Mathematische AbhandLungen I. Basel, Switzerland: Verlag Birkhauser, 1950.
[54] R. Schneider and W. Weil. Stochastic and Integral Geometry. Springer-Verlag Berlin Heidelberg, 2008.
[55] P. Schniter, S. Rangan, and A. K. Fletcher. Vector approximate message passing for the generalized
linear model. In 50th Asilomar Conference on Signals, Systems and Computers, ACSSC 2016, Pacific
Grove, CA, USA, November 6-9, 2016, pages 1525–1529. IEEE, 2016.
[56] V. Shah and C. Hegde. Solving linear inverse problems using gan priors: An algorithm with provable
guarantees. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP
2018, Calgary, AB, Canada, April 15-20, 2018, pages 4609–4613. IEEE, 2018.
[57] M. Shcherbina and B. Tirozzi. On the volume of the intrersection of a sphere with random half spaces.
C. R. Acad. Sci. Paris. Ser I, (334):803–806, 2002.
[58] M. Shcherbina and B. Tirozzi. Rigorous solution of the Gardner problem. Comm. on Math. Physics,
(234):383–422, 2003.
[59] M. Stojnic. Block-length dependent thresholds in block-sparse compressed sensing. available online at
http://arxiv.org/abs/0907.3679.
[60] M. Stojnic. Various thresholds for ℓ1 -optimization in compressed sensing. available online at http://
arxiv.org/abs/0907.3666.
[61] M. Stojnic. Block-length dependent thresholds for ℓ2 /ℓ1 -optimization in block-sparse compressed sens-
ing. ICASSP, IEEE International Conference on Acoustics, Signal and Speech Processing, pages 3918–
3921, 14-19 March 2010. Dallas, TX.
[62] M. Stojnic. ℓ1 optimization and its various thresholds in compressed sensing. ICASSP, IEEE Inter-
national Conference on Acoustics, Signal and Speech Processing, pages 3910–3913, 14-19 March 2010.
Dallas, TX.
[63] M. Stojnic. Another look at the Gardner problem. 2013. available online at http://arxiv.org/abs/
1306.3979.
[64] M. Stojnic. Discrete perceptrons. 2013. available online at http://arxiv.org/abs/1303.4375.
[65] M. Stojnic. Lifting ℓ1 -optimization strong and sectional thresholds. 2013. available online at http://
arxiv.org/abs/1306.3770.
[66] M. Stojnic. Lifting/lowering Hopfield models ground state energies. 2013. available online at http://
arxiv.org/abs/1306.3975.
[67] M. Stojnic. Meshes that trap random subspaces. 2013. available online at http://arxiv.org/abs/
1304.0003.
[68] M. Stojnic. Negative spherical perceptron. 2013. available online at http://arxiv.org/abs/1306.
3980.
[69] M. Stojnic. Regularly random duality. 2013. available online at http://arxiv.org/abs/1303.7295.
[70] M. Stojnic. Spherical perceptron as a storage memory with limited errors. 2013. available online at
http://arxiv.org/abs/1306.3809.

27
[71] M. Stojnic. Fully bilinear generic and lifted random processes comparisons. 2016. available online at
http://arxiv.org/abs/1612.08516.
[72] M. Stojnic. Generic and lifted probabilistic comparisons – max replaces minmax. 2016. available online
at http://arxiv.org/abs/1612.08506.
[73] M. Stojnic. Bilinearly indexed random processes – stationarization of fully lifted interpolation. 2023.
available online at http://arxiv.org/abs/2311.18097.
[74] M. Stojnic. Binary perceptrons capacity via fully lifted random duality theory. 2023. available online
at http://arxiv.org/abs/2312.00073.
[75] M. Stojnic. Capacity of the treelike sign perceptrons neural networks with one hidden layer – rdt based
upper bounds. 2023. available online at http://arxiv.org/abs/2312.08244.
[76] M. Stojnic. Fl rdt based ultimate lowering of the negative spherical perceptron capacity. 2023. available
online at http://arxiv.org/abs/2312.16531.
[77] M. Stojnic. Fully lifted interpolating comparisons of bilinearly indexed random processes. 2023. available
online at http://arxiv.org/abs/2311.18092.
[78] M. Stojnic. Fully lifted random duality theory. 2023. available online at http://arxiv.org/abs/2312.
00070.
[79] M. Stojnic. Lifted rdt based capacity analysis of the 1-hidden layer treelike sign perceptrons neural
networks. 2023. available online at http://arxiv.org/abs/2312.08257.
[80] M. Stojnic. Exact capacity of the wide hidden layer treelike neural networks with generic activations.
2024. available online at http://arxiv.org/abs/2402.05719.
[81] M. Stojnic. Fixed width treelike neural networks capacity analysis – generic activations. 2024. available
online at http://arxiv.org/abs/2402.05696.
[82] M. Stojnic. Injectivity capacity of relu gates. 2024. available online at http://arxiv.org/abs/2410.
20646.
[83] M. Talagrand. The Parisi formula. Annals of mathematics, 163(2):221–263, 2006.
[84] M. Talagrand. Mean field models and spin glasses: Volume I. A series of modern surveys in mathematics
54, Springer-Verlag, Berlin Heidelberg, 2011.
[85] S. Venkatesh. Epsilon capacity of neural networks. Proc. Conf. on Neural Networks for Computing,
Snowbird, UT, 1986.
[86] J. G. Wendel. A problem in geometric probablity. Mathematics Scandinavia, 11:109–111, 1962.
[87] R. O. Winder. Single stage threshold logic. Switching circuit theory and logical design, pages 321–332,
Sep. 1961. AIEE Special publications S-134.
[88] R. O. Winder. Threshold logic. Ph. D. dissertation, Princetoin University, 1962.
[89] Y. Wu, M. Rosca, and T. P. Lillicrap. Deep compressed sensing. In Proceedings of the 36th International
Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97
of Proceedings of Machine Learning Research, pages 6850–6860. PMLR, 2019.
[90] J. A. Zavatone-Veth and C. Pehlevan. Activation function dependence of the storage capacity of treelike
neural networks. Phys. Rev. E, 103:L020301, February 2021.

28

You might also like