17-AAP1328
17-AAP1328
17-AAP1328
CONTENTS
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1191
2. System model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1193
3. Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1196
3.1. Main technical results and training performance . . . . . . . . . . . . . . . . . . . . . 1196
3.2. Testing performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1199
3.3. Evaluation of AB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1200
4. Practical outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1202
4.1. Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1202
4.2. The underlying kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1203
4.3. Limiting cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1207
5. Proof of the main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1209
5.1. Concentration results on . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1209
5.2. Asymptotic equivalents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1219
5.2.1. First equivalent for E[Q] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1219
5.2.2. Second equivalent for E[Q] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1223
5.2.3. Asymptotic equivalent for E[QAQ], where A is either or symmetric of
bounded norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1225
5.3. Derivation of ab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1230
with the aforementioned single hidden-layer random neural network known as ex-
treme learning machine. We show that, under mild conditions, both the training
Etrain and testing Etest mean-square errors, respectively, corresponding to the re-
gression errors on known input–output pairs (x1 , y1 ), . . . , (xT , yT ) (with xi ∈ Rp ,
yi ∈ Rd ) and unknown pairings (x̂1 , ŷ1 ), . . . , (x̂T̂ , ŷT̂ ), almost surely converge to
deterministic limiting values as n, p, T grow large at the same rate (while d is
kept constant) for every fixed ridge-regression parameter γ > 0. Simulations on
real image datasets are provided that corroborate our results.
These findings provide new insights into the roles played by the activation func-
tion σ (·) and the random distribution of the entries of W in random feature maps as
well as by the ridge-regression parameter γ in the neural network performance. We
notably exhibit and prove some peculiar behaviors, such as the impossibility for
the network to carry out elementary Gaussian mixture classification tasks, when
either the activation function or the random weights distribution are ill chosen.
Besides, for the practitioner, the theoretical formulas retrieved in this work al-
low for a fast offline tuning of the aforementioned hyperparameters of the neural
network, notably when T is not too large compared to p. The graphical results
provided in the course of the article were particularly obtained within a 100- to
500-fold gain in computation time between theory and simulations.
The remainder of the article is structured as follows: in Section 2, we introduce
the mathematical model of the system under investigation. Our main results are
then described and discussed in Section 3, the proofs of which are deferred to
Section 5. Section 4 discusses our main findings. The article closes on concluding
remarks on envisioned extensions of the present work in Section 6. The Appendix
provides some intermediary lemmas of constant use throughout the proof section.
Reproducibility: Python 3 codes used to produce the results of Section 4 are
available at https://github.com/Zhenyu-LIAO/RMT4ELM.
Notation: The norm · is understood as the Euclidean norm for vectors and
the operator norm for matrices, while the norm · F is the Frobenius norm for
matrices. All vectors in the article are understood as column vectors.
W = ϕ(W̃ )
This assumption holds for many of the activation functions traditionally con-
sidered in neural networks, such as sigmoid functions, the rectified linear unit
σ (t) = max(t, 0) or the absolute value operator.
When considering the interesting case of simultaneously large data and ran-
dom features (or neurons), we shall then make the following growth rate assump-
tions.
0 < lim inf min{p/n, T /n} ≤ lim sup max{p/n, T /n} < ∞
n n
3. Main results.
Note that this lemma partially extends concentration of measure results involv-
ing quadratic forms [see, e.g., Rudelson, Vershynin et al. (2013), Theorem 1.1], to
nonlinear vectors.
With this result in place, the standard resolvent approaches of random matrix
theory apply, providing our main theoretical finding as follows.
f dμn − f d μ̄n → 0,
where μ̄n is the measure defined through its Stieltjes’ transform mμ̄n (z) ≡ (t −
z)−1 d μ̄n (t) given, for z ∈ {w ∈ C, [w] > 0}, by
−1
1 n
mμ̄n (z) = tr − zIT
T T 1 + δz
with δz the unique solution in {w ∈ C, [w] > 0} of
−1
1 n
δz = tr − zIT .
T T 1 + δz
Note that μ̄n has a well-known form, already met in early random matrix
works [e.g., Silverstein and Bai (1995)] on sample covariance matrix models. No-
tably, μ̄n is also the deterministic equivalent of the empirical spectral measure of
T P W W P for any deterministic matrix P ∈ R
1 T T p×T such that P T P = . As such,
for an example where inappropriate choices for the law of W lead to network
failure to fulfill the regression task.
For convenience in the following, letting δ and be defined as in Theorem 1,
we shall denote
n
(3) = .
T 1+δ
Theorem 1 provides the central step in the evaluation of Etrain , for which not
only E[Q] but also E[Q2 ] needs be estimated. This last ingredient is provided in
the following proposition.
Since Q̄ and share the same orthogonal eigenvector basis, it appears that
Etrain depends on the alignment between the right singular vectors of Y and the
eigenvectors of , with weighting coefficients
2 1 T −2
γ n j =1 λj (λj + γ )
1 + λi , 1 ≤ i ≤ T,
λi + γ 1 − n1 Tj=1 λ2j (λj + γ )−2
where we denoted λi = λi ( ), 1 ≤ i ≤ T , the eigenvalues of [which depend
on γ through λi ( ) = T (1+δ)
n
λi ()]. If lim infn n/T > 1, it is easily seen that
δ → 0 as γ → 0, in which case Etrain → 0 almost surely. However, in the more
A RANDOM MATRIX APPROACH TO NEURAL NETWORKS 1199
interesting case in practice where lim supn n/T < 1, δ → ∞ as γ → 0 and Etrain
consequently does not have a simple limit (see Section 4.3 for more discussion on
this aspect).
Theorem 3 is also reminiscent of applied random matrix works on empirical
covariance matrix models, such as Bai and Silverstein (2007), Kammoun et al.
(2009), then further emphasizing the strong connection between the nonlinear ma-
1
trix σ (W X) and its linear counterpart W 2 .
As a side note, observe that, to obtain Theorem 3, we could have used the fact
that tr Y T Y Q2 = − ∂γ
∂
tr Y T Y Q which, along with some analyticity arguments [for
instance when extending the definition of Q = Q(γ ) to Q(z), z ∈ C], would have
directly ensured that ∂∂γQ̄ is an asymptotic equivalent for −E[Q2 ], without the need
for the explicit derivation of Proposition 1. Nonetheless, as shall appear subse-
quently, Proposition 1 is also a proxy to the asymptotic analysis of Etest . Besides,
the technical proof of Proposition 1 quite interestingly showcases the strength of
the concentration of measure tools under study here.
While not immediate at first sight, one can confirm (using notably the relation
Q̄ + γ Q̄ = IT ) that, for (X̂, Ŷ ) = (X, Y ), Ētrain = Ētest , as expected.
In order to evaluate practically the results of Theorem 3 and Conjecture 1, it is a
first step to be capable of estimating the values of AB for various σ (·) activation
functions of practical interest. Such results, which call for completely different
mathematical tools (mostly based on integration tricks), are provided in the subse-
quent section.
3.3. Evaluation of AB . The evaluation of AB = E[σ (wT A)T σ (wT B)] for
arbitrary matrices A, B naturally boils down to the evaluation of its individual
entries, and thus to the calculus, for arbitrary vectors a, b ∈ Rp , of
ab ≡ E σ wT a σ wT b
(4)
p 1
= (2π)− 2 σ ϕ(w̃)T a σ ϕ(w̃)T b e− 2 w̃ d w̃.
2
The evaluation of (4) can be obtained through various integration tricks for a wide
family of mappings ϕ(·) and activation functions σ (·). The most popular activa-
tion functions in neural networks are sigmoid functions, such as σ (t) = erf(t) ≡
t −u2
√2 e du, as well as the so-called rectified linear unit (ReLU) defined by
π 0
σ (t) = max(t, 0) which has been recently popularized as a result of its robust be-
havior in deep neural networks. In physical artificial neural networks implemented
using light projections, σ (t) = |t| is the preferred choice. Note that all aforemen-
tioned functions are Lipschitz continuous and, therefore, in accordance with As-
sumption 2.
Despite their not abiding by the prescription of Assumptions 1 and 2, we be-
lieve that the results of this article could be extended to more general settings, as
discussed in Section 4. In particular, since the key ingredient in the proof of all our
results is that the vector σ (w T X) follows a concentration of measure phenomenon,
induced by the Gaussianity of w̃ [if w = ϕ(w̃)], the Lipschitz character of σ and
the norm boundedness of X, it is likely, although not necessarily simple to prove,
that σ (w T X) may still concentrate under relaxed assumptions. This is likely the
case for more generic vectors w than Nϕ (0, Ip ) as well as for a larger class of
A RANDOM MATRIX APPROACH TO NEURAL NETWORKS 1201
TABLE 1
aT b
Values of ab for w ∼ N (0, Ip ), ∠(a, b) ≡ ab
σ (t) ab
t aTb
1
2π ab(∠(a, b) acos(−∠(a, b)) + 1 − ∠(a, b) )
max(t, 0) 2
|t| 2
π ab(∠(a, b) asin(∠(a, b)) + 1 − ∠(a, b) )
2
erf(t) 2 asin( √ 2a T b )
π (1+2a2 )(1+2b2 )
1{t>0} 1 − 1 acos(∠(a, b))
2 2π
sign(t) 2
π asin(∠(a, b))
cos(t) exp(− 12 (a2 + b2 )) cosh(a T b)
sin(t) exp(− 12 (a2 + b2 )) sinh(a T b).
2 It is in particular not difficult to prove, based on our framework, that as n/T → ∞, a random
neural network composed of n/2 neurons with activation function σ (t) = cos(t) and n/2 neurons
with activation function σ (t) = sin(t) implements a Gaussian difference kernel.
1202 C. LOUART, Z. LIAO AND R. COUILLET
F IG . 1. Neural network performance for Lipschitz continuous σ (·), Wij ∼ N (0, 1), as a function
of γ , for 2-class MNIST data (sevens, nines), n = 512, T = T̂ = 1024, p = 784.
using the linked Python script were found to be 100 to 500 times faster than to gen-
erate the simulated network performances. Beyond their theoretical interest, the
provided formulas therefore allow for an efficient offline tuning of the network hy-
perparameters, notably the choice of an appropriate value for the ridge-regression
parameter γ .
4.2. The underlying kernel. Theorem 1 and the subsequent theoretical find-
ings importantly reveal that the neural network performances are directly related
to the Gram matrix , which acts as a deterministic kernel on the dataset X. This
is in fact a well-known result found, for example, in Williams (1998) where it is
shown that, as n → ∞ alone, the neural network behaves as a mere kernel operator
(this observation is retrieved here in the subsequent Section 4.3). This remark was
then put at an advantage in Rahimi and Recht (2007) and subsequent works, where
random feature maps of the type x → σ (W x) are proposed as a computationally
efficient proxy to evaluate kernels (x, y) → (x, y).
As discussed previously, the formulas for Ētrain and Ētest suggest that good per-
formances are achieved if the dominant eigenvectors of show a good alignment
to Y (and similarly for XX̂ and Ŷ ). This naturally drives us to finding a priori sim-
ple regression tasks where ill-choices of may annihilate the neural network per-
formance. Following recent works on the asymptotic performance analysis of ker-
nel methods for Gaussian mixture models [Couillet and Benaych-Georges (2016),
1204 C. LOUART, Z. LIAO AND R. COUILLET
Liao and Couillet (2017), Mai and Couillet (2017)] and [Couillet and Kammoun
(2016)], we describe here such a task.
Let x1 , . . . , xT /2 ∼ N (0, p1 C1 ) and xT /2+1 , . . . , xT ∼ N (0, p1 C2 ) where C1 and
C2 are such that tr C1 = tr C2 , C1 , C2 are bounded, and tr(C1 − C2 )2 = O(p).
Accordingly, y1 , . . . , yT /2+1 = −1 and yT /2+1 , . . . , yT = 1. It is proved in the
aforementioned articles that, under these conditions, it is theoretically possible, in
the large p, T limit, to classify the data using a kernel least-square support vector
machine (i.e., with a training dataset) or with a kernel spectral clustering method
(i.e., in a completely unsupervised manner) with a nontrivial limiting error prob-
ability (i.e., neither zero nor one). This scenario has the interesting feature that
xiT xj → 0 almost surely for all i = j while xi 2 − p1 tr( 12 C1 + 12 C2 ) → 0, almost
surely, irrespective of the class of xi , thereby allowing for a Taylor expansion of
the nonlinear kernels as early proposed in El Karoui (2010).
Transposed to our present setting, the aforementioned Taylor expansion allows
for a consistent approximation ˜ of by an information-plus-noise (spiked) ran-
dom matrix model [see, e.g., Benaych-Georges and Nadakuditi (2012), Loubaton
and Vallet (2010)]. In the present Gaussian mixture context, it is shown in Couillet
and Benaych-Georges (2016) that data classification is (asymptotically at least)
only possible if ˜ ij explicitly contains the quadratic term (x T xj )2 [or combina-
i
A RANDOM MATRIX APPROACH TO NEURAL NETWORKS 1205
tions of (xi2 )T xj , (xj2 )T xi , and (xi2 )T (xj2 )]. In particular, letting a, b ∼ N (0, Ci )
with i = 1, 2, it is easily seen from Table 1 that only max(t, 0), |t|, and cos(t) can
realize the task. Indeed, we have the following Taylor expansions around x = 0:
asin(x) = x + O x 3 ,
sinh(x) = x + O x 3 ,
π
acos(x) = − x + O x 3 ,
2
x2
cosh(x) = 1 + + O x3 ,
2
πx x 2
x acos(−x) + 1 − x 2 = 1 + + + O x3 ,
2 2
x2
x asin(x) + 1 − x 2 = 1 + + O x3 ,
2
where only the last three functions [only found in the expression of ab corre-
sponding to σ (t) = max(t, 0), |t|, or cos(t)] exhibit a quadratic term.
More surprisingly maybe, recalling now equation (5) which considers nonnec-
essarily Gaussian Wij with moments mk of order k, a more refined analysis shows
that the aforementioned Gaussian mixture classification task will fail if m3 = 0
and m4 = m22 , so for instance, for Wij ∈ {−1, 1} Bernoulli with parameter 12 .
The performance comparison of this scenario is shown in the top part of Fig-
ure 3 for σ (t) = − 12 t 2 + 1 and C1 = diag(Ip/2 , 4Ip/2 ), C2 = diag(4Ip/2 , Ip/2 ),
for Wij ∼ N (0, 1) and Wij ∼ Bern [i.e., Bernoulli {(−1, 12 ), (1, 12 )}]. The choice
of σ (t) = ζ2 t 2 + ζ1 t + ζ0 with ζ1 = 0 is motivated by Couillet and Benaych-
Georges (2016), Couillet and Kammoun (2016) where it is shown, in a somewhat
different setting, that this choice is optimal for class recovery. Note that, while
the test performances are overall rather weak in this setting, for Wij ∼ N (0, 1),
Etest drops below one (the amplitude of the Ŷij ), thereby indicating that nontriv-
ial classification is performed. This is not so for the Bernoulli Wij ∼ Bern case
where Etest is systematically greater than |Ŷij | = 1. This is theoretically explained
by the fact that, from equation (5), ij contains structural information about the
data classes through the term 2m22 (xiT xj )2 + (m4 − 3m22 )(xi2 )T (xj2 ) which induces
an information-plus-noise model for as long as 2m22 + (m4 − 3m22 ) = 0, that is,
m4 = m22 [see Couillet and Benaych-Georges (2016) for details]. This is visually
seen in the bottom part of Figure 3 where the Gaussian scenario presents an iso-
lated eigenvalue for with corresponding structured eigenvector, which is not the
case of the Bernoulli scenario. To complete this discussion, it appears relevant in
the present setting to choose Wij in such a way that m4 − m22 is far from zero,
thus suggesting the interest of heavy-tailed distributions. To confirm this predic-
tion, Figure 3 additionally displays the performance achieved and the spectrum of
1206 C. LOUART, Z. LIAO AND R. COUILLET
F IG . 3. (Top) Neural network performance for σ (t) = − 12 t 2 + 1, with different Wij , for a 2-class
Gaussian mixture model (see details in text), n = 512, T = T̂ = 1024, p = 256. (Bottom) Spectra and
second eigenvector of for different Wij (first eigenvalues are of order n and not shown; associated
eigenvectors are provably noninformative).
observed for Wij ∼ Stud, that is, following a Student-t distribution with degree
of freedom ν = 7 normalized to unit variance (in this case m2 = 1 and m4 = 5).
Figure 3 confirms the large superiority of this choice over the Gaussian case (note
nonetheless the slight inaccuracy of our theoretical formulas in this case, which
A RANDOM MATRIX APPROACH TO NEURAL NETWORKS 1207
is likely due to too small values of p, n, T to accommodate Wij with higher order
moments, an observation which is confirmed in simulations when letting ν be even
smaller).
4.3. Limiting cases. We have suggested that contains, in its dominant eigen-
modes, all the usable information describing X. In the Gaussian mixture example
above, it was notably shown that may completely fail to contain this information,
resulting in the impossibility to perform a classification task, even if one were to
take infinitely many neurons in the network. For containing useful information
about X, it is intuitive to expect that both infγ Ētrain and infγ Ētest become smaller
as n/T and n/p become large. It is in fact easy to see that, if is invertible (which
is likely to occur in most cases if lim infn T /p > 1), then
lim Ētrain = 0,
n→∞
1 T
lim Ētest − Ŷ − X̂X −1 Y T 2F = 0
n→∞ T̂
and we fall back on the performance of a classical kernel regression. It is inter-
esting in particular to note that, as the number of neurons n becomes large, the
effect of γ on Etest flattens out. Therefore, a smart choice of γ is only relevant for
small (and thus computationally more efficient) neuron layers. This observation is
depicted in Figure 4 where it is made clear that a growth of n reduces Etrain to zero
while Etest saturates to a nonzero limit which becomes increasingly irrespective
of γ . Note additionally the interesting phenomenon occurring for n ≤ T where
too small values of γ induce important performance losses, thereby suggesting a
strong importance of proper choices of γ in this regime.
Of course, practical interest lies precisely in situations where n is not too large.
We may thus subsequently assume that lim supn n/T < 1. In this case, as sug-
gested by Figures 1–2, the mean-square error performances achieved as γ → 0
may predict the superiority of specific choices of σ (·) for optimally chosen γ . It
is important for this study to differentiate between cases where r ≡ rank() is
smaller or greater than n. Indeed, observe that, with the spectral decomposition
= Ur r UrT for r ∈ Rr×r diagonal and Ur ∈ RT ×r ,
−1
−1
1 n 1 n r
δ= tr + γ IT = tr r + γ Ir ,
T T 1+δ T T 1+δ
which satisfies, as γ → 0,
⎧ r
⎪
⎨δ →
⎪ , r < n,
n−r −1
⎪
⎪ 1 n
⎩γ δ → = tr + IT , r ≥ n.
T T
A phase transition therefore exists whereby δ assumes a finite positive value in the
small γ limit if r/n < 1, or scales like 1/γ otherwise.
1208 C. LOUART, Z. LIAO AND R. COUILLET
F IG . 4. Neural network performance for growing n (256, 512, 1024, 2048, 4096) as a function of
γ , σ (t) = max(t, 0); 2-class MNIST data (sevens, nines), T = T̂ = 1024, p = 784. Limiting (n = ∞)
Ētest shown in thick black line.
sponds to the energy of Y not captured by the space spanned by . Since Etrain is
an increasing function of γ , so is Ētrain (at least for all large n), and thus T1 Y Vr 2F
corresponds to the lowest achievable asymptotic training error.
If instead r > n (which is the most likely outcome in practice), as γ → 0, Q̄ ∼
−1 and thus
γ T + IT )
1 n
(
1 2
1
γ →0 n tr Q
Ētrain −→ tr Y Q + IT Q Y T ,
T 1 − n1 tr( Q )2
where = Tn
and Q = ( Tn
+ IT )−1 .
These results suggest that neural networks should be designed both in a way that
reduces the rank of while maintaining a strong alignment between the dominant
eigenvectors of and the output matrix Y .
Interestingly, if X is assumed as above to be extracted from a Gaussian mixture
and that Y ∈ R1×T is a classification vector with Y1j ∈ {−1, 1}, then the tools
proposed in Couillet and Benaych-Georges (2016) (related to spike random matrix
A RANDOM MATRIX APPROACH TO NEURAL NETWORKS 1209
5. Proof of the main results. In the remainder, we shall use extensively the
following notation:
⎡ ⎤ ⎡ ⎤
σT wT
⎢ .1 ⎥ ⎢ .1 ⎥
= σ (W X) = ⎢ ⎥
⎣ .. ⎦ , W =⎢ ⎥
⎣ .. ⎦
σnT wnT
that is, σi = σ (wiT X)T . Also, we shall define −i ∈ R(n−1)×T the matrix with
ith row removed, and correspondingly
−1
1 T 1
Q−i = − σi σiT + γ IT .
T T
Finally, because of exchangeability, it shall often be convenient to work with the
generic random vector w ∼ Nϕ (0, IT ), the random vector σ distributed as any
of the σi ’s, the random matrix − distributed as any of the −i ’s, and with the
random matrix Q− distributed as any of the Q−i ’s.
The main approach to the proof of our results, starting with that of the key
Lemma 1, is as follows: since Wij = ϕ(W̃ij ) with W̃ij ∼ N (0, 1) and ϕ Lipschitz,
the normal concentration of W̃ transfers to W which further induces a normal
concentration of the random vector σ and the matrix , thereby implying that
Lipschitz functionals of σ or also concentrate. As pointed out earlier, these
1210 C. LOUART, Z. LIAO AND R. COUILLET
concentration results are used in place for the independence assumptions (and their
multiple consequences on convergence of random variables) classically exploited
in random matrix theory.
Notation: In all subsequent lemmas and proofs, the letters c, ci , C, Ci > 0 will
be used interchangeably as positive constants independent of the key equation pa-
rameters (notably n and t below) and may be reused from line to line. Additionally,
the variable ε > 0 will denote any small positive number; the variables c, ci , C, Ci
may depend on ε.
We start by recalling the first part of the statement of Lemma 1 and subsequently
providing its proof.
E[σ ] = 0, then is “dominated” by the matrix √1 E[σ ]1TT , the operator norm of
√
√ T T
which is indeed of order n and the bound is tight. If σ (t) = t and E[Wij ] = 0, we
however know that √ = O(1) [Bai and Silverstein (1998)]. One is tempted to
T
believe that, more generally, if E[σ ] = 0, then √ should remain of this order.
T
And, if instead E[σ ] = 0, the contribution of √1 E[σ ]1TT should merely engender
T
a single large amplitude isolate singular value in the spectrum of √ and the other
T
singular values remain of order O(1). These intuitions are not captured by our
concentration of measure approach.
Since = σ (W X) is an entry-wise operation, concentration results with re-
spect to the Frobenius norm are natural, where with respect to the operator norm
are hardly accessible.
Lemma 2 with respect to AK and its complementary AcK , for some K ≥ 4t0 ,
gives
1 T 1
P σ w X Aσ wT X T − tr A > t
T T
1 1
≤P σ wT X Aσ wT X T − tr A > t , AK + P AcK .
T T
We can already bound P (AcK ) thanks to (7). As for the first right-hand side term,
√ that on the set {σ (w X), w ∈ AK }, the function f : R
note T T → R : σ → σ T Aσ is
√
f (σ + h) − f (σ ) = hT Aσ + (σ + h)T Ah ≤ K T h.
Our next step is then to bound the difference √= |E[f˜(σ (wT X))] −
E[f (σ (wT X))]|. Since f and f˜ are equal on {σ, σ ≤ K T },
≤ √ f (σ ) + f˜(σ ) dμσ (σ ),
σ ≥K T
√
where μσ is the law of σ (wT X). Since A ≤ 1, for σ ≥ K T , max(|f (σ )|,
|f˜(σ )|) ≤ σ 2 , and thus
∞
≤2 √ σ 2 dμσ = 2 √ 1σ 2 ≥t dt dμσ
σ ≥K T σ ≥K T t=0
∞ # $
=2 P σ 2 ≥ t , AcK dt
t=0
K 2T ∞
≤2 P AcK dt + 2 P σ wT X 2 ≥ t dt
t=0 t=K 2 T
∞ − ct
2λ2ϕ λ2σ X2
≤ 2P AcK K 2 T + 2 Ce dt
t=K 2 T
cT K 2 cT K 2
−
2λ2 2 2 2Cλ2ϕ λ2σ X2 −
2λ2ϕ λ2σ X2
≤ 2CT K e 2 ϕ λσ X + e
c
6C 2 2
≤
λ λ X2 ,
c ϕ σ
where in last inequality we used the fact that for x ∈ R, xe−x ≤ e−1 ≤ 1, and
K ≥ 4t0 ≥ 4λσ λϕ X Tp . As a consequence,
cT t 2
# $ −
X2 λ2ϕ λ2σ
P f σ w X −E f σ w X
T T
≥ KT t + , AK ≤ Ce
so that, with the same remark as before, for t ≥ 4
KT ,
cT t 2
# $ −
2X2 λ2ϕ λ2σ
P f σ w X −E f σ w X
T T
≥ KT t , AK ≤ Ce .
To avoid the condition t ≥ KT4
, we use the fact that, probabilities being lower than
one, it suffices to replace C by λC with λ ≥ 1 such that
T t2
−c 4
2X2 λ2 2
λCe ϕ λσ ≥1 for t ≤ .
KT
2
1 18C
The above inequality holds if we take for instance λ = Ce
c since then t ≤
24Cλ2ϕ λ2σ X2 6Cλϕ λσ X
4
KT ≤ cKT ≤ c√ pT
(using successively ≥ c λϕ λσ X
6C 2 2 2 and K ≥
4λσ λϕ X Tp ), and thus
cT t 2
− − 18C
2
18C 2
2X2 λ2
≥ λCe−
2
λCe ϕ λσ ≥ λCe cp c ≥ 1.
1214 C. LOUART, Z. LIAO AND R. COUILLET
C c
2
Therefore, setting λ = max(1, C1 e 2 ), we get for every t > 0
# $
P f (σ wT X − E f σ wT X) ≥ KT t , AK
cT t 2
−
2X2 λ2 2
≤ λCe ϕ λσ ,
cT K 2
−
2λ2ϕ λ2σ X2
which, together with the inequality P (AcK ) ≤ Ce , gives
P f σ wT X − E f σ wT X ≥ KT t
T ct 2 cT K 2
− −
2X2 λ2 2 2λ2ϕ λ2σ X2
≤ λCe ϕ λσ + Ce .
We then conclude
1 T T 1
P σ w X Aσ wT X − tr(A) ≥ t
T T
− cT
min(t 2 /K 2 ,K 2 )
2X2 λ2ϕ λ2σ
≤ (λ + 1)Ce
√
and, with K = max(4t0 , t),
2
cT min( t 2 ,t)
16t0
1 T 1 −
2X2 λ2ϕ λ2σ
P σ w X Aσ wT X T − tr(A) ≥ t ≤ (λ + 1)Ce .
T T
√ √
Indeed, if 4t0 ≤ t then min(t 2 /K 2 , K 2 ) = t, while if 4t0 ≥ t then min(t 2 /K 2 ,
K 2 ) = min(t 2 /16t02 , 16t02 ) = t 2 /16t02 .
∞P ROOF. We use the fact that, for a nonnegative random variable Y , E[Y ] =
0 P (Y > t) dt, so that
k
1 T 1
E σ Aσ − tr A
T T
∞ k
1 T 1
= σ Aσ − tr A > u du
P
0 T T
∞
1 T 1
= k−1
kv P σ Aσ − tr A > v dv
0 T T
2
∞ − cT2 min( v2 ,v)
≤ kv k−1
Ce η t0
dv
0
2
t0 − cT2 v2 ∞ − cT2v
≤ kv k−1
Ce t0 η
dv + kv k−1 Ce η dv
0 t0
2
∞ − cT2 v2 ∞ − cT2v
≤ kv k−1
Ce t0 η
dv + kv k−1 Ce η dv
0 0
k ∞ 2 k ∞
t0 η η
kt k−1 Ce−t dt + kt k−1 Ce−t dt,
2
= √
cT 0 cT 0
which, along with the boundedness of the integrals, concludes the proof.
1 −1 −1
+ tr (R + H )T (R + H ) − zIT H T R R T R − zIT
T
2H 2H F
≤ 3
≤ 3
,
dist(z, R+ ) 2 dist(z, R+ ) 2
where, for √the second√ to last inequality, we successively used the relations
| tr AB| ≤ tr AAT tr BB T , | tr CD| ≤ D tr C for nonnegative definite C,
and (R T R − zIT )−1 ≤ dist(z, R+ )−1 , (R T R − zIT )−1 R T R ≤ 1, (R T R −
1
zIT )−1 R T = (R T R − zIT )−1 R T R(R T R − zIT )−1 2 ≤ (R T R − zIT )−1 R T ×
1 1 1
R 2 (R T R − zIT )−1 2 ≤ dist(z, R+ )− 2 , for z ∈ C \ R+ , and finally · ≤ · F .
for some C > 0. The function f is thus Lipschitz with parameter independent of n,
which allows us to conclude using Lemma 3.
The aforementioned concentration results are the building blocks of the proofs
of Theorem 1–3 which, under all Assumptions 1–3, are established using standard
random matrix approaches.
5.2.1. First equivalent for E[Q]. This section is dedicated to a first character-
ization of E[Q], in the “simultaneously large” n, p, T regime. This preliminary
step is classical in studying resolvents in random matrix theory as the direct com-
parison of E[Q] to Q̄ with the implicit δ may be cumbersome. To this end, let us
thus define the intermediary deterministic matrix
−1
n
Q̃ = + γ IT
T 1+α
with α ≡ T1 tr E[Q− ], where we recall that Q− is a random matrix distributed as,
say, ( T1 T − T1 σ1 σ1T + γ IT )−1 .
First note that, since T1 tr = E[ T1 σ 2 ] and, from (7) and Assumption 3,
P ( T1 σ 2 > t) ≤ Ce−cnt for all large t, we find that T1 tr = 0∞ t 2 P ( T1 σ 2 >
2
t) dt ≤ C for some constant C . Thus, α ≤ E[Q− ] T1 tr ≤ Cγ is uniformly
bounded.
We will show here that E[Q] − Q̃ → 0 as n → ∞ in the regime of As-
sumption 3. As the proof steps are somewhat classical, we defer to the Appendix
some classical intermediary lemmas (Lemmas 5–7). Using the resolvent identity,
Lemma 5, we start by writing
n 1
E[Q] − Q̃ = E Q − T Q̃
T 1+α T
n 1 T
= E[Q] Q̃ − E Q Q̃
T 1+α T
n 1! n
= E[Q] Q̃ − E Qσi σiT Q̃,
T 1+α T i=1
E[Q] − Q̃
n 1! n
σi σiT
= E[Q] Q̃ − E Q−i Q̃
T 1+α T i=1 1 + T1 σiT Q−i σi
1220 C. LOUART, Z. LIAO AND R. COUILLET
n 1 1! n
= E[Q] Q̃ − E Q−i σi σiT Q̃
T 1+α 1 + α T i=1
1! n
Q−i σi σiT ( T1 σiT Q−i σi − α)
+ E Q̃.
T i=1 (1 + α)(1 + T1 σiT Q−i σi )
Note now, from the independence of Q−i and σi σiT , that the second right-hand
side expectation is simply E[Q−i ]. Also, exploiting Lemma 6 in reverse on the
rightmost term, this gives
1! n
E[Q − Q−i ]
E[Q] − Q̃ = Q̃
T i=1 1+α
(8)
1 1! n
1 T
+ E Qσi σiT Q̃ σi Q−i σi − α .
1 + α T i=1 T
1≤i≤n
for some C, c > 0, so in particular, recalling that α ≤ C for some constant C > 0,
* + 2(1+C ) ( ) ∞ ( )
E max Dii = P max Dii > t dt + P max Dii > t dt
1≤i≤n 0 1≤i≤n 2(1+C ) 1≤i≤n
∞ 2 ,t−(1+C ))
≤ 2 1 + C + Cne−cn min((t−(1+C )) dt
2(1+C )
∞
= 2 1 + C + Cne−cnt dt
1+C
= 2 1 + C + e−Cn(1+C ) = O(1).
As a consequence of all the above (and of the boundedness of α), we have that, for
some c > 0,
1 1 c
(10) E Q T DQ Q̃ ≤ .
T T n
Let us now consider the second right-hand side term of (9). Using the rela-
tion abT + ba T aa T + bbT in the order of Hermitian matrices [which unfolds
1
from (a − b)(a − b)T 0], we have, with a = T 4 Qσi ( T1 σiT Q−i σi − α) and
1
b = T − 4 Q̃σi ,
1! n
1 T
E Qσi σiT Q̃ + Q̃σi σ T Q σi Q−i σi − α
T i=1 T
2
1 ! n
1 T 1 ! n
√ T
E Qσi σi Q σi Q−i σi − α + √ E Q̃σi σiT Q̃
T i=1 T T T i=1
√ 1 T 2
n
= T E Q D2 Q + √ Q̃Q̃,
T T T
where D2 = diag({ T1 σiT Q−i σi − α}ni=1 ). Of course, since we also have −aa T −
bbT abT + ba T [from (a + b)(a + b)T 0], we have symmetrically
1! n
1 T
E Qσi σi Q̃ + Q̃σi σ Q
T T
σ Q−i σi − α
T i=1 T i
√ 1 T 2
n
− T E Q D2 Q − √ Q̃Q̃.
T T T
1222 C. LOUART, Z. LIAO AND R. COUILLET
1 1 1
α= tr E[Q− ] = tr E[Q] + tr E[Q− ] − E[Q] ,
T T T
1 1 1
α− tr Q̃ ≤ Cn− 2 +ε tr .
T T
1
We have proved in the beginning of the section that T tr is bounded and thus we
finally conclude that
1 1
α − tr Q̃ ≤ Cnε− 2 .
T
A RANDOM MATRIX APPROACH TO NEURAL NETWORKS 1223
5.2.2. Second equivalent for E[Q]. In this section, we show that E[Q] can be
approximated by the matrix Q̄, which we recall is defined as
−1
n
Q̄ = + γ IT ,
T 1+δ
where δ > 0 is the unique positive solution to δ = T1 tr Q̄. The fact that δ > 0 is
well defined is quite standard and has already been proved several times for more
elaborate models. Following the ideas of Couillet, Hoydis and Debbah (2012), we
may for instance use the framework of so-called standard interference functions
[Yates (1995)] which claims that, if a map f : [0, ∞) → (0, ∞), x → f (x), satis-
fies x ≥ x ⇒ f (x) ≥ f (x ), ∀a > 1, af (x) > f (ax) and there exists x0 such that
x0 ≥ f (x0 ), then f has a unique fixed point Yates (1995), Theorem 2. It is easily
shown that δ → T1 tr Q̄ is such a map, so that δ exists and is unique.
To compare Q̃ and Q̄, using the resolvent identity, Lemma 5, we start by writing
n
Q̃ − Q̄ = (α − δ)Q̃ Q̄
T (1 + α)(1 + δ)
from which
1
|α − δ| = tr E[Q− ] − Q̄
T
1 1
≤ tr (Q̃ − Q̄) + cn− 2 +ε
T
1 Q̃ Tn Q̄ 1
= |α − δ| tr + cn− 2 +ε ,
T (1 + α)(1 + δ)
which implies that
1 Q̃ Tn Q̄ 1
|α − δ| 1 − tr ≤ cn− 2 +ε .
T (1 + α)(1 + δ)
It thus remains to show that
1 Q̃ Tn Q̄
lim sup tr <1
n T (1 + α)(1 + δ)
1
to prove that |α − δ| ≤ cnε− 2 . To this end, note that, by Cauchy–Schwarz’s in-
equality,
1 Q̃ Tn Q̄ n 1 n 1
tr ≤ tr 2 Q̄2 · tr 2 Q̃2
T (1 + α)(1 + δ) T (1 + δ)2 T T (1 + α)2 T
so that it is sufficient to bound the limsup of both terms under the square root
strictly by one. Next, remark that
1 1 n(1 + δ) 1 1
δ = tr Q̄ = tr Q̄2 Q̄−1 = tr 2 Q̄2 + γ tr Q̄2 .
T T T (1 + δ) T
2 T
1224 C. LOUART, Z. LIAO AND R. COUILLET
In particular,
n 1 2 2
n 1 δ T (1+δ) 2 T tr Q̄ δ
tr 2 2
Q̄ = ≤ .
T (1 + δ)2 T (1 + δ) T (1+δ)2 T tr Q̄ + γ T tr Q̄
n 1 2 2 1 2 1 + δ
1
which completes to prove that |α − δ| ≤ cnε− 2 .
As a consequence of all this,
Q̃ Tn Q̄ 1
Q̃ − Q̄ = |α − δ| · ≤ cn− 2 +ε
(1 + α)(1 + δ)
1
and we have thus proved that E[Q] − Q̄ ≤ cn− 2 +ε for some c > 0.
From this result, along with Corollary 2, we now have that
1 1
P tr Q − tr Q̄ > t
T T
1 1 1 1
≤P tr Q − tr E[Q] > t − tr E[Q] − tr Q̄
T T T T
− 21 +ε 1
≤ C e−c n(t−cn )
≤ C e− 2 c nt
1! n
n
E Qσi σiT Q̄AQ = E Qσ σ T Q̄AQ
T i=1 T
n Q− σ σ T Q̄AQ
= E
T 1 + T1 σ T Q− σ
n 1
= E Q− σ σ T Q̄AQ
T 1+δ
n 1 δ − T1 σ T Q− σ
+ E Q− σ σ T Q̄AQ ,
T 1+δ 1 + T1 σ T Q− σ
1! n
E Qσi σiT Q̄AQ
T i=1
n 1 n 1 Q− σ σ T Q̄AQ− σ σ T Q−
= E Q− σ σ T Q̄AQ− − 2 E
T 1+δ T 1+δ 1 + T1 σ T Q− σ
n δ − T1 σ T Q− σ
+ E Q− σ σ T Q̄AQ−
T (1 + δ)(1 + T1 σ T Q− σ )
n Q− σ σ T Q̄AQ− σ σ T Q− (δ − T1 σ T Q− σ )
− E
T2 (1 + δ)(1 + T1 σ T Q− σ )2
1 T
n 1 n 1 σ Q̄AQ− σ
= E[Q− Q̄AQ− ] − E Q− σ σ T Q− T 1
T 1+δ T 1+δ 1 + T σ T Q− σ
n σ σ T (δ − T1 σ T Q− σ )
+ E Q− Q̄AQ−
T (1 + δ)(1 + T1 σ T Q− σ )
n σ Q̄AQ− σ (δ − T1 σ T Q− σ )
1 T
− E Q− σ σ T Q− T
T (1 + δ)(1 + T1 σ T Q− σ )2
≡ Z 1 + Z 2 + Z 3 + Z4
(where in the previous to last line, we have merely reorganized the terms conve-
niently) and our interest is in handling Z1 + Z1T + Z2 + Z2T + Z3 + Z3T + Z4 + Z4T .
Let us first treat term Z2 . Since Q̄AQ− is bounded, by Lemma 4, T1 σ T Q̄AQ− σ
concentrates around T1 tr Q̄AE[Q− ]; but, as Q̄ is bounded, we also have
1
| T1 tr Q̄AE[Q− ] − 1
T tr Q̄AQ̄| ≤ cnε− 2 . We thus deduce, with similar argu-
A RANDOM MATRIX APPROACH TO NEURAL NETWORKS 1227
1 ε
≤ E[Q− Q− ]Cnε− 2 + C ne−cn
1
≤ E[Q− Q− ]C nε− 2 .
But, again by exchangeability arguments,
2
1
E[Q− Q− ] = E Q− σ σ Q− T
= E Qσ σ Q 1 + σ T Q− σ
T
T
T 1
= E Q T D 2 Q
n T
with D = diag({1 + T1 σiT Q− σi }), the operator norm of which is bounded as O(1).
So finally,
1 T 1
σ Q̄AQ− σ
T tr Q̄AQ̄ 1
E Q− σ σ T Q− T − E[Q− Q− ] ≤ Cnε− 2 .
1+δ
1 + T σ Q− σ
1 T
can be expressed in terms of E[Q− Q− ] and E[Q− Q̄k Q− ] for k = 1, 2, all of
which have been shown to be bounded (at most by Cnε ). We thus conclude that
1
E δ − σ T Q− σ
Q− σ σ T Q̄AQ− + Q− AQ̄σ σ T Q− 1
≤ Cnε− 2 .
T 1
(1 + T σ Q− σ )
T 2
The first norm in the parenthesis is bounded by Cnε and it thus remains
to control the second norm. To this end, similar to the control of E[QQ],
by writing E[QQQ] = E[Qσ1 σ1T Qσ2 σ2T Q] for σ1 , σ2 independent vectors
with the same law as σ , and exploiting the exchangeability, we obtain after
some calculus that E[QQ] can be expressed as the sum of terms of the form
E[Q++ T1 ++T
D++ Q++ ] or E[Q++ T1 ++T
D++ Q++ T1 ++T
D2 ++ Q++ ] for
D, D2 diagonal matrices of norm bounded as O(1), while ++ and Q++
are similar as and Q, only for n replaced by n + 2. All these terms are
bounded as O(1) and we finally obtain that E[QQQ] is bounded, and
thus
1 C
E[QQ] − E[Q− Q + QQ− ] ≤ .
2 n
With the additional control on QQ− − Q− Q− and Q− Q − Q− Q− , to-
gether, this implies that E[QQ] = E[Q− Q− ] + O· (n−1 ). Hence, for A =
1
, exploiting the fact that Tn 1+δ Q̄ = − γ Q̄, we have the simplifica-
tion
E[QQ]
n E[QQ̄Q] n E[Q− Q̄Q− ]
= Q̄Q̄ + −
T 1+δ T 1+δ
n T1 tr 2 Q̄2 ε− 1
+ E[Q− Q− ] + O· n 2
T (1 + δ)2
n T1 tr 2 Q̄2 1
= Q̄Q̄ + E[QQ] + O· nε− 2
T (1 + δ) 2
or equivalently
n 1 tr 2 Q̄2 1
E[QQ] 1 − T = Q̄Q̄ + O· nε− 2 .
T (1 + δ) 2
1
tr 2 Q̄2
We have already shown in (11) that lim supn Tn T
(1+δ)2
< 1, and thus
Q̄Q̄ 1
E[QQ] = 1
+ O· nε− 2 .
tr 2 Q̄2
1− n
T
T
(1+δ)2
T (1+δ)2
Rp
Assume that a and b and not linearly dependent. It is convenient to observe that
this integral can be reduced to a two-dimensional integration by considering the
basis e1 , . . . , ep defined (for instance) by
aT b
a
b
b − a
ab a
e1 = , e2 = -
a (a T b)2
1− a2 b2
1 1
σ (w̃1 ã1 )σ (w̃1 b̃1 + w̃2 b̃2 )e− 2 (w̃1 +w̃2 ) d w̃1 d w̃2 .
2 2
I=
2π R R
Letting w̃ = [w̃1 , w̃2 ]T , ã = [ã1 , 0]T and b̃ = [b̃1 , b̃2 ]T , this is conveniently written
as the two-dimensional integral
1 1
σ w̃T ã σ w̃T b̃ e− 2 w̃ d w̃.
2
I=
2π R2
The case where a and b would be linearly dependent can then be obtained by
continuity arguments.
The function σ (t) = max(t, 0). For this function, we have
1 1
w̃T ã · w̃T b̃ · e− 2 w̃ d w̃.
2
I=
2π min(w̃T ã,w̃ T b̃)≥0
π
1 1 2
r 3 e− 2 r dr.
2
= ã1 cos(θ ) cos(θ )b̃1 + sin(θ )b̃2 dθ
2π θ0 − π2 R+
1 2
With two integration by parts, we have that R+ r 3 e− 2 r dr = 2. Classical trigono-
metric formulas also provide
π
2 1 1
cos(θ )2 dθ = (π − θ0 ) + sin(2θ0 )
θ0 − π2 2 2
1 b̃1 b̃1 b̃2
= π − arccos +
2 b̃ b̃ b̃
π
2 1 2 1 b̃2 2
cos(θ ) sin(θ ) dθ =sin (θ 0 ) = ,
θ0 − π2 2 2 b̃
√
where we used in particular sin(2 arccos(x)) = 2x 1 − x 2 . Altogether, this is after
simplification and replacement of ã1 , b̃1 and b̃2 ,
1
I= ab 1 − ∠(a, b)2 + ∠(a, b) arccos −∠(a, b) .
2π
It is worth noticing that this may be more compactly written as
1 ∠(a,b)
I= ab arccos(−x) dx,
2π −1
which is minimum for ∠(a, b) → −1 (since arccos(−x) ≥ 0 on [−1, 1]) and takes
there the limiting value zero. Hence, I > 0 for a and b not linearly dependent.
For a and b linearly dependent, we simply have I = 0 for ∠(a, b) = −1 and
I = 12 ab for ∠(a, b) = 1.
The function σ (t) = |t|. Since |t| = max(t, 0) + max(−t, 0), we have
wT a · wT b
= max wT a, 0 max wT b, 0 + max wT (−a), 0 max wT (−b), 0
+ max wT (−a), 0 max wT b, 0 + max wT a, 0 max wT (−b), 0 .
Hence, reusing the results above, we have here
ab
I= 4 1 − ∠(a, b)2 + 2∠(a, b)
2π
× acos −∠(a, b) − 2∠(a, b) acos ∠(a, b) .
1232 C. LOUART, Z. LIAO AND R. COUILLET
Using the identity acos(−x) − acos(x) = 2 asin(x) provides the expected result.
The function σ (t) = 1t≥0 . With the same notation as in the case σ (t) = max(t, 0),
we have to evaluate
1 1
e− 2 w̃ d w̃.
2
I=
2π min(w̃T ã,w̃T b̃)≥0
After a polar coordinate change of variable, this is
π
1 1 2 1 θ0
re− 2 r dr =
2
I= dθ −
2π θ0 − π2 R+ 2 2π
from which the result unfolds.
The function σ (t) = sign(t). Here, it suffices to note that sign(t) = 1t≥0 − 1−t≥0
so that
σ wT a σ wT b = 1wT a≥0 1wT b≥0 + 1wT (−a)≥0 1wT (−b)≥0
− 1wT (−a)≥0 1wT b≥0 − 1wT a≥0 1wT (−b)≥0
and to apply the result of the previous section, with either (a, b), (−a, b), (a, −b)
or (−a, −b). Since arccos(−x) = − arccos(x) + π , we conclude that
p
2θ0 1
I = (2π)− 2 sign wT a sign wT b e− 2 w dw = 1 −
2
.
Rp π
The functions σ (t) = cos(t) and σ (t) = sin(t). Let us first consider σ (t) =
cos(t). We have here to evaluate
1 1
cos w̃T ã cos w̃T b̃ e− 2 w̃ d w̃
2
I=
2π R2
1 ı w̃T ã T 1
+ e−ı w̃ ã eı w̃ b̃ + e−ı w̃ b̃ e− 2 w̃ d w̃,
T T 2
= e
8π R2
which boils down to evaluating, for d ∈ {ã + b̃, ã − b̃, −ã + b̃, −ã − b̃}, the integral
1 1 1
e− 2 d e− 2 w̃−ıd d w̃ = (2π)e− 2 d .
2 2 2
R2
Altogether, we find
1 − 1 a+b2 1 1
+ e− 2 a−b = e− 2 (a+b ) cosh a T b .
2 2
I= e 2
2
For σ (t) = sin(t), it suffices to appropriately adapt the signs in the expression
of I [using the relation sin(t) = 2ı1 (et + e−t )] to obtain in the end
1 − 1 a+b2 1 1
+ e− 2 a−b = e− 2 (a+b ) sinh a T b
2 2
I= e 2
2
as desired.
A RANDOM MATRIX APPROACH TO NEURAL NETWORKS 1233
5.4. Polynomial σ (·) and generic w. In this section, we prove equation (5)
for σ (t) = ζ2 t 2 + ζ1 t + ζ0 and w ∈ Rp a random vector with independent and
identically distributed entries of zero mean and moment of order k equal to mk .
The result is based on standard combinatorics. We are to evaluate
2 2
ab = E ζ2 wT a + ζ1 wT a + ζ0 ζ2 wT b + ζ1 wT b + ζ0 .
After development, it appears that one needs only assess, for say vectors c, d ∈ Rp
that take values in {a, b}, the moments
2 2
!
E wT c wT d = ci1 ci2 dj1 dj2 E[wi1 wi2 wj1 wj2 ]
i 1 i 2 j1 j2
! ! !
= m4 ci21 di21 + m22 ci21 dj21 + 2 m22 ci1 di1 ci2 di2
i1 i1 =j1 i1 =i2
! ! !
= m4 ci21 di21 + − m22 ci21 dj21
i1 i 1 j1 i1 =j1
! !
+2 − m22 ci1 di1 ci2 di2
i1 i2 i1 =i2
= m4 c2 T d 2 + m22 c2 d2 − c2 T d 2
2 T 2 2 T 2
+ 2m2 c d − c d
T 2 2
= m4 − 3m22 c2 d + m22 c2 d2 + 2 cT d ,
2
! !
E wT c wT d = ci1 ci2 dj E[wi1 wi2 wj ] = m3 ci21 di1 = m3 c2 d,
i1 i2 j i1
2
!
E wT c = ci1 ci2 E[wi1 wi2 ] = m2 c2 ,
i1 i2
where we recall the definition (a 2 ) = [a12 , . . . , ap2 ]T . Gathering all the terms for
appropriate selections of c, d leads to (5).
that would require more advanced proof techniques, let envision the following
heuristic derivation for Conjecture 1.
Recall that our interest is on the test performance Etest defined as
1 T
Etest = ˆ T β 2 ,
Ŷ − F
T̂
which may be rewritten as
1 T 2 1
Etest = tr Ŷ Ŷ − ˆ Ŷ T +
tr Y Q T ˆ
tr Y Q T ˆ T QY T
T̂ T T̂ T 2 T̂
(12)
≡ Z1 − Z 2 + Z3 .
ˆ
If ˆ◦
= + σ̄ˆ 1T follows the aforementioned claimed operator norm control, re-
T̂
producing the steps of Corollary 3 leads to a similar concentration for Etest , which
we shall then admit. We are therefore left to evaluating E[Z2 ] and E[Z3 ].
We start with the term E[Z2 ], which we expand as
2 2 ! n
E[Z2 ] = ˆ Ŷ T =
E tr Y Q T tr Y Qσi σ̂iT Ŷ T
T T̂ T T̂ i=1
2 ! n
Y Q−i σi σ̂iT Ŷ T
= E tr
T T̂ i=1 1 + T1 σiT Q−i σi
2 1 ! n
= E tr Y Q−i σi σ̂iT Ŷ T
T T̂ 1 + δ i=1
2 1 ! n
δ − T1 σiT Q−i σi
+ E tr Y Q−i σi σ̂iT Ŷ T
T T̂ 1 + δ i=1 1 + T1 σiT Q−i σi
2n 1 2 1
= tr Y E[Q− ]XX̂ Ŷ T + ˆ Ŷ T
E tr Y Q T D
T T̂ 1 + δ T T̂ 1 + δ
≡ Z21 + Z22
1
with D = diag({δ − T1 σiT Q−i σi }), the operator norm of which is bounded by nε− 2
ˆ =
with high probability. Now, observe that, again with the assumption that ˆ ◦+
T
σ̄ 1 with controlled ˆ , Z22 may be decomposed as
◦
T̂
2 1 2 1
ˆ Ŷ T =
E tr Y Q T D ˆ ◦ Ŷ T
E tr Y Q T D
T T̂ 1 + δ T T̂ 1 + δ
2 1 T T
+ 1 Ŷ E Y Q T D σ̄ .
T T̂ 1 + δ T̂
1
In the display above, the first right-hand side term is now of order O(nε− 2 ). As
for the second right-hand side term, note that D σ̄ is a vector of independent and
A RANDOM MATRIX APPROACH TO NEURAL NETWORKS 1235
identically distributed zero mean and variance O(n−1 ) entries; while note formally
independent of Y Q T , it is nonetheless expected that this independence “weak-
ens” asymptotically (a behavior several times observed in linear random matrix
models), so that one expects by central limit arguments that the second right-hand
1
side term be also of order O(nε− 2 ).
This would thus result in
2n 1 1
E[Z2 ] = tr Y E[Q− ]XX̂ Ŷ T + O nε− 2
T T̂ 1 + δ
2n 1 1
= tr Y Q̄XX̂ Ŷ T + O nε− 2
T T̂ 1 + δ
2 1
= tr Y Q̄ XX̂ Ŷ T + O nε− 2 ,
T̂
1
where we used E[Q− ] − Q̄ ≤ Cnε− 2 and the definition XX̂ = Tn 1+δ
X X̂
.
We then move on to E[Z3 ] of equation (12), which can be developed as
1
E[Z3 ] = ˆ
E tr Y Q T ˆ T QY T
T 2 T̂
1 !
n
= E tr Y Qσi σ̂iT σ̂j σjT QY T
T 2 T̂ i,j =1
!
n σ̂j σjT Q−j
1 Q−i σi σ̂iT
= E tr Y YT
T 2 T̂ i,j =1 1 + T1 σiT Q−i σi 1 + T1 σjT Q−j σj
n !
1 ! Q−i σi σ̂iT σ̂j σjT Q−j
= E tr Y Y T
T 2 T̂ i=1 j =i 1 + T1 σiT Q−i σi 1 + T1 σjT Q−j σj
1 ! n
Q−i σi σ̂iT σ̂i σiT Q−i T
+ E tr Y Y ≡ Z31 + Z32 .
T 2 T̂ i=1 (1 + T1 σiT Q−i σi )2
In the term Z32 , reproducing the proof of Lemma 1 with the condition X̂
σ̂ T σ̂
bounded, we obtain that i i concentrates around 1 tr X̂X̂ , which allows us to
T̂ T̂
write
1 ! n
Q−i σi tr(X̂X̂ )σiT Q−i T
Z32 = E tr Y Y
T 2 T̂ i=1 (1 + T1 σiT Q−i σi )2
1 ! n
Q−i σi (σ̂iT σ̂i − tr T̂ )σiT Q−i T
+ E tr Y Y
T 2 T̂ i=1 (1 + T1 σiT Q−i σi )2
1 tr(X̂X̂ ) !n
Q−i σi σiT Q−i
= 2 E tr Y YT
T T̂ i=1 (1 + 1 T
σ
T i Q σ
−i i )2
1236 C. LOUART, Z. LIAO AND R. COUILLET
1 !
n
σ̂ T σ̂i − tr T̂
+ 2 E tr Y Qσi i σiT QY T
T i=1 T̂
≡ Z321 + Z322
1 tr X̂X̂ !
n
1
+
T 2
T̂ i=1 (1 + δ)2
2
1
× E tr Y Qσi σi QY T T
(1 + δ) − 1 + σiT Q−i σi
2
T
1 tr X̂X̂ !
n
1
= E tr Y Q−i X Q−i Y T
T 2
T̂ i=1 (1 + δ)2
1 tr X̂X̂ !
n
1
+ E tr Y Q T DQY T
T 2
T̂ i=1 (1 + δ)2
n tr(X̂X̂ ) 1
= 2
E tr Y Q− X Q− Y T + O nε− 2 ,
T T̂ (1 + δ) 2
≡ Z311 − Z312 ,
where in the previous to last inequality we used the relation
Q−j σj σjT Q−j
Q = Q−j − .
1 + T1 σjT Q−j σj
1 1 ! n
T ˆ
= E tr Y Q−j −j −j σ̂j σjT Q−j Y T
T T̂ 1 + δ j =1
2
Q T
−j −j ˆ −j σ̂j σj Q−j (δ − T σj Q−j σj ) T
1 T
1 ! n T
1
+ E tr Y Y
T 2 T̂ 1 + δ j =1 1 + T1 σjT Q−j σj
≡ Z3111 + Z3112 .
n
The idea to handle Z3112 is to retrieve forms of the type T
j =1 dj σ̂j σj
ˆ T D
=
1
for some D satisfying D ≤ nε− 2 with high probability. To this end, we use
T ˆ ˆ
−j −j T σj σ̂jT
Q−j = Q−j − Q−j
T T T
Tˆ T
Qσj σj Q T ˆ σj σ̂jT
=Q + − Q−j
T 1 − T1 σjT Qσj T T
1238 C. LOUART, Z. LIAO AND R. COUILLET
and thus Z3112 can be expanded as the sum of three terms that shall be studied in
order:
Q T
−j −j ˆ −j σ̂j σj Q−j (δ − T σj Q−j σj ) T
1 T
1 ! n T
1
Z3112 = E tr Y Y
T 2 T̂ 1 + δ j =1 1 + T1 σjT Q−j σj
ˆ T
1 1 T
= E tr Y Q ˆ DQY T
T T̂ 1 + δ T
Qσ σ T Q T
ˆ σ̂j (δ − 1 σ T Q−j σj )σ T Q
1 1 ! n
j j T j j
+ E tr Y YT
T T̂ 1 + δ j =1 T (1 − T σj Qσj )
1 T
1 !
1 n
1
− E tr Y Qσj σ̂jT σ̂j σjT Q δ − σjT Q−j σj
2
T T̂ 1 + δ j =1
T
1
× 1 + σjT Q−j σj Y T
T
≡ Z31121 + Z31122 − Z31123 ,
1
where D = diag({δ − T σj Q−j σj }i=1 ).
1 T n First, Z31121 is of order O(nε− 2 ) since
T ˆ
Q T is of bounded operator norm. Subsequently, Z31122 can be rewritten as
1 1 T D 1
Z31122 = E tr Y Q QY T = O nε− 2
T̂ 1 + δ T
with here
T ˆ
−j −j
1 T 1
D = diag δ−
σj Q−j σj tr Q−j X̂X
T T T
1 1
+ tr(Q−j ) tr X̂X̂
T T
. 1
1
n
1 − σjT Qσj 1 + σjT Q−j σj .
T T i=1
The same arguments apply for Z31123 but for
n
tr X̂X̂ 1 T 1 T
D = diag δ − σj Q−j σj 1 + σj Q−j σj ,
T T T i=1
1
which completes to show that |Z3112 | ≤ Cnε− 2 and thus
1
Z311 = Z3111 + O nε− 2
1 1 ! n
T ˆ 1
= E tr Y Q−j −j −j σ̂j σjT Q−j Y T + O nε− 2 .
T 2 T̂ 1 + δ j =1
A RANDOM MATRIX APPROACH TO NEURAL NETWORKS 1239
1 1 ! n !
Q−ij σi σ̂iT
− E tr Y
T 3 T̂ 1 + δ j =1 i =j 1 + T1 σiT Q−ij σi
Q−ij σi σiT Q−ij
× X̂X YT
1 + T1 σiT Q−ij σi
≡ Z31111 − Z31112 ,
1 1 ! n !
= E tr Y Q−ij σi σ̂iT X̂X Q−ij Y T
2
T T̂ (1 + δ)2
j =1 i =j
1 1
+
T 2 T̂ (1 + δ)2
n !
!
Q−ij σi σ̂iT (δ − T1 σiT Q−ij σi )
× E tr Y X̂X Q−ij Y T
j =1 i =j 1 + T1 σiT Q−ij σi
n2 1
= E tr Y Q−− XX̂ X̂X Q−− Y T
2
T T̂ (1 + δ)2
!n !
1 1 1
+ E tr Y Q−j σi σ̂iT δ − σiT Q−ij σi
T T̂ (1 + δ) j =1 i =j
2 2 T
× X̂X Q−j Y T
1240 C. LOUART, Z. LIAO AND R. COUILLET
!n !
1 1
+ E tr Y Q−j σi σ̂iT
T T̂ (1 + δ) j =1 i =j
2 2
Q−j T1 σi σiT Q−j 1
× X̂X Y T
δ − σiT Q−ij σi
1 − T1 σiT Q−j σi T
n2 1
= E tr Y Q−− XX̂ X̂X Q−− Y T
T 2 T̂ (1 + δ)2
1 1 !n
+ T
E tr Y Q−j −j ˆ −j Q−j Y T
D
T T̂ (1 + δ) j =1
2 2 X̂X
n 1 !n
1
+ T
E Y Q−j −j D −j Q−j Y T + O nε− 2
T T̂ (1 + δ) j =1
2 2
n2 1 ε− 1
= E tr Y Q−− Q−− Y T
+ O n 2
T 2 T̂ (1 + δ)2 X X̂ X̂X
with Q−− having the same law as Q−ij , D = diag({δ − T σi Q−ij σi }i=1 )
1 T n
1
= tr Y Q̄ X X̂ X̂X Q̄Y
T
T̂
1 T
1 n tr(Y Q̄ X Q̄Y )
+ tr( X Q̄ X X̂ X̂X Q̄)
T̂ 1 − n1 tr( X
2 Q̄2 )
1
+ O nε− 2 .
Following the same principle, we deduce for Z31112 that
1 1 ! n !
Q−ij σi σ̂iT Q−ij σi σiT Q−ij T
Z31112 = E tr Y Y
T 3 T̂ 1 + δ j =1 i =j 1 + T1 σiT Q−ij σi X̂X 1 + T1 σiT Q−ij σi
!n !
1 1 1
= E tr Y Q−ij σi σiT Q−ij Y T tr(X̂X Q−ij XX̂ )
T T̂ (1 + δ) j =1 i =j
3 3 T
1 1 !n !
1
+ E tr Y Q−j σi Di σiT Q−j Y T + O nε− 2
T T̂ (1 + δ) j =1 i =j
3 3
A RANDOM MATRIX APPROACH TO NEURAL NETWORKS 1241
n2 1 1
= E tr Y Q−− X Q−− Y T tr(X̂X Q−− XX̂ )
3
T T̂ 1 + δ T
1
+ O nε− 2
1 T
1 n tr(Y Q̄ X Q̄Y ) 1
= tr( X̂X Q̄ X X̂ ) + O nε− 2
T̂ 1 − n tr( X Q̄ )
1 2 2
with Di = 1
T tr(X̂X Q−ij XX̂ )[(1 + δ)2 − (1 + T1 σiT Q−ij σi )2 ], also believed to
1 1
be of order O(nε− 2 ). Recalling the fact that Z311 = Z3111 + O(nε− 2 ), we can thus
conclude for Z311 that
1
Z311 = tr Y Q̄ X X̂ X̂X Q̄Y
T
T̂
1 T
1 tr(Y Q̄ X Q̄Y )
+ tr( X Q̄ X X̂ X̂X Q̄)
n
T̂ 1 − n1 tr( 2 2
X Q̄ )
1 T
1 n tr(Y Q̄ X Q̄Y ) 1
− tr( X̂X Q̄ X X̂ ) + O nε− 2 .
T̂ 1 − n tr( X Q̄2 )
1 2
T ˆ
Since Q−j T1 −j −j is expected to be of bounded norm, using the concentration
T ˆ
1 T −j −j
inequality of the quadratic form σ
T j Q−j T σ̂j , we infer
Q σ σ TQ Y T
1 ! n
−j j j −j
Z312 = E tr Y
T T̂ j =1 (1 + T1 σjT Q−j σj )2
1 T ˆ 1
× 2
tr Q−j −j −j X̂X + O nε− 2
T
Q σ σ T Q Y T
1 ! n
−j j j −j 1 T ˆ
= E tr Y tr Q−j −j −j X̂X
T T̂ j =1 (1 + T1 σjT Q−j σj )2 T 2
1
+ O nε− 2 .
1242 C. LOUART, Z. LIAO AND R. COUILLET
1 T
We again replace T σj Q−j σj by δ and take expectation over wj to obtain
!n
1 1 1 T ˆ
Z312 = E tr Y Q−j σj σjT Q−j Y T 2 tr Q−j −j −j X̂X
T T̂ (1 + δ) j =1
2 T
1 1
+
T T̂ (1 + δ)2
!
n tr(Y Q σ D σ T Q Y T )
−j j j j −j 1 T ˆ
× E tr Q−j −j −j X̂X
j =1 (1 + T1 σjT Q−j σj )2 T 2
1
+ O nε− 2
n 1 1 T ˆ
= E tr Y Q− X Q− Y T 2 tr Q− − − X̂X
T T̂ (1 + δ)2 T
1 1 T 1 T ˆ ε− 1
+ E tr Y Q T
DQY tr Q− − − X̂X + O n 2
T T̂ (1 + δ) 2 T 2
1
with Dj = (1 + δ)2 − (1 + T1 σjT Q−j σj )2 = O(nε− 2 ), which eventually brings the
second term to vanish, and we thus get
n 1 T 1 T ˆ 1
Z312 = E tr Y Q− X Q− Y tr Q− − − X̂X + O nε− 2 .
T T̂ (1 + δ) 2 T 2
1 T ˆ
For the term T 2 tr(Q− − − X̂X ) we apply again the concentration inequality
to get
1 T ˆ
tr Q− − − X̂X
T2
1 !
= 2 tr Q−j σi σ̂iT X̂X
T i =j
1 ! Q−ij σi σ̂iT
= 2 tr
T i =j 1 + T1 σiT Q−ij σi X̂X
1 1 !
= tr Q−ij σi σ̂iT X̂X
T 2 1 + δ i =j
1 1 ! Q−ij σi σ̂iT (δ − T1 σiT Q−ij σi )
+ tr X̂X
T 2 1 + δ i =j 1 + T1 σiT Q−ij σi
n−1 1
= tr X̂X E[Q−− ]XX̂
T 1+δ
2
1 1
ˆ −j
1
+ 2 T
tr Q−j −j D + O nε− 2
T 1+δ X̂X
A RANDOM MATRIX APPROACH TO NEURAL NETWORKS 1243
with high probability, where D = diag({δ − T1 σiT Q−ij σi }ni=1 ), the norm of which
1
is of order O(nε− 2 ). This entails
1 T ˆ n 1 1
tr Q− − − X̂X = 2 tr X̂X E[Q−− ]XX̂ + O nε− 2
T 2 T 1+δ
with high probability. Once more plugging the asymptotic equivalent of E[QAQ]
deduced in Section 5.2.3, we conclude for Z312 that
1 T
1 n tr(Y Q̄ X Q̄Y ) 1
Z312 = tr( X̂X Q̄ X X̂ ) + O nε− 2
T̂ 1 − n tr( X Q̄ )
1 2 2
1 T
2 n tr(Y Q̄ X Q̄Y ) 1
− tr( X̂X Q̄ X X̂ ) + O nε− 2 .
T̂ 1 − n tr( X Q̄ )
1 2 2
Combining the estimates of E[Z2 ] as well as Z31 and Z32 , we finally have the
estimates for the test error defined in (12) as
1 T
Etest = Ŷ − T
X X̂
Q̄Y T 2F
T̂
1 T
tr(Y Q̄ X Q̄Y ) 1
+ n
tr X̂ X̂
1 − tr( 1
n
2 2
X Q̄ ) T̂
1 2
+ tr( X Q̄ X X̂ X̂X Q̄) − tr( X̂X̂ Q̄ X X̂ )
T̂ T̂
1
+ O nε− 2 .
X Q̄ = ( X + γ IT − γ IT )( X + γ IT )−1 = IT − γ Q̄
in the second term in brackets to finally retrieve the form of Conjecture 1.
of σ (or ) further satisfy concentration inequalities, and thus, if the Lipschitz pa-
rameter scales with n, convergence results as n → ∞. With this in mind, note that
we could have generalized our input–output model z = β T σ (W x) of Section 2 to
z = β T σ (x; W )
for σ : Rp × P → Rn with P some probability space and W ∈ P a random variable
such that σ (x; W ) and σ (X; W ) [where σ (·) is here applied column-wise] satisfy
a concentration of measure phenomenon; it is not even necessary that σ (X; W )
has a normal concentration so long that the corresponding concentration function
allows for appropriate convergence results. This generalized setting however has
the drawback of being less explicit and less practical (as most neural networks
involve linear maps W x rather than nonlinear maps of W and x).
A much less demanding generalization though would consist in changing the
vector w ∼ Nϕ (0, Ip ) for a vector w still satisfying an exponential (not necessarily
normal) concentration. This is the case notably if w = ϕ(w̃) with ϕ(·) a Lipschitz
map with Lipschitz parameter bounded by, say, log(n) or any small enough power
of n. This would then allow for w with heavier than Gaussian tails.
Despite its simplicity, the concentration method also has some strong limitations
that presently do not allow for a sufficiently profound analysis of the testing mean
square error. We believe that Conjecture 1 can be proved by means of more elab-
orate methods. Notably, we believe that the powerful Gaussian method advertised
in Pastur and Ŝerbina (2011) which relies on Stein’s lemma and the Poincaré–
Nash inequality could provide a refined control of the residual terms involved in
the derivation of Conjecture 1. However, since Stein’s lemma (which states that
E[xφ(x)] = E[φ (x)] for x ∼ N (0, 1) and differentiable polynomially bounded φ)
can only be used on products xφ(x) involving the linear component x, the latter is
not directly accessible; we nonetheless believe that appropriate ansatzs of Stein’s
lemma, adapted to the nonlinear setting and currently under investigation, could
be exploited.
As a striking example, one key advantage of such a tool would be the possibility
to evaluate expectations of the type Z = E[σ σ T ( T1 σ T Q− σ − α)] which, in our
present analysis, was shown to be bounded in the order of symmetric matrices by
1
Cnε− 2 with high probability. Thus, if no matrix (such as Q̄) pre-multiplies Z,
since can grow as large as O(n), Z cannot be shown to vanish. But such a
bound does not account for the fact that would in general be unbounded because
of the term σ̄ σ̄ T in the display = σ̄ σ̄ T + E[(σ − σ̄ )(σ − σ̄ )T ], where σ̄ = E[σ ].
Intuitively, the “mean” contribution σ̄ σ̄ T of σ σ T , being post-multiplied in Z by
T σ Q− σ − α (which averages to zero) disappears; and thus only smaller order
1 T
terms remain. We believe that the aforementioned ansatzs for the Gaussian tools
would be capable of subtly handling this self-averaging effect on Z to prove that
Z vanishes [for σ (t) = t, it is simple to show that Z ≤ Cn−1 ]. In addition,
Stein’s lemma-based methods only require the differentiability of σ (·), which need
not be Lipschitz, thereby allowing for a larger class of activation functions.
A RANDOM MATRIX APPROACH TO NEURAL NETWORKS 1245
REFERENCES
A KHIEZER , N. I. and G LAZMAN , I. M. (1993). Theory of Linear Operators in Hilbert Space. Dover,
New York. MR1255973
BAI , Z. D. and S ILVERSTEIN , J. W. (1998). No eigenvalues outside the support of the limiting
spectral distribution of large-dimensional sample covariance matrices. Ann. Probab. 26 316–345.
MR1617051
BAI , Z. D. and S ILVERSTEIN , J. W. (2007). On the signal-to-interference-ratio of CDMA systems
in wireless communications. Ann. Appl. Probab. 17 81–101. MR2292581
BAI , Z. and S ILVERSTEIN , J. W. (2010). Spectral Analysis of Large Dimensional Random Matrices,
2nd ed. Springer, New York. MR2567175
B ENAYCH -G EORGES , F. and NADAKUDITI , R. R. (2012). The singular values and vectors of low
rank perturbations of large rectangular random matrices. J. Multivariate Anal. 111 120–135.
C AMBRIA , E., G ASTALDO , P., B ISIO , F. and Z UNINO , R. (2015). An ELM-based model for affec-
tive analogical reasoning. Neurocomputing 149 443–455.
C HOROMANSKA , A., H ENAFF , M., M ATHIEU , M., A ROUS , G. B. and L E C UN , Y. (2015). The
loss surfaces of multilayer networks. In AISTATS.
C OUILLET, R. and B ENAYCH -G EORGES , F. (2016). Kernel spectral clustering of large dimensional
data. Electron. J. Stat. 10 1393–1454.
C OUILLET, R., H OYDIS , J. and D EBBAH , M. (2012). Random beamforming over quasi-static and
fading channels: A deterministic equivalent approach. IEEE Trans. Inform. Theory 58 6392–6425.
MR2982669
C OUILLET, R. and K AMMOUN , A. (2016). Random matrix improved subspace clustering. In 2016
Asilomar Conference on Signals, Systems, and Computers.
C OUILLET, R., PASCAL , F. and S ILVERSTEIN , J. W. (2015). The random matrix regime of
Maronna’s M-estimator with elliptically distributed samples. J. Multivariate Anal. 139 56–78.
E L K AROUI , N. (2009). Concentration of measure and spectra of random matrices: Applications
to correlation matrices, elliptical distributions and beyond. Ann. Appl. Probab. 19 2362–2405.
MR2588248
E L K AROUI , N. (2010). The spectrum of kernel random matrices. Ann. Statist. 38 1–50. MR2589315
E L K AROUI , N. (2013). Asymptotic behavior of unregularized and ridge-regularized
high-dimensional robust regression estimators: Rigorous results. Preprint. Available at
arXiv:1311.2445.
G IRYES , R., S APIRO , G. and B RONSTEIN , A. M. (2016). Deep neural networks with random Gaus-
sian weights: A universal classification strategy? IEEE Trans. Signal Process. 64 3444–3457.
MR3515693
H ORNIK , K., S TINCHCOMBE , M. and W HITE , H. (1989). Multilayer feedforward networks are
universal approximators. Neural Networks 2 359–366.
H UANG , G.-B., Z HU , Q.-Y. and S IEW, C.-K. (2006). Extreme learning machine: Theory and ap-
plications. Neurocomputing 70 489–501.
A RANDOM MATRIX APPROACH TO NEURAL NETWORKS 1247
H UANG , G.-B., Z HOU , H., D ING , X. and Z HANG , R. (2012). Extreme learning machine for re-
gression and multiclass classification. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE
Transactions on 42 513–529.
JAEGER , H. and H AAS , H. (2004). Harnessing nonlinearity: Predicting chaotic systems and saving
energy in wireless communication. Science 304 78–80.
K AMMOUN , A., K HAROUF, M., H ACHEM , W. and NAJIM , J. (2009). A central limit theorem for
the sinr at the lmmse estimator output for large-dimensional signals. IEEE Transactions on Infor-
mation Theory 55 5048–5063.
K RIZHEVSKY, A., S UTSKEVER , I. and H INTON , G. E. (2012). Imagenet classification with deep
convolutional neural networks. In Advances in Neural Information Processing Systems 1097–
1105.
L E C UN , Y., C ORTES , C. and B URGES , C. (1998). The MNIST database of handwritten digits.
L EDOUX , M. (2005). The Concentration of Measure Phenomenon 89. Amer. Math. Soc., Providence,
RI. MR1849347
L IAO , Z. and C OUILLET, R. (2017). A large dimensional analysis of least squares support vector
machines. J. Mach. Learn. Res. To appear. Available at arXiv:1701.02967.
L OUBATON , P. and VALLET, P. (2010). Almost sure localization of the eigenvalues in a Gaussian
information plus noise model. Application to the spiked models. Electron. J. Probab. 16 1934–
1959.
M AI , X. and C OUILLET, R. (2017). The counterintuitive mechanism of graph-based semi-
supervised learning in the big data regime. In IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP’17).
M AR C̆ENKO , V. A. and PASTUR , L. A. (1967). Distribution of eigenvalues for some sets of random
matrices. Math. USSR, Sb. 1 457–483.
PASTUR , L. and Ŝ ERBINA , M. (2011). Eigenvalue Distribution of Large Random Matrices. Amer.
Math. Soc., Providence, RI. MR2808038
R AHIMI , A. and R ECHT, B. (2007). Random features for large-scale kernel machines. In Advances
in Neural Information Processing Systems 1177–1184.
ROSENBLATT, F. (1958). The perceptron: A probabilistic model for information storage and organi-
zation in the brain. Psychol. Rev. 65 386–408.
RUDELSON , M., V ERSHYNIN , R. et al. (2013). Hanson–Wright inequality and sub-Gaussian con-
centration. Electron. Commun. Probab. 18 1–9.
S AXE , A., KOH , P. W., C HEN , Z., B HAND , M., S URESH , B. and N G , A. Y. (2011). On random
weights and unsupervised feature learning. In Proceedings of the 28th International Conference
on Machine Learning (ICML-11) 1089–1096.
S CHMIDHUBER , J. (2015). Deep learning in neural networks: An overview. Neural Netw. 61 85–117.
S ILVERSTEIN , J. W. and BAI , Z. D. (1995). On the empirical distribution of eigenvalues of a class
of large dimensional random matrices. J. Multivariate Anal. 54 175–192.
S ILVERSTEIN , J. W. and C HOI , S. (1995). Analysis of the limiting spectral distribution of large
dimensional random matrices. J. Multivariate Anal. 54 295–309.
TAO , T. (2012). Topics in Random Matrix Theory 132. Amer. Math. Soc., Providence, RI.
T ITCHMARSH , E. C. (1939). The Theory of Functions. Oxford Univ. Press, New York.
V ERSHYNIN , R. (2012). Introduction to the non-asymptotic analysis of random matrices. In Com-
pressed Sensing, 210–268, Cambridge Univ. Press, Cambridge.
W ILLIAMS , C. K. I. (1998). Computation with infinite neural networks. Neural Comput. 10 1203–
1216.
YATES , R. D. (1995). A framework for uplink power control in cellular radio systems. IEEE Journal
on Selected Areas in Communications 13 1341–1347.
Z HANG , T., C HENG , X. and S INGER , A. (2014). Marchenko–Pastur Law for Tyler’s and Maronna’s
M-estimators. Available at http://arxiv.org/abs/1401.3424.
1248 C. LOUART, Z. LIAO AND R. COUILLET