RMT Aproach To NN

Submitted to the Annals of Applied Probability
A RANDOM MATRIX APPROACH

TO NEURAL NETWORKS
By Cosme Louart, Zhenyu Liao, and Romain Couillet∗

arXiv:1702.05419v2 [math.PR] 29 Jun 2017
CentraleSupélec, University of Paris–Saclay, France.

This article studies the Gram random matrix model G = T1 ΣT Σ,
Σ = σ(W X), classically found in the analysis of random feature maps
and random neural networks, where X = [x1 , . . . , xT ] ∈ Rp×T is a
(data) matrix of bounded norm, W ∈ Rn×p is a matrix of indepen-
dent zero-mean unit variance entries, and σ : R → R is a Lipschitz
continuous (activation) function — σ(W X) being understood entry-
wise. By means of a key concentration of measure lemma arising
from non-asymptotic random matrix arguments, we prove that, as
n, p, T grow large at the same rate, the resolvent Q = (G + γIT )−1 ,
for γ > 0, has a similar behavior as that met in sample covariance
matrix models, involving notably the moment Φ = Tn E[G], which pro-
vides in passing a deterministic equivalent for the empirical spectral
measure of G. Application-wise, this result enables the estimation of
the asymptotic performance of single-layer random neural networks.
This in turn provides practical insights into the underlying mecha-
nisms into play in random neural networks, entailing several unex-
pected consequences, as well as a fast practical means to tune the
network hyperparameters.
1. Introduction. Artificial neural networks, developed in the late fifties

(Rosenblatt, 1958) in an attempt to develop machines capable of brain-like
behaviors, know today an unprecedented research interest, notably in its ap-
plications to computer vision and machine learning at large (Krizhevsky, Sutskever and Hinton,
2012; Schmidhuber, 2015) where superhuman performances on specific tasks
are now commonly achieved. Recent progress in neural network perfor-
mances however find their source in the processing power of modern com-
puters as well as in the availability of large datasets rather than in the
development of new mathematics. In fact, for lack of appropriate tools to
understand the theoretical behavior of the non-linear activations and de-
terministic data dependence underlying these networks, the discrepancy be-
tween mathematical and practical (heuristic) studies of neural networks has
kept widening. A first salient problem in harnessing neural networks lies
in their being completely designed upon a deterministic training dataset
∗
Couillet’s work is supported by the ANR Project RMT4GRAPH (ANR-14-CE28-
0006).
MSC 2010 subject classifications: Primary 60B20; secondary 62M45
1
imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017
2 C. LOUART ET AL.
X = [x1 , . . . , xT ] ∈ Rp×T , so that their resulting performances intricately

depend first and foremost on X. Recent works have nonetheless established
that, when smartly designed, mere randomly connected neural networks
can achieve performances close to those reached by entirely data-driven
network designs (Rahimi and Recht, 2007; Saxe et al., 2011). As a mat-
ter of fact, to handle gigantic databases, the computationally expensive
learning phase (the so-called backpropagation of the error method) typi-
cal of deep neural network structures becomes impractical, while it was re-
cently shown that smartly designed single-layer random networks (as stud-
ied presently) can already reach superhuman capabilities (Cambria et al.,
2015) and beat expert knowledge in specific fields (Jaeger and Haas, 2004).
These various findings have opened the road to the study of neural networks
by means of statistical and probabilistic tools (Choromanska et al., 2015;
Giryes, Sapiro and Bronstein, 2015). The second problem relates to the non-
linear activation functions present at each neuron, which have long been
known (as opposed to linear activations) to help design universal approxi-
mators for any input-output target map (Hornik, Stinchcombe and White,
1989).
In this work, we propose an original random matrix-based approach to
understand the end-to-end regression performance of single-layer random
artificial neural networks, sometimes referred to as extreme learning ma-
chines (Huang, Zhu and Siew, 2006; Huang et al., 2012), when the number
T and size p of the input dataset are large and scale proportionally with
the number n of neurons in the network. These networks can also be seen,
from a more immediate statistical viewpoint, as a mere linear ridge-regressor
relating a random feature map σ(W X) ∈ Rn×T of explanatory variables
X = [x1 , . . . , xT ] ∈ Rp×T and target variables y = [y1 , . . . , yT ] ∈ Rd×T ,
for W ∈ Rn×p a randomly designed matrix and σ(·) a non-linear R → R
function (applied component-wise). Our approach has several interesting
features both for theoretical and practical considerations. It is first one of
the few known attempts to move the random matrix realm away from ma-
trices with independent or linearly dependent entries. Notable exceptions
are the line of works surrounding kernel random matrices (El Karoui, 2010;
Couillet and Benaych-Georges, 2016) as well as large dimensional robust
statistics models (Couillet, Pascal and Silverstein, 2015; El Karoui, 2013;
Zhang, Cheng and Singer, 2014). Here, to alleviate the non-linear difficulty,
we exploit concentration of measure arguments (Ledoux, 2005) for non-
asymptotic random matrices, thereby pushing further the original ideas of
(El Karoui, 2009; Vershynin, 2012) established for simpler random matrix
models. While we believe that more powerful, albeit more computational

A RANDOM MATRIX APPROACH TO NEURAL NETWORKS 3
intensive, tools (such as an appropriate adaptation of the Gaussian tools

advocated in (Pastur and Ŝerbina, 2011)) cannot be avoided to handle ad-
vanced considerations in neural networks, we demonstrate here that the
concentration of measure phenomenon allows one to fully characterize the
main quantities at the heart of the single-layer regression problem at hand.
In terms of practical applications, our findings shed light on the already
incompletely understood extreme learning machines which have proved ex-
tremely efficient in handling machine learning problems involving large to
huge datasets (Huang et al., 2012; Cambria et al., 2015) at a computation-
ally affordable cost. But our objective is also to pave to path to the un-
derstanding of more involved neural network structures, featuring notably
multiple layers and some steps of learning by means of backpropagation of
the error.
Our main contribution is twofold. From a theoretical perspective, we first

obtain a key lemma, Lemma 1, on the concentration of quadratic forms of
the type σ(wT X)Aσ(X T w) where w = ϕ(w̃), w̃ ∼ N (0, Ip ), with ϕ : R → R
and σ : R → R Lipschitz functions, and X ∈ Rp×T , A ∈ Rn×n are deter-
ministic matrices. This non-asymptotic result (valid for all n, p, T ) is then
exploited under a simultaneous growth regime for n, p, T and boundedness
conditions on kXk and kAk to obtain, in Theorem 1, a deterministic ap-
proximation Q̄ of the resolvent E[Q], where Q = ( T1 ΣT Σ + γIT )−1 , γ > 0,
Σ = σ(W X), for some W = ϕ(W̃ ), W̃ ∈ Rn×p having independent N (0, 1)
entries. As the resolvent of a matrix (or operator) is an important proxy
for the characterization of its spectrum (see e.g., (Pastur and Ŝerbina, 2011;
Akhiezer and Glazman, 1993)), this result therefore allows for the character-
ization of the asymptotic spectral properties of T1 ΣT Σ, such as its limiting
spectral measure in Theorem 2.
Application-wise, the theoretical findings are an important preliminary
step for the understanding and improvement of various statistical meth-
ods based on random features in the large dimensional regime. Specifically,
here, we consider the question of linear ridge-regression from random feature
maps, which coincides with the aforementioned single hidden-layer random
neural network known as extreme learning machine. We show that, under
mild conditions, both the training Etrain and testing Etest mean-square er-
rors, respectively corresponding to the regression errors on known input-
output pairs (x1 , y1 ), . . . , (xT , yT ) (with xi ∈ Rp , yi ∈ Rd ) and unknown
pairings (x̂1 , ŷ1 ), . . . , (x̂T̂ , ŷT̂ ), almost surely converge to deterministic limit-
ing values as n, p, T grow large at the same rate (while d is kept constant)
for every fixed ridge-regression parameter γ > 0. Simulations on real image

4 C. LOUART ET AL.
datasets are provided that corroborate our results.
These findings provide new insights into the roles played by the acti-
vation function σ(·) and the random distribution of the entries of W in
random feature maps as well as by the ridge-regression parameter γ in the
neural network performance. We notably exhibit and prove some peculiar
behaviors, such as the impossibility for the network to carry out elementary
Gaussian mixture classification tasks, when either the activation function or
the random weights distribution are ill chosen.
Besides, for the practitioner, the theoretical formulas retrieved in this
work allow for a fast offline tuning of the aforementioned hyperparameters
of the neural network, notably when T is not too large compared to p.
The graphical results provided in the course of the article were particularly
obtained within a 100- to 500-fold gain in computation time between theory
and simulations.
The remainder of the article is structured as follows: in Section 2, we

introduce the mathematical model of the system under investigation. Our
main results are then described and discussed in Section 3, the proofs of
which are deferred to Section 5. Section 4 discusses our main findings. The
article closes on concluding remarks on envisioned extensions of the present
work in Section 6. The appendix provides some intermediary lemmas of
constant use throughout the proof section.
Reproducibility: Python 3 codes used to produce the results of Section 4

are available at https://github.com/Zhenyu-LIAO/RMT4ELM
Notations: The norm k · k is understood as the Euclidean norm for vectors

and the operator norm for matrices, while the norm k · kF is the Frobe-
nius norm for matrices. All vectors in the article are understood as column
vectors.
2. System Model. We consider a ridge-regression task on random fea-

ture maps defined as follows. Each input data x ∈ Rp is multiplied by a ma-
trix W ∈ Rn×p ; a non-linear function σ : R → R is then applied entry-wise
to the vector W x, thereby providing a set of n random features σ(W x) ∈ Rn
for each datum x ∈ Rp . The output z ∈ Rd of the linear regression is the
inner product z = β T σ(W x) for some matrix β ∈ Rn×d to be designed.
From a neural network viewpoint, the n neurons of the network are the
virtual units operating the mapping Wi· x 7→ σ(Wi· x) (Wi· being the i-th

row of W ), for 1 ≤ i ≤ n. The neural network then operates in two phases:

a training phase where the regression matrix β is learned based on a known
input-output dataset pair (X, Y ) and a testing phase where, for β now fixed,
the network operates on a new input dataset X̂ with corresponding unknown
output Ŷ .
During the training phase, based on a set of known input X = [x1 , . . . , xT ] ∈

Rp×T and output Y = [y1 , . . . , yT ] ∈ Rd×T datasets, the matrix β is chosen
1 PT
so as to minimize the mean square error T i=1 kzi − yi k2 + γkβk2F , where
zi = β T σ(W xi ) and γ > 0 is some regularization factor. Solving for β, this
leads to the explicit ridge-regressor
−1
1 1 T
β= Σ Σ Σ + γIT YT
T T
where we defined Σ ≡ σ(W X). This follows Pfrom differentiating the mean
square error along β to obtain 0 = γβ + T1 Ti=1 σ(W xi )(β T σ(W xi ) − yi )T ,
so that ( T1 ΣΣT + γIn )β = T1 ΣY T which, along with ( T1 ΣΣT + γIn )−1 Σ =
Σ( T1 ΣT Σ + γIT )−1 , gives the result.
In the remainder, we will also denote
−1
1 T
Q≡ Σ Σ + γIT
T
the resolvent of T1 ΣT Σ. The matrix Q naturally appears as a key quantity in

the performance analysis of the neural network. Notably, the mean-square
error Etrain on the training dataset X is given by
1 2 γ2
tr Y T Y Q2 .
T T
(1) Etrain = − Σ β =
T T
Y
F
Under the growth rate assumptions on n, p, T taken below, it shall appear

that the random variable Etrain concentrates around its mean, letting then
appear E[Q2 ] as a central object in the asymptotic evaluation of Etrain .
The testing phase of the neural network is more interesting in practice

as it unveils the actual performance of neural networks. For a test dataset
X̂ ∈ Rp×T̂ of length T̂ , with unknown output Ŷ ∈ Rd×T̂ , the test mean-
square error is defined by
1 2
Etest = Ŷ T − Σ̂T β

T F

6 C. LOUART ET AL.
where Σ̂ = σ(W X̂) and β is the same as used in (1) (and thus only depends
on (X, Y ) and γ). One of the key questions in the analysis of such an ele-
mentary neural network lies in the determination of γ which minimizes Etest
(and is thus said to have good generalization performance). Notably, small
γ values are known to reduce Etrain but to induce the popular overfitting is-
sue which generally increases Etest , while large γ values engender both large
values for Etrain and Etest .
From a mathematical standpoint though, the study of Etest brings for-
ward some technical difficulties that do not allow for as a simple treatment
through the present concentration of measure methodology as the study of
Etrain . Nonetheless, the analysis of Etrain allows at least for heuristic ap-
proaches to become available, which we shall exploit to propose an asymp-
totic deterministic approximation for Etest .
From a technical standpoint, we shall make the following set of assump-

tions on the mapping x 7→ σ(W x).
Assumption 1 (Subgaussian W ). The matrix W is defined by
W = ϕ(W̃ )
(understood entry-wise), where W̃ has independent and identically distributed

N (0, 1) entries and ϕ(·) is λϕ -Lipschitz.
For a = ϕ(b) ∈ Rℓ , ℓ ≥ 1, with b ∼ N (0, Iℓ ), we shall subsequently denote

a ∼ Nϕ (0, Iℓ ).
Under the notations of Assumption 1, we have in particular Wij ∼ N (0, 1)
if ϕ(t) = t and Wij ∼ U (−1, 1) (the uniform distribution on [−1, 1]) if
R∞ 2
ϕ(t) = −1 + 2 √12π t e−x dx (ϕ is here a 2/π-Lipschitz map).
p
We further need the following regularity condition on the function σ.
Assumption 2 (Function σ). The function σ is Lipschitz continuous

with parameter λσ .
This assumption holds for many of the activation functions traditionally

considered in neural networks, such as sigmoid functions, the rectified linear
unit σ(t) = max(t, 0), or the absolute value operator.
When considering the interesting case of simultaneously large data and

random features (or neurons), we shall then make the following growth rate
assumptions.

Assumption 3 (Growth Rate). As n → ∞,

0 < lim inf min{p/n, T /n} ≤ lim sup max{p/n, T /n} < ∞
n n
while γ, λσ , λϕ > 0 and d are kept constant. In addition,

lim sup kXk < ∞
n
lim sup max |Yij | < ∞.
n ij
3. Main Results.
3.1. Main technical results and training performance. As a standard pre-

liminary step in the asymptotic random matrix analysis of the expectation
E[Q] of the resolvent Q = ( T1 ΣT Σ+γIT )−1 , a convergence of quadratic forms
based on the row vectors of Σ is necessary (see e.g., (Marc̆enko and Pastur,
1967; Silverstein and Bai, 1995)). Such results are usually obtained by ex-
ploiting the independence (or linear dependence) in the vector entries. This
not being the case here, as the entries of the vector σ(X T w) are in gen-
eral not independent, we resort to a concentration of measure approach, as
advocated in (El Karoui, 2009). The following lemma, stated here in a non-
asymptotic random matrix regime (that is, without necessarily resorting to
Assumption 3), and thus of independent interest, provides this concentration
result. For this lemma, we need first to define the following key matrix
h i
(2) Φ = E σ(wT X)T σ(wT X)
of size T × T , where w ∼ Nϕ (0, Ip ).
Lemma 1 (Concentration of quadratic forms). Let Assumptions 1–2

hold. Let also A ∈ RT ×T such that kAk ≤ 1 and, for X ∈ Rp×T and
w ∼ Nϕ (0, Ip ), define the random vector σ ≡ σ(wT X)T ∈ RT . Then,

t2
cT
1 T 1 −
kXk 2 λ2 λ2 min t2 ,t
P σ Aσ − tr ΦA > t ≤ Ce ϕ σ 0
T T
q
for t0 ≡ |σ(0)| + λϕ λσ kXk Tp and C, c > 0 independent of all other param-
eters. In particular, under the additional Assumption 3,

1 T 1 2
P σ Aσ − tr ΦA > t ≤ Ce−cn min(t,t )

T T
for some C, c > 0.

8 C. LOUART ET AL.
Note that this lemma partially extends concentration of measure results

involving quadratic forms, see e.g., (Rudelson et al., 2013, Theorem 1.1), to
non-linear vectors.
With this result in place, the standard resolvent approaches of random
matrix theory apply, providing our main theoretical finding as follows.
Theorem 1 (Asymptotic equivalent for E[Q]). Let Assumptions 1–3

hold and define Q̄ as
−1
n Φ
Q̄ ≡ + γIT
T 1+δ
1
where δ is implicitly defined as the unique positive solution to δ = T tr ΦQ̄.
Then, for all ε > 0, there exists c > 0 such that
E[Q] − Q̄ ≤ cn− 12 +ε .

As a corollary of Theorem 1 along with a concentration argument on

1
Ttr Q, we have the following result on the spectral measure of T1 ΣT Σ, which
may be seen as a non-linear extension of (Silverstein and Bai, 1995) for
which σ(t) = t.
Theorem 2 (Limiting spectral measure of T1 ΣT Σ). Let Assumptions 1–

3 hold and, for λ1 , . . . , λT the eigenvalues of T1 ΣT Σ, define µn = T1 Ti=1 δλi .
P
Then, for every bounded continuous function f , with probability one
Z Z
f dµn − f dµ̄n → 0.
Rwhere µ̄n−1is the measure defined through its Stieltjes transform mµ̄n (z) ≡
(t − z) dµ̄n (t) given, for z ∈ {w ∈ C, ℑ[w] > 0}, by
−1
1 n Φ
mµ̄n (z) = tr − zIT
T T 1 + δz
with δz the unique solution in {w ∈ C, ℑ[w] > 0} of

−1
1 n Φ
δz = tr Φ − zIT .
T T 1 + δz
Note that µ̄n has a well-known form, already met in early random ma-
trix works (e.g., (Silverstein and Bai, 1995)) on sample covariance matrix
models. Notably, µ̄n is also the deterministic equivalent of the empirical

spectral measure of T1 P T W T W P for any deterministic matrix P ∈ Rp×T

such that P T P = Φ. As such, to some extent, the results above provide a
consistent asymptotic linearization of T1 ΣT Σ. From standard spiked model
arguments (see e.g., (Benaych-Georges and Nadakuditi, 2012)), the result
kE[Q]− Q̄k → 0 further suggests that also the eigenvectors associated to iso-
lated eigenvalues of T1 ΣT Σ (if any) behave similarly to those of T1 P T W T W P ,
a remark that has fundamental importance in the neural network perfor-
mance understanding.
However, as shall be shown in Section 3.3, and contrary to empirical
covariance matrix models of the type P T W T W P , Φ explicitly depends on
the distribution of Wij (that is, beyond its first two moments). Thus, the
aforementioned linearization of T1 ΣT Σ, and subsequently the deterministic
equivalent for µn , are not universal with respect to the distribution of zero-
mean unit variance Wij . This is in striking contrast to the many linear
random matrix models studied to date which often exhibit such universal
behaviors. This property too will have deep consequences in the performance
of neural networks as shall be shown through Figure 3 in Section 4 for an
example where inappropriate choices for the law of W lead to network failure
to fulfill the regression task.
For convenience in the following, letting δ and Φ be defined as in Theo-

rem 1, we shall denote
n Φ
(3) Ψ= .
T 1+δ
Theorem 1 provides the central step in the evaluation of Etrain , for which
not only E[Q] but also E[Q2 ] needs be estimated. This last ingredient is
provided in the following proposition.
Proposition 1 (Asymptotic equivalent for E[QAQ]). Let Assumptions 1–

3 hold and A ∈ RT ×T be a symmetric non-negative definite matrix which is
either Φ or a matrix with uniformly bounded operator norm (with respect to
T ). Then, for all ε > 0, there exists c > 0 such that, for all n,
!
1

tr Ψ Q̄A Q̄ 1
E[QAQ] − Q̄AQ̄ + n 1 Q̄Ψ Q̄ ≤ cn− 2 +ε .

2
1 − n tr Ψ Q̄ 2
As an immediate consequence of Proposition 1, we have the following

result on the training mean-square error of single-layer random neural net-
works.

10 C. LOUART ET AL.
Theorem 3 (Asymptotic training mean-square error). Let Assumptions 1–

3 hold and Q̄, Ψ be defined as in Theorem 1 and (3). Then, for all ε > 0,
1
n 2 −ε Etrain − Ētrain → 0

almost surely, where
1 2 γ2
tr Y T Y Q2
T T
Etrain = − Σ β =
T T
Y
F
" #
1 2
γ2 n tr Ψ Q̄
Ētrain = tr Y T Y Q̄ Ψ + IT Q̄.
T 1 − n1 tr(ΨQ̄)2
Since Q̄ and Φ share the same orthogonal eigenvector basis, it appears

that Etrain depends on the alignment between the right singular vectors of
Y and the eigenvectors of Φ, with weighting coefficients
1 PT
!
2
λ (λ + γ)−2
γ j=1 j j
1 + λi n 1 P T , 1≤i≤T
λi + γ 1 − n j=1 λ2j (λj + γ)−2
where we denoted λi = λi (Ψ), 1 ≤ i ≤ T , the eigenvalues of Ψ (which

n
depend on γ through λi (Ψ) = T (1+δ) λi (Φ)). If lim inf n n/T > 1, it is easily
seen that δ → 0 as γ → 0, in which case Etrain → 0 almost surely. However,
in the more interesting case in practice where lim supn n/T < 1, δ → ∞ as
γ → 0 and Etrain consequently does not have a simple limit (see Section 4.3
for more discussion on this aspect).
Theorem 3 is also reminiscent of applied random matrix works on empiri-
cal covariance matrix models, such as (Bai and Silverstein, 2007; Kammoun et al.,
2009), then further emphasizing the strong connection between the non-
1
linear matrix σ(W X) and its linear counterpart W Φ 2 .
As a side note, observe that, to obtain Theorem 3, we could have used

∂
the fact that tr Y T Y Q2 = − ∂γ tr Y T Y Q which, along with some analyticity
arguments (for instance when extending the definition of Q = Q(γ) to Q(z),
z ∈ C), would have directly ensured that ∂∂γQ̄ is an asymptotic equivalent
for −E[Q2 ], without the need for the explicit derivation of Proposition 1.
Nonetheless, as shall appear subsequently, Proposition 1 is also a proxy to
the asymptotic analysis of Etest . Besides, the technical proof of Proposition 1
quite interestingly showcases the strength of the concentration of measure
tools under study here.

3.2. Testing performance. As previously mentioned, harnessing the asymp-

totic testing performance Etest seems, to the best of the authors’ knowledge,
out of current reach with the sole concentration of measure arguments used
for the proof of the previous main results. Nonetheless, if not fully effec-
tive, these arguments allow for an intuitive derivation of a deterministic
equivalent for Etest , which is strongly supported by simulation results. We
provide this result below under the form of a yet unproven claim, a heuristic
derivation of which is provided at the end of Section 5.
To introduce this result, let X̂ = [x̂1 , . . . , x̂T̂ ] ∈ Rp×T̂ be a set of input

data with corresponding output Ŷ = [ŷ1 , . . . , ŷT̂ ] ∈ Rd×T̂ . We also define
Σ̂ = σ(W X̂) ∈ Rp×T̂ . We assume that X̂ and Ŷ satisfy the same growth
rate conditions as X and Y in Assumption 3. To introduce our claim, we
need to extend the definition of Φ in (2) and Ψ in (3) to the following
notations: for all pair of matrices (A, B) of appropriate dimensions,
h i
ΦAB = E σ(wT A)T σ(wT B)
n ΦAB
ΨAB =
T 1+δ
where w ∼ Nϕ (0, Ip ). In particular, Φ = ΦXX and Ψ = ΨXX .
With these notations in place, we are in position to state our claimed
result.
Conjecture 1 (Deterministic equivalent for Etest ). Let Assumptions 1–

2 hold and X̂, Ŷ satisfy the same conditions as X, Y in Assumption 3. Then,
for all ε > 0,
1
n 2 −ε Etest − Ētest → 0

almost surely, where

1 2

T̂ F
1 2
Ētest = Ŷ − ΨT
T T
X X̂
Q̄Y
T̂ F
1 T Y Q̄ΨQ̄
tr Y

n 1 1
+ tr ΨX̂ X̂ − tr(IT + γ Q̄)(ΨX X̂ ΨX̂X Q̄) .
1 − n1 tr(ΨQ̄)2 T̂ T̂
While not immediate at first sight, one can confirm (using notably the
relation ΨQ̄ + γ Q̄ = IT ) that, for (X̂, Ŷ ) = (X, Y ), Ētrain = Ētest , as ex-
pected.

12 C. LOUART ET AL.
In order to evaluate practically the results of Theorem 3 and Conjecture 1,

it is a first step to be capable of estimating the values of ΦAB for various
σ(·) activation functions of practical interest. Such results, which call for
completely different mathematical tools (mostly based on integration tricks),
are provided in the subsequent section.
3.3. Evaluation of ΦAB . The evaluation of ΦAB = E[σ(wT A)T σ(wT B)]
for arbitrary matrices A, B naturally boils down to the evaluation of its
individual entries and thus to the calculus, for arbitrary vectors a, b ∈ Rp ,
of
Z
p 1 2
(4) Φab ≡E[σ(wT a)σ(wT b)] = (2π)− 2 σ(ϕ(w̃)T a)σ(ϕ(w̃)T b)e− 2 kw̃k dw̃.
The evaluation of (4) can be obtained through various integration tricks

for a wide family of mappings ϕ(·) and activation functions σ(·). The most
popular activation functionsRt in neural networks are sigmoid functions, such
2
as σ(t) = erf(t) ≡ √2π 0 e−u du, as well as the so-called rectified linear unit
(ReLU) defined by σ(t) = max(t, 0) which has been recently popularized
as a result of its robust behavior in deep neural networks. In physical ar-
tificial neural networks implemented using light projections, σ(t) = |t| is
the preferred choice. Note that all aforementioned functions are Lipschitz
continuous and therefore in accordance with Assumption 2.
Despite their not abiding by the prescription of Assumptions 1 and 2, we
believe that the results of this article could be extended to more general
settings, as discussed in Section 4. In particular, since the key ingredient in
the proof of all our results is that the vector σ(wT X) follows a concentration
of measure phenomenon, induced by the Gaussianity of w̃ (if w = ϕ(w̃)),
the Lipschitz character of σ and the norm boundedness of X, it is likely,
although not necessarily simple to prove, that σ(wT X) may still concentrate
under relaxed assumptions. This is likely the case for more generic vectors
w than Nϕ (0, Ip ) as well as for a larger class of activation functions, such as
polynomial or piece-wise Lipschitz continuous functions.
In anticipation of these likely generalizations, we provide in Table 1 the
values of Φab for w ∼ N (0, Ip ) (i.e., for ϕ(t) = t) and for a set of functions
σ(·) not necessarily satisfying Assumption 2. Denoting Φ ≡ Φ(σ(t)), it is
interesting to remark that, since arccos(x) = − arcsin(x)+ π2 , Φ(max(t, 0)) =
Φ( 21 t) + Φ( 12 |t|). Also, [Φ(cos(t)) + Φ(sin(t))]a,b = exp(− 21 ka − bk2 ), a result
reminiscent of (Rahimi and Recht, 2007).1 Finally, note that Φ(erf(κt)) →
1
It is in particular not difficult to prove, based on our framework, that, as n/T → ∞,
a random neural network composed of n/2 neurons with activation function σ(t) = cos(t)

σ(t) Φab
t aT b
1
p
max(t, 0) 2π
kakkbk ∠(a, b) acos(−∠(a, b)) + 1 − ∠(a, b)2

2
p
|t| π
kakkbk ∠(a, b) asin(∠(a, b)) + 1 − ∠(a, b)2

2 √ 2aT b
erf(t) π
asin
(1+2kak2 )(1+2kbk2 )
1 1
1{t>0} 2
− 2π
acos(∠(a, b))
2
sign(t) π
asin(∠(a, b))
cos(t) exp(− 21 (kak2 + kbk2 )) cosh(aT b)
sin(t) exp(− 21 (kak2 + kbk2 )) sinh(aT b).
Table 1
aT b
Values of Φab for w ∼ N (0, Ip ), ∠(a, b) ≡ kakkbk
.
Φ(sign(t)) as κ → ∞, inducing that the extension by continuity of erf(κt)

to sign(t) propagates to their associated kernels.
In addition to these results for w ∼ N (0, Ip ), we also evaluated Φab =

E[σ(wT a)σ(wT b)] for σ(t) = ζ2 t2 + ζ1 t + ζ0 and w ∈ Rp a vector of indepen-
dent and identically distributed entries of zero mean and moments of order
k equal to mk (so m1 = 0); w is not restricted here to satisfy w ∼ Nϕ (0, Ip ).
In this case, we find
h i
Φab = ζ22 m22 2(aT b)2 + kak2 kbk2 + (m4 − 3m22 )(a2 )T (b2 ) + ζ12 m2 aT b
h i
(5) + ζ2 ζ1 m3 (a2 )T b + aT (b2 ) + ζ2 ζ0 m2 kak2 + kbk2 + ζ02

where we defined (a2 ) ≡ [a21 , . . . , a2p ]T .

It is already interesting to remark that, while classical random matrix
models exhibit a well-known universality property — in the sense that their
limiting spectral distribution is independent of the moments (higher than
two) of the entries of the involved random matrix, here W —, for σ(·) a poly-
nomial of order two, Φ and thus µn strongly depend on E[Wijk ] for k = 3, 4.
We shall see in Section 4 that this remark has troubling consequences. We
will notably infer (and confirm via simulations) that the studied neural net-
work may provably fail to fulfill a specific task if the Wij are Bernoulli with
zero mean and unit variance but succeed with possibly high performance if
the Wij are standard Gaussian (which is explained by the disappearance or
not of the term (aT b)2 and (a2 )T (b2 ) in (5) if m4 = m22 ).
and n/2 neurons with activation function σ(t) = sin(t) implements a Gaussian difference
kernel.

14 C. LOUART ET AL.
4. Practical Outcomes. We discuss in this section the outcomes of

our main results in terms of neural network application. The technical dis-
cussions on Theorem 1 and Proposition 1 will be made in the course of their
respective proofs in Section 5.
4.1. Simulation Results. We first provide in this section a simulation cor-

roborating the findings of Theorem 3 and suggesting the validity of Conjec-
ture 1. To this end, we consider the task of classifying the popular MNIST
image database (LeCun, Cortes and Burges, 1998), composed of grayscale
handwritten digits of size 28×28, with a neural network composed of n = 512
units and standard Gaussian W . We represent here each image as a p = 784-
size vector; 1 024 images of sevens and 1 024 images of nines were extracted
from the database and were evenly split in 512 training and test images,
respectively. The database images were jointly centered and scaled so to fall
close to the setting of Assumption 3 on X and X̂ (an admissible preprocess-
ing intervention). The columns of the output values Y and Ŷ were taken
as unidimensional (d = 1) with Y1j , Ŷ1j ∈ {−1, 1} depending on the image
class. Figure 1 displays the simulated (averaged over 100 realizations of W )
versus theoretical values of Etrain and Etest for three choices of Lipschitz
continuous functions σ(·), as a function of γ.
Note that a perfect match between theory and practice is observed, for
both Etrain and Etest , which is a strong indicator of both the validity of
Conjecture 1 and the adequacy of Assumption 3 to the MNIST dataset.
We subsequently provide in Figure 2 the comparison between theoretical

formulas and practical simulations for a set of functions σ(·) which do not
satisfy Assumption 2, i.e., either discontinuous or non-Lipschitz maps. The
closeness between both sets of curves is again remarkably good, although
to a lesser extent than for the Lipschitz continuous functions of Figure 1.
Also, the achieved performances are generally worse than those observed in
Figure 1.
It should be noted that the performance estimates provided by Theorem 3

and Conjecture 1 can be efficiently implemented at low computational cost
in practice. Indeed, by diagonalizing Φ (which is a marginal cost independent
of γ), Ētrain can be computed for all γ through mere vector operations; simi-
larly Ētest is obtained by the marginal cost of a basis change of ΦX̂X and the
matrix product ΦX X̂ ΦX̂X , all remaining operations being accessible through
vector operations. As a consequence, the simulation durations to generate
the aforementioned theoretical curves using the linked Python script were

100
Ētrain
Ētest
Etrain
MSE Etest
σ(t) = t σ(t) = |t|
σ(t) = erf(t)
10−1
σ(t) = max(t, 0)
10−4 10−3 10−2 10−1 100 101 102

γ
Fig 1. Neural network performance for Lipschitz continuous σ(·), Wij ∼ N (0, 1), as a
function of γ, for 2-class MNIST data (sevens, nines), n = 512, T = T̂ = 1024, p = 784.
found to be 100 to 500 times faster than to generate the simulated net-
work performances. Beyond their theoretical interest, the provided formulas
therefore allow for an efficient offline tuning of the network hyperparameters,
notably the choice of an appropriate value for the ridge-regression parameter
γ.
4.2. The underlying kernel. Theorem 1 and the subsequent theoretical

findings importantly reveal that the neural network performances are di-
rectly related to the Gram matrix Φ, which acts as a deterministic ker-
nel on the dataset X. This is in fact a well-known result found e.g., in
(Williams, 1998) where it is shown that, as n → ∞ alone, the neural net-
work behaves as a mere kernel operator (this observation is retrieved here
in the subsequent Section 4.3). This remark was then put at an advantage
in (Rahimi and Recht, 2007) and subsequent works, where random feature
maps of the type x 7→ σ(W x) are proposed as a computationally efficient
proxy to evaluate kernels (x, y) 7→ Φ(x, y).
As discussed previously, the formulas for Ētrain and Ētest suggest that
good performances are achieved if the dominant eigenvectors of Φ show a
good alignment to Y (and similarly for ΦX X̂ and Ŷ ). This naturally drives us

16 C. LOUART ET AL.
100
Ētrain
Ētest
Etrain
MSE Etest
σ(t) = 1 − 12 t2
σ(t) = 1{t>0}
σ(t) = sign(t)
−1
10
10−4 10−3 10−2 10−1 100 101 102
γ
Fig 2. Neural network performance for σ(·) either discontinuous or non Lipschitz, Wij ∼
N (0, 1), as a function of γ, for 2-class MNIST data (sevens, nines), n = 512, T = T̂ =
1024, p = 784.
to finding a priori simple regression tasks where ill-choices of Φ may annihi-

late the neural network performance. Following recent works on the asymp-
totic performance analysis of kernel methods for Gaussian mixture models
(Couillet and Benaych-Georges, 2016; Zhenyu Liao, 2017; Mai and Couillet,
2017) and (Couillet and Kammoun, 2016), we describe here such a task.
Let x1 , . . . , xT /2 ∼ N (0, 1p C1 ) and xT /2+1 , . . . , xT ∼ N (0, p1 C2 ) where C1
and C2 are such that tr C1 = tr C2 , kC1 k, kC2 k are bounded, and tr(C1 −
C2 )2 = O(p). Accordingly, y1 , . . . , yT /2+1 = −1 and yT /2+1 , . . . , yT = 1. It
is proved in the aforementioned articles that, under these conditions, it is
theoretically possible, in the large p, T limit, to classify the data using a
kernel least-square support vector machine (that is, with a training dataset)
or with a kernel spectral clustering method (that is, in a completely unsu-
pervised manner) with a non-trivial limiting error probability (i.e., neither
zero nor one). This scenario has the interesting feature that xT i xj → 0 al-
most surely for all i 6= j while kxi k2 − 1p tr( 12 C1 + 12 C2 ) → 0, almost surely,
irrespective of the class of xi , thereby allowing for a Taylor expansion of the
non-linear kernels as early proposed in (El Karoui, 2010).
Transposed to our present setting, the aforementioned Taylor expan-

sion allows for a consistent approximation Φ̃ of Φ by an information-plus-

noise (spiked) random matrix model (see e.g., (Loubaton and Vallet, 2010;
Benaych-Georges and Nadakuditi, 2012)). In the present Gaussian mixture
context, it is shown in (Couillet and Benaych-Georges, 2016) that data clas-
sification is (asymptotically at least) only possible if Φ̃ij explicitly con-
tains the quadratic term (xT 2 2 T 2 T
i xj ) (or combinations of (xi ) xj , (xj ) xi , and
2 T 2
(xi ) (xj )). In particular, letting a, b ∼ N (0, Ci ) with i = 1, 2, it is easily
seen from Table 1 that only max(t, 0), |t|, and cos(t) can realize the task.
Indeed, we have the following Taylor expansions around x = 0:
asin(x) = x + O(x3 )
sinh(x) = x + O(x3 )
π
acos(x) = − x + O(x3 )
2
x2
cosh(x) = 1 + + O(x3 )
2
p πx x2
x acos(−x) + 1 − x2 = 1 + + + O(x3 )
2 2
p x2
x asin(x) + 1 − x2 = 1 + + O(x3 )
2
where only the last three functions (only found in the expression of Φab
corresponding to σ(t) = max(t, 0), |t|, or cos(t)) exhibit a quadratic term.
More surprisingly maybe, recalling now Equation (5) which considers non-
necessarily Gaussian Wij with moments mk of order k, a more refined anal-
ysis shows that the aforementioned Gaussian mixture classification task will
fail if m3 = 0 and m4 = m22 , so for instance for Wij ∈ {−1, 1} Bernoulli
with parameter 12 . The performance comparison of this scenario is shown
in the top part of Figure 3 for σ(t) = − 12 t2 + 1 and C1 = diag(Ip/2 , 4Ip/2 ),
C2 = diag(4Ip/2 , Ip/2 ), for Wij ∼ N (0, 1) and Wij ∼ Bern (that is, Bernoulli
{(−1, 12 ), (1, 21 )}). The choice of σ(t) = ζ2 t2 + ζ1 t + ζ0 with ζ1 = 0 is mo-
tivated by (Couillet and Benaych-Georges, 2016; Couillet and Kammoun,
2016) where it is shown, in a somewhat different setting, that this choice
is optimal for class recovery. Note that, while the test performances are
overall rather weak in this setting, for Wij ∼ N (0, 1), Etest drops below one
(the amplitude of the Ŷij ), thereby indicating that non-trivial classification
is performed. This is not so for the Bernoulli Wij ∼ Bern case where Etest is
systematically greater than |Ŷij |= 1. This is theoretically explained by the
fact that, from Equation (5), Φij contains structural information about the
data classes through the term 2m22 (xT 2 2 2 T 2
i xj ) + (m4 − 3m2 )(xi ) (xj ) which in-
duces an information-plus-noise model for Φ as long as 2m22 +(m4 −3m22 ) 6= 0,

18 C. LOUART ET AL.
i.e., m4 6= m22 (see (Couillet and Benaych-Georges, 2016) for details). This
is visually seen in the bottom part of Figure 3 where the Gaussian scenario
presents an isolated eigenvalue for Φ with corresponding structured eigen-
vector, which is not the case of the Bernoulli scenario. To complete this
discussion, it appears relevant in the present setting to choose Wij in such a
way that m4 −m22 is far from zero, thus suggesting the interest of heavy-tailed
distributions. To confirm this prediction, Figure 3 additionally displays the
performance achieved and the spectrum of Φ observed for Wij ∼ Stud, that
is, following a Student-t distribution with degree of freedom ν = 7 normal-
ized to unit variance (in this case m2 = 1 and m4 = 5). Figure 3 confirms
the large superiority of this choice over the Gaussian case (note nonetheless
the slight inaccuracy of our theoretical formulas in this case, which is likely
due to too small values of p, n, T to accommodate Wij with higher order
moments, an observation which is confirmed in simulations when letting ν
be even smaller).
4.3. Limiting cases. We have suggested that Φ contains, in its dominant

eigenmodes, all the usable information describing X. In the Gaussian mix-
ture example above, it was notably shown that Φ may completely fail to
contain this information, resulting in the impossibility to perform a classifi-
cation task, even if one were to take infinitely many neurons in the network.
For Φ containing useful information about X, it is intuitive to expect that
both inf γ Ētrain and inf γ Ētest become smaller as n/T and n/p become large.
It is in fact easy to see that, if Φ is invertible (which is likely to occur in
most cases if lim inf n T /p > 1), then
lim Ētrain = 0
n→∞
1 2
lim Ētest − Ŷ T − ΦX̂X Φ−1 Y T = 0

n→∞ T̂ F
and we fall back on the performance of a classical kernel regression. It is

interesting in particular to note that, as the number of neurons n becomes
large, the effect of γ on Etest flattens out. Therefore, a smart choice of γ
is only relevant for small (and thus computationally more efficient) neuron
layers. This observation is depicted in Figure 4 where it is made clear that
a growth of n reduces Etrain to zero while Etest saturates to a non-zero limit
which becomes increasingly irrespective of γ. Note additionally the interest-
ing phenomenon occurring for n ≤ T where too small values of γ induce
important performance losses, thereby suggesting a strong importance of
proper choices of γ in this regime.

100.4 Ētrain
Ētest
Etrain
100.2 Etest
100
MSE
10−0.2
10−0.4
Wij ∼ Bern
Wij ∼ N (0, 1)
−0.6
10
Wij ∼ Stud
10−4 10−3 10−2 10−1 100 101 102

γ
close far
no spike
spike spike
Wij ∼ N (0, 1) Wij ∼ Bern Wij ∼ Stud
Fig 3. (Top) Neural network performance for σ(t) = − 12 t2 + 1, with different Wij , for a
2-class Gaussian mixture model (see details in text), n = 512, T = T̂ = 1024, p = 256.
(Bottom) Spectra and second eigenvector of Φ for different Wij (first eigenvalues are of
order n and not shown; associated eigenvectors are provably non informative).
Of course, practical interest lies precisely in situations where n is not

too large. We may thus subsequently assume that lim supn n/T < 1. In
this case, as suggested by Figures 1–2, the mean-square error performances
achieved as γ → 0 may predict the superiority of specific choices of σ(·) for

20 C. LOUART ET AL.
100
Ētrain
Ētest
Etrain
MSE Etest
10−1
n = 256 → 4 096
10−4 10−3 10−2 10−1 100 101 102

γ
Fig 4. Neural network performance for growing n (256, 512, 1 024, 2 048, 4 096) as a
function of γ, σ(t) = max(t, 0); 2-class MNIST data (sevens, nines), T = T̂ = 1024,
p = 784. Limiting (n = ∞) Ētest shown in thick black line.
optimally chosen γ. It is important for this study to differentiate between

cases where r ≡ rank(Φ) is smaller or greater than n. Indeed, observe that,
with the spectral decomposition Φ = Ur Λr UrT for Λr ∈ Rr×r diagonal and
Ur ∈ RT ×r ,
−1 −1
1 n Φ 1 n Λr
δ = tr Φ + γIT = tr Λr + γIr
T T 1+δ T T 1+δ
which satisfies, as γ → 0,
(
r
δ → n−r , r<n
1 n Φ
−1
γδ → ∆ = T tr Φ T ∆ + IT , r ≥ n.
A phase transition therefore exists whereby δ assumes a finite positive value

in the small γ limit if r/n < 1, or scales like 1/γ otherwise.
As a consequence, if r < n, as γ → 0, Ψ → Tn (1 − nr )Φ and Q̄ ∼
T −1 T 1 T T ×(n−r) is any matrix such that [U V ]
n−r Ur Λr Ur + γ Vr Vr , where Vr ∈ R r r
T 2 −1
is orthogonal, so that ΨQ̄ → Ur Ur and ΨQ̄ → Ur Λr Ur ; and thus,T

Ētrain → T1 tr Y Vr VrT Y T = T1 kY Vr k2F , which states that the residual train-

ing error corresponds to the energy of Y not captured by the space spanned
by Φ. Since Etrain is an increasing function of γ, so is Ētrain (at least for all
large n) and thus T1 kY Vr k2F corresponds to the lowest achievable asymptotic
training error.
If instead r > n (which is the most likely outcome in practice), as γ → 0,
Q̄ ∼ γ1 ( Tn ∆
Φ
+ IT )−1 and thus
" #
1 2
1
γ→0 n tr Ψ Q
∆ ∆
Ētrain −→ tr Y Q∆ Ψ ∆ + IT Q ∆ Y T
T 1 − n1 tr(Ψ∆ Q∆ )2
where Ψ∆ = Tn ∆ Φ
and Q∆ = ( Tn ∆
Φ
+ IT )−1 .
These results suggest that neural networks should be designed both in
a way that reduces the rank of Φ while maintaining a strong alignment
between the dominant eigenvectors of Φ and the output matrix Y .
Interestingly, if X is assumed as above to be extracted from a Gaussian
mixture and that Y ∈ R1×T is a classification vector with Y1j ∈ {−1, 1},
then the tools proposed in (Couillet and Benaych-Georges, 2016) (related to
spike random matrix analysis) allow for an explicit evaluation of the afore-
mentioned limits as n, p, T grow large. This analysis is however cumbersome
and outside the scope of the present work.
5. Proof of the Main Results. In the remainder, we shall use exten-

sively the following notations:
 T  T
σ1 w1
 ..   .. 
Σ = σ(W X) =  .  , W =  . 
σnT wnT
i.e., σi = σ(wiT X)T . Also, we shall define Σ−i ∈ R(n−1)×T the matrix Σ with
i-th row removed, and correspondingly
−1
1 T 1
Q−i = Σ Σ − σi σiT + γIT .
T T
Finally, because of exchangeability, it shall often be convenient to work with

the generic random vector w ∼ Nϕ (0, IT ), the random vector σ distributed
as any of the σi ’s, the random matrix Σ− distributed as any of the Σ−i ’s,
and with the random matrix Q− distributed as any of the Q−i ’s.

22 C. LOUART ET AL.
5.1. Concentration Results on Σ. Our first results provide concentra-

tion of measure properties on functionals of Σ. These results unfold from
the following concentration inequality for Lipschitz applications of a Gaus-
sian vector; see e.g., (Ledoux, 2005, Corollary 2.6, Propositions 1.3, 1.8) or
(Tao, 2012, Theorem 2.1.12). For d ∈ N, consider µ the canonical Gaus-
d 1 2
sian probability on Rd defined through its density dµ(w) = (2π)− 2 e− 2 kwk
and f : Rd → R a λf -Lipschitz function. Then, we have the said normal
concentration
2
−c t 2
Z
λ
(6) µ f − f dµ ≥ t ≤ Ce f

where C, c > 0 are independent of d and λf . As a corollary (see e.g., (Ledoux,

2005, Proposition 1.10)), for every k ≥ 1,
" k # k
Cλ
Z
f
E f − f dµ ≤ √ .
c
The main approach to the proof of our results, starting with that of the
key Lemma 1, is as follows: since Wij = ϕ(W̃ij ) with W̃ij ∼ N (0, 1) and
ϕ Lipschitz, the normal concentration of W̃ transfers to W which further
induces a normal concentration of the random vector σ and the matrix Σ,
thereby implying that Lipschitz functionals of σ or Σ also concentrate. As
pointed out earlier, these concentration results are used in place for the
independence assumptions (and their multiple consequences on convergence
of random variables) classically exploited in random matrix theory.
Notations: In all subsequent lemmas and proofs, the letters c, ci , C, Ci > 0

will be used interchangeably as positive constants independent of the key
equation parameters (notably n and t below) and may be reused from line to
line. Additionally, the variable ε > 0 will denote any small positive number;
the variables c, ci , C, Ci may depend on ε.
We start by recalling the first part of the statement of Lemma 1 and

subsequently providing its proof.
Lemma 2 (Concentration of quadratic forms). Let Assumptions 1–2

hold. Let also A ∈ RT ×T such that kAk ≤ 1 and, for X ∈ Rp×T and

w ∼ Nϕ (0, Ip ), define the random vector σ ≡ σ(wT X)T ∈ RT . Then,

cT 2
min t2 ,t

1 T 1 − 2 2 2
P σ Aσ − tr ΦA > t ≤ Ce kXk λϕ σ λ t0
T T
q
for t0 ≡ |σ(0)| + λϕ λσ kXk Tp and C, c > 0 independent of all other param-
eters.
Proof. The layout of the proof is as follows: since the application w 7→

1 T
T σ Aσ is “quadratic” in w and thus not Lipschitz (therefore not allowing for
a natural transfer of the concentration of w to T1 σ T Aσ), we first prove that
√1 kσk satisfies a concentration inequality, which provides a high probability
T
O(1) bound on √1T kσk. Conditioning on this event, the map w 7→ √1T σ T Aσ
can then be shown to be Lipschitz (by isolating one of the σ terms for
bounding and the other one for retrieving the Lipschitz character) and, up
to an appropriate control of concentration results under conditioning, the
result is obtained.
Following this plan, we first provide a concentration inequality for kσk.
To this end, note that the application ψ : Rp → RT , w̃ 7→ σ(ϕ(w̃)T X)T is
Lipschitz with parameter λϕ λσ kXk as the combination of the λϕ -Lipschitz
function ϕ : w̃ 7→ w, the kXk-Lipschitz map Rn → RT , w 7→ X T w and the
λσ -Lipschitz map RT → RT , Y 7→ σ(Y ). As a Gaussian vector, w̃ has a
normal concentration and so does ψ(w̃). Since the Euclidean norm RT → R,
Y 7→ kY k is 1-Lipschitz, we thus have immediately by (6)
cT t2

1 T
1 T
−
kXk2 λ2 2
P √ σ(w X) − E √ σ(w X) ≥ t ≤ Ce
σ λϕ
T T
for some c, C > 0 independent of all parameters.
Finally, using again the Lipschitz character of σ(wT X),

T T T T
σ(w X) − σ(0)1 ≤ σ(w X) − σ(0)1T ≤ λσ kwk · kXk

T
so that, by Jensen’s inequality,

1 T
1
E √ σ(w X) ≤ |σ(0)| + λσ E
√ kwk kXk
T T
s
1 2
≤ |σ(0)| + λσ E kwk kXk
T

24 C. LOUART ET AL.
with E[kϕ(w̃)k2 ] ≤ λ2ϕ E[kw̃k2 ] = pλ2ϕ (since w̃ ∼ N (0, Ip )). Letting t0 ≡

q
|σ(0)| + λσ λϕ kXk Tp , we then find
cT t2

1 −
T ≥ t + t0 ≤ Ce λ2ϕ λ2σ kXk2

P √
T σ(w X)
which, with the remark t ≥ 4t0 ⇒ (t − t0 )2 ≥ t2 /2, may be equivalently

stated as
2
1 − 2 cT2t
T 2λϕ λσ kXk2

(7) ∀t ≥ 4t0 , P √ σ(w X) ≥ t ≤ Ce
.
T
As a side (but important) remark, note that, since
v 
u n
√ uX σi 2 √

Σ
√T ≥ t T = P
P t √ ≥ t T
T
F i=1
r !
σi T
≤ P max √ ≥ t
1≤i≤n T n
r !
σ T
√T ≥ n t
≤ nP
the result above implies that

√ cT 2 t2

Σ −
∀t ≥ 4t0 , P √ ≥ t T ≤ Cne 2nλ2ϕ λ2σ kXk2
T
F
and thus, since k · kF ≥ k · k, we have
√ cT 2 t2

Σ −
2nλ2 λ2 2
∀t ≥ 4t0 , P √ ≥ t T ≤ Cne
ϕ σ kXk
T
√ 3, with high probabil-

Thus, in particular, under the additional Assumption
ity, the operator norm of √ΣT cannot exceed a rate T .
Remark 1 (Loss of control of the structure of Σ). The aforementioned

control of kΣk arises from √ the bound kΣk ≤ kΣkF which may be quite loose
(by as much as a factor T ). Intuitively, under the supplementary Assump-
tion 3, if E[σ] 6= 0, then √ΣT is “dominated” by the matrix √1T E[σ]1T
T , the op-
√
erator norm of which is indeed of order n and the bound is tight. If σ(t) = t
and E[Wij ] = 0, we however know that k √ΣT k = O(1) (Bai and Silverstein,

1998). One is tempted to believe that, more generally, if E[σ] = 0, then k √ΣT k
should remain of this order. And, if instead E[σ] 6= 0, the contribution of
√1 E[σ]1T should merely engender a single large amplitude isolate singular
T T
value in the spectrum of √ΣT and the other singular values remain of order
O(1). These intuitions are not captured by our concentration of measure
approach.
Since Σ = σ(W X) is an entry-wise operation, concentration results with
respect to the Frobenius norm are natural, where with respect to the operator
norm are hardly accessible.
Back to our present considerations,

√ let us define the probability space
AK = {w, kσ(wT X)k ≤ K T }. Conditioning the random variable of inter-
est in Lemma 2 with respect to AK and its complementary AcK , for some
K ≥ 4t0 , gives

1 T T T 1
P σ(w X)Aσ(w X) − tr ΦA > t
T T

1 T T T 1 c
≤P T σ(w X)Aσ(w X) − T tr ΦA > t , AK + P (AK ).

We can already bound P (AcK ) thanks to (7). As for the first right-hand
side term, note that on √the set {σ(wT X), w ∈ AK }, the function f : RT →
T
R : σ 7→ σ Aσ is K T -Lipschitz. This is because, for all σ, σ + h ∈
{σ(wT X), w ∈ AK },

T T
√
kf (σ + h) − f (σ)k = h Aσ + (σ + h) Ah ≤ K T khk .

Since conditioning does not √ allow for a straightforward application of (6),

˜
we consider instead f , a K T -Lipschitz continuation to RT of fAK , the
such that all the radial derivative of f˜ are constant
restriction of f to AK , √
in the set {σ, kσk ≥ K T }. We may thus now apply (6) and our previous
results to obtain
cT t2
−

˜ T ˜ T kXk2 λ2 2
P f (σ(w X)) − E[f (σ(w X))] ≥ KT t ≤ e σ λϕ .

Therefore,
n o
P f (σ(wT X)) − E[f˜(σ(wT X))] ≥ KT t , AK

n o
˜
=P f (σ(wT X)) − E[f˜(σ(wT X))] ≥ KT t , AK

cT t2
−

≤ P f˜(σ(wT X)) − E[f˜(σ(wT X))] ≥ KT t ≤ e kXk λσ λϕ .
2 2 2

26 C. LOUART ET AL.
˜ T T
√ ∆ = |E[f (σ(w X))]−E[f (σ(w X))]|.
Our next step is then to bound the difference
˜
Since f and f are equal on {σ, kσk ≤ K T },
Z
∆≤ |f (σ)| + |f˜(σ)| dµσ (σ)
√
kσk≥K T
√
where µσ is the law of σ(wT X). Since kAk ≤ 1, for kσk ≥ K T , max(|f (σ)|, |f˜(σ)|) ≤
kσk2 and thus
Z Z Z ∞
2
∆≤2 √ kσk dµ σ = 2 √ 1kσk2 ≥t dtdµσ
kσk≥K T kσk≥K T t=0
Z ∞
P kσk2 ≥ t , AcK dt

=2
t=0
Z K 2T Z ∞
≤2 P (AcK )dt + 2 P (kσ(wT X)k2 ≥ t)dt
t=0 t=K 2 T
Z ∞ ct
−
c 2 2λ2 2
ϕ λσ kXk
2
≤ 2P (AK )K T + 2 Ce dt
t=K 2 T
cT K 2 2
2
−
2λ2 2 2 2Cλ2ϕ λ2σ kXk2 − 2λ2cTλ2KkXk2
≤ 2CT K e ϕ λσ kXk + e ϕ σ
c
6C 2 2
≤ λ λ kXk2
c ϕ σ
where in last inequality weqused the fact that for x ∈ R, xe−x ≤ e−1 ≤ 1,
and K ≥ 4t0 ≥ 4λσ λϕ kXk Tp . As a consequence,
cT t2
−
n o
2 2 2
P f (σ(wT X)) − E[f (σ(wT X))] ≥ KT t + ∆ , AK ≤ Ce kXk λϕ λσ

4∆
so that, with the same remark as before, for t ≥ KT ,
cT t2
−
n o
T T 2kXk2 λ2 2
P f (σ(w X)) − E[f (σ(w X))] ≥ KT t , AK ≤ Ce ϕ λσ .

4∆
To avoid the condition t ≥ KT , we use the fact that, probabilities being
lower than one, it suffices to replace C by λC with λ ≥ 1 such that
T t2
−c
2kXk2 λ2 2 4∆
λCe ϕ λσ ≥ 1 for t ≤ .
KT
2
1 18C
The above inequality holds if we take for instance λ = Ce
c since then
4∆ 24Cλ2ϕ λ2σ kXk2 6Cλϕ λσ kXk
t≤ KT ≤ cKT ≤ √
c pT
(using successively ∆ ≥ 6C 2 2
c λϕ λσ kXk
2

q
and K ≥ 4λσ λϕ kXk Tp ) and thus
cT t2 2
− − 18C 18C 2
2kXk2 λ2 2
λCe ϕ λσ ≥ λCe cp ≥ λCe− c ≥ 1.
2
C′ c
Therefore, setting λ = max(1, C1 e
), we get for every t > 0
2
cT t2
−
n o
T T 2kXk2 λ2 2
P f (σ(w X) − E[f (σ(w X)] ≥ KT t , AK ≤ λCe ϕ λσ

cT K 2
−
2λ2 2 2
which, together with the inequality ≤ Ce P (AcK ) , gives ϕ λσ kXk
T ct 2 2
− − 2cT2K 2

T T 2kXk2 λ2 λ2
P f (σ(w X) − E[f (σ(w X)] ≥ KT t ≤ λCe + Ce 2λϕ λσ kXk
.
ϕ σ
We then conclude

1 T T T 1
P σ(w X)Aσ(w X) − tr(ΦA) ≥ t

T T
cT
− min(t2 /K 2 ,K 2 )
2kXk2 λ2 2
≤ (λ + 1)Ce ϕ λσ
√
and, with K = max(4t0 , t),
!
cT min t2 ,t
16t2
1 T T T 1 −
2kXk2 λ2
0
2
P σ(w X)Aσ(w X) − tr(ΦA) ≥ t ≤ (λ + 1)Ce
. ϕ λσ
T T
√ √
Indeed, if 4t0 ≤ t then min(t2 /K 2 , K 2 ) = t, while if 4t0 ≥ t then
min(t2 /K 2 , K 2 ) = min(t2 /16t20 , 16t20 ) = t2 /16t20 .
As a corollary of Lemma 2, we have the following control of the moments

of T1 σ T Aσ.
Corollary 1 (Moments of quadratic forms). Let Assumptions 1–2

hold. For w ∼ Nϕ (0, Ip ), σ ≡ σ(wT X)T ∈ RT , A ∈ RT ×T such that kAk ≤ 1,
and k ∈ N,
" k # 2 k
t0 η k

1 T 1 η
E σ Aσ − tr ΦA ≤ C1 √ + C2
T T T T
q
with t0 = |σ(0)| + λσ λϕ kXk Tp , η = kXkλσ λϕ , and C1 , C2 > 0 independent
of the other parameters. In particular, under the additional Assumption 3,
" k #
1 T 1 C
E σ Aσ − tr ΦA ≤ k/2
T T n

28 C. LOUART ET AL.
R ∞ We use the fact that, for a nonnegative random variable Y ,

Proof.
E[Y ] = 0 P (Y > t)dt, so that
" k #
1 T 1
E σ Aσ − tr ΦA
T T
Z ∞ k !
1 T 1
= P σ Aσ − tr ΦA > u du
0 T T
Z ∞
k−1
1 T 1
= kv P σ Aσ − tr ΦA > v dv
0 T T

Z ∞ 2
− cT2 min v2 ,v
≤ kv k−1 Ce η t
0 dv
0
t0 2 ∞
− cT v
Z Z
k−1 2 2 − cT2v
≤ kv Ce t η
0 dv + kv k−1 Ce η dv
0 t0
∞ 2 Z ∞
− cT v
Z
2 2 − cT2v
≤ kv k−1 Ce t η
0 dv + kv k−1 Ce η dv
0 0
k Z ∞ k Z ∞
η2

t η −t2
= √0 kt k−1
Ce dt + ktk−1 Ce−t dt
cT 0 cT 0
which, along with the boundedness of the integrals, concludes the proof.
Beyond concentration results on functions of the vector σ, we also have

the following convenient property for functions of the matrix Σ.
Lemma 3 (Lipschitz functions of Σ). Let f : Rn×T → R be a λf -

Lipschitz function with respect to the Froebnius norm. Then, under Assump-
tions 1–2,
2
− 2 2cT 2t

Σ Σ λ λ λ kXk2
P f √ − Ef √ > t ≤ Ce σ ϕ f
T T
for some C, c > 0. In particular, under the additional Assumption 3,

Σ Σ > t ≤ Ce−cT t2 .

P f √ − Ef √
T T
Proof. Denoting W = ϕ(W̃ ), since vec(W̃ ) ≡ [W̃11 , · · · , W̃np ] is a Gaus-

sian vector, by the normal concentration of Gaussian vectors, for g a λg -

Lipschitz function of W with respect to the Frobenius norm (i.e., the Eu-
clidean norm of vec(W )), by (6),
2
− ct

2 2
P (|g(W ) − E[g(W )]| > t) = P g(ϕ(W̃ )) − E[g(ϕ(W̃ ))] > t ≤ Ce λg λϕ

√
for some C, c > 0. Let’s consider in particular g : W 7→ f (Σ/ T ) and
remark that

σ((W + H)X) σ(W X)
|g(W + H) − g(W )| = f
√ −f √
T T
λf
≤ √ kσ((W + H)X) − σ(W X)kF
T
λf λσ
≤ √ kHXkF
T
λf λσ √
= √ tr HXX T H T
T
λf λσ
q
≤ √ kXX T kkHkF
T
concluding the proof.
A first corollary of Lemma 3 is the concentration of the Stieltjes transform

1
−1
T tr T1 ΣT Σ − zIT of µn , the empirical spectral measure of T1 ΣT Σ, for all
+
z ∈ C \ R (so in particular, for z = −γ, γ > 0).
Corollary 2 (Concentration of the Stieltjes transform of µn ). Under

Assumptions 1–2, for z ∈ C \ R+ ,
−1 " −1 # !
1 1 1

1 T
T
P tr Σ Σ − zIT −E tr Σ Σ − zIT >t

T T T T
+ )2 T t 2
− cdist(z,R
2 2
λσ λϕ kXk2
≤ Ce
for some C, c > 0, where dist(z, R+ ) is the Hausdorff set distance. In par-
ticular, for z = −γ, γ > 0, and under the additional Assumption 3

1 1 2
P tr Q − tr E[Q] > t ≤ Ce−cnt .

T T

30 C. LOUART ET AL.
1
Proof. We can apply Lemma 3 for f : R 7→ T tr(RT R − zIT )−1 , since
we have
|f (R + H) − f (R)|

1 T −1 T T T −1

= tr((R + H) (R + H) − zIT ) ((R + H) H + H R)(R R − zIT )
T

1 T −1 T T −1

≤ tr((R + H) (R + H) − zIT ) (R + H) H(R R − zIT )

T

1 T −1 T T −1

+ tr((R + H) (R + H) − zIT ) H R(R R − zIT )
T
2kHk 2kHkF
≤ 3 ≤ 3
dist(z, R+ ) 2 dist(z, R+ ) 2
where, for the
√ second√ to last inequality, we successively used the relations
| tr AB| ≤ tr AAT tr BB T , | tr CD| ≤ kDk tr C for nonnegative definite
C, and k(RT R − zIT )−1 k ≤ dist(z, R+ )−1 , k(RT R − zIT )−1 RT Rk ≤ 1,
1
k(RT R − zIT )−1 RT k = k(RT R − zIT )−1 RT R(RT R − zIT )−1 k 2 ≤ k(RT R −
1 1 1
zIT )−1 RT Rk 2 k(RT R−zIT )−1 k 2 ≤ dist(z, R+ )− 2 , for z ∈ C\R+, and finally
k · k ≤ k · kF .
Lemma 3 also allows for an important application of Lemma 2 as follows.
Lemma 4 (Concentration of T1 σ T Q− σ). Let Assumptions 1–3 hold and

write W T = [w1 , . . . , wn ]. Define σ ≡ σ(w1T X)T ∈ RT and, for W−T =
[w2 , . . . , wn ] and Σ− = σ(W− X), let Q− = ( T1 ΣT −1
− Σ− + γIT ) . Then, for
A, B ∈ RT ×T such that kAk, kBk ≤ 1

1 T 1 2
P σ AQ− Bσ − tr ΦAE[Q− ]B > t ≤ Ce−cn min(t ,t)

T T
for some C, c > 0 independent of the other parameters.
Proof. Let f : R 7→ T1 σ T A(RT R + γIT )−1 Bσ. Reproducing the proof

of Corollary 2, conditionally to T1 kσk2 ≤ K for any arbitrary large enough
K > 0, it appears that f is Lipschitz with parameter of order O(1). Along
with (7) and Assumption 3, this thus ensures that

1 T 1 T
P σ AQ− Bσ − σ AE[Q− ]Bσ > t

T T
kσk2 kσk2

1 T 1 T
≤ P σ AQ− Bσ − σ AE[Q− ]Bσ > t,
≤K +P >K
T T T T
2
≤ Ce−cnt

for some C, c > 0. We may then apply Lemma 1 on the bounded norm
matrix AE[Q− ]B to further find that

1 T 1
P σ AQ− Bσ − tr ΦAE[Q− ]B > t

T T

1 T 1 T t
≤ P σ AQ− Bσ − σ AE[Q− ]Bσ >

T T 2

1 T 1 t
+ P σ AE[Q− ]Bσ − tr ΦAE[Q− ]B >

T T 2
′ 2 ,t)
≤ C ′ e−c n min(t
which concludes the proof.
As a further corollary of Lemma 3, we have the following concentration

result on the training mean-square error of the neural network under study.
Corollary 3 (Concentration of the mean-square error). Under As-

sumptions 1–3,

1 1 2
2
P tr Y Y Q − tr Y Y E Q > t ≤ Ce−cnt
T T
2

T T
for some C, c > 0 independent of the other parameters.
Proof. We apply Lemma 3 to the mapping f : R 7→ T1 tr Y T Y (RT R +

γIT )−2 . Denoting Q = (RT R+γIT )−1 and QH = ((R+H)T (R+H)+γIT )−1 ,
remark indeed that
|f (R + H) − f (R)|

1 T H 2 2

= tr Y Y ((Q ) − Q )

T

1 T H H
1 T H

≤ tr Y Y (Q − Q)Q + tr Y Y Q(Q − Q)

T T

1 T H T T H

= tr Y Y Q ((R + H) (R + H) − R R)QQ

T

1 T H T T

+ tr Y Y QQ ((R + H) (R + H) − R R)Q
T

1 T H T
1
H T H T H

≤ tr Y Y Q (R + H) HQQ + tr Y Y Q H RQQ

T T

1 T H T
1 T H T

+ tr Y Y QQ (R + H) RQ + tr Y Y QQ H RQ .
T T

32 C. LOUART ET AL.
p p
As kQH (R+H)T k = kQH (R + H)T (R + H)QH k and kRQk = kQRT RQk
are bounded and T1 tr Y T Y is also bounded by Assumption 3, this implies
|f (R + H) − f (R)| ≤ CkHk ≤ CkHkF
for some C > 0. The function f is thus Lipschitz with parameter independent
of n, which allows us to conclude using Lemma 3.
The aforementioned concentration results are the building blocks of the

proofs of Theorem 1–3 which, under all Assumptions 1–3, are established
using standard random matrix approaches.
5.2. Asymptotic Equivalents.
5.2.1. First Equivalent for E[Q]. This section is dedicated to a first char-
acterization of E[Q], in the “simultaneously large” n, p, T regime. This pre-
liminary step is classical in studying resolvents in random matrix theory as
the direct comparison of E[Q] to Q̄ with the implicit δ may be cumbersome.
To this end, let us thus define the intermediary deterministic matrix
−1
n Φ
Q̃ = + γIT
T 1+α
with α ≡ T1 tr ΦE[Q− ], where we recall that Q− is a random matrix dis-
tributed as, say, ( T1 ΣT Σ − T1 σ1 σ1T + γIT )−1 .
First note that, since T1 tr Φ = E[ T1 kσk2 ] and, from (7) and Assump-
2
Rtion 3, P ( T1 kσk2 > t) ≤ Ce−cnt for all large t, we find that T1 tr Φ =
∞ 2 1 2 ′ ′ 1
0 t P ( T kσk > t)dt ≤ C for some constant C . Thus, α ≤ kE[Q− ]k T tr Φ ≤
C ′
γ is uniformly bounded.
We will show here that kE[Q] − Q̃k → 0 as n → ∞ in the regime of

Assumption 3. As the proof steps are somewhat classical, we defer to the
appendix some classical intermediary lemmas (Lemmas 5–7). Using the re-
solvent identity, Lemma 5, we start by writing

n Φ 1 T
E[Q] − Q̃ = E Q − Σ Σ Q̃
T 1+α T

n Φ 1 T
= E[Q] Q̃ − E Q Σ Σ Q̃
T 1+α T
n
n Φ 1X h i
= E[Q] Q̃ − E Qσi σiT Q̃
T 1+α T
i=1

which, from Lemma 6, gives, for Q−i = ( T1 ΣT Σ − T1 σi σiT + γIT )−1 ,

n
" #
n Φ 1X σi σiT
E[Q] − Q̃ = E[Q] Q̃ − E Q−i Q̃
T 1+α T
i=1
1 + T1 σiT Q−i σi
n
n Φ 1 1X h i
= E[Q] Q̃ − E Q−i σi σiT Q̃
T 1+α 1+αT
i=1
n
" #
T 1 T
1 X Q−i σi σi T σi Q−i σi − α
+ E Q̃.
T
i=1
(1 + α)(1 + T1 σiT Q−i σi )
Note now, from the independence of Q−i and σi σiT , that the second right-
hand side expectation is simply E[Q−i ]Φ. Also, exploiting Lemma 6 in re-
verse on the rightmost term, this gives
n
1 X E[Q − Q−i ]Φ
E[Q] − Q̃ = Q̃
T 1+α
i=1
n
1 1X 1 T
(8) + E Qσi σiT Q̃ σi Q−i σi − α .
1+αT T
i=1
It is convenient at this point to note that, since E[Q] − Q̃ is symmetric, we

may write
n
1 1 1 X
E[Q] − Q̃ = E[Q − Q−i ]ΦQ̃ + Q̃ΦE[Q − Q−i ]
21+α T
i=1
n 1 !
1X
T T T
(9) + E Qσi σi Q̃ + Q̃σi σi Q σ Q−i σi − α .
T T i
i=1
We study the two right-hand side terms of (9) independently.

For the first term, since Q − Q−i = −Q T1 σi σiT Q−i ,
n n
" #
1 X E[Q − Q−i ]Φ 1 1 1X
Q̃ = E Q σi σiT Q−i ΦQ̃
T 1+α 1+αT T
i=1 i=1
n
" #
1 1 1X T 1 T
= E Q σi σi Q 1 + σi Q−i σi ΦQ̃
1+αT T T
i=1
where we used again Lemma 6 in reverse. Denoting D = diag({1+ T1 σiT Q−i σi }ni=1 ),
this can be compactly written
n
1 X E[Q − Q−i ]Φ 1 1 1
Q̃ = E Q ΣT DΣQ ΦQ̃.
T 1+α 1+αT T
i=1

34 C. LOUART ET AL.
Note at this point that, from Lemma 7, kΦQ̃k ≤ (1 + α) Tn and

s
1 T
Q √ Σ = Q 1 ΣT ΣQ ≤ γ − 12 .

T T
Besides, by Lemma 4 and the union bound,

2
P max Dii > 1 + α + t ≤ Cne−cn min(t ,t)
1≤i≤n
for some C, c > 0, so in particular, recalling that α ≤ C ′ for some constant

C ′ > 0,
Z 2(1+C ′ ) Z ∞
E max Dii = P max Dii > t dt + P max Dii > t dt
1≤i≤n 0 1≤i≤n 2(1+C ′ ) 1≤i≤n
Z ∞
′ ′ ))2 ,t−(1+C ′ ))
≤ 2(1 + C ) + Cne−cn min((t−(1+C dt
2(1+C ′ )
Z ∞
= 2(1 + C ′ ) + Cne−cntdt
1+C ′
−Cn(1+C ′ )
= 2(1 + C ′ ) + e = O(1).
As a consequence of all the above (and of the boundedness of α), we have

that, for some c > 0,

1 1 T c
(10) E Q Σ DΣQ ΦQ̃ ≤ .
T T n
Let us now consider the second right-hand side term of (9). Using the
relation abT + baT aaT + bbT in the order of Hermitian matrices (which
1
unfolds from (a − b)(a − b)T 0), we have, with a = T 4 Qσi ( T1 σiT Q−i σi − α)
1
and b = T − 4 Q̃σi ,
n 1
1X T T T
E Qσi σi Q̃ + Q̃σi σ Q σ Q−i σi − α
T T i
i=1
n n
" 2 #
1 X T 1 T 1 X h i
√ E Qσi σi Q σi Q−i σi − α + √ E Q̃σi σiT Q̃
T i=1 T T T i=1
√

1 n
= T E Q ΣT D22 ΣQ + √ Q̃ΦQ̃
T T T

where D2 = diag({ T1 σiT Q−i σi −α}ni=1 ). Of course, since we also have −aaT −
bbT abT + baT (from (a + b)(a + b)T 0), we have symmetrically
n 1
1X T T T
E Qσi σi Q̃ + Q̃σi σ Q σ Q−i σi − α
T T i
i=1
√

1 T 2 n
− T E Q Σ D2 ΣQ − √ Q̃ΦQ̃.
T T T
But from Lemma 4,

ε− 21
1 T ε− 21
P kD2 k > tn = P max σi Q−i σi − α > tn

1≤i≤n T
1
2ε t2 ,n 2 +ε t)
≤ Cne−cmin(n
so that, with a similar reasoning as in the proof of Corollary 1,

√ √

T E Q 1 ΣT D22 ΣQ ≤ T E kD2 k2 ≤ Cnε′ − 12

T
√
where we additionally used kQΣk
≤ T in the first inequality.
1
Since in addition T T Q̃ΦQ̃ ≤ Cn− 2 , this gives
n
√

n
1 X 1 1
E Qσi σiT Q̃ + Q̃σi σiT Q σiT Q−i σi − α ≤ Cnε− 2 .

T T

i=1
Together with (9), we thus conclude that

1
E[Q] − Q̃ ≤ Cnε− 2 .

Note in passing that we proved that

n

T 1 X 1 1
c
T

kE [Q − Q− ]k = E [Q − Q−i ] = E Q Σ DΣQ ≤

n T n T n
i=1
where the first equality holds by exchangeability arguments.

In particular,
1 1 1
α= tr ΦE[Q− ] = tr ΦE[Q] + tr Φ(E[Q− ] − E[Q])
T T T

36 C. LOUART ET AL.
where | T1 tr Φ(E[Q− ] − E[Q])| ≤ nc . And thus, by the previous result,

α − 1 tr ΦQ̃ ≤ Cn− 21 +ε 1 tr Φ.

T T
1
We have proved in the beginning of the section that T tr Φ is bounded and
thus we finally conclude that

1
α − tr ΦQ̃ ≤ Cnε− 21 .

T
5.2.2. Second Equivalent for E[Q]. In this section, we show that E[Q]
can be approximated by the matrix Q̄, which we recall is defined as
−1
n Φ
Q̄ = + γIT
T 1+δ
where δ > 0 is the unique positive solution to δ = T1 tr ΦQ̄. The fact that δ >
0 is well defined is quite standard and has already been proved several times
for more elaborate models. Following the ideas of (Hoydis, Couillet and Debbah,
2013), we may for instance use the framework of so-called standard interfer-
ence functions (Yates, 1995) which claims that, if a map f : [0, ∞) → (0, ∞),
x 7→ f (x), satisfies x ≥ x′ ⇒ f (x) ≥ f (x′ ), ∀a > 1, af (x) > f (ax) and there
exists x0 such that x0 ≥ f (x0 ), then f has a unique fixed point (Yates, 1995,
Th 2). It is easily shown that δ 7→ T1 tr ΦQ̄ is such a map, so that δ exists
and is unique.
To compare Q̃ and Q̄, using the resolvent identity, Lemma 5, we start by

writing
n Φ
Q̃ − Q̄ = (α − δ)Q̃ Q̄
T (1 + α)(1 + δ)
from which

1
|α − δ| = tr Φ E[Q− ] − Q̄

T

1 1
≤ tr Φ Q̃ − Q̄ + cn− 2 +ε

T
1 ΦQ̃ Tn ΦQ̄ 1
= |α − δ| tr + cn− 2 +ε
T (1 + α)(1 + δ)

which implies that

!
1 ΦQ̃ Tn ΦQ̄ 1
|α − δ| 1 − tr ≤ cn− 2 +ε .
T (1 + α)(1 + δ)
It thus remains to show that

1 ΦQ̃ Tn ΦQ̄
lim sup tr <1
n T (1 + α)(1 + δ)
1
to prove that |α − δ| ≤ cnε− 2 . To this end, note that, by Cauchy–Schwarz’s
inequality,
s
1 ΦQ̃ Tn ΦQ̄ n 1 n 1
tr ≤ 2
tr Φ2 Q̄2 · 2
tr Φ2 Q̃2
T (1 + α)(1 + δ) T (1 + δ) T T (1 + α) T
so that it is sufficient to bound the limsup of both terms under the square
root strictly by one. Next, remark that
1 1 n(1 + δ) 1 1
δ= tr ΦQ̄ = tr ΦQ̄2 Q̄−1 = 2
tr Φ2 Q̄2 + γ tr ΦQ̄2 .
T T T (1 + δ) T T
In particular,
n 1 2 2
n 1 δ T (1+δ)2 T tr Φ Q̄ δ
2 2
2
tr Φ Q̄ = n 1 1 ≤ .
T (1 + δ) T (1 + δ) T (1+δ)2 T tr Φ2 Q̄2 + γ T tr ΦQ̄2 1+δ
But at the same time, since k( Tn Φ + γIT )−1 k ≤ γ −1 ,

1
δ≤ tr Φ
γT
the limsup of which is bounded. We thus conclude that
n 1
(11) lim sup tr Φ2 Q̄2 < 1.
n T (1 + δ)2 T
Similarly, α, which is known to be bounded, satisfies
n 1 1 1
α = (1 + α) 2
tr Φ2 Q̃2 + γ tr ΦQ̃2 + O(nε− 2 )
T (1 + α) T T
and we thus have also
n 1
lim sup tr Φ2 Q̃2 < 1
n T (1 + α)2 T

38 C. LOUART ET AL.
1
which completes to prove that |α − δ| ≤ cnε− 2 .
As a consequence of all this,

Q̃ Tn ΦQ̄ 1
kQ̃ − Q̄k = |α − δ| · ≤ cn− 2 +ε

(1 + α)(1 + δ)
1
and we have thus proved that kE[Q] − Q̄k ≤ cn− 2 +ε for some c > 0.
From this result, along with Corollary 2, we now have that

1 1
P tr Q − tr Q̄ > t
T T

1 1 1 1
≤ P tr Q − tr E[Q] > t − tr E[Q] − tr Q̄

T T T T
′ −1
2 +ε 1 ′
≤ C ′ e−c n(t−cn )
≤ C ′ e− 2 c nt
for all large n. As a consequence, for all γ > 0, T1 tr Q − T1 tr Q̄ → 0

almost surely. As such, the difference mµn − mµ̄n of Stieltjes transforms
mµn : C \ R+ → C, z 7→ T1 tr( T1 ΣT Σ − zIT )−1 and mµ̄n : C \ R+ → C,
z 7→ T1 tr( Tn 1+δ
Φ
z
− zIT )−1 (with δz the unique Stieltjes transform solution
Φ
to δz = T1 tr Φ( Tn 1+δ z
− zIT )−1 ) converges to zero for each z in a subset of
C \ R having at least one accumulation point (namely R− ), almost surely
+
so (that is, on a probability set Az with P (Az ) = 1). Thus, letting {zk }∞ k=1
be a converging sequence strictly included in R− , on the probability one
space A = ∩∞ k=1 Ak , mµn (zk ) − mµ̄n (zk ) → 0 for all k. Now, mµn is com-
plex analytic on C \ R+ and bounded on all compact subsets of C \ R+ .
Besides, it was shown in (Silverstein and Bai, 1995; Silverstein and Choi,
1995) that the function mµ̄n is well-defined, complex analytic and bounded
on all compact subsets of C \ R+ . As a result, on A, mµn − mµ̄n is complex
analytic, bounded on all compact subsets of C \ R+ and converges to zero on
a subset admitting at least one accumulation point. Thus, by Vitali’s con-
vergence theorem (Titchmarsh, 1939), with probability one, mµn − mµ̄n con-
verges to zero everywhere on C \ R+ . This implies, by (Bai and Silverstein,
2009, Theorem B.9), that µn − µ̄n → 0, vaguely as a signed finite measure,
with probability one, and, since µ̄n is a probability measure (again from the
results of (Silverstein and Bai, 1995; Silverstein and Choi, 1995)), we have
thus proved Theorem 2.
5.2.3. Asymptotic Equivalent for E[QAQ], where A is either Φ or sym-

metric of bounded norm. The evaluation of the second order statistics of

the neural network under study requires, beside E[Q], to evaluate the more
involved form E[QAQ], where A is a symmetric matrix either equal to Φ
or of bounded norm (so in particular kQ̄Ak is bounded). To evaluate this
quantity, first write

E[QAQ] = E Q̄AQ + E (Q − Q̄)AQ

n Φ 1 T
= E Q̄AQ + E Q − Σ Σ Q̄AQ
T 1+δ T
n
n 1 1X h i
E Qσi σiT Q̄AQ .

= E Q̄AQ + E QΦQ̄AQ −
T 1+δ T
i=1
Of course, since QAQ is symmetric, we may write

1 n 1
E[QAQ] = E Q̄AQ + QAQ̄ + E QΦQ̄AQ + QAQ̄ΦQ
2 T 1+δ
n
!
1X h
T T
i
− E Qσi σi Q̄AQ + QAQ̄σi σi Q
T
i=1
which will reveal more practical to handle.

1
First note that, since E[Q] − Q̄ ≤ Cnε− 2 and A is such that kQ̄Ak is

1
bounded, kE[Q̄AQ] − Q̄AQ̄]k ≤ kQ̄AkkE[Q] − Q̄k ≤ C ′ nε− 2 , which provides
an estimate for the first expectation. We next evaluate the last right-hand
side expectation above. With the same notations as previously, from ex-
changeability arguments and using Q = Q− − Q T1 σσ T Q− , observe that
n
1X h i n h i
E Qσi σiT Q̄AQ = E Qσσ T Q̄AQ
T T
i=1
" #
n Q− σσ T Q̄AQ
= E
T 1 + T1 σ T Q− σ
n 1 h i
= E Q− σσ T Q̄AQ
T 1+δ " #
1 T
n 1 δ − T σ Q − σ
+ E Q− σσ T Q̄AQ
T 1+δ 1 + T1 σ T Q− σ

40 C. LOUART ET AL.
which, reusing Q = Q− − Q T1 σσ T Q− , is further decomposed as

n
1X h i
E Qσi σiT Q̄AQ
T
i=1
" #
n 1 h
T
i n 1 Q− σσ T Q̄AQ− σσ T Q−
= E Q− σσ Q̄AQ− − 2 E
T 1+δ T 1+δ 1 + T1 σ T Q− σ
" #
1 T
n δ − T σ Q − σ
+ E Q− σσ T Q̄AQ−
(1 + δ) 1 + T1 σ T Q− σ

T
" #
n Q− σσ T Q̄AQ− σσ T Q− δ − T1 σ T Q− σ
− 2E 2
T (1 + δ) 1 + T1 σ T Q− σ
" #
1 T
n 1 n 1 σ Q̄AQ − σ
E Q− σσ T Q− T 1 T

= E Q− ΦQ̄AQ− −
T 1+δ T 1+δ 1 + T σ Q− σ
" #
σσ T δ − T1 σ T Q− σ

n
+ E Q− Q̄AQ−
T (1 + δ) 1 + T1 σ T Q− σ
" #
1 T 1 T
n T T σ Q̄AQ− σ δ − T σ Q− σ
− E Q− σσ Q− 2
T (1 + δ) 1 + 1 σ T Q− σ
T
≡ Z1 + Z2 + Z3 + Z4
(where in the previous to last line, we have merely reorganized the terms
conveniently) and our interest is in handling Z1 + Z1T + Z2 + Z2T + Z3 +
Z3T + Z4 + Z4T . Let us first treat term Z2 . Since Q̄AQ− is bounded, by
Lemma 4, T1 σ T Q̄AQ− σ concentrates around T1 tr ΦQ̄AE[Q− ]; but, as kΦQ̄k
1
is bounded, we also have | T1 tr ΦQ̄AE[Q− ] − T1 tr ΦQ̄AQ̄| ≤ cnε− 2 . We thus
deduce, with similar arguments as previously, that
" #
1 T 1
T ε− 21 T T σ Q̄AQ− σ T tr ΦQ̄AQ̄
−Q− σσ Q− Cn Q− σσ Q− −
1 + T1 σ T Q− σ 1+δ
1
Q− σσ T Q− Cnε− 2
with probability exponentially close to one, in the order of symmetric ma-

trices. Taking expectation and norms on both sides, and conditioning on the

aforementioned event and its complementary, we thus have that

" #
1 T 1
σ Q̄AQ − σ tr Φ Q̄AQ̄
E Q− σσ T Q− T 1 T − E [Q− ΦQ− ] T

1+δ

1 + T σ Q− σ
1 ε′
≤ kE [Q− ΦQ− ]k Cnε− 2 + C ′ ne−cn
1
≤ kE [Q− ΦQ− ]k C ′′ nε− 2
But, again by exchangeability arguments,
" 2 #
1
T T
E[Q− ΦQ− ] = E[Q− σσ Q− ] = E Qσσ Q 1 + σ T Q− σ
T

T 1 T 2
= E Q Σ D ΣQ
n T
with D = diag({1 + T1 σiT Q− σi }), the operator norm of which is bounded as
O(1). So finally,
" #
1 T 1
σ Q̄AQ − σ tr Φ Q̄AQ̄ 1
E Q− σσ T Q− T 1 T − E [Q− ΦQ− ] T ≤ Cnε− 2 .

1 + T σ Q− σ 1+δ
We now move to term Z3 + Z3T . Using the relation abT + baT aaT + bbT ,
" #
1 T Q− σσ T Q̄AQ− + Q− AQ̄σσ T Q−
E (δ − σ Q− σ)
T (1 + T1 σ T Q− σ)2
" #
√ (δ − T1 σ T Q− σ)2 T 1 h T
i
nE Q − σσ Q − + √ E Q − A Q̄σσ Q̄AQ −
(1 + T1 σ T Q− σ)4 n
√ T

1 T 2 1
= n E Q Σ D3 ΣQ + √ E Q− AQ̄ΦQ̄AQ−
n T n
and the symmetrical lower bound (equal to the opposite of the upper bound),
where D3 = diag((δ − T1 σiT Q−i σi )/(1 + T1 σiT Q−i σi )). For the same reasons
1
as above, the first right-hand side term is bounded by Cnε− 2 . As for the
Q̄Φ
second term, for A = IT , it is clearly bounded; for A = Φ, using Tn 1+δ =
IT − γ Q̄, E[Q− AQ̄ΦQ̄AQ− ] can be expressed in terms of E[Q− ΦQ− ] and
E[Q− Q̄k ΦQ− ] for k = 1, 2, all of which have been shown to be bounded (at
most by Cnε ). We thus conclude that
" #
Q− σσ T Q̄AQ− + Q− AQ̄σσ T Q−

1 T 1
E δ − σ Q− σ ≤ Cnε− 2 .

T 1 T 2
(1 + T σ Q− σ)

42 C. LOUART ET AL.
Finally, term Z4 can be handled similarly as term Z2 and is shown to be

1
of norm bounded by Cnε− 2 .
As a consequence of all the above, we thus find that

n E QΦQ̄AQ n E Q− ΦQ̄AQ−
E [QAQ] = Q̄AQ̄ + −
T 1+δ T 1+δ
1
n tr ΦQ̄AQ̄ 1
+ T 2
E[Q− ΦQ− ] + O(nε− 2 ).
T (1 + δ)
It is attractive to feel that the sum of the second and third terms above
vanishes. This is indeed verified by observing that, for any matrix B,
1 h i
E [QBQ] − E [Q− BQ] = E Qσσ T Q− BQ
T
1 T 1 T
= E Qσσ QBQ 1 + σ Q− σ
T T

1 1
= E Q ΣT DΣQBQ
n T
and symmetrically

1 1 T
E [QBQ] − E [QBQ− ] = E QBQ Σ DΣQ
n T
with D = diag(1 + T1 σiT Q−i σi ), and a similar reasoning is performed to

control E[Q− BQ]−E[Q− BQ− ] and E[QBQ− ]−E[Q− BQ− ]. For B bounded,
kE[Q T1 ΣT DΣQBQ]k is bounded as O(1), and thus kE[QBQ] − E[Q− BQ− ]k
is of order O(n−1 ). So in particular, taking A of bounded norm, we find that
1
n tr ΦQ̄AQ̄ 1
E [QAQ] = Q̄AQ̄ + T
2
E[Q− ΦQ− ] + O(nε− 2 ).
T (1 + δ)
Take now B = Φ. Then, from the relation AB T + BAT AAT + BB T in

the order of symmetric matrices,

1
E [QΦQ] − E [Q− ΦQ + QΦQ− ]

2

1 1 T 1 T
= E Q Σ DΣQΦQ + QΦQ Σ DΣQ
2n T T

1 E Q 1 ΣT DΣQ 1 ΣT DΣQ + kE [QΦQΦQ]k .

≤
2n T T

The first norm in the parenthesis is bounded by Cnε and it thus remains
to control the second norm. To this end, similar to the control of E[QΦQ],
by writing E[QΦQΦQ] = E[Qσ1 σ1T Qσ2 σ2T Q] for σ1 , σ2 independent vectors
with the same law as σ, and exploiting the exchangeability, we obtain after
some calculus that E[QΦQ] can be expressed as the sum of terms of the form
E[Q++ T1 ΣT 1 T 1 T
++ DΣ++ Q++ ] or E[Q++ T Σ++ DΣ++ Q++ T Σ++ D2 Σ++ Q++ ] for
D, D2 diagonal matrices of norm bounded as O(1), while Σ++ and Q++ are
similar as Σ and Q, only for n replaced by n+2. All these terms are bounded
as O(1) and we finally obtain that E[QΦQΦQ] is bounded and thus

E [QΦQ] − 1 E [Q− ΦQ + QΦQ− ] ≤ C .

2 n
With the additional control on QΦQ− − Q− ΦQ− and Q− ΦQ − Q− ΦQ− ,

together, this implies that E[QΦQ] = E[Q− ΦQ− ]+Ok·k (n−1 ). Hence, for A =
Φ, exploiting the fact that Tn 1+δ
1
ΦQ̄Φ = Φ−γ Q̄Φ, we have the simplification

n E QΦQ̄ΦQ n E Q− ΦQ̄ΦQ−
E [QΦQ] = Q̄ΦQ̄ + −
T 1+δ T 1+δ
1 2 2
n tr Φ Q̄ 1
+ T 2
E[Q− ΦQ− ] + Ok·k (nε− 2 )
T (1 + δ)
n 1 tr Φ2 Q̄2 1
= Q̄ΦQ̄ + T 2
E[QΦQ] + Ok·k (nε− 2 ).
T (1 + δ)
or equivalently
!
n 1 tr Φ2 Q̄2 1
E [QΦQ] 1 − T = Q̄ΦQ̄ + Ok·k (nε− 2 ).
T (1 + δ)2
1
n tr Φ2 Q̄2
We have already shown in (11) that lim supn T
T
(1+δ)2 < 1 and thus
Q̄ΦQ̄ 1
E [QΦQ] = 1 + Ok·k (nε− 2 ).
n tr Φ2 Q̄2
1− T
T
(1+δ)2
So finally, for all A of bounded norm,

1
n tr ΦQ̄AQ̄ Q̄ΦQ̄ 1
E [QAQ] = Q̄AQ̄ + T
2 1 2 Q̄2
+ O(nε− 2 )
T (1 + δ) 1 − n T tr Φ
T (1+δ)2
which proves immediately Proposition 1 and Theorem 3.

44 C. LOUART ET AL.
5.3. Derivation of Φab .
5.3.1. Gaussian w. In this section, we evaluate the terms Φab provided

in Table 1. The proof for the term corresponding to σ(t) = erf(t) can be
already be found in (Williams, 1998, Section 3.1) and is not recalled here.
For the other functions σ(·), we follow a similar approach as in (Williams,
1998), as detailed next.
The evaluation of Φab for w ∼ N (0, Ip ) requires to estimate

Z
− p2 1 2
I ≡ (2π) σ(wT a)σ(wT b)e− 2 kwk dw.
Rp
Assume that a and b and not linearly dependent. It is convenient to ob-

serve that this integral can be reduced to a two-dimensional integration by
considering the basis e1 , . . . , ep defined (for instance) by
b a b aT
a kbk − kakkbk kak
e1 = , e2 =
kak
q
(aT b)2
1 − kak2 kbk2
and e3 , . . . , ep any completion of the basis. By letting w = w̃1 e1 + . . . + w̃p ep

T
and a = ã1 e1 (ã1 = kak), b = b̃1 e1 + b̃2 e2 (where b̃1 = akakb and b̃2 =
q
(aT b)2
kbk 1 − kak 2 kbk2 ), this reduces I to
1
Z Z
1 2 2
I= σ(w̃1 ã1 )σ(w̃1 b̃1 + w̃2 b̃2 )e− 2 (w̃1 +w̃2 ) dw̃1 dw̃2 .
2π R R
Letting w̃ = [w̃1 , w̃2 ]T , ã = [ã1 , 0]T and b̃ = [b̃1 , b̃2 ]T , this is conveniently
written as the two-dimensional integral
1
Z
1 2
I= σ(w̃T ã)σ(w̃T b̃)e− 2 kw̃k dw̃.
2π R2
The case where a and b would be linearly dependent can then be obtained
by continuity arguments.
The function σ(t) = max(t, 0). For this function, we have
1
Z
1 2
I= w̃T ã · w̃T b̃ · e− 2 kw̃k dw̃.
2π min(w̃T ã,w̃T b̃)≥0
Since ã = ã1 e1 , a simple geometric representation lets us observe that
n o n π π o
w̃ | min(w̃T ã, w̃T b̃) ≥ 0 = r cos(θ)e1 + r sin(θ)e2 | r ≥ 0, θ ∈ [θ0 − , ]
2 2


b̃1 b̃1
where we defined θ0 ≡ arccos kb̃k
= − arcsin kb̃k
+ π2 . We may thus oper-
ate a polar coordinate change of variable (with inverse Jacobian determinant
equal to r) to obtain
π
1
Z Z
2
1 2
I= (r cos(θ)ã1 ) r cos(θ)b̃1 + r sin(θ)b̃2 re− 2 r dθdr
2π θ0 − π2 R+
π
1
Z Z
2
1 2
= ã1 cos(θ) cos(θ)b̃1 + sin(θ)b̃2 dθ r 3 e− 2 r dr.
2π θ0 − π2 R+
1 2
r 3 e− 2 r dr = 2. Classical
R
With two integration by parts, we have that R+
trigonometric formulas also provide
π
1 1
Z
2
cos(θ)2 dθ =
(π − θ0 ) + sin(2θ0 )
θ0 − π2 2 2
! !
1 b̃1 b̃1 b̃2
= π − arccos +
2 kb̃k kb̃k kb̃k
π
!2
1 1 b̃2
Z
2
cos(θ) sin(θ)dθ = sin2 (θ0 ) =
θ0 − π2 2 2 kb̃k
√
where we used in particular sin(2 arccos(x)) = 2x 1 − x2 . Altogether, this
is after simplification and replacement of ã1 , b̃1 and b̃2 ,
1 p
I= kakkbk 1 − ∠(a, b)2 + ∠(a, b) arccos(−∠(a, b)) .
2π
It is worth noticing that this may be more compactly written as
∠(a,b)
1
Z
I= kakkbk arccos(−x)dx.
2π −1
which is minimum for ∠(a, b) → −1 (since arccos(−x) ≥ 0 on [−1, 1]) and

takes there the limiting value zero. Hence I > 0 for a and b not linearly
dependent.
For a and b linearly dependent, we simply have I = 0 for ∠(a, b) = −1
and I = 12 kakkbk for ∠(a, b) = 1.
The function σ(t) = |t|. Since |t| = max(t, 0) + max(−t, 0), we have
|wT a| · |wT b| = max(wT a, 0) max(wT b, 0) + max(wT (−a), 0) max(wT (−b), 0)

+ max(wT (−a), 0) max(wT b, 0) + max(wT a, 0) max(wT (−b), 0).

46 C. LOUART ET AL.
Hence, reusing the results above, we have here

kakkbk p
I= 4 1 − ∠(a, b)2 + 2∠(a, b) acos(−∠(a, b)) − 2∠(a, b) acos(∠(a, b)) .
2π
Using the identity acos(−x) − acos(x) = 2 asin(x) provides the expected
result.
The function σ(t) = 1t≥0 . With the same notations as in the case σ(t) =
max(t, 0), we have to evaluate
1
Z
1 2
I= e− 2 kw̃k dw̃.
2π min(w̃T ã,w̃T b̃)≥0
After a polar coordinate change of variable, this is

Z π
1 1 θ0
Z
2 1 2
I= dθ re− 2 r dr = −
2π θ0 − π
R+ 2 2π
2
from which the result unfolds.

The function σ(t) = sign(t). Here it suffices to note that sign(t) = 1t≥0 −
1−t≥0 so that
σ(wT a)σ(wT b) = 1wT a≥0 1wT b≥0 + 1wT (−a)≥0 1wT (−b)≥0
− 1wT (−a)≥0 1wT b≥0 − 1wT a≥0 1wT (−b)≥0
and to apply the result of the previous section, with either (a, b), (−a, b),
(a, −b) or (−a, −b). Since arccos(−x) = − arccos(x) + π, we conclude that
2θ0
Z
− p2 1 2
I = (2π) sign(wT a)sign(wT b)e− 2 kwk dw = 1 − .
Rp π
The functions σ(t) = cos(t) and σ(t) = sin(t).. Let us first consider σ(t) =
cos(t). We have here to evaluate
1
Z 1 2
I= cos w̃T ã cos w̃T b̃ e− 2 kw̃k dw̃
2π R2
1
Z T 1 2
T T T
= eıw̃ ã + e−ıw̃ ã eıw̃ b̃ + e−ıw̃ b̃ e− 2 kw̃k dw̃
8π R2
which boils down to evaluating, for d ∈ {ã + b̃, ã − b̃, −ã + b̃, −ã − b̃}, the
integral
Z
1 2 1 2 1 2
e− 2 kdk e− 2 kw̃−ıdk dw̃ = (2π)e− 2 kdk .
R2

Altogether, we find
1 − 1 ka+bk2 1 2
1 2
I= e 2 + e− 2 ka−bk = e− 2 (kak+kbk ) cosh(aT b).
2
For σ(t) = sin(t), it suffices to appropriately adapt the signs in the ex-
1 t
pression of I (using the relation sin(t) = 2ı (e + e−t )) to obtain in the end
1 − 1 ka+bk2 − 12 ka−bk2
1 2
I= e 2 +e = e− 2 (kak+kbk ) sinh(aT b)
2
as desired.
5.4. Polynomial σ(·) and generic w. In this section, we prove Equation 5

for σ(t) = ζ2 t2 + ζ1 t + ζ0 and w ∈ Rp a random vector with independent and
identically distributed entries of zero mean and moment of order k equal to
mk . The result is based on standard combinatorics. We are to evaluate
h i
Φab = E ζ2 (wT a)2 + ζ1 wT a + ζ0 ζ2 (wT b)2 + ζ1 wT b + ζ0 .
After development, it appears that one needs only assess, for say vectors
c, d ∈ Rp that take values in {a, b}, the moments
X
E[(wT c)2 (wT d)2 ] = ci1 ci2 dj1 dj2 E[wi1 wi2 wj1 wj2 ]
i1 i2 j 1 j 2
X X X
= m4 c2i1 d2i1 + m22 c2i1 d2j1 + 2 m22 ci1 di1 ci2 di2
i1 i1 6=j1 i1 6=i2
 
X X X
= m4 c2i1 d2i1 +  −  m22 c2i d2j
1 1
i1 i1 j 1 i1 =j1
 
X X
+ 2 −  m22 ci1 di1 ci2 di2
i1 i2 i1 6=i2
= m4 (c2 )T (d2 ) + m22 (kck2 kdk2 − (c2 )T (d2 ))

+ 2m22 (cT d)2 − (c2 )T (d2 )

= (m4 − 3m22 )(c2 )T (d2 ) + m22 kck2 kdk2 + 2(cT d)2
h i X X
E (wT c)2 (wT d) = ci1 ci2 dj E[wi1 wi2 wj ] = m3 c2i1 di1 = m3 (c2 )d
i1 i2 j i1
h i X
E (wT c)2 = ci1 ci2 E[wi1 wi2 ] = m2 kck2
i1 i2
where we recall the definition (a2 ) = [a21 , . . . , a2p ]T . Gathering all the terms
for appropriate selections of c, d leads to (5).

48 C. LOUART ET AL.
5.5. Heuristic derivation of Conjecture 1. Conjecture 1 essentially fol-

lows as an aftermath of Remark 1. We believe that, similar to Σ, Σ̂ is ex-
pected to be of the form Σ̂ = Σ̂◦ + σ̄ ˆ 1T , where σ̄
T̂
ˆ = E[σ(wT X̂)]T , with
Σ̂◦
k√ T
k ≤ nε with high probability. Besides, if X, X̂ were chosen as consti-
tuted of Gaussian mixture vectors, with non-trivial growth rate conditions
as introduced in (Couillet and Benaych-Georges, 2016), it is easily seen that
σ̄ = c1p + v and σ̄ ˆ = c1p + v̂, for some constant c and kvk, kv̂k = O(1).
This subsequently ensures that ΦX X̂ and ΦX̂ X̂ would be of a similar form
ˆ T and Φ◦ + σ̄
Φ◦X X̂ + σ̄ σ̄ X̂ X̂
ˆ T with Φ◦ and Φ◦ of bounded norm. These
ˆ σ̄
X X̂ X̂ X̂
facts, that would require more advanced proof techniques, let envision the
following heuristic derivation for Conjecture 1.
Recall that our interest is on the test performance Etest defined as

1 2

T̂ F
which may be rewritten as

1 2 1
Etest = tr Ŷ Ŷ T − tr Y QΣT Σ̂Ŷ T + tr Y QΣT Σ̂Σ̂T ΣQY T
T̂ T T̂ T 2 T̂
(12)
≡ Z1 − Z2 + Z3 .
If Σ̂ = Σ̂◦ + σ̄
ˆ 1T follows the aforementioned claimed operator norm control,
T̂
reproducing the steps of Corollary 3 leads to a similar concentration for
Etest , which we shall then admit. We are therefore left to evaluating E[Z2 ]
and E[Z3 ].

We start with the term E[Z2 ], which we expand as

n
2 h i 2 Xh i
E[Z2 ] = E tr(Y QΣT Σ̂Ŷ T ) = tr(Y Qσi σ̂iT Ŷ T )
T T̂ T T̂ i=1
n
" !#
2 X Y Q−i σi σ̂iT Ŷ T
= E tr
T T̂ i=1 1 + T1 σiT Q−i σi
n
2 1 X h i
= E tr Y Q−i σi σ̂iT Ŷ T
T T̂ 1 + δ i=1
n
" #
2 1 X δ − 1 σT Q σ
T i −i i
+ E tr Y Q−i σi σ̂iT Ŷ T
T T̂ 1 + δ i=1 1 + T1 σiT Q−i σi
2n 1 2 1 h i
= tr Y E[Q− ]ΦX X̂ Ŷ T + E tr Y QΣT D Σ̂Ŷ T
T T̂ 1 + δ T T̂ 1 + δ
≡ Z21 + Z22
with D = diag({δ − T1 σiT Q−i σi }), the operator norm of which is bounded by
1
nε− 2 with high probability. Now, observe that, again with the assumption
that Σ̂ = Σ̂◦ + σ̄1T
T̂
with controlled Σ̂◦ , Z22 may be decomposed as
2 1 h i 2 1 h i
E tr Y QΣT D Σ̂Ŷ T = E tr Y QΣT D Σ̂◦ Ŷ T
T T̂ 1 + δ T T̂ 1 + δ
2 1 T T h i
+ 1T̂ Ŷ E Y QΣT Dσ̄ .
T T̂ 1 + δ
1
In the display above, the first right-hand side term is now of order O(nε− 2 ).
As for the second right-hand side term, note that Dσ̄ is a vector of inde-
pendent and identically distributed zero mean and variance O(n−1 ) entries;
while note formally independent of Y QΣT , it is nonetheless expected that
this independence “weakens” asymptotically (a behavior several times ob-
served in linear random matrix models), so that one expects by central limit
1
arguments that the second right-hand side term be also of order O(nε− 2 ).
This would thus result in
2n 1 1
E[Z2 ] = tr Y E[Q− ]ΦX X̂ Ŷ T + O(nε− 2 )
T T̂ 1 + δ
2n 1 1
= tr Y Q̄ΦX X̂ Ŷ T + O(nε− 2 )
T T̂ 1 + δ
2 1
= tr Y Q̄ΨX X̂ Ŷ T + O(nε− 2 )
T̂

50 C. LOUART ET AL.
1
n ΦX X̂
where we used kE[Q− ] − Q̄k ≤ Cnε− 2 and the definition ΨX X̂ = T 1+δ .
We then move on to E[Z3 ] of Equation (12), which can be developed as

1 h i
E[Z3 ] = E tr Y QΣT Σ̂Σ̂T ΣQY T
T 2 T̂
n
1 X h i
= E tr Y Qσi σ̂iT σ̂j σjT QY T
T 2 T̂ i,j=1
n
" !#
1 X Q−i σi σ̂iT σ̂j σjT Q−j T
= E tr Y Y
T 2 T̂ i,j=1
1 + T1 σiT Q−i σi 1 + T1 σjT Q−j σj
n X
" !#
1 X Q−i σi σ̂iT σ̂j σjT Q−j T
= E tr Y 1 T 1 T Y
T 2 T̂ i=1 j6=i
1 + T σ i Q −i σ i 1 + T σ j Q −j σ j
n
" !#
1 X Q−i σi σ̂iT σ̂i σiT Q−i T
+ E tr Y Y ≡ Z31 + Z32 .
T 2 T̂ i=1
(1 + T1 σiT Q−i σi )2
In the term Z32 , reproducing the proof of Lemma 1 with the condition
σ̂T σ̂
kX̂k bounded, we obtain that i i concentrates around 1 tr ΦX̂ X̂ , which
T̂ T̂
allows us to write
n
" !#
1 X Q−i σi tr(ΦX̂ X̂ )σiT Q−i T
Z32 = E tr Y Y
T 2 T̂ i=1 (1 + T1 σiT Q−i σi )2
n
" !#
Q−i σi σ̂iT σ̂i − tr ΦT̂ σiT Q−i T

1 X
+ E tr Y Y
T 2 T̂ i=1 (1 + T1 σiT Q−i σi )2
n
" !#
1 tr(ΦX̂ X̂ ) X Q−i σi σiT Q−i T
= 2 E tr Y 1 T 2
Y
T T̂ (1 + T σ i Q −i σ i )
n
" i=1 ! !#
1 X σ̂iT σ̂i − tr ΦT̂ T T
+ 2 E tr Y Qσi σi QY
T T̂
i=1
≡ Z321 + Z322
with D = diag({ 1 σiT σ̂i − 1

tr ΦT̂ T̂ }ni=1 ) and thus Z322 can be rewritten as
T̂ T̂
QΣT ΣQ

1 1
Z322 = E tr Y √ D √ Y T = O(nε− 2 )
T T T

while for Z321 , following the same arguments as previously, we have

n
" !#
1 tr ΦX̂ X̂ X Q−i σi σiT Q−i
Z321 = 2 E tr Y 1 T 2
YT
T T̂ (1 + σ
T i Q −i σ i )
i=1
n
1 tr ΦX̂ X̂ X 1 h
T T
i
= 2 E tr Y Q −i σ i σ i Q −i Y
T T̂ (1 + δ)2
i=1
n
1 tr ΦX̂ X̂ X

1 T T 2 1 T 2
+ 2 E tr Y Qσi σi QY (1 + δ) − (1 + σi Q−i σi )
T T̂ (1 + δ)2 T
i=1
n
T
i
= E tr Y Q −i Φ X Q −i Y
T 2 T̂ (1 + δ)2
i=1
n
T T
i
+ E tr Y QΣ DΣQY
T 2 T̂ (1 + δ)2
i=1
n h i tr(Φ ) 1
= 2 E tr Y Q− ΦX Q− Y T X̂ X̂
+ O(nε− 2 )
T T̂ (1 + δ) 2
where D = diag({(1 + δ)2 − (1 + T1 σiT Q−i σi )2 }ni=1 ).

1
Since E[Q− AQ− ] = E[QAQ] + Ok·k (nε− 2 ), we are free to plug in the
asymptotic equivalent of E[QAQ] derived in Section 5.2.3, and we deduce
" ! #
n Q̄ΨX Q̄ · n1 tr ΨX Q̄ΦX Q̄ tr(ΦX̂ X̂ )
Z32 = 2 E tr Y Q̄ΦX Q̄ + 1 2 2
YT
T 1 − n tr ΨX Q̄ T̂ (1 + δ)2
1
tr Y Q̄ΨX Q̄Y T 1

1
= n 1 2 2
tr(ΨX̂ X̂ ) + O(nε− 2 ).
1 − n tr ΨX Q̄ T̂
The term Z31 of the double sum over i and j (j 6= i) needs more efforts.
To handle this term, we need to remove the dependence of both σi and σj
in Q in sequence. We start with j as follows:
n
" !#
σ̂ σ TQ
1 XX j j −j
Z31 = E tr Y Qσi σ̂iT YT
T 2 T̂ i=1 j6=i 1 + T1 σjT Q−j σj
n
" !#
σ̂ σ TQ
1 XX j j −j
= E tr Y Q−j σi σ̂iT YT
T 2 T̂ i=1 j6=i 1 + T1 σjT Q−j σj
n
" !#
1 XX Q−j σj σjT Q−j σi σ̂iT σ̂j σjT Q−j T
− E tr Y Y
T 3 T̂ i=1 j6=i 1 + T1 σjT Q−j σj 1 + T1 σjT Q−j σj
≡ Z311 − Z312

52 C. LOUART ET AL.
where in the previous to last inequality we used the relation
Q−j σj σjT Q−j

Q = Q−j − 1 T .
1+ T σj Q−j σj
1 T
For Z311 , we replace 1 + T σj Q−j σj by 1 + δ and take expectation over wj
n
" !#
1 XX σ̂j σjT Q−j
Z311 = E tr Y Q−j σi σ̂iT
1 T Y T
T 2 T̂
i=1 j6=i
1 + T σ j Q −j σ j
n
" !#
1 X Q−j ΣT
−j Σ̂ −j σ̂ j σ TQ
j −j
= E tr Y YT
T 2 T̂ j=1 1 + T1 σjT Q−j σj
n
1 1 X h i
= E tr Y Q−j ΣT
−j Σ̂−j σ̂j σjT Q−j Y T
T 2 T̂ 1 + δ j=1
n 1 T
" !#
1 1 X Q−j ΣT T
−j Σ̂−j σ̂j σj Q−j (δ − T σj Q−j σj ) T
+ E tr Y Y
T 2 T̂ 1 + δ j=1
1 + T1 σjT Q−j σj
≡ Z3111 + Z3112 .
Pn T
The idea to handle Z3112 is to retrieve forms of the type j=1 dj σ̂j σj =
ε− 21
Σ̂T DΣ for some D satisfying kDk ≤ n with high probability. To this
end, we use
ΣT
−j Σ̂−j ΣT Σ̂ σj σ̂jT
Q−j = Q−j − Q−j
T T T
T Qσ σ TQ σj σ̂jT
Σ Σ̂ j j ΣT Σ̂
=Q + − Q −j
T 1 − T1 σjT Qσj T T
and thus Z3112 can be expanded as the sum of three terms that shall be

studied in order:
n
" !#
1 T
1 1 X Q−j ΣT T
−j Σ̂−j σ̂j σj Q−j (δ − T σj Q−j σj ) T
Z3112 = E tr Y Y
T 2 T̂ 1 + δ j=1 1 + T1 σjT Q−j σj
" !#
1 1 ΣT Σ̂ T T
= E tr Y Q Σ̂ DΣQY
T T̂ 1 + δ T
n
" !#
1 1 X Qσj σjT QΣT Σ̂σ̂j (δ − T1 σjT Q−j σj )σjT Q T
+ E tr Y Y
T T̂ 1 + δ j=1 T (1 − T1 σjT Qσj )
n
1 1 X T T 1 T 1 T T
− E tr Y Qσj σ̂j σ̂j σj Q(δ − σj Q−j σj )(1 + σj Q−j σj )Y
T 2 T̂ 1 + δ j=1
T T
≡ Z31121 + Z31122 − Z31123 .
1
where D = diag({δ − T1 σjT Q−j σj }ni=1 ). First, Z31121 is of order O(nε− 2 ) since
T
Q ΣT Σ̂ is of bounded operator norm. Subsequently, Z31122 can be rewritten
as
ΣT DΣ

1 1 1
Z31122 = E tr Y Q QY T
= O(nε− 2 )
T̂ 1 + δ T
with here
 n
1 T 1 ΣT
−j Σ̂−j 1


 δ − σ Q σ
T j −j j T tr Q −j T Φ X̂X + T tr (Q−j Φ) T1 tr ΦX̂ X̂ 

D = diag 1 T 1 T .


 (1 − T σj Qσj )(1 + T σj Q−j σj )



i=1
The same arguments apply for Z31123 but for

n
tr ΦX̂ X̂

1 T 1 T
D = diag (δ − σj Q−j σj )(1 + σj Q−j σj )
T T T i=1
1
which completes to show that |Z3112 | ≤ Cnε− 2 and thus
1
Z311 = Z3111 + O(nε− 2 )
n
1 1 X h i 1
= E tr Y Q−j ΣT
−j Σ̂ σ̂ σ T
−j j j −jQ Y T
+ O(nε− 2 ).
T 2 T̂ 1 + δ j=1

54 C. LOUART ET AL.
It remains to handle Z3111 . Under the same claims as above, we have

n
" !#
1 1 X ΣT
−j Σ̂ −j
Z3111 = E tr Y Q−j ΦX̂X Q−j Y T
T T̂ 1 + δ j=1 T
n
σi σ̂iT

1 1 XX
= E tr Y Q−j ΦX̂X Q−j Y T
T T̂ 1 + δ j=1 i6=j T
n
" !#
1 1 XX Q−ij σi σ̂iT T
= E tr Y Φ Q−ij Y
T 2 T̂ 1 + δ j=1 i6=j 1 + T1 σiT Q−ij σi X̂X
n
" !#
1 1 XX Q−ij σi σ̂iT Q−ij σi σiT Q−ij T
− E tr Y Φ Y
T 3 T̂ 1 + δ j=1 i6=j 1 + T1 σiT Q−ij σi X̂X 1 + T1 σiT Q−ij σi
≡ Z31111 − Z31112
where we introduced the notation Q−ij = ( T1 ΣT Σ− T1 σi σiT − T1 σj σjT +γIT )−1 .

For Z31111 , we replace T1 σiT Q−ij σi by δ, and take the expectation over wi ,

as follows
n
" !#
1 1 XX Q−ij σi σ̂iT T
Z31111 = E tr Y Φ Q−ij Y
T 2 T̂ 1 + δ j=1 i6=j 1 + T1 σiT Q−ij σi X̂X
n X h
1 1 X
T T
i
= E tr Y Q −ij σ i σ̂ Φ
i X̂X Q −ij Y
T 2 T̂ (1 + δ)2 j=1 i6=j
n X
" !#
1 1 X Q−ij σi σ̂iT (δ − T1 σiT Q−ij σi )
+ E tr Y ΦX̂X Q−ij Y T
T 2 T̂ (1 + δ)2 j=1 i6=j 1 + T1 σiT Q−ij σi
n2 1 h
T
i
= E tr Y Q −− Φ Φ
X X̂ X̂X Q −− Y
T 2 T̂ (1 + δ)2
n
1 1 XX
T 1 T T
+ E tr Y Q−j σi σ̂i δ − σi Q−ij σi ΦX̂X Q−j Y
T 2 T̂ (1 + δ)2 j=1 i6=j T
n X
" !#
1 TQ
Q σ σ

1 1 X −j T i i −j 1
+ E tr Y Q−j σi σ̂iT ΦX̂X Y T δ − σiT Q−ij σi
T 2 T̂ (1 + δ)2 j=1 i6=j
1 − T1 σiT Q−j σi T
n2 1 h
T
i
= E tr Y Q −− Φ Φ
X X̂ X̂X Q −− Y
T 2 T̂ (1 + δ)2
n
1 1 X h
T T
i
+ E tr Y Q −j Σ −j D Σ̂ −j Φ X̂X Q −j Y
T 2 T̂ (1 + δ)2 j=1
n
n 1 X h
′
i 1
+ E Y Q−j ΣT
−j D Σ Q
−j −j Y T
+ O(nε− 2 )
T 2 T̂ (1 + δ)2 j=1
n2 1 h i 1
= 2
E tr Y Q Φ Φ Q
−− X X̂ X̂X −− Y T
+ O(nε− 2 )
T 2 T̂ (1 + δ)
with Q−− having the same law as Q−ij , D =diag({δ − T1 σiT Q−ij σi }ni=1 )
n
(δ− T1 σiT Q−ij σi ) T1 tr(ΦX̂X Q−ij ΦX X̂ )
and D ′ = diag (1− 1 σT Q σ )(1+ 1 σT Q σ )
, both expected to be of
T i −j i T i −ij i
i=1
ε− 21
order O(n ). Using again the asymptotic equivalent of E[QAQ] devised
in Section 5.2.3, we then have
n2 1 h i 1
Z31111 = 2
E tr Y Q Φ Φ Q
−− X X̂ X̂X −− Y T
+ O(nε− 2 )
T 2 T̂ (1 + δ)
n1 tr Y Q̄ΨX Q̄Y T

1 T
1
= tr Y Q̄ΨX X̂ ΨX̂X Q̄Y + tr ΨX Q̄ΨX X̂ ΨX̂X Q̄
T̂ T̂ 1 − n1 tr(Ψ2X Q̄2 )
1
+ O(nε− 2 ).

56 C. LOUART ET AL.
Following the same principle, we deduce for Z31112 that

n
" !#
1 1 XX Q−ij σi σ̂iT Q−ij σi σiT Q−ij T
Z31112 = E tr Y Φ Y
T 3 T̂ 1 + δ j=1 i6=j 1 + T1 σiT Q−ij σi X̂X 1 + T1 σiT Q−ij σi
n X
1 1 X
T T 1

= E tr Y Q−ij σi σi Q−ij Y tr ΦX̂X Q−ij ΦX X̂
T 3 T̂ (1 + δ)3 j=1 i6=j T
n X h
1 1 X i 1
+ 3
E tr Y Q −j σ i D i σ T
i Q −j Y T
+ O(nε− 2 )
T 3 T̂ (1 + δ) j=1 i6=j
n2

1 T 1
1
tr ΦX̂X Q−− ΦX X̂ + O(nε− 2 )

= E tr Y Q−− ΦX Q−− Y
T 3 T̂ 1 + δ T
1 T

1 tr Y Q̄ΨX Q̄Y 1
= tr ΨX̂X Q̄ΨX X̂ n 1 2 2
+ O(nε− 2 ).
T̂ 1 − n tr(ΨX Q̄ )
with Di = T1 tr ΦX̂X Q−ij ΦX X̂ (1 + δ)2 − (1 + T1 σiT Q−ij σi )2 , also believed

1 1
to be of order O(nε− 2 ). Recalling the fact that Z311 = Z3111 + O(nε− 2 ), we
can thus conclude for Z311 that

1 T
1
Z311 = tr Y Q̄ΨX X̂ ΨX̂X Q̄Y + tr ΨX Q̄ΨX X̂ ΨX̂X Q̄
T̂ T̂ 1 − n1 tr(Ψ2X Q̄2 )
1 tr Y Q̄ΨX Q̄Y T

1 1
− tr ΨX̂X Q̄ΨX X̂ n 1 2 2
+ O(nε− 2 ).
T̂ 1 − n tr(ΨX Q̄ )
As for Z312 , we have

n
" !#
1 XX Q−j σj σjT Q−j σi σ̂iT σ̂j σjT Q−j T
Z312 = E tr Y 1 T 1 T Y
T 3 T̂ i=1 j6=i
1+ T σj Q−j σj 1+ T σj Q−j σj
n
" !#
1 X Q−j σj σjT Q−j ΣT
−j Σ̂−j σ̂j σjT Q−j T
= E tr Y 1 T 1 T Y .
T 3 T̂ j=1
1+ T σj Q−j σj 1+ T σj Q−j σj
Since Q−j T1 ΣT
−j Σ̂−j is expected to be of bounded norm, using the concen-

ΣT Σ̂−j
tration inequality of the quadratic form T1 σjT Q−j −jT σ̂j , we infer
n
" ! #
1 X Q−j σj σjT Q−j Y T 1
T

ε− 21
Z312 = E tr Y tr Q−j Σ−j Σ̂−j ΦX̂X + O(n )
T T̂ j=1 (1 + T1 σjT Q−j σj )2 T2
n
" ! #
1 X Q−j σj σjT Q−j Y T 1 1
= E tr Y 1 T 2 2
tr Q−j ΣT
−j Σ̂−j ΦX̂X + O(nε− 2 ).
T T̂ (1 + T σj Q−j σj ) T
j=1
1 T
We again replace T σj Q−j σj by δ and take expectation over wj to obtain
n
1 1 X
T T
1
T
Z312 = E tr Y Q−j σj σj Q−j Y tr Q−j Σ−j Σ̂−j ΦX̂X
T T̂ (1 + δ)2 j=1 T2
n
" #
1 1 X tr(Y Q−j σj Dj σjT Q−j Y T ) 1
T

ε− 21
+ E 1 tr Q −j Σ −j Σ̂ −j Φ X̂X + O(n )
T T̂ (1 + δ)2 j=1 (1 + T σjT Q−j σj )2 T2

n 1 T
1
T
= E tr Y Q− ΦX Q− Y tr Q− Σ− Σ̂− ΦX̂X
T T̂ (1 + δ)2 T2

1 1 1 1
+ 2
T
E tr Y QΣ DΣQY T
2
tr Q− Σ− Σ̂− ΦX̂X + O(nε− 2 )
T
T T̂ (1 + δ) T
1
with Dj = (1 + δ)2 − (1 + T1 σjT Q−j σj )2 = O(nε− 2 ), which eventually brings
the second term to vanish, and we thus get

n 1 1 1
Z312 = 2
E tr Y Q− ΦX Q− Y T
2
tr Q− Σ− Σ̂− ΦX̂X + O(nε− 2 ).
T
T T̂ (1 + δ) T

For the term T12 tr Q− ΣT − Σ̂ − Φ X̂X we apply again the concentration
inequality to get
1
T
1 X T

tr Q − Σ − Σ̂ − Φ X̂X = tr Q −j σ i σ̂ i Φ X̂X
T2 T2
i6=j
!
1 X Q−ij σi σ̂iT
= 2 tr Φ
T
i6=j
1 + T1 σiT Q−ij σi X̂X
!
1 1 X 1 1 X Q−ij σi σ̂iT (δ − T1 σiT Q−ij σi )
= 2 tr Q−ij σi σ̂iT ΦX̂X + 2 tr 1 T ΦX̂X
T 1+δ T 1+δ 1 + T σ i Q −ij σ i
i6=j i6=j
n−1 1 1 1 1
T
+ O(nε− 2 )

= 2
tr Φ X̂X E[Q −− ]Φ X X̂ + 2
tr Q −j Σ −j D Σ̂ −j Φ X̂X
T 1+δ T 1+δ

58 C. LOUART ET AL.
1 T n
with high probability, where D = diag({δ − T σi Q−ij σi }i=1 ), the norm of
1
which is of order O(nε− 2 ). This entails
1 n 1 1
T
tr ΦX̂X E[Q−− ]ΦX X̂ + O(nε− 2 )

2
tr Q − Σ − Σ̂ − Φ X̂X = 2
T T 1+δ
with high probability. Once more plugging the asymptotic equivalent of
E[QAQ] deduced in Section 5.2.3, we conclude for Z312 that

1 1
Z312 = tr ΨX̂X Q̄ΨX X̂ 1 2 2
+ O(nε− 2 )
T̂ 1 − n tr(ΨX Q̄ )
and eventually for Z31

1
tr Y Q̄ΨX Q̄Y T

1
T
1 n
Z31 = tr Y Q̄ΨX X̂ ΨX̂X Q̄Y +
tr ΨX Q̄ΨX X̂ ΨX̂X Q̄
T̂ T̂ 1 − n1 tr(Ψ2X Q̄2 )

2 1
− tr ΨX̂X Q̄ΨX X̂ 1 2 2
+ O(nε− 2 ).
T̂ 1 − n tr(ΨX Q̄ )
Combining the estimates of E[Z2 ] as well as Z31 and Z32 , we finally have
the estimates for the test error defined in (12) as
1 T
2
Etest = Ŷ − ΨT X X̂
Q̄Y T

T̂ F
1 T

n tr Y Q̄ΨX Q̄Y

1 1 2
+ tr ΨX̂ X̂ + tr ΨX Q̄ΨX X̂ ΨX̂X Q̄ − tr ΨX̂ X̂ Q̄ΨX X̂
1 − n1 tr(Ψ2X Q̄2 ) T̂ T̂ T̂
1
+ O(nε− 2 ).
Since by definition, Q̄ = (ΨX + γIT )−1 , we may use
ΨX Q̄ = (ΨX + γIT − γIT ) (ΨX + γIT )−1 = IT − γ Q̄
in the second term in brackets to finally retrieve the form of Conjecture 1.
6. Concluding Remarks. This article provides a possible direction of

exploration of random matrices involving entry-wise non-linear transforma-
tions (here through the function σ(·)), as typically found in modelling neural
networks, by means of a concentration of measure approach. The main ad-
vantage of the method is that it leverages the concentration of an initial
random vector w (here a Lipschitz function of a Gaussian vector) to trans-
fer concentration to all vector σ (or matrix Σ) being Lipschitz functions of

w. This induces that Lipschitz functionals of σ (or Σ) further satisfy con-

centration inequalities and thus, if the Lipschitz parameter scales with n,
convergence results as n → ∞. With this in mind, note that we could have
generalized our input-output model z = β T σ(W x) of Section 2 to
z = β T σ(x; W)
for σ : Rp × P → Rn with P some probability space and W ∈ P a ran-

dom variable such that σ(x; W) and σ(X; W) (where σ(·) is here applied
column-wise) satisfy a concentration of measure phenomenon; it is not even
necessary that σ(X; W) has a normal concentration so long that the corre-
sponding concentration function allows for appropriate convergence results.
This generalized setting however has the drawback of being less explicit and
less practical (as most neural networks involve linear maps W x rather than
non-linear maps of W and x).
A much less demanding generalization though would consist in changing
the vector w ∼ Nϕ (0, Ip ) for a vector w still satisfying an exponential (not
necessarily normal) concentration. This is the case notably if w = ϕ(w̃) with
ϕ(·) a Lipschitz map with Lipschitz parameter bounded by, say, log(n) or
any small enough power of n. This would then allow for w with heavier than
Gaussian tails.
Despite its simplicity, the concentration method also has some strong
limitations that presently do not allow for a sufficiently profound analysis of
the testing mean square error. We believe that Conjecture 1 can be proved
by means of more elaborate methods. Notably, we believe that the powerful
Gaussian method advertised in (Pastur and Ŝerbina, 2011) which relies on
Stein’s lemma and the Poincaré–Nash inequality could provide a refined
control of the residual terms involved in the derivation of Conjecture 1.
However, since Stein’s lemma (which states that E[xφ(x)] = E[φ′ (x)] for
x ∼ N (0, 1) and differentiable polynomially bounded φ) can only be used
on products xφ(x) involving the linear component x, the latter is not directly
accessible; we nonetheless believe that appropriate ansatzs of Stein’s lemma,
adapted to the non-linear setting and currently under investigation, could
be exploited.
As a striking example, one key advantage of such a tool would be the
possibility to evaluate expectations of the type Z = E[σσ T ( T1 σ T Q− σ − α)]
which, in our present analysis, was shown to be bounded in the order of
1
symmetric matrices by ΦCnε− 2 with high probability. Thus, if no matrix
(such as Q̄) pre-multiplies Z, since kΦk can grow as large as O(n), Z cannot
be shown to vanish. But such a bound does not account for the fact that

60 C. LOUART ET AL.
Φ would in general be unbounded because of the term σ̄σ̄ T in the display

Φ = σ̄σ̄ T + E[(σ − σ̄)(σ − σ̄)T ], where σ̄ = E[σ]. Intuitively, the “mean”
contribution σ̄σ̄ T of σσ T , being post-multiplied in Z by T1 σ T Q− σ − α (which
averages to zero) disappears; and thus only smaller order terms remain.
We believe that the aforementioned ansatzs for the Gaussian tools would
be capable of subtly handling this self-averaging effect on Z to prove that
kZk vanishes (for σ(t) = t, it is simple to show that kZk ≤ Cn−1 ). In
addition, Stein’s lemma-based methods only require the differentiability of
σ(·), which need not be Lipschitz, thereby allowing for a larger class of
activation functions.
As suggested in the simulations of Figure 2, our results also seem to extend

to non continuous functions σ(·). To date, we cannot envision a method
allowing to tackle this setting.
In terms of neural network applications, the present article is merely a

first step towards a better understanding of the “hardening” effect occurring
in large dimensional networks with numerous samples and large data points
(that is, simultaneously large n, p, T ), which we exemplified here through
the convergence of mean-square errors. The mere fact that some standard
performance measure of these random networks would “freeze” as n, p, T
grow at the predicted regime and that the performance would heavily depend
on the distribution of the random entries is already in itself an interesting
result to neural network understanding and dimensioning. However, more
interesting questions remain open. Since neural networks are today dedicated
to classification rather than regression, a first question is the study of the
asymptotic statistics of the output z = β T σ(W x) itself; we believe that
z satisfies a central limit theorem with mean and covariance allowing for
assessing the asymptotic misclassification rate.
A further extension of the present work would be to go beyond the single-
layer network and include multiple layers (finitely many or possibly a number
scaling with n) in the network design. The interest here would be on the
key question of the best distribution of the number of neurons across the
successive layers.
It is also classical in neural networks to introduce different (possibly ran-
dom) biases at the neuron level, thereby turning σ(t) into σ(t + b) for a
random variable b different for each neuron. This has the effect of mitigat-
ing the negative impact of the mean E[σ(wiT xj )], which is independent of
the neuron index i.
Finally, neural networks, despite their having been recently shown to op-

erate almost equally well when taken random in some very specific scenar-
ios, are usually only initiated as random networks before being subsequently
trained through backpropagation of the error on the training dataset (that is,
essentially through convex gradient descent). We believe that our framework
can allow for the understanding of at least finitely many steps of gradient de-
scent, which may then provide further insights into the overall performance
of deep learning networks.
APPENDIX A: INTERMEDIARY LEMMAS

This section recalls some elementary algebraic relations and identities
used throughout the proof section.
Lemma 5 (Resolvent Identity). For invertible matrices A, B, A−1 −

B −1= A−1 (B − A)B −1 .
Lemma 6 (A rank-1 perturbation identity). For A Hermitian, v a vector

and t ∈ R, if A and A + tvv T are invertible, then
−1 A−1 v
A + tvv T v= .
1 + tv T A−1 v
Lemma 7 (Operator Norm Control). For nonnegative definite A and

z ∈ C \ R+ ,
k (A − zIT )−1 k ≤ dist(z, R+ )−1

kA (A − zIT )−1 k ≤ 1
where dist(x, A) is the Hausdorff distance of a point to a set. In particular,

for γ > 0, k(A + γIT )−1 k ≤ γ −1 and kA(A + γIT )−1 k ≤ 1.
REFERENCES
Akhiezer, N. I. and Glazman, I. M. (1993). Theory of linear operators in Hilbert space.
Courier Dover Publications.
Bai, Z. D. and Silverstein, J. W. (1998). No eigenvalues outside the support of the lim-
iting spectral distribution of large dimensional sample covariance matrices. The Annals
of Probability 26 316-345.
Bai, Z. D. and Silverstein, J. W. (2007). On the signal-to-interference-ratio of CDMA
systems in wireless communications. Annals of Applied Probability 17 81-101.
Bai, Z. D. and Silverstein, J. W. (2009). Spectral analysis of large dimensional random
matrices, second ed. Springer Series in Statistics, New York, NY, USA.
Benaych-Georges, F. and Nadakuditi, R. R. (2012). The singular values and vectors
of low rank perturbations of large rectangular random matrices. Journal of Multivariate
Analysis 111 120–135.

62 C. LOUART ET AL.
Cambria, E., Gastaldo, P., Bisio, F. and Zunino, R. (2015). An ELM-based model
for affective analogical reasoning. Neurocomputing 149 443–455.
Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B. and LeCun, Y. (2015).
The Loss Surfaces of Multilayer Networks. In AISTATS.
Couillet, R. and Benaych-Georges, F. (2016). Kernel spectral clustering of large
dimensional data. Electronic Journal of Statistics 10 1393–1454.
Couillet, R. and Kammoun, A. (2016). Random Matrix Improved Subspace Clustering.
In 2016 Asilomar Conference on Signals, Systems, and Computers.
Couillet, R., Pascal, F. and Silverstein, J. W. (2015). The random matrix regime
of Maronna’s M-estimator with elliptically distributed samples. Journal of Multivariate
Analysis 139 56–78.
Giryes, R., Sapiro, G. and Bronstein, A. M. (2015). Deep Neural Networks with
Random Gaussian Weights: A Universal Classification Strategy? IEEE Transactions
on Signal Processing 64 3444-3457.
Hornik, K., Stinchcombe, M. and White, H. (1989). Multilayer feedforward networks
are universal approximators. Neural networks 2 359–366.
Hoydis, J., Couillet, R. and Debbah, M. (2013). Random beamforming over quasi-
static and fading channels: a deterministic equivalent approach. IEEE Transactions on
Information Theory 58 6392-6425.
Huang, G.-B., Zhu, Q.-Y. and Siew, C.-K. (2006). Extreme learning machine: theory
and applications. Neurocomputing 70 489–501.
Huang, G.-B., Zhou, H., Ding, X. and Zhang, R. (2012). Extreme learning machine
for regression and multiclass classification. Systems, Man, and Cybernetics, Part B:
Cybernetics, IEEE Transactions on 42 513–529.
Jaeger, H. and Haas, H. (2004). Harnessing nonlinearity: Predicting chaotic systems
and saving energy in wireless communication. Science 304 78–80.
Kammoun, A., Kharouf, M., Hachem, W. and Najim, J. (2009). A central limit the-
orem for the sinr at the lmmse estimator output for large-dimensional signals. IEEE
Transactions on Information Theory 55 5048–5063.
El Karoui, N. (2009). Concentration of measure and spectra of random matrices: ap-
plications to correlation matrices, elliptical distributions and beyond. The Annals of
Applied Probability 19 2362–2405.
El Karoui, N. (2010). The spectrum of kernel random matrices. The Annals of Statistics
38 1–50.
El Karoui, N. (2013). Asymptotic behavior of unregularized and ridge-regularized
high-dimensional robust regression estimators: rigorous results. arXiv preprint
arXiv:1311.2445.
Krizhevsky, A., Sutskever, I. and Hinton, G. E. (2012). Imagenet classification with
deep convolutional neural networks. In Advances in neural information processing sys-
tems 1097–1105.
LeCun, Y., Cortes, C. and Burges, C. (1998). The MNIST database of handwritten
digits.
Ledoux, M. (2005). The concentration of measure phenomenon 89. American Mathemat-
ical Soc.
Loubaton, P. and Vallet, P. (2010). Almost sure localization of the eigenvalues in a
Gaussian information plus noise model. Application to the spiked models. Electronic
Journal of Probability 16 1934–1959.
Mai, X. and Couillet, R. (2017). The counterintuitive mechanism of graph-based semi-
supervised learning in the big data regime. In IEEE International Conference on Acous-
tics, Speech and Signal Processing (ICASSP’17).

Marc̆enko, V. A. and Pastur, L. A. (1967). Distribution of eigenvalues for some sets

of random matrices. Math USSR-Sbornik 1 457-483.
Pastur, L. and Ŝerbina, M. (2011). Eigenvalue distribution of large random matrices.
American Mathematical Society.
Rahimi, A. and Recht, B. (2007). Random features for large-scale kernel machines. In
Advances in neural information processing systems 1177–1184.
Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage
and organization in the brain. Psychological review 65 386.
Rudelson, M., Vershynin, R. et al. (2013). Hanson-Wright inequality and sub-Gaussian
concentration. Electron. Commun. Probab 18 1–9.
Saxe, A., Koh, P. W., Chen, Z., Bhand, M., Suresh, B. and Ng, A. Y. (2011). On
random weights and unsupervised feature learning. In Proceedings of the 28th interna-
tional conference on machine learning (ICML-11) 1089–1096.
Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks
61 85–117.
Silverstein, J. W. and Bai, Z. D. (1995). On the empirical distribution of eigenvalues
of a class of large dimensional random matrices. Journal of Multivariate Analysis 54
175-192.
Silverstein, J. W. and Choi, S. (1995). Analysis of the limiting spectral distribution of
large dimensional random matrices. Journal of Multivariate Analysis 54 295-309.
Tao, T. (2012). Topics in random matrix theory 132. American Mathematical Soc.
Titchmarsh, E. C. (1939). The Theory of Functions. Oxford University Press, New York,
NY, USA.
Vershynin, R. (2012). Introduction to the non-asymptotic analysis of random matrices.
in Compressed Sensing, 210–268, Cambridge University Press.
Williams, C. K. I. (1998). Computation with infinite neural networks. Neural Compu-
tation 10 1203–1216.
Yates, R. D. (1995). A framework for uplink power control in cellular radio systems.
IEEE Journal on Selected Areas in Communications 13 1341-1347.
Zhang, T., Cheng, X. and Singer, A. (2014). Marchenko-Pastur Law for Tyler’s and
Maronna’s M-estimators. http://arxiv.org/abs/1401.3424.
Zhenyu Liao, R. C. (2017). A Large Dimensional Analysis of Least Squares Support
Vector Machines. (submitted to) Journal of Machine Learning Research.

RMT Aproach To NN

Uploaded by

Copyright:

Available Formats

RMT Aproach To NN

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

RMT Aproach To NN

Uploaded by

Copyright:

Available Formats

Submitted to the Annals of Applied Probability

A RANDOM MATRIX APPROACH

By Cosme Louart, Zhenyu Liao, and Romain Couillet∗

CentraleSupélec, University of Paris–Saclay, France.

1. Introduction. Artificial neural networks, developed in the late fifties

X = [x1 , . . . , xT ] ∈ Rp×T , so that their resulting performances intricately

imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017

intensive, tools (such as an appropriate adaptation of the Gaussian tools

Our main contribution is twofold. From a theoretical perspective, we first

imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017

datasets are provided that corroborate our results.

The remainder of the article is structured as follows: in Section 2, we

Reproducibility: Python 3 codes used to produce the results of Section 4

Notations: The norm k · k is understood as the Euclidean norm for vectors

2. System Model. We consider a ridge-regression task on random fea-

imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017

row of W ), for 1 ≤ i ≤ n. The neural network then operates in two phases:

During the training phase, based on a set of known input X = [x1 , . . . , xT ] ∈

the resolvent of T1 ΣT Σ. The matrix Q naturally appears as a key quantity in

Under the growth rate assumptions on n, p, T taken below, it shall appear

The testing phase of the neural network is more interesting in practice

imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017

From a technical standpoint, we shall make the following set of assump-

Assumption 1 (Subgaussian W ). The matrix W is defined by

(understood entry-wise), where W̃ has independent and identically distributed

For a = ϕ(b) ∈ Rℓ , ℓ ≥ 1, with b ∼ N (0, Iℓ ), we shall subsequently denote

We further need the following regularity condition on the function σ.

Assumption 2 (Function σ). The function σ is Lipschitz continuous

This assumption holds for many of the activation functions traditionally

When considering the interesting case of simultaneously large data and

imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017

Assumption 3 (Growth Rate). As n → ∞,

while γ, λσ , λϕ > 0 and d are kept constant. In addition,

3.1. Main technical results and training performance. As a standard pre-

of size T × T , where w ∼ Nϕ (0, Ip ).

Lemma 1 (Concentration of quadratic forms). Let Assumptions 1–2

imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017

Note that this lemma partially extends concentration of measure results

Theorem 1 (Asymptotic equivalent for E[Q]). Let Assumptions 1–3

As a corollary of Theorem 1 along with a concentration argument on

Theorem 2 (Limiting spectral measure of T1 ΣT Σ). Let Assumptions 1–

with δz the unique solution in {w ∈ C, ℑ[w] > 0} of

imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017

spectral measure of T1 P T W T W P for any deterministic matrix P ∈ Rp×T

For convenience in the following, letting δ and Φ be defined as in Theo-

Proposition 1 (Asymptotic equivalent for E[QAQ]). Let Assumptions 1–

As an immediate consequence of Proposition 1, we have the following

imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017

Theorem 3 (Asymptotic training mean-square error). Let Assumptions 1–

almost surely, where

Since Q̄ and Φ share the same orthogonal eigenvector basis, it appears

where we denoted λi = λi (Ψ), 1 ≤ i ≤ T , the eigenvalues of Ψ (which

As a side note, observe that, to obtain Theorem 3, we could have used

imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017

3.2. Testing performance. As previously mentioned, harnessing the asymp-

To introduce this result, let X̂ = [x̂1 , . . . , x̂T̂ ] ∈ Rp×T̂ be a set of input

Conjecture 1 (Deterministic equivalent for Etest ). Let Assumptions 1–

almost surely, where

imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017