0% found this document useful (0 votes)
6 views59 pages

17-AAP1328

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 59

The Annals of Applied Probability

2018, Vol. 28, No. 2, 1190–1248


https://doi.org/10.1214/17-AAP1328
© Institute of Mathematical Statistics, 2018

A RANDOM MATRIX APPROACH TO NEURAL NETWORKS

B Y C OSME L OUART , Z HENYU L IAO AND ROMAIN C OUILLET1


CentraleSupélec, University of Paris–Saclay
This article studies the Gram random matrix model G = T1  T ,  =
σ (W X), classically found in the analysis of random feature maps and ran-
dom neural networks, where X = [x1 , . . . , xT ] ∈ Rp×T is a (data) ma-
trix of bounded norm, W ∈ Rn×p is a matrix of independent zero-mean
unit variance entries and σ : R → R is a Lipschitz continuous (activation)
function—σ (W X) being understood entry-wise. By means of a key concen-
tration of measure lemma arising from nonasymptotic random matrix argu-
ments, we prove that, as n, p, T grow large at the same rate, the resolvent
Q = (G + γ IT )−1 , for γ > 0, has a similar behavior as that met in sam-
ple covariance matrix models, involving notably the moment  = Tn E[G],
which provides in passing a deterministic equivalent for the empirical spec-
tral measure of G. Application-wise, this result enables the estimation of the
asymptotic performance of single-layer random neural networks. This in turn
provides practical insights into the underlying mechanisms into play in ran-
dom neural networks, entailing several unexpected consequences, as well as
a fast practical means to tune the network hyperparameters.

CONTENTS
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1191
2. System model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1193
3. Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1196
3.1. Main technical results and training performance . . . . . . . . . . . . . . . . . . . . . 1196
3.2. Testing performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1199
3.3. Evaluation of AB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1200
4. Practical outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1202
4.1. Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1202
4.2. The underlying kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1203
4.3. Limiting cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1207
5. Proof of the main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1209
5.1. Concentration results on  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1209
5.2. Asymptotic equivalents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1219
5.2.1. First equivalent for E[Q] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1219
5.2.2. Second equivalent for E[Q] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1223
5.2.3. Asymptotic equivalent for E[QAQ], where A is either  or symmetric of
bounded norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1225
5.3. Derivation of ab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1230

Received February 2017; revised June 2017.


1 Supported by the ANR Project RMT4GRAPH (ANR-14-CE28-0006).
MSC2010 subject classifications. Primary 60B20; secondary 62M45.
Key words and phrases. Random matrix theory, random feature maps, neural networks.
1190
A RANDOM MATRIX APPROACH TO NEURAL NETWORKS 1191

5.3.1. Gaussian w . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1230


5.4. Polynomial σ (·) and generic w . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1233
5.5. Heuristic derivation of Conjecture 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1233
6. Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1243
Appendix: Intermediary lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1245
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1246

1. Introduction. Artificial neural networks, developed in the late fifties


[Rosenblatt (1958)] in an attempt to develop machines capable of brain-like behav-
iors, know today an unprecedented research interest, notably in its applications to
computer vision and machine learning at large [Krizhevsky, Sutskever and Hinton
(2012), Schmidhuber (2015)] where superhuman performances on specific tasks
are now commonly achieved. Recent progress in neural network performances
however find their source in the processing power of modern computers as well as
in the availability of large datasets rather than in the development of new math-
ematics. In fact, for lack of appropriate tools to understand the theoretical be-
havior of the nonlinear activations and deterministic data dependence underlying
these networks, the discrepancy between mathematical and practical (heuristic)
studies of neural networks has kept widening. A first salient problem in harness-
ing neural networks lies in their being completely designed upon a deterministic
training dataset X = [x1 , . . . , xT ] ∈ Rp×T , so that their resulting performances
intricately depend first and foremost on X. Recent works have nonetheless es-
tablished that, when smartly designed, mere randomly connected neural networks
can achieve performances close to those reached by entirely data-driven network
designs [Rahimi and Recht (2007), Saxe et al. (2011)]. As a matter of fact, to han-
dle gigantic databases, the computationally expensive learning phase (the so-called
backpropagation of the error method) typical of deep neural network structures be-
comes impractical, while it was recently shown that smartly designed single-layer
random networks (as studied presently) can already reach superhuman capabilities
[Cambria et al. (2015)] and beat expert knowledge in specific fields [Jaeger and
Haas (2004)]. These various findings have opened the road to the study of neu-
ral networks by means of statistical and probabilistic tools [Choromanska et al.
(2015), Giryes, Sapiro and Bronstein (2016)]. The second problem relates to the
nonlinear activation functions present at each neuron, which have long been known
(as opposed to linear activations) to help design universal approximators for any
input–output target map [Hornik, Stinchcombe and White (1989)].
In this work, we propose an original random matrix-based approach to under-
stand the end-to-end regression performance of single-layer random artificial neu-
ral networks, sometimes referred to as extreme learning machines [Huang, Zhu
and Siew (2006), Huang et al. (2012)], when the number T and size p of the in-
put dataset are large and scale proportionally with the number n of neurons in
the network. These networks can also be seen, from a more immediate statis-
tical viewpoint, as a mere linear ridge-regressor relating a random feature map
1192 C. LOUART, Z. LIAO AND R. COUILLET

σ (W X) ∈ Rn×T of explanatory variables X = [x1 , . . . , xT ] ∈ Rp×T and target


variables y = [y1 , . . . , yT ] ∈ Rd×T , for W ∈ Rn×p a randomly designed matrix
and σ (·) a nonlinear R → R function (applied component-wise). Our approach
has several interesting features both for theoretical and practical considerations.
It is first one of the few known attempts to move the random matrix realm away
from matrices with independent or linearly dependent entries. Notable exceptions
are the line of works surrounding kernel random matrices [Couillet and Benaych-
Georges (2016), El Karoui (2010)] as well as large dimensional robust statistics
models [Couillet, Pascal and Silverstein (2015), El Karoui (2013), Zhang, Cheng
and Singer (2014)]. Here, to alleviate the nonlinear difficulty, we exploit concentra-
tion of measure arguments [Ledoux (2005)] for nonasymptotic random matrices,
thereby pushing further the original ideas of El Karoui (2009), Vershynin (2012)
established for simpler random matrix models. While we believe that more power-
ful, albeit more computational intensive, tools [such as an appropriate adaptation
of the Gaussian tools advocated in Pastur and Ŝerbina (2011)] cannot be avoided to
handle advanced considerations in neural networks, we demonstrate here that the
concentration of measure phenomenon allows one to fully characterize the main
quantities at the heart of the single-layer regression problem at hand.
In terms of practical applications, our findings shed light on the already in-
completely understood extreme learning machines which have proved extremely
efficient in handling machine learning problems involving large to huge datasets
[Cambria et al. (2015), Huang et al. (2012)] at a computationally affordable cost.
But our objective is also to pave to path to the understanding of more involved neu-
ral network structures, featuring notably multiple layers and some steps of learning
by means of backpropagation of the error.
Our main contribution is twofold. From a theoretical perspective, we first ob-
tain a key lemma, Lemma 1, on the concentration of quadratic forms of the
type σ (wT X)Aσ (XT w) where w = ϕ(w̃), w̃ ∼ N (0, Ip ), with ϕ : R → R and
σ : R → R Lipschitz functions, and X ∈ Rp×T , A ∈ Rn×n are deterministic ma-
trices. This nonasymptotic result (valid for all n, p, T ) is then exploited under a
simultaneous growth regime for n, p, T and boundedness conditions on X and
A to obtain, in Theorem 1, a deterministic approximation Q̄ of the resolvent
E[Q], where Q = ( T1  T  + γ IT )−1 , γ > 0,  = σ (W X), for some W = ϕ(W̃ ),
W̃ ∈ Rn×p having independent N (0, 1) entries. As the resolvent of a matrix (or
operator) is an important proxy for the characterization of its spectrum [see, e.g.,
Akhiezer and Glazman (1993), Pastur and Ŝerbina (2011)], this result therefore al-
lows for the characterization of the asymptotic spectral properties of T1  T , such
as its limiting spectral measure in Theorem 2.
Application-wise, the theoretical findings are an important preliminary step for
the understanding and improvement of various statistical methods based on ran-
dom features in the large dimensional regime. Specifically, here, we consider the
question of linear ridge-regression from random feature maps, which coincides
A RANDOM MATRIX APPROACH TO NEURAL NETWORKS 1193

with the aforementioned single hidden-layer random neural network known as ex-
treme learning machine. We show that, under mild conditions, both the training
Etrain and testing Etest mean-square errors, respectively, corresponding to the re-
gression errors on known input–output pairs (x1 , y1 ), . . . , (xT , yT ) (with xi ∈ Rp ,
yi ∈ Rd ) and unknown pairings (x̂1 , ŷ1 ), . . . , (x̂T̂ , ŷT̂ ), almost surely converge to
deterministic limiting values as n, p, T grow large at the same rate (while d is
kept constant) for every fixed ridge-regression parameter γ > 0. Simulations on
real image datasets are provided that corroborate our results.
These findings provide new insights into the roles played by the activation func-
tion σ (·) and the random distribution of the entries of W in random feature maps as
well as by the ridge-regression parameter γ in the neural network performance. We
notably exhibit and prove some peculiar behaviors, such as the impossibility for
the network to carry out elementary Gaussian mixture classification tasks, when
either the activation function or the random weights distribution are ill chosen.
Besides, for the practitioner, the theoretical formulas retrieved in this work al-
low for a fast offline tuning of the aforementioned hyperparameters of the neural
network, notably when T is not too large compared to p. The graphical results
provided in the course of the article were particularly obtained within a 100- to
500-fold gain in computation time between theory and simulations.
The remainder of the article is structured as follows: in Section 2, we introduce
the mathematical model of the system under investigation. Our main results are
then described and discussed in Section 3, the proofs of which are deferred to
Section 5. Section 4 discusses our main findings. The article closes on concluding
remarks on envisioned extensions of the present work in Section 6. The Appendix
provides some intermediary lemmas of constant use throughout the proof section.
Reproducibility: Python 3 codes used to produce the results of Section 4 are
available at https://github.com/Zhenyu-LIAO/RMT4ELM.
Notation: The norm  ·  is understood as the Euclidean norm for vectors and
the operator norm for matrices, while the norm  · F is the Frobenius norm for
matrices. All vectors in the article are understood as column vectors.

2. System model. We consider a ridge-regression task on random feature


maps defined as follows. Each input data x ∈ Rp is multiplied by a matrix
W ∈ Rn×p ; a nonlinear function σ : R → R is then applied entry-wise to the
vector W x, thereby providing a set of n random features σ (W x) ∈ Rn for each
datum x ∈ Rp . The output z ∈ Rd of the linear regression is the inner product
z = β T σ (W x) for some matrix β ∈ Rn×d to be designed.
From a neural network viewpoint, the n neurons of the network are the virtual
units operating the mapping Wi· x → σ (Wi· x) (Wi· being the ith row of W ), for
1 ≤ i ≤ n. The neural network then operates in two phases: a training phase where
the regression matrix β is learned based on a known input–output dataset pair
(X, Y ) and a testing phase where, for β now fixed, the network operates on a new
input dataset X̂ with corresponding unknown output Ŷ .
1194 C. LOUART, Z. LIAO AND R. COUILLET

During the training phase, based on a set of known input X = [x1 , . . . , xT ] ∈


Rp×T and output Y = [y1 , . . . , yT ] ∈ Rd×T

datasets, the matrix β is chosen so
as to minimize the mean square error T1 Ti=1 zi − yi 2 + γ β2F , where zi =
β T σ (W xi ) and γ > 0 is some regularization factor. Solving for β, this leads to the
explicit ridge-regressor
 −1
1 1 T
β=    + γ IT Y T,
T T
where we defined  ≡ σ (W X). This follows
from differentiating the mean square
error along β to obtain 0 = γβ + T1 Ti=1 σ (W xi )(β T σ (W xi ) − yi )T , so that
( T1  T + γ In )β = T1 Y T which, along with ( T1  T + γ In )−1  = ( T1  T  +
γ IT )−1 , gives the result.
In the remainder, we will also denote
 −1
1 T
Q≡   + γ IT
T
the resolvent of T1  T . The matrix Q naturally appears as a key quantity in the
performance analysis of the neural network. Notably, the mean-square error Etrain
on the training dataset X is given by
1  T  γ2
(1) Etrain = Y −  T β 2F = tr Y T Y Q2 .
T T
Under the growth rate assumptions on n, p, T taken below, it shall appear that the
random variable Etrain concentrates around its mean, letting then appear E[Q2 ] as
a central object in the asymptotic evaluation of Etrain .
The testing phase of the neural network is more interesting in practice as it
unveils the actual performance of neural networks. For a test dataset X̂ ∈ Rp×T̂ of
length T̂ , with unknown output Ŷ ∈ Rd×T̂ , the test mean-square error is defined
by
1  T 
Etest = ˆ T β 2 ,
Ŷ −  F
T
where  ˆ = σ (W X̂) and β is the same as used in (1) [and thus only depends
on (X, Y ) and γ ]. One of the key questions in the analysis of such an elemen-
tary neural network lies in the determination of γ which minimizes Etest (and is
thus said to have good generalization performance). Notably, small γ values are
known to reduce Etrain but to induce the popular overfitting issue which gener-
ally increases Etest , while large γ values engender both large values for Etrain and
Etest .
From a mathematical standpoint though, the study of Etest brings forward some
technical difficulties that do not allow for as a simple treatment through the present
concentration of measure methodology as the study of Etrain . Nonetheless, the
A RANDOM MATRIX APPROACH TO NEURAL NETWORKS 1195

analysis of Etrain allows at least for heuristic approaches to become available,


which we shall exploit to propose an asymptotic deterministic approximation for
Etest .
From a technical standpoint, we shall make the following set of assumptions on
the mapping x → σ (W x).

A SSUMPTION 1 (Sub-Gaussian W ). The matrix W is defined by

W = ϕ(W̃ )

(understood entry-wise), where W̃ has independent and identically distributed


N (0, 1) entries and ϕ(·) is λϕ -Lipschitz.

For a = ϕ(b) ∈ R , ≥ 1, with b ∼ N (0, I ), we shall subsequently denote


a ∼ Nϕ (0, I ).
Under the notation of Assumption 1, we have in particular Wij ∼ N (0, 1) if
ϕ(t) = t and Wij ∼ U (−1, 1) (the uniform distribution on [−1, 1]) if ϕ(t) = −1 +
 √
2 √1 t∞ e−x dx (ϕ is here a 2/π -Lipschitz map).
2

We further need the following regularity condition on the function σ .

A SSUMPTION 2 (Function σ ). The function σ is Lipschitz continuous with


parameter λσ .

This assumption holds for many of the activation functions traditionally con-
sidered in neural networks, such as sigmoid functions, the rectified linear unit
σ (t) = max(t, 0) or the absolute value operator.
When considering the interesting case of simultaneously large data and ran-
dom features (or neurons), we shall then make the following growth rate assump-
tions.

A SSUMPTION 3 (Growth rate). As n → ∞,

0 < lim inf min{p/n, T /n} ≤ lim sup max{p/n, T /n} < ∞
n n

while γ , λσ , λϕ > 0 and d are kept constant. In addition,

lim sup X < ∞,


n

lim sup max |Yij | < ∞.


n ij
1196 C. LOUART, Z. LIAO AND R. COUILLET

3. Main results.

3.1. Main technical results and training performance. As a standard prelimi-


nary step in the asymptotic random matrix analysis of the expectation E[Q] of the
resolvent Q = ( T1  T  + γ IT )−1 , a convergence of quadratic forms based on the
row vectors of  is necessary [see, e.g., Marc̆enko and Pastur (1967), Silverstein
and Bai (1995)]. Such results are usually obtained by exploiting the independence
(or linear dependence) in the vector entries. This not being the case here, as the
entries of the vector σ (XT w) are in general not independent, we resort to a con-
centration of measure approach, as advocated in El Karoui (2009). The following
lemma, stated here in a nonasymptotic random matrix regime (i.e., without neces-
sarily resorting to Assumption 3), and thus of independent interest, provides this
concentration result. For this lemma, we need first to define the following key ma-
trix:
  
(2)  = E σ wT X T σ wT X
of size T × T , where w ∼ Nϕ (0, Ip ).

L EMMA 1 (Concentration of quadratic forms). Let Assumptions 1–2 hold. Let


also A ∈ RT ×T such that A ≤ 1 and, for X ∈ Rp×T and w ∼ Nϕ (0, Ip ), define
the random vector σ ≡ σ (wT X)T ∈ RT . Then
 
− cT 2
min( t 2 ,t)
1 T 1 2 2 2
P σ Aσ − tr A > t ≤ Ce X λϕ λσ t0
T T
for t0 ≡ |σ (0)| + λϕ λσ X Tp and C, c > 0 independent of all other parameters.
In particular, under the additional Assumption 3,
 
1 T 1
σ Aσ − tr A > t ≤ Ce−cn min(t,t )
2
P
T T
for some C, c > 0.

Note that this lemma partially extends concentration of measure results involv-
ing quadratic forms [see, e.g., Rudelson, Vershynin et al. (2013), Theorem 1.1], to
nonlinear vectors.
With this result in place, the standard resolvent approaches of random matrix
theory apply, providing our main theoretical finding as follows.

T HEOREM 1 (Asymptotic equivalent for E[Q]). Let Assumptions 1–3 hold


and define Q̄ as
 −1
n 
Q̄ ≡ + γ IT ,
T 1+δ
A RANDOM MATRIX APPROACH TO NEURAL NETWORKS 1197

where δ is implicitly defined as the unique positive solution to δ = 1


T tr Q̄. Then,
for all ε > 0, there exists c > 0 such that
  1
E[Q] − Q̄ ≤ cn− 2 +ε .

As a corollary of Theorem 1 along with a concentration argument on T1 tr Q, we


have the following result on the spectral measure of T1  T , which may be seen as
a nonlinear extension of Silverstein and Bai (1995) for which σ (t) = t.

T HEOREM 2 (Limiting spectral measure of T1  T ). Let Assumptions 1–3



hold and, for λ1 , . . . , λT the eigenvalues of T1  T , define μn = T1 Ti=1 δ λi . Then,
for every bounded continuous function f , with probability one

f dμn − f d μ̄n → 0,

where μ̄n is the measure defined through its Stieltjes’ transform mμ̄n (z) ≡ (t −
z)−1 d μ̄n (t) given, for z ∈ {w ∈ C, [w] > 0}, by
 −1
1 n 
mμ̄n (z) = tr − zIT
T T 1 + δz
with δz the unique solution in {w ∈ C, [w] > 0} of
 −1
1 n 
δz = tr  − zIT .
T T 1 + δz

Note that μ̄n has a well-known form, already met in early random matrix
works [e.g., Silverstein and Bai (1995)] on sample covariance matrix models. No-
tably, μ̄n is also the deterministic equivalent of the empirical spectral measure of
T P W W P for any deterministic matrix P ∈ R
1 T T p×T such that P T P = . As such,

to some extent, the results above provide a consistent asymptotic linearization


of T1  T . From standard spiked model arguments [see, e.g., Benaych-Georges
and Nadakuditi (2012)], the result E[Q] − Q̄ → 0 further suggests that also the
eigenvectors associated to isolated eigenvalues of T1  T  (if any) behave similarly
to those of T1 P T W T W P , a remark that has fundamental importance in the neural
network performance understanding.
However, as shall be shown in Section 3.3, and contrary to empirical covariance
matrix models of the type P T W T W P ,  explicitly depends on the distribution of
Wij (i.e., beyond its first two moments). Thus, the aforementioned linearization
of T1  T , and subsequently the deterministic equivalent for μn , are not universal
with respect to the distribution of zero-mean unit variance Wij . This is in striking
contrast to the many linear random matrix models studied to date which often
exhibit such universal behaviors. This property too will have deep consequences in
the performance of neural networks as shall be shown through Figure 3 in Section 4
1198 C. LOUART, Z. LIAO AND R. COUILLET

for an example where inappropriate choices for the law of W lead to network
failure to fulfill the regression task.
For convenience in the following, letting δ and  be defined as in Theorem 1,
we shall denote
n 
(3) = .
T 1+δ
Theorem 1 provides the central step in the evaluation of Etrain , for which not
only E[Q] but also E[Q2 ] needs be estimated. This last ingredient is provided in
the following proposition.

P ROPOSITION 1 (Asymptotic equivalent for E[QAQ]). Let Assumptions 1–3


hold and A ∈ RT ×T be a symmetric nonnegative definite matrix which is either 
or a matrix with uniformly bounded operator norm (with respect to T ). Then, for
all ε > 0, there exists c > 0 such that, for all n,
  1 
 tr( Q̄AQ̄)  1
E[QAQ] − Q̄AQ̄ + n
 Q̄ Q̄  ≤ cn− 2 +ε .
1 2 21 − n tr Q̄

As an immediate consequence of Proposition 1, we have the following result on


the training mean-square error of single-layer random neural networks.

T HEOREM 3 (Asymptotic training mean-square error). Let Assumptions 1–3


hold and Q̄, be defined as in Theorem 1 and (3). Then, for all ε > 0,
1
n 2 −ε (Etrain − Ētrain ) → 0
almost surely, where
1  T  γ2
Etrain = Y −  T β 2F = tr Y T Y Q2 ,
T T
1 2 
γ2 n tr Q̄
Ētrain = T
tr Y Y Q̄ + IT Q̄.
T 1 − n1 tr( Q̄)2

Since Q̄ and  share the same orthogonal eigenvector basis, it appears that
Etrain depends on the alignment between the right singular vectors of Y and the
eigenvectors of , with weighting coefficients
 2  1 T −2 
γ n j =1 λj (λj + γ )
1 + λi  , 1 ≤ i ≤ T,
λi + γ 1 − n1 Tj=1 λ2j (λj + γ )−2
where we denoted λi = λi ( ), 1 ≤ i ≤ T , the eigenvalues of [which depend
on γ through λi ( ) = T (1+δ)
n
λi ()]. If lim infn n/T > 1, it is easily seen that
δ → 0 as γ → 0, in which case Etrain → 0 almost surely. However, in the more
A RANDOM MATRIX APPROACH TO NEURAL NETWORKS 1199

interesting case in practice where lim supn n/T < 1, δ → ∞ as γ → 0 and Etrain
consequently does not have a simple limit (see Section 4.3 for more discussion on
this aspect).
Theorem 3 is also reminiscent of applied random matrix works on empirical
covariance matrix models, such as Bai and Silverstein (2007), Kammoun et al.
(2009), then further emphasizing the strong connection between the nonlinear ma-
1
trix σ (W X) and its linear counterpart W  2 .
As a side note, observe that, to obtain Theorem 3, we could have used the fact
that tr Y T Y Q2 = − ∂γ

tr Y T Y Q which, along with some analyticity arguments [for
instance when extending the definition of Q = Q(γ ) to Q(z), z ∈ C], would have
directly ensured that ∂∂γQ̄ is an asymptotic equivalent for −E[Q2 ], without the need
for the explicit derivation of Proposition 1. Nonetheless, as shall appear subse-
quently, Proposition 1 is also a proxy to the asymptotic analysis of Etest . Besides,
the technical proof of Proposition 1 quite interestingly showcases the strength of
the concentration of measure tools under study here.

3.2. Testing performance. As previously mentioned, harnessing the asymp-


totic testing performance Etest seems, to the best of the authors’ knowledge, out of
current reach with the sole concentration of measure arguments used for the proof
of the previous main results. Nonetheless, if not fully effective, these arguments
allow for an intuitive derivation of a deterministic equivalent for Etest , which is
strongly supported by simulation results. We provide this result below under the
form of a yet unproven claim, a heuristic derivation of which is provided at the end
of Section 5.
To introduce this result, let X̂ = [x̂1 , . . . , x̂T̂ ] ∈ Rp×T̂ be a set of input data with
corresponding output Ŷ = [ŷ1 , . . . , ŷ ] ∈ Rd×T̂ . We also define  ˆ = σ (W X̂) ∈

Rp×T̂ . We assume that X̂ and Ŷ satisfy the same growth rate conditions as X and
Y in Assumption 3. To introduce our claim, we need to extend the definition of
 in (2) and in (3) to the following notation: for all pair of matrices (A, B) of
appropriate dimensions,
  
AB = E σ wT A T σ wT B ,
n AB
AB = ,
T 1+δ
where w ∼ Nϕ (0, Ip ). In particular,  = XX and = XX .
With these notation in place, we are in position to state our claimed result.

C ONJECTURE 1 (Deterministic equivalent for Etest ). Let Assumptions 1–2


hold and X̂, Ŷ satisfy the same conditions as X, Y in Assumption 3. Then, for
all ε > 0,
1
n 2 −ε (Etest − Ētest ) → 0
1200 C. LOUART, Z. LIAO AND R. COUILLET

almost surely, where


1  T 
Etest = ˆ T β 2 ,
Ŷ −  F

1 
Ētest = Ŷ T − X T

Q̄Y T 2F

1  
tr Y T Y Q̄ Q̄ 1 1
+ n
tr X̂ X̂ − tr(IT + γ Q̄)( X X̂ X̂X Q̄) .
1 − n1 tr( Q̄)2 T̂ T̂

While not immediate at first sight, one can confirm (using notably the relation
Q̄ + γ Q̄ = IT ) that, for (X̂, Ŷ ) = (X, Y ), Ētrain = Ētest , as expected.
In order to evaluate practically the results of Theorem 3 and Conjecture 1, it is a
first step to be capable of estimating the values of AB for various σ (·) activation
functions of practical interest. Such results, which call for completely different
mathematical tools (mostly based on integration tricks), are provided in the subse-
quent section.

3.3. Evaluation of AB . The evaluation of AB = E[σ (wT A)T σ (wT B)] for
arbitrary matrices A, B naturally boils down to the evaluation of its individual
entries, and thus to the calculus, for arbitrary vectors a, b ∈ Rp , of
  
ab ≡ E σ wT a σ wT b
(4)
p   1
= (2π)− 2 σ ϕ(w̃)T a σ ϕ(w̃)T b e− 2 w̃ d w̃.
2

The evaluation of (4) can be obtained through various integration tricks for a wide
family of mappings ϕ(·) and activation functions σ (·). The most popular activa-
tion functions in neural networks are sigmoid functions, such as σ (t) = erf(t) ≡
 t −u2
√2 e du, as well as the so-called rectified linear unit (ReLU) defined by
π 0
σ (t) = max(t, 0) which has been recently popularized as a result of its robust be-
havior in deep neural networks. In physical artificial neural networks implemented
using light projections, σ (t) = |t| is the preferred choice. Note that all aforemen-
tioned functions are Lipschitz continuous and, therefore, in accordance with As-
sumption 2.
Despite their not abiding by the prescription of Assumptions 1 and 2, we be-
lieve that the results of this article could be extended to more general settings, as
discussed in Section 4. In particular, since the key ingredient in the proof of all our
results is that the vector σ (w T X) follows a concentration of measure phenomenon,
induced by the Gaussianity of w̃ [if w = ϕ(w̃)], the Lipschitz character of σ and
the norm boundedness of X, it is likely, although not necessarily simple to prove,
that σ (w T X) may still concentrate under relaxed assumptions. This is likely the
case for more generic vectors w than Nϕ (0, Ip ) as well as for a larger class of
A RANDOM MATRIX APPROACH TO NEURAL NETWORKS 1201

TABLE 1
aT b
Values of ab for w ∼ N (0, Ip ), ∠(a, b) ≡ ab

σ (t) ab

t aTb
1
2π ab(∠(a, b) acos(−∠(a, b)) + 1 − ∠(a, b) )
max(t, 0) 2

|t| 2
π ab(∠(a, b) asin(∠(a, b)) + 1 − ∠(a, b) )
2

erf(t) 2 asin( √ 2a T b )
π (1+2a2 )(1+2b2 )
1{t>0} 1 − 1 acos(∠(a, b))
2 2π
sign(t) 2
π asin(∠(a, b))
cos(t) exp(− 12 (a2 + b2 )) cosh(a T b)
sin(t) exp(− 12 (a2 + b2 )) sinh(a T b).

activation functions, such as polynomial or piece-wise Lipschitz continuous func-


tions.
In anticipation of these likely generalizations, we provide in Table 1 the val-
ues of ab for w ∼ N (0, Ip ) [i.e., for ϕ(t) = t] and for a set of functions σ (·) not
necessarily satisfying Assumption 2. Denoting  ≡ (σ (t)), it is interesting to re-
mark that, since arccos(x) = − arcsin(x) + π2 , (max(t, 0)) = ( 12 t) + ( 12 |t|).
Also, [(cos(t)) + (sin(t))]a,b = exp(− 12 a − b2 ), a result reminiscent of
Rahimi and Recht (2007).2 Finally, note that (erf(κt)) → (sign(t)) as κ → ∞,
inducing that the extension by continuity of erf(κt) to sign(t) propagates to their
associated kernels.
In addition to these results for w ∼ N (0, Ip ), we also evaluated ab =
E[σ (wT a)σ (wT b)] for σ (t) = ζ2 t 2 + ζ1 t + ζ0 and w ∈ Rp a vector of indepen-
dent and identically distributed entries of zero mean and moments of order k equal
to mk (so m1 = 0); w is not restricted here to satisfy w ∼ Nϕ (0, Ip ). In this case,
we find
   2   T 2
ab = ζ22 m22 2 a T b + a2 b2 + m4 − 3m22 a 2 b + ζ12 m2 a T b
(5)   
+ ζ2 ζ1 m3 a 2 T b + a T b2 + ζ2 ζ0 m2 a2 + b2 + ζ02 ,
where we defined (a 2 ) ≡ [a12 , . . . , ap2 ]T .
It is already interesting to remark that, while classical random matrix models
exhibit a well-known universality property—in the sense that their limiting spectral
distribution is independent of the moments (higher than two) of the entries of the
involved random matrix, here W —for σ (·) a polynomial of order two, , and

2 It is in particular not difficult to prove, based on our framework, that as n/T → ∞, a random
neural network composed of n/2 neurons with activation function σ (t) = cos(t) and n/2 neurons
with activation function σ (t) = sin(t) implements a Gaussian difference kernel.
1202 C. LOUART, Z. LIAO AND R. COUILLET

thus μn strongly depend on E[Wijk ] for k = 3, 4. We shall see in Section 4 that


this remark has troubling consequences. We will notably infer (and confirm via
simulations) that the studied neural network may provably fail to fulfill a specific
task if the Wij are Bernoulli with zero mean and unit variance but succeed with
possibly high performance if the Wij are standard Gaussian [which is explained
by the disappearance or not of the term (a T b)2 and (a 2 )T (b2 ) in (5) if m4 = m22 ].

4. Practical outcomes. We discuss in this section the outcomes of our main


results in terms of neural network application. The technical discussions on The-
orem 1 and Proposition 1 will be made in the course of their respective proofs in
Section 5.

4.1. Simulation results. We first provide in this section a simulation corrobo-


rating the findings of Theorem 3 and suggesting the validity of Conjecture 1. To
this end, we consider the task of classifying the popular MNIST image database
[LeCun, Cortes and Burges (1998)], composed of grayscale handwritten digits of
size 28 × 28, with a neural network composed of n = 512 units and standard Gaus-
sian W . We represent here each image as a p = 784-size vector; 1024 images of
sevens and 1024 images of nines were extracted from the database and were evenly
split in 512 training and test images, respectively. The database images were jointly
centered and scaled so to fall close to the setting of Assumption 3 on X and X̂ (an
admissible preprocessing intervention). The columns of the output values Y and
Ŷ were taken as unidimensional (d = 1) with Y1j , Ŷ1j ∈ {−1, 1} depending on
the image class. Figure 1 displays the simulated (averaged over 100 realizations
of W ) versus theoretical values of Etrain and Etest for three choices of Lipschitz
continuous functions σ (·), as a function of γ .
Note that a perfect match between theory and practice is observed, for both
Etrain and Etest , which is a strong indicator of both the validity of Conjecture 1 and
the adequacy of Assumption 3 to the MNIST dataset.
We subsequently provide in Figure 2 the comparison between theoretical for-
mulas and practical simulations for a set of functions σ (·) which do not satisfy
Assumption 2, that is, either discontinuous or non-Lipschitz maps. The closeness
between both sets of curves is again remarkably good, although to a lesser extent
than for the Lipschitz continuous functions of Figure 1. Also, the achieved perfor-
mances are generally worse than those observed in Figure 1.
It should be noted that the performance estimates provided by Theorem 3 and
Conjecture 1 can be efficiently implemented at low computational cost in practice.
Indeed, by diagonalizing  (which is a marginal cost independent of γ ), Ētrain can
be computed for all γ through mere vector operations; similarly, Ētest is obtained
by the marginal cost of a basis change of X̂X and the matrix product XX̂ X̂X ,
all remaining operations being accessible through vector operations. As a conse-
quence, the simulation durations to generate the aforementioned theoretical curves
A RANDOM MATRIX APPROACH TO NEURAL NETWORKS 1203

F IG . 1. Neural network performance for Lipschitz continuous σ (·), Wij ∼ N (0, 1), as a function
of γ , for 2-class MNIST data (sevens, nines), n = 512, T = T̂ = 1024, p = 784.

using the linked Python script were found to be 100 to 500 times faster than to gen-
erate the simulated network performances. Beyond their theoretical interest, the
provided formulas therefore allow for an efficient offline tuning of the network hy-
perparameters, notably the choice of an appropriate value for the ridge-regression
parameter γ .

4.2. The underlying kernel. Theorem 1 and the subsequent theoretical find-
ings importantly reveal that the neural network performances are directly related
to the Gram matrix , which acts as a deterministic kernel on the dataset X. This
is in fact a well-known result found, for example, in Williams (1998) where it is
shown that, as n → ∞ alone, the neural network behaves as a mere kernel operator
(this observation is retrieved here in the subsequent Section 4.3). This remark was
then put at an advantage in Rahimi and Recht (2007) and subsequent works, where
random feature maps of the type x → σ (W x) are proposed as a computationally
efficient proxy to evaluate kernels (x, y) → (x, y).
As discussed previously, the formulas for Ētrain and Ētest suggest that good per-
formances are achieved if the dominant eigenvectors of  show a good alignment
to Y (and similarly for XX̂ and Ŷ ). This naturally drives us to finding a priori sim-
ple regression tasks where ill-choices of  may annihilate the neural network per-
formance. Following recent works on the asymptotic performance analysis of ker-
nel methods for Gaussian mixture models [Couillet and Benaych-Georges (2016),
1204 C. LOUART, Z. LIAO AND R. COUILLET

F IG . 2. Neural network performance for σ (·) either discontinuous or non-Lipschitz,


Wij ∼ N (0, 1), as a function of γ , for 2-class MNIST data (sevens, nines), n = 512, T = T̂ = 1024,
p = 784.

Liao and Couillet (2017), Mai and Couillet (2017)] and [Couillet and Kammoun
(2016)], we describe here such a task.
Let x1 , . . . , xT /2 ∼ N (0, p1 C1 ) and xT /2+1 , . . . , xT ∼ N (0, p1 C2 ) where C1 and
C2 are such that tr C1 = tr C2 , C1 , C2  are bounded, and tr(C1 − C2 )2 = O(p).
Accordingly, y1 , . . . , yT /2+1 = −1 and yT /2+1 , . . . , yT = 1. It is proved in the
aforementioned articles that, under these conditions, it is theoretically possible, in
the large p, T limit, to classify the data using a kernel least-square support vector
machine (i.e., with a training dataset) or with a kernel spectral clustering method
(i.e., in a completely unsupervised manner) with a nontrivial limiting error prob-
ability (i.e., neither zero nor one). This scenario has the interesting feature that
xiT xj → 0 almost surely for all i = j while xi 2 − p1 tr( 12 C1 + 12 C2 ) → 0, almost
surely, irrespective of the class of xi , thereby allowing for a Taylor expansion of
the nonlinear kernels as early proposed in El Karoui (2010).
Transposed to our present setting, the aforementioned Taylor expansion allows
for a consistent approximation  ˜ of  by an information-plus-noise (spiked) ran-
dom matrix model [see, e.g., Benaych-Georges and Nadakuditi (2012), Loubaton
and Vallet (2010)]. In the present Gaussian mixture context, it is shown in Couillet
and Benaych-Georges (2016) that data classification is (asymptotically at least)
only possible if  ˜ ij explicitly contains the quadratic term (x T xj )2 [or combina-
i
A RANDOM MATRIX APPROACH TO NEURAL NETWORKS 1205

tions of (xi2 )T xj , (xj2 )T xi , and (xi2 )T (xj2 )]. In particular, letting a, b ∼ N (0, Ci )
with i = 1, 2, it is easily seen from Table 1 that only max(t, 0), |t|, and cos(t) can
realize the task. Indeed, we have the following Taylor expansions around x = 0:

asin(x) = x + O x 3 ,

sinh(x) = x + O x 3 ,
π 
acos(x) = − x + O x 3 ,
2
x2 
cosh(x) = 1 + + O x3 ,
2
πx x 2 
x acos(−x) + 1 − x 2 = 1 + + + O x3 ,
2 2
x2 
x asin(x) + 1 − x 2 = 1 + + O x3 ,
2
where only the last three functions [only found in the expression of ab corre-
sponding to σ (t) = max(t, 0), |t|, or cos(t)] exhibit a quadratic term.
More surprisingly maybe, recalling now equation (5) which considers nonnec-
essarily Gaussian Wij with moments mk of order k, a more refined analysis shows
that the aforementioned Gaussian mixture classification task will fail if m3 = 0
and m4 = m22 , so for instance, for Wij ∈ {−1, 1} Bernoulli with parameter 12 .
The performance comparison of this scenario is shown in the top part of Fig-
ure 3 for σ (t) = − 12 t 2 + 1 and C1 = diag(Ip/2 , 4Ip/2 ), C2 = diag(4Ip/2 , Ip/2 ),
for Wij ∼ N (0, 1) and Wij ∼ Bern [i.e., Bernoulli {(−1, 12 ), (1, 12 )}]. The choice
of σ (t) = ζ2 t 2 + ζ1 t + ζ0 with ζ1 = 0 is motivated by Couillet and Benaych-
Georges (2016), Couillet and Kammoun (2016) where it is shown, in a somewhat
different setting, that this choice is optimal for class recovery. Note that, while
the test performances are overall rather weak in this setting, for Wij ∼ N (0, 1),
Etest drops below one (the amplitude of the Ŷij ), thereby indicating that nontriv-
ial classification is performed. This is not so for the Bernoulli Wij ∼ Bern case
where Etest is systematically greater than |Ŷij | = 1. This is theoretically explained
by the fact that, from equation (5), ij contains structural information about the
data classes through the term 2m22 (xiT xj )2 + (m4 − 3m22 )(xi2 )T (xj2 ) which induces
an information-plus-noise model for  as long as 2m22 + (m4 − 3m22 ) = 0, that is,
m4 = m22 [see Couillet and Benaych-Georges (2016) for details]. This is visually
seen in the bottom part of Figure 3 where the Gaussian scenario presents an iso-
lated eigenvalue for  with corresponding structured eigenvector, which is not the
case of the Bernoulli scenario. To complete this discussion, it appears relevant in
the present setting to choose Wij in such a way that m4 − m22 is far from zero,
thus suggesting the interest of heavy-tailed distributions. To confirm this predic-
tion, Figure 3 additionally displays the performance achieved and the spectrum of
1206 C. LOUART, Z. LIAO AND R. COUILLET

F IG . 3. (Top) Neural network performance for σ (t) = − 12 t 2 + 1, with different Wij , for a 2-class
Gaussian mixture model (see details in text), n = 512, T = T̂ = 1024, p = 256. (Bottom) Spectra and
second eigenvector of  for different Wij (first eigenvalues are of order n and not shown; associated
eigenvectors are provably noninformative).

 observed for Wij ∼ Stud, that is, following a Student-t distribution with degree
of freedom ν = 7 normalized to unit variance (in this case m2 = 1 and m4 = 5).
Figure 3 confirms the large superiority of this choice over the Gaussian case (note
nonetheless the slight inaccuracy of our theoretical formulas in this case, which
A RANDOM MATRIX APPROACH TO NEURAL NETWORKS 1207

is likely due to too small values of p, n, T to accommodate Wij with higher order
moments, an observation which is confirmed in simulations when letting ν be even
smaller).

4.3. Limiting cases. We have suggested that  contains, in its dominant eigen-
modes, all the usable information describing X. In the Gaussian mixture example
above, it was notably shown that  may completely fail to contain this information,
resulting in the impossibility to perform a classification task, even if one were to
take infinitely many neurons in the network. For  containing useful information
about X, it is intuitive to expect that both infγ Ētrain and infγ Ētest become smaller
as n/T and n/p become large. It is in fact easy to see that, if  is invertible (which
is likely to occur in most cases if lim infn T /p > 1), then
lim Ētrain = 0,
n→∞
1  T 
lim Ētest − Ŷ − X̂X −1 Y T 2F = 0
n→∞ T̂
and we fall back on the performance of a classical kernel regression. It is inter-
esting in particular to note that, as the number of neurons n becomes large, the
effect of γ on Etest flattens out. Therefore, a smart choice of γ is only relevant for
small (and thus computationally more efficient) neuron layers. This observation is
depicted in Figure 4 where it is made clear that a growth of n reduces Etrain to zero
while Etest saturates to a nonzero limit which becomes increasingly irrespective
of γ . Note additionally the interesting phenomenon occurring for n ≤ T where
too small values of γ induce important performance losses, thereby suggesting a
strong importance of proper choices of γ in this regime.
Of course, practical interest lies precisely in situations where n is not too large.
We may thus subsequently assume that lim supn n/T < 1. In this case, as sug-
gested by Figures 1–2, the mean-square error performances achieved as γ → 0
may predict the superiority of specific choices of σ (·) for optimally chosen γ . It
is important for this study to differentiate between cases where r ≡ rank() is
smaller or greater than n. Indeed, observe that, with the spectral decomposition
 = Ur r UrT for r ∈ Rr×r diagonal and Ur ∈ RT ×r ,
  −1  
−1
1 n  1 n r
δ= tr  + γ IT = tr r + γ Ir ,
T T 1+δ T T 1+δ
which satisfies, as γ → 0,
⎧ r

⎨δ →
⎪ , r < n,
n−r  −1

⎪ 1 n
⎩γ δ →  = tr  + IT , r ≥ n.
T T 
A phase transition therefore exists whereby δ assumes a finite positive value in the
small γ limit if r/n < 1, or scales like 1/γ otherwise.
1208 C. LOUART, Z. LIAO AND R. COUILLET

F IG . 4. Neural network performance for growing n (256, 512, 1024, 2048, 4096) as a function of
γ , σ (t) = max(t, 0); 2-class MNIST data (sevens, nines), T = T̂ = 1024, p = 784. Limiting (n = ∞)
Ētest shown in thick black line.

As a consequence, if r < n, as γ → 0, → Tn (1 − nr ) and Q̄ ∼ n−r T


Ur ×
−1
r Ur + γ Vr Vr , where Vr ∈ R
T 1 T T ×(n−r) is any matrix such that [Ur Vr ] is or-
thogonal, so that Q̄ → Ur Ur and Q̄2 → Ur −1
T
r Ur ; and thus, Ētrain →
T

T tr Y Vr Vr Y = T Y Vr F , which states that the residual training error corre-


1 T T 1 2

sponds to the energy of Y not captured by the space spanned by . Since Etrain is
an increasing function of γ , so is Ētrain (at least for all large n), and thus T1 Y Vr 2F
corresponds to the lowest achievable asymptotic training error.
If instead r > n (which is the most likely outcome in practice), as γ → 0, Q̄ ∼
−1 and thus
γ T  + IT )
1 n
(
1  2 
1
γ →0 n tr  Q
Ētrain −→ tr Y Q  + IT Q Y T ,
T 1 − n1 tr(  Q )2
where  = Tn  
and Q = ( Tn 

+ IT )−1 .
These results suggest that neural networks should be designed both in a way that
reduces the rank of  while maintaining a strong alignment between the dominant
eigenvectors of  and the output matrix Y .
Interestingly, if X is assumed as above to be extracted from a Gaussian mixture
and that Y ∈ R1×T is a classification vector with Y1j ∈ {−1, 1}, then the tools
proposed in Couillet and Benaych-Georges (2016) (related to spike random matrix
A RANDOM MATRIX APPROACH TO NEURAL NETWORKS 1209

analysis) allow for an explicit evaluation of the aforementioned limits as n, p, T


grow large. This analysis is however cumbersome and outside the scope of the
present work.

5. Proof of the main results. In the remainder, we shall use extensively the
following notation:
⎡ ⎤ ⎡ ⎤
σT wT
⎢ .1 ⎥ ⎢ .1 ⎥
 = σ (W X) = ⎢ ⎥
⎣ .. ⎦ , W =⎢ ⎥
⎣ .. ⎦
σnT wnT

that is, σi = σ (wiT X)T . Also, we shall define −i ∈ R(n−1)×T the matrix  with
ith row removed, and correspondingly
 −1
1 T 1
Q−i =   − σi σiT + γ IT .
T T
Finally, because of exchangeability, it shall often be convenient to work with the
generic random vector w ∼ Nϕ (0, IT ), the random vector σ distributed as any
of the σi ’s, the random matrix − distributed as any of the −i ’s, and with the
random matrix Q− distributed as any of the Q−i ’s.

5.1. Concentration results on . Our first results provide concentration of


measure properties on functionals of . These results unfold from the follow-
ing concentration inequality for Lipschitz applications of a Gaussian vector; see,
for example, Ledoux (2005), Corollary 2.6, Propositions 1.3, 1.8 or Tao (2012),
Theorem 2.1.12. For d ∈ N, consider μ the canonical Gaussian probability on
d 1
Rd defined through its density dμ(w) = (2π)− 2 e− 2 w and f : Rd → R a λf -
2

Lipschitz function. Then we have the said normal concentration


  −c t2
λ2f
(6) μ f− f dμ ≥ t ≤ Ce ,

where C, c > 0 are independent of d and λf . As a corollary [see, e.g., Ledoux


(2005)], for every k ≥ 1,
 k  k
Cλf
E f− f dμ ≤ √ .
c

The main approach to the proof of our results, starting with that of the key
Lemma 1, is as follows: since Wij = ϕ(W̃ij ) with W̃ij ∼ N (0, 1) and ϕ Lipschitz,
the normal concentration of W̃ transfers to W which further induces a normal
concentration of the random vector σ and the matrix , thereby implying that
Lipschitz functionals of σ or  also concentrate. As pointed out earlier, these
1210 C. LOUART, Z. LIAO AND R. COUILLET

concentration results are used in place for the independence assumptions (and their
multiple consequences on convergence of random variables) classically exploited
in random matrix theory.
Notation: In all subsequent lemmas and proofs, the letters c, ci , C, Ci > 0 will
be used interchangeably as positive constants independent of the key equation pa-
rameters (notably n and t below) and may be reused from line to line. Additionally,
the variable ε > 0 will denote any small positive number; the variables c, ci , C, Ci
may depend on ε.
We start by recalling the first part of the statement of Lemma 1 and subsequently
providing its proof.

L EMMA 2 (Concentration of quadratic forms). Let Assumptions 1–2 hold. Let


also A ∈ RT ×T such that A ≤ 1 and, for X ∈ Rp×T and w ∼ Nϕ (0, Ip ), define
the random vector σ ≡ σ (wT X)T ∈ RT . Then
 −
 cT 2
min( t 2 ,t)
1 T 1 X2 λ2ϕ λ2σ
P σ Aσ − tr A > t ≤ Ce t0
T T
p
for t0 ≡ |σ (0)| + λϕ λσ X T and C, c > 0 independent of all other parameters.

P ROOF. The layout of the proof is as follows: since the application w →


1 T
T σ Aσ is “quadratic” in w, and thus not Lipschitz (therefore not allowing for
a natural transfer of the concentration of w to T1 σ T Aσ ), we first prove that √1 σ 
T
satisfies a concentration inequality, which provides a high probability O(1) bound
on √1 σ . Conditioning on this event, the map w → √1 σ T Aσ can then be shown
T T
to be Lipschitz (by isolating one of the σ terms for bounding and the other one for
retrieving the Lipschitz character) and, up to an appropriate control of concentra-
tion results under conditioning, the result is obtained.
Following this plan, we first provide a concentration inequality for σ . To this
end, note that the application ψ : Rp → RT , w̃ → σ (ϕ(w̃)T X)T is Lipschitz with
parameter λϕ λσ X as the combination of the λϕ -Lipschitz function ϕ : w̃ → w,
the X-Lipschitz map Rn → RT , w → X T w and the λσ -Lipschitz map RT →
RT , Y → σ (Y ). As a Gaussian vector, w̃ has a normal concentration and so does
ψ(w̃). Since the Euclidean norm RT → R, Y → Y  is 1-Lipschitz, we thus have
immediately by (6)
     2
 1  T   1   − cT t
P √ σ w X  − E  √ σ wT X  ≥ t ≤ Ce X2 λ2σ λ2ϕ
   
T T
for some c, C > 0 independent of all parameters.
Finally, using again the Lipschitz character of σ (wT X),
  T      
σ w X  − σ (0)1T  ≤ σ w T X − σ (0)1T  ≤ λσ w · X
T T
A RANDOM MATRIX APPROACH TO NEURAL NETWORKS 1211

so that, by Jensen’s inequality,


   
 1   1
E  √ σ wT X  ≤ σ (0) + λσ E √ w X

T T
  
1
≤ σ (0) + λσ E w2 X
T
with E[ϕ(w̃)2 ] ≤ λ2ϕ E[w̃2 ] = pλ2ϕ [since w̃ ∼ N (0, Ip )]. Letting t0 ≡
p
|σ (0)| + λσ λϕ X T, we then find
   2
 1   − cT t
P  √ σ wT X  ≥ t + t0 ≤ Ce λ2ϕ λ2σ X2 ,

T
which, with the remark t ≥ 4t0 ⇒ (t − t0 )2 ≥ t 2 /2, may be equivalently stated as
  2
 1   − 2 cT2t 2
(7) ∀t ≥ 4t0 , P  √ σ wT X  ≥ t ≤ Ce 2λϕ λσ X .
T
As a side (but important) remark, note that, since
  
 n  2 "
   √  ! σi  √
P  √  ≥t T =P

√

 ≥t T

T F i=1 T
    
 σi  T
≤ P max  √ ≥
 t
1≤i≤n T n
   
 σ  T
≤ nP  √ ≥
 t
T n
the result above implies that
  √  cT 2 t 2
   −
∀t ≥ 4t0 , P  √  ≥ t T ≤ Cne 2nλ2ϕ λ2σ X2
T F
and thus, since  · F ≥  · , we have
  √  2 2
   − cT t
∀t ≥ 4t0 , P  √  ≥ t T ≤ Cne 2nλ2ϕ λ2σ X2 .

T
Thus, in particular, under the additional Assumption
√ 3, with high probability, the
operator norm of √ cannot exceed a rate T .
T

R EMARK 1 (Loss of control of the structure of ). The aforementioned con-


trol of  arises from
√ the bound  ≤ F which may be quite loose (by
as much as a factor T ). Intuitively, under the supplementary Assumption 3, if
1212 C. LOUART, Z. LIAO AND R. COUILLET

E[σ ] = 0, then is “dominated” by the matrix √1 E[σ ]1TT , the operator norm of
√
√ T T
which is indeed of order n and the bound is tight. If σ (t) = t and E[Wij ] = 0, we
however know that  √  = O(1) [Bai and Silverstein (1998)]. One is tempted to
T
believe that, more generally, if E[σ ] = 0, then  √  should remain of this order.
T
And, if instead E[σ ] = 0, the contribution of √1 E[σ ]1TT should merely engender
T
a single large amplitude isolate singular value in the spectrum of √ and the other
T
singular values remain of order O(1). These intuitions are not captured by our
concentration of measure approach.
Since  = σ (W X) is an entry-wise operation, concentration results with re-
spect to the Frobenius norm are natural, where with respect to the operator norm
are hardly accessible.

√ considerations, let us define the probability space AK =


Back to our present
{w, σ (w X) ≤ K T }. Conditioning the random variable of interest in
T

Lemma 2 with respect to AK and its complementary AcK , for some K ≥ 4t0 ,
gives
 
1  T  1
P σ w X Aσ wT X T − tr A > t
T T
  
1   1 
≤P σ wT X Aσ wT X T − tr A > t , AK + P AcK .
T T
We can already bound P (AcK ) thanks to (7). As for the first right-hand side term,
√ that on the set {σ (w X), w ∈ AK }, the function f : R
note T T → R : σ → σ T Aσ is

K T -Lipschitz. This is because, for all σ, σ + h ∈ {σ (w X), w ∈ AK },


T

    √
f (σ + h) − f (σ ) = hT Aσ + (σ + h)T Ah ≤ K T h.

Since conditioning does √not allow for a straightforward application of (6), we


consider instead f˜, a K T -Lipschitz continuation to RT of fAK , the restric-
AK , such that all the radial derivative of f˜ are constant in the set
tion of f to √
{σ, σ  ≥ K T }. We may thus now apply (6) and our previous results to obtain
cT t 2
      −
P f˜ σ wT X − E f˜ σ wT X ≥ KT t ≤ e X2 λ2σ λ2ϕ
.
Therefore,
#      $
P f σ wT X − E f˜ σ wT X ≥ KT t , AK
#      $
=P f˜ σ wT X − E f˜ σ wT X ≥ KT t , AK
cT t 2
      −
≤ P f˜ σ w X − E f˜ σ w X
T T
≥ KT t ≤ e X2 λ2σ λ2ϕ
.
A RANDOM MATRIX APPROACH TO NEURAL NETWORKS 1213

Our next step is then to bound the difference √= |E[f˜(σ (wT X))] −
E[f (σ (wT X))]|. Since f and f˜ are equal on {σ, σ  ≤ K T },

≤ √ f (σ ) + f˜(σ ) dμσ (σ ),
σ ≥K T

where μσ is the law of σ (wT X). Since A ≤ 1, for σ  ≥ K T , max(|f (σ )|,
|f˜(σ )|) ≤ σ 2 , and thus

≤2 √ σ 2 dμσ = 2 √ 1σ 2 ≥t dt dμσ
σ ≥K T σ ≥K T t=0
∞ # $
=2 P σ 2 ≥ t , AcK dt
t=0
K 2T  ∞   
≤2 P AcK dt + 2 P σ wT X 2 ≥ t dt
t=0 t=K 2 T
 ∞ − ct
2λ2ϕ λ2σ X2
≤ 2P AcK K 2 T + 2 Ce dt
t=K 2 T
cT K 2 cT K 2

2λ2 2 2 2Cλ2ϕ λ2σ X2 −
2λ2ϕ λ2σ X2
≤ 2CT K e 2 ϕ λσ X + e
c
6C 2 2

λ λ X2 ,
c ϕ σ
where in last inequality we used the fact that for x ∈ R, xe−x ≤ e−1 ≤ 1, and
K ≥ 4t0 ≥ 4λσ λϕ X Tp . As a consequence,
cT t 2
#      $ −
X2 λ2ϕ λ2σ
P f σ w X −E f σ w X
T T
≥ KT t +  , AK ≤ Ce
so that, with the same remark as before, for t ≥ 4
KT ,
cT t 2
#      $ −
2X2 λ2ϕ λ2σ
P f σ w X −E f σ w X
T T
≥ KT t , AK ≤ Ce .
To avoid the condition t ≥ KT4
, we use the fact that, probabilities being lower than
one, it suffices to replace C by λC with λ ≥ 1 such that
T t2
−c 4
2X2 λ2 2
λCe ϕ λσ ≥1 for t ≤ .
KT
2
1 18C
The above inequality holds if we take for instance λ = Ce
c since then t ≤
24Cλ2ϕ λ2σ X2 6Cλϕ λσ X
4
KT ≤ cKT ≤ c√ pT
(using successively  ≥ c λϕ λσ X
6C 2 2 2 and K ≥
4λσ λϕ X Tp ), and thus
cT t 2
− − 18C
2
18C 2
2X2 λ2
≥ λCe−
2
λCe ϕ λσ ≥ λCe cp c ≥ 1.
1214 C. LOUART, Z. LIAO AND R. COUILLET

C c
2
Therefore, setting λ = max(1, C1 e 2 ), we get for every t > 0
#     $
P f (σ wT X − E f σ wT X) ≥ KT t , AK
cT t 2

2X2 λ2 2
≤ λCe ϕ λσ ,
cT K 2

2λ2ϕ λ2σ X2
which, together with the inequality P (AcK ) ≤ Ce , gives
     
P f σ wT X − E f σ wT X ≥ KT t
T ct 2 cT K 2
− −
2X2 λ2 2 2λ2ϕ λ2σ X2
≤ λCe ϕ λσ + Ce .

We then conclude
 
1  T  T 1
P σ w X Aσ wT X − tr(A) ≥ t
T T
− cT
min(t 2 /K 2 ,K 2 )
2X2 λ2ϕ λ2σ
≤ (λ + 1)Ce

and, with K = max(4t0 , t),
2
cT min( t 2 ,t)
  16t0
1  T  1 −
2X2 λ2ϕ λ2σ
P σ w X Aσ wT X T − tr(A) ≥ t ≤ (λ + 1)Ce .
T T
√ √
Indeed, if 4t0 ≤ t then min(t 2 /K 2 , K 2 ) = t, while if 4t0 ≥ t then min(t 2 /K 2 ,
K 2 ) = min(t 2 /16t02 , 16t02 ) = t 2 /16t02 . 

As a corollary of Lemma 2, we have the following control of the moments of


1 T
T σ Aσ .

C OROLLARY 1 (Moments of quadratic forms). Let Assumptions 1–2 hold. For


w ∼ Nϕ (0, Ip ), σ ≡ σ (wT X)T ∈ RT , A ∈ RT ×T such that A ≤ 1, and k ∈ N,
 k   k  2 k
1 T 1 t0 η η
E σ Aσ − tr A ≤ C1 √ + C2
T T T T

with t0 = |σ (0)| + λσ λϕ X Tp , η = Xλσ λϕ , and C1 , C2 > 0 independent of


the other parameters. In particular, under the additional Assumption 3,
 k 
1 T 1 C
E σ Aσ − tr A ≤ k/2 .
T T n
A RANDOM MATRIX APPROACH TO NEURAL NETWORKS 1215

 ∞P ROOF. We use the fact that, for a nonnegative random variable Y , E[Y ] =
0 P (Y > t) dt, so that
 k 
1 T 1
E σ Aσ − tr A
T T
∞  k 
1 T 1
= σ Aσ − tr A > u du
P
0 T T
∞  
1 T 1
= k−1
kv P σ Aσ − tr A > v dv
0 T T
2
∞ − cT2 min( v2 ,v)
≤ kv k−1
Ce η t0
dv
0
2
t0 − cT2 v2 ∞ − cT2v
≤ kv k−1
Ce t0 η
dv + kv k−1 Ce η dv
0 t0
2
∞ − cT2 v2 ∞ − cT2v
≤ kv k−1
Ce t0 η
dv + kv k−1 Ce η dv
0 0
 k ∞  2 k ∞
t0 η η
kt k−1 Ce−t dt + kt k−1 Ce−t dt,
2
= √
cT 0 cT 0

which, along with the boundedness of the integrals, concludes the proof. 

Beyond concentration results on functions of the vector σ , we also have the


following convenient property for functions of the matrix .

L EMMA 3 (Lipschitz functions of ). Let f : Rn×T → R be a λf -Lipschitz


function with respect to the Froebnius norm. Then, under Assumptions 1–2,
      − cT t 2
  λ2σ λ2ϕ λ2f X2
P f √ − Ef √ > t ≤ Ce
T T
for some C, c > 0. In particular, under the additional Assumption 3,
     
 
> t ≤ Ce−cT t .
2
P f √ − Ef √
T T

P ROOF. Denoting W = ϕ(W̃ ), since vec(W̃ ) ≡ [W̃11 , . . . , W̃np ] is a Gaussian


vector, by the normal concentration of Gaussian vectors, for g a λg -Lipschitz func-
tion of W with respect to the Frobenius norm [i.e., the Euclidean norm of vec(W )],
by (6),
ct 2
      −
λ2g λ2ϕ
P g(W ) − E g(W ) > t = P g ϕ(W̃ ) − E g ϕ(W̃ ) > t ≤ Ce
1216 C. LOUART, Z. LIAO AND R. COUILLET

for some C, c > 0. Let us consider in particular g : W → f (/ T ) and remark
that
   
σ ((W + H )X) σ (W X)
g(W + H ) − g(W ) = f √ −f √
T T
λf   
≤ √ σ (W + H )X − σ (W X)F
T
λf λσ
≤ √ H XF
T
λf λσ √
= √ tr H XXT H T
T
λf λσ  
≤ √ XX T H F
T
concluding the proof. 

A first corollary of Lemma 3 is the concentration of the Stieltjes’ transform


1
T tr( T1  T  − zIT )−1 of μn , the empirical spectral measure of T1  T , for all
z ∈ C \ R+ (so in particular, for z = −γ , γ > 0).

C OROLLARY 2 (Concentration of the Stieltjes’ transform of μn ). Under As-


sumptions 1–2, for z ∈ C \ R+ ,
  −1   −1  
1 1 T 1 1 T
P tr   − zIT −E tr   − zIT >t
T T T T
+ )2 T t 2
− c dist(z,R
2 2
λσ λϕ X2
≤ Ce
for some C, c > 0, where dist(z, R+ ) is the Hausdorff set distance. In particular,
for z = −γ , γ > 0, and under the additional Assumption 3,
 
1 1
tr Q − tr E[Q] > t ≤ Ce−cnt .
2
P
T T

P ROOF. We can apply Lemma 3 for f : R → 1


T tr(R T R − zIT )−1 , since we
have
f (R + H ) − f (R)
1  −1   −1
= tr (R + H )T (R + H ) − zIT (R + H )T H + H T R R T R − zIT
T
1  −1  −1
≤ tr (R + H )T (R + H ) − zIT (R + H )T H R T R − zIT
T
A RANDOM MATRIX APPROACH TO NEURAL NETWORKS 1217

1  −1  −1
+ tr (R + H )T (R + H ) − zIT H T R R T R − zIT
T
2H  2H F
≤ 3
≤ 3
,
dist(z, R+ ) 2 dist(z, R+ ) 2
where, for √the second√ to last inequality, we successively used the relations
| tr AB| ≤ tr AAT tr BB T , | tr CD| ≤ D tr C for nonnegative definite C,
and (R T R − zIT )−1  ≤ dist(z, R+ )−1 , (R T R − zIT )−1 R T R ≤ 1, (R T R −
1
zIT )−1 R T  = (R T R − zIT )−1 R T R(R T R − zIT )−1  2 ≤ (R T R − zIT )−1 R T ×
1 1 1
R 2 (R T R − zIT )−1  2 ≤ dist(z, R+ )− 2 , for z ∈ C \ R+ , and finally  ·  ≤  · F .


Lemma 3 also allows for an important application of Lemma 2 as follows.

L EMMA 4 (Concentration of T1 σ T Q− σ ). Let Assumptions 1–3 hold and write


W T = [w1 , . . . , wn ]. Define σ ≡ σ (w1T X)T ∈ RT and, for W−T = [w2 , . . . , wn ] and
− = σ (W− X), let Q− = ( T1 − T
− + γ IT )−1 . Then, for A, B ∈ RT ×T such that
A, B ≤ 1
 
1 T 1
σ AQ− Bσ − tr AE[Q− ]B > t ≤ Ce−cn min(t ,t)
2
P
T T
for some C, c > 0 independent of the other parameters.

P ROOF. Let f : R → T1 σ T A(R T R + γ IT )−1 Bσ . Reproducing the proof of


Corollary 2, conditionally to T1 σ 2 ≤ K for any arbitrary large enough K > 0,
it appears that f is Lipschitz with parameter of order O(1). Along with (7) and
Assumption 3, this thus ensures that
 
1 T 1
P σ AQ− Bσ − σ T AE[Q− ]Bσ > t
T T
 
1 T 1 σ 2
≤P σ AQ− Bσ − σ T AE[Q− ]Bσ > t, ≤K
T T T
 
σ 2
> K ≤ Ce−cnt
2
+P
T
for some C, c > 0. We may then apply Lemma 1 on the bounded norm matrix
AE[Q− ]B to further find that
 
1 T 1
P σ AQ− Bσ − tr AE[Q− ]B > t
T T
 
1 T 1 t
≤P σ AQ− Bσ − σ T AE[Q− ]Bσ >
T T 2
1218 C. LOUART, Z. LIAO AND R. COUILLET
 
1 T 1 t
+P σ AE[Q− ]Bσ − tr AE[Q− ]B >
T T 2

≤ C  e−c n min(t
2 ,t)
,
which concludes the proof. 

As a further corollary of Lemma 3, we have the following concentration result


on the training mean-square error of the neural network under study.

C OROLLARY 3 (Concentration of the mean-square error). Under Assumptions


1–3,
 
1 1  2
tr Y Y Q − tr Y Y E Q > t ≤ Ce−cnt
T 2 T 2
P
T T
for some C, c > 0 independent of the other parameters.

P ROOF. We apply Lemma 3 to the mapping f : R → T1 tr Y T Y (R T R +


γ IT )−2 . Denoting Q = (R T R + γ IT )−1 and QH = ((R + H )T (R + H ) + γ IT )−1 ,
remark indeed that
f (R + H ) − f (R)
1 
= tr Y T Y QH 2 − Q2
T
1  1 
≤ tr Y T Y QH − Q QH + tr Y T Y Q QH − Q
T T
1 
= tr Y T Y QH (R + H )T (R + H ) − R T R QQH
T
1 
+ tr Y T Y QQH (R + H )T (R + H ) − R T R Q
T
1
≤ tr Y T Y QH (R + H )T H QQH
T
1
+ tr Y T Y QH H T RQQH
T
1
+ tr Y T Y QQH (R + H )T RQ
T
1
+ tr Y T Y QQH H T RQ .
T
%
As QH (R + H )T  = QH (R + H )T (R + H )QH  and RQ = QR T RQ
are bounded and T1 tr Y T Y is also bounded by Assumption 3, this implies
f (R + H ) − f (R) ≤ CH  ≤ CH F
A RANDOM MATRIX APPROACH TO NEURAL NETWORKS 1219

for some C > 0. The function f is thus Lipschitz with parameter independent of n,
which allows us to conclude using Lemma 3. 

The aforementioned concentration results are the building blocks of the proofs
of Theorem 1–3 which, under all Assumptions 1–3, are established using standard
random matrix approaches.

5.2. Asymptotic equivalents.

5.2.1. First equivalent for E[Q]. This section is dedicated to a first character-
ization of E[Q], in the “simultaneously large” n, p, T regime. This preliminary
step is classical in studying resolvents in random matrix theory as the direct com-
parison of E[Q] to Q̄ with the implicit δ may be cumbersome. To this end, let us
thus define the intermediary deterministic matrix
 −1
n 
Q̃ = + γ IT
T 1+α
with α ≡ T1 tr E[Q− ], where we recall that Q− is a random matrix distributed as,
say, ( T1  T  − T1 σ1 σ1T + γ IT )−1 .
First note that, since T1 tr  = E[ T1 σ 2 ] and, from (7) and Assumption 3,

P ( T1 σ 2 > t) ≤ Ce−cnt for all large t, we find that T1 tr  = 0∞ t 2 P ( T1 σ 2 >
2


t) dt ≤ C  for some constant C  . Thus, α ≤ E[Q− ] T1 tr  ≤ Cγ is uniformly
bounded.
We will show here that E[Q] − Q̃ → 0 as n → ∞ in the regime of As-
sumption 3. As the proof steps are somewhat classical, we defer to the Appendix
some classical intermediary lemmas (Lemmas 5–7). Using the resolvent identity,
Lemma 5, we start by writing
  
n  1
E[Q] − Q̃ = E Q −  T  Q̃
T 1+α T
 
n  1 T
= E[Q] Q̃ − E Q   Q̃
T 1+α T
n  1! n

= E[Q] Q̃ − E Qσi σiT Q̃,
T 1+α T i=1

which, from Lemma 6, gives for Q−i = ( T1  T  − T1 σi σiT + γ IT )−1 ,

E[Q] − Q̃
 
n  1! n
σi σiT
= E[Q] Q̃ − E Q−i Q̃
T 1+α T i=1 1 + T1 σiT Q−i σi
1220 C. LOUART, Z. LIAO AND R. COUILLET

n  1 1! n

= E[Q] Q̃ − E Q−i σi σiT Q̃
T 1+α 1 + α T i=1
 
1! n
Q−i σi σiT ( T1 σiT Q−i σi − α)
+ E Q̃.
T i=1 (1 + α)(1 + T1 σiT Q−i σi )

Note now, from the independence of Q−i and σi σiT , that the second right-hand
side expectation is simply E[Q−i ]. Also, exploiting Lemma 6 in reverse on the
rightmost term, this gives
1! n
E[Q − Q−i ]
E[Q] − Q̃ = Q̃
T i=1 1+α
(8)   
1 1! n
1 T
+ E Qσi σiT Q̃ σi Q−i σi − α .
1 + α T i=1 T

It is convenient at this point to note that, since E[Q] − Q̃ is symmetric, we may


write

1 1 1! n

E[Q] − Q̃ = E[Q − Q−i ]Q̃ + Q̃E[Q − Q−i ]
2 1 + α T i=1
(9)   "
1! n
 1
+ E Qσi σiT Q̃ + Q̃σi σiT Q σ T Q−i σi − α .
T i=1 T i
We study the two right-hand side terms of (9) independently.
For the first term, since Q − Q−i = −Q T1 σi σiT Q−i ,
& '
1! n
E[Q − Q−i ] 1 1 1! n
Q̃ = E Q σi σiT Q−i Q̃
T i=1 1+α 1+α T T i=1
&   '
1 1 1! n
1
= E Q σi σiT Q 1 + σiT Q−i σi Q̃,
1+α T T i=1 T

where we used again Lemma 6 in reverse. Denoting D = diag({1 + T σi ×


1 T

Q−i σi }ni=1 ), this can be compactly written:


 
1! n
E[Q − Q−i ] 1 1 1 T
Q̃ = E Q  DQ Q̃.
T i=1 1+α 1+α T T

Note at this point that, from Lemma 7, Q̃ ≤ (1 + α) Tn and


   
 1   1  1
Q √   = Q  T Q ≤ γ − 2 .
T
   T 
T
A RANDOM MATRIX APPROACH TO NEURAL NETWORKS 1221

Besides, by Lemma 4 and the union bound,


( )
P max Dii > 1 + α + t ≤ Cne−cn min(t
2 ,t)

1≤i≤n

for some C, c > 0, so in particular, recalling that α ≤ C  for some constant C  > 0,
* + 2(1+C  ) ( ) ∞ ( )
E max Dii = P max Dii > t dt + P max Dii > t dt
1≤i≤n 0 1≤i≤n 2(1+C  ) 1≤i≤n
 ∞  2 ,t−(1+C  ))
≤ 2 1 + C + Cne−cn min((t−(1+C )) dt
2(1+C  )
 ∞
= 2 1 + C + Cne−cnt dt
1+C 
 
= 2 1 + C  + e−Cn(1+C ) = O(1).

As a consequence of all the above (and of the boundedness of α), we have that, for
some c > 0,
   
1  1  c
(10) E Q  T DQ Q̃ ≤ .
T T n
Let us now consider the second right-hand side term of (9). Using the rela-
tion abT + ba T  aa T + bbT in the order of Hermitian matrices [which unfolds
1
from (a − b)(a − b)T  0], we have, with a = T 4 Qσi ( T1 σiT Q−i σi − α) and
1
b = T − 4 Q̃σi ,
  
1! n
 1 T
E Qσi σiT Q̃ + Q̃σi σ T Q σi Q−i σi − α
T i=1 T
  2 
1 ! n
1 T 1 ! n

√ T
E Qσi σi Q σi Q−i σi − α + √ E Q̃σi σiT Q̃
T i=1 T T T i=1
√  1 T 2 
n
= T E Q  D2 Q + √ Q̃Q̃,
T T T

where D2 = diag({ T1 σiT Q−i σi − α}ni=1 ). Of course, since we also have −aa T −
bbT  abT + ba T [from (a + b)(a + b)T  0], we have symmetrically
  
1! n
 1 T
E Qσi σi Q̃ + Q̃σi σ Q
T T
σ Q−i σi − α
T i=1 T i
√  1 T 2 
n
 − T E Q  D2 Q − √ Q̃Q̃.
T T T
1222 C. LOUART, Z. LIAO AND R. COUILLET

But from Lemma 4,


 
 ε− 12 1 1
P D2  > tn = P max σiT Q−i σi − α > tnε− 2
1≤i≤n T
1
2ε t 2 ,n 2 +ε t)
≤ Cne−cmin(n

so that, with a similar reasoning as in the proof of Corollary 1,


√   √
  
 T E Q 1  T D 2 Q  ≤ T E D2 2 ≤ Cnε − 2 ,
1
 T 2 

where we additionally used Q ≤ T in the first inequality.
1
Since in addition  √n
Q̃Q̃ ≤ Cn− 2 , this gives
T T
   
1 !
n
 1 T 
  1
 E Qσi σi Q̃ + Q̃σi σi Q
T T
σi Q−i σi − α  ≤ Cnε− 2 .
T T 
i=1

Together with (9), we thus conclude that


  1
E[Q] − Q̃ ≤ Cnε− 2 .

Note in passing that we proved that


  
  T 1 !
n  1  1 
 c

E[Q − Q− ] =   
E[Q − Q−i ] =  E Q  DQ  ≤ ,
T

n T  n T n
i=1

where the first equality holds by exchangeability arguments.


In particular,

1 1 1 
α= tr E[Q− ] = tr E[Q] + tr  E[Q− ] − E[Q] ,
T T T

where | T1 tr (E[Q− ] − E[Q])| ≤ nc . And thus, by the previous result,

1 1 1
α− tr Q̃ ≤ Cn− 2 +ε tr .
T T
1
We have proved in the beginning of the section that T tr  is bounded and thus we
finally conclude that
 
 1  1
α − tr Q̃ ≤ Cnε− 2 .
 T 
A RANDOM MATRIX APPROACH TO NEURAL NETWORKS 1223

5.2.2. Second equivalent for E[Q]. In this section, we show that E[Q] can be
approximated by the matrix Q̄, which we recall is defined as
 −1
n 
Q̄ = + γ IT ,
T 1+δ
where δ > 0 is the unique positive solution to δ = T1 tr Q̄. The fact that δ > 0 is
well defined is quite standard and has already been proved several times for more
elaborate models. Following the ideas of Couillet, Hoydis and Debbah (2012), we
may for instance use the framework of so-called standard interference functions
[Yates (1995)] which claims that, if a map f : [0, ∞) → (0, ∞), x → f (x), satis-
fies x ≥ x  ⇒ f (x) ≥ f (x  ), ∀a > 1, af (x) > f (ax) and there exists x0 such that
x0 ≥ f (x0 ), then f has a unique fixed point Yates (1995), Theorem 2. It is easily
shown that δ → T1 tr Q̄ is such a map, so that δ exists and is unique.
To compare Q̃ and Q̄, using the resolvent identity, Lemma 5, we start by writing
n 
Q̃ − Q̄ = (α − δ)Q̃ Q̄
T (1 + α)(1 + δ)
from which
1 
|α − δ| = tr  E[Q− ] − Q̄
T
1 1
≤ tr (Q̃ − Q̄) + cn− 2 +ε
T
1 Q̃ Tn Q̄ 1
= |α − δ| tr + cn− 2 +ε ,
T (1 + α)(1 + δ)
which implies that
 
1 Q̃ Tn Q̄ 1
|α − δ| 1 − tr ≤ cn− 2 +ε .
T (1 + α)(1 + δ)
It thus remains to show that
1 Q̃ Tn Q̄
lim sup tr <1
n T (1 + α)(1 + δ)
1
to prove that |α − δ| ≤ cnε− 2 . To this end, note that, by Cauchy–Schwarz’s in-
equality,

1 Q̃ Tn Q̄ n 1 n 1
tr ≤ tr 2 Q̄2 · tr 2 Q̃2
T (1 + α)(1 + δ) T (1 + δ)2 T T (1 + α)2 T
so that it is sufficient to bound the limsup of both terms under the square root
strictly by one. Next, remark that
1 1 n(1 + δ) 1 1
δ = tr Q̄ = tr Q̄2 Q̄−1 = tr 2 Q̄2 + γ tr Q̄2 .
T T T (1 + δ) T
2 T
1224 C. LOUART, Z. LIAO AND R. COUILLET

In particular,
n 1 2 2
n 1 δ T (1+δ) 2 T tr  Q̄ δ
tr 2 2
Q̄ = ≤ .
T (1 + δ)2 T (1 + δ) T (1+δ)2 T tr  Q̄ + γ T tr Q̄
n 1 2 2 1 2 1 + δ

But at the same time, since ( Tn  + γ IT )−1  ≤ γ −1 ,


1
δ≤ tr 
γT
the limsup of which is bounded. We thus conclude that
n 1
(11) lim sup tr 2 Q̄2 < 1.
n T (1 + δ) T
2

Similarly, α, which is known to be bounded, satisfies


n 1 1  1
α = (1 + α) tr 2 Q̃2 + γ tr Q̃2 + O nε− 2
T (1 + α) T
2 T
and we thus have also
n 1
lim sup tr 2 Q̃2 < 1,
n T (1 + α) T
2

1
which completes to prove that |α − δ| ≤ cnε− 2 .
As a consequence of all this,
 
 Q̃ Tn Q̄  1

Q̃ − Q̄ = |α − δ| ·   ≤ cn− 2 +ε
(1 + α)(1 + δ)
1
and we have thus proved that E[Q] − Q̄ ≤ cn− 2 +ε for some c > 0.
From this result, along with Corollary 2, we now have that
 
1 1
P tr Q − tr Q̄ > t
T T
 
1 1 1 1
≤P tr Q − tr E[Q] > t − tr E[Q] − tr Q̄
T T T T
 − 21 +ε 1 
≤ C  e−c n(t−cn )
≤ C  e− 2 c nt

for all large n. As a consequence, for all γ > 0, T1 tr Q − T1 tr Q̄ → 0 almost surely.


As such, the difference mμn − mμ̄n of Stieltjes’ transforms mμn : C \ R+ → C,
z → T1 tr( T1  T  − zIT )−1 and mμ̄n : C \ R+ → C, z → T1 tr( Tn 1+δ 
z
− zIT )−1
[with δz the unique Stieltjes’ transform solution to δz = T1 tr ( Tn 1+δ

z
− zIT )−1 ]
converges to zero for each z in a subset of C \ R+ having at least one accumu-
lation point (namely R− ), almost surely so [i.e., on a probability set Az with
A RANDOM MATRIX APPROACH TO NEURAL NETWORKS 1225

P (Az ) = 1]. Thus, letting {zk }∞


k=1 be a,converging sequence strictly included in
R− , on the probability one space A = ∞ k=1 Ak , mμn (zk ) − mμ̄n (zk ) → 0 for all
k. Now, mμn is complex analytic on C \ R+ and bounded on all compact sub-
sets of C \ R+ . Besides, it was shown in Silverstein and Bai (1995), Silverstein
and Choi (1995) that the function mμ̄n is well defined, complex analytic and
bounded on all compact subsets of C \ R+ . As a result, on A, mμn − mμ̄n is com-
plex analytic, bounded on all compact subsets of C \ R+ and converges to zero
on a subset admitting at least one accumulation point. Thus, by Vitali’s conver-
gence theorem [Titchmarsh (1939)], with probability one, mμn − mμ̄n converges
to zero everywhere on C \ R+ . This implies, by Bai and Silverstein (2010), The-
orem B.9, that μn − μ̄n → 0, vaguely as a signed finite measure, with prob-
ability one, and since μ̄n is a probability measure [again from the results of
Silverstein and Bai (1995), Silverstein and Choi (1995)], we have thus proved
Theorem 2.

5.2.3. Asymptotic equivalent for E[QAQ], where A is either  or symmet-


ric of bounded norm. The evaluation of the second-order statistics of the neural
network under study requires, beside E[Q], to evaluate the more involved form
E[QAQ], where A is a symmetric matrix either equal to  or of bounded norm
(so in particular Q̄A is bounded). To evaluate this quantity, first write
E[QAQ]

= E[Q̄AQ] + E (Q − Q̄)AQ
   
n  1
= E[Q̄AQ] + E Q −  T  Q̄AQ
T 1+δ T
n 1 1! n

= E[Q̄AQ] + E[QQ̄AQ] − E Qσi σiT Q̄AQ .
T 1+δ T i=1

Of course, since QAQ is symmetric, we may write



1 n 1
E[QAQ] = E[Q̄AQ + QAQ̄] + E[QQ̄AQ + QAQ̄Q]
2 T 1+δ
"
1! n

− E Qσi σiT Q̄AQ + QAQ̄σi σiT Q ,
T i=1

which will reveal more practical to handle.


1
First note that, since E[Q] − Q̄ ≤ Cnε− 2 and A is such that Q̄A is
1
bounded, E[Q̄AQ] − Q̄AQ̄ ≤ Q̄AE[Q] − Q̄ ≤ C  nε− 2 , which provides an
estimate for the first expectation. We next evaluate the last right-hand side expecta-
tion above. With the same notation as previously, from exchangeability arguments
1226 C. LOUART, Z. LIAO AND R. COUILLET

and using Q = Q− − Q T1 σ σ T Q− , observe that

1! n
 n 
E Qσi σiT Q̄AQ = E Qσ σ T Q̄AQ
T i=1 T
 
n Q− σ σ T Q̄AQ
= E
T 1 + T1 σ T Q− σ
n 1 
= E Q− σ σ T Q̄AQ
T 1+δ
 
n 1 δ − T1 σ T Q− σ
+ E Q− σ σ T Q̄AQ ,
T 1+δ 1 + T1 σ T Q− σ

which, reusing Q = Q− − Q T1 σ σ T Q− , is further decomposed as

1! n

E Qσi σiT Q̄AQ
T i=1
 
n 1  n 1 Q− σ σ T Q̄AQ− σ σ T Q−
= E Q− σ σ T Q̄AQ− − 2 E
T 1+δ T 1+δ 1 + T1 σ T Q− σ
 
n δ − T1 σ T Q− σ
+ E Q− σ σ T Q̄AQ−
T (1 + δ)(1 + T1 σ T Q− σ )
 
n Q− σ σ T Q̄AQ− σ σ T Q− (δ − T1 σ T Q− σ )
− E
T2 (1 + δ)(1 + T1 σ T Q− σ )2
1 T 
n 1 n 1 σ Q̄AQ− σ
= E[Q− Q̄AQ− ] − E Q− σ σ T Q− T 1
T 1+δ T 1+δ 1 + T σ T Q− σ
 
n σ σ T (δ − T1 σ T Q− σ )
+ E Q− Q̄AQ−
T (1 + δ)(1 + T1 σ T Q− σ )
 
n σ Q̄AQ− σ (δ − T1 σ T Q− σ )
1 T
− E Q− σ σ T Q− T
T (1 + δ)(1 + T1 σ T Q− σ )2
≡ Z 1 + Z 2 + Z 3 + Z4
(where in the previous to last line, we have merely reorganized the terms conve-
niently) and our interest is in handling Z1 + Z1T + Z2 + Z2T + Z3 + Z3T + Z4 + Z4T .
Let us first treat term Z2 . Since Q̄AQ− is bounded, by Lemma 4, T1 σ T Q̄AQ− σ
concentrates around T1 tr Q̄AE[Q− ]; but, as Q̄ is bounded, we also have
1
| T1 tr Q̄AE[Q− ] − 1
T tr Q̄AQ̄| ≤ cnε− 2 . We thus deduce, with similar argu-
A RANDOM MATRIX APPROACH TO NEURAL NETWORKS 1227

ments as previously, that


1 T 1 
ε− 21 T σ Q̄AQ− σ tr Q̄AQ̄
−Q− σ σ Q− Cn
T
 Q− σ σ Q− T
− T
1 + T1 σ T Q− σ 1+δ
1
 Q− σ σ T Q− Cnε− 2
with probability exponentially close to one, in the order of symmetric matrices.
Taking expectation and norms on both sides, and conditioning on the aforemen-
tioned event and its complementary, we thus have that
  1 T  1 
 σ Q̄AQ− σ 
T tr Q̄AQ̄ 
E Q− σ σ T Q− T − E[Q− Q− ]
 1+δ 
1 + T σ Q− σ
1 T

  1 ε
≤ E[Q− Q− ]Cnε− 2 + C  ne−cn
  1
≤ E[Q− Q− ]C  nε− 2 .
But, again by exchangeability arguments,
  2 
 1
E[Q− Q− ] = E Q− σ σ Q− T
= E Qσ σ Q 1 + σ T Q− σ
T
T
 
T 1
= E Q  T D 2 Q
n T
with D = diag({1 + T1 σiT Q− σi }), the operator norm of which is bounded as O(1).
So finally,
  1 T  1 
 σ Q̄AQ− σ 
T tr Q̄AQ̄  1
E Q− σ σ T Q− T − E[Q− Q− ] ≤ Cnε− 2 .
 1+δ 
1 + T σ Q− σ
1 T

We now move to term Z3 + Z3T . Using the relation abT + ba T  aa T + bbT ,


  
1 T Q− σ σ T Q̄AQ− + Q− AQ̄σ σ T Q−
E δ− σ Q− σ
T (1 + T1 σ T Q− σ )2
 
√ (δ − T1 σ T Q− σ )2 1 
 nE Q− σ σ T Q− + √ E Q− AQ̄σ σ T Q̄AQ−
(1 + T σ Q− σ )
1 T 4 n
 
√ T 1 1
= n E Q  T D32 Q + √ E[Q− AQ̄Q̄AQ− ]
n T n
and the symmetrical lower bound (equal to the opposite of the upper bound), where
D3 = diag((δ − T1 σiT Q−i σi )/(1+ T1 σiT Q−i σi )). For the same reasons as above, the
1
first right-hand side term is bounded by Cnε− 2 . As for the second term, for A =
Q̄
IT , it is clearly bounded; for A = , using Tn 1+δ = IT − γ Q̄, E[Q− AQ̄Q̄AQ− ]
1228 C. LOUART, Z. LIAO AND R. COUILLET

can be expressed in terms of E[Q− Q− ] and E[Q− Q̄k Q− ] for k = 1, 2, all of
which have been shown to be bounded (at most by Cnε ). We thus conclude that
   
 1
E δ − σ T Q− σ
Q− σ σ T Q̄AQ− + Q− AQ̄σ σ T Q−  1
  ≤ Cnε− 2 .
T 1
(1 + T σ Q− σ )
T 2

Finally, term Z4 can be handled similarly as term Z2 and is shown to be of norm


1
bounded by Cnε− 2 .
As a consequence of all the above, we thus find that
n E[QQ̄AQ] n E[Q− Q̄AQ− ]
E[QAQ] = Q̄AQ̄ + −
T 1+δ T 1+δ
n T1 tr Q̄AQ̄  ε− 1
+ E[Q − Q − ] + O n 2 .
T (1 + δ)2
It is attractive to feel that the sum of the second and third terms above vanishes.
This is indeed verified by observing that, for any matrix B,
1 
E[QBQ] − E[Q− BQ] = E Qσ σ T Q− BQ
T
  
1 1
= E Qσ σ T QBQ 1 + σ T Q− σ
T T
 
1 1
= E Q  T DQBQ
n T
and symmetrically
 
1 1 T
E[QBQ] − E[QBQ− ] = E QBQ  DQ
n T
with D = diag(1 + T1 σiT Q−i σi ), and a similar reasoning is performed to con-
trol E[Q− BQ] − E[Q− BQ− ] and E[QBQ− ] − E[Q− BQ− ]. For B bounded,
E[Q T1  T DQBQ] is bounded as O(1), and thus E[QBQ] − E[Q− BQ− ]
is of order O(n−1 ). So in particular, taking A of bounded norm, we find that
1
n tr Q̄AQ̄  1
E[QAQ] = Q̄AQ̄ + T
E[Q− Q− ] + O nε− 2 .
T (1 + δ)2

Take now B = . Then, from the relation AB T + BAT  AAT + BB T in the


order of symmetric matrices,
 
 1 
E[QQ] − E[Q− Q + QQ− ]
 2 
  
1 1 1 
= E Q  T DQQ + QQ  T DQ 
2n T T
   
1  1 T 1 T  
 


≤ E Q  DQ  DQ  + E[QQQ] .
2n  T T
A RANDOM MATRIX APPROACH TO NEURAL NETWORKS 1229

The first norm in the parenthesis is bounded by Cnε and it thus remains
to control the second norm. To this end, similar to the control of E[QQ],
by writing E[QQQ] = E[Qσ1 σ1T Qσ2 σ2T Q] for σ1 , σ2 independent vectors
with the same law as σ , and exploiting the exchangeability, we obtain after
some calculus that E[QQ] can be expressed as the sum of terms of the form
E[Q++ T1 ++T
D++ Q++ ] or E[Q++ T1 ++T
D++ Q++ T1 ++T
D2 ++ Q++ ] for
D, D2 diagonal matrices of norm bounded as O(1), while ++ and Q++
are similar as  and Q, only for n replaced by n + 2. All these terms are
bounded as O(1) and we finally obtain that E[QQQ] is bounded, and
thus
 
 1  C
E[QQ] − E[Q− Q + QQ− ] ≤ .
 2  n
With the additional control on QQ− − Q− Q− and Q− Q − Q− Q− , to-
gether, this implies that E[QQ] = E[Q− Q− ] + O· (n−1 ). Hence, for A =
1
, exploiting the fact that Tn 1+δ Q̄ =  − γ Q̄, we have the simplifica-
tion
E[QQ]
n E[QQ̄Q] n E[Q− Q̄Q− ]
= Q̄Q̄ + −
T 1+δ T 1+δ
n T1 tr 2 Q̄2  ε− 1
+ E[Q− Q− ] + O· n 2
T (1 + δ)2
n T1 tr 2 Q̄2  1
= Q̄Q̄ + E[QQ] + O· nε− 2
T (1 + δ) 2

or equivalently
 
n 1 tr 2 Q̄2  1
E[QQ] 1 − T = Q̄Q̄ + O· nε− 2 .
T (1 + δ) 2

1
tr 2 Q̄2
We have already shown in (11) that lim supn Tn T
(1+δ)2
< 1, and thus

Q̄Q̄  1
E[QQ] = 1
+ O· nε− 2 .
tr 2 Q̄2
1− n
T
T
(1+δ)2

So finally, for all A of bounded norm,


1
n tr Q̄AQ̄ Q̄Q̄  ε− 1
E[QAQ] = Q̄AQ̄ + T
+ O n 2 ,
T (1 + δ)2 1 − n T tr 2 Q̄2
1

T (1+δ)2

which proves immediately Proposition 1 and Theorem 3.


1230 C. LOUART, Z. LIAO AND R. COUILLET

5.3. Derivation of ab .

5.3.1. Gaussian w. In this section, we evaluate the terms ab provided in


Table 1. The proof for the term corresponding to σ (t) = erf(t) can be already
be found in Williams (1998), Section 3.1, and is not recalled here. For the other
functions σ (·), we follow a similar approach as in Williams (1998), as detailed
next.
The evaluation of ab for w ∼ N (0, Ip ) requires to estimate
p   1
I ≡ (2π)− 2 σ wT a σ wT b e− 2 w dw.
2

Rp

Assume that a and b and not linearly dependent. It is convenient to observe that
this integral can be reduced to a two-dimensional integration by considering the
basis e1 , . . . , ep defined (for instance) by
aT b
a
b
b − a
ab a
e1 = , e2 = -
a (a T b)2
1− a2 b2

and e3 , . . . , ep any completion of the basis. By letting w = w̃1 e1 + · · · +


T
w̃p ep and a = ã1 e1 (ã1 = a), b = b̃1 e1 + b̃2 e2 (where b̃1 = aab and b̃2 =
-
(a T b)2
b 1 − a2 b2
), this reduces I to

1 1
σ (w̃1 ã1 )σ (w̃1 b̃1 + w̃2 b̃2 )e− 2 (w̃1 +w̃2 ) d w̃1 d w̃2 .
2 2
I=
2π R R

Letting w̃ = [w̃1 , w̃2 ]T , ã = [ã1 , 0]T and b̃ = [b̃1 , b̃2 ]T , this is conveniently written
as the two-dimensional integral
1   1
σ w̃T ã σ w̃T b̃ e− 2 w̃ d w̃.
2
I=
2π R2

The case where a and b would be linearly dependent can then be obtained by
continuity arguments.
The function σ (t) = max(t, 0). For this function, we have
1 1
w̃T ã · w̃T b̃ · e− 2 w̃ d w̃.
2
I=
2π min(w̃T ã,w̃ T b̃)≥0

Since ã = ã1 e1 , a simple geometric representation lets us observe that


  
#  $ π π
w̃| min w̃ ã, w̃ b̃ ≥ 0 = r cos(θ )e1 + r sin(θ )e2 |r ≥ 0, θ ∈ θ0 − ,
T T
,
2 2
A RANDOM MATRIX APPROACH TO NEURAL NETWORKS 1231

where we defined θ0 ≡ arccos( b̃1 ) = − arcsin( b̃1 ) + π2 . We may thus operate a


b̃ b̃
polar coordinate change of variable (with inverse Jacobian determinant equal to r)
to obtain
π
1   1 2
r cos(θ )ã1 r cos(θ )b̃1 + r sin(θ )b̃2 re− 2 r dθ dr
2
I=
2π θ0 − 2 R+
π

π
1  1 2
r 3 e− 2 r dr.
2
= ã1 cos(θ ) cos(θ )b̃1 + sin(θ )b̃2 dθ
2π θ0 − π2 R+
 1 2
With two integration by parts, we have that R+ r 3 e− 2 r dr = 2. Classical trigono-
metric formulas also provide
π
2 1 1
cos(θ )2 dθ = (π − θ0 ) + sin(2θ0 )
θ0 − π2 2 2
   
1 b̃1 b̃1 b̃2
= π − arccos +
2 b̃ b̃ b̃
π  
2 1 2 1 b̃2 2
cos(θ ) sin(θ ) dθ =sin (θ 0 ) = ,
θ0 − π2 2 2 b̃

where we used in particular sin(2 arccos(x)) = 2x 1 − x 2 . Altogether, this is after
simplification and replacement of ã1 , b̃1 and b̃2 ,
1  
I= ab 1 − ∠(a, b)2 + ∠(a, b) arccos −∠(a, b) .

It is worth noticing that this may be more compactly written as
1 ∠(a,b)
I= ab arccos(−x) dx,
2π −1
which is minimum for ∠(a, b) → −1 (since arccos(−x) ≥ 0 on [−1, 1]) and takes
there the limiting value zero. Hence, I > 0 for a and b not linearly dependent.
For a and b linearly dependent, we simply have I = 0 for ∠(a, b) = −1 and
I = 12 ab for ∠(a, b) = 1.
The function σ (t) = |t|. Since |t| = max(t, 0) + max(−t, 0), we have
wT a · wT b
   
= max wT a, 0 max wT b, 0 + max wT (−a), 0 max wT (−b), 0
   
+ max wT (−a), 0 max wT b, 0 + max wT a, 0 max wT (−b), 0 .
Hence, reusing the results above, we have here
ab 
I= 4 1 − ∠(a, b)2 + 2∠(a, b)

 
× acos −∠(a, b) − 2∠(a, b) acos ∠(a, b) .
1232 C. LOUART, Z. LIAO AND R. COUILLET

Using the identity acos(−x) − acos(x) = 2 asin(x) provides the expected result.
The function σ (t) = 1t≥0 . With the same notation as in the case σ (t) = max(t, 0),
we have to evaluate
1 1
e− 2 w̃ d w̃.
2
I=
2π min(w̃T ã,w̃T b̃)≥0
After a polar coordinate change of variable, this is
π
1 1 2 1 θ0
re− 2 r dr =
2
I= dθ −
2π θ0 − π2 R+ 2 2π
from which the result unfolds.
The function σ (t) = sign(t). Here, it suffices to note that sign(t) = 1t≥0 − 1−t≥0
so that
 
σ wT a σ wT b = 1wT a≥0 1wT b≥0 + 1wT (−a)≥0 1wT (−b)≥0
− 1wT (−a)≥0 1wT b≥0 − 1wT a≥0 1wT (−b)≥0
and to apply the result of the previous section, with either (a, b), (−a, b), (a, −b)
or (−a, −b). Since arccos(−x) = − arccos(x) + π , we conclude that
p  
2θ0 1
I = (2π)− 2 sign wT a sign wT b e− 2 w dw = 1 −
2
.
Rp π
The functions σ (t) = cos(t) and σ (t) = sin(t). Let us first consider σ (t) =
cos(t). We have here to evaluate
1   1
cos w̃T ã cos w̃T b̃ e− 2 w̃ d w̃
2
I=
2π R2
1  ı w̃T ã  T 1
+ e−ı w̃ ã eı w̃ b̃ + e−ı w̃ b̃ e− 2 w̃ d w̃,
T T 2
= e
8π R2

which boils down to evaluating, for d ∈ {ã + b̃, ã − b̃, −ã + b̃, −ã − b̃}, the integral
1 1 1
e− 2 d e− 2 w̃−ıd d w̃ = (2π)e− 2 d .
2 2 2

R2
Altogether, we find
1  − 1 a+b2 1 1 
+ e− 2 a−b = e− 2 (a+b ) cosh a T b .
2 2
I= e 2
2
For σ (t) = sin(t), it suffices to appropriately adapt the signs in the expression
of I [using the relation sin(t) = 2ı1 (et + e−t )] to obtain in the end
1  − 1 a+b2 1 1 
+ e− 2 a−b = e− 2 (a+b ) sinh a T b
2 2
I= e 2
2
as desired.
A RANDOM MATRIX APPROACH TO NEURAL NETWORKS 1233

5.4. Polynomial σ (·) and generic w. In this section, we prove equation (5)
for σ (t) = ζ2 t 2 + ζ1 t + ζ0 and w ∈ Rp a random vector with independent and
identically distributed entries of zero mean and moment of order k equal to mk .
The result is based on standard combinatorics. We are to evaluate
  2   2
ab = E ζ2 wT a + ζ1 wT a + ζ0 ζ2 wT b + ζ1 wT b + ζ0 .
After development, it appears that one needs only assess, for say vectors c, d ∈ Rp
that take values in {a, b}, the moments
 2 2
!
E wT c wT d = ci1 ci2 dj1 dj2 E[wi1 wi2 wj1 wj2 ]
i 1 i 2 j1 j2
! ! !
= m4 ci21 di21 + m22 ci21 dj21 + 2 m22 ci1 di1 ci2 di2
i1 i1 =j1 i1 =i2
! ! !
= m4 ci21 di21 + − m22 ci21 dj21
i1 i 1 j1 i1 =j1
! !
+2 − m22 ci1 di1 ci2 di2
i1 i2 i1 =i2
    
= m4 c2 T d 2 + m22 c2 d2 − c2 T d 2
2  T 2  2 T 2
+ 2m2 c d − c d
  T 2   2
= m4 − 3m22 c2 d + m22 c2 d2 + 2 cT d ,
 2
! ! 
E wT c wT d = ci1 ci2 dj E[wi1 wi2 wj ] = m3 ci21 di1 = m3 c2 d,
i1 i2 j i1
 2
!
E wT c = ci1 ci2 E[wi1 wi2 ] = m2 c2 ,
i1 i2

where we recall the definition (a 2 ) = [a12 , . . . , ap2 ]T . Gathering all the terms for
appropriate selections of c, d leads to (5).

5.5. Heuristic derivation of Conjecture 1. Conjecture 1 essentially follows as


an aftermath of Remark 1. We believe that, similar to ,  ˆ is expected to be of
ˆ = ˆ◦
ˆ ◦ + σ̄ˆ 1T , where σ̄ˆ = E[σ (wT X̂)]T , with  √
the form  
 ≤ nε with high
T̂ T
probability. Besides, if X, X̂ were chosen as constituted of Gaussian mixture vec-
tors, with nontrivial growth rate conditions as introduced in Couillet and Benaych-
Georges (2016), it is easily seen that σ̄ = c1p + v and σ̄ˆ = c1p + v̂, for some
constant c and v, v̂ = O(1).
This subsequently ensures that XX̂ and X̂X̂ would be of a similar form
 + σ̄ σ̄ˆ T and ◦ + σ̄ˆ σ̄ˆ T with ◦ and ◦ of bounded norm. These facts,

X X̂ X̂ X̂ X X̂ X̂ X̂
1234 C. LOUART, Z. LIAO AND R. COUILLET

that would require more advanced proof techniques, let envision the following
heuristic derivation for Conjecture 1.
Recall that our interest is on the test performance Etest defined as
1  T 
Etest = ˆ T β 2 ,
Ŷ −  F

which may be rewritten as
1  T 2  1 
Etest = tr Ŷ Ŷ − ˆ Ŷ T +
tr Y Q T  ˆ
tr Y Q T  ˆ T QY T
T̂ T T̂ T 2 T̂
(12)
≡ Z1 − Z 2 + Z3 .
ˆ
If  ˆ◦
= + σ̄ˆ 1T follows the aforementioned claimed operator norm control, re-

producing the steps of Corollary 3 leads to a similar concentration for Etest , which
we shall then admit. We are therefore left to evaluating E[Z2 ] and E[Z3 ].
We start with the term E[Z2 ], which we expand as
2   2 ! n
 
E[Z2 ] = ˆ Ŷ T =
E tr Y Q T  tr Y Qσi σ̂iT Ŷ T
T T̂ T T̂ i=1
  
2 ! n
Y Q−i σi σ̂iT Ŷ T
= E tr
T T̂ i=1 1 + T1 σiT Q−i σi
2 1 ! n
 
= E tr Y Q−i σi σ̂iT Ŷ T
T T̂ 1 + δ i=1
 
2 1 ! n
 δ − T1 σiT Q−i σi
+ E tr Y Q−i σi σ̂iT Ŷ T
T T̂ 1 + δ i=1 1 + T1 σiT Q−i σi
2n 1  2 1  
= tr Y E[Q− ]XX̂ Ŷ T + ˆ Ŷ T
E tr Y Q T D 
T T̂ 1 + δ T T̂ 1 + δ
≡ Z21 + Z22
1
with D = diag({δ − T1 σiT Q−i σi }), the operator norm of which is bounded by nε− 2
ˆ =
with high probability. Now, observe that, again with the assumption that  ˆ ◦+
T
σ̄ 1 with controlled ˆ , Z22 may be decomposed as


2 1   2 1  
ˆ Ŷ T =
E tr Y Q T D  ˆ ◦ Ŷ T
E tr Y Q T D 
T T̂ 1 + δ T T̂ 1 + δ
2 1 T T 
+ 1 Ŷ E Y Q T D σ̄ .
T T̂ 1 + δ T̂
1
In the display above, the first right-hand side term is now of order O(nε− 2 ). As
for the second right-hand side term, note that D σ̄ is a vector of independent and
A RANDOM MATRIX APPROACH TO NEURAL NETWORKS 1235

identically distributed zero mean and variance O(n−1 ) entries; while note formally
independent of Y Q T , it is nonetheless expected that this independence “weak-
ens” asymptotically (a behavior several times observed in linear random matrix
models), so that one expects by central limit arguments that the second right-hand
1
side term be also of order O(nε− 2 ).
This would thus result in
2n 1   1
E[Z2 ] = tr Y E[Q− ]XX̂ Ŷ T + O nε− 2
T T̂ 1 + δ
2n 1   1
= tr Y Q̄XX̂ Ŷ T + O nε− 2
T T̂ 1 + δ
2   1
= tr Y Q̄ XX̂ Ŷ T + O nε− 2 ,

1 
where we used E[Q− ] − Q̄ ≤ Cnε− 2 and the definition XX̂ = Tn 1+δ
X X̂
.
We then move on to E[Z3 ] of equation (12), which can be developed as
1  
E[Z3 ] = ˆ
E tr Y Q T  ˆ T QY T
T 2 T̂
1 !
n
 
= E tr Y Qσi σ̂iT σ̂j σjT QY T
T 2 T̂ i,j =1

!
n   σ̂j σjT Q−j 
1 Q−i σi σ̂iT
= E tr Y YT
T 2 T̂ i,j =1 1 + T1 σiT Q−i σi 1 + T1 σjT Q−j σj
n !   
1 ! Q−i σi σ̂iT σ̂j σjT Q−j
= E tr Y Y T
T 2 T̂ i=1 j =i 1 + T1 σiT Q−i σi 1 + T1 σjT Q−j σj
  
1 ! n
Q−i σi σ̂iT σ̂i σiT Q−i T
+ E tr Y Y ≡ Z31 + Z32 .
T 2 T̂ i=1 (1 + T1 σiT Q−i σi )2
In the term Z32 , reproducing the proof of Lemma 1 with the condition X̂
σ̂ T σ̂
bounded, we obtain that i i concentrates around 1 tr X̂X̂ , which allows us to
T̂ T̂
write
  
1 ! n
Q−i σi tr(X̂X̂ )σiT Q−i T
Z32 = E tr Y Y
T 2 T̂ i=1 (1 + T1 σiT Q−i σi )2
  
1 ! n
Q−i σi (σ̂iT σ̂i − tr T̂ )σiT Q−i T
+ E tr Y Y
T 2 T̂ i=1 (1 + T1 σiT Q−i σi )2
  
1 tr(X̂X̂ ) !n
Q−i σi σiT Q−i
= 2 E tr Y YT
T T̂ i=1 (1 + 1 T
σ
T i Q σ
−i i )2
1236 C. LOUART, Z. LIAO AND R. COUILLET
    
1 !
n
σ̂ T σ̂i − tr T̂
+ 2 E tr Y Qσi i σiT QY T
T i=1 T̂
≡ Z321 + Z322

with D = diag({ 1 σiT σ̂i − 1


tr T̂ T̂ }ni=1 ) and thus Z322 can be rewritten as
T̂ T̂
  
1 Q T Q  1
Z322 = E tr Y √ D √ Y T = O nε− 2
T T T
while for Z321 , following the same arguments as previously, we have
  
1 tr X̂X̂ !
n
Q−i σi σiT Q−i
Z321 = E tr Y Y T
T 2 T̂ i=1 (1 + T1 σiT Q−i σi )2
1 tr X̂X̂ !
n
1  
= E tr Y Q−i σi σiT Q−i Y T
T 2
T̂ i=1 (1 + δ)2

1 tr X̂X̂ !
n
1
+
T 2
T̂ i=1 (1 + δ)2
   2 
 1
× E tr Y Qσi σi QY T T
(1 + δ) − 1 + σiT Q−i σi
2
T
1 tr X̂X̂ !
n
1  
= E tr Y Q−i X Q−i Y T
T 2
T̂ i=1 (1 + δ)2

1 tr X̂X̂ !
n
1  
+ E tr Y Q T DQY T
T 2
T̂ i=1 (1 + δ)2

n   tr(X̂X̂ )  1
= 2
E tr Y Q− X Q− Y T + O nε− 2 ,
T T̂ (1 + δ) 2

where D = diag({(1 + δ)2 − (1 + T1 σiT Q−i σi )2 }ni=1 ).


1
Since E[Q− AQ− ] = E[QAQ] + O· (nε− 2 ), we are free to plug in the asymp-
totic equivalent of E[QAQ] derived in Section 5.2.3, and we deduce
   
X Q̄ · n tr( X Q̄X Q̄)
1
n Q̄ tr(X̂X̂ )
Z32 = E tr Y Q̄X Q̄ + YT
T 2
1 − n tr( X Q̄2 )
1 2
T̂ (1 + δ)2
1 T
tr(Y Q̄ X Q̄Y ) 1  ε− 1
= n
tr( X̂X̂ ) + O n 2 .
1 − tr(1
n
2 2
X Q̄ ) T̂
The term Z31 of the double sum over i and j (j = i) needs more efforts. To
handle this term, we need to remove the dependence of both σi and σj in Q in
A RANDOM MATRIX APPROACH TO NEURAL NETWORKS 1237

sequence. We start with j as follows:


n !   
1 ! σ̂j σjT Q−j
Z31 = E tr Y Qσi σ̂iT Y T
T 2 T̂ i=1 j =i 1 + T1 σjT Q−j σj
n !   
1 ! σ̂j σjT Q−j
= E tr Y Q−j σi σ̂iT
Y T
T 2 T̂ i=1 j =i 1 + T1 σjT Q−j σj
n !   Q σ σ T Q σ σ̂ T 
1 ! −j j j −j i i σ̂j σjT Q−j
− E tr Y Y T
T 3 T̂ i=1 j =i 1 + T1 σjT Q−j σj 1 + T1 σjT Q−j σj

≡ Z311 − Z312 ,
where in the previous to last inequality we used the relation
Q−j σj σjT Q−j
Q = Q−j − .
1 + T1 σjT Q−j σj

For Z311 , we replace 1 + T1 σjT Q−j σj by 1 + δ and take expectation over wj :


n !   
1 ! σ̂j σjT Q−j
Z311 = E tr Y Q−j σi σ̂iT Y T
T 2 T̂ i=1 j =i 1 + T1 σjT Q−j σj
  Q T  
1 ! −j −j ˆ −j σ̂j σj Q−j T
n T
= E tr Y Y
T 2 T̂ j =1 1 + T1 σjT Q−j σj

1 1 ! n
  T ˆ
= E tr Y Q−j −j −j σ̂j σjT Q−j Y T
T T̂ 1 + δ j =1
2

  Q T  
−j −j ˆ −j σ̂j σj Q−j (δ − T σj Q−j σj ) T
1 T
1 ! n T
1
+ E tr Y Y
T 2 T̂ 1 + δ j =1 1 + T1 σjT Q−j σj

≡ Z3111 + Z3112 .
n
The idea to handle Z3112 is to retrieve forms of the type T
j =1 dj σ̂j σj
ˆ T D
=
1
for some D satisfying D ≤ nε− 2 with high probability. To this end, we use
T ˆ ˆ
−j −j T σj σ̂jT
Q−j = Q−j − Q−j
T T T
Tˆ T
Qσj σj Q  T  ˆ σj σ̂jT
=Q + − Q−j
T 1 − T1 σjT Qσj T T
1238 C. LOUART, Z. LIAO AND R. COUILLET

and thus Z3112 can be expanded as the sum of three terms that shall be studied in
order:
  Q T  
−j −j ˆ −j σ̂j σj Q−j (δ − T σj Q−j σj ) T
1 T
1 ! n T
1
Z3112 = E tr Y Y
T 2 T̂ 1 + δ j =1 1 + T1 σjT Q−j σj
  ˆ T 
1 1 T
= E tr Y Q ˆ DQY T
T T̂ 1 + δ T
  Qσ σ T Q T 
ˆ σ̂j (δ − 1 σ T Q−j σj )σ T Q 
1 1 ! n
j j T j j
+ E tr Y YT
T T̂ 1 + δ j =1 T (1 − T σj Qσj )
1 T

   
1 !
1 n
1
− E tr Y Qσj σ̂jT σ̂j σjT Q δ − σjT Q−j σj
2
T T̂ 1 + δ j =1
T
  
1
× 1 + σjT Q−j σj Y T
T
≡ Z31121 + Z31122 − Z31123 ,
1
where D = diag({δ − T σj Q−j σj }i=1 ).
1 T n First, Z31121 is of order O(nε− 2 ) since
T ˆ
Q T  is of bounded operator norm. Subsequently, Z31122 can be rewritten as
  
1 1  T D  1
Z31122 = E tr Y Q QY T = O nε− 2
T̂ 1 + δ T
with here
   T ˆ
−j −j 
1 T 1
D = diag δ−
σj Q−j σj tr Q−j X̂X
T T T

1 1
+ tr(Q−j ) tr X̂X̂
T T
. 1

1
n
1 − σjT Qσj 1 + σjT Q−j σj .
T T i=1
The same arguments apply for Z31123 but for
   n
tr X̂X̂ 1 T 1 T
D = diag δ − σj Q−j σj 1 + σj Q−j σj ,
T T T i=1
1
which completes to show that |Z3112 | ≤ Cnε− 2 and thus
 1
Z311 = Z3111 + O nε− 2
1 1 ! n
  T ˆ  1
= E tr Y Q−j −j −j σ̂j σjT Q−j Y T + O nε− 2 .
T 2 T̂ 1 + δ j =1
A RANDOM MATRIX APPROACH TO NEURAL NETWORKS 1239

It remains to handle Z3111 . Under the same claims as above, we have


  T ˆ 
1 1 ! n −j −j
Z3111 = E tr Y Q−j X̂X Q−j Y T
T T̂ 1 + δ j =1 T
  
1 1 ! n !
σi σ̂iT
= E tr Y Q−j X̂X Q−j Y T
T T̂ 1 + δ j =1 i =j
T
  
1 1 ! n !
Q−ij σi σ̂iT
= E tr Y X̂X Q−ij Y T
2
T T̂ 1 + δ j =1 i =j 1 + T σi Q−ij σi
1 T

 
1 1 ! n !
Q−ij σi σ̂iT
− E tr Y
T 3 T̂ 1 + δ j =1 i =j 1 + T1 σiT Q−ij σi

Q−ij σi σiT Q−ij
× X̂X YT
1 + T1 σiT Q−ij σi
≡ Z31111 − Z31112 ,

where we introduced the notation Q−ij = ( T1  T  − T1 σi σiT − T1 σj σjT + γ IT )−1 .


For Z31111 , we replace T1 σiT Q−ij σi by δ, and take the expectation over wi , as
follows:
  
1 1 ! n !
Q−ij σi σ̂iT
Z31111 = E tr Y X̂X Q−ij Y T
2
T T̂ 1 + δ j =1 i =j 1 + T σi Q−ij σi
1 T

1 1 ! n !
 
= E tr Y Q−ij σi σ̂iT X̂X Q−ij Y T
2
T T̂ (1 + δ)2
j =1 i =j

1 1
+
T 2 T̂ (1 + δ)2
n !  
! 
Q−ij σi σ̂iT (δ − T1 σiT Q−ij σi )
× E tr Y X̂X Q−ij Y T
j =1 i =j 1 + T1 σiT Q−ij σi

n2 1  
= E tr Y Q−− XX̂ X̂X Q−− Y T
2
T T̂ (1 + δ)2

!n !    
1 1 1
+ E tr Y Q−j σi σ̂iT δ − σiT Q−ij σi
T T̂ (1 + δ) j =1 i =j
2 2 T

× X̂X Q−j Y T
1240 C. LOUART, Z. LIAO AND R. COUILLET

!n !  
1 1
+ E tr Y Q−j σi σ̂iT
T T̂ (1 + δ) j =1 i =j
2 2

 
Q−j T1 σi σiT Q−j 1
× X̂X Y T
δ − σiT Q−ij σi
1 − T1 σiT Q−j σi T
n2 1  
= E tr Y Q−− XX̂ X̂X Q−− Y T
T 2 T̂ (1 + δ)2

1 1 !n
 
+ T
E tr Y Q−j −j ˆ −j  Q−j Y T
D
T T̂ (1 + δ) j =1
2 2 X̂X

n 1 !n
  1
+ T
E Y Q−j −j D  −j Q−j Y T + O nε− 2
T T̂ (1 + δ) j =1
2 2

n2 1    ε− 1
= E tr Y Q−−   Q−− Y T
+ O n 2
T 2 T̂ (1 + δ)2 X X̂ X̂X

with Q−− having the same law as Q−ij , D = diag({δ − T σi Q−ij σi }i=1 )
1 T n

(δ− T1 σiT Q−ij σi ) T1 tr(X̂X Q−ij XX̂ )


and D  = diag{ }ni=1 , both expected to be of order
(1− T1 σiT Q−j σi )(1+ T1 σiT Q−ij σi )
1
O(nε− 2 ). Using again the asymptotic equivalent of E[QAQ] devised in Sec-
tion 5.2.3, we then have
n2 1    1
Z31111 = E tr Y Q−− XX̂ X̂X Q−− Y T + O nε− 2
T 2 T̂ (1 + δ)2

1 
= tr Y Q̄ X X̂ X̂X Q̄Y
T

1 T
1 n tr(Y Q̄ X Q̄Y )
+ tr( X Q̄ X X̂ X̂X Q̄)
T̂ 1 − n1 tr( X
2 Q̄2 )

 1
+ O nε− 2 .
Following the same principle, we deduce for Z31112 that
  
1 1 ! n !
Q−ij σi σ̂iT Q−ij σi σiT Q−ij T
Z31112 = E tr Y  Y
T 3 T̂ 1 + δ j =1 i =j 1 + T1 σiT Q−ij σi X̂X 1 + T1 σiT Q−ij σi
!n !  
1 1  1
= E tr Y Q−ij σi σiT Q−ij Y T tr(X̂X Q−ij XX̂ )
T T̂ (1 + δ) j =1 i =j
3 3 T

1 1 !n !
   1
+ E tr Y Q−j σi Di σiT Q−j Y T + O nε− 2
T T̂ (1 + δ) j =1 i =j
3 3
A RANDOM MATRIX APPROACH TO NEURAL NETWORKS 1241
 
n2 1  1
= E tr Y Q−− X Q−− Y T tr(X̂X Q−− XX̂ )
3
T T̂ 1 + δ T
 1
+ O nε− 2
1 T
1 n tr(Y Q̄ X Q̄Y )  1
= tr( X̂X Q̄ X X̂ ) + O nε− 2
T̂ 1 − n tr( X Q̄ )
1 2 2

with Di = 1
T tr(X̂X Q−ij XX̂ )[(1 + δ)2 − (1 + T1 σiT Q−ij σi )2 ], also believed to
1 1
be of order O(nε− 2 ). Recalling the fact that Z311 = Z3111 + O(nε− 2 ), we can thus
conclude for Z311 that
1 
Z311 = tr Y Q̄ X X̂ X̂X Q̄Y
T

1 T
1 tr(Y Q̄ X Q̄Y )
+ tr( X Q̄ X X̂ X̂X Q̄)
n
T̂ 1 − n1 tr( 2 2
X Q̄ )
1 T
1 n tr(Y Q̄ X Q̄Y )  1
− tr( X̂X Q̄ X X̂ ) + O nε− 2 .
T̂ 1 − n tr( X Q̄2 )
1 2

As for Z312 , we have


n !   Q σ σ T Q σ σ̂ T 
1 ! −j j j −j i i σ̂j σjT Q−j
Z312 = E tr Y Y T
T 3 T̂ i=1 j =i 1 + T1 σjT Q−j σj 1 + T1 σjT Q−j σj
  Q σ σ TQ T  
1 ! −j j j −j −j ˆ −j σ̂j σjT Q−j
n
= E tr Y Y T
.
T 3 T̂ j =1 1 + T1 σjT Q−j σj 1 + T1 σjT Q−j σj

T ˆ
Since Q−j T1 −j −j is expected to be of bounded norm, using the concentration
T ˆ
1 T −j −j
inequality of the quadratic form σ
T j Q−j T σ̂j , we infer
  Q σ σ TQ Y T 
1 ! n
−j j j −j
Z312 = E tr Y
T T̂ j =1 (1 + T1 σjT Q−j σj )2
 
1  T ˆ  1
× 2
tr Q−j −j −j X̂X + O nε− 2
T
  Q σ σ T Q Y T  
1 ! n
−j j j −j 1  T ˆ
= E tr Y tr Q−j −j −j X̂X
T T̂ j =1 (1 + T1 σjT Q−j σj )2 T 2
 1
+ O nε− 2 .
1242 C. LOUART, Z. LIAO AND R. COUILLET

1 T
We again replace T σj Q−j σj by δ and take expectation over wj to obtain
!n  
1 1  1  T ˆ
Z312 = E tr Y Q−j σj σjT Q−j Y T 2 tr Q−j −j −j X̂X
T T̂ (1 + δ) j =1
2 T

1 1
+
T T̂ (1 + δ)2
!
n  tr(Y Q σ D σ T Q Y T ) 
−j j j j −j 1  T ˆ
× E tr Q−j −j −j X̂X
j =1 (1 + T1 σjT Q−j σj )2 T 2

 1
+ O nε− 2
 
n 1  1  T ˆ
= E tr Y Q− X Q− Y T 2 tr Q− − − X̂X
T T̂ (1 + δ)2 T
 
1 1  T 1  T ˆ  ε− 1
+ E tr Y Q T
DQY tr Q− − − X̂X + O n 2
  
T T̂ (1 + δ) 2 T 2
1
with Dj = (1 + δ)2 − (1 + T1 σjT Q−j σj )2 = O(nε− 2 ), which eventually brings the
second term to vanish, and we thus get
 
n 1  T 1  T ˆ  1
Z312 = E tr Y Q− X Q− Y tr Q− − − X̂X + O nε− 2 .
T T̂ (1 + δ) 2 T 2

1 T ˆ
For the term T 2 tr(Q− − − X̂X ) we apply again the concentration inequality
to get
1  T ˆ
tr Q− − − X̂X
T2
1 ! 
= 2 tr Q−j σi σ̂iT X̂X
T i =j
 
1 ! Q−ij σi σ̂iT
= 2 tr 
T i =j 1 + T1 σiT Q−ij σi X̂X
1 1 ! 
= tr Q−ij σi σ̂iT X̂X
T 2 1 + δ i =j
 
1 1 ! Q−ij σi σ̂iT (δ − T1 σiT Q−ij σi )
+ tr X̂X
T 2 1 + δ i =j 1 + T1 σiT Q−ij σi
n−1 1 
= tr X̂X E[Q−− ]XX̂
T 1+δ
2

1 1  
ˆ −j 
1
+ 2 T
tr Q−j −j D + O nε− 2
T 1+δ X̂X
A RANDOM MATRIX APPROACH TO NEURAL NETWORKS 1243

with high probability, where D = diag({δ − T1 σiT Q−ij σi }ni=1 ), the norm of which
1
is of order O(nε− 2 ). This entails
1  T ˆ n 1   1
tr Q− − − X̂X = 2 tr X̂X E[Q−− ]XX̂ + O nε− 2
T 2 T 1+δ
with high probability. Once more plugging the asymptotic equivalent of E[QAQ]
deduced in Section 5.2.3, we conclude for Z312 that
1 T
1 n tr(Y Q̄ X Q̄Y )  1
Z312 = tr( X̂X Q̄ X X̂ ) + O nε− 2
T̂ 1 − n tr( X Q̄ )
1 2 2

and eventually for Z31


1 T
1  1 n tr(Y Q̄ X Q̄Y )
Z31 = tr Y Q̄ X X̂ X̂X Q̄Y
T
+ tr( X Q̄ X X̂ X̂X Q̄)
T̂ T̂ 1 − n1 tr( X
2 Q̄2 )

1 T
2 n tr(Y Q̄ X Q̄Y )  1
− tr( X̂X Q̄ X X̂ ) + O nε− 2 .
T̂ 1 − n tr( X Q̄ )
1 2 2

Combining the estimates of E[Z2 ] as well as Z31 and Z32 , we finally have the
estimates for the test error defined in (12) as
1  T 
Etest = Ŷ − T
X X̂
Q̄Y T 2F

1 T 
tr(Y Q̄ X Q̄Y ) 1
+ n
tr X̂ X̂
1 − tr( 1
n
2 2
X Q̄ ) T̂

1 2
+ tr( X Q̄ X X̂ X̂X Q̄) − tr( X̂X̂ Q̄ X X̂ )
T̂ T̂
 1
+ O nε− 2 .

Since by definition, Q̄ = ( X + γ IT )−1 , we may use

X Q̄ = ( X + γ IT − γ IT )( X + γ IT )−1 = IT − γ Q̄
in the second term in brackets to finally retrieve the form of Conjecture 1.

6. Concluding remarks. This article provides a possible direction of explo-


ration of random matrices involving entry-wise nonlinear transformations [here
through the function σ (·)], as typically found in modelling neural networks, by
means of a concentration of measure approach. The main advantage of the method
is that it leverages the concentration of an initial random vector w (here a Lips-
chitz function of a Gaussian vector) to transfer concentration to all vector σ (or
matrix ) being Lipschitz functions of w. This induces that Lipschitz functionals
1244 C. LOUART, Z. LIAO AND R. COUILLET

of σ (or ) further satisfy concentration inequalities, and thus, if the Lipschitz pa-
rameter scales with n, convergence results as n → ∞. With this in mind, note that
we could have generalized our input–output model z = β T σ (W x) of Section 2 to
z = β T σ (x; W )
for σ : Rp × P → Rn with P some probability space and W ∈ P a random variable
such that σ (x; W ) and σ (X; W ) [where σ (·) is here applied column-wise] satisfy
a concentration of measure phenomenon; it is not even necessary that σ (X; W )
has a normal concentration so long that the corresponding concentration function
allows for appropriate convergence results. This generalized setting however has
the drawback of being less explicit and less practical (as most neural networks
involve linear maps W x rather than nonlinear maps of W and x).
A much less demanding generalization though would consist in changing the
vector w ∼ Nϕ (0, Ip ) for a vector w still satisfying an exponential (not necessarily
normal) concentration. This is the case notably if w = ϕ(w̃) with ϕ(·) a Lipschitz
map with Lipschitz parameter bounded by, say, log(n) or any small enough power
of n. This would then allow for w with heavier than Gaussian tails.
Despite its simplicity, the concentration method also has some strong limitations
that presently do not allow for a sufficiently profound analysis of the testing mean
square error. We believe that Conjecture 1 can be proved by means of more elab-
orate methods. Notably, we believe that the powerful Gaussian method advertised
in Pastur and Ŝerbina (2011) which relies on Stein’s lemma and the Poincaré–
Nash inequality could provide a refined control of the residual terms involved in
the derivation of Conjecture 1. However, since Stein’s lemma (which states that
E[xφ(x)] = E[φ  (x)] for x ∼ N (0, 1) and differentiable polynomially bounded φ)
can only be used on products xφ(x) involving the linear component x, the latter is
not directly accessible; we nonetheless believe that appropriate ansatzs of Stein’s
lemma, adapted to the nonlinear setting and currently under investigation, could
be exploited.
As a striking example, one key advantage of such a tool would be the possibility
to evaluate expectations of the type Z = E[σ σ T ( T1 σ T Q− σ − α)] which, in our
present analysis, was shown to be bounded in the order of symmetric matrices by
1
Cnε− 2 with high probability. Thus, if no matrix (such as Q̄) pre-multiplies Z,
since  can grow as large as O(n), Z cannot be shown to vanish. But such a
bound does not account for the fact that  would in general be unbounded because
of the term σ̄ σ̄ T in the display  = σ̄ σ̄ T + E[(σ − σ̄ )(σ − σ̄ )T ], where σ̄ = E[σ ].
Intuitively, the “mean” contribution σ̄ σ̄ T of σ σ T , being post-multiplied in Z by
T σ Q− σ − α (which averages to zero) disappears; and thus only smaller order
1 T

terms remain. We believe that the aforementioned ansatzs for the Gaussian tools
would be capable of subtly handling this self-averaging effect on Z to prove that
Z vanishes [for σ (t) = t, it is simple to show that Z ≤ Cn−1 ]. In addition,
Stein’s lemma-based methods only require the differentiability of σ (·), which need
not be Lipschitz, thereby allowing for a larger class of activation functions.
A RANDOM MATRIX APPROACH TO NEURAL NETWORKS 1245

As suggested in the simulations of Figure 2, our results also seem to extend to


noncontinuous functions σ (·). To date, we cannot envision a method allowing to
tackle this setting.
In terms of neural network applications, the present article is merely a first step
towards a better understanding of the “hardening” effect occurring in large dimen-
sional networks with numerous samples and large data points (i.e., simultaneously
large n, p, T ), which we exemplified here through the convergence of mean-square
errors. The mere fact that some standard performance measure of these random
networks would “freeze” as n, p, T grow at the predicted regime and that the per-
formance would heavily depend on the distribution of the random entries is already
in itself an interesting result to neural network understanding and dimensioning.
However, more interesting questions remain open. Since neural networks are to-
day dedicated to classification rather than regression, a first question is the study
of the asymptotic statistics of the output z = β T σ (W x) itself; we believe that z
satisfies a central limit theorem with mean and covariance allowing for assessing
the asymptotic misclassification rate.
A further extension of the present work would be to go beyond the single-layer
network and include multiple layers (finitely many or possibly a number scaling
with n) in the network design. The interest here would be on the key question of
the best distribution of the number of neurons across the successive layers.
It is also classical in neural networks to introduce different (possibly random)
biases at the neuron level, thereby turning σ (t) into σ (t + b) for a random variable
b different for each neuron. This has the effect of mitigating the negative impact
of the mean E[σ (wiT xj )], which is independent of the neuron index i.
Finally, neural networks, despite their having been recently shown to operate al-
most equally well when taken random in some very specific scenarios, are usually
only initiated as random networks before being subsequently trained through back-
propagation of the error on the training dataset (i.e., essentially through convex
gradient descent). We believe that our framework can allow for the understanding
of at least finitely many steps of gradient descent, which may then provide further
insights into the overall performance of deep learning networks.

APPENDIX: INTERMEDIARY LEMMAS


This section recalls some elementary algebraic relations and identities used
throughout the proof section.

L EMMA 5 (Resolvent identity). For invertible matrices A, B, A−1 − B −1 =


A−1 (B− A)B −1 .

L EMMA 6 (A rank-1 perturbation identity). For A Hermitian, v a vector and


t ∈ R, if A and A + tvv T are invertible, then
 −1 A−1 v
A + tvv T v= .
1 + tv T A−1 v
1246 C. LOUART, Z. LIAO AND R. COUILLET

L EMMA 7 (Operator norm control). For nonnegative definite A and z ∈ C \


R+ ,
  
(A − zIT )−1  ≤ dist z, R+ −1 ,
 
A(A − zIT )−1  ≤ 1,

where dist(x, A) is the Hausdorff distance of a point to a set. In particular, for


γ > 0, (A + γ IT )−1  ≤ γ −1 and A(A + γ IT )−1  ≤ 1.

REFERENCES
A KHIEZER , N. I. and G LAZMAN , I. M. (1993). Theory of Linear Operators in Hilbert Space. Dover,
New York. MR1255973
BAI , Z. D. and S ILVERSTEIN , J. W. (1998). No eigenvalues outside the support of the limiting
spectral distribution of large-dimensional sample covariance matrices. Ann. Probab. 26 316–345.
MR1617051
BAI , Z. D. and S ILVERSTEIN , J. W. (2007). On the signal-to-interference-ratio of CDMA systems
in wireless communications. Ann. Appl. Probab. 17 81–101. MR2292581
BAI , Z. and S ILVERSTEIN , J. W. (2010). Spectral Analysis of Large Dimensional Random Matrices,
2nd ed. Springer, New York. MR2567175
B ENAYCH -G EORGES , F. and NADAKUDITI , R. R. (2012). The singular values and vectors of low
rank perturbations of large rectangular random matrices. J. Multivariate Anal. 111 120–135.
C AMBRIA , E., G ASTALDO , P., B ISIO , F. and Z UNINO , R. (2015). An ELM-based model for affec-
tive analogical reasoning. Neurocomputing 149 443–455.
C HOROMANSKA , A., H ENAFF , M., M ATHIEU , M., A ROUS , G. B. and L E C UN , Y. (2015). The
loss surfaces of multilayer networks. In AISTATS.
C OUILLET, R. and B ENAYCH -G EORGES , F. (2016). Kernel spectral clustering of large dimensional
data. Electron. J. Stat. 10 1393–1454.
C OUILLET, R., H OYDIS , J. and D EBBAH , M. (2012). Random beamforming over quasi-static and
fading channels: A deterministic equivalent approach. IEEE Trans. Inform. Theory 58 6392–6425.
MR2982669
C OUILLET, R. and K AMMOUN , A. (2016). Random matrix improved subspace clustering. In 2016
Asilomar Conference on Signals, Systems, and Computers.
C OUILLET, R., PASCAL , F. and S ILVERSTEIN , J. W. (2015). The random matrix regime of
Maronna’s M-estimator with elliptically distributed samples. J. Multivariate Anal. 139 56–78.
E L K AROUI , N. (2009). Concentration of measure and spectra of random matrices: Applications
to correlation matrices, elliptical distributions and beyond. Ann. Appl. Probab. 19 2362–2405.
MR2588248
E L K AROUI , N. (2010). The spectrum of kernel random matrices. Ann. Statist. 38 1–50. MR2589315
E L K AROUI , N. (2013). Asymptotic behavior of unregularized and ridge-regularized
high-dimensional robust regression estimators: Rigorous results. Preprint. Available at
arXiv:1311.2445.
G IRYES , R., S APIRO , G. and B RONSTEIN , A. M. (2016). Deep neural networks with random Gaus-
sian weights: A universal classification strategy? IEEE Trans. Signal Process. 64 3444–3457.
MR3515693
H ORNIK , K., S TINCHCOMBE , M. and W HITE , H. (1989). Multilayer feedforward networks are
universal approximators. Neural Networks 2 359–366.
H UANG , G.-B., Z HU , Q.-Y. and S IEW, C.-K. (2006). Extreme learning machine: Theory and ap-
plications. Neurocomputing 70 489–501.
A RANDOM MATRIX APPROACH TO NEURAL NETWORKS 1247

H UANG , G.-B., Z HOU , H., D ING , X. and Z HANG , R. (2012). Extreme learning machine for re-
gression and multiclass classification. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE
Transactions on 42 513–529.
JAEGER , H. and H AAS , H. (2004). Harnessing nonlinearity: Predicting chaotic systems and saving
energy in wireless communication. Science 304 78–80.
K AMMOUN , A., K HAROUF, M., H ACHEM , W. and NAJIM , J. (2009). A central limit theorem for
the sinr at the lmmse estimator output for large-dimensional signals. IEEE Transactions on Infor-
mation Theory 55 5048–5063.
K RIZHEVSKY, A., S UTSKEVER , I. and H INTON , G. E. (2012). Imagenet classification with deep
convolutional neural networks. In Advances in Neural Information Processing Systems 1097–
1105.
L E C UN , Y., C ORTES , C. and B URGES , C. (1998). The MNIST database of handwritten digits.
L EDOUX , M. (2005). The Concentration of Measure Phenomenon 89. Amer. Math. Soc., Providence,
RI. MR1849347
L IAO , Z. and C OUILLET, R. (2017). A large dimensional analysis of least squares support vector
machines. J. Mach. Learn. Res. To appear. Available at arXiv:1701.02967.
L OUBATON , P. and VALLET, P. (2010). Almost sure localization of the eigenvalues in a Gaussian
information plus noise model. Application to the spiked models. Electron. J. Probab. 16 1934–
1959.
M AI , X. and C OUILLET, R. (2017). The counterintuitive mechanism of graph-based semi-
supervised learning in the big data regime. In IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP’17).
M AR C̆ENKO , V. A. and PASTUR , L. A. (1967). Distribution of eigenvalues for some sets of random
matrices. Math. USSR, Sb. 1 457–483.
PASTUR , L. and Ŝ ERBINA , M. (2011). Eigenvalue Distribution of Large Random Matrices. Amer.
Math. Soc., Providence, RI. MR2808038
R AHIMI , A. and R ECHT, B. (2007). Random features for large-scale kernel machines. In Advances
in Neural Information Processing Systems 1177–1184.
ROSENBLATT, F. (1958). The perceptron: A probabilistic model for information storage and organi-
zation in the brain. Psychol. Rev. 65 386–408.
RUDELSON , M., V ERSHYNIN , R. et al. (2013). Hanson–Wright inequality and sub-Gaussian con-
centration. Electron. Commun. Probab. 18 1–9.
S AXE , A., KOH , P. W., C HEN , Z., B HAND , M., S URESH , B. and N G , A. Y. (2011). On random
weights and unsupervised feature learning. In Proceedings of the 28th International Conference
on Machine Learning (ICML-11) 1089–1096.
S CHMIDHUBER , J. (2015). Deep learning in neural networks: An overview. Neural Netw. 61 85–117.
S ILVERSTEIN , J. W. and BAI , Z. D. (1995). On the empirical distribution of eigenvalues of a class
of large dimensional random matrices. J. Multivariate Anal. 54 175–192.
S ILVERSTEIN , J. W. and C HOI , S. (1995). Analysis of the limiting spectral distribution of large
dimensional random matrices. J. Multivariate Anal. 54 295–309.
TAO , T. (2012). Topics in Random Matrix Theory 132. Amer. Math. Soc., Providence, RI.
T ITCHMARSH , E. C. (1939). The Theory of Functions. Oxford Univ. Press, New York.
V ERSHYNIN , R. (2012). Introduction to the non-asymptotic analysis of random matrices. In Com-
pressed Sensing, 210–268, Cambridge Univ. Press, Cambridge.
W ILLIAMS , C. K. I. (1998). Computation with infinite neural networks. Neural Comput. 10 1203–
1216.
YATES , R. D. (1995). A framework for uplink power control in cellular radio systems. IEEE Journal
on Selected Areas in Communications 13 1341–1347.
Z HANG , T., C HENG , X. and S INGER , A. (2014). Marchenko–Pastur Law for Tyler’s and Maronna’s
M-estimators. Available at http://arxiv.org/abs/1401.3424.
1248 C. LOUART, Z. LIAO AND R. COUILLET

L ABORATOIRE DES S IGNAUX ET S YSTÈMES


C ENTRALE S UPÉLEC
U NIVERSITY OF PARIS -S ACLAY
3, RUE J OLIOT-C URIE
91192 G IF SUR Y VETTE
F RANCE
E- MAIL : [email protected]
[email protected]
[email protected]

You might also like