RMT Aproach To NN
RMT Aproach To NN
RMT Aproach To NN
These findings provide new insights into the roles played by the acti-
vation function σ(·) and the random distribution of the entries of W in
random feature maps as well as by the ridge-regression parameter γ in the
neural network performance. We notably exhibit and prove some peculiar
behaviors, such as the impossibility for the network to carry out elementary
Gaussian mixture classification tasks, when either the activation function or
the random weights distribution are ill chosen.
Besides, for the practitioner, the theoretical formulas retrieved in this
work allow for a fast offline tuning of the aforementioned hyperparameters
of the neural network, notably when T is not too large compared to p.
The graphical results provided in the course of the article were particularly
obtained within a 100- to 500-fold gain in computation time between theory
and simulations.
where we defined Σ ≡ σ(W X). This follows Pfrom differentiating the mean
square error along β to obtain 0 = γβ + T1 Ti=1 σ(W xi )(β T σ(W xi ) − yi )T ,
so that ( T1 ΣΣT + γIn )β = T1 ΣY T which, along with ( T1 ΣΣT + γIn )−1 Σ =
Σ( T1 ΣT Σ + γIT )−1 , gives the result.
In the remainder, we will also denote
−1
1 T
Q≡ Σ Σ + γIT
T
1
2 γ2
tr Y T Y Q2 .
T T
(1) Etrain = − Σ β
=
T T
Y
F
where Σ̂ = σ(W X̂) and β is the same as used in (1) (and thus only depends
on (X, Y ) and γ). One of the key questions in the analysis of such an ele-
mentary neural network lies in the determination of γ which minimizes Etest
(and is thus said to have good generalization performance). Notably, small
γ values are known to reduce Etrain but to induce the popular overfitting is-
sue which generally increases Etest , while large γ values engender both large
values for Etrain and Etest .
From a mathematical standpoint though, the study of Etest brings for-
ward some technical difficulties that do not allow for as a simple treatment
through the present concentration of measure methodology as the study of
Etrain . Nonetheless, the analysis of Etrain allows at least for heuristic ap-
proaches to become available, which we shall exploit to propose an asymp-
totic deterministic approximation for Etest .
W = ϕ(W̃ )
3. Main Results.
E[Q] − Q̄
≤ cn− 12 +ε .
Rwhere µ̄n−1is the measure defined through its Stieltjes transform mµ̄n (z) ≡
(t − z) dµ̄n (t) given, for z ∈ {w ∈ C, ℑ[w] > 0}, by
−1
1 n Φ
mµ̄n (z) = tr − zIT
T T 1 + δz
Note that µ̄n has a well-known form, already met in early random ma-
trix works (e.g., (Silverstein and Bai, 1995)) on sample covariance matrix
models. Notably, µ̄n is also the deterministic equivalent of the empirical
Theorem 1 provides the central step in the evaluation of Etrain , for which
not only E[Q] but also E[Q2 ] needs be estimated. This last ingredient is
provided in the following proposition.
1
2 γ2
tr Y T Y Q2
T T
Etrain = − Σ β
=
T T
Y
F
" #
1 2
γ2 n tr Ψ Q̄
Ētrain = tr Y T Y Q̄ Ψ + IT Q̄.
T 1 − n1 tr(ΨQ̄)2
While not immediate at first sight, one can confirm (using notably the
relation ΨQ̄ + γ Q̄ = IT ) that, for (X̂, Ŷ ) = (X, Y ), Ētrain = Ētest , as ex-
pected.
3.3. Evaluation of ΦAB . The evaluation of ΦAB = E[σ(wT A)T σ(wT B)]
for arbitrary matrices A, B naturally boils down to the evaluation of its
individual entries and thus to the calculus, for arbitrary vectors a, b ∈ Rp ,
of
Z
p 1 2
(4) Φab ≡E[σ(wT a)σ(wT b)] = (2π)− 2 σ(ϕ(w̃)T a)σ(ϕ(w̃)T b)e− 2 kw̃k dw̃.
σ(t) Φab
t aT b
1
p
max(t, 0) 2π
kakkbk ∠(a, b) acos(−∠(a, b)) + 1 − ∠(a, b)2
2
p
|t| π
kakkbk ∠(a, b) asin(∠(a, b)) + 1 − ∠(a, b)2
2 √ 2aT b
erf(t) π
asin
(1+2kak2 )(1+2kbk2 )
1 1
1{t>0} 2
− 2π
acos(∠(a, b))
2
sign(t) π
asin(∠(a, b))
cos(t) exp(− 21 (kak2 + kbk2 )) cosh(aT b)
sin(t) exp(− 21 (kak2 + kbk2 )) sinh(aT b).
Table 1
aT b
Values of Φab for w ∼ N (0, Ip ), ∠(a, b) ≡ kakkbk
.
100
Ētrain
Ētest
Etrain
MSE Etest
σ(t) = erf(t)
10−1
σ(t) = max(t, 0)
Fig 1. Neural network performance for Lipschitz continuous σ(·), Wij ∼ N (0, 1), as a
function of γ, for 2-class MNIST data (sevens, nines), n = 512, T = T̂ = 1024, p = 784.
found to be 100 to 500 times faster than to generate the simulated net-
work performances. Beyond their theoretical interest, the provided formulas
therefore allow for an efficient offline tuning of the network hyperparameters,
notably the choice of an appropriate value for the ridge-regression parameter
γ.
100
Ētrain
Ētest
Etrain
MSE Etest
σ(t) = 1 − 12 t2
σ(t) = 1{t>0}
σ(t) = sign(t)
−1
10
10−4 10−3 10−2 10−1 100 101 102
γ
Fig 2. Neural network performance for σ(·) either discontinuous or non Lipschitz, Wij ∼
N (0, 1), as a function of γ, for 2-class MNIST data (sevens, nines), n = 512, T = T̂ =
1024, p = 784.
i.e., m4 6= m22 (see (Couillet and Benaych-Georges, 2016) for details). This
is visually seen in the bottom part of Figure 3 where the Gaussian scenario
presents an isolated eigenvalue for Φ with corresponding structured eigen-
vector, which is not the case of the Bernoulli scenario. To complete this
discussion, it appears relevant in the present setting to choose Wij in such a
way that m4 −m22 is far from zero, thus suggesting the interest of heavy-tailed
distributions. To confirm this prediction, Figure 3 additionally displays the
performance achieved and the spectrum of Φ observed for Wij ∼ Stud, that
is, following a Student-t distribution with degree of freedom ν = 7 normal-
ized to unit variance (in this case m2 = 1 and m4 = 5). Figure 3 confirms
the large superiority of this choice over the Gaussian case (note nonetheless
the slight inaccuracy of our theoretical formulas in this case, which is likely
due to too small values of p, n, T to accommodate Wij with higher order
moments, an observation which is confirmed in simulations when letting ν
be even smaller).
lim Ētrain = 0
n→∞
1
2
lim Ētest −
Ŷ T − ΦX̂X Φ−1 Y T
= 0
n→∞ T̂ F
100.4 Ētrain
Ētest
Etrain
100.2 Etest
100
MSE
10−0.2
10−0.4
Wij ∼ Bern
Wij ∼ N (0, 1)
−0.6
10
Wij ∼ Stud
close far
no spike
spike spike
Fig 3. (Top) Neural network performance for σ(t) = − 12 t2 + 1, with different Wij , for a
2-class Gaussian mixture model (see details in text), n = 512, T = T̂ = 1024, p = 256.
(Bottom) Spectra and second eigenvector of Φ for different Wij (first eigenvalues are of
order n and not shown; associated eigenvectors are provably non informative).
100
Ētrain
Ētest
Etrain
MSE Etest
10−1
n = 256 → 4 096
Fig 4. Neural network performance for growing n (256, 512, 1 024, 2 048, 4 096) as a
function of γ, σ(t) = max(t, 0); 2-class MNIST data (sevens, nines), T = T̂ = 1024,
p = 784. Limiting (n = ∞) Ētest shown in thick black line.
which satisfies, as γ → 0,
(
r
δ → n−r , r<n
1 n Φ
−1
γδ → ∆ = T tr Φ T ∆ + IT , r ≥ n.
where Ψ∆ = Tn ∆ Φ
and Q∆ = ( Tn ∆
Φ
+ IT )−1 .
These results suggest that neural networks should be designed both in
a way that reduces the rank of Φ while maintaining a strong alignment
between the dominant eigenvectors of Φ and the output matrix Y .
Interestingly, if X is assumed as above to be extracted from a Gaussian
mixture and that Y ∈ R1×T is a classification vector with Y1j ∈ {−1, 1},
then the tools proposed in (Couillet and Benaych-Georges, 2016) (related to
spike random matrix analysis) allow for an explicit evaluation of the afore-
mentioned limits as n, p, T grow large. This analysis is however cumbersome
and outside the scope of the present work.
i.e., σi = σ(wiT X)T . Also, we shall define Σ−i ∈ R(n−1)×T the matrix Σ with
i-th row removed, and correspondingly
−1
1 T 1
Q−i = Σ Σ − σi σiT + γIT .
T T
The main approach to the proof of our results, starting with that of the
key Lemma 1, is as follows: since Wij = ϕ(W̃ij ) with W̃ij ∼ N (0, 1) and
ϕ Lipschitz, the normal concentration of W̃ transfers to W which further
induces a normal concentration of the random vector σ and the matrix Σ,
thereby implying that Lipschitz functionals of σ or Σ also concentrate. As
pointed out earlier, these concentration results are used in place for the
independence assumptions (and their multiple consequences on convergence
of random variables) classically exploited in random matrix theory.
cT t2
1 −
T
≥ t + t0 ≤ Ce λ2ϕ λ2σ kXk2
P
√
T σ(w X)
√ cT 2 t2
Σ
−
2nλ2 λ2 2
∀t ≥ 4t0 , P
√
≥ t T ≤ Cne
ϕ σ kXk
T
1998). One is tempted to believe that, more generally, if E[σ] = 0, then k √ΣT k
should remain of this order. And, if instead E[σ] 6= 0, the contribution of
√1 E[σ]1T should merely engender a single large amplitude isolate singular
T T
value in the spectrum of √ΣT and the other singular values remain of order
O(1). These intuitions are not captured by our concentration of measure
approach.
Since Σ = σ(W X) is an entry-wise operation, concentration results with
respect to the Frobenius norm are natural, where with respect to the operator
norm are hardly accessible.
We can already bound P (AcK ) thanks to (7). As for the first right-hand
side term, note that on √the set {σ(wT X), w ∈ AK }, the function f : RT →
T
R : σ 7→ σ Aσ is K T -Lipschitz. This is because, for all σ, σ + h ∈
{σ(wT X), w ∈ AK },
T T
√
kf (σ + h) − f (σ)k =
h Aσ + (σ + h) Ah
≤ K T khk .
Therefore,
n o
P f (σ(wT X)) − E[f˜(σ(wT X))] ≥ KT t , AK
n o
˜
=P f (σ(wT X)) − E[f˜(σ(wT X))] ≥ KT t , AK
cT t2
−
≤ P f˜(σ(wT X)) − E[f˜(σ(wT X))] ≥ KT t ≤ e kXk λσ λϕ .
2 2 2
˜ T T
√ ∆ = |E[f (σ(w X))]−E[f (σ(w X))]|.
Our next step is then to bound the difference
˜
Since f and f are equal on {σ, kσk ≤ K T },
Z
∆≤ |f (σ)| + |f˜(σ)| dµσ (σ)
√
kσk≥K T
√
where µσ is the law of σ(wT X). Since kAk ≤ 1, for kσk ≥ K T , max(|f (σ)|, |f˜(σ)|) ≤
kσk2 and thus
Z Z Z ∞
2
∆≤2 √ kσk dµ σ = 2 √ 1kσk2 ≥t dtdµσ
kσk≥K T kσk≥K T t=0
Z ∞
P kσk2 ≥ t , AcK dt
=2
t=0
Z K 2T Z ∞
≤2 P (AcK )dt + 2 P (kσ(wT X)k2 ≥ t)dt
t=0 t=K 2 T
Z ∞ ct
−
c 2 2λ2 2
ϕ λσ kXk
2
≤ 2P (AK )K T + 2 Ce dt
t=K 2 T
cT K 2 2
2
−
2λ2 2 2 2Cλ2ϕ λ2σ kXk2 − 2λ2cTλ2KkXk2
≤ 2CT K e ϕ λσ kXk + e ϕ σ
c
6C 2 2
≤ λ λ kXk2
c ϕ σ
where in last inequality weqused the fact that for x ∈ R, xe−x ≤ e−1 ≤ 1,
and K ≥ 4t0 ≥ 4λσ λϕ kXk Tp . As a consequence,
cT t2
−
n o
2 2 2
P f (σ(wT X)) − E[f (σ(wT X))] ≥ KT t + ∆ , AK ≤ Ce kXk λϕ λσ
4∆
so that, with the same remark as before, for t ≥ KT ,
cT t2
−
n o
T T 2kXk2 λ2 2
P f (σ(w X)) − E[f (σ(w X))] ≥ KT t , AK ≤ Ce ϕ λσ .
4∆
To avoid the condition t ≥ KT , we use the fact that, probabilities being
lower than one, it suffices to replace C by λC with λ ≥ 1 such that
T t2
−c
2kXk2 λ2 2 4∆
λCe ϕ λσ ≥ 1 for t ≤ .
KT
2
1 18C
The above inequality holds if we take for instance λ = Ce
c since then
4∆ 24Cλ2ϕ λ2σ kXk2 6Cλϕ λσ kXk
t≤ KT ≤ cKT ≤ √
c pT
(using successively ∆ ≥ 6C 2 2
c λϕ λσ kXk
2
cT t2
−
n o
T T 2kXk2 λ2 2
P f (σ(w X) − E[f (σ(w X)] ≥ KT t , AK ≤ λCe ϕ λσ
cT K 2
−
2λ2 2 2
which, together with the inequality ≤ Ce P (AcK ) , gives ϕ λσ kXk
T ct 2 2
− − 2cT2K 2
T T 2kXk2 λ2 λ2
P f (σ(w X) − E[f (σ(w X)] ≥ KT t ≤ λCe + Ce 2λϕ λσ kXk
.
ϕ σ
We then conclude
1 T T T 1
P σ(w X)Aσ(w X) − tr(ΦA) ≥ t
T T
cT
− min(t2 /K 2 ,K 2 )
2kXk2 λ2 2
≤ (λ + 1)Ce ϕ λσ
√
and, with K = max(4t0 , t),
!
cT min t2 ,t
16t2
1 T T T 1 −
2kXk2 λ2
0
2
P σ(w X)Aσ(w X) − tr(ΦA) ≥ t ≤ (λ + 1)Ce
. ϕ λσ
T T
√ √
Indeed, if 4t0 ≤ t then min(t2 /K 2 , K 2 ) = t, while if 4t0 ≥ t then
min(t2 /K 2 , K 2 ) = min(t2 /16t20 , 16t20 ) = t2 /16t20 .
which, along with the boundedness of the integrals, concludes the proof.
Lipschitz function of W with respect to the Frobenius norm (i.e., the Eu-
clidean norm of vec(W )), by (6),
2
− ct
2 2
P (|g(W ) − E[g(W )]| > t) = P g(ϕ(W̃ )) − E[g(ϕ(W̃ ))] > t ≤ Ce λg λϕ
√
for some C, c > 0. Let’s consider in particular g : W 7→ f (Σ/ T ) and
remark that
σ((W + H)X) σ(W X)
|g(W + H) − g(W )| = f
√ −f √
T T
λf
≤ √ kσ((W + H)X) − σ(W X)kF
T
λf λσ
≤ √ kHXkF
T
λf λσ √
= √ tr HXX T H T
T
λf λσ
q
≤ √ kXX T kkHkF
T
concluding the proof.
for some C, c > 0, where dist(z, R+ ) is the Hausdorff set distance. In par-
ticular, for z = −γ, γ > 0, and under the additional Assumption 3
1 1 2
P tr Q − tr E[Q] > t ≤ Ce−cnt .
T T
1
Proof. We can apply Lemma 3 for f : R 7→ T tr(RT R − zIT )−1 , since
we have
|f (R + H) − f (R)|
1 T −1 T T T −1
= tr((R + H) (R + H) − zIT ) ((R + H) H + H R)(R R − zIT )
T
1 T −1 T T −1
≤ tr((R + H) (R + H) − zIT ) (R + H) H(R R − zIT )
T
1 T −1 T T −1
+ tr((R + H) (R + H) − zIT ) H R(R R − zIT )
T
2kHk 2kHkF
≤ 3 ≤ 3
dist(z, R+ ) 2 dist(z, R+ ) 2
where, for the
√ second√ to last inequality, we successively used the relations
| tr AB| ≤ tr AAT tr BB T , | tr CD| ≤ kDk tr C for nonnegative definite
C, and k(RT R − zIT )−1 k ≤ dist(z, R+ )−1 , k(RT R − zIT )−1 RT Rk ≤ 1,
1
k(RT R − zIT )−1 RT k = k(RT R − zIT )−1 RT R(RT R − zIT )−1 k 2 ≤ k(RT R −
1 1 1
zIT )−1 RT Rk 2 k(RT R−zIT )−1 k 2 ≤ dist(z, R+ )− 2 , for z ∈ C\R+, and finally
k · k ≤ k · kF .
for some C, c > 0. We may then apply Lemma 1 on the bounded norm
matrix AE[Q− ]B to further find that
1 T 1
P σ AQ− Bσ − tr ΦAE[Q− ]B > t
T T
1 T 1 T t
≤ P σ AQ− Bσ − σ AE[Q− ]Bσ >
T T 2
1 T 1 t
+ P σ AE[Q− ]Bσ − tr ΦAE[Q− ]B >
T T 2
′ 2 ,t)
≤ C ′ e−c n min(t
|f (R + H) − f (R)|
1 T H 2 2
= tr Y Y ((Q ) − Q )
T
1 T H H
1 T H
≤ tr Y Y (Q − Q)Q + tr Y Y Q(Q − Q)
T T
1 T H T T H
= tr Y Y Q ((R + H) (R + H) − R R)QQ
T
1 T H T T
+ tr Y Y QQ ((R + H) (R + H) − R R)Q
T
1 T H T
1
H T H T H
≤ tr Y Y Q (R + H) HQQ + tr Y Y Q H RQQ
T T
1 T H T
1 T H T
+ tr Y Y QQ (R + H) RQ + tr Y Y QQ H RQ .
T T
5.2.1. First Equivalent for E[Q]. This section is dedicated to a first char-
acterization of E[Q], in the “simultaneously large” n, p, T regime. This pre-
liminary step is classical in studying resolvents in random matrix theory as
the direct comparison of E[Q] to Q̄ with the implicit δ may be cumbersome.
To this end, let us thus define the intermediary deterministic matrix
−1
n Φ
Q̃ = + γIT
T 1+α
with α ≡ T1 tr ΦE[Q− ], where we recall that Q− is a random matrix dis-
tributed as, say, ( T1 ΣT Σ − T1 σ1 σ1T + γIT )−1 .
First note that, since T1 tr Φ = E[ T1 kσk2 ] and, from (7) and Assump-
2
Rtion 3, P ( T1 kσk2 > t) ≤ Ce−cnt for all large t, we find that T1 tr Φ =
∞ 2 1 2 ′ ′ 1
0 t P ( T kσk > t)dt ≤ C for some constant C . Thus, α ≤ kE[Q− ]k T tr Φ ≤
C ′
γ is uniformly bounded.
Note now, from the independence of Q−i and σi σiT , that the second right-
hand side expectation is simply E[Q−i ]Φ. Also, exploiting Lemma 6 in re-
verse on the rightmost term, this gives
n
1 X E[Q − Q−i ]Φ
E[Q] − Q̃ = Q̃
T 1+α
i=1
n
1 1X 1 T
(8) + E Qσi σiT Q̃ σi Q−i σi − α .
1+αT T
i=1
where we used again Lemma 6 in reverse. Denoting D = diag({1+ T1 σiT Q−i σi }ni=1 ),
this can be compactly written
n
1 X E[Q − Q−i ]Φ 1 1 1
Q̃ = E Q ΣT DΣQ ΦQ̃.
T 1+α 1+αT T
i=1
Let us now consider the second right-hand side term of (9). Using the
relation abT + baT aaT + bbT in the order of Hermitian matrices (which
1
unfolds from (a − b)(a − b)T 0), we have, with a = T 4 Qσi ( T1 σiT Q−i σi − α)
1
and b = T − 4 Q̃σi ,
n 1
1X T T T
E Qσi σi Q̃ + Q̃σi σ Q σ Q−i σi − α
T T i
i=1
n n
" 2 #
1 X T 1 T 1 X h i
√ E Qσi σi Q σi Q−i σi − α + √ E Q̃σi σiT Q̃
T i=1 T T T i=1
√
1 n
= T E Q ΣT D22 ΣQ + √ Q̃ΦQ̃
T T T
where D2 = diag({ T1 σiT Q−i σi −α}ni=1 ). Of course, since we also have −aaT −
bbT abT + baT (from (a + b)(a + b)T 0), we have symmetrically
n 1
1X T T T
E Qσi σi Q̃ + Q̃σi σ Q σ Q−i σi − α
T T i
i=1
√
1 T 2 n
− T E Q Σ D2 ΣQ − √ Q̃ΦQ̃.
T T T
But from Lemma 4,
ε− 21
1 T ε− 21
P kD2 k > tn = P max σi Q−i σi − α > tn
1≤i≤n T
1
2ε t2 ,n 2 +ε t)
≤ Cne−cmin(n
n
1 X 1
1
E Qσi σiT Q̃ + Q̃σi σiT Q σiT Q−i σi − α
≤ Cnε− 2 .
T T
i=1
5.2.2. Second Equivalent for E[Q]. In this section, we show that E[Q]
can be approximated by the matrix Q̄, which we recall is defined as
−1
n Φ
Q̄ = + γIT
T 1+δ
where δ > 0 is the unique positive solution to δ = T1 tr ΦQ̄. The fact that δ >
0 is well defined is quite standard and has already been proved several times
for more elaborate models. Following the ideas of (Hoydis, Couillet and Debbah,
2013), we may for instance use the framework of so-called standard interfer-
ence functions (Yates, 1995) which claims that, if a map f : [0, ∞) → (0, ∞),
x 7→ f (x), satisfies x ≥ x′ ⇒ f (x) ≥ f (x′ ), ∀a > 1, af (x) > f (ax) and there
exists x0 such that x0 ≥ f (x0 ), then f has a unique fixed point (Yates, 1995,
Th 2). It is easily shown that δ 7→ T1 tr ΦQ̄ is such a map, so that δ exists
and is unique.
from which
1
|α − δ| = tr Φ E[Q− ] − Q̄
T
1 1
≤ tr Φ Q̃ − Q̄ + cn− 2 +ε
T
1 ΦQ̃ Tn ΦQ̄ 1
= |α − δ| tr + cn− 2 +ε
T (1 + α)(1 + δ)
so that it is sufficient to bound the limsup of both terms under the square
root strictly by one. Next, remark that
1 1 n(1 + δ) 1 1
δ= tr ΦQ̄ = tr ΦQ̄2 Q̄−1 = 2
tr Φ2 Q̄2 + γ tr ΦQ̄2 .
T T T (1 + δ) T T
In particular,
n 1 2 2
n 1 δ T (1+δ)2 T tr Φ Q̄ δ
2 2
2
tr Φ Q̄ = n 1 1 ≤ .
T (1 + δ) T (1 + δ) T (1+δ)2 T tr Φ2 Q̄2 + γ T tr ΦQ̄2 1+δ
so (that is, on a probability set Az with P (Az ) = 1). Thus, letting {zk }∞ k=1
be a converging sequence strictly included in R− , on the probability one
space A = ∩∞ k=1 Ak , mµn (zk ) − mµ̄n (zk ) → 0 for all k. Now, mµn is com-
plex analytic on C \ R+ and bounded on all compact subsets of C \ R+ .
Besides, it was shown in (Silverstein and Bai, 1995; Silverstein and Choi,
1995) that the function mµ̄n is well-defined, complex analytic and bounded
on all compact subsets of C \ R+ . As a result, on A, mµn − mµ̄n is complex
analytic, bounded on all compact subsets of C \ R+ and converges to zero on
a subset admitting at least one accumulation point. Thus, by Vitali’s con-
vergence theorem (Titchmarsh, 1939), with probability one, mµn − mµ̄n con-
verges to zero everywhere on C \ R+ . This implies, by (Bai and Silverstein,
2009, Theorem B.9), that µn − µ̄n → 0, vaguely as a signed finite measure,
with probability one, and, since µ̄n is a probability measure (again from the
results of (Silverstein and Bai, 1995; Silverstein and Choi, 1995)), we have
thus proved Theorem 2.
the neural network under study requires, beside E[Q], to evaluate the more
involved form E[QAQ], where A is a symmetric matrix either equal to Φ
or of bounded norm (so in particular kQ̄Ak is bounded). To evaluate this
quantity, first write
E[QAQ] = E Q̄AQ + E (Q − Q̄)AQ
n Φ 1 T
= E Q̄AQ + E Q − Σ Σ Q̄AQ
T 1+δ T
n
n 1 1X h i
E Qσi σiT Q̄AQ .
= E Q̄AQ + E QΦQ̄AQ −
T 1+δ T
i=1
(where in the previous to last line, we have merely reorganized the terms
conveniently) and our interest is in handling Z1 + Z1T + Z2 + Z2T + Z3 +
Z3T + Z4 + Z4T . Let us first treat term Z2 . Since Q̄AQ− is bounded, by
Lemma 4, T1 σ T Q̄AQ− σ concentrates around T1 tr ΦQ̄AE[Q− ]; but, as kΦQ̄k
1
is bounded, we also have | T1 tr ΦQ̄AE[Q− ] − T1 tr ΦQ̄AQ̄| ≤ cnε− 2 . We thus
deduce, with similar arguments as previously, that
" #
1 T 1
T ε− 21 T T σ Q̄AQ− σ T tr ΦQ̄AQ̄
−Q− σσ Q− Cn Q− σσ Q− −
1 + T1 σ T Q− σ 1+δ
1
Q− σσ T Q− Cnε− 2
We now move to term Z3 + Z3T . Using the relation abT + baT aaT + bbT ,
" #
1 T Q− σσ T Q̄AQ− + Q− AQ̄σσ T Q−
E (δ − σ Q− σ)
T (1 + T1 σ T Q− σ)2
" #
√ (δ − T1 σ T Q− σ)2 T 1 h T
i
nE Q − σσ Q − + √ E Q − A Q̄σσ Q̄AQ −
(1 + T1 σ T Q− σ)4 n
√ T
1 T 2 1
= n E Q Σ D3 ΣQ + √ E Q− AQ̄ΦQ̄AQ−
n T n
and the symmetrical lower bound (equal to the opposite of the upper bound),
where D3 = diag((δ − T1 σiT Q−i σi )/(1 + T1 σiT Q−i σi )). For the same reasons
1
as above, the first right-hand side term is bounded by Cnε− 2 . As for the
Q̄Φ
second term, for A = IT , it is clearly bounded; for A = Φ, using Tn 1+δ =
IT − γ Q̄, E[Q− AQ̄ΦQ̄AQ− ] can be expressed in terms of E[Q− ΦQ− ] and
E[Q− Q̄k ΦQ− ] for k = 1, 2, all of which have been shown to be bounded (at
most by Cnε ). We thus conclude that
" #
Q− σσ T Q̄AQ− + Q− AQ̄σσ T Q−
1 T 1
E δ − σ Q− σ
≤ Cnε− 2 .
T 1 T 2
(1 + T σ Q− σ)
The first norm in the parenthesis is bounded by Cnε and it thus remains
to control the second norm. To this end, similar to the control of E[QΦQ],
by writing E[QΦQΦQ] = E[Qσ1 σ1T Qσ2 σ2T Q] for σ1 , σ2 independent vectors
with the same law as σ, and exploiting the exchangeability, we obtain after
some calculus that E[QΦQ] can be expressed as the sum of terms of the form
E[Q++ T1 ΣT 1 T 1 T
++ DΣ++ Q++ ] or E[Q++ T Σ++ DΣ++ Q++ T Σ++ D2 Σ++ Q++ ] for
D, D2 diagonal matrices of norm bounded as O(1), while Σ++ and Q++ are
similar as Σ and Q, only for n replaced by n+2. All these terms are bounded
as O(1) and we finally obtain that E[QΦQΦQ] is bounded and thus
E [QΦQ] − 1 E [Q− ΦQ + QΦQ− ]
≤ C .
2
n
Q̄ΦQ̄ 1
E [QΦQ] = 1 + Ok·k (nε− 2 ).
n tr Φ2 Q̄2
1− T
T
(1+δ)2
1
Z Z
1 2 2
I= σ(w̃1 ã1 )σ(w̃1 b̃1 + w̃2 b̃2 )e− 2 (w̃1 +w̃2 ) dw̃1 dw̃2 .
2π R R
Letting w̃ = [w̃1 , w̃2 ]T , ã = [ã1 , 0]T and b̃ = [b̃1 , b̃2 ]T , this is conveniently
written as the two-dimensional integral
1
Z
1 2
I= σ(w̃T ã)σ(w̃T b̃)e− 2 kw̃k dw̃.
2π R2
The case where a and b would be linearly dependent can then be obtained
by continuity arguments.
The function σ(t) = max(t, 0). For this function, we have
1
Z
1 2
I= w̃T ã · w̃T b̃ · e− 2 kw̃k dw̃.
2π min(w̃T ã,w̃T b̃)≥0
Since ã = ã1 e1 , a simple geometric representation lets us observe that
n o n π π o
w̃ | min(w̃T ã, w̃T b̃) ≥ 0 = r cos(θ)e1 + r sin(θ)e2 | r ≥ 0, θ ∈ [θ0 − , ]
2 2
1 2
r 3 e− 2 r dr = 2. Classical
R
With two integration by parts, we have that R+
trigonometric formulas also provide
π
1 1
Z
2
cos(θ)2 dθ =
(π − θ0 ) + sin(2θ0 )
θ0 − π2 2 2
! !
1 b̃1 b̃1 b̃2
= π − arccos +
2 kb̃k kb̃k kb̃k
π
!2
1 1 b̃2
Z
2
cos(θ) sin(θ)dθ = sin2 (θ0 ) =
θ0 − π2 2 2 kb̃k
√
where we used in particular sin(2 arccos(x)) = 2x 1 − x2 . Altogether, this
is after simplification and replacement of ã1 , b̃1 and b̃2 ,
1 p
I= kakkbk 1 − ∠(a, b)2 + ∠(a, b) arccos(−∠(a, b)) .
2π
It is worth noticing that this may be more compactly written as
∠(a,b)
1
Z
I= kakkbk arccos(−x)dx.
2π −1
σ(wT a)σ(wT b) = 1wT a≥0 1wT b≥0 + 1wT (−a)≥0 1wT (−b)≥0
− 1wT (−a)≥0 1wT b≥0 − 1wT a≥0 1wT (−b)≥0
and to apply the result of the previous section, with either (a, b), (−a, b),
(a, −b) or (−a, −b). Since arccos(−x) = − arccos(x) + π, we conclude that
2θ0
Z
− p2 1 2
I = (2π) sign(wT a)sign(wT b)e− 2 kwk dw = 1 − .
Rp π
The functions σ(t) = cos(t) and σ(t) = sin(t).. Let us first consider σ(t) =
cos(t). We have here to evaluate
1
Z 1 2
I= cos w̃T ã cos w̃T b̃ e− 2 kw̃k dw̃
2π R2
1
Z T 1 2
T T T
= eıw̃ ã + e−ıw̃ ã eıw̃ b̃ + e−ıw̃ b̃ e− 2 kw̃k dw̃
8π R2
which boils down to evaluating, for d ∈ {ã + b̃, ã − b̃, −ã + b̃, −ã − b̃}, the
integral
Z
1 2 1 2 1 2
e− 2 kdk e− 2 kw̃−ıdk dw̃ = (2π)e− 2 kdk .
R2
Altogether, we find
1 − 1 ka+bk2 1 2
1 2
I= e 2 + e− 2 ka−bk = e− 2 (kak+kbk ) cosh(aT b).
2
For σ(t) = sin(t), it suffices to appropriately adapt the signs in the ex-
1 t
pression of I (using the relation sin(t) = 2ı (e + e−t )) to obtain in the end
1 − 1 ka+bk2 − 12 ka−bk2
1 2
I= e 2 +e = e− 2 (kak+kbk ) sinh(aT b)
2
as desired.
where we recall the definition (a2 ) = [a21 , . . . , a2p ]T . Gathering all the terms
for appropriate selections of c, d leads to (5).
If Σ̂ = Σ̂◦ + σ̄
ˆ 1T follows the aforementioned claimed operator norm control,
T̂
reproducing the steps of Corollary 3 leads to a similar concentration for
Etest , which we shall then admit. We are therefore left to evaluating E[Z2 ]
and E[Z3 ].
with D = diag({δ − T1 σiT Q−i σi }), the operator norm of which is bounded by
1
nε− 2 with high probability. Now, observe that, again with the assumption
that Σ̂ = Σ̂◦ + σ̄1T
T̂
with controlled Σ̂◦ , Z22 may be decomposed as
2 1 h i 2 1 h i
E tr Y QΣT D Σ̂Ŷ T = E tr Y QΣT D Σ̂◦ Ŷ T
T T̂ 1 + δ T T̂ 1 + δ
2 1 T T h i
+ 1T̂ Ŷ E Y QΣT Dσ̄ .
T T̂ 1 + δ
1
In the display above, the first right-hand side term is now of order O(nε− 2 ).
As for the second right-hand side term, note that Dσ̄ is a vector of inde-
pendent and identically distributed zero mean and variance O(n−1 ) entries;
while note formally independent of Y QΣT , it is nonetheless expected that
this independence “weakens” asymptotically (a behavior several times ob-
served in linear random matrix models), so that one expects by central limit
1
arguments that the second right-hand side term be also of order O(nε− 2 ).
This would thus result in
2n 1 1
E[Z2 ] = tr Y E[Q− ]ΦX X̂ Ŷ T + O(nε− 2 )
T T̂ 1 + δ
2n 1 1
= tr Y Q̄ΦX X̂ Ŷ T + O(nε− 2 )
T T̂ 1 + δ
2 1
= tr Y Q̄ΨX X̂ Ŷ T + O(nε− 2 )
T̂
1
n ΦX X̂
where we used kE[Q− ] − Q̄k ≤ Cnε− 2 and the definition ΨX X̂ = T 1+δ .
In the term Z32 , reproducing the proof of Lemma 1 with the condition
σ̂T σ̂
kX̂k bounded, we obtain that i i concentrates around 1 tr ΦX̂ X̂ , which
T̂ T̂
allows us to write
n
" !#
1 X Q−i σi tr(ΦX̂ X̂ )σiT Q−i T
Z32 = E tr Y Y
T 2 T̂ i=1 (1 + T1 σiT Q−i σi )2
n
" !#
Q−i σi σ̂iT σ̂i − tr ΦT̂ σiT Q−i T
1 X
+ E tr Y Y
T 2 T̂ i=1 (1 + T1 σiT Q−i σi )2
n
" !#
1 tr(ΦX̂ X̂ ) X Q−i σi σiT Q−i T
= 2 E tr Y 1 T 2
Y
T T̂ (1 + T σ i Q −i σ i )
n
" i=1 ! !#
1 X σ̂iT σ̂i − tr ΦT̂ T T
+ 2 E tr Y Qσi σi QY
T T̂
i=1
≡ Z321 + Z322
QΣT ΣQ
1 1
Z322 = E tr Y √ D √ Y T = O(nε− 2 )
T T T
The term Z31 of the double sum over i and j (j 6= i) needs more efforts.
To handle this term, we need to remove the dependence of both σi and σj
in Q in sequence. We start with j as follows:
n
" !#
σ̂ σ TQ
1 XX j j −j
Z31 = E tr Y Qσi σ̂iT YT
T 2 T̂ i=1 j6=i 1 + T1 σjT Q−j σj
n
" !#
σ̂ σ TQ
1 XX j j −j
= E tr Y Q−j σi σ̂iT YT
T 2 T̂ i=1 j6=i 1 + T1 σjT Q−j σj
n
" !#
1 XX Q−j σj σjT Q−j σi σ̂iT σ̂j σjT Q−j T
− E tr Y Y
T 3 T̂ i=1 j6=i 1 + T1 σjT Q−j σj 1 + T1 σjT Q−j σj
≡ Z311 − Z312
1 T
For Z311 , we replace 1 + T σj Q−j σj by 1 + δ and take expectation over wj
n
" !#
1 XX σ̂j σjT Q−j
Z311 = E tr Y Q−j σi σ̂iT
1 T Y T
T 2 T̂
i=1 j6=i
1 + T σ j Q −j σ j
n
" !#
1 X Q−j ΣT
−j Σ̂ −j σ̂ j σ TQ
j −j
= E tr Y YT
T 2 T̂ j=1 1 + T1 σjT Q−j σj
n
1 1 X h i
= E tr Y Q−j ΣT
−j Σ̂−j σ̂j σjT Q−j Y T
T 2 T̂ 1 + δ j=1
n 1 T
" !#
1 1 X Q−j ΣT T
−j Σ̂−j σ̂j σj Q−j (δ − T σj Q−j σj ) T
+ E tr Y Y
T 2 T̂ 1 + δ j=1
1 + T1 σjT Q−j σj
≡ Z3111 + Z3112 .
Pn T
The idea to handle Z3112 is to retrieve forms of the type j=1 dj σ̂j σj =
ε− 21
Σ̂T DΣ for some D satisfying kDk ≤ n with high probability. To this
end, we use
ΣT
−j Σ̂−j ΣT Σ̂ σj σ̂jT
Q−j = Q−j − Q−j
T T T
T Qσ σ TQ σj σ̂jT
Σ Σ̂ j j ΣT Σ̂
=Q + − Q −j
T 1 − T1 σjT Qσj T T
and thus Z3112 can be expanded as the sum of three terms that shall be
studied in order:
n
" !#
1 T
1 1 X Q−j ΣT T
−j Σ̂−j σ̂j σj Q−j (δ − T σj Q−j σj ) T
Z3112 = E tr Y Y
T 2 T̂ 1 + δ j=1 1 + T1 σjT Q−j σj
" !#
1 1 ΣT Σ̂ T T
= E tr Y Q Σ̂ DΣQY
T T̂ 1 + δ T
n
" !#
1 1 X Qσj σjT QΣT Σ̂σ̂j (δ − T1 σjT Q−j σj )σjT Q T
+ E tr Y Y
T T̂ 1 + δ j=1 T (1 − T1 σjT Qσj )
n
1 1 X T T 1 T 1 T T
− E tr Y Qσj σ̂j σ̂j σj Q(δ − σj Q−j σj )(1 + σj Q−j σj )Y
T 2 T̂ 1 + δ j=1
T T
≡ Z31121 + Z31122 − Z31123 .
1
where D = diag({δ − T1 σjT Q−j σj }ni=1 ). First, Z31121 is of order O(nε− 2 ) since
T
Q ΣT Σ̂ is of bounded operator norm. Subsequently, Z31122 can be rewritten
as
ΣT DΣ
1 1 1
Z31122 = E tr Y Q QY T
= O(nε− 2 )
T̂ 1 + δ T
with here
n
1 T 1 ΣT
−j Σ̂−j 1
δ − σ Q σ
T j −j j T tr Q −j T Φ X̂X + T tr (Q−j Φ) T1 tr ΦX̂ X̂
D = diag 1 T 1 T .
(1 − T σj Qσj )(1 + T σj Q−j σj )
i=1
as follows
n
" !#
1 1 XX Q−ij σi σ̂iT T
Z31111 = E tr Y Φ Q−ij Y
T 2 T̂ 1 + δ j=1 i6=j 1 + T1 σiT Q−ij σi X̂X
n X h
1 1 X
T T
i
= E tr Y Q −ij σ i σ̂ Φ
i X̂X Q −ij Y
T 2 T̂ (1 + δ)2 j=1 i6=j
n X
" !#
1 1 X Q−ij σi σ̂iT (δ − T1 σiT Q−ij σi )
+ E tr Y ΦX̂X Q−ij Y T
T 2 T̂ (1 + δ)2 j=1 i6=j 1 + T1 σiT Q−ij σi
n2 1 h
T
i
= E tr Y Q −− Φ Φ
X X̂ X̂X Q −− Y
T 2 T̂ (1 + δ)2
n
1 1 XX
T 1 T T
+ E tr Y Q−j σi σ̂i δ − σi Q−ij σi ΦX̂X Q−j Y
T 2 T̂ (1 + δ)2 j=1 i6=j T
n X
" !#
1 TQ
Q σ σ
1 1 X −j T i i −j 1
+ E tr Y Q−j σi σ̂iT ΦX̂X Y T δ − σiT Q−ij σi
T 2 T̂ (1 + δ)2 j=1 i6=j
1 − T1 σiT Q−j σi T
n2 1 h
T
i
= E tr Y Q −− Φ Φ
X X̂ X̂X Q −− Y
T 2 T̂ (1 + δ)2
n
1 1 X h
T T
i
+ E tr Y Q −j Σ −j D Σ̂ −j Φ X̂X Q −j Y
T 2 T̂ (1 + δ)2 j=1
n
n 1 X h
′
i 1
+ E Y Q−j ΣT
−j D Σ Q
−j −j Y T
+ O(nε− 2 )
T 2 T̂ (1 + δ)2 j=1
n2 1 h i 1
= 2
E tr Y Q Φ Φ Q
−− X X̂ X̂X −− Y T
+ O(nε− 2 )
T 2 T̂ (1 + δ)
with Q−− having the same law as Q−ij , D =diag({δ − T1 σiT Q−ij σi }ni=1 )
n
(δ− T1 σiT Q−ij σi ) T1 tr(ΦX̂X Q−ij ΦX X̂ )
and D ′ = diag (1− 1 σT Q σ )(1+ 1 σT Q σ )
, both expected to be of
T i −j i T i −ij i
i=1
ε− 21
order O(n ). Using again the asymptotic equivalent of E[QAQ] devised
in Section 5.2.3, we then have
n2 1 h i 1
Z31111 = 2
E tr Y Q Φ Φ Q
−− X X̂ X̂X −− Y T
+ O(nε− 2 )
T 2 T̂ (1 + δ)
n1 tr Y Q̄ΨX Q̄Y T
1 T
1
= tr Y Q̄ΨX X̂ ΨX̂X Q̄Y + tr ΨX Q̄ΨX X̂ ΨX̂X Q̄
T̂ T̂ 1 − n1 tr(Ψ2X Q̄2 )
1
+ O(nε− 2 ).
n1 tr Y Q̄ΨX Q̄Y T
1 T
1
Z311 = tr Y Q̄ΨX X̂ ΨX̂X Q̄Y + tr ΨX Q̄ΨX X̂ ΨX̂X Q̄
T̂ T̂ 1 − n1 tr(Ψ2X Q̄2 )
1 tr Y Q̄ΨX Q̄Y T
1 1
− tr ΨX̂X Q̄ΨX X̂ n 1 2 2
+ O(nε− 2 ).
T̂ 1 − n tr(ΨX Q̄ )
Since Q−j T1 ΣT
−j Σ̂−j is expected to be of bounded norm, using the concen-
ΣT Σ̂−j
tration inequality of the quadratic form T1 σjT Q−j −jT σ̂j , we infer
n
" ! #
1 X Q−j σj σjT Q−j Y T 1
T
ε− 21
Z312 = E tr Y tr Q−j Σ−j Σ̂−j ΦX̂X + O(n )
T T̂ j=1 (1 + T1 σjT Q−j σj )2 T2
n
" ! #
1 X Q−j σj σjT Q−j Y T 1 1
= E tr Y 1 T 2 2
tr Q−j ΣT
−j Σ̂−j ΦX̂X + O(nε− 2 ).
T T̂ (1 + T σj Q−j σj ) T
j=1
1 T
We again replace T σj Q−j σj by δ and take expectation over wj to obtain
n
1 1 X
T T
1
T
Z312 = E tr Y Q−j σj σj Q−j Y tr Q−j Σ−j Σ̂−j ΦX̂X
T T̂ (1 + δ)2 j=1 T2
n
" #
1 1 X tr(Y Q−j σj Dj σjT Q−j Y T ) 1
T
ε− 21
+ E 1 tr Q −j Σ −j Σ̂ −j Φ X̂X + O(n )
T T̂ (1 + δ)2 j=1 (1 + T σjT Q−j σj )2 T2
n 1 T
1
T
= E tr Y Q− ΦX Q− Y tr Q− Σ− Σ̂− ΦX̂X
T T̂ (1 + δ)2 T2
1 1 1 1
+ 2
T
E tr Y QΣ DΣQY T
2
tr Q− Σ− Σ̂− ΦX̂X + O(nε− 2 )
T
T T̂ (1 + δ) T
1
with Dj = (1 + δ)2 − (1 + T1 σjT Q−j σj )2 = O(nε− 2 ), which eventually brings
the second term to vanish, and we thus get
n 1 1 1
Z312 = 2
E tr Y Q− ΦX Q− Y T
2
tr Q− Σ− Σ̂− ΦX̂X + O(nε− 2 ).
T
T T̂ (1 + δ) T
For the term T12 tr Q− ΣT − Σ̂ − Φ X̂X we apply again the concentration
inequality to get
1
T
1 X T
tr Q − Σ − Σ̂ − Φ X̂X = tr Q −j σ i σ̂ i Φ X̂X
T2 T2
i6=j
!
1 X Q−ij σi σ̂iT
= 2 tr Φ
T
i6=j
1 + T1 σiT Q−ij σi X̂X
!
1 1 X 1 1 X Q−ij σi σ̂iT (δ − T1 σiT Q−ij σi )
= 2 tr Q−ij σi σ̂iT ΦX̂X + 2 tr 1 T ΦX̂X
T 1+δ T 1+δ 1 + T σ i Q −ij σ i
i6=j i6=j
n−1 1 1 1 1
T
+ O(nε− 2 )
= 2
tr Φ X̂X E[Q −− ]Φ X X̂ + 2
tr Q −j Σ −j D Σ̂ −j Φ X̂X
T 1+δ T 1+δ
1 T n
with high probability, where D = diag({δ − T σi Q−ij σi }i=1 ), the norm of
1
which is of order O(nε− 2 ). This entails
1 n 1 1
T
tr ΦX̂X E[Q−− ]ΦX X̂ + O(nε− 2 )
2
tr Q − Σ − Σ̂ − Φ X̂X = 2
T T 1+δ
with high probability. Once more plugging the asymptotic equivalent of
E[QAQ] deduced in Section 5.2.3, we conclude for Z312 that
n1 tr Y Q̄ΨX Q̄Y T
1 1
Z312 = tr ΨX̂X Q̄ΨX X̂ 1 2 2
+ O(nε− 2 )
T̂ 1 − n tr(ΨX Q̄ )
Combining the estimates of E[Z2 ] as well as Z31 and Z32 , we finally have
the estimates for the test error defined in (12) as
1
T
2
Etest =
Ŷ − ΨT X X̂
Q̄Y T
T̂ F
1 T
n tr Y Q̄ΨX Q̄Y
1 1 2
+ tr ΨX̂ X̂ + tr ΨX Q̄ΨX X̂ ΨX̂X Q̄ − tr ΨX̂ X̂ Q̄ΨX X̂
1 − n1 tr(Ψ2X Q̄2 ) T̂ T̂ T̂
1
+ O(nε− 2 ).
z = β T σ(x; W)
Despite its simplicity, the concentration method also has some strong
limitations that presently do not allow for a sufficiently profound analysis of
the testing mean square error. We believe that Conjecture 1 can be proved
by means of more elaborate methods. Notably, we believe that the powerful
Gaussian method advertised in (Pastur and Ŝerbina, 2011) which relies on
Stein’s lemma and the Poincaré–Nash inequality could provide a refined
control of the residual terms involved in the derivation of Conjecture 1.
However, since Stein’s lemma (which states that E[xφ(x)] = E[φ′ (x)] for
x ∼ N (0, 1) and differentiable polynomially bounded φ) can only be used
on products xφ(x) involving the linear component x, the latter is not directly
accessible; we nonetheless believe that appropriate ansatzs of Stein’s lemma,
adapted to the non-linear setting and currently under investigation, could
be exploited.
As a striking example, one key advantage of such a tool would be the
possibility to evaluate expectations of the type Z = E[σσ T ( T1 σ T Q− σ − α)]
which, in our present analysis, was shown to be bounded in the order of
1
symmetric matrices by ΦCnε− 2 with high probability. Thus, if no matrix
(such as Q̄) pre-multiplies Z, since kΦk can grow as large as O(n), Z cannot
be shown to vanish. But such a bound does not account for the fact that
erate almost equally well when taken random in some very specific scenar-
ios, are usually only initiated as random networks before being subsequently
trained through backpropagation of the error on the training dataset (that is,
essentially through convex gradient descent). We believe that our framework
can allow for the understanding of at least finitely many steps of gradient de-
scent, which may then provide further insights into the overall performance
of deep learning networks.
REFERENCES
Akhiezer, N. I. and Glazman, I. M. (1993). Theory of linear operators in Hilbert space.
Courier Dover Publications.
Bai, Z. D. and Silverstein, J. W. (1998). No eigenvalues outside the support of the lim-
iting spectral distribution of large dimensional sample covariance matrices. The Annals
of Probability 26 316-345.
Bai, Z. D. and Silverstein, J. W. (2007). On the signal-to-interference-ratio of CDMA
systems in wireless communications. Annals of Applied Probability 17 81-101.
Bai, Z. D. and Silverstein, J. W. (2009). Spectral analysis of large dimensional random
matrices, second ed. Springer Series in Statistics, New York, NY, USA.
Benaych-Georges, F. and Nadakuditi, R. R. (2012). The singular values and vectors
of low rank perturbations of large rectangular random matrices. Journal of Multivariate
Analysis 111 120–135.
Cambria, E., Gastaldo, P., Bisio, F. and Zunino, R. (2015). An ELM-based model
for affective analogical reasoning. Neurocomputing 149 443–455.
Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B. and LeCun, Y. (2015).
The Loss Surfaces of Multilayer Networks. In AISTATS.
Couillet, R. and Benaych-Georges, F. (2016). Kernel spectral clustering of large
dimensional data. Electronic Journal of Statistics 10 1393–1454.
Couillet, R. and Kammoun, A. (2016). Random Matrix Improved Subspace Clustering.
In 2016 Asilomar Conference on Signals, Systems, and Computers.
Couillet, R., Pascal, F. and Silverstein, J. W. (2015). The random matrix regime
of Maronna’s M-estimator with elliptically distributed samples. Journal of Multivariate
Analysis 139 56–78.
Giryes, R., Sapiro, G. and Bronstein, A. M. (2015). Deep Neural Networks with
Random Gaussian Weights: A Universal Classification Strategy? IEEE Transactions
on Signal Processing 64 3444-3457.
Hornik, K., Stinchcombe, M. and White, H. (1989). Multilayer feedforward networks
are universal approximators. Neural networks 2 359–366.
Hoydis, J., Couillet, R. and Debbah, M. (2013). Random beamforming over quasi-
static and fading channels: a deterministic equivalent approach. IEEE Transactions on
Information Theory 58 6392-6425.
Huang, G.-B., Zhu, Q.-Y. and Siew, C.-K. (2006). Extreme learning machine: theory
and applications. Neurocomputing 70 489–501.
Huang, G.-B., Zhou, H., Ding, X. and Zhang, R. (2012). Extreme learning machine
for regression and multiclass classification. Systems, Man, and Cybernetics, Part B:
Cybernetics, IEEE Transactions on 42 513–529.
Jaeger, H. and Haas, H. (2004). Harnessing nonlinearity: Predicting chaotic systems
and saving energy in wireless communication. Science 304 78–80.
Kammoun, A., Kharouf, M., Hachem, W. and Najim, J. (2009). A central limit the-
orem for the sinr at the lmmse estimator output for large-dimensional signals. IEEE
Transactions on Information Theory 55 5048–5063.
El Karoui, N. (2009). Concentration of measure and spectra of random matrices: ap-
plications to correlation matrices, elliptical distributions and beyond. The Annals of
Applied Probability 19 2362–2405.
El Karoui, N. (2010). The spectrum of kernel random matrices. The Annals of Statistics
38 1–50.
El Karoui, N. (2013). Asymptotic behavior of unregularized and ridge-regularized
high-dimensional robust regression estimators: rigorous results. arXiv preprint
arXiv:1311.2445.
Krizhevsky, A., Sutskever, I. and Hinton, G. E. (2012). Imagenet classification with
deep convolutional neural networks. In Advances in neural information processing sys-
tems 1097–1105.
LeCun, Y., Cortes, C. and Burges, C. (1998). The MNIST database of handwritten
digits.
Ledoux, M. (2005). The concentration of measure phenomenon 89. American Mathemat-
ical Soc.
Loubaton, P. and Vallet, P. (2010). Almost sure localization of the eigenvalues in a
Gaussian information plus noise model. Application to the spiked models. Electronic
Journal of Probability 16 1934–1959.
Mai, X. and Couillet, R. (2017). The counterintuitive mechanism of graph-based semi-
supervised learning in the big data regime. In IEEE International Conference on Acous-
tics, Speech and Signal Processing (ICASSP’17).