0% found this document useful (0 votes)

27 views64 pages

PDF Random Matrix Methods For Machine Learning 1st Edition Romain Couillet Download

ebook

Uploaded by

miraliprein

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views64 pages

PDF Random Matrix Methods For Machine Learning 1st Edition Romain Couillet Download

ebook

Uploaded by

miraliprein

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 64

Full download test bank at ebookmeta.

com

Random Matrix Methods for Machine Learning 1st

Edition Romain Couillet

For dowload this book click LINK or Button below

https://ebookmeta.com/product/random-matrix-
methods-for-machine-learning-1st-edition-romain-
couillet/

OR CLICK BUTTON

DOWLOAD EBOOK

Download More ebooks from https://ebookmeta.com

More products digital (pdf, epub, mobi) instant
download maybe you interests ...

Random Matrix Methods for Machine Learning. 1st Edition

Romain Couillet

https://ebookmeta.com/product/random-matrix-methods-for-machine-
learning-1st-edition-romain-couillet-2/

Machine Learning Methods 1st Edition Hang Li

https://ebookmeta.com/product/machine-learning-methods-1st-
edition-hang-li/

High Dimensional Covariance Matrix Estimation An

Introduction to Random Matrix Theory 1st Edition Aygul
Zagidullina

https://ebookmeta.com/product/high-dimensional-covariance-matrix-
estimation-an-introduction-to-random-matrix-theory-1st-edition-
aygul-zagidullina-2/

High Dimensional Covariance Matrix Estimation An

Introduction to Random Matrix Theory 1st Edition Aygul
Zagidullina

https://ebookmeta.com/product/high-dimensional-covariance-matrix-
estimation-an-introduction-to-random-matrix-theory-1st-edition-
aygul-zagidullina/
Ensemble Methods for Machine Learning - MEAP Version 6
Gautam Kunapuli

https://ebookmeta.com/product/ensemble-methods-for-machine-
learning-meap-version-6-gautam-kunapuli/

Multivariate Statistical Machine Learning Methods for

Genomic Prediction Montesinos López

https://ebookmeta.com/product/multivariate-statistical-machine-
learning-methods-for-genomic-prediction-montesinos-lopez/

Hamiltonian Monte Carlo Methods in Machine Learning

Tshilidzi Marwala

https://ebookmeta.com/product/hamiltonian-monte-carlo-methods-in-
machine-learning-tshilidzi-marwala/

Kernel Methods for Machine Learning with Math and

Python: 100 Exercises for Building Logic Joe Suzuki

https://ebookmeta.com/product/kernel-methods-for-machine-
learning-with-math-and-python-100-exercises-for-building-logic-
joe-suzuki/

Practical Machine Learning for Computer Vision: End-to-

End Machine Learning for Images 1st Edition Valliappa
Lakshmanan

https://ebookmeta.com/product/practical-machine-learning-for-
computer-vision-end-to-end-machine-learning-for-images-1st-
edition-valliappa-lakshmanan/
Random Matrix Methods for Machine Learning

This book presents a unified theory of random matrices for applications in machine
learning, offering a large-dimensional data vision that exploits concentration and
universality phenomena. This enables a precise understanding, and possible improve-
ments, of the core mechanisms at play in real-world machine learning algorithms. The
book opens with a thorough introduction to the theoretical basics of random matri-
ces, which serves as a support to a wide scope of applications ranging from support
vector machines, through semi-supervised learning, unsupervised spectral clustering,
and graph methods, to neural networks and deep learning. For each application, the
authors discuss small- versus large-dimensional intuitions of the problem, followed
by a systematic random matrix analysis of the resulting performance and possible
improvements. All concepts, applications, and variations are illustrated numerically
on synthetic as well as real-world data, with MATLAB and Python code provided on
the accompanying website.

Romain Couillet is a full professor at Grenoble Alpes University, France. Prior to that,
he was a full professor at CentraleSupélec, University of Paris-Saclay, France. His
research topics are in random matrix theory applied to statistics, machine learning,
and signal processing. He is the recipient of the 2021 IEEE/SEE (Institute of Electri-
cal and Electronics Engineers/Society of Electricity, Electronics and Information and
Communication Technologies) Glavieux Prize, the 2013 CNRS (The National Cen-
ter for Scientific Research) Bronze Medal, and the 2013 IEEE ComSoc Outstanding
Young Researcher Award.
Zhenyu Liao is an associate professor with Huazhong University of Science and Tech-
nology (HUST), China. He is the recipient of the 2021 East Lake Youth Talent
Program Fellowship of HUST, the 2019 ED STIC Ph.D. Student Award, and the 2016
Supélec Foundation Ph.D. Fellowship of University of Paris-Saclay, France.

Published online by Cambridge University Press

Published online by Cambridge University Press
Random Matrix Methods
for Machine Learning
ROMAIN COUILLET
Grenoble Alpes University

ZHENYU LIAO
Huazhong University of Science and Technology

Published online by Cambridge University Press

University Printing House, Cambridge CB2 8BS, United Kingdom
One Liberty Plaza, 20th Floor, New York, NY 10006, USA
477 Williamstown Road, Port Melbourne, VIC 3207, Australia
314–321, 3rd Floor, Plot 3, Splendor Forum, Jasola District Centre, New Delhi – 110025, India
103 Penang Road, #05–06/07, Visioncrest Commercial, Singapore 238467

Cambridge University Press is part of the University of Cambridge.

It furthers the University’s mission by disseminating knowledge in the pursuit of
education, learning, and research at the highest international levels of excellence.

www.cambridge.org
Information on this title: www.cambridge.org/9781009123235
DOI: 10.1017/9781009128490
© Romain Couillet and Zhenyu Liao 2022
This publication is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published 2022
Printed in the United Kingdom by TJ Books Limited, Padstow Cornwall
A catalogue record for this publication is available from the British Library.
ISBN 978-1-009-12323-5 Hardback
Cambridge University Press has no responsibility for the persistence or accuracy of
URLs for external or third-party internet websites referred to in this publication
and does not guarantee that any content on such websites is, or will remain,
accurate or appropriate.

Published online by Cambridge University Press

Contents

Preface page vii

1 Introduction 1
1.1 Motivation: The Pitfalls of Large-Dimensional Statistics 1
1.2 Random Matrix Theory as an Answer 13
1.3 Outline and Online Toolbox 31

2 Random Matrix Theory 35

2.1 Fundamental Objects 36
2.2 Foundational Random Matrix Results 42
2.3 Advanced Spectrum Considerations for Sample Covariances 80
2.4 Preliminaries on Statistical Inference 88
2.5 Spiked Models 102
2.6 Information-plus-Noise, Deformed Wigner, and Other Models 115
2.7 Beyond Vectors of Independent Entries: Concentration of Measure
in RMT 130
2.8 Concluding Remarks 146
2.9 Exercises 147

3 Statistical Inference in Linear Models 155

3.1 Detection and Estimation in Information-plus-Noise Models 156
3.2 Covariance Matrix Distance Estimation 173
3.3 M-Estimators of Scatter 185
3.4 Concluding Remarks 198
3.5 Practical Course Material 200

4 Kernel Methods 207

4.1 Basic Setting 208
4.2 Distance and Inner-Product Random Kernel Matrices 211
4.3 Properly Scaling Kernel Model 228
4.4 Implications to Kernel Methods 242
4.5 Concluding Remarks 272
4.6 Practical Course Material 273

Published online by Cambridge University Press

vi Contents

5 Large Neural Networks 277

5.1 Random Neural Networks 277
5.2 Gradient Descent Dynamics in Learning Linear Neural Nets 293
5.3 Recurrent Neural Nets: Echo State Networks 300
5.4 Concluding Remarks 307
5.5 Practical Course Material 310

6 Large-Dimensional Convex Optimization 313

6.1 Generalized Linear Classifier 314
6.2 Large-Dimensional Support Vector Machines 326
6.3 Concluding Remarks 331
6.4 Practical Course Material 333

7 Community Detection on Graphs 337

7.1 Community Detection in Dense Graphs 337
7.2 From Dense to Sparse Graphs: A Different Approach 354
7.3 Concluding Remarks 360
7.4 Practical Course Material 361

8 Universality and Real Data 364

8.1 From Gaussian Mixtures to Concentrated Random Vectors
and GAN Data 364
8.2 Wide-Sense Universality in Large-Dimensional Machine Learning 373
8.3 Discussions and Conclusions 376

Bibliography 378
Index 401

Published online by Cambridge University Press

Preface

Numerous and large-dimensional data is now a default setting in modern machine

learning (ML). Standard ML algorithms, starting with kernel methods such as support
vector machines and graph-based methods like the PageRank algorithm, were how-
ever initially designed out of small-dimensional intuitions and tend to misbehave, if
not completely collapse, when dealing with real-world large datasets. Random matrix
theory has recently developed a broad spectrum of tools to help understand this new
“curse of dimensionality,” to help repair or completely recreate the suboptimal algo-
rithms, and most importantly, to provide new intuitions to deal with modern data
mining.
This book primarily aims to deliver these intuitions, by providing a digest of the
recent theoretical and applied breakthroughs of random matrix theory into ML. Tar-
geting a broad audience, spanning from undergraduate students interested in statistical
learning to artificial intelligence engineers and researchers alike, the mathematical
prerequisites to the book are minimal (basics of probability theory, linear algebra,
and real and complex analyses are sufficient): As opposed to introductory books in
the mathematical literature of random matrix theory and large-dimensional statistics,
the theoretical focus here is restricted to the essential requirements to ML applica-
tions. These applications range from detection, statistical inference, and estimation
to graph- and kernel-based supervised, semisupervised, and unsupervised classifica-
tion, as well as neural networks: For these, a precise theoretical prediction of the
algorithm performance (often inaccessible when not resorting to a random matrix
analysis), large-dimensional insights, methods of improvement, along with a funda-
mental justification of the wide-scope applicability of the methods to real data, are
provided.
Most methods, algorithms, and figures proposed in the book are coded in
MATLAB and Python and made available to the readers (https://github.com/
Zhenyu-LIAO/RMT4ML). The book also contains a series of exercises of two types:
short exercises with corrections available online to familiarize the reader with the
basic theoretical notions and tools in random matrix analysis, as well as long guided
exercises to apply these tools to further concrete ML applications.

https://doi.org/10.1017/9781009128490.001 Published online by Cambridge University Press

https://doi.org/10.1017/9781009128490.001 Published online by Cambridge University Press
1 Introduction

This chapter discusses fundamentally different mental images of large- versus small-
dimensional machine learning through examples of sample covariance and kernel
matrices, on both synthetic and real data. Random matrix theory is presented as a flex-
ible and powerful tool to assess, understand, and improve classical machine learning
methods in this modern large-dimensional setting.

1.1 Motivation: The Pitfalls of Large-Dimensional Statistics

1.1.1 The Big Data Era: When n Is No Longer Much Larger than p
The big data revolution comes along with the challenging needs to parse, mine, and
compress a large amount of large-dimensional and possibly heterogeneous data. In
many applications, the dimension p of the observations is as large as – if not much
larger than – their number n. In array processing and wireless communications, the
number of antennas required for fine localization resolution or increased communi-
cation throughput may be as large (today in the order of hundreds) as the number of
available independent signal observations [Li and Stoica, 2007, Lu et al., 2014]. In
genomics, the identification of correlations among hundreds of thousands of genes
based on a limited number of independent (and expensive) samples induces an even
larger ratio p/n [Arnold et al., 1994]. In statistical finance, portfolio optimization relies
on the need to invest on a large number p of assets to reduce volatility but at the same
time to estimate the current (rather than past) asset statistics from a relatively small
number n of asset return records [Laloux et al., 2000].
As we shall demonstrate in the following section, the fact that in these problems
n is not much larger than p annihilates most of the results from standard asymp-
totic statistics that assume n alone is large [Vaart, 2000]. As a rule of thumb, by
“much larger” we mean here that n must be at least 100 times larger than p for
standard asymptotic statistics to be of practical convenience (see our argument in Sec-
tion 1.1.2). Many algorithms in statistics, signal processing, and machine learning are
precisely derived from this n p assumption that is no longer appropriate today. A
major objective of this book is to cast some light on the resulting biases and prob-
lems incurred and to provide a systematic random matrix framework to improve these
algorithms.

https://doi.org/10.1017/9781009128490.002 Published online by Cambridge University Press

2 1 Introduction

Possibly more importantly, we will see in this book that (small p) small-dimensional
intuitions at the core of many machine learning algorithms (starting with spectral
clustering [Ng et al., 2002, Luxburg, 2007]) may strikingly fail when applied in a
simultaneously large n,p setting. A compelling example lies in the notion of “dis-
tance” between vectors. Most classification methods in machine learning are rooted
in the observation that random data vectors arising from a mixture distribution (say
Gaussian) gather in “groups” of close-by vectors in the Euclidean norm. When deal-
ing with large-dimensional data, however, concentration phenomena arise that make
Euclidean distances useless, if not counterproductive: Vectors from the same mixture
class may be further away in Euclidean distance than vectors arising from different
classes. While classification may still be doable, it works in a rather different way
from our small-dimensional intuition. The book intends to prepare the reader for the
multiple traps caused by this “curse of dimensionality.”

1.1.2 Sample Covariance Matrices in the Large n,p Regime

Let us consider the following example that illustrates a first elementary, yet counterin-
tuitive, result: For simultaneously large n,p, the sample covariance matrix Ĉ ∈ R p×p
based on n samples xi ∼ N (0,C) is an entry-wise consistent estimator of the popula-
tion covariance C ∈ R p×p (i.e., Ĉ − C∞ → 0 as p,n → ∞ for A∞ ≡ maxi j |Ai j |)
while overall being an extremely poor estimator in a (more practical) operator norm
sense (i.e., Ĉ − C → 0, with · being the operator norm here). Matrix norms are,
in particular, not equivalent in the large n,p scenario.
Let us detail this claim, in the simplest case where C = I p . Consider a dataset
X = [x1 ,. . . ,xn ] ∈ R p×n of n independent and identically distributed (i.i.d.) observa-
tions from a p-dimensional standard Gaussian distribution, that is, xi ∼ N (0,I p ) for
i ∈ {1,. . . ,n}. We wish to estimate the population covariance matrix C = I p from the
n available samples. The maximum likelihood estimator in this zero-mean Gaussian
setting is the sample covariance matrix Ĉ defined by

1 n 1
Ĉ = ∑ xi xTi = n XXT .
n i=1
(1.1)

By the strong law of large numbers, for fixed p, Ĉ → I p almost surely as n → ∞, so

a.s.
that Ĉ − I p −−→ 0 holds for any standard matrix norm and in particular for the
operator norm.
One must be more careful when dealing with the case n,p → ∞ with the ratio p/n →
c ∈ (0,∞) (or, from a practical standpoint, n is not much larger than p). First, note that
the entry-wise convergence still holds since, invoking the law of large numbers again,

1 n 1, i = j
[Ĉ]i j = ∑ [X]il [X] jl −−→
a.s.
n l=1 0, i = j.

Besides, by a concentration inequality argument, it can even be shown that

a.s.
max [Ĉ − I p ]i j −−→ 0,
1≤i, j≤p

https://doi.org/10.1017/9781009128490.002 Published online by Cambridge University Press

1.1 Motivation: The Pitfalls of Large-Dimensional Statistics 3

which holds as long as p is no larger than a polynomial function of n, and thus:

a.s.
Ĉ − I p ∞ −−→ 0.
Consider now the case p > n. Since Ĉ = n1 ∑i=1 n
xi xTi is the sum of n rank-one
matrices, the rank of Ĉ is at most equal to n and thus, being a p × p matrix with p > n,
the sample covariance matrix Ĉ must be a singular matrix having at least p − n > 0
null eigenvalues. As a consequence,
Ĉ − I p → 0
for · the matrix operator (or spectral) norm. This last result actually extends to the
general case where p/n → c ∈ (0,∞). As such, matrix norms cannot be considered
equivalent in the regime where p is not negligible compared to n. This follows from
the fact that the coefficients involved in the equivalence of norm relation between the
infinity and operator norm depend on p; here, for instance, we have that for symmetric
matrices A ∈ R p×p , A∞ ≤ A ≤ pA∞ .
Unfortunately, in practice, the (nonconverging) operator norm is of more practical
interest than the (converging) infinity norm.
Remark 1.1 (On the importance of operator norm). For practical purposes, this
“loss” of norm equivalence for large p raises the question of the relevant matrix norm
to consider for a given application. For the purpose of the present book, and for most
applications in machine learning, the operator (or spectral) norm is the most relevant.
First, the operator norm is the matrix norm induced by the Euclidean norm of vectors.
Thus, the study of regression vectors or label/score vectors in classification is natu-
rally attached to the spectral study of matrices. Besides, we will often be interested
in the asymptotic equivalence of families of large-dimensional symmetric matrices. If
A p − B p → 0 for matrix sequences {A p } and {B p }, indexed by their dimension p,
then according to Weyl’s inequality (see, e.g., Lemma 2.10 in Section 2.2.1),

max λ i (A p ) − λ i (B p ) → 0
i

for λ 1 (A) ≥ λ 2 (A) ≥ · · · , the eigenvalues of A in a decreasing order. Besides, for

ui (A p ), an eigenvector of A p associated with an isolated eigenvalue λ i (A p ) (i.e., such
that min{|λ i+1 (A p ) − λ i (A p )|,|λ i (A p ) − λ i−1 (A p )|} > ε for some ε > 0 uniformly
on p),

ui (A p ) − ui (B p ) → 0.

These results ensure that, as far as spectral properties are concerned, A p can be stud-
ied equivalently through B p . We will often use this argument to investigate intractable
random matrices A p by means of a more tractable “proxy” B p .
The pitfall that consists in assuming that Ĉ is a valid estimator of C since
a.s.
Ĉ − C∞ −−→ 0 may thus have deleterious practical consequences when n is not
significantly larger than p.
Resuming our discussion of norm convergence, it is now natural to ask whether Ĉ,
which badly estimates C, has a controlled asymptotic behavior. There precisely lay the
first theoretical interests of random matrix theory. While Ĉ itself does not converge in

https://doi.org/10.1017/9781009128490.002 Published online by Cambridge University Press

4 1 Introduction

Empirical eigenvalues
4 Marc̆enko–Pastur law

3
Histogram

0
0.7 0.8 0.9 1 1.1 1.2 1.3
Eigenvalues of Ĉ

Figure 1.1 Histogram of the eigenvalues of Ĉ versus the Marc̆enko–Pastur law, for X having
standard Gaussian entries, p = 500 and n = 50 000. Code on web: MATLAB and Python.

any useful way, its eigenvalue distribution does exhibit a traceable limiting behavior
[Marčenko and Pastur, 1967, Silverstein and Bai, 1995, Bai and Silverstein, 2010]. The
seminal result in this direction, due to Marc̆enko and Pastur, states that, for C = I p , as
n,p → ∞, with p/n → c ∈ (0,∞), it holds with probability 1 that the random discrete
eigenvalue/empirical spectral distribution
1 p
μp ≡ ∑ δ λ i (Ĉ)
p i=1
converges in law to a nonrandom smooth limit, today referred to as the “Marc̆enko–
Pastur law” [Marčenko and Pastur, 1967],
1
μ(dx) = (1 − c−1 )+ δ0 (x) + (x − E− )+ (E+ − x)+ dx, (1.2)
2πcx
√
where E± = (1 ± c)2 and (x)+ ≡ max(x,0).
Figure 1.1 compares the empirical spectral distribution of Ĉ to the limiting
Marc̆enko–Pastur law given in (1.2), for p = 500 and n = 50 000.
The elementary Marc̆enko–Pastur result is already quite instructive and insightful.
Remark 1.2 (When is one under the random matrix regime?). Equation (1.2) reveals
that the eigenvalues of Ĉ, instead of concentrating at x = 1 as a large-n alone analysis
√ √
would suggest, are spread from (1 − c)2 to (1 + c)2 . As such, the eigenvalues span
on a range
√ √ √
(1 + c)2 − (1 − c)2 = 4 c.

This is a slow decaying behavior with respect to c = lim p/n. In particular, for
n = 100p, in which case, one would expect a sufficiently large number of samples
√
for Ĉ to properly estimate C = I p , one has 4 c = 0.4, which is a large spread around

https://doi.org/10.1017/9781009128490.002 Published online by Cambridge University Press

1.1 Motivation: The Pitfalls of Large-Dimensional Statistics 5

the mean (and true) eigenvalue 1. This is visually confirmed by Figure 1.1 for p = 500
and n = 50 000, where the histogram of the eigenvalues is nowhere near concentrated
at x = 1. Therefore, random matrix results will be much more accurate than classical
asymptotic statistics even when n ∼ 100p. As a telling example, estimating the covari-
ance matrix of each digit from the popular Modified National Institute of Standards
and Technology (MNIST) dataset [LeCun et al., 1998], made of no more than 60 000
training samples (and thus about n = 6 000 samples per digit) of size p = 784, is likely
a hazardous undertaking.
Remark 1.3 (On universality). Although introduced here in the context of a Gaussian
distribution for xi , the Marc̆enko–Pastur law applies to much more general cases.
Indeed, the result remains valid as long as the xi s have independent normalized entries
of zero mean and unit variance (and even beyond this setting, see El Karoui [2009]
and Louart and Couillet [2018]). Similar to the law of large numbers in standard
asymptotic statistics, this universality phenomenon commonly arises in random matrix
theory and large-dimensional statistics. We will exploit this phenomenon in the book
to justify the wide applicability of the presented results, even to real datasets. See
Chapter 8 for more detail.

1.1.3 Kernel Matrices of Large-Dimensional Data

Another less-known but equally important example of the curse of dimensionality in
machine learning involves the loss of relevance of (the notion of) Euclidean distance
between large-dimensional data vectors. To be more precise, we will see in the sequel
that, in an asymptotically nontrivial classification setting (i.e., ensuring that asymp-
totic classification is neither trivially easy nor impossible), large and numerous data
vectors x1 ,. . . ,xn ∈ R p extracted from a few-class (say two-class) mixture model tend
to be asymptotically at equal (Euclidean) distance from one another, irrespective of
their corresponding class. Roughly speaking, in this nontrivial setting and under some
reasonable statistical assumptions on the xi s, we have

1
max xi − x j − τ → 0
2
(1.3)
1≤i= j≤n p

for some constant τ > 0 as n,p → ∞, independently of the classes (same or different)
of xi and x j (here the normalization by p is used for compliance with the notations in
the remainder of this book and has no particular importance).
This asymptotic behavior is extremely counterintuitive and conveys the idea that
classification by standard methods ought not to be doable in this large-dimensional
regime. Indeed, in the conventional small-dimensional intuition that forged many of
the leading machine learning algorithms of everyday use (such as spectral clustering
[Ng et al., 2002, Luxburg, 2007]), two data points are assigned to the same class if
they are “close” in Euclidean distance. Here we claim that, when p is large, data pairs
are neither close nor far from each other, regardless of their belonging to the same
class or not. Despite this troubling loss of individual discriminative power between
data pairs, we subsequently show that, thanks to a collective behavior of all data

https://doi.org/10.1017/9781009128490.002 Published online by Cambridge University Press

6 1 Introduction

belonging to the same (few and thus large) classes, data classification or clustering
is still achievable. Better, we shall see that, while many conventional methods devised
from small-dimensional intuitions do fail in this large-dimensional regime, some pop-
ular approaches, such as the Ng–Jordan–Weiss spectral clustering method [Ng et al.,
2002] or the PageRank semisupervised learning approach [Avrachenkov et al., 2012],
still function. But the core reasons for their functioning are strikingly different from
the reasons of their initial designs, and they often operate far from optimally.

The Nontrivial Classification Regime

To get a clear picture of the source of Equation (1.3), we first need to clarify what
we refer to as the “asymptotically nontrivial” classification setting. Consider the
simplest scenario of a binary Gaussian mixture classification: Given a training set
x1 ,. . . ,xn ∈ R p of n samples independently drawn from the two-class (C1 and C2 )
Gaussian mixture,

C1 : x ∼ N (μ,I p ), C2 : x ∼ N (−μ,I p + E), (1.4)

each drawn with probability 1/2, for some deterministic μ ∈ R p and symmetric
E ∈ R p×p , both possibly depending on p. In the ideal case where μ and E are perfectly
known, one can devise a (decision optimal) Neyman–Pearson test. For an unknown x,
genuinely belonging to C1 , the Neyman–Pearson test to decide on the class of x reads

C1
(x + μ)T (I p + E)−1 (x + μ) − (x − μ)T (x − μ) ≷ − log det(I p + E). (1.5)
C2

Writing x = μ + z for z ∼ N (0,I p ), the above test is equivalent to

T(x) ≡ 4μ T (I p + E)−1 μ + 4μ T (I p + E)−1 z + zT (I p + E)−1 − I p z
C1
+ log det(I p + E) ≷ 0. (1.6)
C2

Since Uz for U ∈ R p×p , an eigenvector basis of (I p + E)−1 (and thus of (I p + E)−1 −

I p ), follows the same distribution as z, the random variable T(x) can be written
as the sum of p independent random variables. Further assuming that μ = O(1)
with respect to p, by Lyapunov’s central limit theorem (e.g., [Billingsley, 2012, The-
orem 27.3]) and the fact that Var[zT Az] = 2 tr(A2 ) for symmetric A ∈ R p×p and
Gaussian z, we have, as p → ∞,

VT−1/2 (T(x) − T̄) −

d
→ N (0,1),

where

T̄ ≡ 4μ T (I p + E)−1 μ + tr(I p + E)−1 − p + log det(I p + E),

2
VT ≡ 16μ T (I p + E)−2 μ + 2 tr (I p + E)−1 − I p .

https://doi.org/10.1017/9781009128490.002 Published online by Cambridge University Press

1.1 Motivation: The Pitfalls of Large-Dimensional Statistics 7

As a consequence, the classification of x ∈ C1 is asymptotically nontrivial (i.e., the

classification error neither goes to 0 nor 1 as p → ∞) if and only if T̄ is of the same
√
order as VT . Considering the (worst-case) scenario where E = 0, we must have
μ ≥ O(1) with respect to p (indeed, if instead μ = o(1), the classification of
x is asymptotically impossible).
Under the constraint μ = O(1), we move on to consider the case E = 0 with the
spectral norm constraint E = o(1). By a Taylor expansion of both (I p + E)−1 and
log det(I p + E) around I p , we obtain

1
T̄ = 4μ2 − tr(E2 ) + o(1);
2
VT = 16μ2 + 2 tr(E2 ) + o(1),

which demands tr(E2 ) to be of order O(1) (same as μ) so as to have discriminative

power. Since tr(E2 ) ≤ pE2 , with equality if and only if E is proportional to the iden-
tity, that is, E = I p , one must have E ≥ O(p−1/2 ). Also, since O(1) = tr(E2 ) ≤
(tr E)2 , we must have | tr E| ≥ O(1). This allows us to conclude on the following
nontrivial classification conditions:

μ ≥ O(1), E ≥ O(p−1/2 ), | tr(E)| ≥ O(1), tr(E2 ) ≥ O(1). (1.7)

These are the minimal conditions for classification in the case of perfectly known
means and covariances in the following sense: (i) if none of the inequalities hold (i.e.,
if the means and covariances from both classes are too close), asymptotic classification
√
must fail and (ii) if at least one of the inequalities is not tight (say if μ ≥ O( p)),
asymptotic classification becomes trivial. 1

We shall subsequently see that (1.7) precisely induces the asymptotic loss of dis-
tance discrimination raised in (1.3) but that standard spectral clustering methods based
on n ∼ p data remain valid.

Asymptotic Loss of Pairwise Distance Discrimination

Under the equality case for the conditions in (1.7), consider the (normalized)
Euclidean distance between two distinct data vectors xi ∈ Ca and x j ∈ Cb ,i = j,
given by

1 p zi
1
− z j 2 + Ap−1 , for a = b = 2
xi − x j 2 = (1.8)
p p zi
1
− z j 2 + Bp−1 , for a = 1,b = 2,

1 It should be noted here that, unlike in computer science, we will stick in this book with the notation O(·)
indifferently from the complexity notations Ω(·), O(·), and Θ(·). The exact meaning of O(·) will be
clear in context. For instance, under computer science notations, Equation (1.7) would be μ ≥ Θ(1),
E ≥ Θ(p −1/2 ), | tr(E)| ≥ Θ(1), and tr(E2 ) ≥ Θ(1).

https://doi.org/10.1017/9781009128490.002 Published online by Cambridge University Press

8 1 Introduction

v2 = v2 =

⎡ ⎤ ⎡ ⎤
⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥
K=⎢ ⎥ K=⎢ ⎥
⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥
⎣ ⎦ ⎣ ⎦

(a) p = 5 (b) p = 250

Figure 1.2 Gaussian kernel matrices K and the second top eigenvectors v2 for (a) small- and
(b) large-dimensional data X = [x1 ,. . . ,xn ] ∈ R p×n , with x1 ,. . . ,xn/2 ∈ C1 and
xn/2+1 ,. . . ,xn ∈ C2 for n = 5 000. Code on web: MATLAB and Python.

where

A = zTi Ezi + zTj Ez j − 2zTi Ez j and

B = zTj (E + E2 /4)z j − zTi Ez j + 4μ2 + 4μ T (zi − z j ) + o(1)

are both of order O(1) and thus both Ap−1 and Bp−1 are of order O(p−1 ) , while the
leading term p1 zi − z j 2 of (1.8) is of order O(1). As such,

1
max zi − z j − 2 → 0
2
1≤i= j≤n p

almost surely as n,p → ∞ (this follows by exploiting the fact that zi − z j 2 is a chi-
square random variable with p degrees of freedom). As a consequence, as previously
claimed in (1.3),

1
max xi − x j 2 − τ → 0
1≤i= j≤n p

for τ = 2 here. Besides, on a closer inspection of (1.8), we find that, beyond

this common value τ of order O(1), the discriminative class information in means
4μ2 /p and that in covariances zTj (E + E2 /4)z j /p tr(E + E2 /4)/p are both of order
O(p−1 ), while by the central limit theorem, zi − z j 2 /p = 2 + O(p−1/2 ). The class
information is thus largely overtaken by the random fluctuations. As a consequence,
asymptotically, the pairwise distance xi − x j 2 /p contains no exploitable statistical

https://doi.org/10.1017/9781009128490.002 Published online by Cambridge University Press

1.1 Motivation: The Pitfalls of Large-Dimensional Statistics 9

information (about μ or E) to distinguish if the xi and x j vectors belong to the same or

different classes.
To visually confirm this joint convergence of the data distances, in Figure 1.2, we
display
the content of the Gaussian (heat) kernel matrix K ∈ Rn×n , with [K]i j =
exp −xi − x j /(2p) , and the associated second dominant eigenvector v2 for a two-
2

class Gaussian mixture x ∼ N (±μ,I p ), with μ = [2; 0 p−1 ]. For a constant n = 500,
we take p = 5 in Figure 1.2(a) and p = 250 in Figure 1.2(b).
While the “block-structure” in the case of p = 5 of Figure 1.2(a) does agree with
the small-dimensional intuition – data vectors from the same class are “closer” to one
another in diagonal blocks with larger values (since exp(−x/2) decreases with x) than
in nondiagonal blocks – this intuition collapses when large-dimensional data vectors
are considered. Indeed, in the large data setting of Figure 1.2(b), all entries (except
obviously on the diagonal) of K have approximately the same value, which, we now
know from (1.3), is exp(−1).
This is no longer surprising to us. However, what remains surprising in Figure 1.2
at this stage of our analysis is that the eigenvector v2 of K seems not affected by this
(asymptotic) loss of class-wise discrimination of individual distances. And spectral
clustering seems to work equally well for p = 5 and for p = 250, despite the radical
and intuitively destructive change in the behavior of K for p = 250.

Explaining Kernel Methods with Random Matrix Theory

The fundamental reason behind this surprising behavior lies in the accumulated effect
of the n/2 small “hidden” informative terms μ2 , tr E and tr(E2 ) in each class, which
collectively “steer” the several top eigenvectors of K. More explicitly, we shall see
in the course of this book that the Gaussian kernel matrix K can be asymptotically
expanded as

1 1
K = exp(−1) 1n 1Tn + ZT Z + f (μ,E) · jjT + ∗ + o· (1), (1.9)
p p

where Z = [z1 ,. . . ,zn ] ∈ R p×n is a Gaussian noise matrix, f (μ,E) = O(1), and
j = [1n/2 ; − 1n/2 ] is the class-information “label” vector (as in the setting of
Figure 1.2). Here “*” symbolizes extra terms of marginal importance to the present
discussion, and o· (1) represents terms of asymptotically vanishing operator norm as
n,p → ∞. The important remark to be made here is that
(i) Under this description, [K]i j = exp(−1)(1 + zTi z j /p) ± f (μ,E)/p + ∗, with
f (μ,E)/p zTi z j /p = O(p−1/2 ); this is consistent with our previous discussion:
The statistical information is entry-wise dominated by noise.
(ii) From a spectral viewpoint, ZT Z/p = O(1), as per the Marc̆enko–Pastur
theorem [Marčenko and Pastur, 1967] discussed in Section 1.1.2 and visually
confirmed in Figure 1.1, while f (μ,E) · jjT /p = O(1): Thus, spectrum-wise,
the information stands on even ground with noise.

https://doi.org/10.1017/9781009128490.002 Published online by Cambridge University Press

10 1 Introduction

v2 = v2 =

(a) MNIST data (b) Fashion-MNIST data

Figure 1.3 Gaussian kernel matrices K and the second top eigenvectors v2 for (a) MNIST
[LeCun et al., 1998] (class 8 versus 9) and (b) Fashion-MNIST [Xiao et al., 2017] data
(class 5 versus 7), with x1 ,. . . ,xn/2 ∈ C1 and xn/2+1 ,. . . ,xn ∈ C2 for n = 5 000. Code on
web: MATLAB and Python.

The mathematical magic at play here lies in f (μ,E) · jjT /p having entries of order
O(p−1 ) while being a low-rank (here unit-rank) matrix: All its “energy” concentrates
in a single nonzero eigenvalue. As for ZT Z/p, with larger O(p−1/2 ) amplitude entries,
it is composed of “essentially independent” zero-mean random variables and tends
to be of full rank and spreads its energy over its n eigenvalues. Spectrum-wise, both
f (μ,E) · jjT /p and ZT Z/p meet on even ground under the nontrivial classification
setting of (1.7).
We shall see in Section 4 that things are actually not as clear-cut and, in particular,
that not all choices of kernel functions can achieve the same nontrivial classification
rates. In particular, the popular Gaussian (radial basis function [RBF]) kernel will be
shown to be largely suboptimal in this respect.

Do Real Data Follow Small- or Large-Dimensional Intuitions?

A first glimpse into this riddle, fundamental for the practical design of machine
learning algorithms, is provided in Figure 1.3. Similar to Figure 1.2 for synthetic
Gaussian data, Figure 1.3 depicts the content of kernel matrices built from the
MNIST [LeCun et al., 1998] and Fashion-MNIST data [Xiao et al., 2017], with
p = 28 × 28 = 784 and n = 5 000 in both cases. In Figure 1.4, instead of raw data,
we display the features extracted from popular deep neural networks, such as VGG-16
[Simonyan and Zisserman, 2014] of the more complex CIFAR-10 images (with

https://doi.org/10.1017/9781009128490.002 Published online by Cambridge University Press

1.1 Motivation: The Pitfalls of Large-Dimensional Statistics 11

v2 = v2 =

(a) VGG-16 features of CIFAR-10 (b) Word2vec features of GoogleNews

Figure 1.4 Gaussian kernel matrices K and the second dominant eigenvectors v2 for
(a) VGG-16 [Simonyan and Zisserman, 2014] features of CIFAR-10 data (“airplane” versus
“bird”) and (b) word2vec [Mikolov et al., 2013] features of GoogleNews-vectors data
(“sports” versus “sales”), with x1 ,. . . ,xn/2 ∈ C1 and xn/2+1 ,. . . ,xn ∈ C2 . Code on web:
MATLAB and Python.

p = 1024), as well as the so-called “word-embedding” features from the popular

word2vec method [Mikolov et al., 2013] of the GoogleNews data (with p = 300).
In all aforementioned cases, we observe a typical large-dimensional behavior (that is
similar to Figure 1.2(b) for Gaussian data), not only on raw data but also on efficient
features from modern and elaborate machine learning algorithms; even more strik-
ingly, this behavior is consistently observed both for image and natural language data,
despite their being of a fundamentally different nature. Section 1.2.4, at the end of this
introductory chapter, provides first clues that justify why this seemingly unexpected
observation (recall again that in the classical motivation behind spectral clustering
methods [Ng et al., 2002], we would rather expect a behavior typical of Figure 1.2(a))
on real-world datasets should, in fact, not be a surprise.

1.1.4 Summarizing
In this section, we discussed two simple, yet counterintuitive examples of common
pitfalls in learning from large-dimensional data.
In the sample covariance matrix example of Section 1.1.2, we made the important
remark of the loss of equivalence between matrix norms in the random matrix regime
where the data (or features) dimension p and their number n are both large and com-
parable, which is at the source of many seemingly striking empirical observations in
modern machine learning. We, in particular, insist that for matrices An ,Bn ∈ Rn×n of
large sizes,

https://doi.org/10.1017/9781009128490.002 Published online by Cambridge University Press

12 1 Introduction

∀i, j, [An − Bn ]i j → 0 ⇒ An − Bn → 0 (1.10)

in the operator norm.

We also realized, from a basic reading of the Marc̆enko–Pastur theorem, that the
random matrix regime arises more often than one may think: While n/p ∼ 100 may
seem a large enough ratio for classical asymptotic statistics to be accurate, random
matrix theory is, in general, a far more appropriate tool (with as much as 20% gain in
precision for the estimation of the eigenvalues of sample covariances).
In Section 1.1.3, we provided a concrete machine learning application example
of the message in (1.10). We saw that, in the practically most relevant scenario of
nontrivial (not too easy, not too hard) large data classification, the Euclidean distance
between any two data vectors “concentrates” around a constant as in (1.3), regardless
of their respective classes. Yet, since again entry-wise convergence [An ]i j → τ does
not imply operator norm convergence An − τ1n 1Tn → 0, we understood that, thanks
to a collective effect of the small but similarly “oriented” fluctuations in all the entries,
spectral clustering remains valid for large-dimensional problems.
Possibly most importantly, we discovered that the “curse of dimensionality”
induced by the counterintuitive behavior of large-dimensional vectors turns into
an asset for mathematical analysis. In the sample covariance matrix example, we
observed that a random-matrix version of the laws of large numbers arises in the
convergence of the eigenvalue distributions of large sample covariance matrices to
a deterministic limiting measure. As a matter of fact, as we shall see throughout the
book, the very fact that both p and n are large ensures a generally fast convergence
of most (random) quantities of practical interest for machine learning: By exploiting
np = O(n2 ), rather than n degrees of freedom, central limit theorems may converge at
√
O(1/n) rate (instead of the classical O(1/ n) rate).
This fast convergence rate further induces another important phenomenon, referred
to as the universality, which ensures the robustness of the random matrix asymptotics
to a vast range of distributions. Essentially, as we shall see in more detail later in this
book, first- and second-order statistics are often sufficient to describe most asymp-
totic behaviors, even of complicated data models and methods. This is a first (yet
not the most convincing) justification of the repeatedly observed – but quite unex-
pected – good match between random matrix predictions and experiments on real
datasets.
In a nutshell, the fundamentally counterintuitive, yet mathematically addressable
changes in the behavior of large-dimensional data when compared with small-
dimensional data have two major consequences to statistics and machine learning: (i)
most algorithms, originally developed under a small-dimensional intuition, are likely
to fail (as we shall discover in this book, many of them do) or at least to perform inef-
ficiently and (ii) by benefiting from the extra degrees of freedom offered by large data
(in the dimension p), random matrix theory is apt to analyze and improve these meth-
ods, but most importantly, it generates a whole new paradigm for large-dimensional
learning.

https://doi.org/10.1017/9781009128490.002 Published online by Cambridge University Press

1.2 Random Matrix Theory as an Answer 13

1.2 Random Matrix Theory as an Answer

1.2.1 Which Theory and Why?

A Point of History
Random matrix theory originates from the work of John Wishart [Wishart, 1928] on
the study of the eigenvalues of the matrix XXT (now referred to as a Wishart matrix)
for X ∈ R p×n with standard Gaussian entries [X]i j ∼ N (0,1). Wishart managed to
determine a closed-form expression for the joint eigenvalue distribution of XXT for
every pair of p,n. Few progress however followed, as matrices with non-Gaussian
entries are hardly amenable to similar analysis and, even if they were, the actual
study of more elaborate functionals of XXT is at best cumbersome and often simply
intractable.
The works of the physicist Eugene Wigner [Wigner, 1955] gave a new impulse to
the theory. Interested in the eigenvalues of symmetric matrices X ∈ Rn×n with inde-
pendent Bernoulli entries (particle spins in his application context), Wigner opted for
an asymptotic analysis of the eigenvalue distribution, thereby initiating the impor-
tant and much richer branch of large-dimensional random matrix theory. Despite
this important inspiration, Wigner exploited standard asymptotic statistics tools (the
method of moments) to prove that the discrete distribution of the eigenvalues of X
has a continuous semicircle looking density in the n → ∞ limit (the now popular
semicircular law). This approach was particularly convenient as the limiting law is
simple and could be visually anticipated (which is not the case of the next-to-come
Marc̆enko–Pastur limiting distribution of Wishart matrices).
Only until 1967 with the tour-de-force of Marčenko and Pastur [1967] did random
matrix theory take a new dimension. Marc̆enko and Pastur determined the limiting
spectral distribution of the sample covariance matrix model XXT of Wishart but under
relaxed conditions: [X]i j are independent entries with zero mean and unit variance,
and additional moment assumptions (all discarded in subsequent works). The indepen-
dence (or weak dependence) property is key to their proof, which exploits the powerful
Stieltjes transform p1 tr( n1 XXT − zI p )−1 = (λ − z)−1 μ p (dt) of the empirical spectral
p
distribution μ p ≡ p1 ∑i=1 δ λ i ( 1 XXT ) of n1 XXT , a tool borrowed from operator theory in
n
Hilbert spaces [Akhiezer and Glazman, 2013], rather than the moments p1 tr( n1 XXT )k
(which may not converge since E[Xi j ] needs not be finite for > 2).
The technical approach devised by Marc̆enko and Pastur was then largely embraced
at the turn of the twenty-first century by Bai and Silverstein who, in a series of sig-
nificant breakthroughs (the most noticeable of which are [Silverstein and Bai, 1995,
Bai and Silverstein, 1998]), extended the results in [Marčenko and Pastur, 1967] to an
exhaustive study of sample covariance matrices.
In parallel, another approach to limiting spectral analysis of large random matri-
ces emerged as an application example of the free probability theory developed by
Voiculescu et al. [1992]. Free probability was born as a theory to study random vari-
ables in noncommutative algebras, such as the algebra of matrices. Rather than relying
on independence assumptions as for the aforementioned Stieltjes transform method,

https://doi.org/10.1017/9781009128490.002 Published online by Cambridge University Press

14 1 Introduction

free probability theory relies on a notion of asymptotic freeness. In essence, ran-

dom matrices are asymptotically free if their eigenvector distributions are sufficiently
“isotropic” with respect to each other; for instance, independent Gaussian matrices
(matrices with independent Gaussian entries) are free, and independent unitary matri-
ces with isotropic eigenvector distributions are free, and a deterministic matrix is free
with respect to a Gaussian matrix [Mingo and Speicher, 2017].
Both free probability and the Stieltjes transform approaches have long lived
hand-in-hand, and are essentially capable of proving similar results under vari-
ous assumptions. A classical example, of great importance to this book, is that of
spiked models (i.e., finite-rank deformations of random matrices, such as the nonzero
mean sample covariance (X + μ1Tn )(X + μ1Tn )T or the rank-one perturbed identity
1 1
covariance (I p + uuT ) 2 XXT (I p + uuT ) 2 for X with i.i.d. zero-mean entries) made
popular by two key articles [Baik and Silverstein, 2006] and [Benaych-Georges and
Nadakuditi, 2012], respectively based on a Stieltjes transform and a free probability
approach.
These tools are largely sufficient to cover most of the basic statistical problems in
random matrix theory. In particular, the often-called global regime of random matri-
ces: Their limiting eigenvalue distribution, the behavior of linear statistics of their
eigenvalues or eigenvectors, the position of the outlying eigenvalues in spiked mod-
els, etc., are all accessible by either method. However, this is often not the case of the
local regime: The limiting distribution of a specific eigenvalue (notably the largest and
smallest, of practical interest) for which more efforts are, in general, needed. There,
researchers have rather resorted to a finite-dimensional analysis of the joint eigenvalue
distribution for the Gaussian case (in the spirit of Wishart), and carefully taken the lim-
its of the distribution, exploiting powerful tools such as orthogonal polynomial theory
[Johnstone, 2001]. We will not further discuss these approaches in the book, which are
rather specific and not of direct use to our applications.

Resolvents, Gaussian Tools, and Concentration of Measure Theory

As we shall see throughout this book, realistic data and feature models necessarily
contain rich statistical structures and information patterns (to be extracted by machine
learning algorithms). Typical examples include local structures (captured by convo-
lutional filters) in image data, as well as short- and long-term dependences in time
series or natural language data. In random matrix terms, this involves dealing with very
structured and heterogeneous random matrix models. Although it ebbed and flowed in
the past decade, the free probability approach, in general, requires increased effort
and advanced techniques to prove the key asymptotic freeness, if possible at all. For
this reason (and also because most research and results are available in the Stieltjes
transform-related literature), our focus in this book will be on the range of methods
surrounding the Stieltjes transform approach.
More exactly, the central object of study in this book is the so-called resolvent
of the (almost always symmetric, or Hermitian in the complex case) random matrix
X ∈ Rn×n under investigation, that we shall often denote QX (z) or simply Q(z), and

https://doi.org/10.1017/9781009128490.002 Published online by Cambridge University Press

1.2 Random Matrix Theory as an Answer 15

that is defined, for all z ∈ C not in the eigenspectrum of X (i.e., not coinciding with an
eigenvalue of X), by

QX (z) ≡ (X − zIn )−1 . (1.11)

The resolvent is a rich mathematical object that gives access to:

• the eigenvalue distribution μX ≡ n1 ∑i=1n
δ λ i (X) of X through the (inverse) Stieltjes
/ {λ 1 (X),. . . ,λ n (X)})
transform relation (for all a,b ∈
b b 1
μX (dλ) = lim ℑ[mX (x + ı)] dx,
a ↓0 a π
with ı the imaginary unit and

μX (dλ) 1 n 1 1
mX (z) ≡ = ∑ = tr QX (z);
λ−z n i=1 λ i (X) − z n

• functionals of these eigenvalues 1

n ∑i=1
n
f (λ i (X)) through Cauchy’s integral
identity (Theorem 2.2)

1 n 1
∑ f (λ i (X)) = − 2πın
n i=1 Γ
f (z) tr QX (z) dz,

for Γ ⊂ C, a positively oriented contour in the complex plane surrounding all the
λ i (X)s and f (z) complex analytic in a neighborhood of the “inside” of Γ;
• the eigenvectors and subspaces of X, again, through Cauchy’s integral relation

1
ui (X)ui (X)T = − QX (z) dz,
2πı Γ λ (X)
i

for (λ i (X),ui (X)), an eigenpair of X and Γ λ i (X) , a positively oriented contour

surrounding only λ i (X).
As such, the resolvent plays a key role in the analysis of spectral methods, such as
(kernel) spectral clustering or graph-based community detection, in which case, the
top eigenvectors of some underlying random matrix are exploited.
In addition, the resolvent is a fundamental object that frequently appears in the solu-
tions to linear regression problems (for machine learning applications, in least squares
support vector machines, random features and kernel ridge regressions, neural net-
works, etc.), or to random walk and graph-based semi-supervised learning methods.
They will also be shown to appear naturally in not immediately related machine learn-
ing problems, such as in large-dimensional nonlinear regression (such as logistic or
robust M-regression).
The core of the random matrix approach devised in this book consists in determin-
ing, for various statistical models of random matrices X, a deterministic equivalent
Q̄(z) for Q(z) = QX (z), that it is a deterministic matrix Q̄(z) such that
a.s.
u(Q(z) − Q̄(z)) −−→ 0, or u(E[Q(z)] − Q̄(z)) → 0

https://doi.org/10.1017/9781009128490.002 Published online by Cambridge University Press

16 1 Introduction

for all 1-Lipschitz linear mapping u : Rn×n → R. Of particular interest are the
functions u(X) = n1 tr(AX) for A ≤ 1, and u(X) = aT Xb for a,b ≤ 1.2
As an example, in the setting of the Marc̆enko–Pastur law, where the random matrix
of interest is n1 XXT with X ∈ R p×n having i.i.d. zero mean and unit variance entries,
the resolvent
−1
1
Q(z) = XXT − zI p
n
admits
μ(dλ)
Q̄(z) = m μ (z)I p , m μ (z) = , for μ defined in (1.2),
λ−z
a.s.
as a deterministic equivalent. Thus, in particular, 1
p tr Q(z) − m μ (z) −−→ 0 and
a.s.
aT Q(z)b − m μ (z)aT b −−→ 0 for deterministic a,b ∈ R p of bounded Euclidean norm.
Consequently, the resolvent (and Stieltjes transform) approach simultaneously
involves notions from three distinct mathematical areas:
• linear algebra, and particularly the exploitation of inverse matrix lemmas, the
Schur complement, interlacing, and low-rank perturbation identities [Horn and
Johnson, 2012];
• complex analysis (the resolvent Q(z) is a complex analytic matrix-valued
function), and particularly the theory of analytic functions, contour integrals, and
residue calculus [Stein and Shakarchi, 2003];
• probability theory, and, most specifically, notions of convergence, central limit
theory, and the method of moments [Billingsley, 2012]. Depending on the
underlying random matrix assumptions (independence of entries, Gaussianity,
concentration properties), different random matrix-adapted techniques (among
others and variations) will be discussed in this book: the Gaussian tools developed
by Pastur, relying on Stein’s lemma and the Nash–Poincaré inequality [Pastur and
Shcherbina, 2011], the Bai–Silverstein inductive method [Bai and Silverstein,
2010], the concentration of measure framework developed by Ledoux [2005] and
applied to random matrix endeavors successively by El Karoui [2009], Vershynin
[2012], and Louart and Couillet [2018], or the double leave-one-out approach
devised by El Karoui et al. [2013].
The aforementioned tools are, in general, used together with a perturbation approach
in the sense that they exploit the fact that, by eliminating a row or a column (say,
here both row and column i) of a large random matrix X ∈ Rn×n to obtain X−i ∈
R(n−1)×(n−1) , the resulting resolvent Q−i (z) = (X−i − zIn−1 )−1 can be related to
the original resolvent Q(z) through both linear algebraic relations and asymptotically

2 Here, A and a, b must be understood as “sequences” of deterministic matrices (or vectors) of growing
size but with controlled norm; in particular, A and a, b, being deterministic, cannot depend on X (in
which case, the convergence results may fail: take for instance a = b some eigenvector of X to be
convinced).

https://doi.org/10.1017/9781009128490.002 Published online by Cambridge University Press

1.2 Random Matrix Theory as an Answer 17

comparable statistical behaviors. For instance, in the case of symmetric X with i.i.d.
(and properly normalized) entries, it is not difficult to show that mX (z) = mX−i (z) +
O(n−1 ).
In this regard, Pastur’s Gaussian method manages, for models of X involving Gaus-
sianity (e.g., X has Gaussian entries or its entries are functions of Gaussian random
variables), to obtain asymptotic relations for EQ(z). Interpolation methods may then
be used to extrapolate the results beyond the Gaussian setting. The Bai–Silverstein
inductive method, on the contrary, is not restricted to matrices with Gaussian entries
but is restricted to the specific analysis of either trace forms tr AQ(z) or bilinear forms
aT Q(z)b that need be treated individually (it also suffers to handle exotic forms of
dependence within X). The concentration of measure approach is quite versatile: by
merely restricting the matrix under study to be constituted of concentrated random
vectors (so, in particular, Lipschitz maps of standard Gaussian random vectors or of
vectors with i.i.d. entries), it allows one to study simultaneously the fluctuations of all
linear functionals of Q(z) under light conditions on X.

1.2.2 The Double Asymptotics: Turning the Curse of Dimensionality into a

Dimensionality Blessing
Why Random Matrix Theory to Study the Large n,p Regime?
Although we have previously made a point that modern data processing and learning
involve large dimensions (numerous data, large sample sizes, large number of system
parameters), and that large-dimensional statistics are a natural class of mathemati-
cal tools to turn to, why should one invest in random matrix theory rather than, say,
statistical physics,3 nonasymptotic random matrix theory,4 or compressive sensing?5
Large-dimensional random matrix theory, as we introduce it in this book, has two key

3 Statistical physics and statistical mechanics are powerful tools to map large-dimensional data problems
into physics-inspired problems of “interacting particles” [Mézard and Montanari, 2009]. In the early
2000s, statistical physics has brought inspiring ideas and powerful (but unfortunately often unreliable,
since nonrigorous) tools for the analysis of wireless communication and information-theoretic problems,
before being caught up by added solid and versatile mathematical techniques. Today, statistical physics
has an edge on the study of sparse (graph-based) machine learning problems for which random matrix
theory still struggles to offer a sound theory.
4 The recent field of nonasymptotic random matrix theory is based on concentration inequality approach
and aims, as such, to provide bounds rather than exact (deterministic) asymptotics on various random
matrix quantities [Vershynin, 2018]. This set of concentration inequalities should not to be confused with
the concentration of measure theory [Ledoux, 2005]: Concentration inequalities form a restricted subset
of the theory by proving statistical bounds on specific quantities.
5 Compressive sensing revolves around the assumption that large (p)-dimensional data often arise from
a manifold in R p of much lower intrinsic dimension: Under this assumption, the curse of dimen-
sionality (when p ∼ n or even p n) vanishes if one manages to retrieve the (often unknown)
low-dimensional manifold. As an aftermath of the seminal work by Candes and Tao [2005], compressive
sensing was possibly the first major breakthrough in the modern field of large-dimensional statistical
machine learning.

https://doi.org/10.1017/9781009128490.002 Published online by Cambridge University Press

18 1 Introduction

distinctive features, making it simultaneously more powerful and versatile than these
alternative tools:
(i) Unlike nonasymptotic random matrix theory and compressive sensing
methods, which mostly aim at bounding key quantities (from a rather qualitative
standpoint), large-dimensional random matrix theory is able to provide precise
and quantitative (asymptotically exact) approximations for a host of quantities,
defined as functionals of random matrices. As a matter of fact, nonasymptotic
random matrix theory is more flexible in its not constraining the system
dimensions (p, n) and latent variables (data statistics, model hyperparameters)
to increase at a controlled rate. Large-dimensional random matrix theory, on the
contrary, imposes a controlled growth on the dimensions, and consequently, on
the model statistics to enforce nontrivial limiting behavior. The ensuing drawback
of this allowed flexibility is that only qualitative bounds can be obtained on the
system behavior, which at best provides “rules of thumbs” and order of magni-
tudes on the performance of given algorithms. Large-dimensional random matrix
theory, by providing exact asymptotics, allows one to finely track the system
behavior and opens the possibility to improve its (also fully traced) performance.
(ii) Modern advances in large-dimensional random matrix theory, as opposed to
statistical physics notably, further provide results for rather generic and complex
system models: matrix models involving nonlinearities (kernels, activation
functions), structural data dependence (nonidentity covariances, heterogeneous
mixture models, models of concentrated random vectors with strong nonlinear
dependence). These key features bring the random matrix tools much closer
to practical settings and algorithms. As such, not only does random matrix theory
provide a precise understanding of the behavior of key algorithms in machine
learning, but it also predicts their behavior when applied to realistic data models.
These two advantages are decisive to the analysis, improvement, and proposition of
new machine learning algorithms.

The Case of Machine Learning

The major technical difficulty that has long held many machine learning away from
precise quantitative analysis and theoretical comprehension relates to the nonlinear-
ity involved in feature extraction (nonlinear kernels, nonlinear activation functions
in neural networks), to the implicit nature of some methods (as simple as the logis-
tic regression), and eventually to the difficulty of a proper (statistical) modeling of
complex realistic data of various natures (starting with natural images).
An all-encompassing example of these difficulties could be summarized in the
following classical problem:

Problem. Determine the exact classification performance of logistic regression for n

independent observations of p-dimensional (random) feature vectors extracted from
a set of two-class images (say, images of dogs versus images of cats).

In the conventional wisdom of statistical machine learning, one cannot conceive to

solve this problem in an exact and qualitative manner: the input data (real images)

https://doi.org/10.1017/9781009128490.002 Published online by Cambridge University Press

1.2 Random Matrix Theory as an Answer 19

are not easily modeled, the nonlinear features extracted from those data are com-
plex mathematical objects (even in the case where the original data could be modeled
as multivariate Gaussian random vectors), and the logistic regression is an implicit
optimization method not easily amenable to explicit mathematical analysis.
We shall demonstrate throughout this book that random matrix theory provides a
satisfying answer to all these difficulties at one fell swoop and can actually solve the
Problem. This is made possible by the powerful joint universality and determinism
effects brought by large-dimensional data models and treatments.
Specifically, in the random matrix regime where n,p grow large at a controlled rate,
the following key properties arise:
• fast asymptotic determinism: the law of large numbers and the central limit
theorem tell us that the average of n i.i.d. random variables converges to a
√
deterministic limit (e.g., the expectation) at an O(1/ n) speed. By gathering
independence (or degrees of freedom) both in the sample dimension p and size n,
functionals of large random matrices (even mathematically involved functionals,
such as the average of functions of their eigenvalues) also converge to
√
deterministic limits, but at an increased speed of up to O(1/ np) which, for n ∼ p,
is O(1/n). In machine learning problems, performance may be expressed in terms
of misclassification rates or regression errors (i.e., averaged statistics of sometimes
involved random matrix functionals) and can thereby be predicted with high
accuracy, even for not too large datasets;
• universality with respect to data models: similarly, again, consistently with the law
of large numbers and the central limit theorem in the large-n alone setting, the
above asymptotic deterministic behavior at large n,p is, in general, independent of
the underlying distribution of the random matrix entries. This phenomenon,
referred to in the random matrix literature as universality, predicts notably that the
asymptotic statistics of even complex machine learning procedures depend on the
input data only via the first- and second-order statistics; this is a major distinctive
feature when compared to the fixed-p and large-n regime, where the asymptotic
performance of algorithms, when accessible, would, in general, depend on the
exact p-dimensional distribution of the data;6
• universality with respect to algorithm nonlinearities: when nonlinear methods are
considered, the nonlinear function f (e.g., the kernel function or the activation
function) gets involved in the large-dimensional machine learning algorithm
performance only via a few parameters (e.g., its derivatives f (τ), f (τ),. . . at a
precise location τ, its “moments” f k μ with respect to the Gaussian measure μ, or
more elaborate scalars solution to a fixed-point equation involving f ). For
instance, in the case of kernel random matrices of the type f (xi − x j 2 /p), only

6 Compare, for instance, Luxburg et al. [2008] on the fixed-p and large-n asymptotics of spectral cluster-
ing (the main result of which contains nonlinear expressions of the input data distribution) to Couillet and
Benaych-Georges [2016] on the large p, n asymptotics of the same problem (the main result of which
only involves linear and quadratic forms of the statistical mean and covariances of the data, irrespective
of the input data distribution, as further confirmed by Seddik et al. [2019]).

https://doi.org/10.1017/9781009128490.002 Published online by Cambridge University Press

20 1 Introduction

the first three successive derivatives of the kernel function f at the “concentration”
point τ = lim p xi − x j 2 /p matter; the performance of random neural networks
depends on the nonlinear activation function σ(·) solely through its first Hermite
coefficients (i.e., its Gaussian moments); in implicit optimization schemes (such as
logistic regression), the solution “concentrates” with predictable asymptotics,
which, despite the nonlinear and implicit nature of the problem, only depend on a
few scalar parameters of the logistic loss function. This, together with the
asymptotic deterministic behavior of the linear (eigenvalue or eigenvector)
statistics discussed above, gives access to the performance of a host of nonlinear
machine learning algorithms.
• tractable real data modeling: possibly, the most important aspect of
large-dimensional random matrix analysis in machine learning practice relates to
the counterintuitive fact that, as p,n grow large, machine learning algorithms tend
to treat real data as if they were mere Gaussian mixture models. This statement, to
be discussed thoroughly in the subsequent sections, is both supported by empirical
observations (with most theoretical findings derived for Gaussian mixtures observed
to fit the performances retrieved on real data) and by the theoretical fact that some
extremely realistic datasets (in particular, artificial images created by the popular
generative adversarial networks, or GANs) are by definition concentrated random
vectors, which are: (i) amenable to (and, in fact, extremely well-suited for) random
matrix analysis, and (ii) proven to behave as if they were mere Gaussian mixtures.
In a word, in large-dimensional problems, data no longer “gather” in groups and do
not really “spread” all over their large ambient space neither. But, by accumulation of
degrees of freedom, they rather concentrate within a thin lower-dimensional “layer.”
Each scalar observation of the data, even through complicated functions (regressors,
classifiers for machine learning applications), tends to become deterministic, pre-
dictable, and simple functions of first-order statistics of the data distribution. Random
matrix theory exploits these effects and is thus able to answer seemingly inaccessible
machine learning questions.

1.2.3 Analyze, Understand, and Improve Large-Dimensional

Machine Learning Methods
One of the first elementary objectives of this book is to demonstrate that, in a large-
dimensional and numerous data setting, many standard low-dimensional machine
learning intuitions tend to collapse. As a result, many of the algorithms originally
designed for small-dimensional data fail to perform as expected. Some of these algo-
rithms will be shown to remain valid, but for rather unexpected reasons. And some
of them will be proven suboptimal, quite largely so sometimes. Finally, some of them
will be shown to completely fail to meet their objectives and in need of an adaptation
or a complete change of paradigm.
In a second part, the book will further show that this “large-dimensional” regime,
which one may think synonymous to thousands or millions in dimension and sample
size, is in reality already visible in much smaller data sizes than the earliest researchers

https://doi.org/10.1017/9781009128490.002 Published online by Cambridge University Press

1.2 Random Matrix Theory as an Answer 21

in applied random matrix theory could anticipate. And, more importantly, that a large
class of “real data” naturally falls under the random matrix theory umbrella.
Our argumentation line and every single treatment of machine learning algorithm
analysis and improvement proceed along the following steps: One needs to (i) con-
ceive the limitations of low-dimensional intuitions and understand the reach of the
very different large-dimensional intuitions, (ii) capture the behavior of the main math-
ematical objects at play in machine learning method on large-dimensional models so
as to (iii) include these objects in a mathematical framework for performance analysis,
and (iv) foresee means of improvement based on the newly acquired large-dimensional
intuitions and mathematical understanding.
In the remainder of this subsection, we will illustrate the above four-step method-
ology with the examples of kernel methods and the very related random feature maps
(which may alternatively be seen as a two-layer neural network model with random
first-layer weights).

From Low- to Large-Dimensional Intuitions

Most of the manuscript focuses on large-dimensional data vectors or graph models. By
large-dimensional, we refer to random vectors x ∈ R p “built from” numerous (of order
O(p)) degrees of freedom. That is, as opposed to the compressive sensing paradigm
[Donoho, 2006], we do not impose the existence of a low-dimensional representation
of the data.7
From this viewpoint, the simplest mixture data model is the symmetric binary Gaus-
sian mixture model x ∼ N (±μ,I p ). As we saw previously, for p small (say, p = 2
or p = 3), classifying n samples of the mixture is easily visualized as grouping two
stacks of data: one gathered around μ ∈ R p , the other around −μ. Most of (low-
dimensional) machine learning algorithms are anchored in this mentally convenient
visualization. But the large-dimensional image is completely different. Standard Gaus-
√
sian vectors x ∈ R p have an Euclidean norm of order x ∼ O( p) but a spread of
order x − E[x] ∼ O(1), and nontrivial classification can be performed as long
as μ is no smaller than order O(1). The mental image is thus one of two spheres
√
in R p with an extremely large radius (of order O( p)), around which the data of
both classes “accumulate.” Figure 1.5 provides a comparative picture for small- versus
large-dimensional classification.
With this image in mind, the Euclidean distance paradigm is shifted: For small p,
the information lies in the typical distance from one data point to a “centroid”; for
large p, the centroid is far from all data points (it lives in an “empty” region of the
space), and the class information is summarized in the accumulated small, determinis-
tic deviations of all data points from the same class; this deviation is (asymptotically)
invisible for any data vector but can be inferred collectively from the large data matrix.

7 The statistical information contained in the data such as the mean E[x] ∈ R p can be sparse (i.e., has a
few nonzero entries), but the practical large-dimensional data vectors must randomly “fluctuate” with
sufficiently many degrees of freedom around their possibly low-dimensional manifold structure. The
large-dimensional random fluctuation of the data is essential to produce a statistically “robust” behavior
of the algorithms and is key to establishing mathematical convergence in the large n, p setting.

https://doi.org/10.1017/9781009128490.002 Published online by Cambridge University Press

22 1 Introduction

(a) (b)
O(1)

√
O( p)

Figure 1.5 Visual representation of classification in (a) small and (b) large dimensions. The red
circles and blue crosses represent data points from different classes.

Consequently, machine learning algorithms based on the evaluations of Euclidean

distances xi − x j , inner products xTi x j , nonlinear activations σ(wT xi ), regressions
f ( β T xi ), etc., of data xi or data pairs xi ,x j structurally behave differently in large
dimensions (from their small-dimensional counterparts).

Core Random Matrices in Machine Learning Algorithms

Be it in a supervised, semi-supervised, or unsupervised context, machine learning
algorithms essentially consist of extracting structural information from some avail-
able set of data x1 ,. . . ,xn ∈ X : this is done, in general, via one-to-one comparisons of
the data. At the heart of most algorithms, we notably find affinity matrices of the type:
n
K ≡ κ(xi ,x j ) i, j=1 ∈ Rn×n , (1.12)
where κ : X ×X → R evaluates the closeness or affinity between xi and x j . For graphs,
the data xi are merely the nodes (or vertices) of the graph, and κ(xi ,x j ) = wi j is thus
the weight of the edge (i, j), which may be real of binary (i.e., wi j ∈ {0,1} depending
on whether node i attaches to node j).
For X = R p and xi statistically distributed, this naturally gives rise to a family
of kernel random matrices, among which are inner-product kernel random matrices
with κ(xi ,x j ) = f (xTi x j ), distance-based kernel random matrices with κ(xi ,x j ) =
f (xi − x j 2 ), and correlation random matrices with κ(xi ,x j ) = xTi x j /(xi · x j ).
In the first case, f is often taken to be either linear f (t) = t (therefore giving rise
to sample covariance or Gram matrix models), a polynomial f (t) = ak t k + . . . + a0 ,
or of a sigmoid type, such as the logistic function f (t) = (1 + e−x )−1 or the
hyperbolic tangent f (t) = tanh(t). In the second case, f can be either linear (and
we obtain a Euclidean distance matrix [Dokmanic et al., 2015]) or, more often,
f (t) = exp(−t/(2σ 2 )) for some σ > 0, which is referred to as the heat kernel, the
Gaussian kernel, or the RBF kernel.
When the xi s themselves are not directly separable in their ambient space, they
are conventionally mapped into a feature space, in which they become separable. As
feature extraction is possibly the single most important but usually hardest task in
machine learning, it comes in a variety of forms. Kernel matrices of the type (1.12)

https://doi.org/10.1017/9781009128490.002 Published online by Cambridge University Press

1.2 Random Matrix Theory as an Answer 23

typically play the role of a feature extraction method, which maps the data points into
a reproducing kernel Hilbert space (RKHS) [Schölkopf and Smola, 2018]. Another
closely related, yet equally popular, approach is random extraction by means of ran-
dom feature maps, which consist in operating σ(Wx) for some (usually randomly and
independently drawn) matrix W ∈ R N ×p and some nonlinear function σ : R N → R N
applying entrywise, i.e., σ(y) = [σ0 (y1 ),. . . ,σ0 (y N )]T for some σ0 : R → R, which,
with a slight abuse of notation, we simply call σ. Among random feature maps,
the most popular is the random Fourier features method proposed by Rahimi and
Recht [2008], for which σ(t) = exp(−ıt) (so, formally, σ(R) ⊂ C rather than R in
this case).
Neural networks operate likewise. Every size-N layer (that contains N neurons) of
a neural network operates σ(Wx) for an input x, a linear mapping W ∈ R N ×p (the
neural weights to be learned), and a nonlinear activation function σ : R → R.8 In this
setting, σ is usually taken to be a sigmoid function (the logistic function, the tanh,
or the Gaussian error function), or, more recently, the rectified linear unit (ReLU)
function σ(t) = max(0,t).
Collecting the data in X = [x1 ,. . . ,xn ] ∈ R p×n , the sample covariance matrix of the
random features of the data then reduces to the Gram matrix:

Φ ≡ σ(WX)T σ(WX), (1.13)

which is also a central object of interest in this book.

The aforementioned kernel and Gram matrices of feature maps are actually much
interrelated. For instance, the random Fourier features σ(Wx), with σ(t) = exp(−ıt)
and W ∈ R N ×p having i.i.d. standard Gaussian entries, that is, Wi j ∼ N (0,1), are
known to have the fundamental property:
1 1
EW [σ(Wx)T σ(Wy)] ≡ exp − x − y2 ,
N 2
so that random Fourier features are intricately connected to Gaussian kernel
matrices. This property ensures, in particular, that the Gaussian kernel κ(x,y) =
exp(−x − y2 /2) is a nonnegative definite kernel in the sense that K =
{κ(xi ,x j )}i,n j=1 is a nonnegative definite matrix (for any n and any set of x1 ,. . . ,xn ), a
particularly convenient property in both theoretical and practical kernel learning. An
important subclass of kernel functions, referred to as Mercer kernels [Schölkopf and
Smola, 2018], share this nonnegative definiteness property and have long been priv-
ileged in machine learning. We shall see in this book that, from a large-dimensional
perspective, Mercer kernels can be, in general, suboptimal, and that simple but less
intuitive choices of κ can largely outperform these conventional kernels.
A large body of machine learning algorithms (spectral clustering, linear, or logis-
tic regression, support vector machines, and neural networks) relates, in one way or
another, to the aforementioned global properties (eigenvalues, content of dominant

8 Sometimes, an additional bias term is considered and the network operates σ(WX)+b for some b ∈ R N
also to be learned.

https://doi.org/10.1017/9781009128490.002 Published online by Cambridge University Press

24 1 Introduction

eigenvectors, linear, or nonlinear functionals of the resolvent) of the above matrices

K or Φ. A systematic statistical analysis of these global properties for all finite p,n, N
is, however, often out of reach, even for the simplest standard Gaussian modeling of
the data.
In this book, we will show that random matrix theory manages to leverage the
large-dimensional nature of both the data and the learning systems (i.e., large n,p, N),
to tackle this statistical analysis. We will see, in particular, that several conventional
models for K can be “Taylor-expanded” under the form of matrices involving only
first- and second-order moments of the data distribution. The Gram matrix Φ cannot
be directly Taylor-expanded in this way (it will be “Hermite-polynomially expanded”
though) but will also be shown to behave as a kernel random matrix and be decom-
posed as the sum of more elementary random matrices, the statistical properties of
which also become tractable in the large-dimensional regime.
In short, the intractable matrices K and Φ will be approximated by tractable ersatz
K̃ and Φ̃, which behave asymptotically the same in the sense that

a.s. a.s.
K − K̃ −−→ 0, Φ − Φ̃ −−→ 0,

in operator norm as n,p, N → ∞ at a similar rate. These matrices K̃ and Φ̃ will allow
for further and deeper mathematical analysis.

Performance Analysis: Spectral Properties and Functionals

In a classification context, where, conventionally, xi ∈ R p belongs to one of the k
classes C1 ,. . . ,Ck with k n (the number of data samples), and thus k p whenever
p ∼ n, the approximation matrices K̃ and Φ̃ will often be shown to take a spiked
random matrix form. That is, for instance,

K̃ = Z + P,

where Z ∈ Rn×n is a random symmetric matrix, in general, having entries of zero

mean and rather “uniform” variances, while P ∈ Rn×n is a low-rank matrix (the rank of
which is often related to k), comprising the statistical information about the data-class
associations and the statistical properties of the classes.
These spiked random matrix models have been extensively studied, and it is possi-
ble to extract much information about them. In particular, the dominant eigenvectors
of K̃ are known to relate to the eigenvectors of P (which carry the sought-for data-class
information) whenever a phase transition threshold is exceeded.
In a regression setting where the xi s are assumed independently and identically
distributed, the regression vector β of interest is a certain functional of K or Φ. For
instance, a random feature regression from the observations X ∈ R p×n to the desired
outputs y ∈ Rn entails the regression vector:

β = σ(WX) (Φ + γIn )−1 y,

https://doi.org/10.1017/9781009128490.002 Published online by Cambridge University Press

1.2 Random Matrix Theory as an Answer 25

which is thus an (indirect) function of the resolvent QΦ (−γ) = (Φ + γIn )−1 of Φ

for a certain γ > 0. Random matrix theory possesses tools to analyze the statistical
properties of such vectors β as well.
Least squares support vector machines and most conventional algorithms of graph-
based semi-supervised learning relate to functionals of the same type. This also holds
true (yet less directly) for nonlinear (e.g., logistic) regression, where β is implicitly
defined as a function of QΦ . Similarly, in their plain form, support vector machines
can be seen as nonlinear regressors which also fall within this scope.
Since eigenvalues, eigenvectors, and regressor statistics are at the core of machine
learning algorithm performance, once these central quantities are accessible, the actual
(asymptotic) classification error rates, mean squared error of regression, etc., become
also accessible. It is important to point out here that not only bounds on performance
but actual accurate estimators of the performance are provided. Under a random
matrix framework, a precise characterization of the anticipated performance (as well
as its error margins) for the above algorithms becomes available.
Since these performance indicators depend on the various hyperparameters of the
problem, themselves being quantifiable from data statistics, in many scenarios, it
becomes possible to fine-tune the algorithms without resorting to cross-validation pro-
cedures. We shall notably see how some simple instances of neural networks can be
fairly well understood: why the rectifier max(t,0) is a convenient choice, and how the
activation function and the data statistics mix up, etc. We will also understand that
kernel methods do not function as one may think they should, and that there exists an
elegant interplay between data statistics and the successive derivatives of the kernel
function at a precise position.

Directions of Improvement and New Ideas

Due to the complete change of paradigm when comparing data from a small-versus
a large-dimensional perspective, the overall behavior and the ensuing performance of
the studied algorithms are often tainted, when large-dimensional data are handled.
We shall notably see, in the course of the book, that the conventional heat (or Gaus-
sian) kernel used in various classification contexts is largely suboptimal. We shall
also see that most graph-inspired semi-supervised learning algorithms in the litera-
ture fail to properly accomplish their requested task for n,p large and comparable;
yet, we will show that the so-called PageRank approach [Avrachenkov et al., 2012]
happens not to fail, although the fundamental reasons behind its nondegrading perfor-
mance are at odds with the initial inspiration for the method; but most importantly,
this popular approach will also be shown to perform quite far from optimal and, in
particular, not to be capable of benefiting from a large addition of unlabeled data.
This observation entails the very unpleasant property that purely unsupervised meth-
ods tend to outperform semi-supervised ones when the number of unlabeled data is
large.
For all these applications, the book will list a set of recommendations and improved
methods, which are tailored to large (as well as practically not so large)-dimensional
data learning. Among others, optimal, but quite counterintuitive, kernel functions

https://doi.org/10.1017/9781009128490.002 Published online by Cambridge University Press

26 1 Introduction

will be introduced, new regularization procedures for supervised and semi-supervised

learning will be discussed that particularly defeat the “curse of dimensionality”
in semi-supervised learning (by fully exploiting the additional information from
unlabeled data), and some further light on the design of neural networks will be cast.

1.2.4 Exploiting Universality: From Large-Dimensional Gaussian

Vectors to Real Data
Before delving into the core of the manuscript, we conclude this section by further
elaborating on the universality phenomenon briefly discussed above, which is of much
greater importance to machine learning than one may anticipate.
First, let us recall that most random matrix results derived in the literature, even the
most recent ones on machine learning applications (to be discussed in this book), are
based on the assumption of data either arising from (possibly a mixture of) Gaussian
distributions or represented by random vectors with independent entries. These models
are generally deemed unsuitable to mimic real data, and we will not claim otherwise.
It is a fact that real data, such as images, are largely more complex than mere Gaussian
vectors.
Yet, what we do claim here and throughout this book is that scalar observations
(regressor or classifier outputs, misclassification rates, etc.) obtained from large-
dimensional and numerous data tend to behave as if the data were Gaussian (mixtures)
in the first place. This is a fundamental disruption from small-dimensional statistics
that random matrix analysis structurally exploits: Rather than assuming data as fixed
entities living in a complex manifold, random matrix theory mostly exploits their
numerous degrees of freedom, which, by universality, induce deterministic behav-
ior in the large-dimensional limit, thus independently of the underlying vector data
distribution.
We justify this claim below with both empirical and theoretical arguments.

Theory versus Practice

Our first argument follows after numerous comparative experiments made between
theoretical findings on Gaussian versus real data. Indeed, although mostly derived
under simple and seemingly unrealistic Gaussian mixture models, many theoretical
results mentioned above show an unexpected close match when applied to popu-
lar real-world (sometimes not so) large-dimensional datasets, such as the MNIST
handwritten-digit dataset [LeCun et al., 1998], the related Fashion-MNIST [Xiao
et al., 2017], Kannada-MNIST [Prabhu, 2019] and Kuzushiji-MNIST [Clanuwat et al.,
2018] datasets, the German Traffic Sign dataset [Houben et al., 2013], deep neural
network features of the now popular ImageNet dataset [Deng et al., 2009], used for
state-of-the-art machine learning and computer vision applications, as well as numer-
ous financial and electroencephalography (EEG) time series datasets. In particular,
while most elementary machine learning methods discussed in this book cannot be
applied directly on raw ImageNet images to yield satisfactory performance, when per-
formed on “deep” features of the data (such as VGG, DenseNet, or ResNet features)
obtained from independent deep neural networks, these algorithms tend to behave the

https://doi.org/10.1017/9781009128490.002 Published online by Cambridge University Press

1.2 Random Matrix Theory as an Answer 27

same as with simple Gaussian mixtures [Seddik et al., 2020]. These seemingly striking
empirical observations are indeed theoretically sustained by universality arguments
arising from the powerful concentration of measure theory.
To be more precise, the following systematic comparison approach will be pur-
sued in this book. An asymptotically nontrivial classification or regression problem is
studied: that is, we assume that the problem at hand is theoretically neither too easy
nor too hard to solve (as the one discussed in Section 1.1.3) and practically leads,
in general, to, say, (binary) classification error rates of the order of 5%−30% and of
relative regression errors also of the order 5%−30%. In particular, we insist that the
asymptotic random matrix framework under study is, in general, incapable to thinly
grasp error rates below the 1%–2% region, which may be the domain of “outliers” and
marginal data.
Having posed this nontriviality assumption, we shall generically model the data as
being drawn from a simple mixture model, for example, the Gaussian mixture model
that gives access to a large panoply of powerful technical tools. The theoretical results
obtained from the proposed analyses (asymptotic performance notably) are thus func-
tion of the statistical means and covariances of the mixture distribution. To compare
the theoretical results to real data, we then conduct the following procedure:
(i) exploiting the numerous and labeled samples of the real datasets (such as the
∼60 000 images of the training MNIST database), we empirically estimate the
scalar functions of the statistical means and covariances (that determine the
asymptotic performance of the method under study), for each class in the
database;
(ii) we then evaluate the asymptotic performance that a genuine Gaussian mixture
model having these means and covariances would have;
(iii) we compare these “theoretical” values to actual simulations.
As the book will demonstrate in most scenarios, this procedure systematically leads
to the conclusion that the performance of machine learning methods obtained on
mere Gaussian mixtures approximate surprisingly well the performance observed on
real data and features. On a side note, we mentioned in Remark 1.2 that it is likely
inappropriate to use the sample covariance matrix to estimate the population covari-
ance of the small (i.e., n not much larger than p) databases, such as the MNIST
database (for which n/p 100). However, it turns out that, as the quantities of inter-
est (e.g., classification or regression errors) are generally scalar functionals of the
data statistical means and covariances, it is still possible, in the large n,p regime, to
derive consistent estimators of these quantities without resorting to an exact eval-
uation of the (large-dimensional) moments; see more discussions on this topic in
Sections 3.2 and 4.4.
As already mentioned in Remark 1.3, this surprising accordance between theory
and practice is possibly due to the universality of random matrix results, that is,
only the first several order statistics of the data/features at hand matter in the large-
dimensional regime (recall for instance that the limiting eigenvalue distribution of
T
n XX for X ∈ R
1 p×n having i.i.d. zero mean and unit variance entries is the same

Marc̆enko–Pastur law, irrespective of the higher order moments of X).

https://doi.org/10.1017/9781009128490.002 Published online by Cambridge University Press

28 1 Introduction

Yet, another stronger argument can be made, especially when it comes to machine
learning for image processing.

Concentrated Random Vectors and Real Data Modeling

The modeling assumption that the data vectors xi are linear or affine maps xi = Azi +b
of random vectors zi constituted of i.i.d. entries is simultaneously an asset for random
matrix analysis (by exploiting the degrees of freedom in the entries of zi ) but a severe
practical limitation, as few real datasets are likely of this simplistic form.
El Karoui [2009] provided a first means for random matrix theory to go beyond
the “vector of independent entries” assumption.9 There, relying on elements of the
concentration of measure theory, extensively developed by Ledoux [2005], El Karoui
essentially shows (in a rather technical manner) that some of the early random matrix
results from Pastur, Bai, and Silverstein remain valid under the assumption that the xi s
are concentrated random vectors. Roughly speaking, a random vector x ∈ R p is con-
centrated if, for a certain family of functions f : R p → R, there exists a deterministic
scalar M f ∈ R such that

P | f (x) − M f | > t ≤ α(t) (1.14)

for some decreasing function α : R → R; in general, α(t) will be of the form

α(t) = Ce−ct for some q > 0 and C,c > 0 constants (which may depend on p though).
q

Intuitively, a concentrated random vector is a (random) point in high-dimensional

space having “predictable scalar observation” f (x), in the sense that, with (exponen-
tially) high probability, f (x) takes values very close to the deterministic M f . Thus, in
the (one-dimensional) “observable world,” the observation f (x), which may typically
be any performance metric of a machine learning algorithm on a test datum x, appears
to be “stable” for any concentrated vector x.10
Ledoux and El Karoui mostly focused on concentrated random vectors defined on
Lipschitz classes of functions f , that is, x is Lipschitz-concentrated if (1.14) holds for
all f such that | f (x) − f (y)| ≤ x − y for all x,y ∈ R p . These stringent constraints,
however, make it hard to find random vector belonging to this class. As a matter of
fact, in this class, the only standard random vectors are the Gaussian random vector
x ∼ N (0,I p ) and the uniform vector on the sphere u = x/x ∼ S p−1 for x ∼ N (0,I p ).
However, quite importantly, every R p → Rq Lipschitz-mapping g(x) and g(u) of these
two random vectors, by definition, also belong to the class.11
A visual representation of the notion of concentration is presented in Figure 1.6.
Yet, since the widest class of (Lipschitz) concentrated random vectors is restricted
to Lipschitz maps of standard Gaussian vectors, at first sight, concentrated random

9 See also Pajor and Pastur [2009] published in the same year under slightly more constrained assumptions.
10 Note that by modeling the input data x as a concentrated random vector and stating that the output
(statistics) of a machine learning algorithm is “stable” implicitly assumes some regularity in the algo-
rithm, which, as we shall see, can be shown to hold for many popular methods including deep neural
networks (and which often takes the form of a “Lipschitz control”).
11 Under the more restricted class of Lipschitz and convex functions, random vectors with i.i.d. and bounded
entries (up to normalization) also create a class of (convexly) concentrated random vectors.

https://doi.org/10.1017/9781009128490.002 Published online by Cambridge University Press

1.2 Random Matrix Theory as an Answer 29

“Visualization” of x ∼ N (0, Ip ) Scalar observations f (x)

f1 (x)
O(1)
√
O( p)

f2 (x) O(1)

Figure 1.6 Multivariate Gaussian distribution x ∼ N (0,I p ), a fundamental example of

concentrated random vectors. (Left) A visual “interpretation” of 500 independent drawings
√
of x ∼ N (0,I p ). (Right) Concentration of observations for linear ( f 1 (x) = xT 1 p / p) and
Lipschitz ( f 2 (x) = x∞ ) maps.

Generator
Discriminator
Generated
N (0,I p )
examples Real?
Fake?
Real
examples
Figure 1.7 Illustration of a generative adversarial network (GAN).

vectors are seemingly no more elaborate models than linear and affine maps of Gaus-
sian vectors. As a consequence, there is a priori no reason to assume that the mixtures
of concentrated random vectors can model real data any better than Gaussian mixtures.
It turns out that this intuition is again tainted by erroneous small-dimensional
insights. Indeed, there practically exist extremely data-realistic concentrated random
vectors: the outputs of GANs [Goodfellow et al., 2014], as shown in Figure 1.7. GANs
generate artificial images g(x) from large-dimensional standard Gaussian vectors x,
where g is a conventional feedforward neural network trained to mimic real data. As
such, g is the combination of Lipschitz nonlinear (the neural activations) and linear
(the inter-layer connections) maps, and is thus a Lipschitz mapping.12 The output
image vectors g(x), see examples in Figure 1.8, are thus concentrated vectors. Modern
GANs are so sophisticated, that it has become virtually impossible for human beings
to tell whether their outputs are genuine or artificial. This, as a result, strongly suggests
that concentrated random vectors are accurate models of real-world data.

12 In practice, other operations are also performed in neural networks, such as pooling operations, random
or deterministic dropouts, and various connectivity matrix normalization procedures, so as to achieve
better performance. They are all shown to be Lipschitz [Seddik et al., 2020].

https://doi.org/10.1017/9781009128490.002 Published online by Cambridge University Press

30 1 Introduction

Figure 1.8 Image samples generated by BigGAN in Brock et al. [2019].

A strong emphasis has thus lately been given to these models. The book will, in par-
ticular, elaborate on the work of Louart and Couillet [2018], which largely generalizes
the seminal findings of El Karoui by providing a systematic methodological toolbox of
concentration theory for random matrices. There, the notion of concentration is gen-
eralized by including linear concentration, which provides a consistent framework for
the important notion of deterministic equivalents in random matrix theory, and by pro-
viding a wide range of properties and lemmas of immediate use for random matrix
purposes.
An important finding of Louart and Couillet [2018] is that, first-order statistics
of functionals of random matrices building from concentrated random vectors are
universal; the asymptotic performance of many machine learning methods is, there-
fore, also universal. Specifically, for most conventional machine learning methods
(support vector machines, semi-supervised learning, spectral clustering, random fea-
ture maps, linear regression, etc.), the asymptotic performance achieved on Gaussian
mixtures N (μ a ,Ca ), a ∈ {1,. . . ,k} coincides with that obtained on concentrated ran-
dom vectors mixtures La (μ a ,Ca ), a ∈ {1,. . . ,k}, having the same means μ a and
covariances Ca per class, and are independent of the high-order moments of the
underlying distribution.
This strongly suggests that Gaussian mixture models, if not appropriate data
“models” per se, are largely sufficient statistical assumptions for the theoretical
understanding of real data machine learning.
Remark 1.4 (Concentration of measure, concentration inequalities, and non-
asymptotic random matrices). It is important to raise here the fact that the con-
centration of measure theory is structurally broader than the scope of the popular
concentration inequalities regularly used in statistical learning theory [Boucheron
et al., 2013, Tropp, 2015, Vershynin, 2018]. Concentration inequalities are gener-
ally expressions of (1.14) for specific choices of f and their consequences, and they
are, in particular, not new to random matrix theory. In Vershynin [2012] and Tao
[2012], the authors exploit the mathematical strength of concentration inequalities
(which, thanks to the exponential decay, is stronger and less cumbersome to han-
dle than moment bounds) to prove fundamental results in random matrix theory. Yet,
these inequalities are mostly exploited in proofs involving Gaussian or sub-Gaussian
random vectors (as an instance of concentrated random vector). In particular, Ver-
shynin establishes a nonasymptotic random matrix theory by exploiting concentration
inequalities to bound various quantities of theoretical interest (notably bounds on the

https://doi.org/10.1017/9781009128490.002 Published online by Cambridge University Press

1.3 Outline and Online Toolbox 31

eigenvalue positions of random matrices). The book instead puts forth the interest
of concentration of measure theory for data modeling beyond a merely convenient
mathematical tool.
Concentration of measure theory is also all the more suited to machine learning
as it structurally relates to linear, Lipschitz, or convex-Lipschitz functionals of ran-
dom vectors and matrices. These are precisely the core elements of machine learning
algorithms (kernels, activation functions, convex optimization schemes). From this
viewpoint, concentration of measure theory is much more adapted to machine learn-
ing analysis than seemingly simpler data models. Note, for instance, that concentrated
random vectors are stable (i.e., they remain concentrated) when passed through the
layers of a neural network; this is particularly not true for Gaussian random vectors
or vectors with independent entries, which, in general, no longer have independent
entries when passed through nonlinear layers.
A last but not least convenient aspect of concentration of measure theory is that
it flexibly allows one to “decouple” the behavior of the data size p and number n
in the large-dimensional setting. It is technically much easier to keep track of inde-
pendent growth rates for p and n under a concentration of measure framework than
when exploiting more standard random matrix techniques (such as Gaussian tools to
be discussed in Section 2.2.2).

1.3 Outline and Online Toolbox

1.3.1 Organization of the Book

The remainder of the book is divided into two parts.
Chapter 2 introduces the basics of random matrix theory needed for machine learn-
ing applications in this book. In doing so, we shall first revisit the traditional approach
found in math-oriented sources, such as Bai and Silverstein [2010], based on a Stielt-
jes transform and truncation machinery, Pastur and Shcherbina [2011], based on a
Gaussian-method approach, Tao [2012] and Vershynin [2012], based on concentra-
tion inequalities and a nonasymptotic random matrix approach, and also say a few
words on Mingo and Speicher [2017], which follows a free probability framework
and on Anderson et al. [2010], which is more oriented toward a determinantal point
process and large deviations direction. Unlike most of these references though (with
the possible exception of Pastur and Shcherbina [2011]), our methodology is primar-
ily centered on the statistical analysis of the resolvent (and only secondarily on the
Stieltjes transform) of random matrices, which is the chief object of interest to us in
most machine learning applications. The particular mathematical toolbox exploited to
derive the results is of secondary importance.
In this chapter, we will successively introduce:
• the fundamental notion of the resolvent Q(z) = (X − zIn )−1 of a (random) matrix
X, and its relations to the eigenvalues of X, the limiting spectrum of X, the
eigenvectors and eigenspaces associated with some specific eigenvalues, as well as

https://doi.org/10.1017/9781009128490.002 Published online by Cambridge University Press

Another random document with
no related content on Scribd:
the family life whence she was wise enough to feel his past exclusion.
Sarah, then, was sponsor of the child’s forced participation in the trip.
Sarah, as so often before, was clumsily in error. But the serious element of
this lay not in her mistake, not in the needless discomfort that her mistake
brought on Quincy. It lay in the circumstance of Quincy’s knowing just
these things, of his knowing how they and their like bore on his life. For
Quincy was coming fatalistically, stoically, to adjudge his mother, to
recognize her failures and to accept her in their shadow.
The trip was interminable. Chiefly, it consisted of long, stiflingly hot
days in a cramped car that lurched and groaned and pounded and halted,
getting nowhere in a scorch of wheat-fields. Nights, Quincy was perched far
up under the chandelier, with no window through which to reaffirm, by
looking out, his hold on the realities. This was insufferable. To be shuffled
along through endless flatlands was bad enough if one could see. But to be
thrust through the black with no sense of direction or of space, while one’s
limbs ached with the unceasing murmurs of the train, bordered on
nightmare. The cities, moreover, were an unmeaning jangle of lights and
muddy, topless houses and clamoring traffic.
The gem of their journey—Yellowstone Park—proved to be the height
of his torture. It lasted, by count, four days. But by this period, Quincy’s
sense of time had gone the way of his other senses. He was packed stiffly,
hotly, into a high, swaying stagecoach. Before being swung to his position
he always saw the horses that drew the wagon. He loved horses. And the
fiery sinew of these Western beasts gave his heart a turn. He wished to pet
them and feed them sugar and talk with them. But always, he was swung up
between Josiah and Jonas where the horses were invisible. Their rhythmic
patter he still heard, and it was bitter music to him, since he could not fix
this one suggested joy by seeing. And now, as they swayed on, Quincy
observed that the Park’s chief quality was not geysers at all, but dust. True,
it was a peculiar sort of dust. Never had he tasted dust so thick, so bitter, so
plentiful, so blinding. It rolled and plunged over the coach, over the
crouched, packed creatures that hung upon its scruff, over the very skies. It
came in great clouds. It cut into his eyes and ears and mouth. It made him
itch and ache. And yet, so carefully had he been wedged against his father
and his brother that he could not scratch where the dust itched, nor stretch
where the ride stiffened him. At noontime, the coach creaked to a stand-still
and there was a hotel, instead of dust, upon the map of living. He was fed at
a long table where waitresses hurled thick crockery and men smelled of
sweat. Before or after coach-time, he saw geysers.
Unpleasant, ill-natured, evil-tempered things they were to Quincy—
freaks without form or beauty, uninspiring and meaningless. To any lad of
vision, a brook with a bass sunning in its bed, a flower upon a ledge of rock,
must mean immeasurably more. But for the vague impression of this truth,
Quincy was upbraided. He could not buttress it with clarifying questions,
such, for instance, as whether a man with three noses would be deemed
worshipful or a woman ten feet high entrancing. And if not, then why these
freaks styled geysers? Quincy knew deeply that he would be more
comfortable and more inspired in a copse of saplings. But to a continent
with no imagination, these miserable spurts of water and hot mud are a
property of pride. So, of course, the boy riled his father and the gaping girls
by his indifference, the few times they turned to sound him.
And then, after the nightmare of dusty roads and advertised
monstrosities which one was ordered to admire, as one is ordered to brush
one’s teeth (and which seemed equally aside the point of living), came
another scourge of trains. And between them, cities that deafened and
terrified and bullied. And then, one blessed day, after the most abysmal and
thundering of all the cities, there was Harriet at the end of the day’s
journey!
With a real sense of joy, Quincy took in the modest, crumbled-wood
station, the box-like freight house on the siding. And when the platform slid
in beneath his staring eyes that seemed glued to the car-window, and there,
in a grey dress and a black shawl over her head, was mother, he could not
restrain himself from a demonstration. Here at last was something got to by
trains that he could rejoice in and wonder at! For there, despite the ceaseless
purgatory he had been hurled or pushed through, stood the old bulwark, the
old love—as serenely unchanged as if all of it had been an angry dream.
And so perchance it had been! Yet, Quincy did not on that account elect to
linger in it. Dream or actuality, it was to be got behind! And the one
efficacious way of that was to storm from the train, to fling into the arms of
the dear past—and to lie there, huddled, tearful, aglow, while the great iron
monster with snort and scream pulled the horror—dream or actuality—
forever after it, out of the station.
So, it was needful that Quincy act. And so, he acted. Oblivious of hat or
coat, he rushed frantic down the aisle of the lugubrious car—his hands out,
his mouth open. A trainman guarded the door. He passed him, nor could the
car platform hold him. Down he flew, and stumbled upon the whirling walk
of the station. His mother picked him up, bruised but happy.
He looked up at her. He had seen her first; he had reached her first. And
now, here she was touching him, clasping his shoulders, smoothing his hair,
brushing his coat. What mattered an abrasion on the knee or a burning on
the forehead? He looked up, then, speechless, taking in his delight.
And his mother said, in the old voice which was somehow not quite the
voice that he had so often summoned to him on his journey:
“You silly boy! You silly boy! Why couldn’t you wait until the train
stops? It’s a wonder you weren’t killed.”
Whereupon, the train did stop—and the scolding, while Sarah went to
greet the others.
For a moment, Quincy stood alone on the platform, next to a very sharp
old man that seemed to be looking through him with steel eyes, so that he
was ashamed. And the abrasion on his knee hurt very much; and the
burning on his forehead seemed somehow to have scorched his heart.

VIII
In early September, the workmen and the decorators left the big brown-
stone house in a condition of coldness known as “modern” and a state of
deadness known as “beautiful.” And then, the family of Burt moved in. At
this time, Quincy was approaching his twelfth birthday.
Not as long as he lived did he forget the feeling that came over him, that
first time, as he mounted the stoop and went through the ponderous carved
door into the house that was now to be his home.
The coupé came to a stand-still. The horses had made a strange and
muffled sing-song on the pavement. From time to time, this changed to a
metallic patter in syncopation—a sound symbol, it seemed to Quincy, of
affright. At these intervals, the carriage jolted, one wheel rolled high, the
other was in a trough. Then again, all righted itself and the sing-song was
resumed. In the coach, was gloom of blue upholstery and leather. His father
and his sisters sat tight, thrilled, their eyes intent upon the passing city. It
was a maelstrom of half impressions. Cars clanged, other horses sloughed
off the view, a swaying coachman shouted, an insatiate tide of men and
women ebbed and flowed. Marsden, mother, Jonas and the servant had gone
before in another carriage. And now, the frenzy of Manhattan seemed to
abate. It was like leaving the wind behind one on the water. They swung up
the border of a Park. There were trees and shrubs and grass! It was a wood,
by all conventions. And yet it depressed Quincy who yearned for just such
balm. The trees were grey; the grass was dull. Here was not life, but a show
of artifice. Hard walks girded the green stretches like belts of steel. None of
the free tang and give and sunniness, none of the lilt and smiling, none of
the purple murmur of the woods was here. Central Park did not fool Quincy
—could not have fooled him, even if cars had not swept back and forth
between him and it; even if a depressing monotone of houses had not filled
the other flank.
And now, the carriage rolled up. A crowded tramp of the horses and it
came to a halt.
Rhoda and Adelaide seemed to emerge from a trance.
“Here we are!” they sighed, with a hollow note that bespoke the nervous
feel in their stomachs. The carriage door flung open. They bounded out and
ran up the stoop. Josiah half-lifted, half-pushed Quincy to the pavement. He
stood there, balanced by his bewilderments, while his father paid the
coachman.
“Come on, sir,” the big man tapped his shoulder. The carriage had
disappeared. Quincy looked up.
Before him was an unbroken but uneven battlement of houses. Some
were brown, some were red, some were grey. At their feet ran the wide,
flagstoned pavement. Some were straight-stepped, some were curved, some
were curiously decorated boxes. A few had no feet at all—with doorways
punched abruptly in the wall. Before Quincy’s eyes it was brown; the
protuberance with stairs was straight. Red doors were flung wide open—
held so for passage of the trunks—and within was darkness. Above, it was
all very vague and high. Quincy felt this, though he did not look. He went
up mechanically, with his father. As he stepped in, he felt a quick sensation
of the sky—shrill blue, inexorably far away, yet good. It seemed like a short
draught of water when one has been long athirst. It was but a momentary
glimpse. And then, his body carried him beyond, within, where the sky was
not. He saw the long hall, shadowed, and the wide stairs. It seemed clear to
him now why it had been as if the sky was snuffed away. Everything
loomed forward and smothered Quincy; filled up the crystal space within
him that cared for the blue above and seemed somehow related to it.
Everything loaded down upon him, occupied him, stayed there. As he
trudged up, it was as if a mighty burden had come suddenly. Quincy
observed no more, felt nothing more explicit. But for an instant a
perspective flashed on him, though he deemed it merely a natural panic like
a score of others he had undergone. In it, he saw himself, slight, small,
stooped, his head strained back with his disordered tension, his legs
careering stiffly with untrained, superfluous energy. In it, he saw about him
a weight of gloom—the stuff and color of this house which was to be his
home. And then, once more, he was a child. His mother stood at the head of
the stairs. She was very busy, and rather dirty-looking.
“You had better go up to your room, dear, where you won’t be in the
way.” She turned to Jonas who sat sprawling within vision, upon a great
chair in green satin. “Jonas, will you take Quincy up?”
“Sure,” replied the boy. “Come along, Kid.”
It was easy to tell that what interested Jonas was the chance of showing.
So they were still to share a room? Quincy learned this, as he took in the
two white-enameled beds and the valise with “J.B.” upon it that he
stumbled over as he entered. The sight of his mother and this new event
which scarcely he had dared to hope for, seemed to enliven Quincy. He had
not given up Jonas. He had had his pangs from him, as from his mother.
“Oh, Jonas,” he exclaimed, “aren’t you glad we’re here?”
“You bet,” said Jonas.
“No—I mean we—”
The elder boy looked down, first quizzically, then with a withering
wrinkle upon his eyes and nose and mouth.
“For God’s sake, Quincy—what a sis you are!” Then,—“Ma says for
you to stay up here,” and left the room.
So Quincy was alone.

The end of that first month was the beginning of the time when Quincy
began once more to breathe in a normal fashion. For long, everything had
been so new, all the old ceremonies had been so suddenly replaced, all the
comfortable nooks of life which with difficulty he had carved for himself in
Harriet were so miserably absent, that life had become a breathless trick
like trying to ride bareback (as he had once essayed), or endeavoring not to
irritate his sisters. There was, for instance, the problem of eating in the
ominous, overbearing dining room, the problem of sleeping in a bed which
shone like the exhibition motor in the shop on Main Street, the problem of
being comfortable in blouses that had to be kept clean and with thin new
stockings that had to be kept whole. Also, there was the problem of loving
his mother in a dazzling housegown of blue satin. These were like enemies,
besetting the routes of life. And at first they had seemed insuperable. And at
last they had faded quite away, and wonder about them, as well as memory,
had died in the fresh, general glamor. But now, with his recovery, came a
new shock.
Something of adoration had persisted in Quincy toward his room-mate
despite constantly recurring disillusions, rational promptings, and rebuffs.
In the fixations of childish fantasy and love there is the doggedness of
plant-life which persists where it has grown, though all nature conspire to
prove the folly of its position. Such plants will die, or they must be
uprooted. They are such stubborn things precisely because of the logic of
their existence—to rise from their roots. A similar instinct was in Quincy
concerning Jonas. All the persuasions of deed or mind could prevail little
against his intuitive attachment, because they were in different planes. He
would hold to his sentiment for Jonas until the roots of energy which had
thus grown were pointedly grasped and torn away. For a seed of his life
instinct was there. And where it had fallen, it had remained. In Jonas,
Quincy saw a future of his own growth—a boy, happy, cherished, of
importance. What aided this admiration perhaps most of all was the sense of
imperviousness to life’s problems which permeated Jonas. This, in
particular, was a desideratum. But, after all, these were but rationalizations.
The heart of the young boy’s attachment, no young boy could understand.
Quincy was in his room. It was but half an hour before dinner. The boy
sat at his desk solving a knotty problem in arithmetic with a facility beyond
the power of his six-years-older brother. Quincy was very apt at
mathematics. But also, he was good at literature. This double
accomplishment militated against his being singled out for any talent. It is a
way of people, to mark a virtue only when it is one-sided.
So Quincy sat at his little desk and worked. It was a slanting, box-shaped
affair. Its top lifted upon a hinge, disclosing within a maze of paper, school-
books, pencils, twine. In one corner, half hidden under a pad, were two
unframed pictures—cheap photographs which he had clandestinely
collected of the Farnese Hercules and the Venus de Milo. Quincy’s instinct
told him that it would be well not to make show of these treasures. His
mother would have found them naughty, Jonas would have seared them
with the laughter of Philistia. So he hid them and, like forbidden fruit,
enjoyed them. Of all the pictures he had ever seen, these meant the most to
him. The huge, power-ridged torso of the Hercules filled him with
fellowship to so much might and in some way seemed to make him share
the giant’s merciless efficiency. He could repeat the Labors. The genius of
this overweening man who had penetrated to all the corners of the earth and
forever forced for himself acceptance and a welcome was dazzling to
Quincy. He enjoyed gazing at a plastic wish fulfillment. But for the
sentiment of the divine, he turned toward the Venus. Hercules was to him a
successful man; Venus was a goddess. He loved to look at her. The subdued
rhythm of her body, the gentle poise of her head and breasts justified
Quincy in his own nature, whereas the brawny giant served to mitigate that
nature’s realness and to exalt its opposite. So also, since the Venus reached
him not by antithesis, but by a direct appeal to a deep, primal part of him,
his love for her was more rapt, more pointless, sweeter. He spent more time
with her than with the giant. But she made him think less. And he knew less
about the instinct which drew him toward her.
Above Quincy’s head, as he worked, was an electric bracket. To his left
were the two beds. In the direction that he faced were the windows,
curtained in dainty, dun-colored mesh. To his right was a mahogany bureau.
This was no ideal setting for his work. But Quincy had learned—it was one
of the gifts of the poor days—to concentrate.
The door opened and Jonas slammed in. There was no formality between
them. So Quincy went on working. But in the pause that followed, the child
felt something which disturbed him. Still holding his pencil, he turned
about. There, near the door, stood Jonas, looking at him. On his face was a
gleam of triumph.
“What’s happened, Jonas?”
Jonas chuckled. “Time to get ready for supper, Quint.”
The child jumped up, to obey. “What has happened?”
“Oh, if you only knew!”
“Tell me.”
Jonas laughed tantalizingly. The little lad looked at him with a hopeless
rebuke. And then, tossing his head, he moved to the bureau and began to
brush his hair. His hair seemed to grow awry, to shoot out in a dozen
directions from his scalp. So Quincy’s task was always a fairly hard one
since his mother insisted on a part; and since, when he did not succeed, she
would brush it for him and invariably hurt him. So Quincy fell to. Jonas
stood smiling at him still. Suddenly, he began to speak.
“I guess I will tell you. I’m going away to school—right off.”
The child leaped around. “Jonas!”
“Next week. To Exeter.”
Quincy’s head worked fast. Then, with effort: “Can’t I go, too?”
It was a fatal question. Jonas took it sneering. He was nearly seventeen
and he had the sense of age and independence that ordinary boys are prone
to.
“You? I guess not. I don’t want you ’round any more. You stay at home,
where you belong—if you belong anywhere.” He smiled.
He would have said more in this great need of establishing his power and
independence through attack on some one immeasurably weaker, in these
things, than himself. But just then a flying brush hurled against his
forehead. He looked up, not understanding. And then, he smiled through his
pain. For it was not to be admitted that this infuriated child could hurt him.
“You little sinner!—” he stepped back instinctively. Then, again, he
smiled.
And at this last smile, Quincy became an unaccountable demon. He saw
what he had done. It moved him strangely. A need swept over him to rush
up to Jonas, to fling arms about his neck, to kiss him, to implore him, to cry
out: “Take me too. Please, please stop despising me!” This was all his need.
And yet, out of the fullness of his love he had flung his brush. And out of
his love again, there he was, leaping upon his brother, biting him, scratching
him, tearing his face. And all that became articulate of his beseechment was
a liquid “Oh! Oh! Oh!”
Jonas grasped at his frenzied, writhing assailant. At length, he caught
him comprehensively within his arms. And then, Quincy went flying
through the air. He fell, safely, stomach downward, on the bed. And there
he lay, tearless, motionless, overwhelmed with the bitterness of life.
The door opened. He did not budge. But he understood. The alarm had
brought his father. And there in the door he felt the cold, looming figure of
the man whose presence alone was needed to brim his misery. Stark, stiffly,
he lay now—one nerve of agony. And when his father’s voice came, it was
like the sharp touch of steel upon a nerve that is exposed.
“What is this?”
“Oh, nothing,” replied Jonas, moved again by the need of minimizing
the damage done by a child not yet turned twelve. But his father could see
the two bloody scratches on his cheek, the slight swelling on his forehead.
And Quincy could hear the nervous clutch in his voice.
Josiah looked long, saying nothing. And then:
“It’s supper time, Jonas.... Come down.... And as to you, my lad,—”
Quincy held his breath with his galled anguish,—“you’d better stay up here
—and cool off.”
Quincy had felt the smile in this voice also. He felt the two, their eyes
meeting and smiling together. Then the door slammed and they were gone.
Smiles, smiles—what a curse smiles seemed to him! There was so much
laughter in the house. But when they looked at him, it became a smile.
Never, never did they laugh with him. Surely, then, he too must learn to
smile. Rigid as ever, he turned on his back. And through his scarce-started
tears, he looked up at the blurred electric lamp. And then, as he lay there,
his mouth trembled and he learned to smile. It was an evil moment.
Never had there been so deep a silence. With outstretched body, it was to
Quincy as if he had been swept beyond the bed. He seemed afloat, astride
two worlds, strangely apart from the one in which he had been incontinently
dropped. And then a thought of what had happened—a whispered thought
like a dim memory—brushed him back into the actual living. With his face
hot in fever, his mind seething in visions that burst out and vanished ere
they had been caught, all of his life came to him in a clear, ghostly light. He
saw the household, below stairs, joyously seated at the gleaming table,
eating good things. He felt hungry. He wished to go out and steal some
food. He wished to crash through the floor and fall, dead and mangled,
upon that board of mocking plenty. But through it all, he managed still to
smile.
And then, he looked up. His mother had come in, holding a tray.
“You were very naughty, Quincy,” she said in a voice that went up and
down, “but I have brought you some dinner.”
Quincy was still smiling. He reached out his arm for a pillow and flung it
at his mother. It fell short, doing no harm. Sarah placed the tray
precipitately on a chair and rushed to the bed.
“Darling, darling!” she cried, “why are you like this? Tell mother, won’t
you? What is it? Why? Why?”
She kissed his face. She pressed his hands. She clamped his ears so that
they hurt deliciously. And Quincy lay silent, happy; his smile lost at last in
the sweet tears.
Then, something came over his mother. She stopped.
“I must go now,” she said hurriedly. “Father was angry even at this. He
said for Bridget to bring it up. He’ll be mad, dearest, if I stay.”
She leaned over and kissed her son, once more. He lay now, stiffly again
and sternly. Then the door closed behind her.
For some time Quincy remained upon the bed. His face was a screen to a
bitter battle. Bitterness had the victory. He jumped up and uncovered the
dishes on the tray. Roast-beef soaking in gravy, peas and sweet-potatoes,
apple-sauce and angel cake. Calmly and slowly he took a dish, moved to the
low casement window, opened it, and threw out the contents. And in this
gesture he continued, going back and forth, eyes bright with fever, mouth
parted with passion,—until all of the food he craved had disappeared into
the grey, deep night.

So Jonas went away to school, and the agony of this period in Quincy’s
life set in—the end of the Reign of Jonas.
For the most part, it was the feeling of a void; to fill it, the creating of a
fancied Jonas. And this, being in its essence art, Quincy discovered to be
painful. In his relations with his school-mates, with his sisters, with
Marsden, he missed him most. And the Jonas whom he regretted was a
bland, kind brother, not reasoned-out but clear through the pleasing memory
that he inspired. In brief, Quincy was longing for a brother he himself had
invented. And to this creation went little fact beyond the name of Jonas. If
this new brother’s face was one with the old, at least his expression was so
different as to transfigure it. If his voice was one, the words he spoke had
varied even that. Besides, Quincy was happy with his creation. Had Jonas
never returned, never shown his real face, his real spirit, doubtless Quincy
would have prospered with his auspicious figment. And though life had
turmoiled him in a dozen other ways, he would still have clung close to his
dream, derived from it sustenance and so hewn out an indomitable faith.
For no actual occurrence could have attained to splinter it. Jonas alone was
so empowered.
Quincy had troubled his mother enough to know on what day Jonas was
coming for his Christmas holidays. At first, knowledge of the day had
sufficed. He had aimed his existence at that day. It had seemed a small
enough target. But when the day came, Quincy realized his error. He
learned of the vague, broad desert that a day can be. He felt the irony of
implacable dimension when one’s heart strains toward a pin-point. He
awoke in the morning with a bound of fear. What if Jonas was already
there! He lay in bed and listened. No stirring. It was not too late. For Jonas
always made a noise. And then, as a sense of gratitude came over him, it
was dispelled at once by a succeeding sickish thought. His vacation did not
begin until the morrow. He would, then, have to spend the morning—half of
the day—out of sight and beyond watch. During that time, Jonas might
walk in! The idea froze him. He thought of playing sick—just so sick as to
be able to remain at home. He did not care even if it did mean doctoring
him. His fervor laughed at castor-oil. Once more, a glow of satisfaction,
such as one feels after a great invention. But this also, was short-lived. The
mournful countenance of duty had thrust in at the door. The school, that
morning, was giving a Christmas Exercise. He had his rôle in the festivities.
There was to be a masque of the nations, assembled to wish America a
happy new year. In this great ceremony, Quincy was to take the part of
Mexico. Leggins, sombrero, tasseled vest and practice in rolling out
“carramba” had gone toward the occasion. He could not shirk this austere
duty. The vision of staying luxuriously at home, of waiting in the warm
house for Jonas, had disappeared.
And so, since he must run risks, he faced them. He jumped out of bed
although it was a full hour before breakfast. He dressed and went
downstairs. Perhaps Jonas might come in early—and he be there, alone, to
greet him! He was disappointed. And then his mother appeared.
“You down so early?” she asked, looking for something wrong.
“Mama, when does Jonas come?”
“Today.” She went toward the pantry to signal the cook.
“Yes, Mama. But when to-day?”
Sarah looked at her little son as he stood there, all serious and expectant.
“I don’t know. Don’t bother me. By suppertime, I guess,” and she went
out.
Breakfast went fast, since the threat of school was at its termination. And
then, he and Adelaide were shuttled off in the brand-new limousine.
Before he had gauged the event, he was on his feet. The thought of Jonas
had lurked in a strategic corner of his consciousness. And it had gathered to
it the energy which might have gone to stage-fright.
Before him lay countless amorphous rows of children, with stern
sprinklings of teachers. He began. It was as if his voice went on of itself—
uncontrolled, aloof from him, so that he heard it from afar. He had studied
his rôle too well. It left no fear of his forgetting it. And in the relief, back
stole the thought of Jonas, the possible catastrophe if he returned, while
Quincy stood there droning out his duty. If his duty was to be done, at least
it could be gone through quickly. The words flowed. He hurried them, as if
pursuing from behind. It was a way of getting back. And then, suddenly, his
words stopped. He realized that there were no others. He sat down.
He received little praise for his performance. His teacher, a stout, florid
creature, came up to him showing her teeth, as she did always when she
was irritated. She found fault with his rendition. But what bothered Quincy
was the thought that his hurrying had not brought nearer his return. It had
been useless.
He did not hate the woman only because of his indifference. About him
was a bustle of holiday excitement. Children laughed and ran about.
Teachers relaxed and Quincy noticed that when they became smirkingly
human they were still more disagreeable than when they had been teachers.
It was as if they had taken off their clothes. Greetings of “Merry Christmas”
interspersed the general murmur of voices and light feet. It was all like a
great, dim wave on the edge of Quincy’s consciousness.
At last, however, the wave broke; the voices scattered; the eddies
lessened. And then, Quincy came home.
Jonas was not there. So he ate his lunch, tossed between gladness at not
having missed him and hollow perturbation at the deep-shadowed future.
For in this state, the coming home of Jonas was Quincy’s future.
His mother had said to Rhoda:
“Dear, if you have nothing to do, will you take the children for a short
walk in the Park?”
Rhoda had consented. The remainder of the meal was torture to Quincy.
But out of his new anguish was born a device. For he was resolved not to go
out, that afternoon.
As soon as the company had gotten up from the table, Quincy went
bravely up to Rhoda who stood alone in a corner. Rhoda was seventeen—a
remote, resplendent creature. She was dark and tall and very cold and very
occupied in her own mysterious affairs. Quincy looked up at her with
admiration, but with distance. He felt that she was beautiful. He felt that he
would not have minded had she kissed him—which she never did; that he
would have been glad, had she noticed him—which occurred scarce more
often. Now, however, he was inspired. So he went up to her unhesitant,
reached for her wrists, clasped them, and spoke.
“Sister.”
“Well?”
“Please, sister—make Mama not make you take me out, to-day.”
“What’s wrong?” She looked down, interested.
“Please, sister—I want to be here when Jonas is coming.”
He looked up piteously. And Rhoda laughed.
“All right. I don’t care, I’m sure. I’ll see to it.”
So Quincy went up to his room. He did not take out his toy engine or his
soldiers. He knew he should not be able to do his vacation homework. For
some strange reason, he was prompted to look at Hercules and Venus. But
mostly, he waited. Waiting was by now the atmosphere he lived in. His
mood had grown so wide, he scarcely noticed it, for want of something to
contrast it with. His strain toward Jonas had grown so intense, he scarce
saw him, felt him, thought of him any longer. His subconscious mind
seemed to be the active one. So he merely waited, doing few external
things, aware of few external qualities. And among those that went were
time and Jonas. All that remained was the abstract waiting for him.
And then, a noise below. For a moment, it meant nothing. Then it flashed
on him palpably that Jonas had arrived! He rushed to the door. He was
downstairs. Jonas was before him.
“Brother!” he cried, aching to be caught up.
Jonas looked down at the intrusion.
“Oh, hello, Kid. How are you?” He had outgrown kissing. He threw his
coat on a chair, lounged into another, lighted a cigarette and then went on:
“Say, Ma—I’m hungry.”

If there was a shred of hope left, it went that evening.

The scene took place directly after supper. It took place before Quincy.
And, most horrible of all, Quincy was its cause. Jonas objected to sharing a
room during his vacation with a twelve year old child. That was the crux.
He explained amply what he meant. He had had a different sort of a room-
mate at school. He was sorry now, that he had not accepted his invitation to
go with him to Chicago. He’d not have been placed in the nursery, there.
When fellows came home from “work” at Christmas, their families were
expected to be a little considerate of them. Some had their breakfasts served
in bed. He didn’t demand that. He wasn’t selfish. But there were certain
indignities one had to draw the line at! He was not proposing to spend his
vacation by going to bed with the chickens—or the babies. He expected to
have “late dates” every night; “late sleeps” every morning. He did not wish
to be disturbed. He was used to smoking in his bedroom, before sleep,
before arising. This also, was a habit dear to his manhood. But, most
important of all arguments, he simply would not have it! If the Kid stayed
where he was, he, Jonas, would wire to his chum and take the sleeper to
Chicago. Just watch and see if he didn’t! Didn’t he have cash enough in his
pocket?
All this, Quincy heard, sitting quiet and alone in a corner, while his hero
paced the floor, using his name, ignoring his presence, soiling his soul. And
now, Rhoda chimed in, agreeing. And now they all agreed!
The broad sofa in the parental room was to be improvised at once into a
bed. This would do for Quincy, during the vacation.
The servant was summoned, the deadly order given. Jonas called out a
general “good-bye” that somehow did not include his brother in the corner,
since he had not turned in that direction. He left the house to meet a friend,
downtown.

IX
Two days Quincy had been going about with a soiled handkerchief. At last
his mother noticed it, lost patience and sent him upstairs for a fresh one.
Quincy grit his teeth and went. But the philosophy behind that handkerchief
no one in the family was near to reckoning.
The truth was, he avoided his old room. Throughout the holidays, he
stepped within it as little as was possible. And there his handkerchiefs were
kept. Had he been wise, once there, he would have taken more than a day’s
supplies. But Quincy did not have that manner of shrewdness. And even if
he had, it would have been a costly risk to hide his linen in the room below.
So, at times, a hurried visit was inevitable. In the old days, he had spent
long hours there. And still, he might have, since Jonas generally was gone
with lunch, not to return until vague hours after. But the charm and the glow
of the room were dead. Enough of it remained to have turned into a sneer
and mockery. For this room had been an altar to Quincy’s faith. In it, he had
performed his services; here he had dwelt as a priest, in the abode of his
faith. And he had been driven out and all of the temple had been sullied.
And now, there was no god at all. So that the room served merely as a cold
record of present miseries and lost illusions.
Meantime, for two weeks, he slept in a broad couch placed at the foot of
his parents’ beds. And here was a steadfast torture not to be avoided like the
room above. A dilemma confronted Quincy. He was, for some reason,
uneasy about lapsing into unconsciousness ere his parents came to retire;
and yet, to be awake when they came in was a miserable trial. So between
the two uncomfortable states, the child built up a fever of resentments.
Although he could not so have worded it, his was a feeling more than all
else of humiliation, of shame at this promiscuous arrangement. He went to
bed with a weighing smart on his soul, like the mark of a blow that one can
not avenge. And then, with gloomy prospect, he drew the covers high over
his face and lay there, staring out, gripping his blanket, stiff with a sense of
deep discomfort. And all manner of wild, ugly thoughts raced through his
half somnolent mind—thoughts of vindication against Jonas, against his
sisters, against his parents; lurid sweeps of chastisement in which there was
neither mercy nor discrimination. Fairly, his blood boiled with his
resentment at this cavalier disposal of him and the malignant token—his
lying there!—of how well Fortune could distort his hopes. And then,
generally, his body would prevail and he would fall asleep.
But with such rivalry, sleep could not be firm. Too much passion and
reflection, thrown up by his unconscious self like the lava of a volcano,
flooded the black slopes of Quincy’s night. And in his sleep came hectic,
vivid dreams—dreams in which a burst of repressed wishes stormed to
realization. Nor were the wishes good or gentle or composed....
In the midst of some fantasy, painting his sleep, Quincy slides across the
faint border into consciousness. There are his parents. A light sears through
his closed eyes. He will keep them closed, though it meant never to be able
again to open them. For solely by feigning sleep can he be sure his mother
and father will not address him. And the idea of that is unbearable. How can
he speak to these two tyrants about whom his thoughts contain so many
guilty reservations? He lies still, strained, listening for all the little noises
that will upset him, waiting for the light to go and rest to come. His parents
never make mention of his name. They talk sparingly, and then of things
that he does not understand. But every sound they make seems to him a
monstrous thing. They are trying to be quiet. But of what avail, when the
drop of a shoe on the floor twinges through him like a dart? when the creak
of a bed plays on his nerves as on a jangled harp? when the repressed sound
of their voices, whispering, becomes a source of suffering worse through
the very quality of stillness and of effort?
And with the morning, the shame of getting up when they do; the grim
refusal (though the price be their completer feeling of his “badness”) in
such naked intimacy to share their life, as if he shared as well their cruelty
and their dull perceptions. The agonized last minutes waiting under covers,
though the dregs of sleep be turned into a bitter wide-awake, until they are
dressed. And then, to spring up with a sick relief, to rush into clothes and
go!... Above all, with the morning, the consciousness of the next night.

So came the New Year, while the City revelled. And then, at last this
holiday of suffering was over. Jonas was gone. Quincy returned to his room.
And it had no positive reaction upon him. Simply, he was glad that certain
things were gone. Passively, he accepted the present. He entered upon a
period of calm, of seeming apathy. But this welcome state was destined to
prove meretricious. Quincy had not really attained serenity or resignation,
after his turbulent experience. His soul, rebuffed and bruised, was not yet
content to retire within itself and glut its own demands. There was a long
road, ere this, for Quincy. For the present, unconsciously of course, he
delighted in his breathing space and with great readiness forgot the past
with its poignant measure of prophetic warning. But, in truth, his soul was
merely crouching ere it leaped out once more. It had not been discouraged.
Within it, was too much vitality for that. It was lying low, only to attempt a
higher flight, once a new object had been scented.
When Quincy was thirteen, he discovered Rhoda.
Jonas was still away—at college. Rhoda had done with school and the
idea of college for a girl was not part of the mental outfit which the Burt
family had brought along from Harriet, Long Island. Adelaide approached
sixteen. Her educational trials were not yet over. Marsden was now literally
a man. A certain surface of his mind had grown hard and polished, so that
he was clever and ingenious; able to make bearable the largely mental life
to which his body had condemned him. The depths of his mind had died, so
that he was bitter in spirit, visionless, and well nigh content. He derived the
same satisfaction from his dominion in the household that a normal man
might glean from his part in a community. He drew joy from his ability to
judge aloof; he created a sort of life from the business of poising the lives of
others.
Some time, he had been watching Quincy. But the period of their talks
was not yet ripe. For the present, he was a cold, hard, cynical obstruction
upon Quincy’s path—a creature that could sear and wither with a word, an
intimate that exploited his foreknowledge with the evil unconcern of an
outsider. Quincy hated him, feared him. For he was powerful beyond his
parents when it became his listless fancy to ordain. And his mother loved
him with a warmth that was thrice cursed, since it withdrew a part of her
from Quincy, since it served as a common interest to veer her toward his
father and since in the last fact, the love she did bestow upon him seemed
somehow tainted and imperfect. Marsden had not yet been created in
Quincy’s life. He stood there as a bitter, impersonal condition—an element,
a natural detail. In this way, a pagan peasant might regard a mountain that
stood north above his land—the womb of storms, the treasury of frosts, a
thing under no circumstance to be explored.
Marsden’s time would come. But now, the child’s senses opened
miraculously wide to Rhoda. With Rhoda, a new Adelaide was as well
created. But in the counterpoint played by these sisters upon Quincy, the
elder had the upper hand.
It was the common mistake of most people to call Rhoda the prettier of
the two. Such are the triumphs of an aggressive spirit. For although the
younger girl was essentially the finer girl, her subdued nature shone forth
badly beside the obvious brilliance of her sister. Rhoda was tall and dark;
Adelaide was short and blonde. Rhoda’s eyes were large, brown, pent-up
always with whatever mood possessed her. Adelaide’s eyes were small and
their blue was unobtrusive; their spirit was diffident; their suggestion of an
Oriental tilt seemed somehow to conceal itself. They were set deep and soft,
seclusively almost. And the angle of their position with the faint thickness
of their lids, the short fringe of golden lash above them (a sign of delicacy
for who understood), served with most persons as an excuse for not noticing
her at all. Rhoda’s eyes, on the other hand, squarely, unimaginatively set,
came forward to command. At eighteen, it was obvious to her mother that
she was to be what she herself termed a “belle.”
To Quincy, Rhoda became now, by degrees, a dominant and estimable
figure. The evolution of this from early hatred and mistrust, through the
period of apathy when Jonas had foregathered his affections, was of course
not a conscious one. But gradually, Rhoda entered his dreams, later his
thoughts—came in some way to merge with them and to be welcome there.
The sporadic walks which she took in Central Park with Adelaide and him
grew to be hours of anticipation. During them, this tall glamorous creature
that lived so near seemed to relax from her haughty state and to be willing
to consort with him.
Upon one occasion, he had chanced to see her as she was dressing.
Rhoda had seemed unconcerned enough at this momentous incident. She
had crossed her hands quickly upon her breasts, cried: “Go away!”—and
been otherwise unmoved. In the shock of his amazement and swift retreat,
he had seen little. Yet that little remained long, branded within the texture of
his mind. He had not forgotten the event,—although he had no idea of what,
exactly, he remembered.
For his thirteenth birthday, Quincy received a bicycle. With the spring,
he determined to learn to ride it. And Rhoda volunteered to teach him. They
went to Central Park, to the circular abandoned road-way where once, ere
Quincy’s memory, had stood a statue of Bolivar and which later great
quantities of stone and sand turned into a manner of Park dump. With the
strangely sweet presence of Rhoda beside him, he had learned rapidly. And
having learned, he found that he had learned too rapidly, since now this
presence would be no longer there, beside him, as he pedalled. Having
stooped from her cold estate in the excitement of teaching her brother how
to ride, Rhoda lost no time in clambering back again. And so, Quincy found
a new want in his heart.
In his own way, he began to pay court to Rhoda. He was fully aware that
he had rivals. Rhoda went to parties on many evenings. And on many
others, she remained alone in the parlor with some caller whom Quincy
never saw. A great curiosity came over him to see these men who were his
rivals. Once, he slipped downstairs at the time when he was supposed to go
up to bed, and hid in the portières. He knew that a caller was coming. But
he became frightened and ran away even before the bell had rung. It was
decidedly too dangerous an experiment. Quincy knew that his discovery in
such flagrant badness before his sister’s caller would reflect on her. It would
plunge them together into shame. Had not his mother told him that when he
was bad, she always lost a bit of her prestige before her God? He did not
wish that kind of sharing, with his sister. So he never stole downstairs
again, to hide, when he had been sent to bed.
Quincy’s way of courting Rhoda gave little promise of his subsequent
success with women. Quincy did not understand his sister. It did not occur
to him to psychologize beforehand the effect which his efforts would be
likely to attain. It did not occur to him to find what she most wished by
studying her nature, to suppress what was conflicting in himself and to
feign what was harmonious. Gallantry is a sense in men capable of being
tutored into art. At thirteen, it was already patent that Quincy’s sense of
hearing did not promise the musician, nor his sense of sight the artist—nor
his sense of pleasing, the diplomat or gallant. He was an ordinary boy. The
most that one had ever said about him was that he was bright at school.
Now, when Quincy received a piece of chocolate or a box of soldiers or
a flower, he was very happy. Also, in his new state, he needed—though he
ignored the reason—to make Rhoda happy. His way was that of a miserable
logic. When he received a cake of chocolate, in lieu of devouring it as his
lust urged, a greater love restrained him. He saved it. And when, in the late
afternoon, Rhoda came home from a tea, he rushed up to her, held out his
offering before her, and said: “Here!”
In his accent, Quincy put neither a suggestion of his sacrifice nor a hint
of the motive behind his gift. So perhaps Rhoda must be excused. She
looked at the chocolate:
“I don’t want it.” The fact was that she had eaten too much pastry at the
tea.
Quincy withdrew his offering and ate it, that night, before retiring.
On another occasion, it was a daisy that his mother gave him. It was the
evening of the great dance to which Rhoda was going and for which
preparations were on foot with the early afternoon. Now Quincy had
learned much of the mundane way. And one of the things he knew was that
when his sister went to a dance, she always wanted flowers.
So he took his daisy and held it up to her.
“Will you wear this to-night?” he asked proudly.
Rhoda laughed. “Here,” she said, “—wear it yourself,—” and pinned it
upon his blouse.
Quincy left it there. But he felt the condescension in her retort. Had he
seen the bower of roses that one of the callers brought with him, after he
had gone to bed, he might have understood. But also, he might not have
gained much solace from his added wisdom, since roses were so
pathetically far beyond his reach.
And so, time and again, Quincy’s assiduous courtship failed, not only of
impression but of notice. Gradually, Quincy began to realize that his desire
to be an entity in Rhoda’s life was for some reason quite as monstrous and
preposterous as had been his kindred wish with Jonas. His mute attempts to
share in the lives of those whom he was so apt to love, his efforts to inspire
them to share in his, seemed, for some cause, to partake of the nature of a
jest—of an extravagance.
Once more, his vital energies swirled and stormed and veered within
him, hopeless of goal or outlet. His affection for his sister had at best been
vague. Quincy had no conception of the deep ties and firm hold which a
response in her would have called forth. He had no idea of the undirected

Mobile Phone Unlock Codes
75% (16)
Mobile Phone Unlock Codes
8 pages
Mathematics: Sindh Textbook Board, Jamshoro
No ratings yet
Mathematics: Sindh Textbook Board, Jamshoro
176 pages
SCREENING - DEVELOPMENTAL DELAY & AUTISM - CCBS DP Infant-Toddler Checklist
100% (2)
SCREENING - DEVELOPMENTAL DELAY & AUTISM - CCBS DP Infant-Toddler Checklist
9 pages
Epa's Aermet Guide
No ratings yet
Epa's Aermet Guide
310 pages
Model: TPA9423YBS: Technical Data Sheet
100% (1)
Model: TPA9423YBS: Technical Data Sheet
3 pages
Ballistics Lecture
100% (1)
Ballistics Lecture
16 pages
Removal of Permanent Hardness of Water
100% (1)
Removal of Permanent Hardness of Water
7 pages
Agricola Rulebook
No ratings yet
Agricola Rulebook
12 pages
Chapter 1 Industrial Wastewater Treatment
No ratings yet
Chapter 1 Industrial Wastewater Treatment
91 pages
RDCAM Software Installation Manual
100% (1)
RDCAM Software Installation Manual
25 pages
Kanban Guide For Scrum Teams
100% (1)
Kanban Guide For Scrum Teams
9 pages
Modulo Ingles Basica Superior Intensiva
No ratings yet
Modulo Ingles Basica Superior Intensiva
32 pages
XG AC DC Annual Maint Schedule
No ratings yet
XG AC DC Annual Maint Schedule
28 pages
Dk'tronics 64K/256K Memory Expansion For The Amstrad CPC Computers
No ratings yet
Dk'tronics 64K/256K Memory Expansion For The Amstrad CPC Computers
25 pages
Tvl11-He-Cookery Q1 M4 W4
No ratings yet
Tvl11-He-Cookery Q1 M4 W4
15 pages
Leo Steinberg-Carol Duncan PDF
No ratings yet
Leo Steinberg-Carol Duncan PDF
2 pages
EIM4
No ratings yet
EIM4
4 pages
M4TXX-BR12SH: TIMEKEEPER SNAPHAT (Battery & Crystal)
No ratings yet
M4TXX-BR12SH: TIMEKEEPER SNAPHAT (Battery & Crystal)
7 pages
Critical Essay
No ratings yet
Critical Essay
8 pages
Prof Savita Gautam Fore School of Management
No ratings yet
Prof Savita Gautam Fore School of Management
27 pages
Ict Systems For Monitoring and Protection of Wildl
No ratings yet
Ict Systems For Monitoring and Protection of Wildl
26 pages
Business Cases and Benefits Management
100% (2)
Business Cases and Benefits Management
66 pages
Alcatel-Lucent India LTD.: Opening of New Telemarketer Code & Implementing The Scrubbing Solution
No ratings yet
Alcatel-Lucent India LTD.: Opening of New Telemarketer Code & Implementing The Scrubbing Solution
5 pages
Grafico CS2
No ratings yet
Grafico CS2
2 pages
Poetry
No ratings yet
Poetry
3 pages
X28HC64
No ratings yet
X28HC64
24 pages
Note #1 - Substantive Test of Cash
No ratings yet
Note #1 - Substantive Test of Cash
11 pages
Rubric For Assessing Interns Dailyn Demo Teaching
No ratings yet
Rubric For Assessing Interns Dailyn Demo Teaching
2 pages
Converting Common Units of Mass Measure KG and Grams
No ratings yet
Converting Common Units of Mass Measure KG and Grams
7 pages
Conference Book of Abstracts - I-CMME 2022
No ratings yet
Conference Book of Abstracts - I-CMME 2022
139 pages
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (648)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1175)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4103)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2886)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)

PDF Random Matrix Methods For Machine Learning 1st Edition Romain Couillet Download

Uploaded by

PDF Random Matrix Methods For Machine Learning 1st Edition Romain Couillet Download

Uploaded by

Full download test bank at ebookmeta.

Random Matrix Methods for Machine Learning 1st

For dowload this book click LINK or Button below

Download More ebooks from https://ebookmeta.com

Random Matrix Methods for Machine Learning. 1st Edition

Machine Learning Methods 1st Edition Hang Li

High Dimensional Covariance Matrix Estimation An

High Dimensional Covariance Matrix Estimation An

Multivariate Statistical Machine Learning Methods for

Hamiltonian Monte Carlo Methods in Machine Learning

Kernel Methods for Machine Learning with Math and

Practical Machine Learning for Computer Vision: End-to-

Published online by Cambridge University Press

Published online by Cambridge University Press

Cambridge University Press is part of the University of Cambridge.

Published online by Cambridge University Press

Preface page vii

2 Random Matrix Theory 35

3 Statistical Inference in Linear Models 155

4 Kernel Methods 207

Published online by Cambridge University Press

5 Large Neural Networks 277

6 Large-Dimensional Convex Optimization 313

7 Community Detection on Graphs 337

8 Universality and Real Data 364

Published online by Cambridge University Press

Numerous and large-dimensional data is now a default setting in modern machine

https://doi.org/10.1017/9781009128490.001 Published online by Cambridge University Press

1.1 Motivation: The Pitfalls of Large-Dimensional Statistics

https://doi.org/10.1017/9781009128490.002 Published online by Cambridge University Press

1.1.2 Sample Covariance Matrices in the Large n,p Regime

By the strong law of large numbers, for fixed p, Ĉ → I p almost surely as n → ∞, so

Besides, by a concentration inequality argument, it can even be shown that

https://doi.org/10.1017/9781009128490.002 Published online by Cambridge University Press

which holds as long as p is no larger than a polynomial function of n, and thus:

for λ 1 (A) ≥ λ 2 (A) ≥ · · · , the eigenvalues of A in a decreasing order. Besides, for

https://doi.org/10.1017/9781009128490.002 Published online by Cambridge University Press

https://doi.org/10.1017/9781009128490.002 Published online by Cambridge University Press

1.1.3 Kernel Matrices of Large-Dimensional Data

https://doi.org/10.1017/9781009128490.002 Published online by Cambridge University Press

The Nontrivial Classification Regime

C1 : x ∼ N (μ,I p ), C2 : x ∼ N (−μ,I p + E), (1.4)

Writing x = μ + z for z ∼ N (0,I p ), the above test is equivalent to

Since Uz for U ∈ R p×p , an eigenvector basis of (I p + E)−1 (and thus of (I p + E)−1 −

VT−1/2 (T(x) − T̄) −

T̄ ≡ 4μ T (I p + E)−1 μ + tr(I p + E)−1 − p + log det(I p + E),

https://doi.org/10.1017/9781009128490.002 Published online by Cambridge University Press

As a consequence, the classification of x ∈ C1 is asymptotically nontrivial (i.e., the

which demands tr(E2 ) to be of order O(1) (same as μ) so as to have discriminative

μ ≥ O(1), E ≥ O(p−1/2 ), | tr(E)| ≥ O(1), tr(E2 ) ≥ O(1). (1.7)

Asymptotic Loss of Pairwise Distance Discrimination

https://doi.org/10.1017/9781009128490.002 Published online by Cambridge University Press

(a) p = 5 (b) p = 250

A = zTi Ezi + zTj Ez j − 2zTi Ez j and

B = zTj (E + E2 /4)z j − zTi Ez j + 4μ2 + 4μ T (zi − z j ) + o(1)

for τ = 2 here. Besides, on a closer inspection of (1.8), we find that, beyond

https://doi.org/10.1017/9781009128490.002 Published online by Cambridge University Press

information (about μ or E) to distinguish if the xi and x j vectors belong to the same or

Explaining Kernel Methods with Random Matrix Theory

https://doi.org/10.1017/9781009128490.002 Published online by Cambridge University Press

(a) MNIST data (b) Fashion-MNIST data

Do Real Data Follow Small- or Large-Dimensional Intuitions?

https://doi.org/10.1017/9781009128490.002 Published online by Cambridge University Press

(a) VGG-16 features of CIFAR-10 (b) Word2vec features of GoogleNews

p = 1024), as well as the so-called “word-embedding” features from the popular

https://doi.org/10.1017/9781009128490.002 Published online by Cambridge University Press

∀i, j, [An − Bn ]i j → 0 ⇒ An − Bn  → 0 (1.10)

in the operator norm.

https://doi.org/10.1017/9781009128490.002 Published online by Cambridge University Press

1.2 Random Matrix Theory as an Answer

1.2.1 Which Theory and Why?

https://doi.org/10.1017/9781009128490.002 Published online by Cambridge University Press

free probability theory relies on a notion of asymptotic freeness. In essence, ran-

Resolvents, Gaussian Tools, and Concentration of Measure Theory

https://doi.org/10.1017/9781009128490.002 Published online by Cambridge University Press

QX (z) ≡ (X − zIn )−1 . (1.11)

which demands tr(E2 ) to be of order O(1) (same as μ) so as to have discriminative

μ ≥ O(1), E ≥ O(p−1/2 ), | tr(E)| ≥ O(1), tr(E2 ) ≥ O(1). (1.7)

B = zTj (E + E2 /4)z j − zTi Ez j + 4μ2 + 4μ T (zi − z j ) + o(1)

∀i, j, [An − Bn ]i j → 0 ⇒ An − Bn → 0 (1.10)