Spectrum Estimation For Large Dimensional Covariance Matrices Using Random Matrix Theory
Spectrum Estimation For Large Dimensional Covariance Matrices Using Random Matrix Theory
Abstract
Estimating the eigenvalues of a population covariance matrix from a sample covariance matrix is a
problem of fundamental importance in multivariate statistics; the eigenvalues of covariance matrices play
a key role in many widely techniques, in particular in Principal Component Analysis (PCA). In many
modern data analysis problems, statisticians are faced with large datasets where the sample size, n, is
of the same order of magnitude as the number of variables p. Random matrix theory predicts that in
this context, the eigenvalues of the sample covariance matrix are not good estimators of the eigenvalues
of the population covariance.
We propose to use a fundamental result in random matrix theory, the Marčenko-Pastur equation, to
better estimate the eigenvalues of large dimensional covariance matrices. The Marčenko-Pastur equation
holds in very wide generality and under weak assumptions. The estimator we obtain can be thought of
as “shrinking” in a non linear fashion the eigenvalues of the sample covariance matrix to estimate the
population eigenvalue. Inspired by ideas of random matrix theory, we also suggest a change of point of
view when thinking about estimation of high-dimensional vectors: we do not try to estimate directly
the vectors but rather a probability measure that describes them. We think this is a theoretically more
fruitful way to think about these problems.
Our estimator gives fast and good or very good results in extended simulations. Our algorithmic
approach is based on convex optimization. We also show that the proposed estimator is consistent.
1 Introduction
With data acquisition and storage now easy, today’s statisticians often encounter datasets for which
the sample size, n and the number of variables p, are both large: in the order of hundreds, thousands,
millions, or even billions in situations such as web search problems.
The analysis of these datasets using classical methods of multivariate statistical analysis requires some
care. While the ideas are still relevant, the intuition for the estimators that are used and the interpretation
of the results are often - implicitly - justified by assuming an asymptotic framework of p fixed and n
growing infinitely large. This assumption was consistent with the practice of statistics when these ideas
were developed, since investigation of datasets with a large number of variables was very difficult. A better
theoretical framework for modern - i.e large p - datasets, however is the assumption of the so-called “large
n, large p” asymptotics. In other words, one should consider that both n and p go to infinity, perhaps
with the restriction that their ratio goes to a finite limit γ, and draw practical insights from the theoretical
results obtained in this setting.
Acknowledgements: The author is grateful to Alexandre d’Aspremont, Peter Bickel, Laurent El Ghaoui, Elizabeth
∗
Purdom, John Rice, Saharon Rosset and Bin Yu for stimulating discussions and comments at various stages of this project.
Support from NSF grant DMS-0605169 is gratefully acknowledged. AMS 2000 SC: Primary 62H12, Secondary 62-09.
Key words and Phrases : covariance matrices, principal component analysis, eigenvalues of covariance matrices, high-
dimensional inference, random matrix theory, Stieltjes transforms, Marčenko-Pastur equation, convex optimization. Contact :
[email protected]
1
We will turn our attention to an object of central interest in multivariate statistics: the eigenvalues
of covariance matrices. A key application is Principal Components Analysis (PCA), where one searches
for a good low dimensional approximation to the data by projecting the data on the “best” possible k
dimensional subspace: here “best” means that the projected data explain as much variance in the original
data as possible. This amount of variance explained is measured by the eigenvalues of the population
covariance matrix, Σp , and hence we need to find a way to estimate those eigenvalues. We will discuss in
the course of the paper other problems where the eigenvalues of Σp play a key role.
We take a moment here to give a few examples that illustrate the differences that occur under the
different asymptotic settings. To pose the problem more formally, let us say that we observe iid random
vectors X1 , . . . , Xn in Rp , and that the covariance of Xi is Σp . We call X the data matrix whose rows are
the Xi ’s. In the classical context, where p is fixed and n goes to ∞, a fundamental result of (Anderson,
1963) says that the eigenvalues of the sample covariance matrix Sp = (X − X̄)′ (X − X̄)/(n − 1) are good
estimators of the population eigenvalues (i.e the eigenvalues of Σp ). More precisely, calling li the ordered
eigenvalues of Sp (l1 ≥ l2 . . .) and λi the ordered eigenvalues of Σp (λ1 ≥ λ2 . . .), it was shown in (Anderson,
1963) that √
n(li − λi ) ⇒ N (0, 2λ2i ) ,
when the Xi are normally distributed and all the λi ’s are distinct. This result provided rigorous grounds
for estimating the eigenvalues of the population covariance matrix, Σp , with the eigenvalues of the sample
covariance matrix, Sp , when p is small compared to n. (For more details on Anderson’s theorem, we refer
the reader to (Anderson, 2003) Theorem 13.5.1.)
Shifting assumptions to “large n, large p” asymptotics induces fundamental differences in the behavior
of multivariate statistics, some of which we will highlight in the course of the paper. As a first example, let
us consider the case where Σp = Idp , so all the population eigenvalues are equal to 1. A result first shown
in (Geman, 1980) under some moment growth assumptions, and later refined in (Yin et al., 1988), states
that if the entries of the Xi ’s are i.i.d and have a fourth moment, and if p/n → γ, then
√
l1 → (1 + γ)2 a.s.
2
size 1/100 in all coordinates, the resulting l1 error is 1, even though, at least intuitively, it would seem like
we are doing well. Also, if we made a large error (say size 1) in one direction, the l2 norm would be large
(larger than 1 at least), even though we may have gotten the structural information about this vector (and
almost all its coordinates) “right”. Inspired by ideas of random matrix theory, we propose to associate
to high-dimensional vectors probability measures that describe them. We will explain this in more detail
in Section 2.1. After this change of point of view, our focus becomes trying to estimate these measures.
Why choosing to estimate measures? The reasons are many. Chief among them is that this approach will
allow us to look into the structure of the population eigenvalues. For instance, we would like to be able
to say whether all population eigenvalues are equal, or whether they are clustered around say two values,
or if they are uniformly spread out on an interval. Because the ratio p/n can make the scree plot appear
smooth (and hence in some sense uninformative) regardless of the true population eigenvalue structure,
this structural information is not well estimated by currently existing methods. We discuss other practical
benefits (like scalability with p) of the measure estimation approach in 3.3.7. In the context of PCA, where
usually the concern is not to estimate each population eigenvalues with very high precision, but rather to
have an idea of the structure of the population spectrum to guide the choice of lower-dimensional subspaces
on which to project the data, this measure approach is particularly appealing. Examples to come later in
the paper will illustrate this point.
Random matrix theory plays a key role in our approach to this measure estimation problem. A main
ingredient of our method is a fundamental result, which we call the Marčenko-Pastur equation (see Theorem
1), which relates the asymptotic behavior of the sample eigenvalues to the population eigenvalues. The
assumptions under which the theorem holds are very weak (a fourth moment condition) and hence it is very
widely applicable. Until now, this theorem has not been used to do inference on population eigenvalues.
Partly this is because in its general form it has not received much attention in statistics, and partly because
the inverse problem that needs to be considered is very hard to solve if it is not posed the right way. We
propose an original way to approach inverting the Marčenko-Pastur equation. In particular, we will be
able to estimate given the eigenvalues of the sample covariance matrix Sp the probability measure, Hp ,
that describes the population eigenvalues. We use the standard names empirical spectral distribution for
Fp and population spectral distribution for Hp . It is important to state clearly what asymptotic framework
we place ourselves in. We will consider that when p and n go to infinity, Hp stays fixed. In particular, it
has a limit, denoted H∞ . We call this framework “asymptotics at fixed spectral distribution”. Of course,
fixing Hp does not imply that we fix p. For instance, sometime we will have Hp = δ1 , for all p. Since the
parameter of interest in our problems is really the measure Hp , the fixed spectral distribution asymptotics
corresponds to classical assumptions for parameter estimation in statistics, where the parameter does not
change with the number of variables observed. We refer the reader to 3.3.6 for a more detailed discussion.
To solve the inverse problem posed by the Marčenko-Pastur equation, we propose to discretize the
Marčenko-Pastur equation and then use convex optimization methods to solve the discretized version of
the problem. In doing so, we obtain a fast and provably accurate algorithm to estimate the population
parameter of interest, Hp , from the sample eigenvalues. The approach is non-parametric since no assump-
tions are made a priori on the structure of the population eigenvalues. One outcome of the algorithm is an
efficient graphical method to look at the structure of the population eigenvalues. Another outcome is that
since we have an estimate of the measure that describes the population eigenvalues, standard statistical
ideas then allow us to get estimates of the individual population eigenvalues λi . Some subtle problems
may arise when doing so and we address them in 3.3.6. The final result of the algorithm can be thought
of as performing non-linear shrinkage of the sample eigenvalues to estimate the population eigenvalues.
We want to highlight two contributions of our paper. First, we propose to estimate measures associated
with high-dimensional vectors rather than estimating the vectors. This gives rise to natural notions of
consistency and accuracy of our estimates which are reasonable theoretical requirements for any estimator
to achieve. And second, we make use, for the first time, of a fundamental result of random matrix theory
to solve an important practical problem in multivariate statistics.
The rest of the paper is divided into four parts. In Section 2, we give some background on results in
Random Matrix Theory that will be needed. We do not assume that the reader has any familiarity with
the topic. In Section 3, we present our algorithm to estimate Hp , the population spectral distribution,
and also the population eigenvalues. In Section 4, we present the results of some simulations. We give in
3
Section 5 a proof of consistency of our algorithm. The Appendix contains some details on implementation
of the algorithm.
A note on notation is needed before we start: in the rest of the paper, p will always be a function of n,
with the property that p(n)/n → γ and γ ∈ (0, ∞). To avoid cumbersome notations, we will usually write
p and not p(n).
Gp is thus a measure with p point masses of equal weight, one at each of the coordinates of the vector.
In the rest of the paper, we will denote by Hp the spectral distribution of the population covariance
matrix Σp , i.e the measure associated with the vector of eigenvalues of Σp . We will refer to Hp as the
population spectral distribution. We can write this measure as
p
1X
dHp (x) = δλi (x) ,
p
i=1
where δλi is a point mass, of mass 1, at λi . We also call δλi a “dirac” at λi . The simplest example of
population spectral distribution is found when Σp = Idp . In this case, for all i, λi = 1, and dHp = δ1 . So
the population spectral distribution is a point mass at 1 when Σp = Idp .
Similarly, we will denote by Fp the measure associated with the eigenvalues of the sample covariance
matrix Sp . We refer to Fp as the empirical spectral distribution. Equivalently, we define
p
1X
dFp (x) = δli (x) .
p
i=1
The change of focus from vector to measure implies a change of focus in the notion of convergence we
will consider adequate. In particular, for consistency issues, the notion of convergence we will use is weak
convergence of probability measures. While this is the natural way to pose the problem mathematically,
we may ask if it will allow us to gather the statistical information we are looking for. An example of the
difficulties that arise is the following. Suppose dHp = (1 − 1/p) δ1 + 1/p δ2 . In other words, the population
covariance has one eigenvalue that is equal to 2 and (p − 1) that are equal to 1. Clearly, when p → ∞,
Hp weakly converges to H∞ , with dH∞ = δ1 . So all information about the large and isolated eigenvalue
2, which is present in Hp for all p and is naturally of great interest in PCA, seems lost in the limit. This
is not the case when one does asymptotic at fixed spectral distribution and consider that we are following
a sequence of models which are going to infinity with Hp = Hp0 = H∞ , where p0 is the p which is given
4
by the data set. Fixed distribution asymptotics is more akin to what is done in classical statistics and
we place ourselves in this framework. We refer the reader to 3.3.6 for a more detailed justification of our
point.
In other respects, associating a measure to a vector in the way we described is meaningful mostly when
one wants to have information about the whole set of values taken by the coordinates of the vector, and not
about each coordinate. In particular, when going from vector to measure as described above we are losing
all coordinate information: permuting the coordinates would drastically change the vector but yield the
same measure. However, in the case of vectors of eigenvalues, since there is a canonical way to represent
the vector (the i-th largest eigenvalue occupying the i-th coordinate), the information contained in the
measure is sufficient. This measure approach is especially good when we are not focused on getting all the
fine details of the vectors right, but rather when we are looking for structural information concerning the
values taken by the coordinates.
An important area of random matrix theory for sample covariance matrices is concerned with under-
standing the properties of Fp as p (and n) go to ∞. A key theorem , which we review later (see Theorem
1), states that for a wide class of sample covariance matrices, F∞ , the limit of Fp , is asymptotically non-
random. Furthermore, the theorem connects F∞ to H∞ , the limit of Hp : given H∞ , we can theoretically
compute F∞ , by solving a complicated equation. In data analysis, we observe the empirical spectral distri-
bution, Fp . Our goal, of course, as far as eigenvalues are concerned, is to estimate the population spectral
distribution, Hp . Our method will “invert” the relation between F∞ and H∞ , so that we can go from Fp
to Hb p , an estimate of Hp . The method does not work directly with Fp but with a tool that is similar in
flavor to the characteristic function of a distribution: the Stieltjes transform of a measure. We introduce
this tool in the next subsection. As we will see later, it will also play a key role in our algorithm.
2. If F and G are two measures, and if mF (z) = mG (z), for all z ∈ C+ , then G = F , a.e.
3. (Geronimo and Hill, 2003, Theorem 1): If Gn is a sequence of probability measures and mGn (z)
has a (pointwise) limit m(z) for all z ∈ C+ , then there exists a probability measure G with Stieltjes
transform mG = m if and only if limy→∞ −iym(iy) = 1. If it is the case, Gn converges weakly to G.
4. (Geronimo and Hill, 2003, Theorem 2): The same is true if the convergence happens only for an
infinite sequence {zi }∞ + +
i=1 in C with a limit point in C .
5
5. If t is a continuity point of the cdf of G, dG(t)/dt = limǫ→0 π1 Im (mG (t + iǫ))
For proofs, we refer the reader to (Geronimo and Hill, 2003).
Note that the Stieltjes transform of the spectral distribution Γp of a p × p matrix Ap is just
1
mΓp (z) = trace (Ap − zIdp )−1 .
p
Finally, it is clear that points 3 and 4 above can be used to show convergence of probability measures if
one can control the corresponding Stieltjes transforms.
3. The previous equation has one and only one solution which is the Stieltjes transform of a measure.
In plain English, under the assumptions put forth in Theorem 1, the spectral distribution of the
sample covariance matrix is asymptotically non-random. Furthermore, it is fully characterized by the true
population spectral distribution, through the equation (M-P).
A particular case of equation (M-P) is often of interest: the situation when all the population eigenvalues
are equal to 1. Then of course, Hp = H∞ = δ1 . A little bit of elementary work leads to the well-known
fact in random matrix theory that the empirical spectral distribution, Fp , converges (a.s) to the Marčenko-
Pastur law, whose density is given by, if γ ≤ 1,
p
fγ (x) = (b − x)(x − a)/(2πxγ) , with a = (1 − γ 1/2 )2 , b = (1 + γ 1/2 )2 .
We refer the reader to (Marčenko and Pastur, 1967), (Bai, 1999) and (Johnstone, 2001) for more details
and explanations concerning the case γ > 1. One point of statistical interest is that even though the
true population eigenvalues are all equal to 1, the empirical ones are now spread on the interval [(1 −
γ 1/2 )2 , (1 + γ 1/2 )2 ]. Plotting the density also shows that its shape vary with γ in a non-trivial way. These
two remarks illustrate some of the difficulties that need to be overcome when working under “large n, large
p” asymptotics.
6
3 Algorithm and Statistical considerations
3.1 Formulation of the estimation problem
A remarkable feature of the equation (M-P) is that the knowledge of the limiting distribution of the
eigenvalues in the population given by H∞ fully characterizes the limiting behavior of the eigenvalues of
the sample covariance matrix. However, the relationship between the two is hard to disentangle. As is
common in statistics, the question is how to invert this relationship to estimate Hp . The question thus
becomes, given l1 , . . . , lp , the eigenvalues of a sample covariance matrix, can we estimate the population
eigenvalues, λ1 , . . . , λp , using Equation (M-P)? Or in terms of spectral distribution, can we estimate Hp
from Fp ?
Our strategy is the following: 1) the first aim is to estimate the measure H∞ appearing in the Marčenko-
Pastur equation. 2) Given an estimator, H b ∞ , of this measure, we will estimate λi as the i-th quantile of
our estimated distribution. It is common in statistical practice to get these estimates by using the i/(p + 1)
percentile and this is what we do. (We come back to possible difficulties getting from H b p to λ̂i in 3.3.6.)
3) An important point is that since we are considering fixed distribution asymptotics, our estimate of H∞
will serve as our estimate of Hp , so H bp = H
b ∞.
The main question, then, is how to approach step 1: estimating H∞ based only on Fp . Of course,
since we can compute the eigenvalues of Sp , we can compute vFp (z) for any z we choose. By evaluating
vFp at a grid of values {zj }Jj=1 n
, we have a set of values {vFp (zj )}Jj=1
n
for which equation (M-P) should
b
(approximately) hold. We want to find H∞ that will “best” satisfy equation (M-P) across the set of values
of vFp (zj ). In other words, we will pick
Z Jn !
bp = H b ∞ = argmin L 1 p λdH(λ)
H + zj − ,
H vFp (zj ) n 1 + λvFp (zj ) j=1
where the optimization is over probability measures H, and L is a loss function to be chosen later. In this
way we are “inverting” the equation (M-P), going from Fp , an estimate of F∞ , to an estimate of H∞ .
We will solve this inverse problem in two steps: discretization and convex optimization. We give a
high-level overview of our method and postpone implementation details to the Appendix.
To summarize, we face the following interpolation problem: given J an integer and (zj , vFp (zj ))Jj=1 we
want to find an estimate of H∞ that approximately satisfies equation (M-P). In Section 5, we show that
doing so for L∞ loss function leads to a consistent estimator of H∞ , under the reasonable assumption that
all spectra are bounded.
3.2.1 Discretization
Naturally, dH can be simply approximated by a weighted sum of point masses:
K
X
dH(x) ≃ wk δtk (x) ,
k=1
where {tk }K
k=1 is a grid of points, chosen by us, and wk ’s are weights. The fact that we are looking for a
probability measure imposes the constraints
K
X
wk = 1 , and wk ≥ 0 .
k=1
7
This approximation turns the optimization over measures problem into searching for a vector of weights
in RK
+ . After discretization, the integral in equation (M-P) can be approximated by
Z K
λdH(λ) X tk
≃ wk .
1 + λv 1 + tk v
k=1
Hence finding a measure that approximately satisfies Equation (M-P) is equivalent to finding a set of
weights {wk }K
k=1 , for which we have
K
1 pX tk
− ≃ zj − wk , ∀j .
v∞ (zj ) n 1 + tk v∞ (zj )
k=1
Naturally, we do not get to observe v∞ , and so we make a further approximation and replace v∞ by
vFp . Our problem is thus to find {wk }K
k=1 such that
K
1 pX tk
− ≃ zj − wk , ∀j .
vFp (zj ) n 1 + tk vFp (zj )
k=1
One good thing about this approach is that the problem we now face is linear in the weights, which
are the only unknowns here. We will demonstrate that this allows us to cast the problem as a relatively
simple convex optimization problem.
As explained above, there are two sources of error in ej : one comes from the discretization of the integral
involving H∞ . The other one comes from the substitution of v∞ , a non-random and asymptotic quantity,
by vFp , a (random) quantity computable from the data. ej is of course a complex number in general.
We can now state several convex problems as approximation of the inversion of the Marčenko-Pastur
equation problem. We show in Section 5 consistency of the solution of the “L∞ ” version of the problem
described below. Here are a few examples of convex formulations for our inverse problem. In all these
problems, the wk ’s are constrained to sum to 1 and to be non-negative.
The advantages of formulating our problem as a convex optimization problem are many. We will come
back to the more statistical issues later. From a purely numerical point of view, we are guaranteed that an
optimum exists, and fast algorithms are available. In practice, we used the optimization package MOSEK
(see (MOSEK, 2006)), within Matlab, for solving our problems.
8
Because the rest of the article focuses particularly on the “L∞ ” version of the problem described above,
we want to give a bit more details about it. The “translation” of the problem into a convex optimization
problem is
min u
(w1 ,...,wK ,u)
∀j, −u ≤ Re (ej ) ≤ u
∀j, −u ≤ Im (ej ) ≤ u
K
X
subject to wk = 1
i=1
and wk ≥ 0, ∀k
This is a linear program (LP) with unknowns (w1 , . . . , wK ) and u (see (Boyd and Vandenberghe, 2004) for
standard manipulations to make it a standard form LP).
The simulations we present in Section 4 were made using this version of this algorithm. The proof in
Section 5 applies to this version of the algorithm.
9
where the Mi ’s are the measures in our dictionary.
In the preceding discussion on discretization, we restricted ourselves to Mi ’s being point masses at
chosen “grid points”. Of course, we can enlarge our dictionary to include, for instance:
1. Probability measures that are uniform on an interval: dMi (x) = 1x∈[ai ,bi ] dx/(bi − ai ).
2. Probability measures that have a linearly increasing density on an interval [ai , bi ] and density 0
elsewhere. So dMi (x) = 1[ai ,bi ] 2(x − ai )/(bi − ai )2 dx, and density 0 elsewhere.
3. Probability measures that have a linearly decreasing density on an interval [ai , bi ], and density 0
elsewhere. So dMi (x) = 1[ai ,bi ] 2(bi − x)/(bi − ai )2 dx.
If we decide to include a probability measure M in our dictionary, the only requirement is that we be
able to compute the integral Z
λdM (λ)
1 + λv
for any v in C+ .
Choosing a larger dictionary increases the size of the convex optimization problems we try to solve,
and hence is at first glance computationally harder. However, statistically, enlarging the dictionary may
lead to sparser representations of the measure we are estimating, and hence, at least intuitively, lead to
better estimates of H∞ . The most favorable case is of course when H∞ is a mixture of a small number of
measures present in our dictionary. For instance, if H∞ has a density whose graph is a triangle, having
measures as described in points 2 and 3 above would most likely lead to sparser and maybe more accurate
estimates. In the presence of a priori information on H∞ , the choice of dictionary should be adapted so
that H∞ has a sparse representation in the dictionary.
10
3.3.5 On covariance estimation, linear and non-linear shrinkage of eigenvalues
There is some classical and more recent statistical work on shrinkage of eigenvalues to improve co-
variance estimation. We refer the reader to Section 4.1 in (Ledoit and Wolf, 2004) for some examples
due to Charles Stein and Leonard Haff, unfortunately in unpublished manuscripts. More recently, in the
interesting paper by (Ledoit and Wolf, 2004), what was proposed is to linearly shrink the eigenvalues of Sp
toward the identity : i.e li ’s become ˜li = (1 − ρ)li + ρ’s, for some ρ, independent of i, chosen using the data
and the Marčenko-Pastur law. Then the authors of (Ledoit and Wolf, 2004) proposed to estimate Σp by
(1 − ρ)Sp + ρIdp . Since this latter matrix and Sp have the same eigenvectors, their method of covariance
estimation can be viewed as linearly shrinking the sample eigenvalues and keeping the eigenvectors of Sp
as estimates of the eigenvectors of Σp .
Our method of estimation of the population eigenvalues can be viewed as doing a non-linear shrinkage
of the sample eigenvalues. While we could propose to just keep the eigenvectors of Sp as estimates of
the eigenvectors of Σp , and hence get an estimate of the population covariance matrix, we think one
should be able to do better by using the eigenvalue information to drive the eigenvector estimation. It is
known that in “large n, large p” asymptotics, the eigenvectors of the sample covariance matrix are not
consistent estimators of the population eigenvectors (see (Paul, To Appear)), even in the most favorable
cases. However, having a good idea of the structure of the population eigenvalues should help us estimate
the eigenvectors of the population covariance matrix, or at least formulate the right questions for the
problem at hand. For instance, the inferred structure of the covariance matrix could help us decide how
many subspaces we need to identify: if, for example, it turned out that the population eigenvalues were
clustered around two values, we would have to identify two subspaces, the dimensions of these subspaces
being the number of eigenvalues clustered around each value. Also, having estimates of the eigenvalues
tell us how much variance our “eigenvectors” will have to explain. In other words, our hope is that taking
advantage of the crucial eigenvalue information we are now able to gather will lead to better estimation of
Σp by doing a “reasoned” spectral decomposition. Work in this direction is in progress.
11
2005)), that takes advantage of the Marčenko-Pastur law to estimate some moments of H∞ . H∞ is then
assumed to a be a mixture of a finite and pre-specified number of point masses (see (Burda et al., 2004, p.
303)) and the moments are then matched with possible point masses and weights. While these methods
might be of some use sometimes, we think they require too many assumptions to be practically acceptable
for a broad class of problems. It might be tempting to try to develop an non-parametric estimator from
moments, but we think that without the strong assumptions made in (Burda et al., 2004), those estimators
will suffer drastically from: 1) the number of moments needed a priori may be large, and large moments
are very unreliable estimators; 2) moments estimated indirectly may not constitute a genuine family of
moments: certain Hankel matrices need to be positive semi-definite and will not necessarily be so. Semi-
definite programming type corrections will then be necessary, but hard to implement. 3) Even if one has a
genuine moment sequence, there are usually many distributions with the same moments. Choosing between
them is clearly going to be a difficult task.
4 Simulations
We now present some simulations to illustrate the practical capabilities of the method. The objectives
of eigenvalues estimation are many-folds and depend of the area of applications. We review some of those
that inspired our work.
In settings like PCA, one basically wishes to discover some form of structure in the covariance matrix by
looking at the eigenvalues of the sample covariance matrix. In particular, a situation where the population
eigenvalues are different from each other indicates that projecting the data in some projections will be more
“informative” that projecting it in other directions; while in the case where all the population eigenvalues
are equal, all projections are equally informative or uninformative. As our brief discussion of the Marčenko-
Pastur law illustrated, in the “large n, large p” setting, it is difficult to know from the sample eigenvalues
whether all population eigenvalues are equal to each other or not, or even if there is any kind of structure
in them. When p and n are both large, standard graphical methods like the scree plot tend to look similar
whether or not there is structure in the data. We will see that our approach is able to differentiate between
the situations. Among other things, our method can thus be thought as a alternative to the scree plot for
high-dimensional problems.
In other applications, one focuses more on trying to estimate the value of the largest or smallest
eigenvalues. In PCA, the largest population eigenvalues measure how much variance we can explain through
a low dimensional projection and is hence important. In financial applications, like the Markovitz’ portfolio
optimization problem, the small population eigenvalues are important. They essentially measure what is
the minimum risk one can take by investing in a portfolio of certain stocks (see (Laloux et al., 1999) and
(Campbell et al., 1996, Chapter 5)). However, as explained in the Appendix, the largest eigenvalue of the
sample covariance matrix tends to overestimate the largest eigenvalue of the population covariance. And
similarly, the smallest eigenvalue of the sample covariance matrix tends to underestimate its population
counterpart. What that means is that using these measures of “information” and “risk”, we will tend to
overestimate the amount of information there is in our data and tend to underestimate the amount of risk
there is in our portfolios. So it is important to have tools to correct this bias. Our estimator provides a
way to do so.
12
high-dimension the sample eigenvalues will often blur the clusters together. We show that our method
generally recovers these two clusters well.
Finally, the third example is one where Σp is a Toeplitz matrix. More details on Toeplitz matrices
are given in 4.1.3. This situation poses a harder estimation problem. While the asymptotic behavior of
the eigenvalues of such matrices is well understood, there are generally no easy and explicit formulas to
represent the limit. We present the results to show that even in this difficult setting, our method performs
quite well.
To measure the performance of our estimators, we compare the Lévy distances between our estimator,
Hb p , and the true distribution of the population eigenvalues, Hp , to that of the empirical spectral distribu-
tion, Fp , to Hp . Our choice is motivated by the fact that the Lévy distance can be used as a metric for
weak convergence of distributions on R. Recall (see e.g (Durrett, 1996)) that the Lévy distance between
two distributions F and G on the real line is defined as
In the plots we will depict the cumulative distribution function (cdf) of our estimated measures. Recall
that the estimates of the population eigenvalues λi ’s are obtained by taking appropriate percentiles of these
measures.
13
Scree Plot eigenvalues sample covariance matrix CDF eigenvalues sample covariance matrix
2.2 1
2 0.9
1.8 0.8
n=500
n=500 p=100
1.6 0.7
p=100 cov=id
1.4 cov =id 0.6
1.2 0.5
1 0.4
0.8 0.3
0.6 0.2
0.4 0.1
0.2 0
0 10 20 30 40 50 60 70 80 90 100 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2
(a) Eigenvalues (scree plot) of the sample covariance matrix (b) CDF eigenvalues, sample covariance matrix (Fp )
0.9
0.8 n=500
0.7
p=100
0.6 cov=id
0.5
0.4
0.3
0.2
0.1
0
0 0.5 1 1.5 2 2.5 3 3.5
Figure 1: case Σp = Idp. The three figures above compare the performance of our estimator to the one
derived from the sample covariance matrix on one realization of the data. The data matrix X is 500 × 100.
All its entries are iid N (0, 1). The population covariance is Σp = Id100 , so the distribution of the eigenvalues
is a point mass at 1. This is what our estimator (Figure (c)) recovers. Average computation time (over
1000 repetitions) was 13.33 seconds, according to Matlab tic and toc functions. Implementation details
are in the Appendix.
250
Ratio of Levy
200
Metrics
Empirical/Adjusted
1000 repetitions
150
n=500
p=100
100
50
0
0 20 40 60 80 100 120
Figure 2: case Σp = Idp: Ratios dL (H b p , Hp )/dL (Fp , Hp ) over 1,000 repetitions. Dictionary consisted of
only point masses. Large values indicate better performance of our algorithm. All ratios were found to be
larger than 1.
14
Scree plot eigenvalues sample covariance matrix CDF eigenvalues sample covariance matrix
4 1
3.5 n=500
0.8 p=500
3 n=500 50 eigenvalues =1
p=100 50 eigenvalues =2
2.5 50 eigenvalues =1 0.6
50 eigenvalues =2
2
0.4
1.5
1 0.2
0.5
0
0 0.5 1 1.5 2 2.5 3 3.5
0
0 10 20 30 40 50 60 70 80 90 100
Figure 3: case Hp = .5δ1 + .5δ2 : the three figures above compare the performance of our estimator on
one realization of the data. The data matrix Y is 500 × 100. All its entries are iid N (0, 1). The covariance
is diagonal and has spectral distribution Hp = .5δ1 + .5δ2 . In other words, 50 eigenvalues are equal to 1
and fifty eigenvalues are equal to 2. This is essentially what our estimator (Figure (c)) recovers. Average
computation time (over 1000 repetitions) was 15.71 seconds, according to Matlab tic and toc functions.
50
45
25
20
15
10
0
0 2 4 6 8 10 12 14 16 18 20
Figure 4: case Hp = .5δ1 + .5δ2 : Ratios dL (H b p , Hp )/dL (Fp , Hp ) over 1,000 repetitions. Dictionary
consisted of only point masses. Large values indicate better performance of our algorithm. All ratios were
found to be larger than 1.
15
Scree plot eigenvalues sample covariance matrix Empirical and Population spectral distributions
3 1
n=500 n=500
2.5 p=100 0.8
p=100
2 Toeplitz Toeplitz
matrix: 0.6 matrix:
1.5
ak=.3k ak=.3k
0.4
1 Empirical Spectral Distribution
Population Spectral Distribution
0.2
0.5
0 0
0 10 20 30 40 50 60 70 80 90 100 0 0.5 1 1.5 2 2.5 3
(a) Scree plot, Eigenvalues sample covariance matrix (b) CDF eigenvalues sample covariance matrix (Fp )
n=500
0.8
p=100
Toeplitz matrix:
0.6
ak=.3k
0.4
0
0 0.5 1 1.5 2 2.5 3
Figure 5: case Σp Toeplitz with entries.3|i−j| : the three figures above show the performance of our
estimator on one realization of the data. The data matrix Y is 500×100. All its entries are iid N (0, 1). The
covariance is Toeplitz, with t(|i − j|) = .3|i−j| . In Figure (c), we superimpose our estimator (blue curve)
and the true distribution of eigenvalues (red curve). Average computation time (over 1000 repetitions) was
16.61 seconds, according to Matlab tic and toc functions.
once again our estimator clearly outperforms the one derived from the sample covariance matrix, by a
large factor. Again, upon further investigation, the estimator generally gets the correct structure of the
distribution of the population eigenvalues: in this case two spikes at 1 and 2.
16
40
35
n=500
25 p=100
20
15
10
0
1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6
Figure 6: Case Σp Toeplitz with entries (.3|i−j| ): Ratios dL (Hb p , Hp )/dL (Fp , Hp ) over 1,000 repetitions.
Dictionary consisted of only point masses. Large values indicate better performance of our algorithm. All
ratios were found to be larger than 1.
not severely affected and the results are still quite good. To give a more detailed comparison, we present
b p , Hp )/dL (Fp , Hp ).
in Figure 6 a histogram of ratios dL (H
5 Consistency
In this section, we prove that the algorithm we propose leads to a consistent (in the sense of weak
convergence of probability measures) estimator of the spectral distribution of the covariance matrices of
interest.
More precisely, we focus on the “L∞ ” version of the algorithm proposed in 3.2.2. In short, the theoretical
results we prove state that as our computational resources grow (both in terms of size of available data and
grid points on which to evaluate functions), the estimator H b p converges to H∞ . The meaning of Theorem
2, which follows, is the following. We first choose a family of points {zj } in the upper-half of the complex
plane, with a limit point in the upper-half of the complex plane. We assume that the population spectral
distribution Hp has a limit, in the sense of weak convergence of distributions, when p → ∞. We call this
limit H∞ . This assumption of weak convergence allows us to vary Hp , as p grows, and to not be limited
to Hp = H∞ for the theory; this provides maximal generality. We then solve the “L∞ ” version of our
optimization problem, by including more and more of the zj ’s in the optimization problem as n → ∞.
We assume in Theorem 2 that we can solve this problem by optimizing over all probability measures.
Then Theorem 2 shows that the solution of the optimization problem, H b p , converges in distribution to the
limiting population spectral distribution, H∞ . In Corollary 1, we show that the same conclusion holds if
the optimization is now made over probability measures that are mixture of point masses, whose locations
are on a grid whose step size goes to 0 with p and n. Actually, the requirement is that the dictionary of
measures we use contain these diracs. It can of course be larger. Hence, Corollary 1 proves consistency
of the estimators specifically obtained through our algorithm. Beside the assumptions of Theorem 1, we
assume that all the spectra of the population covariances are (uniformly) bounded. That translates into
the mild requirement that the support of all Hp ’s be contained in a same compact set. Note that in the
context of asymptotics at fixed spectral distribution, this is automatically satisfied.
We now turn to a more formal statement of the theorem. The notation B(z0 , r) denotes the closed ball
of center z0 and radius r. Our main theorem is the following.
Theorem 2. Suppose we are under the setup of Theorem 1, Hp ⇒ H∞ and p/n → γ, with 0 < γ < ∞.
Assume that the spectra of the Σp ’s are uniformly bounded. Let J1 , J2 , . . . , be a sequence of integers tending
to ∞. Let z0 ∈ C+ and r ∈ R+ be such that B(z0 , r) ⊂ C+ . Let z1 , z2 , . . . be a sequence of complex variables
17
b p be the solution of
with an accumulation point, all contained in B(z0 , r). Let H
Z
1 p λdH(λ)
b
Hp = argmin max + zj − , (1)
H j≤Jn vFp (zj ) n 1 + λvFp (zj )
b p ⇒ H∞ , a.s .
H
Before we turn to proving the theorem, we need a few intermediate results. An important step in the
proof is the following analytic lemma.
18
So we have Z Z
λdHb p (λ) λdH∞ (λ)
→ .
1 + λv∞ (zj ) 1 + λv∞ (zj )
We remark that for m ∈ C+ , and G a probability measure on R, whose Stieltjes transform is denoted by
SG , Z Z Z
λdG(λ) 1 1 dG(λ) 1 1 dG(λ) 1 1 1
= − = − = − SG − .
1 + λm m m 1 + λm m m2 1/m + λ m m2 m
Hence, when the assumptions of the lemma are satisfied, we have
1 1
SHb p − → S H∞ − .
v∞ (zj ) v∞ (zj )
Now since v∞ (zj ) satisfies Equation (3), we see that if v∞ (zj ) = v∞ (zk ), then zj = zk . Hence, {−1/v∞ (zj )}∞
j=1
is an infinite sequence of complex numbers in C+ . Moreover, because v∞ is analytic in C+ , it is continuous,
and so {−1/v∞ (zj )}∞ j=1 has an accumulation point. Further, because |v∞ (zj )| < ∞ and Im (v∞ (zj )) > δ,
this accumulation point is in C+ .
So under the assumptions of the lemma, we have shown that there exist an infinite sequence {yj }∞ j=1
of complex numbers in C+ , with an accumulation point in C+ , such that
b p ⇒ H∞ .
H
In the context of spectrum estimation, the intuitive meaning of the previous lemma is that if for a
sequence of complex numbers {zj }∞ + b
j=1 with an accumulation point in C , we can find a sequence of Hp ’s
approximately satisfying the Marčenko-Pastur equation at more and more of the zj ’s when n grows, then
this sequence of measures will converge to H∞ .
We now state and prove a few results that will be needed in the proof of Theorem 2. The first one is a
remark concerning Stieltjes transforms.
|z1 − z2 |
|SH (z1 ) − SH (z2 )| ≤ .
u2min
T
So we have shown that SH is uniformly Lipschitz 1/u2min on C+ {Im (z) > umin }.
Now, it is an elementary and standard fact of analysis that if a sequence of K-Lipschitz functions
converge pointwise to a K-Lipschitz function, then the convergence is uniform on compact sets. This
shows the uniform convergence part of our statement.
In the proof of the Theorem, we will need the result of the following proposition.
19
Proposition 2. Assume the assumptions underlying Theorem 1 are satisfied. Recall that vFp is the Stieltjes
transform of Fep , the spectral distribution of XX ∗ /n = Y Σp Y ∗ /n. Assume that the population spectral
distribution Hp has a limit H∞ and that all the spectra are uniformly bounded. Let z ∈ B(z0 , r), with
B(z0 , r) ⊂ C+ . Then, almost surely,
∃N, n > N ⇒ inf Im vFp (z) = δ > 0 .
n,z∈B(z0 ,r)
Proof. Since we assume that all spectra are bounded, we can assume that the population eigenvalues are
1/2
all uniformly bounded by K. Because the spectral norm is a matrix norm and X = Y Σp , we have
Now it is a standard result in random matrix theory that, λmax (Y ∗ Y /n) → (1 + γ)2 , a.s, so for n large
enough,
λmax (Y ∗ Y /n) ≤ 2(1 + γ)2 a.s .
Calling z = u + iv, we have
Z Z
vdFep (λ) vdFep (λ)
Im vFp (z) = ≥ ,
(λ − u)2 + v 2 2(λ2+ u2 ) + v 2
because v ≥ 0. Now, the remark we made concerning the eigenvalues of X ∗ X/n implies that almost surely,
for n large enough, Fep puts all its mass within [0, C], for some C. Therefore,
v
Im vFp (z) ≥ ,
C2 + v 2 + 2u2
and hence Im vFp (z) is a.s bounded away from 0, for n large enough.
To show that we can find “good” probability measures when solving our optimization problem, we will
need to exhibit a sequence of measures that approximately satisfy the Marčenko-Pastur equation. The
next proposition is a step in this direction.
Proposition 3. Let r ∈ R+ and z0 ∈ C+ be given and satisfying B(z0 , r) ⊂ C+ . Suppose p/n → γ when
n → ∞, and ∀ǫ ∃N : n > N ⇒ ∀z ∈ B(z0 , r), |vFp (z) − v∞ (z)| < ǫ, where v∞ satisfies equation (3).
Suppose further that |Im (v∞ (z)) | > umin on B(z0 , r). Then, if ǫ < umin /2,
Z
′
1
′ p λdH∞ (λ) 1 + 2γ
∃N ∈ N, ∀z ∈ B(z0 , r), ∀n > N , +z− < 2ǫ 2
vFp (z) n 1 + λvFp (z) umin
Because γ − p/n → 0, and |λ/(1 + λv∞ (z))| ≤ 1/|Im (v∞ (z)) | ≤ 1/umin , we have
Z
p λ
γ− dH∞ (λ) → 0 uniformly on B(z0 , r) .
n 1 + λv∞ (z)
20
Now, of course,
Z
v∞ (z) − vFp (z) p λ2
∆In (z)
= − (vFp (z) − v∞ (z)) dH∞ (λ) .
vFp (z)v∞ (z) n (1 + λvFp (z))(1 + λv∞ (z))
We remark that |vFp (z)| > |Im vFp (z) | > umin − ǫ > umin /2. Hence, if n is large enough,
I
∆n (z) ≤ 2 |v∞ (z) − vFp (z)| + 2 p |vFp (z) − v∞ (z)| ≤ ǫ 2 (1 + 2γ) .
u2min n u2min u2min
Proof of Theorem 2. According to Propositions 1 and 2, the assumptions put forth in Proposition 3
are a.s satisfied for vFp and v∞ is the Stieltjes as in Theorem 1. Note also that Theorem 1 states that a.s,
vFp (z) → v∞ (z), and that all these functions are analytic in C+ . In other words, they have the properties
needed for Lemma 1 to apply.
In particular, Proposition 3 implies that if {zj } is a family of complex numbers included in B(z0 , r),
and if Hb p is the solution of equation (1), equation (2) will be satisfied almost surely, with a family {ǫj } of
positive real numbers that converge to 0. According to Lemma 1, this implies that,
b p ⇒ H∞ , almost surely.
H
Proof. All that is needed is to show that a discretized version of H∞ furnishes a good sequence of measures
in the sense that Proposition 3 holds for this sequence of discretized version of H∞ .
We call HMn a discretization of H∞ on a regular discrete grid of size 1/Mn . For instance, we can
choose HMn (x) to be a step function, with HMn (x) = H∞ (x) is x = l/Mn , l ∈ N, and HMn is constant on
[l/Mn , (l + 1)/Mn ). Recall also that H∞ is compactly supported.
In light of the proof of Proposition 3, for the corollary to hold, it is sufficient to show that uniformly
in z ∈ B(z0 , r), Z Z
λ λ
dH∞ (λ) → 0 .
1 + λvF (z) dHMn (λ) − 1 + λvF (z)
p p
Now calling dW (HMn , H∞ ) the Wasserstein distance between HMn and H∞ , we have
Z ∞
dW (HMn , H∞ ) = |HMn (x) − H∞ (x)| dx → 0 as n → ∞ .
0
(HMn and H∞ put mass only on R+ , so the previous integral is restricted to R+ . We refer the reader to
the survey (Gibbs and Su, 2001) for properties of different metrics on probability measures.)
In other respects, it is easy to see that under the assumptions of Proposition 3, there exists N such
that, supn>N,z∈B(z0 ,r) |vFp (z)| ≤ K, for some K < ∞. Recall also that under the same assumptions,
inf n>N,z∈B(z0 ,r) Im vFp (z) ≥ δ, for some δ > 0.
For two probability measures G and H, we also have
Z Z
dW (G, H) = sup f dG − f dH ; f a 1-Lipschitz function .
f
21
Hence, because H∞ and HMn are supported on a compact set that is independent of n, to have the result
we want, it will be enough to show that
λ
fvFp (z) (λ) =
1 + λvFp (z)
The proof of the corollary makes clear that when solving the optimization problem over any dictionary
of probability measures containing point masses (but also possibly other measures) at grid points on a grid
whose step size goes to 0, the algorithm will lead to a consistent estimator.
Finally, as explained in the Appendix, the algorithm we implemented start with vFp (zj ) sequences, as
opposed to simply zj sequences. It can be straightforwardly adapted to handle the zj ’s as a starting point,
too, but we got slightly better numerical results when starting with vFp (zj ). The proof we just gave could
be adapted to handle the situation where the vFp (zj )’s are used as starting point. However, a few other
technical issues would have to be addressed that we felt would make the important ideas of the proof less
clear. Hence we decided to show consistency in the setting of Corollary 1.
6 Conclusion
In this paper we have presented an original method to estimate the spectrum of large dimensional
covariance matrices. We place ourselves in a “large n, large p” asymptotic framework, where both the
number of observations and the number of variables is going to infinity, while their ratio goes to a finite,
non-zero limit. Approaching problems in this framework is increasingly relevant as datasets of larger and
larger size become more common.
Instead of estimating individually each eigenvalue, we propose to associate to each vector of eigenvalues
a probability distribution and estimate this distribution. We then estimate the population eigenvalues as
the appropriate quantiles of the estimated distribution. We use a fundamental result of random matrix
theory, the Marčenko-Pastur equation, to formulate our estimation problem. We propose a practical
method to solve this estimation problem, using tools from convex optimization.
The estimator has good practical properties: it is fast to compute on modern computers (we use the
software (MOSEK, 2006) to solve our optimization problem) and scales well with the number of parameters
to estimate. We show that our estimator of the distribution of interest is consistent, where the appropriate
notion of convergence is weak convergence of distributions.
The estimator performs a non-linear shrinkage of the sample eigenvalues. It is basis independent and
we hope will help in improving the estimation of eigenvectors of large dimensional covariance matrices. To
the best of our knowledge, our method is the first that harnesses deep results of random matrix theory to
practically solve estimation problems. We have seen in simulations that the improvement it leads to are
often dramatic. In particular, it enables us to find structure in the data when it exists and to conclude to
its absence where there is none, even when classical methods would point to different conclusions.
APPENDIX
22
A.1 Implementation details
We plan to release the software we used to create the figures appearing in the simulation and data
analysis section in the near future. However, we want to mention here the choices of parameters we made
to implement our algorithm. The justifications for them is based on intuitions coming from studying the
equation (M-P).
Scaling of the eigenvalues If all the entries of the data matrix are multiplied by a constant a, then the
eigenvalues of Σp are multiplied by a2 , and so are the eigenvalues of Sp . Hence, if the eigenvalues of Sp are
divided by a factor a, Equation (M-P) remains valid if we change H∞ (x) into H∞ (ax). In practice, we scale
the empirical eigenvalues by l1 the largest eigenvalue of Sp . We solve our convex optimization problem
with the scaled eigenvalues to obtain H∞ (l1 x), from which we get H∞ (x) through easy manipulations.
The subsequent details describe how we solve our convex optimization problem, after rescaling of the
eigenvalues.
Choice of (zj , v(zj )) We have found that using 100 pairs (zj , v(zj )) was generally sufficient to obtain good
and quick (10s-60s) results in simulations. More points is of course better. With 200 points, solving the
problem took more time, but was still doable (40s-3mins). In the simulations and data analysis presented
afterwards, we first chose the v(zj ) and numerically found the corresponding zj using Matlab’s optimization
toolbox. We took v(zj ) to have a real part equally spaced (every .02) on [0, 1], and imaginary part of 10−2
or 10−3 . In other words, our v(zj )’s consisted of two (discretized) segments in C+ , the second one being
obtained from the first one by a vertical translation of 9 ∗ 10−3 .
Choice of interval to focus on The largest (resp. smallest) eigenvalue of a p × p symmetric matrix S
are convex (resp. concave) functions of the entries of the matrix. This is because l1 (S) = supkuk2 =1 u′ Su,
where u is a vector in Rp . Hence l1 (S) is the supremum of linear functionals of the entries of the matrix.
Similarly, lp (S) = inf kuk2 =1 u′ Su, so lp (S) is a concave function of the entries of S. Note that the sample
covariance matrix Sp is an unbiased estimator of Σp . By Jensen’s inequality, we therefore have E(l1 (Sp )) ≥
l1 (E(Sp )) = λ1 (Σp ). In other words, l1 (Sp ) is a biased estimator of λ1 (Σp ), and tends to overestimate it.
Similarly, lp (Sp ) is a biased estimator of λp (Σp ) and tends to underestimate it. More detailed studies of
l1 and lp indicate that they do not fluctuate too much around their mean. Practically, as n → ∞, we
will have with large probability, lp ≤ λp and l1 ≥ λ1 . (In certain cases, concentration bounds can make
the previous statement rigorous.) Hence, after rescaling of the eigenvalues, it will be enough to focus on
probability measures supported on the interval [lp /l1 , 1] when decomposing H∞ (l1 x).
References
Akhiezer, N. I. (1965). The classical moment problem and some related questions in analysis. Translated
by N. Kemmer. Hafner Publishing Co., New York.
23
Anderson, T. W. (1963). Asymptotic theory for principal component analysis. Ann. Math. Statist. 34,
122–148.
Baik, J., Ben Arous, G., and Péché, S. (2005). Phase transition of the largest eigenvalue for non-null
complex sample covariance matrices. Ann. Probab. 33, 1643–1697.
Baik, J. and Silverstein, J. (2004). Eigenvalues of large sample covariance matrices of spiked population
models. arXiv:math.ST/0408165 .
Bickel, P. J. and Levina, E. (2004). Some theory of Fisher’s linear discriminant function, ‘naive Bayes’,
and some alternatives when there are many more variables than observations. Bernoulli 10, 989–1010.
Bickel, P. J. and Levina, E. (2006). Regularized estimation of large covariance matrices. Forthcoming
Technical Report .
Böttcher, A. and Silbermann, B. (1999). Introduction to large truncated Toeplitz matrices. Universitext.
Springer-Verlag, New York.
Boyd, S. and Vandenberghe, L. (2004). Convex optimization. Cambridge University Press, Cambridge.
Burda, Z., Görlich, A., Jarosz, A., and Jurkiewicz, J. (2004). Signal and noise in correlation matrix.
Physica A 343, 295–310.
Burda, Z., Jurkiewicz, J., and Waclaw, B. (2005). Spectral moments of correlated Wishart matrices.
Phys. Rev. E 71.
Campbell, J., Lo, A., and MacKinlay, C. (1996). The Econometrics of Financial Markets. Princeton
University Press, Princeton, NJ.
Chen, S. S., Donoho, D. L., and Saunders, M. A. (1998). Atomic decomposition by basis pursuit. SIAM
J. Sci. Comput. 20, 33–61 (electronic).
Durrett, R. (1996). Probability: theory and examples. Duxbury Press, Belmont, CA, second edition.
El Karoui, N. (To Appear). Tracy-Widom limit for the largest eigenvalue of a large class of complex
sample covariance matrices. The Annals of Probability See also arxiv.PR/0503109.
Geman, S. (1980). A limit theorem for the norm of random matrices. Ann. Probab. 8, 252–261.
Geronimo, J. S. and Hill, T. P. (2003). Necessary and sufficient condition that the limit of Stieltjes
transforms is a Stieltjes transform. J. Approx. Theory 121, 54–60.
Gibbs, A. L. and Su, F. (2001). On choosing and bounding probability metrics. International Statistical
Review 70, 419–435.
Grenander, U. and Szegö, G. (1958). Toeplitz forms and their applications. California Monographs in
Mathematical Sciences. University of California Press, Berkeley.
Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements of Statistical Learning. Springer
Series in Statistics. Springer-Verlag, New York. Data mining, inference, and prediction.
24
Hiai, F. and Petz, D. (2000). The semicircle law, free random variables and entropy, volume 77 of
Mathematical Surveys and Monographs. American Mathematical Society, Providence, RI.
Johnstone, I. (2001). On the distribution of the largest eigenvalue in principal component analysis. Ann.
Statist. 29, 295–327.
Jonsson, D. (1982). Some limit theorems for the eigenvalues of a sample covariance matrix. J. Multivariate
Anal. 12, 1–38.
Laloux, L., Cizeau, P., Bouchaud, J.-P., and Potters, M. (1999). Noise dressing of financial correlation
matrices. Phys. Rev. Lett. 83, 1467–1470.
Lax, P. D. (2002). Functional analysis. Pure and Applied Mathematics (New York). Wiley-Interscience
[John Wiley & Sons], New York.
Ledoit, O. and Wolf, M. (2004). A well-conditioned estimator for large-dimensional covariance matrices.
J. Multivariate Anal. 88, 365–411.
Mardia, K. V., Kent, J. T., and Bibby, J. M. (1979). Multivariate analysis. Academic Press [Harcourt
Brace Jovanovich Publishers], London. Probability and Mathematical Statistics: A Series of Monographs
and Textbooks.
Paul, D. (To Appear). Asymptotics of sample eigenstructure for a large dimensional spiked covariance
model. Statistica Sinica .
Wachter, K. W. (1978). The strong limits of random matrix spectra for sample matrices of independent
elements. Ann. Probability 6, 1–18.
Yin, Y. Q., Bai, Z. D., and Krishnaiah, P. R. (1988). On the limit of the largest eigenvalue of the
large-dimensional sample covariance matrix. Probab. Theory Related Fields 78, 509–521.
25