0% found this document useful (0 votes)
48 views25 pages

Spectrum Estimation For Large Dimensional Covariance Matrices Using Random Matrix Theory

This document discusses estimating the eigenvalues of large dimensional covariance matrices using random matrix theory. It proposes using the Marchenko-Pastur equation to better estimate population eigenvalues from sample eigenvalues. The estimator "shrinks" sample eigenvalues in a nonlinear way. It also suggests estimating a probability measure describing high-dimensional vectors rather than directly estimating the vectors. Simulations show the proposed estimator gives fast and good results. It is also proven to be consistent.

Uploaded by

Sathvik S Prabhu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views25 pages

Spectrum Estimation For Large Dimensional Covariance Matrices Using Random Matrix Theory

This document discusses estimating the eigenvalues of large dimensional covariance matrices using random matrix theory. It proposes using the Marchenko-Pastur equation to better estimate population eigenvalues from sample eigenvalues. The estimator "shrinks" sample eigenvalues in a nonlinear way. It also suggests estimating a probability measure describing high-dimensional vectors rather than directly estimating the vectors. Simulations show the proposed estimator gives fast and good results. It is also proven to be consistent.

Uploaded by

Sathvik S Prabhu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Spectrum estimation for large dimensional covariance matrices using

random matrix theory


Noureddine El Karoui ∗
Department of Statistics,
University of California, Berkeley
arXiv:math/0609418v1 [math.ST] 14 Sep 2006

October 29, 2018

Abstract
Estimating the eigenvalues of a population covariance matrix from a sample covariance matrix is a
problem of fundamental importance in multivariate statistics; the eigenvalues of covariance matrices play
a key role in many widely techniques, in particular in Principal Component Analysis (PCA). In many
modern data analysis problems, statisticians are faced with large datasets where the sample size, n, is
of the same order of magnitude as the number of variables p. Random matrix theory predicts that in
this context, the eigenvalues of the sample covariance matrix are not good estimators of the eigenvalues
of the population covariance.
We propose to use a fundamental result in random matrix theory, the Marčenko-Pastur equation, to
better estimate the eigenvalues of large dimensional covariance matrices. The Marčenko-Pastur equation
holds in very wide generality and under weak assumptions. The estimator we obtain can be thought of
as “shrinking” in a non linear fashion the eigenvalues of the sample covariance matrix to estimate the
population eigenvalue. Inspired by ideas of random matrix theory, we also suggest a change of point of
view when thinking about estimation of high-dimensional vectors: we do not try to estimate directly
the vectors but rather a probability measure that describes them. We think this is a theoretically more
fruitful way to think about these problems.
Our estimator gives fast and good or very good results in extended simulations. Our algorithmic
approach is based on convex optimization. We also show that the proposed estimator is consistent.

1 Introduction
With data acquisition and storage now easy, today’s statisticians often encounter datasets for which
the sample size, n and the number of variables p, are both large: in the order of hundreds, thousands,
millions, or even billions in situations such as web search problems.
The analysis of these datasets using classical methods of multivariate statistical analysis requires some
care. While the ideas are still relevant, the intuition for the estimators that are used and the interpretation
of the results are often - implicitly - justified by assuming an asymptotic framework of p fixed and n
growing infinitely large. This assumption was consistent with the practice of statistics when these ideas
were developed, since investigation of datasets with a large number of variables was very difficult. A better
theoretical framework for modern - i.e large p - datasets, however is the assumption of the so-called “large
n, large p” asymptotics. In other words, one should consider that both n and p go to infinity, perhaps
with the restriction that their ratio goes to a finite limit γ, and draw practical insights from the theoretical
results obtained in this setting.
Acknowledgements: The author is grateful to Alexandre d’Aspremont, Peter Bickel, Laurent El Ghaoui, Elizabeth

Purdom, John Rice, Saharon Rosset and Bin Yu for stimulating discussions and comments at various stages of this project.
Support from NSF grant DMS-0605169 is gratefully acknowledged. AMS 2000 SC: Primary 62H12, Secondary 62-09.
Key words and Phrases : covariance matrices, principal component analysis, eigenvalues of covariance matrices, high-
dimensional inference, random matrix theory, Stieltjes transforms, Marčenko-Pastur equation, convex optimization. Contact :
[email protected]

1
We will turn our attention to an object of central interest in multivariate statistics: the eigenvalues
of covariance matrices. A key application is Principal Components Analysis (PCA), where one searches
for a good low dimensional approximation to the data by projecting the data on the “best” possible k
dimensional subspace: here “best” means that the projected data explain as much variance in the original
data as possible. This amount of variance explained is measured by the eigenvalues of the population
covariance matrix, Σp , and hence we need to find a way to estimate those eigenvalues. We will discuss in
the course of the paper other problems where the eigenvalues of Σp play a key role.
We take a moment here to give a few examples that illustrate the differences that occur under the
different asymptotic settings. To pose the problem more formally, let us say that we observe iid random
vectors X1 , . . . , Xn in Rp , and that the covariance of Xi is Σp . We call X the data matrix whose rows are
the Xi ’s. In the classical context, where p is fixed and n goes to ∞, a fundamental result of (Anderson,
1963) says that the eigenvalues of the sample covariance matrix Sp = (X − X̄)′ (X − X̄)/(n − 1) are good
estimators of the population eigenvalues (i.e the eigenvalues of Σp ). More precisely, calling li the ordered
eigenvalues of Sp (l1 ≥ l2 . . .) and λi the ordered eigenvalues of Σp (λ1 ≥ λ2 . . .), it was shown in (Anderson,
1963) that √
n(li − λi ) ⇒ N (0, 2λ2i ) ,
when the Xi are normally distributed and all the λi ’s are distinct. This result provided rigorous grounds
for estimating the eigenvalues of the population covariance matrix, Σp , with the eigenvalues of the sample
covariance matrix, Sp , when p is small compared to n. (For more details on Anderson’s theorem, we refer
the reader to (Anderson, 2003) Theorem 13.5.1.)
Shifting assumptions to “large n, large p” asymptotics induces fundamental differences in the behavior
of multivariate statistics, some of which we will highlight in the course of the paper. As a first example, let
us consider the case where Σp = Idp , so all the population eigenvalues are equal to 1. A result first shown
in (Geman, 1980) under some moment growth assumptions, and later refined in (Yin et al., 1988), states
that if the entries of the Xi ’s are i.i.d and have a fourth moment, and if p/n → γ, then

l1 → (1 + γ)2 a.s.

In particular, l1 is not a consistent estimator of λ1 . Note that by picking n = p, l1 tends to 4 whereas


λ1 = 1. (For more general Σp , see (El Karoui, To Appear) Section 4.3 for numerically explicit results about
the limit of l1 .)
As the case of Σp = Idp illustrated, when n and p are both large, the largest sample eigenvalue is
biased, sometime dramatically so. Hence, we should correct this bias in the largest sample eigenvalue(s)
if we want to use them in data analysis. Theoretical results predict that the behavior of extreme sample
eigenvalues can be quite subtle; in particular, depending on how far an isolated population eigenvalue is
from the bulk of the population spectrum, the corresponding sample eigenvalue can either be isolated, and
far away from the bulk of the sample eigenvalues, or be absorbed by the bulk of the sample eigenvalues (see
(Baik et al., 2005), (El Karoui, To Appear), (Baik and Silverstein, 2004), (Paul, To Appear)). One thing
is however clear from the most recent theoretical results : if we wish to de-bias extreme sample eigenvalues,
we need an accurate estimate of the so-called population spectral distribution, a probability measure that
characterizes the population eigenvalues (see (El Karoui, To Appear)). This is what our algorithm will
deliver.
We have so far mostly discussed extreme sample eigenvalues. However, much is also known about the
behavior of the whole vector of sample eigenvalues (l1 , l2 , . . . , lp ) and its asymptotic behavior. In particular,
theory predicts that in the “large n, large p” case, the scree plot (i.e the plot of the sample eigenvalues
vs. their rank; see (Mardia et al., 1979)) becomes uninformative and deceptive. What we propose in this
paper is to use random matrix theory to develop practically useful tools to remedy the flaws appearing in
some widely used tools in multivariate statistics.
Before we discuss how we will go about it, let us briefly discuss some issues that arise when estimating
vectors of large dimension, since working in an asymptotic setting where p → ∞ is not without additional
difficulties. Since we will try to estimate vectors of increasingly larger and larger size, an appropriate
notion of convergence is needed if we want to quantify the quality of our estimators. Standard norms in
high-dimensions not necessarily a very good choice: for instance, if we are in R100 , and make an error of

2
size 1/100 in all coordinates, the resulting l1 error is 1, even though, at least intuitively, it would seem like
we are doing well. Also, if we made a large error (say size 1) in one direction, the l2 norm would be large
(larger than 1 at least), even though we may have gotten the structural information about this vector (and
almost all its coordinates) “right”. Inspired by ideas of random matrix theory, we propose to associate
to high-dimensional vectors probability measures that describe them. We will explain this in more detail
in Section 2.1. After this change of point of view, our focus becomes trying to estimate these measures.
Why choosing to estimate measures? The reasons are many. Chief among them is that this approach will
allow us to look into the structure of the population eigenvalues. For instance, we would like to be able
to say whether all population eigenvalues are equal, or whether they are clustered around say two values,
or if they are uniformly spread out on an interval. Because the ratio p/n can make the scree plot appear
smooth (and hence in some sense uninformative) regardless of the true population eigenvalue structure,
this structural information is not well estimated by currently existing methods. We discuss other practical
benefits (like scalability with p) of the measure estimation approach in 3.3.7. In the context of PCA, where
usually the concern is not to estimate each population eigenvalues with very high precision, but rather to
have an idea of the structure of the population spectrum to guide the choice of lower-dimensional subspaces
on which to project the data, this measure approach is particularly appealing. Examples to come later in
the paper will illustrate this point.
Random matrix theory plays a key role in our approach to this measure estimation problem. A main
ingredient of our method is a fundamental result, which we call the Marčenko-Pastur equation (see Theorem
1), which relates the asymptotic behavior of the sample eigenvalues to the population eigenvalues. The
assumptions under which the theorem holds are very weak (a fourth moment condition) and hence it is very
widely applicable. Until now, this theorem has not been used to do inference on population eigenvalues.
Partly this is because in its general form it has not received much attention in statistics, and partly because
the inverse problem that needs to be considered is very hard to solve if it is not posed the right way. We
propose an original way to approach inverting the Marčenko-Pastur equation. In particular, we will be
able to estimate given the eigenvalues of the sample covariance matrix Sp the probability measure, Hp ,
that describes the population eigenvalues. We use the standard names empirical spectral distribution for
Fp and population spectral distribution for Hp . It is important to state clearly what asymptotic framework
we place ourselves in. We will consider that when p and n go to infinity, Hp stays fixed. In particular, it
has a limit, denoted H∞ . We call this framework “asymptotics at fixed spectral distribution”. Of course,
fixing Hp does not imply that we fix p. For instance, sometime we will have Hp = δ1 , for all p. Since the
parameter of interest in our problems is really the measure Hp , the fixed spectral distribution asymptotics
corresponds to classical assumptions for parameter estimation in statistics, where the parameter does not
change with the number of variables observed. We refer the reader to 3.3.6 for a more detailed discussion.
To solve the inverse problem posed by the Marčenko-Pastur equation, we propose to discretize the
Marčenko-Pastur equation and then use convex optimization methods to solve the discretized version of
the problem. In doing so, we obtain a fast and provably accurate algorithm to estimate the population
parameter of interest, Hp , from the sample eigenvalues. The approach is non-parametric since no assump-
tions are made a priori on the structure of the population eigenvalues. One outcome of the algorithm is an
efficient graphical method to look at the structure of the population eigenvalues. Another outcome is that
since we have an estimate of the measure that describes the population eigenvalues, standard statistical
ideas then allow us to get estimates of the individual population eigenvalues λi . Some subtle problems
may arise when doing so and we address them in 3.3.6. The final result of the algorithm can be thought
of as performing non-linear shrinkage of the sample eigenvalues to estimate the population eigenvalues.
We want to highlight two contributions of our paper. First, we propose to estimate measures associated
with high-dimensional vectors rather than estimating the vectors. This gives rise to natural notions of
consistency and accuracy of our estimates which are reasonable theoretical requirements for any estimator
to achieve. And second, we make use, for the first time, of a fundamental result of random matrix theory
to solve an important practical problem in multivariate statistics.
The rest of the paper is divided into four parts. In Section 2, we give some background on results in
Random Matrix Theory that will be needed. We do not assume that the reader has any familiarity with
the topic. In Section 3, we present our algorithm to estimate Hp , the population spectral distribution,
and also the population eigenvalues. In Section 4, we present the results of some simulations. We give in

3
Section 5 a proof of consistency of our algorithm. The Appendix contains some details on implementation
of the algorithm.
A note on notation is needed before we start: in the rest of the paper, p will always be a function of n,
with the property that p(n)/n → γ and γ ∈ (0, ∞). To avoid cumbersome notations, we will usually write
p and not p(n).

2 Background: Random matrix theory of sample covariance matrices


There is a large body of work concerned with the limiting behavior of the eigenvalues of a sample
covariance matrix when p and n both go to ∞; it constitutes an important subset of what is commonly
known as Random Matrix Theory, to which we now turn. This is a wide area of research, of which we
will only give a very quick and self-contained overview. Our eventual aim in this section is to introduce a
fundamental result, the Marčenko-Pastur equation, that relates the asymptotic behavior of the eigenvalues
of the sample covariance matrix to that of the population covariance in the “large n, large p” asymptotic
setting. The formulation of the result requires that we introduce some concepts and notations.

2.1 Changing point of views: from vectors to measures


One of the first problems to tackle is to find a mathematically efficient way to express the limit of a
vector whose size grows to ∞. (Recall that there are p eigenvalues to estimate in our problem and p goes
to ∞.) A fairly natural way to do so is to associate to any vector a probability measure. More explicitly,
suppose we have a vector (y1 , . . . , yp ) in Rp . We can associate to it the following measure:
p
1X
dGp (x) = δyi (x) .
p
i=1

Gp is thus a measure with p point masses of equal weight, one at each of the coordinates of the vector.
In the rest of the paper, we will denote by Hp the spectral distribution of the population covariance
matrix Σp , i.e the measure associated with the vector of eigenvalues of Σp . We will refer to Hp as the
population spectral distribution. We can write this measure as
p
1X
dHp (x) = δλi (x) ,
p
i=1

where δλi is a point mass, of mass 1, at λi . We also call δλi a “dirac” at λi . The simplest example of
population spectral distribution is found when Σp = Idp . In this case, for all i, λi = 1, and dHp = δ1 . So
the population spectral distribution is a point mass at 1 when Σp = Idp .
Similarly, we will denote by Fp the measure associated with the eigenvalues of the sample covariance
matrix Sp . We refer to Fp as the empirical spectral distribution. Equivalently, we define
p
1X
dFp (x) = δli (x) .
p
i=1

The change of focus from vector to measure implies a change of focus in the notion of convergence we
will consider adequate. In particular, for consistency issues, the notion of convergence we will use is weak
convergence of probability measures. While this is the natural way to pose the problem mathematically,
we may ask if it will allow us to gather the statistical information we are looking for. An example of the
difficulties that arise is the following. Suppose dHp = (1 − 1/p) δ1 + 1/p δ2 . In other words, the population
covariance has one eigenvalue that is equal to 2 and (p − 1) that are equal to 1. Clearly, when p → ∞,
Hp weakly converges to H∞ , with dH∞ = δ1 . So all information about the large and isolated eigenvalue
2, which is present in Hp for all p and is naturally of great interest in PCA, seems lost in the limit. This
is not the case when one does asymptotic at fixed spectral distribution and consider that we are following
a sequence of models which are going to infinity with Hp = Hp0 = H∞ , where p0 is the p which is given

4
by the data set. Fixed distribution asymptotics is more akin to what is done in classical statistics and
we place ourselves in this framework. We refer the reader to 3.3.6 for a more detailed justification of our
point.
In other respects, associating a measure to a vector in the way we described is meaningful mostly when
one wants to have information about the whole set of values taken by the coordinates of the vector, and not
about each coordinate. In particular, when going from vector to measure as described above we are losing
all coordinate information: permuting the coordinates would drastically change the vector but yield the
same measure. However, in the case of vectors of eigenvalues, since there is a canonical way to represent
the vector (the i-th largest eigenvalue occupying the i-th coordinate), the information contained in the
measure is sufficient. This measure approach is especially good when we are not focused on getting all the
fine details of the vectors right, but rather when we are looking for structural information concerning the
values taken by the coordinates.
An important area of random matrix theory for sample covariance matrices is concerned with under-
standing the properties of Fp as p (and n) go to ∞. A key theorem , which we review later (see Theorem
1), states that for a wide class of sample covariance matrices, F∞ , the limit of Fp , is asymptotically non-
random. Furthermore, the theorem connects F∞ to H∞ , the limit of Hp : given H∞ , we can theoretically
compute F∞ , by solving a complicated equation. In data analysis, we observe the empirical spectral distri-
bution, Fp . Our goal, of course, as far as eigenvalues are concerned, is to estimate the population spectral
distribution, Hp . Our method will “invert” the relation between F∞ and H∞ , so that we can go from Fp
to Hb p , an estimate of Hp . The method does not work directly with Fp but with a tool that is similar in
flavor to the characteristic function of a distribution: the Stieltjes transform of a measure. We introduce
this tool in the next subsection. As we will see later, it will also play a key role in our algorithm.

2.2 The Stieltjes transform of measures


A large number of results concerning the asymptotic properties of the eigenvalues of large dimensional
random matrices are formulated in terms of limiting behavior of the Stieltjes transform of their empirical
spectral distributions. The Stieltjes transform is a convenient and very powerful tool in the study of the
convergence of spectral distribution of matrices (or operators), just as the characteristic function of a
probability distribution is a powerful tool for central limit theorems. Most importantly, there is a simple
connection between the Stieltjes transform of the spectral distribution of a matrix and its eigenvalues.
By definition, the Stieltjes transform of a measure G on R is defined as
Z
dG(x)
mG (z) = , for z ∈ C+ ,
x−z
T
where C+ , C {z : Im (z) > 0} is the set of complex numbers with strictly positive imaginary part.
The Stieltjes transform appears to be known under several names in different areas of mathematics. It is
sometimes referred to as Cauchy or Abel-Stieltjes transform. Good references about Stieltjes transforms
include (Akhiezer, 1965, Sections 3.1-2), (Lax, 2002, Chapter 32), (Hiai and Petz, 2000, Chapter 3) and
(Geronimo and Hill, 2003).
For the purpose of this paper, where will consider only compactly supported measures, the following
results will be needed:

Fact. Important properties of Stieltjes transforms of measures on R:

1. If G is a probability measure, mG (z) ∈ C+ if z ∈ C+ and limy→∞ −iymG (iy) = 1.

2. If F and G are two measures, and if mF (z) = mG (z), for all z ∈ C+ , then G = F , a.e.

3. (Geronimo and Hill, 2003, Theorem 1): If Gn is a sequence of probability measures and mGn (z)
has a (pointwise) limit m(z) for all z ∈ C+ , then there exists a probability measure G with Stieltjes
transform mG = m if and only if limy→∞ −iym(iy) = 1. If it is the case, Gn converges weakly to G.

4. (Geronimo and Hill, 2003, Theorem 2): The same is true if the convergence happens only for an
infinite sequence {zi }∞ + +
i=1 in C with a limit point in C .

5
5. If t is a continuity point of the cdf of G, dG(t)/dt = limǫ→0 π1 Im (mG (t + iǫ))
For proofs, we refer the reader to (Geronimo and Hill, 2003).
Note that the Stieltjes transform of the spectral distribution Γp of a p × p matrix Ap is just
1 
mΓp (z) = trace (Ap − zIdp )−1 .
p
Finally, it is clear that points 3 and 4 above can be used to show convergence of probability measures if
one can control the corresponding Stieltjes transforms.

2.3 A fundamental result: the Marčenko-Pastur equation


In the study of covariance matrices, a remarkable result exists that describes the limiting behavior
of the empirical spectral distribution, F∞ , in terms of the limiting behavior of the population spectral
distribution, H∞ . The connection between these two measures is made through an equation that links
the Stieltjes transform of the empirical spectral distribution to an integral against the population spectral
distribution. We call this equation the Marčenko-Pastur equation because it first appeared in the landmark
paper of (Marčenko and Pastur, 1967). The result was independently re-discovered in (Wachter, 1978) and
then refined in (Silverstein and Bai, 1995) and (Silverstein, 1995). In particular, (Silverstein, 1995) is the
only paper where the case of a non-diagonal population covariance is tackled.
In what follows, we will be working with an n × p data matrix X. We call Sp = X ∗ X/n and denote
mFp the Stieltjes transform of the spectral distribution, Fp , of Sp . We will call vFp the function defined by
p
vFp (z) = (1 − p/n) −1 ∗
z + n mFp (z). vFp is the Stieltjes transform of the spectral distribution of XX /n.
Currently, the most general version of the result is found in (Silverstein, 1995) and states the following:
1/2
Theorem 1. Suppose the data matrix X can be written X = Y Σp , where Σp is a p × p positive definite
matrix and Y is an n × p matrix whose entries are i.i.d (real or complex), with E(Yi,j ) = 0, E(|Yi,j |2 ) = 1
and E(|Yi,j |4 ) < ∞.
Call Hp the population spectral distribution, i.e the distribution that puts mass 1/p at each of the
eigenvalues of the population covariance matrix, Σp . Assume that Hp converges weakly to a limit denoted
H∞ . (We write this convergence Hp ⇒ H∞ .) Then, when p, n → ∞, and p/n → γ, γ ∈ (0, ∞),
1. vFp (z) → v∞ (z), a.s, where v∞ (z) is a deterministic function
2. v∞ (z) satisfies the equation
Z
1 λdH∞ (λ)
− =z−γ , ∀z ∈ C+ (M-P)
v∞ (z) 1 + λv∞ (z)

3. The previous equation has one and only one solution which is the Stieltjes transform of a measure.
In plain English, under the assumptions put forth in Theorem 1, the spectral distribution of the
sample covariance matrix is asymptotically non-random. Furthermore, it is fully characterized by the true
population spectral distribution, through the equation (M-P).
A particular case of equation (M-P) is often of interest: the situation when all the population eigenvalues
are equal to 1. Then of course, Hp = H∞ = δ1 . A little bit of elementary work leads to the well-known
fact in random matrix theory that the empirical spectral distribution, Fp , converges (a.s) to the Marčenko-
Pastur law, whose density is given by, if γ ≤ 1,
p
fγ (x) = (b − x)(x − a)/(2πxγ) , with a = (1 − γ 1/2 )2 , b = (1 + γ 1/2 )2 .
We refer the reader to (Marčenko and Pastur, 1967), (Bai, 1999) and (Johnstone, 2001) for more details
and explanations concerning the case γ > 1. One point of statistical interest is that even though the
true population eigenvalues are all equal to 1, the empirical ones are now spread on the interval [(1 −
γ 1/2 )2 , (1 + γ 1/2 )2 ]. Plotting the density also shows that its shape vary with γ in a non-trivial way. These
two remarks illustrate some of the difficulties that need to be overcome when working under “large n, large
p” asymptotics.

6
3 Algorithm and Statistical considerations
3.1 Formulation of the estimation problem
A remarkable feature of the equation (M-P) is that the knowledge of the limiting distribution of the
eigenvalues in the population given by H∞ fully characterizes the limiting behavior of the eigenvalues of
the sample covariance matrix. However, the relationship between the two is hard to disentangle. As is
common in statistics, the question is how to invert this relationship to estimate Hp . The question thus
becomes, given l1 , . . . , lp , the eigenvalues of a sample covariance matrix, can we estimate the population
eigenvalues, λ1 , . . . , λp , using Equation (M-P)? Or in terms of spectral distribution, can we estimate Hp
from Fp ?
Our strategy is the following: 1) the first aim is to estimate the measure H∞ appearing in the Marčenko-
Pastur equation. 2) Given an estimator, H b ∞ , of this measure, we will estimate λi as the i-th quantile of
our estimated distribution. It is common in statistical practice to get these estimates by using the i/(p + 1)
percentile and this is what we do. (We come back to possible difficulties getting from H b p to λ̂i in 3.3.6.)
3) An important point is that since we are considering fixed distribution asymptotics, our estimate of H∞
will serve as our estimate of Hp , so H bp = H
b ∞.
The main question, then, is how to approach step 1: estimating H∞ based only on Fp . Of course,
since we can compute the eigenvalues of Sp , we can compute vFp (z) for any z we choose. By evaluating
vFp at a grid of values {zj }Jj=1 n
, we have a set of values {vFp (zj )}Jj=1
n
for which equation (M-P) should
b
(approximately) hold. We want to find H∞ that will “best” satisfy equation (M-P) across the set of values
of vFp (zj ). In other words, we will pick
 Z Jn !
bp = H b ∞ = argmin L 1 p λdH(λ)
H + zj − ,
H vFp (zj ) n 1 + λvFp (zj ) j=1

where the optimization is over probability measures H, and L is a loss function to be chosen later. In this
way we are “inverting” the equation (M-P), going from Fp , an estimate of F∞ , to an estimate of H∞ .
We will solve this inverse problem in two steps: discretization and convex optimization. We give a
high-level overview of our method and postpone implementation details to the Appendix.
To summarize, we face the following interpolation problem: given J an integer and (zj , vFp (zj ))Jj=1 we
want to find an estimate of H∞ that approximately satisfies equation (M-P). In Section 5, we show that
doing so for L∞ loss function leads to a consistent estimator of H∞ , under the reasonable assumption that
all spectra are bounded.

3.2 The algorithm


In order to alleviate the notations, we will replace the notation H∞ by H when it does not cause any
confusion.

3.2.1 Discretization
Naturally, dH can be simply approximated by a weighted sum of point masses:
K
X
dH(x) ≃ wk δtk (x) ,
k=1

where {tk }K
k=1 is a grid of points, chosen by us, and wk ’s are weights. The fact that we are looking for a
probability measure imposes the constraints
K
X
wk = 1 , and wk ≥ 0 .
k=1

7
This approximation turns the optimization over measures problem into searching for a vector of weights
in RK
+ . After discretization, the integral in equation (M-P) can be approximated by

Z K
λdH(λ) X tk
≃ wk .
1 + λv 1 + tk v
k=1

Hence finding a measure that approximately satisfies Equation (M-P) is equivalent to finding a set of
weights {wk }K
k=1 , for which we have

K
1 pX tk
− ≃ zj − wk , ∀j .
v∞ (zj ) n 1 + tk v∞ (zj )
k=1

Naturally, we do not get to observe v∞ , and so we make a further approximation and replace v∞ by
vFp . Our problem is thus to find {wk }K
k=1 such that

K
1 pX tk
− ≃ zj − wk , ∀j .
vFp (zj ) n 1 + tk vFp (zj )
k=1

One good thing about this approach is that the problem we now face is linear in the weights, which
are the only unknowns here. We will demonstrate that this allows us to cast the problem as a relatively
simple convex optimization problem.

3.2.2 Convex Optimization formulation


To show that we can formulate our inverse problem as a convex problem, let us call the approximation
errors we make
K
1 pX tk
ej = + zj − wk .
vFp (zj ) n 1 + vFp (zj )tk
k=1

As explained above, there are two sources of error in ej : one comes from the discretization of the integral
involving H∞ . The other one comes from the substitution of v∞ , a non-random and asymptotic quantity,
by vFp , a (random) quantity computable from the data. ej is of course a complex number in general.
We can now state several convex problems as approximation of the inversion of the Marčenko-Pastur
equation problem. We show in Section 5 consistency of the solution of the “L∞ ” version of the problem
described below. Here are a few examples of convex formulations for our inverse problem. In all these
problems, the wk ’s are constrained to sum to 1 and to be non-negative.

1. “L∞ ” version: Find wk ’s to

Minimize max max {|Re (ej )| , |Im (ej )|}


j=1,...,Jn

2. “L2 ” version: Find wk ’s to


Jn
X
Minimize |ej | .
j=1

3. “L2 -squared” version: Find wk ’s to


Jn
X
Minimize |ej |2 .
j=1

The advantages of formulating our problem as a convex optimization problem are many. We will come
back to the more statistical issues later. From a purely numerical point of view, we are guaranteed that an
optimum exists, and fast algorithms are available. In practice, we used the optimization package MOSEK
(see (MOSEK, 2006)), within Matlab, for solving our problems.

8
Because the rest of the article focuses particularly on the “L∞ ” version of the problem described above,
we want to give a bit more details about it. The “translation” of the problem into a convex optimization
problem is

min u
(w1 ,...,wK ,u)

∀j, −u ≤ Re (ej ) ≤ u
∀j, −u ≤ Im (ej ) ≤ u
K
X
subject to wk = 1
i=1
and wk ≥ 0, ∀k

This is a linear program (LP) with unknowns (w1 , . . . , wK ) and u (see (Boyd and Vandenberghe, 2004) for
standard manipulations to make it a standard form LP).
The simulations we present in Section 4 were made using this version of this algorithm. The proof in
Section 5 applies to this version of the algorithm.

3.3 Statistical considerations


The formulation we proposed is quite flexible and has several important qualities. For instance, reg-
ularization constraints can be easily handled through our proposal. We also can view the algorithm as a
form of “basis pursuit” in measure space, from which we can draw some practical conclusions.

3.3.1 Regularization and constraints


Methods to invert the Marčenko-Pastur equation should be flexible enough to accommodate reasonable
constraints that could provide additional improvement to our estimate of Hp . The fact that we essentially
just optimize over the weights wk ’s mean that we can easily regularize and add constraints. For instance,
we might want to regularize our estimator and make it smoother by adding a total variation penalty (on
the wk ’s) to our objective function. In terms of constraints, we might want to specify that the first moment
of our estimate H b p match the trace of Sp /p, since we know that the trace of Sp /p is a good estimate of
the trace of Σp /p (see e.g (Jonsson, 1982)), and that the trace of Σp /p is equal to the first moment of Hp .
Note that constraints on the moments of our estimator are linear in the wk ’s and so such constraints would
still lead to a convex problem. The framework we provide can very easily incorporate these two examples
of penalty and constraints, as well as many others.

3.3.2 A “basis pursuit” point of view


A semantic point is needed before we start our discussion. We use the term “basis pursuit” in a loose
sense: we are not referring to the algorithm proposed in (Chen et al., 1998) but rather use this expression
as a generic term for describing techniques that aim to optimize the representations of functional objects
in overcomplete dictionaries. We refer the reader to (Hastie et al., 2001, Chapter 5) for some of the core
statistical ideas of these so-called basis expansion methods.
The algorithm we propose can be viewed as a relaxation of a measure estimation problem. We want
to estimate a measure H∞ and instead of searching among all possible probability measures, we restrict
our search space to mixtures of certain class of probability measures. In 3.2.1 for instance, we restricted
the choice to mixture of point masses. In that sense, we can view it as a type of “basis pursuit” in
probability measure space. We first choose a “dictionary” of probability measures on the real line, and we
then decompose our estimator on this dictionary, searching for the best coefficients. Hence our problem
can be formulated as
N
X
b =
find the best possible weights {w1 , . . . , wN } with dH wi dMi
i=1

9
where the Mi ’s are the measures in our dictionary.
In the preceding discussion on discretization, we restricted ourselves to Mi ’s being point masses at
chosen “grid points”. Of course, we can enlarge our dictionary to include, for instance:

1. Probability measures that are uniform on an interval: dMi (x) = 1x∈[ai ,bi ] dx/(bi − ai ).

2. Probability measures that have a linearly increasing density on an interval [ai , bi ] and density 0
elsewhere. So dMi (x) = 1[ai ,bi ] 2(x − ai )/(bi − ai )2 dx, and density 0 elsewhere.

3. Probability measures that have a linearly decreasing density on an interval [ai , bi ], and density 0
elsewhere. So dMi (x) = 1[ai ,bi ] 2(bi − x)/(bi − ai )2 dx.

If we decide to include a probability measure M in our dictionary, the only requirement is that we be
able to compute the integral Z
λdM (λ)
1 + λv
for any v in C+ .
Choosing a larger dictionary increases the size of the convex optimization problems we try to solve,
and hence is at first glance computationally harder. However, statistically, enlarging the dictionary may
lead to sparser representations of the measure we are estimating, and hence, at least intuitively, lead to
better estimates of H∞ . The most favorable case is of course when H∞ is a mixture of a small number of
measures present in our dictionary. For instance, if H∞ has a density whose graph is a triangle, having
measures as described in points 2 and 3 above would most likely lead to sparser and maybe more accurate
estimates. In the presence of a priori information on H∞ , the choice of dictionary should be adapted so
that H∞ has a sparse representation in the dictionary.

3.3.3 Useful properties of the algorithm


One important advantage of choosing to estimate measures instead of choosing to estimate a high-
dimensional vector is that the algorithm’s complexity does not increase with the size of the answer required
by the user. Hence given a p dimensional vector of eigenvalues, once the values vFp (zj ) are computed, the
computational cost of the algorithm is the same irrespective of p. This means that for large p problems,
only one difficult computation is required: that of the eigenvalues of the empirical covariance matrix. Our
algorithm is hence, in some sense, “dimension-free”, i.e, except for the computation of the eigenvalues,
it is insensitive to the dimensionality of our original problem. This scaling property is important for
high-dimensional problems.
Another good property of our method is that it is independent of the basis in which the data is
represented. Because our method requires only as input the eigenvalues of the sample covariance matrix -
quantities obviously independent of the original basis of the data - our method is basis independent.
In other respects, Theorem 1 holds for random variables that have a 4-th moment; we are not limited to
Gaussian random variables. Complex random variables are also possible. Hence, the theorem is well-suited
for wide applicability. Elementary properties of Gaussian random variables show that Theorem 1 covers all
possible Gaussian problems. This will not be true for all distributions, but the scope of the theorem is still
very wide. Note also that the Equation (M-P) holds in greater generality than mentioned in Theorem 1.
We refer the reader to the original paper (Marčenko and Pastur, 1967) for further examples, in particular
when the data is distributed on spheres or ellipsoids. (The original formulation of the theorem allows for
dependence between the entries of the matrix Y , but the convergence is not shown to be almost sure.)

3.3.4 The case p > n and how large is large?


Another advantage of the proposed method is that it is insensitive to whether p is larger than n or n is
larger than p. The only requirement is that they both be quite large. We had reasonable to good results in
simulation as soon as p > 30 or so. As a matter of fact, it is quite clear that to have reasonably accurate
estimates of the eigenvalues, we need to “populate” the interval [λp , λ1 ] with enough points, for otherwise
quantile methods may be somewhat inaccurate.

10
3.3.5 On covariance estimation, linear and non-linear shrinkage of eigenvalues
There is some classical and more recent statistical work on shrinkage of eigenvalues to improve co-
variance estimation. We refer the reader to Section 4.1 in (Ledoit and Wolf, 2004) for some examples
due to Charles Stein and Leonard Haff, unfortunately in unpublished manuscripts. More recently, in the
interesting paper by (Ledoit and Wolf, 2004), what was proposed is to linearly shrink the eigenvalues of Sp
toward the identity : i.e li ’s become ˜li = (1 − ρ)li + ρ’s, for some ρ, independent of i, chosen using the data
and the Marčenko-Pastur law. Then the authors of (Ledoit and Wolf, 2004) proposed to estimate Σp by
(1 − ρ)Sp + ρIdp . Since this latter matrix and Sp have the same eigenvectors, their method of covariance
estimation can be viewed as linearly shrinking the sample eigenvalues and keeping the eigenvectors of Sp
as estimates of the eigenvectors of Σp .
Our method of estimation of the population eigenvalues can be viewed as doing a non-linear shrinkage
of the sample eigenvalues. While we could propose to just keep the eigenvectors of Sp as estimates of
the eigenvectors of Σp , and hence get an estimate of the population covariance matrix, we think one
should be able to do better by using the eigenvalue information to drive the eigenvector estimation. It is
known that in “large n, large p” asymptotics, the eigenvectors of the sample covariance matrix are not
consistent estimators of the population eigenvectors (see (Paul, To Appear)), even in the most favorable
cases. However, having a good idea of the structure of the population eigenvalues should help us estimate
the eigenvectors of the population covariance matrix, or at least formulate the right questions for the
problem at hand. For instance, the inferred structure of the covariance matrix could help us decide how
many subspaces we need to identify: if, for example, it turned out that the population eigenvalues were
clustered around two values, we would have to identify two subspaces, the dimensions of these subspaces
being the number of eigenvalues clustered around each value. Also, having estimates of the eigenvalues
tell us how much variance our “eigenvectors” will have to explain. In other words, our hope is that taking
advantage of the crucial eigenvalue information we are now able to gather will lead to better estimation of
Σp by doing a “reasoned” spectral decomposition. Work in this direction is in progress.

3.3.6 Asymptotics at fixed spectral distribution and isolated eigenvalues


Our algorithm actually uses asymptotics assuming a fixed spectral distribution: we are essentially fixing
Hp = H∞ when solving our optimization problem. Naturally, this does not mean that p is fixed. Note that
this is what is classically done is statistics: for the simple problem of estimating the mean of a population
from a sample Z1 , . . . , ZK , it is common to assume that the Zk ’s have the same mean µ, and that µ does
not depend on K. However, when studying the asymptotic properties of this simple estimator, we could
require to actually have µ(K), with µ(K) → µ. (All we would have to do is have a triangular array of
data, and getting to observe just one row of this array at a time.) Hence our fixed spectral distribution
“assumption” is very natural and similar to classical assumptions made in estimation problems.
Let us go back now to the problem of isolated eigenvalues. Suppose we get to see data in Rp0 for
some p0 . Then, any isolated eigenvalue that may be present is numerically treated as if the mass that
is attached to it is held fixed at 1/p0 when p → ∞. So a point mass at the corresponding population
eigenvalue would appear in H b p . This has been verified numerically. If the estimator were perfect, this
mass should be equal to 1/p0 . However, because of variability it may not be exactly of mass 1/p0 . Then,
estimating the population eigenvalues by the quantiles of the estimated population spectral distribution,
we may “miss” this isolated eigenvalue. In the case of the largest eigenvalue, that would happen if the mass
found numerically at this isolated eigenvalue is less than 1/(p0 + 1). So isolated eigenvalues will require
special care and caution, particularly in going from H b p to λ̂i . While the method focuses on identifying the
structure of the population eigenvalues and hence may have problems when it comes to estimating isolated
eigenvalues, we have found in practice that it still provided a good tool for this task but that some care
was required.

3.3.7 Existing related work


As far as we know, there has been no work on non-parametric estimation of Hp or H∞ using the
Marčenko-Pastur equation. However, some work exists in the Physics’ literature ((Burda et al., 2004,

11
2005)), that takes advantage of the Marčenko-Pastur law to estimate some moments of H∞ . H∞ is then
assumed to a be a mixture of a finite and pre-specified number of point masses (see (Burda et al., 2004, p.
303)) and the moments are then matched with possible point masses and weights. While these methods
might be of some use sometimes, we think they require too many assumptions to be practically acceptable
for a broad class of problems. It might be tempting to try to develop an non-parametric estimator from
moments, but we think that without the strong assumptions made in (Burda et al., 2004), those estimators
will suffer drastically from: 1) the number of moments needed a priori may be large, and large moments
are very unreliable estimators; 2) moments estimated indirectly may not constitute a genuine family of
moments: certain Hankel matrices need to be positive semi-definite and will not necessarily be so. Semi-
definite programming type corrections will then be necessary, but hard to implement. 3) Even if one has a
genuine moment sequence, there are usually many distributions with the same moments. Choosing between
them is clearly going to be a difficult task.

4 Simulations
We now present some simulations to illustrate the practical capabilities of the method. The objectives
of eigenvalues estimation are many-folds and depend of the area of applications. We review some of those
that inspired our work.
In settings like PCA, one basically wishes to discover some form of structure in the covariance matrix by
looking at the eigenvalues of the sample covariance matrix. In particular, a situation where the population
eigenvalues are different from each other indicates that projecting the data in some projections will be more
“informative” that projecting it in other directions; while in the case where all the population eigenvalues
are equal, all projections are equally informative or uninformative. As our brief discussion of the Marčenko-
Pastur law illustrated, in the “large n, large p” setting, it is difficult to know from the sample eigenvalues
whether all population eigenvalues are equal to each other or not, or even if there is any kind of structure
in them. When p and n are both large, standard graphical methods like the scree plot tend to look similar
whether or not there is structure in the data. We will see that our approach is able to differentiate between
the situations. Among other things, our method can thus be thought as a alternative to the scree plot for
high-dimensional problems.
In other applications, one focuses more on trying to estimate the value of the largest or smallest
eigenvalues. In PCA, the largest population eigenvalues measure how much variance we can explain through
a low dimensional projection and is hence important. In financial applications, like the Markovitz’ portfolio
optimization problem, the small population eigenvalues are important. They essentially measure what is
the minimum risk one can take by investing in a portfolio of certain stocks (see (Laloux et al., 1999) and
(Campbell et al., 1996, Chapter 5)). However, as explained in the Appendix, the largest eigenvalue of the
sample covariance matrix tends to overestimate the largest eigenvalue of the population covariance. And
similarly, the smallest eigenvalue of the sample covariance matrix tends to underestimate its population
counterpart. What that means is that using these measures of “information” and “risk”, we will tend to
overestimate the amount of information there is in our data and tend to underestimate the amount of risk
there is in our portfolios. So it is important to have tools to correct this bias. Our estimator provides a
way to do so.

4.1 Details of the simulations


We illustrate the performance of our method on three cases, each with very different covariance struc-
ture. We will give more details on each individual case in the following subsections.
We now describe more precisely these examples. The first case is that of Σp = Idp , in other words,
there is no “information” in the data. However standard graphical statistical methods like the “scree plot”
will tend to show a pattern in the eigenvalues. We will show that our method is generally able to inform
us that all the eigenvalues are equal.
The second case is one where Σp has 50% of its eigenvalues equal to 1 and 50% equal to 2. While
it should be easy to discern that there are two very distinct clusters of eigenvalues in the population, in

12
high-dimension the sample eigenvalues will often blur the clusters together. We show that our method
generally recovers these two clusters well.
Finally, the third example is one where Σp is a Toeplitz matrix. More details on Toeplitz matrices
are given in 4.1.3. This situation poses a harder estimation problem. While the asymptotic behavior of
the eigenvalues of such matrices is well understood, there are generally no easy and explicit formulas to
represent the limit. We present the results to show that even in this difficult setting, our method performs
quite well.
To measure the performance of our estimators, we compare the Lévy distances between our estimator,
Hb p , and the true distribution of the population eigenvalues, Hp , to that of the empirical spectral distribu-
tion, Fp , to Hp . Our choice is motivated by the fact that the Lévy distance can be used as a metric for
weak convergence of distributions on R. Recall (see e.g (Durrett, 1996)) that the Lévy distance between
two distributions F and G on the real line is defined as

dL (F, G) = inf{ǫ > 0 : F (x − ǫ) − ǫ ≤ G(x) ≤ F (x + ǫ) + ǫ , ∀x} .

In the plots we will depict the cumulative distribution function (cdf) of our estimated measures. Recall
that the estimates of the population eigenvalues λi ’s are obtained by taking appropriate percentiles of these
measures.

4.1.1 The case Σp = Idp


In this situation, the Marčenko-Pastur law predicts that instead of being concentrated
p at 1plike the
population eigenvalues, the sample eigenvalues will be spread on the interval [(1 − p/n)2 , (1 + p/n)2 ].
This is problematic, since by looking at the scree plot of just the sample eigenvalues, one might think
that some population eigenvalues are (much) larger than others and hence some projections of the data
are more informative than others. This is vividly illustrated on Figure 1a. However, as we see on Figure
1c, the method we propose finds that the population spectral distribution is very close to a point mass
at 1, and all eigenvalues are thus close to 1. Statistically, this of course means that there is no preferred
direction to project the data. All directions are equally informative, or uninformative.
The figures presented in Figure 1 were chosen at random among 1000 Monte-Carlo simulations and are
very encouraging. To further our empirical investigation of the performance of our method, we repeated
the estimation process 1000 times. Another advantage is that on further investigation (manually checking
the graphs of many of the estimators we obtained) we saw that the estimator consistently gets the structure
“right”, namely a huge spike in the vicinity of 1. This is of course very important for applications such
as PCA, where the structure of the spectrum of the covariance matrix is of fundamental importance. For
each repetition, we estimated the distribution of the eigenvalues in the population, and computed the Lévy
distance of our estimator, Hb p , to the true distribution, Hp , in this case a point mass at 1. We did the
same for the empirical spectral distribution Fp . Figure 2 shows the ratio dL (H b p , Hp )/dL (Fp , Hp ) for these
simulations. Our estimator clearly outperforms the one derived from the sample covariance matrix, often
by a dramatic factor.

4.1.2 The case Hp = .5δ1 + .5δ2


In this case the eigenvalues of the population covariance matrix are split into two clusters of equal size.
For the specific example we investigate, 50% of the eigenvalues are equal to 1 and 50% are equal to 2.
While it should be easy to discern that there are two very distinct clusters of population eigenvalues,
when p is sufficiently close to n the two clusters merge together and the scree plot of the sample eigenvalues
does not show a clear separation between the two regions. The Marčenko-Pastur law predicts (in the case
of identity covariance) that the sample eigenvalues spread over larger and larger intervals as p gets closer to
n. Therefore, it is intuitively not surprising that when we have two not too distant clusters of population
eigenvalues, the corresponding sample eigenvalues would start to overlap if p is close enough to n.
We did a Monte Carlo analysis (similar to the one done in the case of Idp covariance) of our estimator
and did comparisons to the empirical spectral distribution. As in the case of Idp , we present a figure
showing the ratio of the Lévy distance of the two estimates to the true distribution. Figure 4 shows that

13
Scree Plot eigenvalues sample covariance matrix CDF eigenvalues sample covariance matrix
2.2 1

2 0.9

1.8 0.8
n=500
n=500 p=100
1.6 0.7
p=100 cov=id
1.4 cov =id 0.6

1.2 0.5

1 0.4

0.8 0.3

0.6 0.2

0.4 0.1

0.2 0
0 10 20 30 40 50 60 70 80 90 100 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2

(a) Eigenvalues (scree plot) of the sample covariance matrix (b) CDF eigenvalues, sample covariance matrix (Fp )

Estimated CDF eigenvalues population covariance


1

0.9

0.8 n=500
0.7
p=100

0.6 cov=id
0.5

0.4

0.3

0.2

0.1

0
0 0.5 1 1.5 2 2.5 3 3.5

(c) CDF eigenvalues, estimated population covariance ma-


b p)
trix (H

Figure 1: case Σp = Idp. The three figures above compare the performance of our estimator to the one
derived from the sample covariance matrix on one realization of the data. The data matrix X is 500 × 100.
All its entries are iid N (0, 1). The population covariance is Σp = Id100 , so the distribution of the eigenvalues
is a point mass at 1. This is what our estimator (Figure (c)) recovers. Average computation time (over
1000 repetitions) was 13.33 seconds, according to Matlab tic and toc functions. Implementation details
are in the Appendix.

250

Ratio of Levy
200
Metrics
Empirical/Adjusted

1000 repetitions
150

n=500
p=100
100

50

0
0 20 40 60 80 100 120

Figure 2: case Σp = Idp: Ratios dL (H b p , Hp )/dL (Fp , Hp ) over 1,000 repetitions. Dictionary consisted of
only point masses. Large values indicate better performance of our algorithm. All ratios were found to be
larger than 1.

14
Scree plot eigenvalues sample covariance matrix CDF eigenvalues sample covariance matrix
4 1

3.5 n=500
0.8 p=500
3 n=500 50 eigenvalues =1
p=100 50 eigenvalues =2
2.5 50 eigenvalues =1 0.6
50 eigenvalues =2
2
0.4
1.5

1 0.2

0.5
0
0 0.5 1 1.5 2 2.5 3 3.5
0
0 10 20 30 40 50 60 70 80 90 100

(b) CDF eigenvalues sample covariance matrix (Fp )


(a) Scree plot of eigenvalues, sample covariance matrix: no
clear separation around the 50th eigenvalue

Estimated CDF Eigenvalues Population Covariance


1
0.9
n=500
0.8 p=100
0.7
50 eigenvalues=1
0.6
50 eigenvalues=2
0.5
0.4
0.3
0.2
0.1
0
0 0.5 1 1.5 2 2.5 3 3.5

(c) Estimated CDF of eigenvalues of population covariance


matrix (Hbp)

Figure 3: case Hp = .5δ1 + .5δ2 : the three figures above compare the performance of our estimator on
one realization of the data. The data matrix Y is 500 × 100. All its entries are iid N (0, 1). The covariance
is diagonal and has spectral distribution Hp = .5δ1 + .5δ2 . In other words, 50 eigenvalues are equal to 1
and fifty eigenvalues are equal to 2. This is essentially what our estimator (Figure (c)) recovers. Average
computation time (over 1000 repetitions) was 15.71 seconds, according to Matlab tic and toc functions.

50

45

Ratio Levy metrics


40
2 point (1,2) covariance
35
n=500
p=100
30

25

20

15

10

0
0 2 4 6 8 10 12 14 16 18 20

Figure 4: case Hp = .5δ1 + .5δ2 : Ratios dL (H b p , Hp )/dL (Fp , Hp ) over 1,000 repetitions. Dictionary
consisted of only point masses. Large values indicate better performance of our algorithm. All ratios were
found to be larger than 1.

15
Scree plot eigenvalues sample covariance matrix Empirical and Population spectral distributions
3 1

n=500 n=500
2.5 p=100 0.8
p=100

2 Toeplitz Toeplitz
matrix: 0.6 matrix:
1.5
ak=.3k ak=.3k
0.4
1 Empirical Spectral Distribution
Population Spectral Distribution
0.2
0.5

0 0
0 10 20 30 40 50 60 70 80 90 100 0 0.5 1 1.5 2 2.5 3

(a) Scree plot, Eigenvalues sample covariance matrix (b) CDF eigenvalues sample covariance matrix (Fp )

Estimated cdf of spectral distribution in population


1

n=500
0.8
p=100

Toeplitz matrix:
0.6
ak=.3k

0.4

Estimated spectral distribution


0.2
Population spectral distribution

0
0 0.5 1 1.5 2 2.5 3

(c) Estimated CDF of eigenvalues of population covariance


matrix (Hbp)

Figure 5: case Σp Toeplitz with entries.3|i−j| : the three figures above show the performance of our
estimator on one realization of the data. The data matrix Y is 500×100. All its entries are iid N (0, 1). The
covariance is Toeplitz, with t(|i − j|) = .3|i−j| . In Figure (c), we superimpose our estimator (blue curve)
and the true distribution of eigenvalues (red curve). Average computation time (over 1000 repetitions) was
16.61 seconds, according to Matlab tic and toc functions.

once again our estimator clearly outperforms the one derived from the sample covariance matrix, by a
large factor. Again, upon further investigation, the estimator generally gets the correct structure of the
distribution of the population eigenvalues: in this case two spikes at 1 and 2.

4.1.3 The case of a Toeplitz covariance matrix


Finally, we performed the same type of analysis on a Toeplitz matrix, to show that the method we
propose works quite well on more complicated types of covariance structures. Note that generally this is
inherently a quite difficult problem, if we do not assume a priori that we know that the matrix is Toeplitz.
We recall that a Toeplitz matrix T is a matrix whose entries satisfy Ti,j = t(i − j), for a certain function
t. Since covariance matrices are symmetric, the Toeplitz matrices at hand will satisfy Ti,j = t(|i − j|). The
limiting spectral distribution of these objects are very well understood: see (Böttcher and Silbermann,
1999), (Gray, 2002) or (Grenander and Szegö, 1958).
Approaches exist that take advantage of the particular structure of a Toeplitz matrix. See for instance,
the interesting papers (Bickel and Levina, 2004) and for even more generality - beyond Toeplitz matrices
- (Bickel and Levina, 2006). However, these approaches are very basis dependent; they assume that the
variables are measured in the appropriate basis. In data analysis, this may sometimes be justified and
sometimes not. In particular, if the order of the variables is permuted, the resulting estimators might
change. Since we want to be able to avoid this type of behavior, we feel that a “basis independent” method
is needed and should be available. Finding such a method was one of the original motivations of our
investigations.
Once again, the results displayed in Figure 5 are quite encouraging. Note that this time, the population
spectral distribution could only be approximated by a large number of elements of our dictionary. So there
was no sparse representation of H∞ in our chosen dictionary of measures. However, computation time was

16
40

35

Ratio Levy metrics


Toeplitz covariance
30

n=500
25 p=100

20

15

10

0
1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6

Figure 6: Case Σp Toeplitz with entries (.3|i−j| ): Ratios dL (Hb p , Hp )/dL (Fp , Hp ) over 1,000 repetitions.
Dictionary consisted of only point masses. Large values indicate better performance of our algorithm. All
ratios were found to be larger than 1.

not severely affected and the results are still quite good. To give a more detailed comparison, we present
b p , Hp )/dL (Fp , Hp ).
in Figure 6 a histogram of ratios dL (H

5 Consistency
In this section, we prove that the algorithm we propose leads to a consistent (in the sense of weak
convergence of probability measures) estimator of the spectral distribution of the covariance matrices of
interest.
More precisely, we focus on the “L∞ ” version of the algorithm proposed in 3.2.2. In short, the theoretical
results we prove state that as our computational resources grow (both in terms of size of available data and
grid points on which to evaluate functions), the estimator H b p converges to H∞ . The meaning of Theorem
2, which follows, is the following. We first choose a family of points {zj } in the upper-half of the complex
plane, with a limit point in the upper-half of the complex plane. We assume that the population spectral
distribution Hp has a limit, in the sense of weak convergence of distributions, when p → ∞. We call this
limit H∞ . This assumption of weak convergence allows us to vary Hp , as p grows, and to not be limited
to Hp = H∞ for the theory; this provides maximal generality. We then solve the “L∞ ” version of our
optimization problem, by including more and more of the zj ’s in the optimization problem as n → ∞.
We assume in Theorem 2 that we can solve this problem by optimizing over all probability measures.
Then Theorem 2 shows that the solution of the optimization problem, H b p , converges in distribution to the
limiting population spectral distribution, H∞ . In Corollary 1, we show that the same conclusion holds if
the optimization is now made over probability measures that are mixture of point masses, whose locations
are on a grid whose step size goes to 0 with p and n. Actually, the requirement is that the dictionary of
measures we use contain these diracs. It can of course be larger. Hence, Corollary 1 proves consistency
of the estimators specifically obtained through our algorithm. Beside the assumptions of Theorem 1, we
assume that all the spectra of the population covariances are (uniformly) bounded. That translates into
the mild requirement that the support of all Hp ’s be contained in a same compact set. Note that in the
context of asymptotics at fixed spectral distribution, this is automatically satisfied.
We now turn to a more formal statement of the theorem. The notation B(z0 , r) denotes the closed ball
of center z0 and radius r. Our main theorem is the following.

Theorem 2. Suppose we are under the setup of Theorem 1, Hp ⇒ H∞ and p/n → γ, with 0 < γ < ∞.
Assume that the spectra of the Σp ’s are uniformly bounded. Let J1 , J2 , . . . , be a sequence of integers tending
to ∞. Let z0 ∈ C+ and r ∈ R+ be such that B(z0 , r) ⊂ C+ . Let z1 , z2 , . . . be a sequence of complex variables

17
b p be the solution of
with an accumulation point, all contained in B(z0 , r). Let H
Z
1 p λdH(λ)
b
Hp = argmin max + zj − , (1)
H j≤Jn vFp (zj ) n 1 + λvFp (zj )

where H is a probability measure. Then we have

b p ⇒ H∞ , a.s .
H

Before we turn to proving the theorem, we need a few intermediate results. An important step in the
proof is the following analytic lemma.

Lemma 1. Suppose we have a family {zi }∞ +


i=1 of complex numbers in C , with an accumulation point in
C . Suppose there exist a sequence {Ji }i=1 of integers tending to ∞, a sequence {ǫi }∞
+ ∞
i=1 of positive reals
∞ ∗
tending to 0, a sequence {p(n)}n=1 of integers, with p(n)/n → γ ∈ R+ , and a sequence of probability
measures {Hb p }∞ such that
p=1
Z
1 p λdHb p (λ)

∀j ≤ Jn , + zj − < ǫn . (2)
vFp (zj ) n 1 + λvFp (zj )

Assume that v∞ satisfies Z


1 λdH∞ (λ)
− = zj − γ , (3)
v∞ (zj ) 1 + λv∞ (zj )
for some probability measure H∞ . Assume that vFp (zj ) → v∞ (zj ), and both are analytic
 in C+ and from
C+ to C+ . Further, assume that |v∞ (zj )| < C for some C ∈ R, and |Im vFp (zj ) | > δ, as well as
|Im (v∞ (zj )) | > δ, for some δ > 0. Then
Hb p ⇒ H∞ .

Proof. Since v∞ satisfies Z


1 λdH∞ (λ)
+ zj − γ =0,
v∞ (zj ) 1 + λv∞ (zj )
equation (2) reads
Z Z Z !
1 1  p λdH∞ (λ) p λdH∞ (λ) λdHb p (λ)

− + γ− + − < ǫn .
vFp (zj ) v∞ (zj ) n 1 + λv∞ (zj ) n 1 + λv∞ (zj ) 1 + λvFp (zj )

Note that since |Im vFp | > δ and |Im (v∞ ) | > δ, and given that

1 1 |vFp − v∞ |

vF − v∞ ≤ |Im v  ||Im (v ) | ,
p Fp ∞

we have |1/vFp − 1/v∞ | → 0.


Also, because p/n → γ, the previous equation implies that
Z Z b p (λ)
λdH∞ (λ) λdH
− →0.
1 + λv∞ (zj ) 1 + λvFp (zj )

Now because vFp (zj ) → v∞ (zj ), we have


Z Z

λdH b p (λ) λdHb p (λ) Z λ2 (v∞ (zj ) − vFp (zj ))dH
b p (λ)
− =
1 + λvFp (zj ) 1 + λv∞ (zj ) (1 + λv∞ (zj ))(1 + λvFp (zj ))

vFp (zj ) − v∞ (zj )
≤  →0.
|Im vFp (zj ) ||Im (v∞ (zj )) |

18
So we have Z Z
λdHb p (λ) λdH∞ (λ)
→ .
1 + λv∞ (zj ) 1 + λv∞ (zj )
We remark that for m ∈ C+ , and G a probability measure on R, whose Stieltjes transform is denoted by
SG , Z Z Z  
λdG(λ) 1 1 dG(λ) 1 1 dG(λ) 1 1 1
= − = − = − SG − .
1 + λm m m 1 + λm m m2 1/m + λ m m2 m
Hence, when the assumptions of the lemma are satisfied, we have
   
1 1
SHb p − → S H∞ − .
v∞ (zj ) v∞ (zj )

Now since v∞ (zj ) satisfies Equation (3), we see that if v∞ (zj ) = v∞ (zk ), then zj = zk . Hence, {−1/v∞ (zj )}∞
j=1
is an infinite sequence of complex numbers in C+ . Moreover, because v∞ is analytic in C+ , it is continuous,
and so {−1/v∞ (zj )}∞ j=1 has an accumulation point. Further, because |v∞ (zj )| < ∞ and Im (v∞ (zj )) > δ,
this accumulation point is in C+ .
So under the assumptions of the lemma, we have shown that there exist an infinite sequence {yj }∞ j=1
of complex numbers in C+ , with an accumulation point in C+ , such that

SHb p (yj ) → SH∞ (yj ) , ∀j .

According to (Geronimo and Hill, 2003), Theorem 2, this implies that

b p ⇒ H∞ .
H

In the context of spectrum estimation, the intuitive meaning of the previous lemma is that if for a
sequence of complex numbers {zj }∞ + b
j=1 with an accumulation point in C , we can find a sequence of Hp ’s
approximately satisfying the Marčenko-Pastur equation at more and more of the zj ’s when n grows, then
this sequence of measures will converge to H∞ .
We now state and prove a few results that will be needed in the proof of Theorem 2. The first one is a
remark concerning Stieltjes transforms.

Proposition 1. The Stieltjes transform, SH , of any probability measure H on R, is Lipschitz 1/u2min on


C+ ∩ {Im (z) > umin }.
Hence, if SHn (z) → SH∞ (z) pointwise, where all the measures considered are probability measures, the
convergence is uniform on compact subsets of C+ ∩ {Im (z) > umin }.

Proof. We first show the Lipschitz character of SH . We have


Z   Z
1 1 dH(λ)
SH (z1 ) − SH (z2 ) = − dH(λ) = (z1 − z2 ) .
λ − z1 λ − z2 (λ − z1 )(λ − z2 )

Now |λ − z1 | > |Im (λ − z1 ) | > umin . So

|z1 − z2 |
|SH (z1 ) − SH (z2 )| ≤ .
u2min
T
So we have shown that SH is uniformly Lipschitz 1/u2min on C+ {Im (z) > umin }.
Now, it is an elementary and standard fact of analysis that if a sequence of K-Lipschitz functions
converge pointwise to a K-Lipschitz function, then the convergence is uniform on compact sets. This
shows the uniform convergence part of our statement.

In the proof of the Theorem, we will need the result of the following proposition.

19
Proposition 2. Assume the assumptions underlying Theorem 1 are satisfied. Recall that vFp is the Stieltjes
transform of Fep , the spectral distribution of XX ∗ /n = Y Σp Y ∗ /n. Assume that the population spectral
distribution Hp has a limit H∞ and that all the spectra are uniformly bounded. Let z ∈ B(z0 , r), with
B(z0 , r) ⊂ C+ . Then, almost surely,

∃N, n > N ⇒ inf Im vFp (z) = δ > 0 .
n,z∈B(z0 ,r)

Proof. Since we assume that all spectra are bounded, we can assume that the population eigenvalues are
1/2
all uniformly bounded by K. Because the spectral norm is a matrix norm and X = Y Σp , we have

λmax (X ∗ X/n) ≤ λmax (Σp )λmax (Y ∗ Y /n) .

Now it is a standard result in random matrix theory that, λmax (Y ∗ Y /n) → (1 + γ)2 , a.s, so for n large
enough,
λmax (Y ∗ Y /n) ≤ 2(1 + γ)2 a.s .
Calling z = u + iv, we have
Z Z
 vdFep (λ) vdFep (λ)
Im vFp (z) = ≥ ,
(λ − u)2 + v 2 2(λ2+ u2 ) + v 2

because v ≥ 0. Now, the remark we made concerning the eigenvalues of X ∗ X/n implies that almost surely,
for n large enough, Fep puts all its mass within [0, C], for some C. Therefore,
 v
Im vFp (z) ≥ ,
C2 + v 2 + 2u2

and hence Im vFp (z) is a.s bounded away from 0, for n large enough.

To show that we can find “good” probability measures when solving our optimization problem, we will
need to exhibit a sequence of measures that approximately satisfy the Marčenko-Pastur equation. The
next proposition is a step in this direction.

Proposition 3. Let r ∈ R+ and z0 ∈ C+ be given and satisfying B(z0 , r) ⊂ C+ . Suppose p/n → γ when
n → ∞, and ∀ǫ ∃N : n > N ⇒ ∀z ∈ B(z0 , r), |vFp (z) − v∞ (z)| < ǫ, where v∞ satisfies equation (3).
Suppose further that |Im (v∞ (z)) | > umin on B(z0 , r). Then, if ǫ < umin /2,
Z

1
′ p λdH∞ (λ) 1 + 2γ
∃N ∈ N, ∀z ∈ B(z0 , r), ∀n > N , +z− < 2ǫ 2
vFp (z) n 1 + λvFp (z) umin

Proof. Using equation (3) we find that


Z
1 p λdH∞ (λ)
∆n (z) = +z−
vFp (z) n 1 + λvFp (z)
Z  
1 1 p λ λ
= − + − dH∞ (λ)
vFp (z) v∞ (z) n 1 + λv∞ (z) 1 + λvFp (z)
 Z
p λ
+ γ− dH∞ (λ)
n 1 + λv∞ (z)
 Z
p λ
, ∆In (z) + γ − dH∞ (λ)
n 1 + λv∞ (z)

Because γ − p/n → 0, and |λ/(1 + λv∞ (z))| ≤ 1/|Im (v∞ (z)) | ≤ 1/umin , we have
 Z
p λ
γ− dH∞ (λ) → 0 uniformly on B(z0 , r) .
n 1 + λv∞ (z)

20
Now, of course,
Z
v∞ (z) − vFp (z) p λ2
∆In (z)
= − (vFp (z) − v∞ (z)) dH∞ (λ) .
vFp (z)v∞ (z) n (1 + λvFp (z))(1 + λv∞ (z))

We remark that |vFp (z)| > |Im vFp (z) | > umin − ǫ > umin /2. Hence, if n is large enough,

I
∆n (z) ≤ 2 |v∞ (z) − vFp (z)| + 2 p |vFp (z) − v∞ (z)| ≤ ǫ 2 (1 + 2γ) .
u2min n u2min u2min

We now turn to proving Theorem 2

Proof of Theorem 2. According to Propositions 1 and 2, the assumptions put forth in Proposition 3
are a.s satisfied for vFp and v∞ is the Stieltjes as in Theorem 1. Note also that Theorem 1 states that a.s,
vFp (z) → v∞ (z), and that all these functions are analytic in C+ . In other words, they have the properties
needed for Lemma 1 to apply.
In particular, Proposition 3 implies that if {zj } is a family of complex numbers included in B(z0 , r),
and if Hb p is the solution of equation (1), equation (2) will be satisfied almost surely, with a family {ǫj } of
positive real numbers that converge to 0. According to Lemma 1, this implies that,
b p ⇒ H∞ , almost surely.
H

As a corollary of Theorem 2, we are now ready to prove consistency of our algorithm.

Corollary 1 (Consistency of proposed algorithm). Assume the same assumptions as in Theorem


b p the solution of equation (1), where the optimization is now over measures which are sums of
2. Call H
atoms, the location of which are restricted to belong to a grid (depending on n) whose step size is going to
0 as n → ∞. Then
Hb p ⇒ H∞ a.s .

Proof. All that is needed is to show that a discretized version of H∞ furnishes a good sequence of measures
in the sense that Proposition 3 holds for this sequence of discretized version of H∞ .
We call HMn a discretization of H∞ on a regular discrete grid of size 1/Mn . For instance, we can
choose HMn (x) to be a step function, with HMn (x) = H∞ (x) is x = l/Mn , l ∈ N, and HMn is constant on
[l/Mn , (l + 1)/Mn ). Recall also that H∞ is compactly supported.
In light of the proof of Proposition 3, for the corollary to hold, it is sufficient to show that uniformly
in z ∈ B(z0 , r), Z Z
λ λ
dH∞ (λ) → 0 .
1 + λvF (z) dHMn (λ) − 1 + λvF (z)
p p

Now calling dW (HMn , H∞ ) the Wasserstein distance between HMn and H∞ , we have
Z ∞
dW (HMn , H∞ ) = |HMn (x) − H∞ (x)| dx → 0 as n → ∞ .
0

(HMn and H∞ put mass only on R+ , so the previous integral is restricted to R+ . We refer the reader to
the survey (Gibbs and Su, 2001) for properties of different metrics on probability measures.)
In other respects, it is easy to see that under the assumptions of Proposition 3, there exists N such
that, supn>N,z∈B(z0 ,r) |vFp (z)| ≤ K, for some K < ∞. Recall also that under the same assumptions,

inf n>N,z∈B(z0 ,r) Im vFp (z) ≥ δ, for some δ > 0.
For two probability measures G and H, we also have
 Z Z 


dW (G, H) = sup f dG − f dH ; f a 1-Lipschitz function .
f

21
Hence, because H∞ and HMn are supported on a compact set that is independent of n, to have the result
we want, it will be enough to show that
λ
fvFp (z) (λ) =
1 + λvFp (z)

is uniformly Lipschitz (as a function of λ) when z ∈ B(z0 , r) and n > N .


Now note that
λ1 − λ2
fvFp (z) (λ1 ) − fvFp (z) (λ2 ) = .
(1 + λ1 vFp (z))(1 + λ2 vFp (z))
If λ ≤ 1/(2K),
 then |λvFp (z)| ≤ 1/2, so |1 + λvFp (z)| ≥ 1/2. If λ ≥ 1/(2K), then |1 + λvFp (z)| ≥
λIm vFp (z) ≥ δ/(2K). So |1 + λvFp (z)| ≥ min(1/2, δ/(2K)) = C. Hence fvFp (z) is 1/C 2 -Lipschitz, and C
is uniform in n and z, as needed.
Having thus extended Proposition 3 to discretized versions of H∞ , the proof of the corollary is the
same as that of Theorem 2.

The proof of the corollary makes clear that when solving the optimization problem over any dictionary
of probability measures containing point masses (but also possibly other measures) at grid points on a grid
whose step size goes to 0, the algorithm will lead to a consistent estimator.
Finally, as explained in the Appendix, the algorithm we implemented start with vFp (zj ) sequences, as
opposed to simply zj sequences. It can be straightforwardly adapted to handle the zj ’s as a starting point,
too, but we got slightly better numerical results when starting with vFp (zj ). The proof we just gave could
be adapted to handle the situation where the vFp (zj )’s are used as starting point. However, a few other
technical issues would have to be addressed that we felt would make the important ideas of the proof less
clear. Hence we decided to show consistency in the setting of Corollary 1.

6 Conclusion
In this paper we have presented an original method to estimate the spectrum of large dimensional
covariance matrices. We place ourselves in a “large n, large p” asymptotic framework, where both the
number of observations and the number of variables is going to infinity, while their ratio goes to a finite,
non-zero limit. Approaching problems in this framework is increasingly relevant as datasets of larger and
larger size become more common.
Instead of estimating individually each eigenvalue, we propose to associate to each vector of eigenvalues
a probability distribution and estimate this distribution. We then estimate the population eigenvalues as
the appropriate quantiles of the estimated distribution. We use a fundamental result of random matrix
theory, the Marčenko-Pastur equation, to formulate our estimation problem. We propose a practical
method to solve this estimation problem, using tools from convex optimization.
The estimator has good practical properties: it is fast to compute on modern computers (we use the
software (MOSEK, 2006) to solve our optimization problem) and scales well with the number of parameters
to estimate. We show that our estimator of the distribution of interest is consistent, where the appropriate
notion of convergence is weak convergence of distributions.
The estimator performs a non-linear shrinkage of the sample eigenvalues. It is basis independent and
we hope will help in improving the estimation of eigenvectors of large dimensional covariance matrices. To
the best of our knowledge, our method is the first that harnesses deep results of random matrix theory to
practically solve estimation problems. We have seen in simulations that the improvement it leads to are
often dramatic. In particular, it enables us to find structure in the data when it exists and to conclude to
its absence where there is none, even when classical methods would point to different conclusions.

APPENDIX

22
A.1 Implementation details
We plan to release the software we used to create the figures appearing in the simulation and data
analysis section in the near future. However, we want to mention here the choices of parameters we made
to implement our algorithm. The justifications for them is based on intuitions coming from studying the
equation (M-P).

Scaling of the eigenvalues If all the entries of the data matrix are multiplied by a constant a, then the
eigenvalues of Σp are multiplied by a2 , and so are the eigenvalues of Sp . Hence, if the eigenvalues of Sp are
divided by a factor a, Equation (M-P) remains valid if we change H∞ (x) into H∞ (ax). In practice, we scale
the empirical eigenvalues by l1 the largest eigenvalue of Sp . We solve our convex optimization problem
with the scaled eigenvalues to obtain H∞ (l1 x), from which we get H∞ (x) through easy manipulations.
The subsequent details describe how we solve our convex optimization problem, after rescaling of the
eigenvalues.

Choice of (zj , v(zj )) We have found that using 100 pairs (zj , v(zj )) was generally sufficient to obtain good
and quick (10s-60s) results in simulations. More points is of course better. With 200 points, solving the
problem took more time, but was still doable (40s-3mins). In the simulations and data analysis presented
afterwards, we first chose the v(zj ) and numerically found the corresponding zj using Matlab’s optimization
toolbox. We took v(zj ) to have a real part equally spaced (every .02) on [0, 1], and imaginary part of 10−2
or 10−3 . In other words, our v(zj )’s consisted of two (discretized) segments in C+ , the second one being
obtained from the first one by a vertical translation of 9 ∗ 10−3 .

Choice of interval to focus on The largest (resp. smallest) eigenvalue of a p × p symmetric matrix S
are convex (resp. concave) functions of the entries of the matrix. This is because l1 (S) = supkuk2 =1 u′ Su,
where u is a vector in Rp . Hence l1 (S) is the supremum of linear functionals of the entries of the matrix.
Similarly, lp (S) = inf kuk2 =1 u′ Su, so lp (S) is a concave function of the entries of S. Note that the sample
covariance matrix Sp is an unbiased estimator of Σp . By Jensen’s inequality, we therefore have E(l1 (Sp )) ≥
l1 (E(Sp )) = λ1 (Σp ). In other words, l1 (Sp ) is a biased estimator of λ1 (Σp ), and tends to overestimate it.
Similarly, lp (Sp ) is a biased estimator of λp (Σp ) and tends to underestimate it. More detailed studies of
l1 and lp indicate that they do not fluctuate too much around their mean. Practically, as n → ∞, we
will have with large probability, lp ≤ λp and l1 ≥ λ1 . (In certain cases, concentration bounds can make
the previous statement rigorous.) Hence, after rescaling of the eigenvalues, it will be enough to focus on
probability measures supported on the interval [lp /l1 , 1] when decomposing H∞ (l1 x).

Choice of dictionary In the “smallest” implementation, we limit ourselves to a dictionary consisting of


point masses on the interval [lp /l1 , 1], with equal spacing of .005. We call ζp the length of this interval. In
larger implementations, we split the interval [lp /l1 , 1] into dyadic intervals, getting at scale k, 2k intervals:
[lp /l1 + j2−k ζp , lp /l1 + (j + 1)2−k ζp ], for j = 0, . . . , 2k − 1. We store the end points of all the intervals at all
the scales from k = 2 to k = 8 for the coarsest implementation and up to 10 for the finest We implemented
dictionaries containing:
1. Point masses every .005 on [lp /l1 , 1], and probability measures supported on the dyadic intervals
described above that have constant density on these intervals.
2. Point masses every .005 on [lp /l1 , 1], and probability measures supported on the dyadic intervals
described above that have constant density on these intervals, as well as probability measures on
those dyadic intervals that have linearly increasing and linearly decreasing densities.
The simulations presented above were made with this latter choice of dictionary using scales up to 8.

References
Akhiezer, N. I. (1965). The classical moment problem and some related questions in analysis. Translated
by N. Kemmer. Hafner Publishing Co., New York.

23
Anderson, T. W. (1963). Asymptotic theory for principal component analysis. Ann. Math. Statist. 34,
122–148.

Anderson, T. W. (2003). An introduction to multivariate statistical analysis. Wiley Series in Probability


and Statistics. Wiley-Interscience [John Wiley & Sons], Hoboken, NJ, third edition.

Bai, Z. D. (1999). Methodologies in spectral analysis of large-dimensional random matrices, a review.


Statist. Sinica 9, 611–677. With comments by G. J. Rodgers and Jack W. Silverstein; and a rejoinder
by the author.

Baik, J., Ben Arous, G., and Péché, S. (2005). Phase transition of the largest eigenvalue for non-null
complex sample covariance matrices. Ann. Probab. 33, 1643–1697.

Baik, J. and Silverstein, J. (2004). Eigenvalues of large sample covariance matrices of spiked population
models. arXiv:math.ST/0408165 .

Bickel, P. J. and Levina, E. (2004). Some theory of Fisher’s linear discriminant function, ‘naive Bayes’,
and some alternatives when there are many more variables than observations. Bernoulli 10, 989–1010.

Bickel, P. J. and Levina, E. (2006). Regularized estimation of large covariance matrices. Forthcoming
Technical Report .

Böttcher, A. and Silbermann, B. (1999). Introduction to large truncated Toeplitz matrices. Universitext.
Springer-Verlag, New York.

Boyd, S. and Vandenberghe, L. (2004). Convex optimization. Cambridge University Press, Cambridge.

Burda, Z., Görlich, A., Jarosz, A., and Jurkiewicz, J. (2004). Signal and noise in correlation matrix.
Physica A 343, 295–310.

Burda, Z., Jurkiewicz, J., and Waclaw, B. (2005). Spectral moments of correlated Wishart matrices.
Phys. Rev. E 71.

Campbell, J., Lo, A., and MacKinlay, C. (1996). The Econometrics of Financial Markets. Princeton
University Press, Princeton, NJ.

Chen, S. S., Donoho, D. L., and Saunders, M. A. (1998). Atomic decomposition by basis pursuit. SIAM
J. Sci. Comput. 20, 33–61 (electronic).

Durrett, R. (1996). Probability: theory and examples. Duxbury Press, Belmont, CA, second edition.

El Karoui, N. (To Appear). Tracy-Widom limit for the largest eigenvalue of a large class of complex
sample covariance matrices. The Annals of Probability See also arxiv.PR/0503109.

Geman, S. (1980). A limit theorem for the norm of random matrices. Ann. Probab. 8, 252–261.

Geronimo, J. S. and Hill, T. P. (2003). Necessary and sufficient condition that the limit of Stieltjes
transforms is a Stieltjes transform. J. Approx. Theory 121, 54–60.

Gibbs, A. L. and Su, F. (2001). On choosing and bounding probability metrics. International Statistical
Review 70, 419–435.

Gray, R. M. (2002). Toeplitz and circulant matrices: A review. Available at


http://ee.stanford.edu/~gray/toeplitz.pdf.

Grenander, U. and Szegö, G. (1958). Toeplitz forms and their applications. California Monographs in
Mathematical Sciences. University of California Press, Berkeley.

Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements of Statistical Learning. Springer
Series in Statistics. Springer-Verlag, New York. Data mining, inference, and prediction.

24
Hiai, F. and Petz, D. (2000). The semicircle law, free random variables and entropy, volume 77 of
Mathematical Surveys and Monographs. American Mathematical Society, Providence, RI.

Johnstone, I. (2001). On the distribution of the largest eigenvalue in principal component analysis. Ann.
Statist. 29, 295–327.

Jonsson, D. (1982). Some limit theorems for the eigenvalues of a sample covariance matrix. J. Multivariate
Anal. 12, 1–38.

Laloux, L., Cizeau, P., Bouchaud, J.-P., and Potters, M. (1999). Noise dressing of financial correlation
matrices. Phys. Rev. Lett. 83, 1467–1470.

Lax, P. D. (2002). Functional analysis. Pure and Applied Mathematics (New York). Wiley-Interscience
[John Wiley & Sons], New York.

Ledoit, O. and Wolf, M. (2004). A well-conditioned estimator for large-dimensional covariance matrices.
J. Multivariate Anal. 88, 365–411.

Marčenko, V. A. and Pastur, L. A. (1967). Distribution of eigenvalues in certain sets of random


matrices. Mat. Sb. (N.S.) 72 (114), 507–536.

Mardia, K. V., Kent, J. T., and Bibby, J. M. (1979). Multivariate analysis. Academic Press [Harcourt
Brace Jovanovich Publishers], London. Probability and Mathematical Statistics: A Series of Monographs
and Textbooks.

MOSEK (2006). MOSEK Optimization Toolbox. Available at www.mosek.com.

Paul, D. (To Appear). Asymptotics of sample eigenstructure for a large dimensional spiked covariance
model. Statistica Sinica .

Silverstein, J. W. (1995). Strong convergence of the empirical distribution of eigenvalues of large-


dimensional random matrices. J. Multivariate Anal. 55, 331–339.

Silverstein, J. W. and Bai, Z. D. (1995). On the empirical distribution of eigenvalues of a class of


large-dimensional random matrices. J. Multivariate Anal. 54, 175–192.

Wachter, K. W. (1978). The strong limits of random matrix spectra for sample matrices of independent
elements. Ann. Probability 6, 1–18.

Yin, Y. Q., Bai, Z. D., and Krishnaiah, P. R. (1988). On the limit of the largest eigenvalue of the
large-dimensional sample covariance matrix. Probab. Theory Related Fields 78, 509–521.

25

You might also like