The Use of Discrete Data in PCAR
The Use of Discrete Data in PCAR
The Use of Discrete Data in PCAR
Theory, Simulations,
and Applications to Socioeconomic Indices
Stanislav Kolenikov
Gustavo Angeles
Abstract
The last several years have seen a growth in the number of publications in
economics that use principal component analysis (PCA), especially in the area of
welfare studies. This paper gives an introduction into the principal component
analysis and describes how the discrete data can be incorporated into it. The effects of discreteness of the observed variables on the PCA are overviewed. The
concepts of polychoric and polyserial correlations are introduced with appropriate
references to the existing literature demonstrating their statistical properties. A
large simulation study is carried out to shed light on some of the issues raised in
the theoretical part of the paper. The simulation results show that the currently
used method of running PCA on a set of dummy variables as proposed by Filmer
& Pritchett (2001) is inferior to other methods for analyzing discrete data, both
simple such as using ordinal variables, and more sophisticated such as using the
polychoric correlations.
Keywords: welfare indices, principal component analysis, PCA, polychoric
correlations, rank correlations, living standards, socio-economic status
JEL classification: C19, C49, I32
Introduction
One of the recurrent ideas and needs in the development economics studies at the micro
level is to assess the socio-economic status (SES) of a household or an individual. Such
estimates usually serve as an input to another analysis such as inequality or poverty
analysis, tabulation of population characteristics by quintiles or deciles, or regressions
that involve welfare as an explanatory or dependent variable and aim at explaining the
household health status or certain behavior.
Broadly speaking, the socio-economic status involves many dimensions: education
and occupation of family members, their access to goods and services, and the welfare of the household as a measure of the goods and services accessibility. We shall
concentrate on the economic components of the socio-economic status in this paper.
Often, straightforward numeric measures of welfare such as household income
or consumption are not available or not reliable, especially in non-market economies
where a large fraction of economic activities is carried out outside of the market. This
does not have to be an illegal black market activity. Family farming where family members are not paid salary, but rather consume a large portion of their produce is a good
example.
In such situations, the researcher has to deal with other proxies for the household
wealth and/or consumption and use those in deriving an index of the household welfare.
Such proxies must be easier to observe than income, and possession of durable goods
and living conditions are used more and more often as those proxies: the interviewer
can simply observe and record the household status, or ask sufficiently simple questions
such as Do you own a TV set? or What is the source of the drinking water in your
house? Those variables with a small number of clear response categories suffer much
less measurement and reporting error than does income or expenditure, although they
would still contain a lot of measurement error in terms of measuring socio-economic
status.
The use of a single proxy is likely to lead to unreliable and/or unstable results, so a
natural idea would be to incorporate a number of such proxies to compensate for various measurement errors that stand between the proxy and the concept it is supposed
to measure. Fortunately, the researcher can observe the possession by a household of
several durable goods, as well as record the characteristics of a dwelling. That way,
something between 10 and 20 characteristics (possibly with several levels) can be observed, and then the analyst must have a method for aggregating such proxies. By far
the most popular method is to assign coefficients, or weights, to those observed variables, and sum them up. The weights may come from some economic considerations,
such as assigning a monetary value for durable goods; from statistical considerations,
such as principal component analysis; or from other considerations, such as putting all
coefficients to one. That way, the researcher obtains a univariate measure of welfare.
It may not have a direct interpretation, say, in dollar terms, unless some measures of
income, expenditure or wealth are used explicitly in the analysis as the nominal anchors, and have the coefficients set to one. Otherwise such a measure cannot be used
directly for poverty analysis in terms of relating somebodys disposable resources to
an absolute figure like $1 a day, but it finds use in ranking individuals, make decisions regarding the allocation of projects that are to benefit the poor, or as an input to
other research problems where the researcher is interested in relation between SES and
observed behaviors.
A thorough review of the existing methods of SES assessment in application to the
fertility studies is given in Bollen, Glanville & Stecklov (2001) and Bollen, Glanville
& Stecklov (2002). They note that measures of SES . . . vary widely within and between disciplines regardless of the outcome, and empirical implementations of SES
. . . are often driven by data availability and the empirical performance of indicators as
much as they are by theoretical groundwork. Upon providing a thorough review of
the methods and concepts related to SES, such as Friedmans permanent income thesis,
they compare the performance in terms of external validity (the explanatory power in
a regression with fertility as a dependent variable) of a simple sum of the assets (i.e.
total number of durable goods possessed by the household), sum of current values (as
assessed by the household itself), sum of median values (where the median value of
the asset across all households is taken as the market price of an item), principal components, as well as measures based on single variables such as occupational prestige,
or expenditure per adult. They found that the best fitting measures were the principal
component measure and a simple sum of asset indicators.
The principal component analysis was developed in early 20th century (Pearson
1901b, Hotelling 1933) in psychometrics and multivariate statistical analysis for similar purposes of aggregating information scattered in many numeric measures, such as
student scores on several tests. It is a standard multivariate technique described in such
textbooks as Anderson (2003), Mardia, Kent & Bibby (1980), Flury (1988), Jolliffe
(2002) and Rencher (2002). In economics, the method has been applied to the studies of cointegration and spatial convergence (Harris 1997, Drakos 2002), development
(Caudill, Zanella & Mixon 2000), panel data (Bai 1993, Reichlin 2002), forecasting
(Stock & Watson 2002), simultaneous equations (Choi 2002) and economics of education (Webster 2001). See also reviews of the factor models in Bai (1993) and Wansbeek
3
& Meijer (2000). Krelle (1997) gives a review of a number of methods aimed at estimation of unobservable variables, including PCA.
One of the earliest and most influential papers in development economics and population studies for the construction of socio-economic indices that used PCA was Filmer
& Pritchett (2001) (and an earlier working paper version Filmer & Pritchett (1998)).
They used the data on household assets (primary importance durable goods such as
clock, bicycle, radio, television, sewing machine, motorcycle, refrigerator, car), type
of access to hygienical facilities (sources of drinking water, types of toilet), number of
rooms in dwelling, and construction materials used in the dwelling.
The methodology was quickly accepted by the World Bank (Gwatkin, Rustein,
Johnson, Suliman & Wagstaff 2003a, Gwatkin, Rustein, Johnson, Suliman & Wagstaff
2003b) and ORC/Macro Demographic and Health Surveys1 as the way to assess socioeconomic status of a household based on the household assets (electricity, radio, television, telephone, refrigerator, bicycle, motorcycle, car or truck) and facilities (source
of drinking water, toilet type, source of heat for cooking, materials used for flooring,
walls, and roofing).
Despite the nice title Estimating wealth effect without expenditure data or
tears, the paper by Filmer & Pritchett (2001) has not quite solved all of the methodological problems. The primary critique that can be raised for the method used in the
aforementioned papers is that the use of dummy variables in the PCA is not justified, as
PCA as is is only suitable for continuous data. It was developed for the samples from
multivariate normal distribution (Hotelling 1933, Anderson 2003, Mardia et al. 1980),
and most of the theoretical results, including the implicitly used consistency of the
estimates of the factor loadings, were derived under the normality assumption. See
Appendix A for technical results.
In fact, the Filmer & Pritchett (2001) go further than just using the discrete welfare
indicators as if they were continuous. Instead, a dummy variable was used for each
category of the discrete variable, so the variable Source of drinking water with categories 1 for lake or stream, 2 for tube well, 3 for pipe outside the dwelling, and 4 for
the pipe inside the dwelling will be represented by four dummies (or three if a perfect
collinearity is to be avoided; see argument about numerical stability in Appendix A).
The reason for doing so may have been the common recommendation to use individual
binary indicators whenever the categorical variable is to be used in regression analysis.
The recommendation is certainly warranted when the variable is an explanatory one.
For the purposes of PCA, however, we want to stress that the input variables should be
1
See http://www.measuredhs.com.
treated as dependent ones. The analysis must be modified accordingly, since the assets
used in the PCA are indicators, or outcomes, of the welfare, rather than determinants
of it.
One consequence of using the dummy indicators in PCA for construction of the
welfare indices is that doing so introduces a lot of spurious correlations if there are
more than two categories for a variable (and thus more than one dummy variable per
categorical factor is created). The dummy variables produced from the same factor are
negatively correlated, although the strength of dependence declines with the number
of categories. The PCA method then starts getting confused as to whether the main
source of the common variation in the data is due to the correlation with the unobserved
welfare (as we want it to be, and we want PCA to capture this cross-dependence), or due
to the correlation among the variables that belong to the common categorical variable
(as introduced by the researcher through the use of dummy indicators). Even if the
former is a strong relation, it might get blurred by the latter, and thus these spurious
correlations may tend to generate incorrect estimates of the socioeconomic index. The
goodness of fit measures are going to deteriorate, too, since the PCA will see a noisier
covariance matrix.
Besides, the Filmer-Pritchett procedure loses all of the ordinal information, if there
were any. It can be argued that one of the strengths of the Filmer-Pritchett method is
that it does not make any assumptions regarding the ordering of the categories. We
tend to think, however, that if you have additional information, it can and should be
incorporated, as it helps producing more accurate results. As it is always the case in
statistics and econometrics, the model-based methods produce more efficient results for
well-specified models than semi- or non-parametric methods. Here, the weakest form
of the model assumptions used is that the researcher can provide ordering of categories
based on the substantive knowledge of the problem.
There is a substantial literature on the use of discrete data in mutlivariate methods.
The foundations for the use of ordinal data and the foundations of the principal component analysis were developed at the same time and by the same person. The latter was
done in Pearson (1901b), while Pearson (1901a) introduced tetrachoric correlation for
a two-by-two contingency table as an improved measure of correlation between two
binary variables. Further work with major contributions of Pearson & Pearson (1922)
and Olsson (1979) introduced concepts of polychoric and polyserial correlations as the
maximum likelihood estimates of the underlying correlation between the unobserved
normally distributed continuous variables from their discretized versions. Other literature such as Bollen & Barb (1981), Babakus, Ferguson, Jr. & Joereskog (1987),
Dolan (1994), and DiStefano (2002), among others, has looked at the effects of catego5
rization in a closely related area of structural equation modelling with latent variables,
also known as linear structural relations. We have found only one application of the
polychoric correlations in economic indexing systems (Bartolo 2000).
Yet another aspect of PCA that we shall only briefly mention in passing is the effect of complex sample design on the principal component analysis. Skinner, Holmes
& Smith (1986) show that analysis that does not take into account the design leads to
biased estimates for disproportionate designs (i.e., those where probabilities of selection differ for different members of the population).
The goal of this paper is to provide an overview of PCA and examine how discrete
data can be appropriately be used with it. The other purpose of the paper is to examine the performance of different procedures for using PCA with discrete data typically
available in household health surveys for measuring SES. We present a number of small
analytical examples that demonstrate the results of PCA with simple discrete distributions. For more complex situations, we designed and carried out a large simulation
project that also confirmed that the Filmer-Pritchett procedure yields results inferior to
other methods.
The remainder of the paper is organized as follows. The next section reviews the
main procedures of the principal component analysis, including the general formulation
(section 2.1); the specific features of the discrete data that make it more difficult to
carry out the PCA with such data (section 2.2); and the definitions and properties of
the polychoric and polyserial correlations (section 2.3). Then section 3 introduces our
Monte Carlo study. Section 3.1 presents the setup of the simulations comparing the
polychoric correlation and related approaches, and the Filmer-Pritchett approach of
PCA on dummy variables. Section 3.2 presents the numeric findings, and section 3.3
demonstrates some of them visually. Section 4 concludes.
2
2.1
a1 = arg max V a0 x ,
a:kak=1
...
ak = arg
max
0
V ax ,
a:kak=1,
aa1 ,...,ak1
...
(1)
The maxima are those of a convex function on a compact set, and thus exist, and are
unique if there are no perfect collinearities in the data, up to the change of the sign
of all elements of ak . The linear combination a0k x is referred to as the k-th principal
component (PC).
The motivation behind this problem is that the directions of greatest variability give
most information about the configuration of the data in multidimensional space2 . The
first principal component will have the greatest variance and extract the largest amount
of information from the data; the second component will be orthogonal to the first one,
and will have the greatest variance in the subspace orthogonal to the first component,
and extract the greatest information in that subspace; and so on. Also, the principal
components minimize the L2 norm (sum of squared deviations) of the residuals from
the projection onto linear subspaces of dimensions 1, 2, etc. The first PC gives a line
such that the projections of the data onto this line have the smallest sum of squared
deviations among all possible lines. The first two PC define a plane that minimizes the
sum of squared deviations of residuals, and so on.
The principal components analysis can be carried out for both for the theoretical
distributions and the actual data. In the latter case, one would analyze the empirical
covariance matrix. Plotting several first components against each other can often give
good insight into the structure of the data, presence of clusters, nonlinearities, outliers,
etc.
There are a number of practical choices that researchers have to make when performing the principal component analysis. The first one is to choose what variables to
include in the analysis. The desirable choice is that all variables describe a common
2
This statement makes exact sense when the vector x has multivariate normal distribution, and the
information is the Kullback-Leibler information (Kullback 1997) between the original distribution, and the
joint distribution of the first, . . . , k-th components.
phenomenon, and the primary application we are looking at in this paper is the welfare
analysis. As far as PCA was originally developed for the multivariate normal distribution and samples from it, the PCA will work best on the variables that are continuous
and at least approximately normal.
Another important choice to be made is whether the data need to be standardized.
If the original variables have wildly different scales, then the PCA will simply pick
the variable that has the highest variance as the direction of the greatest variability.
Most of the time, the researcher rather wants to find relations between the variables,
and for that purpose, one should analyze the standardized data, when each variable has
mean zero and variance 1, so that the principal component analysis indeed analyzes the
dependencies among the variables rather than differences in measurement scales. The
analysis of the standardized data is equivalent to analysis of the correlation matrix of
the original data. This would be the default option for most statistical packages, too.3
The solution to equation (1) is found by solving the eigenproblem for the covariance
(or correlation) matrix : find s and as (with an identification condition kak = 1)
such that
a = a
(2)
Some background on eigenproblems is provided in Appendix B.
The solution of the eigenproblem (2) for the covariance or more commonly correlation matrix gives the set of principal component weights a (also referred to as factor loadings), the linear combinations a0 x (referred to as scores) and the eigenvalues
1 . . . p . It is easy to establish that V a0k x = k given that V xj = 1 (which
would be the case of the standardized data, or correlation matrix), so that the eigenvalues are the variances of the corresponding linear combinations. Then the linear
combination that corresponds to the largest eigenvalue is the one that has the greatest
variance. Note that the principal components are only defined up to a sign, so in the
applied work it is always worth checking if the principal components correspond to
the desired direction of the feature variation, such as higher values of the score should
represent richer households. Quite often the first component can be interpreted as a
measure of size, or a degree of expression of a certain feature, while the second, the
third, and so on components might have an interpretation of some structure of that
feature.
3 See also discussion in Appendix E. Anderson (1963) finds deriving asymptotic results for the PCA on
a correlation matrix more difficult, and attributes that to the difficulties of interpretations for correlations
as compared to the interpretation of covariances.
2.2
Discrete data
A practically important violation of the normality assumption underlying the PCA occurs when the data are discrete. There are several kinds of discrete data one can encounter in empirical analysis. Most often the discrete data are binary, i.e., a variable
that can only take one of two values, such as gender (male/female), ownership of a car,
9
In the PCA case however one can find an additional justification for this approach by noting that using
ordered categories can be viewed as computing Spearmans rank correlation S instead of Pearsons moment
correlation in the analysis. Then, to be consistent, one should compute Spearmans S for each pair of
variables, and use the matrix of rank correlations to run PCA on (Lebart, Morineau & Warwick 1984, Sec.
I.3.4). The rank correlations are robust to non-normality of the variables, which is important for both the
discrete data, and the income data which are usually heavily skewed unless transformed. They are also
robust to outliers which may not be much of an issue for discrete variables, but may be an issue for skewed
distributions such as that of raw income data. See Appendix C for some details on rank correlations.
10
V[] = = diag[1 , . . . , K ],
V[] =
(6)
If the observed xk s are ordinal with the categories 1, . . . , Kk , then it is assumed that
they are obtained by discretizing the underlying xk according to the set of thresholds
k1 , . . . , k,Kk 1 :
xk = r if k,r1 < xk < k,r
(7)
where k,0 = , k,Kk = +. So, xs are dependent variables, and if s were
observed, we would use ordered dependent variable models to analyze the relations
between and x.
Let us illustrate the above notation with a simple example. It will also show some of
the problems that arise because of the discrete character of the data, such as excessive
skewness and kurtosis.
Example 1.
0.2%
0.8%
1.3%
2.1%
8.2%
10.1%
6.9%
20.6%
19.0%
6.7%
14.4%
9.7%
15.9%
44.0%
40.1%
Marginal
2.3%
20.4%
46.5%
30.8%
100%
On Fig. 1, the parameters are as follows: 1,1 = 2, 1,2 = 0.75, 1,3 = 0.5;
2,1 = 0.25, 2,2 = 1, and the correlation of the underlying bivariate normal is 0.2. The
cell proportions and the marginal are given in Table 1. Note that the joint distribution of x1 and
x2 is an example of the opposite skewness: x1 is skewed to the left (with skewness -0.40), while
x2 is skewed to the right (with skewness 0.39). Another important feature of this discretized
bivariate data is high kurtosis: the kurtosis of x1 is 2.52, and kurtosis of x2 is 2.04. Typically,
11
x2 =3 1
x2 =2
-1
x2 =1
-2
-3
x1 =1
-2
-1
x1 =2
x1 =3
Figure 1: Example.
x1 =4
There is a number of implications of the discrete character of the data if the observed discrete xk s are used directly in the standard principal component analysis. The
problems related to the discrete data have received a considerable attention in quantitative sociology (Olsson 1979, Bollen & Barb 1981, Johnson & Creech 1983, Babakus
et al. 1987, Dolan 1994, DiStefano 2002).
First, the distributional assumptions (normality) are seriously violated. Obviously,
the discrete data do not have a density (at least with respect to Lebesgue measure).
Also, even despite the finite range, the discrete data tend to have high skewness and
kurtosis, especially if the majority of the data points are concentrated in a single category. Even with moderate discretization in Example 1, the skewness and kurtosis
were not negligible. PCA only addresses the second moments of the data, in essence
approximating the real data with the normal distribution of the same mean and covariance matrix.
As far as the distribution of the data is not a multivariate normal, the standard results
on the asymptotic distributions of the eigenvalues and eigenvectors5 need to be modified (Davis 1977). The normality will still hold as the eigenvalues and eigenvectors
5
See Appendix A for the existing results on the asymptotic distributions of the eigenvalues and eigenvectors.
12
are functions of the covariance matrix, whose entries are asymptotically normal since
they are sums of i.i.d. variables. The parameters of the resulting asymptotic normal
distribution, however, depend crucially on the fourth moments of the data generating
distributions.
Second, and maybe even more important, consequence of the discreteness is that
the covariances or correlations between the discretized versions x1 , x2 of the variables
of interest are not equal to the true covariances or correlations of the (unobserved)
underlying variables x1 , x2 . They mostly tend to be biased towards 0, as the following
example demonstrates.
Example 2.
x2
that came from a bivariate normal distribution with standard normal marginals. Denote
corr(x1 , x2 ) = , and
Zs Zt
2 (s, t; ) =
h
i
1
1
p
u2 2uv + v 2 du dv
exp
2
2(1 )
2 1 2
(8)
the cdf of the bivariate standard normal distribution. If the thresholds are given by 1,1 and 2,1
(i,0 = , i,2 = +, i = 1, 2), then the proportions in cell (i, j) is
i,j = (i, j; , ) = Prob[x1 = i, x2 = j] =
= 2 (1,i , 2,j ; ) 2 (1,i1 , 2,j ; )
2 (1,i , 2,j1 ; ) + 2 (1,i1 , 2,j1 ; )
(9)
x
1 = 10 + 11 ,
V x1 = (10 + 11 )(01 + 00 ),
x
2 = 01 + 11
V x2 = (01 + 11 )(10 + 00 ),
(10)
The dependence of the observed correlation on the underlying correlation and the threshold
structure in a 22 case is shown on Fig. 2. The lines from top to bottom are based on thresholds
of 1,1 = 0 and 2,1 = 0 (and hence marginal proportions of 0s and 1s equal to a half,
labelled as Half-half on the plot), 0.67 and 0.67 (which gives the proportion of 0s about
3/4, and proportion of 1s about 1/4, labelled as Upper Q - Upper Q on the plot), 0 and 0.67
(labelled as Half - Upper Q), and -0.67 and 0.67 which gives the opposite skewness (labelled
Upper Q - Lower Q). All of the lines lie below the diagonal which is in agreement with the
above suggestion that the correlations are biased towards zero. If the thresholds are not the same,
13
the observed correlations do not reach 1 even if the underlying correlation is 1. The worst is the
case of opposite skewness of the binary indicators: the correlations then do not exceed 0.33.
.8
.6
.4
.2
0
Example 2 shows that in the extreme case of dichotomizing the continuous distribution, the correlations are largely underestimated. Even if the underlying variables
x1 , x2 are perfectly related (i.e., the correlation between them is 1), their discretized
manifestation may show correlation that is far from 1 unless the categorization thresholds match exactly. For all values of the correlation, however, the correlation based on
the discrete versions x1 , x2 is lower than the correlation of the underlying starred
variables. In a more general case of more than two categories, categorization can be
viewed as a measurement error with nonlinear properties, and the authors are not aware
of the general literature that shows that correlations go down because of discretization.
.2
.4
.6
Underlying correlation
.8
Half-half
Half - Upper Q
Upper Q - Upper Q
Upper Q - Lower Q
Figure 2: Correlation of the bivariate binary data obtained from a bivariate normal
distribution.
14
15
off-diagonal entry of the correlation matrix. Thus the combination of weights that
gives larger weight to those categories (and are of different signs since the correlation
in question is negative) will produce larger variance. This also seems to be a general
result supported by empirical evidence on data sets with dummy variables: the first
principal component would tend to connect the most populated categories, and the
following components would try to add the next most populated ones.
Finally, the natural ordering of categories is not generally reproduced by the principal component analysis, so the only condition that identifies the ordering would be
the use of monotone variables for which the higher values really mean higher SES.
The continuous variables such as income, expenditure, value of the property, etc., will
serve best, although even the binary ownership indicators tend to produce reasonable
results in practice. Otherwise, unless the two largest categories are the poorest and the
richest members of the population, the first principal component would fail to give a
meaningful direction of the welfare change, and the scores with low counts will not be
well reproduced.
2.3
This section introduces an alternative approach to the analysis of the discrete data in
PCA, in particular, to computing the correlations between two ordinal variables. The
approach originated in Pearson (1901a) and was further developed in Pearson & Pearson (1922) and Olsson (1979). In fact, it is very similar to the assumptions one makes in
deriving an ordered probit model (Maddala 1983, Wooldridge 2002). A general treatment is given in Joreskog (2004b) for LISREL software (SSI 2004) that he originated,
and that remains the leader in the multivariate analysis of the ordinal variables.
Suppose two ordinal variables x1 , x2 are obtained by categorizing two variables
x1 , x2 with distribution
!
!!
x1
1
N 0,
, 1 1
(11)
x2
1
The categorizing thresholds for the two variables are given by 1,0 = < 1,1 <
. . . < 1,K1 < 1,K = , 2,0 = < 2,1 < . . . < 2,K1 < 2,K = , so that
xi = k when i,k1 < xi i,k , i = 1, 2. Then the theoretical proportions of the
data in each cell can be found as (9). (See also Example 1.)
16
Assuming that observations are i.i.d., the likelihood can be written down as
L() =
K1 Y
K2
N Y
Y
N
Y
(xi,1 , xi,2 ; , )
(12)
i=1
ln L =
N
X
ln (xi,1 , xi,2 ; , )
(13)
i=1
17
the polychoric correlation coefficient. The first test is the likelihood ratio test of the
saturated model that does not make any distributional assumptions (cell proportions)
and the normality-implied one:
K1 X
K2
X
n(m, l; ,
)
LR = 2
nml ln
nml
m=1
(15)
l=1
where nml = |{i : xi,1 = m, xi,2 = l}| is the number of observations identified by m,
l-th categories of variables x1 and x2 . The second test is Pearson goodness of fit test
for distributions:
X2 =
K1 X
K2
X
m=1 l=1
nm l
(nml /n (m, l; ,
))2
(m, l; ,
)
(16)
18
An option that seems to lie between the full fledge polychoric correlation analysis
and the analysis based on the ordinal indicators is to estimate the mean of the underlying normal variable x conditional on a particular category of the observed ordinal
indicator x = j:
Zj
E[x |x = j] =
u(u) du = (j1 ) (j ),
j1
2
1
(z) = ez /2
2
(18)
This value can be used instead of x = j to make the variable less skewed and/or
kurtotic, as well as to make the distance between the categories more informative,
rather than assuming the distance between categories 1 and 2 is the same as the distance
between the categories 2 and 3, or 3 and 4.
3
3.1
This section describes a large simulation project undertaken to examine the behavior
of different PCA procedures with discrete data. The measures of performance are chosen to address the accuracy of PCA in the applications of the method in development
economics as in Filmer & Pritchett (2001), i.e., for ranking households by their welfare. The main theme of the simulation was to set up a model of the form (5) with
different distributions of the underlying welfare index , various coefficients , various proportions of variance explained by the first PC, and other controls, as explained
below.
The following parameters and the settings of the simulation were used:
Total number of indicators: from 1 to 12.
The fraction of discrete variables: from 50% (1 discrete, 1 continuous) to 100%.
The distribution of the underlying factor: normal; uniform; lognormal; bimodal
(a mixture of two normals).
The proportion of the variance explained: 80%, 60%; 50% if the total number
of indicators was greater than 4; 40% and 30% if the total number of indicators
was greater than 7.
19
The values of : all ones; one or two of the discrete variables have = 3; one
or two of the continuous variables have = 3; one discrete and one continuous
variables have = 3. (See discussion in Appendix A on implications for PCA.)
The number of categories of the discrete variables: from 2 to 12.
The threshold settings: uniform (each category has the same number of observations); half observations are in the bottom category (heavy skewness and kurtosis, at least for a large number of categories); half observations are in the central
category (high kurtosis with low skewness); half observations are in the top category; random thresholds (if Prob[x < z] = F (z), u1 , . . . , uK1 U [0, 1],
and u(1) , . . . , u(K1) is the set of order statistics from u1 , . . . , uK1 , then k =
F 1 (u(k) )).
The sample sizes: 100, 500, 2000, 10000.
Finally, and most importantly for the objective of the paper, the analyses performed: PCA on the ordinal categorical variables; PCA on the dummy variables
corresponding to the individual categories, as in Filmer & Pritchett (2001); PCA
on the ordinal variables with the number of the category replaced by the group
means given by (18); PCA of the polychoric correlation matrix; PCA on the original continuous variables x1 , . . . , xp as the benchmark (cannot be performed in
the field applications).
A non-proportional random sample of all possible combinations was taken. The
probability of selection of a particular combination of the simulation parameters was
Prob[select|simulation settings] = exp((3 + 0.25pd + 0.03pc ))
(19)
where pd is the number of discrete variables, and pc is the number of continuous variables. An increase in the number of variables leads to the increase in computational
time, both due to increased number of the polychoric (pd (pd 1)/2) and polyserial
(pc pd ) correlations to be computed, and due to increase in the number of combinations
arising for each extra discrete variable. This sampling procedure resulted in approximately 1% sample of all settings combinations, with the total sample size of 947434
observations, and the sum of weights (the estimate of the total population size) of
99.744 millions. (This would be the total sample size should we run the simulation
for each combination of parameters.) Those observations came from 189756 unique
samples (combinations of settings). Some observations were lost due to the difficulties
with the numeric likelihood maximization in polychoric correlation estimation. The
20
error messages mainly had to deal with with flat likelihoods, and also with the correlation matrix not being positive definite. Fifty five variables were describing the settings
and the outcomes (Stata file size is 277 Mbytes).
The simulation was performed on the server of statistical applications at UNC6 as
well as on several personal computers that the authors had access to. The software
platform is Stata Special Edition, version 8.2 (Kolenikov 2001, Corporation 2003).
The project was spread into 41 separate threads. On average, a thread took about 2
to 4 days on a Pentium IV 1 GHz 256Mb RAM PC (single task), or 5 to 10 days on
the multitask server, the workload due to nonlinear maximization involving numerical
integration of the bivariate normal probabilities in the maximum likelihood estimation
step. Also, some matrix manipulations such as product matrix accumulation might
have been pretty long for large sample sizes and large number of variables, especially
when the ordinal variables were expanded into the sets of dummy variables. Each run
required no more than 10 Mbytes of RAM.
3.2
Results
This section describes the basic analysis of the simulation results. We performed regression analysis with several performance measures and the simulation settings to
characterize numerically the differences in the PCA methods for discrete data.
The primary outcome variables we consider are the internally and externally defined goodness of fit measures.
The internally defined goodness of fit is what the researcher has at her disposal
upon running the PCA. As discussed in Section 2.1 and Appendix A, the most popular
measure is the proportion of the explained variance.
The external measures of performance are those relating the estimated first PC with
the truth, i.e. , in the context of applications of SES where the scores are used to
classify individuals into quintiles, or other rank groups, used for poverty, service use
analaysis and ultimately, for policy advice. We will examine the correlation of rankings produced by different PCA procedures with the underlying score , and compare
the quintiles groups produced by the two scores. Thus the first of our measures is the
Spearman rank correlation7 of the empirical first PC with the original factor 1 . As
discussed in Appendix C, Kendalls might have been a more interpretable measure
6
See http://www.unc.edu/atn/statistical/. The domain within a Sun E15K server has 20 processors with
the clock speed of 1.05GHz, and 40 GB of memory.
7 The definitions and useful facts regarding the rank correlations are given in Appendix C. Rank correlations show how similar are the rankings of individuals produced by two variables.
21
No.
obs.
R-squared
Theoretical explained
proportion
Analysis type
Original: base
189756
Filmer-Pritchett
189511
Group means
189328
Ordinal
189511
Polychoric
189328
Distribution of
Normal: base
233688
Lognormal
237222
Bimodal
234612
Uniform
241912
189511
Other discrete
568167
Log samplesize
Share of
explained
variance
Rank
correlation
with
Overall
misclassification rate
Misclassification
in Q1
0.93
0.93
0.90
0.81
-623.493
(1.469)**
1927.794
(1.706)**
-1402.261
(1.446)**
-1268.915
(2.450)**
0
(.)
-556.864
(2.506)**
-339.320
(1.157)**
-345.287
(1.159)**
-157.689
(1.125)**
0
(.)
-332.814
(1.963)**
-226.089
(0.976)**
-231.128
(0.976)**
-223.402
(0.983)**
0
(.)
236.595
(1.249)**
172.133
(0.860)**
175.534
(0.863)**
169.980
(0.861)**
0
(.)
243.223
(2.742)**
143.568
(1.441)**
147.047
(1.444)**
142.112
(1.440)**
0
(.)
-203.238
(0.694)**
5.190
(0.480)**
23.847
(0.481)**
0
(.)
-734.501
(0.764)**
-97.136
(0.690)**
65.141
(0.642)**
0
(.)
421.746
(0.622)**
36.580
(0.562)**
-50.568
(0.586)**
0
(.)
836.829
(1.021)**
209.515
(1.063)**
47.110
(1.079)**
-134.764
(0.444)**
15.007
(0.183)**
0.741
(0.375)*
20.040
(0.169)**
-3.673
(0.230)**
-15.211
(0.148)**
7.846
(0.507)**
-11.223
(0.252)**
-1.420
(0.137)**
9.797
(0.171)**
-6.202
(0.144)**
-10.365
(0.249)**
Cluster corrected standard errors in parentheses. Other controls include: the threshold structure; the factor
loadings; the number of discrete and continuous variables. Total number of observations is 947434. Number
of clusters is 189756.
22
23
The list of explanatory variables include the simulation settings and their functions
and combinations. The ones not shown in the table are9 :
the threshold structure proportions (proportion of discrete variables with bottom,
center, top dominated categories, and random thresholds; uniform thresholds is
the base)
the threshold structure dummy variables10
the factor loadings (whether there were any variables with k = 3 as opposed to
the default k = 1 k, and whether those variables were discrete, continuous, or
both)
dummy indicators of each particular combinations of discrete and continuous
variables.
The first column shows additionally the number of observations for which a particular value of the explanatory variable is observed. The variation in the number of
available observations of the analysis type is due to computational failures either with
Filmer-Pritchett or with polychoric procedure (non-positive definite matrices or lack of
convergence, respectively), and of all other variables, due to randomness of the Monte
Carlo procedure.
The reported results are for the regressions on all observations in the data set, with
probability weights given by the selection probabilities (19), and corrections of the
covariance matrix of the estimates by clustering. The latter is necessary as long as the
observations based on the same Monte Carlo sample (but different in the type of the
analysis performed) are strongly related to each other.
Let us list the findings we consider interesting. A short note on the interpretation
of the coefficients may be in place here: the value of the coefficient of 100 means that
a unit change in the explanatory variable shifts the linear prediction x by 0.1, and the
9 The complete tables are available at
http://www.unc.edu/skolenik/cpc/polychnoric-technical-regtable.pdf.
10 The categories were defined as follows: mostly uniform: uniform distribution of the thresholds, as
explained in Section 3.1, for at least 3/4 of the discrete variables; uniform or random: all ordinal variables have either uniform or random threshold settings; center dominated: in at least one of the variables,
the middle category has 50% of population, and the remaining half of observations is distributed uniformly
across other categories; it represents high kurtosis with low skewness; skewed: in at least one of the
variables, either the top or the bottom category consists of 50% of population, and the remaining half of
observations is distributed uniformly across other categories; it represents both high kurtosis and high skewness; opposite extremes: two variables have domination in different parts of their distributions, like topand bottom-dominated, or center- and bottom-dominated; it is known to be the most difficult case for the
polychoric coefficient estimation.
24
ultimate response in its original [0, 1] scale by 0.04 near the middle of that range; 0.03
near the points 0.25 and 0.75 (i.e., misclassification rates of 25%, or rank correlation
of 0.75, which are rather reasonable values for some of the combinations in our data
set); 0.02 near the points 0.1 and 0.9; and 0.1 near the points 0.05 and 0.95.
First, it was very reassuring that relatively few variables (64 in our regressions)
yield R2 above 0.90 (0.81 for the first quintile misclassification rate). Most of the
performance of the PCA is thus explained by the factors used in the regression.
Second, the comparison of the different methods is directly accessible through the
Analysis type block of the table. In all four regressions the Filmer and Pritchett procedure is performing worse than any of the other methods, as evidenced by the largest
coefficient across all methods. The baseline for the analysis type was the PCA based
on the original unobserved x , so the regression coefficients show the deterioration of
performance relative to that case. The three other methods based on discrete indicators11 performed about the same, and the top performing method (but infeasible in the
field applications) is the PCA based on the original continuous variables.
Third, we can identify the most important explanatory variables, as evidenced by
their t-statistics (not reported in the table; the analysis is based on Stata output). The
t-statistics may seem to be too big, for any reasonable standards, but the data came
from a controlled experiment, and the data set size is about 1 million observations,
so one should not be surprised seeing both tiny standard errors and strong effects.
The t-statistics still do convey important information on the relative importance of the
variables in the regression.
The most important explanatory variable is the theoretical share of explained variance (the first line of regressor coefficients). This is not surprising, as long as this is the
primary variable that controls the closeness of the indicators x and x to the underlying welfare . The t-statistics vary between 424 (reported explained variance) to 1130
(rank correlation).
The next important factor is the distribution of the underlying , or rather the fact
that it is lognormal (in the Distribution of block). The lognormality leads to a substantial deterioration of the performance of the empirical PC. The t-statistics range between
292 (the reported explained variance) and 961 (the rank correlation). We believe this
has more to do with the high kurtosis rather than high skewness of this distribution, as
another asymmetric distribution used in our analysis, the mixture of two normal distributions, did not produce so bad results. This is also in agreement with the theoretical
11
I.e., the ordinal PCA based on the ordinal variables scaled to the standard Likert scale (1, 2, 3, . . . );
the polychoric PCA; and the group means PCA based on the ordinal variables assigned scores based on the
underlying standard normal distribution.
25
results on the PCA in non-normal case (Davis 1977, as also reported in Appendix B).
Those two variables were the two most significant in all four regressions. The next
group of variables that came up among the most significant ones are the analysis type
indicators (see the Analysis type block of coefficient estimates). The coefficients in the
table are the differences in performance from the PCA based on the original (unobserved) continuous data. For the ordinal or group means analysis, the t-statistics were
around 300 in the explained variance regression, around 230 in the rank correlation regression, around 200 in the overall misclassification rate regression, around 100 in the
first quintile misclassification rate regression. The t-statistics of the polychoric PCA
analysis were lower at 140, 227, 197, and 99, respectively, so the real difference is only
in the share of explained variance that is estimated consistently by polychoric PCA,
but not other methods. The polychoric PCA tends to produce the results closer to the
benchmark PCA on the original variables, while the standard errors are essentially the
same as for the ordinal or group means analysis. For the Filmer-Pritchett procedure,
the t-statistics were 222 in the explained variance regression, 190 in the overall misclassification rate regression, 170 in the rank correlation regression, and 90 in the Q1
misclassification rate regression. Even though the coefficients of the Filmer-Pritchett
procedure dummy variable is greater in absolute value than those for other discrete
data procedures, they also have standard errors that are about twice as large as other
discrete methods. This is an indication that not only the Filmer-Pritchett procedure
gives worse results on average, but also that it is less stable in performance, with the
residual variance being notably greater for the Filmer-Pritchett subsample.
The interaction of the number of categories and the Filmer-Pritchett procedure indicator (in the block Average number of categories, bottom of the first page of the table)
also came out to be the second most significant variable in the explained variance regression (t = 303, more categories leads to smaller explained variance). A possible
interpretation of this can be proposed as follows: the number of variables in the denominator of (3) in the Filmer-Pritchett procedure increases, while the information they can
explain, or variability of the first PC they can produce, remains the same.
The lack of information about contained in a small number of variables is also an
important factor for the external measures. The first line at the third page of the table
shows that the extreme case of 2 discrete variables and 0 continuous variables yields tstatistics of 274 for the rank correlation regression, 255 for the overall misclassification
rate regression, and 83 in the Q1 misclassification rate regression.
Finally, the bimodal distribution dummy was one of the most significant regressors
in the Q1 misclassification rate regression, with t = 197.
The comparison of the absolute values of the coefficients in order to assess the mag26
Largest
difference
Share of
explained
variance
Rank
correlation
with
Overall
misclassification rate
Misclassification
in Q1
73.980
(1.215)
42.190
(1.394)
66.715
(2.021)
44.093
(2.443)
70.411
(2.061)
37.289
(2.310)
66.391
(3.724)
45.414
(4.656)
-102.482
(1.283)
-58.020
(1.528)
-103.054
(2.701)
-66.672
(3.416)
Cluster corrected standard errors in parentheses. h8, 4i notation means a model with 8 discrete and 4 continuous variables, etc. See also Table 2 for additional explanations and primary effects.
nitudes of the effects under consideration gives rather similar results, as far the standard
errors of most variables were quite close to each other. This is due to the fact that the
simulation design was nearly orthogonal. Most of the settings were used independently
of each other, except for the theoretical explained variance dependent on the number of
indicators. The settings for the number of categories and the threshold structure were
also strongly related to the number of discrete variables, but were randomized in order
to achieve the overall balance.
When interpreting those coefficients and their t-statistics, one should keep in mind
that only the share of explained variance is a stand-alone regressor, while all others
are dummy variables relating the factor of interest to the base. Thus, the lognormal
distribution is being compared to the normal distribution of , and the types of analysis
are compared to the PCA based on the original continuous variables.
Let us return to the most important findings. The fourth of those, as was noted
above, is that the number of variables, and whether those variables are discrete or
continuous, plays a key role in performance of PCA. Table 3 shows the magnitudes of
those effects. The first line is the largest difference across the estimated coefficients,
usually between a model with two indicators, and a model with 12 indicators. The
27
reported figure and corresponding standard error is the difference of the coefficients of
the first and the second models mentioned, with notation hpd , pc i for a data set with pd
discrete and pc continuous variables. Most of the Share of explained variance results
are not readily interpretable. In the next column, the linear predictor x0 from a probit
regression for the rank correlation is by 1.368 lower for the model with 2 discrete and
no continuous variables than for the model with 8 discrete and 4 continuous variables.
This may translate to a difference in the rank correlations as large as 0.60 for the weaker
model vs. 0.96 for the model with 12 variables12 . Likewise, from the differences
between the coefficient estimates, the differences between the misclassification rates
for different number of variables may be as large as 64% vs. 26% for the overall rate,
and 50% vs. 17% for the first quintile.
Table 3 also compares the effects of adding an extra variable. The improvement
due to a continuous variable is larger than that for a discrete one by some 6080%.
This can be viewed as a crude measure of the losses to discreteness: roughly speaking,
10 discrete variables contain about as much information, for the PCA purposes, as 6
continuous ones do.
Fifth, there was a number of rather surprising findings. The first one is that the sample size does not matter much, at least for the levels of the dependent variables. The coefficient of the log sample size variable is never larger than 10 in absolute value, which
translates to about 2% change in the indicator from the smallest sample size (100) to
the largest (10000). To compare, the losses due to the lognormal distribution may be
as large as 30% in the Q1 misclassification rate, and due to Filmer-Pritchett procedure,
20% of reported explained variance. Threshold structure had a mixed effect: concentration of half of the observations in a single category usually has a negative effect, but
not for the misclassification rates in Q1, where concentration of the observations in the
wealthier categories gives more resolution at the left tail of the welfare distribution.
The opposite extremes case when the marginal distributions are concentrated on different tails for two different ordinal variables, although known to pose difficulties for
the polychoric correlation coefficient estimation, does not have overly detrimental consequences. Likewise, the differences in factor loading that translate to the strength of
the relation between the latent and the corresponding indicator, and thus to a greater
explanatory power for that variable, although have predictable directions, are not very
large in the absolute values.
12 See Appendix C for interpretation of the values of rank correlations and relations to the misclassification
rates.
28
3.3
Graphical representation
A saying goes that one picture is worth a thousand words. Let us complement the
regression analysis with a graphical illustration of our findings. As far as the selection probability weights differ between settings with different numbers of discrete and
continuous variables, those differing probabilities of selection would make it hard to
come up with clearly interpretable graphs comparing results for different number of
variables. We thus confine our attention to a specific setting with 8 discrete and 0
continuous variables, since for this setting, the differences between methods would be
quite pronounced due to discrete variables present without any continuous ones. Also,
this is the setting with considerably many observations (12880).
The graphical representation is complementary to the results reported in the previous section. The primary value of graphs is, of course, an easy grasp of the distributional features, and we shall draw readers attention to those features in our interpretation of the graphs we report. On one hand, the graphs do not provide any inferential
measures such as p-values. But due to the large sample sizes used for each graph, the
features detected by eye would probably be interpretable as those of the population.
Also, there may be some confounding due to complex simulation design, as even with
most involved graphics, it is difficult to visualize more than three or four dimensions,
or sources of performance variability, in our case. Due to nearly orthogonal simulation design, we hope the influence of those confounding factors to be minimal. We
control for the strongest effects reported in the previous section such as the lognormality of and the underlying proportion of explained variance, so that the graphs really
demonstrate the differences in methods along with their sampling variability.
Figure 3 shows the box-and-whisker plots13 of the four performance indicator discussed in Section 3.2. As in the explanation of Table 2, in the first row of figures, less
is better; in picture (c), more is better; and for picture (d), the target is shown with the
horizontal line at 0.5.
The best performance is demonstrated by the field-infeasible analysis of the original x variables. Also, the Filmer-Pritchett procedure is clearly inferior in all of the
analyses. For instance, for the Q1 misclassification rate, the median of the distribution for the Filmer-Pritchett procedure is above the 75-th percentiles of other methods,
and the upper quartile of this performance measure is above all of the observations for
other methods (i.e., in 25% of the worst cases, it does a poorer job than you would ever
13
The central line of the plot shows the median of the data. The boundaries of the box are the lower
and upper quartiles. The length of each whisker is three times the distance between the median and the
corresponding quartile, which leaves about 0.7% of the normal distribution outside the whiskers.
29
.6
.3
.95
.75
Normal means
Normal means
FilmerPritchett
FilmerPritchett
(c)
Ordinal
(a)
Ordinal
Original
Original
Polychoric
Polychoric
FilmerPritchett
FilmerPritchett
Normal means
Normal means
(d)
Ordinal
(b)
Ordinal
Original
Original
Polychoric
Polychoric
Figure 3: Box plots for different PCA methods. (a) Overall misclassification rate; (b) misclassification rate in the first quintile; (c)
Spearmans between the theoretical and empirical welfare measures; (d) share of explained variance. Restrictions: 8 discrete variables,
no continuous variables, sample sizes 2000 or 10000, lognormal distribution excluded, theoretical share of explained variance is 0.5.
.7
.8
Misclassification rate in Q1
.4
.6
.2
.6
Proportion explained
.2
.4
0
30
expect from other methods). Likewise, for the overall quintile misclassification rate,
the misclassification rate better than about 43% occurs in about 75% of cases for the
ordinal, normal means and polychoric methods, but only for 25% cases for the FilmerPritchett method. The other three discrete methods show practically indistinguishable
performance, with ordinal PCA giving slightly larger variability, as evidenced by the
size of the box.
As for the internal measure of fit, i.e., the reported proportion of explained variance,
only the analysis of the original variables and the polychoric PCA show consistency of
the reported explained proportion (the graphs are drawn for large sample sizes 2000 and
10000). Other methods are demonstrating the lack of explained variance in the first PC,
and the Filmer-Pritchett procedure shows particularly bad bias, with no observations
higher than 0.3 even though the target explained variance is 0.5. Based on all four
characteristics together, the polychoric method gives the most accurate picture.
As was claimed in Section 3.2, the most important factor in the performance of the
PCA, or the most important setting of the simulation study, with the highest t-statistics
in Table 2, is the underlying proportion of the explained variance (not to be mixed with
the reported proportion of explained variance used as the performance measure!). Fig. 4
shows the relation of our performance measures to the underlying theoretical proportion of the explained variance. The misclassification rates (panels (a) and (b)) show
almost linear decline for methods other than Filmer-Pritchett. The latter surprisingly
shows increase in variability over the whole range of the explained variances for Q1
misclassification, and for the proportion of explained variance equal to 0.65 and 0.80,
for the overall misclassification rate. The reported share of explained variance (panel
(d)), although approximately unbiased for the original PCA and the polychoric PCA,
is underestimated by the ordinal or group means PCA, and severely biased downwards
by the Filmer-Pritchett procedure. The rank correlation with the underlying welfare
(panel (c)) does go up with the underlying proportion of explained variance for all
methods, although the distribution of the correlations for Filmer-Pritchett procedure
demonstrates quite extended lower tail of the distribution that is also very protruded
for the proportion of explained variance equal to 0.80.
The next set of findings is related to the number of categories of the discrete variables used in PCA. Those are depicted on Fig. 5. Note that with just two categories
(binary indicators like ownership of an asset), the Filmer-Pritchett and ordinal PCA
coincide. But as extra categories are added, the performance of the methods does differ notably. For the methods other than Filmer-Pritchett, the four measures approach
their continuous case limits approximately exponentially and come to saturation at
about 5 or 6 categories (except for the proportion of explained variance), consistent
31
.8
.6
.4
.2
.4
.6
.8
.4
.6
.8
.2
FilmerPritchett
.2
.5
.6
.8
.2
.4
.6
.8
.2
.4
.6
Ordinal
(a)
.8
.2
(c)
.6
.4
.6
Original
.4
Normal means
.2
.2
.4
.8
Original
.6
.2
.4
Ordinal
Normal means
.8
.8
.2
.2
.6
.8
.4
.6
.8
Polychoric
.4
Polychoric
.4
.6
.8
.4
.6
.8
Graphs by Analysis type
.2
FilmerPritchett
.2
FilmerPritchett
.6
.8
.2
.2
.4
.6
Ordinal
.8
.2
.4
.6
.8
.2
.4
.6
.8
.2
.4
(d)
.6
Original
.4
Ordinal
(b)
.6
Original
.4
Normal means
.2
Normal means
.8
.8
.2
.2
.6
.8
.4
.6
.8
Polychoric
.4
Polychoric
Figure 4: Relation of the performance measures to the underlying proportion of explained variance. (a) Overall misclassification rate; (b)
misclassification rate in the first quintile; (c) Spearmans between the theoretical and empirical welfare measures; (d) share of explained
variance. Restrictions: 8 discrete variables, no continuous variables, sample sizes 2000 or 10000, lognormal distribution excluded. Jitter
added to show structure.
FilmerPritchett
Misclassification rate in Q1
Proportion explained
1
.5
0
.8
.6
.4
.2
0
32
.6
.5
.4
.3
.6
.5
.4
2
Ordinal
FilmerPritchett
(a)
Polychoric
Normal means
Polychoric
Ordinal
Normal means
FilmerPritchett
.3
Graphs by Analysis type
(c)
Proportion explained
Polychoric
Normal means
Ordinal
FilmerPritchett
Polychoric
Ordinal
Normal means
FilmerPritchett
Figure 5: Relation of the performance measures to the number of categories of discrete variables. (a) Overall misclassification rate;
(b) misclassification rate in the first quintile; (c) Spearmans between the theoretical and empirical welfare measures; (d) share of explained variance. Restrictions: 8 discrete variables, no continuous variables, sample sizes 2000 or 10000, lognormal distribution excluded,
theoretical share of explained variance is 0.5.
Rank correlation
1
.9
.8
.7
1
.8
.7
.9
.8
.6
.4
.2
.8
.6
.4
.2
.6
.4
.2
0
.6
.4
Misclassification rate in Q1
Proportion explained
.2
0
33
with recommendations from the quantitative sociology literature (Dolan 1994). The
performance of Filmer-Pritchett procedure also improves with the larger number of
categories, but does not get as far as in other methods until there are as many as 8
categories per variable, on average.
The most striking result is the performance of the Filmer-Pritchett procedure in
terms of the reported explained variance. It declines steadily as the number of categories is increased, and the explanation we can propose is that more and more of the
spurious and irrelevant negative correlation structure is added to the correlation matrix
used as an input to PCA. Also, the amount of information that can be explained by a
univariate summary stays about the same as more variables are generated, while the
number of variables increases. The former serves as a numerator of (3), and the latter
as its denominator. Thus the resulting shape is approximately hyperbolic in the number
of categories, which is what is observed in panel (d). See also discussion in Section 2.2.
As for the other discrete PCA methods, the share of explained variance reported by the
polychoric PCA stays on target for any number of categories, while the ordinal and the
group means methods underestimate it, although improving with more categories.
In all of the above, the observations with lognormal distribution of were excluded,
as they led to substantial deterioration of the performance of every method. If those
were shown on the graphs in Fig. 35, they would look like an extra cloud of points in
the direction of deteriorating performance: somewhat below others on the rank correlation and explained variance plots, and somewhat above others, on the misclassification
rates plots.
Other combinations of the number of discrete and continuous variables produced
qualitatively similar results, although with more continuous variables, the differences
between the methods were not as distinct as in the reported case.
Conclusion
This paper was motivated by recent examples of use of the principal component analysis in development economics literature starting from Filmer & Pritchett (2001), and
investigated several ways to use categorical (in particular, ordinal and binary) variables
in the principal component analysis. As far as the distributions of the indicators are
non-normal, some of the asymptotic properties of the principal components no longer
hold or need to be modified, as the variances and covariances of both eigenvalues and
eigenvectors depend on the fourth moments of the data. Other complications to the
principal component analysis due to the categorical nature of the variables include bi-
34
ases to the covariance structure, and hence the factor loadings, and smaller reported
proportion of explained variance.
We developed several analytical examples demonstrating that (i) categorical variables do have excessive skewness and kurtosis (Example 1); (ii) correlations between
categorical variables are on the smallish side (Example 2), (iii) nave principal component analysis based on the dummy variables aims at placing the two largest groups on
the opposite ends of the first principal component score spectrum, and underestimates
the proportion of explained variance (Examples 4 and 5).
We then discussed several options that may be useful in performing the principal
component analysis in presence of the categorical variables: using ordinal variables
per se; using the group means implied by a normal distribution; using the dummy
variables for categories as suggested by Filmer & Pritchett (2001); and using the polychoric correlations. We designed and conducted a large simulation study to compare
the performance of different discrete PCA methods under different scenarios. The performance measures used were the quintile misclassification rates (overall and in the
first quintile), Spearman rank correlation between the true welfare index used to generate data and the empirical one obtained through the versions of PCA (as an overall
measure of the conformance of the rankings of individual observations obtained by the
two welfare indices), and the reported proportion of the explained variance, as the main
(and often the only one used) measure of the performance available to the researcher.
Our main conclusions stemming from the analysis of the simulation data are as
follows.
If there are several categories related to a single factor, such as the access of hygienic facilities or the materials used in roofing, dividing the variable into a set of
dummy indicators as suggested by Filmer & Pritchett (2001) leads to deterioration of
performance according to all of our performance measures used. The explained variance is most heavily affected (underestimated), and more so the more categories are
there in the original variables. Even though the goodness of fit of the Filmer-Pritchett
procedure improves as we add more variables, the method does not achieve the performance characteristic of other methods. We thus believe that the researcher will be
better off using the ordinal variables as inputs to PCA. If the variables do not come in
a standard way such as 1, 2, . . . (Likert scale) with roughly equal distances between
categories, it is worth recoding them that way, so that those distances are not very different. Model-based category weights (referred to as group means in our analysis)
show slight improvement in performance compared to the standard Likert-scale ordinal coding, so nave coding is strikingly robust to the arbitrary assumption of the
distance between categories being 1.
35
36
Acknowledgements
Ken Bollen made suggestions that were crucial in the development of the paper, and
Nash Herndon provided useful editorial comments. Authors are also grateful for the
comments on their poster presentation at the Joint Statistical Meetings (Toronto, 2004).
Financial support was provided by U.S. Agency for International Development through
MEASURE Evaluation project of Carolina Population Center, University of North
Carolina at Chapel Hill, under the terms of Cooperative Agreement #GPO-A-00-0300003-00.
Appendices
A
The principal component analysis is aimed at solving the (conditional) variance maximization problem (1). This problem turns out to be identical (Anderson 2003, Mardia
et al. 1980) to the eigenvalue problem (2) that is discussed in Appendix B. Along with
the theoretical properties of this linear algebra problem, a researcher is usually interested in statistical properties of this procedure, and in practical uses of its results. This
appendix highlights some distributional results available for PCA, and discusses the
choice of the number of significant components.
The issue of selecting an appropriate model dimensionality does not usually arise
in the construction of the welfare index in the household studies, as the first component
is the only one that is used, but at least it is worth checking that the first component
really stands out relative to the second one, and others. If the first two eigenvalues are
relatively close to each other, then the first component may not be very stable, and thus
the resulting rankings of the households by their estimated welfare may be misleading.
In other applications such as exploratory data analysis, the researcher often faces
the problem of what constitutes a good description of the data in the most concise terms.
For the PCA, this is the question of choosing the number of components that the analyst
will be using further in her analysis. Most often, this is done graphically by plotting
the eigenvalues and eye-balling the place on the graph where the decline in eigenvalues
switches from roughly exponential to roughly linear. The plot is referred to as scree
plot. An example is shown in Fig. 6. The first principal component really stands aside,
while the last four or five show a linear trend in eigenvalues. The conclusion from
this particular scree plot might be that two or three PCs are significant while others
37
represent noise.
If a sample from a multivariate normal distribution is taken and PCA is performed
i and v
i are the maximum likelion the sample covariance matrix, then the resulting
hood estimates of the corresponding population parameters i , vi (Mardia et al. 1980).
A number of theoretical results on the asymptotic distributions of eigenvalues and
eigenvectors can be established under the assumption of normality when the dimension
p is fixed and the number of observations n (Anderson 1963, Mardia et al. 1980,
Theorem 8.3.3):
p
i
(20)
) N (0, 2 diag(i ))
n(
i , vj ] 0 i, j
n Cov[
(21)
(22)
Eigenvalues
2
The factor loadings however are not uncorrelated. Also, their variances involve the
terms of the form j k /(j k ), which are undefined in case of the multiple eigenvalues (and, as we already know, there are no unique eigenvalues, but only a unique
eigenspace), and are large for close eigenvalues. The practical implication of this would
be the need to check if the second largest eigenvalue is distant enough from the largest
one. If it is not, the weights of the variables will be unstable.
Based on those asymptotic results, the likelihood ratio type tests of significance
of the last p k components can be constructed (Mardia et al. 1980, section 8.4.3).
4
Number
The null hypothesis is that the first k components have eigenvalues distinctly greater
than the remaining p k, and the latter are equal to each other (which is interpreted as
having k significant factors, and the rest is the white noise). The test statistic is
LR = n(p k) ln(a0 /g0 ),
a0 =
ln g0 =
1
pk
1
pk
p
X
i=1
p
X
(23)
i,
(24)
i,
ln
(25)
i=1
so that a0 and g0 are the arithmetic and geometric mean of the eigenvalues that are
hypothesized to be equal. The test statistic has an asymptotic 2 distribution with
1
2 (p k + 2)(p k 1) degrees of freedom. It can be Bartlett corrected by replacing
n in (23) with n 2p+11
6 .
If the distribution of the original data is not normal, Davis (1977) establishes that
the asymptotic distributions of the eigenvalues and eigenvectors is still a multivariate
normal, but the variances and covariances involve the fourth order cumulants
ijkl =
4
X (t) = E[xi xj xk xl ] ij (ik jl + il jk ) i k ij kl (26)
ti tj tk tl
0
that are identically zero for the normal distribution. (Here, X (t) = E eiX t is the
characteristic function of the random variable X and its associated distribution.) In
particular,
p
i i )
n(
N (0, 22 + iiii )
(27)
i
and it is correlated with other eigenvalues through the fourth order cumulants.
An alternative asymptotics, sometimes referred to as Kolmogorov asymptotics, is
to let both the dimension and the number of observations increase in a coherent way so
that n/p const (Johnstone 2001). Then the consistency of the eigenvalues does not
hold any longer, and the spectrum of the estimated eigenvalues is non-degenerate even
for the spherical Gaussian distribution with identity covariance matrix.
Additional difficulties can arise due to complex sample design. The researcher
needs to make sure that the aspects of it such as weights, clustering, and stratification
are accounted for properly. Skinner et al. (1986) compare the results of PCA based on
the nave estimate of the sample covariance matrix as if the data were i.i.d.; modelbased estimate that assumes multivariate normality, and accounts for the data known
prior to sampling (and used in stratification), and design-unbiased estimator. They
39
show that the first estimator gives biased estimates of both the eigenvalues and eigenvectors when the design calls for weighting, and both the direction and the magnitude
of bias depend on the specific designs. The maximum likelihood estimate achieves
best results in their simulations, which is not surprising given that they sampled from
a multivariate normal distribution. The design-unbiased estimator was unconditionally
unbiased, although showed substantial variability, performing well in some samples,
and poorly in others (which is a small sample effect, as asymptotically it gives the correct answer). Skinner et al. (1986) provide Taylor series expansion that predicts the
deviations (and, eventually, the bias) from the true eigenvalues quite well, but requires
the true covariance matrix to be known. Also, the bias depends on the correlation
between the stratification variables and the principal component.
The message for the particular applications we are considering here (that of socioeconomic status assessment based on DHS data) is that the design does have an effect
on the estimates through both (i) design features such as weights, and (ii) through the
correlation between the stratification variables (geography) and the substantive first
principal component, as there usually is a quite distinct differences in SES levels between regions. The existing software (polychoric Stata module) does allow for
weights, which is the main source of discrepancies in Skinner et al. (1986).
The geometry of the principal components may be represented graphically in various ways. An obvious way is to use the scores of the observations for the first few
components and draw the scatterplots of one principal component against another. For
a large data set with more than a hundred or so observations, the picture may start
looking messy, so one may consider plotting the centers of reasonably grouped observations.
The usual way to show the variables graphically is to plot the factor loadings to
show the relation between the principal components and the original variables. This
sort of a graph allows clustering by eye of the variables that convey similar information.
The general formulation of an eigenproblem for a matrix R with real entries is to find
the scalars and non-zero vectors v such that
Rv = v,
kvk = 1
(28)
This is a standard linear algebra problem (Parlett 1980, Horn & Johnson 1990, Weisstein
2004) with applications ranging from acoustics to quantum mechanics, and from statis40
tics to nonlinear optimization. The numbers k are called eigenvalues, and the vectors
vk , eigenvectors. A number of theoretical properties can be established for them:
1. The eigenvalues are solutions to the characteristic equation
det[R Ip ] = 0
(29)
It implies that there are p such s, although some may repeat, and for an arbitrary matrix R, the eigenvalues may be complex.
2. The eigenvectors vi , vj corresponding to distinct eigenvalues i 6= j are orthogonal: vi0 vj = 0. If is a multiple root of equation (29) of order l (such
eigenvalues are also referred to as degenerate), then there is a linear subspace
(an eigenspace) of dimension l corresponding to that eigenvalue. Each vector in
this subspace satisfies (28) except for normalization, and l orthonormal eigenvectors can be chosen as a basis of that subspace.
3. If RT = R (i.e., R is real and symmetric), all s are real.
4. If R is positive (semi)definite, then all eigenvalues are positive (non-negative).
5.
det R =
i ,
tr R =
i ,
X
i,j
2
rij
=
2i
(30)
6. If R is positive definite, and all off-diagonal entries are non-negative, then the
components of the eigenvector corresponding to the largest eigenvalue are all
positive.
Most of the time, the eigenvectors are taken to have unit length for identification, and
the eigenvalues are ordered from the largest to the smallest: 1 2 ... p . The
set {k } is also referred to as the spectrum of the matrix R.
Some numeric linear algebra considerations (Demmel 1997) should be taken into
account in the applied analysis. The most important one is how well is the matrix
conditioned. The ratio 1 /p is referred to as the condition number14 of the matrix.
The condition number is the (upper bound of the) multiplier for the relative error of
a linear algebra algorithm (such as solution of a system of linear equations, matrix
inversion, or eigenproblem). In other words, it shows by how much the relative error
can be expected to go up from the inputs of the algorithm to its output. The relative
error in double precision arithmetics is about 1015 , so the condition number of a
14
41
matrix of the order 1015 means that the linear algebra problems cannot be solved in
double precision for this matrix. Ill-conditioned matrices should thus be avoided, and
an obvious example of such a matrix is the covariance or correlation matrix of the
dummy variables that sum up to one (i.e., no category was taken to be the base and
excluded). For such a matrix, the condition number is infinity. In practice, the solution
of the eigenproblem may still yield small non-zero eigenvalues due to round-off errors,
but the condition number would still be very high signalling the problem. Round-off
errors can also make some of the zero eigenvalues of a positive semi-definite matrix
negative, which is another signal for numeric problems in PCA. The modern software
is likely to have ways around such problems such as early automatic detection of zero
eigenvalues15 , but the researcher should not completely rely on this, and make the
computations more efficient by unlinking the dummy variables from each other.
From the statistical point of view, the condition number can be viewed as a crude
measure of dependence between standardized variables if the PCA is performed on the
correlation matrix: if the variables are independent, all eigenvalues would be equal to
1, and the condition number is 1. For the dependent data, the condition number will
be greater than 1. It may also be useful for collinearity diagnostics in the regression
context.
The following example shows the relation between the eigenvalues of two correlation matrices with attenuated correlations.
Example 3. If C1 and C2 are correlation matrices of the same size such that their offdiagonal entries are proportional to each other:
(1)
(2)
(2)
(1)
(31)
(32)
(33)
(34)
This is what Stata software does with collinearity diagnostics by dropping some of the variables it thinks
are responsible for it.
42
In other words, the spectrum of C1 is shrunk towards 1 relative to C2 . In particular, the largest
eigenvalue of C1 is greater than one, but not by as much as the largest eigenvalue of C2 . So in
the PCA on the two matrices, the first PC will explain more variance for the matrix C2 than for
C1 .
If there is no strict proportionality between the entries of two matrix, a similar argument can
still be made by the Gershgorin circles theorem and its extensions (Brualdi & Mellendorf 1994).
The convex combination (32) would become
C1 = C2 + (1 )A
(35)
R = max |aij |, i, j = 1, . . . , n
= arg min R ,
(36)
For the original problem, R can be brought down to zero. Then the relation (34) between
eigenvalues of the two matrices would also need to be attenuated by rather crude upper bounds
R of the absolute value of the off-diagonale elements.
Rank correlations
A standard measure of the relation between the two variables is their correlation coefficient:
(X E[X])(Y E[Y ])
(37)
=E
1
(V[X] V[Y ]) 2
(also referred to as Pearson moment correlation), and its sample analogue
n
r=
1 X xi x
yi y
n 1 i=1 sx
sy
(38)
Rank correlations deal with ranks rx and ry given to the observations by two variables x and y, rather than the values (xi , yi ) themselves. Hence the rank correlations
are invariant under any monotone transformation of the original variables, not only under linear transformations, as is the moment correlation. Thus the rank correlation of
say acres of land owned with income will be the same as the rank correlation of the
acreage with log income. Also this is more useful in our analysis if the underlying
welfare and its empirical analogue produced by a version of PCA have a curvilinear
rather than a linear relation. In the latter case, the two will produce the same distribution quintiles, even though the moment correlation will signal the departure of one
from another. Finally, our interest in the rank correlations is due to the use of PCA in
43
S = 1
X
6
(rx,i ry,i )2
n2 (n 1) i=1
(39)
a = S/n,
S
,
(n U )(n V )
X
X
U=
ui (ui 1)/2, V =
vi (vi 1)/2
b = p
(40)
where ui is the multiplicity of the value xi , and vi is the mutliplicity of the value yi .
Thus b corrects for the ties in the data set.
The interpretation of Kendalls is the (relative) number of pairwise transpositions
one would need to make to reorder the data so that two variables agree; and, as is
clearly seen from the definition, the proportion of concordant vs. discordant pairs of
observations.
All the aforementioned correlations can be embedded into the following general
formula:
P
i,j d(xi , xj )d(yi , yj )
(41)
(d) = h
i1/2
P
P
2 (x , x )
2 (y , y )
d
d
i
j
i
j
i,j
i,j
where the generalized measure of discrepancy d() between the two observations i, j
as given by variables x and/or y is:
d(xi , xj ) = sign(xi xj ) for Kendalls correlation (+1 if the rankings agree,
and 1 if they disagree);
the difference of the ranks d(xi , xj ) = rx,i rx,j for Spearmans ;
d(xi , xj ) = xi xj for Pearson moment correlation.
44
20
Estimated rank
40
60
80
100
The distributions of both rank correlations under the null hypothesis of independence can be derived by noting that all n! combinations are equally likely, and then
counting the number of combinations that give a particular value of S or . Those are
examples of discrete distributions even though the values of the random variable are
not integers. The asymptotic distribution of either quantity is normal since it is based
on the sums of identically distributed random variables. In particular, those correlation
coefficients are combinations of U -statistics which are known to be asymptotically
normal (Hoeffding 1948, van der Vaart 1998).
It follows from (41) that Kendalls gives smaller weights to the pairs of observations that have drastically different ranks than Spearmans S . Kendall (1955)
shows that in large samples, 1 3 2S 1, although the limits are attained for
rather peculiar rankings. Another limiting inequality for large samples and > 0 is
3
1
1
1 2
2 2 S 2 + 2 . Those inequalities show that usually tends to be
smaller in magnitude than S , and it is possible that the two rank correlations come up
to be of different signs.
The rank correlations can be linked to more easily understandable quintile misclas-
20
40
60
True rank
80
100
Figure 7: The worst case misclassification rate with the highest Kendalls .
45
sification rates. The worst case scenario is depicted in Fig. 7. In this case, the overall
misclassification is the highest, although the Kendalls is not that badly affected. If
a fraction 0 12 of observations in the bottom of each quintile are wrongly attributed to the previous quintile, and the same number of observations from the top
of the quintile are attributed to the next quintile, then the overall misclassification rate
is 1.6 (so at the maximum when = 21 , the misclassification rate is 80%: only the
10% at the very bottom and 10% at the very top are correctly classified), while the rank
correlation is 1 1.28 2 (and even in the worst case it still is quite high at 0.68). Then
the upper bound on the misclassification rate (i.e. the worst case relation between the
This appendix considers the results of the principal component analysis performed on
a set of dummy variables obtained as indicators of categories of another variable.
Suppose there is a single factor , a single indicator, and a categorical version of
that indicator with K categories. This would be the case if one uses the Filmer-Pritchett
procedure to obtain weights if only a single categorical variable is observed, or it can
be the case when many categorical variables are coded in such a way that each unique
combination is represented by a corresponding binary indicator of that combination.
The categorical variable in this example will have a multinomial distribution16 . If
16
A categorical variable x is said to have a multinomial distribution with categories 1, . . . , K and probabilities p1 , . . . , pK if for the sample of size n,
Prob[|{i : xi = 1}| = n1 , . . . , |{i : xi = K}| = nK ] =
n!
n
pn1 . . . pKK
n1 ! . . . nK ! 1
(42)
This is a natural generalization of the binomial distribution into more than two categories. See Johnson, Kotz
46
the proportion of the data in k-th category is k , then the covariance matrix of the
dummy variables corresponding to the individual categories is
1 (1 1 )
1 2
.......................
2 (1 2 ) . . . . . . . . . . . . . . . . . . . . . . .
1 2
=
..
..
..
..
.
.
.
. . . . . . . . . . . . . . . . . . . . . . . K1,K K (1 K )
(43)
The next example deals with the case when there is the same number of observations in each category.
Example 4. In a special case when all categories are equally populated (1 = 2 = . . . =
K = 1/K), the diagonal elements are (K 1)/K 2 , and off-diagonal elements are 1/K 2 .
The correlation matrix is then
= .
..
...
0
where
=
1
..
.
..
.
...
. . .
..
.
1/K 2
1
=
(K 1)/K 2
K 1
(44)
(45)
Then by using the symmetry arguments and direct verification17 , the eigenvalues and the corre1
sponding eigenvectors are zero with an eigenvector18 of K 2 (1, . . . , 1), and K 1 eigenvalues
of K/(K1) with an eigenspace19 generated by the vectors of the form ui = (1, . . . , 0, 1, 0, . . . , 0)
that have -1 in their i-th position. The proportion explained by the first principal component (and,
in fact, all non-trivial components) will be 1/(K 1). The first principal component is not well
defined in this case: any weights that sum up to zero can be taken as the weights for the first
PC. As a result, the sample first PC will be extremely unstable. (The problem may be diagnosed in the way described in Appendix A: the analyst would need to have a look at the second
component as well, and would find that the first two sample eigenvalues are close to each other.
& Balakrishnan (1997), Chapter 35.
17 See Appendix E, p. 52.
18 It corresponds to the bookkeeping condition + . . . +
1
K = 1. The variance of this condition is
identically zero, and that is shown by the zero eigenvalue. If the dummy variables are coded in such a way
that some of them sum up to 1, then one or more eigenvalues would be equal to zero. From the theoretical
point of view, this does not constitute a significant problem, as the eigenvalues and the scores would be
the same should one of the categories be dropped. The zero eigenvalues, however, may be a problem for
numerical stability of the eigenproblem algorithms as explained in Appendix B.
19 See Appendix A, p. 41 on explanation of eigenvectors and eigenspaces.
47
This result can also be derived at from the asymptotic distributions viewpoint described also in
Appendix A.)
In the general case, the proportions of categories will be different, and the next
example gives a basic analysis of this case.
Example 5.
Let us now consider a more realistic setting where 1 > 2 > . . . > K
(the ordering is assumed for the sake of transparency of the analysis). This will be the general
case for discrete data if we simply consider all possible categories together, and create dummy
variables for each of them. If there was a natural ordering of the categories, it is disregarded in
this analysis.
Now, the correlation matrix becomes
1
12
12
1
= .
..
..
.
...
13
23
..
.
K2,K
K,K1
...
. . .
..
.
(46)
Within each row or column, the values of ij are decreasing, in absolute value, as one moves
further away from the diagonal:
i j
ij = p
=
i (1 i )j (1 j )
i j
(1 i )(1 j )
(47)
1
2 38
21
12 38
1
12
12
12 + 83
1
(48)
By solving the eigenproblem for this matrix, one finds that the double eigenvalue of 3/2 splits into
3/2+ 3/4 and 3/2 3/4 with the zero order terms of eigenvectors proportional to (1, 1
48
(0.211, 0.577, 0.789), respectively)20 . The third eigenvalue is still identically zero which
reflects the fact that the sum of the dummy variables related to a single factor is 1, so that
the covariance and correlation matrices are singular. The null space eigenvector is 1/ 3(1 +
/4, 1, 1 /4). The proportion of the variance explained by the first principal component
The first order analysis of the perturbed correlation matrix cannot give the corrections to the eigenvectors
a. If the linear system ( I)a = 0 is perturbed ( is changed into + ), then the first order
approximation for a is ( I)a = ( I)a. The matrix in the left hand side, however,
is not invertible, so there is no unique solution for a. Higher order expansions lead to nonlinear matrix
problems. The statistical implication of that is high sampling variability of the empirical eigenvectors that
really is found in practice: the observed eigenvector may be quite far from (0.789, 0.577, 0.211) even
for fairly small deviations from .
21 The fact that the middle category is not perturbed is not particularly important: the main issue is that
the three categories are not equally populated.
49
E
E.1
Suppose the (continuous, fully observed) data x1 , . . . , xp come from the model with
one latent variable (c.f. (5)):
xk = k + k ,
V[] =
diag(12 , . . . , k2 ),
(49)
V[] =
(50)
k = 1, k = p1 + 1, . . . , p1 + p2 = p
2
2
2
b + ...
b
b
..
..
..
..
.
.
.
.
[x]
=
V
b
b
.
.
.
+
2
..
..
..
..
.
.
.
.
b
b
...
1 u ... w
u 1 . . . w
.
.. .. . .
. .
. ..
C = Corr[x] =
w w . . . 1
. .
..
..
. .
.
.
. .
w w ... v
u=
b2
,
b2 + 2
v=
,
+ 2
..
.
..
.
+ 2
...
...
..
.
...
..
.
...
(b2
(52)
... w
. . . w
..
..
.
.
... v
..
..
. .
... 1
b
w= p
(51)
2 )(
2 )
uv
(53)
E.2
Optimal prediction
50
(54)
Although it may look like a regression-type prediction, it really is not. The model (50)
defines p equations with xk being the dependent variable, and being the only independent variable in the regression. Rather, this is a problem of inverse regression: given
the values of the dependent variable(s), construct the best estimate of the explanatory
variable.
If x1 , . . . , xp , have a multivariate normal distribution (as would be the case if
N (0, ) and k N (0, 2 )):
!!
T
x1 , . . . , xp , N 0,
(55)
where = V[x], = Cov[x, ], then by the properties of the normal distribution
(Mardia et al. 1980),
|x = E[] + T 1 (x E[x]) = T 1 x
(56)
(57)
a1 ,...,ak
Note further that the first p1 variables are the same in their statistical properties, and
the permutation of those would not change the covariance matrix in (55), so neither
would it change the weights resulting from (56). Thus, the first p1 values a1 , . . . , ap1
are identical: a1 = . . . = ap1 . Likewise, the last p2 entries are also equal to each other:
ap1 +1 = . . . = ap . Let us denote a1 = , ap = . Then the projection problem
becomes
p1
pX
1 +p2
X
xk +
xk ]2 min
(58)
E[
k=1
k=p1 +1
Then
E[
p1
X
k=1
xk +
pX
1 +p2
xk ]2 = E[
k=p1 +1
= E[
p1
X
k=1
p1
X
(b + k ) +
k=1
k +
pX
1 +p2
pX
1 +p2
( + k ) ]2 =
k=p1 +1
k + p1 b + p2 ]2 =
k=p1 +1
= 2 p1 2 + 2 p2 2 + (p1 b + p2 1)2
51
(59)
(60)
V
= 2p2 + 2p2 (p1 b + p2 1) = 0
(61)
= b = b
=
1
p1 b 2 + p 2 +
(62)
b2
k=1
(63)
k=p1 +1
The most important implication is that the ratio of the weights in (63) is equal to b, the
ratio of the original factor loadings in (50) and (51).
E.3
Let us first consider a somewhat simpler case of the eigenproblem for a p p matrix
1
u ... u
1 . . . u
u
(64)
(u) =
.. . .
.
..
. ..
.
.
......
1
u
1u
. . .
.. .. ..
u u 0
=
(e1 ei ) =
u 1 u 1 = (1 u)(e1 ei )
u u 0
..
..
..
.
.
.
(65)
p)]e1 + ei + (p
p)
p
X
j=2
52
ej ,
i = 2, . . . , p
(66)
1
1 + (p 1)u
1
1
1 1 + (p 1)u
= [1 + (p 1)u] .
..
.
.. =
.
.
.
1 + (p 1)u
(67)
(68)
a:kak=1
By the symmetry argument similar to the one in the preceding section, the first p1
elements of a should be identical, and the also last p2 elements of a should be identical.
By denoting them and , respectively, the quadratic form becomes
T
..
.
.
.
.
..
.
.
.
.
w
T
+ (p1 1)u + p2 w
... w
. . . w + (p1 1)u + p2 w
..
..
.. ..
.
.
.
.
. . . v p1 w + + (p2 1)
..
..
..
..
. . .
.
p1 w + + (p2 1)
... 1
u
1
..
.
w
..
.
w
... w
... w
.
..
. ..
... 1
..
..
.
.
... v
..
.
=
.
.
.
(69)
(p1 u + p2 v)2
max
,:p1 2 +p2 2 =1
(70)
The Lagrangian is
(71)
L
= 2p1 u(p1 u + p2 v) 2p1 = 0
L
= 2p2 v(p1 u + p2 v) 2p2 = 0
53
(72)
u
=b
v
r
b1
+ 2
1+
=
b
for b 1,
b
1
b2 + 2
b2 +
1+
r
r
u
v
=
, =
,
p 1 u + p2 v
p1 u + p 2 v
(73)
Thus the weights differ from being proportional to (b, . . . , b, 1, . . . , 1), although not
very greatly if b is close to 1. The value of the quadratic form is the largest eigenvalue:
(p1 u + p2 v)2 =
p1 u
p2 v
+
p1 u + p2 v
p1 u + p 2 v
so the proportion of the explained variance is
2
= p 1 u + p2 v
(74)
p 1 u + p2 v
p1 u + p 2 v
=
(75)
p
p1 + p2
If the PCA is performed on the original variables (or their covariance matrix), then
we can show that the first principal component will be the same as from (63). Indeed,
upon invoking the symmetry argument once again, the problem now becomes
R1 =
V[
p1
X
pX
1 +p2
xk +
k=1
xk ]
k=p1 +1
max
,:p1 2 +p2 2 =1
(76)
p1
X
k=1
xk +
pX
1 +p2
xk ] = V[
k=p1 +1
= V[(p1 b + p2 ) +
p1
X
k=1
p1
X
(b + k ) +
k=1
pX
1 +p2
k +
pX
1 +p2
( + k )] =
k=p1 +1
k=p1 +1
(77)
The Lagrangian for the problem is
L(, , ) = (p1 b + p2 )2 + p21 2 + p22 2 (p1 2 + p2 2 1)
(78)
L
= 2p2 (p1 b + p2 ) + 2p2 2 2p2 = 0,
(79)
54
(80)
References
Anderson, T. W. (1963), Asymptotic theory for principal component analysis, The
Annals of Mathematical Statistics 34, 122148.
Anderson, T. W. (2003), An Introduction to Multivariate Statistical Analysis, 3rd edn,
John Wiley and Sons, New York.
Babakus, E., Ferguson, Jr., C. E. & Joereskog, K. G. (1987), The sensitivity of confirmatory maximum likelihood factor analysis to violations of measurement scale
and distributional assumptions, Journal of Marketing Research 24, 222228.
Bai, J. (1993), Inferential theory for factor models of large dimensions, Econometrica
71, 135171.
Bartholomew, D. & Knott, M. (1999), Latent Variable Models and Factor Analysis,
Kendalls Library of Statistics, 7, Arnold Publishers.
Bartolo, A. D. (2000), Human capital estimation through structural equation
models with some categorical observed variables, Working paper, IRISS at
CEPS/INSTEAD. RePEc handle: RePEc:irs:iriswp:2000-02.
Bollen, K. (1989), Structural Equations with Latent Variables, Wiley and Sons, New
York.
Bollen, K. A. & Barb, K. H. (1981), Pearsons R and coarsely categorized measures,
American Sociological Review 46, 232239.
Bollen, K. A., Glanville, J. L. & Stecklov, G. (2001), Socioeconomic status and class
in studies of fertility and health in developing countries, Annual Review of Sociology 27, 153185.
Bollen, K. A., Glanville, J. L. & Stecklov, G. (2002), Economic status proxies in
studies of fertility in developing countries: Does the measure matter?, Population
Studies 56, 8196. DOI: 10.1080/00324720213796.
55
Bollen, K. A. & Long, J. S., eds (1993), Testing Structural Equation Models, SAGE
Publications, Thousand Oaks, CA.
Brualdi, R. A. & Mellendorf, S. (1994), Regions in the complex plane containing the
eigenvalues of a matrix, The American Mathematical Monthly 101, 975985.
Caudill, S. B., Zanella, F. C. & Mixon, F. G. (2000), Is economic freedom one dimension? A factor analysis of some common measures of economic freedom,
Journal of Economic Development 25, 1740.
Choi, I. (2002), Structural changes and seemingly unidentified structural equations,
Econometric Theory 18, 744775.
Corporation, S. (2003), Stata Software, Release 8, College Station, TX.
Davis, A. W. (1977), Asymptotic theoty for principal component anakysis: Nonnormal case, Australian Journal of Statistics 19, 206212.
Demmel, J. W. (1997), Applied Numerical Linear Algebra, SIAM, Philadelphia.
DiStefano, C. (2002), The impact of categorization with confirmatory factor analysis,
Structural Equations Modeling 9, 327346.
Dolan, C. V. (1994), Factor analysis with 2, 3, 5 and 7 response categories: A comparison of categorical variable estimators using simulated data, British Journal
of Mathetmatical and Statistical Psychology 47, 309326.
Drakos, K. (2002), Common factor in eurocurrency rates: A dynamic analysis, Journal of Economic Integration 17, 164184.
Filmer, D. & Pritchett, L. (1998), Estimating wealth effect without expenditure data
or tears: An application to educational enrollments in states of India, World Bank
Policy Research Working Paper No. 1994, The World Bank, Washington, DC.
Filmer, D. & Pritchett, L. (2001), Estimating wealth effect without expenditure data
or tears: An application to educational enrollments in states of India, Demography 38, 115132.
Flury, B. (1988), Common Principal Components and Related Multivariate Methods,
John Wiley and Sons, New York.
Gwatkin, D. R., Rustein, S., Johnson, K., Suliman, E. A. & Wagstaff, A. (2003a),
Socio-economic differences in health, nutrition, and population, Technical report,
World Bank. Volume 1: Armenia Kyrgyz Republic.
56
Gwatkin, D. R., Rustein, S., Johnson, K., Suliman, E. A. & Wagstaff, A. (2003b),
Socio-economic differences in health, nutrition, and population, Technical report,
World Bank. Volume 2: Madagascar Zimbabwe.
Harris, D. (1997), Principal component analysis of cointegrated time series, Econometric Theory 13, 529557.
Hoeffding, W. (1948), A class of statistics with asymptotically normal distribution,
Annals of Mathematical Statistics 19, 293325.
Horn, R. A. & Johnson, C. R. (1990), Matrix Analysis, Cambridge University Press,
Cambridge, UK.
Hotelling, H. (1933), Analysis of a complex of statistical variables into principal components, Journal of Educational Psychology 24, 417441, 498520.
Huber, P. J. (2003), Robust Statitsics, John Wiley and Sons, New York.
Johnson, D. R. & Creech, J. C. (1983), Ordinal measures in mulitple indicator models: A simulation study of categorization error, American Sociological Review
48, 398407.
Johnson, N. L., Kotz, S. & Balakrishnan, N. (1997), Discrete Multivariate Distributions, John Wiley and Sons, New York.
Johnstone, I. M. (2001), On the distribution of the largest eigenvalue in principal component analysis, Annals of Statistics 29, 295327.
Jolliffe, I. T. (2002), Principal Component Ananlysis, 2nd edn, Springer, Heidelberg
and New York.
Joreskog, K. (2004a), Structural equation modeling with ordinal variables.
Joreskog, K. (2004b), Structural Equation Modeling With Ordinal Variables using LISREL. Notes on LISREL 8.52.
http://www.ssicentral.com/lisrel/ordinal.pdf.
Judd, K. L. (1998), Numerical Methods in Economics, MIT Press, Cambridge, MA.
Kaplan, D. (2000), Structural Equation Modeling: Foundations and Extensions, SAGE
Publications, Thousand Oaks, CA.
Kendall, M. G. (1955), Rank Correlation Methods, 2nd edn, Charles Griffin & Co.,
London.
57
58
Rencher, A. C. (2002), Methods of Multivariate Analysis, John Wiley and Sons, New
York.
Skinner, C. J., Holmes, D. J. & Smith, T. M. F. (1986), The effect of sample design
on principal component analysis, Journal of the American Statistical Association
81, 789798.
SSI (2004), LISREL software, Release 8.52 for Windows, Scientific Software International, Lincolnwood, IL.
Stock, J. H. & Watson, M. W. (2002), Forecasting using principal components from
a large number of predictors, Journal of the American Statistical Association
97, 11671179.
van der Vaart, A. W. (1998), Asymptotic statistics, John Wiley and Sons, New York.
Wansbeek, T. & Meijer, E. (2000), Measurement Error and Latent Variables in Econometrics, North-Holland, Amsterdam.
Webster, T. J. (2001), A principal component analysis of the U.S.News & World Report tier rankings of colleges and universities, Economics of Education Review
20, 235244.
Weisstein, E. W. (2004), Eigenvalue. From MathWorldA Wolfram Web Resource.
http://mathworld.wolfram.com/Eigenvalue.html.
Wooldridge, J. M. (2002), Econometric Analysis of Cross Section and Panel Data, The
MIT Press, Cambridge, MA.
59