Factor Analysis
Factor Analysis
Factor Analysis
X = Fw (19.1)
[n × p] = [n × p][ p × p]
where I’ve checked the dimensions of the matrices underneath. This is an exact equa-
tion involving no noise, approximation or error, but it’s kind of useless; we’ve re-
placed p-dimensional vectors in X with p-dimensional vectors in F. If we keep only
to q < p largest principal components, that corresponds to dropping columns from
368
19.1. FROM PCA TO FACTOR ANALYSIS 369
F and rows from w. Let’s say that the truncated matrices are Fq and wq . Then
X ≈ Fq wq (19.2)
[n × p] = [n × q][q × p]
The error of approximation — the difference between the left- and right- hand-sides
of Eq. 19.2 — will get smaller as we increase q. (The line below the equation is a
sanity-check that the matrices are the right size, which they are. Also, at this point
the subscript qs get too annoying, so I’ll drop them.) We can of course make the two
sides match exactly by adding an error or residual term on the right:
X = Fw + ε (19.3)
1. All of the observable random variables Xi have mean zero and variance 1.
4. The factors are uncorrelated across individuals (rows of F) and across variables
(columns).
5. The noise terms are uncorrelated across individuals, and across observable vari-
ables.
Item (1) isn’t restrictive, because we can always center and standardize our data. Item
(2) isn’t restrictive either — we could always center and standardize the factor vari-
ables without really changing anything. Item (3) actually follows from (1) and (2).
The substantive assumptions — the ones which will give us predictive power but
could also go wrong, and so really define the factor model — are the others, about
lack of correlation. Where do they come from?
Remember what the model looks like:
X = Fw + ε (19.4)
All of the systematic patterns in the observations X should come from the first term
on the right-hand side. The residual term ε should, if the model is working, be un-
predictable noise. Items (3) through (5) express a very strong form of this idea. In
particular it’s vital that the noise be uncorrelated with the factor scores.
was about the only way people had of measuring relationships between variables. Since then it’s been
propagated by statistics courses where it is the only way people are taught to measure relationships. The
great statistician John Tukey once wrote “Does anyone know when the correlation coefficient is useful, as
opposed to when it is used? If so, why not tell us?” (Tukey, 1954, p. 721).
19.2. THE GRAPHICAL MODEL 371
F1 F2 F3
X1 X2 X3 X4 X5 X6
E1 E2 E3 E4 E5 E6
Figure 19.1: Graphical model form of a factor model. Circles stand for the unob-
served variables (factors above, noises below), boxes for the observed features. Edges
indicate non-zero coefficients — entries in the factor loading matrix w, or specific
variances ψi . Arrows representing entries in w are decorated with those entries. Note
that it is common to omit the noise variables in such diagrams, with the implicit un-
derstanding that every variable with an incoming arrow also has an incoming noise
term.
Notice that not every observable is connected to every factor: this depicts the fact
that some entries in w are zero. In the figure, for instance, X1 has an arrow only from
F1 and not the other factors; this means that while w11 = 0.87, w21 = w31 = 0.
Drawn this way, one sees how the factor model is generative — how it gives us
a recipe for producing new data. In this case, it’s: draw new, independent values for
the factor scores F1 , F2 , . . . Fq ; add these up with weights from w; and then add on the
final noises ε1 , ε2 , . . . ε p . If the model is right, this is a procedure for generating new,
synthetic data with the same characteristics as the real data. In fact, it’s a story about
how the real data came to be — that there really are some latent variables (the factor
scores) which linearly cause the observables to have the values they do.
372 CHAPTER 19. FACTOR ANALYSIS
so long as i �= j .
The jargon says that observable i loads on factor k when wk i �= 0. If two observ-
ables do not load on to any of the same factors, if they do not share any common
factors, then they will be independent. If we could condition on (“control for”) the
factors, all of the observables would be conditionally independent.
Graphically, we draw an arrow from a factor node to an observable node if and
only if the observable loads on the factor. So then we can just see that two observables
are correlated if they both have in-coming arrows from the same factors. (To find
the actual correlation, we multiply the weights on all the edges connecting the two
observable nodes to the common factors; that’s Eq. 19.11.) Conversely, even though
the factors are marginally independent of each other, if two factors both send arrows
to the same observable, then they are dependent conditional on that observable.2
2 To see that this makes sense, suppose that X = F w + F w + ε . If we know the value of X , we
1 1 11 2 21 1 1
know what F1 , F2 and ε1 have to add up to, so they are conditionally dependent.
19.3. ROOTS OF FACTOR ANALYSIS IN CAUSAL DISCOVERY 373
X = ε + Gw (19.12)
vi j /vk j wi w j /wk w j
= (19.14)
vi l /vk l wi w l /wk w l
wi /wk
= (19.15)
wi /wk
= 1 (19.16)
The relationship
vi j vk l = vi l vk j (19.17)
is called the “tetrad equation”, and we will meet it again later when we consider meth-
ods for causal discovery in Part III. In Spearman’s model, this is one tetrad equation
for every set of four distinct variables.
Spearman found that the tetrad equations held in his data on school grades (to a
good approximation), and concluded that a single general factor of intelligence must
exist. This was, of course, logically fallacious.
Later work, using large batteries of different kinds of intelligence tests, showed
that the tetrad equations do not hold in general, or more exactly that departures
from them are too big to explain away as sampling noise. (Recall that the equations
are about the true correlations between the variables, but we only get to see sample
correlations, which are always a little off.) The response, done in an ad hoc way
by Spearman and his followers, and then more systematically by Thurstone, was to
introduce multiple factors. This breaks the tetrad equation, but still accounts for
the correlations among features by saying that features are really directly correlated
with factors, and uncorrelated conditional on the factor scores.3 Thurstone’s form of
factor analysis is basically the one people still use — there have been refinements, of
course, but it’s mostly still his method.
19.4 Estimation
The factor model introduces a whole bunch of new variables to explain the observ-
ables: the factor scores F, the factor loadings or weights w, and the observable-specific
variances ψi . The factor scores are specific to each individual, and individuals by as-
sumption are independent, so we can’t expect them to really generalize. But the
loadings w are, supposedly, characteristic of the population. So it would be nice if we
could separate estimating the population parameters from estimating the attributes
of individuals; here’s how.
Since the variables are centered, we can write the covariance matrix in terms of
the data frames: � �
1 T
v=E X X (19.18)
n
3 You can (and should!) read the classic “The Vectors of Mind” paper (Thurstone, 1934) online.
19.4. ESTIMATION 375
(This is the true, population covariance matrix on the left.) But the factor model tells
us that
X = Fw + ε (19.19)
This involves the factor scores F, but reember that when we looked at the correlations
between individual variables, those went away, so let’s substitute Eq. 19.19 into Eq.
19.18 and see what happens:
� �
1 T
E X X (19.20)
n
1 � T �
= E (ε + wT FT )(Fw + ε) (19.21)
n
1� � T � � � � � � � �
= E ε ε + wT E FT ε + E εT F w + wT E FT F w (19.22)
n
1
= ψ + 0 + 0 + wT nIw (19.23)
n
= ψ + wT w (19.24)
Behold:
v = ψ + wT w (19.25)
The individual-specific variables F have gone away, leaving only population parame-
ters on both sides of the equation.
More equations (constraints) than unknowns (free parameters) This is more in-
teresting. In general, systems of equations like this are overdetermined, meaning
that there is no way to satisfy all the constraints at once, and there isn’t even a single
solution. It’s just not possible to write an arbitrary covariance matrix v among, say,
seven variables in terms of, say, a one-factor model (as p( p −1)/2 = 7(7−1)/2 = 21 >
7(1)−1(1−1)/2 = 7 = pq −q(q −1)/2). But it is possible for special covariance matri-
ces. In these situations, the factor model actually has testable implications for the data
— it says that only certain covariance matrices are possible and not others. For ex-
ample, we saw above that the one-fator model implies the tetrad equations must hold
among the observable covariances; the constraints on v for multiple-factor models
are similar in kind but more complicated algebraically. By testing these implications,
we can check whether or not the our favorite factor model is right.5
Now we don’t know the true, population covariance matrix v, but we can esti-
mate it from data, getting an estimate � v. The natural thing to do then is to equate this
with the parameters and try to solve for the latter:
v = ψ� + w
� �T w
� (19.26)
The book-keeping for degrees of freedom here is the same as for Eq. 19.25. If q is too
large relative to p, the model is unidentifiable; if it is too small, the matrix equation
can only be solved if �v is of the right, restricted form, i.e., if the model is right. Of
course even if the model is right, the sample covariances are the true covariances plus
noise, so we shouldn’t expect to get an exact match, but we can try in various way to
minimize the discrepancy between the two sides of the equation.
5 Actually, we need to be a little careful here. If we find that the tetrad equations don’t hold, we know
a one-factor model must be wrong. We could only conclude that the one-factor model must be right if we
found that the tetrad equations held, and that there were no other models which implied those equations;
but, as we’ll see, there are.
19.4. ESTIMATION 377
The exception is that vi i = wi2 +ψi , rather than wi2 . However, if we look at u = v−ψ,
that’s the same as v off the diagonal, and a little algebra shows that its diagonal entries
are, in fact, just wi2 . So if we look at any two rows of U, they’re proportional to each
other:
wi
ui j = uk j (19.28)
wk
This means that, when Spearman’s model holds true, there is actually only one linearly-
independent row in in u.
Recall from linear algebra that the rank of a matrix is how many linearly inde-
pendent rows it has.6 Ordinarily, the matrix is of full rank, meaning all the rows are
linearly independent. What we have just seen is that when Spearman’s model holds,
the matrix u is not of full rank, but rather of rank 1. More generally, when the factor
model holds with q factors, the matrix u = wT w has rank q. The diagonal entries of
u, called the common variances or commonalities, are no longer automatically 1,
but rather show how much of the variance in each observable is associated with the
variances of the latent factors. Like v, u is a positive symmetric matrix.
Because u is a positive symmetric matrix, we know from linear algebra that it can
be written as
u = cdcT (19.29)
where c is the matrix whose columns are the eigenvectors of u, and d is the diagonal
matrix whose entries are the eigenvalues. That is, if we use all p eigenvectors, we can
reproduce the covariance matrix exactly. Suppose we instead use cq , the p × q matrix
whose columns are the eigenvectors going with the q largest eigenvalues, and likewise
make dq the diagonal matrix of those eigenvalues. Then cq dq cq T will be a symmetric
positive p × p matrix. This is a matrix of rank q, and so can only equal u if the
latter also has rank q. Otherwise, it’s an approximation which grows more accurate
as we let q grow towards p, and, at any given q, it’s a better approximation to u than
any other rank-q matrix. This, finally, is the precise sense in which factor analysis
tries preserve correlations, as opposed to principal components trying to preserve
variance.
To resume our algebra, define dq 1/2 as the q × q diagonal matrix of the square
roots of the eigenvalues. Clearly dq = dq 1/2 dq 1/2 . So
� �� �T
cq dq cq T = cq dq 1/2 dq 1/2 cq T = cq dq 1/2 cq dq 1/2 (19.30)
So we have � �� �T
u ≈ cq dq 1/2 cq dq 1/2 (19.31)
6 We could also talk about the columns; it wouldn’t make any difference.
378 CHAPTER 19. FACTOR ANALYSIS
� �T
but at the same time we know that u = wT w. So we just identify w with cq dq 1/2 :
� �T
w = cq dq 1/2 (19.32)
and we are done with our algebra.
Let’s think a bit more about how well we’re approximating v. The approximation
will always be exact when q = p, so that there is one factor for each feature (in which
case ψ = 0 always). Then all factor analysis does for us is to rotate the coordinate
axes in feature space, so that the new coordinates are uncorrelated. (This is the same
was what PCA does with p components.) The approximation can also be exact with
fewer factors than features if the reduced covariance matrix is of less than full rank,
and we use at least as many factors as the rank.
u v − ψ�
�=� (19.33)
We can’t actually calculate u � until we know, or have a guess as to, ψ.� A reasonable
and common starting-point is to do a linear regression of each feature j on all the
other features, and then set ψ � to the mean squared error for that regression. (We’ll
j
come back to this guess later.)
Once we have the reduced correlation matrix, find its top q eigenvalues and eigen-
vectors, getting matrices c�q and d�q as above. Set the factor loadings accordingly, and
re-calculate the specific variances:
� �T
� =
w cq dq 1/2 (19.34)
�
k
�
ψ = 1− w r2 j (19.35)
j
r =1
ṽ ≡ ψ� + w
�T w
� (19.36)
The “predicted” covariance matrix ṽ in the last line is exactly right on the diagonal (by
construction), and should be closer off-diagonal than anything else we could do with
the same number of factors. However, our guess as to u depended on our initial guess
about ψ, which has in general changed, so we can try iterating this (i.e., re-calculating
cq and dq ), until we converge.
with immensely large samples.) On the other hand, we know from our elementary
statistics courses that maximum likelihood estimates are generally consistent, unless
we choose a spectacularly bad model. Can we use that here?
We can, but at a cost. We have so far got away with just making assumptions
about the means and covariances of the factor scores F. To get an actual likelihood,
we need to assume something about their distribution as well.
The usual assumption is that Fi k ∼ � (0, 1), and that the factor scores are indepen-
dent across factors k = 1, . . . q and individuals i = 1, . . . n. With this assumption, the
features have a multivariate normal distribution X � ∼ � (0, ψ + wT w). This means
i
that the log-likelihood is
np n n � −1
�
L=− log 2π − log |ψ + wT w| − tr (ψ + wT w) � v (19.37)
2 2 2
where tr a is the trace of the matrix a, the sum of its diagonal elements. Notice that
the likelihood only involves the data through the sample covariance matrix � v — the
actual factor scores F are not needed for the likelihood.
One can either try direct numerical maximization, or use a two-stage procedure.
Starting, once again, with a guess as to ψ, one finds that the optimal choice of ψ1/2 wT
is given by the matrix whose columns are the q leading eigenvectors of ψ1/2 � vψ1/2 .
Starting from a guess as to w, the optimal choice of ψ is given by the diagonal entries
v − wT w. So again one starts with a guess about the unique variances (e.g., the
of �
residuals of the regressions) and iterates to convergence.7
The differences between the maximum likelihood estimates and the “principal
factors” approach can be substantial. If the data appear to be normally distributed
(as shown by the usual tests), then the additional efficiency of maximum likelihood
estimation is highly worthwhile. Also, as we’ll see below, it is a lot easier to test the
model assumptions is one uses the MLE.
and seeks the weights bi j which will minimize the mean squared error, E[(F�i r −
Fi r )2 ]. You can work out the bi j as an exercise, assuming you know w.
The inverse to this matrix must be the one which rotates through the angle −α,
r−1
α
= r−α , but trigonometry tells us that r−α = rTα .
To see why this matters to us, go back to the matrix form of the factor model,
and insert an orthogonal q × q matrix and its transpose:
X = ε + Fw (19.40)
T
= ε + Foo w (19.41)
= ε + Hy (19.42)
We’ve changed the factor scores to H ≡ Ho, and we’ve changed the factor loadings to
y ≡ oT w, but nothing about the features has changed at all. We can do as many or-
thogonal transformations of the factors as we like, with no observable consequences
whatsoever.8
Statistically, the fact that different parameter settings give us the same observa-
tional consequences means that the parameters of the factor model are unidentifi-
able. The rotation problem is, as it were, the revenant of having an ill-posed problem:
we thought we’d slain it through heroic feats of linear algebra, but it’s still around and
determined to have its revenge.9
8 Notice that the log-likelihood only involves wT w, which is equal to wT ooT w = yT y, so even as-
suming Gaussian distributions doesn’t let us tell the difference between the original and the transformed
variables. In fact, if F� ∼ � (0, I), then F� o ∼ � (0o, oT Io) = � (0, I) — in other words, the rotated factor
scores still satisfy our distributional assumptions.
9 Remember that we obtained the loading matrix w as a solution to wT w = u, that is to we got w as a
kind of matrix square root of the reduced correlation matrix. For a real number � u there are�two square
roots, i.e., two numbers w such that w × w = u, namely the usual w = u and w = − u, because
(−1) × (−1) = 1. Similarly, whenever we find one solution to wT w = u, oT w is another solution, because
ooT = I. So while the usual “square root” of u is w = dq 1/2 c, for any orthogonal matrix oT dq 1/2 c will
always work just as well.
19.7. FACTOR ANALYSIS AS A PREDICTIVE MODEL 381
Mathematically, this should not be surprising at all. The factor live in a q-dimensional
vector space of their own. We should be free to set up any coordinate system we feel
like on that space. Changing coordinates in factor space will just require a compensat-
ing change in how factor space coordinates relate to feature space (the factor loadings
matrix w). That’s all we’ve done here with our orthogonal transformation.
Substantively, this should be rather troubling. If we can rotate the factors as much
as we like without consequences, how on Earth can we interpret them?
� ∼ � (0, wT w + ψ)
X (19.43)
� |F� ∼ � (F w, ψ)
X (19.44)
but the problem is that there is no way to guess at or estimate the factor scores F�
� , at which point anyone can predict X perfectly. So the actual
until after we’ve seen X
forecast is given by Eq. 19.43.10
Now, without going through the trouble of factor analysis, one could always just
postulate that
X� ∼ � (0, v) (19.45)
and estimate v; the maximum likelihood estimate of it is the observed covariance
matrix, but really we could use any estimator of the covariance matrix. The closer
our is to the true v, the better our predictions. One way to think of factor analysis
is that it looks for the maximum likelihood estimate, but constrained to matrices of
the form wT w + ψ.
On the plus side, the constrained estimate has a faster rate of convergence. That
is, both the constrained and unconstrained estimates are consistent and will converge
on their optimal, population values as we feed in more and more data, but for the
same amount of data the constrained estimate is probably closer to its limiting value.
In other words, the constrained estimate w � + ψ� has less variance than the uncon-
�T w
strained estimate �v.
On the minus side, maybe the true, population v just can’t be written in the form
wT w + ψ. Then we’re getting biased estimates of the covariance and the bias will not
10 A subtlety is that we might get to see some but not all of X � , and use that to predict the rest. Say
�
X = (X1 , X2 ), and we see X1 . Then we could, in principle, compute the conditional distribution of the
factors, p(F |X1 ), and use that to predict X2 . Of course one could do the same thing using the correlation
matrix, factor model or no factor model.
382 CHAPTER 19. FACTOR ANALYSIS
go away, even with infinitely many samples. Using factor analysis rather than just
fitting a multivariate Gaussian means betting that either this bias is really zero, or
that, with the amount of data on hand, the reduction in variance outweighs the bias.
(I haven’t talked about estimated errors in the parameters of a factor model. With
large samples and maximum-likelihood estimation, one could use the usual asymp-
totic theory. For small samples, one bootstraps as usual.)
1. Log-likelihood ratio tests Sample covariances will almost never be exactly equal
to population covariances. So even if the data comes from a model with q
factors, we can’t expect the tetrad equations (or their multi-factor analogs) to
hold exactly. The question then becomes whether the observed covariances are
compatible with sampling fluctuations in a q-factor model, or are too big for
that.
We can tackle this question by using log likelihood ratio tests. The crucial
observations are that a model with q factors is a special case of a model with
q + 1 factors (just set a row of the weight matrix to zero), and that in the most
general case, q = p, we can get any covariance matrix v into the form wT w.
(Set ψ = 0 and proceed as in the “principal factors” estimation method.)
As explained in Appendix B, if θ� is the maximum likelihood estimate in a
� is the MLE in a more general model
restricted model with u parameters, and Θ
with r > s parameters, containing the former as a special case, and finally � is
the log-likelihood function
2[�(Θ) � � χ2
� − �(θ)] (19.46)
r −s
when the data came from the small model. The general regularity conditions
needed for this to hold apply to Gaussian factor models, so we can test whether
one factor is enough, two, etc.
(Said another way, adding another factor never reduces the likelihood, but the
equation tells us how much to expect the log-likelihood to go up when the new
factor really adds nothing and is just over-fitting the noise.)
Determining q by getting the smallest one without a significant result in a like-
lihood ratio test is fairly traditional, but statistically messy.11 To raise a subject
we’ll return to, if the true q > 1 and all goes well, we’ll be doing lots of hypoth-
esis tests, and making sure this compound procedure works reliably is harder
11 Suppose q is really 1, but by chance that gets rejected. Whether q = 2 gets rejected in term is not
independent of this!
19.7. FACTOR ANALYSIS AS A PREDICTIVE MODEL 383
than controlling any one test. Perhaps more worrisomely, calculating the like-
lihood relies on distributional assumptions for the factor scores and the noises,
which are hard to check for latent variables.
2. If you are comfortable with the distributional assumptions, use Eq. 19.43 to
predict new data, and see which q gives the best predictions — for compara-
bility, the predictions should be compared in terms of the log-likelihood they
assign to the testing data. If genuinely new data is not available, use cross-
validation.
Comparative prediction, and especially cross-validation, seems to be somewhat
rare with factor analysis. There is no good reason why this should be so.
(Remember that the factors are, by design, uncorrelated with each other, and that the
entries of w are the correlations between factors and observables.) People sometimes
select the number of factors by looking at how much variance they “explain” — really,
how much variance is kept after smoothing on to the plane. As usual with model
selection by R2 , there is little good to be said for this, except that it is fast and simple.
In particular, R2 should not be used to assess the goodness-of-fit of a factor model.
The bluntest way to see this is to simulate data which does not come from a factor
model, fit a small number of factors, and see what R2 one gets. This was done by
Peterson (2000), who found that it was easy to get R2 of 0.4 or 0.5, and sometimes
even higher12 The same paper surveyed values of R2 from the published literature on
factor models, and found that the typical value was also somewhere around 0.5; no
doubt this was just a coincidence13 .
Instead of looking at R2 , it is much better to check goodness-of-fit by actually
goodness-of-fit tests. We looked at some tests of multivariate goodness-of-fit in Chap-
ter 14. In the particular case of factor models with the Gaussian assumption, we can
use a log-likelihood ratio test, checking the null hypothesis that the number of factors
= q against the alternative of an arbitrary multivariate Gaussian (which is the same
as p factors). This test is automatically performed by factanal in R.
If the Gaussian assumption is dubious but we want a factor model and goodness-
of-fit anyway, we can look at the difference between the empirical covariance matrix
v and the one estimated by the factor model, ψ� + w �T w� . There are several notions of
distance between matrices (matrix norms) which could be used as test statistics; one
could also use the sum of squared differences between the entries of v and those of
12 See also http://bactra.org/weblog/523.html for a similar experiment, with R code.
13 Peterson (2000) also claims that reported values of R2 for PCA are roughly equal to those of factor
analysis, but by this point I hope that none of you take that as an argument in favor of PCA.
384 CHAPTER 19. FACTOR ANALYSIS
ψ� + w
�T w
� . Sampling distributions would have to come from bootstrapping, where
we would want to simulate from the factor model.
G1 G2 F3
0.13 -0.45 0.86 -0.13 -0.69 0.02 -0.20 0.03 0.73 0.10 0.15 0.45
X1 X2 X3 X4 X5 X6
E1 E2 E3 E4 E5 E6
Figure 19.2: The model from Figure 19.1, after rotating the first two factors by 30
degrees around the third factor’s axis. The new factor loadings are rounded to two
decimal places.
before this: each distribution in the mixture is basically a cluster, and the mixing
weights are the probabilities of drawing a new sample from the different clusters.14
I bring up mixture models here because there is a very remarkable result: any
linear, Gaussian factor model with k factors is equivalent to some mixture model with
k + 1 clusters, in the sense that the two models have the same means and covariances
(Bartholomew, 1987, pp. 36–38). Recall from above that the likelihood of a factor
model depends on the data only through the correlation matrix. If the data really
were generated by sampling from k + 1 clusters, then a model with k factors can
match the covariance matrix very well, and so get a very high likelihood. This means
it will, by the usual test, seem like a very good fit. Needless to say, however, the
causal interpretations of the mixture model and the factor model are very different.
The two may be distinguishable if the clusters are well-separated (by looking to see
whether the data are unimodal or not), but that’s not exactly guaranteed.
All of which suggests that factor analysis can’t really tell us whether we have
k continuous hidden causal variables, or one discrete hidden variable taking k + 1
values.
14 We will get into mixtures in considerable detail in the next lecture.
386 CHAPTER 19. FACTOR ANALYSIS
where q � p. Suppose further that the latent variables Ai k are totally independent
of one another, but they all have mean 0 and variance 1; and that the noises ηi j
are independent of each other and of the Ai k , with variance φ j ; and the Tk j are
independent
� of�everything.
� � What then is the covariance between Xi a and Xi b ? Well,
because E Xia = E Xi b = 0, it will just be the expectation of the product of the
features:
� �
E Xi a Xi b (19.50)
� �
= E (ηi a + A� · T� )(η + A � · T� ) (19.51)
i a ib i b
� � � � � � � �
= E ηi a ηi b + E ηi a A� · T� + E η A � � � � � �
i b i b i · Ta + E (Ai · Ta )(Ai · T b )(19.52)
� q �� q �
� �
= 0 + 0 + 0 + E Ai k Tka Ai l T l b (19.53)
k=1 l =1
�
= E Ai k Ai l Tka T l b (19.54)
k,l
� � �
= E Ai k Ai l Tka T l b (19.55)
k,l
� � � � �
= E Ai k Ai l E Tka T l b (19.56)
k,l
�q
� �
= E Tka Tk b (19.57)
k=1
� �
where to get the last line I use the fact that E Ai k Ai l = 1 if k = l and = 0 otherwise.
If the coefficients T are fixed, then the last expectation goes away and we merely have
the same kind of sum we’ve seen before, in the factor model.
Instead, however, let’s say that the coefficients T are themselves random (but
independent of A and η). For each feature Xi a , we fix a proportion za between 0 and
1. We then set Tka ∼ Bernoulli(za ), with Tka T l b unless k = l and a = b . Then
|=
� � � � � �
E Tka Tk b = E Tka E Tk b = za z b (19.58)
and
� �
E Xi a Xi b = q za z b (19.59)
19.8. REIFICATION, AND ALTERNATIVES TO FACTOR MODELS 387
generate the coefficient Tk j . For each feature he drew a uniform integer between 1 and q, call it q j , and
then sampled the integers from 1 to q without replacement until he had q j random numbers; these were the
values of k where Tk j = 1. This is basically similar to what I describe, setting z j = q j /q, but a bit harder
to analyze in an elementary way. — Thomson (1916), the original paper, includes what we would now call
a simulation study of the model, where Thomson stepped through the procedure to produce simulated
data, calculate the empirical correlation matrix of the features, and check the fit to the tetrad equations.
Not having a computer, Thomson generated the values of Tk j with a deck of cards, and of the Ai k and ηi j
by rolling 5220 dice.
388 CHAPTER 19. FACTOR ANALYSIS
0.4
0.2
0.0
p value
200 replicates of 50 subjects each
> plot(ecdf(replicate(200,factanal(rthomson(50,11,500,50)$data,1)$PVAL)),
xlab="p value",ylab="Empirical CDF",
main="Sampling distribution of FA p-value under Thomson model",
sub="200 replicates of 50 subjects each")
> abline(0,1,lty=2)
Figure 19.3: Mimcry of the one-factor model by the Thomson model. The Thom-
son model was simulated 200 times with the parameters given above; each time, the
simulated data was then fit to a factor model with one factor, and the p-value of the
goodness-of-fit test extracted. The plot shows the empirical cumulative distribution
function of the p-values. If the null hypothesis were exactly true, then p ∼ Unif(0, 1),
and the theoretical CDF would be the diagonal line (dashed).
19.8. REIFICATION, AND ALTERNATIVES TO FACTOR MODELS 389
where the Tk j coefficients are uncorrelated with the Rk j , and so forth. In expectation,
if there are r such pools, this exactly matches the factor model with r factors, and any
particular realization is overwhelmingly likely to match if the q1 , q2 , . . . q r are large
enough.16
It’s not feasible to estimate the T of the Thomson model in the same way that
we estimate factor loadings, because q > p. This is not the point of considering the
model, which is rather to make it clear that we actually learn very little about where
the data come from when we learn that a factor model fits well. It could mean that
the features arise from combining a small number of factors, or on the contrary from
combining a huge number of factors in a random fashion. A lot of the time the latter
is a more plausible-sounding story.17
For example, a common application of factor analysis is in marketing: you survey
consumers and ask them to rate a bunch of products on a range of features, and then
do factor analysis to find attributes which summarize the features. That’s fine, but it
may well be that each of the features is influenced by lots of aspects of the product you
don’t include in your survey, and the correlations are really explained by different
features being affected by many of the same small aspects of the product. Similarly for
psychological testing: answering any question is really a pretty complicated process
involving lots of small processes and skills (of perception, several kinds of memory,
problem-solving, attention, etc.), which overlap partially from question to question.
Exercises
1. Prove Eq. 19.13.
2. Why is it fallacious to go from “the data have the kind of correlations predicted
by a one-factor model” to “the data were generated by a one-factor model”?
3. Show that the correlation between the j th feature and G, in the one-factor
model, is w j .
5. Find the weights bi j for the Thomson estimator, assuming you know w. Do
you need to assume a Gaussian distribution?
6. Step through the examples in the accompanying R code on the class website.
16 A recent paper on the Thomson model (Bartholomew et al., 2009) proposes just this modification
to multiple factors and to Bernoulli sampling. However, I proposes this independently, in the fall 2008
version of these notes, about a year before their paper.
17 Thomson (1939) remains one of the most insightful books on factor analysis, though obviously there
have been a lot of technical refinements since he wrote. It’s strongly recommended for anyone who plans
to make much use of factor analysis. While out of print, used copies are reasonably plentiful and cheap,
and at least one edition is free online (URL in the bibliography).