Factor Analysis
Factor Analysis
Factor Analysis
Factor analysis
14.1 INTRODUCTION
Factor analysis is a method for investigating whether a number of variables
of interest Y1 , Y2 , : : :, Yl, are linearly related to a smaller number of unob-
servable factors F1, F2, : : :, Fk .
The fact that the factors are not observable disquali¯es regression and
other methods previously examined. We shall see, however, that under
certain conditions the hypothesized factor model has certain implications,
and these implications in turn can be tested against the observations. Ex-
actly what these conditions and implications are, and how the model can be
tested, must be explained with some care.
14.2 AN EXAMPLE
Factor analysis is best explained in the context of a simple example. Stu-
dents entering a certain MBA program must take three required courses in
¯nance, marketing and business policy. Let Y1, Y2 , and Y3 , respectively,
represent a student's grades in these courses. The available data consist of
the grades of ¯ve students (in a 10-point numerical scale above the passing
mark), as shown in Table 14.1.
Table 14.1
Student grades
Student Grade in:
no. Finance, Y1 Marketing, Y2 Policy, Y3
1 3 6 5
2 7 3 3
3 10 9 8
4 3 9 7
5 10 6 5
°Peter
c Tryfos, 1997. This version printed: 14-3-2001.
2 Chapter 14: Factor analysis
It has been suggested that these grades are functions of two underlying
factors, F1 and F2, tentatively and rather loosely described as quantitative
ability and verbal ability, respectively. It is assumed that each Y variable is
linearly related to the two factors, as follows:
Y1 = ¯10 + ¯11 F1 + ¯ 12 F2 + e1
Y2 = ¯20 + ¯21 F1 + ¯ 22 F2 + e2 (14:1)
Y3 = ¯30 + ¯31 F1 + ¯ 32 F2 + e3
The error terms e1, e2, and e3 , serve to indicate that the hypothesized
relationships are not exact.
In the special vocabulary of factor analysis, the parameters ¯ij are
referred to as loadings. For example, ¯12 is called the loading of variable Y1
on factor F2.
In this MBA program, ¯nance is highly quantitative, while marketing
and policy have a strong qualitative orientation. Quantitative skills should
help a student in ¯nance, but not in marketing or policy. Verbal skills should
be helpful in marketing or policy but not in ¯nance. In other words, it is
expected that the loadings have roughly the following structure:
Loading on:
Variable, Yi F1 , ¯ i1 F2, ¯i2
Y1 + 0
Y2 0 +
Y3 0 +
One can think of each ei as the outcome of a random draw with replace-
ment from a population of ei -values having mean 0 and a certain variance
¾2i . A similar assumption was made in regression analysis (Section 3.2).
A2: The unobservable factors Fj are independent of one another
and of the error terms, and are such that E(Fj ) = 0 and V ar(Fj ) =
1.
In the context of the present example, this means in part that there is
no relationship between quantitative and verbal ability. In more advanced
models of factor analysis, the condition that the factors are independent
of one another can be relaxed. As for the factor means and variances, the
assumption is that the factors are standardized. It is an assumption made
for mathematical convenience; since the factors are not observable, we might
as well think of them as measured in standardized form.
Let us now examine some implications of these assumptions. Each
observable variable is a linear function of independent factors and error
terms, and can be written as
The ¯rst, the communality of the variable, is the part that is explained by
the common factors F1 and F2 . The second, the speci¯c variance, is the
part of the variance of Yi that is not accounted by the common factors. If
the two factors were perfect predictors of grades, then e1 = e2 = e3 = 0
always, and ¾21 = ¾22 = ¾32 = 0:
To calculate the covariance of any two observable variables, Yi and Yj ,
we can write
We can arrange all the variances and covariances in the form of the
following table:
Variable:
Variable: Y1 Y2 Y3
2 2
Y1 ¯11 + ¯12 + ¾21 ¯21 ¯11 + ¯22 ¯12 ¯ 31 ¯11 + ¯32 ¯12
2 + ¯2 + ¾2
Y2 ¯11¯ 21 + ¯12¯ 22 ¯21 22 2 ¯ 21 ¯31 + ¯22 ¯32
Y3 ¯11¯ 31 + ¯12¯ 32 ¯21 ¯31 + ¯22 ¯32 ¯ 231 + ¯32
2 + ¾2
3
Variable:
Variable: Y1 Y2 Y3
Y1 S21 S12 S13
Y2 S21 S22 S23
Y3 S31 S32 S32
Thus, S12 is the observed variance of Y1, S12 the observed covariance
of Y1 and Y2 , and so on. It is understood, of course, the S12 = S21, S13 =
S31 , and so on; the matrix, in other words, is symmetric. It can be easily
con¯rmed that the observed variance covariance matrix for the data of Table
14.1 is as follows: 0 1
9:84 ¡0:36 0:44
@ ¡0:36 5:04 3:84 A
0:44 3:84 3:04
14.3 Factor loadings are not unique 5
On the one hand, therefore, we have the observed variances and covari-
ances of the variables; on the other, the variances and covariances implied
by the factor model. If the model's assumptions are true, we should be able
to estimate the loadings ¯ij so that the resulting estimates of the theoretical
variances and covariances are close to the observed ones. We shall soon see
how these estimates can be obtained, but ¯rst let us examine an important
feature of the factor model.
Y1 = 0:5 F1 + 0:5 F2 + e1
Y2 = 0:3 F1 + 0:3 F2 + e2
Y3 = 0:5 F1 ¡ 0:5 F2 + e3
For example, V ar(Y1) = (0:5)2 + (0:5)2 + ¾21 = 0:5 + ¾21 ; Cov(Y1 ; Y2) =
(0:5)(0:3) + (0:5)(0:3) = 0:3; and so on.
Next, consider Model B, having a di®erent set of ¯ij :
p
Y1 = ( 2=2) F1 + 0 F2 + e1
p
Y2 = (0:3 2) F1 + 0 F2 + e2
p
Y3 = 0 F1 ¡ ( 2=2) F2 + e3
It can again be easily con¯rmed that the theoretical variances and p covari-
ances are identical to those of Model A.
p For example,
p V ar(Y 1 ) = ( 2=2)2 +
2 2 2
(0) + ¾1 = 0:5 + ¾1 ; Cov(Y1; Y2 ) = ( 2=2)(0:3 2) + (0)(0) = 0:3; and so
on.
Examine now panel (a) of Figure 14.1. Along the horizontal axis we
plot the coe±cient of F1, and along the vertical axis the coe±cient of F2
for each equation of Model A. The coe±cients of F1 and F2 in the ¯rst
equation are plotted as the point with coordinates (0.5, 0.5); those of the
second equation as the point (0.3, 0.3), and those of the third as the point
(0.5, ¡0:5).
6 Chapter 14: Factor analysis
Figure 14.1
Rotation of loadings illustrated
14.4 First factor solutions 7
* This is not the only method for factor analysis. Among others are
the principal factor (also called principal axis) and maximum likelihood
methods. See, for example, Johnson and Wichern (1992, Ch. 9), Rencher
(1995, Ch. 13).
8 Chapter 14: Factor analysis
Table 14.2
Elements of principal component method
Variable, Observed Communality,
Yi variance, Si2 2
¯i1 + ¯ 2i2
Finance grade, Y1 S12 2 + ¯2
¯11 12
Marketing grade, Y2 S22 2 + ¯2
¯21 22
Policy grade, Y3 S32 2 + ¯2
¯31 32
Total To Tt
Table 14.3
Principal component solution, data of Table 14.1, unstandardized variables
Variable, Observed Loadings on Communality, Percent
Yi variance, S2i F1 ; bi1 F2 ; bi2 b2i1 + b2i2 explained
(1) (2) (3) (4) (5) (6)=100£(5)/(2)
Finance, Y1 9.84 3.136773 0.023799 9.8399 99.999
Marketing, Y2 5.04 ¡0.132190 2.237858 5.0255 99.712
Policy, Y3 3.04 0.127697 1.731884 3.0157 99.201
Overall 17.92 9.873125a 8.007997a 17.8811 99.783
a
Sum of squared loadings
* Based on the output of program SAS with the statements proc factor
n = 2 cov vardef=n eigenvectors; and additional calculations by hand.
14.4 First factor solutions 9
and X
b2i2 = (0:023799)2 + ¢ ¢ ¢ + (1:731884)2 = 8:007997:
i
It can be shown that the covariances of the standardized variables are equal
to the correlation coe±cients of the original variables (the variances of the
standardized variables are, of course, equal to 1).
This last result can be easily veri¯ed using the data of Table 14.1. First, we
calculate
Y1 Y2 Y3
¹
Mean, Yi : 6.6 6.6 5.6
Variance, S 2i : 9.84 5.04 3.04
Std. Dev., S i: 3.1369 2.2450 1.7436
The observations of the standardized variables are shown in the following table:
0 Y1j ¡ 6:6
Y1j = :
3:1369
It can be con¯rmed that the means of the standardized variables are equal to 0,
and their variances and standard deviations equal to 1.
The covariance of Y1 and Y2 is
1 X
S12 = Y1 Y2 ¡ Y¹1 Y¹2 = (216) ¡ (6:6)(6:6) = ¡0:36:
n
S12 ¡0:36
r12 = = = ¡0:0511;
S1 S 2 (3:1369)(2:245)
1 X 0 0 ¹ 0 ¹0
S 012 = Y1 Y2 ¡ Y1 Y2 = (¡0:2556) ¡ (0)(0) = ¡0:0511:
n
Table 14.4
Principal component solution, data of Table 14.1, standardized variables
Standardized Observed Loadings on Communality, Percent
variable, Yi0 variance, Si02 F1 ; bi1 F2; bi2 b 2i1 + b2i2 explained
(1) (2) (3) (4) (5) (6)=100£(5)/(2)
Finance, Y10 1 0.02987 0.99951 0.99991 99.991
Marketing, Y20 1 0.99413 ¡0.08153 0.99494 99.494
Policy, Y30 1 0.99613 0.05139 0.99492 99.492
Overall 3 1.981463 1.008306a
a
2.98977 99.659
a Sum of squared loadings
* Readers familiar with linear algebra may want to know that the princi-
pal component solution involves the eigenvalues (characteristic values) and
eigenvectors (characteristic vectors) of the observed variance covariance or
correlation matrix. Hence the appearance of these terms in the output of
computer programs. For a clear mathematical exposition of the principal
component method see, for example, Johnson and Wichern, ibid.
12 Chapter 14: Factor analysis
Figure 14.2
SAS output, data of Table 14.1
and the rest as small as possible in absolute value. The varimax method
encourages the detection of factors each of which is related to few variables.
It discourages the detection of factors in°uencing all variables.
The quartimax criterion, on the other hand, seeks to maximize the
variance of the squared loadings for each variable, and tends to produce
factors with high loadings for all variables.
Figure 14.3
SAS output continued, data of Table 14.1
Figure 14.3 shows the output produced by the SAS program, instructed
to apply the varimax rotation to the ¯rst set of loadings shown in Figure
14.3.
This output is translated and interpreted in Table 14.5.
The estimates of the communality of each variable and of the total
communality are the same as in Table 14.4, but the contributions of each
factor di®er slightly. In this example, rotation did not alter appreciably the
¯rst estimates of the loadings or the proportions of the sum of the observed
variances explained by the two factors.
14 Chapter 14: Factor analysis
Table 14.5
Varimax rotation, data of Table 14.1, standardized variables
Standardized Observed Loadings on Communality, Percent
variable, Yi0 variance, Si02 F1 ; bi1 F2; bi2 b 2i1 + b2i2 explained
(1) (2) (3) (4) (5) (6)=100£(5)/(2)
Finance, Y10 1 0.00723 0.99993 0.99991 99.991
Marketing, Y20 1 0.99572 ¡0.05900 0.99494 99.494
Policy, Y30 1 0.99471 0.07393 0.99492 99.492
Overall 3 1.980964 1.008805a
a 2.98977 99.659
a
Sum of squared loadings
In the preceding illustration, the number of factors and their nature were
hypothesized in advance. It was reasonable to assume that verbal and quan-
titative ability were two factors in°uencing course performance and grades.
In other situations, however, the number of factors involved and their inter-
pretation may not be clear.
Some computer programs, unless instructed otherwise, only identify
(\extract") factors explaining a given proportion of the sum of the variances
of the variables of interest. For example, when the variables are standardized
a common default is to identify factors whose contribution is greater than
1.
It is common practice in factor analysis to examine the results of as-
suming that one, two, three, etc. factors are involved, and to tailor the
hypothesis to ¯t the results of these analyses.
There is, however, some subjectivity in declaring loadings to be \high"
or \close to zero" in absolute value. There could thus be disagreement among
investigators as to whether or not the hypothesized structure of loadings is
indeed supported by the data.
It should always be borne in mind that there are several methods for
obtaining ¯rst and subsequent factor solutions, and each combination of ¯rst
solution and rotation method may give rise to entirely di®erent interpreta-
tions.
Example 14.1 The ¯le realest.dat contains the prices and features of
100 residential real estate properties selected at random from among those
sold in a large metropolitan area over a three-month period. Four of these
14.6 On the number and interpretation of factors 15
Table 14.7
Results of factor analysis, Example 14.1
Standardized Observed Loadings on Communality, Percent
variable, Yi0 variance, Si02 F1 ; b i1 F2; bi2 b2i1 + b2i2 explained
(1) (2) (3) (4) (5) (6)=100£(5)/(2)
AREA 1 0.90 ¡0.01 0.82 82
LOTSZ 1 0.03 0.99 0.98 98
ROOMS 1 0.87 ¡0.06 0.76 76
BATHS 1 0.78 0.17 0.64 64
Overall 4 2.18a 1.01a 3.20 80
a Sum of squared loadings
features are:
AREA: Floor area, in square feet
LOTSZ: Size of the lot, in square feet
ROOMS: Number of rooms in the house
BATHS: Number of separate bathrooms in the house
The variables are described in more detail in the case City West York in
Part II of this text. Table 14.6 lists the features of the ¯rst few properties
in the ¯le.
Table 14.6
Partial listing, realest.dat ¯le
Property no. AREA LOTSZ ROOMS BATHS
1 740 1854 6 1
2 914 1256 7 2
3 968 1198 7 3
.. .. .. .. ..
. . . . .
It would appear that all these features are functions of a single factor,
the size of the property. After all, is it not true that large houses tend to
be built on large lots, and to have large °oor area, more rooms and more
bathrooms?
The data outlined in Table 14.6 were processed using proc factor of
the SAS program. By default, the program used standardized variables, the
built-in criterion for determining the number of factors, and the principal
component method; the varimax rotation was requested. The estimated
rotated loadings and other statistics are shown in Table 14.7.
16 Chapter 14: Factor analysis
14.7 TO SUM UP
² Factor analysis is a method for investigating whether a number of
variables of interest are linearly related to a smaller number of unobservable
factors.
² In the special vocabulary of factor analysis, the parameters of these
linear functions are referred to as loadings.
² Under certain conditions (A1 and A2 in the text), the theoretical
variance of each variable and the covariance of each pair of variables can be
expressed in terms of the loadings and the variance of the error terms.
² The communality of a variable is the part of its variance that is
explained by the common factors. The speci¯c variance is the part of the
variance of the variable that is not accounted by the common factors.
² There exist an in¯nite number of sets of loadings yielding the same
theoretical variances and covariances.
² Factor analysis usually proceeds in two stages. In the ¯rst, one set of
loadings is calculated which yields theoretical variances and covariances that
¯t the observed ones as closely as possible according to a certain criterion.
These loadings, however, may not agree with the prior expectations, or may
not lend themselves to a reasonable interpretation. Thus, in the second
stage, the ¯rst loadings are \rotated" in an e®ort to arrive at another set
of loadings that ¯t equally well the observed variances and covariances, but
are more consistent with prior expectations or more easily interpreted.
² A method widely used for determining a ¯rst set of loadings is the
principal component method. This method seeks values of the loadings that
bring the estimate of the total communality as close as possible to the total
of the observed variances.
² When the variables are not measured in the same units, it is customary
to standardize them prior to subjecting them to the principal component
method so that all have mean equal to zero and variance equal to one.
² The varimax rotation method encourages the detection of factors each
of which is related to few variables. It discourages the detection of factors
in°uencing all variables.
Problems 17
PROBLEMS
14.1 Con¯rm the results presented in Tables 14.4 and 14.5 using the data of
Table 14.1 and a statistical program for factor analysis.
14.2 Con¯rm the results given in Table 14.7 using the data in the ¯le realest.dat
and a statistical program for factor analysis.
14.3 Using the data of Table 14.1 and in the manner of Section 14.4, con¯rm
that the covariances of the standardized variables Y10, Y20 , and Y30 , are equal to the
correlation coe±cients of the original variables Y1 , Y2 , and Y3 .
14.4 Two observable variables, Y1 and Y2 , are thought to be linearly related to
a common unobservable factor F :
Y1 = ¯10 + ¯ 11 F + e1
Y2 = ¯20 + ¯ 21 F + e2
14.7 (a) \In order to apply factor analysis, one does not need the values of the
observations for the variables of interest but only the variance covariance matrix or,
when the variables are to be standardized, the correlation matrix of the variables."
Comment.
(b) Some statistical programs seem to agree with the statement in (a) because
they allow the user to input directly the observed variance covariance or correlation
matrix for factor analysis. If possible, make use of this feature to analyze the
following correlation matrix of four variables:
Y1 Y1 Y1 Y1
Y1 1.00 -0.01 0.81 0.02
Y1 -0.01 1.00 0.03 -0.97
Y1 0.81 0.03 1.00 -0.02
Y1 0.02 -0.97 -0.02 1.00
The grades are expressed as integers from 9 (A+) to 1 (C-) and 0 (Fail). A partial
listing of the data is given in Table 14.8.
Table 14.8
Data for Problem 14.8
ID No. FACTG MACTG ECON FIN MKTG SKILLS ENVIR MIS QM OPSM OB
1 7 1 7 6 6 . 6 7 5 5 6
2 7 6 7 7 6 5 6 7 4 7 7
3 3 6 6 6 6 4 5 4 6 7 8
4 8 7 . . 8 6 7 8 7 8 9
5 5 . . . . 3 5 . . . .
¢¢¢ ¢¢¢ ¢¢¢ ¢¢¢ ¢¢¢ ¢¢¢ ¢¢¢ ¢¢¢ ¢¢¢ ¢¢¢ ¢¢¢ ¢¢¢
50 3 3 3 3 . 3 5 6 7 7 7
File grades.dat
Problems 19
Figure 14.4
Factor analysis results, Problem 14.8(a)
Many students had not completed all required courses; missing grades are
indicated by a period.
(a) Figure 14.4 shows the output of a computer program for factor analysis
directed to extract only one factor (program SAS with the statement proc factor
n=1). Interpret and comment on the results.
(b) Can the analysis be improved? If so, carry out your suggestions using the
¯le grades.dat and a program for factor analysis.
14.9 The ¯le bridge.dat is described in Problem 4.12 and includes the following
features of 45 bridges constructed by the Department of Transportation:
A statistical program for factor analysis routinely processed the data ac-
cording to its built-in defaults (standardization, principal component estimation,
varimax rotation). It extracted two factors and produced the loadings shown in
Table 14.9.
Table 14.9
Rotated factor loadings,
Problem 14.9
Variable Factor 1 Factor 2
TIME 0.69732 0.47572
DAREA 0.74797 0.44545
CCOST 0.83123 0.35001
DWGS 0.59594 0.64808
LENGTH 0.93742 0.16039
SPANS 0.86564 0.20127
DDIFF 0.16549 0.93573
(a) Calculate the communality of each variable and the percentage of its
variance that is explained by the factors. Calculate the percentage of the total
variance that is explained by each factor and by both factors jointly.
(b) Interpret the results of factor analysis.
(c) Con¯rm the results using the data in the ¯le bridge.dat and a program
for factor analysis.
(d) Of what possible use is this type of analysis? Can it be improved? If
so, carry out your recommendations using the data in the ¯le bridge.dat and a
program for factor analysis.
14.10 The ¯le mpg.dat is described in Problem 5.15 and includes the following
features of 116 car models:
The data on these variables were pro cessed by a program for factor analysis
according to its default features (standardization, principal component estimation
and varimax rotation). The program extracted one factor but indicated it could
not rotate the loadings shown in Table 14.10.
(a) Calculate the communality of each variable and the percentage of its
variance that is explained by the factor. Calculate the percentage of the total
variance explained by the factor.
(b) Interpret the results of factor analysis.
(c) Con¯rm the results using the data in the ¯le mpg.dat and a program for
factor analysis.
(d) Of what possible use is this type of analysis? Can it be improved? If so,
carry out your recommendations using the data in the ¯le mpg.dat and a program
for factor analysis.
Problems 21
Table 14.10
Unrotated factor loadings,
Problem 14.10
Variable Factor 1
ED 0.91337
CYL 0.90924
HP 0.83956
WEIGHT 0.83456
MPG -0.92294
14.11 The ¯le stocks.dat, described in Problem 10.14, contains the daily closing
price of ¯ve stocks over a period of 378 consecutive trading days. A partial listing
of the ¯le can be found in Table 10.8.
The factors in°uencing the price of a stock are usually categorized as those
that are common to all stocks (e.g., general economic conditions), those that
are speci¯c to the industry in which the ¯rm operates (e.g., conditions in the
lumber industry), and those that are speci¯c to the ¯rm itself (e.g., quality of its
management).
Two of the ¯ve stocks in the ¯le belong to one, and the remaining three to
another industry.
Apply factor analysis to the data in the ¯le stocks.dat to investigate if they
are consistent with the above categorization. Explain carefully your results and
any additional assumptions or special treatment you considered appropriate. Of
what possible use is this type of analysis?
14.12 The ¯le mutfunds.dat contains the share prices of 15 mutual funds at the
end of each of 25 consecutive months. Also included in the ¯le are the interest
rate and the value of a stock market index at the end of each month. The ¯le has
the format shown in Table 14.11.
Table 14.11
Data for Problem 14.12
Month MF1 MF2 ¢¢¢ MF15 IRATE MINDEX
1 48.25 7.59 ¢¢¢ 7.60 0.0833 3028.20
2 47.18 7.37 ¢¢¢ 7.42 0.0835 2999.04
¢¢¢ ¢¢¢ ¢¢¢ ¢¢¢ ¢¢¢ ¢¢¢ ¢¢¢
25 50.63 6.39 ¢¢¢ 7.30 0.0985 3285.82
File mutfunds.dat
MF1 to MF15 are the share prices of the mutual funds, IRATE is the interest
rate, and MINDEX the market index.
(a) The share price of a mutual fund is equal to the current value of its
assets divided by the number of outstanding shares. If all funds carried similar
portfolios of assets their share prices would vary in a similar fashion. To investigate
the degree to which a single factor explains the observed variation in share prices,
a factor model was estimated. The loadings obtained by the principal component
method are shown in Table 14.12.
22 Chapter 14: Factor analysis
Table 14.12
Factor loadings, Problem 14.12
Variable Factor 1
MF1 0.94156
MF2 0.85979
MF3 0.99148
MF4 0.98685
MF5 0.72733
MF6 0.76779
MF7 0.67069
MF8 0.97711
MF9 0.92559
MF10 0.95107
MF11 0.50267
MF12 0.90661
MF13 0.99303
MF14 0.98972
MF15 0.97688