Statistica
Statistica
FACTOR ANALYSIS
Structure
18.0 Objectives
18.1 Introduction
18.1.1 Matrix Representation of Data
18.1.2 Analysis of Multidimensional Data
18.2 Cluster Analysis
18.2.1 Hierarchical Cluster Analysis
18.2.2 Non-hierarchical Cluster Analysis
18.2.3 Key Steps in Cluster Analysis
18.2.4 Measurement of Distances
18.2.5 Clustering Algorithms
18.2.6 Examples
18.3 Factor Analysis
18.3.1 Introduction
18.3.2 Principal Components Analysis
18.3.3 Factor Analysis
18.3.4 Examples of Factor Analysis
18.4 Appendices
18.5 Summary
18.6 Key Words
18.7 Answers to Self Check Exercises
18.8 References and Further Reading
18.0 OBJECTIVES
After going through this Unit, you will be able to appreciate and understand:
• the basic tools of data reduction, namely, Cluster Analysis and Factor
Analysis and their applications using standard statistical packages, such
as, SPSS.
18.1 INTRODUCTION
Bibliometric data are often l~rge and complex. For instance, if one wishes to
examine the contributions of research institutions in a country, there would be
hundreds of institutions and several research fields. Such data sets are called
Multidimensional Data. Then, the question is how can we comprehend
multidimensional data?
Inst -1
Inst - 2
Inst -m
• Cluster analysis - Group the data into subsets (called clusters) based on
certain optimization criterion.
• Cluster analysis,
• Principal components analysis and factor analysis.
81
-
Techniques and Modeling
in Informetrics and 18.2 CLUSTER ANALYSIS
Scientometrics
Cluster analysis is essentially concerned with the following general problem:
Given a set of objects (institutions, authors, libraries, etc.) find subsets, called
clusters, which are both homogeneous and well separated. The aim of cluster
analysis is to classify objects into groups that are internally cohesive and
externally isolated. Thus clustering is a bi-criterion problem:
1) Objects within the same cluster should be homogeneous (as far as possible).
2) Objects in one cluster should differ (as much as possible) from those in
other clusters.
The classification has the effect of reducing the size of a data set by reducing
the number of rows (cases).
I
•
82
-
• Hierarchical Agglomeration: Initially each object is considered as a cluster, Cluster Analysis and
and then iteratively, two clusters are chosen according to some criterion Factor Analysis
and merged into a new cluster. The procedure is continued till all the
objects belong to the one big cluster. The number of iterations is equal to
the number of objects minus 1.
• Each object must belong to only one group (However, in fuzzy clustering,
this criterion is relaxed).
These conditions imply that k < n, where n is the total number of objects.
• Euclidean metrics •
• Non-Euclidean metrics
• Semi-metrics
Euclidean Metrics
Euclidean metrics measure true straight line distances in the Euclidean space.
In the case of univariate data, Euclidean distance between two values A and
B is the arithmetic difference, i.e. A-B. In the case of bivariate data, the
minimum distance is the hypotenuse of a triangle formed from the points. For
83
-
Techniques and Modeling three variables the hypotenuse can be extended through a three-dimensional
in Informetrics and space. An extension of the Pythagoras theorem gives the distance between two
Scientometrics
points in an n-dimensional space. The distance between two points X, Y, in
n-dimensional space can be computed as follows:
We can use many different measures all of which define different kinds of
metric space; that is a space where distance has meaning. For a space to be
metric, it must satisfy the following conditions:
Let d (A,B) denote the distance 'between two objects, A and B.
1) d(A,B) is non-negative. It must be 0 or positive.
2) d (A,B) = d(B,A). The distance from A to B is the same as that from B
to A.
Non-Euclidean Metrics
These distances are not straight line distances, but they obey the above
mentioned rules.
Semi-metrics
These distance measures obey the first three rules, but may not obey the
triangle rule. An example of semi-metric measure is the Cosine measure.
84
b) Euclidean Distance Cluster Analysis and
Factor Analysis
This distance is the square root of the squared Euclidean distance discussed
above.
c) Minkowsky Distance
A useful general form of distance metric is the Minkowsky distance.
Minkowsky row distance is defined as
d (ij) = [L (ABSI(X(ik) - X(jk))IP)] lip
The summation is from k = I to the number of columns. The column
distance is computed similarly, but the summation is over the number of
rows rather than the number of columns. ABS refers to the absolute value
such that ABSI-21 = 2.
The Minkowsky distance is the pth root of the sum of the absolute differences
to the pth power between corresponding elements of the rows (or columns).
The Euclidean distance is the special case of the Minkowsky distance with
p = 2.
d) Block Distance
The block row distance is defined as
d(ij) = L(ABSI(X(ik) - X(ik))1)
The sum is from k = 1 to the number of columns. The column distance is
computed in a similar manner, but the summation is over the number of
rows rather than the number of columns.
The block distance is the sum of the absolute differences between
corresponding elements of the rows (or columns). The block distance is
also known as the city block or Manhattan distance. This distance measure
is a special case of the Minkowsky distance with p = 1.
e) Chebychev Distance
The Chebychev row distance is defined as
d (ij) = MAX(ABSI(X(ik) - X(jk))I)
The summation is from k = 1 to the number of columns. The column
distance is computed similarly, but the summation is over the number of
rows rather than the number of columns.
Non-Euclidean distances for interval variables
a) Cosine
Cosine of vectors of variables. This is a pattern similarity measure. The
cosine of the angle between two vectors is identical to their correlation
coefficient. •
Similarity (X, Y) = L(XY)/(LX2) (LY2)
b) Correlation
where
X is the mean value of X,
85
-
Techniques and Modeling
in Informetrics and
y is the mean value of Y,
Scientometrics S, is the standard deviation of X, and
S; is the standard deviation of Y.
Count Data
1) The proximity matrix is scanned to find the pair of cases with the highest
similarity (or lowest distance). These will be the most similar cases and
should be clustered most closely together.
2) The duster formed by these two cases can now be considered a single
object. The proximity matrix is recalculated so that all the other cases are
compared with this new group, rather than the original two cases.
3) The matrix is then scanned (as in step 2) to find the pair of cases or
clusters that now have the highest similarity .
.4) Steps 2 and 3 are repeated until all the objects have been combined into
a single group.
The result is a dendrogram that shows the most similar cases linked most
closely together. The level of the vertical lines joining two cases or clusters
indicates the level of similarity between them. It is important to note that the
branching hierarchy and the level of similarity are the only important features
of the dendrogram. The exact sequence of the cases along the vertical axis is
not significant.
In the single linkage method the distance between any two clusters is the
distance between the two closest elements of the two clusters. This is also
commonly referred to as the "nearest neighbor" technique.
This method tends to yield clusters in long chains, where the initial element
86 of the cluster may be very different from the final element of the cluster. This
method, though computationally very efficient, is ill regarded because of this Cluster Analysis and
tendency. Factor Analysis
In this method, the distance between any two clusters is the distance between
the two farthest elements of the two clusters. This method is also called
"farthest neighbor method". Similar to the single linkage in efficiency, the
complete linkage method dues not suffer from chaining. It is actually biased
against any sort of chaining.
The figure below shows how the distances for these two methods are calculated
in a two dimensional example.
1. Single linkage
2. Complete linkage
Fig.2: Calculation of distances by single linkage and complete linkage methods
These two methods are quite simple, but they can distort the results, since the
distances between groups are calculated based on unusual outlying points
rather than the properties of the whole cluster. Nearest neighbor clustering is
also susceptible to a phenomenon called "chaining" in which there is a tendency
to repeatedly add new individuals onto a single cluster rather than making
several separate clusters. This gives the dendrogram a staircase-like appearance.
Group Average Method - The distance betlveen two clusters is the average
of the distances between every pair of objects in opposing clusters. This
method is also called paired group method. This method is non-monotonic,
which essentially means that the elements of two clusters may be more distant
than the clusters themselves.
Centroid Met/rod - In this method, the centroid (the center of gravity of the
points) of each group is calculated and the distance between the groups is the
distance between their centroids. The centroid itself can be described as the
87
Techniques and Modeling average point of the cluster. It is calculated by taking the mean value of the
in Informetrics and coordinates on each axis for all the points in the cluster. The centroid method
Scientometrics is non-monotonic, i.e. elements of two clusters may be more distant than the
clusters themselves.
Both methods have two variants: weighted and unweighted. The unweighted
methods give equal weight to each point in each cluster. The weighted methods
on the other hand give unequal weights to each cluster; if one cluster has
fewer points than another, then points in the smaller cluster are given higher
weightage to make the two groups equal. In general, the unweighted versions
are used unless the data are expected to have some clusters that are much
smaller than others.
Centroid y es ? ?
Ward's y es No No
One of the biggest problems with hierarchical cluster analysis is how to identify
the optimum number of clusters. As the fusion process continues, increasingly
dissimilar clusters are fused, i.e. the classification becomes increasingly
artificial. Deciding upon the optimum number of clusters is largely subjective,
although looking at a graph of the level of similarity at fusion versus number
of clusters may help. There will be sudden jumps in'the level of similarity as
dissimilar groups are fused.
Partitioning Methods
18.2.6 Examples
Proximity Matrix
Cosine of Vectors of Values
Case I:USA 2:JPN 3:SUN 4:UKD 5:FRG 6:FRA 7:IND 8:CAN 9:ITA IO:NLD 11:POL
I:USA .902 .727 .898 .884 .892 .838 .992 .806 .954 .891
2:JPN .902 .733 .791 .786 .932 .843 .903 .692 .884 .842
3:SUN .727 .733 .902 .799 .828 .847 .724 .915 .727 .576
4:UKD .898 .791 .902 .954 .874 .931 .900 .978 .892 .739
5:FRG .884 .786 .799 .954 .853 .944 .918 .904 .892 .793
6:FRA .892 .932 .828 .874 .853 .841 .896 .782 .922 .752
7:IND .838 .843 .847 .931 .944 .841 .866 .912 .854 .792
8:CAN .992 .903 .724 .900 .918 .896 .866 .804 .961 .916
9:ITA .806 .692 .915 .978 .904 .782 .912 .804 .804 .653
\O:NLD .954 .884 .727 .892 .892 .922 .854 .961 .804 .886
II:POL .891 .842 .576 .739 .793 .752 .792 .916 .653 .886
89
-
Techniques and Modeling Average Linkage (Between Groups)
in Informetrics and
Scientometrics Agglomeration Schedule
Cluster Combined Stage Cluster First Next
Coefficients Appears Stage
Stage Cluster 1 Cluster 2 Cluster 1 Cluster 2
1 1 8 .992 0 0 3
2 4 9 .978 0 0 6
3 1 10 .958 I 0 7
4 5 7 .944 0 0 6
5 2 6 .932 0 0 7
6 4 5 .925 2 4 8
7 1 2 .900 3 5 9
8 3 4 .866 0 6 10
9 1 11 .857 7 0 10
10 1 3 .804 9 8 0
Dendrogram
***** HIERARCHICAL CLUSTER ANALYSI
S * * * * *
Dendrogram using Average Linkage (Between Groups)
Rescaled Distance Cluster Combine
CAS E 0 5 10 15 20 25
Label Num + +.~ + + + +
USA 1 ..+ +
CAN 8 ..+ + +
NLD 10 + + +
JPN 2 + + + +
FRA 6 + I I
POL 11 + I
UKD 4 + + I
ITA 9 + + + I
FRG 5 + + + +
IND 7 + I
SUN 3 +
Comments:
1) SPSS module CLASSIFY (Hierarchical Cluster) was used with the following
options:
Proximity Measure: COSINE. (This is- a similarity measure)
Clustering Method: BETWEEN GROUP AVERAGE
2) Proximity matrix shows similarities between pairs of countries.
Note that diagonal of the matrix is empty, since our primary interest is
proximity between countries.
90
Most similar countries are USA and Canada. The proximity value is the Cluster Analysis and
highest (.992). Factor Analysis
Least similar countries are Poland and Ex-USSR (SUN) (Proximity measure
= .592)
Cluster Membership
1 2 3 4 5
MAT .00 2.27 3.38 2.00 .96
PHY .00 31.43 48.69 18.69 73.60
CHM .00 17.89 6.64 7.76 5.02
BIO 22.22 6.50 4.53 4.04 1.34
Cluster 1 2 3 4 5
1 58.600 68.068 53.507 90.115
ANOVA
Cluster Error F Sig.
Mean Square df Mean Square df
MAT 8.193 4 6.641 30 1.234 .318
PHY 3704.185 4 83.661 30 44.276 .000
CHM 267.983 4 39.421 30 6.798 .001
BIO 110.891 4 10.408 30 10.654 .000
EAS 20.227 4 15.275 30 1.324 .284
Comments:
............................................ : .
Principal components analysis and its cousin factor analysis operate by replacing
the original data matrix X by an estimate composed of the product of two
matrices. The left matrix in the product contains a small number of columns
corresponding to the factors or components. whereas the right matrix of the
product provides the information that relates components to the original 95
Techniques and Modeling variables. A scatter plot based on the,left matrix is useful for relating the n
in Informetrics and objects of X with respect to the new factors. A plot based on the rows of the
Scientometrics
right matrix can be used to relate the components to the original variables. The
decomposition of X into a product of two matrices is a special case of a matrix
approximation procedure, called: Singular value decomposition.
Unique Variance=specific
variance + error variance
Factor analysis analyzes only the common variance of the observed variables;
principal components analysis (PCA) considers the total variance and makes
no distinction between common and unique variance (specific plus error).
Thus, the difference between these two techniques is related to matrix
computations. PCA assumes that all variance is common, with all unique
factors set equal to zero; while factor analysis assumes that there is some
unique variance. The level of unique variance is dictated by the factor analysis
model.
•
18.3.2 Principal Components Analysis
Principal components analysis can be defined as follows.
x = IXij" I
in which the columns represent the p variables and rows represent measurements
of n objects or individuals on those variables. The data can be represented by
a cloud of n points in a p-dimensional space, each axis corresponding to a
96
Cluster Analysis and
measured variable. We can then look for a line ay) in this space such that the Factor Analysis
dispersion of n points when projected onto this line is a maximum. This
operation defines a derived variable of the form
where the constants are determined by the requirement that the variance of
y, is a maximum, subject to the constraint of orthogonality as well as
for each i.
The Y, thus obtained are called Principal Components of the system and the
process of obtaining them is called Principal Components Analysis.
is given by
where s" is the covariance between variables i and j. The variance of a linear
composite can also be expressed in the notation of matrix algebra as
aT Sa
where a is the vector of the variable weights and S is the covariance matrix.
aT is the transpose of a.
97
Techniques and Modeling Principal components analysis finds the weight vector a, that maximizes
in Informetrics and
Scientometrics aT Sa
f:
,=)
",2 = a'a = I
I"),x,
i=)
fa~
i=)
We can then define the second largest principal component as the weight
vector
98
Cluster Analysis and
which maximizes the variance of Factor Analysis
• "a; ~
;=1
_I
=
~L a1,a? u =0
f
1=1
We can define the third largest principal component as the weight vector
"
La3;ali =0
1=1
~
"
"a3,a?,
1_' =0
1=1
This process can be continued till the last ti.e., thep'h ) principal component
is derived,
The sum of the variances of the principal components is equal to the variance
of the original variables.
is the set of weights for the first principal component, which maximizes the
variance of
The i th latent root and its associated weight vector satisfy the matrix equation:
Raj = \al
Pre-multiplying the above equation by aT leads to
TR - T"\ - "\
aj aj - a, /\,jal - /\,j
since
The variance of the first principal component = AI' Similarly for the second
principal component, and so on. The last latent root (= AI') is the variance of
the last or the smallest principal component. Thus:
Ra, = Ajal
Ra, = A2a2
Rap = Apap
In matrix notation,
RA=AA
X
h2. = "1..
L..
2
I)
I j-I
V2 ~I ~2 An
~
Communality: The sum of the squared factor loadings for all factors for a
given variable is the variance in that variable accounted for by all the factors,
and this is called the communality. In other words, it is the proportion of a
variable's variance explained by a factor structure. In complete principal
components analysis, with no factors dropped, communality is equal to 1.0, or
100% of the variance of the given variable.
Eigenvalues: The eigenvalue for a given factor reflects the variance in all the
variables, which is accounted for by that factor. A factor's eigenvalue may be
computed as the sum of its squared factor loadings for all the variables. If a
factor has a low eigenvalue, then it is contributing little to the explanation of
variances in the variables and may be ignored. The sum of eigenvalues of all
102
the factors is equal to the total variance in the data divided by the number of Cluster Analysis and
variables. The importance of a factor is equal to the proportion of total variance Factor Analysis
explained by it.
The factor analysis model does not extract all the variance; it extracts only that
proportion of variance, which is due to the common factors and shared by
several items. The proportion of variance of a particular item that is due to
common factors (shared with other items) is called communality. The proportion
of variance that is unique to each item is then the respective item's total
variance minus the communality.
The solution of equation (2) is not unique (unless the number of factors = 1),
which means that the factor loadings are inherently indeterminate. Any solution
can be rotated arbitrarily to obtain a new factor structure.
VI .81 .45
V2 .84 .31
V3 .80 .29
V4 .89 .37
V5 .79 .51
V6 .45 .43
This table shows the difficulty of interpreting an unrotated factor solution. All
the significant loadings are shown in boldface type. Obviously, the results are
ambiguous. Variables VI, V2, VS and V6 have significant loadings on Factors
1 and 2. This is a common pattern. One way to obtain more interpretable
results is to rotate the factor axes.
• Each factor should have several variables with strong loadings. Remember
that strong loadings can be positive or negative. •
• Each variable should have a strong loading for only one factor
• Each variable should have a large communality. This implies that its
membership accounts for its variance.
Factor Rotation: Given a Cartesian coordinate system where the axes are the
factors and the points are the variables, factor rotation is the process of holding
the points constant and moving (rotating) the factor axes. The rotation is done
103
Techniques and Modeling in a manner so that the points are highly correlated with the axes and provide
in Informetrics and a more meaningful interpretation of the factor.
Sclentometrlcs
Rotation can be orthogonal or oblique. In orthogonal rotation, the axes are
perpendicular to each other, whereas in oblique rotation, the axes are not
perpendicular. Major strategies for orthogonal rotation are: Varimax, Quartimax
and Equimax, but the most common rotation strategy is the Varimax rotation.
VI .68 .17
V2 .87 .24
V3 .65 .07
V4 .16 .76
VS .30 .83
V6 .19 .69
Note that the eigenvalues associated with the unrotated and rotated solution
will differ, but their total will be the same. The communalities of variable do
not change with rotation.
Interpretation of Factors
This involves the following: (i) How many principal components or factors to
choose and retain for interpretation, and (ii) which variables have strong loading
on which factors?
4-
o~ __ ~ __ ~_~ __ ~ ~
o 5 10 15 20
Number of Factor
Fig. 3: Scree Plot
Communalities
Initial Extraction
VAROOO02 1.000 .717
VAROOO04 1.000 .933
VAROOO05 1.000 .895
VAROOO06 1.000 .694
VAROOO07 1.000 .861
VAROOO08 1.000 .889
VAROOO09 1.000 .795
VAROOOI0 1.000 .854
VAROOOl1 1.000 .804
VAROOO12 1.000 .778
VAROOO14 1.000 .571
VAROOO15 1.000 .622
VAROOO16 1.000 .914
VAROOO19 1.000 .878
VAROO021 1.000 .908
VAROO024 1.000 .446
VAROO025 1.000 .918
If\~
Extraction Method: Principal Component Analysis.
Total Variance Explained Cluster Analysis and
Initial Eigenvalues Extraction Sums of Squared Rotation Sums of Squarec Factor Analysis
Loadings Loadings
Comments
1) Syntax-All the journals listed in App-3 are not included in the analysis.
This is because all the rows and columns of excluded journals are empty.
2) Communalities -As discussed earlier, communality is the proportion of
a variable's variance explained by the factor structure. In complete principal
components analysis, with no factors dropped, cornmunality is equal to
1.0, or 100% of the variance of the given variable. For extracted factors,
communality ranges from 0.446 to 0.933.
3) Percentage of Variance Explained - The first factor explains more than
25% of the total variance. The first six factors together explain more than
79% of the total variance. Six- factor solution has been retained for
interpretation.
4) Rotated Component Matrix- This matrix shows rotated factor loadings,
sorted by their absolute values.
5) Interpretation of Flctors-Look at the rotated component matrix.
Factor 1 The following variables have high loadings on this factor.
VarOOO16 Lect Notes Cornp Sci
VarOOOO5 Commun ACM
VarOOOO8 Computer
VarOOOO7 Comput Surveys
VarOOO15 Journal ACM
109
Techniques and Modeling
in Informetrics and 18.4 APPENDICES
Scientometrics
Appendix - 1
SPECIALIZATION PATTERNS IN CHEMISTRY
V 1 Analytical Chemistry
V2 Applied Chemistry
V3 Biochemistry
V4 Electrochemistry
V 5 Inorganic Chemistry
V6 Organic Chemistry
V7 Physical Chemistry
V8 Polymers
V9 Chemical Engineering
VI V2 V3 V4 VS V6 V7 V8 V9
110
Appendix - 2 Cluster Analysis and
•• Factor Analysis
INDIA'S SCIENTIFIC COOPERATION LINKS
The following data shows India's cooperation links with 35 most significant
partner countries in eleven scientific fields during 1990-1994.:
Scientific fields
1) Mathematics
2) Physics
3) Chemistry
4) Biology
5) Earth & Space
6) Agriculture
7) Clinical Medicine
8) Biomedicine
9) Engineering & Technology
10) Material science
11) Computer science
Countries
1) United States of America (USA)
2) United Kingdom (UKD)
3) Germany (FRG)
4) Canada (CAN)
5) France (FRA)
6) Japan (JPN)
7) Italy (ITA)
8) Ex-USSR (SUN)
9) Australia (AUS)
10) Switzerland (CHE)
11) Netherlands (NLD)
12) Sweden (SWE)
13) Spain (ESP)
14) China (PRC) •
15) Belgium (BEL)
16) Hungary (HUN)
17) Brazil (BRA)
18) Bangladesh (BND)
19) Denmark (DNK)
20) Austria (AUT)
III
Techniques and Modeling 21) Bulgaria (BGR)
in Informetrics and
Scientometrics
-22) Poland (PaL)
23) Korea (KOR)
24) Czechoslovakia (CSK)
25) Mexico (MEX)
26) Philippines (PHL)
27) Finland (FIN)
28) Israel (ISR)
29) Egypt (EGY)
30) Greece (GRC)
31) Taiwan (TWN)
32) Norway (NOR)
33) Nigeria (NGR)
34) Chile (CHL)
35) Romania (ROM)
3.86 33.09 11.54 4.77 5.25 2.31 12.13 12.45 7.48 1.95 3.10
U4 35.37 8.38 6.41 5.38 3.52 19.13 9.62 5.07 .31 3.41
1.28 44.90 10.79 5.68 4.52 2.67 9.16 10.90 6.03 .46 1.86
8.91 37.03 7.13 6.34 5.74 2.57 7.92 7.72 10.30 2.38 2.57
.92 51.61 10.83 3.46 5.99 1.61 4.84 9.22 3.00 1.38 4.61
1.84 29.95 17.74 9.68 7.37 1.38 10.14 12.90 4.15 .69 2.30
2.92 67.11 9.55 1.06 1.59 .53 6.37 4.24 3.45 .27 2.39
2.65 27.51 17.99 8.99 7.94 10.05 11.11 4.76 5.82 .00 .53
2.76 59.67 6.63 .00 1.10 1.66 12.71 4.42 8.29 .00 .00
9.04 49.40 3.61 3.01 6.02 3.01 8.43 9.04 4.82 1.81 .60
.00 32.52 7.32 2.44 6.50 1.63 28.46 13.01 5.69 .81 .00
.85 66.95 11.02 .85 .85 .00 7.63 3.39 .85 .00 5.93
.93 70.37 1.85 3.70 2.78 .93 8.33 3.70 3.70 .00 1.85
3.13 40.63 9.38 10.42 9.38 .00 8.33 7.29 3.13 1.04 1.04
2.30 59.77 10.34 4.60 4.60 .00 4.60 4.60 5.75 1.15 1.15
2.67 61.33 1.33 2.67 9.33 .00 8.00 6.67 6.67 .00 .00
10.00 21.67 10.00 3.33 1.67 3.33 16.67 11.67 21.67 .00 .00
.00 21.15 26.92 13.46 1.92 .00 17.31 3.85 5.77 5.77 .00
2.13 23.40 29.79 4.26 2.13 .00 21.28 12.77 4.26 .00 .00
.00 85.11 4.26 2.13 .00 .00 2.13 4.26 2.13 .00 .00
4.35 36.96 15.22 10.87 .00 2.17 15.22 8.70 4.35 2.17 .00
.00 37.84 35.14 .00 5.41 2.70 5.41 5.41 .00 .00 8.11
2.78 36.11 2.78 5.56 .00 11.11 8.33 30.56 2.78 .00 .00
.00 .00 .00 22.22 5.56 44.44 8.33 19.44 .00 .00 .00
2.94 64.71 8.82 -.00 5.88 .00 5.88 8.82 .00 .00 .00
3.03 33.33 9.09 3.03 3.03 3.03 9.09 24.24 9.09 .00 3.03
.00 13.33 6.67 3.33 13.33 .00 20.00 6.67 20.00 .00 .00
3.57 35.71 7.14 3.57 .00 3.57 10.71 10.71 10.71 3.57 3.57
.00 46.43 3.57 3.57 3.57 .00 10.71 17.86 10.71 3.57 .00
.00 11.11 3.70 11.11 14.81 3.70 22.22 18.52 14.81 .00 .00
.00 14.81 11.11 .00 .00 11.11 33.33 11.11 3.70 3.70 .00
.00 73.08 .00 3.85 .00 .00 7.69 7.69 .00 3.85 .00
.00 72.00 .00 .00 .00 .00 .00 .00 28.00 .00 .00
112 .00 00.00 .00 .00 .00 .00 .00 .00 .00 .00 .00
Appendix - 3 Cluster Analysis and
Factor Analysis
ARTIFfCIAL INTELLIGENCE
The following table shows the citation relationships among a set of24 journals
in the area of Artificial Intelligence. This table is divided into three parts-
Table A contains columns 1 - 8, Table B columns 9-16 Table C·columns 17-
24. If these tables are placed side by side, we would get the journal to journal
citation matrix,
In this matrix, the cell (i,j) gives the number of times the articles in journal
j cite unique articles in journal i in a given year. Also, the cell (i,j) gives
the number of times articles in journal i are cited by articles in journal j. This
matrix represents the citation environment of Artificial Intelligence. The columns
of the matrix represent the citing dimension, whereas the rows represent the
cited dimension of Artificial Intelligence
Table A : Journal to Journal Citation Matrix
Journal Name I 2 3 4 5 6 7 8
I. Artif Intel 46.00 .00 7.00 7.00 5.00 22.00 12.00 7.00
2. Auto- Theo 4.00 .00 .00 .00 .00 .00 .00 .00
4. Comm ACM 11.00 .00 .00 182.00 34.00 35.00 98.00 17.00
5. Comp Graph .00 .00 .00 8.00 101.00 5.00 19.00 15.00
6. Comp Survey .00 .00 .00 24.00 11.00 13.00 18.00 .00
9. IEEE SMC 6.00 .00 .00 .00 24.00 .00 33.00 53.00
10 Man Machine .00 .00 .00 12.00 .00 .00 .00 13.00
11. J.Exp Psy .00 .00 3.00 .00 .00 .00 .00 .00
13. J Pragmat .00 .00 .00 .00 .00 .00 .00 .00
14. J A<;:M 12.00 .00 .00 16.00 15.00 7.00 11.00 8.00
15. Le Comp Sci 5.00 .00 .00 7.00 .00 .00 .00
16. Mach Intel 12.00 .00 .00 .00 .00 .00 .00
18. P IEEE ..00 .00 .00 .00 • 18.00 .00 17.00 12.00
19. P IJCAI 22.00 .00 .00 .00 18.00 .00 .00 6.00
20. P Photo Opt .00 .00 .00 .00 8.00 .00 .00 .00
21. Patt Direct 9.00 .00 .. 00 .00 .00 .00 .00 .00
22. Problem Sol 5.00 .00 .00 .00 .00 .00 . .00 .00
23. Psychol B .00 .00 .00 .00 .00 .00 .00 8.00
24.Tai Tec Sci .00 .00 .00 .00 .00 .00 .00 .00
113
Techniques and Modeling Table B : Journal to Journal Citation Matrix
in Informetrics and
Journal Name 9 10 II 12 13 14 15 16
Scientometrics
1. Artif Intel \3.00 7.00 4.00 .00 10.00 10.00 .00 .00
2. Auto- Theo .00 .00 .00 .00 .00 .00 .00 .00
3. Cogmitive .00 .00 4.00 .00 24.00 .00 .00 .00
4. Comm ACM 25.00 28.00 .00 .00 6.00 58.00 83.00 .00
5. Comp Graph 71.00 .00 .00 .00 .00 .00 .00 .00
6. Comp Survey .00 .00 .00 .00 .00 .00 .00 .00
7. Computer .00 .00 .00 .00 .00 .00 .00 .00
8. IEEE PA 80.00 .00 .00 .00 .00 .00 .00 .00
9. IEEE SMC 60.00 7.00 .00 .00 .00 .00 .00 .00
10. Man Machine .00 103.00 .00 .00 .00 .00 .00 .00
11. J.Exp Psy .00 .00 154.00 .00 .00 .00 .00 .00
12 J Philos .00 .00 .00 .00 .00 .00 .00 .00
13 J Pragmat .00 .00 .00 .00 11.00 .00 .00 .00
14 J ACM 32.00 .00 .00 .00 .00 102.00 60.00 .00
15. Le Comp Sci .00 .00 .00 .00 .00 30.00 92.00 .00
16. Mach Intel 7.00 .00 .00 .00 .00 .00 .00 .00
17. OtheIIo Q .00 .00 .00 .00 .00 .00 .00 .00
18. P IEEE 34.00 8.00 .00 .00 .00 .00 .00 .00
19. P IJCAI .00 10.00 .00 .00 .00 .00 .00 .00
2q P Photo Opt .00 .00 .00 .. 00 .00 .00 .00 .00
21 Patt Direct .00 .00 .00 .00 .00 .00 .00 .00
22 Problem Sol .00 .00 .00 .00 .00 .00 .00 .00
23 Psychol B .00 .00 .00 .00 .00 .00 .00 .00
24 Tai Tec Sci .00 .00 .00 .00 .00 .00 .00 .00
Note: Numbers in columns refer to journal names in rows
114 14. J ACM .00 15.00 .00 .00 .00 .00 .00 .00
Cluster Analysis and
15. Le Comp Sci .00 .00 .00 .00 .00 .00 .00 .00
Factor Analysis
16. Mach Intel .00 .00 .00 .00 .00 .00 .00 .00
17. Othello Q .00 .00 .00 .00 .00 .00 .00 .00
18. P IEEE .00 222.00 .00 112.00 .00 .00 .00 .00
19. P IJCAI .00 .00 .00 .00 .00 .00 .00 .00
20 P Photo Opt .00 6.00 .00 68.00 .00 .00 .00 .00
21. Patt Direct .00 .00 .00 .00 .00 .00 .00 .00
22. Problem Sol .00 .00 .00 .00 .00 .00 .00 .00
23. Psycho I B .00 .00 .00 .00 .00 .00 211.00 .00
24. Tai Tec Sci .00 .00 .00 .00 .00 .00 .00 .00
18.5 SUMMARY
In this module we have discussed two major techniques of data reduction, viz.
Cluster analysis and Factor analysis.
Cluster Analysis
Factor Analysis
Partitional clustering attempts to directly decompose the data set into a set of
disjoint clusters according to an optimization criterion, which involves
minimizing within-cluster dissimilarities and maximizing inter-cluster
dissimilarities.
. Principal Factor Analysis (PF A): Also called principal axis factoring, PFA,
and common factor analysis, PFA is a form of factor analysis which seeks the
least number of factors which can account for the common variance of a set
of variables, whereas the more common principal components analysis (PCA)
in its full form seeks the set of factors which can account for all the common
115
Techniques and Modeling and unique variance in a set of variables. PFA uses a PCA strategy but applies
in Informetrics and it to a correlation matrix in which the diagonal elements are not l' s, as in
Scientometrics
PCA, but estimates of the communalities (see below). These estimates are the
squared multiple correlations of each variable with the remainder of variables
in the matrix (or, less commonly, the highest absolute correlation in the matrix
row.
117
Techniques and Modeling Unique variance The variance of a variable which is not explained
in Informetrics and by common factors. Unique variance is composed
Scientometrics
of specific and error variances.
a) Euclidean Metrics
b) Non - Euclidean Metrics
c) Semi - Metrics
2) There are five types of agglomerative clustering methods commonly in
use. They are:
a) Single linkage clustering
b) Complete linkage clustering
c) Average linkage clustering
d) Within group clustering
e) Minimum variance clustering
3) In both principal components analysis (PCA) and factor analysis, the original
explanatory variables are replaced by components or factors. The objective
is to reduce the number of variables and classify the variables. The PCA
provides a unique solution so that the original data can be reconstructed
from the results. There are as many components as variables. The sum of
the variances of the principal components is equal to the variance of the
original variables. However, we retain only those components which explain
the variation in the dependent variable substantially. In factor analysis we
identify the underlying latent factors. Here again we take only those factors
which explain the variance in dependent variable substantially.
4) The objective of rotation of principal axes in factor analysis is to obtain a
clear pattern of loadings. The rotation is done in a manner so that the
points are highly correlated with the axes and provide a more meaningful
interpretation of factors. Rotation can be orthogonal or oblique.
5) The correlations of variables with principal components are called loadings,
which provide the jnfluence of the original variables on the principal
components. Thus a high coefficient corresponds to a high loading. The
general pattern of loadings is called 'simple structure'. It means that each
factor should have several variables with strong loadings. Each variable
should have a strong loading for only one factor. Each factor should account
for the variance in dependent variable.
6) The objective of cluster analysis is to classify (not reduce) the explanatory
variables into homogeneous groups. The unit of measurement strongly
affects the resulting clustering. Thus if the variables are in different units,
118
Cluster Analysis and
th en we stan d ar d·ise. th e vana. bles so t 1rat Xi = Xi - X P'nncipa
. I components Factor Analysis
O'x
Kim, Jae-On and Mueller, C.W. (1978). Introduction 10 Factor Analysis. Sage
Publications, Inc
Nagpaul, P.S. (2001). Guide to Advanced Data Analysis Using IDAMS Software
(Chapters 6 and 7), httpv/www.Unesco.org/ldams (Only the electronic version
is available).
119