0% found this document useful (0 votes)
61 views40 pages

Statistica

in Informetrics and This document discusses cluster analysis and factor analysis techniques for reducing the dimensionality of bibliometric data. It begins by explaining how bibliometric data can be represented in a matrix and the need for data reduction techniques due to the large, complex nature of such datasets. It then provides an overview of cluster analysis, including hierarchical and non-hierarchical methods, and the key steps of measurement of distances between objects and classification into clusters. Factor analysis techniques like principal component analysis are also introduced as methods to reduce dimensionality by projecting the data onto a lower dimensional space.

Uploaded by

stephennrobert
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views40 pages

Statistica

in Informetrics and This document discusses cluster analysis and factor analysis techniques for reducing the dimensionality of bibliometric data. It begins by explaining how bibliometric data can be represented in a matrix and the need for data reduction techniques due to the large, complex nature of such datasets. It then provides an overview of cluster analysis, including hierarchical and non-hierarchical methods, and the key steps of measurement of distances between objects and classification into clusters. Factor analysis techniques like principal component analysis are also introduced as methods to reduce dimensionality by projecting the data onto a lower dimensional space.

Uploaded by

stephennrobert
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

UNIT 18 CLUSTER ANALYSIS AND

FACTOR ANALYSIS
Structure
18.0 Objectives
18.1 Introduction
18.1.1 Matrix Representation of Data
18.1.2 Analysis of Multidimensional Data
18.2 Cluster Analysis
18.2.1 Hierarchical Cluster Analysis
18.2.2 Non-hierarchical Cluster Analysis
18.2.3 Key Steps in Cluster Analysis
18.2.4 Measurement of Distances
18.2.5 Clustering Algorithms
18.2.6 Examples
18.3 Factor Analysis
18.3.1 Introduction
18.3.2 Principal Components Analysis
18.3.3 Factor Analysis
18.3.4 Examples of Factor Analysis
18.4 Appendices
18.5 Summary
18.6 Key Words
18.7 Answers to Self Check Exercises
18.8 References and Further Reading

18.0 OBJECTIVES
After going through this Unit, you will be able to appreciate and understand:

• the need for data reduction in bibliometrics;

• the basic tools of data reduction, namely, Cluster Analysis and Factor
Analysis and their applications using standard statistical packages, such
as, SPSS.

18.1 INTRODUCTION
Bibliometric data are often l~rge and complex. For instance, if one wishes to
examine the contributions of research institutions in a country, there would be
hundreds of institutions and several research fields. Such data sets are called
Multidimensional Data. Then, the question is how can we comprehend
multidimensional data?

18.1.1 Matrix Representation of Data

Typically, a dataset is represented in the form of a matrix, wherein, usually,


rows represent the cases (i.e. institutions) and columns represent the research
fields. Suppose a dataset comprises m institutions and n research fields, it can
80
be represented as m x n matrix !Pill, where Pij indicates the number of Cluster Analysis and
publications by institution i in field j. There are m rows and n columns in this' Factor Analysis
matrix.

Institution Field -1 Field -2 '" Field -n

Inst -1

Inst - 2

Inst -m

A colum of this matrix is called a column vector (m-vector as there are m


rows), whereas a row of this matrix is called a row vector (n -vector as there
are n columns). A column vector indicates the distribution of publications in
a field among the set of institutions. Similarly, a row vector indicates the
distribution of publications of an institution in different fields. The dimension
of the matrix is equal to m x n.

18.1.2 Anaysis of Multidimensional Data


How do we analyze or find information in multidimensional data? Some of
the common techniques are mentioned below.

• Compute summary statistics- Reduce the data to summarizing values


such as: mean and variance of columns or of entire matrix.

• Multivariate regression analysis - Find the line, plane, hyperplane which


summarizes the data with minimum error.

• Cluster analysis - Group the data into subsets (called clusters) based on
certain optimization criterion.

Generally speaking, cluster analysis reduces the number of cases (rows).


Instead of a large number of cases we can represent the data by the
centroids of clusters.

• Reduce the dimensionality of the (lata - Here, the goal is to reduce an


m' x n matrix to m x p matrix (p < n), which can be done by projecting
the data on to a space of lower dimensionality. Commonly used projection
techniques are "Principal Components Analysis" and its twin "Factor
Analysis".

1) Factor Analysis - Find subset of cokimns (dimensions) which explain


majority of variance in all columns of the matrix.

2) Principal Component Analysis - Find subset of columns (dimensions)


which explain majority of correlation in all columns of matrix.

In this Unit, we discuss the following techniques of data reduction:

• Cluster analysis,
• Principal components analysis and factor analysis.

81

-
Techniques and Modeling
in Informetrics and 18.2 CLUSTER ANALYSIS
Scientometrics
Cluster analysis is essentially concerned with the following general problem:
Given a set of objects (institutions, authors, libraries, etc.) find subsets, called
clusters, which are both homogeneous and well separated. The aim of cluster
analysis is to classify objects into groups that are internally cohesive and
externally isolated. Thus clustering is a bi-criterion problem:

1) Objects within the same cluster should be homogeneous (as far as possible).

2) Objects in one cluster should differ (as much as possible) from those in
other clusters.

The classification has the effect of reducing the size of a data set by reducing
the number of rows (cases).

Two kinds of clustering algorithms (procedures) are reported in the literature,


namely partitioning and hierarchical methods. Partitioning methods are also
called non- hierarchical methods.

18.2.1 Hierarchical Cluster Analysis


Hierarchical methods arrange the clusters into a hierarchy so that the
relationships between the different groups are apparent. This technique yields
classification that has an increasing number of nested classes, which can be
represented by an inverted tree (called "dendrogram"). Hierarchical clustering
proceeds successively by either merging smaller clusters into larger ones, or
by splitting larger clusters. The end result of the algorithm is a tree of clusters
called a dendrogram, which shows how the clusters are related. By cutting the
dendrogram at a desired level, a clustering of the data items into disjoint
groups is obtained.

I

Fig.I : Hierarchical Cluster

Hierarchical clustering methods can be classified into two categories: "top


down" or "bottom up". Top down methods use logical division (Divisive
clustering), whereas bottom up methods use agglomeration (Agglomerative
clustering).

82

-
• Hierarchical Agglomeration: Initially each object is considered as a cluster, Cluster Analysis and
and then iteratively, two clusters are chosen according to some criterion Factor Analysis
and merged into a new cluster. The procedure is continued till all the
objects belong to the one big cluster. The number of iterations is equal to
the number of objects minus 1.

• Hierarchical Divisive Clustering: Initially all objects belong to the same


cluster; then, iteratively, a cluster of the current partition is chosen according
to a selection criterion and bipartitioned according to a local criterion (i.e.,
a criterion based only on information concerning the objects of the chosen
clusters). The procedure is continued till all clusters comprise single objects.
However, divisive clustering is computationally more demanding.

18.2.2 Non-hierarchical Cluster Analysis


A non-hierarchical or partitioning method classifies objects into a specified
number of groups (say k), which together satisfy the following criteria:

• Each group must contain at least one object;

• Each object must belong to only one group (However, in fuzzy clustering,
this criterion is relaxed).

These conditions imply that k < n, where n is the total number of objects.

18.2.3 Key Steps in Cluster Analysis


Cluster analysis essentially involves two key steps:

1) Measurement of distances between objects, which indicate similarity (or


dissimilarity) between objects.

2) Classification of objects into groups based upon the resultant distances.

18.2.4 Measurement of Distances


We need to measure the similarity of objects or cases so that we can determine
as to which cluster each object should be associated with. This section discusses
a number of common metrics employed to measure this similarity.

Distances can be measured in a variety of ways, depending upon the


measurement level of variables. There are distances that are Euclidean (can be
measured with a 'ruler') and there are other distances based on similarity.
Three general classes of distance measures can be recognized .

• Euclidean metrics •
• Non-Euclidean metrics
• Semi-metrics

Euclidean Metrics

Euclidean metrics measure true straight line distances in the Euclidean space.
In the case of univariate data, Euclidean distance between two values A and
B is the arithmetic difference, i.e. A-B. In the case of bivariate data, the
minimum distance is the hypotenuse of a triangle formed from the points. For
83

-
Techniques and Modeling three variables the hypotenuse can be extended through a three-dimensional
in Informetrics and space. An extension of the Pythagoras theorem gives the distance between two
Scientometrics
points in an n-dimensional space. The distance between two points X, Y, in
n-dimensional space can be computed as follows:

d = [ (X\-Y\)2 + (X2-Y2)2 + (X3-Y3)2 + ..... + (X, - Yn)n~

We can use many different measures all of which define different kinds of
metric space; that is a space where distance has meaning. For a space to be
metric, it must satisfy the following conditions:
Let d (A,B) denote the distance 'between two objects, A and B.
1) d(A,B) is non-negative. It must be 0 or positive.
2) d (A,B) = d(B,A). The distance from A to B is the same as that from B
to A.

3) d (A,A) = d (B,B)= 0 : An object is identical to itself.


4) d (A,C) S d(A,B) + d (B,C): When considering three objects the distance
between any two of them cannot exceed the sum of the distances between
the other two pairs. This is the so-called 'triangle rule'.

Non-Euclidean Metrics

These distances are not straight line distances, but they obey the above
mentioned rules.

Semi-metrics

These distance measures obey the first three rules, but may not obey the
triangle rule. An example of semi-metric measure is the Cosine measure.

Computation Formulae for distance measures


Euclidean Distances for Interval Variables

Given an n x p data matrix X, we compute a distance matrix D. For row


distances, the dei)) element of the distance matrix is the distance between row
i and row j, which results in an n x n matrix. For column distances, the dei))
element of the distance matrix is the distance between column i and column
j, which results in a p x p matrix.

a) Squared Euclidean Distance


The squared Euclidean row distance is defined as
=
.
d(ij) [I:(X(ik) - X(jk))2]
where the summation is relative to k over columns 1 to p.
'

The squared Euclidean column distance is defined as


dUi) = (I:(X(ki) - X(kj))2)
where the summation is relative to k over rows 1 to n.
The squared Euclidean distance is simply the sum of the squared differences
between corresponding elements of the rows (or columns). This is probably
the most commonly used distance metric.

84
b) Euclidean Distance Cluster Analysis and
Factor Analysis
This distance is the square root of the squared Euclidean distance discussed
above.
c) Minkowsky Distance
A useful general form of distance metric is the Minkowsky distance.
Minkowsky row distance is defined as
d (ij) = [L (ABSI(X(ik) - X(jk))IP)] lip
The summation is from k = I to the number of columns. The column
distance is computed similarly, but the summation is over the number of
rows rather than the number of columns. ABS refers to the absolute value
such that ABSI-21 = 2.
The Minkowsky distance is the pth root of the sum of the absolute differences
to the pth power between corresponding elements of the rows (or columns).
The Euclidean distance is the special case of the Minkowsky distance with
p = 2.
d) Block Distance
The block row distance is defined as
d(ij) = L(ABSI(X(ik) - X(ik))1)
The sum is from k = 1 to the number of columns. The column distance is
computed in a similar manner, but the summation is over the number of
rows rather than the number of columns.
The block distance is the sum of the absolute differences between
corresponding elements of the rows (or columns). The block distance is
also known as the city block or Manhattan distance. This distance measure
is a special case of the Minkowsky distance with p = 1.
e) Chebychev Distance
The Chebychev row distance is defined as
d (ij) = MAX(ABSI(X(ik) - X(jk))I)
The summation is from k = 1 to the number of columns. The column
distance is computed similarly, but the summation is over the number of
rows rather than the number of columns.
Non-Euclidean distances for interval variables
a) Cosine
Cosine of vectors of variables. This is a pattern similarity measure. The
cosine of the angle between two vectors is identical to their correlation
coefficient. •
Similarity (X, Y) = L(XY)/(LX2) (LY2)
b) Correlation

where
X is the mean value of X,
85

-
Techniques and Modeling
in Informetrics and
y is the mean value of Y,
Scientometrics S, is the standard deviation of X, and
S; is the standard deviation of Y.
Count Data

a) Chi-Square Measure: This measure is based on the chi-square test of


equality for two sets of frequencies.

b) Phi-Square Measure: This measure is equal to the chi-square measure


normalized by the square root of the combined frequency.

18.2.5 Clustering Algorithms


Hierarchical Agglomeration involves the following steps:

1) The proximity matrix is scanned to find the pair of cases with the highest
similarity (or lowest distance). These will be the most similar cases and
should be clustered most closely together.

2) The duster formed by these two cases can now be considered a single
object. The proximity matrix is recalculated so that all the other cases are
compared with this new group, rather than the original two cases.

3) The matrix is then scanned (as in step 2) to find the pair of cases or
clusters that now have the highest similarity .

.4) Steps 2 and 3 are repeated until all the objects have been combined into
a single group.

The result is a dendrogram that shows the most similar cases linked most
closely together. The level of the vertical lines joining two cases or clusters
indicates the level of similarity between them. It is important to note that the
branching hierarchy and the level of similarity are the only important features
of the dendrogram. The exact sequence of the cases along the vertical axis is
not significant.

There are five types of agglomerative clustering methods commonly in use.


These all follow the basic algorithm outlined above, varying only in the manner
in which the similarity between clusters is calculated.
1) Single linkage clustering
2) Complete linkage clustering
3) Average linkage clustering
4) Within group clustering

5) Minimum variance clustering

Single Linkage Method

In the single linkage method the distance between any two clusters is the
distance between the two closest elements of the two clusters. This is also
commonly referred to as the "nearest neighbor" technique.

This method tends to yield clusters in long chains, where the initial element
86 of the cluster may be very different from the final element of the cluster. This
method, though computationally very efficient, is ill regarded because of this Cluster Analysis and
tendency. Factor Analysis

Complete Linkage Method

In this method, the distance between any two clusters is the distance between
the two farthest elements of the two clusters. This method is also called
"farthest neighbor method". Similar to the single linkage in efficiency, the
complete linkage method dues not suffer from chaining. It is actually biased
against any sort of chaining.

The figure below shows how the distances for these two methods are calculated
in a two dimensional example.

1. Single linkage
2. Complete linkage
Fig.2: Calculation of distances by single linkage and complete linkage methods

These two methods are quite simple, but they can distort the results, since the
distances between groups are calculated based on unusual outlying points
rather than the properties of the whole cluster. Nearest neighbor clustering is
also susceptible to a phenomenon called "chaining" in which there is a tendency
to repeatedly add new individuals onto a single cluster rather than making
several separate clusters. This gives the dendrogram a staircase-like appearance.

Average Linkage Clustering

Average linkage techniques provide a more balanced approach to clustering.


With these techniques, the distances between clusters are represented as some
sort of an average distance. There are two basic approaches:

Group Average Method - The distance betlveen two clusters is the average
of the distances between every pair of objects in opposing clusters. This
method is also called paired group method. This method is non-monotonic,
which essentially means that the elements of two clusters may be more distant
than the clusters themselves.

Centroid Met/rod - In this method, the centroid (the center of gravity of the
points) of each group is calculated and the distance between the groups is the
distance between their centroids. The centroid itself can be described as the
87
Techniques and Modeling average point of the cluster. It is calculated by taking the mean value of the
in Informetrics and coordinates on each axis for all the points in the cluster. The centroid method
Scientometrics is non-monotonic, i.e. elements of two clusters may be more distant than the
clusters themselves.

Both methods have two variants: weighted and unweighted. The unweighted
methods give equal weight to each point in each cluster. The weighted methods
on the other hand give unequal weights to each cluster; if one cluster has
fewer points than another, then points in the smaller cluster are given higher
weightage to make the two groups equal. In general, the unweighted versions
are used unless the data are expected to have some clusters that are much
smaller than others.

Median Method -This method is a variation of the centroid method, except


that even weight is given to both clusters in determining the new centroid.
This tends to increase sensitivity to outliers and makes chaining more likely.
Like the centroid method, the median method is non-monotonic.

Ward's Clustering Method - Like the centroid method, Ward's method


represents clusters by their central point. Instead of using a Euclidean distance
measure, Ward's method measures the distance as the increase in the total sum
of squares that would result from clustering together two objects. The clusters
to be joined in the next round of clustering are chosen by determining which
two would give the smallest increase in within-cluster variation. In this way,
the clusters will tend to be as distinct as possible, since the criterion for
clustering is to have the least amount of variation.

Table 1 : Comp atibility of distance metrics with clustering algorithms

Algorithm Euclide tin metric Non-Euclidean metric Semi-metric

Single Yes Yes Yes

Complete Yes Yes Yes

Average Yes Yes Yes

Median Yes Yes Yes

Centroid y es ? ?

Ward's y es No No

Optimum Number of Clusters

One of the biggest problems with hierarchical cluster analysis is how to identify
the optimum number of clusters. As the fusion process continues, increasingly
dissimilar clusters are fused, i.e. the classification becomes increasingly
artificial. Deciding upon the optimum number of clusters is largely subjective,
although looking at a graph of the level of similarity at fusion versus number
of clusters may help. There will be sudden jumps in'the level of similarity as
dissimilar groups are fused.

Partitioning Methods

Partitioning methods are based on specifying an initial number of groups, and


88 iteratively reallocating observations between groups until some equilibrium is
attained.
K-Means clustering Cluster Analysis and
Factor Analysis
One of the most well-known partitioning methods is k-means. In the k-rneans
algorithm the observations are classified as belonging to one of k groups.
Group membership is determined by calculating the centroid for each group
multidimensional and assigning each observation to the group with the closest
centroid. The k-means algorithm alternates between calculating the centroids
based on the current group memberships, and reassigning observations to
groups based on the new centroids. Centroids are calculated using least-squares,
and observations are assigned to the closest centroid based on least-squares.
Note that the choice of measurement units strongly affects the resulting
clustering. The variable with the largest dispersion will have the largest impact
on the clustering. Standardizing the clustering variables consists of subtracting
the variable mean and dividing by the variable standard deviation. The data
should be standardized if the clustering variables are in different units (for
example, age in years height in centimeters, etc.) to avoid giving greater
weights to variables with greater variance. In the case of hierarchical cluster
analysis, standardizing the clustering variables tends to reduce the effect of
outliers on the cluster solution.

18.2.6 Examples

Example of Hierarchical Cluster Analysis

Research Question Classification of 11 major countries according to


their specialization patterns in chemistry.
Methodology Agglomerative clustering
Dataset Appendix-l Specialization Patterns In Chemistry

Extracts from Computer Output (SPSS package)


Case Processing Summary(a)
Cases
Valid Missing Total
N Percent N Percent N Percent
11 100.0 0 .0 11 100.0
a Average Linkage (Between Groups)

Proximity Matrix
Cosine of Vectors of Values
Case I:USA 2:JPN 3:SUN 4:UKD 5:FRG 6:FRA 7:IND 8:CAN 9:ITA IO:NLD 11:POL
I:USA .902 .727 .898 .884 .892 .838 .992 .806 .954 .891
2:JPN .902 .733 .791 .786 .932 .843 .903 .692 .884 .842
3:SUN .727 .733 .902 .799 .828 .847 .724 .915 .727 .576
4:UKD .898 .791 .902 .954 .874 .931 .900 .978 .892 .739
5:FRG .884 .786 .799 .954 .853 .944 .918 .904 .892 .793
6:FRA .892 .932 .828 .874 .853 .841 .896 .782 .922 .752
7:IND .838 .843 .847 .931 .944 .841 .866 .912 .854 .792
8:CAN .992 .903 .724 .900 .918 .896 .866 .804 .961 .916
9:ITA .806 .692 .915 .978 .904 .782 .912 .804 .804 .653
\O:NLD .954 .884 .727 .892 .892 .922 .854 .961 .804 .886

II:POL .891 .842 .576 .739 .793 .752 .792 .916 .653 .886
89

-
Techniques and Modeling Average Linkage (Between Groups)
in Informetrics and
Scientometrics Agglomeration Schedule
Cluster Combined Stage Cluster First Next
Coefficients Appears Stage
Stage Cluster 1 Cluster 2 Cluster 1 Cluster 2
1 1 8 .992 0 0 3
2 4 9 .978 0 0 6
3 1 10 .958 I 0 7
4 5 7 .944 0 0 6
5 2 6 .932 0 0 7
6 4 5 .925 2 4 8
7 1 2 .900 3 5 9
8 3 4 .866 0 6 10
9 1 11 .857 7 0 10
10 1 3 .804 9 8 0

Dendrogram
***** HIERARCHICAL CLUSTER ANALYSI
S * * * * *
Dendrogram using Average Linkage (Between Groups)
Rescaled Distance Cluster Combine

CAS E 0 5 10 15 20 25
Label Num + +.~ + + + +
USA 1 ..+ +
CAN 8 ..+ + +
NLD 10 + + +
JPN 2 + + + +
FRA 6 + I I
POL 11 + I
UKD 4 + + I
ITA 9 + + + I
FRG 5 + + + +
IND 7 + I
SUN 3 +
Comments:
1) SPSS module CLASSIFY (Hierarchical Cluster) was used with the following
options:
Proximity Measure: COSINE. (This is- a similarity measure)
Clustering Method: BETWEEN GROUP AVERAGE
2) Proximity matrix shows similarities between pairs of countries.
Note that diagonal of the matrix is empty, since our primary interest is
proximity between countries.
90
Most similar countries are USA and Canada. The proximity value is the Cluster Analysis and
highest (.992). Factor Analysis

Least similar countries are Poland and Ex-USSR (SUN) (Proximity measure
= .592)

3) Agglomerative schedule shows how different clusters (countries) are fused


together.
Stage 1 - Most similar countries Clusters 1 (USA) and 2 (CAN) are fused
together to form a new cluster
Stage 2 - At this stage, the second most similar countries Clusters 4
(UKD) and 9 (ITA) are fused together to form a new cluster
Stage 3 - At this stage, Cluster 10 (NLD) is fused with the cluster
comprising USA and CAN. These two countries were fused together at
Stage 1 together to form a new cluster. The proximity between NLD and
the cluster formed at Stage 1 is .958.
Note the proximity between NLD and USA is .954 (See the Proximity
Matrix) and that between NLD and CAN is .961.
4) Dendrogram is graphical representation of the Agglomeration Schedule
5) Cutting the dendrogram is a subjective decision. If you cut the dendrogram
at a dissimilarity level = 20, you get two clusters.
Cluster 1 - USA CAN NLD JPN FRA POL
Cluster 2 - UKD ITA FRG IND SUN
If you cut the dendrogram at a dissimilarity level = 15, you get four clusters:
Cluster 1 - USA CAN NLD JPN FRA
Cluster 2 - UKD, ITA FRG IND SUN
Cluster 3 - POL
Cluster 4 - SUN
Example of Cluster Analysis by K-Means Algorithm
Research Question Classify 35 major countries into five clusters
according to the similarity of their collaboration
profiles with India in different fields of science.
Methodology Cluster Analysis using k-means algorithm (Module:
Quick Cluster of SPSS package)
Dataset Appendix- 2 India's cooperation links with 35
significant partner countries in 11 scientific fields
Extract from Computer Printout
INITIAL CLUSTER CENTERS
Cluster
1 2 3 4 5
MAT .00 .00 .00 .00 .00
PHY .00 37.84 46.43 14.81 100.00
CHM .00 35.14 3.57 11.11 .00
B10 22.22 .00 3.57 .00 .00 91
Techniques and Modeling EAS 5.56 5.41 3.57 .00 .00
in Informetrics and
Scientometrics AGR 44.44 2.70 .00 11.11 .00
CLl 8.33 5.41 10.71 33.33 .00
BIM 19.44 5:41 17.86 11.11 .00
ENT .00 .00 10.71 3.70 .00
MTL .0000 .0000 3.5714 3.7037 .0000
COM .0000 8.1 081 .0000 .0000 .0000

Cluster Membership

Case Number Country Cluster Distance


1 USA 2 7.728
2 UKD 2 12.213
3 FRG 3 6.284
4 CAN 3 14.178
5 FRA 3 8.424
6 JPN 2 6.687
7 ITA 5 8.566
8 SUN 5 15.162
9 AUS 2 11.542
10 CHE 3 15.001
11 NLD 3 7.159
12 SWE 4 16.716
13 ESP 5 11.335
14 PRC 5 6.534
15 BEL 3 12.365
16 HUN 3 13.991
17 BRA 3 15.368
18 BND 4 15.462
19 DNK 2 18.418
20 AUT 2 17.407
21 BGR 5 12.385
22 POL 2 9.594
23 KOR 2 23.266
24 CSK 3 26.103
25 MEX 1 .000
26 PHL 5 12.485
27 -rIN 2 17.688
28 ISR 4 13.488
29 EGY 2 14.028
30 GRC 3 10.831
31 TWN 4 15.234
32 NOR 4 18.273
33 NIG 5 9.724
34 CHL 5 25.155
35 ROM 5 28.095
92
Final Cluster Centers Cluster Analysis and
Factor Analysis
Cluster

1 2 3 4 5
MAT .00 2.27 3.38 2.00 .96
PHY .00 31.43 48.69 18.69 73.60
CHM .00 17.89 6.64 7.76 5.02
BIO 22.22 6.50 4.53 4.04 1.34

EAS 5.56 3.84 5.03 7.26 2.53


AGR 44.44 2.87 2.26 3.95 .16
CLI 8.33 13.15 8.30 24.14 4.50

BIM 19.44 10.54 10.83 12.19 4.11


ENT .00 5.67 6.15 13.18 4.40

MTL .0000 1.4465 1.1793 .9033 .4568

COM .0000 2.4058 1.1832 .0000 1.2380

Distances between Final Cluster Centers

Cluster 1 2 3 4 5
1 58.600 68.068 53.507 90.115

2 58.600 21.370 21.677 45.844


3 68.068 21.370 34.850 26.708
4 53.507 21.677 34.850 59.984
5 90.115 45.844 26.708 59.984

ANOVA
Cluster Error F Sig.
Mean Square df Mean Square df
MAT 8.193 4 6.641 30 1.234 .318
PHY 3704.185 4 83.661 30 44.276 .000
CHM 267.983 4 39.421 30 6.798 .001
BIO 110.891 4 10.408 30 10.654 .000
EAS 20.227 4 15.275 30 1.324 .284

AGR 449.033 4 8.202 30 54.747 .000


CLl 342.403 4 18.075 30 18.943 .000
BIM 106.328 4 33.910 • 30 3.136 .029
ENT 78.132 4 35.476 30 2.202 .093
MTL 1.504 4 2.326 30 .647 .634
COM 5.618 4 3.607 30 1.558 .211
Warning: The F tests should be used only for descriptive purposes because the
clusters have been chosen to maximize the differences among cases in different
clusters. The observed significance levels are not corrected for this and thus
cannot be interpreted as tests of the hypothesis that the cluster means are equal.
93
Techniques and Modeling Number of Cases in each Cluster
in Informetrics and
Scientometrics Cluster 1 1.000
2 10.000
3 10.000
4 5.000
5 9.000
Valid 35.000
Missing 1.000

Comments:

• Collaboration profile of a country is determined as follows:


Let NI, N2, N3, .... NU denote the number of India's cooperation links
with a partner country in eleven fields (Mathematics - Computers). Its
collaboration profile is computed as follows:
Let N = NI + N2·+ +Nl l
. NI N2 NIl
Colleboration Profile = 100 x -- , 100 x -- , 100 x --
N N N
• Initial Cluster Centres - First estimate of the variable means for each of
the clusters. For example, Cluster 1 has relatively higher values for bio-
sciences - BIO, AGR, CLI, and BIM.
By default, a number of well-spaced cases equal to the number of clusters
are selected from the data. Initial cluster centres are used for a first round
of classification and are then updated.
• Final Cluster Centres - Final estimates of variable means after maximum
number of iterations set by the researcher. We had set the default value of
5 iterations.
• Distance between Clusters - These are the Euclidean distances between
final cluster centres. Note that the distance between Clusters 2 and 3 is
minimum, which means that these two clusters have greater similarity than
other clusters. Similarly, the distance between Clusters 1 and 5 is maximum.
Cluster 1 is characterized by Biosciences, whereas Cluster 5 is characterized
by Physics.
• Cluster Membership -This table indicates the assignments of countries to
different clusters:
Cluster 1- MEX
Cluster 2- USA UKD JPN AUS DNK AUT POL KOR FIN EGY
Cluster 3- FRG CAN I'RA CHE NLD BEL HUN BRA CSK GRC
Cluster 4- SWE BND ISR TWN NOR
Cluster 5- ITA SUN ESP PRC BGR OHL NIG CHL ROM
• Number of Cases in Each Cluster - Cluster 1 is the smallest cluster
comprising only one country. Cluster 2 is the largest cluster comprising
ten countries.
• ANOVA- Analysis of variance is usually carried out to test the statistical
significance of observed difference among different groups. However, in
94
this case, the fundamental assumptions of hypothesis testing are violated, Cluster Analysis and
statistical significance of inter-cluster differences cannot be tested. Factor Analysis

Self Check Exercises


1) What are the general classes for the measurement of distances?
2) What are the types of agglomerative clustering methods in use?
Note: i) Write your answers in the space given below.
ii) Check your answers with the answers given at end of this Unit.

............................................ : .

18.3 FACTOR ANALYSIS


18.3.1 Introduction
Factor analysis is a generic term for a family of statistical techniques concerned
with the reduction of a set of observable variables in terms of a small number
of latent factors. The main objectives of factor analytic techniques described
in this module are:

• To reduce the number of variables.


• To detect structure in the relationships between variables, i.e., to classify
variables.
Therefore, factor analysis is applied as a data reduction or structure detection
method.
Consider a data set, an n x p matrix
X·= IXij I
in which the columns (j) represent the variables and rows (i) represent
measurements of specific objects or individuals on the variables. Such a data
set, particularly, when p is large, is unwieldy and difficult to comprehend. It
is usually desirable to obtain a smaller set qf variables that can be used to
approximate the original data matrix X. The new variables, called principal
components or factors, are designed to carry most of the information in the
columns of X. Greater the correlation between the columns of X, the fewer the
number of new variables required.

Principal components analysis and its cousin factor analysis operate by replacing
the original data matrix X by an estimate composed of the product of two
matrices. The left matrix in the product contains a small number of columns
corresponding to the factors or components. whereas the right matrix of the
product provides the information that relates components to the original 95
Techniques and Modeling variables. A scatter plot based on the,left matrix is useful for relating the n
in Informetrics and objects of X with respect to the new factors. A plot based on the rows of the
Scientometrics
right matrix can be used to relate the components to the original variables. The
decomposition of X into a product of two matrices is a special case of a matrix
approximation procedure, called: Singular value decomposition.

There are two types of factor analysis:

• Principal Components Analysis - This method provides a unique


solution, so that the original data can be reconstructed from the results. It
looks at the total variance among the variables, so the solution generated
will include as many factors as there are variables, although it is unlikely
that they will all meet the criteria for retention. There is only one method
for completing a principal component analysis; this is not true of any of
the other multidimensional methods described here.

• Common Factor Analysis - This is what people generally mean when


they say "factor analysis." The term "common" in common factor analysis
describes the variance that is analyzed. It is assumed that the variance of
a single variable variance is composed of three components: common,
specific and error.

Total Variance of a variable


I
Common Specific Error Variance
Variance Variance
h2

Reliable Variance= common


variance+ specific variance

Unique Variance=specific
variance + error variance

Factor analysis analyzes only the common variance of the observed variables;
principal components analysis (PCA) considers the total variance and makes
no distinction between common and unique variance (specific plus error).
Thus, the difference between these two techniques is related to matrix
computations. PCA assumes that all variance is common, with all unique
factors set equal to zero; while factor analysis assumes that there is some
unique variance. The level of unique variance is dictated by the factor analysis
model.

18.3.2 Principal Components Analysis
Principal components analysis can be defined as follows.

Consider a data matrix:

x = IXij" I
in which the columns represent the p variables and rows represent measurements
of n objects or individuals on those variables. The data can be represented by
a cloud of n points in a p-dimensional space, each axis corresponding to a
96
Cluster Analysis and
measured variable. We can then look for a line ay) in this space such that the Factor Analysis
dispersion of n points when projected onto this line is a maximum. This
operation defines a derived variable of the form

with coefficients satisfying the condition

After obtaining QY), consider the (P-l) - dimensional subspace orthogonal to


QY) and look for the line QY2 in this subspace such that the dispersion of
points when projected onto this line is a maximum. This is equivalent to
seeking a line aY2 perpendicular to QY, such that the dispersion of points
when they are projected onto this line is the maximum. Having obtained ay 2'
consider a line in the (P-2) - dimensional subspace, which is orthogonal to
both ay) and ay2, such that the dispersion of points when projected onto this
line is as large as possible. The process can be continued, until p mutually
orthogonal lines are determined. Each of these lines defines a derived variable:

where the constants are determined by the requirement that the variance of
y, is a maximum, subject to the constraint of orthogonality as well as

for each i.

The Y, thus obtained are called Principal Components of the system and the
process of obtaining them is called Principal Components Analysis.

The p-dimensional geometric model defined above can be considered as the


true picture of the data. If we wish to obtain the best q-dimensional
representation of the p-dimensional true picture, then we simply have to project
the points onto the q-dimensional subspace defined by the first q principal
components YI, Y], ... , Y,J'

The variance of a linear composite

is given by

where s" is the covariance between variables i and j. The variance of a linear
composite can also be expressed in the notation of matrix algebra as
aT Sa
where a is the vector of the variable weights and S is the covariance matrix.
aT is the transpose of a.
97
Techniques and Modeling Principal components analysis finds the weight vector a, that maximizes
in Informetrics and
Scientometrics aT Sa

subject to the constraint that

f:
,=)
",2 = a'a = I

It is essential to constrain the size of a, otherwise the variance of the linear


composite can become arbitrarily large by selecting large weights.

It is important to note that principal components decomposition is not scale


invariant. We would get different decompositions, depending upon whether
the principal components are calculated from the un-sealed cross-products
matrix (SSCP) or covariance matrix. The magnitudes of the diagonal elements
of a cross-products matrix or a covariance matrix influence the nature of the
principal components. Hence standardized variables are commonly used. The
XTX matrix based on standardized variables is proportional to a correlation
matrix. The covariance matrix can be viewed as a partial step between the
SSCP and the correlation matrix. Since the covariance matrix is based on the
deviations of the variables from their respective means, it corrects for the
differences in the magnitudes of the 'elements of SSCP for the overall level,
but it does not correct for the differences in the variances among the variables.

If we have a set of n observations (objects/cases) onp variables, then we can


find the largest principal component (of a cross-products matrix, covariance
matrix or correlation matrix) as the weight vector

which maximizes the variance of


f'

I"),x,
i=)

subject to the constraint

fa~
i=)

We can then define the second largest principal component as the weight
vector

98
Cluster Analysis and
which maximizes the variance of Factor Analysis

subject to the constraints:


p

• "a; ~

;=1
_I
=

• Principal component 2 is linearly independent of principal component 1, i.e.

~L a1,a? u =0
f

1=1

We can define the third largest principal component as the weight vector

which maximizes the variance of

subject to the constraints:

The third principal component is orthogonal to the first two principal


components. These two orthogonality conditions are :

"
La3;ali =0
1=1

~
"
"a3,a?,
1_' =0
1=1

This process can be continued till the last ti.e., thep'h ) principal component
is derived,
The sum of the variances of the principal components is equal to the variance
of the original variables.

where A; is the variance of the principal component. If the variables are


standardized then
P
LAI =P
1=1

In the matrix notation, the above definition of principal components leads to


the following equation
RA=AA 99
Techniques and Modeling where A is a matrix of eigenvectors as column vectors and A is a diagonal
in Informetrics and
Scientometrics matrix of the corresponding latent roots (or the eigenvalues) of the correlation
matrix R, rank- ordered from the largest to the smallest. The elements of A
have to be in the same order as their associated latent vector (or eigenvector).
The largest latent root (1."1) of R is the variance of the first or largest principal
component of R and its associated vector.

is the set of weights for the first principal component, which maximizes the
variance of

Similarly for the second principal component, and so on.


The last latent root (Ap) is the variance of the last or the smallest principal
component.

The i th latent root and its associated weight vector satisfy the matrix equation:
Raj = \al
Pre-multiplying the above equation by aT leads to
TR - T"\ - "\
aj aj - a, /\,jal - /\,j
since

The variance of the first principal component = AI' Similarly for the second
principal component, and so on. The last latent root (= AI') is the variance of
the last or the smallest principal component. Thus:
Ra, = Ajal
Ra, = A2a2

Rap = Apap

In matrix notation,
RA=AA

Where A is the matrix of eigenvectors, as column vectors, and A is a matrix


of corresponding latent roots ordered from the largest to the smallest.
Since RA = A A
Pre-rnultiplying by AT leads to
ATRA = AI' A
lOO =A
because ATA = I Cluster Analysis and
Factor Analysis
This means that we can decompose R into a product of three matrices, involving
eigenvectors and eigenroots. In other words, the variation in R is expressed in
terms of the weighting vectors (eigenvectors) of the principal components and
variances (eigenvalues) of the principal components. This is called the singular
value decomposition of the correlation matrix R. This is the key concept
underlying "Principal Components Analysis."

18.3.3 Factor Analysis


Factor analysis is used to uncover the latent structure (dimensions) of a set of
variables. It reduces attribute space from a larger number of variables to a
smaller number of factors.

In many scientific fields, particularly behavioral and social sciences, variables


such as 'intelligence' or 'leadership quality' cannot be measured directly.
Such variables, called latent variables, can be measured by other 'quantifiable'
variables, which reflect the underlying variables of interest. Factor analysis
attempts to explain the correlations between the observations in terms of the
underlying factors, which are not directly observable.

Factor analysis closely resembles principal components analysis. Both


techniques use linear combinations of variables to explain sets of observations
on many variables. In principal components, the intrinsic interest is in the
observed variables. The combination of variables is primarily a tool for
simplifying the interpretation of the observed variables. In factor analysis, the
intrinsic interest is in the 'underlying factors', the observed variables are
relatively of little interest. Linear combinations are formed to derive the factors.

Factor Analysis Model


The factor analysis model can be expressed in the matrix notation:
x=~+Af+U (1)
where
A = {Aij} is a p x k matrix of constants, called the matrix of factors
loadings.
f = random vector representing the k common factors.
U = random vector representing p unique factors associated with the original
variables.
The common factors F" F2, •.• , F, are common to all X variables, and are
assumed to have mean = 0 and variance = 1. The unique factors are unique
to X;. The unique factors are also assumed to have mean=O and are uncorrelated
to the common factors.
Equivalently, the covariance matrix S can be decomposed into a factor
covariance matrix and an error covariance matrix:
(2)
where
\}' = Var { u }
AT is the transpose of A lDl
Techniques and Modeling The diagonal of the factor covanance matrix IS called the vector of
in Informetrics and communalities h2 where
Scientometrics I

X
h2. = "1..
L..
2
I)

I j-I

The correlations of variables with principal components are called loadings.


They provide a convenient summary of the influence of the original variables
on the principal components, and thus a useful basis for interpretation. A large
coefficient (in absolute value) corresponds to a high loading, while a coefficient
near zero has a low loading. Factor loadings are also called factor pattern
coefficients or factor structure coefficients.

Analogous to Pearson's r, the squared factor loading is the percentage of


variance in the variable, explained by a factor.

Table 2 : Factor Structure Matrix

Variables Factor 1 Factor 2 Factor 3 ........ Factor p

VI All AI2 AI3 Alp

V2 ~I ~2 An
~

V3 A31 A32 A33 x,


V4 A41 A42 A43 A4p

VS ASI AS2 As) Asp

V6 ~I A62 A63 A6p

V7 A71 An A73 A7p

Vn Ani An2 An) x,

Communality: The sum of the squared factor loadings for all factors for a
given variable is the variance in that variable accounted for by all the factors,
and this is called the communality. In other words, it is the proportion of a
variable's variance explained by a factor structure. In complete principal
components analysis, with no factors dropped, communality is equal to 1.0, or
100% of the variance of the given variable.

Communality = 0"11)2 +(1..12)2 + (1..\3)2 + .... + (Alp?

The variance explained by a factor is equal to the sum of squared loadings of


all variables on that factor. It is also equal to its eigenvalue.

Eigenvalues: The eigenvalue for a given factor reflects the variance in all the
variables, which is accounted for by that factor. A factor's eigenvalue may be
computed as the sum of its squared factor loadings for all the variables. If a
factor has a low eigenvalue, then it is contributing little to the explanation of
variances in the variables and may be ignored. The sum of eigenvalues of all
102
the factors is equal to the total variance in the data divided by the number of Cluster Analysis and
variables. The importance of a factor is equal to the proportion of total variance Factor Analysis
explained by it.

The factor analysis model does not extract all the variance; it extracts only that
proportion of variance, which is due to the common factors and shared by
several items. The proportion of variance of a particular item that is due to
common factors (shared with other items) is called communality. The proportion
of variance that is unique to each item is then the respective item's total
variance minus the communality.

The solution of equation (2) is not unique (unless the number of factors = 1),
which means that the factor loadings are inherently indeterminate. Any solution
can be rotated arbitrarily to obtain a new factor structure.

Consider for example the following pattern of factor loadings.

Table 3: Un rotated factor matrix

Variables Factor 1 Factor 2

VI .81 .45

V2 .84 .31

V3 .80 .29

V4 .89 .37

V5 .79 .51

V6 .45 .43

This table shows the difficulty of interpreting an unrotated factor solution. All
the significant loadings are shown in boldface type. Obviously, the results are
ambiguous. Variables VI, V2, VS and V6 have significant loadings on Factors
1 and 2. This is a common pattern. One way to obtain more interpretable
results is to rotate the factor axes.

Rotation of Factorial Axes

The goal of these rotation strategies is to obtain a clear pattern of loadings,


i . e. the factors are somehow clearly marked by high loadings for some variables
and low loadings for other variables. This general pattern is called 'Simple
Structure', which means:

• Each factor should have several variables with strong loadings. Remember
that strong loadings can be positive or negative. •
• Each variable should have a strong loading for only one factor

• Each variable should have a large communality. This implies that its
membership accounts for its variance.

Factor Rotation: Given a Cartesian coordinate system where the axes are the
factors and the points are the variables, factor rotation is the process of holding
the points constant and moving (rotating) the factor axes. The rotation is done
103
Techniques and Modeling in a manner so that the points are highly correlated with the axes and provide
in Informetrics and a more meaningful interpretation of the factor.
Sclentometrlcs
Rotation can be orthogonal or oblique. In orthogonal rotation, the axes are
perpendicular to each other, whereas in oblique rotation, the axes are not
perpendicular. Major strategies for orthogonal rotation are: Varimax, Quartimax
and Equimax, but the most common rotation strategy is the Varimax rotation.

Varimax rotation seeks to maximize the variances of the squared normalized


factor loadings across variables for each factor. This is equivalent to maximizing
the variances in the columns or the matrix of the squared normalized factor
loadings:

Table 4 shows the results of Varimax rotation of factor solution in Table 3.


The pattern of loadings is unambiguous.

Table 4 : Rotated factor matrix

Variables Factor 1 Factor 2

VI .68 .17

V2 .87 .24

V3 .65 .07

V4 .16 .76

VS .30 .83

V6 .19 .69

Note that the eigenvalues associated with the unrotated and rotated solution
will differ, but their total will be the same. The communalities of variable do
not change with rotation.

Interpretation of Factors

This involves the following: (i) How many principal components or factors to
choose and retain for interpretation, and (ii) which variables have strong loading
on which factors?

How Many Principal Components?


The purpose of principal components analysis or factor analysis is to reduce
the complexity of the multivariate data into the principal components space
and then choose the first q principal component (q < p) that explains most of
the variation in the original variables. Determining the optimal number of
factors to extract is not •. straightforward task since the decision is ultimately
subjective. There are several criteria for the number of factors to be extracted,
but these are just empirical guidelines rather than an exact quantitative solution.
Some of the most commonly used criteria are:
• Kaiser's Criterion- Exclude those principal components with eigenvalues
below the average. For the principal components calculated from the
correlation matrix, the average eigenvalue is 1. This criterion excludes
principal components with eigenvalues less than 1. This is the most
commonly used criterion, because of its simplicity and availability in most
104 computer packages:
• Scree Plot - Plot the eigenvalues in the sequence of principal factors (j = Cluster Analysis and
Factor Analysis
],2, ... , p). The resulting plot, called the Scree plot, provides a convenient
visual method for separating the important components from the less
important components. The plot is called a scree plot, since it resembles
a mountainside with a jumble of boulders at the base (Scree is a geological
term referring to the debris, which collects on the lower part of a rocky
slope). The number of factors is chosen where the plot levels off to a linear
decreasing pattern.
5

4-

o~ __ ~ __ ~_~ __ ~ ~
o 5 10 15 20
Number of Factor
Fig. 3: Scree Plot

• Percentage of Variance Criterion --Include just enough components that


explain some arbitrary amount of the variance (typically 80%).
Usually, the first approach includes too many components, whereas the second
approach includes too few components. The 80% criterion can be a good
compromise.
Another very important but often overlooked criterion for determining the
number of factors is the interpretability of the factors extracted. Factor solutions
should be evaluated not only according to empirical criteria but also according
to the criterion of theoretical meaningfulness.
Which variables have strong loadings on which factors?
This means that deciding a cut-off value for.factor loadings. This is purely
arbitrary, but common social science practice 'uses a minimum cut-off of 0.3
or 0.35. Another arbitrary rule-of-thumb terms loadings as "weak" if less than
0.4, "strong" if more than 0.6, and otherwise as "moderate."

18.3.4 Examples of Factor Analysis


Research. Question : Explore the intellectual base of Artificial Intelligence
by examining the citing environment of significant
journal in the area.
Methodology Factor Analysis of the journal to journal citation matrix
Dataset Appendinx - 3 Artificial Intelligence 105
Techniques and Modeling Extract from computer printout
in Informetrics and
Scientometrics Comments
Data D:\IGNOU\JOURNAL.SA V
Filter <none>
Weight <none>
Input Split File <none>
NO. of Rows in
Working Data
File 25
Missing Definition of
Value Missing MISSING=EXCLUDE: User-defined
missing values are treated as missing
Handling Cases Used LlSTWISE: Statistics are based on cases with no missing values
for any variable used.
Syntax FACTOR
IV ARIABLES var00002 var00004 var00005 varOOOO6
var00007 var00008 var00009 varOOO10 varOOO11 varOOO12
varOOO14 varOOO15 varOOO16 varOOO19 var00021 varOOO24
var00025 IMISSING LISTWISE IANAL YSIS varOOOO2
var00004 var00005 var00006 varO0007 var00008 varOOOO9
varOOO10 varOOO11 varOOO12 varOOO14 varOOO15 varOOO16
varOOO19 varO0021 var00024 varOOO25
IPRINT UNIVARIATE INITIAL CORRELATION SIG DET
KMO INV REPR AIC EXTRACTION ROT A TION I
FORMAT SORT !PLOT EIGEN ICRITERIA MINEIGEN(1)
ITERA TE(25) IEXTRACTION PC ICRITERIA ITERATE(25)
tROT ATION VARIMAXtMETHOD=CORRELATION .

Communalities
Initial Extraction
VAROOO02 1.000 .717
VAROOO04 1.000 .933
VAROOO05 1.000 .895
VAROOO06 1.000 .694
VAROOO07 1.000 .861
VAROOO08 1.000 .889
VAROOO09 1.000 .795
VAROOOI0 1.000 .854
VAROOOl1 1.000 .804
VAROOO12 1.000 .778
VAROOO14 1.000 .571
VAROOO15 1.000 .622
VAROOO16 1.000 .914
VAROOO19 1.000 .878
VAROO021 1.000 .908
VAROO024 1.000 .446
VAROO025 1.000 .918
If\~
Extraction Method: Principal Component Analysis.
Total Variance Explained Cluster Analysis and
Initial Eigenvalues Extraction Sums of Squared Rotation Sums of Squarec Factor Analysis
Loadings Loadings

Comp Total % of Cumu- Total % of Cumu- Total % of Cumu-


onent variance lative % variance lative % variance lative%
I 4.348 25.576 25.576 4.348 25.576 25.576 4.065 23.912 23.912
2 2.866 16.857 42.432 2.866 16.857 42.432 2.810 16.531 40.443
..,
.J 2.331 13.711 56.143 2.331 13.711 56.143 2.382 14.010 54.452
4 1.727 10.159 66.302 1.727 10.159 66.302 1.954 11.491 65.944
5 I. I 31 6.655 72.957 1.131 6.655 72.957 1.150 6.767 72.7II
6 1.075 6.321 79.278 1.075 6.321 79.278 1.116 6.567 79.278
7 .946 5.564 84.841
8 .873 5.138 89.979
9 .689 4.056 94.035
10 .598 3.518 97.553
11 .183 1.079 98.632
12 .114 .669 99.301
13 8.345E .491 99.792
-02
14 2.267E .133 99.926
-02
15 1.260E 7.413E- 100.000
-02 02
16 4.281E 2.518E- 100.000
-05 04
17 -2.383E -1.402E 100.000
-16 15
Extraction Method: Principal Component Analysis.

Component Matrix (a)


Component
1 2 3 4 5 6
VAROOO08 .930 .115 -7.280E-02 1.321 E-02 2.704E-02 7.167E-02
VAROOO05 .869 -.223 -.277 -6.603E-03 6.340E-02 9.79IE-02
VAROOO07 .863 -.303 7.980E-02 .128 1.368E-02 -2.394E-02
VAROOOl6 .858 -.240 -.335 -2.329E-03 -7.792E-02 -3.859E-02
VAROOOl5 .687 -.224 -.248 1.656E-02 -.147 -.132
VAROOO'lO .470 .729 .217 -1.558E-02 -.227 -4.516E-02
VAROOO06 .471 .602 8.575E-02 -2.188E-02 -.314 -5.390E-02
VAROOO09 .273 .580 .450 -.346 .229 9.836E-02
VAROOOl4 .103 -.492 .463 .297 5.953E-02 -.108
VAROO025 .256 .462 .761 • -.209 .127 1.736E-02
VAROOO04 9.640E-03 -.492 .738 .361 -9.594E-02 8.480E-02
VAROOO02 .309 -.404 .608 .208 .140 -.161
VAROOOl9 -3.5 18E-02 .419 -.201 .799 .149 7.983E-03
VAROO021 -5.964E-02 .541 -.133 .766 8.957E-02 -6.709E-03
VAROOOII .157 -6.664E-02 -.167 -.120 .701 .491
VAROOOl2 -.172 -.185 7.725E-02 1.126E-02 -.567 .622
VAROO024 -.165 -6.593E-02 -.110 -.213 6.233E-02 -.594
Extraction Method: Principal Component Analysis.
a) 6 components extracted.
107
Techniques and Modeling Rotated Component Matrix(a)
in Informetrics and
Component
Scientometrics
1 2 3 4 5 6

VAROOOl6 .952 -5.212E-02 -3.08IE-02 -5.765E-02 7.630E-03 -1.807E-02

VAROOO05 .924 7.542E-03 5.365E-03 -4.440E-02 .193 3.070E-02

VAROOO08 .854 .373 -3.715E-03 8.306E-02 .110 3.039E-02

VAROOOO7 .842 .108 .369 -2.863E-02 5.607E-02 8.069E-03

VAROOOl5 .773 -6.372E-02 1.514E-02 -5.892E-02 -.118 -5.209E-02

VAROO025 -.112 .899 .290 -8.638E-02 6.625E-02 -3.012E-02

VAROOO09 -4.40IE-02 .855 -4.845E-02 -8.823E-02 .221 -6.622E-02

VAROOOIO .239 .797 -.186 .243 -.256 5.295E-02

VAROOO06 .313 .633 -.230 .191 -.314 8.263E-02

VAROOO04 -9.152E-02 -1.854E-02 .920 -3.930E-02 -8.483E-02 .263

VAROOO02 .186 .114 .808 -7.642E-02 2.208E-02 -.102

VAROOOl4 6.839E-02 -.126 .741 -2.587E-02 -2.317E-02 -1.338E-02

VAROO021 -9.247E-02 9.120E-02 -7.938E-02 .939 -4.557E-02 7.648E-03

VAROOOl9 -2.785E-02 -2.979E-02 -4.690E-02 .934 2.429E-02 -5.757E-03

VAROOOII .129 -1.443E-02 -8.833E-02 -9.984E-03 .882 3.523E-02

VAROOOl2 -.115 -.142 5.905E-03 -.167 -.158 .832

VAROO024 -.104 -.135 -7.930E-02 -.185 -.230 -.568

Extraction Method: Principal Component Analysis.


Rotation Method: Varimax with Kaiser Normalization.
a) Rotation converged in 7 iterations.

Comments
1) Syntax-All the journals listed in App-3 are not included in the analysis.
This is because all the rows and columns of excluded journals are empty.
2) Communalities -As discussed earlier, communality is the proportion of
a variable's variance explained by the factor structure. In complete principal
components analysis, with no factors dropped, cornmunality is equal to
1.0, or 100% of the variance of the given variable. For extracted factors,
communality ranges from 0.446 to 0.933.
3) Percentage of Variance Explained - The first factor explains more than
25% of the total variance. The first six factors together explain more than
79% of the total variance. Six- factor solution has been retained for
interpretation.
4) Rotated Component Matrix- This matrix shows rotated factor loadings,
sorted by their absolute values.
5) Interpretation of Flctors-Look at the rotated component matrix.
Factor 1 The following variables have high loadings on this factor.
VarOOO16 Lect Notes Cornp Sci
VarOOOO5 Commun ACM
VarOOOO8 Computer
VarOOOO7 Comput Surveys
VarOOO15 Journal ACM

108 Thus, the first factor can be labeled as Computer Science


Factor 2 The following variables have high loadings on this factor. Cluster Analysis and
Factor Analysis
Var00025 Tai Tech Inf Sci

Var00009 IEEE Tr Pattern Analysis

VarOOO10 IEEE Systems Man & Cybernetics

Var00006 Comp Graph Image

The second factor can be labeled as Pattern Analysis

Factor 3 The following variables have high loadings on this factor


VarOOOO4 Cognitive Science
VarOOOO2 Artificial Intelligence
VarOOO14 Journal Pragmatics
This factor can be labeled as Artificial Intelligence
Factor 4 The following variables have high loadings on this factor
Var00021 Proc Soc Photo - Opt Instruments
VarOOO19 Proc IEEE
This factor can be labeled as Optics
Factor 5 Only one variable has high loadings on this factor
I VarOOO11 Int J Man Machine Stud . I
This factor can be labeled as Man-Machine Studies
Factor 6 The following variables have high loadings on this factor.
Var00012 1. Exp Psychology
Var00024 Psychol Bull
Psychol Bull has a negative loading on this factor. This factor can be labeled
as Psychology
Self Check Exercises
3) What is the difference between Principal Components Analysis and Factor
Analysis?
4) Why do we rotate the principal axes in factor analysis?
5) What is a simple structure?
6) What will you do in cluster analysis if different variables are measured on
different units? Would you do the same thjng in factor analysis? If not,
why? '
Note: i) Write your answers in the space given below.
ii) Check your answers with the answers given at end of this Unit.

109
Techniques and Modeling
in Informetrics and 18.4 APPENDICES
Scientometrics
Appendix - 1
SPECIALIZATION PATTERNS IN CHEMISTRY

Code labels for subflelds of chemistry

V 1 Analytical Chemistry

V2 Applied Chemistry

V3 Biochemistry

V4 Electrochemistry

V 5 Inorganic Chemistry

V6 Organic Chemistry

V7 Physical Chemistry

V8 Polymers

V9 Chemical Engineering

VI V2 V3 V4 VS V6 V7 V8 V9

USA 96 71 129 85 68 67 85 73 137

JPN 113 25 101 143 41 115 72 161 101

SUN 64 259 42 176 146 145 177 159 4

UK 89 152 102 70 143 106 98 75 72

DEU 99 84 81 44 189 116 108 80 106

FRA 93 39 105 126 69 122 122 88 40

IND 140 128 47 65 174 164 67 162 98

CAN 107 56 115 82 93 69 96 81 143


••
ITA 97 233 84 71 175 130 85 75 62

NLD 156 56 112 74 73 82 104 56 91

POL 235 50 47 123 86 65 106 114 272

110
Appendix - 2 Cluster Analysis and
•• Factor Analysis
INDIA'S SCIENTIFIC COOPERATION LINKS

The following data shows India's cooperation links with 35 most significant
partner countries in eleven scientific fields during 1990-1994.:

Scientific fields
1) Mathematics
2) Physics
3) Chemistry
4) Biology
5) Earth & Space
6) Agriculture
7) Clinical Medicine
8) Biomedicine
9) Engineering & Technology
10) Material science
11) Computer science
Countries
1) United States of America (USA)
2) United Kingdom (UKD)
3) Germany (FRG)
4) Canada (CAN)
5) France (FRA)
6) Japan (JPN)
7) Italy (ITA)
8) Ex-USSR (SUN)
9) Australia (AUS)
10) Switzerland (CHE)
11) Netherlands (NLD)
12) Sweden (SWE)
13) Spain (ESP)
14) China (PRC) •
15) Belgium (BEL)
16) Hungary (HUN)
17) Brazil (BRA)
18) Bangladesh (BND)
19) Denmark (DNK)
20) Austria (AUT)
III
Techniques and Modeling 21) Bulgaria (BGR)
in Informetrics and
Scientometrics
-22) Poland (PaL)
23) Korea (KOR)
24) Czechoslovakia (CSK)
25) Mexico (MEX)
26) Philippines (PHL)
27) Finland (FIN)
28) Israel (ISR)
29) Egypt (EGY)
30) Greece (GRC)
31) Taiwan (TWN)
32) Norway (NOR)
33) Nigeria (NGR)
34) Chile (CHL)
35) Romania (ROM)
3.86 33.09 11.54 4.77 5.25 2.31 12.13 12.45 7.48 1.95 3.10
U4 35.37 8.38 6.41 5.38 3.52 19.13 9.62 5.07 .31 3.41
1.28 44.90 10.79 5.68 4.52 2.67 9.16 10.90 6.03 .46 1.86
8.91 37.03 7.13 6.34 5.74 2.57 7.92 7.72 10.30 2.38 2.57
.92 51.61 10.83 3.46 5.99 1.61 4.84 9.22 3.00 1.38 4.61
1.84 29.95 17.74 9.68 7.37 1.38 10.14 12.90 4.15 .69 2.30
2.92 67.11 9.55 1.06 1.59 .53 6.37 4.24 3.45 .27 2.39
2.65 27.51 17.99 8.99 7.94 10.05 11.11 4.76 5.82 .00 .53
2.76 59.67 6.63 .00 1.10 1.66 12.71 4.42 8.29 .00 .00
9.04 49.40 3.61 3.01 6.02 3.01 8.43 9.04 4.82 1.81 .60
.00 32.52 7.32 2.44 6.50 1.63 28.46 13.01 5.69 .81 .00
.85 66.95 11.02 .85 .85 .00 7.63 3.39 .85 .00 5.93
.93 70.37 1.85 3.70 2.78 .93 8.33 3.70 3.70 .00 1.85
3.13 40.63 9.38 10.42 9.38 .00 8.33 7.29 3.13 1.04 1.04
2.30 59.77 10.34 4.60 4.60 .00 4.60 4.60 5.75 1.15 1.15
2.67 61.33 1.33 2.67 9.33 .00 8.00 6.67 6.67 .00 .00
10.00 21.67 10.00 3.33 1.67 3.33 16.67 11.67 21.67 .00 .00
.00 21.15 26.92 13.46 1.92 .00 17.31 3.85 5.77 5.77 .00
2.13 23.40 29.79 4.26 2.13 .00 21.28 12.77 4.26 .00 .00
.00 85.11 4.26 2.13 .00 .00 2.13 4.26 2.13 .00 .00
4.35 36.96 15.22 10.87 .00 2.17 15.22 8.70 4.35 2.17 .00
.00 37.84 35.14 .00 5.41 2.70 5.41 5.41 .00 .00 8.11
2.78 36.11 2.78 5.56 .00 11.11 8.33 30.56 2.78 .00 .00
.00 .00 .00 22.22 5.56 44.44 8.33 19.44 .00 .00 .00
2.94 64.71 8.82 -.00 5.88 .00 5.88 8.82 .00 .00 .00
3.03 33.33 9.09 3.03 3.03 3.03 9.09 24.24 9.09 .00 3.03
.00 13.33 6.67 3.33 13.33 .00 20.00 6.67 20.00 .00 .00
3.57 35.71 7.14 3.57 .00 3.57 10.71 10.71 10.71 3.57 3.57
.00 46.43 3.57 3.57 3.57 .00 10.71 17.86 10.71 3.57 .00
.00 11.11 3.70 11.11 14.81 3.70 22.22 18.52 14.81 .00 .00
.00 14.81 11.11 .00 .00 11.11 33.33 11.11 3.70 3.70 .00
.00 73.08 .00 3.85 .00 .00 7.69 7.69 .00 3.85 .00
.00 72.00 .00 .00 .00 .00 .00 .00 28.00 .00 .00
112 .00 00.00 .00 .00 .00 .00 .00 .00 .00 .00 .00
Appendix - 3 Cluster Analysis and
Factor Analysis
ARTIFfCIAL INTELLIGENCE

The following table shows the citation relationships among a set of24 journals
in the area of Artificial Intelligence. This table is divided into three parts-
Table A contains columns 1 - 8, Table B columns 9-16 Table C·columns 17-
24. If these tables are placed side by side, we would get the journal to journal
citation matrix,

In this matrix, the cell (i,j) gives the number of times the articles in journal
j cite unique articles in journal i in a given year. Also, the cell (i,j) gives
the number of times articles in journal i are cited by articles in journal j. This
matrix represents the citation environment of Artificial Intelligence. The columns
of the matrix represent the citing dimension, whereas the rows represent the
cited dimension of Artificial Intelligence
Table A : Journal to Journal Citation Matrix

Journal Name I 2 3 4 5 6 7 8

I. Artif Intel 46.00 .00 7.00 7.00 5.00 22.00 12.00 7.00

2. Auto- Theo 4.00 .00 .00 .00 .00 .00 .00 .00

3. Cogmitive 8.00 .00 5.00 .00 .00 .00 .00 .00

4. Comm ACM 11.00 .00 .00 182.00 34.00 35.00 98.00 17.00

5. Comp Graph .00 .00 .00 8.00 101.00 5.00 19.00 15.00

6. Comp Survey .00 .00 .00 24.00 11.00 13.00 18.00 .00

7. Computer .00 .00 .00 .00 .00 .00 .00 .00

8. IEEE PA .00 .00 .00 6.00 47.00 4.00 23.00 18.00

9. IEEE SMC 6.00 .00 .00 .00 24.00 .00 33.00 53.00

10 Man Machine .00 .00 .00 12.00 .00 .00 .00 13.00

11. J.Exp Psy .00 .00 3.00 .00 .00 .00 .00 .00

12 J Philos 6.00 .00 .00 .00 .00 .00 .00 .00

13. J Pragmat .00 .00 .00 .00 .00 .00 .00 .00

14. J A<;:M 12.00 .00 .00 16.00 15.00 7.00 11.00 8.00

15. Le Comp Sci 5.00 .00 .00 7.00 .00 .00 .00

16. Mach Intel 12.00 .00 .00 .00 .00 .00 .00

17. Othello Q 10.00 .00 .00 .00 .00 .00 .00

18. P IEEE ..00 .00 .00 .00 • 18.00 .00 17.00 12.00

19. P IJCAI 22.00 .00 .00 .00 18.00 .00 .00 6.00

20. P Photo Opt .00 .00 .00 .00 8.00 .00 .00 .00

21. Patt Direct 9.00 .00 .. 00 .00 .00 .00 .00 .00

22. Problem Sol 5.00 .00 .00 .00 .00 .00 . .00 .00

23. Psychol B .00 .00 .00 .00 .00 .00 .00 8.00

24.Tai Tec Sci .00 .00 .00 .00 .00 .00 .00 .00

Note: Numbers in columns refer to journal names in rows

113
Techniques and Modeling Table B : Journal to Journal Citation Matrix
in Informetrics and
Journal Name 9 10 II 12 13 14 15 16
Scientometrics
1. Artif Intel \3.00 7.00 4.00 .00 10.00 10.00 .00 .00
2. Auto- Theo .00 .00 .00 .00 .00 .00 .00 .00
3. Cogmitive .00 .00 4.00 .00 24.00 .00 .00 .00
4. Comm ACM 25.00 28.00 .00 .00 6.00 58.00 83.00 .00
5. Comp Graph 71.00 .00 .00 .00 .00 .00 .00 .00
6. Comp Survey .00 .00 .00 .00 .00 .00 .00 .00
7. Computer .00 .00 .00 .00 .00 .00 .00 .00
8. IEEE PA 80.00 .00 .00 .00 .00 .00 .00 .00
9. IEEE SMC 60.00 7.00 .00 .00 .00 .00 .00 .00
10. Man Machine .00 103.00 .00 .00 .00 .00 .00 .00
11. J.Exp Psy .00 .00 154.00 .00 .00 .00 .00 .00
12 J Philos .00 .00 .00 .00 .00 .00 .00 .00
13 J Pragmat .00 .00 .00 .00 11.00 .00 .00 .00
14 J ACM 32.00 .00 .00 .00 .00 102.00 60.00 .00
15. Le Comp Sci .00 .00 .00 .00 .00 30.00 92.00 .00
16. Mach Intel 7.00 .00 .00 .00 .00 .00 .00 .00
17. OtheIIo Q .00 .00 .00 .00 .00 .00 .00 .00
18. P IEEE 34.00 8.00 .00 .00 .00 .00 .00 .00
19. P IJCAI .00 10.00 .00 .00 .00 .00 .00 .00
2q P Photo Opt .00 .00 .00 .. 00 .00 .00 .00 .00
21 Patt Direct .00 .00 .00 .00 .00 .00 .00 .00
22 Problem Sol .00 .00 .00 .00 .00 .00 .00 .00
23 Psychol B .00 .00 .00 .00 .00 .00 .00 .00
24 Tai Tec Sci .00 .00 .00 .00 .00 .00 .00 .00
Note: Numbers in columns refer to journal names in rows

Table C : Journal to Journal Citation Matrix


Journal Name 17 18 19 20 21 22 23 24
1. Artif Intel .00 6.00 .00 11.00 .00 .00 7.00 9.00
2. Auto- Theo .00 .00 .00 .00 .00 .00 .00 .00
3. Cogmitive .00 .00 .00 .00 .00 .00 3.00 .00
4. Comm ACM .00 .00 .00 .00 .00 .00 .00 .00
5. Comp Graph .00 .00 .00 28.00 .00 .00 .00 7.00
6. Comp Survey .00 .&0 .00 .00 .00 .00 .00 .00
7. Computer .00 .00 .00 .00 .00 .00 .00 .00
8. IEEE PA .00 4.00 .00 9.00 .00 .00 .00 .00
9. IEEE SMC .00 .00 .00 8.00 .00 .00 .00 17.00
10. Man Machine .00 .00 .00 .00 .00 .00 .00 .00
It. J.Exp Psy .00 .00 .00 .00 .00 .00 .00 .00
12. J Philos .00 .00 .00 .00 .00 .00 .00 .00
13. J Pragmat .00 .00 .00 .00 ;00 .00 .00 .00

114 14. J ACM .00 15.00 .00 .00 .00 .00 .00 .00
Cluster Analysis and
15. Le Comp Sci .00 .00 .00 .00 .00 .00 .00 .00
Factor Analysis
16. Mach Intel .00 .00 .00 .00 .00 .00 .00 .00

17. Othello Q .00 .00 .00 .00 .00 .00 .00 .00

18. P IEEE .00 222.00 .00 112.00 .00 .00 .00 .00

19. P IJCAI .00 .00 .00 .00 .00 .00 .00 .00

20 P Photo Opt .00 6.00 .00 68.00 .00 .00 .00 .00

21. Patt Direct .00 .00 .00 .00 .00 .00 .00 .00

22. Problem Sol .00 .00 .00 .00 .00 .00 .00 .00

23. Psycho I B .00 .00 .00 .00 .00 .00 211.00 .00

24. Tai Tec Sci .00 .00 .00 .00 .00 .00 .00 .00

Note : Numbers in columns refer to journal names in rows

18.5 SUMMARY
In this module we have discussed two major techniques of data reduction, viz.
Cluster analysis and Factor analysis.

Cluster Analysis

The goal of cluster analysis is to reduce the amount of data by categorizing


or grouping similar data items together. Clustering can be divided into two
basic types: hierarchical and partitional clustering.

Hierarchical clustering proceeds successively by either merging smaller clusters


into larger ones, or by splitting larger clusters. Clustering methods differ in the
rule by which two small clusters are merged or which large cluster is split.
The end result of hierarchical clustering is a tree of clusters called a dendrogram,
which shows how the clusters are related. By cutting the dendrogram at a
desired level a clustering of the data items into disjoint groups is obtained.

Factor Analysis

Partitional clustering attempts to directly decompose the data set into a set of
disjoint clusters according to an optimization criterion, which involves
minimizing within-cluster dissimilarities and maximizing inter-cluster
dissimilarities.

Analysis includes both component analysis and common factor analysis.

Principal Components Analysis (PCA): By far the most common form of


factor analysis, PCA seeks a linear combination of variables such that the
maximum variance is extracted from the variables. It then removes this variance
and seeks a second linear combination which explains the maximum proportion
of the remaining variance, and so on. This is called the principal axis method
and results in orthogonal (uncorrelated) factors.

. Principal Factor Analysis (PF A): Also called principal axis factoring, PFA,
and common factor analysis, PFA is a form of factor analysis which seeks the
least number of factors which can account for the common variance of a set
of variables, whereas the more common principal components analysis (PCA)
in its full form seeks the set of factors which can account for all the common
115
Techniques and Modeling and unique variance in a set of variables. PFA uses a PCA strategy but applies
in Informetrics and it to a correlation matrix in which the diagonal elements are not l' s, as in
Scientometrics
PCA, but estimates of the communalities (see below). These estimates are the
squared multiple correlations of each variable with the remainder of variables
in the matrix (or, less commonly, the highest absolute correlation in the matrix
row.

• Principal Component Analysis - This method provides a unique solution,


so that the original data can be reconstructed from the results. It looks at
the total variance among the variables, so the solution generated will include
as many factors as there are variables, although it is unlikely that they will
all meet the criteria for retention. There is only one method for completing
a principal components analysis; this is not true of any of the other
multidimensional methods described here.
• Common Factor Analysis - This is what people generally mean when
they say "factor analysis." The term "common" in common factor analysis
describes the variance that is analyzed. It is assumed that the variance of
a single variable variance is composed of three components: common,
specific and error.

18.6 KEY WORDS


City-block Distance A distance measure computed as the average
difference across dimensions.

Cluster Analysis It is a collection of statistical techniques for


creating homogeneous groups of cases or variables.

Common Factor A statistical technique which uses the correlations


Analysis between observed variables to estimate common
factors and the structural relationships linking
factors to observed variables.

Common Factor A factor on which two or rriore variables load.


Common Variance Variance in a variable shared with common factors.
Factor analysis assumes that a variable's variance
is composed of three components: common,
specific and error.
Communality It is the proportion of variance that each item has
in common with other items. The proportion of
variance that is unique to each item is then the
respective item's total variance minus the
communality.
Dendrogram It is "tree-like diagram" that summarizes the
process of clustering. Similar cases are joined by
links whose position in the diagram is determined
by the level of similarity between the cases.
Distance The distance between two objects is a measure of
the interval between them. Remember that
distances are not always measured by rulers or
their equivalents. They could be similarities or
116 dissimilarities.
Eigen Value Eigen values can be found for square symmetric Cluster Analysis and
matrices. There are as many eigen values as there Factor Analysis
rows (or columns) in the matrix. A realistic
description of an eigen value demands a sound
knowledge of linear algebra. However,
conceptually they can be considered to measure
the strength (relative length) of an axis (dervied
from the square symmetric matrix).

Eigen Vector Each eigen value has an associated eigen vector.


An eigen value indicates the length of an axis, the
eigen vector determines its orientation in space.

Error Variance Unreliable and inexplicable variation in a variable.


Error variance is assumed to be independent of
common variance, and a component of the unique
variance of a variable.

Euclidean Distance This is probably the most commonly used type of


distance. It simply is the geometric distance in the
multidimensional space.

Factor Analysis It is a generic term for a family of statistical


techniques concerned with the reduction of a set
of observable variables in terms of a small number
of latent factors. It has been developed primarily
for analyzing relationships among a number of
measurable entities (such as survey items or test
scores). The main applications of factor analytic
techniques are: (1) to reduce the number of
variables and (2) to detect structure in the
relationships between variables, that is to classify
variables.

Factor Loading A Pearson correlation between a variable and a


factor. It is also called factor pattern coefficient or
factor structure coefficient.

Principal Components: A linear dimensionality reduction technique, which


Analysis identifies orthogonal directions of maximum
variance in the original data, and projects the data
into a lower-dimensionality space spanned by a
sub-set of the highest-variance components.

Principal Factor It is a method of factor analysis using a priori


Analysis communality .estimates. Successive factors are
extracted which explain the most variation in a set
of variables. The first factor accounts for the most
variance. Then the second factor accounts for the
most variance in the variables residualized for the
first factor, and so on. The factors are uncorrelated,

Specific Variance The component of unique variance which is reliable


but not explained by common factors.

117
Techniques and Modeling Unique variance The variance of a variable which is not explained
in Informetrics and by common factors. Unique variance is composed
Scientometrics
of specific and error variances.

Varimax Rotation An orthogonal rotation criterion which maximizes


the variance of the squared elements in the columns
of a factor matrix. Varimax is the most common
rotational criterion.

18.7 ANSWERS TO SELF CHECK EXERCISES


1) There are three general classes for the measurement of distances. They are:

a) Euclidean Metrics
b) Non - Euclidean Metrics
c) Semi - Metrics
2) There are five types of agglomerative clustering methods commonly in
use. They are:
a) Single linkage clustering
b) Complete linkage clustering
c) Average linkage clustering
d) Within group clustering
e) Minimum variance clustering
3) In both principal components analysis (PCA) and factor analysis, the original
explanatory variables are replaced by components or factors. The objective
is to reduce the number of variables and classify the variables. The PCA
provides a unique solution so that the original data can be reconstructed
from the results. There are as many components as variables. The sum of
the variances of the principal components is equal to the variance of the
original variables. However, we retain only those components which explain
the variation in the dependent variable substantially. In factor analysis we
identify the underlying latent factors. Here again we take only those factors
which explain the variance in dependent variable substantially.
4) The objective of rotation of principal axes in factor analysis is to obtain a
clear pattern of loadings. The rotation is done in a manner so that the
points are highly correlated with the axes and provide a more meaningful
interpretation of factors. Rotation can be orthogonal or oblique.
5) The correlations of variables with principal components are called loadings,
which provide the jnfluence of the original variables on the principal
components. Thus a high coefficient corresponds to a high loading. The
general pattern of loadings is called 'simple structure'. It means that each
factor should have several variables with strong loadings. Each variable
should have a strong loading for only one factor. Each factor should account
for the variance in dependent variable.
6) The objective of cluster analysis is to classify (not reduce) the explanatory
variables into homogeneous groups. The unit of measurement strongly
affects the resulting clustering. Thus if the variables are in different units,

118
Cluster Analysis and
th en we stan d ar d·ise. th e vana. bles so t 1rat Xi = Xi - X P'nncipa
. I components Factor Analysis
O'x

decomposition is also dependent upon the scale of measurement. Hence,


here also standardised variables are commonly used.

18.8 REFERENCES AND FURTHER READING


Blashfield, R. and Aldenderfer M.S. (1984). Cluster Analysis. Sage Publications,
Inc.

Everitt, B.S. (1993). Cluster Analysis, 3rd edition. Edward Arnold.

Kim, Jae-On and Mueller, C.W. (1978). Introduction 10 Factor Analysis. Sage
Publications, Inc

Nagpaul, P.S. (2001). Guide to Advanced Data Analysis Using IDAMS Software
(Chapters 6 and 7), httpv/www.Unesco.org/ldams (Only the electronic version
is available).

119

You might also like