Cluster Analysis
Cluster Analysis
)
Cluster analysis is a group of multivariate techniques whose primary
purpose is to group objects (e.g., respondents, products, or other
entities) based on the characteristics they possess.
It is a means of grouping records based upon attributes that make
them similar. If plotted geometrically, the objects within the clusters
will be close together, while the distance between clusters will be
farther apart.
* Cluster Variate
- represents a mathematical representation of the
selected set of variables which compares the object‟s similarities.
Cluster Analysis Factor Analysis
- grouping is -grouping is based
based on the on patterns of
variation
distance
(correlation)
(proximity)
Information Retrieval - The world Wide Web consists of billions of Web pages,
and the results of a query to a search engine can return thousands of pages.
Clustering can be used to group these search results into small number of
clusters, each of which captures a particular aspect of the query. For
instance, a query of “movie” might return Web pages grouped into
categories such as reviews, trailers, stars and theaters. Each category
(cluster) can be broken into subcategories (sub-clusters_, producing a
hierarchical structure that further assists a user‟s exploration of the query
results.
Hypothesis Generation
- Cluster analysis is also useful when a researcher
wishes to develop hypotheses concerning the nature of
the data or to examine previously stated hypotheses.
Cluster analysis is descriptive, atheoretical, and noninferential. Cluster analysis has no
statistical basis upon which to draw inferences from a sample to a population, and many
contend that it is only an exploratory technique. Nothing guarantees unique solutions,
because the cluster membership for any number of solutions is dependent upon many
elements of the procedure, and many different solutions can be obtained by varying one or
more elements.
Cluster analysis will always create clusters, regardless of the actual existence of any structure
in the data. When using cluster analysis, the researcher is making an assumption of some
structure among the objects. The researcher should always remember that just because
clusters can be found does not validate their existence. Only with strong conceptual support
and then validation are the clusters potentially meaningful and relevant.
The cluster solution is not generalizable because it is totally dependent upon the variables
used as the basis for the similarity measure. This criticism can be made against any
statistical technique, but cluster analysis is generally considered more dependent on the
measures used to characterize the objects than other multivariate techniques. With the
cluster variate completely specified by the researcher. As a result, the researcher must be
especially cognizant of the variables used in the analysis, ensuring that they have strong
conceptual support.
Cluster analysis used for:
◦ Taxonomy Description. Identifying groups within the data
◦ Data Simplication. The ability to analyze groups of similar
observations instead all individual observation.
◦ Relationship Identification. The simplified structure from CA
portrays relationships not revealed otherwise.
Theoretical, conceptual and practical considerations must be
observed when selecting clustering variables for CA:
◦ Only variables that relate specifically to objectives of the CA are
included.
◦ Variables selected characterize the individuals (objects) being
clustered.
The primary objective of cluster analysis is to define the
structure of the data by placing the most similar
observations into groups. To accomplish this task, we must
address three basic questions:
◦ Correlational measures.
- Less frequently used, where large values of r‟s do indicate
similarity
◦ Distance Measures.
Most often used as a measure of similarity, with higher values
representing greater dissimilarity (distance between cases), not
similarity.
Graph 2
Graph 1
Chart Title 7
Graph 1 represents
7 6
higher level of
6 similar5ity
5
4
4
3
3
2
2
1 1
0 0
Category Category Category Category Category Category Category Category
1 2 3 4 1 2 3 4
Observations
Observation A B C D E F G
A ---
B 3.162 ---
C 5.099 2.000 ---
D 5.099 2.828 2.000 ---
E 5.000 2.236 2.236 4.123 ---
F 6.403 3.606 3.000 5.000 1.414 ---
G 3.606 2.236 3.606 5.000 2.000 3.16 ---
2
SIMPLE RULE:
Step Unclustered
Distance Observation Cluster Membership Number of Within-Cluster
Measure (Average
Observationsa Pair Clusters
Initial Solution (A)(B)(C)(D)(E)(F)(G) 7 Distance)
0
1 1.414 E-F (A)(B)(C)(D)(E-F)(G) 6 1.414
2 2.000 E-G (A)(B)(C)(D)(E-F-G) 5 2.192
3 2.000 C-D (A)(B)(C-D)(E-F-G) 4 2.144
4 2.000 B-C (A)(B-C-D)(E-F-G) 3 2.234
5 2.236 B-E (A)(B-C-D-E-F-G) 2 2.896
6 3.162 A-B (A-B-C-D-E-F-G) 1 3.420
In steps 1,2,3 and 4, the OSM does not change substantially, which
indicates that we are forming other clusters with essentially the same
heterogeneity of the existing clusters.
When we get to step 5, we see a large increase. This indicates that joining
clusters (B-C-D) and (E-F-G) resulted a single cluster that was markedly
less homogenous.
Therefore, the three – cluster solution of Step 4 seems the
most appropriate for a final cluster solution, with two equally
sized clusters, (B-C-D) and (E-F-G), and a single outlying
observation (A).
◦ Single – Linkage
◦ Complete – Linkage
◦ Average – Linkage
◦ Centroid Method
◦ Ward‟s Method
◦ Mahalanobis Distance
Single – Linkage
◦ Also called the nearest – neighbor method, defines
similarity between clusters as the shortest distance from
any object in one cluster to any object in the other.
Complete Linkage
◦ Also known as the farthest – neighbor method.
◦ The oppositional approach to single linkage assumes
that the distance between two clusters is based on
the maximum distance between any two members in
the two clusters.
Average Linkage
The distance between two clusters is defined as the
average distance between all pairs of the two clusters‟
members
Centroid Method
◦ Cluster Centroids
- are the mean values of the observation on the
variables of the cluster.
◦ K – means Method
In stage 8, the
observations 5 and 7 were
joined. The resulting
cluster next appears in
stage 13.
Table 23.2 is a reformed table to see the changes in the coefficients as the
number of clusters increase. The final column, headed 'Change‟, enables us to
determine the optimum number of clusters. In this case it is 3 clusters as
succeeding clustering adds very much less to distinguishing between cases.
Repeat step 1 to 3 to place cases into one of three clusters.
The number you place in the box is the number of clusters that seem best to
represent the clustering solution in a parsimonious way.
Finally click OK.
A new variable has
been generated at
the end of your
SPSS data file
called clu3_1
(labelled Ward
method in variable
view). This
provides the
cluster
membership for
each case in your
sample
Multiple lines
Voice mail
Paging service
Internet
Caller ID
Call waiting
Call forwarding
3-way calling
Electronic billing
Number iteration or repetition of
combining different clusters.
Specify number of clusters
Determines when iteration cease and
represent a proportion of the min.
distance bet. Initial cluster center.
STATISTICS
it will show the
information for each
group.
The initial cluster centers are the variable values of the k well-
spaced observations.
Iteration History
The iteration history shows the progress of the clustering process at
each step.
The ANOVA table indicates which variables contribute the most to your
cluster solution.
Cluster 2 is approximately
equally similar to clusters
1 and 3.
A large number of cases were assigned to the third
cluster, which unfortunately is the least profitable group.
The two steps of the TwoStep Cluster Analysis procedure's algorithm can be summarized as follows:
Step 1. The procedure begins with the construction of a Cluster Features (CF) Tree. The tree begins by
placing the first case at the root of the tree in a leaf node that contains variable information about that
case. Each successive case is then added to an existing node or forms a new node, based upon its
similarity to existing nodes and using the distance measure as the similarity criterion. A node that
contains multiple cases contains a summary of variable information about those cases. Thus, the CF tree
provides a capsule summary of the data file.
Step 2. The leaf nodes of the CF tree are then grouped using an agglomerative clustering algorithm. The
agglomerative clustering can be used to produce a range of solutions. To determine which number of
clusters is "best", each of these cluster solutions is compared using Schwarz's Bayesian Criterion (BIC) or
the Akaike Information Criterion (AIC) as the clustering criterion.
Car manufacturers need to be able to appraise the current market to
determine the likely competition for their vehicles. If cars can be
grouped according to available data, this task can be largely
automatic using cluster analysis.