Cluster Analysis
Cluster Analysis
1
Collections….
2
Cluster analysis
Suppose one wants to understand the consumers in the
finance market. With regard to the fiannce market, there
are few outdtanding aspects like Risk, Returns, Liquidity.
One could use any two aspects at one time to form clusters
of consumers, or could use more than two attributes also.
Cluster - I : Consumers RETRUNS (X2) I
A
believe in high risk and high II
G D
returns instruments (Equity) K C
III
Cluster - II : Consumers believe INOE
L H J
in average risk and average
B F
returns (Mutual funds, Company M
fixed deposits)
( -3 , -3) RISK (X1)
Cluster - III : Consumers believe
in low risk and low returns
(Savings bank, Fixed deposits) 3
3
Cluster analysis
4
4
Market segmentation
5
5
Other uses of cluster analysis
Product characteristics and the identification of new
product opportunities.
Clustering of similar brands or products according to
their characteristics allow one to identify
competitors, potential market opportunities and
available niches
Data reduction
• Factor analysis and principal component analysis
allow to reduce the number of variables.
• Cluster analysis allows to reduce the number of
observations, by grouping them into
homogeneous clusters.
Maps profiling simultaneously consumers and
products, market opportunities and preferences as in
preference or perceptual mappings
6
6
Quality: What Is Good Clustering?
7
Steps in Cluster Analysis
9
Similarity and Dissimilarity
Between Objects
Distances are normally used to measure the
similarity or dissimilarity between two data objects
Some popular ones include: Minkowski distance:
d (i, j) q (| x x |q | x x | q ... | x x |q )
i1 j1 i2 j2 ip jp
11
Similarity and Dissimilarity Between
Objects : Interval Measure
Euclidean distance : Square root of the sum of the square of the x
difference plus the square of the y distance.
Squared Euclidean distance removes the sign and also places
greater emphasis on objects further apart, thus increasing the
effect of outliers. This is the default for interval data.
Cosine.. Interval-level similarity based on the cosine of the angle
between two vectors of values.
Pearson correlation. Interval-level similarity based on the product-
moment correlation. For case clustering as opposed the
variable clustering, the reseacher transposes the normal data
table in which columns are variables and rows are cases. By
using columns as cases and rows as variables instead, the
correlation is between cases and these correlations may
constitute the cells of the similarity matrix.
12
Similarity and Dissimilarity Between
Objects : Interval Measure
Chebychev distance is the maximum absolute difference between a
pair of cases on any one of the two or more dimensions
(variables) which are being used to define distance. Pairs will be
defined as different according to their difference on a single
dimension, ignoring their similarity on the remaining
dimensions.
Block distance, Manhattan distance, is the average absolute
difference across the two or more dimensions which are used to
define distance.
Minkowski distance is the generalized distance function. The pth
root of the sum of the absolute differences to the pth power
between the values for the items.
dij = [sum(xik - xjk )p](1/p)
When p = 1, Minkowski distance is city block distance.
In the case of binary data, when p = 1 Minkowski distance is Hamming
distance, When p = 2, Minkowski distance is Euclidean distance.
13
Similarity and Dissimilarity Between
Objects : Count Measure
Chi-square measure. Based on the chi-square test of
equality for two sets of frequencies, this measure is
the default for count data.
14
Similarity and Dissimilarity Between
Objects : Count Measure
Squared Euclidean distance is also an option for binary data as it
is for interval.
Size difference. This asymmetry index ranges from 0 to 1.
Pattern difference. This dissimilarity measure also ranges from 0
to 1. Computed from a 2x2 table as bc/(n**2), where b and c are
the diagonal cells and n is the number of observations.
Variance. Computed from a2x2 table as (b+c)/4n. It also ranges
from 0 to 1.
Dispersion. is a similarity measure with a range of -1 to 1.
Shape. is a distance measure with a range of 0 to 1. Shape
penalizes asymmetry of mismatches.
Simple matching. is the ratio of matches to the total number of
values.
Phi 4-point correlation. is a binary analog of the Pearson
15
correlation coefficient and has a range of -1 to 1.
Similarity and Dissimilarity Between
Objects : Count Measure
Lambda. Goodman and Kruskal's lambda, which is interpreted as the
proportional reduction of error (PRE) when using one item to predict
the other (predicting in both directions). It ranges from 0 to 1, with 1
being perfect predictive monotonicity..
Anderberg's D. is a variant on lambda. D is the actual reduction of error
using one item to predict the other (predicting in both directions). I
ranges from 0 to 1.
Czekanowski or Sorensen measure, is an index in which joint absences
are excluded from computation and matches are weighted double.
Hamann is the number of matches minus the number of nonmatches,
divided by the total number of items which ranges from -1 to 1.
Jaccard. This is an index in which joint absences are excluded from
consideration. Equal weight is given to matches and nonmatches. Also
known as the similarity ratio. Jaccard is commonly recommended for
binary data.
16
Similarity and Dissimilarity Between
Objects : Count Measure
Kulczynski 1 is the ratio of joint presences to all nonmatches. Its lower
bound is 0 and it is unbounded above. It is theoretically undefined
when there are no nonmatches.
Kulczynski 2. is the conditional probability that the characteristic is
present in one item, given that it is present in the other.
Lance and Williams index, also called the Bray-Curtis nonmetric
coefficient, is based on a 2x2 table, using the formula (b+c)/(2a+b+c),
where a represents the cell corresponding to cases present on both
items, and b and c represent the diagonal cells corresponding to cases
present on one item but absent on the other. It ranges from 0 to 1.
Ochiai index is the binary equivalent of the cosine similarity measure and
ranges from 0 to 1.
Rogers and Tanimoto index gives double weight to nonmatches.
Russel and Rao index is equivalent to the inner (dot) product, giving
equal weight to matches and nonmatches. This is a common choice for
binary similarity data.
17
Similarity and Dissimilarity Between
Objects : Count Measure
Sokal and Sneath 1 index also gives double weight to matches.
Sokal and Sneath 2. index gives double weight to nonmatches, but joint
absences are excluded from consideration.
Sokal and Sneath 3. index is the ratio of matches to nonmatches. Its
minimum is 0 but it is unbounded above. It is theoretically undefined
when there are no nonmatches; however, the software assigns an
arbitrary value of 9999.999 when the value is undefined or is greater
than this value.
Sokal and Sneath 4 index is based on the conditional probability that the
characteristic in one item matches the value in the other, taking the
mean of predictions either direction.
Sokal and Sneath 5 index is the squared geometric mean of conditional
probabilities of positive and negative matches. It ranges from f 0 to 1.
Yule's Y. , also called the coefficient of colligation, is a similarity measure
based on the cross-ratio for a 2 x 2 table and is independent of the
marginal totals. It ranges from -1 to +1.
Yule's Q. is a similarity measure which is a special 2x2 case of Goodman
and Kruskal's gamma. It is based on the cross-ratio
18
Clustering procedures
Hierarchical procedures
• Agglomerative (start from n clusters to get to 1 cluster)
• Linkage Methods (Single, Complete, Average)
• Variance Methods (Ward’s method)
• Centroid Methods
• Divisive (start from 1 cluster to get to n clusters)
Non hierarchical procedures
• Sequential Threshold
• Parallel Threshold
• Optimising Threshold
19
19
Hierarchical clustering
Agglomerative:
• Each of the n observations constitutes a separate cluster
• The two clusters that are more similar according to same distance rule
are aggregated, so that in step 1 there are n-1 clusters
• In the second step another cluster is formed (n-2 clusters), by nesting
the two clusters that are more similar, and so on
• There is a merging in each step until all observations end up in a single
cluster in the final step.
Divisive
• All observations are initially assumed to belong to a single cluster
• The most dissimilar observation is extracted to form a separate cluster
• In step 1 there will be 2 clusters, in the second step three clusters and
so on, until the final step will produce as many clusters as the number
of observations.
The number of clusters determines the stopping rule for
the algorithms
20
20
Non-hierarchical clustering
21
21
Distance between clusters
22
22
Linkage methods
23
23
Hierarchical Clustering
Single Linkage
Clustering criterion based on the shortest distance
Complete Linkage
Clustering criterion based on the longest distance
24
Hierarchical Clustering (Contd.)
Average Linkage
Clustering criterion based on the average distance
25
Ward algorithm
26
26
Hierarchical Clustering (Contd.)
Ward's Method
Based on the loss of information resulting from grouping of
the objects into clusters (minimize within cluster variation)
27
Centroid method
Centroid Method
29
Non-hierarchical Clustering
Sequential Threshold
Parallel Threshold
30
Non-hierarchical clustering:
K-means method
1. The number k of clusters is fixed
2. An initial set of k “seeds” (aggregation centres) is
provided
• First k elements
• Other seeds (randomly selected or explicitly defined)
3. Given a certain fixed threshold, all units are assigned to
the nearest cluster seed
4. New seeds are computed
5. Go back to step 3 until no reclassification is necessary
Units can be reassigned in successive steps (optimising
partioning)
31
31
Non-hierarchical threshold methods
32
32
Hierarchical vs. non-hierarchical
methods
Hierarchical Methods Non-hierarchical methods
33
33
Description of Variables
34
Export Data – K-means Clustering Results
Cluster Cluster
1 2 3 1 2 3
Will 4 1 5 Will 4 2 5
Govt 5 1 4 Govt 4 2 4
Train 1 0 0 Train 1 0 1
Experience 1 0 1 Experience 1 0 1
Years 6.0 7.0 4.5 Years 6.3 6.7 6.0
Prod 5 2 11 Prod 6 3 10
Mod_size 5.80 2.80 2.90 Mod_size 4.97 3.60 5.42
Mod_Rev 1.00 .90 .90 Mod_Rev 1.76 1.74 1.21
Cluster 1 2 3
1 3.899 4.193
Revenue/1000
2 3.899 7.602
3 4.193 7.602
35
Export Data – K-means Clustering
Results (contd.)
ANOVA
Cluster Error
Mean Square df Mean Square df F Sig.
Will 58.540 2 .683 117 85.710 .000
Govt 34.297 2 .750 117 45.717 .000
Train 2.228 2 .177 117 12.565 .000
Experience 3.640 2 .142 117 25.590 .000
Years 4.091 2 .690 117 5.932 .004
Prod 298.924 2 1.377 117 217.038 .000
Mod_size 32.451 2 .537 117 60.391 .000
Mod_Rev 2.252 2 .873 117 2.580 .080
Cluster 1 56.000
2 46.000
3 18.000
Valid 120.000
Missing .000
36
Determining the optimal number of
cluster from hierarchical methods
Graphical
• dendrogram
• scree diagram
Statistical
• Arnold’s criterion
• pseudo F statistic
• pseudo t2 statistic
• cubic clustering criterion (CCC)
37
37
And the merging Dendrogram
distance is
relatively small This dotted line represents the
Rescaled Distance distance
Cluster between clusters
Combine
C A S E 0 5 10 15 20 25
Label Num +---------+---------+---------+---------+---------+
38
38
Scree diagram
12
Merging
distance on When one moves from
10
the y-axis 7 to 6 clusters, the
8 merging distance
Distance
increases noticeably
6
0
11 10 9 8 7 6 5 4 3 2 1
Number of clusters
39
39
Statistical tests
40
40
Statistical criteria to detect the optimal partition
41
41
Suggested approach:
2-steps procedures
42
42
The SPSS two-step procedure
43
43
Evaluation and validation
44
44
Cluster analysis in SPSS
45
45
Hierarchical cluster analysis
Variables selected
for the analysis
Clustering method
Statistics required Graphs (dendrogram) and options
in the analysis Advice: no plots
46
46
Statistics
The agglomeration
schedule is a table
which shows the
steps of the clustering
procedure, indicating
which cases (clusters)
are merged and the
merging distance
Shows the cluster
membership of
individual cases only
The proximity matrix for a sub-set of
contains all distances solutions
between cases (it may
be huge)
47
47
Plots
Shows the
clustering process,
indicating which
cases are
aggregated and the
merging distance The icicle plot (which can
With many cases, be restricted to cover a
the dendrogram is small range of clusters),
hardly readable shows at what stage
cases are clustered. The
plot is cumbersome and
slows down the analysis
(advice: no icicle)
48
48
Method
Choose a
hierarchical
algorithm
50
50
The example:
agglomeration schedule
Last 10 stages
of the process
(10
to 1 clusters)
Cluster Combined
490 10 8 12 544.4
51
51
Scree diagram
Scree diagram
840
The scree diagram (not provided by
SPSS but created from the
790
agglomeration schedule) shows a
larger distance increase when the
Distance
740
cluster number goes below 4
690
640
Elbow?
590
7 6 5 4 3 2 1
Number of clusters
52
52
Non-hierarchical solution
with 4 clusters
Ward Method
1 2 3 4 Total
Case Number N% 26.6% 20.2% 23.8% 29.4% 100.0%
Household size Mean 1.4 3.2 1.9 3.1 2.4
Gross current income of Mean 238.0 1158.9 333.8 680.3 576.9
household
Age of Household Mean 72 44 40 48 52
Reference
EFS: Total Person
Food & Mean 28.8 64.4 29.2 60.6 45.4
non-alcoholic beverage
EFS: Total Clothing and Mean 8.8 64.3 9.2 19.0 23.1
Footwear
EFS: Total Housing, Mean 25.1 77.7 33.5 39.1 41.8
Water, Electricity
EFS: Total Transport Mean 17.7 147.8 24.6 57.1 57.2
costs Total Recreation
EFS: Mean 29.6 146.2 39.4 63.0 65.3
53
53
K-means solution (4 clusters)
Variables
54
54
K-means options
Creates a new
variable with
cluster
membership
for each case
55
55
Results from k-means
(initial seeds chosen by SPSS)
56
56
Results from k-means:initial seeds from hierarchical
clustering
Cluster Number of Case
1 2 3 4 Total
Case Number N% 32.6% 10.2% 33.6% 23.6% 100.0%
Household size Mean 1.7 3.1 2.5 2.9 2.4
Gross current income of Mean 163.5 1707.3 431.8 865.9 576.9
household
Age of Household Mean 60 45 50 46 52
Reference
EFS: Total Person
Food & Mean 31.3 65.5 45.1 56.8 45.4
non-alcoholic beverage
EFS: Total Clothing and Mean 12.3 48.4 19.1 32.7 23.1
Footwear
EFS: Total Housing, Mean 29.8 65.3 41.9 48.1 41.8
Water, Electricity
EFS: Total Transport Mean 24.6 156.8 37.4 87.5 57.2
costs
EFS: Total Recreation Mean 30.3 126.8 67.9 83.4 65.3
The first cluster is now larger, but it still represents older and poorer households. The
other clusters are not very different from the ones obtained with the Ward algorithm,
indicating a certain robustness of the results.
57
57
2-step clustering
it is possible to make
a distinction
between categorical
and continuous
variables
This is the
information
criterion to
The search for choose the
the optimal optimal partition
number of
clusters may be One may also
constrained asks for plots and
descriptive stats
58
58
Options
It is advisable It is possible to
to control for choose which
outliers (OLs) variable should
because the be standardized
analysis is prior to run the
usually analysis
sensitive to
OLs
More advanced
options are
available for a
better control on
the procedure
59
59
Output
60
60
Discussion
Limitations
It is difficult to evaluate the quality of the clustering
It is difficult to know exactly which clusters are very similar
and which objects are difficult to assign.
It is difficult to select a clustering criterion and program on
any basis other than availability.
62