0% found this document useful (0 votes)
831 views62 pages

Cluster Analysis

Cluster analysis is an unsupervised machine learning technique used to group similar objects together. It classifies data points into clusters so that objects within a cluster are more closely related to each other than objects assigned to different clusters. The document discusses cluster analysis and how it can be used for market segmentation by grouping consumers with similar characteristics together. It also outlines the steps involved in performing cluster analysis, including selecting variables, measuring similarity, choosing a clustering method, determining the optimal number of clusters, interpreting the results, and validating the clusters.

Uploaded by

SUBHANKARSUBHRA
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
831 views62 pages

Cluster Analysis

Cluster analysis is an unsupervised machine learning technique used to group similar objects together. It classifies data points into clusters so that objects within a cluster are more closely related to each other than objects assigned to different clusters. The document discusses cluster analysis and how it can be used for market segmentation by grouping consumers with similar characteristics together. It also outlines the steps involved in performing cluster analysis, including selecting variables, measuring similarity, choosing a clustering method, determining the optimal number of clusters, interpreting the results, and validating the clusters.

Uploaded by

SUBHANKARSUBHRA
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 62

Cluster Analysis

1
Collections….

2
Cluster analysis
 Suppose one wants to understand the consumers in the
finance market. With regard to the fiannce market, there
are few outdtanding aspects like Risk, Returns, Liquidity.
One could use any two aspects at one time to form clusters
of consumers, or could use more than two attributes also.
Cluster - I : Consumers RETRUNS (X2) I
A
believe in high risk and high II
G D
returns instruments (Equity) K C
III
Cluster - II : Consumers believe INOE
L H J
in average risk and average
B F
returns (Mutual funds, Company M
fixed deposits)
( -3 , -3) RISK (X1)
Cluster - III : Consumers believe
in low risk and low returns
(Savings bank, Fixed deposits) 3
3
Cluster analysis

 It is a class of techniques used to classify


cases into groups that are
• relatively homogeneous within
themselves and
• heterogeneous between each other
• Homogeneity (similarity) and
heterogeneity (dissimilarity) are measured
on the basis of a defined set of variables
 These groups are called clusters

4
4
Market segmentation

 Cluster analysis is especially useful for market


segmentation
 Segmenting a market means dividing its potential
consumers into separate sub-sets where
• Consumers in the same group are similar with
respect to a given set of characteristics
• Consumers belonging to different groups are
dissimilar with respect to the same set of
characteristics
 This allows one to calibrate the marketing mix
differently according to the target consumer
group

5
5
Other uses of cluster analysis
 Product characteristics and the identification of new
product opportunities.
 Clustering of similar brands or products according to
their characteristics allow one to identify
competitors, potential market opportunities and
available niches
 Data reduction
• Factor analysis and principal component analysis
allow to reduce the number of variables.
• Cluster analysis allows to reduce the number of
observations, by grouping them into
homogeneous clusters.
 Maps profiling simultaneously consumers and
products, market opportunities and preferences as in
preference or perceptual mappings
6
6
Quality: What Is Good Clustering?

 A good clustering method will produce high quality


clusters with
 high intra-class similarity
 low inter-class similarity
 The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation
 The quality of a clustering method is also measured by
its ability to discover some or all of the hidden patterns

7
Steps in Cluster Analysis

 Formulate the problem


 Select the distance measure
 Select the clustering procedure
 Decide the number of clusters
 Interpret and profile the clusters
 Assess the validity of clustering
8
Formulate the problem
 Selecting the variables on which the clustering is
based – The most important part of formulating the
clustering problem
 Inclusion of even one or two irrelevant variables may
distort an otherwise useful clustering solutions
 The set of selected variables should be able to describe
the similarity between objects in terms that are relevant
to marketing research problem
 The variables should be selected on the basis of past
research, theory, or a consideration of the hypotheses
being tested.

9
Similarity and Dissimilarity
Between Objects
 Distances are normally used to measure the
similarity or dissimilarity between two data objects
 Some popular ones include: Minkowski distance:
d (i, j)  q (| x  x |q  | x  x | q ... | x  x |q )
i1 j1 i2 j2 ip jp

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are


two p-dimensional data objects, and q is a
positive integer
 If q = 1, d is Manhattan distance
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2 ip j p
10
Similarity and Dissimilarity
Between Objects
Distance measures how far apart two observations are.
Similarity measures how alike two cases are.
When two or more variables are used to define distance, the one with the
larger magnitude will dominate.
To avoid this it is common to first standardize all variables.
SPSS hierarchical clustering supports these types of measures:
Interval. Available alternatives are Euclidean distance, squared Euclidean
distance, cosine, Pearson correlation, Chebychev, block, Minkowski, and
customized.
Counts. Available alternatives are chi-square measure and phi-square measure.
Binary. Binary matching is a common type of similarity data, where 1 indicates
a match and 0 indicates no match between any pair of cases. There are multiple
matched attributes and the similarity score is the number of matches divided by
the number of attributes being matched.

11
Similarity and Dissimilarity Between
Objects : Interval Measure
Euclidean distance : Square root of the sum of the square of the x
difference plus the square of the y distance.
Squared Euclidean distance removes the sign and also places
greater emphasis on objects further apart, thus increasing the
effect of outliers. This is the default for interval data.
Cosine.. Interval-level similarity based on the cosine of the angle
between two vectors of values.
Pearson correlation. Interval-level similarity based on the product-
moment correlation. For case clustering as opposed the
variable clustering, the reseacher transposes the normal data
table in which columns are variables and rows are cases. By
using columns as cases and rows as variables instead, the
correlation is between cases and these correlations may
constitute the cells of the similarity matrix.
12
Similarity and Dissimilarity Between
Objects : Interval Measure
Chebychev distance is the maximum absolute difference between a
pair of cases on any one of the two or more dimensions
(variables) which are being used to define distance. Pairs will be
defined as different according to their difference on a single
dimension, ignoring their similarity on the remaining
dimensions.
Block distance, Manhattan distance, is the average absolute
difference across the two or more dimensions which are used to
define distance.
Minkowski distance is the generalized distance function. The pth
root of the sum of the absolute differences to the pth power
between the values for the items.
dij = [sum(xik - xjk )p](1/p)
When p = 1, Minkowski distance is city block distance.
In the case of binary data, when p = 1 Minkowski distance is Hamming
distance, When p = 2, Minkowski distance is Euclidean distance.
13
Similarity and Dissimilarity Between
Objects : Count Measure
Chi-square measure. Based on the chi-square test of
equality for two sets of frequencies, this measure is
the default for count data.

Phi-square measure. Phi-square normalizes chi-square


measure by the square root of the combined
frequency.

14
Similarity and Dissimilarity Between
Objects : Count Measure
Squared Euclidean distance is also an option for binary data as it
is for interval.
Size difference. This asymmetry index ranges from 0 to 1.
Pattern difference. This dissimilarity measure also ranges from 0
to 1. Computed from a 2x2 table as bc/(n**2), where b and c are
the diagonal cells and n is the number of observations.
Variance. Computed from a2x2 table as (b+c)/4n. It also ranges
from 0 to 1.
Dispersion. is a similarity measure with a range of -1 to 1.
Shape. is a distance measure with a range of 0 to 1. Shape
penalizes asymmetry of mismatches.
Simple matching. is the ratio of matches to the total number of
values.
Phi 4-point correlation. is a binary analog of the Pearson
15
correlation coefficient and has a range of -1 to 1.
Similarity and Dissimilarity Between
Objects : Count Measure
Lambda. Goodman and Kruskal's lambda, which is interpreted as the
proportional reduction of error (PRE) when using one item to predict
the other (predicting in both directions). It ranges from 0 to 1, with 1
being perfect predictive monotonicity..
Anderberg's D. is a variant on lambda. D is the actual reduction of error
using one item to predict the other (predicting in both directions). I
ranges from 0 to 1.
Czekanowski or Sorensen measure, is an index in which joint absences
are excluded from computation and matches are weighted double.
Hamann is the number of matches minus the number of nonmatches,
divided by the total number of items which ranges from -1 to 1.
Jaccard. This is an index in which joint absences are excluded from
consideration. Equal weight is given to matches and nonmatches. Also
known as the similarity ratio. Jaccard is commonly recommended for
binary data.

16
Similarity and Dissimilarity Between
Objects : Count Measure
Kulczynski 1 is the ratio of joint presences to all nonmatches. Its lower
bound is 0 and it is unbounded above. It is theoretically undefined
when there are no nonmatches.
Kulczynski 2. is the conditional probability that the characteristic is
present in one item, given that it is present in the other.
Lance and Williams index, also called the Bray-Curtis nonmetric
coefficient, is based on a 2x2 table, using the formula (b+c)/(2a+b+c),
where a represents the cell corresponding to cases present on both
items, and b and c represent the diagonal cells corresponding to cases
present on one item but absent on the other. It ranges from 0 to 1.
Ochiai index is the binary equivalent of the cosine similarity measure and
ranges from 0 to 1.
Rogers and Tanimoto index gives double weight to nonmatches.
Russel and Rao index is equivalent to the inner (dot) product, giving
equal weight to matches and nonmatches. This is a common choice for
binary similarity data.
17
Similarity and Dissimilarity Between
Objects : Count Measure
Sokal and Sneath 1 index also gives double weight to matches.
Sokal and Sneath 2. index gives double weight to nonmatches, but joint
absences are excluded from consideration.
Sokal and Sneath 3. index is the ratio of matches to nonmatches. Its
minimum is 0 but it is unbounded above. It is theoretically undefined
when there are no nonmatches; however, the software assigns an
arbitrary value of 9999.999 when the value is undefined or is greater
than this value.
Sokal and Sneath 4 index is based on the conditional probability that the
characteristic in one item matches the value in the other, taking the
mean of predictions either direction.
Sokal and Sneath 5 index is the squared geometric mean of conditional
probabilities of positive and negative matches. It ranges from f 0 to 1.
Yule's Y. , also called the coefficient of colligation, is a similarity measure
based on the cross-ratio for a 2 x 2 table and is independent of the
marginal totals. It ranges from -1 to +1.
Yule's Q. is a similarity measure which is a special 2x2 case of Goodman
and Kruskal's gamma. It is based on the cross-ratio
18
Clustering procedures

 Hierarchical procedures
• Agglomerative (start from n clusters to get to 1 cluster)
• Linkage Methods (Single, Complete, Average)
• Variance Methods (Ward’s method)
• Centroid Methods
• Divisive (start from 1 cluster to get to n clusters)
 Non hierarchical procedures
• Sequential Threshold
• Parallel Threshold
• Optimising Threshold

19
19
Hierarchical clustering

 Agglomerative:
• Each of the n observations constitutes a separate cluster
• The two clusters that are more similar according to same distance rule
are aggregated, so that in step 1 there are n-1 clusters
• In the second step another cluster is formed (n-2 clusters), by nesting
the two clusters that are more similar, and so on
• There is a merging in each step until all observations end up in a single
cluster in the final step.
 Divisive
• All observations are initially assumed to belong to a single cluster
• The most dissimilar observation is extracted to form a separate cluster
• In step 1 there will be 2 clusters, in the second step three clusters and
so on, until the final step will produce as many clusters as the number
of observations.
 The number of clusters determines the stopping rule for
the algorithms
20
20
Non-hierarchical clustering

 These algorithms do not follow a hierarchy and produce a single


partition
 Knowledge of the number of clusters (c) is required
 In the first step, initial cluster centres (the seeds) are determined
for each of the c clusters, either by the researcher or by the
software (usually the first c observation or observations are
chosen randomly)
 Each iteration allocates observations to each of the c clusters,
based on their distance from the cluster centres
 Cluster centres are computed again and observations may be
reallocated to the nearest cluster in the next iteration
 When no observations can be reallocated or a stopping rule is
met, the process stops

21
21
Distance between clusters

 Algorithms vary according to the way the


distance between two clusters is defined.
 The most common algorithm for hierarchical
methods include
• single linkage method
• complete linkage method
• average linkage method
• ward algorithm
• centroid method

22
22
Linkage methods

 Single linkage method (nearest neighbour):


distance between two clusters is the minimum
distance among all possible distances between
observations belonging to the two clusters.

 Complete linkage method (furthest neighbour):


nests two cluster using as a basis the maximum
distance between observations belonging to
separate clusters.

 Average linkage method: the distance between two


clusters is the average of all distances between
observations in the two clusters

23
23
Hierarchical Clustering
 Single Linkage
 Clustering criterion based on the shortest distance

 Complete Linkage
 Clustering criterion based on the longest distance

24
Hierarchical Clustering (Contd.)

 Average Linkage
 Clustering criterion based on the average distance

25
Ward algorithm

1. The sum of squared distances is computed


within each of the cluster, considering all
distances between observation within the same
cluster
2. The algorithm proceeds by choosing the
aggregation between two clusters which
generates the smallest increase in the total sum
of squared distances.
 It is a computationally intensive method,
because at each step all the sum of squared
distances need to be computed, together with
all potential increases in the total sum of
squared distances for each possible
aggregation of clusters.

26
26
Hierarchical Clustering (Contd.)

 Ward's Method
 Based on the loss of information resulting from grouping of
the objects into clusters (minimize within cluster variation)

27
Centroid method

 The distance between two clusters is the distance


between the two centroids,
 Centroids are the cluster averages for each of the
variables
• each cluster is defined by a single set of
coordinates, the averages of the coordinates of all
individual observations belonging to that cluster
 Difference between the centroid and the average
linkage method
• Centroid: computes the average of the co-
ordinates of the observations belonging to an
individual cluster
• Average linkage: computes the average of the
distances between two separate clusters.
28
28
Hierarchical Clustering (Contd.)

 Centroid Method

 Based on the distance between the group centroids (the point


whose coordinates are the means of all the observations in the
cluster)

29
Non-hierarchical Clustering

 Sequential Threshold

 Cluster center is selected and all objects within a prespecified


threshold value are grouped

 Parallel Threshold

 Several cluster centers are selected and objects within threshold


level are assigned to the nearest center
 Optimizing

 Objects can be later reassigned to clusters on the basis of optimizing


some overall criterion measure

30
Non-hierarchical clustering:
K-means method
1. The number k of clusters is fixed
2. An initial set of k “seeds” (aggregation centres) is
provided
• First k elements
• Other seeds (randomly selected or explicitly defined)
3. Given a certain fixed threshold, all units are assigned to
the nearest cluster seed
4. New seeds are computed
5. Go back to step 3 until no reclassification is necessary
Units can be reassigned in successive steps (optimising
partioning)

31
31
Non-hierarchical threshold methods

 Sequential threshold methods


• a prior threshold is fixed and units within that distance
are allocated to the first seed
• a second seed is selected and the remaining units are
allocated, etc.
 Parallel threshold methods
• more than one seed are considered simultaneously
 When reallocation is possible after each stage, the methods
are termed optimizing procedures.

32
32
Hierarchical vs. non-hierarchical
methods
Hierarchical Methods Non-hierarchical methods

 No decision about the number  Faster, more reliable, works


of clusters with large data sets
 Problems when data contain a  Need to specify the number of
high level of error clusters
 Can be very slow, preferable  Need to set the initial seeds
with small data-sets  Only cluster distances to seeds
 Initial decisions are more need to be computed in each
influential (one-step only) iteration
 At each step they require
computation of the full
proximity matrix

33
33
Description of Variables

Variable Description Corresponding Scale Values


Name in Output
Willingness to Export (Y1) Will 1(definitely not interested) to 5
(definitely interested)
Level of Interest in Seeking Govt Govt 1(definitely not interested) to 5
Assistance (Y2) (definitely interested)
Employee Size (X1) Size Greater than Zero

Firm Revenue (X2) Rev In millions of dollars

Years of Operation in the Years Actual number of years


Domestic Market (X3)
Number of Products Currently Prod Actual number
Produced by the Firm (X4)
Training of Employees (X5) Train 0 (no formal program) or 1
(existence of a formal program)
Management Experience in Exp 0 (no experience) or 1 (presence
International Operation (X6) of experience)

34
Export Data – K-means Clustering Results

Initial Cluster Centers Final Cluster Centers

Cluster Cluster
1 2 3 1 2 3
Will 4 1 5 Will 4 2 5
Govt 5 1 4 Govt 4 2 4
Train 1 0 0 Train 1 0 1
Experience 1 0 1 Experience 1 0 1
Years 6.0 7.0 4.5 Years 6.3 6.7 6.0
Prod 5 2 11 Prod 6 3 10
Mod_size 5.80 2.80 2.90 Mod_size 4.97 3.60 5.42
Mod_Rev 1.00 .90 .90 Mod_Rev 1.76 1.74 1.21

Distances between Final Cluster Centers


Size/10

Cluster 1 2 3
1 3.899 4.193
Revenue/1000
2 3.899 7.602
3 4.193 7.602

35
Export Data – K-means Clustering
Results (contd.)
ANOVA

Cluster Error
Mean Square df Mean Square df F Sig.
Will 58.540 2 .683 117 85.710 .000
Govt 34.297 2 .750 117 45.717 .000
Train 2.228 2 .177 117 12.565 .000
Experience 3.640 2 .142 117 25.590 .000
Years 4.091 2 .690 117 5.932 .004
Prod 298.924 2 1.377 117 217.038 .000
Mod_size 32.451 2 .537 117 60.391 .000
Mod_Rev 2.252 2 .873 117 2.580 .080

Number of Cases in each Cluster

Cluster 1 56.000
2 46.000
3 18.000
Valid 120.000
Missing .000

36
Determining the optimal number of
cluster from hierarchical methods
 Graphical
• dendrogram
• scree diagram
 Statistical
• Arnold’s criterion
• pseudo F statistic
• pseudo t2 statistic
• cubic clustering criterion (CCC)

37
37
And the merging Dendrogram
distance is
relatively small This dotted line represents the
Rescaled Distance distance
Cluster between clusters
Combine

C A S E 0 5 10 15 20 25
Label Num +---------+---------+---------+---------+---------+

231  Case 231 and case 275 are merged


275  
These 145  
are the 181  
individual 333   
cases 117  
336   
337  
209  
431  
178 
As the algorithm proceeds, the
merging distances become larger

38
38
Scree diagram

12
Merging
distance on When one moves from
10
the y-axis 7 to 6 clusters, the
8 merging distance
Distance

increases noticeably
6

0
11 10 9 8 7 6 5 4 3 2 1
Number of clusters

39
39
Statistical tests

 The rationale is that in optimal partition,


variability within clusters should be as small as
possible, while variability between clusters
should be maximized
 This principle is similar to the ANOVA-F test
 However, since hierarchical algorithms proceed
sequentially, the probability distribution of
statistics relating variability within and variability
between is unknown and differs from the F
distribution

40
40
Statistical criteria to detect the optimal partition

 Arnold’s criterion: find the minimum of the determinant of the


within cluster sum of squares matrix W
 Pseudo F, CCC and Pseudo t2: the ideal number of clusters
should correspond to
• a local maximum for the Pseudo-F and CCC, and
• a small value of the pseudo t2 which increases in the next step
(preferably a local minimum).
 These criteria are rarely consistent among them, so that the
researcher should also rely on meaningful (interpretable) criteria.
 Non-parametric methods (SAS) also allow one to determine the
number of clusters
• k-th nearest neighbour method:
• the researcher sets a parameter (k)
• for each k the method returns the optimal number of clusters.
• if this optimal number is the same for several values of k, then
the determination of the number of clusters is relatively robust

41
41
Suggested approach:
2-steps procedures

1. First perform a hierarchical method to define the


number of clusters
2. Then use the k-means procedure to actually form the
clusters
The reallocation problem
 Rigidity of hierarchical methods: once a unit is
classified into a cluster, it cannot be moved to other
clusters in subsequent steps
 The k-means method allows a reclassification of all
units in each iteration.
 If some uncertainty about the number of clusters
remains after running the hierarchical method, one
may also run several k-means clustering procedures
and apply the previously discussed statistical tests to
choose the best partition.

42
42
The SPSS two-step procedure

 The observations are preliminarily aggregated into clusters


using an hybrid hierarchical procedure named cluster
feature tree.
 This first step produces a number of pre-clusters, which is
higher than the final number of clusters, but much smaller
than the number of observations.
 In the second step, a hierarchical method is used to
classify the pre-clusters, obtaining the final classification.
 During this second clustering step, it is possible to
determine the number of clusters.
The user can either fix the number of clusters or let the
algorithm search for the best one according to information
criteria which are also based on goodness-of-fit measures.

43
43
Evaluation and validation

 goodness-of-fit of a cluster analysis


• ratio between the sum of squared errors and the total sum of
squared errors (similar to R2)
• root mean standard deviation within clusters.
 Validation: if the identified cluster structure (number of clusters
and cluster characteristics) is real, it should not be c
 Validation approaches
• use of different samples to check whether the final output is
similar
• Split the sample into two groups when no other samples are
available
• Check for the impact of initial seeds / order of cases
(hierarchical approach) on the final partition
• Check for the impact of the selected clustering method

44
44
Cluster analysis in SPSS

Three types of cluster


analysis are available in
SPSS

45
45
Hierarchical cluster analysis

Variables selected
for the analysis

Create a new variable


with cluster membership
for each case

Clustering method
Statistics required Graphs (dendrogram) and options
in the analysis Advice: no plots

46
46
Statistics

The agglomeration
schedule is a table
which shows the
steps of the clustering
procedure, indicating
which cases (clusters)
are merged and the
merging distance
Shows the cluster
membership of
individual cases only
The proximity matrix for a sub-set of
contains all distances solutions
between cases (it may
be huge)

47
47
Plots

Shows the
clustering process,
indicating which
cases are
aggregated and the
merging distance The icicle plot (which can
With many cases, be restricted to cover a
the dendrogram is small range of clusters),
hardly readable shows at what stage
cases are clustered. The
plot is cumbersome and
slows down the analysis
(advice: no icicle)

48
48
Method
Choose a
hierarchical
algorithm

Choose the type of data


(interval, counts binary) and
the appropriate measure

Specify whether the variables (values)


should be standardized before analysis.
Z-scores return variables with zero mean
and unity variance. Other standardizations
are possible. Distance measures can also
be transformed
49
49
Cluster memberships

If the number of clusters has been decided (or at least a


range of solutions), it is possible to save the cluster
membership for each case into new variables

50
50
The example:
agglomeration schedule
Last 10 stages
of the process
(10
 
to 1 clusters)
  Cluster Combined    

Stage Number of clusters Cluster 1 Cluster 2 Distance Diff. Dist

490 10 8 12 544.4  

491 9 8 11 559.3 14.9 As the


492 8 3 7 575.0 15.7
algorithms
proceeds
493 7 3 366 591.6 16.6
towards the
494 6 3 6 610.6 19.0 end, the
495 5 3 37 636.6 26.0 distance
496 4 13 23 663.7 27.1 increases
497 3 3 13 700.8 37.1

498 2 1 8 754.1 53.3

499 1 1 3 864.2 110.2

51
51
Scree diagram

Scree diagram

840
The scree diagram (not provided by
SPSS but created from the
790
agglomeration schedule) shows a
larger distance increase when the
Distance

740
cluster number goes below 4
690

640
Elbow?

590
7 6 5 4 3 2 1
Number of clusters

52
52
Non-hierarchical solution
with 4 clusters

Ward Method
1 2 3 4 Total
Case Number N% 26.6% 20.2% 23.8% 29.4% 100.0%
Household size Mean 1.4 3.2 1.9 3.1 2.4
Gross current income of Mean 238.0 1158.9 333.8 680.3 576.9
household
Age of Household Mean 72 44 40 48 52
Reference
EFS: Total Person
Food & Mean 28.8 64.4 29.2 60.6 45.4
non-alcoholic beverage
EFS: Total Clothing and Mean 8.8 64.3 9.2 19.0 23.1
Footwear
EFS: Total Housing, Mean 25.1 77.7 33.5 39.1 41.8
Water, Electricity
EFS: Total Transport Mean 17.7 147.8 24.6 57.1 57.2
costs Total Recreation
EFS: Mean 29.6 146.2 39.4 63.0 65.3

53
53
K-means solution (4 clusters)

Variables

Number of clusters (fixed)

Ask for one (classify only) or more


iterations before stopping the
algorithm

It is possible to read a file with


initial seeds or write final seeds on
a file

54
54
K-means options

Creates a new
variable with
cluster
membership
for each case

Improve the More options


algorithm by including an
allowing for ANOVA table
more iterations with statistics
and running
means (seeds
are recomputed
at each stage)

55
55
Results from k-means
(initial seeds chosen by SPSS)

Final Cluster Centers


Number of Cases in each Cluster
Cluster
1 2 3 4 Cluster 1 292.000
Household size 2.0 2.0 2.8 3.2 2 1.000
Gross current income of 3 155.000
264.5 241.1 791.2 1698.1
household
4 52.000
Age of Household
56 75 46 45 Valid 500.000
Reference Person
EFS: Total Food & Missing .000
37.3 22.2 54.1 66.2
non-alcoholic beverage
EFS: Total Clothing and
14.0 28.0 31.7 48.4
Footwear The k-means algorithm is
EFS: Total Housing,
Water, Electricity
34.7 100.3 47.3 64.5 sensible to outliers and SPSS
EFS: Total Transport
costs
28.4 10.4 78.3 156.8 chose an improbable amount for
EFS: Total Recreation 39.6 3013.1 74.4 125.9 recreation expenditure as an
initial seed for cluster 2 (probably
an outlier due to misrecording or
an exceptional expenditure)

56
56
Results from k-means:initial seeds from hierarchical
clustering
Cluster Number of Case
1 2 3 4 Total
Case Number N% 32.6% 10.2% 33.6% 23.6% 100.0%
Household size Mean 1.7 3.1 2.5 2.9 2.4
Gross current income of Mean 163.5 1707.3 431.8 865.9 576.9
household
Age of Household Mean 60 45 50 46 52
Reference
EFS: Total Person
Food & Mean 31.3 65.5 45.1 56.8 45.4
non-alcoholic beverage
EFS: Total Clothing and Mean 12.3 48.4 19.1 32.7 23.1
Footwear
EFS: Total Housing, Mean 29.8 65.3 41.9 48.1 41.8
Water, Electricity
EFS: Total Transport Mean 24.6 156.8 37.4 87.5 57.2
costs
EFS: Total Recreation Mean 30.3 126.8 67.9 83.4 65.3

The first cluster is now larger, but it still represents older and poorer households. The
other clusters are not very different from the ones obtained with the Ward algorithm,
indicating a certain robustness of the results.

57
57
2-step clustering

it is possible to make
a distinction
between categorical
and continuous
variables

This is the
information
criterion to
The search for choose the
the optimal optimal partition
number of
clusters may be One may also
constrained asks for plots and
descriptive stats

58
58
Options

It is advisable It is possible to
to control for choose which
outliers (OLs) variable should
because the be standardized
analysis is prior to run the
usually analysis
sensitive to
OLs
More advanced
options are
available for a
better control on
the procedure

59
59
Output

 Results are not satisfactory


 With no prior decision on the number of clusters, two clusters are found, one
with a single observations and the other with the remaining 499 observations.
 Allowing for outlier treatment does not improve results
 Setting the number of clusters to four produces these results

Cluster Distribution It seems that the two-step clustering is


% of biased towards finding a macro-
N Combined % of Total cluster.
Cluster 1 2 .4% .4%
2 5 1.0% 1.0% This might be due to the fact that the
3 490 98.2% 98.2%
number of observations is relatively
4 2 .4% .4%
Combined 499 100.0% 100.0%
small, but the combination of the
Total 499 100.0% Ward algorithm with the k-means
algorithm is more effective

60
60
Discussion

 It might seem that cluster analysis is too sensitive


to the researcher’s choices
 This is partly due to the relatively small data-set
and possibly to correlation between variables
 However, all outputs point out to a segment with
older and poorer household and another with
younger and larger households, with high
expenditures.
 By intensifying the search and adjusting some of
the properties, cluster analysis does help
identifying homogeneous groups.
 “Moral”: cluster analysis needs to be adequately
validated and it may be risky to run a single cluster
analysis and take the results as truly informative,
especially in presence of outliers.
61
61
Assumptions and Limitations of
Cluster Analysis
 Assumptions
 The basic measure of similarity on which the clustering is
based is a valid measure of the similarity between the objects.
 There is theoretical justification for structuring the objects into
clusters

 Limitations
 It is difficult to evaluate the quality of the clustering
 It is difficult to know exactly which clusters are very similar
and which objects are difficult to assign.
 It is difficult to select a clustering criterion and program on
any basis other than availability.

62

You might also like