Cluster Analysis: Abu Bashar
Cluster Analysis: Abu Bashar
Abu Bashar
Cluster analysis
It is a class of techniques used to classify cases into groups that are
relatively homogeneous within themselves and heterogeneous between each other Homogeneity (similarity) and heterogeneity (dissimilarity) are measured on the basis of a defined set of variables
Market segmentation
Cluster analysis is especially useful for market segmentation Segmenting a market means dividing its potential consumers into separate sub-sets where
Consumers in the same group are similar with respect to a given set of characteristics Consumers belonging to different groups are dissimilar with respect to the same set of characteristics
This allows one to calibrate the marketing mix differently according to the target consumer group
The most known measure of distance is the Euclidean distance, which is the concept we use in everyday life for spatial coordinates.
7
Clustering procedures
Hierarchical procedures
Agglomerative (start from n clusters to get to 1 cluster) Divisive (start from 1 cluster to get to n clusters)
Hierarchical clustering
Agglomerative:
Each of the n observations constitutes a separate cluster The two clusters that are more similar according to same distance rule are aggregated, so that in step 1 there are n-1 clusters In the second step another cluster is formed (n-2 clusters), by nesting the two clusters that are more similar, and so on There is a merging in each step until all observations end up in a single cluster in the final step.
Divisive
All observations are initially assumed to belong to a single cluster The most dissimilar observation is extracted to form a separate cluster In step 1 there will be 2 clusters, in the second step three clusters and so on, until the final step will produce as many clusters as the number of observations.
The number of clusters determines the stopping rule for the algorithms
10
Non-hierarchical clustering
These algorithms do not follow a hierarchy and produce a single partition Knowledge of the number of clusters (c) is required In the first step, initial cluster centres (the seeds) are determined for each of the c clusters, either by the researcher or by the software (usually the first c observation or observations are chosen randomly) Each iteration allocates observations to each of the c clusters, based on their distance from the cluster centres Cluster centres are computed again and observations may be reallocated to the nearest cluster in the next iteration When no observations can be reallocated or a stopping rule is met, the process stops
11
The number k of clusters is fixed An initial set of k seeds (aggregation centres) is provided
First k elements Other seeds (randomly selected or explicitly defined)
3.
Given a certain fixed threshold, all units are assigned to the nearest cluster seed 4. New seeds are computed 5. Go back to step 3 until no reclassification is necessary Units can be reassigned in successive steps (optimising partioning)
12
Non-hierarchical methods
Faster, more reliable, works with large data sets Need to specify the number of clusters Need to set the initial seeds Only cluster distances to seeds need to be computed in each iteration
13
In segmentation studies, the c represents the number of potential separate segments. Preferable approach: let the data speak
Hierarchical approach and optimal partition identified through statistical tests (stopping rule for the algorithm) However, the detection of the optimal number of clusters is subject to a high degree of uncertainty
If the research objectives allow a choice rather than estimating the number of clusters, non-hierarchical methods are the way to go.
14
15
Dendrogram
Rescaled Distance
This dotted line represents the Cluster Combine distance between clusters
0 5 10 15 20 25 +---------+---------+---------+---------+---------+
275 145 181 333 117 336 337 209 431 178
Scree diagram
Merging distance on the y-axis
Distance
12 10 8 6 4 2 0 11 10 9 8 7 6 5 4 3 2 1 Number of clusters
When one moves from 7 to 6 clusters, the merging distance increases noticeably
17