Cluster Analysis - Part A
Cluster Analysis - Part A
Overview
What is Cluster Analysis?
Why Cluster Analysis?
Clustering Process
Cluster Analysis
Clustering Algorithms
Cluster Validity Analysis
Difficulties and drawbacks
Conclusions
2
Cluster Analysis Defined
Cluster analysis . . . groups objects (respondents, products, firms,
variables, etc.) so that each object is similar to the other objects in
the cluster and different from objects in all the other clusters.
5
Clustering Applications (Use Cases)
6
Clustering Applications (Use Cases)
City-planning: Identifying groups of houses according
to their house type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters
should be clustered along continent faults
Climate: understanding earth climate, find patterns of
atmospheric and ocean
Economic Science: Grouping countries on various
economic parameters
Insurance: Identifying groups of motor insurance
policy holders with a high average claim cost
7
Clustering as a Preprocessing Tool
(Utility)
Summarization:
Preprocessing for regression, PCA, classification, and
association analysis
Compression:
Image processing: vector quantization
Finding K-nearest Neighbors
Localizing search to one or a small number of clusters
Outlier Detection
Outliers are often viewed as those “far away” from any cluster
Criticisms of Cluster Analysis
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Why Cluster Analysis?
Clustering is a process by which you can explore your
data in an efficient manner.
Visualization of data can help you review the data
quality.
Assumption: “Guilt by association” – similar gene
expression patterns may indicate a biological
relationship.
What Can We Do With Cluster Analysis?
Select a measure of
similarity or dis-
similarity / distance
14
The Clustering Process
15
Objectives of Cluster Analysis
19
Selection of Clustering Variables
20
Selection of Clustering Variables
21
Selection of Clustering Variables
22
DataSet Variables for Clustering
With CoolAir Airways you will arrive on time (e1)
CoolAir Airways provides you with a very pleasant
travel experience (e5)
CoolAir Airways gives you a sense of safety (e9)
CoolAir Airways makes traveling uncomplicated (e21)
CoolAir Airways provides you with interesting on-
board entertainment, service, and information sources
(e22)
23
DataSet Variables for Clustering
24
Research Design in Cluster Analysis
Four Questions:
Is the sample size adequate?
Can outliers be detected and, if so, should they be
deleted?
Should the data be standardized?
How should object similarity be measured?
Sample Size Requirement
The sample size required is not based on statistical
considerations for inference testing, but rather:
Sufficient size is needed to ensure representativeness of the
population and its underlying structure, particularly small
groups within the population.
Minimum group sizes are based on the relevance of each
group to the research question and the confidence needed in
characterizing that group.
26
Sample Size Requirement
From a statistical perspective, every additional variable
requires an over-proportional increase in observations to
ensure valid results.
Recent rules-of-thumb are as follows:
Qiu and Joe (2009) recommend a sample size at least ten times the
number of clustering variables multiplied by the number of clusters.
Dolnicar et al. (2014 & 2016) recommend using a sample size of 70
times the number of clustering variables.
They say increasing the sample size from 10 to 30 times the number
of clustering variables substantially improves the clustering solution.
This improvement levels off subsequently, but is still noticeable up
to a sample size of approximately 100 times the number of clustering
variables. 27
Sample Size Requirement
With five clustering variables, our analysis meets even
the most conservative rule-of-thumb regarding
minimum sample size requirements.
30
Data Pre-processing Decisions
Data Scaling
Whether data scaling to be done or not
In case all the variables are of same type and measured in
the same units then data scaling is not required.
In case data variables are of same type but measured in
vastly different units then data has to be standardised and
scaled.
In cluster analysis, however, range standardization
(normalization) (e.g., to a range of 0 to 1) typically
works better.
Missing values to be estimated or eliminated.
Data Pre-processing Decisions
Assessing Multicollinearity
If there is strong correlation between the variables, they are not
sufficiently unique to identify distinct market segments.
33
Assessing Multicollinearity
34
Assessing Multicollinearity
The results show that collinearity is not at a critical
level.
The variables e1 and e21 show the highest correlation
of 0.613, which is clearly lower than the 0.70
threshold.
We can therefore proceed with the analysis, using all
five clustering variables.
35
Deciding Distance Metric
37
Clustering Algorithms
There are innumerable clustering algorithms.
However, the traditional algorithms for clustering
can be divided in 3 main categories:
1. Partitional Clustering
Our Focus
2. Hierarchical Clustering
3. Model-based Clustering
Classification of Clustering Procedures
Clustering Procedures
Hierarchical Nonhierarchical
Agglomerative Divisive
AGNES DAISY
Ward’s
Method
50
60
70
The height of a node
80 in the dendrogram
90
100 represents the
similarity of the two
children clusters.
% of similarity
Hierarchical Clustering
42
Hierarchical Clustering
Analyze >> Classify >> Hierarchical
43
Hierarchical Clustering – Analysis
Statistics
Statistics Option
44
Hierarchical Clustering
Plots Option
45
Hierarchical Clustering
Method Option
46
Hierarchical Clustering
Save Option
47
Generating a Scree Plot
Right Click on the Agglomeration Schedule
48
Scree Plot
Deciding on the Number of Clusters
The scree plot shows that there is no clear elbow indicating a suitable number
of clusters to retain. This result is quite common for datasets with several
hundred objects. 49
Dendrogram
Reading the dendrogram from let to right, we find that the vast majority of
objects are merged at very small distances. The dendrogram also shows that
the step from a three-cluster solution to a two-cluster solution occurs at a
greatly increased distance. Hence, we assume a three-cluster solution and
continue with the analysis.
50
Saving the Cluster Membership
Once you have identified that there are three clusters, rerun the clustering
process with the following change
51
Saving the Cluster membership
Once you have identified that there are three clusters, rerun the clustering
process with the following change
52
Cluster Size
A new variable called CLU3_1 will be added to your dataset in
the end. This variable shows the cluster membership.
53
Agglomerative Hierarchical
Clustering
1. Calculate the distance between all data points
2. Cluster the data points to the initial clusters
3. Calculate the distance metrics between all clusters
4. Repeatedly cluster most similar clusters into a higher
level cluster
5. Repeat steps 3 and 4 for the most high-level clusters.
Cluster 1 Cluster 2
Average Linkage
Average Distance
Cluster 1 Cluster 2
Agglomerative Clustering - Variance
Method
The variance methods attempt to generate clusters to minimize the
within-cluster variance.
A commonly used variance method is the Ward's procedure. For
each cluster, the means for all the variables are computed.
Then, for each object, the squared Euclidean distance to the
cluster means is calculated. These distances are summed for all
the objects. At each stage, the two clusters with the smallest increase in
the overall sum of squares within cluster distances are combined.
Centroid Method
Cluster Validity
• For supervised classification we have a variety of measures to
evaluate how good our model is
– Accuracy, precision, recall
Range of SC Interpretation
64
Validate & Interpret the Clustering
Solution
65
Validate & Interpret the Clustering
Solution
Next, go to ► Analyze ► Descriptive Statistics ► Descriptives
Add clustering variables e1, e5, e9, e21, and e22 and ask for mean,
min, max and std dev
66
67
Validate & Interpret the Clustering
Solution
Comparing the variable means across the three clusters, we find
that
Respondents in the first cluster have extremely high
expectations regarding all five performance features, as
evidenced in average values of around 90 and higher.
Respondents in the second cluster strongly emphasize
punctuality (e1), while comfort (e5) and, particularly,
entertainment aspects (e22) are less important.
Respondents in the third cluster do not express high
expectations in general, except in terms of security (e9)
68
Validate & Interpret the Clustering
Solution
Based on these results, we could label
the first cluster “The Demanding Traveler,”
the second cluster “Ontime is Enough,” and
the third cluster “No Thrills.”
69
Analyze >> Compare Means >> One Way
Anova
70
Analyze >> Compare Means >> One Way
Anova
Options button
71
Validate & Interpret the Clustering
Solution
Since all the values in the final column Sig. are below 0.05, we can conclude that all the
clustering variables’ means differ significantly across at least two of the three segments. 72
Analyze >> Descriptives >> Crosstab
73
74
Validate & Interpret the Clustering
Solution
The results show that the first cluster
primarily consists of leisure travelers,
whereas the majority of respondents in
the second and third cluster are business
travelers.
75
Requirements and Challenges
• Scalability
– Clustering all the data instead of only on samples
• Ability to deal with different types of attributes
– Numerical, binary, categorical, ordinal, linked, and mixture of these
• Constraint-based clustering
• User may give inputs on constraints