0% found this document useful (0 votes)
27 views77 pages

Cluster Analysis - Part A

Cluster analysis is a multivariate technique that groups similar objects based on their characteristics, facilitating data exploration and pattern recognition. It has various applications across fields such as biology, marketing, and city planning, and involves a systematic process of selecting variables, preprocessing data, and choosing appropriate clustering algorithms. Despite its utility, cluster analysis faces criticisms regarding its descriptive nature and dependency on variable selection, which can affect the generalizability of results.

Uploaded by

contact.ankit865
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views77 pages

Cluster Analysis - Part A

Cluster analysis is a multivariate technique that groups similar objects based on their characteristics, facilitating data exploration and pattern recognition. It has various applications across fields such as biology, marketing, and city planning, and involves a systematic process of selecting variables, preprocessing data, and choosing appropriate clustering algorithms. Despite its utility, cluster analysis faces criticisms regarding its descriptive nature and dependency on variable selection, which can affect the generalizability of results.

Uploaded by

contact.ankit865
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

Cluster Analysis

Overview
What is Cluster Analysis?
Why Cluster Analysis?
Clustering Process
Cluster Analysis
 Clustering Algorithms
 Cluster Validity Analysis
Difficulties and drawbacks
Conclusions

2
Cluster Analysis Defined
Cluster analysis . . . groups objects (respondents, products, firms,
variables, etc.) so that each object is similar to the other objects in
the cluster and different from objects in all the other clusters.

Cluster analysis . . . is a group of multivariate techniques whose


primary purpose is to group objects based on the characteristics
they possess.
 It has been referred to as Q analysis, typology construction, classification
analysis, and numerical taxonomy.

 The essence of all clustering approaches is the classification of data as


suggested by “natural” groupings of the data themselves.
What is Clustering?

Cluster: A collection of data objects


 similar (or related) to one another within the same
group
 dissimilar (or unrelated) to the objects in other groups

Cluster analysis (or clustering, data segmentation, …)


 Finding similarities between data according to the
characteristics found in the data and grouping similar
data objects into clusters
4
What is Clustering?
 Clustering can be done on any data:
genes, sample, documents, time points in a time
series, etc.

 The algorithm will treat all inputs as a set of n


numbers or an n-dimensional vector.

 Clustering is appropriate when there is no a priori


knowledge about the data.

5
Clustering Applications (Use Cases)

Biology: taxonomy of living things: kingdom, phylum,


class, order, family, genus and species
Information retrieval: document clustering
Land use: Identification of areas of similar land use in an
earth observation database
Marketing: Help marketers discover distinct groups in
their customer bases, and then use this knowledge to
develop targeted marketing programs

6
Clustering Applications (Use Cases)
City-planning: Identifying groups of houses according
to their house type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters
should be clustered along continent faults
Climate: understanding earth climate, find patterns of
atmospheric and ocean
Economic Science: Grouping countries on various
economic parameters
Insurance: Identifying groups of motor insurance
policy holders with a high average claim cost
7
Clustering as a Preprocessing Tool
(Utility)
Summarization:
 Preprocessing for regression, PCA, classification, and
association analysis
Compression:
 Image processing: vector quantization
Finding K-nearest Neighbors
 Localizing search to one or a small number of clusters
Outlier Detection
 Outliers are often viewed as those “far away” from any cluster
Criticisms of Cluster Analysis

The following must be addressed by conceptual rather than


empirical support:
 Cluster analysis is descriptive, atheoretical, and non-
inferential.

 will always create clusters, regardless of the actual existence


of any structure in the data.

 The cluster solution is not generalizable because it is totally


dependent upon the variables used as the basis for the
similarity measure.
Quality:
What Is Good Clustering?
A Good Clustering method will produce high quality
clusters
 high intra-class similarity: cohesive within clusters
 low inter-class similarity: distinctive between clusters

The quality of a clustering method depends on


 the similarity measure used by the method
 its implementation, and
 Its ability to discover some or all of the hidden
patterns
Intra-cluster and
Inter-cluster distances

Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Why Cluster Analysis?
 Clustering is a process by which you can explore your
data in an efficient manner.
 Visualization of data can help you review the data
quality.
 Assumption: “Guilt by association” – similar gene
expression patterns may indicate a biological
relationship.
What Can We Do With Cluster Analysis?

 Determine if statistically different clusters exist.

 Identify the meaning of the clusters.

 Explain how the clusters can be used.


The Clustering Process
Select the
Clustering
Variables
Data
Preprocessing

Select the clustering


procedure

Select a measure of
similarity or dis-
similarity / distance

14
The Clustering Process

Decide on the number of


clusters

Validate and Interpret the cluster


solution

15
Objectives of Cluster Analysis

Primary Goal = to partition a set of objects into two


or more groups based on the similarity of the objects
for a set of specified characteristics (the cluster
variate).

There are two key issues:


 The research questions being addressed, and
 The variables used to characterize objects in the
clustering process.
Selection of Clustering Variables

 Even though this choice is critical, it is rarely treated as such.


 Instead, a mixture of intuition and data availability
guide most analyses in marketing practice.
 However, faulty assumptions may lead to improper
market segmentation and, consequently, to deficient
marketing strategies.
 The types of variables used for cluster analysis provide
different solutions and, thereby, influence targeting
strategies.
Thus, great care should be taken when selecting
the clustering variables! 17
Selection of Clustering Variables

Theoretical, conceptual and practical considerations must be


observed when selecting clustering variables for cluster analysis:
 Only variables that relate specifically to objectives of the
cluster analysis are included, since “irrelevant” variables can
not be excluded from the analysis once it begins.
 Inclusion of even one or two irrelevant variables may distort
a clustering solution.
 Variables selected should describe the similarity between
objects in terms that are relevant to the problem.
 Should be selected based on past research, theory, or a
consideration of the hypotheses being tested.
Selection of Clustering Variables

To facilitate the choice of clustering variables, we should


consider the following guiding questions:
 Do the variables differentiate sufficiently between the clusters?
 Is the relation between the sample size and the number of clustering
variables reasonable?
 Are the clustering variables highly correlated?
 Are the data underlying the clustering variables of high quality?

19
Selection of Clustering Variables

20
Selection of Clustering Variables

 Generally, clusters based on psychometric variables


are more homogenous and these consumers respond
more consistently to marketing actions.
 However, consumers in these clusters are frequently
hard to identify as such variables are not easily
measured.

21
Selection of Clustering Variables

 Conversely, clusters determined by sociodemographic


variables are easy to identify but are also more
heterogeneous, which complicates targeting efforts.
 Consequently, researchers frequently combine
different variables such as lifestyle characteristics and
demographic variables, benefiting from each one’s
strengths.

22
DataSet Variables for Clustering
 With CoolAir Airways you will arrive on time (e1)
 CoolAir Airways provides you with a very pleasant
travel experience (e5)
 CoolAir Airways gives you a sense of safety (e9)
 CoolAir Airways makes traveling uncomplicated (e21)
 CoolAir Airways provides you with interesting on-
board entertainment, service, and information sources
(e22)

23
DataSet Variables for Clustering

24
Research Design in Cluster Analysis

Four Questions:
Is the sample size adequate?
Can outliers be detected and, if so, should they be
deleted?
Should the data be standardized?
How should object similarity be measured?
Sample Size Requirement
The sample size required is not based on statistical
considerations for inference testing, but rather:
 Sufficient size is needed to ensure representativeness of the
population and its underlying structure, particularly small
groups within the population.
 Minimum group sizes are based on the relevance of each
group to the research question and the confidence needed in
characterizing that group.

26
Sample Size Requirement
From a statistical perspective, every additional variable
requires an over-proportional increase in observations to
ensure valid results.
Recent rules-of-thumb are as follows:
 Qiu and Joe (2009) recommend a sample size at least ten times the
number of clustering variables multiplied by the number of clusters.
 Dolnicar et al. (2014 & 2016) recommend using a sample size of 70
times the number of clustering variables.
 They say increasing the sample size from 10 to 30 times the number
of clustering variables substantially improves the clustering solution.
 This improvement levels off subsequently, but is still noticeable up
to a sample size of approximately 100 times the number of clustering
variables. 27
Sample Size Requirement
With five clustering variables, our analysis meets even
the most conservative rule-of-thumb regarding
minimum sample size requirements.

Specifically, according to Dolnicar et al., the cluster


analysis should draw on 100 times the number of
clustering variables to optimize cluster recovery.

As our sample size of 1065 is clearly higher than


5 * 100 = 500, we can proceed with the analysis.
28
Data Pre-processing Decisions
Handling Outliers
Outliers can severely distort the representativeness of the
results if they appear as structure (clusters) that are
inconsistent with the research objectives. They should be
removed if the outlier represents:
 Aberrant observations not representative of the population.
 Observations of small or insignificant segments within the
population which are of no interest to the research objectives.
 They should be retained if representing an under-sampling/poor
representation of relevant groups in the population. In this case, the
sample should be augmented to ensure representation of these
groups.
Data Pre-processing Decisions
Handling Outliers
Outliers can be identified based on the similarity
measure by:
 Finding observations with large distances from all
other observations
 Graphic profile diagrams highlighting outlying cases
 Their appearance in cluster solutions as single-member
or very small clusters

30
Data Pre-processing Decisions
Data Scaling
Whether data scaling to be done or not
 In case all the variables are of same type and measured in
the same units then data scaling is not required.
 In case data variables are of same type but measured in
vastly different units then data has to be standardised and
scaled.
In cluster analysis, however, range standardization
(normalization) (e.g., to a range of 0 to 1) typically
works better.
Missing values to be estimated or eliminated.
Data Pre-processing Decisions
Assessing Multicollinearity
If there is strong correlation between the variables, they are not
sufficiently unique to identify distinct market segments.

If highly correlated variables are used for cluster analysis, the


specific aspects that these variables cover will be overrepresented
in the clustering solution.

Input variables should be examined for substantial


multicollinearity and if present:
✓Reduce the variables to equal numbers in each set of correlated
measures, or
✓ Use a distance measure that compensates for the correlation,
like Mahalanobis Distance.
32
Assessing Multicollinearity
Analyse >>Corelate>>Bivariate

33
Assessing Multicollinearity

34
Assessing Multicollinearity
 The results show that collinearity is not at a critical
level.
 The variables e1 and e21 show the highest correlation
of 0.613, which is clearly lower than the 0.70
threshold.
 We can therefore proceed with the analysis, using all
five clustering variables.

35
Deciding Distance Metric

Similarity measures calculated across the entire set of


clustering variables allow for the grouping of observations
and their comparison to each other.
Many Distance measures are most often used as a measure of
similarity, with higher values representing greater dissimilarity
(distance between cases) not similarity.
Euclidean (straight line) distance is the most common measure of
distance.
Squared Euclidean distance is the sum of squared distances and is
the recommended measure for the Centroid and Ward’s methods of
clustering.
Deciding Distance Metric
Mahalanobis distance accounts for variable intercorrelations
and weights each variable equally. When variables are highly
intercorrelated, Mahalanobis distance is most appropriate.
Less frequently used are correlational measures, where large
values do indicate similarity.

Given the sensitivity of some procedures to the


similarity measure used, the researcher should employ
several distance measures and compare the results
from each with other results or theoretical/known
patterns.

37
Clustering Algorithms
There are innumerable clustering algorithms.
However, the traditional algorithms for clustering
can be divided in 3 main categories:
1. Partitional Clustering
Our Focus
2. Hierarchical Clustering
3. Model-based Clustering
Classification of Clustering Procedures
Clustering Procedures

Hierarchical Nonhierarchical

Agglomerative Divisive
AGNES DAISY

Linkage Variance Centroid K-Means PAM CLARA


Methods Methods Methods

Ward’s
Method

Single Complete Average


Linkage Linkage Linkage
Hierarchical Clustering
Clusters are created in levels actually creating sets of
clusters at each level.
 Agglomerative
– Initially each item in its own cluster
– Iteratively clusters are merged together
– Bottom Up
 Divisive
– Initially all items in one cluster
– Large clusters are successively divided
– Top Down
Hierarchical Clustering
Hierarchical clustering aims at the more ambitious
task of obtaining hierarchy of clusters, called dendrogram,
that shows how the clusters are related to each other.

50
60
70
The height of a node
80 in the dendrogram
90
100 represents the
similarity of the two
children clusters.

% of similarity
Hierarchical Clustering

42
Hierarchical Clustering
Analyze >> Classify >> Hierarchical

43
Hierarchical Clustering – Analysis
Statistics
Statistics Option

44
Hierarchical Clustering
Plots Option

45
Hierarchical Clustering
Method Option

46
Hierarchical Clustering
Save Option

47
Generating a Scree Plot
Right Click on the Agglomeration Schedule

48
Scree Plot
Deciding on the Number of Clusters

The scree plot shows that there is no clear elbow indicating a suitable number
of clusters to retain. This result is quite common for datasets with several
hundred objects. 49
Dendrogram

Deciding on the Number of Clusters

Reading the dendrogram from let to right, we find that the vast majority of
objects are merged at very small distances. The dendrogram also shows that
the step from a three-cluster solution to a two-cluster solution occurs at a
greatly increased distance. Hence, we assume a three-cluster solution and
continue with the analysis.
50
Saving the Cluster Membership
Once you have identified that there are three clusters, rerun the clustering
process with the following change

51
Saving the Cluster membership
Once you have identified that there are three clusters, rerun the clustering
process with the following change

52
Cluster Size
A new variable called CLU3_1 will be added to your dataset in
the end. This variable shows the cluster membership.

To learn about the size of the clusters, go to ► Analyze ►


Descriptive Statistics ► Frequencies and enter CLU3_1 into
the Variable(s) box. When clicking on OK,

53
Agglomerative Hierarchical
Clustering
1. Calculate the distance between all data points
2. Cluster the data points to the initial clusters
3. Calculate the distance metrics between all clusters
4. Repeatedly cluster most similar clusters into a higher
level cluster
5. Repeat steps 3 and 4 for the most high-level clusters.

Key operation is the computation of the proximity of two clusters


Different approaches to defining the distance
between clusters distinguish the different algorithms
Conducting Cluster Analysis
Select a Clustering Procedure – Linkage Method

 The single linkage method is based on minimum distance, or the


nearest neighbor rule. At every stage, the distance between two
clusters is the distance between their two closest points.
 The complete linkage method is similar to single linkage,
except that it is based on the maximum distance or the furthest
neighbor approach. In complete linkage, the distance between two
clusters is calculated as the distance between their two furthest points.
 The average linkage method works similarly. However, in
this method, the distance between two clusters is defined as the
average of the distances between all pairs of objects, where one member
of the pair is from each of the clusters.
Agglomerative Clustering - Linkage
Methods
Single Linkage
Minimum
Distance
Cluster 1 Cluster 2
Complete Linkage
Maximum
Distance

Cluster 1 Cluster 2
Average Linkage

Average Distance
Cluster 1 Cluster 2
Agglomerative Clustering - Variance
Method
The variance methods attempt to generate clusters to minimize the
within-cluster variance.
 A commonly used variance method is the Ward's procedure. For
each cluster, the means for all the variables are computed.
Then, for each object, the squared Euclidean distance to the
cluster means is calculated. These distances are summed for all
the objects. At each stage, the two clusters with the smallest increase in
the overall sum of squares within cluster distances are combined.

 In the centroid methods, the distance between two clusters is the


distance between their centroids (means for all the variables).
Every time objects are grouped, a new centroid is computed.
Of the hierarchical methods, average linkage and Ward's methods have been
shown to perform better than the other procedures.
Agglomerative Clustering - Variance
Method
Ward’s Procedure

Centroid Method
Cluster Validity
• For supervised classification we have a variety of measures to
evaluate how good our model is
– Accuracy, precision, recall

• For cluster analysis, the analogous question is how to evaluate


the “goodness” of the resulting clusters?

• But “clusters are in the eye of the beholder”!

• Then why do we want to evaluate them?


– To avoid finding patterns in noise
– To compare clustering algorithms
– To compare two sets of clusters
– To compare59 two clusters
Assess Reliability and Validity
1. Perform cluster analysis on the same data using
different distance measures. Compare the results
across measures to determine the stability of the
solutions.
2. Use different methods of clustering and compare the
results.
3. Split the data randomly into halves. Perform
clustering separately on each half. Compare cluster
centroids across the two subsamples.
Assess Reliability and Validity
4. Delete variables randomly. Perform clustering based
on the reduced set of variables. Compare the results
with those obtained by clustering based on the entire
set of variables.
5. In nonhierarchical clustering, the solution may
depend on the order of cases in the data set. Make
multiple runs using different order of cases until the
solution stabilizes.
The Silhouette plot is very useful in
locating groups in a cluster

Range of SC Interpretation

A strong structure has been


0.71-1.0
found
A reasonable structure has
0.51-0.70
been found
The structure is weak and
0.26-0.50
could be artificial
No substantial structure has
< 0.25
been found
Interpreting and Profiling Clusters

 Involves examining the cluster centroids. The centroids


enable us to describe each cluster by assigning it a name or
label.

 Profile the clusters in terms of variables that were not used


for clustering. These may include demographic,
psychographic, product usage, media usage, or other
variables.
Validate & Interpret the Clustering
Solution
Next, we would like to compute the centroids of our clustering
variables.
 To do so, split up the dataset using the Split File command (►
Data ► Split File). Select the
 Select the option Compare groups and choose CLU3_1 as the
grouping variable.

64
Validate & Interpret the Clustering
Solution

65
Validate & Interpret the Clustering
Solution
Next, go to ► Analyze ► Descriptive Statistics ► Descriptives
Add clustering variables e1, e5, e9, e21, and e22 and ask for mean,
min, max and std dev

66
67
Validate & Interpret the Clustering
Solution
Comparing the variable means across the three clusters, we find
that
 Respondents in the first cluster have extremely high
expectations regarding all five performance features, as
evidenced in average values of around 90 and higher.
 Respondents in the second cluster strongly emphasize
punctuality (e1), while comfort (e5) and, particularly,
entertainment aspects (e22) are less important.
 Respondents in the third cluster do not express high
expectations in general, except in terms of security (e9)

68
Validate & Interpret the Clustering
Solution
Based on these results, we could label
 the first cluster “The Demanding Traveler,”
 the second cluster “Ontime is Enough,” and
 the third cluster “No Thrills.”

We could further check whether these differences in


means are significant by using a one-way ANOVA

69
Analyze >> Compare Means >> One Way
Anova

70
Analyze >> Compare Means >> One Way
Anova
Options button

71
Validate & Interpret the Clustering
Solution

Since all the values in the final column Sig. are below 0.05, we can conclude that all the
clustering variables’ means differ significantly across at least two of the three segments. 72
Analyze >> Descriptives >> Crosstab

73
74
Validate & Interpret the Clustering
Solution
The results show that the first cluster
primarily consists of leisure travelers,
whereas the majority of respondents in
the second and third cluster are business
travelers.

With a pvalue of 0.003, the χ2-test statistic


indicates a significant relationship between
these two variables.
However, the strength of the variables’
association is rather small, as indicated by
the Contingency Coeficient of 0.108

75
Requirements and Challenges
• Scalability
– Clustering all the data instead of only on samples
• Ability to deal with different types of attributes
– Numerical, binary, categorical, ordinal, linked, and mixture of these
• Constraint-based clustering
• User may give inputs on constraints

• Use domain knowledge to determine input parameters

• Interpretability and usability


• Others
– Discovery of clusters with arbitrary shape
– Ability to deal with noisy data
– Incremental clustering and insensitivity to input order
– High dimensionality
ANY QUESTIONS ???

You might also like