0% found this document useful (0 votes)
15 views25 pages

Cluster Analysis

Yes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views25 pages

Cluster Analysis

Yes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 25

Cluster Analysis

Cluster Analysis
• Used to classify objects (cases) into
homogeneous groups called clusters.
• Objects in each cluster tend to be similar and
dissimilar to objects in the other clusters.
• In cluster analysis groups are suggested by the
data.
An Ideal Clustering Situation

Variable 1

Variable 2
More Common Clustering Situation

Variable 1

X
Variable 2
Statistics Associated with Cluster Analysis

• Agglomeration schedule. Gives information on the


objects or cases being combined at each stage of a
hierarchical clustering process.

• Cluster centroid. Mean values of the variables for all


the cases in a particular cluster.

• Cluster centers. Initial starting points in nonhierarchical


clustering. Clusters are built around these centers, or
seeds.

• Cluster membership. Indicates the cluster to which


each object or case belongs.
Statistics Associated with Cluster
Analysis
• Dendrogram (A tree graph). A graphical device for displaying
clustering results.

-Vertical lines represent clusters that are joined together.

-The position of the line on the scale indicates distances


at which clusters were joined.

• Distances between cluster centers. These distances indicate


how separated the individual pairs of clusters are. Clusters that
are widely separated are distinct, and therefore desirable.
Conducting Cluster Analysis
Formulate the Problem

Select a Distance Measure

Select a Clustering Procedure

Decide on the Number of Clusters

Interpret and Profile Clusters

Assess the Validity of Clustering


Formulating the Problem
• Most important is selecting the variables on
which the clustering is based.
• Inclusion of even one or two irrelevant
variables may distort a clustering solution.
• Variables selected should describe the
similarity between objects in terms that are
relevant to the marketing research problem.
• Should be selected based on past research,
theory, or a consideration of the hypotheses
being tested.
Select a Similarity Measure
• Similarity measure can be correlations or distances

• The most commonly used measure of similarity is


the Euclidean distance. The city-block distance is
also used.

• If variables measured in vastly different units, we


must standardize data. Also eliminate outliers

• Use of different similarity/distance measures may


lead to different clustering results.

• Hence, it is advisable to use different measures


and compare the results.
Hierarchical Clustering Methods
• Hierarchical clustering is characterized by the
development of a hierarchy or tree-like structure.
-Agglomerative clustering starts with each object
in a separate cluster. Clusters are formed by grouping
objects into bigger and bigger clusters.
-Divisive clustering starts with all the objects
grouped in a single cluster. Clusters are divided or split
until each object is in a separate cluster.
• Agglomerative methods are commonly used in marketing
research. They consist of linkage methods, variance
methods, and centroid methods.
Hierarchical Agglomerative Clustering-
Linkage Method
• The single linkage method is based on minimum
distance, or the nearest neighbor rule.

• The complete linkage method is based on the


maximum distance or the furthest neighbor
approach.

• The average linkage method the distance


between two clusters is defined as the average of
the distances between all pairs of objects
Linkage Methods of Clustering
Single Linkage
Minimum
Distance

Cluster 1 Cluster 2
Complete Linkage
Maximum
Distance

Cluster 1 Cluster 2
Average Linkage

Average
Cluster 1 Distance Cluster 2
Hierarchical Agglomerative Clustering-
Variance and Centroid Method
• Variance methods generate clusters to minimize the within-
cluster variance.

• Ward's procedure is commonly used. For each cluster, the sum


of squares is calculated. The two clusters with the smallest
increase in the overall sum of squares within cluster distances are
combined.

• In the centroid methods, the distance between two clusters is


the distance between their centroids (means for all the variables),

• Of the hierarchical methods, average linkage and Ward's


methods have been shown to perform better than the other
procedures.
Other Agglomerative Clustering Methods
Ward’s Procedure

Centroid Method
Idea Behind K-Means
• Algorithm for K-means clustering
1. Partition items into K clusters
2. Assign items to cluster with nearest
centroid mean
3. Recalculate centroids both for cluster
receiving and losing item
4. Repeat steps 2 and 3 till no more
reassignments
Select a Clustering Procedure
• The hierarchical and nonhierarchical methods should be
used in tandem.

-First, an initial clustering solution is obtained


using a hierarchical procedure (e.g. Ward's).

-The number of clusters and cluster centroids so


obtained are used as inputs to the optimizing
partitioning method.

• Choice of a clustering method and choice of a distance


measure are interrelated. For example, squared
Euclidean distances should be used with the Ward's and
centroid methods. Several nonhierarchical procedures
also use squared Euclidean distances.
Decide Number of Clusters
• Theoretical, conceptual, or practical
considerations.
• In hierarchical clustering, the distances at which
clusters are combined (from agglomeration
schedule) can be used
• Stop when similarity measure value makes
sudden jumps between steps
Interpreting and Profiling Clusters

• Involves examining the cluster centroids. The


centroids enable us to describe each cluster by
assigning it a name or label.

• Profile the clusters in terms of variables that were not


used for clustering. These may include
demographic, psychographic, product usage, media
usage, or other variables.
Assess Reliability and Validity
1. Perform cluster analysis on the same data using different
distance measures. Compare the results across measures
to determine the stability of the solutions.
2. Use different methods of clustering and compare the results.
3. Split the data randomly into halves. Perform clustering
separately on each half. Compare cluster centroids across
the two subsamples.
4. Delete variables randomly. Perform clustering based on the
reduced set of variables. Compare the results with those
obtained by clustering based on the entire set of variables.
5. In nonhierarchical clustering, the solution may depend on
the order of cases in the data set. Make multiple runs using
different order of cases until the solution stabilizes.
Example of Cluster Analysis
• Consumers were asked about their attitudes
about shopping. Six variables were selected:
• V1: Shopping is fun
V2: Shopping is bad for your budget
V3: I combine shopping with eating out
V4: I try to get the best buys when shopping
V5: I don’t care about shopping
V6: You can save money by comparing prices
• Responses were on a 7-pt scale (1=disagree;
7=agree)
Attitudinal Data For Clustering
Case No. V1 V2 V3 V4 V5 V6
1 6 4 7 3 2 3
2 2 3 1 4 5 4
3 7 2 6 4 1 3
4 4 6 4 5 3 6
5 1 3 2 2 6 4
6 6 4 6 3 3 4
7 5 3 6 3 3 4
8 7 3 7 4 1 4
9 2 4 3 3 6 3
10 3 5 3 6 4 6
11 1 3 2 3 5 3
12 5 4 5 4 2 4
13 2 2 1 5 4 4
14 4 6 4 6 4 7
15 6 5 4 2 1 4
16 3 5 4 6 4 7
17 4 4 7 2 2 5
18 3 7 2 6 4 3
19 4 6 3 7 2 7
20 2 3 2 4 7
Dendrogram
The Elbow method
• The Elbow method picks the range of values and
takes the best among them. It calculates the
Within Cluster Sum of Square(WCSS) for
different values of K. It calculates the sum of
squared points and calculates the average
distance.
Cluster Centroids

Means of
Variables
Cluster No. V1 V2 V3 V4 V5 V6

1 5.750 3.625 6.000 3.125 1.750 3.875

2 1.667 3.000 1.833 3.500 5.500 3.333

3 3.500 5.833 3.333 6.000 3.500 6.000

You might also like