0% found this document useful (0 votes)
148 views18 pages

Cluster Analysis: Abu Bashar

This document discusses cluster analysis techniques. It begins by defining cluster analysis as a method used to classify cases into relatively homogeneous groups based on a set of variables. It notes that cluster analysis is especially useful for market segmentation. The document then discusses other uses of cluster analysis including product analysis and data reduction. It outlines the main steps to conduct a cluster analysis including selecting measures, algorithms, determining cluster numbers, and validating results. Finally, it covers topics like distance measures, hierarchical and non-hierarchical clustering methods, and determining the optimal number of clusters.

Uploaded by

Abu Bashar
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
148 views18 pages

Cluster Analysis: Abu Bashar

This document discusses cluster analysis techniques. It begins by defining cluster analysis as a method used to classify cases into relatively homogeneous groups based on a set of variables. It notes that cluster analysis is especially useful for market segmentation. The document then discusses other uses of cluster analysis including product analysis and data reduction. It outlines the main steps to conduct a cluster analysis including selecting measures, algorithms, determining cluster numbers, and validating results. Finally, it covers topics like distance measures, hierarchical and non-hierarchical clustering methods, and determining the optimal number of clusters.

Uploaded by

Abu Bashar
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Cluster Analysis

Abu Bashar

Cluster analysis
It is a class of techniques used to classify cases into groups that are
relatively homogeneous within themselves and heterogeneous between each other Homogeneity (similarity) and heterogeneity (dissimilarity) are measured on the basis of a defined set of variables

These groups are called clusters

Market segmentation
Cluster analysis is especially useful for market segmentation Segmenting a market means dividing its potential consumers into separate sub-sets where
Consumers in the same group are similar with respect to a given set of characteristics Consumers belonging to different groups are dissimilar with respect to the same set of characteristics

This allows one to calibrate the marketing mix differently according to the target consumer group

Other uses of cluster analysis


Product characteristics and the identification of new product opportunities. Clustering of similar brands or products according to their characteristics allow one to identify competitors, potential market opportunities and available niches Data reduction
Factor analysis and principal component analysis allow to reduce the number of variables. Cluster analysis allows to reduce the number of observations, by grouping them into homogeneous clusters.

Steps to conduct a cluster analysis


Select a distance measure Select a clustering algorithm Define the distance between two clusters Determine the number of clusters Validate the analysis

Distance measures for individual observations


To measure similarity between two observations a distance measure is needed With a single variable, similarity is straightforward
Example: income two individuals are similar if their income level is similar and the level of dissimilarity increases as the income gap increases

Multiple variables require an aggregate distance measure


Many characteristics (e.g. income, age, consumption habits, family composition, owning a car, education level, job), it becomes more difficult to define similarity with a single value

The most known measure of distance is the Euclidean distance, which is the concept we use in everyday life for spatial coordinates.
7

Other distance measures


Other distance measures: Chebychev, Minkowski, Mahalanobis An alternative approach: use correlation measures, where correlations are not between variables, but between observations. Each observation is characterized by a set of measurements (one for each variable) and bi-variate correlations can be computed between two observations.
8

Clustering procedures
Hierarchical procedures
Agglomerative (start from n clusters to get to 1 cluster) Divisive (start from 1 cluster to get to n clusters)

Non hierarchical procedures


K-means clustering

Hierarchical clustering
Agglomerative:
Each of the n observations constitutes a separate cluster The two clusters that are more similar according to same distance rule are aggregated, so that in step 1 there are n-1 clusters In the second step another cluster is formed (n-2 clusters), by nesting the two clusters that are more similar, and so on There is a merging in each step until all observations end up in a single cluster in the final step.

Divisive
All observations are initially assumed to belong to a single cluster The most dissimilar observation is extracted to form a separate cluster In step 1 there will be 2 clusters, in the second step three clusters and so on, until the final step will produce as many clusters as the number of observations.

The number of clusters determines the stopping rule for the algorithms

10

Non-hierarchical clustering
These algorithms do not follow a hierarchy and produce a single partition Knowledge of the number of clusters (c) is required In the first step, initial cluster centres (the seeds) are determined for each of the c clusters, either by the researcher or by the software (usually the first c observation or observations are chosen randomly) Each iteration allocates observations to each of the c clusters, based on their distance from the cluster centres Cluster centres are computed again and observations may be reallocated to the nearest cluster in the next iteration When no observations can be reallocated or a stopping rule is met, the process stops
11

Non-hierarchical clustering: K-means method


1. 2.

The number k of clusters is fixed An initial set of k seeds (aggregation centres) is provided
First k elements Other seeds (randomly selected or explicitly defined)

3.

Given a certain fixed threshold, all units are assigned to the nearest cluster seed 4. New seeds are computed 5. Go back to step 3 until no reclassification is necessary Units can be reassigned in successive steps (optimising partioning)

12

Hierarchical vs. non-hierarchical methods


Hierarchical Methods
No decision about the number of clusters Problems when data contain a high level of error Can be very slow, preferable with small data-sets Initial decisions are more influential (one-step only) At each step they require computation of the full proximity matrix

Non-hierarchical methods
Faster, more reliable, works with large data sets Need to specify the number of clusters Need to set the initial seeds Only cluster distances to seeds need to be computed in each iteration

13

The number of clusters c


Two alternatives
Determined by the analysis Fixed by the researchers

In segmentation studies, the c represents the number of potential separate segments. Preferable approach: let the data speak
Hierarchical approach and optimal partition identified through statistical tests (stopping rule for the algorithm) However, the detection of the optimal number of clusters is subject to a high degree of uncertainty

If the research objectives allow a choice rather than estimating the number of clusters, non-hierarchical methods are the way to go.

14

Example: fixed number of clusters


A retailer wants to identify several shopping profiles in order to activate new and targeted retail outlets The budget only allows him to open three types of outlets A partition into three clusters follows naturally, although it is not necessarily the optimal one. Fixed number of clusters and (k-means) non hierarchical approach

15

And the merging distance is relatively small


C A S E Label Num 231

Dendrogram
Rescaled Distance

This dotted line represents the Cluster Combine distance between clusters

0 5 10 15 20 25 +---------+---------+---------+---------+---------+

Case 231 and case 275 are merged

These are the individual cases

275 145 181 333 117 336 337 209 431 178

As the algorithm proceeds, the merging distances become larger


16

Scree diagram
Merging distance on the y-axis
Distance

12 10 8 6 4 2 0 11 10 9 8 7 6 5 4 3 2 1 Number of clusters

When one moves from 7 to 6 clusters, the merging distance increases noticeably

17

Thank You Very Much

You might also like