0% found this document useful (0 votes)
39 views52 pages

Clustering Today

Cluster analysis is a statistical technique used to group similar objects into categories called clusters. It aims to maximize homogeneity within clusters and heterogeneity between clusters. The key steps in cluster analysis involve selecting a distance measure to quantify similarity, choosing a clustering algorithm, determining the optimal number of clusters, and validating the results. Common applications of cluster analysis include data reduction and identifying hidden patterns in data.

Uploaded by

Shubh Agarwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views52 pages

Clustering Today

Cluster analysis is a statistical technique used to group similar objects into categories called clusters. It aims to maximize homogeneity within clusters and heterogeneity between clusters. The key steps in cluster analysis involve selecting a distance measure to quantify similarity, choosing a clustering algorithm, determining the optimal number of clusters, and validating the results. Common applications of cluster analysis include data reduction and identifying hidden patterns in data.

Uploaded by

Shubh Agarwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 52

INTRODUCTION TO CLUSTERING

It is a class of techniques used to classify cases


into groups that are relatively homogeneous
within themselves and heterogeneous between
each other, on the basis of a defined set of
variables. These groups are called clusters.
Cluster analysis is a statistical method used to
group similar objects into respective categories. It
can also be referred to as segmentation analysis,
taxonomy analysis, or clustering.
• Cluster analysis is a group of multivariate technique whose primary
purpose is to group objects based on the characteristics they possess.
• Factor analysis is primarily concerned with grouping variables based
on pattern of variation(correlation) in the data whereas cluster
analysis makes groupings on the basis of distance(proximity).
• Cluster analysis classifies objects (i.e. respondents , products, or other
entities) on a set of user selected characteristics i.e. clustering
variables. The resulting clusters should exhibit high internal (within
cluster) homogeneity and high external (between cluster)
heterogeneity. If the classification is successful , the objects within
the clusters will be close together when plotted geometrically and
different clusters will be far apart.
Conceptual development with cluster analysis

• Data reduction – a researcher may be faced with a large


number of observations that are meaningless unless classified
into manageable groups. Cluster analysis can perform this
data reduction procedure objectively by reducing the
information from an entire population or sample to
information about specific groups .

• Steps to conduct a Cluster Analysis


• 1. Select a distance measure
• 2. Select a clustering algorithm
• 3. Determine the number of clusters
• 4. Validate the analysis
• The primary objective of cluster analysis is to define the
structure of the data by placing the most similar observations
into groups :, we must address three basic questions
• 1. How do we measure similarity ? Several methods are
possible , including correlation between objects or perhaps a
measure of their proximity in two dimensional space such
that the distance between observations indicates similarity?
• 2. How do we form clusters? No matter how similarity is
measured , the procedure must group those observations that
are most similar into a cluster.
• 3. How many groups do we form ?
Measuring similarity

• The first task is developing some measure of similarity


between each object to be used in the clustering process.
Non-overlapping clusters
Cluster in which each observation belongs to only
one cluster. Non-overlapping clusters are more
frequently used clustering techniques in practice.
Overlapping clusters
• An observation may belong to more than one cluster
Probabilistic clusters
An observation may belong to a cluster according to a
probability distribution.
Hierarchical clustering
Hierarchical clustering creates subsets of data similar to a tree-
like structure in which the root node corresponds to the
complete set of data. Branches are created from the root node
to split the data into heterogeneous subsets (clusters).
Euclidean Distance
Euclidean is one of the frequently used distance measures when
the data are either in interval or ratio scale.

The Eucledian distance between two n-dimensional


observations X1 (x11, x12, …, x1n) and X2 (x21, x22, …, x2n) is given
by

D( X 1 , X 2 )  ( x11  x21 )2  ( x12  x22 ) 2    ( x1n  x2n ) 2


Example

The below table has information about 20 wines sold in the market along with their
alcohol and alkalinity of ash content

Alkalinity of Alkalinity of
Wine Alcohol Wine Alcohol
Ash Ash

1 14.8 28 11 10.7 12.2


2 11.05 12 12 14.3 27
3 12.2 21 13 12.4 19.5
4 12 20 14 14.85 29.2
5 14.5 29.5 15 10.9 13.6
6 11.2 13 16 13.9 29.7
7 11.5 12 17 10.4 12.2
8 12.8 19 18 10.8 13.6
9 14.75 28.8 19 14 28.8
10 10.5 14 20 12.47 22.8
Clusters of wine based on alcohol and ash
content.
Standardized Euclidean Distance
Let X1k and X2k be two attributes of the data (where k stands for
the kth observation in the data set). It is possible that the range
of X1k can be much smaller compared to X2k, resulting in
skewed Euclidean distance value. An easier way of handling
the potential bias is to standardize the data using the following
equation:
  
 X ik  X i 
Standardized value of the attribute =  
  Xi 
 

Where X i and  X i are, respectively, the mean and standard


deviation of ith attribute
Manhattan Distance (City Block Distance)

Euclidean distance may not be appropriate while measuring


distance between different locations (for example, distance
between two shops in a city). In such cases, we use Manhattan
distance, which is given by

n
DM ( X 1 , X 2 )   X 1i  X 2i
i 1

It is not based on Euclidean distance , instead it uses the sum of the absolute
distance of the variables . It is simply to calculate but may lead to invalid
clusters if the clustering variable are highly correlated
Minkowski Distance
Minsowski distance is the generalized distance measure
between two cases in the dataset and is given by

1 p
 n p

Minkowski D( X 1 , X 2 )    X 1i  X 2 i 
 i 1
 

When p = 1, Minkowski distance is same as the Manhattan


distance.
For p = 2, Minkowski distance is same as the Euclidean
distance.
Jaccard Similarity Coefficient (Jaccard
Index)
 Jaccard similarity coefficient (JSC) or Jaccard index (Real and
Vargas, 1996) is a measure used when the data is qualitative,
especially when attributes can be represented in binary form.
 JSC for two n-dimensional data (n attributes), X1 and X2, is
given by

n( X1  X 2 )
Jaccard(X1, X2) =
n( X1  X 2 )
where n(X1  X2) is the number of attributes that belong to both
X1 and X2 (that is, X1  X2), n(X1  X2) is the number of
attributes that belong to either X1 or X2 (that is, X1  X2).
Example
Consider movie DVD purchases made by two customers as given by
the following sets
Customer 1 = {Jungle Book (JB), Iron Man (IM), Kung Fu Panda
(KFP), Before Sunrise (BS), Bridge of spies (BoS), Forest Gump (FG)}
Customer 2 = {Casablanca (C), Jungle Book (JB), Forrest Gump, Iron
Man (IM), Kung Fu Panda (KFP), Schindler’s List (SL), The God
Father (TGF)}
In this case, each movie is an attribute. The purchases made by the two
customers are shown in Table

Movie Title BS BoS C FG IM JB KFP SL TGF

Customer 1 1 1 0 1 1 1 1 0 0

Customer 2 0 0 1 1 1 1 1 1 1
• The JSC is given by

n(customer 1  customer 2) 4
JSC    0.44
n(customer 1  customer 2) 9

Higher the Jaccard coefficient, higher the similarity between


two observations being compared. The value of JSC lies
between 0 and 1.
Cosine Similarity
The cosine similarity between X1 and X2 is given by
n
 X1i  X 2i
X1  X 2 i 1
Similarity (X1, X2) = cos() = 
X1  X 2 n
2
n
2
 X1i   X 2i
i 1 i 1

In cosine similarity, X1 and X2 are two n-dimensional vectors


and it measures the angle between two vectors (thus called
vector space model).
Cosine similarity of different values of .
Gower’s Similarity Coefficient
Gower’s similarity coefficient (Gower, 1971) is used when
the data has both quantitative and qualitative data.
Gower’s coefficient between two n-dimensional
observations i and j is given by

n
 DijkWijk
Dij  k 1n
 Wijk
k 1

where Dijk is the distance between observations (i and j) for kth


variable and Wijk is a binary variable that captures whether
the distance between observations is valid for kth variable.
Example
Table 14.5 shows 5 customers and their movie downloads from a portal.
The data consists of genre of the movies, maximum rating given by the
customer, and the marital status (code 1 implies married and 0
otherwise). For example, customer 1 downloaded 23 action, 5
romance, 15 comedy, and 0 Sci-fi movies and his maximum rating was
4.

Customer Number of Movies Downloaded Under Each Genre Marital


Maximum
Rating
Status

Action Romance Comedy Sci-fi (k = 5) Married

(k = 1) (k = 2) (k = 3) (k = 4) (k = 6)

1 23 5 15 0 4 0

2 5 18 16 2 5 1

3 25 0 0 15 5 0

4 2 30 15 0 4 1

5 45 0 0 10 5 0
Solution
The Gowers distance between customers 1 and 2 can be
calculated as shown in Table below :

  k=1 k=2 k=3 k=4 k=5 k=6 Sum

Dijk 0.5814 0.5667 0.9375 0.8667 0.0000 0 2.952

Wijk 1 1 1 1 1 1 6

n
 Dijk Wijk
Dij  k 1
n
 Wijk
k 1

The Gower’s distance between customers 1 and 2 is given


by 2.952/6 = 0.492.
Quality and Optimal Number of Clusters
Milligan and Cooper (1985) analysed over 30 procedures for
determining the optimal number of clusters and recommended
the index proposed by Calinski and Harabasz (1974) which is
given by

B(k ) / k  1
CH ( k ) 
W ( k ) /( n  k )

where CH(k) is the Calinski and Harabasz index with k-


clusters (k > 1), B(k) and W(k) are the between and within
clusters sum of squared variations with k clusters.
Clustering Algorithms
Clustering algorithms group data into finite number of
mutually exclusive subsets.

Steps followed in clustering algorithms:

•Variable selection.
•Deciding the distance/similarity measure for measuring
distance/dissimilarity between the observations.
•Deciding the number of clusters.
•Validation of the clusters.
Variable Selection

Ketchen and Shook (1996) suggest inductive, deductive, and


cognitive approaches for variable selection.

• Inductive is basically an exploratory approach and starts


with as many variables as possible.

• On the other hand, in deductive variable selection,


suitability of the variable and theoretical basis influence
selection of variables.

• Under cognitive variable selection, expert opinion plays a


major role in variable selection
Deciding Distance/Similarity Measures
Choosing the right distance/similarity measure plays an
important role in developing clusters.

Number of Clusters

Several approaches are available for deciding the number of


clusters such as CH index , Hartigan statistic [Eq. (14.14)],
Silhouette statistic, and elbow method in which the ideal number
of clusters is given by the position of elbow in an L-shaped
curve.
Cluster Validation

The clusters created should be validated for consistency using


different algorithms to ensure that the clusters represent the
structures that exist in the population.
Halkidi et al. (2001) suggest the following measures to
validate the clusters:

•Compactness: Closeness of each member of a cluster which


can be measured through variance.

•Separation: Distance between different clusters.


K-Means Clustering

• K-means clustering is one of the frequently used clustering


algorithms.

• It is a non-hierarchical clustering method in which the


number of clusters (K) is decided a priori.
K-Means Clustering - Steps
1) Choose K observations from the data that are likely to be in
different clusters. There are many ways of choosing these initial
K values; easiest approach is to choose observations that are
farthest (in one of the parameters of the data).
2) The K observations chosen in step 1 are the centroids of those
clusters.
3) For remaining observations, find the cluster closest to the
centroid. Add the new observation (say observation j) to the
cluster with closest centroid. Adjust the centroid after adding a
new observation to the cluster. The closest centroid is chosen
based on an appropriate distance measure.
4) Repeat step 3 till all observations are assigned to a cluster.
Hierarchical Clustering
Hierarchical clustering is a clustering algorithm which uses the
following steps to develop clusters:

1) Start with each data point in a single cluster.

2) Find the data points with shortest distance (using an


appropriate distance measure) and merge them to form a
cluster.

3) Repeat step 2 until all data points are merged to form a single
cluster

The above procedure is called agglomerative hierarchical cluster


Dendrogram for movie clustering
Summary

• Clustering is an unsupervised learning algorithms that divides


the data set into mutually exclusive and exhaustive subsets (in
non-overlapping clusters) that that are homogeneous within
the group and heterogeneous between the groups.

• Clustering is one of the frequently used techniques and


practitioners first cluster the data and develop predictive
models for each cluster for better management.
• Several distance measures such as Euclidian distance, Gower
distance are used in clustering algorithms. Similarity
coefficients such as Jaccard coefficient and Cosine similarity
are used depending on the data type.

• K-means clustering and Hierarchical clustering are two


popular techniques used for clustering.

• One of the decisions to be taken during clustering is to


decide on the number of cluster. Usually this is carried out
using elbow curve. The cluster number at which the elbow
(bend) occurs in the elbow curve is the optimal number of
clusters.
• Kalyani Jeweler designer wishes to know if the population of young teenage girls
aged 13-19 can be divided into smaller groups who might be looking at jewellery
very differently .
• The following six questions were given to a group of 10 girls to understand what
jewelery meant to them. The questionnaire was on a 5 point likert scale ranging
from 1-straingly agree to 5 strongly disagree.
Questionnaire

• 1. I like to wear jewellery that glitters.


• 1. strongly agree 2.Agree 3.Neutral 4.Disagree 5.Strongly disagree
• 2. My Jewellery should match my dress.
• 1. strongly agree 2.Agree 3.Neutral 4.Disagree 5.Strongly disagree
• 3. I want everyone to admire my jewellery
• 1. strongly agree 2.Agree 3.Neutral 4.Disagree 5.Strongly disagree
• 4. I take my friends with me when I go jewellery shopping
• 1. strongly agree 2.Agree 3.Neutral 4.Disagree 5.Strongly disagree
• 5. Beautiful jewellery adds to a girls beauty
• 1. strongly agree 2.Agree 3.Neutral 4.Disagree 5.Strongly disagree
• Cluster-1 = 5,9,4,10
• Cluster-2 = 6,8,3,2,7,1

• Cluster1 seems to be socially concerned group as they have high degree of


agreement with X3 and X4.

• Cluster-2 seems to be more self driven as they show a high degree of agreement
with X1,x2,x3
• The Different Types of Cluster Analysis
• There are three primary methods used to perform cluster analysis:  
• Hierarchical Cluster
• This is the most common method of clustering. It creates a series of models with
cluster solutions from 1 (all cases in one cluster) to n (each case is an individual
cluster). This approach also works with variables instead of cases. Hierarchical
clustering can group variables together in a manner similar to factor analysis. 
• Finally, hierarchical cluster analysis can handle nominal, ordinal, and scale data. But,
remember not to mix different levels of measurement into your study.
• K-Means Cluster
• This method is used to quickly cluster large datasets. Here, researchers define the
number of clusters prior to performing the actual study. This approach is useful when
testing different models with a different assumed number of clusters.
• Two-Step Cluster
• This method uses a cluster algorithm to identify groupings by performing pre-
clustering first, and then performing hierarchical methods. Two-step clustering is best
for handling larger datasets that would otherwise take too long a time to calculate with
strictly hierarchical methods. 
• Essentially, two-step cluster analysis is a combination of hierarchical and k-means
cluster analysis. It can handle both scale and ordinal data, and it automatically selects
the number of clusters.
• Steps for Cluster Analysis
• Formulate the problem – Select the variables on which the clustering will be based.
The variables should describe the similarity between objects in terms that are relevant
to the research problem. The variables should be selected based on past research,
theory, the hypotheses being tested, or the judgment of the researcher.
• Select a distance measure – An appropriate measure of distance needs to be selected
to determine how similar or dissimilar the objects being clustered should be.  The most
commonly used measure is Euclidean distance.
• Select a clustering procedure – Several clustering procedures have been developed
and the one most appropriate for the problem at hand should be chosen.
• Decide on the number of clusters – The number of clusters can be based on
theoretical, conceptual, or practical considerations.
• Interpret and profile clusters – This involves examining cluster centroids. The
centroids represent the mean values of the objects contained in the cluster on each of
the variables.
• Asses the validity of clustering – Some methods to validate the data quality include
using different methods of clustering and comparing the results or clustering on a
smaller set of variables (randomly deleted) and comparing the results with the entire
set of variables.

You might also like