Clustering Today
Clustering Today
The below table has information about 20 wines sold in the market along with their
alcohol and alkalinity of ash content
Alkalinity of Alkalinity of
Wine Alcohol Wine Alcohol
Ash Ash
n
DM ( X 1 , X 2 ) X 1i X 2i
i 1
It is not based on Euclidean distance , instead it uses the sum of the absolute
distance of the variables . It is simply to calculate but may lead to invalid
clusters if the clustering variable are highly correlated
Minkowski Distance
Minsowski distance is the generalized distance measure
between two cases in the dataset and is given by
1 p
n p
Minkowski D( X 1 , X 2 ) X 1i X 2 i
i 1
n( X1 X 2 )
Jaccard(X1, X2) =
n( X1 X 2 )
where n(X1 X2) is the number of attributes that belong to both
X1 and X2 (that is, X1 X2), n(X1 X2) is the number of
attributes that belong to either X1 or X2 (that is, X1 X2).
Example
Consider movie DVD purchases made by two customers as given by
the following sets
Customer 1 = {Jungle Book (JB), Iron Man (IM), Kung Fu Panda
(KFP), Before Sunrise (BS), Bridge of spies (BoS), Forest Gump (FG)}
Customer 2 = {Casablanca (C), Jungle Book (JB), Forrest Gump, Iron
Man (IM), Kung Fu Panda (KFP), Schindler’s List (SL), The God
Father (TGF)}
In this case, each movie is an attribute. The purchases made by the two
customers are shown in Table
Customer 1 1 1 0 1 1 1 1 0 0
Customer 2 0 0 1 1 1 1 1 1 1
• The JSC is given by
n(customer 1 customer 2) 4
JSC 0.44
n(customer 1 customer 2) 9
n
DijkWijk
Dij k 1n
Wijk
k 1
(k = 1) (k = 2) (k = 3) (k = 4) (k = 6)
1 23 5 15 0 4 0
2 5 18 16 2 5 1
3 25 0 0 15 5 0
4 2 30 15 0 4 1
5 45 0 0 10 5 0
Solution
The Gowers distance between customers 1 and 2 can be
calculated as shown in Table below :
Wijk 1 1 1 1 1 1 6
n
Dijk Wijk
Dij k 1
n
Wijk
k 1
B(k ) / k 1
CH ( k )
W ( k ) /( n k )
•Variable selection.
•Deciding the distance/similarity measure for measuring
distance/dissimilarity between the observations.
•Deciding the number of clusters.
•Validation of the clusters.
Variable Selection
Number of Clusters
3) Repeat step 2 until all data points are merged to form a single
cluster
• Cluster-2 seems to be more self driven as they show a high degree of agreement
with X1,x2,x3
• The Different Types of Cluster Analysis
• There are three primary methods used to perform cluster analysis:
• Hierarchical Cluster
• This is the most common method of clustering. It creates a series of models with
cluster solutions from 1 (all cases in one cluster) to n (each case is an individual
cluster). This approach also works with variables instead of cases. Hierarchical
clustering can group variables together in a manner similar to factor analysis.
• Finally, hierarchical cluster analysis can handle nominal, ordinal, and scale data. But,
remember not to mix different levels of measurement into your study.
• K-Means Cluster
• This method is used to quickly cluster large datasets. Here, researchers define the
number of clusters prior to performing the actual study. This approach is useful when
testing different models with a different assumed number of clusters.
• Two-Step Cluster
• This method uses a cluster algorithm to identify groupings by performing pre-
clustering first, and then performing hierarchical methods. Two-step clustering is best
for handling larger datasets that would otherwise take too long a time to calculate with
strictly hierarchical methods.
• Essentially, two-step cluster analysis is a combination of hierarchical and k-means
cluster analysis. It can handle both scale and ordinal data, and it automatically selects
the number of clusters.
• Steps for Cluster Analysis
• Formulate the problem – Select the variables on which the clustering will be based.
The variables should describe the similarity between objects in terms that are relevant
to the research problem. The variables should be selected based on past research,
theory, the hypotheses being tested, or the judgment of the researcher.
• Select a distance measure – An appropriate measure of distance needs to be selected
to determine how similar or dissimilar the objects being clustered should be. The most
commonly used measure is Euclidean distance.
• Select a clustering procedure – Several clustering procedures have been developed
and the one most appropriate for the problem at hand should be chosen.
• Decide on the number of clusters – The number of clusters can be based on
theoretical, conceptual, or practical considerations.
• Interpret and profile clusters – This involves examining cluster centroids. The
centroids represent the mean values of the objects contained in the cluster on each of
the variables.
• Asses the validity of clustering – Some methods to validate the data quality include
using different methods of clustering and comparing the results or clustering on a
smaller set of variables (randomly deleted) and comparing the results with the entire
set of variables.