Clustering
Clustering
Unsupervised Learning:
•Types of Unsupervised Learning
•Challenges in Unsupervised Learning
• Pre-processing and Scaling
• Applying Data Transformation
• K-Means Clustering
• Euclidean distance, Manhattan distance and Minkowski
distance.
•Case Study: Recommender system.
•Introduction to Artificial Neural Networks and Deep
Learning.
Geetishree Mishra 1
• What is Unsupervised learning?
»Key concept: Clustering
Goal: Automatically partition unlabeled data into groups of
similar data points.
Useful for:
• Automatically organizing data.
• Understanding hidden structure in data.
Find patterns/structure/sub-populations in data
(“knowledge discovery")
• Preprocessing for further analysis: PCA, Dimensionality
Reduction representing high-dimensional data in a low
dimensional space (e.g. for visualization purposes).
Training data does not include desired outputs.
Less well-defined problem with no obvious error metrics.
Exa: market segmentation, clustering of hand-written digits,
news clustering (Google News)etc..
Geetishree Mishra 2
Applications..
• Cluster news articles or web pages or search
results by topic- topic modeling.
• Cluster users of social networks by interest
(community detection).
• Cluster customers according to purchase
history.
• Cluster galaxies or nearby stars (e.g. Sloan
Digital Sky Survey)
• Deriving underlying rules, recurring patterns,
topics etc.
Geetishree Mishra 3
Clustering..
• We need a distance metric and a method to
utilize that distance metric to find self-similar
groups.
• Clustering is a ubiquitous procedure in any field
that deals with high-dimensional data exa:
bioinformatics.
• Due to this ubiquity and general usefulness, it is
an essential technique to learn.
Geetishree Mishra 4
Types of Unsupervised Learning..
• KMeans Clustering
Aims to partition n observations into k clusters in which
each observation belongs to the cluster with the
nearest mean. It minimizes within cluster variances.
• Hierarchical Clustering
Hierarchical clustering typically works by sequentially
merging similar clusters, which is known
as agglomerative hierarchical clustering. It can also be
done by initially grouping all the observations into one
cluster, and then successively splitting these clusters. This
is known as divisive hierarchical clustering.
linkage criteria: single linkage, complete linkage and average
linkage
Geetishree Mishra 5
Hierarchical clustering
Geetishree Mishra 6
Geetishree Mishra 7
Distance between the Clusters
Geetishree Mishra 8
DBScan(Density-Based Spatial Clustering of
Applications with Noise)
• Given a set of points in some space, it groups
together points that are closely packed together
(points with high density) marking as outliers points
that lie alone in low-density regions, whose nearest
neighbors are too far away.
Geetishree Mishra 9
Challenges in Unsupervised learning..
• Higher risk of inaccurate results
• Longer training times
• Computational complexity due to high volume of
training data.
• Lack of transparency into the basis on which data
are clustered.
• Human intervention to validate the output
variables.
• Large set of hyper parameter tuning.
Geetishree Mishra 10
Pre-processing and Scaling..
• Data Transformation
• Data to be numeric & scaled.
• Missing values, Outlier & Skewness detection
• Data encoding:
Level encoding, One-hot encoding
Geetishree Mishra 11
Pre-processing and Scaling..
• Import data
• Summarize/plot raw data
• Impute missing values
• Normalize/Standardize data
• Handle outliers
• Data analysis
• Interpretation & Validation
Geetishree Mishra 12
K-Means Clustering..
• k-Means clustering algorithm proposed by J. Hartigan
and M. A. Wong [1979].
• Given a set of n distinct objects, the k-Means clustering
algorithm partitions the objects into k number of
clusters such that intracluster similarity is high but the
intercluster similarity is low.
• In this algorithm, user has to specify k, the number of
clusters and consider the objects are defined with
numeric attributes and thus using any one of the
distance metric to demarcate the clusters.
Geetishree Mishra 13
K-Means Clustering..
Algorithm : k-Means clustering
3. Compute the “cluster center-centroid” which are the mean of all cluster points for
each cluster.
5. Stop
Geetishree Mishra 14
K-Means Clustering(Pseudocode)
• Given unlabeled feature vectors
D = {x(1), x(2),…, x(N)}
• Initialize cluster centers c = {c(1),…, c(K)}
and cluster assignments z = {z(1), z(2),…, z(N)}
• Repeat until convergence:
– for j in {1,…,K}
c(j) = mean of all points assigned to cluster j
– for i in {1,…, N}
– z(i) = index j of cluster center nearest to x(i)
Geetishree Mishra 15
Euclidean distance, Manhattan distance
and Minkowski distance…
Geetishree Mishra 16
Cosine Similarity..
Geetishree Mishra 17
Distance Measures:
Geetishree Mishra 18
Optimal K..
One method of choosing value K is the elbow method. In this method we
will run K-Means clustering for a range of K values lets say ( K= 1 to 10 ) and
calculate the Sum of Squared Error (SSE). SSE is calculated as the mean
distance between data points and their cluster centroid.
Geetishree Mishra 19
Example..
f1 f2 f3 f4 f5 f6
A 3 5
B 2 2 3 4
C 5 4 4 5
D 5 5 2 5 4
Geetishree Mishra 20
Example..
X1 X2
A 2 3
B 6 1
C 1 2
D 3 0
Centroids:
AB= (4,2) and CD= (2,1)
Geetishree Mishra 21
The criteria of objective function with different
proximity measures
Geetishree Mishra 22
Geetishree Mishra 23
Strengths:
• Use simple principles without the need for any complex statistical
terms
• Once clusters and their associated centroids are identified, it is
easy to assign new objects (for example, new customers) to a
cluster based on the object's distance from the closest centroid.
• Because the method is unsupervised, using kmeans helps to
eliminate subjectivity from the analysis.
Weakness:
• How to choose K?
• The k-means algorithm is sensitive to the starting positions of the
initial centroid. Thus, it is important to rerun the k-means analysis
several times for a particular value of k to ensure the cluster
results provide the overall minimum WSS.
• Susceptible to curse of dimensionality.
Geetishree Mishra 24