0% found this document useful (0 votes)
35 views24 pages

Clustering

Unsupervised learning techniques like k-means clustering are used to automatically organize unlabeled data points into groups. K-means clustering works by assigning data points to the cluster with the nearest centroid and recalculating centroids as cluster assignments change, minimizing within-cluster variance. Choosing the optimal number of clusters k is challenging, and results can be sensitive to initial centroid positions, requiring multiple runs. Distance metrics like Euclidean distance are used to assign points to centroids.

Uploaded by

harshbafna.ei20
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views24 pages

Clustering

Unsupervised learning techniques like k-means clustering are used to automatically organize unlabeled data points into groups. K-means clustering works by assigning data points to the cluster with the nearest centroid and recalculating centroids as cluster assignments change, minimizing within-cluster variance. Choosing the optimal number of clusters k is challenging, and results can be sensitive to initial centroid positions, requiring multiple runs. Distance metrics like Euclidean distance are used to assign points to centroids.

Uploaded by

harshbafna.ei20
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

UNIT-V

Unsupervised Learning:
•Types of Unsupervised Learning
•Challenges in Unsupervised Learning
• Pre-processing and Scaling
• Applying Data Transformation
• K-Means Clustering
• Euclidean distance, Manhattan distance and Minkowski
distance.
•Case Study: Recommender system.
•Introduction to Artificial Neural Networks and Deep
Learning.
Geetishree Mishra 1
• What is Unsupervised learning?
»Key concept: Clustering
Goal: Automatically partition unlabeled data into groups of
similar data points.
Useful for:
• Automatically organizing data.
• Understanding hidden structure in data.
Find patterns/structure/sub-populations in data
(“knowledge discovery")
• Preprocessing for further analysis: PCA, Dimensionality
Reduction representing high-dimensional data in a low
dimensional space (e.g. for visualization purposes).
Training data does not include desired outputs.
Less well-defined problem with no obvious error metrics.
Exa: market segmentation, clustering of hand-written digits,
news clustering (Google News)etc..
Geetishree Mishra 2
Applications..
• Cluster news articles or web pages or search
results by topic- topic modeling.
• Cluster users of social networks by interest
(community detection).
• Cluster customers according to purchase
history.
• Cluster galaxies or nearby stars (e.g. Sloan
Digital Sky Survey)
• Deriving underlying rules, recurring patterns,
topics etc.
Geetishree Mishra 3
Clustering..
• We need a distance metric and a method to
utilize that distance metric to find self-similar
groups.
• Clustering is a ubiquitous procedure in any field
that deals with high-dimensional data exa:
bioinformatics.
• Due to this ubiquity and general usefulness, it is
an essential technique to learn.

Geetishree Mishra 4
Types of Unsupervised Learning..
• KMeans Clustering
Aims to partition n observations into k clusters in which
each observation belongs to the cluster with the
nearest mean. It minimizes within cluster variances.
• Hierarchical Clustering
Hierarchical clustering typically works by sequentially
merging similar clusters, which is known
as agglomerative hierarchical clustering. It can also be
done by initially grouping all the observations into one
cluster, and then successively splitting these clusters. This
is known as divisive hierarchical clustering.
linkage criteria: single linkage, complete linkage and average
linkage
Geetishree Mishra 5
Hierarchical clustering

Geetishree Mishra 6
Geetishree Mishra 7
Distance between the Clusters
Geetishree Mishra 8
DBScan(Density-Based Spatial Clustering of
Applications with Noise)
• Given a set of points in some space, it groups
together points that are closely packed together
(points with high density) marking as outliers points
that lie alone in low-density regions, whose nearest
neighbors are too far away.

Geetishree Mishra 9
Challenges in Unsupervised learning..
• Higher risk of inaccurate results
• Longer training times
• Computational complexity due to high volume of
training data.
• Lack of transparency into the basis on which data
are clustered.
• Human intervention to validate the output
variables.
• Large set of hyper parameter tuning.

Geetishree Mishra 10
Pre-processing and Scaling..

• Data Transformation
• Data to be numeric & scaled.
• Missing values, Outlier & Skewness detection
• Data encoding:
 Level encoding, One-hot encoding

Geetishree Mishra 11
Pre-processing and Scaling..

• Import data
• Summarize/plot raw data
• Impute missing values
• Normalize/Standardize data
• Handle outliers
• Data analysis
• Interpretation & Validation

Geetishree Mishra 12
K-Means Clustering..
• k-Means clustering algorithm proposed by J. Hartigan
and M. A. Wong [1979].
• Given a set of n distinct objects, the k-Means clustering
algorithm partitions the objects into k number of
clusters such that intracluster similarity is high but the
intercluster similarity is low.
• In this algorithm, user has to specify k, the number of
clusters and consider the objects are defined with
numeric attributes and thus using any one of the
distance metric to demarcate the clusters.

Geetishree Mishra 13
K-Means Clustering..
Algorithm : k-Means clustering

Input: D is a dataset containing n objects, k is the number of cluster


Output: A set of k clusters
Steps:
1. Randomly choose k objects from D as the initial cluster centroids.

2. For each of the objects in D do


• Compute distance between the current objects and k cluster centroids
• Assign the current object to that cluster to which it is closest.

3. Compute the “cluster center-centroid” which are the mean of all cluster points for
each cluster.

4. Repeat step 2-3 until the convergence criterion is satisfied

5. Stop

Geetishree Mishra 14
K-Means Clustering(Pseudocode)
• Given unlabeled feature vectors
D = {x(1), x(2),…, x(N)}
• Initialize cluster centers c = {c(1),…, c(K)}
and cluster assignments z = {z(1), z(2),…, z(N)}
• Repeat until convergence:
– for j in {1,…,K}
c(j) = mean of all points assigned to cluster j
– for i in {1,…, N}
– z(i) = index j of cluster center nearest to x(i)

Geetishree Mishra 15
Euclidean distance, Manhattan distance
and Minkowski distance…

Geetishree Mishra 16
Cosine Similarity..

Geetishree Mishra 17
Distance Measures:

Geetishree Mishra 18
Optimal K..
One method of choosing value K is the elbow method. In this method we
will run K-Means clustering for a range of K values lets say ( K= 1 to 10 ) and
calculate the Sum of Squared Error (SSE). SSE is calculated as the mean
distance between data points and their cluster centroid.

Geetishree Mishra 19
Example..

f1 f2 f3 f4 f5 f6

A 3 5

B 2 2 3 4

C 5 4 4 5

D 5 5 2 5 4

Manhattan, Euclidean, Chebychev, Cosine similarity….?

Geetishree Mishra 20
Example..
X1 X2
A 2 3
B 6 1
C 1 2
D 3 0

Centroids:
AB= (4,2) and CD= (2,1)

Geetishree Mishra 21
The criteria of objective function with different
proximity measures

1. SSE (using L2 norm) : To minimize the SSE.


2. SAE (using L1 norm) : To minimize the SAE.
3. TC(using cosine similarity) : To maximize the
TC.

Geetishree Mishra 22
Geetishree Mishra 23
Strengths:
• Use simple principles without the need for any complex statistical
terms
• Once clusters and their associated centroids are identified, it is
easy to assign new objects (for example, new customers) to a
cluster based on the object's distance from the closest centroid.
• Because the method is unsupervised, using kmeans helps to
eliminate subjectivity from the analysis.

Weakness:
• How to choose K?
• The k-means algorithm is sensitive to the starting positions of the
initial centroid. Thus, it is important to rerun the k-means analysis
several times for a particular value of k to ensure the cluster
results provide the overall minimum WSS.
• Susceptible to curse of dimensionality.

Geetishree Mishra 24

You might also like