0% found this document useful (0 votes)
9 views14 pages

Lecture 23 - Clustring

Clustering involves grouping data into clusters so that objects within a cluster are similar to each other but dissimilar to objects in other clusters. It is an unsupervised learning technique with no predefined classes. Clustering is used for data visualization, as a preprocessing step for other algorithms, and in applications like image processing, web mining, and bioinformatics. Common clustering algorithms include k-means, which assigns objects to the closest of k randomly selected centroids, iteratively updating the centroids until clusters stabilize. The quality of clustering depends on achieving high intra-cluster similarity and low inter-cluster similarity.

Uploaded by

bscs-20f-0009
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views14 pages

Lecture 23 - Clustring

Clustering involves grouping data into clusters so that objects within a cluster are similar to each other but dissimilar to objects in other clusters. It is an unsupervised learning technique with no predefined classes. Clustering is used for data visualization, as a preprocessing step for other algorithms, and in applications like image processing, web mining, and bioinformatics. Common clustering algorithms include k-means, which assigns objects to the closest of k randomly selected centroids, iteratively updating the centroids until clusters stabilize. The quality of clustering depends on achieving high intra-cluster similarity and low inter-cluster similarity.

Uploaded by

bscs-20f-0009
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Clustering

What Is Clustering?
• Group data into clusters
• Similar to one another within the same cluster
• Dissimilar to the objects in other clusters
• Unsupervised learning: no predefined classes

Outliers
Cluster 1
Cluster 2
Outliers
• Outliers are objects that do not belong to any cluster or
form clusters of very small cardinality

cluster

outliers

• In some applications we are interested in discovering


outliers, not clusters (outlier analysis)
Why do we cluster?
• Clustering : given a collection of data objects group them so that
• Similar to one another within the same cluster
• Dissimilar to the objects in other clusters

• Clustering results are used:


• As a stand-alone tool to get insight into data distribution
• Visualization of clusters may unveil important information
• As a preprocessing step for other algorithms
• Efficient indexing or compression often relies on clustering
Applications of clustering?
• Image Processing
• cluster images based on their visual content
• Web
• Cluster groups of users based on their access patterns on
webpages
• Cluster webpages based on their content
• Bioinformatics
• Cluster similar proteins together (similarity wrt chemical
structure and/or functionality etc)
• Many more…
Observations to cluster
• Real-value attributes/variables
• e.g., salary, height

• Binary attributes
• e.g., gender (M/F), has_cancer(T/F)

• Nominal (categorical) attributes


• e.g., religion (Christian, Muslim, Buddhist, Hindu, etc.)

• Ordinal/Ranked attributes
• e.g., military rank (soldier, sergeant, lutenant, captain, etc.)

• Variables of mixed types


• multiple attributes with various types
What Is A Good Clustering?
• High intra-class similarity and low inter-class similarity
• Depending on the similarity measure
• The ability to discover some or all of the hidden patterns
How Good Is A Clustering?
• Dissimilarity/similarity depends on distance function
• Different applications have different functions
• Judgment of clustering quality is typically highly subjective
Similarity and Dissimilarity Between Objects
• Distances are normally used measures
• Distance: a generalization

• If q = 2, d is Euclidean distance
• If q = 1, d is Manhattan distance
• Weighed distance
K-means algorithm
• Step-1: Select the value of K, to decide the number of clusters to be
formed.
• Step-2: Select random K points which will act as centroids.
• Step-3: Assign each data point, based on their distance from the randomly
selected points (Centroid), to the nearest/closest centroid which will form
the predefined clusters.
• Step-4: place a new centroid of each cluster.
• Step-5: Repeat step no.3, which reassign each datapoint to the new closest
centroid of each cluster.
• Step-6: If any reassignment occurs, then go to step-4 else go to Step 7.
• Step-7: FINISH
K-Means: Example
10
10
9
9
8
8
7
7
6
6
5
5
4
4
Assign Update 3
3

2 each the 2

1
objects cluster 1

0
0
0 1 2 3 4 5 6 7 8 9 10 to most means 0 1 2 3 4 5 6 7 8 9 10

similar
center reassign reassign

K=2
Arbitrarily choose K
object as initial
cluster center Update
the
cluster
means
K mean clustering with k=4
Implementation in python
• from sklearn.cluster import Kmeans
• kmeans = KMeans(n_clusters=2)
• kmeans.fit(X)
• https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KM
eans.html

You might also like