Unit4 ML
Unit4 ML
UNIT4-UNSUPERVISED LEARNING
Introduction to clustering, KMeans clustering, KMode clustering,
Distance based clustering, Clusering around medoids,
silhoutts clustering, hierarchial clusttering
Clustering is the task of dividing the population or data points into a number of groups such that data points in
the same groups are more similar to other data points in the same group and dissimilar to the data points in
other groups. It is basically a collection of objects on the basis of similarity and dissimilarity between them. A
Clustering Algorithm tries to analyse natural groups of data on the basis of some similarity. Clustering is
dividing data points into homogeneous classes or clusters:
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabelled dataset into different
clusters. Here K defines the number of pre-defined clusters that need to be created in the process, as if K=2,
there will be two clusters, and for K=3, there will be three clusters, and so on. It is an iterative algorithm that
divides the unlabeled dataset into k different clusters in such a way that each dataset belongs only one group
that has similar properties.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this
algorithm is to minimize the sum of distances between the data point and their corresponding clusters.
K-mode clustering is an unsupervised machine-learning technique used to group a set of data objects into a
specified number of clusters, based on their categorical attributes. The algorithm is called “K-Mode” because it
uses modes (i.e. the most frequent values) instead of means or medians to represent the clusters.
K-Modes clustering is an iterative algorithm that starts by selecting k initial data points as centroids of the
cluster. After that, each data point in the dataset is assigned to a cluster based on its similarity with the
centroids. After creating clusters for the first time, we select a new centroid in each cluster using the mod of
each feature in the data. After selecting new clusters, we calculate their dissimilarity from each data point and
regroup the clusters. This process continues until the process converges and there is no change to the clusters
in two consecutive iterations.
The K-Modes clustering partitions the data into two mutually exclusive groups. Hence, it is termed a
partitioning clustering algorithm.
K-MEDOIDS CLUSTERING:
K-Medoid clustering is a partitioning method used in cluster analysis, a technique used to classify a set of
objects into groups (or clusters) such that objects in the same group are more similar to each other than to
those in other groups.
K-Medoid clustering is an extension of the K-Means clustering algorithm, with the main difference being that K-
Medoid uses actual data points as cluster centers (medoids) instead of the means of the points in the cluster.
This makes K-Medoid more robust to outliers and noise in the data.
K-Medoids and K-Means are two types of clustering mechanisms in Partition Clustering. First, Clustering is the
process of breaking down an abstract group of data points/ objects into classes of similar objects such that all
the objects in one cluster have similar traits. , a group of n objects is broken down into k number of clusters
based on their similarities.
Medoid: A Medoid is a point in the cluster from which the sum of distances to other data points is minimal.
(or)
A Medoid is a point in the cluster from which dissimilarities with all the other points in the clusters are
minimal.
K-medoids is an unsupervised method with unlabelled data to be clustered. It is an improvised version of the
K-Means algorithm mainly designed to deal with outlier data sensitivity. Compared to other partitioning
algorithms, the algorithm is simple, fast, and easy to implement.
Steps:
1. initially select k random points as the medoids from the given n data points.
2. Map each data point to the closest medoid by using any distance metric like Manhatton distance.
3. Calculate the cost as the total sum of distances of the data points from the assigned medoid c = ∑(Ci – Pi)
4. Swap one medoid point with any one of the non-medoid point and repeat the steps 2 and 3.
5. If the new cost is greater than the previous cost, we conclude the process and finalize the clusters otherwise
repeat step 4.
Instead of centroids as reference points in K-Means algorithms, the K-Medoids algorithm takes a Medoid as a
reference point.
Hierarchical Clustering:
Hierarchical clustering is another unsupervised machine learning algorithm, which is used to group the
unlabeled datasets into a cluster and also known as hierarchical cluster analysis or HCA. In this algorithm, we
develop the hierarchy of clusters in the form of a tree, and this tree-shaped structure is known as the
dendrogram. Sometimes the results of K-means clustering and hierarchical clustering may look similar, but they
both differ depending on how they work.
1. Agglomerative: Agglomerative is a bottom-up approach, in which the algorithm starts with taking all data
points as single clusters and merging them until one cluster is left.
2. Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it is a top-down approach.
Agglomerative Clustering: This is the most common type of hierarchical clustering. It starts by considering
each data point as a single cluster and then successively merges or combines the closest pairs of clusters until
only one cluster remains. The algorithm proceeds iteratively, at each stage merging the two most similar
clusters, until all data points belong to a single cluster.
The choice of distance metric (to determine the similarity between clusters) and linkage criterion (to
determine which clusters to merge) are important decisions in agglomerative clustering.
• Calculate the distance matrix: Compute the pairwise distances between all data points. The choice of
distance metric (such as Euclidean distance, Manhattan distance, or others) depends on the nature of the data.
• Create clusters: Start by considering each data point as a single cluster.
• Merge or split clusters: For agglomerative clustering, merge the two closest clusters, and for divisive
clustering, split the cluster into smaller clusters.
• Update the distance matrix: Recalculate the distances between the new cluster(s) and the existing clusters
or data points.
• Repeat: Repeat steps 3 and 4 until only a single cluster remains (agglomerative) or until each data point is in
its own cluster (divisive).
• Dendrogram: Create a dendrogram, which is a tree-like diagram that shows the arrangement of the clusters
produced by the hierarchical clustering algorithm. The height at which branches merge in the dendrogram
represents the distance between the clusters.
Agglomerative hierarchical clustering is a versatile clustering algorithm that can be applied to various fields
where grouping similar items together is important. Here are some suitable applications for agglomerative
hierarchical clustering:
Biology and Bioinformatics: Clustering genes or proteins based on their expression patterns or sequences can
help in understanding genetic similarities and evolutionary relationships.
Document Clustering: Grouping similar documents together based on their content can be useful in
information retrieval, topic modeling, and document organization.
Market Segmentation: Businesses can use hierarchical clustering to segment customers into different groups
based on their purchasing behaviour, preferences, or demographics.
Image Segmentation: In image processing, clustering can be used to segment images into meaningful regions
based on pixel similarities, aiding tasks like object recognition and tracking.
Social Network Analysis: Clustering users in social networks based on their interactions and interests can
reveal community structures and help in targeted marketing or content recommendation.
Anomaly Detection: Identifying outliers or anomalies in a dataset can be approached as a clustering problem.
Agglomerative hierarchical clustering can help identify clusters of normal behavior, making it easier to spot
unusual patterns.
Customer Segmentation: Businesses can use hierarchical clustering to segment their customers into different
groups based on their behaviour, purchasing patterns, and preferences. This information can be used for
targeted marketing strategies.
Speech Recognition: Clustering phonemes or speech patterns can help improve speech recognition systems by
grouping similar sounds together.
Recommendation Systems: Grouping users or items based on their preferences and behaviours can enhance
recommendation algorithms by suggesting products or content that similar users have liked.
Medicine and Healthcare: Agglomerative hierarchical clustering can be applied to medical data for patient
stratification, identifying subgroups of patients with similar disease characteristics, genetics, or treatment
responses.
Fraud Detection: Clustering credit card transactions or financial data can help in detecting unusual patterns
that might indicate fraudulent activities.
In the Silhouette algorithm, we assume that the data has already been clustered into k clusters by a clustering
technique. Silhouette analysis used to check the quality of clustering model by measuring the distance
between the clusters. It basically provides us a way to assess the parameters like number of clusters with the
help of Silhouette score. This score measures how close each point in one cluster is to points in the
neighboring clusters.
Silhouette Analysis is the most common method as it is more straightforward compared to others. Silhouette
Analysis or Silhouette Plot is often used with the K-Means algorithm to measure the separation distance
between clusters. We know that K-means clustering is a simplest and popular unsupervised machine learning
algorithm. We can evaluate the algorithm by two ways. One is elbow technique and another is silhouette
method.
It exhibits the nature of the clusters formed, by how close they are within the range of [-1,1].
A silhouette score of “+1” indicates that a specific data point is distant away from its neighboring cluster and
very close to the cluster group it is assigned. In contrast, a value of “-1” indicates that the point is close to its
neighbouring cluster compared to the cluster it is assigned. As for the value of “0”, it means the data point
most likely lies at the boundary of the distance between the two clusters. Value of “+1” is the ideal score to
achieve to have a good clustering performance whereas “-1” is least preferred. However, a silhouette score of
“+1” is seemingly hard to achieve in real life, when dealing with unstructured and complex data.
The silhouette score is calculated using the mean intra-cluster distance, a, and the mean nearest-cluster
distance, b for each sample, with a condition where the number of labels to be at least larger than 2 and
smaller than the number of samples.
1.Euclidean Distance:
Formula: d(x,y)=((x2-x1)^2+(y2-y1)^2)^1/2
Example:
Points: A(1, 2), B(4, 6)
Distance=((4-1)^2+(6-2)^2)^1/2
=(9+16)^1/2
=5
Used in:
K-Means
Agglomerative Clustering
t-SNE (for visualization)
🔹 3. Minkowski Distance
Use when:
You want flexibility: it generalizes both Euclidean (p=2) and Manhattan (p=1)
You're experimenting with different "flavors" of distance
Tuning Tip:
Try p = 1.5 or p = 3 and compare clustering results.
Used in:
Any clustering that supports custom metrics (e.g., K-Medoids)
4. Cosine Distance
Use when:
Your data is directional (angle matters, not magnitude)
Mostly used in text mining or recommender systems
Example:
Two documents with similar word distribution → Small angle → High similarity
Used in:
Text clustering (TF-IDF vectors)
Chatbot intent grouping
News/article topic clustering
5. Hamming Distance
Use when:
Your data is binary or categorical
You want to compare bitstrings, options, or labels
Example:
A = "10101", B = "10011" → Distance = 2 (only 2 bits are different)
Used in:
DNA sequence comparison
Spam detection
Sensor failure detection (on/off signals)