0% found this document useful (0 votes)
27 views20 pages

Unit4 ML

Uploaded by

Devabn Nirmal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views20 pages

Unit4 ML

Uploaded by

Devabn Nirmal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

MACHINE LEARNING

UNIT4-UNSUPERVISED LEARNING
Introduction to clustering, KMeans clustering, KMode clustering,
Distance based clustering, Clusering around medoids,
silhoutts clustering, hierarchial clusttering

Introduction to Clustering: It is basically a type of unsupervised learning method. An unsupervised learning


method is a method in which we draw references from datasets consisting of input data without labeled
responses. Generally, it is used as a process to find meaningful structure, explanatory underlying processes,
generative features, and groupings inherent in a set of examples.

Clustering is the task of dividing the population or data points into a number of groups such that data points in
the same groups are more similar to other data points in the same group and dissimilar to the data points in
other groups. It is basically a collection of objects on the basis of similarity and dissimilarity between them. A
Clustering Algorithm tries to analyse natural groups of data on the basis of some similarity. Clustering is
dividing data points into homogeneous classes or clusters:

• Points in the same group are as similar as possible


• Points in different group are as dissimilar as possible
When a collection of objects is given, we put objects into group based on similarity.
Clustering is very much important as it determines the intrinsic grouping among the unlabelled data present.
There are no criteria for good clustering. It depends on the user, and what criteria they may use which satisfy
their need. For instance, we could be interested in finding representatives for homogeneous groups (data
reduction), finding “natural clusters” and describing their unknown properties (“natural” data types), in finding
useful and suitable groupings (“useful” data classes) or in finding unusual data objects (outlier detection).
Clustering Methods:
• Density-Based Methods: These methods consider the clusters as the dense region having some similarities
and differences from the lower dense region of the space. These methods have good accuracy and the ability
to merge two clusters. Example DBSCAN (Density-Based Spatial Clustering of Applications with Noise), OPTICS
(Ordering Points to Identify Clustering Structure), etc.
• Hierarchical Based Methods: The clusters formed in this method form a tree-type structure based on the
hierarchy. New clusters are formed using the previously formed one. It is divided into two categories
• Agglomerative (bottom-up approach)
• Divisive (top-down approach)
K-means clustering algorithm:
It is the simplest unsupervised learning algorithm that solves clustering problem. K-means algorithm partitions
n observations into k clusters where each observation belongs to the cluster with the nearest mean serving as
a prototype of the cluster.

K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabelled dataset into different
clusters. Here K defines the number of pre-defined clusters that need to be created in the process, as if K=2,
there will be two clusters, and for K=3, there will be three clusters, and so on. It is an iterative algorithm that
divides the unlabeled dataset into k different clusters in such a way that each dataset belongs only one group
that has similar properties.

It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this
algorithm is to minimize the sum of distances between the data point and their corresponding clusters.

The working of the K-Means algorithm is explained in the below steps:


Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each
cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.

In kmeans algorithm we find the euclidiean distance:


Euclidean distance=((x2-x1)^2+(y2-y1)^2)^1/2
K-Mode Clustering:
In machine learning, we often need to analyse datasets having categorical variables. Generally, K-Means
clustering is used as the partitioning clustering technique for numerical data. However, we cannot apply k-
means clustering to categorical data.
K-Modes clustering is an unsupervised machine learning technique. It is a partition clustering algorithm used
to group a dataset into K clusters.

K-mode clustering is an unsupervised machine-learning technique used to group a set of data objects into a
specified number of clusters, based on their categorical attributes. The algorithm is called “K-Mode” because it
uses modes (i.e. the most frequent values) instead of means or medians to represent the clusters.

K-Modes clustering is an iterative algorithm that starts by selecting k initial data points as centroids of the
cluster. After that, each data point in the dataset is assigned to a cluster based on its similarity with the
centroids. After creating clusters for the first time, we select a new centroid in each cluster using the mod of
each feature in the data. After selecting new clusters, we calculate their dissimilarity from each data point and
regroup the clusters. This process continues until the process converges and there is no change to the clusters
in two consecutive iterations.

The K-Modes clustering partitions the data into two mutually exclusive groups. Hence, it is termed a
partitioning clustering algorithm.
K-MEDOIDS CLUSTERING:
K-Medoid clustering is a partitioning method used in cluster analysis, a technique used to classify a set of
objects into groups (or clusters) such that objects in the same group are more similar to each other than to
those in other groups.
K-Medoid clustering is an extension of the K-Means clustering algorithm, with the main difference being that K-
Medoid uses actual data points as cluster centers (medoids) instead of the means of the points in the cluster.
This makes K-Medoid more robust to outliers and noise in the data.

K-Medoids and K-Means are two types of clustering mechanisms in Partition Clustering. First, Clustering is the
process of breaking down an abstract group of data points/ objects into classes of similar objects such that all
the objects in one cluster have similar traits. , a group of n objects is broken down into k number of clusters
based on their similarities.

Medoid: A Medoid is a point in the cluster from which the sum of distances to other data points is minimal.
(or)
A Medoid is a point in the cluster from which dissimilarities with all the other points in the clusters are
minimal.

K-medoids is an unsupervised method with unlabelled data to be clustered. It is an improvised version of the
K-Means algorithm mainly designed to deal with outlier data sensitivity. Compared to other partitioning
algorithms, the algorithm is simple, fast, and easy to implement.

Steps:
1. initially select k random points as the medoids from the given n data points.
2. Map each data point to the closest medoid by using any distance metric like Manhatton distance.
3. Calculate the cost as the total sum of distances of the data points from the assigned medoid c = ∑(Ci – Pi)
4. Swap one medoid point with any one of the non-medoid point and repeat the steps 2 and 3.
5. If the new cost is greater than the previous cost, we conclude the process and finalize the clusters otherwise
repeat step 4.

Instead of centroids as reference points in K-Means algorithms, the K-Medoids algorithm takes a Medoid as a
reference point.
Hierarchical Clustering:
Hierarchical clustering is another unsupervised machine learning algorithm, which is used to group the
unlabeled datasets into a cluster and also known as hierarchical cluster analysis or HCA. In this algorithm, we
develop the hierarchy of clusters in the form of a tree, and this tree-shaped structure is known as the
dendrogram. Sometimes the results of K-means clustering and hierarchical clustering may look similar, but they
both differ depending on how they work.
1. Agglomerative: Agglomerative is a bottom-up approach, in which the algorithm starts with taking all data
points as single clusters and merging them until one cluster is left.
2. Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it is a top-down approach.

Agglomerative Clustering: This is the most common type of hierarchical clustering. It starts by considering
each data point as a single cluster and then successively merges or combines the closest pairs of clusters until
only one cluster remains. The algorithm proceeds iteratively, at each stage merging the two most similar
clusters, until all data points belong to a single cluster.

The steps for agglomerative clustering are as follows:


• Start with n clusters, each containing a single data point.
• Find the two closest clusters and merge them into a single cluster.
• Repeat the previous step until only one cluster remains.

The choice of distance metric (to determine the similarity between clusters) and linkage criterion (to
determine which clusters to merge) are important decisions in agglomerative clustering.
• Calculate the distance matrix: Compute the pairwise distances between all data points. The choice of
distance metric (such as Euclidean distance, Manhattan distance, or others) depends on the nature of the data.
• Create clusters: Start by considering each data point as a single cluster.
• Merge or split clusters: For agglomerative clustering, merge the two closest clusters, and for divisive
clustering, split the cluster into smaller clusters.
• Update the distance matrix: Recalculate the distances between the new cluster(s) and the existing clusters
or data points.
• Repeat: Repeat steps 3 and 4 until only a single cluster remains (agglomerative) or until each data point is in
its own cluster (divisive).
• Dendrogram: Create a dendrogram, which is a tree-like diagram that shows the arrangement of the clusters
produced by the hierarchical clustering algorithm. The height at which branches merge in the dendrogram
represents the distance between the clusters.

The applications for hierarchical clustering is given below:

Agglomerative hierarchical clustering is a versatile clustering algorithm that can be applied to various fields
where grouping similar items together is important. Here are some suitable applications for agglomerative
hierarchical clustering:
Biology and Bioinformatics: Clustering genes or proteins based on their expression patterns or sequences can
help in understanding genetic similarities and evolutionary relationships.
Document Clustering: Grouping similar documents together based on their content can be useful in
information retrieval, topic modeling, and document organization.

Market Segmentation: Businesses can use hierarchical clustering to segment customers into different groups
based on their purchasing behaviour, preferences, or demographics.

Image Segmentation: In image processing, clustering can be used to segment images into meaningful regions
based on pixel similarities, aiding tasks like object recognition and tracking.

Social Network Analysis: Clustering users in social networks based on their interactions and interests can
reveal community structures and help in targeted marketing or content recommendation.

Anomaly Detection: Identifying outliers or anomalies in a dataset can be approached as a clustering problem.
Agglomerative hierarchical clustering can help identify clusters of normal behavior, making it easier to spot
unusual patterns.

Customer Segmentation: Businesses can use hierarchical clustering to segment their customers into different
groups based on their behaviour, purchasing patterns, and preferences. This information can be used for
targeted marketing strategies.

Speech Recognition: Clustering phonemes or speech patterns can help improve speech recognition systems by
grouping similar sounds together.

Recommendation Systems: Grouping users or items based on their preferences and behaviours can enhance
recommendation algorithms by suggesting products or content that similar users have liked.

Medicine and Healthcare: Agglomerative hierarchical clustering can be applied to medical data for patient
stratification, identifying subgroups of patients with similar disease characteristics, genetics, or treatment
responses.

Fraud Detection: Clustering credit card transactions or financial data can help in detecting unusual patterns
that might indicate fraudulent activities.

Silhouette technique/ approach:


One of the fundamental steps of an unsupervised learning algorithm is to determine the number of clusters
into which the data may be divided. The silhouette algorithm is one of the many algorithms to determine the
optimal number of clusters for an unsupervised learning technique.

In the Silhouette algorithm, we assume that the data has already been clustered into k clusters by a clustering
technique. Silhouette analysis used to check the quality of clustering model by measuring the distance
between the clusters. It basically provides us a way to assess the parameters like number of clusters with the
help of Silhouette score. This score measures how close each point in one cluster is to points in the
neighboring clusters.

Silhouette Analysis is the most common method as it is more straightforward compared to others. Silhouette
Analysis or Silhouette Plot is often used with the K-Means algorithm to measure the separation distance
between clusters. We know that K-means clustering is a simplest and popular unsupervised machine learning
algorithm. We can evaluate the algorithm by two ways. One is elbow technique and another is silhouette
method.

It exhibits the nature of the clusters formed, by how close they are within the range of [-1,1].

A silhouette score of “+1” indicates that a specific data point is distant away from its neighboring cluster and
very close to the cluster group it is assigned. In contrast, a value of “-1” indicates that the point is close to its
neighbouring cluster compared to the cluster it is assigned. As for the value of “0”, it means the data point
most likely lies at the boundary of the distance between the two clusters. Value of “+1” is the ideal score to
achieve to have a good clustering performance whereas “-1” is least preferred. However, a silhouette score of
“+1” is seemingly hard to achieve in real life, when dealing with unstructured and complex data.

The silhouette score is calculated using the mean intra-cluster distance, a, and the mean nearest-cluster
distance, b for each sample, with a condition where the number of labels to be at least larger than 2 and
smaller than the number of samples.

Silhouette coefficient =1-(a/b)=(b-a)/b.


Distance based clustering:
It is a technique where data points that are close together (based on some distance metric) are assigned to
the same cluster. The goal is to minimize the distance within clusters and maximize the distance between
clusters.

1.Euclidean Distance:
Formula: d(x,y)=((x2-x1)^2+(y2-y1)^2)^1/2

Example:
Points: A(1, 2), B(4, 6)

Distance=((4-1)^2+(6-2)^2)^1/2
=(9+16)^1/2
=5
Used in:
 K-Means
 Agglomerative Clustering
 t-SNE (for visualization)

2. Manhattan Distance (L1 Norm)


Use when:
 Your data is sparse (many 0s)
 You want less influence from outliers
 You're dealing with grid-based data (e.g., pixels or city blocks)
Example:
A(1, 2), B(4, 6) → |4-1| + |6-2| = 3 + 4 = 7
Used in:
 L1-regularized models
 Image processing
 DBSCAN (with some tuning)

🔹 3. Minkowski Distance
Use when:
 You want flexibility: it generalizes both Euclidean (p=2) and Manhattan (p=1)
 You're experimenting with different "flavors" of distance
Tuning Tip:
Try p = 1.5 or p = 3 and compare clustering results.
Used in:
 Any clustering that supports custom metrics (e.g., K-Medoids)
4. Cosine Distance
Use when:
 Your data is directional (angle matters, not magnitude)
 Mostly used in text mining or recommender systems
Example:
Two documents with similar word distribution → Small angle → High similarity
Used in:
 Text clustering (TF-IDF vectors)
 Chatbot intent grouping
 News/article topic clustering

5. Hamming Distance
Use when:
 Your data is binary or categorical
 You want to compare bitstrings, options, or labels
Example:
A = "10101", B = "10011" → Distance = 2 (only 2 bits are different)
Used in:
 DNA sequence comparison
 Spam detection
 Sensor failure detection (on/off signals)

You might also like