FPA Unit 3
FPA Unit 3
CLUSTERING
Important questions:
1. What is clustering? Explain in detail?
2. Explain different types of clustering methods and give example of each?
3. Define core aspects of clustering? What are the types of clustering algorithms to apply in business?
4. Explain clustering? Write the steps of distribution-based clustering algorithm?
5. Explain the role of clustering in data mining / write the application of clustering?
6. Explain k-means algorithm?
7. Differentiate between clustering and classification?
1.Algorithm
2.Metrics used in finding accuracy in k-mean
3.Centroid point in k-mean
4.Which algorithm is best for clustering? Explain with example.
Clustering
Clustering is a way of grouping similar objects together based on their characteristics. It is a type of
unsupervised learning in machine learning, meaning that it does not require labelled data to find patterns.
Instead, it explores the data and forms groups (clusters) based on similarities. The main objective of
clustering is to group similar items together so that objects in the same group are more alike than those in
different groups. This helps in understanding patterns in data, making decisions, and identifying trends.
For example, in a shopping website, clustering can be used to group customers based on their shopping
habits. Some customers may prefer budget-friendly products, while others may buy premium items. By
identifying these groups, the website can recommend products that match each customer's interests,
improving their shopping experience and increasing sales.
1
Clustering VS Classification
Feature Classification Clustering
Requires labelled data (training data with Works with unlabelled data (no predefined
Data Type
known outputs). labels).
A discrete label (e.g., cat or dog in image A set of clusters where similar data points are
Output
classification). grouped together.
Decision Creates clear decision boundaries between Groups data based on similarity, but clusters
Boundaries different categories. may overlap.
More interpretable since it assigns clear Less interpretable as clusters may not have
Interpretability
labels to data points. clear meanings.
Less flexible, as it depends on predefined More flexible, as it can discover new patterns
Flexibility
labels. or structures.
Handling New Requires retraining when new categories Can naturally adjust to new data by forming
Data appear. new clusters.
Can be computationally expensive for large Can be more scalable, depending on the
Scalability
datasets. clustering algorithm used.
2
Example scenario:
Classification Example:
A teacher wants to categorize students as "Pass" or "Fail" based on their exam scores. The model is trained
with past student data and their results, then predicts whether a new student will pass or fail.
Clustering Example:
A shop owner wants to group customers based on their shopping habits. Since there are no predefined
categories, a clustering algorithm forms groups like "frequent shoppers" and "occasional shoppers" to offer
better discounts and promotions.
Applications of clustering:
• Customer Segmentation
Clustering is used by businesses to group customers based on purchasing habits, demographics, or
preferences. This helps in personalized marketing, better product recommendations, and improved
customer engagement.
Example: An online retailer clusters customers into "budget shoppers" and "premium buyers" to send
targeted promotions.
• Anomaly Detection
Clustering helps identify unusual patterns in data, making it useful for fraud detection, cybersecurity, and
system monitoring. Anomalies are flagged when data points significantly differ from normal patterns.
Example: A bank detects fraudulent transactions when a customer’s spending suddenly spikes in an
unusual location.
• Image Segmentation
In image processing, clustering divides an image into regions with similar colours, textures, or features,
aiding in object detection and medical imaging. This simplifies image analysis and enhances recognition
accuracy.
Example: In healthcare, clustering is used to separate tumours from normal tissue in MRI scans.
• Document Categorization
3
Clustering helps organize large text datasets by grouping similar documents together, making it easier to
retrieve relevant information. It is widely used in search engines and content management.
Example: A news website automatically groups articles into categories like politics, sports, and
entertainment.
• Healthcare & Medical Diagnosis
In healthcare, clustering is used to group patients based on symptoms, medical history, or genetic factors.
This aids in disease classification, treatment planning, and early diagnosis.
Example: Hospitals use clustering to classify cancer patients into different risk levels for personalized
treatments.
• Finance & Fraud Detection
Banks and financial institutions use clustering to analyse customer behaviour and detect fraudulent
activities. It helps in identifying unusual transactions and segmenting customers based on risk levels.
Example: Credit card fraud detection systems flag suspicious transactions by identifying unusual
spending patterns.
• Social Media & Recommendation Systems
Social media platforms and streaming services use clustering to group users with similar interests. This
helps in personalized content recommendations and community detection.
Example: Netflix clusters users based on viewing history to suggest relevant movies and shows.
• Supply Chain & Logistics
Logistics companies use clustering to optimize delivery routes and supplier networks. It helps reduce
transportation costs and improve efficiency.
Example: DHL clusters delivery locations to create the most efficient delivery routes.
• Retail & Inventory Management
Clustering helps retailers optimize product placement, manage inventory, and analyse customer buying
patterns. It improves store layout design and demand forecasting.
Example: Supermarkets group products frequently bought together to arrange store shelves efficiently.
• Text Mining & Natural Language Processing (NLP)
In NLP, clustering is used for document classification, topic modeling, and sentiment analysis. It helps
organize large amounts of text data for efficient retrieval.
Example: Google News clusters articles into categories like politics, sports, and technology.
Core aspects
Clustering is a way of grouping similar things together based on patterns in the data. The groups, called
clusters, should have closely related items and clear boundaries. They should also be easy to describe using
just a few important features.
4
The main purpose of clustering is to find hidden patterns in data without needing labels. Machine learning
algorithms analyse the dataset and assign data points into clusters where similar items are grouped together.
Since clustering is an unsupervised learning method, it does not require predefined labels, making it useful
for discovering hidden patterns in various applications such as customer segmentation, anomaly detection,
and pattern recognition. By effectively clustering data, organizations can extract valuable insights and make
data-driven decisions.
Clustering methods
Clustering methods are techniques used to group similar data points based on different criteria such as data
structure, distance, and density. These methods help in identifying meaningful patterns in datasets and are
broadly divided based on how clusters are formed. Clustering techniques are broadly classified into six
major types:
1. Partition-Based Clustering
2. Hierarchical Clustering
3. Density-Based Clustering
4. Grid-Based Clustering
5. Model-Based Clustering
6. Constraint-based clustering
Partition-based clustering, also known as Centroid-based clustering, is a method that divides a dataset into a
fixed number of K clusters, where each data point belongs to exactly one cluster. It ensures that the clusters
are distinct and do not overlap. This method works by selecting central points called centroids for each cluster
and grouping data points based on their closeness to these centroids. The process continues iteratively,
adjusting the centroids until the clusters become stable.
5
A key challenge in this method is deciding the number of clusters (K) beforehand, which can affect the final
results. It also assumes that all groups are evenly sized and round in shape, which is not always true in real
data. Despite these challenges, it is a popular and widely used clustering method. The K-Means algorithm lies
in this category.
Example: A shopping mall groups its visitors into three clusters based on the time they spend in the mall:
short visits (less than 30 minutes), medium visits (30-90 minutes), and long visits (more than 90 minutes). The
algorithm assigns each visitor to the nearest group and adjusts the group averages until the clusters are stable.
Steps:
2. Assign each data point to the nearest centroid. (Mean value of the data points)
Connectivity-based Clustering, also known as Hierarchical Clustering, is a method that groups data points
based on their similarity, forming a tree-like structure called a dendrogram. Similar objects have small
distances and are grouped together early, while dissimilar objects are farther apart in the hierarchy. This
method helps in understanding relationships between data points at different levels of similarity, through
techniques like cross-tabulation. Unlike partition-based clustering, it does not require the number of clusters
to be predefined.
6
The clustering process can follow different approaches depending on how the data is structured. It starts with
either a single cluster containing all data points (Divisive Clustering) or each data point as its own cluster
(Agglomerative Clustering) and then merges or splits clusters step by step.
The main goal of hierarchical clustering is to create a structured hierarchy of clusters, ranging from one large
cluster to many smaller clusters. This allows users to analyse the data at different levels of relationship and
choose the number of clusters based on the problem’s needs.
Example: A company groups its employees based on experience levels. It starts with all employees in one
group and gradually splits them into beginners, intermediate, and advanced professionals.
Steps:
Hierarchical clustering follows different approaches based on how clusters are built:
The Divisive (Top-Down) approach in hierarchical clustering starts with one large cluster containing all data
points and gradually splits it into smaller clusters. The process continues until a predefined condition is met,
such as achieving the desired number of clusters or ensuring each cluster has meaningful distinctions. This
method helps in breaking down broad groups into meaningful subcategories.
This method starts with each data point as an individual cluster. The algorithm merges the closest clusters step
by step based on their similarities until all data points form one large cluster or the desired number of clusters
is achieved. This approach is widely used as it provides a clear hierarchical structure, making it easier to
visualize relationships between data points.
7
▪ Density-based clustering (or) Density-Based Spatial Clustering
(The first two methods discussed above depend on a distance (similarity/proximity) metric.)
Density-based clustering groups data points based on regions of high density, separating them from areas of
low density. Instead of relying on distance alone, it identifies clusters as dense areas of connected points,
allowing for clusters of various shapes and sizes. Unlike traditional clustering methods, it can handle noise
and outliers effectively, making it useful for real-world datasets where data may not follow a strict geometric
shape.
This method assumes that data contains some inconsistencies (noise) and that clusters may not always be
circular or elliptical. By allowing flexible cluster shapes, density-based clustering ensures that no important
data points are ignored. Density-based clustering aims to identify groups of closely packed data points while
separating sparse regions as noise, allowing for flexible cluster shapes and better handling of outliers.
Example Algorithm: DBSCAN (Density-Based Spatial Clustering of Applications with Noise), DENCAST.
Example: A shopping mall analyses customer movement using density-based clustering. Areas where many
customers gather, like near popular stores or food courts, are identified as clusters. Shoppers moving alone in
less crowded areas are marked as noise, helping the mall optimize store placement.
• MinPts: MinPts refers to the minimum number of points in an Eps neighbourhood of that point.
Steps:
8
▪ Grid-based clustering (or) Cell-based clustering
Grid-based clustering divides the data space into a grid of cells and groups them based on density rather than
distance between individual points. The entire data space is overlaid with a grid structure, and each cell is
analysed for the number of data points it contains. Cells with a higher density of points are merged to form
clusters, making the process efficient and scalable for large datasets.
This method is useful for handling high-dimensional data and works faster than traditional clustering
approaches since it operates on grid cells instead of individual data points. It is especially effective in spatial
data analysis and geographic information systems. Grid-based clustering aims to organize large datasets
efficiently by grouping dense regions of a grid, making clustering faster and more scalable. Its fixed grid
structure allows for easy handling of new data without reprocessing the entire dataset, making it ideal for real-
time applications.
Example: A city traffic management system divides a map into grid cells and analyses vehicle density in each
cell. Areas with high traffic are grouped into clusters to optimize traffic signals and reduce congestion.
Steps:
Model-based clustering is a method that assumes data is generated by a mixture of probability distributions
and group data points based on statistical models. It assumes that each cluster follows a specific probability
distribution, such as a Gaussian (normal) distribution. Instead of measuring distances between points, it
calculates the likelihood that a point belongs to a cluster. This method helps in handling overlapping clusters
9
where data points may have partial membership in multiple groups. The main objective of model-based
clustering is to identify hidden patterns in data by assuming that each cluster follows a specific probability
distribution.
One key advantage of model-based clustering is that it can automatically determine the number of clusters by
analysing probabilities. Unlike traditional clustering, which requires setting the number of clusters manually,
this method estimates them based on data patterns. It is widely used in fields where data has complex
structures. The results are based on parameters like mean and variance, making the clusters more interpretable
and mathematically sound.
Example: A hospital uses model-based clustering to classify patients based on symptoms and medical history.
It analyses factors like age, test results, and disease patterns, assigning each patient a probability of belonging
to risk groups. This helps doctors provide personalized treatment plans.
Steps:
Constraint-based clustering, or supervised clustering, incorporates user-defined rules to guide the clustering
process. Unlike traditional methods that rely only on data structure, this approach allows specifying
conditions like must-link (points that must be grouped together) and cannot-link (points that must remain
separate).
This method ensures clusters follow specific rules, making results more accurate and interpretable. Once
boundaries are established, new data points can be classified consistently, making it useful for fraud
detection, customer segmentation, and medical diagnosis. The main objective of constraint-based clustering
10
is to create meaningful clusters while respecting predefined rules, ensuring results align with user-specific
needs.
Example Algorithm: COP K-Means (Constraint-Based K-Means), Decision tree, Random Forest.
Example: A school groups students into study teams using constraint-based clustering. Students who prefer
the same subjects are placed together (must-link), while those with different learning styles are kept in separate
groups (cannot-link). This helps create balanced and effective study teams.
Steps:
Clustering algorithms
An algorithm is a step-by-step set of instructions or rules designed to perform a specific task or solve a
problem. It follows a logical sequence of operations and can be implemented in programming to automate
processes. Algorithms are widely used in computing, mathematics, and daily life, such as in search engines,
data processing, and decision-making. They can be simple, like sorting numbers, or complex, like machine
learning models. Efficiency and correctness are key aspects of a good algorithm.
A clustering algorithm is a machine learning technique used to group similar data points together based on
their characteristics without prior labels. It helps in identifying patterns and structures within data by dividing
it into clusters where members of the same cluster are more similar to each other than to those in other clusters.
Clustering algorithms are widely used in market segmentation, image recognition, anomaly detection, and
recommendation systems. So, the choice of algorithm depends upon how the data looks like. A suitable
clustering algorithm helps in finding valuable insights for the industry.
Clustering algorithms can be categorized based on their purpose into two types:
1. Monothetic Clustering: In this type, all members of a cluster share a common characteristic. For
example, if 25% of patients experience side effects from a vaccine, they would be grouped together
because of this shared trait. Here, clustering is done based on a single feature.
11
2. Polythetic Clustering: In this type, members of a cluster are similar to each other in multiple ways, but
they don’t necessarily share a single common trait. For example, a group of customers may be similar
based on overall shopping behaviour rather than one specific factor. This type of clustering considers
all features when grouping data.
Gaussian Mixture Model (GMM) Probabilistic distributions, flexible shapes Image processing
Fuzzy C-Means (FCM) Data points can belong to multiple clusters Medical diagnosis
Clustering is a powerful technique used in marketing, finance, healthcare, image processing, and anomaly
detection. Choosing the right method depends on data size, shape, and structure.
12
▪ K-Means algorithm
K-Means is a popular and simple unsupervised learning algorithm used for clustering data into groups. It
works by dividing a dataset into ‘k’ clusters, where ‘k’ is a predefined clusters that need to be created in the
process. For e.g. If K=2, there will be two clusters and if K=4, there will be four clusters and so on. Each
cluster has a central point called a centroid, which represents the average position of all data points within that
cluster. Initially, the centroids are placed randomly, and data points are assigned to the nearest centroid. The
centroids are then recalculated based on the newly formed clusters, and this process continues iteratively until
the centroids no longer change. In R, there is a built-in function kmeans ().
The algorithm seeks to minimize the objective function by optimizing the Squared Error Function (F(V)),
which is defined as:
Where:
• k = Number of clusters
The Elbow Method helps by plotting the total variation (sum of squared distances) within clusters for
different values of K. The point where the decrease in variation slows down (forming an "elbow" shape) is
the best K value. This ensures the model is neither over-clustered nor under-clustered.
Step-3: Assign each data point to the nearest centroid, forming K clusters.
13
Step-4: Recalculate the centroids by finding the average position of points in each cluster.
Advantages:
1. Can be applied to any form of data – as long as the data has numerical (continuous) entities.
2. Much faster than other algorithms.
Drawbacks:
1. Fails for non-linear data.
Until now, the clustering techniques as we know them are based on either proximity (similarity/distance) or
composition (density). There is a family of clustering algorithms that take a totally different metric into
consideration – probability. Distribution-based clustering creates and groups data points based on their
likely hood of belonging to the same probability distribution (Gaussian or Binomial or etc.) in the data.
Each cluster has a central point, and the probability of a data point being included decreases as its distance
from the centre increases. Unlike density or boundary-based methods, this approach does not require defining
cluster shapes beforehand. However, it requires setting specific parameters, which can affect results if not set
correctly.
Steps:
14
Step-3: Calculate how likely each data point belongs to a cluster.
Step-4: Update cluster values to improve accuracy.
Step-5: Recalculate probabilities for each data point.
Step-6: Repeat the process until clusters do not change.
Step-7: Assign each data point to the most likely cluster.
Step-8: Clustering is complete and ready for analysis.
Hierarchical clustering methods follow two approaches – Divisive and Agglomerative types. Their
implementation family contains two algorithms respectively:
DIANA (Divisive Analysis) is a hierarchical clustering method that starts with all data points in a single large
cluster. It then gradually splits the cluster into smaller groups based on how far apart the data points are from
each other. The splitting continues until each data point is in its own individual cluster. The method uses
different ways to measure distances between points, such as Ward’s Distance, Centroid Distance, Average
Linkage, Complete Linkage, and Single Linkage. These measures help decide how the data should be divided
at each step. DIANA is useful when the data has a clear hierarchical structure and needs to be broken down
into meaningful subgroups. In R, the algorithm can be implemented using the diana() function.
AGNES (Agglomerative Nesting) is a hierarchical clustering method that starts with each data point as its
own cluster. If there are n data points, the algorithm begins with n clusters. It then gradually merges the closest
clusters based on distance measures like those used in DIANA (e.g., Ward’s Distance, Centroid Distance,
Average Linkage, Complete Linkage, or Single Linkage). This merging process continues until all data points
are combined into one large cluster. AGNES is useful when grouping similar data points together step by step.
In R, it can be implemented using the agnes() function or the built-in hclust() function.
Mean shift clustering is a powerful algorithm that works by finding dense regions in the data. It does this by
shifting data points toward areas of higher density using a sliding window technique. Unlike k-means, which
15
requires a predefined number of clusters, mean shift automatically detects the number of clusters based on the
data distribution. This makes it especially useful for cases where the number of groups is unknown.
Another advantage of mean shift clustering is its ability to detect clusters of any shape, rather than being
limited to circular or spherical clusters like k-means. It is also less sensitive to outliers, as it focuses on dense
areas instead of strict distance measures. However, it can be computationally expensive for large datasets, as
it requires multiple iterations to converge. In R, BMSClustering() function from MeanShift package performs
the clustering (MeanShift::bmsClustering()).
Gaussian Mixture Model (GMM) is a clustering method that groups data by assuming it comes from
multiple Gaussian (bell-shaped) distributions. Unlike k-means, which assigns each point to a single cluster,
GMM gives each point a probability of belonging to multiple clusters. This makes it useful for handling
overlapping groups in data. Each cluster is defined by its centre (mean), spread (variance), and weight
(importance).
GMM is more flexible than other clustering methods because it can find clusters of different shapes and
sizes. It uses a process called the Expectation-Maximization (EM) algorithm to improve cluster assignments
over multiple steps. However, GMM requires choosing the number of clusters in advance and can take
longer to compute for large datasets.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that
groups data points based on density. Unlike k-means or GMM, it does not require specifying the number of
clusters beforehand. Instead, it identifies dense regions of data and expands clusters from them. Points in
less dense areas are labelled as noise or outliers.
The algorithm relies on two key parameters: epsilon (ε), which defines the neighbourhood radius, and
minPts, which sets the minimum number of points required to form a cluster.
There are two major underlying concepts in DBSCAN – one, Density Reachability and second, Density
Connectivity. This helps the algorithm to differentiate and separate regions with varying degrees of density –
hence creating clusters.
16
▪ Fuzzy means algorithm
The Fuzzy C-Means (FCM) algorithm is a clustering method similar to k-means, but with one key
difference: a data point can belong to more than one cluster instead of just one. This means that instead of
assigning a data point strictly to a single group, FCM gives each point a degree of belonging to multiple
clusters.
Fuzzy clustering follows three main steps: initialization, iteration, and termination. The algorithm calculates
membership values based on the distance between a data point and cluster centres—closer points have higher
membership values. These values and cluster centres are continuously updated until they are stable. This
method is especially useful when data points fall between multiple clusters or have ambiguous boundaries, as
it assigns clusters based on probability rather than strict distance measures.
17