0% found this document useful (0 votes)

4 views17 pages

FPA Unit 3

Clustering is an unsupervised learning technique that groups similar objects based on their characteristics without requiring labeled data. Various clustering methods include partition-based (e.g., K-Means), hierarchical, density-based, grid-based, model-based, and constraint-based clustering, each with unique applications in fields like customer segmentation, anomaly detection, and healthcare. The core aspects of clustering involve identifying hidden patterns in data, forming distinct groups (clusters), and enabling organizations to make data-driven decisions.

Uploaded by

Ghazi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views17 pages

FPA Unit 3

Uploaded by

Ghazi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

3.

CLUSTERING
Important questions:
1. What is clustering? Explain in detail?
2. Explain different types of clustering methods and give example of each?
3. Define core aspects of clustering? What are the types of clustering algorithms to apply in business?
4. Explain clustering? Write the steps of distribution-based clustering algorithm?
5. Explain the role of clustering in data mining / write the application of clustering?
6. Explain k-means algorithm?
7. Differentiate between clustering and classification?

1.Algorithm
2.Metrics used in finding accuracy in k-mean
3.Centroid point in k-mean
4.Which algorithm is best for clustering? Explain with example.

Clustering
Clustering is a way of grouping similar objects together based on their characteristics. It is a type of
unsupervised learning in machine learning, meaning that it does not require labelled data to find patterns.
Instead, it explores the data and forms groups (clusters) based on similarities. The main objective of
clustering is to group similar items together so that objects in the same group are more alike than those in
different groups. This helps in understanding patterns in data, making decisions, and identifying trends.

For example, in a shopping website, clustering can be used to group customers based on their shopping
habits. Some customers may prefer budget-friendly products, while others may buy premium items. By
identifying these groups, the website can recommend products that match each customer's interests,
improving their shopping experience and increasing sales.

1
Clustering VS Classification
Feature Classification Clustering

A supervised learning technique that An unsupervised learning technique that

Definition assigns labels to data points based on groups similar data points into clusters
predefined categories. without predefined labels.

Requires labelled data (training data with Works with unlabelled data (no predefined
Data Type
known outputs). labels).

Predicts a categorical outcome (e.g., spam Identifies hidden patterns or natural

Purpose
vs. non-spam emails). groupings in data.

Decision Trees, Random Forest, Support K-Means, DBSCAN, Hierarchical

Example
Vector Machine (SVM), Neural Networks, Clustering, Gaussian Mixture Models
Algorithms
Naïve Bayes. (GMM).

Fraud detection, medical diagnosis, Customer segmentation, anomaly detection,

Use Case
sentiment analysis, email filtering. social network analysis.

A discrete label (e.g., cat or dog in image A set of clusters where similar data points are
Output
classification). grouped together.

Supervised learning: Requires training with Unsupervised learning: No labelled data is

Supervision
labelled datasets. required.

Decision Creates clear decision boundaries between Groups data based on similarity, but clusters
Boundaries different categories. may overlap.

More interpretable since it assigns clear Less interpretable as clusters may not have
Interpretability
labels to data points. clear meanings.

Less flexible, as it depends on predefined More flexible, as it can discover new patterns
Flexibility
labels. or structures.

Handling New Requires retraining when new categories Can naturally adjust to new data by forming
Data appear. new clusters.

Can be computationally expensive for large Can be more scalable, depending on the
Scalability
datasets. clustering algorithm used.

2
Example scenario:

Classification Example:
A teacher wants to categorize students as "Pass" or "Fail" based on their exam scores. The model is trained
with past student data and their results, then predicts whether a new student will pass or fail.
Clustering Example:
A shop owner wants to group customers based on their shopping habits. Since there are no predefined
categories, a clustering algorithm forms groups like "frequent shoppers" and "occasional shoppers" to offer
better discounts and promotions.

Applications of clustering:
• Customer Segmentation
Clustering is used by businesses to group customers based on purchasing habits, demographics, or
preferences. This helps in personalized marketing, better product recommendations, and improved
customer engagement.
Example: An online retailer clusters customers into "budget shoppers" and "premium buyers" to send
targeted promotions.
• Anomaly Detection
Clustering helps identify unusual patterns in data, making it useful for fraud detection, cybersecurity, and
system monitoring. Anomalies are flagged when data points significantly differ from normal patterns.
Example: A bank detects fraudulent transactions when a customer’s spending suddenly spikes in an
unusual location.
• Image Segmentation
In image processing, clustering divides an image into regions with similar colours, textures, or features,
aiding in object detection and medical imaging. This simplifies image analysis and enhances recognition
accuracy.
Example: In healthcare, clustering is used to separate tumours from normal tissue in MRI scans.
• Document Categorization

3
Clustering helps organize large text datasets by grouping similar documents together, making it easier to
retrieve relevant information. It is widely used in search engines and content management.
Example: A news website automatically groups articles into categories like politics, sports, and
entertainment.
• Healthcare & Medical Diagnosis
In healthcare, clustering is used to group patients based on symptoms, medical history, or genetic factors.
This aids in disease classification, treatment planning, and early diagnosis.
Example: Hospitals use clustering to classify cancer patients into different risk levels for personalized
treatments.
• Finance & Fraud Detection
Banks and financial institutions use clustering to analyse customer behaviour and detect fraudulent
activities. It helps in identifying unusual transactions and segmenting customers based on risk levels.
Example: Credit card fraud detection systems flag suspicious transactions by identifying unusual
spending patterns.
• Social Media & Recommendation Systems
Social media platforms and streaming services use clustering to group users with similar interests. This
helps in personalized content recommendations and community detection.
Example: Netflix clusters users based on viewing history to suggest relevant movies and shows.
• Supply Chain & Logistics
Logistics companies use clustering to optimize delivery routes and supplier networks. It helps reduce
transportation costs and improve efficiency.
Example: DHL clusters delivery locations to create the most efficient delivery routes.
• Retail & Inventory Management
Clustering helps retailers optimize product placement, manage inventory, and analyse customer buying
patterns. It improves store layout design and demand forecasting.
Example: Supermarkets group products frequently bought together to arrange store shelves efficiently.
• Text Mining & Natural Language Processing (NLP)
In NLP, clustering is used for document classification, topic modeling, and sentiment analysis. It helps
organize large amounts of text data for efficient retrieval.
Example: Google News clusters articles into categories like politics, sports, and technology.

Core aspects
Clustering is a way of grouping similar things together based on patterns in the data. The groups, called
clusters, should have closely related items and clear boundaries. They should also be easy to describe using
just a few important features.

4
The main purpose of clustering is to find hidden patterns in data without needing labels. Machine learning
algorithms analyse the dataset and assign data points into clusters where similar items are grouped together.
Since clustering is an unsupervised learning method, it does not require predefined labels, making it useful
for discovering hidden patterns in various applications such as customer segmentation, anomaly detection,
and pattern recognition. By effectively clustering data, organizations can extract valuable insights and make
data-driven decisions.

Clustering methods
Clustering methods are techniques used to group similar data points based on different criteria such as data
structure, distance, and density. These methods help in identifying meaningful patterns in datasets and are
broadly divided based on how clusters are formed. Clustering techniques are broadly classified into six
major types:

1. Partition-Based Clustering
2. Hierarchical Clustering

3. Density-Based Clustering
4. Grid-Based Clustering

5. Model-Based Clustering
6. Constraint-based clustering

▪ Partition-based clustering (or) Centroid-based clustering

Partition-based clustering, also known as Centroid-based clustering, is a method that divides a dataset into a
fixed number of K clusters, where each data point belongs to exactly one cluster. It ensures that the clusters
are distinct and do not overlap. This method works by selecting central points called centroids for each cluster
and grouping data points based on their closeness to these centroids. The process continues iteratively,
adjusting the centroids until the clusters become stable.

5
A key challenge in this method is deciding the number of clusters (K) beforehand, which can affect the final
results. It also assumes that all groups are evenly sized and round in shape, which is not always true in real
data. Despite these challenges, it is a popular and widely used clustering method. The K-Means algorithm lies
in this category.

Example Algorithm: K-Means clustering algorithm, K means, K mode, K median.

Example: A shopping mall groups its visitors into three clusters based on the time they spend in the mall:
short visits (less than 30 minutes), medium visits (30-90 minutes), and long visits (more than 90 minutes). The
algorithm assigns each visitor to the nearest group and adjusts the group averages until the clusters are stable.

Steps:

1. Choose K cluster centroids randomly.

2. Assign each data point to the nearest centroid. (Mean value of the data points)

3. Update centroids by calculating the mean of assigned points.

4. Repeat until centroids no longer change.

▪ Hierarchical-based clustering (or) Connectivity-based clustering

Connectivity-based Clustering, also known as Hierarchical Clustering, is a method that groups data points
based on their similarity, forming a tree-like structure called a dendrogram. Similar objects have small
distances and are grouped together early, while dissimilar objects are farther apart in the hierarchy. This
method helps in understanding relationships between data points at different levels of similarity, through
techniques like cross-tabulation. Unlike partition-based clustering, it does not require the number of clusters
to be predefined.

6
The clustering process can follow different approaches depending on how the data is structured. It starts with
either a single cluster containing all data points (Divisive Clustering) or each data point as its own cluster
(Agglomerative Clustering) and then merges or splits clusters step by step.

The main goal of hierarchical clustering is to create a structured hierarchy of clusters, ranging from one large
cluster to many smaller clusters. This allows users to analyse the data at different levels of relationship and
choose the number of clusters based on the problem’s needs.

Example Algorithm: Agglomerative Clustering, Divisive clustering, DIANA.

Example: A company groups its employees based on experience levels. It starts with all employees in one
group and gradually splits them into beginners, intermediate, and advanced professionals.

Steps:

1. Start with each data point as its own cluster.

2. Compute the distance between clusters.

3. Merge the closest clusters.

4. Repeat until only one cluster remains.

Hierarchical clustering follows different approaches based on how clusters are built:

A. Divisive (Top-Down) Approach

The Divisive (Top-Down) approach in hierarchical clustering starts with one large cluster containing all data
points and gradually splits it into smaller clusters. The process continues until a predefined condition is met,
such as achieving the desired number of clusters or ensuring each cluster has meaningful distinctions. This
method helps in breaking down broad groups into meaningful subcategories.

B. Agglomerative (Bottom-Up) Approach

This method starts with each data point as an individual cluster. The algorithm merges the closest clusters step
by step based on their similarities until all data points form one large cluster or the desired number of clusters
is achieved. This approach is widely used as it provides a clear hierarchical structure, making it easier to
visualize relationships between data points.

7
▪ Density-based clustering (or) Density-Based Spatial Clustering

(The first two methods discussed above depend on a distance (similarity/proximity) metric.)

Density-based clustering groups data points based on regions of high density, separating them from areas of
low density. Instead of relying on distance alone, it identifies clusters as dense areas of connected points,
allowing for clusters of various shapes and sizes. Unlike traditional clustering methods, it can handle noise
and outliers effectively, making it useful for real-world datasets where data may not follow a strict geometric
shape.

This method assumes that data contains some inconsistencies (noise) and that clusters may not always be
circular or elliptical. By allowing flexible cluster shapes, density-based clustering ensures that no important
data points are ignored. Density-based clustering aims to identify groups of closely packed data points while
separating sparse regions as noise, allowing for flexible cluster shapes and better handling of outliers.

Example Algorithm: DBSCAN (Density-Based Spatial Clustering of Applications with Noise), DENCAST.

Example: A shopping mall analyses customer movement using density-based clustering. Areas where many
customers gather, like near popular stores or food courts, are identified as clusters. Shoppers moving alone in
less crowded areas are marked as noise, helping the mall optimize store placement.

Two main parameters are used to calculate density-based Clustering.

• EPS: It is mainly considered as the maximum radius of the neighbourhood.

• MinPts: MinPts refers to the minimum number of points in an Eps neighbourhood of that point.

Steps:

1. Define a minimum number of points (minPts) and a search radius (eps).

2. Find core points (points with at least minPts neighbours).

3. Expand clusters from core points.

4. Mark remaining points as noise or border points.

8
▪ Grid-based clustering (or) Cell-based clustering

Grid-based clustering divides the data space into a grid of cells and groups them based on density rather than
distance between individual points. The entire data space is overlaid with a grid structure, and each cell is
analysed for the number of data points it contains. Cells with a higher density of points are merged to form
clusters, making the process efficient and scalable for large datasets.

This method is useful for handling high-dimensional data and works faster than traditional clustering
approaches since it operates on grid cells instead of individual data points. It is especially effective in spatial
data analysis and geographic information systems. Grid-based clustering aims to organize large datasets
efficiently by grouping dense regions of a grid, making clustering faster and more scalable. Its fixed grid
structure allows for easy handling of new data without reprocessing the entire dataset, making it ideal for real-
time applications.

Example Algorithm: STING (Statistical Information Grid-Based Method).

Example: A city traffic management system divides a map into grid cells and analyses vehicle density in each
cell. Areas with high traffic are grouped into clusters to optimize traffic signals and reduce congestion.

Steps:

1. Divide the data space into a hierarchical grid structure.

2. Compute statistics for each grid cell.

3. Merge dense cells into clusters.

▪ Model-based algorithm (or) Probabilistic Clustering

Model-based clustering is a method that assumes data is generated by a mixture of probability distributions
and group data points based on statistical models. It assumes that each cluster follows a specific probability
distribution, such as a Gaussian (normal) distribution. Instead of measuring distances between points, it
calculates the likelihood that a point belongs to a cluster. This method helps in handling overlapping clusters

9
where data points may have partial membership in multiple groups. The main objective of model-based
clustering is to identify hidden patterns in data by assuming that each cluster follows a specific probability
distribution.

One key advantage of model-based clustering is that it can automatically determine the number of clusters by
analysing probabilities. Unlike traditional clustering, which requires setting the number of clusters manually,
this method estimates them based on data patterns. It is widely used in fields where data has complex
structures. The results are based on parameters like mean and variance, making the clusters more interpretable
and mathematically sound.

Example Algorithm: Gaussian Mixture Model (GMM).

Example: A hospital uses model-based clustering to classify patients based on symptoms and medical history.
It analyses factors like age, test results, and disease patterns, assigning each patient a probability of belonging
to risk groups. This helps doctors provide personalized treatment plans.

Steps:

1. Assume each cluster follows a Gaussian distribution.

2. Estimate parameters using the Expectation-Maximization (EM) algorithm.

3. Assign data points based on probabilities.

▪ Constraint based clustering (or) Supervised clustering

Constraint-based clustering, or supervised clustering, incorporates user-defined rules to guide the clustering
process. Unlike traditional methods that rely only on data structure, this approach allows specifying
conditions like must-link (points that must be grouped together) and cannot-link (points that must remain
separate).

This method ensures clusters follow specific rules, making results more accurate and interpretable. Once
boundaries are established, new data points can be classified consistently, making it useful for fraud
detection, customer segmentation, and medical diagnosis. The main objective of constraint-based clustering

10
is to create meaningful clusters while respecting predefined rules, ensuring results align with user-specific
needs.

Example Algorithm: COP K-Means (Constraint-Based K-Means), Decision tree, Random Forest.

Example: A school groups students into study teams using constraint-based clustering. Students who prefer
the same subjects are placed together (must-link), while those with different learning styles are kept in separate
groups (cannot-link). This helps create balanced and effective study teams.

Steps:

1. Set rules for how data points should be grouped.

2. Use an algorithm that follows these rules, like Constrained K-Means.

3. Group data while ensuring the rules are followed.

4. Adjust clusters if needed to improve accuracy.

Clustering algorithms
An algorithm is a step-by-step set of instructions or rules designed to perform a specific task or solve a
problem. It follows a logical sequence of operations and can be implemented in programming to automate
processes. Algorithms are widely used in computing, mathematics, and daily life, such as in search engines,
data processing, and decision-making. They can be simple, like sorting numbers, or complex, like machine
learning models. Efficiency and correctness are key aspects of a good algorithm.

A clustering algorithm is a machine learning technique used to group similar data points together based on
their characteristics without prior labels. It helps in identifying patterns and structures within data by dividing
it into clusters where members of the same cluster are more similar to each other than to those in other clusters.
Clustering algorithms are widely used in market segmentation, image recognition, anomaly detection, and
recommendation systems. So, the choice of algorithm depends upon how the data looks like. A suitable
clustering algorithm helps in finding valuable insights for the industry.

Clustering algorithms can be categorized based on their purpose into two types:

1. Monothetic Clustering: In this type, all members of a cluster share a common characteristic. For
example, if 25% of patients experience side effects from a vaccine, they would be grouped together
because of this shared trait. Here, clustering is done based on a single feature.

11
2. Polythetic Clustering: In this type, members of a cluster are similar to each other in multiple ways, but
they don’t necessarily share a single common trait. For example, a group of customers may be similar
based on overall shopping behaviour rather than one specific factor. This type of clustering considers
all features when grouping data.

Let’s explore the different types of clustering in machine learning in detail

(Which algorithm is best for clustering? Explain with example.)

Clustering Method Suitable When Example Use Case

Data follows a specific probability

Distribution-Based Algorithm Anomaly detection
distribution

Large datasets, well-separated spherical

K-Means Clustering Customer segmentation
clusters

Hierarchical Clustering Algorithm Small datasets, hierarchical relationships Gene classification

Mean-Shift Clustering Algorithm Clusters with high-density regions Object tracking

Gaussian Mixture Model (GMM) Probabilistic distributions, flexible shapes Image processing

Fuzzy C-Means (FCM) Data points can belong to multiple clusters Medical diagnosis

DBSCAN Arbitrary-shaped clusters, noisy data Fraud detection

Clustering is a powerful technique used in marketing, finance, healthcare, image processing, and anomaly
detection. Choosing the right method depends on data size, shape, and structure.

12
▪ K-Means algorithm

K-Means is a popular and simple unsupervised learning algorithm used for clustering data into groups. It
works by dividing a dataset into ‘k’ clusters, where ‘k’ is a predefined clusters that need to be created in the
process. For e.g. If K=2, there will be two clusters and if K=4, there will be four clusters and so on. Each
cluster has a central point called a centroid, which represents the average position of all data points within that
cluster. Initially, the centroids are placed randomly, and data points are assigned to the nearest centroid. The
centroids are then recalculated based on the newly formed clusters, and this process continues iteratively until
the centroids no longer change. In R, there is a built-in function kmeans ().

The algorithm seeks to minimize the objective function by optimizing the Squared Error Function (F(V)),
which is defined as:

Where:

• k = Number of clusters

• Ci = Set of data points in the ith cluster

• xj = A data point
• vi = Centroid of cluster Ci
To find the best number of clusters (K) in K-Means, we can use the Silhouette Method and the Elbow
Method. (Metrics used in finding accuracy in k-mean). The Silhouette Method checks how well each data point fits
in its assigned cluster by comparing its distance to points within the same cluster and to points in the nearest
other cluster. A higher score means better clustering.

The Elbow Method helps by plotting the total variation (sum of squared distances) within clusters for
different values of K. The point where the decrease in variation slows down (forming an "elbow" shape) is
the best K value. This ensures the model is neither over-clustered nor under-clustered.

How does the K-Means Algorithm Work?

The working of the K-Means algorithm is explained in the below steps:

Step-1: Choose the number of clusters (K).

Step-2: Select K random points as the initial centroids.

Step-3: Assign each data point to the nearest centroid, forming K clusters.

13
Step-4: Recalculate the centroids by finding the average position of points in each cluster.

Step-5: Reassign data points to the new nearest centroid.

Step-6: Repeat steps 4 and 5 until centroids no longer change.

Step-7: The clustering is complete, and the model is ready.

(Refer pg. no: 88 for application of K-means algorithm)

Advantages:

1. Can be applied to any form of data – as long as the data has numerical (continuous) entities.
2. Much faster than other algorithms.

3. Easy to understand and interpret.

4. Works well with large datasets.

5. Provides clear and distinct cluster assignments.

Drawbacks:
1. Fails for non-linear data.

2. It requires us to decide on the number of clusters before we start the algorithm

3. This cannot work for Categorical data.

4. Cannot handle outliers.

5.It is sensitive to the initial placement of centroids, which can lead to different results on different runs.

▪ Distribution-based clustering algorithm

Until now, the clustering techniques as we know them are based on either proximity (similarity/distance) or
composition (density). There is a family of clustering algorithms that take a totally different metric into
consideration – probability. Distribution-based clustering creates and groups data points based on their
likely hood of belonging to the same probability distribution (Gaussian or Binomial or etc.) in the data.

Each cluster has a central point, and the probability of a data point being included decreases as its distance
from the centre increases. Unlike density or boundary-based methods, this approach does not require defining
cluster shapes beforehand. However, it requires setting specific parameters, which can affect results if not set
correctly.

Steps:

Step-1: Choose a statistical distribution like Gaussian for clustering.

Step-2: Set initial cluster values, such as centre and spread.

14
Step-3: Calculate how likely each data point belongs to a cluster.
Step-4: Update cluster values to improve accuracy.
Step-5: Recalculate probabilities for each data point.
Step-6: Repeat the process until clusters do not change.
Step-7: Assign each data point to the most likely cluster.
Step-8: Clustering is complete and ready for analysis.

▪ Hierarchical clustering algorithm

Hierarchical clustering methods follow two approaches – Divisive and Agglomerative types. Their
implementation family contains two algorithms respectively:

a. DIANA (Divisive Analysis)

b. AGNES (Agglomerative Nesting)

A. DIANA (Divisive Analysis)

DIANA (Divisive Analysis) is a hierarchical clustering method that starts with all data points in a single large
cluster. It then gradually splits the cluster into smaller groups based on how far apart the data points are from
each other. The splitting continues until each data point is in its own individual cluster. The method uses
different ways to measure distances between points, such as Ward’s Distance, Centroid Distance, Average
Linkage, Complete Linkage, and Single Linkage. These measures help decide how the data should be divided
at each step. DIANA is useful when the data has a clear hierarchical structure and needs to be broken down
into meaningful subgroups. In R, the algorithm can be implemented using the diana() function.

B. AGNES (Agglomerative Nesting)

AGNES (Agglomerative Nesting) is a hierarchical clustering method that starts with each data point as its
own cluster. If there are n data points, the algorithm begins with n clusters. It then gradually merges the closest
clusters based on distance measures like those used in DIANA (e.g., Ward’s Distance, Centroid Distance,
Average Linkage, Complete Linkage, or Single Linkage). This merging process continues until all data points
are combined into one large cluster. AGNES is useful when grouping similar data points together step by step.
In R, it can be implemented using the agnes() function or the built-in hclust() function.

▪ Mean shift clustering algorithm

Mean shift clustering is a powerful algorithm that works by finding dense regions in the data. It does this by
shifting data points toward areas of higher density using a sliding window technique. Unlike k-means, which

15
requires a predefined number of clusters, mean shift automatically detects the number of clusters based on the
data distribution. This makes it especially useful for cases where the number of groups is unknown.

Another advantage of mean shift clustering is its ability to detect clusters of any shape, rather than being
limited to circular or spherical clusters like k-means. It is also less sensitive to outliers, as it focuses on dense
areas instead of strict distance measures. However, it can be computationally expensive for large datasets, as
it requires multiple iterations to converge. In R, BMSClustering() function from MeanShift package performs
the clustering (MeanShift::bmsClustering()).

▪ Gaussian mixed model (GMM)

Gaussian Mixture Model (GMM) is a clustering method that groups data by assuming it comes from
multiple Gaussian (bell-shaped) distributions. Unlike k-means, which assigns each point to a single cluster,
GMM gives each point a probability of belonging to multiple clusters. This makes it useful for handling
overlapping groups in data. Each cluster is defined by its centre (mean), spread (variance), and weight
(importance).

GMM is more flexible than other clustering methods because it can find clusters of different shapes and
sizes. It uses a process called the Expectation-Maximization (EM) algorithm to improve cluster assignments
over multiple steps. However, GMM requires choosing the number of clusters in advance and can take
longer to compute for large datasets.

▪ DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that
groups data points based on density. Unlike k-means or GMM, it does not require specifying the number of
clusters beforehand. Instead, it identifies dense regions of data and expands clusters from them. Points in
less dense areas are labelled as noise or outliers.

The algorithm relies on two key parameters: epsilon (ε), which defines the neighbourhood radius, and
minPts, which sets the minimum number of points required to form a cluster.

There are two major underlying concepts in DBSCAN – one, Density Reachability and second, Density
Connectivity. This helps the algorithm to differentiate and separate regions with varying degrees of density –
hence creating clusters.

16
▪ Fuzzy means algorithm

The Fuzzy C-Means (FCM) algorithm is a clustering method similar to k-means, but with one key
difference: a data point can belong to more than one cluster instead of just one. This means that instead of
assigning a data point strictly to a single group, FCM gives each point a degree of belonging to multiple
clusters.

Fuzzy clustering follows three main steps: initialization, iteration, and termination. The algorithm calculates
membership values based on the distance between a data point and cluster centres—closer points have higher
membership values. These values and cluster centres are continuously updated until they are stable. This
method is especially useful when data points fall between multiple clusters or have ambiguous boundaries, as
it assigns clusters based on probability rather than strict distance measures.

Data Mining Ii Sol
No ratings yet
Data Mining Ii Sol
106 pages
Fiat Hitachi Excavator Ex135w Workshop Manual
100% (1)
Fiat Hitachi Excavator Ex135w Workshop Manual
22 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
HW Ch7 1
No ratings yet
HW Ch7 1
12 pages
Unit 4
No ratings yet
Unit 4
106 pages
Classification and Clustering
No ratings yet
Classification and Clustering
8 pages
Chapter-8 (Cluster Analysis Basic Concepts and Algorithms)
No ratings yet
Chapter-8 (Cluster Analysis Basic Concepts and Algorithms)
73 pages
7 Habits of Highly Effective People
No ratings yet
7 Habits of Highly Effective People
2 pages
ML Unit 4 (Ab 22)
No ratings yet
ML Unit 4 (Ab 22)
39 pages
17 GM ASAP Data Mining - Clustering
No ratings yet
17 GM ASAP Data Mining - Clustering
107 pages
ML
No ratings yet
ML
28 pages
Data Science
No ratings yet
Data Science
20 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
21 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
66 pages
All Merged Chap 5
No ratings yet
All Merged Chap 5
45 pages
Clustering
No ratings yet
Clustering
57 pages
E-Note 28966 Content Document 20241211091351PM
No ratings yet
E-Note 28966 Content Document 20241211091351PM
69 pages
AIML Mod 5
No ratings yet
AIML Mod 5
39 pages
Untitled Document
No ratings yet
Untitled Document
32 pages
Unit 15
No ratings yet
Unit 15
26 pages
Module 5
No ratings yet
Module 5
91 pages
4.unit 4 ML Q&A
No ratings yet
4.unit 4 ML Q&A
73 pages
Lecturer-1 Unit 3
No ratings yet
Lecturer-1 Unit 3
31 pages
DW & DM Unit 4 Notes
No ratings yet
DW & DM Unit 4 Notes
40 pages
Week 9 Part 1 Clustering
No ratings yet
Week 9 Part 1 Clustering
44 pages
Pospiszyl 2023 The Fifth Element The Enlightenment and The Draining of Eastern Europe
No ratings yet
Pospiszyl 2023 The Fifth Element The Enlightenment and The Draining of Eastern Europe
28 pages
Cluster Analysis (1) - RMM
No ratings yet
Cluster Analysis (1) - RMM
17 pages
Clustering Notes
No ratings yet
Clustering Notes
17 pages
Literature Review On Accessibility
100% (1)
Literature Review On Accessibility
7 pages
ML Unit 3
No ratings yet
ML Unit 3
24 pages
Cluster Analysis: Basic Concepts and Methods: Imagine That You Are
No ratings yet
Cluster Analysis: Basic Concepts and Methods: Imagine That You Are
15 pages
K-Means Clustering Algorithm Based On E-Commerce B
No ratings yet
K-Means Clustering Algorithm Based On E-Commerce B
6 pages
Chapter One To Five Collective
No ratings yet
Chapter One To Five Collective
33 pages
Artificial Intelligence Lec 5
No ratings yet
Artificial Intelligence Lec 5
20 pages
Unit 5 - Cluster Analysis
No ratings yet
Unit 5 - Cluster Analysis
28 pages
Clustering Unit4
No ratings yet
Clustering Unit4
9 pages
Unit 4
No ratings yet
Unit 4
40 pages
Clustering
No ratings yet
Clustering
20 pages
Cluster Analysis
No ratings yet
Cluster Analysis
22 pages
Classification and Clustering: Eng Teong Cheah MVP Visual Studio & Development Technologies
No ratings yet
Classification and Clustering: Eng Teong Cheah MVP Visual Studio & Development Technologies
23 pages
Activity Based Costing
No ratings yet
Activity Based Costing
34 pages
Clustering
No ratings yet
Clustering
12 pages
Module V
No ratings yet
Module V
16 pages
Unit 3 Clustering
No ratings yet
Unit 3 Clustering
28 pages
Unit 3 Unsupervised Learning Algorith
No ratings yet
Unit 3 Unsupervised Learning Algorith
15 pages
Clustering: An Overview: Key Concepts Objective
No ratings yet
Clustering: An Overview: Key Concepts Objective
12 pages
Clustering
No ratings yet
Clustering
29 pages
Implementation of Modbus Slave TCPIP For Alfen NG9xx Platform
No ratings yet
Implementation of Modbus Slave TCPIP For Alfen NG9xx Platform
15 pages
Unit-4 ML
No ratings yet
Unit-4 ML
16 pages
Nostalgia Funny Car Rules V1
No ratings yet
Nostalgia Funny Car Rules V1
5 pages
Clustering U 5
No ratings yet
Clustering U 5
2 pages
Cluster Analysis: Basic Concepts and Algorithms
No ratings yet
Cluster Analysis: Basic Concepts and Algorithms
141 pages
Very Low Drop 5V Regulator With Reset: Description
No ratings yet
Very Low Drop 5V Regulator With Reset: Description
79 pages
CSC403 - Software Engineering BOSU
No ratings yet
CSC403 - Software Engineering BOSU
13 pages
Clustering
No ratings yet
Clustering
8 pages
Cbsyllabus Bda
No ratings yet
Cbsyllabus Bda
5 pages
I. Classification: Department of Computer Science and Engineering Course Code: CD503 Course Name: Pattern Recognition
No ratings yet
I. Classification: Department of Computer Science and Engineering Course Code: CD503 Course Name: Pattern Recognition
4 pages
Classify Clustering
No ratings yet
Classify Clustering
31 pages
29501clustering in Data Mining Process
No ratings yet
29501clustering in Data Mining Process
3 pages
Funny PHD Thesis Quotes
100% (3)
Funny PHD Thesis Quotes
4 pages
Clustering New
No ratings yet
Clustering New
6 pages
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
93 pages
Astm A799a799m - 10
No ratings yet
Astm A799a799m - 10
4 pages
Assignment 4
No ratings yet
Assignment 4
40 pages
Clustering
No ratings yet
Clustering
6 pages
BUCHI Destilador B-324 LIGAL 489 Operationmanual - SP
No ratings yet
BUCHI Destilador B-324 LIGAL 489 Operationmanual - SP
30 pages
01 Introduction Clustering
No ratings yet
01 Introduction Clustering
11 pages
Data Mining - UNIT-IV
No ratings yet
Data Mining - UNIT-IV
24 pages
An Analytical Method of Aircraft Gearboxes: To Predict Efficiency
No ratings yet
An Analytical Method of Aircraft Gearboxes: To Predict Efficiency
25 pages
Clustering
No ratings yet
Clustering
3 pages
Introduction To Aerospace Engineering
No ratings yet
Introduction To Aerospace Engineering
5 pages
Guidelines ITR 2020-21-For Mentor and Students
No ratings yet
Guidelines ITR 2020-21-For Mentor and Students
2 pages
PhysicsBowl 2017
No ratings yet
PhysicsBowl 2017
11 pages
Unit 8: Unsupervised Learning - Clustering: Reading Assignments
No ratings yet
Unit 8: Unsupervised Learning - Clustering: Reading Assignments
8 pages
Molas Lubes-Products List
No ratings yet
Molas Lubes-Products List
2 pages
F4 Chapter 3 (Exercise 6)
No ratings yet
F4 Chapter 3 (Exercise 6)
3 pages
The Best of Charlie Munger 1994 2011 PDF
No ratings yet
The Best of Charlie Munger 1994 2011 PDF
1 page
A Journey of Self-Actualization of Amir in The Kite Runner
No ratings yet
A Journey of Self-Actualization of Amir in The Kite Runner
4 pages
MSA Case Studies
No ratings yet
MSA Case Studies
10 pages
Data Clustering Seminar
No ratings yet
Data Clustering Seminar
34 pages
A06-A Survey of Clustering Techniques
No ratings yet
A06-A Survey of Clustering Techniques
5 pages
Alemite Oil Mist Application Manual
100% (1)
Alemite Oil Mist Application Manual
34 pages
Graph
No ratings yet
Graph
9 pages
Molo District Health Center: AP (Pre-Natal)
No ratings yet
Molo District Health Center: AP (Pre-Natal)
2 pages
Cargador Frontal WA500-6 (English) Komatsu
100% (1)
Cargador Frontal WA500-6 (English) Komatsu
12 pages
Interview Vera Geier PDF
No ratings yet
Interview Vera Geier PDF
2 pages
Passband Digital Transmission
No ratings yet
Passband Digital Transmission
99 pages
Clustering
No ratings yet
Clustering
5 pages
Creating Graphs and Charts in Excel
No ratings yet
Creating Graphs and Charts in Excel
6 pages
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet

FPA Unit 3

Uploaded by

FPA Unit 3

Uploaded by

3.

A supervised learning technique that An unsupervised learning technique that

Predicts a categorical outcome (e.g., spam Identifies hidden patterns or natural

Decision Trees, Random Forest, Support K-Means, DBSCAN, Hierarchical

Fraud detection, medical diagnosis, Customer segmentation, anomaly detection,

Supervised learning: Requires training with Unsupervised learning: No labelled data is

▪ Partition-based clustering (or) Centroid-based clustering

Example Algorithm: K-Means clustering algorithm, K means, K mode, K median.

1. Choose K cluster centroids randomly.

3. Update centroids by calculating the mean of assigned points.

4. Repeat until centroids no longer change.

▪ Hierarchical-based clustering (or) Connectivity-based clustering

Example Algorithm: Agglomerative Clustering, Divisive clustering, DIANA.

1. Start with each data point as its own cluster.

2. Compute the distance between clusters.

3. Merge the closest clusters.

4. Repeat until only one cluster remains.

A. Divisive (Top-Down) Approach

B. Agglomerative (Bottom-Up) Approach

Two main parameters are used to calculate density-based Clustering.

• EPS: It is mainly considered as the maximum radius of the neighbourhood.

1. Define a minimum number of points (minPts) and a search radius (eps).

3. Expand clusters from core points.

Example Algorithm: STING (Statistical Information Grid-Based Method).

1. Divide the data space into a hierarchical grid structure.

2. Compute statistics for each grid cell.

3. Merge dense cells into clusters.

▪ Model-based algorithm (or) Probabilistic Clustering

Example Algorithm: Gaussian Mixture Model (GMM).

1. Assume each cluster follows a Gaussian distribution.

2. Estimate parameters using the Expectation-Maximization (EM) algorithm.

3. Assign data points based on probabilities.

▪ Constraint based clustering (or) Supervised clustering

1. Set rules for how data points should be grouped.

2. Use an algorithm that follows these rules, like Constrained K-Means.

3. Group data while ensuring the rules are followed.

4. Adjust clusters if needed to improve accuracy.

Let’s explore the different types of clustering in machine learning in detail

Clustering Method Suitable When Example Use Case

Data follows a specific probability

Large datasets, well-separated spherical

Hierarchical Clustering Algorithm Small datasets, hierarchical relationships Gene classification

Mean-Shift Clustering Algorithm Clusters with high-density regions Object tracking

DBSCAN Arbitrary-shaped clusters, noisy data Fraud detection

• Ci = Set of data points in the ith cluster

How does the K-Means Algorithm Work?

Step-1: Choose the number of clusters (K).

Step-2: Select K random points as the initial centroids.

Step-5: Reassign data points to the new nearest centroid.

Step-6: Repeat steps 4 and 5 until centroids no longer change.

Step-7: The clustering is complete, and the model is ready.

(Refer pg. no: 88 for application of K-means algorithm)

3. Easy to understand and interpret.

5. Provides clear and distinct cluster assignments.

2. It requires us to decide on the number of clusters before we start the algorithm

4. Cannot handle outliers.

▪ Distribution-based clustering algorithm

Step-1: Choose a statistical distribution like Gaussian for clustering.

▪ Hierarchical clustering algorithm

a. DIANA (Divisive Analysis)

A. DIANA (Divisive Analysis)

B. AGNES (Agglomerative Nesting)

▪ Mean shift clustering algorithm

▪ Gaussian mixed model (GMM)

▪ DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

You might also like