0% found this document useful (0 votes)

25 views8 pages

Clustering

This document discusses clustering techniques as a form of unsupervised learning aimed at grouping similar observations into clusters. It outlines various clustering methods, including partitioning, hierarchical, density-based, model-based, and grid-based approaches, along with their applications in business analytics such as customer segmentation and anomaly detection. The document emphasizes the importance of criteria for effective clustering, including similarity measures, scalability, and cluster interpretability.

Uploaded by

sahilkauts112

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views8 pages

Clustering

Uploaded by

sahilkauts112

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

INTRODUCTION

Clustering techniques, which will be discussed in this chapter,

are an
example of the second class of unsupervised learning
models. The goal of
clustering methods is the identification of homogenous
groupings of records
known as clusters by specifying acceptable metrics and the
induced ideas
of distance and similarity between pairs of observations. The
observations
included in each cluster must be close to one another and
remote from those
found in other clusters, depending on the precise distance
chosen. We
discuss the key characteristics of clustering models at the
beginning of this
chapter. Following that, we will demonstrate the most
common ways to
measure the distance between pairs of observations in
relation to the
characteristics of the dataset's attributes. Then, partitioning
techniques will
be discussed with an emphasis on the K-means and K-
medoids algorithms.
In relation to the key metrics that describe the
inhomogeneity among
various clusters, we will finally illustrate both agglomerative
and divisive
hierarchical techniques. We will also go through several
metrics for
measuring the effectiveness of clustering methods.

CLUSTERING METHODS
The goal of clustering models is to partition the records of a
dataset into
clusters, which are homogenous groups of observations that
are similar to
one another and different from the observations contained in
other groups.
The human brain frequently uses a method of reasoning
called affinity grouping to organize objects. Additionally,
because of this, clustering
models have been used for a long time in a variety of fields,
including social
sciences, biology, astronomy, statistics, image recognition,
handling digital
data, marketing, and data mining. There are several uses for
clustering
models. The clusters produced may offer a useful
understanding of the event
in some applications of interest. For instance, categorizing
clients based on
their purchasing patterns may identify a cluster that
corresponds to a market
niche where it may be acceptable to focus marketing efforts
for promotional
purposes. In addition, a data mining project's preliminary
phase can involve
grouping data into clusters, which would be followed by using
various
approaches inside each cluster. In a retention study, a
preliminary partition
into clusters may be followed by the creation of unique
classification
models, with the goal of better identifying the clients with a
high churn
likelihood. To highlight outliers and find an observation that
might stand in
for an entire cluster on its own, grouping data into clusters
may be useful
during exploratory data analysis. This will help to reduce the
size of the
dataset.

The following general criteria must be met by clustering

methods:

Flexibility: Only numerical attributes can be used with some

clustering
techniques, and the distances between observations can be
calculated using
Euclidean metrics. The ability to analyze datasets with
categorical features
should be a feature of a flexible clustering technique, though.
Euclidean
metrics-based algorithms frequently produce spherical
clusters and struggle
to recognize more intricate geometrical patterns.

Robustness: The stability of the clusters produced about

little variations in
the values of each observation's attribute values is a sign of
an algorithm's
resilience. This characteristic makes sure that any noise that
could be
present in the data does not significantly impair the
clustering procedure.
Additionally, the clusters produced must be stable with
respect to the
dataset's observations' appearance order.

Efficiency: Because there are often quite a few observations

in some
applications, clustering algorithms must produce clusters
quickly in order
to ensure reasonable computation times for complex
problems. When
dealing with large datasets, one may also use the extraction
of smaller
samples to create clusters more quickly. This method
inherently indicates a
lesser resilience for the resulting clusters, though. In terms of
the quantity
of characteristics included in the dataset, clustering
algorithms must also
demonstrate their efficacy.

Clustering in Business Analytics

Clustering is a technique in Business Analytics used for grouping a set of objects or data points
in such a way that objects in the same group (called a cluster) are more similar to each other than
to those in other groups. It is a form of unsupervised learning, where the goal is to identify
patterns or structures in data without predefined labels or outcomes.

Applications in Business Analytics:

 Customer Segmentation: Grouping customers based on purchasing behavior,

demographics, or interests.
 Market Segmentation: Identifying distinct market segments for targeted marketing
campaigns.
 Product Categorization: Classifying products based on features, sales performance, or
user ratings.
 Anomaly Detection: Identifying outliers or unusual patterns in data, which could
indicate fraud or unusual behavior.
 Inventory Management: Grouping products based on demand patterns for optimized
stock management.

Criteria for Applying Clustering Methods

To ensure effective clustering, several criteria should be met:

1. Similarity or Distance Measure:

o A clear measure of similarity or dissimilarity (like Euclidean distance, Manhattan
distance, or cosine similarity) must be chosen to compare data points. This is
crucial as it directly impacts the cluster formation.
2. Scalability:
o The clustering method should handle large datasets efficiently, especially if
dealing with big data. Scalability ensures that clustering is computationally
feasible for large volumes of data.
3. Cluster Interpretability:
o The resulting clusters should be interpretable, meaning the characteristics of each
cluster should be clear and distinct to draw actionable insights.
4. Cluster Density:
o Clusters should be dense (i.e., points within a cluster should be close to each
other) and well-separated from other clusters to avoid overlapping, which could
lead to ambiguity.
5. Handling of Noise and Outliers:
o The clustering method should be able to handle noisy data or outliers effectively
without distorting the clusters. Robust clustering methods should accommodate
some level of data anomalies.
6. Dimensionality:
o The method should be adaptable to the data’s dimensionality. High-dimensional
data may require specific methods or dimensionality reduction before clustering.
7. Cluster Shape:
o Clusters should be of flexible shape (not necessarily spherical) to capture complex
patterns in data, especially in business scenarios where natural groupings may not
conform to regular shapes.

Types of Clustering Methods Based on Cluster Creation

Different clustering methods have unique approaches for creating clusters. Here are the primary
types:

1. Partitioning Methods

 Description: These methods divide the dataset into a set of non-overlapping clusters by
optimizing a given criterion, like minimizing the distance of points from their cluster
centers.
 Examples:
o K-Means: Divides the dataset into KKK clusters by assigning each data point to the
cluster with the nearest mean. The cluster centers (centroids) are iteratively updated.
o K-Medoids (PAM): Similar to K-Means but uses medoids (actual data points) as cluster
centers, making it less sensitive to outliers.

 Application: Useful when the number of clusters KKK is known or can be estimated.
 Advantages:
o Easy to implement and computationally efficient.
o Works well with large datasets.

 Limitations:
o Sensitive to the choice of initial clusters.
o Assumes clusters are spherical and similar in size.
2. Hierarchical Methods

 Description: These methods create a hierarchy of clusters using a tree-like structure

called a dendrogram, which can be visualized to determine the optimal number of
clusters.
 Types:
o Agglomerative: Starts with individual data points as clusters and merges them step-by-
step based on similarity until a single cluster is formed.
o Divisive: Starts with the entire dataset as a single cluster and splits it into smaller
clusters iteratively.

 Examples:
o Agglomerative Hierarchical Clustering: Commonly uses linkage criteria like single-
linkage (minimum distance), complete-linkage (maximum distance), or average-linkage.

 Application: Useful when the data has a natural hierarchy or when the number of clusters
is unknown.
 Advantages:
o Does not require specifying the number of clusters in advance.
o Can capture complex nested structures.

 Limitations:
o Computationally intensive for large datasets.
o Merging or splitting decisions are irreversible.

3. Density-Based Methods

 Description: These methods create clusters based on the density of data points in a
region. A cluster is formed when data points are closely packed together, and regions of
low density are treated as noise or outliers.
 Examples:
o DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Clusters based
on a neighborhood radius (ε) and minimum number of points. It can identify arbitrarily
shaped clusters and outliers.
o OPTICS (Ordering Points to Identify the Clustering Structure): Similar to DBSCAN but
provides a more detailed cluster analysis.

 Application: Effective for datasets with noise, outliers, or when clusters are of arbitrary
shape.
 Advantages:
o Can identify clusters of various shapes.
o Handles noise and outliers well.

 Limitations:
o Requires careful tuning of parameters like ε and minimum points.
o May struggle with varying densities.
4. Model-Based Methods

 Description: These methods assume that the data is generated by a mixture of underlying
probability distributions (e.g., Gaussian). Each cluster is treated as a component of a
mixture model, and the goal is to estimate the parameters of these distributions.
 Examples:
o Gaussian Mixture Model (GMM): Assumes that data points are generated from a
mixture of Gaussian distributions with different means and variances.
o Expectation-Maximization (EM): Iteratively estimates the parameters of the model to
maximize the likelihood of the observed data.

 Application: Useful for probabilistic clustering or when clusters have overlapping

boundaries.
 Advantages:
o Provides a probabilistic assignment of data points to clusters.
o Can model complex cluster shapes.

 Limitations:
o Sensitive to initialization.
o Computationally intensive with a large number of clusters.

5. Grid-Based Methods

 Description: The data space is divided into a finite number of grid cells, and clustering is
performed on these cells instead of individual data points.
 Examples:
o CLIQUE (Clustering In QUEst): A combination of grid-based and density-based
approaches.
o STING (Statistical Information Grid): Uses a hierarchical grid structure for clustering.

 Application: Suitable for large datasets with a spatial context.

 Advantages:
o Efficient for high-dimensional data.
o Computationally fast.

 Limitations:
o Resolution of clustering depends on the grid size.
o May not handle arbitrary shapes well.

Conclusion

Clustering is a powerful tool in Business Analytics for uncovering hidden patterns and
segmenting data into meaningful groups. The choice of clustering method depends on factors
like the shape and density of the data, the presence of noise, and the need for interpretability. The
application of clustering requires careful consideration of criteria such as distance measures,
scalability, and the desired outcome. Each method has its strengths and limitations, making it
essential to match the technique to the specific characteristics of the dataset and the business
problem at hand.

Unit 5
No ratings yet
Unit 5
27 pages
ILS Science - Question Paper 1 (Jun 2024)
No ratings yet
ILS Science - Question Paper 1 (Jun 2024)
36 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
64 pages
Unsupervised Learning-01
No ratings yet
Unsupervised Learning-01
42 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
66 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
21 pages
UNIT 4 Clustering and Applications
No ratings yet
UNIT 4 Clustering and Applications
5 pages
Unit 4
No ratings yet
Unit 4
106 pages
Cluster Analysis (1) - RMM
No ratings yet
Cluster Analysis (1) - RMM
17 pages
Waec 2000 Past Questions
No ratings yet
Waec 2000 Past Questions
9 pages
ML Unit 4 (Ab 22)
No ratings yet
ML Unit 4 (Ab 22)
39 pages
From Toxic To Pure Harnessing Renewable Energy For Non Conventional Water Remediation and Green Hydrogen Production
No ratings yet
From Toxic To Pure Harnessing Renewable Energy For Non Conventional Water Remediation and Green Hydrogen Production
10 pages
CLUSTER ANALYSIS Unit 3 Data Mining
No ratings yet
CLUSTER ANALYSIS Unit 3 Data Mining
84 pages
Cluster Analysis
No ratings yet
Cluster Analysis
26 pages
DWDM Unit-5
No ratings yet
DWDM Unit-5
52 pages
Clustering
No ratings yet
Clustering
6 pages
Module 5
No ratings yet
Module 5
91 pages
By Lior Rokach and Oded Maimon: Clustering Methods
No ratings yet
By Lior Rokach and Oded Maimon: Clustering Methods
5 pages
Clustering Agglo Devisive DBSCAN
No ratings yet
Clustering Agglo Devisive DBSCAN
78 pages
E-Note 28966 Content Document 20241211091351PM
No ratings yet
E-Note 28966 Content Document 20241211091351PM
69 pages
Unit 4 Descriptive Modeling
No ratings yet
Unit 4 Descriptive Modeling
18 pages
Data Mining Clustering Techniques
No ratings yet
Data Mining Clustering Techniques
3 pages
Brosur Ground Rod
No ratings yet
Brosur Ground Rod
2 pages
Lecturer-1 Unit 3
No ratings yet
Lecturer-1 Unit 3
31 pages
DWMModule 4
No ratings yet
DWMModule 4
31 pages
Clustering: An Overview: Key Concepts Objective
No ratings yet
Clustering: An Overview: Key Concepts Objective
12 pages
DW & DM Unit 4 Notes
No ratings yet
DW & DM Unit 4 Notes
40 pages
The Naomi Letters Rachel Mennies Download
No ratings yet
The Naomi Letters Rachel Mennies Download
29 pages
Unit 4
No ratings yet
Unit 4
40 pages
Clustering in Data Mining
No ratings yet
Clustering in Data Mining
14 pages
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
93 pages
HarnessingTheSun 2
No ratings yet
HarnessingTheSun 2
102 pages
Unit-V (Dmwh6em)
No ratings yet
Unit-V (Dmwh6em)
30 pages
Clustering Notes
No ratings yet
Clustering Notes
17 pages
Maxwell Wheel
33% (3)
Maxwell Wheel
4 pages
Screenshot 2024-05-17 at 3.30.05 PM
No ratings yet
Screenshot 2024-05-17 at 3.30.05 PM
31 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
7 pages
Unit 4
No ratings yet
Unit 4
21 pages
Chapter 5
No ratings yet
Chapter 5
43 pages
Clustering U 5
No ratings yet
Clustering U 5
2 pages
ML Unit-Iii
No ratings yet
ML Unit-Iii
18 pages
Unit 2 ML
No ratings yet
Unit 2 ML
11 pages
DM Unit-5 Notes
No ratings yet
DM Unit-5 Notes
16 pages
DM Module 4
No ratings yet
DM Module 4
17 pages
Module V
No ratings yet
Module V
16 pages
Cbsyllabus Bda
No ratings yet
Cbsyllabus Bda
5 pages
Assignment 4
No ratings yet
Assignment 4
40 pages
Air Operator Certification Manual CAP 3300: Scheduled Commuter & NSOP Airplanes
No ratings yet
Air Operator Certification Manual CAP 3300: Scheduled Commuter & NSOP Airplanes
5 pages
Wenaas Catalogue
No ratings yet
Wenaas Catalogue
29 pages
Classify Clustering
No ratings yet
Classify Clustering
31 pages
Unit 4 Clustering
No ratings yet
Unit 4 Clustering
18 pages
YEAH
No ratings yet
YEAH
2 pages
U20cs604 Machine Learning Unit III
No ratings yet
U20cs604 Machine Learning Unit III
23 pages
ML Unit 4 Notes - NJ
No ratings yet
ML Unit 4 Notes - NJ
15 pages
Data Mining - UNIT-IV
No ratings yet
Data Mining - UNIT-IV
24 pages
Dmbi Unit-4
No ratings yet
Dmbi Unit-4
18 pages
DM Unit 5
No ratings yet
DM Unit 5
15 pages
Anleitung Espadrilles en
No ratings yet
Anleitung Espadrilles en
4 pages
Clustering
No ratings yet
Clustering
6 pages
Fundamentals of Data Science Unit 3
No ratings yet
Fundamentals of Data Science Unit 3
15 pages
Bio Lecture 1 2025
No ratings yet
Bio Lecture 1 2025
25 pages
Class 9 Economics Project On Toothpaste
No ratings yet
Class 9 Economics Project On Toothpaste
12 pages
Clustering New
No ratings yet
Clustering New
6 pages
Cable Testing Procedure Elect
100% (1)
Cable Testing Procedure Elect
5 pages
Unit 5
No ratings yet
Unit 5
5 pages
Data Mining-Unit IV
No ratings yet
Data Mining-Unit IV
15 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
53 pages
Cluster Is A Group of Objects That Belongs To The Same Class
No ratings yet
Cluster Is A Group of Objects That Belongs To The Same Class
12 pages
Clustering
No ratings yet
Clustering
6 pages
Experiment 4 LIPID
100% (8)
Experiment 4 LIPID
16 pages
CV - Senior Electrical Engineer
No ratings yet
CV - Senior Electrical Engineer
3 pages
List of Jails in Rajasthan
No ratings yet
List of Jails in Rajasthan
14 pages
CDP Worksheet
No ratings yet
CDP Worksheet
51 pages
No Load & Short Circuit Test On 3 Phase Alternator
No ratings yet
No Load & Short Circuit Test On 3 Phase Alternator
6 pages
Iman Magnético
No ratings yet
Iman Magnético
6 pages
Unit 4
No ratings yet
Unit 4
4 pages
A06-A Survey of Clustering Techniques
No ratings yet
A06-A Survey of Clustering Techniques
5 pages
Insect Repellent Candle (Paraffin) FROM TANGLAD (Cymbopogon Citrates Stapf.) LEAF EXTRACT
No ratings yet
Insect Repellent Candle (Paraffin) FROM TANGLAD (Cymbopogon Citrates Stapf.) LEAF EXTRACT
14 pages
Program of Activities: Boy Scouts of The Philippines Zamboanga City Council Sta. Maria District
No ratings yet
Program of Activities: Boy Scouts of The Philippines Zamboanga City Council Sta. Maria District
4 pages
2.5L Diesel 1993 On
No ratings yet
2.5L Diesel 1993 On
2 pages
Keys - Lesson 8. Matching (Cont)
No ratings yet
Keys - Lesson 8. Matching (Cont)
6 pages
11.3 Note-Taking Page
No ratings yet
11.3 Note-Taking Page
4 pages
11th Business Maths & Stastics Reduced Syllabus 2021 - 2022 English Medium - WWW - Kalvikadal.in
No ratings yet
11th Business Maths & Stastics Reduced Syllabus 2021 - 2022 English Medium - WWW - Kalvikadal.in
4 pages
GuideforBookReview2 568863761482473
No ratings yet
GuideforBookReview2 568863761482473
2 pages
General System Diagram
No ratings yet
General System Diagram
1 page
BME 301 HW #1 Solutions
No ratings yet
BME 301 HW #1 Solutions
2 pages
A-STR-STD-000-30053-0 - STD Details of Ladders SHT 1 PDF
No ratings yet
A-STR-STD-000-30053-0 - STD Details of Ladders SHT 1 PDF
1 page
Fundamentals of Machine Learning: a Simplified Approach
From Everand
Fundamentals of Machine Learning: a Simplified Approach
Er. Sudhir Goswami
No ratings yet
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet