0% found this document useful (0 votes)

26 views8 pages

Ds Econtent

Uploaded by

23pca005

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views8 pages

Ds Econtent

Uploaded by

23pca005

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

Cluster analysis

Cluster analysis is a data analysis method that clusters (or groups) objects that are closely
associated within a given data set. When performing cluster analysis, we assign
characteristics (or properties) to each group. Then we create what we call clusters based on
those shared properties. Thus, clustering is a process that organizes items into groups using
unsupervised machine learning algorithms.

Let’s also take a look at an example to get a gist of cluster analysis in terms of how data sets
are grouped together.

Consider a data set of eight countries—India, the U.S., Germany, Australia, the U.K., France,
China, and Canada. Using

this form of analysis, we can split the countries into four clusters.

 The first cluster will consist of Canada and the U.S.

 Australia will comprise the second cluster

 The third one will consist of France, U.K., and Germany

 China and India form the fourth cluster

At first glance, we can conclude that the clusters are divided based on the continents. This is
clear in the cluster composition; the first cluster consists of countries from North America.
The second one comprises countries from the continental region of Australia, while the third
is European nations. Finally, the fourth cluster is Asian countries. From this, it is evident that
the main feature of this cluster analysis is the geographical proximity of each country.

Applications of Cluster Analysis :

1. Customer Segmentation: Grouping customers by behavior or preferences for

targeted marketing.

2. Document Classification: Organizing documents (e.g., news, articles) by topic in text

mining.

3. Image Segmentation: Separating regions or objects in images for medical imaging or

object detection.

4. Anomaly Detection: Identifying outliers for fraud detection or cybersecurity.

5. Healthcare: Grouping patients by symptoms or genetic data for personalized
treatments.

Types of clustering methods:

 Partitioning clustering

 Hierarchical clustering

 Density based clustering

 Model based clustering

 Grid based clustering

 Fuzzy clustering

1.Partitioning Clustering:

partitioning clustering works by dividing a dataset into a predefined number of clusters, k,

where each point belongs to exactly one cluster. The goal is to minimize the distance between
the data points and the cluster center.

K-means algorithm:

The most widely used partitioning method. It assigns each data point to the cluster with the
nearest centroid and iteratively updates the centroids to minimize the variance within clusters.

Steps in K-means:

 Choose k (number of clusters).

 Initialize k centroids randomly or based on some method.

 Assign each data point to the nearest centroid.

 Update centroids by calculating the mean of the data points in each cluster.

 Repeat until convergence (when centroids no longer change significantly).

K-medoids: Similar to K-means, but instead of using the mean to represent a cluster, it uses
the most centrally located data point (the "medoid"). K-medoids is more robust to noise and
outliers compared to K-means.
PAM (Partitioning Around Medoids): This is the algorithm behind K-medoids and is often
used for datasets with non-Euclidean distances, unlike K-means which is distance-sensitive
(usually Euclidean).

2.Hierarchical clustering

Hierarchical clustering is another popular technique in data science used to build a hierarchy
of clusters. Unlike partitioning methods (such as K-means), hierarchical clustering doesn’t
require the number of clusters to be specified beforehand. It produces a tree-like structure
called a dendrogram that can be cut at different levels to form clusters.

Types of Hierarchical Clustering:

There are two main types of hierarchical clustering:

1. Agglomerative (Bottom-up approach):

o Start with each data point as its own cluster.

o At each step, merge the closest clusters based on a distance metric (e.g.,
Euclidean distance, Manhattan distance).

o Repeat this process until all data points are merged into one single cluster.

2. Divisive (Top-down approach):

o Start with all data points in a single cluster.

o At each step, split the most dissimilar cluster into smaller clusters.

o Repeat the process until each data point is its own cluster.

Agglomerative clustering is more commonly used than divisive due to its simplicity.

Dendrogram

A dendrogram is a tree-like diagram that shows the order in which clusters were merged (for
agglomerative clustering) or split (for divisive clustering). By cutting the dendrogram at
different levels, you can determine the number of clusters formed.

 Height on the dendrogram represents the distance or dissimilarity between merged

clusters.
 Cutting the dendrogram at a specific level (height) results in a specific number of
clusters.

3.Density Based Clustering

Density-based clustering is a method used in data science to discover clusters of arbitrary

shapes by identifying regions in the data where points are densely packed together, separated
by areas of low point density. Unlike partitioning and hierarchical methods, density-based
clustering does not require you to predefine the number of clusters.

The core idea is that clusters are dense regions in space, separated by sparser regions. The
most common algorithm for density-based clustering is DBSCAN (Density-Based Spatial
Clustering of Applications with Noise).

DBSCAN Algorithm

DBSCAN is the most well-known density-based clustering algorithm. The key concepts in
DBSCAN are:

1. Core Points: Points that have at least a minimum number of neighbors (denoted as
MinPts) within a specified radius (denoted as ε).
2. Border Points: Points that are not core points but are within the neighborhood of a
core point.
3. Noise Points: Points that are neither core points nor border points and do not belong
to any cluster.

4.Model Based Clustering

Model-based clustering is a probabilistic approach used in data science where data is

assumed to be generated from a mixture of underlying probability distributions. The aim is to
identify these distributions and assign data points to the most likely cluster based on the
parameters of these distributions. Unlike other clustering methods like K-means or
DBSCAN, model-based clustering provides a more flexible framework and can
accommodate clusters of various shapes, sizes, and densities.
Gaussian Mixture Models (GMM):

 The most widely used model-based clustering technique.

 Assumes that the data is generated from a mixture of Gaussian distributions.
 Each cluster is represented by a multivariate Gaussian distribution, defined by its
mean and covariance matrix.
 The task is to find the parameters of these Gaussian distributions (mean, covariance)
and the mixing coefficients (which define the proportion of points in each
distribution).

The Expectation-Maximization (EM) algorithm is often used to fit GMMs. The EM

algorithm works in two main steps:

 E-step: Calculate the probability (responsibility) of each point belonging to each

cluster.
 M-step: Update the parameters of the Gaussian distributions (mean, covariance) to
maximize the likelihood of the data given the current responsibilities.

GMMs allow for soft clustering, meaning data points are assigned to clusters with
probabilities, rather than hard assignments as in K-means.

5.Grid Based Clustering

Grid-based clustering is a clustering technique in data science where the data space is divided
into a finite number of cells or grids, and the clustering is performed on these grids rather
than directly on the data points. The main idea is to partition the space into a grid structure
and then group adjacent dense grids (those with a sufficient number of points) into clusters.

STING (Statistical Information Grid):

 STING is a hierarchical grid-based method. It divides the data space into a

hierarchical grid structure, where cells at higher levels of the hierarchy cover larger
areas of the data space.

 The algorithm uses statistical information about each cell (such as the number of
points, mean, variance, etc.) to form clusters, making it efficient for large datasets.
 At the higher levels of the hierarchy, the grid cells are large, and at the lower levels,
they are smaller and more granular. Clustering decisions are made at different levels
of this hierarchy.

CLIQUE (Clustering in Quest):

 CLIQUE is a grid-based clustering method designed for high-dimensional data. It

combines grid-based and density-based approaches.
 It first partitions the data space into non-overlapping rectangular units (grid cells) and
identifies dense cells (those that contain a large number of points).
 It then combines adjacent dense cells to form clusters.
 CLIQUE can automatically identify subspaces that contain clusters, making it useful
for high-dimensional clustering.

5,Fuzzy Clustering

Fuzzy clustering, also known as soft clustering, is a technique in data science where each
data point can belong to more than one cluster. Unlike traditional clustering methods (like K-
means) where each point is assigned to exactly one cluster (hard clustering), fuzzy clustering
assigns each data point a membership degree for each cluster, typically ranging between 0
and 1. This makes it especially useful for datasets where clusters overlap or when data points
don’t neatly fit into distinct groups.

Fuzzy C-Means (FCM) Algorithm

The Fuzzy C-Means (FCM) algorithm is the most widely used fuzzy clustering method. It is
a generalization of the K-means algorithm but with soft clustering assignments. FCM
minimizes an objective function based on the membership values and distances between data
points and cluster centers.

Steps in Fuzzy C-Means:

1. Initialize Cluster Centers: Start by randomly initializing the cluster centers.

2. Calculate Membership Values: For each data point, calculate its membership value
for each cluster. This is based on the distance between the point and the cluster center.
The closer a data point is to a cluster center, the higher its membership value for that
cluster.
3. Update Cluster Centers: The cluster centers are updated as the weighted average of
all data points, where the weights are the membership values of the points for that
cluster.
4. Repeat Until Convergence: Repeat the process of updating membership values and
cluster centers until the changes in the cluster centers are smaller than a specified
threshold.

Objective Function:

The objective function in FCM aims to minimize the weighted sum of squared distances
between each data point and the cluster centers, where the weights are the membership
values. Mathematically, it can be expressed as:

Jm=∑i=1n∑j=1cuijm ∣∣xi−cj∣∣2J_m = \sum_{i=1}^{n} \sum_{j=1}^{c} u_{ij}^m \, ||x_i -

c_j||^2Jm=i=1∑nj=1∑cuijm∣∣xi−cj∣∣2

Where:

 uiju_{ij}uij is the membership degree of point iii to cluster jjj,

 mmm is the fuzziness parameter,
 xix_ixi is the iii-th data point,
 cjc_jcj is the center of the jjj-th cluster,
 ∣∣xi−cj∣∣2||x_i - c_j||^2∣∣xi−cj∣∣2 is the squared Euclidean distance between point xix_ixi
and cluster center cjc_jcj.

Fuzziness Index (m):

The fuzziness parameter mmm typically takes values greater than 1. A higher value of mmm
increases the fuzziness (or overlapping) of clusters:

 If m=1m = 1m=1, FCM behaves like hard clustering (similar to K-means).

 Typical values of mmm range from 1.5 to 3. The choice of mmm can affect the
clustering results.

Clustering
No ratings yet
Clustering
104 pages
Clustering
No ratings yet
Clustering
11 pages
Unit 2 DMW
No ratings yet
Unit 2 DMW
26 pages
DWDM Unit-5
No ratings yet
DWDM Unit-5
52 pages
Narrative (SPES)
No ratings yet
Narrative (SPES)
2 pages
Just Culture (Aviation Safety Management Systems)
100% (1)
Just Culture (Aviation Safety Management Systems)
0 pages
Ifferent Methods of Clustering
No ratings yet
Ifferent Methods of Clustering
8 pages
510 - Employment Desk Ordinance in Every Barangay
No ratings yet
510 - Employment Desk Ordinance in Every Barangay
3 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
64 pages
Module 5
No ratings yet
Module 5
91 pages
Partition
No ratings yet
Partition
52 pages
Lecture 8 - Clustering
No ratings yet
Lecture 8 - Clustering
23 pages
ML - 8
No ratings yet
ML - 8
70 pages
Unit 4 Descriptive Modeling
No ratings yet
Unit 4 Descriptive Modeling
18 pages
Introduction To Cluster Analysis.
No ratings yet
Introduction To Cluster Analysis.
53 pages
Clustering
No ratings yet
Clustering
41 pages
Unsupervised Learning-01
No ratings yet
Unsupervised Learning-01
42 pages
Grouping
No ratings yet
Grouping
98 pages
Module 5 - Notes - 13 12 2024
No ratings yet
Module 5 - Notes - 13 12 2024
45 pages
ML Unit 4 (Ab 22)
No ratings yet
ML Unit 4 (Ab 22)
39 pages
11 - Professional Secrecy
100% (1)
11 - Professional Secrecy
10 pages
M5
No ratings yet
M5
40 pages
M5
No ratings yet
M5
40 pages
Lecturer-1 Unit 3
No ratings yet
Lecturer-1 Unit 3
31 pages
IT3080 Lecture04 2023
No ratings yet
IT3080 Lecture04 2023
56 pages
DWMModule 4
No ratings yet
DWMModule 4
31 pages
Unit5 CSM ML
No ratings yet
Unit5 CSM ML
32 pages
DWDM - Unit - VI
No ratings yet
DWDM - Unit - VI
38 pages
Big Data Analytics
No ratings yet
Big Data Analytics
25 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
21 pages
Mod3 DM
No ratings yet
Mod3 DM
20 pages
Screenshot 2024-05-17 at 3.30.05 PM
No ratings yet
Screenshot 2024-05-17 at 3.30.05 PM
31 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
21 pages
Machine Learning Unit-4
No ratings yet
Machine Learning Unit-4
24 pages
Clustering
No ratings yet
Clustering
21 pages
Chapter 5
No ratings yet
Chapter 5
43 pages
Unit 4
No ratings yet
Unit 4
16 pages
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
93 pages
Clustering Notes
No ratings yet
Clustering Notes
17 pages
Unit 5 Data Science
No ratings yet
Unit 5 Data Science
18 pages
Classify Clustering
No ratings yet
Classify Clustering
31 pages
Carpentry LP 1
No ratings yet
Carpentry LP 1
7 pages
Cluster Analysis
No ratings yet
Cluster Analysis
18 pages
Clustering: An Overview: Key Concepts Objective
No ratings yet
Clustering: An Overview: Key Concepts Objective
12 pages
Clustering in Data Mining
No ratings yet
Clustering in Data Mining
14 pages
Unit 4 Clustering
No ratings yet
Unit 4 Clustering
18 pages
Fundamentals of Data Science Unit 3
No ratings yet
Fundamentals of Data Science Unit 3
15 pages
GK For
No ratings yet
GK For
171 pages
LEEA 039 (A) ATS Appendix 1 Version 13 February 2023
No ratings yet
LEEA 039 (A) ATS Appendix 1 Version 13 February 2023
34 pages
DMW Unit 5
No ratings yet
DMW Unit 5
10 pages
Unit 5
No ratings yet
Unit 5
10 pages
ML Unit 4 Notes - NJ
No ratings yet
ML Unit 4 Notes - NJ
15 pages
Clustering
No ratings yet
Clustering
11 pages
Clustering Explanation
No ratings yet
Clustering Explanation
8 pages
Clustering
No ratings yet
Clustering
6 pages
Clustering
No ratings yet
Clustering
8 pages
Clustering New
No ratings yet
Clustering New
6 pages
Clustering
No ratings yet
Clustering
6 pages
HTCB Unit 5
No ratings yet
HTCB Unit 5
3 pages
Cluster Analysis-Unit 4
No ratings yet
Cluster Analysis-Unit 4
7 pages
An Introduction To Clustering Methods
No ratings yet
An Introduction To Clustering Methods
8 pages
Custer Analysis: Prepared by Navin Ninama
No ratings yet
Custer Analysis: Prepared by Navin Ninama
20 pages
Unit 5
No ratings yet
Unit 5
5 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
4 pages
Discount Rates: III: Relative Risk Measures
No ratings yet
Discount Rates: III: Relative Risk Measures
20 pages
Aura and Color Readings
No ratings yet
Aura and Color Readings
4 pages
3.BP Travel - Create Quotes - Functional Requirements Questionnaire (FRQ)
No ratings yet
3.BP Travel - Create Quotes - Functional Requirements Questionnaire (FRQ)
11 pages
Chapter 703
No ratings yet
Chapter 703
14 pages
Infectious Diseases of Livestock, 2nd Edition, Volume 1
No ratings yet
Infectious Diseases of Livestock, 2nd Edition, Volume 1
688 pages
Business Plan For TNTS Standard Format
No ratings yet
Business Plan For TNTS Standard Format
21 pages
ACSIC Guide Book July 2023
No ratings yet
ACSIC Guide Book July 2023
37 pages
Krumrei-Mancuso2015 Humility Scale
No ratings yet
Krumrei-Mancuso2015 Humility Scale
14 pages
CMoS s5 Phy Chem Calculations Seminar 01?
100% (1)
CMoS s5 Phy Chem Calculations Seminar 01?
3 pages
Y10 English Language Remote Learning 01.02.2021
No ratings yet
Y10 English Language Remote Learning 01.02.2021
8 pages
The Entire Cars 2 Script
No ratings yet
The Entire Cars 2 Script
15 pages
Conditionals Random Pages Sample2 PDF
No ratings yet
Conditionals Random Pages Sample2 PDF
22 pages
Sold To A Ruthless Mafia Boss Outline
No ratings yet
Sold To A Ruthless Mafia Boss Outline
4 pages
Assessment of Existing Steel Structures - Reccomendations For Estimation of Exisitng Fatigue Life
No ratings yet
Assessment of Existing Steel Structures - Reccomendations For Estimation of Exisitng Fatigue Life
109 pages
Section 5 Electrical Wiring Diagram
No ratings yet
Section 5 Electrical Wiring Diagram
159 pages
Eklavya Traders - Trading Rulebook-1
No ratings yet
Eklavya Traders - Trading Rulebook-1
20 pages
Flava Works vs. Adam4Adam
No ratings yet
Flava Works vs. Adam4Adam
61 pages
2017 Deckers Core Vocabulary of Young Children With Down
No ratings yet
2017 Deckers Core Vocabulary of Young Children With Down
11 pages
NCM 102 Health Education
No ratings yet
NCM 102 Health Education
9 pages
Experience Counts : BA-922 SAE-BB Air Compressor - Overview Experience Counts
No ratings yet
Experience Counts : BA-922 SAE-BB Air Compressor - Overview Experience Counts
8 pages
TYS Questions by Topics (Full)
No ratings yet
TYS Questions by Topics (Full)
10 pages
An Examination of Audit Delay Further Evidence From New Zealand
No ratings yet
An Examination of Audit Delay Further Evidence From New Zealand
13 pages
Ramon B. de Los Reyes and Zoilo P. Perlas For Plaintiffs-Appellants. Villamor & Gancayco For Defendants-Appellees
No ratings yet
Ramon B. de Los Reyes and Zoilo P. Perlas For Plaintiffs-Appellants. Villamor & Gancayco For Defendants-Appellees
2 pages
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
Image Segmentation: Unlocking Insights through Pixel Precision
From Everand
Image Segmentation: Unlocking Insights through Pixel Precision
Fouad Sabry
No ratings yet