Clustering

The document provides an overview of clustering, detailing various types such as K-means, density-based (DBSCAN), distribution-based, and hierarchical clustering. It discusses the algorithms, applications, and use cases of clustering, including data imputation, compression, and privacy preservation. Additionally, it outlines the workflow for clustering data, emphasizing the importance of data preparation and normalization techniques.

Uploaded by

arulx06

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views75 pages

Clustering

Uploaded by

arulx06

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 75

Clustering

Dr. C Santhosh Kumar

What is Clustering?
Types of Clustering
Common Distance Measures
Distance Measures
K means clustering
K means clustering
K means clustering
K means clustering - algorithm
K means clustering - algorithm
Example
Example
Applications of k means clustering
Imputation

When some examples in a cluster have missing
feature data, you can infer the missing data from
other examples in the cluster. This is called
imputation.

For example, less popular videos can be clustered
with more popular videos to improve video
recommendations.
Data Compression

As discussed, the relevant cluster ID can replace other features for all examples in
that cluster. This substitution reduces the number of features and therefore also
reduces the resources needed to store, process, and train models on that data. For
very large datasets, these savings become significant.


To give an example, a single YouTube video can have feature data including:

viewer location, time, and demographics, comment timestamps, text, and user IDs,
video tags

Clustering YouTube videos replaces this set of features with a single cluster ID, thus
compressing the data.
Privacy preservation

You can preserve privacy somewhat by clustering users and associating
user data with cluster IDs instead of user IDs.

To give one possible example, say you want to train a model on YouTube
users' watch history.

Instead of passing user IDs to the model, you could cluster users and
pass only the cluster ID.

This keeps individual watch histories from being attached to individual
users. Note that the cluster must contain a sufficiently large number of
users in order to preserve privacy.
Centroid based clustering

The centroid of a cluster is the arithmetic mean of all the
points in the cluster.

Centroid-based clustering organizes the data into non-
hierarchical clusters.

Centroid-based clustering algorithms are efficient but
sensitive to initial conditions and outliers.

Of these, k-means is the most widely used. It requires
users to define the number of centroids, k, and works well
with clusters of roughly equal size.
Density based clustering

Density-based clustering connects contiguous areas of
high example density into clusters.

This allows for the discovery of any number of clusters
of any shape. Outliers are not assigned to clusters.

These algorithms have difficulty with clusters of
different density and data with high dimensions.
Centroid based clustering

The centroid of a cluster is the arithmetic mean of all the
points in the cluster.

Centroid-based clustering organizes the data into non-
hierarchical clusters.

Centroid-based clustering algorithms are efficient but
sensitive to initial conditions and outliers.

Of these, k-means is the most widely used. It requires
users to define the number of centroids, k, and works well
with clusters of roughly equal size.
Use cases
Clustering is useful in a variety of industries. Some common
applications for clustering:


Market segmentation

Social network analysis

Search result grouping

Medical imaging

Image segmentation

Anomaly detection - Gene sequencing that shows
previously unknown genetic similarities and dissimilarities
between species has led to the revision of taxonomies
previously based on appearances.
Centroid based clustering
DBSCAN (Density-Based Spatial Clustering
of Applications with Noise)

ɛ: The radius of our neighborhoods around a
data point p.

minPts: The minimum number of data points we
want in a neighborhood to define a cluster.
DBSCAN

Core Points: A data point p is a core point if Nbhd(p,ɛ) [ɛ-
neighborhood of p] contains at least minPts ; |Nbhd(p,ɛ)| >=
minPts.

Border Points: A data point *q is a border point if Nbhd(q, ɛ)
contains less than minPts data points, but q is reachable from
some core point p.

Outlier: A data point o is an outlier if it is neither a core point
nor a border point. Essentially, this is the “other” class.
Core Points

Core Points are the foundations for our clusters are based on the density approximation.

We use the same ɛ to compute the neighborhood for each point, so the volume of all the
neighborhoods is the same.

However, the number of other points in each neighborhood is what differs.

The number of data points in the neighborhood is its mass. The volume of each neighborhood
is constant, and the mass of neighborhood is variable

By keeping a threshold on the minimum amount of mass needed to be a core point, we are
essentially setting a minimum density threshold.

Therefore, core points are data points that satisfy a minimum density requirement.

Our clusters are built around our core points (hence the core part), so by adjusting our minPts
parameter, we can fine-tune how dense our clusters' cores must be.
Border Points

Border Points are the points in our clusters that are not core points.

Density-reachable - Let’s revisit our neighborhood example with epsilon = 0.15. Consider the point r (the black
dot) that is outside of the point p‘s neighborhood.

All the points inside the point p‘s neighborhood are said to be directly reachable from p.

Now, let’s explore the neighborhood of point q, a point directly reachable from p. The yellow circle represents q‘s
neighborhood.

Now while our target point r is not our starting point p‘s neighborhood, it is contained in the point q‘s neighborhood.

If we can get to the point r by jumping from neighborhood to neighborhood, starting at a point p, then the point r is
density-reachable from the point p.

If the directly-reachable of a core point p are its “friends”, then the density-reachable points, points in the
neighborhood of the “friends” of p, are the “friends of its friends”.

“friends of a friend of a friend … of a friend” are included as well.
DBSCAN Algorithm

The steps to the DBSCAN algorithm are:

Pick a point at random that has not been assigned to a cluster or been
designated as an outlier. Compute its neighborhood to determine if it’s a core
point. If yes, start a cluster around this point. If no, label the point as an outlier.

Once we find a core point and thus a cluster, expand the cluster by adding all
directly-reachable points to the cluster. Perform “neighborhood jumps” to find all
density-reachable points and add them to the cluster. If an outlier is added,
change that point’s status from outlier to border point.

Repeat these two steps until all points are either assigned to a cluster or
designated as an outlier.
Distribution-based clustering

This clustering approach assumes data is composed of
probabilistic distributions, such as Gaussian distributions.

The distribution-based algorithm clusters data into three
Gaussian distributions.

As distance from the distribution's center increases, the
probability that a point belongs to the distribution decreases.

The bands show that decrease in probability.

When you're not comfortable assuming a particular underlying
distribution of the data, you should use a different algorithm.
Gaussian Distribution
Multi-dimensional Gaussian
Hierarchical Clustering

Hierarchical clustering creates a tree of clusters.

Hierarchical clustering, not surprisingly, is well suited to
hierarchical data, such as taxonomies.

Hierarchical clustering relies using these clustering techniques to
find a hierarchy of clusters, where this hierarchy resembles a tree
structure, called a dendrogram.

Any number of clusters can be chosen by cutting the tree at the
right level.
Hierarchical Clustering

Agglomerative clustering uses a bottom-up approach, wherein each
data point starts in its own cluster. These clusters are then joined
greedily, by taking the two most similar clusters together and
merging them.

Divisive clustering uses a top-down approach, wherein all data
points start in the same cluster. You can then use a parametric
clustering algorithm like K-Means to divide the cluster into two
clusters. For each cluster, you further divide it down to two clusters
until you hit the desired number of clusters.
Clustering – work flow

To cluster your data, you'll follow these steps:

Prepare data.

Create similarity metric.

Run clustering algorithm.

Interpret results and adjust your clustering.
Data Preparation

Normalising data
 Z scores : Whenever you see a dataset roughly shaped like a Gaussian distribution, you should
calculate z-scores for the data. Z-scores are the number of standard deviations a value is from the
mean. You can also use z-scores when the dataset isn't large enough for quantiles.

A Z-score is the number of standard deviations a value is from the mean. For example, a value
that is 2 standard deviations greater than the mean has a Z-score of +2.0. A value that is 1.5
standard deviations less than the mean has a Z-score of -1.5.

Normalising the data

Log Scaling
 Log scaling computes the logarithm of the raw
value. In theory, the logarithm could be any base; in
practice, log scaling usually calculates the natural
logarithm (ln).
Log Scaling

Log scaling is helpful when the data conforms to a power law distribution. Casually speaking, a power law
distribution looks as follows:

 Low values of X have very high values of Y.

 As the values of X increase, the values of Y quickly decrease. Consequently, high values of X have very low values of Y.


Movie ratings are a good example of a power law distribution. In the following figure, notice:

 A few movies have lots of user ratings. (Low values of X have high values of Y.)
 Most movies have very few user ratings. (High values of X have low values of Y.)



Log scaling changes the distribution, which helps train a model that will make better predictions.

BS en Iso 28927-5-2009 PDF
No ratings yet
BS en Iso 28927-5-2009 PDF
32 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
Clustering
No ratings yet
Clustering
12 pages
Introduction To Data Science Unsupervised Learning: CS 194 Fall 2015 John Canny
No ratings yet
Introduction To Data Science Unsupervised Learning: CS 194 Fall 2015 John Canny
54 pages
ML - 8
No ratings yet
ML - 8
70 pages
UNIT5
No ratings yet
UNIT5
60 pages
M5
No ratings yet
M5
40 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Clustering
No ratings yet
Clustering
65 pages
M5
No ratings yet
M5
40 pages
Lecture 12 - Unsupervised Learning - Shoould Be Marged
No ratings yet
Lecture 12 - Unsupervised Learning - Shoould Be Marged
31 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
Cluster
100% (1)
Cluster
72 pages
K Medoids
No ratings yet
K Medoids
101 pages
Module 5
No ratings yet
Module 5
43 pages
03 Clustering
No ratings yet
03 Clustering
63 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Week 07 Lecture Material
No ratings yet
Week 07 Lecture Material
49 pages
Data Mining Unit-Iv
No ratings yet
Data Mining Unit-Iv
34 pages
Chapter 6
No ratings yet
Chapter 6
62 pages
Lecture 9 Clustering
No ratings yet
Lecture 9 Clustering
36 pages
Lect 12
No ratings yet
Lect 12
80 pages
Unit 4
No ratings yet
Unit 4
16 pages
Machine Learning Unit-4
No ratings yet
Machine Learning Unit-4
24 pages
Clustering
No ratings yet
Clustering
53 pages
Clustering
No ratings yet
Clustering
39 pages
Clustering, A Tool To Analyze Data Points
No ratings yet
Clustering, A Tool To Analyze Data Points
61 pages
Datamining Lect8
No ratings yet
Datamining Lect8
79 pages
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
110 pages
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
93 pages
Week6 Clustering Regression
No ratings yet
Week6 Clustering Regression
101 pages
Clustering
No ratings yet
Clustering
45 pages
P 3.1.3 Hierarchical
No ratings yet
P 3.1.3 Hierarchical
30 pages
Mod 4 - CLustering
No ratings yet
Mod 4 - CLustering
55 pages
Clustering L7
No ratings yet
Clustering L7
7 pages
Lecture 13 - Unsupervised Learning, PCA ICA
No ratings yet
Lecture 13 - Unsupervised Learning, PCA ICA
50 pages
Data Mining: Hierarchical Clustering, DBSCAN The EM Algorithm
No ratings yet
Data Mining: Hierarchical Clustering, DBSCAN The EM Algorithm
63 pages
MLLecture 1
No ratings yet
MLLecture 1
56 pages
Cluster Analysis
No ratings yet
Cluster Analysis
22 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
Data Mining Cluster Analysis: Basic Concepts and Algorithms
No ratings yet
Data Mining Cluster Analysis: Basic Concepts and Algorithms
26 pages
Chatgpt Unit - 4
No ratings yet
Chatgpt Unit - 4
4 pages
Unit 3 Updated Notes
No ratings yet
Unit 3 Updated Notes
29 pages
Clustering Basics
No ratings yet
Clustering Basics
39 pages
Clustering Analysis
No ratings yet
Clustering Analysis
102 pages
ML Unit Iii
No ratings yet
ML Unit Iii
12 pages
Grouping
No ratings yet
Grouping
98 pages
ML 07 Clustering
No ratings yet
ML 07 Clustering
56 pages
4 Clustering
No ratings yet
4 Clustering
9 pages
Unit 5
No ratings yet
Unit 5
63 pages
Unit 2
No ratings yet
Unit 2
33 pages
Clustering Explanation
No ratings yet
Clustering Explanation
8 pages
Clustering
No ratings yet
Clustering
75 pages
Clustering Analysis (Unsupervised)
No ratings yet
Clustering Analysis (Unsupervised)
6 pages
Clustering Analysis
No ratings yet
Clustering Analysis
30 pages
Clustering Class
No ratings yet
Clustering Class
103 pages
Clustering
No ratings yet
Clustering
75 pages
ML4 Unsupervised Learning
No ratings yet
ML4 Unsupervised Learning
60 pages
Some Introductory Concepts On Fiberr Optic System
No ratings yet
Some Introductory Concepts On Fiberr Optic System
36 pages
ICT 9 7.2 Design
No ratings yet
ICT 9 7.2 Design
70 pages
Bike Generator Thesis
100% (3)
Bike Generator Thesis
6 pages
Embedded Systems Input and Output Optional
No ratings yet
Embedded Systems Input and Output Optional
4 pages
Corvis Prospekt 4 Seitig 0611
No ratings yet
Corvis Prospekt 4 Seitig 0611
4 pages
Prestigio Multipad pmp3270b Service Manual
No ratings yet
Prestigio Multipad pmp3270b Service Manual
32 pages
C WIPG 300Hv2 - S
No ratings yet
C WIPG 300Hv2 - S
7 pages
Wind Energy Conversion
No ratings yet
Wind Energy Conversion
7 pages
Whisper Speaker Identification: Leveraging Pre-Trained Multilingual Transformers For Robust Speaker Embeddings
No ratings yet
Whisper Speaker Identification: Leveraging Pre-Trained Multilingual Transformers For Robust Speaker Embeddings
6 pages
Chapter Two and Exception Handling
No ratings yet
Chapter Two and Exception Handling
6 pages
Brio Ir
No ratings yet
Brio Ir
11 pages
Sales Analysis and Prediction Using Pyth
No ratings yet
Sales Analysis and Prediction Using Pyth
5 pages
Parallel Database
No ratings yet
Parallel Database
27 pages
Superseded
No ratings yet
Superseded
19 pages
SDN Notes
No ratings yet
SDN Notes
117 pages
SCADA System of NLDC
100% (1)
SCADA System of NLDC
38 pages
An Analysis of QSAR Research Based On Machine Learning Concepts
No ratings yet
An Analysis of QSAR Research Based On Machine Learning Concepts
15 pages
Free Ebook MCQ Series Based On e PG Pathshala P02-M1,2,3
No ratings yet
Free Ebook MCQ Series Based On e PG Pathshala P02-M1,2,3
81 pages
Computation With The Fractional Fourier Transform
No ratings yet
Computation With The Fractional Fourier Transform
2 pages
Mikrotik rb4011-rm Datasheet
No ratings yet
Mikrotik rb4011-rm Datasheet
4 pages
'402735339 Application Form 2024
No ratings yet
'402735339 Application Form 2024
1 page
Hydraulic Surgery Table Manual
No ratings yet
Hydraulic Surgery Table Manual
8 pages
1p00q00 5
No ratings yet
1p00q00 5
1 page
MiniWave Manual
No ratings yet
MiniWave Manual
16 pages
Canon I350 Waste Tank Full - Fixyourownprinter
No ratings yet
Canon I350 Waste Tank Full - Fixyourownprinter
22 pages
Computer Vision in Banking
No ratings yet
Computer Vision in Banking
7 pages
(NOV) F2升F3 BI (tech savvy)
No ratings yet
(NOV) F2升F3 BI (tech savvy)
33 pages
Candidate Handbook
No ratings yet
Candidate Handbook
66 pages
DP-200 Dump
No ratings yet
DP-200 Dump
164 pages

Clustering

Uploaded by

Clustering

Uploaded by

Clustering

Dr. C Santhosh Kumar

 Low values of X have very high values of Y.

You might also like