0% found this document useful (0 votes)
6 views75 pages

Clustering

The document provides an overview of clustering, detailing various types such as K-means, density-based (DBSCAN), distribution-based, and hierarchical clustering. It discusses the algorithms, applications, and use cases of clustering, including data imputation, compression, and privacy preservation. Additionally, it outlines the workflow for clustering data, emphasizing the importance of data preparation and normalization techniques.

Uploaded by

arulx06
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views75 pages

Clustering

The document provides an overview of clustering, detailing various types such as K-means, density-based (DBSCAN), distribution-based, and hierarchical clustering. It discusses the algorithms, applications, and use cases of clustering, including data imputation, compression, and privacy preservation. Additionally, it outlines the workflow for clustering data, emphasizing the importance of data preparation and normalization techniques.

Uploaded by

arulx06
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

Clustering

Dr. C Santhosh Kumar


What is Clustering?
Types of Clustering
Common Distance Measures
Distance Measures
K means clustering
K means clustering
K means clustering
K means clustering - algorithm
K means clustering - algorithm
Example
Example
Applications of k means clustering
Imputation

When some examples in a cluster have missing
feature data, you can infer the missing data from
other examples in the cluster. This is called
imputation.

For example, less popular videos can be clustered
with more popular videos to improve video
recommendations.
Data Compression

As discussed, the relevant cluster ID can replace other features for all examples in
that cluster. This substitution reduces the number of features and therefore also
reduces the resources needed to store, process, and train models on that data. For
very large datasets, these savings become significant.


To give an example, a single YouTube video can have feature data including:

viewer location, time, and demographics, comment timestamps, text, and user IDs,
video tags

Clustering YouTube videos replaces this set of features with a single cluster ID, thus
compressing the data.
Privacy preservation

You can preserve privacy somewhat by clustering users and associating
user data with cluster IDs instead of user IDs.

To give one possible example, say you want to train a model on YouTube
users' watch history.

Instead of passing user IDs to the model, you could cluster users and
pass only the cluster ID.

This keeps individual watch histories from being attached to individual
users. Note that the cluster must contain a sufficiently large number of
users in order to preserve privacy.
Centroid based clustering

The centroid of a cluster is the arithmetic mean of all the
points in the cluster.

Centroid-based clustering organizes the data into non-
hierarchical clusters.

Centroid-based clustering algorithms are efficient but
sensitive to initial conditions and outliers.

Of these, k-means is the most widely used. It requires
users to define the number of centroids, k, and works well
with clusters of roughly equal size.
Density based clustering

Density-based clustering connects contiguous areas of
high example density into clusters.

This allows for the discovery of any number of clusters
of any shape. Outliers are not assigned to clusters.

These algorithms have difficulty with clusters of
different density and data with high dimensions.
Centroid based clustering

The centroid of a cluster is the arithmetic mean of all the
points in the cluster.

Centroid-based clustering organizes the data into non-
hierarchical clusters.

Centroid-based clustering algorithms are efficient but
sensitive to initial conditions and outliers.

Of these, k-means is the most widely used. It requires
users to define the number of centroids, k, and works well
with clusters of roughly equal size.
Use cases
Clustering is useful in a variety of industries. Some common
applications for clustering:


Market segmentation

Social network analysis

Search result grouping

Medical imaging

Image segmentation

Anomaly detection - Gene sequencing that shows
previously unknown genetic similarities and dissimilarities
between species has led to the revision of taxonomies
previously based on appearances.
Centroid based clustering
DBSCAN (Density-Based Spatial Clustering
of Applications with Noise)

ɛ: The radius of our neighborhoods around a
data point p.

minPts: The minimum number of data points we
want in a neighborhood to define a cluster.
DBSCAN

Core Points: A data point p is a core point if Nbhd(p,ɛ) [ɛ-
neighborhood of p] contains at least minPts ; |Nbhd(p,ɛ)| >=
minPts.

Border Points: A data point *q is a border point if Nbhd(q, ɛ)
contains less than minPts data points, but q is reachable from
some core point p.

Outlier: A data point o is an outlier if it is neither a core point
nor a border point. Essentially, this is the “other” class.
Core Points

Core Points are the foundations for our clusters are based on the density approximation.

We use the same ɛ to compute the neighborhood for each point, so the volume of all the
neighborhoods is the same.

However, the number of other points in each neighborhood is what differs.

The number of data points in the neighborhood is its mass. The volume of each neighborhood
is constant, and the mass of neighborhood is variable

By keeping a threshold on the minimum amount of mass needed to be a core point, we are
essentially setting a minimum density threshold.

Therefore, core points are data points that satisfy a minimum density requirement.

Our clusters are built around our core points (hence the core part), so by adjusting our minPts
parameter, we can fine-tune how dense our clusters' cores must be.
Border Points

Border Points are the points in our clusters that are not core points.

Density-reachable - Let’s revisit our neighborhood example with epsilon = 0.15. Consider the point r (the black
dot) that is outside of the point p‘s neighborhood.

All the points inside the point p‘s neighborhood are said to be directly reachable from p.

Now, let’s explore the neighborhood of point q, a point directly reachable from p. The yellow circle represents q‘s
neighborhood.

Now while our target point r is not our starting point p‘s neighborhood, it is contained in the point q‘s neighborhood.

If we can get to the point r by jumping from neighborhood to neighborhood, starting at a point p, then the point r is
density-reachable from the point p.

If the directly-reachable of a core point p are its “friends”, then the density-reachable points, points in the
neighborhood of the “friends” of p, are the “friends of its friends”.

“friends of a friend of a friend … of a friend” are included as well.
DBSCAN Algorithm

The steps to the DBSCAN algorithm are:

Pick a point at random that has not been assigned to a cluster or been
designated as an outlier. Compute its neighborhood to determine if it’s a core
point. If yes, start a cluster around this point. If no, label the point as an outlier.

Once we find a core point and thus a cluster, expand the cluster by adding all
directly-reachable points to the cluster. Perform “neighborhood jumps” to find all
density-reachable points and add them to the cluster. If an outlier is added,
change that point’s status from outlier to border point.

Repeat these two steps until all points are either assigned to a cluster or
designated as an outlier.
Distribution-based clustering

This clustering approach assumes data is composed of
probabilistic distributions, such as Gaussian distributions.

The distribution-based algorithm clusters data into three
Gaussian distributions.

As distance from the distribution's center increases, the
probability that a point belongs to the distribution decreases.

The bands show that decrease in probability.

When you're not comfortable assuming a particular underlying
distribution of the data, you should use a different algorithm.
Gaussian Distribution
Multi-dimensional Gaussian
Hierarchical Clustering

Hierarchical clustering creates a tree of clusters.

Hierarchical clustering, not surprisingly, is well suited to
hierarchical data, such as taxonomies.

Hierarchical clustering relies using these clustering techniques to
find a hierarchy of clusters, where this hierarchy resembles a tree
structure, called a dendrogram.

Any number of clusters can be chosen by cutting the tree at the
right level.
Hierarchical Clustering

Agglomerative clustering uses a bottom-up approach, wherein each
data point starts in its own cluster. These clusters are then joined
greedily, by taking the two most similar clusters together and
merging them.

Divisive clustering uses a top-down approach, wherein all data
points start in the same cluster. You can then use a parametric
clustering algorithm like K-Means to divide the cluster into two
clusters. For each cluster, you further divide it down to two clusters
until you hit the desired number of clusters.
Clustering – work flow

To cluster your data, you'll follow these steps:

Prepare data.

Create similarity metric.

Run clustering algorithm.

Interpret results and adjust your clustering.
Data Preparation

Normalising data
 Z scores : Whenever you see a dataset roughly shaped like a Gaussian distribution, you should
calculate z-scores for the data. Z-scores are the number of standard deviations a value is from the
mean. You can also use z-scores when the dataset isn't large enough for quantiles.

A Z-score is the number of standard deviations a value is from the mean. For example, a value
that is 2 standard deviations greater than the mean has a Z-score of +2.0. A value that is 1.5
standard deviations less than the mean has a Z-score of -1.5.

Normalising the data

Log Scaling
 Log scaling computes the logarithm of the raw
value. In theory, the logarithm could be any base; in
practice, log scaling usually calculates the natural
logarithm (ln).
Log Scaling

Log scaling is helpful when the data conforms to a power law distribution. Casually speaking, a power law
distribution looks as follows:

 Low values of X have very high values of Y.


 As the values of X increase, the values of Y quickly decrease. Consequently, high values of X have very low values of Y.


Movie ratings are a good example of a power law distribution. In the following figure, notice:

 A few movies have lots of user ratings. (Low values of X have high values of Y.)
 Most movies have very few user ratings. (High values of X have low values of Y.)


Log scaling changes the distribution, which helps train a model that will make better predictions.

You might also like