Understanding The Inners of Clustering: DR Akashdeep, UIET, Panjab University Chandigarh, Maivriklab@pu - Ac.in
Understanding The Inners of Clustering: DR Akashdeep, UIET, Panjab University Chandigarh, Maivriklab@pu - Ac.in
Clustering
Dr Akashdeep,
UIET, Panjab University
Chandigarh
[email protected],
[email protected]
Agenda
• What is clustering?
• Different types of Clustering
• Clustering Workflow : Walk through the practical process
• Implementing various clustering algorithms in Python : total of 10
What is Clustering?
• Clustering: the process of grouping a set of objects into classes of similar
objects
• Documents within a cluster should be similar.
• Documents from different clusters should be dissimilar.
• The commonest form of unsupervised learning
• Unsupervised learning = learning from raw data, as opposed to
supervised data where a classification of examples is given
• Grouping unlabeled examples is called clustering.
• As the examples are un-labeled, clustering relies on unsupervised
machine learning. If the examples are labeled, then clustering
becomes classification.
• You can measure similarity between examples by combining the
examples' feature data into a metric, called a similarity measure.
• For example, similar books by their authors.
• You can create a similarity measure in different scenarios.
• As the number of features increases, creating a similarity measure
becomes more complex.
Classification vs Clustering
Applications of Clustering
More Applications
Identify the natural grouping among these?
Ch. 16
The 0-norm (or L0 norm) counts the number of non-zero elements in a vector. The corresponding distance
counts the number of positions in which vectors x and y differ. This is not strictly a Minkowski distance;
however, we can define it as
L-infinity norm:
Gives the largest magnitude among each element of a vector.
Having the vector X= [-6, 4, 2], the L-infinity norm is 6.
In L-infinity norm, only the largest element has any effect.
So, for example, if your vector represents the cost of constructing a building, by minimizing L-infinity
norm we are reducing the cost of the most expensive building.
2-norm
Clustering Types?
a) Centroid-based Clustering
• Centroid-based clustering organizes
the data into non-hierarchical
clusters, in contrast to hierarchical
clustering defined below.
• k-means is the most widely-used
centroid-based clustering algorithm.
• Centroid-based algorithms are
efficient but sensitive to initial
conditions and outliers.
https://www.naftaliharris.com/blog/visualizing-k-means-clustering/
b) Density-based Clustering
• Density-based clustering connects areas of
high example density into clusters.
• This allows for arbitrary-shaped distributions
as long as dense areas can be connected.
• These algorithms have difficulty with data of
varying densities and high dimensions.
• Further, by design, these algorithms do not
assign outliers to clusters.
• These don’t need to specify number of
clusters
https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/
c)Distribution-based Clustering
• This clustering approach assumes data is
composed of distributions, such as Gaussian
distributions.
• In Figure , the distribution-based algorithm
clusters data into three Gaussian distributions.
• As distance from the distribution's center
increases, the probability that a point belongs
to the distribution decreases.
• The bands show that decrease in probability.
• When you do not know the type of distribution
in your data, you should use a different
algorithm.
• EM algorithm
d)Hierarchical
• Hierarchical clustering creates a
tree of clusters. Hierarchical
clustering, not surprisingly, is well
suited to hierarchical data, such as
taxonomies.
• See Comparison of 61 Sequenced
Escherichia coli Genomes by
Oksana Lukjancenko, Trudy
Wassenaar & Dave Ussery for an
example.
• In addition, another advantage is
that any number of clusters can
be chosen by cutting the tree at
the right level.
Clustering Workflow
• Prepare data.
• Create similarity metric.
• Run clustering algorithm.
• Interpret results and adjust your clustering.
Clustering Workflow
• Prepare Data
• As with any ML problem, you must normalize, scale, and transform feature data.
• While clustering however, you must additionally ensure that the prepared data lets you accurately calculate the
similarity between examples.
• Sometimes, a
data set
conforms to
a power
law distribution
that clumps data
at the low end.
• In Figure 2, red
is closer to
yellow than
blue.
Using Quantiles
• Normalization and log transforms address specific data distributions.
What if data doesn’t conform to a Gaussian or power-law
distribution? Is there a general approach that applies to any data
distribution?
if the two examples have only a few examples between them, then these two examples are similar
irrespective of their values.
Conversely, if the two examples have many examples between them, then the two examples are less
similar. Thus, the similarity between two examples decreases as the number of examples between
them increases.
• Normalizing the data simply reproduces the data distribution because normalization is a linear transform.
• Applying a log transform doesn't reflect your intuition on how similarity works either, as shown below.
Instead, divide the data into intervals where each interval contains an equal number of examples. These
interval boundaries are called quantiles.
Convert your data into quantiles by performing the following steps:
1.Decide the number of intervals.
2.Define intervals such that each interval has an equal number of examples.
3.Replace each example by the index of the interval it falls in.
4.Bring the indexes to same range as other feature data by scaling the index values to [0,1].
• After converting data to quantiles, the similarity between two examples is inversely proportional to the
number of examples between those two examples
https://developers.google.com/machine-learning/clustering/prepare-data
2. Define Similarity Measures
Can be categorized into two types
1. Manual Similarity
2. Supervised Similarity measure
2.a) Create a Manual Similarity Measure
• Combine all the feature data for two examples into a single numeric value.
• Consider a shoe data set with only one feature: shoe size.
• You can quantify how similar two shoes are by calculating the difference between
their sizes.
• The smaller the numerical difference between sizes, the greater the similarity
between shoes. Such a handcrafted similarity measure is called a manual similarity
measure.
• What if you wanted to find similarities between shoes by using both size
and color?
• Color is categorical data, and is harder to combine with the numerical size data.
• Things get complex as data becomes more complex, creating a manual
similarity measure becomes harder.
• Switch to supervised similarity measure, where a supervised machine
learning model calculates the similarity.
Manual Similarity: Example
• Suppose the model has two features: shoe size and shoe price data.
• Since both features are numeric, you can combine them into a single number representing similarity as follows.
• Size (s): Shoe size probably forms a Gaussian distribution. Confirm this. Then normalize the data.
• Price (p): The data is probably a Poisson distribution. Confirm this. If you have enough data, convert the data
to quantiles and scale to [0,1].
• Combine the data by using root mean squared error (RMSE).
let’s calculate similarity for two shoes with US sizes 8 and 11, and prices
120 and 150.
Action Method
Scale the size. Assume a maximum possible shoe size of 20. Divide 8
and 11 by the maximum size 20 to get 0.4 and 0.55.
Scale the price. Divide 120 and 150 by the maximum price 150 to get 0.8
and 1.
Find the difference in size. 0.55−0.4=0.15
Find the difference in 1−0.8=0.2
price.
2 2 ½
Find the RMSE. (0.2 +0.15 ) = 0.17
Intuitively, your measured similarity should increase when feature data becomes
similar. Instead, your measured similarity actually decreases. Make your measured
similarity follow your intuition by subtracting it from 1.
Similarity=1−0.17=0.83
What if you have categorical data?
• Categorical data can either be:
• Single valued (univalent), such as a car's color ("white" or "blue" but never both). If univalent data
matches, the similarity is 1; otherwise, it's 0.
• Multi-valued (multivalent), such as a movie's genre (can be "action" and "comedy" simultaneously, or just
"action"). Multivalent data is harder to deal with. For example, movie genres can be a challenge to work
with.
• To handle this problem, suppose movies are assigned genres from a fixed set of genres. Calculate similarity
using the ratio of common values, called Jaccard similarity.
• Examples:
• [“comedy”,”action”] and [“comedy”,”action”] = 1
• [“comedy”,”action”] and [“action”] = ½
• [“comedy”,”action”] and [“action”, "drama"] = ⅓
• [“comedy”,”action”] and [“non-fiction”,”biographical”] = 0
Postal code Postal codes representing areas that are close to each other should have a higher similarity. convert the postal codes into latitude and longitude.
For a pair of postal codes, separately calculate the difference between their latitude and their longitude. Then add the differences to get a single
numeric value.
Color Convert the textual values into numeric RGB values.
Find difference in red, green, and blue values for two colors, and combine the differences into a numeric value by using the Euclidean distance.
Manual Similarity Measure: Exercise ( 5
minutes)
Dataset à Feature Type
Price Positive integer
Size Positive floating-point value in units of square meters
• Libraries make life easy since all these are directly available
Generating Embeddings Example
Feature Type
• To better understand how vector length changes the similarity measure, normalize
the vector lengths to 1 and notice that the three measures become proportional to
each other.
Programming Exercise for : Clustering with a Supervised Similarity Measure
Comparison Manual vs Supervised
Requirement Manual Supervised
Eliminate redundant information in No, you need to separately investigate Yes, DNN eliminates redundant
correlated features. correlations between features. information.