0% found this document useful (0 votes)
71 views

Understanding The Inners of Clustering: DR Akashdeep, UIET, Panjab University Chandigarh, Maivriklab@pu - Ac.in

Clustering is an unsupervised machine learning technique used to group unlabeled data points into clusters based on similarity. There are several types of clustering algorithms, including centroid-based (like k-means), density-based, distribution-based, and hierarchical clustering. The clustering workflow involves preparing data, creating a similarity metric, running a clustering algorithm, and interpreting results to refine the clustering. Proper data preparation through techniques like normalization and quantiles is important for accurately measuring similarity between data points.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views

Understanding The Inners of Clustering: DR Akashdeep, UIET, Panjab University Chandigarh, Maivriklab@pu - Ac.in

Clustering is an unsupervised machine learning technique used to group unlabeled data points into clusters based on similarity. There are several types of clustering algorithms, including centroid-based (like k-means), density-based, distribution-based, and hierarchical clustering. The clustering workflow involves preparing data, creating a similarity metric, running a clustering algorithm, and interpreting results to refine the clustering. Proper data preparation through techniques like normalization and quantiles is important for accurately measuring similarity between data points.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Understanding the inners of

Clustering
Dr Akashdeep,
UIET, Panjab University
Chandigarh
[email protected],
[email protected]
Agenda
• What is clustering?
• Different types of Clustering
• Clustering Workflow : Walk through the practical process
• Implementing various clustering algorithms in Python : total of 10
What is Clustering?
• Clustering: the process of grouping a set of objects into classes of similar
objects
• Documents within a cluster should be similar.
• Documents from different clusters should be dissimilar.
• The commonest form of unsupervised learning
• Unsupervised learning = learning from raw data, as opposed to
supervised data where a classification of examples is given
• Grouping unlabeled examples is called clustering.
• As the examples are un-labeled, clustering relies on unsupervised
machine learning. If the examples are labeled, then clustering
becomes classification.
• You can measure similarity between examples by combining the
examples' feature data into a metric, called a similarity measure.
• For example, similar books by their authors.
• You can create a similarity measure in different scenarios.
• As the number of features increases, creating a similarity measure
becomes more complex.
Classification vs Clustering
Applications of Clustering
More Applications
Identify the natural grouping among these?
Ch. 16

Identify the clusters in the given data set?

• How would you design an algorithm for finding the three


clusters in this case?
How to group elements?

Most obvious answer: group similar points???


But is similarity so easy to find?
Are These Similar??

• We may need a notion of similarity that depends on representation and algorithm:


Distance

Pic courtesy:- https://www.slideserve.com


Measure (dis)similarity?
Edit or Transformation Distance:
• Transform one of the object into another and measure the effort.
What Distance measures cam actually
measure?
• Can measure
• Symmetry
• Self-similarity
• Separation
• Traingular inequality
• …….
Intuition for distance based models

•Set of (x,y) points


•Two classes
•Is the box red or blue?
How we can do it?
-use Bayes rule
-decision tree
-Fit a hyperplane
Or Since the nearest points are red
-use this as basis of your algorithm
Distance measure
There are many ways to measure distance.
(Minkowski distance). If X = Rd , the Minkowski distance of order p > 0 is
defined as

Where is the p-norm (sometimes denoted Lp norm)


of the vector z. We will often refer to Disp simply as the p-norm.
The 2-norm refers to the familiar Euclidean distance
How distance matters?

• A is the testing point


• for Euclidian distance point B and C are at same distance:
• cant capture the fact that B differs from A on only one
attribute and 2 from C
• But for different p things change
1-norm and 0-norm
The 1-norm denotes Manhattan distance, also called cityblock distance:

The 0-norm (or L0 norm) counts the number of non-zero elements in a vector. The corresponding distance
counts the number of positions in which vectors x and y differ. This is not strictly a Minkowski distance;
however, we can define it as

under the understanding that x0 = 0 for x = 0 and 1 otherwise.


If x and y are binary strings, this is also called the Hamming distance.
We can see the Hamming distance as the number of bits that need to be flipped to change x into y.
USE OF L0 norm
- when having two vectors (username and password).
- If the L0 norm of the vectors is equal to 0, then the login is successful. Otherwise, if the L0 norm is 1, it means that either
the username or password is incorrect, but not both.
- And lastly, if the L0 norm is 2, it means that both username and password are incorrect.
Example of various distances: L1 and L2 norm
You cant go directly
The L1 norm is calculated by ||X||1 = |3| + |4| = 7

Using the same example, the L2 norm is calculated by ||X||2 = √


(|3|2 + |4|2) = √9+16 = √25 = 5
As you can see in the graphic, L2 norm is the most direct route.

L-infinity norm:
Gives the largest magnitude among each element of a vector.
Having the vector X= [-6, 4, 2], the L-infinity norm is 6.
In L-infinity norm, only the largest element has any effect.
So, for example, if your vector represents the cost of constructing a building, by minimizing L-infinity
norm we are reducing the cost of the most expensive building.
2-norm
Clustering Types?
a) Centroid-based Clustering
• Centroid-based clustering organizes
the data into non-hierarchical
clusters, in contrast to hierarchical
clustering defined below.
• k-means is the most widely-used
centroid-based clustering algorithm.
• Centroid-based algorithms are
efficient but sensitive to initial
conditions and outliers.

https://www.naftaliharris.com/blog/visualizing-k-means-clustering/
b) Density-based Clustering
• Density-based clustering connects areas of
high example density into clusters.
• This allows for arbitrary-shaped distributions
as long as dense areas can be connected.
• These algorithms have difficulty with data of
varying densities and high dimensions.
• Further, by design, these algorithms do not
assign outliers to clusters.
• These don’t need to specify number of
clusters
https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/
c)Distribution-based Clustering
• This clustering approach assumes data is
composed of distributions, such as Gaussian
distributions.
• In Figure , the distribution-based algorithm
clusters data into three Gaussian distributions.
• As distance from the distribution's center
increases, the probability that a point belongs
to the distribution decreases.
• The bands show that decrease in probability.
• When you do not know the type of distribution
in your data, you should use a different
algorithm.
• EM algorithm
d)Hierarchical
• Hierarchical clustering creates a
tree of clusters. Hierarchical
clustering, not surprisingly, is well
suited to hierarchical data, such as
taxonomies.
• See Comparison of 61 Sequenced
Escherichia coli Genomes by
Oksana Lukjancenko, Trudy
Wassenaar & Dave Ussery for an
example.
• In addition, another advantage is
that any number of clusters can
be chosen by cutting the tree at
the right level.
Clustering Workflow
• Prepare data.
• Create similarity metric.
• Run clustering algorithm.
• Interpret results and adjust your clustering.

Clustering Workflow
• Prepare Data
• As with any ML problem, you must normalize, scale, and transform feature data.
• While clustering however, you must additionally ensure that the prepared data lets you accurately calculate the
similarity between examples.

• Create Similarity metric


• Before a clustering algorithm can group data, it needs to know how similar pairs of examples are.
• You quantify the similarity between examples by creating a similarity metric.
• Creating a similarity metric requires you to carefully understand your data and how to derive similarity from your
features.

• Run Clustering Algorithm


• A clustering algorithm uses the similarity metric to cluster data.

• Interpret Results and Adjust


• Checking the quality of your clustering output is iterative and exploratory because clustering lacks “truth” that
can verify the output.
• You verify the result against expectations at the cluster-level and the example-level.
• Improving the result requires iteratively experimenting with the previous steps to see how they affect the
clustering.
1. Prepare Data
• In clustering, you calculate the similarity between two examples by
combining all the feature data for those examples into a numeric
value.
• Combining feature data requires that the data have the same scale.
• discuss normalizing, transforming, and creating quantiles, and discusses why
quantiles are the best default choice for transforming any data distribution.
• Having a default choice lets you transform your data without
inspecting the data's distribution.
Normalizing Data
• Transform data for multiple features to the same scale by normalizing the
data.
• Well-suited to processing the most common data distribution,
the Gaussian distribution.
• Compared to quantiles, normalization requires significantly less data to
calculate.
• Normalize data by calculating its z-score as follows:

• Let’s look at similarity between examples with and without


normalization.
• In Figure you find that red appears to be more similar to blue than
yellow. However, the features on the x- and y-axes do not have the
same scale.
• After normalization using z-score, all the features have the same
scale.
• Now, you find that red is actually more similar to yellow. Thus, after
normalizing data, you can calculate similarity more accurately.
Use normalization when either of the following are true:
• Your data has a Gaussian distribution.
• Your data set lacks enough data to create quantiles.
Using the Log Transform

• Sometimes, a
data set
conforms to
a power
law distribution
that clumps data
at the low end.
• In Figure 2, red
is closer to
yellow than
blue.
Using Quantiles
• Normalization and log transforms address specific data distributions.
What if data doesn’t conform to a Gaussian or power-law
distribution? Is there a general approach that applies to any data
distribution?

if the two examples have only a few examples between them, then these two examples are similar
irrespective of their values.
Conversely, if the two examples have many examples between them, then the two examples are less
similar. Thus, the similarity between two examples decreases as the number of examples between
them increases.
• Normalizing the data simply reproduces the data distribution because normalization is a linear transform.
• Applying a log transform doesn't reflect your intuition on how similarity works either, as shown below.

Instead, divide the data into intervals where each interval contains an equal number of examples. These
interval boundaries are called quantiles.
Convert your data into quantiles by performing the following steps:
1.Decide the number of intervals.
2.Define intervals such that each interval has an equal number of examples.
3.Replace each example by the index of the interval it falls in.
4.Bring the indexes to same range as other feature data by scaling the index values to [0,1].
• After converting data to quantiles, the similarity between two examples is inversely proportional to the
number of examples between those two examples

• Quantiles are your best default choice to transform data.


• However, to create quantiles that are reliable indicators of the underlying data distribution, you
need a lot of data.
• As a rule of thumb, to create n quantiles, you should have at least 10n examples. If you don't have
enough data, stick to normalization.
Lets Revise

https://developers.google.com/machine-learning/clustering/prepare-data
2. Define Similarity Measures
Can be categorized into two types
1. Manual Similarity
2. Supervised Similarity measure
2.a) Create a Manual Similarity Measure
• Combine all the feature data for two examples into a single numeric value.
• Consider a shoe data set with only one feature: shoe size.
• You can quantify how similar two shoes are by calculating the difference between
their sizes.
• The smaller the numerical difference between sizes, the greater the similarity
between shoes. Such a handcrafted similarity measure is called a manual similarity
measure.
• What if you wanted to find similarities between shoes by using both size
and color?
• Color is categorical data, and is harder to combine with the numerical size data.
• Things get complex as data becomes more complex, creating a manual
similarity measure becomes harder.
• Switch to supervised similarity measure, where a supervised machine
learning model calculates the similarity.
Manual Similarity: Example
• Suppose the model has two features: shoe size and shoe price data.
• Since both features are numeric, you can combine them into a single number representing similarity as follows.
• Size (s): Shoe size probably forms a Gaussian distribution. Confirm this. Then normalize the data.
• Price (p): The data is probably a Poisson distribution. Confirm this. If you have enough data, convert the data
to quantiles and scale to [0,1].
• Combine the data by using root mean squared error (RMSE).
let’s calculate similarity for two shoes with US sizes 8 and 11, and prices
120 and 150.
Action Method
Scale the size. Assume a maximum possible shoe size of 20. Divide 8
and 11 by the maximum size 20 to get 0.4 and 0.55.
Scale the price. Divide 120 and 150 by the maximum price 150 to get 0.8
and 1.
Find the difference in size. 0.55−0.4=0.15
Find the difference in 1−0.8=0.2
price.
2 2 ½
Find the RMSE. (0.2 +0.15 ) = 0.17

Intuitively, your measured similarity should increase when feature data becomes
similar. Instead, your measured similarity actually decreases. Make your measured
similarity follow your intuition by subtracting it from 1.
Similarity=1−0.17=0.83
What if you have categorical data?
• Categorical data can either be:
• Single valued (univalent), such as a car's color ("white" or "blue" but never both). If univalent data
matches, the similarity is 1; otherwise, it's 0.
• Multi-valued (multivalent), such as a movie's genre (can be "action" and "comedy" simultaneously, or just
"action"). Multivalent data is harder to deal with. For example, movie genres can be a challenge to work
with.
• To handle this problem, suppose movies are assigned genres from a fixed set of genres. Calculate similarity
using the ratio of common values, called Jaccard similarity.
• Examples:
• [“comedy”,”action”] and [“comedy”,”action”] = 1
• [“comedy”,”action”] and [“action”] = ½
• [“comedy”,”action”] and [“action”, "drama"] = ⅓
• [“comedy”,”action”] and [“non-fiction”,”biographical”] = 0

Postal code Postal codes representing areas that are close to each other should have a higher similarity. convert the postal codes into latitude and longitude.
For a pair of postal codes, separately calculate the difference between their latitude and their longitude. Then add the differences to get a single
numeric value.
Color Convert the textual values into numeric RGB values.
Find difference in red, green, and blue values for two colors, and combine the differences into a numeric value by using the Euclidean distance.
Manual Similarity Measure: Exercise ( 5
minutes)
Dataset à Feature Type
Price Positive integer
Size Positive floating-point value in units of square meters

Postal code Integer


Number of Integer
bedrooms
Type of house A text value from “single_family," “multi-family," “apartment,”
“condo”
Garage 0/1 for no/yes
Colors Multivalent categorical: one or more values from standard colors
“white,” ”yellow,” ”green,” etc.
Pre-processing
• Pre-process the numerical features: price, size, number of bedrooms, and
postal code.
• For each of these features you will have to perform a different operation.
• For example, in this case, assume that pricing data follows a bimodal
distribution.
• Answer What should you do next? Pick one:-
1. Log transform and scale to [0,1].
• This is actually the step to take when data follows a Power-law distribution.

2. Normalize and scale to [0,1].


• This is the step you would take when data follows a Gaussian distribution. Try again.

3. Create quantiles from the data and scale to [0,1].


• This is the correct step to take when data follows a bimodal distribution.Correct answer.
• How you would process size data.
• Check whether size follows a power-law, Poisson, or Gaussian distribution.
• Power-law: Log transform and scale to [0,1].
• Poisson: Create quantiles and scale to [0,1].
• Gaussian: Normalize and scale to [0,1].
• How you would process data on the number of bedrooms.
• Check the distribution for number of bedrooms.
• Most likely, clipping outliers and scaling to [0,1] will be adequate, but if you
find a power-law distribution then a log-transform might be necessary.
• How should you represent postal codes?
• Convert postal codes to longitude and latitude. Then process those values as
you would process other numeric values.
Next :- Calculating Similarity per Feature
• For numeric features, you simply find the difference.
• For binary features, such as if a house has a garage, you can also find the difference to get 0 or 1.
• But what about categorical features?
1. Which of these features is multivalent (can have multiple values)?
- Postal code
- Type
- Color
2. Which type of similarity measure should you use for calculating the similarity for a multivalent feature?
• Jaccard similarity
• Suppose homes are assigned colors from a fixed set of colors. Then, calculate similarity using the ratio of common values (Jaccard similarity). Correct
answer.
• Euclidean distance
• For the features “postal code” and “type” that have only one value (univalent features), if the feature matches, the similarity measure is 0; otherwise, the
similarity measure is 1.
Calculating Overall Similarity
Calculate the overall similarity between a pair of houses by combining the per- feature similarity using root mean squared error (RMSE) where
s1,s2,…..sn are similarities for n features.

Programming:- Clustering with a Manual Similarity Measure


2. b) Supervised Similarity Measure
• reduce the feature data to representations called embeddings, and
then compare the embeddings.
• Embeddings are generated by training a supervised deep neural
network (DNN) on the feature data itself.
• Embeddings map the feature data to a vector in an embedding
space and has less dimensions.
• The embedding vectors for similar examples, such as YouTube videos watched by
the same users, end up close together in the embedding space

Process for creating Supervised Similarity Measure


Choose DNN Based on Training Labels
• Reduce feature data to embeddings by training a DNN that uses the same feature data both as
input and as the labels.
• For example, in the case of house data, the DNN would use the features—such as price, size, and
postal code—to predict those features themselves.
• In order to use the feature data to predict the same feature data, the DNN is forced to reduce the
input feature data to embeddings and use these embeddings to calculate similarity.
• Two ways to implement DNN
• Autoencoders
• Predictors: if one feature carries more importance than input only that feature eg price in housing problem is more
important
Loss Function for DNN
• To train the DNN, you need to create a loss function by following
these steps:
1. Calculate the loss for every output of the DNN. For outputs that are:
• Numeric, use mean square error (MSE).
• Univalent categorical, use log loss.
• Multivalent categorical, use softmax cross entropy loss.
2. Calculate the total loss by summing the loss for every output.

• Libraries make life easy since all these are directly available
Generating Embeddings Example
Feature Type

• Lets have the same data Price Positive integer


Size Positive floating-point value in units of square meters

Postal code Integer


Number of bedrooms Integer

Type of house A text value from “single_family," “multi-family," “apartment,” “condo”

Garage 0/1 for no/yes


Colors Multivalent categorical: one or more values from standard colors “white,” ”yellow,”
”green,” etc.

Feature Type or Distribution Action


Price Poisson distribution Quantize and scale to [0,1].

Size Poisson distribution Quantize and scale to [0,1].

Postal code Categorical Convert to longitude and latitude,


quantize and scale to [0,1].
ß After Pre-processing
Number of bedrooms Integer Clip outliers and scale to [0,1].

Type of house Categorical Convert to one-hot encoding..

Garage 0 or 1 Leave as is.


Colors Categorical Convert to RGB values and process as
numeric data.
Choose Predictor or Autoencoder??
• Case of Predictor:
• You need to choose those features as training labels for your DNN that are important in
determining similarity between your examples.
• Let's assume price is most important in determining similarity between houses.
• Choose price as the training label, and remove it from the input feature data to the DNN.
• Train the DNN by using all other features as input data. For training, the loss function is
simply the MSE between predicted and actual price.
• Case of Autoencoder
• Train an autoencoder on our dataset by following these steps:
• Ensure the hidden layers of the autoencoder are smaller than the input and output layers.
• Calculate the loss for each output.
• Create the loss function by summing the losses for each output. Ensure you weight the loss
equally for every feature. For example, because color data is processed into RGB, weight each
of the RGB outputs by 1/3rd.
• Train the DNN.
Extracting Embeddings from the DNN
• Train your DNN
• After training your DNN, whether predictor or autoencoder, extract
the embedding for an example from the DNN using the feature data
of the example as input, and read the outputs of the final hidden
layer.
• These outputs form the embedding vector.
• Remember, the vectors for similar houses should be closer together
than vectors for dissimilar houses.
Measuring Similarity from Embeddings
• A similarity measure takes these embeddings and returns a number
measuring their similarity.
• Remember that embeddings are simply vectors of numbers.
• To find the similarity between two vectors and , you have three
similarity measures to choose from, as listed in the table below.
Choosing a Similarity Measure
• In contrast to the cosine, the dot product is proportional to the vector length.
• Important for examples that appear very frequently in the training set
• for example, popular YouTube videos) tend to have embedding vectors with large lengths.
• If you want to capture popularity, then choose dot product.
• Risk is that popular examples may skew the similarity metric.
• To balance this skew, you can raise the length to an exponent (Alpha < 1) to
calculate the dot product as

• To better understand how vector length changes the similarity measure, normalize
the vector lengths to 1 and notice that the three measures become proportional to
each other.
Programming Exercise for : Clustering with a Supervised Similarity Measure
Comparison Manual vs Supervised
Requirement Manual Supervised
Eliminate redundant information in No, you need to separately investigate Yes, DNN eliminates redundant
correlated features. correlations between features. information.

Provide insight into calculated Yes No, embeddings cannot be deciphered.


similarities.
Suitable for small datasets with few Yes, designing a manual measure with No, small datasets do not provide
features. a few features is easy. enough training data for a DNN.

Suitable for large datasets with many


No, manually eliminating redundant Yes, the DNN automatically eliminates
features.
information from multiple features and redundant information and combines
then combining them is very difficult. features.
Similarity Measure Summary
Type Create By Use When Implication
Manual Manually combining Datasets are small and Gain insight into results
feature data. features are easily combined. of similarity
calculations, but if
feature data changes,
then you must update
the similarity measure.
Supervised Measuring distance Datasets are large and No insight into results,
between embeddings features are hard to combine. but DNN can
generated via a supervised automatically adapt to
DNN. changing feature data.
Reviewing Clustering Process
Step One: Quality of Clustering
• perform a visual check that the clusters look as expected, and that
examples that you consider similar do appear in the same cluster
• Then check these commonly-used metrics as described in the
following sections:
• Cluster cardinality
• Cluster magnitude
• Performance of downstream system
Cluster cardinality is the number Cluster magnitude is the sum of distances Plot cardinality vs magnitude,
of examples per cluster. from all examples to the centroid of the Clusters are anomalous when
Plot the cluster cardinality and cluster. See how the magnitude varies across cardinality doesn't correlate with
investigate clusters that are major the clusters, and investigate anomalies. For magnitude relative to the other
outliers. For example, cluster 5. example, investigate cluster 0. clusters. For example, cluster
number 0 is anomalous.
• Performance of Downstream System
• Since clustering output is often used in downstream ML
systems, check if the downstream system’s performance
improves when your clustering process changes.
• The impact on your downstream performance provides a real-
world test for the quality of your clustering. The disadvantage is
that this check is complex to perform.
• Questions to Investigate If Problems are Found
• If you find problems, then check your data preparation and
similarity measure, asking yourself the following questions:
• Is your data scaled?
• Is your similarity measure correct?
• Is your algorithm performing semantically meaningful operations on the
data?
• Do your algorithm’s assumptions match the data?
Step Two: Performance of the Similarity
Measure
• Your clustering algorithm is only as good as your similarity measure.
• Make sure your similarity measure returns sensible results.
• The simplest check is to identify pairs of examples that are known to
be more or less similar than other pairs.
• Calculate the similarity measure for each pair of examples. Ensure
that the similarity measure for more similar examples is higher than
the similarity measure for less similar examples
Step Three: Optimum Number of Clusters
Programming various clustering techniques:-
• Affinity Propagation
• Agglomerative Clustering
• BIRCH
• DBSCAN
• K-Means
• Mini-Batch K-Means
• Mean Shift
• OPTICS
• Spectral Clustering
• Mixture of Gaussians

You might also like