0% found this document useful (0 votes)

71 views

Understanding The Inners of Clustering: DR Akashdeep, UIET, Panjab University Chandigarh, Maivriklab@pu - Ac.in

Clustering is an unsupervised machine learning technique used to group unlabeled data points into clusters based on similarity. There are several types of clustering algorithms, including centroid-based (like k-means), density-based, distribution-based, and hierarchical clustering. The clustering workflow involves preparing data, creating a similarity metric, running a clustering algorithm, and interpreting results to refine the clustering. Proper data preparation through techniques like normalization and quantiles is important for accurately measuring similarity between data points.

Uploaded by

Dr. Dnyaneshwar Kirange

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

71 views

Understanding The Inners of Clustering: DR Akashdeep, UIET, Panjab University Chandigarh, Maivriklab@pu - Ac.in

Uploaded by

Dr. Dnyaneshwar Kirange

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 61

Understanding the inners of

Clustering
Dr Akashdeep,
UIET, Panjab University
Chandigarh
[email protected],
[email protected]
Agenda
• What is clustering?
• Different types of Clustering
• Clustering Workflow : Walk through the practical process
• Implementing various clustering algorithms in Python : total of 10
What is Clustering?
• Clustering: the process of grouping a set of objects into classes of similar
objects
• Documents within a cluster should be similar.
• Documents from different clusters should be dissimilar.
• The commonest form of unsupervised learning
• Unsupervised learning = learning from raw data, as opposed to
supervised data where a classification of examples is given
• Grouping unlabeled examples is called clustering.
• As the examples are un-labeled, clustering relies on unsupervised
machine learning. If the examples are labeled, then clustering
becomes classification.
• You can measure similarity between examples by combining the
examples' feature data into a metric, called a similarity measure.
• For example, similar books by their authors.
• You can create a similarity measure in different scenarios.
• As the number of features increases, creating a similarity measure
becomes more complex.
Classification vs Clustering
Applications of Clustering
More Applications
Identify the natural grouping among these?
Ch. 16

Identify the clusters in the given data set?

• How would you design an algorithm for finding the three

clusters in this case?
How to group elements?

Most obvious answer: group similar points???

But is similarity so easy to find?
Are These Similar??

• We may need a notion of similarity that depends on representation and algorithm:

Distance

Pic courtesy:- https://www.slideserve.com

Measure (dis)similarity?
Edit or Transformation Distance:
• Transform one of the object into another and measure the effort.
What Distance measures cam actually
measure?
• Can measure
• Symmetry
• Self-similarity
• Separation
• Traingular inequality
• …….
Intuition for distance based models

•Set of (x,y) points

•Two classes
•Is the box red or blue?
How we can do it?
-use Bayes rule
-decision tree
-Fit a hyperplane
Or Since the nearest points are red
-use this as basis of your algorithm
Distance measure
There are many ways to measure distance.
(Minkowski distance). If X = Rd , the Minkowski distance of order p > 0 is
defined as

Where is the p-norm (sometimes denoted Lp norm)

of the vector z. We will often refer to Disp simply as the p-norm.
The 2-norm refers to the familiar Euclidean distance
How distance matters?

• A is the testing point

• for Euclidian distance point B and C are at same distance:
• cant capture the fact that B differs from A on only one
attribute and 2 from C
• But for different p things change
1-norm and 0-norm
The 1-norm denotes Manhattan distance, also called cityblock distance:

The 0-norm (or L0 norm) counts the number of non-zero elements in a vector. The corresponding distance
counts the number of positions in which vectors x and y differ. This is not strictly a Minkowski distance;
however, we can define it as

under the understanding that x0 = 0 for x = 0 and 1 otherwise.

If x and y are binary strings, this is also called the Hamming distance.
We can see the Hamming distance as the number of bits that need to be flipped to change x into y.
USE OF L0 norm
- when having two vectors (username and password).
- If the L0 norm of the vectors is equal to 0, then the login is successful. Otherwise, if the L0 norm is 1, it means that either
the username or password is incorrect, but not both.
- And lastly, if the L0 norm is 2, it means that both username and password are incorrect.
Example of various distances: L1 and L2 norm
You cant go directly
The L1 norm is calculated by ||X||1 = |3| + |4| = 7

Using the same example, the L2 norm is calculated by ||X||2 = √

(|3|2 + |4|2) = √9+16 = √25 = 5
As you can see in the graphic, L2 norm is the most direct route.

L-infinity norm:
Gives the largest magnitude among each element of a vector.
Having the vector X= [-6, 4, 2], the L-infinity norm is 6.
In L-infinity norm, only the largest element has any effect.
So, for example, if your vector represents the cost of constructing a building, by minimizing L-infinity
norm we are reducing the cost of the most expensive building.
2-norm
Clustering Types?
a) Centroid-based Clustering
• Centroid-based clustering organizes
the data into non-hierarchical
clusters, in contrast to hierarchical
clustering defined below.
• k-means is the most widely-used
centroid-based clustering algorithm.
• Centroid-based algorithms are
efficient but sensitive to initial
conditions and outliers.

https://www.naftaliharris.com/blog/visualizing-k-means-clustering/
b) Density-based Clustering
• Density-based clustering connects areas of
high example density into clusters.
• This allows for arbitrary-shaped distributions
as long as dense areas can be connected.
• These algorithms have difficulty with data of
varying densities and high dimensions.
• Further, by design, these algorithms do not
assign outliers to clusters.
• These don’t need to specify number of
clusters
https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/
c)Distribution-based Clustering
• This clustering approach assumes data is
composed of distributions, such as Gaussian
distributions.
• In Figure , the distribution-based algorithm
clusters data into three Gaussian distributions.
• As distance from the distribution's center
increases, the probability that a point belongs
to the distribution decreases.
• The bands show that decrease in probability.
• When you do not know the type of distribution
in your data, you should use a different
algorithm.
• EM algorithm
d)Hierarchical
• Hierarchical clustering creates a
tree of clusters. Hierarchical
clustering, not surprisingly, is well
suited to hierarchical data, such as
taxonomies.
• See Comparison of 61 Sequenced
Escherichia coli Genomes by
Oksana Lukjancenko, Trudy
Wassenaar & Dave Ussery for an
example.
• In addition, another advantage is
that any number of clusters can
be chosen by cutting the tree at
the right level.
Clustering Workflow
• Prepare data.
• Create similarity metric.
• Run clustering algorithm.
• Interpret results and adjust your clustering.

Clustering Workflow
• Prepare Data
• As with any ML problem, you must normalize, scale, and transform feature data.
• While clustering however, you must additionally ensure that the prepared data lets you accurately calculate the
similarity between examples.

• Create Similarity metric

• Before a clustering algorithm can group data, it needs to know how similar pairs of examples are.
• You quantify the similarity between examples by creating a similarity metric.
• Creating a similarity metric requires you to carefully understand your data and how to derive similarity from your
features.

• Run Clustering Algorithm

• A clustering algorithm uses the similarity metric to cluster data.

• Interpret Results and Adjust

• Checking the quality of your clustering output is iterative and exploratory because clustering lacks “truth” that
can verify the output.
• You verify the result against expectations at the cluster-level and the example-level.
• Improving the result requires iteratively experimenting with the previous steps to see how they affect the
clustering.
1. Prepare Data
• In clustering, you calculate the similarity between two examples by
combining all the feature data for those examples into a numeric
value.
• Combining feature data requires that the data have the same scale.
• discuss normalizing, transforming, and creating quantiles, and discusses why
quantiles are the best default choice for transforming any data distribution.
• Having a default choice lets you transform your data without
inspecting the data's distribution.
Normalizing Data
• Transform data for multiple features to the same scale by normalizing the
data.
• Well-suited to processing the most common data distribution,
the Gaussian distribution.
• Compared to quantiles, normalization requires significantly less data to
calculate.
• Normalize data by calculating its z-score as follows:

• Let’s look at similarity between examples with and without

normalization.
• In Figure you find that red appears to be more similar to blue than
yellow. However, the features on the x- and y-axes do not have the
same scale.
• After normalization using z-score, all the features have the same
scale.
• Now, you find that red is actually more similar to yellow. Thus, after
normalizing data, you can calculate similarity more accurately.
Use normalization when either of the following are true:
• Your data has a Gaussian distribution.
• Your data set lacks enough data to create quantiles.
Using the Log Transform

• Sometimes, a
data set
conforms to
a power
law distribution
that clumps data
at the low end.
• In Figure 2, red
is closer to
yellow than
blue.
Using Quantiles
• Normalization and log transforms address specific data distributions.
What if data doesn’t conform to a Gaussian or power-law
distribution? Is there a general approach that applies to any data
distribution?

if the two examples have only a few examples between them, then these two examples are similar
irrespective of their values.
Conversely, if the two examples have many examples between them, then the two examples are less
similar. Thus, the similarity between two examples decreases as the number of examples between
them increases.
• Normalizing the data simply reproduces the data distribution because normalization is a linear transform.
• Applying a log transform doesn't reflect your intuition on how similarity works either, as shown below.

Instead, divide the data into intervals where each interval contains an equal number of examples. These
interval boundaries are called quantiles.
Convert your data into quantiles by performing the following steps:
1.Decide the number of intervals.
2.Define intervals such that each interval has an equal number of examples.
3.Replace each example by the index of the interval it falls in.
4.Bring the indexes to same range as other feature data by scaling the index values to [0,1].
• After converting data to quantiles, the similarity between two examples is inversely proportional to the
number of examples between those two examples

• Quantiles are your best default choice to transform data.

• However, to create quantiles that are reliable indicators of the underlying data distribution, you
need a lot of data.
• As a rule of thumb, to create n quantiles, you should have at least 10n examples. If you don't have
enough data, stick to normalization.
Lets Revise

https://developers.google.com/machine-learning/clustering/prepare-data
2. Define Similarity Measures
Can be categorized into two types
1. Manual Similarity
2. Supervised Similarity measure
2.a) Create a Manual Similarity Measure
• Combine all the feature data for two examples into a single numeric value.
• Consider a shoe data set with only one feature: shoe size.
• You can quantify how similar two shoes are by calculating the difference between
their sizes.
• The smaller the numerical difference between sizes, the greater the similarity
between shoes. Such a handcrafted similarity measure is called a manual similarity
measure.
• What if you wanted to find similarities between shoes by using both size
and color?
• Color is categorical data, and is harder to combine with the numerical size data.
• Things get complex as data becomes more complex, creating a manual
similarity measure becomes harder.
• Switch to supervised similarity measure, where a supervised machine
learning model calculates the similarity.
Manual Similarity: Example
• Suppose the model has two features: shoe size and shoe price data.
• Since both features are numeric, you can combine them into a single number representing similarity as follows.
• Size (s): Shoe size probably forms a Gaussian distribution. Confirm this. Then normalize the data.
• Price (p): The data is probably a Poisson distribution. Confirm this. If you have enough data, convert the data
to quantiles and scale to [0,1].
• Combine the data by using root mean squared error (RMSE).
let’s calculate similarity for two shoes with US sizes 8 and 11, and prices
120 and 150.
Action Method
Scale the size. Assume a maximum possible shoe size of 20. Divide 8
and 11 by the maximum size 20 to get 0.4 and 0.55.
Scale the price. Divide 120 and 150 by the maximum price 150 to get 0.8
and 1.
Find the difference in size. 0.55−0.4=0.15
Find the difference in 1−0.8=0.2
price.
2 2 ½
Find the RMSE. (0.2 +0.15 ) = 0.17

Intuitively, your measured similarity should increase when feature data becomes
similar. Instead, your measured similarity actually decreases. Make your measured
similarity follow your intuition by subtracting it from 1.
Similarity=1−0.17=0.83
What if you have categorical data?
• Categorical data can either be:
• Single valued (univalent), such as a car's color ("white" or "blue" but never both). If univalent data
matches, the similarity is 1; otherwise, it's 0.
• Multi-valued (multivalent), such as a movie's genre (can be "action" and "comedy" simultaneously, or just
"action"). Multivalent data is harder to deal with. For example, movie genres can be a challenge to work
with.
• To handle this problem, suppose movies are assigned genres from a fixed set of genres. Calculate similarity
using the ratio of common values, called Jaccard similarity.
• Examples:
• [“comedy”,”action”] and [“comedy”,”action”] = 1
• [“comedy”,”action”] and [“action”] = ½
• [“comedy”,”action”] and [“action”, "drama"] = ⅓
• [“comedy”,”action”] and [“non-fiction”,”biographical”] = 0

Postal code Postal codes representing areas that are close to each other should have a higher similarity. convert the postal codes into latitude and longitude.
For a pair of postal codes, separately calculate the difference between their latitude and their longitude. Then add the differences to get a single
numeric value.
Color Convert the textual values into numeric RGB values.
Find difference in red, green, and blue values for two colors, and combine the differences into a numeric value by using the Euclidean distance.
Manual Similarity Measure: Exercise ( 5
minutes)
Dataset à Feature Type
Price Positive integer
Size Positive floating-point value in units of square meters

Postal code Integer

Number of Integer
bedrooms
Type of house A text value from “single_family," “multi-family," “apartment,”
“condo”
Garage 0/1 for no/yes
Colors Multivalent categorical: one or more values from standard colors
“white,” ”yellow,” ”green,” etc.
Pre-processing
• Pre-process the numerical features: price, size, number of bedrooms, and
postal code.
• For each of these features you will have to perform a different operation.
• For example, in this case, assume that pricing data follows a bimodal
distribution.
• Answer What should you do next? Pick one:-
1. Log transform and scale to [0,1].
• This is actually the step to take when data follows a Power-law distribution.

2. Normalize and scale to [0,1].

• This is the step you would take when data follows a Gaussian distribution. Try again.

3. Create quantiles from the data and scale to [0,1].

• This is the correct step to take when data follows a bimodal distribution.Correct answer.
• How you would process size data.
• Check whether size follows a power-law, Poisson, or Gaussian distribution.
• Power-law: Log transform and scale to [0,1].
• Poisson: Create quantiles and scale to [0,1].
• Gaussian: Normalize and scale to [0,1].
• How you would process data on the number of bedrooms.
• Check the distribution for number of bedrooms.
• Most likely, clipping outliers and scaling to [0,1] will be adequate, but if you
find a power-law distribution then a log-transform might be necessary.
• How should you represent postal codes?
• Convert postal codes to longitude and latitude. Then process those values as
you would process other numeric values.
Next :- Calculating Similarity per Feature
• For numeric features, you simply find the difference.
• For binary features, such as if a house has a garage, you can also find the difference to get 0 or 1.
• But what about categorical features?
1. Which of these features is multivalent (can have multiple values)?
- Postal code
- Type
- Color
2. Which type of similarity measure should you use for calculating the similarity for a multivalent feature?
• Jaccard similarity
• Suppose homes are assigned colors from a fixed set of colors. Then, calculate similarity using the ratio of common values (Jaccard similarity). Correct
answer.
• Euclidean distance
• For the features “postal code” and “type” that have only one value (univalent features), if the feature matches, the similarity measure is 0; otherwise, the
similarity measure is 1.
Calculating Overall Similarity
Calculate the overall similarity between a pair of houses by combining the per- feature similarity using root mean squared error (RMSE) where
s1,s2,…..sn are similarities for n features.

Programming:- Clustering with a Manual Similarity Measure

2. b) Supervised Similarity Measure
• reduce the feature data to representations called embeddings, and
then compare the embeddings.
• Embeddings are generated by training a supervised deep neural
network (DNN) on the feature data itself.
• Embeddings map the feature data to a vector in an embedding
space and has less dimensions.
• The embedding vectors for similar examples, such as YouTube videos watched by
the same users, end up close together in the embedding space

Process for creating Supervised Similarity Measure

Choose DNN Based on Training Labels
• Reduce feature data to embeddings by training a DNN that uses the same feature data both as
input and as the labels.
• For example, in the case of house data, the DNN would use the features—such as price, size, and
postal code—to predict those features themselves.
• In order to use the feature data to predict the same feature data, the DNN is forced to reduce the
input feature data to embeddings and use these embeddings to calculate similarity.
• Two ways to implement DNN
• Autoencoders
• Predictors: if one feature carries more importance than input only that feature eg price in housing problem is more
important
Loss Function for DNN
• To train the DNN, you need to create a loss function by following
these steps:
1. Calculate the loss for every output of the DNN. For outputs that are:
• Numeric, use mean square error (MSE).
• Univalent categorical, use log loss.
• Multivalent categorical, use softmax cross entropy loss.
2. Calculate the total loss by summing the loss for every output.

• Libraries make life easy since all these are directly available
Generating Embeddings Example
Feature Type

• Lets have the same data Price Positive integer

Size Positive floating-point value in units of square meters

Postal code Integer

Number of bedrooms Integer

Type of house A text value from “single_family," “multi-family," “apartment,” “condo”

Garage 0/1 for no/yes

Colors Multivalent categorical: one or more values from standard colors “white,” ”yellow,”
”green,” etc.

Feature Type or Distribution Action

Price Poisson distribution Quantize and scale to [0,1].

Size Poisson distribution Quantize and scale to [0,1].

Postal code Categorical Convert to longitude and latitude,

quantize and scale to [0,1].
ß After Pre-processing
Number of bedrooms Integer Clip outliers and scale to [0,1].

Type of house Categorical Convert to one-hot encoding..

Garage 0 or 1 Leave as is.

Colors Categorical Convert to RGB values and process as
numeric data.
Choose Predictor or Autoencoder??
• Case of Predictor:
• You need to choose those features as training labels for your DNN that are important in
determining similarity between your examples.
• Let's assume price is most important in determining similarity between houses.
• Choose price as the training label, and remove it from the input feature data to the DNN.
• Train the DNN by using all other features as input data. For training, the loss function is
simply the MSE between predicted and actual price.
• Case of Autoencoder
• Train an autoencoder on our dataset by following these steps:
• Ensure the hidden layers of the autoencoder are smaller than the input and output layers.
• Calculate the loss for each output.
• Create the loss function by summing the losses for each output. Ensure you weight the loss
equally for every feature. For example, because color data is processed into RGB, weight each
of the RGB outputs by 1/3rd.
• Train the DNN.
Extracting Embeddings from the DNN
• Train your DNN
• After training your DNN, whether predictor or autoencoder, extract
the embedding for an example from the DNN using the feature data
of the example as input, and read the outputs of the final hidden
layer.
• These outputs form the embedding vector.
• Remember, the vectors for similar houses should be closer together
than vectors for dissimilar houses.
Measuring Similarity from Embeddings
• A similarity measure takes these embeddings and returns a number
measuring their similarity.
• Remember that embeddings are simply vectors of numbers.
• To find the similarity between two vectors and , you have three
similarity measures to choose from, as listed in the table below.
Choosing a Similarity Measure
• In contrast to the cosine, the dot product is proportional to the vector length.
• Important for examples that appear very frequently in the training set
• for example, popular YouTube videos) tend to have embedding vectors with large lengths.
• If you want to capture popularity, then choose dot product.
• Risk is that popular examples may skew the similarity metric.
• To balance this skew, you can raise the length to an exponent (Alpha < 1) to
calculate the dot product as

• To better understand how vector length changes the similarity measure, normalize
the vector lengths to 1 and notice that the three measures become proportional to
each other.
Programming Exercise for : Clustering with a Supervised Similarity Measure
Comparison Manual vs Supervised
Requirement Manual Supervised
Eliminate redundant information in No, you need to separately investigate Yes, DNN eliminates redundant
correlated features. correlations between features. information.

Provide insight into calculated Yes No, embeddings cannot be deciphered.

similarities.
Suitable for small datasets with few Yes, designing a manual measure with No, small datasets do not provide
features. a few features is easy. enough training data for a DNN.

Suitable for large datasets with many

No, manually eliminating redundant Yes, the DNN automatically eliminates
features.
information from multiple features and redundant information and combines
then combining them is very difficult. features.
Similarity Measure Summary
Type Create By Use When Implication
Manual Manually combining Datasets are small and Gain insight into results
feature data. features are easily combined. of similarity
calculations, but if
feature data changes,
then you must update
the similarity measure.
Supervised Measuring distance Datasets are large and No insight into results,
between embeddings features are hard to combine. but DNN can
generated via a supervised automatically adapt to
DNN. changing feature data.
Reviewing Clustering Process
Step One: Quality of Clustering
• perform a visual check that the clusters look as expected, and that
examples that you consider similar do appear in the same cluster
• Then check these commonly-used metrics as described in the
following sections:
• Cluster cardinality
• Cluster magnitude
• Performance of downstream system
Cluster cardinality is the number Cluster magnitude is the sum of distances Plot cardinality vs magnitude,
of examples per cluster. from all examples to the centroid of the Clusters are anomalous when
Plot the cluster cardinality and cluster. See how the magnitude varies across cardinality doesn't correlate with
investigate clusters that are major the clusters, and investigate anomalies. For magnitude relative to the other
outliers. For example, cluster 5. example, investigate cluster 0. clusters. For example, cluster
number 0 is anomalous.
• Performance of Downstream System
• Since clustering output is often used in downstream ML
systems, check if the downstream system’s performance
improves when your clustering process changes.
• The impact on your downstream performance provides a real-
world test for the quality of your clustering. The disadvantage is
that this check is complex to perform.
• Questions to Investigate If Problems are Found
• If you find problems, then check your data preparation and
similarity measure, asking yourself the following questions:
• Is your data scaled?
• Is your similarity measure correct?
• Is your algorithm performing semantically meaningful operations on the
data?
• Do your algorithm’s assumptions match the data?
Step Two: Performance of the Similarity
Measure
• Your clustering algorithm is only as good as your similarity measure.
• Make sure your similarity measure returns sensible results.
• The simplest check is to identify pairs of examples that are known to
be more or less similar than other pairs.
• Calculate the similarity measure for each pair of examples. Ensure
that the similarity measure for more similar examples is higher than
the similarity measure for less similar examples
Step Three: Optimum Number of Clusters
Programming various clustering techniques:-
• Affinity Propagation
• Agglomerative Clustering
• BIRCH
• DBSCAN
• K-Means
• Mini-Batch K-Means
• Mean Shift
• OPTICS
• Spectral Clustering
• Mixture of Gaussians

Capstone Project SupplyChain DataCo Supplychain FinalReport
100% (8)
Capstone Project SupplyChain DataCo Supplychain FinalReport
79 pages
The Practically Cheating Calculus Handbook
From Everand
The Practically Cheating Calculus Handbook
S. Deviant
3.5/5 (7)
Tesla 3 6 9 With Religion
100% (4)
Tesla 3 6 9 With Religion
34 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
Supervised Learning vs. Unsupervised Learning
No ratings yet
Supervised Learning vs. Unsupervised Learning
7 pages
Lec10 Clustering
No ratings yet
Lec10 Clustering
19 pages
Pattern Recognition - Clustering - Classification
No ratings yet
Pattern Recognition - Clustering - Classification
177 pages
Clustering
No ratings yet
Clustering
75 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
17 pages
Ds Module 5
No ratings yet
Ds Module 5
49 pages
Week6_clustering_regression
No ratings yet
Week6_clustering_regression
101 pages
Clustering, A Tool To Analyze Data Points
No ratings yet
Clustering, A Tool To Analyze Data Points
61 pages
Module 4 ML
No ratings yet
Module 4 ML
11 pages
Agglomerative Clustering
No ratings yet
Agglomerative Clustering
44 pages
1. Clustering
No ratings yet
1. Clustering
75 pages
Aiml 5th Module Part2
No ratings yet
Aiml 5th Module Part2
28 pages
Clustering Slides
No ratings yet
Clustering Slides
22 pages
k-medoids
No ratings yet
k-medoids
101 pages
Clustering
No ratings yet
Clustering
64 pages
5clustering-2
No ratings yet
5clustering-2
35 pages
Introduction To Data Science Unsupervised Learning: CS 194 Fall 2015 John Canny
No ratings yet
Introduction To Data Science Unsupervised Learning: CS 194 Fall 2015 John Canny
54 pages
Lecture 12 - Unsupervised Learning - Shoould Be Marged
No ratings yet
Lecture 12 - Unsupervised Learning - Shoould Be Marged
31 pages
K Means
No ratings yet
K Means
36 pages
DM 10,11 Clustering PDF
No ratings yet
DM 10,11 Clustering PDF
65 pages
K - Means Clustering and Related Algorithms: Ryan P. Adams COS 324 - Elements of Machine Learning Princeton University
No ratings yet
K - Means Clustering and Related Algorithms: Ryan P. Adams COS 324 - Elements of Machine Learning Princeton University
18 pages
Clustering
No ratings yet
Clustering
5 pages
DM&BAFall2204 2
No ratings yet
DM&BAFall2204 2
61 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
UNIT5
No ratings yet
UNIT5
60 pages
K-Means Clustering and Related Algorithms: Ryan P. Adams
No ratings yet
K-Means Clustering and Related Algorithms: Ryan P. Adams
16 pages
Clustering
No ratings yet
Clustering
39 pages
Unit 2
No ratings yet
Unit 2
89 pages
Clustering Data Mining
No ratings yet
Clustering Data Mining
27 pages
hierarchicalclustering
No ratings yet
hierarchicalclustering
20 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
III-clustering
No ratings yet
III-clustering
87 pages
Clustering Lecture
No ratings yet
Clustering Lecture
46 pages
_Clustering
No ratings yet
_Clustering
41 pages
Clustering Lec 1 Introduction To Clustering
No ratings yet
Clustering Lec 1 Introduction To Clustering
48 pages
Unit - 4 - Modified
No ratings yet
Unit - 4 - Modified
152 pages
Clustering
No ratings yet
Clustering
27 pages
Cluster
100% (1)
Cluster
72 pages
Clustering Examples
No ratings yet
Clustering Examples
47 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
53 pages
Chapter 6
No ratings yet
Chapter 6
62 pages
2021 Clustering
No ratings yet
2021 Clustering
50 pages
Presentation 28128 Content Document 20241126014005PM
No ratings yet
Presentation 28128 Content Document 20241126014005PM
80 pages
Soft Vs Hard Clustering
No ratings yet
Soft Vs Hard Clustering
5 pages
APznzab0G8iLD5cDfn798Gn-fXshRpam8ullbf6ZS5Hd4l0BEcKNHy9gDG24DS66RfgvnKXAQjMAivMmmi5cmDWF9tqOaPMy3afuzafCU1kpG1xfQIr7b98q406ZWiqt50nL8WhMI6azoYzWSgf7c7khnqww3VlQ9I90ROmc0QL4DbmipYYoLleGYR6TO4UYmc_PsaQB5v0XmLUwPEub3QuwGdUnUEr2dp_hV4bds0MuRbpJ
No ratings yet
APznzab0G8iLD5cDfn798Gn-fXshRpam8ullbf6ZS5Hd4l0BEcKNHy9gDG24DS66RfgvnKXAQjMAivMmmi5cmDWF9tqOaPMy3afuzafCU1kpG1xfQIr7b98q406ZWiqt50nL8WhMI6azoYzWSgf7c7khnqww3VlQ9I90ROmc0QL4DbmipYYoLleGYR6TO4UYmc_PsaQB5v0XmLUwPEub3QuwGdUnUEr2dp_hV4bds0MuRbpJ
34 pages
Cs8080 Unit3 Text Classification and Clustering
No ratings yet
Cs8080 Unit3 Text Classification and Clustering
171 pages
S VD For Clustering
No ratings yet
S VD For Clustering
10 pages
CV w4 - Recognition - Statistical Based
No ratings yet
CV w4 - Recognition - Statistical Based
42 pages
Clustering Class Ppt
No ratings yet
Clustering Class Ppt
103 pages
Medical Imabmnge Analysis
No ratings yet
Medical Imabmnge Analysis
41 pages
Chapter 5 Clustering
No ratings yet
Chapter 5 Clustering
40 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
49 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
51 pages
SP14 CS188 Lecture 23 -- Kernels and Clustering - print
No ratings yet
SP14 CS188 Lecture 23 -- Kernels and Clustering - print
39 pages
Cluster Analysis: Basic Concepts and Algorithms
No ratings yet
Cluster Analysis: Basic Concepts and Algorithms
141 pages
Module 5 - Clustering - Afterclassb
No ratings yet
Module 5 - Clustering - Afterclassb
49 pages
Visualizing Data Structures
From Everand
Visualizing Data Structures
Rhonda Hoenigman
No ratings yet
Final - Unit 1.1
No ratings yet
Final - Unit 1.1
23 pages
6 - Lane Recognition Technique Using Hough Transform
No ratings yet
6 - Lane Recognition Technique Using Hough Transform
6 pages
22 - Predictive Modelling of Brain Tumor
No ratings yet
22 - Predictive Modelling of Brain Tumor
9 pages
Deep Learning
No ratings yet
Deep Learning
40 pages
1 - Educational Data Mining Survey
No ratings yet
1 - Educational Data Mining Survey
32 pages
CAPRA Risk Management Effectiveness-Conf Paper (Cardona Ordaz Reinoso 15WCEE 2012)
No ratings yet
CAPRA Risk Management Effectiveness-Conf Paper (Cardona Ordaz Reinoso 15WCEE 2012)
10 pages
Answer Key: 1.7. Quadratic Inequality
No ratings yet
Answer Key: 1.7. Quadratic Inequality
10 pages
Fourth Quarter Summative Test FINAL
No ratings yet
Fourth Quarter Summative Test FINAL
3 pages
Data Structures & System Programming Lab File
0% (1)
Data Structures & System Programming Lab File
29 pages
My Schedule
No ratings yet
My Schedule
6 pages
ENsemble, Random Forest
No ratings yet
ENsemble, Random Forest
28 pages
Fault Tree Analysis Methods and Applications A Review
No ratings yet
Fault Tree Analysis Methods and Applications A Review
10 pages
JEE Main 2025 Paper_ Memory Based Questions and Analysis of 2nd April (Shift-1)_1743579512016
No ratings yet
JEE Main 2025 Paper_ Memory Based Questions and Analysis of 2nd April (Shift-1)_1743579512016
11 pages
Chapter 18
No ratings yet
Chapter 18
75 pages
PK 2
No ratings yet
PK 2
19 pages
Dirichlet Problem With Series
No ratings yet
Dirichlet Problem With Series
4 pages
U.S. Patent 6,797,871, Entitled "Pick and Method", To Adkin, Dated 2004.
No ratings yet
U.S. Patent 6,797,871, Entitled "Pick and Method", To Adkin, Dated 2004.
14 pages
VIT MCA Syllabus
No ratings yet
VIT MCA Syllabus
3 pages
Summative Lab Rubric-Intro To Chemistry
No ratings yet
Summative Lab Rubric-Intro To Chemistry
1 page
8 Combinations With Repetition 04-10-2024
No ratings yet
8 Combinations With Repetition 04-10-2024
88 pages
Reasoning Strategy with CHECKLIST by THE PUNDITS_edited
No ratings yet
Reasoning Strategy with CHECKLIST by THE PUNDITS_edited
1 page
FSK Demodulation Method Using Short-Time DFT Analysis For LEO
No ratings yet
FSK Demodulation Method Using Short-Time DFT Analysis For LEO
9 pages
Statistics and Probability: Possible Values of A Random Variable
No ratings yet
Statistics and Probability: Possible Values of A Random Variable
44 pages
Vector Spaces Crash Course
No ratings yet
Vector Spaces Crash Course
11 pages
Speed of Sound - Physics LibreTexts
No ratings yet
Speed of Sound - Physics LibreTexts
6 pages
Steel Design Beam Column Fu New
No ratings yet
Steel Design Beam Column Fu New
8 pages
Chapter 1 - Symmetry Elements and Operation
No ratings yet
Chapter 1 - Symmetry Elements and Operation
32 pages
72b85f60-8523-423f-9efc-ff56aa21f3f3
No ratings yet
72b85f60-8523-423f-9efc-ff56aa21f3f3
29 pages
Application of The Principles of Permutation and C
No ratings yet
Application of The Principles of Permutation and C
6 pages
[email protected]
No ratings yet
[email protected]
11 pages
Dronacharya College of Engineering: Subject: DAA Sem: VTH Trade: CSE/IT
No ratings yet
Dronacharya College of Engineering: Subject: DAA Sem: VTH Trade: CSE/IT
2 pages
Civil Engineer CV Example
100% (1)
Civil Engineer CV Example
2 pages

Understanding The Inners of Clustering: DR Akashdeep, UIET, Panjab University Chandigarh, Maivriklab@pu - Ac.in

Uploaded by

Understanding The Inners of Clustering: DR Akashdeep, UIET, Panjab University Chandigarh, Maivriklab@pu - Ac.in

Uploaded by

Understanding the inners of

Identify the clusters in the given data set?

• How would you design an algorithm for finding the three

Most obvious answer: group similar points???

• We may need a notion of similarity that depends on representation and algorithm:

Pic courtesy:- https://www.slideserve.com

•Set of (x,y) points

Where is the p-norm (sometimes denoted Lp norm)

• A is the testing point

under the understanding that x0 = 0 for x = 0 and 1 otherwise.

Using the same example, the L2 norm is calculated by ||X||2 = √

• Create Similarity metric

• Run Clustering Algorithm

• Interpret Results and Adjust

• Let’s look at similarity between examples with and without

• Quantiles are your best default choice to transform data.

Postal code Integer

2. Normalize and scale to [0,1].

3. Create quantiles from the data and scale to [0,1].

Programming:- Clustering with a Manual Similarity Measure

Process for creating Supervised Similarity Measure

• Lets have the same data Price Positive integer

Postal code Integer

Type of house A text value from “single_family," “multi-family," “apartment,” “condo”

Garage 0/1 for no/yes

Feature Type or Distribution Action

Size Poisson distribution Quantize and scale to [0,1].

Postal code Categorical Convert to longitude and latitude,

Type of house Categorical Convert to one-hot encoding..

Garage 0 or 1 Leave as is.

Provide insight into calculated Yes No, embeddings cannot be deciphered.

Suitable for large datasets with many

You might also like