0% found this document useful (0 votes)

3 views70 pages

D3IT Clustering April 2023

The document discusses unsupervised machine learning, focusing on clustering as a key technique for identifying inherent groupings in data without labeled outputs. It covers various clustering methods, including hierarchical, partitioning, density-based, and grid-based approaches, along with their applications in business intelligence, anomaly detection, and more. Additionally, it outlines considerations for selecting appropriate clustering methods based on data attributes, scalability, and interpretability.

Uploaded by

aditya tulsyan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views70 pages

D3IT Clustering April 2023

Uploaded by

aditya tulsyan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 70

Clustering

Unsupervised Machine Learning

•No labels are given to the learning algorithm – leaving it
on its own to find structures in the output.

Source: https://medium.com/analytics-vidhya/beginners-guide-to-
unsupervised-learning-76a575c4e942
Unsupervised Machine Learning
• Unsupervised learning is where you only have input data
(X) and no corresponding output variables.
• The goal for unsupervised learning is to model the
underlying structure or distribution in the data in order to
learn more about the data.
• Algorithms are left to their own devises to discover and
present the interesting structure in the data.
• Further grouped into:
• Clustering: A clustering problem is where you want to
discover the inherent groupings in the data, such as
grouping customers by purchasing behavior.
• Association: An association rule learning problem is where
you want to discover rules that describe large portions of
your data, such as people that buy A also tend to buy B.
Hands-On Machine Learning with Scikit-
Learn and TensorFlow by Aurélien Géron
Unsupervised Machine Learning
• Unsupervised learning has a wide range of
applications in areas such as
• image recognition,
• anomaly detection,
• natural language processing, and
• customer segmentation,
• For example, let's say we have a dataset of customer
purchases at a grocery store, including the types of
items they bought, the amount spent, and the time
of purchase. We don't have any labels or categories
for the customers, but we want to group them into
clusters based on their purchasing behavior.
Hands-On Machine Learning with Scikit-
Learn and TensorFlow by Aurélien Géron
Semi-supervised Machine Learning
• Problems where you have a large amount of input data (X) and only
some of the data is labeled (Y ) are called semi-supervised learning
problems.
• These problems sit in between both supervised and unsupervised
• A good example is a photo archive where only some of the images
are labeled, (e.g. dog, cat, person) and the majority are unlabeled.
• because it can be expensive or time consuming to label data as it
may require access to domain experts.
• Whereas unlabeled data is cheap and easy to collect and store.
• You can use unsupervised learning techniques to discover and learn
the structure in the input variables. You can also use supervised
learning techniques to make best guess predictions for the
unlabeled data, feed that data back into the supervised learning
algorithm as training data and use the model to make predictions
on new unseen data. LearnHands-On Machine Learning with Scikit-
and TensorFlow by Aurélien Géron
Unsupervised Machine Learning
• Clustering
– k-Means
– Hierarchical Cluster Analysis (HCA)
– Expectation Maximization
• Visualization and dimensionality reduction
– Principal Component Analysis (PCA)
– Kernel PCA
– Locally-Linear Embedding (LLE)
• Association rule learning
– Apriori
Hands-On Machine Learning with Scikit-
Learn and TensorFlow by Aurélien Géron
Clustering
• Clustering is a machine learning technique for
analyzing data and dividing in to groups of similar
data.
• These groups or sets of similar data are known as
clusters.
• Cluster analysis looks at clustering algorithms
that can identify clusters automatically.
• The goal of clustering is to discover both the
dense and sparse regions in the data set.

Source: Benjamin Lam, Spring 2007, SJSU

Clustering
• Clustering is known as unsupervised learning
because the class label information is not
present.
• clustering is a form of learning by observation,
rather than learning by examples.

Source: Data Mining Concepts and Techniques, by Jiawei

Han Micheline Kamber, and Jian Pei
Clustering
• Clustering can be considered the most important
unsupervised learning technique; so, as every
other problem of this kind, it deals with finding a
structure in a collection of unlabeled data.

• Clustering is “the process of organizing objects

into groups whose members are similar in some
way”.

• A cluster is therefore a collection of objects which

are “similar” between them and are “dissimilar”
to the objects belonging to other clusters.

Source: Benjamin Lam, Spring 2007, SJSU

Clustering

Source: Benjamin Lam, Spring 2007, SJSU

Applications of Clustering
• In business intelligence, clustering can be used to
organize a large number of customers into groups,
where customers within a group share strong
similar characteristics.
• This facilitates the development of business
strategies for enhanced customer relationship
management.
• consider a consultant company with a large
number of projects. To improve project
management, clustering can be applied to partition
projects into categories based on similarity so that
project auditing and diagnosis (to improve project
delivery and outcomes) can be conducted
effectively.
Source: Data Mining Concepts and Techniques, by Jiawei
Han Micheline Kamber, and Jian Pei
Applications of Clustering
• Clustering is also called data segmentation in some
applications because clustering partitions large data
sets into groups according to their similarity.
• Clustering can also be used for outlier detection,
where outliers (values that are “far away” from any
cluster) may be more interesting than common cases.
• Example: detection of credit card fraud and the
monitoring of criminal activities in electronic
commerce. For example, exceptional cases in credit
card transactions, such as very expensive and
infrequent purchases, may be of interest as possible
fraudulent activities.

Source: Data Mining Concepts and Techniques, by Jiawei

Han Micheline Kamber, and Jian Pei
Applications of Clustering
• Pattern recognition
• Information retrieval
• text mining
• Web analysis
• marketing
• medical diagnostic

Source: Benjamin Lam, Spring 2007, SJSU

Which method should I use?
• Type of attributes in data – continuous or
categorical
• Scalability to larger dataset - Some
clustering methods may be more scalable
than others, allowing you to analyze larger
datasets or run the algorithm more
efficiently.
• Ability to work with irregular or noisy data -
Some clustering methods may be more
robust to noise and outliers in the data,
while others may be more sensitive to them.
Source: Benjamin Lam, Spring 2007, SJSU
Which method should I use?
• Number of clusters: Some methods are
better suited for identifying a specific number of
clusters, while others may be more flexible in
the number of clusters they can identify.

• Cluster shape and density: Different

clustering methods may be better suited for
identifying clusters with different shapes and
densities. For example, hierarchical clustering
may be better for identifying clusters with
irregular shapes, while K-means clustering may
be better for identifying clusters with a
spherical shape.
Source: Benjamin Lam, Spring 2007, SJSU
Which method should I use?
• Time, cost, size of data and complexity
- The speed and accuracy of different
clustering algorithms may vary depending on
the size and complexity of your data. Some
clustering methods may not be suitable for
large datasets due to their computational
complexity or memory requirements.
• Interpretability and usability - Some
clustering methods may produce results that
are easier to interpret and explain than
others, which may be important depending
on the goals of your analysis.
Source: Benjamin Lam, Spring 2007, SJSU
Pseudo Code of Clustering?
1. Pick an arbitrary number of groups/segments
to be created
2. Start with some initial randomly chosen center
values for groups
3. Classify instances to closest groups
4. Compute new values for the group centers
5. Repeat steps 3 and 4 till groups converge
6. If clusters are not satisfactory, go to step 1
and pick a different number of
groups/segments

Source: Data Analytics by Anil Maheshwari

Example

Source: Data Analytics by Anil Maheshwari

Example

Source: Data Analytics by Anil Maheshwari

Example

Source: Data Analytics by Anil Maheshwari

Example

Source: Data Analytics by Anil Maheshwari

Types of Clustering
• A distinction among different types of clusterings is whether the
set of clusters is nested or unnested.
• Hierarchical Clustering
– works by grouping data objects into a hierarchy or “tree” of clusters
– creates a tree-like structure of clusters by either starting with each
data point as a separate cluster (agglomerative clustering) or starting
with all data points in one cluster and recursively splitting them
(divisive clustering).
– Representing data objects in the form of a hierarchy is useful for data
summarization and visualization
– Ex: manager of human resources at AllElectronics you may organize your
employees into major groups such as executives, managers, and staff
– You can further partition these groups into smaller subgroups. For
instance, the general group of staff can be further divided into
subgroups of senior officers, officers, and trainees. All these groups
forma hierarchy. We can easily summarize or characterize the data that
are organized into a hierarchy, which can be used to find, say, the
average salary of managers and of officers.

Source: Benjamin Lam, Spring 2007, SJSU

Types of Clustering
• Partitioning Clustering
– Partitioning algorithms divide the data set into mutually
disjoint partitions or exclusive groups or clusters
– Selection of number of clusters is the starting point for
partitioning methods.
– It works by iteratively assigning data points to the nearest
centroid and recalculating the centroids until convergence
• Formally, given a data set, D, of n objects, and k, the number
of clusters to form, a partitioning algorithm organizes the
objects into k partitions where each partition represents a
cluster
• The clusters are formed on the basis of dissimilarity function
based on distance, so that the objects within a cluster are
“similar” to one another and “dissimilar” to objects in other
clusters in terms of the data set attributes. Ex K-means
Source: Benjamin Lam, Spring 2007, SJSU
Types of Clustering
• Density-based methods: Most partitioning methods cluster
objects based on the distance between objects. Such methods
can find only spherical-shaped clusters and encounter difficulty
in discovering clusters of arbitrary shapes.
• Other clustering methods have been developed based on the
notion of density.
• This method identifies regions of high density in the data and
groups points that are within these regions. Examples of density-
based clustering algorithms include DBSCAN and OPTICS.
• Idea is to continue growing a given cluster as long as the density
(number of objects or data points) in the “neighborhood”
exceeds some threshold.
• For example, for each data point within a given cluster, the
neighborhood of a given radius has to contain at least a
minimum number of points.

Source: Benjamin Lam, Spring 2007, SJSU

Types of Clustering
• Grid-based methods: Grid-based methods
quantize the object space into a finite number
of cells that form a grid structure.
• All the clustering operations are performed on
the grid structure (i.e., on the quantized space).
• The main advantage of this approach is its fast
processing time, which is typically independent
of the number of data objects and dependent
only on the number of cells in each dimension
in the quantized space.
Source: Benjamin Lam, Spring 2007, SJSU
Types of Clustering
• Model-based methods: This method assumes
that the data is generated from a mixture of
probability distributions and seeks to estimate
the parameters of these distributions to identify
clusters.
• Examples of model-based clustering algorithms
include Gaussian Mixture Models (GMM) and
Latent Dirichlet Allocation.

Source: Benjamin Lam, Spring 2007, SJSU

Hierarchical Clustering
• Hierarchical clustering algorithms repeat the cycle
of either merging smaller clusters in to larger ones
or dividing larger clusters to smaller ones
• Creates hierarchical decomposition of the database
• Decomposition of clusters is represented by a
dendrogram
• Two types of Hierarchical clustering
– Agglomerative
– Divisive
• In either agglomerative or divisive hierarchical clustering, a
user can specify the desired number of clusters as a
termination condition. Source: Benjamin Lam, Spring 2007, SJSU
Hierarchical Clustering

Source: Benjamin Lam, Spring 2007, SJSU

Hierarchical Clustering
• Agglomerative clustering
– bottom-up approach of merging clusters in to larger
ones,
• divisive clustering
– top-down approach of splitting clusters in to smaller
ones
• Typically, the greedy approach is used in deciding which
larger/smaller clusters are used for merging/dividing
• Euclidean distance, Manhattan distance and cosine
similarity are some of the most commonly used metrics
of similarity for numeric data
• For non-numeric data, metrics such as the Hamming
distance is used Source: Benjamin Lam, Spring 2007, SJSU
Hierarchical Clustering
• AGNES (AGglomerative NESting),
• Step 1: Create a cluster for each data
object i.e. Creation of n clusters for n
objects
• Step 2: merge two closest clusters i.e.
n-1 clusters
• Step 3: repeat step 2 untill there is only
one cluster
Source: Data Mining Concepts and Techniques, by Jiawe
Hierarchical Clustering
• AGNES (AGglomerative NESting),
• Ex. a data set of five objects, a,b, c,d, e.
• Initially, AGNES, the agglomerative method,
places each object into a cluster of its own.
• The clusters are then merged step-by-step
according to some criterion clusters C1 and C2
may be merged if an object in C1 and an object
in C2 form the minimum Euclidean distance
between any two objects from different clusters
• E.D. between two points = sqrt((x2-x1)2 +(y2-y1)2)
Hierarchical Clustering
• Distance between two clusters (linkage measures) :
– Closest points - Farthest points
– Average distance - Distance between centroids
Hierarchical Clustering
• DIANA (DIvisive ANAlysis), the divisive method, proceeds
in the contrasting way. All the objects are used to form
one initial cluster.
• The cluster is split according to some principle such as the
maximum Euclidean distance between the closest
neighboring objects in the cluster
• A challenge with divisive methods is how to partition a
large cluster into several smaller ones.
• When n is large, it is computationally prohibitive to
examine all possibilities.
• Consequently, a divisive method typically uses heuristics
in partitioning, which can lead to inaccurate results
Source: Data Mining Concepts and Techniques, by Jiawei
Han Micheline Kamber, and Jian Pei
Hierarchical Clustering

Source: Data Mining Concepts and Techniques, by Jiaw

Hierarchical Clustering
• tree structure called a dendrogram is commonly used to
represent the process of hierarchical clustering.
• It shows how objects are grouped together (in an
agglomerative method) or partitioned (in a divisive
method) step-by-step.

Source: Data Mining Concepts and Techniques, by Jiawei

Han Micheline Kamber, and Jian Pei
Hierarchical Clustering

Source: Data Mining Concepts and Techniques, by Jiaw

Dendrogram
Dendrogram
Dendrogram
Dendrogram
Dendrogram
Partitioning clustering
• generate various partitions and then evaluate
them by some criterion
• also referred to as nonhierarchical as each instance
is placed in exactly one of k mutually exclusive
clusters
• Because only one set of clusters is the output of a
typical partitioning clustering algorithm, the user is
required to input the desired number of clusters
(usually called k)
• most commonly used partitioning clustering
algorithms is the k-means clustering algorithm
• User is required to provide the number of clusters
(k) before starting
Source: Benjamin Lam, Spring 2007, SJSU
Partitioning clustering
• User is required to provide the number of clusters
(k) before starting and the algorithm first initiates
the centers (or centroids) of the k partitions
• In a nutshell, k-means clustering algorithm then
assigns members based on the current centers and
re-estimates centers based on the current
members
• These two steps are repeated until a certain intra-
cluster similarity objective function and inter-
cluster dissimilarity objective function are
optimized
• Therefore, sensible initialization of centers is a very
important factor in obtaining quality results from
partitional clustering
Source:algorithms
Benjamin Lam, Spring 2007, SJSU
Difference between Hierarchical and
Partitioning Clustering?
• key differences in running time, assumptions, input parameters
and resultant clusters.
• Typically, partitioning clustering is faster than hierarchical
clustering
• Hierarchical clustering requires only a similarity measure, while
partitioning clustering requires stronger assumptions such as
number of clusters and the initial centers
• Hierarchical clustering does not require any input parameters,
while partitional clustering algorithms require the number of
clusters to start running
• Hierarchical clustering returns a much more meaningful and
subjective division of clusters but partitional clustering results
in exactly k clusters.
Source: Benjamin Lam, Spring 2007, SJSU
Exercise
Data about height
and weight of few
customers is
available. Create a
set of clusters for
the available data
to decide how
many sizes of T-
Shirt should be
ordered.
K-Means Clustering
• Simplest and most commonly used method
• Is a geometrical model
• Tries to find cluster centers that are representative
of certain regions of the data
• The k-means clustering algorithm alternates
between two steps:
– assigning each data point to its closest cluster center
– setting each cluster center as the mean of the data
points that are assigned to it
• The algorithm finishes when the assignment of
instances to clusters no longer changes
K-Means Clustering
• k-Means clustering algorithm proposed by
J. Hartigan and M. A. Wong [1979].
• Given a set of n distinct objects, the k-Means
clustering algorithm partitions the objects into k
number of clusters such that intracluster similarity
is high but the intercluster similarity is low.
• In this algorithm, user has to specify k, the number
of clusters and consider the objects are defined
with numeric attributes and thus using any one of
the distance metric to demarcate the clusters.
K-Means Clustering
• Choose the number of clusters (x)
• Select the centroids (random) equal to number of clusters
(x)
• Assign each data point to its closest centroid, this will create
x clusters
• Compute the new centroid of each cluster based on the data
points within the cluster
• Reassign each data point to the new centroid, if
reassignment occurs then go to previous step, otherwise
quit the algorithm
• The assignment and update procedure is until it reaches
some stopping criteria (such as, number of iteration,
centroids remain unchanged or no assignment, etc.)
K-Means Clustering
Step-01:
Choose the number of clusters K.
Step-02:
Randomly select any K data points as cluster centers.
Select cluster centers in such a way that they are as farther as possible from each
other.
Step-03:
Calculate the Euclidean distance between each data point and each cluster center.
Step-04:
Assign each data point to some cluster.
A data point is assigned to that cluster whose center is nearest to that data point.
Step-05:
Re-compute the center of newly formed clusters.
The center of a cluster is computed by taking mean of all the data points contained in
that cluster.
Step-06:
Keep repeating the procedure from Step-03 to Step-05 until any of the following
stopping criteria is met-
-Center of newly formed clusters do not change
-Data points remain present in the same cluster
-Maximum number of iterations are reached
K-Means Clustering
Input: D is a dataset containing n objects, k is the number of cluster
Output: A set of k clusters
Steps:
1. Randomly choose k objects from D as the initial cluster centroids.
2. For each of the objects in D do
• Compute distance between the current objects and k cluster centroids
• Assign the current object to that cluster to which it is closest.

3. Compute the “cluster centers” of each cluster. These become the

new cluster centroids.
4. Repeat step 2-3 until the convergence criterion is satisfied
5. Stop
Source: Dr. Debasis Samanta, IIT Kharagpur
K-Means Clustering
k-Means clustering
A1 A2
25
6.8 12.6
0.8 9.8
1.2 11.6 20

2.8 9.6
3.8 9.9 15

4.4 6.5
4.8 1.1 A2 10
6.0 19.9
6.2 18.5
5
7.6 17.4
7.8 12.2
0
6.6 7.7 0 2 4 6 8 10 12
8.2 4.5 A1
8.4 6.9
9.0 3.4 Suppose, k=3. Three objects are chosen at
9.6 11.1
random shown as circled

Source: Dr. Debasis Samanta, IIT Kharagpur

k-Means clustering
Initial Centroids chosen randomly
Centroid Objects
A1 A2
c1 3.8 9.9

c2 7.8 12.2

c3 6.2 18.5

• Let us consider the Euclidean distance measure (L2 Norm) as the distance
measurement in our illustration.
• Let d1, d2 and d3 denote the distance from an object to c1, c2 and c3
respectively.
• Assignment of each object to the respective centroid is shown in the right-
most column and the clustering so obtained is shown in Figure
k-Means clustering
A1 A2 d1 d2 d3 cluster
6.8 12.6 4.0 1.1 5.9 2
0.8 9.8 3.0 7.4 10.2 1
1.2 11.6 3.1 6.6 8.5 1
2.8 9.6 1.0 5.6 9.5 1
3.8 9.9 0.0 4.6 8.9 1
4.4 6.5 3.5 6.6 12.1 1
4.8 1.1 8.9 11.5 17.5 1
6.0 19.9 10.2 7.9 1.4 3
6.2 18.5 8.9 6.5 0.0 3
7.6 17.4 8.4 5.2 1.8 3
7.8 12.2 4.6 0.0 6.5 2
6.6 7.7 3.6 4.7 10.8 1
8.2 4.5 7.0 7.7 14.1 1
8.4 6.9 5.5 5.3 11.8 2
9.0 3.4 8.3 8.9 15.4 1
9.6 11.1 5.9 2.1 8.1 2

55
Source: Dr. Debasis Samanta, IIT Kharagpur
k-Means clustering
The calculation new centroids of the three cluster using the mean of attribute
values of A1 and A2 is shown in the Table below.

Calculation of new centroids

New Objects
Centroid A1 A2
c1 4.6 7.1
c2 8.2 10.7
c3 6.6 18.6

Source: Dr. Debasis Samanta, IIT Kharagpur

Illustration of k-Means clustering
algorithms
We next reassign the 16 objects to three clusters by determining which centroid
is closest to each one.
Note that point p moves from cluster C2 to cluster C1.

Source: Dr. Debasis Samanta, IIT Kharagpur

Illustration of k-Means clustering
algorithms
• The newly obtained centroids after second iteration are given in the table
below. Note that the centroid c3 remains unchanged, where c2 and c1 changed
a little.
• With respect to newly obtained cluster centres, 16 points are reassigned again.
These are the same clusters as before. Hence, their centroids also remain
unchanged.
• Considering this as the termination criteria, the k-means algorithm stops here.
Hence, the final cluster.

Cluster centres after second iteration

Centroid Revised Centroids

A1 A2
c1 5.0 7.1
c2 8.1 12.0
c3 6.6 18.6

Source: Dr. Debasis Samanta, IIT Kharagpur

Example 2
• Cluster the following eight points (with (x, y)
representing locations) into three clusters:
• A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4),
A7(1, 2), A8(4, 9)

• Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1,

2).
• The distance function between two points a = (x1, y1)
and b = (x2, y2) is defined as-
• Ρ(a, b) = |x2 – x1| + |y2 – y1|

• Use K-Means Algorithm to find the three cluster centers

after the second iteration.
Example 2
• Iteration-01:

• We calculate the distance of each point from each of the center of

the three clusters.
• The distance is calculated by using the given distance function.

• The following illustration shows the calculation of distance between

point A1(2, 10) and each of the center of the three clusters-

• Calculating Distance Between A1(2, 10) and C1(2, 10)-

• Ρ(A1, C1)
• = |x2 – x1| + |y2 – y1|
• = |2 – 2| + |10 – 10|
• =0
Example 2
• Calculating Distance Between A1(2, 10) and C2(5, 8)-

• Ρ(A1, C2)
• = |x2 – x1| + |y2 – y1|
• = |5 – 2| + |8 – 10|
• =3+2
• =5
•
• Calculating Distance Between A1(2, 10) and C3(1, 2)-

• Ρ(A1, C3)
• = |x2 – x1| + |y2 – y1|
• = |1 – 2| + |2 – 10|
• =1+8
• =9
K-Modes Clustering
• K-Means is one of the most common used method of
clustering but not perform well on the categorical data
or features
• For example categorical input variables such as
“designation of the employee” or “branch of a student”
• It creates clusters based on the number of matching
categories (while K-Means works on the basis of some
distance measures such as “Euclidean Distance”
between the data points)
• K-Modes attempts to minimize a dissimilarity measure
K-Modes Clustering
• The changes to the k-Means clustering are –
• using a simple matching dissimilarity measure for
categorical objects,
• replacing means of clusters by modes, and
• using a frequency-based method to update the
modes.
– Let X, x11, x12,…,xnm be the data set consists of n
number of objects with m number of attributes. The
main objective of the kmodes clustering algorithm is to
group the data objects X into K clusters by minimize the
cost function
K-Modes Clustering
• Input: Data objects X, Number of clusters K.
• Step 1: Randomly select the K initial modes from the data objects
such that Cj, j = 1,2,…,K
• Step 2: Find the matching dissimilarity between the each K initial
cluster modes and each data objects
• Step 3: Evaluate the fitness
• Step 4: Find the minimum mode values in each data object i.e. finding
the objects nearest to the initial cluster modes.
• Step 5: Assign the data objects to the nearest cluster centroid modes.
• Step 6: Update the modes by apply the frequency-based method on
newly formed clusters.
• Step 7: Recalculate the similarity between the data objects and the
updated modes.
• Step 8: Repeat the step 4 and step 5 until no changes in the cluster
ship of data objects.
• Output: Clustered data objects
K-Modes Clustering
import numpy as np

from kmodes.kmodes import KModes

# random categorical data

data = np.random.choice(20, (100, 10))
print(data)
km = KModes(n_clusters=4, init='Huang', n_init=5, verbose=1)

clusters = km.fit_predict(data)

# Print the cluster centroids

print(km.cluster_centroids_)
Gaussian Mixture Models (GMMs)
• distribution-based model
• Gaussian Mixture Models (GMMs) assume that
there are a certain number of Gaussian
distributions, and each of these distributions
represent a cluster. Hence, a Gaussian Mixture
Model tends to group the data points belonging to
a single distribution together.
• Gaussian Mixture Models are probabilistic
models and use the soft clustering approach for
distributing the points in different clusters
Hard vs Soft Clustering
𝐾-Means vs GMM
• 𝐾-means algorithm performs a hard assignment of
data points to clusters, in which each data point is
associated uniquely with one cluster,
• GMM algorithm makes a soft assignment based
on posterior probabilities based on EM Algorithm.
• 𝐾-means is only based on Euclidean distances,
• classic GMM use a Mahalanobis distances that can
deal with non-spherical distributions
• The Mahalanobis distance is unitless and scale-
invariant, and takes into account the correlations
of the data set.
Gaussian Mixture Models (GMMs)
• Example: three Gaussian distributions– GD1,
GD2, and GD3.
• have a certain mean (μ1, μ2, μ3) and
• variance (σ1, σ2, σ3) value respectively.
• For a given set of data points, our GMM would
identify the probability of each data point
belonging to each of these distributions.
Gaussian Mixture Models (GMMs)

Image Source: WikiPedia

Chapter-8 (Cluster Analysis Basic Concepts and Algorithms)
No ratings yet
Chapter-8 (Cluster Analysis Basic Concepts and Algorithms)
73 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
17 GM ASAP Data Mining - Clustering
No ratings yet
17 GM ASAP Data Mining - Clustering
107 pages
ML Mod 4 Part 1
No ratings yet
ML Mod 4 Part 1
99 pages
ML Material Unit-4
No ratings yet
ML Material Unit-4
38 pages
Unit 4
No ratings yet
Unit 4
62 pages
ML Unsupervised
No ratings yet
ML Unsupervised
35 pages
R20 Machine Learning Unit 4
No ratings yet
R20 Machine Learning Unit 4
49 pages
Unsupervised Machine Learning
No ratings yet
Unsupervised Machine Learning
59 pages
Unit 4
No ratings yet
Unit 4
106 pages
Week 9. Unsupervised Learning
No ratings yet
Week 9. Unsupervised Learning
32 pages
Module 5
No ratings yet
Module 5
45 pages
Module 6 - Un-Supervised Learning Algorithms
No ratings yet
Module 6 - Un-Supervised Learning Algorithms
31 pages
Clustering
No ratings yet
Clustering
20 pages
CLUSTERING
No ratings yet
CLUSTERING
20 pages
Introduction To Machine Learning-Presentation
No ratings yet
Introduction To Machine Learning-Presentation
28 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
47 pages
E-Note 28966 Content Document 20241211091351PM
No ratings yet
E-Note 28966 Content Document 20241211091351PM
69 pages
Module 5
No ratings yet
Module 5
91 pages
Clustering Techniques
No ratings yet
Clustering Techniques
30 pages
FPA Unit 3
No ratings yet
FPA Unit 3
17 pages
2nd Unit NN Final Class Notes
No ratings yet
2nd Unit NN Final Class Notes
50 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
49 pages
Unsupervised Learning: Niveditha. GH
No ratings yet
Unsupervised Learning: Niveditha. GH
10 pages
Cluster Analysis
No ratings yet
Cluster Analysis
22 pages
Mod2 Clustering Text Book
No ratings yet
Mod2 Clustering Text Book
30 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
13 pages
Unit 4
No ratings yet
Unit 4
40 pages
What Is Unsupervised Learning
No ratings yet
What Is Unsupervised Learning
9 pages
Cluster Analysis I: Presidency University
No ratings yet
Cluster Analysis I: Presidency University
98 pages
DW & DM Unit 4 Notes
No ratings yet
DW & DM Unit 4 Notes
40 pages
Screenshot 2025-01-03 at 8.05.30 PM
No ratings yet
Screenshot 2025-01-03 at 8.05.30 PM
20 pages
Unit 3 Unsupervised Learning Algorith
No ratings yet
Unit 3 Unsupervised Learning Algorith
15 pages
Artificial Intelligence Lec 5
No ratings yet
Artificial Intelligence Lec 5
20 pages
M Learning
No ratings yet
M Learning
11 pages
DM Unit-5 Notes
No ratings yet
DM Unit-5 Notes
16 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
7 pages
Cluster Analysis
No ratings yet
Cluster Analysis
15 pages
Unit-5 Clustering (March 16, 24)
No ratings yet
Unit-5 Clustering (March 16, 24)
25 pages
Recent Advances in Clustering A Brief Survey
No ratings yet
Recent Advances in Clustering A Brief Survey
9 pages
Clustering
No ratings yet
Clustering
29 pages
Unit 4
No ratings yet
Unit 4
74 pages
DSA Presentation Group 6
No ratings yet
DSA Presentation Group 6
34 pages
UnSupervised Learning
No ratings yet
UnSupervised Learning
3 pages
CLUSTRING
No ratings yet
CLUSTRING
13 pages
Clustering U 5
No ratings yet
Clustering U 5
2 pages
I. Classification: Department of Computer Science and Engineering Course Code: CD503 Course Name: Pattern Recognition
No ratings yet
I. Classification: Department of Computer Science and Engineering Course Code: CD503 Course Name: Pattern Recognition
4 pages
Fuzzy Meaning
No ratings yet
Fuzzy Meaning
6 pages
Classify Clustering
No ratings yet
Classify Clustering
31 pages
Clustering
No ratings yet
Clustering
8 pages
Unsupervised Machine Learning
No ratings yet
Unsupervised Machine Learning
10 pages
Unit 5
No ratings yet
Unit 5
5 pages
Assignment 4
No ratings yet
Assignment 4
40 pages
U20cs604 Machine Learning Unit III
No ratings yet
U20cs604 Machine Learning Unit III
23 pages
Geography Form 2 Notes
33% (3)
Geography Form 2 Notes
4 pages
Car Owner Database in Delhi - CL at 8826460912
50% (2)
Car Owner Database in Delhi - CL at 8826460912
3 pages
1.supervised and Unsupervised
No ratings yet
1.supervised and Unsupervised
42 pages
YEAH
No ratings yet
YEAH
2 pages
HCSA-Sales-Storage V2.0 Mock Exam
50% (2)
HCSA-Sales-Storage V2.0 Mock Exam
2 pages
Importance of Clustering in Data Mining
No ratings yet
Importance of Clustering in Data Mining
5 pages
ATM Security Using Eye and Facial Recognition System
75% (4)
ATM Security Using Eye and Facial Recognition System
19 pages
A06-A Survey of Clustering Techniques
No ratings yet
A06-A Survey of Clustering Techniques
5 pages
Bs 31267274
No ratings yet
Bs 31267274
8 pages
DataBase Management (VBSPU 4th Sem)
No ratings yet
DataBase Management (VBSPU 4th Sem)
108 pages
Abstract As Reference Source
No ratings yet
Abstract As Reference Source
12 pages
Mogens PDF
No ratings yet
Mogens PDF
163 pages
Ascp Data Collection Technical Workshop Good Oracle Database
No ratings yet
Ascp Data Collection Technical Workshop Good Oracle Database
14 pages
Versant Database v.7.0.1.0 Administration Manual
No ratings yet
Versant Database v.7.0.1.0 Administration Manual
465 pages
Maintenance Management System Guideline For Maintenance Operating Procedures Master Data Management
100% (1)
Maintenance Management System Guideline For Maintenance Operating Procedures Master Data Management
20 pages
Cross-Site Scripting (XSS) - OWASP
No ratings yet
Cross-Site Scripting (XSS) - OWASP
9 pages
215046software Engineering Lecture 3for BCA 4th Sem 09-04-2020
No ratings yet
215046software Engineering Lecture 3for BCA 4th Sem 09-04-2020
4 pages
Cs3492 Key Notes
No ratings yet
Cs3492 Key Notes
8 pages
Syllabus CM3060 NLP
No ratings yet
Syllabus CM3060 NLP
7 pages
23MCX07
No ratings yet
23MCX07
2 pages
Asset Management System
No ratings yet
Asset Management System
21 pages
ديوان ابن سهل الإسرائيلي - تحقيق حسن العطار
No ratings yet
ديوان ابن سهل الإسرائيلي - تحقيق حسن العطار
76 pages
Chapter 1 - Fundamental of DB - Part 1
No ratings yet
Chapter 1 - Fundamental of DB - Part 1
40 pages
Clinical Data
No ratings yet
Clinical Data
22 pages
Advanced Database Management Systems
No ratings yet
Advanced Database Management Systems
2 pages
Lecture 11
No ratings yet
Lecture 11
6 pages
Tello-Ruiz Et Al., 2021
No ratings yet
Tello-Ruiz Et Al., 2021
12 pages
ScienceDirect Quick Reference Guide
No ratings yet
ScienceDirect Quick Reference Guide
8 pages
Patient Appointment System Project
No ratings yet
Patient Appointment System Project
2 pages
ADM Summer 23
No ratings yet
ADM Summer 23
4 pages
Adding Meta Tags To Your Documents: SR - No Attribute & Description
No ratings yet
Adding Meta Tags To Your Documents: SR - No Attribute & Description
9 pages
4-3 Ambler - Ambler - UML - Persistence
No ratings yet
4-3 Ambler - Ambler - UML - Persistence
10 pages
SQL Word
No ratings yet
SQL Word
7 pages
Carl E Clem CV Portfolio
No ratings yet
Carl E Clem CV Portfolio
6 pages
Rubric For History Fair Project: Exhibit: Name(s) : Presentation Date: Period
No ratings yet
Rubric For History Fair Project: Exhibit: Name(s) : Presentation Date: Period
2 pages
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet