0% found this document useful (0 votes)
3 views70 pages

D3IT Clustering April 2023

The document discusses unsupervised machine learning, focusing on clustering as a key technique for identifying inherent groupings in data without labeled outputs. It covers various clustering methods, including hierarchical, partitioning, density-based, and grid-based approaches, along with their applications in business intelligence, anomaly detection, and more. Additionally, it outlines considerations for selecting appropriate clustering methods based on data attributes, scalability, and interpretability.

Uploaded by

aditya tulsyan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views70 pages

D3IT Clustering April 2023

The document discusses unsupervised machine learning, focusing on clustering as a key technique for identifying inherent groupings in data without labeled outputs. It covers various clustering methods, including hierarchical, partitioning, density-based, and grid-based approaches, along with their applications in business intelligence, anomaly detection, and more. Additionally, it outlines considerations for selecting appropriate clustering methods based on data attributes, scalability, and interpretability.

Uploaded by

aditya tulsyan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

Clustering

Unsupervised Machine Learning


•No labels are given to the learning algorithm – leaving it
on its own to find structures in the output.

Source: https://medium.com/analytics-vidhya/beginners-guide-to-
unsupervised-learning-76a575c4e942
Unsupervised Machine Learning
• Unsupervised learning is where you only have input data
(X) and no corresponding output variables.
• The goal for unsupervised learning is to model the
underlying structure or distribution in the data in order to
learn more about the data.
• Algorithms are left to their own devises to discover and
present the interesting structure in the data.
• Further grouped into:
• Clustering: A clustering problem is where you want to
discover the inherent groupings in the data, such as
grouping customers by purchasing behavior.
• Association: An association rule learning problem is where
you want to discover rules that describe large portions of
your data, such as people that buy A also tend to buy B.
Hands-On Machine Learning with Scikit-
Learn and TensorFlow by Aurélien Géron
Unsupervised Machine Learning
• Unsupervised learning has a wide range of
applications in areas such as
• image recognition,
• anomaly detection,
• natural language processing, and
• customer segmentation,
• For example, let's say we have a dataset of customer
purchases at a grocery store, including the types of
items they bought, the amount spent, and the time
of purchase. We don't have any labels or categories
for the customers, but we want to group them into
clusters based on their purchasing behavior.
Hands-On Machine Learning with Scikit-
Learn and TensorFlow by Aurélien Géron
Semi-supervised Machine Learning
• Problems where you have a large amount of input data (X) and only
some of the data is labeled (Y ) are called semi-supervised learning
problems.
• These problems sit in between both supervised and unsupervised
• A good example is a photo archive where only some of the images
are labeled, (e.g. dog, cat, person) and the majority are unlabeled.
• because it can be expensive or time consuming to label data as it
may require access to domain experts.
• Whereas unlabeled data is cheap and easy to collect and store.
• You can use unsupervised learning techniques to discover and learn
the structure in the input variables. You can also use supervised
learning techniques to make best guess predictions for the
unlabeled data, feed that data back into the supervised learning
algorithm as training data and use the model to make predictions
on new unseen data. LearnHands-On Machine Learning with Scikit-
and TensorFlow by Aurélien Géron
Unsupervised Machine Learning
• Clustering
– k-Means
– Hierarchical Cluster Analysis (HCA)
– Expectation Maximization
• Visualization and dimensionality reduction
– Principal Component Analysis (PCA)
– Kernel PCA
– Locally-Linear Embedding (LLE)
• Association rule learning
– Apriori
Hands-On Machine Learning with Scikit-
Learn and TensorFlow by Aurélien Géron
Clustering
• Clustering is a machine learning technique for
analyzing data and dividing in to groups of similar
data.
• These groups or sets of similar data are known as
clusters.
• Cluster analysis looks at clustering algorithms
that can identify clusters automatically.
• The goal of clustering is to discover both the
dense and sparse regions in the data set.

Source: Benjamin Lam, Spring 2007, SJSU


Clustering
• Clustering is known as unsupervised learning
because the class label information is not
present.
• clustering is a form of learning by observation,
rather than learning by examples.

Source: Data Mining Concepts and Techniques, by Jiawei


Han Micheline Kamber, and Jian Pei
Clustering
• Clustering can be considered the most important
unsupervised learning technique; so, as every
other problem of this kind, it deals with finding a
structure in a collection of unlabeled data.

• Clustering is “the process of organizing objects


into groups whose members are similar in some
way”.

• A cluster is therefore a collection of objects which


are “similar” between them and are “dissimilar”
to the objects belonging to other clusters.

Source: Benjamin Lam, Spring 2007, SJSU


Clustering

Source: Benjamin Lam, Spring 2007, SJSU


Applications of Clustering
• In business intelligence, clustering can be used to
organize a large number of customers into groups,
where customers within a group share strong
similar characteristics.
• This facilitates the development of business
strategies for enhanced customer relationship
management.
• consider a consultant company with a large
number of projects. To improve project
management, clustering can be applied to partition
projects into categories based on similarity so that
project auditing and diagnosis (to improve project
delivery and outcomes) can be conducted
effectively.
Source: Data Mining Concepts and Techniques, by Jiawei
Han Micheline Kamber, and Jian Pei
Applications of Clustering
• Clustering is also called data segmentation in some
applications because clustering partitions large data
sets into groups according to their similarity.
• Clustering can also be used for outlier detection,
where outliers (values that are “far away” from any
cluster) may be more interesting than common cases.
• Example: detection of credit card fraud and the
monitoring of criminal activities in electronic
commerce. For example, exceptional cases in credit
card transactions, such as very expensive and
infrequent purchases, may be of interest as possible
fraudulent activities.

Source: Data Mining Concepts and Techniques, by Jiawei


Han Micheline Kamber, and Jian Pei
Applications of Clustering
• Pattern recognition
• Information retrieval
• text mining
• Web analysis
• marketing
• medical diagnostic

Source: Benjamin Lam, Spring 2007, SJSU


Which method should I use?
• Type of attributes in data – continuous or
categorical
• Scalability to larger dataset - Some
clustering methods may be more scalable
than others, allowing you to analyze larger
datasets or run the algorithm more
efficiently.
• Ability to work with irregular or noisy data -
Some clustering methods may be more
robust to noise and outliers in the data,
while others may be more sensitive to them.
Source: Benjamin Lam, Spring 2007, SJSU
Which method should I use?
• Number of clusters: Some methods are
better suited for identifying a specific number of
clusters, while others may be more flexible in
the number of clusters they can identify.

• Cluster shape and density: Different


clustering methods may be better suited for
identifying clusters with different shapes and
densities. For example, hierarchical clustering
may be better for identifying clusters with
irregular shapes, while K-means clustering may
be better for identifying clusters with a
spherical shape.
Source: Benjamin Lam, Spring 2007, SJSU
Which method should I use?
• Time, cost, size of data and complexity
- The speed and accuracy of different
clustering algorithms may vary depending on
the size and complexity of your data. Some
clustering methods may not be suitable for
large datasets due to their computational
complexity or memory requirements.
• Interpretability and usability - Some
clustering methods may produce results that
are easier to interpret and explain than
others, which may be important depending
on the goals of your analysis.
Source: Benjamin Lam, Spring 2007, SJSU
Pseudo Code of Clustering?
1. Pick an arbitrary number of groups/segments
to be created
2. Start with some initial randomly chosen center
values for groups
3. Classify instances to closest groups
4. Compute new values for the group centers
5. Repeat steps 3 and 4 till groups converge
6. If clusters are not satisfactory, go to step 1
and pick a different number of
groups/segments

Source: Data Analytics by Anil Maheshwari


Example

Source: Data Analytics by Anil Maheshwari


Example

Source: Data Analytics by Anil Maheshwari


Example

Source: Data Analytics by Anil Maheshwari


Example

Source: Data Analytics by Anil Maheshwari


Types of Clustering
• A distinction among different types of clusterings is whether the
set of clusters is nested or unnested.
• Hierarchical Clustering
– works by grouping data objects into a hierarchy or “tree” of clusters
– creates a tree-like structure of clusters by either starting with each
data point as a separate cluster (agglomerative clustering) or starting
with all data points in one cluster and recursively splitting them
(divisive clustering).
– Representing data objects in the form of a hierarchy is useful for data
summarization and visualization
– Ex: manager of human resources at AllElectronics you may organize your
employees into major groups such as executives, managers, and staff
– You can further partition these groups into smaller subgroups. For
instance, the general group of staff can be further divided into
subgroups of senior officers, officers, and trainees. All these groups
forma hierarchy. We can easily summarize or characterize the data that
are organized into a hierarchy, which can be used to find, say, the
average salary of managers and of officers.

Source: Benjamin Lam, Spring 2007, SJSU


Types of Clustering
• Partitioning Clustering
– Partitioning algorithms divide the data set into mutually
disjoint partitions or exclusive groups or clusters
– Selection of number of clusters is the starting point for
partitioning methods.
– It works by iteratively assigning data points to the nearest
centroid and recalculating the centroids until convergence
• Formally, given a data set, D, of n objects, and k, the number
of clusters to form, a partitioning algorithm organizes the
objects into k partitions where each partition represents a
cluster
• The clusters are formed on the basis of dissimilarity function
based on distance, so that the objects within a cluster are
“similar” to one another and “dissimilar” to objects in other
clusters in terms of the data set attributes. Ex K-means
Source: Benjamin Lam, Spring 2007, SJSU
Types of Clustering
• Density-based methods: Most partitioning methods cluster
objects based on the distance between objects. Such methods
can find only spherical-shaped clusters and encounter difficulty
in discovering clusters of arbitrary shapes.
• Other clustering methods have been developed based on the
notion of density.
• This method identifies regions of high density in the data and
groups points that are within these regions. Examples of density-
based clustering algorithms include DBSCAN and OPTICS.
• Idea is to continue growing a given cluster as long as the density
(number of objects or data points) in the “neighborhood”
exceeds some threshold.
• For example, for each data point within a given cluster, the
neighborhood of a given radius has to contain at least a
minimum number of points.

Source: Benjamin Lam, Spring 2007, SJSU


Types of Clustering
• Grid-based methods: Grid-based methods
quantize the object space into a finite number
of cells that form a grid structure.
• All the clustering operations are performed on
the grid structure (i.e., on the quantized space).
• The main advantage of this approach is its fast
processing time, which is typically independent
of the number of data objects and dependent
only on the number of cells in each dimension
in the quantized space.
Source: Benjamin Lam, Spring 2007, SJSU
Types of Clustering
• Model-based methods: This method assumes
that the data is generated from a mixture of
probability distributions and seeks to estimate
the parameters of these distributions to identify
clusters.
• Examples of model-based clustering algorithms
include Gaussian Mixture Models (GMM) and
Latent Dirichlet Allocation.

Source: Benjamin Lam, Spring 2007, SJSU


Hierarchical Clustering
• Hierarchical clustering algorithms repeat the cycle
of either merging smaller clusters in to larger ones
or dividing larger clusters to smaller ones
• Creates hierarchical decomposition of the database
• Decomposition of clusters is represented by a
dendrogram
• Two types of Hierarchical clustering
– Agglomerative
– Divisive
• In either agglomerative or divisive hierarchical clustering, a
user can specify the desired number of clusters as a
termination condition. Source: Benjamin Lam, Spring 2007, SJSU
Hierarchical Clustering

Source: Benjamin Lam, Spring 2007, SJSU


Hierarchical Clustering
• Agglomerative clustering
– bottom-up approach of merging clusters in to larger
ones,
• divisive clustering
– top-down approach of splitting clusters in to smaller
ones
• Typically, the greedy approach is used in deciding which
larger/smaller clusters are used for merging/dividing
• Euclidean distance, Manhattan distance and cosine
similarity are some of the most commonly used metrics
of similarity for numeric data
• For non-numeric data, metrics such as the Hamming
distance is used Source: Benjamin Lam, Spring 2007, SJSU
Hierarchical Clustering
• AGNES (AGglomerative NESting),
• Step 1: Create a cluster for each data
object i.e. Creation of n clusters for n
objects
• Step 2: merge two closest clusters i.e.
n-1 clusters
• Step 3: repeat step 2 untill there is only
one cluster
Source: Data Mining Concepts and Techniques, by Jiawe
Hierarchical Clustering
• AGNES (AGglomerative NESting),
• Ex. a data set of five objects, a,b, c,d, e.
• Initially, AGNES, the agglomerative method,
places each object into a cluster of its own.
• The clusters are then merged step-by-step
according to some criterion clusters C1 and C2
may be merged if an object in C1 and an object
in C2 form the minimum Euclidean distance
between any two objects from different clusters
• E.D. between two points = sqrt((x2-x1)2 +(y2-y1)2)
Hierarchical Clustering
• Distance between two clusters (linkage measures) :
– Closest points - Farthest points
– Average distance - Distance between centroids
Hierarchical Clustering
• DIANA (DIvisive ANAlysis), the divisive method, proceeds
in the contrasting way. All the objects are used to form
one initial cluster.
• The cluster is split according to some principle such as the
maximum Euclidean distance between the closest
neighboring objects in the cluster
• A challenge with divisive methods is how to partition a
large cluster into several smaller ones.
• When n is large, it is computationally prohibitive to
examine all possibilities.
• Consequently, a divisive method typically uses heuristics
in partitioning, which can lead to inaccurate results
Source: Data Mining Concepts and Techniques, by Jiawei
Han Micheline Kamber, and Jian Pei
Hierarchical Clustering

Source: Data Mining Concepts and Techniques, by Jiaw


Hierarchical Clustering
• tree structure called a dendrogram is commonly used to
represent the process of hierarchical clustering.
• It shows how objects are grouped together (in an
agglomerative method) or partitioned (in a divisive
method) step-by-step.

Source: Data Mining Concepts and Techniques, by Jiawei


Han Micheline Kamber, and Jian Pei
Hierarchical Clustering

Source: Data Mining Concepts and Techniques, by Jiaw


Dendrogram
Dendrogram
Dendrogram
Dendrogram
Dendrogram
Partitioning clustering
• generate various partitions and then evaluate
them by some criterion
• also referred to as nonhierarchical as each instance
is placed in exactly one of k mutually exclusive
clusters
• Because only one set of clusters is the output of a
typical partitioning clustering algorithm, the user is
required to input the desired number of clusters
(usually called k)
• most commonly used partitioning clustering
algorithms is the k-means clustering algorithm
• User is required to provide the number of clusters
(k) before starting
Source: Benjamin Lam, Spring 2007, SJSU
Partitioning clustering
• User is required to provide the number of clusters
(k) before starting and the algorithm first initiates
the centers (or centroids) of the k partitions
• In a nutshell, k-means clustering algorithm then
assigns members based on the current centers and
re-estimates centers based on the current
members
• These two steps are repeated until a certain intra-
cluster similarity objective function and inter-
cluster dissimilarity objective function are
optimized
• Therefore, sensible initialization of centers is a very
important factor in obtaining quality results from
partitional clustering
Source:algorithms
Benjamin Lam, Spring 2007, SJSU
Difference between Hierarchical and
Partitioning Clustering?
• key differences in running time, assumptions, input parameters
and resultant clusters.
• Typically, partitioning clustering is faster than hierarchical
clustering
• Hierarchical clustering requires only a similarity measure, while
partitioning clustering requires stronger assumptions such as
number of clusters and the initial centers
• Hierarchical clustering does not require any input parameters,
while partitional clustering algorithms require the number of
clusters to start running
• Hierarchical clustering returns a much more meaningful and
subjective division of clusters but partitional clustering results
in exactly k clusters.
Source: Benjamin Lam, Spring 2007, SJSU
Exercise
Data about height
and weight of few
customers is
available. Create a
set of clusters for
the available data
to decide how
many sizes of T-
Shirt should be
ordered.
K-Means Clustering
• Simplest and most commonly used method
• Is a geometrical model
• Tries to find cluster centers that are representative
of certain regions of the data
• The k-means clustering algorithm alternates
between two steps:
– assigning each data point to its closest cluster center
– setting each cluster center as the mean of the data
points that are assigned to it
• The algorithm finishes when the assignment of
instances to clusters no longer changes
K-Means Clustering
• k-Means clustering algorithm proposed by
J. Hartigan and M. A. Wong [1979].
• Given a set of n distinct objects, the k-Means
clustering algorithm partitions the objects into k
number of clusters such that intracluster similarity
is high but the intercluster similarity is low.
• In this algorithm, user has to specify k, the number
of clusters and consider the objects are defined
with numeric attributes and thus using any one of
the distance metric to demarcate the clusters.
K-Means Clustering
• Choose the number of clusters (x)
• Select the centroids (random) equal to number of clusters
(x)
• Assign each data point to its closest centroid, this will create
x clusters
• Compute the new centroid of each cluster based on the data
points within the cluster
• Reassign each data point to the new centroid, if
reassignment occurs then go to previous step, otherwise
quit the algorithm
• The assignment and update procedure is until it reaches
some stopping criteria (such as, number of iteration,
centroids remain unchanged or no assignment, etc.)
K-Means Clustering
Step-01:
Choose the number of clusters K.
Step-02:
Randomly select any K data points as cluster centers.
Select cluster centers in such a way that they are as farther as possible from each
other.
Step-03:
Calculate the Euclidean distance between each data point and each cluster center.
Step-04:
Assign each data point to some cluster.
A data point is assigned to that cluster whose center is nearest to that data point.
Step-05:
Re-compute the center of newly formed clusters.
The center of a cluster is computed by taking mean of all the data points contained in
that cluster.
Step-06:
Keep repeating the procedure from Step-03 to Step-05 until any of the following
stopping criteria is met-
-Center of newly formed clusters do not change
-Data points remain present in the same cluster
-Maximum number of iterations are reached
K-Means Clustering
Input: D is a dataset containing n objects, k is the number of cluster
Output: A set of k clusters
Steps:
1. Randomly choose k objects from D as the initial cluster centroids.
2. For each of the objects in D do
• Compute distance between the current objects and k cluster centroids
• Assign the current object to that cluster to which it is closest.

3. Compute the “cluster centers” of each cluster. These become the


new cluster centroids.
4. Repeat step 2-3 until the convergence criterion is satisfied
5. Stop
Source: Dr. Debasis Samanta, IIT Kharagpur
K-Means Clustering
k-Means clustering
A1 A2
25
6.8 12.6
0.8 9.8
1.2 11.6 20

2.8 9.6
3.8 9.9 15

4.4 6.5
4.8 1.1 A2 10
6.0 19.9
6.2 18.5
5
7.6 17.4
7.8 12.2
0
6.6 7.7 0 2 4 6 8 10 12
8.2 4.5 A1
8.4 6.9
9.0 3.4 Suppose, k=3. Three objects are chosen at
9.6 11.1
random shown as circled

Source: Dr. Debasis Samanta, IIT Kharagpur


k-Means clustering
Initial Centroids chosen randomly
Centroid Objects
A1 A2
c1 3.8 9.9

c2 7.8 12.2

c3 6.2 18.5

• Let us consider the Euclidean distance measure (L2 Norm) as the distance
measurement in our illustration.
• Let d1, d2 and d3 denote the distance from an object to c1, c2 and c3
respectively.
• Assignment of each object to the respective centroid is shown in the right-
most column and the clustering so obtained is shown in Figure
k-Means clustering
A1 A2 d1 d2 d3 cluster
6.8 12.6 4.0 1.1 5.9 2
0.8 9.8 3.0 7.4 10.2 1
1.2 11.6 3.1 6.6 8.5 1
2.8 9.6 1.0 5.6 9.5 1
3.8 9.9 0.0 4.6 8.9 1
4.4 6.5 3.5 6.6 12.1 1
4.8 1.1 8.9 11.5 17.5 1
6.0 19.9 10.2 7.9 1.4 3
6.2 18.5 8.9 6.5 0.0 3
7.6 17.4 8.4 5.2 1.8 3
7.8 12.2 4.6 0.0 6.5 2
6.6 7.7 3.6 4.7 10.8 1
8.2 4.5 7.0 7.7 14.1 1
8.4 6.9 5.5 5.3 11.8 2
9.0 3.4 8.3 8.9 15.4 1
9.6 11.1 5.9 2.1 8.1 2

55
Source: Dr. Debasis Samanta, IIT Kharagpur
k-Means clustering
The calculation new centroids of the three cluster using the mean of attribute
values of A1 and A2 is shown in the Table below.

Calculation of new centroids

New Objects
Centroid A1 A2
c1 4.6 7.1
c2 8.2 10.7
c3 6.6 18.6

Source: Dr. Debasis Samanta, IIT Kharagpur


Illustration of k-Means clustering
algorithms
We next reassign the 16 objects to three clusters by determining which centroid
is closest to each one.
Note that point p moves from cluster C2 to cluster C1.

Source: Dr. Debasis Samanta, IIT Kharagpur


Illustration of k-Means clustering
algorithms
• The newly obtained centroids after second iteration are given in the table
below. Note that the centroid c3 remains unchanged, where c2 and c1 changed
a little.
• With respect to newly obtained cluster centres, 16 points are reassigned again.
These are the same clusters as before. Hence, their centroids also remain
unchanged.
• Considering this as the termination criteria, the k-means algorithm stops here.
Hence, the final cluster.

Cluster centres after second iteration

Centroid Revised Centroids


A1 A2
c1 5.0 7.1
c2 8.1 12.0
c3 6.6 18.6

Source: Dr. Debasis Samanta, IIT Kharagpur


Example 2
• Cluster the following eight points (with (x, y)
representing locations) into three clusters:
• A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4),
A7(1, 2), A8(4, 9)

• Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1,


2).
• The distance function between two points a = (x1, y1)
and b = (x2, y2) is defined as-
• Ρ(a, b) = |x2 – x1| + |y2 – y1|

• Use K-Means Algorithm to find the three cluster centers


after the second iteration.
Example 2
• Iteration-01:

• We calculate the distance of each point from each of the center of


the three clusters.
• The distance is calculated by using the given distance function.

• The following illustration shows the calculation of distance between


point A1(2, 10) and each of the center of the three clusters-

• Calculating Distance Between A1(2, 10) and C1(2, 10)-

• Ρ(A1, C1)
• = |x2 – x1| + |y2 – y1|
• = |2 – 2| + |10 – 10|
• =0
Example 2
• Calculating Distance Between A1(2, 10) and C2(5, 8)-

• Ρ(A1, C2)
• = |x2 – x1| + |y2 – y1|
• = |5 – 2| + |8 – 10|
• =3+2
• =5

• Calculating Distance Between A1(2, 10) and C3(1, 2)-

• Ρ(A1, C3)
• = |x2 – x1| + |y2 – y1|
• = |1 – 2| + |2 – 10|
• =1+8
• =9
K-Modes Clustering
• K-Means is one of the most common used method of
clustering but not perform well on the categorical data
or features
• For example categorical input variables such as
“designation of the employee” or “branch of a student”
• It creates clusters based on the number of matching
categories (while K-Means works on the basis of some
distance measures such as “Euclidean Distance”
between the data points)
• K-Modes attempts to minimize a dissimilarity measure
K-Modes Clustering
• The changes to the k-Means clustering are –
• using a simple matching dissimilarity measure for
categorical objects,
• replacing means of clusters by modes, and
• using a frequency-based method to update the
modes.
– Let X, x11, x12,…,xnm be the data set consists of n
number of objects with m number of attributes. The
main objective of the kmodes clustering algorithm is to
group the data objects X into K clusters by minimize the
cost function
K-Modes Clustering
• Input: Data objects X, Number of clusters K.
• Step 1: Randomly select the K initial modes from the data objects
such that Cj, j = 1,2,…,K
• Step 2: Find the matching dissimilarity between the each K initial
cluster modes and each data objects
• Step 3: Evaluate the fitness
• Step 4: Find the minimum mode values in each data object i.e. finding
the objects nearest to the initial cluster modes.
• Step 5: Assign the data objects to the nearest cluster centroid modes.
• Step 6: Update the modes by apply the frequency-based method on
newly formed clusters.
• Step 7: Recalculate the similarity between the data objects and the
updated modes.
• Step 8: Repeat the step 4 and step 5 until no changes in the cluster
ship of data objects.
• Output: Clustered data objects
K-Modes Clustering
import numpy as np

from kmodes.kmodes import KModes

# random categorical data


data = np.random.choice(20, (100, 10))
print(data)
km = KModes(n_clusters=4, init='Huang', n_init=5, verbose=1)

clusters = km.fit_predict(data)

# Print the cluster centroids


print(km.cluster_centroids_)
Gaussian Mixture Models (GMMs)
• distribution-based model
• Gaussian Mixture Models (GMMs) assume that
there are a certain number of Gaussian
distributions, and each of these distributions
represent a cluster. Hence, a Gaussian Mixture
Model tends to group the data points belonging to
a single distribution together.
• Gaussian Mixture Models are probabilistic
models and use the soft clustering approach for
distributing the points in different clusters
Hard vs Soft Clustering
𝐾-Means vs GMM
• 𝐾-means algorithm performs a hard assignment of
data points to clusters, in which each data point is
associated uniquely with one cluster,
• GMM algorithm makes a soft assignment based
on posterior probabilities based on EM Algorithm.
• 𝐾-means is only based on Euclidean distances,
• classic GMM use a Mahalanobis distances that can
deal with non-spherical distributions
• The Mahalanobis distance is unitless and scale-
invariant, and takes into account the correlations
of the data set.
Gaussian Mixture Models (GMMs)
• Example: three Gaussian distributions– GD1,
GD2, and GD3.
• have a certain mean (μ1, μ2, μ3) and
• variance (σ1, σ2, σ3) value respectively.
• For a given set of data points, our GMM would
identify the probability of each data point
belonging to each of these distributions.
Gaussian Mixture Models (GMMs)

Image Source: WikiPedia

You might also like