D3IT Clustering April 2023
D3IT Clustering April 2023
Source: https://medium.com/analytics-vidhya/beginners-guide-to-
unsupervised-learning-76a575c4e942
Unsupervised Machine Learning
• Unsupervised learning is where you only have input data
(X) and no corresponding output variables.
• The goal for unsupervised learning is to model the
underlying structure or distribution in the data in order to
learn more about the data.
• Algorithms are left to their own devises to discover and
present the interesting structure in the data.
• Further grouped into:
• Clustering: A clustering problem is where you want to
discover the inherent groupings in the data, such as
grouping customers by purchasing behavior.
• Association: An association rule learning problem is where
you want to discover rules that describe large portions of
your data, such as people that buy A also tend to buy B.
Hands-On Machine Learning with Scikit-
Learn and TensorFlow by Aurélien Géron
Unsupervised Machine Learning
• Unsupervised learning has a wide range of
applications in areas such as
• image recognition,
• anomaly detection,
• natural language processing, and
• customer segmentation,
• For example, let's say we have a dataset of customer
purchases at a grocery store, including the types of
items they bought, the amount spent, and the time
of purchase. We don't have any labels or categories
for the customers, but we want to group them into
clusters based on their purchasing behavior.
Hands-On Machine Learning with Scikit-
Learn and TensorFlow by Aurélien Géron
Semi-supervised Machine Learning
• Problems where you have a large amount of input data (X) and only
some of the data is labeled (Y ) are called semi-supervised learning
problems.
• These problems sit in between both supervised and unsupervised
• A good example is a photo archive where only some of the images
are labeled, (e.g. dog, cat, person) and the majority are unlabeled.
• because it can be expensive or time consuming to label data as it
may require access to domain experts.
• Whereas unlabeled data is cheap and easy to collect and store.
• You can use unsupervised learning techniques to discover and learn
the structure in the input variables. You can also use supervised
learning techniques to make best guess predictions for the
unlabeled data, feed that data back into the supervised learning
algorithm as training data and use the model to make predictions
on new unseen data. LearnHands-On Machine Learning with Scikit-
and TensorFlow by Aurélien Géron
Unsupervised Machine Learning
• Clustering
– k-Means
– Hierarchical Cluster Analysis (HCA)
– Expectation Maximization
• Visualization and dimensionality reduction
– Principal Component Analysis (PCA)
– Kernel PCA
– Locally-Linear Embedding (LLE)
• Association rule learning
– Apriori
Hands-On Machine Learning with Scikit-
Learn and TensorFlow by Aurélien Géron
Clustering
• Clustering is a machine learning technique for
analyzing data and dividing in to groups of similar
data.
• These groups or sets of similar data are known as
clusters.
• Cluster analysis looks at clustering algorithms
that can identify clusters automatically.
• The goal of clustering is to discover both the
dense and sparse regions in the data set.
2.8 9.6
3.8 9.9 15
4.4 6.5
4.8 1.1 A2 10
6.0 19.9
6.2 18.5
5
7.6 17.4
7.8 12.2
0
6.6 7.7 0 2 4 6 8 10 12
8.2 4.5 A1
8.4 6.9
9.0 3.4 Suppose, k=3. Three objects are chosen at
9.6 11.1
random shown as circled
c2 7.8 12.2
c3 6.2 18.5
• Let us consider the Euclidean distance measure (L2 Norm) as the distance
measurement in our illustration.
• Let d1, d2 and d3 denote the distance from an object to c1, c2 and c3
respectively.
• Assignment of each object to the respective centroid is shown in the right-
most column and the clustering so obtained is shown in Figure
k-Means clustering
A1 A2 d1 d2 d3 cluster
6.8 12.6 4.0 1.1 5.9 2
0.8 9.8 3.0 7.4 10.2 1
1.2 11.6 3.1 6.6 8.5 1
2.8 9.6 1.0 5.6 9.5 1
3.8 9.9 0.0 4.6 8.9 1
4.4 6.5 3.5 6.6 12.1 1
4.8 1.1 8.9 11.5 17.5 1
6.0 19.9 10.2 7.9 1.4 3
6.2 18.5 8.9 6.5 0.0 3
7.6 17.4 8.4 5.2 1.8 3
7.8 12.2 4.6 0.0 6.5 2
6.6 7.7 3.6 4.7 10.8 1
8.2 4.5 7.0 7.7 14.1 1
8.4 6.9 5.5 5.3 11.8 2
9.0 3.4 8.3 8.9 15.4 1
9.6 11.1 5.9 2.1 8.1 2
55
Source: Dr. Debasis Samanta, IIT Kharagpur
k-Means clustering
The calculation new centroids of the three cluster using the mean of attribute
values of A1 and A2 is shown in the Table below.
New Objects
Centroid A1 A2
c1 4.6 7.1
c2 8.2 10.7
c3 6.6 18.6
• Ρ(A1, C1)
• = |x2 – x1| + |y2 – y1|
• = |2 – 2| + |10 – 10|
• =0
Example 2
• Calculating Distance Between A1(2, 10) and C2(5, 8)-
• Ρ(A1, C2)
• = |x2 – x1| + |y2 – y1|
• = |5 – 2| + |8 – 10|
• =3+2
• =5
•
• Calculating Distance Between A1(2, 10) and C3(1, 2)-
• Ρ(A1, C3)
• = |x2 – x1| + |y2 – y1|
• = |1 – 2| + |2 – 10|
• =1+8
• =9
K-Modes Clustering
• K-Means is one of the most common used method of
clustering but not perform well on the categorical data
or features
• For example categorical input variables such as
“designation of the employee” or “branch of a student”
• It creates clusters based on the number of matching
categories (while K-Means works on the basis of some
distance measures such as “Euclidean Distance”
between the data points)
• K-Modes attempts to minimize a dissimilarity measure
K-Modes Clustering
• The changes to the k-Means clustering are –
• using a simple matching dissimilarity measure for
categorical objects,
• replacing means of clusters by modes, and
• using a frequency-based method to update the
modes.
– Let X, x11, x12,…,xnm be the data set consists of n
number of objects with m number of attributes. The
main objective of the kmodes clustering algorithm is to
group the data objects X into K clusters by minimize the
cost function
K-Modes Clustering
• Input: Data objects X, Number of clusters K.
• Step 1: Randomly select the K initial modes from the data objects
such that Cj, j = 1,2,…,K
• Step 2: Find the matching dissimilarity between the each K initial
cluster modes and each data objects
• Step 3: Evaluate the fitness
• Step 4: Find the minimum mode values in each data object i.e. finding
the objects nearest to the initial cluster modes.
• Step 5: Assign the data objects to the nearest cluster centroid modes.
• Step 6: Update the modes by apply the frequency-based method on
newly formed clusters.
• Step 7: Recalculate the similarity between the data objects and the
updated modes.
• Step 8: Repeat the step 4 and step 5 until no changes in the cluster
ship of data objects.
• Output: Clustered data objects
K-Modes Clustering
import numpy as np
clusters = km.fit_predict(data)