ML Clustering2

Clustering is an unsupervised task that groups similar instances into clusters, with applications in customer segmentation, data analysis, and anomaly detection. The document discusses popular clustering algorithms like K-Means and DBSCAN, detailing their processes, initialization methods, and performance metrics such as inertia and silhouette score. It also highlights limitations of K-Means, including challenges in determining the optimal number of clusters and its poor performance with non-spherical clusters.

Uploaded by

leonhardkwahle

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views11 pages

ML Clustering2

Uploaded by

leonhardkwahle

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Clustering

Clustering referes to the task of identifying similar

instances and assigning them to unlabelled groups
called clusters.
It is an unsupervised task.
This is important for applications like
1. Customer segmentation : group students based
on participation and activity in class whatsapp
group
2. Data analysis : when analyzing a large dataset.
Think of inferential statistics. You can perform a
clustering algorithm to see variety groups in the
data set and that will play an important role in
your sampling algorithm
3. As a dimensionality reduction technique :
consider a dataset with N feature and k clusters
where k < N.( For example if I need to detect an
object in a picture. Sometimes the color will not
be necessary )
4. For anomaly detection : when an instance has
low affinity to all the clusters(does not belong
to any of the clusters)
5. Semi-supervised learning: what will you do if
asked to label 200000000000pictures (I leave
this to the consideration of the reader)
6. Image segmentation : cluster pixels according
to their colors the replace each pixel with the
mean color of it’s cluster .that will reduce the
number of colors in the image (used for object
detection )
7. For search engines: first apply clustering
algorithm to all the images in the database.
When a user searches an image send it with its
cluster members
We will look at two popular clustering algorithms,
K-Means and DBSCAN, and explore some of their
applications, such as nonlinear dimensionality
reduction, semi-supervised learning, and anomaly
detection.
1.k-Means clustering( a.k.a Lloyd-Forgy ) : this
algorithm was first proposed in by Stuart Lloyd at
Bell Lab in 1957 as a technique for pulse code
modulation.Was later published virtually by Edward
W. Forgy .
This algorithm works as follows
 Given the dataset to perform clusterning on
 Select the number of clusters k
 Randomly assign k points as centroids
 Assign datapoints to these labelled clusters
 Keep updating the centroids till they stop
moving ( converge…)
##crazy centroid init can make the algorithm to
converge to a non-optimal solution. So you must
learn how to initialize centroids

Centroids initialization methods :

Assuming you ran the algorithm earlier and
happens to know approximately where the
centroids should be. Then you can set the init
hyper parameter to a Numpy array containing the
list of centroids and set n_init = 1
Code
import numpy as np
good_init_array = np.array([[],[],[],[],[]])
kmeans =KMeans(n_clusters=5,
init = good_init_array,
n_init = 1# runs only 1 time
)

Another solution is to run the algorithm multiple

times with different randomized solutions and
take the best solution

The number of randomized inititalization is

controlled by the n_init hyper parameter .
By default n_init=10. This means that the whole
algorithm runs 10 times when you call the fit()
function and then scikit-learn will keep the best
solution.
Problem :how does it know exactly the best
solution
Answer : it uses a performant metric called the
model’s inertia which is the mean squared
distance between each instance and its closest
centroid
The kMeans class runs the algorithm n_init times
and keeps the model with the lowest inertia
Code

Kmeans.inertia_
Kmeans.score() is negative of inertia

K-Means ++
In 2006 David Arthur and Sergei Vassilvitskii. In their paper
Proposed a smarter initialization step that tend
to select centroids that are distant from one
another and this improvement makes the K-
Means algorithm much less likely to converge to a
suboptimal solution.
They showed that even though this method
requires an additional steps for smarter
initialization. It is worth it because it makes it
possible to drastically reduce the number of the
algorithm needs to run to find the optimal
solution.
k-means++ initialization algorithm
1. Take one centroid c(1), chosen uniformly at random
from the dataset.
2. Take a new centroid c(i) choosing and instance x(i)
with probability sqr(d(xi – xc))/sum((sqr(xj-xc)))
3. This algorithm ensures that any instances farthest
from from the centroid is likely to me a centroid
4. Repeat till all the k –clusters have been found

Accelerated k-means and Mini batch k-means

This one was produced in 2003 by Charles Elkan.
It considerably accelerated the algorithm by
avoiding many unnecessary calculations .
Elkan achieved this by exploiting the triangle of
inequalities(i.e that a straight line is always the
shortest path between 2 points)
And by keeping track of upper and lower bounds for
distances between instances and centroids.
This is the algorithm the k means class uses by
default.

Yet another algorithm was proposed in a 2010 paper

by David Scully . instead of using the whole dataset
for each iteration ,the algorithm is capble of using
mini-batches moving the centroid just slightly after
each iteration .

This speeds up the algorithm typically by a factor of three

or four and makes it possible to cluster huge datasets that
do not fit in memory. Scikit-Learn implements this
algorithm in the MiniBatchKMeans class. You can just use
this class like the
KMeans class:

from sklearn.cluster import MiniBatchKMeans

minibatch_kmeans = MiniBatchKMeans(n_clusters=5)
minibatch_kmeans.fit(X)
Although the Mini-batch K-Means algorithm is much faster
than the regular KMeans algorithm, its inertia is generally
slightly worse, especially as the number of
clusters increases.

Finding the optimal number of clusters

Generally it is not always easy to get the number of
cluster for your algorithm. And setting a wrong value
of k will lead to bad results.
Now what if we just take the k value with the
smallest model’s inertia
This will not work because the value of inertia
decreases as the value of k increases meaning that
even when you exceed the optimum value of k the
model’s inertia will still keep decreasing till every
data point becomes a clusters on its own and at that
point models inertia = 0

But what actually happens is that as you increase k .

the models inertia decreases at very hight rate till
you reach the optimum k value from there now the
model’s inertia will now be decreasing at a very low
rate and this might split perfect clusters for no good
reason hence the inertia is not a good performance
metric.
Since inertia have failed, a more precise but
computationally expensive approach is to use the
silhouette score which is the mean silhouette
coefficient over all instances.
For a single instance.
 Calculate the mean distance from that instance to
its cluster members and assign this value to a
variable a
 Get the closest neighbor cluster
 Calculate the mean distance to all the instances of
this neighbor cluster and assign to b
𝒃−𝒂
 Then silhouette score =
𝒎𝒂𝒙(𝒂,𝒃)
 The silhouette coefficient vary between -1 and 1
 Close to 1 means instance is in the correct cluster
 Close to 0 means instance is in cluster boundary
 Close to -1 means instance is in the wrong cluster
To compute the silhouette score in scikit-learn is very
easy
Code

from sklearn.metrics import silhouette_score

silhouette_score(X, kmeans.labels)
Limitations of kmeans
It is not always easy to specify the number of clusters
The algorithm performs poorly if the clusters are non
spherical

Unsupervised Learning Notes
No ratings yet
Unsupervised Learning Notes
21 pages
Ai - 102
No ratings yet
Ai - 102
116 pages
09.unsupervised Learning
No ratings yet
09.unsupervised Learning
50 pages
Machine Learning Notes-1 (Clustering-1)
No ratings yet
Machine Learning Notes-1 (Clustering-1)
25 pages
Clustering Classification and Intro Neural Network
No ratings yet
Clustering Classification and Intro Neural Network
168 pages
ML Unit 4
No ratings yet
ML Unit 4
110 pages
Clustering and Dimensionality Reduction
No ratings yet
Clustering and Dimensionality Reduction
58 pages
AppliedML Chap1 Clustering
No ratings yet
AppliedML Chap1 Clustering
37 pages
CC Unit IV
No ratings yet
CC Unit IV
30 pages
Unit 4
No ratings yet
Unit 4
46 pages
2021 Clustering
No ratings yet
2021 Clustering
50 pages
Wa0033.
No ratings yet
Wa0033.
38 pages
IDS26 Clustering and Classification
No ratings yet
IDS26 Clustering and Classification
30 pages
ML Lecture06 Unsupervised Learning
No ratings yet
ML Lecture06 Unsupervised Learning
87 pages
ML Clustering
No ratings yet
ML Clustering
33 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
27 pages
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
No ratings yet
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
65 pages
K-Means Algorithm
No ratings yet
K-Means Algorithm
29 pages
Chapter 2.1 - Kmean
No ratings yet
Chapter 2.1 - Kmean
10 pages
ML Unit-4
No ratings yet
ML Unit-4
23 pages
JNTUK R20 B.Tech CSE 3-2 Machine Learning Unit 4 Notes
No ratings yet
JNTUK R20 B.Tech CSE 3-2 Machine Learning Unit 4 Notes
23 pages
Detecting Patterns With Unsupervised Learning
No ratings yet
Detecting Patterns With Unsupervised Learning
21 pages
Major Project PPT - Suspicious Activity Detection
No ratings yet
Major Project PPT - Suspicious Activity Detection
26 pages
04-FSSR DS610 2024 2025T1 Kmeans
No ratings yet
04-FSSR DS610 2024 2025T1 Kmeans
57 pages
Artificial Intelligence Report
No ratings yet
Artificial Intelligence Report
23 pages
K Means
No ratings yet
K Means
25 pages
Unit - 4 DWDM
No ratings yet
Unit - 4 DWDM
27 pages
2.3. Clustering - Scikit-Learn 1
No ratings yet
2.3. Clustering - Scikit-Learn 1
24 pages
Week 11
No ratings yet
Week 11
49 pages
Unit 3 Data
No ratings yet
Unit 3 Data
37 pages
M5
No ratings yet
M5
40 pages
Aam Unit 4 QB With Answer
No ratings yet
Aam Unit 4 QB With Answer
11 pages
Data Science Analysis Final Project
No ratings yet
Data Science Analysis Final Project
10 pages
Unit-Iv Material
No ratings yet
Unit-Iv Material
24 pages
Clustering FinancialData
No ratings yet
Clustering FinancialData
38 pages
Lecture - 10 Unsupervised Learning & K-Means Clustering
No ratings yet
Lecture - 10 Unsupervised Learning & K-Means Clustering
31 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
01 K Means - Merged
No ratings yet
01 K Means - Merged
26 pages
K Mean
No ratings yet
K Mean
9 pages
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
No ratings yet
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
20 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
Clustering in Python
No ratings yet
Clustering in Python
31 pages
Clustering-Part 1
No ratings yet
Clustering-Part 1
35 pages
ML Unit 4 V1
No ratings yet
ML Unit 4 V1
30 pages
Yunsu Han KNN K Means
No ratings yet
Yunsu Han KNN K Means
8 pages
Unit 4 Introduction To Algorithm
No ratings yet
Unit 4 Introduction To Algorithm
10 pages
U1 - KMeans - 5th Sem - DS
No ratings yet
U1 - KMeans - 5th Sem - DS
14 pages
Unit 4 Machine Learning
No ratings yet
Unit 4 Machine Learning
12 pages
Da Exp 10
No ratings yet
Da Exp 10
6 pages
UnsupervisedLearning FoundationalMathofAI S24
No ratings yet
UnsupervisedLearning FoundationalMathofAI S24
6 pages
02.1 K-Means Example
No ratings yet
02.1 K-Means Example
12 pages
Da Exp 10
No ratings yet
Da Exp 10
6 pages
Da Exp 10 66
No ratings yet
Da Exp 10 66
6 pages
"These Are Just Rough Notes For References" What Is K-Means Clustering
No ratings yet
"These Are Just Rough Notes For References" What Is K-Means Clustering
9 pages
K Means
No ratings yet
K Means
9 pages
Soft Vs Hard Clustering
No ratings yet
Soft Vs Hard Clustering
5 pages
ML DSBA Lab7
No ratings yet
ML DSBA Lab7
6 pages
Roadmap To Crack Your Dream Job
No ratings yet
Roadmap To Crack Your Dream Job
14 pages
13: Clustering: Unsupervised Learning - Introduction
No ratings yet
13: Clustering: Unsupervised Learning - Introduction
4 pages
K-Means in Python - Solution
No ratings yet
K-Means in Python - Solution
6 pages
Book Machine Learning Finance Python
100% (1)
Book Machine Learning Finance Python
75 pages
A Tutorial On Clustering Algorithms
No ratings yet
A Tutorial On Clustering Algorithms
4 pages
Paper 2
No ratings yet
Paper 2
11 pages
Computer Science Thesis Topics
100% (2)
Computer Science Thesis Topics
8 pages
3.2 Pca
No ratings yet
3.2 Pca
27 pages
01 Intro and Big Ideas in Ai Systems
No ratings yet
01 Intro and Big Ideas in Ai Systems
120 pages
Introduction To Ai
No ratings yet
Introduction To Ai
7 pages
Assurance of AI Enabled Systems
No ratings yet
Assurance of AI Enabled Systems
73 pages
Pharmaceuticals 16 00891 v2
No ratings yet
Pharmaceuticals 16 00891 v2
11 pages
Prediction of Medical Costs Using Regression Algorithms: A. Lakshmanarao, Chandra Sekhar Koppireddy, G.Vijay Kumar
0% (1)
Prediction of Medical Costs Using Regression Algorithms: A. Lakshmanarao, Chandra Sekhar Koppireddy, G.Vijay Kumar
7 pages
Lecture 1 Introduction Lecture 2-9-2024
No ratings yet
Lecture 1 Introduction Lecture 2-9-2024
63 pages
EBSCO-FullText-04 03 2025
No ratings yet
EBSCO-FullText-04 03 2025
9 pages
Offer Letter/Approval Letter From The Company: Movie Recommendation System With Python
No ratings yet
Offer Letter/Approval Letter From The Company: Movie Recommendation System With Python
27 pages
FinalReport Asif Seum Tanvir Simanta CSE498R 15
No ratings yet
FinalReport Asif Seum Tanvir Simanta CSE498R 15
41 pages
Agriculture 14 01573
No ratings yet
Agriculture 14 01573
18 pages
Data Scientist Master Program Brochure Simplilearn - V2 - 14-06-2024
No ratings yet
Data Scientist Master Program Brochure Simplilearn - V2 - 14-06-2024
28 pages
Convergence of Distributed Ledger Technologies With Digital Twins, IoT, and AI For Fresh Food Logistics - Challenges and Opportunities
No ratings yet
Convergence of Distributed Ledger Technologies With Digital Twins, IoT, and AI For Fresh Food Logistics - Challenges and Opportunities
17 pages
1D Convolutional Neural Network For Stock Market Prediction Using Tensorflow - Js
No ratings yet
1D Convolutional Neural Network For Stock Market Prediction Using Tensorflow - Js
4 pages
Implementation of Lumpy Skin Disease Detection Using Machine Learning Approach
No ratings yet
Implementation of Lumpy Skin Disease Detection Using Machine Learning Approach
7 pages
Machine Learning and Glasgow Coma Scale
No ratings yet
Machine Learning and Glasgow Coma Scale
21 pages
INTERNSHIP REPORT Raviteja
No ratings yet
INTERNSHIP REPORT Raviteja
12 pages
Feature Selection and Comparison of Classifcation Algorithms
No ratings yet
Feature Selection and Comparison of Classifcation Algorithms
13 pages
ML Engineer Path
No ratings yet
ML Engineer Path
3 pages
Artificial Intelligence and Marketing Savica Dimitrieska, Aleksandra Stankovska, Tanja Efremova
No ratings yet
Artificial Intelligence and Marketing Savica Dimitrieska, Aleksandra Stankovska, Tanja Efremova
7 pages
Notes-3-AI-Project Cycle-Grade-9
No ratings yet
Notes-3-AI-Project Cycle-Grade-9
3 pages
KNN Numerical
No ratings yet
KNN Numerical
4 pages
Stanford Computer Science Curriculum Revision-Preview-04-03-08
No ratings yet
Stanford Computer Science Curriculum Revision-Preview-04-03-08
9 pages
Pratul Profile Summary
No ratings yet
Pratul Profile Summary
1 page
The Practically Cheating Calculus Handbook
From Everand
The Practically Cheating Calculus Handbook
S. Deviant
3.5/5 (7)
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet