0% found this document useful (0 votes)
6 views11 pages

ML Clustering2

Clustering is an unsupervised task that groups similar instances into clusters, with applications in customer segmentation, data analysis, and anomaly detection. The document discusses popular clustering algorithms like K-Means and DBSCAN, detailing their processes, initialization methods, and performance metrics such as inertia and silhouette score. It also highlights limitations of K-Means, including challenges in determining the optimal number of clusters and its poor performance with non-spherical clusters.

Uploaded by

leonhardkwahle
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views11 pages

ML Clustering2

Clustering is an unsupervised task that groups similar instances into clusters, with applications in customer segmentation, data analysis, and anomaly detection. The document discusses popular clustering algorithms like K-Means and DBSCAN, detailing their processes, initialization methods, and performance metrics such as inertia and silhouette score. It also highlights limitations of K-Means, including challenges in determining the optimal number of clusters and its poor performance with non-spherical clusters.

Uploaded by

leonhardkwahle
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Clustering

Clustering referes to the task of identifying similar


instances and assigning them to unlabelled groups
called clusters.
It is an unsupervised task.
This is important for applications like
1. Customer segmentation : group students based
on participation and activity in class whatsapp
group
2. Data analysis : when analyzing a large dataset.
Think of inferential statistics. You can perform a
clustering algorithm to see variety groups in the
data set and that will play an important role in
your sampling algorithm
3. As a dimensionality reduction technique :
consider a dataset with N feature and k clusters
where k < N.( For example if I need to detect an
object in a picture. Sometimes the color will not
be necessary )
4. For anomaly detection : when an instance has
low affinity to all the clusters(does not belong
to any of the clusters)
5. Semi-supervised learning: what will you do if
asked to label 200000000000pictures (I leave
this to the consideration of the reader)
6. Image segmentation : cluster pixels according
to their colors the replace each pixel with the
mean color of it’s cluster .that will reduce the
number of colors in the image (used for object
detection )
7. For search engines: first apply clustering
algorithm to all the images in the database.
When a user searches an image send it with its
cluster members
We will look at two popular clustering algorithms,
K-Means and DBSCAN, and explore some of their
applications, such as nonlinear dimensionality
reduction, semi-supervised learning, and anomaly
detection.
1.k-Means clustering( a.k.a Lloyd-Forgy ) : this
algorithm was first proposed in by Stuart Lloyd at
Bell Lab in 1957 as a technique for pulse code
modulation.Was later published virtually by Edward
W. Forgy .
This algorithm works as follows
 Given the dataset to perform clusterning on
 Select the number of clusters k
 Randomly assign k points as centroids
 Assign datapoints to these labelled clusters
 Keep updating the centroids till they stop
moving ( converge…)
##crazy centroid init can make the algorithm to
converge to a non-optimal solution. So you must
learn how to initialize centroids

Centroids initialization methods :


Assuming you ran the algorithm earlier and
happens to know approximately where the
centroids should be. Then you can set the init
hyper parameter to a Numpy array containing the
list of centroids and set n_init = 1
Code
import numpy as np
good_init_array = np.array([[],[],[],[],[]])
kmeans =KMeans(n_clusters=5,
init = good_init_array,
n_init = 1# runs only 1 time
)

Another solution is to run the algorithm multiple


times with different randomized solutions and
take the best solution

The number of randomized inititalization is


controlled by the n_init hyper parameter .
By default n_init=10. This means that the whole
algorithm runs 10 times when you call the fit()
function and then scikit-learn will keep the best
solution.
Problem :how does it know exactly the best
solution
Answer : it uses a performant metric called the
model’s inertia which is the mean squared
distance between each instance and its closest
centroid
The kMeans class runs the algorithm n_init times
and keeps the model with the lowest inertia
Code

Kmeans.inertia_
Kmeans.score() is negative of inertia

K-Means ++
In 2006 David Arthur and Sergei Vassilvitskii. In their paper
Proposed a smarter initialization step that tend
to select centroids that are distant from one
another and this improvement makes the K-
Means algorithm much less likely to converge to a
suboptimal solution.
They showed that even though this method
requires an additional steps for smarter
initialization. It is worth it because it makes it
possible to drastically reduce the number of the
algorithm needs to run to find the optimal
solution.
k-means++ initialization algorithm
1. Take one centroid c(1), chosen uniformly at random
from the dataset.
2. Take a new centroid c(i) choosing and instance x(i)
with probability sqr(d(xi – xc))/sum((sqr(xj-xc)))
3. This algorithm ensures that any instances farthest
from from the centroid is likely to me a centroid
4. Repeat till all the k –clusters have been found

Accelerated k-means and Mini batch k-means


This one was produced in 2003 by Charles Elkan.
It considerably accelerated the algorithm by
avoiding many unnecessary calculations .
Elkan achieved this by exploiting the triangle of
inequalities(i.e that a straight line is always the
shortest path between 2 points)
And by keeping track of upper and lower bounds for
distances between instances and centroids.
This is the algorithm the k means class uses by
default.

Yet another algorithm was proposed in a 2010 paper


by David Scully . instead of using the whole dataset
for each iteration ,the algorithm is capble of using
mini-batches moving the centroid just slightly after
each iteration .

This speeds up the algorithm typically by a factor of three


or four and makes it possible to cluster huge datasets that
do not fit in memory. Scikit-Learn implements this
algorithm in the MiniBatchKMeans class. You can just use
this class like the
KMeans class:

from sklearn.cluster import MiniBatchKMeans


minibatch_kmeans = MiniBatchKMeans(n_clusters=5)
minibatch_kmeans.fit(X)
Although the Mini-batch K-Means algorithm is much faster
than the regular KMeans algorithm, its inertia is generally
slightly worse, especially as the number of
clusters increases.

Finding the optimal number of clusters


Generally it is not always easy to get the number of
cluster for your algorithm. And setting a wrong value
of k will lead to bad results.
Now what if we just take the k value with the
smallest model’s inertia
This will not work because the value of inertia
decreases as the value of k increases meaning that
even when you exceed the optimum value of k the
model’s inertia will still keep decreasing till every
data point becomes a clusters on its own and at that
point models inertia = 0

But what actually happens is that as you increase k .


the models inertia decreases at very hight rate till
you reach the optimum k value from there now the
model’s inertia will now be decreasing at a very low
rate and this might split perfect clusters for no good
reason hence the inertia is not a good performance
metric.
Since inertia have failed, a more precise but
computationally expensive approach is to use the
silhouette score which is the mean silhouette
coefficient over all instances.
For a single instance.
 Calculate the mean distance from that instance to
its cluster members and assign this value to a
variable a
 Get the closest neighbor cluster
 Calculate the mean distance to all the instances of
this neighbor cluster and assign to b
𝒃−𝒂
 Then silhouette score =
𝒎𝒂𝒙(𝒂,𝒃)
 The silhouette coefficient vary between -1 and 1
 Close to 1 means instance is in the correct cluster
 Close to 0 means instance is in cluster boundary
 Close to -1 means instance is in the wrong cluster
To compute the silhouette score in scikit-learn is very
easy
Code

from sklearn.metrics import silhouette_score


silhouette_score(X, kmeans.labels)
Limitations of kmeans
It is not always easy to specify the number of clusters
The algorithm performs poorly if the clusters are non
spherical

You might also like