0% found this document useful (0 votes)
66 views

Machine Learning Notes-1 (Clustering-1)

Uploaded by

rwt91848
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views

Machine Learning Notes-1 (Clustering-1)

Uploaded by

rwt91848
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

CLUSTERING

 Clustering is one of the most useful tasks in data mining process


for discovering groups and identifying interesting distributions
and patterns in the underlying data.
 Clustering problem is about partitioning a given data set into
groups (clusters) such that the data points in a cluster are more
similar to each other than points in different clusters.
 In the clustering process, there are no predefined classes and no
examples that would show what kind of desirable relations should
be valid among the data that is why it is perceived as an
unsupervised process.
 Classification is a procedure of assigning a data item to a
predefined set of categories
bNMF: Bayesian non-negative matrix factorization
DB Scan: Density-Based Spatial Clustering of Applications with Noise
https://www.researchgate.net/figure/Types-of-clustering-algorithms_fig3_362507241
Partitional Clustering  It aims to partition a
dataset into K clusters.

 It groups similar data


points together while
maximizing differences
between the clusters.

 Partitioning methods
work by iteratively
refining the cluster
centroids until
convergence is reached.

 These algorithms are


popular for their speed
and scalability in
https://medium.com/analytics-vidhya/partitional-clustering-181d42049670 handling large datasets.
https://www.scaler.com/topics/data-mining-tutorial/partitioning-methods-in-
data-mining/
Hierarchical Clustering

In this algorithm, we develop the hierarchy of clusters in the form of a tree, and
this tree-shaped structure is known as the dendrogram.

Agglomerative Clustering
Agglomerative is a bottom-up approach, in which the algorithm starts with
taking all data points as single clusters and merging them until one cluster is left.

Divisive Clustering
Divisive algorithm is the reverse of the agglomerative algorithm as it is a top-
down approach.

https://www.google.com/search?sca_esv=ad2aacbf6bcb86d2&q=hierarchical+clustering&tbm=isch&source=lnms&sa=X&ved=2ahUKEwiDr9vr4
dmEAxXufGwGHQdVDC0Q0pQJegQIDRAB&biw=1366&bih=587&dpr=1#imgrc=UR-6ylprlb0lAM
Density based Clustering

Density-based clustering is an unsupervised machine learning


algorithm that groups similar data points in a dataset based on
their density.

The algorithm identifies core points with a minimum number


of neighboring points within a specified distance
(known as the epsilon radius).

It expands clusters by connecting these core points to their


neighboring points until the density falls below a certain
threshold.

Points that do not include any cluster are considered outliers or


noise.
Core — This is a point that has at least
m points within distance n from itself.

Border — This is a point that has at


least one Core point at a distance n.

Noise — This is a point that is neither a


Core nor a Border. And it has less than
m points within distance n from itself.
https://www.graduatetutor.com/statistics-tutor/k-means-clustering-hierarchical-clustering-density-based-clustering-partitional-
clustering/
https://www.kdnuggets.com/2020/04/dbscan-clustering-algorithm-machine-learning.html
Steps of clustering process

Figure 1: Steps of Clustering process


Steps of clustering process Contd..

The basic steps to develop clustering process are presented in figure 1 and
can be summarized as follows :
 Feature selection: The goal is to select properly the features on which
clustering is to be performed so as to encode as much information as
possible concerning the task of our interest.
 Clustering algorithm: This step refers to the choice of an algorithm that
results in the definition of a good clustering scheme for a data set.
i) Proximity measure: It is a measure that quantifies how “similar” two
data points (i.e. feature vectors) are.
ii) Clustering criterion: In this step, we have to define the clustering
criterion, which can be expressed via a cost function or some other type
of rules. Thus, we may define a “good” clustering criterion, leading to a
partitioning that fits well the data set.
Steps of clustering process

 Validation of the results:


Since clustering algorithms define clusters that are not known a
priori, irrespective of the clustering methods, the final partition of
data requires some kind of evaluation in most applications.

 Interpretation of the results:


In many cases, the experts in the application area have to
integrate the clustering results with other experimental evidence
and analysis in order to draw the right conclusion.
https://techvidvan.com/tutorials/cluster-analysis-in-r/
https://www.linkedin.com/pulse/k-means-clustering-its-use-cases-security-domain-gaurav-sharma
Clustering application
 Data reduction: Clustering can be used to partition data set into a number of “interesting”
clusters. Then, instead of processing the data set as an entity, we adopt the representatives
of the defined clusters in our process.
 Prediction based on groups: Assume, for example, that the cluster analysis is applied to a
data set concerning patients infected by the same disease. The result is a number of clusters
of patients, according to their reaction to specific drugs. Then for a new patient, we identify
the cluster in which he/she can be classified and based on this decision his/her medication
can be made
 Business: In business, clustering may help marketers discover significant groups in their
customers’ database and characterize them based on purchasing patterns.
 Biology: In biology, it can be used to define taxonomies, categorize genes with similar
functionality.
 Spatial data analysis: Due to the huge amounts of spatial data that may be obtained from
satellite images, medical equipment, Geographical Information Systems (GIS), image
database exploration etc.
 Web mining: In this case, clustering is used to discover significant groups of documents
on the Web huge collection of semi-structured documents.
k-Means Clustering
k-Means Clustering
k-Means Clustering
k-Means Clustering
k-Means Clustering
k-Means Clustering
k-Means Clustering
k-Means Clustering
k-Means Clustering
k-Means Clustering
k-Means Clustering

k-means clustering aims to partition n observations into k clusters in


which each observation belongs to the cluster with the
nearest mean, serving as a prototype of the cluster.

The computational complexity of the algorithm is O(ndcT) where d


the number of features and T the number of iterations
k-Means Clustering

Given a set of observations (x1, x2, …, xn), where each


observation is a d-dimensional real vector, k-means
clustering aims to partition the n observations
into k (≤ n) sets S = {S1, S2, …, Sk} so as to minimize
the within-cluster sum of squares (WCSS). In other
words, its objective is to find:

where μi is the mean of points in Si.


Reference
https://www.javatpoint.com/k-means-clustering-algorithm-in-machine-
learning
https://www.researchgate.net/figure/Clustering-algorithms-and-their-
applications_fig1_309461986
https://www.linkedin.com/pulse/k-means-clustering-its-use-cases-security-
domain-gaurav-sharma

You might also like