0% found this document useful (0 votes)

23 views13 pages

Clustering

This document discusses clustering in machine learning. Clustering is an unsupervised learning technique that groups unlabeled data points together based on similarities. The document outlines different types of clustering methods like partitioning, density-based, hierarchical and fuzzy clustering. It also provides examples of clustering algorithms and applications of clustering techniques.

Uploaded by

MUKESHRAJ

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views13 pages

Clustering

Uploaded by

MUKESHRAJ

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Clustering in Machine Learning

Clustering or cluster analysis is a machine learning technique, which groups the unlabelled dataset. It
can be defined as “A way of grouping the data points into different clusters, consisting of similar data
points. The objects with the possible similarities remain in a group that has less or no similarities with
another group.”

It does it by finding some similar patterns in the unlabelled dataset such as shape, size, color, behavior,
etc., and divides them as per the presence and absence of those similar patterns.

It is an unsupervised learning method, hence no supervision is provided to the algorithm, and it deals
with the unlabeled dataset.

After applying this clustering technique, each cluster or group is provided with a cluster-ID. ML system
can use this id to simplify the processing of large and complex datasets.

The clustering technique is commonly used for statistical data analysis.

Example: Let’s understand the clustering technique with the real-world example of Mall: When we visit
any shopping mall, we can observe that the things with similar usage are grouped together. Such as the
t-shirts are grouped in one section, and trousers are at other sections, similarly, at vegetable sections,
apples, bananas, Mangoes, etc., are grouped in separate sections, so that we can easily find out the
things. The clustering technique also works in the same way. Other examples of clustering are grouping
documents according to the topic.

The clustering technique can be widely used in various tasks. Some most common uses of this technique
are:

Market Segmentation

Statistical data analysis

Social network analysis

Image segmentation

Anomaly detection, etc.

Apart from these general usages, it is used by the Amazon in its recommendation system to provide the
recommendations as per the past search of products. Netflix also uses this technique to recommend the
movies and web-series to its users as per the watch history.The below diagram explains the working of
the clustering algorithm. We can see the different fruits are divided into several groups with similar
properties.
Types of Clustering Methods

The clustering methods are broadly divided into Hard clustering (datapoint belongs to only one group)
and Soft Clustering (data points can belong to another group also). But there are also other various
approaches of Clustering exist. Below are the main clustering methods used in Machine learning:

Partitioning Clustering

Density-Based Clustering

Distribution Model-Based Clustering

Hierarchical Clustering

Fuzzy Clustering

Partitioning Clustering

It is a type of clustering that divides the data into non-hierarchical groups. It is also known as the
centroid-based method. The most common example of partitioning clustering is the K-Means
Clustering algorithm.

In this type, the dataset is divided into a set of k groups, where K is used to define the number of pre-
defined groups. The cluster center is created in such a way that the distance between the data points
of one cluster is minimum as compared to another cluster centroid.
Density Based Clustering

The density-based clustering method connects the highly-dense areas into clusters, and the arbitrarily
shaped distributions are formed as long as the dense region can be connected. This algorithm does it
by identifying different clusters in the dataset and connects the areas of high densities into clusters.
The dense areas in data space are divided from each other by sparser areas.
These algorithms can face difficulty in clustering the data points if the dataset has varying densities
and high dimensions.

Distribution model based clustering

In the distribution model-based clustering method, the data is divided based on the probability of
how a dataset belongs to a particular distribution. The grouping is done by assuming some
distributions commonly Gaussian Distribution.

The example of this type is the Expectation-Maximization Clustering algorithm that uses Gaussian
Mixture Models (GMM).

Hierarchical clustering

Hierarchical clustering can be used as an alternative for the partitioned clustering as there is no
requirement of pre-specifying the number of clusters to be created. In this technique, the dataset is
divided into clusters to create a tree-like structure, which is also called a dendrogram. The
observations or any number of clusters can be selected by cutting the tree at the correct level. The
most common example of this method is the Agglomerative Hierarchical algorithm.
Fuzzy Clustering

Fuzzy clustering is a type of soft method in which a data object may belong to more than one group or
cluster. Each dataset has a set of membership coefficients, which depend on the degree of
membership to be in a cluster. Fuzzy C-means algorithm is the example of this type of clustering; it is
sometimes also known as the Fuzzy k-means algorithm.

Clustering algorithms that are widely used in machine learning:

K-Means algorithm: The k-means algorithm is one of the most popular clustering algorithms. It classifies
the dataset by dividing the samples into different clusters of equal variances. The number of clusters
must be specified in this algorithm. It is fast with fewer computations required, with the linear
complexity of O(n).

Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in the smooth density of data
points. It is an example of a centroid-based model, that works on updating the candidates for centroid
to be the center of the points within a given region.

DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of Applications with Noise. It is an
example of a density-based model similar to the mean-shift, but with some remarkable advantages. In
this algorithm, the areas of high density are separated by the areas of low density. Because of this, the
clusters can be found in any arbitrary shape.

Expectation-Maximization Clustering using GMM: This algorithm can be used as an alternative for the k-
means algorithm or for those cases where K-means can be failed. In GMM, it is assumed that the data
points are Gaussian distributed.
Agglomerative Hierarchical algorithm: The Agglomerative hierarchical algorithm performs the bottom-
up hierarchical clustering. In this, each data point is treated as a single cluster at the outset and then
successively merged. The cluster hierarchy can be represented as a tree-structure.

Affinity Propagation: It is different from other clustering algorithms as it does not require to specify the
number of clusters. In this, each data point sends a message between the pair of data points until
convergence. It has O(N2T) time complexity, which is the main drawback of this algorithm.

Applications of Clustering

Below are some commonly known applications of clustering technique in Machine Learning:

In Identification of Cancer Cells: The clustering algorithms are widely used for the identification of
cancerous cells. It divides the cancerous and non-cancerous data sets into different groups.

In Search Engines: Search engines also work on the clustering technique. The search result appears
based on the closest object to the search query. It does it by grouping similar data objects in one group
that is far from the other dissimilar objects. The accurate result of a query depends on the quality of the
clustering algorithm used.

Customer Segmentation: It is used in market research to segment the customers based on their choice
and preferences.

In Biology: It is used in the biology stream to classify different species of plants and animals using the
image recognition technique.

In Land Use: The clustering technique is used in identifying the area of similar lands use in the GIS
database. This can be very useful to find that for what purpose the particular land should be used, that
means for which purpose it is more suitable.

K-MEANS CLUSTERING
HIERARCHICAL CLUSTERING
Spectral Clustering

Spectral Clustering is a variant of the clustering algorithm that uses the connectivity between the data
points to form the clustering. It uses eigenvalues and eigenvectors of the data matrix to forecast the
data into lower dimensions space to cluster the data points. It is based on the idea of a graph
representation of data where the data point are represented as nodes and the similarity between the
data points are represented by an edge.

Steps performed for spectral Clustering

Building the Similarity Graph Of The Data: This step builds the Similarity Graph in the form of an
adjacency matrix which is represented by A. The adjacency matrix can be built in the following manners:

Epsilon-neighbourhood Graph: A parameter epsilon is fixed beforehand. Then, each point is connected
to all the points which lie in its epsilon-radius. If all the distances between any two points are similar in
scale then typically the weights of the edges ie the distance between the two points are not stored since
they do not provide any additional information. Thus, in this case, the graph built is an undirected and
unweighted graph.

K-Nearest Neighbours A parameter k is fixed beforehand. Then, for two vertices u and v, an edge is
directed from u to v only if v is among the k-nearest neighbours of u. Note that this leads to the
formation of a weighted and directed graph because it is not always the case that for each u having v as
one of the k-nearest neighbours, it will be the same case for v having u among its k-nearest neighbours.
To make this graph undirected, one of the following approaches is followed:-

Direct an edge from u to v and from v to u if either v is among the k-nearest neighbours of u OR u is
among the k-nearest neighbours of v.

Direct an edge from u to v and from v to u if v is among the k-nearest neighbours of u AND u is among
the k-nearest neighbours of v.

Fully-Connected Graph: To build this graph, each point is connected with an undirected edge-weighted
by the distance between the two points to every other point. Since this approach is used to model the
local neighbourhood relationships thus typically the Gaussian similarity metric is used to calculate the
distance.

Three basic stages

1) Preprocessing
Construct Matrix representation of graph
2) Decomposition
To compute the vectors
3) Grouping
Assigning the points to the cluster.
What is subspace clustering?

Subspace clustering is a technique which finds clusters within different subspaces (a selection of one or
more dimensions). The underlying assumption is that we can find valid clusters which are defined by
only a subset of dimensions (it is not needed to have the agreement of all N features). For example, if
we consider as input patient data observing the gene expression level (we can have more than 20000
features), a cluster of patients suffering from Alzheimer can be found only by looking at the expression
data of a subset of 100 genes, or stated differently, the subset exists in 100D. Stated differently,
subspace clustering is an extension of traditional N dimensional cluster analysis which allows to
simultaneously group features and observations by creating both row and column clusters.

The resulting clusters may be overlapping both in the space of features and observations. Another
example is shown in the figures below,

What are the challenges working with high dimensional data

High dimensional data consists in input having from a few dozen to many thousands of features (or
dimensions). This is a context typically encountered for instance in bioinformatics (all sorts of
sequencing data) or in NLP where the size of the vocabulary if very high. High dimensional data is
challenging because:

It makes the visualization and thus understanding of the input difficult, it often requires applying a
dimensionality reduction technique beforehand. It leads to the ‘curse of dimensionality’ which means
that the complete enumeration of all subspaces becomes intractable with increasing dimensionality
Most underlying clustering techniques depend on the results and the choice of the dimensionality
reduction technique

Many dimensions may be irrelevant and can mask existing clusters in noisy data

One common technique is to perform feature selection (remove irrelevant dimensions) but there are
cases when identifying redundant dimensions is not easy

Types of subspace clustering

Based on the search strategy, we can differentiate 2 types of subspace clustering,

bottom up approaches start by finding clusters in low dimensional (1 D) spaces and iteratively merging
them to process higher dimensional spaces (up to N D).

Top down approaches find clusters in the full set of dimensions and evaluate the subspace of each
cluster.

Machine Learning Clustering AlgorithmsI
No ratings yet
Machine Learning Clustering AlgorithmsI
129 pages
ML UNIT-III
No ratings yet
ML UNIT-III
18 pages
UNIT-4
No ratings yet
UNIT-4
62 pages
ML Mod 4 Part 1
No ratings yet
ML Mod 4 Part 1
99 pages
Classification and Clustering
No ratings yet
Classification and Clustering
8 pages
Module-5_Notes_13-12-2024.docx
No ratings yet
Module-5_Notes_13-12-2024.docx
45 pages
Fundamentals of Data Science Unit 3
No ratings yet
Fundamentals of Data Science Unit 3
15 pages
Lecturer-1 Unit 3
No ratings yet
Lecturer-1 Unit 3
31 pages
ML_Unit-3
No ratings yet
ML_Unit-3
22 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
7 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
64 pages
Unit 4-L2
No ratings yet
Unit 4-L2
19 pages
Untitled document
No ratings yet
Untitled document
32 pages
UNIT- IV UNSUPERVISIED LEARNING_NOTES
No ratings yet
UNIT- IV UNSUPERVISIED LEARNING_NOTES
32 pages
4.unit 4 ML Q&A
No ratings yet
4.unit 4 ML Q&A
73 pages
Unit III Clustering
No ratings yet
Unit III Clustering
47 pages
Module 5
No ratings yet
Module 5
91 pages
Unit 4
No ratings yet
Unit 4
40 pages
Day 3 - Content
No ratings yet
Day 3 - Content
50 pages
4.unsupervised Learning Model-Clustering
No ratings yet
4.unsupervised Learning Model-Clustering
45 pages
Unsupervised Learning-01
No ratings yet
Unsupervised Learning-01
42 pages
1
No ratings yet
1
59 pages
Unit 3 Clustering Algorithm
No ratings yet
Unit 3 Clustering Algorithm
44 pages
Unit 3 Clustering
No ratings yet
Unit 3 Clustering
28 pages
unit 2 ml
No ratings yet
unit 2 ml
11 pages
DWMModule 4 (1) (1) (1)
No ratings yet
DWMModule 4 (1) (1) (1)
31 pages
clustering
No ratings yet
clustering
20 pages
Clustering Explanation
No ratings yet
Clustering Explanation
8 pages
ML CH 4
No ratings yet
ML CH 4
51 pages
Unsupervised-Learning-Part-1 (1)
No ratings yet
Unsupervised-Learning-Part-1 (1)
9 pages
clustering
No ratings yet
clustering
9 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
7 pages
UNIT III - ML
No ratings yet
UNIT III - ML
13 pages
Clustering
No ratings yet
Clustering
57 pages
Unit-4 (2)
No ratings yet
Unit-4 (2)
29 pages
ML Unit-4 Final 2024-25
No ratings yet
ML Unit-4 Final 2024-25
28 pages
E-Note_28966_Content_Document_20241211091351PM
No ratings yet
E-Note_28966_Content_Document_20241211091351PM
69 pages
DW & DM Unit 4 Notes
No ratings yet
DW & DM Unit 4 Notes
40 pages
CBSYLLABUS BDA
No ratings yet
CBSYLLABUS BDA
5 pages
ARTIFICIAL INTELLIGENCE LEC 5
No ratings yet
ARTIFICIAL INTELLIGENCE LEC 5
20 pages
Unit- 4(ML)
No ratings yet
Unit- 4(ML)
13 pages
M5
No ratings yet
M5
40 pages
Unit 3 unsupervised learning algorith
No ratings yet
Unit 3 unsupervised learning algorith
15 pages
Clustering
No ratings yet
Clustering
10 pages
Clustering
No ratings yet
Clustering
6 pages
Clustering in Machine Learning - Javatpoint
No ratings yet
Clustering in Machine Learning - Javatpoint
10 pages
ML Unit-4
No ratings yet
ML Unit-4
14 pages
7.introduction To Clustering
No ratings yet
7.introduction To Clustering
11 pages
Unit-5 Clustering (March 16, 24)
No ratings yet
Unit-5 Clustering (March 16, 24)
25 pages
Unit 4 Clustering
No ratings yet
Unit 4 Clustering
18 pages
Clustering
No ratings yet
Clustering
8 pages
Unit 4
No ratings yet
Unit 4
74 pages
unsupervised learning
No ratings yet
unsupervised learning
23 pages
Unit 5
No ratings yet
Unit 5
5 pages
Classify Clustering
No ratings yet
Classify Clustering
31 pages
U20cs604 Machine Learning Unit III
No ratings yet
U20cs604 Machine Learning Unit III
23 pages
Machine Learning & Data Mining: Understanding
No ratings yet
Machine Learning & Data Mining: Understanding
7 pages