0% found this document useful (0 votes)
11 views27 pages

Cluster Analysis

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views27 pages

Cluster Analysis

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 27

Cluster Analysis

• Cluster analysis is a statistical method for processing


data. It works by organising items into groups – or
clusters – based on how closely associated they are.
• The objective of cluster analysis is to find similar groups
of subjects, where the “similarity” between each pair of
subjects represents a unique characteristic of the group
vs. the larger population/sample. Strong differentiation
between groups is indicated through separate clusters;
a single cluster indicates extremely homogeneous data.
Cluster analysis algorithms

• Your choice of cluster analysis algorithm is important,


particularly when you have mixed data. In major
statistics packages you’ll find a range of preset
algorithms ready to number-crunch your matrices.
• K-means and K-medoid are two of the most suitable
clustering methods. In both cases (K) = the number of
clusters.
Properties of Clustering :
1. Clustering Scalability: Nowadays there is a vast amount of data and should be dealing with
huge databases. In order to handle extensive databases, the clustering algorithm should be
scalable. Data should be scalable, if it is not scalable, then we can’t get the appropriate result
which would lead to wrong results.
2. High Dimensionality: The algorithm should be able to handle high dimensional space along
with the data of small size.
3. Algorithm Usability with multiple data kinds: Different kinds of data can be used with
algorithms of clustering. It should be capable of dealing with different types of data like discrete,
categorical and interval-based data, binary data etc.
4. Dealing with unstructured data: There would be some databases that contain missing
values, and noisy or erroneous data. If the algorithms are sensitive to such data then it may lead
to poor quality clusters. So it should be able to handle unstructured data and give some structure
to the data by organising it into groups of similar data objects. This makes the job of the data
expert easier in order to process the data and discover new patterns.
5. Interpretability: The clustering outcomes should be interpretable, comprehensible, and
usable. The interpretability reflects how easily the data is understood.
2.It can be used for exploratory data analysis and can help with feature selection.

3.It can be used to reduce the dimensionality of the data.

4.It can be used for anomaly detection and outlier identification.

5.It can be used for market segmentation and customer profiling.

Disadvantages of Cluster Analysis:


6.It can be sensitive to the choice of initial conditions and the number of clusters.

7.It can be sensitive to the presence of noise or outliers in the data.

8.It can be difficult to interpret the results of the analysis if the clusters are not well-defined.

9.It can be computationally expensive for large datasets.

10.The results of the analysis can be affected by the choice of clustering algorithm used.

11.It is important to note that the success of cluster analysis depends on the data, the goals of the
analysis, and the ability of the analyst to interpret the results.
Density-based clustering
• Density-based clustering refers to a method that is
based on local cluster criterion, such as density
connected points. In this tutorial, we will discuss
density-based clustering with examples.
What is Density-based clustering?
• Density-Based Clustering refers to one of the most
popular unsupervised learning methodologies used in
model building and machine learning algorithms. The
data points in the region separated by two clusters of
low point density are considered as noise. The
surroundings with a radius ε of a given object are known
as the ε neighborhood of the object. If the ε
neighborhood of the object comprises at least a
minimum number, MinPts of objects, then it is called a
core object.
• Parameters Required For DBSCAN Algorithm

1.eps: It defines the neighborhood around a data point i.e. if the


distance between two points is lower or equal to ‘eps’ then
they are considered neighbors. If the eps value is chosen too
small then a large part of the data will be considered as an
outlier. If it is chosen very large then the clusters will merge
and the majority of the data points will be in the same clusters.
One way to find the eps value is based on the k-distance
graph.

2.MinPts: Minimum number of neighbors (data points) within


eps radius. The larger the dataset, the larger value of MinPts
must be chosen. As a general rule, the minimum MinPts can be
derived from the number of dimensions D in the dataset as,
MinPts >= D+1. The minimum value of MinPts must be chosen
at least 3.
Density-Based Clustering - Background
There are two different parameters to calculate the density-based clustering
EPS: It is considered as the maximum radius of the neighborhood.
MinPts: MinPts refers to the minimum number of points in an Eps
neighborhood of that point.
NEps (i) : { k belongs to D and dist (i,k) < = Eps}
Directly density reachable:
A point i is considered as the directly density reachable from a point k with
respect to Eps, MinPts if
i belongs to NEps(k)
Core point condition:
NEps (k) >= MinPts
Density reachable:
A point denoted by i is a density reachable from a point j with
respect to Eps, MinPts if there is a sequence chain of a point i1,….,
in, i1 = j, pn = i such that ii + 1 is directly density reachable from ii.
Density connected:
A point i refers to density connected to a point j with respect to
Eps, MinPts if there is a point o such that both i and j are
considered as density reachable from o with respect to Eps and
MinPts.
Major Features of Density-Based
Clustering
• The primary features of Density-based clustering are
given below.
• It is a scan method.
• It requires density parameters as a termination
condition.
• It is used to manage noise in data clusters.
• Density-based clustering is used to identify clusters of
arbitrary size.
Example
• MinPts: 4
• Eps: 1.9

• Use Euclidean Distance

You might also like