0% found this document useful (0 votes)
107 views

Unit 3 Clustering Algorithm

Uploaded by

crazybruce2024
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
107 views

Unit 3 Clustering Algorithm

Uploaded by

crazybruce2024
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 44

Unit 3 Clustering Approaches

Dr.M.Thamarai
Professor/ECE
SVEC
Introduction
• Clustering Analysis is a technique of partitioning a
collection of unlabelled objects/data with many
attributes ,into a meaningful disjoint groups or clusters.
• Cluster analysis is a fundamental task of unsupervised
learning.
• Visual identification and grouping similar data points is
easy if the data set has less attributes.(Only two
features).
• But dataset having n number of features clustering
process requires automatic clustering algorithms.
Clustering..
• "A way of grouping the data points into
different clusters, consisting of similar data
points. The objects with the possible
similarities remain in a group that has less or
no similarities with another group.“
• It does it by finding some similar patterns in the
unlabelled dataset such as shape, size, color,
behavior, etc., and divides them as per the
presence and absence of those similar patterns.
Clustering
Clustering…
• All clusters are represented by centroids.
• Example data points (3,3),(2,6),(7,9) and centroid is given
as (4,6).
• The clusters should not overlap and every cluster should
represent only one class.
• Clustering algorithms use trial and error method to form
clusters that can be converted into labels.
• After applying this clustering technique, each cluster or
group is provided with a cluster-ID.
• ML system can use this id to simplify the processing of large
and complex datasets.
Difference between classification and
clustering

S.No. Clustering Classification


1 Unsupervised learning and Supervised learning with the
cluster formation are done presence of a supervisor to
by trial and error as there provide training .
is no supervisor
2 Unlabelled data Labeled data
3 No prior knowledge in Knowledge of the domain is
clustering must to label the unseen
data
4 Cluster results are Once a label is assigned ,it
dynamic does not change.
Applications of clustering
• Grouping based customer buying patterns
• Profiling of customer based on life style.
• Document indexing
• Taxonomy of animals plants, in biology
• Market Segmentation
• Statistical data analysis
• Social network analysis
• Image segmentation
• Anomaly detection, etc
Types of Clustering Methods

• Partitioning Clustering
• Density-Based Clustering
• Distribution Model-Based Clustering
• Hierarchical Clustering
• Fuzzy Clustering
Partitioning Clustering
• It is a type of clustering that divides the data into non-
hierarchical groups. It is also known as the centroid-
based clustering method.
• The most common example of partitioning clustering is
the K-Means Clustering algorithm.
• In this type, the dataset is divided into a set of k groups,
where K is used to define the number of pre-defined
groups.
• The cluster center is created in such a way that the
distance between the data points of one cluster is
minimum as compared to another cluster centroid.
Density-Based Clustering

• The density-based clustering method connects the highly-


dense areas into clusters, and the arbitrarily shaped
distributions are formed as long as the dense region can be
connected.
• This algorithm does it by identifying different clusters in the
dataset and connects the areas of high densities into clusters.
• The dense areas in data space are divided from each other by
sparser areas.
• These algorithms can face difficulty in clustering the data
points if the dataset has varying densities and high
dimensions.
Distribution Model-Based Clustering

• In the distribution model-based clustering


method, the data is divided based on the
probability of how a dataset belongs to a
particular distribution.
• The grouping is done by assuming some
distributions commonly Gaussian Distribution.
• The example of this type is the Expectation-
Maximization Clustering algorithm that uses
Gaussian Mixture Models (GMM).
Hierarchical Clustering

• Hierarchical clustering can be used as an alternative for


the partitioned clustering as there is no requirement of
pre-specifying the number of clusters to be created.
• In this technique, the dataset is divided into clusters
to create a tree-like structure, which is also called
a dendrogram.
• The observations or any number of clusters can be
selected by cutting the tree at the correct level. The
most common example of this method is
the Agglomerative Hierarchical algorithm.
Fuzzy Clustering

• Fuzzy clustering is a type of soft method in


which a data object may belong to more than
one group or cluster.
• Each dataset has a set of membership
coefficients, which depend on the degree of
membership to be in a cluster.
• Fuzzy C-means algorithm is the example of
this type of clustering; it is sometimes also
known as the Fuzzy k-means algorithm.
Clustering algorithms
• K-Means algorithm
• Mean-shift algorithm
• DBSCAN Algorithm: It stands for Density-Based
Spatial Clustering of Applications with Noise.
• Expectation-Maximization Clustering using
GMM
• Agglomerative Hierarchical algorithm
• Affinity Propagation
Quantitative variables used in clustering
• Euclidean distance
• City block distance
• Chebyshev distance
• For binary attributes: Simple matching
coefficient
• Jackard coefficient
• Hamming distance
• Categorical variables distance measures
Hierarchical Clustering Algorithms
• Hierarchical methods produce a nested
partition of objects with Hierarchical
relationship among others. These relationship
is shown in the form of dendrograms.
• Three methods are used here
• Categories, agglomerative methods and
divisive methods
Agglomerative Clustering Algorithm
• Place each N sample or data instance into a
separate cluster, So initially N clusters are
available.
• Repeat the following steps until a single
cluster is formed.
• (i) Determine two most similar clusters.
• (ii)Merge the two clusters into a single cluster
reducing the number of clusters as N-1.
Agglomerative Clustering Algorithm types
• Single linkage of Min algorithm
• Complete Linkage or Max or Clique
• Average Linkage
• Mean shift clustering algorithm
Mean shift clustering
• Meanshift is falling under the category of a
Hierarchical clustering algorithm that assigns the
data points to the clusters iteratively by shifting
points towards the mode (mode is the highest
density of data points in the region, in the context
of the Mean shift).
• As such, it is also known as the Mode-seeking
algorithm or sliding window algorithm.
• Mean-shift algorithm has applications in the field
of image processing and computer vision.
Mean shift clustering
• Mean-shift clustering is a non-parametric, and
Hierarchical clustering algorithm that can be used
to identify clusters in a dataset.
• There is no need of any prior knowledge of clusters
or shape of the cluster present in the dataset.
• The algorithm slowly moves from its initial position
towards the dense regions.
• Mean-Shift clustering can be applied to various
types of data, including image and video
processing, object tracking and bioinformatics.
Mean shift clustering
• The algorithm uses a window called a weighting
function.
• Gaussian window is one of the example for a window.
• The radius of the kernel is called bandwidth.
• The entire window is called kernel.
• The window is based on the concept of kernel density
function and it is used to find the underlying data
distribution.
• The method of calculating the mean is depends on
the type of the window.
The process of mean-shift clustering
• 1.Design a window
• 2.Place the window on a set of data
• 3.Compute the mean of all points that come under the
window.
• 4.Move the center of the window to the mean computed in the
step 3.Thus the window moves towards the dense region.
• 5.The movement to the dense region is controlled by a mean
shift vector Vs and is given as

• The centroid is updated as x=x+Vs


• Repeat the steps 3-4 or convergence. Once convergence is
achieved, no further points can be accommodated.
Mean shift vector
Advantages
• No model assumptions
• Suitable for all convex shapes
• Only one parameter of the window ,called
bandwidth is required.
• Robust to noise
• No issues of local minima or premature
termination
Disadvantages
• Selecting the bandwidth is a challenging task.
If it is larger, then many clusters are missed. If
it is small, then many points are missed and
convergence is a problem.
• The number of clusters cannot be specified
and user has no control over this parameter
Problem link and formulas
• https://fdslive.oup.com/asiaed/interactive/9780190127275/chapter_13/0
3_Section_13.3.4_QR_Code_Content.docx
k-MEANS CLUSTERING
• Iterative type Partitional clustering algorithm.
• k-stands for user specified clusters
• clusters do not overlap in this method.
• The algorithms detects cluster shapes like circular or
spherical.
• The algorithm needs initialization. It randomly selects
k data points as clusters(centroids).
• The algorithm assigns each data points to the k
clusters based on the distance of the point from the
centroid of the cluster.
k-Means Clustering Algorithm
• 1.Determine the number of clusters before
the algorithm is started.
• 2.Choose k instances randomly. These are
initial cluster centers.
• 3.Compute the mean of the initial clusters and
assign the remaining sample to the closest
cluster based on Euclidean distance or any
other distance measure between the
instances and the centroid of the clusters.
k-Means Clustering Algorithm
• 4.Compute new centroid again considering the
newly added samples.
• Step 5.Perform the steps 3-4 till the algorithm
becomes stable with no more changes in
assignment of instances and clusters.
• SSE(Sum of squared error ) is a metric that is a
measure of error that give the sum of the
squared Euclidean distances of each data to its
closest centroid c.
How to find optimum value of k
• The algorithm is allowed to run for different
values of k.
• For each k compute the SSE within a group
and plot a line graph .
• This plot is called Elbow curve.
• The optimal value of K is identified by the flat
of horizontal part of the Elbow curve.
Advantages and Disadvantages
• Advantages: simple and Easy to implement
• Disadvantages:
• It is sensitive to initialization process as
change of initial points leads to different
clusters.
• If the no. of samples are more then the
algorithm takes a lot of time to form clusters.
Problem
• Consider the following set of data given in
table below. cluster it using k-means algorithm
with initial value of objects 2 and 5 with the
co ordinate values(4,6) and (12,4) as initial
seeds.
Objects X-Coordinate Y-Coordinate
1 2 4
2 4 6
3 6 8
4 10 4
5 12 4
Expectation Maximization algorithm
• SOFT clustering algorithm
• Clustering is done by statistical model.
• Statistical model is described in terms of a
distribution and a set of parameters.
• The data is assumed to be generated by a process
and the focus is to describe the data by finding a
model that fits the data.
• The data is assumed to be generated by multiple
distributions like Gaussian mixture model.
Expectation Maximization..
• Gaussian distribution is a bell shaped curve.
• The distribution is characterized by two
parameters called mean and standard deviation.
(sometimes variance also used).
• When mean is equal to zero ,the peak of the bell
shaped curve occurs.
• The standard deviation is the spread of the curve.
• In 2D Gaussian function, the mean is a vector and
the variance is in the form of covariance.
Expectation Maximization..
• Assume that
• K= number of distributions
• n=number of samples
• =[1, 2, 3…… k],a set of parameters that
are associated with the distributions.
• j is the parameter of jth distribution.
• Then p(xi/ j) is the probability of ith object
coming from jth distribution.
Expectation Maximization..
• The probability of jth distribution to be chosen
is given by the weight wj,1<j<k, then,
k
p ( xi /  )  w j p j ( x /  j )
j 1

• If all the points are generated randomly, then


the entire set of objects can be denoted as
n n k
p( xi /  )  p ( xi /  ) 
i 1 i 1
 w p (x /
j 1
j j j )
Expectation Maximization..
• Every data is assumed to be generated by a
distribution.
• To describe the data point ,the corresponding
distribution with its parameters should be
known.
• If Gaussian distribution is assumed, then the
probability of that data belongs to that
distribution should be learnt. n  (x ) 2

1 i

p (  /  ) 
2
• This is given as e 2
i 1 2 2
Expectation Maximization..
• The parameters  and  should be chosen such
that the above equation is maximized.
• This is known as maximum likelihood principle.
• The objective of EM algorithm is to maximize the
likelihood of observation by selecting proper
parameters.
• The EM algorithm works in two stages.
• 1.Expectation step
• 2.Maximization step
Expectation Maximization..

• Expectation step, the probability of each data


point generated by k-Gaussian function is
computed.
• Maximization step, the parameters are
updated.
EM Algorithm
• 1.Select the parameters randomly.
• 2.In expectation stage, for each point the
condition probability is computed.
• 3.In maximization stage, the new parameters
are computed.
• 4.Repeat the steps 2-3 till change is minimal
within the threshold value or parameters do
not change at all.

You might also like