Unit 3 Clustering Algorithm

Uploaded by

crazybruce2024

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

119 views

Unit 3 Clustering Algorithm

Uploaded by

crazybruce2024

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 44

Unit 3 Clustering Approaches

Dr.M.Thamarai
Professor/ECE
SVEC
Introduction
• Clustering Analysis is a technique of partitioning a
collection of unlabelled objects/data with many
attributes ,into a meaningful disjoint groups or clusters.
• Cluster analysis is a fundamental task of unsupervised
learning.
• Visual identification and grouping similar data points is
easy if the data set has less attributes.(Only two
features).
• But dataset having n number of features clustering
process requires automatic clustering algorithms.
Clustering..
• "A way of grouping the data points into
different clusters, consisting of similar data
points. The objects with the possible
similarities remain in a group that has less or
no similarities with another group.“
• It does it by finding some similar patterns in the
unlabelled dataset such as shape, size, color,
behavior, etc., and divides them as per the
presence and absence of those similar patterns.
Clustering
Clustering…
• All clusters are represented by centroids.
• Example data points (3,3),(2,6),(7,9) and centroid is given
as (4,6).
• The clusters should not overlap and every cluster should
represent only one class.
• Clustering algorithms use trial and error method to form
clusters that can be converted into labels.
• After applying this clustering technique, each cluster or
group is provided with a cluster-ID.
• ML system can use this id to simplify the processing of large
and complex datasets.
Difference between classification and
clustering

S.No. Clustering Classification

1 Unsupervised learning and Supervised learning with the
cluster formation are done presence of a supervisor to
by trial and error as there provide training .
is no supervisor
2 Unlabelled data Labeled data
3 No prior knowledge in Knowledge of the domain is
clustering must to label the unseen
data
4 Cluster results are Once a label is assigned ,it
dynamic does not change.
Applications of clustering
• Grouping based customer buying patterns
• Profiling of customer based on life style.
• Document indexing
• Taxonomy of animals plants, in biology
• Market Segmentation
• Statistical data analysis
• Social network analysis
• Image segmentation
• Anomaly detection, etc
Types of Clustering Methods

• Partitioning Clustering
• Density-Based Clustering
• Distribution Model-Based Clustering
• Hierarchical Clustering
• Fuzzy Clustering
Partitioning Clustering
• It is a type of clustering that divides the data into non-
hierarchical groups. It is also known as the centroid-
based clustering method.
• The most common example of partitioning clustering is
the K-Means Clustering algorithm.
• In this type, the dataset is divided into a set of k groups,
where K is used to define the number of pre-defined
groups.
• The cluster center is created in such a way that the
distance between the data points of one cluster is
minimum as compared to another cluster centroid.
Density-Based Clustering

• The density-based clustering method connects the highly-

dense areas into clusters, and the arbitrarily shaped
distributions are formed as long as the dense region can be
connected.
• This algorithm does it by identifying different clusters in the
dataset and connects the areas of high densities into clusters.
• The dense areas in data space are divided from each other by
sparser areas.
• These algorithms can face difficulty in clustering the data
points if the dataset has varying densities and high
dimensions.
Distribution Model-Based Clustering

• In the distribution model-based clustering

method, the data is divided based on the
probability of how a dataset belongs to a
particular distribution.
• The grouping is done by assuming some
distributions commonly Gaussian Distribution.
• The example of this type is the Expectation-
Maximization Clustering algorithm that uses
Gaussian Mixture Models (GMM).
Hierarchical Clustering

• Hierarchical clustering can be used as an alternative for

the partitioned clustering as there is no requirement of
pre-specifying the number of clusters to be created.
• In this technique, the dataset is divided into clusters
to create a tree-like structure, which is also called
a dendrogram.
• The observations or any number of clusters can be
selected by cutting the tree at the correct level. The
most common example of this method is
the Agglomerative Hierarchical algorithm.
Fuzzy Clustering

• Fuzzy clustering is a type of soft method in

which a data object may belong to more than
one group or cluster.
• Each dataset has a set of membership
coefficients, which depend on the degree of
membership to be in a cluster.
• Fuzzy C-means algorithm is the example of
this type of clustering; it is sometimes also
known as the Fuzzy k-means algorithm.
Clustering algorithms
• K-Means algorithm
• Mean-shift algorithm
• DBSCAN Algorithm: It stands for Density-Based
Spatial Clustering of Applications with Noise.
• Expectation-Maximization Clustering using
GMM
• Agglomerative Hierarchical algorithm
• Affinity Propagation
Quantitative variables used in clustering
• Euclidean distance
• City block distance
• Chebyshev distance
• For binary attributes: Simple matching
coefficient
• Jackard coefficient
• Hamming distance
• Categorical variables distance measures
Hierarchical Clustering Algorithms
• Hierarchical methods produce a nested
partition of objects with Hierarchical
relationship among others. These relationship
is shown in the form of dendrograms.
• Three methods are used here
• Categories, agglomerative methods and
divisive methods
Agglomerative Clustering Algorithm
• Place each N sample or data instance into a
separate cluster, So initially N clusters are
available.
• Repeat the following steps until a single
cluster is formed.
• (i) Determine two most similar clusters.
• (ii)Merge the two clusters into a single cluster
reducing the number of clusters as N-1.
Agglomerative Clustering Algorithm types
• Single linkage of Min algorithm
• Complete Linkage or Max or Clique
• Average Linkage
• Mean shift clustering algorithm
Mean shift clustering
• Meanshift is falling under the category of a
Hierarchical clustering algorithm that assigns the
data points to the clusters iteratively by shifting
points towards the mode (mode is the highest
density of data points in the region, in the context
of the Mean shift).
• As such, it is also known as the Mode-seeking
algorithm or sliding window algorithm.
• Mean-shift algorithm has applications in the field
of image processing and computer vision.
Mean shift clustering
• Mean-shift clustering is a non-parametric, and
Hierarchical clustering algorithm that can be used
to identify clusters in a dataset.
• There is no need of any prior knowledge of clusters
or shape of the cluster present in the dataset.
• The algorithm slowly moves from its initial position
towards the dense regions.
• Mean-Shift clustering can be applied to various
types of data, including image and video
processing, object tracking and bioinformatics.
Mean shift clustering
• The algorithm uses a window called a weighting
function.
• Gaussian window is one of the example for a window.
• The radius of the kernel is called bandwidth.
• The entire window is called kernel.
• The window is based on the concept of kernel density
function and it is used to find the underlying data
distribution.
• The method of calculating the mean is depends on
the type of the window.
The process of mean-shift clustering
• 1.Design a window
• 2.Place the window on a set of data
• 3.Compute the mean of all points that come under the
window.
• 4.Move the center of the window to the mean computed in the
step 3.Thus the window moves towards the dense region.
• 5.The movement to the dense region is controlled by a mean
shift vector Vs and is given as

• The centroid is updated as x=x+Vs

• Repeat the steps 3-4 or convergence. Once convergence is
achieved, no further points can be accommodated.
Mean shift vector
Advantages
• No model assumptions
• Suitable for all convex shapes
• Only one parameter of the window ,called
bandwidth is required.
• Robust to noise
• No issues of local minima or premature
termination
Disadvantages
• Selecting the bandwidth is a challenging task.
If it is larger, then many clusters are missed. If
it is small, then many points are missed and
convergence is a problem.
• The number of clusters cannot be specified
and user has no control over this parameter
Problem link and formulas
• https://fdslive.oup.com/asiaed/interactive/9780190127275/chapter_13/0
3_Section_13.3.4_QR_Code_Content.docx
k-MEANS CLUSTERING
• Iterative type Partitional clustering algorithm.
• k-stands for user specified clusters
• clusters do not overlap in this method.
• The algorithms detects cluster shapes like circular or
spherical.
• The algorithm needs initialization. It randomly selects
k data points as clusters(centroids).
• The algorithm assigns each data points to the k
clusters based on the distance of the point from the
centroid of the cluster.
k-Means Clustering Algorithm
• 1.Determine the number of clusters before
the algorithm is started.
• 2.Choose k instances randomly. These are
initial cluster centers.
• 3.Compute the mean of the initial clusters and
assign the remaining sample to the closest
cluster based on Euclidean distance or any
other distance measure between the
instances and the centroid of the clusters.
k-Means Clustering Algorithm
• 4.Compute new centroid again considering the
newly added samples.
• Step 5.Perform the steps 3-4 till the algorithm
becomes stable with no more changes in
assignment of instances and clusters.
• SSE(Sum of squared error ) is a metric that is a
measure of error that give the sum of the
squared Euclidean distances of each data to its
closest centroid c.
How to find optimum value of k
• The algorithm is allowed to run for different
values of k.
• For each k compute the SSE within a group
and plot a line graph .
• This plot is called Elbow curve.
• The optimal value of K is identified by the flat
of horizontal part of the Elbow curve.
Advantages and Disadvantages
• Advantages: simple and Easy to implement
• Disadvantages:
• It is sensitive to initialization process as
change of initial points leads to different
clusters.
• If the no. of samples are more then the
algorithm takes a lot of time to form clusters.
Problem
• Consider the following set of data given in
table below. cluster it using k-means algorithm
with initial value of objects 2 and 5 with the
co ordinate values(4,6) and (12,4) as initial
seeds.
Objects X-Coordinate Y-Coordinate
1 2 4
2 4 6
3 6 8
4 10 4
5 12 4
Expectation Maximization algorithm
• SOFT clustering algorithm
• Clustering is done by statistical model.
• Statistical model is described in terms of a
distribution and a set of parameters.
• The data is assumed to be generated by a process
and the focus is to describe the data by finding a
model that fits the data.
• The data is assumed to be generated by multiple
distributions like Gaussian mixture model.
Expectation Maximization..
• Gaussian distribution is a bell shaped curve.
• The distribution is characterized by two
parameters called mean and standard deviation.
(sometimes variance also used).
• When mean is equal to zero ,the peak of the bell
shaped curve occurs.
• The standard deviation is the spread of the curve.
• In 2D Gaussian function, the mean is a vector and
the variance is in the form of covariance.
Expectation Maximization..
• Assume that
• K= number of distributions
• n=number of samples
• =[1, 2, 3…… k],a set of parameters that
are associated with the distributions.
• j is the parameter of jth distribution.
• Then p(xi/ j) is the probability of ith object
coming from jth distribution.
Expectation Maximization..
• The probability of jth distribution to be chosen
is given by the weight wj,1<j<k, then,
k
p ( xi /  )  w j p j ( x /  j )
j 1

• If all the points are generated randomly, then

the entire set of objects can be denoted as
n n k
p( xi /  )  p ( xi /  ) 
i 1 i 1
 w p (x /
j 1
j j j )
Expectation Maximization..
• Every data is assumed to be generated by a
distribution.
• To describe the data point ,the corresponding
distribution with its parameters should be
known.
• If Gaussian distribution is assumed, then the
probability of that data belongs to that
distribution should be learnt. n  (x ) 2

1 i

p (  /  ) 
2
• This is given as e 2
i 1 2 2
Expectation Maximization..
• The parameters  and  should be chosen such
that the above equation is maximized.
• This is known as maximum likelihood principle.
• The objective of EM algorithm is to maximize the
likelihood of observation by selecting proper
parameters.
• The EM algorithm works in two stages.
• 1.Expectation step
• 2.Maximization step
Expectation Maximization..

• Expectation step, the probability of each data

point generated by k-Gaussian function is
computed.
• Maximization step, the parameters are
updated.
EM Algorithm
• 1.Select the parameters randomly.
• 2.In expectation stage, for each point the
condition probability is computed.
• 3.In maximization stage, the new parameters
are computed.
• 4.Repeat the steps 2-3 till change is minimal
within the threshold value or parameters do
not change at all.

IDS Unit 5 Visualization
No ratings yet
IDS Unit 5 Visualization
24 pages
Unit 4
No ratings yet
Unit 4
4 pages
DWDM R13 Unit 1 PDF
No ratings yet
DWDM R13 Unit 1 PDF
10 pages
Phishing Attack Prevention Using Cryptography
No ratings yet
Phishing Attack Prevention Using Cryptography
42 pages
Data Science Techniques Classification Regression and Clustering
No ratings yet
Data Science Techniques Classification Regression and Clustering
5 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
Nptel Swayam DWDM Slides
No ratings yet
Nptel Swayam DWDM Slides
406 pages
Market Basket Analysis and Advanced Data Mining: Professor Amit Basu
No ratings yet
Market Basket Analysis and Advanced Data Mining: Professor Amit Basu
24 pages
Bar Graph-Wps Office
No ratings yet
Bar Graph-Wps Office
16 pages
Crime Prediction Using KNN
No ratings yet
Crime Prediction Using KNN
4 pages
A Survey On Data Mining
No ratings yet
A Survey On Data Mining
4 pages
IM Ch14 Big Data Analytics NoSQL Ed12
No ratings yet
IM Ch14 Big Data Analytics NoSQL Ed12
8 pages
Mutual Fund Performance Analyser
No ratings yet
Mutual Fund Performance Analyser
24 pages
Data Mining Techniques and Applications
No ratings yet
Data Mining Techniques and Applications
16 pages
CH 6
No ratings yet
CH 6
72 pages
Cluster Analysis Chapter 8 Solution
No ratings yet
Cluster Analysis Chapter 8 Solution
8 pages
Density & Grid based clustering
100% (1)
Density & Grid based clustering
21 pages
Lecture 3 Data Mining
No ratings yet
Lecture 3 Data Mining
30 pages
TE7265 - Introduction To Data Science
No ratings yet
TE7265 - Introduction To Data Science
4 pages
DMDW-Unit II
No ratings yet
DMDW-Unit II
19 pages
Introduction To Data Analytics MCA-3282 Open Elective - 6 Sem B.Tech Topic - Grouping
No ratings yet
Introduction To Data Analytics MCA-3282 Open Elective - 6 Sem B.Tech Topic - Grouping
44 pages
Data Mining: Concepts and Techniques: Jiawei Han and Micheline Kamber
No ratings yet
Data Mining: Concepts and Techniques: Jiawei Han and Micheline Kamber
46 pages
Anomaly Detection
No ratings yet
Anomaly Detection
11 pages
Assignment 6 ML
No ratings yet
Assignment 6 ML
4 pages
Mining Class Comparisons
100% (1)
Mining Class Comparisons
4 pages
Outline: Problem Statement Definitions & Examples Strategies
No ratings yet
Outline: Problem Statement Definitions & Examples Strategies
7 pages
Fundamentals of Database Systems 6th Edition by Ramez Elmasri, Shamkant Navathe 0136086209 978-0136086208 download
100% (2)
Fundamentals of Database Systems 6th Edition by Ramez Elmasri, Shamkant Navathe 0136086209 978-0136086208 download
54 pages
Practical 5: Introduction To Weka For Classfication
100% (1)
Practical 5: Introduction To Weka For Classfication
4 pages
Unit 5 - Data Mining - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Data Mining - WWW - Rgpvnotes.in
15 pages
Data Clustering..
No ratings yet
Data Clustering..
10 pages
K Means Clustering Lecture
No ratings yet
K Means Clustering Lecture
32 pages
UNIT-1 Introduction: Dr. C.Nagaraju Head of Cse Ysrec of YVU Proddatur
100% (1)
UNIT-1 Introduction: Dr. C.Nagaraju Head of Cse Ysrec of YVU Proddatur
86 pages
Introduction To Data Mining: Saeed Salem Department of Computer Science North Dakota State University Cs - Ndsu.edu/ Salem
No ratings yet
Introduction To Data Mining: Saeed Salem Department of Computer Science North Dakota State University Cs - Ndsu.edu/ Salem
30 pages
5.1 Mining Data Streams
No ratings yet
5.1 Mining Data Streams
16 pages
K Means R and Rapid Miner Patient and Mall Case Study
No ratings yet
K Means R and Rapid Miner Patient and Mall Case Study
80 pages
Unit 4 - Data Cube Technology
No ratings yet
Unit 4 - Data Cube Technology
27 pages
PHD Progress Report PPT 20191222-c
No ratings yet
PHD Progress Report PPT 20191222-c
36 pages
CS302 Unit1-III
No ratings yet
CS302 Unit1-III
18 pages
Data Mining Approach For Cyber Security
No ratings yet
Data Mining Approach For Cyber Security
7 pages
Madhan-1
No ratings yet
Madhan-1
90 pages
Daa Ktu Notes
No ratings yet
Daa Ktu Notes
112 pages
Unit 4 Part A
No ratings yet
Unit 4 Part A
51 pages
Data Binning
No ratings yet
Data Binning
9 pages
Network Security and Cryptography
No ratings yet
Network Security and Cryptography
68 pages
Knowledge Discovery in Databases
No ratings yet
Knowledge Discovery in Databases
29 pages
Cp7029 Information Storage Management
100% (1)
Cp7029 Information Storage Management
1 page
3 - Stack Applications-Expression Conversion and Evaluation
100% (1)
3 - Stack Applications-Expression Conversion and Evaluation
18 pages
Lecture 14 Clustering
0% (1)
Lecture 14 Clustering
57 pages
Sat - 63.Pdf - Crime Detction Using Machine Learning
No ratings yet
Sat - 63.Pdf - Crime Detction Using Machine Learning
11 pages
Data Mining Techniques - Javatpoint
No ratings yet
Data Mining Techniques - Javatpoint
9 pages
4.7.1 - Data Warehousing Mining & Business Intelligence
No ratings yet
4.7.1 - Data Warehousing Mining & Business Intelligence
3 pages
Unit V
No ratings yet
Unit V
13 pages
Shivendra Frontpage
No ratings yet
Shivendra Frontpage
10 pages
Data Modelling and Visualization
No ratings yet
Data Modelling and Visualization
31 pages
2019-A Bi-Objective Hyper-Heuristic Support Vector Machines For Big Data Cyber - Security
No ratings yet
2019-A Bi-Objective Hyper-Heuristic Support Vector Machines For Big Data Cyber - Security
11 pages
SUG613 GIS - Project 2 Report
No ratings yet
SUG613 GIS - Project 2 Report
10 pages
ML CH 4
No ratings yet
ML CH 4
51 pages
7.introduction To Clustering
No ratings yet
7.introduction To Clustering
11 pages
Module 5
No ratings yet
Module 5
91 pages
Module-5_Notes_13-12-2024.docx
No ratings yet
Module-5_Notes_13-12-2024.docx
45 pages
Pemberian Plester Hangat
No ratings yet
Pemberian Plester Hangat
8 pages
FHDV 120
No ratings yet
FHDV 120
3 pages
UEF-Compliance Audit-CAU
No ratings yet
UEF-Compliance Audit-CAU
237 pages
Advanced Word Processing Skills: By: Mark Jhon C. Oxillo
No ratings yet
Advanced Word Processing Skills: By: Mark Jhon C. Oxillo
40 pages
IGP Collect A Junk Segregate To MRF Proponent Dagupan Joseph
No ratings yet
IGP Collect A Junk Segregate To MRF Proponent Dagupan Joseph
10 pages
Download full Solution Manual for Computer Vision: A Modern Approach, 2/E 2nd Edition : 013608592X all chapters
No ratings yet
Download full Solution Manual for Computer Vision: A Modern Approach, 2/E 2nd Edition : 013608592X all chapters
12 pages
Product Datasheet Product Datasheet HB P 147W 865 110DEG IP65
No ratings yet
Product Datasheet Product Datasheet HB P 147W 865 110DEG IP65
5 pages
mathReport.2e9d6ba4af51829cc7bc
No ratings yet
mathReport.2e9d6ba4af51829cc7bc
1 page
Feasilbility Study On Bachelor of Science in Psychologys
No ratings yet
Feasilbility Study On Bachelor of Science in Psychologys
2 pages
RES 16-MV SWGR Submittal Accmi Project ABB Rev.01
No ratings yet
RES 16-MV SWGR Submittal Accmi Project ABB Rev.01
589 pages
Course Catalog: Specialty Coffee Association
100% (1)
Course Catalog: Specialty Coffee Association
14 pages
Peak Analysis: Injection Details
No ratings yet
Peak Analysis: Injection Details
2 pages
Official Manuscript For The Effects of Gadgets To The Academic Performance of Grade 12 Learners
100% (1)
Official Manuscript For The Effects of Gadgets To The Academic Performance of Grade 12 Learners
42 pages
Art To The Aid of Technology - Reading Test 1
No ratings yet
Art To The Aid of Technology - Reading Test 1
2 pages
21.sop For Soil Investigation Work
No ratings yet
21.sop For Soil Investigation Work
21 pages
apex trading and manufacturing PLC
No ratings yet
apex trading and manufacturing PLC
5 pages
Icomta2023document 2
No ratings yet
Icomta2023document 2
11 pages
Acrylic Sunscreen Sealer: Technical Data Sheet
No ratings yet
Acrylic Sunscreen Sealer: Technical Data Sheet
1 page
How I Learned That the Problem in My Marriage Was Me - The New York Times
No ratings yet
How I Learned That the Problem in My Marriage Was Me - The New York Times
14 pages
Aops Community 2014 Nimo Problems
No ratings yet
Aops Community 2014 Nimo Problems
17 pages
Robot's Heart: Tinkering With Humanity and Intimacy in Robot-Building
No ratings yet
Robot's Heart: Tinkering With Humanity and Intimacy in Robot-Building
18 pages
Flow Measurement ABB
No ratings yet
Flow Measurement ABB
21 pages
Planetary Pictures 1 - Gary Christen
83% (6)
Planetary Pictures 1 - Gary Christen
5 pages
10-Minute Summary - Organizational Skills
No ratings yet
10-Minute Summary - Organizational Skills
4 pages
SU_G2_Sci_L23_Lesson Exemplar
No ratings yet
SU_G2_Sci_L23_Lesson Exemplar
5 pages
Goodwin Cable Diagnostics
No ratings yet
Goodwin Cable Diagnostics
19 pages
Navigating the Future of Software Engineering AI-Driven Development, Sustainable Coding, And Adaptive Engineering Practices
No ratings yet
Navigating the Future of Software Engineering AI-Driven Development, Sustainable Coding, And Adaptive Engineering Practices
8 pages
Introduction To Nike
No ratings yet
Introduction To Nike
2 pages
Paper Assignment_Mindfulness and Gender.edited
No ratings yet
Paper Assignment_Mindfulness and Gender.edited
12 pages
Get Angels in America A Gay Fantasia on National Themes Part One Millennium Approaches Part Two Perestroika 1st Edition Tony Kushner free all chapters
100% (8)
Get Angels in America A Gay Fantasia on National Themes Part One Millennium Approaches Part Two Perestroika 1st Edition Tony Kushner free all chapters
50 pages

Unit 3 Clustering Algorithm

Uploaded by

Unit 3 Clustering Algorithm

Uploaded by

Unit 3 Clustering Approaches

S.No. Clustering Classification

• The density-based clustering method connects the highly-

• In the distribution model-based clustering

• Hierarchical clustering can be used as an alternative for

• Fuzzy clustering is a type of soft method in

• The centroid is updated as x=x+Vs

• If all the points are generated randomly, then

• Expectation step, the probability of each data

You might also like