0% found this document useful (0 votes)

123 views6 pages

Cluster Analysis Clustering

Cluster analysis is the process of grouping objects into clusters based on similarities. It is used in numerous applications including market research and image processing. There are several major clustering methods including partitioning, hierarchical, density-based, grid-based, and model-based. Partitioning methods like k-means and k-medoids organize objects into k partitions where each represents a cluster to maximize intra-cluster similarity and minimize inter-cluster similarity. K-medoids is more robust to outliers than k-means as it uses actual objects as cluster representatives rather than means. Sampling-based methods like CLARA and CLARANS are used for large datasets and dynamically search for clusters.

Uploaded by

17CSE97- VIKASHINI TP

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

123 views6 pages

Cluster Analysis Clustering

Uploaded by

17CSE97- VIKASHINI TP

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

Cluster Analysis

Clustering:

The process of grouping a set of physical or abstract objects into classes of similar objects is
called clustering.

Cluster:

A cluster is a collection of data objects that are similar to one another within the same cluster and
are dissimilar to the objects in other clusters.

Cluster Analysis:

 It is an important human activity. Automated clustering is used to identify dense and sparse
regions in object space and therefore, discover overall distribution patterns and interesting
correlations among data attributes.

 It has been widely used in numerous applications, including market research, pattern
recognition, data analysis and image processing.

Requirements of Clustering in data mining:

 Scalability

 Ability to deal with different types of attributes

 Discovery of clusters with arbitrary shape

 Minimal requirements for domain knowledge to determine input parameters

 Ability to deal with noisy data

 Incremental clustering and insensitivity to the order of input records

 High dimensionality

 Constraint-based clustering

 Interpretability and usability

The major clustering methods can be classified into the following categories.

 Partitioning methods

 Hierarchical methods

 Density-based methods
 Grid-based methods

 Model-based methods

Partitioning methods:

Given a database of n objects or data tuples, a partitioning method constructs k partitions of the
data, where each partitions represents a cluster and k<=n.

Hierarchical methods:

 It creates a hierarchical decomposition of the given set of data objects.

 It can be classified as being either agglomerative or divisive, based on how the hierarchical
decomposition is formed.

Density-based methods:

 It is based on the notion of the density.

 The general idea is to continue growing the given cluster as long as the density (number of
objects or data points) in the “neighborhood” exceeds some threshold.

Grid-based methods:

 It quantize the object space into a finite number of cells that form a grid structure.

 All of the clustering operations are performed on the grid structure.

Model-based methods:

 It hypothesize a model for each of the clusters and find the best fit of the data to the given
model.

 There are two classes of clustering tasks: Clustering high-dimensional data and Constraint-
based clustering.

 Clustering high-dimensional data: It is an important task in cluster analysis because many

applications require the analysis of objects containing a large number of features of
dimensions.

 Constraint-based clustering: It performs clustering by incorporation of user-specified or

application-oriented constraint.
Partitioning Methods

Given D, a set of n objects, and k, the number of clusters to form, a partitioning algorithm
organizes the objects into k partitions (k<=n), where each partition represents a cluster.

Classical Partitioning Methods:

1. k-means

2. k-mediods

1. Centroid-based techniques: The k-means method:

 The k-means algorithm takes the input parameter, k, and partitions a set of n
objects into k cluster so that the resulting intracluster similarity is high but the
intercluster similarity is low.

 Cluster similarity is measured in regard to the mean value of the objects in a

cluster, which can be viewed as the cluster’s centroid or center of gravity.

Algorithm:

Input:

 k: The number of clusters

 D: a data set containing n objects

Output: A set of k clusters

Method:

 Arbitrarily choose k objects from D as the initial cluster centers

 Repeat

 (re)design each object to the cluster to which the object is the most similar,
based on the mean value of the objects in the cluster.

 Update the cluster means; i.e., calculate the mean value of the objects for
each cluster;

 Until no change;

Strength:

This method is relatively scalable and efficient in processing large data set.
Weakness:

 Applicable only when the mean is defined.

 Need to specify k, the number of clusters in advance.

 Unable to handle noisy data and outliers

 Not suitable to discover clusters with non-convex shapes.

Problems of k-means method:

 The k-means algorithm is sensititve to outliers, since an object with an

extremely large value may substantially distort the distribution of the data.

 k-mediods: Instead of taking the mean value of the object in a cluster as a

reference point, mediods can be used, which is the most centrally located object in
a cluster.

2. Representative object-based technique: The k-mediods method:

 Pick actual objects to represent the clusters, using one representative

object per cluster. Each remaining object is clustered with the representative
object to which it is the most similar.

 The partitioning method is then performed based on the principle of

minimizing the sum of the dissimilarities between each object and its
corresponding reference point.

PAM (Partitioning Around Mediods):

 PAM was one of the k-mediods algorithms PAM starts from an initial set
of mediods and iteratively replaces one of the mediods by one of the non-mediods
if it improves the total distance of the resulting clustering.

 It works effectively for small data sets, but does not scale well for large
data set.

 It is a typical k-mediods algorithm.

Algorithm: k-mediods. PAM, a k-mediods algorithm for partitioning based on mediod or central
objects.

Input:

1. k: the number of clusters

2. D: a data set containing n objects

Output: A set of k clusters

Method:

1. Arbitrarily choose k objects in D as the initial representative objects or seeds;

2. Repeat

3. Assign each remaining object to the cluster with the nearest representative object;

4. Randomly select a non representative object, O random ;

5. Compute the total cost, S, of swapping representative object, O j , with O random ;

6. If S<0 then swap O j , with O random to form the new set of k representative objects.

7. Until no change.

Problem with PAM:

 PAM is more robust than k-means in the presence of noise and outliers
because a mediod is less influenced by outliers or other extreme values than a
mean.

 PAM works efficiently for small data sets but does not scale well for large
data sets.

Partitioning Methods in Large Databases:

Sampling based methods are used to deal with larger data sets.

CLARA (Clustering LARGE application)

 Instead of taking the whole set of data into consideration, a small portion
of the actual data is chosen as the representative of the data.

 Mediods are then chosen from this sample using PAM.

Strength: Deals with larger data sets than PAM

Weakness:

 Efficiency depends on the sample size.

 A good clustering based on samples will not necessarily represent a good

clustering of the whole data set if the sample is biased.
CLARANS (Clustering Large Applications based on Randomized search)

 It combines the sampling techniques with PAM.

 It draws sample of neighbors dynamically. The clustering process can be

represented as searching a graph where every node is a potential solution, that is, a
set of k mdiods.

 If local optimum is found CLARANS starts with new randomly selected

node in search for a new local optimum.

 It is more efficient and scalable than both PAM and CLARA.

Clustering For Big Data Analytics
No ratings yet
Clustering For Big Data Analytics
28 pages
The Impact of Artificial Intelligence and Machine Learning in Digital Marketing Strategies
No ratings yet
The Impact of Artificial Intelligence and Machine Learning in Digital Marketing Strategies
11 pages
ASSIST Scoring Key For The Approaches and Study Skills Inventory For Students
No ratings yet
ASSIST Scoring Key For The Approaches and Study Skills Inventory For Students
12 pages
Biostatistics With R Solutions
100% (1)
Biostatistics With R Solutions
51 pages
4.1 Clustering
No ratings yet
4.1 Clustering
69 pages
Clustering
No ratings yet
Clustering
89 pages
Session 7 Clustering
No ratings yet
Session 7 Clustering
93 pages
Partitioning Methods
100% (1)
Partitioning Methods
3 pages
Partitioning Methods & Hierachical Methods
No ratings yet
Partitioning Methods & Hierachical Methods
22 pages
2017 Software Engineering Mock Exam Three - Attempt Review
No ratings yet
2017 Software Engineering Mock Exam Three - Attempt Review
47 pages
Slide-08-Chapter10-Cluster Analysis Basic Concept I
No ratings yet
Slide-08-Chapter10-Cluster Analysis Basic Concept I
40 pages
Data Mining-Partitioning Methods
100% (1)
Data Mining-Partitioning Methods
7 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
93 pages
Clustering
No ratings yet
Clustering
37 pages
10ClusBasic Editted v1
No ratings yet
10ClusBasic Editted v1
41 pages
10 Clus Basic
No ratings yet
10 Clus Basic
31 pages
10 Clus Basic
No ratings yet
10 Clus Basic
66 pages
Cluster Analysis
No ratings yet
Cluster Analysis
76 pages
DM 1
No ratings yet
DM 1
52 pages
2022 - Clustering and Heuristics Algorithm For The Vehicle Routing Problem With Time Windows
No ratings yet
2022 - Clustering and Heuristics Algorithm For The Vehicle Routing Problem With Time Windows
20 pages
Unit 5 DM
No ratings yet
Unit 5 DM
47 pages
Unit IV
No ratings yet
Unit IV
96 pages
2017DataMiningTools PDF
No ratings yet
2017DataMiningTools PDF
4 pages
2010 Costanzo Som
No ratings yet
2010 Costanzo Som
40 pages
Lecture 8 - Clustering
No ratings yet
Lecture 8 - Clustering
23 pages
Smart Computer Vision
No ratings yet
Smart Computer Vision
358 pages
10 Clus Basic
No ratings yet
10 Clus Basic
95 pages
Partitioning Methods
No ratings yet
Partitioning Methods
26 pages
Machine Learning in Agriculture A Review
No ratings yet
Machine Learning in Agriculture A Review
7 pages
IMVFX 1 HistGMM F23 S
No ratings yet
IMVFX 1 HistGMM F23 S
41 pages
Cluster
No ratings yet
Cluster
20 pages
Selectedtopicsksaltlm Selectedtopics Ekinyaynevi
No ratings yet
Selectedtopicsksaltlm Selectedtopics Ekinyaynevi
25 pages
Clustering
No ratings yet
Clustering
24 pages
Honeypot: Intrusion Detection System
No ratings yet
Honeypot: Intrusion Detection System
4 pages
Clustering Methods
No ratings yet
Clustering Methods
14 pages
A Traffic Classification Method With Spectral
No ratings yet
A Traffic Classification Method With Spectral
4 pages
Lecture 3.2.3 3.2.4
No ratings yet
Lecture 3.2.3 3.2.4
28 pages
Unit5 Clustering
No ratings yet
Unit5 Clustering
74 pages
2002 Spring CS525 Lecture 2
No ratings yet
2002 Spring CS525 Lecture 2
37 pages
Unit VII
No ratings yet
Unit VII
30 pages
05 Clustering
No ratings yet
05 Clustering
96 pages
Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Evaluation of Clustering
No ratings yet
Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Evaluation of Clustering
38 pages
Cluster-Analysis
No ratings yet
Cluster-Analysis
89 pages
Email Address: Session 1
No ratings yet
Email Address: Session 1
8 pages
Email Address: Session 1
No ratings yet
Email Address: Session 1
12 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
50 pages
Churn Analysis
No ratings yet
Churn Analysis
7 pages
Unit-5 DM
No ratings yet
Unit-5 DM
11 pages
Unit 3 DVA
No ratings yet
Unit 3 DVA
50 pages
DMW Unit-V
No ratings yet
DMW Unit-V
47 pages
Detection of Pneumonia in Chest X Ray Image
No ratings yet
Detection of Pneumonia in Chest X Ray Image
6 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
9 pages
Unit Iv
No ratings yet
Unit Iv
14 pages
Dhruv.K.M (19EJCCS714) Seminar Report
No ratings yet
Dhruv.K.M (19EJCCS714) Seminar Report
36 pages
Introduction To Cluster Analysis.
No ratings yet
Introduction To Cluster Analysis.
53 pages
An Adaptive Neuro-Fuzzy Propagation Model For Lorawan
No ratings yet
An Adaptive Neuro-Fuzzy Propagation Model For Lorawan
11 pages
Clustering
No ratings yet
Clustering
7 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
Unit - 5 Cluster Analysis
No ratings yet
Unit - 5 Cluster Analysis
83 pages
Ferreira 2018 Unsupervised Seismic-Facies AAPG
No ratings yet
Ferreira 2018 Unsupervised Seismic-Facies AAPG
16 pages
Clustering
No ratings yet
Clustering
25 pages
Application of Artificial Neural Network in Market Segmentation: A Review On Recent Trends
No ratings yet
Application of Artificial Neural Network in Market Segmentation: A Review On Recent Trends
24 pages
Elliptic Curve Cryptography
No ratings yet
Elliptic Curve Cryptography
4 pages
DWMModule 4
No ratings yet
DWMModule 4
31 pages
Clustering
No ratings yet
Clustering
32 pages
MCTA - 301 (A) Data Mining and Warehousing: Reference Books
No ratings yet
MCTA - 301 (A) Data Mining and Warehousing: Reference Books
8 pages
COMP9417 Review Notes
No ratings yet
COMP9417 Review Notes
10 pages
Clustering K Means Agnes
No ratings yet
Clustering K Means Agnes
36 pages
Clustering Data Mining
No ratings yet
Clustering Data Mining
27 pages
Cluster Analysis
No ratings yet
Cluster Analysis
18 pages
10clustering - Han and Kamber
No ratings yet
10clustering - Han and Kamber
93 pages
DWDM MID - 2 Question Paper and Online Bits
No ratings yet
DWDM MID - 2 Question Paper and Online Bits
3 pages
Clustering Partitioning Methods
No ratings yet
Clustering Partitioning Methods
20 pages
CheatSheet Beginner A4 PDF
No ratings yet
CheatSheet Beginner A4 PDF
2 pages
Clustering-Part 1
No ratings yet
Clustering-Part 1
35 pages
Concepts and Techniques: - Chapter 10
No ratings yet
Concepts and Techniques: - Chapter 10
97 pages
Amity School of Engineering and Technology Amity University, Uttar Pradesh
No ratings yet
Amity School of Engineering and Technology Amity University, Uttar Pradesh
5 pages
Data Mining: I Gede Mahendra Darmawiguna
No ratings yet
Data Mining: I Gede Mahendra Darmawiguna
25 pages
ML Unit 4 Notes - NJ
No ratings yet
ML Unit 4 Notes - NJ
15 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
Email Address: Question Session
No ratings yet
Email Address: Question Session
8 pages
Unit 2 DMW
No ratings yet
Unit 2 DMW
26 pages
Cluster Analysis-Unit 4
No ratings yet
Cluster Analysis-Unit 4
7 pages
Clustering
No ratings yet
Clustering
104 pages
Cluster Analysis - Approach 1
No ratings yet
Cluster Analysis - Approach 1
28 pages
Cisco Press Computer Networking Data Analytics Developing in
No ratings yet
Cisco Press Computer Networking Data Analytics Developing in
469 pages
ML Unit-1
No ratings yet
ML Unit-1
34 pages
Data Mining Clustering
No ratings yet
Data Mining Clustering
76 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
42 pages
Machine Learning: A Comprehensive Overview
No ratings yet
Machine Learning: A Comprehensive Overview
3 pages
Data Warehousing and Data Mining MCQ'S: Unit - I
No ratings yet
Data Warehousing and Data Mining MCQ'S: Unit - I
29 pages
Unit 4
No ratings yet
Unit 4
4 pages
Oracle Statistics
No ratings yet
Oracle Statistics
26 pages

Cluster Analysis Clustering

Uploaded by

Cluster Analysis Clustering

Uploaded by

Cluster Analysis

Requirements of Clustering in data mining:

 Ability to deal with different types of attributes

 Discovery of clusters with arbitrary shape

 Minimal requirements for domain knowledge to determine input parameters

 Ability to deal with noisy data

 Incremental clustering and insensitivity to the order of input records

 Interpretability and usability

 It creates a hierarchical decomposition of the given set of data objects.

 It is based on the notion of the density.

 All of the clustering operations are performed on the grid structure.

 Clustering high-dimensional data: It is an important task in cluster analysis because many

 Constraint-based clustering: It performs clustering by incorporation of user-specified or

Classical Partitioning Methods:

1. Centroid-based techniques: The k-means method:

 Cluster similarity is measured in regard to the mean value of the objects in a

 k: The number of clusters

 D: a data set containing n objects

Output: A set of k clusters

 Arbitrarily choose k objects from D as the initial cluster centers

 Applicable only when the mean is defined.

 Need to specify k, the number of clusters in advance.

 Unable to handle noisy data and outliers

 Not suitable to discover clusters with non-convex shapes.

Problems of k-means method:

 The k-means algorithm is sensititve to outliers, since an object with an

 k-mediods: Instead of taking the mean value of the object in a cluster as a

2. Representative object-based technique: The k-mediods method:

 Pick actual objects to represent the clusters, using one representative

 The partitioning method is then performed based on the principle of

PAM (Partitioning Around Mediods):

 It is a typical k-mediods algorithm.

1. k: the number of clusters

Output: A set of k clusters

1. Arbitrarily choose k objects in D as the initial representative objects or seeds;

4. Randomly select a non representative object, O random ;

5. Compute the total cost, S, of swapping representative object, O j , with O random ;

Problem with PAM:

Partitioning Methods in Large Databases:

CLARA (Clustering LARGE application)

 Mediods are then chosen from this sample using PAM.

Strength: Deals with larger data sets than PAM

 Efficiency depends on the sample size.

 A good clustering based on samples will not necessarily represent a good

 It combines the sampling techniques with PAM.

 It draws sample of neighbors dynamically. The clustering process can be

 If local optimum is found CLARANS starts with new randomly selected

 It is more efficient and scalable than both PAM and CLARA.

You might also like