0% found this document useful (0 votes)

17 views25 pages

Clustering

Clustering in data mining

Uploaded by

Bhavani Viswa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views25 pages

Clustering

Clustering in data mining

Uploaded by

Bhavani Viswa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 25

CLUSTERING

ClusterAnalysis
Types of Data in Cluster Analysis

A Categorization of Major Clustering

Methods
Cluster Analysis
 Cluster is a collection of data objects:
 Similar to one another within the same cluster,
 Dissimilar to the objects in other clusters.

 Cluster analysis is an analysis used to finding similarities

between data according to the characteristics found in the
data and grouping similar data objects into clusters.

 Cluster analysis is used to form groups or clusters of similar

records based on several measures made on these records.

 Cluster is Unsupervised learning, i.e., no predefined classes.

 This data has been applied in many areas, including

astronomy, archaeology, medicine, chemistry, education,
psychology, linguistics and sociology.
Application of Clustering
 Use of Clustering in Medical Image
Database
 Pattern Recognition
 Spatial Data Analysis
 Detect spatial clusters or for other
spatial mining tasks
 Image Processing
 Economic Science
 WWW
Requirements of Clustering in Data Mining

 The following are typical requirements of

clustering in data mining.

 Scalability.
 Ability to deal with different types of attributes.
 Discovery of clusters with arbitrary shape.
 Ability to deal with noisy data.
 Minimal requirements for domain knowledge to
determine input parameters.
 Insensitivity to the order of input records.
 High dimensionality.
 Constraint-based clustering.
 Interpretability and usability.
TYPES OF DATA IN CLUSTER ANALYSIS

 Types of data in cluster analysis are:

 Interval-scaled variables. Example: salary, height.

 Binary variables. Example: gender (Male/Female), has_cancer

(True/False)

 Nominal (categorical) variables. Example: religion (Christian,

Muslim, Buddhist, Hindu, etc.)

 Ordinal variables. Example: military-rank (soldier, sergeant,

captain, etc.)

 Ratio-scaled variables. Example: Population growth (1, 10, 100,

1000, ...)

 Variables of mixed types.

CATEGORISATION OF MAJOR CLUSTERING
METHODS

 We can divide the clustering methods

into two main groups:
 Hierarchical methods
 Partitioning methods

Additional three main categories:

 Density-based methods,
 Model-based clustering
 Grid-based methods.
Partitioning Methods
Definition: Given a database of n objects. A partition
method constructs k partitions of the data, where
each partition represents a cluster and k<=n.

It classifies the data into k groups, which satisfy the

following requirements,
1. Each group must contain at least one object.
2. Each object must belong to exactly one group.

Partitioning Algorithms:
K-means
K- medoids
K- means Partitioning
 Each cluster is represented by the mean value of the objects
in the cluster. Hence, it is known as Centroid-Based
technique.

Working method:
 First, it randomly selects k of the objects, each of which
initially represents a cluster mean.

 For each of the remaining objects, an object is assigned

to the cluster to which it is most similar, based on the
distance between the object and the cluster mean.

 Then, it compute the new mean for each cluster.

 The above process will be continue until the criterion

function converges.
K-Means Clustering
k-means algorithm is implemented as below:
Input: Number of clusters – k, database of n objects
Output: Set of k clusters that minimize the squared error

Choose k objects as the initial cluster centers Repeat

(Re)assign each object to the cluster to which the object is most

similar based on the mean value of the objects in the cluster
Update the cluster means
Until no change

9
K-Means Clustering

1
0
K-Means Clustering Method

10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3
Update 3
3

2 each 2 the 2

1
objects
1
cluster 1

0 0
0
0 1 2 3 4 5 6 7 8 9 to 0 1
10
2 3 4 5 6 7 8 9 means 0 1 2 3 4 5 6 7 8 9
10
10
most
similar reassign reassign
center 10 10

K=2 9 9

8 8

Arbitrarily choose K 7 7

6 6
object as initial 5 5

cluster center 4 Update 4

3
the 3

2 2

1 cluster 1

0
0 1 2 3 4 5 6 7 8 9
means 0
0 1 2 3 4 5 6 7 8 9
10 10

9
K-Means Method

Strength:Relatively efficient: O(tkn), where n is # objects, k

is # clusters, and t is # iterations. Normally, k, t << n.
Comment: Often terminates at a local optimum. The global
optimum may be found using techniques such as: deterministic
annealing and genetic algorithms
Weakness

Applicable only when mean is defined – Categorical data

Need to specify k, the number of clusters, in advance
Unable to handle noisy data and outliers
Not suitable to discover clusters with non-convex shapes

1
2
Variations of the K-Means
Method
A few variants of the k-means which differ in
 Selection of the initial k means
 Dissimilarity calculations
 Strategies to calculate cluster
means categorical data: k-modes
Handling
 Replacing means of clusters with modes
 Using new dissimilarity measures to deal with categorical objects
 A mixture of categorical and numerical data: k-prototype method
Expectation Maximization
 Assigns objects to clusters based on the probability of membership
Scalability of k-means
 Compressible, Discardable, To be maintained in main memory
 Clustering Features

1
3
Problem of the K-Means
Method
The k-means algorithm is sensitive to outliers
 Since an object with an extremely large value may substantially distort the
distribution of the data.

K-Medoids: Instead of taking the mean value of the object in a cluster

as a reference point, medoids can be used, which is the most
centrally located object in a cluster.
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9
10

1
4
K-Medoids Clustering
Method
PAM (Partitioning Around Medoids)
 starts from an initial set of medoids and iteratively replaces one of the
medoids by one of the non-medoids if it improves the total distance of
the resulting clustering
 All pairs are analyzed for replacement
 PAM works effectively for small data sets, but does not scale well for
large data sets

CLARA
CLARANS

1
5
K-
Medoids
Input: k, and database of n objects
Output: A set of k clusters
Method:
 Arbitrarily choose k objects as initial medoids
 Repeat
Assign each remaining object to cluster with nearest medoid
Randomly select a non-medoid oarndom
Compute cost S of swapping oj with oarndom
If S < 0 swap to form new set of k medoids
Until no change
Working Principle: Minimize sum of the dissimilarities between each object and its
corresponding reference point. That is, an absolute-error criterion is used
E  k  |po |
j 1 pCj j

1
6
K-
medoids
 Case 1: p currently belongs to medoid oj. If oj is replaced by orandomas a medoid and p is
closest to one of oi where i < > j then p is reassigned to oi.
 Case 2: p currently belongs to medoid oj. If oj is replaced by orandomas a medoid and p is
closest to orandomthen p is reassigned to orandom.

 Case 3: p currently belongs to medoid oi (i< >j) If oj is replaced by oarndomas a medoid

and p is still closest to oi assignment does not change
 Case 4: p currently belongs to medoid oi (i < > j). If oj is replaced by orandomas a medoid
and p is closest to orandomthen p is reassigned to orandom.

1
7
K-
medoids

 After reassignment difference in squared
error E is calculated. Total cost of

swapping – Sum of costs incurred by all
non-medoid objects
 If total cost is negative, o is replaced with o as
j random

E will be reduced

1
8
K-medoids
Algorithm

1
9
Hierarchical Methods
Agglomerative hierarchical clustering:
Each object initially represents a
cluster of its own. Then clusters are
successively merged until the desired
cluster structure is obtained.

Divisive hierarchical clustering:

All objects initially belong to one
cluster. Then the cluster is divided into sub-
clusters, which are successively divided
into their own sub-clusters. This process
continues until the desired cluster structure
Hierarchical Methods

 The hierarchical clustering methods could be further

divided according to the manner that the similarity
measure is calculated:

 Single-link clustering (also called the

connectedness, the minimum method or the nearest
neighbour method)

 Complete-link clustering (also called the diameter,

the maximum method or the furthest neighbour
method)

 Average-link clustering (also called minimum

variance method)
Density – Based Clustering Method

 This method is used to discover clusters with

arbitrary shape.There are three methods,

 DBSCAN: grows clusters according to a

density-based connectivity analysis.

 OPTICS: produce cluster obtained from a

wide range of parameter settings.

 DENCLUE: clusters objects based on a set of

density distribution functions.
Grid Based Clustering
 It uses multiresolution grid data structure.
 It quantizes the object space into a finite number of
cells, that form a grid structure.
 In that grid structure, all clustering operations are
performed.

 Approaches of Grid-based clustering:

 STING: which explores statistical information stored in
the grid cells.

 WAVE CLUSTER: wavelet transform method used to

cluster objects.

 CLIQUE: represents a grid & density based approach

for clustering in high-dimensional data space.
STING: STatistical Information Grid

 In this technique, the spatial area is divided

into rectangular cells.

 Each cell at high level is partitioned to form

a number of cells at next lower level.

 Statistical information regarding the

attributes in each grid cell is precomputed
and stored.

 These statistical parameters are useful for

query processing.
Wave Cluster

 It is a multiresolution clustering algorithm.

 It first summarizes the data on to the data

space.

 Then it transform the original feature space,

to find dense region by using wavelet
transformation.

Updated Final Bachelor of Commerce Honours Degree in Data Science and Informatics
No ratings yet
Updated Final Bachelor of Commerce Honours Degree in Data Science and Informatics
14 pages
Credit Scoring SAS
No ratings yet
Credit Scoring SAS
42 pages
Ibm
No ratings yet
Ibm
72 pages
2002 Spring CS525 Lecture 2
No ratings yet
2002 Spring CS525 Lecture 2
37 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
42 pages
Data Mining Clustering
No ratings yet
Data Mining Clustering
76 pages
Unit 5
No ratings yet
Unit 5
85 pages
Lecture 8 - Clustering
No ratings yet
Lecture 8 - Clustering
23 pages
Session 7 Clustering
No ratings yet
Session 7 Clustering
93 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
Cluster
No ratings yet
Cluster
20 pages
Lect 10 DM
No ratings yet
Lect 10 DM
36 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
Clustering-Part 1
No ratings yet
Clustering-Part 1
35 pages
Clustering Partitioning Methods
No ratings yet
Clustering Partitioning Methods
20 pages
Clustering
No ratings yet
Clustering
104 pages
BIS 541 Ch04 20-21 S
No ratings yet
BIS 541 Ch04 20-21 S
82 pages
Unit - 5 Cluster Analysis
No ratings yet
Unit - 5 Cluster Analysis
83 pages
Cluster Analysis
No ratings yet
Cluster Analysis
76 pages
05 Clustering
No ratings yet
05 Clustering
96 pages
Data Mining-Partitioning Methods
100% (1)
Data Mining-Partitioning Methods
7 pages
DWMModule 4
No ratings yet
DWMModule 4
31 pages
Clustering K Means Agnes
No ratings yet
Clustering K Means Agnes
36 pages
4.1 Clustering
No ratings yet
4.1 Clustering
69 pages
M5
No ratings yet
M5
40 pages
Unit V - Clustering
No ratings yet
Unit V - Clustering
19 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
M5
No ratings yet
M5
40 pages
Clustering
No ratings yet
Clustering
32 pages
Introduction To Cluster Analysis.
No ratings yet
Introduction To Cluster Analysis.
53 pages
DMW Unit-V
No ratings yet
DMW Unit-V
47 pages
Lecture 6
No ratings yet
Lecture 6
14 pages
Cluster Analysis
No ratings yet
Cluster Analysis
21 pages
Unit IV
No ratings yet
Unit IV
96 pages
10 Clus Basic
No ratings yet
10 Clus Basic
95 pages
Lec.3.D. M. Spring 2025
No ratings yet
Lec.3.D. M. Spring 2025
21 pages
Clustering Data Mining
No ratings yet
Clustering Data Mining
27 pages
Clustering
No ratings yet
Clustering
29 pages
Clustering
No ratings yet
Clustering
24 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
50 pages
Clustering
No ratings yet
Clustering
51 pages
Clustering
No ratings yet
Clustering
89 pages
10ClusBasic Editted v1
No ratings yet
10ClusBasic Editted v1
41 pages
Cluster Analysis
No ratings yet
Cluster Analysis
18 pages
BDA Unit 2
No ratings yet
BDA Unit 2
31 pages
Unit VII
No ratings yet
Unit VII
30 pages
ML Unit III
No ratings yet
ML Unit III
82 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
93 pages
DWDM Unit V Note
No ratings yet
DWDM Unit V Note
19 pages
A Parallel Study On Clustering Algorithms in Data Mining
No ratings yet
A Parallel Study On Clustering Algorithms in Data Mining
7 pages
Unit 5 DM
No ratings yet
Unit 5 DM
47 pages
Slide-08-Chapter10-Cluster Analysis Basic Concept I
No ratings yet
Slide-08-Chapter10-Cluster Analysis Basic Concept I
40 pages
Lecture 3.2.3 3.2.4
No ratings yet
Lecture 3.2.3 3.2.4
28 pages
Data Mining Unit-Iv
No ratings yet
Data Mining Unit-Iv
34 pages
Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Evaluation of Clustering
No ratings yet
Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Evaluation of Clustering
38 pages
Lect3 Clustering
No ratings yet
Lect3 Clustering
86 pages
Lesson8 Clustering
100% (1)
Lesson8 Clustering
33 pages
Chapter 3: Cluster Analysis: 3.1 Basic Concepts of Clustering
No ratings yet
Chapter 3: Cluster Analysis: 3.1 Basic Concepts of Clustering
33 pages
Cluster Analysis: G Sreenivas
No ratings yet
Cluster Analysis: G Sreenivas
29 pages
Unit IV Cluster Analysis
No ratings yet
Unit IV Cluster Analysis
7 pages
Techniques of Cluster Analysis: A Seminar On
No ratings yet
Techniques of Cluster Analysis: A Seminar On
25 pages
IT3080 Lecture04 2023
No ratings yet
IT3080 Lecture04 2023
56 pages
Sudoku New: Workouts to sharpen your mind
From Everand
Sudoku New: Workouts to sharpen your mind
Sahil Gupta
No ratings yet
Neural Networks in Data Mining
No ratings yet
Neural Networks in Data Mining
5 pages
Business Analytics and Big Data
No ratings yet
Business Analytics and Big Data
11 pages
Business Intelligence For Big Data Analytics
No ratings yet
Business Intelligence For Big Data Analytics
8 pages
Document (14) - 2
No ratings yet
Document (14) - 2
55 pages
Introducere in Data Mining
No ratings yet
Introducere in Data Mining
49 pages
U.G. Department of Computer Applications N.G.M College 16 UBC 626 - Data Mining and Warehousing Multiple Choice Questions. (K1 Questions) Unit - I
No ratings yet
U.G. Department of Computer Applications N.G.M College 16 UBC 626 - Data Mining and Warehousing Multiple Choice Questions. (K1 Questions) Unit - I
11 pages
Contrast Data Mining - Concepts, Algorithms, and Applications (Dong & Bailey 2012-09-07)
No ratings yet
Contrast Data Mining - Concepts, Algorithms, and Applications (Dong & Bailey 2012-09-07)
428 pages
ML - Question Bank Part I
No ratings yet
ML - Question Bank Part I
6 pages
PhDModuleClusterAnnex Jan2016
No ratings yet
PhDModuleClusterAnnex Jan2016
2 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
7 pages
M.Tech CA DA 2022
No ratings yet
M.Tech CA DA 2022
48 pages
The KDD Process For From Volumes Of: Extracting Useful Knowledge Data
No ratings yet
The KDD Process For From Volumes Of: Extracting Useful Knowledge Data
12 pages
IntelliQ WP BI or Data Mining
No ratings yet
IntelliQ WP BI or Data Mining
14 pages
Dtree
No ratings yet
Dtree
101 pages
Unit-4 Introduction To Data Mining
No ratings yet
Unit-4 Introduction To Data Mining
26 pages
Data Mining v3
No ratings yet
Data Mining v3
54 pages
Buying or Berowsing
No ratings yet
Buying or Berowsing
9 pages
CS614
No ratings yet
CS614
15 pages
Web Mining PPT 4121
No ratings yet
Web Mining PPT 4121
18 pages
Programa Ciencia de Datos y Machine Learning Con Python - Feb23
No ratings yet
Programa Ciencia de Datos y Machine Learning Con Python - Feb23
13 pages
A.I Seminal
No ratings yet
A.I Seminal
27 pages
Assosa University Online Examination System
No ratings yet
Assosa University Online Examination System
6 pages
TE - Syllabus - R2019 July9
No ratings yet
TE - Syllabus - R2019 July9
3 pages
Deep Learning QP
No ratings yet
Deep Learning QP
4 pages
Machine Learning
No ratings yet
Machine Learning
2 pages
Unit 3 DMW
No ratings yet
Unit 3 DMW
31 pages
Mla - 2 (Cia - 3) - 20221013
No ratings yet
Mla - 2 (Cia - 3) - 20221013
21 pages

Clustering

Uploaded by

Clustering

Uploaded by

CLUSTERING

A Categorization of Major Clustering

 Cluster analysis is an analysis used to finding similarities

 Cluster analysis is used to form groups or clusters of similar

 Cluster is Unsupervised learning, i.e., no predefined classes.

 This data has been applied in many areas, including

 The following are typical requirements of

 Types of data in cluster analysis are:

 Interval-scaled variables. Example: salary, height.

 Binary variables. Example: gender (Male/Female), has_cancer

 Nominal (categorical) variables. Example: religion (Christian,

 Ordinal variables. Example: military-rank (soldier, sergeant,

 Ratio-scaled variables. Example: Population growth (1, 10, 100,

 Variables of mixed types.

 We can divide the clustering methods

Additional three main categories:

It classifies the data into k groups, which satisfy the

 For each of the remaining objects, an object is assigned

 Then, it compute the new mean for each cluster.

 The above process will be continue until the criterion

Choose k objects as the initial cluster centers Repeat

(Re)assign each object to the cluster to which the object is most

cluster center 4 Update 4

Strength:Relatively efficient: O(tkn), where n is # objects, k

Applicable only when mean is defined – Categorical data

K-Medoids: Instead of taking the mean value of the object in a cluster

 Case 3: p currently belongs to medoid oi (i< >j) If oj is replaced by oarndomas a medoid

Divisive hierarchical clustering:

 The hierarchical clustering methods could be further

 Single-link clustering (also called the

 Complete-link clustering (also called the diameter,

 Average-link clustering (also called minimum

 This method is used to discover clusters with

 DBSCAN: grows clusters according to a

 OPTICS: produce cluster obtained from a

 DENCLUE: clusters objects based on a set of

 Approaches of Grid-based clustering:

 WAVE CLUSTER: wavelet transform method used to

 CLIQUE: represents a grid & density based approach

 In this technique, the spatial area is divided

 Each cell at high level is partitioned to form

 Statistical information regarding the

 These statistical parameters are useful for

 It is a multiresolution clustering algorithm.

 It first summarizes the data on to the data

 Then it transform the original feature space,

You might also like