0% found this document useful (0 votes)

4 views

Clustering Part2

Clustering-Part2

Uploaded by

G.M. Ravindu Dulshan

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Clustering Part2

Clustering-Part2

Uploaded by

G.M. Ravindu Dulshan

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Distance between Clusters X X

■ Single link: smallest distance between an element in one cluster

and an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)
■ Complete link: largest distance between an element in one cluster
and an element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)
■ Average: avg distance between an element in one cluster and an
element in the other, i.e., dist(Ki, Kj) = avg(tip, tjq)
■ Centroid: distance between the centroids of two clusters, i.e.,
dist(Ki, Kj) = dist(Ci, Cj)
■ Medoid: distance between the medoids of two clusters, i.e., dist(Ki,
Kj) = dist(Mi, Mj)
■ Medoid: a chosen, centrally located object in the cluster

44
Centroid, Radius and Diameter of a Cluster
(for numerical data sets)
■ Centroid: the “middle” of a cluster

■ Radius: square root of average distance from any point

of the cluster to its centroid

■ Diameter: square root of average mean squared

distance between all pairs of points in the cluster

45
Extensions to Hierarchical Clustering
■ Major weakness of agglomerative clustering methods
■ Can never undo what was done previously
■ Do not scale well: time complexity of at least O(n2),
where n is the number of total objects
■ Integration of hierarchical & distance-based clustering
■ BIRCH (1996): uses CF-tree and incrementally adjusts
the quality of sub-clusters
■ CHAMELEON (1999): hierarchical clustering using
dynamic modeling
46
Chapter 10. Cluster Analysis: Basic Concepts and
Methods
■ Cluster Analysis: Basic Concepts
■ Partitioning Methods
■ Hierarchical Methods
■ Density-Based Methods
■ Grid-Based Methods
■ Evaluation of Clustering
■ Summary

47
Density-Based Clustering Methods

■ Clustering based on density (local cluster criterion), such

as density-connected points
■ Major features:
■ Discover clusters of arbitrary shape

■ Handle noise

■ One scan
■ Need density parameters as termination condition

■ Several interesting studies:

■ DBSCAN: Ester, et al. (KDD’96)

■ OPTICS: Ankerst, et al (SIGMOD’99).

■ DENCLUE: Hinneburg & D. Keim (KDD’98)

■ CLIQUE: Agrawal, et al. (SIGMOD’98) (more

grid-based)
48
Density-Based Clustering: Basic Concepts
■ Two parameters:
■ Eps: Maximum radius of the neighbourhood
■ MinPts: Minimum number of points in an
Eps-neighbourhood of that point
■ NEps(p): {q belongs to D | dist(p,q) ≤ Eps}
■ Directly density-reachable: A point p is directly
density-reachable from a point q w.r.t. Eps, MinPts if

■ p belongs to NEps(q)
p MinPts = 5
■ core point condition:
Eps = 1 cm
|NEps (q)| ≥ MinPts q

49
Density-Reachable and Density-Connected

■ Density-reachable:
■ A point p is density-reachable from p
a point q w.r.t. Eps, MinPts if there
p1
is a chain of points p1, …, pn, p1 = q
q, pn = p such that pi+1 is directly
density-reachable from pi
■ Density-connected
■ A point p is density-connected to a p q
point q w.r.t. Eps, MinPts if there is
a point o such that both, p and q o
are density-reachable from o w.r.t.
Eps and MinPts
50
DBSCAN: Density-Based Spatial Clustering of
Applications with Noise
■ Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
■ Discovers clusters of arbitrary shape in spatial databases
with noise

Outlier

Border
Eps = 1cm
Core MinPts = 5

51
DBSCAN: The Algorithm
■ Arbitrary select a point p
■ Retrieve all points density-reachable from p w.r.t. Eps and
MinPts
■ If p is a core point, a cluster is formed
■ If p is a border point, no points are density-reachable
from p and DBSCAN visits the next point of the database
■ Continue the process until all of the points have been
processed

52
DBSCAN: Sensitive to Parameters

53
OPTICS: A Cluster-Ordering Method (1999)

■ OPTICS: Ordering Points To Identify the Clustering

Structure
■ Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99)

■ Produces a special order of the database wrt its

density-based clustering structure

■ This cluster-ordering contains info equiv to the

density-based clusterings corresponding to a broad

range of parameter settings
■ Good for both automatic and interactive cluster

analysis, including finding intrinsic clustering structure

■ Can be represented graphically or using visualization

techniques

54
Reachability
-distance

undefined

Cluster-order
of the objects
55
Chapter 10. Cluster Analysis: Basic Concepts and
Methods
■ Cluster Analysis: Basic Concepts
■ Partitioning Methods
■ Hierarchical Methods
■ Density-Based Methods
■ Grid-Based Methods
■ Evaluation of Clustering
■ Summary

56
Grid-Based Clustering Method

■ Using multi-resolution grid data structure

■ Several interesting methods
■ STING (a STatistical INformation Grid approach) by

Wang, Yang and Muntz (1997)

■ WaveCluster by Sheikholeslami, Chatterjee, and
Zhang (VLDB’98)
■ A multi-resolution clustering approach using
wavelet method
■ CLIQUE: Agrawal, et al. (SIGMOD’98)
■ Both grid-based and subspace clustering

57
STING: A Statistical Information Grid Approach

■ Wang, Yang and Muntz (VLDB’97)

■ The spatial area is divided into rectangular cells
■ There are several levels of cells corresponding to different
levels of resolution

58
The STING Clustering Method
■ Each cell at a high level is partitioned into a number of
smaller cells in the next lower level
■ Statistical info of each cell is calculated and stored
beforehand and is used to answer queries
■ Parameters of higher level cells can be easily calculated
from parameters of lower level cell
■ count, mean, s, min, max

■ type of distribution—normal, uniform, etc.

■ Use a top-down approach to answer spatial data queries

■ Start from a pre-selected layer—typically with a small
number of cells
■ For each cell in the current level compute the confidence
interval
59
STING Algorithm and Its Analysis

■ Remove the irrelevant cells from further consideration

■ When finish examining the current layer, proceed to the
next lower level
■ Repeat this process until the bottom layer is reached
■ Advantages:
■ Query-independent, easy to parallelize, incremental
update
■ O(K), where K is the number of grid cells at the lowest
level
■ Disadvantages:
■ All the cluster boundaries are either horizontal or
vertical, and no diagonal boundary is detected

60
Chapter 10. Cluster Analysis: Basic Concepts and
Methods
■ Cluster Analysis: Basic Concepts
■ Partitioning Methods
■ Hierarchical Methods
■ Density-Based Methods
■ Grid-Based Methods

■ Evaluation of Clustering

■ Summary

61
Assessing Clustering Tendency
■ Assess if non-random structure exists in the data by measuring the
probability that the data is generated by a uniform data distribution
■ Test spatial randomness by statistic test: Hopkins Static
■ Given a dataset D regarded as a sample of a random variable o,
determine how far away o is from being uniformly distributed in the
data space
■ Sample n points, p1, …, pn, uniformly from D. For each pi, find its
nearest neighbor in D: xi = min{dist (pi, v)} where v in D
■ Sample n points, q1, …, qn, uniformly from D. For each qi, find its
nearest neighbor in D – {qi}: yi = min{dist (qi, v)} where v in D and
v ≠ qi
■ Calculate the Hopkins Statistic:

■ If D is uniformly distributed, ∑ xi and ∑ yi will be close to each

other and H is close to 0.5. If D is highly skewed, H is close to 0
62
Hopkins Statistic

63
Determine the Number of Clusters
■ Empirical method
■ # of clusters ≈√n/2 for a dataset of n points
■ Elbow method
■ Use the turning point in the curve of sum of within cluster variance
w.r.t the # of clusters

64
Determine the Number of Clusters (2)
■ Cross validation method
■ Divide a given data set into m parts
■ Use m – 1 parts to obtain a clustering model
■ Use the remaining part to test the quality of the clustering
■ E.g., For each point in the test set, find the closest centroid, and

use the sum of squared distance between all points in the test set
and the closest centroids to measure how well the model fits the
test set
■ For any k > 0, repeat it m times, compare the overall quality measure
w.r.t. different k’s, and find # of clusters that fits the data the best

65
Cross validation method

66
Measuring Clustering Quality

■ Two methods: extrinsic vs. intrinsic

■ Extrinsic: supervised, i.e., the ground truth is available
■ Compare a clustering against the ground truth using
certain clustering quality measure
■ Ex. BCubed precision and recall metrics
■ Intrinsic: unsupervised, i.e., the ground truth is unavailable
■ Evaluate the goodness of a clustering by considering
how well the clusters are separated, and how compact
the clusters are
■ Ex. Silhouette coefficient

67
Silhouette coefficient

68
69
Measuring Clustering Quality: Extrinsic Methods

■ Clustering quality measure: Q(C, Cg), for a clustering C

given the ground truth Cg.
■ Q is good if it satisfies the following 4 essential criteria
■ Cluster homogeneity: the purer, the better

■ Cluster completeness: should assign objects belong to

the same category in the ground truth to the same

cluster
■ Rag bag: putting a heterogeneous object into a pure

cluster should be penalized more than putting it into a

rag bag (i.e., “miscellaneous” or “other” category)
■ Small cluster preservation: splitting a small category

into pieces is more harmful than splitting a large

category into pieces
70
Chapter 10. Cluster Analysis: Basic Concepts and
Methods
■ Cluster Analysis: Basic Concepts
■ Partitioning Methods
■ Hierarchical Methods
■ Density-Based Methods
■ Grid-Based Methods

■ Evaluation of Clustering

■ Summary

71
Summary
■ Cluster analysis groups objects based on their similarity and has
wide applications
■ Measure of similarity can be computed for various types of data
■ Clustering algorithms can be categorized into partitioning methods,
hierarchical methods, density-based methods, grid-based methods,
and model-based methods
■ K-means and K-medoids algorithms are popular partitioning-based
clustering algorithms
■ Birch and Chameleon are interesting hierarchical clustering
algorithms, and there are also probabilistic hierarchical clustering
algorithms
■ DBSCAN, OPTICS, and DENCLU are interesting density-based
algorithms
■ STING and CLIQUE are grid-based methods, where CLIQUE is also
a subspace clustering algorithm
■ Quality of clustering results can be evaluated in various ways
72

Density & Grid based clustering
100% (1)
Density & Grid based clustering
21 pages
Cluster Analysis
No ratings yet
Cluster Analysis
76 pages
Clustering K Means Agnes
No ratings yet
Clustering K Means Agnes
36 pages
DBSCAN
No ratings yet
DBSCAN
42 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Den Clue
No ratings yet
Den Clue
16 pages
Microsoft PowerPoint - Clustering - Week - 12 - 2 - 4.04
No ratings yet
Microsoft PowerPoint - Clustering - Week - 12 - 2 - 4.04
31 pages
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
No ratings yet
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
22 pages
CSCI946 w5-classification
No ratings yet
CSCI946 w5-classification
72 pages
DM_C6
No ratings yet
DM_C6
37 pages
Unit-8 (1)
No ratings yet
Unit-8 (1)
62 pages
1730702231_ML14_DBSCAN
No ratings yet
1730702231_ML14_DBSCAN
10 pages
Custer Analysis: Prepared by Navin Ninama
No ratings yet
Custer Analysis: Prepared by Navin Ninama
20 pages
2EL1730 ML Lecture07 Neural Networks
No ratings yet
2EL1730 ML Lecture07 Neural Networks
65 pages
Introduction To Data Science Unsupervised Learning: CS 194 Fall 2015 John Canny
No ratings yet
Introduction To Data Science Unsupervised Learning: CS 194 Fall 2015 John Canny
54 pages
Clustering
No ratings yet
Clustering
39 pages
Lecture 18
No ratings yet
Lecture 18
27 pages
Clustering
No ratings yet
Clustering
80 pages
chapter-4
No ratings yet
chapter-4
40 pages
DSS09 (B) - Clustering
No ratings yet
DSS09 (B) - Clustering
35 pages
L5 K Nearest Neighbor
No ratings yet
L5 K Nearest Neighbor
10 pages
Lesson #10 - Cluster Analysis
No ratings yet
Lesson #10 - Cluster Analysis
3 pages
ML Lecture06 Unsupervised Learning
No ratings yet
ML Lecture06 Unsupervised Learning
87 pages
L07 - Advance Analytical Theory and Methods - Clustering
No ratings yet
L07 - Advance Analytical Theory and Methods - Clustering
22 pages
ML - 8
No ratings yet
ML - 8
70 pages
Lecture_3_Machine_learning_Techniques_For_Predictive_Analytics
No ratings yet
Lecture_3_Machine_learning_Techniques_For_Predictive_Analytics
40 pages
Data Mining Unit-Iv
No ratings yet
Data Mining Unit-Iv
34 pages
Lecture 11 DBSCAN
No ratings yet
Lecture 11 DBSCAN
6 pages
Lecture 8 - Clustering
No ratings yet
Lecture 8 - Clustering
23 pages
20 - 1 - ML - Unsup - 03 - Dbscan Hdbscan
No ratings yet
20 - 1 - ML - Unsup - 03 - Dbscan Hdbscan
21 pages
8 - Clustering
No ratings yet
8 - Clustering
85 pages
Partitioning Algorithms
No ratings yet
Partitioning Algorithms
14 pages
scRNAseq_clustering_Asa_Bjorklund_2021
No ratings yet
scRNAseq_clustering_Asa_Bjorklund_2021
53 pages
CSE4014 - High Performance Computing (EPJ) : Submitted by Project Guide
No ratings yet
CSE4014 - High Performance Computing (EPJ) : Submitted by Project Guide
12 pages
K - Nearest Neighbors
No ratings yet
K - Nearest Neighbors
26 pages
9.54 Class 13: Unsupervised Learning
No ratings yet
9.54 Class 13: Unsupervised Learning
54 pages
Parameter Estimation - PR
No ratings yet
Parameter Estimation - PR
66 pages
Data Mining: I Gede Mahendra Darmawiguna
No ratings yet
Data Mining: I Gede Mahendra Darmawiguna
25 pages
w5 Classification
No ratings yet
w5 Classification
34 pages
APznzab0G8iLD5cDfn798Gn-fXshRpam8ullbf6ZS5Hd4l0BEcKNHy9gDG24DS66RfgvnKXAQjMAivMmmi5cmDWF9tqOaPMy3afuzafCU1kpG1xfQIr7b98q406ZWiqt50nL8WhMI6azoYzWSgf7c7khnqww3VlQ9I90ROmc0QL4DbmipYYoLleGYR6TO4UYmc_PsaQB5v0XmLUwPEub3QuwGdUnUEr2dp_hV4bds0MuRbpJ
No ratings yet
APznzab0G8iLD5cDfn798Gn-fXshRpam8ullbf6ZS5Hd4l0BEcKNHy9gDG24DS66RfgvnKXAQjMAivMmmi5cmDWF9tqOaPMy3afuzafCU1kpG1xfQIr7b98q406ZWiqt50nL8WhMI6azoYzWSgf7c7khnqww3VlQ9I90ROmc0QL4DbmipYYoLleGYR6TO4UYmc_PsaQB5v0XmLUwPEub3QuwGdUnUEr2dp_hV4bds0MuRbpJ
34 pages
w2 - Fundamentals of Learning
No ratings yet
w2 - Fundamentals of Learning
37 pages
L05-Predictive Analytics I
No ratings yet
L05-Predictive Analytics I
49 pages
Grouping
No ratings yet
Grouping
98 pages
4.5-Cluster Analysis
No ratings yet
4.5-Cluster Analysis
17 pages
DS143 Group 13 Presentation-1
No ratings yet
DS143 Group 13 Presentation-1
27 pages
CLIQUE and PROCLUS
0% (1)
CLIQUE and PROCLUS
13 pages
Clustering Full 1
No ratings yet
Clustering Full 1
98 pages
Clustering Class Ppt
No ratings yet
Clustering Class Ppt
103 pages
2021 Clustering
No ratings yet
2021 Clustering
50 pages
Clustering K-Means
100% (2)
Clustering K-Means
28 pages
AI ML Nov 15
No ratings yet
AI ML Nov 15
32 pages
6 - Chapter 6 - Hierarchical Clustering
No ratings yet
6 - Chapter 6 - Hierarchical Clustering
32 pages
MLCH9
No ratings yet
MLCH9
45 pages
Nearest Neighbour: Condensing and Editing
No ratings yet
Nearest Neighbour: Condensing and Editing
27 pages
Lecture 8
No ratings yet
Lecture 8
56 pages
Object Recognition
No ratings yet
Object Recognition
43 pages
Module 10
No ratings yet
Module 10
59 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
42 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Dear The Weight
From Everand
Dear The Weight
Masud Rana
No ratings yet
Asoleamiento
No ratings yet
Asoleamiento
1 page
FinGAT Financial Graph Attention Networks For Recommending Top-KK Profitable Stocks
No ratings yet
FinGAT Financial Graph Attention Networks For Recommending Top-KK Profitable Stocks
13 pages
Perspective Drawing Perspective Drawing: What Is This Type of Drawing Called?
100% (1)
Perspective Drawing Perspective Drawing: What Is This Type of Drawing Called?
17 pages
Activity 9 Alcala
100% (1)
Activity 9 Alcala
5 pages
User Guide: 300Mbps Wi-Fi Router TL-WR820N
No ratings yet
User Guide: 300Mbps Wi-Fi Router TL-WR820N
64 pages
Thesis Construction Project Management
100% (3)
Thesis Construction Project Management
4 pages
SCADA
No ratings yet
SCADA
32 pages
Stella Dixit Universe Board Game BoardGameGeek
No ratings yet
Stella Dixit Universe Board Game BoardGameGeek
1 page
Read The Marketdocx
No ratings yet
Read The Marketdocx
127 pages
Milesight AIoT Indoor Parking Management Suite Datasheet en
No ratings yet
Milesight AIoT Indoor Parking Management Suite Datasheet en
11 pages
UG286-1.9.1E - Gowin Clock User Guide
No ratings yet
UG286-1.9.1E - Gowin Clock User Guide
114 pages
04-Numerical Analysis
No ratings yet
04-Numerical Analysis
13 pages
3.0 - Occupational Noise v3.1 English
No ratings yet
3.0 - Occupational Noise v3.1 English
22 pages
Thunderbolt Ac DC
No ratings yet
Thunderbolt Ac DC
44 pages
Low Voltage, Synchronous Step Down PWM Controller: Ideal For 2A To 10A, Small Footprint, DC-DC Power Converters
No ratings yet
Low Voltage, Synchronous Step Down PWM Controller: Ideal For 2A To 10A, Small Footprint, DC-DC Power Converters
10 pages
Gd Script
No ratings yet
Gd Script
52 pages
BT So Sánh Hơn - Trên L P + HW 9324
No ratings yet
BT So Sánh Hơn - Trên L P + HW 9324
2 pages
Malware Analysis
No ratings yet
Malware Analysis
328 pages
It6702 Data Warehousing and Data Mining Two Marks With Answer Unit-1 Data Warehousing - Priya Dharsnee - Academia - Edu
No ratings yet
It6702 Data Warehousing and Data Mining Two Marks With Answer Unit-1 Data Warehousing - Priya Dharsnee - Academia - Edu
16 pages
LS Rapiscan 928DX X Ray Screening
No ratings yet
LS Rapiscan 928DX X Ray Screening
2 pages
PACS
No ratings yet
PACS
26 pages
PHP 8 Objects, Patterns, and Practice: Mastering OO Enhancements, Design Patterns, and Essential Development Tools Zandstra
100% (5)
PHP 8 Objects, Patterns, and Practice: Mastering OO Enhancements, Design Patterns, and Essential Development Tools Zandstra
62 pages
SMS - SCALANCE - M - LIB - V56 - V10 - en - Copie
No ratings yet
SMS - SCALANCE - M - LIB - V56 - V10 - en - Copie
32 pages
BPP E3 Mindmap
No ratings yet
BPP E3 Mindmap
2 pages
Et200sp Di 8xnamur HF Manual en-US en-US
No ratings yet
Et200sp Di 8xnamur HF Manual en-US en-US
37 pages
Frontmatter
50% (2)
Frontmatter
10 pages
LINERS USED IN TROLLEY
No ratings yet
LINERS USED IN TROLLEY
2 pages
System of Linear Equations - Spring - 20-21
100% (1)
System of Linear Equations - Spring - 20-21
35 pages
B431 (1NB43450E1) AF115F Fino Fi
No ratings yet
B431 (1NB43450E1) AF115F Fino Fi
55 pages
TM20416-10 - 6.0 (Technical Manual - SolidStateMC 3 Field Series)
No ratings yet
TM20416-10 - 6.0 (Technical Manual - SolidStateMC 3 Field Series)
20 pages