0% found this document useful (0 votes)

9 views27 pages

Clustering

Uploaded by

bsanchana83

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views27 pages

Clustering

Uploaded by

bsanchana83

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 27

Clustering

• Unsupervised learning
• Generating “classes”
• Distance/similarity measures
• Agglomerative methods
• Divisive methods

Data Clustering 1
What is Clustering?
• Form of unsupervised learning - no information
from teacher
• The process of partitioning a set of data into a set
of meaningful (hopefully) sub-classes, called
clusters
• Cluster:
– collection of data points that are “similar” to
one another and collectively should be treated
as group
– as a collection, are sufficiently different from
other groups
Data Clustering 2
Clusters

Data Clustering 3
Characterizing Cluster Methods
• Class - label applied by clustering algorithm
– hard versus fuzzy:
• hard - either is or is not a member of cluster
• fuzzy - member of cluster with probability
• Distance (similarity) measure - value indicating
how similar data points are
• Deterministic versus stochastic
– deterministic - same clusters produced every time
– stochastic - different clusters may result
• Hierarchical - points connected into clusters using
a hierarchical structure
Data Clustering 4
Basic Clustering Methodology
Two approaches:
Agglomerative: pairs of items/clusters are successively
linked to produce larger clusters
Divisive (partitioning): items are initially placed in one
cluster and successively divided into separate groups

Data Clustering 5
Cluster Validity
• One difficult question: how good are the clusters
produced by a particular algorithm?
• Difficult to develop an objective measure
• Some approaches:
– external assessment: compare clustering to a priori
clustering
– internal assessment: determine if clustering intrinsically
appropriate for data
– relative assessment: compare one clustering methods
results to another methods

Data Clustering 6
Basic Questions
• Data preparation - getting/setting up data for
clustering
– extraction
– normalization
• Similarity/Distance measure - how is the distance
between points defined
• Use of domain knowledge (prior knowledge)
– can influence preparation, Similarity/Distance measure
• Efficiency - how to construct clusters in a
reasonable amount of time

Data Clustering 7
Distance/Similarity Measures
• Key to grouping points
distance = inverse of similarity
• Often based on representation of objects as feature vectors

An Employee DB Term Frequencies for Documents

ID Gender Age Salary T1 T2 T3 T4 T5 T6
1 F 27 19,000 Doc1 0 4 0 0 0 2
2 M 51 64,000 Doc2 3 1 4 3 1 2
3 M 52 100,000 Doc3 3 0 0 0 3 0
4 F 33 55,000 Doc4 0 1 0 3 0 0
5 M 45 45,000 Doc5 2 2 2 3 1 4

Which objects are more similar?

Data Clustering 8
Distance/Similarity Measures
Properties of measures:
based on feature values xinstance#,feature#
for all objects xi,B, dist(xi, xj)  0, dist(xi, xj)=dist(xj, xi)
for any object xi, dist(xi, xi) = 0
dist(xi, xj)  dist(xi, xk) + dist(xk, xj)

| features |
Manhattan distance: | x
f 1
i, f  x j, f |

Euclidean distance: | features |

 i, f j, f
( x
f 1
 x ) 2

Data Clustering 9
Distance/Similarity Measures
Minkowski distance (p): | features |
p
 i, f j, f
( x
f 1
 x ) p

Mahalanobis distance: ( xi  x j ) 1 ( xi  x j )T

where -1 is covariance matrix of the patterns

More complex measures:

Mutual Neighbor Distance (MND) - based on a
count of number of neighbors

Data Clustering 10
Distance (Similarity) Matrix
• Similarity (Distance) Matrix
– based on the distance or similarity measure we can construct a
symmetric matrix of distance (or similarity values)
– (i, j) entry in the matrix is the distance (similarity) between items i and j

I1 I2  In
I1  d12  d1n Note
Notethat
thatddijij==ddjiji(i.e.,
(i.e.,the
thematrix
matrixisis
I 2 d 21   d 2 n symmetric).
symmetric).So,
triangle
So,we weonly
onlyneed
needthe
thelower
lower
trianglepart
partofofthe thematrix.
matrix.
     The
Thediagonal
diagonalisisall
all1’s
1’s(similarity)
(similarity)or
orall
all
I n d n1 d n 2   0’s
0’s(distance)
(distance)

dij similarity (or distance) of Di to D j

Data Clustering 11
Example: Term Similarities in Documents
T1 T2 T3 T4 T5 T6 T7 T8
Doc1 0 4 0 0 0 2 1 3
Doc2 3 1 4 3 1 2 0 1
Doc3 3 0 0 0 3 0 3 0
Doc4 0 1 0 3 0 0 2 0
Doc5 2 2 2 3 1 4 0 2

N
sim(Ti , Tj )  ( wik w jk )
k 1

T1 T2 T3 T4 T5 T6 T7
T2 7
T3 16 8
Term-Term
Term-Term T4 15 12 18
Similarity
SimilarityMatrix
Matrix T5 14 3 6 6
T6 14 18 16 18 6
T7 9 6 0 6 9 2
T8 7 17 8 9 3 16 3
Data Clustering 12
Similarity (Distance) Thresholds
– A similarity (distance) threshold may be used to mark pairs
that are “sufficiently” similar
T1 T2 T3 T4 T5 T6 T7
T2 7
T3 16 8
T4 15 12 18
T5 14 3 6 6
T6 14 18 16 18 6
T7 9 6 0 6 9 2
T8 7 17 8 9 3 16 3 Using a threshold
value of 10 in the
T1 T2 T3 T4 T5 T6 T7 previous example
T2 0
T3 1 0
T4 1 1 1
T5 1 0 0 0
T6 1 1 1 1 0
T7 0 0 0 0 0 0
T8 0 1 0 0 0 1 0
Data Clustering 13
Graph Representation
• The similarity matrix can be visualized as an undirected graph
– each item is represented by a node, and edges represent the fact that two
items are similar (a one in the similarity threshold matrix)

T1 T2 T3 T4 T5 T6 T7
T1 T3
T2 0
T3 1 0
T4 1 1 1
T5 1 0 0 0 T5
T6 1 1 1 1 0
T7 0 0 0 0 0 0
T4 T2
T8 0 1 0 0 0 1 0

IfIfno
nothreshold
thresholdisisused,
used,then
then T7
matrix
matrixcan
canbe
berepresented
representedas as T6
aaweighted
weightedgraph
graph T8

Data Clustering 14
Agglomerative Single-Link
• Single-link: connect all points together that are
within a threshold distance
• Algorithm:
1. place all points in a cluster
2. pick a point to start a cluster
3. for each point in current cluster
add all points within threshold not already in cluster
repeat until no more items added to cluster
4. remove points in current cluster from graph
5. Repeat step 2 until no more points in graph

Data Clustering 15
Example

T2
T1
7
T2 T3 T4 T5 T6 T7
All points except T7 end
T3 16 8 up in one cluster
T4 15 12 18
T5 14 3 6 6
T6 14 18 16 18 6 T1 T3
T7 9 6 0 6 9 2
T8 7 17 8 9 3 16 3
T5
T1 T2 T3 T4 T5 T6 T7
T2 0
T3 1 0 T4 T2
T4 1 1 1
T5 1 0 0 0
T6 1 1 1 1 0
T7 0 0 0 0 0 0 T7
T6
T8 0 1 0 0 0 1 0 T8

Data Clustering 16
Agglomerative Complete-Link (Clique)
• Complete-link (clique): all of the points in a
cluster must be within the threshold distance
• In the threshold distance matrix, a clique is a
complete graph
• Algorithms based on finding maximal cliques
(once a point is chosen, pick the largest clique it is
part of)
– not an easy problem

Data Clustering 17
Example
T1 T2 T3 T4 T5 T6 T7 Different clusters possible
T2 7 based on where cliques start
T3 16 8
T4 15 12 18
T5 14 3 6 6
T6 14 18 16 18 6 T1 T3
T7 9 6 0 6 9 2
T8 7 17 8 9 3 16 3
T5
T1 T2 T3 T4 T5 T6 T7
T2 0
T3 1 0 T4 T2
T4 1 1 1
T5 1 0 0 0
T6 1 1 1 1 0
T7 0 0 0 0 0 0 T7
T6
T8 0 1 0 0 0 1 0 T8

Data Clustering 18
Hierarchical Methods
• Based on some method of representing hierarchy
of data points
• One idea: hierarchical dendogram (connects points
based on similarity)

T1 T2 T3 T4 T5 T6 T7
T2 7
T3 16 8
T4 15 12 18
T5 14 3 6 6
T6 14 18 16 18 6
T7 9 6 0 6 9 2
T8 7 17 8 9 3 16 3 T5 T1 T3 T4 T2 T6 T8 T7

Data Clustering 19
Hierarchical Agglomerative
• Compute distance matrix
• Put each data point in its own cluster
• Find most similar pair of clusters
– merge pairs of clusters (show merger in
dendogram)
– update proximity matrix
– repeat until all patterns in one cluster

Data Clustering 20
Partitional Methods
• Divide data points into a number of clusters
• Difficult questions
– how many clusters?
– how to divide the points?
– how to represent cluster?
• Representing cluster: often done in terms of
centroid for cluster
– centroid of cluster minimizes squared distance between
the centroid and all points in cluster

Data Clustering 21
k-Means Clustering
1. Choose k cluster centers (randomly pick k data
points as center, or randomly distribute in space)
2. Assign each pattern to the closest cluster center
3. Recompute the cluster centers using the current
cluster memberships (moving centers may change
memberships)
4. If a convergence criterion is not met, goto step 2

Convergence criterion:
– no reassignment of patterns
– minimal change in cluster center
Data Clustering 22
k-Means Clustering

Data Clustering 23
k-Means Variations
• What if too many/not enough clusters?
• After some convergence:
– any cluster with too large a distance between
members is split
– any clusters too close together are combined
– any cluster not corresponding to any points is
moved
– thresholds decided empirically

Data Clustering 24
An Incremental Clustering Algorithm
1. Assign first data point to a cluster
2. Consider next data point. Either assign data point
to an existing cluster or create a new cluster.
Assignment to cluster based on threshold
3. Repeat step 2 until all points are clustered

Useful for efficient clustering

Data Clustering 25
Clustering Summary
• Unsupervised learning method
– generation of “classes”
• Based on similarity/distance measure
– Manhattan, Euclidean, Minkowski, Mahalanobis, etc.
– distance matrix
– threshold distance matrix
• Hierarchical representation
– hierarchical dendogram
• Agglomerative methods
– single link
– complete link (clique)
Data Clustering 26
Clustering Summary
• Partitional method
– representing clusters
• centroids and “error”
– k-Means clustering
• combining/splitting k-Means

• Incremental clustering
– one pass clustering

Data Clustering 27

1.4. Linear Dependence and Independence-1 PDF
No ratings yet
1.4. Linear Dependence and Independence-1 PDF
4 pages
Green City Planning
67% (3)
Green City Planning
16 pages
Transportation Management Module
No ratings yet
Transportation Management Module
58 pages
BT503 Handouts PDF
No ratings yet
BT503 Handouts PDF
279 pages
WBI11 01 Que 20190108
100% (1)
WBI11 01 Que 20190108
32 pages
Clustering
No ratings yet
Clustering
104 pages
Calculating Mast and Rigging PDF
100% (1)
Calculating Mast and Rigging PDF
19 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Clustering
No ratings yet
Clustering
28 pages
SR285R W10 2020 172953
100% (1)
SR285R W10 2020 172953
2 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
Texas Public School Construction Costs
100% (1)
Texas Public School Construction Costs
20 pages
Lecture 2.1.1 To 2.1.2
No ratings yet
Lecture 2.1.1 To 2.1.2
97 pages
Poem I (From Tao-Teh-Ching) : Lao Tzu
No ratings yet
Poem I (From Tao-Teh-Ching) : Lao Tzu
4 pages
Week6 Clustering Regression
No ratings yet
Week6 Clustering Regression
101 pages
PROPOSAL FROM M/s S.M.ASGHAR (PVT) LIMITED FOR CUSTOM CLEARANCE-SHIPPING & LOGISTICS SOLUTION
No ratings yet
PROPOSAL FROM M/s S.M.ASGHAR (PVT) LIMITED FOR CUSTOM CLEARANCE-SHIPPING & LOGISTICS SOLUTION
144 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
Clustering
No ratings yet
Clustering
75 pages
Datamining Lect8
No ratings yet
Datamining Lect8
79 pages
K Medoids
No ratings yet
K Medoids
101 pages
Grouping
No ratings yet
Grouping
98 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
83 pages
Clustering Agglo Devisive DBSCAN
No ratings yet
Clustering Agglo Devisive DBSCAN
78 pages
TOP 100 Meds
No ratings yet
TOP 100 Meds
2 pages
Clustering
No ratings yet
Clustering
80 pages
Agricultural Extension and Advisory Services in Nigeria, Malawi, South Africa, Uganda, and Kenya
No ratings yet
Agricultural Extension and Advisory Services in Nigeria, Malawi, South Africa, Uganda, and Kenya
66 pages
Lecture 01 - Unsupervised Learning (Optional)
No ratings yet
Lecture 01 - Unsupervised Learning (Optional)
57 pages
CQ 02 February 1980
No ratings yet
CQ 02 February 1980
100 pages
Clustering Class
No ratings yet
Clustering Class
103 pages
Clustering Part1
No ratings yet
Clustering Part1
79 pages
ML4 Unsupervised Learning
No ratings yet
ML4 Unsupervised Learning
60 pages
Chapter8-Basic Cluster Analysis2018
No ratings yet
Chapter8-Basic Cluster Analysis2018
143 pages
Clustering
No ratings yet
Clustering
75 pages
Module 5
No ratings yet
Module 5
98 pages
Cluster
100% (1)
Cluster
72 pages
Chapter 3-Unsupervised Learning - Updated
No ratings yet
Chapter 3-Unsupervised Learning - Updated
54 pages
ML L14 Clustering
No ratings yet
ML L14 Clustering
59 pages
Chapter8-Basic Cluster Analysis2016
No ratings yet
Chapter8-Basic Cluster Analysis2016
143 pages
Unsupervised Machine Learning
No ratings yet
Unsupervised Machine Learning
63 pages
BigData Clustering
No ratings yet
BigData Clustering
67 pages
M5
No ratings yet
M5
40 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
Rotational Mechanics
No ratings yet
Rotational Mechanics
17 pages
Slide TIF311 DM 10 11
No ratings yet
Slide TIF311 DM 10 11
49 pages
Lect 12
No ratings yet
Lect 12
80 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Clustering
No ratings yet
Clustering
29 pages
Clustering
No ratings yet
Clustering
38 pages
Clustering
No ratings yet
Clustering
84 pages
CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal
No ratings yet
CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal
41 pages
07 Clustering
No ratings yet
07 Clustering
34 pages
Chapter 5 Clustering
No ratings yet
Chapter 5 Clustering
40 pages
w6 Clustering
No ratings yet
w6 Clustering
29 pages
Cluster Analysis: Basic Concepts and Algorithms
No ratings yet
Cluster Analysis: Basic Concepts and Algorithms
141 pages
DM Lecture 06
No ratings yet
DM Lecture 06
32 pages
Unsupervised Machine Learning Techniques
No ratings yet
Unsupervised Machine Learning Techniques
58 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
2024 Programme
No ratings yet
2024 Programme
28 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
Clustering
No ratings yet
Clustering
44 pages
Lect 10 - Unsupervised Learning
No ratings yet
Lect 10 - Unsupervised Learning
50 pages
UNIT5
No ratings yet
UNIT5
60 pages
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
93 pages
Introduction To Machine Learning-Presentation
No ratings yet
Introduction To Machine Learning-Presentation
28 pages
Tantra - Agama - Part One - Tantra - Sreenivasarao's Blogs
No ratings yet
Tantra - Agama - Part One - Tantra - Sreenivasarao's Blogs
19 pages
Map Work - Geography (Locating and Labelling)
No ratings yet
Map Work - Geography (Locating and Labelling)
17 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
53 pages
Case Study of Placenta Previa: Patient's Demographic Data
No ratings yet
Case Study of Placenta Previa: Patient's Demographic Data
9 pages
FML Unit4
No ratings yet
FML Unit4
14 pages
Data Mining Cluster Analysis: Basic Concepts and Algorithms
No ratings yet
Data Mining Cluster Analysis: Basic Concepts and Algorithms
26 pages
Ghosh: CHM 112M: Lecture 3
No ratings yet
Ghosh: CHM 112M: Lecture 3
6 pages
Clustering
No ratings yet
Clustering
39 pages
Unsupervised Learning Modi
No ratings yet
Unsupervised Learning Modi
16 pages
Fuzzy Meaning
No ratings yet
Fuzzy Meaning
6 pages
IC Design of Power Management Circuits (I)
No ratings yet
IC Design of Power Management Circuits (I)
40 pages
Ts X Biology Final Exam Revision 2023-24
No ratings yet
Ts X Biology Final Exam Revision 2023-24
7 pages
MORVOLC (Version 1.2) : User Manual
No ratings yet
MORVOLC (Version 1.2) : User Manual
11 pages
Balanced Cantilever Bridge Design Considering Seismic Analysis Manual
50% (2)
Balanced Cantilever Bridge Design Considering Seismic Analysis Manual
31 pages
Thermodynamics (PC-ME 301) Question Bank
No ratings yet
Thermodynamics (PC-ME 301) Question Bank
7 pages
Family Miles 23.9.2023
No ratings yet
Family Miles 23.9.2023
5 pages
Urban Studies Case Study-Townships: Location
No ratings yet
Urban Studies Case Study-Townships: Location
10 pages
Clustering New
No ratings yet
Clustering New
6 pages
River Dale High School, A'Bad: PA-II Examination
No ratings yet
River Dale High School, A'Bad: PA-II Examination
5 pages
4 Clustering
No ratings yet
4 Clustering
9 pages
Discovering Knowledge in Data: Lecture Review of
No ratings yet
Discovering Knowledge in Data: Lecture Review of
20 pages
The Evolution of Neighborhood Planning Since The Early 20th Century
No ratings yet
The Evolution of Neighborhood Planning Since The Early 20th Century
3 pages
Ivf Changing Steps and Rationale
No ratings yet
Ivf Changing Steps and Rationale
2 pages
Clustering
No ratings yet
Clustering
5 pages
General Notes:: Schedule of Loads and Computation
No ratings yet
General Notes:: Schedule of Loads and Computation
3 pages