0% found this document useful (0 votes)

25 views20 pages

Hierarchicalclustering

clustering density based models

Uploaded by

prathap badam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views20 pages

Hierarchicalclustering

clustering density based models

Uploaded by

prathap badam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Hierarchical Clustering

Mehta Ishani
130040701003
What is Clustering in Data Mining?

Clustering is a process of partitioning a set of data (or objects) in

a set of meaningful sub-classes, called clusters

 Cluster:
 a collection of data objects that are
“similar” to one another and thus can
be treated collectively as one group

 but as a collection, they are

sufficiently different from other groups

2 8/23/2014
Distance or Similarity Measures
 Measuring Distance
 In order to group similar items, we need a way to measure
the distance between objects (e.g., records)
 Note: distance = inverse of similarity
 Often based on the representation of objects as “feature
vectors”

An Employee DB Term Frequencies for Documents

ID Gender Age Salary T1 T2 T3 T4 T5 T6

1 F 27 19,000 Doc1 0 4 0 0 0 2
2 M 51 64,000 Doc2 3 1 4 3 1 2
3 M 52 100,000 Doc3 3 0 0 0 3 0
4 F 33 55,000 Doc4 0 1 0 3 0 0
5 M 45 45,000 Doc5 2 2 2 3 1 4

Which objects are more similar?

3 8/23/2014
Distance or Similarity Measures

 Common Distance Measures:

X  x1 , x2 , , xn Y  y1 , y2 , , yn
 Manhattan distance:

dist ( X , Y )  x1  y1  x2  y2   xn  yn
Can be normalized
 Euclidean distance: to make values fall
between 0 and 1.
dist ( X , Y )   x1  y1     xn  yn 
2 2

 Cosine similarity:

dist ( X ,Y )  1  sim( X ,Y )
 ( xi  yi )
sim( X , Y )  i

 xi   yi
2 2

i i

4 8/23/2014
Distance or Similarity Measures
 Weighting Attributes
 in some cases we want some attributes to count more than
others
 associate a weight with each of the attributes in calculating
distance, e.g.,
dist ( X , Y )  w1  x1  y1    wn  xn  yn 
2 2

 Nominal (categorical) Attributes

can use simple matching: distance=1 if values match, 0

otherwise
 or convert each nominal attribute to a set of binary attribute,
then use the usual distance measure
 if all attributes are nominal, we can normalize by dividing the
number of matches by the total number of attributes
xi  min xi
x 'i 
 Normalization: max xi  min xi
 want values to fall between 0 an 1:
 other variations possible
5 8/23/2014
Distance or Similarity Measures
 Example
 max distance for salary: 100000-19000 = 79000
 max distance for age: 52-27 = 25 xi  min xi
x' 
max xi  min xi
i

ID Gender Age Salary ID Gender Age Salary

1 F 27 19,000 1 1 0.00 0.00
2 M 51 64,000 2 0 0.96 0.56
3 M 52 100,000 3 0 1.00 1.00
4 F 33 55,000 4 1 0.24 0.44
5 M 45 45,000 5 0 0.72 0.32

 dist(ID2, ID3) = SQRT( 0 + (0.04)2 + (0.44)2 ) = 0.44

 dist(ID2, ID4) = SQRT( 1 + (0.72)2 + (0.12)2 ) = 1.24

6 8/23/2014
Domain Specific Distance Functions
 For some data sets, we may need to use specialized functions
 we may want a single or a selected group of attributes to be used
in the computation of distance - same problem as “feature
selection”
 may want to use special properties of one or more attribute in the
data Example: Zip Codes
distzip(A, B) = 0, if zip codes are identical
distzip(A, B) = 0.1, if first 3 digits are
identical
distzip(A, B) = 0.5, if first digits are identical
distzip(A, B) = 1, if first digits are different
Example: Customer Solicitation
distsolicit(A, B) = 0, if both A and B responded
distsolicit(A, B) = 0.1, both A and B were chosen but did not respond
distsolicit(A, B) = 0.5, both A and B were chosen, but only one
responded
distsolicit(A, B) = 1, one was chosen, but the other was not

 natural distance functions may exist in the data

7 8/23/2014
Distance (Similarity) Matrix
 Similarity (Distance) Matrix
 based on the distance or similarity measure we can construct a
symmetric matrix of distance (or similarity values)
 (i, j) entry in the matrix is the distance (similarity) between items
i and j

I1 I2 In
I1  d12 d1n Note that dij = dji (i.e., the matrix is
I2 d 21  d2n symmetric. So, we only need the
lower triangle part of the matrix.
 The diagonal is all 1’s (similarity) or
In d n1 dn2  all 0’s (distance)

dij  similarity (or distance) of Di to D j

8 8/23/2014
Example: Term Similarities in Documents
T1 T2 T3 T4 T5 T6 T7 T8
Doc1 0 4 0 0 0 2 1 3
Doc2 3 1 4 3 1 2 0 1
Doc3 3 0 0 0 3 0 3 0
Doc4 0 1 0 3 0 0 2 0
Doc5 2 2 2 3 1 4 0 2

N
sim(Ti , Tj )   ( wik  w jk )
k 1

T1 T2 T3 T4 T5 T6 T7
T2 7
T3 16 8
Term-Term T4 15 12 18
Similarity Matrix T5 14 3 6 6
T6 14 18 16 18 6
T7 9 6 0 6 9 2
T8 7 17 8 9 3 16 3
9 8/23/2014
Similarity (Distance) Thresholds
 A similarity (distance) threshold may be used to mark pairs that are
“sufficiently”
T1 similar
T2 T3 T4 T5 T6 T7
T2 7
T3 16 8
T4 15 12 18
T5 14 3 6 6
T6 14 18 16 18 6
T7 9 6 0 6 9 2
T8 7 17 8 9 3 16 3 Using a
threshold value
T1 T2 T3 T4 T5 T6 T7 of 10 in the
T2 0 previous
T3 1 0 example
T4 1 1 1
T5 1 0 0 0
T6 1 1 1 1 0
T7 0 0 0 0 0 0
T8 0 1 0 0 0 1 0

10 8/23/2014
Graph Representation

 The similarity matrix can be visualized as an undirected

graph
 each item is represented by a node, and edges represent the
factT1that
T2 two
T3 items
T4 T5are
T6similar
T7 (a one in the similarity threshold
T2matrix)
0 T1 T3

T3 1 0
T4 1 1 1
T5 1 0 0 0 T5
T6 1 1 1 1 0
T7 0 0 0 0 0 0
T8 0 1 0 0 0 1 0 T4 T2

If no threshold is used, then

T7
matrix can be represented as
T6
a weighted graph T8

11 8/23/2014
Clustering Methodologies
 Two general methodologies
 Partitioning Based Algorithms
 Hierarchical Algorithms

 Partitioning Based
 divide a set of N items into K clusters (top-down)

 Hierarchical
 agglomerative: pairs of items or clusters are successively
linked to produce larger clusters
 divisive: start with the whole set as a cluster and
successively divide sets into smaller partitions

12 8/23/2014
Hierarchical Clustering

 Use distance matrix as clustering criteria.

Step 0 Step 1 Step 2 Step 3 Step 4 agglomerative

(AGNES)
a
ab
b
abcde
c
cde
d
de
e
divisive
(DIANA)
Step 4 Step 3 Step 2 Step 1 Step 0

13 8/23/2014
AGNES (Agglomerative Nesting)

 Introduced in Kaufmann and Rousseeuw (1990)

 Use the dissimilarity matrix.
 Merge nodes that have the least dissimilarity
 Go on in a non-descending fashion
 Eventually all nodes belong to the same cluster

10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

14 8/23/2014
Algorithmic steps for Agglomerative
Hierarchical clustering
Let X = {x1, x2, x3, ..., xn} be the set of data points.
(1)Begin with the disjoint clustering having level L(0) = 0 and sequence number m =
0.

(2)Find the least distance pair of clusters in the current clustering, say pair (r), (s),
according
to d[(r),(s)] = min d[(i),(j)] where the minimum is over all pairs of clusters in
the current clustering.

(3)Increment the sequence number: m = m +1.Merge clusters (r) and (s) into a
single cluster to form the next clustering m. Set the level of this clustering to
L(m) = d[(r),(s)].

(4)Update the distance matrix, D, by deleting the rows and columns corresponding
to clusters (r) and (s) and adding a row and column corresponding to the
newly formed cluster. The distance between the new cluster, denoted (r,s) and
old cluster(k) is defined in this way: d[(k), (r,s)] = min (d[(k),(r)], d[(k),(s)]).

(5)If
15 all the data points are in one cluster then stop, else repeat8/23/2014
from step 2).
A Dendrogram Shows How the
Clusters are Merged Hierarchically

16 8/23/2014
DIANA (Divisive Analysis)

 Introduced in Kaufmann and Rousseeuw (1990)

 Implemented in statistical analysis packages, e.g., Splus
 Inverse order of AGNES
 Eventually each node forms a cluster on its own

10 10
10

9 9
9

8 8
8

7 7
7

6 6
6

5 5
5

4 4
4

3 3
3

2 2
2

1 1
1

0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

17 8/23/2014
Algorithmic steps for Divisive Hierarchical
clustering
1. Start with one cluster that contains all samples.
2. Calculate diameter of each cluster. Diameter is the
maximal distance between samples in the cluster.
Choose one cluster C having maximal diameter of all
clusters to split.
3. Find the most dissimilar sample x from cluster C. Let x
depart from the original cluster C to form a new
independent cluster N (now cluster C does not include
sample x). Assign all members of cluster C to MC.
4. Repeat step 6 until members of cluster C and N do not
change.
5. Calculate similarities from each member of MC to cluster
C and N, and let the member owning the highest
similarities in MC move to its similar cluster C or N.
Update members of C and N.
6. Repeat the step 2, 3, 4, 5 until the number of clusters
18becomes the number of samples or as specified8/23/2014 by the
user.
Pros and Cons
Advantages
1) No a priori information about the number of clusters required.
2) Easy to implement and gives best result in some cases.

Disadvantages
1) Algorithm can never undo what was done previously.
2) Time complexity of at least O(n2 log n) is required, where ‘n’ is the
number of data points.
3) Based on the type of distance matrix chosen for merging different
algorithms can suffer with one or more of the following:
i) Sensitivity to noise and outliers
ii) Breaking large clusters
iii) Difficulty handling different sized clusters and convex shapes
4) No objective function is directly minimized
5) Sometimes it is difficult to identify the correct number of clusters by
the dendogram.

19 8/23/2014
20 8/23/2014

Pattern Recognition - Clustering - Classification
No ratings yet
Pattern Recognition - Clustering - Classification
177 pages
Class 10th - Complete Map Work - Board Exam-1
100% (1)
Class 10th - Complete Map Work - Board Exam-1
20 pages
Entry Test Preparation
100% (1)
Entry Test Preparation
23 pages
Femfat 52 Max Manual e
50% (2)
Femfat 52 Max Manual e
103 pages
K Medoids
No ratings yet
K Medoids
101 pages
Honda Shine 125 Spare Parts Price List and Accessories Fixerbolt
No ratings yet
Honda Shine 125 Spare Parts Price List and Accessories Fixerbolt
2 pages
Lecture - 11 Hierarchical Clustering
No ratings yet
Lecture - 11 Hierarchical Clustering
28 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
DIALux Evo Manual
No ratings yet
DIALux Evo Manual
91 pages
Clustering Class
No ratings yet
Clustering Class
103 pages
Clustering
No ratings yet
Clustering
28 pages
ML-Module 5-P1
No ratings yet
ML-Module 5-P1
45 pages
Module 5
No ratings yet
Module 5
43 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
32 pages
Module-5 Clustering Algorithms
No ratings yet
Module-5 Clustering Algorithms
44 pages
4.1 Clustering
No ratings yet
4.1 Clustering
80 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Agglomerative Clustering
No ratings yet
Agglomerative Clustering
44 pages
Week6 Clustering Regression
No ratings yet
Week6 Clustering Regression
101 pages
DM & W - Unit - 3
No ratings yet
DM & W - Unit - 3
34 pages
Clustering
No ratings yet
Clustering
64 pages
DM 4
No ratings yet
DM 4
76 pages
Presentation 28128 Content Document 20241126014005PM
No ratings yet
Presentation 28128 Content Document 20241126014005PM
80 pages
ML Module5
No ratings yet
ML Module5
37 pages
Flat Unit 1
No ratings yet
Flat Unit 1
80 pages
Clustering, A Tool To Analyze Data Points
No ratings yet
Clustering, A Tool To Analyze Data Points
61 pages
Grouping
No ratings yet
Grouping
98 pages
Clustering
No ratings yet
Clustering
75 pages
Cluster Analysis
No ratings yet
Cluster Analysis
60 pages
Chapter 6
No ratings yet
Chapter 6
62 pages
Module 5 - Clustering - Afterclassb
No ratings yet
Module 5 - Clustering - Afterclassb
49 pages
ML12 Clustering
No ratings yet
ML12 Clustering
34 pages
ML4 Unsupervised Learning
No ratings yet
ML4 Unsupervised Learning
60 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
No ratings yet
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
61 pages
Clustering
No ratings yet
Clustering
27 pages
Slide TIF311 DM 10 11
No ratings yet
Slide TIF311 DM 10 11
49 pages
Week 07 Lecture Material
No ratings yet
Week 07 Lecture Material
49 pages
Clustering Basics
No ratings yet
Clustering Basics
39 pages
ML Co4 Session 29
No ratings yet
ML Co4 Session 29
36 pages
Lecture 10
No ratings yet
Lecture 10
26 pages
CV w4 - Recognition - Statistical Based
No ratings yet
CV w4 - Recognition - Statistical Based
42 pages
ML Clustering Algorithm
No ratings yet
ML Clustering Algorithm
29 pages
Data Mining: Hierarchical Clustering, DBSCAN The EM Algorithm
No ratings yet
Data Mining: Hierarchical Clustering, DBSCAN The EM Algorithm
63 pages
DM 10,11 Clustering PDF
No ratings yet
DM 10,11 Clustering PDF
65 pages
AIML Chapter 13
No ratings yet
AIML Chapter 13
26 pages
P 3.1.3 Hierarchical
No ratings yet
P 3.1.3 Hierarchical
30 pages
Clustering: Sridhar S Department of IST Anna University
No ratings yet
Clustering: Sridhar S Department of IST Anna University
91 pages
Clustering Slides
No ratings yet
Clustering Slides
22 pages
Topic 6d - Hierarchical Algorithm
No ratings yet
Topic 6d - Hierarchical Algorithm
38 pages
Introduction To Data Science: Tom A S Horv Ath
No ratings yet
Introduction To Data Science: Tom A S Horv Ath
39 pages
Introduction To Clustering: Alka Arora Sr. Scientist
No ratings yet
Introduction To Clustering: Alka Arora Sr. Scientist
57 pages
Action Plan in Mathematics Grade 4-Integrity
100% (1)
Action Plan in Mathematics Grade 4-Integrity
2 pages
SAT Suite Question Bank - Results
No ratings yet
SAT Suite Question Bank - Results
279 pages
Understanding The Inners of Clustering: DR Akashdeep, UIET, Panjab University Chandigarh, Maivriklab@pu - Ac.in
No ratings yet
Understanding The Inners of Clustering: DR Akashdeep, UIET, Panjab University Chandigarh, Maivriklab@pu - Ac.in
61 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
Unit-7 Finalized
No ratings yet
Unit-7 Finalized
20 pages
Module 4 ML
No ratings yet
Module 4 ML
11 pages
Flat Unit 5 Qa
No ratings yet
Flat Unit 5 Qa
33 pages
Classification Algorithm
No ratings yet
Classification Algorithm
78 pages
Cluster Analysis Introduction
No ratings yet
Cluster Analysis Introduction
23 pages
EN 10202-2001冷轧镀锡钢制品
No ratings yet
EN 10202-2001冷轧镀锡钢制品
48 pages
Unit 1 1.: Discuss The Challenges of The Distributed Systems With Their Examples?
No ratings yet
Unit 1 1.: Discuss The Challenges of The Distributed Systems With Their Examples?
18 pages
Clustering Lecture
No ratings yet
Clustering Lecture
46 pages
Cluster
100% (1)
Cluster
72 pages
DWDM Lecture Notes
No ratings yet
DWDM Lecture Notes
139 pages
Clustering: EE-671 Prof L. Behera, IITK
No ratings yet
Clustering: EE-671 Prof L. Behera, IITK
33 pages
Kinetic Study of Esterification Reaction For The Synthesis of Butyl Acetate IJERTV3IS10139
No ratings yet
Kinetic Study of Esterification Reaction For The Synthesis of Butyl Acetate IJERTV3IS10139
9 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
53 pages
Supervised Learning vs. Unsupervised Learning
No ratings yet
Supervised Learning vs. Unsupervised Learning
7 pages
Clustering Hierarchical PDF
No ratings yet
Clustering Hierarchical PDF
31 pages
Ranbaxy: Taking The ERP Pill: By: Group A6
100% (2)
Ranbaxy: Taking The ERP Pill: By: Group A6
5 pages
23: Economies of Scale
No ratings yet
23: Economies of Scale
55 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
51 pages
Flat Unit 4 Qa
No ratings yet
Flat Unit 4 Qa
37 pages
Flat Unit 2 Qa
No ratings yet
Flat Unit 2 Qa
30 pages
Flat Unit 3 Qa
No ratings yet
Flat Unit 3 Qa
35 pages
Flat Unit 1 Qa
No ratings yet
Flat Unit 1 Qa
25 pages
Masonry Wall Design at Horizontal Bending, Based On ACI 530-05 Input Data & Design Summary
No ratings yet
Masonry Wall Design at Horizontal Bending, Based On ACI 530-05 Input Data & Design Summary
1 page
Study Material PDF
0% (1)
Study Material PDF
69 pages
(IDR) Building Maintenance and Management (TUGINO) (INDONESIA) (Online Learning) (February 2022) Ika
No ratings yet
(IDR) Building Maintenance and Management (TUGINO) (INDONESIA) (Online Learning) (February 2022) Ika
6 pages
LCD SPEC - XT1965 - 6.24-FHD-BOE-M6205G6F1-SD68C37284-LCD Spec-A0 - 20181009
No ratings yet
LCD SPEC - XT1965 - 6.24-FHD-BOE-M6205G6F1-SD68C37284-LCD Spec-A0 - 20181009
57 pages
Abhishek - Chhatrola Case Study
No ratings yet
Abhishek - Chhatrola Case Study
7 pages
CMPT 413/713: Natural Language Processing: Nat Langlab
No ratings yet
CMPT 413/713: Natural Language Processing: Nat Langlab
43 pages
Alia AMF601 Electromagnetic Flowmeter
No ratings yet
Alia AMF601 Electromagnetic Flowmeter
4 pages
Practice Set 1
No ratings yet
Practice Set 1
1 page
Common Modification Request Form
No ratings yet
Common Modification Request Form
2 pages
R2 - Hina Rizvi
No ratings yet
R2 - Hina Rizvi
7 pages
Public Sculpture Examining The Differences Between Contemporary Sculpture Inside and Outside The Art Institution
No ratings yet
Public Sculpture Examining The Differences Between Contemporary Sculpture Inside and Outside The Art Institution
20 pages
WJ 1995 05 s153 PDF
No ratings yet
WJ 1995 05 s153 PDF
7 pages
Bio Concrete
No ratings yet
Bio Concrete
4 pages
Hedy Rahadian, Iwan Zarkasi, Ariono Dhanis
No ratings yet
Hedy Rahadian, Iwan Zarkasi, Ariono Dhanis
15 pages
Hadoop Vs Spark Vs Kafka - Comparing Big Data & Distributed Streaming Tools
No ratings yet
Hadoop Vs Spark Vs Kafka - Comparing Big Data & Distributed Streaming Tools
4 pages
Track Consignment
No ratings yet
Track Consignment
1 page
ALLISON TOURIGNY - Resume
No ratings yet
ALLISON TOURIGNY - Resume
2 pages
Gmail - HISTORY 100 PDF
No ratings yet
Gmail - HISTORY 100 PDF
2 pages
Homework We
No ratings yet
Homework We
1 page
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet

Hierarchicalclustering

Uploaded by

Hierarchicalclustering

Uploaded by

Hierarchical Clustering

Clustering is a process of partitioning a set of data (or objects) in

 but as a collection, they are

An Employee DB Term Frequencies for Documents

ID Gender Age Salary T1 T2 T3 T4 T5 T6

Which objects are more similar?

 Common Distance Measures:

 Nominal (categorical) Attributes

ID Gender Age Salary ID Gender Age Salary

 dist(ID2, ID3) = SQRT( 0 + (0.04)2 + (0.44)2 ) = 0.44

 natural distance functions may exist in the data

dij  similarity (or distance) of Di to D j

 The similarity matrix can be visualized as an undirected

If no threshold is used, then

 Use distance matrix as clustering criteria.

Step 0 Step 1 Step 2 Step 3 Step 4 agglomerative

 Introduced in Kaufmann and Rousseeuw (1990)

 Introduced in Kaufmann and Rousseeuw (1990)

You might also like