0% found this document useful (0 votes)

16 views80 pages

Clustering

Clustering is the process of grouping similar objects into classes, commonly used in unsupervised learning, with applications in search engines for structuring results and suggesting related pages. It differs from classification as it does not involve a target variable and aims to maximize similarity within clusters while minimizing it between them. Various clustering methods exist, including partitional (like K-Means) and hierarchical clustering, each with distinct algorithms and approaches to measure similarity and determine the number of clusters.

Uploaded by

rooosemary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views80 pages

Clustering

Uploaded by

rooosemary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 80

Clustering

Ch. 16
What is clustering?
 Clustering: the process of grouping a set of objects into classes of
similar objects
 Objects within a cluster should be similar.
 Objects from different clusters should be dissimilar.
 The commonest form of unsupervised learning
 Unsupervised learning = learning from raw data, as opposed to
supervised data where a classification of examples is given
 Applications in Search engines:
 Structuring search results
 Suggesting related pages
 Automatic directory construction/update
 Finding near identical/duplicate pages
Classification vs. Clustering

Classification: Clustering:
• Supervised learning • Unsupervised learning
• Learns a method for predicting • Finds “natural” grouping of
the instance class from pre- instances given un-labeled data
labeled (classified) instances
Classification vs. Clustering (cont.)

 There is no target variable for clustering

 Clustering does not try to classify or predict the values of a
target variable.
 Instead, clustering algorithms seek to segment the entire data
set into relatively homogeneous subgroups or clusters,
 Where the similarity of the records within the cluster is maximized,
and
 Similarity to records outside this cluster is minimized.
Goal of Clustering

 Identification of groups f records such that similarity

within a group is very high while the similarity to records
in other groups is very low.
 group data points that are close (or similar) to each other
 identify such groupings (or clusters) in an unsupervised manner
 Unsupervised: no information is provided to the algorithm
on which data points belong to which clusters
 In other words,
 Clustering algorithm seeks to construct clusters of records such
that the between-cluster variation(BCV) is large compared to the
within-cluster variation(WCV)
Goal of Clustering

Between-cluster variation:

Within-cluster variation:

(Intra-cluster distance) the sum of distances between-cluster variation(BCV) is large

between objects in the same cluster are compared to the within-cluster
minimized
variation(WCV)
(Inter-cluster distance) while the distances
between different clusters are maximized
Type of Clustering
 Partitional clustering: Partitional algorithms determine all clusters at
once. They include:
 K-Means Clustering
 Fuzzy c-means clustering
 QT clustering
 Hierarchical Clustering:
 Agglomerative ("bottom-up"): Agglomerative algorithms begin with
each element as a separate cluster and merge them into successively
larger clusters.
 Divisive ("top-down"): Divisive algorithms begin with the whole set
and proceed to divide it into successively smaller clusters.
Hard vs. soft clustering

 Hard clustering: Each document belongs to exactly one

cluster
 More common and easier to do
 Soft clustering: A document can belong to more than one
cluster.
 Makes more sense for applications like creating browsable
hierarchies
 You may want to put a pair of sneakers in two clusters: (i) sports
apparel and (ii) shoes
 You can only do that with a soft clustering approach.
Sec. 16.2

Representation for clustering

 How to measure similarity
 Euclidian Distance
 City-block Distance
 Minkowski Distance
 How many clusters?
 Fixed a priori?-> partitional algorithms
 Completely data driven?-> hierarchical algorithms
 Avoid “trivial” clusters - too large or small
 If a cluster's too large, then for navigation purposes you've wasted an
extra user click without whittling down the set of documents much.
10

1- Partitional clustering
k-Means Clustering

 Input: n objects (or points) and a number k

 Algorithm steps:
1) Randomly assign K records to be the initial cluster
center locations
2) Assign each object to the group that has the closest
centroid
3) When all objects have been assigned, recalculate the
positions of the K centroids
4) Repeat steps 2 to 3 until convergence or termination
K-Mean Clustering
Termination Conditions

 A maximal number of iterations is reached

 The algorithm terminates when the centroids no longer change.
 The SSE(sum of squared errors) value does not change significantly

Where obj represents each data point in cluster k and centk is the centroid of
cluster Ck
K-means example, step 1

Pick 3
initial
cluster
centers
(randomly)
K-means example, step 2

Assign
each point
to the closest
cluster
center
K-means example, step 3

Move
each cluster
center
to the mean
of each cluster
K-means example, step 4a

Reassign
points
closest to a
different new
cluster center

Q: Which
points are
reassigned?
K-means example, step 4a …
K-means example, step 4b

re-compute
cluster
means
K-means example, step 5

move cluster
centers to
cluster means
Example 2:
 Suppose that we have eight data points in two-dimensional space
as follows

 And suppose that we are interested in uncovering k=2 clusters.

Using Euclidean distance

Point Distance from m1 Distance from m2 Cluster membership

(1,1) (2,1)
a (1,3)
b (3,3)
c (4,3)
d (5,3)
e (1,2)
f (4,2)
g (1,1)
h (2,1)
Point Distance from Distance from m2 Cluster membership
m1(1,1) (2,1)
(C1) (C2)
a (1,3) 2.00 2.24 C1
b (3,3) 2.83 2.24 C2
c (4,3) 3.61 2.83 C2
d (5,3) 4.47 3.61 C2
e (1,2) 1.00 1.41 C1
f (4,2) 3.16 2.24 C2
g (1,1) 0.00 1.00 C1
h (2,1) 1.00 0.00 C2
Point Distance from Distance from m2 Cluster membership
m1(1,1) (2,1)
(C1) (C2)
a (1,3) 2.00 2.24 C1
b (3,3) 2.83 2.24 C2
c (4,3) 3.61 2.83 C2
d (5,3) 4.47 3.61 C2
e (1,2) 1.00 1.41 C1
f (4,2) 3.16 2.24 C2
g (1,1) 0.00 1.00 C1
h (2,1) 1.00 0.00 C2

SSE=33.64
Centroid of the cluster 1 is
[(1+1+1)/3,(3+2+1)/3]
=(1,2)

Centroid of the cluster 2 is

[(3+4+5+4+2)/5,(3+3+3+2+1)/5]
=(3.6,2.4)
Point Distance from Distance from Old Cluster New cluster
m1 (1,2) m2 (3.6,2.4) membership membership
C1 C2
a (1,3) C1
b (3,3) C2
c (4,3) C2
d (5,3) C2
e (1,2) C1
f (4,2) C2
g (1,1) C1
h (2,1) C2

m1=(1,2)
m2=(3.6,2.4)
Point Distance from Distance from Old Cluster New cluster
m1 (1,2) m2 (3.6,2.4) membership membership
C1 C2
a (1,3) 1.00 2.67 C1 C1
b (3,3) 2.24 0.85 C2 C2
c (4,3) 3.61 0.72 C2 C2
d (5,3) 4.12 1.52 C2 C2
e (1,2) 0.00 2.63 C1 C1
f (4,2) 3.00 0.57 C2 C2
g (1,1) 1.00 2.95 C1 C1
h (2,1) 1.41 2.13 C2 C1
Point Distance from m1 Distance from m2 Cluster membership

a 1.00 2.67 C1
b 2.24 0.85 C2
c 3.61 0.72 C2
d 4.12 1.52 C2
e 0.00 2.63 C1
f 3.00 0.57 C2
g 1.00 2.95 C1
h 1.41 2.13 C1

SSE=30.42
Centroid of the cluster 1 is
[(1+1+1+2)/4,(3+2+1+1)/4]
=(1.25,1.75)

Centroid of the cluster 2 is

[(3+4+5+4)/4,(3+3+3+2)/4]
=(4,2.75)
Point Distance from m1 Distance from m2 Cluster membership
(1.25,1.75) (4,2.75)
C1 C2
a (1,3)
b (3,3)
c (4,3)
d (5,3)
e (1,2)
f (4,2)
g (1,1)
h (2,1)

m1(1.25,1.75)
m2(4,2.75)
Point Distance from Distance from Old Cluster New cluster
m1 (1.25,1.75) m2 (4,2.75) membership membership

a (1,3) 1.27 3.01 C1 C1

b (3,3) 2.15 1.03 C2 C2
c (4,3) 3.02 0.25 C2 C2
d (5,3) 3.95 1.03 C2 C2
e (1,2) 0.35 3.09 C1 C1
f (4,2) 2.76 0.75 C2 C2
g (1,1) 0.79 3.47 C1 C1
h (2,1) 1.06 2.66 C1 C1

SSE=30.64 No Reduction, stop

Final Results
How to decide k?

 Unless the analyst has a prior knowledge of the

number of underlying clusters, therefore,
 Clustering solutions for each value of K is compared
 The value of K resulting in the smallest SSE being
selected
Sec. 16.3
What Is A Good Clustering?

 Internal criterion: A good clustering will produce high

quality clusters in which:
 the intra-class (that is, intra-cluster) similarity is high
 the inter-class similarity is low
 The measured quality of a clustering depends on both the
document representation and the similarity measure used
Summary of k-means
K-means algorithm is a simple yet popular method for clustering analysis
 Low complexity :complexity is O(nkt), where t = #iterations
 Its performance is determined by initialisation and appropriate distance
measure
 There are several variants of K-means to overcome its weaknesses
 K-Medoids: resistance to noise and/or outliers(data that do not comply with
the general behaviour or model of the data )
 K-Modes: extension to categorical data clustering analysis
 CLARA: extension to deal with large data sets
 Gaussian Mixture models (EM algorithm): handling uncertainty of clusters
2. Hierarchical Clustering
Hierarchical clustering and dendrograms

 A hierarchical clustering on a set of objects D is a set of nested

partitions of D. It is represented by a binary tree such that :
 The root node is a cluster that contains all data points
 Each (parent) node is a cluster made of two subclusters (childs)
 Each leaf node represents one data point (singleton ie cluster with only one
item)
 A hierarchical clustering scheme is also called a taxonomy. In data
clustering the binary tree is called a dendrogram.
 Dendrogram is a tree diagram frequently used to illustrate the
arrangement of the clusters produced by hierarchical clustering.
38
Dendogram: Hierarchical Clustering

• Clustering obtained by
cutting the dendrogram
at a desired level: each
connected component
forms a cluster.

• Does not require the

number of clusters k in
advance
Hierarchical clustering: forming clusters
 Forming clusters from dendograms
Hierarchical clustering
 There are two styles of hierarchical clustering algorithms to build a tree from the
input set S:
 Agglomerative (bottom-up):
 Beginning with singletons (sets with 1 element)
 Merging them until S is achieved as the root.
 In each steps , the two closest clusters are aggregates into a new combined cluster
 In this way, number of clusters in the data set is reduced at each step
 Eventually, all records/elements are combined into a single huge cluster
 It is the most common approach.
 Divisive (top-down):
 All records are combined into a one big cluster
 Then the most dissimilar records being split off recursively partitioning S until singleton
sets are reached.
Two types of hierarchical clustering algorithms :
Agglomerative : “bottom-up”
Divisive : “top-down
42

Hierarchical Agglomerative Clustering (HAC) Algorithm

• Assumes a similarity function for determining the similarity of two

instances.
• Starts with all instances in a separate cluster and then repeatedly
joins the two clusters that are most similar until there is only one
cluster.
Start with all instances in their own cluster.
Until there is only one cluster:
Among the current clusters, determine the two
clusters, ci and cj, that are most similar.
Replace ci and cj with a single cluster ci  cj
Sec. 17.2
Closest pair of clusters

 Many variants to defining closest pair of clusters

 Single-link
 Similarity of the most cosine-similar (single-link)
 Complete-link
 Similarity of the “furthest” points, the least cosine-similar
 Centroid
 Clusters whose centroids (centers of gravity) are the most cosine-similar
 Average-link
 Average cosine between pairs of elements
Lance –Williams Algorithm
Definition(Lance-Williams formula)
In AHC algorithms, the Lance-Williams formula
[Lance and Williams, 1967] is a recurrence equation used to calculate
the dissimilarity between a cluster Ck and a cluster formed by
merging two other clusters Cl ∪Cj
j j j
j

where α , α ,β, γ are real numbers

l j
AHC methods and the Lance-Williams formula
j
Cluster distance measure
 Single link
 Distance between closest elements in clusters

 Complete link
 Distance between farthest elements in clusters

 Centroids
 Distance between centroids(means) of two clusters
Single link method

 Also known as the nearest neighbor method, since it

employs the nearest neighbor to measure the
dissimilarity between two clusters

j j

j
Single-link clustering

5
1
3
5 0.2

2 1 0.15
2 3 6 0.1

0.05
4
4 0
3 6 2 5 4 1

Nested Clusters Dendrogram

Example 1-Single link method
You can cut in anypoint if you cut
lower then u will get many clusters
if u cut in upper u will get less
clusters
Example 02-Single link method
• x = (1, 2)
1

• x = (1, 2.5)
2

• x = (3, 1)
3

• x = (4, 0.5)
4

• x = (4, 2)
5
• x = (1, 2)
1

• x = (1, 2.5)
2

• x = (3, 1)
3

• x = (4, 0.5)
4

• x = (4, 2)
5

Merge X1 and X2
Merge X3 and X4
Merge {X3,X4} and X5
Merge {X1,X2} and {X3,X4,X5}
Example 3-Complete link method
• x = (1, 2)
1

• x = (1, 2.5)
2

• x = (3, 1)
3

• x = (4, 0.5)
4

• x = (4, 2)
5

Merge X1 and X2
Merge X3 and X4
Merge {X3,X4} and X5
Merge {X1,X2} and {X3,X4,X5}
The dendrogram :
Proc and Cons of Hierarchical Clustering
 Advantages
 Dendograms are great for visualization
 Provides hierarchical relations between clusters
 Disadvantages
 Not easy to define levels for clusters
 Can never undo what was done previously
 Sensitive to cluster distance measures and noise/outliers
 Experiments showed that other clustering techniques outperform
hierarchical clustering
 There are several variants to overcome its weaknesses
 BIRCH: scalable to a large data set
 ROCK: clustering categorical data
 CHAMELEON: hierarchical clustering using dynamic modelling

A Systematic Literature Review of Machine Learning Methods Applied To Predictive Maintenance
No ratings yet
A Systematic Literature Review of Machine Learning Methods Applied To Predictive Maintenance
16 pages
Abstrak NG Thesis Filipino
100% (2)
Abstrak NG Thesis Filipino
6 pages
A Ph.D. Research Proposal: YUSUF, Idowu Olusola
No ratings yet
A Ph.D. Research Proposal: YUSUF, Idowu Olusola
10 pages
Adjusted American - Putney
No ratings yet
Adjusted American - Putney
228 pages
Unit 4
No ratings yet
Unit 4
29 pages
FORM 2 ENGLISH Lesson 16 Writing - A Personal Profile
100% (1)
FORM 2 ENGLISH Lesson 16 Writing - A Personal Profile
7 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
q1 General Physics Module 2
No ratings yet
q1 General Physics Module 2
13 pages
MATHS Worksheet and Class 2 CH-6 L1
No ratings yet
MATHS Worksheet and Class 2 CH-6 L1
40 pages
Syllabus
No ratings yet
Syllabus
4 pages
Requirements Before Issuance of Sanitary Permit: For Food Establishments
No ratings yet
Requirements Before Issuance of Sanitary Permit: For Food Establishments
4 pages
Roof Garden Construction - Step by Step Details PDF
100% (1)
Roof Garden Construction - Step by Step Details PDF
3 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
Get Our Special Grand Bundle PDF Course For All Upcoming Bank Exams
No ratings yet
Get Our Special Grand Bundle PDF Course For All Upcoming Bank Exams
238 pages
Data Mining - Clustering
No ratings yet
Data Mining - Clustering
90 pages
Clustering
No ratings yet
Clustering
125 pages
Lecture 2.1.1 To 2.1.2
No ratings yet
Lecture 2.1.1 To 2.1.2
97 pages
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
110 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Conceptualizations of Intrinsic Motivation and Self-Determination
No ratings yet
Conceptualizations of Intrinsic Motivation and Self-Determination
2 pages
Clustering Part1
No ratings yet
Clustering Part1
84 pages
Clustering
No ratings yet
Clustering
75 pages
Acknowledgement 1
No ratings yet
Acknowledgement 1
9 pages
Clustering
No ratings yet
Clustering
104 pages
Datamining Lect8
No ratings yet
Datamining Lect8
79 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
83 pages
cz4041 10 Clustering
No ratings yet
cz4041 10 Clustering
67 pages
Clustering
No ratings yet
Clustering
75 pages
Unit 4
No ratings yet
Unit 4
74 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
Lecture 01 - Unsupervised Learning (Optional)
No ratings yet
Lecture 01 - Unsupervised Learning (Optional)
57 pages
Chap7 Basic Cluster Analysis
No ratings yet
Chap7 Basic Cluster Analysis
82 pages
ML Module5 Clustering
No ratings yet
ML Module5 Clustering
71 pages
Clustering Part1
No ratings yet
Clustering Part1
79 pages
ML4 Unsupervised Learning
No ratings yet
ML4 Unsupervised Learning
60 pages
Unsupervised Algorithms Unit3
No ratings yet
Unsupervised Algorithms Unit3
53 pages
Week 9
No ratings yet
Week 9
66 pages
GEd 107 ETHICS - FINAL PROJECT
100% (1)
GEd 107 ETHICS - FINAL PROJECT
2 pages
Lecture 1 (UNIT 1)
No ratings yet
Lecture 1 (UNIT 1)
68 pages
Slide TIF311 DM 10 11
No ratings yet
Slide TIF311 DM 10 11
49 pages
Lecture 15 Unsupervised Clustering
No ratings yet
Lecture 15 Unsupervised Clustering
73 pages
ML L14 Clustering
No ratings yet
ML L14 Clustering
59 pages
Chapter 3-Unsupervised Learning - Updated
No ratings yet
Chapter 3-Unsupervised Learning - Updated
54 pages
ML Unit III
No ratings yet
ML Unit III
82 pages
Lecture 4.6 Unsupervised-Learning Clustering
No ratings yet
Lecture 4.6 Unsupervised-Learning Clustering
60 pages
Cluster
100% (1)
Cluster
72 pages
M5
No ratings yet
M5
40 pages
M5
No ratings yet
M5
40 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
ML - 8
No ratings yet
ML - 8
70 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Theoretical Framework and Hypothesis Development
100% (1)
Theoretical Framework and Hypothesis Development
2 pages
Clustering
No ratings yet
Clustering
38 pages
Clustering
No ratings yet
Clustering
84 pages
15-505 Internet Search Technologies: Kamal Nigam
No ratings yet
15-505 Internet Search Technologies: Kamal Nigam
62 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
23 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
SCIENCE 4 Week1 - Lesson1 - The Major Organs of The Body
No ratings yet
SCIENCE 4 Week1 - Lesson1 - The Major Organs of The Body
62 pages
DM Lecture 06
No ratings yet
DM Lecture 06
32 pages
AI Chapter 3 Part 5
No ratings yet
AI Chapter 3 Part 5
30 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
Unsupervised Learning 1
No ratings yet
Unsupervised Learning 1
40 pages
Clustering and K-Means Algorithm
No ratings yet
Clustering and K-Means Algorithm
81 pages
Lect 10 - Unsupervised Learning
No ratings yet
Lect 10 - Unsupervised Learning
50 pages
Week 10 Lecture - Introduction To Clustering
No ratings yet
Week 10 Lecture - Introduction To Clustering
35 pages
ML Unit-4 Final 2024-25
No ratings yet
ML Unit-4 Final 2024-25
28 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
19 pages
FML Unit4
No ratings yet
FML Unit4
14 pages
Provide Compassionate, Provide Compassionate, Provide Compassionate, Respectful and Caring Service Learninig Guide 02
No ratings yet
Provide Compassionate, Provide Compassionate, Provide Compassionate, Respectful and Caring Service Learninig Guide 02
13 pages
K Means Clustering
No ratings yet
K Means Clustering
29 pages
MODULE 4 Clustering
No ratings yet
MODULE 4 Clustering
23 pages
Differentiate Between Instrument Check Standard and Equipment Quality Control Sample
No ratings yet
Differentiate Between Instrument Check Standard and Equipment Quality Control Sample
10 pages
U-5 Iml
No ratings yet
U-5 Iml
20 pages
2019 Mark Mathys Award
No ratings yet
2019 Mark Mathys Award
11 pages
Clustering
No ratings yet
Clustering
39 pages
Human Detection Robot For Disaster Management
No ratings yet
Human Detection Robot For Disaster Management
9 pages
K Mean Clustering1
No ratings yet
K Mean Clustering1
23 pages
Article PDF
No ratings yet
Article PDF
6 pages
Test 4 Online
No ratings yet
Test 4 Online
6 pages
Teaching Practicum Syllabus
No ratings yet
Teaching Practicum Syllabus
9 pages
Clustering Algorithm: An Unsupervised Learning Approach
No ratings yet
Clustering Algorithm: An Unsupervised Learning Approach
23 pages
Conflict Management Ogl 220 Worksheet
No ratings yet
Conflict Management Ogl 220 Worksheet
4 pages
CLC 12 - Core Competency Reflection 2
No ratings yet
CLC 12 - Core Competency Reflection 2
3 pages
Metadata Digestive System Grade 6 Week 2 Q2
No ratings yet
Metadata Digestive System Grade 6 Week 2 Q2
1 page
ZYAROCK Artec Pot Leaflet (En)
No ratings yet
ZYAROCK Artec Pot Leaflet (En)
2 pages
Theory Practice
No ratings yet
Theory Practice
1 page
Silk Road
No ratings yet
Silk Road
1 page
HighEntropy Carbide
No ratings yet
HighEntropy Carbide
10 pages