0% found this document useful (0 votes)
16 views80 pages

Clustering

Clustering is the process of grouping similar objects into classes, commonly used in unsupervised learning, with applications in search engines for structuring results and suggesting related pages. It differs from classification as it does not involve a target variable and aims to maximize similarity within clusters while minimizing it between them. Various clustering methods exist, including partitional (like K-Means) and hierarchical clustering, each with distinct algorithms and approaches to measure similarity and determine the number of clusters.

Uploaded by

rooosemary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views80 pages

Clustering

Clustering is the process of grouping similar objects into classes, commonly used in unsupervised learning, with applications in search engines for structuring results and suggesting related pages. It differs from classification as it does not involve a target variable and aims to maximize similarity within clusters while minimizing it between them. Various clustering methods exist, including partitional (like K-Means) and hierarchical clustering, each with distinct algorithms and approaches to measure similarity and determine the number of clusters.

Uploaded by

rooosemary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

Clustering

Ch. 16
What is clustering?
 Clustering: the process of grouping a set of objects into classes of
similar objects
 Objects within a cluster should be similar.
 Objects from different clusters should be dissimilar.
 The commonest form of unsupervised learning
 Unsupervised learning = learning from raw data, as opposed to
supervised data where a classification of examples is given
 Applications in Search engines:
 Structuring search results
 Suggesting related pages
 Automatic directory construction/update
 Finding near identical/duplicate pages
Classification vs. Clustering

Classification: Clustering:
• Supervised learning • Unsupervised learning
• Learns a method for predicting • Finds “natural” grouping of
the instance class from pre- instances given un-labeled data
labeled (classified) instances
Classification vs. Clustering (cont.)

 There is no target variable for clustering


 Clustering does not try to classify or predict the values of a
target variable.
 Instead, clustering algorithms seek to segment the entire data
set into relatively homogeneous subgroups or clusters,
 Where the similarity of the records within the cluster is maximized,
and
 Similarity to records outside this cluster is minimized.
Goal of Clustering

 Identification of groups f records such that similarity


within a group is very high while the similarity to records
in other groups is very low.
 group data points that are close (or similar) to each other
 identify such groupings (or clusters) in an unsupervised manner
 Unsupervised: no information is provided to the algorithm
on which data points belong to which clusters
 In other words,
 Clustering algorithm seeks to construct clusters of records such
that the between-cluster variation(BCV) is large compared to the
within-cluster variation(WCV)
Goal of Clustering

Between-cluster variation:

Within-cluster variation:

(Intra-cluster distance) the sum of distances between-cluster variation(BCV) is large


between objects in the same cluster are compared to the within-cluster
minimized
variation(WCV)
(Inter-cluster distance) while the distances
between different clusters are maximized
Type of Clustering
 Partitional clustering: Partitional algorithms determine all clusters at
once. They include:
 K-Means Clustering
 Fuzzy c-means clustering
 QT clustering
 Hierarchical Clustering:
 Agglomerative ("bottom-up"): Agglomerative algorithms begin with
each element as a separate cluster and merge them into successively
larger clusters.
 Divisive ("top-down"): Divisive algorithms begin with the whole set
and proceed to divide it into successively smaller clusters.
Hard vs. soft clustering

 Hard clustering: Each document belongs to exactly one


cluster
 More common and easier to do
 Soft clustering: A document can belong to more than one
cluster.
 Makes more sense for applications like creating browsable
hierarchies
 You may want to put a pair of sneakers in two clusters: (i) sports
apparel and (ii) shoes
 You can only do that with a soft clustering approach.
Sec. 16.2

Representation for clustering


 How to measure similarity
 Euclidian Distance
 City-block Distance
 Minkowski Distance
 How many clusters?
 Fixed a priori?-> partitional algorithms
 Completely data driven?-> hierarchical algorithms
 Avoid “trivial” clusters - too large or small
 If a cluster's too large, then for navigation purposes you've wasted an
extra user click without whittling down the set of documents much.
10

1- Partitional clustering
k-Means Clustering

 Input: n objects (or points) and a number k


 Algorithm steps:
1) Randomly assign K records to be the initial cluster
center locations
2) Assign each object to the group that has the closest
centroid
3) When all objects have been assigned, recalculate the
positions of the K centroids
4) Repeat steps 2 to 3 until convergence or termination
K-Mean Clustering
Termination Conditions

 A maximal number of iterations is reached


 The algorithm terminates when the centroids no longer change.
 The SSE(sum of squared errors) value does not change significantly

Where obj represents each data point in cluster k and centk is the centroid of
cluster Ck
K-means example, step 1

Pick 3
initial
cluster
centers
(randomly)
K-means example, step 2

Assign
each point
to the closest
cluster
center
K-means example, step 3

Move
each cluster
center
to the mean
of each cluster
K-means example, step 4a

Reassign
points
closest to a
different new
cluster center

Q: Which
points are
reassigned?
K-means example, step 4a …
K-means example, step 4b

re-compute
cluster
means
K-means example, step 5

move cluster
centers to
cluster means
Example 2:
 Suppose that we have eight data points in two-dimensional space
as follows

 And suppose that we are interested in uncovering k=2 clusters.


Using Euclidean distance

Point Distance from m1 Distance from m2 Cluster membership


(1,1) (2,1)
a (1,3)
b (3,3)
c (4,3)
d (5,3)
e (1,2)
f (4,2)
g (1,1)
h (2,1)
Point Distance from Distance from m2 Cluster membership
m1(1,1) (2,1)
(C1) (C2)
a (1,3) 2.00 2.24 C1
b (3,3) 2.83 2.24 C2
c (4,3) 3.61 2.83 C2
d (5,3) 4.47 3.61 C2
e (1,2) 1.00 1.41 C1
f (4,2) 3.16 2.24 C2
g (1,1) 0.00 1.00 C1
h (2,1) 1.00 0.00 C2
Point Distance from Distance from m2 Cluster membership
m1(1,1) (2,1)
(C1) (C2)
a (1,3) 2.00 2.24 C1
b (3,3) 2.83 2.24 C2
c (4,3) 3.61 2.83 C2
d (5,3) 4.47 3.61 C2
e (1,2) 1.00 1.41 C1
f (4,2) 3.16 2.24 C2
g (1,1) 0.00 1.00 C1
h (2,1) 1.00 0.00 C2

SSE=33.64
Centroid of the cluster 1 is
[(1+1+1)/3,(3+2+1)/3]
=(1,2)

Centroid of the cluster 2 is


[(3+4+5+4+2)/5,(3+3+3+2+1)/5]
=(3.6,2.4)
Point Distance from Distance from Old Cluster New cluster
m1 (1,2) m2 (3.6,2.4) membership membership
C1 C2
a (1,3) C1
b (3,3) C2
c (4,3) C2
d (5,3) C2
e (1,2) C1
f (4,2) C2
g (1,1) C1
h (2,1) C2

m1=(1,2)
m2=(3.6,2.4)
Point Distance from Distance from Old Cluster New cluster
m1 (1,2) m2 (3.6,2.4) membership membership
C1 C2
a (1,3) 1.00 2.67 C1 C1
b (3,3) 2.24 0.85 C2 C2
c (4,3) 3.61 0.72 C2 C2
d (5,3) 4.12 1.52 C2 C2
e (1,2) 0.00 2.63 C1 C1
f (4,2) 3.00 0.57 C2 C2
g (1,1) 1.00 2.95 C1 C1
h (2,1) 1.41 2.13 C2 C1
Point Distance from m1 Distance from m2 Cluster membership

a 1.00 2.67 C1
b 2.24 0.85 C2
c 3.61 0.72 C2
d 4.12 1.52 C2
e 0.00 2.63 C1
f 3.00 0.57 C2
g 1.00 2.95 C1
h 1.41 2.13 C1

SSE=30.42
Centroid of the cluster 1 is
[(1+1+1+2)/4,(3+2+1+1)/4]
=(1.25,1.75)

Centroid of the cluster 2 is


[(3+4+5+4)/4,(3+3+3+2)/4]
=(4,2.75)
Point Distance from m1 Distance from m2 Cluster membership
(1.25,1.75) (4,2.75)
C1 C2
a (1,3)
b (3,3)
c (4,3)
d (5,3)
e (1,2)
f (4,2)
g (1,1)
h (2,1)

m1(1.25,1.75)
m2(4,2.75)
Point Distance from Distance from Old Cluster New cluster
m1 (1.25,1.75) m2 (4,2.75) membership membership

a (1,3) 1.27 3.01 C1 C1


b (3,3) 2.15 1.03 C2 C2
c (4,3) 3.02 0.25 C2 C2
d (5,3) 3.95 1.03 C2 C2
e (1,2) 0.35 3.09 C1 C1
f (4,2) 2.76 0.75 C2 C2
g (1,1) 0.79 3.47 C1 C1
h (2,1) 1.06 2.66 C1 C1

SSE=30.64 No Reduction, stop


Final Results
How to decide k?

 Unless the analyst has a prior knowledge of the


number of underlying clusters, therefore,
 Clustering solutions for each value of K is compared
 The value of K resulting in the smallest SSE being
selected
Sec. 16.3
What Is A Good Clustering?

 Internal criterion: A good clustering will produce high


quality clusters in which:
 the intra-class (that is, intra-cluster) similarity is high
 the inter-class similarity is low
 The measured quality of a clustering depends on both the
document representation and the similarity measure used
Summary of k-means
K-means algorithm is a simple yet popular method for clustering analysis
 Low complexity :complexity is O(nkt), where t = #iterations
 Its performance is determined by initialisation and appropriate distance
measure
 There are several variants of K-means to overcome its weaknesses
 K-Medoids: resistance to noise and/or outliers(data that do not comply with
the general behaviour or model of the data )
 K-Modes: extension to categorical data clustering analysis
 CLARA: extension to deal with large data sets
 Gaussian Mixture models (EM algorithm): handling uncertainty of clusters
2. Hierarchical Clustering
Hierarchical clustering and dendrograms

 A hierarchical clustering on a set of objects D is a set of nested


partitions of D. It is represented by a binary tree such that :
 The root node is a cluster that contains all data points
 Each (parent) node is a cluster made of two subclusters (childs)
 Each leaf node represents one data point (singleton ie cluster with only one
item)
 A hierarchical clustering scheme is also called a taxonomy. In data
clustering the binary tree is called a dendrogram.
 Dendrogram is a tree diagram frequently used to illustrate the
arrangement of the clusters produced by hierarchical clustering.
38
Dendogram: Hierarchical Clustering

• Clustering obtained by
cutting the dendrogram
at a desired level: each
connected component
forms a cluster.

• Does not require the


number of clusters k in
advance
Hierarchical clustering: forming clusters
 Forming clusters from dendograms
Hierarchical clustering
 There are two styles of hierarchical clustering algorithms to build a tree from the
input set S:
 Agglomerative (bottom-up):
 Beginning with singletons (sets with 1 element)
 Merging them until S is achieved as the root.
 In each steps , the two closest clusters are aggregates into a new combined cluster
 In this way, number of clusters in the data set is reduced at each step
 Eventually, all records/elements are combined into a single huge cluster
 It is the most common approach.
 Divisive (top-down):
 All records are combined into a one big cluster
 Then the most dissimilar records being split off recursively partitioning S until singleton
sets are reached.
Two types of hierarchical clustering algorithms :
Agglomerative : “bottom-up”
Divisive : “top-down
42

Hierarchical Agglomerative Clustering (HAC) Algorithm

• Assumes a similarity function for determining the similarity of two


instances.
• Starts with all instances in a separate cluster and then repeatedly
joins the two clusters that are most similar until there is only one
cluster.
Start with all instances in their own cluster.
Until there is only one cluster:
Among the current clusters, determine the two
clusters, ci and cj, that are most similar.
Replace ci and cj with a single cluster ci  cj
Sec. 17.2
Closest pair of clusters

 Many variants to defining closest pair of clusters


 Single-link
 Similarity of the most cosine-similar (single-link)
 Complete-link
 Similarity of the “furthest” points, the least cosine-similar
 Centroid
 Clusters whose centroids (centers of gravity) are the most cosine-similar
 Average-link
 Average cosine between pairs of elements
Lance –Williams Algorithm
Definition(Lance-Williams formula)
In AHC algorithms, the Lance-Williams formula
[Lance and Williams, 1967] is a recurrence equation used to calculate
the dissimilarity between a cluster Ck and a cluster formed by
merging two other clusters Cl ∪Cj
j j j
j

where α , α ,β, γ are real numbers


l j
AHC methods and the Lance-Williams formula
j
Cluster distance measure
 Single link
 Distance between closest elements in clusters

 Complete link
 Distance between farthest elements in clusters

 Centroids
 Distance between centroids(means) of two clusters
Single link method

 Also known as the nearest neighbor method, since it


employs the nearest neighbor to measure the
dissimilarity between two clusters

j j

j
Single-link clustering

5
1
3
5 0.2

2 1 0.15
2 3 6 0.1

0.05
4
4 0
3 6 2 5 4 1

Nested Clusters Dendrogram


Example 1-Single link method
You can cut in anypoint if you cut
lower then u will get many clusters
if u cut in upper u will get less
clusters
Example 02-Single link method
• x = (1, 2)
1

• x = (1, 2.5)
2

• x = (3, 1)
3

• x = (4, 0.5)
4

• x = (4, 2)
5
• x = (1, 2)
1

• x = (1, 2.5)
2

• x = (3, 1)
3

• x = (4, 0.5)
4

• x = (4, 2)
5

Merge X1 and X2
Merge X3 and X4
Merge {X3,X4} and X5
Merge {X1,X2} and {X3,X4,X5}
Example 3-Complete link method
• x = (1, 2)
1

• x = (1, 2.5)
2

• x = (3, 1)
3

• x = (4, 0.5)
4

• x = (4, 2)
5

Merge X1 and X2
Merge X3 and X4
Merge {X3,X4} and X5
Merge {X1,X2} and {X3,X4,X5}
The dendrogram :
Proc and Cons of Hierarchical Clustering
 Advantages
 Dendograms are great for visualization
 Provides hierarchical relations between clusters
 Disadvantages
 Not easy to define levels for clusters
 Can never undo what was done previously
 Sensitive to cluster distance measures and noise/outliers
 Experiments showed that other clustering techniques outperform
hierarchical clustering
 There are several variants to overcome its weaknesses
 BIRCH: scalable to a large data set
 ROCK: clustering categorical data
 CHAMELEON: hierarchical clustering using dynamic modelling

You might also like