0% found this document useful (0 votes)
50 views

Clustering Algorithms

The document discusses different clustering techniques including hierarchical and partitional clustering. It describes hierarchical agglomerative clustering and three linkage methods - single, complete, and average linkage. It also explains k-means clustering including how it works, the algorithm, updating cluster means, and stopping criteria. Clustering and biclustering of microarray data is also briefly mentioned.

Uploaded by

Ayesha Khan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views

Clustering Algorithms

The document discusses different clustering techniques including hierarchical and partitional clustering. It describes hierarchical agglomerative clustering and three linkage methods - single, complete, and average linkage. It also explains k-means clustering including how it works, the algorithm, updating cluster means, and stopping criteria. Clustering and biclustering of microarray data is also briefly mentioned.

Uploaded by

Ayesha Khan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Clustering

Dr. Zoya Khalid


[email protected]
Clustering techniques

• Hierarchical: Organize elements into a tree, leaves represent genes and


the length of the paths between leaves represents the distances between
objects (genes, etc). Similar objects lie within the same subtrees. It has
two types:
• Agglomerative (Bottom-Up): Start with every element in its own cluster,
and iteratively join clusters together
• Divisive (Top-Down): Start with one cluster and iteratively divide it into
smaller clusters
Contid….
Measures of similarity and
dissimilarity (distance)
• There are many different ways of calculating similarity and distance
• Knowing your data is important
• When working on distance, pay attention to three properties:
positivity, symmetry, and triangle inequality.
• Examples
• Euclidean distance
Hierarchical Agglomerative Clustering
Most Hierarchical clustering algorithms are agglomerative
Three Techniques
Hierarchical clustering: Recomputing distances
d min (C , C * ) = min d ( x, y )
for all elements x in C and y in C*
• Distance between two clusters is the smallest distance between any pair of their elements
(single-linkage)

d max (C , C * ) = max d ( x, y )
for all elements x in C and y in C*
• Distance between two clusters is the largest distance between any pair of their elements
(complete-linkage)
d avg (C , C ) =
* å d ( x, y )
C C*
for all elements x in C and y in C*
• Distance between two clusters is the average distance between all pairs of their elements
(average-linkage)
Single Linkage example
Single Linkage continued

A B D F
Continued
Complete Linkage Method
Contid….
Contid….
Contid…
Which Distance Measure is Better?
• Each method has both advantages and disadvantages; application-
dependent, single-link and complete-link are the most common
methods
• Single-link
• Can find irregular-shaped clusters
• Sensitive to outliers
• Complete-link, Average-link,
• Robust to outliers
• Tend to break large clusters
• Prefer spherical clusters (smaller sized)
Partitional clustering
• It determines all clusters at once
They include:
• K-means and derivatives
• Fuzzy c-means clustering
• QT clustering algorithm
K –means clustering
K- Means Clustering
K-Means clustering
• consider an example in which our vectors have 2 dimensions

+ +

+ cluster center
profile

+
K-Means clustering
• each iteration involves two steps
• assignment of profiles to clusters
• re-computation of the cluster centers (means)
+ + + +

+ +

+ +

assignment re-computation of cluster centers


Example
y
Distance between two clusters
Distance from Cluster 1 – Cluster 2
y
Tabulate them
y
y
Tabulate the new dataset
y
y
y
y
Elbow Method
• It involves running the algorithm multiple times over a loop, with an
increasing number of cluster choice and then plotting a clustering
score as a function of the number of clusters.
How K-means algorithm works
K-Means clustering algorithm
• Input: K, number of clusters, a set X={x1,.. xN} of data points, where
xi are p-dimensional vectors
• Initialize
• Select initial cluster means f1, ….., fK
• Repeat until convergence
• Assign each xi to cluster C(i) such that

C(i) = argmin1≤k≤K ǁ xi - fk ǁ2

• Re-estimate the mean of each cluster based on new members


K-means: updating the mean
• To compute the mean of the kth cluster

1 X
fk = xi
Nk
i:C(i)=k

Number of points in cluster k All points in cluster k


K-means stopping criteria
1. Assignment of objects to clusters don’t change (convergence)

2. Maximum Number of iterations


Microarray data
Clustering of microarray data
Clsutering and Biclustering
• Biclustering - Identifies groups of genes with similar/coherent
expression patterns under a specific subset of the conditions.
• Clustering - Identifies groups of genes/conditions that show similar
activity patterns under all the set of conditions/all the set of genes
under analysis.

You might also like