Clustering
Clustering
Clustering
Luis Tari
Motivation
One of the important goals in the post-
genomic era is to discover the functions of
genes.
High-throughput technologies allow us to
speed up the process of finding the functions
of genes.
But there are tens of thousands of genes
involved in a microarray experiment.
Questions:
How do we analyze the data?
Which genes should we start exploring?
Why clustering?
Let’s look at the problem in a different angle
The issue here is dealing with high-dimensional data
How do people deal with high-dimensional data?
Start by finding interesting patterns associated with the
data
Clustering is one of the well-known techniques with
successful applications on large domain for finding patterns
Some successes in applying clustering on
microarray data
Golub et. al (1999) uses clustering techniques to discover
subclasses of AML and ALL from microarray data
Eisen et. al (1998) uses clustering techniques that are able
to group genes of similar function together.
But what is clustering?
Introduction
The goal of clustering is to
group data points that are close (or similar) to each other
identify such groupings (or clusters) in an unsupervised
manner
Unsupervised: no information is provided to the algorithm
on which data points belong to which clusters
Example
x
What should the
x
clusters be for
these data points?
x
x x
x
x x
x
What can we do with
clustering?
One of the major applications of clustering in
bioinformatics is on microarray data to cluster similar
genes
Hypotheses:
Genes with similar expression patterns implies that the
coexpression of these genes
Coexpressed genes can imply that
they are involved in similar functions
they are somehow related, for instance because their proteins
directly/indirectly interact with each other
It is widely believed that coexpressed genes implies that
they are involved in similar functions
But still, what can we really gain from doing
clustering?
Purpose of clustering on
microarray data
Suppose genes A and B are grouped in the
same cluster, then we hypothesis that genes
A and B are involved in similar function.
If we know the role of gene A is apoptosis
but we do not know if gene B is involved in
apoptosis
we can do experiments to confirm if gene B
indeed is involved in apoptosis.
Purpose of clustering on
microarray data
Suppose genes A and B are grouped in the
same cluster, then we hypothesize that
proteins A and B might interact with each
other.
So we can do experiments to confirm if such
interaction exists.
So clustering microarray data in a way helps
us make hypotheses about:
potential functions of genes
potential protein-protein interactions
Does clustering always work?
Do coexpressed genes always imply that
they have similar functions?
Not necessarily
housekeeping genes
genes which always expressed or never expressed
despite of different conditions
there can be noise in microarray data
But clustering is useful in:
visualization of data
hypothesis generation
Overview of clustering
i 1 pci
where mi is the mean of all instances in cluster ci
se(j) <
Properties of k-means
Guaranteed to converge
Guaranteed to achieve local optimal, not necessarily
global optimal.
Example:
http://www.kdnuggets.com/dmcourse/data_mining_course/
mod-13-clustering.ppt.
K-means
Pros:
Low complexity
complexity is O(nkt), where t = #iterations
Cons:
Necessity of specifying k
Sensitive to noise and outlier data points
Outliers: a small number of such data can
substantially influence the mean value)
Clusters are sensitive to initial assignment of centroids
K-means is not a deterministic algorithm
Clusters can be inconsistent from one run to another
Fuzzy c-means
An extension of k-means
Hierarchical, k-means generates partitions
each data point can only be assigned in one
cluster
Fuzzy c-means allows data points to be
assigned into more than one cluster
each data point has a degree of membership (or
probability) of belonging to each cluster
Fuzzy c-means algorithm
Let xi be a vector of values for data point gi.
1. Initialize membership U(0) = [ uij ] for data point gi of
cluster clj by random
2. At the k-th step, compute the fuzzy centroid C(k) =
[ cj ] for j = 1, .., nc, where nc is the number of
clusters, using
n
(uij ) m xi
i 1
cj n
ij
(u ) m
i 1
Manhattan distance
n
d ( g1 , g 2 ) ( xi yi )
i 1
Minkowski distance
n
d ( g1 , g 2 ) m ( xi yi ) m
i 1
Correlation distance
Correlation distance
Cov( X , Y )
rxy
(Var ( X ) Var (Y )
n
(x
i 1 i
X )( y i Y )
CoVar( X , Y )
n 1
Positive covariance
two variables vary in the same way
Negative covariance
one variable might increase when the other decreases
Covariance is only suitable for heterogeneous pairs
Correlation distance
Correlation
Cov( X , Y )
rxy
(Var ( X ) Var (Y )
i 1 pci