Clustering: Georg Gerber Lecture #6, 2/6/02
Clustering: Georg Gerber Lecture #6, 2/6/02
Georg Gerber
Lecture #6, 2/6/02
Lecture Overview
Motivation – why do clustering? Examples
from research papers
Choosing (dis)similarity measures – a critical
step in clustering
Euclidean distance
Pearson Linear Correlation
Clustering algorithms
Hierarchical agglomerative clustering
K-means clustering and quality measures
Self-organizing maps (if time)
What is clustering?
A way of grouping together data samples that
are similar in some way - according to some
criteria that you pick
A form of unsupervised learning – you
generally don’t have examples demonstrating
how the data should be grouped together
So, it’s a method of data exploration – a
way of looking for patterns or structure in the
data that are of interest
Why cluster?
Cluster genes = rows
Measure expression at multiple time-points,
different conditions, etc.
Similar expression patterns may suggest similar
functions of genes (is this always true?)
Cluster samples = columns
e.g., expression levels of thousands of genes for
each tumor sample
Similar expression patterns may suggest biological
relationship among samples
Example 1: clustering genes
P. Tamayo et al., Interpreting patterns of
gene expression with self-organizing maps:
methods and application to hematopoietic
differentiation, PNAS 96: 2907-12, 1999.
Treatment of HL-60 cells (myeloid leukemia cell
line) with PMA leads to differentiation into
macrophages
Measured expression of genes at 0, 0.5, 4 and 24
hours after PMA treatment
Used SOM technique;
shown are cluster
averages
Clusters contain a number
of known related genes
involved in macrophage
differentiation
e.g., late induction
cytokines, cell-cycle genes
(down-regulated since
PMA induces terminal
differentiation), etc.
Example 2: clustering genes
E. Furlong et al., Patterns of Gene Expression During
Drosophila Development, Science 293: 1629-33, 2001.
Use clustering to look for patterns of gene expression
change in wild-type vs. mutants
Collect data on gene expression in Drosophila wild-type
and mutants (twist and Toll) at three stages of
development
twist is critical in mesoderm and subsequent muscle
development; mutants have no mesoderm
Toll mutants over-express twist
Take ratio of mutant over wt expression levels at
corresponding stages
Find general trends
in the data – e.g., a
group of genes with
high expression in
twist mutants and
not elevated in Toll
mutants contains
many known neuro-
ectodermal genes
(presumably over-
expression of twist
suppresses
ectoderm)
Example 3: clustering samples
A. Alizadeh et al., Distinct types of diffuse large B-cell
lymphoma identified by gene expression profiling,
Nature 403: 503-11, 2000.
Response to treatment of patients w/ diffuse large B-
cell lymphoma (DLBCL) is heterogeneous
Try to use expression data to discover finer
distinctions among tumor types
Collected gene expression data for 42 DLBCL tumor
samples + normal B-cells in various stages of
differentiation + various controls
Found some tumor
samples have
expression more
similar to germinal
center B-cells and
others to peripheral
blood activated B-cells
Patients with
“germinal center
type” DLBCL generally
had higher five-year
survival rates
Lecture Overview
Motivation – why do clustering? Examples
from research papers
Choosing (dis)similarity measures – a
critical step in clustering
Euclidean distance
Pearson Linear Correlation
Clustering algorithms
Hierarchical agglomerative clustering
K-means clustering and quality measures
Self-Organizing Maps (if time)
How do we define “similarity”?
Recall that the goal is to group together
“similar” data – but what does this mean?
No single answer – it depends on what we
want to find or emphasize in the data; this is
one reason why clustering is an “art”
The similarity measure is often more
important than the clustering algorithm used
– don’t overlook this choice!
(Dis)similarity measures
Instead of talking about similarity measures,
we often equivalently refer to dissimilarity
measures (I’ll give an example of how to
convert between them in a few slides…)
Jagota defines a dissimilarity measure as a
function f(x,y) such that f(x,y) > f(w,z) if and
only if x is less similar to y than w is to z
This is always a pair-wise measure
Think of x, y, w, and z as gene expression
profiles (rows or columns)
Euclidean distance
n
d euc (x, y) i i
( x
i 1
y ) 2
( x x )(y
i i y)
( x , y) i 1
n n
( xi x )
i 1
2
i
( y
i 1
y ) 2
1 n
x xi
n i
1 n
y yi
n i
We’re shifting the expression profiles down (subtracting the
means) and scaling by the standard deviations (i.e., making the
data have mean = 0 and std = 1)
Pearson Linear Correlation
Pearson linear correlation (PLC) is a measure that is
invariant to scaling and shifting (vertically) of the
expression values
Always between –1 and +1 (perfectly anti-correlated
and perfectly correlated)
This is a similarity measure, but we can easily make
it into a dissimilarity measure:
1 (x, y)
dp
2
PLC (cont.)
PLC only measures the degree of a linear relationship
between two expression profiles!
If you want to measure other relationships, there are
many other possible measures (see Jagota book and
project #3 for more examples)
= 0.0249, so dp = 0.4876
The green curve is the
square of the blue curve –
this relationship is not
captured with PLC
More correlation examples
points
Or could randomly assign points to clusters and take
means of clusters
For each data point, compute the cluster center it is
closest to (using some distance measure) and assign the
data point to this cluster
Re-compute cluster centers (mean of data points in
cluster)
Stop when there are no new re-assignments
K-means Clustering (cont.)
k
Cluster Quality (cont.)
The Q measure given in Jagota takes into account
homogeneity within clusters, but not separation
between clusters
Other measures try to combine these two
characteristics (i.e., the Davies-Bouldin measure)
An alternate approach is to look at cluster stability:
Add random noise to the data many times and