K-Means and Kohonen Maps Unsupervised Clustering Techniques: Steve Hookway 4/8/04
K-Means and Kohonen Maps Unsupervised Clustering Techniques: Steve Hookway 4/8/04
Steve Hookway
4/8/04
What is a DNA Microarray?
An experiment on the order of 10k
elements
A way to explore the function of a gene
A snapshot of the expression level of an
entire phenotype under given test
conditions
Some Microarray Terminology
Probe: ssDNA printed on the solid
substrate (nylon or glass) These are
the genes we are going to be testing
Target: cDNA which has been labeled
and is to be washed over the probe
Microarray Fabrication
Deposition of DNA fragments
Deposition of PCR-amplified cDNA clones
Printing of already synthesized
oligonucleotieds
In Situ synthesis
Photolithography
Ink Jet Printing
Electrochemical Synthesis
Sequences Sequences
More variability More reliable data
Maps
Clustering Limitations
Any data can be clustered, therefore we
must be careful what conclusions we
draw from our results
Clustering is non-deterministic and can
and will produce different results on
different runs
K-means Clustering
Given a set of n data points in d-
dimensional space and an integer k
We want to find the set of k points in d-
dimensional space that minimizes the
mean squared distance from each data
point to its nearest center
No exact polynomial-time algorithms
are known for this problem
“A Local Search Approximation Algorithm for k-Means Clustering” by Kanungo et. al
K-means Algorithm
(Lloyd’s Algorithm)
Has been shown to Data
Points
converge to a locally
Optimal
optimal solution Centers
But can converge to Heuristic
a solution arbitrarily Centers
bad compared to
the optimal solution K=3
•“K-means-type algorithms: A generalized convergence theorem and characterization of local optimality” by Selim and Ismail
•“A Local Search Approximation Algorithm for k-Means Clustering” by Kanungo et al.
Euclidean Distance
n
d E ( x, y ) i i
( x
i 1
y ) 2
d E (O, A) 3 4 5 2 2
x1st x2nd
i i xnth i
CP ( x1 , x 2 ,...,x k ) ( i 1 , i 1
,..., i 1
)
k k k
Let’s find the midpoint between 3 2D points, say: (2,4) (5,2) (8,9)
258 4 29
CP ( , ) (5,5)
3 3
K-means Algorithm
1. Choose k initial center points randomly
2. Cluster data using Euclidean distance (or other
distance metric)
3. Calculate new center points for each cluster using only
points within the cluster
4. Re-Cluster all data using the new center points
1. This step could cause data points to be placed in a different
cluster
5. Repeat steps 3 & 4 until the center points have moved
such that in step 4 no data points are moved from one
cluster to another or some other convergence criteria
is met
1. We Pick k=2
centers at
random
2. We cluster our
data around
these center
points
Figure Reproduced From “Data Analysis Tools for DNA
Microarrays” by Sorin Draghici
K-means example with k=2
3. We recalculate
centers based on
our current clusters
4. We re-cluster our
data around our
new center points
distance=5
size=5
distance=20
From: http://davis.wpi.edu/~matt/courses/soms/
An Example Using Color
Three dimensional data: red, blue, green
From http://davis.wpi.edu/~matt/courses/soms/
An Example Using Color
Each color in
the map is
associated with
a weight
From http://davis.wpi.edu/~matt/courses/soms/
An Example Using Color
1. Initialize the weights
From http://davis.wpi.edu/~matt/courses/soms/
An Example Using Color Continued
2. Get best matching unit
From http://davis.wpi.edu/~matt/courses/soms/
An Example Using Color Continued
2. Getting the best matching unit continued…
For example, lets say we chose green as the
sample. Then it can be shown that light
green is closer to green than red:
Green: (0,6,0) Light Green: (3,6,3) Red(6,0,0)
From http://davis.wpi.edu/~matt/courses/soms/
An Example Using Color Continued
3. Scale neighbors
1. Determine which weights are considred
nieghbors
2. How much each weight can become
more like the sample vector
From http://davis.wpi.edu/~matt/courses/soms/
An Example Using Color Continued
2. How much each weight can become more like the sample
From http://davis.wpi.edu/~matt/courses/soms/
An Example Using Color Continued
NewColorValue = CurrentColor*(1-t)+sampleVector*t
For the first iteration t=1 since t can range from 0 to 1,
for following iterations the value of t used in this
formula decreases because there are fewer values in
the range (as t increases in the for loop)
From http://davis.wpi.edu/~matt/courses/soms/
Conclusion of Example
From http://davis.wpi.edu/~matt/courses/soms/
SOFM Applied to Genetics
Consider clustering 10,000 genes
Each gene was measured in 4
experiments
Input vectors are 4 dimensional
Initial pattern of 10,000 each described by
a 4D vector
Each of the 10,000 genes is chosen one
at a time to train the SOM
“Interpresting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic
differentiation” by Tamayo et al.
Benefits of SOFM
SOFM contains the set of features
extracted from the input patterns
(reduces dimensions)
SOFM yields a set of clusters
A gene will always be most similar to a
gene in its immediate neighborhood
than a gene further away
“Interpresting patterns of gene expression with self-organizing maps: Methods and application
to hematopoietic differentiation” by Tamayo et al.
References
Basic microarray analysis: grouping and feature reduction by
Soumya Raychaudhuri, Patrick D. Sutphin, Jeffery T. Chang and
Russ B. Altman; Trends in Biotechnology Vol. 19 No. 5 May
2001
Self Organizing Maps, Tom Germano,
http://davis.wpi.edu/~matt/courses/soms
“Data Analysis Tools for DNA Microarrays” by Sorin Draghici;
Chapman & Hall/CRC 2003
Self-Organizing-Feature-Maps versus Statistical Clustering
Methods: A Benchmark by A. Ultsh, C. Vetter; FG
Neuroinformatik & Kunstliche Intelligenz Research Report 0994
References
Interpreting patterns of gene expression with self-
organizing maps: Methods and application to
hematopoietic differentiation by Tamayo et al.
A Local Search Approximation Algorithm for k-Means
Clustering by Kanungo et al.
K-means-type algorithms: A generalized convergence
theorem and characterization of local optimality by
Selim and Ismail