0% found this document useful (0 votes)
109 views

K-Means and Kohonen Maps Unsupervised Clustering Techniques: Steve Hookway 4/8/04

K-means clustering is an unsupervised machine learning algorithm that groups unlabeled data points into a specified number of clusters (k) based on their similarity. It works by randomly selecting k data points as initial cluster centers, then assigning each remaining point to the nearest center, and recalculating the centers as the means of each cluster's points. This process repeats until the centers stabilize or a maximum number of iterations is reached. Choosing the right number of clusters k can be difficult, and the results are affected by the random initial center selection, so multiple runs may be needed. Cluster quality is assessed based on intra-cluster similarity versus inter-cluster distance.

Uploaded by

Raghav Venkat
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
109 views

K-Means and Kohonen Maps Unsupervised Clustering Techniques: Steve Hookway 4/8/04

K-means clustering is an unsupervised machine learning algorithm that groups unlabeled data points into a specified number of clusters (k) based on their similarity. It works by randomly selecting k data points as initial cluster centers, then assigning each remaining point to the nearest center, and recalculating the centers as the means of each cluster's points. This process repeats until the centers stabilize or a maximum number of iterations is reached. Choosing the right number of clusters k can be difficult, and the results are affected by the random initial center selection, so multiple runs may be needed. Cluster quality is assessed based on intra-cluster similarity versus inter-cluster distance.

Uploaded by

Raghav Venkat
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 53

K-means and Kohonen Maps

Unsupervised Clustering Techniques

Steve Hookway
4/8/04
What is a DNA Microarray?
 An experiment on the order of 10k
elements
 A way to explore the function of a gene
 A snapshot of the expression level of an
entire phenotype under given test
conditions
Some Microarray Terminology
 Probe: ssDNA printed on the solid
substrate (nylon or glass) These are
the genes we are going to be testing
 Target: cDNA which has been labeled
and is to be washed over the probe
Microarray Fabrication
 Deposition of DNA fragments
 Deposition of PCR-amplified cDNA clones
 Printing of already synthesized
oligonucleotieds
 In Situ synthesis
 Photolithography
 Ink Jet Printing
 Electrochemical Synthesis

From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici


cDNA Microarrays and Oligonucleotide
Probes

cDNA Arrays Oligonucleotide


Arrays
Long Sequences Short Sequences

Spot Unknown Spot Known

Sequences Sequences
More variability More reliable data

From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici


In Situ Synthesis
 Photochemically synthesized on the chip
 Reduces noise caused by PCR, cloning,
and Spotting
 As previously mentioned, three kinds of
In Situ Synthesis
 Photolithography
 Ink Jet Printing
 Electrochemical Synthesis

From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici


Photolithography
Photodeprotection
 Similar to process used to
build VLSI circuits
 Photolithographic masks
are used to add each base
mask
 If base is present, there
will be a hole in the
corresponding mask
 Can create high density
C
arrays, but sequence
length is limited

From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici


Ink Jet Printing
 Four cartridges are loaded with the four
nucleotides: A, G, C,T
 As the printer head moves across the
array, the nucleotides are deposited
where they are needed

From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici


Electrochemical Synthesis
 Electrodes are embedded in the substrate to
manage individual reaction sites
 Electrodes are activated in necessary
positions in a predetermined sequence that
allows the sequences to be constructed base
by base
 Solutions containing specific bases are
washed over the substrate while the
electrodes are activated
From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
http://www.bio.davidson.edu/courses/genomics/chip/chip.html
Application of Microarrays
 We only know the
function of about
20% of the 30,000
genes in the Human
Genome
 Gene exploration
 Faster and better
 Can be used for
DNA computing
http://www.gene-chips.com/sample1.html
From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
A Data Mining Problem
 On a given Microarray we test on the
order of 10k elements at a time
 Data is obtained faster than it can be
processed
 We need some ways to work through
this large data set and make sense of
the data
Grouping and Reduction
 Grouping: discovers patterns in the data
from a microarray
 Reduction: reduces the complexity of
data by removing redundant probes
(genes) that will be used in subsequent
assays
Unsupervised Grouping: Clustering
 Pattern discovery via grouping similarly
expressed genes together
 Three techniques most often used
 k-Means Clustering
 Hierarchical Clustering

 Kohonen Self Organizing Feature

Maps
Clustering Limitations
 Any data can be clustered, therefore we
must be careful what conclusions we
draw from our results
 Clustering is non-deterministic and can
and will produce different results on
different runs
K-means Clustering
 Given a set of n data points in d-
dimensional space and an integer k
 We want to find the set of k points in d-
dimensional space that minimizes the
mean squared distance from each data
point to its nearest center
 No exact polynomial-time algorithms
are known for this problem
“A Local Search Approximation Algorithm for k-Means Clustering” by Kanungo et. al
K-means Algorithm
(Lloyd’s Algorithm)
 Has been shown to Data
Points
converge to a locally
Optimal
optimal solution Centers
 But can converge to Heuristic
a solution arbitrarily Centers

bad compared to
the optimal solution K=3

•“K-means-type algorithms: A generalized convergence theorem and characterization of local optimality” by Selim and Ismail
•“A Local Search Approximation Algorithm for k-Means Clustering” by Kanungo et al.
Euclidean Distance
n
d E ( x, y )   i i
( x
i 1
 y ) 2

Now to find the distance between two points, say


the origin and the point (3,4):

d E (O, A)  3 4  5 2 2

Simple and Fast! Remember this when we consider


the complexity!
Finding a Centroid
We use the following equation to find the n dimensional
centroid point amid k n dimensional points:
k k k

 x1st  x2nd
i i  xnth i
CP ( x1 , x 2 ,...,x k )  ( i 1 , i 1
,..., i 1
)
k k k

Let’s find the midpoint between 3 2D points, say: (2,4) (5,2) (8,9)

258 4 29
CP  ( , )  (5,5)
3 3
K-means Algorithm
1. Choose k initial center points randomly
2. Cluster data using Euclidean distance (or other
distance metric)
3. Calculate new center points for each cluster using only
points within the cluster
4. Re-Cluster all data using the new center points
1. This step could cause data points to be placed in a different
cluster
5. Repeat steps 3 & 4 until the center points have moved
such that in step 4 no data points are moved from one
cluster to another or some other convergence criteria
is met

From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici


An example with k=2

1. We Pick k=2
centers at
random
2. We cluster our
data around
these center
points
Figure Reproduced From “Data Analysis Tools for DNA
Microarrays” by Sorin Draghici
K-means example with k=2

3. We recalculate
centers based on
our current clusters

Figure Reproduced From “Data Analysis Tools for DNA


Microarrays” by Sorin Draghici
K-means example with k=2

4. We re-cluster our
data around our
new center points

Figure Reproduced From “Data Analysis Tools for DNA


Microarrays” by Sorin Draghici
K-means example with k=2

5. We repeat the last


two steps until no
more data points
are moved into a
different cluster

Figure Reproduced From “Data Analysis Tools for DNA


Microarrays” by Sorin Draghici
Choosing k
 Use another clustering method
 Run algorithm on data with several
different values of k
 Use advance knowledge about the
characteristics of your test
 Cancerous vs Non-Cancerous
Cluster Quality
 Since any data can be clustered, how do we
know our clusters are meaningful?
 The size (diameter) of the cluster vs. The inter-
cluster distance
 Distance between the members of a cluster and
the cluster’s center
 Diameter of the smallest sphere

From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici


Cluster Quality Continued

distance=5
size=5
distance=20

Quality of cluster assessed by


ratio of distance to nearest
size=5
cluster and cluster diameter
Figure Reproduced From “Data Analysis Tools for DNA
Microarrays” by Sorin Draghici
Cluster Quality Continued
Quality can be
assessed simply by
looking at the
diameter of a cluster

A cluster can be formed even when there is


no similarity between clustered patterns. This
occurs because the algorithm forces k clusters
to be created.
From “Data Analysis Tools for DNA Microarrays” by Sorin
Draghici
Characteristics of k-means Clustering
 The random selection of initial center points
creates the following properties
 Non-Determinism

 May produce clusters without patterns

 One solution is to choose the centers randomly


from existing patterns

From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici


Algorithm Complexity
 Linear in the number of data points, N
 Can be shown to have time of cN
 c does not depend on N, but rather the
number of clusters, k
 Low computational complexity
 High speed

From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici


The Need for a New Algorithm
-Each data point is
assigned to the
correct cluster
-Data points that
seem to be far away
from each other in
heuristic are in
reality very closely
related to each other

Figure Reproduced From “Data Analysis Tools for DNA


Microarrays” by Sorin Draghici
The Need for a New Algorithm

Eisen et al., 1998


Kohonen Self Organizing Feature Maps
(SOFM)
 Creates a map in which similar patterns are
plotted next to each other
 Data visualization technique that reduces n
dimensions and displays similarities
 More complex than k-means or hierarchical
clustering, but more meaningful
 Neural Network Technique
 Inspired by the brain
From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
SOFM Description
 Each unit of the SOFM Output Layer
has a weighted
connection to all inputs
 As the algorithm
progresses,
neighboring units are
grouped by similarity
Input Layer
From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
SOFM Algorithm
Initialize Map
For t from 0 to 1 //t is the learning factor
Randomly select a sample
Get best matching unit
Scale neighbors
Increase t a small amount //decrease learning factor
End for

From: http://davis.wpi.edu/~matt/courses/soms/
An Example Using Color
Three dimensional data: red, blue, green

Will be converted into 2D image map with


clustering of Dark Blue and Greys together and
Yellow close to Both the Red and the Green

From http://davis.wpi.edu/~matt/courses/soms/
An Example Using Color

Each color in
the map is
associated with
a weight

From http://davis.wpi.edu/~matt/courses/soms/
An Example Using Color
1. Initialize the weights

Random Colors in the Equidistant


Values Corners

From http://davis.wpi.edu/~matt/courses/soms/
An Example Using Color Continued
2. Get best matching unit

After randomly selecting a sample, go through all


weight vectors and calculate the best match (in this
case using Euclidian distance)
Think of colors as 3D points each component (red,
green, blue) on an axis

From http://davis.wpi.edu/~matt/courses/soms/
An Example Using Color Continued
2. Getting the best matching unit continued…
For example, lets say we chose green as the
sample. Then it can be shown that light
green is closer to green than red:
Green: (0,6,0) Light Green: (3,6,3) Red(6,0,0)

LightGreen  32 0 2 32  4.24


Re d  6 2  (6) 2 0 2  8.49
This step is repeated for entire map, and the weight with
the shortest distance is chosen as the best match

From http://davis.wpi.edu/~matt/courses/soms/
An Example Using Color Continued
3. Scale neighbors
1. Determine which weights are considred
nieghbors
2. How much each weight can become
more like the sample vector

1. Determine which weights are considered neighbors


In the example, a gaussian function is used where every
point above 0 is considered a neighbor
 6.66666667 x 2  y 2
f ( x, y )  e

From http://davis.wpi.edu/~matt/courses/soms/
An Example Using Color Continued
2. How much each weight can become more like the sample

When the weight with the smallest distance is chosen


and the neighbors are determined, it and its
neighbors ‘learn’ by changing to become more like the
sample…The farther away a neighbor is, the less it
learns

From http://davis.wpi.edu/~matt/courses/soms/
An Example Using Color Continued

NewColorValue = CurrentColor*(1-t)+sampleVector*t
For the first iteration t=1 since t can range from 0 to 1,
for following iterations the value of t used in this
formula decreases because there are fewer values in
the range (as t increases in the for loop)

From http://davis.wpi.edu/~matt/courses/soms/
Conclusion of Example

Samples continue to be chosen


at random until t becomes 1
(learning stops)
At the conclusion of the
algorithm, we have a nicely
clustered data set. Also note
that we have achieved our goal:
Similar colors are grouped
closely together

From http://davis.wpi.edu/~matt/courses/soms/
SOFM Applied to Genetics
 Consider clustering 10,000 genes
 Each gene was measured in 4
experiments
 Input vectors are 4 dimensional
 Initial pattern of 10,000 each described by
a 4D vector
 Each of the 10,000 genes is chosen one
at a time to train the SOM

From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici


SOFM Applied to Genetics
 The pattern found to be closest to the current
gene (determined by weight vectors) is
selected as the winner
 The weight is then modified to become more
similar to the current gene based on the
learning rate (t in the previous example)
 The winner then pulls its neighbors closer to
the current gene by causing a lesser change
in weight
From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
SOFM Applied to Genetics
 This process continues for all 10,000
genes
 Process is repeated until over time the
learning rate is reduced to zero

From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici


Our Favorite Example With Yeast
 Reduce data set to 828 genes
 Clustered data into 30 clusters using a
SOFM
 Each pattern is represented by its
average (centroid) pattern
 Clustered data has same behavior
 Neighbors exhibit similar behavior
“Interpresting patterns of gene expression with self-organizing maps: Methods and application
to hematopoietic differentiation” by Tamayo et al.
A SOFM Example With Yeast

“Interpresting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic
differentiation” by Tamayo et al.
Benefits of SOFM
 SOFM contains the set of features
extracted from the input patterns
(reduces dimensions)
 SOFM yields a set of clusters
 A gene will always be most similar to a
gene in its immediate neighborhood
than a gene further away

From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici


Conclusion
 K-means is a simple yet effective
algorithm for clustering data
 Self-organizing feature maps are slightly
more computationally expensive, but
they solve the problem of spatial
relationship

“Interpresting patterns of gene expression with self-organizing maps: Methods and application
to hematopoietic differentiation” by Tamayo et al.
References
 Basic microarray analysis: grouping and feature reduction by
Soumya Raychaudhuri, Patrick D. Sutphin, Jeffery T. Chang and
Russ B. Altman; Trends in Biotechnology Vol. 19 No. 5 May
2001
 Self Organizing Maps, Tom Germano,
http://davis.wpi.edu/~matt/courses/soms
 “Data Analysis Tools for DNA Microarrays” by Sorin Draghici;
Chapman & Hall/CRC 2003
 Self-Organizing-Feature-Maps versus Statistical Clustering
Methods: A Benchmark by A. Ultsh, C. Vetter; FG
Neuroinformatik & Kunstliche Intelligenz Research Report 0994
References
 Interpreting patterns of gene expression with self-
organizing maps: Methods and application to
hematopoietic differentiation by Tamayo et al.
 A Local Search Approximation Algorithm for k-Means
Clustering by Kanungo et al.
 K-means-type algorithms: A generalized convergence
theorem and characterization of local optimality by
Selim and Ismail

You might also like