Hierarchical Clustering Implementation
Hierarchical Clustering Implementation
Hierarchical Clustering Implementation
implementation
http://www.cs.Princeton.EDU/IntroCS
Iteration.
Closest pair of clusters (i, j) is one with the smallest dist value.
Replace row i by min of row i and row j.
Infinity out row j and column j.
Update dmin[i] and change dmin[i'] to i if previously dmin[i'] = j.
Closest
pair
0
1
2
3
4
dmin
1
3
4
1
3
dist
5.5
2.14
5.6
2.14
5.5
0
1
2
3
4
dmin
1
0
4
1
dist
5.5
5.5
5.6
5.5
gene0
1
2
3
4
0
5.5
7.3
8.9
5.8
1
5.5
6.1
2.14
5.6
2
7.3
6.1
7.8
5.6
3
8.9
2.14
7.8
5.5
4
5.8
5.6
5.6
5.5
-
0
node1
2
3
4
0
5.5
7.3
5.8
1
5.5
6.1
5.5
2
7.3
6.1
5.6
3
-
4
5.8
5.5
5.6
-
Gene1 closest
to gene3,
dist=2.14
i=1, j=3
New min dist
Sequence!
http://www.cs.Princeton.EDU/IntroCS
Bio-Sequences
Complete genomes of >1000 organisms
www.ncbi.nlm.nih.gov/Genomes/index.html
Biomolecules as Strings
Macromolecules are the chemical
building blocks of cells
Proteins
20 amino acids
Nucleic acids
Role of Evolution
Molecular structures and mechanisms are
reused and changed during evolution
Often mechanisms that are conserved can be
detected based on sequence similarity
Powerful tool for annotation
LKKHGTVVLTALGGILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISDAIIHVLHSKHP
LKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKHP
GDFGADAQGAMTKALELFRNDIAAKYKELGFQG
GDFGADAQGAMNKALELFRKDMASNYKELGFQG
Given:
a pair (or more) of sequences (DNA or
protein)
a method for scoring the similarity of a
pair of characters (=bases or amino acids)
Determine: correspondences between
characters in the sequences such that the
similarity score is maximized
AACAGTTACC
TA-AGGT-CA
Score = ?
1.
2.
3.
4.
5.
Pairwise Alignment
A AAC
- AAAC
A GC
A GC
A AAC
- AGC
Score of
aligning
these characters
Consider best
Alignment of
these suffixes
- AAAC
A GC
A AAC
A GC
A AAC
- AGC
If we knew answers to
these three subproblems,
then wed know the best
alignment score between
AAAC and AGC
Consider minimum of
these
three cases
C
Best alignment
score of AC
with GC
A
Best alignment
score of AAAG
with C
A
A
C
sim[i, j]
+g
(gap
cost)
sim[i+1, j]
+g
sim[i, j+1]
+ sc(s[i],t[j])
(similarity score
between
s[i] and t[j])
sim[i+1, j+1]
26
There are nm
entries in the
matrix.
Local Alignment
Database search
Given a sequence of interest, can you
find other similar sequences (to get a
hint about structure/function)?
Speeding up searches
Give up optimality, use heuristics
32
BLAST algorithm
Very important !
33
BLAST Notes
May fail to find all high-scoring segment pairs
-Heuristic approach
Empirically, more than an order of magnitude faster
than Smith-Waterman
Large impact:
NCBIs BLAST server handles thousands of
queries a day
most used (and cited) bioinformatics program