10 - Chapter 3
10 - Chapter 3
Bio Sequences
3.1 Introduction
shape, which in turn confers its function. Segments of the protein that are
segments are often lethal to the organism. These critical "active sites" tend to be
conserved over time and so can be found in many organisms and proteins that
in the late 1970s, the amount of data about the protein and DNA sequence of
humans and other organisms has been growing at an exponential rate. Databases
of these sequences will contain a wealth of information about the nature of life
the case of DNA, there are 4 types of monomers, the nucleotides, each having a
different side chain. For proteins, there are 20 types of monomers, the amino
acids. With just a few exceptions, the sequence of monomers, that is, the
correlation between sequence and structure implies that to know the function of
58
everything one needs to know about the DNA strand or protein in question. If
conjecture that they perform the same function. Because DNA's principal role is
similarity of two segments of DNA suggests that they code similar things.
the replication of DNA can cause a change in the nucleotide at a given position.
of DNA that codes for protein, these changes cause related changes in the
primary sequence and, hence, the shape and activity of the protein. The impact
of a particular mutation depends on the degree to which the original and new
amino acid sequences differ in their physical and chemical properties. Mutations
that result in proteins that are so altered that they function improperly or not at
all tend to be lethal to the organism. Nature is biased against mutations in those
critical regions central to a protein's function and is more lenient toward changes
in other regions.
two proteins in two organisms evolved from a common precursor, one will
regions. If the proteins are very recent derivatives, one might expect to see
similarity over the entire length of the sequences. While proteins can be similar
59
considerations. It appears that nature not only conserves the critical parts of a
protein's conformation and function, but also reuses such motifs as modular
units in fashioning the spectrum of known proteins. One finds strong similarities
between the v-sis oncogene and a growth-stimulating hormone was the key to
discovering that the v-sis oncogene that causes cancer by deregulating cell
growth. In that case, the similarity involved the entirety of the sequence. In other
with a simple, core problem of finding the best alignment between the entirety
aligned with the same letter has a score of 1. A letter aligned with any different
letter or a gap has a score of 0. The total score is the sum of the scores for the
Figure 3.1. Under this unit-cost scheme, the score of an alignment is equal to the
60
alignment, that is, an alignment that achieves this highest score by aligning five
alignment, but in general there can be many. For example, '^' " " " ' also has a
score of 5.
The unit-cost scoring scheme of Figure 3.1 is not the only possible
scheme. Later in this chapter, we will see a much more complex scoring scheme
used in the comparison of proteins (20-letter alphabet). In that scheme and other
scoring schemes, the scores in the table are real numbers assigned on the basis
s
-
~ A C G T
0 0 0 0 0
A 0 1 0 0 0
C 0 0 1 0 0
G 0 0 0 1 0
T 0 0 0 0 1
whose symbols range over some alphabet y, for example, v|/ = {A, C, G, T} for
DNA sequences. Let 5 (a,b) be the score for aligning a with b, let 6 (a,-) be the
score of leaving symbol a unaligned in sequence A, and let 5 {-,b) be the score
of leaving b unaligned in B. Here a and b range over the symbols in \|/ and the
gap symbol "-". The score of an alignment is simply the sum of the scores d
assigns to each pair of aligned symbols, for example, the score of A-lAtSQ is
61
Proceeding formally, the edit graph GA,B for comparing sequences A and
rectangular grid or matrix, so that (/, j) designates the vertex in column / and
row 7 (where the numbering starts at 0). The following edges, and only these
The edit graph has the property that paths and alignments between
segments of A and B are in isomorphic correspondence. That is, any path from
vertex (g,h) to vertex (iJ) for g<i and h < j models an alignment between the
substrings ag+iag+2 a, and bh+\bh+2 bj, and vice versa. The alignment modeled
by a path is the sequence of aligned pairs given by labels on its edges. For
example, in Figure 3.2 the two highlighted paths, both from vertex (0,0) to (6,6)
62
We focus on computing just the score for the moment, and return to the goal of
delivering an alignment achieving that score at the end of this subsection. First,
observe that, in terms of the edit graph formulation, we seek the score of a
maximal-score path from the vertex O at the upper left-hand corner of the graph
some given vertex {i,j) in the graph. Because there are only three edges directed
into vertex (z, j) it follows that any optimal path P to {i,j) must fit one of the
following three cases: (1) P is an optimal path to (i - \,j) followed by the A-gap
edge into {ij); (2) P is an optimal path to {i,j - 1) followed by the B-gap edge
edge into {i,j). It is critical to note that the subpath preceding the last edge of P
must also be optimal, for, if it is not, then it is easy to show that P cannot be
optimal, a contradiction.
63
A » A ^ * k UK XX K X I k k
I t T T T . T t . T T T T T T t
A - A - T • T - ^A -• C * 0
A A A A A A a A . A A A A A
* A * T ••^T - A • C • ©
T
T T T T T T . t T T T T T t
- » - T - T - » - ^e . c
C
C C C C C C C C C C C C G
G - A - t • I - A - C - ^o
o<; C C c « C O C C c a ^ c
S(i-lJ) + 8(a„-),
S{iJ-\) + bi-,bj)},
which states that the maximal score of a path to (i,j) is the larger of (1) the
maximal score of a path to (/ -\,j) plus the score of the A-gap edge to {i,j), (2)
the maximal score of a path to {i,j - 1) plus the score of the B-gap edge to (i,j),
or (3) the maximal score of a path to (/ - l,j - 1) plus the score of the alignment
edge to (ij).
many possible orders. Three simple alternatives are (1) column by column from
left to right, top to bottom in each column, (2) row by row from top to bottom.
64
the vertices {i,j) such that (/ + j = k). Using the first sample ordering leads to
the algorithm of Figure 3.3. In this algorithm, M denotes the length of A and N
The algorithm of Figure 3.3 computes S{i,j) for every vertex {ij) in an {M
+ \)-K{N + \) matrix in the indicated order of/ and 7. Along the left and upper
boundaries of the edit graph (that is, vertices with / = 0 orj = 0, respectively),
the algorithm utilizes the recurrence, except that terms referencing nonexistent
vertices are omitted (that is, in lines 3 and 5, respectively). The algorithm of
Figure 3.3 takes 0{MN) time; that is, when Af and A^are sufficiently large, the
time taken by the algorithm does not grow faster than the quantity MN. If one
stores the whole (M + 1) x (A'^ + 1) matrix S, then the algorithm also requires
0{MN) space.
8. }
9. wrtte "Maximum sc(yrc: is" 5 [ A/, ATJ ,
65
requires that the optimal answer to a given sub problem be expressible in terms
problem does yield to this principle: the optimal answer S{i, j) for the problem
formula describes the relationship of each sub problem to a larger sub problem.
alignments can be recovered by tracing the paths backwards from <I) to 0 with
the aid of the now complete matrix S. Specifically, an edge from vertex vi to O
is on an optimal path if S(vi) plus the score of its edge equals S(0). If vi is on
S(v2) plus the score of the edge equals S(vi). In this way, one can follow an
optimal path back to the start vertex D. In essence, this trace back procedure
moves backwards from a vertex to the preceding vertex whose term in the three-
way maximum of the recurrence yielded the maximum. The possibility of ties
creates the possibility of more than a single optimal path. Unfortunately, this
traceback technique for identifying one or more optimal paths requires that the
entire matrix S be retained, giving an algorithm that takes 0(MN) space as well
as time.
with the observation that if only the score of an optimal alignment is desired
66
they have been used in computing the values that depend on them. Observing
that one need only to know the previous column in order to compute the next
one, it follows, that oni> two columns need be retained at any instance, and so
only 0{N) space is required. Such a score-only algorithm can be used as a sub
alignment using only 0{M + N) space. The divide step consists of finding the
on the first half of B and the reverse of the second half of B. The conquer step
consists of detsrmining the two halves of this path by recursively reapplying the
divide step to the two halves. Myers and Miller [105] have shown this strategy
algorithms. This refinement is very important, since space, not time, is often the
CPU time, but would require 10 billion units of memory if optimal alignments
were delivered using the simple 0{MN) space trace back approach. This is well
subsegments of A and B that align with maximal score. Local alignments can be
visualized as paths in the edit graph, GA.B- Unlike the global alignment problem,
the path may start and end at any vertices, not just from D and O. Intrinsic to
67
(under some suitable stochastic model of the sequences) the score of a path must
on the average be negative. If this were not the case, then longer paths would
:end to have higher scores, and one would generally end up reporting a global
alignment between two sequences as the optimal local alignment. For example,
the simple scoring scheme of Figure 3.1 is not negatively biased, whereas the
ATTACS
scheme of Figure 3.4 is. Note that under this new scheme, the alignment ATATCS .
is now optimal with score 3.34, whereas A-IA"?? now has lesser score 3. In this
case, the optimal alignment happened to be global, but for longer sequences this
is generally not the case. For example, the best local alignment between
nccT^A^ score 4.34 between the underlined substrings of score 4.34 between the
underlined substrings.
& A C G T
- -1 -I •11 -! -I
A -1 I ^.« -.33 -.33
C -1 -^,33 I -.33 -.33
cT ' 'I
-I
-.33
-.W
-.33
-.33
1
-.33
-.33
1
since combinations of 20 letters and the gap symbol represent proteins, the table
of scores is now 21 x 21. These scores may be chosen by users to fit the notion
of similarity they have in mind for the comparison. For example, Dayhoff [36]
68
into another over a fixed period of time and from these buih a table of aligned
likelihood that one segment has mutated into the other. Figure 3.5 is a scaled
The basi:: issue in local alignment. Just as in the case of global alignment,
is to find a path of maximal score. However, there are more degrees of freedom
in the local alignment problem: v/here the paths begin and where they end is not
given a priori but is part cf the problem. Note that if we knew the vertex (g,h) at
which the oest path began, we could find its score and end-vertex by setting
S(g,h) to 0 end then applying the fundarr.ental recurrence to all vertices (i,J) for
which / > g andj > h. We can capture ai! potential start vertices simultaneously
S{Uj) = {0,S{i-lj-l)-hai,bj),
Su,- 7) + 5 i-,bj)}.
Indeed, with this simple modifioat.on, S{i, j) is now the score of the
highest-scoring path to {i,j) that begins at some vertex (g. h) for which g < i and
h< j . Thie best score of a pa:h in the edit graph is then the maximum over all
vertices in the graph of their o-values. A vertex achieving :his maximum is the
end of an optimal path. This basic result is often referred to as the Smith-
Waterman algorithm after its inventors [146]. The beginning of the path, the
segments h aligns, and the alignment between these segments can all be
69
global alignments. If one uses such a comparison algorithm with the scoring
scheme of Figure 3.5, one sees the three regions of similarity shown in Figure
3.6 between the sequence of the monkey somatotropin protein and the
cases the aligned symbols are identical, they do not have to be.
V -H 0 -3
-%-5 -1 -5 -4 -fi -1 -2 -.3 -€ -4 4 - 6 - 3 -3 -1 8 -3
-i -3 -2 -3 -3 -2 -3 J 1 -4 1 - 3 -2 - 2 0 -8 -<5 5
Thus far we have presented the local similarity problem as one of finding
two subsegments of the sequences that align with maximal score. But as
illustrated in Figure 3.6, the ultimate goal is to expose not a single such
as in the somatotropin example of Figure 3.5. To this end. Waterman and Eggert
edge in the edit graph that is adjacent to a vertex on the path of this local
alignment. Now find a highest-scoring path over the remaining graph. Eliminate
the edges adjacent to this second-best path, and proceed to find a third-best path.
70
score whose underlying paths do not intersect. Note that this procedure may
space, but recent refinements by Chao and Miller[24] have reduced both storage
and computing time and have permitted the comparison of two sequences of
sequences become larger and larger, one must "step back" from the details of the
edit graph. Figure 3.7 is a depiction of the edit graph of the monkey and rainbow
trout somatotropin sequences of Figure 3.6 where only the paths corresponding
to the three aligned segment pairs are drawn. At this level of resolution, the
small gaps in the alignments of the second and third segment pairs appear as
small discontinuities in paths that otherwise follow the direction of the diagonal
of the edit graph grid or matrix. When the sequences become very large, say on
the order of 100,000 nucleotides, then small local alignments are not seen, and
neither are gaps in large alignments unless they are very large. Nonetheless,
such dot plots give a meaningful visualization of all the similarities between
71
How [o assign scores to alignment gaps has always been more problematic
than scoring aligned symbols, because the statistical effect of gaps is not well
models in \vhich the score of a gap is not just ths sum of scores assigned to the
individual symbols in the gap. as was used in the previous two sections, but
72
IOTK j ^ ^ . p . . ^ . . - - , ^ - , | ^ ^ - - - - - .
gap{x) =r + sx, where r > 0 is the penalty for the introduction of the gap and s
> 0 is the penalty for each symbol in the gap. Such affine gap costs are
8 + 4^ works well in conjunction with the aligned symbol scores of Figure 3.5.
Because a gap is viewed as detracting from similarity, its score is a penalty that
central recurrence. For each subproblem, A, versus B^, one develops recurrences
for (1) the best alignment that ends with an A-gap, Ag{i, j), (2) the best
alignment that ends with a B-gap, Bg(iJ), and (3) the best overall alignment, S(i
73
gap is being initiated from that term. Ag terms contributing to Ag values and Bg
terms contributing to Bg values are penalized only s because the gap is just
being extended. An algorithm that applies these recurrences at each (y) leads to
an 0{MN) time algorithm for global alignments with affine gap costs. Simply
adding a 0 term to the S-recurrence gives an algorithm for local alignments with
Summation and affine functions are not the only options available for
scoring gaps. The gap cost function gap {x) can be taken to be a concave (flat or
cupped downward) function of length, that is, a function such that gap (x + 7)-
gap {x) <gap{x)- gap{x -1) for all x > 0. The class of concave gap cost functions
includes affine functions but is much wider than just affine functions. For
example, for positive a and b, the function gap (x) = a logx + 6 is a concave
postulated that such a model is natural for biological sequences where gap costs
For this model, investigators have been able to design algorithms that take
algorithm takes 0{MN{M + N)) time [163]. For this reason and because the
more restricted affine and concave models appear adequate to most needs, the
74
exposing the similarity between two sequences and thus have naturally thought
think about how a sequence A may have evolved into sequence B over time. In
this context, one seeks alignments that reveal the minimum number of
mutational events that might have effected the transformation. In this view, an
represents the differences rather than the similarities between symbols. Note that
model, the goal is to find an alignment of minimal score, that is, one that
indicates the minimum scoring set of changes needed to go from one sequence
A and B. Indeed the measure, D, between sequences forms a metric space over
sequences if the underlying scoring function 5 forms a metric space over the
75
and formula. Also, one could simply take a 5 for a difference problem and
multiply every score by -1. Applying the similarity algorithm with the modified
scores would produce optimal alignments for the original difference problem,
and multiplying the resultant similarity score by -1 would give the distance
same functionality, it is much more likely that the similarity that gives this
common function will be more evident among the group than between two
must have taken place along each branch of the tree. A third application for
errors are detected and corrected by sequencing a given stretch several times and
76
CTCCCC-CACAT-ATitSGCG-OTC-CGHCA-C*- TAQQCkAtSQC
CTCCJCCGCA-ATTCGGGCG-GTCTCGAGA-GACrAGGCItAQCX:
CTCGCGrCACArrCGEGC(?fGTCTCGA<^-GACTAGGC&ACCC
CrCTiCgCci'M^^T^^GrTGCG-GT-TCG-GA-GACTAGGCiyVKX
' ^ v"" is of length A^'. As for the basic problem, we wish to arrange the
sequences into a tableau using dashes to force the alignment of certain
characters in given columns. For example, in Figure 3.8 the dashes are placed so
as to arrange columns consisting of primarily one symbol. For each column, the
consensus of the column is the symbol that occurs the greatest number of times
dashes, gives the consensus sequence for the five experimental trials. As for pair
symbols in the column not equal to the majority symbol of the column (which
can be a dash), then the multi-alignment of Figure 3.8 has score 13, and this is
the minimum possible score over all possible multi-alignments of the five
77
arraj' 5", where S(i) is the score of the best alignment among the prefix sequences
A' A^ A'K
. The central recurrence now becomes
dimensional space where each vertex i has 2K - I edges directed into it, each
corresponding to a column that when appended to the alignment for the edge's
tail gives rise to the alignment for the prefix sequences represented by i.
which means that it is almost surely the case that any algorithm for this problem
musi exhibit time behavior that is exponential in K. Thus many authors have
sought heuristic approximations, the most popular of which is to take 0{K N2)
time to compute all pair wise optimal alignments between the K sequences, and
between a given pair of sequences (take the two rows of the tableau and remove
any :;olumns consisting of just dashes). However, given all of the possible K{K-
to arrange a muUi-alignment consistent with them all. Try, for example, merging
78
1 alignments relating all the sequences (that is, a spanning tree of the complete
graph of sequence pairs), it is always possible to do so. Feng and Doolittle [43]
algorithms utilize the natural choice of the K - 1 alignment whose scores sum to
the minimal possible amount (that is, a minimum spanning tree of the complete
graph of sequence pairs). However, such merges do not always lead to optimal
defined in terms of an underlying pair wise scoring function d'. For example, the
sum of the scores of the K{K - l)/2 pairwise alignments it induces. Another
max/min{S^ S\c,a,):c€\\iD{-}}. The symbol c that gives the best score is said to
be the consensus symbol for the column, and the concatenation of these symbols
sum of the scores of the K pair wise alignments of the sequences versus the
consensus. The example of Figure 3.8 is such a scoring scheme where 5' is the
scoring scheme of Figure 3.1. While we do not show it here, the problem of
79
choosing a d for columns that suitably encodes the tree relating the sequences
[134]. However, the more general phylogeny problem requires that one also
determine the tree that produces the minimal score. This daunting task
essentially requires the exploration of the space of all possible trees with K
into designing heuristic algorithms for the phylogeny problem, and there is
which scoring schemes are "correct" for a given comparison domain. This
uncertainty has suggested the problem of listing all alignments near the
From the point of view of the edit graph formulation, the K-htsX problem
the operations research literature. Indeed, there is an 0(MN + KN) time and
space algorithm, immediately available from this literature [45], that delivers the
X^-best paths over an edit graph. The algorithm delivers these paths/alignments
in order of score, and K does not need to be known a priori: the next best
keep, at each vertex v, an ordered list of the score of the next best path to the
80
ordered lists and is extracted, and the lists are appropriately updated.
alignments that are within e of the optimal difference Z)(A, B), then a simpler
method is available that requires only the matrix S of the dynamic programming
computation. While not any faster in time, the simpler alternative below does
require only 0{MN) space. One can imagine tracing back all paths from the sink
limit the trace back to only those paths of score not greater than D(A, B) + e.
Suppose one reaches vertex (/, j) and the score of the path thus far traversed
from the sink to this vertex is T{i, j). Then one traces back to predecessor
vertices (/ - \,j), (/ - \,j- 1), and {i,j - 1) if and only if: respectively. This
A classic example of the need for affine gap costs was presented in a paper
by Smith and Fitch [146] comparing the a and P chicken hemoglobin chains.
For a setting of the gap costs that gave the biologically correct alignment, there
growth suggests that perhaps rather than list alignments, one should report the
best possible scores in order or give a color-coded visualization of the edit graph
that colors edges according to the score of the best path utilizing the edge.
81
scheme [162].
approximate match problem. For this problem, imagine that A is a very long
substrings of A, called match sites, that align with the entirety of B with a score
sequence element (B) or some sequence like it occurs. It is not hard to see that
this problem is equivalent to finding sufficiently high scoring paths that begin at
a vertex in row 0 and end at row A^' of the edit graph for A and B. By simply
values in row A^, one obtains the desired modification of the basic dynamic
programming algorithm.
applications have long studied the problem of finding exact matches to a pattern
in a long text. That is, given a pattern as a query, and a text as a database, one
seeks substrings of the database text that match the pattern (exactly). Pattern
types that have been much studied include the cases of a simple sequence, a
regular expression, and a context-free language. Such patterns are notations that
82
match the pattern. For example, the regular expression A(T|C)G* denotes the set
G's, that is, the set {AT, AC, ATG, ACQ, ATGG, ACGG, ATGGG,. . .}.
Assuming the pattern takes P symbols to specify and the text is of length A'^,
there are algorithms that solve the text searching problem in 0{P+N), 0{PN),
exact pattern matching and sequence comparison gives rise to the class of
scheme, and a threshold, one seeks all substrings of the database that align to
some sequence denoted by the pattern with score better than the threshold. In
essence, one is looking for substrings that are within a given similarity
the pattern is a simple sequence. We showed earlier that this problem can be
solved in 0{PN) time. For the case of regular expressions, the approximate
match problem can also be solved in 0{PN) time [112], and, for context-free
83
quadratic, 0{MN) time algorithms. This has led several investigators to study
the use of parallel computers to achieve greater efficiency. As stated above, the
S'-matrix can be computed in any order consistent with the data dependencies of
by-column evaluation, but we pointed out as a third alternative that one could
that each entry in this antidiagonal can be computed independently of the other
entries in the antidiagonal, a fact not true of the row-by-row and column-by-
and compute its result independently of the others. With 0(M) processors, each
antidiagonal can be computed in constant time, for a total of 0(N) total elapsed
time. Note that total work, which is the product of processors and time per
processor, is still 0{MN). The improvement in time stems from the use of more
This observation about antidiagonals has been used to design custom VLSI
(very large scale integration) chips configured in what is called a systolic array.
performs a dedicated computation, and communicates only with its left and right
neighbors, making it easy to lay out physically on a silicon wafer. For sequence
comparisons, processor / computes the entries for row / and contains three
84
respectively, and the characters of B flow through their [/registers. That is, L(i)k
= SQ, k- i - 1), V(i)k = S{i, k- i), and U{i)k = bk-,, whereX(i)kdenotes the value
of register X at the end of the k th step. It follows from the basic recurrence for
S'-values that the following recurrences correctly express the values of the
registers at the end of step A: + I in terms of their values at the end of step k:
L{i) ^ =5 K(Oj.
i/<,--l>^+8(«,-).
must pass its register values to processor / and each processor must have just
loaded with the scores for 5(a„ ?), 8 (-, ?), and 5 (a,, -) where ? is any symbol in
the underlying alphabet \|/. The beauty of the systolic array is that it can perform
the target sequences in constant time per symbol. With current technology, chips
85
entries, and the PIR database [12] of protein sequences contains about 21
million amino acids of data in about 71,000 protein entries. Whenever a new
to search these databases to see if the new sequence shares any similarities with
existing entries. In the event that the new sequence is of unknown function, an
In the case of protein databases, each entry is for a protein between 100
and 1,500 amino acids long, the average length being about 300. The entries in
DNA databases have tended to be for segments of an organism's DNA that are
of interest, such as stretches that code for proteins. These segments vary in
length from 100 to 10,000 nucleotides. The limited length here is not intrinsic to
the object as in the case of proteins, but because of limitations in the technology
and the cost of obtaining long DNA sequences. In the early 1980s the longest
sequences.
a protein or DNA fragment and searches every entry in the database for
86
and the entries in the database are typically of similar sizes. In DNA databases,
the entries are typically much longer than the query sequence, and one is
The problem of searching for protein similarities efficiently has led many
of the problem has become too large) and instead consider designing very fast
produce answers that are "nearly" correct with respect to a formally stated
optimization criterion. One of the most popular database searching tools of this
genre is FASTA. FASTA looks for entries that share a significant number of
short identical subsequences of symbols with the query sequence. Any entry
meeting this criterion is then compared via dynamic programming with the
query sequence. In this way, the vast majority of entries are eliminated from
some matches and also reports some spurious matches. On the other hand,
BLAST, the Basic Local Alignment Search Tool was bom in the first
BLAST programs have been the fruit of much hard work by scores of talented
87
of FASTA. Given a query A and an entry B, BLAST searches for segment pairs
equal length, and the score of the pair is that of the no-gap alignment between
them. One can argue that the presence of a high-scoring segment pair or pairs is
deletion events tend to significantly change the shape of a protein and hence its
function. Note that segment pairs embody a local similarity concept. What is
particularly useful is that there is a formula for the probability that two
sequences have a segment pair above a certain score. Thus BLAST can give an
assessment of the statistical significance of any match that it reports. For a given
threshold, T, BLAST returns to the user all database entries that have a segment
pair with the query of score greater than T ranked according to probability.
BLASTA may miss some such matches, although in practice it misses very few.
neighborhood of a sequence S is the set of all sequences that align with S with
those sequences of equal length that form a segment pair of score higher than t
under the Dayhoff scoring scheme (see Figure 3.5). This concept suggests a
simple strategy for finding all entries that have segment pairs of length k and
score greater than t with the query: generate the set of all sequences that are in
the t -neighborhood of some A-substring of the query and see if an entry contains
one of these strings. Scanning for an exact match to one of the strings in the
88
problem, the length of the segment pair is not known in advance, and even more
To circumvent this difficulty, BLASTA uses the fast scanning strategy above to
find short segment pairs of length k above a score /, and then checks each of
these to see if they are a portion of a segment pair of score T or greater. This
approach is heuristic (that is, may miss some segment pairs) because it is
possible for every length k subsegment pair of a segment pair of score T to have
score less than t. Nonetheless, with k = 4 and / = 17 such misses are very rare,
and BLASTA takes about 3 seconds for every 1 million characters of data
searched.
SPECint workstation and a search against a typical protein query. The dynamic
programming algorithm for local similarities presented above (also known as the
database with a total of A^ characters in it. On the other hand, FASTA takes
20.0A'^ microseconds, and BLASTA only about 2.OA'^ microseconds. At the other
end of the spectrum, the systolic array chip described above takes only 0.3A^
89
MHSKVTIICIRFLFWFLLLCMLIGKSHTEDDIIIATKNGKVRGMNLTVFG
GTVTAFLGIPYAQPPLGRLRFKKPQSLTKWSDIWNATKYANSCCQNIDQ
SFPGFHGSEMWNPNTDLSEDCLYLNVWIPAPKPKNATVLIWIYGGGFQT
GTSSLHVYDGKFLARVERVIVVSMNYRVGALGFLALPGNPEAPGNMGL
FDQQLALQWVQKNIAAFGGNPKSVTLFGESAGAASVSLHLLSPGSHSLF
TRAILQSGSFNAPWAVTSLYEARNRTLNLAKLTGCSRENETEIIKCLRNK
DPQEILLNEAFVVPYGTPLSVNFGPTVDGDFLTDMPDILLELGQFKKTQI
LVGVNKDEGTAFLVYGAPGFSKDNNSIITRKEFQEGLKIFFPGVSEFGKE
SILFHYTDWVDDQRPENYREALGDVVGDYNFICPALEFTKKFSEWGNN
AFFYYFEHRSSKLPWPEWMGVMHGYEIEFVFGLPLERRDNYTKAEEILS
RSIVKRWANFAKYGNPNETQNNSTSWPVFKSTEQKYLTLNTESTRIMTK
LRAQQCRFWTSFFPKVLEMTGNIDEAEWEWKAGFHRWNNYMMDWKN
QFNDYTSKKESCVGL
sapiens (Human).
MGLLLPLALCILVLCCGKLSPPQLALNPSALLSRGCNDSDVLAVAGFAL
RDINKDRKDGYVLRLNRVNDAQEYRRGGLGSLFYLTLDVLETDCHVLR
KKAWQDCGMRIFFESVYGQCKAIFYMNNPSRVLYLAAYNCTLRPVSKK
90
SSQWVVGPSYFVEYLIKESPCTKSQASSCSLQSSDSVPVGLCKGSLTRTH
WEKFVSVTCDFFESQAPATGSENSAVNQKPTNLPKVEESQQKNTPPTDS
PSKAGPRGSVQYLPDLDDKNSQEKGPQEAFPVHLDLTTNPQGETLDISFL
FLEPMEEKLVVLPFPKEKARTAECPGPAQNASPLVLPP
sapiens (Human).
MKVLWAALLVTFLAGCQAKVEQAVETEPEPELRQQTEWQSGQRWELA
LGRFWDYLRWVQTLSEQVQEELLSSQVTQELRALMDETMKELKAYKS
ELEEQLTPVAEETRARLSKELQAAQARLGADMEDVCGRLVQYRGEVQ
AMLGQSTEELRVRLASHLRKLRKRLLRDADDLQKRLAVYQAGAREGA
ERGLSAIRERLGPLVEQGRVRAATVGSLAGQPLQERAQAWGERLRARM
EEMGSRTRDRLDEVKEQVAEVRAKLEEQAQQIRLQAEAFQARLKSWFE
PLVEDMQRQWAGLVEKVQAAVGTSAAPVPSDNH
• After bring the two proteins into FASTA formats their sequences are
91
RAD_HUMAN MTLNGGGSGAGG 12
Q7ZVA2_BRARE MTLN 4
Q6LCT9_HUMAN
IAPP_HUMAN
CHLE_HUMAN MHSKVTIICIRFLFWFLLLCMLIGKSHTEDDIIIATKNGKVRGMNLT 47
S0X13_HUMAN
MVNCTIKSEEKKEPCHEAPQGSATAAEPQPGDPARASQDSADPQAPAQGN50
Q53Y25_HUMAN MLDDRARMEAAKKEKVEQILAEFQ 24
RAD_HUMAN
SRGGGQERERRRGSTPWGPAPPLHRRSMPVDERDLQAALTPGALTAAAAG62
Q7ZVA2_BRARE TQKEGKEPLRRRASTP1PSSRQAGRGDRDPSTDPYHPPLAQSAS--YHPG 52
Q6LCT9_HUMAN MSG 3
IAPP_HUMAN
CHLE_HUMAN VFGGTVTAFLGIPYAQPPLGRLRFKKPQSLTKWSDIWNATKYANSCCQNI 97
92
FRG5WDCSSPEGNGSPEPKRPGVSEAASGSQEKLDFNRNLKEVVPAIEKL 100
Q53Y25_HUMAN
LQEEDLKKVMRRMQKEMDRGLRLETHEEASVKMLPTYVRSTPEGSEVGDF74
RAD_HUMAN
TGTQGPRLDWPEDSEDSLSSGGSDSDESVYKVLLLGAPGVGKSALARIFG 112
Q7ZVA2_BRARE DKSIHSRANWSSDSES--DSSGS—ECLYRVVLLGDHGVGKSSLANIFA 97
IAPP_HUMAN MGILK 5
CHLE_HUMAN
S0X13_HUMAN
LSSDWKERFLGRNSMEAKDVKGTQESLAEKELQLLVMIHQLSTLRDQLLT 150
Q53Y25_HUMAN
RAD_HUMAN
Q7ZVA2_BRARE
GIQEKDAHKHIGEDAYERTLMVDGEDTTLVVMDPWETDKQEDDEKFLQDY 147
CHLE_HUMAN
S0X13_HUMAN
AHSEQKNMAAMLFEKQQQQMELARQQQEQIAKQQQQLIQQQHKINLLQQQ200
Q53Y25_HUMAN
YISECISDFLDKHQMKHKKLPLGFTFSFPVRHED1DKGILLNWTKGFKAS174
93
S0X13_HUMAN
PPAPVVKRPGAMATHHPLQEPSQPLNLTAICPKAPELPNTSSSPSLKMSSC300
Q53Y25_HUMAN
DHQCEVGMIVGTGCNACYMEEMQNVELVEGDEGRMCVNTEWG 258
Q6LCT9_HUMAN
lAPPHUMAN LNYLPL 89
CHLE_HUMAN
IIKCLRNKDPQEILLNEAFVVPYGTPLSVNFGPTVDGDFLTDMPDILLEL337
S0X]3_HUMAN
VPRPPSHGGPTRDLQSSPPSLPLGFLGEGDAVTKA1QDARQLLHSHSGAL350
94
IAPP_HUMAN
S0X13_HUMAN
DGSPNTPFRKDLISLDSSPAKERLEDGCVHPLEEAMLSCDMDGSRHFPES400
Q6LCT9_HUMAN ^
IAPP_HUMAN
CHLE_HUMAN D N N S I I T R K E F Q E G L K I F F P G V S E F G K E S I L F H 400
S0X13_HUMAN
RNSSHIKRPMNAFMVWAKDERRKILQAFPDMHNSSISKJLGSRWKSMTNQ450
Q53Y25_HUMAN ELVRLVLLRLVDENLLFHGEASEQLRTRGAFET332
Q6LCT9_HUMAN ^
IAPP_HUMAN
CHLE_HUMAN
YTDWVDDQRPENYREALGDVVGDYNFICPALEFTKKFSEWGNNAFFYYFE450
S0X13_HUMAN
EKQPYYEEQARLSRQHLEKYPDYKYKPRPKRTCIVEGKRLRVGEYKALMR500
Q53Y25_HUMAN
RFVSQVESDTGDRKQIYN1L,STLGLRPSTTDCDIVRRACESVSTRAAHMC382
RAD_HUMAN
Q7ZVA2_BRARE .
Q6LCT9_HUMAN
95
CHLE_HUMAN
HRSSKLPWPEWMGVMHGYEIEFVFGLPLERRDNYTKAEEILSRSIVKRWA500
S0X13_HUMAN
TRRQDARQSYVIPPQAGQVQMSSSDVLYPRAAGMPLAQPLVEHYVPRSLD550
Q53Y25_HUMAN
SAGLAGV1NRMRESRSEDVMRITVGVDGSVYKLHPSFKERFHASVRRLTP432
RAD_HUMAN
Q7ZVA2_BRARE
Q6LCT9_HUMAN
!APP_HUMAN
CHLE_HUMAN
NFAKYGNPNETQNNSTSWPVFKSTEQKYLTLNTESTR1MTKLRAQQCRFW550
S0X13_HUMAN
PNMPVIVNTCSLREEGEGTDDRHSVADGEMYRYSEDEDSEGEEKSDGSWW 600
RAD_HUMAN
Q7ZVA2_BRARE
Q6LCT9_HUMAN
IAPP_HUMAN
CHLE_HUMAN
TSFFPKVLEMTGNIDEAEWEWK.AGFHRWNNYMMDWKNQFNDYTSKKESCV600
S0X13_HUMAN
CSQTDPRLGGPGPFSSGEDLVPTRWAQPANLRLCWYLDLFVPQKMGKAVH650
Q53Y25_HUMAN
RAD_HUMAN
Q7ZVA2_BRARE
96
IAPP_HUMAN -
CHLE_HUMAN GL ,.-602
S0X13_HUMAN
LADTFMRGEAPSLPEERVGLGGQELQYGHGLSRLSTSAPRAYGQGTLYDS700
Q53Y25_HUMAN ..
RAD_HUMAN
Q7ZVA2_BRARE
Q6LCT9_HUMAN ,..
IAPP_HUMAN
CHLE_HUMAN ,
S0X13_HUMAN
PLLQVSIHLGYG1YRPVSLGSHALFPFLSWLDQPLWDQHPSHTPPDCSSI750
Q53Y25_HUMAN
RAD_HUMAN
Q7ZVA2_BRARE — ^.
Q6LCT9_HUMAN
IAPP_HUMAN
CHLE_HUMAN
S0X13_HUMAN
Q53Y25_HUMAN ..
RAD_HUMAN
Q7ZVA2_BRARE -
Q6LCT9_HUMAN -
IAPP_HUMAN
CHLE HUMAN •-
97
SAHYVPGTVAEFLWVCLSMPLLLLWGPLSVLLFVPKLLPLCQSGCLRFCV850
Q53Y25_HUMAN
RAD_HUMAN
Q7ZVA2_BRARE
Q6LCT9_HUMAN
IAPP_KUMAN
CHLE_HVMAN
Q53 Y2 5_HUM AN
growing exponentially. On the other hand, the size of the query sequences is
principally concerned with how the time to perform such a search grows as a
function ofN. Yet all currently used methods take an amount of time that grows
linearly in A^; that is, they are 0{N) algorithms. This includes not only rigorous
also the popular heuristics FASTA and BLASTA. Even the systolic array chips
factor of 1,000, all of these 0{N) methods take 1,000 times longer to search that
database. Using the timing estimates given above, it follows that while a custom
chip may take about 3 seconds to search 10 million amino acids or nucleotides,
98
parallelism, but such machinery is beyond the budget of most investigators, and
with computing time sublinear in A^, that is, Oi?/'^) for some a < 1. For
example, suppose there is an algorithm that takes 0(N^ ^) time, which is to say
that as A^ grows, the time taken grows as the square root of A^. For example, if
minutes. Note that while an O(A^0.5) algorithm may be slower than an <9(A0
illustrates this "crossover": in this figure, the size of A^' at which the 0(N°^)
0{N°°^) algorithm that takes, say, 15 seconds on 10 million symbols, will take
point, we chose to let our examples be slower at A^' = 10 million than the
does exist that is actually already much faster on databases of size 10 million.
The other important thing to note is that we are not considering heuristic
algorithms here. What we desire is nothing less than algorithms that accomplish
99
algorithms for the general problem. For relatively stringent matches, this new
U.IWW
1 r-¥-r - r m i r'—» F i i ! T ^
\
i,*00
-X*^**^ ""
100 —
6 -
.6
",....>• • • ' , "i^""*'*'"^^^''^
10
l^^ :
case of the more biologically relevant computation that involves more general
scoring schemes such as the ones in Figures 3.4 and 3.5, and a sublinear
length of the query and is called the mismatch ratio. Searching for such an
0{DN^°'^^^^ log N) expected time with the new algorithm. The exponent is an
100
pow(B) < 1. For example, pow(e) is less than 1 when e < 33 percent for |v|/| = 4
(DNA alphabet) and when 8 < 56 percent for |\)/| = 20 (protein alphabet). More
specifically, pow(e) < 0.22 + 2.3 8 when |v|/| = 4 and pow{e) < 0.17 + 1.48 when
|\|/| = 20. So, for DNA, the algorithm takes a maximum of 0{N°^) time when e is
The logic used to prove these bounds is coarse, and, in practice, the performance
of these methods is much better than the bounds indicate. If these results can be
extended to handle the more general problem of arbitrary scoring tables, the
namely Altshul and Erikson, have looked at nonadditive scoring schemes that
fundamental change in the optimization criterion for alignment creates a new set
of algorithmic problems.
lower bounds placed on algorithms for comparing two sequences of length A^' is
0(N log N), yet the fastest algorithm takes 0{N^ I log^ AO time (Masek and
Paterson, 1980). Can this gap be narrowed, either from above (finding faster
3 (>i
Andhra University, Visakhapatnam
algorithms) or below (finding lower bounds that are higher)? Can we perform
suggested by the results given above for the approximate string matching
developed for finding approximate repeats, for example, finding a pattern that
matches some string X and then 5 to 10 symbols to the right matches the same
separated by a given range of spacing. More intricate patterns for protein motifs
and secondary structure are suggested by the systems QUEST [4, 79,104], all of
which pose problems that could use algorithmic refinement. Finally, biologists
compare objects other than sequences. For example, the partial sequence
placed a large number of beads of, say, eight colors, at various positions along
the string. Given two such maps, are they similar? This problem has been
what the measure of similarity should be and how to design efficient algorithms
for each. There has also been work on comparing phylogenetic trees and
chromosome staining patterns [176]. Indubitably the list will continue to grow.
102