0% found this document useful (0 votes)
19 views45 pages

10 - Chapter 3

Uploaded by

Saif Hasan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views45 pages

10 - Chapter 3

Uploaded by

Saif Hasan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

3 Use of Algorithms to Detect Similarities between

Bio Sequences
3.1 Introduction

The sequence of amino acids in a protein determines its three-dimensional

shape, which in turn confers its function. Segments of the protein that are

critical to its function resist evolutionary pressures because mutations of such

segments are often lethal to the organism. These critical "active sites" tend to be

conserved over time and so can be found in many organisms and proteins that

have similar function.

Since the advent of deoxyribonucleic acid (DNA) sequencing technologies

in the late 1970s, the amount of data about the protein and DNA sequence of

humans and other organisms has been growing at an exponential rate. Databases

of these sequences will contain a wealth of information about the nature of life

at the molecular level //we can decipher their meaning.

Proteins and DNA sequences are polymers consisting of a chain of

monomers with a common backbone substructure that links them together. In

the case of DNA, there are 4 types of monomers, the nucleotides, each having a

different side chain. For proteins, there are 20 types of monomers, the amino

acids. With just a few exceptions, the sequence of monomers, that is, the

primary structure, of a given protein or DNA strand completely determines the

three-dimensional shape of the biopolymer. Because the function of a molecule

is determined by the position of its atoms in space, this almost perfect

correlation between sequence and structure implies that to know the function of

a biopolymer, it is in principle suffices to know its primary sequence.

58

Andhra University, Visakhapatnam


The primary sequence of a DNA segment is denoted by a string consisting

of the four letters A, C, G, and T. Analogously, the primary sequence of a

protein is denoted by a string consisting of 20 letters of the alphabet, one for

each type of amino acid. In principle, these strings of symbols encode

everything one needs to know about the DNA strand or protein in question. If

the primary sequences of two proteins are similar, then it is reasonable to

conjecture that they perform the same function. Because DNA's principal role is

one of encoding information (including all of an organism's proteins), the

similarity of two segments of DNA suggests that they code similar things.

Mutation in a DNA sequence is a natural evolutionary process. Errors in

the replication of DNA can cause a change in the nucleotide at a given position.

Less often, a nucleotide is deleted or inserted. If the mutation occurs in a region

of DNA that codes for protein, these changes cause related changes in the

primary sequence and, hence, the shape and activity of the protein. The impact

of a particular mutation depends on the degree to which the original and new

amino acid sequences differ in their physical and chemical properties. Mutations

that result in proteins that are so altered that they function improperly or not at

all tend to be lethal to the organism. Nature is biased against mutations in those

critical regions central to a protein's function and is more lenient toward changes

in other regions.

Similarity of DNA sequences is a clue to common evolutionary origin. In

two proteins in two organisms evolved from a common precursor, one will

generally find highly similar segments, reflecting strongly conserved critical

regions. If the proteins are very recent derivatives, one might expect to see

similarity over the entire length of the sequences. While proteins can be similar

59

Andhra University, Visakhapatnam


because of evolution from a common precursor, similarity of protein sequences

can also be a clue to common function, independent of evolutionary

considerations. It appears that nature not only conserves the critical parts of a

protein's conformation and function, but also reuses such motifs as modular

units in fashioning the spectrum of known proteins. One finds strong similarities

between segments of proteins that have similar functions. A strong similarity

between the v-sis oncogene and a growth-stimulating hormone was the key to

discovering that the v-sis oncogene that causes cancer by deregulating cell

growth. In that case, the similarity involved the entirety of the sequence. In other

cases, functionally related proteins are similar only in segments corresponding

to active sites or other functionally critical stretches.

3.1.1 Finding Global Similarities

To illustrate the underlying techniques of sequence comparison, we begin

with a simple, core problem of finding the best alignment between the entirety

of two sequences. Such an alignment is called a global alignment because it

aligns the entire sequences, as opposed to a local alignment, which aligns

portions of the sequences.

As an example, consider finding the best global alignment of A =

ATTACG and B = ATATCG under the following scoring scheme. A letter

aligned with the same letter has a score of 1. A letter aligned with any different

letter or a gap has a score of 0. The total score is the sum of the scores for the

alignment. A matrix depicting this "unit-cost" scoring scheme is shown in

Figure 3.1. Under this unit-cost scheme, the score of an alignment is equal to the

number of identical aligned characters. The obvious alignment {'™'^^,has a score

60

Andhra University, Visakhapatnam


of 4. However, because gaps are allowed, a higher score can be achieved,

namely, 5, which can be shown to be the highest score possible. An optimal

alignment, that is, an alignment that achieves this highest score by aligning five

symbols, is ",. In some cases, there is only one, unique optimal

alignment, but in general there can be many. For example, '^' " " " ' also has a

score of 5.

The unit-cost scoring scheme of Figure 3.1 is not the only possible

scheme. Later in this chapter, we will see a much more complex scoring scheme

used in the comparison of proteins (20-letter alphabet). In that scheme and other

scoring schemes, the scores in the table are real numbers assigned on the basis

of various interpretations of empirical evidence. Let us introduce here a formal

framework to assist our thinking.

s
-
~ A C G T
0 0 0 0 0
A 0 1 0 0 0
C 0 0 1 0 0
G 0 0 0 1 0
T 0 0 0 0 1

Figure 3.1: Unit Cost Scoring Scheme

Consider comparing sequence A = a, a 2 OLM and sequence B = 6,62 bN,

whose symbols range over some alphabet y, for example, v|/ = {A, C, G, T} for

DNA sequences. Let 5 (a,b) be the score for aligning a with b, let 6 (a,-) be the

score of leaving symbol a unaligned in sequence A, and let 5 {-,b) be the score

of leaving b unaligned in B. Here a and b range over the symbols in \|/ and the

gap symbol "-". The score of an alignment is simply the sum of the scores d

assigns to each pair of aligned symbols, for example, the score of A-lAtSQ is

6 (A,A) + 8 (T,-) + 5 (T,T) + 5 (A,A) + 5 (-,T) + 5 (C,C) + 5 (G,G), which for

61

Andhra University, Visakhapatnam


the scoring scheme of Figure 3.1 equals 5. An optimal alignment under a given

scoring scheme is an alignment that yields the highest sum.

3.1.1.1 Visualizing Alignments: Edit Graphs

Many investigators have found it illuminating to convert the problem of

finding similarities into one of finding certain paths in an edit graph.

Proceeding formally, the edit graph GA,B for comparing sequences A and

B is an edge-labeled directed graph, as illustrated in Figure 3.2 for the example

mentioned above. The vertices of the graph are arranged in an M + 1 by A^ +1

rectangular grid or matrix, so that (/, j) designates the vertex in column / and

row 7 (where the numbering starts at 0). The following edges, and only these

edges, are in GA,B:

t. If fe[l,WJ and j^[Q,N\, then there is an K-gap edge


"a "
(i — U J > -»(u/) labeled • whose score ts 8 (a,.-).
2. If <6[0, M] and ye[l,/V], then there is a ^gap edge
('.y - 1 ) —> i*'j) labeled ^ whose score fs §(",<iy).
5. If f'ell.A^] and /e[UjlV], then there is an alignment edge
"a "
( / - l , / - l > - * ( i [ , y ) labeled ^^ wboSeScorcii5 S(a,,A>,),

The edit graph has the property that paths and alignments between

segments of A and B are in isomorphic correspondence. That is, any path from

vertex (g,h) to vertex (iJ) for g<i and h < j models an alignment between the

substrings ag+iag+2 a, and bh+\bh+2 bj, and vice versa. The alignment modeled

by a path is the sequence of aligned pairs given by labels on its edges. For

example, in Figure 3.2 the two highlighted paths, both from vertex (0,0) to (6,6)

correspond to the two optimal global alignments A-"^TGG and A7Rr-cs

62

Andhra University, Visakhapatnam


3.1.1.2 The Basic Dynamic Programming Algorithm

We now turn to devising an algorithm, or computational procedure, for

finding the score of an optimal global alignment between sequences A and B.

We focus on computing just the score for the moment, and return to the goal of

delivering an alignment achieving that score at the end of this subsection. First,

observe that, in terms of the edit graph formulation, we seek the score of a

maximal-score path from the vertex O at the upper left-hand corner of the graph

GA,B to the vertex O at the lower right-hand corner.

Consider computing S{i, j), the score of a maximal-score path from D to

some given vertex {i,j) in the graph. Because there are only three edges directed

into vertex (z, j) it follows that any optimal path P to {i,j) must fit one of the

following three cases: (1) P is an optimal path to (i - \,j) followed by the A-gap

edge into {ij); (2) P is an optimal path to {i,j - 1) followed by the B-gap edge

into iUj); or (3) P is an optimal path to (/ - \,j - 1) followed by the alignment

edge into {i,j). It is critical to note that the subpath preceding the last edge of P

must also be optimal, for, if it is not, then it is easy to show that P cannot be

optimal, a contradiction.

63

Andhra University, Visakhapatnam


e A T T A C G

A » A ^ * k UK XX K X I k k

I t T T T . T t . T T T T T T t

A - A - T • T - ^A -• C * 0
A A A A A A a A . A A A A A

* A * T ••^T - A • C • ©
T
T T T T T T . t T T T T T t

- » - T - T - » - ^e . c
C
C C C C C C C C C C C C G

G - A - t • I - A - C - ^o
o<; C C c « C O C C c a ^ c

Figure 3.2: GA,B for A = ATTACG and B = ATATCG.

This observation immediately leads to the fundamental recurrence:

S(i,j) = max {S{i - \J -1) + d ia„bj),

S(i-lJ) + 8(a„-),

S{iJ-\) + bi-,bj)},

which states that the maximal score of a path to (i,j) is the larger of (1) the

maximal score of a path to (/ -\,j) plus the score of the A-gap edge to {i,j), (2)

the maximal score of a path to {i,j - 1) plus the score of the B-gap edge to (i,j),

or (3) the maximal score of a path to (/ - l,j - 1) plus the score of the alignment

edge to (ij).

All that is needed to have an effective computational procedure based on

this recurrence is to determine an order in which to compute S'-values. There are

many possible orders. Three simple alternatives are (1) column by column from

left to right, top to bottom in each column, (2) row by row from top to bottom.

64

Andhra University, Visakhapatnam


left to right in each row, and (3) antidiagonal by antidiagonal from the upper left

to the lower right, in any order within an antidiagonal (antidiagonal k consists of

the vertices {i,j) such that (/ + j = k). Using the first sample ordering leads to

the algorithm of Figure 3.3. In this algorithm, M denotes the length of A and N

denotes the length of B.

The algorithm of Figure 3.3 computes S{i,j) for every vertex {ij) in an {M

+ \)-K{N + \) matrix in the indicated order of/ and 7. Along the left and upper

boundaries of the edit graph (that is, vertices with / = 0 orj = 0, respectively),

the algorithm utilizes the recurrence, except that terms referencing nonexistent

vertices are omitted (that is, in lines 3 and 5, respectively). The algorithm of

Figure 3.3 takes 0{MN) time; that is, when Af and A^are sufficiently large, the

time taken by the algorithm does not grow faster than the quantity MN. If one

stores the whole (M + 1) x (A'^ + 1) matrix S, then the algorithm also requires

0{MN) space.

The algorithm of Figure 3.3 is a dynamic programming algorithm that

utilizes the fundamental recurrence. Dynamic programming is a general

computational paradigm of wide applicability [64].

0. var S: army (0,. M .0.. /V] of real


t. .S(0,01f-0
2. (or y«- I tojVdo
3. S[0.yl<-5T0J-Il + 6K6,)
4. for If-1 toiVfdo
5 { S[nO]«"5l>-l,01 + 6fo,H
6. for J*~l toNio
7. .Stf.yji-majc iJffi-lJ-l] + 6(a„A^)
.S[r-K/] + 8(a,.-).

8. }
9. wrtte "Maximum sc(yrc: is" 5 [ A/, ATJ ,

Figure 3.3: The classical dynamic programming algorithm

65

Andhra University, Visakhapatnam


A problem can be solved by dynamic programming if the final answer can

be efficiently determined by computing a tableau of optimal answers to

progressively larger and larger sub problems. The principle of optimality

requires that the optimal answer to a given sub problem be expressible in terms

of optimal answers to smaller sub problems. Our basic sequence comparison

problem does yield to this principle: the optimal answer S{i, j) for the problem

of comparing prefix A, = a\a2—a, and prefix Bj = b^bj. . .bj can be found by

computing optimal answers for smaller prefixes of A and B. The recurrence

formula describes the relationship of each sub problem to a larger sub problem.

The algorithm of Figure 3.3 computes only the score of a maximum-

scoring global alignment between A and B. One or all of these optimal

alignments can be recovered by tracing the paths backwards from <I) to 0 with

the aid of the now complete matrix S. Specifically, an edge from vertex vi to O

is on an optimal path if S(vi) plus the score of its edge equals S(0). If vi is on

an optimal path, then, in turn, an edge from V2 to vi is on an optimal path if

S(v2) plus the score of the edge equals S(vi). In this way, one can follow an

optimal path back to the start vertex D. In essence, this trace back procedure

moves backwards from a vertex to the preceding vertex whose term in the three-

way maximum of the recurrence yielded the maximum. The possibility of ties

creates the possibility of more than a single optimal path. Unfortunately, this

traceback technique for identifying one or more optimal paths requires that the

entire matrix S be retained, giving an algorithm that takes 0(MN) space as well

as time.

A more space-efficient approach to delivering an optimal alignment begins

with the observation that if only the score of an optimal alignment is desired

66

Andhra University, Visakhapatnam


then only the vaiue ofS{M, N) is needed, and so iS-values can be discarded once

they have been used in computing the values that depend on them. Observing

that one need only to know the previous column in order to compute the next

one, it follows, that oni> two columns need be retained at any instance, and so

only 0{N) space is required. Such a score-only algorithm can be used as a sub

procedure in a divide-and-conquer algorithm that determines an optimal

alignment using only 0{M + N) space. The divide step consists of finding the

midpoint of an optimal source-to-sink path by running the score-only algorithm

on the first half of B and the reverse of the second half of B. The conquer step

consists of detsrmining the two halves of this path by recursively reapplying the

divide step to the two halves. Myers and Miller [105] have shown this strategy

to apply to most comparison algorithms that have linear-space score-only

algorithms. This refinement is very important, since space, not time, is often the

limiting factor in computing optimal alignments between large sequences. For

example, two sequences of length 100,000 can be compared in several hours of

CPU time, but would require 10 billion units of memory if optimal alignments

were delivered using the simple 0{MN) space trace back approach. This is well

beyond the memory capacity of any conventional machine.

3.1.1.3 Finding Local Similarities

We now turn to the problem of finding local alignments, that is,

subsegments of A and B that align with maximal score. Local alignments can be

visualized as paths in the edit graph, GA.B- Unlike the global alignment problem,

the path may start and end at any vertices, not just from D and O. Intrinsic to

determining local S-milarities is the requirement that the scoring scheme 8 be

67

Andhra University, Visakhapatnam


cesigned with a negative bias. That is, for alignment of unrelated sequences

(under some suitable stochastic model of the sequences) the score of a path must

on the average be negative. If this were not the case, then longer paths would

:end to have higher scores, and one would generally end up reporting a global

alignment between two sequences as the optimal local alignment. For example,

the simple scoring scheme of Figure 3.1 is not negatively biased, whereas the
ATTACS
scheme of Figure 3.4 is. Note that under this new scheme, the alignment ATATCS .

is now optimal with score 3.34, whereas A-IA"?? now has lesser score 3. In this

case, the optimal alignment happened to be global, but for longer sequences this

is generally not the case. For example, the best local alignment between

GAGGTTGCTGAGAA and ACTCTTCTTCCTTA is the alignment

nccT^A^ score 4.34 between the underlined substrings of score 4.34 between the

underlined substrings.

& A C G T
- -1 -I •11 -! -I
A -1 I ^.« -.33 -.33
C -1 -^,33 I -.33 -.33
cT ' 'I
-I
-.33
-.W
-.33
-.33
1
-.33
-.33
1

Figure 3.4: A local-alignments scoring scheme

The design of scoring schemes that properly weigh alignments to expose

biologically meaningful local similarities is the subject of much investigation.

The score of alignments between protein sequences is the sum of scores

assigned to individual pairs of aligned symbols, just as for DNA. However,

since combinations of 20 letters and the gap symbol represent proteins, the table

of scores is now 21 x 21. These scores may be chosen by users to fit the notion

of similarity they have in mind for the comparison. For example, Dayhoff [36]

68

Andhra University, Visakhapatnam


compiled statistics on the frequency with v>'hich one amine acid would mutate

into another over a fixed period of time and from these buih a table of aligned

symbol scores consisting of the logarithm of the normalized frequencies. Under

Dayhoffs sc-oring scheme, the score of an alignment is a coarse estimate of the

likelihood that one segment has mutated into the other. Figure 3.5 is a scaled

integer approximation of Dayhoffs matrix that is much used in practice today.

The basi:: issue in local alignment. Just as in the case of global alignment,

is to find a path of maximal score. However, there are more degrees of freedom

in the local alignment problem: v/here the paths begin and where they end is not

given a priori but is part cf the problem. Note that if we knew the vertex (g,h) at

which the oest path began, we could find its score and end-vertex by setting

S(g,h) to 0 end then applying the fundarr.ental recurrence to all vertices (i,J) for

which / > g andj > h. We can capture ai! potential start vertices simultaneously

by modifying the central recurrence so that 0 is a term in the :;omputation of the

maximum: that is,

S{Uj) = {0,S{i-lj-l)-hai,bj),

Sii -]J) + 5 {air),

Su,- 7) + 5 i-,bj)}.

Indeed, with this simple modifioat.on, S{i, j) is now the score of the

highest-scoring path to {i,j) that begins at some vertex (g. h) for which g < i and

h< j . Thie best score of a pa:h in the edit graph is then the maximum over all

vertices in the graph of their o-values. A vertex achieving :his maximum is the

end of an optimal path. This basic result is often referred to as the Smith-

Waterman algorithm after its inventors [146]. The beginning of the path, the

segments h aligns, and the alignment between these segments can all be

69

Andhra University, Visakhapatnam


delivered in linear space by further extensions of the treatment given above for

global alignments. If one uses such a comparison algorithm with the scoring

scheme of Figure 3.5, one sees the three regions of similarity shown in Figure

3.6 between the sequence of the monkey somatotropin protein and the

somatotropin precursor protein of a rainbow trout. Note that while in many

cases the aligned symbols are identical, they do not have to be.

<5 - A R N D C 0 B <j II 1 '1 K M P P S r W Y V


-A •« -S -R -8 -8 -a -s -H -S -R -s -« -8 -S -8 -S -8 -S -R -8 -*
-s 3 -a 0 D -3 -1 0 1 -^ -1 -3 -J - 2 -4 t 1 1 -7 - * 0
R -3 6 -1 -3 -O L -3 -4 1 -J -4 2 -1 -4 - 1 -1
N
•s - 1 1 -6 -3
-s 00 -3-1 < 2 -5 D 1 (1 2
-7 1 } fi 0
-2 -4 1 -3 -4 - 2 1 fl -$ -J -3
n -« -5 -4 2 5 -3 -S -1 -4 -7 . 2 0 -1 -8 -5 _3
c -8 -5 -7 9 -T -1 -S -4 -3 - 7 -7 - 6 -<i -3 - ! -3 -8 -1 -2
Q -8 - i 1 0 1 - J ft 2 -«5 3 -3 -2 0 -1 -6 fl -a -5 -b -5 -3
R -a 0 -3 1 i - 7 2 J - i -1 -i - 1 - ] -4 - 6 -1 -li -2 -a -4 -3
« -8 - 31 -41 0 0 -^5 -3 -1 i -4
. 4 J - I j -4 7
-4 -S - J -4 -S .2 1 -,1 -8 -6 -2
H -e 3 0 -4 -3 -1 -4 - 2 -1 - J - 3 -5 -1 -3
1 -B —1 -2 -2 ~i -;i -5 -3 -4 -4 6 1 - 2 ! 0 -3 -2 0 -7 -2 3
L -B -3 - * -4 .i - 7 -2 - 4 - J -3 I 5 - 4 .^ fl -3 -4 -3 -5 -3 1
K ~»-2 2 1 ~l -T 0 -1 -3 -2 -2 -.» 5 0 -6 - 2 -1 -1 -5 - 6 -.«
M -* -2 - 1 -a • i -1 -4 -4 -4 1 3 0 8 - i - 3 -2
_(; - t -7 -.4 1
!i'•" -s -d1 --41 -^ -7
-2
-ft -6 -6 -S -2 fl 1) .<; - 1 « - 5 -J -4 - ] 4 ~.>
-s -J. -? 0 -1 -2 - ] -3 i -2 ^ -5 5 1 -1 - 7 ' 6 -2
S -s 1 -t 1 i) -1 - I -1 1 -2 -2 -4 -1 -2 - J I 3 2 - 2 -3 -2
T -s 1 - 2 0 -1 -J -2 -2 - I -i (1 -3 -1 -1 - 4 - 1 2 4 -6 -3 »
\V -» -7 1
y -a '4 -6'
s -8 -« - 6 -S -B -S - 1 -5 S -7 - 1 - 7 -2 -6 12 -1 -8

V -H 0 -3
-%-5 -1 -5 -4 -fi -1 -2 -.3 -€ -4 4 - 6 - 3 -3 -1 8 -3
-i -3 -2 -3 -3 -2 -3 J 1 -4 1 - 3 -2 - 2 0 -8 -<5 5

Figure 3.5: A protein local-alignment scoring scheme

Thus far we have presented the local similarity problem as one of finding

two subsegments of the sequences that align with maximal score. But as

illustrated in Figure 3.6, the ultimate goal is to expose not a single such

alignment, but all the significantly conserved segments, ideally nonoverlapping

as in the somatotropin example of Figure 3.5. To this end. Waterman and Eggert

[161] proposed the following simple algorithm. Find a highest-scoring local

alignment by the method indicated in the previous paragraph. Eliminate every

edge in the edit graph that is adjacent to a vertex on the path of this local

alignment. Now find a highest-scoring path over the remaining graph. Eliminate

the edges adjacent to this second-best path, and proceed to find a third-best path.

70

Andhra University, Visakhapatnam


and so on. In this way, one produces a series of local alignments of decreasing

score whose underlying paths do not intersect. Note that this procedure may

generate local alignments whose substrings overlap. Nonetheless, this procedure

is very effective in identifying the biologically relevant local homologies

between two sequences. As originally presented, the algorithm requires 0(MN)

space, but recent refinements by Chao and Miller[24] have reduced both storage

and computing time and have permitted the comparison of two sequences of

length 100,000 on a conventional workstation in several hours.

The output of such a problem could be displayed as a sequence of

alignments, as in Figure 3.6. It is also convenient and illuminating to depict all

the alignments as paths in an edit graph, as in Figure 3.2. However, as the

sequences become larger and larger, one must "step back" from the details of the

edit graph. Figure 3.7 is a depiction of the edit graph of the monkey and rainbow

trout somatotropin sequences of Figure 3.6 where only the paths corresponding

to the three aligned segment pairs are drawn. At this level of resolution, the

small gaps in the alignments of the second and third segment pairs appear as

small discontinuities in paths that otherwise follow the direction of the diagonal

of the edit graph grid or matrix. When the sequences become very large, say on

the order of 100,000 nucleotides, then small local alignments are not seen, and

neither are gaps in large alignments unless they are very large. Nonetheless,

such dot plots give a meaningful visualization of all the similarities between

segments in a single snapshot and are ubiquitous.

71

Andhra University, Visakhapatnam


Sccce = 6S
Position; 2 " • * •41
Hon^Gy: PTIPLSRUFONAKLRAHRLHOLAFOTYOEFEEAYIE'KEQK
Trout: SAIENQRIFHIAVSRVQHLHLLAOKMFNrFOGTLLPDERR
Position: iO * * • 49
Score " J03
Position: 51 • • ^ « • » • «!3l
Monicey: SLCFSESI?TPSNF.EETQQKSMU}LI,R:SLLLIQ;KLEFVQFLRSVFAM3LVYCTSVSDVYDI.LKOLEKGIQTLMGRLF.DG
Trout: DFCNSDSIVSPVOKHETQKSSVLKLLHISFRLIESWEyFSOTI.—IISKSU'IVF.NA-NQISEKXSDLKVGINLLrrGSQOG
Positioflt 58 * ' • » • » • < 135
sccre = J S 5
P o s i t i o n : ;^3 • . » » jgj
Monkey: ySKFOTKSHNDDALLKNVGLLVCFRKD;»)Di<IETFXRl VQCR-SVEGSC
Trout: 5(GNYYIJ>-L6GDGN''/RRNY=:LUCFKKmiHKVETVLTVAKCRKSLEANC
P o s i t i o n : 150 • ' * • jc7

Figure 3.6: Conserved regions^ of two somatotropin proteins.

3.1.1.4 Variations on Sequence Comparison

In this section a number of the most important variations on sequence

comparison are examined. The survey is by no means exhaustive.

3.1.1.4.1 Varialions in Gap Cost Penalties

How [o assign scores to alignment gaps has always been more problematic

than scoring aligned symbols, because the statistical effect of gaps is not well

understood Nature frequently deletes or inserts entire substrings as a unit, as

opoosed to individual polymer elements. It is thus natural to think of cost

models in \vhich the score of a gap is not just ths sum of scores assigned to the

individual symbols in the gap. as was used in the previous two sections, but

rather a mere general function, gapi'x), of its length x.

72

Andhra University, Visakhapatnam


MONKEY
2^ . ^ 1 51 >131 1 4 1 . .^g»

IOTK j ^ ^ . p . . ^ . . - - , ^ - , | ^ ^ - - - - - .

Figure 3.7: Dot plot of somatotropin alignments

For example, it is common to score a gap according to the qffine function

gap{x) =r + sx, where r > 0 is the penalty for the introduction of the gap and s

> 0 is the penalty for each symbol in the gap. Such affine gap costs are

particularly important when comparing proteins. For example, a gap penalty of

8 + 4^ works well in conjunction with the aligned symbol scores of Figure 3.5.

Because a gap is viewed as detracting from similarity, its score is a penalty that

is subtracted from the total.

Accommodating affine gap scores involves the following variation on the

central recurrence. For each subproblem, A, versus B^, one develops recurrences

for (1) the best alignment that ends with an A-gap, Ag{i, j), (2) the best

alignment that ends with a B-gap, Bg(iJ), and (3) the best overall alignment, S(i

J). This leads to the following system of recurrence equations:

^g('J)= max {Agii-lJ)-s,Sii-lJ)-{r+s)}

BgiiJ) = max{Bgii,j -1)- s,S{i,j - l)-{r + s)}

S(iJ) = maxiSii - l,j - ])+(ai, bUg(iJ),Bg(iJ)} .

73

Andhra University, Visakhapatnam


S terms contributing to an Ag or Bg value are penalized r + s because a

gap is being initiated from that term. Ag terms contributing to Ag values and Bg

terms contributing to Bg values are penalized only s because the gap is just

being extended. An algorithm that applies these recurrences at each (y) leads to

an 0{MN) time algorithm for global alignments with affine gap costs. Simply

adding a 0 term to the S-recurrence gives an algorithm for local alignments with

affine gap costs.

Summation and affine functions are not the only options available for

scoring gaps. The gap cost function gap {x) can be taken to be a concave (flat or

cupped downward) function of length, that is, a function such that gap (x + 7)-

gap {x) <gap{x)- gap{x -1) for all x > 0. The class of concave gap cost functions

includes affine functions but is much wider than just affine functions. For

example, for positive a and b, the function gap (x) = a logx + 6 is a concave

function that finds occasional use in sequence comparison. It has been

postulated that such a model is natural for biological sequences where gap costs

would be expected to have a decreasing marginal penalty as a function of length.

For this model, investigators have been able to design algorithms that take

0{MN{logN+ log M)) time or less [105,41].

It is also possible to design an algorithm for completely arbitrary gap cost

functions. However, such generality comes at a price: the best available

algorithm takes 0{MN{M + N)) time [163]. For this reason and because the

more restricted affine and concave models appear adequate to most needs, the

general algorithm is rarely used.

74

Andhra University, Visakhapatnam


3.1.1.4.2 The Duality between Similarity and Difference Measures

Thus far we have considered the comparison problem to be one of

exposing the similarity between two sequences and thus have naturally thought

in terms of maximizing the score of alignments. Another natural perspective is to

think about how a sequence A may have evolved into sequence B over time. In

this context, one seeks alignments that reveal the minimum number of

mutational events that might have effected the transformation. In this view, an

aligned symbol of B is substituted for its counterpart in A, an unaligned symbol

in A is deleted, and an unaligned symbol in B is inserted. For example, in the

alignment ^-I*tC« ; the first T in ATTACG is deleted, and the second T in

ATATCG is inserted. In the alignment^^MCG, T is mutated into A, and A is

mutated into T. As before, the scoring scheme 5 assigns a score to each

evolutionary event modeled by a column, but now the interpretation is that 5

represents the differences rather than the similarities between symbols. Note that

for formal purposes it is assumed that an A mutates into an A in the alignments

above at no cost; that is, one chooses 6 (A,A) to be 0.

Given a scoring scheme 5 reflecting an evolutionary or difference-based

model, the goal is to find an alignment of minimal score, that is, one that

indicates the minimum scoring set of changes needed to go from one sequence

to a related sequence. Let D(A,B) be the score of a minimal cost alignment

between sequences A and B. In honor of its inventor, this score is formally

known as the generalized Levenshtein measure or distance between sequences

A and B. Indeed the measure, D, between sequences forms a metric space over

sequences if the underlying scoring function 5 forms a metric space over the

75

Andhra University, Visakhapatnam


underlying alphabet. Thus calling this measure a distance is formally correct for

a wide class of scoring schemes 5.

Immediately note that the distance and similarity perspectives are

complementary. To solve a difference problem, we need only to revise our

previous discussions by replacing maximum with minimum in every sentence

and formula. Also, one could simply take a 5 for a difference problem and

multiply every score by -1. Applying the similarity algorithm with the modified

scores would produce optimal alignments for the original difference problem,

and multiplying the resultant similarity score by -1 would give the distance

between the two sequences.

3.1,1.4.3 Aligning More Than Two Sequences at a Time

Molecular biologists are frequently interested in comparing more than two

sequences simultaneously. For instance, given a number of sequences of the

same functionality, it is much more likely that the similarity that gives this

common function will be more evident among the group than between two

sequences from the group. A closely related problem is to discover the

evolutionary relationships between a set of sequences by constructing an

evolutionary tree, orphylogeny that minimizes the evolutionary changes that

must have taken place along each branch of the tree. A third application for

aligning a collection of sequences is to correct errors in the "raw" experimental

data obtained in DNA sequencing experiments. Typically, 1 to 10 percent of the

symbols in a sequenced fragment are incorrect, missing, or spurious. These

errors are detected and corrected by sequencing a given stretch several times and

76

Andhra University, Visakhapatnam


then forming a consensus by aligning the sequences. Figure 3.8 illustrates a

multi-alignment of such sequence data.

CTCCCC-CACAT-ATitSGCG-OTC-CGHCA-C*- TAQQCkAtSQC

CTCCJCCGCA-ATTCGGGCG-GTCTCGAGA-GACrAGGCItAQCX:
CTCGCGrCACArrCGEGC(?fGTCTCGA<^-GACTAGGC&ACCC
CrCTiCgCci'M^^T^^GrTGCG-GT-TCG-GA-GACTAGGCiyVKX

CtCCOCGCACATTCCalGCQ GTCTCCACA GACTA<J0CAAOCC*-"co«3rnr.U5"'

Figure 3.8: A multi-alignment of five DNA sequences

Suppose we wish to align K sequences A*, A^,. . ., A'^, where

' ^ v"" is of length A^'. As for the basic problem, we wish to arrange the
sequences into a tableau using dashes to force the alignment of certain

characters in given columns. For example, in Figure 3.8 the dashes are placed so

as to arrange columns consisting of primarily one symbol. For each column, the

consensus of the column is the symbol that occurs the greatest number of times

in that column. Concatenating these consensus characters together, ignoring

dashes, gives the consensus sequence for the five experimental trials. As for pair

wise alignments, each column of K symbols of the multi-alignment is scored

according to a user-supplied function S. For example, if d is the number of

symbols in the column not equal to the majority symbol of the column (which

can be a dash), then the multi-alignment of Figure 3.8 has score 13, and this is

the minimum possible score over all possible multi-alignments of the five

sequences. The problem of finding a maximum (minimum)-scoring alignment

among K sequences can be solved by extending the dynamic programming

recurrence for the basic problem from a recurrence over a two-dimensional

matrix to a recurrence over a ^-dimensional matrix. Let i = (ly, (2,. • .,iK ) be a

77

Andhra University, Visakhapatnam


vector in ^-dimensional Cartesian space. Now we compute a A^-dimensional

arraj' 5", where S(i) is the score of the best alignment among the prefix sequences

A' A^ A'K
. The central recurrence now becomes

Where era means "if e - I then a else *-' ".

In terms of an edit graph model, imagine a grid of vertices in K-

dimensional space where each vertex i has 2K - I edges directed into it, each

corresponding to a column that when appended to the alignment for the edge's

tail gives rise to the alignment for the prefix sequences represented by i.

Computing the S values in some topological ordering requires a total of 0(A'K)

time, where N = max, Ni. While multiple sequence comparison algorithms of

this genre are conceptually straightforward, they take an exponential amount of

time in K and are thus generally impractical for A"> 3.

Muhiple sequence comparison has been shown to be NP-complete [50],

which means that it is almost surely the case that any algorithm for this problem

musi exhibit time behavior that is exponential in K. Thus many authors have

sought heuristic approximations, the most popular of which is to take 0{K N2)

time to compute all pair wise optimal alignments between the K sequences, and

then produce a muhiple sequence alignment by merging these pair wise

alignments. Note that any multiple sequence alignment induces an alignment

between a given pair of sequences (take the two rows of the tableau and remove

any :;olumns consisting of just dashes). However, given all of the possible K{K-

l)/2 pair wise alignments between K sequences, it is almost always impossible

to arrange a muUi-alignment consistent with them all. Try, for example, merging

78

Andhra University, Visakhapatnam


the best pair wise alignments among ACG, CGA, and GAC. But, given any K -

1 alignments relating all the sequences (that is, a spanning tree of the complete

graph of sequence pairs), it is always possible to do so. Feng and Doolittle [43]

compare a number of methods based on this approach. The most recent

algorithms utilize the natural choice of the K - 1 alignment whose scores sum to

the minimal possible amount (that is, a minimum spanning tree of the complete

graph of sequence pairs). However, such merges do not always lead to optimal

alignments, as is illustrated by the following example:

G -CACA <5 —CA C A G^CACA


GGCA-A and GG-CAA yield GG -CA - A. byl GGCA-A is belter.
GGACA- GGACA-- OG-ACA

While the choice of S for a multi-alignment scoring scheme is

conceptually a function of A" arguments, it is often the case that S is effectively

defined in terms of an underlying pair wise scoring function d'. For example, the

sum-of-pairs score is defined as^''""2'*"'^"*f)"'Ii'^'^'^i''*P, where one

must let d (-,-) = 0. In essence, the sum-of-pairs multi-alignment score is the

sum of the scores of the K{K - l)/2 pairwise alignments it induces. Another

common scheme is the consensus score, which defines d {a\, 02,. . ., ax ) as

max/min{S^ S\c,a,):c€\\iD{-}}. The symbol c that gives the best score is said to

be the consensus symbol for the column, and the concatenation of these symbols

is the consensus sequence. In effect, the consensus multi-alignment score is the

sum of the scores of the K pair wise alignments of the sequences versus the

consensus. The example of Figure 3.8 is such a scoring scheme where 5' is the

scoring scheme of Figure 3.1. While we do not show it here, the problem of

determining minimal phylogenies mentioned at the start of this subsection can

79

Andhra University, Visakhapatnam


also be modeled as an instance of a multiple sequence alignment problem by

choosing a d for columns that suitably encodes the tree relating the sequences

[134]. However, the more general phylogeny problem requires that one also

determine the tree that produces the minimal score. This daunting task

essentially requires the exploration of the space of all possible trees with K

vertices. So in practice, evolutionary biologists have put a great deal of effort

into designing heuristic algorithms for the phylogeny problem, and there is

much debate about which of these is best.

3.1.1.4.4 K-Best Aligninents

The alignment algorithm in the section "The Basic Dynamic Programming

Algorithm" above reports an optimal alignment that is clearly a function of the

choice of scoring scheme. Unfortunately, biologists have not yet ascertained

which scoring schemes are "correct" for a given comparison domain. This

uncertainty has suggested the problem of listing all alignments near the

optimum in the hope of generating the biologically correct alignment.

From the point of view of the edit graph formulation, the K-htsX problem

is to deliver the X^-best shortest source-to-sink paths, a problem much studied in

the operations research literature. Indeed, there is an 0(MN + KN) time and

space algorithm, immediately available from this literature [45], that delivers the

X^-best paths over an edit graph. The algorithm delivers these paths/alignments

in order of score, and K does not need to be known a priori: the next best

alignment is available in OiN) time. The essential idea of the algorithm is to

keep, at each vertex v, an ordered list of the score of the next best path to the

80

Andhra University, Visakhapatnam


sink through each edge out of v. The next best path is traceable using these

ordered lists and is extracted, and the lists are appropriately updated.

If all one desires is an enumeration, not necessarily in order of score, of all

alignments that are within e of the optimal difference Z)(A, B), then a simpler

method is available that requires only the matrix S of the dynamic programming

computation. While not any faster in time, the simpler alternative below does

require only 0{MN) space. One can imagine tracing back all paths from the sink

to the source in a recursive fashion. The essential idea of the algorithm is to

limit the trace back to only those paths of score not greater than D(A, B) + e.

Suppose one reaches vertex (/, j) and the score of the path thus far traversed

from the sink to this vertex is T{i, j). Then one traces back to predecessor

vertices (/ - \,j), (/ - \,j- 1), and {i,j - 1) if and only if: respectively. This

procedure is very simple, space economical, and quite fast.

S{i - \J) + 5(a„ -) + T{i,j) < D{A, B) + e,

S{i -1,7 - 1) + 8 (fl„ bj) + m,j) < D(A, B) + e,

S{i,j -1) + 5 (-, bj) + T{i,j) < £>(A, B) + e,

A classic example of the need for affine gap costs was presented in a paper

by Smith and Fitch [146] comparing the a and P chicken hemoglobin chains.

For a setting of the gap costs that gave the biologically correct alignment, there

were 17 optimal alignments, 1,317 alignments within 5 percent of the optimum,

and 20,137,655 within 20 percent of the optimum. This kind of exponential

growth suggests that perhaps rather than list alignments, one should report the

best possible scores in order or give a color-coded visualization of the edit graph

that colors edges according to the score of the best path utilizing the edge.

Another interesting variation is to explore the range of solutions not by

81

Andhra University, Visakhapatnam


enumerating near-optimal answers, but by studying the range of optimal

answers produced by parametrically varying aspects of the underlying scoring

scheme [162].

3.1.1.4.5 Approximate Pattern Matching

A variation on the local alignments problem discussed above is the

approximate match problem. For this problem, imagine that A is a very long

sequence and B a comparatively short query sequence. The problem is to find

substrings of A, called match sites, that align with the entirety of B with a score

greater than some user-specified threshold. An example might be to find all

locations in a chromosome's DNA sequence (A) where a particular DNA

sequence element (B) or some sequence like it occurs. It is not hard to see that

this problem is equivalent to finding sufficiently high scoring paths that begin at

a vertex in row 0 and end at row A^' of the edit graph for A and B. By simply

permitting 0 to be a term in the computation of i'-values in row 0 and checking

values in row A^, one obtains the desired modification of the basic dynamic

programming algorithm.

The problem is taken to another level by generalizing B, the query, from a

sequence to a pattern (that describes a set of sequences). This variation is called

approximate pattern matching. Computer scientists working on text-searching

applications have long studied the problem of finding exact matches to a pattern

in a long text. That is, given a pattern as a query, and a text as a database, one

seeks substrings of the database text that match the pattern (exactly). Pattern

types that have been much studied include the cases of a simple sequence, a

regular expression, and a context-free language. Such patterns are notations that

82

Andhra University, Visakhapatnam


denote a possibly infinite set of sequences, each of which is said to (exactly)

match the pattern. For example, the regular expression A(T|C)G* denotes the set

of sequences that start with an A followed by a T or a C and then zero or more

G's, that is, the set {AT, AC, ATG, ACQ, ATGG, ACGG, ATGGG,. . .}.

Assuming the pattern takes P symbols to specify and the text is of length A'^,

there are algorithms that solve the text searching problem in 0{P+N), 0{PN),

and 0{PN3) time, depending on whether the pattern is a simple sequence, a

regular expression, or context-free language, respectively. Fusing the concept of

exact pattern matching and sequence comparison gives rise to the class of

approximate pattern matching problems. Given a pattern, a database, a scoring

scheme, and a threshold, one seeks all substrings of the database that align to

some sequence denoted by the pattern with score better than the threshold. In

essence, one is looking for substrings that are within a given similarity

neighborhood of an exact match to the pattern. Within this framework, the

similarity search problem is an approximate pattern-matching problem where

the pattern is a simple sequence. We showed earlier that this problem can be

solved in 0{PN) time. For the case of regular expressions, the approximate

match problem can also be solved in 0{PN) time [112], and, for context-free

languages, an 0{PN3) algorithm is known. While the cost of searching for

approximate matches to context-free languages is prohibitive, searching for

approximate matches to regular expressions is well within reason and finds

applications in searching for matches to structural patterns that occur in proteins.

83

Andhra University, Visakhapatnam


3.1.1.4.6 Parallel Computing

The basic problem of comparing sequences has resisted better than

quadratic, 0{MN) time algorithms. This has led several investigators to study

the use of parallel computers to achieve greater efficiency. As stated above, the

S'-matrix can be computed in any order consistent with the data dependencies of

the fundamental recurrence. One naturally thinks of a row-by-row or column-

by-column evaluation, but we pointed out as a third alternative that one could

proceed in order of antidiagonals. Let antidiagonal k be the set of entries

{{ij): i +j = kj. Note that to compute antidiagonal k, one only needs

antidiagonals k - 1 and k - 2. The critical observation for parallel processing is

that each entry in this antidiagonal can be computed independently of the other

entries in the antidiagonal, a fact not true of the row-by-row and column-by-

column evaluation procedures. For large SIMD (single-instruction, multiple-

data) machines, a processor can be assigned to each entry in a fixed antidiagonal

and compute its result independently of the others. With 0(M) processors, each

antidiagonal can be computed in constant time, for a total of 0(N) total elapsed

time. Note that total work, which is the product of processors and time per

processor, is still 0{MN). The improvement in time stems from the use of more

processors, not from an intrinsically more efficient algorithm.

This observation about antidiagonals has been used to design custom VLSI

(very large scale integration) chips configured in what is called a systolic array.

The "array" consists of a vector of processors, each of which is identical,

performs a dedicated computation, and communicates only with its left and right

neighbors, making it easy to lay out physically on a silicon wafer. For sequence

comparisons, processor / computes the entries for row / and contains three

84

Andhra University, Visakhapatnam


registers that we will call L{i), V(i), and U(f). At the completion of the ^th step,

the processors contain antidiagonals k and A: - 1 in their L and V registers,

respectively, and the characters of B flow through their [/registers. That is, L(i)k

= SQ, k- i - 1), V(i)k = S{i, k- i), and U{i)k = bk-,, whereX(i)kdenotes the value

of register X at the end of the k th step. It follows from the basic recurrence for

S'-values that the following recurrences correctly express the values of the

registers at the end of step A: + I in terms of their values at the end of step k:

L{i) ^ =5 K(Oj.

i/<,--l>^+8(«,-).

These recurrences reveal that to accomplish step k + \, processor / - 1

must pass its register values to processor / and each processor must have just

enough hardware to perform three additions and a three-term maximum.

Moreover, each processor must have a (2|\|/|+l)-element memory that can be

loaded with the scores for 5(a„ ?), 8 (-, ?), and 5 (a,, -) where ? is any symbol in

the underlying alphabet \|/. The beauty of the systolic array is that it can perform

comparisons of A against a stream of B sequences, processing each symbol of

the target sequences in constant time per symbol. With current technology, chips

of this kind operate at rates of 3 million to 4 million symbols per second. A

systolic array of 1,000 of these simple processors computes an aggregate of 3

billion to 4 billion dynamic programming entries per second.

85

Andhra University, Visakhapatnam


3.1.1.5 Comparing One Sequence against a Database

The current GENBANK database [15] of DNA sequences contains

approximately 191 million nucleotides of sequence in about 183,000 sequence

entries, and the PIR database [12] of protein sequences contains about 21

million amino acids of data in about 71,000 protein entries. Whenever a new

DNA or protein sequence is produced in a laboratory, it is now routine practice

to search these databases to see if the new sequence shares any similarities with

existing entries. In the event that the new sequence is of unknown function, an

interesting global or local similarity to an already-studied sequence may suggest

possible functions. Thousands of such searches are performed every day.

In the case of protein databases, each entry is for a protein between 100

and 1,500 amino acids long, the average length being about 300. The entries in

DNA databases have tended to be for segments of an organism's DNA that are

of interest, such as stretches that code for proteins. These segments vary in

length from 100 to 10,000 nucleotides. The limited length here is not intrinsic to

the object as in the case of proteins, but because of limitations in the technology

and the cost of obtaining long DNA sequences. In the early 1980s the longest

consecutive stretches being sequenced were up to 5,000 nucleotides long. Today

the sequences of some viruses of length 50,000 to 100,000 have been

determined. Ultimately, what we will have is the entire sequence of DNA in a

chromosome (100 million to 10 billion nucleotides), and entries in the database

will simply be annotations describing interesting parts of these massive

sequences.

A similarity search of a database takes a relatively short query sequence of

a protein or DNA fragment and searches every entry in the database for

86

Andhra University, Visakhapatnam


evidence of similarity with the query. In protein databases, the query sequence

and the entries in the database are typically of similar sizes. In DNA databases,

the entries are typically much longer than the query sequence, and one is

looking for subsegments of the entry that match the query.

3.1.1.5.1 Heuristic Algorithms

The problem of searching for protein similarities efficiently has led many

investigators to abandon dynamic programming algorithms (for which the size

of the problem has become too large) and instead consider designing very fast

heuristic procedures: simple, often ad hoc, computational procedures that

produce answers that are "nearly" correct with respect to a formally stated

optimization criterion. One of the most popular database searching tools of this

genre is FASTA. FASTA looks for entries that share a significant number of

short identical subsequences of symbols with the query sequence. Any entry

meeting this criterion is then compared via dynamic programming with the

query sequence. In this way, the vast majority of entries are eliminated from

consideration quickly. FASTA reports most of the alignments that would be

identified by an equivalent dynamic programming calculation, but it misses

some matches and also reports some spurious matches. On the other hand,

FASTA is very fast.

BLAST, the Basic Local Alignment Search Tool was bom in the first

months of 1989 at the National Centre for Biotechnology Information. The

BLAST programs have been the fruit of much hard work by scores of talented

programmers and scientists. This work continues, linking BLAST output to

other databases, improving alignment formatting option, refining the types of

87

Andhra University, Visakhapatnam


queries that may be performed. BLAST is faster than FASTA but is capable of

detecting biologically meaningful similarities with accuracy comparable to that

of FASTA. Given a query A and an entry B, BLAST searches for segment pairs

of high score. A segment pair is a substring from A and a substring from B of

equal length, and the score of the pair is that of the no-gap alignment between

them. One can argue that the presence of a high-scoring segment pair or pairs is

evidence of functional similarity between proteins, because insertion and

deletion events tend to significantly change the shape of a protein and hence its

function. Note that segment pairs embody a local similarity concept. What is

particularly useful is that there is a formula for the probability that two

sequences have a segment pair above a certain score. Thus BLAST can give an

assessment of the statistical significance of any match that it reports. For a given

threshold, T, BLAST returns to the user all database entries that have a segment

pair with the query of score greater than T ranked according to probability.

BLASTA may miss some such matches, although in practice it misses very few.

The central idea used in BLASTA is the notion of a neighborhood. The t-

neighborhood of a sequence S is the set of all sequences that align with S with

score better than /. In the case of BLASTA, the / -neighborhood of S is exactly

those sequences of equal length that form a segment pair of score higher than t

under the Dayhoff scoring scheme (see Figure 3.5). This concept suggests a

simple strategy for finding all entries that have segment pairs of length k and

score greater than t with the query: generate the set of all sequences that are in

the t -neighborhood of some A-substring of the query and see if an entry contains

one of these strings. Scanning for an exact match to one of the strings in the

neighborhood can be performed very efficiently: on the order of 0.5 million

88

Andhra University, Visakhapatnam


characters per second on a 20 SPECint computer. Unfortunately, for the general

problem, the length of the segment pair is not known in advance, and even more

devastating is the fact that the number of sequences in a neighborhood grows

exponentially in both k and t, rendering it impractical for reasonable values of ^

To circumvent this difficulty, BLASTA uses the fast scanning strategy above to

find short segment pairs of length k above a score /, and then checks each of

these to see if they are a portion of a segment pair of score T or greater. This

approach is heuristic (that is, may miss some segment pairs) because it is

possible for every length k subsegment pair of a segment pair of score T to have

score less than t. Nonetheless, with k = 4 and / = 17 such misses are very rare,

and BLASTA takes about 3 seconds for every 1 million characters of data

searched.

To get an idea of the relative efficiency of various similarity searching

approaches, consider the following rough timing estimates for a typical 20

SPECint workstation and a search against a typical protein query. The dynamic

programming algorithm for local similarities presented above (also known as the

Smith-Waterman algorithm) takes roughly lOOO.OA'^ microseconds to search a

database with a total of A^ characters in it. On the other hand, FASTA takes

20.0A'^ microseconds, and BLASTA only about 2.OA'^ microseconds. At the other

end of the spectrum, the systolic array chip described above takes only 0.3A^

microseconds to perform the Smith-Waterman algorithm with its special-

purpose (and expensive) hardware.

89

Andhra University, Visakhapatnam


3.1.2 Application of tools FASTA and BLASTA to compare the

amino acid sequences of two proteins

Comparision of BChE with that of FETUBHUMAN Fetuin-B precursor

• Both the proteins must be brought into FASTA fomat as below.

>sp|P06276|CHLE_HUMAN Cholinesterase precursor (EC 3.1.1.8)

(Acylcholine acylhydrolase) (Choline esterase II) (Butyrylcholine esterase)

(Pseudocholinesterase) - Homo sapiens (Human).

MHSKVTIICIRFLFWFLLLCMLIGKSHTEDDIIIATKNGKVRGMNLTVFG

GTVTAFLGIPYAQPPLGRLRFKKPQSLTKWSDIWNATKYANSCCQNIDQ

SFPGFHGSEMWNPNTDLSEDCLYLNVWIPAPKPKNATVLIWIYGGGFQT

GTSSLHVYDGKFLARVERVIVVSMNYRVGALGFLALPGNPEAPGNMGL

FDQQLALQWVQKNIAAFGGNPKSVTLFGESAGAASVSLHLLSPGSHSLF

TRAILQSGSFNAPWAVTSLYEARNRTLNLAKLTGCSRENETEIIKCLRNK

DPQEILLNEAFVVPYGTPLSVNFGPTVDGDFLTDMPDILLELGQFKKTQI

LVGVNKDEGTAFLVYGAPGFSKDNNSIITRKEFQEGLKIFFPGVSEFGKE

SILFHYTDWVDDQRPENYREALGDVVGDYNFICPALEFTKKFSEWGNN

AFFYYFEHRSSKLPWPEWMGVMHGYEIEFVFGLPLERRDNYTKAEEILS

RSIVKRWANFAKYGNPNETQNNSTSWPVFKSTEQKYLTLNTESTRIMTK

LRAQQCRFWTSFFPKVLEMTGNIDEAEWEWKAGFHRWNNYMMDWKN

QFNDYTSKKESCVGL

>sp|Q9UG]V[5|FETUB_HUMAN Fetuin-B precursor (IRL685) (16G2) - Homo

sapiens (Human).

MGLLLPLALCILVLCCGKLSPPQLALNPSALLSRGCNDSDVLAVAGFAL

RDINKDRKDGYVLRLNRVNDAQEYRRGGLGSLFYLTLDVLETDCHVLR

KKAWQDCGMRIFFESVYGQCKAIFYMNNPSRVLYLAAYNCTLRPVSKK

90

Andhra University, Visakhapatnam


KIYMTCPDCPSSIPTDSSNHQVLEAATESLAKYNNENTSKQYSLFKVTRA

SSQWVVGPSYFVEYLIKESPCTKSQASSCSLQSSDSVPVGLCKGSLTRTH

WEKFVSVTCDFFESQAPATGSENSAVNQKPTNLPKVEESQQKNTPPTDS

PSKAGPRGSVQYLPDLDDKNSQEKGPQEAFPVHLDLTTNPQGETLDISFL

FLEPMEEKLVVLPFPKEKARTAECPGPAQNASPLVLPP

>sp|P02649|APOE_HUMAN Apolipoprotein E precursor (Apo-E) - Homo

sapiens (Human).

MKVLWAALLVTFLAGCQAKVEQAVETEPEPELRQQTEWQSGQRWELA

LGRFWDYLRWVQTLSEQVQEELLSSQVTQELRALMDETMKELKAYKS

ELEEQLTPVAEETRARLSKELQAAQARLGADMEDVCGRLVQYRGEVQ

AMLGQSTEELRVRLASHLRKLRKRLLRDADDLQKRLAVYQAGAREGA

ERGLSAIRERLGPLVEQGRVRAATVGSLAGQPLQERAQAWGERLRARM

EEMGSRTRDRLDEVKEQVAEVRAKLEEQAQQIRLQAEAFQARLKSWFE

PLVEDMQRQWAGLVEKVQAAVGTSAAPVPSDNH

• After bring the two proteins into FASTA formats their sequences are

compared by the application of BLAST.

SeqAName Len(aa) SeqBName Len(aa) Score

1 CHLE_HUMAN 602 2 lAPPHUMAN 89 19

1 CHLE_HUMAN 602 3 RAD_HUMAN 308 8

1 CHLE_HUMAN 602 4 S0X13_HUMAN 889 3

1 CHLE_HUMAN 602 5 Q53Y25_HUMAN 465 2

1 CHLE_HUMAN 602 6 Q6LCT9_HUMAN 87 5

1 CHLE_HUMAN 602 7 Q7ZVA2_BRARE 298 3

2 IAPP_HUMAN 89 3 RAD_HUMAN 308 7

2 IAPP_HUMAN 89 4 S0X13_HUMAN 889 15

2 1APP_HUMAN 89 5 Q53Y25_HUMAN 465 11

91

Andhra University, Visakhapatnam


2 IAPP_HUMAN 89 6 Q6LCT9_HUMAN 87 6

2 IAPP_HUMAN 89 7 Q7ZVA2_BRARE 298 6

3 RAD_HUMAN 308 4 S0X13_HUMAN 889 12

3 RAD_HUMAN 308 5 Q53Y25_HUMAN 465 4

3 RAD_HUMAN 308 6 Q6LCT9_HUMAN 87 6

3 RAD_HUMAN 308 7 Q7ZVA2_BRARE 298 52

4 S0X13_HUMAN 889 5 Q53Y25_HUMAN 465 11

4 S0X13_HUMAN 889 6 Q6LCT9_HUMAN 87 10

4 S0X13_HUMAN 889 7 Q7ZVA2_BRARE 298 6

5 Q53Y25_HUMAN 465 6 Q6LCT9_HUMAN 87 9

5 Q53Y25_HUMAN 465 7 Q7ZVA2_BRARE 298 6

6 Q6LCT9_HUMAN 87 7 Q7ZVA2_BRARE 298 19

RAD_HUMAN MTLNGGGSGAGG 12

Q7ZVA2_BRARE MTLN 4

Q6LCT9_HUMAN

IAPP_HUMAN

CHLE_HUMAN MHSKVTIICIRFLFWFLLLCMLIGKSHTEDDIIIATKNGKVRGMNLT 47

S0X13_HUMAN

MVNCTIKSEEKKEPCHEAPQGSATAAEPQPGDPARASQDSADPQAPAQGN50

Q53Y25_HUMAN MLDDRARMEAAKKEKVEQILAEFQ 24

RAD_HUMAN

SRGGGQERERRRGSTPWGPAPPLHRRSMPVDERDLQAALTPGALTAAAAG62

Q7ZVA2_BRARE TQKEGKEPLRRRASTP1PSSRQAGRGDRDPSTDPYHPPLAQSAS--YHPG 52

Q6LCT9_HUMAN MSG 3

IAPP_HUMAN

CHLE_HUMAN VFGGTVTAFLGIPYAQPPLGRLRFKKPQSLTKWSDIWNATKYANSCCQNI 97

92

Andhra University, Visakhapatnam


S0X13_HUMAN

FRG5WDCSSPEGNGSPEPKRPGVSEAASGSQEKLDFNRNLKEVVPAIEKL 100

Q53Y25_HUMAN

LQEEDLKKVMRRMQKEMDRGLRLETHEEASVKMLPTYVRSTPEGSEVGDF74

RAD_HUMAN

TGTQGPRLDWPEDSEDSLSSGGSDSDESVYKVLLLGAPGVGKSALARIFG 112

Q7ZVA2_BRARE DKSIHSRANWSSDSES--DSSGS—ECLYRVVLLGDHGVGKSSLANIFA 97

Q6LCT9_HUMAN HKCS YP—WD LQDRYA 17

IAPP_HUMAN MGILK 5

CHLE_HUMAN

DQSFPGFHGSEMWNPNTDLSEDCL YLN VWIPAPKPKNATVLI Wl YGGGFQ 147

S0X13_HUMAN

LSSDWKERFLGRNSMEAKDVKGTQESLAEKELQLLVMIHQLSTLRDQLLT 150

Q53Y25_HUMAN

LSLDLGGTNFRVMLVKVGEGEEGQWS VKTKHQM YSIPEDAMTGTAEMLFD 124

RAD_HUMAN

GVED-GPEAEAAGHTYDRSIVVDGEEASLMV YDIWE—QDGGRWLPGH 157

Q7ZVA2_BRARE

GIQEKDAHKHIGEDAYERTLMVDGEDTTLVVMDPWETDKQEDDEKFLQDY 147

Q6LCT9_HUMAN QDKSVVNK MQQKYWE-—TKQAF1K--39

IAPP_HUMAN LQVFLIVLS-—VALNHLK ATPIESH 28

CHLE_HUMAN

TGTSSLH V YDGKFLARVERVIV VSMN YRVG ALGFLALPGNPEAPGNMGLF 197

S0X13_HUMAN

AHSEQKNMAAMLFEKQQQQMELARQQQEQIAKQQQQLIQQQHKINLLQQQ200

Q53Y25_HUMAN

YISECISDFLDKHQMKHKKLPLGFTFSFPVRHED1DKGILLNWTKGFKAS174

93

Andhra University, Visakhapatnam


RAD_HUMAN CMAMGDAYVIV YSVTDKGSFEK ASELRVQLRRARQ 192

Q7ZVA2_BRARE CMQVGNAYIIV-- YSITDRSSFES ASELRIQLRRIRQ 182

Q6LCT9_HUMAN --ATGKK EDEHVVAS DADLDAKLELFHS 65

IAPP_HUMAN --QVEKR KCNTAT CATQRLANFLVHS 52

CHLE_HUMAN DQQLALQWVQKNIAAFGGNPKSVTLFGESAG AASVSLHLLSPGS 241

SOXl3_HUMAN IQQVNMpYVMIPAFPPSHQPLPVTPDSQLALPIQPIPCKPVEYPLQLLHS 250

Q53Y25_HUMAN GAEGNNVVGLLR- DAIKRRGDFEMDVVAMVNDTVATMISCYYE 216

RAD_HUMAN —TDDVPIILVGNKSDLVRSREVSVDEGRACAVVFDCKF 229

Q7ZVA2_BRARE —AENlPHLVGNKSDLVRSREVAVEEGRACAVMFDCKF 219

Q6LCT9_HUMAN —IQRTCLDLS KAIVLYQKRICSF 87

IAPP_HUMAN —SNNFGAILSSTN VGSNTYGKRNAVEVLKREP 83

CHLE_HUMAN —HSLFTRAILQSGSFNAPWAVTSLYEARNRTLNLAKLTGCSRENETE 287

S0X13_HUMAN

PPAPVVKRPGAMATHHPLQEPSQPLNLTAICPKAPELPNTSSSPSLKMSSC300

Q53Y25_HUMAN

DHQCEVGMIVGTGCNACYMEEMQNVELVEGDEGRMCVNTEWG 258

RAD_HUMAN ^ lETSAALHHNV 240

Q7ZVA2_BRARE lETSASLHHNV 230

Q6LCT9_HUMAN

lAPPHUMAN LNYLPL 89

CHLE_HUMAN

IIKCLRNKDPQEILLNEAFVVPYGTPLSVNFGPTVDGDFLTDMPDILLEL337

S0X]3_HUMAN

VPRPPSHGGPTRDLQSSPPSLPLGFLGEGDAVTKA1QDARQLLHSHSGAL350

Q53Y25_HUMAN AFGDSGELDE 268

RAD_HUMAN QALFEGVVRQIRLRRDSKEANARRQAGTRR 270

Q7ZVA2_BRARE HELFEOTVRQIRLRRDSKEINERRRSVYKR 260

94

Andhra University, Visakhapatnam


Q6LCT9_HUMAN

IAPP_HUMAN

CHLE_HUMAN GQFKKTQILVGVNKDEGTAFLVYGAPGFSK 367

S0X13_HUMAN

DGSPNTPFRKDLISLDSSPAKERLEDGCVHPLEEAMLSCDMDGSRHFPES400

Q53Y25_HUMAN FLLEYDRLVDESSANPGQQLYEKLIGGKYMG 299

RAD_HUMAN - -—RESLGKKAKRFLGRIVARNSRKMAFRAKSKSCH 303

07ZVA2_BRARE KESITKKARRFLDRLVAKNNKKMALKVRSKSCH 293

Q6LCT9_HUMAN ^

IAPP_HUMAN

CHLE_HUMAN D N N S I I T R K E F Q E G L K I F F P G V S E F G K E S I L F H 400

S0X13_HUMAN

RNSSHIKRPMNAFMVWAKDERRKILQAFPDMHNSSISKJLGSRWKSMTNQ450

Q53Y25_HUMAN ELVRLVLLRLVDENLLFHGEASEQLRTRGAFET332

RAD_HUMAN DLSVL-- 308

Q7ZVA2_BRARE DLAVL 298

Q6LCT9_HUMAN ^

IAPP_HUMAN

CHLE_HUMAN

YTDWVDDQRPENYREALGDVVGDYNFICPALEFTKKFSEWGNNAFFYYFE450

S0X13_HUMAN

EKQPYYEEQARLSRQHLEKYPDYKYKPRPKRTCIVEGKRLRVGEYKALMR500

Q53Y25_HUMAN

RFVSQVESDTGDRKQIYN1L,STLGLRPSTTDCDIVRRACESVSTRAAHMC382

RAD_HUMAN

Q7ZVA2_BRARE .

Q6LCT9_HUMAN

95

Andhra University, Visakhapatnam


IAP?_HUMAN

CHLE_HUMAN

HRSSKLPWPEWMGVMHGYEIEFVFGLPLERRDNYTKAEEILSRSIVKRWA500

S0X13_HUMAN

TRRQDARQSYVIPPQAGQVQMSSSDVLYPRAAGMPLAQPLVEHYVPRSLD550

Q53Y25_HUMAN

SAGLAGV1NRMRESRSEDVMRITVGVDGSVYKLHPSFKERFHASVRRLTP432

RAD_HUMAN

Q7ZVA2_BRARE

Q6LCT9_HUMAN

!APP_HUMAN

CHLE_HUMAN

NFAKYGNPNETQNNSTSWPVFKSTEQKYLTLNTESTR1MTKLRAQQCRFW550

S0X13_HUMAN

PNMPVIVNTCSLREEGEGTDDRHSVADGEMYRYSEDEDSEGEEKSDGSWW 600

Q53Y25_HUMAN SCEITFIESEEGSGRGAALVSAVACKKACMLGQ 465

RAD_HUMAN

Q7ZVA2_BRARE

Q6LCT9_HUMAN

IAPP_HUMAN

CHLE_HUMAN

TSFFPKVLEMTGNIDEAEWEWK.AGFHRWNNYMMDWKNQFNDYTSKKESCV600

S0X13_HUMAN

CSQTDPRLGGPGPFSSGEDLVPTRWAQPANLRLCWYLDLFVPQKMGKAVH650

Q53Y25_HUMAN

RAD_HUMAN

Q7ZVA2_BRARE

96

Andhra University, Visakhapatnam


Q6LCT9_HUMAN

IAPP_HUMAN -

CHLE_HUMAN GL ,.-602

S0X13_HUMAN

LADTFMRGEAPSLPEERVGLGGQELQYGHGLSRLSTSAPRAYGQGTLYDS700

Q53Y25_HUMAN ..

RAD_HUMAN

Q7ZVA2_BRARE

Q6LCT9_HUMAN ,..

IAPP_HUMAN

CHLE_HUMAN ,

S0X13_HUMAN

PLLQVSIHLGYG1YRPVSLGSHALFPFLSWLDQPLWDQHPSHTPPDCSSI750

Q53Y25_HUMAN

RAD_HUMAN

Q7ZVA2_BRARE — ^.

Q6LCT9_HUMAN

IAPP_HUMAN

CHLE_HUMAN

S0X13_HUMAN

TRl AL.YF VQKGL A VPCCFHLCQA YC ALAA VC VRVH VCVpHLFIHCTRYLL 800

Q53Y25_HUMAN ..

RAD_HUMAN

Q7ZVA2_BRARE -

Q6LCT9_HUMAN -

IAPP_HUMAN

CHLE HUMAN •-

97

Andhra University, Visakhapatnam


S0X]3_HUMAN

SAHYVPGTVAEFLWVCLSMPLLLLWGPLSVLLFVPKLLPLCQSGCLRFCV850

Q53Y25_HUMAN

RAD_HUMAN

Q7ZVA2_BRARE

Q6LCT9_HUMAN

IAPP_KUMAN

CHLE_HVMAN

S0X13_HUMAN SLCAFLSLSVLVSLQGPLFLSYLGVCPLPPVPSGFSGSM 889

Q53 Y2 5_HUM AN

3.1.2.1 Sublinear Similarity Searches

The total number N of characters of sequence in biosequence databases is

growing exponentially. On the other hand, the size of the query sequences is

basically fixed; for example, a protein sequence's length is bounded by 1,500

and averages 300. So designers of efficient computational methods should be

principally concerned with how the time to perform such a search grows as a

function ofN. Yet all currently used methods take an amount of time that grows

linearly in A^; that is, they are 0{N) algorithms. This includes not only rigorous

methods such as the dynamic programming algorithms mentioned above but

also the popular heuristics FASTA and BLASTA. Even the systolic array chips

described above do not change this. When a database increases in size by a

factor of 1,000, all of these 0{N) methods take 1,000 times longer to search that

database. Using the timing estimates given above, it follows that while a custom

chip may take about 3 seconds to search 10 million amino acids or nucleotides,

it will take 3,000 seconds, or about 50 minutes, to search 10 billion symbols.

98

Andhra University, Visakhapatnam


And this is the fastest of the linear methods: BLASTA will take hours, and the

Smith-Waterman algorithm will take months. One could resort to massive

parallelism, but such machinery is beyond the budget of most investigators, and

it is unlikely that speedups due to improvements in hardware technology will

keep up with sequencing rates in the next decade.

What would be very desirable, if not essential, is to have search methods

with computing time sublinear in A^, that is, Oi?/'^) for some a < 1. For

example, suppose there is an algorithm that takes 0(N^ ^) time, which is to say

that as A^ grows, the time taken grows as the square root of A^. For example, if

the algorithm takes about 10 seconds on a 10 million symbol database, then on

10 billion symbols, it will take about 1,0000.5 ~ 31 times longer, or about 5

minutes. Note that while an O(A^0.5) algorithm may be slower than an <9(A0

algorithm on 10 million symbols, it may be faster on 10 billion. Figure 3.9

illustrates this "crossover": in this figure, the size of A^' at which the 0(N°^)

algorithm overtakes the 0(N) algorithm is approximately 1 x 108. Similarly, an

0{N°°^) algorithm that takes, say, 15 seconds on 10 million symbols, will take

about 1 minute, or only 4 times longer, on 10 billion. To forcefully illustrate the

point, we chose to let our examples be slower at A^' = 10 million than the

competing 0{N) algorithm. As will be seen in a moment, a sublinear algorithm

does exist that is actually already much faster on databases of size 10 million.

The other important thing to note is that we are not considering heuristic

algorithms here. What we desire is nothing less than algorithms that accomplish

exactly the same computational task of complete comparison as the dynamic

programming algorithms, but are much faster because the computation is

performed in a clever way.

99

Andhra University, Visakhapatnam


A recent result on the approximate string-matching problem under the

simpleunit-cost scheme of Figure 3.1 portends the possibility of truly sublinear

algorithms for the general problem. For relatively stringent matches, this new

algorithm is 3 to 4 orders of magnitude more efficient than the equivalent

dynamic programming computation on a database of 1 million characters.

U.IWW
1 r-¥-r - r m i r'—» F i i ! T ^
\

i,*00

-X*^**^ ""

100 —

6 -
.6
",....>• • • ' , "i^""*'*'"^^^''^
10
l^^ :

1 . —A.-^-j ,»-i>j.U, „,_!,.—„i_,^—l_i ^i-iXt—, 1 1 1 1 1 11 1


«e»07 I»»t0

Figure 3.9: Sublinear versus linear algorithms

On the other hand, the approximate string matching problem is a special

case of the more biologically relevant computation that involves more general

scoring schemes such as the ones in Figures 3.4 and 3.5, and a sublinear

algorithm for the general problem has yet to be achieved.

We conclude with a few more details on this sublinear algorithm. For a

search of matches to a query of length P with D or fewer differences, the

quantity 8 = DIP is the maximum fraction of differences permitted per unit

length of the query and is called the mismatch ratio. Searching for such an

approximate string match over a database of length A^ can be accomplished in

0{DN^°'^^^^ log N) expected time with the new algorithm. The exponent is an

increasing and concave function of e that is 0 when e = 0 and depends on the

100

Andhra University, Visakhapatnam


size |v|/| of the underlying alphabet. The algorithm is superior to the 0{N)

algorithms and truly sublinear in A^ when e is small enough to guarantee that

pow(B) < 1. For example, pow(e) is less than 1 when e < 33 percent for |v|/| = 4

(DNA alphabet) and when 8 < 56 percent for |\)/| = 20 (protein alphabet). More

specifically, pow(e) < 0.22 + 2.3 8 when |v|/| = 4 and pow{e) < 0.17 + 1.48 when

|\|/| = 20. So, for DNA, the algorithm takes a maximum of 0{N°^) time when e is

12 percent, and for proteins, a maximum of 0(]^^) time when e is 26 percent.

The logic used to prove these bounds is coarse, and, in practice, the performance

of these methods is much better than the bounds indicate. If these results can be

extended to handle the more general problem of arbitrary scoring tables, the

impact on the field could be great.

3.1.2.1 Open Problems

While progress is continually being made on existing problems in

sequence comparison, new problems continue to arise. A fundamental issue is

the definition of similarity. We have focused here only on the insertion-deletion-

substitution model of comparison and some small variations. Some authors

namely Altshul and Erikson, have looked at nonadditive scoring schemes that

are intended to reflect the probability of finding a given alignment by chance. A

fundamental change in the optimization criterion for alignment creates a new set

of algorithmic problems.

What about fundamentally speeding up sequence comparisons? The best

lower bounds placed on algorithms for comparing two sequences of length A^' is

0(N log N), yet the fastest algorithm takes 0{N^ I log^ AO time (Masek and

Paterson, 1980). Can this gap be narrowed, either from above (finding faster

00 If.' 01 S"/ 101

3 (>i
Andhra University, Visakhapatnam
algorithms) or below (finding lower bounds that are higher)? Can we perform

faster database searches for the case of generalized Levenshtein scores, as is

suggested by the results given above for the approximate string matching

problem? Speeding up database searches is very important. Are there other

effective ways to parallelize such searches or to exploit preprocessing of the

databases, such as an index.

Biologists are interested in searching databases for patterns other than

given strings or regular expressions. Recently, fast algorithms have been

developed for finding approximate repeats, for example, finding a pattern that

matches some string X and then 5 to 10 symbols to the right matches the same

string modulo 5 percent differences. Many DNA structures are induced by

forming base pairs that can be viewed as finding approximate palindromes

separated by a given range of spacing. More intricate patterns for protein motifs

and secondary structure are suggested by the systems QUEST [4, 79,104], all of

which pose problems that could use algorithmic refinement. Finally, biologists

compare objects other than sequences. For example, the partial sequence

information of a restriction map can be viewed as a string on which one has

placed a large number of beads of, say, eight colors, at various positions along

the string. Given two such maps, are they similar? This problem has been

examined by several authors [106]. There are still fundamental questions as to

what the measure of similarity should be and how to design efficient algorithms

for each. There has also been work on comparing phylogenetic trees and

chromosome staining patterns [176]. Indubitably the list will continue to grow.

102

Andhra University, Visakhapatnam

You might also like