Biological Sequence Analysis
Biological Sequence Analysis
Department of Computer Science and Engineering University of Washington Box 352350 Seattle, Washington, U.S.A. 98195-2350
This material is based upon work supported in part by the National Science Foundation and DARPA under grant DBI-9601046, and by the National Science Foundation under grant DBI-9974498.
Contents
Preface 1 Basics of Molecular Biology 1.1 1.2 Proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 1.2.1 1.2.2 1.2.3 1.3 1.4 1.5 1.6 Classication of the Amino Acids . . . . . . . . . . . . . . . . . . . . . . . . . . . Structure of a Nucleotide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Base Pair Complementarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Size of DNA molecules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . DNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 2 3 4 4 5 5 5 6 6 6 7 7 9 9 9
RNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Residues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . DNA Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Synthesis of RNA and Proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.1 1.6.2 Transcription in Prokaryotes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Basics of Molecular Biology (continued) 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 Course Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Translation (continued) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Prokaryotic Gene Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Prokaryotic Genome Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Eukaryotic Gene Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Eukaryotic Genome Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Goals and Status of Genome Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Sequence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 13
Sequence Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 i
CONTENTS
3.2
ii
Biological Motivation for Studying Sequence Similarity . . . . . . . . . . . . . . . . . . . . 13 3.2.1 3.2.2 Hypothesizing the Function of a New Sequence . . . . . . . . . . . . . . . . . . . . 13 Researching the Effects of Multiple Sclerosis . . . . . . . . . . . . . . . . . . . . . 14
The String Alignment Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 An Obvious Algorithm for Optimal Alignment . . . . . . . . . . . . . . . . . . . . . . . . 15 Asymptotic Analysis of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 18
Alignment by Dynamic Programming 4.1 4.1.1 4.1.2 4.1.3 4.2 4.2.1 4.2.2
Computing an Optimal Alignment by Dynamic Programming . . . . . . . . . . . . . . . . . 18 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Recovering the Alignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Time Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 An Obvious Local Alignment Algorithm . . . . . . . . . . . . . . . . . . . . . . . 20 Set-Up for Local Alignment by Dynamic Programming . . . . . . . . . . . . . . . . 21 22
Local Alignment, and Gap Penalties 5.1 5.1.1 5.1.2 5.2 5.3
Space Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Optimal Alignment with Gaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 5.3.1 5.3.2 5.3.3 5.3.4 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Afne Gap Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Dynamic Programming Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Time Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.4 6
Multiple Sequence Alignment 6.1 6.1.1 6.1.2 6.2 6.3 6.4 6.5
Biological Motivation for Multiple Sequence Alignment . . . . . . . . . . . . . . . . . . . 27 Representing Protein Families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Repetitive Sequences in DNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Formulation of the Multiple String Alignment Problem . . . . . . . . . . . . . . . . . . . . 28 Computing an Optimal Multiple Alignment by Dynamic Programming . . . . . . . . . . . . 29 NP-completeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 An Approximation Algorithm for Multiple String Alignment . . . . . . . . . . . . . . . . . 31
CONTENTS
6.5.1 6.5.2 6.5.3 6.5.4 6.6 6.7 7
Weight Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 A Simple Site Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 How Informative is the Log Likelihood Ratio Test? . . . . . . . . . . . . . . . . . . . . . . 40 Nonnegativity of Relative Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 44
Experimental Determination of Binding Energy . . . . . . . . . . . . . . . . . . . . . . . . 44 Computational Estimation of Binding Energy . . . . . . . . . . . . . . . . . . . . . . . . . 45 Finding Instances of an Unknown Site . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 47
10.1 Greedy Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 10.2 Gibbs Sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 10.3 Other Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 11 Correlation of Positions in Sequences 51
11.1 Nonuniform Versus Uniform Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 11.2 Dinucleotide Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 11.3 Disymbol Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 11.4 Coding Sequence Biases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 11.4.1 Codon Biases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 11.4.2 Recognizing Genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 12 Maximum Subsequence Problem 55
CONTENTS
iv
12.2 Maximum Subsequence Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 12.3 Finding All High Scoring Subsequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 13 Markov Chains 59
13.1 Introduction to Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 13.2 Biological Application of Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 13.3 Using Markov Chains to Find Genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 14 Using Interpolated Context Models to Find Genes 62
14.1 Problems with Markov Chains for Finding Genes . . . . . . . . . . . . . . . . . . . . . . . 62 14.2 Glimmer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 14.2.1 Training Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 14.2.2 Identication Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 14.2.3 Resolving Overlap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 15 Start Codon Prediction 65
15.1 Experimental Results of Glimmer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 15.2 Start Codon Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 15.3 Finding SD Sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 16 RNA Secondary Structure Prediction 16.1 RNA Secondary Structure 69
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
16.2 Notation and Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 16.3 Anatomy of Secondary Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 16.4 Free Energy Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 16.5 Dynamic Programming Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 17 RNA Secondary Structure Prediction (continued) 17.1.1 17.1.2 17.1.3 17.1.4 73
17.2 Order of Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 17.3 Speeding Up the Multibranched Computation . . . . . . . . . . . . . . . . . . . . . . . . . 75 17.4 Running Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
CONTENTS
18 Speeding Up Internal Loop Computations
v 77
18.1 Assumptions About Internal Loop Free Energy . . . . . . . . . . . . . . . . . . . . . . . . 77 18.2 Asymmetry Penalty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 18.3 Comparing Interior Pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Bibliography 81
Preface
These are the lecture notes from CSE 527, a graduate course on computational molecular biology I taught at the University of Washington in Winter 2000. The topic of the course was Biological Sequence Analysis. These notes are not intended to be a survey of that area, however, as there are numerous important results that I would have liked to cover but did not have time. I am grateful to Phil Green, Dick Karp, Rune Lyngs, Larry Ruzzo, and Rimli Sengupta, who helped me both with overview and with technical points. I am thankful for the students who attended faithfully, served as notetakers, asked embarassing questions, made perceptive comments, carried out exciting projects, and generally make teaching exciting and rewarding. Martin Tompa
Lecture 1
1.1. Proteins
Proteins have a variety of roles that they must fulll: 1. They are the enzymes that rearrange chemical bonds. 2. They carry signals to and from the outside of the cell, and within the cell. 3. They transport small molecules. 4. They form many of the cellular structures. 5. They regulate cell processes, turning them on and off and controlling their rates. This variety of roles is accomplished by the variety of proteins, which collectively can assume a variety of three-dimensional shapes.
A proteins three-dimensional shape, in turn, is determined by the particular one-dimensional composition of the protein. Each protein is a linear sequence made of smaller constituent molecules called amino acids. The constituent amino acids are joined by a backbone composed of a regularly repeating sequence of bonds. (See [31, Figure 1.4].) There is an asymmetric orientation to this backbone imposed by its chemical structure: one end is called the N-terminus and the other end the C-terminus. This orientation imposes directionality on the amino acid sequence. There are 20 different types of amino acids. The three-dimensional shape the protein assumes is determined by the specic linear sequence of amino acids from N-terminus to C-terminus. Different sequences of amino acids fold into different three-dimensional shapes. (See, for example, [10, Figure 1.1].) Protein size is usually measured in terms of the number of amino acids that comprise it. Proteins can range from fewer than 20 to more than 5000 amino acids in length, although an average protein is about 350 amino acids in length. Each protein that an organism can produce is encoded in a piece of the DNA called a gene (see Section 1.6). To give an idea of the variety of proteins one organism can produce, the single-celled bacterium E. coli has about 4300 different genes. Humans are believed to have about 50,000 different genes (the exact number as yet unresolved), so a human has only about 10 times as many genes as E. coli. The number of proteins that can be produced by humans greatly exceeds the number of genes, however, because a substantial fraction of the human genes can each produce many different proteins through a process called alternative splicing.
2. Negatively charged (and therefore acidic) amino acids (2). Aspartic acid Glutamic acid Asp Glu D E
3. Polar amino acids (7). Though uncharged overall, these amino acids have an uneven charge distribution. Because of this uneven charge distribution, these amino acids can form hydrogen bonds with water. As a consequence, polar amino acids are called hydrophilic, and are often found on the outer surface of folded proteins, in contact with the watery environment of the cell.
4. Nonpolar amino acids (8). These amino acids are uncharged and have a uniform charge distribution. Because of this, they do not form hydrogen bonds with water, are called hydrophobic, and tend to be found on the inside surface of folded proteins. Alanine Isoleucine Leucine Methionine Phenylalanine Proline Tryptophan Valine Ala Ile Leu Met Phe Pro Trp Val A I L M F P W V
Although each amino acid is different and has unique properties, certain pairs have more similar properties than others. The two nonpolar amino acids leucine and isoleucine, for example, are far more similar to each other in their chemical and physical properties than either is to the charged glutamic acid. In algorithms for comparing proteins to be discussed later, the question of amino acid similarity will be important.
1.2. DNA
DNA contains the instructions needed by the cell to carry out its functions. DNA consists of two long interwoven strands that form the famous double helix. (See [14, Figure 3-3].) Each strand is built from a small set of constituent molecules called nucleotides.
#
" ! %$
Two hydrogen bonds form between A and T, whereas three form between C and G. (See [14, Figure 3-5].) This makes C-G bonds stronger than A-T bonds. If two DNA strands consist of complementary bases, under normal cellular conditions they will hybridize and form a stable double helix. However, the two strands will only hybridize if they are in antiparallel conguration. This means that the sequence of one strand, when read from the end to the end, must be complementary, base for base, to the sequence of the other strand read from to . (See [14, Figure 3-4(b) and 3-3].)
1.3. RNA
Chemically, RNA is very similar to DNA. There are two main differences: 1. RNA uses the sugar ribose instead of deoxyribose in its backbone (from which RNA, RiboNucleic Acid, gets its name). 2. RNA uses the base uracil (U) instead of thymine (T). U is chemically similar to T, and in particular is also complementary to A. RNA has two properties important for our purposes. First, it tends to be single-stranded in its normal cellular state. Second, because RNA (like DNA) has base-pairing capability, it often forms intramolecular hydrogen bonds, partially hybridizing to itself. Because of this, RNA, like proteins, can fold into complex three-dimensional shapes. (For an example, see http://www.ibc.wustl.edu/zuker/rna/hammerhead.html.) RNA has some of the properties of both DNA and proteins. It has the same information storage capability as DNA due to its sequence of nucleotides. But its ability to form three-dimensional structures allows it to
"
!
"
"
!
A with T G with C
have enzymatic properties like those of proteins. Because of this dual functionality of RNA, it has been conjectured that life may have originated from RNA alone, DNA and proteins having evolved later.
1.4. Residues
The term residue refers to either a single base constituent from a nucleotide sequence, or a single amino acid constituent from a protein. This is a useful term when one wants to speak collectively about these two types of biological sequences.
"
"
!
"
"
"
!
!
1.6.2. Translation
How is protein synthesized from mRNA? This process, called translation, is not as simple as transcription, because it proceeds from a 4 letter alphabet to the 20 letter alphabet of proteins. Because there is not a oneto-one correspondence between the two alphabets, amino acids are encoded by consecutive sequences of 3 possible permutations, nucleotides, called codons. (Taking 2 nucleotides at a time would give only whereas taking 3 nucleotides yields possible permutations, more than sufcient to encode the 20 different amino acids.) The decoding table is given in Table 1.1, and is called the genetic code. It is rather amazing that this same code is used almost universally by all organisms. U Phe Phe Leu Leu Leu Leu Leu Leu Ile Ile Ile Met Val Val Val Val C Ser Ser Ser Ser Pro Pro Pro Pro Thr Thr Thr Thr Ala Ala Ala Ala A UAU Tyr [Y] UAC Tyr [Y] UAA STOP UAG STOP CAU His [H] CAC His [H] CAA Gln [Q] CAG Gln [Q] AAU Asn [N] AAC Asn [N] AAA Lys [K] AAG Lys [K] GAU Asp [D] GAC Asp [D] GAA Glu [E] GAG Glu [E]
UUU UUC UUA UUG CUU CUC CUA CUG AUU AUC AUA AUG GUU GUC GUA GUG
[F] [F] [L] [L] [L] [L] [L] [L] [I] [I] [I] [M] [V] [V] [V] [V]
UCU UCC UCA UCG CCU CCC CCA CCG ACU ACC ACA ACG GCU GCC GCA GCG
[S] [S] [S] [S] [P] [P] [P] [P] [T] [T] [T] [T] [A] [A] [A] [A]
UGU UGC UGA UGG CGU CGC CGA CGG AGU AGC AGA AGG GGU GGC GGA GGG
There is a necessary redundancy in the code, since there are 64 possible codons and only 20 amino acids. Thus each amino acid (with the exceptions of Met and Trp) is encoded by synonymous codons, which are interchangeable in the sense of producing the same amino acid. Only 61 of the 64 codons are used to encode amino acids. The remaining 3, called STOP codons, signify the end of the protein.
G Cys [C] Cys [C] STOP Trp [W] Arg [R] Arg [R] Arg [R] Arg [R] Ser [S] Ser [S] Arg [R] Arg [R] Gly [G] Gly [G] Gly [G] Gly [G]
U C A G U C A G U C A G U C A G
Ribosomes are the molecular structures that read mRNA and produce the encoded protein according to the genetic code. Ribosomes are large complexes consisting of both proteins and a type of RNA called ribosomal RNA (rRNA). The process by which ribosomes translate mRNA into protein makes use of yet a third type of RNA called transfer RNA (tRNA). There are 61 different transfer RNAs, one for each nontermination codon. Each tRNA folds (see Section 1.3) to form a cloverleaf-shaped structure. This structure produces a pocket that complexes uniquely with the amino acid encoded by the tRNAs associated codon, according to Table 1.1. The unique t is accomplished analogously to a key and lock mechanism. Elsewhere on the tRNA is the anticodon, three consecutive bases that are complementary and antiparallel to the associated codon, and exposed for use by the ribosome. The ribosome brings together each codon of the mRNA with its corresponding anticodon on some tRNA, and hence its encoded amino acid. (See [14, Figure 4-4].)
Lecture 2
!
!
"
"
"
!
! "
!
!
"
!
"
10
process continues until the ribosome detects one of the STOP codons, at which point it releases the mRNA and the completed protein.
!
"
$ #
!
"
11
be many kilobases in size. One fact that is relevant to our later computational studies is that the presence of introns makes it much more difcult to identify the locations of genes computationally, given the genome sequence. Another important difference between prokaryotic and higher eukaryotic genes is that, in the latter, there can be multiple regulatory regions that can be quite far from the coding region, can be either upstream or downstream from it, and can even be in the introns.
12
Lecture 3
13
14
The result was the identication of certain bacterial and viral proteins that were confused with the myelin sheath proteins.
The special character represents the insertion of a space, representing a deletion from its sequence (or, equivalently, an insertion in the other sequence). We can evaluate the goodness of such an alignment using a scoring function. For example, if an exact match between two characters scores , and every mismatch or deletion (space) scores , then the alignment above has score
This example shows only one possible alignment for the given strings. For any pair of strings, there are many possible alignments. The following denitions generalize this example. Denition 3.1: If and are each a single character or space, then and . is called the scoring function.
In the example above, for any two distinct characters and , , and . If one were designing a scoring function for comparing amino acid sequences, one would certainly want to incorporate into it the physico-chemical similarities and differences among the amino acids, such as those described in Section 1.1.1.
! "
the myelin sheath proteins were sequenced, a protein database was searched for similar bacterial and viral sequences, and laboratory tests were performed to determine if the T-cells attacked these same proteins.
a -
c c
b b
c -
d d
b -
For example, if
acbcdb, then
and
b.
Finding an optimal alignment of and is the way in which we will measure their similarity. For the two strings given in the example above, is the alignment shown optimal? We will next present some algorithms for computing optimal alignments, which will allow us to answer that question.
The obvious algorithm for optimal alignment is given in Figure 3.1. This algorithm works correctly, but is it a good algorithm? If you tried running this algorithm on a pair of strings each of length 20 (which is ridiculously modest by biology standards), you would nd it much too slow to be practical. The program would run for an hour on such inputs, even if the computer can perform a billion basic operations per second.
Suppose we are given strings and , and assume for the moment that , subject to the reasonable restriction that an arbitrary scoring function restriction, there is never a reason to align a pair of spaces.
"# !
and
acbcdb and
cadbd, then
ac--bcdb and
where
is
1.
, and and (without changing the order of the remaining characters) leaves
Denition 3.3: Let and be strings. An alignment contain space characters, where
maps
and
into strings
and
that may
"
16
Figure 3.1: Enumerating all Alignments to Find the Optimal subseThe running time analysis of this algorithm proceeds as follows. A string of length has 1 Thus, there are quences of length . pairs of subsequences each of length . Consider one such pair. Since there are characters in , only of which are matched with characters in , there will be characters in unmatched to characters in . Thus, the alignment has length . We must look up and add the score of each pair in the alignment, so the total number of basic operations is at least
(The equality has a pretty combinatorial explanation that is worth discovering. The last inequality follows from Stirlings approximation [39].) Thus, for , this algorithm requires more than basic operations.
Denition 3.5: Let and be functions. Then such that, for all sufciently large, .
1
denotes the number of combinations of The notation combinatorial mathematics, for instance Roberts [39].
$! #
"
for
"
"
"
"
" "
17 are both
and works.
Lecture 4
The value of an optimal alignment of and is then . The crux of dynamic programming is to solve the more general problems of computing all values with and , in order of increasing and . Each of these will be relatively simple to compute, given the values already computed and/or for smaller and/or , using a recurrence relation. To start the process, we need a basis for .
BASIS :
This formula can be understood by considering an optimal alignment of the rst characters from and the rst characters from . In particular, consider the last aligned pair of characters in such an alignment. This last pair must be one of the following:
of 2.
, in which case the remaining alignment excluding this pair must have value 18
1.
, in which case the remaining alignment excluding this pair must be an optimal alignment and (i.e., must have value ), or , or
R ECURRENCE : For
and
The basis for says that if characters of are to be aligned with 0 characters of all be matched with spaces. The basis for is analogous.
for
for
"
"
" "
and , our goal is to compute an optimal alignment of as the value of an optimal alignment of the strings
and and
19
The optimal alignment chooses whichever among these three possibilities has the greatest value.
4.1.1. Example
In aligning acbcdb and cadbd, the dynamic programming algorithm lls in the following values for from top to bottom and left to right, simply applying the basis and recurrence formulas. (As in the example of Section 3.3, assume that matches score , and mismatches and spaces score .) For instance, in the table below, the entry in row 4 and column 1 is obtained by computing .
The value of the optimal alignment is , and so can be read from the entry in the last row and last column. Thus, there is an alignment of acbcdb and cadbd that has value 2, so the alignment proposed in Section 3.3 with value 1 is not optimal. But how can one determine the optimal alignment itself, and not just its value?
"
"
" "
"
"
20
a -
c c
b -
c a
d d
b b
a -
c c
b a
c -
d d
b b
, and
a a
c d
b b
c -
d d
b -
. , the
Each of these has three matches, one mismatch, and three spaces, for a value of optimal alignment value.
Proof: This algorithm requires an table to be completed. Any particular entry is computed with a maximum of 6 table lookups, 3 additions, and a three-way maximum, that is, in time , a . Reconstructing a constant. Thus, the complexity of the algorithm is at most single alignment can then be done in time .
Theorem 4.1: The dynamic programming algorithm computes an optimal alignment in time
"
21
choices of , and choices of (excluding the length 0 substrings as choices). There are Using Theorem 4.1, it is not difcult to show that the time taken by this algorithm is . We will see in Section 5.1, however, that it is possible to compute the optimal local alignment in time , that is, the same time used for the optimal global alignment.
For example, let abcxdex. The prexes of empty string is both a prex and a sufx of .
For example, suppose abcxdex and xxxcde. Score a match as . Then , with cxd, cd, and alignment as
The dynamic programming algorithm for optimal local alignment is similar to the dynamic programming algorithm for optimal global alignment given in Section 4.1. It proceeds by lling in a table with the values , with increasing. The value of each entry is calculated according to a new basis and recurrence of for , given in Section 5.1. Unlike the global alignment algorithm, however, the value of the optimal local alignment can be any entry, whichever contains the maximum of all values of . The reason for this is that each entry represents an optimal pair of sufxes of a given pair of prexes. Since a sufx of a prex is just a substring, we nd the optimal pair of substrings by maximizing over all possible pairs .
c c +2
x -
d d +2
Denition 4.5: Let and be strings with and be the maximum value of an optimal (global) alignment of and all sufxes of .
" "
" "
" ! !
Denition 4.4: .
is a sufx of
if and only if
or
, for some
"
"
is a prex of
if and only if
or
, for some
, where
"
"
, let
Lecture 5
and
Then
and
since the optimal sufx to align with a string of length 0 is the empty sufx.
The formula looks very similar to the recurrence for the optimal global alignment in Section 4.1. Of course, the meaning is somewhat different and we have an additional term in the function. The recurrence is explained as follows. Consider an optimal alignment of a sufx of and a sufx of . There are four possible cases:
1.
and
value
The optimal alignment chooses whichever of these cases has greatest value.
22
4.
is
3.
is
2.
and
is
has
R ECURRENCE : for
and
"
"
23
5.1.1. Example
For example, let abcxdex and xxxcde, and suppose a match scores , and a mismatch or a space scores . The dynamic programming algorithm lls in the table of values from top to bottom and left to right, as follows:
The value of the optimal local alignment is . We can reconstruct optimal alignments as in Section 4.1.2, by retracing from any maximum entry to any zero entry:
Both alignments have three matches and one space, for a value of . You can also see from this diagram how the value was derived in the example following Denition 4.5, which said that for the same strings and , .
entries requires at most 6 table lookups, 3 Proof: Computing the value for each of the additions, and 1 max calculation. Reconstructing a single alignment can then be done in time .
!"
"
"
!
"
"
"
"
! !
"
24
5.3.1. Motivations
In certain applications, we may not want to have a penalty proportional to the length of a gap. 1. Mutations causing insertion or deletion of large substrings may be considered a single evolutionary event, and may be nearly as likely as insertion or deletion of a single residue. 2. cDNA matching: Biologists are very interested in learning which genes are expressed in which types of specialized cells, and where those genes are located in the chromosomal DNA. Recall from Section 2.5 that eukaryotic genes often consist of alternating exons and introns. The mature mRNA that leaves the nucleus after transcription has the introns spliced out. To study gene expression within specialized cells, one procedure is as follows: (a) Capture the mature mRNA as it leaves the nucleus. (b) Make complementary DNA (abbreviated cDNA) from the mRNA using an enzyme called reverse transcriptase. The cDNA is thus a concatenation of the genes exons. (c) Sequence the cDNA. (d) Match the sequenced cDNA against sequenced chromosomal DNA to nd the region of chromosomal DNA from which the cDNA derives. In this process we do not want to penalize heavily for the introns, which will match gaps in the cDNA. In general, the gap penalty may be some arbitrary function of the gap length . The best choice of this function, like the best choice of a scoring function, depends on the application. In the cDNA matching application, we would like the penalty to reect what is known about the common lengths of introns. In the next section we will see an time algorithm for the case when is an arbitrary linear afne function, and this is adequate for many applications. There are programs that use piecewise linear functions as gap penalties, and these may be more suitable in the cDNA matching application. There are time algorithms for the case when is concave downward (Galil and Giancarlo [17], Miller and Myers [35]). We could even implement an arbitrary function as a gap penalty function, but the known algorithm
25
for this requires cubic time (Needleman and Wunsch [37]), and such an algorithm is probably not useful in practice.
# gaps
# spaces .
. whose last pair matches whose last pair matches whose last pair matches
is the value of an optimal alignment of with . is the value of an optimal alignment of with a space.
BASIS :
R ECURRENCE : For
and
for
for
for
for
where
and
are
and
, since the spaces will be penalized as part of the gap. Our goal
and
26
The equation for (and analogously ) can be understood as taking the maximum of two cases: adding another space to an existing gap, and starting a new gap. To understand why starting a new gap can use , which includes the possibility of an alignment ending in a gap, consider that , so that is always dominated by , so will never be chosen by the max.
Proof: The algorithm proceeds as those we have studied before, but in this case there are three or four or calculate them matrices to ll in simultaneously, depending on whether you store the values of from the other three matrices when needed.
Theorem 5.3: An optimal global alignment with afne gap penalty can be computed in time
.
Lecture 6
27
28
leaves
, for
The question that arises next is how to assign a value to such an alignment. In a pairwise alignment, we simply summed the similarity score of corresponding characters. In the case of multiple string alignment, there are various scoring methods, and controversy around the question of which is best. We focus here on a scoring method called the sum-of-pairs score. Other methods are explored in the homework. Until now, we have been using a scoring function that assigns higher values to better alignments and lower values to worse alignments, and we have been trying to nd alignments with maximum value. For the that measures the distance between characters remainder of this lecture, we will switch to a function and . That is, it will assign higher values the more distant two strings are. In the case of two strings, we will thus be trying to minimize
In this denition we assume that the scoring function is symmetric. For simplicity, we will not discuss the issue of a separate gap penalty. Example 6.3: Consider the following alignment:
c c -
c a c
d d d
b b a
d d
Denition 6.4: An optimal SP (global) alignment of strings minimum possible sum-of-pairs value for these strings.
, and
for
Denition 6.2: The sum-of-pairs (SP) value for a multiple global alignment of the values of all pairwise alignments induced by .
of
! "
where
" "
1.
, and
29
that is, entries. Each entry depends on adjacent entries, corresponding to the possibilities for the last match in an optimal alignment: any of the subsets of the strings could participate in that match, except for the empty subset. The details of the algorithm itself and the recurrence are left as exercises for the reader. Because each of the entries can be computed in time proportional to , the running time of the algorithm is . If (as is typical for the length of proteins), it would be practical only for very small values of , perhaps 3 or 4. However, typical protein families have hundreds of members, so this algorithm is of no use in the motivational problem posed in Section 6.1. We would like an algorithm that works for in the hundreds too, which would be possible only if the running time were polynomial in both and . (In particular, should not appear in the exponent as it does in the expression .) Unfortunately, we are very unlikely to nd such an algorithm, which is a consequence of the following theorem: Theorem 6.5 (Wang and Jiang [50]): The optimal SP alignment problem is NP-complete. What NP-completeness means and what its consequences are will be discussed in the following section.
6.4. NP-completeness
In this section we give a brief introduction to NP-completeness, and how problems can be proved to be NP-complete. Denition 6.6: A problem has a polynomial time solution if and only if there is some algorithm that , where is a constant and is the size of the input. solves it in time Many familiar computational problems have polynomial time solutions:
(Theorem 4.1) ,
The last entry illustrates that having a polynomial time solution does not mean that the algorithm is practical. In most cases the converse, though, is true: an algorithm whose running time is not polynomial is likely to be impractical for all but the smallest size inputs. NP-complete problems are equivalent in the sense that if any one of them has a polynomial time solution, then all of them do. One of the major open questions in computer science is whether there is a polynomial
(Section 6.3).
2. sorting:
[12],
(Section 5.3.1),
! "
30
time solution for any of the NP-complete problems. Almost all experts conjecture strongly that the answer to this question is no. The bulk of the evidence supporting this conjecture, however, is only the failure to nd such a polynomial time solution in thirty years. In 1971, Cook dened the notion of NP-completeness and showed the NP-completeness of a small collection of problems, most of them from the domain of mathematical logic [11]. Roughly speaking, he dened NP-complete problems to be problems that have the property that we can verify in polynomial time whether a supplied solution is correct. For instance, if you did not have to compute an optimal SP alignment, but simply had to verify that a given alignment had SP value at most , a given integer, it would be easy to write a polynomial time algorithm to do so. Shortly after Cooks work, Karp recognized the wide applicability of the concept of NP-completeness. He showed that a diverse host of problems are each NP-complete [28]. Since then, many hundreds of natural problems from many areas of computer science and mathematics such as graph theory, combinatorial optimization, scheduling, and symbolic computation have been proven NP-complete; see Garey and Johnson [18] for details. Proving a problem to be NP-complete proceeds in the following way. Choose a known NP-complete problem . Show that has a polynomial time algorithm if it is allowed to invoke a polynomial time subroutine for , and vice versa. There are many computational biology problems that are NP-complete, yet in practice we still need to solve them somehow. There are different ways to deal with an NP-complete problem: 1. We might give up on the possibility of solving the problem on anything but small inputs, by using an exhaustive (nonpolynomial time) search algorithm. We can sometimes use dynamic programming or branch-and-bound techniques to cut down the running time of such a brute force exhaustive search. 2. We might give up guaranteed efciency by settling for an algorithm that is sufciently efcient on inputs that arise in practice, but is nonpolynomial on some worst-case inputs that (hopefully) do not arise in practice. There may be an algorithm that runs in polynomial time on average inputs, being careful to dene the input distribution so that the practical inputs are highly probable. 3. We might give up guaranteed optimality of solution quality by settling for an approximate algorithm that gives a suboptimal solution, especially if the suboptimal solution is provably not much worse than the optimal solution. (An example is given in Section 6.5.) 4. Heuristics (local search, simulated annealing, genetic algorithms, and many others) can also be used to improve the quality of solution or running time in practice. We will see several examples throughout the remaining lectures. However, rigorous analysis of heuristic algorithms is generally unavailable. 5. The problem to be solved in practice may be more specialized than the general one that was proved NP-complete. In the following section we will look at the approximation approach to nd a solution for the multiple string alignment problem.
31
The triangle inequality says that the distance along one edge of a triangle is at most the sum of the distances along the other two edges. Although intuitively plausible, be aware that not all distance measures used in biology obey the triangle inequality.
6.5.1. Algorithm
This can be done by running the dynamic programming algorithm of Section 4.1 on each of the pairs of strings in . Call the remaining strings in . Add these strings one at a time to a multiple alignment that initially contains only , as follows. Suppose are already aligned as ming algorithm of Section 4.1 on and to produce from those columns where spaces were added to get . To add , run the dynamic programand . Adjust by adding spaces to . Replace by .
Theorem 6.8: The approximation algorithm of Section 6.5.1 runs in time each of length at most .
Proof: By Theorem 4.1, each of the values required to compute can be computed in , so the total time for this portion is . After adding to the multiple string time alignment, the length of is at most , so the time to add all strings to the multiple string alignment is
of
strings. First nd
and , dene
2. Triangle Inequality:
"
1.
that
32
for all . This is because the algorithm used an optimal alignment of and Then , and , since . If the algorithm later adds spaces to both and , it does so in the same columns. Let be the optimal alignment, be the distance
, and
Theorem 6.9: That is, the algorithm of Section 6.5.1 produces an alignment whose SP value is less than twice that of the optimal SP alignment.
(triangle inequality)
(explained below)
occurs in
(Denition 6.7)
"
"
Note that
33
(denition of
Note that for small values of , the approximation is signicantly better than a factor of 2. Furthermore, the error analysis does not mean that the approximation solution is always times the optimal solution. It means that the quality of the solution is never worse than this, and may be better in practice.
"
34
6.7. Summary
Multiple sequence alignment is a very important problem in computational biology. It appears to be impossible to obtain exact solutions in polynomial time, even with very simple scoring functions. A variety of (provably) bounded approximation algorithms are known, and a number of heuristic algorithms have been suggested, but it still remains largely an open problem.
Lecture 7
36
TTGTGGC TTTTGAT AAGTGTC ATTTGCA CTGTGAG ATGCAAA GTGTTAA ATTTGAA TTGTGAT ATTTATT ACGTGAT ATGTGAG TTGTGAG CTGTAAC CTGTGAA TTGTGAC GCCTGAC TTGTGAT TTGTGAT GTGTGAA CTGTGAC ATGAGAC TTGTGAG Table 7.1: Positions 39 from 23 CRP Binding Sites [47]
A C G T
Table 7.2: Prole for CRP Binding Sites Given in Table 7.1
37
were given thousands of sites rather than just 23.) In order to do this, suppose that the sequence residues are from an alphabet of size . Consider a matrix where is the fraction of sequences in that have residue in position . Table 7.2 shows the matrix for the CRP sites given in Table 7.1. Such a matrix is called a prole. The prole shows the distribution of residues in each of the positions. For example, in column 1 of the matrix the residues are quite mixed, in column 2, T occurs of the time, etc.
is a site
For example, suppose we want to know the probability that a randomly chosen CRP binding site will be TTGTGAC. By using Equation (7.1) and Table 7.2,
Although this probability is small, it is the largest probability of any site sequence, because each position contains the most probable residue. Now form the prole from the sample of nonsites in the same way. Using the proles and , let us return to the question of whether a given sequence is more likely to be a site or nonsite. In order to do this, we dene the likelihood ratio.
1 "0
6%
5%
is a site is a nonsite
'%
'%
34
$
TTGTGAC
is a site
!
"
% '&
is a site
is a site
!
(7.1)
, is
38
Table 7.3: Log Likelihood Weight Matrix for CRP Binding Sites To illustrate, let , the set of all length seven sequences. The corresponding prole for all and . Then for TTGTGAC,
If is not small and some entries in and are small, then the likelihood ratio may be intractably large or small, causing numerical problems in the calculation. To alleviate this, we dene the log likelihood ratio.
1 0 "0
To test for sites, it is convenient to create a scoring matrix whose entries are the log likelihood ratios, that is, . Table 7.3 shows the weight matrix for the example CRP samples and we have been discussing. In order to compute , Denition 7.3 says to add the corresponding scores from : . A technical difculty arises when an entry is 0, because the corresponding entry is then . If the residue cannot possibly occur in position of any site for biological reasons, then there is no problem. More often, though, this is a result of having too small a sample of sites. In this case, there by a small positive number (see, for are various small sample correction formulas, which replace example, Lawrence et al. [29]), but we will not discuss them here.
1 0 "0
6%
5%
5% 6%
1 0 "0
5%
1 "0
! ! ! )( )( ! !! ! ! ! ! ! !
to a prespecied constant cutoff
has , and declare more ,
"
! !
( )
6%
5%
1 0
1 0
1 "0
1 0 0
!
Lecture 8
Relative Entropy
January 27, 2000 Notes: Anne-Louise Leutenegger
ATG ATG ATG ATG ATG GTG GTG TTG Table 8.1: Eight Hypothetical Translation Start Sites
39
"
!
!
5%
40
(b) 2 (c)
0.701
Table 8.2: (a) Prole, (b) Log Likelihood Weight Matrix, and (c) Positional Relative Entropies, for the Sites in Table 8.1, with Respect to Uniform Background Distribution , and
is the set of all possible values of some random variable . for a sample space assigns a probability
, and .
In our application, the sample space is the set of all length sequences. The site prole induces a probability distribution on this sample space according to Equation (7.1), as does the nonsite prole . Denition 8.4: Let and be probability distributions on the same sample space . The relative entropy (or information content, or Kullback-Leibler measure) of with respect to is denoted and is dened as follows:
, meaning both distributions have to every
"
"
41 to be 0 whenever
In these terms, the relative entropy is the expected value of when is picked randomly according to . That is, it is the expected log likelihood score of a randomly chosen site. Note that when and are the same distribution, the relative entropy will be zero. In general, the relative entropy measures how different the distributions and are. Since we want to be able to distinguish between sites and nonsites, we want the relative entropy to be large, and will use relative entropy as our measure of how informative the log likelihood ratio test is. When the sample space is all length sequences, and we assume independence of the not difcult to prove that the relative entropy satises
is the distribution
imposes on the th
When , the relative entropy is measured in bits. This will be the usual case, unless specically stated otherwise. Continuing Example 8.1, Table 8.2(c) shows the relative entropies for each nucleotide position separately. For instance, looking at position 2, residues A, C, and G do not contribute to the relative entropy (see Table 8.2(a)). Residue T contributes (see Tables 8.2(a) and (b)). Hence, . This means that there are 2 bits of information in position 2. If the residues were coded with 0 and 1 so that 00 = A, 01 = C, 10 = G, and 11 = T, only 2 bits (11) would be necessary to encode the fact that this residue is always T. Position 3 has the same relative entropy of 2. For position 1, the relative entropy is 0.7 so there are 0.7 bits of information, indicating that column 1 of Table 8.2(a) is more similar to the background distribution than columns 2 and 3 are. The total relative entropy of all three positions is 4.7. Example 8.6: Let us now modify Example 8.1 to see the effect of a nonuniform background distribution. Consider the same eight translation start sites of Table 8.1, but change the background distribution to , . The site prole matrix remains unchanged (Table 8.2(a)). The new weight matrix and relative entropies are given in Table 8.3. Note that the relative entropy of each position has changed and, in particular, the last two columns no longer have equal relative entropy. The site distribution in position 2 is now more similar to the background distribution than the site distribution in position 3 is, since G is rarer in the background distribution. Thus, the relative entropy of position 3 is greater than that of position 2. An interpretation of is times more likely to occur in the third position of a site than a nonsite. The total that the residue G is relative entropy of all three positions is 4.93.
"
1 0 "0
!
positions, it is
!
! "
on
42
0.512
Table 8.3: (b) Log Likelihood Weight Matrix, and (c) Positional Relative Entropies, for the Sites in Table 8.1, with Respect to a Nonuniform Background Distribution 0.12 1.3 1.1 1.5 1.2 1.1 0.027
Table 8.4: Positional Relative Entropy for CRP Binding Sites of Tables 7.1 7.3 Example 8.7: Finally, returning to the more interesting CRP binding sites of Table 7.1, the seven positional relative entropies are given in Table 8.4. Note that 1.5 (middle position) is the highest relative entropy and corresponds to the most biased column (see Table 7.2). The value 0.027 (last position) is the lowest relative entropy because the distribution in this last position is the closest to the uniform background distribution (see Table 7.2).
and
Proof: First, it is true that The reason is that the curve . Thus, with :
for all real numbers , with equality if and only if . is concave downward, and its tangent at is the straight line . In the following derivation, we will use this inequality
!
, with
"
since only if
for all
, by Denition 8.3. Note that the relative entropy is equal to 0 if and , that is, and are identical probability distributions.
43
Lecture 9
3. For each such sequence , experimentally measure the difference in binding energy between ing with and binding with . 4. Record the results in a matrix substituted at position in . , where
Stormo and Fields then make the approximating assumption that changes in energy are additive. That is, the change in binding energy for any collection of substitutions is the sum of the changes in binding
44
such bindis
45 to
Thus, is a weight matrix that assigns a score to each sequence formula given in Section 8.1.
according to the usual weight matrix
!
46
as in Section 8.3, and use that as a measure of how good the From and , we can compute collection is. The goal is to nd the collection that maximizes . In particular, if we are looking for unknown binding sites, then the argument of Section 9.2 suggests that a relative entropy around would be encouraging. A version of the computational problem, then, is to take as inputs sequences and an integer , and output one length substring from each input sequence, such that the resulting relative entropy is maximized. Let us call this the relative entropy site selection problem. Unfortunately, this problem is likely to be computationally intractable (Section 6.4): Theorem 9.1 (Akutsu [1, 2]): The relative entropy site selection problem is NP-complete. Akutsu also proved that selecting instances so as maximize the sum-of-pairs score (Section 6.2) rather than the relative entropy is NP-complete.
Lecture 10
47
48
2. For each set retained so far, add each possible length substring from an input sequence not yet represented in . Compute the prole and relative entropy with respect to the background for each new set. Retain the sets with the highest relative entropy. 3. Repeat step 2 until each set has members.
A small example from Hertz and Stormo [23] is shown in Figure 10.1. From this example it is clear that pruning the number of sets to is crucial, in order to avoid the exponentially many possible sets. The greedy nature of this pruning biases the selection from the remaining input sequences. High scoring proles chosen from the rst few sequences may not be well represented in the remaining sequences, whereas medium scoring proles may be well represented in most of the sequences, and thus would have yielded superior scores. Note that one may modify the algorithm to circumvent the assumption of a single site per sequence, by permitting multiple substrings to be chosen from the same sequence. In this case, a different stopping condition is needed. Hertz and Stormo applied their technique to nd CRP binding sites (see Section 7.1) with some success. With 18 genes containing 24 known CRP binding sites, their best solution contained 19 correct sites, plus 3 more that overlap correct sites.
I NPUT: sequences
1. Create a singleton set (i.e., only one member) for each possible length input sequences.
I NPUT: sequences
49
Figure 10.1: Example of Hertz and Stormos greedy algorithm. seq denotes the relative entropy. A LGORITHM: Initialize set to contain substrings , where is a substring of chosen randomly and uniformly. Now perform a series of iterations, each of which consists of the following steps:
(a) Let
be the length
substring of
2. For every in
and remove
from .
50
We iterate until a stopping condition is met, either a xed number of iterations or relative stability of the scores, and return the best solution set seen in all iterations. The hope with this approach is that the random choices help to avoid some of the local optima of greedy algorithms. Note that the Gibbs sampler may discard a substring that yields a higher scoring prole than the one that replaces it, or may restore the substring that was discarded itself. Neither of these occurrences is particularly signicant, since the sampling will tend toward higher scoring proles due to the probabilistic weighting of the substitutions by relative entropy. The Gibbs sampler does retain some degree of greediness (which is desirable), so that there may be cases where a strong signal in only a few sequences incorrectly outweighs a weaker signal in all of the sequences. Lawrence et al. applied their technique to nd motifs in protein families. In particular, they successfully discovered a helix-turn-helix motif, as well as motifs in lipocalins and prenyltransferases.
1. Weight which
2. Use simulated annealing (see, for example, Johnson et al. [25]) where, as time progresses, the probability decreases that you make a substitution that worsens the relative entropy score, yielding a more stable set . Another technique that has been used to solve the site selection problem is expectation maximization (for example, in the MEME system [4]).
3. Randomly choose
to be
with probability
, and add
to .
Lecture 11
52
strands of DNA equally often, any bias of such a region on one strand over the other (see Section 11.4) is canceled out. This phenomenon, together with the fact that the bases occur in complementary pairs, explains why the frequencies of As and Ts are similar and the frequencies of Gs and Cs are similar. Let be the uniform probability distribution, and let be the frequency distribution. Notice that the frequency of residue is equal to the probability of randomly selecting residue from the distribution . How much better does model the actual genome than ? More quantitatively, how much more information is there using rather than ? Recall from Section 8.3 that the relative entropy is dened as follows:
In Example 11.1 for M. jannaschii, . This implies that there are 0.103 more bits of information per position in the sequence by using distribution over distribution . The value 0.103 might seem insignicant, but it means that a sequence of 100 bases has ten bits of extra information when chosen according to distribution . Suppose a random sequence of length 100 is selected according to the probability distribution . Since the relative entropy is the expected log likelihood ratio for , the sequence is approximately times more likely to have been generated by than by . The mathematics leading to this observation is awed, since the log function and expectation do not commute: that is, for it to be correct we would need the expected log likelihood ratio to equal the log of the expected likelihood ratio, which is not true in general. However, the intuition is helpful. The next section explores the application of this relative entropy method to the question of dependence of nucleotides.
%$
"
53
polymerase, there is a small chance that the repeats will persist after a copy mistake. (In a similar way, dinucleotide repeats might occur during the replication process.)
Table 11.2: The ratios of the observed dinucleotide frequency to the expected dinucleotide frequency (assuming independence) in M. jannaschii
If the probability distribution is the joint distribution of and , and is the distribution of and assuming independence, then . By Theorem 8.8, then, , with equality if and only if and are independent, since in the equality case . to be the rst base and to be the second base of a pair, the value By setting the random variable for M. jannaschii is 0.03. For a sequence of 100 bases, there are three bits of extra information when the sequence is chosen from the dinucleotide frequency distribution rather than the independence model. Thus, a random sequence of length 100 generated by a process according to dinucleotide distribution is eight times more likely to have been generated by than by the independent nucleotide distribution .
of random variables is
54
there are 0.082 bits of information in the rst codon position relative to the background distribution for H. inuenzae. Since most of the H. inuenzae genome consists of coding regions, it makes little difference if the background distribution is measured genome-wide or coding-region-wide. There are 0.175 bits of information per residue in the rst position of codons for M. jannaschii. The total relative entropy for the entire codon is simply the sum of the relative entropies for the three positions. (See Section 8.3.) The number of bits per codon for the organisms H. inuenzae, M. jannaschii, C. elegans, and H. sapiens are 0.12, 0.21, 0.09, and 0.12, respectively. For H. inuenzae the number of bits of information for the second position is close to zero. In humans there is more information is in the second position.
The score of a sequence of codons is dened to be the sum of the scores of each . When recognizing genes, one facet would be to identify sequences with high scores. Each reading frame must be examined since moving the frame window to the right one position or two positions results in different sequences of codons. One drawback to this technique is that we must know the coding regions (in order to estimate ) before recognizing (those same) genes in the genome. There are simpler methods for nding long coding regions, and once these are known they can be used to estimate and thus used to nd more genes.
Lecture 12
56
The following algorithm for nding a maximum subsequence was given by Bates and Constable [5] and Bentley [7, Column 7]. Suppose we already knew that the maximum subsequence of has score . How can we nd the maximum subsequence of ? If is included in , then it is easy: if , we will add to , and if not, we will leave unchanged. But what if is not included in ? In that case, in addition to we will have to keep track of the score of the maximum sufx of : is the sufx that maximizes . Let us assume that is also known for . We are now given , and we want to update and accordingly:
to to be empty.
In practice, one could optimize this slightly by processing a consecutive series of positive scores as
to
and replace
by
if
57
Cumulative Score
5 4 -5 3 -3 2 1 -2 2 -2 1
Cumulative Score
5 4 -5 3 -3 2 -2 2 -2 1
Sequence Position
Figure 12.1: An example of the algorithm. Bold segments indicate score sequences currently in the algorithms list. The left gure shows the state prior to adding the last three scores, and the right gure shows the state after. integrated into the list by the following process.
4. Otherwise (i.e., there is such a , but ), extend the subsequence to the left to encompass everything up to and including the leftmost score in . Delete subsequences from the list (none of them is maximum) and reconsider the newly extended subsequence (now renumbered ) as in step 1. After the end of the input is reached, all subsequences remaining on the list are maximum; output them. input sequence , suppose the list of disjoint subsequences is , with , , , and . (See Figure 12.1.) At this point, the cumulative score is 2. If the ninth input is , the list of subsequences is unchanged, but the cumulative score becomes 0. If the tenth input is 1, Step 1 produces , because is the rightmost subsequence with . Now Step 3 applies, since . Thus is added to the list with , and the cumulative score becomes 1. If the eleventh input is 5, Step 1 produces , and Step 4 applies, replacing by with . The algorithm returns to Step 1 without reading further input, this time producing . Step 4 again applies, this time merging , , and into a new with . The algorithm again returns to Step 1, but this time Step 2 applies. If there are no further input scores, the complete list of maximum subsequences is then . The fact that this algorithm correctly nds all maximum subsequences is not obvious; see Ruzzo and Tompa [40] for the details. Analysis. There is an important optimization that may be made to the algorithm. In the case that Step 2 applies, are maximum subsequences, and so may be output before reading any more of the
1 0
1 0 " " !
"
As
an
example
of
the
consider
the
" ! 1 0" ! !
1 0
, then add
1. The list is searched from right to left for the maximum value of satisfying
Sequence Position
58
input. Thus, Step 2 of the algorithm may be replaced by the following, which substantially reduces the memory requirements of the algorithm.
The algorithm as given does not run in linear time, because several successive executions of Step 1 might re-examine a number of list items. This problem is avoided by storing with each subsequence added during Step 3 a pointer to the subsequence that was discovered in Step 1. The resulting linked list of subsequences will have monotonically decreasing values, and can be searched in Step 1 in lieu of searching the full list. Once a list element has been bypassed by this chain, it will be examined again only if it is being deleted from the list, either in Step 2 or Step 4. The work done in the reconsider loop of Step 4 can be amortized over the list item(s) being deleted. Hence, in effect, each list item is examined a bounded number of times, and the total running time is linear. The worst case memory complexity is also linear, although one would expect on average that the subsequence list would remain fairly short in the optimized version incorporating Step 2 . Empirically, a few hundred stack entries sufce for processing sequences of a few million residues, for either synthetic or real genomic data.
If there is no such , all subsequences the list, and reinitialize the list to contain only
Lecture 13
Markov Chains
February 17, 2000 Notes: Jonathan Schaefer
In Lecture 11 we discovered that correlations between sequence positions are signicant, and should often be taken into account. In particular, in Section 11.4 we noted that codons displayed a signicant bias, and that this could be used as a basis for nding coding regions. Lecture 12 then explored algorithms for doing exactly that. In some sense, Lecture 12 regressed from the lesson of Lecture 11. Although it was using codon bias to score codons, it did not exploit the possible correlation between adjacent codons. Even worse, each codon was scored independently and the scores added, so that the codon score does not even depend on the position the codon occupies. This lecture recties these shortcomings by taking codon correlation into account in predicting coding regions. In order to do so, we rst introduce Markov chains as a model of correlation.
be a sequence
In words, in a th order Markov chain, the distribution of depends only on the variables immediately preceding it. In a 1st order Markov chain, for example, the distribution of depends only on . Thus, a 1st order Markov chain models diresidue dependencies, as discussed in Section 11.2. A 0th order Markov chain is just the familiar independence model, where does not depend on any other variables. Markov chains are not restricted to modeling positional dependencies in sequences. In fact, the more usual applications are to time dependencies, as in the following illustrative example. 59
60
Example 13.2: This example is called a random walk on the innite 2-dimensional grid. Imagine an innite grid of streets and intersections, where all the streets run either east-west or north-south. Suppose you are trying to nd a friend who is standing at one specic intersection, but you are lost and all street signs are missing. You decide to use the following algorithm: if your friend is not standing at your current intersection, choose one of the four directions (N, E, S, or W) randomly and uniformly, and walk one block in that direction. Repeat until you nd your friend. This is an example of a 1st order Markov chain, where each intersection is a state, and is the intersection where you stand after steps. Notice that the distribution of depends only on the value of , and is completely independent of the path that you took to arrive at . Denition 13.3: A th order Markov chain is said to be stationary if, for all and ,
That is, in a stationary Markov chain, the distribution of is independent of the value of , and depends only on the previous variables. The random walk of Example 13.2 is an example of a stationary 1st order Markov chain.
In this equation,
1. Unidirectionality: the residue is equally dependent on both only models its dependence on the residues on one side of .
and
2.
is estimated by , and
, where
1.
in the genome, and is the number of occurrences in the genome of the diresidue
#
!
was generated by
is
Consider, for simplicity, a stationary 1st order Markov chain . Let is called the probability transition matrix for . The dimensions of the matrix is the state space. For nucleotide sequences, for example, is .
"
!
61
2. Mononucleotide repeats are not adequately modeled. They are much more frequent in biological sequences than predicted by a Markov chain. This frequency is likely due to DNA polymerase slippage during replication, as discussed in Example 11.2. 3. Codon position biases (as discussed in Section 11.4) are not accurately modeled.
Lecture 14
14.2. Glimmer
Glimmer [13, 41] is a gene prediction tool that uses a model somewhat more general than a Markov chain. In particular, Glimmer 2.0 [13] uses what the authors call the interpolated context model (ICM). The context of a particular residue consists of the characters immediately preceding it. A typical context size might be . For context , the interpolated context model assigns a probability distribution for , using only as many residues from as the training data supports. Furthermore, those residues need not be consecutive in the context. Glimmer has three phases for nding genes: training, identication, and resolving overlaps.
62
"
63
b10
A C G T A C
b8
G T
b10
A C G T
b12
A C G T
b4
A C G T
b3
Figure 14.1: Interpolated context model tree. it to predict . The mutual information of Denition 11.3 is used to make this determination. We rst nd the maximum among the mutual information values
To determine which position has the next highest correlation, we do not simply take the second highest mutual information from the list above. The identity of the next position instead depends on the value of the , illustrated in Figure 14.1, as follows. rst selected residue . Glimmer builds a tree of inuences on -mers from this reading frame position are partitioned into four subsets according to the residue The . Then we repeat the mutual information calculation above for each of these subsets. In the example of Figure 14.1, was found to have the greatest mutual information with , and the -mers were partitioned according to the value of residue . For those with = A, was found to have the greatest mutual information with , and they were further partitioned into four subsets according to the value of . residue A branch is terminated when the remaining subset of -mers becomes too small to support further partitioning. Each such leaf of the tree is labeled with the probability distribution of , given the residue values along the path from the root to that leaf. For example, in the tree shown in Figure 14.1, the leaf shaded gray would be labeled with the distribution
Note how this tree generalizes the notion of Markov chain given in Denition 13.1.
!
Suppose value of
64
A B (a)
A B (b) B
A B (c)
(d)
end of the sequence.
3. Trace down the tree, selecting the edges according to the particular residues in the sequence, until from that leaf. reaching a leaf. Read the probability 4. Shift to the next residue in the sequence and repeat.
The product of these probabilities (times the probability of the rst residues of , as in Section 13.2) yields the probability that was generated by this interpolated context model. To take the length of into account, combine these probabilities by adding their logarithms, and normalize by dividing by the length of . Select for further investigation if this score is above some predetermined threshold. Note: Glimmer actually stores a probability distribution for at every node of the tree, not just the . This is where the modier leaves, and uses a combination of the distributions along a branch to predict interpolated would enter, but we will not discuss this added complication. See the papers [13, 41] for details.
"
"
Lecture 15
!
66
start codon (usually AUG). That is, the ribosome binding site contains not only the rst few codons to be translated, but also part of the untranslated region of the mRNA (see Section 2.3). The ribosome identies where to bind to the mRNA at initiation not only by recognizing the start codon, but also by recognizing a short sequence in the untranslated region within the ribosome binding site. This short mRNA sequence will be called the SD site, for reasons that will become clear below. The mechanism by which the ribosome recognizes the SD site is relatively simple base-pairing: the SD site is complementary to a short sequence near the end of the ribosomes 16S rRNA, one of its ribosomal RNAs. The SD site was rst postulated by Shine and Dalgarno [44] for E. coli. Subsequent experiments demonstrated that the SD site in E. coli mRNA usually matches at least 4 or 5 consecutive bases in the sequence AAGGAGG, and is usually separated from the translation start site by approximately 7 nucleotides, although this distance is variable. Numerous other researchers such as Vellanoweth and Rabinowitz [49] and Mikkonen et al. [34] describe very similar SD sites in the mRNA of other prokaryotes. It is not too surprising that SD sites should be so similar in various prokaryotes, since the end of the 16S rRNA of all these prokaryotes is well conserved (Mikkonen et al. [34]). Table 15.1 shows a number of these rRNA sequences. Note their similarity, and in particular the omnipresence of the sequence CCUCCU, complementary to the Shine-Dalgarno sequence AGGAGG. This SD site can be used to improve start codon prediction. The simplest way to identify whether a candidate start codon is likely to be correct is by checking for approximate base pair complementarity between the end of the 16S rRNA sequence and the DNA sequence just upstream of the candidate codon. We say approximate complementarity because the ribosome just needs sufcient binding energy between the 16S rRNA and the mRNA, not necessarily perfect complementarity. Several papers do use this SD site information to improve translation start site prediction. These papers are described briey below. Hayes and Borodovsky [22] found candidate SD sites by running a Gibbs sampler (Section 10.2) on the DNA sequences just upstream of a given genomes purported start codons. They then used the end of the genomes annotated 16S rRNA sequence to validate the SD site so found. Frishman et al. [16] used a greedy version of the Gibbs sampler to nd likely SD sites. In addition, they
"
"
!
"
Table 15.1:
!
! ! ! ! ! ! ! ! ! ! ! ! !
Bacillus subtilis Lactobacillus delbrueckii Mycoplasma pneumoniae Mycobacterium bovis Aquifex aeolicus Synechocystis sp. Escherichia coli Haemophilus inuenzae Helicobacter pylori Archaeoglobus fulgidus Methanobacterium thermoautotrophicum Pyrococcus horikoshii Methanococcus jannaschii Mycoplasma genitalium
CUGGAUCACCUCCUUUCUA CUGGAUCACCUCCUUUCUA GUGGAUCACCUCCUUUCUA CUGGAUCACCUCCUUUCU CUGGAUCACCUCCUUUA CUGGAUCACCUCCUUU UUGGAUCACCUCCUUA UUGGAUCACCUCCUUA UUGGAUCACCUCCU CUGGAUCACCUCCU CUGGAUCACCUCCU CUCGAUCACCUCCU CUGGAUCACCUCC GUGGAUCACCUC
!
"
"
67
Hannenhalli et al. [21] used multiple features to score potential start codons. The features used were the following:
2. the identity of the start codon (AUG, UUG, or GUG), 3. coding potential downstream from the start codon and noncoding potential upstream, using GeneMarks scoring function (Section 13.3), 4. the distance from SD site to start codon, and 5. the distance from the start codon to the maximal start codon, which is as far upstream in this ORF as possible. They took the score of any start codon to be a weighted linear combination of the scores on these ve features. The coefcients of the linear combination were obtained using mixed integer programming.
The measure is the number of standard deviations by which the observed value exceeds its expectation, and is sometimes called the normal deviate or deviation in standard units. See Leung et al. [30] for a
"
1. the binding energy between the SD site and the insertions and deletions) in the binding,
68
detailed discussion of this statistic. The measure is normalized to have mean 0 and standard deviation 1, making it suitable for comparing different motifs . The algorithm was run on fourteen prokaryotic genomes. Those motifs with highest -score showed a strong predominance of motifs complementary to the end of their genomes 16S rRNA. For the bacteria, these were usually a standard Shine-Dalarno sequence consisting of 45 consecutive bases from AAGGAGG. For the thermophilic archaea A. fulgidus, M. jannaschii, M. thermoautotrophicum, and P. horikoshii, however, the signicant SD sites uncovered were somewhat different. What is interesting about these is that their highest scoring sequences display a predominance of the pattern GGTGA or GGTG, which satises the requirement of complementarity to a substring near the end of the 16S rRNA (see Table 15.1). However, that 16S substring is shifted a few nucleotides upstream compared to the bacterial sites discussed above.
"
"
Lecture 16
Hairpin Loop Bulge Multi-branched Loop Stacked Pair a b External Base Internal Loop
Figure 16.1: RNA Secondary Structure. The solid line indicates the backbone, and the jagged lines indicate paired bases.
c d
69
70
The conguration shown in Figure 16.2 is known as a pseudoknot. For the prediction algorithm that follows, we will assume that the secondary structure does not contain any pseudoknots. The ostensible justications for this are that pseudoknots do not occur as often as the more common types of loops, and secondary structure prediction is moderately successful even if pseudoknots are prohibited. However, the real justication for this assumption is that it greatly simplies the model and algorithm. (Certain types of pseudoknots are handled by the algorithm of Rivas and Eddy [38], but the general problem was shown NP-complete by Lyngs and Pedersen [32].) Denition 16.2: A pseudoknot in a secondary structure with . is a pair of base pairs and
A bulge is an internal loop with one base from each of its two base pairs adjacent on the backbone. A stacked pair is a loop formed by two base pairs and , thus having both ends adjacent on the backbone. (This is the only type of loop that stabilizes the secondary structure. All other loops are destabilizing, to varying degrees.)
and
Denition 16.1: A secondary structure of paired at most once. More precisely, for all
is a set ,
If
, then
with
"
"
71
Denition 16.4: Given a loop, one base pair in the loop is closest to the ends of the RNA strand. This is known as the exterior or closing pair. All other pairs are interior. More precisely, the exterior pair is the over all pairs in the loop. one that maximizes Note that one base pair may be the exterior pair of one loop and the interior pair of another.
A multibranched loop is a loop that contains more than two base pairs. An external base is a base not contained in any loop.
72
: used to compute
Despite the similarity in their number and descriptions, it is important to understand the distinction between the free energy functions of Section 16.4 and these dynamic programming arrays. The free energy functions give the energy of a single specied loop. The arrays will generally contain free energy values for a collecgives the free energy of the tion of consecutive loops. For example, referring to Figure 16.1, and , whereas gives the total free energy of all the loops to the right internal loop closed by of , including the stacked pairs and the hairpin.
, assuming
, assuming
, assuming
Lecture 17
17.1.1.
The base does not pair with any other base and is therefore an external base (see Figure 16.1). The recurrence for makes the implicit assumption that the external bases do not contribute to the overall free energy of the structure. In this case the total energy is therefore . The base pairs with some other base in resulting free energy. That energy is the sum of the energy by , plus the energy of the remainder
17.1.2.
The terms in the second equation correspond to choosing the minimum free energy structure among the following possible solutions: is the exterior pair in a hairpin loop, whose free energy is therefore given by 73
for for
The terms in the second equation correspond to choosing the structure for bases the lesser free energy of two possible structures:
for
having
74
is the exterior pair of stacked pair. In this case the free energy is the energy of the stacked pair, plus the energy of the compound structure closed by . We know in this case that forms a base pair because is the exterior pair of a stacked pair. is the exterior pair of a bulge or internal loop, whose free energy is therefore given by
is the exterior pair of a multibranched loop, whose free energy is therefore given by
17.1.3.
In this case, is the exterior pair of a bulge or interior loop, and we must search all possible interior pairs for the pair that results in the minimum free energy. For each such interior pair, the resulting free energy is sum of the energy of the bulge or internal loop, plus the energy of the compound structure closed by . It is easy to see that this search for the best interior pair is computationally intensive, simply because of the number of possibilities that must be considered. We will see later how to speed up this calculation, which is the new contribution of Lyngs et al. [33].
17.1.4.
In the same way that the recurrence for requires a search for the best structure among all the possible interior pairs, the calculation for is even more intensive, requiring a search for interior pairs , each of which closes its own branch out of the multibranched loop and contributes free energy . A direct implementation of the calculation shown for is infeasibly slow. Section 17.3 will discuss simplifying assumptions about multibranched loops that allow us to speed this up substantially.
.
% % %
% % % % %
$ $
$ % % $
$ $
can be
75
where , , and are constants. (Lyngs et al. [33] suggest that it would be more accurate to approximate the free energy as a logarithmic function of the loop size.) Assuming this linear approximation, we can devise a much more efcient dynamic programming sothan the one given in Section 17.1.4. This solution requires an additional array lution for computing , where gives the free energy of an optimal structure for , assuming that and are on a multibranched loop. is dened by the following recurrence relation:
The terms in the second equation correspond to the following possible solutions: forms a base pair and therefore denes one of the branches, whose free energy is
and are not paired with each other, so the free energy is given by the minimum partition of the sequence into two contiguous subsequences.
then reduces to partitioning the loop into at least two pieces with the minimum total
:
. Each of
terms.
. Each of
for
76 terms. terms.
With the speedup of the multibranched loop computation described in Section 17.3, the new bottleneck time computation of the free energy of bulges and internal loops. We will see next has become the how to eliminate this bottleneck.
. Each of
. Each of
. Each of
terms.
Lecture 18
is the sum of 3 contributions: that is a function of the size of the loop, plus
1. a term
3. an asymmetry penalty , where bases between the two base pairs on one side of the loop, and The free energy function for bulges and internal loops is thus given by
77
for the unpaired bases adjacent on the loop to the is the number of unpaired the number on the other side.
(18.1)
(18.2)
78
studies). that
What is important for our purposes is that this asymmetry penalty grows linearly in , provided and . In particular, the only assumption we will need to make about the penalty is that
We are going to save time by not searching through all the interior pairs . Suppose that, for exterior pair , the interior pair is better than , and that both of these loops have the same size. Then, under the assumptions from Sections 18.1 and 18.2, Theorem 18.1 below demonstrates that is also better for exterior pair . The intuition behind this theorem is that the asymmetry penalty for is the same for the two different exterior pairs by Equation (18.3), as is the asymmetry penalty for , and neither interior pair gains an advantage in loop size or stacking energies when you change from the smaller to the bigger loop. Theorem 18.1: Let
(so as to compare internal loops of identical size). Let , be at least (so that Equation (18.3) applies to both loops). Suppose that
, and
"
Then
% % %
"
for all
and
. This is certainly true for the particular form given in Equation (18.2).
is the maximum asymmetry penalty assessed, is a function whose details need not concern us, , and is a small constant (equal to 5 and 1, respectively, in two cited thermodynamics
(18.3)
(18.4) each
(18.5)
Now suppose that the entry has been calculated, and we want to calculate the entry . By Theorem 18.1, the interior pair stored in is the best interior pair for , with only two possible exceptions. These possible exceptions are the loops with exterior pair , length , and having one or the other loop side of length exactly .
Thus, for each of entries in , it is necessary to compare 3 loop energies and store the minimum, which takes constant time. It is also necessary to compare those loops with exterior pair
"
Instead of using a two-dimensional array , use a three-dimensional array is the loop size. This array will be lled in using dynamic programming. The entry not only the free energy, but also the best interior pair (subject to and that gives this energy.
Proof:
Equation (18.3)
Equation (18.1)
Equation (18.1)
Equation (18.3)
Equation (18.1)
Equation (18.1)
79
80
having one or the other loop side of length less than , but there are only a constant number
Bibliography
References
[1] Tatsuya Akutsu. Hardness results on gapless local multiple sequence alignment. Technical Report 98-MPS-24-2, Information Processing Society of Japan, 1998. [2] Tatsuya Akutsu, Hiroki Arimura, and Shinichi Shimozono. On approximation algorithms for local multiple alignment. In RECOMB00: Proceedings of the Fourth Annual International Conference on Computational Molecular Biology, Tokyo, Japan, April 2000. [3] S. F. Altschul. A protein alignment scoring system sensitive at all evolutionary distances. Journal of Molecular Evolution, 36(3):290300, March 1993. [4] Timothy L. Bailey and Charles Elkan. Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning, 21(1-2):5180, October 1995. [5] Joseph L. Bates and Robert L. Constable. Proofs as programs. ACM Transactions on Programming Languages and Systems, 7(1):113136, January 1985. [6] Richard Bellman. Dynamic Programming. Princeton University Press, 1957. [7] Jon Bentley. Programming Pearls. Addison-Wesley, 1986. [8] M. Borodovsky and J. McIninch. GeneMark: Parallel gene recognition for both DNA strands. Comp. Chem., 17(2):123132, 1993. [9] M. Borodovsky, J. McIninch, E. Koonin, K. Rudd, C. Medigue, and A. Danchin. Detection of new genes in a bacterial genome using Markov models for three gene classes. Nucleic Acids Research, 23(17):35543562, 1995. [10] C. Branden and J. Tooze. An Introduction to Protein Structure. Garland, 1998. [11] Stephen A. Cook. The complexity of theorem proving procedures. In Conference Record of Third Annual ACM Symposium on Theory of Computing, pages 151158, Shaker Heights, OH, May 1971. [12] Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest. Introduction to Algorithms. MIT Press, 1990. [13] Arthur L. Delcher, Douglas Harmon, Simon Kasif, Owen White, and Steven L. Salzberg. Improved microbial gene identication with G LIMMER. Nucleic Acids Research, 27(23):46364641, 1999. [14] Karl Drlica. Understanding DNA and Gene Cloning. John Wiley & Sons, second edition, 1992. 81
BIBLIOGRAPHY
82
[15] R. D. Fleischmann, M. D. Adams, O. White, R. A. Clayton, E. F. Kirkness, A. R. Kerlavage, C. J. Bult, J. F. Tomb, B. A. Dougherty, J. M. Merrick, et al. Whole-genome random sequencing and assembly of Haemophilus inuenzae rd. Science, 269:496512, July 1995. [16] Dmitrij Frishman, Andrey Mironov, Hans-Werner Mewes, and Mikhail Gelfand. Combining diverse evidence for gene recognition in completely sequenced bacterial genomes. Nucleic Acids Research, 26(12):29412947, 1998. [17] Zvi Galil and Raffaele Giancarlo. Speeding up dynamic programming with applications to molecular biology. Theoretical Computer Science, 64:107118, 1989. [18] Michael R. Garey and David S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Company, 1979. [19] D. Guseld. Efcient methods for multiple sequence alignment with guaranteed error bounds. Bulletin of Mathematical Biology, 55:141154, 1993. [20] Dan Guseld. Algorithms on Strings, Trees, and Sequences. Cambridge University Press, 1997. [21] Sridhar S. Hannenhalli, William S. Hayes, Artemis G. Hatzigeorgiou, and James W. Fickett. Bacterial start site prediction. 1999. [22] William S. Hayes and Mark Borodovsky. Deriving ribosomal binding site (RBS) statistical models from unannotated DNA sequences and the use of the RBS model for N-terminal prediction. In Pacic Symposium on Biocomputing, pages 279290, 1998. [23] Gerald Z. Hertz and Gary D. Stormo. Identifying DNA and protein patterns with statistically signicant alignments of multiple sequences. Bioinformatics, 15(7/8):563577, July/August 1999. [24] D. S. Hirschberg. A linear-space algorithm for computing maximal common subsequences. Communications of the ACM, 18:341343, June 1975. [25] David S. Johnson, Cecilia R. Aragon, Lyle A. McGeoch, and Catherine Schevon. Optimization by simulated annealing: an experimental evaluation; part I, graph partitioning. Operations Research, 37(6):865892, Nov.Dec. 1989. [26] Jerzy Jurka and Mark A. Batzer. Human repetitive elements. In Robert A. Meyers, editor, Encyclopedia of Molecular Biology and Molecular Medicine, volume 3, pages 240246. Weinheim, Germany, 1996. [27] Samuel Karlin and Stephen F. Altschul. Methods for assessing the statistical signicance of molecular sequence features by using general scoring schemes. Proceedings of the National Academy of Science USA, 87(6):22642268, March 1990. [28] Richard M. Karp. Reducibility among combinatorial problems. In R. E. Miller and J. W. Thatcher, editors, Complexity of Computer Computations, pages 85104. Plenum Press, New York, 1972. [29] Charles E. Lawrence, Stephen F. Altschul, Mark S. Boguski, Jun S. Liu, Andrew F. Neuwald, and John C. Wootton. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262:208214, 8 October 1993. [30] Ming-Ying Leung, Genevieve M. Marsh, and Terence P. Speed. Over- and underrepresentation of short DNA words in herpesvirus genomes. Journal of Computational Biology, 3(3):345360, 1996.
BIBLIOGRAPHY
[31] Benjamin Lewin. Genes VI. Oxford University Press, 1997.
83
[32] Rune B. Lyngs and Christian N. S. Pedersen. Pseudoknots in RNA secondary structures. In RECOMB00: Proceedings of the Fourth Annual International Conference on Computational Molecular Biology, Tokyo, Japan, April 2000. [33] Rune B. Lyngs, Michael Zuker, and Christian N. S. Pedersen. Internal loops in RNA secondary structure prediction. In RECOMB99: Proceedings of the Third Annual International Conference on Computational Molecular Biology, pages 260267, Lyon, France, April 1999. [34] Merja Mikkonen, Jussi Vuoristo, and Tapani Alatossava. Ribosome binding site consensus sequence of Lactobacillus delbrueckii subsp. lactis. FEMS Microbiology Letters, 116:315320, 1994. [35] Webb Miller and Eugene W. Myers. Sequence comparison with concave weighting functions. Bulletin of Mathematical Biology, 50(2):97120, 1988. [36] Eugene W. Myers and Webb Miller. Optimal alignments in linear space. Computer Applications in the Biosciences, 4:1117, 1988. [37] S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48:443453, 1970. [38] Elena Rivas and Sean R. Eddy. A dynamic programming algorithm for RNA structure prediction including pseudoknots. Journal of Molecular Biology, 285(5):20532068, February 1999. [39] Fred S. Roberts. Applied Combinatorics. Prentice-Hall, 1984. [40] Walter L. Ruzzo and Martin Tompa. A linear time algorithm for nding all maximal scoring subsequences. In Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, pages 234241, Heidelberg, Germany, August 1999. AAAI Press. [41] Steven L. Salzberg, Arthur L. Delcher, Simon Kasif, and Owen White. Microbial gene identication using interpolated Markov models. Nucleic Acids Research, 26(2):544548, 1998. [42] Steven L. Salzberg, David B. Searls, and Simon Kasif, editors. Computational Methods in Molecular Biology. Elsevier, 1998. [43] Jo o Setubal and Jo o Meidanis. Introduction to Computational Molecular Biology. PWS Publishing a a Company, 1997. [44] J. Shine and L. Dalgarno. The -terminal sequence of E. coli 16S ribosomal RNA: Complementarity to nonsense triplets and ribosome binding sites. Proceedings of the National Academy of Science USA, 71:13421346, 1974. [45] Temple F. Smith and Michael S. Waterman. Identication of common molecular subsequences. Journal of Molecular Biology, 147(1):195197, March 1981. [46] Gary D. Stormo and Dana S. Fields. Specicity, free energy and information content in protein-DNA interactions. Trends in Biochemical Sciences, 23:109113, 1998. [47] Gary D. Stormo and George W. Hartzell III. Identifying protein-binding sites from unaligned DNA fragments. Proceedings of the National Academy of Science USA, 86:11831187, 1989.
"
BIBLIOGRAPHY
84
[48] Martin Tompa. An exact method for nding short motifs in sequences, with application to the ribosome binding site problem. In Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, pages 262271, Heidelberg, Germany, August 1999. AAAI Press. [49] Robert Luis Vellanoweth and Jesse C. Rabinowitz. The inuence of ribosome-binding-site elements on translational efciency in Bacillus subtilis and Escherichia coli in vivo. Molecular Microbiology, 6(9):11051114, 1992. [50] L. Wang and T. Jiang. On the complexity of multiple sequence alignment. Journal of Computational Biology, 1:337348, 1994. [51] Michael S. Waterman. Introduction to Computational Biology. Chapman & Hall, 1995. [52] James D. Watson, Michael Gilman, Jan Witkowski, and Mark Zoller. Recombinant DNA. Scientic American Books (Distributed by W. H. Freeman), second edition, 1992.