0% found this document useful (0 votes)

91 views90 pages

Biological Sequence Analysis

Lecture Notes on Biological Sequence Analysis. Based upon work supported in part by the National Science Foundation and DARPA under grant DBI-9601046.

Uploaded by

Siva Rajan K

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

91 views90 pages

Biological Sequence Analysis

Lecture Notes on Biological Sequence Analysis. Based upon work supported in part by the National Science Foundation and DARPA under grant DBI-9601046.

Uploaded by

Siva Rajan K

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 90

Lecture Notes on Biological Sequence Analysis 1

Martin Tompa Technical Report #2000-06-01 Winter 2000

Department of Computer Science and Engineering University of Washington Box 352350 Seattle, Washington, U.S.A. 98195-2350

This material is based upon work supported in part by the National Science Foundation and DARPA under grant DBI-9601046, and by the National Science Foundation under grant DBI-9974498.

c Martin Tompa, 2000

Contents
Preface 1 Basics of Molecular Biology 1.1 1.2 Proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 1.2.1 1.2.2 1.2.3 1.3 1.4 1.5 1.6 Classication of the Amino Acids . . . . . . . . . . . . . . . . . . . . . . . . . . . Structure of a Nucleotide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Base Pair Complementarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Size of DNA molecules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . DNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 2 3 4 4 5 5 5 6 6 6 7 7 9 9 9

RNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Residues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . DNA Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Synthesis of RNA and Proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.1 1.6.2 Transcription in Prokaryotes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Basics of Molecular Biology (continued) 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 Course Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Translation (continued) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Prokaryotic Gene Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Prokaryotic Genome Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Eukaryotic Gene Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Eukaryotic Genome Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Goals and Status of Genome Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Sequence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 13

Introduction to Sequence Similarity 3.1

Sequence Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 i

CONTENTS
3.2

Biological Motivation for Studying Sequence Similarity . . . . . . . . . . . . . . . . . . . . 13 3.2.1 3.2.2 Hypothesizing the Function of a New Sequence . . . . . . . . . . . . . . . . . . . . 13 Researching the Effects of Multiple Sclerosis . . . . . . . . . . . . . . . . . . . . . 14

3.3 3.4 3.5 4

The String Alignment Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 An Obvious Algorithm for Optimal Alignment . . . . . . . . . . . . . . . . . . . . . . . . 15 Asymptotic Analysis of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 18

Alignment by Dynamic Programming 4.1 4.1.1 4.1.2 4.1.3 4.2 4.2.1 4.2.2

Computing an Optimal Alignment by Dynamic Programming . . . . . . . . . . . . . . . . . 18 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Recovering the Alignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Time Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 An Obvious Local Alignment Algorithm . . . . . . . . . . . . . . . . . . . . . . . 20 Set-Up for Local Alignment by Dynamic Programming . . . . . . . . . . . . . . . . 21 22

Searching for Local Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Local Alignment, and Gap Penalties 5.1 5.1.1 5.1.2 5.2 5.3

Computing an Optimal Local Alignment by Dynamic Programming . . . . . . . . . . . . . 22 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Time Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Space Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Optimal Alignment with Gaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 5.3.1 5.3.2 5.3.3 5.3.4 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Afne Gap Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Dynamic Programming Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Time Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.4 6

Bibliographic Notes on Alignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 27

Multiple Sequence Alignment 6.1 6.1.1 6.1.2 6.2 6.3 6.4 6.5

Biological Motivation for Multiple Sequence Alignment . . . . . . . . . . . . . . . . . . . 27 Representing Protein Families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Repetitive Sequences in DNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Formulation of the Multiple String Alignment Problem . . . . . . . . . . . . . . . . . . . . 28 Computing an Optimal Multiple Alignment by Dynamic Programming . . . . . . . . . . . . 29 NP-completeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 An Approximation Algorithm for Multiple String Alignment . . . . . . . . . . . . . . . . . 31

CONTENTS
6.5.1 6.5.2 6.5.3 6.5.4 6.6 6.7 7

iii Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Time Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Other Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

The Consensus String . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 35

Finding Instances of Known Sites 7.1 7.2

How to Summarize Known Sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Using Probabilities to Test for Sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 39

Relative Entropy 8.1 8.2 8.3 8.4

Weight Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 A Simple Site Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 How Informative is the Log Likelihood Ratio Test? . . . . . . . . . . . . . . . . . . . . . . 40 Nonnegativity of Relative Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 44

Relative Entropy and Binding Energy 9.1 9.2 9.3

Experimental Determination of Binding Energy . . . . . . . . . . . . . . . . . . . . . . . . 44 Computational Estimation of Binding Energy . . . . . . . . . . . . . . . . . . . . . . . . . 45 Finding Instances of an Unknown Site . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 47

10 Finding Instances of Unknown Sites

10.1 Greedy Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 10.2 Gibbs Sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 10.3 Other Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 11 Correlation of Positions in Sequences 51

11.1 Nonuniform Versus Uniform Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 11.2 Dinucleotide Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 11.3 Disymbol Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 11.4 Coding Sequence Biases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 11.4.1 Codon Biases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 11.4.2 Recognizing Genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 12 Maximum Subsequence Problem 55

12.1 Scoring Regions of Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

CONTENTS

12.2 Maximum Subsequence Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 12.3 Finding All High Scoring Subsequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 13 Markov Chains 59

13.1 Introduction to Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 13.2 Biological Application of Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 13.3 Using Markov Chains to Find Genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 14 Using Interpolated Context Models to Find Genes 62

14.1 Problems with Markov Chains for Finding Genes . . . . . . . . . . . . . . . . . . . . . . . 62 14.2 Glimmer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 14.2.1 Training Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 14.2.2 Identication Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 14.2.3 Resolving Overlap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 15 Start Codon Prediction 65

15.1 Experimental Results of Glimmer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 15.2 Start Codon Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 15.3 Finding SD Sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 16 RNA Secondary Structure Prediction 16.1 RNA Secondary Structure 69

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

16.2 Notation and Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 16.3 Anatomy of Secondary Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 16.4 Free Energy Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 16.5 Dynamic Programming Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 17 RNA Secondary Structure Prediction (continued) 17.1.1 17.1.2 17.1.3 17.1.4 73

17.1 Recurrence Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

17.2 Order of Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 17.3 Speeding Up the Multibranched Computation . . . . . . . . . . . . . . . . . . . . . . . . . 75 17.4 Running Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

CONTENTS
18 Speeding Up Internal Loop Computations

v 77

18.1 Assumptions About Internal Loop Free Energy . . . . . . . . . . . . . . . . . . . . . . . . 77 18.2 Asymmetry Penalty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 18.3 Comparing Interior Pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Bibliography 81

Preface
These are the lecture notes from CSE 527, a graduate course on computational molecular biology I taught at the University of Washington in Winter 2000. The topic of the course was Biological Sequence Analysis. These notes are not intended to be a survey of that area, however, as there are numerous important results that I would have liked to cover but did not have time. I am grateful to Phil Green, Dick Karp, Rune Lyngs, Larry Ruzzo, and Rimli Sengupta, who helped me both with overview and with technical points. I am thankful for the students who attended faithfully, served as notetakers, asked embarassing questions, made perceptive comments, carried out exciting projects, and generally make teaching exciting and rewarding. Martin Tompa

Lecture 1

Basics of Molecular Biology

January 4, 2000 Notes: Michael Gates
We begin with a review of the basic molecules responsible for the functioning of all organisms cells. Much of the material here comes from the introductory textbooks by Drlica [14], Lewin [31], and Watson et al. [52]. Later in the course, when we discuss the computational aspects of molecular biology, some useful textbooks will be those by Guseld [20], Salzberg et al. [42], Setubal and Meidanis [43], and Waterman [51]. What sorts of molecules perform the required functions of the cells of organisms? Cells have a basic tension in the roles they need those molecules to fulll: 1. The molecules must perform the wide variety of chemical reactions necessary for life. To perform these reactions, cells need diverse three-dimensional structures of interacting molecules. 2. The molecules must pass on the instructions for creating their constitutive components to their descendents. For this purpose, a simple one-dimensional information storage medium is the most effective. We will see that proteins provide the three-dimensional diversity required by the rst role, and DNA provides the one-dimensional information storage required by the second. Another cellular molecule, RNA, is an intermediary between DNA and proteins, and plays some of each of these two roles.

1.1. Proteins
Proteins have a variety of roles that they must fulll: 1. They are the enzymes that rearrange chemical bonds. 2. They carry signals to and from the outside of the cell, and within the cell. 3. They transport small molecules. 4. They form many of the cellular structures. 5. They regulate cell processes, turning them on and off and controlling their rates. This variety of roles is accomplished by the variety of proteins, which collectively can assume a variety of three-dimensional shapes.

LECTURE 1. BASICS OF MOLECULAR BIOLOGY

A proteins three-dimensional shape, in turn, is determined by the particular one-dimensional composition of the protein. Each protein is a linear sequence made of smaller constituent molecules called amino acids. The constituent amino acids are joined by a backbone composed of a regularly repeating sequence of bonds. (See [31, Figure 1.4].) There is an asymmetric orientation to this backbone imposed by its chemical structure: one end is called the N-terminus and the other end the C-terminus. This orientation imposes directionality on the amino acid sequence. There are 20 different types of amino acids. The three-dimensional shape the protein assumes is determined by the specic linear sequence of amino acids from N-terminus to C-terminus. Different sequences of amino acids fold into different three-dimensional shapes. (See, for example, [10, Figure 1.1].) Protein size is usually measured in terms of the number of amino acids that comprise it. Proteins can range from fewer than 20 to more than 5000 amino acids in length, although an average protein is about 350 amino acids in length. Each protein that an organism can produce is encoded in a piece of the DNA called a gene (see Section 1.6). To give an idea of the variety of proteins one organism can produce, the single-celled bacterium E. coli has about 4300 different genes. Humans are believed to have about 50,000 different genes (the exact number as yet unresolved), so a human has only about 10 times as many genes as E. coli. The number of proteins that can be produced by humans greatly exceeds the number of genes, however, because a substantial fraction of the human genes can each produce many different proteins through a process called alternative splicing.

1.1.1. Classication of the Amino Acids

Each of the 20 amino acids consists of two parts: 1. a part that is identical among all 20 amino acids; this part is used to link one amino acid to another to form the backbone of the protein. 2. a unique side chain (or R group) that determines the distinctive physical and chemical properties of the amino acid. Although each of the 20 different amino acids has unique properties, they can be classied into four categories based upon their major chemical properties. Below are the names of the amino acids, their 3 letter abbreviations, and their standard one letter symbols. 1. Positively charged (and therefore basic) amino acids (3). Arginine Histidine Lysine Arg His Lys R H K

2. Negatively charged (and therefore acidic) amino acids (2). Aspartic acid Glutamic acid Asp Glu D E

3. Polar amino acids (7). Though uncharged overall, these amino acids have an uneven charge distribution. Because of this uneven charge distribution, these amino acids can form hydrogen bonds with water. As a consequence, polar amino acids are called hydrophilic, and are often found on the outer surface of folded proteins, in contact with the watery environment of the cell.

LECTURE 1. BASICS OF MOLECULAR BIOLOGY

Asparagine Cysteine Glutamine Glycine Serine Threonine Tyrosine Asn Cys Gln Gly Ser Thr Tyr N C Q G S T Y

4. Nonpolar amino acids (8). These amino acids are uncharged and have a uniform charge distribution. Because of this, they do not form hydrogen bonds with water, are called hydrophobic, and tend to be found on the inside surface of folded proteins. Alanine Isoleucine Leucine Methionine Phenylalanine Proline Tryptophan Valine Ala Ile Leu Met Phe Pro Trp Val A I L M F P W V

Although each amino acid is different and has unique properties, certain pairs have more similar properties than others. The two nonpolar amino acids leucine and isoleucine, for example, are far more similar to each other in their chemical and physical properties than either is to the charged glutamic acid. In algorithms for comparing proteins to be discussed later, the question of amino acid similarity will be important.

1.2. DNA
DNA contains the instructions needed by the cell to carry out its functions. DNA consists of two long interwoven strands that form the famous double helix. (See [14, Figure 3-3].) Each strand is built from a small set of constituent molecules called nucleotides.

1.2.1. Structure of a Nucleotide

A nucleotide consist of three parts [14, Figure 3-2]. The rst two parts are used to form the ribbon-like backbone of the DNA strand, and are identical in all nucleotides. These two parts are (1) a phosphate group and (2) a sugar called deoxyribose (from which DNA, DeoxyriboNucleic Acid, gets its name). The third part of the nucleotide is the base. There are four different bases, which dene the four different nucleotides: thymine (T), cytosine (C), adenine (A), and guanine (G). Note in [14, Figure 3-2] that the ve carbon atoms of the sugar molecule are numbered . The base is attached to the carbon. The two neighboring phosphate groups are attached to the and carbons. As is the case in the protein backbone (Section 1.1), the asymmetry of the sugar molecule imposes an orientation on the backbone, one end of which is called the end and the other the end. (See [14, Figure 3-4(a)].)

" ! %$

LECTURE 1. BASICS OF MOLECULAR BIOLOGY

1.2.2. Base Pair Complementarity

Why is DNA double-stranded? This is due to base pair complementarity. If specic bases of one strand are aligned with specic bases on the other strand, the aligned bases can hybridize via hydrogen bonds, weak attractive forces between hydrogen and either nitrogen or oxygen. The specic complementary pairs are

Two hydrogen bonds form between A and T, whereas three form between C and G. (See [14, Figure 3-5].) This makes C-G bonds stronger than A-T bonds. If two DNA strands consist of complementary bases, under normal cellular conditions they will hybridize and form a stable double helix. However, the two strands will only hybridize if they are in antiparallel conguration. This means that the sequence of one strand, when read from the end to the end, must be complementary, base for base, to the sequence of the other strand read from to . (See [14, Figure 3-4(b) and 3-3].)

1.2.3. Size of DNA molecules

An E. coli bacterium contains one circular, double-stranded molecule of DNA consisting of approximately 5 million nucleotides. Often the length of double-stranded DNA is expressed in the units of basepairs (bp), kilobasepairs (kb), or megabasepairs (Mb), so that this size could be expressed equivalently as bp, 5000 kb, or 5 Mb. Each human cell contains 23 pairs of chromosomes, each of which is a long, double-stranded DNA molecule. Collectively, the 46 chromosomes in one human cell consist of approximately bp of DNA. Note that a human has about 1000 times more DNA than E. coli does, yet only about 10 times as many genes. (See Section 1.1.) The reason for this will be explained shortly.

1.3. RNA
Chemically, RNA is very similar to DNA. There are two main differences: 1. RNA uses the sugar ribose instead of deoxyribose in its backbone (from which RNA, RiboNucleic Acid, gets its name). 2. RNA uses the base uracil (U) instead of thymine (T). U is chemically similar to T, and in particular is also complementary to A. RNA has two properties important for our purposes. First, it tends to be single-stranded in its normal cellular state. Second, because RNA (like DNA) has base-pairing capability, it often forms intramolecular hydrogen bonds, partially hybridizing to itself. Because of this, RNA, like proteins, can fold into complex three-dimensional shapes. (For an example, see http://www.ibc.wustl.edu/zuker/rna/hammerhead.html.) RNA has some of the properties of both DNA and proteins. It has the same information storage capability as DNA due to its sequence of nucleotides. But its ability to form three-dimensional structures allows it to

A with T G with C

LECTURE 1. BASICS OF MOLECULAR BIOLOGY

have enzymatic properties like those of proteins. Because of this dual functionality of RNA, it has been conjectured that life may have originated from RNA alone, DNA and proteins having evolved later.

1.4. Residues
The term residue refers to either a single base constituent from a nucleotide sequence, or a single amino acid constituent from a protein. This is a useful term when one wants to speak collectively about these two types of biological sequences.

1.5. DNA Replication

What is the purpose of double-strandedness in DNA? One answer is that this redundancy of information is key to how the one-dimensional instructions of the cell are passed on to its descendant cells. During the cell cycle, the DNA double strand is split into its two separate strands. As it is split, each individual strand is used as a template to synthesize its complementary strand, to which it hybridizes. (See [14, Figure 5-2 and 5-1].) The result is two exact copies of the original double-stranded DNA. In more detail, an enzymatic protein called DNA polymerase splits the DNA double strand and synthesizes the complementary strand of DNA. It synthesizes this complementary strand by adding free nucleotides available in the cell onto the end of the new strand being synthesized [14, Figure 5-3]. The DNA polymerase will only add a nucleotide if it is complementary to the opposing base on the template strand. Because the DNA polymerase can only add new nucleotides to the end of a DNA strand (i.e., it can only synthesize DNA in the to direction), the actual mechanism of copying both strands is somewhat more complicated. One strand can be synthesized continuously in the to direction. The other strand must be synthesized in short -to- fragments. Another enzymatic protein, DNA ligase, glues these synthesized fragments together into a single long DNA molecule. (See [14, Figure 5-4].)

1.6. Synthesis of RNA and Proteins

The one-dimensional storage of DNA contains the information needed by the cell to produce all its RNA and proteins. In this section, we describe how the information is encoded, and how these molecules are synthesized. Proteins are synthesized in a two-step process. First, an RNA copy of a portion of the DNA is synthesized in a process called transcription, described in Section 1.6.1. Second, this RNA sequence is read and interpreted to synthesize a protein in a process called translation, described in Section 1.6.2. Together, these two steps are called gene expression. A gene is a sequence of DNA that encodes a protein or an RNA molecule. Gene structure and the exact expression process are somewhat dependent on the organism in question. The prokaryotes, which consist of the bacteria and the archaea, are single-celled organisms lacking nuclei. Because prokaryotes have the simplest gene structure and gene expression process, we will start with them. The eukaryotes, which include plants and animals, have a somewhat more complex gene structure that we will discuss after.

LECTURE 1. BASICS OF MOLECULAR BIOLOGY

1.6.1. Transcription in Prokaryotes

How do prokaryotes synthesize RNA from DNA? This process, called transcription, is similar to the way DNA is replicated (Section 1.5). An enzyme called RNA polymerase, copies one strand of the DNA gene into a messenger RNA (mRNA), sometimes called the transcript. The RNA polymerase temporarily splits the double-stranded DNA, and uses one strand as a template to build the complementary strand of RNA. (See [14, Figure 4-1].) It incorporates U opposite A, A opposite T, G opposite C, and C opposite G. The RNA polymerase begins this transcription at a short DNA pattern it recognizes called the transcription start site. When the polymerase reaches another DNA sequence called the transcription stop site, signalling the end of the gene, it drops off.

1.6.2. Translation
How is protein synthesized from mRNA? This process, called translation, is not as simple as transcription, because it proceeds from a 4 letter alphabet to the 20 letter alphabet of proteins. Because there is not a oneto-one correspondence between the two alphabets, amino acids are encoded by consecutive sequences of 3 possible permutations, nucleotides, called codons. (Taking 2 nucleotides at a time would give only whereas taking 3 nucleotides yields possible permutations, more than sufcient to encode the 20 different amino acids.) The decoding table is given in Table 1.1, and is called the genetic code. It is rather amazing that this same code is used almost universally by all organisms. U Phe Phe Leu Leu Leu Leu Leu Leu Ile Ile Ile Met Val Val Val Val C Ser Ser Ser Ser Pro Pro Pro Pro Thr Thr Thr Thr Ala Ala Ala Ala A UAU Tyr [Y] UAC Tyr [Y] UAA STOP UAG STOP CAU His [H] CAC His [H] CAA Gln [Q] CAG Gln [Q] AAU Asn [N] AAC Asn [N] AAA Lys [K] AAG Lys [K] GAU Asp [D] GAC Asp [D] GAA Glu [E] GAG Glu [E]

UUU UUC UUA UUG CUU CUC CUA CUG AUU AUC AUA AUG GUU GUC GUA GUG

[F] [F] [L] [L] [L] [L] [L] [L] [I] [I] [I] [M] [V] [V] [V] [V]

UCU UCC UCA UCG CCU CCC CCA CCG ACU ACC ACA ACG GCU GCC GCA GCG

[S] [S] [S] [S] [P] [P] [P] [P] [T] [T] [T] [T] [A] [A] [A] [A]

UGU UGC UGA UGG CGU CGC CGA CGG AGU AGC AGA AGG GGU GGC GGA GGG

Table 1.1: The Genetic Code

There is a necessary redundancy in the code, since there are 64 possible codons and only 20 amino acids. Thus each amino acid (with the exceptions of Met and Trp) is encoded by synonymous codons, which are interchangeable in the sense of producing the same amino acid. Only 61 of the 64 codons are used to encode amino acids. The remaining 3, called STOP codons, signify the end of the protein.

G Cys [C] Cys [C] STOP Trp [W] Arg [R] Arg [R] Arg [R] Arg [R] Ser [S] Ser [S] Arg [R] Arg [R] Gly [G] Gly [G] Gly [G] Gly [G]

U C A G U C A G U C A G U C A G

LECTURE 1. BASICS OF MOLECULAR BIOLOGY

Ribosomes are the molecular structures that read mRNA and produce the encoded protein according to the genetic code. Ribosomes are large complexes consisting of both proteins and a type of RNA called ribosomal RNA (rRNA). The process by which ribosomes translate mRNA into protein makes use of yet a third type of RNA called transfer RNA (tRNA). There are 61 different transfer RNAs, one for each nontermination codon. Each tRNA folds (see Section 1.3) to form a cloverleaf-shaped structure. This structure produces a pocket that complexes uniquely with the amino acid encoded by the tRNAs associated codon, according to Table 1.1. The unique t is accomplished analogously to a key and lock mechanism. Elsewhere on the tRNA is the anticodon, three consecutive bases that are complementary and antiparallel to the associated codon, and exposed for use by the ribosome. The ribosome brings together each codon of the mRNA with its corresponding anticodon on some tRNA, and hence its encoded amino acid. (See [14, Figure 4-4].)

Lecture 2

Basics of Molecular Biology (continued)

January 6, 2000 Notes: Tory McGrath

2.1. Course Projects

A typical course project might be to take some existing biological sequences from the public databases on the web, and design and run some sequence analysis experiments, using either publicly available software or your own program. For example, there is reason to believe that some of the existing bacterial genomes may be misannotated, in the sense that the identied genes are not actually located exactly as annotated. There is existing software to identify gene locations. We will discuss more such suggested projects as the course proceeds, but the choice of topic is quite exible, and is open to suggestion, provided there is a large computational aspect. You will be required to check your project topic with the instructor before embarking. You may work on the project in groups of up to four people. For maximum effectiveness, it is recommended to have a mix of biology and math/computer participants in each group. The project will entail a short write-up as well as a short presentation of your problem, methods, and results.

2.2. Translation (continued)

In prokaryotes, which have no cell nucleus, translation begins while transcription is still in progress, the end of the transcript being translated before the RNA polymerase has transcribed the end. (See Drlica [14, Figure 4-4].) In eukaryotes, the DNA is inside the nucleus, whereas the ribosomes are in the cytoplasm outside the nucleus. Hence, transcription takes place in the nucleus, the completed transcript is exported from the nucleus, and translation then takes place in the cytoplasm. The ribosome forms a complex near the end of the mRNA, binding around the start codon, also called the translation start site. The start codon is most often -AUG- , and the corresponding anticodon is CAU- . (Less often, the start codon is -GUG- or -UUG- .) The ribosome now brings together this start codon on the mRNA and its exposed anticodon on the corresponding tRNA, which hybridize to each other. (See [14, Figure 4-4].) The tRNA brings with it the encoded amino acid; in the case of the usual start codon -AUG- , this is methionine. Having incorporated the rst amino acid of the synthesized protein, the ribosome shifts the mRNA three bases to the next codon. A second tRNA complexed with its specic amino acid hybridizes to the second codon via its anticodon, and the ribosome bonds this second amino acid to the rst. At this point the ribosome releases the rst tRNA, moves on to the third codon, and repeats. (See [14, Figure 4-5].) This

! "

LECTURE 2. BASICS OF MOLECULAR BIOLOGY (CONTINUED)

process continues until the ribosome detects one of the STOP codons, at which point it releases the mRNA and the completed protein.

2.3. Prokaryotic Gene Structure

Recall from Section 1.6 that a gene is a relatively short sequence of DNA that encodes a protein or RNA molecule. In this section we restrict our attention to protein-coding genes in prokaryotes. The portion of the gene containing the codons that ultimately will be translated into the protein is called the coding region, or open reading frame. The transcription start site (see Section 1.6.1) is somewhat upstream from the start codon, where upstream means in the direction. Similarly, the transcription stop site is somewhat downstream from the stop codon, where downstream means in the direction. That is, the mRNA transcript contains sequence at both its ends that has been transcribed, but will not be translated. The sequence between the transcription start site and the start codon is called the untranslated untranslated region. The sequence between the stop codon and the transcription stop site is called the region. Upstream from the transcription start site is a relatively short sequence of DNA called the regulatory region. It contains promoters, which are specic DNA sites where certain regulatory proteins bind and regulate expression of the gene. These proteins are called transcription factors, since they regulate the transcription process. A common way in which transcription factors regulate expression is to bind to the DNA at a promoter and from there affect the ability (either positively or negatively) of RNA polymerase to perform its task of transcription. (There is also the analogous possibility of translational regulation, in which regulatory factors bind to the mRNA and affect the ability of the ribosome to perform its task of translation.)

2.4. Prokaryotic Genome Organization

The genome of an organism is the entire complement of DNA in any of its cells. In prokaryotes, the genome typically consists of a single chromosome of double-stranded DNA, and it is often circularized (its and ends attached) as opposed to being linear. A typical prokaryotic genome size would be in the millions of base pairs. Typically 90% of the prokaryotic genome consists of coding regions. For instance, the E. coli genome has size about 5 Mb and approximately 4300 coding regions, each of average length around 1000 bp. The genes are relatively densely and uniformly distributed throughout the genome.

2.5. Eukaryotic Gene Structure

An important difference between prokaryotic and eukaryotic genes is that the latter may contain introns. In more detail, the transcribed sequence of a general eukaryotic gene is an alternation between DNA sequences called exons and introns, where the introns are sequences that ultimately will be spliced out of the mRNA before it leaves the nucleus. Transcription in the nucleus produces an RNA molecule called pre-mRNA, produced as described in Section 1.6.1, that contains both the exons and introns. The introns are spliced out of the pre-mRNA by structures called spliceosomes to produce the mature mRNA that will be transported out of the nucleus for translation. A eukaryotic gene may contain numerous introns, and each intron may

$ #

LECTURE 2. BASICS OF MOLECULAR BIOLOGY (CONTINUED)

be many kilobases in size. One fact that is relevant to our later computational studies is that the presence of introns makes it much more difcult to identify the locations of genes computationally, given the genome sequence. Another important difference between prokaryotic and higher eukaryotic genes is that, in the latter, there can be multiple regulatory regions that can be quite far from the coding region, can be either upstream or downstream from it, and can even be in the introns.

2.6. Eukaryotic Genome Organization

Unlike prokaryotic genomes, many eukaryotic genomes consist of multiple linear chromosomes as opposed to single circular chromosomes. Depending on how simple the eukaryote is, very little of the genome may be coding sequence. In humans, less than 3% of the genome is believed to be coding sequence, and the genes are distributed quite nonuniformly over the genome.

2.7. Goals and Status of Genome Projects

Molecular biology has the following two broad goals: 1. Identify all key molecules of a given organism, particularly the proteins, since they are responsible for the chemical reactions of the cells. 2. Identify all key interactions among molecules. Traditionally, molecular biologists have tackled these two goals simultaneously in selected small systems within selected model organisms. The genome projects today differ by focusing primarily on the rst goal, but for all the systems of a given model organism. They do this by sequencing the genome, which means determining the entire DNA sequence of the organism. They then perform a computational analysis (to be discussed in later lectures) on the genome sequence to identify (most of) the genes. Having done this, (most of) the proteins of the organism will have been identied. With recent advances in sequencing technology, the genome projects have progressed very rapidly over the past ve years. The rst free-living organism to be completely sequenced was the bacterium H. inuenzae [15], with a genome of size 1.8 Mb. Since that time, 18 bacterial, 6 archaeal, and 2 eukaryotic genomes have been sequenced. Presently there are approximately an additional 95 prokaryotic and 27 eukaryotic genomes in the process of being sequenced. (See, for example, the Genomes On Line Database at http://geta.life.uiuc.edu/nikos/genomes.html for the status of ongoing genome projects.) The human genome is expected to be sequenced within the next two years or so. Although every human is a unique individual, the genome sequences of any two humans are about 99.9% identical, so that it makes some sense to talk about sequencing the human genome, which will really be an amalgamation of a small collection of individuals. Once that is done, one of the interesting challenges is to identify the common polymorphisms, which are genomic variations that occur in a nonnegligible fraction of the population.

LECTURE 2. BASICS OF MOLECULAR BIOLOGY (CONTINUED)

2.8. Sequence Analysis

Once a genome is completely sequenced, what sorts of analyses are performed on it? Some of the goals of sequence analysis are the following: 1. Identify the genes. 2. Determine the function of each gene. One way to hypothesize the function is to nd another gene (possibly from another organism) whose function is known and to which the new gene has high sequence similarity. This assumes that sequence similarity implies functional similarity, which may or may not be true. 3. Identify the proteins involved in the regulation of gene expression. 4. Identify sequence repeats. 5. Identify other functional regions, for example origins of replication (sites at which DNA polymerase binds and begins replication; see Section 1.5), pseudogenes (sequences that look like genes but are not expressed), sequences responsible for the compact folding of DNA, and sequences responsible for nuclear anchoring of the DNA. Many of these tasks are computational in nature. Given the incredible rate at which sequence data is being produced, the integration of computer science, mathematics, and biology will be integral to analyzing those sequences.

Lecture 3

Introduction to Sequence Similarity

January 11, 2000 Notes: Martin Tompa

3.1. Sequence Similarity

The next few lectures will deal with the topic of sequence similarity, where the sequences under consideration might be DNA, RNA, or amino acid sequences. This is likely the most frequently performed task in computational biology. Its usefulness is predicated on the assumption that a high degree of similarity between two sequences often implies similar function and/or three-dimensional structure. Most of the content of these lectures on sequence similarity is from Guseld [20]. Why are we starting here, rather than with a discussion of how biologists determine the sequence in the rst place? The reason is that the problems and algorithms of sequence similarity are reasonably simple to state. This makes it a good context in which to ensure that we agree on the language we will be using to discuss computing and algorithms. To begin that process, the word algorithm simply means an unambiguously specied method for solving a problem. In this context, an algorithm may be thought of as a computer program, although algorithms are usually expressed in a somewhat more abstract language than real programming languages.

3.2. Biological Motivation for Studying Sequence Similarity

We start with two motivating applications in which sequence similarity is utilized.

3.2.1. Hypothesizing the Function of a New Sequence

When a new genome is sequenced, the usual rst analysis performed is to identify the genes and hypothesize their functions. Hypothesizing their functions is most often done using sequence similarity algorithms, as follows. One rst translates the coding regions into their corresponding amino acid sequences, using the genetic code of Table 1.1. One then searches for similar sequences in a protein database that contains sequenced proteins (from related organisms) and their functions. Close matches allow one to make strong conjectures about the function of each matched gene. In a similar way, sequence similarity can be used to predict the three-dimensional structure of a newly sequenced protein, given a database of known protein sequences and their structures.

LECTURE 3. INTRODUCTION TO SEQUENCE SIMILARITY

3.2.2. Researching the Effects of Multiple Sclerosis

Multiple sclerosis is an autoimmune disease in which the immune system attacks nerve cells in the patient. More specically, the immune systems T-cells, which normally identify foreign bodies for immune system attacks, mistakenly identify proteins in the nerves myelin sheaths as foreign. It was conjectured that the myelin sheath proteins identied by the T-cells were similar to viral and/or bacterial sheath proteins from an earlier infection. In order to test this hypothesis, the following steps were carried out:

The result was the identication of certain bacterial and viral proteins that were confused with the myelin sheath proteins.

3.3. The String Alignment Problem

The rst task is to make the problem of sequence similarity more precise. A string is a sequence of characters from some alphabet. Given two strings acbcdb and cadbd, how should we measure how similar they are? Similarity is witnessed by nding a good alignment between two strings. Here is one possible alignment of these two strings.

The special character represents the insertion of a space, representing a deletion from its sequence (or, equivalently, an insertion in the other sequence). We can evaluate the goodness of such an alignment using a scoring function. For example, if an exact match between two characters scores , and every mismatch or deletion (space) scores , then the alignment above has score

This example shows only one possible alignment for the given strings. For any pair of strings, there are many possible alignments. The following denitions generalize this example. Denition 3.1: If and are each a single character or space, then and . is called the scoring function.

denotes the score of aligning

In the example above, for any two distinct characters and , , and . If one were designing a scoring function for comparing amino acid sequences, one would certainly want to incorporate into it the physico-chemical similarities and differences among the amino acids, such as those described in Section 1.1.1.

! "

the myelin sheath proteins were sequenced, a protein database was searched for similar bacterial and viral sequences, and laboratory tests were performed to determine if the T-cells attacked these same proteins.

a -

c c

b b

c -

d d

b -

LECTURE 3. INTRODUCTION TO SEQUENCE SIMILARITY

denotes the length of Denition 3.2: If is a string, then (where the rst character is rather than, say, ).

15 and denotes the th character of

For example, if

acbcdb, then

and

Finding an optimal alignment of and is the way in which we will measure their similarity. For the two strings given in the example above, is the alignment shown optimal? We will next present some algorithms for computing optimal alignments, which will allow us to answer that question.

3.4. An Obvious Algorithm for Optimal Alignment

The most obvious algorithm is to try all possible alignments, and output any alignment with maximum value. We will examine this approach in more detail. A subsequence of a string means a sequence of characters of that need not be consecutive in , but do retain their order as given in . For instance, acd is a subsequence of acbcdb.

The obvious algorithm for optimal alignment is given in Figure 3.1. This algorithm works correctly, but is it a good algorithm? If you tried running this algorithm on a pair of strings each of length 20 (which is ridiculously modest by biology standards), you would nd it much too slow to be practical. The program would run for an hour on such inputs, even if the computer can perform a billion basic operations per second.

Suppose we are given strings and , and assume for the moment that , subject to the reasonable restriction that an arbitrary scoring function restriction, there is never a reason to align a pair of spaces.

. Also, consider . With this

"# !

Denition 3.4: An optimal alignment of two strings.

and

is one that has the maximum possible value for these

In the example alignment above, if -cadb-d-.

acbcdb and

cadbd, then

ac--bcdb and

where

The value of the alignment

2. the removal of spaces from and , respectively.

, and and (without changing the order of the remaining characters) leaves

Denition 3.3: Let and be strings. An alignment contain space characters, where

maps

and

into strings

and

that may

LECTURE 3. INTRODUCTION TO SEQUENCE SIMILARITY

for all , , do for all subsequences of with do for all subsequences of with do Form an alignment that matches with , and matches all other characters with spaces; Determine the value of this alignment; Retain the alignment with maximum value; end ; end ; end ;

Figure 3.1: Enumerating all Alignments to Find the Optimal subseThe running time analysis of this algorithm proceeds as follows. A string of length has 1 Thus, there are quences of length . pairs of subsequences each of length . Consider one such pair. Since there are characters in , only of which are matched with characters in , there will be characters in unmatched to characters in . Thus, the alignment has length . We must look up and add the score of each pair in the alignment, so the total number of basic operations is at least

(The equality has a pretty combinatorial explanation that is worth discovering. The last inequality follows from Stirlings approximation [39].) Thus, for , this algorithm requires more than basic operations.

3.5. Asymptotic Analysis of Algorithms

In Section 4.1 we will see a cleverer algorithm that runs in time proportional to . For large , it is clear that is greater than . As a demonstration that an algorithm that requires time proportional to is far more desirable than one that requires time , consider at what value of these two functions cross. Suppose the actual running time of the cleverer algorithm is . The value of already exceeds this quadratic at . Suppose instead that the running time is . Despite the fact that we increased the constant of proportionality by a factor of 100, already exceeds this quadratic at . This demonstration should make it clear that the rate of growth of the high order term is the most important determinant of how long an algorithm runs on large inputs, independent of the constant of proportionality and any lower order terms. To formalize this notion, we introduce big O notation.
!

Denition 3.5: Let and be functions. Then such that, for all sufciently large, .
1

if and only if there is a constant

at a time. See any textbook on
&

denotes the number of combinations of The notation combinatorial mathematics, for instance Roberts [39].
$! #

distinguishable objects taken

for

" "

LECTURE 3. INTRODUCTION TO SEQUENCE SIMILARITY

17 are both

For example, works, and for the latter,

and works.

. For the former,

Lecture 4

Alignment by Dynamic Programming

January 13, 2000 Notes: Martin Tompa

4.1. Computing an Optimal Alignment by Dynamic Programming

The value of an optimal alignment of and is then . The crux of dynamic programming is to solve the more general problems of computing all values with and , in order of increasing and . Each of these will be relatively simple to compute, given the values already computed and/or for smaller and/or , using a recurrence relation. To start the process, we need a basis for .
BASIS :

This formula can be understood by considering an optimal alignment of the rst characters from and the rst characters from . In particular, consider the last aligned pair of characters in such an alignment. This last pair must be one of the following:

of 2.

, in which case the remaining alignment excluding this pair must have value 18

, in which case the remaining alignment excluding this pair must be an optimal alignment and (i.e., must have value ), or , or

R ECURRENCE : For

and

The basis for says that if characters of are to be aligned with 0 characters of all be matched with spaces. The basis for is analogous.

for

, then they must

" "

Given strings and , with . Toward this goal, dene .

and , our goal is to compute an optimal alignment of as the value of an optimal alignment of the strings

and and

LECTURE 4. ALIGNMENT BY DYNAMIC PROGRAMMING

3. , in which case the remaining alignment excluding this pair must must have value

The optimal alignment chooses whichever among these three possibilities has the greatest value.

4.1.1. Example
In aligning acbcdb and cadbd, the dynamic programming algorithm lls in the following values for from top to bottom and left to right, simply applying the basis and recurrence formulas. (As in the example of Section 3.3, assume that matches score , and mismatches and spaces score .) For instance, in the table below, the entry in row 4 and column 1 is obtained by computing .

The value of the optimal alignment is , and so can be read from the entry in the last row and last column. Thus, there is an alignment of acbcdb and cadbd that has value 2, so the alignment proposed in Section 3.3 with value 1 is not optimal. But how can one determine the optimal alignment itself, and not just its value?

4.1.2. Recovering the Alignments

The solution is to retrace the dynamic programming steps back from the entry, determining which preceding entries were responsible for the current one. For instance, in the table below, the (4,2) entry could have followed from either the (3,1) or (3,2) entry; this is denoted by the two arrows pointing to those entries. We can then follow any of these paths from to , tracing out an optimal alignment:

The optimal alignments corresponding to these three paths are

" " "

" " " " ! "

" "

" ! ! " "

LECTURE 4. ALIGNMENT BY DYNAMIC PROGRAMMING

a -

c c

b -

c a

d d

b b

a -

c c

b a

c -

d d

b b

, and

a a

c d

b b

c -

d d

b -

. , the

Each of these has three matches, one mismatch, and three spaces, for a value of optimal alignment value.

4.1.3. Time Analysis

Proof: This algorithm requires an table to be completed. Any particular entry is computed with a maximum of 6 table lookups, 3 additions, and a three-way maximum, that is, in time , a . Reconstructing a constant. Thus, the complexity of the algorithm is at most single alignment can then be done in time .

4.2. Searching for Local Similarity

Next we will discuss some variants of the dynamic programming approach to string alignment. We do this to demonstrate the versatility of the approach, and because the variants themselves arise in biological applications. In the variant called local similarity, we are searching for regions of similarity between two strings, within contexts that may be dissimilar. An example in which this arises is if we have two long DNA sequences that each contain a given gene, or perhaps closely related genes. Certainly the global alignment problem of Denitions 3.3 and 3.4 will not in general identify these genes. We can formulate this problem as the local alignment problem: Given two strings and , with and , nd substrings (i.e., contiguous subsequences) of and of such that the optimal (global) alignment of and has the maximum value over all such substrings and . In other words, the optimal alignment of and must have at least as great a value as the optimal alignment of any other substrings of and of .

4.2.1. An Obvious Local Alignment Algorithm

The denition above immediately suggests an algorithm for local alignment: for all substrings of do for all substrings of do Find an optimal alignment of and by dynamic programming; Retain and with maximum alignment value, and their alignment; end ; end ; Output the retained , , and alignment;

Theorem 4.1: The dynamic programming algorithm computes an optimal alignment in time

LECTURE 4. ALIGNMENT BY DYNAMIC PROGRAMMING

choices of , and choices of (excluding the length 0 substrings as choices). There are Using Theorem 4.1, it is not difcult to show that the time taken by this algorithm is . We will see in Section 5.1, however, that it is possible to compute the optimal local alignment in time , that is, the same time used for the optimal global alignment.

4.2.2. Set-Up for Local Alignment by Dynamic Programming

Denition 4.2: The empty string

is the string with

For example, let abcxdex. The prexes of empty string is both a prex and a sufx of .

include ab. The sufxes of

For example, suppose abcxdex and xxxcde. Score a match as . Then , with cxd, cd, and alignment as

and a mismatch or space

The dynamic programming algorithm for optimal local alignment is similar to the dynamic programming algorithm for optimal global alignment given in Section 4.1. It proceeds by lling in a table with the values , with increasing. The value of each entry is calculated according to a new basis and recurrence of for , given in Section 5.1. Unlike the global alignment algorithm, however, the value of the optimal local alignment can be any entry, whichever contains the maximum of all values of . The reason for this is that each entry represents an optimal pair of sufxes of a given pair of prexes. Since a sufx of a prex is just a substring, we nd the optimal pair of substrings by maximizing over all possible pairs .

c c +2

x -

d d +2

Denition 4.5: Let and be strings with and be the maximum value of an optimal (global) alignment of and all sufxes of .

. For and and over all sufxes of

" "

" ! !

Denition 4.4: .

is a sufx of

if and only if

, for some

include xdex. The

Denition 4.3: where .

is a prex of

if and only if

, for some

, where

, let

Lecture 5

Local Alignment, and Gap Penalties

January 18, 2000 Notes: Martin Tompa

5.1. Computing an Optimal Local Alignment by Dynamic Programming

BASIS : For simplicity, we will make the reasonable assumption that

and

Then

and

since the optimal sufx to align with a string of length 0 is the empty sufx.

The formula looks very similar to the recurrence for the optimal global alignment in Section 4.1. Of course, the meaning is somewhat different and we have an additional term in the function. The recurrence is explained as follows. Consider an optimal alignment of a sufx of and a sufx of . There are four possible cases:

and

, in which case the alignment has value 0.

value

The optimal alignment chooses whichever of these cases has greatest value.

, and the last matched pair in .

, in which case the remainder of

, and the last matched pair in .

, in which case the remainder of

has value has value

and

, and the last matched pair in .

, in which case the remainder of

has

R ECURRENCE : for

and

LECTURE 5. LOCAL ALIGNMENT, AND GAP PENALTIES

5.1.1. Example
For example, let abcxdex and xxxcde, and suppose a match scores , and a mismatch or a space scores . The dynamic programming algorithm lls in the table of values from top to bottom and left to right, as follows:

The value of the optimal local alignment is . We can reconstruct optimal alignments as in Section 4.1.2, by retracing from any maximum entry to any zero entry:

The optimal local alignments corresponding to these paths are c c x d d e e and x x c d d e e .

Both alignments have three matches and one space, for a value of . You can also see from this diagram how the value was derived in the example following Denition 4.5, which said that for the same strings and , .

5.1.2. Time Analysis

Theorem 5.1: The dynamic programming algorithm computes an optimal local alignment in time .

entries requires at most 6 table lookups, 3 Proof: Computing the value for each of the additions, and 1 max calculation. Reconstructing a single alignment can then be done in time .

! !

LECTURE 5. LOCAL ALIGNMENT, AND GAP PENALTIES

5.2. Space Analysis

The space required for either the global or local optimal alignment algorithm is also quadratic in the length of the strings being compared. This could be prohibitive for comparing long DNA sequences. There is a modication of the dynamic programming algorithm that computes an optimal alignment in space and still runs in time. If one were interested only in the value of an optimal alignment, this could be done simply by retaining only two consecutive rows of the dynamic programming table at any time. Reconstructing an alignment is somewhat more complicated, but can be done in space and time with a divide and conquer approach (Hirschberg [24], Myers and Miller [36]).

5.3. Optimal Alignment with Gaps

Denition 5.2: A gap in an alignment of and is a maximal substring of either or consisting and are and with spaces inserted as dictated by only of spaces. (Recall from Denition 3.3 that the alignment.)

5.3.1. Motivations
In certain applications, we may not want to have a penalty proportional to the length of a gap. 1. Mutations causing insertion or deletion of large substrings may be considered a single evolutionary event, and may be nearly as likely as insertion or deletion of a single residue. 2. cDNA matching: Biologists are very interested in learning which genes are expressed in which types of specialized cells, and where those genes are located in the chromosomal DNA. Recall from Section 2.5 that eukaryotic genes often consist of alternating exons and introns. The mature mRNA that leaves the nucleus after transcription has the introns spliced out. To study gene expression within specialized cells, one procedure is as follows: (a) Capture the mature mRNA as it leaves the nucleus. (b) Make complementary DNA (abbreviated cDNA) from the mRNA using an enzyme called reverse transcriptase. The cDNA is thus a concatenation of the genes exons. (c) Sequence the cDNA. (d) Match the sequenced cDNA against sequenced chromosomal DNA to nd the region of chromosomal DNA from which the cDNA derives. In this process we do not want to penalize heavily for the introns, which will match gaps in the cDNA. In general, the gap penalty may be some arbitrary function of the gap length . The best choice of this function, like the best choice of a scoring function, depends on the application. In the cDNA matching application, we would like the penalty to reect what is known about the common lengths of introns. In the next section we will see an time algorithm for the case when is an arbitrary linear afne function, and this is adequate for many applications. There are programs that use piecewise linear functions as gap penalties, and these may be more suitable in the cDNA matching application. There are time algorithms for the case when is concave downward (Galil and Giancarlo [17], Miller and Myers [35]). We could even implement an arbitrary function as a gap penalty function, but the known algorithm

LECTURE 5. LOCAL ALIGNMENT, AND GAP PENALTIES

for this requires cubic time (Needleman and Wunsch [37]), and such an algorithm is probably not useful in practice.

5.3.2. Afne Gap Model

We will study a model in which the penalty for a gap has two parts: a penalty for initiating a gap, and another where and penalty that depends linearly on the length of a gap. That is, the gap penalty is are both constants, , , and is the length of the gap. (Note that the model with a constant penalty regardless of gap length is the special case with .) For simplicity, assume we are modifying the global alignment algorithm of Section 4.1 to accommodate an afne gap penalty. Similar ideas would work for local alignment as well.

# gaps

# spaces .

5.3.3. Dynamic Programming Algorithm

Once again, the algorithm proceeds by aligning , dene the following variables: 1. 2. 3. 4. with . For these prexes of

is the value of an optimal alignment of

and and and and

. whose last pair matches whose last pair matches whose last pair matches

is the value of an optimal alignment of with . is the value of an optimal alignment of with a space.

is the value of an optimal alignment of a space with .

BASIS :

R ECURRENCE : For

and

for

where

and

are

and

with spaces inserted, and

We will assume then is to maximize

, since the spaces will be penalized as part of the gap. Our goal

and

LECTURE 5. LOCAL ALIGNMENT, AND GAP PENALTIES

The equation for (and analogously ) can be understood as taking the maximum of two cases: adding another space to an existing gap, and starting a new gap. To understand why starting a new gap can use , which includes the possibility of an alignment ending in a gap, consider that , so that is always dominated by , so will never be chosen by the max.

5.3.4. Time Analysis

Proof: The algorithm proceeds as those we have studied before, but in this case there are three or four or calculate them matrices to ll in simultaneously, depending on whether you store the values of from the other three matrices when needed.

5.4. Bibliographic Notes on Alignments

Bellman [6] began the systematic study of dynamic programming. The original paper on global alignment is that of Needleman and Wunsch [37]. Smith and Waterman [45] introduced the local alignment problem, and the algorithm to solve it. A number of authors have studied the question of how to construct a good scoring function for sequence comparison, including Karlin and Altschul [27] and Altschul [3].

Theorem 5.3: An optimal global alignment with afne gap penalty can be computed in time

Lecture 6

Multiple Sequence Alignment

January 20, 2000 Notes: Martin Tompa
While previous lectures discussed the problem of determining the similarity between two strings, this lecture turns to the problem of determining the similarity among multiple strings.

6.1. Biological Motivation for Multiple Sequence Alignment

6.1.1. Representing Protein Families
An important motivation for studying the similarity among multiple strings is the fact that protein databases are often categorized by protein families. A protein family is a collection of proteins with similar structure (i.e., three-dimensional shape), similar function, or similar evolutionary history. When we have a newly sequenced protein, we would like to know to which family it belongs, as this provides hypotheses about its structure, function, or evolutionary history. (See Section 3.2.1.) The new protein might not be particularly similar to a single protein in the database, yet might still share considerable similarity with the collective members of a family of proteins. One approach is to construct a representation for each protein family, for example a good multiple sequence alignment of all its members. Then, when we have a newly sequenced protein and want to nd its family, we only have to compare it to the representation of each family. Common structure, function, or origin of a molecule may only be weakly reected in its sequence. For example, the three-dimensional structure of a protein is very difcult to infer from its sequence, and yet is very important to predict its function. Multiple sequence comparisons may help highlight weak sequence similarity, and shed light on structure, function, or origin.

6.1.2. Repetitive Sequences in DNA

In the DNA domain, a motivation for multiple sequence alignment arises in the study of repetitive sequences. These are sequences of DNA, often without clearly understood biological function, that are repeated many times throughout the genome. The repetitions are generally not exact, but differ from each other in a small number of insertions, deletions, and substitutions. As an example, the Alu repeat is approximately 300 bp long, and appears over 600,000 times in the human genome. It is believed that as much as 60% of the human genome may be attributable to repetitive sequences without known biological function. (See Jurka and Batzer [26].) In order to highlight the similarities and differences among the instances of such a repeat family, one would like to display a good multiple sequence alignment of its constituent sequences.

LECTURE 6. MULTIPLE SEQUENCE ALIGNMENT

6.2. Formulation of the Multiple String Alignment Problem

We now dene the problem more precisely.

Denition 6.1: Given strings that may contain spaces, where

2. the removal of spaces from

leaves

, for

The question that arises next is how to assign a value to such an alignment. In a pairwise alignment, we simply summed the similarity score of corresponding characters. In the case of multiple string alignment, there are various scoring methods, and controversy around the question of which is best. We focus here on a scoring method called the sum-of-pairs score. Other methods are explored in the homework. Until now, we have been using a scoring function that assigns higher values to better alignments and lower values to worse alignments, and we have been trying to nd alignments with maximum value. For the that measures the distance between characters remainder of this lecture, we will switch to a function and . That is, it will assign higher values the more distant two strings are. In the case of two strings, we will thus be trying to minimize

In this denition we assume that the scoring function is symmetric. For simplicity, we will not discuss the issue of a separate gap penalty. Example 6.3: Consider the following alignment:

a a Using the distance function value .

c c -

c a c

d d d

b b a

d d

Denition 6.4: An optimal SP (global) alignment of strings minimum possible sum-of-pairs value for these strings.

, and

for

, this alignment has a sum-of-pairs

is an alignment that has the

Denition 6.2: The sum-of-pairs (SP) value for a multiple global alignment of the values of all pairwise alignments induced by .

! "

where

" "

, and

a multiple (global) alignment maps them to strings

strings is the sum

LECTURE 6. MULTIPLE SEQUENCE ALIGNMENT

6.3. Computing an Optimal Multiple Alignment by Dynamic Programming

Given strings each of length , there is a generalization of the dynamic programming algorithm of Section 4.1 that nds an optimal SP alignment. Instead of a -dimensional table, it lls in a -dimensional table. This table has dimensions

that is, entries. Each entry depends on adjacent entries, corresponding to the possibilities for the last match in an optimal alignment: any of the subsets of the strings could participate in that match, except for the empty subset. The details of the algorithm itself and the recurrence are left as exercises for the reader. Because each of the entries can be computed in time proportional to , the running time of the algorithm is . If (as is typical for the length of proteins), it would be practical only for very small values of , perhaps 3 or 4. However, typical protein families have hundreds of members, so this algorithm is of no use in the motivational problem posed in Section 6.1. We would like an algorithm that works for in the hundreds too, which would be possible only if the running time were polynomial in both and . (In particular, should not appear in the exponent as it does in the expression .) Unfortunately, we are very unlikely to nd such an algorithm, which is a consequence of the following theorem: Theorem 6.5 (Wang and Jiang [50]): The optimal SP alignment problem is NP-complete. What NP-completeness means and what its consequences are will be discussed in the following section.

6.4. NP-completeness
In this section we give a brief introduction to NP-completeness, and how problems can be proved to be NP-complete. Denition 6.6: A problem has a polynomial time solution if and only if there is some algorithm that , where is a constant and is the size of the input. solves it in time Many familiar computational problems have polynomial time solutions:

1. two-string optimal alignment problem:

(Theorem 4.1) ,

The last entry illustrates that having a polynomial time solution does not mean that the algorithm is practical. In most cases the converse, though, is true: an algorithm whose running time is not polynomial is likely to be impractical for all but the smallest size inputs. NP-complete problems are equivalent in the sense that if any one of them has a polynomial time solution, then all of them do. One of the major open questions in computer science is whether there is a polynomial

4. 100-string optimal alignment problem:

(Section 6.3).

3. two-string alignment with arbitrary gap penalty function:

2. sorting:

[12],

(Section 5.3.1),

! "

LECTURE 6. MULTIPLE SEQUENCE ALIGNMENT

time solution for any of the NP-complete problems. Almost all experts conjecture strongly that the answer to this question is no. The bulk of the evidence supporting this conjecture, however, is only the failure to nd such a polynomial time solution in thirty years. In 1971, Cook dened the notion of NP-completeness and showed the NP-completeness of a small collection of problems, most of them from the domain of mathematical logic [11]. Roughly speaking, he dened NP-complete problems to be problems that have the property that we can verify in polynomial time whether a supplied solution is correct. For instance, if you did not have to compute an optimal SP alignment, but simply had to verify that a given alignment had SP value at most , a given integer, it would be easy to write a polynomial time algorithm to do so. Shortly after Cooks work, Karp recognized the wide applicability of the concept of NP-completeness. He showed that a diverse host of problems are each NP-complete [28]. Since then, many hundreds of natural problems from many areas of computer science and mathematics such as graph theory, combinatorial optimization, scheduling, and symbolic computation have been proven NP-complete; see Garey and Johnson [18] for details. Proving a problem to be NP-complete proceeds in the following way. Choose a known NP-complete problem . Show that has a polynomial time algorithm if it is allowed to invoke a polynomial time subroutine for , and vice versa. There are many computational biology problems that are NP-complete, yet in practice we still need to solve them somehow. There are different ways to deal with an NP-complete problem: 1. We might give up on the possibility of solving the problem on anything but small inputs, by using an exhaustive (nonpolynomial time) search algorithm. We can sometimes use dynamic programming or branch-and-bound techniques to cut down the running time of such a brute force exhaustive search. 2. We might give up guaranteed efciency by settling for an algorithm that is sufciently efcient on inputs that arise in practice, but is nonpolynomial on some worst-case inputs that (hopefully) do not arise in practice. There may be an algorithm that runs in polynomial time on average inputs, being careful to dene the input distribution so that the practical inputs are highly probable. 3. We might give up guaranteed optimality of solution quality by settling for an approximate algorithm that gives a suboptimal solution, especially if the suboptimal solution is provably not much worse than the optimal solution. (An example is given in Section 6.5.) 4. Heuristics (local search, simulated annealing, genetic algorithms, and many others) can also be used to improve the quality of solution or running time in practice. We will see several examples throughout the remaining lectures. However, rigorous analysis of heuristic algorithms is generally unavailable. 5. The problem to be solved in practice may be more specialized than the general one that was proved NP-complete. In the following section we will look at the approximation approach to nd a solution for the multiple string alignment problem.

LECTURE 6. MULTIPLE SEQUENCE ALIGNMENT

6.5. An Approximation Algorithm for Multiple String Alignment

In this section we will show that there is a polynomial time algorithm (called the Center Star Alignment Algorithm) that produces multiple string alignments whose SP values are less than twice that of the optimal solutions. This result is due to Guseld [19]. Although the factor of 2 may be unacceptable in some applications, the result will serve to illustrate how approximation algorithms work. In this section we will make the following assumptions about the distance function:

The triangle inequality says that the distance along one edge of a triangle is at most the sum of the distances along the other two edges. Although intuitively plausible, be aware that not all distance measures used in biology obey the triangle inequality.

6.5.1. Algorithm

This can be done by running the dynamic programming algorithm of Section 4.1 on each of the pairs of strings in . Call the remaining strings in . Add these strings one at a time to a multiple alignment that initially contains only , as follows. Suppose are already aligned as ming algorithm of Section 4.1 on and to produce from those columns where spaces were added to get . To add , run the dynamic programand . Adjust by adding spaces to . Replace by .

6.5.2. Time Analysis

Theorem 6.8: The approximation algorithm of Section 6.5.1 runs in time each of length at most .

when given strings

Proof: By Theorem 4.1, each of the values required to compute can be computed in , so the total time for this portion is . After adding to the multiple string time alignment, the length of is at most , so the time to add all strings to the multiple string alignment is

The approximation algorithm is as follows. The input is a set minimizes

strings. First nd

Denition 6.7: For strings distance of and .

and , dene

to be the value of the minimum (global) alignment

2. Triangle Inequality:

, for all characters , , and , and

, for all characters .

that

LECTURE 6. MULTIPLE SEQUENCE ALIGNMENT

6.5.3. Error Analysis

What remains to be shown is that the algorithm produces a solution that is less than a factor of 2 worse than the optimal solution. Let be the alignment produced by this algorithm, let be the distance induces on the pair , and let

for all . This is because the algorithm used an optimal alignment of and Then , and , since . If the algorithm later adds spaces to both and , it does so in the same columns. Let be the optimal alignment, be the distance

induces on the pair

, and

Theorem 6.9: That is, the algorithm of Section 6.5.1 produces an alignment whose SP value is less than twice that of the optimal SP alignment.

Proof: We will derive an upper bound on quotient.

and a lower bound on

, and then take their

(triangle inequality)

(explained below)

The third line follows because each

occurs in

terms of the second line.

(Denition 6.7)

Note that

is exactly twice the SP score of

, since every pair of strings is counted twice.

LECTURE 6. MULTIPLE SEQUENCE ALIGNMENT

(denition of

Note that for small values of , the approximation is signicantly better than a factor of 2. Furthermore, the error analysis does not mean that the approximation solution is always times the optimal solution. It means that the quality of the solution is never worse than this, and may be better in practice.

6.5.4. Other Approaches

In the Center Star Algorithm discussed in Section 6.5.1, we always try to align the chosen center string with the unaligned strings. However, there might be cases in which some of the strings are very near to each other and form clusters. It might be an advantage to align strings in the same cluster rst, and then merge the clusters of strings. The problem with this is how to dene near and how to dene clusters. There are many variants on this idea, which sometimes are called iterative pairwise alignment methods. Here is one version: an unaligned string nearest to any aligned string is picked and aligned with the previously aligned group. (For those who have seen it before, note the similarity to Prims minimum spanning tree algorithm [12].) The nearest string is chosen based on optimal pairwise alignments between individual strings in the multiple alignment and unaligned strings, without regard to spaces inserted in the multiple alignment. Now the problem is to specify how to align a string with a group of strings. One possible method is to mimic the technique that was used to add to the center star alignment in Section 6.5.1.

6.6. The Consensus String

Given a multiple string alignment, it is sometimes useful to derive from it a consensus string that can be used to represent the entire set of strings in the alignment. of strings , the consensus character of Denition 6.10: Given a multiple alignment column of is the character that minimizes the sum of distances to it from all the characters in column ; that is, it minimizes . Let be this minimum sum. The consensus string is the concatenation of all the consensus characters, where . The alignment error of is then dened to be . For instance, the consensus string for the multiple string alignment in Example 6.3 is ac-cdbd, and its alignment error is 6, the number of characters in the aligned strings that differ from the consensus character in the corresponding position.

Combining these inequalities,

LECTURE 6. MULTIPLE SEQUENCE ALIGNMENT

6.7. Summary
Multiple sequence alignment is a very important problem in computational biology. It appears to be impossible to obtain exact solutions in polynomial time, even with very simple scoring functions. A variety of (provably) bounded approximation algorithms are known, and a number of heuristic algorithms have been suggested, but it still remains largely an open problem.

Lecture 7

Finding Instances of Known Sites

January 25, 2000 Notes: Elisabeth Rosenthal
With this lecture we begin a study of how to identify functional regions from biological sequence data. This includes the problem of how to identify relatively long functional regions such as genes, but we begin instead with the problem of identifying shorter functional regions. A site is a short sequence that contains some signal, that signal often being recognized by some enzyme. Examples of nucleotide sequence sites include the following: 1. origins of replication, where DNA polymerase initially binds (Section 1.5), 2. transcription start and stop sites (Section 1.6.1), 3. ribosome binding sites in prokaryotes (Section 2.2), 4. promoters, or transcription factor binding sites (Section 2.3), and 5. intron splice sites (Section 2.5). We will further subdivide the problem of identifying sites into the problems of nding instances of a known site, and nding instances of unknown sites. We begin with the former. What makes all these problems interesting and challenging is that instances of a single site will generally not be identical, but will instead vary slightly.

7.1. How to Summarize Known Sites

Suppose that we have a large sample of length sites, and a large sample of length nonsites. Given a new sequence of length , is more likely to be a site or a nonsite? If we can derive an efcient way to determine this, we can screen an entire genome, testing every length sequence, and thereby generate a complete list of candidate sites (excepting sequences where the test gives the wrong answer). To illustrate, the cyclic AMP receptor protein (CRP) is a transcription factor (see Section 2.3) in E. coli. Its binding sites are DNA sequences of length approximately 22. Table 7.1, taken from Stormo and Hartzell [47], shows just positions 39 (out of the 22 sequence positions) in 23 bona de CRP binding sites. The signal in Table 7.1 is not easy to detect at rst glance. Notice, though, that in the second column T predominates and in the third column G predominates, for example. Our rst goal is to capture the most relevant information from these 23 sites in a concise form. (This would clearly be more important if we 35

LECTURE 7. FINDING INSTANCES OF KNOWN SITES

TTGTGGC TTTTGAT AAGTGTC ATTTGCA CTGTGAG ATGCAAA GTGTTAA ATTTGAA TTGTGAT ATTTATT ACGTGAT ATGTGAG TTGTGAG CTGTAAC CTGTGAA TTGTGAC GCCTGAC TTGTGAT TTGTGAT GTGTGAA CTGTGAC ATGAGAC TTGTGAG Table 7.1: Positions 39 from 23 CRP Binding Sites [47]

A C G T

0.35 0.17 0.13 0.35

0.043 0.087 0 0.87

0 0.043 0.78 0.17

0.043 0.043 0 0.91

0.13 0 0.83 0.043

0.83 0.043 0.043 0.087

0.26 0.3 0.17 0.26

Table 7.2: Prole for CRP Binding Sites Given in Table 7.1

LECTURE 7. FINDING INSTANCES OF KNOWN SITES

were given thousands of sites rather than just 23.) In order to do this, suppose that the sequence residues are from an alphabet of size . Consider a matrix where is the fraction of sequences in that have residue in position . Table 7.2 shows the matrix for the CRP sites given in Table 7.1. Such a matrix is called a prole. The prole shows the distribution of residues in each of the positions. For example, in column 1 of the matrix the residues are quite mixed, in column 2, T occurs of the time, etc.

7.2. Using Probabilities to Test for Sites

An alternative way to think of is in terms of probability. Let be chosen randomly and uniformly from . Then . In words, this says, is the probability that the -th residue of is the residue , given that is chosen randomly from . For instance, . For the time being, we will make the following Independence Assumption: which residue occurs at position is independent of the residues occurring at other positions. In other words, residues at any two different positions are uncorrelated. Although this assumption is not always realistic, it can be justied in some circumstances. The rst justication is that it keeps the model and resulting analysis simple. The second justication is its predictive power in some (but admittedly not all) situations. The independence assumption can be made precise in probabilistic terms: Denition 7.1: Two probabilistic events and are said to be independent if the probability that they . both occur is the product of their individual probabilities, that is, Under the independence assumption, the probability that a randomly chosen site has a specied sequence is determined by Denition 7.1 as follows:

is a site

For example, suppose we want to know the probability that a randomly chosen CRP binding site will be TTGTGAC. By using Equation (7.1) and Table 7.2,

Although this probability is small, it is the largest probability of any site sequence, because each position contains the most probable residue. Now form the prole from the sample of nonsites in the same way. Using the proles and , let us return to the question of whether a given sequence is more likely to be a site or nonsite. In order to do this, we dene the likelihood ratio.
1 "0

is a site is a nonsite

Denition 7.2: Given the sequence dened to be

, the likelihood ratio, denoted by

TTGTGAC

is a site

" " " )( ! "

% '&

is a site

(7.1)

, is

LECTURE 7. FINDING INSTANCES OF KNOWN SITES

Table 7.3: Log Likelihood Weight Matrix for CRP Binding Sites To illustrate, let , the set of all length seven sequences. The corresponding prole for all and . Then for TTGTGAC,

If is not small and some entries in and are small, then the likelihood ratio may be intractably large or small, causing numerical problems in the calculation. To alleviate this, we dene the log likelihood ratio.
1 0 "0

To test for sites, it is convenient to create a scoring matrix whose entries are the log likelihood ratios, that is, . Table 7.3 shows the weight matrix for the example CRP samples and we have been discussing. In order to compute , Denition 7.3 says to add the corresponding scores from : . A technical difculty arises when an entry is 0, because the corresponding entry is then . If the residue cannot possibly occur in position of any site for biological reasons, then there is no problem. More often, though, this is a result of having too small a sample of sites. In this case, there by a small positive number (see, for are various small sample correction formulas, which replace example, Lawrence et al. [29]), but we will not discuss them here.

1 0 "0

The corresponding test of is that is more likely to be a site if

5% 6%

Denition 7.3: Given the sequence is dened to be

, the log likelihood ratio, denoted by

1 0 "0

1 "0

To test a sequence , compare likely to be a site if

! ! ! )( )( ! !! ! ! ! ! ! !
to a prespecied constant cutoff

has , and declare more ,

! !

( )

1 0

1 "0

1 0 0

Lecture 8

Relative Entropy
January 27, 2000 Notes: Anne-Louise Leutenegger

8.1. Weight Matrices

A weight matrix is any matrix that assigns a score to each sequence according to the formula . The log likelihood ratio matrix described at the end of Section 7.2, and illustrated in Table 7.3, is an example of a weight matrix. In computing log likelihood ratios, we often take to be the background distribution of residue in the entire genome, or a large portion of the genome. That is, is the frequency with which residue appears in the genome as a whole. In this case, is independent of , that is, for all and . Note, however, that this does not mean that in the case of nucleotides. Although this uniform distribution is a fair estimate for the nucleotide composition of E. coli, it is not for other organisms. For instance, the nucleotide composition for the archaeon M. jannaschii is approximately and .

8.2. A Simple Site Example

Example 8.1: As a simpler example of a collection of sites than the CRP binding sites of Table 7.1, Table 8.1 shows eight hypothetical translation start sites. For this example, we will assume a uniform background distribution . Table 8.2(a) shows the site prole matrix, and Table 8.2(b) the log likelihood ratio weight matrix, for this example. As illustrations of the log likelihood ratio calculations,

ATG ATG ATG ATG ATG GTG GTG TTG Table 8.1: Eight Hypothetical Translation Start Sites

LECTURE 8. RELATIVE ENTROPY

A C G T 0.625 0 0.25 0.125 (a) A C G T 1.32 0 0 0 0 1 0 0 1 0

(b) 2 (c)

0.701

Table 8.2: (a) Prole, (b) Log Likelihood Weight Matrix, and (c) Positional Relative Entropies, for the Sites in Table 8.1, with Respect to Uniform Background Distribution , and

the same frequency for G in position 1.

8.3. How Informative is the Log Likelihood Ratio Test?

The next question to ask is how informative is a given weight matix for distinguishing between sites and nonsites. If the distributions for sites and nonsites were identical, then every entry in the weight matrix would be 0, and it would be totally uninformative.

Denition 8.2: A sample space

is the set of all possible values of some random variable . for a sample space assigns a probability

Denition 8.3: A probability distribution , satisfying 1. 2.

, and .

In our application, the sample space is the set of all length sequences. The site prole induces a probability distribution on this sample space according to Equation (7.1), as does the nonsite prole . Denition 8.4: Let and be probability distributions on the same sample space . The relative entropy (or information content, or Kullback-Leibler measure) of with respect to is denoted and is dened as follows:

, meaning both distributions have to every

LECTURE 8. RELATIVE ENTROPY

By convention, we dene calculus that

41 to be 0 whenever

In these terms, the relative entropy is the expected value of when is picked randomly according to . That is, it is the expected log likelihood score of a randomly chosen site. Note that when and are the same distribution, the relative entropy will be zero. In general, the relative entropy measures how different the distributions and are. Since we want to be able to distinguish between sites and nonsites, we want the relative entropy to be large, and will use relative entropy as our measure of how informative the log likelihood ratio test is. When the sample space is all length sequences, and we assume independence of the not difcult to prove that the relative entropy satises

where is the distribution position.

imposes on the th position and

is the distribution

imposes on the th

When , the relative entropy is measured in bits. This will be the usual case, unless specically stated otherwise. Continuing Example 8.1, Table 8.2(c) shows the relative entropies for each nucleotide position separately. For instance, looking at position 2, residues A, C, and G do not contribute to the relative entropy (see Table 8.2(a)). Residue T contributes (see Tables 8.2(a) and (b)). Hence, . This means that there are 2 bits of information in position 2. If the residues were coded with 0 and 1 so that 00 = A, 01 = C, 10 = G, and 11 = T, only 2 bits (11) would be necessary to encode the fact that this residue is always T. Position 3 has the same relative entropy of 2. For position 1, the relative entropy is 0.7 so there are 0.7 bits of information, indicating that column 1 of Table 8.2(a) is more similar to the background distribution than columns 2 and 3 are. The total relative entropy of all three positions is 4.7. Example 8.6: Let us now modify Example 8.1 to see the effect of a nonuniform background distribution. Consider the same eight translation start sites of Table 8.1, but change the background distribution to , . The site prole matrix remains unchanged (Table 8.2(a)). The new weight matrix and relative entropies are given in Table 8.3. Note that the relative entropy of each position has changed and, in particular, the last two columns no longer have equal relative entropy. The site distribution in position 2 is now more similar to the background distribution than the site distribution in position 3 is, since G is rarer in the background distribution. Thus, the relative entropy of position 3 is greater than that of position 2. An interpretation of is times more likely to occur in the third position of a site than a nonsite. The total that the residue G is relative entropy of all three positions is 4.93.

1 0 "0

Denition 8.5: The expected value of a function sample space is

with respect to probability distribution

positions, it is

! "

Since with weights

is the log likelihood ratio, .

is a weighted average of the log likelihood ratio

, in agreement with the fact from

LECTURE 8. RELATIVE ENTROPY

A C G T 0.737

1.42 (b) 1.42 (c)

0.512

Table 8.3: (b) Log Likelihood Weight Matrix, and (c) Positional Relative Entropies, for the Sites in Table 8.1, with Respect to a Nonuniform Background Distribution 0.12 1.3 1.1 1.5 1.2 1.1 0.027

Table 8.4: Positional Relative Entropy for CRP Binding Sites of Tables 7.1 7.3 Example 8.7: Finally, returning to the more interesting CRP binding sites of Table 7.1, the seven positional relative entropies are given in Table 8.4. Note that 1.5 (middle position) is the highest relative entropy and corresponds to the most biased column (see Table 7.2). The value 0.027 (last position) is the lowest relative entropy because the distribution in this last position is the closest to the uniform background distribution (see Table 7.2).

8.4. Nonnegativity of Relative Entropy

In these examples, the relative entropy has always been nonnegative. It is by no means obvious that this should be, since it is the expected value of the log likelihood ratio, which can take negative values. For instance, why should the expected value of the last column of Table 7.3 be positive (0.027, according to Table 8.4)? The following theorem demonstrates that this must, indeed, be the case. Theorem 8.8: For any probability distributions equality if and only if and are identical.

and

over a sample space ,

Proof: First, it is true that The reason is that the curve . Thus, with :

for all real numbers , with equality if and only if . is concave downward, and its tangent at is the straight line . In the following derivation, we will use this inequality

, with

since only if

LECTURE 8. RELATIVE ENTROPY

for all

, by Denition 8.3. Note that the relative entropy is equal to 0 if and , that is, and are identical probability distributions.

Lecture 9

Relative Entropy and Binding Energy

February 1, 2000 Notes: Neil Spring
Binding energy is a measure of the afnity between two molecules. Because it is an expression of free energy released rather than absorbed, a large negative number conventionally represents a strong afnity, and suggests that these molecules are likely to bind. The binding energy depends on a number of factors such as temperature and salinity, which we will assume are not varying. This lecture describes a paper of Stormo and Fields [46], which investigates the binding energy between a given DNA-binding protein and various short DNA sequences. In particular, it discusses an interesting relationship between binding energy and log likelihood weight matrices, shedding a new light on the relative entropy.

9.1. Experimental Determination of Binding Energy

binds to all Given a DNA-binding protein , we would like to determine with what binding energy possible length DNA sequences. The binary question of whether or not will bind to a particular DNA sequence oversimplies a more complicated process: more realistically, binds to most such sequences, but will occupy preferred sites for a greater fraction of time than others. Binding energies reect this reality more clearly. If is the alphabet size, then one cannot hope to perform all the experiments to measure the binding sequences of length . Instead, Stormo and Fields proposed the energy of with each of the possible following experimental method for estimating the binding energy of with each length sequence. 1. Choose some good site of length . that differ from in only one residue. There are

2. Construct all sequences of length sequences.

3. For each such sequence , experimentally measure the difference in binding energy between ing with and binding with . 4. Record the results in a matrix substituted at position in . , where

is the change in binding energy when residue

Stormo and Fields then make the approximating assumption that changes in energy are additive. That is, the change in binding energy for any collection of substitutions is the sum of the changes in binding

such bindis

LECTURE 9. RELATIVE ENTROPY AND BINDING ENERGY

energy of those individual substitutions. With this assumption, one can predict the binding energy of any length sequence by the following formula:

45 to

9.2. Computational Estimation of Binding Energy

Unfortunately, creating the matrix for every DNA-binding protein in every organism of interest still requires an infeasible amount of experimental work. This motivated Stormo and Fields to ask how to computationally, given a collection of good binding sites for and a collection of approximate nonsites. to be the log likelihood ratio weight matrix for with respect to assigns the highest Choosing scores to the sites in . Since also assigns high (negative) scores to the sites in , there is good reason to expect that approximates well (after the appropriate scaling). Recall from Section 8.3 that the relative entropy is the expected score assigned by to a randomly chosen site. If approximates well, the relative entropy then approximates the expected binding energy of to a randomly chosen site. This provides us a new interpretation of relative entropy. It also provides an estimate of how great we should expect the relative entropy to be for a good collection of binding sites. There is some probability that a good site will appear in the genomic background simply by chance. This probability increases with the size of the genome. If the relative entropy is too small with respect to , the expected binding energy at true sites will be too small, and the protein will spend too much time occupying nonsites. Stormo and Fields suggest from experience that the relative entropy for binding sites will be close to . A simple scenario suggests some intuition for this particular estimate: Assume a uniform background distribution , and assume that the site prole has a 1 in each column, that is, all sites are identical. This imples that the relative entropy is 2 bits per position (as in two of the columns of Table 8.2), . In a random sequence generated according to , one would so the total relative entropy is expect this site sequence to appear once every residues. In order for not to bind to too many random locations in the background, must be not much less than , so must be not . much less than

9.3. Finding Instances of an Unknown Site

This leads us into our next topic. Suppose we are not given a sample of known sites. We want to nd sequences that are signicantly similar to each other, without any a priori knowledge of what those sequences look like. A little more precisely, given a set of biological sequences, nd instances of a short site that occur more often than you would expect by chance, with no a priori knowledge about the site. Given a collection of such instances (ignoring, for the moment, how to nd them), this induces a prole as described in Section 7.1. As usual, we compute a prole from the background distribution.

Thus, is a weight matrix that assigns a score to each sequence formula given in Section 8.1.

according to the usual weight matrix

LECTURE 9. RELATIVE ENTROPY AND BINDING ENERGY

as in Section 8.3, and use that as a measure of how good the From and , we can compute collection is. The goal is to nd the collection that maximizes . In particular, if we are looking for unknown binding sites, then the argument of Section 9.2 suggests that a relative entropy around would be encouraging. A version of the computational problem, then, is to take as inputs sequences and an integer , and output one length substring from each input sequence, such that the resulting relative entropy is maximized. Let us call this the relative entropy site selection problem. Unfortunately, this problem is likely to be computationally intractable (Section 6.4): Theorem 9.1 (Akutsu [1, 2]): The relative entropy site selection problem is NP-complete. Akutsu also proved that selecting instances so as maximize the sum-of-pairs score (Section 6.2) rather than the relative entropy is NP-complete.

Lecture 10

Finding Instances of Unknown Sites

February 8, 2000 Notes: Dylan Chivian
In order to nd instances of unknown sites, we would like to be able to solve the relative entropy site selection problem (Section 9.3) exactly and efciently. Unfortunately, Theorem 9.1 shows that the relative entropy site selection problem is NP-complete, so we are unlikely to nd an algorithm that will compute an optimal solution efciently. However, if we relax the optimality constraint, it may be possible to develop algorithms that compute good solutions efciently. Because of the problem abstraction required to model the biological problem mathematically, the mathematically optimal solution need not necessarily be the most biologically signicant. Lower scoring solutions are potentially the correct answer in their biological context. Therefore, giving up on the mathematical optimality of solutions to the relative entropy site selection problem seems the right compromise. As an example of a typical application of nding instances of unknown sites, consider the genes involved in digestion in yeast. It is likely that many of these genes have some transcription factors in common, and therefore similarities in their promoter regions. Applying the site selection problem to the 1Kb DNA sequences upstream of known digestion genes may well yield some of these transcription factor binding sites. As another example, we could use the site selection problem to nd common motifs in a protein family. As dened, the relative entropy site selection problem limits its solution to contain exactly one site per input sequence, which may not be realistic in all applications. In some applications, there may be zero or many such sites in some of the input sequences. The algorithms discussed below are described in terms of the single site assumption, but can be modied to handle the general case as well. But in the context of this general case, this is a good point at which to consider the effects on relative entropy of increasing either the number of sites or the length of each site. Increasing the number of sites will of sites containing each residue not increase the relative entropy, which is a function only of the fraction , and not the absolute number of such sites. For instance, a perfectly conserved position has , regardless of whether it is present in all 10 sites or all 100 sites. This aspect of relative entropy is both a strength and a weakness. The strength is that it measures the degree of conservation, but the weakness is that we would like the measure to increase with more instances of a conserved residue. However, increasing the length of each site does increase the relative entropy, as it is additive and always nonnegative (Theorem 8.8). If comparing relative entropies of different length sites is important, one may normalize by dividing by the length of the site or, alternatively, subtracting the expected relative entropy from each position.

LECTURE 10. FINDING INSTANCES OF UNKNOWN SITES

10.1. Greedy Algorithm

Hertz and Stormo [23] described an efcient algorithm for the relative entropy site selection problem that uses a greedy approach. Greedy algorithms pick the locally best choice at each step, without concern for the impact on future choices. In most applications, the greedy method will result in solutions that are far from optimal, for some input instances. However, it does work efciently, and may produce good solutions on many of its input instances. Hertz and Stormos algorithm for the relative entropy site selection problem proceeds as follows. The user species the length of sites. The user also species a maximum number of proles to retain at each step. Proles with lower relative entropy scores than the top will be discarded; this is precisely the greedy aspect of the algorithm. A LGORITHM:

2. For each set retained so far, add each possible length substring from an input sequence not yet represented in . Compute the prole and relative entropy with respect to the background for each new set. Retain the sets with the highest relative entropy. 3. Repeat step 2 until each set has members.

A small example from Hertz and Stormo [23] is shown in Figure 10.1. From this example it is clear that pruning the number of sets to is crucial, in order to avoid the exponentially many possible sets. The greedy nature of this pruning biases the selection from the remaining input sequences. High scoring proles chosen from the rst few sequences may not be well represented in the remaining sequences, whereas medium scoring proles may be well represented in most of the sequences, and thus would have yielded superior scores. Note that one may modify the algorithm to circumvent the assumption of a single site per sequence, by permitting multiple substrings to be chosen from the same sequence. In this case, a different stopping condition is needed. Hertz and Stormo applied their technique to nd CRP binding sites (see Section 7.1) with some success. With 18 genes containing 24 known CRP binding sites, their best solution contained 19 correct sites, plus 3 more that overlap correct sites.

10.2. Gibbs Sampler

Lawrence et al. [29] developed a different approach to the relative entropy site selection problem based on Gibbs sampling. The idea behind this technique is to start with a complete set of substrings (candidate sites), from which we iteratively remove one at random, and then add a new one at random with probability proportional to its score, hopefully resulting in an improved score. In the following description, we again make the assumption that we choose one site per input sequence, but this method also can be extended to permit any number of sites per sequence.

I NPUT: sequences

, , and the background distribution.

1. Create a singleton set (i.e., only one member) for each possible length input sequences.

substring of each of the

I NPUT: sequences

, and , , and the background distribution.

LECTURE 10. FINDING INSTANCES OF UNKNOWN SITES

Figure 10.1: Example of Hertz and Stormos greedy algorithm. seq denotes the relative entropy. A LGORITHM: Initialize set to contain substrings , where is a substring of chosen randomly and uniformly. Now perform a series of iterations, each of which consists of the following steps:

(a) Let

be the length

substring of

that starts at position .

2. For every in

1. Choose randomly and uniformly from

and remove

from .

LECTURE 10. FINDING INSTANCES OF UNKNOWN SITES

(b) Compute (c) Let

, the relative entropy of

We iterate until a stopping condition is met, either a xed number of iterations or relative stability of the scores, and return the best solution set seen in all iterations. The hope with this approach is that the random choices help to avoid some of the local optima of greedy algorithms. Note that the Gibbs sampler may discard a substring that yields a higher scoring prole than the one that replaces it, or may restore the substring that was discarded itself. Neither of these occurrences is particularly signicant, since the sampling will tend toward higher scoring proles due to the probabilistic weighting of the substitutions by relative entropy. The Gibbs sampler does retain some degree of greediness (which is desirable), so that there may be cases where a strong signal in only a few sequences incorrectly outweighs a weaker signal in all of the sequences. Lawrence et al. applied their technique to nd motifs in protein families. In particular, they successfully discovered a helix-turn-helix motif, as well as motifs in lipocalins and prenyltransferases.

10.3. Other Methods

Possible extensions to the Gibbs sampler technique of Section 10.2 include the following:

1. Weight which

to discard in step 1 (analogously to weighting which to add in step 3).

2. Use simulated annealing (see, for example, Johnson et al. [25]) where, as time progresses, the probability decreases that you make a substitution that worsens the relative entropy score, yielding a more stable set . Another technique that has been used to solve the site selection problem is expectation maximization (for example, in the MEME system [4]).

3. Randomly choose

to be

with probability

, and add

with respect to the background.

to .

Lecture 11

Correlation of Positions in Sequences

February 10, 2000 Notes: Tammy Williams
This lecture explores the validity of the assumption that the residues appearing at different positions in a sequence are independent. In previous lectures the computations assumed such positional independence. (See Section 7.2.) Here we describe a method to determine the level of dependence among residues in a sequence. By calculating the relative entropy of two models, one modeling dependence and the other modeling independence of positions, we can quantify the validity of the positional independence assumption. Most of the material for this lecture is from Phil Greens MBT 599C lecture notes, Autumn 1996.

11.1. Nonuniform Versus Uniform Distributions

We will begin with a warmup to the method that still assumes positional independence, and proceed to the dependence question in Section 11.2. Given the genome of an organism, a simple calculation determines the frequency of each nucleotide. It is reasonable to suspect that these frequencies are more informative than the uniform nucleotide distribution, in which the probability of each nucleotide is 0.25. One can compare the frequency distribution (the nonuniform distribution in which the probability of residue is equal to the frequency of residue in the genome as a whole) to the uniform distribution. Example 11.1: This example calculates the relative entropy of the frequency distribution to the uniform distribution for the archaeon M. jannaschii. M. jannaschii is a thermophilic prokaryote, meaning that it lives in extremely high temperature environments such as thermal springs. The frequency distribution of residues for M. jannaschii is given in Table 11.1. A: 0.344 C: 0.155 G: 0.157 T: 0.343 Table 11.1: The frequency distribution of residues in M. jannaschii Notice that the frequencies of residues A and T are very similar but not equal. Likewise, the residues C and G have similar frequencies. When calculating these frequencies, only one strand of DNA was used. (Had both strands been used, base pair complementarity would have ensured that these frequencies would be exactly equal rather than just similar.) Because genes and other functional regions tend to occur on both 51

LECTURE 11. CORRELATION OF POSITIONS IN SEQUENCES

strands of DNA equally often, any bias of such a region on one strand over the other (see Section 11.4) is canceled out. This phenomenon, together with the fact that the bases occur in complementary pairs, explains why the frequencies of As and Ts are similar and the frequencies of Gs and Cs are similar. Let be the uniform probability distribution, and let be the frequency distribution. Notice that the frequency of residue is equal to the probability of randomly selecting residue from the distribution . How much better does model the actual genome than ? More quantitatively, how much more information is there using rather than ? Recall from Section 8.3 that the relative entropy is dened as follows:

In Example 11.1 for M. jannaschii, . This implies that there are 0.103 more bits of information per position in the sequence by using distribution over distribution . The value 0.103 might seem insignicant, but it means that a sequence of 100 bases has ten bits of extra information when chosen according to distribution . Suppose a random sequence of length 100 is selected according to the probability distribution . Since the relative entropy is the expected log likelihood ratio for , the sequence is approximately times more likely to have been generated by than by . The mathematics leading to this observation is awed, since the log function and expectation do not commute: that is, for it to be correct we would need the expected log likelihood ratio to equal the log of the expected likelihood ratio, which is not true in general. However, the intuition is helpful. The next section explores the application of this relative entropy method to the question of dependence of nucleotides.

11.2. Dinucleotide Frequencies

How much dependence is there between adjacent nucleotides in a DNA sequence? Since there are four nuin a cleotides, there are 16 possible pairs of nucleotides. To calculate the frequencies of each such pair sequence, a simple algorithm computes the total number of observed occurrences of followed immediately by , and divides by the total number of pairs, which is the length of the sequence minus one. Let be the frequency of the residue immediately followed by the residue . In addition let be the frequency of gives a score representing the validresidue in the single nucleotide distribution. The value ity of the positional independence assumption. If , then the independence assumption is valid for residue followed by residue . (See Denition 7.1.) As the deviation from one increases, the independence assumption becomes less valid. Example 11.2: Let us return to Example 11.1 involving the organism M. jannaschii. The values are given in Table 11.2 with the residue indexed by row and the residue indexed by column, where residue immediately precedes in the sequence. Upon examination of the table, one can see that there are sizable deviations from one. For example the pairs (C,C) and (G,G) occur much more often than expected if they were independent, and (A,C), (C,G), and (G,T) occur much less often. Also, the diagonal entries show that two consecutive occurrences of the same residue occur more often than expected. Such repeats of the same residue might result from the slippage of the DNA polymerase during the replication process (see Section 1.5). The DNA polymerase inserts an extra copy of the base or misses a copy while duplicating one of the DNA strands. Even though there is a post-replication repair system to repair mistakes produced by the DNA

LECTURE 11. CORRELATION OF POSITIONS IN SEQUENCES

polymerase, there is a small chance that the repeats will persist after a copy mistake. (In a similar way, dinucleotide repeats might occur during the replication process.)

Table 11.2: The ratios of the observed dinucleotide frequency to the expected dinucleotide frequency (assuming independence) in M. jannaschii

Denition 11.3: The mutual information of a pair

If the probability distribution is the joint distribution of and , and is the distribution of and assuming independence, then . By Theorem 8.8, then, , with equality if and only if and are independent, since in the equality case . to be the rst base and to be the second base of a pair, the value By setting the random variable for M. jannaschii is 0.03. For a sequence of 100 bases, there are three bits of extra information when the sequence is chosen from the dinucleotide frequency distribution rather than the independence model. Thus, a random sequence of length 100 generated by a process according to dinucleotide distribution is eight times more likely to have been generated by than by the independent nucleotide distribution .

11.3. Disymbol Frequencies

A generalization of the dinucleotide frequency is called the disymbol frequency, in which the two positions are not restricted to be adjacent. For example, one could study the dependence relationship between pairs of nucleotides separated by ten positions. The extension of methods presented above to this generalized setting is straightforward. Studies have revealed that the mutual information between DNA nucleotides separated by more than one base is lower than for adjacent residues. In fact, for separations of length 2, 3, and 4, the mutual information is an order of magnitude less than for adjacent residues.

11.4. Coding Sequence Biases

A similar application of relative entropy is nding biases in coding sequences. Recall that coding sequences consist of codons that are three consecutive bases: see Section 1.6.2. Do the three positions each have the same distribution as the background distribution? If such statistical features of protein-coding regions are known, they can be exploited by algorithms that locate genes. In the bacterium H. inuenzae, the residues A and G are more likely to appear in the rst position of codons than in the genomic background. Using an analysis analogous to that used in Sections 11.1 and 11.2,

" ! " " ! ( " " " ( ) " "

of random variables is

LECTURE 11. CORRELATION OF POSITIONS IN SEQUENCES

there are 0.082 bits of information in the rst codon position relative to the background distribution for H. inuenzae. Since most of the H. inuenzae genome consists of coding regions, it makes little difference if the background distribution is measured genome-wide or coding-region-wide. There are 0.175 bits of information per residue in the rst position of codons for M. jannaschii. The total relative entropy for the entire codon is simply the sum of the relative entropies for the three positions. (See Section 8.3.) The number of bits per codon for the organisms H. inuenzae, M. jannaschii, C. elegans, and H. sapiens are 0.12, 0.21, 0.09, and 0.12, respectively. For H. inuenzae the number of bits of information for the second position is close to zero. In humans there is more information is in the second position.

11.4.1. Codon Biases

Recall from Section 1.6.2 that there are 64 possible mRNA sequences of length three, but there are only 20 amino acids plus the stop codon. Thus, there exist synonymous codons that encode the same amino acid. Another statistical clue for locating genes is whether an organism uses synonymous codons equally often or has a bias toward certain codons in its genome. For example, in H. inuenzae the codon TTT is used about four times as often as TTC, although both TTT and TTC encode the amino acid phenylalanine. One conjecture as to why this occurs is that the tRNA for TTT is more abundant than the tRNA for TTC. Recall from Section 2.2 that the tRNA carries an amino acid to the ribosome during translation. There is selective pressure on the organism to choose the codon that is most efciently translated, which would be affected by tRNA abundance. A similar study investigates if organisms prefer one amino acid over another, since some amino acids such as leucine and isoleucine are chemically similar. (See Section 1.1.1.)

11.4.2. Recognizing Genes

Codon bias can be applied to the problem of recognizing genes in a DNA sequence. Dene a score for codon C as follow:

The score of a sequence of codons is dened to be the sum of the scores of each . When recognizing genes, one facet would be to identify sequences with high scores. Each reading frame must be examined since moving the frame window to the right one position or two positions results in different sequences of codons. One drawback to this technique is that we must know the coding regions (in order to estimate ) before recognizing (those same) genes in the genome. There are simpler methods for nding long coding regions, and once these are known they can be used to estimate and thus used to nd more genes.

where is the frequency of codon C in the coding regions and background.

is the frequency of codon C in the

Lecture 12

Maximum Subsequence Problem

February 15, 2000 Notes: Mathieu Blanchette

12.1. Scoring Regions of Sequences

We have studied a variety of methods to score a DNA sequence so that regions of interest obtain a high score. For example, Section 11.4.2 suggested codon bias as a means for nding coding regions. If is a , where is the frequency of in known coding regions codon, the score associated with is (in the correct reading frame), and is the frequency of in noncoding regions (usually taken as the background distribution). Notice that our goal is to identify new coding regions, but the method requires that we already know some coding regions in order to estimate . There are a few easy ways one can identify a subset of likely coding regions. First, one could look for long open reading frames (ORFs), that is, long contiguous reading frames without STOP codons. Since 3 of the 64 codons are STOP codons (see Table 1.1), in random sequences one would expect a STOP codons every 64/3 triplets, i.e., every 64 bases, if codons are distributed uniformly. Since most genes are at least hundreds of bases long, very long ORFs are likely to be coding regions. This method will work well if we assume that the new genome contains no introns (or at least many very long exons), and if the codon distribution in long genes is similar to that in all genes, so that we can use it to estimate . Another easy way to nd a training set of coding sequences is by sequence similarity: compare the sequence of interest with a genome in which many genes are known, and extract the regions with high sequence similarity to known genes. Then, if we assume that different triplets in the sequence are independent, we would like to nd contiguous stretches of triplets with high total score (and thus with high log likelihood ratio). These regions would be good candidates for coding regions, to be subjected to further testing. Another relevant question is in which reading frame to look for codons. There are 6 possible reading frames: 3 on each of the 2 strands of DNA. When looking for coding region, one would search for high scoring regions in each of these 6 reading frames.

12.2. Maximum Subsequence Problem

We can distill the following general computational problem from the preceding discussion. We are given a sequence of real numbers, where corresponds to the score of the th element of the that maximizes sequence. The problem is to nd a contiguous subsequence . We will call this a maximum subsequence. Note that, if all s are nonnegative, the problem is 55

LECTURE 12. MAXIMUM SUBSEQUENCE PROBLEM

not interesting, since the maximum subsequence will always be when some of the scores are negative.

The following algorithm for nding a maximum subsequence was given by Bates and Constable [5] and Bentley [7, Column 7]. Suppose we already knew that the maximum subsequence of has score . How can we nd the maximum subsequence of ? If is included in , then it is easy: if , we will add to , and if not, we will leave unchanged. But what if is not included in ? In that case, in addition to we will have to keep track of the score of the maximum sufx of : is the sufx that maximizes . Let us assume that is also known for . We are now given , and we want to update and accordingly:

to to be empty.

12.3. Finding All High Scoring Subsequences

The algorithm described in Section 12.2 works very well if we are interested in nding one maximum subsequence. However, we are generally looking for all high scoring regions, for instance, all good candidates for coding regions. We could repeatedly use the previous algorithm to nd them all: nd the maximum subsequence, remove it, and repeat on the two remaining parts of the sequence. We will call the problem of nding exactly these disjoint maximum subsequences the all maximum subsequences problem. (In practice, one would only want to retain those reported maximum subsequences with scores sufciently high to be interesting.) The problem with repeatedly running the previous algorithm is that it will take operations per operations to identify all high scoring regions. Intuitively, subsequence reported, and thus possibly one might hope to do better, since much of the work done to nd the rst maximum subsequence could be reused to nd the second one, and so on. We now present an algorithm that solves the all maximum subsequences problem in time , the same as the time to nd just one maximum subsequence. This algorithm is due to Ruzzo and Tompa [40]. We rst describe the algorithm, and then discuss its performance. Algorithm. The algorithm reads the scores from left to right, and maintains the cumulative total of the scores read so far. Additionally, it maintains a certain ordered list of disjoint subsequences. of all scores up to but not including the For each such subsequence , it records the cumulative total leftmost score of , and the total up to and including the rightmost score of . The list is initially empty. Input scores are processed as follows. A nonpositive score requires no special processing when read. A positive score is incorporated into a new subsequence of length one1 that is then
1

In practice, one could optimize this slightly by processing a consecutive series of positive scores as

The complexity of the algorithm is , and there are such elements.

then add else if then add else reset

and replace

, since a constant amount of work is done for every new element

, so the interesting case is

LECTURE 12. MAXIMUM SUBSEQUENCE PROBLEM

Cumulative Score

5 4 -5 3 -3 2 1 -2 2 -2 1

Cumulative Score

5 4 -5 3 -3 2 -2 2 -2 1

Sequence Position

Figure 12.1: An example of the algorithm. Bold segments indicate score sequences currently in the algorithms list. The left gure shows the state prior to adding the last three scores, and the right gure shows the state after. integrated into the list by the following process.

4. Otherwise (i.e., there is such a , but ), extend the subsequence to the left to encompass everything up to and including the leftmost score in . Delete subsequences from the list (none of them is maximum) and reconsider the newly extended subsequence (now renumbered ) as in step 1. After the end of the input is reached, all subsequences remaining on the list are maximum; output them. input sequence , suppose the list of disjoint subsequences is , with , , , and . (See Figure 12.1.) At this point, the cumulative score is 2. If the ninth input is , the list of subsequences is unchanged, but the cumulative score becomes 0. If the tenth input is 1, Step 1 produces , because is the rightmost subsequence with . Now Step 3 applies, since . Thus is added to the list with , and the cumulative score becomes 1. If the eleventh input is 5, Step 1 produces , and Step 4 applies, replacing by with . The algorithm returns to Step 1 without reading further input, this time producing . Step 4 again applies, this time merging , , and into a new with . The algorithm again returns to Step 1, but this time Step 2 applies. If there are no further input scores, the complete list of maximum subsequences is then . The fact that this algorithm correctly nds all maximum subsequences is not obvious; see Ruzzo and Tompa [40] for the details. Analysis. There is an important optimization that may be made to the algorithm. In the case that Step 2 applies, are maximum subsequences, and so may be output before reading any more of the

1 0

1 0 " " !

1 0 1 " 0 1 0 1 0 ! " " !

example

the

execution of the algorithm, . After reading the scores

consider

the

" ! 1 0" ! !

1 0

3. If there is such a , and

, then add

2. If there is no such , then add

to the end of the list. to the end of the list.

1. The list is searched from right to left for the maximum value of satisfying

Sequence Position

LECTURE 12. MAXIMUM SUBSEQUENCE PROBLEM

input. Thus, Step 2 of the algorithm may be replaced by the following, which substantially reduces the memory requirements of the algorithm.

The algorithm as given does not run in linear time, because several successive executions of Step 1 might re-examine a number of list items. This problem is avoided by storing with each subsequence added during Step 3 a pointer to the subsequence that was discovered in Step 1. The resulting linked list of subsequences will have monotonically decreasing values, and can be searched in Step 1 in lieu of searching the full list. Once a list element has been bypassed by this chain, it will be examined again only if it is being deleted from the list, either in Step 2 or Step 4. The work done in the reconsider loop of Step 4 can be amortized over the list item(s) being deleted. Hence, in effect, each list item is examined a bounded number of times, and the total running time is linear. The worst case memory complexity is also linear, although one would expect on average that the subsequence list would remain fairly short in the optimized version incorporating Step 2 . Empirically, a few hundred stack entries sufce for processing sequences of a few million residues, for either synthetic or real genomic data.

If there is no such , all subsequences the list, and reinitialize the list to contain only

are maximum. Output them, delete them from (now renumbered ).

Lecture 13

Markov Chains
February 17, 2000 Notes: Jonathan Schaefer
In Lecture 11 we discovered that correlations between sequence positions are signicant, and should often be taken into account. In particular, in Section 11.4 we noted that codons displayed a signicant bias, and that this could be used as a basis for nding coding regions. Lecture 12 then explored algorithms for doing exactly that. In some sense, Lecture 12 regressed from the lesson of Lecture 11. Although it was using codon bias to score codons, it did not exploit the possible correlation between adjacent codons. Even worse, each codon was scored independently and the scores added, so that the codon score does not even depend on the position the codon occupies. This lecture recties these shortcomings by taking codon correlation into account in predicting coding regions. In order to do so, we rst introduce Markov chains as a model of correlation.

13.1. Introduction to Markov Chains

The end of Section 11.2 mentioned . . . a random sequence . . . generated by a process according to dinucleotide distribution P, without giving any indication of what such a random process might look like. Such a random process is called a Markov chain, and is more complex than a process that draws successive elements independently from a probability distribution. The denition of Markov chain will actually generalize dinucleotide dependence to the case in which the identity of the current residue depends on the previous residues, rather than just the previous one. Denition 13.1: Let be a set of states (e.g., ). Let of random variables, each with sample space . A th order Markov chain satises

be a sequence

for any and any

In words, in a th order Markov chain, the distribution of depends only on the variables immediately preceding it. In a 1st order Markov chain, for example, the distribution of depends only on . Thus, a 1st order Markov chain models diresidue dependencies, as discussed in Section 11.2. A 0th order Markov chain is just the familiar independence model, where does not depend on any other variables. Markov chains are not restricted to modeling positional dependencies in sequences. In fact, the more usual applications are to time dependencies, as in the following illustrative example. 59

LECTURE 13. MARKOV CHAINS

Example 13.2: This example is called a random walk on the innite 2-dimensional grid. Imagine an innite grid of streets and intersections, where all the streets run either east-west or north-south. Suppose you are trying to nd a friend who is standing at one specic intersection, but you are lost and all street signs are missing. You decide to use the following algorithm: if your friend is not standing at your current intersection, choose one of the four directions (N, E, S, or W) randomly and uniformly, and walk one block in that direction. Repeat until you nd your friend. This is an example of a 1st order Markov chain, where each intersection is a state, and is the intersection where you stand after steps. Notice that the distribution of depends only on the value of , and is completely independent of the path that you took to arrive at . Denition 13.3: A th order Markov chain is said to be stationary if, for all and ,

That is, in a stationary Markov chain, the distribution of is independent of the value of , and depends only on the previous variables. The random walk of Example 13.2 is an example of a stationary 1st order Markov chain.

13.2. Biological Application of Markov Chains

Markov chains can be used to model biological sequences. We will assume a directional dependence and always work in one direction, for example, from to , or N-terminal to C-terminal. Given a sequence , and given a Markov chain , a basic question to answer is, What is the probability that the sequence was generated by the Markov chain ? For instance, if we were modeling diresidue dependencies with a 1st order Markov chain , we would need to be able to determine what probabilities assigns to various sequences. are , where

In this equation,

1. Unidirectionality: the residue is equally dependent on both only models its dependence on the residues on one side of .

and

Markov chains have some weaknesses as models of biological sequences:

is estimated by , and

, where

is estimated by the frequency of

in the genome, and is the number of occurrences in the genome of the diresidue

, yet the Markov chain

Then the probability that the sequence

was generated by

Consider, for simplicity, a stationary 1st order Markov chain . Let is called the probability transition matrix for . The dimensions of the matrix is the state space. For nucleotide sequences, for example, is .

LECTURE 13. MARKOV CHAINS

2. Mononucleotide repeats are not adequately modeled. They are much more frequent in biological sequences than predicted by a Markov chain. This frequency is likely due to DNA polymerase slippage during replication, as discussed in Example 11.2. 3. Codon position biases (as discussed in Section 11.4) are not accurately modeled.

13.3. Using Markov Chains to Find Genes

We will consider two gene nding algorithms, GeneMark [8, 9] and Glimmer [13, 41]. Both are commonly used to nd intron-free protein-coding regions (usually in prokaryotes), and both are based on the ideas of Markov chains. As in Section 12.1, both assume that a training set of coding regions is available, but unlike that method, the training set is used to train a Markov chain. . This choice allows GeneMark [8, 9] uses a th order Markov chain to nd coding regions, where any residue to depend on all the residues in its codon and the immediately preceding codon. The training set consists of coding sequences identied by either long open reading frames or high sequence similarity to know genes. Three separate Markov chains are constructed from the training set, one for each of the three possible positions in the reading frame. For any one of these reading frame positions, the Markov chain is built by tabulating the frequencies of all -mers (that is, all length substrings) that end in that reading frame position. These three Markov chains are then alternated to form a single nonstationary th order that models the training set. Markov chain Given a candidate ORF , we can compute the probability that was generated by , as described in Section 13.2. This ORF will be selected for further consideration if is above some predetermined threshold. The further consideration will deal with possible pairwise overlaps of such selected ORFs, in a way to be described in the next lecture.

Lecture 14

Using Interpolated Context Models to Find Genes

February 22, 2000 Notes: Gretta Bartels

14.1. Problems with Markov Chains for Finding Genes

The Markov chain is an effective model for nding genes, as described in Section 13.3. However, such a probabilities in each tool is not 100% accurate. The problem is that a th order Markov chain requires of three reading frame positions. There is a tension between needing large to produce a good gene model, and needing small because there is insufcient data in the training set. For example, when as in GeneMark, we need 12,000 6-mers to build the model. Each of these 12,000 6-mers must occur often enough in the training set to support a statistically reliable sample. Some 5-mers are too infrequent in microbial training sets, yet some 8-mers are frequent enough to be statistically reliable. Section 14.2 describes a gene nder that was designed to have the exibility to deal with these extremes.

14.2. Glimmer
Glimmer [13, 41] is a gene prediction tool that uses a model somewhat more general than a Markov chain. In particular, Glimmer 2.0 [13] uses what the authors call the interpolated context model (ICM). The context of a particular residue consists of the characters immediately preceding it. A typical context size might be . For context , the interpolated context model assigns a probability distribution for , using only as many residues from as the training data supports. Furthermore, those residues need not be consecutive in the context. Glimmer has three phases for nding genes: training, identication, and resolving overlaps.

14.2.1. Training Phase

As in Sections 12.1 and 13.3, Glimmer uses long ORFs and sequences similar to known genes from other organisms as training data for the model. For each of the three reading frame positions, consider all mers that end in that reading frame position. Let be a random variable whose distribution is given by the -mers. frequencies of the residues in position of these In general, we will not have sufcient training data to use all residues from this context to predict the st residue . Our goal is to determine which variable has most correlation with , and use

LECTURE 14. USING INTERPOLATED CONTEXT MODELS TO FIND GENES

b7
A C G T

b10
A C G T A C

b8
G T

b10
A C G T

b12
A C G T

b4
A C G T

Figure 14.1: Interpolated context model tree. it to predict . The mutual information of Denition 11.3 is used to make this determination. We rst nd the maximum among the mutual information values

To determine which position has the next highest correlation, we do not simply take the second highest mutual information from the list above. The identity of the next position instead depends on the value of the , illustrated in Figure 14.1, as follows. rst selected residue . Glimmer builds a tree of inuences on -mers from this reading frame position are partitioned into four subsets according to the residue The . Then we repeat the mutual information calculation above for each of these subsets. In the example of Figure 14.1, was found to have the greatest mutual information with , and the -mers were partitioned according to the value of residue . For those with = A, was found to have the greatest mutual information with , and they were further partitioned into four subsets according to the value of . residue A branch is terminated when the remaining subset of -mers becomes too small to support further partitioning. Each such leaf of the tree is labeled with the probability distribution of , given the residue values along the path from the root to that leaf. For example, in the tree shown in Figure 14.1, the leaf shaded gray would be labeled with the distribution

Note how this tree generalizes the notion of Markov chain given in Denition 13.1.

14.2.2. Identication Phase

Once the interpolated context model has been trained, the identication phase begins. Given a candidate ORF , compute the probability of each residue of by following the appropriate path in the tree that corresponds to s position in the reading frame. Here is a rough algorithm for the identication phase: 1. Pick the tree for the correct reading frame position.

2. Number the residues in the context of and so on.

, so that the previous residue is

, the one before that

Suppose value of

maximizes this mutual information. Then the th residue .

will be used rst to predict the ,

LECTURE 14. USING INTERPOLATED CONTEXT MODELS TO FIND GENES

A B (a)

A B (b) B

A B (c)

(d)
end of the sequence.

3. Trace down the tree, selecting the edges according to the particular residues in the sequence, until from that leaf. reaching a leaf. Read the probability 4. Shift to the next residue in the sequence and repeat.

The product of these probabilities (times the probability of the rst residues of , as in Section 13.2) yields the probability that was generated by this interpolated context model. To take the length of into account, combine these probabilities by adding their logarithms, and normalize by dividing by the length of . Select for further investigation if this score is above some predetermined threshold. Note: Glimmer actually stores a probability distribution for at every node of the tree, not just the . This is where the modier leaves, and uses a combination of the distributions along a branch to predict interpolated would enter, but we will not discuss this added complication. See the papers [13, 41] for details.

14.2.3. Resolving Overlap

In the nal phase, Glimmer resolves candidate gene sequences that overlap. The main exibility in resolving overlap is the possibility of shortening an ORF by choosing a different start codon. Suppose sequences and overlap, and that A has the greater score. There are four possibilities for the way they overlap, as depicted in Figure 14.2. Each arrow in that gure points to the end of the sequence. (a) Suppose and overlap as shown in Figure 14.2(a). Moving either start codon cannot eliminate the overlap in this case. If is signicantly longer than , then reject . Otherwise, accept both and with an annotation that they have a suspicious overlap. (b) Suppose and overlap as shown in Figure 14.2(b). If moving s start codon resolves the overlap in such a way that still has great enough score and great enough length, accept both. If not, proceed as in Case (a). (c) Suppose and overlap as shown in Figure 14.2(c). If the score of the overlap is a small fraction of s score, and moving s start codon resolves the overlap in such a way that still has great enough length, accept both. Otherwise reject . (d) Suppose and overlap as shown in Figure 14.2(d). Move s start codon until the overlap scores higher for than for . Then move s start codon until the overlap scores higher for than for . Repeat until there is no more overlap, and accept both. If there are more than two overlapping sequences, treat the overlaps in decreasing order of score. In this way, if overlaps and overlaps , and has the highest score, then may be rejected before its overlap with would cause to be rejected.

Figure 14.2: Overlapping gene candidates. The arrow points to the

Lecture 15

Start Codon Prediction

February 24, 2000 Notes: Mingzhou Song

15.1. Experimental Results of Glimmer

The experimental results of Glimmer were presented by Delcher et al. [13]. They used the method described in Section 14.2 to predict genes in ten sequenced microbial genomes. The procedure was automated so as not to require human intervention. For each of the ten microbial genomes, the procedure was as follows. In the training phase, they constructed a training set consisting of all ORFs longer than 500 bp with no overlap. The authors state that this set has more than enough data to train the interpolated context model accurately. The then trained the interpolated context model on the training set, as described in Section 14.2.1. The identication phase and overlap resolution were then carried out as described in Sections 14.2.2 and 14.2.3. In each of the ten genomes, 99% of the annotated genes were correctly identied. The authors did not mention whether the start codons had been correctly identied in all cases. Glimmer thus achieved a false negative rate of 1%, but also a false positive rate of 725% on each genome. The false negative rate is the percentage of annotated genes that were not identied by Glimmer. The false positive rate is the percentage of ORFs identied by Glimmer as genes, but not so annotated in the database. Of course, some of the annotations could be incorrect.

15.2. Start Codon Prediction

The accurate prediction of the translation start site, that is, the correct start codon, is important in order to analyze the putative protein product of a gene. Given the quality of the rest of the process, accurate start codon prediction is the most difcult remaining part of prokaryotic gene prediction. The gene-nding techniques discussed so far do little to predict the correct start codon among all the candidates. Among the possible start codon candidates, what extra evidence can be used to identify the true translation start site? Recall from Section 2.2 that the ribosome is the structure that translates mRNA into protein and, at the initiation of that translation, is responsible for identifying the true translation start site. How does the ribosome perform this identication? Can we improve start codon prediction by mimicking the ribosomes process? At the initiation of protein synthesis, the ribosome binds to the mRNA at a region near the end of the mRNA called the ribosome binding site. This is a region of approximately 30 nucleotides of the mRNA that is protected by the ribosome during initiation. The ribosome binding site is approximately centered on the 65

LECTURE 15. START CODON PREDICTION

start codon (usually AUG). That is, the ribosome binding site contains not only the rst few codons to be translated, but also part of the untranslated region of the mRNA (see Section 2.3). The ribosome identies where to bind to the mRNA at initiation not only by recognizing the start codon, but also by recognizing a short sequence in the untranslated region within the ribosome binding site. This short mRNA sequence will be called the SD site, for reasons that will become clear below. The mechanism by which the ribosome recognizes the SD site is relatively simple base-pairing: the SD site is complementary to a short sequence near the end of the ribosomes 16S rRNA, one of its ribosomal RNAs. The SD site was rst postulated by Shine and Dalgarno [44] for E. coli. Subsequent experiments demonstrated that the SD site in E. coli mRNA usually matches at least 4 or 5 consecutive bases in the sequence AAGGAGG, and is usually separated from the translation start site by approximately 7 nucleotides, although this distance is variable. Numerous other researchers such as Vellanoweth and Rabinowitz [49] and Mikkonen et al. [34] describe very similar SD sites in the mRNA of other prokaryotes. It is not too surprising that SD sites should be so similar in various prokaryotes, since the end of the 16S rRNA of all these prokaryotes is well conserved (Mikkonen et al. [34]). Table 15.1 shows a number of these rRNA sequences. Note their similarity, and in particular the omnipresence of the sequence CCUCCU, complementary to the Shine-Dalgarno sequence AGGAGG. This SD site can be used to improve start codon prediction. The simplest way to identify whether a candidate start codon is likely to be correct is by checking for approximate base pair complementarity between the end of the 16S rRNA sequence and the DNA sequence just upstream of the candidate codon. We say approximate complementarity because the ribosome just needs sufcient binding energy between the 16S rRNA and the mRNA, not necessarily perfect complementarity. Several papers do use this SD site information to improve translation start site prediction. These papers are described briey below. Hayes and Borodovsky [22] found candidate SD sites by running a Gibbs sampler (Section 10.2) on the DNA sequences just upstream of a given genomes purported start codons. They then used the end of the genomes annotated 16S rRNA sequence to validate the SD site so found. Frishman et al. [16] used a greedy version of the Gibbs sampler to nd likely SD sites. In addition, they

Table 15.1:

end of 16S rRNA for various prokaryotes

" " " " " " "

! ! ! ! ! ! ! ! ! ! ! ! !

Bacillus subtilis Lactobacillus delbrueckii Mycoplasma pneumoniae Mycobacterium bovis Aquifex aeolicus Synechocystis sp. Escherichia coli Haemophilus inuenzae Helicobacter pylori Archaeoglobus fulgidus Methanobacterium thermoautotrophicum Pyrococcus horikoshii Methanococcus jannaschii Mycoplasma genitalium

CUGGAUCACCUCCUUUCUA CUGGAUCACCUCCUUUCUA GUGGAUCACCUCCUUUCUA CUGGAUCACCUCCUUUCU CUGGAUCACCUCCUUUA CUGGAUCACCUCCUUU UUGGAUCACCUCCUUA UUGGAUCACCUCCUUA UUGGAUCACCUCCU CUGGAUCACCUCCU CUGGAUCACCUCCU CUCGAUCACCUCCU CUGGAUCACCUCC GUGGAUCACCUC

LECTURE 15. START CODON PREDICTION

took into account the distance from the SD site to the start codon, which should be about 7 bp.

Hannenhalli et al. [21] used multiple features to score potential start codons. The features used were the following:

2. the identity of the start codon (AUG, UUG, or GUG), 3. coding potential downstream from the start codon and noncoding potential upstream, using GeneMarks scoring function (Section 13.3), 4. the distance from SD site to start codon, and 5. the distance from the start codon to the maximal start codon, which is as far upstream in this ORF as possible. They took the score of any start codon to be a weighted linear combination of the scores on these ve features. The coefcients of the linear combination were obtained using mixed integer programming.

15.3. Finding SD Sites

Do all prokaryotes have SD sites very similar to the Shine-Dalgarno sequences of E. coli? Given the collection of DNA sequences upstream from its putative genes, how can we identify a prokaryotes SD site, without reliance on the annotation of its 16S rRNA? Tompa [48] proposed a method to discover SD sites by looking for statistically signicant patterns (or motifs) in the sequences upstream from the putative genes. The method is reminiscent of the relative entropy site selection problem of Lecture 10 but, unlike the algorithms discussed there, this one is exhaustive, and guaranteed to nd the most statistically signicant motif. The statistical signicance is measured by the -score, dened below. The sites with the highest -scores are very unlikely to be from the background and very likely to be potential SD sites. For each possible -mer , this approach takes into account both the absolute number of upstream sequences containing (an approximation of) , and the background distribution. It then calculates the unlikelihood of seeing such occurrences, if the sequences had been drawn at random from the background distribution. The random process used in this calculation is a 1st order Markov chain based on the sequences dinucleotide frequencies. (See Section 13.2.) The measure of unlikelihood used is based on the -score, dened as follows. Let be the number of upstream sequences that are input, and the probability that a single random upstream sequence contains at least one occurrence of (an approximation of) . (See Tompa [48] for a description of how to compute .) Then is the expected number of input sequences containing , and is its standard deviation. The -score is dened as

The measure is the number of standard deviations by which the observed value exceeds its expectation, and is sometimes called the normal deviate or deviation in standard units. See Leung et al. [30] for a

1. the binding energy between the SD site and the insertions and deletions) in the binding,

end of the 16S rRNA, allowing bulges (that is,

LECTURE 15. START CODON PREDICTION

detailed discussion of this statistic. The measure is normalized to have mean 0 and standard deviation 1, making it suitable for comparing different motifs . The algorithm was run on fourteen prokaryotic genomes. Those motifs with highest -score showed a strong predominance of motifs complementary to the end of their genomes 16S rRNA. For the bacteria, these were usually a standard Shine-Dalarno sequence consisting of 45 consecutive bases from AAGGAGG. For the thermophilic archaea A. fulgidus, M. jannaschii, M. thermoautotrophicum, and P. horikoshii, however, the signicant SD sites uncovered were somewhat different. What is interesting about these is that their highest scoring sequences display a predominance of the pattern GGTGA or GGTG, which satises the requirement of complementarity to a substring near the end of the 16S rRNA (see Table 15.1). However, that 16S substring is shifted a few nucleotides upstream compared to the bacterial sites discussed above.

Lecture 16

RNA Secondary Structure Prediction

February 29, 2000 Notes: Matthew Cary

16.1. RNA Secondary Structure

Recall from Section 1.3 that RNA is usually single-stranded in its normal state, and this strand folds into a functional shape by forming intramolecular base pairs among some of its bases. (See Figure 16.1 for an illustration.) The geometry of this base-pairing is known as the secondary structure of the RNA. When RNA is folded, some bases are paired with other while others remain free, forming loops in the molecule. Speaking qualitatively, bases that are bonded tend to stabilize the RNA (i.e., have negative free energy), whereas unpaired bases form destabilizing loops (positive free energy). Through thermodynamics experiments, it has been possible to estimate the free energy of some of the common types of loops that arise. Because the secondary structure is related to the function of the RNA, we would like to be able to predict the secondary structure. Given an RNA sequence, the RNA Folding Problem is to predict the secondary structure that minimizes the total free energy of the folded RNA molecule. The prediction algorithm that will be described is by Lyngs et al. [33].

Hairpin Loop Bulge Multi-branched Loop Stacked Pair a b External Base Internal Loop
Figure 16.1: RNA Secondary Structure. The solid line indicates the backbone, and the jagged lines indicate paired bases.

c d

LECTURE 16. RNA SECONDARY STRUCTURE PREDICTION

Figure 16.2: A Pseudoknot

16.2. Notation and Denitions

The conguration shown in Figure 16.2 is known as a pseudoknot. For the prediction algorithm that follows, we will assume that the secondary structure does not contain any pseudoknots. The ostensible justications for this are that pseudoknots do not occur as often as the more common types of loops, and secondary structure prediction is moderately successful even if pseudoknots are prohibited. However, the real justication for this assumption is that it greatly simplies the model and algorithm. (Certain types of pseudoknots are handled by the algorithm of Rivas and Eddy [38], but the general problem was shown NP-complete by Lyngs and Pedersen [32].) Denition 16.2: A pseudoknot in a secondary structure with . is a pair of base pairs and

16.3. Anatomy of Secondary Structure

Given the assumption of no pseudoknots, the secondary structure can be decomposed into a few types of simple loops, described as follows and illustrated in Figure 16.1. Denition 16.3:

A hairpin loop contains exactly one base pair.

An internal loop contains exactly two base pairs.

A bulge is an internal loop with one base from each of its two base pairs adjacent on the backbone. A stacked pair is a loop formed by two base pairs and , thus having both ends adjacent on the backbone. (This is the only type of loop that stabilizes the secondary structure. All other loops are destabilizing, to varying degrees.)

and

Denition 16.1: A secondary structure of paired at most once. More precisely, for all

is a set ,

is an RNA sequence and

, then

denotes the base-pairing of

with

of base pairs such that each base is if and only if .

LECTURE 16. RNA SECONDARY STRUCTURE PREDICTION

Denition 16.4: Given a loop, one base pair in the loop is closest to the ends of the RNA strand. This is known as the exterior or closing pair. All other pairs are interior. More precisely, the exterior pair is the over all pairs in the loop. one that maximizes Note that one base pair may be the exterior pair of one loop and the interior pair of another.

16.4. Free Energy Functions

The assumption of no pseudoknots leads to the following related assumptions: 1. The free energy of a secondary structure is the sum of the free energies of its loops. 2. The free energy of a loop is independent of all other loops. These assumptions imply that, to evaluate the free energy of a given secondary structure, all that is needed is a set of functions that provide the free energies of the allowable constituent loop types. These functions are the free energy functions, which we will assume are provided by experimentalists and are available for the algorithms use. See http://www.ibc.wustl.edu/zuker/rna/energy/node2.html#SECTION20 for typical tables and formulas that can be used. Denition 16.5: There are four free energy functions: . This function gives the free energy of a stacked pair that consists of and . depends on all the bases involved in the stack, namely , , , and . Because stacked complementary base pairs are stabilizing, values will be negative if both stacked base pairs are complementary. In addition to the usual complementary pairs A-U and C-G, the pair G-U forms a weak bond in RNA, and is sometimes called a wobble pair. The values involving such pairs will also be negative. . This function gives the free energy of a hairpin loop closed by . This function depends on several factors, including the length of the loop, and , and the unpaired bases adjacent to and on the loop. . This function gives the free energy of an internal loop or bulge with exterior pair and interior pair . Similar to , this function depends on , , the four paired bases, and the unpaired bases adjacent to the paired bases on the loop. with interior pairs . This function gives the free energy of a multibranched loop closed by . This function is the least well understood at this time.

A multibranched loop is a loop that contains more than two base pairs. An external base is a base not contained in any loop.

LECTURE 16. RNA SECONDARY STRUCTURE PREDICTION

16.5. Dynamic Programming Arrays

The algorithm described by Lyngs et al. [33] uses dynamic programming, the technique that was used to nd optimal alignments (Section 4.1). Like the afne gap penalty algorithm of Section 5.3.3, this one lls in several tables simultaneously. The ve tables used are described below. : the free energy of the optimal structure of the rst residues, . This is the key array: if we can compute (and nd its associated secondary structure), we are done.

: used to compute

, in a manner to be revealed later.

Despite the similarity in their number and descriptions, it is important to understand the distinction between the free energy functions of Section 16.4 and these dynamic programming arrays. The free energy functions give the energy of a single specied loop. The arrays will generally contain free energy values for a collecgives the free energy of the tion of consecutive loops. For example, referring to Figure 16.1, and , whereas gives the total free energy of all the loops to the right internal loop closed by of , including the stacked pairs and the hairpin.

: the free energy of the optimal structure for

, assuming

: the free energy of the optimal structure for loop.

, assuming

closes a multibranched loop.

: the free energy of the optimal structure for

, assuming

forms a base pair in that structure. closes a bulge or internal

Lecture 17

RNA Secondary Structure Prediction (continued)

March 2, 2000 Notes: Don Patterson

17.1. Recurrence Relations

The core of the dynamic programming algorithm for RNA secondary structure prediction lies in the recurrence relations used to ll the arrays introduced in Section 16.5. This section develops the recurrence relations for , , , and , which are interdependent.

17.1.1.

The base does not pair with any other base and is therefore an external base (see Figure 16.1). The recurrence for makes the implicit assumption that the external bases do not contribute to the overall free energy of the structure. In this case the total energy is therefore . The base pairs with some other base in resulting free energy. That energy is the sum of the energy by , plus the energy of the remainder

, where is chosen to minimize the of the compound structure closed .

17.1.2.

The terms in the second equation correspond to choosing the minimum free energy structure among the following possible solutions: is the exterior pair in a hairpin loop, whose free energy is therefore given by 73

for for

The terms in the second equation correspond to choosing the structure for bases the lesser free energy of two possible structures:

for

having

LECTURE 17. RNA SECONDARY STRUCTURE PREDICTION (CONTINUED)

is the exterior pair of stacked pair. In this case the free energy is the energy of the stacked pair, plus the energy of the compound structure closed by . We know in this case that forms a base pair because is the exterior pair of a stacked pair. is the exterior pair of a bulge or internal loop, whose free energy is therefore given by

is the exterior pair of a multibranched loop, whose free energy is therefore given by

17.1.3.

In this case, is the exterior pair of a bulge or interior loop, and we must search all possible interior pairs for the pair that results in the minimum free energy. For each such interior pair, the resulting free energy is sum of the energy of the bulge or internal loop, plus the energy of the compound structure closed by . It is easy to see that this search for the best interior pair is computationally intensive, simply because of the number of possibilities that must be considered. We will see later how to speed up this calculation, which is the new contribution of Lyngs et al. [33].

17.1.4.

In the same way that the recurrence for requires a search for the best structure among all the possible interior pairs, the calculation for is even more intensive, requiring a search for interior pairs , each of which closes its own branch out of the multibranched loop and contributes free energy . A direct implementation of the calculation shown for is infeasibly slow. Section 17.3 will discuss simplifying assumptions about multibranched loops that allow us to speed this up substantially.

17.2. Order of Computation

The interdependence of these recurrences requires a careful ordering of the calculations to ensure that we only rely on array entries whose values have already been determined. Specically, the entries are computed in order from interior pairs to exterior pairs. This corresponds to lling the arrays , , and in order of increasing values of . An inspection of the recurrences in Sections 17.1.2 17.1.4 reveals that this order will always guarantee that the needed array entries have been computed. Within the calculations involving a given value , we compute and before , in order to accommodate the recurrence in Section 17.1.2. Note that the calculations for the three tables are interleaved: we calculate the entry in each table for a given pair before advancing to the next pair. Because none of these entries depend on the values of entries in deferred until the other three tables have been completed. , the computation of

% % %

% % % % %

$ $

$ % % $

$ $

can be

LECTURE 17. RNA SECONDARY STRUCTURE PREDICTION (CONTINUED)

17.3. Speeding Up the Multibranched Computation

As mentioned in Section 16.4, the actual free energy values of multibranched loops are not yet well understood. Given this state, the approximation we will describe is driven more by a desire to reduce the running time of the dynamic program than to produce a very accurate physical model of the loop. For this approximation, we assume that the free energy of a multibranched loop is given by an afne linear function of the number of branches and the size of the loop (measured as the number of unpaired bases):

where , , and are constants. (Lyngs et al. [33] suggest that it would be more accurate to approximate the free energy as a logarithmic function of the loop size.) Assuming this linear approximation, we can devise a much more efcient dynamic programming sothan the one given in Section 17.1.4. This solution requires an additional array lution for computing , where gives the free energy of an optimal structure for , assuming that and are on a multibranched loop. is dened by the following recurrence relation:

The terms in the second equation correspond to the following possible solutions: forms a base pair and therefore denes one of the branches, whose free energy is

and are not paired with each other, so the free energy is given by the minimum partition of the sequence into two contiguous subsequences.

Calculating free energy:

then reduces to partitioning the loop into at least two pieces with the minimum total

17.4. Running Time

The running time to ll in each of the complete tables (assuming the values on which it depends have already been computed and stored in their tables, and that we are using the multibranched approximation of Section 17.3) is determined as follows:

. Each of

entries requires the computation of the min of

terms.

. Each of

entries requires the computation of the min of 4 terms.

for

LECTURE 17. RNA SECONDARY STRUCTURE PREDICTION (CONTINUED)

: :

76 terms. terms.

With the speedup of the multibranched loop computation described in Section 17.3, the new bottleneck time computation of the free energy of bulges and internal loops. We will see next has become the how to eliminate this bottleneck.

. Each of

entries requires the computation of the min of

. Each of

entries requires the computation of the min of

. Each of

entries requires the computation of the min of

terms.

Lecture 18

Speeding Up Internal Loop Computations

March 7, 2000 Notes: Kellie Plow
Recall from Section 17.4 that the running time for determining the internal loop free energy calculation is : each of the exterior pairs requires a search through the interior pairs for one that minimizes the resulting free energy. We will address this running time computation of the . This is the main result free energy of bulges and internal loops, and show that it can be decreased to of Lyngs et al. [33] and, combined with the remaining analysis in Section 17.4, shows that the entire RNA secondary structure prediction problem can be solved in time . running time is not practical for long RNA sequences, but it does allow for secondary structure prediction for RNA sequences that are time algorithm. hundreds of bases in length, which would be prohibitive with an

18.1. Assumptions About Internal Loop Free Energy

To speed up the running time it is necessary to make some assumptions about the form of the internal loop free energy function (see Denition 16.5). The authors cite thermodynamics studies that support the fact that these assumptions are realistic. The authors rst assume that

is the sum of 3 contributions: that is a function of the size of the loop, plus

1. a term

3. an asymmetry penalty , where bases between the two base pairs on one side of the loop, and The free energy function for bulges and internal loops is thus given by

18.2. Asymmetry Penalty

The currently used asymmetry functions are of the form

2. stacking energies two base pairs, plus

for the unpaired bases adjacent on the loop to the is the number of unpaired the number on the other side.

(18.1)

(18.2)

LECTURE 18. SPEEDING UP INTERNAL LOOP COMPUTATIONS

where

studies). that

What is important for our purposes is that this asymmetry penalty grows linearly in , provided and . In particular, the only assumption we will need to make about the penalty is that

18.3. Comparing Interior Pairs

Recall from Section 17.1.3 the recurrence

We are going to save time by not searching through all the interior pairs . Suppose that, for exterior pair , the interior pair is better than , and that both of these loops have the same size. Then, under the assumptions from Sections 18.1 and 18.2, Theorem 18.1 below demonstrates that is also better for exterior pair . The intuition behind this theorem is that the asymmetry penalty for is the same for the two different exterior pairs by Equation (18.3), as is the asymmetry penalty for , and neither interior pair gains an advantage in loop size or stacking energies when you change from the smaller to the bigger loop. Theorem 18.1: Let

(so as to compare internal loops of identical size). Let , be at least (so that Equation (18.3) applies to both loops). Suppose that

, and

Then

% % %

for all

and

. This is certainly true for the particular form given in Equation (18.2).

is the maximum asymmetry penalty assessed, is a function whose details need not concern us, , and is a small constant (equal to 5 and 1, respectively, in two cited thermodynamics

(18.3)

(18.4) each

(18.5)

Now suppose that the entry has been calculated, and we want to calculate the entry . By Theorem 18.1, the interior pair stored in is the best interior pair for , with only two possible exceptions. These possible exceptions are the loops with exterior pair , length , and having one or the other loop side of length exactly .

Thus, for each of entries in , it is necessary to compare 3 loop energies and store the minimum, which takes constant time. It is also necessary to compare those loops with exterior pair

Instead of using a two-dimensional array , use a three-dimensional array is the loop size. This array will be lled in using dynamic programming. The entry not only the free energy, but also the best interior pair (subject to and that gives this energy.

LECTURE 18. SPEEDING UP INTERNAL LOOP COMPUTATIONS

Proof:

Equations (18.4) & (18.5)

Equation (18.3)

Equation (18.1)

Equation (18.3)

Equation (18.1)

, where will store )

LECTURE 18. SPEEDING UP INTERNAL LOOP COMPUTATIONS

and length of these.

having one or the other loop side of length less than , but there are only a constant number

Bibliography
References
[1] Tatsuya Akutsu. Hardness results on gapless local multiple sequence alignment. Technical Report 98-MPS-24-2, Information Processing Society of Japan, 1998. [2] Tatsuya Akutsu, Hiroki Arimura, and Shinichi Shimozono. On approximation algorithms for local multiple alignment. In RECOMB00: Proceedings of the Fourth Annual International Conference on Computational Molecular Biology, Tokyo, Japan, April 2000. [3] S. F. Altschul. A protein alignment scoring system sensitive at all evolutionary distances. Journal of Molecular Evolution, 36(3):290300, March 1993. [4] Timothy L. Bailey and Charles Elkan. Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning, 21(1-2):5180, October 1995. [5] Joseph L. Bates and Robert L. Constable. Proofs as programs. ACM Transactions on Programming Languages and Systems, 7(1):113136, January 1985. [6] Richard Bellman. Dynamic Programming. Princeton University Press, 1957. [7] Jon Bentley. Programming Pearls. Addison-Wesley, 1986. [8] M. Borodovsky and J. McIninch. GeneMark: Parallel gene recognition for both DNA strands. Comp. Chem., 17(2):123132, 1993. [9] M. Borodovsky, J. McIninch, E. Koonin, K. Rudd, C. Medigue, and A. Danchin. Detection of new genes in a bacterial genome using Markov models for three gene classes. Nucleic Acids Research, 23(17):35543562, 1995. [10] C. Branden and J. Tooze. An Introduction to Protein Structure. Garland, 1998. [11] Stephen A. Cook. The complexity of theorem proving procedures. In Conference Record of Third Annual ACM Symposium on Theory of Computing, pages 151158, Shaker Heights, OH, May 1971. [12] Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest. Introduction to Algorithms. MIT Press, 1990. [13] Arthur L. Delcher, Douglas Harmon, Simon Kasif, Owen White, and Steven L. Salzberg. Improved microbial gene identication with G LIMMER. Nucleic Acids Research, 27(23):46364641, 1999. [14] Karl Drlica. Understanding DNA and Gene Cloning. John Wiley & Sons, second edition, 1992. 81

BIBLIOGRAPHY

[15] R. D. Fleischmann, M. D. Adams, O. White, R. A. Clayton, E. F. Kirkness, A. R. Kerlavage, C. J. Bult, J. F. Tomb, B. A. Dougherty, J. M. Merrick, et al. Whole-genome random sequencing and assembly of Haemophilus inuenzae rd. Science, 269:496512, July 1995. [16] Dmitrij Frishman, Andrey Mironov, Hans-Werner Mewes, and Mikhail Gelfand. Combining diverse evidence for gene recognition in completely sequenced bacterial genomes. Nucleic Acids Research, 26(12):29412947, 1998. [17] Zvi Galil and Raffaele Giancarlo. Speeding up dynamic programming with applications to molecular biology. Theoretical Computer Science, 64:107118, 1989. [18] Michael R. Garey and David S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Company, 1979. [19] D. Guseld. Efcient methods for multiple sequence alignment with guaranteed error bounds. Bulletin of Mathematical Biology, 55:141154, 1993. [20] Dan Guseld. Algorithms on Strings, Trees, and Sequences. Cambridge University Press, 1997. [21] Sridhar S. Hannenhalli, William S. Hayes, Artemis G. Hatzigeorgiou, and James W. Fickett. Bacterial start site prediction. 1999. [22] William S. Hayes and Mark Borodovsky. Deriving ribosomal binding site (RBS) statistical models from unannotated DNA sequences and the use of the RBS model for N-terminal prediction. In Pacic Symposium on Biocomputing, pages 279290, 1998. [23] Gerald Z. Hertz and Gary D. Stormo. Identifying DNA and protein patterns with statistically signicant alignments of multiple sequences. Bioinformatics, 15(7/8):563577, July/August 1999. [24] D. S. Hirschberg. A linear-space algorithm for computing maximal common subsequences. Communications of the ACM, 18:341343, June 1975. [25] David S. Johnson, Cecilia R. Aragon, Lyle A. McGeoch, and Catherine Schevon. Optimization by simulated annealing: an experimental evaluation; part I, graph partitioning. Operations Research, 37(6):865892, Nov.Dec. 1989. [26] Jerzy Jurka and Mark A. Batzer. Human repetitive elements. In Robert A. Meyers, editor, Encyclopedia of Molecular Biology and Molecular Medicine, volume 3, pages 240246. Weinheim, Germany, 1996. [27] Samuel Karlin and Stephen F. Altschul. Methods for assessing the statistical signicance of molecular sequence features by using general scoring schemes. Proceedings of the National Academy of Science USA, 87(6):22642268, March 1990. [28] Richard M. Karp. Reducibility among combinatorial problems. In R. E. Miller and J. W. Thatcher, editors, Complexity of Computer Computations, pages 85104. Plenum Press, New York, 1972. [29] Charles E. Lawrence, Stephen F. Altschul, Mark S. Boguski, Jun S. Liu, Andrew F. Neuwald, and John C. Wootton. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262:208214, 8 October 1993. [30] Ming-Ying Leung, Genevieve M. Marsh, and Terence P. Speed. Over- and underrepresentation of short DNA words in herpesvirus genomes. Journal of Computational Biology, 3(3):345360, 1996.

BIBLIOGRAPHY
[31] Benjamin Lewin. Genes VI. Oxford University Press, 1997.

[32] Rune B. Lyngs and Christian N. S. Pedersen. Pseudoknots in RNA secondary structures. In RECOMB00: Proceedings of the Fourth Annual International Conference on Computational Molecular Biology, Tokyo, Japan, April 2000. [33] Rune B. Lyngs, Michael Zuker, and Christian N. S. Pedersen. Internal loops in RNA secondary structure prediction. In RECOMB99: Proceedings of the Third Annual International Conference on Computational Molecular Biology, pages 260267, Lyon, France, April 1999. [34] Merja Mikkonen, Jussi Vuoristo, and Tapani Alatossava. Ribosome binding site consensus sequence of Lactobacillus delbrueckii subsp. lactis. FEMS Microbiology Letters, 116:315320, 1994. [35] Webb Miller and Eugene W. Myers. Sequence comparison with concave weighting functions. Bulletin of Mathematical Biology, 50(2):97120, 1988. [36] Eugene W. Myers and Webb Miller. Optimal alignments in linear space. Computer Applications in the Biosciences, 4:1117, 1988. [37] S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48:443453, 1970. [38] Elena Rivas and Sean R. Eddy. A dynamic programming algorithm for RNA structure prediction including pseudoknots. Journal of Molecular Biology, 285(5):20532068, February 1999. [39] Fred S. Roberts. Applied Combinatorics. Prentice-Hall, 1984. [40] Walter L. Ruzzo and Martin Tompa. A linear time algorithm for nding all maximal scoring subsequences. In Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, pages 234241, Heidelberg, Germany, August 1999. AAAI Press. [41] Steven L. Salzberg, Arthur L. Delcher, Simon Kasif, and Owen White. Microbial gene identication using interpolated Markov models. Nucleic Acids Research, 26(2):544548, 1998. [42] Steven L. Salzberg, David B. Searls, and Simon Kasif, editors. Computational Methods in Molecular Biology. Elsevier, 1998. [43] Jo o Setubal and Jo o Meidanis. Introduction to Computational Molecular Biology. PWS Publishing a a Company, 1997. [44] J. Shine and L. Dalgarno. The -terminal sequence of E. coli 16S ribosomal RNA: Complementarity to nonsense triplets and ribosome binding sites. Proceedings of the National Academy of Science USA, 71:13421346, 1974. [45] Temple F. Smith and Michael S. Waterman. Identication of common molecular subsequences. Journal of Molecular Biology, 147(1):195197, March 1981. [46] Gary D. Stormo and Dana S. Fields. Specicity, free energy and information content in protein-DNA interactions. Trends in Biochemical Sciences, 23:109113, 1998. [47] Gary D. Stormo and George W. Hartzell III. Identifying protein-binding sites from unaligned DNA fragments. Proceedings of the National Academy of Science USA, 86:11831187, 1989.

BIBLIOGRAPHY

[48] Martin Tompa. An exact method for nding short motifs in sequences, with application to the ribosome binding site problem. In Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, pages 262271, Heidelberg, Germany, August 1999. AAAI Press. [49] Robert Luis Vellanoweth and Jesse C. Rabinowitz. The inuence of ribosome-binding-site elements on translational efciency in Bacillus subtilis and Escherichia coli in vivo. Molecular Microbiology, 6(9):11051114, 1992. [50] L. Wang and T. Jiang. On the complexity of multiple sequence alignment. Journal of Computational Biology, 1:337348, 1994. [51] Michael S. Waterman. Introduction to Computational Biology. Chapman & Hall, 1995. [52] James D. Watson, Michael Gilman, Jan Witkowski, and Mark Zoller. Recombinant DNA. Scientic American Books (Distributed by W. H. Freeman), second edition, 1992.

Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (643)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1175)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4103)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
Bioplastics For Sustainable Development
No ratings yet
Bioplastics For Sustainable Development
733 pages
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2885)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
Codon Chart: U C A G U
No ratings yet
Codon Chart: U C A G U
3 pages
This Study Resource Was: DNA Base Pairing Worksheet
No ratings yet
This Study Resource Was: DNA Base Pairing Worksheet
5 pages
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
Prokaryotic Replication A Journey Inside The Cell
No ratings yet
Prokaryotic Replication A Journey Inside The Cell
10 pages
Metabolic Engineering of Aimed at Alternative Carbon Sources and New Products
No ratings yet
Metabolic Engineering of Aimed at Alternative Carbon Sources and New Products
11 pages
Amino Acid 22
No ratings yet
Amino Acid 22
12 pages
Cell Cycle: Dr. Isra Thristy, M.Biomed Biochemistry Department University of Muhammadiyah Sumatera Utara
No ratings yet
Cell Cycle: Dr. Isra Thristy, M.Biomed Biochemistry Department University of Muhammadiyah Sumatera Utara
36 pages
The Physiology and Habitat of The Last Universal Common Ancestor
100% (1)
The Physiology and Habitat of The Last Universal Common Ancestor
18 pages
Mcqs of Biochemistry
No ratings yet
Mcqs of Biochemistry
84 pages
Genetics Bits gpbr-111
No ratings yet
Genetics Bits gpbr-111
10 pages
2nd Sem University Question
No ratings yet
2nd Sem University Question
12 pages
Must Read Medical Biochemistry The Big Picture - 1st Edition Direct Download
No ratings yet
Must Read Medical Biochemistry The Big Picture - 1st Edition Direct Download
17 pages
I Bmas Science
100% (1)
I Bmas Science
279 pages
Protein Synthesis
No ratings yet
Protein Synthesis
90 pages
PDF Test Bank For Vander's Human Physiology 15th Edition by Eric Widmaier Download
No ratings yet
PDF Test Bank For Vander's Human Physiology 15th Edition by Eric Widmaier Download
54 pages
Heterocyclic Organic Chemistry CHEM 341: Dr. Assem Barakat
No ratings yet
Heterocyclic Organic Chemistry CHEM 341: Dr. Assem Barakat
32 pages
Nucleotide Metabolism PP
No ratings yet
Nucleotide Metabolism PP
65 pages
Complete MCB 201 Notes
No ratings yet
Complete MCB 201 Notes
11 pages
BAGIAN PERTAMA - Metabolisme Asam Nukleat
No ratings yet
BAGIAN PERTAMA - Metabolisme Asam Nukleat
84 pages
Cell Biology
No ratings yet
Cell Biology
7 pages
BIO and CHEM
No ratings yet
BIO and CHEM
13 pages
Some Answer of Problemset - 7 - KEY
No ratings yet
Some Answer of Problemset - 7 - KEY
3 pages
Bioenergetika Dan Metabolisme
No ratings yet
Bioenergetika Dan Metabolisme
28 pages
Translation
No ratings yet
Translation
19 pages
Biochemistry Snapshots by DR Murali Bharadwaz Final-Compressed
No ratings yet
Biochemistry Snapshots by DR Murali Bharadwaz Final-Compressed
162 pages
Alkaloids Chemistry and Physiology, Vol 1 PDF
No ratings yet
Alkaloids Chemistry and Physiology, Vol 1 PDF
524 pages
19 Pasos Lanosterol A Colesterol
No ratings yet
19 Pasos Lanosterol A Colesterol
8 pages
Chemistry 211 Experiment 10
No ratings yet
Chemistry 211 Experiment 10
9 pages
Q.P. CODE:500-A-OR: Biochemistry
No ratings yet
Q.P. CODE:500-A-OR: Biochemistry
38 pages
2 August Biochemistry Annotated Notes
No ratings yet
2 August Biochemistry Annotated Notes
34 pages

Biological Sequence Analysis

Uploaded by

Biological Sequence Analysis

Uploaded by

Lecture Notes on Biological Sequence Analysis 1

Martin Tompa Technical Report #2000-06-01 Winter 2000

c Martin Tompa, 2000

Introduction to Sequence Similarity 3.1

3.3 3.4 3.5 4

Searching for Local Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Computing an Optimal Local Alignment by Dynamic Programming . . . . . . . . . . . . . 22 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Time Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Bibliographic Notes on Alignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 27

The Consensus String . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 35

Finding Instances of Known Sites 7.1 7.2

How to Summarize Known Sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Using Probabilities to Test for Sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 39

Relative Entropy 8.1 8.2 8.3 8.4

Relative Entropy and Binding Energy 9.1 9.2 9.3

10 Finding Instances of Unknown Sites

12.1 Scoring Regions of Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

17.1 Recurrence Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

Basics of Molecular Biology

LECTURE 1. BASICS OF MOLECULAR BIOLOGY

1.1.1. Classication of the Amino Acids

LECTURE 1. BASICS OF MOLECULAR BIOLOGY

1.2.1. Structure of a Nucleotide

LECTURE 1. BASICS OF MOLECULAR BIOLOGY

1.2.2. Base Pair Complementarity

1.2.3. Size of DNA molecules

LECTURE 1. BASICS OF MOLECULAR BIOLOGY

1.5. DNA Replication

1.6. Synthesis of RNA and Proteins

LECTURE 1. BASICS OF MOLECULAR BIOLOGY

1.6.1. Transcription in Prokaryotes

Table 1.1: The Genetic Code

LECTURE 1. BASICS OF MOLECULAR BIOLOGY

Basics of Molecular Biology (continued)

2.1. Course Projects

2.2. Translation (continued)

LECTURE 2. BASICS OF MOLECULAR BIOLOGY (CONTINUED)

2.3. Prokaryotic Gene Structure

2.4. Prokaryotic Genome Organization

2.5. Eukaryotic Gene Structure

LECTURE 2. BASICS OF MOLECULAR BIOLOGY (CONTINUED)

2.6. Eukaryotic Genome Organization

2.7. Goals and Status of Genome Projects

LECTURE 2. BASICS OF MOLECULAR BIOLOGY (CONTINUED)

2.8. Sequence Analysis

Introduction to Sequence Similarity

3.1. Sequence Similarity

3.2. Biological Motivation for Studying Sequence Similarity

3.2.1. Hypothesizing the Function of a New Sequence

LECTURE 3. INTRODUCTION TO SEQUENCE SIMILARITY

3.2.2. Researching the Effects of Multiple Sclerosis

3.3. The String Alignment Problem

denotes the score of aligning

LECTURE 3. INTRODUCTION TO SEQUENCE SIMILARITY

15 and denotes the th character of

3.4. An Obvious Algorithm for Optimal Alignment

. Also, consider . With this

Denition 3.4: An optimal alignment of two strings.

is one that has the maximum possible value for these

In the example alignment above, if -cadb-d-.

The value of the alignment

2. the removal of spaces from and , respectively.

LECTURE 3. INTRODUCTION TO SEQUENCE SIMILARITY

3.5. Asymptotic Analysis of Algorithms

if and only if there is a constant

distinguishable objects taken

LECTURE 3. INTRODUCTION TO SEQUENCE SIMILARITY

For example, works, and for the latter,

. For the former,

Alignment by Dynamic Programming

4.1. Computing an Optimal Alignment by Dynamic Programming

, then they must

Given strings and , with . Toward this goal, dene .

LECTURE 4. ALIGNMENT BY DYNAMIC PROGRAMMING

4.1.2. Recovering the Alignments

The optimal alignments corresponding to these three paths are

" "  "     

" "  " " !        "    

" " "

" " " " ! "

" ! ! " "