0% found this document useful (0 votes)

9 views144 pages

Chapter 1

The document discusses the process of finding replication origins (ori) in bacterial genomes, focusing on the computational challenges associated with identifying these regions. It outlines the hidden messages within the ori, specifically how the DnaA protein binds to DnaA boxes to initiate replication. Additionally, the document introduces algorithms for counting frequent substrings in DNA sequences to aid in locating these origins.

Uploaded by

alsanahesapmesap

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views144 pages

Chapter 1

Uploaded by

alsanahesapmesap

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 144

Finding Replication Origins in Bacterial Genomes

Algorithmic Warmup

Phillip Compeau

Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.

Outline

• An Intro to DNA Replication

• Hidden Messages in the Replication Origin
• Hunting for Frequent Words
• A Faster Frequent Words Approach
• Some Hidden Messages are More Surprising than
Others
• An Explosion of Hidden Messages
• Replication Asymmetry Leads Us to the Replication
Origin

Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.

The “Copying Mechanism”

Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.

The “Copying Mechanism”

Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.

The “Copying Mechanism”

Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.

What a Biologist Sees...

Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.

What a Computer Scientist Sees...

String: a contiguous collection of symbols.

Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.

What a Computer Scientist Sees...

String: a contiguous collection of symbols.

...ACTGATAACCCAGTATCAGACCAGTATCGAGGACGATACGTA...
DNA String

Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.

What a Computer Scientist Sees...

String: a contiguous collection of symbols.

...ACTGATAACCCAGTATCAGACCAGTATCGAGGACGATACGTA...
DNA String

Complicated Biological Process

Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.

What a Computer Scientist Sees...

String: a contiguous collection of symbols.

...ACTGATAACCCAGTATCAGACCAGTATCGAGGACGATACGTA...
DNA String

Complicated Biological Process

Copy 1
...ACTGATAACCCAGTATCAGACCAGTATCGAGGACGATACGTA...
...ACTGATAACCCAGTATCAGACCAGTATCGAGGACGATACGTA...
Copy 2
Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.
Origin of Replication

Replication begins in a region called the replication

origin (denoted ori).

Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.

The Finding ori Problem

Origin of Replication Problem

• Input: A DNA string genome.
• Output: The location of ori in genome.

STOP: Is the Hidden Message Problem a computational

problem?

Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.

Finding the Origin of Replication
How can we find ori in a genome?

Let’s hack out this DNA

fragment. Can the
genome replicate
without it?

Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.

Finding the Origin of Replication
How can we find ori in a genome?

Let’s hack out this DNA

fragment. Can the
genome replicate
without it?

I need more information

before I can hack this
problem.

Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.

Looking for ori

Verified ori of Vibrio cholerae, the bacterium that

causes cholera (~500 nucleotides):
atcaatgatcaacgtaagcttctaagcatgatcaaggtgctcacacagtttatccacaac
ctgagtggatgacatcaagataggtcgttgtatctccttcctctcgtactctcatgacca
cggaaagatgatcaagagaggatgatttcttggccatatcgcaatgaatacttgtgactt
gtgcttccaattgacatcttcagcgccatattgcgctggccaaggtgacggagcgggatt
acgaaagcatgatcatggctgttgttctgtttatcttgttttgactgagacttgttagga
tagacggtttttcatcactgactagccaaagccttactctgcctgacatcgaccgtaaat
tgataatgaatttacatgcttccgcgacgatttacctcttgatcatcgatccgattgaag
atcttcaattgttaattctcttgcctcgactcatagccatgatgagctcttgatcatgtt
tccttaaccctctattttttacggaagaatgatcaagctgctgctcttgatcatcgtttc

Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.

Looking for ori

Verified ori of Vibrio cholerae, the bacterium that

There must be a hidden message telling the cell to

start replication here.
Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.
The Hidden Message Problem

Hidden Message Problem

• Input: A string text (representing ori).
• Output: A hidden message in text.

Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.

The Hidden Message Problem

Hidden Message Problem

• Input: A string text (representing ori).
• Output: A hidden message in text.

STOP: Is the Hidden Message Problem a computational

problem?

Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.

We Have Two Scientific Problems

1. Given a bacterial genome (~3 Mbp), where is ori?

2. Given ori (~500 bp), what is the “hidden message”

saying that replication should start here?
Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.
Outline

• An Intro to DNA Replication

Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.

Hidden Message Problem Revisited

Hidden Message Problem

• Input: A string text (representing ori).
• Output: A hidden message in text.

The notion of “hidden message” is not defined.

Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.

Hidden Message Problem Revisited

Hidden Message Problem

• Input: A string text (representing ori).
• Output: A hidden message in text.

Replication initiation is mediated by a protein called

DnaA.

Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.

Hidden Message Problem Revisited

Hidden Message Problem

• Input: A string text (representing ori).
• Output: A hidden message in text.

Replication initiation is mediated by a protein called

DnaA.

DnaA binds to a short segment in ori known as a DnaA

box, a hidden message saying: “bind here!”

Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.

Hidden Message Problem Revisited

Hidden Message Problem

• Input: A string text (representing ori).
• Output: A hidden message in text.

Replication initiation is mediated by a protein called

DnaA.

DnaA binds to a short segment in ori known as a DnaA

box, a hidden message saying: “bind here!”

Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.

Hidden Message Problem Revisited

STOP: Would it make sense for an organism to have

multiple DnaA boxes, or just one?

Replication initiation is mediated by a protein called

DnaA.

DnaA binds to a short segment in ori known as a DnaA

box, a hidden message saying: “bind here!”

Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.

Outline

• An Intro to DNA Replication

Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.

Counting Words

atcaatgatcaacgtaagcttctaagcatgatcaaggtgctcacacagtttatccacaac
ctgagtggatgacatcaagataggtcgttgtatctccttcctctcgtactctcatgacca
cggaaagatgatcaagagaggatgatttcttggccatatcgcaatgaatacttgtgactt
gtgcttccaattgacatcttcagcgccatattgcgctggccaaggtgacggagcgggatt
acgaaagcatgatcatggctgttgttctgtttatcttgttttgactgagacttgttagga
tagacggtttttcatcactgactagccaaagccttactctgcctgacatcgaccgtaaat
tgataatgaatttacatgcttccgcgacgatttacctcttgatcatcgatccgattgaag
atcttcaattgttaattctcttgcctcgactcatagccatgatgagctcttgatcatgtt
tccttaaccctctattttttacggaagaatgatcaagctgctgctcttgatcatcgtttc

We are looking for surprisingly frequent substrings

(contiguous strings appearing within) this ori.

Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.

Counting Words

We are looking for surprisingly frequent substrings

(contiguous strings appearing within) this ori.

First: let’s count how often a given substring occurs.

Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.
Counting Words Problem

Substring Counting Problem

• Input: A string pattern and a longer string text.
• Output: The number of times pattern occurs in text.

Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.

Counting Words Problem

Substring Counting Problem

• Input: A string pattern and a longer string text.
• Output: The number of times pattern occurs in text.

STOP: How many times does ATA occur in

CGATATATCCATAG?

Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.

Counting Words Problem

Substring Counting Problem

• Input: A string pattern and a longer string text.
• Output: The number of times pattern occurs in text.

STOP: How many times does ATA occur in

CGATATATCCATAG?

Answer: It can be 2 or 3. For this application, we will go

with 3; that is, we count overlaps.
Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.
Substring Indexing

Key Point: We think of a string as just an array of

symbols. (So it should be 0-indexed.)

Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.

Substring Indexing

Key Point: We think of a string as just an array of

symbols. (So it should be 0-indexed.)

text

Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.

Substring Indexing

Key Point: We think of a string as just an array of

symbols. (So it should be 0-indexed.)

text

The notation we use for this substring of text is:

text[7, 10]

That’s weird … why not text[7, 9]?!?

Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.
Substring Indexing

Key Point: We think of a string as just an array of

symbols. (So it should be 0-indexed.)

text

STOP: How would we refer to this substring?

Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.

Substring Indexing

Key Point: We think of a string as just an array of

symbols. (So it should be 0-indexed.)

text

STOP: How would we refer to this substring?

Answer: text[0, 3]

Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.

Substring Indexing

Key Point: We think of a string as just an array of

symbols. (So it should be 0-indexed.)

text

STOP: What about this substring?

Answer: text[3, 6]

Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.

Substring Indexing

Key Point: We think of a string as just an array of

symbols. (So it should be 0-indexed.)

text

STOP: What do you notice?

Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.

Substring Indexing

Key Point: We think of a string as just an array of

symbols. (So it should be 0-indexed.)

text

STOP: What do you notice?

Answer: We can easily get the length of the substring

by subtracting the lower index from the upper index.
(Here, 6 - 3 = 3.)
Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.
Substring Indexing

Key Point: We think of a string as just an array of

symbols. (So it should be 0-indexed.)

text

STOP: How would we refer to the substring of text of

length k starting at position i?

Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.

Substring Indexing

Key Point: We think of a string as just an array of

symbols. (So it should be 0-indexed.)

text

STOP: How would we refer to the substring of text of

length k starting at position i?

Answer: text[i, i+k]. This will be very useful!

Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.

Substring Indexing

Key Point: We think of a string as just an array of

symbols. (So it should be 0-indexed.)

text

Note: We use the same notation for “subarrays” if we

want to refer to a contiguous collection of values in an
array.

Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.

Our Idea for Counting Patterns

Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.

Our Idea for Counting Patterns

STOP: How many

substrings of
length k are
there in a string
of length n?

Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.

Pattern Counting

PatternCount(pattern, text)
count  0
k  len(pattern)
n  len(text)
for every integer i between 0 and n – k
if text[i, i+k] = pattern
count  count + 1
return count

len(): A (typically built-in) function determining the

length (number of symbols) in a string; also works for
counting elements in an array.
Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.
The Frequent Words Problem

k-mer: A string of length k.

Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.

The Frequent Words Problem

k-mer: A string of length k.

A k-mer pattern is a most frequent k-mer in a string if

no other k-mer is more frequent than pattern.

Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.

The Frequent Words Problem

k-mer: A string of length k.

A k-mer pattern is a most frequent k-mer in a string if

no other k-mer is more frequent than pattern.

Frequent Words Problem

• Input: A string text and an integer k.
• Output: All most frequent k-mers in text.

Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.

The Frequent Words Problem

k-mer: A string of length k.

A k-mer pattern is a most frequent k-mer in a string if

no other k-mer is more frequent than pattern.

Frequent Words Problem

• Input: A string text and an integer k.
• Output: All most frequent k-mers in text.

STOP: Now is this problem clearly stated?

Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.

Solving the Frequent Words Problem

Frequent Words Problem

• Input: A string text and an integer k.
• Output: All most frequent k-mers in text.

Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.

Solving the Frequent Words Problem

Frequent Words Problem

• Input: A string text and an integer k.
• Output: All most frequent k-mers in text.

Example: If text = ACGTTTCACGTTTTACGG and k = 3,

then the most frequent words are ACG and TTT (both
occur three times).

Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.

Solving the Frequent Words Problem

Frequent Words Problem

• Input: A string text and an integer k.
• Output: All most frequent k-mers in text.

Exercise: How might we solve this problem with an

array? What subroutines would you find useful?