Chapter 1
Chapter 1
Algorithmic Warmup
Phillip Compeau
...ACTGATAACCCAGTATCAGACCAGTATCGAGGACGATACGTA...
DNA String
...ACTGATAACCCAGTATCAGACCAGTATCGAGGACGATACGTA...
DNA String
...ACTGATAACCCAGTATCAGACCAGTATCGAGGACGATACGTA...
DNA String
Copy 1
...ACTGATAACCCAGTATCAGACCAGTATCGAGGACGATACGTA...
...ACTGATAACCCAGTATCAGACCAGTATCGAGGACGATACGTA...
Copy 2
Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.
Origin of Replication
atcaatgatcaacgtaagcttctaagcatgatcaaggtgctcacacagtttatccacaac
ctgagtggatgacatcaagataggtcgttgtatctccttcctctcgtactctcatgacca
cggaaagatgatcaagagaggatgatttcttggccatatcgcaatgaatacttgtgactt
gtgcttccaattgacatcttcagcgccatattgcgctggccaaggtgacggagcgggatt
acgaaagcatgatcatggctgttgttctgtttatcttgttttgactgagacttgttagga
tagacggtttttcatcactgactagccaaagccttactctgcctgacatcgaccgtaaat
tgataatgaatttacatgcttccgcgacgatttacctcttgatcatcgatccgattgaag
atcttcaattgttaattctcttgcctcgactcatagccatgatgagctcttgatcatgtt
tccttaaccctctattttttacggaagaatgatcaagctgctgctcttgatcatcgtttc
atcaatgatcaacgtaagcttctaagcatgatcaaggtgctcacacagtttatccacaac
ctgagtggatgacatcaagataggtcgttgtatctccttcctctcgtactctcatgacca
cggaaagatgatcaagagaggatgatttcttggccatatcgcaatgaatacttgtgactt
gtgcttccaattgacatcttcagcgccatattgcgctggccaaggtgacggagcgggatt
acgaaagcatgatcatggctgttgttctgtttatcttgttttgactgagacttgttagga
tagacggtttttcatcactgactagccaaagccttactctgcctgacatcgaccgtaaat
tgataatgaatttacatgcttccgcgacgatttacctcttgatcatcgatccgattgaag
atcttcaattgttaattctcttgcctcgactcatagccatgatgagctcttgatcatgtt
tccttaaccctctattttttacggaagaatgatcaagctgctgctcttgatcatcgtttc
text
text
text[7, 10]
text
text
Answer: text[0, 3]
text
Answer: text[3, 6]
text
text
text
text
text
PatternCount(pattern, text)
count 0
k len(pattern)
n len(text)
for every integer i between 0 and n – k
if text[i, i+k] = pattern
count count + 1
return count
atcaatgatcaacgtaagcttctaagcatgatcaaggtgctcacacagtttatccacaac
ctgagtggatgacatcaagataggtcgttgtatctccttcctctcgtactctcatgacca
cggaaagatgatcaagagaggatgatttcttggccatatcgcaatgaatacttgtgactt
gtgcttccaattgacatcttcagcgccatattgcgctggccaaggtgacggagcgggatt
acgaaagcatgatcatggctgttgttctgtttatcttgttttgactgagacttgttagga
tagacggtttttcatcactgactagccaaagccttactctgcctgacatcgaccgtaaat
tgataatgaatttacatgcttccgcgacgatttacctcttgatcatcgatccgattgaag
atcttcaattgttaattctcttgcctcgactcatagccatgatgagctcttgatcatgtt
tccttaaccctctattttttacggaagaatgatcaagctgctgctcttgatcatcgtttc
i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
count
i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
count 3
i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
count 3 2
i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
count 3 2 2
i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
count 3 2 2 3
i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
count 3 2 2 3 1
i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
count 3 2 2 3 1 1
i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
count 3 2 2 3 1 1 1
i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
count 3 2 2 3 1 1 1 3
i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
count 3 2 2 3 1 1 1 3 2
i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
count 3 2 2 3 1 1 1 3 2 2
i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
count 3 2 2 3 1 1 1 3 2 2 3
i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
count 3 2 2 3 1 1 1 3 2 2 3 3
i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
count 3 2 2 3 1 1 1 3 2 2 3 3 1
i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
count 3 2 2 3 1 1 1 3 2 2 3 3 1 1
i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
count 3 2 2 3 1 1 1 3 2 2 3 3 1 1 3 1
i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
count 3 2 2 3 1 1 1 3 2 2 3 3 1 1 3 1
H i T h e r e !
0 1 2 3 4 5 6 7 8
1 1 2 3 5 8 13 21 34 55 89
0 1 2 3 4 5 6 7 8 9 10
Pattern count
“AA” 17
Would make things “AC” 4
easier when finding “CG” 15
frequent words …
“GA” 23
“GG” 3
“GT” 30
“TA” 18
“TG” 2
“TT” 24
FrequencyMap(text, k)
freqMap an empty map
n Len(text)
for every integer i between 0 and n - k
pattern text[i, i+k]
if freqMap[pattern] doesn’t exist
freqMap[pattern] = 1
else
freqMap[pattern] freqMap[pattern] + 1
return freqMap
Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and
Pevzner.
Outline
atcaatgatcaacgtaagcttctaagcatgatcaaggtgctcacacagtttatccacaacctgagtgg
W H ERE I N T H E G EN O M E D O ES D N A REPL I C A T I O N BEG I N
atgacatcaagataggtcgttgtatctccttcctctcgtactctcatgaccacggaaagatgatcaag
agaggatgatttcttggccatatcgcaatgaatacttgtgacttgtgcttccaattgacatcttcagc
gccatattgcgctggccaaggtgacggagcgggattacgaaagcatgatcatggctgttgttctgttt
atcttgttttgactgagacttgttaggatagacggtttttcatcactgactagccaaagccttactct
equent words in Vibrio cholerae
gcctgacatcgaccgtaaattgataatgaatttacatgcttccgcgacgatttacctcttgatcatcg
gureatccgattgaagatcttcaattgttaattctcttgcctcgactcatagccatgatgagctcttgatca
1.3 reveals the most frequent k-mers in the oriC region from Vibrio cholerae.
tgtttccttaaccctctattttttacggaagaatgatcaagctgctgctcttgatcatcgtttc
k 3 4 5 6 7 8 9
count 25 12 8 8 5 4 3
k-mers t ga at ga gat ca t gat ca at gat ca at gat caa at gat caag
t gat c ct t gat cat
t ct t gat ca
ct ct t gat c
atcaatgatcaacgtaagcttctaagcATGATCAAGgtgctcacacagtttatccacaacctgagtgg
atgacatcaagataggtcgttgtatctccttcctctcgtactctcatgaccacggaaagATGATCAAG
agaggatgatttcttggccatatcgcaatgaatacttgtgacttgtgcttccaattgacatcttcagc
gccatattgcgctggccaaggtgacggagcgggattacgaaagcatgatcatggctgttgttctgttt
atcttgttttgactgagacttgttaggatagacggtttttcatcactgactagccaaagccttactct
gcctgacatcgaccgtaaattgataatgaatttacatgcttccgcgacgatttacCTCTTGATCATcg
atccgattgaagatcttcaattgttaattctcttgcctcgactcatagccatgatgagCTCTTGATCA
TgtttccttaaccctctattttttacggaagaATGATCAAGctgctgCTCTTGATCATcgtttc
atcaatgatcaacgtaagcttctaagcATGATCAAGgtgctcacacagtttatccacaacctgagtgg
atgacatcaagataggtcgttgtatctccttcctctcgtactctcatgaccacggaaagATGATCAAG
agaggatgatttcttggccatatcgcaatgaatacttgtgacttgtgcttccaattgacatcttcagc
gccatattgcgctggccaaggtgacggagcgggattacgaaagcatgatcatggctgttgttctgttt
atcttgttttgactgagacttgttaggatagacggtttttcatcactgactagccaaagccttactct
gcctgacatcgaccgtaaattgataatgaatttacatgcttccgcgacgatttacCTCTTGATCATcg
atccgattgaagatcttcaattgttaattctcttgcctcgactcatagccatgatgagCTCTTGATCA
TgtttccttaaccctctattttttacggaagaATGATCAAGctgctgCTCTTGATCATcgtttc
5 3
A G T C G C A T A G T
T C A G C G T A T C A
3 5
Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.
TGCGACT. Wehavenot: each DNA strand hasadirection, and thecompleme
Complementarity of DNA
d runsin theopposite direction to thetemplate strand, asshown by thearro
e1.4. Each strand is read in the5’ ! 3’ direction (seeD ETOUR: D irection
N A Strands to learn why biologists refer to thebeginning and end of astra
The reverse complement
using the terms 5’ and 3’).
of AGTCGCATAGT is
ACTATGCGACT.
5 3
A G T C G C A T A G T
T C A G C G T A T C A
3 5
Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.
Reverse Complement Problem
CCTACCACC
||||||||| are candidate hidden messages.
GGATGGTGG
Frequency of C (%)
take frequency of each 27
nucleotide in 100,000
nucleotide windows of E. 25
C on half the genome? 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5
Genome position (MB)
ori ter
Frequency of G (%)
27
nucleotide in 100,000
nucleotide windows of E. 25
be opposite when we 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5
-2
-4
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5
Genome position (MB)
ori ter
-2
-4
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5
Genome position (MB)
-2
5’ ori 3’
3’ ori 5’
terC
terC
Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.
Four DNA Polymerases Can Do the Job
ori
5’ 3’
3’ 5’
ori
terC
terC
Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.
Continue as Replication Fork Enlarges
5’ 3’
3’ 5’
5’ 3’
3’ 5’
5’ 3’
3’ 5’
Note: Leading/lagging
No problem replicatinghalf-strands
reverse half-strands are complementary.
(thick lines).
Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.
Wait until the Fork Opens and ...
5’ 3’
3’ 5’
5’ 3’
3’ 5’
Okazaki
fragments
Okazaki
fragments
Many Okazaki
fragments are
replicated.
lagging
...C...
...G...
leading
3’ ori 5’
C high C low
You walk along the genome and see that #G - #C has G high
G low
been decreasing and then suddenly starts increasing.
terC
3’ ori 5’
C high C low
G low Exercise: What is the computational G high
problem we are trying to solve here?
terC
CATGGGCATCGGCCATACGCC
Bioinformatics Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.
Skew Array/Diagram
#G - #C is DECREASING #G - #C is INCREASING
5’ 3’
3’ ori 5’
C high C low
G low G high
STOP: What will the skew array of a
bacterial genome look like?
terC
ori
You walk along the genome and see that #G - #C have been decreasing and then
suddenly startsBioinformatics
increasing. Where are you in the genome?
Algorithms: An Active Learning Approach. © 2020 Compeau and Pevzner.
We Have Now “Solved” Question 1!
aatgatgatgacgtcaaaaggatccggataaaacatggtgattgcctcgcataacgcgg
tatgaaaatggattgaagcccgggccgtggattctactcaactttgtcggcttgagaaa
gacctgggatcctgggtattaaaaagaagatctatttatttagagatctgttctattgt
gatctcttattaggatcgcactgccctgtggataacaaggatccggcttttaagatcaa
caacctggaaaggatcattaactgtgaatgatcggtgatcctggaccgtataagctggg
atcagaatgaggggttatacacaactcaaaaactgaacaacagttgttctttggataac
taccggttgatccaagcttcctgacagagttatccacagtagatcgcacgatctgtata
cttatttgagtaaattaacccacgatcccagccattcttctgccggatcttccggaatg
tcgtgatcaagaatgttgatcttcagtg
aatgatgatgacgtcaaaaggatccggataaaacatggtgattgcctcgcataacgcgg
tatgaaaatggattgaagcccgggccgtggattctactcaactttgtcggcttgagaaa
gacctgggatcctgggtattaaaaagaagatctatttatttagagatctgttctattgt
gatctcttattaggatcgcactgccctgtggataacaaggatccggcttttaagatcaa
caacctggaaaggatcattaactgtgaatgatcggtgatcctggaccgtataagctggg
atcagaatgaggggttatacacaactcaaaaactgaacaacagttgttctttggataac
taccggttgatccaagcttcctgacagagttatccacagtagatcgcacgatctgtata
cttatttgagtaaattaacccacgatcccagccattcttctgccggatcttccggaatg
tcgtgatcaagaatgttgatcttcagtg
aatgatgatgacgtcaaaaggatccggataaaacatggtgattgcctcgcataacgcgg
tatgaaaatggattgaagcccgggccgtggattctactcaactttgtcggcttgagaaa
gacctgggatcctgggtattaaaaagaagatctatttatttagagatctgttctattgt
gatctcttattaggatcgcactgcccTGTGGATAAcaaggatccggcttttaagatcaa
caacctggaaaggatcattaactgtgaatgatcggtgatcctggaccgtataagctggg
atcagaatgaggggTTATACACAactcaaaaactgaacaacagttgttcTTTGGATAAC
taccggttgatccaagcttcctgacagagTTATCCACAgtagatcgcacgatctgtata
cttatttgagtaaattaacccacgatcccagccattcttctgccggatcttccggaatg
tcgtgatcaagaatgttgatcttcagtg
Skew diagram
of T. petrophila