Lecture 1
Lecture 1
Lecture 1
Introduction to Bioinformatics
Lecture 1
Mohammed El-Kebir
August 28, 2024
Course Staff
Instructor:
• Mohammed El-Kebir (melkebir)
• Office hours: Wednesdays, 3:30-4:30pm in Siebel 3216
TA:
• TBD
2
Course Organization
Course website:
https://www.el-kebir.net/teaching/CS466/Fall_2024/CS466.html
Syllabus:
• Prerequisites: CS 225 and its prerequisites
• Textbook
Grading:
• 5 written/programming assignments
• Midterm
• Final
• Research project
• Use Gradescope: ‘VDEE52’
Not learn:
• Will not learn to run popular bioinformatics packages.
• Will not learn how to program.
4
Homework Assignments
• 5 homework assignments
• Each homework assignment is a combination of written/programming
exercises
• LaTeX highly recommended for homework assignments
• Hint: use overleaf.com
• Python for programming exercises
Late policy:
• Students may use one 3-day extension in the semester for full credit
• Otherwise, late submission within 3 days 80%
• Otherwise, submission after 3 days 0%
5
Primer on Molecular Biology
6
DNA
Each strand composed of sequence of covalently bonded nucleotides (bases).
Four nucleotides:
A (adenine)
C (cytosine)
T (thymine)
G (guanine)
5’ …ACGTGACTGAGGACCGTG
CGACTGAGACTGACTGGGT Single string
CTAGCTAGACTACGTTTTA
from 4-character
TATATATATACGTCGTCGT
ACTGATGACTAGATTACAG alphabet
TGATTTTAAAAAAATATT… 3’
8
RNA
• Single-stranded
• A (adenine)
• C (cytosine)
• U (uracil)
• G (guanine)
…DTIGDWNSPSFFGIQLVSSVHT
TLWYRENAFPVLGGFSWLSWFNW
HNMGYYYPVYHIGYPMIRCGTHL
VPMQFAFQSIARSFALVHWNAPM
VLKINPHERQDPVFWPCLYYSVD
IRSMHIGYPMIRCYQA…
10
Protein
• String of amino acids: 20
letter alphabet
• Folds into 3D structures to
perform various functions
in cells
11
Primer on Molecular Biology
2. RNA
Old view: Mostly a “messenger”.
New view: Performs many important
functions.
3. Protein
Perform most cellular functions
(biochemistry, signaling, control, etc.)
12
Central Dogma of Molecular Biology
Start here
13
Transcription and Translation
http://dna-rna.net/wp-content/uploads/2011/08/rna- http://www.frontiers-in-
transcription2.jpg genetics.org/en/pictures/translation_1.jpg
14
Transcription and Translation
https://www.khanacademy.org/science/biology/gene- http://bioinfo.bisr.res.in/project/crat/pictures/codon.jpg
expression-central-dogma/transcription-of-dna-into-
rna/a/overview-of-transcription
15
What is Computational Biology/Bioinformatics?
16
Technology and Bioinformatics are Transforming Biology
Until late 20th Century
Hypothesis Generation
and Validation
100,000,000
1,000,000
Log Scale
100,000
10,000
1,000
November, 2017 18
A Deluge of Data
19
A Deluge of Data
20
Question: What does it mean that we can sequence a genome?
… TATAATTAG … … CGTACCTAG …
Genome
Next-generation 10-100’s million noisy reads
Millions -billions
DNA sequencing Reads: 30-1000 nucleotides
nucleotides
http://www.careercast.com/jobs-rated/jobs-rated-report-2015-ranking-top-200-jobs 22
Donald Knuth
Professor emeritus of Computer Science at Stanford University
Turing Award winner
“father of the analysis of algorithms.”
23
Course Topic #1: Sequence Alignment
Question: How do we compare two genes/genomes?
vs.
… TATAATTAG … … CGTACCTAG …
25
Course Topic #3: Phylogenetics
Question: Can we
reconstruct the
evolutionary history of
different species?
https://en.wikipedia.org/wiki/Phylogenetic_tree
http://www.genomebiology.com/2009/10/3/R25/figure/F1?highres=y
Suffix Trees
Motif Finding
27
Course Topic #4: Pattern Matching
Question: How do we start to make
sense of all these sequences?
http://www.genomebiology.com/2009/10/3/R25/figure/F1?highres=y
Suffix Trees
Motif Finding
28
Course Topic #5: Cancer Genomics
2. Genome assembly
‘How do we put all the pieces back together?’
3. Phylogenetics
‘What is the evolutionary history of different sequences?’
4. Pattern matching
‘How do we start to make sense out of all these sequences?’
5. Cancer genomics
‘How do we identify what drives tumor growth and how to treat/prevent it?’
30
Course Topics
1. Sequence alignment
Dynamic programming: edit distance
2. Genome assembly
Graphs: de Bruijn graph, Eulerian and Hamiltonian paths
3. Phylogenetics
Trees and distances: distance matrices, neighbor joining, hierarchical clustering.
Phylogenies: Sankoff/Fitch algorithms, perfect phylogeny and compatibility
4. Pattern matching
Suffix trees/arrays. Burrows-Wheeler transform, Hidden Markov Models (HMMs)
5. Cancer genomics
Cancer phylogenies: Integer linear optimization and graph algorithms
31
Problem != Algorithm
Problem Π with instance 𝑋 and solution set Π 𝑋 : Algorithms:
• Decision problem: Set of instructions for
• Is Π 𝑋 = ∅? solving problem.
• Optimization problem: • Exact
• Find 𝑦 ∗ ∈ Π 𝑋 s.t. 𝑓(𝑦 ∗ ) is optimum. • Heuristic
• Counting problem:
• Compute Π 𝑋 .
• Sampling problem:
• Sample uniformly from Π 𝑋 .
• Enumeration problem:
• Enumerate all solutions in Π 𝑋
32
The Change Problem
• Suppose we have three coins:
5 3 1
cent cent cent
• What is the minimum number of coins needed to make change for M cents?
33
The Change Problem
• Suppose we have three coins:
5 3 1
𝐜=( cent , cent , cent
)
• What is the minimum number of coins needed to make change for M cents?
34
Idea #1: Choose largest coin possible
GreedyChange(𝑀, 𝑐! , … , 𝑐" )
1. for 𝑖 ß 1 to 𝑛
2. 𝑑# ß bM/ci c
3. 𝑀 ß 𝑀 − 𝑑# 𝑐#
35
Idea #1: Choose largest coin possible
GreedyChange(𝑀, 𝑐! , … , 𝑐" )
1. for 𝑖 ß 1 to 𝑛
2. 𝑑# ß bM/ci c
3. 𝑀 ß 𝑀 − 𝑑# 𝑐#
36
Idea #1: Choose largest coin possible
GreedyChange(𝑀, 𝑐! , … , 𝑐" )
1. for 𝑖 ß 1 to 𝑛
2. 𝑑# ß bM/ci c
3. 𝑀 ß 𝑀 − 𝑑# 𝑐#
37
Idea #2: When in doubt, apply brute force...
Change Problem: Given amount 𝑀 ∈ ℕ ∖ {0} and coins 𝐜 = 𝑐! , … , 𝑐" ∈ ℕ"
s.t. 𝑐" = 1 and 𝑐# ≥ 𝑐#$! for all 𝑖 ∈ 𝑛 − 1 = {1, … , 𝑛 − 1},
find 𝐝 = 𝑑! , … , 𝑑" ∈ ℕ" s.t. (i) 𝑀 = ∑"#%! 𝑐# 𝑑# and (ii) ∑"#%! 𝑑# is minimum
5 4 1 Correct? yes
𝐜=( cent , cent , cent )
Efficient? no
Value 1 2 3 4 5 6 7 8 9 10 11
Min # coins ? ? ? ? ? ? ? ? ? ? ?
-1
-3
-5
Optimal substructure:
Optimal solution is obtained from optimal solutions of subproblems
39
Idea #3: Recursion
Value 1 2 3 4 5 6 7 8 9 10 11
𝐜=( 5 , 3 , 1 )
Min # coins ? ? ? ? ? ? ? ? ? ? ?
-1
-3
-5
8
>
> minNumCoins(M c1 ) + 1,
>
<minNumCoins(M c2 ) + 1,
minNumCoins(M ) = min
>
> ...
>
:
minNumCoins(M cn ) + 1.
41
Idea #3: Recursion
Given coins 𝐜 = 1, 3, 7 and amount 𝑀 = 77, find 𝐝 = 𝑑! , … , 𝑑" ∈ ℕ"
such that: (i) 𝑀 = ∑"#%! 𝑐# 𝑑# and (ii) ∑"#%! 𝑑# is minimum.
8
>
<minNumCoins(77 1) + 1,
minNumCoins(77) = min minNumCoins(77 3) + 1,
>
:
minNumCoins(77 7) + 1,
8
>
<minNumCoins(76 1) + 1,
minNumCoins(76) = min minNumCoins(76 3) + 1,
>
:
minNumCoins(76 7) + 1,
..
.
minNumCoins(7) = 1
minNumCoins(3) = 1
minNumCoins(1) = 1
42
Idea #3: Recursion
RecursiveChange(𝑀, 𝑐", … , 𝑐# )
1. if 𝑀 = 0
2. return 0
3. bestNumCoins ß ∞
4. for 𝑖 ß 1 to 𝑛
5. if 𝑀 ≥ 𝑐$
6. numCoins ß
RecursiveChange(𝑀 − 𝑐$ , 𝑐", … , 𝑐# )
Correct but inefficient:
7. if numCoins + 1 < bestNumCoins Same subproblem is solved many times!
8. bestNumCoins ß numCoins + 1
9. return bestNumCoins
43
Idea #3: Recursion
RecursiveChange(𝑀, 𝑐", … , 𝑐# )
1. if 𝑀 = 0
2. return 0
3. bestNumCoins ß ∞
4. for 𝑖 ß 1 to 𝑛
5. if 𝑀 ≥ 𝑐$
6. numCoins ß
RecursiveChange(𝑀 − 𝑐$ , 𝑐", … , 𝑐# )
Correct but inefficient:
7. if numCoins + 1 < bestNumCoins Same subproblem is solved many times!
8. bestNumCoins ß numCoins + 1
Solutions:
9. return bestNumCoins • Remember previously computed values: memoization
• Bottom up computation: dynamic programming
44
Idea #4: Solve recurrence with dynamic programming
Fill in table “bottom up”: from smallest to largest.
Value 1 2 3 4 5 6 7 8 9 10 11
𝐜=( 5 , 3 , 1 )
Min # coins 1 1 1
8
>
<minNumCoins(M 1) + 1,
minNumCoins(M ) = min minNumCoins(M 3) + 1,
>
:
minNumCoins(M 5) + 1.
Only one coin is needed to make change for the values 1, 3 and 5
45
Idea #4: Solve recurrence with dynamic programming
Fill in table “bottom up”: from smallest to largest.
Value 1 2 3 4 5 6 7 8 9 10 11
𝐜=( 5 , 3 , 1 )
Min # coins 1 2 1 2 1 2
8
>
<minNumCoins(M 1) + 1,
minNumCoins(M ) = min minNumCoins(M 3) + 1,
>
:
minNumCoins(M 5) + 1.
Two coins are needed to make change for the values 2, 4 and 6
46
Idea #4: Solve recurrence with dynamic programming
Fill in table “bottom up”: from smallest to largest.
Value 1 2 3 4 5 6 7 8 9 10 11
𝐜=( 5 , 3 , 1 )
Min # coins 1 2 1 2 1 2 3
8
>
<minNumCoins(M 1) + 1,
minNumCoins(M ) = min minNumCoins(M 3) + 1,
>
:
minNumCoins(M 5) + 1.
Value 1 2 3 4 5 6 7 8 9 10 11
𝐜=( 5 , 3 , 1 )
Min # coins 1 2 1 2 1 2 3 2
8
>
<minNumCoins(M 1) + 1,
minNumCoins(M ) = min minNumCoins(M 3) + 1,
>
:
minNumCoins(M 5) + 1.
Value 1 2 3 4 5 6 7 8 9 10 11
𝐜=( 5 , 3 , 1 )
Min # coins 1 2 1 2 1 2 3 2 3 2 3
8
>
<minNumCoins(M 1) + 1,
minNumCoins(M ) = min minNumCoins(M 3) + 1,
>
:
minNumCoins(M 5) + 1.
• Problem != algorithm
• Reading:
• “Biology for Computer Scientists” by Lawrence Hunter
(http://www.el-kebir.net/teaching/CS466/Hunter_BIO_CS.pdf)
• Jones and Pevzner: Chapters 2.1, 2.3, 2.4, 6.2
52
Sources
• CS 362 by Layla Oesper (Carleton College)
• CS 1810 by Ben Raphael (Brown/Princeton University)
• An Introduction to Bioinformatics Algorithms book (Jones and Pevzner)
53