Lecture 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 53

CS 466

Introduction to Bioinformatics
Lecture 1
Mohammed El-Kebir
August 28, 2024
Course Staff
Instructor:
• Mohammed El-Kebir (melkebir)
• Office hours: Wednesdays, 3:30-4:30pm in Siebel 3216

Piazza / Gradescope: Developing combinatorial


• https://piazza.com/illinois/fall2024/cs466 algorithms to study all stages
• Gradescope entry code: ‘VDEE52’ of cancer progression.

TA:
• TBD
2
Course Organization
Course website:
https://www.el-kebir.net/teaching/CS466/Fall_2024/CS466.html

Syllabus:
• Prerequisites: CS 225 and its prerequisites
• Textbook

Grading:
• 5 written/programming assignments
• Midterm
• Final
• Research project
• Use Gradescope: ‘VDEE52’

Piazza: (please sign up)


• https://piazza.com/illinois/fall2024/cs466
3
Course Objectives
Learn:
• Learn underlying ideas of common algorithms in bioinformatics.
• Learn to translate a biological problem into a computational problem.
• Learn to read scientific papers, propose and conduct independent research.

Not learn:
• Will not learn to run popular bioinformatics packages.
• Will not learn how to program.

4
Homework Assignments
• 5 homework assignments
• Each homework assignment is a combination of written/programming
exercises
• LaTeX highly recommended for homework assignments
• Hint: use overleaf.com
• Python for programming exercises

Late policy:
• Students may use one 3-day extension in the semester for full credit
• Otherwise, late submission within 3 days 80%
• Otherwise, submission after 3 days 0%

5
Primer on Molecular Biology

Molecular Biology is the field


of biology that studies the composition,
Cellular molecules:
structure and interactions of
1. DNA
cellular molecules – such as nucleic
2. RNA
acids and proteins – that carry out
3. Protein
the biological processes essential for the
cell's functions and maintenance.
https://www.nature.com/subjects/molecular-biology

6
DNA
Each strand composed of sequence of covalently bonded nucleotides (bases).

Four nucleotides:
A (adenine)
C (cytosine)
T (thymine)
G (guanine)

A ßà T, C ßàG Watson-Crick base-pairing


7
DNA
Each strand composed of sequence of covalently bonded nucleotides (bases).

5’ …ACGTGACTGAGGACCGTG…3’ Pair of strings


…||||||||||||||||||… from 4-character
3’
…TGCACTGACTCCTGGCAC… alphabet
5’

5’ …ACGTGACTGAGGACCGTG
CGACTGAGACTGACTGGGT Single string
CTAGCTAGACTACGTTTTA
from 4-character
TATATATATACGTCGTCGT
ACTGATGACTAGATTACAG alphabet
TGATTTTAAAAAAATATT… 3’
8
RNA
• Single-stranded
• A (adenine)
• C (cytosine)
• U (uracil)
• G (guanine)

• Can fold into structures due to


base complementarity.
A ßà U, C ßàG

• Comes in many flavors:


mRNA, rRNA, tRNA, tmRNA, snRNA,
snoRNA, scaRNA, aRNA, asRNA, piwiRNA, etc.
9
Protein
• String of amino acids: 20
letter alphabet

…DTIGDWNSPSFFGIQLVSSVHT
TLWYRENAFPVLGGFSWLSWFNW
HNMGYYYPVYHIGYPMIRCGTHL
VPMQFAFQSIARSFALVHWNAPM
VLKINPHERQDPVFWPCLYYSVD
IRSMHIGYPMIRCYQA…

10
Protein
• String of amino acids: 20
letter alphabet
• Folds into 3D structures to
perform various functions
in cells

11
Primer on Molecular Biology

Three fundamental molecules:


1. DNA
Information storage.

2. RNA
Old view: Mostly a “messenger”.
New view: Performs many important
functions.

3. Protein
Perform most cellular functions
(biochemistry, signaling, control, etc.)

12
Central Dogma of Molecular Biology
Start here

DNA à RNA à Protein:


The process by which cells
“read” the genome
First proposed by Francis Crick in 1956.

13
Transcription and Translation

http://dna-rna.net/wp-content/uploads/2011/08/rna- http://www.frontiers-in-
transcription2.jpg genetics.org/en/pictures/translation_1.jpg
14
Transcription and Translation

https://www.khanacademy.org/science/biology/gene- http://bioinfo.bisr.res.in/project/crat/pictures/codon.jpg
expression-central-dogma/transcription-of-dna-into-
rna/a/overview-of-transcription

15
What is Computational Biology/Bioinformatics?

Computational biology and bioinformatics is an


interdisciplinary field that develops and
applies computational methods to analyze large
collections of biological data, such as genetic sequences,
cell populations or protein samples, to make new
predictions or discover new biology.
https://www.nature.com/subjects/computational-biology-and-bioinformatics

16
Technology and Bioinformatics are Transforming Biology
Until late 20th Century

Hypothesis Generation
and Validation

21th Century and Beyond

Algorithms Hypothesis Generation


and Validation

High throughput technologies


17
A Deluge of Data

100,000,000

What happened here?


10,000,000

1,000,000
Log Scale

100,000

10,000

1,000

November, 2017 18
A Deluge of Data

19
A Deluge of Data

Outer ring color scheme:


Red: Completed genome
Light Blue: Low resolution genome

20
Question: What does it mean that we can sequence a genome?

No technology exists that can sequence a


complete (human) genome from end to end!
… CATTCAGTAG …
… AGCCATTAG …
… GGTAGTTAG … … GGTAAACTAG …

… TATAATTAG … … CGTACCTAG …
Genome
Next-generation 10-100’s million noisy reads
Millions -billions
DNA sequencing Reads: 30-1000 nucleotides
nucleotides

Making sense of this data absolutely requires the use and


development of algorithms!
21
Why Study Computational Biology?
Best Jobs Worst Jobs
Interdisciplinary 1. Actuary 200. Newspaper reporter
Biology 2. Audiologist 199. Lumberjack

Computer Science 3. Mathematician 198. Enlisted Military


Personnel

Mathematics 4. Statistician 197. Cook


5. Biomedical Engineer 196. Broadcaster
Statistics 6. Data Scientist 195. Photojournalist
= FUN! 7. Dental Hygienist 194. Corrections Officer
8. Software Engineer 193. Taxi Driver
9. Occupational Therapist 192. Firefighter

Why choose just 1? 10. Computer Systems


Analyst
191. Mail Carrier

http://www.careercast.com/jobs-rated/jobs-rated-report-2015-ranking-top-200-jobs 22
Donald Knuth
Professor emeritus of Computer Science at Stanford University
Turing Award winner
“father of the analysis of algorithms.”

“I can’t be as confident about computer science as I can


about biology. Biology easily has 500 years of exciting
problems to work on. It’s at that level.”

23
Course Topic #1: Sequence Alignment
Question: How do we compare two genes/genomes?

vs.

Human Genome: Mouse Genome:


…ACTCGACTGAGAGGATTTCGAGCATGA… …ACTCAACTGAGATTCGAGCTTCAATGA…
≈3.2 x 109 bp ≈2.8 x 109 bp
24
Course Topic #2: Genome Assembly
… CATTCAGTAG …
… AGCCATTAG …
… GGTAGTTAG … … GGTAAACTAG …

… TATAATTAG … … CGTACCTAG …

Question: How do we put all the pieces back together?

25
Course Topic #3: Phylogenetics

Question: Can we
reconstruct the
evolutionary history of
different species?

https://en.wikipedia.org/wiki/Phylogenetic_tree

Question: Can we recover


how a tumor has evolved
overtime?
https://scientificbsides.wordpress.com/2014/06/09/inferring-tumour-evolution-2-comparison-to-classical-phylogenetics/
26
Course Topic #4: Pattern Matching
Question: How do we start to make
sense of all these sequences?

Burrows Wheeler Transform

http://www.genomebiology.com/2009/10/3/R25/figure/F1?highres=y

Suffix Trees

Motif Finding

27
Course Topic #4: Pattern Matching
Question: How do we start to make
sense of all these sequences?

Burrows Wheeler Transform

http://www.genomebiology.com/2009/10/3/R25/figure/F1?highres=y

Suffix Trees

Motif Finding

28
Course Topic #5: Cancer Genomics

Question: How can we analyze available data to determine what drives


tumor growth and how to treat or prevent it?
29
Course Topics
1. Sequence alignment
‘How do we compare two genes/genomes?’

2. Genome assembly
‘How do we put all the pieces back together?’

3. Phylogenetics
‘What is the evolutionary history of different sequences?’

4. Pattern matching
‘How do we start to make sense out of all these sequences?’

5. Cancer genomics
‘How do we identify what drives tumor growth and how to treat/prevent it?’

30
Course Topics
1. Sequence alignment
Dynamic programming: edit distance

2. Genome assembly
Graphs: de Bruijn graph, Eulerian and Hamiltonian paths

3. Phylogenetics
Trees and distances: distance matrices, neighbor joining, hierarchical clustering.
Phylogenies: Sankoff/Fitch algorithms, perfect phylogeny and compatibility

4. Pattern matching
Suffix trees/arrays. Burrows-Wheeler transform, Hidden Markov Models (HMMs)

5. Cancer genomics
Cancer phylogenies: Integer linear optimization and graph algorithms

31
Problem != Algorithm
Problem Π with instance 𝑋 and solution set Π 𝑋 : Algorithms:
• Decision problem: Set of instructions for
• Is Π 𝑋 = ∅? solving problem.
• Optimization problem: • Exact
• Find 𝑦 ∗ ∈ Π 𝑋 s.t. 𝑓(𝑦 ∗ ) is optimum. • Heuristic
• Counting problem:
• Compute Π 𝑋 .
• Sampling problem:
• Sample uniformly from Π 𝑋 .
• Enumeration problem:
• Enumerate all solutions in Π 𝑋

32
The Change Problem
• Suppose we have three coins:

5 3 1
cent cent cent

• What is the minimum number of coins needed to make change for M cents?

33
The Change Problem
• Suppose we have three coins:

5 3 1
𝐜=( cent , cent , cent
)

• What is the minimum number of coins needed to make change for M cents?

Change Problem: Given amount 𝑀 ∈ ℕ ∖ {0} and coins 𝐜 = 𝑐! , … , 𝑐" ∈ ℕ"


s.t. 𝑐" = 1 and 𝑐# ≥ 𝑐#$! for all 𝑖 ∈ 𝑛 − 1 = {1, … , 𝑛 − 1},
find 𝐝 = 𝑑! , … , 𝑑" ∈ ℕ" s.t. (i) 𝑀 = ∑"#%! 𝑐# 𝑑# and (ii) ∑"#%! 𝑑# is minimum

34
Idea #1: Choose largest coin possible
GreedyChange(𝑀, 𝑐! , … , 𝑐" )
1. for 𝑖 ß 1 to 𝑛
2. 𝑑# ß bM/ci c
3. 𝑀 ß 𝑀 − 𝑑# 𝑐#

35
Idea #1: Choose largest coin possible
GreedyChange(𝑀, 𝑐! , … , 𝑐" )
1. for 𝑖 ß 1 to 𝑛
2. 𝑑# ß bM/ci c
3. 𝑀 ß 𝑀 − 𝑑# 𝑐#

Is this a good algorithm? Two properties of a good algorithm:

36
Idea #1: Choose largest coin possible
GreedyChange(𝑀, 𝑐! , … , 𝑐" )
1. for 𝑖 ß 1 to 𝑛
2. 𝑑# ß bM/ci c
3. 𝑀 ß 𝑀 − 𝑑# 𝑐#

Is this a good algorithm? Two properties of a good algorithm:


Correctness: gives the correct output for Efficient: running time of the
any input. algorithm does not increase
• Works for 𝐜 = 5, 3, 1 and 𝑀 = 8. too rapidly with input size.
• But what about 𝐜 = 5, 4, 1 and 𝑀 = 8?

37
Idea #2: When in doubt, apply brute force...
Change Problem: Given amount 𝑀 ∈ ℕ ∖ {0} and coins 𝐜 = 𝑐! , … , 𝑐" ∈ ℕ"
s.t. 𝑐" = 1 and 𝑐# ≥ 𝑐#$! for all 𝑖 ∈ 𝑛 − 1 = {1, … , 𝑛 − 1},
find 𝐝 = 𝑑! , … , 𝑑" ∈ ℕ" s.t. (i) 𝑀 = ∑"#%! 𝑐# 𝑑# and (ii) ∑"#%! 𝑑# is minimum

5 4 1 Correct? yes
𝐜=( cent , cent , cent )
Efficient? no

• Check all possible solutions: ExhaustiveChange(𝑀, 𝑐", … , 𝑐# )


• 11 = 5 + 5 + 1 1. for (d1 , . . . , dn ) 2 {0, . . . , bM/c1 c} ⇥ . . . ⇥ {0, . . . , bM/cn c}
• 11 = 5 + 4 + 1 + 1 Pn
2. if i=1 ci di = M
• 11 = 5 + 1 + 1 + 1 + 1 + 1 + 1
• 11 = 4 + 4 + 1 + 1 + 1 3. return (d1 , . . . , dn )
• ...
38
Idea #3: Recursion
5 3 1
𝐜=( cent , cent , cent )

Value 1 2 3 4 5 6 7 8 9 10 11
Min # coins ? ? ? ? ? ? ? ? ? ? ?
-1
-3

-5

Optimal substructure:
Optimal solution is obtained from optimal solutions of subproblems
39
Idea #3: Recursion
Value 1 2 3 4 5 6 7 8 9 10 11
𝐜=( 5 , 3 , 1 )
Min # coins ? ? ? ? ? ? ? ? ? ? ?
-1
-3

-5

• This example can be expressed using a recurrence relation


• Let minNumCoins(M) be the minimum number of coins to make change for
M cents
8
>
<minNumCoins(M 1) + 1,
minNumCoins(M ) = min minNumCoins(M 3) + 1,
>
:
minNumCoins(M 5) + 1.
40
Idea #3: Recursion
Change Problem: Given amount 𝑀 ∈ ℕ ∖ {0} and coins 𝐜 = 𝑐! , … , 𝑐" ∈ ℕ"
s.t. 𝑐" = 1 and 𝑐# ≥ 𝑐#$! for all 𝑖 ∈ 𝑛 − 1 = {1, … , 𝑛 − 1},
find 𝐝 = 𝑑! , … , 𝑑" ∈ ℕ" s.t. (i) 𝑀 = ∑"#%! 𝑐# 𝑑# and (ii) ∑"#%! 𝑑# is minimum

8
>
> minNumCoins(M c1 ) + 1,
>
<minNumCoins(M c2 ) + 1,
minNumCoins(M ) = min
>
> ...
>
:
minNumCoins(M cn ) + 1.

41
Idea #3: Recursion
Given coins 𝐜 = 1, 3, 7 and amount 𝑀 = 77, find 𝐝 = 𝑑! , … , 𝑑" ∈ ℕ"
such that: (i) 𝑀 = ∑"#%! 𝑐# 𝑑# and (ii) ∑"#%! 𝑑# is minimum.
8
>
<minNumCoins(77 1) + 1,
minNumCoins(77) = min minNumCoins(77 3) + 1,
>
:
minNumCoins(77 7) + 1,
8
>
<minNumCoins(76 1) + 1,
minNumCoins(76) = min minNumCoins(76 3) + 1,
>
:
minNumCoins(76 7) + 1,
..
.
minNumCoins(7) = 1
minNumCoins(3) = 1
minNumCoins(1) = 1
42
Idea #3: Recursion
RecursiveChange(𝑀, 𝑐", … , 𝑐# )
1. if 𝑀 = 0
2. return 0
3. bestNumCoins ß ∞
4. for 𝑖 ß 1 to 𝑛
5. if 𝑀 ≥ 𝑐$
6. numCoins ß
RecursiveChange(𝑀 − 𝑐$ , 𝑐", … , 𝑐# )
Correct but inefficient:
7. if numCoins + 1 < bestNumCoins Same subproblem is solved many times!
8. bestNumCoins ß numCoins + 1
9. return bestNumCoins

43
Idea #3: Recursion
RecursiveChange(𝑀, 𝑐", … , 𝑐# )
1. if 𝑀 = 0
2. return 0
3. bestNumCoins ß ∞
4. for 𝑖 ß 1 to 𝑛
5. if 𝑀 ≥ 𝑐$
6. numCoins ß
RecursiveChange(𝑀 − 𝑐$ , 𝑐", … , 𝑐# )
Correct but inefficient:
7. if numCoins + 1 < bestNumCoins Same subproblem is solved many times!
8. bestNumCoins ß numCoins + 1
Solutions:
9. return bestNumCoins • Remember previously computed values: memoization
• Bottom up computation: dynamic programming
44
Idea #4: Solve recurrence with dynamic programming
Fill in table “bottom up”: from smallest to largest.

Value 1 2 3 4 5 6 7 8 9 10 11
𝐜=( 5 , 3 , 1 )
Min # coins 1 1 1

8
>
<minNumCoins(M 1) + 1,
minNumCoins(M ) = min minNumCoins(M 3) + 1,
>
:
minNumCoins(M 5) + 1.

Only one coin is needed to make change for the values 1, 3 and 5
45
Idea #4: Solve recurrence with dynamic programming
Fill in table “bottom up”: from smallest to largest.

Value 1 2 3 4 5 6 7 8 9 10 11
𝐜=( 5 , 3 , 1 )
Min # coins 1 2 1 2 1 2

8
>
<minNumCoins(M 1) + 1,
minNumCoins(M ) = min minNumCoins(M 3) + 1,
>
:
minNumCoins(M 5) + 1.

Two coins are needed to make change for the values 2, 4 and 6
46
Idea #4: Solve recurrence with dynamic programming
Fill in table “bottom up”: from smallest to largest.

Value 1 2 3 4 5 6 7 8 9 10 11
𝐜=( 5 , 3 , 1 )
Min # coins 1 2 1 2 1 2 3

8
>
<minNumCoins(M 1) + 1,
minNumCoins(M ) = min minNumCoins(M 3) + 1,
>
:
minNumCoins(M 5) + 1.

Three coins are needed to make change for the value 7


47
Idea #4: Solve recurrence with dynamic programming
Fill in table “bottom up”: from smallest to largest.

Value 1 2 3 4 5 6 7 8 9 10 11
𝐜=( 5 , 3 , 1 )
Min # coins 1 2 1 2 1 2 3 2

8
>
<minNumCoins(M 1) + 1,
minNumCoins(M ) = min minNumCoins(M 3) + 1,
>
:
minNumCoins(M 5) + 1.

Optimal substructure: Optimal solution obtained from optimal subsolutions


48
Idea #4: Solve recurrence with dynamic programming
Fill in table “bottom up”: from smallest to largest.

Value 1 2 3 4 5 6 7 8 9 10 11
𝐜=( 5 , 3 , 1 )
Min # coins 1 2 1 2 1 2 3 2 3 2 3

8
>
<minNumCoins(M 1) + 1,
minNumCoins(M ) = min minNumCoins(M 3) + 1,
>
:
minNumCoins(M 5) + 1.

Optimal substructure: Optimal solution obtained from optimal subsolutions


49
Idea #4: Solve recurrence with dynamic programming
Change Problem: Given amount 𝑀 ∈ ℕ ∖ {0} and coins 𝐜 = 𝑐! , … , 𝑐" ∈ ℕ"
s.t. 𝑐" = 1 and 𝑐# ≥ 𝑐#$! for all 𝑖 ∈ 𝑛 − 1 = {1, … , 𝑛 − 1},
find 𝐝 = 𝑑! , … , 𝑑" ∈ ℕ" s.t. (i) 𝑀 = ∑"#%! 𝑐# 𝑑# and (ii) ∑"#%! 𝑑# is minimum
DPChange(𝑀, 𝑐! , … , 𝑐" )
1. for 𝑚 ß 1 to 𝑀
2. minNumCoins[𝑚] ß ∞
3. for 𝑖 ß 1 to 𝑛
4. minNumCoins[𝑐# ] ß 1
5. for 𝑚 ß 1 to 𝑀
6. for 𝑖 ß 1 to 𝑛
7. if 𝑚 > 𝑐#
8. minNumCoins[𝑚] ß min(1 +
minNumCoins[𝑚 − 𝑐# ], minNumCoins[𝑚])
Correct? yes
Efficient? yes
9. return minNumCoins[M]
50
Different algorithm techniques
Change Problem: Given amount 𝑀 ∈ ℕ ∖ {0} and coins 𝐜 = 𝑐! , … , 𝑐" ∈ ℕ"
s.t. 𝑐" = 1 and 𝑐# ≥ 𝑐#$! for all 𝑖 ∈ 𝑛 − 1 = {1, … , 𝑛 − 1},
find 𝐝 = 𝑑! , … , 𝑑" ∈ ℕ" s.t. (i) 𝑀 = ∑"#%! 𝑐# 𝑑# and (ii) ∑"#%! 𝑑# is minimum

Technique Correct? Efficient?


Greedy algorithm no yes
[GreedyChange]
Exhaustive enumeration yes no
[ExhaustiveChange]
Recursive algorithm yes no
[RecursiveChange]
Dynamic programming yes yes
[DPChange]
51
Summary
• DNA, RNA and proteins are sequences
• Central dogma of molecular biology: DNA -> RNA -> protein

• Problem != algorithm

• Different algorithm techniques


• Greedy
• Exhaustive search/brute force
• Recursive algorithm
• Dynamic programming algorithm

• Reading:
• “Biology for Computer Scientists” by Lawrence Hunter
(http://www.el-kebir.net/teaching/CS466/Hunter_BIO_CS.pdf)
• Jones and Pevzner: Chapters 2.1, 2.3, 2.4, 6.2
52
Sources
• CS 362 by Layla Oesper (Carleton College)
• CS 1810 by Ben Raphael (Brown/Princeton University)
• An Introduction to Bioinformatics Algorithms book (Jones and Pevzner)

53

You might also like