Lecture 5
Lecture 5
Introduction to Bioinformatics
Lecture 5
Mohammed El-Kebir
September 13, 2024
Course Announcements
Instructor:
• Mohammed El-Kebir (melkebir) HW1: To be released on 09/13
Office hours: Wednesdays, 3:30-4:30pm in and due 09/21 at 11:59pm
Siebel 3216
TA:
• Nicole (nsdong2): Mondays 4-5pm in CS Tutoring Center (Siebel basement)
• Claire (czchou2): Tuesdays 2-3pm in CS Tutoring Center (Siebel basement)
• Mrinmoy (mroddur2): Thursdays 2-3pm in Siebel 2219
2
Global, Fitting and Local Alignment
Global Alignment problem: Given strings 𝐯 ∈ Σ !
and 𝐰 ∈ Σ " and scoring function 𝛿, find alignment
of 𝐯 and 𝐰 with maximum score. 𝐯\𝐰 0 T A C G G C
[Needleman-Wunsch algorithm] 0
4
Time Complexity
Edit graph is a weighed, directed grid
graph 𝐺 = (𝑉, 𝐸) with source vertex
A T C G
(0, 0) and target vertex (𝑚, 𝑛). Each
W 0 1 2 3 n=4 edge ((𝑖, 𝑗), (𝑘, 𝑙)) has weight
V
depending on direction.
0 O O O O O
T m=4
6
Space Complexity
Size of DP table is 𝑚 + 1 × (𝑛 + 1)
A T C G
Thus, space complexity is 𝑂(𝑚𝑛)
W 0 1 2 3 n=4
V [quadratic space]
0
Example:
A 1 To align a short read (𝑚 = 100) to
human genome (𝑛 = 3 5 10!), we need
T 2
300 GB memory.
G 3
7
Space Complexity
Size of DP table is 𝑚 + 1 × (𝑛 + 1)
A T C G
Thus, space complexity is 𝑂(𝑚𝑛)
W 0 1 2 3 n=4
V [quadratic space]
0
Example:
A 1 To align a short read (𝑚 = 100) to
human genome (𝑛 = 3 5 10!), we need
T 2
300 GB memory.
G 3
Reading:
• Jones and Pevzner. Chapters 7.1-7.4
• Lecture notes
9
Space Efficient Alignment
0 𝑗 𝑛
Computing 𝑠[𝑖, 𝑗] requires access to: 0
𝑠 𝑖 − 1, 𝑗 , 𝑠[𝑖, 𝑗 − 1] and 𝑠[𝑖 − 1, 𝑗 − 1]
8
>
> 0, if i = 0 and j = 0,
>
<s[i 1, j] + (v , ),
i if i > 0,
s[i, j] = max
>
> s[i, j 1] + ( , wj ), if j > 0,
>
:
s[i 1, j 1] + (vi , wj ), if i > 0 and j > 0.
𝑖
10
Space Efficient Alignment
0 𝑗 𝑛
Computing 𝑠[𝑖, 𝑗] requires access to: 0
𝑠 𝑖 − 1, 𝑗 , 𝑠[𝑖, 𝑗 − 1] and 𝑠[𝑖 − 1, 𝑗 − 1]
8
>
> 0, if i = 0 and j = 0,
>
<s[i 1, j] + (v , ),
i if i > 0,
s[i, j] = max
>
> s[i, j 1] + ( , wj ), if j > 0,
>
:
s[i 1, j 1] + (vi , wj ), if i > 0 and j > 0.
𝑖
Thus it suffices to store only two columns to
compute optimal alignment score 𝑠 𝑚, 𝑛 ,
i.e., 2 𝑚 + 1 = 𝑂(𝑚) space.
11
Space Efficient Alignment
0 𝑗 𝑛
Computing 𝑠[𝑖, 𝑗] requires access to: 0
𝑠 𝑖 − 1, 𝑗 , 𝑠[𝑖, 𝑗 − 1] and 𝑠[𝑖 − 1, 𝑗 − 1]
8
>
> 0, if i = 0 and j = 0,
>
<s[i 1, j] + (v , ),
i if i > 0,
s[i, j] = max
>
> s[i, j 1] + ( , wj ), if j > 0,
>
:
s[i 1, j 1] + (vi , wj ), if i > 0 and j > 0.
𝑖
Thus it suffices to store only two columns to
compute optimal alignment score 𝑠 𝑚, 𝑛 ,
i.e., 2 𝑚 + 1 = 𝑂(𝑚) space.
12
Space Efficient Alignment – First Attempt
• What if also want optimal alignment?
• Easy: keep best pointers as fill in table.
• No! Do not know which path to keep until computing
recurrence at each step.
w w
v v
Space Efficient Alignment – First Attempt
• What if also want optimal alignment?
• Easy: keep best pointers as fill in table.
• No! Do not know which path to keep until computing
recurrence at each step.
w w
v v
Space Efficient Alignment – First Attempt
• What if also want optimal alignment?
• Easy: keep best pointers as fill in table.
• No! Do not know which path to keep until computing
recurrence at each step.
w w
v v
𝑖∗
Maximum weight path from
(0,0) to (𝑚, 𝑛) passes
through (𝑖 ∗ , 𝑛/2) 𝑖
Question: What is 𝑖 ∗ ? 𝑚
16
Space Efficient Alignment – Second Attempt
𝑛/2
𝑖∗
𝑚
17
Space Efficient Alignment – Second Attempt
𝑛/2
𝑖∗
𝑚
18
Hirschberg(𝑖, 𝑗, 𝑖 " , 𝑗′)
1. if 𝑗 " − 𝑗 > 1
2. 𝑖 ∗ ß arg max wt(𝑖′′)
$%$ !! %$ !
& ! '&
3. Report (𝑖 ∗ , 𝑗 + ( )
& ! '&
4. Hirschberg(𝑖, 𝑗, 𝑖 ∗ , 𝑗 + ( )
& ! '& "
5. Hirschberg(𝑖 ∗ , 𝑗 + , 𝑖 , 𝑗′)
(
19
Hirschberg(𝑖, 𝑗, 𝑖 " , 𝑗′)
1. if 𝑗 " − 𝑗 > 1
2. 𝑖 ∗ ß arg max wt(𝑖′′)
$%$ !! %$ !
& ! '&
3. Report (𝑖 ∗ , 𝑗 + ( )
& ! '&
4. Hirschberg(𝑖, 𝑗, 𝑖 ∗ , 𝑗 + ( )
& ! '& "
5. Hirschberg(𝑖 ∗ , 𝑗 + , 𝑖 , 𝑗′)
(
Time:
area + area/2 + area/4 + …
= area (1 + ½ + ¼ + ⅛ + …)
≤ 2 × area = O(mn)
Space: O(m)
20
Hirschberg(𝑖, 𝑗, 𝑖 " , 𝑗′)
1. if 𝑗 " − 𝑗 > 1
2. 𝑖 ∗ ß arg max wt(𝑖′′)
$%$ !! %$ !
& ! '&
3. Report (𝑖 ∗ , 𝑗 + ( )
& ! '&
4. Hirschberg(𝑖, 𝑗, 𝑖 ∗ , 𝑗 + ( )
& ! '& "
5. Hirschberg(𝑖 ∗ , 𝑗 + , 𝑖 , 𝑗′)
(
Time:
area + area/2 + area/4 + …
= area (1 + ½ + ¼ + ⅛ + …)
≤ 2 × area = O(mn)
Space: O(m)
Question: How to reconstruct
alignment from reported vertices?
21
Hirschberg Algorithm: Reversing Edges Necessary?
Max weight path from (0,0) to (𝑚, 𝑛) through (𝑖 ∗, 𝑛/2)
𝑗
𝑖∗
Compute pre3ix 𝑖 0 ≤ 𝑖 ≤ 𝑚} in O(𝑚𝑗) time and O(𝑚)
space, by starting from (0,0) to 𝑚, 𝑗 keeping only two
columns in memory. [single-source multiple destinations]
𝑖
22
Hirschberg Algorithm: Reversing Edges Necessary?
Max weight path from (0,0) to (𝑚, 𝑛) through (𝑖 ∗, 𝑛/2)
𝑗
𝑖∗
Compute pre3ix 𝑖 0 ≤ 𝑖 ≤ 𝑚} in O(𝑚𝑗) time and O(𝑚)
space, by starting from (0,0) to 𝑚, 𝑗 keeping only two
columns in memory. [single-source multiple destinations]
𝑖
Doing a longest path from each 𝑖, 𝑗 to 𝑚, 𝑛 (for all 0 ≤ 𝑖 ≤ 𝑚) will not achieve desired running time!
Reversing edges enables single-source multiple destination computation in desired time and space bound!
23
Hirschberg Algorithm: Reconstructing Alignment
A T C G C Hirschberg(𝑖, 𝑗, 𝑖 ' , 𝑗′)
W 0 1 2 3 4 5 1. if 𝑗 ' − 𝑗 > 1
V
2. 𝑖 ∗ ß arg max wt(𝑖)
0 0 -1 -2 -3 -4 -5 $%&%!
( ! )(
3. Report (𝑖 ∗, 𝑗 + )
A 1 -1 1 0 -1 -2 -3 *
( ! )(
4. Hirschberg(𝑖, 𝑗, 𝑖 ∗, 𝑗 + )
T 2 -2 0 2 1 0 -1
*
( ! )( '
5. Hirschberg(𝑖 ∗, 𝑗 + , 𝑖 , 𝑗′)
*
G 3 -3 -1 1 1 2 1
Problem: Given reported vertices and
T 4 -4 -2 0 0 1 1 scores { 𝑖$ , 0, 𝑠$ , … , 𝑖" , 𝑛, 𝑠" },
find intermediary vertices.
C 5 -5 -3 -1 1 0 2
Transposing matrix does not help,
A T - G T C because gaps could occur in both
A T C G - C input sequences (and there might be
multiple opt. alignments) 24
Linear Space Alignment – The Hirschberg Algorithm
25
Outline
1. Recap of global, fitting, local and gapped alignment
2. Space-efficient alignment
3. Subquadratic time alignment
Reading:
• Jones and Pevzner. Chapters 7.1-7.4
• Lecture notes
26
Banded Alignment x1
y1 y2 y3 ... ... ... ... yN
x2
x3
Constraint path to band of width 𝑘
around diagonal
.
.
Running time: O(𝑛𝑘) .
xM
x2
x3
Constraint path to band of width 𝑘
around diagonal
.
.
Running time: O(𝑛𝑘) .
29
Introduction to Bioinformatics Algorithms
Block Alignment
Divide input sequences into blocks of length 𝑡
v1, …, vt vt+1, …, v2t … vm-t+1, …, vm
30
Introduction to Bioinformatics Algorithms
Block Alignment
Divide input sequences into blocks of length 𝑡
v1, …, vt vt+1, …, v2t … vm-t+1, …, vm
2t
nt
32
Block Alignment – First Attempt: Pre-compute 𝛽 𝑖, 𝑗
0 ≤ 𝑖, 𝑗 ≤ 𝑛/𝑡 and 𝛽(𝑖, 𝑗) is max score alignment between block 𝑖 of 𝐯 and block 𝑗 of 𝐰
t 2t nt
8
>
> 0, if i = 0 and j = 0,
>
<s[i 1, j] , if i > 0,
s[i, j] = max
>
> s[i, j 1] , if j > 0,
>
:
s[i 1, j 1] + (i, j), if i > 0 and j > 0. t
Question:
How much time to compute all 𝛽(𝑖, 𝑗)? 2t
nt
33
Block Alignment – First Attempt: Pre-compute 𝛽 𝑖, 𝑗
0 ≤ 𝑖, 𝑗 ≤ 𝑛/𝑡 and 𝛽(𝑖, 𝑗) is max score alignment between block 𝑖 of 𝐯 and block 𝑗 of 𝐰
t 2t nt
8
>
> 0, if i = 0 and j = 0,
>
<s[i 1, j] , if i > 0,
s[i, j] = max
>
> s[i, j 1] , if j > 0,
>
:
s[i 1, j 1] + (i, j), if i > 0 and j > 0. t
Question:
How much time to compute all 𝛽(𝑖, 𝑗)? 2t
) ) nt
Total: 𝑂 × × 𝑡( =𝑂 𝑛( time
* * 34
Block Alignment – Four Russians Technique
t 2t nt
Pre-compute and store all βij
t
Pre-compute and store all max
2t
weight alignments 𝑆[𝐯′, 𝐰′] of all
pairs (𝐯 " , 𝐰 " ) of length t strings
nt
Algorithm:
8
1. Precompute 𝑆[𝐯′, 𝐰′] where >
> 0, if i = 0 and j = 0,
>
𝐯 ", 𝐰 " ∈ Σ* <s[i 1, j] , if i > 0,
s[i, j] = max
2. Compute block alignment >
>
> s[i, j 1] , if j > 0,
:
between 𝐯 and 𝐰 using 𝑆 s[i 1, j 1] + S[v(i), w(j)], if i > 0 and j > 0.
35
Block Alignment – Four Russians Technique
t 2t nt
Pre-compute and store all βij
t
Pre-compute and store all max
2t
weight alignments 𝑆[𝐯′, 𝐰′] of all
pairs (𝐯 " , 𝐰 " ) of length t strings
nt
Question: How to choose 𝑡 for DNA?
Algorithm:
8
1. Precompute 𝑆[𝐯′, 𝐰′] where >
> 0, if i = 0 and j = 0,
>
𝐯 ", 𝐰 " ∈ Σ* <s[i 1, j] , if i > 0,
s[i, j] = max
2. Compute block alignment >
>
> s[i, j 1] , if j > 0,
:
between 𝐯 and 𝐰 using 𝑆 s[i 1, j 1] + S[v(i), w(j)], if i > 0 and j > 0.
36
Block Alignment – Four Russians Technique
Question: How to choose 𝑡 for DNA?
37
Fastest Subquadratic Alignment* Algorithm
Edit distance in
O(𝑛N / log 𝑛) time
Barely subquadratic!
Edit distance in
O(𝑛N / log 𝑛) time
Barely subquadratic!
40
41
Take Home Messages
1. Global alignment in O(𝑚𝑛) time and O(𝑚) space
• Hirschberg algorithm
2. Block alignment can be done in subquadratic time
• Four Russians Technique: O(𝑛N / log 𝑛) time
3. Global alignment cannot be done in O(𝑛#$% ) time under SETH
Reading:
• Jones and Pevzner. Chapters 7.1-7.4
• Lecture notes
42