0% found this document useful (0 votes)
7 views

Lecture 5

Uploaded by

chunfeng277
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Lecture 5

Uploaded by

chunfeng277
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

CS 466

Introduction to Bioinformatics
Lecture 5
Mohammed El-Kebir
September 13, 2024
Course Announcements
Instructor:
• Mohammed El-Kebir (melkebir) HW1: To be released on 09/13
Office hours: Wednesdays, 3:30-4:30pm in and due 09/21 at 11:59pm
Siebel 3216

Piazza / Gradescope: Midterm:


• https://piazza.com/illinois/fall2024/cs466 week of 10/09 – 10/11,
• Gradescope entry code: ‘VDEE52’ details to follow

TA:
• Nicole (nsdong2): Mondays 4-5pm in CS Tutoring Center (Siebel basement)
• Claire (czchou2): Tuesdays 2-3pm in CS Tutoring Center (Siebel basement)
• Mrinmoy (mroddur2): Thursdays 2-3pm in Siebel 2219
2
Global, Fitting and Local Alignment
Global Alignment problem: Given strings 𝐯 ∈ Σ !
and 𝐰 ∈ Σ " and scoring function 𝛿, find alignment
of 𝐯 and 𝐰 with maximum score. 𝐯\𝐰 0 T A C G G C
[Needleman-Wunsch algorithm] 0

Fitting Alignment problem: Given strings 𝐯 ∈ Σ ! A


and 𝐰 ∈ Σ " and scoring function 𝛿, find an
alignment of 𝐯 and a substring of 𝐰 with maximum G
global alignment score 𝑠 ∗ among all global
alignments of 𝐯 and all substrings of 𝐰 G

Local Alignment problem: Given strings 𝐯 ∈ Σ !


and 𝐰 ∈ Σ " and scoring function 𝛿, find a substring
of 𝐯 and a substring of 𝐰 whose alignment has Question: How to assess
maximum global alignment score 𝑠 ∗ among all resulting algorithms?
global alignments of all substrings of 𝐯 and 𝐰
[Smith-Waterman algorithm]
3
Time Complexity
Edit graph is a weighed, directed grid
graph 𝐺 = (𝑉, 𝐸) with source vertex
A T C G
(0, 0) and target vertex (𝑚, 𝑛). Each
W 0 1 2 3 n=4 edge ((𝑖, 𝑗), (𝑘, 𝑙)) has weight
V
depending on direction.
0 O O O O O

A 1 O O O O O Alignment is a path from source (0, 0)


to target (𝑚, 𝑛) in edit graph
T 2 O O O O O

G 3 O O O O O Running time is 𝑂(𝑚𝑛)


T [quadratic time]
m=4 O O O O O

4
Time Complexity
Edit graph is a weighed, directed grid
graph 𝐺 = (𝑉, 𝐸) with source vertex
A T C G
(0, 0) and target vertex (𝑚, 𝑛). Each
W 0 1 2 3 n=4 edge ((𝑖, 𝑗), (𝑘, 𝑙)) has weight
V
depending on direction.
0 O O O O O

A 1 O O O O O Alignment is a path from source (0, 0)


to target (𝑚, 𝑛) in edit graph
T 2 O O O O O

G 3 O O O O O Running time is 𝑂(𝑚𝑛)


T [quadratic time]
m=4 O O O O O

Question: Compute alignment faster


than 𝑂(𝑚𝑛) time? [subquadratic time]
5
Space Complexity
Size of DP table is 𝑚 + 1 × (𝑛 + 1)
A T C G
Thus, space complexity is 𝑂(𝑚𝑛)
W 0 1 2 3 n=4
V [quadratic space]
0
Example:
A 1 To align a short read (𝑚 = 100) to
human genome (𝑛 = 3 5 10!), we need
T 2
300 GB memory.
G 3

T m=4

6
Space Complexity
Size of DP table is 𝑚 + 1 × (𝑛 + 1)
A T C G
Thus, space complexity is 𝑂(𝑚𝑛)
W 0 1 2 3 n=4
V [quadratic space]
0
Example:
A 1 To align a short read (𝑚 = 100) to
human genome (𝑛 = 3 5 10!), we need
T 2
300 GB memory.
G 3

T Question: How long is an alignment?


m=4

7
Space Complexity
Size of DP table is 𝑚 + 1 × (𝑛 + 1)
A T C G
Thus, space complexity is 𝑂(𝑚𝑛)
W 0 1 2 3 n=4
V [quadratic space]
0
Example:
A 1 To align a short read (𝑚 = 100) to
human genome (𝑛 = 3 5 10!), we need
T 2
300 GB memory.
G 3

T Question: How long is an alignment?


m=4

Question: Compute alignment in 𝑂(𝑚)


space? [linear space]
8
Outline
1. Recap of global, fitting, local and gapped alignment
2. Space-efficient alignment
3. Subquadratic time alignment

Reading:
• Jones and Pevzner. Chapters 7.1-7.4
• Lecture notes

9
Space Efficient Alignment
0 𝑗 𝑛
Computing 𝑠[𝑖, 𝑗] requires access to: 0
𝑠 𝑖 − 1, 𝑗 , 𝑠[𝑖, 𝑗 − 1] and 𝑠[𝑖 − 1, 𝑗 − 1]
8
>
> 0, if i = 0 and j = 0,
>
<s[i 1, j] + (v , ),
i if i > 0,
s[i, j] = max
>
> s[i, j 1] + ( , wj ), if j > 0,
>
:
s[i 1, j 1] + (vi , wj ), if i > 0 and j > 0.
𝑖

10
Space Efficient Alignment
0 𝑗 𝑛
Computing 𝑠[𝑖, 𝑗] requires access to: 0
𝑠 𝑖 − 1, 𝑗 , 𝑠[𝑖, 𝑗 − 1] and 𝑠[𝑖 − 1, 𝑗 − 1]
8
>
> 0, if i = 0 and j = 0,
>
<s[i 1, j] + (v , ),
i if i > 0,
s[i, j] = max
>
> s[i, j 1] + ( , wj ), if j > 0,
>
:
s[i 1, j 1] + (vi , wj ), if i > 0 and j > 0.
𝑖
Thus it suffices to store only two columns to
compute optimal alignment score 𝑠 𝑚, 𝑛 ,
i.e., 2 𝑚 + 1 = 𝑂(𝑚) space.

11
Space Efficient Alignment
0 𝑗 𝑛
Computing 𝑠[𝑖, 𝑗] requires access to: 0
𝑠 𝑖 − 1, 𝑗 , 𝑠[𝑖, 𝑗 − 1] and 𝑠[𝑖 − 1, 𝑗 − 1]
8
>
> 0, if i = 0 and j = 0,
>
<s[i 1, j] + (v , ),
i if i > 0,
s[i, j] = max
>
> s[i, j 1] + ( , wj ), if j > 0,
>
:
s[i 1, j 1] + (vi , wj ), if i > 0 and j > 0.
𝑖
Thus it suffices to store only two columns to
compute optimal alignment score 𝑠 𝑚, 𝑛 ,
i.e., 2 𝑚 + 1 = 𝑂(𝑚) space.

Question: What if we want alignment itself? 𝑚

12
Space Efficient Alignment – First Attempt
• What if also want optimal alignment?
• Easy: keep best pointers as fill in table.
• No! Do not know which path to keep until computing
recurrence at each step.

w w
v v
Space Efficient Alignment – First Attempt
• What if also want optimal alignment?
• Easy: keep best pointers as fill in table.
• No! Do not know which path to keep until computing
recurrence at each step.

w w
v v
Space Efficient Alignment – First Attempt
• What if also want optimal alignment?
• Easy: keep best pointers as fill in table.
• No! Do not know which path to keep until computing
recurrence at each step.

w w
v v

Best score for column


might not be part of
best alignment!
Space Efficient Alignment – Second Attempt
𝑛/2
Alignment is a path from
source (0, 0) to target (𝑚, 𝑛)
in edit graph

𝑖∗
Maximum weight path from
(0,0) to (𝑚, 𝑛) passes
through (𝑖 ∗ , 𝑛/2) 𝑖

Question: What is 𝑖 ∗ ? 𝑚
16
Space Efficient Alignment – Second Attempt
𝑛/2

𝑖∗

𝑚
17
Space Efficient Alignment – Second Attempt
𝑛/2

𝑖∗

𝑚
18
Hirschberg(𝑖, 𝑗, 𝑖 " , 𝑗′)
1. if 𝑗 " − 𝑗 > 1
2. 𝑖 ∗ ß arg max wt(𝑖′′)
$%$ !! %$ !
& ! '&
3. Report (𝑖 ∗ , 𝑗 + ( )
& ! '&
4. Hirschberg(𝑖, 𝑗, 𝑖 ∗ , 𝑗 + ( )
& ! '& "
5. Hirschberg(𝑖 ∗ , 𝑗 + , 𝑖 , 𝑗′)
(

19
Hirschberg(𝑖, 𝑗, 𝑖 " , 𝑗′)
1. if 𝑗 " − 𝑗 > 1
2. 𝑖 ∗ ß arg max wt(𝑖′′)
$%$ !! %$ !
& ! '&
3. Report (𝑖 ∗ , 𝑗 + ( )
& ! '&
4. Hirschberg(𝑖, 𝑗, 𝑖 ∗ , 𝑗 + ( )
& ! '& "
5. Hirschberg(𝑖 ∗ , 𝑗 + , 𝑖 , 𝑗′)
(
Time:
area + area/2 + area/4 + …
= area (1 + ½ + ¼ + ⅛ + …)
≤ 2 × area = O(mn)
Space: O(m)

20
Hirschberg(𝑖, 𝑗, 𝑖 " , 𝑗′)
1. if 𝑗 " − 𝑗 > 1
2. 𝑖 ∗ ß arg max wt(𝑖′′)
$%$ !! %$ !
& ! '&
3. Report (𝑖 ∗ , 𝑗 + ( )
& ! '&
4. Hirschberg(𝑖, 𝑗, 𝑖 ∗ , 𝑗 + ( )
& ! '& "
5. Hirschberg(𝑖 ∗ , 𝑗 + , 𝑖 , 𝑗′)
(
Time:
area + area/2 + area/4 + …
= area (1 + ½ + ¼ + ⅛ + …)
≤ 2 × area = O(mn)
Space: O(m)
Question: How to reconstruct
alignment from reported vertices?
21
Hirschberg Algorithm: Reversing Edges Necessary?
Max weight path from (0,0) to (𝑚, 𝑛) through (𝑖 ∗, 𝑛/2)
𝑗

𝑖 ∗ = arg max{ pre3ix 𝑖 + suf3ix 𝑖 }


$%&%!

𝑖∗
Compute pre3ix 𝑖 0 ≤ 𝑖 ≤ 𝑚} in O(𝑚𝑗) time and O(𝑚)
space, by starting from (0,0) to 𝑚, 𝑗 keeping only two
columns in memory. [single-source multiple destinations]
𝑖

22
Hirschberg Algorithm: Reversing Edges Necessary?
Max weight path from (0,0) to (𝑚, 𝑛) through (𝑖 ∗, 𝑛/2)
𝑗

𝑖 ∗ = arg max{ pre3ix 𝑖 + suf3ix 𝑖 }


$%&%!

𝑖∗
Compute pre3ix 𝑖 0 ≤ 𝑖 ≤ 𝑚} in O(𝑚𝑗) time and O(𝑚)
space, by starting from (0,0) to 𝑚, 𝑗 keeping only two
columns in memory. [single-source multiple destinations]
𝑖

Want: Compute suf3ix 𝑖 0 ≤ 𝑖 ≤ 𝑚} in O(𝑚𝑗) time


and O(𝑚) space 𝑚

Doing a longest path from each 𝑖, 𝑗 to 𝑚, 𝑛 (for all 0 ≤ 𝑖 ≤ 𝑚) will not achieve desired running time!

Reversing edges enables single-source multiple destination computation in desired time and space bound!
23
Hirschberg Algorithm: Reconstructing Alignment
A T C G C Hirschberg(𝑖, 𝑗, 𝑖 ' , 𝑗′)
W 0 1 2 3 4 5 1. if 𝑗 ' − 𝑗 > 1
V
2. 𝑖 ∗ ß arg max wt(𝑖)
0 0 -1 -2 -3 -4 -5 $%&%!
( ! )(
3. Report (𝑖 ∗, 𝑗 + )
A 1 -1 1 0 -1 -2 -3 *
( ! )(
4. Hirschberg(𝑖, 𝑗, 𝑖 ∗, 𝑗 + )
T 2 -2 0 2 1 0 -1
*
( ! )( '
5. Hirschberg(𝑖 ∗, 𝑗 + , 𝑖 , 𝑗′)
*
G 3 -3 -1 1 1 2 1
Problem: Given reported vertices and
T 4 -4 -2 0 0 1 1 scores { 𝑖$ , 0, 𝑠$ , … , 𝑖" , 𝑛, 𝑠" },
find intermediary vertices.
C 5 -5 -3 -1 1 0 2
Transposing matrix does not help,
A T - G T C because gaps could occur in both
A T C G - C input sequences (and there might be
multiple opt. alignments) 24
Linear Space Alignment – The Hirschberg Algorithm

25
Outline
1. Recap of global, fitting, local and gapped alignment
2. Space-efficient alignment
3. Subquadratic time alignment

Reading:
• Jones and Pevzner. Chapters 7.1-7.4
• Lecture notes

26
Banded Alignment x1
y1 y2 y3 ... ... ... ... yN

x2
x3
Constraint path to band of width 𝑘
around diagonal
.
.
Running time: O(𝑛𝑘) .

Gives a good approximation of .


.
highly identical sequences .

xM

Constrain traceback to band of DP matrix (penalize big gaps)


Figure source: http://jinome.stanford.edu/stat366/pdfs/stat366_win0607_lecture04.pdf 27
Banded Alignment x1
y1 y2 y3 ... ... ... ... yN

x2
x3
Constraint path to band of width 𝑘
around diagonal
.
.
Running time: O(𝑛𝑘) .

Gives a good approximation of .


.
highly identical sequences .

Question: How to change


xM
recurrence to accomplish this?
Constrain traceback to band of DP matrix (penalize big gaps)
Figure source: http://jinome.stanford.edu/stat366/pdfs/stat366_win0607_lecture04.pdf 28
Block Alignment
Divide input sequences into blocks of length 𝑡
v1, …, vt vt+1, …, v2t … vm-t+1, …, vm

w1, …, wt wt+1, …, w2t … wn-t+1, …, wn

29
Introduction to Bioinformatics Algorithms

Block Alignment
Divide input sequences into blocks of length 𝑡
v1, …, vt vt+1, …, v2t … vm-t+1, …, vm

w1, …, wt wt+1, …, w2t … wn-t+1, …, wn

Require that paths in edit graph pass


through corners of blocks

30
Introduction to Bioinformatics Algorithms

Block Alignment
Divide input sequences into blocks of length 𝑡
v1, …, vt vt+1, …, v2t … vm-t+1, …, vm

w1, …, wt wt+1, …, w2t … wn-t+1, …, wn

Require that paths in edit graph pass


through corners of blocks
8
>
> 0, if i = 0 and j = 0,
>
<s[i 1, j] , if i > 0,
s[i, j] = max
>
> s[i, j 1] , if j > 0,
>
:
s[i 1, j 1] + (i, j), if i > 0 and j > 0.

0 ≤ 𝑖, 𝑗 ≤ 𝑡 and 𝛽(𝑖, 𝑗) is max score alignment


between block 𝑖 of 𝐯 and block 𝑗 of 𝐰 31
Block Alignment – First Attempt: Pre-compute 𝛽 𝑖, 𝑗
0 ≤ 𝑖, 𝑗 ≤ 𝑛/𝑡 and 𝛽(𝑖, 𝑗) is max score alignment between block 𝑖 of 𝐯 and block 𝑗 of 𝐰
t 2t nt
8
>
> 0, if i = 0 and j = 0,
>
<s[i 1, j] , if i > 0,
s[i, j] = max
>
> s[i, j 1] , if j > 0,
>
:
s[i 1, j 1] + (i, j), if i > 0 and j > 0. t

2t

nt

32
Block Alignment – First Attempt: Pre-compute 𝛽 𝑖, 𝑗
0 ≤ 𝑖, 𝑗 ≤ 𝑛/𝑡 and 𝛽(𝑖, 𝑗) is max score alignment between block 𝑖 of 𝐯 and block 𝑗 of 𝐰
t 2t nt
8
>
> 0, if i = 0 and j = 0,
>
<s[i 1, j] , if i > 0,
s[i, j] = max
>
> s[i, j 1] , if j > 0,
>
:
s[i 1, j 1] + (i, j), if i > 0 and j > 0. t

Question:
How much time to compute all 𝛽(𝑖, 𝑗)? 2t

nt

33
Block Alignment – First Attempt: Pre-compute 𝛽 𝑖, 𝑗
0 ≤ 𝑖, 𝑗 ≤ 𝑛/𝑡 and 𝛽(𝑖, 𝑗) is max score alignment between block 𝑖 of 𝐯 and block 𝑗 of 𝐰
t 2t nt
8
>
> 0, if i = 0 and j = 0,
>
<s[i 1, j] , if i > 0,
s[i, j] = max
>
> s[i, j 1] , if j > 0,
>
:
s[i 1, j 1] + (i, j), if i > 0 and j > 0. t

Question:
How much time to compute all 𝛽(𝑖, 𝑗)? 2t

Computing 𝛽 𝑖, 𝑗 takes 𝑂 𝑡 ( time

There are 𝑛/𝑡 × 𝑛/𝑡 values 𝛽 𝑖, 𝑗

) ) nt
Total: 𝑂 × × 𝑡( =𝑂 𝑛( time
* * 34
Block Alignment – Four Russians Technique
t 2t nt
Pre-compute and store all βij
t
Pre-compute and store all max
2t
weight alignments 𝑆[𝐯′, 𝐰′] of all
pairs (𝐯 " , 𝐰 " ) of length t strings

nt

Algorithm:
8
1. Precompute 𝑆[𝐯′, 𝐰′] where >
> 0, if i = 0 and j = 0,
>
𝐯 ", 𝐰 " ∈ Σ* <s[i 1, j] , if i > 0,
s[i, j] = max
2. Compute block alignment >
>
> s[i, j 1] , if j > 0,
:
between 𝐯 and 𝐰 using 𝑆 s[i 1, j 1] + S[v(i), w(j)], if i > 0 and j > 0.
35
Block Alignment – Four Russians Technique
t 2t nt
Pre-compute and store all βij
t
Pre-compute and store all max
2t
weight alignments 𝑆[𝐯′, 𝐰′] of all
pairs (𝐯 " , 𝐰 " ) of length t strings

nt
Question: How to choose 𝑡 for DNA?
Algorithm:
8
1. Precompute 𝑆[𝐯′, 𝐰′] where >
> 0, if i = 0 and j = 0,
>
𝐯 ", 𝐰 " ∈ Σ* <s[i 1, j] , if i > 0,
s[i, j] = max
2. Compute block alignment >
>
> s[i, j 1] , if j > 0,
:
between 𝐯 and 𝐰 using 𝑆 s[i 1, j 1] + S[v(i), w(j)], if i > 0 and j > 0.
36
Block Alignment – Four Russians Technique
Question: How to choose 𝑡 for DNA?

37
Fastest Subquadratic Alignment* Algorithm

Edit distance in
O(𝑛N / log 𝑛) time

Barely subquadratic!

Want: O(𝑛NOP ) time


where 𝜀 > 0

*for edit distance 38


Fastest Subquadratic Alignment* Algorithm

Edit distance in
O(𝑛N / log 𝑛) time

Barely subquadratic!

Want: O(𝑛NOP ) time


where 𝜀 > 0

Question: Is 𝑛NOP in O(𝑛N / log 𝑛) for any 𝜀 > 0?


*for edit distance 39
Hardness Result for Edit Distance [STOC 2015]

40
41
Take Home Messages
1. Global alignment in O(𝑚𝑛) time and O(𝑚) space
• Hirschberg algorithm
2. Block alignment can be done in subquadratic time
• Four Russians Technique: O(𝑛N / log 𝑛) time
3. Global alignment cannot be done in O(𝑛#$% ) time under SETH

Reading:
• Jones and Pevzner. Chapters 7.1-7.4
• Lecture notes

42

You might also like