lecture1-2
lecture1-2
comparison
Some books
─ protein analysis
! Main assumptions:
─ similar sequences correspond to similar biological
functions
─ similar sequences witness phylogenetic proximity
elephant
hamster
elephant
whale
elephant
alligator
Another example
! scoring function:
Score=19 Score=-11
Score=25
Sequence alignment: scoring
-AGGCTCACCTGACT-CCAGGC-CGA--TGCC---
|| ||||| ||| | || ||| ||||
TAG-CTCAC--GAC-GC--GG-TCGATTTGCCGAC
LCS: Longest Common Subsequence
-AGGCTCACCTGACT-CCAGGC-CGA--TGCC---
|| ||||| ||| | || ||| ||||
TAG-CTCAC--GAC-GC--GG-TCGATTTGCCGAC
-AGGCTCACCTGACTCCAGGCCGA--TGCC---
|| ||||| ||| | || ||| ||||
TAG-CTCAC--GACGC--GGTCGATTTGCCGAC
RefPos: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19!
Reference: C C A T A C T G A A C T G A C T A A C!
Read : A C T A G A A T G G C T!
POS: 5!
CIGAR: 3M1I3M1D5M!
Computing Score(S,T)
Score[i-1,j-1] + s(S[i-1],T[j-1])
Score[i,j] = max Score[i-1,j] - d
Score[i,j-1] - d
! initialization: Score[0,0]=0, Score[0,j]=-jd, Score[i,0]=-id
! resulting score: Score[n,m]
! Dynamic Programming!
Example
s(x,x)=2, s(x,y)=-1 for x≠y, d=-2
Example
s(x,x)=2, s(x,y)=-1 for x≠y, d=-2
A C G G C T A T
0 1 2 3 4 5 6 7 8
0
A 1
C 2
T 3
G 4
T 5
A
6
T
7
Example
s(x,x)=2, s(x,y)=-1 for x≠y, d=-2
A C G G C T A T
0 1 2 3 4 5 6 7 8
0 0 -2 -4 -6 -8 -10 -12 -14 -16
A 1 -2
C 2 -4
T 3 -6
G 4 -8
T 5 -10
A
6 -12
T
7 -14
Example
s(x,x)=2, s(x,y)=-1 for x≠y, d=-2
A C G G C T A T
0 1 2 3 4 5 6 7 8
0 0 -2 -4 -6 -8 -10 -12 -14 -16
A 1 -2 2 0 -2 -4 -6 -8 -10 -12
C 2 -4
T 3 -6
G 4 -8
T 5 -10
A
6 -12
T
7 -14
Example
s(x,x)=2, s(x,y)=-1 for x≠y, d=-2
A C G G C T A T
0 1 2 3 4 5 6 7 8
0 0 -2 -4 -6 -8 -10 -12 -14 -16
A 1 -2 2 0 -2 -4 -6 -8 -10 -12
C 2 -4 0 4 2 0 -2 -4 -6 -8
T 3 -6
G 4 -8
T 5 -10
A
6 -12
T
7 -14
Example
s(x,x)=2, s(x,y)=-1 for x≠y, d=-2
A C G G C T A T
0 1 2 3 4 5 6 7 8
0 0 -2 -4 -6 -8 -10 -12 -14 -16
A 1 -2 2 0 -2 -4 -6 -8 -10 -12
C 2 -4 0 4 2 0 -2 -4 -6 -8
T 3 -6 -2 2 3 1 -1 0 -2 -4
G 4 -8 -4 0 4 5 3 1 -1 -3
T 5 -10 -6 -2 2 3 4 5 3 1
A
6 -12 -8 -4 0 1 2 3 7 5
T
7 -14 -10 -6 -2 -1 0 4 5 9
Example
s(x,x)=2, s(x,y)=-1 for x≠y, d=-2
A C G G C T A T
0 1 2 3 4 5 6 7 8
0 0 -2 -4 -6 -8 -10 -12 -14 -16
A 1 -2 2 0 -2 -4 -6 -8 -10 -12
C 2 -4 0 4 2 0 -2 -4 -6 -8
T 3 -6 -2 2 3 1 -1 0 -2 -4
G 4 -8 -4 0 4 5 3 1 -1 -3
T 5 -10 -6 -2 2 3 4 5 3 1
A
6 -12 -8 -4 0 1 2 3 7 5
T
7 -14 -10 -6 -2 -1 0 4 5 9 Score(S,T)
How to recover the alignment?
s(x,x)=2, s(x,y)=-1 for x≠y, d=-2
A C G G C T A T
0 1 2 3 4 5 6 7 8
0 0 -2 -4 -6 -8 -10 -12 -14 -16
A 1 -2 2 0 -2 -4 -6 -8 -10 -12
C 2 -4 0 4 2 0 -2 -4 -6 -8
T 3 -6 -2 2 3 1 -1 0 -2 -4
G 4 -8 -4 0 4 5 3 1 -1 -3
T 5 -10 -6 -2 2 3 4 5 3 1
A
6 -12 -8 -4 0 1 2 3 7 5
T
7 -14 -10 -6 -2 -1 0 4 5 9 Score(S,T)
How to recover the alignment?
s(x,x)=2, s(x,y)=-1 for x≠y, d=-2 ACGGCTAT
ACTG TAT
A C G G C T A T
0 1 2 3 4 5 6 7 8
0 0 -2 -4 -6 -8 -10 -12 -14 -16
A 1 -2 2 0 -2 -4 -6 -8 -10 -12
C 2 -4 0 4 2 0 -2 -4 -6 -8
T 3 -6 -2 2 3 1 -1 0 -2 -4
G 4 -8 -4 0 4 5 3 1 -1 -3
T 5 -10 -6 -2 2 3 4 5 3 1
A
6 -12 -8 -4 0 1 2 3 7 5
T
7 -14 -10 -6 -2 -1 0 4 5 9 Score(S,T)
Alignment: graph formulation
A C G G C T A T
A
C
T
G
T
A
T
2 -1 -2
match mismatch indels
Alignment: graph formulation
A C G G C T A T
A
C max-
T cost
path
G cost=9
T
A
T
2 -1 -2
match mismatch indels
Exercise
T suffix-prefix overlap
S
Exercises (2)
! Approximate occurrences of P in T
─ compute all alignments such that Score(S,T[i..j])>δ
T
Exercises (2)
! Approximate occurrences of P in T
─ compute all alignments such that Score(S,T[i..j])>δ
! Particular cases
─ edit distance (<k): O(kn) [Landau&Vishkin 85,
Galil&Park 89, …]
─ Hamming distance: O(n·log(m)) [Fischer&Paterson
73], O(nk) [Galil&Giancarlo 86], O(n√k·log(k))
[Amir&Lewenstein&Porat 04], …
Computing alignment in linear space
! Key observation:
n/2
T
S
k*
Computing alignment in linear space
! Key observation:
n/2
T
Score(n/2, k) ScoreR(n/2, m-k)
S
k*
! Key observation:
n/2
T
Score(n/2, k) ScoreR(n/2, m-k)
S
compute Score(n/2,k) for all k k*
compute ScoreR(n/2,m-k) for all k
n
m k*
compute
k*=argmaxk(Score[n/2,k]+ScoreR[n/2,m-k])
m k*
n
m k*
n
Resulting complexity
S
Local alignment
0
Score[i-1,j-1] + s(S[i],T[j])
Score[i,j] = max
Score[i-1,j] - d
Score[i,j-1] - d
Smith-Waterman: example
EAWACQGKL vs ERDAWCQPGKWY!
s(x,x)=1, s(x,y)=-3 for x≠y, d=-1