DP and Edit Dist
DP and Edit Dist
You are free to use these slides. If you do, please sign the
guestbook (www.langmead-lab.org/teaching-materials), or email
me ([email protected]) and tell me briefly how you’re
using them. For original Keynote files, email me.
Beyond approximate matching: sequence similarity
In many settings, Hamming and edit distance are too simple. Biologically-relevant
distances require algorithms. We will expand our tool set accordingly.
X: G T A G C G G C G
|| |||||| AKA insertion in X or deletion in Y
Y: G T - G C G G C G
Gap in Y
Alignment
X: G C G T A T G A G G C T A - A C G C
|| |||| ||||| ||||
Y: G C - T A T G C G G C T A T A C G C
editDistance(x, y) ≤ hammingDistance(x, y)
editDistance(x, y) ≥ | | x | - | y | |
x: GCGTATGCGGCTAACGC Operations:
M = match, R = replace,
y: GCTATGCGGCTATACGC I = insert into x, D = delete from x
x: GCGTATGCGGCTAACGC
|| MMD
y: G C - T A T G C G G C T A T A C G C
x: GCGTATGCGGCTA-ACGC
|| |||||||||| MMDMMMMMMMMMMI
y: G C - T A T G C G G C T A T A C G C
x: GCGTATGCGGCTA-ACGC
|| |||||||||| |||| MMDMMMMMMMMMMIMMMM
y: G C - T A T G C G G C T A T A C G C
Edit distance
x: GCGTATGCGGCTA-ACGC
|| |||||||||| |||| MMDMMMMMMMMMMIMMMM Distance = 2
y: G C - T A T G C G G C T A T A C G C
x: GCGTATGAGGCTA-ACGC
|| |||| ||||| |||| MMDMMMMRMMMMMIMMMM Distance = 3
y: G C - T A T G C G G C T A T A C G C
x: the longest----
||||||| DDDDMMMMMMMIIII Distance = 8
y: - - - - l o n g e s t d a y
Edit distance
D[i, j]: edit distance between length-i prefix of x and length-j prefix of y
i
x:
y:
j
Think in terms of edit transcript. Optimal transcript for D[i, j] can be built
by extending a shorter one by 1 operation. Only 3 options:
D[i, j] is minimum of the three, and D[|x|, |y|] is the overall edit distance
Edit distance
Much better
Edit distance: dynamic programming
ϵ is empty y
string
ϵ G C T A T G C C A C G C Let n = | x |, m = | y |
ϵ
G D: (n+1) x (m+1) matrix
C D[i, j] = edit distance b/t
G length-i prefix of x and
T length-j prefix of y
A
D: x T
G
C
A
C
G
C
Edit distance: dynamic programming
y
ϵ G C T A T G C C A C G C
ϵ
G D[6, 5]
C D[5, 5]
G
D[5, 6]
T
A
D: x T D[6, 6]
G Value in a cell depends upon its upper,
C left, and upper-left neighbors
A 8
C < D[i 1, j] + 1 upper
left upper-left
G D[i, j] = min
:
D[i, j 1] + 1
D[i 1, j 1] + (x[i 1], y[j 1])
C
Edit distance: dynamic programming
D
=
numpy.zeros((len(x)+1,
len(y)+1),
dtype=int)
First few lines
D[0,
1:]
=
range(1,
len(y)+1)
of edDistDp:
D[1:,
0]
=
range(1,
len(x)+1)
ϵ G C T A T G C C A C G C
ϵ 0 1 2 3 4 5 6 7 8 9 10 11 12 Initialize D[0, j] to j,
G 1 D[i, 0] to i
C 2
G 3
T 4
A 5
T 6
G 7
C 8
A 9
C 10
G 11
C 12
Edit distance: dynamic programming
for
i
in
xrange(1,
len(x)+1):
Loop from
for
j
in
xrange(1,
len(y)+1):
edDistDp:
delt
=
1
if
x[i-‐1]
!=
y[j-‐1]
else
0
D[i,
j]
=
min(D[i-‐1,
j-‐1]+delt,
D[i-‐1,
j]+1,
D[i,
j-‐1]+1)
ϵ G C T A T G C C A C G C
ϵ 0 1 2 3 4 5 6 7 8 9 10 11 12 Fill remaining cells from
G 1 top row to bottom and
C 2 from left to right
G 3
T 4
A 5 etc
T 6
G 7
C 8
A 9
C 10
G 11
C 12
Edit distance: dynamic programming
for
i
in
xrange(1,
len(x)+1):
Loop from
for
j
in
xrange(1,
len(y)+1):
edDistDp:
delt
=
1
if
x[i-‐1]
!=
y[j-‐1]
else
0
D[i,
j]
=
min(D[i-‐1,
j-‐1]+delt,
D[i-‐1,
j]+1,
D[i,
j-‐1]+1)
ϵ G C T A T G C C A C G C
ϵ 0 1 2 3 4 5 6 7 8 9 10 11 12 Fill remaining cells from
G 1 ? top row to bottom and
C 2 from left to right
G 3
T 4 What goes here in i=1,j=1?
A 5 x[i-‐1] = y[j-‐1] = ‘G ‘,
T 6 so delt =
0
G 7
C 8 D[i,
j]
=
min(D[i-‐1,
j-‐1]+delt,
D[i-‐1,
j]+1,
A 9
D[i,
j-‐1]+1)
=
min(0
+
0,
1
+
1,
1
+
1)
C 10
=
0
G 11
C 12
Edit distance: dynamic programming
for
i
in
xrange(1,
len(x)+1):
Loop from
for
j
in
xrange(1,
len(y)+1):
edDistDp:
delt
=
1
if
x[i-‐1]
!=
y[j-‐1]
else
0
D[i,
j]
=
min(D[i-‐1,
j-‐1]+delt,
D[i-‐1,
j]+1,
D[i,
j-‐1]+1)
ϵ G C T A T G C C A C G C
ϵ 0 1 2 3 4 5 6 7 8 9 10 11 12 Fill remaining cells from
G 1 0 1 2 3 4 5 6 7 8 9 10 11 top row to bottom and
C 2 1 0 1 2 3 4 5 6 7 8 9 10 from left to right
G 3 2 1 1 2 3 3 4 5 6 7 8 9
T 4 3 2 1 2 2 3 4 5 6 7 8 9
A 5 4 3 2 1 2 3 4 5 5 6 7 8
T 6 5 4 3 2 1 2 3 4 5 6 7 8
G 7 6 5 4 3 2 1 2 3 4 5 6 7
C 8 7 6 5 4 3 2 1 2 3 4 5 6
A 9 8 7 6 5 4 3 2 2 2 3 4 5
C 10 9 8 7 6 5 4 3 2 3 2 3 4 Edit distance for x, y
G 11 10 9 8 7 6 5 4 3 3 3 2 3
C 12 11 10 9 8 7 6 5 4 4 3 3 2
Edit distance: dynamic programming
for
i
in
xrange(1,
len(x)+1):
Loop from
for
j
in
xrange(1,
len(y)+1):
edDistDp:
delt
=
1
if
x[i-‐1]
!=
y[j-‐1]
else
0
D[i,
j]
=
min(D[i-‐1,
j-‐1]+delt,
D[i-‐1,
j]+1,
D[i,
j-‐1]+1)
ϵ G C T A T G C C A C G C
ϵ 0 1 2 3 4 5 6 7 8 9 10 11 12 Could we have filled the
G 1 cells in a different order?
C 2
G 3
T 4
A 5 etc
T 6
G 7
C 8
A 9
C 10
G 11
C 12
Edit distance: dynamic programming
for
j
in
xrange(1,
len(y)+1):
Switched
for
i
in
xrange(1,
len(x)+1):
delt
=
1
if
x[i-‐1]
!=
y[j-‐1]
else
0
D[i,
j]
=
min(D[i-‐1,
j-‐1]+delt,
D[i-‐1,
j]+1,
D[i,
j-‐1]+1)
ϵ G C T A T G C C A C G C
ϵ 0 1 2 3 4 5 6 7 8 9 10 11 12
G 1 Yes: e.g. invert the loops
C 2
G 3
T 4
A 5
T 6 etc
G 7
C 8
A 9
C 10
G 11
C 12
Edit distance: dynamic programming
ϵ G C T A T G C C A C G C
ϵ 0 1 2 3 4 5 6 7 8 9 10 11 12
G 1 Or by anti-diagonal
C 2
G 3
T 4 etc
A 5
T 6
G 7
C 8
A 9
C 10
G 11
C 12
Edit distance: dynamic programming
ϵ G C T A T G C C A C G C
ϵ 0 1 2 3 4 5 6 7 8 9 10 11 12
G 1 Or blocked
C 2
G 3
T 4
A 5
T 6
G 7
C 8
A 9
C 10
G 11 etc
C 12
Edit distance: getting the alignment
ϵ G C T A T G C C A C G C
ϵ 0 1 2 3 4 5 6 7 8 9 10 11 12
G 1 0 1 2 3 4 5 6 7 8 9 10 11
C 2 1 0 1 2 3 4 5 6 7 8 9 10
G 3 2 1 1 2 3 3 4 5 6 7 8 9
T 4 3 2 1 2 2 3 4 5 6 7 8 9
A 5 4 3 2 1 2 3 4 5 5 6 7 8
T 6 5 4 3 2 1 2 3 4 5 6 7 8
G 7 6 5 4 3 2 1 2 3 4 5 6 7
C 8 7 6 5 4 3 2 1 2 3 4 5 6
A 9 8 7 6 5 4 3 2 2 2 3 4 5
C 10 9 8 7 6 5 4 3 2 3 2 3 4 A: From here
G 11 10 9 8 7 6 5 4 3 3 3 2 3 Q: How did I get here?
C 12 11 10 9 8 7 6 5 4 4 3 3 2
Edit distance: getting the alignment
ϵ G C T A T G C C A C G C
ϵ 0 1 2 3 4 5 6 7 8 9 10 11 12
G 1 0 1 2 3 4 5 6 7 8 9 10 11
C 2 1 0 1 2 3 4 5 6 7 8 9 10
G 3 2 1 1 2 3 3 4 5 6 7 8 9
T 4 3 2 1 2 2 3 4 5 6 7 8 9
A 5 4 3 2 1 2 3 4 5 5 6 7 8
T 6 5 4 3 2 1 2 3 4 5 6 7 8
G 7 6 5 4 3 2 1 2 3 4 5 6 7
C 8 7 6 5 4 3 2 1 2 3 4 5 6
A 9 8 7 6 5 4 3 2 2 2 3 4 5 A: From here
C 10 9 8 7 6 5 4 3 2 3 2 3 4 Q: How did I get here?
G 11 10 9 8 7 6 5 4 3 3 3 2 3
C 12 11 10 9 8 7 6 5 4 4 3 3 2
Edit distance: getting the alignment
ϵ G C T A T G C C A C G C
ϵ 0 1 2 3 4 5 6 7 8 9 10 11 12
G 1 0 1 2 3 4 5 6 7 8 9 10 11
C 2 1 0 1 2 3 4 5 6 7 8 9 10
G 3 2 1 1 2 3 3 4 5 6 7 8 9
T 4 3 2 1 2 2 3 4 5 6 7 8 9
A 5 4 3 2 1 2 3 4 5 5 6 7 8
T 6 5 4 3 2 1 2 3 4 5 6 7 8 A: From here
G 7 6 5 4 3 2 1 2 3 4 5 6 7 Q: How did I get here?
C 8 7 6 5 4 3 2 1 2 3 4 5 6
A 9 8 7 6 5 4 3 2 2 2 3 4 5
C 10 9 8 7 6 5 4 3 2 3 2 3 4
G 11 10 9 8 7 6 5 4 3 3 3 2 3
C 12 11 10 9 8 7 6 5 4 4 3 3 2
Edit distance: getting the alignment
ϵ G C T A T G C C A C G C
ϵ 0 1 2 3 4 5 6 7 8 9 10 11 12
G 1 0 1 2 3 4 5 6 7 8 9 10 11
Alignment:
C 2 1 0 1 2 3 4 5 6 7 8 9 10
G 3 2 1 1 2 3 3 4 5 6 7 8 9 GCGTATG-CACGC
T 4 3 2 1 2 2 3 4 5 6 7 8 9 || |||| |||||
GC-TATGCCACGC
A 5 4 3 2 1 2 3 4 5 5 6 7 8
T 6 5 4 3 2 1 2 3 4 5 6 7 8
G 7 6 5 4 3 2 1 2 3 4 5 6 7
C 8 7 6 5 4 3 2 1 2 3 4 5 6 Edit transcript:
A 9 8 7 6 5 4 3 2 2 2 3 4 5 MMDMMMMIMMMMM
C 10 9 8 7 6 5 4 3 2 3 2 3 4
G 11 10 9 8 7 6 5 4 3 3 3 2 3
C 12 11 10 9 8 7 6 5 4 4 3 3 2
Edit distance: summary
FillIng matrix is O(mn) space and time, and yields edit distance