04 Dynamic Programming 2 Editdistance
04 Dynamic Programming 2 Editdistance
String Comparison
Pavel Pevzner
Department of Computer Science and Engineering
University of California, San Diego
A T G T T A T A
A T C G T C C
A T G T T A T A
A T C G T C C
+1
The Alignment Game
A T G T T A T A
A T C G T C C
+1 +1
The Alignment Game
A T - G T T A T A
A T C G T C C
+1 +1
The Alignment Game
A T - G T T A T A
A T C G T C C
+1 +1 +1
The Alignment Game
A T - G T T A T A
A T C G T C C
+1 +1 +1 +1
The Alignment Game
A T - G T T A T A
A T C G T - C C
+1 +1 +1 +1
The Alignment Game
A T - G T T A T A
A T C G T - C C
+1 +1 +1 +1
The Alignment Game
A T - G T T A T A
A T C G T - C - C
+1 +1 +1 +1
The Alignment Game
A T - G T T A T A
A T C G T - C - C
+1 +1 +1 +1
The Alignment Game
A T - G T T A T A
A T C G T - C - C
+1 +1 +1 +1 =4
Sequence Alignment
A T - G T T A T A
A T C G T - C - C
A T - G T T A T A
A T C G T - C - C
Sequence Alignment
matches mismatches
A T - G T T A T A
A T C G T - C - C
Sequence Alignment
matches insertions deletions mismatches
A T - G T T A T A
A T C G T - C - C
Alignment Score
A T - G T T A T A
A T C G T - C - C
A T - G T T A T A
A T C G T - C - C
+1 +1 -1 +1 +1 -1 +0 -1 +0 =1
Example: 𝜇 = 0 and 𝜎 = 1
Alignment Score
Optimal alignment
Input: Two strings, mismatch penalty 𝜇,
and indel penalty 𝜎.
Output: An alignment of the strings
maximizing the score.
Common Subsequence
A T - G T T A T A
A T C G T - C - C
E D I − T I N G −
− D I S T A N C E
Example
mismatches
matches
E D I − T I N G −
− D I S T A N C E
deletion insertions
E D I − T I N G −
− D I S T A N C E
E D I − T I N G −
− D I S T A N C E
+2·#matches
+2·#mismatches
+1·#insertions
+1·#deletions
E D I − T I N G −
− D I S T A N C E
+2·#matches
+2·#matches −1·#insertions
+2·#mismatches −1·#deletions
=
+1·#insertions +2·#mismatches
+1·#deletions +2·#insertions
+2·#deletions
E D I − T I N G −
− D I S T A N C E
+2·#matches
2 · AlignmentScore
+2·#matches −1·#insertions
(𝜇 = 0, 𝜎 = 1/2)
+2·#mismatches −1·#deletions
= +
+1·#insertions +2·#mismatches
+1·#deletions +2·#insertions 2 · EditDistance
+2·#deletions
E D I − T I N G −
− D I S T A N C E
+2·#matches
minimizing edit distance
2 · AlignmentScore
+2·#matches −1·#insertions
= (𝜇 = 0, 𝜎 = 1/2)
+2·#mismatches −1·#deletions
= +
maximizing
+1·#insertions alignment score
+2·#mismatches
+1·#deletions +2·#insertions 2 · EditDistance
+2·#deletions
Outline
A[1 . . . i]
B[1 . . . j]
⎧
⎪
⎪
⎪D(i, j − 1) + 1
⎪
⎨D(i − 1, j) + 1
D(i, j) = min
⎪
⎪
⎪D(i − 1, j − 1) + 1 if A[i] ̸= B[j]
⎩D(i − 1, j − 1)
⎪
if A[i] = B[j]
j
D I S T A N C E
0 1 2 3 4 5 6 7 8
0
E 1
D 2
I 3
i T 4
I 5
N 6
G 7
comparing A[1 . . . n] = EDITING
and B[1 . . . m] = DISTANCE
j
D I S T A N C E
0 1 2 3 4 5 6 7 8
0
E 1
D 2
I 3
i T 4 D(i, j)
I 5
N 6
G 7
comparing A[1 . . . n] = EDITING
and B[1 . . . m] = DISTANCE
D I S T A N C E
0 1 2 3 4 5 6 7 8
0 0
E 1 1
D 2 2
I 3 3
T 4 4
I 5 5
N 6 6
G 7 7
D I S T A N C E
0 1 2 3 4 5 6 7 8
0 0 1 2 3 4 5 6 7 8
E 1 1
D 2 2
I 3 3
T 4 4
I 5 5
N 6 6
G 7 7
D I S T A N C E
0 1 2 3 4 5 6 7 8
0 0 1 2 3 4 5 6 7 8
E 1 1 ?
D 2 2
I 3 3
T 4 4
I 5 5
N 6 6
G 7 7
D(1, 1) = min{D(1, 0) + 1, D(0, 1) + 1, D(0, 0) + 1}
D I S T A N C E
0 1 2 3 4 5 6 7 8
0 0 1 2 3 4 5 6 7 8
E 1 1 ?
D 2 2
I 3 3
T 4 4
I 5 5
N 6 6
G 7 7
D(1, 1) = min{2, 2, 1}
D I S T A N C E
0 1 2 3 4 5 6 7 8
0 0 1 2 3 4 5 6 7 8
E 1 1 1
D 2 2
I 3 3
T 4 4
I 5 5
N 6 6
G 7 7
D I S T A N C E
0 1 2 3 4 5 6 7 8
0 0 1 2 3 4 5 6 7 8
E 1 1 1
D2 2 ?
I 3 3
T 4 4
I 5 5
N 6 6
G 7 7
D(2, 1) = min{D(2, 0) + 1, D(1, 1) + 1, D(1, 0)}
D I S T A N C E
0 1 2 3 4 5 6 7 8
0 0 1 2 3 4 5 6 7 8
E 1 1 1
D 2 2 1
I 3 3
T 4 4
I 5 5
N 6 6
G 7 7
D I S T A N C E
0 1 2 3 4 5 6 7 8
0 0 1 2 3 4 5 6 7 8
E 1 1 1
D 2 2 1
I 3 3 ?
T 4 4
I 5 5
N 6 6
G 7 7
D(3, 1) = min{D(3, 0) + 1, D(2, 1) + 1, D(2, 0) + 1}
D I S T A N C E
0 1 2 3 4 5 6 7 8
0 0 1 2 3 4 5 6 7 8
E 1 1 1
D 2 2 1
I 3 3 2
T 4 4
I 5 5
N 6 6
G 7 7
D I S T A N C E
0 1 2 3 4 5 6 7 8
0 0 1 2 3 4 5 6 7 8
E 1 1 1
D 2 2 1
I 3 3 2
T 4 4 3
I 5 5 4
N 6 6 5
G 7 7 6
D I S T A N C E
0 1 2 3 4 5 6 7 8
0 0 1 2 3 4 5 6 7 8
E 1 1 1 ?
D 2 2 1
I 3 3 2
T 4 4 3
I 5 5 4
N 6 6 5
G 7 7 6
D(1, 2) = min{D(1, 1) + 1, D(0, 2) + 1, D(0, 1) + 1}
D I S T A N C E
0 1 2 3 4 5 6 7 8
0 0 1 2 3 4 5 6 7 8
E 1 1 1 ?
D 2 2 1
I 3 3 2
T 4 4 3
I 5 5 4
N 6 6 5
G 7 7 6
D(1, 1) = min{2, 3, 2}
D I S T A N C E
0 1 2 3 4 5 6 7 8
0 0 1 2 3 4 5 6 7 8
E 1 1 1 2
D 2 2 1
I 3 3 2
T 4 4 3
I 5 5 4
N 6 6 5
G 7 7 6
D I S T A N C E
0 1 2 3 4 5 6 7 8
0 0 1 2 3 4 5 6 7 8
E 1 1 1 2
D 2 2 1 2
I 3 3 2 1
T 4 4 3 2
I 5 5 4 3
N 6 6 5 4
G 7 7 6 5
D I S T A N C E
0 1 2 3 4 5 6 7 8
0 0 1 2 3 4 5 6 7 8
E 1 1 1 2 3 4 5 6 7 7
D 2 2 1 2 3 4 5 6 7 8
I 3 3 2 1 2 3 4 5 6 7
T 4 4 3 2 2 2 3 4 5 6
I 5 5 4 3 3 3 3 4 5 6
N 6 6 5 4 4 4 4 3 4 5
G 7 7 6 5 5 5 5 4 4 5
EditDistance(A[1 . . . n], B[1 . . . m])
D(i, 0) ← i and D(0, j) ← j for all i, j
for j from 1 to m:
for i from 1 to n:
insertion ← D(i, j − 1) + 1
deletion ← D(i − 1, j) + 1
match ← D(i − 1, j − 1)
mismatch ← D(i − 1, j − 1) + 1
if A[i] = B[j]:
D(i, j) ← min(insertion, deletion, match)
else:
D(i, j) ← min(insertion, deletion, mismatch)
return D(n, m)
Outline
E
D
D I S T A N C E
0 1 2 3 4 5 6 7 8
0
any path from E 1
(0, 0) to (i, j) D 2
spells an align-
I 3
ment of prefixes
A[1 . . . i] and T 4
B[1 . . . j] I 5
N 6
G 7
E −
D I
D I S T A N C E
0 1 2 3 4 5 6 7 8
0
any path from E 1
(0, 0) to (i, j) D 2
spells an align-
I 3
ment of prefixes
A[1 . . . i] and T 4
B[1 . . . j] I 5
N 6
G 7
E −−
D I S
D I S T A N C E
0 1 2 3 4 5 6 7 8
0
any path from E 1
(0, 0) to (i, j) D 2
spells an align-
I 3
ment of prefixes
A[1 . . . i] and T 4
B[1 . . . j] I 5
N 6
G 7
E −−D
D I S −
D I S T A N C E
0 1 2 3 4 5 6 7 8
0
any path from E 1
(0, 0) to (i, j) D 2
spells an align-
I 3
ment of prefixes
A[1 . . . i] and T 4
B[1 . . . j] I 5
N 6
G 7
E −−D I
D I S −−
D I S T A N C E
0 1 2 3 4 5 6 7 8
0
any path from E 1
(0, 0) to (i, j) D 2
spells an align-
I 3
ment of prefixes
A[1 . . . i] and T 4
B[1 . . . j] I 5
N 6
G 7
E −−D I T
D I S −−T
D I S T A N C E
0 1 2 3 4 5 6 7 8
0
any path from E 1
(0, 0) to (i, j) D 2
spells an align-
I 3
ment of prefixes
A[1 . . . i] and T 4
B[1 . . . j] I 5
N 6
G 7
E −−D I T I N−G
D I S −−T A N C E
D I S T A N C E
0 1 2 3 4 5 6 7 8
0
E 1
the constructed
path corresponds D 2
to distance 8 and I 3
is not optimal T 4
(edit distance is 5)
I 5
N 6
G 7
E −−D I T I N−G
D I S −−T A N C E
D I S T A N C E
0 1 2 3 4 5 6 7 8
0
E 1
to construct an
optimal align- D 2
ment we will use I 3
the backtracking
T 4
pointers
I 5
N 6
G 7
E −−D I T I N−G
D I S −−T A N C E
D I S T A N C E
0 1 2 3 4 5 6 7 8
0 0 1 2 3 4 5 6 7 8
E 1 1 1 2 3 4 5 6 7 7
D 2 2 1 2 3 4 5 6 7 8
I 3 3 2 1 2 3 4 5 6 7
T 4 4 3 2 2 2 3 4 5 6
I 5 5 4 3 3 3 3 4 5 6
N 6 6 5 4 4 4 4 3 4 5
G 7 7 6 5 5 5 5 4 4 5
D I S T A N C E
0 1 2 3 4 5 6 7 8
0 0 1 2 3 4 5 6 7 8
E 1 1 1 2 3 4 5 6 7 7
D 2 2 1 2 3 4 5 6 7 8
the edit distance
I 3 3 2 1 2 3 4 5 6 7
is 5
T 4 4 3 2 2 2 3 4 5 6
I 5 5 4 3 3 3 3 4 5 6
N 6 6 5 4 4 4 4 3 4 5
G 7 7 6 5 5 5 5 4 4 5
D I S T A N C E
0 1 2 3 4 5 6 7 8
0 0 1 2 3 4 5 6 7 8
E 1 1 1 2 3 4 5 6 7 7 we arrived to the
D 2 2 1 2 3 4 5 6 7 8 bottom right cell
by moving along
I 3 3 2 1 2 3 4 5 6 7
the backtracking
T 4 4 3 2 2 2 3 4 5 6 pointers shown
I 5 5 4 3 3 3 3 4 5 6 below
N 6 6 5 4 4 4 4 3 4 5
G 7 7 6 5 5 5 5 4 4 5
D I S T A N C E
0 1 2 3 4 5 6 7 8
0 0 1 2 3 4 5 6 7 8
there exists an op-
E 1 1 1 2 3 4 5 6 7 7 timal alignment
D 2 2 1 2 3 4 5 6 7 8 whose last column
is a mismatch and
I 3 3 2 1 2 3 4 5 6 7
an optimal align-
T 4 4 3 2 2 2 3 4 5 6 ment whose last
I 5 5 4 3 3 3 3 4 5 6 column is an inser-
N 6 6 5 4 4 4 4 3 4 5 tion
G 7 7 6 5 5 5 5 4 4 5
D I S T A N C E
0 1 2 3 4 5 6 7 8
0 0 1 2 3 4 5 6 7 8
E 1 1 1 2 3 4 5 6 7 7
D 2 2 1 2 3 4 5 6 7 8
let’s consider a
I 3 3 2 1 2 3 4 5 6 7
mismatch
T 4 4 3 2 2 2 3 4 5 6
I 5 5 4 3 3 3 3 4 5 6
N 6 6 5 4 4 4 4 3 4 5
G 7 7 6 5 5 5 5 4 4 5
G
E
D I S T A N C E
0 1 2 3 4 5 6 7 8
0 0 1 2 3 4 5 6 7 8
E 1 1 1 2 3 4 5 6 7 7
D 2 2 1 2 3 4 5 6 7 8
we continue in a
I 3 3 2 1 2 3 4 5 6 7
similar fashion
T 4 4 3 2 2 2 3 4 5 6
I 5 5 4 3 3 3 3 4 5 6
N 6 6 5 4 4 4 4 3 4 5
G 7 7 6 5 5 5 5 4 4 5
G
E
D I S T A N C E
0 1 2 3 4 5 6 7 8
0 0 1 2 3 4 5 6 7 8
E 1 1 1 2 3 4 5 6 7 7
D 2 2 1 2 3 4 5 6 7 8
I 3 3 2 1 2 3 4 5 6 7
T 4 4 3 2 2 2 3 4 5 6
I 5 5 4 3 3 3 3 4 5 6
N 6 6 5 4 4 4 4 3 4 5
G 7 7 6 5 5 5 5 4 4 5
−G
C E
D I S T A N C E
0 1 2 3 4 5 6 7 8
0 0 1 2 3 4 5 6 7 8
E 1 1 1 2 3 4 5 6 7 7
D 2 2 1 2 3 4 5 6 7 8
I 3 3 2 1 2 3 4 5 6 7
T 4 4 3 2 2 2 3 4 5 6
I 5 5 4 3 3 3 3 4 5 6
N 6 6 5 4 4 4 4 3 4 5
G 7 7 6 5 5 5 5 4 4 5
N−G
N C E
D I S T A N C E
0 1 2 3 4 5 6 7 8
0 0 1 2 3 4 5 6 7 8
E 1 1 1 2 3 4 5 6 7 7
D 2 2 1 2 3 4 5 6 7 8
I 3 3 2 1 2 3 4 5 6 7
T 4 4 3 2 2 2 3 4 5 6
I 5 5 4 3 3 3 3 4 5 6
N 6 6 5 4 4 4 4 3 4 5
G 7 7 6 5 5 5 5 4 4 5
I N−G
A N C E
D I S T A N C E
0 1 2 3 4 5 6 7 8
0 0 1 2 3 4 5 6 7 8
E 1 1 1 2 3 4 5 6 7 7
D 2 2 1 2 3 4 5 6 7 8
I 3 3 2 1 2 3 4 5 6 7
T 4 4 3 2 2 2 3 4 5 6
I 5 5 4 3 3 3 3 4 5 6
N 6 6 5 4 4 4 4 3 4 5
G 7 7 6 5 5 5 5 4 4 5
T I N−G
T A N C E
D I S T A N C E
0 1 2 3 4 5 6 7 8
0 0 1 2 3 4 5 6 7 8
E 1 1 1 2 3 4 5 6 7 7
D 2 2 1 2 3 4 5 6 7 8
I 3 3 2 1 2 3 4 5 6 7
T 4 4 3 2 2 2 3 4 5 6
I 5 5 4 3 3 3 3 4 5 6
N 6 6 5 4 4 4 4 3 4 5
G 7 7 6 5 5 5 5 4 4 5
−T I N−G
S T A N C E
D I S T A N C E
0 1 2 3 4 5 6 7 8
0 0 1 2 3 4 5 6 7 8
E 1 1 1 2 3 4 5 6 7 7
D 2 2 1 2 3 4 5 6 7 8
I 3 3 2 1 2 3 4 5 6 7
T 4 4 3 2 2 2 3 4 5 6
I 5 5 4 3 3 3 3 4 5 6
N 6 6 5 4 4 4 4 3 4 5
G 7 7 6 5 5 5 5 4 4 5
I −T I N−G
I S T A N C E
D I S T A N C E
0 1 2 3 4 5 6 7 8
0 0 1 2 3 4 5 6 7 8
E 1 1 1 2 3 4 5 6 7 7
D 2 2 1 2 3 4 5 6 7 8
I 3 3 2 1 2 3 4 5 6 7
T 4 4 3 2 2 2 3 4 5 6
I 5 5 4 3 3 3 3 4 5 6
N 6 6 5 4 4 4 4 3 4 5
G 7 7 6 5 5 5 5 4 4 5
D I −T I N−G
D I S T A N C E
D I S T A N C E
0 1 2 3 4 5 6 7 8
0 0 1 2 3 4 5 6 7 8
E 1 1 1 2 3 4 5 6 7 7
D 2 2 1 2 3 4 5 6 7 8
I 3 3 2 1 2 3 4 5 6 7
T 4 4 3 2 2 2 3 4 5 6
I 5 5 4 3 3 3 3 4 5 6
N 6 6 5 4 4 4 4 3 4 5
G 7 7 6 5 5 5 5 4 4 5
E D I −T I N−G
−D I S T A N C E
OutputAlignment(i, j)
if i = 0 and j = 0:
return
if backtrack(i, j) = ↓:
OutputAlignment(i − 1, j)
print A[i]
−
else if backtrack(i, j) = →:
OutputAlignment(i, j − 1)
print −
B[j]
else:
OutputAlignment(i − 1, j − 1)
print A[i]
B[j]
OutputAlignment(i, j)
if i = 0 and j = 0:
return
if i > 0 and D(i, j) = D(i − 1, j) + 1:
OutputAlignment(i − 1, j)
print A[i]
−
else if j > 0 and D(i, j) = D(i, j − 1) + 1:
OutputAlignment(i, j − 1)
print −
B[j]
else:
OutputAlignment(i − 1, j − 1)
print A[i]
B[j]
Comparing Genes, Proteins, Bioinformatics Algorithms textbook
and Genomes MOOC (a part at bioinformaticsalgorithms.org
of Bioinformatics Specialization (2nd two-volume edition was pub-
on Coursera) lished in 2015)