EditDistance
EditDistance
◦ Insertion
◦ Deletion
◦ Substitution
Minimum Edit Distance
Two strings and their alignment:
Minimum Edit Distance
Entity Coreference
◦ IBM Inc. announced today
◦ IBM profits
◦ Stanford Professor Jennifer Eberhardt announced yesterday
◦ for Professor Eberhardt…
How to find the Min Edit Distance?
Searching for a path (sequence of edits) from the start
string to the final string:
◦ Initial state: the word we are transforming
◦ Operators: insert, delete, substitute
◦ Goal state: the word we are trying to get to
◦ Path cost: what we want to minimize: the number of edits
Minimum Edit as Search
Space of all edit sequences is huge
◦ Can’t navigate naïvely
◦ Lots of distinct paths wind up at the same state.
Defining Min Edit Distance
For two strings
◦ X of length n
◦ Y of length m
We define D(i,j)
◦ The edit distance between X[1..i] and Y[1..j]
◦ i.e., the first i characters of X and the first j characters of Y
◦ The edit distance between X and Y is thus D(n, m)
Minimum Edit Distance
COMPUTING MINIMUM EDIT DISTANCE
Dynamic Programming for Minimum Edit Distance
T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
The Edit Distance Table
N 9
O 8
I 7
T 6
N 5
E 4
i
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
j
Minimum Edit Distance
T 6
N 5 Ins SUB
i E 4
T 3
N 2
I 1 Del
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
j
The Edit Distance Table
N 9 8 9 10 11 12 11 10 9 8
O 8 7 8 9 10 11 10 9 8 9
I 7 6 7 8 9 10 9 8 9 10
T 6 5 6 7 8 9 8 9 10 11
N 5 4 5 6 7 8 9 10 11 10
E 4 3 4 5 6 7 8 9 10 9
T 3 4 5 6 7 8 7 8 9 8
N 2 3 4 5 6 7 8 7 8 7
I 1 2 3 4 5 6 7 6 7 8
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
Minimum Edit Distance
BACKTRACE FOR COMPUTING ALIGNMENTS
Computing alignments
✓Edit distance isn’t sufficient
We often need to align each character of the two strings to each other
✓Do this by keeping a “backtrace”
✓Every time we enter a cell, remember where we came from
✓When we reach the end,
◦ Trace back the path from the upper right corner to read off the alignment
Edit Distance
N 9
O 8
I 7
T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
Adding Backtrace to Minimum Edit Distance
Base conditions: Termination:
D(i,0) = i D(0,j) = j D(N,M) is distance
Recurrence Relation:
For each i = 1…M
For each j = 1…N
D(i-1,j) + 1 deletion
D(i,j)= min D(i,j-1) + 1 insertion
D(i-1,j-1) + 2; if X(i) ≠ Y(j) substitution
0; if X(i) = Y(j)
LEFT insertion
ptr(i,j)= DOWN deletion
DIAG substitution
MinEdit with Backtrace
* e x e c u t i o n
Del I Sub Sub Insert Sub
0 cost
Cost 1 n/e t/x C n/u 0 cost
with with Cost 1 with
cost 2 cost 2 cost 2
Result of Backtrace
Two strings and their alignment:
Minimum Edit Distance
WEIGHTED MINIMUM EDIT DISTANCE
Weighted Edit Distance
Why would we add weights to the computation?
◦ Spell Correction: some letters are more likely to be
mistyped than others
Confusion matrix for spelling errors
Weighted Min Edit Distance
Initialization:
D(0,0) = 0
D(i,0) = D(i-1,0) + del[x(i)]; 1 < i ≤ N
D(0,j) = D(0,j-1) + ins[y(j)]; 1 < j ≤ M
Recurrence Relation:
D(i-1,j) + del[x(i)]
D(i,j)=min D(i,j-1) + ins[y(j)]
D(i-1,j-1) + sub[x(i),y(j)]
Termination:
D(N,M) is distance