0% found this document useful (0 votes)
53 views28 pages

EditDistance

The document discusses minimum edit distance and how it can be used to measure the similarity between strings. It describes how to calculate minimum edit distance using dynamic programming and how to track the alignment between strings using backtracing. It also discusses extensions like weighted minimum edit distance which accounts for different substitution costs.

Uploaded by

saisuraj1510
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views28 pages

EditDistance

The document discusses minimum edit distance and how it can be used to measure the similarity between strings. It describes how to calculate minimum edit distance using dynamic programming and how to track the alignment between strings using backtracing. It also discusses extensions like weighted minimum edit distance which accounts for different substitution costs.

Uploaded by

saisuraj1510
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Minimum Edit Distance

How similar are two strings?


Spell correction
User typed “graffe” Which is closest?
◦ graf
◦ graft
◦ grail
◦ giraffe

Used in Speech Recognition, Machine Translation(Alignment), Information


Extraction(NER),
Edit Distance
The minimum edit distance between two strings is the minimum
number of editing operations needed to transform one into the
other

◦ Insertion
◦ Deletion
◦ Substitution
Minimum Edit Distance
Two strings and their alignment:
Minimum Edit Distance

If each operation has cost of 1


◦ Distance between these is 5
If substitutions cost 2 (Levenshtein)
◦ Distance between them is 8
Uses of Edit Distance in NLP
Evaluating Machine Translation and speech recognition
R Spokesman confirms senior government adviser was appointed
H Spokesman said the senior adviser was appointed
S I D

Entity Coreference
◦ IBM Inc. announced today
◦ IBM profits
◦ Stanford Professor Jennifer Eberhardt announced yesterday
◦ for Professor Eberhardt…
How to find the Min Edit Distance?
Searching for a path (sequence of edits) from the start
string to the final string:
◦ Initial state: the word we are transforming
◦ Operators: insert, delete, substitute
◦ Goal state: the word we are trying to get to
◦ Path cost: what we want to minimize: the number of edits
Minimum Edit as Search
Space of all edit sequences is huge
◦ Can’t navigate naïvely
◦ Lots of distinct paths wind up at the same state.
Defining Min Edit Distance
For two strings
◦ X of length n
◦ Y of length m
We define D(i,j)
◦ The edit distance between X[1..i] and Y[1..j]
◦ i.e., the first i characters of X and the first j characters of Y
◦ The edit distance between X and Y is thus D(n, m)
Minimum Edit Distance
COMPUTING MINIMUM EDIT DISTANCE
Dynamic Programming for Minimum Edit Distance

Dynamic programming: A tabular computation of D(n,m)


Solving problems by combining solutions to subproblems.
Bottom-up
◦ Compute D(i,j) for small i,j
◦ Compute larger D(i,j) based on previously computed smaller
values
◦ i.e., compute D(i,j) for all i (0 < i < n) and j (0 < j < m)
Defining Min Edit Distance (Levenshtein)
Initialization
D(i,0) = i
D(0,j) = j
Recurrence Relation:
For each i = 1…M
For each j = 1…N
D(i-1,j) + 1 deletion
D(i,j)= min D(i,j-1) + 1 insertion
D(i-1,j-1) + 2; if S1(i)≠S2(j)
0; if S1(i)=S2(j)
Termination:
substitution
D(N,M) is distance
The Edit Distance Table
N 9
O 8
I 7

T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
The Edit Distance Table
N 9
O 8
I 7
T 6
N 5
E 4
i
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N

j
Minimum Edit Distance

If each operation has cost of 1


◦ Distance between these is 5
If substitutions cost 2 (Levenshtein)
◦ Distance between them is 8
Edit Distance
N 9
O 8
I 7

T 6
N 5 Ins SUB
i E 4
T 3
N 2
I 1 Del
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N

j
The Edit Distance Table
N 9 8 9 10 11 12 11 10 9 8
O 8 7 8 9 10 11 10 9 8 9
I 7 6 7 8 9 10 9 8 9 10
T 6 5 6 7 8 9 8 9 10 11
N 5 4 5 6 7 8 9 10 11 10
E 4 3 4 5 6 7 8 9 10 9
T 3 4 5 6 7 8 7 8 9 8
N 2 3 4 5 6 7 8 7 8 7
I 1 2 3 4 5 6 7 6 7 8
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
Minimum Edit Distance
BACKTRACE FOR COMPUTING ALIGNMENTS
Computing alignments
✓Edit distance isn’t sufficient
We often need to align each character of the two strings to each other
✓Do this by keeping a “backtrace”
✓Every time we enter a cell, remember where we came from
✓When we reach the end,
◦ Trace back the path from the upper right corner to read off the alignment
Edit Distance
N 9
O 8
I 7

T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
Adding Backtrace to Minimum Edit Distance
Base conditions: Termination:
D(i,0) = i D(0,j) = j D(N,M) is distance
Recurrence Relation:
For each i = 1…M
For each j = 1…N
D(i-1,j) + 1 deletion
D(i,j)= min D(i,j-1) + 1 insertion
D(i-1,j-1) + 2; if X(i) ≠ Y(j) substitution
0; if X(i) = Y(j)
LEFT insertion
ptr(i,j)= DOWN deletion
DIAG substitution
MinEdit with Backtrace

* e x e c u t i o n
Del I Sub Sub Insert Sub
0 cost
Cost 1 n/e t/x C n/u 0 cost
with with Cost 1 with
cost 2 cost 2 cost 2
Result of Backtrace
Two strings and their alignment:
Minimum Edit Distance
WEIGHTED MINIMUM EDIT DISTANCE
Weighted Edit Distance
Why would we add weights to the computation?
◦ Spell Correction: some letters are more likely to be
mistyped than others
Confusion matrix for spelling errors
Weighted Min Edit Distance
Initialization:
D(0,0) = 0
D(i,0) = D(i-1,0) + del[x(i)]; 1 < i ≤ N
D(0,j) = D(0,j-1) + ins[y(j)]; 1 < j ≤ M
Recurrence Relation:
D(i-1,j) + del[x(i)]
D(i,j)=min D(i,j-1) + ins[y(j)]
D(i-1,j-1) + sub[x(i),y(j)]
Termination:
D(N,M) is distance

You might also like