String Matching
String Matching
PRINCIPLES OF
DATA INTEGRATION
ANHAI DOAN ALON HALEVY ZACHARY IVES
Introduction
Problem description
Similarity measures
Sequence-based: edit distance, Needleman-Wunch, affine gap,
Smith-Waterman, Jaro, Jaro-Winkler
Set-based: overlap, Jaccard, TF/IDF
Hybrid: generalized Jaccard, soft TF/IDF, Monge-Elkan
Phonetic: Soundex
Scaling up string matching
Inverted index, size filtering, prefix filtering, position filtering,
bound filtering
3
Problem Description
5
Accuracy Challenges
Solution:
Use a similarity measure s(x,y) 2 [0,1]
The higher s(x,y), the more likely that x and y match
Declare x and y matched if s(x,y) ≥ t
Distance measure/cost measure have also been used
Same concept
But smaller values higher similarities
6
Scalability Challenges
7
Outline
Problem description
Similarity measures
Sequence-based: edit distance, Needleman-Wunch, affine gap,
Smith-Waterman, Jaro, Jaro-Winkler
Set-based: overlap, Jaccard, TF/IDF
Hybrid: generalized Jaccard, soft TF/IDF, Monge-Elkan
Phonetic: Soundex
Scaling up string matching
Inverted index, size filtering, prefix filtering, position filtering,
bound filtering
8
Edit Distance
Substituting i by m
9
Edit Distance
10
Computing Edit Distance using
Dynamic Programming
Define x = x1x2 xn, y = y1y2 ym
d(i,j) = edit distance between x1x2 xi and y1y2 yj,
the i-th and j-th prefixes of x and y
Recurrence equations
11
Example
x = dva, y = dave
y0 y1 y2 y3 y4 y0 y1 y2 y3 y4
d a v e d a v e x=d–va
x0 0 1 2 3 4 x0 0 1 2 3 4 y=dave
x1 d 1 0 1 x1 d 1 0 1 2 3
substitute a with e
x2 v 2 x2 v 2 1 1 1 2 insert a (after d)
x3 a 3 x3 a 3 2 1 2 2
12
Needleman-Wunch Measure
13
Scoring an Alignment
Use a score matrix and a gap penalty
Example
14
Needleman-Wunch Generalizes
Levenshtein in Three Ways
Computes similarity scores instead of distance values
Generalizes edit costs into a score matrix
allowing for more fine-grained score modeling
e.g., score(o,0) > score(a,0)
e.g., different amino-acid pairs may have different semantic
distance
Generalizes insertion and deletion into gaps, and
generalizes their costs from 1 to Cg
15
Computing Needleman-Wunch
Score with Dynamic
Programming
16
The Affine Gap Measure:
Motivation
An extension of Needleman-Wunch that handles longer gap
more gracefully
E.g., “David Smith” vs. “David R. Smith”
Needleman-Wunch well suited here
opens gap of length 2 right after “David”
E.g.,
19
The Smith-Waterman Measure:
Motivation
Previous measures consider global alignments
attempt to match all characters of x with all characters of y
Not well suited for some cases
e.g., “Prof. John R. Smith, Univ of Wisconsin” and “John R.
Smith, Professor”
similarity score here would be quite low
Better idea: find two substrings of x and y that are most
similar
e.g., find “John R. Smith” in the above case local alignment
20
The Smith-Waterman Measure:
Basic Ideas
Find the best local alignment between x and y, and
return its score as the score between x and y
Makes two key changes to Needleman-Wunch
allows the match to restart at any position in the strings (no
longer limited to just the first position)
if global match dips below 0, then ignore prefix and restart the match
after computing matrix using recurrence equation, retracing
the arrows from the largest value in matrix, rather than from
lower-right corner
this effectively ignores suffixes if the match they produce is not optimal
retracing ends when we meet a cell with value 0 start of alignment
21
Computing Smith-Waterman
Score using Dynamic
Programming
22
The Jaro Measure
x = jon, y = john
c = 3 because the common characters are j, o, and n
t=0
jaro(x,y) = 1 / 3(3/3 + 3/4 + 3/3) = 0.917
contrast this to 0.75, the sim score of x and y using edit distance
x = jon, y = ojhn
common char sequence in x is jon
common char sequence in y is ojn
t=2
jaro(x,y) = 0.81
24
The Jaro-Winkler Measure
25
Outline
Problem description
Similarity measures
Sequence-based: edit distance, Needleman-Wunch, affine gap,
Smith-Waterman, Jaro, Jaro-Winkler
Set-based: overlap, Jaccard, TF/IDF
Hybrid: generalized Jaccard, soft TF/IDF, Monge-Elkan
Phonetic: Soundex
Scaling up string matching
Inverted index, size filtering, prefix filtering, position filtering,
bound filtering
26
Set-based Similarity Measures
27
The Overlap Measure
28
The Jaccard Measure
29
The TF/IDF Measure: Motivation
30
Term Frequencies and
Inverse Document Frequencies
Assume x and y are taken from a collection of strings
Each string is coverted into a bag of terms called a
document
Define term frequency tf(t,d) = number of times term t
appears in document d
Define inverse document frequency idf(t) = N / Nd,
number of documents in collection devided by number
of documents that contain t
note: in practice, idf(t) is often defined as log(N / Nd), here we
will use the above simple formula to define idf(t)
31
Example
32
Feature Vectors
33
TF/IDF Similarity Score
34
TF/IDF Similarity Score
35
Outline
Problem description
Similarity measures
Sequence-based: edit distance, Needleman-Wunch, affine gap,
Smith-Waterman, Jaro, Jaro-Winkler
Set-based: overlap, Jaccard, TF/IDF
Hybrid: generalized Jaccard, soft TF/IDF, Monge-Elkan
Phonetic: Soundex
Scaling up string matching
Inverted index, size filtering, prefix filtering, position filtering,
bound filtering
36
Generalized Jaccard Measure
Jaccard measure
considers overlapping tokens in both x and y
a token from x and a token from y must be identical to be included in
the set of overlapping tokens
this can be too restrictive in certain cases
Example:
matching taxonomic nodes that describe companies
“Energy & Transportation” vs. “Transportation, Energy, & Gas”
in theory Jaccard is well suited here, in practice Jaccard may not work
well if tokens are commonly mispelled
e.g., energy vs. eneryg
generalized Jaccard measure can help such cases
37
Generalized Jaccard Measure
39
The Soft TF/IDF Measure
41
The Monge-Elkan Measure
42
Outline
Problem description
Similarity measures
Sequence-based: edit distance, Needleman-Wunch, affine gap,
Smith-Waterman, Jaro, Jaro-Winkler
Set-based: overlap, Jaccard, TF/IDF
Hybrid: generalized Jaccard, soft TF/IDF, Monge-Elkan
Phonetic: Soundex
Scaling up string matching
Inverted index, size filtering, prefix filtering, position filtering,
bound filtering
43
Phonetic Similarity Measures
44
The Soundex Measure
Used primarily to match surnames
maps a surname x into a 4-letter code
two surnames are judged similar if share the same code
Algorithm to map x into a code:
Step 1: keep the first letter of x, subsequent steps are performed on the
rest of x
Step 2: remove all occurences of W and H. Replace the remaining letters
with digits as follows:
replace B, F, P, V with 1, C, G, J, K, Q, S, X, Z with 2, D, T with 3, L with 4, M, N with
5, R with 6
Step 3: replace sequence of identical digits by the digit itself
Step 4: Drop all non-digit letters, return the first four letters as the
soundex code
45
The Soundex Measure
Example: x = Ashcraft
after Step 2: A226a13, after Step 3: A26a13, Step 4 converts
this into A2613, then returns A261
Soundex code is padded with 0 if there is not enough digits
Example: Robert and Rupert map into R163
Soundex fails to map Gough and Goff, and Jawornicki and
Yavornitzky
designed primarily for Caucasian names, but found to work
well for names of many different origins
does not work well for names of East Asian origins
which uses vowels to discriminate, Soundex ignores vowels
46
Outline
Problem description
Similarity measures
Sequence-based: edit distance, Needleman-Wunch, affine gap,
Smith-Waterman, Jaro, Jaro-Winkler
Set-based: overlap, Jaccard, TF/IDF
Hybrid: generalized Jaccard, soft TF/IDF, Monge-Elkan
Phonetic: Soundex
Scaling up string matching
Inverted index, size filtering, prefix filtering, position filtering,
bound filtering
47
Scalability Challenges
Applying s(x,y) to all pairs is impractical
Quadratic in size of data
Solution: apply s(x,y) to only most promising pairs, using a
method FindCands
For each string x 2 X
use method FindCands to find a candidate set Z µ Y
for each string y 2 Z
if s(x,y) ¸ t then return (x,y) as a matched pair
This is often called a blocking solution
Set Z is often called the umbrella set of x
We now discuss ways to implement FindCands
using Jaccard and overlap measures for now
48
Inverted Index over Strings
49
Example
50
Limitations
51
Size Filtering
52
Example
53
Prefix Filtering
54
Example
56
Selecting the Subset Intelligently
How? (continued)
can prove that if |x Å y| ¸ k, then x’ and y’ must overlap, where
x’ is the prefix of size |x| - (k – 1) of x and y’ is the prefix of size
|y| - (k – 1) of y (see notes)
Algorithm
reorder terms in each x 2 X and y 2 Y in increasing order of
their frequencies
for each y 2 Y, create y’, the prefix of size |y| - (k – 1) of y
build an inverted index over all prefixes y’
for each x 2 X, create x’, the prefix of size |x| - (k – 1) of x, then
use above index to find all y such that x’ overlaps with y’
57
Example
60
Bound Filtering
61
Computing UB(x,y) and LB(x,y)
62
Example
63
Extending Scaling Techniques to
Other Similarity Measures
Discussed Jaccard and overlap so far
To extend a technique T to work for a new similarity
measure s(x,y)
try to translate s(x,y) into constraints on a similarity measure
that already works well with T
The notes discuss examples that involve edit distance
and TF/IDF
64
Summary
String matching is pervasive in data integration
Two key challenges:
what similarity measure and how to scale up?
Similarity measures
Sequence-based: edit distance, Needleman-Wunch, affine gap,
Smith-Waterman, Jaro, Jaro-Winkler
Set-based: overlap, Jaccard, TF/IDF
Hybrid: generalized Jaccard, soft TF/IDF, Monge-Elkan
Phonetic: Soundex
Scaling up string matching
Inverted index, size/prefix/position/bound filtering
65
Acknowledgment
66