0% found this document useful (0 votes)
17 views66 pages

String Matching

Chapter 4 discusses string matching, focusing on identifying strings that refer to the same real-world entities and the challenges of accuracy and scalability in this process. It outlines various similarity measures, including sequence-based, set-based, hybrid, and phonetic methods, and presents techniques for scaling string matching through filtering methods. The chapter provides detailed explanations of specific measures such as edit distance, Needleman-Wunch, Smith-Waterman, Jaro, and Jaro-Winkler, along with their applications and computational methods.

Uploaded by

Harsh Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views66 pages

String Matching

Chapter 4 discusses string matching, focusing on identifying strings that refer to the same real-world entities and the challenges of accuracy and scalability in this process. It outlines various similarity measures, including sequence-based, set-based, hybrid, and phonetic methods, and presents techniques for scaling string matching through filtering methods. The chapter provides detailed explanations of specific measures such as edit distance, Needleman-Wunch, Smith-Waterman, Jaro, and Jaro-Winkler, along with their applications and computational methods.

Uploaded by

Harsh Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 66

Chapter 4: String Matching

PRINCIPLES OF
DATA INTEGRATION
ANHAI DOAN ALON HALEVY ZACHARY IVES
Introduction

 Find strings that refer to same real-world entities


 “David Smith” and “David R. Smith”
 “1210 W. Dayton St Madison WI” and “1210 West Dayton
Madison WI 53706”
 Play critical roles in many DI tasks
 Schema matching, data matching, information extraction
 This chapter
 Defines the string matching problem
 Describes popular similarity measures
 Discusses how to apply such measures to match a large number
of strings
2
Outline

 Problem description
 Similarity measures
 Sequence-based: edit distance, Needleman-Wunch, affine gap,
Smith-Waterman, Jaro, Jaro-Winkler
 Set-based: overlap, Jaccard, TF/IDF
 Hybrid: generalized Jaccard, soft TF/IDF, Monge-Elkan
 Phonetic: Soundex
 Scaling up string matching
 Inverted index, size filtering, prefix filtering, position filtering,
bound filtering

3
Problem Description

 Given two sets of strings X and Y


 Find all pairs x 2 X and y 2 Y that refer to the same
real-world entity
 We refer to (x,y) as a match
 Example

 Two major challenges: accuracy & scalability


4
Accuracy Challenges

 Matching strings often appear quite differently


 Typing and OCR errors: David Smith vs. Davod Smith
 Different formatting conventions: 10/8 vs. Oct 8
 Custom abbreviation, shortening, or omission:
Daniel Walker Herbert Smith vs. Dan W. Smith
 Different names, nick names: William Smith vs. Bill Smith
 Shuffling parts of strings: Dept. of Computer Science, UW-
Madison vs. Computer Science Dept., UW-Madison

5
Accuracy Challenges

 Solution:
 Use a similarity measure s(x,y) 2 [0,1]
 The higher s(x,y), the more likely that x and y match
 Declare x and y matched if s(x,y) ≥ t
 Distance measure/cost measure have also been used
 Same concept
 But smaller values  higher similarities

6
Scalability Challenges

 Applying s(x,y) to all pairs is impractical


 Quadratic in size of data
 Solution: apply s(x,y) to only most promising pairs, using
a method FindCands
 For each string x 2 X
use method FindCands to find a candidate set Z µ Y
for each string y 2 Z
if s(x,y) ¸ t then return (x,y) as a matched pair
 We discuss ways to implement FindCands later

7
Outline

 Problem description
 Similarity measures
 Sequence-based: edit distance, Needleman-Wunch, affine gap,
Smith-Waterman, Jaro, Jaro-Winkler
 Set-based: overlap, Jaccard, TF/IDF
 Hybrid: generalized Jaccard, soft TF/IDF, Monge-Elkan
 Phonetic: Soundex
 Scaling up string matching
 Inverted index, size filtering, prefix filtering, position filtering,
bound filtering

8
Edit Distance

 Also known as Levenshtein distance


 d(x,y) computes minimal cost of transforming x into y, using a
sequence of operators, each with cost 1
 Delete a character
 Insert a character
 Substitute a character with another
 Example: x = David Smiths, y = Davidd Simth,
d(x,y) = 4, using following sequence
 Inserting a character d (after David)
 Substituting m by i

 Substituting i by m

 Deleting the last character of x, which is s

9
Edit Distance

 Models common editing mistakes


 Inserting an extra character, swapping two characters, etc.
 So smaller edit distance  higher similarity
 Can be converted into a similarity measure
 s(x,y) = 1 - d(x,y) / [max(length(x), length(y))]
 Example
 s(David Smiths, Davidd Simth) = 1 – 4 / max(12, 12) = 0.67

10
Computing Edit Distance using
Dynamic Programming
 Define x = x1x2 xn, y = y1y2 ym
 d(i,j) = edit distance between x1x2 xi and y1y2 yj,
the i-th and j-th prefixes of x and y
 Recurrence equations

11
Example

 x = dva, y = dave
y0 y1 y2 y3 y4 y0 y1 y2 y3 y4
d a v e d a v e x=d–va

x0 0 1 2 3 4 x0 0 1 2 3 4 y=dave
x1 d 1 0 1 x1 d 1 0 1 2 3
substitute a with e
x2 v 2 x2 v 2 1 1 1 2 insert a (after d)
x3 a 3 x3 a 3 2 1 2 2

 Cost of dynamic programming is O(|x||y|)

12
Needleman-Wunch Measure

 Generalizes Levenshtein edit distance


 Basic idea
 defines notion of alignment between x and y
 assigns score to alignment
 return the alignment with highest score
 Alignment: set of correspondences between
characters of x and y, allowing for gaps

13
Scoring an Alignment
 Use a score matrix and a gap penalty
 Example

 alignment score = sum of scores of all correspondences -


sum of penalties of all gaps
 e.g., for the above alignment, it is 2 (for d-d) + 2 (for v-v) -1 (for a-e) -2 (for gap) = 1
 this is the alignment with the highest score, it is returned as the Needleman-Wunch score for dva and deeve.

14
Needleman-Wunch Generalizes
Levenshtein in Three Ways
 Computes similarity scores instead of distance values
 Generalizes edit costs into a score matrix
 allowing for more fine-grained score modeling
 e.g., score(o,0) > score(a,0)
 e.g., different amino-acid pairs may have different semantic
distance
 Generalizes insertion and deletion into gaps, and
generalizes their costs from 1 to Cg

15
Computing Needleman-Wunch
Score with Dynamic
Programming

16
The Affine Gap Measure:
Motivation
 An extension of Needleman-Wunch that handles longer gap
more gracefully
 E.g., “David Smith” vs. “David R. Smith”
 Needleman-Wunch well suited here
 opens gap of length 2 right after “David”
 E.g.,

 Needlement-Wunch not well suited here, gap cost is too high


 If each char corrspondence has score 2, cg = 1, then the above has
score 6*2 – 10 = 2
17
The Affine Gap Measure: Solution

 In practice, gaps tend to be longer than 1 character


 Assigning same penalty to each character unfairly punishes
long gaps
 Solution: define cost of opening a gap vs. cost of
continuing the gap
 cost (gap of length k) = c0 + (k-1)cr
 c0 = cost of opening gap
 cr = cost of continuing gap, c0 > cr
 E.g., “David Smith” vs. “David Richardson Smith”
 c0 = 1, cr = 0.5, alignment cost = 6*2 – 1 - 9*0.5 = 6.5
18
Computing Affine Gap Score
using Dynamic Programming

 The notes detail how these equations are derived

19
The Smith-Waterman Measure:
Motivation
 Previous measures consider global alignments
 attempt to match all characters of x with all characters of y
 Not well suited for some cases
 e.g., “Prof. John R. Smith, Univ of Wisconsin” and “John R.
Smith, Professor”
 similarity score here would be quite low
 Better idea: find two substrings of x and y that are most
similar
 e.g., find “John R. Smith” in the above case  local alignment

20
The Smith-Waterman Measure:
Basic Ideas
 Find the best local alignment between x and y, and
return its score as the score between x and y
 Makes two key changes to Needleman-Wunch
 allows the match to restart at any position in the strings (no
longer limited to just the first position)
 if global match dips below 0, then ignore prefix and restart the match
 after computing matrix using recurrence equation, retracing
the arrows from the largest value in matrix, rather than from
lower-right corner
 this effectively ignores suffixes if the match they produce is not optimal
 retracing ends when we meet a cell with value 0  start of alignment

21
Computing Smith-Waterman
Score using Dynamic
Programming

22
The Jaro Measure

 Mainly for comparing short strings, e.g., first/last names


 To compute jaro(x,y)
 find common characters xi and yj such that
xi = yj and |i-j| · min {|x|,|y|}/2
 intuitively, common characters are identical and positionally
“close to each other”
 if the i-th common character of x does not match the i-th
common character of y, then we have a transposition
 return jaro(x,y) = 1 / 3[c/|x| + c/|y| + (c – t/2)/c], where c is
the number of common characters, and t is the number of
transpositions
23
The Jaro Measure: Examples

 x = jon, y = john
 c = 3 because the common characters are j, o, and n
 t=0
 jaro(x,y) = 1 / 3(3/3 + 3/4 + 3/3) = 0.917
 contrast this to 0.75, the sim score of x and y using edit distance
 x = jon, y = ojhn
 common char sequence in x is jon
 common char sequence in y is ojn
 t=2
 jaro(x,y) = 0.81

24
The Jaro-Winkler Measure

 Captures cases where x and y have a low Jaro score, but


share a prefix  still likely to match
 Computed as
 jaro-winkler(x,y) = (1 – PL*PW)*jaro(x,y) + PL*PW
 PL = length of the longest common prefix
 PW is a weight given to the prefix

25
Outline

 Problem description
 Similarity measures
 Sequence-based: edit distance, Needleman-Wunch, affine gap,
Smith-Waterman, Jaro, Jaro-Winkler
 Set-based: overlap, Jaccard, TF/IDF
 Hybrid: generalized Jaccard, soft TF/IDF, Monge-Elkan
 Phonetic: Soundex
 Scaling up string matching
 Inverted index, size filtering, prefix filtering, position filtering,
bound filtering

26
Set-based Similarity Measures

 View strings as sets or multi-sets of tokens


 Use set-related properties to compute similarity scores
 Common methods to generate tokens
 consider words delimited by space
 possibly stem the words (depending on the application)
 remove common stop words (e.g., the, and, of)

 e.g., given “david smith”  generate tokens “david” and “smith”

 consider q-grams, substrings of length q


 e.g., “david smith”  the set of 3-grams are {##d, #da, dav, avi, …, h##}
 special character # is added to handle the start and end of string

27
The Overlap Measure

 Let Bx = set of tokens generated for string x


 Let By = set of tokens generated for string y
 O(x,y) = |Bx Å By|
 returns the number of common tokens
 E.g., x = dave, y = dav
 Bx = {#d, da, av, ve, e#}, By = {#d, da, av, v#}
 O(x,y) = 3

28
The Jaccard Measure

 J(x,y) = |Bx Å By|/|Bx [ By|


 E.g., x = dave, y = dav
 Bx = {#d, da, av, ve, e#}, By = {#d, da, av, v#}
 J(x,y) = 3/6
 Very commonly used in practice

29
The TF/IDF Measure: Motivation

 uses the TF/IDF notion commonly used in IR


 two strings are similar if they share distinguishing terms
 e.g., x = Apple Corporation, CA
y = IBM Corporation, CA
z = Apple Corp
 s(x,y) > s(x,z) using edit distance or Jaccard measure, so x is
matched with y  incorrect
 TF/IDF measure can recognize that Apple is a distinguishing
term, whereas Corporation and CA are far more common 
correctly match x with z

30
Term Frequencies and
Inverse Document Frequencies
 Assume x and y are taken from a collection of strings
 Each string is coverted into a bag of terms called a
document
 Define term frequency tf(t,d) = number of times term t
appears in document d
 Define inverse document frequency idf(t) = N / Nd,
number of documents in collection devided by number
of documents that contain t
 note: in practice, idf(t) is often defined as log(N / Nd), here we
will use the above simple formula to define idf(t)
31
Example

32
Feature Vectors

 Each document d is converted into a feature vector vd


 vd has a feature vd(t) for each term t
 value of vd(t) is a function of TF and IDF scores
 here we assume vd(t) = tf(t,d) * idf(t)

33
TF/IDF Similarity Score

34
TF/IDF Similarity Score

35
Outline

 Problem description
 Similarity measures
 Sequence-based: edit distance, Needleman-Wunch, affine gap,
Smith-Waterman, Jaro, Jaro-Winkler
 Set-based: overlap, Jaccard, TF/IDF
 Hybrid: generalized Jaccard, soft TF/IDF, Monge-Elkan
 Phonetic: Soundex
 Scaling up string matching
 Inverted index, size filtering, prefix filtering, position filtering,
bound filtering

36
Generalized Jaccard Measure

 Jaccard measure
 considers overlapping tokens in both x and y
 a token from x and a token from y must be identical to be included in
the set of overlapping tokens
 this can be too restrictive in certain cases
 Example:
 matching taxonomic nodes that describe companies
 “Energy & Transportation” vs. “Transportation, Energy, & Gas”
 in theory Jaccard is well suited here, in practice Jaccard may not work
well if tokens are commonly mispelled
 e.g., energy vs. eneryg
 generalized Jaccard measure can help such cases
37
Generalized Jaccard Measure

 Let Bx = {x1, …, xn}, By = {y1, …, ym}


 Step 1: find token pairs that will be in the “softened”
overlap set
 apply a similarity measure s to compute sim score for each pair
(xi, yj)
 keep only those score ¸ a given threshold ®, this forms a
bipartite graph G
 find the maximum-weight matching M in G
 Step 2: return normalized weight of M as generalized
Jaccard score
 GJ(x,y) =  (xi,yj)2 M s(xi,yj) / (|Bx| + |By| - |M|)
38
An Example

 Generalized Jaccard score: (0.7 + 0.9)/(3 + 2 – 2) = 0.53

39
The Soft TF/IDF Measure

 Similar to generalized Jaccard measure, except that it


uses TF/IDF measure as the “higher-level” sim measure
 e.g., “Apple Corporation, CA”, “IBM Corporation, CA”, and
“Aple Corp”, with Apple being mispelt in the last string
 Step 1: compute close(x,y,k): set of all terms t2 Bx that
have at least one close term u2 By, i.e., s’(t,u)¸ k
 s’ is a basic sim measure (e.g., Jaro-Winkler), k prespecified
 Step 2: compute s(x,y) as in traditional TF/IDF score, but
weighing each TF/IDF component using s’
 s(x,y) =  t2 close(x,y,k) vx(t) * vy(u*) * s’(t,u*)
 u*2 By maximizes s’(t,u) 8 u2 By
40
An Example

41
The Monge-Elkan Measure

42
Outline

 Problem description
 Similarity measures
 Sequence-based: edit distance, Needleman-Wunch, affine gap,
Smith-Waterman, Jaro, Jaro-Winkler
 Set-based: overlap, Jaccard, TF/IDF
 Hybrid: generalized Jaccard, soft TF/IDF, Monge-Elkan
 Phonetic: Soundex
 Scaling up string matching
 Inverted index, size filtering, prefix filtering, position filtering,
bound filtering

43
Phonetic Similarity Measures

 Match strings based on their sound, instead of


appearances
 Very effective in matching names, which often appear in
different ways that sound the same
 e.g., Meyer, Meier, and Mire; Smith, Smithe, and Smythe
 Soundex is most commonly used

44
The Soundex Measure
 Used primarily to match surnames
 maps a surname x into a 4-letter code
 two surnames are judged similar if share the same code
 Algorithm to map x into a code:
 Step 1: keep the first letter of x, subsequent steps are performed on the
rest of x
 Step 2: remove all occurences of W and H. Replace the remaining letters
with digits as follows:
 replace B, F, P, V with 1, C, G, J, K, Q, S, X, Z with 2, D, T with 3, L with 4, M, N with
5, R with 6
 Step 3: replace sequence of identical digits by the digit itself
 Step 4: Drop all non-digit letters, return the first four letters as the
soundex code

45
The Soundex Measure

 Example: x = Ashcraft
 after Step 2: A226a13, after Step 3: A26a13, Step 4 converts
this into A2613, then returns A261
 Soundex code is padded with 0 if there is not enough digits
 Example: Robert and Rupert map into R163
 Soundex fails to map Gough and Goff, and Jawornicki and
Yavornitzky
 designed primarily for Caucasian names, but found to work
well for names of many different origins
 does not work well for names of East Asian origins
 which uses vowels to discriminate, Soundex ignores vowels
46
Outline

 Problem description
 Similarity measures
 Sequence-based: edit distance, Needleman-Wunch, affine gap,
Smith-Waterman, Jaro, Jaro-Winkler
 Set-based: overlap, Jaccard, TF/IDF
 Hybrid: generalized Jaccard, soft TF/IDF, Monge-Elkan
 Phonetic: Soundex
 Scaling up string matching
 Inverted index, size filtering, prefix filtering, position filtering,
bound filtering

47
Scalability Challenges
 Applying s(x,y) to all pairs is impractical
 Quadratic in size of data
 Solution: apply s(x,y) to only most promising pairs, using a
method FindCands
 For each string x 2 X
use method FindCands to find a candidate set Z µ Y
for each string y 2 Z
if s(x,y) ¸ t then return (x,y) as a matched pair
 This is often called a blocking solution
 Set Z is often called the umbrella set of x
 We now discuss ways to implement FindCands
 using Jaccard and overlap measures for now

48
Inverted Index over Strings

 Converts each string y\in Y into a document, builds an


inverted index over these documents
 Given term t, use the index to quickly find documents of
Y that contain t

49
Example

50
Limitations

 The inverted list of some terms (e.g., stop words) can be


very long  costly to build and manipulate such lists
 Requires enumerating all pairs of strings that share at
least one term. This set can still be very large in practice.

51
Size Filtering

 Retrieves only strings in Y whose sizes make them match


candidates
 given a string x\in X, infer a constraint on the size of strings in Y
that can possibly match x
 uses a B-tree index to retrieve only strings that satisfy size
constraints
 E.g., for Jaccard measure J(x,y) = |x Å y| / |x [ y|
 assume two strings x and y match if J(x,y) ¸ t
 can show that given a string x2 X, only strings y such that
|x|/t ¸ |y| ¸ |x|*t can possibly match x

52
Example

 Consider x = {lake, mendota}. Suppose t = 0.8


 If y2 Y matches x, we must have
 2/0.8 = 2.5 ¸ |y| ¸ 2* 0.8 = 1.6
 no string in Set Y satisfies this constraint  no match

53
Prefix Filtering

 Key idea: if two sets share many terms  large subsets


of them also share terms
 Consider overlap measure O(x,y) = |x Å y|
 if |x Å y| ¸ k  any subset x’ µ x of size at least |x| - (k – 1)
must overlap y
 To exploit this idea to find pairs (x,y) such that
O(x,y) ¸ k
 given x, construct subset x’ of size |x| - (k – 1)
 use an inverted index to find all y that overlap x’

54
Example

 Consider matching using O(x,y) ¸ 2


 x1 = {lake, mendota}, let x1’ = {lake}
 Use inverted index to find {y4, y6} which contain at least
one token in x1’ 55
Selecting the Subset Intelligently

 Recall that we select a subset x’ of x and check its


overlap with the entire set y
 We can do better by selecting a particular subset x’ and
checking its overlap with only a particular subset y’ of y
 How?
 impose an ordering O over the universe of all possible terms
 e.g., in increasing frequency
 reorder the terms in each x 2 X and y 2 Y according to O
 refer to subset x’ that contains the first n terms of x as the
prefix of size n of x

56
Selecting the Subset Intelligently

 How? (continued)
 can prove that if |x Å y| ¸ k, then x’ and y’ must overlap, where
x’ is the prefix of size |x| - (k – 1) of x and y’ is the prefix of size
|y| - (k – 1) of y (see notes)
 Algorithm
 reorder terms in each x 2 X and y 2 Y in increasing order of
their frequencies
 for each y 2 Y, create y’, the prefix of size |y| - (k – 1) of y
 build an inverted index over all prefixes y’
 for each x 2 X, create x’, the prefix of size |x| - (k – 1) of x, then
use above index to find all y such that x’ overlaps with y’
57
Example

 x = {mendota, lake}  x’ = {mendota}


58
Example

 See the notes for applying prefix filtering to Jaccard


measure
59
Position Filtering

 Further limits the set of candidate matches by deriving


an upper bound on the size of overlap between x and y
 e.g., x = {dane, area, mendota, monona, lake}
y = {research, dane, mendota, monona, lake}
 Suppose we consider J(x,y) ¸ 0.8, in prefix filtering we
consider x’ = {dane, area} and y’ = {research, dane} (see
notes)
 But we can do better than this. Specifically, we can prove
that O(x,y) ¸ [t/(1+t)]*(|x| + |y|) = 4.44 (see notes)
 so can immediately discard the above (x,y) pair

60
Bound Filtering

 Used to optimize the computation of generalized Jaccard


similarity measure
 Recall that
 GJ(x,y) =  (xi,yj)2 M s(xi,yj) / (|Bx| + |By| - |M|)
 Algorithm
 for each (x,y) compute an upper bound UB(x,y) and a lower
bound LB(x,y) on GJ(x,y)
 if UB(x,y) · t  (x,y) can be ignored, it is not a match
if LB(x,y) ¸ t  return (x,y) as a match
otherwise compute GJ(x,y)

61
Computing UB(x,y) and LB(x,y)

 For each xi 2 Bx, find yj 2 By with the highest element-


level similarity, such that s(xi,yj) ¸ ®. Call this set of
pairs S1.
 For each yj 2 By, find xi 2 X with the highest element-
level similarity, such that s(xi,yj) ¸ ®. Call this set of
pairs S2.
 Compute
 UB(x,y) =  (xi,yj)2 S1[ S2 s(xi,yj) / (|Bx| + |By| - |S1 [ S2|)
 LB(x,y) =  (xi,yj)2 S1\ S2 s(xi,yj) / (|Bx| + |By| - |S1 \ S2|)

62
Example

 S1 = {(a,q), (b,q)}, S2 = {(a,p), (b,q)}


 UB(x,y) = (0.8+0.9+0.7+0.9)/(3+2-3) = 1.65
 LB(x,y) = 0.9/(3+2-1) = 0.225

63
Extending Scaling Techniques to
Other Similarity Measures
 Discussed Jaccard and overlap so far
 To extend a technique T to work for a new similarity
measure s(x,y)
 try to translate s(x,y) into constraints on a similarity measure
that already works well with T
 The notes discuss examples that involve edit distance
and TF/IDF

64
Summary
 String matching is pervasive in data integration
 Two key challenges:
 what similarity measure and how to scale up?
 Similarity measures
 Sequence-based: edit distance, Needleman-Wunch, affine gap,
Smith-Waterman, Jaro, Jaro-Winkler
 Set-based: overlap, Jaccard, TF/IDF
 Hybrid: generalized Jaccard, soft TF/IDF, Monge-Elkan
 Phonetic: Soundex
 Scaling up string matching
 Inverted index, size/prefix/position/bound filtering

65
Acknowledgment

 Slides in the scalability section are adapted from


http://pike.psu.edu/p2/wisc09-tech.ppt

66

You might also like