0% found this document useful (0 votes)

17 views66 pages

String Matching

Chapter 4 discusses string matching, focusing on identifying strings that refer to the same real-world entities and the challenges of accuracy and scalability in this process. It outlines various similarity measures, including sequence-based, set-based, hybrid, and phonetic methods, and presents techniques for scaling string matching through filtering methods. The chapter provides detailed explanations of specific measures such as edit distance, Needleman-Wunch, Smith-Waterman, Jaro, and Jaro-Winkler, along with their applications and computational methods.

Uploaded by

Harsh Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views66 pages

String Matching

Uploaded by

Harsh Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 66

Chapter 4: String Matching

PRINCIPLES OF
DATA INTEGRATION
ANHAI DOAN ALON HALEVY ZACHARY IVES
Introduction

 Find strings that refer to same real-world entities

 “David Smith” and “David R. Smith”
 “1210 W. Dayton St Madison WI” and “1210 West Dayton
Madison WI 53706”
 Play critical roles in many DI tasks
 Schema matching, data matching, information extraction
 This chapter
 Defines the string matching problem
 Describes popular similarity measures
 Discusses how to apply such measures to match a large number
of strings
2
Outline

 Problem description
 Similarity measures
 Sequence-based: edit distance, Needleman-Wunch, affine gap,
Smith-Waterman, Jaro, Jaro-Winkler
 Set-based: overlap, Jaccard, TF/IDF
 Hybrid: generalized Jaccard, soft TF/IDF, Monge-Elkan
 Phonetic: Soundex
 Scaling up string matching
 Inverted index, size filtering, prefix filtering, position filtering,
bound filtering

3
Problem Description

 Given two sets of strings X and Y

 Find all pairs x 2 X and y 2 Y that refer to the same
real-world entity
 We refer to (x,y) as a match
 Example

 Two major challenges: accuracy & scalability

4
Accuracy Challenges

 Matching strings often appear quite differently

 Typing and OCR errors: David Smith vs. Davod Smith
 Different formatting conventions: 10/8 vs. Oct 8
 Custom abbreviation, shortening, or omission:
Daniel Walker Herbert Smith vs. Dan W. Smith
 Different names, nick names: William Smith vs. Bill Smith
 Shuffling parts of strings: Dept. of Computer Science, UW-
Madison vs. Computer Science Dept., UW-Madison

5
Accuracy Challenges

 Solution:
 Use a similarity measure s(x,y) 2 [0,1]
 The higher s(x,y), the more likely that x and y match
 Declare x and y matched if s(x,y) ≥ t
 Distance measure/cost measure have also been used
 Same concept
 But smaller values  higher similarities

6
Scalability Challenges

 Applying s(x,y) to all pairs is impractical

 Quadratic in size of data
 Solution: apply s(x,y) to only most promising pairs, using
a method FindCands
 For each string x 2 X
use method FindCands to find a candidate set Z µ Y
for each string y 2 Z
if s(x,y) ¸ t then return (x,y) as a matched pair
 We discuss ways to implement FindCands later

7
Outline

8
Edit Distance

 Also known as Levenshtein distance

 d(x,y) computes minimal cost of transforming x into y, using a
sequence of operators, each with cost 1
 Delete a character
 Insert a character
 Substitute a character with another
 Example: x = David Smiths, y = Davidd Simth,
d(x,y) = 4, using following sequence
 Inserting a character d (after David)
 Substituting m by i

 Substituting i by m

 Deleting the last character of x, which is s

9
Edit Distance

 Models common editing mistakes

 Inserting an extra character, swapping two characters, etc.
 So smaller edit distance  higher similarity
 Can be converted into a similarity measure
 s(x,y) = 1 - d(x,y) / [max(length(x), length(y))]
 Example
 s(David Smiths, Davidd Simth) = 1 – 4 / max(12, 12) = 0.67

10
Computing Edit Distance using
Dynamic Programming
 Define x = x1x2 xn, y = y1y2 ym
 d(i,j) = edit distance between x1x2 xi and y1y2 yj,
the i-th and j-th prefixes of x and y
 Recurrence equations

11
Example

 x = dva, y = dave
y0 y1 y2 y3 y4 y0 y1 y2 y3 y4
d a v e d a v e x=d–va

x0 0 1 2 3 4 x0 0 1 2 3 4 y=dave
x1 d 1 0 1 x1 d 1 0 1 2 3
substitute a with e
x2 v 2 x2 v 2 1 1 1 2 insert a (after d)
x3 a 3 x3 a 3 2 1 2 2

 Cost of dynamic programming is O(|x||y|)

12
Needleman-Wunch Measure

 Generalizes Levenshtein edit distance

 Basic idea
 defines notion of alignment between x and y
 assigns score to alignment
 return the alignment with highest score
 Alignment: set of correspondences between
characters of x and y, allowing for gaps

13
Scoring an Alignment
 Use a score matrix and a gap penalty
 Example

 alignment score = sum of scores of all correspondences -

sum of penalties of all gaps
 e.g., for the above alignment, it is 2 (for d-d) + 2 (for v-v) -1 (for a-e) -2 (for gap) = 1
 this is the alignment with the highest score, it is returned as the Needleman-Wunch score for dva and deeve.

14
Needleman-Wunch Generalizes
Levenshtein in Three Ways
 Computes similarity scores instead of distance values
 Generalizes edit costs into a score matrix
 allowing for more fine-grained score modeling
 e.g., score(o,0) > score(a,0)
 e.g., different amino-acid pairs may have different semantic
distance
 Generalizes insertion and deletion into gaps, and
generalizes their costs from 1 to Cg

15
Computing Needleman-Wunch
Score with Dynamic
Programming

16
The Affine Gap Measure:
Motivation
 An extension of Needleman-Wunch that handles longer gap
more gracefully
 E.g., “David Smith” vs. “David R. Smith”
 Needleman-Wunch well suited here
 opens gap of length 2 right after “David”
 E.g.,

 Needlement-Wunch not well suited here, gap cost is too high

 If each char corrspondence has score 2, cg = 1, then the above has
score 6*2 – 10 = 2
17
The Affine Gap Measure: Solution

 In practice, gaps tend to be longer than 1 character

 Assigning same penalty to each character unfairly punishes
long gaps
 Solution: define cost of opening a gap vs. cost of
continuing the gap
 cost (gap of length k) = c0 + (k-1)cr
 c0 = cost of opening gap
 cr = cost of continuing gap, c0 > cr
 E.g., “David Smith” vs. “David Richardson Smith”
 c0 = 1, cr = 0.5, alignment cost = 6*2 – 1 - 9*0.5 = 6.5
18
Computing Affine Gap Score
using Dynamic Programming

 The notes detail how these equations are derived

19
The Smith-Waterman Measure:
Motivation
 Previous measures consider global alignments
 attempt to match all characters of x with all characters of y
 Not well suited for some cases
 e.g., “Prof. John R. Smith, Univ of Wisconsin” and “John R.
Smith, Professor”
 similarity score here would be quite low
 Better idea: find two substrings of x and y that are most
similar
 e.g., find “John R. Smith” in the above case  local alignment

20
The Smith-Waterman Measure:
Basic Ideas
 Find the best local alignment between x and y, and
return its score as the score between x and y
 Makes two key changes to Needleman-Wunch
 allows the match to restart at any position in the strings (no
longer limited to just the first position)
 if global match dips below 0, then ignore prefix and restart the match
 after computing matrix using recurrence equation, retracing
the arrows from the largest value in matrix, rather than from
lower-right corner
 this effectively ignores suffixes if the match they produce is not optimal
 retracing ends when we meet a cell with value 0  start of alignment

21
Computing Smith-Waterman
Score using Dynamic
Programming

22
The Jaro Measure

 Mainly for comparing short strings, e.g., first/last names

 To compute jaro(x,y)
 find common characters xi and yj such that
xi = yj and |i-j| · min {|x|,|y|}/2
 intuitively, common characters are identical and positionally
“close to each other”
 if the i-th common character of x does not match the i-th
common character of y, then we have a transposition
 return jaro(x,y) = 1 / 3[c/|x| + c/|y| + (c – t/2)/c], where c is
the number of common characters, and t is the number of
transpositions
23
The Jaro Measure: Examples

 x = jon, y = john
 c = 3 because the common characters are j, o, and n
 t=0
 jaro(x,y) = 1 / 3(3/3 + 3/4 + 3/3) = 0.917
 contrast this to 0.75, the sim score of x and y using edit distance
 x = jon, y = ojhn
 common char sequence in x is jon
 common char sequence in y is ojn
 t=2
 jaro(x,y) = 0.81

24
The Jaro-Winkler Measure

 Captures cases where x and y have a low Jaro score, but

share a prefix  still likely to match
 Computed as
 jaro-winkler(x,y) = (1 – PL*PW)*jaro(x,y) + PL*PW
 PL = length of the longest common prefix
 PW is a weight given to the prefix

25
Outline

26
Set-based Similarity Measures

 View strings as sets or multi-sets of tokens

 Use set-related properties to compute similarity scores
 Common methods to generate tokens
 consider words delimited by space
 possibly stem the words (depending on the application)
 remove common stop words (e.g., the, and, of)

 e.g., given “david smith”  generate tokens “david” and “smith”

 consider q-grams, substrings of length q

 e.g., “david smith”  the set of 3-grams are {##d, #da, dav, avi, …, h##}
 special character # is added to handle the start and end of string

27
The Overlap Measure

 Let Bx = set of tokens generated for string x

 Let By = set of tokens generated for string y
 O(x,y) = |Bx Å By|
 returns the number of common tokens
 E.g., x = dave, y = dav
 Bx = {#d, da, av, ve, e#}, By = {#d, da, av, v#}
 O(x,y) = 3

28
The Jaccard Measure

 J(x,y) = |Bx Å By|/|Bx [ By|

 E.g., x = dave, y = dav
 Bx = {#d, da, av, ve, e#}, By = {#d, da, av, v#}
 J(x,y) = 3/6
 Very commonly used in practice

29
The TF/IDF Measure: Motivation

 uses the TF/IDF notion commonly used in IR

 two strings are similar if they share distinguishing terms
 e.g., x = Apple Corporation, CA
y = IBM Corporation, CA
z = Apple Corp
 s(x,y) > s(x,z) using edit distance or Jaccard measure, so x is
matched with y  incorrect
 TF/IDF measure can recognize that Apple is a distinguishing
term, whereas Corporation and CA are far more common 
correctly match x with z

30
Term Frequencies and
Inverse Document Frequencies
 Assume x and y are taken from a collection of strings
 Each string is coverted into a bag of terms called a
document
 Define term frequency tf(t,d) = number of times term t
appears in document d
 Define inverse document frequency idf(t) = N / Nd,
number of documents in collection devided by number
of documents that contain t
 note: in practice, idf(t) is often defined as log(N / Nd), here we
will use the above simple formula to define idf(t)
31
Example

32
Feature Vectors

 Each document d is converted into a feature vector vd

 vd has a feature vd(t) for each term t
 value of vd(t) is a function of TF and IDF scores
 here we assume vd(t) = tf(t,d) * idf(t)

33
TF/IDF Similarity Score

34
TF/IDF Similarity Score

35
Outline

36
Generalized Jaccard Measure

 Jaccard measure
 considers overlapping tokens in both x and y
 a token from x and a token from y must be identical to be included in
the set of overlapping tokens
 this can be too restrictive in certain cases
 Example:
 matching taxonomic nodes that describe companies
 “Energy & Transportation” vs. “Transportation, Energy, & Gas”
 in theory Jaccard is well suited here, in practice Jaccard may not work
well if tokens are commonly mispelled
 e.g., energy vs. eneryg
 generalized Jaccard measure can help such cases
37
Generalized Jaccard Measure

 Let Bx = {x1, …, xn}, By = {y1, …, ym}

 Step 1: find token pairs that will be in the “softened”
overlap set
 apply a similarity measure s to compute sim score for each pair
(xi, yj)
 keep only those score ¸ a given threshold ®, this forms a
bipartite graph G
 find the maximum-weight matching M in G
 Step 2: return normalized weight of M as generalized
Jaccard score
 GJ(x,y) =  (xi,yj)2 M s(xi,yj) / (|Bx| + |By| - |M|)
38
An Example

 Generalized Jaccard score: (0.7 + 0.9)/(3 + 2 – 2) = 0.53

39
The Soft TF/IDF Measure

 Similar to generalized Jaccard measure, except that it

uses TF/IDF measure as the “higher-level” sim measure
 e.g., “Apple Corporation, CA”, “IBM Corporation, CA”, and
“Aple Corp”, with Apple being mispelt in the last string
 Step 1: compute close(x,y,k): set of all terms t2 Bx that
have at least one close term u2 By, i.e., s’(t,u)¸ k
 s’ is a basic sim measure (e.g., Jaro-Winkler), k prespecified
 Step 2: compute s(x,y) as in traditional TF/IDF score, but
weighing each TF/IDF component using s’
 s(x,y) =  t2 close(x,y,k) vx(t) * vy(u*) * s’(t,u*)
 u*2 By maximizes s’(t,u) 8 u2 By
40
An Example

41
The Monge-Elkan Measure

42
Outline

43
Phonetic Similarity Measures

 Match strings based on their sound, instead of

appearances
 Very effective in matching names, which often appear in
different ways that sound the same
 e.g., Meyer, Meier, and Mire; Smith, Smithe, and Smythe
 Soundex is most commonly used

44
The Soundex Measure
 Used primarily to match surnames
 maps a surname x into a 4-letter code
 two surnames are judged similar if share the same code
 Algorithm to map x into a code:
 Step 1: keep the first letter of x, subsequent steps are performed on the
rest of x
 Step 2: remove all occurences of W and H. Replace the remaining letters
with digits as follows:
 replace B, F, P, V with 1, C, G, J, K, Q, S, X, Z with 2, D, T with 3, L with 4, M, N with
5, R with 6
 Step 3: replace sequence of identical digits by the digit itself
 Step 4: Drop all non-digit letters, return the first four letters as the
soundex code

45
The Soundex Measure

 Example: x = Ashcraft
 after Step 2: A226a13, after Step 3: A26a13, Step 4 converts
this into A2613, then returns A261
 Soundex code is padded with 0 if there is not enough digits
 Example: Robert and Rupert map into R163
 Soundex fails to map Gough and Goff, and Jawornicki and
Yavornitzky
 designed primarily for Caucasian names, but found to work
well for names of many different origins
 does not work well for names of East Asian origins
 which uses vowels to discriminate, Soundex ignores vowels
46
Outline

47
Scalability Challenges
 Applying s(x,y) to all pairs is impractical
 Quadratic in size of data
 Solution: apply s(x,y) to only most promising pairs, using a
method FindCands
 For each string x 2 X
use method FindCands to find a candidate set Z µ Y
for each string y 2 Z
if s(x,y) ¸ t then return (x,y) as a matched pair
 This is often called a blocking solution
 Set Z is often called the umbrella set of x
 We now discuss ways to implement FindCands
 using Jaccard and overlap measures for now

48
Inverted Index over Strings

 Converts each string y\in Y into a document, builds an

inverted index over these documents
 Given term t, use the index to quickly find documents of
Y that contain t

49
Example

50
Limitations

 The inverted list of some terms (e.g., stop words) can be

very long  costly to build and manipulate such lists
 Requires enumerating all pairs of strings that share at
least one term. This set can still be very large in practice.

51
Size Filtering

 Retrieves only strings in Y whose sizes make them match

candidates
 given a string x\in X, infer a constraint on the size of strings in Y
that can possibly match x
 uses a B-tree index to retrieve only strings that satisfy size
constraints
 E.g., for Jaccard measure J(x,y) = |x Å y| / |x [ y|
 assume two strings x and y match if J(x,y) ¸ t
 can show that given a string x2 X, only strings y such that
|x|/t ¸ |y| ¸ |x|*t can possibly match x

52
Example

 Consider x = {lake, mendota}. Suppose t = 0.8

 If y2 Y matches x, we must have
 2/0.8 = 2.5 ¸ |y| ¸ 2* 0.8 = 1.6
 no string in Set Y satisfies this constraint  no match

53
Prefix Filtering

 Key idea: if two sets share many terms  large subsets

of them also share terms
 Consider overlap measure O(x,y) = |x Å y|
 if |x Å y| ¸ k  any subset x’ µ x of size at least |x| - (k – 1)
must overlap y
 To exploit this idea to find pairs (x,y) such that
O(x,y) ¸ k
 given x, construct subset x’ of size |x| - (k – 1)
 use an inverted index to find all y that overlap x’

54
Example

 Consider matching using O(x,y) ¸ 2

 x1 = {lake, mendota}, let x1’ = {lake}
 Use inverted index to find {y4, y6} which contain at least
one token in x1’ 55
Selecting the Subset Intelligently

 Recall that we select a subset x’ of x and check its

overlap with the entire set y
 We can do better by selecting a particular subset x’ and
checking its overlap with only a particular subset y’ of y
 How?
 impose an ordering O over the universe of all possible terms
 e.g., in increasing frequency
 reorder the terms in each x 2 X and y 2 Y according to O
 refer to subset x’ that contains the first n terms of x as the
prefix of size n of x

56
Selecting the Subset Intelligently

 How? (continued)
 can prove that if |x Å y| ¸ k, then x’ and y’ must overlap, where
x’ is the prefix of size |x| - (k – 1) of x and y’ is the prefix of size
|y| - (k – 1) of y (see notes)
 Algorithm
 reorder terms in each x 2 X and y 2 Y in increasing order of
their frequencies
 for each y 2 Y, create y’, the prefix of size |y| - (k – 1) of y
 build an inverted index over all prefixes y’
 for each x 2 X, create x’, the prefix of size |x| - (k – 1) of x, then
use above index to find all y such that x’ overlaps with y’
57
Example

 x = {mendota, lake}  x’ = {mendota}

58
Example

 See the notes for applying prefix filtering to Jaccard

measure
59
Position Filtering

 Further limits the set of candidate matches by deriving

an upper bound on the size of overlap between x and y
 e.g., x = {dane, area, mendota, monona, lake}
y = {research, dane, mendota, monona, lake}
 Suppose we consider J(x,y) ¸ 0.8, in prefix filtering we
consider x’ = {dane, area} and y’ = {research, dane} (see
notes)
 But we can do better than this. Specifically, we can prove
that O(x,y) ¸ [t/(1+t)]*(|x| + |y|) = 4.44 (see notes)
 so can immediately discard the above (x,y) pair

60
Bound Filtering

 Used to optimize the computation of generalized Jaccard

similarity measure
 Recall that
 GJ(x,y) =  (xi,yj)2 M s(xi,yj) / (|Bx| + |By| - |M|)
 Algorithm
 for each (x,y) compute an upper bound UB(x,y) and a lower
bound LB(x,y) on GJ(x,y)
 if UB(x,y) · t  (x,y) can be ignored, it is not a match
if LB(x,y) ¸ t  return (x,y) as a match
otherwise compute GJ(x,y)

61
Computing UB(x,y) and LB(x,y)

 For each xi 2 Bx, find yj 2 By with the highest element-

level similarity, such that s(xi,yj) ¸ ®. Call this set of
pairs S1.
 For each yj 2 By, find xi 2 X with the highest element-
level similarity, such that s(xi,yj) ¸ ®. Call this set of
pairs S2.
 Compute
 UB(x,y) =  (xi,yj)2 S1[ S2 s(xi,yj) / (|Bx| + |By| - |S1 [ S2|)
 LB(x,y) =  (xi,yj)2 S1\ S2 s(xi,yj) / (|Bx| + |By| - |S1 \ S2|)

62
Example

 S1 = {(a,q), (b,q)}, S2 = {(a,p), (b,q)}

 UB(x,y) = (0.8+0.9+0.7+0.9)/(3+2-3) = 1.65
 LB(x,y) = 0.9/(3+2-1) = 0.225

63
Extending Scaling Techniques to
Other Similarity Measures
 Discussed Jaccard and overlap so far
 To extend a technique T to work for a new similarity
measure s(x,y)
 try to translate s(x,y) into constraints on a similarity measure
that already works well with T
 The notes discuss examples that involve edit distance
and TF/IDF

64
Summary
 String matching is pervasive in data integration
 Two key challenges:
 what similarity measure and how to scale up?
 Similarity measures
 Sequence-based: edit distance, Needleman-Wunch, affine gap,
Smith-Waterman, Jaro, Jaro-Winkler
 Set-based: overlap, Jaccard, TF/IDF
 Hybrid: generalized Jaccard, soft TF/IDF, Monge-Elkan
 Phonetic: Soundex
 Scaling up string matching
 Inverted index, size/prefix/position/bound filtering

65
Acknowledgment

 Slides in the scalability section are adapted from

http://pike.psu.edu/p2/wisc09-tech.ppt

Vector Data Model (GIS)
No ratings yet
Vector Data Model (GIS)
34 pages
Physics Unit-6 Last Touch Review IAL Edexcel
100% (3)
Physics Unit-6 Last Touch Review IAL Edexcel
56 pages
Unit 2 Daa PDF
No ratings yet
Unit 2 Daa PDF
99 pages
Owner'S Manual: Solar Water Heaters
No ratings yet
Owner'S Manual: Solar Water Heaters
56 pages
SpecificationsMotor 3176c PDF
No ratings yet
SpecificationsMotor 3176c PDF
107 pages
Delta Ferrite
No ratings yet
Delta Ferrite
4 pages
G3167 Online LC Solution UseMa en D0006652
No ratings yet
G3167 Online LC Solution UseMa en D0006652
388 pages
Mca May 2019
No ratings yet
Mca May 2019
34 pages
RBI-Grade-B-Quantitative-Aptitude-Question-Paper-2018-Phase-I 2 PDF
No ratings yet
RBI-Grade-B-Quantitative-Aptitude-Question-Paper-2018-Phase-I 2 PDF
15 pages
ZEOL
No ratings yet
ZEOL
407 pages
18-IntroNLP II PDF
No ratings yet
18-IntroNLP II PDF
187 pages
Lec 02
No ratings yet
Lec 02
103 pages
Dynamic Programming 4
No ratings yet
Dynamic Programming 4
107 pages
Lab5 Ch2 Sequence Similarity PDF
No ratings yet
Lab5 Ch2 Sequence Similarity PDF
95 pages
Datos de Chumaceras INA Rodamientos
No ratings yet
Datos de Chumaceras INA Rodamientos
10 pages
String Matching
No ratings yet
String Matching
116 pages
Foundations of Sequence Analysis
No ratings yet
Foundations of Sequence Analysis
161 pages
Data Sources
No ratings yet
Data Sources
80 pages
Needleman Wunsch
100% (1)
Needleman Wunsch
6 pages
Week 5 8
No ratings yet
Week 5 8
80 pages
Lecture # 15 - New
No ratings yet
Lecture # 15 - New
70 pages
Chapter 4 Vector Space
No ratings yet
Chapter 4 Vector Space
66 pages
DNA Alignment
No ratings yet
DNA Alignment
76 pages
Clustering Part4
No ratings yet
Clustering Part4
79 pages
Week 5
No ratings yet
Week 5
64 pages
Alignment Algorithm
No ratings yet
Alignment Algorithm
58 pages
Data Matching
No ratings yet
Data Matching
74 pages
A Guided Tour To Approximate String Matching: Gonzalo Navarro
No ratings yet
A Guided Tour To Approximate String Matching: Gonzalo Navarro
58 pages
Approximate String
No ratings yet
Approximate String
36 pages
Patternmatchingalgorithms
No ratings yet
Patternmatchingalgorithms
63 pages
DIA5ED2130303EN (Web)
No ratings yet
DIA5ED2130303EN (Web)
42 pages
Lecture 4
No ratings yet
Lecture 4
57 pages
03 Med
No ratings yet
03 Med
52 pages
Chemistry 7th Edition McMurry Solutions Manualinstant Download
100% (7)
Chemistry 7th Edition McMurry Solutions Manualinstant Download
51 pages
Chapter 2
No ratings yet
Chapter 2
70 pages
COB Sequencealignment
No ratings yet
COB Sequencealignment
49 pages
Needleman Wunsch PDF
No ratings yet
Needleman Wunsch PDF
3 pages
Definition of Minimum Edit Distance
No ratings yet
Definition of Minimum Edit Distance
49 pages
Alignment Methods: Introduction To Global and Local Sequence Alignment Methods
No ratings yet
Alignment Methods: Introduction To Global and Local Sequence Alignment Methods
57 pages
Mca Dec 2021
No ratings yet
Mca Dec 2021
48 pages
MIT6 047F15 Lecture03
No ratings yet
MIT6 047F15 Lecture03
56 pages
03 Med
No ratings yet
03 Med
35 pages
Lecture1 2
No ratings yet
Lecture1 2
44 pages
03 Text Processing - Minimum Edit Distance
No ratings yet
03 Text Processing - Minimum Edit Distance
41 pages
Lec 3
No ratings yet
Lec 3
37 pages
Sequence Comparison: Motivation: Finding Similarity Between Sequences Is Important For Many Biological Questions
No ratings yet
Sequence Comparison: Motivation: Finding Similarity Between Sequences Is Important For Many Biological Questions
47 pages
String Edit PDF
No ratings yet
String Edit PDF
39 pages
HF-Katalog 2 EN - Technische Informationen PDF
No ratings yet
HF-Katalog 2 EN - Technische Informationen PDF
27 pages
03 Text Processing - Minimum Edit Distance
No ratings yet
03 Text Processing - Minimum Edit Distance
41 pages
Sequence Alignment: Lecture 2, Thursday April 3, 2003
No ratings yet
Sequence Alignment: Lecture 2, Thursday April 3, 2003
39 pages
Unit I Algorithms
No ratings yet
Unit I Algorithms
42 pages
Sequence Comparison
No ratings yet
Sequence Comparison
39 pages
Data Matching
No ratings yet
Data Matching
37 pages
Algorithm Design and Scoring Matrices PDF
No ratings yet
Algorithm Design and Scoring Matrices PDF
31 pages
DP and Edit Dist
No ratings yet
DP and Edit Dist
30 pages
MCA May 2020
No ratings yet
MCA May 2020
37 pages
Cyclone Design
100% (2)
Cyclone Design
11 pages
6 Template Matching
No ratings yet
6 Template Matching
25 pages
Introduction To String Matching
No ratings yet
Introduction To String Matching
28 pages
Lecture Slides - Lubrication Part 1 - 241130 - 093207
No ratings yet
Lecture Slides - Lubrication Part 1 - 241130 - 093207
24 pages
Howe Et Al (2022)
No ratings yet
Howe Et Al (2022)
21 pages
L3 Edit Distance
No ratings yet
L3 Edit Distance
23 pages
q3 Week 3 Stem g11 Statistics and Probability
No ratings yet
q3 Week 3 Stem g11 Statistics and Probability
12 pages
Dynamic Programming - 2
No ratings yet
Dynamic Programming - 2
24 pages
B505 Lec.10 DynamicProgramming 1
No ratings yet
B505 Lec.10 DynamicProgramming 1
19 pages
Science: Whole Brain Learning System Outcome-Based Education
No ratings yet
Science: Whole Brain Learning System Outcome-Based Education
20 pages
06DynamicProgrammingII 2x2
No ratings yet
06DynamicProgrammingII 2x2
17 pages
ESS Leave Request Config Steps
No ratings yet
ESS Leave Request Config Steps
8 pages
Under The Guidance Of:-Mr. Prahakant Dwivedi (Assistant Professor)
No ratings yet
Under The Guidance Of:-Mr. Prahakant Dwivedi (Assistant Professor)
17 pages
8 LCS 19 01 2024
No ratings yet
8 LCS 19 01 2024
17 pages
Heuristics Search Project - 01
No ratings yet
Heuristics Search Project - 01
15 pages
The Feynman Lectures On Physics Vol. II Ch. 2 - Differential Calculus of Vector Fields
No ratings yet
The Feynman Lectures On Physics Vol. II Ch. 2 - Differential Calculus of Vector Fields
13 pages
Approximate Matching
No ratings yet
Approximate Matching
16 pages
Assignment 2 (MAD)
No ratings yet
Assignment 2 (MAD)
16 pages
Design 1
No ratings yet
Design 1
15 pages
Krushna Prasad Shadangi, Kaustubha Mohanty: Highlights
No ratings yet
Krushna Prasad Shadangi, Kaustubha Mohanty: Highlights
7 pages
Csci3104 S2018 L7
No ratings yet
Csci3104 S2018 L7
11 pages
Semester Final Project Report
No ratings yet
Semester Final Project Report
11 pages
Adsa
No ratings yet
Adsa
9 pages
Tabby
No ratings yet
Tabby
11 pages
Approximate String Matching For Music Retrieval System
No ratings yet
Approximate String Matching For Music Retrieval System
12 pages
Why Processor Performance Is More Than Frequency and Core Counts v10 13 23
No ratings yet
Why Processor Performance Is More Than Frequency and Core Counts v10 13 23
7 pages
12 Filter Algorithms
No ratings yet
12 Filter Algorithms
7 pages
Task 1
No ratings yet
Task 1
5 pages
Flygt DX: Submersible Drainage & Waste Water Pumps, 50 HZ
No ratings yet
Flygt DX: Submersible Drainage & Waste Water Pumps, 50 HZ
4 pages
Neon
No ratings yet
Neon
3 pages
Logistic Equation and The Double Slit Ex
No ratings yet
Logistic Equation and The Double Slit Ex
4 pages
Ammonia QP
No ratings yet
Ammonia QP
4 pages
B 374 K
No ratings yet
B 374 K
14 pages
Levenshtein Distance PDF
No ratings yet
Levenshtein Distance PDF
3 pages
Alarm System - DSC Pc1555 - Faq
No ratings yet
Alarm System - DSC Pc1555 - Faq
3 pages
Eca LNG: Coordinate System: Refrigerationmaster
No ratings yet
Eca LNG: Coordinate System: Refrigerationmaster
1 page
Note 4
No ratings yet
Note 4
1 page
CS 240 Tutorial 11 Notes: C A A B A
No ratings yet
CS 240 Tutorial 11 Notes: C A A B A
2 pages
Generalized Fermat Equation
From Everand
Generalized Fermat Equation
Ran Van Vo
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)

String Matching

Uploaded by

String Matching

Uploaded by

Chapter 4: String Matching

 Find strings that refer to same real-world entities

 Given two sets of strings X and Y

 Two major challenges: accuracy & scalability

 Matching strings often appear quite differently

 Applying s(x,y) to all pairs is impractical

 Also known as Levenshtein distance

 Deleting the last character of x, which is s

 Models common editing mistakes

 Cost of dynamic programming is O(|x||y|)

 Generalizes Levenshtein edit distance

 alignment score = sum of scores of all correspondences -

 Needlement-Wunch not well suited here, gap cost is too high

 In practice, gaps tend to be longer than 1 character

 The notes detail how these equations are derived

 Mainly for comparing short strings, e.g., first/last names

 Captures cases where x and y have a low Jaro score, but

 View strings as sets or multi-sets of tokens

 e.g., given “david smith”  generate tokens “david” and “smith”

 consider q-grams, substrings of length q

 Let Bx = set of tokens generated for string x

 J(x,y) = |Bx Å By|/|Bx [ By|

 uses the TF/IDF notion commonly used in IR

 Each document d is converted into a feature vector vd

 Let Bx = {x1, …, xn}, By = {y1, …, ym}

 Generalized Jaccard score: (0.7 + 0.9)/(3 + 2 – 2) = 0.53

 Similar to generalized Jaccard measure, except that it

 Match strings based on their sound, instead of

 Converts each string y\in Y into a document, builds an

 The inverted list of some terms (e.g., stop words) can be

 Retrieves only strings in Y whose sizes make them match

 Consider x = {lake, mendota}. Suppose t = 0.8

 Key idea: if two sets share many terms  large subsets

 Consider matching using O(x,y) ¸ 2

 Recall that we select a subset x’ of x and check its

 x = {mendota, lake}  x’ = {mendota}

 See the notes for applying prefix filtering to Jaccard

 Further limits the set of candidate matches by deriving

 Used to optimize the computation of generalized Jaccard

 For each xi 2 Bx, find yj 2 By with the highest element-

 S1 = {(a,q), (b,q)}, S2 = {(a,p), (b,q)}

 Slides in the scalability section are adapted from

You might also like