Record Linkage Similarity Measures and Algorithms
Record Linkage Similarity Measures and Algorithms
Nick Koudas (University of Toronto) Sunita Sarawagi (IIT Bombay) Divesh Srivastava (AT&T Labs-Research)
Presenters
U. Toronto
IIT Bombay
AT&T Research
9/23/06
Outline
Part I: Motivation, similarity measures (90 min)
Data quality, applications Linkage methodology, core measures Learning core measures Linkage based measures
Part II: Efficient algorithms for approximate join (60 min) Part III: Clustering/partitioning algorithms (30 min)
9/23/06
Inconsistency with reality: 2% of records obsolete in customer files in 1 month (deaths, name changes, etc) [DWI02] Pricing anomalies : UA tickets selling for $5, 1GB of memory selling for $19.99 at amazon.com
$611B/year loss in US due to poor customer data [DWI02] $2.5B/year loss due to incorrect prices in retail DBs [E00]
Application factors
Erroneous applications populating databases Faulty database design (constraints not enforced)
Obsolence
Real-world is dynamic
5
9/23/06
different values treated as distinct for analysis Lot of heterogeneity Need approximate joins
Relevant technologies
9/23/06
name, deny boarding Use more match attributes Obtain more information
Relevant technologies
9/23/06
and R2 is A subset of the cartesian product of R1 and R2 Matching specified attributes of R1 and R2 Labeled with a similarity score > t > 0
Clustering/partitioning of R:
9/23/06
approximate join A x B = {(a,b) | a A, b B} = M U M = {(a,b) | a=b, a A, b B} ; matched U = {(a,b) | a <> b, a A, b B}; unmatched (a,b) = (i(a,b)) i=1..K comparison vector Contains comparison features e.g., same last names, same SSN, etc. : range of (a,b) the comparison space.
9/23/06
10
A1 : match ; A2 : uncertain ; A3 : non-match Function (linkage rule) from to {A1 A2 A3} Distribution D over A x B m () = P((a,b) | (a,b) M} u () = P((a,b) | (a,b) U}
9/23/06
11
Fellegi-Sunter Result
Sort vectors by m ()/u () non increasing order; choose n < n n N = = ! " i =1 i =n'
u (" )
m(# )
Linkage rule with respect to minimizing P(A2), with P(A1|U) = and P(A3|M) = is 1,.,n,n+1,.,n-1,n,.,N
A1 A2 A3 Intuition Swap i-th vector declared as A1 with j-th vector in A2 If u(i) = u(j) then m(j) < m(I) After the swap, P(A2) is increased
9/23/06
12
Fellegi-Sunter Issues:
Tuning:
Estimates for m (), u () ? Training data: active learning for M, U labels Semi or un-supervised clustering: identify M U clusters Setting , ? Defining the comparison space ? Distance metrics between records/fields Efficiency/Scalability Is there a way to avoid quadratic behavior (computing all |A|x|B| pairs)?
9/23/06
13
Outline
Part I: Motivation, similarity measures (90 min)
Data quality, applications Linkage methodology, core measures Learning core measures Linkage based measures
Part II: Efficient algorithms for approximate join (60 min) Part III: Clustering/partitioning algorithms (30 min)
9/23/06
14
FMS Hybrids
9/23/06 15
Attribute Standardization
Several attribute fields in relations have loose or anticipated structure:
Addresses, names Bibliographic entries (mainly for web data) Preprocessing to standardize such fields Enforce common abbreviations, titles Extract structure from addresses Part of ETL tools, commonly using field segmentation and dictionaries Recently machine learning approaches HMM encode universe of states [CCZ02]
9/23/06
16
Field Similarity
Application notion of field
Relational attribute, set of attributes, entire tuples. Basic problem: given two field values quantify their similarity (wlog) in [0..1]. If numeric fields, use numeric methods. Problem challenging for strings.
9/23/06
17
Soundex Encoding
A phonetic algorithm that indexes names by their sounds when
pronounced in english. Consists of the first letter of the name followed by three numbers. Numbers encode similar sounding consonants. Remove all W, H B, F, P, V encoded as 1, C,G,J,K,Q,S,X,Z as 2 D,T as 3, L as 4, M,N as 5, R as 6, Remove vowels Concatenate first letter of string with first 3 numerals Ex: great and grate become 6EA3 and 6A3E and then G63 More recent, metaphone, double metaphone etc.
9/23/06
18
Minimum cost sequence of operations to transform s to t. Example: edit(Error,Eror) = 1, edit(great,grate) = 2 Folklore dynamic programming algorithm to compute edit(); Computation and decision problem: quadratic (on string length) in the worst case.
9/23/06
19
Edit Distance
Several variants (weighted, block etc) -- problem can become NP-
complete easily. Operation costs can be learned from the source (more later) String alignment = sequence of edit operations emitted by a memory-less process [RY97]. Observations May be costly operation for large strings Suitable for common typing mistakes Comprehensive vs Comprenhensive Problematic for specific domains AT&T Corporation vs AT&T Corp IBM Corporation vs AT&T Corporation
9/23/06
20
whole word insertions. John Smith vs John Edward Smith vs John E. Smith IBM Corp. vs IBM Corporation Allow sequences of mis-matched characters (gaps) in the alignment of two strings. Penalty: using the affine cost model Cost(g) = s+e l s: cost of opening a gap e: cost of extending the gap l: length of a gap Commonly e lower than s Similar dynamic programming algorithm
9/23/06
21
character in t if there is a bj in t such that ai = bj i-H j i+H where H = min(|s|,|t|)/2 Let s = a1,,ak and t = b1,,bL characters in s (t) common with t (s) A transposition for s,t is a position i such that ai <> bi. Let Ts,t be half the number of transpositions in s and t.
9/23/06
22
Jaro Rule
Jaro(s,t) = Example:
Martha vs Marhta H = 3, s = Martha, t = Marhta, Ts,t = 1 Jaro(Martha,Marhta) = 0.9722 Jonathan vs Janathon H = 4, s = jnathn, t = jnathn, Ts,t = 0 Jaro(Jonathan,Janathon) = 0.5
9/23/06
23
max(P,4) Jaro-Winkler(s,t) =
Example:
Observations:
9/23/06
24
Words in a field AT&T Corporation -> AT&T , Corporation Q-grams (sequence of q-characters in a field) {AT&,T&T,&T , T C, Co,orp,rpo,por,ora,rat,ati,tio,ion} 3-grams Assess similarity by manipulating sets of terms.
9/23/06
25
Overlap metrics
Given two sets of terms S, T
Jaccard coef.: Jaccard(S,T) = |ST|/|ST| Variants If scores (weights) available for each term (element in the set) compute Jaccard() only for terms with weight above a specific threshold. What constitutes a good choice of a term score?
9/23/06
26
TF/IDF [S83]
Term frequency (tf) inverse document frequency (idf). Widely used in traditional IR approaches. The tf/idf value of a term in a document:
log (tf+1) * log idf where tf : # of times term appears in a document d idf : number of documents / number of documents containing term Intuitively: rare terms are more important
9/23/06
27
TF/IDF
Varying semantics of term
Words in a field AT&T Corporation -> AT&T , Corporation Qgrams (sequence of q-characters in a field) {AT&,T&T,&T , T C, Co,orp,rpo,por,ora,rat,ati,tio,ion} 3-grams For each term in a field compute its corresponding tfidf score using the field as a document and the set of field values as the document collection.
9/23/06
28
draw from sets A and B m (j) = P(j|M) = PAB(j) u (j) = P(j|U) = PA(j)PB(j)
Provide more weight to agreement on rare terms and less weight to common terms IDF measure related to Fellegi-Sunter probabilistic notion: Log(m(str)/u(str)) = log(PAB(str)/PA (str)PB (str)) = log(1/PA(str)) = IDF(str)
9/23/06
29
Cosine similarity
Each field value transformed via tfidf weighting to a (sparse) vector of
high dimensionality d. Let a,b two field values and Sa, Sb the set of terms for each. For w in Sa (Sb), denote W(w,Sa) (W(w,Sb)) its tfidf score. For two such values: Cosine(a,b) = $W (z,Sa)W (z,Sb)
z"Sa#Sb
9/23/06
30
Cosine similarity
Suitable to assess closeness of
AT&T Corporation, AT&T Corp or AT&T Inc Low weights for Corporation,Corp,Inc Higher weight for AT&T Overall Cosine(AT&T Corp,AT&T Inc) should be high Via q-grams may capture small typing mistakes Jaccard vs Jacard -> {Jac,acc,cca,car,ard} vs {Jac,aca,car,ard} Common terms Jac, car, ard would be enough to result in high value of Cosine(Jaccard,Jacard).
9/23/06
31
Hybrids [CRF03]
Let S = {a1,,aK}, T = {b1,bL} sets of terms: Sim(S,T) =
Sim() some other similarity function C(t,S,T) = {wS s.t v T, sim(w,v) > t} D(w,T) = maxvTsim(w,v), w C(t,S,T)
sTFIDF =
w"C ( t , S ,T )
9/23/06
32
Okapi weighting Model within document term frequencies as a mixture of two poisson distributions: one for relevant and one for irrelevant documents Language models Given Q=t1,...tn estimate p(Q|Md) MLE estimate for term t : p(t|Md) = tf(t,d)/dld dld:total number of tokens in d Estimate pavg(t) Weight it by a risk factor (modeled by a geometric distribution) HMM
33
9/23/06
Replacement cost: edit(s,t)*W(s,S) Insertion cost: cins W(s,S) (cins between 0,1) Deletion cost: W(s,S) Computed by DP like edit() Generalized for multiple sets of terms
9/23/06
34
Beoing Corporation,Boeing Company S = {Beoing,Corporation}, T = {Boeing,Company} tc(S,T) = 0.97 (unit weights for terms) sum of
edit(Beoing,Boeing) = 2/6 (normalized) edit(Corporation,Company) = 7/11
9/23/06
35
For s S let QG(s) set of qgrams of s d= (1-1/q) 1 2 W ( s, S ) * max t"T ( simmh (QG ( s ), QG (t )) + d ) fmsapx = W ( S ) q s"S For suitable , and size of min hash signature apx(S,T)) fms(S,T) E(fms apx(S,T) (1-)fms(S,T)) P(fms
9/23/06
36
9/23/06
37
of attributes into consideration. N orders of the relation tuples, ranked by a similarity score to a query.
9/23/06
38
Voting Theory
Tuple id T1 T2 T3 T4 T5 custname John smith Josh Smith Nicolas Smith Joseph Smith Jack Smith address 800 Mountain Av springfield 100 Mount Av Springfield 800 spring Av Union 555 Mt. Road Springfield 100 Springhill lake Park 5.1,5.1 location 5,5 8,8 11,11 9,9 6,6
100 Mount Rd. Springfield address T2 T1 T4 T3 T5 (0.95) (0.8) (0.75) (0.3) (0.1)
Let S,T orderings of the same domain D S(i) (T(i)) the order position of the i-th element of D in S (T) F(S,T) = | S(i) " T(i) |
i#D
!
!
9/23/06
j=1
40
Historical timeline
Jaccard coefficient
FMS
1901
1918
1983/9
1999
2003
9/23/06
41
Outline
Part I: Motivation, similarity measures (90 min)
Data quality, applications Linkage methodology, core measures Learning core measures Linkage based measures
Part II: Efficient algorithms for approximate join (60 min) Part III: Clustering algorithms (30 min)
9/23/06
42
Term based (vector space) Edit based Learning constants in character-level distance measures like levenshtein distances Useful for short strings with systematic errors (e.g., OCRs) or domain specific error (e.g.,st., street) Multi-attribute records Useful when relative importance of match along different attributes highly domain dependent Example: comparison shopping website Match on title more indicative in books than on electronics Difference in price less indicative in books than electronics
9/23/06
43
d(A,W) (x-y) =
A can be a real matrix: corresponds to a linear transform of the input W a diagonal matrix with non-negative entries (guarantees d is a distance metric) Learn entries of W such that to minimize training error Zero training error: (i,j,k) Training set: d(A,W)(xi,xk)-d(A,W)(xi,xk) > 0 Select A,W such that d remains as close to an un-weighted euclidean metric as possible.
9/23/06
44
Tokens 11th and square in a list of addresses might have same IDF values Addresses on same street more relevant than addresses on a square.. Can we make the distinction? d i i Vectors x,y, Sim(x,y) = i=1 Training data: S = {(x,y): x similar y}, D = {(x,y) x different y}
"
xy || x |||| y ||
9/23/06
45
9/23/06
Input: similar pairs Parameters: probability of edit operations E: highest probability edit sequence M: re-estimate probabilities using expectations of the E step Pros: FSM representation (generative model) Cons: fails to incorporate negative examples [BM03] extend to learn weights of edit operations with affine gaps [MBP05] use CRF approach (incorporates positive and negative input)
9/23/06
47
Standard character-level: Insert, Delete, Substitute Costs depends on type: alphabet, number, punctuation Word-level: Insert, Delete, Match, Abbreviation Varying costs: stop words (Eg: The), lexicons (Eg: Corporation, Road) Given: examples of duplicate and non-duplicate strings Learner: Conditional Random Field Allows for flexible overlapping feature sets Ends with a dot and appears in a dictionary Discriminative training ~ higher accuracy than earlier generative models
9/23/06
48
-1.0
W-drop
-1
-0.5
W-insert
-0.2
C-D-punct
-0.3
W-D-stop W-Abbr
Initial
1.0
W-drop
0.5
W-insert
0.2
C-D-punct
0.3
W-D-stop W-Abbr
-1 Proc. of SIGMOD Proc Sp. Int. Gr Management of Data State and transition parameters for match and non-match states Multiple paths through states summed over for each pair EM-like algorithm for training.
9/23/06 49
Results
Earlier generative approach (BM03) Word-level only, no order Initialized with manual weights
Citations
Edit-distance is better than word-level measures CRFs trained with both duplicates and non-duplicates better
than generative approaches using only duplicates Learning domain-specific edit distances could lead to higher accuracy than manually tuned weights
9/23/06 50
Term based (vector space) Edit based Learning constants in character-level distance measures like levenshtein distances Useful for short strings with systematic errors (e.g., OCRs) or domain specific error (e.g.,st., street) Multi-attribute records Useful when relative importance of match along different attributes highly domain dependent Example: comparison shopping website Match on title more indicative in books than on electronics Difference in price less indicative in books than electronics
9/23/06
51
Similarity All-Ngrams*0.4 + AuthorTitleNgram*0.2 functions YearDifference > 1 0.3YearDifference + 1.0*AuthorEditDist All-Ngrams > 0 + 0.2*PageMatch 3 0.48 Non-Duplicate
AuthorTitleNgrams 0.4 Duplicate
Learners: Classifier TitleIsNull < 1 Support Vector Duplicat Machines (SVM) PageMatch 0.5 0.3 0.4 0.4 1 Record 4 D e Logistic regression, Record 5 AuthorEditDist 0.8 Duplicate Linear regression, Mapped examples Unlabeled list Duplicate Non-Duplicate 0.0 Perceptron 0.1 0.3 0
Record Record Record Record Record Record 6 7 8 9 10 11 0.0 1.0 0.6 0.7 0.3 0.0 0.3 0.6 0.1 0.4 0.2 0.1 0.4 0.1 0.8 0.1 0.3 0.2 0.5 0.6 0.4 0.1 0.1 0.5 ? ? ? ? ? ? ? ? 1.0 0.6 0.7 0.3 0.0 0.3 0.6 0.4 0.2 0.1 0.4 0.1 0.8 0.1 0.2 0.5 0.6 0.4 0.1 0.1 0.5 1 0 0 1 0 1 1
9/23/06
52
Learning approach
Learners used: SVMs: high accuracy with limited data, Decision trees:interpretable, efficient to apply Perceptrons: efficient incremental training (Bilenko et al 2005, Comparison shopping) Results: Learnt combination methods better than both
Averaging of attribute-level similarities String based methods like edit distance (Bilenko et al 2003)
9/23/06
54
Unlabeled list
Record Record Record Record Record Record 6 7 8 9 10 11
9/23/06
? ? ? ? ? ? ? ?
Active Learner
classification process Initial classifier sure about prediction on some unlabeled instances and unsure about others (confusion region) Seek predictors on uncertain instances
Uncertain region
9/23/06
56
Rule learn: Attribute 1 > s => mapped Attribute 4 < s4 & attribute > s3 mapped Attribute 2 < s2 => not mapped
Committee of N classifiers
57
Data resampling, Classifier perturbation For each unlabeled instance x Find prediction y1,.., yk from the k classifiers Compute uncertainty U(x) as entropy of above y-s Pick instance with highest uncertainty
9/23/06
58
Active learning much better than random With only 100 active instances
label to learn better A Hierarchical Graphical Model for Record Linkage (Ravikumar, Cohen, UAI 2004) Exploiting transitivity to learn on groups T. Finley and T. Joachims, Supervised Clustering with Support Vector Machines, Proceedings of the International Conference on Machine Learning (ICML), 2005.
9/23/06
60
Outline
Part I: Motivation, similarity measures (90 min)
Data quality, applications Linkage methodology, core measures Learning core measures Linkage based measures
Part II: Efficient algorithms for approximate join (60 min) Part III: Clustering algorithms (30 min)
9/23/06
61
D White
P1 P2 P3 P4
9/23/06
Anup Gupta A Gupta White, Don Liu Jane Jane, Liu David White
Path in graph makes D White more similar to Don White than David White
Lots of work on node similarities in graph sim-rank, conductance models, etc RelDC (Kalashnikov et al 2006)
62
9/23/06
63
relationship, find connection strength between any two nodes u, v Methods Simple methods: shortest path length or flow
Diffusion kernels Electric circuit conductance model (Faloutsos et. al. 2004) Walk-based model (WM) Probabilistic
Treat edge weights as probability of transitioning out of node Probability of reaching u from v via random walks
RelDC extends (WM) to work for graphs with mutually exclusive choice nodes
9/23/06
RelDC
Resolve whatever is possible via textual similarity alone Create relationship graph with unresolved references connected
Authors: Author names, affiliation (HP Search) Papers: Titles and Author names (Citeseer) 13% ambiguous references (cannot be resolved via text alone) 100% accuracy on 50 random tests
9/23/06
65
Outline
Part I: Motivation, similarity measures (90 min) Part II: Efficient algorithms for approximate join (60 min)
Use traditional join methods Extend traditional join methods Commercial systems
9/23/06
66
A subset of the cartesian product of R1 and R2 Matching specified attributes Ai1, ..., Aik with Bi1, , Bik Labeled with a similarity score > t > 0
Reduce number of pairs on which similarity is computed Take advantage of efficient relational join methods
9/23/06
67
Historical Timelines
Index NL Join Merge/ Purge FastMap 1977 1991 1995 Probe count Approx. string edit distance Union/find for clustering Spatial join BigMatch Dimension hierarchies Multi-relational approx joins SSJoin Sort-Merge Join Band Join
StringMap 2002 2004 2006 Probe Fuzzy match cluster Cleaning in similarity SQL Server Q-gram SPIDER IDF join 2003
1991
1995
1998
2001
2003
2004
2005
2006
9/23/06
68
Compute discriminating key per record, sort records Slide fixed size window through sorted list, match in window Use OPS5 rules (equational theory) to determine match Multiple passes with small windows, based on distinct keys
9/23/06
69
yes
ZIP.Name[1..3]
r4 r5
no
9/23/06
70
yes
ZIP.Name[1..3]
r4 r5
no
DOB.Name[1..3]
r4 r2 r5 r3
yes
9/23/06
71
BigMatch [Y02]
Goal: block/index matching records, based on multiple keys Background: indexed nested loop join [BE77] Methodology: domain-specific, Jaro-Winkler similarity
Store smaller table (100M) in main memory (4GB) Create indexes for each set of grouping/blocking criteria Scan larger table (4B), repeatedly probe smaller table Avoids multiple matches of the same pair
9/23/06
72
BigMatch [Y02]
Goal: block/index matching records, based on multiple keys Example:
SS.Name[1..2] yes no
123-45 1960/08/24 98346
inner table
ID r1 r2 r3 r4 r5 Name Smith, John Smyth, Jon Smith, John Smith, J. Smith, J. SS 123-45 123-45 312-54 723-45 456-78 DOB 1960/08/24 1961/08/24 1995/07/25 1960/08/24 1975/12/11 ZIP 07932 07932 98301 98346 98346
9/23/06
73
BigMatch [Y02]
Goal: block/index matching records, based on multiple keys Example:
SS.Name[1..2] yes no
123-45 1960/08/24 98346
inner table
ID r1 r2 r3 r4 r5 Name Smith, John Smyth, Jon Smith, John Smith, J. Smith, J. SS 123-45 123-45 312-54 723-45 456-78 DOB 1960/08/24 1961/08/24 1995/07/25 1960/08/24 1975/12/11 ZIP 07932 07932 98301 98346 98346
ZIP.Name[1..3]
yes no
9/23/06
74
Use hierarchical grouping, instead of sorting, to focus search Structural similarity based on overlap of children sets Textual similarity based on weighted token set containment Top-down processing of dimension hierarchy for efficiency
9/23/06
75
9/23/06
76
Textual similarity
9/23/06
77
Structural similarity
9/23/06
78
9/23/06
79
9/23/06
80
9/23/06
81
9/23/06
82
Historical Timelines
Index NL Join Merge/ Purge FastMap 1977 1991 1995 Probe count Approx. string edit distance Union/find for clustering Spatial join BigMatch Dimension hierarchies Multi-relational approx joins SSJoin Sort-Merge Join Band Join
StringMap 2002 2004 2006 Probe Fuzzy match cluster Cleaning in similarity SQL Server Q-gram SPIDER IDF join 2003
1991
1995
1998
2001
2003
2004
2005
2006
9/23/06
83
Extract set of all overlapping q-grams Q(s) from string s ED(s1,s2) d |Q(s1) Q(s2)| max(|s1|,|s2|) - (d-1)*q - 1 Cheap filters (length, count, position) to prune non-matches Pure SQL solution: cost-based join methods
9/23/06
84
9/23/06
85
9/23/06
86
9/23/06
87
ID r1 r1 r1 r1 r1 r1 r1 r1 r1 r1 r1 r1
Qg ##s #sr sri riv iva vas ast sta tav ava va$ a$$
ID r3 r3 r3 r3 r3 r3 r3 r3 r3 r3 r3 r3
Qg ##s #sh shr hri riv iva vas ast sta tav av$ v$$
88
9/23/06
Q SELECT Q1.ID, Q2.ID FROM Q AS Q1, Q AS Q2 WHERE Q1.Qg = Q2.Qg GROUP BY Q1.ID, Q2.ID HAVING COUNT(*) > T
ID r1 r1 r1 r1 r1 r1 r1 r1 r1 r1 r1 r1
Qg ##s #sr sri riv iva vas ast sta tav ava va$ a$$
ID r3 r3 r3 r3 r3 r3 r3 r3 r3 r3 r3 r3
Qg ##s #sh shr hri riv iva vas ast sta tav av$ v$$
89
9/23/06
Similarity metric based on IDF weighted token edit distance Approximate similarity metric using Jaccard on q-gram sets Small error tolerant index table, sharing of minhash q-grams Optimistic short circuiting exploits large token IDF weights
9/23/06
90
reference table
ID r1 r2 r3 OrgName Boeing Company Bon Corporation Companions City Seattle Seattle Seattle State WA WA WA ZIP 98004 98014 98024
9/23/06
91
input record
Beoing Corporation Seattle WA 98004
r3
9/23/06
92
input record
Beoing Corporation Seattle WA 98004
r3
ETI table
Qg ing orp sea MHC 2 1 1 2 Col 1 1 2 4 Freq 1 1 3 1 TIDList {r1} {r2} {r1, r2, r3} {r1}
004
93
input record
Beoing Corporation Seattle WA 98004
r3
ETI table
Qg ing orp sea MHC 2 1 1 2 Col 1 1 2 4 Freq 1 1 3 1 TIDList {r1} {r2} {r1, r2, r3} {r1}
004
94
Historical Timelines
Index NL Join Merge/ Purge FastMap 1977 1991 1995 Probe count Approx. string edit distance Union/find for clustering Spatial join BigMatch Dimension hierarchies Multi-relational approx joins SSJoin Sort-Merge Join Band Join
StringMap 2002 2004 2006 Probe Fuzzy match cluster Cleaning in similarity SQL Server Q-gram SPIDER IDF join 2003
1991
1995
1998
2001
2003
2004
2005
2006
9/23/06
95
Map a string to a set of elements (words, q-grams, etc.) Build inverted lists on individual set elements Optimization: process skewed lists in increasing size order Optimization: sort lists in decreasing order of record sizes
9/23/06
96
Inverted index
SE ##s #sr #sh sri shr hri riv tav ava v$$
IDs r1, r2, r3 r1 r2, r3 r1 r2, r3 r2, r3 r1, r2, r3 r1, r2, r3 r1, r2 r3
97
9/23/06
Inverted index
IDs r2, r1, r3 r1 r2, r3 r1 r2, r3 r2, r3 r2, r1, r3 r2, r1, r3 r2, r1 r3
98
9/23/06
Inverted index
IDs r2, r1, r3 r1 r2, r3 r1 r2, r3 r2, r3 r2, r1, r3 r2, r1, r3 r2, r1 r3
99
9/23/06
Inverted index
IDs r2, r1, r3 r1 r2, r3 r1 r2, r3 r2, r3 r2, r1, r3 r2, r1, r3 r2, r1 r3
100
9/23/06
Inverted index
IDs r2, r1, r3 r1 r2, r3 r1 r2, r3 r2, r3 r2, r1, r3 r2, r1, r3 r2, r1 r3
101
9/23/06
Compare strings based on sets associated with each string Problem: Overlap(s1, s2) threshold Optimization: high set overlap overlap of ordered subsets SQL implementation using equijoins, cost-based plans
9/23/06
102
Q SELECT Q1.ID, Q2.ID FROM Q AS Q1, Q AS Q2 WHERE Q1.Qg = Q2.Qg GROUP BY Q1.ID, Q2.ID HAVING COUNT(*) > 8
ID r1 r1 r1 r1 r1 r1 r1 r1 r1 r1 r1 r1
Qg ##s #sr sri riv iva vas ast sta tav ava va$ a$$
ID r4 r4 r4 r4 r4 r4 r4 r4 r4 r4 r4
Qg ##s #sr sri riv iva vas ast sta tav av$ v$$
9/23/06
103
Q SELECT Q1.ID, Q2.ID FROM Q AS Q1, Q AS Q2 WHERE Q1.Qg = Q2.Qg GROUP BY Q1.ID, Q2.ID HAVING COUNT(*) > 8
ID r1 r1 r1 r1
ID r4 r4 r4 r4 r4 r4 r4
Qg ##s #sr sri riv iva vas ast sta tav av$ v$$
r4 r4 r4 r4
9/23/06
104
Q SELECT Q1.ID, Q2.ID FROM Q AS Q1, Q AS Q2 WHERE Q1.Qg = Q2.Qg GROUP BY Q1.ID, Q2.ID HAVING COUNT(*) > 8
ID r1 r1 r1 r1 r1 r1 r1
Qg ##s #sr sri riv iva vas ast sta tav ava va$ a$$
ID r4 r4 r4
r1 r1 r1 r1 r1
9/23/06
105
Q SELECT Q1.ID, Q2.ID FROM Q AS Q1, Q AS Q2 WHERE Q1.Qg = Q2.Qg GROUP BY Q1.ID, Q2.ID HAVING COUNT(*) > 8
ID r1 r1 r1 r1
ID r4 r4 r4
Optimization: use ordered 4 q-grams of r1 and 3 q-grams of r4 Suggested ordering: based on decreasing IDF weights
9/23/06
106
Historical Timelines
Index NL Join Merge/ Purge FastMap 1977 1991 1995 Probe count Approx. string edit distance Union/find for clustering Spatial join BigMatch Dimension hierarchies Multi-relational approx joins SSJoin Sort-Merge Join Band Join
StringMap 2002 2004 2006 Probe Fuzzy match cluster Cleaning in similarity SQL Server Q-gram SPIDER IDF join 2003
1991
1995
1998
2001
2003
2004
2005
2006
9/23/06
107
Commercial System SQL Server Integration Services 2005 OracleBI Warehouse Builder 10gR2 Paris IBMs Entity Analytic Solutions, QualityStage
Record Linkage Methodology Fuzzy Lookup; Fuzzy Grouping; uses Error Tolerant Index match-merge rules; deterministic and probabilistic matching probabilistic matching (information content); multi-pass blocking; rules-based merging
Distance Metrics Supported customized, domainindependent: edit distance; number, order, freq. of tokens Jaro-Winkler; double metaphone
name & address parse; match; standardize: 3rd party vendors name recognition; identity resolution; relationship resolution: EAS
data profiling; data rules; data auditors data profiling; standardization; trends and anomalies;
9/23/06
108
Outline
Part I: Motivation, similarity measures (90 min) Part II: Efficient algorithms for approximate join (60 min) Part III: Clustering/partitioning algorithms (30 min)
9/23/06
109
Partitioning/collective deduplication
Single-entity types
A is same as B if both are same as C. Multiple linked entity types If paper A is same as paper B then venue of A is the same as venue of B.
9/23/06
110
f1 f2 fn
1.0 0.4 0.2 1 0.0 0.1 0.3 0 0.3 0.4 0.4 1
Classifier
Unlabeled list
Record Record Record Record Record Record 6 7 8 9 10 11
Mapped examples
6,7 0.0 7,8 1.0 6,8 0.6 6,9 0.7 7,9 0.3 8,9 0.0 6,10 0.3 7,10 0.7 0.3 0.2 0.5 0.6 0.4 0.1 0.1 0.5 ? ? ? ? ? ? ? ?
9/23/06
Record 6 G1 6,7 0.0 0.3 Record 8 7,8 1.0 0.2 6,8 0.6 0.5 Record 0.6 6,9 0.7 9 G2 7,9 0.3 0.4 Record 7 G3 8,9 0.0 0.1 Record 10 6,10 0.3 0.1 Record 11 7,10 0.7 0.5
0 1 1 0 1 0 1 1
111
Creating partitions
Transitive closure
7 2 1 10 3 4 6
9 5
Correlation clustering (Bansal et al 2002) 7 8 Partition to minimize total disagreements 2 9 3 Edges across partitions 1 Missing edges within partition 4 5 More appealing than clustering: 10 6 No magic constants: number of clusters, similarity thresholds, diameter, etc 3 disagreements Extends to real-valued scores NP Hard: many approximate algorithms
9/23/06 112
Practical substitutes (Heuristics, no guarantees) Agglomerative clustering: repeatedly merge closest clusters
Efficient implementation possible via heaps (BG 2005) Definition of closeness subject to tuning Greatest reduction in error Average/Max/Min similarity
9/23/06
113
Digital cameras
Camcoder
Luggage
(From: Bilenko et al,
Setup: Online comparison shopping, 2005) Fields: name, model, description, price Learner: Online perceptron learner Complete-link clustering >> single-link clustering(transitive closure) An issue: when to stop merging clusters
9/23/06
114
Partitions are compact and relatively far from other points A Partition has to satisfy a number of criteria Points within partition closer than any points outside #points within p-neighborhood of each partition < c Either number of points in partition < K, or diameter <
9/23/06
115
Algorithm
Consider case where partitions required to be of size < K if partition Pj of size m in output then
For each record, do efficient index probes to get Get K nearest neighbors Count of number of points in p-neighborhood for each m nearest neighbors Form pairs and perform grouping based on above
Summary: partitioning
Transitive closure is a bad idea No verdict yet on best alternative Difficult to design an objective and algorithms Correlation clustering
Greedy agglomerative clustering algorithms ok Greatest minimum similarity (complete-link), benefit Reasonable performance with heap-based implementation Dense/Sparse partitioning Positives: Declarative objective, efficient algorithm Parameter retuning across domains Need comparison between complete-link, Dense/Sparse, and
Correlation clustering.
9/23/06
117
Associate variables for predictions for each attribute k each record pair (i,j) Akij for each record pair Rij
from Parag & Domingos 2005
9/23/06 118
Dependency graph
Scoring functions
Independent scores
A134 R34
sk(Ak,ai,aj) Attribute-level Any classifier on various text similarities of attribute pairs s(R,bi,bj) Record-level Any classifier on various similarities of all k attribute pairs Dependency scores dk(Ak, R): record pair, attribute pair
9/23/06
0 1 0 4 2 1 1 119 7
dk(1,1) + dk(0,0) >= dk(1,0)+dk(0,1) Can find optimal scores through graph MINCUT Assigning scores ! Manually as in Levy et. al Example-based training as in Domingos et al
9/23/06
120
Venue P T 49 59 86 82
Collective deduplication
9/23/06
P1 P2 P3 P4
D White, J Liu, A Gupta Liu, Jane & J Gupta & White, Don Anup Gupta David White
A Gupta J Gupta
Scoring functions Algorithm S(Aij) Attribute-level Greedy agglomerative clustering Text similarity Merge author clusters with highest score S(Aij, Nij) Dependency with labels of co-author set Redefine similarity between clusters of authors instead of single authors Fraction of co-author set assigned label 1. Max of author-level similarity Final score: a s(Aij) + (1-a) s(A ij, Nij) 122 9/23/06 a is the only parameter
Declarative data cleaning in AJAX [GFS+01] Q-gram based metrics, SPIDER [GIJ+01,GIKS03,KMS04] SSJoin [CGK06] Compact sets, sparse neighborhood [CGM05]
9/23/06
123
9/23/06
124
9/23/06
125
Conclusions
Record linkage is critical when data quality is poor
Similarity metrics Efficient sub-quadratic approximate join algorithms Efficient clustering algorithms
Sophisticated similarity metrics, massive data sets Important to work with real datasets
9/23/06
126
References
[ACG02] Rohit Ananthakrishna, Surajit Chaudhuri, Venkatesh Ganti: Eliminating Fuzzy Duplicates in Data Warehouses. VLDB 2002: 586-597 [BD83] Dina Bitton, David J. DeWitt: Duplicate Record Elimination in Large Data Files. ACM Trans. Database Syst. 8(2): 255-265 (1983) [BE77] Mike W. Blasgen, Kapali P. Eswaran: Storage and Access in Relational Data Bases. IBM Systems Journal 16(4): 362-377 (1977) [BG04] Indrajit Bhattacharya, Lise Getoor: Iterative record linkage for cleaning and integration. DMKD 2004: 11-18 [C98] William W. Cohen: Integration of Heterogeneous Databases Without Common Domains Using Queries Based on Textual Similarity. SIGMOD Conference 1998: 201-212 [C00] William W. Cohen: Data integration using similarity joins and a word-based information representation language. ACM Trans. Inf. Syst. 18(3): 288-321 (2000) [CCZ02] Peter Christen, Tim Churches, Xi Zhu: Probabilistic name and address cleaning and standardization. Australasian Data Mining Workshop 2002. [CGGM04] Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, Rajeev Motwani: Robust and Efficient Fuzzy Match for Online Data Cleaning. SIGMOD Conference 2003: 313-324 [CGG+05] Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, Rahul Kapoor, Vivek R. Narasayya, Theo Vassilakis: Data cleaning in microsoft SQL server 2005. SIGMOD Conference 2005: 918-920 [CGK06] Surajit Chaudhuri, Venkatesh Ganti, Raghav Kaushik: A primitive operator for similarity joins in data cleaning. ICDE 2006. [CGM05] Surajit Chaudhuri, Venkatesh Ganti, Rajeev Motwani: Robust Identification of Fuzzy Duplicates. ICDE 2005: 865-876 [CRF03] William W. Cohen, Pradeep Ravikumar, Stephen E. Fienberg: A Comparison of String Distance Metrics for Name-Matching Tasks. IIWeb 2003: 73-78
9/23/06
127
References
[DJ03] Tamraparni Dasu, Theodore Johnson: Exploratory Data Mining and Data Cleaning John Wiley 2003 [DNS91] David J. DeWitt, Jeffrey F. Naughton, Donovan A. Schneider: An Evaluation of Non-Equijoin Algorithms. VLDB 1991: 443-452 [DWI02] Data Warehousing Institute report 2002 [E00] Larry English: Plain English on Data Quality: Information Quality Management: The Next Frontier. DM Review Magazine: April 2000. http://www.dmreview.com/article_sub.cfm?articleId=2073 [FL95] Christos Faloutsos, King-Ip Lin: FastMap: A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets. SIGMOD Conference 1995: 163-174 [FS69] I. Fellegi, A. Sunter: A theory of record linkage. Journal of the American Statistical Association, Vol 64. No 328, 1969 [G98] D. Gusfield: Algorithms on strings, trees and sequences. Cambridge university press 1998 [GFS+01] Helena Galhardas, Daniela Florescu, Dennis Shasha, Eric Simon, Cristian-Augustin Saita: Declarative Data Cleaning: Language, Model, and Algorithms. VLDB 2001: 371-380 [GIJ+01] Luis Gravano, Panagiotis G. Ipeirotis, H. V. Jagadish, Nick Koudas, S. Muthukrishnan, Divesh Srivastava: Approximate String Joins in a Database (Almost) for Free. VLDB 2001: 491-500 [GIKS03] Luis Gravano, Panagiotis G. Ipeirotis, Nick Koudas, Divesh Srivastava: Text joins in an RDBMS for web data integration. WWW 2003: 90-101 [GKMS04] S. Guha, N. Koudas, A. Marathe, D. Srivastava : Merging the results of approximate match operations. VLDB 2004. [GKR98] David Gibson, Jon M. Kleinberg, Prabhakar Raghavan: Clustering Categorical Data: An Approach Based on Dynamical Systems. VLDB 1998: 311-322
9/23/06
128
References
[HS95] Mauricio A. Hernndez, Salvatore J. Stolfo: The Merge/Purge Problem for Large Databases. SIGMOD Conference 1995: 127-138 [HS98] Gsli R. Hjaltason, Hanan Samet: Incremental Distance Join Algorithms for Spatial Databases. SIGMOD Conference 1998: 237-248 [J89] M. A. Jaro: Advances in record linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association 84: 414-420. [JLM03] Liang Jin, Chen Li, Sharad Mehrotra: Efficient Record Linkage in Large Data Sets. DASFAA 2003 [JU91] Petteri Jokinen, Esko Ukkonen: Two Algorithms for Approximate String Matching in Static Texts. MFCS 1991: 240-248 [KL51] S. Kullback, R. Liebler : On information and sufficiency. The annals of mathematical statistics 22(1): 79-86. 1959. [KMC05] Dmitri V. Kalashnikov, Sharad Mehrotra, Zhaoqi Chen: Exploiting Relationships for DomainIndependent Data Cleaning. SDM 2005 [KMS04] Nick Koudas, Amit Marathe, Divesh Srivastava: Flexible String Matching Against Large Databases in Practice. VLDB 2004: 1078-1086 [KMS05] Nick Koudas, Amit Marathe, Divesh Srivastava: SPIDER: flexible matching in databases. SIGMOD Conference 2005: 876-878 [LLL00] Mong-Li Lee, Tok Wang Ling, Wai Lup Low: IntelliClean: a knowledge-based intelligent data cleaner. KDD 2000: 290-294 [ME96] Alvaro E. Monge, Charles Elkan: The Field Matching Problem: Algorithms and Applications. KDD 1996: 267-270
9/23/06
129
References
[ME97] Alvaro E. Monge, Charles Elkan: An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records. DMKD 1997 [RY97] E. Ristad, P. Yianilos : Learning string edit distance. IEEE Pattern analysis and machine intelligence 1998. [S83] Gerry Salton : Introduction to modern information retrieval. McGraw Hill 1987. [SK04] Sunita Sarawagi, Alok Kirpal: Efficient set joins on similarity predicates. SIGMOD Conference 2004: 743-754 [TF95] Howard R. Turtle, James Flood: Query Evaluation: Strategies and Optimizations. Inf. Process. Manage. 31(6): 831-850 (1995) [TKF01] S. Tejada, C. Knoblock, S. Minton : Learning object identification rules for information integration. Information Systems, Vol 26, No 8, 607-633, 2001. [W94] William E. Winkler: Advanced methods for record linkage. Proceedings of the section on survey research methods, American Statistical Association 1994: 467-472 [W99] William E. Winkler: The state of record linkage and current research problems. IRS publication R99/04 (http://www.census.gov/srd/www/byname.html) [Y02] William E. Yancey: BigMatch: A program for extracting probable matches from a large file for record linkage. RRC 2002-01. Statistical Research Division, U.S. Bureau of the Census.
9/23/06
130