p802 Koudastutorial

The quality of data residing in databases gets degraded due to a multitude of reasons. A central problem is the ability to identify whether two entities (e.g., relational tuples) are approximately the same. This problem is by no means new; Over the years various communities have addressed it.

Uploaded by

vthung

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

111 views2 pages

p802 Koudastutorial

Uploaded by

vthung

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

Record Linkage: Similarity Measures and Algorithms

Nick Koudas Sunita Sarawagi Divesh Srivastava

University of Toronto IIT Bombay AT&T Labs–Research
[email protected] [email protected] [email protected]

1. MOTIVATION the various methodologies to combine approximate match predi-

The quality of the data residing in databases gets degraded due cates for approximate match between tuple pairs, (iv) provide a
to a multitude of reasons. Such reasons include typing mistakes comprehensive and cohesive overview of the key research results,
(e.g., lexicographical errors, character transpositions) during inser- techniques, and tools used for record linkage, (v) present clustering
tion, lack of standards for recording database fields (e.g., person paradigms for consistent partitioning of tuples to identify entities,
names, addresses), and various errors introduced by poor database and (vi) identify key research areas where further work is required.
design (e.g., update anomalies, missing key constraints). Data of Recent tutorials in the area of data quality [2, 1] present broad
poor quality can result in significant impediments to popular busi- overviews of various aspects of data quality and do not delve into
ness practices: sending products or bills to incorrect addresses, in- the details of record linkage technology. Our tutorial is a significant
ability to locate customer records during service calls, or inability extension of a shorter tutorial on this topic [3] and aims to provide
to correlate customers across multiple services, etc. a comprehensive overview of this fundamental area of data quality.
In the presence of data quality errors, a central problem is the
ability to identify whether two entities (e.g., relational tuples) are 2. TUTORIAL OUTLINE
approximately the same. Depending on the type of data under con- Our tutorial is example driven, and begins by introducing the var-
sideration, various “similarity metrics” (approximate match predi- ious issues and problems that one has to cope with when designing
cates) have been defined to quantify the closeness of a pair of data record linkage technology. We then present and categorize typical
entities in a way that common mistakes are captured. Given any “errors” encountered in real operational databases using concrete
specific similarity metric, a key algorithmic problem in this context examples; such examples motivate the need for the methodologies
is the approximate join problem: given two large multi-attribute we shall introduce in the sequel. We shall also define formally the
data sets, identify all pairs of entities (tuples) in the two sets that are various flavors of the record linkage problem as optimization prob-
approximately the same. This problem is by no means new. Over lems. The bulk of the tutorial is organized as follows.
the years various communities, including the statistics, machine
learning, information retrieval, and database communities, have ad- 2.1 Approximate Match Predicates
dressed many aspects of the problem, referring to it by a variety of We shall review a variety of approximate match predicates that
names, including record linkage, entity identification, entity recon- have been proposed to quantify the degree of similarity or close-
ciliation and approximate join. We shall make use of the names ness of two data entities. We shall compare and contrast them based
record linkage and approximate join in the sequel. The approxi- on their applicability to various data types, algorithmic properties,
mate join operation is often followed by a post-processing phase computational overhead and their adaptability. Most approximate
where the tuple pairs produced as join results are used to cluster match predicates return a score between 0 and 1 (with 1 being as-
together tuples that refer to the same entity while minimizing the signed to identical entities) that effectively quantifies the degree of
number of join pairs that get violated. These clusters define the de- similarity between data entities. Our presentation of such approxi-
sired entity boundaries. Given the significance and the inherent dif- mate match predicates will consist of three parts.
ficulty of the record linkage problem, a plethora of techniques have
been developed in various communities, deploying diverse approx- Atomic Similarity Measures: This part will review mea-
imate match predicates. sures to assess atomic (attribute value) similarity between
The objectives of this tutorial are to: (i) formally define the vari- a pair of data entities. We shall cover several approximate
ous flavors of the record linkage problem, (ii) identify and compare match predicates in detail including edit distance, phonetic
the various approximate match predicates for attribute value pairs distance (soundex), the Jaro and Winkler measures, tf.idf and
that have been introduced over the years, (iii) identify and contrast many variants thereof. Several approaches to fine tune pa-
rameters of such measures will be presented.
Functions to combine similarity measures: In this part, we
Permission to make digital or hard copies of all or part of this work for shall first review techniques dealing with the following basic
personal or classroom use is granted without fee provided that copies are decision problem: Given a set of pairs of attributes belonging
not made or distributed for profit or commercial advantage and that copies to two entities (tuples), in which each pair is tagged with it’s
bear this notice and the full citation on the first page. To copy otherwise, to own approximate match score (possibly applying distinct ap-
republish, to post on servers or to redistribute to lists, requires prior specific proximate match predicates for each attribute pair), how does
permission and/or a fee.
SIGMOD 2006, June 27–29, 2006, Chicago, Illinois, USA. one combine such scores to decide whether the entire entities
Copyright 2006 ACM 1-59593-256-9/06/0006 ...$5.00. (tuples) are approximately the same. For this basic decision

802
problem, we shall review an array of proposed methodolo- of methodologies that can assist in each approach. We shall review
gies. These include statistical and probabilistic, predictive, various data quality tools that deploy record linkage technology
cost based, rule based, user assisted as well as learning based and discuss their key functionalities in this area. Where publicly
methodologies. Moreover, we shall cover several specific known, we shall discuss specific algorithms they deploy.
functions including Naive Bayes, the Fellegi-Sunter model,
linear support vector machines (SVM) and approaches based 2.3 Creation of Data Partitions
on voting theory. The output of the approximate join needs to be post processed to
cluster together all tuples that refer to the same entity. The approx-
Similarity between linked entities: Often the entities over imate join operation above might produce seemingly inconsistent
which we need to resolve duplicates are linked together via results like tuple A joins with tuple B, tuple A joins with tuple
foreign keys in a multi-relational database. Links might be C, but tuple B does not join with tuple C. A straightforward way
of various types, including associative links arising out of co- to resolve such inconsistencies is to cluster together all tuples via
occurrence in a larger context (e.g., co-authors of a paper), or a transitive closure of the join pairs. In practice, this can lead to
structural links denoting containment. An inter-linked set of extremely poor results since unrelated tuples might get connected
entities calls for richer similarity measures that capture the through noisy links. A number of approaches have been proposed
similarity of the context in which the entity pair appears. We for tackling this problem. In particular, we shall present a newly
shall present various graph-based similarity measures that introduced clustering paradigm called Correlation Clustering for
capture transitive contextual similarity in combination with minimizing the number of join pairs violated in creating clusters.
the intrinsic similarity between two entities. Finally, we shall identify key research questions pertinent to each
We shall compare and contrast these similarity measures based part of our tutorial. Our tutorial will also contain an extensive sur-
on their features, their complexity, scalability and applicability to vey of related bibliography from various communities.
various domains. We shall present some optimality results and
comment on their applicability in the context of record linkage. 3. PROFESSIONAL BIOGRAPHIES
Nick Koudas is a faculty member at the University of Toronto,
2.2 Record Linkage Algorithms Department of Computer Science. He holds a Ph.D. from the Uni-
Once the basic techniques for quantifying the degree of approxi- versity of Toronto, an M.Sc. from the University of Maryland at
mate match for a pair (or subsets) of attributes have been identified, College Park, and a B.Tech. from the University of Patras in Greece.
the next challenging operation is to embed them into an approxi- He serves as an associate editor for the Information Systems jour-
mate join framework between two data sets. This is a non-trivial nal, the IEEE TKDE journal and the ACM Transactions on the
task due to the large (quadratic in the size of the input) number of WEB. He is the recipient of the 1998 ICDE Best Paper award. His
pairs involved. We shall present a set of algorithmic techniques for research interests include core database management, data quality,
this task. A common feature of all such algorithms is the ability metadata management and its applications to networking.
to keep the total number of pairs (and subsequent decisions) low
utilizing various pruning mechanisms. These algorithms can be Sunita Sarawagi researches in the fields of databases, data min-
classified into two main categories. ing, machine learning and statistics. She is an associate professor
at the Indian Institute of Technology, Bombay. Prior to that she was
Algorithms inspired by relational duplicate elimination and a research staff member at IBM Almaden Research Center. She got
join techniques including sort-merge, band join and indexed her Ph.D. in databases from the University of California at Berkeley
nested loops. In this context, we shall review techniques and a B.Tech. from the Indian Institute of Technology, Kharagpur.
like Merge/Purge (based on the concept of sorted neighbor- She has several publications in databases and data mining, includ-
hoods), BigMatch (based on indexed nested loops joins) and ing a best paper award at the 1998 ACM SIGMOD conference, and
Dimension Hierarchies (based on the concept of hierarchi- several patents. She is on the editorial board of the ACM TODS
cally clustered neighborhoods). and ACM KDD journals and editor-in-chief of the ACM SIGKDD
newsletter. She has served as program committee member for ACM
Algorithms inspired by information retrieval that treat each
SIGMOD, VLDB, ACM SIGKDD and IEEE ICDE, ICML confer-
tuple as a set of tokens, and return those set pairs whose
ences.
(weighted) overlap exceeds a specified threshold. In this con-
text, we shall review a variety of set join algorithms. Divesh Srivastava is the head of the Database Research Depart-
ment at AT&T Labs-Research. He received his Ph.D. from the Uni-
We shall also discuss two alternative ways to realize (i.e., imple- versity of Wisconsin, Madison, and his B.Tech. from the Indian
ment) approximate join techniques. The first is concerned with pro- Institute of Technology, Bombay. His current research interests in-
cedural algorithms operating on data, applying approximate match clude data quality, data streams and XML databases.
predicates, without a particular storage or query model in mind.
The second is concerned with declarative specifications of data
cleaning operations. Both approaches have their relative strengths 4. REFERENCES
and weaknesses. A non-declarative specification offers greater al- [1] C. Batini, T. Catarci, and M. Scannapieco. A survey of data
gorithmic flexibility and possibly improved performance (e.g., im- quality issues in cooperative information systems.
plemented on top of a file system without incurring RDBMS over- Pre-conference ER tutorial, 2004.
heads). A declarative specification offers unbeatable ease of de- [2] T. Johnson and P. Dasu. Data quality and data cleaning: An
ployment (as a set of SQL queries), direct processing of data in overview. SIGMOD tutorial, 2003.
their native store (RDBMS) and flexible integration with existing [3] N. Koudas and D. Srivastava. Approximate joins: concepts
applications utilizing an RDBMS. The choice between the two re- and techniques. VLDB tutorial, 2005.
alizations clearly depends on application constraints and require-
ments. We shall review both approaches describing a concrete set

803

06 Data Similarity and Data Dissimilarity
No ratings yet
06 Data Similarity and Data Dissimilarity
10 pages
Rec 1975
No ratings yet
Rec 1975
6 pages
Data Mining Homework 1
100% (1)
Data Mining Homework 1
2 pages
Intro To Duplicate Detection
No ratings yet
Intro To Duplicate Detection
87 pages
Record Matching Over Query Results From Multiple Web Databases
No ratings yet
Record Matching Over Query Results From Multiple Web Databases
93 pages
Record Matching Over Query Results From Multiple Web Databases
No ratings yet
Record Matching Over Query Results From Multiple Web Databases
27 pages
Efficient Techniques For Online Record Linkage: Debabrata Dey, Member, IEEE, Vijay S. Mookerjee, and Dengpan Liu
No ratings yet
Efficient Techniques For Online Record Linkage: Debabrata Dey, Member, IEEE, Vijay S. Mookerjee, and Dengpan Liu
15 pages
Definition and Terminology of Approximate Matching: Object Similarity Detection: Identify Related Objects, E.G. Different
No ratings yet
Definition and Terminology of Approximate Matching: Object Similarity Detection: Identify Related Objects, E.G. Different
5 pages
Vide
No ratings yet
Vide
80 pages
Proposal
No ratings yet
Proposal
7 pages
Duplicate Record Detection: A Survey
No ratings yet
Duplicate Record Detection: A Survey
40 pages
Conference SVM Classifier
No ratings yet
Conference SVM Classifier
6 pages
DQ Matching
No ratings yet
DQ Matching
6 pages
Documentation
No ratings yet
Documentation
12 pages
A Domain-Independent Data Cleaning Algorithm For Detecting Similar-Duplicates
No ratings yet
A Domain-Independent Data Cleaning Algorithm For Detecting Similar-Duplicates
10 pages
Duplicate Detection of Record Linkage in Real-World Data: K. M, P T
No ratings yet
Duplicate Detection of Record Linkage in Real-World Data: K. M, P T
10 pages
Enterprise Master Patient Index: Name Address Age
No ratings yet
Enterprise Master Patient Index: Name Address Age
10 pages
Genetic Approach Deduplication
No ratings yet
Genetic Approach Deduplication
5 pages
Recordlinkage Readthedocs Io en Latest
No ratings yet
Recordlinkage Readthedocs Io en Latest
108 pages
7-Database Integration Nhom4
No ratings yet
7-Database Integration Nhom4
67 pages
FLAME: A Fast Large-Scale Almost Matching Exactly Approach To Causal Inference
No ratings yet
FLAME: A Fast Large-Scale Almost Matching Exactly Approach To Causal Inference
23 pages
Artificial Intelligence m3
No ratings yet
Artificial Intelligence m3
37 pages
Matching Knowledge Bases
No ratings yet
Matching Knowledge Bases
14 pages
BIGMat: A Distributed Affinity-Preserving Random Walk Strategy For Instance Matching On Knowledge Graphs
No ratings yet
BIGMat: A Distributed Affinity-Preserving Random Walk Strategy For Instance Matching On Knowledge Graphs
7 pages
Real-World Data Is Dirty: Data Cleansing and The Merge/Purge Problem
No ratings yet
Real-World Data Is Dirty: Data Cleansing and The Merge/Purge Problem
29 pages
Efficient Techniques For Online Record Linkage
No ratings yet
Efficient Techniques For Online Record Linkage
4 pages
On The Database Lookup Problem
No ratings yet
On The Database Lookup Problem
9 pages
1.authenticated Multistep Nearest
No ratings yet
1.authenticated Multistep Nearest
14 pages
Feldmann Masters Thesis
No ratings yet
Feldmann Masters Thesis
71 pages
Database Problems Solving and Key Concepts
No ratings yet
Database Problems Solving and Key Concepts
4 pages
Secure Approximate String Matching Using HE For Privacy-Preserving Record Linkage
No ratings yet
Secure Approximate String Matching Using HE For Privacy-Preserving Record Linkage
5 pages
7-Database Integration Nhom4
No ratings yet
7-Database Integration Nhom4
67 pages
FrameworksForEntityMatchingAComparison Dke
No ratings yet
FrameworksForEntityMatchingAComparison Dke
14 pages
2024 V15i5016
No ratings yet
2024 V15i5016
12 pages
Normalization of Duplicate Records From Multiple Sources: IEEE Transactions On Knowledge and Data Engineering June 2018
No ratings yet
Normalization of Duplicate Records From Multiple Sources: IEEE Transactions On Knowledge and Data Engineering June 2018
15 pages
4.on Demand Quality of Web Services Using Ranking by Multi Criteria-31-35
No ratings yet
4.on Demand Quality of Web Services Using Ranking by Multi Criteria-31-35
5 pages
Similarity Flooding: A Versatile Graph Matching Algorithm and Its Application To Schema Matching
No ratings yet
Similarity Flooding: A Versatile Graph Matching Algorithm and Its Application To Schema Matching
12 pages
Duplicate Record Detection - A Survey
No ratings yet
Duplicate Record Detection - A Survey
16 pages
Cse 20
No ratings yet
Cse 20
6 pages
Query Similarities in Various Web Databases
No ratings yet
Query Similarities in Various Web Databases
6 pages
Entity Identification For Heterogeneous Database Integration A Multiple Classifier System Approach and Empirical Evaluation
No ratings yet
Entity Identification For Heterogeneous Database Integration A Multiple Classifier System Approach and Empirical Evaluation
14 pages
Data Matching
No ratings yet
Data Matching
74 pages
Atomic Wedgie
No ratings yet
Atomic Wedgie
8 pages
Message Broker Message Flows
80% (5)
Message Broker Message Flows
1,756 pages
Unit 2: Scs5623 - Data Mining and Warehousing
No ratings yet
Unit 2: Scs5623 - Data Mining and Warehousing
9 pages
DMER - Unit#8 - Intoduction To ER
No ratings yet
DMER - Unit#8 - Intoduction To ER
22 pages
Entity Analysis Resolution
100% (1)
Entity Analysis Resolution
22 pages
Lecture 7
No ratings yet
Lecture 7
48 pages
Cikm 2009 Poster
No ratings yet
Cikm 2009 Poster
1 page
Decision Models For Record Linkage
No ratings yet
Decision Models For Record Linkage
15 pages
Novel and Efficient Approach For Duplicate Record Detection
No ratings yet
Novel and Efficient Approach For Duplicate Record Detection
5 pages
A Genetic Programming Approach To Record Deduplication
No ratings yet
A Genetic Programming Approach To Record Deduplication
45 pages
Data Matching
No ratings yet
Data Matching
37 pages
Penland Trial
100% (1)
Penland Trial
1,021 pages
Schema Matching
No ratings yet
Schema Matching
4 pages
Chapter 4
No ratings yet
Chapter 4
8 pages
C29 - English Instruction Manual-1
No ratings yet
C29 - English Instruction Manual-1
24 pages
WJ96
No ratings yet
WJ96
8 pages
TwoStep Cluster Analysis
No ratings yet
TwoStep Cluster Analysis
35 pages
PEZA
No ratings yet
PEZA
15 pages
Transient Stability Analysis of A Multi Machine Power System
No ratings yet
Transient Stability Analysis of A Multi Machine Power System
6 pages
Adina Institute of Science & Technology: Department of Computer Science & Engg. M.Tech CSE-II Sem Lab Manuals MCSE - 203
100% (1)
Adina Institute of Science & Technology: Department of Computer Science & Engg. M.Tech CSE-II Sem Lab Manuals MCSE - 203
22 pages
Yamaha YJ50 Service Manual PDF
No ratings yet
Yamaha YJ50 Service Manual PDF
228 pages
A Critique of Arranged Marriage
100% (1)
A Critique of Arranged Marriage
5 pages
COC 2 Script FINAL
No ratings yet
COC 2 Script FINAL
9 pages
Dfhuynh Thesis
No ratings yet
Dfhuynh Thesis
134 pages
Information Integration: Maurizio Lenzerini
No ratings yet
Information Integration: Maurizio Lenzerini
110 pages
The State of The Art in End-User Software Engineering: Submitted To ACM Computing Surveys
No ratings yet
The State of The Art in End-User Software Engineering: Submitted To ACM Computing Surveys
50 pages
Bernstein Presentation 03
No ratings yet
Bernstein Presentation 03
38 pages
Answering Queries Using Views - A Survey
No ratings yet
Answering Queries Using Views - A Survey
25 pages
SyncThru Web Admin Service Administrator Manual - SWAS - Main PDF
No ratings yet
SyncThru Web Admin Service Administrator Manual - SWAS - Main PDF
39 pages
Mapping Data To Queries: Semantics of The IS-A Rule
No ratings yet
Mapping Data To Queries: Semantics of The IS-A Rule
22 pages
Information Integration Using Logical Views
No ratings yet
Information Integration Using Logical Views
22 pages
Xquery: An XML Query Language
No ratings yet
Xquery: An XML Query Language
19 pages
Genetic Programming
No ratings yet
Genetic Programming
11 pages
Genetic Programming
No ratings yet
Genetic Programming
11 pages
Concept of Leadership
No ratings yet
Concept of Leadership
30 pages
Data Services in Your Spreadsheet
No ratings yet
Data Services in Your Spreadsheet
10 pages
The Doha Round
No ratings yet
The Doha Round
10 pages
Data Integration: A Theoretical Perspective: Maurizio Lenzerini
No ratings yet
Data Integration: A Theoretical Perspective: Maurizio Lenzerini
14 pages
Provenance and Scientific Workflows
No ratings yet
Provenance and Scientific Workflows
6 pages
Bicol University Polangui Campus Polangui, Albay Tel. No.: (052) - 2350508
No ratings yet
Bicol University Polangui Campus Polangui, Albay Tel. No.: (052) - 2350508
9 pages
Querying and ReUsing Workflows With Visstrails
No ratings yet
Querying and ReUsing Workflows With Visstrails
4 pages
Fly Me To The Moon (Lead Sheet)
100% (1)
Fly Me To The Moon (Lead Sheet)
1 page
Importanta Resurselor Umane in Sustenabilitatea Strategica
No ratings yet
Importanta Resurselor Umane in Sustenabilitatea Strategica
14 pages
Querying and Creating Visualizations by Analogy
No ratings yet
Querying and Creating Visualizations by Analogy
8 pages
Buenaseda v. Flavier GR 106719
No ratings yet
Buenaseda v. Flavier GR 106719
4 pages
FINA3323 Assignment 5
No ratings yet
FINA3323 Assignment 5
2 pages
The Dell Vostro 1510
No ratings yet
The Dell Vostro 1510
3 pages
2000 Mutiple Choice Sentences
No ratings yet
2000 Mutiple Choice Sentences
82 pages
Understanding The Self Finals Score 44 - 50 Verified Answers
No ratings yet
Understanding The Self Finals Score 44 - 50 Verified Answers
5 pages
Pyramid - Suppressed Transmission - Use Archaeology To Uncover Hidden Adventure Ideas
No ratings yet
Pyramid - Suppressed Transmission - Use Archaeology To Uncover Hidden Adventure Ideas
4 pages
Black Sabbath - Paranoid
100% (2)
Black Sabbath - Paranoid
4 pages
Deepak
No ratings yet
Deepak
4 pages
IGCSE ICT Hodder Edition 6.10
No ratings yet
IGCSE ICT Hodder Edition 6.10
8 pages
Six of Crows Series Order - Google Search
No ratings yet
Six of Crows Series Order - Google Search
1 page
Wood Chips PDF
No ratings yet
Wood Chips PDF
3 pages
FC Julia Bradbury
No ratings yet
FC Julia Bradbury
10 pages
Booking Confirmation On IRCTC, Train: 12311, 16-Jul-2022, 2A, BWN - KKDE
No ratings yet
Booking Confirmation On IRCTC, Train: 12311, 16-Jul-2022, 2A, BWN - KKDE
1 page
SWOT
No ratings yet
SWOT
3 pages
Jyotish Praveen - 1993 - Astrology Lessons 2
100% (2)
Jyotish Praveen - 1993 - Astrology Lessons 2
85 pages
Record Linkage Similarity Measures and Algorithms
No ratings yet
Record Linkage Similarity Measures and Algorithms
130 pages
XML Schema Automatic Matching Solution
No ratings yet
XML Schema Automatic Matching Solution
7 pages
Wetlands, Mangroves and Climate Change
No ratings yet
Wetlands, Mangroves and Climate Change
15 pages
SOP For An Australia Student Visa
100% (1)
SOP For An Australia Student Visa
2 pages
Benefits of Ziyarat Ashura (Urdu)
100% (1)
Benefits of Ziyarat Ashura (Urdu)
0 pages

p802 Koudastutorial

Uploaded by

p802 Koudastutorial

Uploaded by

Record Linkage: Similarity Measures and Algorithms

Nick Koudas Sunita Sarawagi Divesh Srivastava

1. MOTIVATION the various methodologies to combine approximate match predi-

You might also like