0% found this document useful (0 votes)
111 views2 pages

p802 Koudastutorial

The quality of data residing in databases gets degraded due to a multitude of reasons. A central problem is the ability to identify whether two entities (e.g., relational tuples) are approximately the same. This problem is by no means new; Over the years various communities have addressed it.

Uploaded by

vthung
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
111 views2 pages

p802 Koudastutorial

The quality of data residing in databases gets degraded due to a multitude of reasons. A central problem is the ability to identify whether two entities (e.g., relational tuples) are approximately the same. This problem is by no means new; Over the years various communities have addressed it.

Uploaded by

vthung
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Record Linkage: Similarity Measures and Algorithms

Nick Koudas Sunita Sarawagi Divesh Srivastava


University of Toronto IIT Bombay AT&T Labs–Research
[email protected] [email protected] [email protected]

1. MOTIVATION the various methodologies to combine approximate match predi-


The quality of the data residing in databases gets degraded due cates for approximate match between tuple pairs, (iv) provide a
to a multitude of reasons. Such reasons include typing mistakes comprehensive and cohesive overview of the key research results,
(e.g., lexicographical errors, character transpositions) during inser- techniques, and tools used for record linkage, (v) present clustering
tion, lack of standards for recording database fields (e.g., person paradigms for consistent partitioning of tuples to identify entities,
names, addresses), and various errors introduced by poor database and (vi) identify key research areas where further work is required.
design (e.g., update anomalies, missing key constraints). Data of Recent tutorials in the area of data quality [2, 1] present broad
poor quality can result in significant impediments to popular busi- overviews of various aspects of data quality and do not delve into
ness practices: sending products or bills to incorrect addresses, in- the details of record linkage technology. Our tutorial is a significant
ability to locate customer records during service calls, or inability extension of a shorter tutorial on this topic [3] and aims to provide
to correlate customers across multiple services, etc. a comprehensive overview of this fundamental area of data quality.
In the presence of data quality errors, a central problem is the
ability to identify whether two entities (e.g., relational tuples) are 2. TUTORIAL OUTLINE
approximately the same. Depending on the type of data under con- Our tutorial is example driven, and begins by introducing the var-
sideration, various “similarity metrics” (approximate match predi- ious issues and problems that one has to cope with when designing
cates) have been defined to quantify the closeness of a pair of data record linkage technology. We then present and categorize typical
entities in a way that common mistakes are captured. Given any “errors” encountered in real operational databases using concrete
specific similarity metric, a key algorithmic problem in this context examples; such examples motivate the need for the methodologies
is the approximate join problem: given two large multi-attribute we shall introduce in the sequel. We shall also define formally the
data sets, identify all pairs of entities (tuples) in the two sets that are various flavors of the record linkage problem as optimization prob-
approximately the same. This problem is by no means new. Over lems. The bulk of the tutorial is organized as follows.
the years various communities, including the statistics, machine
learning, information retrieval, and database communities, have ad- 2.1 Approximate Match Predicates
dressed many aspects of the problem, referring to it by a variety of We shall review a variety of approximate match predicates that
names, including record linkage, entity identification, entity recon- have been proposed to quantify the degree of similarity or close-
ciliation and approximate join. We shall make use of the names ness of two data entities. We shall compare and contrast them based
record linkage and approximate join in the sequel. The approxi- on their applicability to various data types, algorithmic properties,
mate join operation is often followed by a post-processing phase computational overhead and their adaptability. Most approximate
where the tuple pairs produced as join results are used to cluster match predicates return a score between 0 and 1 (with 1 being as-
together tuples that refer to the same entity while minimizing the signed to identical entities) that effectively quantifies the degree of
number of join pairs that get violated. These clusters define the de- similarity between data entities. Our presentation of such approxi-
sired entity boundaries. Given the significance and the inherent dif- mate match predicates will consist of three parts.
ficulty of the record linkage problem, a plethora of techniques have
been developed in various communities, deploying diverse approx- Atomic Similarity Measures: This part will review mea-
imate match predicates. sures to assess atomic (attribute value) similarity between
The objectives of this tutorial are to: (i) formally define the vari- a pair of data entities. We shall cover several approximate
ous flavors of the record linkage problem, (ii) identify and compare match predicates in detail including edit distance, phonetic
the various approximate match predicates for attribute value pairs distance (soundex), the Jaro and Winkler measures, tf.idf and
that have been introduced over the years, (iii) identify and contrast many variants thereof. Several approaches to fine tune pa-
rameters of such measures will be presented.
Functions to combine similarity measures: In this part, we
Permission to make digital or hard copies of all or part of this work for shall first review techniques dealing with the following basic
personal or classroom use is granted without fee provided that copies are decision problem: Given a set of pairs of attributes belonging
not made or distributed for profit or commercial advantage and that copies to two entities (tuples), in which each pair is tagged with it’s
bear this notice and the full citation on the first page. To copy otherwise, to own approximate match score (possibly applying distinct ap-
republish, to post on servers or to redistribute to lists, requires prior specific proximate match predicates for each attribute pair), how does
permission and/or a fee.
SIGMOD 2006, June 27–29, 2006, Chicago, Illinois, USA. one combine such scores to decide whether the entire entities
Copyright 2006 ACM 1-59593-256-9/06/0006 ...$5.00. (tuples) are approximately the same. For this basic decision

802
problem, we shall review an array of proposed methodolo- of methodologies that can assist in each approach. We shall review
gies. These include statistical and probabilistic, predictive, various data quality tools that deploy record linkage technology
cost based, rule based, user assisted as well as learning based and discuss their key functionalities in this area. Where publicly
methodologies. Moreover, we shall cover several specific known, we shall discuss specific algorithms they deploy.
functions including Naive Bayes, the Fellegi-Sunter model,
linear support vector machines (SVM) and approaches based 2.3 Creation of Data Partitions
on voting theory. The output of the approximate join needs to be post processed to
cluster together all tuples that refer to the same entity. The approx-
Similarity between linked entities: Often the entities over imate join operation above might produce seemingly inconsistent
which we need to resolve duplicates are linked together via results like tuple A joins with tuple B, tuple A joins with tuple
foreign keys in a multi-relational database. Links might be C, but tuple B does not join with tuple C. A straightforward way
of various types, including associative links arising out of co- to resolve such inconsistencies is to cluster together all tuples via
occurrence in a larger context (e.g., co-authors of a paper), or a transitive closure of the join pairs. In practice, this can lead to
structural links denoting containment. An inter-linked set of extremely poor results since unrelated tuples might get connected
entities calls for richer similarity measures that capture the through noisy links. A number of approaches have been proposed
similarity of the context in which the entity pair appears. We for tackling this problem. In particular, we shall present a newly
shall present various graph-based similarity measures that introduced clustering paradigm called Correlation Clustering for
capture transitive contextual similarity in combination with minimizing the number of join pairs violated in creating clusters.
the intrinsic similarity between two entities. Finally, we shall identify key research questions pertinent to each
We shall compare and contrast these similarity measures based part of our tutorial. Our tutorial will also contain an extensive sur-
on their features, their complexity, scalability and applicability to vey of related bibliography from various communities.
various domains. We shall present some optimality results and
comment on their applicability in the context of record linkage. 3. PROFESSIONAL BIOGRAPHIES
Nick Koudas is a faculty member at the University of Toronto,
2.2 Record Linkage Algorithms Department of Computer Science. He holds a Ph.D. from the Uni-
Once the basic techniques for quantifying the degree of approxi- versity of Toronto, an M.Sc. from the University of Maryland at
mate match for a pair (or subsets) of attributes have been identified, College Park, and a B.Tech. from the University of Patras in Greece.
the next challenging operation is to embed them into an approxi- He serves as an associate editor for the Information Systems jour-
mate join framework between two data sets. This is a non-trivial nal, the IEEE TKDE journal and the ACM Transactions on the
task due to the large (quadratic in the size of the input) number of WEB. He is the recipient of the 1998 ICDE Best Paper award. His
pairs involved. We shall present a set of algorithmic techniques for research interests include core database management, data quality,
this task. A common feature of all such algorithms is the ability metadata management and its applications to networking.
to keep the total number of pairs (and subsequent decisions) low
utilizing various pruning mechanisms. These algorithms can be Sunita Sarawagi researches in the fields of databases, data min-
classified into two main categories. ing, machine learning and statistics. She is an associate professor
at the Indian Institute of Technology, Bombay. Prior to that she was
Algorithms inspired by relational duplicate elimination and a research staff member at IBM Almaden Research Center. She got
join techniques including sort-merge, band join and indexed her Ph.D. in databases from the University of California at Berkeley
nested loops. In this context, we shall review techniques and a B.Tech. from the Indian Institute of Technology, Kharagpur.
like Merge/Purge (based on the concept of sorted neighbor- She has several publications in databases and data mining, includ-
hoods), BigMatch (based on indexed nested loops joins) and ing a best paper award at the 1998 ACM SIGMOD conference, and
Dimension Hierarchies (based on the concept of hierarchi- several patents. She is on the editorial board of the ACM TODS
cally clustered neighborhoods). and ACM KDD journals and editor-in-chief of the ACM SIGKDD
newsletter. She has served as program committee member for ACM
Algorithms inspired by information retrieval that treat each
SIGMOD, VLDB, ACM SIGKDD and IEEE ICDE, ICML confer-
tuple as a set of tokens, and return those set pairs whose
ences.
(weighted) overlap exceeds a specified threshold. In this con-
text, we shall review a variety of set join algorithms. Divesh Srivastava is the head of the Database Research Depart-
ment at AT&T Labs-Research. He received his Ph.D. from the Uni-
We shall also discuss two alternative ways to realize (i.e., imple- versity of Wisconsin, Madison, and his B.Tech. from the Indian
ment) approximate join techniques. The first is concerned with pro- Institute of Technology, Bombay. His current research interests in-
cedural algorithms operating on data, applying approximate match clude data quality, data streams and XML databases.
predicates, without a particular storage or query model in mind.
The second is concerned with declarative specifications of data
cleaning operations. Both approaches have their relative strengths 4. REFERENCES
and weaknesses. A non-declarative specification offers greater al- [1] C. Batini, T. Catarci, and M. Scannapieco. A survey of data
gorithmic flexibility and possibly improved performance (e.g., im- quality issues in cooperative information systems.
plemented on top of a file system without incurring RDBMS over- Pre-conference ER tutorial, 2004.
heads). A declarative specification offers unbeatable ease of de- [2] T. Johnson and P. Dasu. Data quality and data cleaning: An
ployment (as a set of SQL queries), direct processing of data in overview. SIGMOD tutorial, 2003.
their native store (RDBMS) and flexible integration with existing [3] N. Koudas and D. Srivastava. Approximate joins: concepts
applications utilizing an RDBMS. The choice between the two re- and techniques. VLDB tutorial, 2005.
alizations clearly depends on application constraints and require-
ments. We shall review both approaches describing a concrete set

803

You might also like