p802 Koudastutorial
p802 Koudastutorial
802
problem, we shall review an array of proposed methodolo- of methodologies that can assist in each approach. We shall review
gies. These include statistical and probabilistic, predictive, various data quality tools that deploy record linkage technology
cost based, rule based, user assisted as well as learning based and discuss their key functionalities in this area. Where publicly
methodologies. Moreover, we shall cover several specific known, we shall discuss specific algorithms they deploy.
functions including Naive Bayes, the Fellegi-Sunter model,
linear support vector machines (SVM) and approaches based 2.3 Creation of Data Partitions
on voting theory. The output of the approximate join needs to be post processed to
cluster together all tuples that refer to the same entity. The approx-
Similarity between linked entities: Often the entities over imate join operation above might produce seemingly inconsistent
which we need to resolve duplicates are linked together via results like tuple A joins with tuple B, tuple A joins with tuple
foreign keys in a multi-relational database. Links might be C, but tuple B does not join with tuple C. A straightforward way
of various types, including associative links arising out of co- to resolve such inconsistencies is to cluster together all tuples via
occurrence in a larger context (e.g., co-authors of a paper), or a transitive closure of the join pairs. In practice, this can lead to
structural links denoting containment. An inter-linked set of extremely poor results since unrelated tuples might get connected
entities calls for richer similarity measures that capture the through noisy links. A number of approaches have been proposed
similarity of the context in which the entity pair appears. We for tackling this problem. In particular, we shall present a newly
shall present various graph-based similarity measures that introduced clustering paradigm called Correlation Clustering for
capture transitive contextual similarity in combination with minimizing the number of join pairs violated in creating clusters.
the intrinsic similarity between two entities. Finally, we shall identify key research questions pertinent to each
We shall compare and contrast these similarity measures based part of our tutorial. Our tutorial will also contain an extensive sur-
on their features, their complexity, scalability and applicability to vey of related bibliography from various communities.
various domains. We shall present some optimality results and
comment on their applicability in the context of record linkage. 3. PROFESSIONAL BIOGRAPHIES
Nick Koudas is a faculty member at the University of Toronto,
2.2 Record Linkage Algorithms Department of Computer Science. He holds a Ph.D. from the Uni-
Once the basic techniques for quantifying the degree of approxi- versity of Toronto, an M.Sc. from the University of Maryland at
mate match for a pair (or subsets) of attributes have been identified, College Park, and a B.Tech. from the University of Patras in Greece.
the next challenging operation is to embed them into an approxi- He serves as an associate editor for the Information Systems jour-
mate join framework between two data sets. This is a non-trivial nal, the IEEE TKDE journal and the ACM Transactions on the
task due to the large (quadratic in the size of the input) number of WEB. He is the recipient of the 1998 ICDE Best Paper award. His
pairs involved. We shall present a set of algorithmic techniques for research interests include core database management, data quality,
this task. A common feature of all such algorithms is the ability metadata management and its applications to networking.
to keep the total number of pairs (and subsequent decisions) low
utilizing various pruning mechanisms. These algorithms can be Sunita Sarawagi researches in the fields of databases, data min-
classified into two main categories. ing, machine learning and statistics. She is an associate professor
at the Indian Institute of Technology, Bombay. Prior to that she was
Algorithms inspired by relational duplicate elimination and a research staff member at IBM Almaden Research Center. She got
join techniques including sort-merge, band join and indexed her Ph.D. in databases from the University of California at Berkeley
nested loops. In this context, we shall review techniques and a B.Tech. from the Indian Institute of Technology, Kharagpur.
like Merge/Purge (based on the concept of sorted neighbor- She has several publications in databases and data mining, includ-
hoods), BigMatch (based on indexed nested loops joins) and ing a best paper award at the 1998 ACM SIGMOD conference, and
Dimension Hierarchies (based on the concept of hierarchi- several patents. She is on the editorial board of the ACM TODS
cally clustered neighborhoods). and ACM KDD journals and editor-in-chief of the ACM SIGKDD
newsletter. She has served as program committee member for ACM
Algorithms inspired by information retrieval that treat each
SIGMOD, VLDB, ACM SIGKDD and IEEE ICDE, ICML confer-
tuple as a set of tokens, and return those set pairs whose
ences.
(weighted) overlap exceeds a specified threshold. In this con-
text, we shall review a variety of set join algorithms. Divesh Srivastava is the head of the Database Research Depart-
ment at AT&T Labs-Research. He received his Ph.D. from the Uni-
We shall also discuss two alternative ways to realize (i.e., imple- versity of Wisconsin, Madison, and his B.Tech. from the Indian
ment) approximate join techniques. The first is concerned with pro- Institute of Technology, Bombay. His current research interests in-
cedural algorithms operating on data, applying approximate match clude data quality, data streams and XML databases.
predicates, without a particular storage or query model in mind.
The second is concerned with declarative specifications of data
cleaning operations. Both approaches have their relative strengths 4. REFERENCES
and weaknesses. A non-declarative specification offers greater al- [1] C. Batini, T. Catarci, and M. Scannapieco. A survey of data
gorithmic flexibility and possibly improved performance (e.g., im- quality issues in cooperative information systems.
plemented on top of a file system without incurring RDBMS over- Pre-conference ER tutorial, 2004.
heads). A declarative specification offers unbeatable ease of de- [2] T. Johnson and P. Dasu. Data quality and data cleaning: An
ployment (as a set of SQL queries), direct processing of data in overview. SIGMOD tutorial, 2003.
their native store (RDBMS) and flexible integration with existing [3] N. Koudas and D. Srivastava. Approximate joins: concepts
applications utilizing an RDBMS. The choice between the two re- and techniques. VLDB tutorial, 2005.
alizations clearly depends on application constraints and require-
ments. We shall review both approaches describing a concrete set
803