0% found this document useful (0 votes)
73 views12 pages

Eliminating Fuzzy Duplicates in Data Warehouses

Duplicate elimination problem of detecting multiple tuples, which describe the same real world entity, is an important data cleaning problem. In this paper, we develop an algorithm for eliminating duplicates in dimensional tables in a data warehouse, which are usually associated with hierarchies. We exploit hierarchies to develop a high quality, scalable duplicate elimination algorithm, and evaluate it on real datasets from an operational data warehouse.

Uploaded by

Senthil Nathan
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views12 pages

Eliminating Fuzzy Duplicates in Data Warehouses

Duplicate elimination problem of detecting multiple tuples, which describe the same real world entity, is an important data cleaning problem. In this paper, we develop an algorithm for eliminating duplicates in dimensional tables in a data warehouse, which are usually associated with hierarchies. We exploit hierarchies to develop a high quality, scalable duplicate elimination algorithm, and evaluate it on real datasets from an operational data warehouse.

Uploaded by

Senthil Nathan
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Eliminating Fuzzy Duplicates in Data Warehouses

Rohit Ananthakrishna1 Surajit Chaudhuri Venkatesh Ganti


Cornell University Microsoft Research
[email protected] {surajitc, vganti}@microsoft.com

Abstract like Lisa may be sent multiple catalogs. Moreover, such


The duplicate elimination problem of detecting duplicates can cause incorrect results in analysis queries
multiple tuples, which describe the same real (say, the number of SuperMart customers in Seattle), and
world entity, is an important data cleaning erroneous data mining models to be built. We refer to this
problem. Previous domain independent solutions problem of detecting and eliminating multiple distinct
to this problem relied on standard textual records representing the same real world entity as the fuzzy
similarity functions (e.g., edit distance, cosine duplicate elimination problem, which is sometimes also
metric) between multi-attribute tuples. However, called merge/purge, dedup, record linkage problems [e,g.,
such approaches result in large numbers of false HS95, ME97, FS69]. This problem is different from the
positives if we want to identify domain-specific standard duplicate elimination problem, say for answering
abbreviations and conventions. In this paper, we “select distinct” queries, in relational database systems
develop an algorithm for eliminating duplicates in which considers two tuples to be duplicates if they match
dimensional tables in a data warehouse, which are exactly on all attributes. However, data cleaning deals with
usually associated with hierarchies. We exploit fuzzy duplicate elimination, which is our focus in this
hierarchies to develop a high quality, scalable paper. Henceforth, we use duplicate elimination to mean
duplicate elimination algorithm, and evaluate it on fuzzy duplicate elimination.
real datasets from an operational data warehouse.
Duplicate elimination is hard because it is caused by
1. Introduction several types of errors like typographical errors, and
equivalence errors—different (non-unique and non-
Decision support analysis on data warehouses influences standard) representations of the same logical value. For
important business decisions; therefore, accuracy of such instance, a user may enter “WA, United States” or “Wash.,
analysis is crucial. However, data received at the data USA” for “WA, United States of America.” Equivalence
warehouse from external sources usually contains errors: errors in product tables (“winxp pro” for “windows XP
spelling mistakes, inconsistent conventions, etc. Hence, Professional”) are different from those encountered in
significant amount of time and money are spent on data bibliographic tables (“VLDB” for “very large databases”),
cleaning, the task of detecting and correcting errors in data. etc. Also, it is important to detect and clean equivalence
errors because an equivalence error may result in several
The problem of detecting and eliminating duplicated data duplicate tuples.
is one of the major problems in the broad area of data
cleaning and data quality [e.g., HS95, ME97, RD00]. The class of equivalence errors can be addressed by
Many times, the same logical real world entity may have building sets of rules. For instance, most commercial
multiple representations in the data warehouse. For address cleaning software packages (e.g., Trillium) use
example, when Lisa purchases products from SuperMart rules to detect errors in names and addresses. In this paper,
twice, she might be entered as two different customers— we focus on domain independent duplicate elimination
[Lisa Simpson, Seattle, WA, USA, 98025] and [Lisa techniques. Domain-specific information when available
Simson, Seattle, WA, United States, 98025]—due to data complements these techniques. Previous domain-
entry errors. Such duplicated information can significantly independent methods for duplicate elimination rely on
increase direct mailing costs because several customers textual similarity functions (e.g., edit distance or cosine
metric) predicting that two tuples whose textual similarity
Permission to copy without fee all or part of this material is granted
provided that the copies are not made or distributed for direct
is greater than a pre-specified similarity threshold are
commercial advantage, the VLDB copyright notice and the title of the duplicates [FS69, KA85, Coh98, HS95, ME96]. However,
publication and its date appear, and notice is given that copying is by using these functions to detect duplicates due to
permission of the Very Large Data Base Endowment. To copy otherwise, equivalence errors (say, “US” and “United States”)
or to republish, requires a fee and/or special permission from the
Endowment
requires that the threshold be dropped low enough,
Proceedings of the 28th VLDB Conference, resulting in a large number of false positives—pairs of
Hong Kong, China, 2002 tuples incorrectly detected to be duplicates. For instance,
1
Work done while visiting Microsoft Research
OrgId City State Country
Name CityId StateId CtryId
Address State Id CtryId
City Id
OrgId Name Address CityId
O1 Clintstone Assoc. #1, Lake View Blvd. C1 CityId City StateId
StateIdState CtryId
O2 Compuware #20, Main Street C2 C1 Joplin S1
S1 MO 1
O3 Compuwar #20, Main Street C3 C2 Jopin S2
C3 Joplin S4
S2 MO 2
O4 Clintstone Associates #1, Lake View C4 S3 MO 3
C4 Joplin S3
O5 Ideology Corp. #10, Vancouver Pl. C5 S4 Missouri 3 CtryId Country
C5 Victoria S5 1 United States of America
O6 Victoria Films #5, Victoria Av. C6 C6 Victoria S6 S5 BC 4
2 United States
O7 Ideology Corporation #10, Vanc. Pl. C7 C7 Vancouver S5 S6 British Columbia 4 3 USA
O8 Clark Consultants Ltd. #8, Cherry Street C8 C8 Aberdeen S7 S7 Aberdeen shire 5 4 Canada
O9 Clark Consultants #8, Cherr St. C9 C9 Aberdeen S8 S8 Aberdeen 5 5 UK

Organization (at Level 1) City (at Level 2) State (at Level 3) Country (at Level 4)
Figure 1: An Example Customer Database
tuple pairs with values “USSR” and “United States” in the problems can occur even with other textual similarity
country attribute are also likely to be declared duplicates if functions like the cosine metric. Using our notion of co-
we were to detect “US” and “United States” as duplicates occurrence through the State relation, we observe that the
using textual similarity. sets—called children sets of USA and UK—of states {MO,
Missouri} and {Aberdeen, Aberdeen shire} joining with
In this paper, we exploit dimensional hierarchies typically USA and UK, respectively, are disjoint. Hence, we
associated with dimensional tables in data warehouses to conclude that USA and UK are unlikely to be duplicates.
develop an efficient, scalable, duplicate elimination
algorithm called Delphi,2 which significantly reduces the For reasons of efficiency and scalability, we want to avoid
number of false positives without missing out on detecting comparing all pairs of tuples in each relation of the
duplicates. We rely on hierarchies to detect an important hierarchy. Previous approaches have considered the
class of equivalence errors in each relation, and to windowing strategy, which sorts a relation on a key and
significantly reduce the number of false positives. compares all records within a sliding window on the sorted
order [HS95]. However, observe that equivalence errors
For example, Figure 1 describes the schema maintaining (e.g., UK and Great Britain) may not be adjacent to each
the Customer information in a typical company selling other in standard sort orders, e.g., the lexicographical
products or services. The dimensional hierarchy here order. We exploit the dimensional hierarchy and propose a
consists of four relations—Organization, City, State, and grouping strategy, which only compares tuples within
Country relations—connected by key—foreign key small groups of each relation. For instance, we only
relationships (also called referential links). We say that the compare two State tuples if they join with the same country
Organization and the Country relations are the bottom and tuple or Country tuples that are duplicates of each other.
the top relations, respectively. Consider the tuples USA Since such groups are often much smaller than the entire
and United States in the Country relation in Figure 1. The relation, the grouping strategy allows us to compare pairs
state attribute value “MO” appears in tuples in the State of tuples in each group, and yet be very efficient.
relation joining with countries USA and United States,
whereas most state values occur with only one Country The outline of the paper is as follows. In Section 2, we
tuple. That is, USA and United States co-occur through the discuss related work. In Section 3, we discuss key concepts
state MO. In general, country tuples are associated with and definitions. In Section 4, we describe Delphi. In
sets of state values. The degree of overlap between sets Section 5, we discuss a few important issues. In Section 6,
associated with two countries is a measure of co- we discuss results from a thorough experimental evaluation
occurrence between them, and can be used to detect on real datasets.
duplicates (e.g., USA and United States).
2. Related Work
The above notion of co-occurrence can also be used for
reducing the number of false positives. Consider the two Several earlier proposals exist for the problem of duplicate
countries “USA” and “UK” in Figure 1. Because they are elimination (e.g., [FS69, KA85, HS95, ME96, ME97,
sufficiently closer according to the edit distance function, a Coh98]). As mentioned earlier, all these methods rely on
commonly used textual similarity function, we might threshold-based textual similarity functions to detect
(incorrectly) deduce that they are duplicates. Such duplicates, and hence do not detect equivalence errors
unless we lower thresholds sufficiently; lower thresholds
result in an explosion of the number of false positives. The
2
DELPHI: Duplicate ELimination in the Presence of HIerarchies
record linkage literature also focuses on automatically in the proceedings of a conference as well as in a journal;
determining appropriate thresholds [FS69, KA85], but still and, they are two distinct entities in the publications
suffers from the false positive explosion while detecting database. Motivated by these typical scenarios, we
equivalence errors. Gravano et al. proposed an algorithm consider two entities in a dimensional hierarchy to be
for approximate string joins, which in principle can be duplicates if corresponding pairs of tuples in each relation
adapted to detect duplicate records [GIJ+01]. Since they of the hierarchy either match exactly or are duplicates
use the edit distance function to measure closeness (according to duplicate detection functions at each level).
between tuples, their technique suffers from the drawbacks For example, two entities in Figure 1 are duplicates if the
of strategies relying only on textual similarity functions. In respective pairs of Country, State, City, and Organization
this paper, we exploit hierarchies on dimensional tables to tuples of the two entities are duplicates. Below, we
detect an important class of equivalence errors (which formally introduce dimensional hierarchies, definition of
exhibit significant co-occurrence through other relations) duplicate entities, and our duplicate detection functions.
without increasing the number of false positives.
3.1. Dimensional Hierarchies
Significant amount of work exists in other related aspects
Relations R1,…, Rm with keys K1, …, Km constitute a
of data cleaning: the development of transformational
dimensional hierarchy if and only if there is a key—foreign
cleaning operations [RH01, GFS+01], the detection and
the correction of formatting errors in address data key relationship between Ri-1 and Ri, (2 ≤ i ≤ m). Ri is the
[BDS01], and the design of “good” business practices and ith level relation in the hierarchy. R1 and Rm are the bottom
process flows to prevent problems of deteriorating data and the top relations, respectively, and Ri the child of Ri+1.
quality [Pro, NR99]. Automatic detection of integrity
constraints (functional dependencies and key—foreign key Let the unnormalized dimension table R be the join of
relationships) [MR94, KM95, HKPT98] so that they can be R1,…, Rm through the chain of key—foreign key
enforced in future to improve data quality are relationships. We say that a tuple vi in Ri joins with a tuple
complementary to techniques for cleaning existing data. vj in Rj if there exists a tuple v in R such that the
Because of the commercial importance of the data cleaning projections of v on Ri and Rj equal vi and vj, respectively.
problem, several domain-specific industrial tools exist. Specifically, we say that vi in Ri is a child of vi+1 in Ri+1 if
Galhardas provides a nice survey of many commercial vi joins with vi+1. For example, in Figure 1, [S3, MO, 3] in
tools [Gal]. the State relation is a child of [3, USA] in the Country
relation. We say that a tuple combination (or a row in R)
Our notion of co-occurrence between tuples is similar to [r1,…,rm] is an entity if each ri joins with ri+1.
that used for clustering categorical data [e.g., GKR98,
GRS99, GGR99] and that for matching schema [MBR01]. In typical dimensional tables of data warehouses, the
values of key attributes K1,…, Km are artificially generated
3. Concepts and Definitions by the loading process before a tuple vi is inserted into Ri.
A dimensional hierarchy consists of a chain of relations Such generated keys are not useful for fuzzily matching
linked by key—foreign key dependencies. Figure 1 two tuples, and can only be used for joining tuples across
illustrates an example. An entity described by the hierarchy relations in the hierarchy. From now on, we overload the
also consists of a chain of tuples (one from each relation) term “tuple” to also mean only the descriptive attribute
each of which joins with the tuple from its parent relation. values—the set of attribute values not including the
For example, [<o1, Walmart, c1>, <c1, Redmond, s1>, generated key attributes. We clarify when it is not clear
<s1, WA, t1>, <t1, USA>] describes an organization entity from the context.
where o1, c1, etc. are identifiers typically generated for
maintaining referential links. For clarity in notation, we do 3.2. Definition of Duplicates
not explicitly list identifiers in tuples unless required. We now formally define our notion of duplicate entities
assuming duplicate detection functions at each level. Let
Consider two organization entities: [<Walmart>, f1,…,fm be binary functions called duplicate detection
<Redmond>, <WA>, <USA>] and [<Walmart>, <Seattle>,
functions where each fi takes a pair of tuples in Ri, and
<WA>, <USA>] in the Customer information with a
returns 1 if they are duplicates, and -1 otherwise. Let r=[r1,
dimensional hierarchy shown in Figure 1. The
…, rm] and s=[s1,…,sm] be two entities. We say that r is a
corresponding pairs of tuples in the Name, State, or
Country relations individually are identical. However, they duplicate of s if and only if fi(ri, si)=1 for all i in {1, …,
are not duplicates on the City relation, and in fact this m}. For instance, we consider the two entities
difference makes the two entities distinct. This [<Compuware, #20 Main Street>, <Jopin>, <MO>,
phenomenon is characteristic of dimensional hierarchies. <United States>] and [<Compuwar, #20 Main Street>,
For example, publications with the same title may appear <Joplin>, <Missouri>, <USA>] in Figure 1 to be
duplicates only if the following pairs are duplicates: we define the IDF value IDFG(S) of a set S (subset of ) to
“United States” and “USA” on the Country relation, “MO” be IDFG (s) .
and “Missouri” in the State relation, “Jopin” and “Joplin” s∈S
in the City relation, and “Compuware, #20 Main Street” Containment Metric: We define the containment metric
and “Compuwar, #20 Main Street” in the Organization
cmG(S1, S2) with respect to G between two sets S1 and S2 to
relation. Observe that we can easily extend this definition
be the ratio of the IDF value IDFG( S1 ∩ S 2 )of their
to sub-entities [ri,…,rm] and [si,…,sm].
intersection with the IDF value IDFG(S1) of the first set S1.
3.3. Duplicate Detection Functions
For clarity in presentation, we drop the subscript G from
We exploit dimensional hierarchies to measure co- the above notation when extending them to define textual
occurrence among tuples for detecting equivalence errors and co-occurrence similarity metrics.
and for reducing false positives. This is in conjunction with
the textual similarity functions (like cosine metric and edit 3.3.1. Textual Similarity Function (tcm)
distance), which have traditionally been employed for
detecting duplicates. Our final duplicate detection function We assume that each tuple v can be split into a set of
is a weighted voting of the predictions from using co- tokens using a tokenization function (say, based on white
occurrence and textual similarity functions. Intuitively, the spaces). Treating each tuple as a set of tokens, the token
weight of a prediction is indicative of the importance of the containment metric between v and v’ is the IDF-weighted
information used to arrive at the prediction. fraction of v tokens that v’ contains.

We adopt the standard thresholded similarity function Let G={v1, …, vn} be a set of tuples from Ri. Let TS(v)
approach to define duplicate detection functions [HS95]. denote the set of tokens in a tuple v. Let Bt(G) be the bag
That is, if the textual (or co-occurrence) similarity between (multi-set) of all tokens that occur in any tuple in G. Let
two tuples is greater than a threshold, then the two tuples tf(t) denote the frequency of a token t in Bt(G). The token
are predicted to be duplicates according to textual (or co- containment metric tcm(v, v’) with respect to G between
occurrence) similarity. In this section, we assume that tuples v and v’ in G is given by the containment metric
thresholds are known. In Section 4.3, we relax this cm(TS(v), TS(v’)) with respect to G between their token
assumption and describe automatic threshold sets. For example, if all tokens have equal IDF values then
determination. First, we introduce the notion of set tcm([“MO”, “United States”], [“MO”, “United States of
containment, which we use to define similarity functions. America”]) is 1.0; And, tcm([“MO”, “United States of
We only consider textual attributes for comparing tuples, America”], [“MO”, “United States”]) is 0.6.
and assume default conversions from other types to text,
e.g., integer zipcodes are converted to varchar. Observe that when two tokens differ slightly due to a
typographical error, token containment metric still treats
Given a collection of sets each defined over some domain them as two distinct tokens. To address this shortcoming,
of objects, an intuitive notion of how similar a set S is to a we treat two very similar tokens—with edit distance3 less
set S’ is the fraction of S objects contained in S’. This than a small value (say, 0.15)—in Bt(G) to be synonyms.
notion of containment similarity has been effectively used
to measure document similarity [BGM+97]. We extend 3.3.2. Co-occurrence Similarity Function (fkcm)
this notion to take into account the importance of objects in In a dimensional hierarchy, a tuple in the parent relation Ri
distinguishing sets. For example, the set {Microsoft, joins with a set, which we call its children set, of tuples in
incorporated} is more similar to {Microsoft, inc} than it is the child relation. We measure the co-occurrence between
to {Boeing, incorporated} because the token Microsoft is two distinct tuples by the amount of overlap between
more distinguishing than the token incorporated. The IDF children sets of the two tuples. An unusually significant co-
(inverse document frequency) value of an object has been occurrence (more than the average overlap between pairs
successfully used in the information retrieval literature to of tuples in Ri or above a certain threshold) is a cause for
quantify the notion of importance [BYRN99]. We now suspecting that one is a duplicate of the other. For example,
formalize this intuition. in Figure 1, duplicate states MO and Missouri co-occur
with the city “Joplin” whereas other distinct states do not
Let be a set of objects. Let G be a collection of sets of co-occur with any common cities. Informally, our co-
objects from . Let B(G) be the bag of all objects occurrence measure—called the foreign key containment
contained by any set in G. The frequency fG(o) of an object
o with respect to G is the frequency of o in B(G). The IDF 3
The edit distance between tokens t1 and t2 is the minimum
value IDFG(o) with respect to G of o is log( | G | ) . Also, number of edit operations (delete, insert, transpose, and replace)
f G (o ) required to change t1 to t2; we normalize this value with the sum
of their lengths [AEP01].
metric (fkcm)—between two tuples is the containment distinguishing tokens or children tuples have higher IDF
metric between the children sets of the first and the second values.
tuples.
For a tuple v in Ri (i > 1) let wt=IDF(TS(v)), and
If i > 1, we say that two tuples v1 and v2 in Ri co-occur wc=IDF(CS(v)). Let tcm_threshold and fkcm_threshold be
through a tuple v in Ri-1 if they both join with v. In general, the textual and co-occurrence similarity thresholds,
two distinct tuples v1 and v2 in Ri join with two sets S1 and respectively. Let pos: R {1,-1} be a function defined as
S2 (usually with little overlap) of tuples in Ri-1. We call S1 follows: pos(x) = 1, if x > 0, and -1, otherwise. Our
the children set CS(v1) of v1, and S2 the children set CS(v2) weighted voting combination function is:
of v2. Let G={v1, …, vn} be a set of tuples from Ri. Let pos(wt*pos(tcm(v,v’)-tcm_threshold)+wc*pos(fkcm(v,v’)-
Bc(G) be the bag (multi-set) of all children tuples in Ri-1 fkcm_threshold)).
with any tuple in G as parent. The child frequency cf(c) of
a child tuple c with respect to G is the number of times c Essentially, the combination function returns the
occurs in Bc(G). The FK-containment metric fkcm(v, v’) prediction, 1 (duplicate) or -1 (not a duplicate), of the
with respect to G between v and v’ in G is the containment similarity function that has a higher weight. Suppose that
metric cm(CS(v), CS(v’)) with respect to Bc(G) between in Figure 1, “UK” is considered a duplicate of “USA”
the children sets CS(v) and CS(v’). For example, the FK- according to a textual similarity function. Because they do
containment metric between values “Missouri” (whose not co-occur with any state tuple, fkcm contradicts this
State.Id is S4) and “MO” (whose State.Id is S3) in the prediction. Since the children set of UK has a higher IDF
State relation of Figure 1 is 1.0 because their children sets value than its token set, UK is not a duplicate of USA.
are identical ({Joplin}).
4. Delphi
Note that while measuring co-occurrence between two
tuples in Ri, we only use Ri-1 and disregard information We now describe Delphi. Recall that we consider two
from relations further below for two reasons. First, the entities to be duplicates if the respective pairs of tuples in
restriction improves efficiency because the number of each relation of the hierarchy are duplicates. That is, two
distinct combinations joining with a tuple in Ri increases as entities in the customer information of Figure 1 are
duplicates only if the Organization tuples, City tuples,
we go further down the hierarchy. For example, the
State tuples, and Country tuples are all duplicates of each
number of state tuples pointing to “United States” in the
other. Therefore, a straightforward duplicate detection
Country relation is less than the number of [city, state]
tuple pairs that point to it. Therefore, the restriction enables algorithm would be to independently determine sets of
efficient computation of our co-occurrence measure duplicate tuples at each level of the hierarchy and then
determine duplicate entities over the entire hierarchy. For
between tuples. Second, the co-occurrence information
the example in Figure 1, we can process each of the
between tuples in Ri provided by relations Rj (j < i-1) is
Organization, City, State, and Country relations
usually already available from Ri-1. Tuples in Rj (j < i-1) independently to determine duplicate pairs of tuples in
which join with the same tuple in Ri are also likely to join these relations. We may then identify pairs of duplicate
with the same tuples in Ri-1 if the children sets of distinct entities if their corresponding tuples at each level in the
tuples are very different from each other. We discuss two hierarchy (Organization, City, State, and Country) are
exceptional cases in Section 5. either equal or duplicates.
3.3.3. Combination Function We can be more efficient by exploiting the knowledge
We use thresholded similarity metrics for detecting from already processed relations. Suppose we know that
duplicates. That is, when the similarity cm(v, v’) between v only “United States of America” and “United States” are
and v’ is greater than a threshold, then the duplicate duplicates of “USA” and the rest are all unique tuples in
detection function using cm predicts that v is a duplicate of the Country relation. While processing the State relation,
v’. We now discuss the combination of predictions we do not compare the tuple “BC” with “Missouri”
obtained from both functions. We adopt a weighted voting because the former joins with Canada and the latter with
of the predictions where the weight of a prediction is (duplicates of) USA. Observe that this usage requires us to
proportional to the “importance of the information” used to process a parent relation in the hierarchy before processing
arrive at the prediction.4 As discussed earlier, IDF values its child. As we move down the hierarchy, the reduction in
of the token and children sets capture the concept of the number of comparisons is significant. For instance, the
amount of information because sets containing more Organization relation may have millions of tuples whereas
the number in Seattle, WA, USA may be a few thousands.

4
For the lowest relation R1 in the hierarchy, we return the We adopt a top-down traversal of the hierarchy. After we
prediction of tcm. process the topmost relation, we group the child relation
below into relatively smaller groups (compared to the 4.1.1. Duplicate Detection using tcm
entire relation) and compare pairs of tuples within each
group. Let Si be the join of Ri+1, …, Rm through key— We want to detect all pairs (v1, v2) of tuples where v1 is a
foreign key attribute pairs. We use the knowledge of duplicate, according to tcm, of v2; i.e., tcm(v1, v2) > tcm-
threshold. To reduce the number of pair wise tuple
duplicates in Si to group relation Ri such that we place
comparisons, we use a potential duplicate identification
tuples ri1 and ri2 which join with combinations si1 and si2
filter for efficiently isolating a subset G’ consisting of all
from Si in the same group if si1 and si2 are equal or potential duplicates. That is, a tuple in G-G’ is not a
duplicates (i.e., corresponding pairs of tuples in si1 and si2 duplicate of any tuple in G. Duplicate detection on G
either match exactly or are duplicates). We then process consists of: (i) identifying the set G’, and (ii) comparing
each group of Ri independently. Observe that we require Si each tuple in G’ with tuples in G it may be a duplicate of.
to be grouped into sets of duplicates. Due to efficiency
considerations, we further restrict that these sets be Since tcm compares token sets of tuples, we abuse the
disjoint. Otherwise, same sets of tuples in Ri may be notation and use tcm(v, S) to denote the comparison
processed in multiple groups causing repeated comparisons between the token set of a tuple v and the multi-set union
between the same pairs of Ri tuples. of token sets of all tuples in the set S. We use similar
notation for fkcm as well.
Considering the example in Figure 1, our top-down
traversal of the dimensional hierarchy is as follows. We Potential Duplicate Identification Filter
first detect duplicates in the Country relation, then process The intuition behind our filtering strategy to determine the
the State relation grouping it with the processed Country set G’ of all potentially duplicate tuples is that the tcm
relation, then process the City relation grouping it with the value between any two tuples v and v’ in G is less than that
processed [State, Country] combination, and then finally between v and G-{v}. Therefore, a tuple v for which
process the Organization relation grouping it with the tcm(v, G-{v}) is less than the specified threshold is not a
processed [City, State, Country] combination. duplicate of any other v’ in G. We only perform |G|
comparisons to identify G’, which potentially is much
The remainder of this section is organized as follows. In smaller than G. Therefore, comparing pairs involving
Section 4.1, we discuss the procedure for detecting tuples in the filtered set can be significantly more efficient
duplicates within a group of tuples from a relation in the than comparing all pairs of tuples in G.
hierarchy. In Section 4.2, we discuss the top-down
traversal of the hierarchy coordinating the invocation of the The intuition behind our filtering strategy is captured by
group wise duplicate detection procedure. We do not the following observation for tcm (and fkcm). The
explicitly discuss the special case of the lowest relation observation follows from the fact that the multi-set union
where we cannot use fkcm. The following discussion can of token sets of all tuples in G-{v} is a superset of token
easily be extended to this special case. set of any v’ in G-{v}.

4.1. GroupWise Duplicate Detection Observation 4.1: Let cm denote either tcm or fkcm metric,
We now describe a procedure to detect duplicates among a and v and v’ be two tuples in a set G of tuples. Then,
group G of tuples from a relation in the hierarchy. The cmG(v, v’) ≤ cmG(v, G-{v})
output of this procedure is a partition of G into sets such
that each set consists of variations of the same tuple. First, Computing tcm(v, G-{v}) using Token Tables
we determine pairs of duplicates and then partition G. We now describe a technique to efficiently compute tcm(v,
G-{v}) for any tuple v in G. The intuition is that tokens in
As discussed earlier, our duplicate detection function the intersection of the token set TS(v) of v and the multi-
requires the predictions from threshold-based decision set union of token sets of all tuples in G-{v} have a
functions using tcm and fkcm metrics. A straightforward frequency, in the bag of tokens Bt(G) of G, of at least 2.
procedure is to compare (using tcm and fkcm) all pairs of Any other token is unique and has a frequency 1.
tuples in a group G, and then to choose pairs whose
similarity is greater than the (tcm or fkcm) threshold. We We build a structure called the token table of G containing
reduce the number of pair wise comparisons between the following information: (i) the set of tokens whose
tuples by pruning out many tuples that do not have any frequency tf(t) w.r.t. Bt(G) is greater than one, (ii) the
duplicates (according to tcm or fkcm) in G. We describe frequencies of such tokens, and (iii) the list of (pointers to)
each step in detail below first assuming that the tcm and tuples in which such a token occurs. The difference
fkcm thresholds are known. In Section 4.3, we describe a between a token table and an inverted index over G is that
method to dynamically determine thresholds for each the token table only contains tokens whose frequency with
group. respect to G is greater than 1, and hence potentially
smaller if a large percentage of tokens in Bt(G) are unique.
We maintain lists of tuple identifiers only for tokens which relation. Suppose {United States, United States of
are not very frequent. The frequency at which we start America, USA} is a set of duplicates on the Country
ignoring a token—called the stop token frequency—is set relation. For the group of State tuples joining with USA
to be equal to 10% of the number of tuples in G. As and its duplicates, the children table contains one entry:
mentioned earlier, we enhance tcm by treating tokens {child=Joplin, frequency=3, tupleId-list=<S1,S3, S4>}.
which are very close to each other according to edit
distance (less than 0.15, in our implementation) to be Note: Recall that the frequency of a child tuple in Bc(G) is
synonyms. Due to space constraints, we skip the details of based only on its descriptive attribute value combinations
token table construction. and ignores the generated key attributes in Ri-1. In the
above example, the tuple Joplin has a frequency 3 because
Example 4.1.1: In Figure 1, suppose we are processing the we ignore the CityId attribute values.
State relation grouped with the Country relation, and that
we detected the set {United States, United States of Building the Children Table: The procedure is similar to
America, USA} to be duplicates on the Country relation. that of building the token table except for one difference:
For the group of State tuples joining with USA and its The multi-set union of all children sets Bc(G) can be large,
duplicates, the token table consists of one entry: e.g., all street addresses in the city [Illinois, Chicago], and
{[token=MO, frequency=3, tupleId-list=<S1, S2, S3>]}. hence may not fit in main memory. Therefore, we follow
the steps below. We refer to tuples in Bc(G) with frequency
The computation of tcm(v, G-{v}) requires frequencies greater than one as non-unique tuples.
with respect to Bt(G) of tokens in TS(v), which can be (i) We fetch all non-unique tuples in Bc(G) into a hash
obtained by looking up the token table. Tokens absent from table.
the token table have a frequency 1. Now, any tuple v such (ii) We fetch tuples in G and their children, one pair at a
that tcm(v, G-{v}) is greater than tcm-threshold is a time, and associate non-unique tuples in Bc(G) with the
potential duplicate tuple, and is added to G’. list of G tuples they join with.
Computing Pairs of Duplicates Combination
We compare each tuple v in G’ with a set Sv of tuples, After detecting duplicates according to tcm and fkcm, we
which is the union of all tuples sharing tokens with v. Sv combine (using the combination function of Section 3.3.3)
can be obtained from the token table. (For any tuple v’’ not predictions for each pair of tuples detected to be duplicates
in Sv, tcm(v, v’’) = 0.) For any tuple v’ in Sv such that using either tcm or fkcm or both.
tcm(v, v’) > tcm-threshold, we add the pair (v, v’) to the
pairs of duplicates from G. 4.1.3. Grouping Duplicate Pairs into Sets
Coordinating the top-down traversal of the hierarchy
4.1.2. Duplicate Detection using fkcm requires us to partition G into sets of duplicates, and to
determine a representative tuple—called the canonical
We predict that a tuple v is a duplicate, according to fkcm, tuple—for each set to be able to exploit database systems
of another tuple v’ in G if fkcm(v, v’) > fkcm-threshold. for processing. (This issue will be clearer in the next
Using Observation 4.1, we determine a set of potential section.) To partition G into sets of duplicates, we adapt a
duplicates by efficiently computing fkcm(v, G-{v}) using method from [HS95] to handle asymmetric similarity
children tables. The computation of the set G’ of potential functions. The essential idea is to divide G into connected
duplicates and then duplicates, according to fkcm, of tuples groups and choose a canonical tuple for each group.
in G’ is the same as for tcm. Hence, we only describe the
construction of the children table for a group G of tuples. Following the standard approach [HS95, ME96], we
elevate the relationship “is a duplicate of” between tuples
Children Tables to be a transitive relation. That is, if v1 is a duplicate of v2
The children table of G is a hash table containing a subset
and v2 that of v3, we consider v1 to be a duplicate of v3.
of the union of children sets of all tuples in G. It contains:
The intuition behind the partitioning method is to identify
(i) each child tuple c from Ri-1 joining with some tuple in
maximal connected sets of duplicates such that for any pair
G, and whose frequency cf(c) in Bc(G) is greater than one,
of tuples v and v’ in each set, we can either deduce using
(ii) the frequencies of such children tuples, and (iii) the list transitivity that v is a duplicate of v’ or vice versa. A
of (pointers to) tuples in G with which c joins. We connected set is maximal if we cannot add any more tuples
maintain lists of tuples only for children that have a to it without making it disconnected. For each connected
frequency less than the stop children frequency fixed at set, we choose the tuple with the highest IDF value (of
10% the number of tuples in G. token sets for R1 and of children sets for higher level
relations) as the canonical tuple. Because the relationship
Example 4.1.2: Consider the example in Figure 1. We “is a duplicate of” is asymmetric, a tuple may end up in
process the State relation grouped with the Country multiple connected sets. For such a tuple v, we place it in
View Definitions Qi = Qi’ =
Lm = Select * From Rm, …, R1 Select Li. m, …, Li. i+1, Li. i-1, count(*) Select Li. m, …, Li. i+1, Li. i, Li. i-1

Li = Select Li+1.Am, …, (Case When Ti+1.Ai+1 is Null From (Select distinct Li. m, …, Li. i-1) From (Select distinct Li. m, …, Li. i-1)

Then Li+1.Ai+1 Else Ti+1.Ai+1), Li+1.Ai, …, Li+1.A1 Group By Li. m, Li. i+1, Li. i-1 Order By Li. m, …, Li. i, Li. i-1
From Li+1 Left Outer Join Ti+1 Having count(*) > 1
Order By Li. m, …, Li. i+1, Li. i-1
On Li+1.Am = Ti+1.Am, …, Li+1.Ai+1 = Ti+1.Ai+1

Figure 2: View definitions and Queries


the set with the closest (computed using fkcm at higher
levels and tcm at the lowest level) canonical tuple. Let Canonical_Ri represent the relation Ri where each
duplicate tuple has been replaced with its canonical tuple.
4.2. Top-down Traversal The translation table Ti has the schema: [Ri, Ri AS
We now describe the top-down traversal of the hierarchy. Canonical_Ri, Canonical_Ri+1,...,Canonical_Rm]. Ti records
Starting from the topmost relation, we group each relation each duplicate tuple v and its canonical tuple v’ along with
and invoke the duplicate detection procedure on each the canonical tuple combination sv from the grouping
group. Therefore, the primary goal of the traversal is to combination [Canonical_Ri+1,...,Canonical_Rm] of relations
group each relation appropriately. While grouping a with which v and v’ join.
relation Ri by a combination Si (the join of Ri+1,…, Rm) of
processed relations, all Ri tuples which join with tuple Coordination
combinations (equivalently, sub-entities) in Si that are We form two SQL queries Qi and Qi’ whose results contain
either exactly equal or detected to be duplicates have to be the information required for processing any group in Ri.
placed in the same group. We scan portions of these query results, pause and process
a group of Ri tuples, and then continue the scans. First, we
A straightforward ordering by Si of the join of Ri and Si define the set of views used by these queries.
does not achieve the desired grouping because duplicate
tuple combinations in Si may not be adjacent to each other The sequence of views Lm, …, Li are defined in Figure 2.
in the sorted order. For example, duplicates UK and Great Informally, Li represents the current state of the
Britain on the Country relation are unlikely to be adjacent unnormalized relation R (the join of R1,…, Rm) after all
to each other in the sorted order. Therefore, we realize the duplicate tuples (in Ri+1,…,Rm) are collapsed with their
correct sorted order by considering a new relation Li, canonical tuples. Each Lj has the same schema as the
which is the join of R1,…,Rm but with the duplicate tuples unnormalized dimension relation R. Considering the
in processed relations (Ri+1,…,Rm) replaced by their translation table on the Country relation, an outer join
canonical tuples. We then group (the relevant projection between the original unnormalized relation R and the
of) Li by the canonical tuple combinations of Si. We avoid translation table on the country attribute results in a new
explicit materialization of the very large (as large as the unnormalized relation L with a canonical_Country
database) relations Li by only recording detected duplicates attribute. In L, United States and United States of America
in translation tables. Translation tables can be significantly are always replaced by USA, which is their canonical
smaller than the database if the number of duplicates is equivalent.
much less than the number of tuples in the database.
The queries Qi and Qi’ are defined in Figure 2 in which Ai
Translation Tables denotes the set of descriptive attributes (not including
Informally, the translation table Ti records the mapping generated keys) in Ri. For the sake of clarity, we omit the
between each duplicate tuple in Ri and its canonical tuple, key—foreign key join conditions in the where clause in
as well as the ancestral combination from the join of Ri+1, Figure 2. Both queries Qi and Qi’ order (a projection of) Li
…, Rm to which they both point to. While storing the on S=[Li.Am,…,Li.Ai+1]. Let s be a tuple combination in S,
ancestral combination, we assume that all duplicate tuples and let Gs be the group of tuples in Ri joining with s. We
in relations Ri+1, …, Rm have been replaced with their invoke the duplicate detection procedure discussed in
canonical tuples. For example, if USA is the canonical Section 4.1 for each group Gs as follows. We scan the
tuple of the set of duplicates {United States, United States result of Qi to fetch a group G1 of tuples joining with s,
of America}, and MO is that of the set {Missouri} of states scan the corresponding group G2 from the result of Qi’,
pointing to USA (or United States or United States of
process Gs using G1 and G2, and then move on to a
America) the translation table at Country relation level
subsequent group. The group G1 consists of the
maps both United States and United States of America to
information required for building the hash table of non-
USA. And, the translation table at the State level maps
[USA, Missouri] to [USA, MO]. unique children Bc(Gs), and G2 that for associating non-
unique children with parent tuples as well as for building be normalized into relations Rm, …, R1. We can adapt
the token table. Note that we do not maintain all of G2 in Delphi to work with an unnormalized relation R (the join
memory and only require a tuple at a time. of Rm,…, R1) as long as the sets of attributes which form
the hierarchy are known.
4.3. Dynamic Thresholding
In many cases, it is difficult for users to set tcm and fkcm FKCM Measurement
thresholds. Hence, we develop a technique to dynamically Recall that the fkcm metric only uses information from one
determine thresholds for each group. Moreover, treating level below. Such a strategy is very efficient and sufficient
each group independently allows us to set qualitatively for most but the following two exceptional cases. We now
better thresholds by adapting to the characteristics of that discuss these two cases.
group. For example, the numbers of tokens may vary
significantly across groups (names in Argentina may be Small children sets: When the children set of a tuple v1 is
longer than they are in USA). so small that even a single erroneous tuple in CS(v1) is a
significant fraction, we may incorrectly believe that v1 is
The intuition behind our threshold determination is that unique when in fact it is a duplicate of v2. If we want to
when the fraction of duplicates in a group is small (say, detect such errors, we modify the children table
around 10%), a duplicate tuple v is likely to have a higher construction and processing as follows. We first add all
value for containment metric (tcm or fkcm) between v and children tuples in Bc(G) (even those with frequency 1) to
G-{v} than a unique tuple. Therefore, we expect them to be the children table. We treat all pairs of duplicate (according
outliers in the distribution of tcm and fkcm. We use to tcm) tuples as synonyms when measuring the FK-
standard outlier detection methods based on Normality containment metrics between their parents. Since we have
assumptions to set thresholds. In Section 6, we demonstrate to temporarily maintain all children tuples—even those
experimentally that our threshold determination procedure with frequency 1—we require additional main memory.
is quite effective.
Correlated errors: Consider two sets of tuples in each
4.4. Resource Requirements relation where one uses abbreviations and the other uses
For processing each relation Ri in the hierarchy, we send expanded versions while reporting the country and state
two queries (Qi and Qi’) to the database system where each values. Then, a tuple (“United States”, “Washington”, **)
query computes the join of relations Rm, …, R1. Key— may be a duplicate of (“USA”, “WA”, **) where **
foreign key joins can be made very efficient if we create represents the same set of values in both tuples. We may
appropriate join indexes. We expect the number of not detect that “United States” is a duplicate of USA
duplicates and hence the translation tables to be small. through co-occurrence unless we look one level below the
Hence, outer joins with translation tables are efficient. States relation. It is possible to overcome this limitation by
measuring, with significant computational overhead, co-
Main Memory Requirements: The group level duplicate occurrence through lower level relations. However, the
elimination procedure ideally requires for each group G, number of combinations may sometimes be too high (e.g.,
the token table, the children table, and the tuples in G to be all organizations in USA) to even fit in main memory.
in main memory. If the frequency distribution of children
or tokens follows the Zipfian distribution, which is true for Definition of Duplicates
most real datasets [Zipf49], then less than half the tokens We now discuss a limitation of our definition of duplicates.
or children tuples have frequencies greater than 1, and are Consider the following pair of entities: [<Smith>,
maintained in memory. In rare cases where a group being <98052>, <WA>, <USA>] and [<Smith>, <98052>,
processed is very large, we may materialize the token and <Washington>, <Canada>]. If the tuples “Canada” and
children tables on disk and build appropriate indexes. “USA” are not (and rightly so) considered duplicates of
each other on the Country relation, then according to our
5. Discussion definition, the two entities are not duplicates. Observe that
the second tuple violates an implicit or explicit functional
We now discuss several interesting issues starting with a dependency or rule: “zipcode=98052 and state=WA
note that we do not require the dimensional information to country=USA.” If we correct the violation and detect that
MP-CM Windowing, no hierarchy, no co-occurrence, global thresholds, Cosine metric
MP-ED Windowing, no hierarchy, no co-occurrence, global thresholds, Edit distance
Delphi-Global Grouping, hierarchy, co-occurrence, global thresholds
Delphi Grouping, hierarchy, co-occurrence, dynamic thresholding
Delphi-Stripped Grouping, hierarchy, no co-occurrence, dynamic thresholding

Table 1: Algorithms
FP% of Cosine metric and Edit distance
Error Introduction
We introduce two types of errors common in data
500 CM (0.9) warehouses [For01]: equivalence errors, spelling &
CM (0.85) truncation errors. The generator has three parameters: The
400
CM (0.8) first percentage error parameter controls the error to be
Percentage

300
Edit (0.05) introduced in each relation. The second (equivalence
Edit (0.1) fraction) and the third (spelling fraction) parameters
200 Edit (0.15) control the fractions of equivalence errors, spelling and
H-CM (0.9) truncation errors, respectively. Suppose the percentage
100
H-CM (0.8) error is 10% and the equivalence fraction is 50% then we
0
will introduce 10% duplicate tuples into the input table out
FP% FN% of which 50% will be due to equivalence errors.

Figure 3: False Positive Explosion Equivalence Errors: Consider the tuple combination [<Key
WA and Washington are duplicates (using co-occurrence Associates>, <Joplin>, <MO>, <USA>] in the customer
information), then the two customer entities are duplicates. table. Suppose we want to create an equivalence error for
Thus, even though our definition of duplicates does not “MO” in the state relation. We first garble “MO” into, say,
directly allow such inconsistencies, we can correct them in “xMykOz” so that the new value is undetectable by
conjunction with other cleaning operations. standard textual similarity functions. Since equivalence
errors usually occur in multiple tuples, we choose around
Potential Duplicate Identification Filter 5% (5-x%, 5+x%) of all entities with R.country=“USA”
Imagine a set G of tuples where most of the tokens in and R.state=“MO” and modify the value of MO to
Bt(G) occur in at least two tuples in G. In such cases, the “xMykOz.” For 10% of these modified tuples, we also
filtering strategy is not very effective because we may introduce errors in the tuple from the child relation, when
mark many tuples as potential duplicates. Our experiments one exists. We insert these erroneous tuples into R. At the
on real data illustrate that such a case does not typically lowest level of the hierarchy, we garble a randomly picked
occur in practice. However, developing appropriate filters token from the token set and insert the modified tuple in R.
for such rare cases is still an open issue.
Spelling and Truncation Errors: We modify a token in a
We note that it is possible to consider similarity and tuple by changing, deleting, adding characters or truncating
combination functions other than the ones we used. the token. 50% of the time, we modify characters, and the
However, Observation 4.1, which summarizes our filtering remaining 50% we just truncate the token. The number of
strategy, may not be valid for all similarity functions, and characters modified or truncated is a linearly decreasing
one may have to design suitable filters where possible. function with a maximum of half the token length.

Token Permutation: Consider the example where a user


6. Experimental Evaluation
enters first name followed by the last name instead of the
Using real datasets, we now evaluate the quality and stipulated last name followed by the first name. To reflect
efficiency of Delphi and compare with earlier work. such types of errors, we randomly permute tokens in about
10% of the erroneous tuples being added to R.
6.1. Datasets and Setup
We consider clean Customer information from an internal Algorithms
operational data warehouse and introduce errors.5 The Table 1 summarizes the algorithms we evaluate in this
Customer dimensional hierarchy has four relations: Name study. MP-CM and MP-Edit are derivatives of the
(level 1), City (level 2), State (level 3), Country (level 4) windowing-based MergePurge (MP) algorithm using
with 269678, 21856, 1250, and 115 tuples, respectively. cosine metric and edit distance, respectively [HS95, ME97,
Because we start from real data all characteristics of real Coh98]. Delphi-global is a variant of Delphi that uses
data—variations in the lengths of strings, numbers of global thresholds for both tcm and fkcm. Delphi-Stripped
tokens in and frequencies of attribute values, co-occurrence is a variant of Delphi which only uses tcm and completely
patterns, etc.—are preserved. Since we know the duplicate ignores co-occurrence information.
tuples and their correct counterparts in the erroneous
dataset, we can evaluate duplicate elimination algorithms. We run variants of MP on the unnormalized relation of
Name, City, State, and Country relations, and sort on the
key (name, city, state, country). In both MP-CM and MP-
Edit, we fix the window size at 20, and vary the thresholds.
5
We observed similar results on the publication information of a We use MP-CM(x) (MP-Edit) to denote that the threshold
bibliography database. We omit results due to space constraints. for the cosine metric (edit distance) is set to x. For Delphi-
Delphi Delphi-Stripped Delphi Delphi-Stripped
Delphi-Global MP-CM (0.8) Delphi-Global MP-CM (0.8)
MP-Edit (0.1) MP-Edit (0.1)
120
250
100
200
80
150

FN %
FP %

60
100
40

50 20

0 0
4 8 11 4 8 11

Overall Error Percentages Overall Error Percentage

Figure 4: False Positive Percentages Figure 5: False Negative Percentages


Global, we arrived at the global tcm-threshold and the
fkcm-threshold of 0.80 and 0.85, respectively, after several 6.2.2. Quality
trials. To compare the quality of algorithms, we do not In the following two experiments, we generated erroneous
group duplicate tuples for the lowest Name relation and datasets from the input dataset by introducing 4%, 8%, and
output all pairs of duplicates detected by Delphi. 11% errors with relative fractions of equivalence error and
spelling & truncation errors fixed at 0.5.
Quality Metrics
We now describe the quality metrics for evaluating Reduction in False Positive Percentages
algorithms. Figure 4 shows the false positive percentages of each
algorithm. Because Delphi and Delphi-global have
False positives: The percentage of incorrect pairs of tuples significantly lower false positive percentages, we conclude
which an algorithm detects as duplicates relative to the that hierarchies and co-occurrence information together
actual number of duplicates is called the false positive (FP) significantly reduce false positive percentages.
percentage. The false positive percentage can be greater
than 100 if the algorithm produces many incorrect pairs. Reduction in False Negative Percentages
Lower false positive percentage indicates higher From Figure 5, which plots false negative percentages, we
confidence in the algorithm’s results. see that Delphi has the lowest false negative percentages.
Therefore, co-occurrence information is useful in reducing
False negatives: The percentage of undetected duplicates false negatives as well. And, Delphi-Stripped is better than
in the input dataset relative to the number of duplicates is Delphi-Global. Hence, dynamic thresholding helps reduce
called the false negative percentage. Lower false negative false negative percentages. However, its impact on false
percentages indicate good duplicate detection. positive reduction seems unpredictable.
6.2. Analysis of Results 6.2.3. Speed and Scalability
We ran Delphi, Delphi-Stripped, and MP-CM on datasets
6.2.1. False Positive Explosion of size 3000, 30000, 300000, and 3000000.6 Table 2 shows
We now demonstrate that the use of cosine metric or edit that Delphi and MergePurge are both scalable over a wide
distance can result in large false positive percentages. We range of dataset sizes. Running times are normalized with
consider a dataset with 8% overall error where the respect to that of Delphi on a 3000 tuple dataset. We also
equivalence and spelling & truncation fractions at 0.5 each. note that maximum amount of main memory required by
Figure 3 shows the results of applying the windowing Delphi on any of the datasets we considered here is less
strategy on four different sort orders: [Name, City, State, than 25 MB, thus supporting our argument that token and
Country], [City, State, Country, Name], [State, Country, children tables fit in memory.
Name, City], and [Country, Name, City, State]. CM(x) #Tuples Delphi Delphi-Stripped MP-CM #(TCM; FKCM)
(Edit(x)) denotes the results from using cosine metric (edit 3000 1 0.8 0.7 Name 51582; 0
distance) with a threshold x, and H-CM from using cosine 30000 5.512 4.2 3.55 City 9997; 1093
metric with the restricted definition of duplicates in the 300000 52.5 43.7 151.5 State 434; 441
presence of dimensional hierarchies. From Figure 3, we 3000000 510.4 230.6 1500 Country 30; 8
observe that lowering thresholds drastically increases false Table 2: Scalability Table 3: Filtering
positive percentages for cosine metric and edit distance. 6
Since the scalability characteristics of MP-Edit are similar to
that of MP-CM, we do not consider it here.
6.2.4. Potential Duplicate Filter International Conference on Very Large Databases, pages 371--
380, Roma, Italy, September 11-14 2001.
We now evaluate our potential duplicate filtering [GFSS99] Helena Galhardas, Daniela Florescu, Dennis Shasha,
technique. The dataset has 8% duplicate tuples. Table 3 and Eric Simon. An extensible framework for data cleaning. In
shows the total number of potential duplicates over all ACM Sigmod, May 1999.
groups in each relation of the hierarchy. The entry (x; y) [GIJ+01] L Gravano, P Ipeirotis, H V Jagadish, N Koudas, S
denotes that tcm and fkcm returned x and y potential Muthukrishnan and D Srivastava. Approximate String Joins in a
duplicates, respectively. We observe that only 20% (as Database (Almost) for Free. In Proceedings of the VLDB 2001.
[GGR99] Venkatesh Ganti, Johannes Gehrke, and Raghu
compared to the minimum 16%=8% duplicates + 8%
Ramakrishnan. Cactus--clustering categorical data using
targets) of the overall set of tuples was even considered to summaries. In Proceedings of the ACM SIGKDD fifth
be potential duplicates. Hence, potential duplicate filtering international conference on knowledge discovery in databases,
enhances efficiency. Also observe that fkcm returns fewer pages 73--83, August 15-18 1999.
potential duplicates. Hence, we conclude that co- [GKR98] David Gibson, Jon Kleinberg, and Prabhakar Raghavan.
occurrence information is very effective at reducing false Clustering categorical data: An approach based on dynamical
positives. systems. VLDB 1998, New York City, New York, August 24-27.
[GRS99] Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim.
7. Conclusions Rock: A robust clustering algorithm for categorical attributes. In
In this paper, we exploited dimensional hierarchies in data Proceedings of the IEEE International Conference on Data
warehouses to develop a high quality, scalable, and Engineering, Sydney, March 1999.
efficient algorithm for detecting fuzzy duplicates in [HKPT98] Yka Huhtala, Juha Karkkainen, Pasi Porkka, and
Hannu Toivonen. Efficient discovery of functional and
dimensional tables. In future, we intend to consider
approximate dependencies using partitions. In proceedings of the
multiple hierarchies for detecting fuzzy duplicates. 14th international conference on data engineering (ICDE), pages
392--401, Orlando, Florida, February 1998.
Acknowledgements [HS95] M. Hernandez and S. Stolfo. The merge/purge problem
We thank several members of the DMX group at Microsoft for large databases. In Proceedings of the ACM SIGMOD, pages
Research for their thoughtful comments. 127--138, San Jose, CA, May 1995.
[KA85] B. Kilss and W. Alvey. Record linkage techniques--1985.
References Statistics of income division. Internal revenue service publication,
1985. Available from http://www.bts.gov/fcsm/methodology/.
[AEP01] A.N. Arslan, O. Egecioglu, and P.A. Pevzner. A new [KM95] J.Kivinen and H. Mannila. Approximate dependency
approach to sequence comparison: Normalized local alignment. inference from relations. Theoretical Computer Science,
Bioinformatics, 17(4):327--337, 2001. 149(1):129--149, September 1995.
[BDS01] Vinayak Borkar, Kaustubh Deshmukh, and Sunita [MBR01] J Madhavan, P Bernstein, E Rahm. Generic Schema
Sarawagi. Automatic segmentation of text into structured records. Matching with Cupid. VLDB 2001, pages 49-58, Roma, Italy.
In Proceedings of ACM Sigmod Conference, Santa Barbara, CA, [ME96] Alvaro Monge and Charles Elkan. The field matching
May 2001. problem: Algorithms and applications. In Proceedings of the
[BGM+97] A. Broder, S. Glassman, M. Manasse, and G. Zweig. second international conference on knowledge discovery and
Syntactic Clustering of the Web. In Proc. Sixth Int' l. World Wide databases (KDD), 1996.
Web Conference, World Wide Web Consortium, Cambridge, [ME97] A. Monge and C. Elkan. An efficient domain
pages 391--404, 1997. independent algorithm for detecting approximately duplicate
[BGRS99] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. database records. In Proceedings of the SIGMOD Workshop on
Shaft. When is ' '
nearest neighbor' ' meaningful? International Data Mining and Knowledge Discovery, Tucson, Arizona, May
Conference on Database Theory, pages 217--235. January 1999. 1997.
[BL94] V. Barnett and R. Lewis. Outliers in statistical data. John [MR94] H. Mannila and K.-J. Raiha. Algorithms for inferring
Wiley and Sons, 1994. functional dependencies. Data and Knowledge Engineering,
[BYRN99] Ricardo Baeza-Yates and Berthier Ribeiro-Neto. 12(1):83--99, February 1994.
Modern Information Retrieval. Addison Wesley Longman, 1999. [NR99] Felix Naumann and Claudia Rolker. Do metadata models
[Coh98] W. Cohen. Integration of heterogeneous databases meet iq requirements? In Proceedings of the international
without common domains using queries based in textual conference on data quality (IQ), MIT, Cambridge, 1999.
similarity. In Proceedings of ACM SIGMOD, pages 201--212, [Pro] MIT Total Data Quality Management Program. Information
Seattle, WA, June 1998. quality. http://web.mit.edu/tdqm/www/iqc.
[For01] Ronald Forino. Data e.quality: A behind the scenes [RD00] Erhard Rahm and H. Hai Do. Data cleaning: Problems
perspective on data cleansing. http://www.dmreview.com/, March and current approaches. IEEE Data Engineering Bulletin, 23(4):3-
2001. -13, December 2000.
[FS69] I. P. Felligi and A. B. Sunter. A theory for record linkage. [RH01] Vijayshankar Raman and Joe Hellerstein. Potter' s wheel:
Journal of the American Statistical Society, 64:1183--1210, 1969. An interactive data cleaning system. VLDB 2001, pages 381--
[Gal] Helena Galhardas. Data cleaning commercial tools. 390, Roma, Italy.
http://caravel.inria.fr/~galharda/cleaning.html. [Zipf49] G.K. Zipf. Human behaviour and the principle of least
[GFS+01] Helena Galhardas, Daniela Florescu, Dennis Shasha, effort. Addison-Wesley, Reading, MA, 1949.
Eric Simon, and Cristian Saita. Declarative data cleaning:
Language, model, and algorithms. In Proceedings of the 27th

You might also like