Eliminating Fuzzy Duplicates in Data Warehouses
Eliminating Fuzzy Duplicates in Data Warehouses
Organization (at Level 1) City (at Level 2) State (at Level 3) Country (at Level 4)
Figure 1: An Example Customer Database
tuple pairs with values “USSR” and “United States” in the problems can occur even with other textual similarity
country attribute are also likely to be declared duplicates if functions like the cosine metric. Using our notion of co-
we were to detect “US” and “United States” as duplicates occurrence through the State relation, we observe that the
using textual similarity. sets—called children sets of USA and UK—of states {MO,
Missouri} and {Aberdeen, Aberdeen shire} joining with
In this paper, we exploit dimensional hierarchies typically USA and UK, respectively, are disjoint. Hence, we
associated with dimensional tables in data warehouses to conclude that USA and UK are unlikely to be duplicates.
develop an efficient, scalable, duplicate elimination
algorithm called Delphi,2 which significantly reduces the For reasons of efficiency and scalability, we want to avoid
number of false positives without missing out on detecting comparing all pairs of tuples in each relation of the
duplicates. We rely on hierarchies to detect an important hierarchy. Previous approaches have considered the
class of equivalence errors in each relation, and to windowing strategy, which sorts a relation on a key and
significantly reduce the number of false positives. compares all records within a sliding window on the sorted
order [HS95]. However, observe that equivalence errors
For example, Figure 1 describes the schema maintaining (e.g., UK and Great Britain) may not be adjacent to each
the Customer information in a typical company selling other in standard sort orders, e.g., the lexicographical
products or services. The dimensional hierarchy here order. We exploit the dimensional hierarchy and propose a
consists of four relations—Organization, City, State, and grouping strategy, which only compares tuples within
Country relations—connected by key—foreign key small groups of each relation. For instance, we only
relationships (also called referential links). We say that the compare two State tuples if they join with the same country
Organization and the Country relations are the bottom and tuple or Country tuples that are duplicates of each other.
the top relations, respectively. Consider the tuples USA Since such groups are often much smaller than the entire
and United States in the Country relation in Figure 1. The relation, the grouping strategy allows us to compare pairs
state attribute value “MO” appears in tuples in the State of tuples in each group, and yet be very efficient.
relation joining with countries USA and United States,
whereas most state values occur with only one Country The outline of the paper is as follows. In Section 2, we
tuple. That is, USA and United States co-occur through the discuss related work. In Section 3, we discuss key concepts
state MO. In general, country tuples are associated with and definitions. In Section 4, we describe Delphi. In
sets of state values. The degree of overlap between sets Section 5, we discuss a few important issues. In Section 6,
associated with two countries is a measure of co- we discuss results from a thorough experimental evaluation
occurrence between them, and can be used to detect on real datasets.
duplicates (e.g., USA and United States).
2. Related Work
The above notion of co-occurrence can also be used for
reducing the number of false positives. Consider the two Several earlier proposals exist for the problem of duplicate
countries “USA” and “UK” in Figure 1. Because they are elimination (e.g., [FS69, KA85, HS95, ME96, ME97,
sufficiently closer according to the edit distance function, a Coh98]). As mentioned earlier, all these methods rely on
commonly used textual similarity function, we might threshold-based textual similarity functions to detect
(incorrectly) deduce that they are duplicates. Such duplicates, and hence do not detect equivalence errors
unless we lower thresholds sufficiently; lower thresholds
result in an explosion of the number of false positives. The
2
DELPHI: Duplicate ELimination in the Presence of HIerarchies
record linkage literature also focuses on automatically in the proceedings of a conference as well as in a journal;
determining appropriate thresholds [FS69, KA85], but still and, they are two distinct entities in the publications
suffers from the false positive explosion while detecting database. Motivated by these typical scenarios, we
equivalence errors. Gravano et al. proposed an algorithm consider two entities in a dimensional hierarchy to be
for approximate string joins, which in principle can be duplicates if corresponding pairs of tuples in each relation
adapted to detect duplicate records [GIJ+01]. Since they of the hierarchy either match exactly or are duplicates
use the edit distance function to measure closeness (according to duplicate detection functions at each level).
between tuples, their technique suffers from the drawbacks For example, two entities in Figure 1 are duplicates if the
of strategies relying only on textual similarity functions. In respective pairs of Country, State, City, and Organization
this paper, we exploit hierarchies on dimensional tables to tuples of the two entities are duplicates. Below, we
detect an important class of equivalence errors (which formally introduce dimensional hierarchies, definition of
exhibit significant co-occurrence through other relations) duplicate entities, and our duplicate detection functions.
without increasing the number of false positives.
3.1. Dimensional Hierarchies
Significant amount of work exists in other related aspects
Relations R1,…, Rm with keys K1, …, Km constitute a
of data cleaning: the development of transformational
dimensional hierarchy if and only if there is a key—foreign
cleaning operations [RH01, GFS+01], the detection and
the correction of formatting errors in address data key relationship between Ri-1 and Ri, (2 ≤ i ≤ m). Ri is the
[BDS01], and the design of “good” business practices and ith level relation in the hierarchy. R1 and Rm are the bottom
process flows to prevent problems of deteriorating data and the top relations, respectively, and Ri the child of Ri+1.
quality [Pro, NR99]. Automatic detection of integrity
constraints (functional dependencies and key—foreign key Let the unnormalized dimension table R be the join of
relationships) [MR94, KM95, HKPT98] so that they can be R1,…, Rm through the chain of key—foreign key
enforced in future to improve data quality are relationships. We say that a tuple vi in Ri joins with a tuple
complementary to techniques for cleaning existing data. vj in Rj if there exists a tuple v in R such that the
Because of the commercial importance of the data cleaning projections of v on Ri and Rj equal vi and vj, respectively.
problem, several domain-specific industrial tools exist. Specifically, we say that vi in Ri is a child of vi+1 in Ri+1 if
Galhardas provides a nice survey of many commercial vi joins with vi+1. For example, in Figure 1, [S3, MO, 3] in
tools [Gal]. the State relation is a child of [3, USA] in the Country
relation. We say that a tuple combination (or a row in R)
Our notion of co-occurrence between tuples is similar to [r1,…,rm] is an entity if each ri joins with ri+1.
that used for clustering categorical data [e.g., GKR98,
GRS99, GGR99] and that for matching schema [MBR01]. In typical dimensional tables of data warehouses, the
values of key attributes K1,…, Km are artificially generated
3. Concepts and Definitions by the loading process before a tuple vi is inserted into Ri.
A dimensional hierarchy consists of a chain of relations Such generated keys are not useful for fuzzily matching
linked by key—foreign key dependencies. Figure 1 two tuples, and can only be used for joining tuples across
illustrates an example. An entity described by the hierarchy relations in the hierarchy. From now on, we overload the
also consists of a chain of tuples (one from each relation) term “tuple” to also mean only the descriptive attribute
each of which joins with the tuple from its parent relation. values—the set of attribute values not including the
For example, [<o1, Walmart, c1>, <c1, Redmond, s1>, generated key attributes. We clarify when it is not clear
<s1, WA, t1>, <t1, USA>] describes an organization entity from the context.
where o1, c1, etc. are identifiers typically generated for
maintaining referential links. For clarity in notation, we do 3.2. Definition of Duplicates
not explicitly list identifiers in tuples unless required. We now formally define our notion of duplicate entities
assuming duplicate detection functions at each level. Let
Consider two organization entities: [<Walmart>, f1,…,fm be binary functions called duplicate detection
<Redmond>, <WA>, <USA>] and [<Walmart>, <Seattle>,
functions where each fi takes a pair of tuples in Ri, and
<WA>, <USA>] in the Customer information with a
returns 1 if they are duplicates, and -1 otherwise. Let r=[r1,
dimensional hierarchy shown in Figure 1. The
…, rm] and s=[s1,…,sm] be two entities. We say that r is a
corresponding pairs of tuples in the Name, State, or
Country relations individually are identical. However, they duplicate of s if and only if fi(ri, si)=1 for all i in {1, …,
are not duplicates on the City relation, and in fact this m}. For instance, we consider the two entities
difference makes the two entities distinct. This [<Compuware, #20 Main Street>, <Jopin>, <MO>,
phenomenon is characteristic of dimensional hierarchies. <United States>] and [<Compuwar, #20 Main Street>,
For example, publications with the same title may appear <Joplin>, <Missouri>, <USA>] in Figure 1 to be
duplicates only if the following pairs are duplicates: we define the IDF value IDFG(S) of a set S (subset of ) to
“United States” and “USA” on the Country relation, “MO” be IDFG (s) .
and “Missouri” in the State relation, “Jopin” and “Joplin” s∈S
in the City relation, and “Compuware, #20 Main Street” Containment Metric: We define the containment metric
and “Compuwar, #20 Main Street” in the Organization
cmG(S1, S2) with respect to G between two sets S1 and S2 to
relation. Observe that we can easily extend this definition
be the ratio of the IDF value IDFG( S1 ∩ S 2 )of their
to sub-entities [ri,…,rm] and [si,…,sm].
intersection with the IDF value IDFG(S1) of the first set S1.
3.3. Duplicate Detection Functions
For clarity in presentation, we drop the subscript G from
We exploit dimensional hierarchies to measure co- the above notation when extending them to define textual
occurrence among tuples for detecting equivalence errors and co-occurrence similarity metrics.
and for reducing false positives. This is in conjunction with
the textual similarity functions (like cosine metric and edit 3.3.1. Textual Similarity Function (tcm)
distance), which have traditionally been employed for
detecting duplicates. Our final duplicate detection function We assume that each tuple v can be split into a set of
is a weighted voting of the predictions from using co- tokens using a tokenization function (say, based on white
occurrence and textual similarity functions. Intuitively, the spaces). Treating each tuple as a set of tokens, the token
weight of a prediction is indicative of the importance of the containment metric between v and v’ is the IDF-weighted
information used to arrive at the prediction. fraction of v tokens that v’ contains.
We adopt the standard thresholded similarity function Let G={v1, …, vn} be a set of tuples from Ri. Let TS(v)
approach to define duplicate detection functions [HS95]. denote the set of tokens in a tuple v. Let Bt(G) be the bag
That is, if the textual (or co-occurrence) similarity between (multi-set) of all tokens that occur in any tuple in G. Let
two tuples is greater than a threshold, then the two tuples tf(t) denote the frequency of a token t in Bt(G). The token
are predicted to be duplicates according to textual (or co- containment metric tcm(v, v’) with respect to G between
occurrence) similarity. In this section, we assume that tuples v and v’ in G is given by the containment metric
thresholds are known. In Section 4.3, we relax this cm(TS(v), TS(v’)) with respect to G between their token
assumption and describe automatic threshold sets. For example, if all tokens have equal IDF values then
determination. First, we introduce the notion of set tcm([“MO”, “United States”], [“MO”, “United States of
containment, which we use to define similarity functions. America”]) is 1.0; And, tcm([“MO”, “United States of
We only consider textual attributes for comparing tuples, America”], [“MO”, “United States”]) is 0.6.
and assume default conversions from other types to text,
e.g., integer zipcodes are converted to varchar. Observe that when two tokens differ slightly due to a
typographical error, token containment metric still treats
Given a collection of sets each defined over some domain them as two distinct tokens. To address this shortcoming,
of objects, an intuitive notion of how similar a set S is to a we treat two very similar tokens—with edit distance3 less
set S’ is the fraction of S objects contained in S’. This than a small value (say, 0.15)—in Bt(G) to be synonyms.
notion of containment similarity has been effectively used
to measure document similarity [BGM+97]. We extend 3.3.2. Co-occurrence Similarity Function (fkcm)
this notion to take into account the importance of objects in In a dimensional hierarchy, a tuple in the parent relation Ri
distinguishing sets. For example, the set {Microsoft, joins with a set, which we call its children set, of tuples in
incorporated} is more similar to {Microsoft, inc} than it is the child relation. We measure the co-occurrence between
to {Boeing, incorporated} because the token Microsoft is two distinct tuples by the amount of overlap between
more distinguishing than the token incorporated. The IDF children sets of the two tuples. An unusually significant co-
(inverse document frequency) value of an object has been occurrence (more than the average overlap between pairs
successfully used in the information retrieval literature to of tuples in Ri or above a certain threshold) is a cause for
quantify the notion of importance [BYRN99]. We now suspecting that one is a duplicate of the other. For example,
formalize this intuition. in Figure 1, duplicate states MO and Missouri co-occur
with the city “Joplin” whereas other distinct states do not
Let be a set of objects. Let G be a collection of sets of co-occur with any common cities. Informally, our co-
objects from . Let B(G) be the bag of all objects occurrence measure—called the foreign key containment
contained by any set in G. The frequency fG(o) of an object
o with respect to G is the frequency of o in B(G). The IDF 3
The edit distance between tokens t1 and t2 is the minimum
value IDFG(o) with respect to G of o is log( | G | ) . Also, number of edit operations (delete, insert, transpose, and replace)
f G (o ) required to change t1 to t2; we normalize this value with the sum
of their lengths [AEP01].
metric (fkcm)—between two tuples is the containment distinguishing tokens or children tuples have higher IDF
metric between the children sets of the first and the second values.
tuples.
For a tuple v in Ri (i > 1) let wt=IDF(TS(v)), and
If i > 1, we say that two tuples v1 and v2 in Ri co-occur wc=IDF(CS(v)). Let tcm_threshold and fkcm_threshold be
through a tuple v in Ri-1 if they both join with v. In general, the textual and co-occurrence similarity thresholds,
two distinct tuples v1 and v2 in Ri join with two sets S1 and respectively. Let pos: R {1,-1} be a function defined as
S2 (usually with little overlap) of tuples in Ri-1. We call S1 follows: pos(x) = 1, if x > 0, and -1, otherwise. Our
the children set CS(v1) of v1, and S2 the children set CS(v2) weighted voting combination function is:
of v2. Let G={v1, …, vn} be a set of tuples from Ri. Let pos(wt*pos(tcm(v,v’)-tcm_threshold)+wc*pos(fkcm(v,v’)-
Bc(G) be the bag (multi-set) of all children tuples in Ri-1 fkcm_threshold)).
with any tuple in G as parent. The child frequency cf(c) of
a child tuple c with respect to G is the number of times c Essentially, the combination function returns the
occurs in Bc(G). The FK-containment metric fkcm(v, v’) prediction, 1 (duplicate) or -1 (not a duplicate), of the
with respect to G between v and v’ in G is the containment similarity function that has a higher weight. Suppose that
metric cm(CS(v), CS(v’)) with respect to Bc(G) between in Figure 1, “UK” is considered a duplicate of “USA”
the children sets CS(v) and CS(v’). For example, the FK- according to a textual similarity function. Because they do
containment metric between values “Missouri” (whose not co-occur with any state tuple, fkcm contradicts this
State.Id is S4) and “MO” (whose State.Id is S3) in the prediction. Since the children set of UK has a higher IDF
State relation of Figure 1 is 1.0 because their children sets value than its token set, UK is not a duplicate of USA.
are identical ({Joplin}).
4. Delphi
Note that while measuring co-occurrence between two
tuples in Ri, we only use Ri-1 and disregard information We now describe Delphi. Recall that we consider two
from relations further below for two reasons. First, the entities to be duplicates if the respective pairs of tuples in
restriction improves efficiency because the number of each relation of the hierarchy are duplicates. That is, two
distinct combinations joining with a tuple in Ri increases as entities in the customer information of Figure 1 are
duplicates only if the Organization tuples, City tuples,
we go further down the hierarchy. For example, the
State tuples, and Country tuples are all duplicates of each
number of state tuples pointing to “United States” in the
other. Therefore, a straightforward duplicate detection
Country relation is less than the number of [city, state]
tuple pairs that point to it. Therefore, the restriction enables algorithm would be to independently determine sets of
efficient computation of our co-occurrence measure duplicate tuples at each level of the hierarchy and then
determine duplicate entities over the entire hierarchy. For
between tuples. Second, the co-occurrence information
the example in Figure 1, we can process each of the
between tuples in Ri provided by relations Rj (j < i-1) is
Organization, City, State, and Country relations
usually already available from Ri-1. Tuples in Rj (j < i-1) independently to determine duplicate pairs of tuples in
which join with the same tuple in Ri are also likely to join these relations. We may then identify pairs of duplicate
with the same tuples in Ri-1 if the children sets of distinct entities if their corresponding tuples at each level in the
tuples are very different from each other. We discuss two hierarchy (Organization, City, State, and Country) are
exceptional cases in Section 5. either equal or duplicates.
3.3.3. Combination Function We can be more efficient by exploiting the knowledge
We use thresholded similarity metrics for detecting from already processed relations. Suppose we know that
duplicates. That is, when the similarity cm(v, v’) between v only “United States of America” and “United States” are
and v’ is greater than a threshold, then the duplicate duplicates of “USA” and the rest are all unique tuples in
detection function using cm predicts that v is a duplicate of the Country relation. While processing the State relation,
v’. We now discuss the combination of predictions we do not compare the tuple “BC” with “Missouri”
obtained from both functions. We adopt a weighted voting because the former joins with Canada and the latter with
of the predictions where the weight of a prediction is (duplicates of) USA. Observe that this usage requires us to
proportional to the “importance of the information” used to process a parent relation in the hierarchy before processing
arrive at the prediction.4 As discussed earlier, IDF values its child. As we move down the hierarchy, the reduction in
of the token and children sets capture the concept of the number of comparisons is significant. For instance, the
amount of information because sets containing more Organization relation may have millions of tuples whereas
the number in Seattle, WA, USA may be a few thousands.
4
For the lowest relation R1 in the hierarchy, we return the We adopt a top-down traversal of the hierarchy. After we
prediction of tcm. process the topmost relation, we group the child relation
below into relatively smaller groups (compared to the 4.1.1. Duplicate Detection using tcm
entire relation) and compare pairs of tuples within each
group. Let Si be the join of Ri+1, …, Rm through key— We want to detect all pairs (v1, v2) of tuples where v1 is a
foreign key attribute pairs. We use the knowledge of duplicate, according to tcm, of v2; i.e., tcm(v1, v2) > tcm-
threshold. To reduce the number of pair wise tuple
duplicates in Si to group relation Ri such that we place
comparisons, we use a potential duplicate identification
tuples ri1 and ri2 which join with combinations si1 and si2
filter for efficiently isolating a subset G’ consisting of all
from Si in the same group if si1 and si2 are equal or potential duplicates. That is, a tuple in G-G’ is not a
duplicates (i.e., corresponding pairs of tuples in si1 and si2 duplicate of any tuple in G. Duplicate detection on G
either match exactly or are duplicates). We then process consists of: (i) identifying the set G’, and (ii) comparing
each group of Ri independently. Observe that we require Si each tuple in G’ with tuples in G it may be a duplicate of.
to be grouped into sets of duplicates. Due to efficiency
considerations, we further restrict that these sets be Since tcm compares token sets of tuples, we abuse the
disjoint. Otherwise, same sets of tuples in Ri may be notation and use tcm(v, S) to denote the comparison
processed in multiple groups causing repeated comparisons between the token set of a tuple v and the multi-set union
between the same pairs of Ri tuples. of token sets of all tuples in the set S. We use similar
notation for fkcm as well.
Considering the example in Figure 1, our top-down
traversal of the dimensional hierarchy is as follows. We Potential Duplicate Identification Filter
first detect duplicates in the Country relation, then process The intuition behind our filtering strategy to determine the
the State relation grouping it with the processed Country set G’ of all potentially duplicate tuples is that the tcm
relation, then process the City relation grouping it with the value between any two tuples v and v’ in G is less than that
processed [State, Country] combination, and then finally between v and G-{v}. Therefore, a tuple v for which
process the Organization relation grouping it with the tcm(v, G-{v}) is less than the specified threshold is not a
processed [City, State, Country] combination. duplicate of any other v’ in G. We only perform |G|
comparisons to identify G’, which potentially is much
The remainder of this section is organized as follows. In smaller than G. Therefore, comparing pairs involving
Section 4.1, we discuss the procedure for detecting tuples in the filtered set can be significantly more efficient
duplicates within a group of tuples from a relation in the than comparing all pairs of tuples in G.
hierarchy. In Section 4.2, we discuss the top-down
traversal of the hierarchy coordinating the invocation of the The intuition behind our filtering strategy is captured by
group wise duplicate detection procedure. We do not the following observation for tcm (and fkcm). The
explicitly discuss the special case of the lowest relation observation follows from the fact that the multi-set union
where we cannot use fkcm. The following discussion can of token sets of all tuples in G-{v} is a superset of token
easily be extended to this special case. set of any v’ in G-{v}.
4.1. GroupWise Duplicate Detection Observation 4.1: Let cm denote either tcm or fkcm metric,
We now describe a procedure to detect duplicates among a and v and v’ be two tuples in a set G of tuples. Then,
group G of tuples from a relation in the hierarchy. The cmG(v, v’) ≤ cmG(v, G-{v})
output of this procedure is a partition of G into sets such
that each set consists of variations of the same tuple. First, Computing tcm(v, G-{v}) using Token Tables
we determine pairs of duplicates and then partition G. We now describe a technique to efficiently compute tcm(v,
G-{v}) for any tuple v in G. The intuition is that tokens in
As discussed earlier, our duplicate detection function the intersection of the token set TS(v) of v and the multi-
requires the predictions from threshold-based decision set union of token sets of all tuples in G-{v} have a
functions using tcm and fkcm metrics. A straightforward frequency, in the bag of tokens Bt(G) of G, of at least 2.
procedure is to compare (using tcm and fkcm) all pairs of Any other token is unique and has a frequency 1.
tuples in a group G, and then to choose pairs whose
similarity is greater than the (tcm or fkcm) threshold. We We build a structure called the token table of G containing
reduce the number of pair wise comparisons between the following information: (i) the set of tokens whose
tuples by pruning out many tuples that do not have any frequency tf(t) w.r.t. Bt(G) is greater than one, (ii) the
duplicates (according to tcm or fkcm) in G. We describe frequencies of such tokens, and (iii) the list of (pointers to)
each step in detail below first assuming that the tcm and tuples in which such a token occurs. The difference
fkcm thresholds are known. In Section 4.3, we describe a between a token table and an inverted index over G is that
method to dynamically determine thresholds for each the token table only contains tokens whose frequency with
group. respect to G is greater than 1, and hence potentially
smaller if a large percentage of tokens in Bt(G) are unique.
We maintain lists of tuple identifiers only for tokens which relation. Suppose {United States, United States of
are not very frequent. The frequency at which we start America, USA} is a set of duplicates on the Country
ignoring a token—called the stop token frequency—is set relation. For the group of State tuples joining with USA
to be equal to 10% of the number of tuples in G. As and its duplicates, the children table contains one entry:
mentioned earlier, we enhance tcm by treating tokens {child=Joplin, frequency=3, tupleId-list=<S1,S3, S4>}.
which are very close to each other according to edit
distance (less than 0.15, in our implementation) to be Note: Recall that the frequency of a child tuple in Bc(G) is
synonyms. Due to space constraints, we skip the details of based only on its descriptive attribute value combinations
token table construction. and ignores the generated key attributes in Ri-1. In the
above example, the tuple Joplin has a frequency 3 because
Example 4.1.1: In Figure 1, suppose we are processing the we ignore the CityId attribute values.
State relation grouped with the Country relation, and that
we detected the set {United States, United States of Building the Children Table: The procedure is similar to
America, USA} to be duplicates on the Country relation. that of building the token table except for one difference:
For the group of State tuples joining with USA and its The multi-set union of all children sets Bc(G) can be large,
duplicates, the token table consists of one entry: e.g., all street addresses in the city [Illinois, Chicago], and
{[token=MO, frequency=3, tupleId-list=<S1, S2, S3>]}. hence may not fit in main memory. Therefore, we follow
the steps below. We refer to tuples in Bc(G) with frequency
The computation of tcm(v, G-{v}) requires frequencies greater than one as non-unique tuples.
with respect to Bt(G) of tokens in TS(v), which can be (i) We fetch all non-unique tuples in Bc(G) into a hash
obtained by looking up the token table. Tokens absent from table.
the token table have a frequency 1. Now, any tuple v such (ii) We fetch tuples in G and their children, one pair at a
that tcm(v, G-{v}) is greater than tcm-threshold is a time, and associate non-unique tuples in Bc(G) with the
potential duplicate tuple, and is added to G’. list of G tuples they join with.
Computing Pairs of Duplicates Combination
We compare each tuple v in G’ with a set Sv of tuples, After detecting duplicates according to tcm and fkcm, we
which is the union of all tuples sharing tokens with v. Sv combine (using the combination function of Section 3.3.3)
can be obtained from the token table. (For any tuple v’’ not predictions for each pair of tuples detected to be duplicates
in Sv, tcm(v, v’’) = 0.) For any tuple v’ in Sv such that using either tcm or fkcm or both.
tcm(v, v’) > tcm-threshold, we add the pair (v, v’) to the
pairs of duplicates from G. 4.1.3. Grouping Duplicate Pairs into Sets
Coordinating the top-down traversal of the hierarchy
4.1.2. Duplicate Detection using fkcm requires us to partition G into sets of duplicates, and to
determine a representative tuple—called the canonical
We predict that a tuple v is a duplicate, according to fkcm, tuple—for each set to be able to exploit database systems
of another tuple v’ in G if fkcm(v, v’) > fkcm-threshold. for processing. (This issue will be clearer in the next
Using Observation 4.1, we determine a set of potential section.) To partition G into sets of duplicates, we adapt a
duplicates by efficiently computing fkcm(v, G-{v}) using method from [HS95] to handle asymmetric similarity
children tables. The computation of the set G’ of potential functions. The essential idea is to divide G into connected
duplicates and then duplicates, according to fkcm, of tuples groups and choose a canonical tuple for each group.
in G’ is the same as for tcm. Hence, we only describe the
construction of the children table for a group G of tuples. Following the standard approach [HS95, ME96], we
elevate the relationship “is a duplicate of” between tuples
Children Tables to be a transitive relation. That is, if v1 is a duplicate of v2
The children table of G is a hash table containing a subset
and v2 that of v3, we consider v1 to be a duplicate of v3.
of the union of children sets of all tuples in G. It contains:
The intuition behind the partitioning method is to identify
(i) each child tuple c from Ri-1 joining with some tuple in
maximal connected sets of duplicates such that for any pair
G, and whose frequency cf(c) in Bc(G) is greater than one,
of tuples v and v’ in each set, we can either deduce using
(ii) the frequencies of such children tuples, and (iii) the list transitivity that v is a duplicate of v’ or vice versa. A
of (pointers to) tuples in G with which c joins. We connected set is maximal if we cannot add any more tuples
maintain lists of tuples only for children that have a to it without making it disconnected. For each connected
frequency less than the stop children frequency fixed at set, we choose the tuple with the highest IDF value (of
10% the number of tuples in G. token sets for R1 and of children sets for higher level
relations) as the canonical tuple. Because the relationship
Example 4.1.2: Consider the example in Figure 1. We “is a duplicate of” is asymmetric, a tuple may end up in
process the State relation grouped with the Country multiple connected sets. For such a tuple v, we place it in
View Definitions Qi = Qi’ =
Lm = Select * From Rm, …, R1 Select Li. m, …, Li. i+1, Li. i-1, count(*) Select Li. m, …, Li. i+1, Li. i, Li. i-1
Li = Select Li+1.Am, …, (Case When Ti+1.Ai+1 is Null From (Select distinct Li. m, …, Li. i-1) From (Select distinct Li. m, …, Li. i-1)
Then Li+1.Ai+1 Else Ti+1.Ai+1), Li+1.Ai, …, Li+1.A1 Group By Li. m, Li. i+1, Li. i-1 Order By Li. m, …, Li. i, Li. i-1
From Li+1 Left Outer Join Ti+1 Having count(*) > 1
Order By Li. m, …, Li. i+1, Li. i-1
On Li+1.Am = Ti+1.Am, …, Li+1.Ai+1 = Ti+1.Ai+1
Table 1: Algorithms
FP% of Cosine metric and Edit distance
Error Introduction
We introduce two types of errors common in data
500 CM (0.9) warehouses [For01]: equivalence errors, spelling &
CM (0.85) truncation errors. The generator has three parameters: The
400
CM (0.8) first percentage error parameter controls the error to be
Percentage
300
Edit (0.05) introduced in each relation. The second (equivalence
Edit (0.1) fraction) and the third (spelling fraction) parameters
200 Edit (0.15) control the fractions of equivalence errors, spelling and
H-CM (0.9) truncation errors, respectively. Suppose the percentage
100
H-CM (0.8) error is 10% and the equivalence fraction is 50% then we
0
will introduce 10% duplicate tuples into the input table out
FP% FN% of which 50% will be due to equivalence errors.
Figure 3: False Positive Explosion Equivalence Errors: Consider the tuple combination [<Key
WA and Washington are duplicates (using co-occurrence Associates>, <Joplin>, <MO>, <USA>] in the customer
information), then the two customer entities are duplicates. table. Suppose we want to create an equivalence error for
Thus, even though our definition of duplicates does not “MO” in the state relation. We first garble “MO” into, say,
directly allow such inconsistencies, we can correct them in “xMykOz” so that the new value is undetectable by
conjunction with other cleaning operations. standard textual similarity functions. Since equivalence
errors usually occur in multiple tuples, we choose around
Potential Duplicate Identification Filter 5% (5-x%, 5+x%) of all entities with R.country=“USA”
Imagine a set G of tuples where most of the tokens in and R.state=“MO” and modify the value of MO to
Bt(G) occur in at least two tuples in G. In such cases, the “xMykOz.” For 10% of these modified tuples, we also
filtering strategy is not very effective because we may introduce errors in the tuple from the child relation, when
mark many tuples as potential duplicates. Our experiments one exists. We insert these erroneous tuples into R. At the
on real data illustrate that such a case does not typically lowest level of the hierarchy, we garble a randomly picked
occur in practice. However, developing appropriate filters token from the token set and insert the modified tuple in R.
for such rare cases is still an open issue.
Spelling and Truncation Errors: We modify a token in a
We note that it is possible to consider similarity and tuple by changing, deleting, adding characters or truncating
combination functions other than the ones we used. the token. 50% of the time, we modify characters, and the
However, Observation 4.1, which summarizes our filtering remaining 50% we just truncate the token. The number of
strategy, may not be valid for all similarity functions, and characters modified or truncated is a linearly decreasing
one may have to design suitable filters where possible. function with a maximum of half the token length.
FN %
FP %
60
100
40
50 20
0 0
4 8 11 4 8 11