Data Mining and Entity Resolution
— Unit 9 —
ER Techniques
Dr. Asif Sohail
University of the Punjab
Department of Information Technology, FCIT
NOTE: Some slides and/or pictures are adapted from Lecture slides / Books of
Data Matching, Concepts and Techniques for Record Linkage, Entity
Resolution, and Duplicate Detection by Peter Christen
1
Blocking or Records
Partitioning
Blocking distributes the records into multiple partitions with an
objective to place the potential duplicates called “candidate
record pairs”in the same partition.
Afterwards, the record pairs comparisons are made only among
the records residing in the same partition(s), thereby reducing
the number of comparisons significantly.
The record partitions can either be disjoint or overlapped
(canopies) [Köp 10].
In disjoint partition, only the records residing in the same
partition are thoroughly compared with each other, whereas,
overlapped partitions allow the comparisons of records residing
in the neighboring partitions as well.
Pros and Cons of disjoint and overlapped partitions?
2
Blocking or Records
Partitioning
Inverted index (indexing) is used for records partitioning.
It clubs the records having a similarity of the selected field(s)
greater than a certain threshold into same partition, bucket,
block, cluster, or pocket.
Following are the objectives or desired outcome of blocking:
Partition the records in such a way that reduces the number of
record comparisons with a very little compromise in identifying
the number of duplicate records.
Potential duplicates, called “candidate record pairs”, should be
placed in the same bucket. The potential duplicates placed in
different buckets may lead to “False Negatives (FN)”. Similarly, the
non-duplicates placed in the same bucket may lead to “False
Positives (FP)”.
Balancing of the number of buckets and block sizes.
3
Blocking
The groups of records are blocked together in different
R1 buckets on the basis of blocking key(s).
R2
K1 RB (K1) The comparisons of
records within a
R3 K2 RB (K2) block are made
. . against the selected
. . fields called linkage
. . fields.
. .
. # of comparisons
. Kb RB (Kb)
.
Rn
Slide 4
Selection of Blocking
Key
The selection of blocking key(BK) is the most critical decision in
the ER process.
The blocking key can either be selected manually by domain
experts or by using some machine learning technique.
Following are the guidelines for selection of blocking key:
High data quality with the fewest errors.
Moderately restrictive: The blocking key should neither be overly
restrictive (such as NIC, soc-sec-id etc.) nor too indiscriminative
(such as gender, marital status etc.).
Completeness (95% or more).
Uniform distribution of the values
Information value with a positive correlation with duplicates.
Multiple attributes may be used for blocking key.
5
Standard or Traditional
Blocking
Standard blocking or simply blocking is known to be the first
ever technique proposed by Newcomb [15].
Consider a dataset of records defined over a set of attributes.
Let is a set of attributes, and the set is a set of records.
A blocking key can be described as, where is an attribute of an
entity and is a hashing function.
Let the possible values of are represented as a set . The values
of an attribute of an entity is represented as . An entry of a 2-
dimensional matrix represents the value of a record for an
attribute . The simplest version of blocking uses an attribute as
a blocking key and put all the records with identical in the same
block , such that .
The records residing in the same block are called candidate
record pairs.
6
Standard Blocking
7
Limitations of Standard
Blocking
The errors or noise in blocking key values will inhibit the
potential duplicates to be placed in the same block. Thus,
potential duplicates will never be compared.
Quite a few non-matching record pairs may be placed in the
same block that escalate the number of record comparisons.
Non-uniform or zipf distribution of BKVs will result in varying
sized blocks. Large sized blocks defeat the purpose of blocking.
8
Blocking Variants
1. Q-gram based blocking
2. Suffix Array based blocking
3. Canopy Clustering
9
Q-gram Blocking
It is proposed to encounter the errors or noise in the blocking
key values. It places the record pairs with similar blocking keys
into multiple blocks, so that such records may be placed
together in at least one block.
Blocking key values are transformed into lists of q-grams (sub-
strings of length q), and then creating all combinations of sub-
lists down to a certain length, determined by a threshold
parameter t. The resulting q-gram sub-lists are then converted
back into strings and used as keys in an inverted index.
Example: Let blocking key value ‘peter’, q = 2 (bigrams) and a
threshold value of t = 0.8. The 2-gram list for this value is
[‘pe’,‘et’,‘te’,‘er’] with four elements, and using the threshold 0.8
results in 4 × 0.8 = 3.2, rounded to 3, which means all sub-list
combinations with a length of four and three are generated:
[‘pe’,‘et’,‘te’,‘er’], [‘et’,‘te’,‘er’], [‘pe’,‘te’,‘er’], [‘pe’,‘et’,‘er’], and
[‘pe’,‘et’,‘te’]. Leading to five inverted index lists (blocks) with key
values ‘peetteer’, ‘etteer’, ‘peteer’, ‘peeter’, and ‘peette’. 10
Q-gram Blocking
11
Suffix Array Blocking
The basic idea is to insert the BKVs and their suffixes into a
suffix array based inverted index [Aiz 05].
Only suffixes down to a minimum length, , are inserted into the
suffix array.
For example, for a BKV ‘christen’ and , the values ‘christen’,
‘hristen’, ‘risten’ and ‘isten’ will be generated. It is To To limit
the maximum size of blocks, a second parameter, , allows the
maximum number of record identifiers in a block to be set.
12
Suffix Array Blocking
13
2. Windowing: Sorted array
R1 SR1
The records are sorted
R2 SR2
and a fixed size window of
R3 SR3 size >1 is slided over
R4 SR4 them. The comparisons
are made among all the
R5 SR5 records falling within the
. . same window.
. .
. . # of comparisons
= w(w-1)/2 + (n-w)(w-1)
Rn SRn = O (wn)
Slide 14
2. Windowing: Sorted inverted index
R1 V1 RB (V1) The records that have
a common sorting key
R2 V2 RB (V2) value are blocked
together. Then a
R3 V3 RB (V3) window is slided
across the blocks of
R4 V4 RB (V4) records and the
comparisons are
R5 V5 RB (V5) made among the
common residents.
. .
# of comparisons
. .
. = wn/b (wn/b-1)/2 +
.
(b – w) [n/b(n/b-1)/2 +
(w-1)n2/b2]
Rn Vb RB (Vb)
Slide 15