0% found this document useful (0 votes)
23 views15 pages

DMER - Unit#9 - Blocking Techniques For ER

Uploaded by

Amina Baig
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views15 pages

DMER - Unit#9 - Blocking Techniques For ER

Uploaded by

Amina Baig
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Data Mining and Entity Resolution

— Unit 9 —
ER Techniques

Dr. Asif Sohail


University of the Punjab
Department of Information Technology, FCIT

NOTE: Some slides and/or pictures are adapted from Lecture slides / Books of
 Data Matching, Concepts and Techniques for Record Linkage, Entity
Resolution, and Duplicate Detection by Peter Christen
1
Blocking or Records
Partitioning
 Blocking distributes the records into multiple partitions with an
objective to place the potential duplicates called “candidate
record pairs”in the same partition.
 Afterwards, the record pairs comparisons are made only among
the records residing in the same partition(s), thereby reducing
the number of comparisons significantly.
 The record partitions can either be disjoint or overlapped
(canopies) [Köp 10].
 In disjoint partition, only the records residing in the same
partition are thoroughly compared with each other, whereas,
overlapped partitions allow the comparisons of records residing
in the neighboring partitions as well.
 Pros and Cons of disjoint and overlapped partitions?

2
Blocking or Records
Partitioning
 Inverted index (indexing) is used for records partitioning.
 It clubs the records having a similarity of the selected field(s)
greater than a certain threshold into same partition, bucket,
block, cluster, or pocket.
 Following are the objectives or desired outcome of blocking:
 Partition the records in such a way that reduces the number of
record comparisons with a very little compromise in identifying
the number of duplicate records.
 Potential duplicates, called “candidate record pairs”, should be
placed in the same bucket. The potential duplicates placed in
different buckets may lead to “False Negatives (FN)”. Similarly, the
non-duplicates placed in the same bucket may lead to “False
Positives (FP)”.
 Balancing of the number of buckets and block sizes.

3
Blocking
The groups of records are blocked together in different
R1 buckets on the basis of blocking key(s).

R2
K1 RB (K1) The comparisons of
records within a
R3 K2 RB (K2) block are made
. . against the selected
. . fields called linkage
. . fields.
. .
. # of comparisons
. Kb RB (Kb)
.

Rn

Slide 4
Selection of Blocking
Key
 The selection of blocking key(BK) is the most critical decision in
the ER process.
 The blocking key can either be selected manually by domain
experts or by using some machine learning technique.
 Following are the guidelines for selection of blocking key:
 High data quality with the fewest errors.
 Moderately restrictive: The blocking key should neither be overly
restrictive (such as NIC, soc-sec-id etc.) nor too indiscriminative
(such as gender, marital status etc.).
 Completeness (95% or more).
 Uniform distribution of the values
 Information value with a positive correlation with duplicates.
 Multiple attributes may be used for blocking key.

5
Standard or Traditional
Blocking
 Standard blocking or simply blocking is known to be the first
ever technique proposed by Newcomb [15].
 Consider a dataset of records defined over a set of attributes.
Let is a set of attributes, and the set is a set of records.
 A blocking key can be described as, where is an attribute of an
entity and is a hashing function.
 Let the possible values of are represented as a set . The values
of an attribute of an entity is represented as . An entry of a 2-
dimensional matrix represents the value of a record for an
attribute . The simplest version of blocking uses an attribute as
a blocking key and put all the records with identical in the same
block , such that .
 The records residing in the same block are called candidate
record pairs.

6
Standard Blocking

7
Limitations of Standard
Blocking
 The errors or noise in blocking key values will inhibit the
potential duplicates to be placed in the same block. Thus,
potential duplicates will never be compared.
 Quite a few non-matching record pairs may be placed in the
same block that escalate the number of record comparisons.
 Non-uniform or zipf distribution of BKVs will result in varying
sized blocks. Large sized blocks defeat the purpose of blocking.

8
Blocking Variants

1. Q-gram based blocking


2. Suffix Array based blocking
3. Canopy Clustering

9
Q-gram Blocking
 It is proposed to encounter the errors or noise in the blocking
key values. It places the record pairs with similar blocking keys
into multiple blocks, so that such records may be placed
together in at least one block.
 Blocking key values are transformed into lists of q-grams (sub-
strings of length q), and then creating all combinations of sub-
lists down to a certain length, determined by a threshold
parameter t. The resulting q-gram sub-lists are then converted
back into strings and used as keys in an inverted index.
 Example: Let blocking key value ‘peter’, q = 2 (bigrams) and a
threshold value of t = 0.8. The 2-gram list for this value is
[‘pe’,‘et’,‘te’,‘er’] with four elements, and using the threshold 0.8
results in 4 × 0.8 = 3.2, rounded to 3, which means all sub-list
combinations with a length of four and three are generated:
 [‘pe’,‘et’,‘te’,‘er’], [‘et’,‘te’,‘er’], [‘pe’,‘te’,‘er’], [‘pe’,‘et’,‘er’], and
[‘pe’,‘et’,‘te’]. Leading to five inverted index lists (blocks) with key
values ‘peetteer’, ‘etteer’, ‘peteer’, ‘peeter’, and ‘peette’. 10
Q-gram Blocking

11
Suffix Array Blocking
 The basic idea is to insert the BKVs and their suffixes into a
suffix array based inverted index [Aiz 05].
 Only suffixes down to a minimum length, , are inserted into the
suffix array.
 For example, for a BKV ‘christen’ and , the values ‘christen’,
‘hristen’, ‘risten’ and ‘isten’ will be generated. It is To To limit
the maximum size of blocks, a second parameter, , allows the
maximum number of record identifiers in a block to be set.

12
Suffix Array Blocking

13
2. Windowing: Sorted array

R1 SR1
The records are sorted
R2 SR2
and a fixed size window of
R3 SR3 size >1 is slided over
R4 SR4 them. The comparisons
are made among all the
R5 SR5 records falling within the
. . same window.
. .
. . # of comparisons
= w(w-1)/2 + (n-w)(w-1)
Rn SRn = O (wn)

Slide 14
2. Windowing: Sorted inverted index
R1 V1 RB (V1) The records that have
a common sorting key
R2 V2 RB (V2) value are blocked
together. Then a
R3 V3 RB (V3) window is slided
across the blocks of
R4 V4 RB (V4) records and the
comparisons are
R5 V5 RB (V5) made among the
common residents.
. .
# of comparisons
. .
. = wn/b (wn/b-1)/2 +
.
(b – w) [n/b(n/b-1)/2 +
(w-1)n2/b2]
Rn Vb RB (Vb)

Slide 15

You might also like