DMER - Unit#9 - Blocking Techniques For ER

Uploaded by

Amina Baig

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views15 pages

DMER - Unit#9 - Blocking Techniques For ER

Uploaded by

Amina Baig

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 15

Data Mining and Entity Resolution

— Unit 9 —
ER Techniques

Dr. Asif Sohail

University of the Punjab
Department of Information Technology, FCIT

NOTE: Some slides and/or pictures are adapted from Lecture slides / Books of
 Data Matching, Concepts and Techniques for Record Linkage, Entity
Resolution, and Duplicate Detection by Peter Christen
1
Blocking or Records
Partitioning
 Blocking distributes the records into multiple partitions with an
objective to place the potential duplicates called “candidate
record pairs”in the same partition.
 Afterwards, the record pairs comparisons are made only among
the records residing in the same partition(s), thereby reducing
the number of comparisons significantly.
 The record partitions can either be disjoint or overlapped
(canopies) [Köp 10].
 In disjoint partition, only the records residing in the same
partition are thoroughly compared with each other, whereas,
overlapped partitions allow the comparisons of records residing
in the neighboring partitions as well.
 Pros and Cons of disjoint and overlapped partitions?

2
Blocking or Records
Partitioning
 Inverted index (indexing) is used for records partitioning.
 It clubs the records having a similarity of the selected field(s)
greater than a certain threshold into same partition, bucket,
block, cluster, or pocket.
 Following are the objectives or desired outcome of blocking:
 Partition the records in such a way that reduces the number of
record comparisons with a very little compromise in identifying
the number of duplicate records.
 Potential duplicates, called “candidate record pairs”, should be
placed in the same bucket. The potential duplicates placed in
different buckets may lead to “False Negatives (FN)”. Similarly, the
non-duplicates placed in the same bucket may lead to “False
Positives (FP)”.
 Balancing of the number of buckets and block sizes.

3
Blocking
The groups of records are blocked together in different
R1 buckets on the basis of blocking key(s).

R2
K1 RB (K1) The comparisons of
records within a
R3 K2 RB (K2) block are made
. . against the selected
. . fields called linkage
. . fields.
. .
. # of comparisons
. Kb RB (Kb)
.

Slide 4
Selection of Blocking
Key
 The selection of blocking key(BK) is the most critical decision in
the ER process.
 The blocking key can either be selected manually by domain
experts or by using some machine learning technique.
 Following are the guidelines for selection of blocking key:
 High data quality with the fewest errors.
 Moderately restrictive: The blocking key should neither be overly
restrictive (such as NIC, soc-sec-id etc.) nor too indiscriminative
(such as gender, marital status etc.).
 Completeness (95% or more).
 Uniform distribution of the values
 Information value with a positive correlation with duplicates.
 Multiple attributes may be used for blocking key.

5
Standard or Traditional
Blocking
 Standard blocking or simply blocking is known to be the first
ever technique proposed by Newcomb [15].
 Consider a dataset of records defined over a set of attributes.
Let is a set of attributes, and the set is a set of records.
 A blocking key can be described as, where is an attribute of an
entity and is a hashing function.
 Let the possible values of are represented as a set . The values
of an attribute of an entity is represented as . An entry of a 2-
dimensional matrix represents the value of a record for an
attribute . The simplest version of blocking uses an attribute as
a blocking key and put all the records with identical in the same
block , such that .
 The records residing in the same block are called candidate
record pairs.

6
Standard Blocking

7
Limitations of Standard
Blocking
 The errors or noise in blocking key values will inhibit the
potential duplicates to be placed in the same block. Thus,
potential duplicates will never be compared.
 Quite a few non-matching record pairs may be placed in the
same block that escalate the number of record comparisons.
 Non-uniform or zipf distribution of BKVs will result in varying
sized blocks. Large sized blocks defeat the purpose of blocking.

8
Blocking Variants

1. Q-gram based blocking

2. Suffix Array based blocking
3. Canopy Clustering

9
Q-gram Blocking
 It is proposed to encounter the errors or noise in the blocking
key values. It places the record pairs with similar blocking keys
into multiple blocks, so that such records may be placed
together in at least one block.
 Blocking key values are transformed into lists of q-grams (sub-
strings of length q), and then creating all combinations of sub-
lists down to a certain length, determined by a threshold
parameter t. The resulting q-gram sub-lists are then converted
back into strings and used as keys in an inverted index.
 Example: Let blocking key value ‘peter’, q = 2 (bigrams) and a
threshold value of t = 0.8. The 2-gram list for this value is
[‘pe’,‘et’,‘te’,‘er’] with four elements, and using the threshold 0.8
results in 4 × 0.8 = 3.2, rounded to 3, which means all sub-list
combinations with a length of four and three are generated:
 [‘pe’,‘et’,‘te’,‘er’], [‘et’,‘te’,‘er’], [‘pe’,‘te’,‘er’], [‘pe’,‘et’,‘er’], and
[‘pe’,‘et’,‘te’]. Leading to five inverted index lists (blocks) with key
values ‘peetteer’, ‘etteer’, ‘peteer’, ‘peeter’, and ‘peette’. 10
Q-gram Blocking

11
Suffix Array Blocking
 The basic idea is to insert the BKVs and their suffixes into a
suffix array based inverted index [Aiz 05].
 Only suffixes down to a minimum length, , are inserted into the
suffix array.
 For example, for a BKV ‘christen’ and , the values ‘christen’,
‘hristen’, ‘risten’ and ‘isten’ will be generated. It is To To limit
the maximum size of blocks, a second parameter, , allows the
maximum number of record identifiers in a block to be set.

12
Suffix Array Blocking

13
2. Windowing: Sorted array

R1 SR1
The records are sorted
R2 SR2
and a fixed size window of
R3 SR3 size >1 is slided over
R4 SR4 them. The comparisons
are made among all the
R5 SR5 records falling within the
. . same window.
. .
. . # of comparisons
= w(w-1)/2 + (n-w)(w-1)
Rn SRn = O (wn)

Slide 14
2. Windowing: Sorted inverted index
R1 V1 RB (V1) The records that have
a common sorting key
R2 V2 RB (V2) value are blocked
together. Then a
R3 V3 RB (V3) window is slided
across the blocks of
R4 V4 RB (V4) records and the
comparisons are
R5 V5 RB (V5) made among the
common residents.
. .
# of comparisons
. .
. = wn/b (wn/b-1)/2 +
.
(b – w) [n/b(n/b-1)/2 +
(w-1)n2/b2]
Rn Vb RB (Vb)

Slide 15

IP Routing Protocols All-in-one: OSPF EIGRP IS-IS BGP Hands-on Labs
From Everand
IP Routing Protocols All-in-one: OSPF EIGRP IS-IS BGP Hands-on Labs
Redouane MEDDANE
No ratings yet
Unit - V - 1
0% (1)
Unit - V - 1
17 pages
Coil Springs
No ratings yet
Coil Springs
92 pages
Neighbourhood Blocking For Record Linkage
No ratings yet
Neighbourhood Blocking For Record Linkage
10 pages
Bigmatch: A Program For Large-Scale Record Linkage: ASA Section On Survey Research Methods
No ratings yet
Bigmatch: A Program For Large-Scale Record Linkage: ASA Section On Survey Research Methods
4 pages
Conference SVM Classifier
No ratings yet
Conference SVM Classifier
6 pages
Scalable Blocking For Very Large Databases
No ratings yet
Scalable Blocking For Very Large Databases
16 pages
Indexing
No ratings yet
Indexing
27 pages
Index 2
No ratings yet
Index 2
24 pages
IRS Imp
No ratings yet
IRS Imp
76 pages
09 Indexes2
No ratings yet
09 Indexes2
5 pages
DS TM Study Material Presentations Unit-4 1TM
No ratings yet
DS TM Study Material Presentations Unit-4 1TM
22 pages
Entity Analysis Resolution
100% (1)
Entity Analysis Resolution
22 pages
FALLSEM2024-25 BCSE302L TH VL2024250101553 2024-09-02 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE302L TH VL2024250101553 2024-09-02 Reference-Material-I
48 pages
Index Method2
No ratings yet
Index Method2
26 pages
Chapter 3 File Organization Indexed Methods
No ratings yet
Chapter 3 File Organization Indexed Methods
31 pages
Module - 4: 10.1 Indexed Sequential Access
No ratings yet
Module - 4: 10.1 Indexed Sequential Access
14 pages
SingleLevelIndexing Examples
No ratings yet
SingleLevelIndexing Examples
24 pages
Unit 2
No ratings yet
Unit 2
10 pages
Introduction To Query Processing and Query Optimization Techniques
No ratings yet
Introduction To Query Processing and Query Optimization Techniques
77 pages
Ch17Notes Indexing Structures For Files
No ratings yet
Ch17Notes Indexing Structures For Files
39 pages
Query-Time Record Linkage and Fusion Over Web Databases
No ratings yet
Query-Time Record Linkage and Fusion Over Web Databases
12 pages
3 - QueryProcessing - Ch15
No ratings yet
3 - QueryProcessing - Ch15
56 pages
Lecture Notes
No ratings yet
Lecture Notes
96 pages
Chapter 12: Indexing and Hashing
No ratings yet
Chapter 12: Indexing and Hashing
31 pages
Final Review
No ratings yet
Final Review
96 pages
Indexing Files: Last Time
No ratings yet
Indexing Files: Last Time
5 pages
IR Unit III - Notes
No ratings yet
IR Unit III - Notes
18 pages
Novel and Efficient Approach For Duplicate Record Detection
No ratings yet
Novel and Efficient Approach For Duplicate Record Detection
5 pages
Indexing
No ratings yet
Indexing
24 pages
Session02-FileProcessing
No ratings yet
Session02-FileProcessing
7 pages
Indexing
No ratings yet
Indexing
41 pages
Index Method1
No ratings yet
Index Method1
24 pages
Index and Hashing 2017 Combined
No ratings yet
Index and Hashing 2017 Combined
60 pages
Advance Database Management System: Unit - 2 .Query Processing and Optimization
No ratings yet
Advance Database Management System: Unit - 2 .Query Processing and Optimization
38 pages
DBMS Unit9
No ratings yet
DBMS Unit9
44 pages
Chapter - 3 Algorithms For Query Processing and Optimization PDF
No ratings yet
Chapter - 3 Algorithms For Query Processing and Optimization PDF
100 pages
9 Files, Indices and Database Tuning
No ratings yet
9 Files, Indices and Database Tuning
17 pages
Indexing in Database
No ratings yet
Indexing in Database
33 pages
09 FIle
No ratings yet
09 FIle
22 pages
Indexing and Hashing: B.Ramamurthy
No ratings yet
Indexing and Hashing: B.Ramamurthy
24 pages
Documentation
No ratings yet
Documentation
12 pages
Single-Level Ordered Indexes
No ratings yet
Single-Level Ordered Indexes
12 pages
Mod4 Chap10 - 11 Indexing
No ratings yet
Mod4 Chap10 - 11 Indexing
77 pages
I3306-chap2-TD2-EN - Fa23-24-Solution
No ratings yet
I3306-chap2-TD2-EN - Fa23-24-Solution
6 pages
Efficient Storage and Retrieval of Data
No ratings yet
Efficient Storage and Retrieval of Data
20 pages
Chapter - 2 - Revision
No ratings yet
Chapter - 2 - Revision
26 pages
Module 4 Indexing
No ratings yet
Module 4 Indexing
20 pages
Real-World Data Is Dirty: Data Cleansing and The Merge/Purge Problem
No ratings yet
Real-World Data Is Dirty: Data Cleansing and The Merge/Purge Problem
49 pages
Chapter - 3 - Indexing Structures For Files
No ratings yet
Chapter - 3 - Indexing Structures For Files
83 pages
Jurnal Bahasa Inggris
No ratings yet
Jurnal Bahasa Inggris
8 pages
02 Blocking - Addional
No ratings yet
02 Blocking - Addional
74 pages
Final Updates - Lec 2
No ratings yet
Final Updates - Lec 2
40 pages
Indexing Structures: Professor Navneet Goyal Department of Computer Science & Information Systems BITS, Pilani
No ratings yet
Indexing Structures: Professor Navneet Goyal Department of Computer Science & Information Systems BITS, Pilani
87 pages
Journal Complete Link Method For Improved Ranking Website Final Project
No ratings yet
Journal Complete Link Method For Improved Ranking Website Final Project
9 pages
Co3 Session 21
No ratings yet
Co3 Session 21
53 pages
IN3020/4020 - Database Systems Spring 2020, Week 3.1 Indexing
No ratings yet
IN3020/4020 - Database Systems Spring 2020, Week 3.1 Indexing
44 pages
DBMS Indexing Methods
No ratings yet
DBMS Indexing Methods
33 pages
Decision Models For Record Linkage
No ratings yet
Decision Models For Record Linkage
15 pages
File Structure Data Storage Query Evaluation Indexing and Hashing
No ratings yet
File Structure Data Storage Query Evaluation Indexing and Hashing
14 pages
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
From Everand
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
Joerg Christian Seubert
No ratings yet
Fender Jazz Bass American Deluxe Wiriing 019-4690B - SISD
50% (2)
Fender Jazz Bass American Deluxe Wiriing 019-4690B - SISD
5 pages
Image Occlusion Enhanced Code (Old - Joe)
No ratings yet
Image Occlusion Enhanced Code (Old - Joe)
8 pages
Plate Type Heat Exchanger Project Report 5
No ratings yet
Plate Type Heat Exchanger Project Report 5
26 pages
Model: Frequency: Fuel Type: C17 D5 (X-Series) 50 HZ Diesel: Generator Set Data Sheet
No ratings yet
Model: Frequency: Fuel Type: C17 D5 (X-Series) 50 HZ Diesel: Generator Set Data Sheet
3 pages
Arkan Al Mustaqbal: Virtualization Plan & Design Service
No ratings yet
Arkan Al Mustaqbal: Virtualization Plan & Design Service
20 pages
FO4Edit Log
No ratings yet
FO4Edit Log
698 pages
HACH DR2100Q Datasheet
No ratings yet
HACH DR2100Q Datasheet
4 pages
044-1 - 1996 - Reinforced Concrete Poles
No ratings yet
044-1 - 1996 - Reinforced Concrete Poles
24 pages
Ultra VNC Repeater Guide: Document Version: 5
No ratings yet
Ultra VNC Repeater Guide: Document Version: 5
37 pages
Repair and Rehabilitation Notes
No ratings yet
Repair and Rehabilitation Notes
7 pages
Polyscope Manual: Original Instructions (En) Us Version
No ratings yet
Polyscope Manual: Original Instructions (En) Us Version
121 pages
Excel UserForm Controls - CheckBox, OptionButton and ToggleButton
100% (1)
Excel UserForm Controls - CheckBox, OptionButton and ToggleButton
13 pages
MSG - Crystallizer
100% (2)
MSG - Crystallizer
22 pages
Membrane Separation
50% (2)
Membrane Separation
13 pages
Lab On The Pinhole Camera
No ratings yet
Lab On The Pinhole Camera
3 pages
NHD-0216BZ-FL-YBW: Character Liquid Crystal Display Module
No ratings yet
NHD-0216BZ-FL-YBW: Character Liquid Crystal Display Module
13 pages
PTFE Bellows Seal: Features and Benefits
No ratings yet
PTFE Bellows Seal: Features and Benefits
2 pages
VK186 - EC-QREC Dry Pendent Sprinkler (K5.6)
No ratings yet
VK186 - EC-QREC Dry Pendent Sprinkler (K5.6)
6 pages
Shirla Application Guide Vers 1 04-2010 PDF
No ratings yet
Shirla Application Guide Vers 1 04-2010 PDF
26 pages
Manual VEE PRO 2
100% (1)
Manual VEE PRO 2
644 pages
Fire Resistance Class A Fire Rating (ASTM E84) Maximum Temperature Can With - 20250415 - 080639 - 0000
No ratings yet
Fire Resistance Class A Fire Rating (ASTM E84) Maximum Temperature Can With - 20250415 - 080639 - 0000
1 page
May 7
No ratings yet
May 7
6 pages
Computerized Technical Selection-Chiller
No ratings yet
Computerized Technical Selection-Chiller
2 pages
Rotational Viscometers and Types
No ratings yet
Rotational Viscometers and Types
6 pages
SJVN
No ratings yet
SJVN
1 page
Reliable Autotech Die Shop Presentation
No ratings yet
Reliable Autotech Die Shop Presentation
56 pages
Series J: Shaft Mounted Gearbox
No ratings yet
Series J: Shaft Mounted Gearbox
74 pages
Mathafun C
No ratings yet
Mathafun C
12 pages

DMER - Unit#9 - Blocking Techniques For ER

Uploaded by

DMER - Unit#9 - Blocking Techniques For ER

Uploaded by

Data Mining and Entity Resolution

Dr. Asif Sohail

1. Q-gram based blocking

You might also like