0% found this document useful (0 votes)

22 views22 pages

DMER - Unit#8 - Intoduction To ER

Uploaded by

Amina Baig

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views22 pages

DMER - Unit#8 - Intoduction To ER

Uploaded by

Amina Baig

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 22

Data Mining and Entity Resolution

— Unit 8 —
Introduction to ER

Dr. Asif Sohail

University of the Punjab
Department of Information Technology, FCIT

NOTE: Some slides and/or pictures are adapted from Lecture slides / Books of
 Data Matching, Concepts and Techniques for Record Linkage, Entity
Resolution, and Duplicate Detection by Peter Christen
1
Entity Resolution
 In the modern era of computing, the presence of information
about the same concept or entities at different data repositories
is quite common.
 Examples: digital libraries, social websites, customer service
centers, business enterprises, health care systems, government
agencies, etc.
 In order to get consolidated and comprehensive information,
multiple data sources are linked and integrated together [1].
Without ER, the integrated data will be quite messy containing
quite a few duplicates. The duplicates can either be “exact
duplicates” or more commonly occurring “near duplicates” [6].
 A major challenge in data integration is the identification of the
records referring the same entity in different data sources,
known as entity resolution, aka, record linkage, record matching
[2], duplication detection [4].
2
Challenges in Entity Resolution

1. Absence of unique identifier

2. Data heterogeneity
3. Data noise
4. Data size

3
Challenges in Entity Resolution
1. Absence of unique identifier
 ER would be a trivial task, if some unique identifier is available
across different data sources to be linked together.
 In this situation, deterministic linkage is performed, which is
simply a matter of trivial join operation.
 Unfortunately, the availability of a unique identifier across
different data sources is a pathological scenario in the real
world, as different systems have been developed and evolved
independently. Also, the use of a unique personal identifier,
such as, SSN to link personal data is not allowed in many
countries because of privacy protection legislation [22].
 In the absence of a unique identifier, linkage is performed using
the attributes available in the data sources called linkage key
[23].

4
Challenges in Entity Resolution
2. Data Heterogeneity
 It refers to the non-conformance of data representations across
different data sources to be integrated. Data heterogeneity is of
three types; structural heterogeneity, lexical heterogeneity, and
semantic heterogeneity [4].
 Structural heterogeneity occurs when attributes of a relational
schema have different representations and meanings in
different sources.
 For example, at one place, an attribute, say, address might be
saved in a single attribute whereas, at another place, it might be
saved using multiple attributes such as House#, Street, City etc.
Similarly, different attribute names, say, gender and sex might
have been used to record the same concept.

5
Challenges in Entity Resolution
2. Data Heterogeneity
 Lexical heterogeneity occurs when same attribute across
different sources uses different representations for recording
the same value. It can happen due to spelling variations, use of
abbreviations in some data sources or use of different
conventions across the data sources. For two data sources, the
examples of data value pairs showcasing lexical heterogeneity
are (Sohail, Suhail), (CS, Computer Science), (Fourth Division,
4th Division).
 Semantic heterogeneity occurs when different values are used
to record the same fact. For example, different codes for
representing gender, such as 0/1 or M/F, different values to
record the same fact, e.g., ‘B grade’ or ‘First division’, different
units to represent temperature, distance etc.
6
Challenges in Entity Resolution
3. Data Noise
 Data noise or data dirtiness means wrong or invalid values in a
data source. It happens due to data heterogeneity,
typographical errors, missing values, outdated values, use of
meaningless or default values, use of synonyms, evolution of
entities, duplicate records and so on.
 It is reported in [24] that typographical errors in the Name field
can occur in 20-30% of the records. The entities of an
information system evolve with time and due to this data about
an entity becomes outdate with the passage of time. For
example, people migrate from one place to another, get
married or even die without getting these changes recorded in
all the systems where their data are stored. The details about
the manifestation of data noise and the reasons contributing
towards data dirtiness can be found in [25], [26].
7
Challenges in Entity Resolution
4. Data Size
 The size of data sources is normally very large these days. This
makes it prohibitively expensive to carry out the pairwise or
quadratic number of record comparisons for ER.
 Consider a dataset consisting of records. A simple/naive
approach for duplicate detection is to compare a record with
every other record in the dataset. This approach would require
comparisons, which is too high even for moderate size
datasets.
 It is highly impractical and inefficient to make massive
comparisons out of which majority of the comparisons are
already known to be non-matches owing to “Match Rarity” [27].

8
Challenges in Entity Resolution
4. Data Size
 The size of data sources is normally very large these days. This
makes it prohibitively expensive to carry out the pairwise or
quadratic number of record comparisons for ER.
 Consider two datasets consisting of and non-duplicate records
respectively. A brute force approach for record linkage will make
pair wise record comparisons resulting in comparisons, which is
too high even for moderate size datasets.
 Furthermore, majority of the quadratic record comparisons are
for non-duplicates owing to “Match Rarity” [27], and hence
useless. The maximum possible number of duplicates will be
much smaller than the number of record comparisons.

9
ER Approaches
Naive Approach: Quadratic comparisons
Systematice Approaches:
 Reduces the number of comparisons significantly, and but also
aims at identifying more than 95% of the matches. Record
comparisons space reduction.
1. Blocking Method
2. Windowing or Sorted Neighborhood Method (SNM)
ER Process

Record Pairs
Data Pre- Record Pairs
Dataset(s) Reduction
Processing Comparisons
(Blocking)

Record Pairs
Classification

Non Matches Matches Possible Matches

Evaluation
ER Process

Record Pairs Compare Linking

Dataset Reduction using Variables using
Blocking / SNM Comparison Functions

Record Pairs
Classification

Sim >= UT Sim < LT

LT<= Sim < UT

Matches Possible Matches Non Matches

ER Process
1. Data Preprpcessing
 It involves schema resolution and data standardization.
 Schema resolution discovers the potential associations and
mappings between the schemas of the datasets to be
integrated together [7].
 Consider two datasets and defined over a set of and
attributes respectively. A mapping function maps an attribute
to a set of attributes , i.e., : . The set may be a singleton or
may consist of more than one attributes.
 Data standardization ensures a uniform representation of data.
 Name and address transformation
 Removal of stop words (if required),
 Stemming and lemmatization
 Validation and correction of data values, data type conversion
 Renaming of an attribute, use of some common codes, domain
constraints checking, dependency checking [29], [30].

13
ER Process
2. Record Pairs Reduction (Blocking)
 Record pairs reduction avoids pair-wise or quadratic number of
record comparisons.
 Blocking gathers the potential duplicates in the same block
using a blocking key or indexing key.
 All the records with identical blocking key value are placed in
the same block. Afterwards, the detailed comparisons are made
only among the records residing in the same block, thereby
decreasing the number of record comparisons significantly.
 Let a dataset of size is partitioned into blocks with each block
containing nearly records, then the total number of record
comparisons using blocking are reduced from to .

14
ER Process
3. Record Pairs Comparisons
 A record pair is compared using a set of attributes called linkage
key. Field comparison functions are applied on each attribute of
the linkage key to compute the similarity between the
corresponding attributes of a record pair.
 The similarity between two records can be represented using a
comparison vector, ,…, ), where , shows the similarity between
records and on attribute.
 means complete disagreement on attribute
 means complete agreement on attribute , and
 is the total number attributes used in comparison vector.

15
ER Process
4. Record Pairs Classification
 Using the comparison vectors generated in the previous step,
the record pairs are classified into three possible classes
“Matches”, “Non-Matches” or “Possible Matches”.
 The simplest way of classifying a record pair is threshold-based

classification. Let F is the combined similarity score obtained using

comparison vector for record pair, is lower threshold, is upper
threshold, then using threshold-based classification, record pair is
classified as:
 Match:
 Non-Match:

 Possible Match:

16
ER Process
5. Evaluation

All Record Pairs

True Duplicates
False Negatives (FN)

True Positives (TP)

False Positives (FP)

Declared Duplicates
True Negatives (TN)
ER Process
5. Evaluation
Classified As
Actual
Match (M*) Non-match (U*)
False Non-
Match True matches matches
(M) True Positives (TP) False Negatives
(FN)
True Non-matches
Non- False Matches
 M = TP(U)
match + FN False Positives (FP)
True Negatives
 U = TN + FP (TN)

18
Evaluation Measures

1. Reduction Ratio (RR)

2. Pairs Completeness (PC) or Recall

3. Pairs Quality (PQ) or Precision

4. F-score
4. F-score (Blocking)
1. Blocking Method
The groups of records are blocked together in different
R1 buckets on the basis of blocking key(s).

R2
K1 RB (K1) The comparisons of
records within a
R3 K2 RB (K2) block are made
. . against the selected
. . fields called linkage
. . fieldss.
. .
. # of comparisons
. Kb RB (Kb) = n(n-1)/2b
.

Slide 20
2. Windowing: Sorted array

R1 SR1
The records are sorted
R2 SR2
and a fixed size window of
R3 SR3 size >1 is slided over
R4 SR4 them. The comparisons
are made among all the
R5 SR5 records falling within the
. . same window.
. .
. . # of comparisons
= w(w-1)/2 + (n-w)(w-1)
Rn SRn = O (wn)

Slide 21
2. Windowing: Sorted inverted index
R1 V1 RB (V1) The records that have
a common sorting key
R2 V2 RB (V2) value are blocked
together. Then a
R3 V3 RB (V3) window is slided
across the blocks of
R4 V4 RB (V4) records and the
comparisons are
R5 V5 RB (V5) made among the
common residents.
. .
# of comparisons
. .
. = wn/b (wn/b-1)/2 +
.
(b – w) [n/b(n/b-1)/2 +
(w-1)n2/b2]
Rn Vb RB (Vb)

Slide 22

Data Structures & Algorithms Interview Questions You'll Most Likely Be Asked
From Everand
Data Structures & Algorithms Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
1/5 (1)
Homework 4 Solutions
No ratings yet
Homework 4 Solutions
8 pages
Entity Analysis Resolution
100% (1)
Entity Analysis Resolution
22 pages
Entity Resolution: Tutorial
No ratings yet
Entity Resolution: Tutorial
179 pages
Er VLDB2012 PDF
No ratings yet
Er VLDB2012 PDF
179 pages
Duplicate Detection of Record Linkage in Real-World Data: K. M, P T
No ratings yet
Duplicate Detection of Record Linkage in Real-World Data: K. M, P T
10 pages
Duplicate Record Detection: A Survey
No ratings yet
Duplicate Record Detection: A Survey
40 pages
Progresser: Adaptive Progressive Approach To Relational Entity Resolution
No ratings yet
Progresser: Adaptive Progressive Approach To Relational Entity Resolution
45 pages
FrameworksForEntityMatchingAComparison Dke
No ratings yet
FrameworksForEntityMatchingAComparison Dke
14 pages
Documentation
No ratings yet
Documentation
12 pages
Duplicate Record Detection - A Survey
No ratings yet
Duplicate Record Detection - A Survey
16 pages
Cikm 2009 Poster
No ratings yet
Cikm 2009 Poster
1 page
RJournal 2010-2 Sariyar+Borg
No ratings yet
RJournal 2010-2 Sariyar+Borg
7 pages
Proposal
No ratings yet
Proposal
7 pages
Normalization of Duplicate Records From Multiple Sources: IEEE Transactions On Knowledge and Data Engineering June 2018
No ratings yet
Normalization of Duplicate Records From Multiple Sources: IEEE Transactions On Knowledge and Data Engineering June 2018
15 pages
2024 V15i5016
No ratings yet
2024 V15i5016
12 pages
DQ Matching
No ratings yet
DQ Matching
6 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
Real-World Data Is Dirty: Data Cleansing and The Merge/Purge Problem
No ratings yet
Real-World Data Is Dirty: Data Cleansing and The Merge/Purge Problem
49 pages
Efficient Techniques For Online Record Linkage: Debabrata Dey, Member, IEEE, Vijay S. Mookerjee, and Dengpan Liu
No ratings yet
Efficient Techniques For Online Record Linkage: Debabrata Dey, Member, IEEE, Vijay S. Mookerjee, and Dengpan Liu
15 pages
Entity Resolution With Markov Ogic
No ratings yet
Entity Resolution With Markov Ogic
11 pages
Data Integration: Click To Edit Master Subtitle Style
No ratings yet
Data Integration: Click To Edit Master Subtitle Style
60 pages
Efficient Techniques For Online Record Linkage
No ratings yet
Efficient Techniques For Online Record Linkage
4 pages
DP
No ratings yet
DP
44 pages
Trujillo 15
No ratings yet
Trujillo 15
43 pages
Decision Models For Record Linkage
No ratings yet
Decision Models For Record Linkage
15 pages
Relational Normalization Theory - Slides
No ratings yet
Relational Normalization Theory - Slides
11 pages
Semantic Translation: Fundamentals and Applications
From Everand
Semantic Translation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Datamining 180303060331
No ratings yet
Datamining 180303060331
12 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
Prob Linkage
No ratings yet
Prob Linkage
43 pages
Week 3
No ratings yet
Week 3
29 pages
Entity Identification For Heterogeneous Database Integration A Multiple Classifier System Approach and Empirical Evaluation
No ratings yet
Entity Identification For Heterogeneous Database Integration A Multiple Classifier System Approach and Empirical Evaluation
14 pages
4.on Demand Quality of Web Services Using Ranking by Multi Criteria-31-35
No ratings yet
4.on Demand Quality of Web Services Using Ranking by Multi Criteria-31-35
5 pages
Schema Matching
No ratings yet
Schema Matching
4 pages
Information Integration: Existing Methods and Solutions
No ratings yet
Information Integration: Existing Methods and Solutions
25 pages
Deep Sequence-To-Sequence Entity Matching For Heterogeneous Entity Resolution
No ratings yet
Deep Sequence-To-Sequence Entity Matching For Heterogeneous Entity Resolution
10 pages
Normalization: Repetition of Information Inability To Represent Certain Information Loss of Information
No ratings yet
Normalization: Repetition of Information Inability To Represent Certain Information Loss of Information
39 pages
DSF 3-4
No ratings yet
DSF 3-4
18 pages
Intro To Duplicate Detection
No ratings yet
Intro To Duplicate Detection
87 pages
Query-Time Record Linkage and Fusion Over Web Databases
No ratings yet
Query-Time Record Linkage and Fusion Over Web Databases
12 pages
Lecture 3
No ratings yet
Lecture 3
25 pages
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
From Everand
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
JAMIE POWERS
No ratings yet
Similarity Flooding: A Versatile Graph Matching Algorithm and Its Application To Schema Matching
No ratings yet
Similarity Flooding: A Versatile Graph Matching Algorithm and Its Application To Schema Matching
12 pages
Paper73335 339
No ratings yet
Paper73335 339
5 pages
Qualitative Effects of Knowledge Rules and User Feedback in Probabilistic Data Integration
No ratings yet
Qualitative Effects of Knowledge Rules and User Feedback in Probabilistic Data Integration
26 pages
Unit 3
No ratings yet
Unit 3
164 pages
Week-3 Schema Matching and Mapping
No ratings yet
Week-3 Schema Matching and Mapping
26 pages
DBMS ER Model
No ratings yet
DBMS ER Model
89 pages
Testing 2
No ratings yet
Testing 2
20 pages
Record Matching Over Query Results From Multiple Web Databases
No ratings yet
Record Matching Over Query Results From Multiple Web Databases
27 pages
Cse 20
No ratings yet
Cse 20
6 pages
Discovering and Reconciling Value Conflicts For Data Integration
No ratings yet
Discovering and Reconciling Value Conflicts For Data Integration
24 pages
ER Diagrams (Concluded), Schema Refinement, and Normalization
No ratings yet
ER Diagrams (Concluded), Schema Refinement, and Normalization
39 pages
ER Diagrams (Concluded), Schema Refinement, and Normalization
No ratings yet
ER Diagrams (Concluded), Schema Refinement, and Normalization
39 pages
Data Management
No ratings yet
Data Management
16 pages
Record Linkage System
No ratings yet
Record Linkage System
24 pages
Bafpred Module 2 Week 5 6
No ratings yet
Bafpred Module 2 Week 5 6
35 pages
Efficient String Processing with Trie Structures: Definitive Reference for Developers and Engineers
From Everand
Efficient String Processing with Trie Structures: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
EEE Exam Marker's Report - 2024
No ratings yet
EEE Exam Marker's Report - 2024
2 pages
A Case Study On Entity Resolution For Processing of Big Humanities Data
No ratings yet
A Case Study On Entity Resolution For Processing of Big Humanities Data
17 pages
DBMS Module 04
No ratings yet
DBMS Module 04
33 pages
CC Important Q&A
No ratings yet
CC Important Q&A
34 pages
Digital Fluency Notes
No ratings yet
Digital Fluency Notes
10 pages
Design Theory For Relational Databases
No ratings yet
Design Theory For Relational Databases
73 pages
Avl Trees
No ratings yet
Avl Trees
37 pages
The Magical Marvels of Mongodb Slides PDF
No ratings yet
The Magical Marvels of Mongodb Slides PDF
151 pages
Practical File - Index Page 2024-2025 1
No ratings yet
Practical File - Index Page 2024-2025 1
1 page
Yousef Salman Poor - T5 Worksheet 5
No ratings yet
Yousef Salman Poor - T5 Worksheet 5
4 pages
Find and Analyze Particular Session in Oracle
No ratings yet
Find and Analyze Particular Session in Oracle
4 pages
SAP Temporal Joins
No ratings yet
SAP Temporal Joins
3 pages
SQL - Injctions Loop Joks
No ratings yet
SQL - Injctions Loop Joks
35 pages
Practical No 1 Aim Horizontal Fragmentat
No ratings yet
Practical No 1 Aim Horizontal Fragmentat
42 pages
Data Extract SVG
No ratings yet
Data Extract SVG
3 pages
Microsoft Training and Certification - MOC Course 2074A - Designing and Implementing OLAP Solutions With Microsoft SQL Server 20
No ratings yet
Microsoft Training and Certification - MOC Course 2074A - Designing and Implementing OLAP Solutions With Microsoft SQL Server 20
759 pages
Mysql Interview Questions For Experienced
No ratings yet
Mysql Interview Questions For Experienced
8 pages
Untitled Diagram - Drawio - Draw - Io
No ratings yet
Untitled Diagram - Drawio - Draw - Io
1 page
DBMS Unit 1
No ratings yet
DBMS Unit 1
11 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
40 pages
Oracle 10g Application Developers Guide Fundamentals
No ratings yet
Oracle 10g Application Developers Guide Fundamentals
582 pages
Unix: System Administration and Security
100% (2)
Unix: System Administration and Security
33 pages
Week 1 Database Administrator
No ratings yet
Week 1 Database Administrator
34 pages
9 List of Registered Voters Cambugason
No ratings yet
9 List of Registered Voters Cambugason
30 pages
1nurul Mahmudah1195 1203i Compressed
No ratings yet
1nurul Mahmudah1195 1203i Compressed
9 pages
DBMS Lab Programs
No ratings yet
DBMS Lab Programs
2 pages
IRMA - Content-Based Image Retrieval in Medical Applications
No ratings yet
IRMA - Content-Based Image Retrieval in Medical Applications
5 pages
AWS 05 DataLake
No ratings yet
AWS 05 DataLake
78 pages
21CS71
No ratings yet
21CS71
2 pages
Redis Vs Ncache
No ratings yet
Redis Vs Ncache
36 pages
SQL Lab Test 1
No ratings yet
SQL Lab Test 1
5 pages

DMER - Unit#8 - Intoduction To ER

Uploaded by

DMER - Unit#8 - Intoduction To ER

Uploaded by

Data Mining and Entity Resolution

Dr. Asif Sohail

1. Absence of unique identifier

Non Matches Matches Possible Matches

Record Pairs Compare Linking

Sim >= UT Sim < LT

Matches Possible Matches Non Matches

classification. Let F is the combined similarity score obtained using

All Record Pairs

True Positives (TP)

False Positives (FP)

1. Reduction Ratio (RR)

3. Pairs Quality (PQ) or Precision

You might also like