0% found this document useful (0 votes)
22 views22 pages

DMER - Unit#8 - Intoduction To ER

Uploaded by

Amina Baig
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views22 pages

DMER - Unit#8 - Intoduction To ER

Uploaded by

Amina Baig
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 22

Data Mining and Entity Resolution

— Unit 8 —
Introduction to ER

Dr. Asif Sohail


University of the Punjab
Department of Information Technology, FCIT

NOTE: Some slides and/or pictures are adapted from Lecture slides / Books of
 Data Matching, Concepts and Techniques for Record Linkage, Entity
Resolution, and Duplicate Detection by Peter Christen
1
Entity Resolution
 In the modern era of computing, the presence of information
about the same concept or entities at different data repositories
is quite common.
 Examples: digital libraries, social websites, customer service
centers, business enterprises, health care systems, government
agencies, etc.
 In order to get consolidated and comprehensive information,
multiple data sources are linked and integrated together [1].
Without ER, the integrated data will be quite messy containing
quite a few duplicates. The duplicates can either be “exact
duplicates” or more commonly occurring “near duplicates” [6].
 A major challenge in data integration is the identification of the
records referring the same entity in different data sources,
known as entity resolution, aka, record linkage, record matching
[2], duplication detection [4].
2
Challenges in Entity Resolution

1. Absence of unique identifier


2. Data heterogeneity
3. Data noise
4. Data size

3
Challenges in Entity Resolution
1. Absence of unique identifier
 ER would be a trivial task, if some unique identifier is available
across different data sources to be linked together.
 In this situation, deterministic linkage is performed, which is
simply a matter of trivial join operation.
 Unfortunately, the availability of a unique identifier across
different data sources is a pathological scenario in the real
world, as different systems have been developed and evolved
independently. Also, the use of a unique personal identifier,
such as, SSN to link personal data is not allowed in many
countries because of privacy protection legislation [22].
 In the absence of a unique identifier, linkage is performed using
the attributes available in the data sources called linkage key
[23].

4
Challenges in Entity Resolution
2. Data Heterogeneity
 It refers to the non-conformance of data representations across
different data sources to be integrated. Data heterogeneity is of
three types; structural heterogeneity, lexical heterogeneity, and
semantic heterogeneity [4].
 Structural heterogeneity occurs when attributes of a relational
schema have different representations and meanings in
different sources.
 For example, at one place, an attribute, say, address might be
saved in a single attribute whereas, at another place, it might be
saved using multiple attributes such as House#, Street, City etc.
Similarly, different attribute names, say, gender and sex might
have been used to record the same concept.

5
Challenges in Entity Resolution
2. Data Heterogeneity
 Lexical heterogeneity occurs when same attribute across
different sources uses different representations for recording
the same value. It can happen due to spelling variations, use of
abbreviations in some data sources or use of different
conventions across the data sources. For two data sources, the
examples of data value pairs showcasing lexical heterogeneity
are (Sohail, Suhail), (CS, Computer Science), (Fourth Division,
4th Division).
 Semantic heterogeneity occurs when different values are used
to record the same fact. For example, different codes for
representing gender, such as 0/1 or M/F, different values to
record the same fact, e.g., ‘B grade’ or ‘First division’, different
units to represent temperature, distance etc.
6
Challenges in Entity Resolution
3. Data Noise
 Data noise or data dirtiness means wrong or invalid values in a
data source. It happens due to data heterogeneity,
typographical errors, missing values, outdated values, use of
meaningless or default values, use of synonyms, evolution of
entities, duplicate records and so on.
 It is reported in [24] that typographical errors in the Name field
can occur in 20-30% of the records. The entities of an
information system evolve with time and due to this data about
an entity becomes outdate with the passage of time. For
example, people migrate from one place to another, get
married or even die without getting these changes recorded in
all the systems where their data are stored. The details about
the manifestation of data noise and the reasons contributing
towards data dirtiness can be found in [25], [26].
7
Challenges in Entity Resolution
4. Data Size
 The size of data sources is normally very large these days. This
makes it prohibitively expensive to carry out the pairwise or
quadratic number of record comparisons for ER.
 Consider a dataset consisting of records. A simple/naive
approach for duplicate detection is to compare a record with
every other record in the dataset. This approach would require
comparisons, which is too high even for moderate size
datasets.
 It is highly impractical and inefficient to make massive
comparisons out of which majority of the comparisons are
already known to be non-matches owing to “Match Rarity” [27].

8
Challenges in Entity Resolution
4. Data Size
 The size of data sources is normally very large these days. This
makes it prohibitively expensive to carry out the pairwise or
quadratic number of record comparisons for ER.
 Consider two datasets consisting of and non-duplicate records
respectively. A brute force approach for record linkage will make
pair wise record comparisons resulting in comparisons, which is
too high even for moderate size datasets.
 Furthermore, majority of the quadratic record comparisons are
for non-duplicates owing to “Match Rarity” [27], and hence
useless. The maximum possible number of duplicates will be
much smaller than the number of record comparisons.

9
ER Approaches
Naive Approach: Quadratic comparisons
Systematice Approaches:
 Reduces the number of comparisons significantly, and but also
aims at identifying more than 95% of the matches. Record
comparisons space reduction.
1. Blocking Method
2. Windowing or Sorted Neighborhood Method (SNM)
ER Process

Record Pairs
Data Pre- Record Pairs
Dataset(s) Reduction
Processing Comparisons
(Blocking)

Record Pairs
Classification

Non Matches Matches Possible Matches

Evaluation
ER Process

Record Pairs Compare Linking


Dataset Reduction using Variables using
Blocking / SNM Comparison Functions

Record Pairs
Classification

Sim >= UT Sim < LT


LT<= Sim < UT

Matches Possible Matches Non Matches


ER Process
1. Data Preprpcessing
 It involves schema resolution and data standardization.
 Schema resolution discovers the potential associations and
mappings between the schemas of the datasets to be
integrated together [7].
 Consider two datasets and defined over a set of and
attributes respectively. A mapping function maps an attribute
to a set of attributes , i.e., : . The set may be a singleton or
may consist of more than one attributes.
 Data standardization ensures a uniform representation of data.
 Name and address transformation
 Removal of stop words (if required),
 Stemming and lemmatization
 Validation and correction of data values, data type conversion
 Renaming of an attribute, use of some common codes, domain
constraints checking, dependency checking [29], [30].

13
ER Process
2. Record Pairs Reduction (Blocking)
 Record pairs reduction avoids pair-wise or quadratic number of
record comparisons.
 Blocking gathers the potential duplicates in the same block
using a blocking key or indexing key.
 All the records with identical blocking key value are placed in
the same block. Afterwards, the detailed comparisons are made
only among the records residing in the same block, thereby
decreasing the number of record comparisons significantly.
 Let a dataset of size is partitioned into blocks with each block
containing nearly records, then the total number of record
comparisons using blocking are reduced from to .

14
ER Process
3. Record Pairs Comparisons
 A record pair is compared using a set of attributes called linkage
key. Field comparison functions are applied on each attribute of
the linkage key to compute the similarity between the
corresponding attributes of a record pair.
 The similarity between two records can be represented using a
comparison vector, ,…, ), where , shows the similarity between
records and on attribute.
 means complete disagreement on attribute
 means complete agreement on attribute , and
 is the total number attributes used in comparison vector.

15
ER Process
4. Record Pairs Classification
 Using the comparison vectors generated in the previous step,
the record pairs are classified into three possible classes
“Matches”, “Non-Matches” or “Possible Matches”.
 The simplest way of classifying a record pair is threshold-based

classification. Let F is the combined similarity score obtained using


comparison vector for record pair, is lower threshold, is upper
threshold, then using threshold-based classification, record pair is
classified as:
 Match:
 Non-Match:

 Possible Match:

16
ER Process
5. Evaluation

All Record Pairs

True Duplicates
False Negatives (FN)

True Positives (TP)

False Positives (FP)


Declared Duplicates
True Negatives (TN)
ER Process
5. Evaluation
Classified As
Actual
Match (M*) Non-match (U*)
False Non-
Match True matches matches
(M) True Positives (TP) False Negatives
(FN)
True Non-matches
Non- False Matches
 M = TP(U)
match + FN False Positives (FP)
True Negatives
 U = TN + FP (TN)

18
Evaluation Measures

1. Reduction Ratio (RR)


2. Pairs Completeness (PC) or Recall

3. Pairs Quality (PQ) or Precision

4. F-score
4. F-score (Blocking)
1. Blocking Method
The groups of records are blocked together in different
R1 buckets on the basis of blocking key(s).

R2
K1 RB (K1) The comparisons of
records within a
R3 K2 RB (K2) block are made
. . against the selected
. . fields called linkage
. . fieldss.
. .
. # of comparisons
. Kb RB (Kb) = n(n-1)/2b
.

Rn

Slide 20
2. Windowing: Sorted array

R1 SR1
The records are sorted
R2 SR2
and a fixed size window of
R3 SR3 size >1 is slided over
R4 SR4 them. The comparisons
are made among all the
R5 SR5 records falling within the
. . same window.
. .
. . # of comparisons
= w(w-1)/2 + (n-w)(w-1)
Rn SRn = O (wn)

Slide 21
2. Windowing: Sorted inverted index
R1 V1 RB (V1) The records that have
a common sorting key
R2 V2 RB (V2) value are blocked
together. Then a
R3 V3 RB (V3) window is slided
across the blocks of
R4 V4 RB (V4) records and the
comparisons are
R5 V5 RB (V5) made among the
common residents.
. .
# of comparisons
. .
. = wn/b (wn/b-1)/2 +
.
(b – w) [n/b(n/b-1)/2 +
(w-1)n2/b2]
Rn Vb RB (Vb)

Slide 22

You might also like