DMER - Unit#8 - Intoduction To ER
DMER - Unit#8 - Intoduction To ER
— Unit 8 —
Introduction to ER
NOTE: Some slides and/or pictures are adapted from Lecture slides / Books of
Data Matching, Concepts and Techniques for Record Linkage, Entity
Resolution, and Duplicate Detection by Peter Christen
1
Entity Resolution
In the modern era of computing, the presence of information
about the same concept or entities at different data repositories
is quite common.
Examples: digital libraries, social websites, customer service
centers, business enterprises, health care systems, government
agencies, etc.
In order to get consolidated and comprehensive information,
multiple data sources are linked and integrated together [1].
Without ER, the integrated data will be quite messy containing
quite a few duplicates. The duplicates can either be “exact
duplicates” or more commonly occurring “near duplicates” [6].
A major challenge in data integration is the identification of the
records referring the same entity in different data sources,
known as entity resolution, aka, record linkage, record matching
[2], duplication detection [4].
2
Challenges in Entity Resolution
3
Challenges in Entity Resolution
1. Absence of unique identifier
ER would be a trivial task, if some unique identifier is available
across different data sources to be linked together.
In this situation, deterministic linkage is performed, which is
simply a matter of trivial join operation.
Unfortunately, the availability of a unique identifier across
different data sources is a pathological scenario in the real
world, as different systems have been developed and evolved
independently. Also, the use of a unique personal identifier,
such as, SSN to link personal data is not allowed in many
countries because of privacy protection legislation [22].
In the absence of a unique identifier, linkage is performed using
the attributes available in the data sources called linkage key
[23].
4
Challenges in Entity Resolution
2. Data Heterogeneity
It refers to the non-conformance of data representations across
different data sources to be integrated. Data heterogeneity is of
three types; structural heterogeneity, lexical heterogeneity, and
semantic heterogeneity [4].
Structural heterogeneity occurs when attributes of a relational
schema have different representations and meanings in
different sources.
For example, at one place, an attribute, say, address might be
saved in a single attribute whereas, at another place, it might be
saved using multiple attributes such as House#, Street, City etc.
Similarly, different attribute names, say, gender and sex might
have been used to record the same concept.
5
Challenges in Entity Resolution
2. Data Heterogeneity
Lexical heterogeneity occurs when same attribute across
different sources uses different representations for recording
the same value. It can happen due to spelling variations, use of
abbreviations in some data sources or use of different
conventions across the data sources. For two data sources, the
examples of data value pairs showcasing lexical heterogeneity
are (Sohail, Suhail), (CS, Computer Science), (Fourth Division,
4th Division).
Semantic heterogeneity occurs when different values are used
to record the same fact. For example, different codes for
representing gender, such as 0/1 or M/F, different values to
record the same fact, e.g., ‘B grade’ or ‘First division’, different
units to represent temperature, distance etc.
6
Challenges in Entity Resolution
3. Data Noise
Data noise or data dirtiness means wrong or invalid values in a
data source. It happens due to data heterogeneity,
typographical errors, missing values, outdated values, use of
meaningless or default values, use of synonyms, evolution of
entities, duplicate records and so on.
It is reported in [24] that typographical errors in the Name field
can occur in 20-30% of the records. The entities of an
information system evolve with time and due to this data about
an entity becomes outdate with the passage of time. For
example, people migrate from one place to another, get
married or even die without getting these changes recorded in
all the systems where their data are stored. The details about
the manifestation of data noise and the reasons contributing
towards data dirtiness can be found in [25], [26].
7
Challenges in Entity Resolution
4. Data Size
The size of data sources is normally very large these days. This
makes it prohibitively expensive to carry out the pairwise or
quadratic number of record comparisons for ER.
Consider a dataset consisting of records. A simple/naive
approach for duplicate detection is to compare a record with
every other record in the dataset. This approach would require
comparisons, which is too high even for moderate size
datasets.
It is highly impractical and inefficient to make massive
comparisons out of which majority of the comparisons are
already known to be non-matches owing to “Match Rarity” [27].
8
Challenges in Entity Resolution
4. Data Size
The size of data sources is normally very large these days. This
makes it prohibitively expensive to carry out the pairwise or
quadratic number of record comparisons for ER.
Consider two datasets consisting of and non-duplicate records
respectively. A brute force approach for record linkage will make
pair wise record comparisons resulting in comparisons, which is
too high even for moderate size datasets.
Furthermore, majority of the quadratic record comparisons are
for non-duplicates owing to “Match Rarity” [27], and hence
useless. The maximum possible number of duplicates will be
much smaller than the number of record comparisons.
9
ER Approaches
Naive Approach: Quadratic comparisons
Systematice Approaches:
Reduces the number of comparisons significantly, and but also
aims at identifying more than 95% of the matches. Record
comparisons space reduction.
1. Blocking Method
2. Windowing or Sorted Neighborhood Method (SNM)
ER Process
Record Pairs
Data Pre- Record Pairs
Dataset(s) Reduction
Processing Comparisons
(Blocking)
Record Pairs
Classification
Evaluation
ER Process
Record Pairs
Classification
13
ER Process
2. Record Pairs Reduction (Blocking)
Record pairs reduction avoids pair-wise or quadratic number of
record comparisons.
Blocking gathers the potential duplicates in the same block
using a blocking key or indexing key.
All the records with identical blocking key value are placed in
the same block. Afterwards, the detailed comparisons are made
only among the records residing in the same block, thereby
decreasing the number of record comparisons significantly.
Let a dataset of size is partitioned into blocks with each block
containing nearly records, then the total number of record
comparisons using blocking are reduced from to .
14
ER Process
3. Record Pairs Comparisons
A record pair is compared using a set of attributes called linkage
key. Field comparison functions are applied on each attribute of
the linkage key to compute the similarity between the
corresponding attributes of a record pair.
The similarity between two records can be represented using a
comparison vector, ,…, ), where , shows the similarity between
records and on attribute.
means complete disagreement on attribute
means complete agreement on attribute , and
is the total number attributes used in comparison vector.
15
ER Process
4. Record Pairs Classification
Using the comparison vectors generated in the previous step,
the record pairs are classified into three possible classes
“Matches”, “Non-Matches” or “Possible Matches”.
The simplest way of classifying a record pair is threshold-based
Possible Match:
16
ER Process
5. Evaluation
True Duplicates
False Negatives (FN)
18
Evaluation Measures
4. F-score
4. F-score (Blocking)
1. Blocking Method
The groups of records are blocked together in different
R1 buckets on the basis of blocking key(s).
R2
K1 RB (K1) The comparisons of
records within a
R3 K2 RB (K2) block are made
. . against the selected
. . fields called linkage
. . fieldss.
. .
. # of comparisons
. Kb RB (Kb) = n(n-1)/2b
.
Rn
Slide 20
2. Windowing: Sorted array
R1 SR1
The records are sorted
R2 SR2
and a fixed size window of
R3 SR3 size >1 is slided over
R4 SR4 them. The comparisons
are made among all the
R5 SR5 records falling within the
. . same window.
. .
. . # of comparisons
= w(w-1)/2 + (n-w)(w-1)
Rn SRn = O (wn)
Slide 21
2. Windowing: Sorted inverted index
R1 V1 RB (V1) The records that have
a common sorting key
R2 V2 RB (V2) value are blocked
together. Then a
R3 V3 RB (V3) window is slided
across the blocks of
R4 V4 RB (V4) records and the
comparisons are
R5 V5 RB (V5) made among the
common residents.
. .
# of comparisons
. .
. = wn/b (wn/b-1)/2 +
.
(b – w) [n/b(n/b-1)/2 +
(w-1)n2/b2]
Rn Vb RB (Vb)
Slide 22