0% found this document useful (0 votes)
6 views13 pages

5 2009 XML

Uploaded by

Venkatesh K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views13 pages

5 2009 XML

Uploaded by

Venkatesh K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Information Processing and Management 47 (2011) 706–718

Contents lists available at ScienceDirect

Information Processing and Management


journal homepage: www.elsevier.com/locate/infoproman

An unsupervised heuristic-based approach for bibliographic metadata


deduplication
Eduardo N. Borges a,⇑, Moisés G. de Carvalho b, Renata Galante a, Marcos André Gonçalves b,
Alberto H.F. Laender b
a
Institute of Informatics, Federal University of Rio Grande do Sul, Porto Alegre, Brazil
b
Computer Science Dept., Federal University of Minas Gerais, Belo Horizonte, Brazil

a r t i c l e i n f o a b s t r a c t

Article history: Digital libraries of scientific articles contain collections of digital objects that are usually
Received 22 July 2009 described by bibliographic metadata records. These records can be acquired from different
Received in revised form 20 November 2010 sources and be represented using several metadata standards. These metadata standards
Accepted 24 January 2011
may be heterogeneous in both, content and structure. All of this implies that many records
Available online 17 February 2011
may be duplicated in the repository, thus affecting the quality of services, such as searching
and browsing. In this article we present an approach that identifies duplicated biblio-
Keywords:
graphic metadata records in an efficient and effective way. We propose similarity functions
Digital libraries
Metadata
especially designed for the digital library domain and experimentally evaluate them. Our
Deduplication results show that the proposed functions improve the quality of metadata deduplication
Similarity up to 188% compared to four different baselines. We also show that our approach achieves
statistical equivalent results when compared to a state-of-the-art method for replica iden-
tification based on genetic programming, without the burden and cost of any training
process.
Ó 2011 Elsevier Ltd. All rights reserved.

1. Introduction

Digital libraries (DLs) are complex information systems built to address the information needs of specific target commu-
nities (Gonçalves, Fox, Watson, & Kipp, 2004). DLs are composed of collections of rich (possibly multimedia) digital objects
along with services such as searching, browsing and recommendation, that allow easy access and retrieval of these objects by
the members of the target community (Fox, Akscyn, Furuta, & Leggett, 1995; Gonçalves et al., 2004).
Collections of digital objects are usually described by means of metadata records (usually organized in a metadata cata-
log) whose function is to describe, organize and specify how these objects can be manipulated and retrieved, including who
has the rights for doing so. In order to promote interoperability among DLs and similar systems, metadata records usually
conform to one or more metadata standards that specify, among others, a standardized set of metadata fields and their
semantics for the description of digital objects. The Dublin Core,1 for example, is a general descriptive metadata standard
for the representation and storage of information about scientific publications and Web pages.
Although very useful, these standards do not completely solve all the interoperability problems as there is not a consen-
sus among all existing digital libraries in terms of a unique ‘de facto’ standard. Moreover, even if such a consensus existed,

⇑ Corresponding author. Tel.: +55 51 33087746; fax: +55 51 33087308.


E-mail addresses: [email protected] (E.N. Borges), [email protected] (M.G. de Carvalho), [email protected] (R. Galante), mgoncalv@
dcc.ufmg.br (M.A. Gonçalves), [email protected] (A.H.F. Laender).
1
http://dublincore.org.

0306-4573/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved.
doi:10.1016/j.ipm.2011.01.009
E.N. Borges et al. / Information Processing and Management 47 (2011) 706–718 707

B D B C om p
1 <title>A Computer Vision Framework for Remote Eye Gaze Tracking</title>
2 <creator>Carlos H. Morimoto</creator>
3 <source>sibgrapi2003</source>
DBLP
4 <title>A Computer Vision Framework for Eye Gaze Tracking</title>
5 <author>Carlos Hitoshi Morimoto</author>
6 <booktitle>SIBGRAPI</booktitle>
IEEE Xplore
7 <title>A computer vision framework for eye gaze tracking</title>
8 <author>Morimoto, C.H.</author>
9 <pages>406</pages>
Fig. 1. Heterogeneity of metadata.

differences in practices and in the way some metadata elements are filled, not mentioning possible errors in this process
(e.g., misspellings and typos), allow for the existence of several different records describing the same digital object.
Consider the example of Fig. 1 that presents excerpts of metadata records from three distinct digital libraries: BDBComp,2
DBLP3 and IEEE Xplore.4 All records refer to the same digital object. The field source in the BDBComp metadata record (line 3)
corresponds to the field booktitle from DBLP (line 6). The metadata structures are different, but both refer to the same informa-
tion, i.e., the publication venue of a specific paper. Also, the author of the paper, which is represented by the creator and author
fields, has the value ‘‘Carlos H. Morimoto’’ in the BDBComp record (line 2), ‘‘Carlos Hitoshi Morimoto’’ in the DBLP record (line 5)
and ‘‘Morimoto, C.H.’’ in the IEEE Xplore record (line 8). The value of the title field also differs in the word ‘‘Remote’’ (lines 1, 4
and 7).
Deduplication is the task of identifying in a data repository duplicated records that refer to the same real-world entity.
These records may be hard to identify due to, as mentioned before, variations in spelling, writing style, metadata standard
use, or even typos (Carvalho, Gonçalves, Laender, & da Silva, 2006). Deduplication is also known as record linkage, object
matching or instance matching (Doan, Noy, & Halevy, 2004). In fact, this is not a new problem but a long-standing issue,
for example, in the Library and Information Science field, in the context of Online Public Access Catalogs (OPACs) (Large
& Beheshti, 1997), as well as in the database realm.
Several approaches to record deduplication have been proposed in recent years (Bilenko & Mooney, 2003; Dorneles et al.,
2009; Carvalho et al., 2006; Carvalho, Laender, Gonçalves, & da Silva, 2008; Chaudhuri, Ganjam, Ganti, & Motwani, 2003;
Cohen & Richman, 2002; Tejada, Knoblock, & Minton, 2001). Most of these works focus on the deduplication task in the con-
text of integration of relational databases. Few automatic approaches, if any, have been specifically developed for the realm
of digital libraries or in a more general sense, for bibliographic metadata records. For example, metadata fields that specify
the authors of a digital object are some of the most discriminative fields of a record and this information should be used as a
strong evidence for the deduplication process. In fact, there may be several objects with similar titles but there is a very small
chance that they will also have authors with similar names and be a different real-world object. For instance, Baeza-Yates
and Ribeiro-Neto as well as Manning have published books with similar titles (Baeza-Yates & Ribeiro-Neto, 1999; Manning,
Raghavan, & Schütze, 2008). Another specific problem to deal with is the variation in the way author names are represented
in bibliographic citations. Variations include abbreviations, inversions of names, different spellings and omission of suffixes
as Jr. (Ley, 2002). Deduplication techniques applied to the digital libraries domain should therefore take into special consid-
eration the fields that refer to author names to correctly identify duplicated metadata records. These techniques may even
explore a number of other sources of information such as authority files to help with the task of comparing author names,
although this is not the focus of this work.
This article presents an approach to identifying duplicated bibliographic metadata records. We assume that a mapping
between the metadata fields in different standards is provided and we focus on the application of specially designed simi-
larity functions for the metadata content. We are aware of the problem of schema matching (Rahm & Bernstein, 2001), how-
ever it is not the focus of our work; recent solutions in the literature for the problem could be used (Fagin et al., 2009). Here,
instead, we are interested in the instance matching problem, specifically for the realm of digital libraries. In this context, the
main contributions of this work are:

 an efficient and effective approach for metadata record deduplication that is based on a set of similarity functions spe-
cially designed for the digital library domain;
 the identification and analysis of the failure cases of the evaluated deduplication functions, which are valuable for the
development of new approaches for automatic bibliographic metadata deduplication.

2
http://www.lbd.dcc.ufmg.br/bdbcomp.
3
http://www.informatik.uni-trier.de/ley/db.
4
http://www.ieeexplore.ieee.org.
708 E.N. Borges et al. / Information Processing and Management 47 (2011) 706–718

Table 1
Features of proper nouns functions.

Features Guth Acronyms Fragments


Variations of spelling Limited Yes Yes
Abbreviations No Limited Yes
Inversions of names No No Partial
Complexity Linear Linear Quadratic
Language Independence Yes Yes Yes

The quality of the proposed deduplication functions is evaluated in experiments with two real datasets and compared
with four other related approaches. The results of the experiments show that the proposed functions improve the dedupli-
cation quality from 2% (with regard to an already very accurate result which is difficult to improve) to 62% when identifying
replicas in a dataset containing portions of the metadata records from two real digital libraries, and from 7% to 188% in a
dataset with article citation data. We also show that our approach achieves slightly superior results when compared to a
state-of-the-art method for replica identification based on genetic programming, without the burden and cost of any training
process.
The rest of this article is organized as follows. In Section 2, we discuss related work. In Section 3, we present our approach
to deduplicate bibliographic metadata. There we define a set of functions and algorithms that are specific for our dedupli-
cation approach, which is especially designed for bibliographic metadata. In Section 4, we give details on the performed
experiments and discuss the obtained results. Finally, in Section 5, we draw our conclusions and point out some future work
directions.

2. Related work

Chaudhuri et al. (2003) explore a vector space model representation of records and propose a similarity function that con-
siders the weight of words using the Inverse Document Frequency (IDF) (Baeza-Yates & Ribeiro-Neto, 1999). Carvalho and da
Silva (2003) also use the vector space model to calculate the similarity between objects from multiple sources. Their ap-
proach can be used to deduplicate objects with complex structures such as XML documents.
Dorneles, Heuser, Lima, da Silva, and de Moura (2004) propose a set of similarity metrics that handle collections of val-
ues that occur in XML documents. The authors define two types of metric: metrics for atomic values (MAV) and metrics for
complex values (MCV). MAV are dependent on the application domain while MCV are defined according to the child nodes’
features. The MCV Set is proposed for sets of values without a specific order. Dorneles et al. (2009) extend their previous
work so that instead of setting a threshold of similarity based on the scores returned by a similarity function, the user can
specify the expected record match precision. This approach maps similarity scores into precision values using a training
set.
Other works have proposed strategies based on machine learning techniques to estimate record similarities. The Active
Atlas system (Tejada et al., 2001) performs mappings between records to integrate different data sources. Attribute mapping
rules are specified based on a training process that uses decision trees (Quinlan, 1986). Cohen and Richman (2002) propose a
scalable and adaptive technique to group objects based on the string similarity of different records. The MARLIN system
(Bilenko & Mooney, 2003) explores a framework for identifying duplicated records using adaptive string similarity metrics
applied to each field, according to the domain of their values. The system defines two similarity metrics: one based on edit
distance and another based on Support Vector Machines (SVMs) (Boser, Guyon, & Vapnik, 1992).
Carvalho et al. (2006), apply genetic programming (GP) to automatically generate similarity functions to identify record
replicas in a given repository. These functions improve the deduplication task when combined with a traditional statistical
method proposed by Fellegi and Sunter (1969). The problem of bibliographic citation deduplication is explicitly discussed in
(Lawrence, Giles, & Bollacker, 1999). The authors propose algorithms for matching citations from different sources based on
edit-distance, word matching, phrase matching and subfield extraction. Carvalho et al. (2008) extend their previous work by
proposing a GP based approach which is independent of the Fellegi and Sunter’s statistical method in order to find suitable
similarity functions based on the combination of multiple pieces of evidence (i.e., attribute/similarity function pairs), being
capable of effectively identifying whether two entries in a repository are replicas or not. The suggested functions are also
computationally less demanding than those generated by other approaches since they potentially use less evidence. The
experiments show that this approach outperforms the SVM-based method used by the MARLIN system (Bilenko & Mooney,
2003) by at least 6.5%. For this reason, in our experiments, we compare the effectiveness of our unsupervised heuristic-based
approach with this GP-based approach (Carvalho et al., 2008), which is currently the state-of-the-art method for replica
identification.
Most of the papers addressed in this section have general solutions for record deduplication, but do not specifically treat
proper nouns, which, as discussed, are essential for bibliographic records. Identifying variations in author names present in
bibliographic citations can be considered as a subproblem of deduplication. Some similarity functions have been proposed
specifically to compare proper nouns. For instance, Guth’s function (or Guth) (Guth, 1976) supports small spelling variations.
E.N. Borges et al. / Information Processing and Management 47 (2011) 706–718 709

Acronyms (Lima, 2002) supports abbreviations of names, but is limited in the sense that it uses all capital letters to produce
an acronym. For instance, the acronym of John A.B.C. Smith is JABCS while the acronym of John A. Smith is JAS. Acronyms does
not support the omission of some names. Comparison by Fragments (or Fragments) (Cota, Gonçalves, & Laender, 2007) also
supports abbreviations as well as inversions of names. The solution adopted in Fragments considers inversions of middle
names, but inversions of last names are detected only in the presence of the comma character as an indicator of inversion.
Moreover, the complexity of the algorithm is quadratic in terms of both, the number and the length of the fragments. All
these similarity functions for proper names can be considered as language independent. Table 1 summarizes the features
of each function.

3. The metadata deduplication approach

This section presents our approach to deduplicate bibliographic metadata records. We define as duplicates or replicas two
or more metadata records that are semantically equivalent, i.e., records that describe the same publication item (digital ob-
ject) indexed by a digital library. We assume that the mapping between metadata standards or structures is provided. Again,
we are aware of the problem of schema matching but this is out of the scope of this work5 (Rahm & Bernstein, 2001). The
metadata content is compared using similarity functions, which are chosen according to the domain of each metadata field.
We specify three similarity functions called IniSim, NameMatch e MetadataMatch. These functions compare the content of the
main metadata fields that describe the digital objects, as we shall see.

3.1. IniSim

The similarity function IniSim identifies variations in the representation of an author name considering misspellings,
inversions, abbreviations and omissions of names. Only the initials of the author names are compared enabling an efficient
implementation of this function through a linear algorithm.
Let C be a set of proper nouns and NS be the set of natural numbers f0; 1g; IniSim : fC  Cg ! NS calculates the similarity
between the initials of two proper nouns (i.e., author names). The function IniSim checks whether the proper nouns initials of
a, b 2 C potentially represent the name of the same person. Comparisons are made only among the first, second and last ini-
tials since the first and last author names usually appear in these positions, even if inversions occur. When the initials belong
to a compatible author name, i.e., when both representations may correspond to the same real-world entity, the function
IniSim returns s 2 NS js ¼ 1. Otherwise, s = 0 is returned. IniSim is defined by Eq. 1 as
8
> ða1 ¼ b1 ^ am ¼ bn Þ _
>
>
>
>
>
> 1; if ða1 ¼ b2 ^ am ¼ b1 Þ _
<
IniSimða; bÞ ¼ ða1 ¼ b1 ^ a2 ¼ b2 Þ _ ð1Þ
>
>
>
>
>
> ða1 ¼ bn ^ a2 ¼ b1 Þ
>
:
0; otherwise

where a, b are strings formed by the initials of the author names, ai is the ith letter of string a, i.e., the ith name initial, bi is the
ith letter of string b, i.e., the ith name initial, m is the length of string a, and n is the length of string b.
For example, consider a function Initials that extracts the initials of a proper name. IniSim(Initials(John Winston Lennon),
Initials(Lennon, John)) = IniSim (JWL, LJ)=1 (a1 = b2 ^ am = b1) and IniSim(Initials(John Winston), Initials(Lennon, John)) = IniSim
(JW, LJ) = 0.
Notice that IniSim can handle misspellings of proper names with the exception of misspellings of the first, second and last
initials. However this is unlikely to happen. For example, in our experimental datasets, we were unable to find any case in
which such an error occurs. Other works such as (Convis, Glickman, & Rosenbaum, 1982) also rely on the fact that the initials
are usually the most reliable evidence for misspelling cases.

3.2. NameMatch

NameMatch is a similarity function that compares sets of proper names. In our case, it is specifically applied to compare
author names associated with two distinct digital objects.
Let C be a set of proper names, RS be the set of real numbers in the interval [0,1] and NS be the set of natural numbers
{0,1}. NameMatch : fC  C  RS g ! NS is defined by the algorithm

5
Notice also that due to the size of most metadata standards currently used (i.e., number of defined fields) manual matchings are a reasonable choice.
Moreover, since we are dealing with standards such as MARC or Dublin Core, it is very plausible that some of these matchings have already been specified and
can be used by our approach.
710 E.N. Borges et al. / Information Processing and Management 47 (2011) 706–718

NAMEMATCH(K,L,tN)
1 m LENGTH(K);
2 n LENGTH(L);
3 for i 1 to m
4 do for j 1 to n
5 do if INISIM(Ki,Lj) = 1
6 then counter counter + 1;
7 Lj null;
8 if counter/Max(m,n) P tN
9 then return 1;
10 else return 0;

where K,L are lists composed of the initials of the author names of the digital objects being compared, Ki 2 C is the ith ele-
ment of the K list, i.e., the initials of ith author name of the first object, Lj 2 C is the jth element of the L list, i.e., the initials of
jth author name of the second object and t N 2 RS is a similarity threshold. In addition, LENGTH is a function that returns the
length of a list, counter indicates the number of matches found by function IniSim and MAX is a function that returns the length
of the largest list of words.
The algorithm requires three parameters: the lists K and L of initials of author names and the minimum matching thresh-
old tN between two author names. Both lists K and L are traversed forward (lines 3–4) in order to find matches between the
author names, which is performed by the function IniSim. When two author names match (line 5), the function sets the ini-
tials of the jth author name of the second object to a null value, avoiding future comparisons (line 7). The variable counter
counts the number of matches found (line 6). When the minimal threshold is achieved (line 8), NameMatch returns a sim-
ilarity score s 2 NS js ¼ 1, otherwise it returns s = 0.
The threshold value serves, besides other functions, to deal with errors and problems in data acquisition, e.g., when the
list of author names of a publication is not complete. In such cases, the threshold tN, passed as parameter, helps adjust the
function to the correct replica identification. For instance, consider an author name omission in the second parameter when
using the function NameMatch as follows: NameMatch ([AB, CD, EF], [FE, BA], 0.6). Values 3 and 2 are assigned respectively to
variables m and n. The list [AB, CD, EF] is processed and each element is compared to all unmatched elements from the list [FE,
BA]. The function IniSim returns 1 for the pairs (AB, BA) and (EF, FE). The variable counter is incremented to 2. As the min-
imum threshold passed as a parameter (0.6) is reached, the function returns 1. The similarity threshold used in this case al-
lows two metadata records with different numbers of authors to be possible candidates for a match.6
Errors are also possible in the initials of author names. However, the threshold tN used by function NameMatch minimizes
the effect of this kind of error, because it is unlikely that there will be errors in the initials of several author names of a digital
object at the same time. Our empirical experimental results demonstrate the effectiveness of this heuristic as applied by
functions IniSim and NameMatch.

3.3. MetadataMatch

The function MetadataMatch tries to match two metadata records aiming at determining whether they are replicas or not.
Let M be the metadata record of a digital object, RS be the set of real numbers in the interval [0, 1] and NS be the set of
natural numbers {0, 1}. The function MetadataMatch : fM  M  N  RS  RS g ! NS is defined by the algorithm

METADATAMATCH(a, b, tY, tN, tL)


1 if jYEAR(a)  YEAR(b)j 6 tY
2 then for all AUTHOR(a)
3 do ADD(iniLista, INITIALS(AUTHOR(a)));
4 for all AUTHOR(b)
5 do ADD(iniListb, INITIALS(AUTHOR(b)));
6 if NAMEMATCH(iniLista, iniListb, tN) = 1
7 then if LEVENSHTEIN(TITLE(a), TITLE(b)) P tL
8 then return 1;
9 return 0;

where a and b are metadata records of two digital objects, tY 2 N is the maximum difference between their publication years,
tN 2 RS is the minimum matching threshold between author names, tL 2 RS is the minimum similarity threshold between
their titles and iniListi is the list of author name initials of object i. In addition, the algorithm uses the following functions:
YEAR returns the publication year of a digital object, AUTHOR returns the author names of a digital object, INITIALS returns the

6
The choice of appropriate threshold values is discussed in the literature (Stasiu, Heuser, & da Silva, 2005) and is out of the scope of this paper.
E.N. Borges et al. / Information Processing and Management 47 (2011) 706–718 711

initials of an author name, ADD inserts the initials of an author name into a list, NAMEMATCH is the name matching function de-
fined in Section 3.2, TITLE returns the title of a digital object and LEVENSHTEIN calculates the normalized edit-distance between
the titles of two objects by using Levenshtein’s method (Levenshtein, 1966).
The MetadataMatch algorithm starts by checking whether the publication years of the digital objects are within the de-
fined difference (line 1). We assume that in most cases the year is correct and available. However, if the publication year is
missing, we can replace the function YEAR by a preprocessing function that extracts the year from other metadata fields such
as venue or source (Dublin Core). We can also check whether there are typos in the publication year by comparing it with the
value extracted. Only objects whose absolute value of the difference between their publication years is less than or equal to a
provided threshold tY will have their author names compared. This strategy is used to significantly reduce the number of
further (more expensive) comparisons. Specifically for this parameter, we think that 1 year of difference is a good suggestion
since it covers, for example, the cases of events that sometimes have their formal proceedings published in the following
year. Then, the initials of the author names are extracted (lines 2–5) and added to the lists iniListi. The function NameMatch
matches the objects’ author names (line 6). Only objects that reach the author name matching threshold will have their titles
compared by the function Levenshtein.
If the conditions tested in line 1 and the functions NameMatch and the Levenshtein are evaluated to 1 then the algorithm
returns 1 (line 8). At any time, if a condition is evaluated as false then the algorithm returns 0 (line 9).
The proposed function NameMatch aims at producing a maximum recall. The majority of the relevant matches will be
contained in the set of returned matches. Although the function IniSim can misidentify author names with same initials, a
pair of metadata records will rarely be associated with several authors with the same initials but different names. The high
precision scores in the results of our experiments confirm this hypothesis.
The deduplication process may also achieve high levels of precision due to the comparison of titles by the function
Levenshtein, after a positive result of the function NameMatch. The title of a digital object is only compared when function
NameMatch returns 1. As mentioned before, the strategy of comparing first the publication years and then the author names
(function NameMatch) aims at largely reducing the number of title comparisons performed by function Levenshtein, which is
much more expensive than NameMatch. This reduces the whole deduplication processing time due to the restricted number
of performed comparisons. In our experiments, we also quantified the reduction in the number of comparisons for each
tested condition and applied function.
Finally, the MetadataMatch algorithm can be seen as a template for a general strategy for metadata record deduplication
in which specific similarity functions may be replaced by more effective or efficient ones in specific scenarios. For example,
the function NameMatch may be replaced by another one that implements a specific strategy for matching author names
from a given rationality and the function Levenshtein may be replaced by another string comparison function. This was ex-
actly what we did for the comparison with the baselines in the experiments we describe in the next section.

4. Experimental evaluation

In this section we describe the experiments we conducted in order to empirically validate and check the quality of our
bibliographic metadata deduplication approach. Two real datasets were used in the experiments. The first dataset contains
a portion of the metadata records of two real digital libraries and the second one is composed by article citation data. The
experiments are divided into two groups:

1. The proposed functions are compared with four different baselines (Guth, Acronyms, Fragments e MCV Set) discussed in
Section 2. The experimental results show that the proposed functions significantly improve the quality of metadata dedu-
plication from 2% to 62% in the digital library dataset and from 7% to 188% in the article citation dataset.
2. The effectiveness of our unsupervised heuristic-based approach is compared to a supervised genetic programming dedu-
plication approach (Carvalho et al., 2008), which is currently the state-of-the-art method for replica identification. The
experimental results show that the proposed functions produced statistically equivalent results without the burden
and cost of any training process.

The following metrics were used to evaluate the results: precision, recall, balanced F-measure (Baeza-Yates & Ribeiro-
Neto, 1999; Manning et al., 2008) and Wilcoxon signed-rank test (Siegel, 1956; Wilcoxon, 1945).
The experiments were performed in a personal computer with 1.86 GHz dual core processor and 2.0 GB of DRR2 memory.
Our implementation has required only 60 MB of memory.

4.1. Datasets

The first dataset used in our experiments was created with metadata records extracted from the digital libraries BDB-
Comp and DBLP. When these metadata records were collected (March 1st, 2007), there was about four thousand references
to scientific papers published in Brazil in BDBComp. The metadata records were harvested by means of the OAI-PMH pro-
tocol in the Dublin Core format. In DBLP, there was more than 800,000 references to scientific papers published in several
countries. The metadata records were collected from the digital library website in a specific XML format.
712 E.N. Borges et al. / Information Processing and Management 47 (2011) 706–718

Table 2
Libraries dataset.

Conference Year interval BDBComp DBLP Knowledge area Coverage


ERBD 2005 18 Databases Brazilian
SBBD 2001–2005 122 143 Databases Brazilian
WIDM 2001–2004 76 Databases international
ER 2001–2004 214 Databases international
CAiSE 2001–2004 192 Information systems international
SIBGRAPI 2001–2004 242 274 Computer graphics Brazilian
CGI 2001–2004 196 Computer graphics International
SVR 2001–2004 107 Virtual reality Brazilian
INTERACT 2003 215 Human–computer interaction international
SBIA 2002–2004 93 93 Artificial intelligence Brazilian
R 2001–2005 582 1403

Table 3
Duplicated records from the libraries dataset.

Field Content
Title Mining reliable models of associations in dynamic databases
Authors Adriano Veloso, Wagner Meira Jr., Márcio de Carvalho
Year 2002
Booktitle SBBD
Pages 263-277
Title Mining reliable models of associations in dynamic databases
Creators Adriano A. Veloso, Wagner Meira Jr., Márcio Luiz Bunte de Carvalho
Date 2002
Source sbbd2002
Language por
Coverage Gramado, RS, Brasil
Rights Sociedade Brasileira de Computação

Table 4
Duplicated records from the Cora dataset.

Field Content
Title Applying the weak learning framework to understand and improve C4.5
Author Tom Dietterich, Michael Kearns, and Yishay Mansour
Year 1996.
Venue In Machine learning: Proceedings of the thirteenth international
conference,
Class dietterich1996
Title Applying the weak learning framework to understand and improve C4.5
Author Dietterich, T. G., Kearns, M., & Mansour, Y.
Year 1996.
Venue In Proceedings of the 13th International Conference on Machine Learning,
Other (pp. 96–104). Morgan Kaufmann
Class dietterich1996

We selected metadata records describing scientific papers published in some Brazilian and international conferences. We
call this dataset Libraries. Table 2 indicates the number of records from each digital library for each conference, totalizing
1985 metadata records. This table also shows the selected interval of publication years. The last line of the table presents
the total of selected digital objects in the Libraries dataset.
The conferences SBBD, SBIA and SIBGRAPI are indexed by both digital libraries, so there exists duplicated metadata re-
cords in both of them. A specialist user identified 415 true matches coming from these three conferences in the dataset.
The first specific goal of our experiments was to deduplicate these 415 pairs of metadata records. Table 3 shows an example
of duplicated records in this first dataset.
The second dataset used in our experiments was extracted from the Cora collection. Cora is a collection of 1295 distinct
citations to 122 Computer Science research papers from the Cora Computer Science research paper search engine (McCallum,
Nigam, Rennie, & Seymore, 2000). The citations were segmented into multiple fields by an information extraction system,
resulting in some crossover noise among the fields (Bilenko & Mooney, 2003). For instance, some dates were decomposed
into some fields other than year. There are 2191 records distributed in 305 distinct classes. This collection has been used
E.N. Borges et al. / Information Processing and Management 47 (2011) 706–718 713

Table 5
Deduplication results for the Libraries dataset.

NameFunc MatchAlg tY tL tF tN Precision (%) Recall (%) F-measure (%)


1 Guth MCV Set 1 0.7 NA NA 100.00 42.89 60.03
2 Guth MCV Set 0 0.5 NA NA 99.44 43.13 60.17
3 Acronyms MCV Set 1 0.5 NA NA 98.51 63.86 77.49
4 Acronyms MCV Set 0 0.7 NA NA 100.00 63.61 77.76
5 Fragments MCV Set 1 0.5 2 NA 99.19 88.92 93.77
6 Fragments MCV Set 0 0.5 4 NA 99.48 91.81 95.49
7 IniSim MCV Set 1 0.5 NA NA 98.74 94.70 96.68
8 IniSim MCV Set 0 0.5 NA NA 99.49 94.70 97.04
9 Guth NameMatch 1 0.7 NA 1.0 100.00 42.89 60.03
10 Guth NameMatch 0 0.5 NA 0.75 99.03 49.40 65.92
11 Acronyms NameMatch 1 0.5 NA 1.0 98.46 61.69 75.85
12 Acronyms NameMatch 0 0.5 NA 0.75 99.31 69.40 81.70
13 Fragments NameMatch 1 0.5 2 1.0 99.19 88.92 93.77
14 Fragments NameMatch 0 0.5 4 0.75 99.49 94.22 96.78
15 IniSim NameMatch 1 0.5 NA 1.0 98.73 93.49 96.04
16 IniSim NameMatch 0 0.5 NA 0.75 99.50 95.42 97.42

NA = Not applicable

before for experimental evaluations in other related works (Bilenko & Mooney, 2003; Carvalho et al., 2006; Carvalho et al.,
2008). Table 4 shows an example of duplicated records in this collection.
The field class is a label whose function is to identify which group of duplicated records a specific record belongs to. Thus,
the second specific goal of our experiments was to deduplicate the 305 citation records replicated in the Cora collection,
totalizing 29,611 pairs of metadata records. We extracted only the fields class, title, author and year to compose the dataset
used in our experiments.

4.2. Results of the first group of experiments

This section describes the first group of experiments: the comparison of our functions with the baselines Guth, Acronyms,
Fragments e MCV Set. In this group of experiments we varied the following parameters of MetadataMatch:

 NameFunc – name similarity function, which can assume Guth, Acronyms, Fragments or the proposed function IniSim;
 MatchAlg – author name matching algorithm, which can assume MCV Set or the proposed NameMatch algorithm;
 tY – publication year difference;
 tL – threshold applied to the function Levenshtein;
 tF – threshold applied to the function Fragments;
 tN – threshold applied to the function NameMatch.

4.2.1. Libraries experiments


For the experiments with the Libraries dataset, the thresholds assume the following values:

 tY = 0 (same year) or tY = 1 (1 year before/after);


 0.5 6 tL 6 0.7;
 2 6 tF 6 4;
 0.75 6 tN 6 1.0.

The Libraries experimental results are summarized in Table 5 which presents precision, recall and the balanced F-mea-
sure for each combination of parameters. We show the worst and best F-measures for each combination between the name
similarity function and the author name matching function. The quality measures were determined based on the 415 pairs of
metadata records identified by the specialist.

4.2.2. Cora experiments


For the experiments involving the Cora dataset, the thresholds assume the following values:

 tY = 0 (same year) or tY = 1 (1 year before/after);


 0.5 6 tL 6 0.7;
 1 6 tF 6 4;
 0.6 6 tN 6 1.0.
714 E.N. Borges et al. / Information Processing and Management 47 (2011) 706–718

Table 6
Deduplication results for the Cora dataset.

NameFunc MatchAlg tY tL tF tN Precision (%) Recall (%) F-measure (%)


1 Guth MCV Set 0 0.7 NA NA 87.31 15.34 26.09
2 Guth MCV Set 1 0.5 NA NA 80.52 17.39 28.61
3 Acronyms MCV Set 0 0.7 NA NA 87.75 32.78 47.73
4 Acronyms MCV Set 1 0.6 NA NA 82.92 36.94 51.11
5 Fragments MCV Set 0 0.7 1 NA 88.87 63.67 74.19
6 Fragments MCV Set 1 0.6 4 NA 83.59 71.67 77.17
7 IniSim MCV Set 0 0.7 NA NA 88.89 69.02 77.71
8 IniSim MCV Set 1 0.6 NA NA 83.88 78.47 81.08
9 Guth NameMatch 0 0.7 NA 1.0 87.31 15.34 26.09
10 Guth NameMatch 1 0.5 NA 0.6 82.42 21.61 34.24
11 Acronyms NameMatch 0 0.7 NA 1.0 87.75 32.75 47.70
12 Acronyms NameMatch 1 0.6 NA 0.6 84.33 45.76 59.33
13 Fragments NameMatch 0 0.7 1 1.0 88.87 63.67 74.19
14 Fragments NameMatch 1 0.6 4 0.6 84.01 76.80 80.24
15 IniSim NameMatch 0 0.7 NA 1.0 88.86 68.80 77.55
16 IniSim NameMatch 1 0.6 NA 0.6 83.90 81.33 82.60

NA = Not applicable.

The Cora experimental results are summarized in Table 6, which presents the same columns shown in Table 5. The quality
measures were determined based on the 305 distinct bibliographic citations identified by means of the class field.

4.2.3. Analysis of the results


Observing Table 5, we notice that, for the Libraries dataset, all experiments achieved precision values higher than 98%, i.e.,
all tested functions are very effective for correctly identifying duplicate objects. Meanwhile, the experiments with the func-
tions Guth and Acronyms showed recall values lower than 70% (lines 1–4; 9–12). Experiments with Fragments and the pro-
posed function IniSim achieved recall values between 88.92% and 95.42% (lines 5–8; 13–16). In Table 6, we can observe a
similar behavior for the Cora dataset. All experiments achieved precision values higher than 80%. Experiments with the func-
tions Guth and Acronyms showed recall values lower than 46% (lines 1–4; 9–12). Experiments with the functions Fragments
and IniSim showed recall values between 63.67% and 81.33% (lines 5–8; 13–16).
The overall quality of the functions can be assessed by the F-measure. Observing Table 5, the best result with the pro-
posed functions NameMatch and IniSim achieved an F-measure equal to 97.42% (line 16). The best result using only functions
previously described in the literature is the combination of Fragments and MCV Set, which scored an F-measure equal to
95.49% (line 6). In Table 6, we can observe again a similar behavior. The best result with the proposed functions achieved
an F-measure equal to 82.60% (line 16) while the best result with functions previously described in the literature scored only
77.17% (line 6). In sum, when compared to the best combination of existing functions, the quality improvement of the pro-
posed functions NameMatch and IniSim is around 2% in the Libraries dataset and 7% in the Cora dataset, both improvements
being statistically significant, as we shall see. Notice also that, in the case of the Libraries dataset, improvements are very
hard to obtain given the already high values of the F-measure obtained by the existing functions. Comparing our functions
with all combinations of existing functions, we obtained improvements up to 62% (Libraries dataset) and 188% (Cora
dataset).
To better understand the sensitivity of the functions with respect to the number of authors to be matched, we ran a
few additional experiments. With the threshold tN equal to 0.75 (1/4 of unmatched authors; F = 97.42%) for the Libraries
dataset and with tN equal to 0.8 (1/5 of unmatched authors; F = 80.18%) and 0.6 (2/5 of unmatched authors; F = 82.6%) for
the Cora dataset, the experiments produced better results than using tN equal to 1.0 (F = 96.04% at most for Libraries and
F = 77.5% for Cora). This was due to the increase in recall without a major loss in precision. Lower values for these thresh-
olds did not produce higher F-measure values. This fact shows that NameMatch is a useful function to deduplicate bib-
liographic metadata records because it does not need to match the exact number of authors. It is worth noticing,
however, that providing general guidelines/indications about how to set the similarity threshold for the number of ‘‘Ini-
Sim-compatible’’ authors is very difficult since this is a collection dependent task. In any case, since we found that com-
paring all authors is hardly necessary and in fact can produce worse results, one could start with the best values we used
in our experiments (respectively 1/4 and 2/5 of unmatched authors) and adjust this values empirically using a small por-
tion of the collection.
Regarding the effect of the tY parameter, for the Libraries dataset changes from tY = 0 to tY = 1 did not produce significant
differences in the results. In the case of the Cora dataset, small improvements of about 2% in average were obtained using
tY = 1, showing that a few instances in this dataset do have differences in the publication year of at most 1 year. Full results
for these experiments are ommited due to space constraints.
To check whether our improvements are in fact statistically significant, we performed a paired Wilcoxon test (Siegel,
1956; Wilcoxon, 1945) comparing the proposed functions IniSim and NameMatch with the functions Fragments and MCV
E.N. Borges et al. / Information Processing and Management 47 (2011) 706–718 715

Set. We used the Wilcoxon test because the samples are not normally distributed. The values of 1-tailed p in the Libraries
dataset (415 observations) and in the Cora dataset (1000 observations) were lower than 0.0001. Therefore, our approach
has a performance that is statistically superior than the best combination of existing functions since the 1-tailed p values
calculated are lower than the statistical significance threshold a = 0.01.

4.2.4. Analysis of the cases of failure


In order to better understand our results, we have analyzed the cases of failure of the functions. This analysis might be
very useful for developers of new techniques for automatic bibliographic metadata deduplication because they show the dif-
ficulties of matching some metadata fields such as proper names. We compared the best results of the proposed functions
(Table 5, line 16; Table 6, line 16) with the best results of the functions already proposed in the literature: Fragments and MCV
Set. (Table 5, line 6; Table 6, line 6).
Table 7 summarizes this analysis. It presents the number of relevant pairs of metadata records which were not identified
for each case of failure and function combination. Analyzing the experiments with the combination of the proposed func-
tions (called ‘‘I/N’’ in the Table), the number of relevant pairs of duplicate metadata records that were not identified is 19
out of 415 in the Libraries dataset and 5528 out of the 29,611 in the Cora dataset. For the combination of functions already
proposed in the literature (called ‘‘F/S’’ in the Table), 34 (Libraries dataset) and 8404 (Cora dataset) relevant pairs were not
identified.
The problems presented by the function combinations fall in one of the following cases:

 Omission of suffixes like Junior (or Jr.) (line 1 of Table 7) – function IniSim considers the last initial of the author names.
For example, ‘‘Roberto Marcondes Cesar Junior’’ and ‘‘R. Cesar’’, were not identified as variations of the same author name.
This problem occurred in seven pairs of non-identified duplicate metadata records in the Libraries dataset. This problem
affected the results of the combination ‘‘F/S’’ similarly for the same records and five additional ones, totalizing 12 errors,
due to the need of MCV Set to match the exact number of authors (some misidentifications of the ‘‘I/N’’ approach were
compensated by the flexibility of being able to match less authors). This problem can be partially corrected by suffix
removal.
 Omission of the first or last name (line 2 of Table 7) – function IniSim considers the first and the last initial of the author
names. For example, ‘‘Gabriel P. Lopes’’ and ‘‘Jose Gabriel Pereira Lopes’’ were not identified as variants of the same author
name. This problem occurred in four pairs of non-identified duplicate metadata records in the Libraries dataset and in 579
pairs in the Cora dataset when we used our ‘‘I/N’’ combination. This problem is even worse for the ‘‘F/S’’ combination, even
if it happens with only one author of a record, since, as explained before, the function MCV Set needs to match the exact
number of authors. For the Cora dataset, this resulted in additional pairs not being identified achieving 1120 cases of fail-
ure for the known functions. In case of the Library dataset, one additional pair was not identified.
 Inversion of the penultimate and last names (line 3 of Table 7) – IniSim allows inversions only if the last name appears as
the first name. For example, ‘‘Schubert R. Carvalho’’ and ‘‘Schubert Carvalho Ribeiro’’ have the penultimate and last names
inverted. This problem occurred in only one pair of the non-identified duplicate metadata records for the Libraries data-
set, affecting both combination of functions.
 Different number of authors (line 4 of Table 7) – when NameMatch uses the threshold tN = 100%, it does not allow two
records to have a different number of authors. Even using a lower threshold (e.g., 75%), some pairs of duplicate meta-
data records differ widely in the number of authors (more than 25%). This problem occurred in five pairs of non iden-
tified duplicate metadata records for the Libraries dataset and in 214 pairs for the Cora dataset when using the ‘‘I/N’’
combination. Given the inflexibility of the combination ‘‘F/S’’, which requires the match of the exact number of
authors, this problem caused errors in additional pairs achieving nine cases in the Library dataset and 666 cases in
total for Cora.

Table 7
Cases of failure of the deduplication functions.

Libraries dataset Cora dataset


Case of Failure I/N F/S I/N F/S
1 Omission of suffixes like Junior (or Jr.) 7 12
2 Omission of the first or last name 4 5 579 1120
3 Inversion of the penultimate and last names 1 1
4 Different number of authors 5 9 214 666
5 Omission of words in the title or different titles 2 2 404 404
6 Abbreviation of names 0 1
7 Inversion of the last name 0 4 0 1887
8 Omission of publication year 3974 3974
9 tY > 1 357 357
R 19 34 5528 8408

I = IniSim, N = NameMatch, F = Fragments, S = MCV Set.


716 E.N. Borges et al. / Information Processing and Management 47 (2011) 706–718

 Omission of words in the title or different titles (line 5 of Table 7) – in these cases the function Levenshtein did not cor-
rectly identify the similarity between the titles. For example, ‘‘3D Reconstruction of Tomographic Images of Coronal Loops
based on Image Metamorphosis’’ e ‘‘Image Morphing Applied to 3D Reconstruction of Coronal Loops.’’ differ in several words.
This problem occurred in two pairs of non identified duplicate metadata records for the Libraries dataset and in 404 pairs
for the Cora dataset.
 Abbreviation of names (line 6 of Table 7) – the function Fragments uses edit distance to compare fragments of names. This
function fails when the fragments differ more than the specified threshold. For example, ‘‘Florentino Fdez-Riverola’’ e ‘‘Flor-
entino Fernandez Riverola’’ are not identified as variations of the same author name because the edit distance between
‘‘Fdez’’ and ‘‘Fernandez’’ is 5. The experiments were performed using thresholds varying between 2 and 4. This problem
occurred in only one pair of non identified duplicate metadata records for the Libraries dataset.
 Inversion of the last name (line 7 of Table 7) – the function Fragments allows inversions only of middle names. Inversions
of last names are detected only in the presence of the comma character as an indicator of inversion. For example, ‘‘Hee Cho
Zang’’ and ‘‘Zang Hee Cho’’ are not identified as variations of the same author name because there is an inversion of the last
name. This problem occurred in four pairs of non identified duplicate metadata records for the Libraries dataset and in
1887 pairs for the Cora dataset when using the ‘‘F/S’’ combination. The ‘‘I/N’’ combination, due to its flexibility regarding
the position of the names, did not suffer from this problem.
 Omission of publication year (line 8 of Table 7) – the function MetadataMatch needs the publication years to return a cor-
rect value but these were missing in 3974 pairs of non identified duplicate metadata records for the Cora dataset.
 tY > 1 (line 9) – the experimental configuration allows the difference in the values of the publication year present in the
records to be at most 1. Records with a year difference greater than 1 were not identified as possible replicas. This prob-
lem occurred in 357 pairs of non identified duplicate metadata records for the Cora dataset.

To summarize the discussion, when compared to the best combination of functions found in the literature, our proposed
functions have solved around 44% of the cases of failure in the Libraries dataset and 34% of them in the Cora dataset.

4.3. Results of the second group of experiments

This section describes the second group of experiments: the comparison of our approach with a state-of-the-art super-
vised genetic programming method for record deduplication (Carvalho et al., 2008).
For our approach, we experimented with the same parameters used in the first group of experiments (Section 4.2) except
the name similarity and the author name matching functions: we used the proposed functions IniSim and NameMatch. The
similarity thresholds were also the same.
The content of the datasets was shuffled and divided into two separate subsets: a subset for training the genetic program-
ming algorithm and a larger subset for the tests. We used for training 250 out of the 2191 records present in the Cora dataset
and 250 out of the 1985 records present in the Libraries dataset. This process of generating the training and test subsets was
repeated five times totalizing five random datasets. We performed the experiments in all random datasets and calculated the
mean of the results of all runs. This experimental setup is similar to that used in (Carvalho et al., 2008).
The Cora and Libraries experimental results are summarized in Table 8 which presents the mean balanced F-measure and
the standard deviation for the best results of each deduplication method. We used the following threshold values for the
function MetadataMatch: tY = 1, tN = 0.6, tL = 0.6 when using the Cora dataset and tY = 0, tN = 0.75, tL = 0.7 when using the Li-
braries dataset. The metrics used to evaluate the results are the same adopted in the previous experiments. Notice that some
of the parameters that produced the best results are a little bit different from the previous experiments because the exper-
imental configuration is also different.
The overall quality of the deduplication methods can be assessed by the F-measure and the standard deviation. Observing
Table 8, MetadataMatch achieved F-measure equal to 97.32% ± 0.46% in the Libraries dataset (line 2) and 82.42% ± 0.16% in
the Cora dataset (line 4). The GP-based method scored F-measure and standard deviation values equal to 94.53% ± 2.06% (line
1) and 82.21% ± 1.10% (line 3), respectively. In sum, there was a slight advantage for our approach in the Libraries dataset and
a statistical tie in the Cora dataset. However, differently from the GP-based method, our functions do not require any training
data which, in some cases, for instance, large repositories, may be very expensive or even impossible to obtain.
We also quantified the reduction in the number of comparisons for each condition and function internally performed by
the function MetadataMatch. Analyzing the results for the Libraries dataset, the average number of comparisons made
between the publication years was 1,504,245. NameMatch was applied to 363,637 out of the 1,504,245 pairs since the

Table 8
Deduplication results for the Cora and libraries datasets.

Method Dataset F-measure (%) Standard deviation (%)


1 GP-based Libraries 94.53 2.06
2 MetadataMatch Libraries 97.32 0.46
3 GP-based Cora 82.21 1.10
4 MetadataMatch Cora 82.42 0.16
E.N. Borges et al. / Information Processing and Management 47 (2011) 706–718 717

authors of a metadata record are compared only if the absolute value of the difference between the publication years is less
than or equal to the threshold tY. Similarly, titles are compared only if the function NameMatch founds enough matches be-
tween author names. The average number of comparisons made by the function Levenshtein was only 383 out of the
1,504,245 pairs, i.e., the number of pairs for which the function NameMatch returns s = 1. Therefore, the average number
of comparisons when using the string matching function for titles, by far the most expensive type of comparison, was largely
reduced to 0.025% of the total possible number.
Analyzing the results for the Cora dataset the average number of comparisons made between publication years was
1,882,770. The function NameMatch was applied to 401,429 pairs and the function Levenshtein was applied to only 31,623
pairs (1.68% of total possible number).

5. Conclusions

This paper proposes an effective and efficient heuristic-based approach to deduplicating bibliographic metadata. We pro-
pose a set of functions that are applied together to identify duplicate records with high precision and recall. Differently from
several related approaches presented in Section 2, which are based on machine learning techniques, ours does not require
any type of training, which sometimes is very expensive to carry out.
The experimental results show that the performance of the proposed functions IniSim and NameMatch are statistically
superior to the baselines. Our functions improve the quality of the metadata deduplication task from 2% (in an already very
accurate result which is difficult to improve) to 62% in a dataset containing metadata from two digital libraries, and from 7%
to 188% in a dataset with article citation data. When compared to the best combination of functions in the literature, our
proposed functions have solved around 44% of failure cases in the digital libraries dataset and 34% in the citation dataset.
When compared to a state-of-the-art method for replica identification, our approach performed slightly better in the dig-
ital libraries dataset and presented a statistic tie in the article citation dataset, without the burden and cost of any training
process. The strategy of comparing first the publication years, followed by author names and then titles, largely reduced the
processing time because only few comparisons are required when a string matching algorithm is applied, which by far the
most expensive of the functions.
In sum, the main contributions of our work is the proposal of an efficient and effective approach for bibliographic meta-
data record deduplication and the analyses of the failure cases for the evaluated deduplication functions. This analysis is
valuable for the development of new approaches because they show the difficulties of matching some metadata elements
such as proper names. Our approach was specifically designed for the digital libraries realm, and correctly and efficiently
identifies variations in the representation of author names by using a linear algorithm, but it can be used to deduplicate
metadata records in other domains where the deduplication of proper names is essential.
As future work, we intend to conduct experiments with additional datasets, including synthetic ones. We notice that the
use of synthetic datasets makes possible to vary other parameters like the number of replicas and the distance between the
original and replicated values present in the repository. We also plan to conduct new experiments in other domains to con-
firm our intuition that our approach can be used to deduplicate metadata records where the deduplication of proper names
is essential.

6. Acknowledgments

This research is partially supported by the Brazilian National Institute for Web Research (Grant number 573871/2008-6),
CNPq Universal project ApproxMatch (Grant number 481055/2007-0), MCT/CNPq/CT-INFO projects Gestão de Grandes Vol-
umes de Documentos Textuais (Grant number 550891/2007-2) and InfoWeb (Grant number 550874/2007-0), and by the
authors scholarships and individual research grants from CAPES, CNPq and FAPEMIG.

References

Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. ACM Press/Addison-Wesley.
Bilenko, M., & Mooney, R. J. (2003). Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the 9th ACM SIGKDD
International conference on knowledge discovery and data mining (pp. 39–48). Washington, DC, USA.
Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. In Proceedings of the 5th Annual workshop on
computational learning theory (pp. 144–152). Pittsburgh, PA, USA.
Carvalho, J. C. P., & da Silva, A. S. (2003). Finding similar identities among objects from multiple web sources. In Proceedings of the 5th ACM International
workshop on web information and data management (pp. 90–93). New Orleans, LA, USA.
Carvalho, M. G., Gonçalves, M. A., Laender, A. H. F., & da Silva, A. S. (2006). Learning to deduplicate. In Proceedings of the 6th Joint conference on digital libraries
(pp. 41–50). Chapel Hill, NC, USA.
Carvalho, M. G., Laender, A. H. F., Gonçalves, M. A., & da Silva, A. S. (2008). Replica identification using genetic programming. In Proceedings of the 23rd ACM
Symposium on applied computing (pp. 1801–1806). Fortaleza, CE, Brazil.
Chaudhuri, S., Ganjam, K., Ganti, V., & Motwani, R. (2003). Robust and efficient fuzzy match for online data cleaning. In Proceedings of ACM SIGMOD
international conference on management of data (pp. 313–324). San Diego, CA, USA.
Cohen, W. W., Richman, J. (2002). Learning to match and cluster large high-dimensional data sets for data integration. In Proceedings of the 8th ACM SIGKDD
International conference on knowledge discovery and data mining (pp. 475–480). Edmonton, AB, Canada.
Convis, D. B., Glickman, D., & Rosenbaum, W. S. (1982). Alpha content match prescan method for automatic spelling error correction. US Patent 4 328 561.
718 E.N. Borges et al. / Information Processing and Management 47 (2011) 706–718

Cota, R., Gonçalves, M. A., & Laender, A. H. F. (2007). A heuristic-based hierarchical clustering method for author name disambiguation in digital libraries. In
Proceedings of the 21st Brazilian symposium on databases (pp. 20–34). João Pessoa, PB, Brazil.
Doan, A., Noy, N. F., & Halevy, A. Y. (2004). Introduction to the special issue on semantic integration. SIGMOD Record, 33(4), 11–13.
Dorneles, C. F., Heuser, C. A., Lima, A. E. N., da Silva, A. S., & de Moura, E. S. (2004). Measuring similarity between collection of values. In Proceedings of the 6th
ACM International workshop on web information and data management (pp. 56–63). Washington, DC, USA.
Dorneles, C. F., Nunes, M. F., Heuser, C. A., Moreira, V. P., da Silva, A. S., & de Moura, E. S. (2009). A strategy for allowing meaningful and comparable scores in
approximate matching. Informaion Systems, 34(8), 673–689.
Fagin, R., Haas, L. M., Hernández, M. A., Miller, R. J., Popa, L., & Velegrakis, Y. (2009). Clio: Schema mapping creation and data exchange. In Conceptual
modeling: Foundations and applications, Heidelberg, Germany. LNCS (5600, pp. 198–236). Springer.
Fellegi, I. P., & Sunter, A. B. (1969). A theory for record linkage. Journal of the American Statistical Association, 64(328), 1183–1210.
Fox, E. A., Akscyn, R. M., Furuta, R. K., & Leggett, J. J. (1995). Digital libraries. Commununications of the ACM, 38(4), 22–28.
Gonçalves, M. A., Fox, E. A., Watson, L. T., & Kipp, N. A. (2004). Streams, structures, spaces, scenarios, societies (5s): A formal model for digital libraries. ACM
Transactions of the Information Systems, 22(2), 270–312.
Guth, G. J. A. (1976). Surname spellings and computerized record linkage. Historical Methods Newsletter, 10(1), 10–19.
Large, A., & Beheshti, J. (1997). OPACS: A research review. Library & Information Science Research, 19(2), 111–133.
Lawrence, S., Giles, C. L., & Bollacker, K. (1999). Digital libraries and autonomous citation indexing. Computer, 32(6), 67–71.
Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady, 10(8), 707–710.
Ley, M. (2002). The DBLP computer science bibliography: Evolution, research issues, perspectives. In Proceedings of the 9th International string processing and
information retrieval symposium, Lisbon, Portugal. LNCS (2476, pp. 1–10). Springer.
Lima, A. E. N. (2002). Pesquisa de similaridade em XML. Monografia de Graduação em Ciência da Computação, Instituto de Informática, UFRGS, Porto Alegre,
RS, Brazil (in Portuguese).
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge University Press.
McCallum, A. K., Nigam, K., Rennie, J., & Seymore, K. (2000). Automating the construction of internet portals with machine learning. Information Retrival,
3(2), 127–163.
Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106.
Rahm, E., & Bernstein, P. A. (2001). A survey of approaches to automatic schema matching. The VLDB Journal, 10(4), 334–350.
Siegel, S. (1956). Nonparametric statistics for the behavioral sciences. McGraw-Hill.
Stasiu, R. K., Heuser, C. A., & da Silva, R. (2005). Estimating recall and precision for vague queries in databases. In Proceedings of the 17th international
conference on advanced information systems engineering, Porto, Portugal. LNCS (vol. 3520, pp. 187–200). Springer.
Tejada, S., Knoblock, C. A., & Minton, S. (2001). Learning object identification rules for information integration. Information System, 26(8), 607–633.
Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics, 1, 80–83.

You might also like