Combining Classifiers and Relevance Feedback For The Ambiguous Author Name Problem of Scientific Papers in Digital Libraries

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 3

Combining classifiers and relevance feedback for the

ambiguous author name problem of scientific papers in


digital libraries
Emilia A. de Souza
Dept. of Computer Science
Federal University of Ouro Preto
[email protected]
Anderson A. Ferreira
Dept. of Computer Science
Federal University of Ouro Preto
[email protected]

ABSTRACT
This paper provides a method that combine classifiers and utilize
user feedback for to solve the ambiguous author names problem.
A clustering method is used to group references and to form pure
groups that will compose the training examples. In the next step
the classifiers are combined to produce a good classifier which
will learn a similarity function for assigning authors to correct
groups. We use the algorithms of classification Support Vect
Machines (SVM), K-Nearest Neighbor (KNN) and Random
Forest. There are also the phase in which user attempt to identify
the author of the references in an iterative way. Experiments
demonstrate that the proposed method yields better results than
state-of-the-art disambiguation methods on two traditional
datasets.
Categories and Subject Descriptors
D.3.3 [Programming Languages]: Language Constructs and
Features abstract data types, polymorphism, control structures.
This is just an example, please use the correct category and
subject descriptors for your submission. The ACM Computing
Classification Scheme: http://www.acm.org/class/1998/
General Terms
Your general terms must be any of the following 16 designated
terms: Algorithms, Management, Measurement, Documentation,
Performance, Design, Economics, Reliability, Experimentation,
Security, Human Factors, Standardization, Languages, Theory,
Legal Aspects, Verification.
Keywords
Keywords are your own designated keywords.
1. INTRODUCTION
The great challenge in the information retrieval area is human
language ambiguity which occurs when a set of words contains
more than one sense. There are several domains involving the
ambiguity problem. The scope of the name ambiguity generally
covers specific problems according to what one wants to treat. In
the context of scientific papers, the ambiguity author names
problem is the lack of common standards in the representation of
the papers and also due to storage in different digital libraries.
An author name is ambiguous, when the same author may appear
with different names or different authors may have similar names.
The author names can be classified according to the attributes
explored for evidence presented in citations. This evidence
represents information in specific citations, the most common are:
author and coauthor names, who participates in the authorship of
publication, works title and publications venue title. For removing
the ambiguity, first we have to group the occurrences of similar
author names and assign the name of an author to a particular
reference.

The method of the state-of-art to author name disambiguation,
named SAND (Self-training Associative Name Disambiguator),
works in two steps. The first step, produces pure clusters of
citations that are associated with the same author. The second
step, generates a disambiguation function, using association rules
that predict the correct author for new citations.
So, this work pretends to form clusters of citations that correspond
to the same author. The citations are grouped together, if they
have similar author names and at least one coauthor in common.
We extract similar vectors from the created clusters. Each vector
is based on a comparison of evidence that are within all citations
grouped.
Clusters are split into two sets, which the most dissimilar are
selected for training and the remaining clusters that were not
included in the training set compose the test set, so that data is
stratified and ensure representativeness. The training data is used
to train our classifier (SVM) and the similarity function learned
will infer into the correct author for citations in the test set and is
used to link the clusters.To evaluate we use citations extracted
from DBLP collection and results, the method is unsupervised,
tend to extract highly pure clusters however increase its
fragmentation. Best results in terms of fragmentation are achieved
after the linkage of clusters.
This paper is organized as follows. Section 2 shows related work
for author names disambiguation. In Section 3 we detail the
proposed method. In Section 4 we report the evaluation of
proposed method and compare its effectiveness with the
effectiveness provided by SAND. Finally, in Section 5 we present
the conclusion.
2. PAGE SIZE
All material on each page should fit within a rectangle of 18
23.5 cm (7" 9.25"), centered on the page, beginning 1.9 cm
(0.75") from the top of the page and ending with 2.54 cm (1")
from the bottom. The right and left margins should be 1.9 cm
(.75").
The text should be in two 8.45 cm (3.33") columns with a .83 cm
(.33") gutter.
3. TYPESET TEXT
3.1 Normal or Body Text

use a 9-point Times Roman font, or other Roman font with serifs,
as close as possible in appearance to Times Roman in which these
guidelines have been set. The goal is to have a 9-point text, as you
see here. Please use sans-serif or non-proportional fonts only for
special purposes, such as distinguishing source code text. If Times
Roman is not available, try the font named Computer Modern
Roman. On a Macintosh, use the font named Times. Right
margins should be justified, not ragged.
3.2 Title and Authors
The title (Helvetica 18-point bold), authors' names (Helvetica 12-
point) and affiliations (Helvetica 10-point) run across the full
width of the page one column wide. We also recommend phone
number (Helvetica 10-point) and e-mail address (Helvetica 12-
point). See the top of this page for three addresses. If only one
address is needed, center all address text. For two addresses, use
two centered tabs, and so on. For more than three authors, you
may have to improvise.
1

3.3 First Page Copyright Notice
Please leave 3.81 cm (1.5") of blank text box at the bottom of the
left column of the first page for the copyright notice.
3.4 Subsequent Pages
For pages other than the first page, start at the top of the page, and
continue in double-column format. The two columns on the last
page should be as close to equal length as possible.

Table 1. Table captions should be placed above the table
Graphics Top In-between Bottom
Tables End Last First
Figures Good Similar Very well

3.5 References and Citations
Footnotes should be Times New Roman 9-point, and justified to
the full width of the column.
Use the ACM Reference format for references that is, a
numbered list at the end of the article, ordered alphabetically and
formatted accordingly. See examples of some typical reference
types, in the new ACM Reference format, at the end of this
document. Within this template, use the style named references
for the text. Acceptable abbreviations, for journal names, can be
found here: http://library.caltech.edu/reference/abbreviations/.
Word may try to automatically underline hotlinks in your

1
If necessary, you may place some address information in a
footnote, or in a named section at the end of your paper.
references,
the correct
style is NO
underlining
.
The
references
are also in
9 pt., but
that section (see Section 7) is ragged right. References should be
published materials accessible to the public. Internal technical
reports may be cited only if they are easily accessible (i.e. you can
give the address to obtain the report within your citation) and may
be obtained by any reader.
Please
Proprietary information may not be cited. Private communications
should be acknowledged, not referenced (e.g., [Robertson,
personal communication]).
3.6 Page Numbering, Headers and Footers
Do not include headers, footers or page numbers in your
submission. These will be added when the publications are
assembled.
4. FIGURES/CAPTIONS
Place Tables/Figures/Images in text as close to the reference as
possible (see Figure 1). It may extend across both columns to a
maximum width of 17.78 cm (7).
Captions should be Times New Roman 9-point bold. They should
be numbered (e.g., Table 1 or Figure 2), please note that the
word for Table and Figure are spelled out. Figures captions
should be centered beneath the image or picture, and Table
captions should be centered above the table body.
5. SECTIONS
The heading of a section should be in Times New Roman 12-point
bold in all-capitals flush left with an additional 6-points of white
space above the section head. Sections and subsequent sub-
sections should be numbered and flush left. For a section head and
a subsection head together (such as Section 3 and subsection 3.1),
use no additional space above the subsection head.
5.1 Subsections
The heading of subsections should be in Times New Roman 12-
point bold with only the initial letters capitalized. (Note: For
subsections and subsubsections, a word like the or a is not
capitalized unless it is the first word of the header.)
5.1.1 Subsubsections
The heading for subsubsections should be in Times New Roman
11-point italic with initial letters capitalized and 6-points of white
space above the subsubsection head.
5.1.1.1 Subsubsections
The heading for subsubsections should be in Times New Roman
11-point italic with initial letters capitalized.
5.1.1.2 Subsubsections
The heading for subsubsections should be in Times New Roman
11-point italic with initial letters capitalized.

Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
Conference10, Month 12, 2010, City, State, Country.
Copyright 2010 ACM 1-58113-000-0/00/0010 $15.00.

Figure 1. Insert caption to place caption below figure.

.


6. ACKNOWLEDGMENTS
Our thanks to ACM SIGCHI for allowing us to modify templates
they had developed.
7. REFERENCES
[1] Bowman, M., Debray, S. K., and Peterson, L. L. 1993.
Reasoning about naming systems. ACM Trans. Program.
Lang. Syst. 15, 5 (Nov. 1993), 795-825. DOI=
http://doi.acm.org/10.1145/161468.16147.
[2] Ding, W. and Marchionini, G. 1997. A Study on Video
Browsing Strategies. Technical Report. University of
Maryland at College Park.
[3] Frhlich, B. and Plate, J. 2000. The cubic mouse: a new
device for three-dimensional input. In Proceedings of the
SIGCHI Conference on Human Factors in Computing
Systems (The Hague, The Netherlands, April 01 - 06, 2000).
CHI '00. ACM, New York, NY, 526-531. DOI=
http://doi.acm.org/10.1145/332040.332491.
[4] Tavel, P. 2007. Modeling and Simulation Design. AK Peters
Ltd., Natick, MA.
[5] Sannella, M. J. 1994. Constraint Satisfaction and Debugging
for Interactive User Interfaces. Doctoral Thesis. UMI Order
Number: UMI Order No. GAX95-09398., University of
Washington.
[6] Forman, G. 2003. An extensive empirical study of feature
selection metrics for text classification. J. Mach. Learn. Res.
3 (Mar. 2003), 1289-1305.
[7] Brown, L. D., Hua, H., and Gao, C. 2003. A widget
framework for augmented interaction in SCAPE. In
Proceedings of the 16th Annual ACM Symposium on User
Interface Software and Technology (Vancouver, Canada,
November 02 - 05, 2003). UIST '03. ACM, New York, NY,
1-10. DOI= http://doi.acm.org/10.1145/964696.964697.
[8] Yu, Y. T. and Lau, M. F. 2006. A comparison of MC/DC,
MUMCUT and several other coverage criteria for logical
decisions. J. Syst. Softw. 79, 5 (May. 2006), 577-590. DOI=
http://dx.doi.org/10.1016/j.jss.2005.05.030.
[9] Spector, A. Z. 1989. Achieving application requirements. In
Distributed Systems, S. Mullender, Ed. ACM Press Frontier
Series. ACM, New York, NY, 19-33. DOI=
http://doi.acm.org/10.1145/90417.90738

You might also like