Named Entity Disambiguation: A Hybrid Approach: Ton Duc Thang University, Viet Nam E-Mail: Hien@tdt - Edu.vn

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

International Journal of Computational Intelligence Systems, Vol. 5, No.

6 (November, 2012), 1052-1067

NAMED ENTITY DISAMBIGUATION: A HYBRID APPROACH

Hien T. Nguyen
Ton Duc Thang University, Viet Nam
E-mail: [email protected]

Tru H. Cao
Ho Chi Minh City University of Technology, Viet Nam
E-mail: [email protected]

Received 2 January 2012


Accepted 27 July 2012

Abstract
Semantic annotation of named entities for enriching unstructured content is a critical step in development of Se-
mantic Web and many Natural Language Processing applications. To this end, this paper addresses the named enti-
ty disambiguation problem that aims at detecting entity mentions in a text and then linking them to entries in a
knowledge base. In this paper, we propose a hybrid method, combining heuristics and statistics, for named entity
disambiguation. The novelty is that the disambiguation process is incremental and includes several rounds that filter
the candidate referents, by exploiting previously identified entities and extending the text by those entity attributes
every time they are successfully resolved in a round. Experiments are conducted to evaluate and show the advan-
tages of the proposed method. The experiment results show that our approach achieves high accuracy and can be
used to construct a robust entity disambiguation system.
Keywords: Entity disambiguation. Entity linking. Named entity. Knowledge base. Wikipedia.

1. Introduction about those entities in the texts themselves with support-


ing of some ontologies or knowledge bases (KB) such
In Information Extraction (IE) and Natural Language as KIM [45], Wikipediaa, etc. have been increasingly
Processing (NLP) areas, named entities (NE) are people, attracting researchers’ attention.
organizations, locations, and others that are referred to For the past decade, Named Entity Recognition
by proper names. Having been raised from research in (NER) has become an interesting topic, attracting much
those areas, named entities have also become key issue research effort, with various approaches introduced for
in development of the Semantic Web [37]. That is be- different domains, scopes, and purposes [35, 36, 38, 39].
cause, in many domains, in particular news articles, the Some work on NER address the task of classification of
information and semantics of the article texts center NEs into broad categories such as Person, Organization,
around the named entities and their relations mentioned or Location [34, 36, 38], while others classify NEs into
therein. In 2001, Berners-Lee et al. [37] described the more fine-grained categories that are specified by a giv-
evolution of a Web of documents for human to read to a en ontology [35, 39]. In recent years, some well-known
Web of data where information is given well-defined systems such as SemTag [46] and KIM have been at-
meaning for computers to manipulate. The Semantic tempted in not only fine-grained categorization but also
Web is an extension of the current Web that adds new identification of NEs with respect to a given ontology.
data and metadata to existing Web documents so that One great challenge in dealing with named entities
computers can automatically integrate and re-use data is that one name may refer to different entities in differ-
across various applications. In that spirit, extracting
named entities in texts and adding semantics, metadata a
http://www.wikipedia.org

Published by Atlantis Press


Copyright: the authors
1052
Hien T. Nguyen and Tru H. Cao

ent occurrences and one entity may have different mental, each round of which exploits the previously
names that may be written in different ways and with identified entities and extends the text by the attributes
spelling errors. For example, the name “John McCar- of those identified entities in order to disambiguate the
thy” in different occurrences may refer to different NEs remaining named entities. Third, our method makes use
such as a computer scientist from Stanford University, a of disambiguation texts in article titles of Wikipedia as
linguist from University of Massachusetts Amherst, an an important feature for resolving the right entities for
Australian ambassador, a British journalist who was some mentions in a text, and then the identifiers of those
kidnapped by Iranian terrorists in Lebanon in April entities are exploited as anchors to disambiguate the
1986, etc. Such ambiguity makes identification of NEs others. Note that this work is based on [21], [22], and
more difficult and raises NE disambiguation problem [23].
(NED) as one of the main challenges to research not The rest of the paper is organized as follows. Sec-
only in the Semantic Web but also in areas of natural tion 2 presents Wikipedia and related works. Section 3
language processing in general. presents in details the disambiguation method. Section 4
Indeed, for the past five years, many approaches presents experiments and evaluation. Finally, we draw a
have been proposed for NED [1-23, 27, 28]. And, since conclusion in Section 5. Note that in the rest of this pa-
2009, Entity Linking (EL) shared task held at Text per we use mention in the sense that is a reference to an
Analysis Conference (TAC) [1, 9] has attracted more entity. An entity of a reference is called referent. There-
and more attentions in linking entity mentions to know- fore, we use the terms name and mention interchangea-
ledge base entries [1, 3, 4, 6, 7, 8, 9, 12, 15]. In EL task, bly, as well as for the terms entity and referent.
given a query consists of a named entity (PER, ORG, or
geo-graphical entity) and a background document con- 2. Background
taining that named entity, the system is required to pro-
NED can be considered as an importantly special case
vide the ID of the KB entry describing that named enti- of Word Sense Disambiguation (WSD) [26]. The aim of
ty; or NIL if there is no such KB entry [9]. The used KB WSD is to identify which sense of a word is used in a
is Wikipedia. Even though those approaches to EL ex- given context when several possible senses of that word
ploited diverse features and employed many learning
exist. In WSD, words to be disambiguated may appear
models [1, 8, 9, 12, 15], a hybrid approach that com- in either a plain text or an existing knowledge base.
bines rules and statistics have not been proposed.
Techniques for the latter use a dictionary, thesaurus, or
In this paper, we present our work that aims at de- an ontology as a sense inventory that defines possible
tecting named entities in a text, disambiguating and senses of words. Having been emerging recently as the
linking them to the right ones in Wikipedia. The pro- largest and widely-used encyclopedia in existence, Wi-
posed method is rule-based and statistical-based. It uti-
kipedia is used as a knowledge source for not only WSD
lizes NEs and related terms co-occurring with the target [25], but also IE, NLP, Ontology Building, Information
entity in a text and Wikipedia for disambiguation be-
Retrieval, and so on [24].
cause the intuition is that these respectively convey its This paper proposes a method that also makes use of
relationship and attributes. For example, suppose that in available knowledge sources of entities for NED besides
a KB there are two entities named “Jim Clark”, one of exploiting the context of a text where mentions of
which has a relation with the Formula One car racing named entities occur. Exploiting the external source of
championship and the other with Netscape. Then, if in a knowledge for NED is natural and reasonable as the
text where the name appears there are occurrences of same as the way humans do. Indeed, when we ask a
Netscape or web-related referents and terms, then it is person to identify which entities a name in a text refers
more likely that the name refers to the one with Nets- to, he may rely on his knowledge accumulated from
cape in the KB. diverse sources of knowledge, experiences, etc.
The contribution of this paper is three-fold. First, we In literature, the knowledge sources used for NED
can be divided into two kinds: close ontologies and
propose a hybrid method that combines heuristics and a
open ontologies. Close ontologies are built by experts
learning model for disambiguation and identification of
following a top-down approach, with a hierarchy of
NEs in a text with respect to Wikipedia. Second, the concepts based on a controlled vocabulary and strict
proposed disambiguation process is iterative and incre- constraints, e.g., KIM, WordNet. These knowledge

Published by Atlantis Press


Copyright: the authors
1053
Named Entity Disambiguation: A Hybrid Approach

sources are generally of high reliability, but their size titles contain the name, and returns either the most rele-
and coverage are restricted. Furthermore, not only is the vant entity page or the disambiguation page for that
building of the sources labor-intensive and costly, but name. For those cases when the returned page describes
also they are not kept updated of new discoveries and an entity, we set this entity as the default referent for
topics that arise daily. Meanwhile, open ontologies are that name. For example, when one queries “Oxford”
built by collaborations of volunteers following a bot- from Wikipedia, it returns the page that describes the
tom-up approach, with concepts formed by a free voca- city Oxford in South East England. Therefore, in this
bulary and community agreements, e.g. Wikipedia. case, for the name “Oxford”, we set its default referent
Many open ontologies are fast growth with wide cover- the city Oxford in South East England. For another
age of diverse topics and keeping update daily by volun- example, when one queries “John McCarthy” from Wi-
teers, but someone has doubt about quality of their in- kipedia, the disambiguation page of the name “John
formation contents. Wikipedia is considered as an open McCarthy” is returned. In the case of “John McCarthy”,
ontology where contents of its articles have high quali- we do not set any default referent for this name.
ty. Indeed, in [47], Giles investigated the accuracy of
content of articles in Wikipedia in comparison to those 2.2. Related Problems
of articles in Encyclopedia Britannica, and showed that
In this section, we review related works on Entity Dis-
both sources were equally prone to significant errors.
ambiguation. We are interested in locating in a KB the
2.1. Wikipedia entity that a name in a text refers to. However, we start
out by summarizing work on Record Linkage, which
Wikipedia is a free encyclopedia written by a collabora- aims at detecting records intra- or inter-database or file
tive effort of a large number of volunteer contributors. that refer to the same entity, and then links or merges
We describe here some of its resources of information them together. We then describe and summarize work
for disambiguation. A basic entry in Wikipedia is a page on Cross-Document Co-reference Resolution, which
(or article) that defines and describes a single entity or aims at grouping mentions of entities in different docu-
concept. It is uniquely identified by its title. When the ments into equivalence classes by determining whether
name is ambiguous, the title may contain further infor- any two mentions refer to the same entity. Next, we
mation that we call disambiguation text to distinguish focus on both simplified cases of NED that are To-
the entity described from others. The disambiguation ponym Resolution and Person Disambiguation. Finally,
text is separated from the name by parentheses e.g. we survey disambiguation solutions for NED.
John McCarthy (computer scientist), or a
comma, e.g., Columbia, South Carolina. Record Linkage
In Wikipedia, every entity page is associated with
Record Linkage (RL) is a means of combining in-
one or more categories, each of which can have subca-
formation from different sources such as databases or
tegories expressing meronymic or hyponymic relations.
structured files in general. It has been known for more
Each page may have several incoming links (henceforth
than five decades across research communities (i.e. AI
inlinks), outgoing links (henceforth outlinks), and redi-
and databases) with multiple names such as entity
rect pages. A redirect page typically contains only a
matching [51], entity resolution [53], duplicate detec-
reference to an entity or a concept page. Title of the
tion [54], name disambiguation [56, 57], etc. The basic
redirect page is an alternative name of that entity or
method to RL is to compare values of fields to identify
concept. For example, from redirect pages of the United
whether any pair of records associated with the same
States, we extract alternative names of the United States
entity. NED is different from RL in that it analyses free
such as “US”, “USA”, “United States of America”, etc.
texts to capture entity mentions and then link them to
Other resources are disambiguation pages. They are
KB entries other than link entity mentions from struc-
created for ambiguous names, each of which denotes
tured data sources.
two or more entities in Wikipedia. Based on disambigu-
A typical method proposed for RL involves two
ation pages one may detect all entities that have the
main phases, namely data preparation and matching
same name in Wikipedia.
Note that when searching an entity by its name using [52]. The former is to improve the data quality, as well
the search tool of Wikipedia, if this name occurs in Wi- as make them comparable and more usable such as
kipedia, it appears that Wikipedia ranks pages whose transforming those data from different sources into a

Published by Atlantis Press


Copyright: the authors
1054
Hien T. Nguyen and Tru H. Cao

common form or standardizing the information by exploiting the local contexts where the mentions
represented in certain fields to a specific content format. occur, whereas the implicitly relational information is
The latter is to match records to identify whether they far away the local ones.
refer to the same real-world entity. Conventional match- In particular, some solutions to CDC exploit fea-
ing approaches to RL focused on discovering indepen- tures, which denote attributes of target entities to be
dent pair-wise matches of records using a variety of disambiguated, in local contexts such as token features
attribute-similarity measures such as [54]. State-of-the- [40, 50], bigrams [42], biographical information [48], or
art matching methods are collective matches [51, 53, co-occurrence NE phrases and NE relationships [50].
55] that rely on sophisticated machine learning model Whereas others try to extract information related to NEs
such as Latent Dirichlet Allocation topic model or Mar- in consideration beyond local contexts [41, 43, 49]. Af-
kov Logic Networks. ter that, clustering algorithms are employed to cluster
mentions of the same entities based some similarity
Cross-Document Co-reference Resolution metric such as cosine, gain ratio, likelihood ratio, Kull-
Cross-Document Co-reference Resolution (CDC) back-Leibler Divergence, etc. In general, the most popu-
aims at grouping mentions of entities across documents lar clustering algorithm used by those methods is the
into clusters, each of which consists of mentions that Hierarchical Agglomerative Clustering (HAC) algo-
refer to the same real-world entity, rather than identify- rithm, although the choice of linkage varies such as sin-
ing what actual entities are. Most approaches to this gle-link or complete-link, etc.
problem use clustering techniques. This paper addresses When applying clustering techniques to group men-
the NED problem that aims at locating in a KB the enti- tions of entities together, since the number of clusters is
ty that a mention in a document refers to. NED is differ- not known in advance, cluster-stopping criteria is a
ent from CDC in that it does a further step that links challenge issue. To deal with this issue in cases when
each mention in a document to a KB entry. If ignoring using the techniques like HAC, the number of clusters
this step, one can consider NED as CDC. Motivated in the output is determined by a fixed similarity thre-
from finding information about persons on World Wide shold. Besides HAC, some works employ other models
Web, Web People task, emerged as a challenge topic such as classifiers in [49].
and attracted attention of researchers recently years, is a
Toponym Resolution
simplified case of CDC [44].
A typical solution to CDC usually contains three ba- Toponym Resolution (TR) is a task of identifying
sic steps: (i) exploiting textual contexts where mentions whether an entity mention refer to a place and mapping
of entities occur to extract contextual features for creat- it to a geographic latitude/longitude footprint or a
ing the profiles of those entities; (ii) then, calculating unique identifier in a KB. A conventional approach to
the similarity between profiles using similarity metrics; TR typically involves two main sub-tasks: place name
(iii) and finally, applying clustering algorithms to group extraction and place name disambiguation. The former
mentions of the same entities together. The profiles con- is to identify geographical mentions in a text. The latter
tain a mixture of collocation and other information that firstly looks up candidate referents of a mention from an
may denote attributes (personal information) and rela- external source such as a constructed gazetteer or a par-
tions of those entities. ticular ontology; then disambiguates it by examining the
In general, two main types of information that often context where the mention appears to choose the most
used for CDC are personal and relational information contextually similar candidate referent as the right one.
[43]. Personal information gives biographical informa- In literature, many methods are proposed to TR,
tion about each entity such as birthday, career, occupa- most of which fit into the rule-based and machine learn-
tion, alias and so on. Relational information specifies ing methods. A completely survey of rule-based me-
relations between entities such as the membership rela- thods are in [32]. Machine learning methods employed
tion between Barack Obama and the Democratic Party for TR consist of bootstrapping learning [30], unsuper-
of the United States. The relational information can be vised learning [31], or supervised learning [29].
expressed explicit or implicit in documents. The expli- In summary, although various methods have been
citly relational information of an entity may be captured introduced since 1999, an important issue of TR is that

Published by Atlantis Press


Copyright: the authors
1055
Named Entity Disambiguation: A Hybrid Approach

those methods are usually evaluated in different corpo- context compatibility for disambiguation. Zheng et al.
ra, under different conditions. The shortcoming of the [14], Dredze et al. [15] and Zhou et al. [16] employed
methods proposed to TR is that it omits relationships learning-to-rank techniques to rank all candidate entities
between named entities with different classes, such as and link the mention to the most likely one. Zhang et al.
between persons and organizations, or organizations and [7, 8] improve their approach in [13] by a learning mod-
locations, etc. Therefore, they are not suitable to NED el for automatically generating a very-large training set
where entities belong to different types. and training a statistical classifier to detect name va-
riants. The main drawback of the local approaches is
2.3. Related Work that they do not take into account the interdependence
Many approaches have proposed for NED. All of between disambiguation decisions. Han and Sun [6]
them can fit into three disambiguating strategies: local, proposed a generative probabilistic model that combines
global, and collective. Local methods disambiguate each three evidences: the distribution of entities in document,
mention independently based on local context compati- the distribution of possible names of a specific entity,
bility between the mention and its candidate entities and the distribution of possible contexts of a specific
using some contextual features. Global and collective entity.
methods assume that disambiguation decisions are in-
Global approaches
terdependence and there is coherence between co-
occurrence entities in a text, enabling the use of meas- Global approaches assumed interdependence be-
ures of semantic relatedness for disambiguation. While tween disambiguation decisions and exploited two main
collective methods simultaneously perform disambigua- kinds of information that are disambiguation context
tion decisions, global methods in turn disambiguate and semantic relatedness. Cucerzan [20] was the first to
each mention. model interdependence among disambiguation deci-
sions. In [20] disambiguation context are all Wikipedia
Local approaches contexts that occur in the text and semantic relatedness
A typical local approach to NED focused on local is based on overlap in categories of entities that may be
context compatibility between a mention and its candi- referred to in the text. Wikipedia contexts are comprised
date entities. Firstly, contextual features of entities were of inlink labels, outlink labels, and appositives in titles
extracted from their text descriptions. Then those ex- of all Wikipedia articles.
tracted features were weighted and represented in a vec- Milne and Witten [28] proposed a learning-based
tor model. Finally, each mention in a text was linked to method that ranks each candidate based on three factors:
the candidate entity having the highest contextual simi- the candidate’s semantic relatedness to contextual enti-
larity with it. Bunescu and Paşca [19] proposed a me- ties, the candidate’s commonness - defined as the num-
thod that uses an SVM kernel to compare the lexical ber of times it is used as a destination in Wikipedia, and
context around the ambiguous mention to that of its a measure of overall quality of contextual entities. A
candidate entities, in combination with estimating corre- contextual entity is identified based on a disambiguation
lation of the contextual word with the categories of the context, which is the set of unambiguous mentions hav-
candidate entities. Each candidate entity is a Wikipedia ing only one candidate in Wikipedia. Guo et al. [4] built
article and its lexical context is the content of the article. a directed graph G = (E, V), where V contains name
Mihalcea and Csomai [27] implemented and evaluated mentions and all of their candidates. Each edge connects
two different disambiguation algorithms. The first one from an entity to a mention or vice versa; and, there is
based on the measure of contextual overlap between the not any edge connecting two mentions or two entities.
local context of the ambiguous mention and the contents Then the approach ranks candidates of a certain mention
of candidate Wikipedia articles to identify the most like- based on their in-degree and out-degree. Hachey et al.
ly candidate entity. The second one trains a Naïve Bayes [5] firstly built a seed graph G = (E, V) where V con-
classifier for each ambiguous mention using three words tains candidates of all unambiguously mentions. The
to the left and the right of outlinks in Wikipedia articles, graph was then expanded by traversing length-limited
with their parts-of-speech, as contextual features. Zhang paths via links in both entity and category pages in Wi-
et al. [13] employed classification algorithms to learn kipedia, and adding nodes as well as establishing edges

Published by Atlantis Press


Copyright: the authors
1056
Hien T. Nguyen and Tru H. Cao

as required. Finally, the approach ranks candidate enti- sumption that some mentions or entities in a text are
ties using cosine and degree centrality. Ratinov et al. more important than others, which was used in previous
[10] proposed an approach that combines both local and work [27, 28].
global approaches by extending methods proposed in Hoffart et al. [2] proposed a method for collective
[19] and [28]. Kataria et al. [11] proposed a weakly disambiguation based on a close ontology - YAGO on-
semi-supervised LDA to model correlations among tology. The authors calculated the weight of each men-
words and among topics for disambiguation. tion-entity edge based on popularity of entities and con-
text similarity, which is comprised of keyphrase-based
Collective approaches and syntax-based similarity; calculated the weight of
Kulkarni et al. [17] proposed the first collective enti- each entity-entity edge based on Wikipedia-inlinks
ty disambiguation approach that can simultaneously link overlap between entities. Then they proposed a graph-
entity mentions in a text to corresponding KB entries based algorithm to find a dense-subgraph, which is a
and introduced the collective optimization problem to graph where each mention node has only one edge con-
this end. The approach combines local compatibility necting it with an entity.
between mentions and their candidate entities and se- Han and Sun [3] firstly built a referent graph where
mantic relatedness between entities. Since jointly opti- the local context compatibility was calculated base on a
mization of overall linking is NP-hard, the authors pro- bag-of-words model as in [19] and semantic relatedness
posed two approximation solutions to resolve it. Kbleb was adopted the formula presented in [28]. Second, the
and Abecker [18] proposed an approach that exploits an authors proposed a collective algorithm for disambigua-
RDF(s)-graph structure and co-occurrence among enti- tion. The collective algorithm collects initial evidence
ties in a text for disambiguation. The approach applies for each mention and then reinforces the evidence by
Spreading Activation method to rank and generate the propagating them via edges of the referent graph. The
most optimal Steiner graph based on activation values. initial evidence of each mention shows its popularity
The result graph contains KB entities that actually are over the other mentions and its value is TF-IDF score
referred to in the text. normalized by the sum over TF-IDF scores of all men-
Some research works [2, 3] built a referent graph for tions in the text.
a text and proposed a collective inference method to In our method, we exploit not only tokens around
entity disambiguation. A referent graph is a weighted mentions, but also their co-occurring named entities in a
and undirected graph G = (E, V) where V contains all text. Especially, for those named entities that are already
mentions in the text and all possible candidates of these disambiguated, we use their identifiers, which are more
mentions. Each node represents a mention or an entity. informative and precise than entity names, as essential
The graph has two kinds of edges: disambiguation features of co-occurring mentions. We
• A mention-entity edge is established between a also introduce a rule-based method and combine it with
mention and an entity, and weighted based on con- a statistical one. The experimental results show that the
text similarity, or a combination of popularity and rule-based phase enhances the disambiguation precision
context similarity; and recall significantly. Both of the statistical and rule-
• An entity-entity edge is established between two based phases in our algorithm are iterative, exploiting
entities and weighted using semantic relatedness the identifiers of the resolved named entities in a round
between them. for disambiguation of the remaining mentions in the
Based on a referent graph, one can proposed a me- next round.
thod that performs collective inference KB entities re- In fact, the incremental mechanism of our method is
ferred to in a text. Han and Sun [3] and Hoffart et al. [2] similar to the way humans do when disambiguating
proposed approaches that exploit local context compati- mentions based on previously known ones. That is, the
bility and coherence among entities to build a referent proposed method exploits both the flow of information
graph and then proposed a collective reference based on as it progresses in a news article and the way humans
the graph in combination with popularity measures of read and understand what entities that the mentions in
mentions or entities for simultaneously identifying KB the news article refers to. Indeed, an entity occurring
entries of all mentions in the text. Note that exploiting first in a news article is usually introduced in an unam-
the popularity of mentions is based on a popular as-

Published by Atlantis Press


Copyright: the authors
1057
Named Entity Disambiguation: A Hybrid Approach

biguous way, except when it occurs in the headline of CNN (March 04, 2009) has the lead “JERUSALEM
the news article. Like humans, our method disambi- (CNN) -- U.S. Secretary of State Hillary Clinton on
guates named entities in a text in turn from the top to the Tuesday ruled out working with any Palestinian unity
bottom of the text. When the referent of a mention in a government that includes Hamas if Hamas does not
text is identified, it is considered as an anchor and its agree to recognize Israel” in which the journalist refers
identifier and own features are used to disambiguate to the wife of the 42nd President of the United States
others. Also, when encountering an ambiguous mention clearly by the phrase “U.S. Secretary of State Hillary
in a text, a reader usually links it to the previously re- Clinton”. Then in the body of the story, s/he writes
solved named entities and his/her background know- “Clinton said Hamas must do what the Palestine Libe-
ledge to identify what entity that mention refers to. Si- ration Organization has done” where “Clinton” men-
milarly, our method exploits the coreference chain of tions the Hillary Clinton without introducing more in-
mentions in a text and information from an encycloped- formation to differentiate with the former president Bill
ic knowledge base like Wikipedia for resolving ambi- Clinton of the United States. Especially, for a well-
guous mentions. Furthermore, both humans and our known location entity, although its name may be ambi-
method explore contexts in several levels, from a local guous, a journalist can still leave the name alone. How-
one to the whole text, where diverse clues are used for ever, for other cases, s/he may clarify an ambiguous
the disambiguation task. location name by mentioning some related locations in
the text. For instance, when using “Oxford” to refer to a
3. Proposed method city in Mississippi of the United States, a journalist may
write “Oxford, Mississippi” whereas, when using this
In a news article, co-occurring entities are usually re-
lated to the same context. Furthermore, the identity of a name to refer to the well-known city Oxford in South
East England, s/he may just write “Oxford”.
named entity is inferable from nearby and previously
identified NEs in the text. For example, when the name From those observations, we propose a method with
“Georgia” occurs with “Atlanta” in a text and “Atlanta” the following essential points. Firstly, it is a hybrid me-
is already recognized as a city in the United States, it is thod containing two phases. The first phase is a rule-
based phase that filters candidates and, if possible, it
more likely that “Georgia” refers to a state of the United
States than the country Georgia. Meanwhile, if “Geor- disambiguates named entities with high reliability. The
second phase employs a statistical learning model to
gia” occurs with “Tbilisi” capital as in the text “TBILISI
(CNN) -- Most Russian troops have withdrawn from rank the candidates of each remaining mention and
eastern and western Georgia”, it is “Tbilisi” that helps choose the one with the highest ranking as the right re-
to identify “Georgia” referring to the country next to ferent of that mention. Secondly, each phase is an itera-
tive and incremental process that makes use of the iden-
Russia. In addition, the words surrounding ambiguous
mentions may denote attributes of the NEs they refer to. tifiers of the previously resolved named entities to dis-
ambiguate others. Finally, it exploits both entity iden-
If those words are automatically recognized, the ambi-
guous mentions may be disambiguated. For example, in tifiers and keywords for named entity disambiguation in
the text “John McCarthy, an American computer scien- two phases. The specific steps in the two phases of our
tist pioneer and inventor, was known as the father of disambiguation process are presented below.
• Step 1: identifies if there exist entities in Wikipedia
Artificial Intelligence (AI)”, the word “computer scien-
that a mention in a text may refer to and then re-
tist” can help to discriminate John McCarthy who in-
trieves those entities as candidate referents of the
vented the Lisp programming language from other ones. mention.
When analyzing the structure of news articles, we • Step 2: applies some heuristics to filter candidates
observe that when first referring to a named entity, ex- of each mention and, if possible, choose the right
cept in the headline, journalists usually either implicit or one for the mention. The earlier a mention is re-
explicit introduce it in an unambiguous way by using its solved in this step, the more reliable the identified
main alias or giving more information for readers to entity is. As a result, when an entity in Wikipedia is
understand clearly about the entity they mean. For in- identified as the actual entity that a mention in a
stance, in the news article with the headline “U.S. on text refer to, its identifier will be considered as an
Palestinian government: Hamas is sticking point” on anchor that the method exploits to resolve others.

Published by Atlantis Press


Copyright: the authors
1058
Hien T. Nguyen and Tru H. Cao

• Step 3: employs the vector space model in which ceding mention “Atlanta” is Atlanta, Georgia
the cosine similarity is used as a scoring function to whose disambiguation text is identical to “Georgia”.
ranks the candidates of the mention and chooses the
one with the highest score as the right entity that H3. Disambiguation text in the same window
the mention refers to.
As mentioned above, the disambiguation process in- For a person or an organization mention, the chosen
volves two stages. The first stage is rule-based and in- candidate referent is the one whose disambiguation text
cludes Step 1 and Step 2. The second stage is statistical occurs in the local context of that mention, or the local
and includes Step 3. contexts of the mentions in its coreference chain. After
this step, if there is only one candidate in the result, the
3.1. Heuristic referent is considered being resolved. For example, in
the text “Veteran referee (Big) John McCarthy, one of
In this section, we propose some heuristics used in the
the most recognizable faces of mixed martial arts”, the
first stage and based on local contexts of mentions to word “referee” helps to choose the candidate John
identify their correct referents. The local context of a
McCarthy (referee) as the right one instead of
location mention is its preceding and succeeding men- John McCarthy (computer scientist) or
tions in the text. For example, if “Paris” is a location John McCarthy (linguist) in Wikipedia.
mention and followed by “France”, then the country To show more detail about the way that our method
France is in the local context of this “Paris”. The local
exploits the local contexts in the coreference chain of a
context of a person or an organization mention compris- mention, we describe here the example “Sen. John
es the keywords and unambiguous mentions occurring
McCain said Monday that Rep. John Lewis controver-
in the same sentence where the mention occurs. We sial remarks were "so disturbing" that they "stopped me
exploit such a local context of a mention to narrow in my tracks." [...] Lewis, a Georgia representative and
down its candidates and disambiguate its referents if veteran of the civil rights movement, on Saturday com-
possible, using the following heuristics in the sequence
pared the feeling at recent Republican rallies to those of
as listed. segregationist George Wallace.” In this example, “John
Lewis” and “Lewis” are actually co-referent and, in the
H1. Disambiguation text following
local context of the mention “Lewis”, there occurs the
For a location mention, its right referent is the can- word “Georgia” that is the disambiguation text of the
didate whose disambiguation text is identical to the suc- entity John Lewis (Georgia) in Wikipedia.
ceeding mention. For example, in the text “Columbia, Therefore, in this context, after applying heuristic H3,
South Carolina”, for the mention “Columbia”, the can- our method identifies both mentions “John Lewis” and
didate Columbia, South Carolina, the largest “Lewis” refer to the same entity John Lew-
city of South Carolina, in Wikipedia is chosen because is(Georgia) in Wikipedia.
the disambiguation text of the candidate is “South Caro-
lina” and identical to the succeeding mention of “Co- H4. Coreference relation
lumbia”.
For each coreference chain, we propagate the re-
solved referent of a mention in it to others. For example,
H2. Next to disambiguation text
assume that in a text there are occurrences of coreferent
For a location mention, its right referent is the can- mentions “Denny Hillis” and “Hillis”, where “Hillis”
didate whose name is identical to the disambiguation may refer to Ali Hillis, American actress, Horace
text of the referent of its preceding unambiguous men- Hillis, American politician, or W. Daniel Hil-
tion. For example, in the text “Atlanta, Georgia”, as- lis, American inventor. If “Denny Hillis” is recog-
suming that the referent of “Atlanta” has already been nized as referring to W. Daniel Hillis in Wikipe-
resolved as Atlanta, Georgia, a major city of dia, then “Hillis” also refers to W. Daniel Hillis.
state Georgia of United States. Then, for As another example, for the text “About three-quarters
the mention “Georgia”, the candidate Georgia (U.S. of white, college-educated men age over 65 use the In-
state) is chosen because the referent of its pre- ternet, says Susannah Fox, […] John McCain is an out-
lier when you compare him to his peers, Fox says.”,

Published by Atlantis Press


Copyright: the authors
1059
Named Entity Disambiguation: A Hybrid Approach

there are 164 entities in the Wikipedia version used with NEs in the KB can be represented. In our case, we
the same name “Fox”. However, “Susannah Fox” does represent NEs in the KB by their attributes and rela-
not exist in Wikipedia yet and is coreferent with “Fox” tions. For NEs referred to in a text, we extract those
in the text, so our method recognizes “Fox” as referring features that likely represent their attributes and rela-
to an out-of-Wikipedia entity. tions in contexts where those NEs occur. The attributes
We note that a coreference chain might not be cor- are birthday, career, occupation, alias, first name, last
rectly constructed in the pre-processing steps due to the name, and so on. The relations of an entity represent its
employed NE coreference resolution module. Moreover, relations to others such as part-of, located-in, for in-
for a correct coreference chain, if there is more than one stances. The way we exploit a context is based on Har-
mention already resolved, then it does matter to choose ris’ Distributional Hypothesis [58] stating that words
the right one to be propagated. Therefore, for a high occurring in similar contexts tend to have similar
reliability, before propagating the referent of a mention senses. We adapt that hypothesis to NE instead of word
that has already been resolved to other mentions in its sense disambiguation. After exploring meaningful fea-
coreference chain, our method checks whether that men- tures for representing NEs in texts and a KB, our me-
tion satisfies one of the following criteria: thod assigns each NE referred to in a text to the most
(i) The mention occurs in the text prior to all the others contextually similar referent in the KB.
in its coreference chain and is one of the longest In this section, we present a statistical ranking mod-
mentions in its coreference chain (except for those el where we employ the Vector Space Model (VSM) to
mentions occurring in the headline of the text), or represent entity mentions in a text and entities in Wiki-
(ii) The mention occurs in the text prior to all the others pedia by their features. The VSM considers the set of
in its coreference chain and is the main alias of the
features of entities as a bag-of-words. Firstly, we
corresponding referent in Wikipedia (except for
present what contextual features are extracted and how
those mentions occurring in the headline of the
text). A mention is considered as the main alias of a we normalize them. Then we present how to weight
referent if it occurs in the title of the entity page words in the VSM and calculate the similarity between
that describes the corresponding entity in Wikipe- feature vectors of mentions and entities. Based on the
dia. For example, “United States” is the main alias calculated similarity, our disambiguation method ranks
of the referent the United States because it is the candidate entities of each mention and chooses the
the title of the entity page describing the United best one. The quality of ranking depends on used fea-
States. tures.

H5. Default referents Text features


After applying all the above heuristics, for location To construct the feature vector of a mention in a
mentions that have not been resolved yet, our method text, we extract all mentions co-occurring with it in the
chooses its default referent as the right one. For in- whole text, local words in a context window, and words
stance, in the context, “McCain's willingness to disasso- in the context windows of those mentions that are co-
ciate himself with Bush is not a new strategy. The two referent with the mention to be disambiguated. Those
men are not close and right now McCain is fighting for features are presented below.
the support of undecided, independent voters in states • Entity mentions (EM). After named entity recogni-
such as Pennsylvania, Ohio and Florida.”, Pennsyl- tion, mentions referring to named entities are de-
vania, Ohio, Florida state of the United States in tected. We extract these mentions in the whole text.
Wikipedia are chosen because these entities respectively After extracting the mentions, for the ones that are
are default referents of those underlined mentions. identical, we keep only one and remove the others.
For instance, if “U.S” occurs twice in a text, we
3.2. Statistical Ranking Model remove one.
• Local words (LW). All the words found inside a
To maximize accuracy of mapping NEs referred to in a specified context window around the mention to be
text to the right ones in a given KB poses a significant disambiguated. The window size is set to 55 words,
question that how contexts in which the mentions of the not including special tokens such as $, #, ?, etc.,
NEs occur are exploited and how the corresponding which is the value that was observed to give opti-

Published by Atlantis Press


Copyright: the authors
1060
Hien T. Nguyen and Tru H. Cao

mum performance in the related task of cross- Note that infoboxes of pages in Wikipedia are mea-
document coreference resolution [40]. Then we re- ningful resources for disambiguation. However, these
move those local words that are part of mentions resources of information may be missed in many pages
occurring in the window context to avoid extracting or information in many infoboxes is quite poor. Moreo-
duplicate features. ver, the information in infobox of each page can be dis-
• Coreferential words (CW). All the words found
tilled from the content of the page. Therefore, our dis-
inside the context windows around those mentions
ambiguation method does not extract information from
that are co-referent with the mention to be disambi-
guated in the text. For instance, if “John McCarthy” infoboxes for disambiguation.
and “McCarthy” co-occur in the same text and are
co-referent, we extract words not only around Normalization
“John McCarthy” but also those around “McCar- After extracting features for a mention in a text or an
thy”. The size of those context windows are also set
entity, we put them into a ‘bag of words’. Then we
to 55 words. Note that, when the context windows
of mentions that are co-referent are overlapped, the normalize the bag of words as follows: (i) removing
words in the overlapped areas are extracted only special characters in some tokens such as normalizing
once. We also remove those extracted words that U.S to US, D.C (in “Washington, D.C” for instance) to
are part of mentions occurring in the context win- DC, and so on; (ii) removing punctuation mark and spe-
dows to avoid extracting duplicate features. cial tokens such as commas, periods, question mark, $,
@, etc.; and (iii) removing stop words such as a, an, the,
Wikipedia features etc., and stemming words using Porter stemming algo-
rithm. After normalizing the bag of words, we are al-
For each entity in Wikipedia, serving as a candidate
ready to convert it in to a token-based feature vector.
entity for an ambiguous mention in a text, we extract the
following information to construct its feature vector. Term weighting
• Entity title (ET). Each entity in Wikipedia has a
title. For instance, “John McCarthy (computer For a mention in a text, suppose there are N candi-
scientist)” is the title of the page describing Prof. date entities for it in Wikipedia. We use the tf-idf
John McCarthy who is the inventor of Lisp pro- weighting schema viewing each ‘bag of words’ as a
gramming language. We extract “John McCarthy document and using cosine similarity to calculate the
(computer scientist)” for the corresponding entity. similarity between the bag of words of the mention and
• Titles of redirect pages (RT). Each entity in Wiki-
the bag of words of each of the candidate entities re-
pedia may have some redirect pages whose titles
spectively. Given two vector S1 and S2 for two bags of
contain different names, i.e. aliases, of that entity.
To illustrate, from the redirect pages of an entity words, the similarity of the two bags of words is com-
John Williams in Wikipedia, we extract their titles: puted as:
Williams, John Towner; Johnny Williams; Wil-
liams, John; John Williams (composer); etc.
Sim(S1, S2) = ∑ w1 j × w2 j
common word t j
(1)
• Category labels (CAT). Each entity in Wikipedia
belongs to one or more categories. We extract la- where tj is a term present in both S1 and S2, w1j is the
bels of all its categories. For instance, from the cat- weight of the term tj in S1 and w2j is the weight of the
egories of the entity John McCarthy (com- term tj in S2.
puter scientist) in Wikipedia, we extract The weight of a term tj in vector Si is given by:
the following category labels as follows: Turing
Award laureates; Computer pioneers; Stanford wij = log(tfj +1) × log(N/dfj)/ si21 + si22 + ... + siN
2
(2)
University faculty; Lisp programming language;
Artificial intelligence researchers; etc. where tfj is the frequency of the term tj in vector Si, N is
• Outlink labels (OL). In the page describing an enti- the total number of candidate entities, dfj is the number
ty in Wikipedia there are some links pointing to of bags of words representing candidate entities in which
other Wikipedia entities. We extract labels (anchor the term tj occurs, sij = log(tfj +1) × log(N/dfj).
texts) of those outlinks as features of that entity.

Published by Atlantis Press


Copyright: the authors
1061
Named Entity Disambiguation: A Hybrid Approach

Algorithm tion revised (.) makes use of coreference relations


For a mention m that we want to disambiguate, let C among mentions of named entities to adjust the disam-
be the set of its candidate entities. We cast the named biguated results. Line 2 to Line 17 shows the first stage
entity disambiguation problem as a ranking problem using the heuristics presented above and Line 18 to Line
with the assumption that there is an appropriate scoring 31 shows the second stage employing the statistical
function to calculate semantic similarity between feature ranking model for disambiguation.
vectors of an entity c ∈ C and the mention m. We build
Algorithm 2 Iterative and Incremental NED
a ranking function that takes as input the feature vectors
1: let घ be a set of mentions and E be an empty set
of the entities in C and the feature vector of the mention 2: E ← ∅
m, then based on the scoring function to return the entity 3: flag ← false
c ∈ C with the highest score. We use Sim function as 4: loop until घ empty or flag is true
given in Eq.1 as the scoring function. 5: घ’ ← घ
6 : for each n ∈ घ’ do
What we have just described is implemented in Al- 7: C ← a set of candidate entities of n
gorithm 1. Sim is used at Line 5 of the algorithm. The 8: apply H1, H2, H3 respectively for n
FVector function in the algorithms returns the feature 9: if sizeof(C) = 1 then
vector of a mention. 10: map n to γ* ∈ C
11: E ← revised(E ∪ {<n → γ*>})
12: remove n from घ
Algorithm 1 Statistical-Based Entity Ranking
13: end if
1: let C a set of candidate entities of m 14: end for
2: for each candidate c do 15: if E no change then flag = true
3: score[c] ← Sim(FVector(c), FVector (m)) 16: end loop
4: end for 17: apply H5
5: c* ← arg max score[ci ] 18: flag ← false
ci ∈C
19: loop until घ empty or flag is true
6: if score[c*] > τ then return c* 20: घ’ ← घ
7: return NIL 21: for each n ∈ घ’ do
22: C ← a set of candidate entities of n
3.3. Disambiguating process 23: γ* ← run Algorithm 1 for n
24: if γ* is not NIL then
Prior to looking up candidates in Wikipedia, we per- 25: map n to γ*
form some pre-processing steps. In particular, we per- 26: E ← revised(E ∪ {<n → γ*>})
27: remove n from घ
form NE recognition and NE coreference resolution 28: end if
using natural language processing resources of an In- 29: end for
formation Extraction engine based on GATE [34], a 30: if E no change then flag = true
general architecture for developing natural language 31: end loop
processing applications. The NE recognition applies
4. Experiments and evaluation
pattern-matching rules written in JAPE’s grammar of
GATE, in order to identify the class of an entity in the For evaluating the performance of our disambiguation
text. After performing NE recognition and detecting all method, we have built a corpus in which named entities
mentions of entities occurring in the text, we perform of the types Person, Location, and Organization are ma-
NE co-reference resolution using the method presented nually annotated with their attributes using Wikipedia
in [33] and implemented in GATE system. After these data. We first downloaded the top two or three articles
pre-processing steps, for each name in the text, we send in each of the eleven CNN news categories, namely,
it as a query to Wikipedia to retrieve its candidate refe- Top Stories, Politics, Entertainment, Tech, Travel, Afri-
rents. Finally, we run our disambiguating algorithm, ca, World, World Sport, World Business, Middle East,
namely Algorithm 2. Algorithm 2 takes as an input a set and Americas on July 22, 2008. Then we downloaded
of mentions and return a set of mention-entity map- 10 articles on Oct 17, 2008 in the Top Stories category
pings. During the disambiguation process, if a mention of the CNN news agency to build a dataset D with 40
is disambiguated, the entity corresponding with it is articles for evaluation.
immediately used to disambiguate the others. The func-

Published by Atlantis Press


Copyright: the authors
1062
Hien T. Nguyen and Tru H. Cao

We divide entity names in the dataset into the four dard corpus. The Column 1# represents the number of
categories as follows: mentions for each entity type in the corpus. The Column
• Category 1: names that occur in Wikipedia, and 2# represents the number of mentions in the corpus that
they refer to entities in Wikipedia. actually refer to entities in Wikipedia. The Column 3#
• Category 2: names that occur in Wikipedia, but represents the number of mentions in the corpus that
they refer to entities that are not in Wikipedia. refer to out-of-Wikipedia entities. The Column 4#
• Category 3: names that do not occur in Wikipedia, represents the number of mentions that have two or
and they refer to entities that are not in Wikipedia.
more candidate referents in Wikipedia.
• Category 4: names that do not occur in Wikipedia,
but they refer to entities in Wikipedia.
The annotation process focuses on named entities of Table 1. Statistics of mentions in the datasets.
three types – Person, Location, and Organization. Final- Entity type #1 #2 #3 #4
ly, we obtain a golden standard corpus in which each Person 863 736 127 409 (out of 736)
named entity is annotated with the four following in- Location 665 655 10 402 (out of 655)
formation: Organization 324 315 9 156 (out of 315)
• TYPE: represents the type of the named entity, Total 1852 1706 146 967 (out of 1706)
which is Person, Location, or Organization.
• ID: uniquely identifies the corresponding referent, To evaluate, we first define the measures to evaluate
if existing, in Wikipedia. If the name of the entity the performance of the proposed method, whose out-
belongs to Category 1 or Category 4, the ID is the come is a mapping from the mentions in a text to enti-
title of the corresponding referent in Wikipedia. For ties in Wikipedia or to NIL. Table 2 defines if a map-
instance, if the entity name “John McCarthy” in a ping for a mention is correct or not, depending on the
text actually refers to John McCarthy who is the in- category of that mention. Specifically, for a mention of
ventor of Lisp programming language, then its ID is Category 1 or Category 4, which actually refers to an
John McCarthy (computer scientist).
entity in Wikipedia, it is correct if and only if the men-
Otherwise, if the name of the entity belongs to Cat-
tion is mapped to the right entity in Wikipedia. For a
egory 2 or Category 3, then ID receives the NIL
value. mention of Category 2 or Category 3, which does not
• CAT: represents the category of the name of the refer to any entity in Wikipedia, it is correct if and only
entity. That is, CAT is either Category 2 or Catego- if the mention is mapped to NIL.
ry 3 when the entity name actually refers to an out-
of-Wikipedia entity, or it is either Category 1 or Table 2. Correct and incorrect mention-entity mappings with
Category 4 when the entity name refers to an entity respect to mention categories
in Wikipedia. Correct mapping Incorrect mapping
• POS: represents the position where the named enti- Category 1 to the right entity in to a wrong entity in
ty occurs by characters. For instance, in the text Wikipedia Wikipedia or NIL
“Sen. Barack Obama says Sen. John McCain will Category 2 to NIL to an entity in Wikipedia
not bring the change the country needs”, the posi- Category 3 to NIL to an entity in Wikipedia
tion where “John McCain” occurs is 28. to the right entity in to a wrong entity in
The corpus size is 30,699 in tokens. There are total- Category 4
Wikipedia Wikipedia or NIL
ly 1,852 mentions of named entities in the corpus that
refer to totally 526 distinct entities in the real-world, We evaluate our method in two scenarios. In the
among which there are totally 664 distinct names. There first scenario, we use GATE 3.0b to detect and tag
are 1,706 mentions having the corresponding entities in boundaries of names occurring in the dataset and then
Wikipedia, among which there are 967 mentions having categorize corresponding referents as Person, Location
two or more candidates, adding up to 6,885 as the total and Organization. After that, we gain D1 dataset. We
number of matched candidates. Therefore, the average found some wrong cases in D1 as follows:
number of candidates per a name in those 664 distinct • GATE fails to detect boundaries of some names.
names is 6885/664 = 10.36 candidates. For example, “African National Congress” is rec-
In more details, Table 1 shows the statistics of the
named entities for each entity type in the golden stan- b
http://gate.ac.uk/download/

Published by Atlantis Press


Copyright: the authors
1063
Named Entity Disambiguation: A Hybrid Approach

ognized as “African National”, “Andersen Air mappings, and TI be the number of incorrect mappings
Force Base” as “Air Force”, and “Luis Moreno- by a named entity recognition and disambiguation sys-
Ocampo” as “Luis Moreno-”. tem. Each fully correct mapping is counted as one point,
• GATE detects some names (12 cases in our con- while each partially correct mapping is counted as only
structed corpus) as two different names. For exam- a half. Then the precision and recall of the system on
ple, “Omar al-Bashir” is recognized as separate
the dataset are defined as follows:
names “Omar” and “al-Bashir”, “Sony Ericsson” as
• Precision (P): the ratio of the number of correct
“Sony” and “Ericsson”.
mention-entity mappings and the number all re-
• There are many names (145 cases) that GATE
turned mappings by the system.
misses recognizing them. For example, “Darfur”,
“Qunu”, “Soweto”, “Interfax”, “Rosoboronexport” 1
are not recognized as entity names. TC + TP
P= 2 (3)
• GATE fails to identify types of named entities, e.g. 1
“Robben Island Prison” is recognized as a person. TC + TP + TI
2
• GATE wrongly recognizes mentions such as Ira-
nian, Young (meaning the young people), and • Recall (R): the ratio of the number of correct men-
Christian, as named entities. tion-entity mappings and the number of all ground-
• GATE wrongly produces some coreference chains. truth mappings.
Then we manually fix all such errors in the dataset 1
D1, obtaining the dataset D2 with no error. Table 3 TC + TP
R= 2 (4)
presents the statistics of mentions recognized by GATE Tall
in the dataset D1. We note that the figures in Table 3 are
not necessarily the same as those in Table 1 for the
Table 4. Precision and Recall after running Algorithm 2 in the
ground-truth corpus, due to GATE’s errors as pointed three modes on the dataset D2
above.
P&R PER LOC ORG ALL
Random P=R 52.65% 38.34% 55,75% 48,09%
Table 3. Statistics of mentions in the dataset D1 P 97,48% 97,07% 93,42% 96,78%
Rule-based
Entity type #1 #2 #3 #4 R 85,10% 89,92% 60,30% 82,42%
Statistical P=R 89,95% 65,86% 83,03% 80,11%
Person 794 613 180 403 (out of 613) Hybrid P=R 95,38% 92,78% 87,27% 93,01%
Location 625 597 28 373 (out of 597)
Organization 297 253 44 140 (out of 253)
Total 1716 1463 252 916 (out of 1463) In order to evaluate the affect of each phase in our
proposed method, we run Algorithm 2 in three modes.
The first mode, named Rule-based, only employs
Due to the aforementioned possible error of a named
heuristics presented in Section 3.1 to disambiguate
entity recognition module splitting a name into two sep-
named entities, i.e., running the algorithm from Line 1
arate ones, we introduce the notion of partially correct
to Line 17. The second mode, named Statistical,
mappings. That is, if a mention is correctly disambi-
only employs the vector space model for raking candi-
guated but it is only part of a full name in a text, then
dates as presented in Section 3.2, i.e., running Line 1, 2,
the mapping is only partially correct. For example, if
and Line 18 to Line 31 in the algorithm. The last mode,
“Barack Obama” (meaning the current President of the
named Hybrid, runs the whole algorithm. Also, for sep-
United States) in a text is recognized as two separate
arately evaluating performance of the system with and
mentions “Barack” and “Obama”, and the mention “Ba-
without incurred errors of the preceding named entity
rack” is mapped to the entity Barack Obama (the
recognition module, we run the three modes on both
same President) in Wikipedia, then the mapping is par-
datasets D1 and D2. All the results are matched against
tially correct. A mention-entity mapping is said to be
the golden standard corpus.
fully correct if the mention coincides with its full name
Table 4 presents the precision and recall calculated
in a text.
when we randomly assign a KB entry for each entity
Let Tall be the number of all ground-truth mention-
mention in D2 and run the Algorithm 2 on D2 in the
entity mappings in a dataset, TC be the number of fully
three modes. Since our disambiguation method maps all
correct mappings, TP be the number of partially correct

Published by Atlantis Press


Copyright: the authors
1064
Hien T. Nguyen and Tru H. Cao

available mentions in an input dataset, i.e., D2 in this thod on the dataset D2 without errors from pre-
case, the number of returned mappings is equal to the processing phases and evaluated the method on D1 with
number of mappings in the corresponding gold standard errors accumulated from pre-processing phases. The
corpus. Therefore, P and R are the same for each run- experiment results presented respectively in Table 4 and
ning mode on D2 except in the Rule-based mode. Table 5 show that our method achieves good perfor-
When running the Algorithm 2 in the Rule-based mance.
mode, because there are not any heuristics that fire for We note that although we utilize information from
some entity mentions, P and R are different. Wikipedia for named entity disambiguation, our method
Table 4 also shows that the proposed heuristics give can be adapted for an ontology or a knowledge base in
high precision. So one can adopt these heuristics to im- general. In particular, one can generate a profile for
prove performance of related works such as [3], [10], or each of KB entities by making use of ontology concepts
[28]. Indeed, disambiguation context in [28] are the only and properties of the entities. For instance, one can ex-
candidate entities of unambiguous mentions and disam- tract the direct class and parent classes of an entity as
biguation context in [10] are candidate entities having ones of its features from the given hierarchy of classes.
highest local compatibility with context of their men- Also, values of properties of entities are exploited. For
tions. In our opinion, these disambiguation context are attributes, their values are directly extracted. For rela-
not really reliable due to low performance of disambig- tion properties, one can utilize the names and identifiers
uation systems based on local compatibility and the fact of the corresponding entities. All the extracted features
that the only candidate entity of an unambiguous men- of an entity will be concatenated into a text snippet,
tion may not be the one to which the mention actually which can be considered as a profile of that entity for
refer. Therefore, our proposed heuristics can produce further processing.
more reliable disambiguation context than those pro-
posed in [10] and [28]. These heuristics can also be em- 5. Conclusion
ployed to reduce the size of referent graph proposed in We have proposed a method to named entity disam-
[3], which lead to reduce calculation cost of the collec- biguation. It is a hybrid and incremental process that
tive inference algorithm. utilizes previously identified named entities and related
terms co-occurring with ambiguous names in a text for
Table 5. Precision and Recall after running Algorithm 2 in the the disambiguation task. Our method is robust to free
three modes on the dataset D1 texts without well-defined structures or templates. It can
also be adapted for other languages using freely availa-
PER LOC ORG ALL
ble various language versions of Wikipedia, as well as
P 76,58% 88,00% 73,06% 80,12%
R 70,85% 82,70% 66,97% 74,43% for any ontology and knowledge base in general. More-
F 73,60% 85,26% 69,88% 77,17% over, the proposed method can map a name to an entity
that is missing that name in the knowledge base of dis-
Table 5 presents the precision and recall calculated course. As such, it helps to discover from texts, and
automatically enrich the knowledge base with, new
when we run the Algorithm 2 on D1. One can observe
aliases of named entities. The experiment results have
that, due to the errors of the preceding named entity
shown that our method achieves good performance in
recognition and coreference resolution phases by terms of the precision and recall measures.
GATE, all the precision and recall measures are de- This work focuses on named entities in texts. How-
creased as compared to those on D2. ever, general concepts also play an important role in
In summary, there are different sources of failures in forming the meaning of those texts. Therefore, in the
the results. First, it is due to errors of the employed future work, we will investigate new features and dis-
named entity recognition and coreference resolution ambiguation methods that are suitable for both named
modules, i.e., ones of GATE in this experiment. Second, entities and general concepts. For this line of research,
it is due to the incompleteness of Wikipedia, such as we find it possible to adapt the Latent Dirichlet Alloca-
shortage of entity aliases and real-world entities, and tion Category Language Model proposed in [59] for
poor descriptions of some entities, which cause failures disambiguating both named entities and general con-
cepts, in combination with the method proposed in this
in the looking up and ranking steps. Third, it is due to
paper.
our method itself. We isolated and evaluated our me-

Published by Atlantis Press


Copyright: the authors
1065
Named Entity Disambiguation: A Hybrid Approach

References 13. W. Zhang, J. Su, C.-L. Tan, and W. Wang, Entity Link-
ing Leveraging Automatically Generated Annotation, in
1. H. Ji, R. Grishman, and H. T. Dang, An Overview of the Proc. of 23rd International Conference on Computation-
TAC2011 Knowledge Base Population Track, in Proc.of al Linguistics (COLING 2010, Beijing, China, August
Text Analysis Conference (TAC2011).
23-27, 2010), pp. 1290-1298.
2. Hoffart et al., Robust Disambiguation of Named Entities 14. Z. Zheng, F. Li, M. Huang, and X. Zhu, Learning to Link
in Text, in Proc. of the 2011 Conference on Empirical Entities with Knowledge Base, in Human Language
Methods in Natural Language Processing (Edinburgh, Technologies 2010: The Annual Conference of the North
Scotland, UK, July 27–31, 2011), pp. 782–792. American Chapter of the Association for Computational
3. X. Han, L. Sun, and J. Zhao, Collective Entity Linking in Linguistics (HLT/NAACL 2010).
Web Text: A Graph-Based Method, in Proc. of the 34th
15. M. Dredze, P. McNamee, D. Rao, A. Gerber, T. Finin,
Annual ACM SIGIR Conference (Beijing, China, July 24- Entity Disambiguation for Knowledge Base Population,
28, 2011), pp. 765-774. in Proc. of 23rd International Conference on Computa-
4. Y. Guo, W. Che, T. Liu, S. Li, A Graph-based Method tional Linguistics (COLING 2010, Beijing, China, Au-
for Entity Linking, in Proc. of the 5th International Joint gust 23-27, 2010), pp. 277-285.
Conference on Natural Language Processing (IJCNLP- 16. Y. Zhou, L. Nie, O. Rouhani-Kalleh, F. Vasile, and, S.
2011, Chiang Mai, Thailand, November 8-13, 2011), pp.
Gaffney, Resolving Surface Forms to Wikipedia Topics,
1010-1018. in Proc. of 23rd International Conference on Computa-
5. B. Hachey, W. Radford, J. Curran, Graph-based Named tional Linguistics (COLING 2010, Beijing, China, Au-
Entity Linking with Wikipedia, in Proc. of the 12th In- gust 23-27, 2010), pp. 1335-1343.
ternational Conference on Web Information System En- 17. S. Kulkarni, A. Singh, G. Ramakrishnan, and S. Chakra-
gineering (Sydney, NSW, Australia, 2011), pp. 213-226. barti, Collective Annotation of Wikipedia Entities in Web
6. X. Han and L. Sun, A Generative Entity-Mention Model
Text, in Proc. of the 15th ACM SIGKDD International
for Linking Entities with Knowledge Base, in Proc. of the Conference on Knowledge Discovery and Data Mining
49th Annual Meeting of the Association for Computa- (KDD 2009), pp. 457-466.
tional Linguistics: Human Language Technologies (Port- 18. J. Kleb and A. Abecker, Entity Reference Resolution via
land, Oregon, USA, June 19-24, 2011), pp. 945-954. Spreading Activation on RDF-Graphs, in Proc. of the
7. W. Zhang, J. Su, and C.-L. Tan, A Wikipedia-LDA Mod- 2010 Extended Semantic Web Conference (ESWC 2010).
el for Entity Linking with Batch Size Changing Instance
19. R. Bunescu and M. Paşca, Using encyclopedic know-
Selection, in Proc. of International Joint Conference for ledge for named entity disambiguation, in Proc. of the
Natural Language Processing (IJCNLP 2011, Chiang 11th Conference of EACL, pp. 9–16, 2006.
Mai, Thailand, November 8-13, 2011), pp. 562-570. 20. S. Cucerzan, Large-Scale Named Entity Disambiguation
8. W. Zhang, Y. C. Sim, J. Su, and C.-L. Tan, Entity Link- Based on Wikipedia data, in Proc. of EMNLP-CoNLL
ing with Effective Acronym Expansion, Instance Selec- Joint Conference 2007, pp. 708-716, 2007.
tion and Topic Modeling, in Proc. of International Joint
21. H.T. Nguyen and T.H. Cao, A Knowledge-Based Ap-
Conferences on Artificial Intelligence 2011 (IJCAI 2011, proach to Named Entity Disambiguation in News Ar-
Barcelona, Spain, Jul 16-22, 2011), pp. 1909-1904. ticles, in Proc. of the 20th Australian Joint Conference on
9. H. Ji and R. Grishman, Knowledge Base Population Suc- Artificial Intelligence (AI 2007); LNAI, vol. 4830,
cessful Approaches and Challenge, in Proc. of the 49th Springer-Verlag, pp. 619–624.
Annual Meeting of the Association for Computational 22. H.T. Nguyen and T.H. Cao, Exploring Wikipedia and
Linguistics: Human Language Technologies (Portland,
text features for named entity disambiguation, in Proc. of
Oregon, USA, June 19-24, 2011), pp. 1148-1158. the 2nd Asian Conference on Intelligent Information and
10. L. Ratinov, D. Roth, D. Downey, M. Anderson, Local Database Systems (ACIIDS 2010); LNCS, vol. 5991,
and Global Algorithms for Disambiguation to Wikipedia, Springer-Verlag, pp. 11-20.
in Proc. of the 49th Annual Meeting of the Association 23. H.T. Nguyen and T.H. Cao, Named entity disambigua-
for Computational Linguistics: Human Language Tech- tion on an ontology enriched by Wikipedia, in Proc. of
nologies (Portland, Oregon, USA, June 19-24, 2011), pp.
the 6th IEEE International Conference on Research, In-
1375-1384. novation and Vision for the Future (RIVF 2008, Ho Chi
11. S. Kataria, K. Kumar, R. Rastogi, P. Sen, and S. Senga- Minh City, Viet Nam), pp. 247-254.
medu. Entity Disambiguation with Hierarchical Topic 24. O. Medelyan, D. Milne, C. Legg, I.H. Witten, Mining
Models, in Proc. of 17th ACM SIGKDD Conference on Meaning from Wikipedia, in International Journal of
Knowledge Discovery and Data Mining (KDD 2011, Au- Human-Computer Studies, 67(9): 716-754.
gust 21-24, 2011, San Diego, CA), pp. 1037-1045.
25. R. Mihalcea, Using Wikipedia for Automatic Word Sense
12. S.Gottipati and J. Jiang, Linking Entities to a Knowledge Disambiguation, in Human Language Technologies
Base with Query Expansion, in Proc. of the 2011 Confe- 2007: The Annual Conference of the North American
rence on Empirical Methods in Natural Language Chapter of the Association for Computational Linguistics
Processing (Edinburgh, Scotland, UK, July 27–31, 2011), (HLT/NAACL 2007, Rochester, New York, April 2007).
pp. 804-813.

Published by Atlantis Press


Copyright: the authors
1066
Hien T. Nguyen and Tru H. Cao

26. R. Navigli, Word Sense Disambiguation: A Survey, in Processing and Computational Linguistics, pp. 226-237,
ACM Computing Surveys, 41(2):1-69. 2005.
27. R. Mihalcea, A. Csomai, Wikify!: Linking Documents to 43. B. Malin, Unsupervised Name Disambiguation via Social
Encyclopedic Knowledge, in Proc. of the 16th ACM Con- Network Similarity, in Proc. of SIAM Conference on Da-
ference on Information and Knowledge Management ta Mining 2005.
(CIKM 2007), pp. 233-242. 44. J. Artiles, J. Gonzalo, and S. Sekine, WePS 2 Evaluation
28. D. Milne and I.H. Witten, Learning to Link with Wikipe- Campaign: overview of the Web People Search Cluster-
dia, in Proc. of the 17th ACM Conference on Information ing Task, in Proc. of 2nd Web People Search Evaluation
and Knowledge Management (CIKM 2008), pp. 509-518. Workshop, 18th WWW Conference.
29. S. Overell and S. Rüger, Using co-occurrence models for 45. A. Kiryakov, B. Popov, I. Terziev, D. Manov, D. Ognya-
placename disambiguation, International Journal of noff. Semantic Annotation, Indexing, and Retrieval, in
Geographical Information Science, 22(3):265-287, 2008. Journal of Web Semantics, 2(1), 2005.
30. D. Smith and G. Mann, Bootstrapping Toponym Classifi- 46. S. Dill, et al. Semtag and Seeker: Bootstrapping the Se-
ers, in HLT-NAACL Workshop on Analysis of Geograph- mantic Web via Automated Semantic Annotation, in
ic References, pp. 45–49. Proc. of 12th WWW Conference, pp.178–186. 2003.
31. E. Garbin and I. Mani, Disambiguating Toponyms in 47. Jim Giles. Internet encyclopedias go head to head, in Na-
News, in Proc. of the conference on Human Language ture 438 (7070), pp. 900-901, 2005.
Technology and Empirical Methods in Natural Lan- 48. G. Mann and D. Yarowsky, Unsupervised Personal Name
guage, pp. 363-370. Disambiguation, in Proc. of the 17th Conference on Nat-
32. J. Leidner, Toponym Resolution in Text: Annotation, ural Language Learning, pp. 33–40, 2003.
Evaluation and Applications of Spatial Grounding of 49. X. Li, P. Morie, and D. Roth, Robust Reading: Identifica-
Place Names, Ph.D. thesis, School of Informatics, Uni- tion and Tracing of Ambiguous Names, in Proc. of HLT-
versity of Edinburgh, 2007. NAACL 2004, pp. 17-24, 2004.
33. K. Bontcheva, M. Dimitrov, D. Maynard, V. Tablan, H. 50. C. Niu, W. Li, and R. K. Srihari, Weakly Supervised
Cunningham, Shallow Methods for Named Entity Core- Learning for Cross-document Person Name Disambigua-
ference Resolution, in Proc. of TALN 2002 Workshop, tion Supported by Information Extraction, in Proc. of the
Nancy, France, 2002. 42nd Annual Meeting on Association for Computational
34. H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan. Linguistics, 2004.
GATE: A Framework and Graphical Development Envi- 51. V. Rastogi, N. N. Dalvi, and M. N. Garofalakis. Large-
ronment for Robust NLP Tools and Applications, in scale Collective Entity Matching, in The Proceedings of
Proc. of the 40th Annual Meeting of the Association for the VLDB Endowment (PVLDB), 4(4):208-218, 2011.
Computational Linguistics, 2002. 52. A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios.
35. P. Cimiano and J. Vӧlker, Towards Large-scale, Open- Duplicate Record Detection: A Survey, in IEEE Transac-
domain and Ontology-based Named Entity Classification, tions on Knowledge and Data Engineering, 19(1):1-16,
in Proc. of Recent Advances in Natural Language 2007.
Processing - 2005, pp. 166-172, 2005. 53. I. Bhattacharya and L. Getoor, Collective Entity Resolu-
36. E.F. Tjong Kim Sang and F. De Meulder, Introduction to tion in Relational Data, in TKDD, 1(1), 2007.
the CoNLL-2003 Shared Task: Language Independent 54. M. Bilenko and R. J. Mooney, Adaptive Duplicate Detec-
Named Entity Recognition, in Proc. of CoNLL-2003, pp. tion Using Learnable String Similarity Measures, in Proc.
142–147, 2003. of the Ninth ACM SIGKDD International Conference on
37. T. Berners-Lee, J. Hendler, and O. Lassila, The Semantic Knowledge Discovery and Data Mining (KDD 2003), pp.
Web, in Scientific American, pp. 34–43, 2001. 39-48, 2003.
38. D.M. Bikel, R.L. Schwartz, and R.M. Weischedel, An 55. I. Bhattacharya and L. Getoor, A Latent Dirichlet Model
Algorithm That Learns What’s in a Name, in Machine for Unsupervised Entity Resolution, in The SIAM Confe-
Learning, 34(1-3):211–231, 1999. rence on Data Mining (SIAM-SDM), 2006.
39. M. Fleischman and E. Hovy, Fine grained Classification 56. N. R. Smalheiser and V. I. Torvik, Author Name Disam-
of Named Entities, in Proc. of the 19th international con- biguation, in Annual Review of Information Science and
ference on Computational linguistics, pp.1-7, 2002. Technology, 43, 287-313.
40. C.H. Gooi and J. Allan, Cross-document Coreference on 57. X. Wang, J. Tang, H. Cheng, and P. S. Yu, ADANA: Ac-
a Large-scale Corpus, in Proc. of HLT-NAACL for Com- tive name disambiguation, in Proc. of 2011 IEEE Inter-
putational Linguistics Annual Meeting, pp.9-16, 2004. national Conference on Data Mining (ICDM’2011).
41. Y. Chen and J. Martin, Towards robust unsupervised per- 58. Z. Harris, Distributional structure, in Word, 10(23): 146-
sonal name disambiguation, in Proc. of EMNLP-CoNLL 162, 1954.
Joint Conference 2007, pp. 190–198, 2007. 59. S. Zhou, K. Li, and Y. Liu, Text Categorization Based on
42. T. Pedersen, A. Purandare, and A. Kulkarni, Name Dis- Topic Model, in International Journal of Computational
crimination by Clustering Similar Contexts, in Proc. of Intelligence Systems, 2(4):398-409, 2009.
the Sixth International Conference on Intelligent Text

Published by Atlantis Press


Copyright: the authors
1067

You might also like