EDIUM: Improving Entity Disambiguation Via User Modeling: Abstract
EDIUM: Improving Entity Disambiguation Via User Modeling: Abstract
Modeling
Author
Organization
Introduction
Named Entity Disambiguation (NED) is the task of identifying the correct entity reference from the knowledge bases (like DBpedia, Freebase or YAGO), for
the given mention. In microblogging sites like twitter, NED is an important task
for understanding the users intent and for topic detection & tracking, search
personalization and recommendations.
In past, many NED techniques have been proposed. Some utilize contextual information of an entity, while others use candidates popularity for disambiguation. But tweets being short and noisy, lack sufficient context for these
systems to disambiguate precisely. Due to this, the underlying user interests
are modeled to disambiguate the entities[1][2]. However, creation of user models might require some external knowledge (like Wikipedia edits[1]), which are
computationally expensive. Also, some users (e.g. news channels) tweet randomly (based on recent events or trending hashtags), while others follow their
interests too passionately. So, using same configurations (like same length sliding window[2]) for distinct users might not be effective. We tried to address
these issues by modeling the users tweeting behavior over time.
Our proposed system, EDIUM, links the entities in the tweets by simultaneously disambiguating the entities and creating the user models. The users
behavior is analyzed with respect to the model built and an appropriate weightage is assigned to the user model. The user model contributes in proportion
to the weight assigned to it while disambiguating the new tweet entities. This
approach can also be used for modeling users and disambiguating entities in
other streaming documents like emails or query logs. The next section describes
the EDIUM system in details.
2.1
Description
j-th candidate entity for the i-th mention in a given tweet
Contextual similarity score for Cij
Parent of C is set of categories that are the immediate ancestor of category C
Grand Parent of C is set of all categories such that
G(C) P ar(P ar(C)) and G(C) P ar(C)
Set of categories in the r-th neighborhood of category C
Set of categories in the i-th interest cluster for user u
Score for the cluster ICui
Context Model (CM) disambiguate the entities based on the text around the
entities. Similarity between the text around the mention and text on Wikipedia
page of an entity is compared and an appropriate weightage for disambiguation is given to each candidate reference. The candidate referent with the maximum weightage is considered as disambiguated entity for the given mention.
We improved the referents disambiguation scores by combining the context
based scores with the users interest based scores in an appropriate manner. We
used existing entity linking systems like DBpedia Spotlight[3] and Wikipedia
Miner[4] for linking and disambiguating the entities based on the context.
The final score ScoreC (Cij ) given by the context model is the candidate score
normalized based on all the possible alignments for the given mention.
cj
ScoreC (Cij ) = Pi
cji
jCi
2.2
User Model (UM) understands the users interests and behavior over time.
(1)
UM Creation: We used cluster-weighted models1 for modeling the users interests. The following assumptions were made while creating the user models.
Users only tweet on topics that interest them.
The amount of interest in a topic is proportional to the information shared
by the user on the topic.
Based on these assumptions, we modeled each user into weighted sub-clusters
of semantic Wikipedia categories. Each sub-cluster represents the users interest
over specific topic and weight represents the overall interest of user in that
topic.
UM is updated for users future tweets based on the categories in the users
current tweet. The tweet categories are extracted using the following steps.
1. Current tweets entities are discovered via disambiguation modeling (DM)
system (Section 2.3).
2. The entities with sufficiently high confidence are shortlisted to prevent UM
from learning incorrect information for the user. We considered only those
entities where the ratio of scores of the second ranked entity to the disambiguated entity is atmost 2 .
3. Tweet categories for the shortlisted entities are extracted using Wikipedia.
The score of each tweet category is equivalent to the number of tweet entities inherited by the category.
4. Considering the graph of semantic Wikipedia categories, the tweet categories are smoothed to include the parent categories. Parents are given
scores in inverse proportion to their out-degree for each child category.
Common parent gets lesser contribution from the childs score as compared
to rare parent.
The UM is created based on the tweet categories. If the category is already
present in the model, the score is updated by the sum of initial and the tweet
category score. Otherwise the category and its score is added to the model. As
the new tweet category scores are added to the UM, the model is evolved to
better represent the newly processed tweet.
To find the topic of interests for the user, each category is mapped to a single
interest cluster. We formed clusters based on the similar parent and grandparent categories for a given category. The score of the k-th interest cluster, icku , for
user u, is the sum of the weights of the categories in the cluster k.
Twitter users exhibit different interest behaviors, highly specific or too random. This can be seen from the fact that some users tweet based on the situations like trending hashtags or popular news, while others tweets only about
highly specific products or companies. Disambiguating entities depends highly
on the behavior of the users. While making use of interest models might be useful in the latter case, it might not be that effective in the former case. To handle
1
2
http://en.wikipedia.org/wiki/Cluster-weighted modeling
value depends on the usecase and performance of the underlying CM system. While
high values ensure the large learning rates, low values ensure the performance of the
UM system.
this issue, we introduce the concept of relatedness between the users learnt
model and the Disambiguation Model (DM).
Similarity between the UM and the DM is defined as cosine similarity between the tweet categories vector obtained when DM is used vs. when only
user model is used for disambiguation.
Sim(U M, DM ) = cos(Score (Ci ), ScoreU (Ci ))
(2)
Similarly, similarity between the CM and the DM is defined as cosine similarity between the tweet categories vector obtained when DM is used vs. when
only CM is used for disambiguation.
Sim(CM, DM ) = cos(Score (Ci ), ScoreC (Ci ))
(3)
R=
(1 ) Sim(U M, DM )
(1 ) Sim(U M, DM ) + Sim(CM, DM )
(4)
1X
Rt
n t=0
(5)
iciu
Score(ICui ) = P
n
icku
(7)
ck =0
Sim(Cij , ICui ) =
ICui N3 (Cij )
ICui N3 (Cij )
(8)
DM disambiguate the entities based on the textual context as well as the users
interests. The DM systems combines both the context based models score and
the user based models score using the parameter , that relates the stability of
user to the previous tweeted topics. The final score predicted by the DM is
Score (Cij ) = ScoreU (Cij ) + (1 ) ScoreC (Cij )
(9)
The model selects the entity that maximized the Score (Cij ) for the given mention i.
Entityi = arg max Score (Cij )
(10)
Score
In this paper, we have modeled entity disambiguation based on the users past
interest information. The paper proposed a way to model the users interests
using the entity linking techniques and then using it later to improve the disambiguation in entity linking systems. The gain in precision is proportional to
the accuracies of the underlying entity linking system.
More analysis is required on the user modeling aspect of the system. Currently users past tweets is used for building the user model and the models
quality depends a lot on the underlying context model. We are including network and demographic information of users to improve user modeling. In future, this would help us in better disambiguating the entities by understanding
more aspects of user behavior.
References
1. Murnane, E.L., Haslhofer, B., Lagoze, C.: Reslve: leveraging user interest to improve
entity disambiguation on short text. In: Proceedings of the 22nd international conference on World Wide Web companion. WWW 13 Companion, Republic and Canton
of Geneva, Switzerland, International World Wide Web Conferences Steering Committee (2013) 8182
2. Shen, W., Wang, J., Luo, P., Wang, M.: Linking named entities in tweets with knowledge base via user interest modeling. In: Proceedings of the 19th ACM SIGKDD
international conference on Knowledge discovery and data mining. KDD 13, New
York, NY, USA, ACM (2013) 6876
3. Mendes, P.N., Jakob, M., Garca-Silva, A., Bizer, C.: Dbpedia spotlight: shedding light
on the web of documents. In: Proceedings of the 7th International Conference on
Semantic Systems. I-Semantics 11, New York, NY, USA, ACM (2011) 18
4. Milne, D., Witten, I.H.: An open-source toolkit for mining wikipedia. Artif. Intell. 194
(2013) 222239