0% found this document useful (0 votes)
66 views6 pages

EDIUM: Improving Entity Disambiguation Via User Modeling: Abstract

The document describes EDIUM, an entity disambiguation system that uses user interest models to disambiguate entity mentions in a user's tweets. EDIUM jointly models a user's interest scores based on tweet categories and context disambiguation scores to compensate for the sparse context in tweets. It evaluates the system's entity linking capabilities on user tweets and shows improvement by combining user models and context-based models.

Uploaded by

AkulBansal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views6 pages

EDIUM: Improving Entity Disambiguation Via User Modeling: Abstract

The document describes EDIUM, an entity disambiguation system that uses user interest models to disambiguate entity mentions in a user's tweets. EDIUM jointly models a user's interest scores based on tweet categories and context disambiguation scores to compensate for the sparse context in tweets. It evaluates the system's entity linking capabilities on user tweets and shows improvement by combining user models and context-based models.

Uploaded by

AkulBansal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

EDIUM: Improving Entity Disambiguation via User

Modeling
Author
Organization

Abstract. Entity Disambiguation is the task of associating name mentions


in text to the correct referent entities in the knowledge base, with the goal
of understanding and extracting useful information from the document.
Entity disambiguation has become an important task to harness information shared by users on microblogging sites like twitter. However, noise
and lack of context in tweets makes disambiguation a difficult task. In this
paper, we describe an Entity Disambiguation system, EDIUM, which uses
User interest Models to disambiguate the mentions in the users tweets.
Our system jointly models the users interest scores and the context disambiguation scores, thus compensating the sparse context in the tweets
for a given user. We evaluated the systems entity linking capabilities on
the user tweets and showed that improvement can be achieved by combining the user models and the context based models.

Introduction

Named Entity Disambiguation (NED) is the task of identifying the correct entity reference from the knowledge bases (like DBpedia, Freebase or YAGO), for
the given mention. In microblogging sites like twitter, NED is an important task
for understanding the users intent and for topic detection & tracking, search
personalization and recommendations.
In past, many NED techniques have been proposed. Some utilize contextual information of an entity, while others use candidates popularity for disambiguation. But tweets being short and noisy, lack sufficient context for these
systems to disambiguate precisely. Due to this, the underlying user interests
are modeled to disambiguate the entities[1][2]. However, creation of user models might require some external knowledge (like Wikipedia edits[1]), which are
computationally expensive. Also, some users (e.g. news channels) tweet randomly (based on recent events or trending hashtags), while others follow their
interests too passionately. So, using same configurations (like same length sliding window[2]) for distinct users might not be effective. We tried to address
these issues by modeling the users tweeting behavior over time.
Our proposed system, EDIUM, links the entities in the tweets by simultaneously disambiguating the entities and creating the user models. The users
behavior is analyzed with respect to the model built and an appropriate weightage is assigned to the user model. The user model contributes in proportion
to the weight assigned to it while disambiguating the new tweet entities. This

approach can also be used for modeling users and disambiguating entities in
other streaming documents like emails or query logs. The next section describes
the EDIUM system in details.

The EDIUM System

EDIUM works by creating users interests as a distribution over the semantic


Wikipedia categories. EDIUM has three sub-systems: the context modeling system (Section 2.1), the user modeling system (Section 2.2) and the disambiguation system (Section 2.3). The users interests are modeled based on the tweet
categories by the user. These interests along with the local context are used by
the disambiguation system for linking entities in the new tweets. The final results are fed back to the system for improving the user model.
Every tweet has multiple mentions and each mention can be aligned to multiple entities. Table 1 represents few notations used while describing the system.
Table 1. Notations used for describing the system
Symbol
Cij
cji
P ar(C)
G(C)
Nr (C)
ICui
iciu

2.1

Description
j-th candidate entity for the i-th mention in a given tweet
Contextual similarity score for Cij
Parent of C is set of categories that are the immediate ancestor of category C
Grand Parent of C is set of all categories such that
G(C) P ar(P ar(C)) and G(C) P ar(C)
Set of categories in the r-th neighborhood of category C
Set of categories in the i-th interest cluster for user u
Score for the cluster ICui

Contextual Modeling System

Context Model (CM) disambiguate the entities based on the text around the
entities. Similarity between the text around the mention and text on Wikipedia
page of an entity is compared and an appropriate weightage for disambiguation is given to each candidate reference. The candidate referent with the maximum weightage is considered as disambiguated entity for the given mention.
We improved the referents disambiguation scores by combining the context
based scores with the users interest based scores in an appropriate manner. We
used existing entity linking systems like DBpedia Spotlight[3] and Wikipedia
Miner[4] for linking and disambiguating the entities based on the context.
The final score ScoreC (Cij ) given by the context model is the candidate score
normalized based on all the possible alignments for the given mention.
cj
ScoreC (Cij ) = Pi

cji

jCi

2.2

User Modeling System

User Model (UM) understands the users interests and behavior over time.

(1)

UM Creation: We used cluster-weighted models1 for modeling the users interests. The following assumptions were made while creating the user models.
Users only tweet on topics that interest them.
The amount of interest in a topic is proportional to the information shared
by the user on the topic.
Based on these assumptions, we modeled each user into weighted sub-clusters
of semantic Wikipedia categories. Each sub-cluster represents the users interest
over specific topic and weight represents the overall interest of user in that
topic.
UM is updated for users future tweets based on the categories in the users
current tweet. The tweet categories are extracted using the following steps.
1. Current tweets entities are discovered via disambiguation modeling (DM)
system (Section 2.3).
2. The entities with sufficiently high confidence are shortlisted to prevent UM
from learning incorrect information for the user. We considered only those
entities where the ratio of scores of the second ranked entity to the disambiguated entity is atmost 2 .
3. Tweet categories for the shortlisted entities are extracted using Wikipedia.
The score of each tweet category is equivalent to the number of tweet entities inherited by the category.
4. Considering the graph of semantic Wikipedia categories, the tweet categories are smoothed to include the parent categories. Parents are given
scores in inverse proportion to their out-degree for each child category.
Common parent gets lesser contribution from the childs score as compared
to rare parent.
The UM is created based on the tweet categories. If the category is already
present in the model, the score is updated by the sum of initial and the tweet
category score. Otherwise the category and its score is added to the model. As
the new tweet category scores are added to the UM, the model is evolved to
better represent the newly processed tweet.
To find the topic of interests for the user, each category is mapped to a single
interest cluster. We formed clusters based on the similar parent and grandparent categories for a given category. The score of the k-th interest cluster, icku , for
user u, is the sum of the weights of the categories in the cluster k.
Twitter users exhibit different interest behaviors, highly specific or too random. This can be seen from the fact that some users tweet based on the situations like trending hashtags or popular news, while others tweets only about
highly specific products or companies. Disambiguating entities depends highly
on the behavior of the users. While making use of interest models might be useful in the latter case, it might not be that effective in the former case. To handle
1
2

http://en.wikipedia.org/wiki/Cluster-weighted modeling
value depends on the usecase and performance of the underlying CM system. While
high values ensure the large learning rates, low values ensure the performance of the
UM system.

this issue, we introduce the concept of relatedness between the users learnt
model and the Disambiguation Model (DM).
Similarity between the UM and the DM is defined as cosine similarity between the tweet categories vector obtained when DM is used vs. when only
user model is used for disambiguation.
Sim(U M, DM ) = cos(Score (Ci ), ScoreU (Ci ))

(2)

Similarly, similarity between the CM and the DM is defined as cosine similarity between the tweet categories vector obtained when DM is used vs. when
only CM is used for disambiguation.
Sim(CM, DM ) = cos(Score (Ci ), ScoreC (Ci ))

(3)

Now we define relatedness, R as ratio of similarity between UM & DM and CM


& DM in inverse proportion to their contribution while disambiguation.

R=

(1 ) Sim(U M, DM )
(1 ) Sim(U M, DM ) + Sim(CM, DM )

(4)

is the measure of consistency of users behavior towards the learnt user


model. tells how consistent is the user about his interests (and how stable the
user model is). The higher the value of , more consistent the user is.
We update after each tweet based on the contribution the user model has
in deciding the tweet categories. Since we dont want to decide the just based
on the users behavior on one tweet, we consider previous n relatedness values
for finding the new . The is the average of the last n relatedness values.
n

1X
Rt
n t=0

(5)

To deal with the changing users interests, we decrease by a factor of 0.9


each day. This lowers the dependency of DM on the UM with time. Also,
is restricted to 0.7 to resist model from learning the incorrect user models and
always making decisions irrespective of the contexts used. This also enables
model to discover new entity in highly interest focused twitter users.
The user model is committed to the database3 after each transaction and
is used whenever new tweet from the same user arrives. This helps us track
huge number of users and built the streaming disambiguation system for twitter streams.
Disambiguation : For each category Cij in a tweet, the final score given by user
model is
n
X
ScoreU (Cij ) =
Sim(Cij , ICuk ) Score(ICuk )
(6)
Ck =0
3

Mongo DB is used as a database

iciu
Score(ICui ) = P
n
icku

(7)

ck =0

Sim(Cij , ICui ) =

ICui N3 (Cij )
ICui N3 (Cij )

(8)

The ScoreU (Cij ) is normalized relative to all possible ScoreU (Ci ).


2.3

Entity Disambiguation System (DM)

DM disambiguate the entities based on the textual context as well as the users
interests. The DM systems combines both the context based models score and
the user based models score using the parameter , that relates the stability of
user to the previous tweeted topics. The final score predicted by the DM is
Score (Cij ) = ScoreU (Cij ) + (1 ) ScoreC (Cij )

(9)

The model selects the entity that maximized the Score (Cij ) for the given mention i.
Entityi = arg max Score (Cij )

(10)

Score

Results and Discussions

We evaluated the performance of EDIUM on manually annotated dataset of


100 tweets from 15 different twitter users. is initialized to 0.001 for each user
because UM has no prior information about the user. We experimented the system with n = 20 and = 0.95. As the UM is improved with users each tweet,
precision at 1 (P@1) score is calculated at interval of 20 tweets for each user.
The system is evaluated with both DBpedia Spotlight and Wikipedia Miner as
the context modeling system. Fig. 1 reports the performance of the system over
time when the proposed model is used vs. when just the CM is used or just the
UM (built using previous tweets with the proposed model) is used for disambiguation. We observed that EDIUM started to outperform the CM after 60
tweets (of each user) are processed by the system. The maximum performance
is achieved when the proposed model is used with the Wikipedia Miner as the
CM system.
EDIUM is experimented to perform better with
Wikipedia Miner (WM) than with DBpedia Spot- Table 2. Average scores
Method
Avg.
light (DS). This is because of the fact that the sysEDIUM
(WM)
0.49
tem is dependent on the underlying context modEDIUM
(DS)
0.38
els for learning the user interests. Context models
that are more precise, leads to faster and more accurate user models, thus significantly helping the
context model to disambiguate the entities.

(a) Performance with Wikipedia Miner

(b) Performance with DBpedia Spotlight

Fig. 1. P@1 score of EDIUM under different configurations


Conversely, if the underlying context models have low entity linking and disambiguation accuracies the user models usually takes much longer to learn the
user interests (with low values) and use them for entity disambiguation. It
can be seen that UM alone can also disambiguate the entities from the users
tweet and achieve significant performance.

Conclusion and Future Work

In this paper, we have modeled entity disambiguation based on the users past
interest information. The paper proposed a way to model the users interests
using the entity linking techniques and then using it later to improve the disambiguation in entity linking systems. The gain in precision is proportional to
the accuracies of the underlying entity linking system.
More analysis is required on the user modeling aspect of the system. Currently users past tweets is used for building the user model and the models
quality depends a lot on the underlying context model. We are including network and demographic information of users to improve user modeling. In future, this would help us in better disambiguating the entities by understanding
more aspects of user behavior.

References
1. Murnane, E.L., Haslhofer, B., Lagoze, C.: Reslve: leveraging user interest to improve
entity disambiguation on short text. In: Proceedings of the 22nd international conference on World Wide Web companion. WWW 13 Companion, Republic and Canton
of Geneva, Switzerland, International World Wide Web Conferences Steering Committee (2013) 8182
2. Shen, W., Wang, J., Luo, P., Wang, M.: Linking named entities in tweets with knowledge base via user interest modeling. In: Proceedings of the 19th ACM SIGKDD
international conference on Knowledge discovery and data mining. KDD 13, New
York, NY, USA, ACM (2013) 6876
3. Mendes, P.N., Jakob, M., Garca-Silva, A., Bizer, C.: Dbpedia spotlight: shedding light
on the web of documents. In: Proceedings of the 7th International Conference on
Semantic Systems. I-Semantics 11, New York, NY, USA, ACM (2011) 18
4. Milne, D., Witten, I.H.: An open-source toolkit for mining wikipedia. Artif. Intell. 194
(2013) 222239

You might also like