Word Sense Disambiguation Methods Applied To English and Romanian
Word Sense Disambiguation Methods Applied To English and Romanian
Radu ION
Advisor
prof. dr. Dan TUFIŞ
Corresponding Member of the Romanian Academy
Bibliography
Annexes A, B and C
1
Key words
1. Introduction
Word Sense Disambiguation became useful in many other areas of Natural Language
Processing such as question answering systems, speech transcription, document
classification and especially, in natural language understanding systems.
WSD is known to be AI-complete. That is, it cannot be solved unless the other “hard”
problems of AI are solved. Among these, Knowledge Representation plays a central
role with a special emphasis on the so-called “common sense” knowledge. Existing
WSD methods can only approximate the human capacity of disambiguating words in
their contexts by modeling empirical assumptions of this capacity. One of the most
influential assumptions is that the meaning of a word depends on its context of
occurrence.
Context identification and formalization is one of the main problems to be dealt with
when attempting to implement a WSD algorithm. Some WSD methods consider that
the context of a word can be viewed as a window of words centered on the target
word (the “bag of words” model of context). Others impose restrictions on this
window such as the order in which context words appear with respect to the target
word or the relevance level of these words to the target word. Parallel WSD has one
great advantage over the bag of words formalization of the context: the context of the
target word becomes the translation equivalent(s) of it into the language(s) of the
parallel corpus (thus eliminating much of the noise of the bag of words model).
This work is concerned with simple (traditional) WSD on simple texts and also with
WSD performed on parallel corpora. On the traditional side of the problem we are
interested in studying the effect of some kind of syntactic analysis of the sentence as
context formalization. On the other side, we are keen to explore the possibility of
specifying contexts as lists of translation equivalents.
2
thus specifying the correspondences between words. By their very nature, constituent
grammars are generative grammars that ultimately try to explain the surface form of
the sentences without much consideration as to the (formal) correspondence between
syntax and semantics. On the other hand, dependency formalisms such as Mel’cuk’s
Meaning Text Model (MTM) treat the surface form of the syntactic analysis as a
means to get the final goal, that of the semantic representation of it.
In this work, we will approximate the dependency syntactic relation of MTM with a
constrained Lexical Attraction Model that produces a link structure of a sentence (a
connected, planar, acyclic and undirected graph with the sentence words as vertices).
This approximation will certainly not be a better representation than the proper
dependency analysis but it has one undeniable advantage: can be automatically
obtained from free running texts with little or no processing at all.
To perform WSD, one needs to properly identify sentences, words, their parts of
speech (POS) and lemmas as lemmas are recorded by the sense inventories. This
chapter presents TTL, a Perl module developed by the author, which does all of the
above and also Named Entity Recognition and chunking for English and Romanian
(and more recently for Bulgarian but without an expert validation of the
performances).
This chapter also deals with the construction and processing of the parallel English-
Romanian corpus SemCor. The Romanian translation was done at “Alexandru Ioan
3
Cuza” University of Iaşi. The corpus was preprocessed with TTL and then word-
aligned with the YAWA lexical aligner. The sense transfer from English to Romanian
follows closely the WSDTool procedure. From a total of 88874 occurrences of
content words in Romanian, 54.54% received sense annotation by the transfer
procedure.
Lastly, the Romanian WordNet lexical semantic network (developed within the
BalkaNet EC funded project IST-2000-29388) is described along with its conceptual
alignment to the reference lexical semantic network of English, the Princeton
WordNet (version 2.0), developed by George Miller and his colleagues at Princeton
University in the USA. This network will serve the purpose of the semantic inventory
with which the WSD algorithms described here operate.
This chapter introduces YAWA, a lexical aligner developed by the author and
WSDTool, an unsupervised, knowledge-based WSD algorithm also developed by the
author, that runs on parallel corpora and relies on lexical alignments produced by
YAWA.
WSDTool assumes that given a parallel corpus, the translations of the target word1 in
the languages of the parallel corpus reduce the semantic field2 of the target word
knowing that the translation preserves the meaning of the source word. Thus, the
translation equivalents of the target word become a materialization of the context of
the target word because the translator(s) chose them based on this context.
1
The word to be disambiguated.
2
The set of meanings that a word can have.
4
The sense inventory used by WSDTool is the collection of conceptually aligned
lexical semantic networks of English and Romanian and the sense annotation is done
using Inter-Lingual Indexes (ILI) that are unique identifiers of the EQ-SYNONYM
related English and Romanian concepts. This way, the sense annotation is performed
in a uniform way and both in English and Romanian and we can see the associated
concepts just by following the ILI pointer. The performances of WSDTool are
presented in Figure 1. The measures were computed on the SemCor parallel corpus,
for English and Romanian for 79595 occurrences of content words in English and
48392 occurrences of content words in Romanian.
Figure 1 WSDTool performance measures on three sense inventories: ILI, SUMO categories and IRST
domains. P is the precision, R is the recall and F is the usual combination between P and R. S/C is the
average of the number of sense labels assigned to a content word and is a measure of the discriminating
power of the algorithm.
One of the most widely used context formalization is that of the context features. In a
window of words centered on the target word, some of following properties of the
window words have been used as context features3:
• The words themselves either reduced to their lemma forms or not;
• The POS tags of the words;
• Collocations with the target word.
The syntactic representation of the context of the target word is not new in the realm
of the WSD algorithms. It has the advantage that the target word is related only with
the “relevant” words in its context, words that have a direct influence on determining
the target word’s meaning. The syntactic dependency relation of the MTM model best
describes this situation. However, here we will not be able to make use of it because a
dependency parser is not available for Romanian. So instead of a fully-fledged
syntactic dependency analysis, we will make use of a dependency-like structure of a
sentence obtained through the use of a Lexical Attraction Model (LAM).
3
This is of course not an exhaustive list.
4
Edges drawn over the surface form of the sentence do not cross.
5
also imposed on the resulting graph. LexPar, is the author’s implementation of a
constrained LAM, that is, a LAM where not all possible links are allowed due to
syntactic restrictions of combination such as agreement.
Figure 2 gives the performance measures of SynWSD on the same data on which
WSDTool was tested.
Figure 2 SynWSD performance figures on English-Romanian SemCor. There are four meaning affinity
functions (dice, prob, pointwise mutual information and log-likelihood) and three combination methods
(intersection, majority voting and union).
5. Conclusions
This work focused on the presentation of all algorithms one needs to perform WSD on
a free running text. The author’s contributions are:
• The assembly of the SemCor English-Romanian parallel corpus with
Romanian meaning annotation. SemCor is the reference corpus in testing
English WSD algorithms;
6
• The development of the TTL free running text processing module capable of
named entity recognition, sentence and token splitting, POS tagging,
lemmatization and chunking. Sentence and token splitting, POS tagging and
lemmatization are needed by any WSD algorithm;
• The development of YAWA, a lexical aligner in 4 stages. This aligner is
needed by WSDTool to find the translation equivalents. YAWA is a part of
the COWAL combined lexical alignment platform that won the first place in
the lexical alignment competition held with the occasion of the 43rd Annual
Meeting of the Association for Computational Linguistics (ACL’05) workshop
on “Building and Using Parallel Texts: Data Driven Machine Translation and
Beyond”, Ann Arbor, USA.
• The development of WSDTool, an unsupervised, knowledge-based WSD
solution that is able to disambiguate parallel corpora using three different
sense inventories: WordNet concepts, SUMO categories and IRST domains.
Also, WSDTool was used at validating the correctness of conceptual
alignment between Romanian WordNet and Princeton English WordNet;
• The development of LexPar, a constrained LAM that provides the
dependency-like graph of the sentence to be disambiguated by SynWSD;
• The development of SynWSD, an unsupervised, knowledge-based WSD
solution that is able to disambiguate simple texts, using as WSDTool, the
three sense inventories. Recently, SynWSD participated in the 4th evaluation
of WSD algorithms SensEval-4/SemEval-2007 where it ranked the 8th in the
English All Words task out of 14 competing systems being outrun mostly by
supervised WSD systems.
In his doctoral stage, the author has published 23 papers and 3 abstracts at leading
conferences in the field of Computational Linguistics and Natural Language
Processing. Some of these are:
• The Association for Computational Linguistics (ACL)5;
• The North American Chapter of the Association for Computational Linguistics
(NAACL)6;
• The International Conference on Computational Linguistics (COLING)7;
• The European Chapter of the Association for Computational Linguistics
(EACL)8;
• The Language Resources And Evaluation Conference (LREC)9;
• The International Florida Artificial Intelligence Research Society Conference
(FLAIRS)10;
• Language Resources and Evaluation Journal, ISSN: 1574-020X (print
version), Journal no. 10579, Springer Netherlands
5
http://www.aclweb.org/
6
http://www.cs.cornell.edu/home/llee/naacl/
7
http://www.issco.unige.ch/coling2004/
8
http://eacl.coli.uni-saarland.de/
9
http://www.lrec-conf.org/
10
http://www.flairs.com/