0% found this document useful (0 votes)
66 views21 pages

Unit-3 NLP Notes

Uploaded by

jayanthroy555
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views21 pages

Unit-3 NLP Notes

Uploaded by

jayanthroy555
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Chapter 4

Semantic Parsing

Sameer Pradhan

Semantics by its dictionary definition is the study of meaning, and parsing is the examina-
tion of something in a minute way, that is, identifying and relating the pieces of information
being parsed. When we put the two of these concepts together, we get semantic parsing,
which, in the broadest sense of the phrase, is the process of identifying meaning chunks
contained in an information signal in an attempt to transform it into some data structure
that can be manipulated by a computer to perform higher level tasks. In our case, the infor-
mation signal is human language text. Unfortunately, in the natural language processing
community, the term semantic parsing is somewhat ambiguous. Over the years researchers
have used it to represent various levels of granularity of meaning representation. Because
semantics is such a vague term, it has been used to represent various depths of representa-
tions, from something as basic as identifying domain-specific relations between entities, to
the more intermediate task of identifying the roles that various entities and artifacts play in
an event, to converting a text to a series of specific logical expressions. Within the context
of this chapter, we restrict its interpretation to the study of mapping naturally occurring
text to some representation that is amenable to manipulation by computers for the purpose
of achieving some goals, such as retrieving information, answering a question, populating a
database, or taking an action.

4.1 Introduction
The holy grail of research in language understanding is the identification of a meaning
representation that is detailed enough to allow reasoning systems to make deductions but,
at the same time, is general enough that it can be used across many domains with little
to no adaptation. It is not clear whether a final, low-level, detailed semantic representation
covering various applications that use some form of language interface can be achieved or
whether an ontology can be created that can capture the various granularities and aspects
of meanings that are embodied in such a variety of applications-none has yet been created.
Therefore, two compromise approaches have emerged in the natural language processing
community for language understanding.
In the first approach, a specific, rich meaning representation is created for a limited
domain for use by applications that are restricted to that domain, such as air travel reser-
vations, football game simulations, or querying a geographic database. Systems are then
97
Chapter 4 Semantic p .
98 ats1ng

. t ot1tput from text in this rich, domain- specific meaning representation


er afted t o genera e . ·
In the second approacl1 , a related set of interme diate meanmg represent a t'10ns is
·
creatn,1

. · d • '-IJ,
. f el
cromg rom 1ow- 1ev analysis ·
to a midleve l analysis , and the bigger un erstandm
h
g task .
divided into multiple , smaller pieces that are more manage ~~le, sue ~ :'~rd sense disarn.lS
biguatio n followed by predicat e-argum ent _str~ctu re recogmti_on. By dividi~g the pro~lern
up this way, each interme diate represen tation is ~nly respons ible for _capturm g a r~lat1ve\y
small compon ent of overall meaning , thereby making the t~k of definmg ~nd mo?ehng each
represen tation easier. Unlike the first approac h, each meanmg r~presen ta~10n, while covering
only a small part of the overall meaning , is not tied to a specific domam , and so the data
and methods created for it are similarl y general purpose .
Unfortu nately, we do not yet have the holy grail in the form of a detailed overall rep-
resentati on that would at once be easily learnabl e and have high coverag e across domains.
So, in this chapter, we treat the world as though it has exactly two types of meaning rep-
resentati ons: a domain- depende nt, deeper represen tation and a set of relative ly shallow but
general- purpose, low-level, and interme diate represen tations. The task of producin g the out-
put of the first type is often called deep semant ic parsing , and the task of producing the
output of the second type is often called shallow semant ic parsing . We discuss algorithms

I for producin g both kinds of output.


Both of these approac hes are fraught with issues; the first approac h is so specific that
porting to every new domain can require anywher e from a few modific ations to almost
reworking the solution from scratch. In other words, the reusabil ity of the representation
across domains is very limited. The problem with the latter approac h is that it is extremely
\ difficult to construc t a general- purpose ontology and create symbols that are shallow enough
to be learnabl e but detailed enough to be useful for all possible applicat ions. Therefore, an
applicati on-speci fic translati on layer between the more general represen tation and the more
specific represen tation becomes necessary. However, this translat ional compon ent can be
relativel y small compare d to the total effort required to adapt a more specific representa-
tion to a new domain. None of this even begins to consider the implicat ions of using such
systems across different language s or the role played by the structur e of different languag~
in affecting these meaning represen tations or their learnabi lity. For these reasons, over the
hist~ry of languag e ~rocessi ng, the commun ity has generall y moved away from the nwrt'
detailed , deep, domam- depende nt represen tations to the more shallow ones.

4.2 Semantic Interp retatio n


Se~ ant'_1c parsmg
. c~n b e consider
ed as. part of a larger process, semant ic interpretafoil 1
,_;
which mvolves ~anous compon ents that together let us define a represen tation of a. 1_~ .
that 1
. hcan be fed mto. . a compute r to allow further
· computa tio al
n · mampu• 1a t'1011s and sea.I
w h1c· a re prereqms1te for any language underst anding systein . 1· t'
·ill~
Tl1e foll 0 '' •
lk 01 a.pp 1ca 10n.
sect~~t \ta . a ~~u~~ome of the main compon ents of t his process. [l]·
. he. eg1dn t ids h1scussion with the seminal work by Chomsk y, Syntacti c Structures ,idt'
w h 1c mtro uce t e concept of a transfor mationa l phrase sti·u t ar to pr0'
· l d fi • • · c ure gra
an operat1ona e mt10n for the combina torial formatio ns of meaning ful natural Jaugue,ge
mm
4.2 Semantic Interpretation 99

sentences by humans. Shortly after Chomsky's 1957 book, Katz and Fodor [2] published the
first work treating semantics within the generative grammar paradigm. They found that
Chomsky's transformational grammar was not a complete description of language because
it did not account for meaning. In their 1963 paper "The Structure of a Semantic Theory
,"
Katz and Fodor put forward what they thought were the properties a semantic theory should
possess. A semantic theory should be able to:

· 1. Explain sentences having ambiguous meanings. For example, it should account for the
fact that the word bill in the sentence The bill is large is ambiguous in the sense that
it could represent money or the beak of a bird.
2. Resolve the ambiguities of words in context. For example, if the same sentence is
extended to form The bill is large but need not be paid, then the theory should be able
to disambiguate the monetary meaning of bill.
3. Identify meaningless but syntactically well-formed sentences, such as the famous
example by Chomsky: Colorless green ideas sleep furiously.
4. Identify syntactically or transformationally unrelated paraphrases of a concept having
the same semantic content.
In the following subsections we look at some requirements for achieving a semantic
representation.

4.2.1 Structural Ambiguity


When we talk of structu re, we generally refer to the syntactic structu re of sentences. This
is
a sentence-level phenomenon and essentially means transforming a sentence into its underly
-
ing syntactic representation. Because syntax and semantics have such a strong interaction,
most theories of semantic interpretation refer to the underlying syntactic representation
.
Conventionally, syntax has become the first stage of processing followed by various other
stages in the process of semantic interpretation (see Chapte r 3 for information on syntact
ic
processing).

4.2.2 Word Sense


In any given language, it is almost certainly the case that the same word type, or word
lemma, is used in different contexts and with different morphological variants to represe
nt
different entities or concepts in the world. For example, we use the word nail to represent
a
part of the human anatom y and also to represent the generally metallic object used to secure
other objects. Humans are adept at identifying, through context, which sense of the word
is
intended by the author or speaker. Let's take the following four examples. The presenc
e of
words such as hamme r and hardware store in sentences 1 and 2, and of clipped and manicu
re
in sentences 3 and 4, enable humans to easily disambiguate the sense in which nail is used:

1. He nailed the loose arm of the chair with a hammer.

2. He bought a box of nails from the hardware store.


Chapter 4 Semantic pars1ng
.
100

3. He went to the beau ty salon to get his nails clipped.


long.
4. He went to get a manicure. His nails had grown very
of words in a discourse, there fore , consdtituthes. one of the steps in the
. the sense . . . S . 4
Reso lvmg 1t m grea ter ept m ect1o n .4.
process of semantic interpretation. We discuss

4.2.3 Entity and Event Resolution


icipa ting in a series of explicit or
Any discourse inevitably consists of a set of entit ies part
t com pone nt of semantic interpreta-
implicit events over a period of time. The next impo rtan
kled across the discourse using the
tion is the identification of various entities that are sprin
y or even t is bein g considered, along
same or different phrases. Reconciling wha t type of entit
y is referred to over a discourse. ~
with disambiguating various ways in which the same entit
ant task s have become popular
critical to creating a semantic representation. Two pred omin
nce reso lutio n. These two tasks
over the years: nam ed enti ty reco gnit ion and core fere
are discussed in more detail in
fall under the umbrella of info rma tion extr acti on and
Chapter 8.

4.2.4 Predicate-Argument Structure


, anot her level of semantic struc·
Once we have the word senses, entities, and events identified
entit ies in thes e events. Resolving
ture comes into play: identifying the participants of the
is where we identify which ent.i~~
the argument st_ructu~e of the predicates in a sentence
be defined as the identificat1on
play what part m which event. Generally, this process can
of who did what to whom, when, where, why, and how.
ts.
Figure 4-1 shows the participants of say and acquire even

's computer maintenance businesses.


Bell Atlantic Corp. said it will acquire one of Control Data Corp.
What
What
one of Control Data Corp's
computer-maintenance
businesses

When
'
ho ~
:' rJ □ •1
1
, -. . When
f .. I

) f·· · ·Acqulr~·-_. .
Who ·• ••, :... . Whom
,..s~1~. Wh~;e
Bell Atlantic Corp. I \
,••.,

: ,: /, How
., /
Whore
.
\ Whom
How

h d hotLI
Figure 4-1: A representation of who did what to wh om, w en, where, why, an
4.3 System Paradigms 101

4.2.5 Meaning Representation


The final process of the semantic interpretation is to build a semantic representation or
meaning representation that can then be manipulated by algorithms to various applica-
tion ends. This process is sometimes called the deep representation. Unfortunately, as
we mentioned earlier, due to the lack of a general-purpose representation that is also deep
enough for any given application, most studies in this area have been application depen-
dent, or dependent on the domain of particular applications. The following two examples
show sample sentences and their meaning representations for the RoboCup and GeoQuery
domains (described in §4.6.1):

(1) If our player 2 has the ball, then position our player 5 in the midfield.
((bowner (player our 2)) (do (player our 5) (pos (midfield)))
(2) Which river is the longest?
answer(x1, longest(x 1 river(x 1)))
This is a domain-specific approach; the remainder of this chapter focuses on domain-
independent approaches.

4.3 System Paradigms


The problems discussed in this chapter are familiar to the computational linguistics and
linguistics communities. Researchers from these communities have examined meaning rep-
resentations and methods to recover them at different levels of granularity and generality,
exploring the space of numerous languages. For many of the potential experimental condi-
tions, no hand-annotated data is available. Therefore, it is important to get a perspective
on the various primary dimensions on which the problem of semantic interpretation has
been tackled. It is impossible to cover all these dimensions in this chapter, so while we men-
tion many of the historic approaches, we try to focus on the more prevalent and successful
approaches that lend themselves to practical applications. The approaches generally fall into
the following three categories.

1. System Architectures
( a) Knowledge based: As the name suggests, these systems use a predefined set of
rules or a knowledge base to obtain a solution to a new problem.
(b) Unsupervised: These systems tend to require minimal human intervention to be
functional by using existing resources that can be bootstrapped for a pruticular
application or problem domain.
(c) Supervised: These systems involve the manual annotation of some phenomena
that appear in a sufficient quantity of data so that machine learning algorithms
can be applied. Typically, researchers create feature functions that allow each
problem instance to be projected into a space. o~ featu~es. A model is trained to
use these features to predict labels, and then it 1s applied to unseen data.
Chapter 4 Semantic Parsing
102

. . lly very exp ensi ve and does not }ield


(d) Sem1-. Su pervised: Manua1a nnotat1on ishusu amen on. h . t
1 ptur e a p eno In sue ms ances, researchers
.
enoug 11 data to complete dYthe
ca data
set on which thei r mod els are tramed either
can automatically expan d t ut dire ctly or by boo tstra ppm . ff
l
by employing mac une- · gen erat e ou P g o of an
d 1b having hum ans correc 1 t ·ts outp ut. In man y case d 1
s, a mo e from
existing mo e Y . kl da t to a new dom am. .
one domain is used to qmc ya p

2. Scope
Domain Dependent: These syst ems are specific h'
to cert ain dom ains , such as air
(a) ·
travel reservations or simu1a ted football coac mg.
al
(b) Domain Indepen dent :_ The se syst ems are gener eno ugh tha t the techniques can
.
be applicable to multiple dom ams wi'tho ut littl e or no chan ge.

3. Coverage
Shallow: These systems tend to prod uce an inte t t' th t can
rm~ diat e ~epresen a ion a
(a)
then be converted to one t h at a mac h'ne can base its act10ns on.
(b) Deep: These systems usually c:eat~ ai term . 1 rese ntat ion that is directly
ma rep
consumed by a machine or application.

4.4 Word Sen se


.
In a compositional approach to semantics, where . of the w h o1e is
the mea nmg . c_o mposedareol
the meaning of the part s, the smallest part s und
er conside~ation in text ual _disc;::::izetl
typically the words themselves: either tokens as
they app ear m the text or thei r le . true
forms. Word sense has been examined and stud
ied for a very long tim e [3, 4, 5] , bu~ its t (if
nature still eludes researchers. It is not clear whe
ther it is poss ible to' iden tify a finite_;e tl'
senses that each word in a language exh ibits in
various con text s. Eve n if it were posSl te,t
do so, it is not clear whether a given word evok
es a single disc rete sens e in a given con ·
or whether the word is associated with a dist ribu
tion of som e sub set of all its senses. Ieteb·
Attempts to solve this problem range from rule
base d and knowledge base d to colllP were ·
unsupervised, supervised, and semi-supervised
lear ning met hod s. Ver y earl y sy5terns •es tlf
predominantly rule based or knowledge base d
and used dict iona ry definitions 0 ~ set~: tilt'
words. Unsupervised word sense indu ctio n or disa
mbi gua tion tech niqu es try to ind;1 rd or
senses of a word as it app ears in various corpora.
ft I The se systems perform either a 1\utllr
so c uste nng · of wordH and tend to allow
I 1 the tuni ng of these clus ters to sm•t a partl thrr
app 1.cat ·on. MOAt rece nt supc rv1Re
· cl app roac hes
h d . to word 1:1ense disa mb1. guat1on
· , on the 0.tr~t
an , pnm an·1Y Msume t hat a word can evoke only . n collh
one part icul ar sense in a give tliC
and at a predefincd- 11sua lly applicat ion-inde
pend ent- level of gran ular ity, al thot~gtrib'''
out put of supcrviRed o.pproac hel'.l Cfl.n Ht ill bo ame
nab le to gen erat ing a ranking, or dis ·,her('
tion , of mcmhernhip of RCnl'.icH. In the cMo of tiup
crviscd wor d sense disa mbigua tion, '' i11r<l
human annotation iAnccesRA.ry, A. delic ate bnlnncc
often exis ts between making fin~ g.:~ rt1e
diHt inct,ions hctwcen word scnHcR a.nd maintaining
good inte rann otat or agreement:- gi\ . 1ent
inve ntory of scw1c1-1. T ho coar Hrr t.hn isrnnul flr ity
of H0.nRcs for A. word , t he mor e con~i51
>
103
4
.4 Word Sense

sed chanc e that


the annota tion and .more. learna bl~ the~ becom e. Howe ver, there is an increa
h for the consu ming
this lower granu larity might not identi fy nuanc es that are fine enoug
ate direct ly into
application. An observed_ win in lear~i ng and annot ation might not transl
Palme r, Dang , and
the depth of repres entati on of meanm g expec ted by the applic ation.
Fellbaum [6] discus sed this issue in great detail.
ge under standi ng,
Although theore tically assum ed to be an impor tant aspect of langua
much debat e. The
the applic ability of word sense disam biguat ion seems to be an issue of
text, comp licate d
inherent difficulty of gener ating huge corpo ra of manua lly sense- tagged
of word sense in
in part by the prevai ling agnos tic or an1biv alent status of applic ability
resour ces have been
various applic ations is also a proba ble cause of why few compu tation al
absen ce of stand ard
generated to suppo rt the creati on of better autom atic system s. Also, the
inform ation. Some
criteria has preven ted the mergi ng of variou s resour ces that have sense
attempts are being made to create mappi ngs betwe en such resour ces.
and Yarow sky
One of the princi ple reason s for this ambiv alence, as observ ed by Resnik
inform ation
[7] , is that in many of the more matur e areas of langua ge proces sing, such as
ques tend to be
retrieval and speech recogn ition, either the sense disam biguat ion techni
ation retriev al it is
redundant or cheap er and better altern atives are available. In inform
multip le words in
a well-accepted fact that the multip le words in a query match ing with
is hard to beat with
the docum ent conte xt tend to provid e an implic it disam biguat ion that
s [9, 10] have alway s
perfect word sense inform ation [8]. In speech recogn ition, contex t classe
genres of text tend
proven to be more applic able than word classe s [11]. Specific domai ns or
word. There fore, in
to invoke a smalle r subse t, or even just one sense of a given conten t
and some doma in-
light of the fact that some seman tic parsin g system s are domai n specific
than in the forme r.
independent, the disam biguat ion of sense is more necess ary in the latter
a uniqu e conce pt,
Furthermore, in domai n-spec ific applic ations , a word usuall y maps to
proble m, furthe r
and thus, finding a good mappi ng from words to the concep ts is an easier
sky [7] pointe d out
diminishing the necess ity for sense disam biguat ion. Resnik and Yarow
of standa rdized
several reasons for the lack of progre ss in word sense disam biguat ion: a lack
as compa red to
evaluation~, the range of resour ces neede d to provid e requir ed knowl edge
data sets. Follow ing
other tasks, and the difficu lty of obtain ing adequ ately large sense- tagged
severa l e.xercises
that study, the Specia l Intere st Group on LEXic on (SIGL EX) has held
have been very
called SENSEVAL 1 2 and 3 and SEME VAL 1 and 2. These compe titions
well as identi fying
successful in genera~in~ stand ard datase ts and evalua tion criteri a, as
biguat ion and its
related tasks that have advan ced the under standi ng of word sense disam
applications.
sy_stem~ i~
a ~ow to measu re the perfor mance of autom atic word. sense di~a 1~bigu ation
det iul. Their
n impor tant issue. Gale, Churc h , and Yarowsky [12] discussed 1t m great
perfor ma.11<'e of fl
~roposal, which is still comm only followed , is that the lower bound on the
ce of the lemma is
sys~ern for disam biguat ing a word should be the one in which ever~· instan
. Thi~ i:- <'o mmon ly
assigned the most freque nt sense it exhibi ts in a s ufficie ntly la rge corpus
of n g0ld-s tanrla rd
known as the most freque nt sense, or MFS, baselin e. A goorl propC'r t.,·
word:-. multip le
sense-tagged corpus is that it should be replica ble to Ft high cl<'g l'<'(' . In othf'l'
l,Y hig h agreem ent.
tnnota tors should be able to annot a te the same corpus wi t h n sufficif'nt
bound 011 the
et 's say this agreem ent is x%. We would gC'nerally view .r % as nn uppN
Perform ance of any autom atic system .
Chapter 4 Semantic p .
104 atsrng

Word sense ambiguities can be of three principal types: (i) homonymy, (ii) Polysein
and (m "') cat egona
· l anlbi'gui·ty [13] · Homonymy indicates that the words. . share the ,saineY
1

spelling, but the meanings are quite disparate. Each homonymous parti_t1on, however, lllav
contain finer sense nuances that could be assigned to the word dependmg on the context
and this phenomenon is called polysemy. For example, these two senses of the word bank
are orthogonal: financial bank and river bank. Further, bank has some other, somewhat
finer, and related subsenses that indicate a collection of things: for example, financial bank
and bank of clouds. To illustrate categorial ambiguity, the word book can mean a book
such as the one in which this chapter appears or to enter charges against someone in a
police register. The former belongs to the grammatical category of noun, and the latter.
verb. Distinguishing between these two categories effectively helps disambiguate these two
senses. Therefore, categorial ambiguity can be resolved with syntactic information (part of
speech) alone, but polysemy and homonymy need more than syntax.
Traditionally, in English, word senses have been annotated for each part of speech sep-
arately, whereas in Chinese, the sense annotation has been done per lemma and so ranges
across all parts of speech. Part of the reason is that the distinction between a noun or a
verb is much more obscure in Chinese.

4.4.1 Resources
As with any language understanding task, the availability of resources is a key factor in
the disambiguation of word senses in corpora. Unfortunately, the community has not seen
the development of a significant amount of hand-tagged sense data-at least not until rer_r
recently. Early work on word sense disambiguation used machine-readable dictionaries or
thesauruses as knowledge sources. Two prominent sources were the Longman Dictionary
0
[ Contem?or~ry English (LDOCE) [14] and Roget 's Thesaurus [15]. The late 1980s gare
birth to a sigmficant lexicographical resource WordNet [16] whi' h h b y •nfluenti:1l.
I dd · • . ' , c as een ver 1
. n a i~1on to bemg a lexical resource with inventories of senses provided for most word~
m English acr?ss multiple .parts of speech, it also has a rich taxonomy connecting word~
across many different relationships such as hypernymy h a d so t'11 ·
I dd · · •. ' , omonymy, meronymy, "11
n a it1on, to facilitate research in automatic sense disambiguation a small portion of th~
Brown Corpus [17] has been t t d •h ' nir·
dance (SEMCOR) [ ]anno a e wit WordNet senses to create a semantic_ con_.,,.
tactic info . corpus 18 . More recently, WordNet has been extended by addlllg ~~ i·
and gener::::1~~~;a~~e glos~es, ~isambigua~ing them with manual and automatic met~;~~,;
answering [19]. Anothero::s u~ a ow better mcorporation in applications such as que~ b.1·
tagging WordNet version 1p5 s: the DSO Corpus of Sense-Tagged English, was cre~te ctir·
pora for the 121 nouns and. 70s:::~ss on the. Brown and Wall Street Journal (W~ ;.ds it1
English [20]. Further, the SENSEVA{h[at ]a ie the r~1~st frequent and ambiguous \~: Jt11rt'
created many corpora for te t· 21 competitions held over the past deca o·uo'l'~t
s mg systems on w01•d The 1o0
sense annotation effort so far hfl.8 b sense and related problems. roll~h
the Linguistic Data Consortium. (L;~) t~1e O~toNotes corpus [22, 23, 24] released t::berJ
verb (~2, 700) and noun (~2 200) I ' m wh1c~1 have been tagged a significant nu~ 5p1111·
ning multiple genres with coa'r·se e~mdas covermg roughly 85% of multiple corpo1H ur('(v
· grame sense d · h tor o.,
mcnt. Pradhan et al. [25] based 1 . , s an ~it a very high interannota _. orP''~.
n exical sample task m SE1'vIEVAL 2007 using tlllts r
4.4 Word Sense 105

Cyc [26] is another good example of a useful resource that creates a formalized represen-
tation of common sense knowledge about objects and events in the world to overcome the
so-called knowledge bottleneck that is so crucial to word sense disambiguation and many
other natural language tasks. Even after a couple of decades of handcrafting this knowledge
base, it leaves much to be desired, which underscores the difficulty of such an endeavor.
Fortunately, English seems to have the most highly developed lexicons with various
semantic features associated with words and words grouped together to form coherent
semantic classes. Efforts are underway to create resources for other languages as well.
For example, HowNet [27] is a network of words for Chinese similar to WordNet. The
Global WordNet Association (http://www.globalwordnet.org) keeps track of WordNet
development across various languages. Researchers are also using semiautomatic methods
for expanding coverage of existing languages [28, 29, 30, 31] and for other languages such
as Greek [32]. In addition to such corpora annotated with sense information, there are
also many resources such as WordNet Domains (http://wndomains.fbk.eu/) that provide
structured knowledge to help overcome the knowledge bottleneck in sense disambiguation.

4.4.2 Systems
Now that we have looked at the problem and some resources, we turn to some sense disam-
biguation systems. As mentioned earlier, researchers have explored various system architec-
tures to address the sense disambiguation problem. We can classify these systems into four
main categories: (i) rule based or knowledge based, (ii) supervised, (iii) unsupervised, and
(iv) semisupervised.
In the following three sections, we look at each of these systems in order.

Rule Based
The first generation of word sense disambiguation systems was primarily based on dictionary
sense definitions and glosses [33, 34]. Most of these techniques were handcrafted and used
resources that are not necessarily accessible today. Also, access to the exact rules and sys-
tems was very limited, and most information was only available from archived publications
and discussions of the specific lexical items and senses that were considered during those
experiments. In short, much of this information is historical and cannot readily be trans-
lated and made available for building systems today. However, some valuable techniques and
algorithms are still accessible, and we look at these in this section. Probably the simplest and
oldest dictionary-based sense disambiguation algorithm was introduced by Lesk [35]. The
first-generation word sense disambiguation algorithms were mostly based on computerized
dictionaries; for example, see Calzolari and Picchi [33] •
The first SENSEVAL evaluations [36] used a simplified version of the Lesk algorithm
as a baseline for comparing word sense disambiguation performance. The pseudocode for
the algorithm is shown in Algorithm 4- 1. The core of the algorithm is that the sense of
a word in a given context is most likely to be the dictionary sense whose terms most
closely overlap with the terms in the context. T~e~e h~ve since been further modificatio~s
to the algorithm to make it more robust to vanat10~ m term usages, c~ntext, and defim-
tion. Banerjee and Pedersen [37], for example, modified the Lesk a~gonthm so that syn-
onyms, hypernyms, hyponyms, meronymns, and so on of the words m the _co~text as well
as in the dictionary definition are used to get a more accurate overlap stat1st1c. The score
Chapter 4 Semantic Parsing
106

. d
de of the simplified Lesk algorithm ---
Algorithm 4-1 Pseu oco t ns the number of words common to the two sets
The function COMPUTE0VERL(P r~ ur t ) returns best sense of word ---
Procedure: SIMPLIFIED_LESK wor ' sen ence

1: best-sense +- most frequent sense of word


2: max-overlap +- 0
3: context +- set of words in sentence
4: for all sense E senses of word do
5: signature +- set of words in gloss and examples of sense
6: overlap +- COMPUTEOVERLAP( signature, context)
7: if overlap gt max-overlap then
8: max-overlap +- overlap
9: best-sense +- sense
10: end if
11 : end for
12: return best-sense

associated with each match is measured as the square of the longest common subsequence
between the context and the gloss. 1 Using a context window of five words (two on each side
of the target, as well as the target itself), they report a twofold increase in performa?c~
from 16% to 32% over the vanilla Lesk algorithm when used on the SENSEVAL-2 lexica
sample dataset. This performance improvement is considerable given the simplicity of the
algorithm. ' ed
Another dictionary-based algorithm was suggested by Yarowsky [38]. This study us_ .
Roget's Thesaurus categories and classified unseen words into one of these 1,042 categone~
based on a statistical analysis of 100 word concordances for each member of each categor~
over a large corpus, in this case the 10-million-word Grolier's Encyclopedia. The me th_o_
performed quite well on a set of 12 words for which there had been some previous quantitati;e
studies. Although the instances and corpora used in this study were not the same as t ~e
ones reported previously, it still gives an idea of the success of a relatively simple meth? ;
The method consists of three steps, as shown in Figure 4-2. The first step is a coll~c:_1~l11
of contexts. The second step computes weights for each of the salient words. One thine_ It
note is that the amount of context used was 50 words on each side of the target word. wht~.
. . .
1s much higher than the context wmdows found to be useful for this kind of broad. top1l ht'
classification by Gale et al. [12]. P( wlRCat) is the probability of a word w occurring iu \
cont ext of a Roget 'B Th esaurus category RCat. Finally, in the third step, the unseen w01(
in the test set are classified into the category that has the ma..-ximum weight. t 11t
More recently, Nuvigli and Velardi [39, 40] suggested n knowledge-based algorithn 1 t~er
' graph'1caJ repreRen t at'10n o f· BC ll 8CH of words ·m r.ontoxt to dbnmbiguate the ten· 11 t!ll
uHes

•or11 ~
1. Multiple subscq uenccH in Lhc Af1.111 0 gloRs n.rn idlownl) ki· liowov(•t· nub f ly cont.ent ''e cr
_ . . , , " sequences o on 110 11 e<1 11 .11
such ~ p ronounA,. propoHIL1ons, artlclt•R, 1t11d r.o nJunct,lons l\l'O not consid ered. For example, t he subs
of th e 1s not cons1dorcd 111 I.he cA.h.:11l1ilio11 of n. Rcoro.
4.4 Word Sense 107

1. Collect contexts for each of the Roget 's Thesaurus categories.


2. Determine weights for each of the salient words in the context.

P(wilRCat)
P(wi)
3. Use the weights for predicting the appropriate category of the word in the test corpus.

"""'iog ---'--''---.:.,__-'----'-
arg max~ P(WilRCat)P(RCat)
RC at w P(wi)

Figure 4-2: Algorithm for disambiguating words into Roget's Thesaurus categories

consideration. This is called the structural semantic interconnections (SSI) algorithm. It


uses various sources of information, including WordNet, domain labels [41], and all possible
annotated corpora to form structural specifications of concepts, or semantic graphs. The
algorithm consists of two steps: an initialization step and an iterative step, in which the
algorithm attempts to disambiguate all the words in context iteratively until it cannot dis-
ambiguate any further or until all the terms are successfully disambiguated. Its performance
is very close to that of supervised learning algorithms. Although it does not technically have
a training phase, it surpasses t he best unsupervised algorithm in the SENSEVAL-3 all-words
task. Figure 4-3 shows the semantic graphs for two senses of the term bus. The first one is
the vehicle sense and t he second one is the connector sense.
'
Notation:

• T (the lexical context) is the list of terms in the context of the term t to be disam-
biguated. T = [t1, t2 , ... , tn]-
• Si, s;, ... ,S~ are structural specifications of the possible concepts (or, senses) oft.
• I (the semantic context) is the list of structural specifications of the concepts asso-
ciated with each of the terms in T \ {t} (except t) • I = [st 1 , st 2 , ••• , st"] , that is, the
semantic interpretation of T .
• G is the grammar defining the various relations between the structural specification:: ;
(or semantic interconnections) among the graphs).
• Determine how well the structural specifications in I match t hat. of S{, S~ . ... .s;1
using G.
• Select the best matching Sf.
The algorithm works as follows. A set of pending tern1s in t hf' cont.rxt. P = (t iIS'• = Hull}
is maintained and I is used in each iteration to di:-1nmbig11nto t.('1'111:-- in f>. T he prorr<lurr
iterates and ~ach iteration either disambig11ntcs one ter111 in P fu1d l'<'moves it from the
pendin~ list or stops because no more tcrm8 cnn bo disnmbig111\tcrl. T he output J is uµd nt<'d
with the sense oft. Init ially/ contnin:-1 str11ct11rc:-1 for 1110 110 :-iom o u:-; t<'l'm~ in T\ {t} and any
Chapter 4 Semantic Pa .
108 rs1ng

---------1- /

,,
, ,,
''
,,, /
~
1- , •---

.
, lnterconnect1on# I
, ,
,
...0
-----·- Electricity#]

'C~nnected#6 0-EJ~trical#2 Device# I


---
-
)

lnstnunen1a1i , .
, I) ,
\ I •
/Prot \ I '
I
I ,' Bus#2'
I
I
I

I
I
I
---
I
I
I

' _ _ _ _Fram~rk#3
\

'-, Union#4 Caic~lator-2


''

------- ---
::; =: '
' 'i. , ~lectncal device# I~ _,
r __________________ _..--,
Gloss - - - ·
Ptrtanym - ·- ·-

Figure 4-3: The graphs for sense 1 and 2 of the noun bus as generated by the SSI algorithm

possible disambiguated synsets (since we do use sense-tagged data) 2 . If this is a null set, then
the algorithm makes an initial guess at what the most likely sense of the least ambiguous
term in the context is. During an iteration, the algorithm selects those terms t in P that
show semantic interconnections with at least one sense of S of t and one or more senses in I.
A function fr(S , t) determines the likelihood of S being the correct interpretation oft and
is defined as:

J,(S, t) = { ~~{cp(S,S')IS' E J}), if S E Senses(t) (4.1)


otherwise

where S enses(t) are the senses associated with the term t, and

cp(S, S') = p'({w(e1 · e2 · · · en)IS ~ S1 ~ ... ~ Sn-i ~ S'}) (4.z)

that is, a function (p') of the weights (w) of each path connecting s and S' where S a~id
S' are semant ic · grap hs, an d ed ges e1 to en are the edges connecting them. A
' good c1101rr
for p and p' would be a sum or average sum function. .
F inally, a context-free grammar G = (E, N , SC, ti
PG) encodes all the meaningful senHiJI
patterns, where:

E = {ekind-of' ehas-kind, epart-of' . . . }


are the edge labels,

N = {Sc, Ss, Sg, S1 , S2, ... ' E1, E2, ... }


are nonterminal symbols t hat encode pat hs betw
een the senses,

Sa
rs of
2. A synset is a set of lemmas t hat all have t h h creat,o
WordNet [16]. e same word sense. The term was coined by t e

j

_ Wcxd Sense 109


44

is the start symbol of the graph G, and

Pc= {Sc-+ SslSg,Ss-+ S1IS2IS3,S1 --t E1S1IE1,E1


-+ ekind-oJlepart-of, Sg-+ eglossSslS4ISs , • • •}

are the productions (roughly 40 in the reported study).


The hierarchical concept information in WordNet has been successfully utilized by many
approaches. Refer to Patwardhan, Banerjee, and Pedersen [42] for a comparison of sev-
eral semantic similarity measures based on WordNet. Recent emergence of unstructured
knowledge bases such as Wikipedia has led to a new generation of algorithms that extract
the implicit knowledge encoded in them to assist generation of resources that previously
relied mostly on WordNet-like resources to generate even wider coverage and multilingual
knowledge bases that help further the state of the art in many tasks such as word sense
disambiguation. Strube and Ponzetto [43, 44] provided an algorithm called WikiRelate! to
estimate the distance between two concepts using the Wikipedia taxonomy. Even more
recently, Navigli and Ponzetto [45] introduced a novel method for automatically creating
a multilingual lexical knowledge base that establishes a mapping between the large mul-
tilingual resource Wikipedia and the English computational lexicon WordNet. It currently
includes six languages (German, Spanish, Catalan, Italian, French, and English). The map-
ping to freely available WordNets in those languages can be easily generated using English
WordNet as the interlingua. The continued growth of Wikipedia will enable the genera-
tion of resources for many other languages using this methodology. As a starting point,
Ponzetto and Navigli [46] have already shown that the English information in BabelNet
can be used to create a word sense disambiguation system that rivals previous meth-
ods on the task of coarse-grained sense dismabiguation as well .as domain-specific sense
disambiguation.

Supervised
Ironically, the simpler form of word sense disambiguating systems- the supervised approach,
which tends to transfer all the complexity to the machine learning machinery while still
requiring hand annotation-tends to be superior to unsupervised methods and performs
?est when tested on annotated data [21]. The downside to this approach is that the sense
inventory has to be predetermined, and any change in the inventory might necessitate a
round of expensive reannotation.
These systems typically consist of a machine learning classifier trained on various features
extracted for words that have been manually disambiguated in a given corpus and the
application of resulting models to disambiguate words in unseen test sets. A good feature of
these systems is that the user can incorporate rules and knowledge in the form of features,
and Possibly semiautomatically generate training data to augment the set that has been
rnanually annotated in an attempt to achieve the best of all three approaches. Of course, a
~articular knowledg~ source and/or classifier combination m_ay have issues th~t make i_t less
rnenable to deriving the most optimal feature re_presentat1on, and the sem1automat1cally
&enerated sense-tagged data could be noisy to varymg degrees. Neve_rth~less, state-of-the-art
8Ystems usually tend to be a combination of rich features and explo1tat1on of redundancy in
language.

110 Chapter 4 Semantic Pars·


Ing

\Ve look at some of the typical ~ystems ~.nd features in this _secti~n. B~own ~t al. [ ?]
were probably the first to use machme learnmg for word sense d1samb1guat1on usmg infor. 4
mation in parallel corpora. Yarowsky [48] was among the first to use a rich set of features
in a machine learning framewo rk-decis ion lists-to tackle the word sense problem. Several
other researchers, such as Ng and Lee [20, 49] , have used and refined those features in se\'-
eral variations, including different levels of context and granular ities: sentence , paragraph.
microcontext, and so on. In this section, we look at some of the more popular methods and
features that are relatively easy to obtain.

Classifier Probably the most common and high-performing classifiers are support vector
machines (SVMs) and maximum entropy (MaxEn t) classifiers. Many good-quality, freely
available distribut ions of each are available and can be used to train word sense disambigua-
tion models. Typically, because each lemma has a separate sense inventory, it is almost
always the case that a separate model is trained for each lemma and POS combination (i.e..
if the language , as in the case of English, has separate sense inventories for various parts of
speech).

Features We discuss a more commonly found subset of features that have been useful in
supervised learning of word sense. These are not exhaustive by any means, but ones that
have been time-tes ted, and provide a very good base that can be used to achieve nearly
state-of- the-art performance.

Lexical context -This feature comprises the words and lemmas of words occurring in
the entire paragrap h or a smaller window of usually five words.
Parts of speech -This feature comprises the POS informat ion for words in the wind0'"
surround ing the word that is being sense tagged.
Bag of words context -This feature comprises using an unordered set of words. in
the context window. A threshold is typically tuned to include the most infonnatin'
words in the larger context.

Loca I coII ocat1on s-Local collocations are an ordered sequence of phrases ne8.rtlll'II
target word that provide semantic context for disambiguation. Usually, a very s111\
. d
wm ow of about three tokens on each side of the target word most often in cont 1·guotl-
. . l . • ..
pairs or tnp ets, are added as a list of features. For example' if the target word I~:th ((
then Oi,j wou1d be a collocati·
on where i· and j refer to the
1
' start and Offset'-· .\\ti"e
respect to the word w. A positive sign indicates words on the right, and a negii
sign indicates words on the left of the target. . f\9:
The following set of 11 features is the union of the collocation features used ~ 1 ;.
and Lee ~20, 50]. 0 - 1,- 1, 0 1,1 , 0 - 2, - 2, 02,2, 0 - 2, - 1 , 0 _ , , C1,2, C- 3, - 1, C- 2· ~' ~,.~;/:
11
0 1,3. To illustrate a few of these, let's take our earlier axample for disarnbiguat'. 11~0 .1
He bought a box of nails from the hardware store. In t his example, the collocat; ott1 •
c
would be the word from, and 0 1,3 would be the string from_th e_hardware, an stio11~ 50
·
Usually, st0 P-w0 rds and punctua tions are not removed before creating the colloC c1,rr~
Boundary conditions are treated by adding a null word in a collocation. Resea! ight
could also experiment using root forms or t h~ words and other variations that n•
4 .4 Word Sense 111

help better generalize the context. A guideline on what criteria to consider in choosing
the number and context of collocations is discussed by Gale et al. [12].
Syntactic relations-If the parse of the sentence containing the target word is avail-
able, then we can use syntactic features. One set of features that was proposed by Lee
and Ng [49] is listed in Algorithm 4- 2.
Topic features-The broad topic, or domain, of the article that the word belongs to
is also a good indicator of what sense of the word might be most frequent .

Chen and Palmer [51] recently proposed some additional rich features for disambiguation :

Voice of the sentence-Thi s ternary feature indicates whether the sentence in which
the word occurs is a passive, semipassive, 3 or active sentence.
Presence of subject/objec t-This binary feature indicates whether the target word
has a subject or object. Given a large amount of training data, we could also use
the actual lexeme and possibly the semantic roles rather than the syntactic subject/
objects.
Sentential complement- This binary feature indicates whether the word has a sen-
tential complement.
Prepositional phrase adjunct-This feature indicates whether the target word has a
prepositional phrase, and if so, selects the head of the noun phrase inside the prepo-
sitional phrase.

Algorithm 4- 2 Rules for selecting syntactic relations as features


1: if w is a noun then
2: select parent head word (h)
3: select part of speech of h
4: select voice of h
5: select position of h (left, right)
6: else if w is a verb then
7: select nearest word l to the left of w such that w is the parent head word of l
8: select nearest word r to the right of w such that w is the parent head word of 1·
9: select part of speech of l
10: select part of speech of r
11: select part of speech of w
12: select voice of w
13: else if w is a adjective then
14 : select parent hen.d word (h)
15: select po.rt of speech of h
16: end if

3. Verbs that. aro pMt pfLrticiplcs and nol prP<'<:dNI by /I(• or ha11t' vorb~ n.rn Romipo.<1slvn.
1
Chapter 4 Semantic p .
112 arsi:,
'\

, Th' r ature is the nam ed entit y of the prop er noun


Named enti ty- is 1e · s and ce .
n~~-
type s of common nouns.
WordNet-vVordNet synsets of the hype rnym s of head
noun s of the noun Phra.~
arguments of verbs and prepositions.
More recently, following research in sema ntic role la~el
ing, Dlig ach and Palmer )2'
proposed the following features for verb sense disam bigu
ation : ·
Path -Thi s featu re is the path from the targe t verb to the
verb 's arguments.
Subcategorization-The subc atego rizat ion frame is essen
tially the strin g formed ~-
joining the verb phra se type with that of its children.
11ost likely, developers will have to perform a feature selec
tion per-w ord to get the b~-
set of features for a parti cular word.

Unsupervised
Progress in word sense disambiguation is stymied by the
dear th of label ed training dat~ti)
train a class ifier for every sense of each word in a given
language. Ther e are a few solutwn:
to this problem:
1. Devise a way to cluster instances of a word so
that each clust er effectively constr~
the examples of the word to a certa in sense. This could be
considered sens e induction
through clustering .
2. Use some metrics to identify the proximity of a given kn011·0
insta nce with some sets of
senses of a word and select the closest to be the sense of
that instance.
.
3. Start with seed s of examples of certa in senses, then t0 fu~
itera tivel y grow them
clusters.
We do not discuss in much detail the mostly clustering bod"
-based sense induction wet ch~
here. We assume that there is already a predefined sense
inventory for a word and t.bat 1l,
then a~~
unsu?ervi.sed methds o use very few , if any, hand -ann otate d examples, and
classify unseen test instances into one of their pred eterm
ined sense categories. -urc'
We first look at the category of algorithms that use
some form of distance IIle~di~·
to identify senses. Rada et al. (53) introduced a metr
ic for comp uting the shorte5 ltiP1'
tance be~ween the two ~airs of senses in WordNet. This
~
1
metric assumes t~at : : iJJ
co-occurrmg words are hkely to exhibit senses that
would minimize the di5tan11 -~ [&-1.
semantic network of hierarchical relations, for example,
IS-A , from WordNet. Re5 ~a..'011•
prop os~ a new measure of semantic similarity: info rma
tion cont ent in an is~.l\d ~ fl~
omy which produces ~uch bette r results than the edge-cou
nting measure. Agirre andept'11d'
(55] further refined this ~eaa ure, calling it conc eptu al
dens ity, which not ~nlYrcbY tvld
on the n~mber_ of separatmg edges but is also sensitive
to the dept h of the biera. a,5urt"-\.
the density of it~ co?cepts and is independent of the num
ber of concepts bei11g ~~se th~[
Conceptual density 1s defined for each of the subh" 5
1erarch.ies •m. F.igure 4- 4· The
4.4 Word Sense 113

w
Word to be disambiguated: W
Context words: wl w2 w3 w4

Figure 4-4: Conceptual density

falls in the subhierarchy with the highest conceptual density is chosen to be the correct
sense.
m-1
L hyponymsi-0.20
i =O
CD( c m
) =------- (4.3)
' descendantsc

In Figure 4-4, Sense 2 is the one with the highest conceptual density and is therefore
the chosen sense.
Resnik [56] observed that selectional constraints and word sense are closely related and
identified a measure by which to compute the sense of a word on the basis of predicate-
argument statistics. Note that this algorithm is primarily limited to the disambiguation of
nouns that are arguments of verb predicates.
Let AR be the selectional association of the predicate P to the concept c with respect to
argument R. AR is defined as:
1 P(clp)
AR(p, c) = Sn(P) P(clp) log P(c)
If n is t he noun t ha t is in an a rgument relation R to predicate p, and {s 1 • s 2 • ...• sk} Rre
its possible senses. then, fo r i from 1 to k, compute:
C, = {clc is an ancestor of s, } (-t-l)
rt , = mo.x An(p, c) (4.5)
cr- C ,

wh~rc a 1 is th<! bUJr<! for s<'n!>r. ll, . T h<' i-.c11sc .-i , which ho:-- the lorgr~t vnluc of n1 is sense for
the word. Tics are brokr,11 by rnnd<m1 r hoir P.
Lcac0ck, Mi ller. and Chodornw [f>8J pnivid1 · 111101lir r nlgori t h111 thnt ma.kcs use of corpus
stat istics and Word Net rc•lnt iom,. nnd 1,l,ow t hnt 111<rn osP111ou~ rclntivcs can be e.xploited for
rl i~arnbiguat i11g wordB.
Chapter 4 Semantic Pars·
114 Ing

Algorithms Motivated by Crosslinguistic Evidence


There is a family of unsupervised algorithms based on crosslin~u_istic inf~rmation or evi.
dence. Brown et al. [47] were probably the first to make use of this 1~formation for purposes
of sense disambiguation. They were particularly interested not only m the sense distinctions
that were restricted to the ones made by monolingual dictionary resources but also in sense
differences that required translating into other languages. They provide a method to use
the context information for a given word to identify its most likely translation in the target
language. This idea was further explored by Dagan and Itai [59], who use a bilingual lexicon
paired with a monolingual corpus to acquire statistics on word senses automatically. They
also propose that syntactic relations along with word co-occurrences statistics provide a good
source to resolve lexical ambiguity. Further experiments were performed by Diab [60] using
machine translated English-to-Arabic translations to extract sense information for training
a supervised classifier. These experiments compared favorably with other purely unsuper-
vised methods. The algorithm SALAAM, which requires a word-aligned parallel corpus, is
described in Figure 4- 5.

Semisupervised
The next category of algorithms we look at are those that start from a small seed of exam·
ples and an iterative algorithm that identifies more training examples using a classifier. This
addi~ional, automatically labeled data can then be used to augment the training data of the
class1_fier to provide better predictions for the next selection cycle, and so on. The Yarowsky
algorithm [61] is the classic case of such an algorithm and was seminal in introducing
semisupervised _methods to the word sense disambiguation problem. The algorithm is based
on the assumpt10n that two strong properties are exhibited by corpora:
1.
~nae sense pe~ collocation: Syntactic relationship and the types of words occur·
rmo nearby a given word tend to provide a strong indication as to the sense of that
word.

1. L 1 words that translate into the sam L 2 d


2 SALAAM (Sense A · e wor are grouped into clusters.
· . SSignment Leveraging Alignment and Multilinguality) identifies the
appropriate senses for the words in th I , x·
· ·t · ,1r dN ose c usters according to the words senses pro,
1m1 y m vvor et. The word sense • • . . . , s
on the basis of an alg •th b Rpr~xtmity Is measured m information t heoretic terni
. . on m Y esmk [57].
3. A sense selection criterion is ap lied t h of
sense labels for each word in the ~Juste; c oose the appropriate sense label or set
4. T he chosen sense tags for the
tive contexts in the parallelt::: ;.m \h
d ,
e cluS t er are propagated back to t heir resp:
sense tags for Ll words ant th .' L12mu taneousl~, SALAAM projects the propagat
o e1r correspondmg tra nslations.

Figure 4-5: SALAAM algo 'th ~ · . . . raris-


lations n m or creating training using parallel English-to-Arabic machine t
4.4 Word Sense 115

•, I P pl I I I : I I t ,' t' t ,' l • • I r I : i' I ' ' I I ' r I I: •


• r • , , , ''
' f
,, ,
t
,
I
,
I I
.,....~ . , . ',' , ,
' • .-,,.-:---7",. I I ' ,'1 , r ' 1' I •
I I I t

..
•, I • p I I I 1 1 11 1
, I 1 I t
.
.',. .... '': :. '.''
\ I ' ,• t ' • I I I ,t, '
' I 1, Io I ,, I t I II I

It t I I I 'f p I I f I I 11
I I Pl I ' I
I t I ' • •!
'I ' ,, l ' • , , I , I ' I I
It I I I I I I I t ,I , I I I

I , ' !. ~ ; ~ - 1- - - !, - - - I - • - •- 1_ ,_ I .,J - - .: - r - ,-,- ,- .,. •-; - ,- - -; -; _:_,~ ,~:_ '_',_.•_' _·


I - ,r - I', I I,• : If I f I 1' 1 • I,• ;'t f I I I
I I I t f
I I I I I I I 1 • t I I I I f f f I 11
1
', •• ',',•,I O I I t I t'
f l I,; : 1
I I I I : I I t I I 1, I f : I • I • I
t I I ! 11 t
I II 1, I f I f • I I I I I o '1 ', ' I • ,• ,'

t:
I I I I I' I I I I I I I II II I f
I t t I
1 1

. .
1 I I ltp I It .'I: \I I : t l f p I r' I I
If I
1 1 1
, 1
1
t ,r
1
I
1
r1 p
1
f p I t f
11
t I I ,1 I
'
11 t l
1 1
1 1
•1 I , I 1 11 ; t If I , I I I • ,' • I t • •, !I : ~•____:.,_-
t l l I p I I I I',,'
t '' ' '

Figure 4-6: The three stages of the Yarowsky algorithm

2. One sense per discourse: Usually, in a given discourse, all instances of the same
lemma tend to invoke the same sense.

Based on the as.5umption that these properties exist, the Yarowsky algorithm iteratively
disambiguates most of t he words in a given discourse.
Figure 4- 6 shows the three stages of the algorithm. In the first box, life and manufactur-
ing are used as collocates to identify the two senses of plant. Then, in the next iteration, a
new collocate cell is identified, and the final block shows the small residual remaining at the
end of the algorithm cycle. This algorithm, as described in Figure 4-7, has been shown to
perform well on a small number of examples. For it to be successful, it is important to select
a good way to identify seed examples and to devise a way to identify potential corruption of
the labeled pool by wrong examples. More recently, Galley and McKeown [62] showed that
the assumption of one sense per discourse assumption improves performance.
Another variation of semisupervised systems is the use of unsupervised methods for
the creation of data combined with supervised methods to learn models for that data. The
presumption is t hat the potential noise of wrong examples selected from a corpus during this
process would be low enough so as not to affect learnability. Another presumption is that the
overall discriminative ability of the model is superior to purely unsupervised methods or to
situations in which not enough hand-annotated data is available to train a purely supervised
system. Mihalcea and Moldovan [63] describe one such system in which the algorithm in
Figure 4- 8 is used to obtain examples from large corpora for particular senses in WordNet.
Mihalcea [64] proposes the following method using Wikipedia for automatic word sense
disambiguation.

• Extract all the sentences in Wikipedia in which the word under consideration is a
link. There are two types of links: a simple link, such as [[bar]] , or a piped link, such
as [ [musical...notation I bar]].
• Filter those links that point to a disambiguation page. This means that we need
further information to disambiguate the word. If the word does not point to a disam-
biguation page, t hen the word itself can be the label. For all piped links, the string
before the pipe serves as the label.
• Collect all the labels associated with the word, and then map them to possible Word-
Net senses. Sometimes they might all map to the same sense, essentially making the
Chapter 4 Semantic p ,
116 ars1ng

. tl ge corpus identify all the instances of a particular polysernous


Step 1 In a suffi c1en Y 1ar ' t I 'd
. els t O be disambiguated storing its contex a ongs1 e.
word t h at nee 'f '
all t of instances that are strongly represent a t·1ve of one of th
Step 2 Ident1 y a sm se
1 1y unsuperv1se . d e
· d Th' can either be done in a compete fashion by
senses of the wor • is . . .
.
ident . co11ocat'ions that give a strong md1cat1on of. the sense usage for. the word
'fymg
1
under ·cons1'derat'10n or by manually tagging a small portion .of the data.
. In this example,
word with only two senses, but this algonthm can be extended
we assume a pOlysemous
ton senses.
Step 3.
Step 3a. Train a supervised classifier on this set of examples.
Step 3b. Using these classifiers, classify the remaining instances of the word in the
corpus and select those that are classified above a certain level of confidence.
Step 3c. Filter out the possible misclassifications using one sense per discourse con-
straint, and identify possible new collocations to be added to the list of seed colloca-
tions.
Step 3d. Repeat step 3 iteratively, thereby slowly shrinking the residual.

Step 4. Stop. At some point, a small, stable residual will remain.


Step 5. The trained classifier can now be used to classify new data, and that in turn can
be used to annotate the original corpus with sense tags and probabilities.

Figure 4-7: The Yarowsky algorithm

verb ~1o~osemous and not useful for this pur~ose. Often, the categories c~n be ~ap~~
to a s1gmficant number of WordNet categones, thereby proving sense-d1samb1ou ·
data for training. The manual mapping is a relatively inexpensive process.
. l . h
This . y words
a gont m provides a cheap way of extracting sense information for man of
that display the required properties, and it can alleviate the manually intensive processtr.
sense tagging. Depending on how many words in the entire Wikipedia exhibit this proper ~f
it _could be very useful for generating sense-tagged data. A rough idea of the cover:\1r
th is_method can be gleaned from the fact that roughly 30 of 49 nouns that were ~ dilt~
1
SENSEVAL-2 and SENSEVAL-3 were found to have more than two senses for will 11ses
could be extracted from Wikipedia. The average disambiguation accuracy on these s_ed\ft
w . th ·d 80 01 ·
as m e mi - 10 range. The mterannota tor agreement for mappino- the senses to
Wo1 •
was around 91 %. 0

4.4.3 Software
serisc
Several software programs are d .1 bl . 11i.-0 r word 11
. . . . ma e ava1 a e by the research commumty 1115•
?1Samb1gu~t1on, ra?gmg from similarity measure modules to full disambiguation syst,e
is not possible to hst all of them here, so we list a selected few.
4_4 W<Yd Sense 117

Step 1. Preprocessing

- For each sense of a word W, determine the synsets of WordNet in which


it appears.
For each such synset, determine monosemous words included in that
synset. Parse
the glos.5 definition attach ed to each synset.

Step 2. Search

- Form search phrases using the following procedures in order of prefer


ence
1. If they exist, extrac t monosemous synonyms from the synsets
selected in step 1.
2. Select each of the unambiguous parsed constituents in the gloss
~ a search
phrase.
3. After parsing the gloss, replace all stop-words with a NEAR opera
tor and create
a query from the words in the current synset. For example, if the
synset for
produce#6 is grow, raise, farm, produce, and the gloss is cultivate by
growi,ng,
then the query will look like: cultivate NEAR growing AND (grow OR
raise OR
farm OR produce).
4. Use only the head phrase combined by words in the synset using
the AND
operator. For example, if the definition for company#5 is band of people
and its
synset is (party , company) , then the query becomes: band of people AND
(party
OR company).

- Search the Intern et with the phrases determined in the previous step
and gather
matching documents
- From these documents, extrac t the sentences containing these words

Step 3. Postprocessing
- Keep only those sentences in which the word under consideration
belongs to the
same part of speech as the selected sense, and delete the others .

Figure 4-8: Mihalcea and Moldovan (63] algorithm for generating


examples for words tagged with
Parficular
senses by querying a very large corpus

• IMS (It Makes Sense) http:/ /nlp.c omp. nus.ed u.sg/ software
This is a complete word sense disambiguation syS t cm.
• Wo dN s· , .
r et- 1mtlarity- 2.05 htt P··//searrh. cpan.org/cii~t/ \iVordNt't -Simihirit~·
prov1u e a qmr k• Wf\'. or romputmg · ,·anom
These u, dN s· . . · ..J • · ,
vvor et 1m11arity mo dules for Perl
word similarity measures.
• WikiRelatel
htt . ·/ · . . h/ arch/n lp/dow 11loo(t/w1'kl· p,...,_,H\
- ..1· • ·1 ·
.'illlH nnt~·.p l
1p
~-. /www.h-1ts.org/enghs rese ·od tho co.togorio.~ In \Vikipedin.
This 1s a word similarity measure bas 011

You might also like