Unit-3 NLP Notes
Unit-3 NLP Notes
Semantic Parsing
Sameer Pradhan
Semantics by its dictionary definition is the study of meaning, and parsing is the examina-
tion of something in a minute way, that is, identifying and relating the pieces of information
being parsed. When we put the two of these concepts together, we get semantic parsing,
which, in the broadest sense of the phrase, is the process of identifying meaning chunks
contained in an information signal in an attempt to transform it into some data structure
that can be manipulated by a computer to perform higher level tasks. In our case, the infor-
mation signal is human language text. Unfortunately, in the natural language processing
community, the term semantic parsing is somewhat ambiguous. Over the years researchers
have used it to represent various levels of granularity of meaning representation. Because
semantics is such a vague term, it has been used to represent various depths of representa-
tions, from something as basic as identifying domain-specific relations between entities, to
the more intermediate task of identifying the roles that various entities and artifacts play in
an event, to converting a text to a series of specific logical expressions. Within the context
of this chapter, we restrict its interpretation to the study of mapping naturally occurring
text to some representation that is amenable to manipulation by computers for the purpose
of achieving some goals, such as retrieving information, answering a question, populating a
database, or taking an action.
4.1 Introduction
The holy grail of research in language understanding is the identification of a meaning
representation that is detailed enough to allow reasoning systems to make deductions but,
at the same time, is general enough that it can be used across many domains with little
to no adaptation. It is not clear whether a final, low-level, detailed semantic representation
covering various applications that use some form of language interface can be achieved or
whether an ontology can be created that can capture the various granularities and aspects
of meanings that are embodied in such a variety of applications-none has yet been created.
Therefore, two compromise approaches have emerged in the natural language processing
community for language understanding.
In the first approach, a specific, rich meaning representation is created for a limited
domain for use by applications that are restricted to that domain, such as air travel reser-
vations, football game simulations, or querying a geographic database. Systems are then
97
Chapter 4 Semantic p .
98 ats1ng
sentences by humans. Shortly after Chomsky's 1957 book, Katz and Fodor [2] published the
first work treating semantics within the generative grammar paradigm. They found that
Chomsky's transformational grammar was not a complete description of language because
it did not account for meaning. In their 1963 paper "The Structure of a Semantic Theory
,"
Katz and Fodor put forward what they thought were the properties a semantic theory should
possess. A semantic theory should be able to:
· 1. Explain sentences having ambiguous meanings. For example, it should account for the
fact that the word bill in the sentence The bill is large is ambiguous in the sense that
it could represent money or the beak of a bird.
2. Resolve the ambiguities of words in context. For example, if the same sentence is
extended to form The bill is large but need not be paid, then the theory should be able
to disambiguate the monetary meaning of bill.
3. Identify meaningless but syntactically well-formed sentences, such as the famous
example by Chomsky: Colorless green ideas sleep furiously.
4. Identify syntactically or transformationally unrelated paraphrases of a concept having
the same semantic content.
In the following subsections we look at some requirements for achieving a semantic
representation.
When
'
ho ~
:' rJ □ •1
1
, -. . When
f .. I
) f·· · ·Acqulr~·-_. .
Who ·• ••, :... . Whom
,..s~1~. Wh~;e
Bell Atlantic Corp. I \
,••.,
: ,: /, How
., /
Whore
.
\ Whom
How
h d hotLI
Figure 4-1: A representation of who did what to wh om, w en, where, why, an
4.3 System Paradigms 101
(1) If our player 2 has the ball, then position our player 5 in the midfield.
((bowner (player our 2)) (do (player our 5) (pos (midfield)))
(2) Which river is the longest?
answer(x1, longest(x 1 river(x 1)))
This is a domain-specific approach; the remainder of this chapter focuses on domain-
independent approaches.
1. System Architectures
( a) Knowledge based: As the name suggests, these systems use a predefined set of
rules or a knowledge base to obtain a solution to a new problem.
(b) Unsupervised: These systems tend to require minimal human intervention to be
functional by using existing resources that can be bootstrapped for a pruticular
application or problem domain.
(c) Supervised: These systems involve the manual annotation of some phenomena
that appear in a sufficient quantity of data so that machine learning algorithms
can be applied. Typically, researchers create feature functions that allow each
problem instance to be projected into a space. o~ featu~es. A model is trained to
use these features to predict labels, and then it 1s applied to unseen data.
Chapter 4 Semantic Parsing
102
2. Scope
Domain Dependent: These syst ems are specific h'
to cert ain dom ains , such as air
(a) ·
travel reservations or simu1a ted football coac mg.
al
(b) Domain Indepen dent :_ The se syst ems are gener eno ugh tha t the techniques can
.
be applicable to multiple dom ams wi'tho ut littl e or no chan ge.
3. Coverage
Shallow: These systems tend to prod uce an inte t t' th t can
rm~ diat e ~epresen a ion a
(a)
then be converted to one t h at a mac h'ne can base its act10ns on.
(b) Deep: These systems usually c:eat~ ai term . 1 rese ntat ion that is directly
ma rep
consumed by a machine or application.
Word sense ambiguities can be of three principal types: (i) homonymy, (ii) Polysein
and (m "') cat egona
· l anlbi'gui·ty [13] · Homonymy indicates that the words. . share the ,saineY
1
spelling, but the meanings are quite disparate. Each homonymous parti_t1on, however, lllav
contain finer sense nuances that could be assigned to the word dependmg on the context
and this phenomenon is called polysemy. For example, these two senses of the word bank
are orthogonal: financial bank and river bank. Further, bank has some other, somewhat
finer, and related subsenses that indicate a collection of things: for example, financial bank
and bank of clouds. To illustrate categorial ambiguity, the word book can mean a book
such as the one in which this chapter appears or to enter charges against someone in a
police register. The former belongs to the grammatical category of noun, and the latter.
verb. Distinguishing between these two categories effectively helps disambiguate these two
senses. Therefore, categorial ambiguity can be resolved with syntactic information (part of
speech) alone, but polysemy and homonymy need more than syntax.
Traditionally, in English, word senses have been annotated for each part of speech sep-
arately, whereas in Chinese, the sense annotation has been done per lemma and so ranges
across all parts of speech. Part of the reason is that the distinction between a noun or a
verb is much more obscure in Chinese.
4.4.1 Resources
As with any language understanding task, the availability of resources is a key factor in
the disambiguation of word senses in corpora. Unfortunately, the community has not seen
the development of a significant amount of hand-tagged sense data-at least not until rer_r
recently. Early work on word sense disambiguation used machine-readable dictionaries or
thesauruses as knowledge sources. Two prominent sources were the Longman Dictionary
0
[ Contem?or~ry English (LDOCE) [14] and Roget 's Thesaurus [15]. The late 1980s gare
birth to a sigmficant lexicographical resource WordNet [16] whi' h h b y •nfluenti:1l.
I dd · • . ' , c as een ver 1
. n a i~1on to bemg a lexical resource with inventories of senses provided for most word~
m English acr?ss multiple .parts of speech, it also has a rich taxonomy connecting word~
across many different relationships such as hypernymy h a d so t'11 ·
I dd · · •. ' , omonymy, meronymy, "11
n a it1on, to facilitate research in automatic sense disambiguation a small portion of th~
Brown Corpus [17] has been t t d •h ' nir·
dance (SEMCOR) [ ]anno a e wit WordNet senses to create a semantic_ con_.,,.
tactic info . corpus 18 . More recently, WordNet has been extended by addlllg ~~ i·
and gener::::1~~~;a~~e glos~es, ~isambigua~ing them with manual and automatic met~;~~,;
answering [19]. Anothero::s u~ a ow better mcorporation in applications such as que~ b.1·
tagging WordNet version 1p5 s: the DSO Corpus of Sense-Tagged English, was cre~te ctir·
pora for the 121 nouns and. 70s:::~ss on the. Brown and Wall Street Journal (W~ ;.ds it1
English [20]. Further, the SENSEVA{h[at ]a ie the r~1~st frequent and ambiguous \~: Jt11rt'
created many corpora for te t· 21 competitions held over the past deca o·uo'l'~t
s mg systems on w01•d The 1o0
sense annotation effort so far hfl.8 b sense and related problems. roll~h
the Linguistic Data Consortium. (L;~) t~1e O~toNotes corpus [22, 23, 24] released t::berJ
verb (~2, 700) and noun (~2 200) I ' m wh1c~1 have been tagged a significant nu~ 5p1111·
ning multiple genres with coa'r·se e~mdas covermg roughly 85% of multiple corpo1H ur('(v
· grame sense d · h tor o.,
mcnt. Pradhan et al. [25] based 1 . , s an ~it a very high interannota _. orP''~.
n exical sample task m SE1'vIEVAL 2007 using tlllts r
4.4 Word Sense 105
Cyc [26] is another good example of a useful resource that creates a formalized represen-
tation of common sense knowledge about objects and events in the world to overcome the
so-called knowledge bottleneck that is so crucial to word sense disambiguation and many
other natural language tasks. Even after a couple of decades of handcrafting this knowledge
base, it leaves much to be desired, which underscores the difficulty of such an endeavor.
Fortunately, English seems to have the most highly developed lexicons with various
semantic features associated with words and words grouped together to form coherent
semantic classes. Efforts are underway to create resources for other languages as well.
For example, HowNet [27] is a network of words for Chinese similar to WordNet. The
Global WordNet Association (http://www.globalwordnet.org) keeps track of WordNet
development across various languages. Researchers are also using semiautomatic methods
for expanding coverage of existing languages [28, 29, 30, 31] and for other languages such
as Greek [32]. In addition to such corpora annotated with sense information, there are
also many resources such as WordNet Domains (http://wndomains.fbk.eu/) that provide
structured knowledge to help overcome the knowledge bottleneck in sense disambiguation.
4.4.2 Systems
Now that we have looked at the problem and some resources, we turn to some sense disam-
biguation systems. As mentioned earlier, researchers have explored various system architec-
tures to address the sense disambiguation problem. We can classify these systems into four
main categories: (i) rule based or knowledge based, (ii) supervised, (iii) unsupervised, and
(iv) semisupervised.
In the following three sections, we look at each of these systems in order.
Rule Based
The first generation of word sense disambiguation systems was primarily based on dictionary
sense definitions and glosses [33, 34]. Most of these techniques were handcrafted and used
resources that are not necessarily accessible today. Also, access to the exact rules and sys-
tems was very limited, and most information was only available from archived publications
and discussions of the specific lexical items and senses that were considered during those
experiments. In short, much of this information is historical and cannot readily be trans-
lated and made available for building systems today. However, some valuable techniques and
algorithms are still accessible, and we look at these in this section. Probably the simplest and
oldest dictionary-based sense disambiguation algorithm was introduced by Lesk [35]. The
first-generation word sense disambiguation algorithms were mostly based on computerized
dictionaries; for example, see Calzolari and Picchi [33] •
The first SENSEVAL evaluations [36] used a simplified version of the Lesk algorithm
as a baseline for comparing word sense disambiguation performance. The pseudocode for
the algorithm is shown in Algorithm 4- 1. The core of the algorithm is that the sense of
a word in a given context is most likely to be the dictionary sense whose terms most
closely overlap with the terms in the context. T~e~e h~ve since been further modificatio~s
to the algorithm to make it more robust to vanat10~ m term usages, c~ntext, and defim-
tion. Banerjee and Pedersen [37], for example, modified the Lesk a~gonthm so that syn-
onyms, hypernyms, hyponyms, meronymns, and so on of the words m the _co~text as well
as in the dictionary definition are used to get a more accurate overlap stat1st1c. The score
Chapter 4 Semantic Parsing
106
. d
de of the simplified Lesk algorithm ---
Algorithm 4-1 Pseu oco t ns the number of words common to the two sets
The function COMPUTE0VERL(P r~ ur t ) returns best sense of word ---
Procedure: SIMPLIFIED_LESK wor ' sen ence
associated with each match is measured as the square of the longest common subsequence
between the context and the gloss. 1 Using a context window of five words (two on each side
of the target, as well as the target itself), they report a twofold increase in performa?c~
from 16% to 32% over the vanilla Lesk algorithm when used on the SENSEVAL-2 lexica
sample dataset. This performance improvement is considerable given the simplicity of the
algorithm. ' ed
Another dictionary-based algorithm was suggested by Yarowsky [38]. This study us_ .
Roget's Thesaurus categories and classified unseen words into one of these 1,042 categone~
based on a statistical analysis of 100 word concordances for each member of each categor~
over a large corpus, in this case the 10-million-word Grolier's Encyclopedia. The me th_o_
performed quite well on a set of 12 words for which there had been some previous quantitati;e
studies. Although the instances and corpora used in this study were not the same as t ~e
ones reported previously, it still gives an idea of the success of a relatively simple meth? ;
The method consists of three steps, as shown in Figure 4-2. The first step is a coll~c:_1~l11
of contexts. The second step computes weights for each of the salient words. One thine_ It
note is that the amount of context used was 50 words on each side of the target word. wht~.
. . .
1s much higher than the context wmdows found to be useful for this kind of broad. top1l ht'
classification by Gale et al. [12]. P( wlRCat) is the probability of a word w occurring iu \
cont ext of a Roget 'B Th esaurus category RCat. Finally, in the third step, the unseen w01(
in the test set are classified into the category that has the ma..-ximum weight. t 11t
More recently, Nuvigli and Velardi [39, 40] suggested n knowledge-based algorithn 1 t~er
' graph'1caJ repreRen t at'10n o f· BC ll 8CH of words ·m r.ontoxt to dbnmbiguate the ten· 11 t!ll
uHes
•or11 ~
1. Multiple subscq uenccH in Lhc Af1.111 0 gloRs n.rn idlownl) ki· liowov(•t· nub f ly cont.ent ''e cr
_ . . , , " sequences o on 110 11 e<1 11 .11
such ~ p ronounA,. propoHIL1ons, artlclt•R, 1t11d r.o nJunct,lons l\l'O not consid ered. For example, t he subs
of th e 1s not cons1dorcd 111 I.he cA.h.:11l1ilio11 of n. Rcoro.
4.4 Word Sense 107
P(wilRCat)
P(wi)
3. Use the weights for predicting the appropriate category of the word in the test corpus.
"""'iog ---'--''---.:.,__-'----'-
arg max~ P(WilRCat)P(RCat)
RC at w P(wi)
Figure 4-2: Algorithm for disambiguating words into Roget's Thesaurus categories
• T (the lexical context) is the list of terms in the context of the term t to be disam-
biguated. T = [t1, t2 , ... , tn]-
• Si, s;, ... ,S~ are structural specifications of the possible concepts (or, senses) oft.
• I (the semantic context) is the list of structural specifications of the concepts asso-
ciated with each of the terms in T \ {t} (except t) • I = [st 1 , st 2 , ••• , st"] , that is, the
semantic interpretation of T .
• G is the grammar defining the various relations between the structural specification:: ;
(or semantic interconnections) among the graphs).
• Determine how well the structural specifications in I match t hat. of S{, S~ . ... .s;1
using G.
• Select the best matching Sf.
The algorithm works as follows. A set of pending tern1s in t hf' cont.rxt. P = (t iIS'• = Hull}
is maintained and I is used in each iteration to di:-1nmbig11nto t.('1'111:-- in f>. T he prorr<lurr
iterates and ~ach iteration either disambig11ntcs one ter111 in P fu1d l'<'moves it from the
pendin~ list or stops because no more tcrm8 cnn bo disnmbig111\tcrl. T he output J is uµd nt<'d
with the sense oft. Init ially/ contnin:-1 str11ct11rc:-1 for 1110 110 :-iom o u:-; t<'l'm~ in T\ {t} and any
Chapter 4 Semantic Pa .
108 rs1ng
---------1- /
,,
, ,,
''
,,, /
~
1- , •---
.
, lnterconnect1on# I
, ,
,
...0
-----·- Electricity#]
lnstnunen1a1i , .
, I) ,
\ I •
/Prot \ I '
I
I ,' Bus#2'
I
I
I
I
I
I
---
I
I
I
' _ _ _ _Fram~rk#3
\
------- ---
::; =: '
' 'i. , ~lectncal device# I~ _,
r __________________ _..--,
Gloss - - - ·
Ptrtanym - ·- ·-
Figure 4-3: The graphs for sense 1 and 2 of the noun bus as generated by the SSI algorithm
possible disambiguated synsets (since we do use sense-tagged data) 2 . If this is a null set, then
the algorithm makes an initial guess at what the most likely sense of the least ambiguous
term in the context is. During an iteration, the algorithm selects those terms t in P that
show semantic interconnections with at least one sense of S of t and one or more senses in I.
A function fr(S , t) determines the likelihood of S being the correct interpretation oft and
is defined as:
where S enses(t) are the senses associated with the term t, and
that is, a function (p') of the weights (w) of each path connecting s and S' where S a~id
S' are semant ic · grap hs, an d ed ges e1 to en are the edges connecting them. A
' good c1101rr
for p and p' would be a sum or average sum function. .
F inally, a context-free grammar G = (E, N , SC, ti
PG) encodes all the meaningful senHiJI
patterns, where:
Sa
rs of
2. A synset is a set of lemmas t hat all have t h h creat,o
WordNet [16]. e same word sense. The term was coined by t e
j
►
Supervised
Ironically, the simpler form of word sense disambiguating systems- the supervised approach,
which tends to transfer all the complexity to the machine learning machinery while still
requiring hand annotation-tends to be superior to unsupervised methods and performs
?est when tested on annotated data [21]. The downside to this approach is that the sense
inventory has to be predetermined, and any change in the inventory might necessitate a
round of expensive reannotation.
These systems typically consist of a machine learning classifier trained on various features
extracted for words that have been manually disambiguated in a given corpus and the
application of resulting models to disambiguate words in unseen test sets. A good feature of
these systems is that the user can incorporate rules and knowledge in the form of features,
and Possibly semiautomatically generate training data to augment the set that has been
rnanually annotated in an attempt to achieve the best of all three approaches. Of course, a
~articular knowledg~ source and/or classifier combination m_ay have issues th~t make i_t less
rnenable to deriving the most optimal feature re_presentat1on, and the sem1automat1cally
&enerated sense-tagged data could be noisy to varymg degrees. Neve_rth~less, state-of-the-art
8Ystems usually tend to be a combination of rich features and explo1tat1on of redundancy in
language.
◄
\Ve look at some of the typical ~ystems ~.nd features in this _secti~n. B~own ~t al. [ ?]
were probably the first to use machme learnmg for word sense d1samb1guat1on usmg infor. 4
mation in parallel corpora. Yarowsky [48] was among the first to use a rich set of features
in a machine learning framewo rk-decis ion lists-to tackle the word sense problem. Several
other researchers, such as Ng and Lee [20, 49] , have used and refined those features in se\'-
eral variations, including different levels of context and granular ities: sentence , paragraph.
microcontext, and so on. In this section, we look at some of the more popular methods and
features that are relatively easy to obtain.
Classifier Probably the most common and high-performing classifiers are support vector
machines (SVMs) and maximum entropy (MaxEn t) classifiers. Many good-quality, freely
available distribut ions of each are available and can be used to train word sense disambigua-
tion models. Typically, because each lemma has a separate sense inventory, it is almost
always the case that a separate model is trained for each lemma and POS combination (i.e..
if the language , as in the case of English, has separate sense inventories for various parts of
speech).
Features We discuss a more commonly found subset of features that have been useful in
supervised learning of word sense. These are not exhaustive by any means, but ones that
have been time-tes ted, and provide a very good base that can be used to achieve nearly
state-of- the-art performance.
Lexical context -This feature comprises the words and lemmas of words occurring in
the entire paragrap h or a smaller window of usually five words.
Parts of speech -This feature comprises the POS informat ion for words in the wind0'"
surround ing the word that is being sense tagged.
Bag of words context -This feature comprises using an unordered set of words. in
the context window. A threshold is typically tuned to include the most infonnatin'
words in the larger context.
•
Loca I coII ocat1on s-Local collocations are an ordered sequence of phrases ne8.rtlll'II
target word that provide semantic context for disambiguation. Usually, a very s111\
. d
wm ow of about three tokens on each side of the target word most often in cont 1·guotl-
. . l . • ..
pairs or tnp ets, are added as a list of features. For example' if the target word I~:th ((
then Oi,j wou1d be a collocati·
on where i· and j refer to the
1
' start and Offset'-· .\\ti"e
respect to the word w. A positive sign indicates words on the right, and a negii
sign indicates words on the left of the target. . f\9:
The following set of 11 features is the union of the collocation features used ~ 1 ;.
and Lee ~20, 50]. 0 - 1,- 1, 0 1,1 , 0 - 2, - 2, 02,2, 0 - 2, - 1 , 0 _ , , C1,2, C- 3, - 1, C- 2· ~' ~,.~;/:
11
0 1,3. To illustrate a few of these, let's take our earlier axample for disarnbiguat'. 11~0 .1
He bought a box of nails from the hardware store. In t his example, the collocat; ott1 •
c
would be the word from, and 0 1,3 would be the string from_th e_hardware, an stio11~ 50
·
Usually, st0 P-w0 rds and punctua tions are not removed before creating the colloC c1,rr~
Boundary conditions are treated by adding a null word in a collocation. Resea! ight
could also experiment using root forms or t h~ words and other variations that n•
4 .4 Word Sense 111
help better generalize the context. A guideline on what criteria to consider in choosing
the number and context of collocations is discussed by Gale et al. [12].
Syntactic relations-If the parse of the sentence containing the target word is avail-
able, then we can use syntactic features. One set of features that was proposed by Lee
and Ng [49] is listed in Algorithm 4- 2.
Topic features-The broad topic, or domain, of the article that the word belongs to
is also a good indicator of what sense of the word might be most frequent .
Chen and Palmer [51] recently proposed some additional rich features for disambiguation :
Voice of the sentence-Thi s ternary feature indicates whether the sentence in which
the word occurs is a passive, semipassive, 3 or active sentence.
Presence of subject/objec t-This binary feature indicates whether the target word
has a subject or object. Given a large amount of training data, we could also use
the actual lexeme and possibly the semantic roles rather than the syntactic subject/
objects.
Sentential complement- This binary feature indicates whether the word has a sen-
tential complement.
Prepositional phrase adjunct-This feature indicates whether the target word has a
prepositional phrase, and if so, selects the head of the noun phrase inside the prepo-
sitional phrase.
3. Verbs that. aro pMt pfLrticiplcs and nol prP<'<:dNI by /I(• or ha11t' vorb~ n.rn Romipo.<1slvn.
1
Chapter 4 Semantic p .
112 arsi:,
'\
Unsupervised
Progress in word sense disambiguation is stymied by the
dear th of label ed training dat~ti)
train a class ifier for every sense of each word in a given
language. Ther e are a few solutwn:
to this problem:
1. Devise a way to cluster instances of a word so
that each clust er effectively constr~
the examples of the word to a certa in sense. This could be
considered sens e induction
through clustering .
2. Use some metrics to identify the proximity of a given kn011·0
insta nce with some sets of
senses of a word and select the closest to be the sense of
that instance.
.
3. Start with seed s of examples of certa in senses, then t0 fu~
itera tivel y grow them
clusters.
We do not discuss in much detail the mostly clustering bod"
-based sense induction wet ch~
here. We assume that there is already a predefined sense
inventory for a word and t.bat 1l,
then a~~
unsu?ervi.sed methds o use very few , if any, hand -ann otate d examples, and
classify unseen test instances into one of their pred eterm
ined sense categories. -urc'
We first look at the category of algorithms that use
some form of distance IIle~di~·
to identify senses. Rada et al. (53) introduced a metr
ic for comp uting the shorte5 ltiP1'
tance be~ween the two ~airs of senses in WordNet. This
~
1
metric assumes t~at : : iJJ
co-occurrmg words are hkely to exhibit senses that
would minimize the di5tan11 -~ [&-1.
semantic network of hierarchical relations, for example,
IS-A , from WordNet. Re5 ~a..'011•
prop os~ a new measure of semantic similarity: info rma
tion cont ent in an is~.l\d ~ fl~
omy which produces ~uch bette r results than the edge-cou
nting measure. Agirre andept'11d'
(55] further refined this ~eaa ure, calling it conc eptu al
dens ity, which not ~nlYrcbY tvld
on the n~mber_ of separatmg edges but is also sensitive
to the dept h of the biera. a,5urt"-\.
the density of it~ co?cepts and is independent of the num
ber of concepts bei11g ~~se th~[
Conceptual density 1s defined for each of the subh" 5
1erarch.ies •m. F.igure 4- 4· The
4.4 Word Sense 113
w
Word to be disambiguated: W
Context words: wl w2 w3 w4
falls in the subhierarchy with the highest conceptual density is chosen to be the correct
sense.
m-1
L hyponymsi-0.20
i =O
CD( c m
) =------- (4.3)
' descendantsc
In Figure 4-4, Sense 2 is the one with the highest conceptual density and is therefore
the chosen sense.
Resnik [56] observed that selectional constraints and word sense are closely related and
identified a measure by which to compute the sense of a word on the basis of predicate-
argument statistics. Note that this algorithm is primarily limited to the disambiguation of
nouns that are arguments of verb predicates.
Let AR be the selectional association of the predicate P to the concept c with respect to
argument R. AR is defined as:
1 P(clp)
AR(p, c) = Sn(P) P(clp) log P(c)
If n is t he noun t ha t is in an a rgument relation R to predicate p, and {s 1 • s 2 • ...• sk} Rre
its possible senses. then, fo r i from 1 to k, compute:
C, = {clc is an ancestor of s, } (-t-l)
rt , = mo.x An(p, c) (4.5)
cr- C ,
wh~rc a 1 is th<! bUJr<! for s<'n!>r. ll, . T h<' i-.c11sc .-i , which ho:-- the lorgr~t vnluc of n1 is sense for
the word. Tics are brokr,11 by rnnd<m1 r hoir P.
Lcac0ck, Mi ller. and Chodornw [f>8J pnivid1 · 111101lir r nlgori t h111 thnt ma.kcs use of corpus
stat istics and Word Net rc•lnt iom,. nnd 1,l,ow t hnt 111<rn osP111ou~ rclntivcs can be e.xploited for
rl i~arnbiguat i11g wordB.
Chapter 4 Semantic Pars·
114 Ing
Semisupervised
The next category of algorithms we look at are those that start from a small seed of exam·
ples and an iterative algorithm that identifies more training examples using a classifier. This
addi~ional, automatically labeled data can then be used to augment the training data of the
class1_fier to provide better predictions for the next selection cycle, and so on. The Yarowsky
algorithm [61] is the classic case of such an algorithm and was seminal in introducing
semisupervised _methods to the word sense disambiguation problem. The algorithm is based
on the assumpt10n that two strong properties are exhibited by corpora:
1.
~nae sense pe~ collocation: Syntactic relationship and the types of words occur·
rmo nearby a given word tend to provide a strong indication as to the sense of that
word.
..
•, I • p I I I 1 1 11 1
, I 1 I t
.
.',. .... '': :. '.''
\ I ' ,• t ' • I I I ,t, '
' I 1, Io I ,, I t I II I
It t I I I 'f p I I f I I 11
I I Pl I ' I
I t I ' • •!
'I ' ,, l ' • , , I , I ' I I
It I I I I I I I t ,I , I I I
t:
I I I I I' I I I I I I I II II I f
I t t I
1 1
. .
1 I I ltp I It .'I: \I I : t l f p I r' I I
If I
1 1 1
, 1
1
t ,r
1
I
1
r1 p
1
f p I t f
11
t I I ,1 I
'
11 t l
1 1
1 1
•1 I , I 1 11 ; t If I , I I I • ,' • I t • •, !I : ~•____:.,_-
t l l I p I I I I',,'
t '' ' '
2. One sense per discourse: Usually, in a given discourse, all instances of the same
lemma tend to invoke the same sense.
Based on the as.5umption that these properties exist, the Yarowsky algorithm iteratively
disambiguates most of t he words in a given discourse.
Figure 4- 6 shows the three stages of the algorithm. In the first box, life and manufactur-
ing are used as collocates to identify the two senses of plant. Then, in the next iteration, a
new collocate cell is identified, and the final block shows the small residual remaining at the
end of the algorithm cycle. This algorithm, as described in Figure 4-7, has been shown to
perform well on a small number of examples. For it to be successful, it is important to select
a good way to identify seed examples and to devise a way to identify potential corruption of
the labeled pool by wrong examples. More recently, Galley and McKeown [62] showed that
the assumption of one sense per discourse assumption improves performance.
Another variation of semisupervised systems is the use of unsupervised methods for
the creation of data combined with supervised methods to learn models for that data. The
presumption is t hat the potential noise of wrong examples selected from a corpus during this
process would be low enough so as not to affect learnability. Another presumption is that the
overall discriminative ability of the model is superior to purely unsupervised methods or to
situations in which not enough hand-annotated data is available to train a purely supervised
system. Mihalcea and Moldovan [63] describe one such system in which the algorithm in
Figure 4- 8 is used to obtain examples from large corpora for particular senses in WordNet.
Mihalcea [64] proposes the following method using Wikipedia for automatic word sense
disambiguation.
• Extract all the sentences in Wikipedia in which the word under consideration is a
link. There are two types of links: a simple link, such as [[bar]] , or a piped link, such
as [ [musical...notation I bar]].
• Filter those links that point to a disambiguation page. This means that we need
further information to disambiguate the word. If the word does not point to a disam-
biguation page, t hen the word itself can be the label. For all piped links, the string
before the pipe serves as the label.
• Collect all the labels associated with the word, and then map them to possible Word-
Net senses. Sometimes they might all map to the same sense, essentially making the
Chapter 4 Semantic p ,
116 ars1ng
verb ~1o~osemous and not useful for this pur~ose. Often, the categories c~n be ~ap~~
to a s1gmficant number of WordNet categones, thereby proving sense-d1samb1ou ·
data for training. The manual mapping is a relatively inexpensive process.
. l . h
This . y words
a gont m provides a cheap way of extracting sense information for man of
that display the required properties, and it can alleviate the manually intensive processtr.
sense tagging. Depending on how many words in the entire Wikipedia exhibit this proper ~f
it _could be very useful for generating sense-tagged data. A rough idea of the cover:\1r
th is_method can be gleaned from the fact that roughly 30 of 49 nouns that were ~ dilt~
1
SENSEVAL-2 and SENSEVAL-3 were found to have more than two senses for will 11ses
could be extracted from Wikipedia. The average disambiguation accuracy on these s_ed\ft
w . th ·d 80 01 ·
as m e mi - 10 range. The mterannota tor agreement for mappino- the senses to
Wo1 •
was around 91 %. 0
4.4.3 Software
serisc
Several software programs are d .1 bl . 11i.-0 r word 11
. . . . ma e ava1 a e by the research commumty 1115•
?1Samb1gu~t1on, ra?gmg from similarity measure modules to full disambiguation syst,e
is not possible to hst all of them here, so we list a selected few.
4_4 W<Yd Sense 117
Step 1. Preprocessing
Step 2. Search
- Search the Intern et with the phrases determined in the previous step
and gather
matching documents
- From these documents, extrac t the sentences containing these words
Step 3. Postprocessing
- Keep only those sentences in which the word under consideration
belongs to the
same part of speech as the selected sense, and delete the others .
• IMS (It Makes Sense) http:/ /nlp.c omp. nus.ed u.sg/ software
This is a complete word sense disambiguation syS t cm.
• Wo dN s· , .
r et- 1mtlarity- 2.05 htt P··//searrh. cpan.org/cii~t/ \iVordNt't -Simihirit~·
prov1u e a qmr k• Wf\'. or romputmg · ,·anom
These u, dN s· . . · ..J • · ,
vvor et 1m11arity mo dules for Perl
word similarity measures.
• WikiRelatel
htt . ·/ · . . h/ arch/n lp/dow 11loo(t/w1'kl· p,...,_,H\
- ..1· • ·1 ·
.'illlH nnt~·.p l
1p
~-. /www.h-1ts.org/enghs rese ·od tho co.togorio.~ In \Vikipedin.
This 1s a word similarity measure bas 011