C Class Notes
C Class Notes
Copyright
c 2018. All
rights reserved. Draft of September 19, 2018.
CHAPTER
same. Two words can be homonyms in a different way if they are spelled differently
but pronounced the same, like write and right, or piece and peace. We call these
homophones homophones; they are one cause of real-word spelling errors.
Homonymy causes problems in other areas of language processing as well. In
question answering or information retrieval, we better help a user who typed “bat
care” if we know whether they are vampires or just want to play baseball. And
they will also have different translations; in Spanish the animal bat is a murciélago
while the baseball bat is a bate. Homographs that are pronounced differently cause
problems for speech synthesis (Chapter 28) such as these homographs of the word
bass, the fish pronounced b ae s and the instrument pronounced b ey s.
(C.3) The expert angler from Dora, Mo., was fly-casting for bass rather than the
traditional trout.
(C.4) The curtain rises to the sound of angry dogs baying and ominous bass chords
sounding.
Sometimes there is also some semantic connection between the senses of a word.
Consider the following example:
(C.5) While some banks furnish blood only to hospitals, others are less restrictive.
Although this is clearly not a use of the “sloping mound” meaning of bank, it just
as clearly is not a reference to a charitable giveaway by a financial institution. Rather,
bank has a whole range of uses related to repositories for various biological entities,
as in blood bank, egg bank, and sperm bank. So we could call this “biological
repository” sense bank3 . Now this new sense bank3 has some sort of relation to
bank1 ; both bank1 and bank3 are repositories for entities that can be deposited and
taken out; in bank1 the entity is monetary, whereas in bank3 the entity is biological.
When two senses are related semantically, we call the relationship between them
polysemy polysemy rather than homonymy. In many cases of polysemy, the semantic relation
between the senses is systematic and structured. For example, consider yet another
sense of bank, exemplified in the following sentence:
(C.6) The bank is on the corner of Nassau and Witherspoon.
This sense, which we can call bank4 , means something like “the building be-
longing to a financial institution”. It turns out that these two kinds of senses (an
organization and the building associated with an organization ) occur together for
many other words as well (school, university, hospital, etc.). Thus, there is a sys-
tematic relationship between senses that we might represent as
BUILDING ↔ ORGANIZATION
metonymy This particular subtype of polysemy relation is often called metonymy. Metonymy
is the use of one aspect of a concept or entity to refer to other aspects of the entity
or to the entity itself. Thus, we are performing metonymy when we use the phrase
the White House to refer to the administration whose office is in the White House.
Other common examples of metonymy include the relation between the following
pairings of senses:
Author (Jane Austen wrote Emma) ↔ Works of Author (I really love Jane Austen)
Tree (Plums have beautiful blossoms) ↔ Fruit (I ate a preserved plum yesterday)
While it can be useful to distinguish polysemy from unrelated homonymy, there
is no hard threshold for how related two senses must be to be considered polyse-
mous. Thus, the difference is really one of degree. This fact can make it very difficult
to decide how many senses a word has, that is, whether to make separate senses for
C.1 • W ORD S ENSES 3
closely related usages. There are various criteria for deciding that the differing uses
of a word should be represented with discrete senses. We might consider two senses
discrete if they have independent truth conditions, different syntactic behavior, and
independent sense relations, or if they exhibit antagonistic meanings.
Consider the following uses of the verb serve from the WSJ corpus:
(C.7) They rarely serve red meat, preferring to prepare seafood.
(C.8) He served as U.S. ambassador to Norway in 1976 and 1977.
(C.9) He might have served his time, come out and led an upstanding life.
The serve of serving red meat and that of serving time clearly have different truth
conditions and presuppositions; the serve of serve as ambassador has the distinct
subcategorization structure serve as NP. These heuristics suggest that these are prob-
ably three distinct senses of serve. One practical technique for determining if two
senses are distinct is to conjoin two uses of a word in a single sentence; this kind of
zeugma conjunction of antagonistic readings is called zeugma. Consider the following ATIS
examples:
(C.10) Which of those flights serve breakfast?
(C.11) Does Midwest Express serve Philadelphia?
(C.12) ?Does Midwest Express serve breakfast and Philadelphia?
We use (?) to mark those examples that are semantically ill-formed. The oddness of
the invented third example (a case of zeugma) indicates there is no sensible way to
make a single sense of serve work for both breakfast and Philadelphia. We can use
this as evidence that serve has two different senses in this case.
Dictionaries tend to use many fine-grained senses so as to capture subtle meaning
differences, a reasonable approach given that the traditional role of dictionaries is
aiding word learners. For computational purposes, we often don’t need these fine
distinctions, so we may want to group or cluster the senses; we have already done
this for some of the examples in this chapter.
How can we define the meaning of a word sense? We introduced in Chapter 6 the
standard computational approach of representing a word as an embedding, a point in
semantic space. The intuition was that words were defined by their co-occurrences,
the counts of words that often occur nearby.
Thesauri offer an alternative way of defining words. But we can’t just look at
the definition itself. Consider the following fragments from the definitions of right,
left, red, and blood from the American Heritage Dictionary (Morris, 1985).
right adj. located nearer the right hand esp. being on the right when
facing the same direction as the observer.
left adj. located nearer to this side of the body than the right.
red n. the color of blood or a ruby.
blood n. the red liquid that circulates in the heart, arteries and veins of
animals.
Note the circularity in these definitions. The definition of right makes two direct
references to itself, and the entry for left contains an implicit self-reference in the
phrase this side of the body, which presumably means the left side. The entries for
red and blood reference each other in their definitions. Such circularity is inherent
in all dictionary definitions. For humans, such entries are still useful since the user
of the dictionary has sufficient grasp of these other terms.
For computational purposes, one approach to defining a sense is—like the dic-
tionary definitions—defining a sense through its relationship with other senses. For
4 A PPENDIX C • W ORD N ET: W ORD R ELATIONS , S ENSES , AND D ISAMBIGUATION
example, the above definitions make it clear that right and left are similar kinds of
lemmas that stand in some kind of alternation, or opposition, to one another. Simi-
larly, we can glean that red is a color, that it can be applied to both blood and rubies,
and that blood is a liquid. Sense relations of this sort are embodied in on-line
databases like WordNet. Given a sufficiently large database of such relations, many
applications are quite capable of performing sophisticated semantic tasks (even if
they do not really know their right from their left).
Synonymy We introduced in Chapter 6 the idea that when two senses of two dif-
ferent words (lemmas) are identical, or nearly identical, we say the two senses are
synonym synonyms. Synonyms include such pairs as
couch/sofa vomit/throw up filbert/hazelnut car/automobile
Note that there are eight senses for the noun and one for the adjective, each of
gloss which has a gloss (a dictionary-style definition), a list of synonyms for the sense, and
sometimes also usage examples (shown for the adjective sense). Unlike dictionaries,
WordNet doesn’t represent pronunciation, so doesn’t distinguish the pronunciation
[b ae s] in bass4 , bass5 , and bass8 from the other senses pronounced [b ey s].
synset The set of near-synonyms for a WordNet sense is called a synset (for synonym
set); synsets are an important primitive in WordNet. The entry for bass includes
synsets like {bass1 , deep6 }, or {bass6 , bass voice1 , basso2 }. We can think of a
synset as representing a concept of the type we discussed in Chapter 14. Thus,
instead of representing concepts in logical terms, WordNet represents them as lists
of the word senses that can be used to express the concept. Here’s another synset
example:
{chump1 , fool2 , gull1 , mark9 , patsy1 , fall guy1 ,
sucker1 , soft touch1 , mug2 }
The gloss of this synset describes it as a person who is gullible and easy to take
advantage of. Each of the lexical entries included in the synset can, therefore, be
used to express this concept. Synsets like this one actually constitute the senses
associated with WordNet entries, and hence it is synsets, not wordforms, lemmas, or
individual senses, that participate in most of the lexical sense relations in WordNet.
WordNet represents all the kinds of sense relations discussed in the previous sec-
tion, as illustrated in Fig. C.2 and Fig. C.3. WordNet hyponymy relations correspond
6 A PPENDIX C • W ORD N ET: W ORD R ELATIONS , S ENSES , AND D ISAMBIGUATION
Sense 3
bass, basso --
(an adult male singer with the lowest voice)
=> singer, vocalist, vocalizer, vocaliser
=> musician, instrumentalist, player
=> performer, performing artist
=> entertainer
=> person, individual, someone...
=> organism, being
=> living thing, animate thing,
=> whole, unit
=> object, physical object
=> physical entity
=> entity
=> causal agent, cause, causal agency
=> physical entity
=> entity
Sense 7
bass --
(the member with the lowest range of a family of
musical instruments)
=> musical instrument, instrument
=> device
=> instrumentality, instrumentation
=> artifact, artefact
=> whole, unit
=> object, physical object
=> physical entity
=> entity
Figure C.4 Hyponymy chains for two separate senses of the lemma bass. Note that the
chains are completely distinct, only converging at the very abstract level whole, unit.
might say that the financial sense is similar to one of the senses of fund and the
riparian sense is more similar to one of the senses of slope. In the next few sections
of this chapter, we will compute these relations over both words and senses.
The thesaurus-based algorithms use the structure of the thesaurus to define word
similarity. In principle, we could measure similarity by using any information avail-
able in a thesaurus (meronymy, glosses, etc.). In practice, however, thesaurus-based
word similarity algorithms generally use only the hypernym/hyponym (is-a or sub-
sumption) hierarchy. In WordNet, verbs and nouns are in separate hypernym hier-
archies, so a thesaurus-based algorithm for WordNet can thus compute only noun-
noun similarity, or verb-verb similarity; we can’t compare nouns to verbs or do
anything with adjectives or other parts of speech.
The simplest thesaurus-based algorithms are based on the intuition that words
or senses are more similar if there is a shorter path between them in the thesaurus
graph, an intuition dating back to Quillian (1969). A word/sense is most similar to
itself, then to its parents or siblings, and least similar to words that are far away. We
make this notion operational by measuring the number of edges between the two
concept nodes in the thesaurus graph and adding one. Figure C.5 shows an intuition;
the concept dime is most similar to nickel and coin, less similar to money, and even
less similar to Richter scale. A formal definition:
pathlen(c1 , c2 ) = 1 + the number of edges in the shortest path in the
8 A PPENDIX C • W ORD N ET: W ORD R ELATIONS , S ENSES , AND D ISAMBIGUATION
Figure C.5 A fragment of the WordNet hypernym hierarchy, showing path lengths (number
of edges plus 1) from nickel to coin (2), dime (3), money (6), and Richter scale (8).
The basic path-length algorithm makes the implicit assumption that each link
in the network represents a uniform distance. In practice, this assumption is not
appropriate. Some links (e.g., those that are deep in the WordNet hierarchy) often
seem to represent an intuitively narrow distance, while other links (e.g., higher up
in the WordNet hierarchy) represent an intuitively wider distance. For example, in
Fig. C.5, the distance from nickel to money (5) seems intuitively much shorter than
the distance from nickel to an abstract word standard; the link between medium of
exchange and standard seems wider than that between, say, coin and coinage.
It is possible to refine path-based algorithms with normalizations based on depth
in the hierarchy (Wu and Palmer, 1994), but in general we’d like an approach that
lets us independently represent the distance associated with each edge.
A second class of thesaurus-based similarity algorithms attempts to offer just
information- such a fine-grained metric. These information-content word-similarity algorithms
content
still rely on the structure of the thesaurus but also add probabilistic information
derived from a corpus.
Following Resnik (1995) we’ll define P(c) as the probability that a randomly
selected word in a corpus is an instance of concept c (i.e., a separate random variable,
ranging over words, associated with each concept). This implies that P(root) = 1
since any word is subsumed by the root concept. Intuitively, the lower a concept
C.3 • W ORD S IMILARITY: T HESAURUS M ETHODS 9
in the hierarchy, the lower its probability. We train these probabilities by counting
in a corpus; each word in the corpus counts as an occurrence of each concept that
contains it. For example, in Fig. C.5 above, an occurrence of the word dime would
count toward the frequency of coin, currency, standard, etc. More formally, Resnik
computes P(c) as follows:
P
w∈words(c) count(w)
P(c) = (C.19)
N
where words(c) is the set of words subsumed by concept c, and N is the total number
of words in the corpus that are also present in the thesaurus.
Figure C.6, from Lin (1998), shows a fragment of the WordNet concept hierar-
chy augmented with the probabilities P(c).
entity 0.395
inanimate-object 0.167
natural-object 0.0163
geological-formation 0.00176
We now need two additional definitions. First, following basic information the-
ory, we define the information content (IC) of a concept c as
common(A,B)
simLin (A, B) = (C.24)
description(A,B)
Applying this idea to the thesaurus domain, Lin shows (in a slight modification
of Resnik’s assumption) that the information in common between two concepts is
twice the information in the lowest common subsumer LCS(c1 , c2 ). Adding in the
above definitions of the information content of thesaurus concepts, the final Lin
Lin similarity similarity function is
2 × log P(LCS(c1 , c2 ))
simLin (c1 , c2 ) = (C.25)
log P(c1 ) + log P(c2 )
For example, using simLin , Lin (1998) shows that the similarity between the
concepts of hill and coast from Fig. C.6 is
2 × log P(geological-formation)
simLin (hill, coast) = = 0.59 (C.26)
log P(hill) + log P(coast)
Jiang-Conrath A similar formula, Jiang-Conrath distance (Jiang and Conrath, 1997), although
distance
derived in a completely different way from Lin and expressed as a distance rather
than similarity function, has been shown to work as well as or better than all the
other thesaurus-based methods:
Let RELS be the set of possible WordNet relations whose glosses we compare;
assuming a basic overlap measure as sketched above, we can then define the Ex-
tended Lesk overlap measure as
X
simeLesk (c1 , c2 ) = overlap(gloss(r(c1 )), gloss(q(c2 ))) (C.28)
r,q∈RELS
1
simpath (c1 , c2 ) =
pathlen(c1 , c2 )
simResnik (c1 , c2 ) = − log P(LCS(c1 , c2 ))
2 × log P(LCS(c1 , c2 ))
simLin (c1 , c2 ) =
log P(c1 ) + log P(c2 )
1
simJC (c1 , c2 ) =
2 × log P(LCS(c1 , c2 )) − (log P(c1 ) + log P(c2 ))
X
simeLesk (c1 , c2 ) = overlap(gloss(r(c1 )), gloss(q(c2 )))
r,q∈RELS
Figure C.7 summarizes the five similarity measures we have described in this
section.
lexical sample It is useful to distinguish two WSD tasks. In the lexical sample task, a small
2 The WordNet database includes eight senses; we have arbitrarily selected two for this example; we
have also arbitrarily selected one of the many Spanish fishes that could translate English sea bass.
C.5 • S UPERVISED W ORD S ENSE D ISAMBIGUATION 13
pre-selected set of target words is chosen, along with an inventory of senses for each
word from some lexicon. Since the set of words and the set of senses are small,
simple supervised classification approaches are used.
all-words In the all-words task, systems are given entire texts and a lexicon with an inven-
tory of senses for each entry and are required to disambiguate every content word in
the text. The all-words task is similar to part-of-speech tagging, except with a much
larger set of tags since each lemma has its own set. A consequence of this larger set
of tags is data sparseness; it is unlikely that adequate training data for every word in
the test set will be available. Moreover, given the number of polysemous words in
reasonably sized lexicons, approaches based on training one classifier per term are
unlikely to be practical.
C.5.2 Evaluation
extrinsic To evaluate WSD algorithms, it’s better to consider extrinsic, task-based, or end-
evaluation
to-end evaluation, in which we see whether some new WSD idea actually improves
performance in some end-to-end application like question answering or machine
translation. Nonetheless, because extrinsic evaluations are difficult and slow, WSD
intrinsic systems are typically evaluated with intrinsic evaluation. in which a WSD compo-
nent is treated as an independent system. Common intrinsic evaluations are either
sense accuracy exact-match sense accuracy—the percentage of words that are tagged identically
with the hand-labeled sense tags in a test set—or with precision and recall if sys-
tems are permitted to pass on the labeling of some instances. In general, we evaluate
by using held-out data from the same sense-tagged corpora that we used for training,
such as the SemCor corpus discussed above or the various corpora produced by the
SENSEVAL effort.
Many aspects of sense evaluation have been standardized by the SENSEVAL and
SEMEVAL efforts (Palmer et al. 2006, Kilgarriff and Palmer 2000). This framework
provides a shared task with training and testing materials along with sense invento-
ries for all-words and lexical sample tasks in a variety of languages.
C.6 • WSD: D ICTIONARY AND T HESAURUS M ETHODS 15
most frequent The normal baseline is to choose the most frequent sense for each word from the
sense
senses in a labeled corpus (Gale et al., 1992a). For WordNet, this corresponds to the
first sense, since senses in WordNet are generally ordered from most frequent to least
frequent. WordNet sense frequencies come from the SemCor sense-tagged corpus
described above– WordNet senses that don’t occur in SemCor are ordered arbitrarily
after those that do. The most frequent sense baseline can be quite accurate, and is
therefore often used as a default, to supply a word sense when a supervised algorithm
has insufficient training data.
Figure C.9 The Simplified Lesk algorithm. The C OMPUTE OVERLAP function returns the
number of words in common between two sets, ignoring function words or other words on a
stop list. The original Lesk algorithm defines the context in a more complex way. The Cor-
pus Lesk algorithm weights each overlapping word w by its − log P(w) and includes labeled
training corpus data in the signature.
Sense bank1 has two non-stopwords overlapping with the context in (C.31):
deposits and mortgage, while sense bank2 has zero words, so sense bank1 is chosen.
There are many obvious extensions to Simplified Lesk. The original Lesk algo-
rithm (Lesk, 1986) is slightly more indirect. Instead of comparing a target word’s
signature with the context words, the target signature is compared with the signatures
of each of the context words. For example, consider Lesk’s example of selecting the
appropriate sense of cone in the phrase pine cone given the following definitions for
pine and cone.
pine 1 kinds of evergreen tree with needle-shaped leaves
2 waste away through sorrow or illness
cone 1 solid body which narrows to a point
2 something of this shape whether solid or hollow
3 fruit of certain evergreen trees
In this example, Lesk’s method would select cone3 as the correct sense since two
of the words in its entry, evergreen and tree, overlap with words in the entry for pine,
whereas neither of the other entries has any overlap with words in the definition of
pine. In general Simplified Lesk seems to work better than original Lesk.
The primary problem with either the original or simplified approaches, how-
ever, is that the dictionary entries for the target words are short and may not provide
enough chance of overlap with the context.3 One remedy is to expand the list of
words used in the classifier to include words related to, but not contained in, their
individual sense definitions. But the best solution, if any sense-tagged corpus data
like SemCor is available, is to add all the words in the labeled corpus sentences for a
word sense into the signature for that sense. This version of the algorithm, the Cor-
Corpus Lesk pus Lesk algorithm, is the best-performing of all the Lesk variants (Kilgarriff and
Rosenzweig 2000, Vasilescu et al. 2004) and is used as a baseline in the SENSEVAL
competitions. Instead of just counting up the overlapping words, the Corpus Lesk
inverse
algorithm also applies a weight to each overlapping word. The weight is the inverse
document
frequency
document frequency or IDF, a standard information-retrieval measure introduced
IDF in Chapter 6. IDF measures how many different “documents” (in this case, glosses
and examples) a word occurs in and is thus a way of discounting function words.
Since function words like the, of, etc., occur in many documents, their IDF is very
low, while the IDF of content words is high. Corpus Lesk thus uses IDF instead of a
stop list.
Formally, the IDF for a word i can be defined as
Ndoc
idfi = log (C.32)
ndi
3 Indeed, Lesk (1986) notes that the performance of his system seems to roughly correlate with the
length of the dictionary entries.
C.6 • WSD: D ICTIONARY AND T HESAURUS M ETHODS 17
where Ndoc is the total number of “documents” (glosses and examples) and ndi is
the number of these documents containing word i.
Finally, we can combine the Lesk and supervised approaches by adding new
Lesk-like bag-of-words features. For example, the glosses and example sentences
for the target sense in WordNet could be used to compute the supervised bag-of-
words features in addition to the words in the SemCor context sentence for the sense
(Yuret, 2004).
foodn1 liquidn1
helpingn1
1
beveragen milkn1
toastn4 drinkn1
sipv1
1
supv
1
consumev drinkv1 sipn1
drinkingn1
consumern1 drinkern1
potationn1
consumptionn1
Figure C.10 Part of the WordNet graph around drink1v , after Navigli and Lapata (2010).
There are various ways to use the graph for disambiguation, some using the
whole graph, some using only a subpart. For example the target word and the words
in its sentential context can all be inserted as nodes in the graph via a directed edge
to each of its senses. If we consider the sentence She drank some milk, Fig. C.11
shows a portion of the WordNet graph between the senses drink1v and milk1n .
drinkv2
drinkern1 foodn1 milkn2
drinkv3 boozingn1
nutrimentn1 milkn3
drinkv4
milkn4
5
drinkv
“drink” “milk”
Figure C.11 Part of the WordNet graph between drink1v and milk1n , for disambiguating a
sentence like She drank some milk, adapted from Navigli and Lapata (2010).
The correct sense is then the one which is the most important or central in some
way in this graph. There are many different methods for deciding centrality. The
18 A PPENDIX C • W ORD N ET: W ORD R ELATIONS , S ENSES , AND D ISAMBIGUATION
degree simplest is degree, the number of edges into the node, which tends to correlate
with the most frequent sense. Another algorithm for assigning probabilities across
personalized
page rank nodes is personalized page rank, a version of the well-known pagerank algorithm
which uses some seed nodes. By inserting a uniform probability across the word
nodes (drink and milk in the example) and computing the personalized page rank of
the graph, the result will be a pagerank value for each node in the graph, and the
sense with the maximum pagerank can then be chosen. See Agirre et al. (2014) and
Navigli and Lapata (2010) for details.
? ? LIFE ? ? ? ? ? ? LIFE ? ? ? ?
? ? ? ?
? ? A ?? ? ? ? ? ? ? ? ? ? ? ? A ?? ? ? ? ? ? ? ? ? ?
? A ? ? ? ? A ? A A
?? ? A ?
? ? ? ? ? ? ?? ? ?? ? A ? ?
? ? ? ? ? ?
? ? A A A A A ? ? ? A A A A A ? ? ?
A AA ? ?? ? ? ? ? ? A AA ? ? ? ? ?
A A A MICROSCOPIC
?
? ? ? A
AA A A
A
? ? ? ?
?
V0 ?
A A ? A
AA A A
A
? ? ? ?
?
V1
? ? ? A A ? ? ? ? A A ? A A ? ? ? ?
A A ? ? A A ? ?
? ? ? ? ? ? ? ? ?
? ? ? ? ? ? A ? ?
? A
? ? ? ? ? ?
? ? ? ? ?? ? ? ANIMAL
? A ? ? ?
? ? ? ? ?? ?
? ? ? ? ? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ? ? ? ? ?
? EMPLOYEE ?
? ? ? Λ0? ? ? ?
?
? ? ?
? ?
?
? ? ? Λ1 ?
? ?
?
? ? ?
? ?
?
? ? ? ? ? ? ?
? ? ? B B ?
? ? ?? ? ? ? ? ? ? ? ?? ? B ? ? ?
? ? ? ? ? ? ? ?
? B ? ? B ?
? ?
?
? B B B ? ? ?
?
B B B B ?
? ? ? B B ? B B B B
? ? ? B B B B B ? ? B B B B B B
? ? B ? ?? B B
? ? ? ? ?
? ? ? ? ? ? ? ? ? MANUFACTURING
? ? ? ? ? ?
MANUFACTURING
? ? ? ? ? ? ? EQUIPMENT
? ? ? ?
(a) (b)
Figure C.12 The Yarowsky algorithm disambiguating “plant” at two stages; “?” indicates an unlabeled ob-
servation, A and B are observations labeled as SENSE-A or SENSE-B. The initial stage (a) shows only seed
sentences Λ0 labeled by collocates (“life” and “manufacturing”). An intermediate stage is shown in (b) where
more collocates have been discovered (“equipment”, “microscopic”, etc.) and more instances in V0 have been
moved into Λ1 , leaving a smaller unlabeled set V1 . Figure adapted from Yarowsky (1995).
C.8 • U NSUPERVISED W ORD S ENSE I NDUCTION 19
We need more good teachers – right now, there are only a half a dozen who can play
the free bass with ease.
An electric guitar and bass player stand off to one side, not really part of the scene,
The researchers said the worms spend part of their life cycle in such fish as Pacific
salmon and striped bass and Pacific rockfish or snapper.
And it all started when fishermen decided the striped bass in Lake Mead were...
Figure C.13 Samples of bass sentences extracted from the WSJ by using the simple corre-
lates play and fish.
C.9 Summary
This chapter has covered a wide range of issues concerning the meanings associated
with lexical items. The following are among the highlights:
• A word sense is the locus of word meaning; definitions and meaning relations
are defined at the level of the word sense rather than wordforms.
• Homonymy is the relation between unrelated senses that share a form, and
polysemy is the relation between related senses that share a form.
• Hyponymy and hypernymy relations hold between words that are in a class-
inclusion relationship.
• WordNet is a large database of lexical relations for English.
• Word-sense disambiguation (WSD) is the task of determining the correct
sense of a word in context. Supervised approaches make use of sentences in
which individual words (lexical sample task) or all words (all-words task)
are hand-labeled with senses from a resource like WordNet.
• Classifiers for supervised WSD are generally trained on features of the sur-
rounding words.
• An important baseline for WSD is the most frequent sense, equivalent, in
WordNet, to take the first sense.
B IBLIOGRAPHICAL AND H ISTORICAL N OTES 21
• The Lesk algorithm chooses the sense whose dictionary definition shares the
most words with the target word’s neighborhood.
• Graph-based algorithms view the thesaurus as a graph and choose the sense
that is most central in some way.
• Word similarity can be computed by measuring the link distance in a the-
saurus or by various measures of the information content of the two nodes.
first to use a machine-readable dictionary for word sense disambiguation. The prob-
lem of dictionary senses being too fine-grained has been addressed by clustering
coarse senses word senses into coarse senses (Dolan 1994, Chen and Chang 1998, Mihalcea and
Moldovan 2001, Agirre and de Lacalle 2003, Chklovski and Mihalcea 2003, Palmer
et al. 2004, Navigli 2006, Snow et al. 2007). Corpora with clustered word senses for
OntoNotes training clustering algorithms include Palmer et al. (2006) and OntoNotes (Hovy
et al., 2006).
Supervised approaches to disambiguation began with the use of decision trees by
Black (1988). The need for large amounts of annotated text in these methods led to
investigations into the use of bootstrapping methods (Hearst 1991, Yarowsky 1995).
Diab and Resnik (2002) give a semi-supervised algorithm for sense disambigua-
tion based on aligned parallel corpora in two languages. For example, the fact that
the French word catastrophe might be translated as English disaster in one instance
and tragedy in another instance can be used to disambiguate the senses of the two
English words (i.e., to choose senses of disaster and tragedy that are similar). Ab-
ney (2002) and Abney (2004) explore the mathematical foundations of the Yarowsky
algorithm and its relation to co-training. The most-frequent-sense heuristic is an ex-
tremely powerful one but requires large amounts of supervised training data.
The earliest use of clustering in the study of word senses was by Sparck Jones
(1986); Pedersen and Bruce (1997), Schütze (1997), and Schütze (1998) applied
distributional methods. Recent work on word sense induction has applied Latent
Dirichlet Allocation (LDA) (Boyd-Graber et al. 2007, Brody and Lapata 2009, Lau
et al. 2012). and large co-occurrence graphs (Di Marco and Navigli, 2013).
A collection of work concerning WordNet can be found in Fellbaum (1998).
Early work using dictionaries as lexical resources include Amsler’s (1981) use of the
Merriam Webster dictionary and Longman’s Dictionary of Contemporary English
(Boguraev and Briscoe, 1989).
Early surveys of WSD include Agirre and Edmonds (2006) and Navigli (2009).
See Pustejovsky (1995), Pustejovsky and Boguraev (1996), Martin (1986), and
Copestake and Briscoe (1995), inter alia, for computational approaches to the rep-
generative resentation of polysemy. Pustejovsky’s theory of the generative lexicon, and in
lexicon
qualia particular his theory of the qualia structure of words, is another way of accounting
structure
for the dynamic systematic polysemy of words in context.
Another important recent direction is the addition of sentiment and connotation
to knowledge bases (Wiebe et al. 2005, Qiu et al. 2009, Velikovich et al. 2010)
including SentiWordNet (Baccianella et al., 2010) and ConnotationWordNet (Kang
et al., 2014).
Exercises
C.1 Collect a small corpus of example sentences of varying lengths from any
newspaper or magazine. Using WordNet or any standard dictionary, deter-
mine how many senses there are for each of the open-class words in each sen-
tence. How many distinct combinations of senses are there for each sentence?
How does this number seem to vary with sentence length?
C.2 Using WordNet or a standard reference dictionary, tag each open-class word
in your corpus with its correct tag. Was choosing the correct sense always a
straightforward task? Report on any difficulties you encountered.
E XERCISES 23
C.3 Using your favorite dictionary, simulate the original Lesk word overlap dis-
ambiguation algorithm described on page 16 on the phrase Time flies like an
arrow. Assume that the words are to be disambiguated one at a time, from
left to right, and that the results from earlier decisions are used later in the
process.
C.4 Build an implementation of your solution to the previous exercise. Using
WordNet, implement the original Lesk word overlap disambiguation algo-
rithm described on page 16 on the phrase Time flies like an arrow.
24 Appendix C • WordNet: Word Relations, Senses, and Disambiguation
Abney, S. P. (2002). Bootstrapping. In ACL-02, pp. 360–367. Diab, M. and Resnik, P. (2002). An unsupervised method for
Abney, S. P. (2004). Understanding the Yarowsky algorithm. word sense tagging using parallel corpora. In ACL-02, pp.
Computational Linguistics, 30(3), 365–395. 255–262.
Agirre, E. and de Lacalle, O. L. (2003). Clustering WordNet Dolan, W. B. (1994). Word sense ambiguation: Clustering
word senses. In RANLP 2003. related senses. In COLING-94, Kyoto, Japan, pp. 712–716.
Agirre, E., Banea, C., Cardie, C., Cer, D., Diab, M., Duda, R. O. and Hart, P. E. (1973). Pattern Classification
Gonzalez-Agirre, A., Guo, W., Lopez-Gazpio, I., Maritx- and Scene Analysis. John Wiley and Sons.
alar, M., Mihalcea, R., Rigau, G., Uria, L., and Wiebe, Fellbaum, C. (Ed.). (1998). WordNet: An Electronic Lexical
J. (2015). 2015 SemEval-2015 Task 2: Semantic Textual Database. MIT Press.
Similarity, English, Spanish and Pilot on Interpretability. Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan,
In SemEval-15, pp. 252–263. Z., Wolfman, G., and Ruppin, E. (2002). Placing search in
Agirre, E., Diab, M., Cer, D., and Gonzalez-Agirre, A. context: The concept revisited. ACM Transactions on In-
(2012). Semeval-2012 task 6: A pilot on semantic textual formation Systems, 20(1), 116––131.
similarity. In SemEval-12, pp. 385–393. Gale, W. A., Church, K. W., and Yarowsky, D. (1992a). Es-
Agirre, E. and Edmonds, P. (Eds.). (2006). Word Sense Dis- timating upper and lower bounds on the performance of
ambiguation: Algorithms and Applications. Kluwer. word-sense disambiguation programs. In ACL-92, Newark,
Agirre, E., López de Lacalle, O., and Soroa, A. (2014). Ran- DE, pp. 249–256.
dom walks for knowledge-based word sense disambigua- Gale, W. A., Church, K. W., and Yarowsky, D. (1992b). One
tion. Computational Linguistics, 40(1), 57–84. sense per discourse. In Proceedings DARPA Speech and
Amsler, R. A. (1981). A taxonomy of English nouns and Natural Language Workshop, pp. 233–237.
verbs. In ACL-81, Stanford, CA, pp. 133–138. Hearst, M. A. (1991). Noun homograph disambiguation. In
Atkins, S. (1993). Tools for computer-aided corpus lexicog- Proceedings of the 7th Conference of the University of Wa-
raphy: The Hector project. Acta Linguistica Hungarica, terloo Centre for the New OED and Text Research, pp. 1–
41, 5–72. 19.
Baccianella, S., Esuli, A., and Sebastiani, F. (2010). Sen- Hill, F., Reichart, R., and Korhonen, A. (2015). Simlex-999:
tiwordnet 3.0: An enhanced lexical resource for sentiment Evaluating semantic models with (genuine) similarity esti-
analysis and opinion mining.. In LREC-10, pp. 2200–2204. mation. Computational Linguistics, 41(4), 665–695.
Banerjee, S. and Pedersen, T. (2003). Extended gloss over- Hirst, G. (1987). Semantic Interpretation and the Resolution
laps as a measure of semantic relatedness. In IJCAI 2003, of Ambiguity. Cambridge University Press.
pp. 805–810. Hirst, G. (1988). Resolving lexical ambiguity computa-
Black, E. (1988). An experiment in computational discrim- tionally with spreading activation and polaroid words. In
ination of English word senses. IBM Journal of Research Small, S. L., Cottrell, G. W., and Tanenhaus, M. K. (Eds.),
and Development, 32(2), 185–194. Lexical Ambiguity Resolution, pp. 73–108. Morgan Kauf-
Boguraev, B. and Briscoe, T. (Eds.). (1989). Computational mann.
Lexicography for Natural Language Processing. Longman. Hirst, G. and Charniak, E. (1982). Word sense and case slot
Boyd-Graber, J., Blei, D. M., and Zhu, X. (2007). A topic disambiguation. In AAAI-82, pp. 95–98.
model for word sense disambiguation. In EMNLP/CoNLL Hovy, E. H., Marcus, M. P., Palmer, M., Ramshaw, L. A.,
2007. and Weischedel, R. (2006). Ontonotes: The 90% solution.
Brody, S. and Lapata, M. (2009). Bayesian word sense in- In HLT-NAACL-06.
duction. In EACL-09, pp. 103–111. Huang, E. H., Socher, R., Manning, C. D., and Ng, A. Y.
Bruce, R. F. and Wiebe, J. (1994). Word-sense disambigua- (2012). Improving word representations via global context
tion using decomposable models. In ACL-94, Las Cruces, and multiple word prototypes. In ACL 2012, pp. 873–882.
NM, pp. 139–145. Iacobacci, I., Pilehvar, M. T., and Navigli, R. (2016). Embed-
Chen, J. N. and Chang, J. S. (1998). Topical clustering dings for word sense disambiguation: An evaluation study.
of MRD senses based on information retrieval techniques. In ACL 2016, pp. 897–907.
Computational Linguistics, 24(1), 61–96. Jiang, J. J. and Conrath, D. W. (1997). Semantic similarity
Chklovski, T. and Mihalcea, R. (2003). Exploiting agree- based on corpus statistics and lexical taxonomy. In RO-
ment and disagreement of human annotators for word sense CLING X, Taiwan.
disambiguation. In RANLP 2003. Jurgens, D. and Klapaftis, I. P. (2013). Semeval-2013
Copestake, A. and Briscoe, T. (1995). Semi-productive pol- task 13: Word sense induction for graded and non-graded
ysemy and sense extension. Journal of Semantics, 12(1), senses. In *SEM, pp. 290–299.
15–68. Kang, J. S., Feng, S., Akoglu, L., and Choi, Y. (2014).
Cottrell, G. W. (1985). A Connectionist Approach to Connotationwordnet: Learning connotation over the word+
Word Sense Disambiguation. Ph.D. thesis, University of sense network. In ACL 2014.
Rochester, Rochester, NY. Revised version published by Kawamoto, A. H. (1988). Distributed representations of am-
Pitman, 1989. biguous words and their resolution in connectionist net-
Di Marco, A. and Navigli, R. (2013). Clustering and di- works. In Small, S. L., Cottrell, G. W., and Tanenhaus, M.
versifying web search results with graph-based word sense (Eds.), Lexical Ambiguity Resolution, pp. 195–228. Mor-
induction. Computational Linguistics, 39(3), 709–754. gan Kaufman.
Exercises 25
Kelly, E. F. and Stone, P. J. (1975). Computer Recognition Morris, W. (Ed.). (1985). American Heritage Dictionary
of English Word Senses. North-Holland. (2nd College Edition Ed.). Houghton Mifflin.
Kilgarriff, A. (2001). English lexical sample task descrip- Navigli, R. (2006). Meaningful clustering of senses helps
tion. In Proceedings of Senseval-2: Second International boost word sense disambiguation performance. In COL-
Workshop on Evaluating Word Sense Disambiguation Sys- ING/ACL 2006, pp. 105–112.
tems, Toulouse, France, pp. 17–20. Navigli, R. (2009). Word sense disambiguation: A survey.
Kilgarriff, A. and Palmer, M. (Eds.). (2000). Computing ACM Computing Surveys, 41(2).
and the Humanities: Special Issue on SENSEVAL, Vol. 34. Navigli, R. and Lapata, M. (2010). An experimental study
Kluwer. of graph connectivity for unsupervised word sense disam-
Kilgarriff, A. and Rosenzweig, J. (2000). Framework and biguation. IEEE Transactions on Pattern Analysis and Ma-
results for English SENSEVAL. Computers and the Hu- chine Intelligence, 32(4), 678–692.
manities, 34, 15–48. Navigli, R. and Vannella, D. (2013). Semeval-2013 task 11:
Krovetz, R. (1998). More than one sense per discourse. In Word sense induction & disambiguation within an end-user
Proceedings of the ACL-SIGLEX SENSEVAL Workshop. application. In *SEM, pp. 193–201.
Landauer, T. K. and Dumais, S. T. (1997). A solution to Palmer, M., Babko-Malaya, O., and Dang, H. T. (2004). Dif-
Plato’s problem: The Latent Semantic Analysis theory of ferent sense granularities for different applications. In HLT-
acquisition, induction, and representation of knowledge. NAACL Workshop on Scalable Natural Language Under-
Psychological Review, 104, 211–240. standing, Boston, MA, pp. 49–56.
Landes, S., Leacock, C., and Tengi, R. I. (1998). Building Palmer, M., Dang, H. T., and Fellbaum, C. (2006). Mak-
semantic concordances. In Fellbaum, C. (Ed.), WordNet: ing fine-grained and coarse-grained sense distinctions, both
An Electronic Lexical Database, pp. 199–216. MIT Press. manually and automatically. Natural Language Engineer-
Lau, J. H., Cook, P., McCarthy, D., Newman, D., and Bald- ing, 13(2), 137–163.
win, T. (2012). Word sense induction for novel sense de- Palmer, M., Fellbaum, C., Cotton, S., Delfs, L., and Dang,
tection. In EACL-12, pp. 591–601. H. T. (2001). English tasks: All-words and verb lexical
Leacock, C. and Chodorow, M. S. (1998). Combining lo- sample. In Proceedings of Senseval-2: 2nd International
cal context and WordNet similarity for word sense identi- Workshop on Evaluating Word Sense Disambiguation Sys-
fication. In Fellbaum, C. (Ed.), WordNet: An Electronic tems, Toulouse, France, pp. 21–24.
Lexical Database, pp. 265–283. MIT Press. Palmer, M., Ng, H. T., and Dang, H. T. (2006). Evalua-
Leacock, C., Towell, G., and Voorhees, E. M. (1993). tion of wsd systems. In Agirre, E. and Edmonds, P. (Eds.),
Corpus-based statistical sense resolution. In HLT-93, pp. Word Sense Disambiguation: Algorithms and Applications.
260–265. Kluwer.
Lesk, M. E. (1986). Automatic sense disambiguation us- Pedersen, T. and Bruce, R. (1997). Distinguishing word
ing machine readable dictionaries: How to tell a pine cone senses in untagged text. In EMNLP 1997, Providence, RI.
from an ice cream cone. In Proceedings of the 5th Inter- Ponzetto, S. P. and Navigli, R. (2010). Knowledge-rich word
national Conference on Systems Documentation, Toronto, sense disambiguation rivaling supervised systems. In ACL
CA, pp. 24–26. 2010, pp. 1522–1531.
Lin, D. (1998). An information-theoretic definition of simi- Pustejovsky, J. (1995). The Generative Lexicon. MIT Press.
larity. In ICML 1998, San Francisco, pp. 296–304.
Pustejovsky, J. and Boguraev, B. (Eds.). (1996). Lexical
Madhu, S. and Lytel, D. (1965). A figure of merit technique Semantics: The Problem of Polysemy. Oxford University
for the resolution of non-grammatical ambiguity. Mechan- Press.
ical Translation, 8(2), 9–13.
Qiu, G., Liu, B., Bu, J., and Chen, C. (2009). Expanding
Manandhar, S., Klapaftis, I. P., Dligach, D., and Pradhan, S. domain sentiment lexicon through double propagation.. In
(2010). Semeval-2010 task 14: Word sense induction & IJCAI-09, pp. 1199–1204.
disambiguation. In SemEval-2010, pp. 63–68.
Quillian, M. R. (1968). Semantic memory. In Minsky, M.
Martin, J. H. (1986). The acquisition of polysemy. In ICML (Ed.), Semantic Information Processing, pp. 227–270. MIT
1986, Irvine, CA, pp. 198–204. Press.
Masterman, M. (1957). The thesaurus in syntax and seman- Quillian, M. R. (1969). The teachable language comprehen-
tics. Mechanical Translation, 4(1), 1–2. der: A simulation program and theory of language. Com-
Mihalcea, R. (2007). Using wikipedia for automatic word munications of the ACM, 12(8), 459–476.
sense disambiguation. In NAACL-HLT 07, pp. 196–203. Resnik, P. (1995). Using information content to evaluate
Mihalcea, R. and Moldovan, D. (2001). Automatic genera- semantic similarity in a taxanomy. In International Joint
tion of a coarse grained WordNet. In NAACL Workshop on Conference for Artificial Intelligence (IJCAI-95), pp. 448–
WordNet and Other Lexical Resources. 453.
Miller, G. A. and Charles, W. G. (1991). Contextual cor- Riesbeck, C. K. (1975). Conceptual analysis. In Schank,
relates of semantics similarity. Language and Cognitive R. C. (Ed.), Conceptual Information Processing, pp. 83–
Processes, 6(1), 1–28. 156. American Elsevier, New York.
Miller, G. A., Leacock, C., Tengi, R. I., and Bunker, R. T. Rubenstein, H. and Goodenough, J. B. (1965). Contextual
(1993). A semantic concordance. In Proceedings ARPA correlates of synonymy. Communications of the ACM,
Workshop on Human Language Technology, pp. 303–308. 8(10), 627–633.
26 Appendix C • WordNet: Word Relations, Senses, and Disambiguation