18 Word Senses and WordNet
18 Word Senses and WordNet
All
rights reserved. Draft of December 29, 2021.
CHAPTER
ambiguous Words are ambiguous: the same word can be used to mean different things. In
Chapter 6 we saw that the word “mouse” has (at least) two meanings: (1) a small
rodent, or (2) a hand-operated device to control a cursor. The word “bank” can
mean: (1) a financial institution or (2) a sloping mound. In the quote above from
his play The Importance of Being Earnest, Oscar Wilde plays with two meanings of
“lose” (to misplace an object, and to suffer the death of a close person).
We say that the words ‘mouse’ or ‘bank’ are polysemous (from Greek ‘having
word sense many senses’, poly- ‘many’ + sema, ‘sign, mark’).1 A sense (or word sense) is
a discrete representation of one aspect of the meaning of a word. In this chapter
WordNet we discuss word senses in more detail and introduce WordNet, a large online the-
saurus —a database that represents word senses—with versions in many languages.
WordNet also represents relations between senses. For example, there is an IS-A
relation between dog and mammal (a dog is a kind of mammal) and a part-whole
relation between engine and car (an engine is a part of a car).
Knowing the relation between two senses can play an important role in tasks
involving meaning. Consider the antonymy relation. Two words are antonyms if
they have opposite meanings, like long and short, or up and down. Distinguishing
these is quite important; if a user asks a dialogue agent to turn up the music, it
would be unfortunate to instead turn it down. But in fact in embedding models like
word2vec, antonyms are easily confused with each other, because often one of the
closest words in embedding space to a word (e.g., up) is its antonym (e.g., down).
Thesauruses that represent this relationship can help!
word sense
disambiguation We also introduce word sense disambiguation (WSD), the task of determining
which sense of a word is being used in a particular context. We’ll give supervised
and unsupervised algorithms for deciding which sense was intended in a particular
context. This task has a very long history in computational linguistics and many ap-
plications. In question answering, we can be more helpful to a user who asks about
“bat care” if we know which sense of bat is relevant. (Is the user is a vampire? or
just wants to play baseball.) And the different senses of a word often have differ-
ent translations; in Spanish the animal bat is a murciélago while the baseball bat is
a bate, and indeed word sense algorithms may help improve MT (Pu et al., 2018).
Finally, WSD has long been used as a tool for evaluating language processing mod-
els, and understanding how models represent different word senses is an important
1 The word polysemy itself is ambiguous; you may see it used in a different way, to refer only to cases
where a word’s senses are related in some structured way, reserving the word homonymy to mean sense
ambiguities with no relation between the senses (Haber and Poesio, 2020). Here we will use ‘polysemy’
to mean any kind of sense ambiguity, and ‘structured polysemy’ for polysemy with sense relations.
2 C HAPTER 18 • W ORD S ENSES AND W ORD N ET
analytic direction.
Glosses are not a formal meaning representation; they are just written for people.
Consider the following fragments from the definitions of right, left, red, and blood
from the American Heritage Dictionary (Morris, 1985).
right adj. located nearer the right hand esp. being on the right when
facing the same direction as the observer.
left adj. located nearer to this side of the body than the right.
red n. the color of blood or a ruby.
blood n. the red liquid that circulates in the heart, arteries and veins of
animals.
Note the circularity in these definitions. The definition of right makes two direct
references to itself, and the entry for left contains an implicit self-reference in the
phrase this side of the body, which presumably means the left side. The entries for
red and blood reference each other in their definitions. For humans, such entries are
useful since the user of the dictionary has sufficient grasp of these other terms.
18.1 • W ORD S ENSES 3
Yet despite their circularity and lack of formal representation, glosses can still
be useful for computational modeling of senses. This is because a gloss is just a sen-
tence, and from sentences we can compute sentence embeddings that tell us some-
thing about the meaning of the sense. Dictionaries often give example sentences
along with glosses, and these can again be used to help build a sense representation.
The second way that thesauruses offer for defining a sense is—like the dictionary
definitions—defining a sense through its relationship with other senses. For exam-
ple, the above definitions make it clear that right and left are similar kinds of lemmas
that stand in some kind of alternation, or opposition, to one another. Similarly, we
can glean that red is a color and that blood is a liquid. Sense relations of this sort
(IS-A, or antonymy) are explicitly listed in on-line databases like WordNet. Given
a sufficiently large database of such relations, many applications are quite capable
of performing sophisticated semantic tasks about word senses (even if they do not
really know their right from their left).
Synonymy
We introduced in Chapter 6 the idea that when two senses of two different words
synonym (lemmas) are identical, or nearly identical, we say the two senses are synonyms.
Synonyms include such pairs as
couch/sofa vomit/throw up filbert/hazelnut car/automobile
And we mentioned that in practice, the word synonym is commonly used to
describe a relationship of approximate or rough synonymy. But furthermore, syn-
onymy is actually a relationship between senses rather than words. Considering the
words big and large. These may seem to be synonyms in the following sentences,
since we could swap big and large in either sentence and retain the same meaning:
(18.7) How big is that plane?
(18.8) Would I be flying on a large or small plane?
But note the following sentence in which we cannot substitute large for big:
(18.9) Miss Nelson, for instance, became a kind of big sister to Benjamin.
(18.10) ?Miss Nelson, for instance, became a kind of large sister to Benjamin.
This is because the word big has a sense that means being older or grown up, while
large lacks this sense. Thus, we say that some senses of big and large are (nearly)
synonymous while other ones are not.
Antonymy
antonym Whereas synonyms are words with identical or similar meanings, antonyms are
words with an opposite meaning, like:
long/short big/little fast/slow cold/hot dark/light
rise/fall up/down in/out
Two senses can be antonyms if they define a binary opposition or are at opposite
ends of some scale. This is the case for long/short, fast/slow, or big/little, which are
reversives at opposite ends of the length or size scale. Another group of antonyms, reversives,
describe change or movement in opposite directions, such as rise/fall or up/down.
Antonyms thus differ completely with respect to one aspect of their meaning—
their position on a scale or their direction—but are otherwise very similar, sharing
almost all other aspects of meaning. Thus, automatically distinguishing synonyms
from antonyms can be difficult.
Taxonomic Relations
Another way word senses can be related is taxonomically. A word (or sense) is a
hyponym hyponym of another word or sense if the first is more specific, denoting a subclass
of the other. For example, car is a hyponym of vehicle, dog is a hyponym of animal,
hypernym and mango is a hyponym of fruit. Conversely, we say that vehicle is a hypernym of
car, and animal is a hypernym of dog. It is unfortunate that the two words (hypernym
18.2 • R ELATIONS B ETWEEN S ENSES 5
and hyponym) are very similar and hence easily confused; for this reason, the word
superordinate superordinate is often used instead of hypernym.
We can define hypernymy more formally by saying that the class denoted by
the superordinate extensionally includes the class denoted by the hyponym. Thus,
the class of animals includes as members all dogs, and the class of moving actions
includes all walking actions. Hypernymy can also be defined in terms of entail-
ment. Under this definition, a sense A is a hyponym of a sense B if everything
that is A is also B, and hence being an A entails being a B, or ∀x A(x) ⇒ B(x). Hy-
ponymy/hypernymy is usually a transitive relation; if A is a hyponym of B and B is a
hyponym of C, then A is a hyponym of C. Another name for the hypernym/hyponym
IS-A structure is the IS-A hierarchy, in which we say A IS-A B, or B subsumes A.
Hypernymy is useful for tasks like textual entailment or question answering;
knowing that leukemia is a type of cancer, for example, would certainly be useful in
answering questions about leukemia.
Meronymy
part-whole Another common relation is meronymy, the part-whole relation. A leg is part of a
chair; a wheel is part of a car. We say that wheel is a meronym of car, and car is a
holonym of wheel.
Structured Polysemy
The senses of a word can also be related semantically, in which case we call the
structured
polysemy relationship between them structured polysemy.Consider this sense bank:
(18.11) The bank is on the corner of Nassau and Witherspoon.
This sense, perhaps bank4 , means something like “the building belonging to
a financial institution”. These two kinds of senses (an organization and the build-
ing associated with an organization ) occur together for many other words as well
(school, university, hospital, etc.). Thus, there is a systematic relationship between
senses that we might represent as
BUILDING ↔ ORGANIZATION
metonymy This particular subtype of polysemy relation is called metonymy. Metonymy is
the use of one aspect of a concept or entity to refer to other aspects of the entity or
to the entity itself. We are performing metonymy when we use the phrase the White
House to refer to the administration whose office is in the White House. Other
common examples of metonymy include the relation between the following pairings
of senses:
AUTHOR ↔ WORKS OF AUTHOR
(Jane Austen wrote Emma) (I really love Jane Austen)
FRUITTREE ↔ FRUIT
(Plums have beautiful blossoms) (I ate a preserved plum yesterday)
6 C HAPTER 18 • W ORD S ENSES AND W ORD N ET
Figure 18.1 A portion of the WordNet 3.0 entry for the noun bass.
Note that there are eight senses for the noun and one for the adjective, each of
gloss which has a gloss (a dictionary-style definition), a list of synonyms for the sense, and
sometimes also usage examples (shown for the adjective sense). WordNet doesn’t
represent pronunciation, so doesn’t distinguish the pronunciation [b ae s] in bass4 ,
bass5 , and bass8 from the other senses pronounced [b ey s].
synset The set of near-synonyms for a WordNet sense is called a synset (for synonym
set); synsets are an important primitive in WordNet. The entry for bass includes
synsets like {bass1 , deep6 }, or {bass6 , bass voice1 , basso2 }. We can think of a
synset as representing a concept of the type we discussed in Chapter 15. Thus,
instead of representing concepts in logical terms, WordNet represents them as lists
of the word senses that can be used to express the concept. Here’s another synset
example:
{chump1 , fool2 , gull1 , mark9 , patsy1 , fall guy1 ,
sucker1 , soft touch1 , mug2 }
The gloss of this synset describes it as:
Gloss: a person who is gullible and easy to take advantage of.
Glosses are properties of a synset, so that each sense included in the synset has the
same gloss and can express this concept. Because they share glosses, synsets like
this one are the fundamental unit associated with WordNet entries, and hence it is
synsets, not wordforms, lemmas, or individual senses, that participate in most of the
lexical sense relations in WordNet.
WordNet also labels each synset with a lexicographic category drawn from a
semantic field for example the 26 categories for nouns shown in Fig. 18.2, as well
18.3 • W ORD N ET: A DATABASE OF L EXICAL R ELATIONS 7
as 15 for verbs (plus 2 for adjectives and 1 for adverbs). These categories are often
supersense called supersenses, because they act as coarse semantic categories or groupings of
senses which can be useful when word senses are too fine-grained (Ciaramita and
Johnson 2003, Ciaramita and Altun 2006). Supersenses have also been defined for
adjectives (Tsvetkov et al., 2014) and prepositions (Schneider et al., 2018).
bass3 , basso (an adult male singer with the lowest voice)
=> singer, vocalist, vocalizer, vocaliser
=> musician, instrumentalist, player
=> performer, performing artist
=> entertainer
=> person, individual, someone...
=> organism, being
=> living thing, animate thing,
=> whole, unit
=> object, physical object
=> physical entity
=> entity
7
bass (member with the lowest range of a family of instruments)
=> musical instrument, instrument
=> device
=> instrumentality, instrumentation
=> artifact, artefact
=> whole, unit
=> object, physical object
=> physical entity
=> entity
Figure 18.5 Hyponymy chains for two separate senses of the lemma bass. Note that the
chains are completely distinct, only converging at the very abstract level whole, unit.
senses in WordNet. For fruit this would mean choosing between the correct answer
fruit1n (the ripened reproductive body of a seed plant), and the other two senses fruit2n
(yield; an amount of a product) and fruit3n (the consequence of some effort or action).
Fig. 18.8 sketches the task.
y5 y6
y3
stand1: side1:
y1 bass1: y4 upright relative
low range … region
electric1: … player1: stand5: …
using bass4: in game bear side3:
electricity sea fish player2: … of body
electric2: … musician stand10: …
tense
y2 bass7: player3: put side11:
electric3: instrument actor upright slope
thrilling guitar1 … … … …
x1 x2 x3 x4 x5 x6
an electric guitar and bass player stand off to one side
Figure 18.8 The all-words WSD task, mapping from input words (x) to WordNet senses
(y). Only nouns, verbs, adjectives, and adverbs are mapped, and note that some words (like
guitar in the example) only have one sense in WordNet. Figure inspired by Chaplot and
Salakhutdinov (2018).
s of any word in the corpus, for each of the n tokens of that sense, we average their
n contextual representations vi to produce a contextual sense embedding vs for s:
1X
vs = vi ∀vi ∈ tokens(s) (18.13)
n
i
At test time, given a token of a target word t in context, we compute its contextual
embedding t and choose its nearest neighbor sense from the training set, i.e., the
sense whose sense embedding has the highest cosine with t:
sense(t) = argmax cosine(t, vs ) (18.14)
s∈senses(t)
find5
find4
v
v
find1
v
find9
v
ENCODER
I found the jar empty
Figure 18.9 The nearest-neighbor algorithm for WSD. In green are the contextual embed-
dings precomputed for each sense of each word; here we just show a few of the senses for
find. A contextual embedding is computed for the target word found, and then the nearest
neighbor sense (in this case find9v ) is chosen. Figure inspired by Loureiro and Jorge (2019).
Since all of the supersenses have some labeled data in SemCor, the algorithm is
guaranteed to have some representation for all possible senses by the time the al-
gorithm backs off to the most general (supersense) information, although of course
with a very coarse model.
up to here
18.5 Alternate WSD algorithms and Tasks
Figure 18.10 The Simplified Lesk algorithm. The C OMPUTE OVERLAP function returns
the number of words in common between two sets, ignoring function words or other words
on a stop list. The original Lesk algorithm defines the context in a more complex way.
Sense bank1 has two non-stopwords overlapping with the context in (18.20):
deposits and mortgage, while sense bank2 has zero words, so sense bank1 is chosen.
There are many obvious extensions to Simplified Lesk, such as weighing the
overlapping words by IDF (inverse document frequency) Chapter 6 to downweight
frequent words like function words; best performing is to use word embedding co-
sine instead of word overlap to compute the similarity between the definition and the
context (Basile et al., 2014). Modern neural extensions of Lesk use the definitions
to compute sense embeddings that can be directly used instead of SemCor-training
embeddings (Kumar et al. 2019, Luo et al. 2018a, Luo et al. 2018b).
two sentences, each with the same target word but in a different sentential context.
The system must decide whether the target words are used in the same sense in the
WiC two sentences or in a different sense. Fig. 18.11 shows sample pairs from the WiC
dataset of Pilehvar and Camacho-Collados (2019).
The WiC sentences are mainly taken from the example usages for senses in
WordNet. But WordNet senses are very fine-grained. For this reason tasks like
word-in-context first cluster the word senses into coarser clusters, so that the two
sentential contexts for the target word are marked as T if the two senses are in the
same cluster. WiC clusters all pairs of senses if they are first degree connections in
the WordNet semantic graph, including sister senses, or if they belong to the same
supersense; we point to other sense clustering algorithms at the end of the chapter.
The baseline algorithm to solve the WiC task uses contextual embeddings like
BERT with a simple thresholded cosine. We first compute the contextual embed-
dings for the target word in each of the two sentences, and then compute the cosine
between them. If it’s above a threshold tuned on a devset we respond true (the two
senses are the same) else we respond false.
category (Ponzetto and Navigli, 2010). The resulting mapping has been used to
create BabelNet, a large sense-annotated resource (Navigli and Ponzetto, 2012).
There are two families of solutions. The first requires retraining: we modify the
embedding training to incorporate thesaurus relations like synonymy, antonym, or
supersenses. This can be done by modifying the static embedding loss function for
word2vec (Yu and Dredze 2014, Nguyen et al. 2016) or by modifying contextual
embedding training (Levine et al. 2020, Lauscher et al. 2019).
The second, for static embeddings, is more light-weight; after the embeddings
have been trained we learn a second mapping based on a thesaurus that shifts the
embeddings of words in such a way that synonyms (according to the thesaurus) are
retrofitting pushed closer and antonyms further apart. Such methods are called retrofitting
(Faruqui et al. 2015, Lengerich et al. 2018) or counterfitting (Mrkšić et al., 2016).
Since this is an unsupervised algorithm, we don’t have names for each of these
“senses” of w; we just refer to the jth sense of w.
To disambiguate a particular token t of w we again have three steps:
1. Compute a context vector c for t.
2. Retrieve all sense vectors s j for w.
3. Assign t to the sense represented by the sense vector s j that is closest to t.
All we need is a clustering algorithm and a distance metric between vectors.
Clustering is a well-studied problem with a wide number of standard algorithms that
can be applied to inputs structured as vectors of numerical values (Duda and Hart,
1973). A frequently used technique in language applications is known as agglom-
agglomerative
clustering erative clustering. In this technique, each of the N training instances is initially
assigned to its own cluster. New clusters are then formed in a bottom-up fashion by
the successive merging of the two clusters that are most similar. This process con-
tinues until either a specified number of clusters is reached, or some global goodness
measure among the clusters is achieved. In cases in which the number of training
instances makes this method too expensive, random sampling can be used on the
original training set to achieve similar results.
How can we evaluate unsupervised sense disambiguation approaches? As usual,
the best way is to do extrinsic evaluation embedded in some end-to-end system; one
example used in a SemEval bakeoff is to improve search result clustering and di-
versification (Navigli and Vannella, 2013). Intrinsic evaluation requires a way to
map the automatically derived sense classes into a hand-labeled gold-standard set so
that we can compare a hand-labeled test set with a set labeled by our unsupervised
classifier. Various such metrics have been tested, for example in the SemEval tasks
(Manandhar et al. 2010, Navigli and Vannella 2013, Jurgens and Klapaftis 2013),
including cluster overlap metrics, or methods that map each sense cluster to a pre-
defined sense by choosing the sense that (in some training set) has the most overlap
with the cluster. However it is fair to say that no evaluation metric for this task has
yet become standard.
18.8 Summary
This chapter has covered a wide range of issues concerning the meanings associated
with lexical items. The following are among the highlights:
• A word sense is the locus of word meaning; definitions and meaning relations
are defined at the level of the word sense rather than wordforms.
• Many words are polysemous, having many senses.
• Relations between senses include synonymy, antonymy, meronymy, and
taxonomic relations hyponymy and hypernymy.
• WordNet is a large database of lexical relations for English, and WordNets
exist for a variety of languages.
• Word-sense disambiguation (WSD) is the task of determining the correct
sense of a word in context. Supervised approaches make use of a corpus
of sentences in which individual words (lexical sample task) or all words
(all-words task) are hand-labeled with senses from a resource like WordNet.
SemCor is the largest corpus with WordNet-labeled senses.
B IBLIOGRAPHICAL AND H ISTORICAL N OTES 17
• The standard supervised algorithm for WSD is nearest neighbors with contex-
tual embeddings.
• Feature-based algorithms using parts of speech and embeddings of words in
the context of the target word also work well.
• An important baseline for WSD is the most frequent sense, equivalent, in
WordNet, to take the first sense.
• Another baseline is a knowledge-based WSD algorithm called the Lesk al-
gorithm which chooses the sense whose dictionary definition shares the most
words with the target word’s neighborhood.
• Word sense induction is the task of learning word senses unsupervised.
of disambiguation rules for 1790 ambiguous English words. Lesk (1986) was the
first to use a machine-readable dictionary for word sense disambiguation. Fellbaum
(1998) collects early work on WordNet. Early work using dictionaries as lexical
resources include Amsler’s 1981 use of the Merriam Webster dictionary and Long-
man’s Dictionary of Contemporary English (Boguraev and Briscoe, 1989).
Supervised approaches to disambiguation began with the use of decision trees
by Black (1988). In addition to the IMS and contextual-embedding based methods
for supervised WSD, recent supervised algorithms includes encoder-decoder models
(Raganato et al., 2017a).
The need for large amounts of annotated text in supervised methods led early
on to investigations into the use of bootstrapping methods (Hearst 1991, Yarowsky
1995). For example the semi-supervised algorithm of Diab and Resnik (2002) is
based on aligned parallel corpora in two languages. For example, the fact that the
French word catastrophe might be translated as English disaster in one instance
and tragedy in another instance can be used to disambiguate the senses of the two
English words (i.e., to choose senses of disaster and tragedy that are similar).
The earliest use of clustering in the study of word senses was by Sparck Jones
(1986); Pedersen and Bruce (1997), Schütze (1997), and Schütze (1998) applied dis-
coarse senses tributional methods. Clustering word senses into coarse senses has also been used
to address the problem of dictionary senses being too fine-grained (Section 18.5.3)
(Dolan 1994, Chen and Chang 1998, Mihalcea and Moldovan 2001, Agirre and
de Lacalle 2003, Palmer et al. 2004, Navigli 2006, Snow et al. 2007, Pilehvar et al.
2013). Corpora with clustered word senses for training supervised clustering algo-
OntoNotes rithms include Palmer et al. (2006) and OntoNotes (Hovy et al., 2006).
See Pustejovsky (1995), Pustejovsky and Boguraev (1996), Martin (1986), and
Copestake and Briscoe (1995), inter alia, for computational approaches to the rep-
generative resentation of polysemy. Pustejovsky’s theory of the generative lexicon, and in
lexicon
qualia particular his theory of the qualia structure of words, is a way of accounting for the
structure
dynamic systematic polysemy of words in context.
Historical overviews of WSD include Agirre and Edmonds (2006) and Navigli
(2009).
Exercises
18.1 Collect a small corpus of example sentences of varying lengths from any
newspaper or magazine. Using WordNet or any standard dictionary, deter-
mine how many senses there are for each of the open-class words in each sen-
tence. How many distinct combinations of senses are there for each sentence?
How does this number seem to vary with sentence length?
18.2 Using WordNet or a standard reference dictionary, tag each open-class word
in your corpus with its correct tag. Was choosing the correct sense always a
straightforward task? Report on any difficulties you encountered.
18.3 Using your favorite dictionary, simulate the original Lesk word overlap dis-
ambiguation algorithm described on page 13 on the phrase Time flies like an
arrow. Assume that the words are to be disambiguated one at a time, from
left to right, and that the results from earlier decisions are used later in the
process.
E XERCISES 19
Agirre, E. and O. L. de Lacalle. 2003. Clustering WordNet Hirst, G. 1987. Semantic Interpretation and the Resolution
word senses. RANLP 2003. of Ambiguity. Cambridge University Press.
Agirre, E. and P. Edmonds, editors. 2006. Word Sense Dis- Hirst, G. 1988. Resolving lexical ambiguity computationally
ambiguation: Algorithms and Applications. Kluwer. with spreading activation and polaroid words. In S. L.
Amsler, R. A. 1981. A taxonomy for English nouns and Small, G. W. Cottrell, and M. K. Tanenhaus, editors, Lex-
verbs. ACL. ical Ambiguity Resolution, pages 73–108. Morgan Kauf-
mann.
Basile, P., A. Caputo, and G. Semeraro. 2014. An enhanced
Lesk word sense disambiguation algorithm through a dis- Hirst, G. and E. Charniak. 1982. Word sense and case slot
tributional semantic model. COLING. disambiguation. AAAI.
Black, E. 1988. An experiment in computational discrimi- Hovy, E. H., M. P. Marcus, M. Palmer, L. A. Ramshaw,
nation of English word senses. IBM Journal of Research and R. Weischedel. 2006. OntoNotes: The 90% solution.
and Development, 32(2):185–194. HLT-NAACL.
Boguraev, B. K. and T. Briscoe, editors. 1989. Computa- Iacobacci, I., M. T. Pilehvar, and R. Navigli. 2016. Em-
tional Lexicography for Natural Language Processing. beddings for word sense disambiguation: An evaluation
Longman. study. ACL.
Chaplot, D. S. and R. Salakhutdinov. 2018. Knowledge- Jurgens, D. and I. P. Klapaftis. 2013. SemEval-2013 task 13:
based word sense disambiguation using topic models. Word sense induction for graded and non-graded senses.
AAAI. *SEM.
Chen, J. N. and J. S. Chang. 1998. Topical clustering of Kawamoto, A. H. 1988. Distributed representations of am-
MRD senses based on information retrieval techniques. biguous words and their resolution in connectionist net-
Computational Linguistics, 24(1):61–96. works. In S. L. Small, G. W. Cottrell, and M. Tanen-
haus, editors, Lexical Ambiguity Resolution, pages 195–
Ciaramita, M. and Y. Altun. 2006. Broad-coverage sense
228. Morgan Kaufman.
disambiguation and information extraction with a super-
sense sequence tagger. EMNLP. Kelly, E. F. and P. J. Stone. 1975. Computer Recognition of
English Word Senses. North-Holland.
Ciaramita, M. and M. Johnson. 2003. Supersense tagging of
unknown nouns in WordNet. EMNLP-2003. Kilgarriff, A. and J. Rosenzweig. 2000. Framework and re-
sults for English SENSEVAL. Computers and the Hu-
Copestake, A. and T. Briscoe. 1995. Semi-productive
manities, 34:15–48.
polysemy and sense extension. Journal of Semantics,
12(1):15–68. Kumar, S., S. Jat, K. Saxena, and P. Talukdar. 2019. Zero-
shot word sense disambiguation using sense definition
Cottrell, G. W. 1985. A Connectionist Approach to Word
embeddings. ACL.
Sense Disambiguation. Ph.D. thesis, University of
Rochester, Rochester, NY. Revised version published by Landes, S., C. Leacock, and R. I. Tengi. 1998. Building
Pitman, 1989. semantic concordances. In C. Fellbaum, editor, Word-
Net: An Electronic Lexical Database, pages 199–216.
Diab, M. and P. Resnik. 2002. An unsupervised method for
MIT Press.
word sense tagging using parallel corpora. ACL.
Lauscher, A., I. Vulić, E. M. Ponti, A. Korhonen, and
Dolan, B. 1994. Word sense ambiguation: Clustering related
G. Glavaš. 2019. Informing unsupervised pretraining
senses. COLING.
with external linguistic knowledge. ArXiv preprint
Duda, R. O. and P. E. Hart. 1973. Pattern Classification and arXiv:1909.02339.
Scene Analysis. John Wiley and Sons.
Lengerich, B., A. Maas, and C. Potts. 2018. Retrofitting dis-
Faruqui, M., J. Dodge, S. K. Jauhar, C. Dyer, E. Hovy, and tributional embeddings to knowledge graphs with func-
N. A. Smith. 2015. Retrofitting word vectors to semantic tional relations. COLING.
lexicons. NAACL HLT.
Lesk, M. E. 1986. Automatic sense disambiguation using
Fellbaum, C., editor. 1998. WordNet: An Electronic Lexical machine readable dictionaries: How to tell a pine cone
Database. MIT Press. from an ice cream cone. Proceedings of the 5th Interna-
Gale, W. A., K. W. Church, and D. Yarowsky. 1992a. Es- tional Conference on Systems Documentation.
timating upper and lower bounds on the performance of Levine, Y., B. Lenz, O. Dagan, O. Ram, D. Pad-
word-sense disambiguation programs. ACL. nos, O. Sharir, S. Shalev-Shwartz, A. Shashua, and
Gale, W. A., K. W. Church, and D. Yarowsky. 1992b. One Y. Shoham. 2020. SenseBERT: Driving some sense into
sense per discourse. HLT. BERT. ACL.
Haber, J. and M. Poesio. 2020. Assessing polyseme sense Loureiro, D. and A. Jorge. 2019. Language modelling makes
similarity through co-predication acceptability and con- sense: Propagating representations through WordNet for
textualised embedding distance. *SEM. full-coverage word sense disambiguation. ACL.
Hearst, M. A. 1991. Noun homograph disambiguation. Pro- Luo, F., T. Liu, Z. He, Q. Xia, Z. Sui, and B. Chang. 2018a.
ceedings of the 7th Conference of the University of Wa- Leveraging gloss knowledge in neural word sense disam-
terloo Centre for the New OED and Text Research. biguation by hierarchical co-attention. EMNLP.
Henrich, V., E. Hinrichs, and T. Vodolazova. 2012. We- Luo, F., T. Liu, Q. Xia, B. Chang, and Z. Sui. 2018b. Incor-
bCAGe – a web-harvested corpus annotated with Ger- porating glosses into neural word sense disambiguation.
maNet senses. EACL. ACL.
Exercises 21
Madhu, S. and D. Lytel. 1965. A figure of merit technique for Pilehvar, M. T., D. Jurgens, and R. Navigli. 2013. Align,
the resolution of non-grammatical ambiguity. Mechanical disambiguate and walk: A unified approach for measur-
Translation, 8(2):9–13. ing semantic similarity. ACL.
Manandhar, S., I. P. Klapaftis, D. Dligach, and S. Pradhan. Ponzetto, S. P. and R. Navigli. 2010. Knowledge-rich word
2010. SemEval-2010 task 14: Word sense induction & sense disambiguation rivaling supervised systems. ACL.
disambiguation. SemEval. Pu, X., N. Pappas, J. Henderson, and A. Popescu-Belis.
Martin, J. H. 1986. The acquisition of polysemy. ICML. 2018. Integrating weakly supervised word sense disam-
Masterman, M. 1957. The thesaurus in syntax and semantics. biguation into neural machine translation. TACL, 6:635–
Mechanical Translation, 4(1):1–2. 649.
Melamud, O., J. Goldberger, and I. Dagan. 2016. con- Pustejovsky, J. 1995. The Generative Lexicon. MIT Press.
text2vec: Learning generic context embedding with bidi- Pustejovsky, J. and B. K. Boguraev, editors. 1996. Lexical
rectional LSTM. CoNLL. Semantics: The Problem of Polysemy. Oxford University
Mihalcea, R. 2007. Using Wikipedia for automatic word Press.
sense disambiguation. NAACL-HLT. Quillian, M. R. 1968. Semantic memory. In M. Minsky,
Mihalcea, R. and D. Moldovan. 2001. Automatic genera- editor, Semantic Information Processing, pages 227–270.
tion of a coarse grained WordNet. NAACL Workshop on MIT Press.
WordNet and Other Lexical Resources. Quillian, M. R. 1969. The teachable language compre-
Miller, G. A., C. Leacock, R. I. Tengi, and R. T. Bunker. hender: A simulation program and theory of language.
1993. A semantic concordance. HLT. CACM, 12(8):459–476.
Morris, W., editor. 1985. American Heritage Dictionary, 2nd Raganato, A., C. D. Bovi, and R. Navigli. 2017a. Neural se-
college edition edition. Houghton Mifflin. quence learning models for word sense disambiguation.
Mrkšić, N., D. Ó. Séaghdha, B. Thomson, M. Gašić, L. M. EMNLP.
Rojas-Barahona, P.-H. Su, D. Vandyke, T.-H. Wen, and Raganato, A., J. Camacho-Collados, and R. Navigli. 2017b.
S. Young. 2016. Counter-fitting word vectors to linguis- Word sense disambiguation: A unified evaluation frame-
tic constraints. NAACL HLT. work and empirical comparison. EACL.
Navigli, R. 2006. Meaningful clustering of senses helps Riesbeck, C. K. 1975. Conceptual analysis. In R. C. Schank,
boost word sense disambiguation performance. COL- editor, Conceptual Information Processing, pages 83–
ING/ACL. 156. American Elsevier, New York.
Navigli, R. 2009. Word sense disambiguation: A survey. Schneider, N., J. D. Hwang, V. Srikumar, J. Prange, A. Blod-
ACM Computing Surveys, 41(2). gett, S. R. Moeller, A. Stern, A. Bitan, and O. Abend.
Navigli, R. 2016. Chapter 20. ontologies. In R. Mitkov, ed- 2018. Comprehensive supersense disambiguation of En-
itor, The Oxford handbook of computational linguistics. glish prepositions and possessives. ACL.
Oxford University Press. Schütze, H. 1992. Dimensions of meaning. Proceedings of
Navigli, R. and S. P. Ponzetto. 2012. BabelNet: The auto- Supercomputing ’92. IEEE Press.
matic construction, evaluation and application of a wide- Schütze, H. 1997. Ambiguity Resolution in Language Learn-
coverage multilingual semantic network. Artificial Intel- ing: Computational and Cognitive Models. CSLI Publi-
ligence, 193:217–250. cations, Stanford, CA.
Navigli, R. and D. Vannella. 2013. SemEval-2013 task 11:
Schütze, H. 1998. Automatic word sense discrimination.
Word sense induction and disambiguation within an end-
Computational Linguistics, 24(1):97–124.
user application. *SEM.
Simmons, R. F. 1973. Semantic networks: Their compu-
Nguyen, K. A., S. Schulte im Walde, and N. T. Vu. 2016.
tation and use for understanding English sentences. In
Integrating distributional lexical contrast into word em-
R. C. Schank and K. M. Colby, editors, Computer Models
beddings for antonym-synonym distinction. ACL.
of Thought and Language, pages 61–113. W.H. Freeman
Palmer, M., O. Babko-Malaya, and H. T. Dang. 2004. Dif- and Co.
ferent sense granularities for different applications. HLT-
Small, S. L. and C. Rieger. 1982. Parsing and comprehend-
NAACL Workshop on Scalable Natural Language Under-
ing with Word Experts. In W. G. Lehnert and M. H.
standing.
Ringle, editors, Strategies for Natural Language Process-
Palmer, M., H. T. Dang, and C. Fellbaum. 2006. Making ing, pages 89–147. Lawrence Erlbaum.
fine-grained and coarse-grained sense distinctions, both
manually and automatically. Natural Language Engineer- Snow, R., S. Prakash, D. Jurafsky, and A. Y. Ng. 2007.
ing, 13(2):137–163. Learning to merge word senses. EMNLP/CoNLL.
Pedersen, T. and R. Bruce. 1997. Distinguishing word senses Snyder, B. and M. Palmer. 2004. The English all-words task.
in untagged text. EMNLP. SENSEVAL-3.
Peters, M., M. Neumann, M. Iyyer, M. Gardner, C. Clark, Sparck Jones, K. 1986. Synonymy and Semantic Classifica-
K. Lee, and L. Zettlemoyer. 2018. Deep contextualized tion. Edinburgh University Press, Edinburgh. Republica-
word representations. NAACL HLT. tion of 1964 PhD Thesis.
Pilehvar, M. T. and J. Camacho-Collados. 2019. WiC: the Tsvetkov, Y., N. Schneider, D. Hovy, A. Bhatia, M. Faruqui,
word-in-context dataset for evaluating context-sensitive and C. Dyer. 2014. Augmenting English adjective senses
meaning representations. NAACL HLT. with supersenses. LREC.
22 Chapter 18 • Word Senses and WordNet